All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH bpf-next v2 00/12] bpf: Introduce selectable memcg for bpf map
@ 2022-08-18 14:31 Yafang Shao
  2022-08-18 14:31   ` Yafang Shao
                   ` (12 more replies)
  0 siblings, 13 replies; 61+ messages in thread
From: Yafang Shao @ 2022-08-18 14:31 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, hannes, mhocko, roman.gushchin,
	shakeelb, songmuchun, akpm, tj, lizefan.x
  Cc: cgroups, netdev, bpf, linux-mm, Yafang Shao

On our production environment, we may load, run and pin bpf programs and
maps in containers. For example, some of our networking bpf programs and
maps are loaded and pinned by a process running in a container on our
k8s environment. In this container, there're also running some other
user applications which watch the networking configurations from remote
servers and update them on this local host, log the error events, monitor
the traffic, and do some other stuffs. Sometimes we may need to update 
these user applications to a new release, and in this update process we
will destroy the old container and then start a new genration. In order not
to interrupt the bpf programs in the update process, we will pin the bpf
programs and maps in bpffs. That is the background and use case on our
production environment. 

After switching to memcg-based bpf memory accounting to limit the bpf
memory, some unexpected issues jumped out at us.
1. The memory usage is not consistent between the first generation and
new generations.
2. After the first generation is destroyed, the bpf memory can't be
limited if the bpf maps are not preallocated, because they will be
reparented.

This patchset tries to resolve these issues by introducing an
independent memcg to limit the bpf memory.

In the bpf map creation, we can assign a specific memcg instead of using
the current memcg.  That makes it flexible in containized environment.
For example, if we want to limit the pinned bpf maps, we can use below
hierarchy,

    Shared resources              Private resources 
                                    
     bpf-memcg                      k8s-memcg
     /        \                     /             
bpf-bar-memcg bpf-foo-memcg   srv-foo-memcg        
                  |               /        \
               (charged)     (not charged) (charged)                 
                  |           /              \
                  |          /                \
          bpf-foo-{progs,maps}              srv-foo

srv-foo loads and pins bpf-foo-{progs, maps}, but they are charged to an
independent memcg (bpf-foo-memcg) instead of srv-foo's memcg
(srv-foo-memcg).

Pls. note that there may be no process in bpf-foo-memcg, that means it
can be rmdir-ed by root user currently. Meanwhile we don't forcefully
destroy a memcg if it doesn't have any residents. So this hierarchy is
acceptible. 

In order to make the memcg of bpf maps seletectable, this patchset
introduces some memory allocation wrappers to allocate map related
memory. In these wrappers, it will get the memcg from the map and then
charge the allocated pages or objs.  

Currenly it only supports for bpf map, and we can extend it to bpf prog
as well.

The observebility can also be supported in the next step, for example,
showing the bpf map's memcg by 'bpftool map show' or even showing which
maps are charged to a specific memcg by 'bpftool cgroup show'.
Furthermore, we may also show an accurate memory size of a bpf map
instead of an estimated memory size in 'bpftool map show' in the future. 

v1->v2:
- cgroup1 is also supported after
  commit f3a2aebdd6fb ("cgroup: enable cgroup_get_from_file() on cgroup1")
  So update the commit log.
- remove incorrect fix to mem_cgroup_put  (Shakeel,Roman,Muchun) 
- use cgroup_put() in bpf_map_save_memcg() (Shakeel)
- add detailed commit log for get_obj_cgroup_from_cgroup (Shakeel) 

RFC->v1:
- get rid of bpf_map container wrapper (Alexei)
- add the new field into the end of struct (Alexei)
- get rid of BPF_F_SELECTABLE_MEMCG (Alexei)
- save memcg in bpf_map_init_from_attr
- introduce bpf_ringbuf_pages_{alloc,free} and keep them inside
  kernel/bpf/ringbuf.c  (Andrii)

Yafang Shao (12):
  cgroup: Update the comment on cgroup_get_from_fd
  bpf: Introduce new helper bpf_map_put_memcg()
  bpf: Define bpf_map_{get,put}_memcg for !CONFIG_MEMCG_KMEM
  bpf: Call bpf_map_init_from_attr() immediately after map creation
  bpf: Save memcg in bpf_map_init_from_attr()
  bpf: Use scoped-based charge in bpf_map_area_alloc
  bpf: Introduce new helpers bpf_ringbuf_pages_{alloc,free}
  bpf: Use bpf_map_kzalloc in arraymap
  bpf: Use bpf_map_kvcalloc in bpf_local_storage
  mm, memcg: Add new helper get_obj_cgroup_from_cgroup
  bpf: Add return value for bpf_map_init_from_attr
  bpf: Introduce selectable memcg for bpf map

 include/linux/bpf.h            |  40 +++++++++++-
 include/linux/memcontrol.h     |  11 ++++
 include/uapi/linux/bpf.h       |   1 +
 kernel/bpf/arraymap.c          |  34 ++++++-----
 kernel/bpf/bloom_filter.c      |  11 +++-
 kernel/bpf/bpf_local_storage.c |  17 ++++--
 kernel/bpf/bpf_struct_ops.c    |  19 +++---
 kernel/bpf/cpumap.c            |  17 ++++--
 kernel/bpf/devmap.c            |  30 +++++----
 kernel/bpf/hashtab.c           |  26 +++++---
 kernel/bpf/local_storage.c     |  11 +++-
 kernel/bpf/lpm_trie.c          |  12 +++-
 kernel/bpf/offload.c           |  12 ++--
 kernel/bpf/queue_stack_maps.c  |  11 +++-
 kernel/bpf/reuseport_array.c   |  11 +++-
 kernel/bpf/ringbuf.c           | 104 +++++++++++++++++++++----------
 kernel/bpf/stackmap.c          |  13 ++--
 kernel/bpf/syscall.c           | 136 ++++++++++++++++++++++++++++-------------
 kernel/cgroup/cgroup.c         |   2 +-
 mm/memcontrol.c                |  47 ++++++++++++++
 net/core/sock_map.c            |  30 +++++----
 net/xdp/xskmap.c               |  12 +++-
 tools/include/uapi/linux/bpf.h |   1 +
 tools/lib/bpf/bpf.c            |   3 +-
 tools/lib/bpf/bpf.h            |   3 +-
 tools/lib/bpf/gen_loader.c     |   2 +-
 tools/lib/bpf/libbpf.c         |   2 +
 tools/lib/bpf/skel_internal.h  |   2 +-
 28 files changed, 443 insertions(+), 177 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH bpf-next v2 01/12] cgroup: Update the comment on cgroup_get_from_fd
@ 2022-08-18 14:31   ` Yafang Shao
  0 siblings, 0 replies; 61+ messages in thread
From: Yafang Shao @ 2022-08-18 14:31 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, hannes, mhocko, roman.gushchin,
	shakeelb, songmuchun, akpm, tj, lizefan.x
  Cc: cgroups, netdev, bpf, linux-mm, Yafang Shao, Yosry Ahmed

After commit f3a2aebdd6fb ("cgroup: enable cgroup_get_from_file() on cgroup1")
we can open a cgroup1 dir as well. So let's update the comment.

Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Hao Luo <haoluo@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 kernel/cgroup/cgroup.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 5f4502a..b7d2e55 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -6625,7 +6625,7 @@ struct cgroup *cgroup_get_from_path(const char *path)
 
 /**
  * cgroup_get_from_fd - get a cgroup pointer from a fd
- * @fd: fd obtained by open(cgroup2_dir)
+ * @fd: fd obtained by open(cgroup_dir)
  *
  * Find the cgroup from a fd which should be obtained
  * by opening a cgroup directory.  Returns a pointer to the
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH bpf-next v2 01/12] cgroup: Update the comment on cgroup_get_from_fd
@ 2022-08-18 14:31   ` Yafang Shao
  0 siblings, 0 replies; 61+ messages in thread
From: Yafang Shao @ 2022-08-18 14:31 UTC (permalink / raw)
  To: ast-DgEjT+Ai2ygdnm+yROfE0A, daniel-FeC+5ew28dpmcu3hnIyYJQ,
	andrii-DgEjT+Ai2ygdnm+yROfE0A, kafai-b10kYP2dOMg,
	songliubraving-b10kYP2dOMg, yhs-b10kYP2dOMg,
	john.fastabend-Re5JQEeQqe8AvxtiuMwx3w,
	kpsingh-DgEjT+Ai2ygdnm+yROfE0A, sdf-hpIqsD4AKlfQT0dZR+AlfA,
	haoluo-hpIqsD4AKlfQT0dZR+AlfA, jolsa-DgEjT+Ai2ygdnm+yROfE0A,
	hannes-druUgvl0LCNAfugRpC6u6w, mhocko-DgEjT+Ai2ygdnm+yROfE0A,
	roman.gushchin-fxUVXftIFDnyG1zEObXtfA,
	shakeelb-hpIqsD4AKlfQT0dZR+AlfA,
	songmuchun-EC8Uxl6Npydl57MIdRCFDg,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, tj-DgEjT+Ai2ygdnm+yROfE0A,
	lizefan.x-EC8Uxl6Npydl57MIdRCFDg
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	bpf-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Yafang Shao, Yosry Ahmed

After commit f3a2aebdd6fb ("cgroup: enable cgroup_get_from_file() on cgroup1")
we can open a cgroup1 dir as well. So let's update the comment.

Cc: Yosry Ahmed <yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Cc: Hao Luo <haoluo-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Signed-off-by: Yafang Shao <laoar.shao-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 kernel/cgroup/cgroup.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 5f4502a..b7d2e55 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -6625,7 +6625,7 @@ struct cgroup *cgroup_get_from_path(const char *path)
 
 /**
  * cgroup_get_from_fd - get a cgroup pointer from a fd
- * @fd: fd obtained by open(cgroup2_dir)
+ * @fd: fd obtained by open(cgroup_dir)
  *
  * Find the cgroup from a fd which should be obtained
  * by opening a cgroup directory.  Returns a pointer to the
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH bpf-next v2 02/12] bpf: Introduce new helper bpf_map_put_memcg()
@ 2022-08-18 14:31   ` Yafang Shao
  0 siblings, 0 replies; 61+ messages in thread
From: Yafang Shao @ 2022-08-18 14:31 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, hannes, mhocko, roman.gushchin,
	shakeelb, songmuchun, akpm, tj, lizefan.x
  Cc: cgroups, netdev, bpf, linux-mm, Yafang Shao

Replace the open-coded mem_cgroup_put() with a new helper
bpf_map_put_memcg(). That could make it more clear.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 kernel/bpf/syscall.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 83c7136..2f18ae2 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -441,6 +441,11 @@ static struct mem_cgroup *bpf_map_get_memcg(const struct bpf_map *map)
 	return root_mem_cgroup;
 }
 
+static void bpf_map_put_memcg(struct mem_cgroup *memcg)
+{
+	mem_cgroup_put(memcg);
+}
+
 void *bpf_map_kmalloc_node(const struct bpf_map *map, size_t size, gfp_t flags,
 			   int node)
 {
@@ -451,7 +456,7 @@ void *bpf_map_kmalloc_node(const struct bpf_map *map, size_t size, gfp_t flags,
 	old_memcg = set_active_memcg(memcg);
 	ptr = kmalloc_node(size, flags | __GFP_ACCOUNT, node);
 	set_active_memcg(old_memcg);
-	mem_cgroup_put(memcg);
+	bpf_map_put_memcg(memcg);
 
 	return ptr;
 }
@@ -465,7 +470,7 @@ void *bpf_map_kzalloc(const struct bpf_map *map, size_t size, gfp_t flags)
 	old_memcg = set_active_memcg(memcg);
 	ptr = kzalloc(size, flags | __GFP_ACCOUNT);
 	set_active_memcg(old_memcg);
-	mem_cgroup_put(memcg);
+	bpf_map_put_memcg(memcg);
 
 	return ptr;
 }
@@ -480,7 +485,7 @@ void __percpu *bpf_map_alloc_percpu(const struct bpf_map *map, size_t size,
 	old_memcg = set_active_memcg(memcg);
 	ptr = __alloc_percpu_gfp(size, align, flags | __GFP_ACCOUNT);
 	set_active_memcg(old_memcg);
-	mem_cgroup_put(memcg);
+	bpf_map_put_memcg(memcg);
 
 	return ptr;
 }
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH bpf-next v2 02/12] bpf: Introduce new helper bpf_map_put_memcg()
@ 2022-08-18 14:31   ` Yafang Shao
  0 siblings, 0 replies; 61+ messages in thread
From: Yafang Shao @ 2022-08-18 14:31 UTC (permalink / raw)
  To: ast-DgEjT+Ai2ygdnm+yROfE0A, daniel-FeC+5ew28dpmcu3hnIyYJQ,
	andrii-DgEjT+Ai2ygdnm+yROfE0A, kafai-b10kYP2dOMg,
	songliubraving-b10kYP2dOMg, yhs-b10kYP2dOMg,
	john.fastabend-Re5JQEeQqe8AvxtiuMwx3w,
	kpsingh-DgEjT+Ai2ygdnm+yROfE0A, sdf-hpIqsD4AKlfQT0dZR+AlfA,
	haoluo-hpIqsD4AKlfQT0dZR+AlfA, jolsa-DgEjT+Ai2ygdnm+yROfE0A,
	hannes-druUgvl0LCNAfugRpC6u6w, mhocko-DgEjT+Ai2ygdnm+yROfE0A,
	roman.gushchin-fxUVXftIFDnyG1zEObXtfA,
	shakeelb-hpIqsD4AKlfQT0dZR+AlfA,
	songmuchun-EC8Uxl6Npydl57MIdRCFDg,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, tj-DgEjT+Ai2ygdnm+yROfE0A,
	lizefan.x-EC8Uxl6Npydl57MIdRCFDg
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	bpf-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Yafang Shao

Replace the open-coded mem_cgroup_put() with a new helper
bpf_map_put_memcg(). That could make it more clear.

Signed-off-by: Yafang Shao <laoar.shao-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 kernel/bpf/syscall.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 83c7136..2f18ae2 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -441,6 +441,11 @@ static struct mem_cgroup *bpf_map_get_memcg(const struct bpf_map *map)
 	return root_mem_cgroup;
 }
 
+static void bpf_map_put_memcg(struct mem_cgroup *memcg)
+{
+	mem_cgroup_put(memcg);
+}
+
 void *bpf_map_kmalloc_node(const struct bpf_map *map, size_t size, gfp_t flags,
 			   int node)
 {
@@ -451,7 +456,7 @@ void *bpf_map_kmalloc_node(const struct bpf_map *map, size_t size, gfp_t flags,
 	old_memcg = set_active_memcg(memcg);
 	ptr = kmalloc_node(size, flags | __GFP_ACCOUNT, node);
 	set_active_memcg(old_memcg);
-	mem_cgroup_put(memcg);
+	bpf_map_put_memcg(memcg);
 
 	return ptr;
 }
@@ -465,7 +470,7 @@ void *bpf_map_kzalloc(const struct bpf_map *map, size_t size, gfp_t flags)
 	old_memcg = set_active_memcg(memcg);
 	ptr = kzalloc(size, flags | __GFP_ACCOUNT);
 	set_active_memcg(old_memcg);
-	mem_cgroup_put(memcg);
+	bpf_map_put_memcg(memcg);
 
 	return ptr;
 }
@@ -480,7 +485,7 @@ void __percpu *bpf_map_alloc_percpu(const struct bpf_map *map, size_t size,
 	old_memcg = set_active_memcg(memcg);
 	ptr = __alloc_percpu_gfp(size, align, flags | __GFP_ACCOUNT);
 	set_active_memcg(old_memcg);
-	mem_cgroup_put(memcg);
+	bpf_map_put_memcg(memcg);
 
 	return ptr;
 }
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH bpf-next v2 03/12] bpf: Define bpf_map_{get,put}_memcg for !CONFIG_MEMCG_KMEM
  2022-08-18 14:31 [PATCH bpf-next v2 00/12] bpf: Introduce selectable memcg for bpf map Yafang Shao
  2022-08-18 14:31   ` Yafang Shao
  2022-08-18 14:31   ` Yafang Shao
@ 2022-08-18 14:31 ` Yafang Shao
  2022-08-18 14:31 ` [PATCH bpf-next v2 04/12] bpf: Call bpf_map_init_from_attr() immediately after map creation Yafang Shao
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 61+ messages in thread
From: Yafang Shao @ 2022-08-18 14:31 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, hannes, mhocko, roman.gushchin,
	shakeelb, songmuchun, akpm, tj, lizefan.x
  Cc: cgroups, netdev, bpf, linux-mm, Yafang Shao

We can use this helper when CONFIG_MEMCG_KMEM or CONFIG_MEMCG is not set.
It also moves bpf_map_{get,put}_memcg into include/linux/bpf.h, so
these two helpers can be used in other source files.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/bpf.h        | 26 ++++++++++++++++++++++++++
 include/linux/memcontrol.h | 10 ++++++++++
 kernel/bpf/syscall.c       | 13 -------------
 3 files changed, 36 insertions(+), 13 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index a627a02..ded7d23 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -27,6 +27,7 @@
 #include <linux/bpfptr.h>
 #include <linux/btf.h>
 #include <linux/rcupdate_trace.h>
+#include <linux/memcontrol.h>
 
 struct bpf_verifier_env;
 struct bpf_verifier_log;
@@ -2572,4 +2573,29 @@ static inline void bpf_cgroup_atype_get(u32 attach_btf_id, int cgroup_atype) {}
 static inline void bpf_cgroup_atype_put(int cgroup_atype) {}
 #endif /* CONFIG_BPF_LSM */
 
+#ifdef CONFIG_MEMCG_KMEM
+static inline struct mem_cgroup *bpf_map_get_memcg(const struct bpf_map *map)
+{
+	if (map->objcg)
+		return get_mem_cgroup_from_objcg(map->objcg);
+
+	return root_mem_cgroup;
+}
+
+static inline void bpf_map_put_memcg(struct mem_cgroup *memcg)
+{
+	mem_cgroup_put(memcg);
+}
+
+#else
+static inline struct mem_cgroup *bpf_map_get_memcg(const struct bpf_map *map)
+{
+	return root_memcg();
+}
+
+static inline void bpf_map_put_memcg(struct mem_cgroup *memcg)
+{
+}
+#endif
+
 #endif /* _LINUX_BPF_H */
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 9ecead1..2f0a611 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -361,6 +361,11 @@ struct mem_cgroup {
 
 extern struct mem_cgroup *root_mem_cgroup;
 
+static inline struct mem_cgroup *root_memcg(void)
+{
+	return root_mem_cgroup;
+}
+
 enum page_memcg_data_flags {
 	/* page->memcg_data is a pointer to an objcgs vector */
 	MEMCG_DATA_OBJCGS = (1UL << 0),
@@ -1138,6 +1143,11 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
 #define MEM_CGROUP_ID_SHIFT	0
 #define MEM_CGROUP_ID_MAX	0
 
+static inline struct mem_cgroup *root_memcg(void)
+{
+	return NULL;
+}
+
 static inline struct mem_cgroup *folio_memcg(struct folio *folio)
 {
 	return NULL;
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 2f18ae2..19c3a81 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -433,19 +433,6 @@ static void bpf_map_release_memcg(struct bpf_map *map)
 		obj_cgroup_put(map->objcg);
 }
 
-static struct mem_cgroup *bpf_map_get_memcg(const struct bpf_map *map)
-{
-	if (map->objcg)
-		return get_mem_cgroup_from_objcg(map->objcg);
-
-	return root_mem_cgroup;
-}
-
-static void bpf_map_put_memcg(struct mem_cgroup *memcg)
-{
-	mem_cgroup_put(memcg);
-}
-
 void *bpf_map_kmalloc_node(const struct bpf_map *map, size_t size, gfp_t flags,
 			   int node)
 {
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH bpf-next v2 04/12] bpf: Call bpf_map_init_from_attr() immediately after map creation
  2022-08-18 14:31 [PATCH bpf-next v2 00/12] bpf: Introduce selectable memcg for bpf map Yafang Shao
                   ` (2 preceding siblings ...)
  2022-08-18 14:31 ` [PATCH bpf-next v2 03/12] bpf: Define bpf_map_{get,put}_memcg for !CONFIG_MEMCG_KMEM Yafang Shao
@ 2022-08-18 14:31 ` Yafang Shao
  2022-08-18 14:31   ` Yafang Shao
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 61+ messages in thread
From: Yafang Shao @ 2022-08-18 14:31 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, hannes, mhocko, roman.gushchin,
	shakeelb, songmuchun, akpm, tj, lizefan.x
  Cc: cgroups, netdev, bpf, linux-mm, Yafang Shao

In order to make all other map related memory allocations been allocated
after memcg is saved in the map, we should save the memcg immediately
after map creation. But the map is created in bpf_map_area_alloc(),
within which we can't get the related bpf_map (except with a pointer
casting which may be error prone), so we can do it in
bpf_map_init_from_attr(), which is used by all bpf maps.

bpf_map_init_from_attr() is executed immediately after
bpf_map_area_alloc() for almost all bpf maps except bpf_struct_ops,
devmap and hashmap, so this patch changes these three maps.

In the future we will change the return type of bpf_map_init_from_attr()
from void to int for error cases, so put it immediately after
bpf_map_area_alloc() will make it eary to handle the error case.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 kernel/bpf/bpf_struct_ops.c | 2 +-
 kernel/bpf/devmap.c         | 5 ++---
 kernel/bpf/hashtab.c        | 4 ++--
 3 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index 84b2d9d..36f24f8 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -624,6 +624,7 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr)
 
 	st_map->st_ops = st_ops;
 	map = &st_map->map;
+	bpf_map_init_from_attr(map, attr);
 
 	st_map->uvalue = bpf_map_area_alloc(vt->size, NUMA_NO_NODE);
 	st_map->links =
@@ -637,7 +638,6 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr)
 
 	mutex_init(&st_map->lock);
 	set_vm_flush_reset_perms(st_map->image);
-	bpf_map_init_from_attr(map, attr);
 
 	return map;
 }
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index f9a87dc..20decc7 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -127,9 +127,6 @@ static int dev_map_init_map(struct bpf_dtab *dtab, union bpf_attr *attr)
 	 */
 	attr->map_flags |= BPF_F_RDONLY_PROG;
 
-
-	bpf_map_init_from_attr(&dtab->map, attr);
-
 	if (attr->map_type == BPF_MAP_TYPE_DEVMAP_HASH) {
 		dtab->n_buckets = roundup_pow_of_two(dtab->map.max_entries);
 
@@ -167,6 +164,8 @@ static struct bpf_map *dev_map_alloc(union bpf_attr *attr)
 	if (!dtab)
 		return ERR_PTR(-ENOMEM);
 
+	bpf_map_init_from_attr(&dtab->map, attr);
+
 	err = dev_map_init_map(dtab, attr);
 	if (err) {
 		bpf_map_area_free(dtab);
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 8392f7f..48dc04c 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -499,10 +499,10 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 	if (!htab)
 		return ERR_PTR(-ENOMEM);
 
-	lockdep_register_key(&htab->lockdep_key);
-
 	bpf_map_init_from_attr(&htab->map, attr);
 
+	lockdep_register_key(&htab->lockdep_key);
+
 	if (percpu_lru) {
 		/* ensure each CPU's lru list has >=1 elements.
 		 * since we are at it, make each lru list has the same
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH bpf-next v2 05/12] bpf: Save memcg in bpf_map_init_from_attr()
@ 2022-08-18 14:31   ` Yafang Shao
  0 siblings, 0 replies; 61+ messages in thread
From: Yafang Shao @ 2022-08-18 14:31 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, hannes, mhocko, roman.gushchin,
	shakeelb, songmuchun, akpm, tj, lizefan.x
  Cc: cgroups, netdev, bpf, linux-mm, Yafang Shao

Move bpf_map_save_memcg() into bpf_map_init_from_attr(), then all other
map related memory allocation will be allocated after saving the memcg.
And then we can get memcg from the map in the followup memory allocation.

To pair with this change, bpf_map_release_memcg() is moved into
bpf_map_area_free(). A new parameter struct bpf_map is introduced into
bpf_map_area_free() for this purpose.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/bpf.h            |  2 +-
 kernel/bpf/arraymap.c          |  8 +++---
 kernel/bpf/bloom_filter.c      |  2 +-
 kernel/bpf/bpf_local_storage.c |  4 +--
 kernel/bpf/bpf_struct_ops.c    |  6 ++---
 kernel/bpf/cpumap.c            |  6 ++---
 kernel/bpf/devmap.c            |  8 +++---
 kernel/bpf/hashtab.c           | 10 +++----
 kernel/bpf/local_storage.c     |  2 +-
 kernel/bpf/lpm_trie.c          |  2 +-
 kernel/bpf/offload.c           |  4 +--
 kernel/bpf/queue_stack_maps.c  |  2 +-
 kernel/bpf/reuseport_array.c   |  2 +-
 kernel/bpf/ringbuf.c           |  8 +++---
 kernel/bpf/stackmap.c          |  8 +++---
 kernel/bpf/syscall.c           | 60 ++++++++++++++++++++++--------------------
 net/core/sock_map.c            | 12 ++++-----
 net/xdp/xskmap.c               |  2 +-
 18 files changed, 76 insertions(+), 72 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index ded7d23..6f85137 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1637,7 +1637,7 @@ struct bpf_prog *bpf_prog_get_type_dev(u32 ufd, enum bpf_prog_type type,
 void bpf_map_put(struct bpf_map *map);
 void *bpf_map_area_alloc(u64 size, int numa_node);
 void *bpf_map_area_mmapable_alloc(u64 size, int numa_node);
-void bpf_map_area_free(void *base);
+void bpf_map_area_free(void *base, struct bpf_map *map);
 bool bpf_map_write_active(const struct bpf_map *map);
 void bpf_map_init_from_attr(struct bpf_map *map, union bpf_attr *attr);
 int  generic_map_lookup_batch(struct bpf_map *map,
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index d3e734b..9ce4d1b 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -147,7 +147,7 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr)
 	array->elem_size = elem_size;
 
 	if (percpu && bpf_array_alloc_percpu(array)) {
-		bpf_map_area_free(array);
+		bpf_map_area_free(array, &array->map);
 		return ERR_PTR(-ENOMEM);
 	}
 
@@ -430,9 +430,9 @@ static void array_map_free(struct bpf_map *map)
 		bpf_array_free_percpu(array);
 
 	if (array->map.map_flags & BPF_F_MMAPABLE)
-		bpf_map_area_free(array_map_vmalloc_addr(array));
+		bpf_map_area_free(array_map_vmalloc_addr(array), map);
 	else
-		bpf_map_area_free(array);
+		bpf_map_area_free(array, map);
 }
 
 static void array_map_seq_show_elem(struct bpf_map *map, void *key,
@@ -774,7 +774,7 @@ static void fd_array_map_free(struct bpf_map *map)
 	for (i = 0; i < array->map.max_entries; i++)
 		BUG_ON(array->ptrs[i] != NULL);
 
-	bpf_map_area_free(array);
+	bpf_map_area_free(array, map);
 }
 
 static void *fd_array_map_lookup_elem(struct bpf_map *map, void *key)
diff --git a/kernel/bpf/bloom_filter.c b/kernel/bpf/bloom_filter.c
index b9ea539..e59064d 100644
--- a/kernel/bpf/bloom_filter.c
+++ b/kernel/bpf/bloom_filter.c
@@ -168,7 +168,7 @@ static void bloom_map_free(struct bpf_map *map)
 	struct bpf_bloom_filter *bloom =
 		container_of(map, struct bpf_bloom_filter, map);
 
-	bpf_map_area_free(bloom);
+	bpf_map_area_free(bloom, map);
 }
 
 static void *bloom_map_lookup_elem(struct bpf_map *map, void *key)
diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
index 4ee2e72..77e075b 100644
--- a/kernel/bpf/bpf_local_storage.c
+++ b/kernel/bpf/bpf_local_storage.c
@@ -582,7 +582,7 @@ void bpf_local_storage_map_free(struct bpf_local_storage_map *smap,
 	synchronize_rcu();
 
 	kvfree(smap->buckets);
-	bpf_map_area_free(smap);
+	bpf_map_area_free(smap, &smap->map);
 }
 
 int bpf_local_storage_map_alloc_check(union bpf_attr *attr)
@@ -623,7 +623,7 @@ struct bpf_local_storage_map *bpf_local_storage_map_alloc(union bpf_attr *attr)
 	smap->buckets = kvcalloc(sizeof(*smap->buckets), nbuckets,
 				 GFP_USER | __GFP_NOWARN | __GFP_ACCOUNT);
 	if (!smap->buckets) {
-		bpf_map_area_free(smap);
+		bpf_map_area_free(smap, &smap->map);
 		return ERR_PTR(-ENOMEM);
 	}
 
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index 36f24f8..9fb8ad1 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -577,10 +577,10 @@ static void bpf_struct_ops_map_free(struct bpf_map *map)
 
 	if (st_map->links)
 		bpf_struct_ops_map_put_progs(st_map);
-	bpf_map_area_free(st_map->links);
+	bpf_map_area_free(st_map->links, NULL);
 	bpf_jit_free_exec(st_map->image);
-	bpf_map_area_free(st_map->uvalue);
-	bpf_map_area_free(st_map);
+	bpf_map_area_free(st_map->uvalue, NULL);
+	bpf_map_area_free(st_map, map);
 }
 
 static int bpf_struct_ops_map_alloc_check(union bpf_attr *attr)
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index b5ba34d..7de2ae6 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -118,7 +118,7 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
 
 	return &cmap->map;
 free_cmap:
-	bpf_map_area_free(cmap);
+	bpf_map_area_free(cmap, &cmap->map);
 	return ERR_PTR(err);
 }
 
@@ -622,8 +622,8 @@ static void cpu_map_free(struct bpf_map *map)
 		/* bq flush and cleanup happens after RCU grace-period */
 		__cpu_map_entry_replace(cmap, i, NULL); /* call_rcu */
 	}
-	bpf_map_area_free(cmap->cpu_map);
-	bpf_map_area_free(cmap);
+	bpf_map_area_free(cmap->cpu_map, NULL);
+	bpf_map_area_free(cmap, map);
 }
 
 /* Elements are kept alive by RCU; either by rcu_read_lock() (from syscall) or
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index 20decc7..3268ce7 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -168,7 +168,7 @@ static struct bpf_map *dev_map_alloc(union bpf_attr *attr)
 
 	err = dev_map_init_map(dtab, attr);
 	if (err) {
-		bpf_map_area_free(dtab);
+		bpf_map_area_free(dtab, &dtab->map);
 		return ERR_PTR(err);
 	}
 
@@ -221,7 +221,7 @@ static void dev_map_free(struct bpf_map *map)
 			}
 		}
 
-		bpf_map_area_free(dtab->dev_index_head);
+		bpf_map_area_free(dtab->dev_index_head, NULL);
 	} else {
 		for (i = 0; i < dtab->map.max_entries; i++) {
 			struct bpf_dtab_netdev *dev;
@@ -236,10 +236,10 @@ static void dev_map_free(struct bpf_map *map)
 			kfree(dev);
 		}
 
-		bpf_map_area_free(dtab->netdev_map);
+		bpf_map_area_free(dtab->netdev_map, NULL);
 	}
 
-	bpf_map_area_free(dtab);
+	bpf_map_area_free(dtab, &dtab->map);
 }
 
 static int dev_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 48dc04c..1b3653d 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -290,7 +290,7 @@ static void htab_free_elems(struct bpf_htab *htab)
 		cond_resched();
 	}
 free_elems:
-	bpf_map_area_free(htab->elems);
+	bpf_map_area_free(htab->elems, NULL);
 }
 
 /* The LRU list has a lock (lru_lock). Each htab bucket has a lock
@@ -576,10 +576,10 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 free_map_locked:
 	for (i = 0; i < HASHTAB_MAP_LOCK_COUNT; i++)
 		free_percpu(htab->map_locked[i]);
-	bpf_map_area_free(htab->buckets);
+	bpf_map_area_free(htab->buckets, NULL);
 free_htab:
 	lockdep_unregister_key(&htab->lockdep_key);
-	bpf_map_area_free(htab);
+	bpf_map_area_free(htab, &htab->map);
 	return ERR_PTR(err);
 }
 
@@ -1492,11 +1492,11 @@ static void htab_map_free(struct bpf_map *map)
 
 	bpf_map_free_kptr_off_tab(map);
 	free_percpu(htab->extra_elems);
-	bpf_map_area_free(htab->buckets);
+	bpf_map_area_free(htab->buckets, NULL);
 	for (i = 0; i < HASHTAB_MAP_LOCK_COUNT; i++)
 		free_percpu(htab->map_locked[i]);
 	lockdep_unregister_key(&htab->lockdep_key);
-	bpf_map_area_free(htab);
+	bpf_map_area_free(htab, map);
 }
 
 static void htab_map_seq_show_elem(struct bpf_map *map, void *key,
diff --git a/kernel/bpf/local_storage.c b/kernel/bpf/local_storage.c
index 098cf33..c705d66 100644
--- a/kernel/bpf/local_storage.c
+++ b/kernel/bpf/local_storage.c
@@ -345,7 +345,7 @@ static void cgroup_storage_map_free(struct bpf_map *_map)
 	WARN_ON(!RB_EMPTY_ROOT(&map->root));
 	WARN_ON(!list_empty(&map->list));
 
-	bpf_map_area_free(map);
+	bpf_map_area_free(map, _map);
 }
 
 static int cgroup_storage_delete_elem(struct bpf_map *map, void *key)
diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
index d833496..fd99360 100644
--- a/kernel/bpf/lpm_trie.c
+++ b/kernel/bpf/lpm_trie.c
@@ -609,7 +609,7 @@ static void trie_free(struct bpf_map *map)
 	}
 
 out:
-	bpf_map_area_free(trie);
+	bpf_map_area_free(trie, map);
 }
 
 static int trie_get_next_key(struct bpf_map *map, void *_key, void *_next_key)
diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
index 13e4efc..c9941a9 100644
--- a/kernel/bpf/offload.c
+++ b/kernel/bpf/offload.c
@@ -404,7 +404,7 @@ struct bpf_map *bpf_map_offload_map_alloc(union bpf_attr *attr)
 err_unlock:
 	up_write(&bpf_devs_lock);
 	rtnl_unlock();
-	bpf_map_area_free(offmap);
+	bpf_map_area_free(offmap, &offmap->map);
 	return ERR_PTR(err);
 }
 
@@ -428,7 +428,7 @@ void bpf_map_offload_map_free(struct bpf_map *map)
 	up_write(&bpf_devs_lock);
 	rtnl_unlock();
 
-	bpf_map_area_free(offmap);
+	bpf_map_area_free(offmap, map);
 }
 
 int bpf_map_offload_lookup_elem(struct bpf_map *map, void *key, void *value)
diff --git a/kernel/bpf/queue_stack_maps.c b/kernel/bpf/queue_stack_maps.c
index 8a5e060..f2ec0c4 100644
--- a/kernel/bpf/queue_stack_maps.c
+++ b/kernel/bpf/queue_stack_maps.c
@@ -92,7 +92,7 @@ static void queue_stack_map_free(struct bpf_map *map)
 {
 	struct bpf_queue_stack *qs = bpf_queue_stack(map);
 
-	bpf_map_area_free(qs);
+	bpf_map_area_free(qs, map);
 }
 
 static int __queue_map_get(struct bpf_map *map, void *value, bool delete)
diff --git a/kernel/bpf/reuseport_array.c b/kernel/bpf/reuseport_array.c
index e2618fb..594cdb0 100644
--- a/kernel/bpf/reuseport_array.c
+++ b/kernel/bpf/reuseport_array.c
@@ -146,7 +146,7 @@ static void reuseport_array_free(struct bpf_map *map)
 	 * Once reaching here, all sk->sk_user_data is not
 	 * referencing this "array". "array" can be freed now.
 	 */
-	bpf_map_area_free(array);
+	bpf_map_area_free(array, map);
 }
 
 static struct bpf_map *reuseport_array_alloc(union bpf_attr *attr)
diff --git a/kernel/bpf/ringbuf.c b/kernel/bpf/ringbuf.c
index b483aea..74dd8dc 100644
--- a/kernel/bpf/ringbuf.c
+++ b/kernel/bpf/ringbuf.c
@@ -116,7 +116,7 @@ static struct bpf_ringbuf *bpf_ringbuf_area_alloc(size_t data_sz, int numa_node)
 err_free_pages:
 	for (i = 0; i < nr_pages; i++)
 		__free_page(pages[i]);
-	bpf_map_area_free(pages);
+	bpf_map_area_free(pages, NULL);
 	return NULL;
 }
 
@@ -172,7 +172,7 @@ static struct bpf_map *ringbuf_map_alloc(union bpf_attr *attr)
 
 	rb_map->rb = bpf_ringbuf_alloc(attr->max_entries, rb_map->map.numa_node);
 	if (!rb_map->rb) {
-		bpf_map_area_free(rb_map);
+		bpf_map_area_free(rb_map, &rb_map->map);
 		return ERR_PTR(-ENOMEM);
 	}
 
@@ -190,7 +190,7 @@ static void bpf_ringbuf_free(struct bpf_ringbuf *rb)
 	vunmap(rb);
 	for (i = 0; i < nr_pages; i++)
 		__free_page(pages[i]);
-	bpf_map_area_free(pages);
+	bpf_map_area_free(pages, NULL);
 }
 
 static void ringbuf_map_free(struct bpf_map *map)
@@ -199,7 +199,7 @@ static void ringbuf_map_free(struct bpf_map *map)
 
 	rb_map = container_of(map, struct bpf_ringbuf_map, map);
 	bpf_ringbuf_free(rb_map->rb);
-	bpf_map_area_free(rb_map);
+	bpf_map_area_free(rb_map, map);
 }
 
 static void *ringbuf_map_lookup_elem(struct bpf_map *map, void *key)
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 1adbe67..042b7d2 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -62,7 +62,7 @@ static int prealloc_elems_and_freelist(struct bpf_stack_map *smap)
 	return 0;
 
 free_elems:
-	bpf_map_area_free(smap->elems);
+	bpf_map_area_free(smap->elems, NULL);
 	return err;
 }
 
@@ -120,7 +120,7 @@ static struct bpf_map *stack_map_alloc(union bpf_attr *attr)
 put_buffers:
 	put_callchain_buffers();
 free_smap:
-	bpf_map_area_free(smap);
+	bpf_map_area_free(smap, &smap->map);
 	return ERR_PTR(err);
 }
 
@@ -648,9 +648,9 @@ static void stack_map_free(struct bpf_map *map)
 {
 	struct bpf_stack_map *smap = container_of(map, struct bpf_stack_map, map);
 
-	bpf_map_area_free(smap->elems);
+	bpf_map_area_free(smap->elems, NULL);
 	pcpu_freelist_destroy(&smap->freelist);
-	bpf_map_area_free(smap);
+	bpf_map_area_free(smap, map);
 	put_callchain_buffers();
 }
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 19c3a81..01d8d4a 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -293,6 +293,34 @@ static int bpf_map_copy_value(struct bpf_map *map, void *key, void *value,
 	return err;
 }
 
+#ifdef CONFIG_MEMCG_KMEM
+static void bpf_map_save_memcg(struct bpf_map *map)
+{
+	/* Currently if a map is created by a process belonging to the root
+	 * memory cgroup, get_obj_cgroup_from_current() will return NULL.
+	 * So we have to check map->objcg for being NULL each time it's
+	 * being used.
+	 */
+	map->objcg = get_obj_cgroup_from_current();
+}
+
+static void bpf_map_release_memcg(struct bpf_map *map)
+{
+	if (map->objcg)
+		obj_cgroup_put(map->objcg);
+}
+
+#else
+static void bpf_map_save_memcg(struct bpf_map *map)
+{
+}
+
+static void bpf_map_release_memcg(struct bpf_map *map)
+{
+}
+
+#endif
+
 /* Please, do not use this function outside from the map creation path
  * (e.g. in map update path) without taking care of setting the active
  * memory cgroup (see at bpf_map_kmalloc_node() for example).
@@ -344,8 +372,10 @@ void *bpf_map_area_mmapable_alloc(u64 size, int numa_node)
 	return __bpf_map_area_alloc(size, numa_node, true);
 }
 
-void bpf_map_area_free(void *area)
+void bpf_map_area_free(void *area, struct bpf_map *map)
 {
+	if (map)
+		bpf_map_release_memcg(map);
 	kvfree(area);
 }
 
@@ -363,6 +393,7 @@ static u32 bpf_map_flags_retain_permanent(u32 flags)
 
 void bpf_map_init_from_attr(struct bpf_map *map, union bpf_attr *attr)
 {
+	bpf_map_save_memcg(map);
 	map->map_type = attr->map_type;
 	map->key_size = attr->key_size;
 	map->value_size = attr->value_size;
@@ -417,22 +448,6 @@ void bpf_map_free_id(struct bpf_map *map, bool do_idr_lock)
 }
 
 #ifdef CONFIG_MEMCG_KMEM
-static void bpf_map_save_memcg(struct bpf_map *map)
-{
-	/* Currently if a map is created by a process belonging to the root
-	 * memory cgroup, get_obj_cgroup_from_current() will return NULL.
-	 * So we have to check map->objcg for being NULL each time it's
-	 * being used.
-	 */
-	map->objcg = get_obj_cgroup_from_current();
-}
-
-static void bpf_map_release_memcg(struct bpf_map *map)
-{
-	if (map->objcg)
-		obj_cgroup_put(map->objcg);
-}
-
 void *bpf_map_kmalloc_node(const struct bpf_map *map, size_t size, gfp_t flags,
 			   int node)
 {
@@ -477,14 +492,6 @@ void __percpu *bpf_map_alloc_percpu(const struct bpf_map *map, size_t size,
 	return ptr;
 }
 
-#else
-static void bpf_map_save_memcg(struct bpf_map *map)
-{
-}
-
-static void bpf_map_release_memcg(struct bpf_map *map)
-{
-}
 #endif
 
 static int bpf_map_kptr_off_cmp(const void *a, const void *b)
@@ -605,7 +612,6 @@ static void bpf_map_free_deferred(struct work_struct *work)
 
 	security_bpf_map_free(map);
 	kfree(map->off_arr);
-	bpf_map_release_memcg(map);
 	/* implementation dependent freeing, map_free callback also does
 	 * bpf_map_free_kptr_off_tab, if needed.
 	 */
@@ -1154,8 +1160,6 @@ static int map_create(union bpf_attr *attr)
 	if (err)
 		goto free_map_sec;
 
-	bpf_map_save_memcg(map);
-
 	err = bpf_map_new_fd(map, f_flags);
 	if (err < 0) {
 		/* failed to allocate fd.
diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index d0c4338..e8f414a 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -52,7 +52,7 @@ static struct bpf_map *sock_map_alloc(union bpf_attr *attr)
 				       sizeof(struct sock *),
 				       stab->map.numa_node);
 	if (!stab->sks) {
-		bpf_map_area_free(stab);
+		bpf_map_area_free(stab, &stab->map);
 		return ERR_PTR(-ENOMEM);
 	}
 
@@ -360,8 +360,8 @@ static void sock_map_free(struct bpf_map *map)
 	/* wait for psock readers accessing its map link */
 	synchronize_rcu();
 
-	bpf_map_area_free(stab->sks);
-	bpf_map_area_free(stab);
+	bpf_map_area_free(stab->sks, NULL);
+	bpf_map_area_free(stab, map);
 }
 
 static void sock_map_release_progs(struct bpf_map *map)
@@ -1106,7 +1106,7 @@ static struct bpf_map *sock_hash_alloc(union bpf_attr *attr)
 
 	return &htab->map;
 free_htab:
-	bpf_map_area_free(htab);
+	bpf_map_area_free(htab, &htab->map);
 	return ERR_PTR(err);
 }
 
@@ -1158,8 +1158,8 @@ static void sock_hash_free(struct bpf_map *map)
 	/* wait for psock readers accessing its map link */
 	synchronize_rcu();
 
-	bpf_map_area_free(htab->buckets);
-	bpf_map_area_free(htab);
+	bpf_map_area_free(htab->buckets, NULL);
+	bpf_map_area_free(htab, map);
 }
 
 static void *sock_hash_lookup_sys(struct bpf_map *map, void *key)
diff --git a/net/xdp/xskmap.c b/net/xdp/xskmap.c
index acc8e52..5abb87e 100644
--- a/net/xdp/xskmap.c
+++ b/net/xdp/xskmap.c
@@ -90,7 +90,7 @@ static void xsk_map_free(struct bpf_map *map)
 	struct xsk_map *m = container_of(map, struct xsk_map, map);
 
 	synchronize_net();
-	bpf_map_area_free(m);
+	bpf_map_area_free(m, map);
 }
 
 static int xsk_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH bpf-next v2 05/12] bpf: Save memcg in bpf_map_init_from_attr()
@ 2022-08-18 14:31   ` Yafang Shao
  0 siblings, 0 replies; 61+ messages in thread
From: Yafang Shao @ 2022-08-18 14:31 UTC (permalink / raw)
  To: ast-DgEjT+Ai2ygdnm+yROfE0A, daniel-FeC+5ew28dpmcu3hnIyYJQ,
	andrii-DgEjT+Ai2ygdnm+yROfE0A, kafai-b10kYP2dOMg,
	songliubraving-b10kYP2dOMg, yhs-b10kYP2dOMg,
	john.fastabend-Re5JQEeQqe8AvxtiuMwx3w,
	kpsingh-DgEjT+Ai2ygdnm+yROfE0A, sdf-hpIqsD4AKlfQT0dZR+AlfA,
	haoluo-hpIqsD4AKlfQT0dZR+AlfA, jolsa-DgEjT+Ai2ygdnm+yROfE0A,
	hannes-druUgvl0LCNAfugRpC6u6w, mhocko-DgEjT+Ai2ygdnm+yROfE0A,
	roman.gushchin-fxUVXftIFDnyG1zEObXtfA,
	shakeelb-hpIqsD4AKlfQT0dZR+AlfA,
	songmuchun-EC8Uxl6Npydl57MIdRCFDg,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, tj-DgEjT+Ai2ygdnm+yROfE0A,
	lizefan.x-EC8Uxl6Npydl57MIdRCFDg
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	bpf-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Yafang Shao

Move bpf_map_save_memcg() into bpf_map_init_from_attr(), then all other
map related memory allocation will be allocated after saving the memcg.
And then we can get memcg from the map in the followup memory allocation.

To pair with this change, bpf_map_release_memcg() is moved into
bpf_map_area_free(). A new parameter struct bpf_map is introduced into
bpf_map_area_free() for this purpose.

Signed-off-by: Yafang Shao <laoar.shao-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 include/linux/bpf.h            |  2 +-
 kernel/bpf/arraymap.c          |  8 +++---
 kernel/bpf/bloom_filter.c      |  2 +-
 kernel/bpf/bpf_local_storage.c |  4 +--
 kernel/bpf/bpf_struct_ops.c    |  6 ++---
 kernel/bpf/cpumap.c            |  6 ++---
 kernel/bpf/devmap.c            |  8 +++---
 kernel/bpf/hashtab.c           | 10 +++----
 kernel/bpf/local_storage.c     |  2 +-
 kernel/bpf/lpm_trie.c          |  2 +-
 kernel/bpf/offload.c           |  4 +--
 kernel/bpf/queue_stack_maps.c  |  2 +-
 kernel/bpf/reuseport_array.c   |  2 +-
 kernel/bpf/ringbuf.c           |  8 +++---
 kernel/bpf/stackmap.c          |  8 +++---
 kernel/bpf/syscall.c           | 60 ++++++++++++++++++++++--------------------
 net/core/sock_map.c            | 12 ++++-----
 net/xdp/xskmap.c               |  2 +-
 18 files changed, 76 insertions(+), 72 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index ded7d23..6f85137 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1637,7 +1637,7 @@ struct bpf_prog *bpf_prog_get_type_dev(u32 ufd, enum bpf_prog_type type,
 void bpf_map_put(struct bpf_map *map);
 void *bpf_map_area_alloc(u64 size, int numa_node);
 void *bpf_map_area_mmapable_alloc(u64 size, int numa_node);
-void bpf_map_area_free(void *base);
+void bpf_map_area_free(void *base, struct bpf_map *map);
 bool bpf_map_write_active(const struct bpf_map *map);
 void bpf_map_init_from_attr(struct bpf_map *map, union bpf_attr *attr);
 int  generic_map_lookup_batch(struct bpf_map *map,
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index d3e734b..9ce4d1b 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -147,7 +147,7 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr)
 	array->elem_size = elem_size;
 
 	if (percpu && bpf_array_alloc_percpu(array)) {
-		bpf_map_area_free(array);
+		bpf_map_area_free(array, &array->map);
 		return ERR_PTR(-ENOMEM);
 	}
 
@@ -430,9 +430,9 @@ static void array_map_free(struct bpf_map *map)
 		bpf_array_free_percpu(array);
 
 	if (array->map.map_flags & BPF_F_MMAPABLE)
-		bpf_map_area_free(array_map_vmalloc_addr(array));
+		bpf_map_area_free(array_map_vmalloc_addr(array), map);
 	else
-		bpf_map_area_free(array);
+		bpf_map_area_free(array, map);
 }
 
 static void array_map_seq_show_elem(struct bpf_map *map, void *key,
@@ -774,7 +774,7 @@ static void fd_array_map_free(struct bpf_map *map)
 	for (i = 0; i < array->map.max_entries; i++)
 		BUG_ON(array->ptrs[i] != NULL);
 
-	bpf_map_area_free(array);
+	bpf_map_area_free(array, map);
 }
 
 static void *fd_array_map_lookup_elem(struct bpf_map *map, void *key)
diff --git a/kernel/bpf/bloom_filter.c b/kernel/bpf/bloom_filter.c
index b9ea539..e59064d 100644
--- a/kernel/bpf/bloom_filter.c
+++ b/kernel/bpf/bloom_filter.c
@@ -168,7 +168,7 @@ static void bloom_map_free(struct bpf_map *map)
 	struct bpf_bloom_filter *bloom =
 		container_of(map, struct bpf_bloom_filter, map);
 
-	bpf_map_area_free(bloom);
+	bpf_map_area_free(bloom, map);
 }
 
 static void *bloom_map_lookup_elem(struct bpf_map *map, void *key)
diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
index 4ee2e72..77e075b 100644
--- a/kernel/bpf/bpf_local_storage.c
+++ b/kernel/bpf/bpf_local_storage.c
@@ -582,7 +582,7 @@ void bpf_local_storage_map_free(struct bpf_local_storage_map *smap,
 	synchronize_rcu();
 
 	kvfree(smap->buckets);
-	bpf_map_area_free(smap);
+	bpf_map_area_free(smap, &smap->map);
 }
 
 int bpf_local_storage_map_alloc_check(union bpf_attr *attr)
@@ -623,7 +623,7 @@ struct bpf_local_storage_map *bpf_local_storage_map_alloc(union bpf_attr *attr)
 	smap->buckets = kvcalloc(sizeof(*smap->buckets), nbuckets,
 				 GFP_USER | __GFP_NOWARN | __GFP_ACCOUNT);
 	if (!smap->buckets) {
-		bpf_map_area_free(smap);
+		bpf_map_area_free(smap, &smap->map);
 		return ERR_PTR(-ENOMEM);
 	}
 
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index 36f24f8..9fb8ad1 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -577,10 +577,10 @@ static void bpf_struct_ops_map_free(struct bpf_map *map)
 
 	if (st_map->links)
 		bpf_struct_ops_map_put_progs(st_map);
-	bpf_map_area_free(st_map->links);
+	bpf_map_area_free(st_map->links, NULL);
 	bpf_jit_free_exec(st_map->image);
-	bpf_map_area_free(st_map->uvalue);
-	bpf_map_area_free(st_map);
+	bpf_map_area_free(st_map->uvalue, NULL);
+	bpf_map_area_free(st_map, map);
 }
 
 static int bpf_struct_ops_map_alloc_check(union bpf_attr *attr)
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index b5ba34d..7de2ae6 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -118,7 +118,7 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
 
 	return &cmap->map;
 free_cmap:
-	bpf_map_area_free(cmap);
+	bpf_map_area_free(cmap, &cmap->map);
 	return ERR_PTR(err);
 }
 
@@ -622,8 +622,8 @@ static void cpu_map_free(struct bpf_map *map)
 		/* bq flush and cleanup happens after RCU grace-period */
 		__cpu_map_entry_replace(cmap, i, NULL); /* call_rcu */
 	}
-	bpf_map_area_free(cmap->cpu_map);
-	bpf_map_area_free(cmap);
+	bpf_map_area_free(cmap->cpu_map, NULL);
+	bpf_map_area_free(cmap, map);
 }
 
 /* Elements are kept alive by RCU; either by rcu_read_lock() (from syscall) or
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index 20decc7..3268ce7 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -168,7 +168,7 @@ static struct bpf_map *dev_map_alloc(union bpf_attr *attr)
 
 	err = dev_map_init_map(dtab, attr);
 	if (err) {
-		bpf_map_area_free(dtab);
+		bpf_map_area_free(dtab, &dtab->map);
 		return ERR_PTR(err);
 	}
 
@@ -221,7 +221,7 @@ static void dev_map_free(struct bpf_map *map)
 			}
 		}
 
-		bpf_map_area_free(dtab->dev_index_head);
+		bpf_map_area_free(dtab->dev_index_head, NULL);
 	} else {
 		for (i = 0; i < dtab->map.max_entries; i++) {
 			struct bpf_dtab_netdev *dev;
@@ -236,10 +236,10 @@ static void dev_map_free(struct bpf_map *map)
 			kfree(dev);
 		}
 
-		bpf_map_area_free(dtab->netdev_map);
+		bpf_map_area_free(dtab->netdev_map, NULL);
 	}
 
-	bpf_map_area_free(dtab);
+	bpf_map_area_free(dtab, &dtab->map);
 }
 
 static int dev_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 48dc04c..1b3653d 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -290,7 +290,7 @@ static void htab_free_elems(struct bpf_htab *htab)
 		cond_resched();
 	}
 free_elems:
-	bpf_map_area_free(htab->elems);
+	bpf_map_area_free(htab->elems, NULL);
 }
 
 /* The LRU list has a lock (lru_lock). Each htab bucket has a lock
@@ -576,10 +576,10 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 free_map_locked:
 	for (i = 0; i < HASHTAB_MAP_LOCK_COUNT; i++)
 		free_percpu(htab->map_locked[i]);
-	bpf_map_area_free(htab->buckets);
+	bpf_map_area_free(htab->buckets, NULL);
 free_htab:
 	lockdep_unregister_key(&htab->lockdep_key);
-	bpf_map_area_free(htab);
+	bpf_map_area_free(htab, &htab->map);
 	return ERR_PTR(err);
 }
 
@@ -1492,11 +1492,11 @@ static void htab_map_free(struct bpf_map *map)
 
 	bpf_map_free_kptr_off_tab(map);
 	free_percpu(htab->extra_elems);
-	bpf_map_area_free(htab->buckets);
+	bpf_map_area_free(htab->buckets, NULL);
 	for (i = 0; i < HASHTAB_MAP_LOCK_COUNT; i++)
 		free_percpu(htab->map_locked[i]);
 	lockdep_unregister_key(&htab->lockdep_key);
-	bpf_map_area_free(htab);
+	bpf_map_area_free(htab, map);
 }
 
 static void htab_map_seq_show_elem(struct bpf_map *map, void *key,
diff --git a/kernel/bpf/local_storage.c b/kernel/bpf/local_storage.c
index 098cf33..c705d66 100644
--- a/kernel/bpf/local_storage.c
+++ b/kernel/bpf/local_storage.c
@@ -345,7 +345,7 @@ static void cgroup_storage_map_free(struct bpf_map *_map)
 	WARN_ON(!RB_EMPTY_ROOT(&map->root));
 	WARN_ON(!list_empty(&map->list));
 
-	bpf_map_area_free(map);
+	bpf_map_area_free(map, _map);
 }
 
 static int cgroup_storage_delete_elem(struct bpf_map *map, void *key)
diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
index d833496..fd99360 100644
--- a/kernel/bpf/lpm_trie.c
+++ b/kernel/bpf/lpm_trie.c
@@ -609,7 +609,7 @@ static void trie_free(struct bpf_map *map)
 	}
 
 out:
-	bpf_map_area_free(trie);
+	bpf_map_area_free(trie, map);
 }
 
 static int trie_get_next_key(struct bpf_map *map, void *_key, void *_next_key)
diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
index 13e4efc..c9941a9 100644
--- a/kernel/bpf/offload.c
+++ b/kernel/bpf/offload.c
@@ -404,7 +404,7 @@ struct bpf_map *bpf_map_offload_map_alloc(union bpf_attr *attr)
 err_unlock:
 	up_write(&bpf_devs_lock);
 	rtnl_unlock();
-	bpf_map_area_free(offmap);
+	bpf_map_area_free(offmap, &offmap->map);
 	return ERR_PTR(err);
 }
 
@@ -428,7 +428,7 @@ void bpf_map_offload_map_free(struct bpf_map *map)
 	up_write(&bpf_devs_lock);
 	rtnl_unlock();
 
-	bpf_map_area_free(offmap);
+	bpf_map_area_free(offmap, map);
 }
 
 int bpf_map_offload_lookup_elem(struct bpf_map *map, void *key, void *value)
diff --git a/kernel/bpf/queue_stack_maps.c b/kernel/bpf/queue_stack_maps.c
index 8a5e060..f2ec0c4 100644
--- a/kernel/bpf/queue_stack_maps.c
+++ b/kernel/bpf/queue_stack_maps.c
@@ -92,7 +92,7 @@ static void queue_stack_map_free(struct bpf_map *map)
 {
 	struct bpf_queue_stack *qs = bpf_queue_stack(map);
 
-	bpf_map_area_free(qs);
+	bpf_map_area_free(qs, map);
 }
 
 static int __queue_map_get(struct bpf_map *map, void *value, bool delete)
diff --git a/kernel/bpf/reuseport_array.c b/kernel/bpf/reuseport_array.c
index e2618fb..594cdb0 100644
--- a/kernel/bpf/reuseport_array.c
+++ b/kernel/bpf/reuseport_array.c
@@ -146,7 +146,7 @@ static void reuseport_array_free(struct bpf_map *map)
 	 * Once reaching here, all sk->sk_user_data is not
 	 * referencing this "array". "array" can be freed now.
 	 */
-	bpf_map_area_free(array);
+	bpf_map_area_free(array, map);
 }
 
 static struct bpf_map *reuseport_array_alloc(union bpf_attr *attr)
diff --git a/kernel/bpf/ringbuf.c b/kernel/bpf/ringbuf.c
index b483aea..74dd8dc 100644
--- a/kernel/bpf/ringbuf.c
+++ b/kernel/bpf/ringbuf.c
@@ -116,7 +116,7 @@ static struct bpf_ringbuf *bpf_ringbuf_area_alloc(size_t data_sz, int numa_node)
 err_free_pages:
 	for (i = 0; i < nr_pages; i++)
 		__free_page(pages[i]);
-	bpf_map_area_free(pages);
+	bpf_map_area_free(pages, NULL);
 	return NULL;
 }
 
@@ -172,7 +172,7 @@ static struct bpf_map *ringbuf_map_alloc(union bpf_attr *attr)
 
 	rb_map->rb = bpf_ringbuf_alloc(attr->max_entries, rb_map->map.numa_node);
 	if (!rb_map->rb) {
-		bpf_map_area_free(rb_map);
+		bpf_map_area_free(rb_map, &rb_map->map);
 		return ERR_PTR(-ENOMEM);
 	}
 
@@ -190,7 +190,7 @@ static void bpf_ringbuf_free(struct bpf_ringbuf *rb)
 	vunmap(rb);
 	for (i = 0; i < nr_pages; i++)
 		__free_page(pages[i]);
-	bpf_map_area_free(pages);
+	bpf_map_area_free(pages, NULL);
 }
 
 static void ringbuf_map_free(struct bpf_map *map)
@@ -199,7 +199,7 @@ static void ringbuf_map_free(struct bpf_map *map)
 
 	rb_map = container_of(map, struct bpf_ringbuf_map, map);
 	bpf_ringbuf_free(rb_map->rb);
-	bpf_map_area_free(rb_map);
+	bpf_map_area_free(rb_map, map);
 }
 
 static void *ringbuf_map_lookup_elem(struct bpf_map *map, void *key)
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 1adbe67..042b7d2 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -62,7 +62,7 @@ static int prealloc_elems_and_freelist(struct bpf_stack_map *smap)
 	return 0;
 
 free_elems:
-	bpf_map_area_free(smap->elems);
+	bpf_map_area_free(smap->elems, NULL);
 	return err;
 }
 
@@ -120,7 +120,7 @@ static struct bpf_map *stack_map_alloc(union bpf_attr *attr)
 put_buffers:
 	put_callchain_buffers();
 free_smap:
-	bpf_map_area_free(smap);
+	bpf_map_area_free(smap, &smap->map);
 	return ERR_PTR(err);
 }
 
@@ -648,9 +648,9 @@ static void stack_map_free(struct bpf_map *map)
 {
 	struct bpf_stack_map *smap = container_of(map, struct bpf_stack_map, map);
 
-	bpf_map_area_free(smap->elems);
+	bpf_map_area_free(smap->elems, NULL);
 	pcpu_freelist_destroy(&smap->freelist);
-	bpf_map_area_free(smap);
+	bpf_map_area_free(smap, map);
 	put_callchain_buffers();
 }
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 19c3a81..01d8d4a 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -293,6 +293,34 @@ static int bpf_map_copy_value(struct bpf_map *map, void *key, void *value,
 	return err;
 }
 
+#ifdef CONFIG_MEMCG_KMEM
+static void bpf_map_save_memcg(struct bpf_map *map)
+{
+	/* Currently if a map is created by a process belonging to the root
+	 * memory cgroup, get_obj_cgroup_from_current() will return NULL.
+	 * So we have to check map->objcg for being NULL each time it's
+	 * being used.
+	 */
+	map->objcg = get_obj_cgroup_from_current();
+}
+
+static void bpf_map_release_memcg(struct bpf_map *map)
+{
+	if (map->objcg)
+		obj_cgroup_put(map->objcg);
+}
+
+#else
+static void bpf_map_save_memcg(struct bpf_map *map)
+{
+}
+
+static void bpf_map_release_memcg(struct bpf_map *map)
+{
+}
+
+#endif
+
 /* Please, do not use this function outside from the map creation path
  * (e.g. in map update path) without taking care of setting the active
  * memory cgroup (see at bpf_map_kmalloc_node() for example).
@@ -344,8 +372,10 @@ void *bpf_map_area_mmapable_alloc(u64 size, int numa_node)
 	return __bpf_map_area_alloc(size, numa_node, true);
 }
 
-void bpf_map_area_free(void *area)
+void bpf_map_area_free(void *area, struct bpf_map *map)
 {
+	if (map)
+		bpf_map_release_memcg(map);
 	kvfree(area);
 }
 
@@ -363,6 +393,7 @@ static u32 bpf_map_flags_retain_permanent(u32 flags)
 
 void bpf_map_init_from_attr(struct bpf_map *map, union bpf_attr *attr)
 {
+	bpf_map_save_memcg(map);
 	map->map_type = attr->map_type;
 	map->key_size = attr->key_size;
 	map->value_size = attr->value_size;
@@ -417,22 +448,6 @@ void bpf_map_free_id(struct bpf_map *map, bool do_idr_lock)
 }
 
 #ifdef CONFIG_MEMCG_KMEM
-static void bpf_map_save_memcg(struct bpf_map *map)
-{
-	/* Currently if a map is created by a process belonging to the root
-	 * memory cgroup, get_obj_cgroup_from_current() will return NULL.
-	 * So we have to check map->objcg for being NULL each time it's
-	 * being used.
-	 */
-	map->objcg = get_obj_cgroup_from_current();
-}
-
-static void bpf_map_release_memcg(struct bpf_map *map)
-{
-	if (map->objcg)
-		obj_cgroup_put(map->objcg);
-}
-
 void *bpf_map_kmalloc_node(const struct bpf_map *map, size_t size, gfp_t flags,
 			   int node)
 {
@@ -477,14 +492,6 @@ void __percpu *bpf_map_alloc_percpu(const struct bpf_map *map, size_t size,
 	return ptr;
 }
 
-#else
-static void bpf_map_save_memcg(struct bpf_map *map)
-{
-}
-
-static void bpf_map_release_memcg(struct bpf_map *map)
-{
-}
 #endif
 
 static int bpf_map_kptr_off_cmp(const void *a, const void *b)
@@ -605,7 +612,6 @@ static void bpf_map_free_deferred(struct work_struct *work)
 
 	security_bpf_map_free(map);
 	kfree(map->off_arr);
-	bpf_map_release_memcg(map);
 	/* implementation dependent freeing, map_free callback also does
 	 * bpf_map_free_kptr_off_tab, if needed.
 	 */
@@ -1154,8 +1160,6 @@ static int map_create(union bpf_attr *attr)
 	if (err)
 		goto free_map_sec;
 
-	bpf_map_save_memcg(map);
-
 	err = bpf_map_new_fd(map, f_flags);
 	if (err < 0) {
 		/* failed to allocate fd.
diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index d0c4338..e8f414a 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -52,7 +52,7 @@ static struct bpf_map *sock_map_alloc(union bpf_attr *attr)
 				       sizeof(struct sock *),
 				       stab->map.numa_node);
 	if (!stab->sks) {
-		bpf_map_area_free(stab);
+		bpf_map_area_free(stab, &stab->map);
 		return ERR_PTR(-ENOMEM);
 	}
 
@@ -360,8 +360,8 @@ static void sock_map_free(struct bpf_map *map)
 	/* wait for psock readers accessing its map link */
 	synchronize_rcu();
 
-	bpf_map_area_free(stab->sks);
-	bpf_map_area_free(stab);
+	bpf_map_area_free(stab->sks, NULL);
+	bpf_map_area_free(stab, map);
 }
 
 static void sock_map_release_progs(struct bpf_map *map)
@@ -1106,7 +1106,7 @@ static struct bpf_map *sock_hash_alloc(union bpf_attr *attr)
 
 	return &htab->map;
 free_htab:
-	bpf_map_area_free(htab);
+	bpf_map_area_free(htab, &htab->map);
 	return ERR_PTR(err);
 }
 
@@ -1158,8 +1158,8 @@ static void sock_hash_free(struct bpf_map *map)
 	/* wait for psock readers accessing its map link */
 	synchronize_rcu();
 
-	bpf_map_area_free(htab->buckets);
-	bpf_map_area_free(htab);
+	bpf_map_area_free(htab->buckets, NULL);
+	bpf_map_area_free(htab, map);
 }
 
 static void *sock_hash_lookup_sys(struct bpf_map *map, void *key)
diff --git a/net/xdp/xskmap.c b/net/xdp/xskmap.c
index acc8e52..5abb87e 100644
--- a/net/xdp/xskmap.c
+++ b/net/xdp/xskmap.c
@@ -90,7 +90,7 @@ static void xsk_map_free(struct bpf_map *map)
 	struct xsk_map *m = container_of(map, struct xsk_map, map);
 
 	synchronize_net();
-	bpf_map_area_free(m);
+	bpf_map_area_free(m, map);
 }
 
 static int xsk_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH bpf-next v2 06/12] bpf: Use scoped-based charge in bpf_map_area_alloc
@ 2022-08-18 14:31   ` Yafang Shao
  0 siblings, 0 replies; 61+ messages in thread
From: Yafang Shao @ 2022-08-18 14:31 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, hannes, mhocko, roman.gushchin,
	shakeelb, songmuchun, akpm, tj, lizefan.x
  Cc: cgroups, netdev, bpf, linux-mm, Yafang Shao

Currently bpf_map_area_alloc() is used to allocate a container of struct
bpf_map or members in this container.  To distinguish the map creation
and the other case, a new parameter struct bpf_map is added into
bpf_map_area_alloc(). Then for the non-map-creation case, we could get
the memcg from the map instead of using the current memcg.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/bpf.h            |  2 +-
 kernel/bpf/arraymap.c          |  2 +-
 kernel/bpf/bloom_filter.c      |  2 +-
 kernel/bpf/bpf_local_storage.c |  2 +-
 kernel/bpf/bpf_struct_ops.c    |  6 +++---
 kernel/bpf/cpumap.c            |  5 +++--
 kernel/bpf/devmap.c            | 13 ++++++++-----
 kernel/bpf/hashtab.c           |  8 +++++---
 kernel/bpf/local_storage.c     |  2 +-
 kernel/bpf/lpm_trie.c          |  2 +-
 kernel/bpf/offload.c           |  2 +-
 kernel/bpf/queue_stack_maps.c  |  2 +-
 kernel/bpf/reuseport_array.c   |  2 +-
 kernel/bpf/ringbuf.c           | 15 +++++++++------
 kernel/bpf/stackmap.c          |  5 +++--
 kernel/bpf/syscall.c           | 16 ++++++++++++++--
 net/core/sock_map.c            | 10 ++++++----
 net/xdp/xskmap.c               |  2 +-
 18 files changed, 61 insertions(+), 37 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 6f85137..066f351d 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1635,7 +1635,7 @@ struct bpf_prog *bpf_prog_get_type_dev(u32 ufd, enum bpf_prog_type type,
 struct bpf_map * __must_check bpf_map_inc_not_zero(struct bpf_map *map);
 void bpf_map_put_with_uref(struct bpf_map *map);
 void bpf_map_put(struct bpf_map *map);
-void *bpf_map_area_alloc(u64 size, int numa_node);
+void *bpf_map_area_alloc(u64 size, int numa_node, struct bpf_map *map);
 void *bpf_map_area_mmapable_alloc(u64 size, int numa_node);
 void bpf_map_area_free(void *base, struct bpf_map *map);
 bool bpf_map_write_active(const struct bpf_map *map);
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 9ce4d1b..80974c5 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -135,7 +135,7 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr)
 		array = data + PAGE_ALIGN(sizeof(struct bpf_array))
 			- offsetof(struct bpf_array, value);
 	} else {
-		array = bpf_map_area_alloc(array_size, numa_node);
+		array = bpf_map_area_alloc(array_size, numa_node, NULL);
 	}
 	if (!array)
 		return ERR_PTR(-ENOMEM);
diff --git a/kernel/bpf/bloom_filter.c b/kernel/bpf/bloom_filter.c
index e59064d..6691f79 100644
--- a/kernel/bpf/bloom_filter.c
+++ b/kernel/bpf/bloom_filter.c
@@ -142,7 +142,7 @@ static struct bpf_map *bloom_map_alloc(union bpf_attr *attr)
 	}
 
 	bitset_bytes = roundup(bitset_bytes, sizeof(unsigned long));
-	bloom = bpf_map_area_alloc(sizeof(*bloom) + bitset_bytes, numa_node);
+	bloom = bpf_map_area_alloc(sizeof(*bloom) + bitset_bytes, numa_node, NULL);
 
 	if (!bloom)
 		return ERR_PTR(-ENOMEM);
diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
index 77e075b..67ab249 100644
--- a/kernel/bpf/bpf_local_storage.c
+++ b/kernel/bpf/bpf_local_storage.c
@@ -610,7 +610,7 @@ struct bpf_local_storage_map *bpf_local_storage_map_alloc(union bpf_attr *attr)
 	unsigned int i;
 	u32 nbuckets;
 
-	smap = bpf_map_area_alloc(sizeof(*smap), NUMA_NO_NODE);
+	smap = bpf_map_area_alloc(sizeof(*smap), NUMA_NO_NODE, NULL);
 	if (!smap)
 		return ERR_PTR(-ENOMEM);
 	bpf_map_init_from_attr(&smap->map, attr);
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index 9fb8ad1..37ba5c0 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -618,7 +618,7 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr)
 		 */
 		(vt->size - sizeof(struct bpf_struct_ops_value));
 
-	st_map = bpf_map_area_alloc(st_map_size, NUMA_NO_NODE);
+	st_map = bpf_map_area_alloc(st_map_size, NUMA_NO_NODE, NULL);
 	if (!st_map)
 		return ERR_PTR(-ENOMEM);
 
@@ -626,10 +626,10 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr)
 	map = &st_map->map;
 	bpf_map_init_from_attr(map, attr);
 
-	st_map->uvalue = bpf_map_area_alloc(vt->size, NUMA_NO_NODE);
+	st_map->uvalue = bpf_map_area_alloc(vt->size, NUMA_NO_NODE, map);
 	st_map->links =
 		bpf_map_area_alloc(btf_type_vlen(t) * sizeof(struct bpf_links *),
-				   NUMA_NO_NODE);
+				   NUMA_NO_NODE, map);
 	st_map->image = bpf_jit_alloc_exec(PAGE_SIZE);
 	if (!st_map->uvalue || !st_map->links || !st_map->image) {
 		bpf_struct_ops_map_free(map);
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index 7de2ae6..b593157 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -97,7 +97,7 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
 	    attr->map_flags & ~BPF_F_NUMA_NODE)
 		return ERR_PTR(-EINVAL);
 
-	cmap = bpf_map_area_alloc(sizeof(*cmap), NUMA_NO_NODE);
+	cmap = bpf_map_area_alloc(sizeof(*cmap), NUMA_NO_NODE, NULL);
 	if (!cmap)
 		return ERR_PTR(-ENOMEM);
 
@@ -112,7 +112,8 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
 	/* Alloc array for possible remote "destination" CPUs */
 	cmap->cpu_map = bpf_map_area_alloc(cmap->map.max_entries *
 					   sizeof(struct bpf_cpu_map_entry *),
-					   cmap->map.numa_node);
+					   cmap->map.numa_node,
+					   &cmap->map);
 	if (!cmap->cpu_map)
 		goto free_cmap;
 
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index 3268ce7..807a4cd 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -89,12 +89,13 @@ struct bpf_dtab {
 static LIST_HEAD(dev_map_list);
 
 static struct hlist_head *dev_map_create_hash(unsigned int entries,
-					      int numa_node)
+					      int numa_node,
+					      struct bpf_map *map)
 {
 	int i;
 	struct hlist_head *hash;
 
-	hash = bpf_map_area_alloc((u64) entries * sizeof(*hash), numa_node);
+	hash = bpf_map_area_alloc((u64) entries * sizeof(*hash), numa_node, map);
 	if (hash != NULL)
 		for (i = 0; i < entries; i++)
 			INIT_HLIST_HEAD(&hash[i]);
@@ -136,7 +137,8 @@ static int dev_map_init_map(struct bpf_dtab *dtab, union bpf_attr *attr)
 
 	if (attr->map_type == BPF_MAP_TYPE_DEVMAP_HASH) {
 		dtab->dev_index_head = dev_map_create_hash(dtab->n_buckets,
-							   dtab->map.numa_node);
+							   dtab->map.numa_node,
+							   &dtab->map);
 		if (!dtab->dev_index_head)
 			return -ENOMEM;
 
@@ -144,7 +146,8 @@ static int dev_map_init_map(struct bpf_dtab *dtab, union bpf_attr *attr)
 	} else {
 		dtab->netdev_map = bpf_map_area_alloc((u64) dtab->map.max_entries *
 						      sizeof(struct bpf_dtab_netdev *),
-						      dtab->map.numa_node);
+						      dtab->map.numa_node,
+						      &dtab->map);
 		if (!dtab->netdev_map)
 			return -ENOMEM;
 	}
@@ -160,7 +163,7 @@ static struct bpf_map *dev_map_alloc(union bpf_attr *attr)
 	if (!capable(CAP_NET_ADMIN))
 		return ERR_PTR(-EPERM);
 
-	dtab = bpf_map_area_alloc(sizeof(*dtab), NUMA_NO_NODE);
+	dtab = bpf_map_area_alloc(sizeof(*dtab), NUMA_NO_NODE, NULL);
 	if (!dtab)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 1b3653d..e9a6d2c4 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -332,7 +332,8 @@ static int prealloc_init(struct bpf_htab *htab)
 		num_entries += num_possible_cpus();
 
 	htab->elems = bpf_map_area_alloc((u64)htab->elem_size * num_entries,
-					 htab->map.numa_node);
+					 htab->map.numa_node,
+					 &htab->map);
 	if (!htab->elems)
 		return -ENOMEM;
 
@@ -495,7 +496,7 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 	struct bpf_htab *htab;
 	int err, i;
 
-	htab = bpf_map_area_alloc(sizeof(*htab), NUMA_NO_NODE);
+	htab = bpf_map_area_alloc(sizeof(*htab), NUMA_NO_NODE, NULL);
 	if (!htab)
 		return ERR_PTR(-ENOMEM);
 
@@ -534,7 +535,8 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 	err = -ENOMEM;
 	htab->buckets = bpf_map_area_alloc(htab->n_buckets *
 					   sizeof(struct bucket),
-					   htab->map.numa_node);
+					   htab->map.numa_node,
+					   &htab->map);
 	if (!htab->buckets)
 		goto free_htab;
 
diff --git a/kernel/bpf/local_storage.c b/kernel/bpf/local_storage.c
index c705d66..fcc7ece 100644
--- a/kernel/bpf/local_storage.c
+++ b/kernel/bpf/local_storage.c
@@ -313,7 +313,7 @@ static struct bpf_map *cgroup_storage_map_alloc(union bpf_attr *attr)
 		/* max_entries is not used and enforced to be 0 */
 		return ERR_PTR(-EINVAL);
 
-	map = bpf_map_area_alloc(sizeof(struct bpf_cgroup_storage_map), numa_node);
+	map = bpf_map_area_alloc(sizeof(struct bpf_cgroup_storage_map), numa_node, NULL);
 	if (!map)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
index fd99360..3d329ae 100644
--- a/kernel/bpf/lpm_trie.c
+++ b/kernel/bpf/lpm_trie.c
@@ -558,7 +558,7 @@ static struct bpf_map *trie_alloc(union bpf_attr *attr)
 	    attr->value_size > LPM_VAL_SIZE_MAX)
 		return ERR_PTR(-EINVAL);
 
-	trie = bpf_map_area_alloc(sizeof(*trie), NUMA_NO_NODE);
+	trie = bpf_map_area_alloc(sizeof(*trie), NUMA_NO_NODE, NULL);
 	if (!trie)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
index c9941a9..87c59da 100644
--- a/kernel/bpf/offload.c
+++ b/kernel/bpf/offload.c
@@ -372,7 +372,7 @@ struct bpf_map *bpf_map_offload_map_alloc(union bpf_attr *attr)
 	    attr->map_type != BPF_MAP_TYPE_HASH)
 		return ERR_PTR(-EINVAL);
 
-	offmap = bpf_map_area_alloc(sizeof(*offmap), NUMA_NO_NODE);
+	offmap = bpf_map_area_alloc(sizeof(*offmap), NUMA_NO_NODE, NULL);
 	if (!offmap)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/kernel/bpf/queue_stack_maps.c b/kernel/bpf/queue_stack_maps.c
index f2ec0c4..bf57e45 100644
--- a/kernel/bpf/queue_stack_maps.c
+++ b/kernel/bpf/queue_stack_maps.c
@@ -74,7 +74,7 @@ static struct bpf_map *queue_stack_map_alloc(union bpf_attr *attr)
 	size = (u64) attr->max_entries + 1;
 	queue_size = sizeof(*qs) + size * attr->value_size;
 
-	qs = bpf_map_area_alloc(queue_size, numa_node);
+	qs = bpf_map_area_alloc(queue_size, numa_node, NULL);
 	if (!qs)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/kernel/bpf/reuseport_array.c b/kernel/bpf/reuseport_array.c
index 594cdb0..52c7e77 100644
--- a/kernel/bpf/reuseport_array.c
+++ b/kernel/bpf/reuseport_array.c
@@ -158,7 +158,7 @@ static struct bpf_map *reuseport_array_alloc(union bpf_attr *attr)
 		return ERR_PTR(-EPERM);
 
 	/* allocate all map elements and zero-initialize them */
-	array = bpf_map_area_alloc(struct_size(array, ptrs, attr->max_entries), numa_node);
+	array = bpf_map_area_alloc(struct_size(array, ptrs, attr->max_entries), numa_node, NULL);
 	if (!array)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/kernel/bpf/ringbuf.c b/kernel/bpf/ringbuf.c
index 74dd8dc..5eb7820 100644
--- a/kernel/bpf/ringbuf.c
+++ b/kernel/bpf/ringbuf.c
@@ -59,7 +59,8 @@ struct bpf_ringbuf_hdr {
 	u32 pg_off;
 };
 
-static struct bpf_ringbuf *bpf_ringbuf_area_alloc(size_t data_sz, int numa_node)
+static struct bpf_ringbuf *bpf_ringbuf_area_alloc(size_t data_sz, int numa_node,
+						  struct bpf_map *map)
 {
 	const gfp_t flags = GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL |
 			    __GFP_NOWARN | __GFP_ZERO;
@@ -89,7 +90,7 @@ static struct bpf_ringbuf *bpf_ringbuf_area_alloc(size_t data_sz, int numa_node)
 	 * user-space implementations significantly.
 	 */
 	array_size = (nr_meta_pages + 2 * nr_data_pages) * sizeof(*pages);
-	pages = bpf_map_area_alloc(array_size, numa_node);
+	pages = bpf_map_area_alloc(array_size, numa_node, map);
 	if (!pages)
 		return NULL;
 
@@ -127,11 +128,12 @@ static void bpf_ringbuf_notify(struct irq_work *work)
 	wake_up_all(&rb->waitq);
 }
 
-static struct bpf_ringbuf *bpf_ringbuf_alloc(size_t data_sz, int numa_node)
+static struct bpf_ringbuf *bpf_ringbuf_alloc(size_t data_sz, int numa_node,
+					     struct bpf_map *map)
 {
 	struct bpf_ringbuf *rb;
 
-	rb = bpf_ringbuf_area_alloc(data_sz, numa_node);
+	rb = bpf_ringbuf_area_alloc(data_sz, numa_node, map);
 	if (!rb)
 		return NULL;
 
@@ -164,13 +166,14 @@ static struct bpf_map *ringbuf_map_alloc(union bpf_attr *attr)
 		return ERR_PTR(-E2BIG);
 #endif
 
-	rb_map = bpf_map_area_alloc(sizeof(*rb_map), NUMA_NO_NODE);
+	rb_map = bpf_map_area_alloc(sizeof(*rb_map), NUMA_NO_NODE, NULL);
 	if (!rb_map)
 		return ERR_PTR(-ENOMEM);
 
 	bpf_map_init_from_attr(&rb_map->map, attr);
 
-	rb_map->rb = bpf_ringbuf_alloc(attr->max_entries, rb_map->map.numa_node);
+	rb_map->rb = bpf_ringbuf_alloc(attr->max_entries, rb_map->map.numa_node,
+				       &rb_map->map);
 	if (!rb_map->rb) {
 		bpf_map_area_free(rb_map, &rb_map->map);
 		return ERR_PTR(-ENOMEM);
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 042b7d2..9440fab 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -49,7 +49,8 @@ static int prealloc_elems_and_freelist(struct bpf_stack_map *smap)
 	int err;
 
 	smap->elems = bpf_map_area_alloc(elem_size * smap->map.max_entries,
-					 smap->map.numa_node);
+					 smap->map.numa_node,
+					 &smap->map);
 	if (!smap->elems)
 		return -ENOMEM;
 
@@ -100,7 +101,7 @@ static struct bpf_map *stack_map_alloc(union bpf_attr *attr)
 		return ERR_PTR(-E2BIG);
 
 	cost = n_buckets * sizeof(struct stack_map_bucket *) + sizeof(*smap);
-	smap = bpf_map_area_alloc(cost, bpf_map_attr_numa_node(attr));
+	smap = bpf_map_area_alloc(cost, bpf_map_attr_numa_node(attr), NULL);
 	if (!smap)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 01d8d4a..02ce7e9 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -362,9 +362,21 @@ static void *__bpf_map_area_alloc(u64 size, int numa_node, bool mmapable)
 			flags, numa_node, __builtin_return_address(0));
 }
 
-void *bpf_map_area_alloc(u64 size, int numa_node)
+void *bpf_map_area_alloc(u64 size, int numa_node, struct bpf_map *map)
 {
-	return __bpf_map_area_alloc(size, numa_node, false);
+	struct mem_cgroup *memcg, *old_memcg;
+	void *ptr;
+
+	if (!map)
+		return __bpf_map_area_alloc(size, numa_node, false);
+
+	memcg = bpf_map_get_memcg(map);
+	old_memcg = set_active_memcg(memcg);
+	ptr = __bpf_map_area_alloc(size, numa_node, false);
+	set_active_memcg(old_memcg);
+	bpf_map_put_memcg(memcg);
+
+	return ptr;
 }
 
 void *bpf_map_area_mmapable_alloc(u64 size, int numa_node)
diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index e8f414a..2b3b24e 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -41,7 +41,7 @@ static struct bpf_map *sock_map_alloc(union bpf_attr *attr)
 	    attr->map_flags & ~SOCK_CREATE_FLAG_MASK)
 		return ERR_PTR(-EINVAL);
 
-	stab = bpf_map_area_alloc(sizeof(*stab), NUMA_NO_NODE);
+	stab = bpf_map_area_alloc(sizeof(*stab), NUMA_NO_NODE, NULL);
 	if (!stab)
 		return ERR_PTR(-ENOMEM);
 
@@ -50,7 +50,8 @@ static struct bpf_map *sock_map_alloc(union bpf_attr *attr)
 
 	stab->sks = bpf_map_area_alloc((u64) stab->map.max_entries *
 				       sizeof(struct sock *),
-				       stab->map.numa_node);
+				       stab->map.numa_node,
+				       &stab->map);
 	if (!stab->sks) {
 		bpf_map_area_free(stab, &stab->map);
 		return ERR_PTR(-ENOMEM);
@@ -1076,7 +1077,7 @@ static struct bpf_map *sock_hash_alloc(union bpf_attr *attr)
 	if (attr->key_size > MAX_BPF_STACK)
 		return ERR_PTR(-E2BIG);
 
-	htab = bpf_map_area_alloc(sizeof(*htab), NUMA_NO_NODE);
+	htab = bpf_map_area_alloc(sizeof(*htab), NUMA_NO_NODE, NULL);
 	if (!htab)
 		return ERR_PTR(-ENOMEM);
 
@@ -1093,7 +1094,8 @@ static struct bpf_map *sock_hash_alloc(union bpf_attr *attr)
 
 	htab->buckets = bpf_map_area_alloc(htab->buckets_num *
 					   sizeof(struct bpf_shtab_bucket),
-					   htab->map.numa_node);
+					   htab->map.numa_node,
+					   &htab->map);
 	if (!htab->buckets) {
 		err = -ENOMEM;
 		goto free_htab;
diff --git a/net/xdp/xskmap.c b/net/xdp/xskmap.c
index 5abb87e..beb11fd 100644
--- a/net/xdp/xskmap.c
+++ b/net/xdp/xskmap.c
@@ -75,7 +75,7 @@ static struct bpf_map *xsk_map_alloc(union bpf_attr *attr)
 	numa_node = bpf_map_attr_numa_node(attr);
 	size = struct_size(m, xsk_map, attr->max_entries);
 
-	m = bpf_map_area_alloc(size, numa_node);
+	m = bpf_map_area_alloc(size, numa_node, NULL);
 	if (!m)
 		return ERR_PTR(-ENOMEM);
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH bpf-next v2 06/12] bpf: Use scoped-based charge in bpf_map_area_alloc
@ 2022-08-18 14:31   ` Yafang Shao
  0 siblings, 0 replies; 61+ messages in thread
From: Yafang Shao @ 2022-08-18 14:31 UTC (permalink / raw)
  To: ast-DgEjT+Ai2ygdnm+yROfE0A, daniel-FeC+5ew28dpmcu3hnIyYJQ,
	andrii-DgEjT+Ai2ygdnm+yROfE0A, kafai-b10kYP2dOMg,
	songliubraving-b10kYP2dOMg, yhs-b10kYP2dOMg,
	john.fastabend-Re5JQEeQqe8AvxtiuMwx3w,
	kpsingh-DgEjT+Ai2ygdnm+yROfE0A, sdf-hpIqsD4AKlfQT0dZR+AlfA,
	haoluo-hpIqsD4AKlfQT0dZR+AlfA, jolsa-DgEjT+Ai2ygdnm+yROfE0A,
	hannes-druUgvl0LCNAfugRpC6u6w, mhocko-DgEjT+Ai2ygdnm+yROfE0A,
	roman.gushchin-fxUVXftIFDnyG1zEObXtfA,
	shakeelb-hpIqsD4AKlfQT0dZR+AlfA,
	songmuchun-EC8Uxl6Npydl57MIdRCFDg,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, tj-DgEjT+Ai2ygdnm+yROfE0A,
	lizefan.x-EC8Uxl6Npydl57MIdRCFDg
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	bpf-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Yafang Shao

Currently bpf_map_area_alloc() is used to allocate a container of struct
bpf_map or members in this container.  To distinguish the map creation
and the other case, a new parameter struct bpf_map is added into
bpf_map_area_alloc(). Then for the non-map-creation case, we could get
the memcg from the map instead of using the current memcg.

Signed-off-by: Yafang Shao <laoar.shao-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 include/linux/bpf.h            |  2 +-
 kernel/bpf/arraymap.c          |  2 +-
 kernel/bpf/bloom_filter.c      |  2 +-
 kernel/bpf/bpf_local_storage.c |  2 +-
 kernel/bpf/bpf_struct_ops.c    |  6 +++---
 kernel/bpf/cpumap.c            |  5 +++--
 kernel/bpf/devmap.c            | 13 ++++++++-----
 kernel/bpf/hashtab.c           |  8 +++++---
 kernel/bpf/local_storage.c     |  2 +-
 kernel/bpf/lpm_trie.c          |  2 +-
 kernel/bpf/offload.c           |  2 +-
 kernel/bpf/queue_stack_maps.c  |  2 +-
 kernel/bpf/reuseport_array.c   |  2 +-
 kernel/bpf/ringbuf.c           | 15 +++++++++------
 kernel/bpf/stackmap.c          |  5 +++--
 kernel/bpf/syscall.c           | 16 ++++++++++++++--
 net/core/sock_map.c            | 10 ++++++----
 net/xdp/xskmap.c               |  2 +-
 18 files changed, 61 insertions(+), 37 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 6f85137..066f351d 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1635,7 +1635,7 @@ struct bpf_prog *bpf_prog_get_type_dev(u32 ufd, enum bpf_prog_type type,
 struct bpf_map * __must_check bpf_map_inc_not_zero(struct bpf_map *map);
 void bpf_map_put_with_uref(struct bpf_map *map);
 void bpf_map_put(struct bpf_map *map);
-void *bpf_map_area_alloc(u64 size, int numa_node);
+void *bpf_map_area_alloc(u64 size, int numa_node, struct bpf_map *map);
 void *bpf_map_area_mmapable_alloc(u64 size, int numa_node);
 void bpf_map_area_free(void *base, struct bpf_map *map);
 bool bpf_map_write_active(const struct bpf_map *map);
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 9ce4d1b..80974c5 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -135,7 +135,7 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr)
 		array = data + PAGE_ALIGN(sizeof(struct bpf_array))
 			- offsetof(struct bpf_array, value);
 	} else {
-		array = bpf_map_area_alloc(array_size, numa_node);
+		array = bpf_map_area_alloc(array_size, numa_node, NULL);
 	}
 	if (!array)
 		return ERR_PTR(-ENOMEM);
diff --git a/kernel/bpf/bloom_filter.c b/kernel/bpf/bloom_filter.c
index e59064d..6691f79 100644
--- a/kernel/bpf/bloom_filter.c
+++ b/kernel/bpf/bloom_filter.c
@@ -142,7 +142,7 @@ static struct bpf_map *bloom_map_alloc(union bpf_attr *attr)
 	}
 
 	bitset_bytes = roundup(bitset_bytes, sizeof(unsigned long));
-	bloom = bpf_map_area_alloc(sizeof(*bloom) + bitset_bytes, numa_node);
+	bloom = bpf_map_area_alloc(sizeof(*bloom) + bitset_bytes, numa_node, NULL);
 
 	if (!bloom)
 		return ERR_PTR(-ENOMEM);
diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
index 77e075b..67ab249 100644
--- a/kernel/bpf/bpf_local_storage.c
+++ b/kernel/bpf/bpf_local_storage.c
@@ -610,7 +610,7 @@ struct bpf_local_storage_map *bpf_local_storage_map_alloc(union bpf_attr *attr)
 	unsigned int i;
 	u32 nbuckets;
 
-	smap = bpf_map_area_alloc(sizeof(*smap), NUMA_NO_NODE);
+	smap = bpf_map_area_alloc(sizeof(*smap), NUMA_NO_NODE, NULL);
 	if (!smap)
 		return ERR_PTR(-ENOMEM);
 	bpf_map_init_from_attr(&smap->map, attr);
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index 9fb8ad1..37ba5c0 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -618,7 +618,7 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr)
 		 */
 		(vt->size - sizeof(struct bpf_struct_ops_value));
 
-	st_map = bpf_map_area_alloc(st_map_size, NUMA_NO_NODE);
+	st_map = bpf_map_area_alloc(st_map_size, NUMA_NO_NODE, NULL);
 	if (!st_map)
 		return ERR_PTR(-ENOMEM);
 
@@ -626,10 +626,10 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr)
 	map = &st_map->map;
 	bpf_map_init_from_attr(map, attr);
 
-	st_map->uvalue = bpf_map_area_alloc(vt->size, NUMA_NO_NODE);
+	st_map->uvalue = bpf_map_area_alloc(vt->size, NUMA_NO_NODE, map);
 	st_map->links =
 		bpf_map_area_alloc(btf_type_vlen(t) * sizeof(struct bpf_links *),
-				   NUMA_NO_NODE);
+				   NUMA_NO_NODE, map);
 	st_map->image = bpf_jit_alloc_exec(PAGE_SIZE);
 	if (!st_map->uvalue || !st_map->links || !st_map->image) {
 		bpf_struct_ops_map_free(map);
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index 7de2ae6..b593157 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -97,7 +97,7 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
 	    attr->map_flags & ~BPF_F_NUMA_NODE)
 		return ERR_PTR(-EINVAL);
 
-	cmap = bpf_map_area_alloc(sizeof(*cmap), NUMA_NO_NODE);
+	cmap = bpf_map_area_alloc(sizeof(*cmap), NUMA_NO_NODE, NULL);
 	if (!cmap)
 		return ERR_PTR(-ENOMEM);
 
@@ -112,7 +112,8 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
 	/* Alloc array for possible remote "destination" CPUs */
 	cmap->cpu_map = bpf_map_area_alloc(cmap->map.max_entries *
 					   sizeof(struct bpf_cpu_map_entry *),
-					   cmap->map.numa_node);
+					   cmap->map.numa_node,
+					   &cmap->map);
 	if (!cmap->cpu_map)
 		goto free_cmap;
 
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index 3268ce7..807a4cd 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -89,12 +89,13 @@ struct bpf_dtab {
 static LIST_HEAD(dev_map_list);
 
 static struct hlist_head *dev_map_create_hash(unsigned int entries,
-					      int numa_node)
+					      int numa_node,
+					      struct bpf_map *map)
 {
 	int i;
 	struct hlist_head *hash;
 
-	hash = bpf_map_area_alloc((u64) entries * sizeof(*hash), numa_node);
+	hash = bpf_map_area_alloc((u64) entries * sizeof(*hash), numa_node, map);
 	if (hash != NULL)
 		for (i = 0; i < entries; i++)
 			INIT_HLIST_HEAD(&hash[i]);
@@ -136,7 +137,8 @@ static int dev_map_init_map(struct bpf_dtab *dtab, union bpf_attr *attr)
 
 	if (attr->map_type == BPF_MAP_TYPE_DEVMAP_HASH) {
 		dtab->dev_index_head = dev_map_create_hash(dtab->n_buckets,
-							   dtab->map.numa_node);
+							   dtab->map.numa_node,
+							   &dtab->map);
 		if (!dtab->dev_index_head)
 			return -ENOMEM;
 
@@ -144,7 +146,8 @@ static int dev_map_init_map(struct bpf_dtab *dtab, union bpf_attr *attr)
 	} else {
 		dtab->netdev_map = bpf_map_area_alloc((u64) dtab->map.max_entries *
 						      sizeof(struct bpf_dtab_netdev *),
-						      dtab->map.numa_node);
+						      dtab->map.numa_node,
+						      &dtab->map);
 		if (!dtab->netdev_map)
 			return -ENOMEM;
 	}
@@ -160,7 +163,7 @@ static struct bpf_map *dev_map_alloc(union bpf_attr *attr)
 	if (!capable(CAP_NET_ADMIN))
 		return ERR_PTR(-EPERM);
 
-	dtab = bpf_map_area_alloc(sizeof(*dtab), NUMA_NO_NODE);
+	dtab = bpf_map_area_alloc(sizeof(*dtab), NUMA_NO_NODE, NULL);
 	if (!dtab)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 1b3653d..e9a6d2c4 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -332,7 +332,8 @@ static int prealloc_init(struct bpf_htab *htab)
 		num_entries += num_possible_cpus();
 
 	htab->elems = bpf_map_area_alloc((u64)htab->elem_size * num_entries,
-					 htab->map.numa_node);
+					 htab->map.numa_node,
+					 &htab->map);
 	if (!htab->elems)
 		return -ENOMEM;
 
@@ -495,7 +496,7 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 	struct bpf_htab *htab;
 	int err, i;
 
-	htab = bpf_map_area_alloc(sizeof(*htab), NUMA_NO_NODE);
+	htab = bpf_map_area_alloc(sizeof(*htab), NUMA_NO_NODE, NULL);
 	if (!htab)
 		return ERR_PTR(-ENOMEM);
 
@@ -534,7 +535,8 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 	err = -ENOMEM;
 	htab->buckets = bpf_map_area_alloc(htab->n_buckets *
 					   sizeof(struct bucket),
-					   htab->map.numa_node);
+					   htab->map.numa_node,
+					   &htab->map);
 	if (!htab->buckets)
 		goto free_htab;
 
diff --git a/kernel/bpf/local_storage.c b/kernel/bpf/local_storage.c
index c705d66..fcc7ece 100644
--- a/kernel/bpf/local_storage.c
+++ b/kernel/bpf/local_storage.c
@@ -313,7 +313,7 @@ static struct bpf_map *cgroup_storage_map_alloc(union bpf_attr *attr)
 		/* max_entries is not used and enforced to be 0 */
 		return ERR_PTR(-EINVAL);
 
-	map = bpf_map_area_alloc(sizeof(struct bpf_cgroup_storage_map), numa_node);
+	map = bpf_map_area_alloc(sizeof(struct bpf_cgroup_storage_map), numa_node, NULL);
 	if (!map)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
index fd99360..3d329ae 100644
--- a/kernel/bpf/lpm_trie.c
+++ b/kernel/bpf/lpm_trie.c
@@ -558,7 +558,7 @@ static struct bpf_map *trie_alloc(union bpf_attr *attr)
 	    attr->value_size > LPM_VAL_SIZE_MAX)
 		return ERR_PTR(-EINVAL);
 
-	trie = bpf_map_area_alloc(sizeof(*trie), NUMA_NO_NODE);
+	trie = bpf_map_area_alloc(sizeof(*trie), NUMA_NO_NODE, NULL);
 	if (!trie)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
index c9941a9..87c59da 100644
--- a/kernel/bpf/offload.c
+++ b/kernel/bpf/offload.c
@@ -372,7 +372,7 @@ struct bpf_map *bpf_map_offload_map_alloc(union bpf_attr *attr)
 	    attr->map_type != BPF_MAP_TYPE_HASH)
 		return ERR_PTR(-EINVAL);
 
-	offmap = bpf_map_area_alloc(sizeof(*offmap), NUMA_NO_NODE);
+	offmap = bpf_map_area_alloc(sizeof(*offmap), NUMA_NO_NODE, NULL);
 	if (!offmap)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/kernel/bpf/queue_stack_maps.c b/kernel/bpf/queue_stack_maps.c
index f2ec0c4..bf57e45 100644
--- a/kernel/bpf/queue_stack_maps.c
+++ b/kernel/bpf/queue_stack_maps.c
@@ -74,7 +74,7 @@ static struct bpf_map *queue_stack_map_alloc(union bpf_attr *attr)
 	size = (u64) attr->max_entries + 1;
 	queue_size = sizeof(*qs) + size * attr->value_size;
 
-	qs = bpf_map_area_alloc(queue_size, numa_node);
+	qs = bpf_map_area_alloc(queue_size, numa_node, NULL);
 	if (!qs)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/kernel/bpf/reuseport_array.c b/kernel/bpf/reuseport_array.c
index 594cdb0..52c7e77 100644
--- a/kernel/bpf/reuseport_array.c
+++ b/kernel/bpf/reuseport_array.c
@@ -158,7 +158,7 @@ static struct bpf_map *reuseport_array_alloc(union bpf_attr *attr)
 		return ERR_PTR(-EPERM);
 
 	/* allocate all map elements and zero-initialize them */
-	array = bpf_map_area_alloc(struct_size(array, ptrs, attr->max_entries), numa_node);
+	array = bpf_map_area_alloc(struct_size(array, ptrs, attr->max_entries), numa_node, NULL);
 	if (!array)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/kernel/bpf/ringbuf.c b/kernel/bpf/ringbuf.c
index 74dd8dc..5eb7820 100644
--- a/kernel/bpf/ringbuf.c
+++ b/kernel/bpf/ringbuf.c
@@ -59,7 +59,8 @@ struct bpf_ringbuf_hdr {
 	u32 pg_off;
 };
 
-static struct bpf_ringbuf *bpf_ringbuf_area_alloc(size_t data_sz, int numa_node)
+static struct bpf_ringbuf *bpf_ringbuf_area_alloc(size_t data_sz, int numa_node,
+						  struct bpf_map *map)
 {
 	const gfp_t flags = GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL |
 			    __GFP_NOWARN | __GFP_ZERO;
@@ -89,7 +90,7 @@ static struct bpf_ringbuf *bpf_ringbuf_area_alloc(size_t data_sz, int numa_node)
 	 * user-space implementations significantly.
 	 */
 	array_size = (nr_meta_pages + 2 * nr_data_pages) * sizeof(*pages);
-	pages = bpf_map_area_alloc(array_size, numa_node);
+	pages = bpf_map_area_alloc(array_size, numa_node, map);
 	if (!pages)
 		return NULL;
 
@@ -127,11 +128,12 @@ static void bpf_ringbuf_notify(struct irq_work *work)
 	wake_up_all(&rb->waitq);
 }
 
-static struct bpf_ringbuf *bpf_ringbuf_alloc(size_t data_sz, int numa_node)
+static struct bpf_ringbuf *bpf_ringbuf_alloc(size_t data_sz, int numa_node,
+					     struct bpf_map *map)
 {
 	struct bpf_ringbuf *rb;
 
-	rb = bpf_ringbuf_area_alloc(data_sz, numa_node);
+	rb = bpf_ringbuf_area_alloc(data_sz, numa_node, map);
 	if (!rb)
 		return NULL;
 
@@ -164,13 +166,14 @@ static struct bpf_map *ringbuf_map_alloc(union bpf_attr *attr)
 		return ERR_PTR(-E2BIG);
 #endif
 
-	rb_map = bpf_map_area_alloc(sizeof(*rb_map), NUMA_NO_NODE);
+	rb_map = bpf_map_area_alloc(sizeof(*rb_map), NUMA_NO_NODE, NULL);
 	if (!rb_map)
 		return ERR_PTR(-ENOMEM);
 
 	bpf_map_init_from_attr(&rb_map->map, attr);
 
-	rb_map->rb = bpf_ringbuf_alloc(attr->max_entries, rb_map->map.numa_node);
+	rb_map->rb = bpf_ringbuf_alloc(attr->max_entries, rb_map->map.numa_node,
+				       &rb_map->map);
 	if (!rb_map->rb) {
 		bpf_map_area_free(rb_map, &rb_map->map);
 		return ERR_PTR(-ENOMEM);
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 042b7d2..9440fab 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -49,7 +49,8 @@ static int prealloc_elems_and_freelist(struct bpf_stack_map *smap)
 	int err;
 
 	smap->elems = bpf_map_area_alloc(elem_size * smap->map.max_entries,
-					 smap->map.numa_node);
+					 smap->map.numa_node,
+					 &smap->map);
 	if (!smap->elems)
 		return -ENOMEM;
 
@@ -100,7 +101,7 @@ static struct bpf_map *stack_map_alloc(union bpf_attr *attr)
 		return ERR_PTR(-E2BIG);
 
 	cost = n_buckets * sizeof(struct stack_map_bucket *) + sizeof(*smap);
-	smap = bpf_map_area_alloc(cost, bpf_map_attr_numa_node(attr));
+	smap = bpf_map_area_alloc(cost, bpf_map_attr_numa_node(attr), NULL);
 	if (!smap)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 01d8d4a..02ce7e9 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -362,9 +362,21 @@ static void *__bpf_map_area_alloc(u64 size, int numa_node, bool mmapable)
 			flags, numa_node, __builtin_return_address(0));
 }
 
-void *bpf_map_area_alloc(u64 size, int numa_node)
+void *bpf_map_area_alloc(u64 size, int numa_node, struct bpf_map *map)
 {
-	return __bpf_map_area_alloc(size, numa_node, false);
+	struct mem_cgroup *memcg, *old_memcg;
+	void *ptr;
+
+	if (!map)
+		return __bpf_map_area_alloc(size, numa_node, false);
+
+	memcg = bpf_map_get_memcg(map);
+	old_memcg = set_active_memcg(memcg);
+	ptr = __bpf_map_area_alloc(size, numa_node, false);
+	set_active_memcg(old_memcg);
+	bpf_map_put_memcg(memcg);
+
+	return ptr;
 }
 
 void *bpf_map_area_mmapable_alloc(u64 size, int numa_node)
diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index e8f414a..2b3b24e 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -41,7 +41,7 @@ static struct bpf_map *sock_map_alloc(union bpf_attr *attr)
 	    attr->map_flags & ~SOCK_CREATE_FLAG_MASK)
 		return ERR_PTR(-EINVAL);
 
-	stab = bpf_map_area_alloc(sizeof(*stab), NUMA_NO_NODE);
+	stab = bpf_map_area_alloc(sizeof(*stab), NUMA_NO_NODE, NULL);
 	if (!stab)
 		return ERR_PTR(-ENOMEM);
 
@@ -50,7 +50,8 @@ static struct bpf_map *sock_map_alloc(union bpf_attr *attr)
 
 	stab->sks = bpf_map_area_alloc((u64) stab->map.max_entries *
 				       sizeof(struct sock *),
-				       stab->map.numa_node);
+				       stab->map.numa_node,
+				       &stab->map);
 	if (!stab->sks) {
 		bpf_map_area_free(stab, &stab->map);
 		return ERR_PTR(-ENOMEM);
@@ -1076,7 +1077,7 @@ static struct bpf_map *sock_hash_alloc(union bpf_attr *attr)
 	if (attr->key_size > MAX_BPF_STACK)
 		return ERR_PTR(-E2BIG);
 
-	htab = bpf_map_area_alloc(sizeof(*htab), NUMA_NO_NODE);
+	htab = bpf_map_area_alloc(sizeof(*htab), NUMA_NO_NODE, NULL);
 	if (!htab)
 		return ERR_PTR(-ENOMEM);
 
@@ -1093,7 +1094,8 @@ static struct bpf_map *sock_hash_alloc(union bpf_attr *attr)
 
 	htab->buckets = bpf_map_area_alloc(htab->buckets_num *
 					   sizeof(struct bpf_shtab_bucket),
-					   htab->map.numa_node);
+					   htab->map.numa_node,
+					   &htab->map);
 	if (!htab->buckets) {
 		err = -ENOMEM;
 		goto free_htab;
diff --git a/net/xdp/xskmap.c b/net/xdp/xskmap.c
index 5abb87e..beb11fd 100644
--- a/net/xdp/xskmap.c
+++ b/net/xdp/xskmap.c
@@ -75,7 +75,7 @@ static struct bpf_map *xsk_map_alloc(union bpf_attr *attr)
 	numa_node = bpf_map_attr_numa_node(attr);
 	size = struct_size(m, xsk_map, attr->max_entries);
 
-	m = bpf_map_area_alloc(size, numa_node);
+	m = bpf_map_area_alloc(size, numa_node, NULL);
 	if (!m)
 		return ERR_PTR(-ENOMEM);
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH bpf-next v2 07/12] bpf: Introduce new helpers bpf_ringbuf_pages_{alloc,free}
@ 2022-08-18 14:31   ` Yafang Shao
  0 siblings, 0 replies; 61+ messages in thread
From: Yafang Shao @ 2022-08-18 14:31 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, hannes, mhocko, roman.gushchin,
	shakeelb, songmuchun, akpm, tj, lizefan.x
  Cc: cgroups, netdev, bpf, linux-mm, Yafang Shao, Andrii Nakryiko

Allocate pages related memory into the new helper
bpf_ringbuf_pages_alloc(), then it can be handled as a single unit.

Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 kernel/bpf/ringbuf.c | 80 ++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 56 insertions(+), 24 deletions(-)

diff --git a/kernel/bpf/ringbuf.c b/kernel/bpf/ringbuf.c
index 5eb7820..1e7284c 100644
--- a/kernel/bpf/ringbuf.c
+++ b/kernel/bpf/ringbuf.c
@@ -59,6 +59,57 @@ struct bpf_ringbuf_hdr {
 	u32 pg_off;
 };
 
+static void bpf_ringbuf_pages_free(struct page **pages, int nr_pages)
+{
+	int i;
+
+	for (i = 0; i < nr_pages; i++)
+		__free_page(pages[i]);
+	bpf_map_area_free(pages, NULL);
+}
+
+static struct page **bpf_ringbuf_pages_alloc(struct bpf_map *map,
+					     int nr_meta_pages,
+					     int nr_data_pages,
+					     int numa_node,
+					     const gfp_t flags)
+{
+	int nr_pages = nr_meta_pages + nr_data_pages;
+	struct mem_cgroup *memcg, *old_memcg;
+	struct page **pages, *page;
+	int array_size;
+	int i;
+
+	memcg = bpf_map_get_memcg(map);
+	old_memcg = set_active_memcg(memcg);
+	array_size = (nr_meta_pages + 2 * nr_data_pages) * sizeof(*pages);
+	pages = bpf_map_area_alloc(array_size, numa_node, NULL);
+	if (!pages)
+		goto err;
+
+	for (i = 0; i < nr_pages; i++) {
+		page = alloc_pages_node(numa_node, flags, 0);
+		if (!page) {
+			nr_pages = i;
+			goto err_free_pages;
+		}
+		pages[i] = page;
+		if (i >= nr_meta_pages)
+			pages[nr_data_pages + i] = page;
+	}
+	set_active_memcg(old_memcg);
+	bpf_map_put_memcg(memcg);
+
+	return pages;
+
+err_free_pages:
+	bpf_ringbuf_pages_free(pages, nr_pages);
+err:
+	set_active_memcg(old_memcg);
+	bpf_map_put_memcg(memcg);
+	return NULL;
+}
+
 static struct bpf_ringbuf *bpf_ringbuf_area_alloc(size_t data_sz, int numa_node,
 						  struct bpf_map *map)
 {
@@ -67,10 +118,8 @@ static struct bpf_ringbuf *bpf_ringbuf_area_alloc(size_t data_sz, int numa_node,
 	int nr_meta_pages = RINGBUF_PGOFF + RINGBUF_POS_PAGES;
 	int nr_data_pages = data_sz >> PAGE_SHIFT;
 	int nr_pages = nr_meta_pages + nr_data_pages;
-	struct page **pages, *page;
 	struct bpf_ringbuf *rb;
-	size_t array_size;
-	int i;
+	struct page **pages;
 
 	/* Each data page is mapped twice to allow "virtual"
 	 * continuous read of samples wrapping around the end of ring
@@ -89,22 +138,11 @@ static struct bpf_ringbuf *bpf_ringbuf_area_alloc(size_t data_sz, int numa_node,
 	 * when mmap()'ed in user-space, simplifying both kernel and
 	 * user-space implementations significantly.
 	 */
-	array_size = (nr_meta_pages + 2 * nr_data_pages) * sizeof(*pages);
-	pages = bpf_map_area_alloc(array_size, numa_node, map);
+	pages = bpf_ringbuf_pages_alloc(map, nr_meta_pages, nr_data_pages,
+					numa_node, flags);
 	if (!pages)
 		return NULL;
 
-	for (i = 0; i < nr_pages; i++) {
-		page = alloc_pages_node(numa_node, flags, 0);
-		if (!page) {
-			nr_pages = i;
-			goto err_free_pages;
-		}
-		pages[i] = page;
-		if (i >= nr_meta_pages)
-			pages[nr_data_pages + i] = page;
-	}
-
 	rb = vmap(pages, nr_meta_pages + 2 * nr_data_pages,
 		  VM_MAP | VM_USERMAP, PAGE_KERNEL);
 	if (rb) {
@@ -114,10 +152,6 @@ static struct bpf_ringbuf *bpf_ringbuf_area_alloc(size_t data_sz, int numa_node,
 		return rb;
 	}
 
-err_free_pages:
-	for (i = 0; i < nr_pages; i++)
-		__free_page(pages[i]);
-	bpf_map_area_free(pages, NULL);
 	return NULL;
 }
 
@@ -188,12 +222,10 @@ static void bpf_ringbuf_free(struct bpf_ringbuf *rb)
 	 * to unmap rb itself with vunmap() below
 	 */
 	struct page **pages = rb->pages;
-	int i, nr_pages = rb->nr_pages;
+	int nr_pages = rb->nr_pages;
 
 	vunmap(rb);
-	for (i = 0; i < nr_pages; i++)
-		__free_page(pages[i]);
-	bpf_map_area_free(pages, NULL);
+	bpf_ringbuf_pages_free(pages, nr_pages);
 }
 
 static void ringbuf_map_free(struct bpf_map *map)
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH bpf-next v2 07/12] bpf: Introduce new helpers bpf_ringbuf_pages_{alloc,free}
@ 2022-08-18 14:31   ` Yafang Shao
  0 siblings, 0 replies; 61+ messages in thread
From: Yafang Shao @ 2022-08-18 14:31 UTC (permalink / raw)
  To: ast-DgEjT+Ai2ygdnm+yROfE0A, daniel-FeC+5ew28dpmcu3hnIyYJQ,
	andrii-DgEjT+Ai2ygdnm+yROfE0A, kafai-b10kYP2dOMg,
	songliubraving-b10kYP2dOMg, yhs-b10kYP2dOMg,
	john.fastabend-Re5JQEeQqe8AvxtiuMwx3w,
	kpsingh-DgEjT+Ai2ygdnm+yROfE0A, sdf-hpIqsD4AKlfQT0dZR+AlfA,
	haoluo-hpIqsD4AKlfQT0dZR+AlfA, jolsa-DgEjT+Ai2ygdnm+yROfE0A,
	hannes-druUgvl0LCNAfugRpC6u6w, mhocko-DgEjT+Ai2ygdnm+yROfE0A,
	roman.gushchin-fxUVXftIFDnyG1zEObXtfA,
	shakeelb-hpIqsD4AKlfQT0dZR+AlfA,
	songmuchun-EC8Uxl6Npydl57MIdRCFDg,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, tj-DgEjT+Ai2ygdnm+yROfE0A,
	lizefan.x-EC8Uxl6Npydl57MIdRCFDg
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	bpf-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Yafang Shao, Andrii Nakryiko

Allocate pages related memory into the new helper
bpf_ringbuf_pages_alloc(), then it can be handled as a single unit.

Suggested-by: Andrii Nakryiko <andrii.nakryiko-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Signed-off-by: Yafang Shao <laoar.shao-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 kernel/bpf/ringbuf.c | 80 ++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 56 insertions(+), 24 deletions(-)

diff --git a/kernel/bpf/ringbuf.c b/kernel/bpf/ringbuf.c
index 5eb7820..1e7284c 100644
--- a/kernel/bpf/ringbuf.c
+++ b/kernel/bpf/ringbuf.c
@@ -59,6 +59,57 @@ struct bpf_ringbuf_hdr {
 	u32 pg_off;
 };
 
+static void bpf_ringbuf_pages_free(struct page **pages, int nr_pages)
+{
+	int i;
+
+	for (i = 0; i < nr_pages; i++)
+		__free_page(pages[i]);
+	bpf_map_area_free(pages, NULL);
+}
+
+static struct page **bpf_ringbuf_pages_alloc(struct bpf_map *map,
+					     int nr_meta_pages,
+					     int nr_data_pages,
+					     int numa_node,
+					     const gfp_t flags)
+{
+	int nr_pages = nr_meta_pages + nr_data_pages;
+	struct mem_cgroup *memcg, *old_memcg;
+	struct page **pages, *page;
+	int array_size;
+	int i;
+
+	memcg = bpf_map_get_memcg(map);
+	old_memcg = set_active_memcg(memcg);
+	array_size = (nr_meta_pages + 2 * nr_data_pages) * sizeof(*pages);
+	pages = bpf_map_area_alloc(array_size, numa_node, NULL);
+	if (!pages)
+		goto err;
+
+	for (i = 0; i < nr_pages; i++) {
+		page = alloc_pages_node(numa_node, flags, 0);
+		if (!page) {
+			nr_pages = i;
+			goto err_free_pages;
+		}
+		pages[i] = page;
+		if (i >= nr_meta_pages)
+			pages[nr_data_pages + i] = page;
+	}
+	set_active_memcg(old_memcg);
+	bpf_map_put_memcg(memcg);
+
+	return pages;
+
+err_free_pages:
+	bpf_ringbuf_pages_free(pages, nr_pages);
+err:
+	set_active_memcg(old_memcg);
+	bpf_map_put_memcg(memcg);
+	return NULL;
+}
+
 static struct bpf_ringbuf *bpf_ringbuf_area_alloc(size_t data_sz, int numa_node,
 						  struct bpf_map *map)
 {
@@ -67,10 +118,8 @@ static struct bpf_ringbuf *bpf_ringbuf_area_alloc(size_t data_sz, int numa_node,
 	int nr_meta_pages = RINGBUF_PGOFF + RINGBUF_POS_PAGES;
 	int nr_data_pages = data_sz >> PAGE_SHIFT;
 	int nr_pages = nr_meta_pages + nr_data_pages;
-	struct page **pages, *page;
 	struct bpf_ringbuf *rb;
-	size_t array_size;
-	int i;
+	struct page **pages;
 
 	/* Each data page is mapped twice to allow "virtual"
 	 * continuous read of samples wrapping around the end of ring
@@ -89,22 +138,11 @@ static struct bpf_ringbuf *bpf_ringbuf_area_alloc(size_t data_sz, int numa_node,
 	 * when mmap()'ed in user-space, simplifying both kernel and
 	 * user-space implementations significantly.
 	 */
-	array_size = (nr_meta_pages + 2 * nr_data_pages) * sizeof(*pages);
-	pages = bpf_map_area_alloc(array_size, numa_node, map);
+	pages = bpf_ringbuf_pages_alloc(map, nr_meta_pages, nr_data_pages,
+					numa_node, flags);
 	if (!pages)
 		return NULL;
 
-	for (i = 0; i < nr_pages; i++) {
-		page = alloc_pages_node(numa_node, flags, 0);
-		if (!page) {
-			nr_pages = i;
-			goto err_free_pages;
-		}
-		pages[i] = page;
-		if (i >= nr_meta_pages)
-			pages[nr_data_pages + i] = page;
-	}
-
 	rb = vmap(pages, nr_meta_pages + 2 * nr_data_pages,
 		  VM_MAP | VM_USERMAP, PAGE_KERNEL);
 	if (rb) {
@@ -114,10 +152,6 @@ static struct bpf_ringbuf *bpf_ringbuf_area_alloc(size_t data_sz, int numa_node,
 		return rb;
 	}
 
-err_free_pages:
-	for (i = 0; i < nr_pages; i++)
-		__free_page(pages[i]);
-	bpf_map_area_free(pages, NULL);
 	return NULL;
 }
 
@@ -188,12 +222,10 @@ static void bpf_ringbuf_free(struct bpf_ringbuf *rb)
 	 * to unmap rb itself with vunmap() below
 	 */
 	struct page **pages = rb->pages;
-	int i, nr_pages = rb->nr_pages;
+	int nr_pages = rb->nr_pages;
 
 	vunmap(rb);
-	for (i = 0; i < nr_pages; i++)
-		__free_page(pages[i]);
-	bpf_map_area_free(pages, NULL);
+	bpf_ringbuf_pages_free(pages, nr_pages);
 }
 
 static void ringbuf_map_free(struct bpf_map *map)
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH bpf-next v2 08/12] bpf: Use bpf_map_kzalloc in arraymap
@ 2022-08-18 14:31   ` Yafang Shao
  0 siblings, 0 replies; 61+ messages in thread
From: Yafang Shao @ 2022-08-18 14:31 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, hannes, mhocko, roman.gushchin,
	shakeelb, songmuchun, akpm, tj, lizefan.x
  Cc: cgroups, netdev, bpf, linux-mm, Yafang Shao

Allocates memory after map creation, then we can use the generic helper
bpf_map_kzalloc() instead of the open-coded kzalloc().

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 kernel/bpf/arraymap.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 80974c5..3039832 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -1090,20 +1090,20 @@ static struct bpf_map *prog_array_map_alloc(union bpf_attr *attr)
 	struct bpf_array_aux *aux;
 	struct bpf_map *map;
 
-	aux = kzalloc(sizeof(*aux), GFP_KERNEL_ACCOUNT);
-	if (!aux)
+	map = array_map_alloc(attr);
+	if (IS_ERR(map))
 		return ERR_PTR(-ENOMEM);
 
+	aux = bpf_map_kzalloc(map, sizeof(*aux), GFP_KERNEL);
+	if (!aux) {
+		array_map_free(map);
+		return ERR_PTR(-ENOMEM);
+	}
+
 	INIT_WORK(&aux->work, prog_array_map_clear_deferred);
 	INIT_LIST_HEAD(&aux->poke_progs);
 	mutex_init(&aux->poke_mutex);
 
-	map = array_map_alloc(attr);
-	if (IS_ERR(map)) {
-		kfree(aux);
-		return map;
-	}
-
 	container_of(map, struct bpf_array, map)->aux = aux;
 	aux->map = map;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH bpf-next v2 08/12] bpf: Use bpf_map_kzalloc in arraymap
@ 2022-08-18 14:31   ` Yafang Shao
  0 siblings, 0 replies; 61+ messages in thread
From: Yafang Shao @ 2022-08-18 14:31 UTC (permalink / raw)
  To: ast-DgEjT+Ai2ygdnm+yROfE0A, daniel-FeC+5ew28dpmcu3hnIyYJQ,
	andrii-DgEjT+Ai2ygdnm+yROfE0A, kafai-b10kYP2dOMg,
	songliubraving-b10kYP2dOMg, yhs-b10kYP2dOMg,
	john.fastabend-Re5JQEeQqe8AvxtiuMwx3w,
	kpsingh-DgEjT+Ai2ygdnm+yROfE0A, sdf-hpIqsD4AKlfQT0dZR+AlfA,
	haoluo-hpIqsD4AKlfQT0dZR+AlfA, jolsa-DgEjT+Ai2ygdnm+yROfE0A,
	hannes-druUgvl0LCNAfugRpC6u6w, mhocko-DgEjT+Ai2ygdnm+yROfE0A,
	roman.gushchin-fxUVXftIFDnyG1zEObXtfA,
	shakeelb-hpIqsD4AKlfQT0dZR+AlfA,
	songmuchun-EC8Uxl6Npydl57MIdRCFDg,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, tj-DgEjT+Ai2ygdnm+yROfE0A,
	lizefan.x-EC8Uxl6Npydl57MIdRCFDg
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	bpf-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Yafang Shao

Allocates memory after map creation, then we can use the generic helper
bpf_map_kzalloc() instead of the open-coded kzalloc().

Signed-off-by: Yafang Shao <laoar.shao-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 kernel/bpf/arraymap.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 80974c5..3039832 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -1090,20 +1090,20 @@ static struct bpf_map *prog_array_map_alloc(union bpf_attr *attr)
 	struct bpf_array_aux *aux;
 	struct bpf_map *map;
 
-	aux = kzalloc(sizeof(*aux), GFP_KERNEL_ACCOUNT);
-	if (!aux)
+	map = array_map_alloc(attr);
+	if (IS_ERR(map))
 		return ERR_PTR(-ENOMEM);
 
+	aux = bpf_map_kzalloc(map, sizeof(*aux), GFP_KERNEL);
+	if (!aux) {
+		array_map_free(map);
+		return ERR_PTR(-ENOMEM);
+	}
+
 	INIT_WORK(&aux->work, prog_array_map_clear_deferred);
 	INIT_LIST_HEAD(&aux->poke_progs);
 	mutex_init(&aux->poke_mutex);
 
-	map = array_map_alloc(attr);
-	if (IS_ERR(map)) {
-		kfree(aux);
-		return map;
-	}
-
 	container_of(map, struct bpf_array, map)->aux = aux;
 	aux->map = map;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH bpf-next v2 09/12] bpf: Use bpf_map_kvcalloc in bpf_local_storage
@ 2022-08-18 14:31   ` Yafang Shao
  0 siblings, 0 replies; 61+ messages in thread
From: Yafang Shao @ 2022-08-18 14:31 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, hannes, mhocko, roman.gushchin,
	shakeelb, songmuchun, akpm, tj, lizefan.x
  Cc: cgroups, netdev, bpf, linux-mm, Yafang Shao

Introduce new helper bpf_map_kvcalloc() for this memory allocation.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/bpf.h            |  8 ++++++++
 kernel/bpf/bpf_local_storage.c |  4 ++--
 kernel/bpf/syscall.c           | 15 +++++++++++++++
 3 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 066f351d..f95a2df 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1656,6 +1656,8 @@ int  generic_map_delete_batch(struct bpf_map *map,
 void *bpf_map_kmalloc_node(const struct bpf_map *map, size_t size, gfp_t flags,
 			   int node);
 void *bpf_map_kzalloc(const struct bpf_map *map, size_t size, gfp_t flags);
+void *bpf_map_kvcalloc(struct bpf_map *map, size_t n, size_t size,
+		       gfp_t flags);
 void __percpu *bpf_map_alloc_percpu(const struct bpf_map *map, size_t size,
 				    size_t align, gfp_t flags);
 #else
@@ -1672,6 +1674,12 @@ void __percpu *bpf_map_alloc_percpu(const struct bpf_map *map, size_t size,
 	return kzalloc(size, flags);
 }
 
+static inline void *
+bpf_map_kvcalloc(struct bpf_map *map, size_t n, size_t size, gfp_t flags)
+{
+	return kvcalloc(n, size, flags);
+}
+
 static inline void __percpu *
 bpf_map_alloc_percpu(const struct bpf_map *map, size_t size, size_t align,
 		     gfp_t flags)
diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
index 67ab249..71d5bf1 100644
--- a/kernel/bpf/bpf_local_storage.c
+++ b/kernel/bpf/bpf_local_storage.c
@@ -620,8 +620,8 @@ struct bpf_local_storage_map *bpf_local_storage_map_alloc(union bpf_attr *attr)
 	nbuckets = max_t(u32, 2, nbuckets);
 	smap->bucket_log = ilog2(nbuckets);
 
-	smap->buckets = kvcalloc(sizeof(*smap->buckets), nbuckets,
-				 GFP_USER | __GFP_NOWARN | __GFP_ACCOUNT);
+	smap->buckets = bpf_map_kvcalloc(&smap->map, sizeof(*smap->buckets),
+					 nbuckets, GFP_USER | __GFP_NOWARN);
 	if (!smap->buckets) {
 		bpf_map_area_free(smap, &smap->map);
 		return ERR_PTR(-ENOMEM);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 02ce7e9..2bd3bcf 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -489,6 +489,21 @@ void *bpf_map_kzalloc(const struct bpf_map *map, size_t size, gfp_t flags)
 	return ptr;
 }
 
+void *bpf_map_kvcalloc(struct bpf_map *map, size_t n, size_t size,
+		       gfp_t flags)
+{
+	struct mem_cgroup *memcg, *old_memcg;
+	void *ptr;
+
+	memcg = bpf_map_get_memcg(map);
+	old_memcg = set_active_memcg(memcg);
+	ptr = kvcalloc(n, size, flags | __GFP_ACCOUNT);
+	set_active_memcg(old_memcg);
+	bpf_map_put_memcg(memcg);
+
+	return ptr;
+}
+
 void __percpu *bpf_map_alloc_percpu(const struct bpf_map *map, size_t size,
 				    size_t align, gfp_t flags)
 {
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH bpf-next v2 09/12] bpf: Use bpf_map_kvcalloc in bpf_local_storage
@ 2022-08-18 14:31   ` Yafang Shao
  0 siblings, 0 replies; 61+ messages in thread
From: Yafang Shao @ 2022-08-18 14:31 UTC (permalink / raw)
  To: ast-DgEjT+Ai2ygdnm+yROfE0A, daniel-FeC+5ew28dpmcu3hnIyYJQ,
	andrii-DgEjT+Ai2ygdnm+yROfE0A, kafai-b10kYP2dOMg,
	songliubraving-b10kYP2dOMg, yhs-b10kYP2dOMg,
	john.fastabend-Re5JQEeQqe8AvxtiuMwx3w,
	kpsingh-DgEjT+Ai2ygdnm+yROfE0A, sdf-hpIqsD4AKlfQT0dZR+AlfA,
	haoluo-hpIqsD4AKlfQT0dZR+AlfA, jolsa-DgEjT+Ai2ygdnm+yROfE0A,
	hannes-druUgvl0LCNAfugRpC6u6w, mhocko-DgEjT+Ai2ygdnm+yROfE0A,
	roman.gushchin-fxUVXftIFDnyG1zEObXtfA,
	shakeelb-hpIqsD4AKlfQT0dZR+AlfA,
	songmuchun-EC8Uxl6Npydl57MIdRCFDg,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, tj-DgEjT+Ai2ygdnm+yROfE0A,
	lizefan.x-EC8Uxl6Npydl57MIdRCFDg
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	bpf-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Yafang Shao

Introduce new helper bpf_map_kvcalloc() for this memory allocation.

Signed-off-by: Yafang Shao <laoar.shao-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 include/linux/bpf.h            |  8 ++++++++
 kernel/bpf/bpf_local_storage.c |  4 ++--
 kernel/bpf/syscall.c           | 15 +++++++++++++++
 3 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 066f351d..f95a2df 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1656,6 +1656,8 @@ int  generic_map_delete_batch(struct bpf_map *map,
 void *bpf_map_kmalloc_node(const struct bpf_map *map, size_t size, gfp_t flags,
 			   int node);
 void *bpf_map_kzalloc(const struct bpf_map *map, size_t size, gfp_t flags);
+void *bpf_map_kvcalloc(struct bpf_map *map, size_t n, size_t size,
+		       gfp_t flags);
 void __percpu *bpf_map_alloc_percpu(const struct bpf_map *map, size_t size,
 				    size_t align, gfp_t flags);
 #else
@@ -1672,6 +1674,12 @@ void __percpu *bpf_map_alloc_percpu(const struct bpf_map *map, size_t size,
 	return kzalloc(size, flags);
 }
 
+static inline void *
+bpf_map_kvcalloc(struct bpf_map *map, size_t n, size_t size, gfp_t flags)
+{
+	return kvcalloc(n, size, flags);
+}
+
 static inline void __percpu *
 bpf_map_alloc_percpu(const struct bpf_map *map, size_t size, size_t align,
 		     gfp_t flags)
diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
index 67ab249..71d5bf1 100644
--- a/kernel/bpf/bpf_local_storage.c
+++ b/kernel/bpf/bpf_local_storage.c
@@ -620,8 +620,8 @@ struct bpf_local_storage_map *bpf_local_storage_map_alloc(union bpf_attr *attr)
 	nbuckets = max_t(u32, 2, nbuckets);
 	smap->bucket_log = ilog2(nbuckets);
 
-	smap->buckets = kvcalloc(sizeof(*smap->buckets), nbuckets,
-				 GFP_USER | __GFP_NOWARN | __GFP_ACCOUNT);
+	smap->buckets = bpf_map_kvcalloc(&smap->map, sizeof(*smap->buckets),
+					 nbuckets, GFP_USER | __GFP_NOWARN);
 	if (!smap->buckets) {
 		bpf_map_area_free(smap, &smap->map);
 		return ERR_PTR(-ENOMEM);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 02ce7e9..2bd3bcf 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -489,6 +489,21 @@ void *bpf_map_kzalloc(const struct bpf_map *map, size_t size, gfp_t flags)
 	return ptr;
 }
 
+void *bpf_map_kvcalloc(struct bpf_map *map, size_t n, size_t size,
+		       gfp_t flags)
+{
+	struct mem_cgroup *memcg, *old_memcg;
+	void *ptr;
+
+	memcg = bpf_map_get_memcg(map);
+	old_memcg = set_active_memcg(memcg);
+	ptr = kvcalloc(n, size, flags | __GFP_ACCOUNT);
+	set_active_memcg(old_memcg);
+	bpf_map_put_memcg(memcg);
+
+	return ptr;
+}
+
 void __percpu *bpf_map_alloc_percpu(const struct bpf_map *map, size_t size,
 				    size_t align, gfp_t flags)
 {
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH bpf-next v2 10/12] mm, memcg: Add new helper get_obj_cgroup_from_cgroup
  2022-08-18 14:31 [PATCH bpf-next v2 00/12] bpf: Introduce selectable memcg for bpf map Yafang Shao
                   ` (8 preceding siblings ...)
  2022-08-18 14:31   ` Yafang Shao
@ 2022-08-18 14:31 ` Yafang Shao
  2022-08-18 20:38   ` Shakeel Butt
  2022-08-18 14:31 ` [PATCH bpf-next v2 11/12] bpf: Add return value for bpf_map_init_from_attr Yafang Shao
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 61+ messages in thread
From: Yafang Shao @ 2022-08-18 14:31 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, hannes, mhocko, roman.gushchin,
	shakeelb, songmuchun, akpm, tj, lizefan.x
  Cc: cgroups, netdev, bpf, linux-mm, Yafang Shao

We want to open a cgroup directory and pass the fd into kernel, and then
in the kernel we can charge the allocated memory into the open cgroup if it
has valid memory subsystem. In the bpf subsystem, the opened cgroup will
be store as a struct obj_cgroup pointer, so a new helper
get_obj_cgroup_from_cgroup() is introduced.

It also add a comment on why the helper  __get_obj_cgroup_from_memcg()
must be protected by rcu read lock.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/memcontrol.h |  1 +
 mm/memcontrol.c            | 47 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 48 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 2f0a611..901a921 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1713,6 +1713,7 @@ static inline void set_shrinker_bit(struct mem_cgroup *memcg,
 int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order);
 void __memcg_kmem_uncharge_page(struct page *page, int order);
 
+struct obj_cgroup *get_obj_cgroup_from_cgroup(struct cgroup *cgrp);
 struct obj_cgroup *get_obj_cgroup_from_current(void);
 struct obj_cgroup *get_obj_cgroup_from_page(struct page *page);
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 618c366..0409cc4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2895,6 +2895,14 @@ struct mem_cgroup *mem_cgroup_from_obj(void *p)
 	return page_memcg_check(folio_page(folio, 0));
 }
 
+/*
+ * Pls. note that the memg->objcg can be freed after it is deferenced,
+ * that can happen when the memcg is changed from online to offline.
+ * So this helper must be protected by read rcu lock.
+ *
+ * After obj_cgroup_tryget(), it is safe to use the objcg outside of the rcu
+ * read-side critical section.
+ */
 static struct obj_cgroup *__get_obj_cgroup_from_memcg(struct mem_cgroup *memcg)
 {
 	struct obj_cgroup *objcg = NULL;
@@ -2908,6 +2916,45 @@ static struct obj_cgroup *__get_obj_cgroup_from_memcg(struct mem_cgroup *memcg)
 	return objcg;
 }
 
+static struct obj_cgroup *get_obj_cgroup_from_memcg(struct mem_cgroup *memcg)
+{
+	struct obj_cgroup *objcg;
+
+	if (memcg_kmem_bypass())
+		return NULL;
+
+	rcu_read_lock();
+	objcg = __get_obj_cgroup_from_memcg(memcg);
+	rcu_read_unlock();
+	return objcg;
+}
+
+struct obj_cgroup *get_obj_cgroup_from_cgroup(struct cgroup *cgrp)
+{
+	struct cgroup_subsys_state *css;
+	struct mem_cgroup *memcg;
+	struct obj_cgroup *objcg;
+
+	rcu_read_lock();
+	css = rcu_dereference(cgrp->subsys[memory_cgrp_id]);
+	if (!css || !css_tryget_online(css)) {
+		rcu_read_unlock();
+		return ERR_PTR(-EINVAL);
+	}
+	rcu_read_unlock();
+
+	memcg = mem_cgroup_from_css(css);
+	if (!memcg) {
+		css_put(css);
+		return ERR_PTR(-EINVAL);
+	}
+
+	objcg = get_obj_cgroup_from_memcg(memcg);
+	css_put(css);
+
+	return objcg;
+}
+
 __always_inline struct obj_cgroup *get_obj_cgroup_from_current(void)
 {
 	struct obj_cgroup *objcg = NULL;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH bpf-next v2 11/12] bpf: Add return value for bpf_map_init_from_attr
  2022-08-18 14:31 [PATCH bpf-next v2 00/12] bpf: Introduce selectable memcg for bpf map Yafang Shao
                   ` (9 preceding siblings ...)
  2022-08-18 14:31 ` [PATCH bpf-next v2 10/12] mm, memcg: Add new helper get_obj_cgroup_from_cgroup Yafang Shao
@ 2022-08-18 14:31 ` Yafang Shao
  2022-08-18 14:31 ` [PATCH bpf-next v2 12/12] bpf: Introduce selectable memcg for bpf map Yafang Shao
  2022-08-18 22:20   ` Tejun Heo
  12 siblings, 0 replies; 61+ messages in thread
From: Yafang Shao @ 2022-08-18 14:31 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, hannes, mhocko, roman.gushchin,
	shakeelb, songmuchun, akpm, tj, lizefan.x
  Cc: cgroups, netdev, bpf, linux-mm, Yafang Shao

Add return value for bpf_map_init_from_attr() to indicate whether it
init successfully. This is a preparation of the followup patch.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/bpf.h            | 2 +-
 kernel/bpf/arraymap.c          | 8 +++++++-
 kernel/bpf/bloom_filter.c      | 7 ++++++-
 kernel/bpf/bpf_local_storage.c | 7 ++++++-
 kernel/bpf/bpf_struct_ops.c    | 7 ++++++-
 kernel/bpf/cpumap.c            | 6 +++++-
 kernel/bpf/devmap.c            | 6 +++++-
 kernel/bpf/hashtab.c           | 6 +++++-
 kernel/bpf/local_storage.c     | 7 ++++++-
 kernel/bpf/lpm_trie.c          | 8 +++++++-
 kernel/bpf/offload.c           | 6 +++++-
 kernel/bpf/queue_stack_maps.c  | 7 ++++++-
 kernel/bpf/reuseport_array.c   | 7 ++++++-
 kernel/bpf/ringbuf.c           | 7 ++++++-
 kernel/bpf/syscall.c           | 4 +++-
 net/core/sock_map.c            | 8 +++++++-
 net/xdp/xskmap.c               | 8 +++++++-
 17 files changed, 94 insertions(+), 17 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index f95a2df..0c3534c 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1639,7 +1639,7 @@ struct bpf_prog *bpf_prog_get_type_dev(u32 ufd, enum bpf_prog_type type,
 void *bpf_map_area_mmapable_alloc(u64 size, int numa_node);
 void bpf_map_area_free(void *base, struct bpf_map *map);
 bool bpf_map_write_active(const struct bpf_map *map);
-void bpf_map_init_from_attr(struct bpf_map *map, union bpf_attr *attr);
+int bpf_map_init_from_attr(struct bpf_map *map, union bpf_attr *attr);
 int  generic_map_lookup_batch(struct bpf_map *map,
 			      const union bpf_attr *attr,
 			      union bpf_attr __user *uattr);
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 3039832..28d0310 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -85,6 +85,7 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr)
 	bool bypass_spec_v1 = bpf_bypass_spec_v1();
 	u64 array_size, mask64;
 	struct bpf_array *array;
+	int err;
 
 	elem_size = round_up(attr->value_size, 8);
 
@@ -143,7 +144,12 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr)
 	array->map.bypass_spec_v1 = bypass_spec_v1;
 
 	/* copy mandatory map attributes */
-	bpf_map_init_from_attr(&array->map, attr);
+	err = bpf_map_init_from_attr(&array->map, attr);
+	if (err) {
+		bpf_map_area_free(array, NULL);
+		return ERR_PTR(err);
+	}
+
 	array->elem_size = elem_size;
 
 	if (percpu && bpf_array_alloc_percpu(array)) {
diff --git a/kernel/bpf/bloom_filter.c b/kernel/bpf/bloom_filter.c
index 6691f79..be64227 100644
--- a/kernel/bpf/bloom_filter.c
+++ b/kernel/bpf/bloom_filter.c
@@ -93,6 +93,7 @@ static struct bpf_map *bloom_map_alloc(union bpf_attr *attr)
 	u32 bitset_bytes, bitset_mask, nr_hash_funcs, nr_bits;
 	int numa_node = bpf_map_attr_numa_node(attr);
 	struct bpf_bloom_filter *bloom;
+	int err;
 
 	if (!bpf_capable())
 		return ERR_PTR(-EPERM);
@@ -147,7 +148,11 @@ static struct bpf_map *bloom_map_alloc(union bpf_attr *attr)
 	if (!bloom)
 		return ERR_PTR(-ENOMEM);
 
-	bpf_map_init_from_attr(&bloom->map, attr);
+	err = bpf_map_init_from_attr(&bloom->map, attr);
+	if (err) {
+		bpf_map_area_free(bloom, NULL);
+		return ERR_PTR(err);
+	}
 
 	bloom->nr_hash_funcs = nr_hash_funcs;
 	bloom->bitset_mask = bitset_mask;
diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
index 71d5bf1..b82d124 100644
--- a/kernel/bpf/bpf_local_storage.c
+++ b/kernel/bpf/bpf_local_storage.c
@@ -609,11 +609,16 @@ struct bpf_local_storage_map *bpf_local_storage_map_alloc(union bpf_attr *attr)
 	struct bpf_local_storage_map *smap;
 	unsigned int i;
 	u32 nbuckets;
+	int err;
 
 	smap = bpf_map_area_alloc(sizeof(*smap), NUMA_NO_NODE, NULL);
 	if (!smap)
 		return ERR_PTR(-ENOMEM);
-	bpf_map_init_from_attr(&smap->map, attr);
+	err = bpf_map_init_from_attr(&smap->map, attr);
+	if (err) {
+		bpf_map_area_free(&smap->map, NULL);
+		return ERR_PTR(err);
+	}
 
 	nbuckets = roundup_pow_of_two(num_possible_cpus());
 	/* Use at least 2 buckets, select_bucket() is undefined behavior with 1 bucket */
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index 37ba5c0..7cfbabc 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -598,6 +598,7 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr)
 	struct bpf_struct_ops_map *st_map;
 	const struct btf_type *t, *vt;
 	struct bpf_map *map;
+	int err;
 
 	if (!bpf_capable())
 		return ERR_PTR(-EPERM);
@@ -624,7 +625,11 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr)
 
 	st_map->st_ops = st_ops;
 	map = &st_map->map;
-	bpf_map_init_from_attr(map, attr);
+	err = bpf_map_init_from_attr(map, attr);
+	if (err) {
+		bpf_map_area_free(st_map, NULL);
+		return ERR_PTR(err);
+	}
 
 	st_map->uvalue = bpf_map_area_alloc(vt->size, NUMA_NO_NODE, map);
 	st_map->links =
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index b593157..e672e62 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -101,7 +101,11 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
 	if (!cmap)
 		return ERR_PTR(-ENOMEM);
 
-	bpf_map_init_from_attr(&cmap->map, attr);
+	err = bpf_map_init_from_attr(&cmap->map, attr);
+	if (err) {
+		bpf_map_area_free(cmap, NULL);
+		return ERR_PTR(err);
+	}
 
 	/* Pre-limit array size based on NR_CPUS, not final CPU check */
 	if (cmap->map.max_entries > NR_CPUS) {
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index 807a4cd..10f038d 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -167,7 +167,11 @@ static struct bpf_map *dev_map_alloc(union bpf_attr *attr)
 	if (!dtab)
 		return ERR_PTR(-ENOMEM);
 
-	bpf_map_init_from_attr(&dtab->map, attr);
+	err = bpf_map_init_from_attr(&dtab->map, attr);
+	if (err) {
+		bpf_map_area_free(dtab, NULL);
+		return ERR_PTR(err);
+	}
 
 	err = dev_map_init_map(dtab, attr);
 	if (err) {
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index e9a6d2c4..f059ae0 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -500,7 +500,11 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 	if (!htab)
 		return ERR_PTR(-ENOMEM);
 
-	bpf_map_init_from_attr(&htab->map, attr);
+	err = bpf_map_init_from_attr(&htab->map, attr);
+	if (err) {
+		bpf_map_area_free(htab, NULL);
+		return ERR_PTR(err);
+	}
 
 	lockdep_register_key(&htab->lockdep_key);
 
diff --git a/kernel/bpf/local_storage.c b/kernel/bpf/local_storage.c
index fcc7ece..1901195 100644
--- a/kernel/bpf/local_storage.c
+++ b/kernel/bpf/local_storage.c
@@ -287,6 +287,7 @@ static struct bpf_map *cgroup_storage_map_alloc(union bpf_attr *attr)
 	__u32 max_value_size = BPF_LOCAL_STORAGE_MAX_VALUE_SIZE;
 	int numa_node = bpf_map_attr_numa_node(attr);
 	struct bpf_cgroup_storage_map *map;
+	int err;
 
 	/* percpu is bound by PCPU_MIN_UNIT_SIZE, non-percu
 	 * is the same as other local storages.
@@ -318,7 +319,11 @@ static struct bpf_map *cgroup_storage_map_alloc(union bpf_attr *attr)
 		return ERR_PTR(-ENOMEM);
 
 	/* copy mandatory map attributes */
-	bpf_map_init_from_attr(&map->map, attr);
+	err = bpf_map_init_from_attr(&map->map, attr);
+	if (err) {
+		bpf_map_area_free(map, NULL);
+		return ERR_PTR(err);
+	}
 
 	spin_lock_init(&map->lock);
 	map->root = RB_ROOT;
diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
index 3d329ae..38d7b00 100644
--- a/kernel/bpf/lpm_trie.c
+++ b/kernel/bpf/lpm_trie.c
@@ -543,6 +543,7 @@ static int trie_delete_elem(struct bpf_map *map, void *_key)
 static struct bpf_map *trie_alloc(union bpf_attr *attr)
 {
 	struct lpm_trie *trie;
+	int err;
 
 	if (!bpf_capable())
 		return ERR_PTR(-EPERM);
@@ -563,7 +564,12 @@ static struct bpf_map *trie_alloc(union bpf_attr *attr)
 		return ERR_PTR(-ENOMEM);
 
 	/* copy mandatory map attributes */
-	bpf_map_init_from_attr(&trie->map, attr);
+	err = bpf_map_init_from_attr(&trie->map, attr);
+	if (err) {
+		bpf_map_area_free(trie, NULL);
+		return ERR_PTR(err);
+	}
+
 	trie->data_size = attr->key_size -
 			  offsetof(struct bpf_lpm_trie_key, data);
 	trie->max_prefixlen = trie->data_size * 8;
diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
index 87c59da..dba7ed2 100644
--- a/kernel/bpf/offload.c
+++ b/kernel/bpf/offload.c
@@ -376,7 +376,11 @@ struct bpf_map *bpf_map_offload_map_alloc(union bpf_attr *attr)
 	if (!offmap)
 		return ERR_PTR(-ENOMEM);
 
-	bpf_map_init_from_attr(&offmap->map, attr);
+	err = bpf_map_init_from_attr(&offmap->map, attr);
+	if (err) {
+		bpf_map_area_free(offmap, NULL);
+		return ERR_PTR(err);
+	}
 
 	rtnl_lock();
 	down_write(&bpf_devs_lock);
diff --git a/kernel/bpf/queue_stack_maps.c b/kernel/bpf/queue_stack_maps.c
index bf57e45..f231897 100644
--- a/kernel/bpf/queue_stack_maps.c
+++ b/kernel/bpf/queue_stack_maps.c
@@ -70,6 +70,7 @@ static struct bpf_map *queue_stack_map_alloc(union bpf_attr *attr)
 	int numa_node = bpf_map_attr_numa_node(attr);
 	struct bpf_queue_stack *qs;
 	u64 size, queue_size;
+	int err;
 
 	size = (u64) attr->max_entries + 1;
 	queue_size = sizeof(*qs) + size * attr->value_size;
@@ -78,7 +79,11 @@ static struct bpf_map *queue_stack_map_alloc(union bpf_attr *attr)
 	if (!qs)
 		return ERR_PTR(-ENOMEM);
 
-	bpf_map_init_from_attr(&qs->map, attr);
+	err = bpf_map_init_from_attr(&qs->map, attr);
+	if (err) {
+		bpf_map_area_free(qs, NULL);
+		return ERR_PTR(err);
+	}
 
 	qs->size = size;
 
diff --git a/kernel/bpf/reuseport_array.c b/kernel/bpf/reuseport_array.c
index 52c7e77..4da969b 100644
--- a/kernel/bpf/reuseport_array.c
+++ b/kernel/bpf/reuseport_array.c
@@ -153,6 +153,7 @@ static struct bpf_map *reuseport_array_alloc(union bpf_attr *attr)
 {
 	int numa_node = bpf_map_attr_numa_node(attr);
 	struct reuseport_array *array;
+	int err;
 
 	if (!bpf_capable())
 		return ERR_PTR(-EPERM);
@@ -163,7 +164,11 @@ static struct bpf_map *reuseport_array_alloc(union bpf_attr *attr)
 		return ERR_PTR(-ENOMEM);
 
 	/* copy mandatory map attributes */
-	bpf_map_init_from_attr(&array->map, attr);
+	err = bpf_map_init_from_attr(&array->map, attr);
+	if (err) {
+		bpf_map_area_free(array, NULL);
+		return ERR_PTR(err);
+	}
 
 	return &array->map;
 }
diff --git a/kernel/bpf/ringbuf.c b/kernel/bpf/ringbuf.c
index 1e7284c..3edefd3 100644
--- a/kernel/bpf/ringbuf.c
+++ b/kernel/bpf/ringbuf.c
@@ -185,6 +185,7 @@ static struct bpf_ringbuf *bpf_ringbuf_alloc(size_t data_sz, int numa_node,
 static struct bpf_map *ringbuf_map_alloc(union bpf_attr *attr)
 {
 	struct bpf_ringbuf_map *rb_map;
+	int err;
 
 	if (attr->map_flags & ~RINGBUF_CREATE_FLAG_MASK)
 		return ERR_PTR(-EINVAL);
@@ -204,7 +205,11 @@ static struct bpf_map *ringbuf_map_alloc(union bpf_attr *attr)
 	if (!rb_map)
 		return ERR_PTR(-ENOMEM);
 
-	bpf_map_init_from_attr(&rb_map->map, attr);
+	err = bpf_map_init_from_attr(&rb_map->map, attr);
+	if (err) {
+		bpf_map_area_free(rb_map, NULL);
+		return ERR_PTR(err);
+	}
 
 	rb_map->rb = bpf_ringbuf_alloc(attr->max_entries, rb_map->map.numa_node,
 				       &rb_map->map);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 2bd3bcf..5f5cade4 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -403,7 +403,7 @@ static u32 bpf_map_flags_retain_permanent(u32 flags)
 	return flags & ~(BPF_F_RDONLY | BPF_F_WRONLY);
 }
 
-void bpf_map_init_from_attr(struct bpf_map *map, union bpf_attr *attr)
+int bpf_map_init_from_attr(struct bpf_map *map, union bpf_attr *attr)
 {
 	bpf_map_save_memcg(map);
 	map->map_type = attr->map_type;
@@ -413,6 +413,8 @@ void bpf_map_init_from_attr(struct bpf_map *map, union bpf_attr *attr)
 	map->map_flags = bpf_map_flags_retain_permanent(attr->map_flags);
 	map->numa_node = bpf_map_attr_numa_node(attr);
 	map->map_extra = attr->map_extra;
+
+	return 0;
 }
 
 static int bpf_map_alloc_id(struct bpf_map *map)
diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 2b3b24e..766a260 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -31,6 +31,7 @@ static int sock_map_prog_update(struct bpf_map *map, struct bpf_prog *prog,
 static struct bpf_map *sock_map_alloc(union bpf_attr *attr)
 {
 	struct bpf_stab *stab;
+	int err;
 
 	if (!capable(CAP_NET_ADMIN))
 		return ERR_PTR(-EPERM);
@@ -45,7 +46,12 @@ static struct bpf_map *sock_map_alloc(union bpf_attr *attr)
 	if (!stab)
 		return ERR_PTR(-ENOMEM);
 
-	bpf_map_init_from_attr(&stab->map, attr);
+	err = bpf_map_init_from_attr(&stab->map, attr);
+	if (err) {
+		bpf_map_area_free(stab, NULL);
+		return ERR_PTR(err);
+	}
+
 	raw_spin_lock_init(&stab->lock);
 
 	stab->sks = bpf_map_area_alloc((u64) stab->map.max_entries *
diff --git a/net/xdp/xskmap.c b/net/xdp/xskmap.c
index beb11fd..8e7a5a6 100644
--- a/net/xdp/xskmap.c
+++ b/net/xdp/xskmap.c
@@ -63,6 +63,7 @@ static struct bpf_map *xsk_map_alloc(union bpf_attr *attr)
 	struct xsk_map *m;
 	int numa_node;
 	u64 size;
+	int err;
 
 	if (!capable(CAP_NET_ADMIN))
 		return ERR_PTR(-EPERM);
@@ -79,7 +80,12 @@ static struct bpf_map *xsk_map_alloc(union bpf_attr *attr)
 	if (!m)
 		return ERR_PTR(-ENOMEM);
 
-	bpf_map_init_from_attr(&m->map, attr);
+	err = bpf_map_init_from_attr(&m->map, attr);
+	if (err) {
+		bpf_map_area_free(m, NULL);
+		return ERR_PTR(err);
+	}
+
 	spin_lock_init(&m->lock);
 
 	return &m->map;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH bpf-next v2 12/12] bpf: Introduce selectable memcg for bpf map
  2022-08-18 14:31 [PATCH bpf-next v2 00/12] bpf: Introduce selectable memcg for bpf map Yafang Shao
                   ` (10 preceding siblings ...)
  2022-08-18 14:31 ` [PATCH bpf-next v2 11/12] bpf: Add return value for bpf_map_init_from_attr Yafang Shao
@ 2022-08-18 14:31 ` Yafang Shao
  2022-08-18 22:20   ` Tejun Heo
  12 siblings, 0 replies; 61+ messages in thread
From: Yafang Shao @ 2022-08-18 14:31 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, hannes, mhocko, roman.gushchin,
	shakeelb, songmuchun, akpm, tj, lizefan.x
  Cc: cgroups, netdev, bpf, linux-mm, Yafang Shao

A new bpf attr memcg_fd is introduced for map creation. The map creation
path in libbpf is changed consequently to set this attr.

A new member memcg_fd is introduced into bpf attr of BPF_MAP_CREATE
command, which is the fd of an opened cgroup directory. In this cgroup,
the memory subsystem must be enabled. The valid memcg_fd must be a postive
number, that means it can't be zero(a valid return value of open(2)). Once
the kernel get the memory cgroup from this fd, it will set this memcg into
bpf map, then all the subsequent memory allocation of this map will be
charged to the memcg. The map creation paths in libbpf are also changed
consequently.

The usage of this new member as follows,
	struct bpf_map_create_opts map_opts = {
		.sz = sizeof(map_opts),
	};
	int memcg_fd, map_fd, old_fd;
	int key, value;

	memcg_fd = open("/sys/fs/cgroup/memory/bpf", O_DIRECTORY);
	if (memcg_fd < 0) {
		perror("memcg dir open");
		return -1;
	}

	/* 0 is a invalid fd */
	if (memcg_fd == 0) {
		old_fd = memcg_fd;
		memcg_fd = fcntl(memcg_fd, F_DUPFD_CLOEXEC, 3);
		close(old_fd);
		if (memcg_fd < 0) {
			perror("fcntl");
			return -1;
		}
	}

	map_opts.memcg_fd = memcg_fd;
	map_fd = bpf_map_create(BPF_MAP_TYPE_HASH, "map_for_memcg",
				sizeof(key), sizeof(value),
				1024, &map_opts);
	if (map_fd <= 0) {
		close(memcg_fd);
		perror("map create");
		return -1;
	}

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/uapi/linux/bpf.h       |  1 +
 kernel/bpf/syscall.c           | 45 ++++++++++++++++++++++++++++++++----------
 tools/include/uapi/linux/bpf.h |  1 +
 tools/lib/bpf/bpf.c            |  3 ++-
 tools/lib/bpf/bpf.h            |  3 ++-
 tools/lib/bpf/gen_loader.c     |  2 +-
 tools/lib/bpf/libbpf.c         |  2 ++
 tools/lib/bpf/skel_internal.h  |  2 +-
 8 files changed, 45 insertions(+), 14 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 7d1e279..048b692 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1300,6 +1300,7 @@ struct bpf_stack_build_id {
 		 * to using 5 hash functions).
 		 */
 		__u64	map_extra;
+		__u32	memcg_fd;	/* selectable memcg */
 	};
 
 	struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 5f5cade4..fd1f67c 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -294,14 +294,33 @@ static int bpf_map_copy_value(struct bpf_map *map, void *key, void *value,
 }
 
 #ifdef CONFIG_MEMCG_KMEM
-static void bpf_map_save_memcg(struct bpf_map *map)
+static int bpf_map_save_memcg(struct bpf_map *map, u32 memcg_fd)
 {
-	/* Currently if a map is created by a process belonging to the root
-	 * memory cgroup, get_obj_cgroup_from_current() will return NULL.
-	 * So we have to check map->objcg for being NULL each time it's
-	 * being used.
-	 */
-	map->objcg = get_obj_cgroup_from_current();
+	struct obj_cgroup *objcg;
+	struct cgroup *cgrp;
+
+	if (memcg_fd) {
+		cgrp = cgroup_get_from_fd(memcg_fd);
+		if (IS_ERR(cgrp))
+			return -EINVAL;
+
+		objcg = get_obj_cgroup_from_cgroup(cgrp);
+		if (IS_ERR(objcg)) {
+			cgroup_put(cgrp);
+			return PTR_ERR(objcg);
+		}
+		cgroup_put(cgrp);
+	} else {
+		/* Currently if a map is created by a process belonging to the root
+		 * memory cgroup, get_obj_cgroup_from_current() will return NULL.
+		 * So we have to check map->objcg for being NULL each time it's
+		 * being used.
+		 */
+		objcg = get_obj_cgroup_from_current();
+	}
+
+	map->objcg = objcg;
+	return 0;
 }
 
 static void bpf_map_release_memcg(struct bpf_map *map)
@@ -311,8 +330,9 @@ static void bpf_map_release_memcg(struct bpf_map *map)
 }
 
 #else
-static void bpf_map_save_memcg(struct bpf_map *map)
+static int bpf_map_save_memcg(struct bpf_map *map, u32 memcg_fd)
 {
+	return 0;
 }
 
 static void bpf_map_release_memcg(struct bpf_map *map)
@@ -405,7 +425,12 @@ static u32 bpf_map_flags_retain_permanent(u32 flags)
 
 int bpf_map_init_from_attr(struct bpf_map *map, union bpf_attr *attr)
 {
-	bpf_map_save_memcg(map);
+	int err;
+
+	err = bpf_map_save_memcg(map, attr->memcg_fd);
+	if (err)
+		return err;
+
 	map->map_type = attr->map_type;
 	map->key_size = attr->key_size;
 	map->value_size = attr->value_size;
@@ -1091,7 +1116,7 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf,
 	return ret;
 }
 
-#define BPF_MAP_CREATE_LAST_FIELD map_extra
+#define BPF_MAP_CREATE_LAST_FIELD memcg_fd
 /* called via syscall */
 static int map_create(union bpf_attr *attr)
 {
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index e174ad2..1e9efe7 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1300,6 +1300,7 @@ struct bpf_stack_build_id {
 		 * to using 5 hash functions).
 		 */
 		__u64	map_extra;
+		__u32	memcg_fd;	/* selectable memcg */
 	};
 
 	struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */
diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 6a96e66..6517553 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -171,7 +171,7 @@ int bpf_map_create(enum bpf_map_type map_type,
 		   __u32 max_entries,
 		   const struct bpf_map_create_opts *opts)
 {
-	const size_t attr_sz = offsetofend(union bpf_attr, map_extra);
+	const size_t attr_sz = offsetofend(union bpf_attr, memcg_fd);
 	union bpf_attr attr;
 	int fd;
 
@@ -199,6 +199,7 @@ int bpf_map_create(enum bpf_map_type map_type,
 	attr.map_extra = OPTS_GET(opts, map_extra, 0);
 	attr.numa_node = OPTS_GET(opts, numa_node, 0);
 	attr.map_ifindex = OPTS_GET(opts, map_ifindex, 0);
+	attr.memcg_fd = OPTS_GET(opts, memcg_fd, 0);
 
 	fd = sys_bpf_fd(BPF_MAP_CREATE, &attr, attr_sz);
 	return libbpf_err_errno(fd);
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index 9c50bea..dd0d929 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -51,8 +51,9 @@ struct bpf_map_create_opts {
 
 	__u32 numa_node;
 	__u32 map_ifindex;
+	__u32 memcg_fd;
 };
-#define bpf_map_create_opts__last_field map_ifindex
+#define bpf_map_create_opts__last_field memcg_fd
 
 LIBBPF_API int bpf_map_create(enum bpf_map_type map_type,
 			      const char *map_name,
diff --git a/tools/lib/bpf/gen_loader.c b/tools/lib/bpf/gen_loader.c
index 23f5c46..f35b014 100644
--- a/tools/lib/bpf/gen_loader.c
+++ b/tools/lib/bpf/gen_loader.c
@@ -451,7 +451,7 @@ void bpf_gen__map_create(struct bpf_gen *gen,
 			 __u32 key_size, __u32 value_size, __u32 max_entries,
 			 struct bpf_map_create_opts *map_attr, int map_idx)
 {
-	int attr_size = offsetofend(union bpf_attr, map_extra);
+	int attr_size = offsetofend(union bpf_attr, memcg_fd);
 	bool close_inner_map_fd = false;
 	int map_create_attr, idx;
 	union bpf_attr attr;
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 3f01f5c..ad7aa39 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -511,6 +511,7 @@ struct bpf_map {
 	bool reused;
 	bool autocreate;
 	__u64 map_extra;
+	__u32 memcg_fd;
 };
 
 enum extern_type {
@@ -4936,6 +4937,7 @@ static int bpf_object__create_map(struct bpf_object *obj, struct bpf_map *map, b
 	create_attr.map_flags = def->map_flags;
 	create_attr.numa_node = map->numa_node;
 	create_attr.map_extra = map->map_extra;
+	create_attr.memcg_fd = map->memcg_fd;
 
 	if (bpf_map__is_struct_ops(map))
 		create_attr.btf_vmlinux_value_type_id = map->btf_vmlinux_value_type_id;
diff --git a/tools/lib/bpf/skel_internal.h b/tools/lib/bpf/skel_internal.h
index bd6f450..83403cc 100644
--- a/tools/lib/bpf/skel_internal.h
+++ b/tools/lib/bpf/skel_internal.h
@@ -222,7 +222,7 @@ static inline int skel_map_create(enum bpf_map_type map_type,
 				  __u32 value_size,
 				  __u32 max_entries)
 {
-	const size_t attr_sz = offsetofend(union bpf_attr, map_extra);
+	const size_t attr_sz = offsetofend(union bpf_attr, memcg_fd);
 	union bpf_attr attr;
 
 	memset(&attr, 0, attr_sz);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH bpf-next v2 07/12] bpf: Introduce new helpers bpf_ringbuf_pages_{alloc,free}
  2022-08-18 14:31   ` Yafang Shao
  (?)
@ 2022-08-18 17:30   ` Andrii Nakryiko
  -1 siblings, 0 replies; 61+ messages in thread
From: Andrii Nakryiko @ 2022-08-18 17:30 UTC (permalink / raw)
  To: Yafang Shao
  Cc: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, hannes, mhocko, roman.gushchin,
	shakeelb, songmuchun, akpm, tj, lizefan.x, cgroups, netdev, bpf,
	linux-mm

On Thu, Aug 18, 2022 at 7:32 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> Allocate pages related memory into the new helper
> bpf_ringbuf_pages_alloc(), then it can be handled as a single unit.
>
> Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
>  kernel/bpf/ringbuf.c | 80 ++++++++++++++++++++++++++++++++++++----------------
>  1 file changed, 56 insertions(+), 24 deletions(-)
>

LGTM.

Acked-by: Andrii Nakryiko <andrii@kernel.org>

> diff --git a/kernel/bpf/ringbuf.c b/kernel/bpf/ringbuf.c
> index 5eb7820..1e7284c 100644
> --- a/kernel/bpf/ringbuf.c
> +++ b/kernel/bpf/ringbuf.c
> @@ -59,6 +59,57 @@ struct bpf_ringbuf_hdr {
>         u32 pg_off;
>  };
>

[...]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH bpf-next v2 01/12] cgroup: Update the comment on cgroup_get_from_fd
  2022-08-18 14:31   ` Yafang Shao
@ 2022-08-18 19:11     ` Yosry Ahmed
  -1 siblings, 0 replies; 61+ messages in thread
From: Yosry Ahmed @ 2022-08-18 19:11 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, john fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Tejun Heo, Zefan Li, Cgroups,
	Networking, bpf, Linux-MM

On Thu, Aug 18, 2022 at 7:31 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> After commit f3a2aebdd6fb ("cgroup: enable cgroup_get_from_file() on cgroup1")
> we can open a cgroup1 dir as well. So let's update the comment.

I missed updating the comment in my change. Thanks for catching this.

>
> Cc: Yosry Ahmed <yosryahmed@google.com>
> Cc: Hao Luo <haoluo@google.com>
> Cc: Tejun Heo <tj@kernel.org>
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
>  kernel/cgroup/cgroup.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 5f4502a..b7d2e55 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -6625,7 +6625,7 @@ struct cgroup *cgroup_get_from_path(const char *path)
>
>  /**
>   * cgroup_get_from_fd - get a cgroup pointer from a fd
> - * @fd: fd obtained by open(cgroup2_dir)
> + * @fd: fd obtained by open(cgroup_dir)
>   *
>   * Find the cgroup from a fd which should be obtained
>   * by opening a cgroup directory.  Returns a pointer to the
> --
> 1.8.3.1
>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH bpf-next v2 01/12] cgroup: Update the comment on cgroup_get_from_fd
@ 2022-08-18 19:11     ` Yosry Ahmed
  0 siblings, 0 replies; 61+ messages in thread
From: Yosry Ahmed @ 2022-08-18 19:11 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, john fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Tejun Heo, Zefan Li, Cgroups,
	Networking, bpf

On Thu, Aug 18, 2022 at 7:31 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> After commit f3a2aebdd6fb ("cgroup: enable cgroup_get_from_file() on cgroup1")
> we can open a cgroup1 dir as well. So let's update the comment.

I missed updating the comment in my change. Thanks for catching this.

>
> Cc: Yosry Ahmed <yosryahmed@google.com>
> Cc: Hao Luo <haoluo@google.com>
> Cc: Tejun Heo <tj@kernel.org>
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
>  kernel/cgroup/cgroup.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 5f4502a..b7d2e55 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -6625,7 +6625,7 @@ struct cgroup *cgroup_get_from_path(const char *path)
>
>  /**
>   * cgroup_get_from_fd - get a cgroup pointer from a fd
> - * @fd: fd obtained by open(cgroup2_dir)
> + * @fd: fd obtained by open(cgroup_dir)
>   *
>   * Find the cgroup from a fd which should be obtained
>   * by opening a cgroup directory.  Returns a pointer to the
> --
> 1.8.3.1
>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH bpf-next v2 10/12] mm, memcg: Add new helper get_obj_cgroup_from_cgroup
  2022-08-18 14:31 ` [PATCH bpf-next v2 10/12] mm, memcg: Add new helper get_obj_cgroup_from_cgroup Yafang Shao
@ 2022-08-18 20:38   ` Shakeel Butt
  2022-08-19  1:21     ` Yafang Shao
  0 siblings, 1 reply; 61+ messages in thread
From: Shakeel Butt @ 2022-08-18 20:38 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	kpsingh, Stanislav Fomichev, Hao Luo, jolsa, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Muchun Song, Andrew Morton,
	Tejun Heo, Zefan Li, Cgroups, netdev, bpf, Linux MM

On Thu, Aug 18, 2022 at 7:32 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> We want to open a cgroup directory and pass the fd into kernel, and then
> in the kernel we can charge the allocated memory into the open cgroup if it
> has valid memory subsystem. In the bpf subsystem, the opened cgroup will
> be store as a struct obj_cgroup pointer, so a new helper
> get_obj_cgroup_from_cgroup() is introduced.
>
> It also add a comment on why the helper  __get_obj_cgroup_from_memcg()
> must be protected by rcu read lock.
>
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
>  include/linux/memcontrol.h |  1 +
>  mm/memcontrol.c            | 47 ++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 48 insertions(+)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 2f0a611..901a921 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1713,6 +1713,7 @@ static inline void set_shrinker_bit(struct mem_cgroup *memcg,
>  int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order);
>  void __memcg_kmem_uncharge_page(struct page *page, int order);
>
> +struct obj_cgroup *get_obj_cgroup_from_cgroup(struct cgroup *cgrp);
>  struct obj_cgroup *get_obj_cgroup_from_current(void);
>  struct obj_cgroup *get_obj_cgroup_from_page(struct page *page);
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 618c366..0409cc4 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2895,6 +2895,14 @@ struct mem_cgroup *mem_cgroup_from_obj(void *p)
>         return page_memcg_check(folio_page(folio, 0));
>  }
>
> +/*
> + * Pls. note that the memg->objcg can be freed after it is deferenced,
> + * that can happen when the memcg is changed from online to offline.
> + * So this helper must be protected by read rcu lock.
> + *
> + * After obj_cgroup_tryget(), it is safe to use the objcg outside of the rcu
> + * read-side critical section.
> + */

The comment is too verbose. My suggestion would be to add
WARN_ON_ONCE(!rcu_read_lock_held()) in the function and if you want to
add a comment, just say that the caller should have a reference on
memcg.

>  static struct obj_cgroup *__get_obj_cgroup_from_memcg(struct mem_cgroup *memcg)
>  {
>         struct obj_cgroup *objcg = NULL;
> @@ -2908,6 +2916,45 @@ static struct obj_cgroup *__get_obj_cgroup_from_memcg(struct mem_cgroup *memcg)
>         return objcg;
>  }
>
> +static struct obj_cgroup *get_obj_cgroup_from_memcg(struct mem_cgroup *memcg)
> +{
> +       struct obj_cgroup *objcg;
> +
> +       if (memcg_kmem_bypass())
> +               return NULL;
> +
> +       rcu_read_lock();
> +       objcg = __get_obj_cgroup_from_memcg(memcg);
> +       rcu_read_unlock();
> +       return objcg;
> +}
> +
> +struct obj_cgroup *get_obj_cgroup_from_cgroup(struct cgroup *cgrp)

Since this function is exposed to other files, it would be nice to
have a comment similar to get_mem_cgroup_from_mm() for this function.
I know other get_obj_cgroup variants do not have such a comment yet
but let's start with this.

> +{
> +       struct cgroup_subsys_state *css;
> +       struct mem_cgroup *memcg;
> +       struct obj_cgroup *objcg;
> +
> +       rcu_read_lock();
> +       css = rcu_dereference(cgrp->subsys[memory_cgrp_id]);
> +       if (!css || !css_tryget_online(css)) {

Any reason to use css_tryget_online() instead of css_tryget()?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH bpf-next v2 00/12] bpf: Introduce selectable memcg for bpf map
@ 2022-08-18 22:20   ` Tejun Heo
  0 siblings, 0 replies; 61+ messages in thread
From: Tejun Heo @ 2022-08-18 22:20 UTC (permalink / raw)
  To: Yafang Shao
  Cc: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, hannes, mhocko, roman.gushchin,
	shakeelb, songmuchun, akpm, lizefan.x, cgroups, netdev, bpf,
	linux-mm

Hello,

On Thu, Aug 18, 2022 at 02:31:06PM +0000, Yafang Shao wrote:
> After switching to memcg-based bpf memory accounting to limit the bpf
> memory, some unexpected issues jumped out at us.
> 1. The memory usage is not consistent between the first generation and
> new generations.
> 2. After the first generation is destroyed, the bpf memory can't be
> limited if the bpf maps are not preallocated, because they will be
> reparented.
> 
> This patchset tries to resolve these issues by introducing an
> independent memcg to limit the bpf memory.

memcg folks would have better informed opinions but from generic cgroup pov
I don't think this is a good direction to take. This isn't a problem limited
to bpf progs and it doesn't make whole lot of sense to solve this for bpf.

We have the exact same problem for any resources which span multiple
instances of a service including page cache, tmpfs instances and any other
thing which can persist longer than procss life time. My current opinion is
that this is best solved by introducing an extra cgroup layer to represent
the persistent entity and put the per-instance cgroup under it.

It does require reorganizing how things are organized from userspace POV but
the end result is really desirable. We get entities accurately representing
what needs to be tracked and control over the granularity of accounting and
control (e.g. folks who don't care about telling apart the current
instance's usage can simply not enable controllers at the persistent entity
level).

We surely can discuss other approaches but my current intuition is that it'd
be really difficult to come up with a better solution than layering to
introduce persistent service entities.

So, please consider the approach nacked for the time being.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH bpf-next v2 00/12] bpf: Introduce selectable memcg for bpf map
@ 2022-08-18 22:20   ` Tejun Heo
  0 siblings, 0 replies; 61+ messages in thread
From: Tejun Heo @ 2022-08-18 22:20 UTC (permalink / raw)
  To: Yafang Shao
  Cc: ast-DgEjT+Ai2ygdnm+yROfE0A, daniel-FeC+5ew28dpmcu3hnIyYJQ,
	andrii-DgEjT+Ai2ygdnm+yROfE0A, kafai-b10kYP2dOMg,
	songliubraving-b10kYP2dOMg, yhs-b10kYP2dOMg,
	john.fastabend-Re5JQEeQqe8AvxtiuMwx3w,
	kpsingh-DgEjT+Ai2ygdnm+yROfE0A, sdf-hpIqsD4AKlfQT0dZR+AlfA,
	haoluo-hpIqsD4AKlfQT0dZR+AlfA, jolsa-DgEjT+Ai2ygdnm+yROfE0A,
	hannes-druUgvl0LCNAfugRpC6u6w, mhocko-DgEjT+Ai2ygdnm+yROfE0A,
	roman.gushchin-fxUVXftIFDnyG1zEObXtfA,
	shakeelb-hpIqsD4AKlfQT0dZR+AlfA,
	songmuchun-EC8Uxl6Npydl57MIdRCFDg,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	lizefan.x-EC8Uxl6Npydl57MIdRCFDg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, bpf-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

Hello,

On Thu, Aug 18, 2022 at 02:31:06PM +0000, Yafang Shao wrote:
> After switching to memcg-based bpf memory accounting to limit the bpf
> memory, some unexpected issues jumped out at us.
> 1. The memory usage is not consistent between the first generation and
> new generations.
> 2. After the first generation is destroyed, the bpf memory can't be
> limited if the bpf maps are not preallocated, because they will be
> reparented.
> 
> This patchset tries to resolve these issues by introducing an
> independent memcg to limit the bpf memory.

memcg folks would have better informed opinions but from generic cgroup pov
I don't think this is a good direction to take. This isn't a problem limited
to bpf progs and it doesn't make whole lot of sense to solve this for bpf.

We have the exact same problem for any resources which span multiple
instances of a service including page cache, tmpfs instances and any other
thing which can persist longer than procss life time. My current opinion is
that this is best solved by introducing an extra cgroup layer to represent
the persistent entity and put the per-instance cgroup under it.

It does require reorganizing how things are organized from userspace POV but
the end result is really desirable. We get entities accurately representing
what needs to be tracked and control over the granularity of accounting and
control (e.g. folks who don't care about telling apart the current
instance's usage can simply not enable controllers at the persistent entity
level).

We surely can discuss other approaches but my current intuition is that it'd
be really difficult to come up with a better solution than layering to
introduce persistent service entities.

So, please consider the approach nacked for the time being.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH bpf-next v2 00/12] bpf: Introduce selectable memcg for bpf map
  2022-08-18 22:20   ` Tejun Heo
  (?)
@ 2022-08-18 22:33   ` Tejun Heo
  2022-08-19  1:09       ` Yafang Shao
  -1 siblings, 1 reply; 61+ messages in thread
From: Tejun Heo @ 2022-08-18 22:33 UTC (permalink / raw)
  To: Yafang Shao
  Cc: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, hannes, mhocko, roman.gushchin,
	shakeelb, songmuchun, akpm, lizefan.x, cgroups, netdev, bpf,
	linux-mm

On Thu, Aug 18, 2022 at 12:20:33PM -1000, Tejun Heo wrote:
> We have the exact same problem for any resources which span multiple
> instances of a service including page cache, tmpfs instances and any other
> thing which can persist longer than procss life time. My current opinion is

To expand a bit more on this point, once we start including page cache and
tmpfs, we now get entangled with memory reclaim which then brings in IO and
not-yet-but-eventually CPU usage. Once you start splitting the tree like
you're suggesting here, all those will break down and now we have to worry
about how to split resource accounting and control for the same entities
across two split branches of the tree, which doesn't really make any sense.

So, we *really* don't wanna paint ourselves into that kind of a corner. This
is a dead-end. Please ditch it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH bpf-next v2 00/12] bpf: Introduce selectable memcg for bpf map
  2022-08-18 22:20   ` Tejun Heo
  (?)
  (?)
@ 2022-08-19  0:59   ` Yafang Shao
  2022-08-19 16:45       ` Tejun Heo
  -1 siblings, 1 reply; 61+ messages in thread
From: Yafang Shao @ 2022-08-19  0:59 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin Lau,
	Song Liu, Yonghong Song, john fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, jolsa, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, lizefan.x, Cgroups, netdev, bpf, Linux MM

On Fri, Aug 19, 2022 at 6:20 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Thu, Aug 18, 2022 at 02:31:06PM +0000, Yafang Shao wrote:
> > After switching to memcg-based bpf memory accounting to limit the bpf
> > memory, some unexpected issues jumped out at us.
> > 1. The memory usage is not consistent between the first generation and
> > new generations.
> > 2. After the first generation is destroyed, the bpf memory can't be
> > limited if the bpf maps are not preallocated, because they will be
> > reparented.
> >
> > This patchset tries to resolve these issues by introducing an
> > independent memcg to limit the bpf memory.
>
> memcg folks would have better informed opinions but from generic cgroup pov
> I don't think this is a good direction to take. This isn't a problem limited
> to bpf progs and it doesn't make whole lot of sense to solve this for bpf.
>

This change is bpf specific. It doesn't refactor a whole lot of things.

> We have the exact same problem for any resources which span multiple
> instances of a service including page cache, tmpfs instances and any other
> thing which can persist longer than procss life time. My current opinion is
> that this is best solved by introducing an extra cgroup layer to represent
> the persistent entity and put the per-instance cgroup under it.
>

It is not practical on k8s.
Because, before the persistent entity, the cgroup dir is stateless.
After, it is stateful.
Pls, don't continue keeping blind eyes on k8s.

> It does require reorganizing how things are organized from userspace POV but
> the end result is really desirable. We get entities accurately representing
> what needs to be tracked and control over the granularity of accounting and
> control (e.g. folks who don't care about telling apart the current
> instance's usage can simply not enable controllers at the persistent entity
> level).
>

Pls.s also think about why k8s refuse to use cgroup2.

> We surely can discuss other approaches but my current intuition is that it'd
> be really difficult to come up with a better solution than layering to
> introduce persistent service entities.
>
> So, please consider the approach nacked for the time being.
>

It doesn't make sense to nack it.
I will explain to you by replying to your other email.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH bpf-next v2 00/12] bpf: Introduce selectable memcg for bpf map
@ 2022-08-19  1:09       ` Yafang Shao
  0 siblings, 0 replies; 61+ messages in thread
From: Yafang Shao @ 2022-08-19  1:09 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin Lau,
	Song Liu, Yonghong Song, john fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, jolsa, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, lizefan.x, Cgroups, netdev, bpf, Linux MM

On Fri, Aug 19, 2022 at 6:33 AM Tejun Heo <tj@kernel.org> wrote:
>
> On Thu, Aug 18, 2022 at 12:20:33PM -1000, Tejun Heo wrote:
> > We have the exact same problem for any resources which span multiple
> > instances of a service including page cache, tmpfs instances and any other
> > thing which can persist longer than procss life time. My current opinion is
>
> To expand a bit more on this point, once we start including page cache and
> tmpfs, we now get entangled with memory reclaim which then brings in IO and
> not-yet-but-eventually CPU usage.

Introduce-a-new-layer vs introduce-a-new-cgroup, which one is more overhead?

> Once you start splitting the tree like
> you're suggesting here, all those will break down and now we have to worry
> about how to split resource accounting and control for the same entities
> across two split branches of the tree, which doesn't really make any sense.
>

The k8s has already been broken thanks to the memcg accounting on  bpf memory.
If you ignored it, I paste it below.
[0]"1. The memory usage is not consistent between the first generation and
new generations."

This issue will persist even if you introduce a new layer.

> So, we *really* don't wanna paint ourselves into that kind of a corner. This
> is a dead-end. Please ditch it.
>

It makes non-sensen to ditch it.
Because, the hierarchy I described in the commit log is *one* use case
of the selectable memcg, but not *the only one* use case of it. If you
dislike that hierarchy, I will remove it to avoid misleading you.

Even if you introduce a new layer, you still need the selectable memcg.
For example, to avoid the issue I described in [0],  you still need to
charge to the parent cgroup instead of the current cgroup.

That's why I described in the commit log that the selectable memcg is flexible.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH bpf-next v2 00/12] bpf: Introduce selectable memcg for bpf map
@ 2022-08-19  1:09       ` Yafang Shao
  0 siblings, 0 replies; 61+ messages in thread
From: Yafang Shao @ 2022-08-19  1:09 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin Lau,
	Song Liu, Yonghong Song, john fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, jolsa-DgEjT+Ai2ygdnm+yROfE0A,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, lizefan.x-EC8Uxl6Npydl57MIdRCFDg,
	Cgroups, netdev, bpf, Linux MM

On Fri, Aug 19, 2022 at 6:33 AM Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>
> On Thu, Aug 18, 2022 at 12:20:33PM -1000, Tejun Heo wrote:
> > We have the exact same problem for any resources which span multiple
> > instances of a service including page cache, tmpfs instances and any other
> > thing which can persist longer than procss life time. My current opinion is
>
> To expand a bit more on this point, once we start including page cache and
> tmpfs, we now get entangled with memory reclaim which then brings in IO and
> not-yet-but-eventually CPU usage.

Introduce-a-new-layer vs introduce-a-new-cgroup, which one is more overhead?

> Once you start splitting the tree like
> you're suggesting here, all those will break down and now we have to worry
> about how to split resource accounting and control for the same entities
> across two split branches of the tree, which doesn't really make any sense.
>

The k8s has already been broken thanks to the memcg accounting on  bpf memory.
If you ignored it, I paste it below.
[0]"1. The memory usage is not consistent between the first generation and
new generations."

This issue will persist even if you introduce a new layer.

> So, we *really* don't wanna paint ourselves into that kind of a corner. This
> is a dead-end. Please ditch it.
>

It makes non-sensen to ditch it.
Because, the hierarchy I described in the commit log is *one* use case
of the selectable memcg, but not *the only one* use case of it. If you
dislike that hierarchy, I will remove it to avoid misleading you.

Even if you introduce a new layer, you still need the selectable memcg.
For example, to avoid the issue I described in [0],  you still need to
charge to the parent cgroup instead of the current cgroup.

That's why I described in the commit log that the selectable memcg is flexible.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH bpf-next v2 10/12] mm, memcg: Add new helper get_obj_cgroup_from_cgroup
  2022-08-18 20:38   ` Shakeel Butt
@ 2022-08-19  1:21     ` Yafang Shao
  0 siblings, 0 replies; 61+ messages in thread
From: Yafang Shao @ 2022-08-19  1:21 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, jolsa, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Muchun Song, Andrew Morton,
	Tejun Heo, Zefan Li, Cgroups, netdev, bpf, Linux MM

On Fri, Aug 19, 2022 at 4:38 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Thu, Aug 18, 2022 at 7:32 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > We want to open a cgroup directory and pass the fd into kernel, and then
> > in the kernel we can charge the allocated memory into the open cgroup if it
> > has valid memory subsystem. In the bpf subsystem, the opened cgroup will
> > be store as a struct obj_cgroup pointer, so a new helper
> > get_obj_cgroup_from_cgroup() is introduced.
> >
> > It also add a comment on why the helper  __get_obj_cgroup_from_memcg()
> > must be protected by rcu read lock.
> >
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > ---
> >  include/linux/memcontrol.h |  1 +
> >  mm/memcontrol.c            | 47 ++++++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 48 insertions(+)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 2f0a611..901a921 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -1713,6 +1713,7 @@ static inline void set_shrinker_bit(struct mem_cgroup *memcg,
> >  int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order);
> >  void __memcg_kmem_uncharge_page(struct page *page, int order);
> >
> > +struct obj_cgroup *get_obj_cgroup_from_cgroup(struct cgroup *cgrp);
> >  struct obj_cgroup *get_obj_cgroup_from_current(void);
> >  struct obj_cgroup *get_obj_cgroup_from_page(struct page *page);
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 618c366..0409cc4 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2895,6 +2895,14 @@ struct mem_cgroup *mem_cgroup_from_obj(void *p)
> >         return page_memcg_check(folio_page(folio, 0));
> >  }
> >
> > +/*
> > + * Pls. note that the memg->objcg can be freed after it is deferenced,
> > + * that can happen when the memcg is changed from online to offline.
> > + * So this helper must be protected by read rcu lock.
> > + *
> > + * After obj_cgroup_tryget(), it is safe to use the objcg outside of the rcu
> > + * read-side critical section.
> > + */
>
> The comment is too verbose. My suggestion would be to add
> WARN_ON_ONCE(!rcu_read_lock_held()) in the function and if you want to
> add a comment, just say that the caller should have a reference on
> memcg.
>

Sure, I will change it.

> >  static struct obj_cgroup *__get_obj_cgroup_from_memcg(struct mem_cgroup *memcg)
> >  {
> >         struct obj_cgroup *objcg = NULL;
> > @@ -2908,6 +2916,45 @@ static struct obj_cgroup *__get_obj_cgroup_from_memcg(struct mem_cgroup *memcg)
> >         return objcg;
> >  }
> >
> > +static struct obj_cgroup *get_obj_cgroup_from_memcg(struct mem_cgroup *memcg)
> > +{
> > +       struct obj_cgroup *objcg;
> > +
> > +       if (memcg_kmem_bypass())
> > +               return NULL;
> > +
> > +       rcu_read_lock();
> > +       objcg = __get_obj_cgroup_from_memcg(memcg);
> > +       rcu_read_unlock();
> > +       return objcg;
> > +}
> > +
> > +struct obj_cgroup *get_obj_cgroup_from_cgroup(struct cgroup *cgrp)
>
> Since this function is exposed to other files, it would be nice to
> have a comment similar to get_mem_cgroup_from_mm() for this function.
> I know other get_obj_cgroup variants do not have such a comment yet
> but let's start with this.
>

Sure, I will add it.

> > +{
> > +       struct cgroup_subsys_state *css;
> > +       struct mem_cgroup *memcg;
> > +       struct obj_cgroup *objcg;
> > +
> > +       rcu_read_lock();
> > +       css = rcu_dereference(cgrp->subsys[memory_cgrp_id]);
> > +       if (!css || !css_tryget_online(css)) {
>
> Any reason to use css_tryget_online() instead of css_tryget()?

Because in this case, the cgroup is from an opened cgroup dir, which
should be online.
But considering this is a generic helper, which may be used by others,
it would be more reasonable to use css_tryget().
I will change it.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH bpf-next v2 00/12] bpf: Introduce selectable memcg for bpf map
@ 2022-08-19 16:45       ` Tejun Heo
  0 siblings, 0 replies; 61+ messages in thread
From: Tejun Heo @ 2022-08-19 16:45 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin Lau,
	Song Liu, Yonghong Song, john fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, jolsa, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, lizefan.x, Cgroups, netdev, bpf, Linux MM

Hello,

On Fri, Aug 19, 2022 at 08:59:20AM +0800, Yafang Shao wrote:
> On Fri, Aug 19, 2022 at 6:20 AM Tejun Heo <tj@kernel.org> wrote:
> > memcg folks would have better informed opinions but from generic cgroup pov
> > I don't think this is a good direction to take. This isn't a problem limited
> > to bpf progs and it doesn't make whole lot of sense to solve this for bpf.
> 
> This change is bpf specific. It doesn't refactor a whole lot of things.

I'm not sure what point the above sentence is making. It may not change a
lot of code but it does introduce significantly new mode of operation which
affects memcg and cgroup in general.

> > We have the exact same problem for any resources which span multiple
> > instances of a service including page cache, tmpfs instances and any other
> > thing which can persist longer than procss life time. My current opinion is
> > that this is best solved by introducing an extra cgroup layer to represent
> > the persistent entity and put the per-instance cgroup under it.
> 
> It is not practical on k8s.
> Because, before the persistent entity, the cgroup dir is stateless.
> After, it is stateful.
> Pls, don't continue keeping blind eyes on k8s.

Can you please elaborate why it isn't practical for k8s? I don't know the
details of k8s and what you wrote above is not a detailed enough technical
argument.

> > It does require reorganizing how things are organized from userspace POV but
> > the end result is really desirable. We get entities accurately representing
> > what needs to be tracked and control over the granularity of accounting and
> > control (e.g. folks who don't care about telling apart the current
> > instance's usage can simply not enable controllers at the persistent entity
> > level).
> 
> Pls.s also think about why k8s refuse to use cgroup2.

This attitude really bothers me. You aren't spelling it out fully but
instead of engaging in the technical argument at the hand, you're putting
forth conforming upstream to the current k8s's assumptions and behaviors as
a requirement and then insisting that it's upstream's fault that k8s is
staying with cgroup1.

This is not an acceptable form of argument and it would be irresponsible to
grant any kind weight to this line of reasoning. k8s may seem like the world
to you but it is one of many use cases of the upstream kernel. We all should
pay attention to the use cases and technical arguments to determine how we
chart our way forward, but being k8s or whoever else clearly isn't a waiver
to claim this kind of unilateral demand.

It's okay to emphasize the gravity of the specific use case at hand but
please realize that it's one of the many factors that should be considered
and sometimes one which can and should get trumped by others.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH bpf-next v2 00/12] bpf: Introduce selectable memcg for bpf map
@ 2022-08-19 16:45       ` Tejun Heo
  0 siblings, 0 replies; 61+ messages in thread
From: Tejun Heo @ 2022-08-19 16:45 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin Lau,
	Song Liu, Yonghong Song, john fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, jolsa-DgEjT+Ai2ygdnm+yROfE0A,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, lizefan.x-EC8Uxl6Npydl57MIdRCFDg,
	Cgroups, netdev, bpf, Linux MM

Hello,

On Fri, Aug 19, 2022 at 08:59:20AM +0800, Yafang Shao wrote:
> On Fri, Aug 19, 2022 at 6:20 AM Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> > memcg folks would have better informed opinions but from generic cgroup pov
> > I don't think this is a good direction to take. This isn't a problem limited
> > to bpf progs and it doesn't make whole lot of sense to solve this for bpf.
> 
> This change is bpf specific. It doesn't refactor a whole lot of things.

I'm not sure what point the above sentence is making. It may not change a
lot of code but it does introduce significantly new mode of operation which
affects memcg and cgroup in general.

> > We have the exact same problem for any resources which span multiple
> > instances of a service including page cache, tmpfs instances and any other
> > thing which can persist longer than procss life time. My current opinion is
> > that this is best solved by introducing an extra cgroup layer to represent
> > the persistent entity and put the per-instance cgroup under it.
> 
> It is not practical on k8s.
> Because, before the persistent entity, the cgroup dir is stateless.
> After, it is stateful.
> Pls, don't continue keeping blind eyes on k8s.

Can you please elaborate why it isn't practical for k8s? I don't know the
details of k8s and what you wrote above is not a detailed enough technical
argument.

> > It does require reorganizing how things are organized from userspace POV but
> > the end result is really desirable. We get entities accurately representing
> > what needs to be tracked and control over the granularity of accounting and
> > control (e.g. folks who don't care about telling apart the current
> > instance's usage can simply not enable controllers at the persistent entity
> > level).
> 
> Pls.s also think about why k8s refuse to use cgroup2.

This attitude really bothers me. You aren't spelling it out fully but
instead of engaging in the technical argument at the hand, you're putting
forth conforming upstream to the current k8s's assumptions and behaviors as
a requirement and then insisting that it's upstream's fault that k8s is
staying with cgroup1.

This is not an acceptable form of argument and it would be irresponsible to
grant any kind weight to this line of reasoning. k8s may seem like the world
to you but it is one of many use cases of the upstream kernel. We all should
pay attention to the use cases and technical arguments to determine how we
chart our way forward, but being k8s or whoever else clearly isn't a waiver
to claim this kind of unilateral demand.

It's okay to emphasize the gravity of the specific use case at hand but
please realize that it's one of the many factors that should be considered
and sometimes one which can and should get trumped by others.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH bpf-next v2 00/12] bpf: Introduce selectable memcg for bpf map
@ 2022-08-19 17:06         ` Tejun Heo
  0 siblings, 0 replies; 61+ messages in thread
From: Tejun Heo @ 2022-08-19 17:06 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin Lau,
	Song Liu, Yonghong Song, john fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, jolsa, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, lizefan.x, Cgroups, netdev, bpf, Linux MM

Hello,

On Fri, Aug 19, 2022 at 09:09:25AM +0800, Yafang Shao wrote:
> On Fri, Aug 19, 2022 at 6:33 AM Tejun Heo <tj@kernel.org> wrote:
> >
> > On Thu, Aug 18, 2022 at 12:20:33PM -1000, Tejun Heo wrote:
> > > We have the exact same problem for any resources which span multiple
> > > instances of a service including page cache, tmpfs instances and any other
> > > thing which can persist longer than procss life time. My current opinion is
> >
> > To expand a bit more on this point, once we start including page cache and
> > tmpfs, we now get entangled with memory reclaim which then brings in IO and
> > not-yet-but-eventually CPU usage.
> 
> Introduce-a-new-layer vs introduce-a-new-cgroup, which one is more overhead?

Introducing a new layer in cgroup2 doesn't mean that any specific resource
controller is enabled, so there is no runtime overhead difference. In terms
of logical complexity, introducing a localized layer seems a lot more
straightforward than building a whole separate tree.

Note that the same applies to cgroup1 where collapsed controller tree is
represented by simply not creating those layers in that particular
controller tree.

No matter how we cut the problem here, if we want to track these persistent
resources, we have to create a cgroup to host them somewhere. The discussion
we're having is mostly around where to put them. With your proposal, it can
be anywhere and you draw out an example where the persistent cgroups form
their own separate tree. What I'm saying is that the logical place to put it
is where the current resource consumption is and we just need to put the
persistent entity as the parent of the instances.

Flexibility, just like anything else, isn't free. Here, if we extrapolate
this approach, the cost is evidently hefty in that it doesn't generically
work with the basic resource control structure.

> > Once you start splitting the tree like
> > you're suggesting here, all those will break down and now we have to worry
> > about how to split resource accounting and control for the same entities
> > across two split branches of the tree, which doesn't really make any sense.
> 
> The k8s has already been broken thanks to the memcg accounting on  bpf memory.
> If you ignored it, I paste it below.
> [0]"1. The memory usage is not consistent between the first generation and
> new generations."
> 
> This issue will persist even if you introduce a new layer.

Please watch your tone.

Again, this isn't a problem specific to k8s. We have the same problem with
e.g. persistent tmpfs. One idea which I'm not against is allowing specific
resources to be charged to an ancestor. We gotta think carefully about how
such charges should be granted / denied but an approach like that jives well
with the existing hierarchical control structure and because introducing a
persistent layer does too, the combination of the two works well.

> > So, we *really* don't wanna paint ourselves into that kind of a corner. This
> > is a dead-end. Please ditch it.
> 
> It makes non-sensen to ditch it.
> Because, the hierarchy I described in the commit log is *one* use case
> of the selectable memcg, but not *the only one* use case of it. If you
> dislike that hierarchy, I will remove it to avoid misleading you.

But if you drop that, what'd be the rationale for adding what you're
proposing? Why would we want bpf memory charges to be attached any part of
the hierarchy?

> Even if you introduce a new layer, you still need the selectable memcg.
> For example, to avoid the issue I described in [0],  you still need to
> charge to the parent cgroup instead of the current cgroup.

As I wrote above, we've been discussing the above. Again, I'd be a lot more
amenable to such approach because it fits with how everything is structured.

> That's why I described in the commit log that the selectable memcg is flexible.

Hopefully, my point on this is clear by now.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH bpf-next v2 00/12] bpf: Introduce selectable memcg for bpf map
@ 2022-08-19 17:06         ` Tejun Heo
  0 siblings, 0 replies; 61+ messages in thread
From: Tejun Heo @ 2022-08-19 17:06 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin Lau,
	Song Liu, Yonghong Song, john fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, jolsa-DgEjT+Ai2ygdnm+yROfE0A,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, lizefan.x-EC8Uxl6Npydl57MIdRCFDg,
	Cgroups, netdev, bpf, Linux MM

Hello,

On Fri, Aug 19, 2022 at 09:09:25AM +0800, Yafang Shao wrote:
> On Fri, Aug 19, 2022 at 6:33 AM Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> >
> > On Thu, Aug 18, 2022 at 12:20:33PM -1000, Tejun Heo wrote:
> > > We have the exact same problem for any resources which span multiple
> > > instances of a service including page cache, tmpfs instances and any other
> > > thing which can persist longer than procss life time. My current opinion is
> >
> > To expand a bit more on this point, once we start including page cache and
> > tmpfs, we now get entangled with memory reclaim which then brings in IO and
> > not-yet-but-eventually CPU usage.
> 
> Introduce-a-new-layer vs introduce-a-new-cgroup, which one is more overhead?

Introducing a new layer in cgroup2 doesn't mean that any specific resource
controller is enabled, so there is no runtime overhead difference. In terms
of logical complexity, introducing a localized layer seems a lot more
straightforward than building a whole separate tree.

Note that the same applies to cgroup1 where collapsed controller tree is
represented by simply not creating those layers in that particular
controller tree.

No matter how we cut the problem here, if we want to track these persistent
resources, we have to create a cgroup to host them somewhere. The discussion
we're having is mostly around where to put them. With your proposal, it can
be anywhere and you draw out an example where the persistent cgroups form
their own separate tree. What I'm saying is that the logical place to put it
is where the current resource consumption is and we just need to put the
persistent entity as the parent of the instances.

Flexibility, just like anything else, isn't free. Here, if we extrapolate
this approach, the cost is evidently hefty in that it doesn't generically
work with the basic resource control structure.

> > Once you start splitting the tree like
> > you're suggesting here, all those will break down and now we have to worry
> > about how to split resource accounting and control for the same entities
> > across two split branches of the tree, which doesn't really make any sense.
> 
> The k8s has already been broken thanks to the memcg accounting on  bpf memory.
> If you ignored it, I paste it below.
> [0]"1. The memory usage is not consistent between the first generation and
> new generations."
> 
> This issue will persist even if you introduce a new layer.

Please watch your tone.

Again, this isn't a problem specific to k8s. We have the same problem with
e.g. persistent tmpfs. One idea which I'm not against is allowing specific
resources to be charged to an ancestor. We gotta think carefully about how
such charges should be granted / denied but an approach like that jives well
with the existing hierarchical control structure and because introducing a
persistent layer does too, the combination of the two works well.

> > So, we *really* don't wanna paint ourselves into that kind of a corner. This
> > is a dead-end. Please ditch it.
> 
> It makes non-sensen to ditch it.
> Because, the hierarchy I described in the commit log is *one* use case
> of the selectable memcg, but not *the only one* use case of it. If you
> dislike that hierarchy, I will remove it to avoid misleading you.

But if you drop that, what'd be the rationale for adding what you're
proposing? Why would we want bpf memory charges to be attached any part of
the hierarchy?

> Even if you introduce a new layer, you still need the selectable memcg.
> For example, to avoid the issue I described in [0],  you still need to
> charge to the parent cgroup instead of the current cgroup.

As I wrote above, we've been discussing the above. Again, I'd be a lot more
amenable to such approach because it fits with how everything is structured.

> That's why I described in the commit log that the selectable memcg is flexible.

Hopefully, my point on this is clear by now.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH bpf-next v2 00/12] bpf: Introduce selectable memcg for bpf map
@ 2022-08-20  2:25           ` Yafang Shao
  0 siblings, 0 replies; 61+ messages in thread
From: Yafang Shao @ 2022-08-20  2:25 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin Lau,
	Song Liu, Yonghong Song, john fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, jolsa, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Zefan Li, Cgroups, netdev, bpf, Linux MM

On Sat, Aug 20, 2022 at 1:06 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Fri, Aug 19, 2022 at 09:09:25AM +0800, Yafang Shao wrote:
> > On Fri, Aug 19, 2022 at 6:33 AM Tejun Heo <tj@kernel.org> wrote:
> > >
> > > On Thu, Aug 18, 2022 at 12:20:33PM -1000, Tejun Heo wrote:
> > > > We have the exact same problem for any resources which span multiple
> > > > instances of a service including page cache, tmpfs instances and any other
> > > > thing which can persist longer than procss life time. My current opinion is
> > >
> > > To expand a bit more on this point, once we start including page cache and
> > > tmpfs, we now get entangled with memory reclaim which then brings in IO and
> > > not-yet-but-eventually CPU usage.
> >
> > Introduce-a-new-layer vs introduce-a-new-cgroup, which one is more overhead?
>
> Introducing a new layer in cgroup2 doesn't mean that any specific resource
> controller is enabled, so there is no runtime overhead difference. In terms
> of logical complexity, introducing a localized layer seems a lot more
> straightforward than building a whole separate tree.
>
> Note that the same applies to cgroup1 where collapsed controller tree is
> represented by simply not creating those layers in that particular
> controller tree.
>

No, we have observed on our product env that multiple-layers cpuacct
would cause obvious performance hit due to cache miss.

> No matter how we cut the problem here, if we want to track these persistent
> resources, we have to create a cgroup to host them somewhere. The discussion
> we're having is mostly around where to put them. With your proposal, it can
> be anywhere and you draw out an example where the persistent cgroups form
> their own separate tree. What I'm saying is that the logical place to put it
> is where the current resource consumption is and we just need to put the
> persistent entity as the parent of the instances.
>
> Flexibility, just like anything else, isn't free. Here, if we extrapolate
> this approach, the cost is evidently hefty in that it doesn't generically
> work with the basic resource control structure.
>
> > > Once you start splitting the tree like
> > > you're suggesting here, all those will break down and now we have to worry
> > > about how to split resource accounting and control for the same entities
> > > across two split branches of the tree, which doesn't really make any sense.
> >
> > The k8s has already been broken thanks to the memcg accounting on  bpf memory.
> > If you ignored it, I paste it below.
> > [0]"1. The memory usage is not consistent between the first generation and
> > new generations."
> >
> > This issue will persist even if you introduce a new layer.
>
> Please watch your tone.
>

Hm? I apologize if my words offend you.
But, could you pls take a serious look at the patchset  before giving a NACK?
You didn't even want to know the background before you sent your NACK.

> Again, this isn't a problem specific to k8s. We have the same problem with
> e.g. persistent tmpfs. One idea which I'm not against is allowing specific
> resources to be charged to an ancestor. We gotta think carefully about how
> such charges should be granted / denied but an approach like that jives well
> with the existing hierarchical control structure and because introducing a
> persistent layer does too, the combination of the two works well.
>
> > > So, we *really* don't wanna paint ourselves into that kind of a corner. This
> > > is a dead-end. Please ditch it.
> >
> > It makes non-sensen to ditch it.
> > Because, the hierarchy I described in the commit log is *one* use case
> > of the selectable memcg, but not *the only one* use case of it. If you
> > dislike that hierarchy, I will remove it to avoid misleading you.
>
> But if you drop that, what'd be the rationale for adding what you're
> proposing? Why would we want bpf memory charges to be attached any part of
> the hierarchy?
>

I have explained it to you.
But unfortunately you ignored it again.
But I don't mind explaining to you again.

                 Parent-memcg
                     \
                   Child-memcg (k8s pod)

The user can charge the memory to the parent directly without charging
into the k8s pod.
Then the memory.stat is consistent between different generations.

> > Even if you introduce a new layer, you still need the selectable memcg.
> > For example, to avoid the issue I described in [0],  you still need to
> > charge to the parent cgroup instead of the current cgroup.
>
> As I wrote above, we've been discussing the above. Again, I'd be a lot more
> amenable to such approach because it fits with how everything is structured.
>
> > That's why I described in the commit log that the selectable memcg is flexible.
>
> Hopefully, my point on this is clear by now.
>

Unfortunately, you didn't want to get my point.


-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH bpf-next v2 00/12] bpf: Introduce selectable memcg for bpf map
@ 2022-08-20  2:25           ` Yafang Shao
  0 siblings, 0 replies; 61+ messages in thread
From: Yafang Shao @ 2022-08-20  2:25 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin Lau,
	Song Liu, Yonghong Song, john fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, jolsa-DgEjT+Ai2ygdnm+yROfE0A,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Zefan Li, Cgroups, netdev, bpf,
	Linux MM

On Sat, Aug 20, 2022 at 1:06 AM Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>
> Hello,
>
> On Fri, Aug 19, 2022 at 09:09:25AM +0800, Yafang Shao wrote:
> > On Fri, Aug 19, 2022 at 6:33 AM Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> > >
> > > On Thu, Aug 18, 2022 at 12:20:33PM -1000, Tejun Heo wrote:
> > > > We have the exact same problem for any resources which span multiple
> > > > instances of a service including page cache, tmpfs instances and any other
> > > > thing which can persist longer than procss life time. My current opinion is
> > >
> > > To expand a bit more on this point, once we start including page cache and
> > > tmpfs, we now get entangled with memory reclaim which then brings in IO and
> > > not-yet-but-eventually CPU usage.
> >
> > Introduce-a-new-layer vs introduce-a-new-cgroup, which one is more overhead?
>
> Introducing a new layer in cgroup2 doesn't mean that any specific resource
> controller is enabled, so there is no runtime overhead difference. In terms
> of logical complexity, introducing a localized layer seems a lot more
> straightforward than building a whole separate tree.
>
> Note that the same applies to cgroup1 where collapsed controller tree is
> represented by simply not creating those layers in that particular
> controller tree.
>

No, we have observed on our product env that multiple-layers cpuacct
would cause obvious performance hit due to cache miss.

> No matter how we cut the problem here, if we want to track these persistent
> resources, we have to create a cgroup to host them somewhere. The discussion
> we're having is mostly around where to put them. With your proposal, it can
> be anywhere and you draw out an example where the persistent cgroups form
> their own separate tree. What I'm saying is that the logical place to put it
> is where the current resource consumption is and we just need to put the
> persistent entity as the parent of the instances.
>
> Flexibility, just like anything else, isn't free. Here, if we extrapolate
> this approach, the cost is evidently hefty in that it doesn't generically
> work with the basic resource control structure.
>
> > > Once you start splitting the tree like
> > > you're suggesting here, all those will break down and now we have to worry
> > > about how to split resource accounting and control for the same entities
> > > across two split branches of the tree, which doesn't really make any sense.
> >
> > The k8s has already been broken thanks to the memcg accounting on  bpf memory.
> > If you ignored it, I paste it below.
> > [0]"1. The memory usage is not consistent between the first generation and
> > new generations."
> >
> > This issue will persist even if you introduce a new layer.
>
> Please watch your tone.
>

Hm? I apologize if my words offend you.
But, could you pls take a serious look at the patchset  before giving a NACK?
You didn't even want to know the background before you sent your NACK.

> Again, this isn't a problem specific to k8s. We have the same problem with
> e.g. persistent tmpfs. One idea which I'm not against is allowing specific
> resources to be charged to an ancestor. We gotta think carefully about how
> such charges should be granted / denied but an approach like that jives well
> with the existing hierarchical control structure and because introducing a
> persistent layer does too, the combination of the two works well.
>
> > > So, we *really* don't wanna paint ourselves into that kind of a corner. This
> > > is a dead-end. Please ditch it.
> >
> > It makes non-sensen to ditch it.
> > Because, the hierarchy I described in the commit log is *one* use case
> > of the selectable memcg, but not *the only one* use case of it. If you
> > dislike that hierarchy, I will remove it to avoid misleading you.
>
> But if you drop that, what'd be the rationale for adding what you're
> proposing? Why would we want bpf memory charges to be attached any part of
> the hierarchy?
>

I have explained it to you.
But unfortunately you ignored it again.
But I don't mind explaining to you again.

                 Parent-memcg
                     \
                   Child-memcg (k8s pod)

The user can charge the memory to the parent directly without charging
into the k8s pod.
Then the memory.stat is consistent between different generations.

> > Even if you introduce a new layer, you still need the selectable memcg.
> > For example, to avoid the issue I described in [0],  you still need to
> > charge to the parent cgroup instead of the current cgroup.
>
> As I wrote above, we've been discussing the above. Again, I'd be a lot more
> amenable to such approach because it fits with how everything is structured.
>
> > That's why I described in the commit log that the selectable memcg is flexible.
>
> Hopefully, my point on this is clear by now.
>

Unfortunately, you didn't want to get my point.


-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [RFD RESEND] cgroup: Persistent memory usage tracking
@ 2022-08-22 11:29             ` Tejun Heo
  0 siblings, 0 replies; 61+ messages in thread
From: Tejun Heo @ 2022-08-22 11:29 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin Lau,
	Song Liu, Yonghong Song, john fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, jolsa, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Zefan Li, Cgroups, netdev, bpf, Linux MM,
	Yosry Ahmed, Dan Schatzberg, Lennart Poettering

(Sorry, this is a resend. I messed up the header in the first posting.)

Hello,

This thread started on a bpf-specific memory tracking change proposal and
went south, but a lot of people who would be interested are already cc'd, so
I'm hijacking it to discuss what to do w/ persistent memory usage tracking.

Cc'ing Mina and Yosry who were involved in the discussions on the similar
problem re. tmpfs, Dan Schatzberg who has a lot more prod knowledge and
experience than me, and Lennart for his thoughts from systemd side.

The root problem is that there are resources (almost solely memory
currently) that outlive a given instance of a, to use systemd-lingo,
service. Page cache is the most common case.

Let's say there's system.slice/hello.service. When it runs for the first
time, page cache backing its binary will be charged to hello.service.
However, when it restarts after e.g. a config change, when the initial
hello.service cgroup gets destroyed, we reparent the page cache charges to
the parent system.slice and when the second instance starts, its binary will
stay charged to system.slice. Over time, some may get reclaimed and
refaulted into the new hello.service but that's not guaranteed and most hot
pages likely won't.

The same problem exists for any memory which is not freed synchronously when
the current instance exits. While this isn't a problem for many cases, it's
not difficult to imagine situations where the amount of memory which ends up
getting pushed to the parent is significant, even clear majority, with big
page cache footprint, persistent tmpfs instances and so on, creating issues
with accounting accuracy and thus control.

I think there are two broad issues to discuss here:

[1] Can this be solved by layering the instance cgroups under persistent
    entity cgroup?

So, instead of systemd.slice/hello.service, the application runs inside
something like systemd.slice/hello.service/hello.service.instance and the
service-level cgroup hello.service is not destroyed as long as it is
something worth tracking on the system.

The benefits are

a. While requiring changing how userland organizes cgroup hiearchy, it is a
   straight-forward extension of the current architecture and doesn't
   require any conceptual or structural changes. All the accounting and
   control schemes work exactly the same as before. The only difference is
   that we now have a persistent entity representing each service as we want
   to track their persistent resource usages.

b. Per-instance tracking and control is optional. To me, it seems that the
   persistent resource usages would be more meaningful than per-instance and
   tracking down to the persistent usages shouldn't add noticeable runtime
   overheads while keeping per-instance process management niceties and
   allowing use cases to opt-in for per-instance resource tracking and
   control as needed.

The complications are:

a. It requires changing cgroup hierarchy in a very visible way.

b. What should be the lifetime rules for persistent cgroups? Do we keep them
   around forever or maybe they can be created on the first use and kept
   around until the service is removed from the system? When the persistent
   cgroup is removed, do we need to make sure that the remaining resource
   usages are low enough? Note that this problem exists for any approach
   that tries to track persistent usages no matter how it's done.

c. Do we need to worry about nesting overhead? Given that there's no reason
   to enable controllers w/o persisten states for the instance level and the
   nesting overhead is pretty low for memcg, this doesn't seem like a
   problem to me. If this becomes a problem, we just need to fix it.

A couple alternatives discussed are:

a. Userspace keeps reusing the same cgroup for different instances of the
   same service. This simplifies some aspects while making others more
   complicated. e.g. Determining the current instance's CPU or IO usages now
   require the monitoring software remembering what they were when this
   instance started and calculating the deltas. Also, if some use cases want
   to distinguish persistent vs. instance usages (more on this later), this
   isn't gonna work. That said, this definitely is attractive in that it
   miminizes overt user visible changes.

b. Memory is disassociated rather than just reparented on cgroup destruction
   and get re-charged to the next first user. This is attractive in that it
   doesn't require any userspace changes; however, I'm not sure how this
   would work for non-pageable memory usages such as bpf maps. How would we
   detect the next first usage?


[2] Whether and how to solve first and second+ instance charge differences.

If we take the layering approach, the first instance will get charged for
all memory that it uses while the second+ instances likely won't get charged
for a lot of persistent usages. I don't think there is a consensus on
whether this needs to be solved and I don't have enough context to form a
strong opinion. memcg folks are a lot better equipped to make this decision.

Assuming this needs to be solved, here's a braindump to be taken with a big
pinch of salt:

I have a bit of difficult time imagining a perfect solution given that
whether a given page cache page is persistent or not would be really
difficult to know (or maybe all page cache is persistent by default while
anon is not). However, the problem still seems worthwhile to consider for
big ticket items such as persistent tmpfs mounts and huge bpf maps as they
can easily make the differences really big.

If we want to solve this problem, here are options that I can think of:

a. Let userspace put the charges where they belong using the current
   mechanisms. ie. Create persistent entities in the persistent parent
   cgroup while there's no current instance.

   Pro: It won't require any major kernel or interface changes. There still
   need to be some tweaking such as allowing tmpfs pages to be always
   charged to the cgroup which created the instance (maybe as long as it's
   an ancestor of the faulting cgroup?) but nothing too invasive.

   Con: It may not be flexible enough.

b. Let userspace specify which cgroup to charge for some of constructs like
   tmpfs and bpf maps. The key problems with this approach are

   1. How to grant/deny what can be charged where. We must ensure that a
      descendant can't move charges up or across the tree without the
      ancestors allowing it.

   2. How to specify the cgroup to charge. While specifying the target
      cgroup directly might seem like an obvious solution, it has a couple
      rather serious problems. First, if the descendant is inside a cgroup
      namespace, it might be able to see the target cgroup at all. Second,
      it's an interface which is likely to cause misunderstandings on how it
      can be used. It's too broad an interface.

   One solution that I can think of is leveraging the resource domain
   concept which is currently only used for threaded cgroups. All memory
   usages of threaded cgroups are charged to their resource domain cgroup
   which hosts the processes for those threads. The persistent usages have a
   similar pattern, so maybe the service level cgroup can declare that it's
   the encompassing resource domain and the instance cgroup can say whether
   it's gonna charge e.g. the tmpfs instance to its own or the encompassing
   resource domain.

   This has the benefit that the user only needs to indicate its intention
   without worrying about how cgroups are composed and what their IDs are.
   It just indicates whether the given resource is persistent and if the
   cgroup hierarchy is set up for that, it gets charged that way and if not
   it can be just charged to itself.

   This is a shower thought but, if we allow nesting such domains (and maybe
   name them), we can use it for shared resources too so that co-services
   are put inside a shared slice and shared resources are pushed to the
   slice level.

This became pretty long. I obviously have a pretty strong bias towards
solving this within the current basic architecture but other than that most
of these decisions are best made by memcg folks. We can hopefully build some
consensus on the issue.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [RFD RESEND] cgroup: Persistent memory usage tracking
@ 2022-08-22 11:29             ` Tejun Heo
  0 siblings, 0 replies; 61+ messages in thread
From: Tejun Heo @ 2022-08-22 11:29 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin Lau,
	Song Liu, Yonghong Song, john fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, jolsa-DgEjT+Ai2ygdnm+yROfE0A,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Zefan Li, Cgroups, netdev, bpf,
	Linux MM

(Sorry, this is a resend. I messed up the header in the first posting.)

Hello,

This thread started on a bpf-specific memory tracking change proposal and
went south, but a lot of people who would be interested are already cc'd, so
I'm hijacking it to discuss what to do w/ persistent memory usage tracking.

Cc'ing Mina and Yosry who were involved in the discussions on the similar
problem re. tmpfs, Dan Schatzberg who has a lot more prod knowledge and
experience than me, and Lennart for his thoughts from systemd side.

The root problem is that there are resources (almost solely memory
currently) that outlive a given instance of a, to use systemd-lingo,
service. Page cache is the most common case.

Let's say there's system.slice/hello.service. When it runs for the first
time, page cache backing its binary will be charged to hello.service.
However, when it restarts after e.g. a config change, when the initial
hello.service cgroup gets destroyed, we reparent the page cache charges to
the parent system.slice and when the second instance starts, its binary will
stay charged to system.slice. Over time, some may get reclaimed and
refaulted into the new hello.service but that's not guaranteed and most hot
pages likely won't.

The same problem exists for any memory which is not freed synchronously when
the current instance exits. While this isn't a problem for many cases, it's
not difficult to imagine situations where the amount of memory which ends up
getting pushed to the parent is significant, even clear majority, with big
page cache footprint, persistent tmpfs instances and so on, creating issues
with accounting accuracy and thus control.

I think there are two broad issues to discuss here:

[1] Can this be solved by layering the instance cgroups under persistent
    entity cgroup?

So, instead of systemd.slice/hello.service, the application runs inside
something like systemd.slice/hello.service/hello.service.instance and the
service-level cgroup hello.service is not destroyed as long as it is
something worth tracking on the system.

The benefits are

a. While requiring changing how userland organizes cgroup hiearchy, it is a
   straight-forward extension of the current architecture and doesn't
   require any conceptual or structural changes. All the accounting and
   control schemes work exactly the same as before. The only difference is
   that we now have a persistent entity representing each service as we want
   to track their persistent resource usages.

b. Per-instance tracking and control is optional. To me, it seems that the
   persistent resource usages would be more meaningful than per-instance and
   tracking down to the persistent usages shouldn't add noticeable runtime
   overheads while keeping per-instance process management niceties and
   allowing use cases to opt-in for per-instance resource tracking and
   control as needed.

The complications are:

a. It requires changing cgroup hierarchy in a very visible way.

b. What should be the lifetime rules for persistent cgroups? Do we keep them
   around forever or maybe they can be created on the first use and kept
   around until the service is removed from the system? When the persistent
   cgroup is removed, do we need to make sure that the remaining resource
   usages are low enough? Note that this problem exists for any approach
   that tries to track persistent usages no matter how it's done.

c. Do we need to worry about nesting overhead? Given that there's no reason
   to enable controllers w/o persisten states for the instance level and the
   nesting overhead is pretty low for memcg, this doesn't seem like a
   problem to me. If this becomes a problem, we just need to fix it.

A couple alternatives discussed are:

a. Userspace keeps reusing the same cgroup for different instances of the
   same service. This simplifies some aspects while making others more
   complicated. e.g. Determining the current instance's CPU or IO usages now
   require the monitoring software remembering what they were when this
   instance started and calculating the deltas. Also, if some use cases want
   to distinguish persistent vs. instance usages (more on this later), this
   isn't gonna work. That said, this definitely is attractive in that it
   miminizes overt user visible changes.

b. Memory is disassociated rather than just reparented on cgroup destruction
   and get re-charged to the next first user. This is attractive in that it
   doesn't require any userspace changes; however, I'm not sure how this
   would work for non-pageable memory usages such as bpf maps. How would we
   detect the next first usage?


[2] Whether and how to solve first and second+ instance charge differences.

If we take the layering approach, the first instance will get charged for
all memory that it uses while the second+ instances likely won't get charged
for a lot of persistent usages. I don't think there is a consensus on
whether this needs to be solved and I don't have enough context to form a
strong opinion. memcg folks are a lot better equipped to make this decision.

Assuming this needs to be solved, here's a braindump to be taken with a big
pinch of salt:

I have a bit of difficult time imagining a perfect solution given that
whether a given page cache page is persistent or not would be really
difficult to know (or maybe all page cache is persistent by default while
anon is not). However, the problem still seems worthwhile to consider for
big ticket items such as persistent tmpfs mounts and huge bpf maps as they
can easily make the differences really big.

If we want to solve this problem, here are options that I can think of:

a. Let userspace put the charges where they belong using the current
   mechanisms. ie. Create persistent entities in the persistent parent
   cgroup while there's no current instance.

   Pro: It won't require any major kernel or interface changes. There still
   need to be some tweaking such as allowing tmpfs pages to be always
   charged to the cgroup which created the instance (maybe as long as it's
   an ancestor of the faulting cgroup?) but nothing too invasive.

   Con: It may not be flexible enough.

b. Let userspace specify which cgroup to charge for some of constructs like
   tmpfs and bpf maps. The key problems with this approach are

   1. How to grant/deny what can be charged where. We must ensure that a
      descendant can't move charges up or across the tree without the
      ancestors allowing it.

   2. How to specify the cgroup to charge. While specifying the target
      cgroup directly might seem like an obvious solution, it has a couple
      rather serious problems. First, if the descendant is inside a cgroup
      namespace, it might be able to see the target cgroup at all. Second,
      it's an interface which is likely to cause misunderstandings on how it
      can be used. It's too broad an interface.

   One solution that I can think of is leveraging the resource domain
   concept which is currently only used for threaded cgroups. All memory
   usages of threaded cgroups are charged to their resource domain cgroup
   which hosts the processes for those threads. The persistent usages have a
   similar pattern, so maybe the service level cgroup can declare that it's
   the encompassing resource domain and the instance cgroup can say whether
   it's gonna charge e.g. the tmpfs instance to its own or the encompassing
   resource domain.

   This has the benefit that the user only needs to indicate its intention
   without worrying about how cgroups are composed and what their IDs are.
   It just indicates whether the given resource is persistent and if the
   cgroup hierarchy is set up for that, it gets charged that way and if not
   it can be just charged to itself.

   This is a shower thought but, if we allow nesting such domains (and maybe
   name them), we can use it for shared resources too so that co-services
   are put inside a shared slice and shared resources are pushed to the
   slice level.

This became pretty long. I obviously have a pretty strong bias towards
solving this within the current basic architecture but other than that most
of these decisions are best made by memcg folks. We can hopefully build some
consensus on the issue.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFD RESEND] cgroup: Persistent memory usage tracking
  2022-08-22 11:29             ` Tejun Heo
@ 2022-08-22 16:12               ` Shakeel Butt
  -1 siblings, 0 replies; 61+ messages in thread
From: Shakeel Butt @ 2022-08-22 16:12 UTC (permalink / raw)
  To: Tejun Heo, Mina Almasry
  Cc: Yafang Shao, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin Lau, Song Liu, Yonghong Song,
	john fastabend, KP Singh, Stanislav Fomichev, Hao Luo, jolsa,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Andrew Morton, Zefan Li, Cgroups, netdev, bpf, Linux MM,
	Yosry Ahmed, Dan Schatzberg, Lennart Poettering

Ccing Mina.

On Mon, Aug 22, 2022 at 4:29 AM Tejun Heo <tj@kernel.org> wrote:
>
> (Sorry, this is a resend. I messed up the header in the first posting.)
>
> Hello,
>
> This thread started on a bpf-specific memory tracking change proposal and
> went south, but a lot of people who would be interested are already cc'd, so
> I'm hijacking it to discuss what to do w/ persistent memory usage tracking.
>
> Cc'ing Mina and Yosry who were involved in the discussions on the similar
> problem re. tmpfs, Dan Schatzberg who has a lot more prod knowledge and
> experience than me, and Lennart for his thoughts from systemd side.
>
> The root problem is that there are resources (almost solely memory
> currently) that outlive a given instance of a, to use systemd-lingo,
> service. Page cache is the most common case.
>
> Let's say there's system.slice/hello.service. When it runs for the first
> time, page cache backing its binary will be charged to hello.service.
> However, when it restarts after e.g. a config change, when the initial
> hello.service cgroup gets destroyed, we reparent the page cache charges to
> the parent system.slice and when the second instance starts, its binary will
> stay charged to system.slice. Over time, some may get reclaimed and
> refaulted into the new hello.service but that's not guaranteed and most hot
> pages likely won't.
>
> The same problem exists for any memory which is not freed synchronously when
> the current instance exits. While this isn't a problem for many cases, it's
> not difficult to imagine situations where the amount of memory which ends up
> getting pushed to the parent is significant, even clear majority, with big
> page cache footprint, persistent tmpfs instances and so on, creating issues
> with accounting accuracy and thus control.
>
> I think there are two broad issues to discuss here:
>
> [1] Can this be solved by layering the instance cgroups under persistent
>     entity cgroup?
>
> So, instead of systemd.slice/hello.service, the application runs inside
> something like systemd.slice/hello.service/hello.service.instance and the
> service-level cgroup hello.service is not destroyed as long as it is
> something worth tracking on the system.
>
> The benefits are
>
> a. While requiring changing how userland organizes cgroup hiearchy, it is a
>    straight-forward extension of the current architecture and doesn't
>    require any conceptual or structural changes. All the accounting and
>    control schemes work exactly the same as before. The only difference is
>    that we now have a persistent entity representing each service as we want
>    to track their persistent resource usages.
>
> b. Per-instance tracking and control is optional. To me, it seems that the
>    persistent resource usages would be more meaningful than per-instance and
>    tracking down to the persistent usages shouldn't add noticeable runtime
>    overheads while keeping per-instance process management niceties and
>    allowing use cases to opt-in for per-instance resource tracking and
>    control as needed.
>
> The complications are:
>
> a. It requires changing cgroup hierarchy in a very visible way.
>
> b. What should be the lifetime rules for persistent cgroups? Do we keep them
>    around forever or maybe they can be created on the first use and kept
>    around until the service is removed from the system? When the persistent
>    cgroup is removed, do we need to make sure that the remaining resource
>    usages are low enough? Note that this problem exists for any approach
>    that tries to track persistent usages no matter how it's done.
>
> c. Do we need to worry about nesting overhead? Given that there's no reason
>    to enable controllers w/o persisten states for the instance level and the
>    nesting overhead is pretty low for memcg, this doesn't seem like a
>    problem to me. If this becomes a problem, we just need to fix it.
>
> A couple alternatives discussed are:
>
> a. Userspace keeps reusing the same cgroup for different instances of the
>    same service. This simplifies some aspects while making others more
>    complicated. e.g. Determining the current instance's CPU or IO usages now
>    require the monitoring software remembering what they were when this
>    instance started and calculating the deltas. Also, if some use cases want
>    to distinguish persistent vs. instance usages (more on this later), this
>    isn't gonna work. That said, this definitely is attractive in that it
>    miminizes overt user visible changes.
>
> b. Memory is disassociated rather than just reparented on cgroup destruction
>    and get re-charged to the next first user. This is attractive in that it
>    doesn't require any userspace changes; however, I'm not sure how this
>    would work for non-pageable memory usages such as bpf maps. How would we
>    detect the next first usage?
>
>
> [2] Whether and how to solve first and second+ instance charge differences.
>
> If we take the layering approach, the first instance will get charged for
> all memory that it uses while the second+ instances likely won't get charged
> for a lot of persistent usages. I don't think there is a consensus on
> whether this needs to be solved and I don't have enough context to form a
> strong opinion. memcg folks are a lot better equipped to make this decision.
>
> Assuming this needs to be solved, here's a braindump to be taken with a big
> pinch of salt:
>
> I have a bit of difficult time imagining a perfect solution given that
> whether a given page cache page is persistent or not would be really
> difficult to know (or maybe all page cache is persistent by default while
> anon is not). However, the problem still seems worthwhile to consider for
> big ticket items such as persistent tmpfs mounts and huge bpf maps as they
> can easily make the differences really big.
>
> If we want to solve this problem, here are options that I can think of:
>
> a. Let userspace put the charges where they belong using the current
>    mechanisms. ie. Create persistent entities in the persistent parent
>    cgroup while there's no current instance.
>
>    Pro: It won't require any major kernel or interface changes. There still
>    need to be some tweaking such as allowing tmpfs pages to be always
>    charged to the cgroup which created the instance (maybe as long as it's
>    an ancestor of the faulting cgroup?) but nothing too invasive.
>
>    Con: It may not be flexible enough.
>
> b. Let userspace specify which cgroup to charge for some of constructs like
>    tmpfs and bpf maps. The key problems with this approach are
>
>    1. How to grant/deny what can be charged where. We must ensure that a
>       descendant can't move charges up or across the tree without the
>       ancestors allowing it.
>
>    2. How to specify the cgroup to charge. While specifying the target
>       cgroup directly might seem like an obvious solution, it has a couple
>       rather serious problems. First, if the descendant is inside a cgroup
>       namespace, it might be able to see the target cgroup at all. Second,
>       it's an interface which is likely to cause misunderstandings on how it
>       can be used. It's too broad an interface.
>
>    One solution that I can think of is leveraging the resource domain
>    concept which is currently only used for threaded cgroups. All memory
>    usages of threaded cgroups are charged to their resource domain cgroup
>    which hosts the processes for those threads. The persistent usages have a
>    similar pattern, so maybe the service level cgroup can declare that it's
>    the encompassing resource domain and the instance cgroup can say whether
>    it's gonna charge e.g. the tmpfs instance to its own or the encompassing
>    resource domain.
>
>    This has the benefit that the user only needs to indicate its intention
>    without worrying about how cgroups are composed and what their IDs are.
>    It just indicates whether the given resource is persistent and if the
>    cgroup hierarchy is set up for that, it gets charged that way and if not
>    it can be just charged to itself.
>
>    This is a shower thought but, if we allow nesting such domains (and maybe
>    name them), we can use it for shared resources too so that co-services
>    are put inside a shared slice and shared resources are pushed to the
>    slice level.
>
> This became pretty long. I obviously have a pretty strong bias towards
> solving this within the current basic architecture but other than that most
> of these decisions are best made by memcg folks. We can hopefully build some
> consensus on the issue.
>
> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFD RESEND] cgroup: Persistent memory usage tracking
@ 2022-08-22 16:12               ` Shakeel Butt
  0 siblings, 0 replies; 61+ messages in thread
From: Shakeel Butt @ 2022-08-22 16:12 UTC (permalink / raw)
  To: Tejun Heo, Mina Almasry
  Cc: Yafang Shao, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin Lau, Song Liu, Yonghong Song,
	john fastabend, KP Singh, Stanislav Fomichev, Hao Luo, jolsa,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Andrew Morton, Zefan Li, Cgroups, netdev, bpf, Linux MM

Ccing Mina.

On Mon, Aug 22, 2022 at 4:29 AM Tejun Heo <tj@kernel.org> wrote:
>
> (Sorry, this is a resend. I messed up the header in the first posting.)
>
> Hello,
>
> This thread started on a bpf-specific memory tracking change proposal and
> went south, but a lot of people who would be interested are already cc'd, so
> I'm hijacking it to discuss what to do w/ persistent memory usage tracking.
>
> Cc'ing Mina and Yosry who were involved in the discussions on the similar
> problem re. tmpfs, Dan Schatzberg who has a lot more prod knowledge and
> experience than me, and Lennart for his thoughts from systemd side.
>
> The root problem is that there are resources (almost solely memory
> currently) that outlive a given instance of a, to use systemd-lingo,
> service. Page cache is the most common case.
>
> Let's say there's system.slice/hello.service. When it runs for the first
> time, page cache backing its binary will be charged to hello.service.
> However, when it restarts after e.g. a config change, when the initial
> hello.service cgroup gets destroyed, we reparent the page cache charges to
> the parent system.slice and when the second instance starts, its binary will
> stay charged to system.slice. Over time, some may get reclaimed and
> refaulted into the new hello.service but that's not guaranteed and most hot
> pages likely won't.
>
> The same problem exists for any memory which is not freed synchronously when
> the current instance exits. While this isn't a problem for many cases, it's
> not difficult to imagine situations where the amount of memory which ends up
> getting pushed to the parent is significant, even clear majority, with big
> page cache footprint, persistent tmpfs instances and so on, creating issues
> with accounting accuracy and thus control.
>
> I think there are two broad issues to discuss here:
>
> [1] Can this be solved by layering the instance cgroups under persistent
>     entity cgroup?
>
> So, instead of systemd.slice/hello.service, the application runs inside
> something like systemd.slice/hello.service/hello.service.instance and the
> service-level cgroup hello.service is not destroyed as long as it is
> something worth tracking on the system.
>
> The benefits are
>
> a. While requiring changing how userland organizes cgroup hiearchy, it is a
>    straight-forward extension of the current architecture and doesn't
>    require any conceptual or structural changes. All the accounting and
>    control schemes work exactly the same as before. The only difference is
>    that we now have a persistent entity representing each service as we want
>    to track their persistent resource usages.
>
> b. Per-instance tracking and control is optional. To me, it seems that the
>    persistent resource usages would be more meaningful than per-instance and
>    tracking down to the persistent usages shouldn't add noticeable runtime
>    overheads while keeping per-instance process management niceties and
>    allowing use cases to opt-in for per-instance resource tracking and
>    control as needed.
>
> The complications are:
>
> a. It requires changing cgroup hierarchy in a very visible way.
>
> b. What should be the lifetime rules for persistent cgroups? Do we keep them
>    around forever or maybe they can be created on the first use and kept
>    around until the service is removed from the system? When the persistent
>    cgroup is removed, do we need to make sure that the remaining resource
>    usages are low enough? Note that this problem exists for any approach
>    that tries to track persistent usages no matter how it's done.
>
> c. Do we need to worry about nesting overhead? Given that there's no reason
>    to enable controllers w/o persisten states for the instance level and the
>    nesting overhead is pretty low for memcg, this doesn't seem like a
>    problem to me. If this becomes a problem, we just need to fix it.
>
> A couple alternatives discussed are:
>
> a. Userspace keeps reusing the same cgroup for different instances of the
>    same service. This simplifies some aspects while making others more
>    complicated. e.g. Determining the current instance's CPU or IO usages now
>    require the monitoring software remembering what they were when this
>    instance started and calculating the deltas. Also, if some use cases want
>    to distinguish persistent vs. instance usages (more on this later), this
>    isn't gonna work. That said, this definitely is attractive in that it
>    miminizes overt user visible changes.
>
> b. Memory is disassociated rather than just reparented on cgroup destruction
>    and get re-charged to the next first user. This is attractive in that it
>    doesn't require any userspace changes; however, I'm not sure how this
>    would work for non-pageable memory usages such as bpf maps. How would we
>    detect the next first usage?
>
>
> [2] Whether and how to solve first and second+ instance charge differences.
>
> If we take the layering approach, the first instance will get charged for
> all memory that it uses while the second+ instances likely won't get charged
> for a lot of persistent usages. I don't think there is a consensus on
> whether this needs to be solved and I don't have enough context to form a
> strong opinion. memcg folks are a lot better equipped to make this decision.
>
> Assuming this needs to be solved, here's a braindump to be taken with a big
> pinch of salt:
>
> I have a bit of difficult time imagining a perfect solution given that
> whether a given page cache page is persistent or not would be really
> difficult to know (or maybe all page cache is persistent by default while
> anon is not). However, the problem still seems worthwhile to consider for
> big ticket items such as persistent tmpfs mounts and huge bpf maps as they
> can easily make the differences really big.
>
> If we want to solve this problem, here are options that I can think of:
>
> a. Let userspace put the charges where they belong using the current
>    mechanisms. ie. Create persistent entities in the persistent parent
>    cgroup while there's no current instance.
>
>    Pro: It won't require any major kernel or interface changes. There still
>    need to be some tweaking such as allowing tmpfs pages to be always
>    charged to the cgroup which created the instance (maybe as long as it's
>    an ancestor of the faulting cgroup?) but nothing too invasive.
>
>    Con: It may not be flexible enough.
>
> b. Let userspace specify which cgroup to charge for some of constructs like
>    tmpfs and bpf maps. The key problems with this approach are
>
>    1. How to grant/deny what can be charged where. We must ensure that a
>       descendant can't move charges up or across the tree without the
>       ancestors allowing it.
>
>    2. How to specify the cgroup to charge. While specifying the target
>       cgroup directly might seem like an obvious solution, it has a couple
>       rather serious problems. First, if the descendant is inside a cgroup
>       namespace, it might be able to see the target cgroup at all. Second,
>       it's an interface which is likely to cause misunderstandings on how it
>       can be used. It's too broad an interface.
>
>    One solution that I can think of is leveraging the resource domain
>    concept which is currently only used for threaded cgroups. All memory
>    usages of threaded cgroups are charged to their resource domain cgroup
>    which hosts the processes for those threads. The persistent usages have a
>    similar pattern, so maybe the service level cgroup can declare that it's
>    the encompassing resource domain and the instance cgroup can say whether
>    it's gonna charge e.g. the tmpfs instance to its own or the encompassing
>    resource domain.
>
>    This has the benefit that the user only needs to indicate its intention
>    without worrying about how cgroups are composed and what their IDs are.
>    It just indicates whether the given resource is persistent and if the
>    cgroup hierarchy is set up for that, it gets charged that way and if not
>    it can be just charged to itself.
>
>    This is a shower thought but, if we allow nesting such domains (and maybe
>    name them), we can use it for shared resources too so that co-services
>    are put inside a shared slice and shared resources are pushed to the
>    slice level.
>
> This became pretty long. I obviously have a pretty strong bias towards
> solving this within the current basic architecture but other than that most
> of these decisions are best made by memcg folks. We can hopefully build some
> consensus on the issue.
>
> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFD RESEND] cgroup: Persistent memory usage tracking
@ 2022-08-22 19:02               ` Mina Almasry
  0 siblings, 0 replies; 61+ messages in thread
From: Mina Almasry @ 2022-08-22 19:02 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Yafang Shao, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin Lau, Song Liu, Yonghong Song,
	john fastabend, KP Singh, Stanislav Fomichev, Hao Luo, jolsa,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Zefan Li, Cgroups, netdev, bpf,
	Linux MM, Yosry Ahmed, Dan Schatzberg, Lennart Poettering

On Mon, Aug 22, 2022 at 4:29 AM Tejun Heo <tj@kernel.org> wrote:
>
> (Sorry, this is a resend. I messed up the header in the first posting.)
>
> Hello,
>
> This thread started on a bpf-specific memory tracking change proposal and
> went south, but a lot of people who would be interested are already cc'd, so
> I'm hijacking it to discuss what to do w/ persistent memory usage tracking.
>
> Cc'ing Mina and Yosry who were involved in the discussions on the similar
> problem re. tmpfs, Dan Schatzberg who has a lot more prod knowledge and
> experience than me, and Lennart for his thoughts from systemd side.
>
> The root problem is that there are resources (almost solely memory
> currently) that outlive a given instance of a, to use systemd-lingo,
> service. Page cache is the most common case.
>
> Let's say there's system.slice/hello.service. When it runs for the first
> time, page cache backing its binary will be charged to hello.service.
> However, when it restarts after e.g. a config change, when the initial
> hello.service cgroup gets destroyed, we reparent the page cache charges to
> the parent system.slice and when the second instance starts, its binary will
> stay charged to system.slice. Over time, some may get reclaimed and
> refaulted into the new hello.service but that's not guaranteed and most hot
> pages likely won't.
>
> The same problem exists for any memory which is not freed synchronously when
> the current instance exits. While this isn't a problem for many cases, it's
> not difficult to imagine situations where the amount of memory which ends up
> getting pushed to the parent is significant, even clear majority, with big
> page cache footprint, persistent tmpfs instances and so on, creating issues
> with accounting accuracy and thus control.
>
> I think there are two broad issues to discuss here:
>
> [1] Can this be solved by layering the instance cgroups under persistent
>     entity cgroup?
>
> So, instead of systemd.slice/hello.service, the application runs inside
> something like systemd.slice/hello.service/hello.service.instance and the
> service-level cgroup hello.service is not destroyed as long as it is
> something worth tracking on the system.
>
> The benefits are
>
> a. While requiring changing how userland organizes cgroup hiearchy, it is a
>    straight-forward extension of the current architecture and doesn't
>    require any conceptual or structural changes. All the accounting and
>    control schemes work exactly the same as before. The only difference is
>    that we now have a persistent entity representing each service as we want
>    to track their persistent resource usages.
>
> b. Per-instance tracking and control is optional. To me, it seems that the
>    persistent resource usages would be more meaningful than per-instance and
>    tracking down to the persistent usages shouldn't add noticeable runtime
>    overheads while keeping per-instance process management niceties and
>    allowing use cases to opt-in for per-instance resource tracking and
>    control as needed.
>
> The complications are:
>
> a. It requires changing cgroup hierarchy in a very visible way.
>
> b. What should be the lifetime rules for persistent cgroups? Do we keep them
>    around forever or maybe they can be created on the first use and kept
>    around until the service is removed from the system? When the persistent
>    cgroup is removed, do we need to make sure that the remaining resource
>    usages are low enough? Note that this problem exists for any approach
>    that tries to track persistent usages no matter how it's done.
>
> c. Do we need to worry about nesting overhead? Given that there's no reason
>    to enable controllers w/o persisten states for the instance level and the
>    nesting overhead is pretty low for memcg, this doesn't seem like a
>    problem to me. If this becomes a problem, we just need to fix it.
>
> A couple alternatives discussed are:
>
> a. Userspace keeps reusing the same cgroup for different instances of the
>    same service. This simplifies some aspects while making others more
>    complicated. e.g. Determining the current instance's CPU or IO usages now
>    require the monitoring software remembering what they were when this
>    instance started and calculating the deltas. Also, if some use cases want
>    to distinguish persistent vs. instance usages (more on this later), this
>    isn't gonna work. That said, this definitely is attractive in that it
>    miminizes overt user visible changes.
>
> b. Memory is disassociated rather than just reparented on cgroup destruction
>    and get re-charged to the next first user. This is attractive in that it
>    doesn't require any userspace changes; however, I'm not sure how this
>    would work for non-pageable memory usages such as bpf maps. How would we
>    detect the next first usage?
>
>
> [2] Whether and how to solve first and second+ instance charge differences.
>
> If we take the layering approach, the first instance will get charged for
> all memory that it uses while the second+ instances likely won't get charged
> for a lot of persistent usages. I don't think there is a consensus on
> whether this needs to be solved and I don't have enough context to form a
> strong opinion. memcg folks are a lot better equipped to make this decision.
>

The problem of first and second+ instance charges is something I've
ran into when dealing with tmpfs charges, and something I believe
Yosry ran into with regards to bpf as well, so AFAIU we think it's
something worth addressing, but I'd also like to hear from other memcg
folks if this is something that needs to be solved for them as well.
For us this is something we want to fix because the memory usage of
these services is hard to predict and limit since the memory usage of
the first charger may be much greater than second+.

> Assuming this needs to be solved, here's a braindump to be taken with a big
> pinch of salt:
>
> I have a bit of difficult time imagining a perfect solution given that
> whether a given page cache page is persistent or not would be really
> difficult to know (or maybe all page cache is persistent by default while
> anon is not). However, the problem still seems worthwhile to consider for
> big ticket items such as persistent tmpfs mounts and huge bpf maps as they
> can easily make the differences really big.
>
> If we want to solve this problem, here are options that I can think of:
>
> a. Let userspace put the charges where they belong using the current
>    mechanisms. ie. Create persistent entities in the persistent parent
>    cgroup while there's no current instance.
>
>    Pro: It won't require any major kernel or interface changes. There still
>    need to be some tweaking such as allowing tmpfs pages to be always
>    charged to the cgroup which created the instance (maybe as long as it's
>    an ancestor of the faulting cgroup?) but nothing too invasive.
>
>    Con: It may not be flexible enough.
>

I believe this works for us, provided memory is charged to the
persistent parent entity directly and is not charged to the child
instance. If memory is charged to the child instance first and then
reparented, I think we still have the same problem. Because the first
instance is charged for this memory but the second+ instances are not,
making predicting the memory usage of such use cases difficult.

> b. Let userspace specify which cgroup to charge for some of constructs like
>    tmpfs and bpf maps. The key problems with this approach are
>
>    1. How to grant/deny what can be charged where. We must ensure that a
>       descendant can't move charges up or across the tree without the
>       ancestors allowing it.
>
>    2. How to specify the cgroup to charge. While specifying the target
>       cgroup directly might seem like an obvious solution, it has a couple
>       rather serious problems. First, if the descendant is inside a cgroup
>       namespace, it might be able to see the target cgroup at all. Second,
>       it's an interface which is likely to cause misunderstandings on how it
>       can be used. It's too broad an interface.
>

This is pretty much the solution I sent out for review about a year
ago and yes, it suffers from the issues you've brought up:
https://lore.kernel.org/linux-mm/20211120045011.3074840-1-almasrymina@google.com/


>    One solution that I can think of is leveraging the resource domain
>    concept which is currently only used for threaded cgroups. All memory
>    usages of threaded cgroups are charged to their resource domain cgroup
>    which hosts the processes for those threads. The persistent usages have a
>    similar pattern, so maybe the service level cgroup can declare that it's
>    the encompassing resource domain and the instance cgroup can say whether
>    it's gonna charge e.g. the tmpfs instance to its own or the encompassing
>    resource domain.
>

I think this sounds excellent and addresses our use cases. Basically
the tmpfs/bpf memory would get charged to the encompassing resource
domain cgroup rather than the instance cgroup, making the memory usage
of the first and second+ instances consistent and predictable.

Would love to hear from other memcg folks what they would think of
such an approach. I would also love to hear what kind of interface you
have in mind. Perhaps a cgroup tunable that says whether it's going to
charge the tmpfs/bpf instance to itself or to the encompassing
resource domain?

>    This has the benefit that the user only needs to indicate its intention
>    without worrying about how cgroups are composed and what their IDs are.
>    It just indicates whether the given resource is persistent and if the
>    cgroup hierarchy is set up for that, it gets charged that way and if not
>    it can be just charged to itself.
>
>    This is a shower thought but, if we allow nesting such domains (and maybe
>    name them), we can use it for shared resources too so that co-services
>    are put inside a shared slice and shared resources are pushed to the
>    slice level.
>
> This became pretty long. I obviously have a pretty strong bias towards
> solving this within the current basic architecture but other than that most
> of these decisions are best made by memcg folks. We can hopefully build some
> consensus on the issue.
>
> Thanks.
>
> --
> tejun
>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFD RESEND] cgroup: Persistent memory usage tracking
@ 2022-08-22 19:02               ` Mina Almasry
  0 siblings, 0 replies; 61+ messages in thread
From: Mina Almasry @ 2022-08-22 19:02 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Yafang Shao, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin Lau, Song Liu, Yonghong Song,
	john fastabend, KP Singh, Stanislav Fomichev, Hao Luo,
	jolsa-DgEjT+Ai2ygdnm+yROfE0A, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Zefan Li, Cgroups, netdev, bpf

On Mon, Aug 22, 2022 at 4:29 AM Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>
> (Sorry, this is a resend. I messed up the header in the first posting.)
>
> Hello,
>
> This thread started on a bpf-specific memory tracking change proposal and
> went south, but a lot of people who would be interested are already cc'd, so
> I'm hijacking it to discuss what to do w/ persistent memory usage tracking.
>
> Cc'ing Mina and Yosry who were involved in the discussions on the similar
> problem re. tmpfs, Dan Schatzberg who has a lot more prod knowledge and
> experience than me, and Lennart for his thoughts from systemd side.
>
> The root problem is that there are resources (almost solely memory
> currently) that outlive a given instance of a, to use systemd-lingo,
> service. Page cache is the most common case.
>
> Let's say there's system.slice/hello.service. When it runs for the first
> time, page cache backing its binary will be charged to hello.service.
> However, when it restarts after e.g. a config change, when the initial
> hello.service cgroup gets destroyed, we reparent the page cache charges to
> the parent system.slice and when the second instance starts, its binary will
> stay charged to system.slice. Over time, some may get reclaimed and
> refaulted into the new hello.service but that's not guaranteed and most hot
> pages likely won't.
>
> The same problem exists for any memory which is not freed synchronously when
> the current instance exits. While this isn't a problem for many cases, it's
> not difficult to imagine situations where the amount of memory which ends up
> getting pushed to the parent is significant, even clear majority, with big
> page cache footprint, persistent tmpfs instances and so on, creating issues
> with accounting accuracy and thus control.
>
> I think there are two broad issues to discuss here:
>
> [1] Can this be solved by layering the instance cgroups under persistent
>     entity cgroup?
>
> So, instead of systemd.slice/hello.service, the application runs inside
> something like systemd.slice/hello.service/hello.service.instance and the
> service-level cgroup hello.service is not destroyed as long as it is
> something worth tracking on the system.
>
> The benefits are
>
> a. While requiring changing how userland organizes cgroup hiearchy, it is a
>    straight-forward extension of the current architecture and doesn't
>    require any conceptual or structural changes. All the accounting and
>    control schemes work exactly the same as before. The only difference is
>    that we now have a persistent entity representing each service as we want
>    to track their persistent resource usages.
>
> b. Per-instance tracking and control is optional. To me, it seems that the
>    persistent resource usages would be more meaningful than per-instance and
>    tracking down to the persistent usages shouldn't add noticeable runtime
>    overheads while keeping per-instance process management niceties and
>    allowing use cases to opt-in for per-instance resource tracking and
>    control as needed.
>
> The complications are:
>
> a. It requires changing cgroup hierarchy in a very visible way.
>
> b. What should be the lifetime rules for persistent cgroups? Do we keep them
>    around forever or maybe they can be created on the first use and kept
>    around until the service is removed from the system? When the persistent
>    cgroup is removed, do we need to make sure that the remaining resource
>    usages are low enough? Note that this problem exists for any approach
>    that tries to track persistent usages no matter how it's done.
>
> c. Do we need to worry about nesting overhead? Given that there's no reason
>    to enable controllers w/o persisten states for the instance level and the
>    nesting overhead is pretty low for memcg, this doesn't seem like a
>    problem to me. If this becomes a problem, we just need to fix it.
>
> A couple alternatives discussed are:
>
> a. Userspace keeps reusing the same cgroup for different instances of the
>    same service. This simplifies some aspects while making others more
>    complicated. e.g. Determining the current instance's CPU or IO usages now
>    require the monitoring software remembering what they were when this
>    instance started and calculating the deltas. Also, if some use cases want
>    to distinguish persistent vs. instance usages (more on this later), this
>    isn't gonna work. That said, this definitely is attractive in that it
>    miminizes overt user visible changes.
>
> b. Memory is disassociated rather than just reparented on cgroup destruction
>    and get re-charged to the next first user. This is attractive in that it
>    doesn't require any userspace changes; however, I'm not sure how this
>    would work for non-pageable memory usages such as bpf maps. How would we
>    detect the next first usage?
>
>
> [2] Whether and how to solve first and second+ instance charge differences.
>
> If we take the layering approach, the first instance will get charged for
> all memory that it uses while the second+ instances likely won't get charged
> for a lot of persistent usages. I don't think there is a consensus on
> whether this needs to be solved and I don't have enough context to form a
> strong opinion. memcg folks are a lot better equipped to make this decision.
>

The problem of first and second+ instance charges is something I've
ran into when dealing with tmpfs charges, and something I believe
Yosry ran into with regards to bpf as well, so AFAIU we think it's
something worth addressing, but I'd also like to hear from other memcg
folks if this is something that needs to be solved for them as well.
For us this is something we want to fix because the memory usage of
these services is hard to predict and limit since the memory usage of
the first charger may be much greater than second+.

> Assuming this needs to be solved, here's a braindump to be taken with a big
> pinch of salt:
>
> I have a bit of difficult time imagining a perfect solution given that
> whether a given page cache page is persistent or not would be really
> difficult to know (or maybe all page cache is persistent by default while
> anon is not). However, the problem still seems worthwhile to consider for
> big ticket items such as persistent tmpfs mounts and huge bpf maps as they
> can easily make the differences really big.
>
> If we want to solve this problem, here are options that I can think of:
>
> a. Let userspace put the charges where they belong using the current
>    mechanisms. ie. Create persistent entities in the persistent parent
>    cgroup while there's no current instance.
>
>    Pro: It won't require any major kernel or interface changes. There still
>    need to be some tweaking such as allowing tmpfs pages to be always
>    charged to the cgroup which created the instance (maybe as long as it's
>    an ancestor of the faulting cgroup?) but nothing too invasive.
>
>    Con: It may not be flexible enough.
>

I believe this works for us, provided memory is charged to the
persistent parent entity directly and is not charged to the child
instance. If memory is charged to the child instance first and then
reparented, I think we still have the same problem. Because the first
instance is charged for this memory but the second+ instances are not,
making predicting the memory usage of such use cases difficult.

> b. Let userspace specify which cgroup to charge for some of constructs like
>    tmpfs and bpf maps. The key problems with this approach are
>
>    1. How to grant/deny what can be charged where. We must ensure that a
>       descendant can't move charges up or across the tree without the
>       ancestors allowing it.
>
>    2. How to specify the cgroup to charge. While specifying the target
>       cgroup directly might seem like an obvious solution, it has a couple
>       rather serious problems. First, if the descendant is inside a cgroup
>       namespace, it might be able to see the target cgroup at all. Second,
>       it's an interface which is likely to cause misunderstandings on how it
>       can be used. It's too broad an interface.
>

This is pretty much the solution I sent out for review about a year
ago and yes, it suffers from the issues you've brought up:
https://lore.kernel.org/linux-mm/20211120045011.3074840-1-almasrymina-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org/


>    One solution that I can think of is leveraging the resource domain
>    concept which is currently only used for threaded cgroups. All memory
>    usages of threaded cgroups are charged to their resource domain cgroup
>    which hosts the processes for those threads. The persistent usages have a
>    similar pattern, so maybe the service level cgroup can declare that it's
>    the encompassing resource domain and the instance cgroup can say whether
>    it's gonna charge e.g. the tmpfs instance to its own or the encompassing
>    resource domain.
>

I think this sounds excellent and addresses our use cases. Basically
the tmpfs/bpf memory would get charged to the encompassing resource
domain cgroup rather than the instance cgroup, making the memory usage
of the first and second+ instances consistent and predictable.

Would love to hear from other memcg folks what they would think of
such an approach. I would also love to hear what kind of interface you
have in mind. Perhaps a cgroup tunable that says whether it's going to
charge the tmpfs/bpf instance to itself or to the encompassing
resource domain?

>    This has the benefit that the user only needs to indicate its intention
>    without worrying about how cgroups are composed and what their IDs are.
>    It just indicates whether the given resource is persistent and if the
>    cgroup hierarchy is set up for that, it gets charged that way and if not
>    it can be just charged to itself.
>
>    This is a shower thought but, if we allow nesting such domains (and maybe
>    name them), we can use it for shared resources too so that co-services
>    are put inside a shared slice and shared resources are pushed to the
>    slice level.
>
> This became pretty long. I obviously have a pretty strong bias towards
> solving this within the current basic architecture but other than that most
> of these decisions are best made by memcg folks. We can hopefully build some
> consensus on the issue.
>
> Thanks.
>
> --
> tejun
>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFD RESEND] cgroup: Persistent memory usage tracking
  2022-08-22 19:02               ` Mina Almasry
@ 2022-08-22 21:19                 ` Johannes Weiner
  -1 siblings, 0 replies; 61+ messages in thread
From: Johannes Weiner @ 2022-08-22 21:19 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Tejun Heo, Yafang Shao, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin Lau, Song Liu, Yonghong Song,
	john fastabend, KP Singh, Stanislav Fomichev, Hao Luo, jolsa,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Zefan Li, Cgroups, netdev, bpf, Linux MM,
	Yosry Ahmed, Dan Schatzberg, Lennart Poettering

On Mon, Aug 22, 2022 at 12:02:48PM -0700, Mina Almasry wrote:
> On Mon, Aug 22, 2022 at 4:29 AM Tejun Heo <tj@kernel.org> wrote:
> > b. Let userspace specify which cgroup to charge for some of constructs like
> >    tmpfs and bpf maps. The key problems with this approach are
> >
> >    1. How to grant/deny what can be charged where. We must ensure that a
> >       descendant can't move charges up or across the tree without the
> >       ancestors allowing it.
> >
> >    2. How to specify the cgroup to charge. While specifying the target
> >       cgroup directly might seem like an obvious solution, it has a couple
> >       rather serious problems. First, if the descendant is inside a cgroup
> >       namespace, it might be able to see the target cgroup at all. Second,
> >       it's an interface which is likely to cause misunderstandings on how it
> >       can be used. It's too broad an interface.
> >
> 
> This is pretty much the solution I sent out for review about a year
> ago and yes, it suffers from the issues you've brought up:
> https://lore.kernel.org/linux-mm/20211120045011.3074840-1-almasrymina@google.com/
> 
> 
> >    One solution that I can think of is leveraging the resource domain
> >    concept which is currently only used for threaded cgroups. All memory
> >    usages of threaded cgroups are charged to their resource domain cgroup
> >    which hosts the processes for those threads. The persistent usages have a
> >    similar pattern, so maybe the service level cgroup can declare that it's
> >    the encompassing resource domain and the instance cgroup can say whether
> >    it's gonna charge e.g. the tmpfs instance to its own or the encompassing
> >    resource domain.
> >
> 
> I think this sounds excellent and addresses our use cases. Basically
> the tmpfs/bpf memory would get charged to the encompassing resource
> domain cgroup rather than the instance cgroup, making the memory usage
> of the first and second+ instances consistent and predictable.
> 
> Would love to hear from other memcg folks what they would think of
> such an approach. I would also love to hear what kind of interface you
> have in mind. Perhaps a cgroup tunable that says whether it's going to
> charge the tmpfs/bpf instance to itself or to the encompassing
> resource domain?

I like this too. It makes shared charging predictable, with a coherent
resource hierarchy (congruent OOM, CPU, IO domains), and without the
need for cgroup paths in tmpfs mounts or similar.

As far as who is declaring what goes, though: if the instance groups
can declare arbitrary files/objects persistent or shared, they'd be
able to abuse this and sneak private memory past local limits and
burden the wider persistent/shared domain with it.

I'm thinking it might make more sense for the service level to declare
which objects are persistent and shared across instances.

If that's the case, we may not need a two-component interface. Just
the ability for an intermediate cgroup to say: "This object's future
memory is to be charged to me, not the instantiating cgroup."

Can we require a process in the intermediate cgroup to set up the file
or object, and use madvise/fadvise to say "charge me", before any
instances are launched?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFD RESEND] cgroup: Persistent memory usage tracking
@ 2022-08-22 21:19                 ` Johannes Weiner
  0 siblings, 0 replies; 61+ messages in thread
From: Johannes Weiner @ 2022-08-22 21:19 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Tejun Heo, Yafang Shao, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin Lau, Song Liu, Yonghong Song,
	john fastabend, KP Singh, Stanislav Fomichev, Hao Luo, jolsa,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Zefan Li, Cgroups, netdev, bpf, Linux

On Mon, Aug 22, 2022 at 12:02:48PM -0700, Mina Almasry wrote:
> On Mon, Aug 22, 2022 at 4:29 AM Tejun Heo <tj@kernel.org> wrote:
> > b. Let userspace specify which cgroup to charge for some of constructs like
> >    tmpfs and bpf maps. The key problems with this approach are
> >
> >    1. How to grant/deny what can be charged where. We must ensure that a
> >       descendant can't move charges up or across the tree without the
> >       ancestors allowing it.
> >
> >    2. How to specify the cgroup to charge. While specifying the target
> >       cgroup directly might seem like an obvious solution, it has a couple
> >       rather serious problems. First, if the descendant is inside a cgroup
> >       namespace, it might be able to see the target cgroup at all. Second,
> >       it's an interface which is likely to cause misunderstandings on how it
> >       can be used. It's too broad an interface.
> >
> 
> This is pretty much the solution I sent out for review about a year
> ago and yes, it suffers from the issues you've brought up:
> https://lore.kernel.org/linux-mm/20211120045011.3074840-1-almasrymina@google.com/
> 
> 
> >    One solution that I can think of is leveraging the resource domain
> >    concept which is currently only used for threaded cgroups. All memory
> >    usages of threaded cgroups are charged to their resource domain cgroup
> >    which hosts the processes for those threads. The persistent usages have a
> >    similar pattern, so maybe the service level cgroup can declare that it's
> >    the encompassing resource domain and the instance cgroup can say whether
> >    it's gonna charge e.g. the tmpfs instance to its own or the encompassing
> >    resource domain.
> >
> 
> I think this sounds excellent and addresses our use cases. Basically
> the tmpfs/bpf memory would get charged to the encompassing resource
> domain cgroup rather than the instance cgroup, making the memory usage
> of the first and second+ instances consistent and predictable.
> 
> Would love to hear from other memcg folks what they would think of
> such an approach. I would also love to hear what kind of interface you
> have in mind. Perhaps a cgroup tunable that says whether it's going to
> charge the tmpfs/bpf instance to itself or to the encompassing
> resource domain?

I like this too. It makes shared charging predictable, with a coherent
resource hierarchy (congruent OOM, CPU, IO domains), and without the
need for cgroup paths in tmpfs mounts or similar.

As far as who is declaring what goes, though: if the instance groups
can declare arbitrary files/objects persistent or shared, they'd be
able to abuse this and sneak private memory past local limits and
burden the wider persistent/shared domain with it.

I'm thinking it might make more sense for the service level to declare
which objects are persistent and shared across instances.

If that's the case, we may not need a two-component interface. Just
the ability for an intermediate cgroup to say: "This object's future
memory is to be charged to me, not the instantiating cgroup."

Can we require a process in the intermediate cgroup to set up the file
or object, and use madvise/fadvise to say "charge me", before any
instances are launched?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFD RESEND] cgroup: Persistent memory usage tracking
  2022-08-22 21:19                 ` Johannes Weiner
@ 2022-08-22 21:52                   ` Mina Almasry
  -1 siblings, 0 replies; 61+ messages in thread
From: Mina Almasry @ 2022-08-22 21:52 UTC (permalink / raw)
  To: Johannes Weiner, Yosry Ahmed
  Cc: Tejun Heo, Yafang Shao, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin Lau, Song Liu, Yonghong Song,
	john fastabend, KP Singh, Stanislav Fomichev, Hao Luo, jolsa,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Zefan Li, Cgroups, netdev, bpf, Linux MM,
	Dan Schatzberg, Lennart Poettering

On Mon, Aug 22, 2022 at 2:19 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Mon, Aug 22, 2022 at 12:02:48PM -0700, Mina Almasry wrote:
> > On Mon, Aug 22, 2022 at 4:29 AM Tejun Heo <tj@kernel.org> wrote:
> > > b. Let userspace specify which cgroup to charge for some of constructs like
> > >    tmpfs and bpf maps. The key problems with this approach are
> > >
> > >    1. How to grant/deny what can be charged where. We must ensure that a
> > >       descendant can't move charges up or across the tree without the
> > >       ancestors allowing it.
> > >
> > >    2. How to specify the cgroup to charge. While specifying the target
> > >       cgroup directly might seem like an obvious solution, it has a couple
> > >       rather serious problems. First, if the descendant is inside a cgroup
> > >       namespace, it might be able to see the target cgroup at all. Second,
> > >       it's an interface which is likely to cause misunderstandings on how it
> > >       can be used. It's too broad an interface.
> > >
> >
> > This is pretty much the solution I sent out for review about a year
> > ago and yes, it suffers from the issues you've brought up:
> > https://lore.kernel.org/linux-mm/20211120045011.3074840-1-almasrymina@google.com/
> >
> >
> > >    One solution that I can think of is leveraging the resource domain
> > >    concept which is currently only used for threaded cgroups. All memory
> > >    usages of threaded cgroups are charged to their resource domain cgroup
> > >    which hosts the processes for those threads. The persistent usages have a
> > >    similar pattern, so maybe the service level cgroup can declare that it's
> > >    the encompassing resource domain and the instance cgroup can say whether
> > >    it's gonna charge e.g. the tmpfs instance to its own or the encompassing
> > >    resource domain.
> > >
> >
> > I think this sounds excellent and addresses our use cases. Basically
> > the tmpfs/bpf memory would get charged to the encompassing resource
> > domain cgroup rather than the instance cgroup, making the memory usage
> > of the first and second+ instances consistent and predictable.
> >
> > Would love to hear from other memcg folks what they would think of
> > such an approach. I would also love to hear what kind of interface you
> > have in mind. Perhaps a cgroup tunable that says whether it's going to
> > charge the tmpfs/bpf instance to itself or to the encompassing
> > resource domain?
>
> I like this too. It makes shared charging predictable, with a coherent
> resource hierarchy (congruent OOM, CPU, IO domains), and without the
> need for cgroup paths in tmpfs mounts or similar.
>
> As far as who is declaring what goes, though: if the instance groups
> can declare arbitrary files/objects persistent or shared, they'd be
> able to abuse this and sneak private memory past local limits and
> burden the wider persistent/shared domain with it.
>
> I'm thinking it might make more sense for the service level to declare
> which objects are persistent and shared across instances.
>
> If that's the case, we may not need a two-component interface. Just
> the ability for an intermediate cgroup to say: "This object's future
> memory is to be charged to me, not the instantiating cgroup."
>
> Can we require a process in the intermediate cgroup to set up the file
> or object, and use madvise/fadvise to say "charge me", before any
> instances are launched?

I think doing this on a file granularity makes it logistically hard to
use, no? The service needs to create a file in the shared domain and
all its instances need to re-use this exact same file.

Our kubernetes use case from [1] shares a mount between subtasks
rather than specific files. This allows subtasks to create files at
will in the mount with the memory charged to the shared domain. I
imagine this is more convenient than a shared file.

Our other use case, which I hope to address here as well, is a
service-client relationship from [1] where the service would like to
charge per-client memory back to the client itself. In this case the
service or client can create a mount from the shared domain and pass
it to the service at which point the service is free to create/remove
files in this mount as it sees fit.

Would you be open to a per-mount interface rather than a per-file
fadvise interface?

Yosry, would a proposal like so be extensible to address the bpf
charging issues?

[1] https://lore.kernel.org/linux-mm/20211120045011.3074840-1-almasrymina@google.com/

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFD RESEND] cgroup: Persistent memory usage tracking
@ 2022-08-22 21:52                   ` Mina Almasry
  0 siblings, 0 replies; 61+ messages in thread
From: Mina Almasry @ 2022-08-22 21:52 UTC (permalink / raw)
  To: Johannes Weiner, Yosry Ahmed
  Cc: Tejun Heo, Yafang Shao, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin Lau, Song Liu, Yonghong Song,
	john fastabend, KP Singh, Stanislav Fomichev, Hao Luo, jolsa,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Zefan Li, Cgroups, netdev, bpf, Linux

On Mon, Aug 22, 2022 at 2:19 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Mon, Aug 22, 2022 at 12:02:48PM -0700, Mina Almasry wrote:
> > On Mon, Aug 22, 2022 at 4:29 AM Tejun Heo <tj@kernel.org> wrote:
> > > b. Let userspace specify which cgroup to charge for some of constructs like
> > >    tmpfs and bpf maps. The key problems with this approach are
> > >
> > >    1. How to grant/deny what can be charged where. We must ensure that a
> > >       descendant can't move charges up or across the tree without the
> > >       ancestors allowing it.
> > >
> > >    2. How to specify the cgroup to charge. While specifying the target
> > >       cgroup directly might seem like an obvious solution, it has a couple
> > >       rather serious problems. First, if the descendant is inside a cgroup
> > >       namespace, it might be able to see the target cgroup at all. Second,
> > >       it's an interface which is likely to cause misunderstandings on how it
> > >       can be used. It's too broad an interface.
> > >
> >
> > This is pretty much the solution I sent out for review about a year
> > ago and yes, it suffers from the issues you've brought up:
> > https://lore.kernel.org/linux-mm/20211120045011.3074840-1-almasrymina@google.com/
> >
> >
> > >    One solution that I can think of is leveraging the resource domain
> > >    concept which is currently only used for threaded cgroups. All memory
> > >    usages of threaded cgroups are charged to their resource domain cgroup
> > >    which hosts the processes for those threads. The persistent usages have a
> > >    similar pattern, so maybe the service level cgroup can declare that it's
> > >    the encompassing resource domain and the instance cgroup can say whether
> > >    it's gonna charge e.g. the tmpfs instance to its own or the encompassing
> > >    resource domain.
> > >
> >
> > I think this sounds excellent and addresses our use cases. Basically
> > the tmpfs/bpf memory would get charged to the encompassing resource
> > domain cgroup rather than the instance cgroup, making the memory usage
> > of the first and second+ instances consistent and predictable.
> >
> > Would love to hear from other memcg folks what they would think of
> > such an approach. I would also love to hear what kind of interface you
> > have in mind. Perhaps a cgroup tunable that says whether it's going to
> > charge the tmpfs/bpf instance to itself or to the encompassing
> > resource domain?
>
> I like this too. It makes shared charging predictable, with a coherent
> resource hierarchy (congruent OOM, CPU, IO domains), and without the
> need for cgroup paths in tmpfs mounts or similar.
>
> As far as who is declaring what goes, though: if the instance groups
> can declare arbitrary files/objects persistent or shared, they'd be
> able to abuse this and sneak private memory past local limits and
> burden the wider persistent/shared domain with it.
>
> I'm thinking it might make more sense for the service level to declare
> which objects are persistent and shared across instances.
>
> If that's the case, we may not need a two-component interface. Just
> the ability for an intermediate cgroup to say: "This object's future
> memory is to be charged to me, not the instantiating cgroup."
>
> Can we require a process in the intermediate cgroup to set up the file
> or object, and use madvise/fadvise to say "charge me", before any
> instances are launched?

I think doing this on a file granularity makes it logistically hard to
use, no? The service needs to create a file in the shared domain and
all its instances need to re-use this exact same file.

Our kubernetes use case from [1] shares a mount between subtasks
rather than specific files. This allows subtasks to create files at
will in the mount with the memory charged to the shared domain. I
imagine this is more convenient than a shared file.

Our other use case, which I hope to address here as well, is a
service-client relationship from [1] where the service would like to
charge per-client memory back to the client itself. In this case the
service or client can create a mount from the shared domain and pass
it to the service at which point the service is free to create/remove
files in this mount as it sees fit.

Would you be open to a per-mount interface rather than a per-file
fadvise interface?

Yosry, would a proposal like so be extensible to address the bpf
charging issues?

[1] https://lore.kernel.org/linux-mm/20211120045011.3074840-1-almasrymina@google.com/

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFD RESEND] cgroup: Persistent memory usage tracking
@ 2022-08-23  3:01                   ` Roman Gushchin
  0 siblings, 0 replies; 61+ messages in thread
From: Roman Gushchin @ 2022-08-23  3:01 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Mina Almasry, Tejun Heo, Yafang Shao, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin Lau, Song Liu,
	Yonghong Song, john fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, jolsa, Michal Hocko, Shakeel Butt, Muchun Song,
	Andrew Morton, Zefan Li, Cgroups, netdev, bpf, Linux MM,
	Yosry Ahmed, Dan Schatzberg, Lennart Poettering

On Mon, Aug 22, 2022 at 05:19:50PM -0400, Johannes Weiner wrote:
> On Mon, Aug 22, 2022 at 12:02:48PM -0700, Mina Almasry wrote:
> > On Mon, Aug 22, 2022 at 4:29 AM Tejun Heo <tj@kernel.org> wrote:
> > > b. Let userspace specify which cgroup to charge for some of constructs like
> > >    tmpfs and bpf maps. The key problems with this approach are
> > >
> > >    1. How to grant/deny what can be charged where. We must ensure that a
> > >       descendant can't move charges up or across the tree without the
> > >       ancestors allowing it.
> > >
> > >    2. How to specify the cgroup to charge. While specifying the target
> > >       cgroup directly might seem like an obvious solution, it has a couple
> > >       rather serious problems. First, if the descendant is inside a cgroup
> > >       namespace, it might be able to see the target cgroup at all. Second,
> > >       it's an interface which is likely to cause misunderstandings on how it
> > >       can be used. It's too broad an interface.
> > >
> > 
> > This is pretty much the solution I sent out for review about a year
> > ago and yes, it suffers from the issues you've brought up:
> > https://lore.kernel.org/linux-mm/20211120045011.3074840-1-almasrymina@google.com/
> > 
> > 
> > >    One solution that I can think of is leveraging the resource domain
> > >    concept which is currently only used for threaded cgroups. All memory
> > >    usages of threaded cgroups are charged to their resource domain cgroup
> > >    which hosts the processes for those threads. The persistent usages have a
> > >    similar pattern, so maybe the service level cgroup can declare that it's
> > >    the encompassing resource domain and the instance cgroup can say whether
> > >    it's gonna charge e.g. the tmpfs instance to its own or the encompassing
> > >    resource domain.
> > >
> > 
> > I think this sounds excellent and addresses our use cases. Basically
> > the tmpfs/bpf memory would get charged to the encompassing resource
> > domain cgroup rather than the instance cgroup, making the memory usage
> > of the first and second+ instances consistent and predictable.
> > 
> > Would love to hear from other memcg folks what they would think of
> > such an approach. I would also love to hear what kind of interface you
> > have in mind. Perhaps a cgroup tunable that says whether it's going to
> > charge the tmpfs/bpf instance to itself or to the encompassing
> > resource domain?
> 
> I like this too. It makes shared charging predictable, with a coherent
> resource hierarchy (congruent OOM, CPU, IO domains), and without the
> need for cgroup paths in tmpfs mounts or similar.
> 
> As far as who is declaring what goes, though: if the instance groups
> can declare arbitrary files/objects persistent or shared, they'd be
> able to abuse this and sneak private memory past local limits and
> burden the wider persistent/shared domain with it.
> 
> I'm thinking it might make more sense for the service level to declare
> which objects are persistent and shared across instances.

I like this idea.

> 
> If that's the case, we may not need a two-component interface. Just
> the ability for an intermediate cgroup to say: "This object's future
> memory is to be charged to me, not the instantiating cgroup."
> 
> Can we require a process in the intermediate cgroup to set up the file
> or object, and use madvise/fadvise to say "charge me", before any
> instances are launched?

We need to think how to make this interface convenient to use.
First, these persistent resources are likely created by some agent software,
not the main workload. So the requirement to call madvise() from the
actual cgroup might be not easily achievable.

So _maybe_ something like writing a fd into cgroup.memory.resources.

Second, it would be really useful to export the current configuration
to userspace. E.g. a user should be able to query to which cgroup the given
bpf map "belongs" and which bpf maps belong to the given cgroups. Otherwise
it will create a problem for userspace programs which manage cgroups
(e.g. systemd): they should be able to restore the current configuration
from the kernel state, without "remembering" what has been configured
before.

Thanks!

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFD RESEND] cgroup: Persistent memory usage tracking
@ 2022-08-23  3:01                   ` Roman Gushchin
  0 siblings, 0 replies; 61+ messages in thread
From: Roman Gushchin @ 2022-08-23  3:01 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Mina Almasry, Tejun Heo, Yafang Shao, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin Lau, Song Liu,
	Yonghong Song, john fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, jolsa-DgEjT+Ai2ygdnm+yROfE0A, Michal Hocko,
	Shakeel Butt, Muchun Song, Andrew Morton, Zefan Li, Cgroups,
	netdev, bpf, Linux MM

On Mon, Aug 22, 2022 at 05:19:50PM -0400, Johannes Weiner wrote:
> On Mon, Aug 22, 2022 at 12:02:48PM -0700, Mina Almasry wrote:
> > On Mon, Aug 22, 2022 at 4:29 AM Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> > > b. Let userspace specify which cgroup to charge for some of constructs like
> > >    tmpfs and bpf maps. The key problems with this approach are
> > >
> > >    1. How to grant/deny what can be charged where. We must ensure that a
> > >       descendant can't move charges up or across the tree without the
> > >       ancestors allowing it.
> > >
> > >    2. How to specify the cgroup to charge. While specifying the target
> > >       cgroup directly might seem like an obvious solution, it has a couple
> > >       rather serious problems. First, if the descendant is inside a cgroup
> > >       namespace, it might be able to see the target cgroup at all. Second,
> > >       it's an interface which is likely to cause misunderstandings on how it
> > >       can be used. It's too broad an interface.
> > >
> > 
> > This is pretty much the solution I sent out for review about a year
> > ago and yes, it suffers from the issues you've brought up:
> > https://lore.kernel.org/linux-mm/20211120045011.3074840-1-almasrymina-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org/
> > 
> > 
> > >    One solution that I can think of is leveraging the resource domain
> > >    concept which is currently only used for threaded cgroups. All memory
> > >    usages of threaded cgroups are charged to their resource domain cgroup
> > >    which hosts the processes for those threads. The persistent usages have a
> > >    similar pattern, so maybe the service level cgroup can declare that it's
> > >    the encompassing resource domain and the instance cgroup can say whether
> > >    it's gonna charge e.g. the tmpfs instance to its own or the encompassing
> > >    resource domain.
> > >
> > 
> > I think this sounds excellent and addresses our use cases. Basically
> > the tmpfs/bpf memory would get charged to the encompassing resource
> > domain cgroup rather than the instance cgroup, making the memory usage
> > of the first and second+ instances consistent and predictable.
> > 
> > Would love to hear from other memcg folks what they would think of
> > such an approach. I would also love to hear what kind of interface you
> > have in mind. Perhaps a cgroup tunable that says whether it's going to
> > charge the tmpfs/bpf instance to itself or to the encompassing
> > resource domain?
> 
> I like this too. It makes shared charging predictable, with a coherent
> resource hierarchy (congruent OOM, CPU, IO domains), and without the
> need for cgroup paths in tmpfs mounts or similar.
> 
> As far as who is declaring what goes, though: if the instance groups
> can declare arbitrary files/objects persistent or shared, they'd be
> able to abuse this and sneak private memory past local limits and
> burden the wider persistent/shared domain with it.
> 
> I'm thinking it might make more sense for the service level to declare
> which objects are persistent and shared across instances.

I like this idea.

> 
> If that's the case, we may not need a two-component interface. Just
> the ability for an intermediate cgroup to say: "This object's future
> memory is to be charged to me, not the instantiating cgroup."
> 
> Can we require a process in the intermediate cgroup to set up the file
> or object, and use madvise/fadvise to say "charge me", before any
> instances are launched?

We need to think how to make this interface convenient to use.
First, these persistent resources are likely created by some agent software,
not the main workload. So the requirement to call madvise() from the
actual cgroup might be not easily achievable.

So _maybe_ something like writing a fd into cgroup.memory.resources.

Second, it would be really useful to export the current configuration
to userspace. E.g. a user should be able to query to which cgroup the given
bpf map "belongs" and which bpf maps belong to the given cgroups. Otherwise
it will create a problem for userspace programs which manage cgroups
(e.g. systemd): they should be able to restore the current configuration
from the kernel state, without "remembering" what has been configured
before.

Thanks!

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFD RESEND] cgroup: Persistent memory usage tracking
  2022-08-23  3:01                   ` Roman Gushchin
@ 2022-08-23  3:14                     ` Tejun Heo
  -1 siblings, 0 replies; 61+ messages in thread
From: Tejun Heo @ 2022-08-23  3:14 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Johannes Weiner, Mina Almasry, Yafang Shao, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin Lau, Song Liu,
	Yonghong Song, john fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, jolsa, Michal Hocko, Shakeel Butt, Muchun Song,
	Andrew Morton, Zefan Li, Cgroups, netdev, bpf, Linux MM,
	Yosry Ahmed, Dan Schatzberg, Lennart Poettering

Hello,

On Mon, Aug 22, 2022 at 08:01:41PM -0700, Roman Gushchin wrote:
> > > >    One solution that I can think of is leveraging the resource domain
> > > >    concept which is currently only used for threaded cgroups. All memory
> > > >    usages of threaded cgroups are charged to their resource domain cgroup
> > > >    which hosts the processes for those threads. The persistent usages have a
> > > >    similar pattern, so maybe the service level cgroup can declare that it's
> > > >    the encompassing resource domain and the instance cgroup can say whether
> > > >    it's gonna charge e.g. the tmpfs instance to its own or the encompassing
> > > >    resource domain.
> > > >
> > > 
> > > I think this sounds excellent and addresses our use cases. Basically
> > > the tmpfs/bpf memory would get charged to the encompassing resource
> > > domain cgroup rather than the instance cgroup, making the memory usage
> > > of the first and second+ instances consistent and predictable.
> > > 
> > > Would love to hear from other memcg folks what they would think of
> > > such an approach. I would also love to hear what kind of interface you
> > > have in mind. Perhaps a cgroup tunable that says whether it's going to
> > > charge the tmpfs/bpf instance to itself or to the encompassing
> > > resource domain?
> > 
> > I like this too. It makes shared charging predictable, with a coherent
> > resource hierarchy (congruent OOM, CPU, IO domains), and without the
> > need for cgroup paths in tmpfs mounts or similar.
> > 
> > As far as who is declaring what goes, though: if the instance groups
> > can declare arbitrary files/objects persistent or shared, they'd be
> > able to abuse this and sneak private memory past local limits and
> > burden the wider persistent/shared domain with it.

My thought was that the persistent cgroup and instance cgroups should belong
to the same trust domain and system level control should be applied at the
resource domain level. The application may decide to shift between
persistent and per-instance however it wants to and may even configure
resource control at that level but all that's for its own accounting
accuracy and benefit.

> > I'm thinking it might make more sense for the service level to declare
> > which objects are persistent and shared across instances.
> 
> I like this idea.
> 
> > If that's the case, we may not need a two-component interface. Just
> > the ability for an intermediate cgroup to say: "This object's future
> > memory is to be charged to me, not the instantiating cgroup."
> > 
> > Can we require a process in the intermediate cgroup to set up the file
> > or object, and use madvise/fadvise to say "charge me", before any
> > instances are launched?
> 
> We need to think how to make this interface convenient to use.
> First, these persistent resources are likely created by some agent software,
> not the main workload. So the requirement to call madvise() from the
> actual cgroup might be not easily achievable.

So one worry that I have for this is that it requires the application itself
to be aware of cgroup topolgies and restructure itself so that allocation of
those resources are factored out into something else. Maybe that's not a
huge problem but it may limit its applicability quite a bit.

If we can express all the resource contraints and structures in the cgroup
side and configured by the management agent, the application can simply e.g.
madvise whatever memory region or flag bpf maps as "these are persistent"
and the rest can be handled by the system. If the agent set up the
environment for that, it gets accounted accordingly; otherwise, it'd behave
as if those tagging didn't exist. Asking the application to set up all its
resources in separate steps, that might require significant restructuring
and knowledge of how the hierarchy is setup in many cases.

> So _maybe_ something like writing a fd into cgroup.memory.resources.
> 
> Second, it would be really useful to export the current configuration
> to userspace. E.g. a user should be able to query to which cgroup the given
> bpf map "belongs" and which bpf maps belong to the given cgroups. Otherwise
> it will create a problem for userspace programs which manage cgroups
> (e.g. systemd): they should be able to restore the current configuration
> from the kernel state, without "remembering" what has been configured
> before.

This too can be achieved by separating out cgroup setup and tagging specific
resources. Agent and cgroup know what each cgroup is supposed to do as they
already do now and each resource is tagged whether they're persistent or
not, so everything is always known without the agent and the application
having to explicitly share the information.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFD RESEND] cgroup: Persistent memory usage tracking
@ 2022-08-23  3:14                     ` Tejun Heo
  0 siblings, 0 replies; 61+ messages in thread
From: Tejun Heo @ 2022-08-23  3:14 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Johannes Weiner, Mina Almasry, Yafang Shao, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin Lau, Song Liu,
	Yonghong Song, john fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, jolsa, Michal Hocko, Shakeel Butt, Muchun Song,
	Andrew Morton, Zefan Li, Cgroups, netdev, bpf

Hello,

On Mon, Aug 22, 2022 at 08:01:41PM -0700, Roman Gushchin wrote:
> > > >    One solution that I can think of is leveraging the resource domain
> > > >    concept which is currently only used for threaded cgroups. All memory
> > > >    usages of threaded cgroups are charged to their resource domain cgroup
> > > >    which hosts the processes for those threads. The persistent usages have a
> > > >    similar pattern, so maybe the service level cgroup can declare that it's
> > > >    the encompassing resource domain and the instance cgroup can say whether
> > > >    it's gonna charge e.g. the tmpfs instance to its own or the encompassing
> > > >    resource domain.
> > > >
> > > 
> > > I think this sounds excellent and addresses our use cases. Basically
> > > the tmpfs/bpf memory would get charged to the encompassing resource
> > > domain cgroup rather than the instance cgroup, making the memory usage
> > > of the first and second+ instances consistent and predictable.
> > > 
> > > Would love to hear from other memcg folks what they would think of
> > > such an approach. I would also love to hear what kind of interface you
> > > have in mind. Perhaps a cgroup tunable that says whether it's going to
> > > charge the tmpfs/bpf instance to itself or to the encompassing
> > > resource domain?
> > 
> > I like this too. It makes shared charging predictable, with a coherent
> > resource hierarchy (congruent OOM, CPU, IO domains), and without the
> > need for cgroup paths in tmpfs mounts or similar.
> > 
> > As far as who is declaring what goes, though: if the instance groups
> > can declare arbitrary files/objects persistent or shared, they'd be
> > able to abuse this and sneak private memory past local limits and
> > burden the wider persistent/shared domain with it.

My thought was that the persistent cgroup and instance cgroups should belong
to the same trust domain and system level control should be applied at the
resource domain level. The application may decide to shift between
persistent and per-instance however it wants to and may even configure
resource control at that level but all that's for its own accounting
accuracy and benefit.

> > I'm thinking it might make more sense for the service level to declare
> > which objects are persistent and shared across instances.
> 
> I like this idea.
> 
> > If that's the case, we may not need a two-component interface. Just
> > the ability for an intermediate cgroup to say: "This object's future
> > memory is to be charged to me, not the instantiating cgroup."
> > 
> > Can we require a process in the intermediate cgroup to set up the file
> > or object, and use madvise/fadvise to say "charge me", before any
> > instances are launched?
> 
> We need to think how to make this interface convenient to use.
> First, these persistent resources are likely created by some agent software,
> not the main workload. So the requirement to call madvise() from the
> actual cgroup might be not easily achievable.

So one worry that I have for this is that it requires the application itself
to be aware of cgroup topolgies and restructure itself so that allocation of
those resources are factored out into something else. Maybe that's not a
huge problem but it may limit its applicability quite a bit.

If we can express all the resource contraints and structures in the cgroup
side and configured by the management agent, the application can simply e.g.
madvise whatever memory region or flag bpf maps as "these are persistent"
and the rest can be handled by the system. If the agent set up the
environment for that, it gets accounted accordingly; otherwise, it'd behave
as if those tagging didn't exist. Asking the application to set up all its
resources in separate steps, that might require significant restructuring
and knowledge of how the hierarchy is setup in many cases.

> So _maybe_ something like writing a fd into cgroup.memory.resources.
> 
> Second, it would be really useful to export the current configuration
> to userspace. E.g. a user should be able to query to which cgroup the given
> bpf map "belongs" and which bpf maps belong to the given cgroups. Otherwise
> it will create a problem for userspace programs which manage cgroups
> (e.g. systemd): they should be able to restore the current configuration
> from the kernel state, without "remembering" what has been configured
> before.

This too can be achieved by separating out cgroup setup and tagging specific
resources. Agent and cgroup know what each cgroup is supposed to do as they
already do now and each resource is tagged whether they're persistent or
not, so everything is always known without the agent and the application
having to explicitly share the information.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFD RESEND] cgroup: Persistent memory usage tracking
  2022-08-22 11:29             ` Tejun Heo
@ 2022-08-23 11:08               ` Yafang Shao
  -1 siblings, 0 replies; 61+ messages in thread
From: Yafang Shao @ 2022-08-23 11:08 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin Lau,
	Song Liu, Yonghong Song, john fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, jolsa, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Zefan Li, Cgroups, netdev, bpf, Linux MM,
	Yosry Ahmed, Dan Schatzberg, Lennart Poettering

On Mon, Aug 22, 2022 at 7:29 PM Tejun Heo <tj@kernel.org> wrote:
>
> (Sorry, this is a resend. I messed up the header in the first posting.)
>
> Hello,
>
> This thread started on a bpf-specific memory tracking change proposal and
> went south, but a lot of people who would be interested are already cc'd, so
> I'm hijacking it to discuss what to do w/ persistent memory usage tracking.
>
> Cc'ing Mina and Yosry who were involved in the discussions on the similar
> problem re. tmpfs, Dan Schatzberg who has a lot more prod knowledge and
> experience than me, and Lennart for his thoughts from systemd side.
>
> The root problem is that there are resources (almost solely memory
> currently) that outlive a given instance of a, to use systemd-lingo,
> service. Page cache is the most common case.
>
> Let's say there's system.slice/hello.service. When it runs for the first
> time, page cache backing its binary will be charged to hello.service.
> However, when it restarts after e.g. a config change, when the initial
> hello.service cgroup gets destroyed, we reparent the page cache charges to
> the parent system.slice and when the second instance starts, its binary will
> stay charged to system.slice. Over time, some may get reclaimed and
> refaulted into the new hello.service but that's not guaranteed and most hot
> pages likely won't.
>
> The same problem exists for any memory which is not freed synchronously when
> the current instance exits. While this isn't a problem for many cases, it's
> not difficult to imagine situations where the amount of memory which ends up
> getting pushed to the parent is significant, even clear majority, with big
> page cache footprint, persistent tmpfs instances and so on, creating issues
> with accounting accuracy and thus control.
>
> I think there are two broad issues to discuss here:
>
> [1] Can this be solved by layering the instance cgroups under persistent
>     entity cgroup?
>

Hi,

Below is some background of kubernetes.
In kubernetes, a pod is organized as follows,

               pod
               |- Container
               |- Container

IOW, it is a two-layer unit, or a two-layer instance.
The cgroup dir of the pod is named with a UUID assigned by kubernetes-apiserver.
Once the old pod is destroyed (that can happen when the user wants to
update their service), the new pod will have a different UUID.
That said, different instances will have different cgroup dir.

If we want to introduce a  persistent entity cgroup, we have to make
it a three-layer unit.

           persistent-entity
           |- pod
                 |- Container
                 |- Container

There will be some issues,
1.  The kuber-apiserver must maintain the persistent-entity on each host.
     It needs a great refactor and the compatibility is also a problem
per my discussion with kubernetes experts.
2.  How to do the monitor?
     If there's only one pod under this persistent-entity, we can
easily get the memory size of  shared resources by:
         Sizeof(shared-resources) = Sizeof(persistent-entity) - Sizeof(pod)
    But what if it has N pods and N is dynamically changed ?
3.  What if it has more than one shared resource?
     For example, pod-foo has two shared resources A and B, pod-bar
has two shared resources A and C, and another pod has two shared
resources B and C.
     How to deploy them?
     Pls, note that we can introduce multiple-layer persistent-entity,
but which one should be the parent ?

So from my perspective, it is almost impossible.

> So, instead of systemd.slice/hello.service, the application runs inside
> something like systemd.slice/hello.service/hello.service.instance and the
> service-level cgroup hello.service is not destroyed as long as it is
> something worth tracking on the system.
>
> The benefits are
>
> a. While requiring changing how userland organizes cgroup hiearchy, it is a
>    straight-forward extension of the current architecture and doesn't
>    require any conceptual or structural changes. All the accounting and
>    control schemes work exactly the same as before. The only difference is
>    that we now have a persistent entity representing each service as we want
>    to track their persistent resource usages.
>
> b. Per-instance tracking and control is optional. To me, it seems that the
>    persistent resource usages would be more meaningful than per-instance and
>    tracking down to the persistent usages shouldn't add noticeable runtime
>    overheads while keeping per-instance process management niceties and
>    allowing use cases to opt-in for per-instance resource tracking and
>    control as needed.
>
> The complications are:
>
> a. It requires changing cgroup hierarchy in a very visible way.
>
> b. What should be the lifetime rules for persistent cgroups? Do we keep them
>    around forever or maybe they can be created on the first use and kept
>    around until the service is removed from the system? When the persistent
>    cgroup is removed, do we need to make sure that the remaining resource
>    usages are low enough? Note that this problem exists for any approach
>    that tries to track persistent usages no matter how it's done.
>
> c. Do we need to worry about nesting overhead? Given that there's no reason
>    to enable controllers w/o persisten states for the instance level and the
>    nesting overhead is pretty low for memcg, this doesn't seem like a
>    problem to me. If this becomes a problem, we just need to fix it.
>
> A couple alternatives discussed are:
>
> a. Userspace keeps reusing the same cgroup for different instances of the
>    same service. This simplifies some aspects while making others more
>    complicated. e.g. Determining the current instance's CPU or IO usages now
>    require the monitoring software remembering what they were when this
>    instance started and calculating the deltas. Also, if some use cases want
>    to distinguish persistent vs. instance usages (more on this later), this
>    isn't gonna work. That said, this definitely is attractive in that it
>    miminizes overt user visible changes.
>
> b. Memory is disassociated rather than just reparented on cgroup destruction
>    and get re-charged to the next first user. This is attractive in that it
>    doesn't require any userspace changes; however, I'm not sure how this
>    would work for non-pageable memory usages such as bpf maps. How would we
>    detect the next first usage?
>

JFYI, There is a reuse path for the bpf map, see my previous RFC[1].
[1] https://lore.kernel.org/bpf/20220619155032.32515-1-laoar.shao@gmail.com/

>
> [2] Whether and how to solve first and second+ instance charge differences.
>
> If we take the layering approach, the first instance will get charged for
> all memory that it uses while the second+ instances likely won't get charged
> for a lot of persistent usages. I don't think there is a consensus on
> whether this needs to be solved and I don't have enough context to form a
> strong opinion. memcg folks are a lot better equipped to make this decision.
>

Just sharing our practice.
For many of our users, it means the memcg is unreliable (at least for
the observability) when the memory usage is inconsistent.
So they prefer to drop the page cache (by echoing memory.force_empty)
when the container (pod) is destroyed, at the cost of taking time to
reload these page caches next time.  Reliability is more important
than performance.

> Assuming this needs to be solved, here's a braindump to be taken with a big
> pinch of salt:
>
> I have a bit of difficult time imagining a perfect solution given that
> whether a given page cache page is persistent or not would be really
> difficult to know (or maybe all page cache is persistent by default while
> anon is not). However, the problem still seems worthwhile to consider for
> big ticket items such as persistent tmpfs mounts and huge bpf maps as they
> can easily make the differences really big.
>
> If we want to solve this problem, here are options that I can think of:
>
> a. Let userspace put the charges where they belong using the current
>    mechanisms. ie. Create persistent entities in the persistent parent
>    cgroup while there's no current instance.
>
>    Pro: It won't require any major kernel or interface changes. There still
>    need to be some tweaking such as allowing tmpfs pages to be always
>    charged to the cgroup which created the instance (maybe as long as it's
>    an ancestor of the faulting cgroup?) but nothing too invasive.
>
>    Con: It may not be flexible enough.
>
> b. Let userspace specify which cgroup to charge for some of constructs like
>    tmpfs and bpf maps. The key problems with this approach are
>
>    1. How to grant/deny what can be charged where. We must ensure that a
>       descendant can't move charges up or across the tree without the
>       ancestors allowing it.
>

We can add restrictions to check which memcg can be selected
(regarding the selectable memcg).
But I think it may be too early to do the restrictions, as only the
privileged user can set it.
It is the sys admin's responsbility to select a proper memcg.
That said, the selectable memcg is not going south.

>    2. How to specify the cgroup to charge. While specifying the target
>       cgroup directly might seem like an obvious solution, it has a couple
>       rather serious problems. First, if the descendant is inside a cgroup
>       namespace, it might be able to see the target cgroup at all.

It is not a problem. Just sharing our practice below.
$ docker run -tid --privileged    \
                      --mount
type=bind,source=/sys/fs/bpf,target=/sys/fs/bpf    \
                      --mount
type=bind,source=/sys/fs/cgroup/memory/bpf,target=/bpf-memcg    \
                      docker-image

The bind-mount can make it work.

>  Second,
>       it's an interface which is likely to cause misunderstandings on how it
>       can be used. It's too broad an interface.
>

As I said above, we just need some restrictions or guidance if that is
desired now.

>    One solution that I can think of is leveraging the resource domain
>    concept which is currently only used for threaded cgroups. All memory
>    usages of threaded cgroups are charged to their resource domain cgroup
>    which hosts the processes for those threads. The persistent usages have a
>    similar pattern, so maybe the service level cgroup can declare that it's
>    the encompassing resource domain and the instance cgroup can say whether
>    it's gonna charge e.g. the tmpfs instance to its own or the encompassing
>    resource domain.
>
>    This has the benefit that the user only needs to indicate its intention
>    without worrying about how cgroups are composed and what their IDs are.
>    It just indicates whether the given resource is persistent and if the
>    cgroup hierarchy is set up for that, it gets charged that way and if not
>    it can be just charged to itself.
>
>    This is a shower thought but, if we allow nesting such domains (and maybe
>    name them), we can use it for shared resources too so that co-services
>    are put inside a shared slice and shared resources are pushed to the
>    slice level.
>
> This became pretty long. I obviously have a pretty strong bias towards
> solving this within the current basic architecture but other than that most
> of these decisions are best made by memcg folks. We can hopefully build some
> consensus on the issue.
>

JFYI.
Apologize in advance if my words offend you again.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFD RESEND] cgroup: Persistent memory usage tracking
@ 2022-08-23 11:08               ` Yafang Shao
  0 siblings, 0 replies; 61+ messages in thread
From: Yafang Shao @ 2022-08-23 11:08 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin Lau,
	Song Liu, Yonghong Song, john fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, jolsa, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Zefan Li, Cgroups, netdev, bpf, Linux MM

On Mon, Aug 22, 2022 at 7:29 PM Tejun Heo <tj@kernel.org> wrote:
>
> (Sorry, this is a resend. I messed up the header in the first posting.)
>
> Hello,
>
> This thread started on a bpf-specific memory tracking change proposal and
> went south, but a lot of people who would be interested are already cc'd, so
> I'm hijacking it to discuss what to do w/ persistent memory usage tracking.
>
> Cc'ing Mina and Yosry who were involved in the discussions on the similar
> problem re. tmpfs, Dan Schatzberg who has a lot more prod knowledge and
> experience than me, and Lennart for his thoughts from systemd side.
>
> The root problem is that there are resources (almost solely memory
> currently) that outlive a given instance of a, to use systemd-lingo,
> service. Page cache is the most common case.
>
> Let's say there's system.slice/hello.service. When it runs for the first
> time, page cache backing its binary will be charged to hello.service.
> However, when it restarts after e.g. a config change, when the initial
> hello.service cgroup gets destroyed, we reparent the page cache charges to
> the parent system.slice and when the second instance starts, its binary will
> stay charged to system.slice. Over time, some may get reclaimed and
> refaulted into the new hello.service but that's not guaranteed and most hot
> pages likely won't.
>
> The same problem exists for any memory which is not freed synchronously when
> the current instance exits. While this isn't a problem for many cases, it's
> not difficult to imagine situations where the amount of memory which ends up
> getting pushed to the parent is significant, even clear majority, with big
> page cache footprint, persistent tmpfs instances and so on, creating issues
> with accounting accuracy and thus control.
>
> I think there are two broad issues to discuss here:
>
> [1] Can this be solved by layering the instance cgroups under persistent
>     entity cgroup?
>

Hi,

Below is some background of kubernetes.
In kubernetes, a pod is organized as follows,

               pod
               |- Container
               |- Container

IOW, it is a two-layer unit, or a two-layer instance.
The cgroup dir of the pod is named with a UUID assigned by kubernetes-apiserver.
Once the old pod is destroyed (that can happen when the user wants to
update their service), the new pod will have a different UUID.
That said, different instances will have different cgroup dir.

If we want to introduce a  persistent entity cgroup, we have to make
it a three-layer unit.

           persistent-entity
           |- pod
                 |- Container
                 |- Container

There will be some issues,
1.  The kuber-apiserver must maintain the persistent-entity on each host.
     It needs a great refactor and the compatibility is also a problem
per my discussion with kubernetes experts.
2.  How to do the monitor?
     If there's only one pod under this persistent-entity, we can
easily get the memory size of  shared resources by:
         Sizeof(shared-resources) = Sizeof(persistent-entity) - Sizeof(pod)
    But what if it has N pods and N is dynamically changed ?
3.  What if it has more than one shared resource?
     For example, pod-foo has two shared resources A and B, pod-bar
has two shared resources A and C, and another pod has two shared
resources B and C.
     How to deploy them?
     Pls, note that we can introduce multiple-layer persistent-entity,
but which one should be the parent ?

So from my perspective, it is almost impossible.

> So, instead of systemd.slice/hello.service, the application runs inside
> something like systemd.slice/hello.service/hello.service.instance and the
> service-level cgroup hello.service is not destroyed as long as it is
> something worth tracking on the system.
>
> The benefits are
>
> a. While requiring changing how userland organizes cgroup hiearchy, it is a
>    straight-forward extension of the current architecture and doesn't
>    require any conceptual or structural changes. All the accounting and
>    control schemes work exactly the same as before. The only difference is
>    that we now have a persistent entity representing each service as we want
>    to track their persistent resource usages.
>
> b. Per-instance tracking and control is optional. To me, it seems that the
>    persistent resource usages would be more meaningful than per-instance and
>    tracking down to the persistent usages shouldn't add noticeable runtime
>    overheads while keeping per-instance process management niceties and
>    allowing use cases to opt-in for per-instance resource tracking and
>    control as needed.
>
> The complications are:
>
> a. It requires changing cgroup hierarchy in a very visible way.
>
> b. What should be the lifetime rules for persistent cgroups? Do we keep them
>    around forever or maybe they can be created on the first use and kept
>    around until the service is removed from the system? When the persistent
>    cgroup is removed, do we need to make sure that the remaining resource
>    usages are low enough? Note that this problem exists for any approach
>    that tries to track persistent usages no matter how it's done.
>
> c. Do we need to worry about nesting overhead? Given that there's no reason
>    to enable controllers w/o persisten states for the instance level and the
>    nesting overhead is pretty low for memcg, this doesn't seem like a
>    problem to me. If this becomes a problem, we just need to fix it.
>
> A couple alternatives discussed are:
>
> a. Userspace keeps reusing the same cgroup for different instances of the
>    same service. This simplifies some aspects while making others more
>    complicated. e.g. Determining the current instance's CPU or IO usages now
>    require the monitoring software remembering what they were when this
>    instance started and calculating the deltas. Also, if some use cases want
>    to distinguish persistent vs. instance usages (more on this later), this
>    isn't gonna work. That said, this definitely is attractive in that it
>    miminizes overt user visible changes.
>
> b. Memory is disassociated rather than just reparented on cgroup destruction
>    and get re-charged to the next first user. This is attractive in that it
>    doesn't require any userspace changes; however, I'm not sure how this
>    would work for non-pageable memory usages such as bpf maps. How would we
>    detect the next first usage?
>

JFYI, There is a reuse path for the bpf map, see my previous RFC[1].
[1] https://lore.kernel.org/bpf/20220619155032.32515-1-laoar.shao@gmail.com/

>
> [2] Whether and how to solve first and second+ instance charge differences.
>
> If we take the layering approach, the first instance will get charged for
> all memory that it uses while the second+ instances likely won't get charged
> for a lot of persistent usages. I don't think there is a consensus on
> whether this needs to be solved and I don't have enough context to form a
> strong opinion. memcg folks are a lot better equipped to make this decision.
>

Just sharing our practice.
For many of our users, it means the memcg is unreliable (at least for
the observability) when the memory usage is inconsistent.
So they prefer to drop the page cache (by echoing memory.force_empty)
when the container (pod) is destroyed, at the cost of taking time to
reload these page caches next time.  Reliability is more important
than performance.

> Assuming this needs to be solved, here's a braindump to be taken with a big
> pinch of salt:
>
> I have a bit of difficult time imagining a perfect solution given that
> whether a given page cache page is persistent or not would be really
> difficult to know (or maybe all page cache is persistent by default while
> anon is not). However, the problem still seems worthwhile to consider for
> big ticket items such as persistent tmpfs mounts and huge bpf maps as they
> can easily make the differences really big.
>
> If we want to solve this problem, here are options that I can think of:
>
> a. Let userspace put the charges where they belong using the current
>    mechanisms. ie. Create persistent entities in the persistent parent
>    cgroup while there's no current instance.
>
>    Pro: It won't require any major kernel or interface changes. There still
>    need to be some tweaking such as allowing tmpfs pages to be always
>    charged to the cgroup which created the instance (maybe as long as it's
>    an ancestor of the faulting cgroup?) but nothing too invasive.
>
>    Con: It may not be flexible enough.
>
> b. Let userspace specify which cgroup to charge for some of constructs like
>    tmpfs and bpf maps. The key problems with this approach are
>
>    1. How to grant/deny what can be charged where. We must ensure that a
>       descendant can't move charges up or across the tree without the
>       ancestors allowing it.
>

We can add restrictions to check which memcg can be selected
(regarding the selectable memcg).
But I think it may be too early to do the restrictions, as only the
privileged user can set it.
It is the sys admin's responsbility to select a proper memcg.
That said, the selectable memcg is not going south.

>    2. How to specify the cgroup to charge. While specifying the target
>       cgroup directly might seem like an obvious solution, it has a couple
>       rather serious problems. First, if the descendant is inside a cgroup
>       namespace, it might be able to see the target cgroup at all.

It is not a problem. Just sharing our practice below.
$ docker run -tid --privileged    \
                      --mount
type=bind,source=/sys/fs/bpf,target=/sys/fs/bpf    \
                      --mount
type=bind,source=/sys/fs/cgroup/memory/bpf,target=/bpf-memcg    \
                      docker-image

The bind-mount can make it work.

>  Second,
>       it's an interface which is likely to cause misunderstandings on how it
>       can be used. It's too broad an interface.
>

As I said above, we just need some restrictions or guidance if that is
desired now.

>    One solution that I can think of is leveraging the resource domain
>    concept which is currently only used for threaded cgroups. All memory
>    usages of threaded cgroups are charged to their resource domain cgroup
>    which hosts the processes for those threads. The persistent usages have a
>    similar pattern, so maybe the service level cgroup can declare that it's
>    the encompassing resource domain and the instance cgroup can say whether
>    it's gonna charge e.g. the tmpfs instance to its own or the encompassing
>    resource domain.
>
>    This has the benefit that the user only needs to indicate its intention
>    without worrying about how cgroups are composed and what their IDs are.
>    It just indicates whether the given resource is persistent and if the
>    cgroup hierarchy is set up for that, it gets charged that way and if not
>    it can be just charged to itself.
>
>    This is a shower thought but, if we allow nesting such domains (and maybe
>    name them), we can use it for shared resources too so that co-services
>    are put inside a shared slice and shared resources are pushed to the
>    slice level.
>
> This became pretty long. I obviously have a pretty strong bias towards
> solving this within the current basic architecture but other than that most
> of these decisions are best made by memcg folks. We can hopefully build some
> consensus on the issue.
>

JFYI.
Apologize in advance if my words offend you again.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFD RESEND] cgroup: Persistent memory usage tracking
@ 2022-08-23 17:12                 ` Tejun Heo
  0 siblings, 0 replies; 61+ messages in thread
From: Tejun Heo @ 2022-08-23 17:12 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin Lau,
	Song Liu, Yonghong Song, john fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, jolsa, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Zefan Li, Cgroups, netdev, bpf, Linux MM,
	Yosry Ahmed, Dan Schatzberg, Lennart Poettering

Hello,

On Tue, Aug 23, 2022 at 07:08:17PM +0800, Yafang Shao wrote:
> On Mon, Aug 22, 2022 at 7:29 PM Tejun Heo <tj@kernel.org> wrote:
> > [1] Can this be solved by layering the instance cgroups under persistent
> >     entity cgroup?
>
> Below is some background of kubernetes.
> In kubernetes, a pod is organized as follows,
> 
>                pod
>                |- Container
>                |- Container
> 
> IOW, it is a two-layer unit, or a two-layer instance.
> The cgroup dir of the pod is named with a UUID assigned by kubernetes-apiserver.
> Once the old pod is destroyed (that can happen when the user wants to
> update their service), the new pod will have a different UUID.
> That said, different instances will have different cgroup dir.
> 
> If we want to introduce a  persistent entity cgroup, we have to make
> it a three-layer unit.
> 
>            persistent-entity
>            |- pod
>                  |- Container
>                  |- Container
> 
> There will be some issues,
> 1.  The kuber-apiserver must maintain the persistent-entity on each host.
>      It needs a great refactor and the compatibility is also a problem
> per my discussion with kubernetes experts.

This is gonna be true for anybody. The basic strategy here should be
defining a clear boundary between system agent and applications so that
individual applications, even when they build their own subhierarchy
internally, aren't affected by system level hierarchy configuration changes.
systemd is already like that with clear delegation boundary. Our (fb)
container management is like that too. So, while this requires some
reorganization from the agent side, things like this don't create huge
backward compatbility issues involving applications. I have no idea about
k8s, so the situation may differ but in the long term at least, it'd be a
good idea to build in similar conceptual separation even if it stays with
cgroup1.

> 2.  How to do the monitor?
>      If there's only one pod under this persistent-entity, we can
> easily get the memory size of  shared resources by:
>          Sizeof(shared-resources) = Sizeof(persistent-entity) - Sizeof(pod)
>     But what if it has N pods and N is dynamically changed ?

There should only be one live pod instance inside the pod's persistent
cgroup, so the calculation doesn't really change.

> 3.  What if it has more than one shared resource?
>      For example, pod-foo has two shared resources A and B, pod-bar
> has two shared resources A and C, and another pod has two shared
> resources B and C.
>      How to deploy them?
>      Pls, note that we can introduce multiple-layer persistent-entity,
> but which one should be the parent ?
> 
> So from my perspective, it is almost impossible.

Yeah, this is a different problem and cgroup has never been good at tracking
resources shared across multiple groups. There is always tension between
overhead, complexity and accuracy and the tradeoff re. resource sharing has
almost always been towards the former two.

Naming specific resources and designating them as being shared and
accounting them commonly somehow does make sense as an approach as we get to
avoid the biggest headaches (e.g. how to split a page cache page?) and maybe
it can even be claimed that a lot of use cases which may want cross-group
sharing can be sufficiently served by such approach.

That said, it still has to be balanced against other factors. For example,
memory pressure caused by the shared resources should affect all
participants in the sharing group in terms of both memory and IO, which
means that they'll have to stay within a nested subtree structure. This does
restrict overlapping partial sharing that you described above but the goal
is finding a reasonable tradeoff, so something has to give.

> > b. Memory is disassociated rather than just reparented on cgroup destruction
> >    and get re-charged to the next first user. This is attractive in that it
> >    doesn't require any userspace changes; however, I'm not sure how this
> >    would work for non-pageable memory usages such as bpf maps. How would we
> >    detect the next first usage?
> >
> 
> JFYI, There is a reuse path for the bpf map, see my previous RFC[1].
> [1] https://lore.kernel.org/bpf/20220619155032.32515-1-laoar.shao@gmail.com/

I'm not a big fan of explicit recharging. It's too hairy to use requiring
mixing system level hierarchy configuration knoweldge with in-application
resource handling. There should be clear isolation between the two. This is
also what leads to namespace and visibility issues. Ideally, these should be
handled by structuring the resource hierarchay correctly from the get-go.

...
> > b. Let userspace specify which cgroup to charge for some of constructs like
> >    tmpfs and bpf maps. The key problems with this approach are
> >
> >    1. How to grant/deny what can be charged where. We must ensure that a
> >       descendant can't move charges up or across the tree without the
> >       ancestors allowing it.
> >
> 
> We can add restrictions to check which memcg can be selected
> (regarding the selectable memcg).
> But I think it may be too early to do the restrictions, as only the
> privileged user can set it.
> It is the sys admin's responsbility to select a proper memcg.
> That said, the selectable memcg is not going south.

I generally tend towards shaping interfaces more carefully. We can of course
add a do-whatever-you-want interface and declare that it's for root only but
even that becomes complicated with things like userns as we're finding out
in different areas, but again, nothing is free and approaches like that
often bring more longterm headaches than the problems they solve.

> >    2. How to specify the cgroup to charge. While specifying the target
> >       cgroup directly might seem like an obvious solution, it has a couple
> >       rather serious problems. First, if the descendant is inside a cgroup
> >       namespace, it might be able to see the target cgroup at all.
> 
> It is not a problem. Just sharing our practice below.
> $ docker run -tid --privileged    \
>                       --mount
> type=bind,source=/sys/fs/bpf,target=/sys/fs/bpf    \
>                       --mount
> type=bind,source=/sys/fs/cgroup/memory/bpf,target=/bpf-memcg    \
>                       docker-image
> 
> The bind-mount can make it work.

Not mount namespace. cgroup namespace which is used to isolate the cgroup
subtree that's visible to each container. Please take a look at
cgroup_namespaces(7).

> >  Second,
> >       it's an interface which is likely to cause misunderstandings on how it
> >       can be used. It's too broad an interface.
> >
> 
> As I said above, we just need some restrictions or guidance if that is
> desired now.

I'm really unlikely to go down that path because as I said before I believe
that that's a road to long term headaches and already creates immediate
problems in terms of how it'd interact with other resources. No matter what
approach we may choose, it has to work with the existing resource hierarchy.
Otherwise, we end up in a situation where we have to split accounting and
control of other controllers too which makes little sense for non-memory
controllers (not that it's great for memory in the first place).

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFD RESEND] cgroup: Persistent memory usage tracking
@ 2022-08-23 17:12                 ` Tejun Heo
  0 siblings, 0 replies; 61+ messages in thread
From: Tejun Heo @ 2022-08-23 17:12 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin Lau,
	Song Liu, Yonghong Song, john fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, jolsa-DgEjT+Ai2ygdnm+yROfE0A,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Zefan Li, Cgroups, netdev, bpf,
	Linux MM

Hello,

On Tue, Aug 23, 2022 at 07:08:17PM +0800, Yafang Shao wrote:
> On Mon, Aug 22, 2022 at 7:29 PM Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> > [1] Can this be solved by layering the instance cgroups under persistent
> >     entity cgroup?
>
> Below is some background of kubernetes.
> In kubernetes, a pod is organized as follows,
> 
>                pod
>                |- Container
>                |- Container
> 
> IOW, it is a two-layer unit, or a two-layer instance.
> The cgroup dir of the pod is named with a UUID assigned by kubernetes-apiserver.
> Once the old pod is destroyed (that can happen when the user wants to
> update their service), the new pod will have a different UUID.
> That said, different instances will have different cgroup dir.
> 
> If we want to introduce a  persistent entity cgroup, we have to make
> it a three-layer unit.
> 
>            persistent-entity
>            |- pod
>                  |- Container
>                  |- Container
> 
> There will be some issues,
> 1.  The kuber-apiserver must maintain the persistent-entity on each host.
>      It needs a great refactor and the compatibility is also a problem
> per my discussion with kubernetes experts.

This is gonna be true for anybody. The basic strategy here should be
defining a clear boundary between system agent and applications so that
individual applications, even when they build their own subhierarchy
internally, aren't affected by system level hierarchy configuration changes.
systemd is already like that with clear delegation boundary. Our (fb)
container management is like that too. So, while this requires some
reorganization from the agent side, things like this don't create huge
backward compatbility issues involving applications. I have no idea about
k8s, so the situation may differ but in the long term at least, it'd be a
good idea to build in similar conceptual separation even if it stays with
cgroup1.

> 2.  How to do the monitor?
>      If there's only one pod under this persistent-entity, we can
> easily get the memory size of  shared resources by:
>          Sizeof(shared-resources) = Sizeof(persistent-entity) - Sizeof(pod)
>     But what if it has N pods and N is dynamically changed ?

There should only be one live pod instance inside the pod's persistent
cgroup, so the calculation doesn't really change.

> 3.  What if it has more than one shared resource?
>      For example, pod-foo has two shared resources A and B, pod-bar
> has two shared resources A and C, and another pod has two shared
> resources B and C.
>      How to deploy them?
>      Pls, note that we can introduce multiple-layer persistent-entity,
> but which one should be the parent ?
> 
> So from my perspective, it is almost impossible.

Yeah, this is a different problem and cgroup has never been good at tracking
resources shared across multiple groups. There is always tension between
overhead, complexity and accuracy and the tradeoff re. resource sharing has
almost always been towards the former two.

Naming specific resources and designating them as being shared and
accounting them commonly somehow does make sense as an approach as we get to
avoid the biggest headaches (e.g. how to split a page cache page?) and maybe
it can even be claimed that a lot of use cases which may want cross-group
sharing can be sufficiently served by such approach.

That said, it still has to be balanced against other factors. For example,
memory pressure caused by the shared resources should affect all
participants in the sharing group in terms of both memory and IO, which
means that they'll have to stay within a nested subtree structure. This does
restrict overlapping partial sharing that you described above but the goal
is finding a reasonable tradeoff, so something has to give.

> > b. Memory is disassociated rather than just reparented on cgroup destruction
> >    and get re-charged to the next first user. This is attractive in that it
> >    doesn't require any userspace changes; however, I'm not sure how this
> >    would work for non-pageable memory usages such as bpf maps. How would we
> >    detect the next first usage?
> >
> 
> JFYI, There is a reuse path for the bpf map, see my previous RFC[1].
> [1] https://lore.kernel.org/bpf/20220619155032.32515-1-laoar.shao-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org/

I'm not a big fan of explicit recharging. It's too hairy to use requiring
mixing system level hierarchy configuration knoweldge with in-application
resource handling. There should be clear isolation between the two. This is
also what leads to namespace and visibility issues. Ideally, these should be
handled by structuring the resource hierarchay correctly from the get-go.

...
> > b. Let userspace specify which cgroup to charge for some of constructs like
> >    tmpfs and bpf maps. The key problems with this approach are
> >
> >    1. How to grant/deny what can be charged where. We must ensure that a
> >       descendant can't move charges up or across the tree without the
> >       ancestors allowing it.
> >
> 
> We can add restrictions to check which memcg can be selected
> (regarding the selectable memcg).
> But I think it may be too early to do the restrictions, as only the
> privileged user can set it.
> It is the sys admin's responsbility to select a proper memcg.
> That said, the selectable memcg is not going south.

I generally tend towards shaping interfaces more carefully. We can of course
add a do-whatever-you-want interface and declare that it's for root only but
even that becomes complicated with things like userns as we're finding out
in different areas, but again, nothing is free and approaches like that
often bring more longterm headaches than the problems they solve.

> >    2. How to specify the cgroup to charge. While specifying the target
> >       cgroup directly might seem like an obvious solution, it has a couple
> >       rather serious problems. First, if the descendant is inside a cgroup
> >       namespace, it might be able to see the target cgroup at all.
> 
> It is not a problem. Just sharing our practice below.
> $ docker run -tid --privileged    \
>                       --mount
> type=bind,source=/sys/fs/bpf,target=/sys/fs/bpf    \
>                       --mount
> type=bind,source=/sys/fs/cgroup/memory/bpf,target=/bpf-memcg    \
>                       docker-image
> 
> The bind-mount can make it work.

Not mount namespace. cgroup namespace which is used to isolate the cgroup
subtree that's visible to each container. Please take a look at
cgroup_namespaces(7).

> >  Second,
> >       it's an interface which is likely to cause misunderstandings on how it
> >       can be used. It's too broad an interface.
> >
> 
> As I said above, we just need some restrictions or guidance if that is
> desired now.

I'm really unlikely to go down that path because as I said before I believe
that that's a road to long term headaches and already creates immediate
problems in terms of how it'd interact with other resources. No matter what
approach we may choose, it has to work with the existing resource hierarchy.
Otherwise, we end up in a situation where we have to split accounting and
control of other controllers too which makes little sense for non-memory
controllers (not that it's great for memory in the first place).

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFD RESEND] cgroup: Persistent memory usage tracking
@ 2022-08-24 11:57                   ` Yafang Shao
  0 siblings, 0 replies; 61+ messages in thread
From: Yafang Shao @ 2022-08-24 11:57 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin Lau,
	Song Liu, Yonghong Song, john fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, jolsa, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Zefan Li, Cgroups, netdev, bpf, Linux MM,
	Yosry Ahmed, Dan Schatzberg, Lennart Poettering

On Wed, Aug 24, 2022 at 1:12 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Tue, Aug 23, 2022 at 07:08:17PM +0800, Yafang Shao wrote:
> > On Mon, Aug 22, 2022 at 7:29 PM Tejun Heo <tj@kernel.org> wrote:
> > > [1] Can this be solved by layering the instance cgroups under persistent
> > >     entity cgroup?
> >
> > Below is some background of kubernetes.
> > In kubernetes, a pod is organized as follows,
> >
> >                pod
> >                |- Container
> >                |- Container
> >
> > IOW, it is a two-layer unit, or a two-layer instance.
> > The cgroup dir of the pod is named with a UUID assigned by kubernetes-apiserver.
> > Once the old pod is destroyed (that can happen when the user wants to
> > update their service), the new pod will have a different UUID.
> > That said, different instances will have different cgroup dir.
> >
> > If we want to introduce a  persistent entity cgroup, we have to make
> > it a three-layer unit.
> >
> >            persistent-entity
> >            |- pod
> >                  |- Container
> >                  |- Container
> >
> > There will be some issues,
> > 1.  The kuber-apiserver must maintain the persistent-entity on each host.
> >      It needs a great refactor and the compatibility is also a problem
> > per my discussion with kubernetes experts.
>
> This is gonna be true for anybody. The basic strategy here should be
> defining a clear boundary between system agent and applications so that
> individual applications, even when they build their own subhierarchy
> internally, aren't affected by system level hierarchy configuration changes.
> systemd is already like that with clear delegation boundary. Our (fb)
> container management is like that too. So, while this requires some
> reorganization from the agent side, things like this don't create huge
> backward compatbility issues involving applications. I have no idea about
> k8s, so the situation may differ but in the long term at least, it'd be a
> good idea to build in similar conceptual separation even if it stays with
> cgroup1.
>
> > 2.  How to do the monitor?
> >      If there's only one pod under this persistent-entity, we can
> > easily get the memory size of  shared resources by:
> >          Sizeof(shared-resources) = Sizeof(persistent-entity) - Sizeof(pod)
> >     But what if it has N pods and N is dynamically changed ?
>
> There should only be one live pod instance inside the pod's persistent
> cgroup, so the calculation doesn't really change.
>
> > 3.  What if it has more than one shared resource?
> >      For example, pod-foo has two shared resources A and B, pod-bar
> > has two shared resources A and C, and another pod has two shared
> > resources B and C.
> >      How to deploy them?
> >      Pls, note that we can introduce multiple-layer persistent-entity,
> > but which one should be the parent ?
> >
> > So from my perspective, it is almost impossible.
>
> Yeah, this is a different problem and cgroup has never been good at tracking
> resources shared across multiple groups. There is always tension between
> overhead, complexity and accuracy and the tradeoff re. resource sharing has
> almost always been towards the former two.
>
> Naming specific resources and designating them as being shared and
> accounting them commonly somehow does make sense as an approach as we get to
> avoid the biggest headaches (e.g. how to split a page cache page?) and maybe
> it can even be claimed that a lot of use cases which may want cross-group
> sharing can be sufficiently served by such approach.
>
> That said, it still has to be balanced against other factors. For example,
> memory pressure caused by the shared resources should affect all
> participants in the sharing group in terms of both memory and IO, which
> means that they'll have to stay within a nested subtree structure. This does
> restrict overlapping partial sharing that you described above but the goal
> is finding a reasonable tradeoff, so something has to give.
>
> > > b. Memory is disassociated rather than just reparented on cgroup destruction
> > >    and get re-charged to the next first user. This is attractive in that it
> > >    doesn't require any userspace changes; however, I'm not sure how this
> > >    would work for non-pageable memory usages such as bpf maps. How would we
> > >    detect the next first usage?
> > >
> >
> > JFYI, There is a reuse path for the bpf map, see my previous RFC[1].
> > [1] https://lore.kernel.org/bpf/20220619155032.32515-1-laoar.shao@gmail.com/
>
> I'm not a big fan of explicit recharging. It's too hairy to use requiring
> mixing system level hierarchy configuration knoweldge with in-application
> resource handling. There should be clear isolation between the two. This is
> also what leads to namespace and visibility issues. Ideally, these should be
> handled by structuring the resource hierarchay correctly from the get-go.
>
> ...
> > > b. Let userspace specify which cgroup to charge for some of constructs like
> > >    tmpfs and bpf maps. The key problems with this approach are
> > >
> > >    1. How to grant/deny what can be charged where. We must ensure that a
> > >       descendant can't move charges up or across the tree without the
> > >       ancestors allowing it.
> > >
> >
> > We can add restrictions to check which memcg can be selected
> > (regarding the selectable memcg).
> > But I think it may be too early to do the restrictions, as only the
> > privileged user can set it.
> > It is the sys admin's responsbility to select a proper memcg.
> > That said, the selectable memcg is not going south.
>
> I generally tend towards shaping interfaces more carefully. We can of course
> add a do-whatever-you-want interface and declare that it's for root only but
> even that becomes complicated with things like userns as we're finding out
> in different areas, but again, nothing is free and approaches like that
> often bring more longterm headaches than the problems they solve.
>
> > >    2. How to specify the cgroup to charge. While specifying the target
> > >       cgroup directly might seem like an obvious solution, it has a couple
> > >       rather serious problems. First, if the descendant is inside a cgroup
> > >       namespace, it might be able to see the target cgroup at all.
> >
> > It is not a problem. Just sharing our practice below.
> > $ docker run -tid --privileged    \
> >                       --mount
> > type=bind,source=/sys/fs/bpf,target=/sys/fs/bpf    \
> >                       --mount
> > type=bind,source=/sys/fs/cgroup/memory/bpf,target=/bpf-memcg    \
> >                       docker-image
> >
> > The bind-mount can make it work.
>
> Not mount namespace. cgroup namespace which is used to isolate the cgroup
> subtree that's visible to each container. Please take a look at
> cgroup_namespaces(7).
>

IIUC, we can get the target cgroup with a relative path if cgroup
namespace is enabled, for example ../bpf, right ?
(If not, then I think we can extend it.)

> > >  Second,
> > >       it's an interface which is likely to cause misunderstandings on how it
> > >       can be used. It's too broad an interface.
> > >
> >
> > As I said above, we just need some restrictions or guidance if that is
> > desired now.
>
> I'm really unlikely to go down that path because as I said before I believe
> that that's a road to long term headaches and already creates immediate
> problems in terms of how it'd interact with other resources. No matter what
> approach we may choose, it has to work with the existing resource hierarchy.

Unfortunately,  the bpf map has already broken and is still breaking
the existing resource hierarchy.
What I'm doing is to improve or even fix this breakage.

The reason I say it is breaking the existing resource hierarchy is
that the bpf map is improperly treated as a process while it is really
a shared resource.

A bpf-map can be written by processes running in other memcgs, but the
memory allocated caused by the writing won't be charged to the
writer's memcg, but will be charged to bpf-map's memcg.

That's why I try to convey to you that what I'm trying to fix is a bpf
specific issue.

> Otherwise, we end up in a situation where we have to split accounting and
> control of other controllers too which makes little sense for non-memory
> controllers (not that it's great for memory in the first place).
>

If the resource hierarchy is what you concern, then I think it can be
addressed with below addition change,

if (cgroup_is_not_ancestor(cgrp))
    return -EINVAL;

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFD RESEND] cgroup: Persistent memory usage tracking
@ 2022-08-24 11:57                   ` Yafang Shao
  0 siblings, 0 replies; 61+ messages in thread
From: Yafang Shao @ 2022-08-24 11:57 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin Lau,
	Song Liu, Yonghong Song, john fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, jolsa-DgEjT+Ai2ygdnm+yROfE0A,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Zefan Li, Cgroups, netdev, bpf,
	Linux MM

On Wed, Aug 24, 2022 at 1:12 AM Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>
> Hello,
>
> On Tue, Aug 23, 2022 at 07:08:17PM +0800, Yafang Shao wrote:
> > On Mon, Aug 22, 2022 at 7:29 PM Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> > > [1] Can this be solved by layering the instance cgroups under persistent
> > >     entity cgroup?
> >
> > Below is some background of kubernetes.
> > In kubernetes, a pod is organized as follows,
> >
> >                pod
> >                |- Container
> >                |- Container
> >
> > IOW, it is a two-layer unit, or a two-layer instance.
> > The cgroup dir of the pod is named with a UUID assigned by kubernetes-apiserver.
> > Once the old pod is destroyed (that can happen when the user wants to
> > update their service), the new pod will have a different UUID.
> > That said, different instances will have different cgroup dir.
> >
> > If we want to introduce a  persistent entity cgroup, we have to make
> > it a three-layer unit.
> >
> >            persistent-entity
> >            |- pod
> >                  |- Container
> >                  |- Container
> >
> > There will be some issues,
> > 1.  The kuber-apiserver must maintain the persistent-entity on each host.
> >      It needs a great refactor and the compatibility is also a problem
> > per my discussion with kubernetes experts.
>
> This is gonna be true for anybody. The basic strategy here should be
> defining a clear boundary between system agent and applications so that
> individual applications, even when they build their own subhierarchy
> internally, aren't affected by system level hierarchy configuration changes.
> systemd is already like that with clear delegation boundary. Our (fb)
> container management is like that too. So, while this requires some
> reorganization from the agent side, things like this don't create huge
> backward compatbility issues involving applications. I have no idea about
> k8s, so the situation may differ but in the long term at least, it'd be a
> good idea to build in similar conceptual separation even if it stays with
> cgroup1.
>
> > 2.  How to do the monitor?
> >      If there's only one pod under this persistent-entity, we can
> > easily get the memory size of  shared resources by:
> >          Sizeof(shared-resources) = Sizeof(persistent-entity) - Sizeof(pod)
> >     But what if it has N pods and N is dynamically changed ?
>
> There should only be one live pod instance inside the pod's persistent
> cgroup, so the calculation doesn't really change.
>
> > 3.  What if it has more than one shared resource?
> >      For example, pod-foo has two shared resources A and B, pod-bar
> > has two shared resources A and C, and another pod has two shared
> > resources B and C.
> >      How to deploy them?
> >      Pls, note that we can introduce multiple-layer persistent-entity,
> > but which one should be the parent ?
> >
> > So from my perspective, it is almost impossible.
>
> Yeah, this is a different problem and cgroup has never been good at tracking
> resources shared across multiple groups. There is always tension between
> overhead, complexity and accuracy and the tradeoff re. resource sharing has
> almost always been towards the former two.
>
> Naming specific resources and designating them as being shared and
> accounting them commonly somehow does make sense as an approach as we get to
> avoid the biggest headaches (e.g. how to split a page cache page?) and maybe
> it can even be claimed that a lot of use cases which may want cross-group
> sharing can be sufficiently served by such approach.
>
> That said, it still has to be balanced against other factors. For example,
> memory pressure caused by the shared resources should affect all
> participants in the sharing group in terms of both memory and IO, which
> means that they'll have to stay within a nested subtree structure. This does
> restrict overlapping partial sharing that you described above but the goal
> is finding a reasonable tradeoff, so something has to give.
>
> > > b. Memory is disassociated rather than just reparented on cgroup destruction
> > >    and get re-charged to the next first user. This is attractive in that it
> > >    doesn't require any userspace changes; however, I'm not sure how this
> > >    would work for non-pageable memory usages such as bpf maps. How would we
> > >    detect the next first usage?
> > >
> >
> > JFYI, There is a reuse path for the bpf map, see my previous RFC[1].
> > [1] https://lore.kernel.org/bpf/20220619155032.32515-1-laoar.shao-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org/
>
> I'm not a big fan of explicit recharging. It's too hairy to use requiring
> mixing system level hierarchy configuration knoweldge with in-application
> resource handling. There should be clear isolation between the two. This is
> also what leads to namespace and visibility issues. Ideally, these should be
> handled by structuring the resource hierarchay correctly from the get-go.
>
> ...
> > > b. Let userspace specify which cgroup to charge for some of constructs like
> > >    tmpfs and bpf maps. The key problems with this approach are
> > >
> > >    1. How to grant/deny what can be charged where. We must ensure that a
> > >       descendant can't move charges up or across the tree without the
> > >       ancestors allowing it.
> > >
> >
> > We can add restrictions to check which memcg can be selected
> > (regarding the selectable memcg).
> > But I think it may be too early to do the restrictions, as only the
> > privileged user can set it.
> > It is the sys admin's responsbility to select a proper memcg.
> > That said, the selectable memcg is not going south.
>
> I generally tend towards shaping interfaces more carefully. We can of course
> add a do-whatever-you-want interface and declare that it's for root only but
> even that becomes complicated with things like userns as we're finding out
> in different areas, but again, nothing is free and approaches like that
> often bring more longterm headaches than the problems they solve.
>
> > >    2. How to specify the cgroup to charge. While specifying the target
> > >       cgroup directly might seem like an obvious solution, it has a couple
> > >       rather serious problems. First, if the descendant is inside a cgroup
> > >       namespace, it might be able to see the target cgroup at all.
> >
> > It is not a problem. Just sharing our practice below.
> > $ docker run -tid --privileged    \
> >                       --mount
> > type=bind,source=/sys/fs/bpf,target=/sys/fs/bpf    \
> >                       --mount
> > type=bind,source=/sys/fs/cgroup/memory/bpf,target=/bpf-memcg    \
> >                       docker-image
> >
> > The bind-mount can make it work.
>
> Not mount namespace. cgroup namespace which is used to isolate the cgroup
> subtree that's visible to each container. Please take a look at
> cgroup_namespaces(7).
>

IIUC, we can get the target cgroup with a relative path if cgroup
namespace is enabled, for example ../bpf, right ?
(If not, then I think we can extend it.)

> > >  Second,
> > >       it's an interface which is likely to cause misunderstandings on how it
> > >       can be used. It's too broad an interface.
> > >
> >
> > As I said above, we just need some restrictions or guidance if that is
> > desired now.
>
> I'm really unlikely to go down that path because as I said before I believe
> that that's a road to long term headaches and already creates immediate
> problems in terms of how it'd interact with other resources. No matter what
> approach we may choose, it has to work with the existing resource hierarchy.

Unfortunately,  the bpf map has already broken and is still breaking
the existing resource hierarchy.
What I'm doing is to improve or even fix this breakage.

The reason I say it is breaking the existing resource hierarchy is
that the bpf map is improperly treated as a process while it is really
a shared resource.

A bpf-map can be written by processes running in other memcgs, but the
memory allocated caused by the writing won't be charged to the
writer's memcg, but will be charged to bpf-map's memcg.

That's why I try to convey to you that what I'm trying to fix is a bpf
specific issue.

> Otherwise, we end up in a situation where we have to split accounting and
> control of other controllers too which makes little sense for non-memory
> controllers (not that it's great for memory in the first place).
>

If the resource hierarchy is what you concern, then I think it can be
addressed with below addition change,

if (cgroup_is_not_ancestor(cgrp))
    return -EINVAL;

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFD RESEND] cgroup: Persistent memory usage tracking
  2022-08-23  3:14                     ` Tejun Heo
@ 2022-08-24 19:02                       ` Mina Almasry
  -1 siblings, 0 replies; 61+ messages in thread
From: Mina Almasry @ 2022-08-24 19:02 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Roman Gushchin, Johannes Weiner, Yafang Shao, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin Lau, Song Liu,
	Yonghong Song, john fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, jolsa, Michal Hocko, Shakeel Butt, Muchun Song,
	Andrew Morton, Zefan Li, Cgroups, netdev, bpf, Linux MM,
	Yosry Ahmed, Dan Schatzberg, Lennart Poettering

On Mon, Aug 22, 2022 at 8:14 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Mon, Aug 22, 2022 at 08:01:41PM -0700, Roman Gushchin wrote:
> > > > >    One solution that I can think of is leveraging the resource domain
> > > > >    concept which is currently only used for threaded cgroups. All memory
> > > > >    usages of threaded cgroups are charged to their resource domain cgroup
> > > > >    which hosts the processes for those threads. The persistent usages have a
> > > > >    similar pattern, so maybe the service level cgroup can declare that it's
> > > > >    the encompassing resource domain and the instance cgroup can say whether
> > > > >    it's gonna charge e.g. the tmpfs instance to its own or the encompassing
> > > > >    resource domain.
> > > > >
> > > >
> > > > I think this sounds excellent and addresses our use cases. Basically
> > > > the tmpfs/bpf memory would get charged to the encompassing resource
> > > > domain cgroup rather than the instance cgroup, making the memory usage
> > > > of the first and second+ instances consistent and predictable.
> > > >
> > > > Would love to hear from other memcg folks what they would think of
> > > > such an approach. I would also love to hear what kind of interface you
> > > > have in mind. Perhaps a cgroup tunable that says whether it's going to
> > > > charge the tmpfs/bpf instance to itself or to the encompassing
> > > > resource domain?
> > >
> > > I like this too. It makes shared charging predictable, with a coherent
> > > resource hierarchy (congruent OOM, CPU, IO domains), and without the
> > > need for cgroup paths in tmpfs mounts or similar.
> > >
> > > As far as who is declaring what goes, though: if the instance groups
> > > can declare arbitrary files/objects persistent or shared, they'd be
> > > able to abuse this and sneak private memory past local limits and
> > > burden the wider persistent/shared domain with it.
>
> My thought was that the persistent cgroup and instance cgroups should belong
> to the same trust domain and system level control should be applied at the
> resource domain level. The application may decide to shift between
> persistent and per-instance however it wants to and may even configure
> resource control at that level but all that's for its own accounting
> accuracy and benefit.
>
> > > I'm thinking it might make more sense for the service level to declare
> > > which objects are persistent and shared across instances.
> >
> > I like this idea.
> >
> > > If that's the case, we may not need a two-component interface. Just
> > > the ability for an intermediate cgroup to say: "This object's future
> > > memory is to be charged to me, not the instantiating cgroup."
> > >
> > > Can we require a process in the intermediate cgroup to set up the file
> > > or object, and use madvise/fadvise to say "charge me", before any
> > > instances are launched?
> >
> > We need to think how to make this interface convenient to use.
> > First, these persistent resources are likely created by some agent software,
> > not the main workload. So the requirement to call madvise() from the
> > actual cgroup might be not easily achievable.
>
> So one worry that I have for this is that it requires the application itself
> to be aware of cgroup topolgies and restructure itself so that allocation of
> those resources are factored out into something else. Maybe that's not a
> huge problem but it may limit its applicability quite a bit.
>

I agree with this point 100%. The interfaces being discussed here
require existing applications restructuring themselves which I don't
imagine will be very useful for us at least.

> If we can express all the resource contraints and structures in the cgroup
> side and configured by the management agent, the application can simply e.g.
> madvise whatever memory region or flag bpf maps as "these are persistent"
> and the rest can be handled by the system. If the agent set up the
> environment for that, it gets accounted accordingly; otherwise, it'd behave
> as if those tagging didn't exist. Asking the application to set up all its
> resources in separate steps, that might require significant restructuring
> and knowledge of how the hierarchy is setup in many cases.
>

I don't know if this level of granularity is needed with a madvise()
or such. The kernel knows whether resources are persistent due to the
nature of the resource. For example a shared tmpfs file is a resource
that is persistent and not cleaned up after the process using it dies,
but private memory is. madvise(PERSISTENT) on private memory would not
make sense, and I don't think madvise(NOT_PERSISTENT) on tmpfs-backed
memory region would make sense. Also, this requires adding madvise()
hints in userspace code to leverage this.

> > So _maybe_ something like writing a fd into cgroup.memory.resources.
> >

Sorry, I don't see this being useful - to us at least - if it is an
fd-based interface. It needs to support marking entire tmpfs mounts as
persistent. The reasoning is as Tejun alludes to: our container
management agent generally sets up the container hierarchy for a job
and also sets up the filesystem mounts the job requests. This is
generally because the job doesn't run as root and doesn't bother with
mount namespaces. Thus, our jobs are well-trained in receiving mounts
set up for them from our container management agent. Jobs are _not_
well-trained in receiving an fd from the container management agent,
and restructuring our jobs/services for such an interface will not be
feasible I think.

This applies to us but I imagine it is common practice for the
container management agent to set up mounts for the jobs to use,
rather than provide the job with an fd or collection of fds.

> > Second, it would be really useful to export the current configuration
> > to userspace. E.g. a user should be able to query to which cgroup the given
> > bpf map "belongs" and which bpf maps belong to the given cgroups. Otherwise
> > it will create a problem for userspace programs which manage cgroups
> > (e.g. systemd): they should be able to restore the current configuration
> > from the kernel state, without "remembering" what has been configured
> > before.
>
> This too can be achieved by separating out cgroup setup and tagging specific
> resources. Agent and cgroup know what each cgroup is supposed to do as they
> already do now and each resource is tagged whether they're persistent or
> not, so everything is always known without the agent and the application
> having to explicitly share the information.
>
> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFD RESEND] cgroup: Persistent memory usage tracking
@ 2022-08-24 19:02                       ` Mina Almasry
  0 siblings, 0 replies; 61+ messages in thread
From: Mina Almasry @ 2022-08-24 19:02 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Roman Gushchin, Johannes Weiner, Yafang Shao, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin Lau, Song Liu,
	Yonghong Song, john fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, jolsa, Michal Hocko, Shakeel Butt, Muchun Song,
	Andrew Morton, Zefan Li, Cgroups, netdev, bpf

On Mon, Aug 22, 2022 at 8:14 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Mon, Aug 22, 2022 at 08:01:41PM -0700, Roman Gushchin wrote:
> > > > >    One solution that I can think of is leveraging the resource domain
> > > > >    concept which is currently only used for threaded cgroups. All memory
> > > > >    usages of threaded cgroups are charged to their resource domain cgroup
> > > > >    which hosts the processes for those threads. The persistent usages have a
> > > > >    similar pattern, so maybe the service level cgroup can declare that it's
> > > > >    the encompassing resource domain and the instance cgroup can say whether
> > > > >    it's gonna charge e.g. the tmpfs instance to its own or the encompassing
> > > > >    resource domain.
> > > > >
> > > >
> > > > I think this sounds excellent and addresses our use cases. Basically
> > > > the tmpfs/bpf memory would get charged to the encompassing resource
> > > > domain cgroup rather than the instance cgroup, making the memory usage
> > > > of the first and second+ instances consistent and predictable.
> > > >
> > > > Would love to hear from other memcg folks what they would think of
> > > > such an approach. I would also love to hear what kind of interface you
> > > > have in mind. Perhaps a cgroup tunable that says whether it's going to
> > > > charge the tmpfs/bpf instance to itself or to the encompassing
> > > > resource domain?
> > >
> > > I like this too. It makes shared charging predictable, with a coherent
> > > resource hierarchy (congruent OOM, CPU, IO domains), and without the
> > > need for cgroup paths in tmpfs mounts or similar.
> > >
> > > As far as who is declaring what goes, though: if the instance groups
> > > can declare arbitrary files/objects persistent or shared, they'd be
> > > able to abuse this and sneak private memory past local limits and
> > > burden the wider persistent/shared domain with it.
>
> My thought was that the persistent cgroup and instance cgroups should belong
> to the same trust domain and system level control should be applied at the
> resource domain level. The application may decide to shift between
> persistent and per-instance however it wants to and may even configure
> resource control at that level but all that's for its own accounting
> accuracy and benefit.
>
> > > I'm thinking it might make more sense for the service level to declare
> > > which objects are persistent and shared across instances.
> >
> > I like this idea.
> >
> > > If that's the case, we may not need a two-component interface. Just
> > > the ability for an intermediate cgroup to say: "This object's future
> > > memory is to be charged to me, not the instantiating cgroup."
> > >
> > > Can we require a process in the intermediate cgroup to set up the file
> > > or object, and use madvise/fadvise to say "charge me", before any
> > > instances are launched?
> >
> > We need to think how to make this interface convenient to use.
> > First, these persistent resources are likely created by some agent software,
> > not the main workload. So the requirement to call madvise() from the
> > actual cgroup might be not easily achievable.
>
> So one worry that I have for this is that it requires the application itself
> to be aware of cgroup topolgies and restructure itself so that allocation of
> those resources are factored out into something else. Maybe that's not a
> huge problem but it may limit its applicability quite a bit.
>

I agree with this point 100%. The interfaces being discussed here
require existing applications restructuring themselves which I don't
imagine will be very useful for us at least.

> If we can express all the resource contraints and structures in the cgroup
> side and configured by the management agent, the application can simply e.g.
> madvise whatever memory region or flag bpf maps as "these are persistent"
> and the rest can be handled by the system. If the agent set up the
> environment for that, it gets accounted accordingly; otherwise, it'd behave
> as if those tagging didn't exist. Asking the application to set up all its
> resources in separate steps, that might require significant restructuring
> and knowledge of how the hierarchy is setup in many cases.
>

I don't know if this level of granularity is needed with a madvise()
or such. The kernel knows whether resources are persistent due to the
nature of the resource. For example a shared tmpfs file is a resource
that is persistent and not cleaned up after the process using it dies,
but private memory is. madvise(PERSISTENT) on private memory would not
make sense, and I don't think madvise(NOT_PERSISTENT) on tmpfs-backed
memory region would make sense. Also, this requires adding madvise()
hints in userspace code to leverage this.

> > So _maybe_ something like writing a fd into cgroup.memory.resources.
> >

Sorry, I don't see this being useful - to us at least - if it is an
fd-based interface. It needs to support marking entire tmpfs mounts as
persistent. The reasoning is as Tejun alludes to: our container
management agent generally sets up the container hierarchy for a job
and also sets up the filesystem mounts the job requests. This is
generally because the job doesn't run as root and doesn't bother with
mount namespaces. Thus, our jobs are well-trained in receiving mounts
set up for them from our container management agent. Jobs are _not_
well-trained in receiving an fd from the container management agent,
and restructuring our jobs/services for such an interface will not be
feasible I think.

This applies to us but I imagine it is common practice for the
container management agent to set up mounts for the jobs to use,
rather than provide the job with an fd or collection of fds.

> > Second, it would be really useful to export the current configuration
> > to userspace. E.g. a user should be able to query to which cgroup the given
> > bpf map "belongs" and which bpf maps belong to the given cgroups. Otherwise
> > it will create a problem for userspace programs which manage cgroups
> > (e.g. systemd): they should be able to restore the current configuration
> > from the kernel state, without "remembering" what has been configured
> > before.
>
> This too can be achieved by separating out cgroup setup and tagging specific
> resources. Agent and cgroup know what each cgroup is supposed to do as they
> already do now and each resource is tagged whether they're persistent or
> not, so everything is always known without the agent and the application
> having to explicitly share the information.
>
> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFD RESEND] cgroup: Persistent memory usage tracking
  2022-08-24 19:02                       ` Mina Almasry
@ 2022-08-25 17:59                         ` Tejun Heo
  -1 siblings, 0 replies; 61+ messages in thread
From: Tejun Heo @ 2022-08-25 17:59 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Roman Gushchin, Johannes Weiner, Yafang Shao, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin Lau, Song Liu,
	Yonghong Song, john fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, jolsa, Michal Hocko, Shakeel Butt, Muchun Song,
	Andrew Morton, Zefan Li, Cgroups, netdev, bpf, Linux MM,
	Yosry Ahmed, Dan Schatzberg, Lennart Poettering

Hello,

On Wed, Aug 24, 2022 at 12:02:04PM -0700, Mina Almasry wrote:
> > If we can express all the resource contraints and structures in the cgroup
> > side and configured by the management agent, the application can simply e.g.
> > madvise whatever memory region or flag bpf maps as "these are persistent"
> > and the rest can be handled by the system. If the agent set up the
> > environment for that, it gets accounted accordingly; otherwise, it'd behave
> > as if those tagging didn't exist. Asking the application to set up all its
> > resources in separate steps, that might require significant restructuring
> > and knowledge of how the hierarchy is setup in many cases.
> 
> I don't know if this level of granularity is needed with a madvise()
> or such. The kernel knows whether resources are persistent due to the
> nature of the resource. For example a shared tmpfs file is a resource
> that is persistent and not cleaned up after the process using it dies,
> but private memory is. madvise(PERSISTENT) on private memory would not
> make sense, and I don't think madvise(NOT_PERSISTENT) on tmpfs-backed
> memory region would make sense. Also, this requires adding madvise()
> hints in userspace code to leverage this.

I haven't thought hard about what the hinting interface should be like. The
default assumptions would be that page cache belongs to the persistent
domain and anon belongs to the instance (mm folks, please correct me if I'm
off the rails here), but I can imagine situations where that doesn't
necessarily hold - like temp files which get unlinked on instance shutdown.

In terms of hint granularity, more coarse grained (e.g. file, mount
whatever) seems to make sense but again I haven't thought too hard on it.
That said, as long as the default behavior is reasonable, I think adding
some hinting calls in the application is acceptable. It doesn't require any
structrual changes and the additions would be for its own benefit of more
accurate accounting and control. That makes sense to me.

One unfortunate effect this will have is that we'll be actively putting
resources into intermediate cgroups. This already happens today but if we
support persistent domains, it's gonna be a lot more prevalent and we'll
need to update e.g. iocost to support IOs coming out of intermediate
cgroups. This kinda sucks because we don't even have knobs to control self
vs. children distributions. Oh well...

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFD RESEND] cgroup: Persistent memory usage tracking
@ 2022-08-25 17:59                         ` Tejun Heo
  0 siblings, 0 replies; 61+ messages in thread
From: Tejun Heo @ 2022-08-25 17:59 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Roman Gushchin, Johannes Weiner, Yafang Shao, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin Lau, Song Liu,
	Yonghong Song, john fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, jolsa, Michal Hocko, Shakeel Butt, Muchun Song,
	Andrew Morton, Zefan Li, Cgroups, netdev, bpf

Hello,

On Wed, Aug 24, 2022 at 12:02:04PM -0700, Mina Almasry wrote:
> > If we can express all the resource contraints and structures in the cgroup
> > side and configured by the management agent, the application can simply e.g.
> > madvise whatever memory region or flag bpf maps as "these are persistent"
> > and the rest can be handled by the system. If the agent set up the
> > environment for that, it gets accounted accordingly; otherwise, it'd behave
> > as if those tagging didn't exist. Asking the application to set up all its
> > resources in separate steps, that might require significant restructuring
> > and knowledge of how the hierarchy is setup in many cases.
> 
> I don't know if this level of granularity is needed with a madvise()
> or such. The kernel knows whether resources are persistent due to the
> nature of the resource. For example a shared tmpfs file is a resource
> that is persistent and not cleaned up after the process using it dies,
> but private memory is. madvise(PERSISTENT) on private memory would not
> make sense, and I don't think madvise(NOT_PERSISTENT) on tmpfs-backed
> memory region would make sense. Also, this requires adding madvise()
> hints in userspace code to leverage this.

I haven't thought hard about what the hinting interface should be like. The
default assumptions would be that page cache belongs to the persistent
domain and anon belongs to the instance (mm folks, please correct me if I'm
off the rails here), but I can imagine situations where that doesn't
necessarily hold - like temp files which get unlinked on instance shutdown.

In terms of hint granularity, more coarse grained (e.g. file, mount
whatever) seems to make sense but again I haven't thought too hard on it.
That said, as long as the default behavior is reasonable, I think adding
some hinting calls in the application is acceptable. It doesn't require any
structrual changes and the additions would be for its own benefit of more
accurate accounting and control. That makes sense to me.

One unfortunate effect this will have is that we'll be actively putting
resources into intermediate cgroups. This already happens today but if we
support persistent domains, it's gonna be a lot more prevalent and we'll
need to update e.g. iocost to support IOs coming out of intermediate
cgroups. This kinda sucks because we don't even have knobs to control self
vs. children distributions. Oh well...

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 61+ messages in thread

end of thread, other threads:[~2022-08-25 17:59 UTC | newest]

Thread overview: 61+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-18 14:31 [PATCH bpf-next v2 00/12] bpf: Introduce selectable memcg for bpf map Yafang Shao
2022-08-18 14:31 ` [PATCH bpf-next v2 01/12] cgroup: Update the comment on cgroup_get_from_fd Yafang Shao
2022-08-18 14:31   ` Yafang Shao
2022-08-18 19:11   ` Yosry Ahmed
2022-08-18 19:11     ` Yosry Ahmed
2022-08-18 14:31 ` [PATCH bpf-next v2 02/12] bpf: Introduce new helper bpf_map_put_memcg() Yafang Shao
2022-08-18 14:31   ` Yafang Shao
2022-08-18 14:31 ` [PATCH bpf-next v2 03/12] bpf: Define bpf_map_{get,put}_memcg for !CONFIG_MEMCG_KMEM Yafang Shao
2022-08-18 14:31 ` [PATCH bpf-next v2 04/12] bpf: Call bpf_map_init_from_attr() immediately after map creation Yafang Shao
2022-08-18 14:31 ` [PATCH bpf-next v2 05/12] bpf: Save memcg in bpf_map_init_from_attr() Yafang Shao
2022-08-18 14:31   ` Yafang Shao
2022-08-18 14:31 ` [PATCH bpf-next v2 06/12] bpf: Use scoped-based charge in bpf_map_area_alloc Yafang Shao
2022-08-18 14:31   ` Yafang Shao
2022-08-18 14:31 ` [PATCH bpf-next v2 07/12] bpf: Introduce new helpers bpf_ringbuf_pages_{alloc,free} Yafang Shao
2022-08-18 14:31   ` Yafang Shao
2022-08-18 17:30   ` Andrii Nakryiko
2022-08-18 14:31 ` [PATCH bpf-next v2 08/12] bpf: Use bpf_map_kzalloc in arraymap Yafang Shao
2022-08-18 14:31   ` Yafang Shao
2022-08-18 14:31 ` [PATCH bpf-next v2 09/12] bpf: Use bpf_map_kvcalloc in bpf_local_storage Yafang Shao
2022-08-18 14:31   ` Yafang Shao
2022-08-18 14:31 ` [PATCH bpf-next v2 10/12] mm, memcg: Add new helper get_obj_cgroup_from_cgroup Yafang Shao
2022-08-18 20:38   ` Shakeel Butt
2022-08-19  1:21     ` Yafang Shao
2022-08-18 14:31 ` [PATCH bpf-next v2 11/12] bpf: Add return value for bpf_map_init_from_attr Yafang Shao
2022-08-18 14:31 ` [PATCH bpf-next v2 12/12] bpf: Introduce selectable memcg for bpf map Yafang Shao
2022-08-18 22:20 ` [PATCH bpf-next v2 00/12] " Tejun Heo
2022-08-18 22:20   ` Tejun Heo
2022-08-18 22:33   ` Tejun Heo
2022-08-19  1:09     ` Yafang Shao
2022-08-19  1:09       ` Yafang Shao
2022-08-19 17:06       ` Tejun Heo
2022-08-19 17:06         ` Tejun Heo
2022-08-20  2:25         ` Yafang Shao
2022-08-20  2:25           ` Yafang Shao
2022-08-22 11:29           ` [RFD RESEND] cgroup: Persistent memory usage tracking Tejun Heo
2022-08-22 11:29             ` Tejun Heo
2022-08-22 16:12             ` Shakeel Butt
2022-08-22 16:12               ` Shakeel Butt
2022-08-22 19:02             ` Mina Almasry
2022-08-22 19:02               ` Mina Almasry
2022-08-22 21:19               ` Johannes Weiner
2022-08-22 21:19                 ` Johannes Weiner
2022-08-22 21:52                 ` Mina Almasry
2022-08-22 21:52                   ` Mina Almasry
2022-08-23  3:01                 ` Roman Gushchin
2022-08-23  3:01                   ` Roman Gushchin
2022-08-23  3:14                   ` Tejun Heo
2022-08-23  3:14                     ` Tejun Heo
2022-08-24 19:02                     ` Mina Almasry
2022-08-24 19:02                       ` Mina Almasry
2022-08-25 17:59                       ` Tejun Heo
2022-08-25 17:59                         ` Tejun Heo
2022-08-23 11:08             ` Yafang Shao
2022-08-23 11:08               ` Yafang Shao
2022-08-23 17:12               ` Tejun Heo
2022-08-23 17:12                 ` Tejun Heo
2022-08-24 11:57                 ` Yafang Shao
2022-08-24 11:57                   ` Yafang Shao
2022-08-19  0:59   ` [PATCH bpf-next v2 00/12] bpf: Introduce selectable memcg for bpf map Yafang Shao
2022-08-19 16:45     ` Tejun Heo
2022-08-19 16:45       ` Tejun Heo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.