linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next v3 00/13] bpf: Introduce selectable memcg for bpf map
@ 2022-09-02  2:29 Yafang Shao
  2022-09-02  2:29 ` [PATCH bpf-next v3 01/13] cgroup: Update the comment on cgroup_get_from_fd Yafang Shao
                   ` (13 more replies)
  0 siblings, 14 replies; 33+ messages in thread
From: Yafang Shao @ 2022-09-02  2:29 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, hannes, mhocko, roman.gushchin,
	shakeelb, songmuchun, akpm, tj, lizefan.x
  Cc: cgroups, netdev, bpf, linux-mm, Yafang Shao

On our production environment, we may load, run and pin bpf programs and
maps in containers. For example, some of our networking bpf programs and
maps are loaded and pinned by a process running in a container on our
k8s environment. In this container, there're also running some other
user applications which watch the networking configurations from remote
servers and update them on this local host, log the error events, monitor
the traffic, and do some other stuffs. Sometimes we may need to update 
these user applications to a new release, and in this update process we
will destroy the old container and then start a new genration. In order not
to interrupt the bpf programs in the update process, we will pin the bpf
programs and maps in bpffs. That is the background and use case on our
production environment. 

After switching to memcg-based bpf memory accounting to limit the bpf
memory, some unexpected issues jumped out at us.
1. The memory usage is not consistent between the first generation and
new generations.
2. After the first generation is destroyed, the bpf memory can't be
limited if the bpf maps are not preallocated, because they will be
reparented.

Besides, there's another issue that the bpf-map's memcg is breaking the
memcg hierarchy, because bpf-map has its own memcg. A bpf map can be
wrote by tasks running in other memcgs, once a writer in other memcg
writes a shared bpf map, the memory allocated in this writing won't be
charged to the writer's memcg but it will be charge to bpf-map's own
memcg instead. IOW, the bpf-map is improperly treated as a task, while
actually it is a shared resource. This patchset doesn't resolve this
issue. I will post another RFC once I find a workable solution to
address it.

This patchset tries to resolve the above two issues by introducing a
selectable memcg to limit the bpf memory. Currently we only allow to
select its ancestor to avoid breaking the memcg hierarchy further. 
Possible use cases of the selectable memcg as follows,
- Select the root memcg as bpf-map's memcg
  Then bpf-map's memory won't be throttled by current memcg limit.
- Put current memcg under a fixed memcg dir and select the fixed memcg
  as bpf-map's memcg
  The hierarchy as follows,

      Parent-memcg (A fixed dir, i.e. /sys/fs/cgroup/memory/bpf)
         \
        Current-memcg (Container dir, i.e. /sys/fs/cgroup/memory/bpf/foo)

  At the map creation time, the bpf-map's memory will be charged
  into the parent directly without charging into current memcg, and thus
  current memcg's usage will be consistent among different generations.
  To limit bpf-map's memory usage, we can set the limit in the parent
  memcg.

Currenly it only supports for bpf map, and we can extend it to bpf prog
as well.

The observebility can also be supported in the next step, for example,
showing the bpf map's memcg by 'bpftool map show' or even showing which
maps are charged to a specific memcg by 'bpftool cgroup show'.
Furthermore, we may also show an accurate memory size of a bpf map
instead of an estimated memory size in 'bpftool map show' in the future. 

v2->v3:
- use css_tryget() instead of css_tryget_online() (Shakeel)
- add comment for get_obj_cgroup_from_cgroup() (Shakeel)
- add new memcg helper task_under_memcg_hierarchy()
- add restriction to allow selecting ancestor only to avoid breaking the
  memcg hierarchy further, per discussion with Tejun 

v1->v2:
- cgroup1 is also supported after
  commit f3a2aebdd6fb ("cgroup: enable cgroup_get_from_file() on cgroup1")
  So update the commit log.
- remove incorrect fix to mem_cgroup_put  (Shakeel,Roman,Muchun) 
- use cgroup_put() in bpf_map_save_memcg() (Shakeel)
- add detailed commit log for get_obj_cgroup_from_cgroup (Shakeel) 

RFC->v1:
- get rid of bpf_map container wrapper (Alexei)
- add the new field into the end of struct (Alexei)
- get rid of BPF_F_SELECTABLE_MEMCG (Alexei)
- save memcg in bpf_map_init_from_attr
- introduce bpf_ringbuf_pages_{alloc,free} and keep them inside
  kernel/bpf/ringbuf.c  (Andrii)

Yafang Shao (13):
  cgroup: Update the comment on cgroup_get_from_fd
  bpf: Introduce new helper bpf_map_put_memcg()
  bpf: Define bpf_map_{get,put}_memcg for !CONFIG_MEMCG_KMEM
  bpf: Call bpf_map_init_from_attr() immediately after map creation
  bpf: Save memcg in bpf_map_init_from_attr()
  bpf: Use scoped-based charge in bpf_map_area_alloc
  bpf: Introduce new helpers bpf_ringbuf_pages_{alloc,free}
  bpf: Use bpf_map_kzalloc in arraymap
  bpf: Use bpf_map_kvcalloc in bpf_local_storage
  mm, memcg: Add new helper get_obj_cgroup_from_cgroup
  mm, memcg: Add new helper task_under_memcg_hierarchy
  bpf: Add return value for bpf_map_init_from_attr
  bpf: Introduce selectable memcg for bpf map

 include/linux/bpf.h            |  40 +++++++++++-
 include/linux/memcontrol.h     |  25 ++++++++
 include/uapi/linux/bpf.h       |   1 +
 kernel/bpf/arraymap.c          |  34 +++++-----
 kernel/bpf/bloom_filter.c      |  11 +++-
 kernel/bpf/bpf_local_storage.c |  17 +++--
 kernel/bpf/bpf_struct_ops.c    |  19 +++---
 kernel/bpf/cpumap.c            |  17 +++--
 kernel/bpf/devmap.c            |  30 +++++----
 kernel/bpf/hashtab.c           |  26 +++++---
 kernel/bpf/local_storage.c     |  11 +++-
 kernel/bpf/lpm_trie.c          |  12 +++-
 kernel/bpf/offload.c           |  12 ++--
 kernel/bpf/queue_stack_maps.c  |  11 +++-
 kernel/bpf/reuseport_array.c   |  11 +++-
 kernel/bpf/ringbuf.c           | 104 ++++++++++++++++++++----------
 kernel/bpf/stackmap.c          |  13 ++--
 kernel/bpf/syscall.c           | 140 ++++++++++++++++++++++++++++-------------
 kernel/cgroup/cgroup.c         |   2 +-
 mm/memcontrol.c                |  48 ++++++++++++++
 net/core/sock_map.c            |  30 +++++----
 net/xdp/xskmap.c               |  12 +++-
 tools/include/uapi/linux/bpf.h |   1 +
 tools/lib/bpf/bpf.c            |   3 +-
 tools/lib/bpf/bpf.h            |   3 +-
 tools/lib/bpf/gen_loader.c     |   2 +-
 tools/lib/bpf/libbpf.c         |   2 +
 tools/lib/bpf/skel_internal.h  |   2 +-
 28 files changed, 462 insertions(+), 177 deletions(-)

-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH bpf-next v3 01/13] cgroup: Update the comment on cgroup_get_from_fd
  2022-09-02  2:29 [PATCH bpf-next v3 00/13] bpf: Introduce selectable memcg for bpf map Yafang Shao
@ 2022-09-02  2:29 ` Yafang Shao
  2022-09-02  2:29 ` [PATCH bpf-next v3 02/13] bpf: Introduce new helper bpf_map_put_memcg() Yafang Shao
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 33+ messages in thread
From: Yafang Shao @ 2022-09-02  2:29 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, hannes, mhocko, roman.gushchin,
	shakeelb, songmuchun, akpm, tj, lizefan.x
  Cc: cgroups, netdev, bpf, linux-mm, Yafang Shao, Yosry Ahmed

After commit f3a2aebdd6fb ("cgroup: enable cgroup_get_from_file() on cgroup1")
we can open a cgroup1 dir as well. So let's update the comment.

Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Hao Luo <haoluo@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 kernel/cgroup/cgroup.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 5f4502a..b7d2e55 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -6625,7 +6625,7 @@ struct cgroup *cgroup_get_from_path(const char *path)
 
 /**
  * cgroup_get_from_fd - get a cgroup pointer from a fd
- * @fd: fd obtained by open(cgroup2_dir)
+ * @fd: fd obtained by open(cgroup_dir)
  *
  * Find the cgroup from a fd which should be obtained
  * by opening a cgroup directory.  Returns a pointer to the
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH bpf-next v3 02/13] bpf: Introduce new helper bpf_map_put_memcg()
  2022-09-02  2:29 [PATCH bpf-next v3 00/13] bpf: Introduce selectable memcg for bpf map Yafang Shao
  2022-09-02  2:29 ` [PATCH bpf-next v3 01/13] cgroup: Update the comment on cgroup_get_from_fd Yafang Shao
@ 2022-09-02  2:29 ` Yafang Shao
  2022-09-02  2:29 ` [PATCH bpf-next v3 03/13] bpf: Define bpf_map_{get,put}_memcg for !CONFIG_MEMCG_KMEM Yafang Shao
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 33+ messages in thread
From: Yafang Shao @ 2022-09-02  2:29 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, hannes, mhocko, roman.gushchin,
	shakeelb, songmuchun, akpm, tj, lizefan.x
  Cc: cgroups, netdev, bpf, linux-mm, Yafang Shao

Replace the open-coded mem_cgroup_put() with a new helper
bpf_map_put_memcg(). That could make it more clear.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 kernel/bpf/syscall.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 4e9d462..7ce024c 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -441,6 +441,11 @@ static struct mem_cgroup *bpf_map_get_memcg(const struct bpf_map *map)
 	return root_mem_cgroup;
 }
 
+static void bpf_map_put_memcg(struct mem_cgroup *memcg)
+{
+	mem_cgroup_put(memcg);
+}
+
 void *bpf_map_kmalloc_node(const struct bpf_map *map, size_t size, gfp_t flags,
 			   int node)
 {
@@ -451,7 +456,7 @@ void *bpf_map_kmalloc_node(const struct bpf_map *map, size_t size, gfp_t flags,
 	old_memcg = set_active_memcg(memcg);
 	ptr = kmalloc_node(size, flags | __GFP_ACCOUNT, node);
 	set_active_memcg(old_memcg);
-	mem_cgroup_put(memcg);
+	bpf_map_put_memcg(memcg);
 
 	return ptr;
 }
@@ -465,7 +470,7 @@ void *bpf_map_kzalloc(const struct bpf_map *map, size_t size, gfp_t flags)
 	old_memcg = set_active_memcg(memcg);
 	ptr = kzalloc(size, flags | __GFP_ACCOUNT);
 	set_active_memcg(old_memcg);
-	mem_cgroup_put(memcg);
+	bpf_map_put_memcg(memcg);
 
 	return ptr;
 }
@@ -480,7 +485,7 @@ void __percpu *bpf_map_alloc_percpu(const struct bpf_map *map, size_t size,
 	old_memcg = set_active_memcg(memcg);
 	ptr = __alloc_percpu_gfp(size, align, flags | __GFP_ACCOUNT);
 	set_active_memcg(old_memcg);
-	mem_cgroup_put(memcg);
+	bpf_map_put_memcg(memcg);
 
 	return ptr;
 }
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH bpf-next v3 03/13] bpf: Define bpf_map_{get,put}_memcg for !CONFIG_MEMCG_KMEM
  2022-09-02  2:29 [PATCH bpf-next v3 00/13] bpf: Introduce selectable memcg for bpf map Yafang Shao
  2022-09-02  2:29 ` [PATCH bpf-next v3 01/13] cgroup: Update the comment on cgroup_get_from_fd Yafang Shao
  2022-09-02  2:29 ` [PATCH bpf-next v3 02/13] bpf: Introduce new helper bpf_map_put_memcg() Yafang Shao
@ 2022-09-02  2:29 ` Yafang Shao
  2022-09-02  2:29 ` [PATCH bpf-next v3 04/13] bpf: Call bpf_map_init_from_attr() immediately after map creation Yafang Shao
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 33+ messages in thread
From: Yafang Shao @ 2022-09-02  2:29 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, hannes, mhocko, roman.gushchin,
	shakeelb, songmuchun, akpm, tj, lizefan.x
  Cc: cgroups, netdev, bpf, linux-mm, Yafang Shao

We can use this helper when CONFIG_MEMCG_KMEM or CONFIG_MEMCG is not set.
It also moves bpf_map_{get,put}_memcg into include/linux/bpf.h, so
these two helpers can be used in other source files.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/bpf.h        | 26 ++++++++++++++++++++++++++
 include/linux/memcontrol.h | 10 ++++++++++
 kernel/bpf/syscall.c       | 13 -------------
 3 files changed, 36 insertions(+), 13 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 9c16749..a50d29c 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -27,6 +27,7 @@
 #include <linux/bpfptr.h>
 #include <linux/btf.h>
 #include <linux/rcupdate_trace.h>
+#include <linux/memcontrol.h>
 
 struct bpf_verifier_env;
 struct bpf_verifier_log;
@@ -2594,4 +2595,29 @@ static inline void bpf_cgroup_atype_get(u32 attach_btf_id, int cgroup_atype) {}
 static inline void bpf_cgroup_atype_put(int cgroup_atype) {}
 #endif /* CONFIG_BPF_LSM */
 
+#ifdef CONFIG_MEMCG_KMEM
+static inline struct mem_cgroup *bpf_map_get_memcg(const struct bpf_map *map)
+{
+	if (map->objcg)
+		return get_mem_cgroup_from_objcg(map->objcg);
+
+	return root_mem_cgroup;
+}
+
+static inline void bpf_map_put_memcg(struct mem_cgroup *memcg)
+{
+	mem_cgroup_put(memcg);
+}
+
+#else
+static inline struct mem_cgroup *bpf_map_get_memcg(const struct bpf_map *map)
+{
+	return root_memcg();
+}
+
+static inline void bpf_map_put_memcg(struct mem_cgroup *memcg)
+{
+}
+#endif
+
 #endif /* _LINUX_BPF_H */
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 4d31ce5..6040b5c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -361,6 +361,11 @@ struct mem_cgroup {
 
 extern struct mem_cgroup *root_mem_cgroup;
 
+static inline struct mem_cgroup *root_memcg(void)
+{
+	return root_mem_cgroup;
+}
+
 enum page_memcg_data_flags {
 	/* page->memcg_data is a pointer to an objcgs vector */
 	MEMCG_DATA_OBJCGS = (1UL << 0),
@@ -1147,6 +1152,11 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
 #define MEM_CGROUP_ID_SHIFT	0
 #define MEM_CGROUP_ID_MAX	0
 
+static inline struct mem_cgroup *root_memcg(void)
+{
+	return NULL;
+}
+
 static inline struct mem_cgroup *folio_memcg(struct folio *folio)
 {
 	return NULL;
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 7ce024c..eaf1906 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -433,19 +433,6 @@ static void bpf_map_release_memcg(struct bpf_map *map)
 		obj_cgroup_put(map->objcg);
 }
 
-static struct mem_cgroup *bpf_map_get_memcg(const struct bpf_map *map)
-{
-	if (map->objcg)
-		return get_mem_cgroup_from_objcg(map->objcg);
-
-	return root_mem_cgroup;
-}
-
-static void bpf_map_put_memcg(struct mem_cgroup *memcg)
-{
-	mem_cgroup_put(memcg);
-}
-
 void *bpf_map_kmalloc_node(const struct bpf_map *map, size_t size, gfp_t flags,
 			   int node)
 {
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH bpf-next v3 04/13] bpf: Call bpf_map_init_from_attr() immediately after map creation
  2022-09-02  2:29 [PATCH bpf-next v3 00/13] bpf: Introduce selectable memcg for bpf map Yafang Shao
                   ` (2 preceding siblings ...)
  2022-09-02  2:29 ` [PATCH bpf-next v3 03/13] bpf: Define bpf_map_{get,put}_memcg for !CONFIG_MEMCG_KMEM Yafang Shao
@ 2022-09-02  2:29 ` Yafang Shao
  2022-09-02  2:29 ` [PATCH bpf-next v3 05/13] bpf: Save memcg in bpf_map_init_from_attr() Yafang Shao
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 33+ messages in thread
From: Yafang Shao @ 2022-09-02  2:29 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, hannes, mhocko, roman.gushchin,
	shakeelb, songmuchun, akpm, tj, lizefan.x
  Cc: cgroups, netdev, bpf, linux-mm, Yafang Shao

In order to make all other map related memory allocations been allocated
after memcg is saved in the map, we should save the memcg immediately
after map creation. But the map is created in bpf_map_area_alloc(),
within which we can't get the related bpf_map (except with a pointer
casting which may be error prone), so we can do it in
bpf_map_init_from_attr(), which is used by all bpf maps.

bpf_map_init_from_attr() is executed immediately after
bpf_map_area_alloc() for almost all bpf maps except bpf_struct_ops,
devmap and hashmap, so this patch changes these three maps.

In the future we will change the return type of bpf_map_init_from_attr()
from void to int for error cases, so put it immediately after
bpf_map_area_alloc() will make it eary to handle the error case.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 kernel/bpf/bpf_struct_ops.c | 2 +-
 kernel/bpf/devmap.c         | 5 ++---
 kernel/bpf/hashtab.c        | 4 ++--
 3 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index 84b2d9d..36f24f8 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -624,6 +624,7 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr)
 
 	st_map->st_ops = st_ops;
 	map = &st_map->map;
+	bpf_map_init_from_attr(map, attr);
 
 	st_map->uvalue = bpf_map_area_alloc(vt->size, NUMA_NO_NODE);
 	st_map->links =
@@ -637,7 +638,6 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr)
 
 	mutex_init(&st_map->lock);
 	set_vm_flush_reset_perms(st_map->image);
-	bpf_map_init_from_attr(map, attr);
 
 	return map;
 }
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index f9a87dc..20decc7 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -127,9 +127,6 @@ static int dev_map_init_map(struct bpf_dtab *dtab, union bpf_attr *attr)
 	 */
 	attr->map_flags |= BPF_F_RDONLY_PROG;
 
-
-	bpf_map_init_from_attr(&dtab->map, attr);
-
 	if (attr->map_type == BPF_MAP_TYPE_DEVMAP_HASH) {
 		dtab->n_buckets = roundup_pow_of_two(dtab->map.max_entries);
 
@@ -167,6 +164,8 @@ static struct bpf_map *dev_map_alloc(union bpf_attr *attr)
 	if (!dtab)
 		return ERR_PTR(-ENOMEM);
 
+	bpf_map_init_from_attr(&dtab->map, attr);
+
 	err = dev_map_init_map(dtab, attr);
 	if (err) {
 		bpf_map_area_free(dtab);
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index eb1263f..fc7242c 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -508,10 +508,10 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 	if (!htab)
 		return ERR_PTR(-ENOMEM);
 
-	lockdep_register_key(&htab->lockdep_key);
-
 	bpf_map_init_from_attr(&htab->map, attr);
 
+	lockdep_register_key(&htab->lockdep_key);
+
 	if (percpu_lru) {
 		/* ensure each CPU's lru list has >=1 elements.
 		 * since we are at it, make each lru list has the same
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH bpf-next v3 05/13] bpf: Save memcg in bpf_map_init_from_attr()
  2022-09-02  2:29 [PATCH bpf-next v3 00/13] bpf: Introduce selectable memcg for bpf map Yafang Shao
                   ` (3 preceding siblings ...)
  2022-09-02  2:29 ` [PATCH bpf-next v3 04/13] bpf: Call bpf_map_init_from_attr() immediately after map creation Yafang Shao
@ 2022-09-02  2:29 ` Yafang Shao
  2022-09-02  2:29 ` [PATCH bpf-next v3 06/13] bpf: Use scoped-based charge in bpf_map_area_alloc Yafang Shao
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 33+ messages in thread
From: Yafang Shao @ 2022-09-02  2:29 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, hannes, mhocko, roman.gushchin,
	shakeelb, songmuchun, akpm, tj, lizefan.x
  Cc: cgroups, netdev, bpf, linux-mm, Yafang Shao

Move bpf_map_save_memcg() into bpf_map_init_from_attr(), then all other
map related memory allocation will be allocated after saving the memcg.
And then we can get memcg from the map in the followup memory allocation.

To pair with this change, bpf_map_release_memcg() is moved into
bpf_map_area_free(). A new parameter struct bpf_map is introduced into
bpf_map_area_free() for this purpose.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/bpf.h            |  2 +-
 kernel/bpf/arraymap.c          |  8 +++---
 kernel/bpf/bloom_filter.c      |  2 +-
 kernel/bpf/bpf_local_storage.c |  4 +--
 kernel/bpf/bpf_struct_ops.c    |  6 ++---
 kernel/bpf/cpumap.c            |  6 ++---
 kernel/bpf/devmap.c            |  8 +++---
 kernel/bpf/hashtab.c           | 10 +++----
 kernel/bpf/local_storage.c     |  2 +-
 kernel/bpf/lpm_trie.c          |  2 +-
 kernel/bpf/offload.c           |  4 +--
 kernel/bpf/queue_stack_maps.c  |  2 +-
 kernel/bpf/reuseport_array.c   |  2 +-
 kernel/bpf/ringbuf.c           |  8 +++---
 kernel/bpf/stackmap.c          |  8 +++---
 kernel/bpf/syscall.c           | 60 ++++++++++++++++++++++--------------------
 net/core/sock_map.c            | 12 ++++-----
 net/xdp/xskmap.c               |  2 +-
 18 files changed, 76 insertions(+), 72 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index a50d29c..729ddf7 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1638,7 +1638,7 @@ struct bpf_prog *bpf_prog_get_type_dev(u32 ufd, enum bpf_prog_type type,
 void bpf_map_put(struct bpf_map *map);
 void *bpf_map_area_alloc(u64 size, int numa_node);
 void *bpf_map_area_mmapable_alloc(u64 size, int numa_node);
-void bpf_map_area_free(void *base);
+void bpf_map_area_free(void *base, struct bpf_map *map);
 bool bpf_map_write_active(const struct bpf_map *map);
 void bpf_map_init_from_attr(struct bpf_map *map, union bpf_attr *attr);
 int  generic_map_lookup_batch(struct bpf_map *map,
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 6245274..7eebdbe 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -147,7 +147,7 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr)
 	array->elem_size = elem_size;
 
 	if (percpu && bpf_array_alloc_percpu(array)) {
-		bpf_map_area_free(array);
+		bpf_map_area_free(array, &array->map);
 		return ERR_PTR(-ENOMEM);
 	}
 
@@ -430,9 +430,9 @@ static void array_map_free(struct bpf_map *map)
 		bpf_array_free_percpu(array);
 
 	if (array->map.map_flags & BPF_F_MMAPABLE)
-		bpf_map_area_free(array_map_vmalloc_addr(array));
+		bpf_map_area_free(array_map_vmalloc_addr(array), map);
 	else
-		bpf_map_area_free(array);
+		bpf_map_area_free(array, map);
 }
 
 static void array_map_seq_show_elem(struct bpf_map *map, void *key,
@@ -780,7 +780,7 @@ static void fd_array_map_free(struct bpf_map *map)
 	for (i = 0; i < array->map.max_entries; i++)
 		BUG_ON(array->ptrs[i] != NULL);
 
-	bpf_map_area_free(array);
+	bpf_map_area_free(array, map);
 }
 
 static void *fd_array_map_lookup_elem(struct bpf_map *map, void *key)
diff --git a/kernel/bpf/bloom_filter.c b/kernel/bpf/bloom_filter.c
index b9ea539..e59064d 100644
--- a/kernel/bpf/bloom_filter.c
+++ b/kernel/bpf/bloom_filter.c
@@ -168,7 +168,7 @@ static void bloom_map_free(struct bpf_map *map)
 	struct bpf_bloom_filter *bloom =
 		container_of(map, struct bpf_bloom_filter, map);
 
-	bpf_map_area_free(bloom);
+	bpf_map_area_free(bloom, map);
 }
 
 static void *bloom_map_lookup_elem(struct bpf_map *map, void *key)
diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
index 802fc15..7b68d846 100644
--- a/kernel/bpf/bpf_local_storage.c
+++ b/kernel/bpf/bpf_local_storage.c
@@ -582,7 +582,7 @@ void bpf_local_storage_map_free(struct bpf_local_storage_map *smap,
 	synchronize_rcu();
 
 	kvfree(smap->buckets);
-	bpf_map_area_free(smap);
+	bpf_map_area_free(smap, &smap->map);
 }
 
 int bpf_local_storage_map_alloc_check(union bpf_attr *attr)
@@ -623,7 +623,7 @@ struct bpf_local_storage_map *bpf_local_storage_map_alloc(union bpf_attr *attr)
 	smap->buckets = kvcalloc(sizeof(*smap->buckets), nbuckets,
 				 GFP_USER | __GFP_NOWARN | __GFP_ACCOUNT);
 	if (!smap->buckets) {
-		bpf_map_area_free(smap);
+		bpf_map_area_free(smap, &smap->map);
 		return ERR_PTR(-ENOMEM);
 	}
 
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index 36f24f8..9fb8ad1 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -577,10 +577,10 @@ static void bpf_struct_ops_map_free(struct bpf_map *map)
 
 	if (st_map->links)
 		bpf_struct_ops_map_put_progs(st_map);
-	bpf_map_area_free(st_map->links);
+	bpf_map_area_free(st_map->links, NULL);
 	bpf_jit_free_exec(st_map->image);
-	bpf_map_area_free(st_map->uvalue);
-	bpf_map_area_free(st_map);
+	bpf_map_area_free(st_map->uvalue, NULL);
+	bpf_map_area_free(st_map, map);
 }
 
 static int bpf_struct_ops_map_alloc_check(union bpf_attr *attr)
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index b5ba34d..7de2ae6 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -118,7 +118,7 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
 
 	return &cmap->map;
 free_cmap:
-	bpf_map_area_free(cmap);
+	bpf_map_area_free(cmap, &cmap->map);
 	return ERR_PTR(err);
 }
 
@@ -622,8 +622,8 @@ static void cpu_map_free(struct bpf_map *map)
 		/* bq flush and cleanup happens after RCU grace-period */
 		__cpu_map_entry_replace(cmap, i, NULL); /* call_rcu */
 	}
-	bpf_map_area_free(cmap->cpu_map);
-	bpf_map_area_free(cmap);
+	bpf_map_area_free(cmap->cpu_map, NULL);
+	bpf_map_area_free(cmap, map);
 }
 
 /* Elements are kept alive by RCU; either by rcu_read_lock() (from syscall) or
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index 20decc7..3268ce7 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -168,7 +168,7 @@ static struct bpf_map *dev_map_alloc(union bpf_attr *attr)
 
 	err = dev_map_init_map(dtab, attr);
 	if (err) {
-		bpf_map_area_free(dtab);
+		bpf_map_area_free(dtab, &dtab->map);
 		return ERR_PTR(err);
 	}
 
@@ -221,7 +221,7 @@ static void dev_map_free(struct bpf_map *map)
 			}
 		}
 
-		bpf_map_area_free(dtab->dev_index_head);
+		bpf_map_area_free(dtab->dev_index_head, NULL);
 	} else {
 		for (i = 0; i < dtab->map.max_entries; i++) {
 			struct bpf_dtab_netdev *dev;
@@ -236,10 +236,10 @@ static void dev_map_free(struct bpf_map *map)
 			kfree(dev);
 		}
 
-		bpf_map_area_free(dtab->netdev_map);
+		bpf_map_area_free(dtab->netdev_map, NULL);
 	}
 
-	bpf_map_area_free(dtab);
+	bpf_map_area_free(dtab, &dtab->map);
 }
 
 static int dev_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index fc7242c..063989e 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -303,7 +303,7 @@ static void htab_free_elems(struct bpf_htab *htab)
 		cond_resched();
 	}
 free_elems:
-	bpf_map_area_free(htab->elems);
+	bpf_map_area_free(htab->elems, NULL);
 }
 
 /* The LRU list has a lock (lru_lock). Each htab bucket has a lock
@@ -585,10 +585,10 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 free_map_locked:
 	for (i = 0; i < HASHTAB_MAP_LOCK_COUNT; i++)
 		free_percpu(htab->map_locked[i]);
-	bpf_map_area_free(htab->buckets);
+	bpf_map_area_free(htab->buckets, NULL);
 free_htab:
 	lockdep_unregister_key(&htab->lockdep_key);
-	bpf_map_area_free(htab);
+	bpf_map_area_free(htab, &htab->map);
 	return ERR_PTR(err);
 }
 
@@ -1501,11 +1501,11 @@ static void htab_map_free(struct bpf_map *map)
 
 	bpf_map_free_kptr_off_tab(map);
 	free_percpu(htab->extra_elems);
-	bpf_map_area_free(htab->buckets);
+	bpf_map_area_free(htab->buckets, NULL);
 	for (i = 0; i < HASHTAB_MAP_LOCK_COUNT; i++)
 		free_percpu(htab->map_locked[i]);
 	lockdep_unregister_key(&htab->lockdep_key);
-	bpf_map_area_free(htab);
+	bpf_map_area_free(htab, map);
 }
 
 static void htab_map_seq_show_elem(struct bpf_map *map, void *key,
diff --git a/kernel/bpf/local_storage.c b/kernel/bpf/local_storage.c
index 098cf33..c705d66 100644
--- a/kernel/bpf/local_storage.c
+++ b/kernel/bpf/local_storage.c
@@ -345,7 +345,7 @@ static void cgroup_storage_map_free(struct bpf_map *_map)
 	WARN_ON(!RB_EMPTY_ROOT(&map->root));
 	WARN_ON(!list_empty(&map->list));
 
-	bpf_map_area_free(map);
+	bpf_map_area_free(map, _map);
 }
 
 static int cgroup_storage_delete_elem(struct bpf_map *map, void *key)
diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
index d833496..fd99360 100644
--- a/kernel/bpf/lpm_trie.c
+++ b/kernel/bpf/lpm_trie.c
@@ -609,7 +609,7 @@ static void trie_free(struct bpf_map *map)
 	}
 
 out:
-	bpf_map_area_free(trie);
+	bpf_map_area_free(trie, map);
 }
 
 static int trie_get_next_key(struct bpf_map *map, void *_key, void *_next_key)
diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
index 13e4efc..c9941a9 100644
--- a/kernel/bpf/offload.c
+++ b/kernel/bpf/offload.c
@@ -404,7 +404,7 @@ struct bpf_map *bpf_map_offload_map_alloc(union bpf_attr *attr)
 err_unlock:
 	up_write(&bpf_devs_lock);
 	rtnl_unlock();
-	bpf_map_area_free(offmap);
+	bpf_map_area_free(offmap, &offmap->map);
 	return ERR_PTR(err);
 }
 
@@ -428,7 +428,7 @@ void bpf_map_offload_map_free(struct bpf_map *map)
 	up_write(&bpf_devs_lock);
 	rtnl_unlock();
 
-	bpf_map_area_free(offmap);
+	bpf_map_area_free(offmap, map);
 }
 
 int bpf_map_offload_lookup_elem(struct bpf_map *map, void *key, void *value)
diff --git a/kernel/bpf/queue_stack_maps.c b/kernel/bpf/queue_stack_maps.c
index 8a5e060..f2ec0c4 100644
--- a/kernel/bpf/queue_stack_maps.c
+++ b/kernel/bpf/queue_stack_maps.c
@@ -92,7 +92,7 @@ static void queue_stack_map_free(struct bpf_map *map)
 {
 	struct bpf_queue_stack *qs = bpf_queue_stack(map);
 
-	bpf_map_area_free(qs);
+	bpf_map_area_free(qs, map);
 }
 
 static int __queue_map_get(struct bpf_map *map, void *value, bool delete)
diff --git a/kernel/bpf/reuseport_array.c b/kernel/bpf/reuseport_array.c
index 85fa9db..3ac9276 100644
--- a/kernel/bpf/reuseport_array.c
+++ b/kernel/bpf/reuseport_array.c
@@ -143,7 +143,7 @@ static void reuseport_array_free(struct bpf_map *map)
 	 * Once reaching here, all sk->sk_user_data is not
 	 * referencing this "array". "array" can be freed now.
 	 */
-	bpf_map_area_free(array);
+	bpf_map_area_free(array, map);
 }
 
 static struct bpf_map *reuseport_array_alloc(union bpf_attr *attr)
diff --git a/kernel/bpf/ringbuf.c b/kernel/bpf/ringbuf.c
index b483aea..74dd8dc 100644
--- a/kernel/bpf/ringbuf.c
+++ b/kernel/bpf/ringbuf.c
@@ -116,7 +116,7 @@ static struct bpf_ringbuf *bpf_ringbuf_area_alloc(size_t data_sz, int numa_node)
 err_free_pages:
 	for (i = 0; i < nr_pages; i++)
 		__free_page(pages[i]);
-	bpf_map_area_free(pages);
+	bpf_map_area_free(pages, NULL);
 	return NULL;
 }
 
@@ -172,7 +172,7 @@ static struct bpf_map *ringbuf_map_alloc(union bpf_attr *attr)
 
 	rb_map->rb = bpf_ringbuf_alloc(attr->max_entries, rb_map->map.numa_node);
 	if (!rb_map->rb) {
-		bpf_map_area_free(rb_map);
+		bpf_map_area_free(rb_map, &rb_map->map);
 		return ERR_PTR(-ENOMEM);
 	}
 
@@ -190,7 +190,7 @@ static void bpf_ringbuf_free(struct bpf_ringbuf *rb)
 	vunmap(rb);
 	for (i = 0; i < nr_pages; i++)
 		__free_page(pages[i]);
-	bpf_map_area_free(pages);
+	bpf_map_area_free(pages, NULL);
 }
 
 static void ringbuf_map_free(struct bpf_map *map)
@@ -199,7 +199,7 @@ static void ringbuf_map_free(struct bpf_map *map)
 
 	rb_map = container_of(map, struct bpf_ringbuf_map, map);
 	bpf_ringbuf_free(rb_map->rb);
-	bpf_map_area_free(rb_map);
+	bpf_map_area_free(rb_map, map);
 }
 
 static void *ringbuf_map_lookup_elem(struct bpf_map *map, void *key)
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 1adbe67..042b7d2 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -62,7 +62,7 @@ static int prealloc_elems_and_freelist(struct bpf_stack_map *smap)
 	return 0;
 
 free_elems:
-	bpf_map_area_free(smap->elems);
+	bpf_map_area_free(smap->elems, NULL);
 	return err;
 }
 
@@ -120,7 +120,7 @@ static struct bpf_map *stack_map_alloc(union bpf_attr *attr)
 put_buffers:
 	put_callchain_buffers();
 free_smap:
-	bpf_map_area_free(smap);
+	bpf_map_area_free(smap, &smap->map);
 	return ERR_PTR(err);
 }
 
@@ -648,9 +648,9 @@ static void stack_map_free(struct bpf_map *map)
 {
 	struct bpf_stack_map *smap = container_of(map, struct bpf_stack_map, map);
 
-	bpf_map_area_free(smap->elems);
+	bpf_map_area_free(smap->elems, NULL);
 	pcpu_freelist_destroy(&smap->freelist);
-	bpf_map_area_free(smap);
+	bpf_map_area_free(smap, map);
 	put_callchain_buffers();
 }
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index eaf1906..68a6f59 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -293,6 +293,34 @@ static int bpf_map_copy_value(struct bpf_map *map, void *key, void *value,
 	return err;
 }
 
+#ifdef CONFIG_MEMCG_KMEM
+static void bpf_map_save_memcg(struct bpf_map *map)
+{
+	/* Currently if a map is created by a process belonging to the root
+	 * memory cgroup, get_obj_cgroup_from_current() will return NULL.
+	 * So we have to check map->objcg for being NULL each time it's
+	 * being used.
+	 */
+	map->objcg = get_obj_cgroup_from_current();
+}
+
+static void bpf_map_release_memcg(struct bpf_map *map)
+{
+	if (map->objcg)
+		obj_cgroup_put(map->objcg);
+}
+
+#else
+static void bpf_map_save_memcg(struct bpf_map *map)
+{
+}
+
+static void bpf_map_release_memcg(struct bpf_map *map)
+{
+}
+
+#endif
+
 /* Please, do not use this function outside from the map creation path
  * (e.g. in map update path) without taking care of setting the active
  * memory cgroup (see at bpf_map_kmalloc_node() for example).
@@ -344,8 +372,10 @@ void *bpf_map_area_mmapable_alloc(u64 size, int numa_node)
 	return __bpf_map_area_alloc(size, numa_node, true);
 }
 
-void bpf_map_area_free(void *area)
+void bpf_map_area_free(void *area, struct bpf_map *map)
 {
+	if (map)
+		bpf_map_release_memcg(map);
 	kvfree(area);
 }
 
@@ -363,6 +393,7 @@ static u32 bpf_map_flags_retain_permanent(u32 flags)
 
 void bpf_map_init_from_attr(struct bpf_map *map, union bpf_attr *attr)
 {
+	bpf_map_save_memcg(map);
 	map->map_type = attr->map_type;
 	map->key_size = attr->key_size;
 	map->value_size = attr->value_size;
@@ -417,22 +448,6 @@ void bpf_map_free_id(struct bpf_map *map, bool do_idr_lock)
 }
 
 #ifdef CONFIG_MEMCG_KMEM
-static void bpf_map_save_memcg(struct bpf_map *map)
-{
-	/* Currently if a map is created by a process belonging to the root
-	 * memory cgroup, get_obj_cgroup_from_current() will return NULL.
-	 * So we have to check map->objcg for being NULL each time it's
-	 * being used.
-	 */
-	map->objcg = get_obj_cgroup_from_current();
-}
-
-static void bpf_map_release_memcg(struct bpf_map *map)
-{
-	if (map->objcg)
-		obj_cgroup_put(map->objcg);
-}
-
 void *bpf_map_kmalloc_node(const struct bpf_map *map, size_t size, gfp_t flags,
 			   int node)
 {
@@ -477,14 +492,6 @@ void __percpu *bpf_map_alloc_percpu(const struct bpf_map *map, size_t size,
 	return ptr;
 }
 
-#else
-static void bpf_map_save_memcg(struct bpf_map *map)
-{
-}
-
-static void bpf_map_release_memcg(struct bpf_map *map)
-{
-}
 #endif
 
 static int bpf_map_kptr_off_cmp(const void *a, const void *b)
@@ -605,7 +612,6 @@ static void bpf_map_free_deferred(struct work_struct *work)
 
 	security_bpf_map_free(map);
 	kfree(map->off_arr);
-	bpf_map_release_memcg(map);
 	/* implementation dependent freeing, map_free callback also does
 	 * bpf_map_free_kptr_off_tab, if needed.
 	 */
@@ -1154,8 +1160,6 @@ static int map_create(union bpf_attr *attr)
 	if (err)
 		goto free_map_sec;
 
-	bpf_map_save_memcg(map);
-
 	err = bpf_map_new_fd(map, f_flags);
 	if (err < 0) {
 		/* failed to allocate fd.
diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index a660bae..8da9fd4 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -52,7 +52,7 @@ static struct bpf_map *sock_map_alloc(union bpf_attr *attr)
 				       sizeof(struct sock *),
 				       stab->map.numa_node);
 	if (!stab->sks) {
-		bpf_map_area_free(stab);
+		bpf_map_area_free(stab, &stab->map);
 		return ERR_PTR(-ENOMEM);
 	}
 
@@ -360,8 +360,8 @@ static void sock_map_free(struct bpf_map *map)
 	/* wait for psock readers accessing its map link */
 	synchronize_rcu();
 
-	bpf_map_area_free(stab->sks);
-	bpf_map_area_free(stab);
+	bpf_map_area_free(stab->sks, NULL);
+	bpf_map_area_free(stab, map);
 }
 
 static void sock_map_release_progs(struct bpf_map *map)
@@ -1115,7 +1115,7 @@ static struct bpf_map *sock_hash_alloc(union bpf_attr *attr)
 
 	return &htab->map;
 free_htab:
-	bpf_map_area_free(htab);
+	bpf_map_area_free(htab, &htab->map);
 	return ERR_PTR(err);
 }
 
@@ -1167,8 +1167,8 @@ static void sock_hash_free(struct bpf_map *map)
 	/* wait for psock readers accessing its map link */
 	synchronize_rcu();
 
-	bpf_map_area_free(htab->buckets);
-	bpf_map_area_free(htab);
+	bpf_map_area_free(htab->buckets, NULL);
+	bpf_map_area_free(htab, map);
 }
 
 static void *sock_hash_lookup_sys(struct bpf_map *map, void *key)
diff --git a/net/xdp/xskmap.c b/net/xdp/xskmap.c
index acc8e52..5abb87e 100644
--- a/net/xdp/xskmap.c
+++ b/net/xdp/xskmap.c
@@ -90,7 +90,7 @@ static void xsk_map_free(struct bpf_map *map)
 	struct xsk_map *m = container_of(map, struct xsk_map, map);
 
 	synchronize_net();
-	bpf_map_area_free(m);
+	bpf_map_area_free(m, map);
 }
 
 static int xsk_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH bpf-next v3 06/13] bpf: Use scoped-based charge in bpf_map_area_alloc
  2022-09-02  2:29 [PATCH bpf-next v3 00/13] bpf: Introduce selectable memcg for bpf map Yafang Shao
                   ` (4 preceding siblings ...)
  2022-09-02  2:29 ` [PATCH bpf-next v3 05/13] bpf: Save memcg in bpf_map_init_from_attr() Yafang Shao
@ 2022-09-02  2:29 ` Yafang Shao
  2022-09-02  2:29 ` [PATCH bpf-next v3 07/13] bpf: Introduce new helpers bpf_ringbuf_pages_{alloc,free} Yafang Shao
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 33+ messages in thread
From: Yafang Shao @ 2022-09-02  2:29 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, hannes, mhocko, roman.gushchin,
	shakeelb, songmuchun, akpm, tj, lizefan.x
  Cc: cgroups, netdev, bpf, linux-mm, Yafang Shao

Currently bpf_map_area_alloc() is used to allocate a container of struct
bpf_map or members in this container.  To distinguish the map creation
and the other case, a new parameter struct bpf_map is added into
bpf_map_area_alloc(). Then for the non-map-creation case, we could get
the memcg from the map instead of using the current memcg.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/bpf.h            |  2 +-
 kernel/bpf/arraymap.c          |  2 +-
 kernel/bpf/bloom_filter.c      |  2 +-
 kernel/bpf/bpf_local_storage.c |  2 +-
 kernel/bpf/bpf_struct_ops.c    |  6 +++---
 kernel/bpf/cpumap.c            |  5 +++--
 kernel/bpf/devmap.c            | 13 ++++++++-----
 kernel/bpf/hashtab.c           |  8 +++++---
 kernel/bpf/local_storage.c     |  2 +-
 kernel/bpf/lpm_trie.c          |  2 +-
 kernel/bpf/offload.c           |  2 +-
 kernel/bpf/queue_stack_maps.c  |  2 +-
 kernel/bpf/reuseport_array.c   |  2 +-
 kernel/bpf/ringbuf.c           | 15 +++++++++------
 kernel/bpf/stackmap.c          |  5 +++--
 kernel/bpf/syscall.c           | 16 ++++++++++++++--
 net/core/sock_map.c            | 10 ++++++----
 net/xdp/xskmap.c               |  2 +-
 18 files changed, 61 insertions(+), 37 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 729ddf7..e8ac29f 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1636,7 +1636,7 @@ struct bpf_prog *bpf_prog_get_type_dev(u32 ufd, enum bpf_prog_type type,
 struct bpf_map * __must_check bpf_map_inc_not_zero(struct bpf_map *map);
 void bpf_map_put_with_uref(struct bpf_map *map);
 void bpf_map_put(struct bpf_map *map);
-void *bpf_map_area_alloc(u64 size, int numa_node);
+void *bpf_map_area_alloc(u64 size, int numa_node, struct bpf_map *map);
 void *bpf_map_area_mmapable_alloc(u64 size, int numa_node);
 void bpf_map_area_free(void *base, struct bpf_map *map);
 bool bpf_map_write_active(const struct bpf_map *map);
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 7eebdbe..114fbda 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -135,7 +135,7 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr)
 		array = data + PAGE_ALIGN(sizeof(struct bpf_array))
 			- offsetof(struct bpf_array, value);
 	} else {
-		array = bpf_map_area_alloc(array_size, numa_node);
+		array = bpf_map_area_alloc(array_size, numa_node, NULL);
 	}
 	if (!array)
 		return ERR_PTR(-ENOMEM);
diff --git a/kernel/bpf/bloom_filter.c b/kernel/bpf/bloom_filter.c
index e59064d..6691f79 100644
--- a/kernel/bpf/bloom_filter.c
+++ b/kernel/bpf/bloom_filter.c
@@ -142,7 +142,7 @@ static struct bpf_map *bloom_map_alloc(union bpf_attr *attr)
 	}
 
 	bitset_bytes = roundup(bitset_bytes, sizeof(unsigned long));
-	bloom = bpf_map_area_alloc(sizeof(*bloom) + bitset_bytes, numa_node);
+	bloom = bpf_map_area_alloc(sizeof(*bloom) + bitset_bytes, numa_node, NULL);
 
 	if (!bloom)
 		return ERR_PTR(-ENOMEM);
diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
index 7b68d846..44498d7d 100644
--- a/kernel/bpf/bpf_local_storage.c
+++ b/kernel/bpf/bpf_local_storage.c
@@ -610,7 +610,7 @@ struct bpf_local_storage_map *bpf_local_storage_map_alloc(union bpf_attr *attr)
 	unsigned int i;
 	u32 nbuckets;
 
-	smap = bpf_map_area_alloc(sizeof(*smap), NUMA_NO_NODE);
+	smap = bpf_map_area_alloc(sizeof(*smap), NUMA_NO_NODE, NULL);
 	if (!smap)
 		return ERR_PTR(-ENOMEM);
 	bpf_map_init_from_attr(&smap->map, attr);
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index 9fb8ad1..37ba5c0 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -618,7 +618,7 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr)
 		 */
 		(vt->size - sizeof(struct bpf_struct_ops_value));
 
-	st_map = bpf_map_area_alloc(st_map_size, NUMA_NO_NODE);
+	st_map = bpf_map_area_alloc(st_map_size, NUMA_NO_NODE, NULL);
 	if (!st_map)
 		return ERR_PTR(-ENOMEM);
 
@@ -626,10 +626,10 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr)
 	map = &st_map->map;
 	bpf_map_init_from_attr(map, attr);
 
-	st_map->uvalue = bpf_map_area_alloc(vt->size, NUMA_NO_NODE);
+	st_map->uvalue = bpf_map_area_alloc(vt->size, NUMA_NO_NODE, map);
 	st_map->links =
 		bpf_map_area_alloc(btf_type_vlen(t) * sizeof(struct bpf_links *),
-				   NUMA_NO_NODE);
+				   NUMA_NO_NODE, map);
 	st_map->image = bpf_jit_alloc_exec(PAGE_SIZE);
 	if (!st_map->uvalue || !st_map->links || !st_map->image) {
 		bpf_struct_ops_map_free(map);
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index 7de2ae6..b593157 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -97,7 +97,7 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
 	    attr->map_flags & ~BPF_F_NUMA_NODE)
 		return ERR_PTR(-EINVAL);
 
-	cmap = bpf_map_area_alloc(sizeof(*cmap), NUMA_NO_NODE);
+	cmap = bpf_map_area_alloc(sizeof(*cmap), NUMA_NO_NODE, NULL);
 	if (!cmap)
 		return ERR_PTR(-ENOMEM);
 
@@ -112,7 +112,8 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
 	/* Alloc array for possible remote "destination" CPUs */
 	cmap->cpu_map = bpf_map_area_alloc(cmap->map.max_entries *
 					   sizeof(struct bpf_cpu_map_entry *),
-					   cmap->map.numa_node);
+					   cmap->map.numa_node,
+					   &cmap->map);
 	if (!cmap->cpu_map)
 		goto free_cmap;
 
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index 3268ce7..807a4cd 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -89,12 +89,13 @@ struct bpf_dtab {
 static LIST_HEAD(dev_map_list);
 
 static struct hlist_head *dev_map_create_hash(unsigned int entries,
-					      int numa_node)
+					      int numa_node,
+					      struct bpf_map *map)
 {
 	int i;
 	struct hlist_head *hash;
 
-	hash = bpf_map_area_alloc((u64) entries * sizeof(*hash), numa_node);
+	hash = bpf_map_area_alloc((u64) entries * sizeof(*hash), numa_node, map);
 	if (hash != NULL)
 		for (i = 0; i < entries; i++)
 			INIT_HLIST_HEAD(&hash[i]);
@@ -136,7 +137,8 @@ static int dev_map_init_map(struct bpf_dtab *dtab, union bpf_attr *attr)
 
 	if (attr->map_type == BPF_MAP_TYPE_DEVMAP_HASH) {
 		dtab->dev_index_head = dev_map_create_hash(dtab->n_buckets,
-							   dtab->map.numa_node);
+							   dtab->map.numa_node,
+							   &dtab->map);
 		if (!dtab->dev_index_head)
 			return -ENOMEM;
 
@@ -144,7 +146,8 @@ static int dev_map_init_map(struct bpf_dtab *dtab, union bpf_attr *attr)
 	} else {
 		dtab->netdev_map = bpf_map_area_alloc((u64) dtab->map.max_entries *
 						      sizeof(struct bpf_dtab_netdev *),
-						      dtab->map.numa_node);
+						      dtab->map.numa_node,
+						      &dtab->map);
 		if (!dtab->netdev_map)
 			return -ENOMEM;
 	}
@@ -160,7 +163,7 @@ static struct bpf_map *dev_map_alloc(union bpf_attr *attr)
 	if (!capable(CAP_NET_ADMIN))
 		return ERR_PTR(-EPERM);
 
-	dtab = bpf_map_area_alloc(sizeof(*dtab), NUMA_NO_NODE);
+	dtab = bpf_map_area_alloc(sizeof(*dtab), NUMA_NO_NODE, NULL);
 	if (!dtab)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 063989e..975bcc1 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -341,7 +341,8 @@ static int prealloc_init(struct bpf_htab *htab)
 		num_entries += num_possible_cpus();
 
 	htab->elems = bpf_map_area_alloc((u64)htab->elem_size * num_entries,
-					 htab->map.numa_node);
+					 htab->map.numa_node,
+					 &htab->map);
 	if (!htab->elems)
 		return -ENOMEM;
 
@@ -504,7 +505,7 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 	struct bpf_htab *htab;
 	int err, i;
 
-	htab = bpf_map_area_alloc(sizeof(*htab), NUMA_NO_NODE);
+	htab = bpf_map_area_alloc(sizeof(*htab), NUMA_NO_NODE, NULL);
 	if (!htab)
 		return ERR_PTR(-ENOMEM);
 
@@ -543,7 +544,8 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 	err = -ENOMEM;
 	htab->buckets = bpf_map_area_alloc(htab->n_buckets *
 					   sizeof(struct bucket),
-					   htab->map.numa_node);
+					   htab->map.numa_node,
+					   &htab->map);
 	if (!htab->buckets)
 		goto free_htab;
 
diff --git a/kernel/bpf/local_storage.c b/kernel/bpf/local_storage.c
index c705d66..fcc7ece 100644
--- a/kernel/bpf/local_storage.c
+++ b/kernel/bpf/local_storage.c
@@ -313,7 +313,7 @@ static struct bpf_map *cgroup_storage_map_alloc(union bpf_attr *attr)
 		/* max_entries is not used and enforced to be 0 */
 		return ERR_PTR(-EINVAL);
 
-	map = bpf_map_area_alloc(sizeof(struct bpf_cgroup_storage_map), numa_node);
+	map = bpf_map_area_alloc(sizeof(struct bpf_cgroup_storage_map), numa_node, NULL);
 	if (!map)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
index fd99360..3d329ae 100644
--- a/kernel/bpf/lpm_trie.c
+++ b/kernel/bpf/lpm_trie.c
@@ -558,7 +558,7 @@ static struct bpf_map *trie_alloc(union bpf_attr *attr)
 	    attr->value_size > LPM_VAL_SIZE_MAX)
 		return ERR_PTR(-EINVAL);
 
-	trie = bpf_map_area_alloc(sizeof(*trie), NUMA_NO_NODE);
+	trie = bpf_map_area_alloc(sizeof(*trie), NUMA_NO_NODE, NULL);
 	if (!trie)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
index c9941a9..87c59da 100644
--- a/kernel/bpf/offload.c
+++ b/kernel/bpf/offload.c
@@ -372,7 +372,7 @@ struct bpf_map *bpf_map_offload_map_alloc(union bpf_attr *attr)
 	    attr->map_type != BPF_MAP_TYPE_HASH)
 		return ERR_PTR(-EINVAL);
 
-	offmap = bpf_map_area_alloc(sizeof(*offmap), NUMA_NO_NODE);
+	offmap = bpf_map_area_alloc(sizeof(*offmap), NUMA_NO_NODE, NULL);
 	if (!offmap)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/kernel/bpf/queue_stack_maps.c b/kernel/bpf/queue_stack_maps.c
index f2ec0c4..bf57e45 100644
--- a/kernel/bpf/queue_stack_maps.c
+++ b/kernel/bpf/queue_stack_maps.c
@@ -74,7 +74,7 @@ static struct bpf_map *queue_stack_map_alloc(union bpf_attr *attr)
 	size = (u64) attr->max_entries + 1;
 	queue_size = sizeof(*qs) + size * attr->value_size;
 
-	qs = bpf_map_area_alloc(queue_size, numa_node);
+	qs = bpf_map_area_alloc(queue_size, numa_node, NULL);
 	if (!qs)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/kernel/bpf/reuseport_array.c b/kernel/bpf/reuseport_array.c
index 3ac9276..447f664 100644
--- a/kernel/bpf/reuseport_array.c
+++ b/kernel/bpf/reuseport_array.c
@@ -155,7 +155,7 @@ static struct bpf_map *reuseport_array_alloc(union bpf_attr *attr)
 		return ERR_PTR(-EPERM);
 
 	/* allocate all map elements and zero-initialize them */
-	array = bpf_map_area_alloc(struct_size(array, ptrs, attr->max_entries), numa_node);
+	array = bpf_map_area_alloc(struct_size(array, ptrs, attr->max_entries), numa_node, NULL);
 	if (!array)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/kernel/bpf/ringbuf.c b/kernel/bpf/ringbuf.c
index 74dd8dc..5eb7820 100644
--- a/kernel/bpf/ringbuf.c
+++ b/kernel/bpf/ringbuf.c
@@ -59,7 +59,8 @@ struct bpf_ringbuf_hdr {
 	u32 pg_off;
 };
 
-static struct bpf_ringbuf *bpf_ringbuf_area_alloc(size_t data_sz, int numa_node)
+static struct bpf_ringbuf *bpf_ringbuf_area_alloc(size_t data_sz, int numa_node,
+						  struct bpf_map *map)
 {
 	const gfp_t flags = GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL |
 			    __GFP_NOWARN | __GFP_ZERO;
@@ -89,7 +90,7 @@ static struct bpf_ringbuf *bpf_ringbuf_area_alloc(size_t data_sz, int numa_node)
 	 * user-space implementations significantly.
 	 */
 	array_size = (nr_meta_pages + 2 * nr_data_pages) * sizeof(*pages);
-	pages = bpf_map_area_alloc(array_size, numa_node);
+	pages = bpf_map_area_alloc(array_size, numa_node, map);
 	if (!pages)
 		return NULL;
 
@@ -127,11 +128,12 @@ static void bpf_ringbuf_notify(struct irq_work *work)
 	wake_up_all(&rb->waitq);
 }
 
-static struct bpf_ringbuf *bpf_ringbuf_alloc(size_t data_sz, int numa_node)
+static struct bpf_ringbuf *bpf_ringbuf_alloc(size_t data_sz, int numa_node,
+					     struct bpf_map *map)
 {
 	struct bpf_ringbuf *rb;
 
-	rb = bpf_ringbuf_area_alloc(data_sz, numa_node);
+	rb = bpf_ringbuf_area_alloc(data_sz, numa_node, map);
 	if (!rb)
 		return NULL;
 
@@ -164,13 +166,14 @@ static struct bpf_map *ringbuf_map_alloc(union bpf_attr *attr)
 		return ERR_PTR(-E2BIG);
 #endif
 
-	rb_map = bpf_map_area_alloc(sizeof(*rb_map), NUMA_NO_NODE);
+	rb_map = bpf_map_area_alloc(sizeof(*rb_map), NUMA_NO_NODE, NULL);
 	if (!rb_map)
 		return ERR_PTR(-ENOMEM);
 
 	bpf_map_init_from_attr(&rb_map->map, attr);
 
-	rb_map->rb = bpf_ringbuf_alloc(attr->max_entries, rb_map->map.numa_node);
+	rb_map->rb = bpf_ringbuf_alloc(attr->max_entries, rb_map->map.numa_node,
+				       &rb_map->map);
 	if (!rb_map->rb) {
 		bpf_map_area_free(rb_map, &rb_map->map);
 		return ERR_PTR(-ENOMEM);
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 042b7d2..9440fab 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -49,7 +49,8 @@ static int prealloc_elems_and_freelist(struct bpf_stack_map *smap)
 	int err;
 
 	smap->elems = bpf_map_area_alloc(elem_size * smap->map.max_entries,
-					 smap->map.numa_node);
+					 smap->map.numa_node,
+					 &smap->map);
 	if (!smap->elems)
 		return -ENOMEM;
 
@@ -100,7 +101,7 @@ static struct bpf_map *stack_map_alloc(union bpf_attr *attr)
 		return ERR_PTR(-E2BIG);
 
 	cost = n_buckets * sizeof(struct stack_map_bucket *) + sizeof(*smap);
-	smap = bpf_map_area_alloc(cost, bpf_map_attr_numa_node(attr));
+	smap = bpf_map_area_alloc(cost, bpf_map_attr_numa_node(attr), NULL);
 	if (!smap)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 68a6f59..eefe590 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -362,9 +362,21 @@ static void *__bpf_map_area_alloc(u64 size, int numa_node, bool mmapable)
 			flags, numa_node, __builtin_return_address(0));
 }
 
-void *bpf_map_area_alloc(u64 size, int numa_node)
+void *bpf_map_area_alloc(u64 size, int numa_node, struct bpf_map *map)
 {
-	return __bpf_map_area_alloc(size, numa_node, false);
+	struct mem_cgroup *memcg, *old_memcg;
+	void *ptr;
+
+	if (!map)
+		return __bpf_map_area_alloc(size, numa_node, false);
+
+	memcg = bpf_map_get_memcg(map);
+	old_memcg = set_active_memcg(memcg);
+	ptr = __bpf_map_area_alloc(size, numa_node, false);
+	set_active_memcg(old_memcg);
+	bpf_map_put_memcg(memcg);
+
+	return ptr;
 }
 
 void *bpf_map_area_mmapable_alloc(u64 size, int numa_node)
diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 8da9fd4..25a5ac4 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -41,7 +41,7 @@ static struct bpf_map *sock_map_alloc(union bpf_attr *attr)
 	    attr->map_flags & ~SOCK_CREATE_FLAG_MASK)
 		return ERR_PTR(-EINVAL);
 
-	stab = bpf_map_area_alloc(sizeof(*stab), NUMA_NO_NODE);
+	stab = bpf_map_area_alloc(sizeof(*stab), NUMA_NO_NODE, NULL);
 	if (!stab)
 		return ERR_PTR(-ENOMEM);
 
@@ -50,7 +50,8 @@ static struct bpf_map *sock_map_alloc(union bpf_attr *attr)
 
 	stab->sks = bpf_map_area_alloc((u64) stab->map.max_entries *
 				       sizeof(struct sock *),
-				       stab->map.numa_node);
+				       stab->map.numa_node,
+				       &stab->map);
 	if (!stab->sks) {
 		bpf_map_area_free(stab, &stab->map);
 		return ERR_PTR(-ENOMEM);
@@ -1085,7 +1086,7 @@ static struct bpf_map *sock_hash_alloc(union bpf_attr *attr)
 	if (attr->key_size > MAX_BPF_STACK)
 		return ERR_PTR(-E2BIG);
 
-	htab = bpf_map_area_alloc(sizeof(*htab), NUMA_NO_NODE);
+	htab = bpf_map_area_alloc(sizeof(*htab), NUMA_NO_NODE, NULL);
 	if (!htab)
 		return ERR_PTR(-ENOMEM);
 
@@ -1102,7 +1103,8 @@ static struct bpf_map *sock_hash_alloc(union bpf_attr *attr)
 
 	htab->buckets = bpf_map_area_alloc(htab->buckets_num *
 					   sizeof(struct bpf_shtab_bucket),
-					   htab->map.numa_node);
+					   htab->map.numa_node,
+					   &htab->map);
 	if (!htab->buckets) {
 		err = -ENOMEM;
 		goto free_htab;
diff --git a/net/xdp/xskmap.c b/net/xdp/xskmap.c
index 5abb87e..beb11fd 100644
--- a/net/xdp/xskmap.c
+++ b/net/xdp/xskmap.c
@@ -75,7 +75,7 @@ static struct bpf_map *xsk_map_alloc(union bpf_attr *attr)
 	numa_node = bpf_map_attr_numa_node(attr);
 	size = struct_size(m, xsk_map, attr->max_entries);
 
-	m = bpf_map_area_alloc(size, numa_node);
+	m = bpf_map_area_alloc(size, numa_node, NULL);
 	if (!m)
 		return ERR_PTR(-ENOMEM);
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH bpf-next v3 07/13] bpf: Introduce new helpers bpf_ringbuf_pages_{alloc,free}
  2022-09-02  2:29 [PATCH bpf-next v3 00/13] bpf: Introduce selectable memcg for bpf map Yafang Shao
                   ` (5 preceding siblings ...)
  2022-09-02  2:29 ` [PATCH bpf-next v3 06/13] bpf: Use scoped-based charge in bpf_map_area_alloc Yafang Shao
@ 2022-09-02  2:29 ` Yafang Shao
  2022-09-02  2:29 ` [PATCH bpf-next v3 08/13] bpf: Use bpf_map_kzalloc in arraymap Yafang Shao
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 33+ messages in thread
From: Yafang Shao @ 2022-09-02  2:29 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, hannes, mhocko, roman.gushchin,
	shakeelb, songmuchun, akpm, tj, lizefan.x
  Cc: cgroups, netdev, bpf, linux-mm, Yafang Shao

Allocate pages related memory into the new helper
bpf_ringbuf_pages_alloc(), then it can be handled as a single unit.

Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/bpf/ringbuf.c | 80 ++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 56 insertions(+), 24 deletions(-)

diff --git a/kernel/bpf/ringbuf.c b/kernel/bpf/ringbuf.c
index 5eb7820..1e7284c 100644
--- a/kernel/bpf/ringbuf.c
+++ b/kernel/bpf/ringbuf.c
@@ -59,6 +59,57 @@ struct bpf_ringbuf_hdr {
 	u32 pg_off;
 };
 
+static void bpf_ringbuf_pages_free(struct page **pages, int nr_pages)
+{
+	int i;
+
+	for (i = 0; i < nr_pages; i++)
+		__free_page(pages[i]);
+	bpf_map_area_free(pages, NULL);
+}
+
+static struct page **bpf_ringbuf_pages_alloc(struct bpf_map *map,
+					     int nr_meta_pages,
+					     int nr_data_pages,
+					     int numa_node,
+					     const gfp_t flags)
+{
+	int nr_pages = nr_meta_pages + nr_data_pages;
+	struct mem_cgroup *memcg, *old_memcg;
+	struct page **pages, *page;
+	int array_size;
+	int i;
+
+	memcg = bpf_map_get_memcg(map);
+	old_memcg = set_active_memcg(memcg);
+	array_size = (nr_meta_pages + 2 * nr_data_pages) * sizeof(*pages);
+	pages = bpf_map_area_alloc(array_size, numa_node, NULL);
+	if (!pages)
+		goto err;
+
+	for (i = 0; i < nr_pages; i++) {
+		page = alloc_pages_node(numa_node, flags, 0);
+		if (!page) {
+			nr_pages = i;
+			goto err_free_pages;
+		}
+		pages[i] = page;
+		if (i >= nr_meta_pages)
+			pages[nr_data_pages + i] = page;
+	}
+	set_active_memcg(old_memcg);
+	bpf_map_put_memcg(memcg);
+
+	return pages;
+
+err_free_pages:
+	bpf_ringbuf_pages_free(pages, nr_pages);
+err:
+	set_active_memcg(old_memcg);
+	bpf_map_put_memcg(memcg);
+	return NULL;
+}
+
 static struct bpf_ringbuf *bpf_ringbuf_area_alloc(size_t data_sz, int numa_node,
 						  struct bpf_map *map)
 {
@@ -67,10 +118,8 @@ static struct bpf_ringbuf *bpf_ringbuf_area_alloc(size_t data_sz, int numa_node,
 	int nr_meta_pages = RINGBUF_PGOFF + RINGBUF_POS_PAGES;
 	int nr_data_pages = data_sz >> PAGE_SHIFT;
 	int nr_pages = nr_meta_pages + nr_data_pages;
-	struct page **pages, *page;
 	struct bpf_ringbuf *rb;
-	size_t array_size;
-	int i;
+	struct page **pages;
 
 	/* Each data page is mapped twice to allow "virtual"
 	 * continuous read of samples wrapping around the end of ring
@@ -89,22 +138,11 @@ static struct bpf_ringbuf *bpf_ringbuf_area_alloc(size_t data_sz, int numa_node,
 	 * when mmap()'ed in user-space, simplifying both kernel and
 	 * user-space implementations significantly.
 	 */
-	array_size = (nr_meta_pages + 2 * nr_data_pages) * sizeof(*pages);
-	pages = bpf_map_area_alloc(array_size, numa_node, map);
+	pages = bpf_ringbuf_pages_alloc(map, nr_meta_pages, nr_data_pages,
+					numa_node, flags);
 	if (!pages)
 		return NULL;
 
-	for (i = 0; i < nr_pages; i++) {
-		page = alloc_pages_node(numa_node, flags, 0);
-		if (!page) {
-			nr_pages = i;
-			goto err_free_pages;
-		}
-		pages[i] = page;
-		if (i >= nr_meta_pages)
-			pages[nr_data_pages + i] = page;
-	}
-
 	rb = vmap(pages, nr_meta_pages + 2 * nr_data_pages,
 		  VM_MAP | VM_USERMAP, PAGE_KERNEL);
 	if (rb) {
@@ -114,10 +152,6 @@ static struct bpf_ringbuf *bpf_ringbuf_area_alloc(size_t data_sz, int numa_node,
 		return rb;
 	}
 
-err_free_pages:
-	for (i = 0; i < nr_pages; i++)
-		__free_page(pages[i]);
-	bpf_map_area_free(pages, NULL);
 	return NULL;
 }
 
@@ -188,12 +222,10 @@ static void bpf_ringbuf_free(struct bpf_ringbuf *rb)
 	 * to unmap rb itself with vunmap() below
 	 */
 	struct page **pages = rb->pages;
-	int i, nr_pages = rb->nr_pages;
+	int nr_pages = rb->nr_pages;
 
 	vunmap(rb);
-	for (i = 0; i < nr_pages; i++)
-		__free_page(pages[i]);
-	bpf_map_area_free(pages, NULL);
+	bpf_ringbuf_pages_free(pages, nr_pages);
 }
 
 static void ringbuf_map_free(struct bpf_map *map)
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH bpf-next v3 08/13] bpf: Use bpf_map_kzalloc in arraymap
  2022-09-02  2:29 [PATCH bpf-next v3 00/13] bpf: Introduce selectable memcg for bpf map Yafang Shao
                   ` (6 preceding siblings ...)
  2022-09-02  2:29 ` [PATCH bpf-next v3 07/13] bpf: Introduce new helpers bpf_ringbuf_pages_{alloc,free} Yafang Shao
@ 2022-09-02  2:29 ` Yafang Shao
  2022-09-02  2:29 ` [PATCH bpf-next v3 09/13] bpf: Use bpf_map_kvcalloc in bpf_local_storage Yafang Shao
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 33+ messages in thread
From: Yafang Shao @ 2022-09-02  2:29 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, hannes, mhocko, roman.gushchin,
	shakeelb, songmuchun, akpm, tj, lizefan.x
  Cc: cgroups, netdev, bpf, linux-mm, Yafang Shao

Allocates memory after map creation, then we can use the generic helper
bpf_map_kzalloc() instead of the open-coded kzalloc().

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 kernel/bpf/arraymap.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 114fbda..f953acc 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -1096,20 +1096,20 @@ static struct bpf_map *prog_array_map_alloc(union bpf_attr *attr)
 	struct bpf_array_aux *aux;
 	struct bpf_map *map;
 
-	aux = kzalloc(sizeof(*aux), GFP_KERNEL_ACCOUNT);
-	if (!aux)
+	map = array_map_alloc(attr);
+	if (IS_ERR(map))
 		return ERR_PTR(-ENOMEM);
 
+	aux = bpf_map_kzalloc(map, sizeof(*aux), GFP_KERNEL);
+	if (!aux) {
+		array_map_free(map);
+		return ERR_PTR(-ENOMEM);
+	}
+
 	INIT_WORK(&aux->work, prog_array_map_clear_deferred);
 	INIT_LIST_HEAD(&aux->poke_progs);
 	mutex_init(&aux->poke_mutex);
 
-	map = array_map_alloc(attr);
-	if (IS_ERR(map)) {
-		kfree(aux);
-		return map;
-	}
-
 	container_of(map, struct bpf_array, map)->aux = aux;
 	aux->map = map;
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH bpf-next v3 09/13] bpf: Use bpf_map_kvcalloc in bpf_local_storage
  2022-09-02  2:29 [PATCH bpf-next v3 00/13] bpf: Introduce selectable memcg for bpf map Yafang Shao
                   ` (7 preceding siblings ...)
  2022-09-02  2:29 ` [PATCH bpf-next v3 08/13] bpf: Use bpf_map_kzalloc in arraymap Yafang Shao
@ 2022-09-02  2:29 ` Yafang Shao
  2022-09-02  2:30 ` [PATCH bpf-next v3 10/13] mm, memcg: Add new helper get_obj_cgroup_from_cgroup Yafang Shao
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 33+ messages in thread
From: Yafang Shao @ 2022-09-02  2:29 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, hannes, mhocko, roman.gushchin,
	shakeelb, songmuchun, akpm, tj, lizefan.x
  Cc: cgroups, netdev, bpf, linux-mm, Yafang Shao

Introduce new helper bpf_map_kvcalloc() for this memory allocation.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/bpf.h            |  8 ++++++++
 kernel/bpf/bpf_local_storage.c |  4 ++--
 kernel/bpf/syscall.c           | 15 +++++++++++++++
 3 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index e8ac29f..52d8df0 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1657,6 +1657,8 @@ int  generic_map_delete_batch(struct bpf_map *map,
 void *bpf_map_kmalloc_node(const struct bpf_map *map, size_t size, gfp_t flags,
 			   int node);
 void *bpf_map_kzalloc(const struct bpf_map *map, size_t size, gfp_t flags);
+void *bpf_map_kvcalloc(struct bpf_map *map, size_t n, size_t size,
+		       gfp_t flags);
 void __percpu *bpf_map_alloc_percpu(const struct bpf_map *map, size_t size,
 				    size_t align, gfp_t flags);
 #else
@@ -1673,6 +1675,12 @@ void __percpu *bpf_map_alloc_percpu(const struct bpf_map *map, size_t size,
 	return kzalloc(size, flags);
 }
 
+static inline void *
+bpf_map_kvcalloc(struct bpf_map *map, size_t n, size_t size, gfp_t flags)
+{
+	return kvcalloc(n, size, flags);
+}
+
 static inline void __percpu *
 bpf_map_alloc_percpu(const struct bpf_map *map, size_t size, size_t align,
 		     gfp_t flags)
diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
index 44498d7d..8a24828 100644
--- a/kernel/bpf/bpf_local_storage.c
+++ b/kernel/bpf/bpf_local_storage.c
@@ -620,8 +620,8 @@ struct bpf_local_storage_map *bpf_local_storage_map_alloc(union bpf_attr *attr)
 	nbuckets = max_t(u32, 2, nbuckets);
 	smap->bucket_log = ilog2(nbuckets);
 
-	smap->buckets = kvcalloc(sizeof(*smap->buckets), nbuckets,
-				 GFP_USER | __GFP_NOWARN | __GFP_ACCOUNT);
+	smap->buckets = bpf_map_kvcalloc(&smap->map, sizeof(*smap->buckets),
+					 nbuckets, GFP_USER | __GFP_NOWARN);
 	if (!smap->buckets) {
 		bpf_map_area_free(smap, &smap->map);
 		return ERR_PTR(-ENOMEM);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index eefe590..034accd 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -489,6 +489,21 @@ void *bpf_map_kzalloc(const struct bpf_map *map, size_t size, gfp_t flags)
 	return ptr;
 }
 
+void *bpf_map_kvcalloc(struct bpf_map *map, size_t n, size_t size,
+		       gfp_t flags)
+{
+	struct mem_cgroup *memcg, *old_memcg;
+	void *ptr;
+
+	memcg = bpf_map_get_memcg(map);
+	old_memcg = set_active_memcg(memcg);
+	ptr = kvcalloc(n, size, flags | __GFP_ACCOUNT);
+	set_active_memcg(old_memcg);
+	bpf_map_put_memcg(memcg);
+
+	return ptr;
+}
+
 void __percpu *bpf_map_alloc_percpu(const struct bpf_map *map, size_t size,
 				    size_t align, gfp_t flags)
 {
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH bpf-next v3 10/13] mm, memcg: Add new helper get_obj_cgroup_from_cgroup
  2022-09-02  2:29 [PATCH bpf-next v3 00/13] bpf: Introduce selectable memcg for bpf map Yafang Shao
                   ` (8 preceding siblings ...)
  2022-09-02  2:29 ` [PATCH bpf-next v3 09/13] bpf: Use bpf_map_kvcalloc in bpf_local_storage Yafang Shao
@ 2022-09-02  2:30 ` Yafang Shao
  2022-09-02  2:30 ` [PATCH bpf-next v3 11/13] mm, memcg: Add new helper task_under_memcg_hierarchy Yafang Shao
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 33+ messages in thread
From: Yafang Shao @ 2022-09-02  2:30 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, hannes, mhocko, roman.gushchin,
	shakeelb, songmuchun, akpm, tj, lizefan.x
  Cc: cgroups, netdev, bpf, linux-mm, Yafang Shao

We want to open a cgroup directory and pass the fd into kernel, and then
in the kernel we can charge the allocated memory into the open cgroup if it
has valid memory subsystem. In the bpf subsystem, the opened cgroup will
be store as a struct obj_cgroup pointer, so a new helper
get_obj_cgroup_from_cgroup() is introduced.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/memcontrol.h |  1 +
 mm/memcontrol.c            | 48 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 49 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 6040b5c..7a7f252 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1734,6 +1734,7 @@ static inline void set_shrinker_bit(struct mem_cgroup *memcg,
 int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order);
 void __memcg_kmem_uncharge_page(struct page *page, int order);
 
+struct obj_cgroup *get_obj_cgroup_from_cgroup(struct cgroup *cgrp);
 struct obj_cgroup *get_obj_cgroup_from_current(void);
 struct obj_cgroup *get_obj_cgroup_from_page(struct page *page);
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b69979c..4e3b51e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2940,6 +2940,7 @@ static struct obj_cgroup *__get_obj_cgroup_from_memcg(struct mem_cgroup *memcg)
 {
 	struct obj_cgroup *objcg = NULL;
 
+	WARN_ON_ONCE(!rcu_read_lock_held());
 	for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) {
 		objcg = rcu_dereference(memcg->objcg);
 		if (objcg && obj_cgroup_tryget(objcg))
@@ -2949,6 +2950,53 @@ static struct obj_cgroup *__get_obj_cgroup_from_memcg(struct mem_cgroup *memcg)
 	return objcg;
 }
 
+static struct obj_cgroup *get_obj_cgroup_from_memcg(struct mem_cgroup *memcg)
+{
+	struct obj_cgroup *objcg;
+
+	if (memcg_kmem_bypass())
+		return NULL;
+
+	rcu_read_lock();
+	objcg = __get_obj_cgroup_from_memcg(memcg);
+	rcu_read_unlock();
+	return objcg;
+}
+
+/**
+ * get_obj_cgroup_from_cgroup: Obtain a reference on given cgroup's objcg.
+ * @cgrp: cgroup from which objcg should be extracted. It can't be NULL.
+ *        The memory subsystem of this cgroup must be enabled, otherwise
+ *        -EINVAL will be returned.
+ * On success, the objcg will be returned.
+ * On failure, -EINVAL will be returned.
+ */
+struct obj_cgroup *get_obj_cgroup_from_cgroup(struct cgroup *cgrp)
+{
+	struct cgroup_subsys_state *css;
+	struct mem_cgroup *memcg;
+	struct obj_cgroup *objcg;
+
+	rcu_read_lock();
+	css = rcu_dereference(cgrp->subsys[memory_cgrp_id]);
+	if (!css || !css_tryget(css)) {
+		rcu_read_unlock();
+		return ERR_PTR(-EINVAL);
+	}
+	rcu_read_unlock();
+
+	memcg = mem_cgroup_from_css(css);
+	if (!memcg) {
+		css_put(css);
+		return ERR_PTR(-EINVAL);
+	}
+
+	objcg = get_obj_cgroup_from_memcg(memcg);
+	css_put(css);
+
+	return objcg;
+}
+
 __always_inline struct obj_cgroup *get_obj_cgroup_from_current(void)
 {
 	struct obj_cgroup *objcg = NULL;
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH bpf-next v3 11/13] mm, memcg: Add new helper task_under_memcg_hierarchy
  2022-09-02  2:29 [PATCH bpf-next v3 00/13] bpf: Introduce selectable memcg for bpf map Yafang Shao
                   ` (9 preceding siblings ...)
  2022-09-02  2:30 ` [PATCH bpf-next v3 10/13] mm, memcg: Add new helper get_obj_cgroup_from_cgroup Yafang Shao
@ 2022-09-02  2:30 ` Yafang Shao
  2022-09-02  2:30 ` [PATCH bpf-next v3 12/13] bpf: Add return value for bpf_map_init_from_attr Yafang Shao
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 33+ messages in thread
From: Yafang Shao @ 2022-09-02  2:30 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, hannes, mhocko, roman.gushchin,
	shakeelb, songmuchun, akpm, tj, lizefan.x
  Cc: cgroups, netdev, bpf, linux-mm, Yafang Shao

Introduce a new helper to check if a task belongs to a specific memcg.
It is similar to mm_match_cgroup() except that the new helper is checked
against a task rather than a mm struct. So with this new helper we can
check a task directly.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/memcontrol.h | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 7a7f252..3b8a8dd 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -887,6 +887,20 @@ static inline bool mem_cgroup_is_descendant(struct mem_cgroup *memcg,
 	return cgroup_is_descendant(memcg->css.cgroup, root->css.cgroup);
 }
 
+static inline bool task_under_memcg_hierarchy(struct task_struct *p,
+					      struct mem_cgroup *memcg)
+{
+	struct mem_cgroup *task_memcg;
+	bool match = false;
+
+	rcu_read_lock();
+	task_memcg = mem_cgroup_from_task(p);
+	if (task_memcg)
+		match = mem_cgroup_is_descendant(task_memcg, memcg);
+	rcu_read_unlock();
+	return match;
+}
+
 static inline bool mm_match_cgroup(struct mm_struct *mm,
 				   struct mem_cgroup *memcg)
 {
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH bpf-next v3 12/13] bpf: Add return value for bpf_map_init_from_attr
  2022-09-02  2:29 [PATCH bpf-next v3 00/13] bpf: Introduce selectable memcg for bpf map Yafang Shao
                   ` (10 preceding siblings ...)
  2022-09-02  2:30 ` [PATCH bpf-next v3 11/13] mm, memcg: Add new helper task_under_memcg_hierarchy Yafang Shao
@ 2022-09-02  2:30 ` Yafang Shao
  2022-09-02  2:30 ` [PATCH bpf-next v3 13/13] bpf: Introduce selectable memcg for bpf map Yafang Shao
  2022-09-07 15:43 ` [PATCH bpf-next v3 00/13] " Tejun Heo
  13 siblings, 0 replies; 33+ messages in thread
From: Yafang Shao @ 2022-09-02  2:30 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, hannes, mhocko, roman.gushchin,
	shakeelb, songmuchun, akpm, tj, lizefan.x
  Cc: cgroups, netdev, bpf, linux-mm, Yafang Shao

Add return value for bpf_map_init_from_attr() to indicate whether it
init successfully. This is a preparation of the followup patch.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/bpf.h            | 2 +-
 kernel/bpf/arraymap.c          | 8 +++++++-
 kernel/bpf/bloom_filter.c      | 7 ++++++-
 kernel/bpf/bpf_local_storage.c | 7 ++++++-
 kernel/bpf/bpf_struct_ops.c    | 7 ++++++-
 kernel/bpf/cpumap.c            | 6 +++++-
 kernel/bpf/devmap.c            | 6 +++++-
 kernel/bpf/hashtab.c           | 6 +++++-
 kernel/bpf/local_storage.c     | 7 ++++++-
 kernel/bpf/lpm_trie.c          | 8 +++++++-
 kernel/bpf/offload.c           | 6 +++++-
 kernel/bpf/queue_stack_maps.c  | 7 ++++++-
 kernel/bpf/reuseport_array.c   | 7 ++++++-
 kernel/bpf/ringbuf.c           | 7 ++++++-
 kernel/bpf/syscall.c           | 4 +++-
 net/core/sock_map.c            | 8 +++++++-
 net/xdp/xskmap.c               | 8 +++++++-
 17 files changed, 94 insertions(+), 17 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 52d8df0..2dd520e 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1640,7 +1640,7 @@ struct bpf_prog *bpf_prog_get_type_dev(u32 ufd, enum bpf_prog_type type,
 void *bpf_map_area_mmapable_alloc(u64 size, int numa_node);
 void bpf_map_area_free(void *base, struct bpf_map *map);
 bool bpf_map_write_active(const struct bpf_map *map);
-void bpf_map_init_from_attr(struct bpf_map *map, union bpf_attr *attr);
+int bpf_map_init_from_attr(struct bpf_map *map, union bpf_attr *attr);
 int  generic_map_lookup_batch(struct bpf_map *map,
 			      const union bpf_attr *attr,
 			      union bpf_attr __user *uattr);
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index f953acc..9f08694 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -85,6 +85,7 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr)
 	bool bypass_spec_v1 = bpf_bypass_spec_v1();
 	u64 array_size, mask64;
 	struct bpf_array *array;
+	int err;
 
 	elem_size = round_up(attr->value_size, 8);
 
@@ -143,7 +144,12 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr)
 	array->map.bypass_spec_v1 = bypass_spec_v1;
 
 	/* copy mandatory map attributes */
-	bpf_map_init_from_attr(&array->map, attr);
+	err = bpf_map_init_from_attr(&array->map, attr);
+	if (err) {
+		bpf_map_area_free(array, NULL);
+		return ERR_PTR(err);
+	}
+
 	array->elem_size = elem_size;
 
 	if (percpu && bpf_array_alloc_percpu(array)) {
diff --git a/kernel/bpf/bloom_filter.c b/kernel/bpf/bloom_filter.c
index 6691f79..be64227 100644
--- a/kernel/bpf/bloom_filter.c
+++ b/kernel/bpf/bloom_filter.c
@@ -93,6 +93,7 @@ static struct bpf_map *bloom_map_alloc(union bpf_attr *attr)
 	u32 bitset_bytes, bitset_mask, nr_hash_funcs, nr_bits;
 	int numa_node = bpf_map_attr_numa_node(attr);
 	struct bpf_bloom_filter *bloom;
+	int err;
 
 	if (!bpf_capable())
 		return ERR_PTR(-EPERM);
@@ -147,7 +148,11 @@ static struct bpf_map *bloom_map_alloc(union bpf_attr *attr)
 	if (!bloom)
 		return ERR_PTR(-ENOMEM);
 
-	bpf_map_init_from_attr(&bloom->map, attr);
+	err = bpf_map_init_from_attr(&bloom->map, attr);
+	if (err) {
+		bpf_map_area_free(bloom, NULL);
+		return ERR_PTR(err);
+	}
 
 	bloom->nr_hash_funcs = nr_hash_funcs;
 	bloom->bitset_mask = bitset_mask;
diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
index 8a24828..ab7fd6b 100644
--- a/kernel/bpf/bpf_local_storage.c
+++ b/kernel/bpf/bpf_local_storage.c
@@ -609,11 +609,16 @@ struct bpf_local_storage_map *bpf_local_storage_map_alloc(union bpf_attr *attr)
 	struct bpf_local_storage_map *smap;
 	unsigned int i;
 	u32 nbuckets;
+	int err;
 
 	smap = bpf_map_area_alloc(sizeof(*smap), NUMA_NO_NODE, NULL);
 	if (!smap)
 		return ERR_PTR(-ENOMEM);
-	bpf_map_init_from_attr(&smap->map, attr);
+	err = bpf_map_init_from_attr(&smap->map, attr);
+	if (err) {
+		bpf_map_area_free(&smap->map, NULL);
+		return ERR_PTR(err);
+	}
 
 	nbuckets = roundup_pow_of_two(num_possible_cpus());
 	/* Use at least 2 buckets, select_bucket() is undefined behavior with 1 bucket */
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index 37ba5c0..7cfbabc 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -598,6 +598,7 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr)
 	struct bpf_struct_ops_map *st_map;
 	const struct btf_type *t, *vt;
 	struct bpf_map *map;
+	int err;
 
 	if (!bpf_capable())
 		return ERR_PTR(-EPERM);
@@ -624,7 +625,11 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr)
 
 	st_map->st_ops = st_ops;
 	map = &st_map->map;
-	bpf_map_init_from_attr(map, attr);
+	err = bpf_map_init_from_attr(map, attr);
+	if (err) {
+		bpf_map_area_free(st_map, NULL);
+		return ERR_PTR(err);
+	}
 
 	st_map->uvalue = bpf_map_area_alloc(vt->size, NUMA_NO_NODE, map);
 	st_map->links =
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index b593157..e672e62 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -101,7 +101,11 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
 	if (!cmap)
 		return ERR_PTR(-ENOMEM);
 
-	bpf_map_init_from_attr(&cmap->map, attr);
+	err = bpf_map_init_from_attr(&cmap->map, attr);
+	if (err) {
+		bpf_map_area_free(cmap, NULL);
+		return ERR_PTR(err);
+	}
 
 	/* Pre-limit array size based on NR_CPUS, not final CPU check */
 	if (cmap->map.max_entries > NR_CPUS) {
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index 807a4cd..10f038d 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -167,7 +167,11 @@ static struct bpf_map *dev_map_alloc(union bpf_attr *attr)
 	if (!dtab)
 		return ERR_PTR(-ENOMEM);
 
-	bpf_map_init_from_attr(&dtab->map, attr);
+	err = bpf_map_init_from_attr(&dtab->map, attr);
+	if (err) {
+		bpf_map_area_free(dtab, NULL);
+		return ERR_PTR(err);
+	}
 
 	err = dev_map_init_map(dtab, attr);
 	if (err) {
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 975bcc1..705ffdd 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -509,7 +509,11 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 	if (!htab)
 		return ERR_PTR(-ENOMEM);
 
-	bpf_map_init_from_attr(&htab->map, attr);
+	err = bpf_map_init_from_attr(&htab->map, attr);
+	if (err) {
+		bpf_map_area_free(htab, NULL);
+		return ERR_PTR(err);
+	}
 
 	lockdep_register_key(&htab->lockdep_key);
 
diff --git a/kernel/bpf/local_storage.c b/kernel/bpf/local_storage.c
index fcc7ece..1901195 100644
--- a/kernel/bpf/local_storage.c
+++ b/kernel/bpf/local_storage.c
@@ -287,6 +287,7 @@ static struct bpf_map *cgroup_storage_map_alloc(union bpf_attr *attr)
 	__u32 max_value_size = BPF_LOCAL_STORAGE_MAX_VALUE_SIZE;
 	int numa_node = bpf_map_attr_numa_node(attr);
 	struct bpf_cgroup_storage_map *map;
+	int err;
 
 	/* percpu is bound by PCPU_MIN_UNIT_SIZE, non-percu
 	 * is the same as other local storages.
@@ -318,7 +319,11 @@ static struct bpf_map *cgroup_storage_map_alloc(union bpf_attr *attr)
 		return ERR_PTR(-ENOMEM);
 
 	/* copy mandatory map attributes */
-	bpf_map_init_from_attr(&map->map, attr);
+	err = bpf_map_init_from_attr(&map->map, attr);
+	if (err) {
+		bpf_map_area_free(map, NULL);
+		return ERR_PTR(err);
+	}
 
 	spin_lock_init(&map->lock);
 	map->root = RB_ROOT;
diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
index 3d329ae..38d7b00 100644
--- a/kernel/bpf/lpm_trie.c
+++ b/kernel/bpf/lpm_trie.c
@@ -543,6 +543,7 @@ static int trie_delete_elem(struct bpf_map *map, void *_key)
 static struct bpf_map *trie_alloc(union bpf_attr *attr)
 {
 	struct lpm_trie *trie;
+	int err;
 
 	if (!bpf_capable())
 		return ERR_PTR(-EPERM);
@@ -563,7 +564,12 @@ static struct bpf_map *trie_alloc(union bpf_attr *attr)
 		return ERR_PTR(-ENOMEM);
 
 	/* copy mandatory map attributes */
-	bpf_map_init_from_attr(&trie->map, attr);
+	err = bpf_map_init_from_attr(&trie->map, attr);
+	if (err) {
+		bpf_map_area_free(trie, NULL);
+		return ERR_PTR(err);
+	}
+
 	trie->data_size = attr->key_size -
 			  offsetof(struct bpf_lpm_trie_key, data);
 	trie->max_prefixlen = trie->data_size * 8;
diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
index 87c59da..dba7ed2 100644
--- a/kernel/bpf/offload.c
+++ b/kernel/bpf/offload.c
@@ -376,7 +376,11 @@ struct bpf_map *bpf_map_offload_map_alloc(union bpf_attr *attr)
 	if (!offmap)
 		return ERR_PTR(-ENOMEM);
 
-	bpf_map_init_from_attr(&offmap->map, attr);
+	err = bpf_map_init_from_attr(&offmap->map, attr);
+	if (err) {
+		bpf_map_area_free(offmap, NULL);
+		return ERR_PTR(err);
+	}
 
 	rtnl_lock();
 	down_write(&bpf_devs_lock);
diff --git a/kernel/bpf/queue_stack_maps.c b/kernel/bpf/queue_stack_maps.c
index bf57e45..f231897 100644
--- a/kernel/bpf/queue_stack_maps.c
+++ b/kernel/bpf/queue_stack_maps.c
@@ -70,6 +70,7 @@ static struct bpf_map *queue_stack_map_alloc(union bpf_attr *attr)
 	int numa_node = bpf_map_attr_numa_node(attr);
 	struct bpf_queue_stack *qs;
 	u64 size, queue_size;
+	int err;
 
 	size = (u64) attr->max_entries + 1;
 	queue_size = sizeof(*qs) + size * attr->value_size;
@@ -78,7 +79,11 @@ static struct bpf_map *queue_stack_map_alloc(union bpf_attr *attr)
 	if (!qs)
 		return ERR_PTR(-ENOMEM);
 
-	bpf_map_init_from_attr(&qs->map, attr);
+	err = bpf_map_init_from_attr(&qs->map, attr);
+	if (err) {
+		bpf_map_area_free(qs, NULL);
+		return ERR_PTR(err);
+	}
 
 	qs->size = size;
 
diff --git a/kernel/bpf/reuseport_array.c b/kernel/bpf/reuseport_array.c
index 447f664..f9604f3 100644
--- a/kernel/bpf/reuseport_array.c
+++ b/kernel/bpf/reuseport_array.c
@@ -150,6 +150,7 @@ static struct bpf_map *reuseport_array_alloc(union bpf_attr *attr)
 {
 	int numa_node = bpf_map_attr_numa_node(attr);
 	struct reuseport_array *array;
+	int err;
 
 	if (!bpf_capable())
 		return ERR_PTR(-EPERM);
@@ -160,7 +161,11 @@ static struct bpf_map *reuseport_array_alloc(union bpf_attr *attr)
 		return ERR_PTR(-ENOMEM);
 
 	/* copy mandatory map attributes */
-	bpf_map_init_from_attr(&array->map, attr);
+	err = bpf_map_init_from_attr(&array->map, attr);
+	if (err) {
+		bpf_map_area_free(array, NULL);
+		return ERR_PTR(err);
+	}
 
 	return &array->map;
 }
diff --git a/kernel/bpf/ringbuf.c b/kernel/bpf/ringbuf.c
index 1e7284c..3edefd3 100644
--- a/kernel/bpf/ringbuf.c
+++ b/kernel/bpf/ringbuf.c
@@ -185,6 +185,7 @@ static struct bpf_ringbuf *bpf_ringbuf_alloc(size_t data_sz, int numa_node,
 static struct bpf_map *ringbuf_map_alloc(union bpf_attr *attr)
 {
 	struct bpf_ringbuf_map *rb_map;
+	int err;
 
 	if (attr->map_flags & ~RINGBUF_CREATE_FLAG_MASK)
 		return ERR_PTR(-EINVAL);
@@ -204,7 +205,11 @@ static struct bpf_map *ringbuf_map_alloc(union bpf_attr *attr)
 	if (!rb_map)
 		return ERR_PTR(-ENOMEM);
 
-	bpf_map_init_from_attr(&rb_map->map, attr);
+	err = bpf_map_init_from_attr(&rb_map->map, attr);
+	if (err) {
+		bpf_map_area_free(rb_map, NULL);
+		return ERR_PTR(err);
+	}
 
 	rb_map->rb = bpf_ringbuf_alloc(attr->max_entries, rb_map->map.numa_node,
 				       &rb_map->map);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 034accd..f710495 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -403,7 +403,7 @@ static u32 bpf_map_flags_retain_permanent(u32 flags)
 	return flags & ~(BPF_F_RDONLY | BPF_F_WRONLY);
 }
 
-void bpf_map_init_from_attr(struct bpf_map *map, union bpf_attr *attr)
+int bpf_map_init_from_attr(struct bpf_map *map, union bpf_attr *attr)
 {
 	bpf_map_save_memcg(map);
 	map->map_type = attr->map_type;
@@ -413,6 +413,8 @@ void bpf_map_init_from_attr(struct bpf_map *map, union bpf_attr *attr)
 	map->map_flags = bpf_map_flags_retain_permanent(attr->map_flags);
 	map->numa_node = bpf_map_attr_numa_node(attr);
 	map->map_extra = attr->map_extra;
+
+	return 0;
 }
 
 static int bpf_map_alloc_id(struct bpf_map *map)
diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 25a5ac4..67c502d 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -31,6 +31,7 @@ static int sock_map_prog_update(struct bpf_map *map, struct bpf_prog *prog,
 static struct bpf_map *sock_map_alloc(union bpf_attr *attr)
 {
 	struct bpf_stab *stab;
+	int err;
 
 	if (!capable(CAP_NET_ADMIN))
 		return ERR_PTR(-EPERM);
@@ -45,7 +46,12 @@ static struct bpf_map *sock_map_alloc(union bpf_attr *attr)
 	if (!stab)
 		return ERR_PTR(-ENOMEM);
 
-	bpf_map_init_from_attr(&stab->map, attr);
+	err = bpf_map_init_from_attr(&stab->map, attr);
+	if (err) {
+		bpf_map_area_free(stab, NULL);
+		return ERR_PTR(err);
+	}
+
 	raw_spin_lock_init(&stab->lock);
 
 	stab->sks = bpf_map_area_alloc((u64) stab->map.max_entries *
diff --git a/net/xdp/xskmap.c b/net/xdp/xskmap.c
index beb11fd..8e7a5a6 100644
--- a/net/xdp/xskmap.c
+++ b/net/xdp/xskmap.c
@@ -63,6 +63,7 @@ static struct bpf_map *xsk_map_alloc(union bpf_attr *attr)
 	struct xsk_map *m;
 	int numa_node;
 	u64 size;
+	int err;
 
 	if (!capable(CAP_NET_ADMIN))
 		return ERR_PTR(-EPERM);
@@ -79,7 +80,12 @@ static struct bpf_map *xsk_map_alloc(union bpf_attr *attr)
 	if (!m)
 		return ERR_PTR(-ENOMEM);
 
-	bpf_map_init_from_attr(&m->map, attr);
+	err = bpf_map_init_from_attr(&m->map, attr);
+	if (err) {
+		bpf_map_area_free(m, NULL);
+		return ERR_PTR(err);
+	}
+
 	spin_lock_init(&m->lock);
 
 	return &m->map;
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH bpf-next v3 13/13] bpf: Introduce selectable memcg for bpf map
  2022-09-02  2:29 [PATCH bpf-next v3 00/13] bpf: Introduce selectable memcg for bpf map Yafang Shao
                   ` (11 preceding siblings ...)
  2022-09-02  2:30 ` [PATCH bpf-next v3 12/13] bpf: Add return value for bpf_map_init_from_attr Yafang Shao
@ 2022-09-02  2:30 ` Yafang Shao
  2022-09-07 15:43 ` [PATCH bpf-next v3 00/13] " Tejun Heo
  13 siblings, 0 replies; 33+ messages in thread
From: Yafang Shao @ 2022-09-02  2:30 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, hannes, mhocko, roman.gushchin,
	shakeelb, songmuchun, akpm, tj, lizefan.x
  Cc: cgroups, netdev, bpf, linux-mm, Yafang Shao

A new member memcg_fd is introduced into bpf attr of BPF_MAP_CREATE
command, which is the fd of an opened cgroup directory. In this cgroup,
the memory subsystem must be enabled. The valid memcg_fd must be a postive
number, that means it can't be zero(a valid return value of open(2)). Once
the kernel get the memory cgroup from this fd, it will set this memcg into
bpf map, then all the subsequent memory allocation of this map will be
charged to the memcg. The map creation paths in libbpf are also changed
consequently.

Currently we only allow to select its ancestors to avoid breaking the
memcg hierarchy further. For example, we can select its parent, other
ancestors, or the root memcg. Possible use cases of the selectable memcg
as follows,
- Select the root memcg as bpf-map's memcg
  Then bpf-map's memory won't be throttled by current memcg limit.
- Put current memcg under a fixed memcg dir and select the fixed memcg
  as bpf-map's memcg
  The hierarchy as follows,

      Parent-memcg (A fixed dir, i.e. /sys/fs/cgroup/memory/bpf)
         \
        Current-memcg (Container dir, i.e. /sys/fs/cgroup/memory/bpf/foo)

  At the map creation time, the bpf-map's memory will be charged
  into the parent directly without charging into current memcg, and thus
  current memcg's usage will be consistent among different generations.
  To limit bpf-map's memory usage, we can set the limit in the parent
  memcg.

Below is an example on how to use this new API,
	struct bpf_map_create_opts map_opts = {
		.sz = sizeof(map_opts),
	};
	int memcg_fd, map_fd, old_fd;
	int key, value;

	memcg_fd = open("/sys/fs/cgroup/memory/bpf", O_DIRECTORY);
	if (memcg_fd < 0) {
		perror("memcg dir open");
		return -1;
	}

	/* 0 is a invalid fd */
	if (memcg_fd == 0) {
		old_fd = memcg_fd;
		memcg_fd = fcntl(memcg_fd, F_DUPFD_CLOEXEC, 3);
		close(old_fd);
		if (memcg_fd < 0) {
			perror("fcntl");
			return -1;
		}
	}

	map_opts.memcg_fd = memcg_fd;
	map_fd = bpf_map_create(BPF_MAP_TYPE_HASH, "map_for_memcg",
				sizeof(key), sizeof(value),
				1024, &map_opts);
	if (map_fd <= 0) {
		close(memcg_fd);
		perror("map create");
		return -1;
	}

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/uapi/linux/bpf.h       |  1 +
 kernel/bpf/syscall.c           | 49 +++++++++++++++++++++++++++++++++---------
 tools/include/uapi/linux/bpf.h |  1 +
 tools/lib/bpf/bpf.c            |  3 ++-
 tools/lib/bpf/bpf.h            |  3 ++-
 tools/lib/bpf/gen_loader.c     |  2 +-
 tools/lib/bpf/libbpf.c         |  2 ++
 tools/lib/bpf/skel_internal.h  |  2 +-
 8 files changed, 49 insertions(+), 14 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 962960a..9121c4f 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1319,6 +1319,7 @@ struct bpf_stack_build_id {
 		 * to using 5 hash functions).
 		 */
 		__u64	map_extra;
+		__u32	memcg_fd;	/* selectable memcg */
 	};
 
 	struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index f710495..1b1af68 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -294,14 +294,37 @@ static int bpf_map_copy_value(struct bpf_map *map, void *key, void *value,
 }
 
 #ifdef CONFIG_MEMCG_KMEM
-static void bpf_map_save_memcg(struct bpf_map *map)
+static int bpf_map_save_memcg(struct bpf_map *map, u32 memcg_fd)
 {
-	/* Currently if a map is created by a process belonging to the root
-	 * memory cgroup, get_obj_cgroup_from_current() will return NULL.
-	 * So we have to check map->objcg for being NULL each time it's
-	 * being used.
-	 */
-	map->objcg = get_obj_cgroup_from_current();
+	struct obj_cgroup *objcg;
+	struct cgroup *cgrp;
+
+	if (memcg_fd) {
+		cgrp = cgroup_get_from_fd(memcg_fd);
+		if (IS_ERR(cgrp))
+			return -EINVAL;
+
+		objcg = get_obj_cgroup_from_cgroup(cgrp);
+		cgroup_put(cgrp);
+		if (IS_ERR(objcg))
+			return PTR_ERR(objcg);
+
+		/* Currently we only allow to select its ancestors. */
+		if (objcg && !task_under_memcg_hierarchy(current, objcg->memcg)) {
+			obj_cgroup_put(objcg);
+			return -EINVAL;
+		}
+	} else {
+		/* Currently if a map is created by a process belonging to the root
+		 * memory cgroup, get_obj_cgroup_from_current() will return NULL.
+		 * So we have to check map->objcg for being NULL each time it's
+		 * being used.
+		 */
+		objcg = get_obj_cgroup_from_current();
+	}
+
+	map->objcg = objcg;
+	return 0;
 }
 
 static void bpf_map_release_memcg(struct bpf_map *map)
@@ -311,8 +334,9 @@ static void bpf_map_release_memcg(struct bpf_map *map)
 }
 
 #else
-static void bpf_map_save_memcg(struct bpf_map *map)
+static int bpf_map_save_memcg(struct bpf_map *map, u32 memcg_fd)
 {
+	return 0;
 }
 
 static void bpf_map_release_memcg(struct bpf_map *map)
@@ -405,7 +429,12 @@ static u32 bpf_map_flags_retain_permanent(u32 flags)
 
 int bpf_map_init_from_attr(struct bpf_map *map, union bpf_attr *attr)
 {
-	bpf_map_save_memcg(map);
+	int err;
+
+	err = bpf_map_save_memcg(map, attr->memcg_fd);
+	if (err)
+		return err;
+
 	map->map_type = attr->map_type;
 	map->key_size = attr->key_size;
 	map->value_size = attr->value_size;
@@ -1091,7 +1120,7 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf,
 	return ret;
 }
 
-#define BPF_MAP_CREATE_LAST_FIELD map_extra
+#define BPF_MAP_CREATE_LAST_FIELD memcg_fd
 /* called via syscall */
 static int map_create(union bpf_attr *attr)
 {
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index f4ba82a..fc19366 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1319,6 +1319,7 @@ struct bpf_stack_build_id {
 		 * to using 5 hash functions).
 		 */
 		__u64	map_extra;
+		__u32	memcg_fd;	/* selectable memcg */
 	};
 
 	struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */
diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 1d49a03..b475b28 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -169,7 +169,7 @@ int bpf_map_create(enum bpf_map_type map_type,
 		   __u32 max_entries,
 		   const struct bpf_map_create_opts *opts)
 {
-	const size_t attr_sz = offsetofend(union bpf_attr, map_extra);
+	const size_t attr_sz = offsetofend(union bpf_attr, memcg_fd);
 	union bpf_attr attr;
 	int fd;
 
@@ -197,6 +197,7 @@ int bpf_map_create(enum bpf_map_type map_type,
 	attr.map_extra = OPTS_GET(opts, map_extra, 0);
 	attr.numa_node = OPTS_GET(opts, numa_node, 0);
 	attr.map_ifindex = OPTS_GET(opts, map_ifindex, 0);
+	attr.memcg_fd = OPTS_GET(opts, memcg_fd, 0);
 
 	fd = sys_bpf_fd(BPF_MAP_CREATE, &attr, attr_sz);
 	return libbpf_err_errno(fd);
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index 9c50bea..dd0d929 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -51,8 +51,9 @@ struct bpf_map_create_opts {
 
 	__u32 numa_node;
 	__u32 map_ifindex;
+	__u32 memcg_fd;
 };
-#define bpf_map_create_opts__last_field map_ifindex
+#define bpf_map_create_opts__last_field memcg_fd
 
 LIBBPF_API int bpf_map_create(enum bpf_map_type map_type,
 			      const char *map_name,
diff --git a/tools/lib/bpf/gen_loader.c b/tools/lib/bpf/gen_loader.c
index 23f5c46..f35b014 100644
--- a/tools/lib/bpf/gen_loader.c
+++ b/tools/lib/bpf/gen_loader.c
@@ -451,7 +451,7 @@ void bpf_gen__map_create(struct bpf_gen *gen,
 			 __u32 key_size, __u32 value_size, __u32 max_entries,
 			 struct bpf_map_create_opts *map_attr, int map_idx)
 {
-	int attr_size = offsetofend(union bpf_attr, map_extra);
+	int attr_size = offsetofend(union bpf_attr, memcg_fd);
 	bool close_inner_map_fd = false;
 	int map_create_attr, idx;
 	union bpf_attr attr;
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 3ad1392..ce04d93 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -512,6 +512,7 @@ struct bpf_map {
 	bool reused;
 	bool autocreate;
 	__u64 map_extra;
+	__u32 memcg_fd;
 };
 
 enum extern_type {
@@ -4948,6 +4949,7 @@ static int bpf_object__create_map(struct bpf_object *obj, struct bpf_map *map, b
 	create_attr.map_flags = def->map_flags;
 	create_attr.numa_node = map->numa_node;
 	create_attr.map_extra = map->map_extra;
+	create_attr.memcg_fd = map->memcg_fd;
 
 	if (bpf_map__is_struct_ops(map))
 		create_attr.btf_vmlinux_value_type_id = map->btf_vmlinux_value_type_id;
diff --git a/tools/lib/bpf/skel_internal.h b/tools/lib/bpf/skel_internal.h
index 1e82ab0..8760747 100644
--- a/tools/lib/bpf/skel_internal.h
+++ b/tools/lib/bpf/skel_internal.h
@@ -222,7 +222,7 @@ static inline int skel_map_create(enum bpf_map_type map_type,
 				  __u32 value_size,
 				  __u32 max_entries)
 {
-	const size_t attr_sz = offsetofend(union bpf_attr, map_extra);
+	const size_t attr_sz = offsetofend(union bpf_attr, memcg_fd);
 	union bpf_attr attr;
 
 	memset(&attr, 0, attr_sz);
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH bpf-next v3 00/13] bpf: Introduce selectable memcg for bpf map
  2022-09-02  2:29 [PATCH bpf-next v3 00/13] bpf: Introduce selectable memcg for bpf map Yafang Shao
                   ` (12 preceding siblings ...)
  2022-09-02  2:30 ` [PATCH bpf-next v3 13/13] bpf: Introduce selectable memcg for bpf map Yafang Shao
@ 2022-09-07 15:43 ` Tejun Heo
  2022-09-07 15:45   ` Tejun Heo
  2022-09-07 22:28   ` Roman Gushchin
  13 siblings, 2 replies; 33+ messages in thread
From: Tejun Heo @ 2022-09-07 15:43 UTC (permalink / raw)
  To: Yafang Shao
  Cc: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, hannes, mhocko, roman.gushchin,
	shakeelb, songmuchun, akpm, lizefan.x, cgroups, netdev, bpf,
	linux-mm

Hello,

On Fri, Sep 02, 2022 at 02:29:50AM +0000, Yafang Shao wrote:
...
> This patchset tries to resolve the above two issues by introducing a
> selectable memcg to limit the bpf memory. Currently we only allow to
> select its ancestor to avoid breaking the memcg hierarchy further. 
> Possible use cases of the selectable memcg as follows,

As discussed in the following thread, there are clear downsides to an
interface which requires the users to specify the cgroups directly.

 https://lkml.kernel.org/r/YwNold0GMOappUxc@slm.duckdns.org

So, I don't really think this is an interface we wanna go for. I was hoping
to hear more from memcg folks in the above thread. Maybe ping them in that
thread and continue there?

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH bpf-next v3 00/13] bpf: Introduce selectable memcg for bpf map
  2022-09-07 15:43 ` [PATCH bpf-next v3 00/13] " Tejun Heo
@ 2022-09-07 15:45   ` Tejun Heo
  2022-09-07 16:13     ` Alexei Starovoitov
  2022-09-07 22:28   ` Roman Gushchin
  1 sibling, 1 reply; 33+ messages in thread
From: Tejun Heo @ 2022-09-07 15:45 UTC (permalink / raw)
  To: Yafang Shao
  Cc: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, hannes, mhocko, roman.gushchin,
	shakeelb, songmuchun, akpm, lizefan.x, cgroups, netdev, bpf,
	linux-mm

On Wed, Sep 07, 2022 at 05:43:31AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Fri, Sep 02, 2022 at 02:29:50AM +0000, Yafang Shao wrote:
> ...
> > This patchset tries to resolve the above two issues by introducing a
> > selectable memcg to limit the bpf memory. Currently we only allow to
> > select its ancestor to avoid breaking the memcg hierarchy further. 
> > Possible use cases of the selectable memcg as follows,
> 
> As discussed in the following thread, there are clear downsides to an
> interface which requires the users to specify the cgroups directly.
> 
>  https://lkml.kernel.org/r/YwNold0GMOappUxc@slm.duckdns.org
> 
> So, I don't really think this is an interface we wanna go for. I was hoping
> to hear more from memcg folks in the above thread. Maybe ping them in that
> thread and continue there?

Ah, another thing. If the memcg accounting is breaking things right now, we
can easily introduce a memcg disable flag for bpf memory. That should help
alleviating the immediate breakage while we figure this out.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH bpf-next v3 00/13] bpf: Introduce selectable memcg for bpf map
  2022-09-07 15:45   ` Tejun Heo
@ 2022-09-07 16:13     ` Alexei Starovoitov
  2022-09-07 16:18       ` Tejun Heo
  0 siblings, 1 reply; 33+ messages in thread
From: Alexei Starovoitov @ 2022-09-07 16:13 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Yafang Shao, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Zefan Li,
	open list:CONTROL GROUP (CGROUP),
	Network Development, bpf, linux-mm

On Wed, Sep 7, 2022 at 8:45 AM Tejun Heo <tj@kernel.org> wrote:
>
> On Wed, Sep 07, 2022 at 05:43:31AM -1000, Tejun Heo wrote:
> > Hello,
> >
> > On Fri, Sep 02, 2022 at 02:29:50AM +0000, Yafang Shao wrote:
> > ...
> > > This patchset tries to resolve the above two issues by introducing a
> > > selectable memcg to limit the bpf memory. Currently we only allow to
> > > select its ancestor to avoid breaking the memcg hierarchy further.
> > > Possible use cases of the selectable memcg as follows,
> >
> > As discussed in the following thread, there are clear downsides to an
> > interface which requires the users to specify the cgroups directly.
> >
> >  https://lkml.kernel.org/r/YwNold0GMOappUxc@slm.duckdns.org
> >
> > So, I don't really think this is an interface we wanna go for. I was hoping
> > to hear more from memcg folks in the above thread. Maybe ping them in that
> > thread and continue there?
>
> Ah, another thing. If the memcg accounting is breaking things right now, we
> can easily introduce a memcg disable flag for bpf memory. That should help
> alleviating the immediate breakage while we figure this out.

Hmm. We discussed this option already. We definitely don't want
to introduce an uapi knob that will allow anyone to skip memcg
accounting today and in the future.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH bpf-next v3 00/13] bpf: Introduce selectable memcg for bpf map
  2022-09-07 16:13     ` Alexei Starovoitov
@ 2022-09-07 16:18       ` Tejun Heo
  2022-09-07 16:27         ` Alexei Starovoitov
  0 siblings, 1 reply; 33+ messages in thread
From: Tejun Heo @ 2022-09-07 16:18 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Yafang Shao, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Zefan Li,
	open list:CONTROL GROUP (CGROUP),
	Network Development, bpf, linux-mm

Hello,

On Wed, Sep 07, 2022 at 09:13:09AM -0700, Alexei Starovoitov wrote:
> Hmm. We discussed this option already. We definitely don't want
> to introduce an uapi knob that will allow anyone to skip memcg
> accounting today and in the future.

cgroup.memory boot parameter is how memcg provides last-resort workarounds
for this sort of problems / regressions while they're being addressed. It's
not a dynamically changeable or programmable thing. Just a boot time
opt-out. That said, if you don't want it, you don't want it.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH bpf-next v3 00/13] bpf: Introduce selectable memcg for bpf map
  2022-09-07 16:18       ` Tejun Heo
@ 2022-09-07 16:27         ` Alexei Starovoitov
  2022-09-07 17:01           ` Tejun Heo
  0 siblings, 1 reply; 33+ messages in thread
From: Alexei Starovoitov @ 2022-09-07 16:27 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Yafang Shao, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Zefan Li,
	open list:CONTROL GROUP (CGROUP),
	Network Development, bpf, linux-mm

On Wed, Sep 7, 2022 at 9:18 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Wed, Sep 07, 2022 at 09:13:09AM -0700, Alexei Starovoitov wrote:
> > Hmm. We discussed this option already. We definitely don't want
> > to introduce an uapi knob that will allow anyone to skip memcg
> > accounting today and in the future.
>
> cgroup.memory boot parameter is how memcg provides last-resort workarounds
> for this sort of problems / regressions while they're being addressed. It's
> not a dynamically changeable or programmable thing. Just a boot time
> opt-out. That said, if you don't want it, you don't want it.

ahh. boot param.
Are you suggesting a global off switch ? Like nosocket and nokmem.
That would be a different story.
Need to think more about it. It could be ok.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH bpf-next v3 00/13] bpf: Introduce selectable memcg for bpf map
  2022-09-07 16:27         ` Alexei Starovoitov
@ 2022-09-07 17:01           ` Tejun Heo
  2022-09-08  2:44             ` Yafang Shao
  0 siblings, 1 reply; 33+ messages in thread
From: Tejun Heo @ 2022-09-07 17:01 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Yafang Shao, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Zefan Li,
	open list:CONTROL GROUP (CGROUP),
	Network Development, bpf, linux-mm

On Wed, Sep 07, 2022 at 09:27:09AM -0700, Alexei Starovoitov wrote:
> On Wed, Sep 7, 2022 at 9:18 AM Tejun Heo <tj@kernel.org> wrote:
> >
> > Hello,
> >
> > On Wed, Sep 07, 2022 at 09:13:09AM -0700, Alexei Starovoitov wrote:
> > > Hmm. We discussed this option already. We definitely don't want
> > > to introduce an uapi knob that will allow anyone to skip memcg
> > > accounting today and in the future.
> >
> > cgroup.memory boot parameter is how memcg provides last-resort workarounds
> > for this sort of problems / regressions while they're being addressed. It's
> > not a dynamically changeable or programmable thing. Just a boot time
> > opt-out. That said, if you don't want it, you don't want it.
> 
> ahh. boot param.
> Are you suggesting a global off switch ? Like nosocket and nokmem.
> That would be a different story.
> Need to think more about it. It could be ok.

Yeah, nobpf or sth like that. An equivalent cgroup.memory parameter.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH bpf-next v3 00/13] bpf: Introduce selectable memcg for bpf map
  2022-09-07 15:43 ` [PATCH bpf-next v3 00/13] " Tejun Heo
  2022-09-07 15:45   ` Tejun Heo
@ 2022-09-07 22:28   ` Roman Gushchin
  2022-09-08  2:37     ` Yafang Shao
  1 sibling, 1 reply; 33+ messages in thread
From: Roman Gushchin @ 2022-09-07 22:28 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Yafang Shao, ast, daniel, andrii, kafai, songliubraving, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, hannes, mhocko,
	shakeelb, songmuchun, akpm, lizefan.x, cgroups, netdev, bpf,
	linux-mm

On Wed, Sep 07, 2022 at 05:43:31AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Fri, Sep 02, 2022 at 02:29:50AM +0000, Yafang Shao wrote:
> ...
> > This patchset tries to resolve the above two issues by introducing a
> > selectable memcg to limit the bpf memory. Currently we only allow to
> > select its ancestor to avoid breaking the memcg hierarchy further. 
> > Possible use cases of the selectable memcg as follows,
> 
> As discussed in the following thread, there are clear downsides to an
> interface which requires the users to specify the cgroups directly.
> 
>  https://lkml.kernel.org/r/YwNold0GMOappUxc@slm.duckdns.org
> 
> So, I don't really think this is an interface we wanna go for. I was hoping
> to hear more from memcg folks in the above thread. Maybe ping them in that
> thread and continue there?

As I said previously, I don't like it, because it's an attempt to solve a non
bpf-specific problem in a bpf-specific way.

Yes, memory cgroups are not great for accounting of shared resources, it's well
known. This patchset looks like an attempt to "fix" it specifically for bpf maps
in a particular cgroup setup. Honestly, I don't think it's worth the added
complexity. Especially because a similar behaviour can be achieved simple
by placing the task which creates the map into the desired cgroup.
Beatiful? Not. Neither is the proposed solution.

Thanks!


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH bpf-next v3 00/13] bpf: Introduce selectable memcg for bpf map
  2022-09-07 22:28   ` Roman Gushchin
@ 2022-09-08  2:37     ` Yafang Shao
  2022-09-08  2:43       ` Alexei Starovoitov
  2022-09-08 16:13       ` Roman Gushchin
  0 siblings, 2 replies; 33+ messages in thread
From: Yafang Shao @ 2022-09-08  2:37 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Tejun Heo, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin Lau, Song Liu, Yonghong Song, john fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, jolsa, Johannes Weiner,
	Michal Hocko, Shakeel Butt, Muchun Song, Andrew Morton, Zefan Li,
	Cgroups, netdev, bpf, Linux MM

On Thu, Sep 8, 2022 at 6:29 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> On Wed, Sep 07, 2022 at 05:43:31AM -1000, Tejun Heo wrote:
> > Hello,
> >
> > On Fri, Sep 02, 2022 at 02:29:50AM +0000, Yafang Shao wrote:
> > ...
> > > This patchset tries to resolve the above two issues by introducing a
> > > selectable memcg to limit the bpf memory. Currently we only allow to
> > > select its ancestor to avoid breaking the memcg hierarchy further.
> > > Possible use cases of the selectable memcg as follows,
> >
> > As discussed in the following thread, there are clear downsides to an
> > interface which requires the users to specify the cgroups directly.
> >
> >  https://lkml.kernel.org/r/YwNold0GMOappUxc@slm.duckdns.org
> >
> > So, I don't really think this is an interface we wanna go for. I was hoping
> > to hear more from memcg folks in the above thread. Maybe ping them in that
> > thread and continue there?
>

Hi Roman,

> As I said previously, I don't like it, because it's an attempt to solve a non
> bpf-specific problem in a bpf-specific way.
>

Why do you still insist that bpf_map->memcg is not a bpf-specific
issue after so many discussions?
Do you charge the bpf-map's memory the same way as you charge the page
caches or slabs ?
No, you don't. You charge it in a bpf-specific way.

> Yes, memory cgroups are not great for accounting of shared resources, it's well
> known. This patchset looks like an attempt to "fix" it specifically for bpf maps
> in a particular cgroup setup. Honestly, I don't think it's worth the added
> complexity. Especially because a similar behaviour can be achieved simple
> by placing the task which creates the map into the desired cgroup.

Are you serious ?
Have you ever read the cgroup doc? Which clearly describe the "No
Internal Process Constraint".[1]
Obviously you can't place the task in the desired cgroup, i.e. the parent memcg.

[1] https://www.kernel.org/doc/Documentation/cgroup-v2.txt

> Beatiful? Not. Neither is the proposed solution.
>

Is it really hard to admit a fault?

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH bpf-next v3 00/13] bpf: Introduce selectable memcg for bpf map
  2022-09-08  2:37     ` Yafang Shao
@ 2022-09-08  2:43       ` Alexei Starovoitov
  2022-09-08  2:48         ` Yafang Shao
  2022-09-08 16:13       ` Roman Gushchin
  1 sibling, 1 reply; 33+ messages in thread
From: Alexei Starovoitov @ 2022-09-08  2:43 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Roman Gushchin, Tejun Heo, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin Lau, Song Liu, Yonghong Song,
	john fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Muchun Song,
	Andrew Morton, Zefan Li, Cgroups, netdev, bpf, Linux MM

On Wed, Sep 7, 2022 at 7:37 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Thu, Sep 8, 2022 at 6:29 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> >
> > On Wed, Sep 07, 2022 at 05:43:31AM -1000, Tejun Heo wrote:
> > > Hello,
> > >
> > > On Fri, Sep 02, 2022 at 02:29:50AM +0000, Yafang Shao wrote:
> > > ...
> > > > This patchset tries to resolve the above two issues by introducing a
> > > > selectable memcg to limit the bpf memory. Currently we only allow to
> > > > select its ancestor to avoid breaking the memcg hierarchy further.
> > > > Possible use cases of the selectable memcg as follows,
> > >
> > > As discussed in the following thread, there are clear downsides to an
> > > interface which requires the users to specify the cgroups directly.
> > >
> > >  https://lkml.kernel.org/r/YwNold0GMOappUxc@slm.duckdns.org
> > >
> > > So, I don't really think this is an interface we wanna go for. I was hoping
> > > to hear more from memcg folks in the above thread. Maybe ping them in that
> > > thread and continue there?
> >
>
> Hi Roman,
>
> > As I said previously, I don't like it, because it's an attempt to solve a non
> > bpf-specific problem in a bpf-specific way.
> >
>
> Why do you still insist that bpf_map->memcg is not a bpf-specific
> issue after so many discussions?
> Do you charge the bpf-map's memory the same way as you charge the page
> caches or slabs ?
> No, you don't. You charge it in a bpf-specific way.
>
> > Yes, memory cgroups are not great for accounting of shared resources, it's well
> > known. This patchset looks like an attempt to "fix" it specifically for bpf maps
> > in a particular cgroup setup. Honestly, I don't think it's worth the added
> > complexity. Especially because a similar behaviour can be achieved simple
> > by placing the task which creates the map into the desired cgroup.
>
> Are you serious ?
> Have you ever read the cgroup doc? Which clearly describe the "No
> Internal Process Constraint".[1]
> Obviously you can't place the task in the desired cgroup, i.e. the parent memcg.
>
> [1] https://www.kernel.org/doc/Documentation/cgroup-v2.txt
>
> > Beatiful? Not. Neither is the proposed solution.
> >
>
> Is it really hard to admit a fault?

Yafang,

This attitude won't get you anywhere.

Selecting memcg by fd is no go.
You need to work with the community to figure out a solution
acceptable to maintainers of relevant subsystems.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH bpf-next v3 00/13] bpf: Introduce selectable memcg for bpf map
  2022-09-07 17:01           ` Tejun Heo
@ 2022-09-08  2:44             ` Yafang Shao
  0 siblings, 0 replies; 33+ messages in thread
From: Yafang Shao @ 2022-09-08  2:44 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Alexei Starovoitov, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Zefan Li,
	open list:CONTROL GROUP (CGROUP),
	Network Development, bpf, linux-mm

On Thu, Sep 8, 2022 at 1:01 AM Tejun Heo <tj@kernel.org> wrote:
>
> On Wed, Sep 07, 2022 at 09:27:09AM -0700, Alexei Starovoitov wrote:
> > On Wed, Sep 7, 2022 at 9:18 AM Tejun Heo <tj@kernel.org> wrote:
> > >
> > > Hello,
> > >
> > > On Wed, Sep 07, 2022 at 09:13:09AM -0700, Alexei Starovoitov wrote:
> > > > Hmm. We discussed this option already. We definitely don't want
> > > > to introduce an uapi knob that will allow anyone to skip memcg
> > > > accounting today and in the future.
> > >
> > > cgroup.memory boot parameter is how memcg provides last-resort workarounds
> > > for this sort of problems / regressions while they're being addressed. It's
> > > not a dynamically changeable or programmable thing. Just a boot time
> > > opt-out. That said, if you don't want it, you don't want it.
> >
> > ahh. boot param.
> > Are you suggesting a global off switch ? Like nosocket and nokmem.
> > That would be a different story.
> > Need to think more about it. It could be ok.
>
> Yeah, nobpf or sth like that. An equivalent cgroup.memory parameter.
>

It may be a useful feature for some cases, but it can't help container users.
The memcg works well to limit the non-pinned bpf-map, that's the
reason why we, a container user, switch to memcg-based bpf charging.
Our goal is to make it also work for pinned bpf-map.

That said,  your proposal may be a useful feature,  but it should be
another different patchset.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH bpf-next v3 00/13] bpf: Introduce selectable memcg for bpf map
  2022-09-08  2:43       ` Alexei Starovoitov
@ 2022-09-08  2:48         ` Yafang Shao
  0 siblings, 0 replies; 33+ messages in thread
From: Yafang Shao @ 2022-09-08  2:48 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Roman Gushchin, Tejun Heo, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin Lau, Song Liu, Yonghong Song,
	john fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Muchun Song,
	Andrew Morton, Zefan Li, Cgroups, netdev, bpf, Linux MM

On Thu, Sep 8, 2022 at 10:43 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Sep 7, 2022 at 7:37 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Thu, Sep 8, 2022 at 6:29 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > >
> > > On Wed, Sep 07, 2022 at 05:43:31AM -1000, Tejun Heo wrote:
> > > > Hello,
> > > >
> > > > On Fri, Sep 02, 2022 at 02:29:50AM +0000, Yafang Shao wrote:
> > > > ...
> > > > > This patchset tries to resolve the above two issues by introducing a
> > > > > selectable memcg to limit the bpf memory. Currently we only allow to
> > > > > select its ancestor to avoid breaking the memcg hierarchy further.
> > > > > Possible use cases of the selectable memcg as follows,
> > > >
> > > > As discussed in the following thread, there are clear downsides to an
> > > > interface which requires the users to specify the cgroups directly.
> > > >
> > > >  https://lkml.kernel.org/r/YwNold0GMOappUxc@slm.duckdns.org
> > > >
> > > > So, I don't really think this is an interface we wanna go for. I was hoping
> > > > to hear more from memcg folks in the above thread. Maybe ping them in that
> > > > thread and continue there?
> > >
> >
> > Hi Roman,
> >
> > > As I said previously, I don't like it, because it's an attempt to solve a non
> > > bpf-specific problem in a bpf-specific way.
> > >
> >
> > Why do you still insist that bpf_map->memcg is not a bpf-specific
> > issue after so many discussions?
> > Do you charge the bpf-map's memory the same way as you charge the page
> > caches or slabs ?
> > No, you don't. You charge it in a bpf-specific way.
> >
> > > Yes, memory cgroups are not great for accounting of shared resources, it's well
> > > known. This patchset looks like an attempt to "fix" it specifically for bpf maps
> > > in a particular cgroup setup. Honestly, I don't think it's worth the added
> > > complexity. Especially because a similar behaviour can be achieved simple
> > > by placing the task which creates the map into the desired cgroup.
> >
> > Are you serious ?
> > Have you ever read the cgroup doc? Which clearly describe the "No
> > Internal Process Constraint".[1]
> > Obviously you can't place the task in the desired cgroup, i.e. the parent memcg.
> >
> > [1] https://www.kernel.org/doc/Documentation/cgroup-v2.txt
> >
> > > Beatiful? Not. Neither is the proposed solution.
> > >
> >
> > Is it really hard to admit a fault?
>
> Yafang,
>
> This attitude won't get you anywhere.
>

Thanks for pointing it out. It is my fault.

> Selecting memcg by fd is no go.
> You need to work with the community to figure out a solution
> acceptable to maintainers of relevant subsystems.

Yes, I'm trying.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH bpf-next v3 00/13] bpf: Introduce selectable memcg for bpf map
  2022-09-08  2:37     ` Yafang Shao
  2022-09-08  2:43       ` Alexei Starovoitov
@ 2022-09-08 16:13       ` Roman Gushchin
  2022-09-13  6:15         ` Yafang Shao
  1 sibling, 1 reply; 33+ messages in thread
From: Roman Gushchin @ 2022-09-08 16:13 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Tejun Heo, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin Lau, Song Liu, Yonghong Song, john fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, jolsa, Johannes Weiner,
	Michal Hocko, Shakeel Butt, Muchun Song, Andrew Morton, Zefan Li,
	Cgroups, netdev, bpf, Linux MM

On Thu, Sep 08, 2022 at 10:37:02AM +0800, Yafang Shao wrote:
> On Thu, Sep 8, 2022 at 6:29 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> >
> > On Wed, Sep 07, 2022 at 05:43:31AM -1000, Tejun Heo wrote:
> > > Hello,
> > >
> > > On Fri, Sep 02, 2022 at 02:29:50AM +0000, Yafang Shao wrote:
> > > ...
> > > > This patchset tries to resolve the above two issues by introducing a
> > > > selectable memcg to limit the bpf memory. Currently we only allow to
> > > > select its ancestor to avoid breaking the memcg hierarchy further.
> > > > Possible use cases of the selectable memcg as follows,
> > >
> > > As discussed in the following thread, there are clear downsides to an
> > > interface which requires the users to specify the cgroups directly.
> > >
> > >  https://lkml.kernel.org/r/YwNold0GMOappUxc@slm.duckdns.org
> > >
> > > So, I don't really think this is an interface we wanna go for. I was hoping
> > > to hear more from memcg folks in the above thread. Maybe ping them in that
> > > thread and continue there?
> >
> 
> Hi Roman,
> 
> > As I said previously, I don't like it, because it's an attempt to solve a non
> > bpf-specific problem in a bpf-specific way.
> >
> 
> Why do you still insist that bpf_map->memcg is not a bpf-specific
> issue after so many discussions?
> Do you charge the bpf-map's memory the same way as you charge the page
> caches or slabs ?
> No, you don't. You charge it in a bpf-specific way.

The only difference is that we charge the cgroup of the processes who
created a map, not a process who is doing a specific allocation.
Your patchset doesn't change this.
There are pros and cons with this approach, we've discussed it back
to the times when bpf memcg accounting was developed. If you want
to revisit this, it's maybe possible (given there is a really strong and likely
new motivation appears), but I haven't seen any complaints yet except from you.

> 
> > Yes, memory cgroups are not great for accounting of shared resources, it's well
> > known. This patchset looks like an attempt to "fix" it specifically for bpf maps
> > in a particular cgroup setup. Honestly, I don't think it's worth the added
> > complexity. Especially because a similar behaviour can be achieved simple
> > by placing the task which creates the map into the desired cgroup.
> 
> Are you serious ?
> Have you ever read the cgroup doc? Which clearly describe the "No
> Internal Process Constraint".[1]
> Obviously you can't place the task in the desired cgroup, i.e. the parent memcg.

But you can place it into another leaf cgroup. You can delete this leaf cgroup
and your memcg will get reparented. You can attach this process and create
a bpf map to the parent cgroup before it gets child cgroups.
You can revisit the idea of shared bpf maps and outlive specific cgroups.
Lof of options.

> 
> [1] https://www.kernel.org/doc/Documentation/cgroup-v2.txt
> 
> > Beatiful? Not. Neither is the proposed solution.
> >
> 
> Is it really hard to admit a fault?

Yafang, you posted several versions and so far I haven't seen much of support
or excitement from anyone (please, fix me if I'm wrong). It's not like I'm
nacking a patchset with many acks, reviews and supporters.

Still think you're solving an important problem in a reasonable way?
It seems like not many are convinced yet. I'd recommend to focus on this instead
of blaming me.

Thanks!


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH bpf-next v3 00/13] bpf: Introduce selectable memcg for bpf map
  2022-09-08 16:13       ` Roman Gushchin
@ 2022-09-13  6:15         ` Yafang Shao
  2022-09-16 16:53           ` Roman Gushchin
  0 siblings, 1 reply; 33+ messages in thread
From: Yafang Shao @ 2022-09-13  6:15 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Tejun Heo, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin Lau, Song Liu, Yonghong Song, john fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, Johannes Weiner,
	Michal Hocko, Shakeel Butt, Muchun Song, Andrew Morton, Zefan Li,
	Cgroups, netdev, bpf, Linux MM

On Fri, Sep 9, 2022 at 12:13 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> On Thu, Sep 08, 2022 at 10:37:02AM +0800, Yafang Shao wrote:
> > On Thu, Sep 8, 2022 at 6:29 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > >
> > > On Wed, Sep 07, 2022 at 05:43:31AM -1000, Tejun Heo wrote:
> > > > Hello,
> > > >
> > > > On Fri, Sep 02, 2022 at 02:29:50AM +0000, Yafang Shao wrote:
> > > > ...
> > > > > This patchset tries to resolve the above two issues by introducing a
> > > > > selectable memcg to limit the bpf memory. Currently we only allow to
> > > > > select its ancestor to avoid breaking the memcg hierarchy further.
> > > > > Possible use cases of the selectable memcg as follows,
> > > >
> > > > As discussed in the following thread, there are clear downsides to an
> > > > interface which requires the users to specify the cgroups directly.
> > > >
> > > >  https://lkml.kernel.org/r/YwNold0GMOappUxc@slm.duckdns.org
> > > >
> > > > So, I don't really think this is an interface we wanna go for. I was hoping
> > > > to hear more from memcg folks in the above thread. Maybe ping them in that
> > > > thread and continue there?
> > >
> >
> > Hi Roman,
> >
> > > As I said previously, I don't like it, because it's an attempt to solve a non
> > > bpf-specific problem in a bpf-specific way.
> > >
> >
> > Why do you still insist that bpf_map->memcg is not a bpf-specific
> > issue after so many discussions?
> > Do you charge the bpf-map's memory the same way as you charge the page
> > caches or slabs ?
> > No, you don't. You charge it in a bpf-specific way.
>

Hi Roman,

Sorry for the late response.
I've been on vacation in the past few days.

> The only difference is that we charge the cgroup of the processes who
> created a map, not a process who is doing a specific allocation.

This means the bpf-map can be indepent of process, IOW, the memcg of
bpf-map can be indepent of the memcg of the processes.
This is the fundamental difference between bpf-map and page caches, then...

> Your patchset doesn't change this.

We can make this behavior reasonable by introducing an independent
memcg, as what I did in the previous version.

> There are pros and cons with this approach, we've discussed it back
> to the times when bpf memcg accounting was developed. If you want
> to revisit this, it's maybe possible (given there is a really strong and likely
> new motivation appears), but I haven't seen any complaints yet except from you.
>

memcg-base bpf accounting is a new feature, which may not be used widely.

> >
> > > Yes, memory cgroups are not great for accounting of shared resources, it's well
> > > known. This patchset looks like an attempt to "fix" it specifically for bpf maps
> > > in a particular cgroup setup. Honestly, I don't think it's worth the added
> > > complexity. Especially because a similar behaviour can be achieved simple
> > > by placing the task which creates the map into the desired cgroup.
> >
> > Are you serious ?
> > Have you ever read the cgroup doc? Which clearly describe the "No
> > Internal Process Constraint".[1]
> > Obviously you can't place the task in the desired cgroup, i.e. the parent memcg.
>
> But you can place it into another leaf cgroup. You can delete this leaf cgroup
> and your memcg will get reparented. You can attach this process and create
> a bpf map to the parent cgroup before it gets child cgroups.

If the process doesn't exit after it created bpf-map, we have to
migrate it around memcgs....
The complexity in deployment can introduce unexpected issues easily.

> You can revisit the idea of shared bpf maps and outlive specific cgroups.
> Lof of options.
>
> >
> > [1] https://www.kernel.org/doc/Documentation/cgroup-v2.txt
> >
> > > Beatiful? Not. Neither is the proposed solution.
> > >
> >
> > Is it really hard to admit a fault?
>
> Yafang, you posted several versions and so far I haven't seen much of support
> or excitement from anyone (please, fix me if I'm wrong). It's not like I'm
> nacking a patchset with many acks, reviews and supporters.
>
> Still think you're solving an important problem in a reasonable way?
> It seems like not many are convinced yet. I'd recommend to focus on this instead
> of blaming me.
>

The best way so far is to introduce specific memcg for specific resources.
Because not only the process owns its memcg, but also specific
resources own their memcgs, for example bpf-map, or socket.

struct bpf_map {                                 <<<< memcg owner
    struct memcg_cgroup *memcg;
};

struct sock {                                       <<<< memcg owner
    struct mem_cgroup *sk_memcg;
};

These resources already have their own memcgs, so we should make this
behavior formal.

The selectable memcg is just a variant of 'echo ${proc} > cgroup.procs'.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH bpf-next v3 00/13] bpf: Introduce selectable memcg for bpf map
  2022-09-13  6:15         ` Yafang Shao
@ 2022-09-16 16:53           ` Roman Gushchin
  2022-09-18  3:44             ` Yafang Shao
  0 siblings, 1 reply; 33+ messages in thread
From: Roman Gushchin @ 2022-09-16 16:53 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Tejun Heo, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin Lau, Song Liu, Yonghong Song, john fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, Johannes Weiner,
	Michal Hocko, Shakeel Butt, Muchun Song, Andrew Morton, Zefan Li,
	Cgroups, netdev, bpf, Linux MM

On Tue, Sep 13, 2022 at 02:15:20PM +0800, Yafang Shao wrote:
> On Fri, Sep 9, 2022 at 12:13 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> >
> > On Thu, Sep 08, 2022 at 10:37:02AM +0800, Yafang Shao wrote:
> > > On Thu, Sep 8, 2022 at 6:29 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > > >
> > > > On Wed, Sep 07, 2022 at 05:43:31AM -1000, Tejun Heo wrote:
> > > > > Hello,
> > > > >
> > > > > On Fri, Sep 02, 2022 at 02:29:50AM +0000, Yafang Shao wrote:
> > > > > ...
> > > > > > This patchset tries to resolve the above two issues by introducing a
> > > > > > selectable memcg to limit the bpf memory. Currently we only allow to
> > > > > > select its ancestor to avoid breaking the memcg hierarchy further.
> > > > > > Possible use cases of the selectable memcg as follows,
> > > > >
> > > > > As discussed in the following thread, there are clear downsides to an
> > > > > interface which requires the users to specify the cgroups directly.
> > > > >
> > > > >  https://lkml.kernel.org/r/YwNold0GMOappUxc@slm.duckdns.org
> > > > >
> > > > > So, I don't really think this is an interface we wanna go for. I was hoping
> > > > > to hear more from memcg folks in the above thread. Maybe ping them in that
> > > > > thread and continue there?
> > > >
> > >
> > > Hi Roman,
> > >
> > > > As I said previously, I don't like it, because it's an attempt to solve a non
> > > > bpf-specific problem in a bpf-specific way.
> > > >
> > >
> > > Why do you still insist that bpf_map->memcg is not a bpf-specific
> > > issue after so many discussions?
> > > Do you charge the bpf-map's memory the same way as you charge the page
> > > caches or slabs ?
> > > No, you don't. You charge it in a bpf-specific way.
> >
> 
> Hi Roman,
> 
> Sorry for the late response.
> I've been on vacation in the past few days.
> 
> > The only difference is that we charge the cgroup of the processes who
> > created a map, not a process who is doing a specific allocation.
> 
> This means the bpf-map can be indepent of process, IOW, the memcg of
> bpf-map can be indepent of the memcg of the processes.
> This is the fundamental difference between bpf-map and page caches, then...
> 
> > Your patchset doesn't change this.
> 
> We can make this behavior reasonable by introducing an independent
> memcg, as what I did in the previous version.
> 
> > There are pros and cons with this approach, we've discussed it back
> > to the times when bpf memcg accounting was developed. If you want
> > to revisit this, it's maybe possible (given there is a really strong and likely
> > new motivation appears), but I haven't seen any complaints yet except from you.
> >
> 
> memcg-base bpf accounting is a new feature, which may not be used widely.
> 
> > >
> > > > Yes, memory cgroups are not great for accounting of shared resources, it's well
> > > > known. This patchset looks like an attempt to "fix" it specifically for bpf maps
> > > > in a particular cgroup setup. Honestly, I don't think it's worth the added
> > > > complexity. Especially because a similar behaviour can be achieved simple
> > > > by placing the task which creates the map into the desired cgroup.
> > >
> > > Are you serious ?
> > > Have you ever read the cgroup doc? Which clearly describe the "No
> > > Internal Process Constraint".[1]
> > > Obviously you can't place the task in the desired cgroup, i.e. the parent memcg.
> >
> > But you can place it into another leaf cgroup. You can delete this leaf cgroup
> > and your memcg will get reparented. You can attach this process and create
> > a bpf map to the parent cgroup before it gets child cgroups.
> 
> If the process doesn't exit after it created bpf-map, we have to
> migrate it around memcgs....
> The complexity in deployment can introduce unexpected issues easily.
> 
> > You can revisit the idea of shared bpf maps and outlive specific cgroups.
> > Lof of options.
> >
> > >
> > > [1] https://www.kernel.org/doc/Documentation/cgroup-v2.txt
> > >
> > > > Beatiful? Not. Neither is the proposed solution.
> > > >
> > >
> > > Is it really hard to admit a fault?
> >
> > Yafang, you posted several versions and so far I haven't seen much of support
> > or excitement from anyone (please, fix me if I'm wrong). It's not like I'm
> > nacking a patchset with many acks, reviews and supporters.
> >
> > Still think you're solving an important problem in a reasonable way?
> > It seems like not many are convinced yet. I'd recommend to focus on this instead
> > of blaming me.
> >
> 
> The best way so far is to introduce specific memcg for specific resources.
> Because not only the process owns its memcg, but also specific
> resources own their memcgs, for example bpf-map, or socket.
> 
> struct bpf_map {                                 <<<< memcg owner
>     struct memcg_cgroup *memcg;
> };
> 
> struct sock {                                       <<<< memcg owner
>     struct mem_cgroup *sk_memcg;
> };
> 
> These resources already have their own memcgs, so we should make this
> behavior formal.
> 
> The selectable memcg is just a variant of 'echo ${proc} > cgroup.procs'.

This is a fundamental change: cgroups were always hierarchical groups
of processes/threads. You're basically suggesting to extend it to
hierarchical groups of processes and some other objects (what's a good
definition?). It's a huge change and it's scope is definetely larger
than bpf and even memory cgroups. It will raise a lot of questions:
e.g. what does it mean for other controllers (cpu, io, etc)?
Which objects can have dedicated cgroups and which not? How the interface will
look like? How the oom handling will work? Etc.

The history showed that starting small with one controller and/or specific
use case isn't working well for cgroups, because different resources and
controllers are not living independently. So if you really want to go this way
you need to present the whole picture and convince many people that it's worth
it. I doubt this specific bpf map accounting thing can justify it.

Personally I know some examples where such functionality could be useful,
but I'm not yet convinced it's worth the effort and potential problems.

Thanks!


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH bpf-next v3 00/13] bpf: Introduce selectable memcg for bpf map
  2022-09-16 16:53           ` Roman Gushchin
@ 2022-09-18  3:44             ` Yafang Shao
  2022-09-20  2:40               ` Roman Gushchin
  0 siblings, 1 reply; 33+ messages in thread
From: Yafang Shao @ 2022-09-18  3:44 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Tejun Heo, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin Lau, Song Liu, Yonghong Song, john fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, Johannes Weiner,
	Michal Hocko, Shakeel Butt, Muchun Song, Andrew Morton, Zefan Li,
	Cgroups, netdev, bpf, Linux MM

On Sat, Sep 17, 2022 at 12:53 AM Roman Gushchin
<roman.gushchin@linux.dev> wrote:
>
> On Tue, Sep 13, 2022 at 02:15:20PM +0800, Yafang Shao wrote:
> > On Fri, Sep 9, 2022 at 12:13 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > >
> > > On Thu, Sep 08, 2022 at 10:37:02AM +0800, Yafang Shao wrote:
> > > > On Thu, Sep 8, 2022 at 6:29 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > > > >
> > > > > On Wed, Sep 07, 2022 at 05:43:31AM -1000, Tejun Heo wrote:
> > > > > > Hello,
> > > > > >
> > > > > > On Fri, Sep 02, 2022 at 02:29:50AM +0000, Yafang Shao wrote:
> > > > > > ...
> > > > > > > This patchset tries to resolve the above two issues by introducing a
> > > > > > > selectable memcg to limit the bpf memory. Currently we only allow to
> > > > > > > select its ancestor to avoid breaking the memcg hierarchy further.
> > > > > > > Possible use cases of the selectable memcg as follows,
> > > > > >
> > > > > > As discussed in the following thread, there are clear downsides to an
> > > > > > interface which requires the users to specify the cgroups directly.
> > > > > >
> > > > > >  https://lkml.kernel.org/r/YwNold0GMOappUxc@slm.duckdns.org
> > > > > >
> > > > > > So, I don't really think this is an interface we wanna go for. I was hoping
> > > > > > to hear more from memcg folks in the above thread. Maybe ping them in that
> > > > > > thread and continue there?
> > > > >
> > > >
> > > > Hi Roman,
> > > >
> > > > > As I said previously, I don't like it, because it's an attempt to solve a non
> > > > > bpf-specific problem in a bpf-specific way.
> > > > >
> > > >
> > > > Why do you still insist that bpf_map->memcg is not a bpf-specific
> > > > issue after so many discussions?
> > > > Do you charge the bpf-map's memory the same way as you charge the page
> > > > caches or slabs ?
> > > > No, you don't. You charge it in a bpf-specific way.
> > >
> >
> > Hi Roman,
> >
> > Sorry for the late response.
> > I've been on vacation in the past few days.
> >
> > > The only difference is that we charge the cgroup of the processes who
> > > created a map, not a process who is doing a specific allocation.
> >
> > This means the bpf-map can be indepent of process, IOW, the memcg of
> > bpf-map can be indepent of the memcg of the processes.
> > This is the fundamental difference between bpf-map and page caches, then...
> >
> > > Your patchset doesn't change this.
> >
> > We can make this behavior reasonable by introducing an independent
> > memcg, as what I did in the previous version.
> >
> > > There are pros and cons with this approach, we've discussed it back
> > > to the times when bpf memcg accounting was developed. If you want
> > > to revisit this, it's maybe possible (given there is a really strong and likely
> > > new motivation appears), but I haven't seen any complaints yet except from you.
> > >
> >
> > memcg-base bpf accounting is a new feature, which may not be used widely.
> >
> > > >
> > > > > Yes, memory cgroups are not great for accounting of shared resources, it's well
> > > > > known. This patchset looks like an attempt to "fix" it specifically for bpf maps
> > > > > in a particular cgroup setup. Honestly, I don't think it's worth the added
> > > > > complexity. Especially because a similar behaviour can be achieved simple
> > > > > by placing the task which creates the map into the desired cgroup.
> > > >
> > > > Are you serious ?
> > > > Have you ever read the cgroup doc? Which clearly describe the "No
> > > > Internal Process Constraint".[1]
> > > > Obviously you can't place the task in the desired cgroup, i.e. the parent memcg.
> > >
> > > But you can place it into another leaf cgroup. You can delete this leaf cgroup
> > > and your memcg will get reparented. You can attach this process and create
> > > a bpf map to the parent cgroup before it gets child cgroups.
> >
> > If the process doesn't exit after it created bpf-map, we have to
> > migrate it around memcgs....
> > The complexity in deployment can introduce unexpected issues easily.
> >
> > > You can revisit the idea of shared bpf maps and outlive specific cgroups.
> > > Lof of options.
> > >
> > > >
> > > > [1] https://www.kernel.org/doc/Documentation/cgroup-v2.txt
> > > >
> > > > > Beatiful? Not. Neither is the proposed solution.
> > > > >
> > > >
> > > > Is it really hard to admit a fault?
> > >
> > > Yafang, you posted several versions and so far I haven't seen much of support
> > > or excitement from anyone (please, fix me if I'm wrong). It's not like I'm
> > > nacking a patchset with many acks, reviews and supporters.
> > >
> > > Still think you're solving an important problem in a reasonable way?
> > > It seems like not many are convinced yet. I'd recommend to focus on this instead
> > > of blaming me.
> > >
> >
> > The best way so far is to introduce specific memcg for specific resources.
> > Because not only the process owns its memcg, but also specific
> > resources own their memcgs, for example bpf-map, or socket.
> >
> > struct bpf_map {                                 <<<< memcg owner
> >     struct memcg_cgroup *memcg;
> > };
> >
> > struct sock {                                       <<<< memcg owner
> >     struct mem_cgroup *sk_memcg;
> > };
> >
> > These resources already have their own memcgs, so we should make this
> > behavior formal.
> >
> > The selectable memcg is just a variant of 'echo ${proc} > cgroup.procs'.
>
> This is a fundamental change: cgroups were always hierarchical groups
> of processes/threads. You're basically suggesting to extend it to
> hierarchical groups of processes and some other objects (what's a good
> definition?).

Kind of, but not exactly.
We can do it without breaking the cgroup hierarchy. Under current
cgroup hierarchy, the user can only echo processes/threads into a
cgroup, that won't be changed in the future. The specific resources
are not exposed to the user, the user can only control these specific
resources by controlling their associated processes/threads.
For example,

                Memcg-A
                       |---- Memcg-A1
                       |---- Memcg-A2

We can introduce a new file memory.owner into each memcg. Each bit of
memory.owner represents a specific resources,

 memory.owner: | bit31 | bitN | ... | bit1 | bit0 |
                                         |               |
|------ bit0: bpf memory
                                         |
|-------------- bit1: socket memory
                                         |
                                         |---------------------------
bitN: a specific resource

There won't be too many specific resources which have to own their
memcgs, so I think 32bits is enough.

                Memcg-A : memory.owner == 0x1
                       |---- Memcg-A1 : memory.owner == 0
                       |---- Memcg-A2 : memory.owner == 0x1

Then the bpf created by processes in Memcg-A1 will be charged into
Memcg-A directly without charging into Memcg-A1.
But the bpf created by processes in Memcg-A2 will be charged into
Memcg-A2 as its memory.owner is 0x1.
That said, these specific resources are not fully independent of
process, while they are still associated with the processes which
create them.
Luckily memory.move_charge_at_immigrate is disabled in cgroup2, so we
don't need to care about the possible migration issue.

I think we may also apply it to shared page caches.  For example,
      struct inode {
          struct mem_cgroup *memcg;          <<<< add a new member
      };

We define struct inode as a memcg owner, and use scope-based charge to
charge its pages into inode->memcg.
And then put all memcgs which shared these resources under the same
parent. The page caches of this inode will be charged into the parent
directly.
The shared page cache is more complicated than bpf memory, so I'm not
quite sure if it can apply to shared page cache, but it can work well
for bpf memory.

Regarding the observability, we can introduce a specific item into
memory.stat for this specific memory. For example a new item 'bpf' for
bpf memory.
That can be accounted/unaccounted for in the same way as we do in
set_active_memcg(). for example,

    struct task_struct {
        struct mem_cgroup  *active_memcg;
        int                             active_memcg_item;   <<<<
introduce a new member
    };

    bpf_memory_alloc()
    {
             old_memcg = set_active_memcg(memcg);
             old_item = set_active_memcg_item(MEMCG_BPF);
             alloc();
             set_active_memcg_item(old_item);
             set_active_memcg(old_memcg);
    }

    bpf_memory_free()
    {
             old = set_active_memcg_item(MEMCG_BPF);
             free();
             set_active_memcg_item(old);
    }

Then we can know the bpf memory size in each memcg, and possible
system-wide bpf-memory usage if it can be supported in root memcg.
(Currently  kmem is not charged into root memcg)

> It's a huge change and it's scope is definetely larger
> than bpf and even memory cgroups. It will raise a lot of questions:
> e.g. what does it mean for other controllers (cpu, io, etc)?
> Which objects can have dedicated cgroups and which not? How the interface will
> look like? How the oom handling will work? Etc.
>

With the above hierarchy, I think the oom handling can work as-is.

> The history showed that starting small with one controller and/or specific
> use case isn't working well for cgroups, because different resources and
> controllers are not living independently.

Agreed. That is why I still associate the specific resources with the
process which creates them.
If all the resources are still associated with the process, I think it
can work well.

> So if you really want to go this way
> you need to present the whole picture and convince many people that it's worth
> it. I doubt this specific bpf map accounting thing can justify it.
>
> Personally I know some examples where such functionality could be useful,
> but I'm not yet convinced it's worth the effort and potential problems.
>

Thanks for your reply.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH bpf-next v3 00/13] bpf: Introduce selectable memcg for bpf map
  2022-09-18  3:44             ` Yafang Shao
@ 2022-09-20  2:40               ` Roman Gushchin
  2022-09-20 12:42                 ` Yafang Shao
  0 siblings, 1 reply; 33+ messages in thread
From: Roman Gushchin @ 2022-09-20  2:40 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Tejun Heo, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin Lau, Song Liu, Yonghong Song, john fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, Johannes Weiner,
	Michal Hocko, Shakeel Butt, Muchun Song, Andrew Morton, Zefan Li,
	Cgroups, netdev, bpf, Linux MM

On Sun, Sep 18, 2022 at 11:44:48AM +0800, Yafang Shao wrote:
> On Sat, Sep 17, 2022 at 12:53 AM Roman Gushchin
> <roman.gushchin@linux.dev> wrote:
> >
> > On Tue, Sep 13, 2022 at 02:15:20PM +0800, Yafang Shao wrote:
> > > On Fri, Sep 9, 2022 at 12:13 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > > >
> > > > On Thu, Sep 08, 2022 at 10:37:02AM +0800, Yafang Shao wrote:
> > > > > On Thu, Sep 8, 2022 at 6:29 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > > > > >
> > > > > > On Wed, Sep 07, 2022 at 05:43:31AM -1000, Tejun Heo wrote:
> > > > > > > Hello,
> > > > > > >
> > > > > > > On Fri, Sep 02, 2022 at 02:29:50AM +0000, Yafang Shao wrote:
> > > > > > > ...
> > > > > > > > This patchset tries to resolve the above two issues by introducing a
> > > > > > > > selectable memcg to limit the bpf memory. Currently we only allow to
> > > > > > > > select its ancestor to avoid breaking the memcg hierarchy further.
> > > > > > > > Possible use cases of the selectable memcg as follows,
> > > > > > >
> > > > > > > As discussed in the following thread, there are clear downsides to an
> > > > > > > interface which requires the users to specify the cgroups directly.
> > > > > > >
> > > > > > >  https://lkml.kernel.org/r/YwNold0GMOappUxc@slm.duckdns.org
> > > > > > >
> > > > > > > So, I don't really think this is an interface we wanna go for. I was hoping
> > > > > > > to hear more from memcg folks in the above thread. Maybe ping them in that
> > > > > > > thread and continue there?
> > > > > >
> > > > >
> > > > > Hi Roman,
> > > > >
> > > > > > As I said previously, I don't like it, because it's an attempt to solve a non
> > > > > > bpf-specific problem in a bpf-specific way.
> > > > > >
> > > > >
> > > > > Why do you still insist that bpf_map->memcg is not a bpf-specific
> > > > > issue after so many discussions?
> > > > > Do you charge the bpf-map's memory the same way as you charge the page
> > > > > caches or slabs ?
> > > > > No, you don't. You charge it in a bpf-specific way.
> > > >
> > >
> > > Hi Roman,
> > >
> > > Sorry for the late response.
> > > I've been on vacation in the past few days.
> > >
> > > > The only difference is that we charge the cgroup of the processes who
> > > > created a map, not a process who is doing a specific allocation.
> > >
> > > This means the bpf-map can be indepent of process, IOW, the memcg of
> > > bpf-map can be indepent of the memcg of the processes.
> > > This is the fundamental difference between bpf-map and page caches, then...
> > >
> > > > Your patchset doesn't change this.
> > >
> > > We can make this behavior reasonable by introducing an independent
> > > memcg, as what I did in the previous version.
> > >
> > > > There are pros and cons with this approach, we've discussed it back
> > > > to the times when bpf memcg accounting was developed. If you want
> > > > to revisit this, it's maybe possible (given there is a really strong and likely
> > > > new motivation appears), but I haven't seen any complaints yet except from you.
> > > >
> > >
> > > memcg-base bpf accounting is a new feature, which may not be used widely.
> > >
> > > > >
> > > > > > Yes, memory cgroups are not great for accounting of shared resources, it's well
> > > > > > known. This patchset looks like an attempt to "fix" it specifically for bpf maps
> > > > > > in a particular cgroup setup. Honestly, I don't think it's worth the added
> > > > > > complexity. Especially because a similar behaviour can be achieved simple
> > > > > > by placing the task which creates the map into the desired cgroup.
> > > > >
> > > > > Are you serious ?
> > > > > Have you ever read the cgroup doc? Which clearly describe the "No
> > > > > Internal Process Constraint".[1]
> > > > > Obviously you can't place the task in the desired cgroup, i.e. the parent memcg.
> > > >
> > > > But you can place it into another leaf cgroup. You can delete this leaf cgroup
> > > > and your memcg will get reparented. You can attach this process and create
> > > > a bpf map to the parent cgroup before it gets child cgroups.
> > >
> > > If the process doesn't exit after it created bpf-map, we have to
> > > migrate it around memcgs....
> > > The complexity in deployment can introduce unexpected issues easily.
> > >
> > > > You can revisit the idea of shared bpf maps and outlive specific cgroups.
> > > > Lof of options.
> > > >
> > > > >
> > > > > [1] https://www.kernel.org/doc/Documentation/cgroup-v2.txt
> > > > >
> > > > > > Beatiful? Not. Neither is the proposed solution.
> > > > > >
> > > > >
> > > > > Is it really hard to admit a fault?
> > > >
> > > > Yafang, you posted several versions and so far I haven't seen much of support
> > > > or excitement from anyone (please, fix me if I'm wrong). It's not like I'm
> > > > nacking a patchset with many acks, reviews and supporters.
> > > >
> > > > Still think you're solving an important problem in a reasonable way?
> > > > It seems like not many are convinced yet. I'd recommend to focus on this instead
> > > > of blaming me.
> > > >
> > >
> > > The best way so far is to introduce specific memcg for specific resources.
> > > Because not only the process owns its memcg, but also specific
> > > resources own their memcgs, for example bpf-map, or socket.
> > >
> > > struct bpf_map {                                 <<<< memcg owner
> > >     struct memcg_cgroup *memcg;
> > > };
> > >
> > > struct sock {                                       <<<< memcg owner
> > >     struct mem_cgroup *sk_memcg;
> > > };
> > >
> > > These resources already have their own memcgs, so we should make this
> > > behavior formal.
> > >
> > > The selectable memcg is just a variant of 'echo ${proc} > cgroup.procs'.
> >
> > This is a fundamental change: cgroups were always hierarchical groups
> > of processes/threads. You're basically suggesting to extend it to
> > hierarchical groups of processes and some other objects (what's a good
> > definition?).
> 
> Kind of, but not exactly.
> We can do it without breaking the cgroup hierarchy. Under current
> cgroup hierarchy, the user can only echo processes/threads into a
> cgroup, that won't be changed in the future. The specific resources
> are not exposed to the user, the user can only control these specific
> resources by controlling their associated processes/threads.
> For example,
> 
>                 Memcg-A
>                        |---- Memcg-A1
>                        |---- Memcg-A2
> 
> We can introduce a new file memory.owner into each memcg. Each bit of
> memory.owner represents a specific resources,
> 
>  memory.owner: | bit31 | bitN | ... | bit1 | bit0 |
>                                          |               |
> |------ bit0: bpf memory
>                                          |
> |-------------- bit1: socket memory
>                                          |
>                                          |---------------------------
> bitN: a specific resource
> 
> There won't be too many specific resources which have to own their
> memcgs, so I think 32bits is enough.
> 
>                 Memcg-A : memory.owner == 0x1
>                        |---- Memcg-A1 : memory.owner == 0
>                        |---- Memcg-A2 : memory.owner == 0x1
> 
> Then the bpf created by processes in Memcg-A1 will be charged into
> Memcg-A directly without charging into Memcg-A1.
> But the bpf created by processes in Memcg-A2 will be charged into
> Memcg-A2 as its memory.owner is 0x1.
> That said, these specific resources are not fully independent of
> process, while they are still associated with the processes which
> create them.
> Luckily memory.move_charge_at_immigrate is disabled in cgroup2, so we
> don't need to care about the possible migration issue.
> 
> I think we may also apply it to shared page caches.  For example,
>       struct inode {
>           struct mem_cgroup *memcg;          <<<< add a new member
>       };
> 
> We define struct inode as a memcg owner, and use scope-based charge to
> charge its pages into inode->memcg.
> And then put all memcgs which shared these resources under the same
> parent. The page caches of this inode will be charged into the parent
> directly.

Ok, so it's something like premature selective reparenting.

> The shared page cache is more complicated than bpf memory, so I'm not
> quite sure if it can apply to shared page cache, but it can work well
> for bpf memory.

Yeah, this is the problem. It feels like it's a problem very specific
to bpf maps and an exact way you use them. I don't think you can successfully
advocate for changes of these calibre without a more generic problem. I might
be wrong.

> 
> Regarding the observability, we can introduce a specific item into
> memory.stat for this specific memory. For example a new item 'bpf' for
> bpf memory.
> That can be accounted/unaccounted for in the same way as we do in
> set_active_memcg(). for example,
> 
>     struct task_struct {
>         struct mem_cgroup  *active_memcg;
>         int                             active_memcg_item;   <<<<
> introduce a new member
>     };
> 
>     bpf_memory_alloc()
>     {
>              old_memcg = set_active_memcg(memcg);
>              old_item = set_active_memcg_item(MEMCG_BPF);

I thought about something like this but for a different purpose:
to track the amount of memory consumed by bpf.

>              alloc();
>              set_active_memcg_item(old_item);
>              set_active_memcg(old_memcg);
>     }
> 
>     bpf_memory_free()
>     {
>              old = set_active_memcg_item(MEMCG_BPF);
>              free();
>              set_active_memcg_item(old);
>     }

But the problem is that we shoud very carefully mark all allocations and
releases, which is very error-prone. Interfaces which don't require annotating
releases are generally better, but require additional memory.

Thanks!


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH bpf-next v3 00/13] bpf: Introduce selectable memcg for bpf map
  2022-09-20  2:40               ` Roman Gushchin
@ 2022-09-20 12:42                 ` Yafang Shao
  2022-09-20 23:15                   ` Roman Gushchin
  0 siblings, 1 reply; 33+ messages in thread
From: Yafang Shao @ 2022-09-20 12:42 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Tejun Heo, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin Lau, Song Liu, Yonghong Song, john fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, Johannes Weiner,
	Michal Hocko, Shakeel Butt, Muchun Song, Andrew Morton, Zefan Li,
	Cgroups, netdev, bpf, Linux MM

On Tue, Sep 20, 2022 at 10:40 AM Roman Gushchin
<roman.gushchin@linux.dev> wrote:
>
> On Sun, Sep 18, 2022 at 11:44:48AM +0800, Yafang Shao wrote:
> > On Sat, Sep 17, 2022 at 12:53 AM Roman Gushchin
> > <roman.gushchin@linux.dev> wrote:
> > >
> > > On Tue, Sep 13, 2022 at 02:15:20PM +0800, Yafang Shao wrote:
> > > > On Fri, Sep 9, 2022 at 12:13 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > > > >
> > > > > On Thu, Sep 08, 2022 at 10:37:02AM +0800, Yafang Shao wrote:
> > > > > > On Thu, Sep 8, 2022 at 6:29 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > > > > > >
> > > > > > > On Wed, Sep 07, 2022 at 05:43:31AM -1000, Tejun Heo wrote:
> > > > > > > > Hello,
> > > > > > > >
> > > > > > > > On Fri, Sep 02, 2022 at 02:29:50AM +0000, Yafang Shao wrote:
> > > > > > > > ...
> > > > > > > > > This patchset tries to resolve the above two issues by introducing a
> > > > > > > > > selectable memcg to limit the bpf memory. Currently we only allow to
> > > > > > > > > select its ancestor to avoid breaking the memcg hierarchy further.
> > > > > > > > > Possible use cases of the selectable memcg as follows,
> > > > > > > >
> > > > > > > > As discussed in the following thread, there are clear downsides to an
> > > > > > > > interface which requires the users to specify the cgroups directly.
> > > > > > > >
> > > > > > > >  https://lkml.kernel.org/r/YwNold0GMOappUxc@slm.duckdns.org
> > > > > > > >
> > > > > > > > So, I don't really think this is an interface we wanna go for. I was hoping
> > > > > > > > to hear more from memcg folks in the above thread. Maybe ping them in that
> > > > > > > > thread and continue there?
> > > > > > >
> > > > > >
> > > > > > Hi Roman,
> > > > > >
> > > > > > > As I said previously, I don't like it, because it's an attempt to solve a non
> > > > > > > bpf-specific problem in a bpf-specific way.
> > > > > > >
> > > > > >
> > > > > > Why do you still insist that bpf_map->memcg is not a bpf-specific
> > > > > > issue after so many discussions?
> > > > > > Do you charge the bpf-map's memory the same way as you charge the page
> > > > > > caches or slabs ?
> > > > > > No, you don't. You charge it in a bpf-specific way.
> > > > >
> > > >
> > > > Hi Roman,
> > > >
> > > > Sorry for the late response.
> > > > I've been on vacation in the past few days.
> > > >
> > > > > The only difference is that we charge the cgroup of the processes who
> > > > > created a map, not a process who is doing a specific allocation.
> > > >
> > > > This means the bpf-map can be indepent of process, IOW, the memcg of
> > > > bpf-map can be indepent of the memcg of the processes.
> > > > This is the fundamental difference between bpf-map and page caches, then...
> > > >
> > > > > Your patchset doesn't change this.
> > > >
> > > > We can make this behavior reasonable by introducing an independent
> > > > memcg, as what I did in the previous version.
> > > >
> > > > > There are pros and cons with this approach, we've discussed it back
> > > > > to the times when bpf memcg accounting was developed. If you want
> > > > > to revisit this, it's maybe possible (given there is a really strong and likely
> > > > > new motivation appears), but I haven't seen any complaints yet except from you.
> > > > >
> > > >
> > > > memcg-base bpf accounting is a new feature, which may not be used widely.
> > > >
> > > > > >
> > > > > > > Yes, memory cgroups are not great for accounting of shared resources, it's well
> > > > > > > known. This patchset looks like an attempt to "fix" it specifically for bpf maps
> > > > > > > in a particular cgroup setup. Honestly, I don't think it's worth the added
> > > > > > > complexity. Especially because a similar behaviour can be achieved simple
> > > > > > > by placing the task which creates the map into the desired cgroup.
> > > > > >
> > > > > > Are you serious ?
> > > > > > Have you ever read the cgroup doc? Which clearly describe the "No
> > > > > > Internal Process Constraint".[1]
> > > > > > Obviously you can't place the task in the desired cgroup, i.e. the parent memcg.
> > > > >
> > > > > But you can place it into another leaf cgroup. You can delete this leaf cgroup
> > > > > and your memcg will get reparented. You can attach this process and create
> > > > > a bpf map to the parent cgroup before it gets child cgroups.
> > > >
> > > > If the process doesn't exit after it created bpf-map, we have to
> > > > migrate it around memcgs....
> > > > The complexity in deployment can introduce unexpected issues easily.
> > > >
> > > > > You can revisit the idea of shared bpf maps and outlive specific cgroups.
> > > > > Lof of options.
> > > > >
> > > > > >
> > > > > > [1] https://www.kernel.org/doc/Documentation/cgroup-v2.txt
> > > > > >
> > > > > > > Beatiful? Not. Neither is the proposed solution.
> > > > > > >
> > > > > >
> > > > > > Is it really hard to admit a fault?
> > > > >
> > > > > Yafang, you posted several versions and so far I haven't seen much of support
> > > > > or excitement from anyone (please, fix me if I'm wrong). It's not like I'm
> > > > > nacking a patchset with many acks, reviews and supporters.
> > > > >
> > > > > Still think you're solving an important problem in a reasonable way?
> > > > > It seems like not many are convinced yet. I'd recommend to focus on this instead
> > > > > of blaming me.
> > > > >
> > > >
> > > > The best way so far is to introduce specific memcg for specific resources.
> > > > Because not only the process owns its memcg, but also specific
> > > > resources own their memcgs, for example bpf-map, or socket.
> > > >
> > > > struct bpf_map {                                 <<<< memcg owner
> > > >     struct memcg_cgroup *memcg;
> > > > };
> > > >
> > > > struct sock {                                       <<<< memcg owner
> > > >     struct mem_cgroup *sk_memcg;
> > > > };
> > > >
> > > > These resources already have their own memcgs, so we should make this
> > > > behavior formal.
> > > >
> > > > The selectable memcg is just a variant of 'echo ${proc} > cgroup.procs'.
> > >
> > > This is a fundamental change: cgroups were always hierarchical groups
> > > of processes/threads. You're basically suggesting to extend it to
> > > hierarchical groups of processes and some other objects (what's a good
> > > definition?).
> >
> > Kind of, but not exactly.
> > We can do it without breaking the cgroup hierarchy. Under current
> > cgroup hierarchy, the user can only echo processes/threads into a
> > cgroup, that won't be changed in the future. The specific resources
> > are not exposed to the user, the user can only control these specific
> > resources by controlling their associated processes/threads.
> > For example,
> >
> >                 Memcg-A
> >                        |---- Memcg-A1
> >                        |---- Memcg-A2
> >
> > We can introduce a new file memory.owner into each memcg. Each bit of
> > memory.owner represents a specific resources,
> >
> >  memory.owner: | bit31 | bitN | ... | bit1 | bit0 |
> >                                          |               |
> > |------ bit0: bpf memory
> >                                          |
> > |-------------- bit1: socket memory
> >                                          |
> >                                          |---------------------------
> > bitN: a specific resource
> >
> > There won't be too many specific resources which have to own their
> > memcgs, so I think 32bits is enough.
> >
> >                 Memcg-A : memory.owner == 0x1
> >                        |---- Memcg-A1 : memory.owner == 0
> >                        |---- Memcg-A2 : memory.owner == 0x1
> >
> > Then the bpf created by processes in Memcg-A1 will be charged into
> > Memcg-A directly without charging into Memcg-A1.
> > But the bpf created by processes in Memcg-A2 will be charged into
> > Memcg-A2 as its memory.owner is 0x1.
> > That said, these specific resources are not fully independent of
> > process, while they are still associated with the processes which
> > create them.
> > Luckily memory.move_charge_at_immigrate is disabled in cgroup2, so we
> > don't need to care about the possible migration issue.
> >
> > I think we may also apply it to shared page caches.  For example,
> >       struct inode {
> >           struct mem_cgroup *memcg;          <<<< add a new member
> >       };
> >
> > We define struct inode as a memcg owner, and use scope-based charge to
> > charge its pages into inode->memcg.
> > And then put all memcgs which shared these resources under the same
> > parent. The page caches of this inode will be charged into the parent
> > directly.
>
> Ok, so it's something like premature selective reparenting.
>

Right. I think it  may be a good way to handle the resources which may
outlive the process.

> > The shared page cache is more complicated than bpf memory, so I'm not
> > quite sure if it can apply to shared page cache, but it can work well
> > for bpf memory.
>
> Yeah, this is the problem. It feels like it's a problem very specific
> to bpf maps and an exact way you use them. I don't think you can successfully
> advocate for changes of these calibre without a more generic problem. I might
> be wrong.
>

What is your concern about this method? Are there any potential issues?

> >
> > Regarding the observability, we can introduce a specific item into
> > memory.stat for this specific memory. For example a new item 'bpf' for
> > bpf memory.
> > That can be accounted/unaccounted for in the same way as we do in
> > set_active_memcg(). for example,
> >
> >     struct task_struct {
> >         struct mem_cgroup  *active_memcg;
> >         int                             active_memcg_item;   <<<<
> > introduce a new member
> >     };
> >
> >     bpf_memory_alloc()
> >     {
> >              old_memcg = set_active_memcg(memcg);
> >              old_item = set_active_memcg_item(MEMCG_BPF);
>
> I thought about something like this but for a different purpose:
> to track the amount of memory consumed by bpf.
>

Right, we can use it to track bpf memory consumption.

> >              alloc();
> >              set_active_memcg_item(old_item);
> >              set_active_memcg(old_memcg);
> >     }
> >
> >     bpf_memory_free()
> >     {
> >              old = set_active_memcg_item(MEMCG_BPF);
> >              free();
> >              set_active_memcg_item(old);
> >     }
>
> But the problem is that we shoud very carefully mark all allocations and
> releases, which is very error-prone. Interfaces which don't require annotating
> releases are generally better, but require additional memory.
>

If we don't annotate the releases, we have to add something into the
struct page, which may not be worth it.
It is clear how the bpf memory is allocated and freed, so I think we
can start it with bpf memory.
If in the future we can figure out a lightweight way to avoid
annotating the releases, then we can remove the annotations in the bpf
memory releases.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH bpf-next v3 00/13] bpf: Introduce selectable memcg for bpf map
  2022-09-20 12:42                 ` Yafang Shao
@ 2022-09-20 23:15                   ` Roman Gushchin
  2022-09-21  9:36                     ` Yafang Shao
  0 siblings, 1 reply; 33+ messages in thread
From: Roman Gushchin @ 2022-09-20 23:15 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Tejun Heo, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin Lau, Song Liu, Yonghong Song, john fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, Johannes Weiner,
	Michal Hocko, Shakeel Butt, Muchun Song, Andrew Morton, Zefan Li,
	Cgroups, netdev, bpf, Linux MM

On Tue, Sep 20, 2022 at 08:42:36PM +0800, Yafang Shao wrote:
> On Tue, Sep 20, 2022 at 10:40 AM Roman Gushchin
> <roman.gushchin@linux.dev> wrote:
> >
> > On Sun, Sep 18, 2022 at 11:44:48AM +0800, Yafang Shao wrote:
> > > On Sat, Sep 17, 2022 at 12:53 AM Roman Gushchin
> > > <roman.gushchin@linux.dev> wrote:
> > > >
> > > > On Tue, Sep 13, 2022 at 02:15:20PM +0800, Yafang Shao wrote:
> > > > > On Fri, Sep 9, 2022 at 12:13 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > > > > >
> > > > > > On Thu, Sep 08, 2022 at 10:37:02AM +0800, Yafang Shao wrote:
> > > > > > > On Thu, Sep 8, 2022 at 6:29 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > > > > > > >
> > > > > > > > On Wed, Sep 07, 2022 at 05:43:31AM -1000, Tejun Heo wrote:
> > > > > > > > > Hello,
> > > > > > > > >
> > > > > > > > > On Fri, Sep 02, 2022 at 02:29:50AM +0000, Yafang Shao wrote:
> > > > > > > > > ...
> > > > > > > > > > This patchset tries to resolve the above two issues by introducing a
> > > > > > > > > > selectable memcg to limit the bpf memory. Currently we only allow to
> > > > > > > > > > select its ancestor to avoid breaking the memcg hierarchy further.
> > > > > > > > > > Possible use cases of the selectable memcg as follows,
> > > > > > > > >
> > > > > > > > > As discussed in the following thread, there are clear downsides to an
> > > > > > > > > interface which requires the users to specify the cgroups directly.
> > > > > > > > >
> > > > > > > > >  https://lkml.kernel.org/r/YwNold0GMOappUxc@slm.duckdns.org
> > > > > > > > >
> > > > > > > > > So, I don't really think this is an interface we wanna go for. I was hoping
> > > > > > > > > to hear more from memcg folks in the above thread. Maybe ping them in that
> > > > > > > > > thread and continue there?
> > > > > > > >
> > > > > > >
> > > > > > > Hi Roman,
> > > > > > >
> > > > > > > > As I said previously, I don't like it, because it's an attempt to solve a non
> > > > > > > > bpf-specific problem in a bpf-specific way.
> > > > > > > >
> > > > > > >
> > > > > > > Why do you still insist that bpf_map->memcg is not a bpf-specific
> > > > > > > issue after so many discussions?
> > > > > > > Do you charge the bpf-map's memory the same way as you charge the page
> > > > > > > caches or slabs ?
> > > > > > > No, you don't. You charge it in a bpf-specific way.
> > > > > >
> > > > >
> > > > > Hi Roman,
> > > > >
> > > > > Sorry for the late response.
> > > > > I've been on vacation in the past few days.
> > > > >
> > > > > > The only difference is that we charge the cgroup of the processes who
> > > > > > created a map, not a process who is doing a specific allocation.
> > > > >
> > > > > This means the bpf-map can be indepent of process, IOW, the memcg of
> > > > > bpf-map can be indepent of the memcg of the processes.
> > > > > This is the fundamental difference between bpf-map and page caches, then...
> > > > >
> > > > > > Your patchset doesn't change this.
> > > > >
> > > > > We can make this behavior reasonable by introducing an independent
> > > > > memcg, as what I did in the previous version.
> > > > >
> > > > > > There are pros and cons with this approach, we've discussed it back
> > > > > > to the times when bpf memcg accounting was developed. If you want
> > > > > > to revisit this, it's maybe possible (given there is a really strong and likely
> > > > > > new motivation appears), but I haven't seen any complaints yet except from you.
> > > > > >
> > > > >
> > > > > memcg-base bpf accounting is a new feature, which may not be used widely.
> > > > >
> > > > > > >
> > > > > > > > Yes, memory cgroups are not great for accounting of shared resources, it's well
> > > > > > > > known. This patchset looks like an attempt to "fix" it specifically for bpf maps
> > > > > > > > in a particular cgroup setup. Honestly, I don't think it's worth the added
> > > > > > > > complexity. Especially because a similar behaviour can be achieved simple
> > > > > > > > by placing the task which creates the map into the desired cgroup.
> > > > > > >
> > > > > > > Are you serious ?
> > > > > > > Have you ever read the cgroup doc? Which clearly describe the "No
> > > > > > > Internal Process Constraint".[1]
> > > > > > > Obviously you can't place the task in the desired cgroup, i.e. the parent memcg.
> > > > > >
> > > > > > But you can place it into another leaf cgroup. You can delete this leaf cgroup
> > > > > > and your memcg will get reparented. You can attach this process and create
> > > > > > a bpf map to the parent cgroup before it gets child cgroups.
> > > > >
> > > > > If the process doesn't exit after it created bpf-map, we have to
> > > > > migrate it around memcgs....
> > > > > The complexity in deployment can introduce unexpected issues easily.
> > > > >
> > > > > > You can revisit the idea of shared bpf maps and outlive specific cgroups.
> > > > > > Lof of options.
> > > > > >
> > > > > > >
> > > > > > > [1] https://www.kernel.org/doc/Documentation/cgroup-v2.txt
> > > > > > >
> > > > > > > > Beatiful? Not. Neither is the proposed solution.
> > > > > > > >
> > > > > > >
> > > > > > > Is it really hard to admit a fault?
> > > > > >
> > > > > > Yafang, you posted several versions and so far I haven't seen much of support
> > > > > > or excitement from anyone (please, fix me if I'm wrong). It's not like I'm
> > > > > > nacking a patchset with many acks, reviews and supporters.
> > > > > >
> > > > > > Still think you're solving an important problem in a reasonable way?
> > > > > > It seems like not many are convinced yet. I'd recommend to focus on this instead
> > > > > > of blaming me.
> > > > > >
> > > > >
> > > > > The best way so far is to introduce specific memcg for specific resources.
> > > > > Because not only the process owns its memcg, but also specific
> > > > > resources own their memcgs, for example bpf-map, or socket.
> > > > >
> > > > > struct bpf_map {                                 <<<< memcg owner
> > > > >     struct memcg_cgroup *memcg;
> > > > > };
> > > > >
> > > > > struct sock {                                       <<<< memcg owner
> > > > >     struct mem_cgroup *sk_memcg;
> > > > > };
> > > > >
> > > > > These resources already have their own memcgs, so we should make this
> > > > > behavior formal.
> > > > >
> > > > > The selectable memcg is just a variant of 'echo ${proc} > cgroup.procs'.
> > > >
> > > > This is a fundamental change: cgroups were always hierarchical groups
> > > > of processes/threads. You're basically suggesting to extend it to
> > > > hierarchical groups of processes and some other objects (what's a good
> > > > definition?).
> > >
> > > Kind of, but not exactly.
> > > We can do it without breaking the cgroup hierarchy. Under current
> > > cgroup hierarchy, the user can only echo processes/threads into a
> > > cgroup, that won't be changed in the future. The specific resources
> > > are not exposed to the user, the user can only control these specific
> > > resources by controlling their associated processes/threads.
> > > For example,
> > >
> > >                 Memcg-A
> > >                        |---- Memcg-A1
> > >                        |---- Memcg-A2
> > >
> > > We can introduce a new file memory.owner into each memcg. Each bit of
> > > memory.owner represents a specific resources,
> > >
> > >  memory.owner: | bit31 | bitN | ... | bit1 | bit0 |
> > >                                          |               |
> > > |------ bit0: bpf memory
> > >                                          |
> > > |-------------- bit1: socket memory
> > >                                          |
> > >                                          |---------------------------
> > > bitN: a specific resource
> > >
> > > There won't be too many specific resources which have to own their
> > > memcgs, so I think 32bits is enough.
> > >
> > >                 Memcg-A : memory.owner == 0x1
> > >                        |---- Memcg-A1 : memory.owner == 0
> > >                        |---- Memcg-A2 : memory.owner == 0x1
> > >
> > > Then the bpf created by processes in Memcg-A1 will be charged into
> > > Memcg-A directly without charging into Memcg-A1.
> > > But the bpf created by processes in Memcg-A2 will be charged into
> > > Memcg-A2 as its memory.owner is 0x1.
> > > That said, these specific resources are not fully independent of
> > > process, while they are still associated with the processes which
> > > create them.
> > > Luckily memory.move_charge_at_immigrate is disabled in cgroup2, so we
> > > don't need to care about the possible migration issue.
> > >
> > > I think we may also apply it to shared page caches.  For example,
> > >       struct inode {
> > >           struct mem_cgroup *memcg;          <<<< add a new member
> > >       };
> > >
> > > We define struct inode as a memcg owner, and use scope-based charge to
> > > charge its pages into inode->memcg.
> > > And then put all memcgs which shared these resources under the same
> > > parent. The page caches of this inode will be charged into the parent
> > > directly.
> >
> > Ok, so it's something like premature selective reparenting.
> >
> 
> Right. I think it  may be a good way to handle the resources which may
> outlive the process.
> 
> > > The shared page cache is more complicated than bpf memory, so I'm not
> > > quite sure if it can apply to shared page cache, but it can work well
> > > for bpf memory.
> >
> > Yeah, this is the problem. It feels like it's a problem very specific
> > to bpf maps and an exact way you use them. I don't think you can successfully
> > advocate for changes of these calibre without a more generic problem. I might
> > be wrong.
> >
> 
> What is your concern about this method? Are there any potential issues?

The issue is simple: nobody wants to support a new non-trivial cgroup interface
to solve a specific bpf accounting issue in one particular setup. Any new
interface will become an API and has to be supported for many many years,
so it has to be generic and future-proof.

If you want to go this direction, please, show that it solves a _generic_
problem, not limited to a specific way how you use bpf maps in your specific
setup. Accounting of a bpf map shared by many cgroups, which should outlive
the original memory cgroups... Idk, maybe it's how many users are using bpf
maps, but I don't hear it yet.

There were some patches from Google folks about the tmpfs accounting, _maybe_
it's something to look at in order to get an idea about a more generic problem
and solution.

Thanks!


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH bpf-next v3 00/13] bpf: Introduce selectable memcg for bpf map
  2022-09-20 23:15                   ` Roman Gushchin
@ 2022-09-21  9:36                     ` Yafang Shao
  0 siblings, 0 replies; 33+ messages in thread
From: Yafang Shao @ 2022-09-21  9:36 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Tejun Heo, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin Lau, Song Liu, Yonghong Song, john fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, Johannes Weiner,
	Michal Hocko, Shakeel Butt, Muchun Song, Andrew Morton, Zefan Li,
	Cgroups, netdev, bpf, Linux MM

On Wed, Sep 21, 2022 at 7:15 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> On Tue, Sep 20, 2022 at 08:42:36PM +0800, Yafang Shao wrote:
> > On Tue, Sep 20, 2022 at 10:40 AM Roman Gushchin
> > <roman.gushchin@linux.dev> wrote:
> > >
> > > On Sun, Sep 18, 2022 at 11:44:48AM +0800, Yafang Shao wrote:
> > > > On Sat, Sep 17, 2022 at 12:53 AM Roman Gushchin
> > > > <roman.gushchin@linux.dev> wrote:
> > > > >
> > > > > On Tue, Sep 13, 2022 at 02:15:20PM +0800, Yafang Shao wrote:
> > > > > > On Fri, Sep 9, 2022 at 12:13 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > > > > > >
> > > > > > > On Thu, Sep 08, 2022 at 10:37:02AM +0800, Yafang Shao wrote:
> > > > > > > > On Thu, Sep 8, 2022 at 6:29 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > > > > > > > >
> > > > > > > > > On Wed, Sep 07, 2022 at 05:43:31AM -1000, Tejun Heo wrote:
> > > > > > > > > > Hello,
> > > > > > > > > >
> > > > > > > > > > On Fri, Sep 02, 2022 at 02:29:50AM +0000, Yafang Shao wrote:
> > > > > > > > > > ...
> > > > > > > > > > > This patchset tries to resolve the above two issues by introducing a
> > > > > > > > > > > selectable memcg to limit the bpf memory. Currently we only allow to
> > > > > > > > > > > select its ancestor to avoid breaking the memcg hierarchy further.
> > > > > > > > > > > Possible use cases of the selectable memcg as follows,
> > > > > > > > > >
> > > > > > > > > > As discussed in the following thread, there are clear downsides to an
> > > > > > > > > > interface which requires the users to specify the cgroups directly.
> > > > > > > > > >
> > > > > > > > > >  https://lkml.kernel.org/r/YwNold0GMOappUxc@slm.duckdns.org
> > > > > > > > > >
> > > > > > > > > > So, I don't really think this is an interface we wanna go for. I was hoping
> > > > > > > > > > to hear more from memcg folks in the above thread. Maybe ping them in that
> > > > > > > > > > thread and continue there?
> > > > > > > > >
> > > > > > > >
> > > > > > > > Hi Roman,
> > > > > > > >
> > > > > > > > > As I said previously, I don't like it, because it's an attempt to solve a non
> > > > > > > > > bpf-specific problem in a bpf-specific way.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Why do you still insist that bpf_map->memcg is not a bpf-specific
> > > > > > > > issue after so many discussions?
> > > > > > > > Do you charge the bpf-map's memory the same way as you charge the page
> > > > > > > > caches or slabs ?
> > > > > > > > No, you don't. You charge it in a bpf-specific way.
> > > > > > >
> > > > > >
> > > > > > Hi Roman,
> > > > > >
> > > > > > Sorry for the late response.
> > > > > > I've been on vacation in the past few days.
> > > > > >
> > > > > > > The only difference is that we charge the cgroup of the processes who
> > > > > > > created a map, not a process who is doing a specific allocation.
> > > > > >
> > > > > > This means the bpf-map can be indepent of process, IOW, the memcg of
> > > > > > bpf-map can be indepent of the memcg of the processes.
> > > > > > This is the fundamental difference between bpf-map and page caches, then...
> > > > > >
> > > > > > > Your patchset doesn't change this.
> > > > > >
> > > > > > We can make this behavior reasonable by introducing an independent
> > > > > > memcg, as what I did in the previous version.
> > > > > >
> > > > > > > There are pros and cons with this approach, we've discussed it back
> > > > > > > to the times when bpf memcg accounting was developed. If you want
> > > > > > > to revisit this, it's maybe possible (given there is a really strong and likely
> > > > > > > new motivation appears), but I haven't seen any complaints yet except from you.
> > > > > > >
> > > > > >
> > > > > > memcg-base bpf accounting is a new feature, which may not be used widely.
> > > > > >
> > > > > > > >
> > > > > > > > > Yes, memory cgroups are not great for accounting of shared resources, it's well
> > > > > > > > > known. This patchset looks like an attempt to "fix" it specifically for bpf maps
> > > > > > > > > in a particular cgroup setup. Honestly, I don't think it's worth the added
> > > > > > > > > complexity. Especially because a similar behaviour can be achieved simple
> > > > > > > > > by placing the task which creates the map into the desired cgroup.
> > > > > > > >
> > > > > > > > Are you serious ?
> > > > > > > > Have you ever read the cgroup doc? Which clearly describe the "No
> > > > > > > > Internal Process Constraint".[1]
> > > > > > > > Obviously you can't place the task in the desired cgroup, i.e. the parent memcg.
> > > > > > >
> > > > > > > But you can place it into another leaf cgroup. You can delete this leaf cgroup
> > > > > > > and your memcg will get reparented. You can attach this process and create
> > > > > > > a bpf map to the parent cgroup before it gets child cgroups.
> > > > > >
> > > > > > If the process doesn't exit after it created bpf-map, we have to
> > > > > > migrate it around memcgs....
> > > > > > The complexity in deployment can introduce unexpected issues easily.
> > > > > >
> > > > > > > You can revisit the idea of shared bpf maps and outlive specific cgroups.
> > > > > > > Lof of options.
> > > > > > >
> > > > > > > >
> > > > > > > > [1] https://www.kernel.org/doc/Documentation/cgroup-v2.txt
> > > > > > > >
> > > > > > > > > Beatiful? Not. Neither is the proposed solution.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Is it really hard to admit a fault?
> > > > > > >
> > > > > > > Yafang, you posted several versions and so far I haven't seen much of support
> > > > > > > or excitement from anyone (please, fix me if I'm wrong). It's not like I'm
> > > > > > > nacking a patchset with many acks, reviews and supporters.
> > > > > > >
> > > > > > > Still think you're solving an important problem in a reasonable way?
> > > > > > > It seems like not many are convinced yet. I'd recommend to focus on this instead
> > > > > > > of blaming me.
> > > > > > >
> > > > > >
> > > > > > The best way so far is to introduce specific memcg for specific resources.
> > > > > > Because not only the process owns its memcg, but also specific
> > > > > > resources own their memcgs, for example bpf-map, or socket.
> > > > > >
> > > > > > struct bpf_map {                                 <<<< memcg owner
> > > > > >     struct memcg_cgroup *memcg;
> > > > > > };
> > > > > >
> > > > > > struct sock {                                       <<<< memcg owner
> > > > > >     struct mem_cgroup *sk_memcg;
> > > > > > };
> > > > > >
> > > > > > These resources already have their own memcgs, so we should make this
> > > > > > behavior formal.
> > > > > >
> > > > > > The selectable memcg is just a variant of 'echo ${proc} > cgroup.procs'.
> > > > >
> > > > > This is a fundamental change: cgroups were always hierarchical groups
> > > > > of processes/threads. You're basically suggesting to extend it to
> > > > > hierarchical groups of processes and some other objects (what's a good
> > > > > definition?).
> > > >
> > > > Kind of, but not exactly.
> > > > We can do it without breaking the cgroup hierarchy. Under current
> > > > cgroup hierarchy, the user can only echo processes/threads into a
> > > > cgroup, that won't be changed in the future. The specific resources
> > > > are not exposed to the user, the user can only control these specific
> > > > resources by controlling their associated processes/threads.
> > > > For example,
> > > >
> > > >                 Memcg-A
> > > >                        |---- Memcg-A1
> > > >                        |---- Memcg-A2
> > > >
> > > > We can introduce a new file memory.owner into each memcg. Each bit of
> > > > memory.owner represents a specific resources,
> > > >
> > > >  memory.owner: | bit31 | bitN | ... | bit1 | bit0 |
> > > >                                          |               |
> > > > |------ bit0: bpf memory
> > > >                                          |
> > > > |-------------- bit1: socket memory
> > > >                                          |
> > > >                                          |---------------------------
> > > > bitN: a specific resource
> > > >
> > > > There won't be too many specific resources which have to own their
> > > > memcgs, so I think 32bits is enough.
> > > >
> > > >                 Memcg-A : memory.owner == 0x1
> > > >                        |---- Memcg-A1 : memory.owner == 0
> > > >                        |---- Memcg-A2 : memory.owner == 0x1
> > > >
> > > > Then the bpf created by processes in Memcg-A1 will be charged into
> > > > Memcg-A directly without charging into Memcg-A1.
> > > > But the bpf created by processes in Memcg-A2 will be charged into
> > > > Memcg-A2 as its memory.owner is 0x1.
> > > > That said, these specific resources are not fully independent of
> > > > process, while they are still associated with the processes which
> > > > create them.
> > > > Luckily memory.move_charge_at_immigrate is disabled in cgroup2, so we
> > > > don't need to care about the possible migration issue.
> > > >
> > > > I think we may also apply it to shared page caches.  For example,
> > > >       struct inode {
> > > >           struct mem_cgroup *memcg;          <<<< add a new member
> > > >       };
> > > >
> > > > We define struct inode as a memcg owner, and use scope-based charge to
> > > > charge its pages into inode->memcg.
> > > > And then put all memcgs which shared these resources under the same
> > > > parent. The page caches of this inode will be charged into the parent
> > > > directly.
> > >
> > > Ok, so it's something like premature selective reparenting.
> > >
> >
> > Right. I think it  may be a good way to handle the resources which may
> > outlive the process.
> >
> > > > The shared page cache is more complicated than bpf memory, so I'm not
> > > > quite sure if it can apply to shared page cache, but it can work well
> > > > for bpf memory.
> > >
> > > Yeah, this is the problem. It feels like it's a problem very specific
> > > to bpf maps and an exact way you use them. I don't think you can successfully
> > > advocate for changes of these calibre without a more generic problem. I might
> > > be wrong.
> > >
> >
> > What is your concern about this method? Are there any potential issues?
>
> The issue is simple: nobody wants to support a new non-trivial cgroup interface
> to solve a specific bpf accounting issue in one particular setup. Any new
> interface will become an API and has to be supported for many many years,
> so it has to be generic and future-proof.
>
> If you want to go this direction, please, show that it solves a _generic_
> problem, not limited to a specific way how you use bpf maps in your specific
> setup. Accounting of a bpf map shared by many cgroups, which should outlive
> the original memory cgroups... Idk, maybe it's how many users are using bpf
> maps, but I don't hear it yet.
>
> There were some patches from Google folks about the tmpfs accounting, _maybe_
> it's something to look at in order to get an idea about a more generic problem
> and solution.
>

Hmm...
It seems that we are in a dilemma now.
We can't fix it in memcg way, because the issue we are fixing it a
bpf-specific issue.
But we can't fix it in a bpf-specific way neither...

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2022-09-21  9:37 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-02  2:29 [PATCH bpf-next v3 00/13] bpf: Introduce selectable memcg for bpf map Yafang Shao
2022-09-02  2:29 ` [PATCH bpf-next v3 01/13] cgroup: Update the comment on cgroup_get_from_fd Yafang Shao
2022-09-02  2:29 ` [PATCH bpf-next v3 02/13] bpf: Introduce new helper bpf_map_put_memcg() Yafang Shao
2022-09-02  2:29 ` [PATCH bpf-next v3 03/13] bpf: Define bpf_map_{get,put}_memcg for !CONFIG_MEMCG_KMEM Yafang Shao
2022-09-02  2:29 ` [PATCH bpf-next v3 04/13] bpf: Call bpf_map_init_from_attr() immediately after map creation Yafang Shao
2022-09-02  2:29 ` [PATCH bpf-next v3 05/13] bpf: Save memcg in bpf_map_init_from_attr() Yafang Shao
2022-09-02  2:29 ` [PATCH bpf-next v3 06/13] bpf: Use scoped-based charge in bpf_map_area_alloc Yafang Shao
2022-09-02  2:29 ` [PATCH bpf-next v3 07/13] bpf: Introduce new helpers bpf_ringbuf_pages_{alloc,free} Yafang Shao
2022-09-02  2:29 ` [PATCH bpf-next v3 08/13] bpf: Use bpf_map_kzalloc in arraymap Yafang Shao
2022-09-02  2:29 ` [PATCH bpf-next v3 09/13] bpf: Use bpf_map_kvcalloc in bpf_local_storage Yafang Shao
2022-09-02  2:30 ` [PATCH bpf-next v3 10/13] mm, memcg: Add new helper get_obj_cgroup_from_cgroup Yafang Shao
2022-09-02  2:30 ` [PATCH bpf-next v3 11/13] mm, memcg: Add new helper task_under_memcg_hierarchy Yafang Shao
2022-09-02  2:30 ` [PATCH bpf-next v3 12/13] bpf: Add return value for bpf_map_init_from_attr Yafang Shao
2022-09-02  2:30 ` [PATCH bpf-next v3 13/13] bpf: Introduce selectable memcg for bpf map Yafang Shao
2022-09-07 15:43 ` [PATCH bpf-next v3 00/13] " Tejun Heo
2022-09-07 15:45   ` Tejun Heo
2022-09-07 16:13     ` Alexei Starovoitov
2022-09-07 16:18       ` Tejun Heo
2022-09-07 16:27         ` Alexei Starovoitov
2022-09-07 17:01           ` Tejun Heo
2022-09-08  2:44             ` Yafang Shao
2022-09-07 22:28   ` Roman Gushchin
2022-09-08  2:37     ` Yafang Shao
2022-09-08  2:43       ` Alexei Starovoitov
2022-09-08  2:48         ` Yafang Shao
2022-09-08 16:13       ` Roman Gushchin
2022-09-13  6:15         ` Yafang Shao
2022-09-16 16:53           ` Roman Gushchin
2022-09-18  3:44             ` Yafang Shao
2022-09-20  2:40               ` Roman Gushchin
2022-09-20 12:42                 ` Yafang Shao
2022-09-20 23:15                   ` Roman Gushchin
2022-09-21  9:36                     ` Yafang Shao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).