bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next v4 0/7] bpf: Implement cgroup local storage available to non-cgroup-attached bpf progs
@ 2022-10-23 18:05 Yonghong Song
  2022-10-23 18:05 ` [PATCH bpf-next v4 1/7] bpf: Make struct cgroup btf id global Yonghong Song
                   ` (6 more replies)
  0 siblings, 7 replies; 31+ messages in thread
From: Yonghong Song @ 2022-10-23 18:05 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo

There already exists a local storage implementation for cgroup-attached
bpf programs. See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper
bpf_get_local_storage(). But there are use cases such that non-cgroup
attached bpf progs wants to access cgroup local storage data. For example,
tc egress prog has access to sk and cgroup. It is possible to use
sk local storage to emulate cgroup local storage by storing data in socket.
But this is a waste as it could be lots of sockets belonging to a particular
cgroup. Alternatively, a separate map can be created with cgroup id as the key.
But this will introduce additional overhead to manipulate the new map.
A cgroup local storage, similar to existing sk/inode/task storage,
should help for this use case.

This patch implemented new cgroup local storage available to
non-cgroup-attached bpf programs. In the patch series, Patches 1 and 2
are preparation patches. Patch 3 implemented new cgroup local storage
kernel support. Patches 4 and 5 implemented libbpf and bpftool support.
Patch 6 added two tests to validate kernel/libbpf implementations.
Patch 7 added documentation for new BPF_MAP_TYPE_CGRP_STORAGE map type
and comparison of the old and new cgroup local storage maps.

Changelogs:
  v3 -> v4:
    . fix a config guarding problem in kernel/cgroup/cgroup.c when
      cgrp_storage is deleted (CONFIG_CGROUP_BPF => CONFIG_BPF_SYSCALL). 
    . rename selftest from cgroup_local_storage.c to cgrp_local_storage.c
      so the name can better align with map name.
    . fix a few misspellings.
  v2 -> v3:
    . fix a config caused kernel test complaint.
    . better description/comments in uapi bpf.h and bpf_cgrp_storage.c.
    . factor code for better resue for map_alloc/map_free.
    . improved explanation in map documentation.
  v1 -> v2:
    . change map name from BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE to
      BPF_MAP_TYPE_CGRP_STORAGE.
    . removed support of sleepable programs.
    . changed the place of freeing cgrp local storage from put_css_set_locked()
      to css_free_rwork_fn().
    . added map documentation.

Yonghong Song (7):
  bpf: Make struct cgroup btf id global
  bpf: Refactor inode/task/sk storage map_{alloc,free}() for reuse
  bpf: Implement cgroup storage available to non-cgroup-attached bpf
    progs
  libbpf: Support new cgroup local storage
  bpftool: Support new cgroup local storage
  selftests/bpf: Add selftests for new cgroup local storage
  docs/bpf: Add documentation for new cgroup local storage

 Documentation/bpf/map_cgrp_storage.rst        | 109 +++++++
 include/linux/bpf.h                           |   3 +
 include/linux/bpf_local_storage.h             |  11 +-
 include/linux/bpf_types.h                     |   1 +
 include/linux/btf_ids.h                       |   1 +
 include/linux/cgroup-defs.h                   |   4 +
 include/uapi/linux/bpf.h                      |  50 +++-
 kernel/bpf/Makefile                           |   2 +-
 kernel/bpf/bpf_cgrp_storage.c                 | 268 ++++++++++++++++++
 kernel/bpf/bpf_inode_storage.c                |  15 +-
 kernel/bpf/bpf_local_storage.c                |  39 ++-
 kernel/bpf/bpf_task_storage.c                 |  15 +-
 kernel/bpf/cgroup_iter.c                      |   2 +-
 kernel/bpf/helpers.c                          |   6 +
 kernel/bpf/syscall.c                          |   3 +-
 kernel/bpf/verifier.c                         |  13 +-
 kernel/cgroup/cgroup.c                        |   4 +
 kernel/trace/bpf_trace.c                      |   4 +
 net/core/bpf_sk_storage.c                     |  15 +-
 scripts/bpf_doc.py                            |   2 +
 .../bpf/bpftool/Documentation/bpftool-map.rst |   2 +-
 tools/bpf/bpftool/map.c                       |   2 +-
 tools/include/uapi/linux/bpf.h                |  50 +++-
 tools/lib/bpf/libbpf.c                        |   1 +
 tools/lib/bpf/libbpf_probes.c                 |   1 +
 .../bpf/prog_tests/cgrp_local_storage.c       |  92 ++++++
 .../selftests/bpf/progs/cgrp_local_storage.c  |  88 ++++++
 .../selftests/bpf/progs/cgrp_ls_recursion.c   |  70 +++++
 28 files changed, 813 insertions(+), 60 deletions(-)
 create mode 100644 Documentation/bpf/map_cgrp_storage.rst
 create mode 100644 kernel/bpf/bpf_cgrp_storage.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/cgrp_local_storage.c
 create mode 100644 tools/testing/selftests/bpf/progs/cgrp_local_storage.c
 create mode 100644 tools/testing/selftests/bpf/progs/cgrp_ls_recursion.c

-- 
2.30.2


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH bpf-next v4 1/7] bpf: Make struct cgroup btf id global
  2022-10-23 18:05 [PATCH bpf-next v4 0/7] bpf: Implement cgroup local storage available to non-cgroup-attached bpf progs Yonghong Song
@ 2022-10-23 18:05 ` Yonghong Song
  2022-10-23 19:59   ` David Vernet
  2022-10-23 18:05 ` [PATCH bpf-next v4 2/7] bpf: Refactor inode/task/sk storage map_{alloc,free}() for reuse Yonghong Song
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 31+ messages in thread
From: Yonghong Song @ 2022-10-23 18:05 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo

Make struct cgroup btf id global so later patch can reuse
the same btf id.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 include/linux/btf_ids.h  | 1 +
 kernel/bpf/cgroup_iter.c | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/btf_ids.h b/include/linux/btf_ids.h
index 2aea877d644f..c9744efd202f 100644
--- a/include/linux/btf_ids.h
+++ b/include/linux/btf_ids.h
@@ -265,5 +265,6 @@ MAX_BTF_TRACING_TYPE,
 };
 
 extern u32 btf_tracing_ids[];
+extern u32 bpf_cgroup_btf_id[];
 
 #endif
diff --git a/kernel/bpf/cgroup_iter.c b/kernel/bpf/cgroup_iter.c
index 0d200a993489..c6ffc706d583 100644
--- a/kernel/bpf/cgroup_iter.c
+++ b/kernel/bpf/cgroup_iter.c
@@ -157,7 +157,7 @@ static const struct seq_operations cgroup_iter_seq_ops = {
 	.show   = cgroup_iter_seq_show,
 };
 
-BTF_ID_LIST_SINGLE(bpf_cgroup_btf_id, struct, cgroup)
+BTF_ID_LIST_GLOBAL_SINGLE(bpf_cgroup_btf_id, struct, cgroup)
 
 static int cgroup_iter_seq_init(void *priv, struct bpf_iter_aux_info *aux)
 {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH bpf-next v4 2/7] bpf: Refactor inode/task/sk storage map_{alloc,free}() for reuse
  2022-10-23 18:05 [PATCH bpf-next v4 0/7] bpf: Implement cgroup local storage available to non-cgroup-attached bpf progs Yonghong Song
  2022-10-23 18:05 ` [PATCH bpf-next v4 1/7] bpf: Make struct cgroup btf id global Yonghong Song
@ 2022-10-23 18:05 ` Yonghong Song
  2022-10-24 18:02   ` sdf
  2022-10-24 20:34   ` Martin KaFai Lau
  2022-10-23 18:05 ` [PATCH bpf-next v4 3/7] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs Yonghong Song
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 31+ messages in thread
From: Yonghong Song @ 2022-10-23 18:05 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo, David Vernet

Refactor codes so that inode/task/sk storage map_{alloc,free}
can maximally share the same code. There is no functionality change.

Acked-by: David Vernet <void@manifault.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
---
 include/linux/bpf_local_storage.h | 11 ++++-----
 kernel/bpf/bpf_inode_storage.c    | 15 ++----------
 kernel/bpf/bpf_local_storage.c    | 39 +++++++++++++++++++++++++------
 kernel/bpf/bpf_task_storage.c     | 15 ++----------
 net/core/bpf_sk_storage.c         | 15 ++----------
 5 files changed, 43 insertions(+), 52 deletions(-)

diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h
index 7ea18d4da84b..fdf753125778 100644
--- a/include/linux/bpf_local_storage.h
+++ b/include/linux/bpf_local_storage.h
@@ -116,21 +116,20 @@ static struct bpf_local_storage_cache name = {			\
 	.idx_lock = __SPIN_LOCK_UNLOCKED(name.idx_lock),	\
 }
 
-u16 bpf_local_storage_cache_idx_get(struct bpf_local_storage_cache *cache);
-void bpf_local_storage_cache_idx_free(struct bpf_local_storage_cache *cache,
-				      u16 idx);
-
 /* Helper functions for bpf_local_storage */
 int bpf_local_storage_map_alloc_check(union bpf_attr *attr);
 
-struct bpf_local_storage_map *bpf_local_storage_map_alloc(union bpf_attr *attr);
+struct bpf_map *
+bpf_local_storage_map_alloc(union bpf_attr *attr,
+			    struct bpf_local_storage_cache *cache);
 
 struct bpf_local_storage_data *
 bpf_local_storage_lookup(struct bpf_local_storage *local_storage,
 			 struct bpf_local_storage_map *smap,
 			 bool cacheit_lockit);
 
-void bpf_local_storage_map_free(struct bpf_local_storage_map *smap,
+void bpf_local_storage_map_free(struct bpf_map *map,
+				struct bpf_local_storage_cache *cache,
 				int __percpu *busy_counter);
 
 int bpf_local_storage_map_check_btf(const struct bpf_map *map,
diff --git a/kernel/bpf/bpf_inode_storage.c b/kernel/bpf/bpf_inode_storage.c
index 5f7683b19199..34c315746d61 100644
--- a/kernel/bpf/bpf_inode_storage.c
+++ b/kernel/bpf/bpf_inode_storage.c
@@ -226,23 +226,12 @@ static int notsupp_get_next_key(struct bpf_map *map, void *key,
 
 static struct bpf_map *inode_storage_map_alloc(union bpf_attr *attr)
 {
-	struct bpf_local_storage_map *smap;
-
-	smap = bpf_local_storage_map_alloc(attr);
-	if (IS_ERR(smap))
-		return ERR_CAST(smap);
-
-	smap->cache_idx = bpf_local_storage_cache_idx_get(&inode_cache);
-	return &smap->map;
+	return bpf_local_storage_map_alloc(attr, &inode_cache);
 }
 
 static void inode_storage_map_free(struct bpf_map *map)
 {
-	struct bpf_local_storage_map *smap;
-
-	smap = (struct bpf_local_storage_map *)map;
-	bpf_local_storage_cache_idx_free(&inode_cache, smap->cache_idx);
-	bpf_local_storage_map_free(smap, NULL);
+	bpf_local_storage_map_free(map, &inode_cache, NULL);
 }
 
 BTF_ID_LIST_SINGLE(inode_storage_map_btf_ids, struct,
diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
index 9dc6de1cf185..f89b6d080e1f 100644
--- a/kernel/bpf/bpf_local_storage.c
+++ b/kernel/bpf/bpf_local_storage.c
@@ -346,7 +346,7 @@ int bpf_local_storage_alloc(void *owner,
 		/* Note that even first_selem was linked to smap's
 		 * bucket->list, first_selem can be freed immediately
 		 * (instead of kfree_rcu) because
-		 * bpf_local_storage_map_free() does a
+		 * __bpf_local_storage_map_free() does a
 		 * synchronize_rcu_mult (waiting for both sleepable and
 		 * normal programs) before walking the bucket->list.
 		 * Hence, no one is accessing selem from the
@@ -500,7 +500,7 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
 	return ERR_PTR(err);
 }
 
-u16 bpf_local_storage_cache_idx_get(struct bpf_local_storage_cache *cache)
+static u16 bpf_local_storage_cache_idx_get(struct bpf_local_storage_cache *cache)
 {
 	u64 min_usage = U64_MAX;
 	u16 i, res = 0;
@@ -524,16 +524,16 @@ u16 bpf_local_storage_cache_idx_get(struct bpf_local_storage_cache *cache)
 	return res;
 }
 
-void bpf_local_storage_cache_idx_free(struct bpf_local_storage_cache *cache,
-				      u16 idx)
+static void bpf_local_storage_cache_idx_free(struct bpf_local_storage_cache *cache,
+					     u16 idx)
 {
 	spin_lock(&cache->idx_lock);
 	cache->idx_usage_counts[idx]--;
 	spin_unlock(&cache->idx_lock);
 }
 
-void bpf_local_storage_map_free(struct bpf_local_storage_map *smap,
-				int __percpu *busy_counter)
+static void __bpf_local_storage_map_free(struct bpf_local_storage_map *smap,
+					 int __percpu *busy_counter)
 {
 	struct bpf_local_storage_elem *selem;
 	struct bpf_local_storage_map_bucket *b;
@@ -613,7 +613,7 @@ int bpf_local_storage_map_alloc_check(union bpf_attr *attr)
 	return 0;
 }
 
-struct bpf_local_storage_map *bpf_local_storage_map_alloc(union bpf_attr *attr)
+static struct bpf_local_storage_map *__bpf_local_storage_map_alloc(union bpf_attr *attr)
 {
 	struct bpf_local_storage_map *smap;
 	unsigned int i;
@@ -663,3 +663,28 @@ int bpf_local_storage_map_check_btf(const struct bpf_map *map,
 
 	return 0;
 }
+
+struct bpf_map *
+bpf_local_storage_map_alloc(union bpf_attr *attr,
+			    struct bpf_local_storage_cache *cache)
+{
+	struct bpf_local_storage_map *smap;
+
+	smap = __bpf_local_storage_map_alloc(attr);
+	if (IS_ERR(smap))
+		return ERR_CAST(smap);
+
+	smap->cache_idx = bpf_local_storage_cache_idx_get(cache);
+	return &smap->map;
+}
+
+void bpf_local_storage_map_free(struct bpf_map *map,
+				struct bpf_local_storage_cache *cache,
+				int __percpu *busy_counter)
+{
+	struct bpf_local_storage_map *smap;
+
+	smap = (struct bpf_local_storage_map *)map;
+	bpf_local_storage_cache_idx_free(cache, smap->cache_idx);
+	__bpf_local_storage_map_free(smap, busy_counter);
+}
diff --git a/kernel/bpf/bpf_task_storage.c b/kernel/bpf/bpf_task_storage.c
index 6f290623347e..020153902ef8 100644
--- a/kernel/bpf/bpf_task_storage.c
+++ b/kernel/bpf/bpf_task_storage.c
@@ -288,23 +288,12 @@ static int notsupp_get_next_key(struct bpf_map *map, void *key, void *next_key)
 
 static struct bpf_map *task_storage_map_alloc(union bpf_attr *attr)
 {
-	struct bpf_local_storage_map *smap;
-
-	smap = bpf_local_storage_map_alloc(attr);
-	if (IS_ERR(smap))
-		return ERR_CAST(smap);
-
-	smap->cache_idx = bpf_local_storage_cache_idx_get(&task_cache);
-	return &smap->map;
+	return bpf_local_storage_map_alloc(attr, &task_cache);
 }
 
 static void task_storage_map_free(struct bpf_map *map)
 {
-	struct bpf_local_storage_map *smap;
-
-	smap = (struct bpf_local_storage_map *)map;
-	bpf_local_storage_cache_idx_free(&task_cache, smap->cache_idx);
-	bpf_local_storage_map_free(smap, &bpf_task_storage_busy);
+	bpf_local_storage_map_free(map, &task_cache, &bpf_task_storage_busy);
 }
 
 BTF_ID_LIST_SINGLE(task_storage_map_btf_ids, struct, bpf_local_storage_map)
diff --git a/net/core/bpf_sk_storage.c b/net/core/bpf_sk_storage.c
index 94374d529ea4..3bfdc8834a5b 100644
--- a/net/core/bpf_sk_storage.c
+++ b/net/core/bpf_sk_storage.c
@@ -87,23 +87,12 @@ void bpf_sk_storage_free(struct sock *sk)
 
 static void bpf_sk_storage_map_free(struct bpf_map *map)
 {
-	struct bpf_local_storage_map *smap;
-
-	smap = (struct bpf_local_storage_map *)map;
-	bpf_local_storage_cache_idx_free(&sk_cache, smap->cache_idx);
-	bpf_local_storage_map_free(smap, NULL);
+	bpf_local_storage_map_free(map, &sk_cache, NULL);
 }
 
 static struct bpf_map *bpf_sk_storage_map_alloc(union bpf_attr *attr)
 {
-	struct bpf_local_storage_map *smap;
-
-	smap = bpf_local_storage_map_alloc(attr);
-	if (IS_ERR(smap))
-		return ERR_CAST(smap);
-
-	smap->cache_idx = bpf_local_storage_cache_idx_get(&sk_cache);
-	return &smap->map;
+	return bpf_local_storage_map_alloc(attr, &sk_cache);
 }
 
 static int notsupp_get_next_key(struct bpf_map *map, void *key,
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH bpf-next v4 3/7] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
  2022-10-23 18:05 [PATCH bpf-next v4 0/7] bpf: Implement cgroup local storage available to non-cgroup-attached bpf progs Yonghong Song
  2022-10-23 18:05 ` [PATCH bpf-next v4 1/7] bpf: Make struct cgroup btf id global Yonghong Song
  2022-10-23 18:05 ` [PATCH bpf-next v4 2/7] bpf: Refactor inode/task/sk storage map_{alloc,free}() for reuse Yonghong Song
@ 2022-10-23 18:05 ` Yonghong Song
  2022-10-23 20:02   ` David Vernet
  2022-10-24 19:19   ` Martin KaFai Lau
  2022-10-23 18:05 ` [PATCH bpf-next v4 4/7] libbpf: Support new cgroup local storage Yonghong Song
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 31+ messages in thread
From: Yonghong Song @ 2022-10-23 18:05 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo

Similar to sk/inode/task storage, implement similar cgroup local storage.

There already exists a local storage implementation for cgroup-attached
bpf programs.  See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper
bpf_get_local_storage(). But there are use cases such that non-cgroup
attached bpf progs wants to access cgroup local storage data. For example,
tc egress prog has access to sk and cgroup. It is possible to use
sk local storage to emulate cgroup local storage by storing data in socket.
But this is a waste as it could be lots of sockets belonging to a particular
cgroup. Alternatively, a separate map can be created with cgroup id as the key.
But this will introduce additional overhead to manipulate the new map.
A cgroup local storage, similar to existing sk/inode/task storage,
should help for this use case.

The life-cycle of storage is managed with the life-cycle of the
cgroup struct.  i.e. the storage is destroyed along with the owning cgroup
with a call to bpf_cgrp_storage_free() when cgroup itself
is deleted.

The userspace map operations can be done by using a cgroup fd as a key
passed to the lookup, update and delete operations.

Typically, the following code is used to get the current cgroup:
    struct task_struct *task = bpf_get_current_task_btf();
    ... task->cgroups->dfl_cgrp ...
and in structure task_struct definition:
    struct task_struct {
        ....
        struct css_set __rcu            *cgroups;
        ....
    }
With sleepable program, accessing task->cgroups is not protected by rcu_read_lock.
So the current implementation only supports non-sleepable program and supporting
sleepable program will be the next step together with adding rcu_read_lock
protection for rcu tagged structures.

Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup local
storage support, the new map name BPF_MAP_TYPE_CGRP_STORAGE is used
for cgroup storage available to non-cgroup-attached bpf programs. The old
cgroup storage supports bpf_get_local_storage() helper to get the cgroup data.
The new cgroup storage helper bpf_cgrp_storage_get() can provide similar
functionality. While old cgroup storage pre-allocates storage memory, the new
mechanism can also pre-allocate with a user space bpf_map_update_elem() call
to avoid potential run-time memory allocation failure.
Therefore, the new cgroup storage can provide all functionality w.r.t.
the old one. So in uapi bpf.h, the old BPF_MAP_TYPE_CGROUP_STORAGE is alias to
BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED to indicate the old cgroup storage can
be deprecated since the new one can provide the same functionality.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 include/linux/bpf.h            |   3 +
 include/linux/bpf_types.h      |   1 +
 include/linux/cgroup-defs.h    |   4 +
 include/uapi/linux/bpf.h       |  50 +++++-
 kernel/bpf/Makefile            |   2 +-
 kernel/bpf/bpf_cgrp_storage.c  | 268 +++++++++++++++++++++++++++++++++
 kernel/bpf/helpers.c           |   6 +
 kernel/bpf/syscall.c           |   3 +-
 kernel/bpf/verifier.c          |  13 +-
 kernel/cgroup/cgroup.c         |   4 +
 kernel/trace/bpf_trace.c       |   4 +
 scripts/bpf_doc.py             |   2 +
 tools/include/uapi/linux/bpf.h |  50 +++++-
 13 files changed, 405 insertions(+), 5 deletions(-)
 create mode 100644 kernel/bpf/bpf_cgrp_storage.c

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 9e7d46d16032..674da3129248 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -2045,6 +2045,7 @@ struct bpf_link *bpf_link_by_id(u32 id);
 
 const struct bpf_func_proto *bpf_base_func_proto(enum bpf_func_id func_id);
 void bpf_task_storage_free(struct task_struct *task);
+void bpf_cgrp_storage_free(struct cgroup *cgroup);
 bool bpf_prog_has_kfunc_call(const struct bpf_prog *prog);
 const struct btf_func_model *
 bpf_jit_find_kfunc_model(const struct bpf_prog *prog,
@@ -2537,6 +2538,8 @@ extern const struct bpf_func_proto bpf_copy_from_user_task_proto;
 extern const struct bpf_func_proto bpf_set_retval_proto;
 extern const struct bpf_func_proto bpf_get_retval_proto;
 extern const struct bpf_func_proto bpf_user_ringbuf_drain_proto;
+extern const struct bpf_func_proto bpf_cgrp_storage_get_proto;
+extern const struct bpf_func_proto bpf_cgrp_storage_delete_proto;
 
 const struct bpf_func_proto *tracing_prog_func_proto(
   enum bpf_func_id func_id, const struct bpf_prog *prog);
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 2c6a4f2562a7..d4ee3ccd3753 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -86,6 +86,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_PROG_ARRAY, prog_array_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_PERF_EVENT_ARRAY, perf_event_array_map_ops)
 #ifdef CONFIG_CGROUPS
 BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_ARRAY, cgroup_array_map_ops)
+BPF_MAP_TYPE(BPF_MAP_TYPE_CGRP_STORAGE, cgrp_storage_map_ops)
 #endif
 #ifdef CONFIG_CGROUP_BPF
 BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_STORAGE, cgroup_storage_map_ops)
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 8f481d1b159a..c466fdc3a32a 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -504,6 +504,10 @@ struct cgroup {
 	/* Used to store internal freezer state */
 	struct cgroup_freezer_state freezer;
 
+#ifdef CONFIG_BPF_SYSCALL
+	struct bpf_local_storage __rcu  *bpf_cgrp_storage;
+#endif
+
 	/* All ancestors including self */
 	struct cgroup *ancestors[];
 };
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 17f61338f8f8..94659f6b3395 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -922,7 +922,14 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_CPUMAP,
 	BPF_MAP_TYPE_XSKMAP,
 	BPF_MAP_TYPE_SOCKHASH,
-	BPF_MAP_TYPE_CGROUP_STORAGE,
+	BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED,
+	/* BPF_MAP_TYPE_CGROUP_STORAGE is available to bpf programs attaching
+	 * to a cgroup. The newer BPF_MAP_TYPE_CGRP_STORAGE is available to
+	 * both cgroup-attached and other progs and supports all functionality
+	 * provided by BPF_MAP_TYPE_CGROUP_STORAGE. So mark
+	 * BPF_MAP_TYPE_CGROUP_STORAGE deprecated.
+	 */
+	BPF_MAP_TYPE_CGROUP_STORAGE = BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED,
 	BPF_MAP_TYPE_REUSEPORT_SOCKARRAY,
 	BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE,
 	BPF_MAP_TYPE_QUEUE,
@@ -935,6 +942,7 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_TASK_STORAGE,
 	BPF_MAP_TYPE_BLOOM_FILTER,
 	BPF_MAP_TYPE_USER_RINGBUF,
+	BPF_MAP_TYPE_CGRP_STORAGE,
 };
 
 /* Note that tracing related programs such as
@@ -5435,6 +5443,44 @@ union bpf_attr {
  *		**-E2BIG** if user-space has tried to publish a sample which is
  *		larger than the size of the ring buffer, or which cannot fit
  *		within a struct bpf_dynptr.
+ *
+ * void *bpf_cgrp_storage_get(struct bpf_map *map, struct cgroup *cgroup, void *value, u64 flags)
+ *	Description
+ *		Get a bpf_local_storage from the *cgroup*.
+ *
+ *		Logically, it could be thought of as getting the value from
+ *		a *map* with *cgroup* as the **key**.  From this
+ *		perspective,  the usage is not much different from
+ *		**bpf_map_lookup_elem**\ (*map*, **&**\ *cgroup*) except this
+ *		helper enforces the key must be a cgroup struct and the map must also
+ *		be a **BPF_MAP_TYPE_CGRP_STORAGE**.
+ *
+ *		In reality, the local-storage value is embedded directly inside of the
+ *		*cgroup* object itself, rather than being located in the
+ *		**BPF_MAP_TYPE_CGRP_STORAGE** map. When the local-storage value is
+ *		queried for some *map* on a *cgroup* object, the kernel will perform an
+ *		O(n) iteration over all of the live local-storage values for that
+ *		*cgroup* object until the local-storage value for the *map* is found.
+ *
+ *		An optional *flags* (**BPF_LOCAL_STORAGE_GET_F_CREATE**) can be
+ *		used such that a new bpf_local_storage will be
+ *		created if one does not exist.  *value* can be used
+ *		together with **BPF_LOCAL_STORAGE_GET_F_CREATE** to specify
+ *		the initial value of a bpf_local_storage.  If *value* is
+ *		**NULL**, the new bpf_local_storage will be zero initialized.
+ *	Return
+ *		A bpf_local_storage pointer is returned on success.
+ *
+ *		**NULL** if not found or there was an error in adding
+ *		a new bpf_local_storage.
+ *
+ * long bpf_cgrp_storage_delete(struct bpf_map *map, struct cgroup *cgroup)
+ *	Description
+ *		Delete a bpf_local_storage from a *cgroup*.
+ *	Return
+ *		0 on success.
+ *
+ *		**-ENOENT** if the bpf_local_storage cannot be found.
  */
 #define ___BPF_FUNC_MAPPER(FN, ctx...)			\
 	FN(unspec, 0, ##ctx)				\
@@ -5647,6 +5693,8 @@ union bpf_attr {
 	FN(tcp_raw_check_syncookie_ipv6, 207, ##ctx)	\
 	FN(ktime_get_tai_ns, 208, ##ctx)		\
 	FN(user_ringbuf_drain, 209, ##ctx)		\
+	FN(cgrp_storage_get, 210, ##ctx)		\
+	FN(cgrp_storage_delete, 211, ##ctx)		\
 	/* */
 
 /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that don't
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 341c94f208f4..3a12e6b400a2 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -25,7 +25,7 @@ ifeq ($(CONFIG_PERF_EVENTS),y)
 obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
 endif
 ifeq ($(CONFIG_CGROUPS),y)
-obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o
+obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o bpf_cgrp_storage.o
 endif
 obj-$(CONFIG_CGROUP_BPF) += cgroup.o
 ifeq ($(CONFIG_INET),y)
diff --git a/kernel/bpf/bpf_cgrp_storage.c b/kernel/bpf/bpf_cgrp_storage.c
new file mode 100644
index 000000000000..770c9c28215a
--- /dev/null
+++ b/kernel/bpf/bpf_cgrp_storage.c
@@ -0,0 +1,268 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
+ */
+
+#include <linux/types.h>
+#include <linux/bpf.h>
+#include <linux/bpf_local_storage.h>
+#include <uapi/linux/btf.h>
+#include <linux/btf_ids.h>
+
+DEFINE_BPF_STORAGE_CACHE(cgroup_cache);
+
+static DEFINE_PER_CPU(int, bpf_cgrp_storage_busy);
+
+static void bpf_cgrp_storage_lock(void)
+{
+	migrate_disable();
+	this_cpu_inc(bpf_cgrp_storage_busy);
+}
+
+static void bpf_cgrp_storage_unlock(void)
+{
+	this_cpu_dec(bpf_cgrp_storage_busy);
+	migrate_enable();
+}
+
+static bool bpf_cgrp_storage_trylock(void)
+{
+	migrate_disable();
+	if (unlikely(this_cpu_inc_return(bpf_cgrp_storage_busy) != 1)) {
+		this_cpu_dec(bpf_cgrp_storage_busy);
+		migrate_enable();
+		return false;
+	}
+	return true;
+}
+
+static struct bpf_local_storage __rcu **cgroup_storage_ptr(void *owner)
+{
+	struct cgroup *cg = owner;
+
+	return &cg->bpf_cgrp_storage;
+}
+
+void bpf_cgrp_storage_free(struct cgroup *cgroup)
+{
+	struct bpf_local_storage *local_storage;
+	struct bpf_local_storage_elem *selem;
+	bool free_cgroup_storage = false;
+	struct hlist_node *n;
+	unsigned long flags;
+
+	rcu_read_lock();
+	local_storage = rcu_dereference(cgroup->bpf_cgrp_storage);
+	if (!local_storage) {
+		rcu_read_unlock();
+		return;
+	}
+
+	/* Neither the bpf_prog nor the bpf_map's syscall
+	 * could be modifying the local_storage->list now.
+	 * Thus, no elem can be added to or deleted from the
+	 * local_storage->list by the bpf_prog or by the bpf_map's syscall.
+	 *
+	 * It is racing with __bpf_local_storage_map_free() alone
+	 * when unlinking elem from the local_storage->list and
+	 * the map's bucket->list.
+	 */
+	bpf_cgrp_storage_lock();
+	raw_spin_lock_irqsave(&local_storage->lock, flags);
+	hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) {
+		bpf_selem_unlink_map(selem);
+		/* If local_storage list has only one element, the
+		 * bpf_selem_unlink_storage_nolock() will return true.
+		 * Otherwise, it will return false. The current loop iteration
+		 * intends to remove all local storage. So the last iteration
+		 * of the loop will set the free_cgroup_storage to true.
+		 */
+		free_cgroup_storage =
+			bpf_selem_unlink_storage_nolock(local_storage, selem, false, false);
+	}
+	raw_spin_unlock_irqrestore(&local_storage->lock, flags);
+	bpf_cgrp_storage_unlock();
+	rcu_read_unlock();
+
+	if (free_cgroup_storage)
+		kfree_rcu(local_storage, rcu);
+}
+
+static struct bpf_local_storage_data *
+cgroup_storage_lookup(struct cgroup *cgroup, struct bpf_map *map, bool cacheit_lockit)
+{
+	struct bpf_local_storage *cgroup_storage;
+	struct bpf_local_storage_map *smap;
+
+	cgroup_storage = rcu_dereference_check(cgroup->bpf_cgrp_storage,
+					       bpf_rcu_lock_held());
+	if (!cgroup_storage)
+		return NULL;
+
+	smap = (struct bpf_local_storage_map *)map;
+	return bpf_local_storage_lookup(cgroup_storage, smap, cacheit_lockit);
+}
+
+static void *bpf_cgrp_storage_lookup_elem(struct bpf_map *map, void *key)
+{
+	struct bpf_local_storage_data *sdata;
+	struct cgroup *cgroup;
+	int fd;
+
+	fd = *(int *)key;
+	cgroup = cgroup_get_from_fd(fd);
+	if (IS_ERR(cgroup))
+		return ERR_CAST(cgroup);
+
+	bpf_cgrp_storage_lock();
+	sdata = cgroup_storage_lookup(cgroup, map, true);
+	bpf_cgrp_storage_unlock();
+	cgroup_put(cgroup);
+	return sdata ? sdata->data : NULL;
+}
+
+static int bpf_cgrp_storage_update_elem(struct bpf_map *map, void *key,
+					  void *value, u64 map_flags)
+{
+	struct bpf_local_storage_data *sdata;
+	struct cgroup *cgroup;
+	int fd;
+
+	fd = *(int *)key;
+	cgroup = cgroup_get_from_fd(fd);
+	if (IS_ERR(cgroup))
+		return PTR_ERR(cgroup);
+
+	bpf_cgrp_storage_lock();
+	sdata = bpf_local_storage_update(cgroup, (struct bpf_local_storage_map *)map,
+					 value, map_flags, GFP_ATOMIC);
+	bpf_cgrp_storage_unlock();
+	cgroup_put(cgroup);
+	return PTR_ERR_OR_ZERO(sdata);
+}
+
+static int cgroup_storage_delete(struct cgroup *cgroup, struct bpf_map *map)
+{
+	struct bpf_local_storage_data *sdata;
+
+	sdata = cgroup_storage_lookup(cgroup, map, false);
+	if (!sdata)
+		return -ENOENT;
+
+	bpf_selem_unlink(SELEM(sdata), true);
+	return 0;
+}
+
+static int bpf_cgrp_storage_delete_elem(struct bpf_map *map, void *key)
+{
+	struct cgroup *cgroup;
+	int err, fd;
+
+	fd = *(int *)key;
+	cgroup = cgroup_get_from_fd(fd);
+	if (IS_ERR(cgroup))
+		return PTR_ERR(cgroup);
+
+	bpf_cgrp_storage_lock();
+	err = cgroup_storage_delete(cgroup, map);
+	bpf_cgrp_storage_unlock();
+	cgroup_put(cgroup);
+	return err;
+}
+
+static int notsupp_get_next_key(struct bpf_map *map, void *key, void *next_key)
+{
+	return -ENOTSUPP;
+}
+
+static struct bpf_map *cgroup_storage_map_alloc(union bpf_attr *attr)
+{
+	return bpf_local_storage_map_alloc(attr, &cgroup_cache);
+}
+
+static void cgroup_storage_map_free(struct bpf_map *map)
+{
+	bpf_local_storage_map_free(map, &cgroup_cache, NULL);
+}
+
+/* *gfp_flags* is a hidden argument provided by the verifier */
+BPF_CALL_5(bpf_cgrp_storage_get, struct bpf_map *, map, struct cgroup *, cgroup,
+	   void *, value, u64, flags, gfp_t, gfp_flags)
+{
+	struct bpf_local_storage_data *sdata;
+
+	WARN_ON_ONCE(!bpf_rcu_lock_held());
+	if (flags & ~(BPF_LOCAL_STORAGE_GET_F_CREATE))
+		return (unsigned long)NULL;
+
+	if (!cgroup)
+		return (unsigned long)NULL;
+
+	if (!bpf_cgrp_storage_trylock())
+		return (unsigned long)NULL;
+
+	sdata = cgroup_storage_lookup(cgroup, map, true);
+	if (sdata)
+		goto unlock;
+
+	/* only allocate new storage, when the cgroup is refcounted */
+	if (!percpu_ref_is_dying(&cgroup->self.refcnt) &&
+	    (flags & BPF_LOCAL_STORAGE_GET_F_CREATE))
+		sdata = bpf_local_storage_update(cgroup, (struct bpf_local_storage_map *)map,
+						 value, BPF_NOEXIST, gfp_flags);
+
+unlock:
+	bpf_cgrp_storage_unlock();
+	return IS_ERR_OR_NULL(sdata) ? (unsigned long)NULL : (unsigned long)sdata->data;
+}
+
+BPF_CALL_2(bpf_cgrp_storage_delete, struct bpf_map *, map, struct cgroup *, cgroup)
+{
+	int ret;
+
+	WARN_ON_ONCE(!bpf_rcu_lock_held());
+	if (!cgroup)
+		return -EINVAL;
+
+	if (!bpf_cgrp_storage_trylock())
+		return -EBUSY;
+
+	ret = cgroup_storage_delete(cgroup, map);
+	bpf_cgrp_storage_unlock();
+	return ret;
+}
+
+BTF_ID_LIST_SINGLE(cgroup_storage_map_btf_ids, struct, bpf_local_storage_map)
+const struct bpf_map_ops cgrp_storage_map_ops = {
+	.map_meta_equal = bpf_map_meta_equal,
+	.map_alloc_check = bpf_local_storage_map_alloc_check,
+	.map_alloc = cgroup_storage_map_alloc,
+	.map_free = cgroup_storage_map_free,
+	.map_get_next_key = notsupp_get_next_key,
+	.map_lookup_elem = bpf_cgrp_storage_lookup_elem,
+	.map_update_elem = bpf_cgrp_storage_update_elem,
+	.map_delete_elem = bpf_cgrp_storage_delete_elem,
+	.map_check_btf = bpf_local_storage_map_check_btf,
+	.map_btf_id = &cgroup_storage_map_btf_ids[0],
+	.map_owner_storage_ptr = cgroup_storage_ptr,
+};
+
+const struct bpf_func_proto bpf_cgrp_storage_get_proto = {
+	.func		= bpf_cgrp_storage_get,
+	.gpl_only	= false,
+	.ret_type	= RET_PTR_TO_MAP_VALUE_OR_NULL,
+	.arg1_type	= ARG_CONST_MAP_PTR,
+	.arg2_type	= ARG_PTR_TO_BTF_ID,
+	.arg2_btf_id	= &bpf_cgroup_btf_id[0],
+	.arg3_type	= ARG_PTR_TO_MAP_VALUE_OR_NULL,
+	.arg4_type	= ARG_ANYTHING,
+};
+
+const struct bpf_func_proto bpf_cgrp_storage_delete_proto = {
+	.func		= bpf_cgrp_storage_delete,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_CONST_MAP_PTR,
+	.arg2_type	= ARG_PTR_TO_BTF_ID,
+	.arg2_btf_id	= &bpf_cgroup_btf_id[0],
+};
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index a6b04faed282..124fd199ce5c 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1663,6 +1663,12 @@ bpf_base_func_proto(enum bpf_func_id func_id)
 		return &bpf_dynptr_write_proto;
 	case BPF_FUNC_dynptr_data:
 		return &bpf_dynptr_data_proto;
+#ifdef CONFIG_CGROUPS
+	case BPF_FUNC_cgrp_storage_get:
+		return &bpf_cgrp_storage_get_proto;
+	case BPF_FUNC_cgrp_storage_delete:
+		return &bpf_cgrp_storage_delete_proto;
+#endif
 	default:
 		break;
 	}
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 7b373a5e861f..b95c276f92e3 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1016,7 +1016,8 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf,
 		    map->map_type != BPF_MAP_TYPE_CGROUP_STORAGE &&
 		    map->map_type != BPF_MAP_TYPE_SK_STORAGE &&
 		    map->map_type != BPF_MAP_TYPE_INODE_STORAGE &&
-		    map->map_type != BPF_MAP_TYPE_TASK_STORAGE)
+		    map->map_type != BPF_MAP_TYPE_TASK_STORAGE &&
+		    map->map_type != BPF_MAP_TYPE_CGRP_STORAGE)
 			return -ENOTSUPP;
 		if (map->spin_lock_off + sizeof(struct bpf_spin_lock) >
 		    map->value_size) {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index ddc1452cf023..8f849a763b79 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -6350,6 +6350,11 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 		    func_id != BPF_FUNC_task_storage_delete)
 			goto error;
 		break;
+	case BPF_MAP_TYPE_CGRP_STORAGE:
+		if (func_id != BPF_FUNC_cgrp_storage_get &&
+		    func_id != BPF_FUNC_cgrp_storage_delete)
+			goto error;
+		break;
 	case BPF_MAP_TYPE_BLOOM_FILTER:
 		if (func_id != BPF_FUNC_map_peek_elem &&
 		    func_id != BPF_FUNC_map_push_elem)
@@ -6462,6 +6467,11 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 		if (map->map_type != BPF_MAP_TYPE_TASK_STORAGE)
 			goto error;
 		break;
+	case BPF_FUNC_cgrp_storage_get:
+	case BPF_FUNC_cgrp_storage_delete:
+		if (map->map_type != BPF_MAP_TYPE_CGRP_STORAGE)
+			goto error;
+		break;
 	default:
 		break;
 	}
@@ -14139,7 +14149,8 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
 
 		if (insn->imm == BPF_FUNC_task_storage_get ||
 		    insn->imm == BPF_FUNC_sk_storage_get ||
-		    insn->imm == BPF_FUNC_inode_storage_get) {
+		    insn->imm == BPF_FUNC_inode_storage_get ||
+		    insn->imm == BPF_FUNC_cgrp_storage_get) {
 			if (env->prog->aux->sleepable)
 				insn_buf[0] = BPF_MOV64_IMM(BPF_REG_5, (__force __s32)GFP_KERNEL);
 			else
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 764bdd5fd8d1..32145d066a09 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5227,6 +5227,10 @@ static void css_free_rwork_fn(struct work_struct *work)
 	struct cgroup_subsys *ss = css->ss;
 	struct cgroup *cgrp = css->cgroup;
 
+#ifdef CONFIG_BPF_SYSCALL
+	bpf_cgrp_storage_free(cgrp);
+#endif
+
 	percpu_ref_exit(&css->refcnt);
 
 	if (ss) {
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 49fb9ec8366d..0ddf0834b1b8 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1454,6 +1454,10 @@ bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_get_current_cgroup_id_proto;
 	case BPF_FUNC_get_current_ancestor_cgroup_id:
 		return &bpf_get_current_ancestor_cgroup_id_proto;
+	case BPF_FUNC_cgrp_storage_get:
+		return &bpf_cgrp_storage_get_proto;
+	case BPF_FUNC_cgrp_storage_delete:
+		return &bpf_cgrp_storage_delete_proto;
 #endif
 	case BPF_FUNC_send_signal:
 		return &bpf_send_signal_proto;
diff --git a/scripts/bpf_doc.py b/scripts/bpf_doc.py
index c0e6690be82a..fdb0aff8cb5a 100755
--- a/scripts/bpf_doc.py
+++ b/scripts/bpf_doc.py
@@ -685,6 +685,7 @@ class PrinterHelpers(Printer):
             'struct udp6_sock',
             'struct unix_sock',
             'struct task_struct',
+            'struct cgroup',
 
             'struct __sk_buff',
             'struct sk_msg_md',
@@ -742,6 +743,7 @@ class PrinterHelpers(Printer):
             'struct udp6_sock',
             'struct unix_sock',
             'struct task_struct',
+            'struct cgroup',
             'struct path',
             'struct btf_ptr',
             'struct inode',
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 17f61338f8f8..94659f6b3395 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -922,7 +922,14 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_CPUMAP,
 	BPF_MAP_TYPE_XSKMAP,
 	BPF_MAP_TYPE_SOCKHASH,
-	BPF_MAP_TYPE_CGROUP_STORAGE,
+	BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED,
+	/* BPF_MAP_TYPE_CGROUP_STORAGE is available to bpf programs attaching
+	 * to a cgroup. The newer BPF_MAP_TYPE_CGRP_STORAGE is available to
+	 * both cgroup-attached and other progs and supports all functionality
+	 * provided by BPF_MAP_TYPE_CGROUP_STORAGE. So mark
+	 * BPF_MAP_TYPE_CGROUP_STORAGE deprecated.
+	 */
+	BPF_MAP_TYPE_CGROUP_STORAGE = BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED,
 	BPF_MAP_TYPE_REUSEPORT_SOCKARRAY,
 	BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE,
 	BPF_MAP_TYPE_QUEUE,
@@ -935,6 +942,7 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_TASK_STORAGE,
 	BPF_MAP_TYPE_BLOOM_FILTER,
 	BPF_MAP_TYPE_USER_RINGBUF,
+	BPF_MAP_TYPE_CGRP_STORAGE,
 };
 
 /* Note that tracing related programs such as
@@ -5435,6 +5443,44 @@ union bpf_attr {
  *		**-E2BIG** if user-space has tried to publish a sample which is
  *		larger than the size of the ring buffer, or which cannot fit
  *		within a struct bpf_dynptr.
+ *
+ * void *bpf_cgrp_storage_get(struct bpf_map *map, struct cgroup *cgroup, void *value, u64 flags)
+ *	Description
+ *		Get a bpf_local_storage from the *cgroup*.
+ *
+ *		Logically, it could be thought of as getting the value from
+ *		a *map* with *cgroup* as the **key**.  From this
+ *		perspective,  the usage is not much different from
+ *		**bpf_map_lookup_elem**\ (*map*, **&**\ *cgroup*) except this
+ *		helper enforces the key must be a cgroup struct and the map must also
+ *		be a **BPF_MAP_TYPE_CGRP_STORAGE**.
+ *
+ *		In reality, the local-storage value is embedded directly inside of the
+ *		*cgroup* object itself, rather than being located in the
+ *		**BPF_MAP_TYPE_CGRP_STORAGE** map. When the local-storage value is
+ *		queried for some *map* on a *cgroup* object, the kernel will perform an
+ *		O(n) iteration over all of the live local-storage values for that
+ *		*cgroup* object until the local-storage value for the *map* is found.
+ *
+ *		An optional *flags* (**BPF_LOCAL_STORAGE_GET_F_CREATE**) can be
+ *		used such that a new bpf_local_storage will be
+ *		created if one does not exist.  *value* can be used
+ *		together with **BPF_LOCAL_STORAGE_GET_F_CREATE** to specify
+ *		the initial value of a bpf_local_storage.  If *value* is
+ *		**NULL**, the new bpf_local_storage will be zero initialized.
+ *	Return
+ *		A bpf_local_storage pointer is returned on success.
+ *
+ *		**NULL** if not found or there was an error in adding
+ *		a new bpf_local_storage.
+ *
+ * long bpf_cgrp_storage_delete(struct bpf_map *map, struct cgroup *cgroup)
+ *	Description
+ *		Delete a bpf_local_storage from a *cgroup*.
+ *	Return
+ *		0 on success.
+ *
+ *		**-ENOENT** if the bpf_local_storage cannot be found.
  */
 #define ___BPF_FUNC_MAPPER(FN, ctx...)			\
 	FN(unspec, 0, ##ctx)				\
@@ -5647,6 +5693,8 @@ union bpf_attr {
 	FN(tcp_raw_check_syncookie_ipv6, 207, ##ctx)	\
 	FN(ktime_get_tai_ns, 208, ##ctx)		\
 	FN(user_ringbuf_drain, 209, ##ctx)		\
+	FN(cgrp_storage_get, 210, ##ctx)		\
+	FN(cgrp_storage_delete, 211, ##ctx)		\
 	/* */
 
 /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that don't
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH bpf-next v4 4/7] libbpf: Support new cgroup local storage
  2022-10-23 18:05 [PATCH bpf-next v4 0/7] bpf: Implement cgroup local storage available to non-cgroup-attached bpf progs Yonghong Song
                   ` (2 preceding siblings ...)
  2022-10-23 18:05 ` [PATCH bpf-next v4 3/7] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs Yonghong Song
@ 2022-10-23 18:05 ` Yonghong Song
  2022-10-23 20:03   ` David Vernet
  2022-10-23 18:05 ` [PATCH bpf-next v4 5/7] bpftool: " Yonghong Song
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 31+ messages in thread
From: Yonghong Song @ 2022-10-23 18:05 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo

Add support for new cgroup local storage.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Yonghong Song <yhs@fb.com>
---
 tools/lib/bpf/libbpf.c        | 1 +
 tools/lib/bpf/libbpf_probes.c | 1 +
 2 files changed, 2 insertions(+)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 027fd9565c16..5d7819edf074 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -164,6 +164,7 @@ static const char * const map_type_name[] = {
 	[BPF_MAP_TYPE_TASK_STORAGE]		= "task_storage",
 	[BPF_MAP_TYPE_BLOOM_FILTER]		= "bloom_filter",
 	[BPF_MAP_TYPE_USER_RINGBUF]             = "user_ringbuf",
+	[BPF_MAP_TYPE_CGRP_STORAGE]		= "cgrp_storage",
 };
 
 static const char * const prog_type_name[] = {
diff --git a/tools/lib/bpf/libbpf_probes.c b/tools/lib/bpf/libbpf_probes.c
index f3a8e8e74eb8..bdb83d467f9a 100644
--- a/tools/lib/bpf/libbpf_probes.c
+++ b/tools/lib/bpf/libbpf_probes.c
@@ -221,6 +221,7 @@ static int probe_map_create(enum bpf_map_type map_type)
 	case BPF_MAP_TYPE_SK_STORAGE:
 	case BPF_MAP_TYPE_INODE_STORAGE:
 	case BPF_MAP_TYPE_TASK_STORAGE:
+	case BPF_MAP_TYPE_CGRP_STORAGE:
 		btf_key_type_id = 1;
 		btf_value_type_id = 3;
 		value_size = 8;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH bpf-next v4 5/7] bpftool: Support new cgroup local storage
  2022-10-23 18:05 [PATCH bpf-next v4 0/7] bpf: Implement cgroup local storage available to non-cgroup-attached bpf progs Yonghong Song
                   ` (3 preceding siblings ...)
  2022-10-23 18:05 ` [PATCH bpf-next v4 4/7] libbpf: Support new cgroup local storage Yonghong Song
@ 2022-10-23 18:05 ` Yonghong Song
  2022-10-23 18:05 ` [PATCH bpf-next v4 6/7] selftests/bpf: Add selftests for " Yonghong Song
  2022-10-23 18:05 ` [PATCH bpf-next v4 7/7] docs/bpf: Add documentation " Yonghong Song
  6 siblings, 0 replies; 31+ messages in thread
From: Yonghong Song @ 2022-10-23 18:05 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo,
	Quentin Monnet

Add support for new cgroup local storage

Acked-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
---
 tools/bpf/bpftool/Documentation/bpftool-map.rst | 2 +-
 tools/bpf/bpftool/map.c                         | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/bpf/bpftool/Documentation/bpftool-map.rst b/tools/bpf/bpftool/Documentation/bpftool-map.rst
index 7f3b67a8b48f..11250c4734fe 100644
--- a/tools/bpf/bpftool/Documentation/bpftool-map.rst
+++ b/tools/bpf/bpftool/Documentation/bpftool-map.rst
@@ -55,7 +55,7 @@ MAP COMMANDS
 |		| **devmap** | **devmap_hash** | **sockmap** | **cpumap** | **xskmap** | **sockhash**
 |		| **cgroup_storage** | **reuseport_sockarray** | **percpu_cgroup_storage**
 |		| **queue** | **stack** | **sk_storage** | **struct_ops** | **ringbuf** | **inode_storage**
-|		| **task_storage** | **bloom_filter** | **user_ringbuf** }
+|		| **task_storage** | **bloom_filter** | **user_ringbuf** | **cgrp_storage** }
 
 DESCRIPTION
 ===========
diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c
index 9a6ca9f31133..3a2a89912637 100644
--- a/tools/bpf/bpftool/map.c
+++ b/tools/bpf/bpftool/map.c
@@ -1459,7 +1459,7 @@ static int do_help(int argc, char **argv)
 		"                 devmap | devmap_hash | sockmap | cpumap | xskmap | sockhash |\n"
 		"                 cgroup_storage | reuseport_sockarray | percpu_cgroup_storage |\n"
 		"                 queue | stack | sk_storage | struct_ops | ringbuf | inode_storage |\n"
-		"                 task_storage | bloom_filter | user_ringbuf }\n"
+		"                 task_storage | bloom_filter | user_ringbuf | cgrp_storage }\n"
 		"       " HELP_SPEC_OPTIONS " |\n"
 		"                    {-f|--bpffs} | {-n|--nomount} }\n"
 		"",
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH bpf-next v4 6/7] selftests/bpf: Add selftests for new cgroup local storage
  2022-10-23 18:05 [PATCH bpf-next v4 0/7] bpf: Implement cgroup local storage available to non-cgroup-attached bpf progs Yonghong Song
                   ` (4 preceding siblings ...)
  2022-10-23 18:05 ` [PATCH bpf-next v4 5/7] bpftool: " Yonghong Song
@ 2022-10-23 18:05 ` Yonghong Song
  2022-10-23 20:14   ` David Vernet
  2022-10-24 20:30   ` Martin KaFai Lau
  2022-10-23 18:05 ` [PATCH bpf-next v4 7/7] docs/bpf: Add documentation " Yonghong Song
  6 siblings, 2 replies; 31+ messages in thread
From: Yonghong Song @ 2022-10-23 18:05 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo

Add two tests for new cgroup local storage, one to test bpf program helpers
and user space map APIs, and the other to test recursive fentry
triggering won't deadlock.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 .../bpf/prog_tests/cgrp_local_storage.c       | 92 +++++++++++++++++++
 .../selftests/bpf/progs/cgrp_local_storage.c  | 88 ++++++++++++++++++
 .../selftests/bpf/progs/cgrp_ls_recursion.c   | 70 ++++++++++++++
 3 files changed, 250 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/cgrp_local_storage.c
 create mode 100644 tools/testing/selftests/bpf/progs/cgrp_local_storage.c
 create mode 100644 tools/testing/selftests/bpf/progs/cgrp_ls_recursion.c

diff --git a/tools/testing/selftests/bpf/prog_tests/cgrp_local_storage.c b/tools/testing/selftests/bpf/prog_tests/cgrp_local_storage.c
new file mode 100644
index 000000000000..7ee21d598195
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/cgrp_local_storage.c
@@ -0,0 +1,92 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates.*/
+
+#define _GNU_SOURCE
+#include <unistd.h>
+#include <sys/syscall.h>
+#include <sys/types.h>
+#include <test_progs.h>
+#include "cgrp_local_storage.skel.h"
+#include "cgrp_ls_recursion.skel.h"
+
+static void test_sys_enter_exit(int cgroup_fd)
+{
+	struct cgrp_local_storage *skel;
+	long val1 = 1, val2 = 0;
+	int err;
+
+	skel = cgrp_local_storage__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_open_and_load"))
+		return;
+
+	/* populate a value in cgrp_storage_2 */
+	err = bpf_map_update_elem(bpf_map__fd(skel->maps.cgrp_storage_2), &cgroup_fd, &val1, BPF_ANY);
+	if (!ASSERT_OK(err, "map_update_elem"))
+		goto out;
+
+	/* check value */
+	err = bpf_map_lookup_elem(bpf_map__fd(skel->maps.cgrp_storage_2), &cgroup_fd, &val2);
+	if (!ASSERT_OK(err, "map_lookup_elem"))
+		goto out;
+	if (!ASSERT_EQ(val2, 1, "map_lookup_elem, invalid val"))
+		goto out;
+
+	/* delete value */
+	err = bpf_map_delete_elem(bpf_map__fd(skel->maps.cgrp_storage_2), &cgroup_fd);
+	if (!ASSERT_OK(err, "map_delete_elem"))
+		goto out;
+
+	skel->bss->target_pid = syscall(SYS_gettid);
+
+	err = cgrp_local_storage__attach(skel);
+	if (!ASSERT_OK(err, "skel_attach"))
+		goto out;
+
+	syscall(SYS_gettid);
+	syscall(SYS_gettid);
+
+	skel->bss->target_pid = 0;
+
+	/* 3x syscalls: 1x attach and 2x gettid */
+	ASSERT_EQ(skel->bss->enter_cnt, 3, "enter_cnt");
+	ASSERT_EQ(skel->bss->exit_cnt, 3, "exit_cnt");
+	ASSERT_EQ(skel->bss->mismatch_cnt, 0, "mismatch_cnt");
+out:
+	cgrp_local_storage__destroy(skel);
+}
+
+static void test_recursion(int cgroup_fd)
+{
+	struct cgrp_ls_recursion *skel;
+	int err;
+
+	skel = cgrp_ls_recursion__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_open_and_load"))
+		return;
+
+	err = cgrp_ls_recursion__attach(skel);
+	if (!ASSERT_OK(err, "skel_attach"))
+		goto out;
+
+	/* trigger sys_enter, make sure it does not cause deadlock */
+	syscall(SYS_gettid);
+
+out:
+	cgrp_ls_recursion__destroy(skel);
+}
+
+void test_cgrp_local_storage(void)
+{
+	int cgroup_fd;
+
+	cgroup_fd = test__join_cgroup("/cgrp_local_storage");
+	if (!ASSERT_GE(cgroup_fd, 0, "join_cgroup /cgrp_local_storage"))
+		return;
+
+	if (test__start_subtest("sys_enter_exit"))
+		test_sys_enter_exit(cgroup_fd);
+	if (test__start_subtest("recursion"))
+		test_recursion(cgroup_fd);
+
+	close(cgroup_fd);
+}
diff --git a/tools/testing/selftests/bpf/progs/cgrp_local_storage.c b/tools/testing/selftests/bpf/progs/cgrp_local_storage.c
new file mode 100644
index 000000000000..dee63d4c1512
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/cgrp_local_storage.c
@@ -0,0 +1,88 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+char _license[] SEC("license") = "GPL";
+
+struct {
+	__uint(type, BPF_MAP_TYPE_CGRP_STORAGE);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__type(key, int);
+	__type(value, long);
+} cgrp_storage_1 SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_CGRP_STORAGE);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__type(key, int);
+	__type(value, long);
+} cgrp_storage_2 SEC(".maps");
+
+#define MAGIC_VALUE 0xabcd1234
+
+pid_t target_pid = 0;
+int mismatch_cnt = 0;
+int enter_cnt = 0;
+int exit_cnt = 0;
+
+SEC("tp_btf/sys_enter")
+int BPF_PROG(on_enter, struct pt_regs *regs, long id)
+{
+	struct task_struct *task;
+	long *ptr;
+	int err;
+
+	task = bpf_get_current_task_btf();
+	if (task->pid != target_pid)
+		return 0;
+
+	/* populate value 0 */
+	ptr = bpf_cgrp_storage_get(&cgrp_storage_1, task->cgroups->dfl_cgrp, 0,
+				   BPF_LOCAL_STORAGE_GET_F_CREATE);
+	if (!ptr)
+		return 0;
+
+	/* delete value 0 */
+	err = bpf_cgrp_storage_delete(&cgrp_storage_1, task->cgroups->dfl_cgrp);
+	if (err)
+		return 0;
+
+	/* value is not available */
+	ptr = bpf_cgrp_storage_get(&cgrp_storage_1, task->cgroups->dfl_cgrp, 0, 0);
+	if (ptr)
+		return 0;
+
+	/* re-populate the value */
+	ptr = bpf_cgrp_storage_get(&cgrp_storage_1, task->cgroups->dfl_cgrp, 0,
+				   BPF_LOCAL_STORAGE_GET_F_CREATE);
+	if (!ptr)
+		return 0;
+	__sync_fetch_and_add(&enter_cnt, 1);
+	*ptr = MAGIC_VALUE + enter_cnt;
+
+	return 0;
+}
+
+SEC("tp_btf/sys_exit")
+int BPF_PROG(on_exit, struct pt_regs *regs, long id)
+{
+	struct task_struct *task;
+	long *ptr;
+
+	task = bpf_get_current_task_btf();
+	if (task->pid != target_pid)
+		return 0;
+
+	ptr = bpf_cgrp_storage_get(&cgrp_storage_1, task->cgroups->dfl_cgrp, 0,
+				   BPF_LOCAL_STORAGE_GET_F_CREATE);
+	if (!ptr)
+		return 0;
+
+	__sync_fetch_and_add(&exit_cnt, 1);
+	if (*ptr != MAGIC_VALUE + exit_cnt)
+		__sync_fetch_and_add(&mismatch_cnt, 1);
+	return 0;
+}
diff --git a/tools/testing/selftests/bpf/progs/cgrp_ls_recursion.c b/tools/testing/selftests/bpf/progs/cgrp_ls_recursion.c
new file mode 100644
index 000000000000..a043d8fefdac
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/cgrp_ls_recursion.c
@@ -0,0 +1,70 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+char _license[] SEC("license") = "GPL";
+
+struct {
+	__uint(type, BPF_MAP_TYPE_CGRP_STORAGE);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__type(key, int);
+	__type(value, long);
+} map_a SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_CGRP_STORAGE);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__type(key, int);
+	__type(value, long);
+} map_b SEC(".maps");
+
+SEC("fentry/bpf_local_storage_lookup")
+int BPF_PROG(on_lookup)
+{
+	struct task_struct *task = bpf_get_current_task_btf();
+
+	bpf_cgrp_storage_delete(&map_a, task->cgroups->dfl_cgrp);
+	bpf_cgrp_storage_delete(&map_b, task->cgroups->dfl_cgrp);
+	return 0;
+}
+
+SEC("fentry/bpf_local_storage_update")
+int BPF_PROG(on_update)
+{
+	struct task_struct *task = bpf_get_current_task_btf();
+	long *ptr;
+
+	ptr = bpf_cgrp_storage_get(&map_a, task->cgroups->dfl_cgrp, 0,
+				   BPF_LOCAL_STORAGE_GET_F_CREATE);
+	if (ptr)
+		*ptr += 1;
+
+	ptr = bpf_cgrp_storage_get(&map_b, task->cgroups->dfl_cgrp, 0,
+				   BPF_LOCAL_STORAGE_GET_F_CREATE);
+	if (ptr)
+		*ptr += 1;
+
+	return 0;
+}
+
+SEC("tp_btf/sys_enter")
+int BPF_PROG(on_enter, struct pt_regs *regs, long id)
+{
+	struct task_struct *task;
+	long *ptr;
+
+	task = bpf_get_current_task_btf();
+	ptr = bpf_cgrp_storage_get(&map_a, task->cgroups->dfl_cgrp, 0,
+				   BPF_LOCAL_STORAGE_GET_F_CREATE);
+	if (ptr)
+		*ptr = 200;
+
+	ptr = bpf_cgrp_storage_get(&map_b, task->cgroups->dfl_cgrp, 0,
+				   BPF_LOCAL_STORAGE_GET_F_CREATE);
+	if (ptr)
+		*ptr = 100;
+	return 0;
+}
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH bpf-next v4 7/7] docs/bpf: Add documentation for new cgroup local storage
  2022-10-23 18:05 [PATCH bpf-next v4 0/7] bpf: Implement cgroup local storage available to non-cgroup-attached bpf progs Yonghong Song
                   ` (5 preceding siblings ...)
  2022-10-23 18:05 ` [PATCH bpf-next v4 6/7] selftests/bpf: Add selftests for " Yonghong Song
@ 2022-10-23 18:05 ` Yonghong Song
  2022-10-23 20:26   ` David Vernet
  6 siblings, 1 reply; 31+ messages in thread
From: Yonghong Song @ 2022-10-23 18:05 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo

Add some descriptions and examples for BPF_MAP_TYPE_CGRP_STROAGE.
Also illustate the major difference between BPF_MAP_TYPE_CGRP_STROAGE
and BPF_MAP_TYPE_CGROUP_STORAGE and recommend to use
BPF_MAP_TYPE_CGRP_STROAGE instead of BPF_MAP_TYPE_CGROUP_STORAGE
in the end.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 Documentation/bpf/map_cgrp_storage.rst | 109 +++++++++++++++++++++++++
 1 file changed, 109 insertions(+)
 create mode 100644 Documentation/bpf/map_cgrp_storage.rst

diff --git a/Documentation/bpf/map_cgrp_storage.rst b/Documentation/bpf/map_cgrp_storage.rst
new file mode 100644
index 000000000000..4dfc7770da7e
--- /dev/null
+++ b/Documentation/bpf/map_cgrp_storage.rst
@@ -0,0 +1,109 @@
+.. SPDX-License-Identifier: GPL-2.0-only
+.. Copyright (C) 2022 Meta Platforms, Inc. and affiliates.
+
+===========================
+BPF_MAP_TYPE_CGRP_STORAGE
+===========================
+
+The ``BPF_MAP_TYPE_CGRP_STORAGE`` map type represents a local fix-sized
+storage for cgroups. It is only available with ``CONFIG_CGROUP_BPF``.
+The programs are made available by the same Kconfig. The
+data for a particular cgroup can be retrieved by looking up the map
+with that cgroup.
+
+This document describes the usage and semantics of the
+``BPF_MAP_TYPE_CGRP_STORAGE`` map type.
+
+Usage
+=====
+
+The map key must be ``sizeof(int)`` representing a cgroup fd.
+To access the storage in a program, use ``bpf_cgrp_storage_get``::
+
+    void *bpf_cgrp_storage_get(struct bpf_map *map, struct cgroup *cgroup, void *value, u64 flags)
+
+``flags`` could be 0 or ``BPF_LOCAL_STORAGE_GET_F_CREATE`` which indicates that
+a new local storage will be created if one does not exist.
+
+The local storage can be removed with ``bpf_cgrp_storage_delete``::
+
+    long bpf_cgrp_storage_delete(struct bpf_map *map, struct cgroup *cgroup)
+
+The map is available to all program types.
+
+Examples
+========
+
+A bpf program example with BPF_MAP_TYPE_CGRP_STORAGE::
+
+    #include <vmlinux.h>
+    #include <bpf/bpf_helpers.h>
+    #include <bpf/bpf_tracing.h>
+
+    struct {
+            __uint(type, BPF_MAP_TYPE_CGRP_STORAGE);
+            __uint(map_flags, BPF_F_NO_PREALLOC);
+            __type(key, int);
+            __type(value, long);
+    } cgrp_storage SEC(".maps");
+
+    SEC("tp_btf/sys_enter")
+    int BPF_PROG(on_enter, struct pt_regs *regs, long id)
+    {
+            struct task_struct *task = bpf_get_current_task_btf();
+            long *ptr;
+
+            ptr = bpf_cgrp_storage_get(&cgrp_storage, task->cgroups->dfl_cgrp, 0,
+                                       BPF_LOCAL_STORAGE_GET_F_CREATE);
+            if (ptr)
+                __sync_fetch_and_add(ptr, 1);
+
+            return 0;
+    }
+
+Userspace accessing map declared above::
+
+    #include <linux/bpf.h>
+    #include <linux/libbpf.h>
+
+    __u32 map_lookup(struct bpf_map *map, int cgrp_fd)
+    {
+            __u32 *value;
+            value = bpf_map_lookup_elem(bpf_map__fd(map), &cgrp_fd);
+            if (value)
+                return *value;
+            return 0;
+    }
+
+Difference Between BPF_MAP_TYPE_CGRP_STORAGE and BPF_MAP_TYPE_CGROUP_STORAGE
+============================================================================
+
+The old cgroup storage map ``BPF_MAP_TYPE_CGROUP_STORAGE`` has been marked as
+deprecated (renamed to ``BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED``). The new
+``BPF_MAP_TYPE_CGRP_STORAGE`` map should be used instead. The following
+illusates the main difference between ``BPF_MAP_TYPE_CGRP_STORAGE`` and
+``BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED``.
+
+(1). ``BPF_MAP_TYPE_CGRP_STORAGE`` can be used by all program types while
+     ``BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED`` is available only to cgroup program types
+     like BPF_CGROUP_INET_INGRESS or BPF_CGROUP_SOCK_OPS, etc.
+
+(2). ``BPF_MAP_TYPE_CGRP_STORAGE`` supports local storage for more than one
+     cgroup while ``BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED`` only support one cgroup
+     which is attached by a bpf program.
+
+(3). ``BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED`` allocates local storage at attach time so
+     ``bpf_get_local_storage()`` always returns non-NULL local storage.
+     ``BPF_MAP_TYPE_CGRP_STORAGE`` allocates local storage at runtime so
+     it is possible that ``bpf_cgrp_storage_get()`` may return null local storage.
+     To avoid such null local storage issue, user space can do
+     ``bpf_map_update_elem()`` to pre-allocate local storage before a bpf program
+     is attached.
+
+(4). ``BPF_MAP_TYPE_CGRP_STORAGE`` supports deleting local storage by a bpf program
+     while ``BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED`` only deletes storage during
+     prog detach time.
+
+So overall, ``BPF_MAP_TYPE_CGRP_STORAGE`` supports all ``BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED``
+functionality and beyond. It is recommended to use ``BPF_MAP_TYPE_CGRP_STORAGE``
+instead of ``BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED``.
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH bpf-next v4 1/7] bpf: Make struct cgroup btf id global
  2022-10-23 18:05 ` [PATCH bpf-next v4 1/7] bpf: Make struct cgroup btf id global Yonghong Song
@ 2022-10-23 19:59   ` David Vernet
  0 siblings, 0 replies; 31+ messages in thread
From: David Vernet @ 2022-10-23 19:59 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo

On Sun, Oct 23, 2022 at 11:05:19AM -0700, Yonghong Song wrote:
> Make struct cgroup btf id global so later patch can reuse
> the same btf id.
> 
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  include/linux/btf_ids.h  | 1 +
>  kernel/bpf/cgroup_iter.c | 2 +-
>  2 files changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/btf_ids.h b/include/linux/btf_ids.h
> index 2aea877d644f..c9744efd202f 100644
> --- a/include/linux/btf_ids.h
> +++ b/include/linux/btf_ids.h
> @@ -265,5 +265,6 @@ MAX_BTF_TRACING_TYPE,
>  };
>  
>  extern u32 btf_tracing_ids[];
> +extern u32 bpf_cgroup_btf_id[];
>  
>  #endif
> diff --git a/kernel/bpf/cgroup_iter.c b/kernel/bpf/cgroup_iter.c
> index 0d200a993489..c6ffc706d583 100644
> --- a/kernel/bpf/cgroup_iter.c
> +++ b/kernel/bpf/cgroup_iter.c
> @@ -157,7 +157,7 @@ static const struct seq_operations cgroup_iter_seq_ops = {
>  	.show   = cgroup_iter_seq_show,
>  };
>  
> -BTF_ID_LIST_SINGLE(bpf_cgroup_btf_id, struct, cgroup)
> +BTF_ID_LIST_GLOBAL_SINGLE(bpf_cgroup_btf_id, struct, cgroup)
>  
>  static int cgroup_iter_seq_init(void *priv, struct bpf_iter_aux_info *aux)
>  {
> -- 
> 2.30.2
> 

Acked-by: David Vernet <void@manifault.com>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH bpf-next v4 3/7] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
  2022-10-23 18:05 ` [PATCH bpf-next v4 3/7] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs Yonghong Song
@ 2022-10-23 20:02   ` David Vernet
  2022-10-24 19:19   ` Martin KaFai Lau
  1 sibling, 0 replies; 31+ messages in thread
From: David Vernet @ 2022-10-23 20:02 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo

On Sun, Oct 23, 2022 at 11:05:30AM -0700, Yonghong Song wrote:
> Similar to sk/inode/task storage, implement similar cgroup local storage.
> 
> There already exists a local storage implementation for cgroup-attached
> bpf programs.  See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper
> bpf_get_local_storage(). But there are use cases such that non-cgroup
> attached bpf progs wants to access cgroup local storage data. For example,
> tc egress prog has access to sk and cgroup. It is possible to use
> sk local storage to emulate cgroup local storage by storing data in socket.
> But this is a waste as it could be lots of sockets belonging to a particular
> cgroup. Alternatively, a separate map can be created with cgroup id as the key.
> But this will introduce additional overhead to manipulate the new map.
> A cgroup local storage, similar to existing sk/inode/task storage,
> should help for this use case.
> 
> The life-cycle of storage is managed with the life-cycle of the
> cgroup struct.  i.e. the storage is destroyed along with the owning cgroup
> with a call to bpf_cgrp_storage_free() when cgroup itself
> is deleted.
> 
> The userspace map operations can be done by using a cgroup fd as a key
> passed to the lookup, update and delete operations.
> 
> Typically, the following code is used to get the current cgroup:
>     struct task_struct *task = bpf_get_current_task_btf();
>     ... task->cgroups->dfl_cgrp ...
> and in structure task_struct definition:
>     struct task_struct {
>         ....
>         struct css_set __rcu            *cgroups;
>         ....
>     }
> With sleepable program, accessing task->cgroups is not protected by rcu_read_lock.
> So the current implementation only supports non-sleepable program and supporting
> sleepable program will be the next step together with adding rcu_read_lock
> protection for rcu tagged structures.
> 
> Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup local
> storage support, the new map name BPF_MAP_TYPE_CGRP_STORAGE is used
> for cgroup storage available to non-cgroup-attached bpf programs. The old
> cgroup storage supports bpf_get_local_storage() helper to get the cgroup data.
> The new cgroup storage helper bpf_cgrp_storage_get() can provide similar
> functionality. While old cgroup storage pre-allocates storage memory, the new
> mechanism can also pre-allocate with a user space bpf_map_update_elem() call
> to avoid potential run-time memory allocation failure.
> Therefore, the new cgroup storage can provide all functionality w.r.t.
> the old one. So in uapi bpf.h, the old BPF_MAP_TYPE_CGROUP_STORAGE is alias to
> BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED to indicate the old cgroup storage can
> be deprecated since the new one can provide the same functionality.
> 
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---

Looks great, thanks. Only other thing I'll mention is that I think Tejun
had pointed out in [0] that cg was usually more used as an abbreviation.
I don't feel strongly one way or the other, so here's my ack either way.

[0]: https://lore.kernel.org/all/Y1NByH+suY2s65Kh@slm.duckdns.org/

Acked-by: David Vernet <void@manifault.com>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH bpf-next v4 4/7] libbpf: Support new cgroup local storage
  2022-10-23 18:05 ` [PATCH bpf-next v4 4/7] libbpf: Support new cgroup local storage Yonghong Song
@ 2022-10-23 20:03   ` David Vernet
  0 siblings, 0 replies; 31+ messages in thread
From: David Vernet @ 2022-10-23 20:03 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo

On Sun, Oct 23, 2022 at 11:05:35AM -0700, Yonghong Song wrote:
> Add support for new cgroup local storage.
> 
> Acked-by: Andrii Nakryiko <andrii@kernel.org>
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---

Acked-by: David Vernet <void@manifault.com>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH bpf-next v4 6/7] selftests/bpf: Add selftests for new cgroup local storage
  2022-10-23 18:05 ` [PATCH bpf-next v4 6/7] selftests/bpf: Add selftests for " Yonghong Song
@ 2022-10-23 20:14   ` David Vernet
  2022-10-24 19:03     ` Yonghong Song
  2022-10-24 20:30   ` Martin KaFai Lau
  1 sibling, 1 reply; 31+ messages in thread
From: David Vernet @ 2022-10-23 20:14 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo

On Sun, Oct 23, 2022 at 11:05:46AM -0700, Yonghong Song wrote:
> Add two tests for new cgroup local storage, one to test bpf program helpers
> and user space map APIs, and the other to test recursive fentry
> triggering won't deadlock.
> 
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  .../bpf/prog_tests/cgrp_local_storage.c       | 92 +++++++++++++++++++
>  .../selftests/bpf/progs/cgrp_local_storage.c  | 88 ++++++++++++++++++
>  .../selftests/bpf/progs/cgrp_ls_recursion.c   | 70 ++++++++++++++
>  3 files changed, 250 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/cgrp_local_storage.c
>  create mode 100644 tools/testing/selftests/bpf/progs/cgrp_local_storage.c
>  create mode 100644 tools/testing/selftests/bpf/progs/cgrp_ls_recursion.c
> 
> diff --git a/tools/testing/selftests/bpf/prog_tests/cgrp_local_storage.c b/tools/testing/selftests/bpf/prog_tests/cgrp_local_storage.c
> new file mode 100644
> index 000000000000..7ee21d598195
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/cgrp_local_storage.c
> @@ -0,0 +1,92 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates.*/
> +
> +#define _GNU_SOURCE
> +#include <unistd.h>
> +#include <sys/syscall.h>
> +#include <sys/types.h>
> +#include <test_progs.h>
> +#include "cgrp_local_storage.skel.h"
> +#include "cgrp_ls_recursion.skel.h"
> +
> +static void test_sys_enter_exit(int cgroup_fd)
> +{
> +	struct cgrp_local_storage *skel;
> +	long val1 = 1, val2 = 0;
> +	int err;
> +
> +	skel = cgrp_local_storage__open_and_load();
> +	if (!ASSERT_OK_PTR(skel, "skel_open_and_load"))
> +		return;
> +
> +	/* populate a value in cgrp_storage_2 */
> +	err = bpf_map_update_elem(bpf_map__fd(skel->maps.cgrp_storage_2), &cgroup_fd, &val1, BPF_ANY);
> +	if (!ASSERT_OK(err, "map_update_elem"))
> +		goto out;
> +
> +	/* check value */
> +	err = bpf_map_lookup_elem(bpf_map__fd(skel->maps.cgrp_storage_2), &cgroup_fd, &val2);
> +	if (!ASSERT_OK(err, "map_lookup_elem"))
> +		goto out;
> +	if (!ASSERT_EQ(val2, 1, "map_lookup_elem, invalid val"))
> +		goto out;
> +
> +	/* delete value */
> +	err = bpf_map_delete_elem(bpf_map__fd(skel->maps.cgrp_storage_2), &cgroup_fd);
> +	if (!ASSERT_OK(err, "map_delete_elem"))
> +		goto out;
> +
> +	skel->bss->target_pid = syscall(SYS_gettid);
> +
> +	err = cgrp_local_storage__attach(skel);
> +	if (!ASSERT_OK(err, "skel_attach"))
> +		goto out;
> +
> +	syscall(SYS_gettid);
> +	syscall(SYS_gettid);
> +
> +	skel->bss->target_pid = 0;
> +
> +	/* 3x syscalls: 1x attach and 2x gettid */
> +	ASSERT_EQ(skel->bss->enter_cnt, 3, "enter_cnt");
> +	ASSERT_EQ(skel->bss->exit_cnt, 3, "exit_cnt");
> +	ASSERT_EQ(skel->bss->mismatch_cnt, 0, "mismatch_cnt");
> +out:
> +	cgrp_local_storage__destroy(skel);
> +}
> +
> +static void test_recursion(int cgroup_fd)
> +{
> +	struct cgrp_ls_recursion *skel;
> +	int err;
> +
> +	skel = cgrp_ls_recursion__open_and_load();
> +	if (!ASSERT_OK_PTR(skel, "skel_open_and_load"))
> +		return;
> +
> +	err = cgrp_ls_recursion__attach(skel);
> +	if (!ASSERT_OK(err, "skel_attach"))
> +		goto out;
> +
> +	/* trigger sys_enter, make sure it does not cause deadlock */
> +	syscall(SYS_gettid);
> +
> +out:
> +	cgrp_ls_recursion__destroy(skel);
> +}
> +
> +void test_cgrp_local_storage(void)
> +{
> +	int cgroup_fd;
> +
> +	cgroup_fd = test__join_cgroup("/cgrp_local_storage");
> +	if (!ASSERT_GE(cgroup_fd, 0, "join_cgroup /cgrp_local_storage"))
> +		return;
> +
> +	if (test__start_subtest("sys_enter_exit"))
> +		test_sys_enter_exit(cgroup_fd);
> +	if (test__start_subtest("recursion"))
> +		test_recursion(cgroup_fd);
> +
> +	close(cgroup_fd);
> +}
> diff --git a/tools/testing/selftests/bpf/progs/cgrp_local_storage.c b/tools/testing/selftests/bpf/progs/cgrp_local_storage.c
> new file mode 100644
> index 000000000000..dee63d4c1512
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/cgrp_local_storage.c
> @@ -0,0 +1,88 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */
> +
> +#include "vmlinux.h"
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_tracing.h>
> +
> +char _license[] SEC("license") = "GPL";
> +
> +struct {
> +	__uint(type, BPF_MAP_TYPE_CGRP_STORAGE);
> +	__uint(map_flags, BPF_F_NO_PREALLOC);
> +	__type(key, int);
> +	__type(value, long);
> +} cgrp_storage_1 SEC(".maps");
> +
> +struct {
> +	__uint(type, BPF_MAP_TYPE_CGRP_STORAGE);
> +	__uint(map_flags, BPF_F_NO_PREALLOC);
> +	__type(key, int);
> +	__type(value, long);
> +} cgrp_storage_2 SEC(".maps");
> +
> +#define MAGIC_VALUE 0xabcd1234
> +
> +pid_t target_pid = 0;
> +int mismatch_cnt = 0;
> +int enter_cnt = 0;
> +int exit_cnt = 0;
> +
> +SEC("tp_btf/sys_enter")
> +int BPF_PROG(on_enter, struct pt_regs *regs, long id)
> +{
> +	struct task_struct *task;
> +	long *ptr;
> +	int err;
> +
> +	task = bpf_get_current_task_btf();
> +	if (task->pid != target_pid)
> +		return 0;
> +
> +	/* populate value 0 */
> +	ptr = bpf_cgrp_storage_get(&cgrp_storage_1, task->cgroups->dfl_cgrp, 0,
> +				   BPF_LOCAL_STORAGE_GET_F_CREATE);
> +	if (!ptr)
> +		return 0;
> +
> +	/* delete value 0 */
> +	err = bpf_cgrp_storage_delete(&cgrp_storage_1, task->cgroups->dfl_cgrp);
> +	if (err)
> +		return 0;
> +
> +	/* value is not available */
> +	ptr = bpf_cgrp_storage_get(&cgrp_storage_1, task->cgroups->dfl_cgrp, 0, 0);
> +	if (ptr)
> +		return 0;
> +
> +	/* re-populate the value */
> +	ptr = bpf_cgrp_storage_get(&cgrp_storage_1, task->cgroups->dfl_cgrp, 0,
> +				   BPF_LOCAL_STORAGE_GET_F_CREATE);
> +	if (!ptr)
> +		return 0;

Should we also add a global int err variable to this program that we set
any time we can't fetch an instance of the local storage, etc  and then
check in the user space test progs logic?

> +	__sync_fetch_and_add(&enter_cnt, 1);
> +	*ptr = MAGIC_VALUE + enter_cnt;
> +
> +	return 0;
> +}
> +
> +SEC("tp_btf/sys_exit")
> +int BPF_PROG(on_exit, struct pt_regs *regs, long id)
> +{
> +	struct task_struct *task;
> +	long *ptr;
> +
> +	task = bpf_get_current_task_btf();
> +	if (task->pid != target_pid)
> +		return 0;
> +
> +	ptr = bpf_cgrp_storage_get(&cgrp_storage_1, task->cgroups->dfl_cgrp, 0,
> +				   BPF_LOCAL_STORAGE_GET_F_CREATE);
> +	if (!ptr)
> +		return 0;
> +
> +	__sync_fetch_and_add(&exit_cnt, 1);
> +	if (*ptr != MAGIC_VALUE + exit_cnt)
> +		__sync_fetch_and_add(&mismatch_cnt, 1);
> +	return 0;
> +}
> diff --git a/tools/testing/selftests/bpf/progs/cgrp_ls_recursion.c b/tools/testing/selftests/bpf/progs/cgrp_ls_recursion.c
> new file mode 100644
> index 000000000000..a043d8fefdac
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/cgrp_ls_recursion.c
> @@ -0,0 +1,70 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */
> +
> +#include "vmlinux.h"
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_tracing.h>
> +
> +char _license[] SEC("license") = "GPL";
> +
> +struct {
> +	__uint(type, BPF_MAP_TYPE_CGRP_STORAGE);
> +	__uint(map_flags, BPF_F_NO_PREALLOC);
> +	__type(key, int);
> +	__type(value, long);
> +} map_a SEC(".maps");
> +
> +struct {
> +	__uint(type, BPF_MAP_TYPE_CGRP_STORAGE);
> +	__uint(map_flags, BPF_F_NO_PREALLOC);
> +	__type(key, int);
> +	__type(value, long);
> +} map_b SEC(".maps");
> +
> +SEC("fentry/bpf_local_storage_lookup")
> +int BPF_PROG(on_lookup)
> +{
> +	struct task_struct *task = bpf_get_current_task_btf();
> +
> +	bpf_cgrp_storage_delete(&map_a, task->cgroups->dfl_cgrp);
> +	bpf_cgrp_storage_delete(&map_b, task->cgroups->dfl_cgrp);
> +	return 0;
> +}
> +
> +SEC("fentry/bpf_local_storage_update")
> +int BPF_PROG(on_update)
> +{
> +	struct task_struct *task = bpf_get_current_task_btf();
> +	long *ptr;
> +
> +	ptr = bpf_cgrp_storage_get(&map_a, task->cgroups->dfl_cgrp, 0,
> +				   BPF_LOCAL_STORAGE_GET_F_CREATE);
> +	if (ptr)
> +		*ptr += 1;
> +
> +	ptr = bpf_cgrp_storage_get(&map_b, task->cgroups->dfl_cgrp, 0,
> +				   BPF_LOCAL_STORAGE_GET_F_CREATE);
> +	if (ptr)
> +		*ptr += 1;
> +
> +	return 0;
> +}
> +
> +SEC("tp_btf/sys_enter")
> +int BPF_PROG(on_enter, struct pt_regs *regs, long id)
> +{
> +	struct task_struct *task;
> +	long *ptr;
> +
> +	task = bpf_get_current_task_btf();
> +	ptr = bpf_cgrp_storage_get(&map_a, task->cgroups->dfl_cgrp, 0,
> +				   BPF_LOCAL_STORAGE_GET_F_CREATE);
> +	if (ptr)
> +		*ptr = 200;
> +
> +	ptr = bpf_cgrp_storage_get(&map_b, task->cgroups->dfl_cgrp, 0,
> +				   BPF_LOCAL_STORAGE_GET_F_CREATE);
> +	if (ptr)
> +		*ptr = 100;
> +	return 0;
> +}
> -- 
> 2.30.2
> 


Looks reasonable overall. Should we also include any negative tests,
like e.g. trying to invoke bpf_cgrp_storage_get() with a struct other
than a cgroup?


Thanks,
David

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH bpf-next v4 7/7] docs/bpf: Add documentation for new cgroup local storage
  2022-10-23 18:05 ` [PATCH bpf-next v4 7/7] docs/bpf: Add documentation " Yonghong Song
@ 2022-10-23 20:26   ` David Vernet
  2022-10-24 19:05     ` Yonghong Song
  0 siblings, 1 reply; 31+ messages in thread
From: David Vernet @ 2022-10-23 20:26 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo

On Sun, Oct 23, 2022 at 11:05:52AM -0700, Yonghong Song wrote:
> Add some descriptions and examples for BPF_MAP_TYPE_CGRP_STROAGE.

s/STROAGE/STORAGE here and elsewhere

> Also illustate the major difference between BPF_MAP_TYPE_CGRP_STROAGE
> and BPF_MAP_TYPE_CGROUP_STORAGE and recommend to use
> BPF_MAP_TYPE_CGRP_STROAGE instead of BPF_MAP_TYPE_CGROUP_STORAGE
> in the end.
> 
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  Documentation/bpf/map_cgrp_storage.rst | 109 +++++++++++++++++++++++++
>  1 file changed, 109 insertions(+)
>  create mode 100644 Documentation/bpf/map_cgrp_storage.rst
> 
> diff --git a/Documentation/bpf/map_cgrp_storage.rst b/Documentation/bpf/map_cgrp_storage.rst
> new file mode 100644
> index 000000000000..4dfc7770da7e
> --- /dev/null
> +++ b/Documentation/bpf/map_cgrp_storage.rst
> @@ -0,0 +1,109 @@
> +.. SPDX-License-Identifier: GPL-2.0-only
> +.. Copyright (C) 2022 Meta Platforms, Inc. and affiliates.
> +
> +===========================
> +BPF_MAP_TYPE_CGRP_STORAGE
> +===========================

Small nit, can you trim the == border so it matches the width of
BPF_MAP_TYPE_CGRP_STORAGE?

> +
> +The ``BPF_MAP_TYPE_CGRP_STORAGE`` map type represents a local fix-sized
> +storage for cgroups. It is only available with ``CONFIG_CGROUP_BPF``.

This is no longer accurate and should say, "It is only available with
``CONFIG_CGROUPS``."

> +The programs are made available by the same Kconfig. The
> +data for a particular cgroup can be retrieved by looking up the map
> +with that cgroup.
> +
> +This document describes the usage and semantics of the
> +``BPF_MAP_TYPE_CGRP_STORAGE`` map type.
> +
> +Usage
> +=====
> +
> +The map key must be ``sizeof(int)`` representing a cgroup fd.
> +To access the storage in a program, use ``bpf_cgrp_storage_get``::
> +
> +    void *bpf_cgrp_storage_get(struct bpf_map *map, struct cgroup *cgroup, void *value, u64 flags)
> +
> +``flags`` could be 0 or ``BPF_LOCAL_STORAGE_GET_F_CREATE`` which indicates that
> +a new local storage will be created if one does not exist.
> +
> +The local storage can be removed with ``bpf_cgrp_storage_delete``::
> +
> +    long bpf_cgrp_storage_delete(struct bpf_map *map, struct cgroup *cgroup)
> +
> +The map is available to all program types.
> +
> +Examples
> +========
> +
> +A bpf program example with BPF_MAP_TYPE_CGRP_STORAGE::
> +
> +    #include <vmlinux.h>
> +    #include <bpf/bpf_helpers.h>
> +    #include <bpf/bpf_tracing.h>
> +
> +    struct {
> +            __uint(type, BPF_MAP_TYPE_CGRP_STORAGE);
> +            __uint(map_flags, BPF_F_NO_PREALLOC);
> +            __type(key, int);
> +            __type(value, long);
> +    } cgrp_storage SEC(".maps");
> +
> +    SEC("tp_btf/sys_enter")
> +    int BPF_PROG(on_enter, struct pt_regs *regs, long id)
> +    {
> +            struct task_struct *task = bpf_get_current_task_btf();
> +            long *ptr;
> +
> +            ptr = bpf_cgrp_storage_get(&cgrp_storage, task->cgroups->dfl_cgrp, 0,
> +                                       BPF_LOCAL_STORAGE_GET_F_CREATE);
> +            if (ptr)
> +                __sync_fetch_and_add(ptr, 1);
> +
> +            return 0;
> +    }
> +
> +Userspace accessing map declared above::
> +
> +    #include <linux/bpf.h>
> +    #include <linux/libbpf.h>
> +
> +    __u32 map_lookup(struct bpf_map *map, int cgrp_fd)
> +    {
> +            __u32 *value;
> +            value = bpf_map_lookup_elem(bpf_map__fd(map), &cgrp_fd);
> +            if (value)
> +                return *value;
> +            return 0;
> +    }
> +
> +Difference Between BPF_MAP_TYPE_CGRP_STORAGE and BPF_MAP_TYPE_CGROUP_STORAGE
> +============================================================================
> +
> +The old cgroup storage map ``BPF_MAP_TYPE_CGROUP_STORAGE`` has been marked as
> +deprecated (renamed to ``BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED``). The new
> +``BPF_MAP_TYPE_CGRP_STORAGE`` map should be used instead. The following
> +illusates the main difference between ``BPF_MAP_TYPE_CGRP_STORAGE`` and
> +``BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED``.
> +
> +(1). ``BPF_MAP_TYPE_CGRP_STORAGE`` can be used by all program types while
> +     ``BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED`` is available only to cgroup program types
> +     like BPF_CGROUP_INET_INGRESS or BPF_CGROUP_SOCK_OPS, etc.
> +
> +(2). ``BPF_MAP_TYPE_CGRP_STORAGE`` supports local storage for more than one
> +     cgroup while ``BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED`` only support one cgroup

s/only support/only supports

> +     which is attached by a bpf program.
> +
> +(3). ``BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED`` allocates local storage at attach time so
> +     ``bpf_get_local_storage()`` always returns non-NULL local storage.
> +     ``BPF_MAP_TYPE_CGRP_STORAGE`` allocates local storage at runtime so
> +     it is possible that ``bpf_cgrp_storage_get()`` may return null local storage.
> +     To avoid such null local storage issue, user space can do
> +     ``bpf_map_update_elem()`` to pre-allocate local storage before a bpf program
> +     is attached.
> +
> +(4). ``BPF_MAP_TYPE_CGRP_STORAGE`` supports deleting local storage by a bpf program
> +     while ``BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED`` only deletes storage during
> +     prog detach time.
> +
> +So overall, ``BPF_MAP_TYPE_CGRP_STORAGE`` supports all ``BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED``
> +functionality and beyond. It is recommended to use ``BPF_MAP_TYPE_CGRP_STORAGE``
> +instead of ``BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED``.
> -- 
> 2.30.2
> 

One more thought / consideration which you can of course feel free to
ignore. If we're using the acronym "bpf" in documentation (acronym
meaning it's used on its own rather than e.g. in a variable name), IMO
we should capitalize it to "BPF". Very minor thing, but I think it makes
the docs look a bit more polished. Up to you, and not a big deal either
way.

Anyways, this looks great, thanks again for writing it up! Everything I
pointed out was a minor typo / fix so:

Acked-by: David Vernet <void@manifault.com>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH bpf-next v4 2/7] bpf: Refactor inode/task/sk storage map_{alloc,free}() for reuse
  2022-10-23 18:05 ` [PATCH bpf-next v4 2/7] bpf: Refactor inode/task/sk storage map_{alloc,free}() for reuse Yonghong Song
@ 2022-10-24 18:02   ` sdf
  2022-10-24 19:08     ` Yonghong Song
  2022-10-24 20:34   ` Martin KaFai Lau
  1 sibling, 1 reply; 31+ messages in thread
From: sdf @ 2022-10-24 18:02 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo, David Vernet

On 10/23, Yonghong Song wrote:
> Refactor codes so that inode/task/sk storage map_{alloc,free}
> can maximally share the same code. There is no functionality change.

Does it make sense to also do following? (see below, untested)
We aren't saving much code-wise here, but at least we won't have three  
copies
of the same long comment.


diff --git a/include/linux/bpf_local_storage.h  
b/include/linux/bpf_local_storage.h
index 7ea18d4da84b..e4b0b04d081b 100644
--- a/include/linux/bpf_local_storage.h
+++ b/include/linux/bpf_local_storage.h
@@ -138,6 +138,8 @@ int bpf_local_storage_map_check_btf(const struct  
bpf_map *map,
  				    const struct btf_type *key_type,
  				    const struct btf_type *value_type);

+bool bpf_local_storage_unlink_nolock(struct bpf_local_storage  
*local_storage);
+
  void bpf_selem_link_storage_nolock(struct bpf_local_storage *local_storage,
  				   struct bpf_local_storage_elem *selem);

diff --git a/kernel/bpf/bpf_inode_storage.c b/kernel/bpf/bpf_inode_storage.c
index 5f7683b19199..5313cb0b7181 100644
--- a/kernel/bpf/bpf_inode_storage.c
+++ b/kernel/bpf/bpf_inode_storage.c
@@ -56,11 +56,9 @@ static struct bpf_local_storage_data  
*inode_storage_lookup(struct inode *inode,

  void bpf_inode_storage_free(struct inode *inode)
  {
-	struct bpf_local_storage_elem *selem;
  	struct bpf_local_storage *local_storage;
  	bool free_inode_storage = false;
  	struct bpf_storage_blob *bsb;
-	struct hlist_node *n;

  	bsb = bpf_inode(inode);
  	if (!bsb)
@@ -74,30 +72,11 @@ void bpf_inode_storage_free(struct inode *inode)
  		return;
  	}

-	/* Neither the bpf_prog nor the bpf-map's syscall
-	 * could be modifying the local_storage->list now.
-	 * Thus, no elem can be added-to or deleted-from the
-	 * local_storage->list by the bpf_prog or by the bpf-map's syscall.
-	 *
-	 * It is racing with bpf_local_storage_map_free() alone
-	 * when unlinking elem from the local_storage->list and
-	 * the map's bucket->list.
-	 */
  	raw_spin_lock_bh(&local_storage->lock);
-	hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) {
-		/* Always unlink from map before unlinking from
-		 * local_storage.
-		 */
-		bpf_selem_unlink_map(selem);
-		free_inode_storage = bpf_selem_unlink_storage_nolock(
-			local_storage, selem, false, false);
-	}
+	free_inode_storage = bpf_local_storage_unlink_nolock(local_storage);
  	raw_spin_unlock_bh(&local_storage->lock);
  	rcu_read_unlock();

-	/* free_inoode_storage should always be true as long as
-	 * local_storage->list was non-empty.
-	 */
  	if (free_inode_storage)
  		kfree_rcu(local_storage, rcu);
  }
diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
index 9dc6de1cf185..0ea754953242 100644
--- a/kernel/bpf/bpf_local_storage.c
+++ b/kernel/bpf/bpf_local_storage.c
@@ -98,6 +98,36 @@ void bpf_local_storage_free_rcu(struct rcu_head *rcu)
  		kfree_rcu(local_storage, rcu);
  }

+bool bpf_local_storage_unlink_nolock(struct bpf_local_storage  
*local_storage)
+{
+	struct bpf_local_storage_elem *selem;
+	bool free_storage = false;
+	struct hlist_node *n;
+
+	/* Neither the bpf_prog nor the bpf-map's syscall
+	 * could be modifying the local_storage->list now.
+	 * Thus, no elem can be added-to or deleted-from the
+	 * local_storage->list by the bpf_prog or by the bpf-map's syscall.
+	 *
+	 * It is racing with bpf_local_storage_map_free() alone
+	 * when unlinking elem from the local_storage->list and
+	 * the map's bucket->list.
+	 */
+	hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) {
+		/* Always unlink from map before unlinking from
+		 * local_storage.
+		 */
+		bpf_selem_unlink_map(selem);
+		free_storage = bpf_selem_unlink_storage_nolock(
+			local_storage, selem, false, false);
+	}
+
+	/* free_storage should always be true as long as
+	 * local_storage->list was non-empty.
+	 */
+	return free_storage;
+}
+
  static void bpf_selem_free_rcu(struct rcu_head *rcu)
  {
  	struct bpf_local_storage_elem *selem;
diff --git a/kernel/bpf/bpf_task_storage.c b/kernel/bpf/bpf_task_storage.c
index 6f290623347e..60887c25504b 100644
--- a/kernel/bpf/bpf_task_storage.c
+++ b/kernel/bpf/bpf_task_storage.c
@@ -71,10 +71,8 @@ task_storage_lookup(struct task_struct *task, struct  
bpf_map *map,

  void bpf_task_storage_free(struct task_struct *task)
  {
-	struct bpf_local_storage_elem *selem;
  	struct bpf_local_storage *local_storage;
  	bool free_task_storage = false;
-	struct hlist_node *n;
  	unsigned long flags;

  	rcu_read_lock();
@@ -85,32 +83,13 @@ void bpf_task_storage_free(struct task_struct *task)
  		return;
  	}

-	/* Neither the bpf_prog nor the bpf-map's syscall
-	 * could be modifying the local_storage->list now.
-	 * Thus, no elem can be added-to or deleted-from the
-	 * local_storage->list by the bpf_prog or by the bpf-map's syscall.
-	 *
-	 * It is racing with bpf_local_storage_map_free() alone
-	 * when unlinking elem from the local_storage->list and
-	 * the map's bucket->list.
-	 */
  	bpf_task_storage_lock();
  	raw_spin_lock_irqsave(&local_storage->lock, flags);
-	hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) {
-		/* Always unlink from map before unlinking from
-		 * local_storage.
-		 */
-		bpf_selem_unlink_map(selem);
-		free_task_storage = bpf_selem_unlink_storage_nolock(
-			local_storage, selem, false, false);
-	}
+	free_task_storage = bpf_local_storage_unlink_nolock(local_storage);
  	raw_spin_unlock_irqrestore(&local_storage->lock, flags);
  	bpf_task_storage_unlock();
  	rcu_read_unlock();

-	/* free_task_storage should always be true as long as
-	 * local_storage->list was non-empty.
-	 */
  	if (free_task_storage)
  		kfree_rcu(local_storage, rcu);
  }

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH bpf-next v4 6/7] selftests/bpf: Add selftests for new cgroup local storage
  2022-10-23 20:14   ` David Vernet
@ 2022-10-24 19:03     ` Yonghong Song
  0 siblings, 0 replies; 31+ messages in thread
From: Yonghong Song @ 2022-10-24 19:03 UTC (permalink / raw)
  To: David Vernet, Yonghong Song
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo



On 10/23/22 1:14 PM, David Vernet wrote:
> On Sun, Oct 23, 2022 at 11:05:46AM -0700, Yonghong Song wrote:
>> Add two tests for new cgroup local storage, one to test bpf program helpers
>> and user space map APIs, and the other to test recursive fentry
>> triggering won't deadlock.
>>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
>> ---
>>   .../bpf/prog_tests/cgrp_local_storage.c       | 92 +++++++++++++++++++
>>   .../selftests/bpf/progs/cgrp_local_storage.c  | 88 ++++++++++++++++++
>>   .../selftests/bpf/progs/cgrp_ls_recursion.c   | 70 ++++++++++++++
>>   3 files changed, 250 insertions(+)
>>   create mode 100644 tools/testing/selftests/bpf/prog_tests/cgrp_local_storage.c
>>   create mode 100644 tools/testing/selftests/bpf/progs/cgrp_local_storage.c
>>   create mode 100644 tools/testing/selftests/bpf/progs/cgrp_ls_recursion.c
>>
>> diff --git a/tools/testing/selftests/bpf/prog_tests/cgrp_local_storage.c b/tools/testing/selftests/bpf/prog_tests/cgrp_local_storage.c
>> new file mode 100644
>> index 000000000000..7ee21d598195
>> --- /dev/null
>> +++ b/tools/testing/selftests/bpf/prog_tests/cgrp_local_storage.c
>> @@ -0,0 +1,92 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates.*/
>> +
>> +#define _GNU_SOURCE
>> +#include <unistd.h>
>> +#include <sys/syscall.h>
>> +#include <sys/types.h>
>> +#include <test_progs.h>
>> +#include "cgrp_local_storage.skel.h"
>> +#include "cgrp_ls_recursion.skel.h"
>> +
>> +static void test_sys_enter_exit(int cgroup_fd)
>> +{
>> +	struct cgrp_local_storage *skel;
>> +	long val1 = 1, val2 = 0;
>> +	int err;
>> +
>> +	skel = cgrp_local_storage__open_and_load();
>> +	if (!ASSERT_OK_PTR(skel, "skel_open_and_load"))
>> +		return;
>> +
>> +	/* populate a value in cgrp_storage_2 */
>> +	err = bpf_map_update_elem(bpf_map__fd(skel->maps.cgrp_storage_2), &cgroup_fd, &val1, BPF_ANY);
>> +	if (!ASSERT_OK(err, "map_update_elem"))
>> +		goto out;
>> +
>> +	/* check value */
>> +	err = bpf_map_lookup_elem(bpf_map__fd(skel->maps.cgrp_storage_2), &cgroup_fd, &val2);
>> +	if (!ASSERT_OK(err, "map_lookup_elem"))
>> +		goto out;
>> +	if (!ASSERT_EQ(val2, 1, "map_lookup_elem, invalid val"))
>> +		goto out;
>> +
>> +	/* delete value */
>> +	err = bpf_map_delete_elem(bpf_map__fd(skel->maps.cgrp_storage_2), &cgroup_fd);
>> +	if (!ASSERT_OK(err, "map_delete_elem"))
>> +		goto out;
>> +
>> +	skel->bss->target_pid = syscall(SYS_gettid);
>> +
>> +	err = cgrp_local_storage__attach(skel);
>> +	if (!ASSERT_OK(err, "skel_attach"))
>> +		goto out;
>> +
>> +	syscall(SYS_gettid);
>> +	syscall(SYS_gettid);
>> +
>> +	skel->bss->target_pid = 0;
>> +
>> +	/* 3x syscalls: 1x attach and 2x gettid */
>> +	ASSERT_EQ(skel->bss->enter_cnt, 3, "enter_cnt");
>> +	ASSERT_EQ(skel->bss->exit_cnt, 3, "exit_cnt");
>> +	ASSERT_EQ(skel->bss->mismatch_cnt, 0, "mismatch_cnt");
>> +out:
>> +	cgrp_local_storage__destroy(skel);
>> +}
>> +
>> +static void test_recursion(int cgroup_fd)
>> +{
>> +	struct cgrp_ls_recursion *skel;
>> +	int err;
>> +
>> +	skel = cgrp_ls_recursion__open_and_load();
>> +	if (!ASSERT_OK_PTR(skel, "skel_open_and_load"))
>> +		return;
>> +
>> +	err = cgrp_ls_recursion__attach(skel);
>> +	if (!ASSERT_OK(err, "skel_attach"))
>> +		goto out;
>> +
>> +	/* trigger sys_enter, make sure it does not cause deadlock */
>> +	syscall(SYS_gettid);
>> +
>> +out:
>> +	cgrp_ls_recursion__destroy(skel);
>> +}
>> +
>> +void test_cgrp_local_storage(void)
>> +{
>> +	int cgroup_fd;
>> +
>> +	cgroup_fd = test__join_cgroup("/cgrp_local_storage");
>> +	if (!ASSERT_GE(cgroup_fd, 0, "join_cgroup /cgrp_local_storage"))
>> +		return;
>> +
>> +	if (test__start_subtest("sys_enter_exit"))
>> +		test_sys_enter_exit(cgroup_fd);
>> +	if (test__start_subtest("recursion"))
>> +		test_recursion(cgroup_fd);
>> +
>> +	close(cgroup_fd);
>> +}
>> diff --git a/tools/testing/selftests/bpf/progs/cgrp_local_storage.c b/tools/testing/selftests/bpf/progs/cgrp_local_storage.c
>> new file mode 100644
>> index 000000000000..dee63d4c1512
>> --- /dev/null
>> +++ b/tools/testing/selftests/bpf/progs/cgrp_local_storage.c
>> @@ -0,0 +1,88 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */
>> +
>> +#include "vmlinux.h"
>> +#include <bpf/bpf_helpers.h>
>> +#include <bpf/bpf_tracing.h>
>> +
>> +char _license[] SEC("license") = "GPL";
>> +
>> +struct {
>> +	__uint(type, BPF_MAP_TYPE_CGRP_STORAGE);
>> +	__uint(map_flags, BPF_F_NO_PREALLOC);
>> +	__type(key, int);
>> +	__type(value, long);
>> +} cgrp_storage_1 SEC(".maps");
>> +
>> +struct {
>> +	__uint(type, BPF_MAP_TYPE_CGRP_STORAGE);
>> +	__uint(map_flags, BPF_F_NO_PREALLOC);
>> +	__type(key, int);
>> +	__type(value, long);
>> +} cgrp_storage_2 SEC(".maps");
>> +
>> +#define MAGIC_VALUE 0xabcd1234
>> +
>> +pid_t target_pid = 0;
>> +int mismatch_cnt = 0;
>> +int enter_cnt = 0;
>> +int exit_cnt = 0;
>> +
>> +SEC("tp_btf/sys_enter")
>> +int BPF_PROG(on_enter, struct pt_regs *regs, long id)
>> +{
>> +	struct task_struct *task;
>> +	long *ptr;
>> +	int err;
>> +
>> +	task = bpf_get_current_task_btf();
>> +	if (task->pid != target_pid)
>> +		return 0;
>> +
>> +	/* populate value 0 */
>> +	ptr = bpf_cgrp_storage_get(&cgrp_storage_1, task->cgroups->dfl_cgrp, 0,
>> +				   BPF_LOCAL_STORAGE_GET_F_CREATE);
>> +	if (!ptr)
>> +		return 0;
>> +
>> +	/* delete value 0 */
>> +	err = bpf_cgrp_storage_delete(&cgrp_storage_1, task->cgroups->dfl_cgrp);
>> +	if (err)
>> +		return 0;
>> +
>> +	/* value is not available */
>> +	ptr = bpf_cgrp_storage_get(&cgrp_storage_1, task->cgroups->dfl_cgrp, 0, 0);
>> +	if (ptr)
>> +		return 0;
>> +
>> +	/* re-populate the value */
>> +	ptr = bpf_cgrp_storage_get(&cgrp_storage_1, task->cgroups->dfl_cgrp, 0,
>> +				   BPF_LOCAL_STORAGE_GET_F_CREATE);
>> +	if (!ptr)
>> +		return 0;
> 
> Should we also add a global int err variable to this program that we set
> any time we can't fetch an instance of the local storage, etc  and then
> check in the user space test progs logic?

I think we are fine here. The enter_cnt below should ensure all above
should not prematurely return. Otherwise, the test will fail.

> 
>> +	__sync_fetch_and_add(&enter_cnt, 1);
>> +	*ptr = MAGIC_VALUE + enter_cnt;
>> +
>> +	return 0;
>> +}
>> +
>> +SEC("tp_btf/sys_exit")
>> +int BPF_PROG(on_exit, struct pt_regs *regs, long id)
>> +{
>> +	struct task_struct *task;
>> +	long *ptr;
>> +
>> +	task = bpf_get_current_task_btf();
>> +	if (task->pid != target_pid)
>> +		return 0;
>> +
>> +	ptr = bpf_cgrp_storage_get(&cgrp_storage_1, task->cgroups->dfl_cgrp, 0,
>> +				   BPF_LOCAL_STORAGE_GET_F_CREATE);
>> +	if (!ptr)
>> +		return 0;
>> +
>> +	__sync_fetch_and_add(&exit_cnt, 1);
>> +	if (*ptr != MAGIC_VALUE + exit_cnt)
>> +		__sync_fetch_and_add(&mismatch_cnt, 1);
>> +	return 0;
>> +}
>> diff --git a/tools/testing/selftests/bpf/progs/cgrp_ls_recursion.c b/tools/testing/selftests/bpf/progs/cgrp_ls_recursion.c
>> new file mode 100644
>> index 000000000000..a043d8fefdac
>> --- /dev/null
>> +++ b/tools/testing/selftests/bpf/progs/cgrp_ls_recursion.c
>> @@ -0,0 +1,70 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */
>> +
>> +#include "vmlinux.h"
>> +#include <bpf/bpf_helpers.h>
>> +#include <bpf/bpf_tracing.h>
>> +
>> +char _license[] SEC("license") = "GPL";
>> +
>> +struct {
>> +	__uint(type, BPF_MAP_TYPE_CGRP_STORAGE);
>> +	__uint(map_flags, BPF_F_NO_PREALLOC);
>> +	__type(key, int);
>> +	__type(value, long);
>> +} map_a SEC(".maps");
>> +
>> +struct {
>> +	__uint(type, BPF_MAP_TYPE_CGRP_STORAGE);
>> +	__uint(map_flags, BPF_F_NO_PREALLOC);
>> +	__type(key, int);
>> +	__type(value, long);
>> +} map_b SEC(".maps");
>> +
>> +SEC("fentry/bpf_local_storage_lookup")
>> +int BPF_PROG(on_lookup)
>> +{
>> +	struct task_struct *task = bpf_get_current_task_btf();
>> +
>> +	bpf_cgrp_storage_delete(&map_a, task->cgroups->dfl_cgrp);
>> +	bpf_cgrp_storage_delete(&map_b, task->cgroups->dfl_cgrp);
>> +	return 0;
>> +}
>> +
>> +SEC("fentry/bpf_local_storage_update")
>> +int BPF_PROG(on_update)
>> +{
>> +	struct task_struct *task = bpf_get_current_task_btf();
>> +	long *ptr;
>> +
>> +	ptr = bpf_cgrp_storage_get(&map_a, task->cgroups->dfl_cgrp, 0,
>> +				   BPF_LOCAL_STORAGE_GET_F_CREATE);
>> +	if (ptr)
>> +		*ptr += 1;
>> +
>> +	ptr = bpf_cgrp_storage_get(&map_b, task->cgroups->dfl_cgrp, 0,
>> +				   BPF_LOCAL_STORAGE_GET_F_CREATE);
>> +	if (ptr)
>> +		*ptr += 1;
>> +
>> +	return 0;
>> +}
>> +
>> +SEC("tp_btf/sys_enter")
>> +int BPF_PROG(on_enter, struct pt_regs *regs, long id)
>> +{
>> +	struct task_struct *task;
>> +	long *ptr;
>> +
>> +	task = bpf_get_current_task_btf();
>> +	ptr = bpf_cgrp_storage_get(&map_a, task->cgroups->dfl_cgrp, 0,
>> +				   BPF_LOCAL_STORAGE_GET_F_CREATE);
>> +	if (ptr)
>> +		*ptr = 200;
>> +
>> +	ptr = bpf_cgrp_storage_get(&map_b, task->cgroups->dfl_cgrp, 0,
>> +				   BPF_LOCAL_STORAGE_GET_F_CREATE);
>> +	if (ptr)
>> +		*ptr = 100;
>> +	return 0;
>> +}
>> -- 
>> 2.30.2
>>
> 
> 
> Looks reasonable overall. Should we also include any negative tests,
> like e.g. trying to invoke bpf_cgrp_storage_get() with a struct other
> than a cgroup?

I can add one.

> 
> 
> Thanks,
> David

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH bpf-next v4 7/7] docs/bpf: Add documentation for new cgroup local storage
  2022-10-23 20:26   ` David Vernet
@ 2022-10-24 19:05     ` Yonghong Song
  0 siblings, 0 replies; 31+ messages in thread
From: Yonghong Song @ 2022-10-24 19:05 UTC (permalink / raw)
  To: David Vernet, Yonghong Song
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo



On 10/23/22 1:26 PM, David Vernet wrote:
> On Sun, Oct 23, 2022 at 11:05:52AM -0700, Yonghong Song wrote:
>> Add some descriptions and examples for BPF_MAP_TYPE_CGRP_STROAGE.
> 
> s/STROAGE/STORAGE here and elsewhere

Thanks. Will make corresponding changes for this and other suggestions
below.

> 
>> Also illustate the major difference between BPF_MAP_TYPE_CGRP_STROAGE
>> and BPF_MAP_TYPE_CGROUP_STORAGE and recommend to use
>> BPF_MAP_TYPE_CGRP_STROAGE instead of BPF_MAP_TYPE_CGROUP_STORAGE
>> in the end.
>>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
>> ---
>>   Documentation/bpf/map_cgrp_storage.rst | 109 +++++++++++++++++++++++++
>>   1 file changed, 109 insertions(+)
>>   create mode 100644 Documentation/bpf/map_cgrp_storage.rst
>>
>> diff --git a/Documentation/bpf/map_cgrp_storage.rst b/Documentation/bpf/map_cgrp_storage.rst
>> new file mode 100644
>> index 000000000000..4dfc7770da7e
>> --- /dev/null
>> +++ b/Documentation/bpf/map_cgrp_storage.rst
>> @@ -0,0 +1,109 @@
>> +.. SPDX-License-Identifier: GPL-2.0-only
>> +.. Copyright (C) 2022 Meta Platforms, Inc. and affiliates.
>> +
>> +===========================
>> +BPF_MAP_TYPE_CGRP_STORAGE
>> +===========================
> 
> Small nit, can you trim the == border so it matches the width of
> BPF_MAP_TYPE_CGRP_STORAGE?
> 
>> +
>> +The ``BPF_MAP_TYPE_CGRP_STORAGE`` map type represents a local fix-sized
>> +storage for cgroups. It is only available with ``CONFIG_CGROUP_BPF``.
> 
> This is no longer accurate and should say, "It is only available with
> ``CONFIG_CGROUPS``."
> 
[...]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH bpf-next v4 2/7] bpf: Refactor inode/task/sk storage map_{alloc,free}() for reuse
  2022-10-24 18:02   ` sdf
@ 2022-10-24 19:08     ` Yonghong Song
  0 siblings, 0 replies; 31+ messages in thread
From: Yonghong Song @ 2022-10-24 19:08 UTC (permalink / raw)
  To: sdf, Yonghong Song
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo, David Vernet



On 10/24/22 11:02 AM, sdf@google.com wrote:
> On 10/23, Yonghong Song wrote:
>> Refactor codes so that inode/task/sk storage map_{alloc,free}
>> can maximally share the same code. There is no functionality change.
> 
> Does it make sense to also do following? (see below, untested)
> We aren't saving much code-wise here, but at least we won't have three 
> copies
> of the same long comment.

Sounds good. Let me do this refactoring as well.

> 
> 
> diff --git a/include/linux/bpf_local_storage.h 
> b/include/linux/bpf_local_storage.h
> index 7ea18d4da84b..e4b0b04d081b 100644
> --- a/include/linux/bpf_local_storage.h
> +++ b/include/linux/bpf_local_storage.h
> @@ -138,6 +138,8 @@ int bpf_local_storage_map_check_btf(const struct 
> bpf_map *map,
>                       const struct btf_type *key_type,
>                       const struct btf_type *value_type);
> 
> +bool bpf_local_storage_unlink_nolock(struct bpf_local_storage 
> *local_storage);
> +
>   void bpf_selem_link_storage_nolock(struct bpf_local_storage 
> *local_storage,
>                      struct bpf_local_storage_elem *selem);
> 
> diff --git a/kernel/bpf/bpf_inode_storage.c 
> b/kernel/bpf/bpf_inode_storage.c
> index 5f7683b19199..5313cb0b7181 100644
> --- a/kernel/bpf/bpf_inode_storage.c
> +++ b/kernel/bpf/bpf_inode_storage.c
> @@ -56,11 +56,9 @@ static struct bpf_local_storage_data 
> *inode_storage_lookup(struct inode *inode,
> 
>   void bpf_inode_storage_free(struct inode *inode)
>   {
> -    struct bpf_local_storage_elem *selem;
>       struct bpf_local_storage *local_storage;
>       bool free_inode_storage = false;
>       struct bpf_storage_blob *bsb;
> -    struct hlist_node *n;
> 
>       bsb = bpf_inode(inode);
>       if (!bsb)
> @@ -74,30 +72,11 @@ void bpf_inode_storage_free(struct inode *inode)
>           return;
>       }
> 
> -    /* Neither the bpf_prog nor the bpf-map's syscall
> -     * could be modifying the local_storage->list now.
> -     * Thus, no elem can be added-to or deleted-from the
> -     * local_storage->list by the bpf_prog or by the bpf-map's syscall.
> -     *
> -     * It is racing with bpf_local_storage_map_free() alone
> -     * when unlinking elem from the local_storage->list and
> -     * the map's bucket->list.
> -     */
>       raw_spin_lock_bh(&local_storage->lock);
> -    hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) {
> -        /* Always unlink from map before unlinking from
> -         * local_storage.
> -         */
> -        bpf_selem_unlink_map(selem);
> -        free_inode_storage = bpf_selem_unlink_storage_nolock(
> -            local_storage, selem, false, false);
> -    }
> +    free_inode_storage = bpf_local_storage_unlink_nolock(local_storage);
>       raw_spin_unlock_bh(&local_storage->lock);
>       rcu_read_unlock();
> 
> -    /* free_inoode_storage should always be true as long as
> -     * local_storage->list was non-empty.
> -     */
>       if (free_inode_storage)
>           kfree_rcu(local_storage, rcu);
>   }
> diff --git a/kernel/bpf/bpf_local_storage.c 
> b/kernel/bpf/bpf_local_storage.c
> index 9dc6de1cf185..0ea754953242 100644
> --- a/kernel/bpf/bpf_local_storage.c
> +++ b/kernel/bpf/bpf_local_storage.c
> @@ -98,6 +98,36 @@ void bpf_local_storage_free_rcu(struct rcu_head *rcu)
>           kfree_rcu(local_storage, rcu);
>   }
> 
> +bool bpf_local_storage_unlink_nolock(struct bpf_local_storage 
> *local_storage)
> +{
> +    struct bpf_local_storage_elem *selem;
> +    bool free_storage = false;
> +    struct hlist_node *n;
> +
> +    /* Neither the bpf_prog nor the bpf-map's syscall
> +     * could be modifying the local_storage->list now.
> +     * Thus, no elem can be added-to or deleted-from the
> +     * local_storage->list by the bpf_prog or by the bpf-map's syscall.
> +     *
> +     * It is racing with bpf_local_storage_map_free() alone
> +     * when unlinking elem from the local_storage->list and
> +     * the map's bucket->list.
> +     */
> +    hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) {
> +        /* Always unlink from map before unlinking from
> +         * local_storage.
> +         */
> +        bpf_selem_unlink_map(selem);
> +        free_storage = bpf_selem_unlink_storage_nolock(
> +            local_storage, selem, false, false);
> +    }
> +
> +    /* free_storage should always be true as long as
> +     * local_storage->list was non-empty.
> +     */
> +    return free_storage;
> +}
> +
>   static void bpf_selem_free_rcu(struct rcu_head *rcu)
>   {
>       struct bpf_local_storage_elem *selem;
> diff --git a/kernel/bpf/bpf_task_storage.c b/kernel/bpf/bpf_task_storage.c
> index 6f290623347e..60887c25504b 100644
> --- a/kernel/bpf/bpf_task_storage.c
> +++ b/kernel/bpf/bpf_task_storage.c
> @@ -71,10 +71,8 @@ task_storage_lookup(struct task_struct *task, struct 
> bpf_map *map,
> 
>   void bpf_task_storage_free(struct task_struct *task)
>   {
> -    struct bpf_local_storage_elem *selem;
>       struct bpf_local_storage *local_storage;
>       bool free_task_storage = false;
> -    struct hlist_node *n;
>       unsigned long flags;
> 
>       rcu_read_lock();
> @@ -85,32 +83,13 @@ void bpf_task_storage_free(struct task_struct *task)
>           return;
>       }
> 
> -    /* Neither the bpf_prog nor the bpf-map's syscall
> -     * could be modifying the local_storage->list now.
> -     * Thus, no elem can be added-to or deleted-from the
> -     * local_storage->list by the bpf_prog or by the bpf-map's syscall.
> -     *
> -     * It is racing with bpf_local_storage_map_free() alone
> -     * when unlinking elem from the local_storage->list and
> -     * the map's bucket->list.
> -     */
>       bpf_task_storage_lock();
>       raw_spin_lock_irqsave(&local_storage->lock, flags);
> -    hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) {
> -        /* Always unlink from map before unlinking from
> -         * local_storage.
> -         */
> -        bpf_selem_unlink_map(selem);
> -        free_task_storage = bpf_selem_unlink_storage_nolock(
> -            local_storage, selem, false, false);
> -    }
> +    free_task_storage = bpf_local_storage_unlink_nolock(local_storage);
>       raw_spin_unlock_irqrestore(&local_storage->lock, flags);
>       bpf_task_storage_unlock();
>       rcu_read_unlock();
> 
> -    /* free_task_storage should always be true as long as
> -     * local_storage->list was non-empty.
> -     */
>       if (free_task_storage)
>           kfree_rcu(local_storage, rcu);
>   }

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH bpf-next v4 3/7] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
  2022-10-23 18:05 ` [PATCH bpf-next v4 3/7] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs Yonghong Song
  2022-10-23 20:02   ` David Vernet
@ 2022-10-24 19:19   ` Martin KaFai Lau
  2022-10-25  0:21     ` Yosry Ahmed
  1 sibling, 1 reply; 31+ messages in thread
From: Martin KaFai Lau @ 2022-10-24 19:19 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo, bpf

On 10/23/22 11:05 AM, Yonghong Song wrote:
> +void bpf_cgrp_storage_free(struct cgroup *cgroup)
> +{
> +	struct bpf_local_storage *local_storage;
> +	struct bpf_local_storage_elem *selem;
> +	bool free_cgroup_storage = false;
> +	struct hlist_node *n;
> +	unsigned long flags;
> +
> +	rcu_read_lock();
> +	local_storage = rcu_dereference(cgroup->bpf_cgrp_storage);
> +	if (!local_storage) {
> +		rcu_read_unlock();
> +		return;
> +	}
> +
> +	/* Neither the bpf_prog nor the bpf_map's syscall
> +	 * could be modifying the local_storage->list now.
> +	 * Thus, no elem can be added to or deleted from the
> +	 * local_storage->list by the bpf_prog or by the bpf_map's syscall.
> +	 *
> +	 * It is racing with __bpf_local_storage_map_free() alone
> +	 * when unlinking elem from the local_storage->list and
> +	 * the map's bucket->list.
> +	 */
> +	bpf_cgrp_storage_lock();
> +	raw_spin_lock_irqsave(&local_storage->lock, flags);
> +	hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) {
> +		bpf_selem_unlink_map(selem);
> +		/* If local_storage list has only one element, the
> +		 * bpf_selem_unlink_storage_nolock() will return true.
> +		 * Otherwise, it will return false. The current loop iteration
> +		 * intends to remove all local storage. So the last iteration
> +		 * of the loop will set the free_cgroup_storage to true.
> +		 */
> +		free_cgroup_storage =
> +			bpf_selem_unlink_storage_nolock(local_storage, selem, false, false);
> +	}
> +	raw_spin_unlock_irqrestore(&local_storage->lock, flags);
> +	bpf_cgrp_storage_unlock();
> +	rcu_read_unlock();
> +
> +	if (free_cgroup_storage)
> +		kfree_rcu(local_storage, rcu);
> +}

[ ... ]

> +/* *gfp_flags* is a hidden argument provided by the verifier */
> +BPF_CALL_5(bpf_cgrp_storage_get, struct bpf_map *, map, struct cgroup *, cgroup,
> +	   void *, value, u64, flags, gfp_t, gfp_flags)
> +{
> +	struct bpf_local_storage_data *sdata;
> +
> +	WARN_ON_ONCE(!bpf_rcu_lock_held());
> +	if (flags & ~(BPF_LOCAL_STORAGE_GET_F_CREATE))
> +		return (unsigned long)NULL;
> +
> +	if (!cgroup)
> +		return (unsigned long)NULL;
> +
> +	if (!bpf_cgrp_storage_trylock())
> +		return (unsigned long)NULL;
> +
> +	sdata = cgroup_storage_lookup(cgroup, map, true);
> +	if (sdata)
> +		goto unlock;
> +
> +	/* only allocate new storage, when the cgroup is refcounted */
> +	if (!percpu_ref_is_dying(&cgroup->self.refcnt) &&
> +	    (flags & BPF_LOCAL_STORAGE_GET_F_CREATE))
> +		sdata = bpf_local_storage_update(cgroup, (struct bpf_local_storage_map *)map,
> +						 value, BPF_NOEXIST, gfp_flags);
> +
> +unlock:
> +	bpf_cgrp_storage_unlock();
> +	return IS_ERR_OR_NULL(sdata) ? (unsigned long)NULL : (unsigned long)sdata->data;
> +}

[ ... ]

> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 764bdd5fd8d1..32145d066a09 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -5227,6 +5227,10 @@ static void css_free_rwork_fn(struct work_struct *work)
>   	struct cgroup_subsys *ss = css->ss;
>   	struct cgroup *cgrp = css->cgroup;
>   
> +#ifdef CONFIG_BPF_SYSCALL
> +	bpf_cgrp_storage_free(cgrp);
> +#endif


After revisiting comment 4bfc0bb2c60e, some of the commit message came to my mind:

" ...... it blocks a possibility to implement
   the memcg-based memory accounting for bpf objects, because a circular
   reference dependency will occur. Charged memory pages are pinning the
   corresponding memory cgroup, and if the memory cgroup is pinning
   the attached bpf program, nothing will be ever released."

Considering the bpf_map_kzalloc() is used in bpf_local_storage_map.c and it can 
charge the memcg, I wonder if the cgrp_local_storage will have similar refcnt 
loop issue here.

If here is the right place to free the cgrp_local_storage() and enough to break 
this refcnt loop, it will be useful to add some explanation and its 
consideration in the commit message.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH bpf-next v4 6/7] selftests/bpf: Add selftests for new cgroup local storage
  2022-10-23 18:05 ` [PATCH bpf-next v4 6/7] selftests/bpf: Add selftests for " Yonghong Song
  2022-10-23 20:14   ` David Vernet
@ 2022-10-24 20:30   ` Martin KaFai Lau
  2022-10-25  2:26     ` Yonghong Song
  1 sibling, 1 reply; 31+ messages in thread
From: Martin KaFai Lau @ 2022-10-24 20:30 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo, bpf

On 10/23/22 11:05 AM, Yonghong Song wrote:
> Add two tests for new cgroup local storage, one to test bpf program helpers
> and user space map APIs, and the other to test recursive fentry
> triggering won't deadlock.

Other than tracing, it will be very useful to add a bpf_cgrp_storage_get() usage 
in a cgroup-bpf prog.  Exercising this helper in the existing 
SEC(cgroup)/SEC(sockops) tests should be pretty easy.  eg. The 
SEC("cgroup/connect6") in socket_cookie_prog.c.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH bpf-next v4 2/7] bpf: Refactor inode/task/sk storage map_{alloc,free}() for reuse
  2022-10-23 18:05 ` [PATCH bpf-next v4 2/7] bpf: Refactor inode/task/sk storage map_{alloc,free}() for reuse Yonghong Song
  2022-10-24 18:02   ` sdf
@ 2022-10-24 20:34   ` Martin KaFai Lau
  2022-10-25  2:28     ` Yonghong Song
  1 sibling, 1 reply; 31+ messages in thread
From: Martin KaFai Lau @ 2022-10-24 20:34 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo, David Vernet,
	bpf

On 10/23/22 11:05 AM, Yonghong Song wrote
> -void bpf_local_storage_map_free(struct bpf_local_storage_map *smap,
> -				int __percpu *busy_counter)
> +static void __bpf_local_storage_map_free(struct bpf_local_storage_map *smap,
> +					 int __percpu *busy_counter)

nit.

This map_free does not look like it requires a separate "__" version since it is 
not reused.  probably just put everything into the bpf_local_storage_map_free() 
instead?

>   {
>   	struct bpf_local_storage_elem *selem;
>   	struct bpf_local_storage_map_bucket *b;
> @@ -613,7 +613,7 @@ int bpf_local_storage_map_alloc_check(union bpf_attr *attr)
>   	return 0;
>   }
>   
> -struct bpf_local_storage_map *bpf_local_storage_map_alloc(union bpf_attr *attr)
> +static struct bpf_local_storage_map *__bpf_local_storage_map_alloc(union bpf_attr *attr)
>   {
>   	struct bpf_local_storage_map *smap;
>   	unsigned int i;
> @@ -663,3 +663,28 @@ int bpf_local_storage_map_check_btf(const struct bpf_map *map,
>   
>   	return 0;
>   }

[ ... ]

> +void bpf_local_storage_map_free(struct bpf_map *map,
> +				struct bpf_local_storage_cache *cache,
> +				int __percpu *busy_counter)
> +{
> +	struct bpf_local_storage_map *smap;
> +
> +	smap = (struct bpf_local_storage_map *)map;
> +	bpf_local_storage_cache_idx_free(cache, smap->cache_idx);
> +	__bpf_local_storage_map_free(smap, busy_counter);
> +}


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH bpf-next v4 3/7] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
  2022-10-24 19:19   ` Martin KaFai Lau
@ 2022-10-25  0:21     ` Yosry Ahmed
  2022-10-25  0:32       ` Martin KaFai Lau
  0 siblings, 1 reply; 31+ messages in thread
From: Yosry Ahmed @ 2022-10-25  0:21 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Yonghong Song, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, kernel-team, KP Singh, Martin KaFai Lau,
	Tejun Heo, bpf

On Mon, Oct 24, 2022 at 2:15 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 10/23/22 11:05 AM, Yonghong Song wrote:
> > +void bpf_cgrp_storage_free(struct cgroup *cgroup)
> > +{
> > +     struct bpf_local_storage *local_storage;
> > +     struct bpf_local_storage_elem *selem;
> > +     bool free_cgroup_storage = false;
> > +     struct hlist_node *n;
> > +     unsigned long flags;
> > +
> > +     rcu_read_lock();
> > +     local_storage = rcu_dereference(cgroup->bpf_cgrp_storage);
> > +     if (!local_storage) {
> > +             rcu_read_unlock();
> > +             return;
> > +     }
> > +
> > +     /* Neither the bpf_prog nor the bpf_map's syscall
> > +      * could be modifying the local_storage->list now.
> > +      * Thus, no elem can be added to or deleted from the
> > +      * local_storage->list by the bpf_prog or by the bpf_map's syscall.
> > +      *
> > +      * It is racing with __bpf_local_storage_map_free() alone
> > +      * when unlinking elem from the local_storage->list and
> > +      * the map's bucket->list.
> > +      */
> > +     bpf_cgrp_storage_lock();
> > +     raw_spin_lock_irqsave(&local_storage->lock, flags);
> > +     hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) {
> > +             bpf_selem_unlink_map(selem);
> > +             /* If local_storage list has only one element, the
> > +              * bpf_selem_unlink_storage_nolock() will return true.
> > +              * Otherwise, it will return false. The current loop iteration
> > +              * intends to remove all local storage. So the last iteration
> > +              * of the loop will set the free_cgroup_storage to true.
> > +              */
> > +             free_cgroup_storage =
> > +                     bpf_selem_unlink_storage_nolock(local_storage, selem, false, false);
> > +     }
> > +     raw_spin_unlock_irqrestore(&local_storage->lock, flags);
> > +     bpf_cgrp_storage_unlock();
> > +     rcu_read_unlock();
> > +
> > +     if (free_cgroup_storage)
> > +             kfree_rcu(local_storage, rcu);
> > +}
>
> [ ... ]
>
> > +/* *gfp_flags* is a hidden argument provided by the verifier */
> > +BPF_CALL_5(bpf_cgrp_storage_get, struct bpf_map *, map, struct cgroup *, cgroup,
> > +        void *, value, u64, flags, gfp_t, gfp_flags)
> > +{
> > +     struct bpf_local_storage_data *sdata;
> > +
> > +     WARN_ON_ONCE(!bpf_rcu_lock_held());
> > +     if (flags & ~(BPF_LOCAL_STORAGE_GET_F_CREATE))
> > +             return (unsigned long)NULL;
> > +
> > +     if (!cgroup)
> > +             return (unsigned long)NULL;
> > +
> > +     if (!bpf_cgrp_storage_trylock())
> > +             return (unsigned long)NULL;
> > +
> > +     sdata = cgroup_storage_lookup(cgroup, map, true);
> > +     if (sdata)
> > +             goto unlock;
> > +
> > +     /* only allocate new storage, when the cgroup is refcounted */
> > +     if (!percpu_ref_is_dying(&cgroup->self.refcnt) &&
> > +         (flags & BPF_LOCAL_STORAGE_GET_F_CREATE))
> > +             sdata = bpf_local_storage_update(cgroup, (struct bpf_local_storage_map *)map,
> > +                                              value, BPF_NOEXIST, gfp_flags);
> > +
> > +unlock:
> > +     bpf_cgrp_storage_unlock();
> > +     return IS_ERR_OR_NULL(sdata) ? (unsigned long)NULL : (unsigned long)sdata->data;
> > +}
>
> [ ... ]
>
> > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> > index 764bdd5fd8d1..32145d066a09 100644
> > --- a/kernel/cgroup/cgroup.c
> > +++ b/kernel/cgroup/cgroup.c
> > @@ -5227,6 +5227,10 @@ static void css_free_rwork_fn(struct work_struct *work)
> >       struct cgroup_subsys *ss = css->ss;
> >       struct cgroup *cgrp = css->cgroup;
> >
> > +#ifdef CONFIG_BPF_SYSCALL
> > +     bpf_cgrp_storage_free(cgrp);
> > +#endif
>
>
> After revisiting comment 4bfc0bb2c60e, some of the commit message came to my mind:
>
> " ...... it blocks a possibility to implement
>    the memcg-based memory accounting for bpf objects, because a circular
>    reference dependency will occur. Charged memory pages are pinning the
>    corresponding memory cgroup, and if the memory cgroup is pinning
>    the attached bpf program, nothing will be ever released."
>
> Considering the bpf_map_kzalloc() is used in bpf_local_storage_map.c and it can
> charge the memcg, I wonder if the cgrp_local_storage will have similar refcnt
> loop issue here.
>
> If here is the right place to free the cgrp_local_storage() and enough to break
> this refcnt loop, it will be useful to add some explanation and its
> consideration in the commit message.
>

I think a similar refcount loop issue can happen here as well. IIUC,
this function will only be run when the css is released after all
references are dropped. So if memcg charging is enabled the cgroup
will never be removed. This one might be trickier to handle though..

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH bpf-next v4 3/7] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
  2022-10-25  0:21     ` Yosry Ahmed
@ 2022-10-25  0:32       ` Martin KaFai Lau
  2022-10-25  0:48         ` Yosry Ahmed
  0 siblings, 1 reply; 31+ messages in thread
From: Martin KaFai Lau @ 2022-10-25  0:32 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Yonghong Song, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, kernel-team, KP Singh, Martin KaFai Lau,
	Tejun Heo, bpf

On 10/24/22 5:21 PM, Yosry Ahmed wrote:
> On Mon, Oct 24, 2022 at 2:15 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> On 10/23/22 11:05 AM, Yonghong Song wrote:
>>> +void bpf_cgrp_storage_free(struct cgroup *cgroup)
>>> +{
>>> +     struct bpf_local_storage *local_storage;
>>> +     struct bpf_local_storage_elem *selem;
>>> +     bool free_cgroup_storage = false;
>>> +     struct hlist_node *n;
>>> +     unsigned long flags;
>>> +
>>> +     rcu_read_lock();
>>> +     local_storage = rcu_dereference(cgroup->bpf_cgrp_storage);
>>> +     if (!local_storage) {
>>> +             rcu_read_unlock();
>>> +             return;
>>> +     }
>>> +
>>> +     /* Neither the bpf_prog nor the bpf_map's syscall
>>> +      * could be modifying the local_storage->list now.
>>> +      * Thus, no elem can be added to or deleted from the
>>> +      * local_storage->list by the bpf_prog or by the bpf_map's syscall.
>>> +      *
>>> +      * It is racing with __bpf_local_storage_map_free() alone
>>> +      * when unlinking elem from the local_storage->list and
>>> +      * the map's bucket->list.
>>> +      */
>>> +     bpf_cgrp_storage_lock();
>>> +     raw_spin_lock_irqsave(&local_storage->lock, flags);
>>> +     hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) {
>>> +             bpf_selem_unlink_map(selem);
>>> +             /* If local_storage list has only one element, the
>>> +              * bpf_selem_unlink_storage_nolock() will return true.
>>> +              * Otherwise, it will return false. The current loop iteration
>>> +              * intends to remove all local storage. So the last iteration
>>> +              * of the loop will set the free_cgroup_storage to true.
>>> +              */
>>> +             free_cgroup_storage =
>>> +                     bpf_selem_unlink_storage_nolock(local_storage, selem, false, false);
>>> +     }
>>> +     raw_spin_unlock_irqrestore(&local_storage->lock, flags);
>>> +     bpf_cgrp_storage_unlock();
>>> +     rcu_read_unlock();
>>> +
>>> +     if (free_cgroup_storage)
>>> +             kfree_rcu(local_storage, rcu);
>>> +}
>>
>> [ ... ]
>>
>>> +/* *gfp_flags* is a hidden argument provided by the verifier */
>>> +BPF_CALL_5(bpf_cgrp_storage_get, struct bpf_map *, map, struct cgroup *, cgroup,
>>> +        void *, value, u64, flags, gfp_t, gfp_flags)
>>> +{
>>> +     struct bpf_local_storage_data *sdata;
>>> +
>>> +     WARN_ON_ONCE(!bpf_rcu_lock_held());
>>> +     if (flags & ~(BPF_LOCAL_STORAGE_GET_F_CREATE))
>>> +             return (unsigned long)NULL;
>>> +
>>> +     if (!cgroup)
>>> +             return (unsigned long)NULL;
>>> +
>>> +     if (!bpf_cgrp_storage_trylock())
>>> +             return (unsigned long)NULL;
>>> +
>>> +     sdata = cgroup_storage_lookup(cgroup, map, true);
>>> +     if (sdata)
>>> +             goto unlock;
>>> +
>>> +     /* only allocate new storage, when the cgroup is refcounted */
>>> +     if (!percpu_ref_is_dying(&cgroup->self.refcnt) &&
>>> +         (flags & BPF_LOCAL_STORAGE_GET_F_CREATE))
>>> +             sdata = bpf_local_storage_update(cgroup, (struct bpf_local_storage_map *)map,
>>> +                                              value, BPF_NOEXIST, gfp_flags);
>>> +
>>> +unlock:
>>> +     bpf_cgrp_storage_unlock();
>>> +     return IS_ERR_OR_NULL(sdata) ? (unsigned long)NULL : (unsigned long)sdata->data;
>>> +}
>>
>> [ ... ]
>>
>>> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
>>> index 764bdd5fd8d1..32145d066a09 100644
>>> --- a/kernel/cgroup/cgroup.c
>>> +++ b/kernel/cgroup/cgroup.c
>>> @@ -5227,6 +5227,10 @@ static void css_free_rwork_fn(struct work_struct *work)
>>>        struct cgroup_subsys *ss = css->ss;
>>>        struct cgroup *cgrp = css->cgroup;
>>>
>>> +#ifdef CONFIG_BPF_SYSCALL
>>> +     bpf_cgrp_storage_free(cgrp);
>>> +#endif
>>
>>
>> After revisiting comment 4bfc0bb2c60e, some of the commit message came to my mind:
>>
>> " ...... it blocks a possibility to implement
>>     the memcg-based memory accounting for bpf objects, because a circular
>>     reference dependency will occur. Charged memory pages are pinning the
>>     corresponding memory cgroup, and if the memory cgroup is pinning
>>     the attached bpf program, nothing will be ever released."
>>
>> Considering the bpf_map_kzalloc() is used in bpf_local_storage_map.c and it can
>> charge the memcg, I wonder if the cgrp_local_storage will have similar refcnt
>> loop issue here.
>>
>> If here is the right place to free the cgrp_local_storage() and enough to break
>> this refcnt loop, it will be useful to add some explanation and its
>> consideration in the commit message.
>>
> 
> I think a similar refcount loop issue can happen here as well. IIUC,
> this function will only be run when the css is released after all
> references are dropped. So if memcg charging is enabled the cgroup
> will never be removed. This one might be trickier to handle though..

How about removing all storage from cgrp->bpf_cgrp_storage in 
cgroup_destroy_locked()?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH bpf-next v4 3/7] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
  2022-10-25  0:32       ` Martin KaFai Lau
@ 2022-10-25  0:48         ` Yosry Ahmed
  2022-10-25  0:55           ` Yosry Ahmed
  2022-10-25  1:36           ` Martin KaFai Lau
  0 siblings, 2 replies; 31+ messages in thread
From: Yosry Ahmed @ 2022-10-25  0:48 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Yonghong Song, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, kernel-team, KP Singh, Martin KaFai Lau,
	Tejun Heo, bpf

On Mon, Oct 24, 2022 at 5:32 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 10/24/22 5:21 PM, Yosry Ahmed wrote:
> > On Mon, Oct 24, 2022 at 2:15 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >>
> >> On 10/23/22 11:05 AM, Yonghong Song wrote:
> >>> +void bpf_cgrp_storage_free(struct cgroup *cgroup)
> >>> +{
> >>> +     struct bpf_local_storage *local_storage;
> >>> +     struct bpf_local_storage_elem *selem;
> >>> +     bool free_cgroup_storage = false;
> >>> +     struct hlist_node *n;
> >>> +     unsigned long flags;
> >>> +
> >>> +     rcu_read_lock();
> >>> +     local_storage = rcu_dereference(cgroup->bpf_cgrp_storage);
> >>> +     if (!local_storage) {
> >>> +             rcu_read_unlock();
> >>> +             return;
> >>> +     }
> >>> +
> >>> +     /* Neither the bpf_prog nor the bpf_map's syscall
> >>> +      * could be modifying the local_storage->list now.
> >>> +      * Thus, no elem can be added to or deleted from the
> >>> +      * local_storage->list by the bpf_prog or by the bpf_map's syscall.
> >>> +      *
> >>> +      * It is racing with __bpf_local_storage_map_free() alone
> >>> +      * when unlinking elem from the local_storage->list and
> >>> +      * the map's bucket->list.
> >>> +      */
> >>> +     bpf_cgrp_storage_lock();
> >>> +     raw_spin_lock_irqsave(&local_storage->lock, flags);
> >>> +     hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) {
> >>> +             bpf_selem_unlink_map(selem);
> >>> +             /* If local_storage list has only one element, the
> >>> +              * bpf_selem_unlink_storage_nolock() will return true.
> >>> +              * Otherwise, it will return false. The current loop iteration
> >>> +              * intends to remove all local storage. So the last iteration
> >>> +              * of the loop will set the free_cgroup_storage to true.
> >>> +              */
> >>> +             free_cgroup_storage =
> >>> +                     bpf_selem_unlink_storage_nolock(local_storage, selem, false, false);
> >>> +     }
> >>> +     raw_spin_unlock_irqrestore(&local_storage->lock, flags);
> >>> +     bpf_cgrp_storage_unlock();
> >>> +     rcu_read_unlock();
> >>> +
> >>> +     if (free_cgroup_storage)
> >>> +             kfree_rcu(local_storage, rcu);
> >>> +}
> >>
> >> [ ... ]
> >>
> >>> +/* *gfp_flags* is a hidden argument provided by the verifier */
> >>> +BPF_CALL_5(bpf_cgrp_storage_get, struct bpf_map *, map, struct cgroup *, cgroup,
> >>> +        void *, value, u64, flags, gfp_t, gfp_flags)
> >>> +{
> >>> +     struct bpf_local_storage_data *sdata;
> >>> +
> >>> +     WARN_ON_ONCE(!bpf_rcu_lock_held());
> >>> +     if (flags & ~(BPF_LOCAL_STORAGE_GET_F_CREATE))
> >>> +             return (unsigned long)NULL;
> >>> +
> >>> +     if (!cgroup)
> >>> +             return (unsigned long)NULL;
> >>> +
> >>> +     if (!bpf_cgrp_storage_trylock())
> >>> +             return (unsigned long)NULL;
> >>> +
> >>> +     sdata = cgroup_storage_lookup(cgroup, map, true);
> >>> +     if (sdata)
> >>> +             goto unlock;
> >>> +
> >>> +     /* only allocate new storage, when the cgroup is refcounted */
> >>> +     if (!percpu_ref_is_dying(&cgroup->self.refcnt) &&
> >>> +         (flags & BPF_LOCAL_STORAGE_GET_F_CREATE))
> >>> +             sdata = bpf_local_storage_update(cgroup, (struct bpf_local_storage_map *)map,
> >>> +                                              value, BPF_NOEXIST, gfp_flags);
> >>> +
> >>> +unlock:
> >>> +     bpf_cgrp_storage_unlock();
> >>> +     return IS_ERR_OR_NULL(sdata) ? (unsigned long)NULL : (unsigned long)sdata->data;
> >>> +}
> >>
> >> [ ... ]
> >>
> >>> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> >>> index 764bdd5fd8d1..32145d066a09 100644
> >>> --- a/kernel/cgroup/cgroup.c
> >>> +++ b/kernel/cgroup/cgroup.c
> >>> @@ -5227,6 +5227,10 @@ static void css_free_rwork_fn(struct work_struct *work)
> >>>        struct cgroup_subsys *ss = css->ss;
> >>>        struct cgroup *cgrp = css->cgroup;
> >>>
> >>> +#ifdef CONFIG_BPF_SYSCALL
> >>> +     bpf_cgrp_storage_free(cgrp);
> >>> +#endif
> >>
> >>
> >> After revisiting comment 4bfc0bb2c60e, some of the commit message came to my mind:
> >>
> >> " ...... it blocks a possibility to implement
> >>     the memcg-based memory accounting for bpf objects, because a circular
> >>     reference dependency will occur. Charged memory pages are pinning the
> >>     corresponding memory cgroup, and if the memory cgroup is pinning
> >>     the attached bpf program, nothing will be ever released."
> >>
> >> Considering the bpf_map_kzalloc() is used in bpf_local_storage_map.c and it can
> >> charge the memcg, I wonder if the cgrp_local_storage will have similar refcnt
> >> loop issue here.
> >>
> >> If here is the right place to free the cgrp_local_storage() and enough to break
> >> this refcnt loop, it will be useful to add some explanation and its
> >> consideration in the commit message.
> >>
> >
> > I think a similar refcount loop issue can happen here as well. IIUC,
> > this function will only be run when the css is released after all
> > references are dropped. So if memcg charging is enabled the cgroup
> > will never be removed. This one might be trickier to handle though..
>
> How about removing all storage from cgrp->bpf_cgrp_storage in
> cgroup_destroy_locked()?

The problem here is that you lose information for cgroups that went
offline but still exist in the kernel (i.e offline cgroups). The
commit log 4bfc0bb2c60e mentions that such cgroups can have live
sockets attached, so this might be a problem? From a memory
perspective, offline memcgs can still undergo memory operations like
reclaim. If we are using BPF to collect cgroup statistics for memory
reclaim, we can't do so for offline memcgs, which is not the end of
the world, but the cgroup storages become slightly less powerful. We
might also lose some data that we have already stored for such offline
memcgs. Also BPF programs now need to handle the case where they have
a valid cgroup pointer but they cannot retrieve a cgroup storage for
it because it went offline.

We ideally want to be able to charge the memory to the cgroup without
holding a ref to it, which is against the cgroup memory charging
model.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH bpf-next v4 3/7] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
  2022-10-25  0:48         ` Yosry Ahmed
@ 2022-10-25  0:55           ` Yosry Ahmed
  2022-10-25  2:38             ` Roman Gushchin
  2022-10-25  1:36           ` Martin KaFai Lau
  1 sibling, 1 reply; 31+ messages in thread
From: Yosry Ahmed @ 2022-10-25  0:55 UTC (permalink / raw)
  To: Martin KaFai Lau, Roman Gushchin
  Cc: Yonghong Song, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, kernel-team, KP Singh, Martin KaFai Lau,
	Tejun Heo, bpf

On Mon, Oct 24, 2022 at 5:48 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Mon, Oct 24, 2022 at 5:32 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >
> > On 10/24/22 5:21 PM, Yosry Ahmed wrote:
> > > On Mon, Oct 24, 2022 at 2:15 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> > >>
> > >> On 10/23/22 11:05 AM, Yonghong Song wrote:
> > >>> +void bpf_cgrp_storage_free(struct cgroup *cgroup)
> > >>> +{
> > >>> +     struct bpf_local_storage *local_storage;
> > >>> +     struct bpf_local_storage_elem *selem;
> > >>> +     bool free_cgroup_storage = false;
> > >>> +     struct hlist_node *n;
> > >>> +     unsigned long flags;
> > >>> +
> > >>> +     rcu_read_lock();
> > >>> +     local_storage = rcu_dereference(cgroup->bpf_cgrp_storage);
> > >>> +     if (!local_storage) {
> > >>> +             rcu_read_unlock();
> > >>> +             return;
> > >>> +     }
> > >>> +
> > >>> +     /* Neither the bpf_prog nor the bpf_map's syscall
> > >>> +      * could be modifying the local_storage->list now.
> > >>> +      * Thus, no elem can be added to or deleted from the
> > >>> +      * local_storage->list by the bpf_prog or by the bpf_map's syscall.
> > >>> +      *
> > >>> +      * It is racing with __bpf_local_storage_map_free() alone
> > >>> +      * when unlinking elem from the local_storage->list and
> > >>> +      * the map's bucket->list.
> > >>> +      */
> > >>> +     bpf_cgrp_storage_lock();
> > >>> +     raw_spin_lock_irqsave(&local_storage->lock, flags);
> > >>> +     hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) {
> > >>> +             bpf_selem_unlink_map(selem);
> > >>> +             /* If local_storage list has only one element, the
> > >>> +              * bpf_selem_unlink_storage_nolock() will return true.
> > >>> +              * Otherwise, it will return false. The current loop iteration
> > >>> +              * intends to remove all local storage. So the last iteration
> > >>> +              * of the loop will set the free_cgroup_storage to true.
> > >>> +              */
> > >>> +             free_cgroup_storage =
> > >>> +                     bpf_selem_unlink_storage_nolock(local_storage, selem, false, false);
> > >>> +     }
> > >>> +     raw_spin_unlock_irqrestore(&local_storage->lock, flags);
> > >>> +     bpf_cgrp_storage_unlock();
> > >>> +     rcu_read_unlock();
> > >>> +
> > >>> +     if (free_cgroup_storage)
> > >>> +             kfree_rcu(local_storage, rcu);
> > >>> +}
> > >>
> > >> [ ... ]
> > >>
> > >>> +/* *gfp_flags* is a hidden argument provided by the verifier */
> > >>> +BPF_CALL_5(bpf_cgrp_storage_get, struct bpf_map *, map, struct cgroup *, cgroup,
> > >>> +        void *, value, u64, flags, gfp_t, gfp_flags)
> > >>> +{
> > >>> +     struct bpf_local_storage_data *sdata;
> > >>> +
> > >>> +     WARN_ON_ONCE(!bpf_rcu_lock_held());
> > >>> +     if (flags & ~(BPF_LOCAL_STORAGE_GET_F_CREATE))
> > >>> +             return (unsigned long)NULL;
> > >>> +
> > >>> +     if (!cgroup)
> > >>> +             return (unsigned long)NULL;
> > >>> +
> > >>> +     if (!bpf_cgrp_storage_trylock())
> > >>> +             return (unsigned long)NULL;
> > >>> +
> > >>> +     sdata = cgroup_storage_lookup(cgroup, map, true);
> > >>> +     if (sdata)
> > >>> +             goto unlock;
> > >>> +
> > >>> +     /* only allocate new storage, when the cgroup is refcounted */
> > >>> +     if (!percpu_ref_is_dying(&cgroup->self.refcnt) &&
> > >>> +         (flags & BPF_LOCAL_STORAGE_GET_F_CREATE))
> > >>> +             sdata = bpf_local_storage_update(cgroup, (struct bpf_local_storage_map *)map,
> > >>> +                                              value, BPF_NOEXIST, gfp_flags);
> > >>> +
> > >>> +unlock:
> > >>> +     bpf_cgrp_storage_unlock();
> > >>> +     return IS_ERR_OR_NULL(sdata) ? (unsigned long)NULL : (unsigned long)sdata->data;
> > >>> +}
> > >>
> > >> [ ... ]
> > >>
> > >>> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> > >>> index 764bdd5fd8d1..32145d066a09 100644
> > >>> --- a/kernel/cgroup/cgroup.c
> > >>> +++ b/kernel/cgroup/cgroup.c
> > >>> @@ -5227,6 +5227,10 @@ static void css_free_rwork_fn(struct work_struct *work)
> > >>>        struct cgroup_subsys *ss = css->ss;
> > >>>        struct cgroup *cgrp = css->cgroup;
> > >>>
> > >>> +#ifdef CONFIG_BPF_SYSCALL
> > >>> +     bpf_cgrp_storage_free(cgrp);
> > >>> +#endif
> > >>
> > >>
> > >> After revisiting comment 4bfc0bb2c60e, some of the commit message came to my mind:
> > >>
> > >> " ...... it blocks a possibility to implement
> > >>     the memcg-based memory accounting for bpf objects, because a circular
> > >>     reference dependency will occur. Charged memory pages are pinning the
> > >>     corresponding memory cgroup, and if the memory cgroup is pinning
> > >>     the attached bpf program, nothing will be ever released."
> > >>
> > >> Considering the bpf_map_kzalloc() is used in bpf_local_storage_map.c and it can
> > >> charge the memcg, I wonder if the cgrp_local_storage will have similar refcnt
> > >> loop issue here.
> > >>
> > >> If here is the right place to free the cgrp_local_storage() and enough to break
> > >> this refcnt loop, it will be useful to add some explanation and its
> > >> consideration in the commit message.
> > >>
> > >
> > > I think a similar refcount loop issue can happen here as well. IIUC,
> > > this function will only be run when the css is released after all
> > > references are dropped. So if memcg charging is enabled the cgroup
> > > will never be removed. This one might be trickier to handle though..
> >
> > How about removing all storage from cgrp->bpf_cgrp_storage in
> > cgroup_destroy_locked()?
>
> The problem here is that you lose information for cgroups that went
> offline but still exist in the kernel (i.e offline cgroups). The
> commit log 4bfc0bb2c60e mentions that such cgroups can have live
> sockets attached, so this might be a problem? From a memory
> perspective, offline memcgs can still undergo memory operations like
> reclaim. If we are using BPF to collect cgroup statistics for memory
> reclaim, we can't do so for offline memcgs, which is not the end of
> the world, but the cgroup storages become slightly less powerful. We
> might also lose some data that we have already stored for such offline
> memcgs. Also BPF programs now need to handle the case where they have
> a valid cgroup pointer but they cannot retrieve a cgroup storage for
> it because it went offline.
>
> We ideally want to be able to charge the memory to the cgroup without
> holding a ref to it, which is against the cgroup memory charging
> model.

+Roman Gushchin

Wait a second. I think Roman implemented reparenting of BPF map memcg
charges when a memcg is offlined. In this case, when a cgroup goes
offline (aka rmdir()'d), the charges will actually move to the parent,
and the local storage will no longer be holding any refs to the
cgroup. In that case, the current freeing location should be fine and
the cgroup local storage can continue to work with offlined cgroups.

Roman, could you please confirm or deny my thesis?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH bpf-next v4 3/7] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
  2022-10-25  0:48         ` Yosry Ahmed
  2022-10-25  0:55           ` Yosry Ahmed
@ 2022-10-25  1:36           ` Martin KaFai Lau
  2022-10-25  5:44             ` Yosry Ahmed
  1 sibling, 1 reply; 31+ messages in thread
From: Martin KaFai Lau @ 2022-10-25  1:36 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Yonghong Song, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, kernel-team, KP Singh, Martin KaFai Lau,
	Tejun Heo, bpf

On 10/24/22 5:48 PM, Yosry Ahmed wrote:
> On Mon, Oct 24, 2022 at 5:32 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> On 10/24/22 5:21 PM, Yosry Ahmed wrote:
>>> On Mon, Oct 24, 2022 at 2:15 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>>>
>>>> On 10/23/22 11:05 AM, Yonghong Song wrote:
>>>>> +void bpf_cgrp_storage_free(struct cgroup *cgroup)
>>>>> +{
>>>>> +     struct bpf_local_storage *local_storage;
>>>>> +     struct bpf_local_storage_elem *selem;
>>>>> +     bool free_cgroup_storage = false;
>>>>> +     struct hlist_node *n;
>>>>> +     unsigned long flags;
>>>>> +
>>>>> +     rcu_read_lock();
>>>>> +     local_storage = rcu_dereference(cgroup->bpf_cgrp_storage);
>>>>> +     if (!local_storage) {
>>>>> +             rcu_read_unlock();
>>>>> +             return;
>>>>> +     }
>>>>> +
>>>>> +     /* Neither the bpf_prog nor the bpf_map's syscall
>>>>> +      * could be modifying the local_storage->list now.
>>>>> +      * Thus, no elem can be added to or deleted from the
>>>>> +      * local_storage->list by the bpf_prog or by the bpf_map's syscall.
>>>>> +      *
>>>>> +      * It is racing with __bpf_local_storage_map_free() alone
>>>>> +      * when unlinking elem from the local_storage->list and
>>>>> +      * the map's bucket->list.
>>>>> +      */
>>>>> +     bpf_cgrp_storage_lock();
>>>>> +     raw_spin_lock_irqsave(&local_storage->lock, flags);
>>>>> +     hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) {
>>>>> +             bpf_selem_unlink_map(selem);
>>>>> +             /* If local_storage list has only one element, the
>>>>> +              * bpf_selem_unlink_storage_nolock() will return true.
>>>>> +              * Otherwise, it will return false. The current loop iteration
>>>>> +              * intends to remove all local storage. So the last iteration
>>>>> +              * of the loop will set the free_cgroup_storage to true.
>>>>> +              */
>>>>> +             free_cgroup_storage =
>>>>> +                     bpf_selem_unlink_storage_nolock(local_storage, selem, false, false);
>>>>> +     }
>>>>> +     raw_spin_unlock_irqrestore(&local_storage->lock, flags);
>>>>> +     bpf_cgrp_storage_unlock();
>>>>> +     rcu_read_unlock();
>>>>> +
>>>>> +     if (free_cgroup_storage)
>>>>> +             kfree_rcu(local_storage, rcu);
>>>>> +}
>>>>
>>>> [ ... ]
>>>>
>>>>> +/* *gfp_flags* is a hidden argument provided by the verifier */
>>>>> +BPF_CALL_5(bpf_cgrp_storage_get, struct bpf_map *, map, struct cgroup *, cgroup,
>>>>> +        void *, value, u64, flags, gfp_t, gfp_flags)
>>>>> +{
>>>>> +     struct bpf_local_storage_data *sdata;
>>>>> +
>>>>> +     WARN_ON_ONCE(!bpf_rcu_lock_held());
>>>>> +     if (flags & ~(BPF_LOCAL_STORAGE_GET_F_CREATE))
>>>>> +             return (unsigned long)NULL;
>>>>> +
>>>>> +     if (!cgroup)
>>>>> +             return (unsigned long)NULL;
>>>>> +
>>>>> +     if (!bpf_cgrp_storage_trylock())
>>>>> +             return (unsigned long)NULL;
>>>>> +
>>>>> +     sdata = cgroup_storage_lookup(cgroup, map, true);
>>>>> +     if (sdata)
>>>>> +             goto unlock;
>>>>> +
>>>>> +     /* only allocate new storage, when the cgroup is refcounted */
>>>>> +     if (!percpu_ref_is_dying(&cgroup->self.refcnt) &&
>>>>> +         (flags & BPF_LOCAL_STORAGE_GET_F_CREATE))
>>>>> +             sdata = bpf_local_storage_update(cgroup, (struct bpf_local_storage_map *)map,
>>>>> +                                              value, BPF_NOEXIST, gfp_flags);
>>>>> +
>>>>> +unlock:
>>>>> +     bpf_cgrp_storage_unlock();
>>>>> +     return IS_ERR_OR_NULL(sdata) ? (unsigned long)NULL : (unsigned long)sdata->data;
>>>>> +}
>>>>
>>>> [ ... ]
>>>>
>>>>> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
>>>>> index 764bdd5fd8d1..32145d066a09 100644
>>>>> --- a/kernel/cgroup/cgroup.c
>>>>> +++ b/kernel/cgroup/cgroup.c
>>>>> @@ -5227,6 +5227,10 @@ static void css_free_rwork_fn(struct work_struct *work)
>>>>>         struct cgroup_subsys *ss = css->ss;
>>>>>         struct cgroup *cgrp = css->cgroup;
>>>>>
>>>>> +#ifdef CONFIG_BPF_SYSCALL
>>>>> +     bpf_cgrp_storage_free(cgrp);
>>>>> +#endif
>>>>
>>>>
>>>> After revisiting comment 4bfc0bb2c60e, some of the commit message came to my mind:
>>>>
>>>> " ...... it blocks a possibility to implement
>>>>      the memcg-based memory accounting for bpf objects, because a circular
>>>>      reference dependency will occur. Charged memory pages are pinning the
>>>>      corresponding memory cgroup, and if the memory cgroup is pinning
>>>>      the attached bpf program, nothing will be ever released."
>>>>
>>>> Considering the bpf_map_kzalloc() is used in bpf_local_storage_map.c and it can
>>>> charge the memcg, I wonder if the cgrp_local_storage will have similar refcnt
>>>> loop issue here.
>>>>
>>>> If here is the right place to free the cgrp_local_storage() and enough to break
>>>> this refcnt loop, it will be useful to add some explanation and its
>>>> consideration in the commit message.
>>>>
>>>
>>> I think a similar refcount loop issue can happen here as well. IIUC,
>>> this function will only be run when the css is released after all
>>> references are dropped. So if memcg charging is enabled the cgroup
>>> will never be removed. This one might be trickier to handle though..
>>
>> How about removing all storage from cgrp->bpf_cgrp_storage in
>> cgroup_destroy_locked()?
> 
> The problem here is that you lose information for cgroups that went
> offline but still exist in the kernel (i.e offline cgroups). The
> commit log 4bfc0bb2c60e mentions that such cgroups can have live
> sockets attached, so this might be a problem?

Keeping the cgrp_storage around is useful for the cgroup-bpf prog that will be 
called upon some sk events (eg ingress/egress).  The cgrp_storage cleanup could 
be done in cgroup_bpf_release_fn() also such that it will wait till all sk is done.

> From a memory perspective, offline memcgs can still undergo memory operations like
> reclaim. If we are using BPF to collect cgroup statistics for memory
> reclaim, we can't do so for offline memcgs, which is not the end of
> the world, but the cgroup storages become slightly less powerful. We
> might also lose some data that we have already stored for such offline
> memcgs. Also BPF programs now need to handle the case where they have
> a valid cgroup pointer but they cannot retrieve a cgroup storage for
> it because it went offline.

iiuc, the use case is to be able to use the cgrp_storage at some earlier stage 
of the cgroup destruction.  A noob question, I wonder if there is a cgroup that 
it will never go away, the root cgrp?  Then the cgrp_storage cleanup could be 
more selective and avoid cleaning up the cgrp storage charged to the root cgroup.

> We ideally want to be able to charge the memory to the cgroup without
> holding a ref to it, which is against the cgroup memory charging
> model.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH bpf-next v4 6/7] selftests/bpf: Add selftests for new cgroup local storage
  2022-10-24 20:30   ` Martin KaFai Lau
@ 2022-10-25  2:26     ` Yonghong Song
  0 siblings, 0 replies; 31+ messages in thread
From: Yonghong Song @ 2022-10-25  2:26 UTC (permalink / raw)
  To: Martin KaFai Lau, Yonghong Song
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo, bpf



On 10/24/22 1:30 PM, Martin KaFai Lau wrote:
> On 10/23/22 11:05 AM, Yonghong Song wrote:
>> Add two tests for new cgroup local storage, one to test bpf program 
>> helpers
>> and user space map APIs, and the other to test recursive fentry
>> triggering won't deadlock.
> 
> Other than tracing, it will be very useful to add a 
> bpf_cgrp_storage_get() usage in a cgroup-bpf prog.  Exercising this 
> helper in the existing SEC(cgroup)/SEC(sockops) tests should be pretty 
> easy.  eg. The SEC("cgroup/connect6") in socket_cookie_prog.c.

Will do.
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH bpf-next v4 2/7] bpf: Refactor inode/task/sk storage map_{alloc,free}() for reuse
  2022-10-24 20:34   ` Martin KaFai Lau
@ 2022-10-25  2:28     ` Yonghong Song
  0 siblings, 0 replies; 31+ messages in thread
From: Yonghong Song @ 2022-10-25  2:28 UTC (permalink / raw)
  To: Martin KaFai Lau, Yonghong Song
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	kernel-team, KP Singh, Martin KaFai Lau, Tejun Heo, David Vernet,
	bpf



On 10/24/22 1:34 PM, Martin KaFai Lau wrote:
> On 10/23/22 11:05 AM, Yonghong Song wrote
>> -void bpf_local_storage_map_free(struct bpf_local_storage_map *smap,
>> -                int __percpu *busy_counter)
>> +static void __bpf_local_storage_map_free(struct bpf_local_storage_map 
>> *smap,
>> +                     int __percpu *busy_counter)
> 
> nit.
> 
> This map_free does not look like it requires a separate "__" version 
> since it is not reused.  probably just put everything into the 
> bpf_local_storage_map_free() instead?

Okay, will inline __bpf_local_storage_map_free() into
bpf_local_storage_map_free().

> 
>>   {
>>       struct bpf_local_storage_elem *selem;
>>       struct bpf_local_storage_map_bucket *b;
>> @@ -613,7 +613,7 @@ int bpf_local_storage_map_alloc_check(union 
>> bpf_attr *attr)
>>       return 0;
>>   }
>> -struct bpf_local_storage_map *bpf_local_storage_map_alloc(union 
>> bpf_attr *attr)
>> +static struct bpf_local_storage_map 
>> *__bpf_local_storage_map_alloc(union bpf_attr *attr)
>>   {
>>       struct bpf_local_storage_map *smap;
>>       unsigned int i;
>> @@ -663,3 +663,28 @@ int bpf_local_storage_map_check_btf(const struct 
>> bpf_map *map,
>>       return 0;
>>   }
> 
> [ ... ]
> 
>> +void bpf_local_storage_map_free(struct bpf_map *map,
>> +                struct bpf_local_storage_cache *cache,
>> +                int __percpu *busy_counter)
>> +{
>> +    struct bpf_local_storage_map *smap;
>> +
>> +    smap = (struct bpf_local_storage_map *)map;
>> +    bpf_local_storage_cache_idx_free(cache, smap->cache_idx);
>> +    __bpf_local_storage_map_free(smap, busy_counter);
>> +}
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH bpf-next v4 3/7] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
  2022-10-25  0:55           ` Yosry Ahmed
@ 2022-10-25  2:38             ` Roman Gushchin
  2022-10-25  5:46               ` Yosry Ahmed
  0 siblings, 1 reply; 31+ messages in thread
From: Roman Gushchin @ 2022-10-25  2:38 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Martin KaFai Lau, Yonghong Song, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, kernel-team, KP Singh,
	Martin KaFai Lau, Tejun Heo, bpf

On Mon, Oct 24, 2022 at 05:55:17PM -0700, Yosry Ahmed wrote:
> On Mon, Oct 24, 2022 at 5:48 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Mon, Oct 24, 2022 at 5:32 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> > >
> > > On 10/24/22 5:21 PM, Yosry Ahmed wrote:
> > > > On Mon, Oct 24, 2022 at 2:15 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> > > >>
> > > >> On 10/23/22 11:05 AM, Yonghong Song wrote:
> > > >>> +void bpf_cgrp_storage_free(struct cgroup *cgroup)
> > > >>> +{
> > > >>> +     struct bpf_local_storage *local_storage;
> > > >>> +     struct bpf_local_storage_elem *selem;
> > > >>> +     bool free_cgroup_storage = false;
> > > >>> +     struct hlist_node *n;
> > > >>> +     unsigned long flags;
> > > >>> +
> > > >>> +     rcu_read_lock();
> > > >>> +     local_storage = rcu_dereference(cgroup->bpf_cgrp_storage);
> > > >>> +     if (!local_storage) {
> > > >>> +             rcu_read_unlock();
> > > >>> +             return;
> > > >>> +     }
> > > >>> +
> > > >>> +     /* Neither the bpf_prog nor the bpf_map's syscall
> > > >>> +      * could be modifying the local_storage->list now.
> > > >>> +      * Thus, no elem can be added to or deleted from the
> > > >>> +      * local_storage->list by the bpf_prog or by the bpf_map's syscall.
> > > >>> +      *
> > > >>> +      * It is racing with __bpf_local_storage_map_free() alone
> > > >>> +      * when unlinking elem from the local_storage->list and
> > > >>> +      * the map's bucket->list.
> > > >>> +      */
> > > >>> +     bpf_cgrp_storage_lock();
> > > >>> +     raw_spin_lock_irqsave(&local_storage->lock, flags);
> > > >>> +     hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) {
> > > >>> +             bpf_selem_unlink_map(selem);
> > > >>> +             /* If local_storage list has only one element, the
> > > >>> +              * bpf_selem_unlink_storage_nolock() will return true.
> > > >>> +              * Otherwise, it will return false. The current loop iteration
> > > >>> +              * intends to remove all local storage. So the last iteration
> > > >>> +              * of the loop will set the free_cgroup_storage to true.
> > > >>> +              */
> > > >>> +             free_cgroup_storage =
> > > >>> +                     bpf_selem_unlink_storage_nolock(local_storage, selem, false, false);
> > > >>> +     }
> > > >>> +     raw_spin_unlock_irqrestore(&local_storage->lock, flags);
> > > >>> +     bpf_cgrp_storage_unlock();
> > > >>> +     rcu_read_unlock();
> > > >>> +
> > > >>> +     if (free_cgroup_storage)
> > > >>> +             kfree_rcu(local_storage, rcu);
> > > >>> +}
> > > >>
> > > >> [ ... ]
> > > >>
> > > >>> +/* *gfp_flags* is a hidden argument provided by the verifier */
> > > >>> +BPF_CALL_5(bpf_cgrp_storage_get, struct bpf_map *, map, struct cgroup *, cgroup,
> > > >>> +        void *, value, u64, flags, gfp_t, gfp_flags)
> > > >>> +{
> > > >>> +     struct bpf_local_storage_data *sdata;
> > > >>> +
> > > >>> +     WARN_ON_ONCE(!bpf_rcu_lock_held());
> > > >>> +     if (flags & ~(BPF_LOCAL_STORAGE_GET_F_CREATE))
> > > >>> +             return (unsigned long)NULL;
> > > >>> +
> > > >>> +     if (!cgroup)
> > > >>> +             return (unsigned long)NULL;
> > > >>> +
> > > >>> +     if (!bpf_cgrp_storage_trylock())
> > > >>> +             return (unsigned long)NULL;
> > > >>> +
> > > >>> +     sdata = cgroup_storage_lookup(cgroup, map, true);
> > > >>> +     if (sdata)
> > > >>> +             goto unlock;
> > > >>> +
> > > >>> +     /* only allocate new storage, when the cgroup is refcounted */
> > > >>> +     if (!percpu_ref_is_dying(&cgroup->self.refcnt) &&
> > > >>> +         (flags & BPF_LOCAL_STORAGE_GET_F_CREATE))
> > > >>> +             sdata = bpf_local_storage_update(cgroup, (struct bpf_local_storage_map *)map,
> > > >>> +                                              value, BPF_NOEXIST, gfp_flags);
> > > >>> +
> > > >>> +unlock:
> > > >>> +     bpf_cgrp_storage_unlock();
> > > >>> +     return IS_ERR_OR_NULL(sdata) ? (unsigned long)NULL : (unsigned long)sdata->data;
> > > >>> +}
> > > >>
> > > >> [ ... ]
> > > >>
> > > >>> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> > > >>> index 764bdd5fd8d1..32145d066a09 100644
> > > >>> --- a/kernel/cgroup/cgroup.c
> > > >>> +++ b/kernel/cgroup/cgroup.c
> > > >>> @@ -5227,6 +5227,10 @@ static void css_free_rwork_fn(struct work_struct *work)
> > > >>>        struct cgroup_subsys *ss = css->ss;
> > > >>>        struct cgroup *cgrp = css->cgroup;
> > > >>>
> > > >>> +#ifdef CONFIG_BPF_SYSCALL
> > > >>> +     bpf_cgrp_storage_free(cgrp);
> > > >>> +#endif
> > > >>
> > > >>
> > > >> After revisiting comment 4bfc0bb2c60e, some of the commit message came to my mind:
> > > >>
> > > >> " ...... it blocks a possibility to implement
> > > >>     the memcg-based memory accounting for bpf objects, because a circular
> > > >>     reference dependency will occur. Charged memory pages are pinning the
> > > >>     corresponding memory cgroup, and if the memory cgroup is pinning
> > > >>     the attached bpf program, nothing will be ever released."
> > > >>
> > > >> Considering the bpf_map_kzalloc() is used in bpf_local_storage_map.c and it can
> > > >> charge the memcg, I wonder if the cgrp_local_storage will have similar refcnt
> > > >> loop issue here.
> > > >>
> > > >> If here is the right place to free the cgrp_local_storage() and enough to break
> > > >> this refcnt loop, it will be useful to add some explanation and its
> > > >> consideration in the commit message.
> > > >>
> > > >
> > > > I think a similar refcount loop issue can happen here as well. IIUC,
> > > > this function will only be run when the css is released after all
> > > > references are dropped. So if memcg charging is enabled the cgroup
> > > > will never be removed. This one might be trickier to handle though..
> > >
> > > How about removing all storage from cgrp->bpf_cgrp_storage in
> > > cgroup_destroy_locked()?
> >
> > The problem here is that you lose information for cgroups that went
> > offline but still exist in the kernel (i.e offline cgroups). The
> > commit log 4bfc0bb2c60e mentions that such cgroups can have live
> > sockets attached, so this might be a problem? From a memory
> > perspective, offline memcgs can still undergo memory operations like
> > reclaim. If we are using BPF to collect cgroup statistics for memory
> > reclaim, we can't do so for offline memcgs, which is not the end of
> > the world, but the cgroup storages become slightly less powerful. We
> > might also lose some data that we have already stored for such offline
> > memcgs. Also BPF programs now need to handle the case where they have
> > a valid cgroup pointer but they cannot retrieve a cgroup storage for
> > it because it went offline.
> >
> > We ideally want to be able to charge the memory to the cgroup without
> > holding a ref to it, which is against the cgroup memory charging
> > model.
> 
> +Roman Gushchin
> 
> Wait a second. I think Roman implemented reparenting of BPF map memcg
> charges when a memcg is offlined. In this case, when a cgroup goes
> offline (aka rmdir()'d), the charges will actually move to the parent,
> and the local storage will no longer be holding any refs to the
> cgroup. In that case, the current freeing location should be fine and
> the cgroup local storage can continue to work with offlined cgroups.
> 
> Roman, could you please confirm or deny my thesis?

Yes, it shouldn't be a problem anymore since we've implemented the recharging
of all main kernel memory types (slabs, large kmallocs, percpu).

Thanks!

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH bpf-next v4 3/7] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
  2022-10-25  1:36           ` Martin KaFai Lau
@ 2022-10-25  5:44             ` Yosry Ahmed
  2022-10-25 19:53               ` Yonghong Song
  0 siblings, 1 reply; 31+ messages in thread
From: Yosry Ahmed @ 2022-10-25  5:44 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Yonghong Song, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, kernel-team, KP Singh, Martin KaFai Lau,
	Tejun Heo, bpf, Roman Gushchin

On Mon, Oct 24, 2022 at 6:36 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 10/24/22 5:48 PM, Yosry Ahmed wrote:
> > On Mon, Oct 24, 2022 at 5:32 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >>
> >> On 10/24/22 5:21 PM, Yosry Ahmed wrote:
> >>> On Mon, Oct 24, 2022 at 2:15 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >>>>
> >>>> On 10/23/22 11:05 AM, Yonghong Song wrote:
> >>>>> +void bpf_cgrp_storage_free(struct cgroup *cgroup)
> >>>>> +{
> >>>>> +     struct bpf_local_storage *local_storage;
> >>>>> +     struct bpf_local_storage_elem *selem;
> >>>>> +     bool free_cgroup_storage = false;
> >>>>> +     struct hlist_node *n;
> >>>>> +     unsigned long flags;
> >>>>> +
> >>>>> +     rcu_read_lock();
> >>>>> +     local_storage = rcu_dereference(cgroup->bpf_cgrp_storage);
> >>>>> +     if (!local_storage) {
> >>>>> +             rcu_read_unlock();
> >>>>> +             return;
> >>>>> +     }
> >>>>> +
> >>>>> +     /* Neither the bpf_prog nor the bpf_map's syscall
> >>>>> +      * could be modifying the local_storage->list now.
> >>>>> +      * Thus, no elem can be added to or deleted from the
> >>>>> +      * local_storage->list by the bpf_prog or by the bpf_map's syscall.
> >>>>> +      *
> >>>>> +      * It is racing with __bpf_local_storage_map_free() alone
> >>>>> +      * when unlinking elem from the local_storage->list and
> >>>>> +      * the map's bucket->list.
> >>>>> +      */
> >>>>> +     bpf_cgrp_storage_lock();
> >>>>> +     raw_spin_lock_irqsave(&local_storage->lock, flags);
> >>>>> +     hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) {
> >>>>> +             bpf_selem_unlink_map(selem);
> >>>>> +             /* If local_storage list has only one element, the
> >>>>> +              * bpf_selem_unlink_storage_nolock() will return true.
> >>>>> +              * Otherwise, it will return false. The current loop iteration
> >>>>> +              * intends to remove all local storage. So the last iteration
> >>>>> +              * of the loop will set the free_cgroup_storage to true.
> >>>>> +              */
> >>>>> +             free_cgroup_storage =
> >>>>> +                     bpf_selem_unlink_storage_nolock(local_storage, selem, false, false);
> >>>>> +     }
> >>>>> +     raw_spin_unlock_irqrestore(&local_storage->lock, flags);
> >>>>> +     bpf_cgrp_storage_unlock();
> >>>>> +     rcu_read_unlock();
> >>>>> +
> >>>>> +     if (free_cgroup_storage)
> >>>>> +             kfree_rcu(local_storage, rcu);
> >>>>> +}
> >>>>
> >>>> [ ... ]
> >>>>
> >>>>> +/* *gfp_flags* is a hidden argument provided by the verifier */
> >>>>> +BPF_CALL_5(bpf_cgrp_storage_get, struct bpf_map *, map, struct cgroup *, cgroup,
> >>>>> +        void *, value, u64, flags, gfp_t, gfp_flags)
> >>>>> +{
> >>>>> +     struct bpf_local_storage_data *sdata;
> >>>>> +
> >>>>> +     WARN_ON_ONCE(!bpf_rcu_lock_held());
> >>>>> +     if (flags & ~(BPF_LOCAL_STORAGE_GET_F_CREATE))
> >>>>> +             return (unsigned long)NULL;
> >>>>> +
> >>>>> +     if (!cgroup)
> >>>>> +             return (unsigned long)NULL;
> >>>>> +
> >>>>> +     if (!bpf_cgrp_storage_trylock())
> >>>>> +             return (unsigned long)NULL;
> >>>>> +
> >>>>> +     sdata = cgroup_storage_lookup(cgroup, map, true);
> >>>>> +     if (sdata)
> >>>>> +             goto unlock;
> >>>>> +
> >>>>> +     /* only allocate new storage, when the cgroup is refcounted */
> >>>>> +     if (!percpu_ref_is_dying(&cgroup->self.refcnt) &&
> >>>>> +         (flags & BPF_LOCAL_STORAGE_GET_F_CREATE))
> >>>>> +             sdata = bpf_local_storage_update(cgroup, (struct bpf_local_storage_map *)map,
> >>>>> +                                              value, BPF_NOEXIST, gfp_flags);
> >>>>> +
> >>>>> +unlock:
> >>>>> +     bpf_cgrp_storage_unlock();
> >>>>> +     return IS_ERR_OR_NULL(sdata) ? (unsigned long)NULL : (unsigned long)sdata->data;
> >>>>> +}
> >>>>
> >>>> [ ... ]
> >>>>
> >>>>> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> >>>>> index 764bdd5fd8d1..32145d066a09 100644
> >>>>> --- a/kernel/cgroup/cgroup.c
> >>>>> +++ b/kernel/cgroup/cgroup.c
> >>>>> @@ -5227,6 +5227,10 @@ static void css_free_rwork_fn(struct work_struct *work)
> >>>>>         struct cgroup_subsys *ss = css->ss;
> >>>>>         struct cgroup *cgrp = css->cgroup;
> >>>>>
> >>>>> +#ifdef CONFIG_BPF_SYSCALL
> >>>>> +     bpf_cgrp_storage_free(cgrp);
> >>>>> +#endif
> >>>>
> >>>>
> >>>> After revisiting comment 4bfc0bb2c60e, some of the commit message came to my mind:
> >>>>
> >>>> " ...... it blocks a possibility to implement
> >>>>      the memcg-based memory accounting for bpf objects, because a circular
> >>>>      reference dependency will occur. Charged memory pages are pinning the
> >>>>      corresponding memory cgroup, and if the memory cgroup is pinning
> >>>>      the attached bpf program, nothing will be ever released."
> >>>>
> >>>> Considering the bpf_map_kzalloc() is used in bpf_local_storage_map.c and it can
> >>>> charge the memcg, I wonder if the cgrp_local_storage will have similar refcnt
> >>>> loop issue here.
> >>>>
> >>>> If here is the right place to free the cgrp_local_storage() and enough to break
> >>>> this refcnt loop, it will be useful to add some explanation and its
> >>>> consideration in the commit message.
> >>>>
> >>>
> >>> I think a similar refcount loop issue can happen here as well. IIUC,
> >>> this function will only be run when the css is released after all
> >>> references are dropped. So if memcg charging is enabled the cgroup
> >>> will never be removed. This one might be trickier to handle though..
> >>
> >> How about removing all storage from cgrp->bpf_cgrp_storage in
> >> cgroup_destroy_locked()?
> >
> > The problem here is that you lose information for cgroups that went
> > offline but still exist in the kernel (i.e offline cgroups). The
> > commit log 4bfc0bb2c60e mentions that such cgroups can have live
> > sockets attached, so this might be a problem?
>
> Keeping the cgrp_storage around is useful for the cgroup-bpf prog that will be
> called upon some sk events (eg ingress/egress).  The cgrp_storage cleanup could
> be done in cgroup_bpf_release_fn() also such that it will wait till all sk is done.
>
> > From a memory perspective, offline memcgs can still undergo memory operations like
> > reclaim. If we are using BPF to collect cgroup statistics for memory
> > reclaim, we can't do so for offline memcgs, which is not the end of
> > the world, but the cgroup storages become slightly less powerful. We
> > might also lose some data that we have already stored for such offline
> > memcgs. Also BPF programs now need to handle the case where they have
> > a valid cgroup pointer but they cannot retrieve a cgroup storage for
> > it because it went offline.
>
> iiuc, the use case is to be able to use the cgrp_storage at some earlier stage
> of the cgroup destruction.

Yes, exactly. The cgroup gets "offlined" when the user removes the
directory, but is not actually freed until all references are dropped.
An offline memcg can still undergo some operations.

> A noob question, I wonder if there is a cgroup that
> it will never go away, the root cgrp?  Then the cgrp_storage cleanup could be
> more selective and avoid cleaning up the cgrp storage charged to the root cgroup.

Yes, root cgrp doesn't go away, but I am not sure I understand how
this fixes the problem.

In all cases, Roman confirmed my theory. BPF maps charges are now
charged through objcg, not directly to memcgs, which means that when
the cgroup is offline, those charges are moved to the parent. IIUC,
this means there should not be a refcount loop. When the cgroup
directory is removed and the cgroup is offlined, the memory charges of
the cgrp_storage will move to the parent. The cgrp_storage will remain
accessible until all refs to it are dropped and it is actually freed
by css_free_rwork_fn(), at that point the cgrp_storage will be freed.

I just realized, I *think* the call to bpf_cgrp_storage_free(cgrp) is
actually misplaced within css_free_rwork_fn(). Reading the code, it
seems like css_free_rwork_fn() can be called in two cases:
(a) We are freeing a css (cgroup_subsys_state), but not the cgroup
itself (e.g. we are disabling a subsystem controller on a cgroup).
(b) We are freeing the cgroup itself.

I think we want to free the cgrp_storage only in (b), which would
correspond to the else statement (but before the nested if).
Basically, something like this *I think*:

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 2319946715e0..f1e6058089f5 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5349,6 +5349,7 @@ static void css_free_rwork_fn(struct work_struct *work)
                atomic_dec(&cgrp->root->nr_cgrps);
                cgroup1_pidlist_destroy_all(cgrp);
                cancel_work_sync(&cgrp->release_agent_work);
+               bpf_cgrp_storage_free(cgrp);

                if (cgroup_parent(cgrp)) {
                        /*

Tejun would know better, so please correct me if I am wrong.

(FWIW I think it would be nicer if we have an empty stub for
bpf_cgrp_storage_free() when !CONFIG_BPF_SYSCALL instead of the
#ifdef, similar to cgroup_bpf_offline()).

>
> > We ideally want to be able to charge the memory to the cgroup without
> > holding a ref to it, which is against the cgroup memory charging
> > model.
>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH bpf-next v4 3/7] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
  2022-10-25  2:38             ` Roman Gushchin
@ 2022-10-25  5:46               ` Yosry Ahmed
  0 siblings, 0 replies; 31+ messages in thread
From: Yosry Ahmed @ 2022-10-25  5:46 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Martin KaFai Lau, Yonghong Song, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, kernel-team, KP Singh,
	Martin KaFai Lau, Tejun Heo, bpf

On Mon, Oct 24, 2022 at 7:38 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> On Mon, Oct 24, 2022 at 05:55:17PM -0700, Yosry Ahmed wrote:
> > On Mon, Oct 24, 2022 at 5:48 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > > On Mon, Oct 24, 2022 at 5:32 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> > > >
> > > > On 10/24/22 5:21 PM, Yosry Ahmed wrote:
> > > > > On Mon, Oct 24, 2022 at 2:15 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> > > > >>
> > > > >> On 10/23/22 11:05 AM, Yonghong Song wrote:
> > > > >>> +void bpf_cgrp_storage_free(struct cgroup *cgroup)
> > > > >>> +{
> > > > >>> +     struct bpf_local_storage *local_storage;
> > > > >>> +     struct bpf_local_storage_elem *selem;
> > > > >>> +     bool free_cgroup_storage = false;
> > > > >>> +     struct hlist_node *n;
> > > > >>> +     unsigned long flags;
> > > > >>> +
> > > > >>> +     rcu_read_lock();
> > > > >>> +     local_storage = rcu_dereference(cgroup->bpf_cgrp_storage);
> > > > >>> +     if (!local_storage) {
> > > > >>> +             rcu_read_unlock();
> > > > >>> +             return;
> > > > >>> +     }
> > > > >>> +
> > > > >>> +     /* Neither the bpf_prog nor the bpf_map's syscall
> > > > >>> +      * could be modifying the local_storage->list now.
> > > > >>> +      * Thus, no elem can be added to or deleted from the
> > > > >>> +      * local_storage->list by the bpf_prog or by the bpf_map's syscall.
> > > > >>> +      *
> > > > >>> +      * It is racing with __bpf_local_storage_map_free() alone
> > > > >>> +      * when unlinking elem from the local_storage->list and
> > > > >>> +      * the map's bucket->list.
> > > > >>> +      */
> > > > >>> +     bpf_cgrp_storage_lock();
> > > > >>> +     raw_spin_lock_irqsave(&local_storage->lock, flags);
> > > > >>> +     hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) {
> > > > >>> +             bpf_selem_unlink_map(selem);
> > > > >>> +             /* If local_storage list has only one element, the
> > > > >>> +              * bpf_selem_unlink_storage_nolock() will return true.
> > > > >>> +              * Otherwise, it will return false. The current loop iteration
> > > > >>> +              * intends to remove all local storage. So the last iteration
> > > > >>> +              * of the loop will set the free_cgroup_storage to true.
> > > > >>> +              */
> > > > >>> +             free_cgroup_storage =
> > > > >>> +                     bpf_selem_unlink_storage_nolock(local_storage, selem, false, false);
> > > > >>> +     }
> > > > >>> +     raw_spin_unlock_irqrestore(&local_storage->lock, flags);
> > > > >>> +     bpf_cgrp_storage_unlock();
> > > > >>> +     rcu_read_unlock();
> > > > >>> +
> > > > >>> +     if (free_cgroup_storage)
> > > > >>> +             kfree_rcu(local_storage, rcu);
> > > > >>> +}
> > > > >>
> > > > >> [ ... ]
> > > > >>
> > > > >>> +/* *gfp_flags* is a hidden argument provided by the verifier */
> > > > >>> +BPF_CALL_5(bpf_cgrp_storage_get, struct bpf_map *, map, struct cgroup *, cgroup,
> > > > >>> +        void *, value, u64, flags, gfp_t, gfp_flags)
> > > > >>> +{
> > > > >>> +     struct bpf_local_storage_data *sdata;
> > > > >>> +
> > > > >>> +     WARN_ON_ONCE(!bpf_rcu_lock_held());
> > > > >>> +     if (flags & ~(BPF_LOCAL_STORAGE_GET_F_CREATE))
> > > > >>> +             return (unsigned long)NULL;
> > > > >>> +
> > > > >>> +     if (!cgroup)
> > > > >>> +             return (unsigned long)NULL;
> > > > >>> +
> > > > >>> +     if (!bpf_cgrp_storage_trylock())
> > > > >>> +             return (unsigned long)NULL;
> > > > >>> +
> > > > >>> +     sdata = cgroup_storage_lookup(cgroup, map, true);
> > > > >>> +     if (sdata)
> > > > >>> +             goto unlock;
> > > > >>> +
> > > > >>> +     /* only allocate new storage, when the cgroup is refcounted */
> > > > >>> +     if (!percpu_ref_is_dying(&cgroup->self.refcnt) &&
> > > > >>> +         (flags & BPF_LOCAL_STORAGE_GET_F_CREATE))
> > > > >>> +             sdata = bpf_local_storage_update(cgroup, (struct bpf_local_storage_map *)map,
> > > > >>> +                                              value, BPF_NOEXIST, gfp_flags);
> > > > >>> +
> > > > >>> +unlock:
> > > > >>> +     bpf_cgrp_storage_unlock();
> > > > >>> +     return IS_ERR_OR_NULL(sdata) ? (unsigned long)NULL : (unsigned long)sdata->data;
> > > > >>> +}
> > > > >>
> > > > >> [ ... ]
> > > > >>
> > > > >>> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> > > > >>> index 764bdd5fd8d1..32145d066a09 100644
> > > > >>> --- a/kernel/cgroup/cgroup.c
> > > > >>> +++ b/kernel/cgroup/cgroup.c
> > > > >>> @@ -5227,6 +5227,10 @@ static void css_free_rwork_fn(struct work_struct *work)
> > > > >>>        struct cgroup_subsys *ss = css->ss;
> > > > >>>        struct cgroup *cgrp = css->cgroup;
> > > > >>>
> > > > >>> +#ifdef CONFIG_BPF_SYSCALL
> > > > >>> +     bpf_cgrp_storage_free(cgrp);
> > > > >>> +#endif
> > > > >>
> > > > >>
> > > > >> After revisiting comment 4bfc0bb2c60e, some of the commit message came to my mind:
> > > > >>
> > > > >> " ...... it blocks a possibility to implement
> > > > >>     the memcg-based memory accounting for bpf objects, because a circular
> > > > >>     reference dependency will occur. Charged memory pages are pinning the
> > > > >>     corresponding memory cgroup, and if the memory cgroup is pinning
> > > > >>     the attached bpf program, nothing will be ever released."
> > > > >>
> > > > >> Considering the bpf_map_kzalloc() is used in bpf_local_storage_map.c and it can
> > > > >> charge the memcg, I wonder if the cgrp_local_storage will have similar refcnt
> > > > >> loop issue here.
> > > > >>
> > > > >> If here is the right place to free the cgrp_local_storage() and enough to break
> > > > >> this refcnt loop, it will be useful to add some explanation and its
> > > > >> consideration in the commit message.
> > > > >>
> > > > >
> > > > > I think a similar refcount loop issue can happen here as well. IIUC,
> > > > > this function will only be run when the css is released after all
> > > > > references are dropped. So if memcg charging is enabled the cgroup
> > > > > will never be removed. This one might be trickier to handle though..
> > > >
> > > > How about removing all storage from cgrp->bpf_cgrp_storage in
> > > > cgroup_destroy_locked()?
> > >
> > > The problem here is that you lose information for cgroups that went
> > > offline but still exist in the kernel (i.e offline cgroups). The
> > > commit log 4bfc0bb2c60e mentions that such cgroups can have live
> > > sockets attached, so this might be a problem? From a memory
> > > perspective, offline memcgs can still undergo memory operations like
> > > reclaim. If we are using BPF to collect cgroup statistics for memory
> > > reclaim, we can't do so for offline memcgs, which is not the end of
> > > the world, but the cgroup storages become slightly less powerful. We
> > > might also lose some data that we have already stored for such offline
> > > memcgs. Also BPF programs now need to handle the case where they have
> > > a valid cgroup pointer but they cannot retrieve a cgroup storage for
> > > it because it went offline.
> > >
> > > We ideally want to be able to charge the memory to the cgroup without
> > > holding a ref to it, which is against the cgroup memory charging
> > > model.
> >
> > +Roman Gushchin
> >
> > Wait a second. I think Roman implemented reparenting of BPF map memcg
> > charges when a memcg is offlined. In this case, when a cgroup goes
> > offline (aka rmdir()'d), the charges will actually move to the parent,
> > and the local storage will no longer be holding any refs to the
> > cgroup. In that case, the current freeing location should be fine and
> > the cgroup local storage can continue to work with offlined cgroups.
> >
> > Roman, could you please confirm or deny my thesis?
>
> Yes, it shouldn't be a problem anymore since we've implemented the recharging
> of all main kernel memory types (slabs, large kmallocs, percpu).

Thanks for the confirmation and the fast response :)

>
> Thanks!

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH bpf-next v4 3/7] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
  2022-10-25  5:44             ` Yosry Ahmed
@ 2022-10-25 19:53               ` Yonghong Song
  0 siblings, 0 replies; 31+ messages in thread
From: Yonghong Song @ 2022-10-25 19:53 UTC (permalink / raw)
  To: Yosry Ahmed, Martin KaFai Lau
  Cc: Yonghong Song, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, kernel-team, KP Singh, Martin KaFai Lau,
	Tejun Heo, bpf, Roman Gushchin



On 10/24/22 10:44 PM, Yosry Ahmed wrote:
> On Mon, Oct 24, 2022 at 6:36 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> On 10/24/22 5:48 PM, Yosry Ahmed wrote:
>>> On Mon, Oct 24, 2022 at 5:32 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>>>
>>>> On 10/24/22 5:21 PM, Yosry Ahmed wrote:
>>>>> On Mon, Oct 24, 2022 at 2:15 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>>>>>
>>>>>> On 10/23/22 11:05 AM, Yonghong Song wrote:
>>>>>>> +void bpf_cgrp_storage_free(struct cgroup *cgroup)
>>>>>>> +{
>>>>>>> +     struct bpf_local_storage *local_storage;
>>>>>>> +     struct bpf_local_storage_elem *selem;
>>>>>>> +     bool free_cgroup_storage = false;
>>>>>>> +     struct hlist_node *n;
>>>>>>> +     unsigned long flags;
>>>>>>> +
>>>>>>> +     rcu_read_lock();
>>>>>>> +     local_storage = rcu_dereference(cgroup->bpf_cgrp_storage);
>>>>>>> +     if (!local_storage) {
>>>>>>> +             rcu_read_unlock();
>>>>>>> +             return;
>>>>>>> +     }
>>>>>>> +
>>>>>>> +     /* Neither the bpf_prog nor the bpf_map's syscall
>>>>>>> +      * could be modifying the local_storage->list now.
>>>>>>> +      * Thus, no elem can be added to or deleted from the
>>>>>>> +      * local_storage->list by the bpf_prog or by the bpf_map's syscall.
>>>>>>> +      *
>>>>>>> +      * It is racing with __bpf_local_storage_map_free() alone
>>>>>>> +      * when unlinking elem from the local_storage->list and
>>>>>>> +      * the map's bucket->list.
>>>>>>> +      */
>>>>>>> +     bpf_cgrp_storage_lock();
>>>>>>> +     raw_spin_lock_irqsave(&local_storage->lock, flags);
>>>>>>> +     hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) {
>>>>>>> +             bpf_selem_unlink_map(selem);
>>>>>>> +             /* If local_storage list has only one element, the
>>>>>>> +              * bpf_selem_unlink_storage_nolock() will return true.
>>>>>>> +              * Otherwise, it will return false. The current loop iteration
>>>>>>> +              * intends to remove all local storage. So the last iteration
>>>>>>> +              * of the loop will set the free_cgroup_storage to true.
>>>>>>> +              */
>>>>>>> +             free_cgroup_storage =
>>>>>>> +                     bpf_selem_unlink_storage_nolock(local_storage, selem, false, false);
>>>>>>> +     }
>>>>>>> +     raw_spin_unlock_irqrestore(&local_storage->lock, flags);
>>>>>>> +     bpf_cgrp_storage_unlock();
>>>>>>> +     rcu_read_unlock();
>>>>>>> +
>>>>>>> +     if (free_cgroup_storage)
>>>>>>> +             kfree_rcu(local_storage, rcu);
>>>>>>> +}
>>>>>>
>>>>>> [ ... ]
>>>>>>
>>>>>>> +/* *gfp_flags* is a hidden argument provided by the verifier */
>>>>>>> +BPF_CALL_5(bpf_cgrp_storage_get, struct bpf_map *, map, struct cgroup *, cgroup,
>>>>>>> +        void *, value, u64, flags, gfp_t, gfp_flags)
>>>>>>> +{
>>>>>>> +     struct bpf_local_storage_data *sdata;
>>>>>>> +
>>>>>>> +     WARN_ON_ONCE(!bpf_rcu_lock_held());
>>>>>>> +     if (flags & ~(BPF_LOCAL_STORAGE_GET_F_CREATE))
>>>>>>> +             return (unsigned long)NULL;
>>>>>>> +
>>>>>>> +     if (!cgroup)
>>>>>>> +             return (unsigned long)NULL;
>>>>>>> +
>>>>>>> +     if (!bpf_cgrp_storage_trylock())
>>>>>>> +             return (unsigned long)NULL;
>>>>>>> +
>>>>>>> +     sdata = cgroup_storage_lookup(cgroup, map, true);
>>>>>>> +     if (sdata)
>>>>>>> +             goto unlock;
>>>>>>> +
>>>>>>> +     /* only allocate new storage, when the cgroup is refcounted */
>>>>>>> +     if (!percpu_ref_is_dying(&cgroup->self.refcnt) &&
>>>>>>> +         (flags & BPF_LOCAL_STORAGE_GET_F_CREATE))
>>>>>>> +             sdata = bpf_local_storage_update(cgroup, (struct bpf_local_storage_map *)map,
>>>>>>> +                                              value, BPF_NOEXIST, gfp_flags);
>>>>>>> +
>>>>>>> +unlock:
>>>>>>> +     bpf_cgrp_storage_unlock();
>>>>>>> +     return IS_ERR_OR_NULL(sdata) ? (unsigned long)NULL : (unsigned long)sdata->data;
>>>>>>> +}
>>>>>>
>>>>>> [ ... ]
>>>>>>
>>>>>>> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
>>>>>>> index 764bdd5fd8d1..32145d066a09 100644
>>>>>>> --- a/kernel/cgroup/cgroup.c
>>>>>>> +++ b/kernel/cgroup/cgroup.c
>>>>>>> @@ -5227,6 +5227,10 @@ static void css_free_rwork_fn(struct work_struct *work)
>>>>>>>          struct cgroup_subsys *ss = css->ss;
>>>>>>>          struct cgroup *cgrp = css->cgroup;
>>>>>>>
>>>>>>> +#ifdef CONFIG_BPF_SYSCALL
>>>>>>> +     bpf_cgrp_storage_free(cgrp);
>>>>>>> +#endif
>>>>>>
>>>>>>
>>>>>> After revisiting comment 4bfc0bb2c60e, some of the commit message came to my mind:
>>>>>>
>>>>>> " ...... it blocks a possibility to implement
>>>>>>       the memcg-based memory accounting for bpf objects, because a circular
>>>>>>       reference dependency will occur. Charged memory pages are pinning the
>>>>>>       corresponding memory cgroup, and if the memory cgroup is pinning
>>>>>>       the attached bpf program, nothing will be ever released."
>>>>>>
>>>>>> Considering the bpf_map_kzalloc() is used in bpf_local_storage_map.c and it can
>>>>>> charge the memcg, I wonder if the cgrp_local_storage will have similar refcnt
>>>>>> loop issue here.
>>>>>>
>>>>>> If here is the right place to free the cgrp_local_storage() and enough to break
>>>>>> this refcnt loop, it will be useful to add some explanation and its
>>>>>> consideration in the commit message.
>>>>>>
>>>>>
>>>>> I think a similar refcount loop issue can happen here as well. IIUC,
>>>>> this function will only be run when the css is released after all
>>>>> references are dropped. So if memcg charging is enabled the cgroup
>>>>> will never be removed. This one might be trickier to handle though..
>>>>
>>>> How about removing all storage from cgrp->bpf_cgrp_storage in
>>>> cgroup_destroy_locked()?
>>>
>>> The problem here is that you lose information for cgroups that went
>>> offline but still exist in the kernel (i.e offline cgroups). The
>>> commit log 4bfc0bb2c60e mentions that such cgroups can have live
>>> sockets attached, so this might be a problem?
>>
>> Keeping the cgrp_storage around is useful for the cgroup-bpf prog that will be
>> called upon some sk events (eg ingress/egress).  The cgrp_storage cleanup could
>> be done in cgroup_bpf_release_fn() also such that it will wait till all sk is done.
>>
>>>  From a memory perspective, offline memcgs can still undergo memory operations like
>>> reclaim. If we are using BPF to collect cgroup statistics for memory
>>> reclaim, we can't do so for offline memcgs, which is not the end of
>>> the world, but the cgroup storages become slightly less powerful. We
>>> might also lose some data that we have already stored for such offline
>>> memcgs. Also BPF programs now need to handle the case where they have
>>> a valid cgroup pointer but they cannot retrieve a cgroup storage for
>>> it because it went offline.
>>
>> iiuc, the use case is to be able to use the cgrp_storage at some earlier stage
>> of the cgroup destruction.
> 
> Yes, exactly. The cgroup gets "offlined" when the user removes the
> directory, but is not actually freed until all references are dropped.
> An offline memcg can still undergo some operations.
> 
>> A noob question, I wonder if there is a cgroup that
>> it will never go away, the root cgrp?  Then the cgrp_storage cleanup could be
>> more selective and avoid cleaning up the cgrp storage charged to the root cgroup.
> 
> Yes, root cgrp doesn't go away, but I am not sure I understand how
> this fixes the problem.
> 
> In all cases, Roman confirmed my theory. BPF maps charges are now
> charged through objcg, not directly to memcgs, which means that when
> the cgroup is offline, those charges are moved to the parent. IIUC,
> this means there should not be a refcount loop. When the cgroup
> directory is removed and the cgroup is offlined, the memory charges of
> the cgrp_storage will move to the parent. The cgrp_storage will remain
> accessible until all refs to it are dropped and it is actually freed
> by css_free_rwork_fn(), at that point the cgrp_storage will be freed.
> 
> I just realized, I *think* the call to bpf_cgrp_storage_free(cgrp) is
> actually misplaced within css_free_rwork_fn(). Reading the code, it
> seems like css_free_rwork_fn() can be called in two cases:
> (a) We are freeing a css (cgroup_subsys_state), but not the cgroup
> itself (e.g. we are disabling a subsystem controller on a cgroup).
> (b) We are freeing the cgroup itself.
> 
> I think we want to free the cgrp_storage only in (b), which would
> correspond to the else statement (but before the nested if).
> Basically, something like this *I think*:
> 
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 2319946715e0..f1e6058089f5 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -5349,6 +5349,7 @@ static void css_free_rwork_fn(struct work_struct *work)
>                  atomic_dec(&cgrp->root->nr_cgrps);
>                  cgroup1_pidlist_destroy_all(cgrp);
>                  cancel_work_sync(&cgrp->release_agent_work);
> +               bpf_cgrp_storage_free(cgrp);
> 
>                  if (cgroup_parent(cgrp)) {
>                          /*
> 
> Tejun would know better, so please correct me if I am wrong.

Good suggestion. calling bpf_cgrp_storage_free() after
cancel_work_sync() seems a good idea to ensure all pending
works are done.

> 
> (FWIW I think it would be nicer if we have an empty stub for
> bpf_cgrp_storage_free() when !CONFIG_BPF_SYSCALL instead of the
> #ifdef, similar to cgroup_bpf_offline()).

Okay, let me give a try.


> 
>>
>>> We ideally want to be able to charge the memory to the cgroup without
>>> holding a ref to it, which is against the cgroup memory charging
>>> model.
>>

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2022-10-25 19:54 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-23 18:05 [PATCH bpf-next v4 0/7] bpf: Implement cgroup local storage available to non-cgroup-attached bpf progs Yonghong Song
2022-10-23 18:05 ` [PATCH bpf-next v4 1/7] bpf: Make struct cgroup btf id global Yonghong Song
2022-10-23 19:59   ` David Vernet
2022-10-23 18:05 ` [PATCH bpf-next v4 2/7] bpf: Refactor inode/task/sk storage map_{alloc,free}() for reuse Yonghong Song
2022-10-24 18:02   ` sdf
2022-10-24 19:08     ` Yonghong Song
2022-10-24 20:34   ` Martin KaFai Lau
2022-10-25  2:28     ` Yonghong Song
2022-10-23 18:05 ` [PATCH bpf-next v4 3/7] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs Yonghong Song
2022-10-23 20:02   ` David Vernet
2022-10-24 19:19   ` Martin KaFai Lau
2022-10-25  0:21     ` Yosry Ahmed
2022-10-25  0:32       ` Martin KaFai Lau
2022-10-25  0:48         ` Yosry Ahmed
2022-10-25  0:55           ` Yosry Ahmed
2022-10-25  2:38             ` Roman Gushchin
2022-10-25  5:46               ` Yosry Ahmed
2022-10-25  1:36           ` Martin KaFai Lau
2022-10-25  5:44             ` Yosry Ahmed
2022-10-25 19:53               ` Yonghong Song
2022-10-23 18:05 ` [PATCH bpf-next v4 4/7] libbpf: Support new cgroup local storage Yonghong Song
2022-10-23 20:03   ` David Vernet
2022-10-23 18:05 ` [PATCH bpf-next v4 5/7] bpftool: " Yonghong Song
2022-10-23 18:05 ` [PATCH bpf-next v4 6/7] selftests/bpf: Add selftests for " Yonghong Song
2022-10-23 20:14   ` David Vernet
2022-10-24 19:03     ` Yonghong Song
2022-10-24 20:30   ` Martin KaFai Lau
2022-10-25  2:26     ` Yonghong Song
2022-10-23 18:05 ` [PATCH bpf-next v4 7/7] docs/bpf: Add documentation " Yonghong Song
2022-10-23 20:26   ` David Vernet
2022-10-24 19:05     ` Yonghong Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).