netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next v1 0/5] bpf: rstat: cgroup hierarchical stats
@ 2022-05-20  1:21 Yosry Ahmed
  2022-05-20  1:21 ` [PATCH bpf-next v1 1/5] cgroup: bpf: add a hook for bpf progs to attach to rstat flushing Yosry Ahmed
                   ` (5 more replies)
  0 siblings, 6 replies; 58+ messages in thread
From: Yosry Ahmed @ 2022-05-20  1:21 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Roman Gushchin, Michal Hocko
  Cc: Stanislav Fomichev, David Rientjes, Greg Thelen, Shakeel Butt,
	linux-kernel, netdev, bpf, cgroups, Yosry Ahmed

This patch series allows for using bpf to collect hierarchical cgroup
stats efficiently by integrating with the rstat framework. The rstat
framework provides an efficient way to collect cgroup stats and
propagate them through the cgroup hierarchy.

* Background on rstat (I am using a subscriber analogy that is not
commonly used):

The rstat framework maintains a tree of cgroups that have updates and
which cpus have updates. A subscriber to the rstat framework maintains
their own stats. The framework is used to tell the subscriber when
and what to flush, for the most efficient stats propagation. The
workflow is as follows:

- When a subscriber updates a cgroup on a cpu, it informs the rstat
  framework by calling cgroup_rstat_updated(cgrp, cpu).

- When a subscriber wants to read some stats for a cgroup, it asks
  the rstat framework to initiate a stats flush (propagation) by calling
  cgroup_rstat_flush(cgrp).

- When the rstat framework initiates a flush, it makes callbacks to
  subscribers to aggregate stats on cpus that have updates, and
  propagate updates to their parent.

Currently, the main subscribers to the rstat framework are cgroup
subsystems (e.g. memory, block). This patch series allow bpf programs to
become subscribers as well.

Patches in this series are based off two patches in the mailing list:
- bpf/btf: also allow kfunc in tracing and syscall programs
- btf: Add a new kfunc set which allows to mark a function to be
  sleepable

Both by Benjamin Tissoires, from different versions of his HID patch
series (the second patch seems to have been dropped in the last
version).

Patches in this series are organized as follows:
* The first patch adds a hook point, bpf_rstat_flush(), that is called
during rstat flushing. This allows bpf fentry programs to attach to it
to be called during rstat flushing (effectively registering themselves
as rstat flush callbacks).

* The second patch adds cgroup_rstat_updated() and cgorup_rstat_flush()
kfuncs, to allow bpf stat collectors and readers to communicate with rstat.

* The third patch is actually v2 of a previously submitted patch [1]
by Hao Luo. We agreed that it fits better as a part of this series. It
introduces cgroup_iter programs that can dump stats for cgroups to
userspace.
v1 - > v2:
- Getting the cgroup's reference at the time at attaching, instead of
  at the time when iterating. (Yonghong) (context [1])
- Remove .init_seq_private and .fini_seq_private callbacks for
  cgroup_iter. They are not needed now. (Yonghong)

* The fourth patch extends bpf selftests cgroup helpers, as necessary
for the following patch.

* The fifth  patch is a selftest that demonstrates the entire workflow.
It includes programs that collect, aggregate, and dump per-cgroup stats
by fully integrating with the rstat framework.

[1]https://lore.kernel.org/lkml/20220225234339.2386398-9-haoluo@google.com/

RFC v2 -> v1:
- Instead of introducing a new program type for rstat flushing, add an
  empty hook point, bpf_rstat_flush(), and use fentry bpf programs to
  attach to it and flush bpf stats.
- Instead of using helpers, use kfuncs for rstat functions.
- These changes simplify the patchset greatly, with minimal changes to
  uapi.

RFC v1 -> RFC v2:
- Instead of rstat flush programs attach to subsystems, they now attach
  to rstat (global flushers, not per-subsystem), based on discussions
  with Tejun. The first patch is entirely rewritten.
- Pass cgroup pointers to rstat flushers instead of cgroup ids. This is
  much more flexibility and less likely to need a uapi update later.
- rstat helpers are now only defined if CGROUP_CONFIG.
- Most of the code is now only defined if CGROUP_CONFIG and
  CONFIG_BPF_SYSCALL.
- Move rstat helper protos from bpf_base_func_proto() to
  tracing_prog_func_proto().
- rstat helpers argument (cgroup pointer) is now ARG_PTR_TO_BTF_ID, not
  ARG_ANYTHING.
- Rewrote the selftest to use the cgroup helpers.
- Dropped bpf_map_lookup_percpu_elem (already added by Feng).
- Dropped patch to support cgroup v1 for cgroup_iter.
- Dropped patch to define some cgroup_put() when !CONFIG_CGROUP. The
  code that calls it is no longer compiled when !CONFIG_CGROUP.


Hao Luo (1):
  bpf: Introduce cgroup iter

Yosry Ahmed (4):
  cgroup: bpf: add a hook for bpf progs to attach to rstat flushing
  cgroup: bpf: add cgroup_rstat_updated() and cgroup_rstat_flush()
    kfuncs
  selftests/bpf: extend cgroup helpers
  bpf: add a selftest for cgroup hierarchical stats collection

 include/linux/bpf.h                           |   2 +
 include/uapi/linux/bpf.h                      |   6 +
 kernel/bpf/Makefile                           |   3 +
 kernel/bpf/cgroup_iter.c                      | 148 ++++++++
 kernel/cgroup/rstat.c                         |  40 +++
 tools/include/uapi/linux/bpf.h                |   6 +
 tools/testing/selftests/bpf/cgroup_helpers.c  | 159 +++++---
 tools/testing/selftests/bpf/cgroup_helpers.h  |  14 +-
 .../test_cgroup_hierarchical_stats.c          | 339 ++++++++++++++++++
 tools/testing/selftests/bpf/progs/bpf_iter.h  |   7 +
 .../selftests/bpf/progs/cgroup_vmscan.c       | 221 ++++++++++++
 11 files changed, 899 insertions(+), 46 deletions(-)
 create mode 100644 kernel/bpf/cgroup_iter.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_cgroup_hierarchical_stats.c
 create mode 100644 tools/testing/selftests/bpf/progs/cgroup_vmscan.c

-- 
2.36.1.124.g0e6072fb45-goog


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH bpf-next v1 1/5] cgroup: bpf: add a hook for bpf progs to attach to rstat flushing
  2022-05-20  1:21 [PATCH bpf-next v1 0/5] bpf: rstat: cgroup hierarchical stats Yosry Ahmed
@ 2022-05-20  1:21 ` Yosry Ahmed
  2022-05-21 11:16   ` kernel test robot
                     ` (2 more replies)
  2022-05-20  1:21 ` [PATCH bpf-next v1 2/5] cgroup: bpf: add cgroup_rstat_updated() and cgroup_rstat_flush() kfuncs Yosry Ahmed
                   ` (4 subsequent siblings)
  5 siblings, 3 replies; 58+ messages in thread
From: Yosry Ahmed @ 2022-05-20  1:21 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Roman Gushchin, Michal Hocko
  Cc: Stanislav Fomichev, David Rientjes, Greg Thelen, Shakeel Butt,
	linux-kernel, netdev, bpf, cgroups, Yosry Ahmed

Add an empty bpf_rstat_flush() hook that is called during rstat
flushing. bpf programs that make use of rstat and want to flush their
stats can attach to bpf_rstat_flush().

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 kernel/cgroup/rstat.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 24b5c2ab5598..e7a88d2600bd 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -141,6 +141,12 @@ static struct cgroup *cgroup_rstat_cpu_pop_updated(struct cgroup *pos,
 	return pos;
 }
 
+/* A hook for bpf stat collectors to attach to and flush their stats */
+__weak noinline void bpf_rstat_flush(struct cgroup *cgrp,
+				     struct cgroup *parent, int cpu)
+{
+}
+
 /* see cgroup_rstat_flush() */
 static void cgroup_rstat_flush_locked(struct cgroup *cgrp, bool may_sleep)
 	__releases(&cgroup_rstat_lock) __acquires(&cgroup_rstat_lock)
@@ -168,6 +174,7 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp, bool may_sleep)
 			struct cgroup_subsys_state *css;
 
 			cgroup_base_stat_flush(pos, cpu);
+			bpf_rstat_flush(pos, cgroup_parent(pos), cpu);
 
 			rcu_read_lock();
 			list_for_each_entry_rcu(css, &pos->rstat_css_list,
-- 
2.36.1.124.g0e6072fb45-goog


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH bpf-next v1 2/5] cgroup: bpf: add cgroup_rstat_updated() and cgroup_rstat_flush() kfuncs
  2022-05-20  1:21 [PATCH bpf-next v1 0/5] bpf: rstat: cgroup hierarchical stats Yosry Ahmed
  2022-05-20  1:21 ` [PATCH bpf-next v1 1/5] cgroup: bpf: add a hook for bpf progs to attach to rstat flushing Yosry Ahmed
@ 2022-05-20  1:21 ` Yosry Ahmed
  2022-05-20  7:24   ` Tejun Heo
                     ` (2 more replies)
  2022-05-20  1:21 ` [PATCH bpf-next v1 3/5] bpf: Introduce cgroup iter Yosry Ahmed
                   ` (3 subsequent siblings)
  5 siblings, 3 replies; 58+ messages in thread
From: Yosry Ahmed @ 2022-05-20  1:21 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Roman Gushchin, Michal Hocko
  Cc: Stanislav Fomichev, David Rientjes, Greg Thelen, Shakeel Butt,
	linux-kernel, netdev, bpf, cgroups, Yosry Ahmed

Add cgroup_rstat_updated() and cgroup_rstat_flush() kfuncs to bpf
tracing programs. bpf programs that make use of rstat can use these
functions to inform rstat when they update stats for a cgroup, and when
they need to flush the stats.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 kernel/cgroup/rstat.c | 35 ++++++++++++++++++++++++++++++++++-
 1 file changed, 34 insertions(+), 1 deletion(-)

diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index e7a88d2600bd..a16a851bc0a1 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -3,6 +3,11 @@
 
 #include <linux/sched/cputime.h>
 
+#include <linux/bpf.h>
+#include <linux/btf.h>
+#include <linux/btf_ids.h>
+
+
 static DEFINE_SPINLOCK(cgroup_rstat_lock);
 static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);
 
@@ -141,7 +146,12 @@ static struct cgroup *cgroup_rstat_cpu_pop_updated(struct cgroup *pos,
 	return pos;
 }
 
-/* A hook for bpf stat collectors to attach to and flush their stats */
+/*
+ * A hook for bpf stat collectors to attach to and flush their stats.
+ * Together with providing bpf kfuncs for cgroup_rstat_updated() and
+ * cgroup_rstat_flush(), this enables a complete workflow where bpf progs that
+ * collect cgroup stats can integrate with rstat for efficient flushing.
+ */
 __weak noinline void bpf_rstat_flush(struct cgroup *cgrp,
 				     struct cgroup *parent, int cpu)
 {
@@ -476,3 +486,26 @@ void cgroup_base_stat_cputime_show(struct seq_file *seq)
 		   "system_usec %llu\n",
 		   usage, utime, stime);
 }
+
+/* Add bpf kfuncs for cgroup_rstat_updated() and cgroup_rstat_flush() */
+BTF_SET_START(bpf_rstat_check_kfunc_ids)
+BTF_ID(func, cgroup_rstat_updated)
+BTF_ID(func, cgroup_rstat_flush)
+BTF_SET_END(bpf_rstat_check_kfunc_ids)
+
+BTF_SET_START(bpf_rstat_sleepable_kfunc_ids)
+BTF_ID(func, cgroup_rstat_flush)
+BTF_SET_END(bpf_rstat_sleepable_kfunc_ids)
+
+static const struct btf_kfunc_id_set bpf_rstat_kfunc_set = {
+	.owner		= THIS_MODULE,
+	.check_set	= &bpf_rstat_check_kfunc_ids,
+	.sleepable_set	= &bpf_rstat_sleepable_kfunc_ids,
+};
+
+static int __init bpf_rstat_kfunc_init(void)
+{
+	return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING,
+					 &bpf_rstat_kfunc_set);
+}
+late_initcall(bpf_rstat_kfunc_init);
-- 
2.36.1.124.g0e6072fb45-goog


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH bpf-next v1 3/5] bpf: Introduce cgroup iter
  2022-05-20  1:21 [PATCH bpf-next v1 0/5] bpf: rstat: cgroup hierarchical stats Yosry Ahmed
  2022-05-20  1:21 ` [PATCH bpf-next v1 1/5] cgroup: bpf: add a hook for bpf progs to attach to rstat flushing Yosry Ahmed
  2022-05-20  1:21 ` [PATCH bpf-next v1 2/5] cgroup: bpf: add cgroup_rstat_updated() and cgroup_rstat_flush() kfuncs Yosry Ahmed
@ 2022-05-20  1:21 ` Yosry Ahmed
  2022-05-20  7:41   ` Tejun Heo
  2022-05-20  1:21 ` [PATCH bpf-next v1 4/5] selftests/bpf: extend cgroup helpers Yosry Ahmed
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 58+ messages in thread
From: Yosry Ahmed @ 2022-05-20  1:21 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Roman Gushchin, Michal Hocko
  Cc: Stanislav Fomichev, David Rientjes, Greg Thelen, Shakeel Butt,
	linux-kernel, netdev, bpf, cgroups, Yosry Ahmed

From: Hao Luo <haoluo@google.com>

Introduce a new type of iter prog: cgroup. Unlike other bpf_iter, this
iter doesn't iterate a set of kernel objects. Instead, it is supposed to
be parameterized by a cgroup id and prints only that cgroup. So one
needs to specify a target cgroup id when attaching this iter. The target
cgroup's state can be read out via a link of this iter.

Signed-off-by: Hao Luo <haoluo@google.com>
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 include/linux/bpf.h            |   2 +
 include/uapi/linux/bpf.h       |   6 ++
 kernel/bpf/Makefile            |   3 +
 kernel/bpf/cgroup_iter.c       | 148 +++++++++++++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h |   6 ++
 5 files changed, 165 insertions(+)
 create mode 100644 kernel/bpf/cgroup_iter.c

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index c107392b0ba7..74c30fe20c23 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -44,6 +44,7 @@ struct kobject;
 struct mem_cgroup;
 struct module;
 struct bpf_func_state;
+struct cgroup;
 
 extern struct idr btf_idr;
 extern spinlock_t btf_idr_lock;
@@ -1581,6 +1582,7 @@ int bpf_obj_get_user(const char __user *pathname, int flags);
 
 struct bpf_iter_aux_info {
 	struct bpf_map *map;
+	struct cgroup *cgroup;
 };
 
 typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog,
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 0210f85131b3..e5bc40d4bccc 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -91,6 +91,9 @@ union bpf_iter_link_info {
 	struct {
 		__u32	map_fd;
 	} map;
+	struct {
+		__u64	cgroup_id;
+	} cgroup;
 };
 
 /* BPF syscall commands, see bpf(2) man-page for more details. */
@@ -5965,6 +5968,9 @@ struct bpf_link_info {
 				struct {
 					__u32 map_id;
 				} map;
+				struct {
+					__u64 cgroup_id;
+				} cgroup;
 			};
 		} iter;
 		struct  {
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 057ba8e01e70..3e563b163d49 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -36,6 +36,9 @@ obj-$(CONFIG_BPF_SYSCALL) += bpf_struct_ops.o
 obj-${CONFIG_BPF_LSM} += bpf_lsm.o
 endif
 obj-$(CONFIG_BPF_PRELOAD) += preload/
+ifeq ($(CONFIG_CGROUPS),y)
+obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o
+endif
 
 obj-$(CONFIG_BPF_SYSCALL) += relo_core.o
 $(obj)/relo_core.o: $(srctree)/tools/lib/bpf/relo_core.c FORCE
diff --git a/kernel/bpf/cgroup_iter.c b/kernel/bpf/cgroup_iter.c
new file mode 100644
index 000000000000..86bdfe135d24
--- /dev/null
+++ b/kernel/bpf/cgroup_iter.c
@@ -0,0 +1,148 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2022 Google */
+#include <linux/bpf.h>
+#include <linux/btf_ids.h>
+#include <linux/cgroup.h>
+#include <linux/kernel.h>
+#include <linux/seq_file.h>
+
+struct bpf_iter__cgroup {
+	__bpf_md_ptr(struct bpf_iter_meta *, meta);
+	__bpf_md_ptr(struct cgroup *, cgroup);
+};
+
+static void *cgroup_iter_seq_start(struct seq_file *seq, loff_t *pos)
+{
+	/* Only one session is supported. */
+	if (*pos > 0)
+		return NULL;
+
+	if (*pos == 0)
+		++*pos;
+
+	return *(struct cgroup **)seq->private;
+}
+
+static void *cgroup_iter_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	++*pos;
+	return NULL;
+}
+
+static int cgroup_iter_seq_show(struct seq_file *seq, void *v)
+{
+	struct bpf_iter__cgroup ctx;
+	struct bpf_iter_meta meta;
+	struct bpf_prog *prog;
+	int ret = 0;
+
+	ctx.meta = &meta;
+	ctx.cgroup = v;
+	meta.seq = seq;
+	prog = bpf_iter_get_info(&meta, false);
+	if (prog)
+		ret = bpf_iter_run_prog(prog, &ctx);
+
+	return ret;
+}
+
+static void cgroup_iter_seq_stop(struct seq_file *seq, void *v)
+{
+}
+
+static const struct seq_operations cgroup_iter_seq_ops = {
+	.start  = cgroup_iter_seq_start,
+	.next   = cgroup_iter_seq_next,
+	.stop   = cgroup_iter_seq_stop,
+	.show   = cgroup_iter_seq_show,
+};
+
+BTF_ID_LIST_SINGLE(bpf_cgroup_btf_id, struct, cgroup)
+
+static int cgroup_iter_seq_init(void *priv_data, struct bpf_iter_aux_info *aux)
+{
+	*(struct cgroup **)priv_data = aux->cgroup;
+	return 0;
+}
+
+static const struct bpf_iter_seq_info cgroup_iter_seq_info = {
+	.seq_ops                = &cgroup_iter_seq_ops,
+	.init_seq_private       = cgroup_iter_seq_init,
+	.seq_priv_size          = sizeof(struct cgroup *),
+};
+
+static int bpf_iter_attach_cgroup(struct bpf_prog *prog,
+				  union bpf_iter_link_info *linfo,
+				  struct bpf_iter_aux_info *aux)
+{
+	struct cgroup *cgroup;
+
+	cgroup = cgroup_get_from_id(linfo->cgroup.cgroup_id);
+	if (!cgroup)
+		return -EBUSY;
+
+	aux->cgroup = cgroup;
+	return 0;
+}
+
+static void bpf_iter_detach_cgroup(struct bpf_iter_aux_info *aux)
+{
+	if (aux->cgroup)
+		cgroup_put(aux->cgroup);
+}
+
+static void bpf_iter_cgroup_show_fdinfo(const struct bpf_iter_aux_info *aux,
+					struct seq_file *seq)
+{
+	char *buf;
+
+	seq_printf(seq, "cgroup_id:\t%llu\n", cgroup_id(aux->cgroup));
+
+	buf = kmalloc(PATH_MAX, GFP_KERNEL);
+	if (!buf) {
+		seq_puts(seq, "cgroup_path:\n");
+		return;
+	}
+
+	/* If cgroup_path_ns() fails, buf will be an empty string, cgroup_path
+	 * will print nothing.
+	 *
+	 * Cgroup_path is the path in the calliing process's cgroup namespace.
+	 */
+	cgroup_path_ns(aux->cgroup, buf, sizeof(buf),
+		       current->nsproxy->cgroup_ns);
+	seq_printf(seq, "cgroup_path:\t%s\n", buf);
+	kfree(buf);
+}
+
+static int bpf_iter_cgroup_fill_link_info(const struct bpf_iter_aux_info *aux,
+					  struct bpf_link_info *info)
+{
+	info->iter.cgroup.cgroup_id = cgroup_id(aux->cgroup);
+	return 0;
+}
+
+DEFINE_BPF_ITER_FUNC(cgroup, struct bpf_iter_meta *meta,
+		     struct cgroup *cgroup)
+
+static struct bpf_iter_reg bpf_cgroup_reg_info = {
+	.target			= "cgroup",
+	.attach_target		= bpf_iter_attach_cgroup,
+	.detach_target		= bpf_iter_detach_cgroup,
+	.show_fdinfo		= bpf_iter_cgroup_show_fdinfo,
+	.fill_link_info		= bpf_iter_cgroup_fill_link_info,
+	.ctx_arg_info_size	= 1,
+	.ctx_arg_info		= {
+		{ offsetof(struct bpf_iter__cgroup, cgroup),
+		  PTR_TO_BTF_ID },
+	},
+	.seq_info		= &cgroup_iter_seq_info,
+};
+
+static int __init bpf_cgroup_iter_init(void)
+{
+	bpf_cgroup_reg_info.ctx_arg_info[0].btf_id = bpf_cgroup_btf_id[0];
+	return bpf_iter_reg_target(&bpf_cgroup_reg_info);
+}
+
+late_initcall(bpf_cgroup_iter_init);
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 0210f85131b3..e5bc40d4bccc 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -91,6 +91,9 @@ union bpf_iter_link_info {
 	struct {
 		__u32	map_fd;
 	} map;
+	struct {
+		__u64	cgroup_id;
+	} cgroup;
 };
 
 /* BPF syscall commands, see bpf(2) man-page for more details. */
@@ -5965,6 +5968,9 @@ struct bpf_link_info {
 				struct {
 					__u32 map_id;
 				} map;
+				struct {
+					__u64 cgroup_id;
+				} cgroup;
 			};
 		} iter;
 		struct  {
-- 
2.36.1.124.g0e6072fb45-goog


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH bpf-next v1 4/5] selftests/bpf: extend cgroup helpers
  2022-05-20  1:21 [PATCH bpf-next v1 0/5] bpf: rstat: cgroup hierarchical stats Yosry Ahmed
                   ` (2 preceding siblings ...)
  2022-05-20  1:21 ` [PATCH bpf-next v1 3/5] bpf: Introduce cgroup iter Yosry Ahmed
@ 2022-05-20  1:21 ` Yosry Ahmed
  2022-05-20  1:21 ` [PATCH bpf-next v1 5/5] bpf: add a selftest for cgroup hierarchical stats collection Yosry Ahmed
  2022-06-03 16:22 ` [PATCH bpf-next v1 0/5] bpf: rstat: cgroup hierarchical stats Michal Koutný
  5 siblings, 0 replies; 58+ messages in thread
From: Yosry Ahmed @ 2022-05-20  1:21 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Roman Gushchin, Michal Hocko
  Cc: Stanislav Fomichev, David Rientjes, Greg Thelen, Shakeel Butt,
	linux-kernel, netdev, bpf, cgroups, Yosry Ahmed

This patch extends bpf selftests cgroup helpers in various ways:
- Expose enable_controllers() that allows tests to enable all or a
  subset of controllers for a specific cgroup.
- Add write_cgroup_file().
- Add join_cgroup_parent(). The cgroup workdir is based on the pid,
  therefore a spawned child cannot join the same cgroup hierarchy of the
  test through join_cgroup(). join_cgroup_parent() is used in child
  processes to join a cgroup under the parent's workdir.
- Distinguish relative and absolute cgroup paths in function arguments.
  Now relative paths are called relative_path, and absolute paths are
  called cgroup_path.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 tools/testing/selftests/bpf/cgroup_helpers.c | 159 ++++++++++++++-----
 tools/testing/selftests/bpf/cgroup_helpers.h |  14 +-
 2 files changed, 127 insertions(+), 46 deletions(-)

diff --git a/tools/testing/selftests/bpf/cgroup_helpers.c b/tools/testing/selftests/bpf/cgroup_helpers.c
index 9d59c3990ca8..48c8f794a347 100644
--- a/tools/testing/selftests/bpf/cgroup_helpers.c
+++ b/tools/testing/selftests/bpf/cgroup_helpers.c
@@ -33,49 +33,51 @@
 #define CGROUP_MOUNT_DFLT		"/sys/fs/cgroup"
 #define NETCLS_MOUNT_PATH		CGROUP_MOUNT_DFLT "/net_cls"
 #define CGROUP_WORK_DIR			"/cgroup-test-work-dir"
-#define format_cgroup_path(buf, path) \
+
+#define format_cgroup_path_pid(buf, path, pid) \
 	snprintf(buf, sizeof(buf), "%s%s%d%s", CGROUP_MOUNT_PATH, \
-	CGROUP_WORK_DIR, getpid(), path)
+	CGROUP_WORK_DIR, pid, path)
+
+#define format_cgroup_path(buf, path) \
+	format_cgroup_path_pid(buf, path, getpid())
+
+#define format_parent_cgroup_path(buf, path) \
+	format_cgroup_path_pid(buf, path, getppid())
 
 #define format_classid_path(buf)				\
 	snprintf(buf, sizeof(buf), "%s%s", NETCLS_MOUNT_PATH,	\
 		 CGROUP_WORK_DIR)
 
-/**
- * enable_all_controllers() - Enable all available cgroup v2 controllers
- *
- * Enable all available cgroup v2 controllers in order to increase
- * the code coverage.
- *
- * If successful, 0 is returned.
- */
-static int enable_all_controllers(char *cgroup_path)
+
+static int __enable_controllers(const char *cgroup_path, const char *controllers)
 {
 	char path[PATH_MAX + 1];
-	char buf[PATH_MAX];
+	char enable[PATH_MAX + 1];
 	char *c, *c2;
 	int fd, cfd;
 	ssize_t len;
 
-	snprintf(path, sizeof(path), "%s/cgroup.controllers", cgroup_path);
-	fd = open(path, O_RDONLY);
-	if (fd < 0) {
-		log_err("Opening cgroup.controllers: %s", path);
-		return 1;
-	}
+	/* If not controllers are passed, enable all available controllers */
+	if (!controllers) {
+		snprintf(path, sizeof(path), "%s/cgroup.controllers",
+			 cgroup_path);
+		fd = open(path, O_RDONLY);
+		if (fd < 0) {
+			log_err("Opening cgroup.controllers: %s", path);
+			return 1;
+		}
 
-	len = read(fd, buf, sizeof(buf) - 1);
-	if (len < 0) {
+		len = read(fd, enable, sizeof(enable) - 1);
+		if (len < 0) {
+			close(fd);
+			log_err("Reading cgroup.controllers: %s", path);
+			return 1;
+		} else if (len == 0) /* No controllers to enable */
+			return 0;
+		enable[len] = 0;
 		close(fd);
-		log_err("Reading cgroup.controllers: %s", path);
-		return 1;
-	}
-	buf[len] = 0;
-	close(fd);
-
-	/* No controllers available? We're probably on cgroup v1. */
-	if (len == 0)
-		return 0;
+	} else
+		strncpy(enable, controllers, sizeof(enable));
 
 	snprintf(path, sizeof(path), "%s/cgroup.subtree_control", cgroup_path);
 	cfd = open(path, O_RDWR);
@@ -84,7 +86,7 @@ static int enable_all_controllers(char *cgroup_path)
 		return 1;
 	}
 
-	for (c = strtok_r(buf, " ", &c2); c; c = strtok_r(NULL, " ", &c2)) {
+	for (c = strtok_r(enable, " ", &c2); c; c = strtok_r(NULL, " ", &c2)) {
 		if (dprintf(cfd, "+%s\n", c) <= 0) {
 			log_err("Enabling controller %s: %s", c, path);
 			close(cfd);
@@ -95,6 +97,63 @@ static int enable_all_controllers(char *cgroup_path)
 	return 0;
 }
 
+/**
+ * enable_controllers() - Enable cgroup v2 controllers
+ * @relative_path: The cgroup path, relative to the workdir
+ * @controllers: List of controllers to enable in cgroup.controllers format
+ *
+ *
+ * Enable given cgroup v2 controllers, if @controllers is NULL, enable all
+ * available controllers.
+ *
+ * If successful, 0 is returned.
+ */
+int enable_controllers(const char *relative_path, const char *controllers)
+{
+	char cgroup_path[PATH_MAX + 1];
+
+	format_cgroup_path(cgroup_path, relative_path);
+	return __enable_controllers(cgroup_path, controllers);
+}
+
+
+
+/**
+ * write_cgroup_file() - Write to a cgroup file
+ * @relative_path: The cgroup path, relative to the workdir
+ * @buf: Buffer to write to the file
+ *
+ * Write to a file in the given cgroup's directory.
+ *
+ * If successful, 0 is returned.
+ */
+int write_cgroup_file(const char *relative_path, const char *file,
+		      const char *buf)
+{
+	char cgroup_path[PATH_MAX - 24];
+	char file_path[PATH_MAX + 1];
+	int fd;
+
+	format_cgroup_path(cgroup_path, relative_path);
+
+	snprintf(file_path, sizeof(file_path), "%s/%s", cgroup_path, file);
+	fd = open(file_path, O_RDWR);
+	if (fd < 0) {
+		log_err("Opening cgroup.subtree_control: %s", file_path);
+		return 1;
+	}
+
+	if (dprintf(fd, "%s", buf) <= 0) {
+		log_err("Writing to %s", file_path);
+		close(fd);
+		return 1;
+	}
+	close(fd);
+	return 0;
+}
+
+
+
 /**
  * setup_cgroup_environment() - Setup the cgroup environment
  *
@@ -133,7 +192,9 @@ int setup_cgroup_environment(void)
 		return 1;
 	}
 
-	if (enable_all_controllers(cgroup_workdir))
+	/* Enable all available controllers to increase test coverage */
+	if (__enable_controllers(CGROUP_MOUNT_PATH, NULL) ||
+	    __enable_controllers(cgroup_workdir, NULL))
 		return 1;
 
 	return 0;
@@ -173,7 +234,7 @@ static int join_cgroup_from_top(const char *cgroup_path)
 
 /**
  * join_cgroup() - Join a cgroup
- * @path: The cgroup path, relative to the workdir, to join
+ * @relative_path: The cgroup path, relative to the workdir, to join
  *
  * This function expects a cgroup to already be created, relative to the cgroup
  * work dir, and it joins it. For example, passing "/my-cgroup" as the path
@@ -182,11 +243,27 @@ static int join_cgroup_from_top(const char *cgroup_path)
  *
  * On success, it returns 0, otherwise on failure it returns 1.
  */
-int join_cgroup(const char *path)
+int join_cgroup(const char *relative_path)
+{
+	char cgroup_path[PATH_MAX + 1];
+
+	format_cgroup_path(cgroup_path, relative_path);
+	return join_cgroup_from_top(cgroup_path);
+}
+
+/**
+ * join_parent_cgroup() - Join a cgroup in the parent process workdir
+ * @relative_path: The cgroup path, relative to parent process workdir, to join
+ *
+ * See join_cgroup().
+ *
+ * On success, it returns 0, otherwise on failure it returns 1.
+ */
+int join_parent_cgroup(const char *relative_path)
 {
 	char cgroup_path[PATH_MAX + 1];
 
-	format_cgroup_path(cgroup_path, path);
+	format_parent_cgroup_path(cgroup_path, relative_path);
 	return join_cgroup_from_top(cgroup_path);
 }
 
@@ -214,7 +291,7 @@ void cleanup_cgroup_environment(void)
 
 /**
  * create_and_get_cgroup() - Create a cgroup, relative to workdir, and get the FD
- * @path: The cgroup path, relative to the workdir, to join
+ * @relative_path: The cgroup path, relative to the workdir, to join
  *
  * This function creates a cgroup under the top level workdir and returns the
  * file descriptor. It is idempotent.
@@ -222,14 +299,14 @@ void cleanup_cgroup_environment(void)
  * On success, it returns the file descriptor. On failure it returns -1.
  * If there is a failure, it prints the error to stderr.
  */
-int create_and_get_cgroup(const char *path)
+int create_and_get_cgroup(const char *relative_path)
 {
 	char cgroup_path[PATH_MAX + 1];
 	int fd;
 
-	format_cgroup_path(cgroup_path, path);
+	format_cgroup_path(cgroup_path, relative_path);
 	if (mkdir(cgroup_path, 0777) && errno != EEXIST) {
-		log_err("mkdiring cgroup %s .. %s", path, cgroup_path);
+		log_err("mkdiring cgroup %s .. %s", relative_path, cgroup_path);
 		return -1;
 	}
 
@@ -244,13 +321,13 @@ int create_and_get_cgroup(const char *path)
 
 /**
  * get_cgroup_id() - Get cgroup id for a particular cgroup path
- * @path: The cgroup path, relative to the workdir, to join
+ * @relative_path: The cgroup path, relative to the workdir, to join
  *
  * On success, it returns the cgroup id. On failure it returns 0,
  * which is an invalid cgroup id.
  * If there is a failure, it prints the error to stderr.
  */
-unsigned long long get_cgroup_id(const char *path)
+unsigned long long get_cgroup_id(const char *relative_path)
 {
 	int dirfd, err, flags, mount_id, fhsize;
 	union {
@@ -261,7 +338,7 @@ unsigned long long get_cgroup_id(const char *path)
 	struct file_handle *fhp, *fhp2;
 	unsigned long long ret = 0;
 
-	format_cgroup_path(cgroup_workdir, path);
+	format_cgroup_path(cgroup_workdir, relative_path);
 
 	dirfd = AT_FDCWD;
 	flags = 0;
diff --git a/tools/testing/selftests/bpf/cgroup_helpers.h b/tools/testing/selftests/bpf/cgroup_helpers.h
index fcc9cb91b211..6b1d905557c7 100644
--- a/tools/testing/selftests/bpf/cgroup_helpers.h
+++ b/tools/testing/selftests/bpf/cgroup_helpers.h
@@ -10,11 +10,15 @@
 	__FILE__, __LINE__, clean_errno(), ##__VA_ARGS__)
 
 /* cgroupv2 related */
-int cgroup_setup_and_join(const char *path);
-int create_and_get_cgroup(const char *path);
-unsigned long long get_cgroup_id(const char *path);
+int enable_controllers(const char *relative_path, const char *controllers);
+int write_cgroup_file(const char *relative_path, const char *file,
+		      const char *buf);
+int cgroup_setup_and_join(const char *relative_path);
+int create_and_get_cgroup(const char *relative_path);
+unsigned long long get_cgroup_id(const char *relative_path);
 
-int join_cgroup(const char *path);
+int join_cgroup(const char *relative_path);
+int join_parent_cgroup(const char *relative_path);
 
 int setup_cgroup_environment(void);
 void cleanup_cgroup_environment(void);
@@ -26,4 +30,4 @@ int join_classid(void);
 int setup_classid_environment(void);
 void cleanup_classid_environment(void);
 
-#endif /* __CGROUP_HELPERS_H */
\ No newline at end of file
+#endif /* __CGROUP_HELPERS_H */
-- 
2.36.1.124.g0e6072fb45-goog


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH bpf-next v1 5/5] bpf: add a selftest for cgroup hierarchical stats collection
  2022-05-20  1:21 [PATCH bpf-next v1 0/5] bpf: rstat: cgroup hierarchical stats Yosry Ahmed
                   ` (3 preceding siblings ...)
  2022-05-20  1:21 ` [PATCH bpf-next v1 4/5] selftests/bpf: extend cgroup helpers Yosry Ahmed
@ 2022-05-20  1:21 ` Yosry Ahmed
  2022-05-20 16:09   ` Yonghong Song
  2022-06-03 16:23   ` Michal Koutný
  2022-06-03 16:22 ` [PATCH bpf-next v1 0/5] bpf: rstat: cgroup hierarchical stats Michal Koutný
  5 siblings, 2 replies; 58+ messages in thread
From: Yosry Ahmed @ 2022-05-20  1:21 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Roman Gushchin, Michal Hocko
  Cc: Stanislav Fomichev, David Rientjes, Greg Thelen, Shakeel Butt,
	linux-kernel, netdev, bpf, cgroups, Yosry Ahmed

Add a selftest that tests the whole workflow for collecting,
aggregating, and display cgroup hierarchical stats.

TL;DR:
- Whenever reclaim happens, vmscan_start and vmscan_end update
  per-cgroup percpu readings, and tell rstat which (cgroup, cpu) pairs
  have updates.
- When userspace tries to read the stats, vmscan_dump calls rstat to flush
  the stats.
- rstat calls vmscan_flush once for every (cgroup, cpu) pair that has
  updates, vmscan_flush aggregates cpu readings and propagates updates
  to parents.

Detailed explanation:
- The test loads tracing bpf programs, vmscan_start and vmscan_end, to
  measure the latency of cgroup reclaim. Per-cgroup ratings are stored in
  percpu maps for efficiency. When a cgroup reading is updated on a cpu,
  cgroup_rstat_updated(cgroup, cpu) is called to add the cgroup to the
  rstat updated tree on that cpu.

- A cgroup_iter program, vmscan_dump, is loaded and pinned to a file, for
  each cgroup. Reading this file invokes the program, which calls
  cgroup_rstat_flush(cgroup) to ask rstat to propagate the updates for all
  cpus and cgroups that have updates in this cgroup's subtree. Afterwards,
  the stats are exposed to the user.

- An ftrace program, vmscan_flush, is also loaded and attached to
  bpf_rstat_flush. When rstat flushing is ongoing, vmscan_flush is invoked
  once for each (cgroup, cpu) pair that has updates. cgroups are popped
  from the rstat tree in a bottom-up fashion, so calls will always be
  made for cgroups that have updates before their parents. The program
  aggregates percpu readings to a total per-cgroup reading, and also
  propagates them to the parent cgroup. After rstat flushing is over, all
  cgroups will have correct updated hierarchical readings (including all
  cpus and all their descendants).

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 .../test_cgroup_hierarchical_stats.c          | 339 ++++++++++++++++++
 tools/testing/selftests/bpf/progs/bpf_iter.h  |   7 +
 .../selftests/bpf/progs/cgroup_vmscan.c       | 221 ++++++++++++
 3 files changed, 567 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_cgroup_hierarchical_stats.c
 create mode 100644 tools/testing/selftests/bpf/progs/cgroup_vmscan.c

diff --git a/tools/testing/selftests/bpf/prog_tests/test_cgroup_hierarchical_stats.c b/tools/testing/selftests/bpf/prog_tests/test_cgroup_hierarchical_stats.c
new file mode 100644
index 000000000000..e560c1f6291f
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/test_cgroup_hierarchical_stats.c
@@ -0,0 +1,339 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Functions to manage eBPF programs attached to cgroup subsystems
+ *
+ * Copyright 2022 Google LLC.
+ */
+#include <errno.h>
+#include <sys/types.h>
+#include <sys/mount.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include <bpf/libbpf.h>
+#include <bpf/bpf.h>
+#include <test_progs.h>
+
+#include "cgroup_helpers.h"
+#include "cgroup_vmscan.skel.h"
+
+#define PAGE_SIZE 4096
+#define MB(x) (x << 20)
+
+#define BPFFS_ROOT "/sys/fs/bpf/"
+#define BPFFS_VMSCAN BPFFS_ROOT"vmscan/"
+
+#define CG_ROOT_NAME "root"
+#define CG_ROOT_ID 1
+
+#define CGROUP_PATH(p, n) {.name = #n, .path = #p"/"#n}
+
+static struct {
+	const char *name, *path;
+	unsigned long long id;
+	int fd;
+} cgroups[] = {
+	CGROUP_PATH(/, test),
+	CGROUP_PATH(/test, child1),
+	CGROUP_PATH(/test, child2),
+	CGROUP_PATH(/test/child1, child1_1),
+	CGROUP_PATH(/test/child1, child1_2),
+	CGROUP_PATH(/test/child2, child2_1),
+	CGROUP_PATH(/test/child2, child2_2),
+};
+
+#define N_CGROUPS ARRAY_SIZE(cgroups)
+#define N_NON_LEAF_CGROUPS 3
+
+bool mounted_bpffs;
+static int duration;
+
+static int read_from_file(const char *path, char *buf, size_t size)
+{
+	int fd, len;
+
+	fd = open(path, O_RDONLY);
+	if (fd < 0) {
+		log_err("Open %s", path);
+		return -errno;
+	}
+	len = read(fd, buf, size);
+	if (len < 0)
+		log_err("Read %s", path);
+	else
+		buf[len] = 0;
+	close(fd);
+	return len < 0 ? -errno : 0;
+}
+
+static int setup_bpffs(void)
+{
+	int err;
+
+	/* Mount bpffs */
+	err = mount("bpf", BPFFS_ROOT, "bpf", 0, NULL);
+	mounted_bpffs = !err;
+	if (CHECK(err && errno != EBUSY, "mount bpffs",
+	      "failed to mount bpffs at %s (%s)\n", BPFFS_ROOT,
+	      strerror(errno)))
+		return err;
+
+	/* Create a directory to contain stat files in bpffs */
+	err = mkdir(BPFFS_VMSCAN, 0755);
+	CHECK(err, "mkdir bpffs", "failed to mkdir %s (%s)\n",
+	      BPFFS_VMSCAN, strerror(errno));
+	return err;
+}
+
+static void cleanup_bpffs(void)
+{
+	/* Remove created directory in bpffs */
+	CHECK(rmdir(BPFFS_VMSCAN), "rmdir", "failed to rmdir %s (%s)\n",
+	      BPFFS_VMSCAN, strerror(errno));
+
+	/* Unmount bpffs, if it wasn't already mounted when we started */
+	if (mounted_bpffs)
+		return;
+	CHECK(umount(BPFFS_ROOT), "umount", "failed to unmount bpffs (%s)\n",
+	      strerror(errno));
+}
+
+static int setup_cgroups(void)
+{
+	int i, err;
+
+	err = setup_cgroup_environment();
+	if (CHECK(err, "setup_cgroup_environment", "failed: %d\n", err))
+		return err;
+
+	for (i = 0; i < N_CGROUPS; i++) {
+		int fd;
+
+		fd = create_and_get_cgroup(cgroups[i].path);
+		if (!ASSERT_GE(fd, 0, "create_and_get_cgroup"))
+			return fd;
+
+		cgroups[i].fd = fd;
+		cgroups[i].id = get_cgroup_id(cgroups[i].path);
+		if (i < N_NON_LEAF_CGROUPS) {
+			err = enable_controllers(cgroups[i].path, "memory");
+			if (!ASSERT_OK(err, "enable_controllers"))
+				return err;
+		}
+	}
+	return 0;
+}
+
+static void cleanup_cgroups(void)
+{
+	for (int i = 0; i < N_CGROUPS; i++)
+		close(cgroups[i].fd);
+	cleanup_cgroup_environment();
+}
+
+
+static int setup_hierarchy(void)
+{
+	return setup_bpffs() || setup_cgroups();
+}
+
+static void destroy_hierarchy(void)
+{
+	cleanup_cgroups();
+	cleanup_bpffs();
+}
+
+static void alloc_anon(size_t size)
+{
+	char *buf, *ptr;
+
+	buf = malloc(size);
+	for (ptr = buf; ptr < buf + size; ptr += PAGE_SIZE)
+		*ptr = 0;
+	free(buf);
+}
+
+static int induce_vmscan(void)
+{
+	char size[128];
+	int i, err;
+
+	/*
+	 * Set memory.high for test parent cgroup to 1 MB to throttle
+	 * allocations and invoke reclaim in children.
+	 */
+	snprintf(size, 128, "%d", MB(1));
+	err = write_cgroup_file(cgroups[0].path, "memory.high",	size);
+	if (!ASSERT_OK(err, "write memory.high"))
+		return err;
+	/*
+	 * In every leaf cgroup, run a memory hog for a few seconds to induce
+	 * reclaim then kill it.
+	 */
+	for (i = N_NON_LEAF_CGROUPS; i < N_CGROUPS; i++) {
+		pid_t pid = fork();
+
+		if (pid == 0) {
+			/* Join cgroup in the parent process workdir */
+			join_parent_cgroup(cgroups[i].path);
+
+			/* Allocate more memory than memory.high */
+			alloc_anon(MB(2));
+			exit(0);
+		} else {
+			/* Wait for child to cause reclaim then kill it */
+			if (!ASSERT_GT(pid, 0, "fork"))
+				return pid;
+			sleep(2);
+			kill(pid, SIGKILL);
+			waitpid(pid, NULL, 0);
+		}
+	}
+	return 0;
+}
+
+static unsigned long long get_cgroup_vmscan(unsigned long long cgroup_id,
+					    const char *file_name)
+{
+	char buf[128], path[128];
+	unsigned long long vmscan = 0, id = 0;
+	int err;
+
+	/* For every cgroup, read the file generated by cgroup_iter */
+	snprintf(path, 128, "%s%s", BPFFS_VMSCAN, file_name);
+	err = read_from_file(path, buf, 128);
+	if (CHECK(err, "read", "failed to read from %s (%s)\n",
+		   path, strerror(errno)))
+		return 0;
+
+	/* Check the output file formatting */
+	ASSERT_EQ(sscanf(buf, "cg_id: %llu, total_vmscan_delay: %llu\n",
+			 &id, &vmscan), 2, "output format");
+
+	/* Check that the cgroup_id is displayed correctly */
+	ASSERT_EQ(cgroup_id, id, "cgroup_id");
+	/* Check that the vmscan reading is non-zero */
+	ASSERT_NEQ(vmscan, 0, "vmscan_reading");
+	return vmscan;
+}
+
+static void check_vmscan_stats(void)
+{
+	int i;
+	unsigned long long vmscan_readings[N_CGROUPS], vmscan_root;
+
+	for (i = 0; i < N_CGROUPS; i++)
+		vmscan_readings[i] = get_cgroup_vmscan(cgroups[i].id,
+						       cgroups[i].name);
+
+	/* Read stats for root too */
+	vmscan_root = get_cgroup_vmscan(CG_ROOT_ID, CG_ROOT_NAME);
+
+	/* Check that child1 == child1_1 + child1_2 */
+	ASSERT_EQ(vmscan_readings[1], vmscan_readings[3] + vmscan_readings[4],
+		  "child1_vmscan");
+	/* Check that child2 == child2_1 + child2_2 */
+	ASSERT_EQ(vmscan_readings[2], vmscan_readings[5] + vmscan_readings[6],
+		  "child2_vmscan");
+	/* Check that test == child1 + child2 */
+	ASSERT_EQ(vmscan_readings[0], vmscan_readings[1] + vmscan_readings[2],
+		  "test_vmscan");
+	/* Check that root >= test */
+	ASSERT_GE(vmscan_root, vmscan_readings[1], "root_vmscan");
+}
+
+static int setup_cgroup_iter(struct cgroup_vmscan *obj,
+			     unsigned long long cgroup_id,
+			     const char *file_name)
+{
+	DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts);
+	union bpf_iter_link_info linfo = {};
+	struct bpf_link *link;
+	char path[128];
+	int err;
+
+	/* Create an iter link, parameterized by cgroup id */
+	linfo.cgroup.cgroup_id = cgroup_id;
+	opts.link_info = &linfo;
+	opts.link_info_len = sizeof(linfo);
+	link = bpf_program__attach_iter(obj->progs.dump_vmscan, &opts);
+	if (!ASSERT_OK_PTR(link, "attach iter"))
+		return libbpf_get_error(link);
+
+	/* Pin the link to a bpffs file */
+	snprintf(path, 128, "%s%s", BPFFS_VMSCAN, file_name);
+	err = bpf_link__pin(link, path);
+	CHECK(err, "pin iter", "failed to pin iter at %s", path);
+	return err;
+}
+
+static int setup_progs(struct cgroup_vmscan **skel)
+{
+	int i;
+	struct bpf_link *link;
+	struct cgroup_vmscan *obj;
+
+	obj = cgroup_vmscan__open_and_load();
+	if (!ASSERT_OK_PTR(obj, "open_and_load"))
+		return libbpf_get_error(obj);
+
+	/* Attach cgroup_iter program that will dump the stats to cgroups */
+	for (i = 0; i < N_CGROUPS; i++)
+		setup_cgroup_iter(obj, cgroups[i].id, cgroups[i].name);
+	/* Also dump stats for root */
+	setup_cgroup_iter(obj, CG_ROOT_ID, CG_ROOT_NAME);
+
+	/* Attach rstat flusher */
+	link = bpf_program__attach(obj->progs.vmscan_flush);
+	if (!ASSERT_OK_PTR(link, "attach rstat"))
+		return libbpf_get_error(link);
+
+	/* Attach tracing programs that will calculate vmscan delays */
+	link = bpf_program__attach(obj->progs.vmscan_start);
+	if (!ASSERT_OK_PTR(obj, "attach raw_tracepoint"))
+		return libbpf_get_error(obj);
+
+	link = bpf_program__attach(obj->progs.vmscan_end);
+	if (!ASSERT_OK_PTR(obj, "attach raw_tracepoint"))
+		return libbpf_get_error(obj);
+
+	*skel = obj;
+	return 0;
+}
+
+void destroy_progs(struct cgroup_vmscan *skel)
+{
+	char path[128];
+	int i;
+
+	for (i = 0; i < N_CGROUPS; i++) {
+		/* Delete files in bpffs that cgroup_iters are pinned in */
+		snprintf(path, 128, "%s%s", BPFFS_VMSCAN,
+			 cgroups[i].name);
+		CHECK(remove(path), "remove", "failed to remove %s (%s)\n",
+		      path, strerror(errno));
+	}
+
+	/* Delete root file in bpffs */
+	snprintf(path, 128, "%s%s", BPFFS_VMSCAN, CG_ROOT_NAME);
+	CHECK(remove(path), "remove", "failed to remove %s (%s)\n", path,
+	      strerror(errno));
+	cgroup_vmscan__destroy(skel);
+}
+
+void test_cgroup_hierarchical_stats(void)
+{
+	struct cgroup_vmscan *skel = NULL;
+
+	if (setup_hierarchy())
+		goto hierarchy_cleanup;
+	if (setup_progs(&skel))
+		goto cleanup;
+	if (induce_vmscan())
+		goto cleanup;
+	check_vmscan_stats();
+cleanup:
+	destroy_progs(skel);
+hierarchy_cleanup:
+	destroy_hierarchy();
+}
diff --git a/tools/testing/selftests/bpf/progs/bpf_iter.h b/tools/testing/selftests/bpf/progs/bpf_iter.h
index 97ec8bc76ae6..df91f1daf74d 100644
--- a/tools/testing/selftests/bpf/progs/bpf_iter.h
+++ b/tools/testing/selftests/bpf/progs/bpf_iter.h
@@ -17,6 +17,7 @@
 #define bpf_iter__bpf_sk_storage_map bpf_iter__bpf_sk_storage_map___not_used
 #define bpf_iter__sockmap bpf_iter__sockmap___not_used
 #define bpf_iter__bpf_link bpf_iter__bpf_link___not_used
+#define bpf_iter__cgroup bpf_iter__cgroup__not_used
 #define btf_ptr btf_ptr___not_used
 #define BTF_F_COMPACT BTF_F_COMPACT___not_used
 #define BTF_F_NONAME BTF_F_NONAME___not_used
@@ -39,6 +40,7 @@
 #undef bpf_iter__bpf_sk_storage_map
 #undef bpf_iter__sockmap
 #undef bpf_iter__bpf_link
+#undef bpf_iter__cgroup
 #undef btf_ptr
 #undef BTF_F_COMPACT
 #undef BTF_F_NONAME
@@ -139,6 +141,11 @@ struct bpf_iter__bpf_link {
 	struct bpf_link *link;
 };
 
+struct bpf_iter__cgroup {
+	struct bpf_iter_meta *meta;
+	struct cgroup *cgroup;
+} __attribute((preserve_access_index));
+
 struct btf_ptr {
 	void *ptr;
 	__u32 type_id;
diff --git a/tools/testing/selftests/bpf/progs/cgroup_vmscan.c b/tools/testing/selftests/bpf/progs/cgroup_vmscan.c
new file mode 100644
index 000000000000..9d7c72c213ad
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/cgroup_vmscan.c
@@ -0,0 +1,221 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Functions to manage eBPF programs attached to cgroup subsystems
+ *
+ * Copyright 2022 Google LLC.
+ */
+#include "bpf_iter.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+char _license[] SEC("license") = "GPL";
+
+/*
+ * Start times are stored per-task, not per-cgroup, as multiple tasks in one
+ * cgroup can perform reclain concurrently.
+ */
+struct {
+	__uint(type, BPF_MAP_TYPE_TASK_STORAGE);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__type(key, int);
+	__type(value, __u64);
+} vmscan_start_time SEC(".maps");
+
+struct vmscan_percpu {
+	/* Previous percpu state, to figure out if we have new updates */
+	__u64 prev;
+	/* Current percpu state */
+	__u64 state;
+};
+
+struct vmscan {
+	/* State propagated through children, pending aggregation */
+	__u64 pending;
+	/* Total state, including all cpus and all children */
+	__u64 state;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_PERCPU_HASH);
+	__uint(max_entries, 10);
+	__type(key, __u64);
+	__type(value, struct vmscan_percpu);
+} pcpu_cgroup_vmscan_elapsed SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, 10);
+	__type(key, __u64);
+	__type(value, struct vmscan);
+} cgroup_vmscan_elapsed SEC(".maps");
+
+extern void cgroup_rstat_updated(struct cgroup *cgrp, int cpu) __ksym;
+extern void cgroup_rstat_flush(struct cgroup *cgrp) __ksym;
+
+static inline bool memory_subsys_enabled(struct cgroup *cgrp)
+{
+	return cgrp->subsys[memory_cgrp_id] != NULL;
+}
+
+static inline struct cgroup *task_memcg(struct task_struct *task)
+{
+	return task->cgroups->subsys[memory_cgrp_id]->cgroup;
+}
+
+static inline uint64_t cgroup_id(struct cgroup *cgrp)
+{
+	return cgrp->kn->id;
+}
+
+static inline int create_vmscan_percpu_elem(__u64 cg_id, __u64 state)
+{
+	struct vmscan_percpu pcpu_init = {.state = state, .prev = 0};
+
+	if (bpf_map_update_elem(&pcpu_cgroup_vmscan_elapsed, &cg_id,
+				&pcpu_init, BPF_NOEXIST)) {
+		bpf_printk("failed to create pcpu entry for cgroup %llu\n"
+			   , cg_id);
+		return 1;
+	}
+	return 0;
+}
+
+static inline int create_vmscan_elem(__u64 cg_id, __u64 state, __u64 pending)
+{
+	struct vmscan init = {.state = state, .pending = pending};
+
+	if (bpf_map_update_elem(&cgroup_vmscan_elapsed, &cg_id,
+				&init, BPF_NOEXIST)) {
+		bpf_printk("failed to create entry for cgroup %llu\n"
+			   , cg_id);
+		return 1;
+	}
+	return 0;
+}
+
+SEC("tp_btf/mm_vmscan_memcg_reclaim_begin")
+int BPF_PROG(vmscan_start, struct lruvec *lruvec, struct scan_control *sc)
+{
+	struct task_struct *task = bpf_get_current_task_btf();
+	__u64 *start_time_ptr;
+
+	start_time_ptr = bpf_task_storage_get(&vmscan_start_time, task, 0,
+					  BPF_LOCAL_STORAGE_GET_F_CREATE);
+	if (!start_time_ptr) {
+		bpf_printk("error retrieving storage\n");
+		return 0;
+	}
+
+	*start_time_ptr = bpf_ktime_get_ns();
+	return 0;
+}
+
+SEC("tp_btf/mm_vmscan_memcg_reclaim_end")
+int BPF_PROG(vmscan_end, struct lruvec *lruvec, struct scan_control *sc)
+{
+	struct vmscan_percpu *pcpu_stat;
+	struct task_struct *current = bpf_get_current_task_btf();
+	struct cgroup *cgrp = task_memcg(current);
+	__u64 *start_time_ptr;
+	__u64 current_elapsed, cg_id;
+	__u64 end_time = bpf_ktime_get_ns();
+
+	/* cgrp may not have memory controller enabled */
+	if (!cgrp)
+		return 0;
+
+	cg_id = cgroup_id(cgrp);
+	start_time_ptr = bpf_task_storage_get(&vmscan_start_time, current, 0,
+					      BPF_LOCAL_STORAGE_GET_F_CREATE);
+	if (!start_time_ptr) {
+		bpf_printk("error retrieving storage local storage\n");
+		return 0;
+	}
+
+	current_elapsed = end_time - *start_time_ptr;
+	pcpu_stat = bpf_map_lookup_elem(&pcpu_cgroup_vmscan_elapsed,
+					&cg_id);
+	if (pcpu_stat)
+		__sync_fetch_and_add(&pcpu_stat->state, current_elapsed);
+	else
+		create_vmscan_percpu_elem(cg_id, current_elapsed);
+
+	cgroup_rstat_updated(cgrp, bpf_get_smp_processor_id());
+	return 0;
+}
+
+SEC("fentry/bpf_rstat_flush")
+int BPF_PROG(vmscan_flush, struct cgroup *cgrp, struct cgroup *parent, int cpu)
+{
+	struct vmscan_percpu *pcpu_stat;
+	struct vmscan *total_stat, *parent_stat;
+	__u64 cg_id = cgroup_id(cgrp);
+	__u64 parent_cg_id = parent ? cgroup_id(parent) : 0;
+	__u64 *pcpu_vmscan;
+	__u64 state;
+	__u64 delta = 0;
+
+	if (!memory_subsys_enabled(cgrp))
+		return 0;
+
+	/* Add CPU changes on this level since the last flush */
+	pcpu_stat = bpf_map_lookup_percpu_elem(&pcpu_cgroup_vmscan_elapsed,
+					       &cg_id, cpu);
+	if (pcpu_stat) {
+		state = pcpu_stat->state;
+		delta += state - pcpu_stat->prev;
+		pcpu_stat->prev = state;
+	}
+
+	total_stat = bpf_map_lookup_elem(&cgroup_vmscan_elapsed, &cg_id);
+	if (!total_stat) {
+		create_vmscan_elem(cg_id, delta, 0);
+		goto update_parent;
+	}
+
+	/* Collect pending stats from subtree */
+	if (total_stat->pending) {
+		delta += total_stat->pending;
+		total_stat->pending = 0;
+	}
+
+	/* Propagate changes to this cgroup's total */
+	total_stat->state += delta;
+
+update_parent:
+	/* Skip if there are no changes to propagate, or no parent */
+	if (!delta || !parent_cg_id)
+		return 0;
+
+	/* Propagate changes to cgroup's parent */
+	parent_stat = bpf_map_lookup_elem(&cgroup_vmscan_elapsed,
+					  &parent_cg_id);
+	if (parent_stat)
+		parent_stat->pending += delta;
+	else
+		create_vmscan_elem(parent_cg_id, 0, delta);
+
+	return 0;
+}
+
+SEC("iter.s/cgroup")
+int BPF_PROG(dump_vmscan, struct bpf_iter_meta *meta, struct cgroup *cgrp)
+{
+	struct seq_file *seq = meta->seq;
+	struct vmscan *total_stat;
+	__u64 cg_id = cgroup_id(cgrp);
+
+	/* Flush the stats to make sure we get the most updated numbers */
+	cgroup_rstat_flush(cgrp);
+
+	total_stat = bpf_map_lookup_elem(&cgroup_vmscan_elapsed, &cg_id);
+	if (!total_stat) {
+		bpf_printk("error finding stats for cgroup %llu\n", cg_id);
+		BPF_SEQ_PRINTF(seq, "cg_id: -1, total_vmscan_delay: -1\n");
+		return 0;
+	}
+	BPF_SEQ_PRINTF(seq, "cg_id: %llu, total_vmscan_delay: %llu\n",
+		       cg_id, total_stat->state);
+	return 0;
+}
+
-- 
2.36.1.124.g0e6072fb45-goog


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next v1 2/5] cgroup: bpf: add cgroup_rstat_updated() and cgroup_rstat_flush() kfuncs
  2022-05-20  1:21 ` [PATCH bpf-next v1 2/5] cgroup: bpf: add cgroup_rstat_updated() and cgroup_rstat_flush() kfuncs Yosry Ahmed
@ 2022-05-20  7:24   ` Tejun Heo
  2022-05-20  9:13     ` Yosry Ahmed
  2022-05-20 15:14   ` Yonghong Song
  2022-05-21 11:47   ` kernel test robot
  2 siblings, 1 reply; 58+ messages in thread
From: Tejun Heo @ 2022-05-20  7:24 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Zefan Li, Johannes Weiner, Shuah Khan,
	Roman Gushchin, Michal Hocko, Stanislav Fomichev, David Rientjes,
	Greg Thelen, Shakeel Butt, linux-kernel, netdev, bpf, cgroups

On Fri, May 20, 2022 at 01:21:30AM +0000, Yosry Ahmed wrote:
> Add cgroup_rstat_updated() and cgroup_rstat_flush() kfuncs to bpf
> tracing programs. bpf programs that make use of rstat can use these
> functions to inform rstat when they update stats for a cgroup, and when
> they need to flush the stats.
> 
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>

Do patch 1 and 2 need to be separate? Also, can you explain and comment why
it's __weak?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next v1 3/5] bpf: Introduce cgroup iter
  2022-05-20  1:21 ` [PATCH bpf-next v1 3/5] bpf: Introduce cgroup iter Yosry Ahmed
@ 2022-05-20  7:41   ` Tejun Heo
  2022-05-20  7:58     ` Yosry Ahmed
  0 siblings, 1 reply; 58+ messages in thread
From: Tejun Heo @ 2022-05-20  7:41 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Zefan Li, Johannes Weiner, Shuah Khan,
	Roman Gushchin, Michal Hocko, Stanislav Fomichev, David Rientjes,
	Greg Thelen, Shakeel Butt, linux-kernel, netdev, bpf, cgroups

On Fri, May 20, 2022 at 01:21:31AM +0000, Yosry Ahmed wrote:
> From: Hao Luo <haoluo@google.com>
> 
> Introduce a new type of iter prog: cgroup. Unlike other bpf_iter, this
> iter doesn't iterate a set of kernel objects. Instead, it is supposed to
> be parameterized by a cgroup id and prints only that cgroup. So one
> needs to specify a target cgroup id when attaching this iter. The target
> cgroup's state can be read out via a link of this iter.
> 
> Signed-off-by: Hao Luo <haoluo@google.com>
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>

This could be me not understanding why it's structured this way but it keeps
bothering me that this is adding a cgroup iterator which doesn't iterate
cgroups. If all that's needed is extracting information from a specific
cgroup, why does this need to be an iterator? e.g. why can't I use
BPF_PROG_TEST_RUN which looks up the cgroup with the provided ID, flushes
rstat, retrieves whatever information necessary and returns that as the
result?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next v1 3/5] bpf: Introduce cgroup iter
  2022-05-20  7:41   ` Tejun Heo
@ 2022-05-20  7:58     ` Yosry Ahmed
  2022-05-20  8:11       ` Tejun Heo
  0 siblings, 1 reply; 58+ messages in thread
From: Yosry Ahmed @ 2022-05-20  7:58 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Zefan Li, Johannes Weiner, Shuah Khan,
	Roman Gushchin, Michal Hocko, Stanislav Fomichev, David Rientjes,
	Greg Thelen, Shakeel Butt, Linux Kernel Mailing List, Networking,
	bpf, Cgroups

On Fri, May 20, 2022 at 12:41 AM Tejun Heo <tj@kernel.org> wrote:
>
> On Fri, May 20, 2022 at 01:21:31AM +0000, Yosry Ahmed wrote:
> > From: Hao Luo <haoluo@google.com>
> >
> > Introduce a new type of iter prog: cgroup. Unlike other bpf_iter, this
> > iter doesn't iterate a set of kernel objects. Instead, it is supposed to
> > be parameterized by a cgroup id and prints only that cgroup. So one
> > needs to specify a target cgroup id when attaching this iter. The target
> > cgroup's state can be read out via a link of this iter.
> >
> > Signed-off-by: Hao Luo <haoluo@google.com>
> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
>
> This could be me not understanding why it's structured this way but it keeps
> bothering me that this is adding a cgroup iterator which doesn't iterate
> cgroups. If all that's needed is extracting information from a specific
> cgroup, why does this need to be an iterator? e.g. why can't I use
> BPF_PROG_TEST_RUN which looks up the cgroup with the provided ID, flushes
> rstat, retrieves whatever information necessary and returns that as the
> result?

I will let Hao and Yonghong reply here as they have a lot more
context, and they had previous discussions about cgroup_iter. I just
want to say that exposing the stats in a file is extremely convenient
for userspace apps. It becomes very similar to reading stats from
cgroupfs. It also makes migrating cgroup stats that we have
implemented in the kernel to BPF a lot easier.

AFAIK there are also discussions about using overlayfs to have links
to the bpffs files in cgroupfs, which makes it even better. So I would
really prefer keeping the approach we have here of reading stats
through a file from userspace. As for how we go about this (and why a
cgroup iterator doesn't iterate cgroups) I will leave this for Hao and
Yonghong to explain the rationale behind it. Ideally we can keep the
same functionality under a more descriptive name/type.

>
> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next v1 3/5] bpf: Introduce cgroup iter
  2022-05-20  7:58     ` Yosry Ahmed
@ 2022-05-20  8:11       ` Tejun Heo
  2022-05-20 11:27         ` Tejun Heo
                           ` (2 more replies)
  0 siblings, 3 replies; 58+ messages in thread
From: Tejun Heo @ 2022-05-20  8:11 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Zefan Li, Johannes Weiner, Shuah Khan,
	Roman Gushchin, Michal Hocko, Stanislav Fomichev, David Rientjes,
	Greg Thelen, Shakeel Butt, Linux Kernel Mailing List, Networking,
	bpf, Cgroups

Hello,

On Fri, May 20, 2022 at 12:58:52AM -0700, Yosry Ahmed wrote:
> On Fri, May 20, 2022 at 12:41 AM Tejun Heo <tj@kernel.org> wrote:
> >
> > On Fri, May 20, 2022 at 01:21:31AM +0000, Yosry Ahmed wrote:
> > > From: Hao Luo <haoluo@google.com>
> > >
> > > Introduce a new type of iter prog: cgroup. Unlike other bpf_iter, this
> > > iter doesn't iterate a set of kernel objects. Instead, it is supposed to
> > > be parameterized by a cgroup id and prints only that cgroup. So one
> > > needs to specify a target cgroup id when attaching this iter. The target
> > > cgroup's state can be read out via a link of this iter.
> > >
> > > Signed-off-by: Hao Luo <haoluo@google.com>
> > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> >
> > This could be me not understanding why it's structured this way but it keeps
> > bothering me that this is adding a cgroup iterator which doesn't iterate
> > cgroups. If all that's needed is extracting information from a specific
> > cgroup, why does this need to be an iterator? e.g. why can't I use
> > BPF_PROG_TEST_RUN which looks up the cgroup with the provided ID, flushes
> > rstat, retrieves whatever information necessary and returns that as the
> > result?
> 
> I will let Hao and Yonghong reply here as they have a lot more
> context, and they had previous discussions about cgroup_iter. I just
> want to say that exposing the stats in a file is extremely convenient
> for userspace apps. It becomes very similar to reading stats from
> cgroupfs. It also makes migrating cgroup stats that we have
> implemented in the kernel to BPF a lot easier.

So, if it were upto me, I'd rather direct energy towards making retrieving
information through TEST_RUN_PROG easier rather than clinging to making
kernel output text. I get that text interface is familiar but it kinda
sucks in many ways.

> AFAIK there are also discussions about using overlayfs to have links
> to the bpffs files in cgroupfs, which makes it even better. So I would
> really prefer keeping the approach we have here of reading stats
> through a file from userspace. As for how we go about this (and why a
> cgroup iterator doesn't iterate cgroups) I will leave this for Hao and
> Yonghong to explain the rationale behind it. Ideally we can keep the
> same functionality under a more descriptive name/type.

My answer would be the same here. You guys seem dead set on making the
kernel emulate cgroup1. I'm not gonna explicitly block that but would
strongly suggest having a longer term view.

If you *must* do the iterator, can you at least make it a proper iterator
which supports seeking? AFAICS there's nothing fundamentally preventing bpf
iterators from supporting seeking. Or is it that you need something which is
pinned to a cgroup so that you can emulate the directory structure?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next v1 2/5] cgroup: bpf: add cgroup_rstat_updated() and cgroup_rstat_flush() kfuncs
  2022-05-20  7:24   ` Tejun Heo
@ 2022-05-20  9:13     ` Yosry Ahmed
  2022-05-20  9:36       ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 58+ messages in thread
From: Yosry Ahmed @ 2022-05-20  9:13 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Zefan Li, Johannes Weiner, Shuah Khan,
	Roman Gushchin, Michal Hocko, Stanislav Fomichev, David Rientjes,
	Greg Thelen, Shakeel Butt, Linux Kernel Mailing List, Networking,
	bpf, Cgroups

On Fri, May 20, 2022 at 12:24 AM Tejun Heo <tj@kernel.org> wrote:
>
> On Fri, May 20, 2022 at 01:21:30AM +0000, Yosry Ahmed wrote:
> > Add cgroup_rstat_updated() and cgroup_rstat_flush() kfuncs to bpf
> > tracing programs. bpf programs that make use of rstat can use these
> > functions to inform rstat when they update stats for a cgroup, and when
> > they need to flush the stats.
> >
> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
>
> Do patch 1 and 2 need to be separate? Also, can you explain and comment why
> it's __weak?

I will squash them in the next version.

As for the declaration, I took the __weak annotation from Alexei's
reply to the previous version. I thought it had something to do with
how fentry progs attach to functions with BTF and all.
When I try the same code with a static noinline declaration instead,
fentry attachment fails to find the BTF type ID of bpf_rstat_flush.
When I try it with just noinline (without __weak), the fentry program
attaches, but is never invoked. I tried looking at the attach code but
I couldn't figure out why this happens.

In retrospect, I should have given this more thought. It would be
great if Alexei could shed some light on this.

>
> Thanks.


>
> --
> tejun

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next v1 2/5] cgroup: bpf: add cgroup_rstat_updated() and cgroup_rstat_flush() kfuncs
  2022-05-20  9:13     ` Yosry Ahmed
@ 2022-05-20  9:36       ` Kumar Kartikeya Dwivedi
  2022-05-20 11:16         ` Tejun Heo
  0 siblings, 1 reply; 58+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-05-20  9:36 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Tejun Heo, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Zefan Li, Johannes Weiner, Shuah Khan,
	Roman Gushchin, Michal Hocko, Stanislav Fomichev, David Rientjes,
	Greg Thelen, Shakeel Butt, Linux Kernel Mailing List, Networking,
	bpf, Cgroups

On Fri, May 20, 2022 at 02:43:03PM IST, Yosry Ahmed wrote:
> On Fri, May 20, 2022 at 12:24 AM Tejun Heo <tj@kernel.org> wrote:
> >
> > On Fri, May 20, 2022 at 01:21:30AM +0000, Yosry Ahmed wrote:
> > > Add cgroup_rstat_updated() and cgroup_rstat_flush() kfuncs to bpf
> > > tracing programs. bpf programs that make use of rstat can use these
> > > functions to inform rstat when they update stats for a cgroup, and when
> > > they need to flush the stats.
> > >
> > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> >
> > Do patch 1 and 2 need to be separate? Also, can you explain and comment why
> > it's __weak?
>
> I will squash them in the next version.
>
> As for the declaration, I took the __weak annotation from Alexei's
> reply to the previous version. I thought it had something to do with
> how fentry progs attach to functions with BTF and all.
> When I try the same code with a static noinline declaration instead,
> fentry attachment fails to find the BTF type ID of bpf_rstat_flush.
> When I try it with just noinline (without __weak), the fentry program
> attaches, but is never invoked. I tried looking at the attach code but
> I couldn't figure out why this happens.
>

With static noinline, the compiler will optimize away the function. With global
noinline, it can still optimize away the call site, but will keep the function
definition, so attach works. Therefore __weak is needed to ensure call is still
emitted. With GCC __attribute__((noipa)) might have been more appropritate, but
LLVM doesn't support it, so __weak is the next best thing supported by both with
the same side effect.

> In retrospect, I should have given this more thought. It would be
> great if Alexei could shed some light on this.
>
> >
> > Thanks.
>
>
> >
> > --
> > tejun

--
Kartikeya

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next v1 2/5] cgroup: bpf: add cgroup_rstat_updated() and cgroup_rstat_flush() kfuncs
  2022-05-20  9:36       ` Kumar Kartikeya Dwivedi
@ 2022-05-20 11:16         ` Tejun Heo
  2022-05-20 16:06           ` Yosry Ahmed
  0 siblings, 1 reply; 58+ messages in thread
From: Tejun Heo @ 2022-05-20 11:16 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Yosry Ahmed, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Hao Luo, Zefan Li, Johannes Weiner,
	Shuah Khan, Roman Gushchin, Michal Hocko, Stanislav Fomichev,
	David Rientjes, Greg Thelen, Shakeel Butt,
	Linux Kernel Mailing List, Networking, bpf, Cgroups

On Fri, May 20, 2022 at 03:06:07PM +0530, Kumar Kartikeya Dwivedi wrote:
> With static noinline, the compiler will optimize away the function. With global
> noinline, it can still optimize away the call site, but will keep the function
> definition, so attach works. Therefore __weak is needed to ensure call is still
> emitted. With GCC __attribute__((noipa)) might have been more appropritate, but
> LLVM doesn't support it, so __weak is the next best thing supported by both with
> the same side effect.

Ah, okay, so it's to prevent compiler from optimizing away call to a noop
function by telling it that we don't know what the function might eventually
be. Thanks for the explanation. Yosry, can you please add a comment
explaining what's going on?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next v1 3/5] bpf: Introduce cgroup iter
  2022-05-20  8:11       ` Tejun Heo
@ 2022-05-20 11:27         ` Tejun Heo
  2022-05-20 16:29         ` Yonghong Song
  2022-05-20 17:30         ` Hao Luo
  2 siblings, 0 replies; 58+ messages in thread
From: Tejun Heo @ 2022-05-20 11:27 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Zefan Li, Johannes Weiner, Shuah Khan,
	Roman Gushchin, Michal Hocko, Stanislav Fomichev, David Rientjes,
	Greg Thelen, Shakeel Butt, Linux Kernel Mailing List, Networking,
	bpf, Cgroups

On Thu, May 19, 2022 at 10:11:26PM -1000, Tejun Heo wrote:
> If you *must* do the iterator, can you at least make it a proper iterator
> which supports seeking? AFAICS there's nothing fundamentally preventing bpf
> iterators from supporting seeking. Or is it that you need something which is
> pinned to a cgroup so that you can emulate the directory structure?

Or, alternatively, would it be possible to make a TEST_RUN_PROG to output a
text file in bpffs? There just doesn't seem to be anything cgroup specific
that the iterator is doing that can't be done with exposing a couple kfuncs.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next v1 2/5] cgroup: bpf: add cgroup_rstat_updated() and cgroup_rstat_flush() kfuncs
  2022-05-20  1:21 ` [PATCH bpf-next v1 2/5] cgroup: bpf: add cgroup_rstat_updated() and cgroup_rstat_flush() kfuncs Yosry Ahmed
  2022-05-20  7:24   ` Tejun Heo
@ 2022-05-20 15:14   ` Yonghong Song
  2022-05-20 16:08     ` Yosry Ahmed
  2022-05-21 11:47   ` kernel test robot
  2 siblings, 1 reply; 58+ messages in thread
From: Yonghong Song @ 2022-05-20 15:14 UTC (permalink / raw)
  To: Yosry Ahmed, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Roman Gushchin, Michal Hocko
  Cc: Stanislav Fomichev, David Rientjes, Greg Thelen, Shakeel Butt,
	linux-kernel, netdev, bpf, cgroups



On 5/19/22 6:21 PM, Yosry Ahmed wrote:
> Add cgroup_rstat_updated() and cgroup_rstat_flush() kfuncs to bpf
> tracing programs. bpf programs that make use of rstat can use these
> functions to inform rstat when they update stats for a cgroup, and when
> they need to flush the stats.
> 
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> ---
>   kernel/cgroup/rstat.c | 35 ++++++++++++++++++++++++++++++++++-
>   1 file changed, 34 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
> index e7a88d2600bd..a16a851bc0a1 100644
> --- a/kernel/cgroup/rstat.c
> +++ b/kernel/cgroup/rstat.c
> @@ -3,6 +3,11 @@
>   
>   #include <linux/sched/cputime.h>
>   
> +#include <linux/bpf.h>
> +#include <linux/btf.h>
> +#include <linux/btf_ids.h>
> +
> +
>   static DEFINE_SPINLOCK(cgroup_rstat_lock);
>   static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);
>   
> @@ -141,7 +146,12 @@ static struct cgroup *cgroup_rstat_cpu_pop_updated(struct cgroup *pos,
>   	return pos;
>   }
>   
> -/* A hook for bpf stat collectors to attach to and flush their stats */
> +/*
> + * A hook for bpf stat collectors to attach to and flush their stats.
> + * Together with providing bpf kfuncs for cgroup_rstat_updated() and
> + * cgroup_rstat_flush(), this enables a complete workflow where bpf progs that
> + * collect cgroup stats can integrate with rstat for efficient flushing.
> + */
>   __weak noinline void bpf_rstat_flush(struct cgroup *cgrp,
>   				     struct cgroup *parent, int cpu)
>   {
> @@ -476,3 +486,26 @@ void cgroup_base_stat_cputime_show(struct seq_file *seq)
>   		   "system_usec %llu\n",
>   		   usage, utime, stime);
>   }
> +
> +/* Add bpf kfuncs for cgroup_rstat_updated() and cgroup_rstat_flush() */
> +BTF_SET_START(bpf_rstat_check_kfunc_ids)
> +BTF_ID(func, cgroup_rstat_updated)
> +BTF_ID(func, cgroup_rstat_flush)
> +BTF_SET_END(bpf_rstat_check_kfunc_ids)
> +
> +BTF_SET_START(bpf_rstat_sleepable_kfunc_ids)
> +BTF_ID(func, cgroup_rstat_flush)
> +BTF_SET_END(bpf_rstat_sleepable_kfunc_ids)
> +
> +static const struct btf_kfunc_id_set bpf_rstat_kfunc_set = {
> +	.owner		= THIS_MODULE,
> +	.check_set	= &bpf_rstat_check_kfunc_ids,
> +	.sleepable_set	= &bpf_rstat_sleepable_kfunc_ids,

There is a compilation error here:

kernel/cgroup/rstat.c:503:3: error: ‘const struct btf_kfunc_id_set’ has 
no member named ‘sleepable_set’; did you mean ‘release_set’?
     503 |  .sleepable_set = &bpf_rstat_sleepable_kfunc_ids,
         |   ^~~~~~~~~~~~~
         |   release_set
   kernel/cgroup/rstat.c:503:19: warning: excess elements in struct 
initializer
     503 |  .sleepable_set = &bpf_rstat_sleepable_kfunc_ids,
         |                   ^
   kernel/cgroup/rstat.c:503:19: note: (near initialization for 
‘bpf_rstat_kfunc_set’)
   make[3]: *** [scripts/Makefile.build:288: kernel/cgroup/rstat.o] Error 1

Please fix.

> +};
> +
> +static int __init bpf_rstat_kfunc_init(void)
> +{
> +	return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING,
> +					 &bpf_rstat_kfunc_set);
> +}
> +late_initcall(bpf_rstat_kfunc_init);

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next v1 2/5] cgroup: bpf: add cgroup_rstat_updated() and cgroup_rstat_flush() kfuncs
  2022-05-20 11:16         ` Tejun Heo
@ 2022-05-20 16:06           ` Yosry Ahmed
  0 siblings, 0 replies; 58+ messages in thread
From: Yosry Ahmed @ 2022-05-20 16:06 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Kumar Kartikeya Dwivedi, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Hao Luo, Zefan Li, Johannes Weiner,
	Shuah Khan, Roman Gushchin, Michal Hocko, Stanislav Fomichev,
	David Rientjes, Greg Thelen, Shakeel Butt,
	Linux Kernel Mailing List, Networking, bpf, Cgroups

On Fri, May 20, 2022 at 4:16 AM Tejun Heo <tj@kernel.org> wrote:
>
> On Fri, May 20, 2022 at 03:06:07PM +0530, Kumar Kartikeya Dwivedi wrote:
> > With static noinline, the compiler will optimize away the function. With global
> > noinline, it can still optimize away the call site, but will keep the function
> > definition, so attach works. Therefore __weak is needed to ensure call is still
> > emitted. With GCC __attribute__((noipa)) might have been more appropritate, but
> > LLVM doesn't support it, so __weak is the next best thing supported by both with
> > the same side effect.

Thanks a lot for the explanation!

>
> Ah, okay, so it's to prevent compiler from optimizing away call to a noop
> function by telling it that we don't know what the function might eventually
> be. Thanks for the explanation. Yosry, can you please add a comment
> explaining what's going on?

Will add a comment explaining things in the next section.  Thanks for
reviewing this, Tejun!

>
> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next v1 2/5] cgroup: bpf: add cgroup_rstat_updated() and cgroup_rstat_flush() kfuncs
  2022-05-20 15:14   ` Yonghong Song
@ 2022-05-20 16:08     ` Yosry Ahmed
  2022-05-20 16:16       ` Yonghong Song
  0 siblings, 1 reply; 58+ messages in thread
From: Yosry Ahmed @ 2022-05-20 16:08 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, John Fastabend, KP Singh, Hao Luo,
	Tejun Heo, Zefan Li, Johannes Weiner, Shuah Khan, Roman Gushchin,
	Michal Hocko, Stanislav Fomichev, David Rientjes, Greg Thelen,
	Shakeel Butt, Linux Kernel Mailing List, Networking, bpf,
	Cgroups

On Fri, May 20, 2022 at 8:15 AM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 5/19/22 6:21 PM, Yosry Ahmed wrote:
> > Add cgroup_rstat_updated() and cgroup_rstat_flush() kfuncs to bpf
> > tracing programs. bpf programs that make use of rstat can use these
> > functions to inform rstat when they update stats for a cgroup, and when
> > they need to flush the stats.
> >
> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> > ---
> >   kernel/cgroup/rstat.c | 35 ++++++++++++++++++++++++++++++++++-
> >   1 file changed, 34 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
> > index e7a88d2600bd..a16a851bc0a1 100644
> > --- a/kernel/cgroup/rstat.c
> > +++ b/kernel/cgroup/rstat.c
> > @@ -3,6 +3,11 @@
> >
> >   #include <linux/sched/cputime.h>
> >
> > +#include <linux/bpf.h>
> > +#include <linux/btf.h>
> > +#include <linux/btf_ids.h>
> > +
> > +
> >   static DEFINE_SPINLOCK(cgroup_rstat_lock);
> >   static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);
> >
> > @@ -141,7 +146,12 @@ static struct cgroup *cgroup_rstat_cpu_pop_updated(struct cgroup *pos,
> >       return pos;
> >   }
> >
> > -/* A hook for bpf stat collectors to attach to and flush their stats */
> > +/*
> > + * A hook for bpf stat collectors to attach to and flush their stats.
> > + * Together with providing bpf kfuncs for cgroup_rstat_updated() and
> > + * cgroup_rstat_flush(), this enables a complete workflow where bpf progs that
> > + * collect cgroup stats can integrate with rstat for efficient flushing.
> > + */
> >   __weak noinline void bpf_rstat_flush(struct cgroup *cgrp,
> >                                    struct cgroup *parent, int cpu)
> >   {
> > @@ -476,3 +486,26 @@ void cgroup_base_stat_cputime_show(struct seq_file *seq)
> >                  "system_usec %llu\n",
> >                  usage, utime, stime);
> >   }
> > +
> > +/* Add bpf kfuncs for cgroup_rstat_updated() and cgroup_rstat_flush() */
> > +BTF_SET_START(bpf_rstat_check_kfunc_ids)
> > +BTF_ID(func, cgroup_rstat_updated)
> > +BTF_ID(func, cgroup_rstat_flush)
> > +BTF_SET_END(bpf_rstat_check_kfunc_ids)
> > +
> > +BTF_SET_START(bpf_rstat_sleepable_kfunc_ids)
> > +BTF_ID(func, cgroup_rstat_flush)
> > +BTF_SET_END(bpf_rstat_sleepable_kfunc_ids)
> > +
> > +static const struct btf_kfunc_id_set bpf_rstat_kfunc_set = {
> > +     .owner          = THIS_MODULE,
> > +     .check_set      = &bpf_rstat_check_kfunc_ids,
> > +     .sleepable_set  = &bpf_rstat_sleepable_kfunc_ids,
>
> There is a compilation error here:
>
> kernel/cgroup/rstat.c:503:3: error: ‘const struct btf_kfunc_id_set’ has
> no member named ‘sleepable_set’; did you mean ‘release_set’?
>      503 |  .sleepable_set = &bpf_rstat_sleepable_kfunc_ids,
>          |   ^~~~~~~~~~~~~
>          |   release_set
>    kernel/cgroup/rstat.c:503:19: warning: excess elements in struct
> initializer
>      503 |  .sleepable_set = &bpf_rstat_sleepable_kfunc_ids,
>          |                   ^
>    kernel/cgroup/rstat.c:503:19: note: (near initialization for
> ‘bpf_rstat_kfunc_set’)
>    make[3]: *** [scripts/Makefile.build:288: kernel/cgroup/rstat.o] Error 1
>
> Please fix.

This patch series is rebased on top of 2 patches in the mailing list:
- bpf/btf: also allow kfunc in tracing and syscall programs
- btf: Add a new kfunc set which allows to mark a function to be
  sleepable

I specified this in the cover letter, do I need to do something else
in this situation? Re-send the patches as part of my series?



>
> > +};
> > +
> > +static int __init bpf_rstat_kfunc_init(void)
> > +{
> > +     return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING,
> > +                                      &bpf_rstat_kfunc_set);
> > +}
> > +late_initcall(bpf_rstat_kfunc_init);

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next v1 5/5] bpf: add a selftest for cgroup hierarchical stats collection
  2022-05-20  1:21 ` [PATCH bpf-next v1 5/5] bpf: add a selftest for cgroup hierarchical stats collection Yosry Ahmed
@ 2022-05-20 16:09   ` Yonghong Song
  2022-05-20 16:18     ` Yosry Ahmed
  2022-06-03 16:23   ` Michal Koutný
  1 sibling, 1 reply; 58+ messages in thread
From: Yonghong Song @ 2022-05-20 16:09 UTC (permalink / raw)
  To: Yosry Ahmed, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Roman Gushchin, Michal Hocko
  Cc: Stanislav Fomichev, David Rientjes, Greg Thelen, Shakeel Butt,
	linux-kernel, netdev, bpf, cgroups



On 5/19/22 6:21 PM, Yosry Ahmed wrote:
> Add a selftest that tests the whole workflow for collecting,
> aggregating, and display cgroup hierarchical stats.
> 
> TL;DR:
> - Whenever reclaim happens, vmscan_start and vmscan_end update
>    per-cgroup percpu readings, and tell rstat which (cgroup, cpu) pairs
>    have updates.
> - When userspace tries to read the stats, vmscan_dump calls rstat to flush
>    the stats.
> - rstat calls vmscan_flush once for every (cgroup, cpu) pair that has
>    updates, vmscan_flush aggregates cpu readings and propagates updates
>    to parents.
> 
> Detailed explanation:
> - The test loads tracing bpf programs, vmscan_start and vmscan_end, to
>    measure the latency of cgroup reclaim. Per-cgroup ratings are stored in
>    percpu maps for efficiency. When a cgroup reading is updated on a cpu,
>    cgroup_rstat_updated(cgroup, cpu) is called to add the cgroup to the
>    rstat updated tree on that cpu.
> 
> - A cgroup_iter program, vmscan_dump, is loaded and pinned to a file, for
>    each cgroup. Reading this file invokes the program, which calls
>    cgroup_rstat_flush(cgroup) to ask rstat to propagate the updates for all
>    cpus and cgroups that have updates in this cgroup's subtree. Afterwards,
>    the stats are exposed to the user.
> 
> - An ftrace program, vmscan_flush, is also loaded and attached to
>    bpf_rstat_flush. When rstat flushing is ongoing, vmscan_flush is invoked
>    once for each (cgroup, cpu) pair that has updates. cgroups are popped
>    from the rstat tree in a bottom-up fashion, so calls will always be
>    made for cgroups that have updates before their parents. The program
>    aggregates percpu readings to a total per-cgroup reading, and also
>    propagates them to the parent cgroup. After rstat flushing is over, all
>    cgroups will have correct updated hierarchical readings (including all
>    cpus and all their descendants).
> 
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> ---
>   .../test_cgroup_hierarchical_stats.c          | 339 ++++++++++++++++++
>   tools/testing/selftests/bpf/progs/bpf_iter.h  |   7 +
>   .../selftests/bpf/progs/cgroup_vmscan.c       | 221 ++++++++++++
>   3 files changed, 567 insertions(+)
>   create mode 100644 tools/testing/selftests/bpf/prog_tests/test_cgroup_hierarchical_stats.c
>   create mode 100644 tools/testing/selftests/bpf/progs/cgroup_vmscan.c
> 
> diff --git a/tools/testing/selftests/bpf/prog_tests/test_cgroup_hierarchical_stats.c b/tools/testing/selftests/bpf/prog_tests/test_cgroup_hierarchical_stats.c
> new file mode 100644
> index 000000000000..e560c1f6291f
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/test_cgroup_hierarchical_stats.c
> @@ -0,0 +1,339 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Functions to manage eBPF programs attached to cgroup subsystems
> + *
> + * Copyright 2022 Google LLC.
> + */
> +#include <errno.h>
> +#include <sys/types.h>
> +#include <sys/mount.h>
> +#include <sys/stat.h>
> +#include <unistd.h>
> +
> +#include <bpf/libbpf.h>
> +#include <bpf/bpf.h>
> +#include <test_progs.h>
> +
> +#include "cgroup_helpers.h"
> +#include "cgroup_vmscan.skel.h"
> +
> +#define PAGE_SIZE 4096
> +#define MB(x) (x << 20)
> +
> +#define BPFFS_ROOT "/sys/fs/bpf/"
> +#define BPFFS_VMSCAN BPFFS_ROOT"vmscan/"
> +
> +#define CG_ROOT_NAME "root"
> +#define CG_ROOT_ID 1
> +
> +#define CGROUP_PATH(p, n) {.name = #n, .path = #p"/"#n}
> +
> +static struct {
> +	const char *name, *path;
> +	unsigned long long id;
> +	int fd;
> +} cgroups[] = {
> +	CGROUP_PATH(/, test),
> +	CGROUP_PATH(/test, child1),
> +	CGROUP_PATH(/test, child2),
> +	CGROUP_PATH(/test/child1, child1_1),
> +	CGROUP_PATH(/test/child1, child1_2),
> +	CGROUP_PATH(/test/child2, child2_1),
> +	CGROUP_PATH(/test/child2, child2_2),
> +};
> +
> +#define N_CGROUPS ARRAY_SIZE(cgroups)
> +#define N_NON_LEAF_CGROUPS 3
> +
> +bool mounted_bpffs;
> +static int duration;
> +
> +static int read_from_file(const char *path, char *buf, size_t size)
> +{
> +	int fd, len;
> +
> +	fd = open(path, O_RDONLY);
> +	if (fd < 0) {
> +		log_err("Open %s", path);
> +		return -errno;
> +	}
> +	len = read(fd, buf, size);
> +	if (len < 0)
> +		log_err("Read %s", path);
> +	else
> +		buf[len] = 0;
> +	close(fd);
> +	return len < 0 ? -errno : 0;
> +}
> +
> +static int setup_bpffs(void)
> +{
> +	int err;
> +
> +	/* Mount bpffs */
> +	err = mount("bpf", BPFFS_ROOT, "bpf", 0, NULL);
> +	mounted_bpffs = !err;
> +	if (CHECK(err && errno != EBUSY, "mount bpffs",

Please use ASSERT_* macros instead of CHECK.
There are similar instances below as well.

> +	      "failed to mount bpffs at %s (%s)\n", BPFFS_ROOT,
> +	      strerror(errno)))
> +		return err;
> +
> +	/* Create a directory to contain stat files in bpffs */
> +	err = mkdir(BPFFS_VMSCAN, 0755);
> +	CHECK(err, "mkdir bpffs", "failed to mkdir %s (%s)\n",
> +	      BPFFS_VMSCAN, strerror(errno));
> +	return err;
> +}
> +
> +static void cleanup_bpffs(void)
> +{
> +	/* Remove created directory in bpffs */
> +	CHECK(rmdir(BPFFS_VMSCAN), "rmdir", "failed to rmdir %s (%s)\n",
> +	      BPFFS_VMSCAN, strerror(errno));
> +
> +	/* Unmount bpffs, if it wasn't already mounted when we started */
> +	if (mounted_bpffs)
> +		return;
> +	CHECK(umount(BPFFS_ROOT), "umount", "failed to unmount bpffs (%s)\n",
> +	      strerror(errno));
> +}
> +
> +static int setup_cgroups(void)
> +{
> +	int i, err;
> +
> +	err = setup_cgroup_environment();
> +	if (CHECK(err, "setup_cgroup_environment", "failed: %d\n", err))
> +		return err;
> +
> +	for (i = 0; i < N_CGROUPS; i++) {
> +		int fd;

You can put this to the top declaration 'int i, err'.

> +
> +		fd = create_and_get_cgroup(cgroups[i].path);
> +		if (!ASSERT_GE(fd, 0, "create_and_get_cgroup"))
> +			return fd;
> +
> +		cgroups[i].fd = fd;
> +		cgroups[i].id = get_cgroup_id(cgroups[i].path);
> +		if (i < N_NON_LEAF_CGROUPS) {
> +			err = enable_controllers(cgroups[i].path, "memory");
> +			if (!ASSERT_OK(err, "enable_controllers"))
> +				return err;
> +		}
> +	}
> +	return 0;
> +}
> +
> +static void cleanup_cgroups(void)
> +{
> +	for (int i = 0; i < N_CGROUPS; i++)
> +		close(cgroups[i].fd);
> +	cleanup_cgroup_environment();
> +}
> +
> +
> +static int setup_hierarchy(void)
> +{
> +	return setup_bpffs() || setup_cgroups();
> +}
> +
> +static void destroy_hierarchy(void)
> +{
> +	cleanup_cgroups();
> +	cleanup_bpffs();
> +}
> +
[...]
> +
> +SEC("iter.s/cgroup")
> +int BPF_PROG(dump_vmscan, struct bpf_iter_meta *meta, struct cgroup *cgrp)
> +{
> +	struct seq_file *seq = meta->seq;
> +	struct vmscan *total_stat;
> +	__u64 cg_id = cgroup_id(cgrp);
> +
> +	/* Flush the stats to make sure we get the most updated numbers */
> +	cgroup_rstat_flush(cgrp);
> +
> +	total_stat = bpf_map_lookup_elem(&cgroup_vmscan_elapsed, &cg_id);
> +	if (!total_stat) {
> +		bpf_printk("error finding stats for cgroup %llu\n", cg_id);
> +		BPF_SEQ_PRINTF(seq, "cg_id: -1, total_vmscan_delay: -1\n");
> +		return 0;
> +	}
> +	BPF_SEQ_PRINTF(seq, "cg_id: %llu, total_vmscan_delay: %llu\n",
> +		       cg_id, total_stat->state);
> +	return 0;
> +}
> +

Empty line here.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next v1 2/5] cgroup: bpf: add cgroup_rstat_updated() and cgroup_rstat_flush() kfuncs
  2022-05-20 16:08     ` Yosry Ahmed
@ 2022-05-20 16:16       ` Yonghong Song
  2022-05-20 16:20         ` Yosry Ahmed
  0 siblings, 1 reply; 58+ messages in thread
From: Yonghong Song @ 2022-05-20 16:16 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, John Fastabend, KP Singh, Hao Luo,
	Tejun Heo, Zefan Li, Johannes Weiner, Shuah Khan, Roman Gushchin,
	Michal Hocko, Stanislav Fomichev, David Rientjes, Greg Thelen,
	Shakeel Butt, Linux Kernel Mailing List, Networking, bpf,
	Cgroups



On 5/20/22 9:08 AM, Yosry Ahmed wrote:
> On Fri, May 20, 2022 at 8:15 AM Yonghong Song <yhs@fb.com> wrote:
>>
>>
>>
>> On 5/19/22 6:21 PM, Yosry Ahmed wrote:
>>> Add cgroup_rstat_updated() and cgroup_rstat_flush() kfuncs to bpf
>>> tracing programs. bpf programs that make use of rstat can use these
>>> functions to inform rstat when they update stats for a cgroup, and when
>>> they need to flush the stats.
>>>
>>> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
>>> ---
>>>    kernel/cgroup/rstat.c | 35 ++++++++++++++++++++++++++++++++++-
>>>    1 file changed, 34 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
>>> index e7a88d2600bd..a16a851bc0a1 100644
>>> --- a/kernel/cgroup/rstat.c
>>> +++ b/kernel/cgroup/rstat.c
>>> @@ -3,6 +3,11 @@
>>>
>>>    #include <linux/sched/cputime.h>
>>>
>>> +#include <linux/bpf.h>
>>> +#include <linux/btf.h>
>>> +#include <linux/btf_ids.h>
>>> +
>>> +
>>>    static DEFINE_SPINLOCK(cgroup_rstat_lock);
>>>    static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);
>>>
>>> @@ -141,7 +146,12 @@ static struct cgroup *cgroup_rstat_cpu_pop_updated(struct cgroup *pos,
>>>        return pos;
>>>    }
>>>
>>> -/* A hook for bpf stat collectors to attach to and flush their stats */
>>> +/*
>>> + * A hook for bpf stat collectors to attach to and flush their stats.
>>> + * Together with providing bpf kfuncs for cgroup_rstat_updated() and
>>> + * cgroup_rstat_flush(), this enables a complete workflow where bpf progs that
>>> + * collect cgroup stats can integrate with rstat for efficient flushing.
>>> + */
>>>    __weak noinline void bpf_rstat_flush(struct cgroup *cgrp,
>>>                                     struct cgroup *parent, int cpu)
>>>    {
>>> @@ -476,3 +486,26 @@ void cgroup_base_stat_cputime_show(struct seq_file *seq)
>>>                   "system_usec %llu\n",
>>>                   usage, utime, stime);
>>>    }
>>> +
>>> +/* Add bpf kfuncs for cgroup_rstat_updated() and cgroup_rstat_flush() */
>>> +BTF_SET_START(bpf_rstat_check_kfunc_ids)
>>> +BTF_ID(func, cgroup_rstat_updated)
>>> +BTF_ID(func, cgroup_rstat_flush)
>>> +BTF_SET_END(bpf_rstat_check_kfunc_ids)
>>> +
>>> +BTF_SET_START(bpf_rstat_sleepable_kfunc_ids)
>>> +BTF_ID(func, cgroup_rstat_flush)
>>> +BTF_SET_END(bpf_rstat_sleepable_kfunc_ids)
>>> +
>>> +static const struct btf_kfunc_id_set bpf_rstat_kfunc_set = {
>>> +     .owner          = THIS_MODULE,
>>> +     .check_set      = &bpf_rstat_check_kfunc_ids,
>>> +     .sleepable_set  = &bpf_rstat_sleepable_kfunc_ids,
>>
>> There is a compilation error here:
>>
>> kernel/cgroup/rstat.c:503:3: error: ‘const struct btf_kfunc_id_set’ has
>> no member named ‘sleepable_set’; did you mean ‘release_set’?
>>       503 |  .sleepable_set = &bpf_rstat_sleepable_kfunc_ids,
>>           |   ^~~~~~~~~~~~~
>>           |   release_set
>>     kernel/cgroup/rstat.c:503:19: warning: excess elements in struct
>> initializer
>>       503 |  .sleepable_set = &bpf_rstat_sleepable_kfunc_ids,
>>           |                   ^
>>     kernel/cgroup/rstat.c:503:19: note: (near initialization for
>> ‘bpf_rstat_kfunc_set’)
>>     make[3]: *** [scripts/Makefile.build:288: kernel/cgroup/rstat.o] Error 1
>>
>> Please fix.
> 
> This patch series is rebased on top of 2 patches in the mailing list:
> - bpf/btf: also allow kfunc in tracing and syscall programs
> - btf: Add a new kfunc set which allows to mark a function to be
>    sleepable
> 
> I specified this in the cover letter, do I need to do something else
> in this situation? Re-send the patches as part of my series?

At least put a link in the cover letter for the above two patches?
This way, people can easily find them to double check.

> 
> 
> 
>>
>>> +};
>>> +
>>> +static int __init bpf_rstat_kfunc_init(void)
>>> +{
>>> +     return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING,
>>> +                                      &bpf_rstat_kfunc_set);
>>> +}
>>> +late_initcall(bpf_rstat_kfunc_init);

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next v1 5/5] bpf: add a selftest for cgroup hierarchical stats collection
  2022-05-20 16:09   ` Yonghong Song
@ 2022-05-20 16:18     ` Yosry Ahmed
  2022-05-24  0:01       ` Andrii Nakryiko
  0 siblings, 1 reply; 58+ messages in thread
From: Yosry Ahmed @ 2022-05-20 16:18 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, John Fastabend, KP Singh, Hao Luo,
	Tejun Heo, Zefan Li, Johannes Weiner, Shuah Khan, Roman Gushchin,
	Michal Hocko, Stanislav Fomichev, David Rientjes, Greg Thelen,
	Shakeel Butt, Linux Kernel Mailing List, Networking, bpf,
	Cgroups

On Fri, May 20, 2022 at 9:09 AM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 5/19/22 6:21 PM, Yosry Ahmed wrote:
> > Add a selftest that tests the whole workflow for collecting,
> > aggregating, and display cgroup hierarchical stats.
> >
> > TL;DR:
> > - Whenever reclaim happens, vmscan_start and vmscan_end update
> >    per-cgroup percpu readings, and tell rstat which (cgroup, cpu) pairs
> >    have updates.
> > - When userspace tries to read the stats, vmscan_dump calls rstat to flush
> >    the stats.
> > - rstat calls vmscan_flush once for every (cgroup, cpu) pair that has
> >    updates, vmscan_flush aggregates cpu readings and propagates updates
> >    to parents.
> >
> > Detailed explanation:
> > - The test loads tracing bpf programs, vmscan_start and vmscan_end, to
> >    measure the latency of cgroup reclaim. Per-cgroup ratings are stored in
> >    percpu maps for efficiency. When a cgroup reading is updated on a cpu,
> >    cgroup_rstat_updated(cgroup, cpu) is called to add the cgroup to the
> >    rstat updated tree on that cpu.
> >
> > - A cgroup_iter program, vmscan_dump, is loaded and pinned to a file, for
> >    each cgroup. Reading this file invokes the program, which calls
> >    cgroup_rstat_flush(cgroup) to ask rstat to propagate the updates for all
> >    cpus and cgroups that have updates in this cgroup's subtree. Afterwards,
> >    the stats are exposed to the user.
> >
> > - An ftrace program, vmscan_flush, is also loaded and attached to
> >    bpf_rstat_flush. When rstat flushing is ongoing, vmscan_flush is invoked
> >    once for each (cgroup, cpu) pair that has updates. cgroups are popped
> >    from the rstat tree in a bottom-up fashion, so calls will always be
> >    made for cgroups that have updates before their parents. The program
> >    aggregates percpu readings to a total per-cgroup reading, and also
> >    propagates them to the parent cgroup. After rstat flushing is over, all
> >    cgroups will have correct updated hierarchical readings (including all
> >    cpus and all their descendants).
> >
> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> > ---
> >   .../test_cgroup_hierarchical_stats.c          | 339 ++++++++++++++++++
> >   tools/testing/selftests/bpf/progs/bpf_iter.h  |   7 +
> >   .../selftests/bpf/progs/cgroup_vmscan.c       | 221 ++++++++++++
> >   3 files changed, 567 insertions(+)
> >   create mode 100644 tools/testing/selftests/bpf/prog_tests/test_cgroup_hierarchical_stats.c
> >   create mode 100644 tools/testing/selftests/bpf/progs/cgroup_vmscan.c
> >
> > diff --git a/tools/testing/selftests/bpf/prog_tests/test_cgroup_hierarchical_stats.c b/tools/testing/selftests/bpf/prog_tests/test_cgroup_hierarchical_stats.c
> > new file mode 100644
> > index 000000000000..e560c1f6291f
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/prog_tests/test_cgroup_hierarchical_stats.c
> > @@ -0,0 +1,339 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * Functions to manage eBPF programs attached to cgroup subsystems
> > + *
> > + * Copyright 2022 Google LLC.
> > + */
> > +#include <errno.h>
> > +#include <sys/types.h>
> > +#include <sys/mount.h>
> > +#include <sys/stat.h>
> > +#include <unistd.h>
> > +
> > +#include <bpf/libbpf.h>
> > +#include <bpf/bpf.h>
> > +#include <test_progs.h>
> > +
> > +#include "cgroup_helpers.h"
> > +#include "cgroup_vmscan.skel.h"
> > +
> > +#define PAGE_SIZE 4096
> > +#define MB(x) (x << 20)
> > +
> > +#define BPFFS_ROOT "/sys/fs/bpf/"
> > +#define BPFFS_VMSCAN BPFFS_ROOT"vmscan/"
> > +
> > +#define CG_ROOT_NAME "root"
> > +#define CG_ROOT_ID 1
> > +
> > +#define CGROUP_PATH(p, n) {.name = #n, .path = #p"/"#n}
> > +
> > +static struct {
> > +     const char *name, *path;
> > +     unsigned long long id;
> > +     int fd;
> > +} cgroups[] = {
> > +     CGROUP_PATH(/, test),
> > +     CGROUP_PATH(/test, child1),
> > +     CGROUP_PATH(/test, child2),
> > +     CGROUP_PATH(/test/child1, child1_1),
> > +     CGROUP_PATH(/test/child1, child1_2),
> > +     CGROUP_PATH(/test/child2, child2_1),
> > +     CGROUP_PATH(/test/child2, child2_2),
> > +};
> > +
> > +#define N_CGROUPS ARRAY_SIZE(cgroups)
> > +#define N_NON_LEAF_CGROUPS 3
> > +
> > +bool mounted_bpffs;
> > +static int duration;
> > +
> > +static int read_from_file(const char *path, char *buf, size_t size)
> > +{
> > +     int fd, len;
> > +
> > +     fd = open(path, O_RDONLY);
> > +     if (fd < 0) {
> > +             log_err("Open %s", path);
> > +             return -errno;
> > +     }
> > +     len = read(fd, buf, size);
> > +     if (len < 0)
> > +             log_err("Read %s", path);
> > +     else
> > +             buf[len] = 0;
> > +     close(fd);
> > +     return len < 0 ? -errno : 0;
> > +}
> > +
> > +static int setup_bpffs(void)
> > +{
> > +     int err;
> > +
> > +     /* Mount bpffs */
> > +     err = mount("bpf", BPFFS_ROOT, "bpf", 0, NULL);
> > +     mounted_bpffs = !err;
> > +     if (CHECK(err && errno != EBUSY, "mount bpffs",
>
> Please use ASSERT_* macros instead of CHECK.
> There are similar instances below as well.

CHECK is more flexible in providing a parameterized failure message,
but I guess we ideally shouldn't see those a lot anyway. Will change
them to ASSERTs in the next version.

>
> > +           "failed to mount bpffs at %s (%s)\n", BPFFS_ROOT,
> > +           strerror(errno)))
> > +             return err;
> > +
> > +     /* Create a directory to contain stat files in bpffs */
> > +     err = mkdir(BPFFS_VMSCAN, 0755);
> > +     CHECK(err, "mkdir bpffs", "failed to mkdir %s (%s)\n",
> > +           BPFFS_VMSCAN, strerror(errno));
> > +     return err;
> > +}
> > +
> > +static void cleanup_bpffs(void)
> > +{
> > +     /* Remove created directory in bpffs */
> > +     CHECK(rmdir(BPFFS_VMSCAN), "rmdir", "failed to rmdir %s (%s)\n",
> > +           BPFFS_VMSCAN, strerror(errno));
> > +
> > +     /* Unmount bpffs, if it wasn't already mounted when we started */
> > +     if (mounted_bpffs)
> > +             return;
> > +     CHECK(umount(BPFFS_ROOT), "umount", "failed to unmount bpffs (%s)\n",
> > +           strerror(errno));
> > +}
> > +
> > +static int setup_cgroups(void)
> > +{
> > +     int i, err;
> > +
> > +     err = setup_cgroup_environment();
> > +     if (CHECK(err, "setup_cgroup_environment", "failed: %d\n", err))
> > +             return err;
> > +
> > +     for (i = 0; i < N_CGROUPS; i++) {
> > +             int fd;
>
> You can put this to the top declaration 'int i, err'.

Will do in the next version. I thought declaring variables in the
innermost block that uses them is preferable.

>
> > +
> > +             fd = create_and_get_cgroup(cgroups[i].path);
> > +             if (!ASSERT_GE(fd, 0, "create_and_get_cgroup"))
> > +                     return fd;
> > +
> > +             cgroups[i].fd = fd;
> > +             cgroups[i].id = get_cgroup_id(cgroups[i].path);
> > +             if (i < N_NON_LEAF_CGROUPS) {
> > +                     err = enable_controllers(cgroups[i].path, "memory");
> > +                     if (!ASSERT_OK(err, "enable_controllers"))
> > +                             return err;
> > +             }
> > +     }
> > +     return 0;
> > +}
> > +
> > +static void cleanup_cgroups(void)
> > +{
> > +     for (int i = 0; i < N_CGROUPS; i++)
> > +             close(cgroups[i].fd);
> > +     cleanup_cgroup_environment();
> > +}
> > +
> > +
> > +static int setup_hierarchy(void)
> > +{
> > +     return setup_bpffs() || setup_cgroups();
> > +}
> > +
> > +static void destroy_hierarchy(void)
> > +{
> > +     cleanup_cgroups();
> > +     cleanup_bpffs();
> > +}
> > +
> [...]
> > +
> > +SEC("iter.s/cgroup")
> > +int BPF_PROG(dump_vmscan, struct bpf_iter_meta *meta, struct cgroup *cgrp)
> > +{
> > +     struct seq_file *seq = meta->seq;
> > +     struct vmscan *total_stat;
> > +     __u64 cg_id = cgroup_id(cgrp);
> > +
> > +     /* Flush the stats to make sure we get the most updated numbers */
> > +     cgroup_rstat_flush(cgrp);
> > +
> > +     total_stat = bpf_map_lookup_elem(&cgroup_vmscan_elapsed, &cg_id);
> > +     if (!total_stat) {
> > +             bpf_printk("error finding stats for cgroup %llu\n", cg_id);
> > +             BPF_SEQ_PRINTF(seq, "cg_id: -1, total_vmscan_delay: -1\n");
> > +             return 0;
> > +     }
> > +     BPF_SEQ_PRINTF(seq, "cg_id: %llu, total_vmscan_delay: %llu\n",
> > +                    cg_id, total_stat->state);
> > +     return 0;
> > +}
> > +
>
> Empty line here.

Will remove this in the next version.
Thanks for taking a look at this!

>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next v1 2/5] cgroup: bpf: add cgroup_rstat_updated() and cgroup_rstat_flush() kfuncs
  2022-05-20 16:16       ` Yonghong Song
@ 2022-05-20 16:20         ` Yosry Ahmed
  0 siblings, 0 replies; 58+ messages in thread
From: Yosry Ahmed @ 2022-05-20 16:20 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, John Fastabend, KP Singh, Hao Luo,
	Tejun Heo, Zefan Li, Johannes Weiner, Shuah Khan, Roman Gushchin,
	Michal Hocko, Stanislav Fomichev, David Rientjes, Greg Thelen,
	Shakeel Butt, Linux Kernel Mailing List, Networking, bpf,
	Cgroups

On Fri, May 20, 2022 at 9:16 AM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 5/20/22 9:08 AM, Yosry Ahmed wrote:
> > On Fri, May 20, 2022 at 8:15 AM Yonghong Song <yhs@fb.com> wrote:
> >>
> >>
> >>
> >> On 5/19/22 6:21 PM, Yosry Ahmed wrote:
> >>> Add cgroup_rstat_updated() and cgroup_rstat_flush() kfuncs to bpf
> >>> tracing programs. bpf programs that make use of rstat can use these
> >>> functions to inform rstat when they update stats for a cgroup, and when
> >>> they need to flush the stats.
> >>>
> >>> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> >>> ---
> >>>    kernel/cgroup/rstat.c | 35 ++++++++++++++++++++++++++++++++++-
> >>>    1 file changed, 34 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
> >>> index e7a88d2600bd..a16a851bc0a1 100644
> >>> --- a/kernel/cgroup/rstat.c
> >>> +++ b/kernel/cgroup/rstat.c
> >>> @@ -3,6 +3,11 @@
> >>>
> >>>    #include <linux/sched/cputime.h>
> >>>
> >>> +#include <linux/bpf.h>
> >>> +#include <linux/btf.h>
> >>> +#include <linux/btf_ids.h>
> >>> +
> >>> +
> >>>    static DEFINE_SPINLOCK(cgroup_rstat_lock);
> >>>    static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);
> >>>
> >>> @@ -141,7 +146,12 @@ static struct cgroup *cgroup_rstat_cpu_pop_updated(struct cgroup *pos,
> >>>        return pos;
> >>>    }
> >>>
> >>> -/* A hook for bpf stat collectors to attach to and flush their stats */
> >>> +/*
> >>> + * A hook for bpf stat collectors to attach to and flush their stats.
> >>> + * Together with providing bpf kfuncs for cgroup_rstat_updated() and
> >>> + * cgroup_rstat_flush(), this enables a complete workflow where bpf progs that
> >>> + * collect cgroup stats can integrate with rstat for efficient flushing.
> >>> + */
> >>>    __weak noinline void bpf_rstat_flush(struct cgroup *cgrp,
> >>>                                     struct cgroup *parent, int cpu)
> >>>    {
> >>> @@ -476,3 +486,26 @@ void cgroup_base_stat_cputime_show(struct seq_file *seq)
> >>>                   "system_usec %llu\n",
> >>>                   usage, utime, stime);
> >>>    }
> >>> +
> >>> +/* Add bpf kfuncs for cgroup_rstat_updated() and cgroup_rstat_flush() */
> >>> +BTF_SET_START(bpf_rstat_check_kfunc_ids)
> >>> +BTF_ID(func, cgroup_rstat_updated)
> >>> +BTF_ID(func, cgroup_rstat_flush)
> >>> +BTF_SET_END(bpf_rstat_check_kfunc_ids)
> >>> +
> >>> +BTF_SET_START(bpf_rstat_sleepable_kfunc_ids)
> >>> +BTF_ID(func, cgroup_rstat_flush)
> >>> +BTF_SET_END(bpf_rstat_sleepable_kfunc_ids)
> >>> +
> >>> +static const struct btf_kfunc_id_set bpf_rstat_kfunc_set = {
> >>> +     .owner          = THIS_MODULE,
> >>> +     .check_set      = &bpf_rstat_check_kfunc_ids,
> >>> +     .sleepable_set  = &bpf_rstat_sleepable_kfunc_ids,
> >>
> >> There is a compilation error here:
> >>
> >> kernel/cgroup/rstat.c:503:3: error: ‘const struct btf_kfunc_id_set’ has
> >> no member named ‘sleepable_set’; did you mean ‘release_set’?
> >>       503 |  .sleepable_set = &bpf_rstat_sleepable_kfunc_ids,
> >>           |   ^~~~~~~~~~~~~
> >>           |   release_set
> >>     kernel/cgroup/rstat.c:503:19: warning: excess elements in struct
> >> initializer
> >>       503 |  .sleepable_set = &bpf_rstat_sleepable_kfunc_ids,
> >>           |                   ^
> >>     kernel/cgroup/rstat.c:503:19: note: (near initialization for
> >> ‘bpf_rstat_kfunc_set’)
> >>     make[3]: *** [scripts/Makefile.build:288: kernel/cgroup/rstat.o] Error 1
> >>
> >> Please fix.
> >
> > This patch series is rebased on top of 2 patches in the mailing list:
> > - bpf/btf: also allow kfunc in tracing and syscall programs
> > - btf: Add a new kfunc set which allows to mark a function to be
> >    sleepable
> >
> > I specified this in the cover letter, do I need to do something else
> > in this situation? Re-send the patches as part of my series?
>
> At least put a link in the cover letter for the above two patches?
> This way, people can easily find them to double check.

Right. Will do this in the next version. Sorry for the inconvenience.

>
> >
> >
> >
> >>
> >>> +};
> >>> +
> >>> +static int __init bpf_rstat_kfunc_init(void)
> >>> +{
> >>> +     return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING,
> >>> +                                      &bpf_rstat_kfunc_set);
> >>> +}
> >>> +late_initcall(bpf_rstat_kfunc_init);

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next v1 3/5] bpf: Introduce cgroup iter
  2022-05-20  8:11       ` Tejun Heo
  2022-05-20 11:27         ` Tejun Heo
@ 2022-05-20 16:29         ` Yonghong Song
  2022-05-20 16:45           ` Tejun Heo
  2022-05-20 17:30         ` Hao Luo
  2 siblings, 1 reply; 58+ messages in thread
From: Yonghong Song @ 2022-05-20 16:29 UTC (permalink / raw)
  To: Tejun Heo, Yosry Ahmed
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, John Fastabend, KP Singh, Hao Luo,
	Zefan Li, Johannes Weiner, Shuah Khan, Roman Gushchin,
	Michal Hocko, Stanislav Fomichev, David Rientjes, Greg Thelen,
	Shakeel Butt, Linux Kernel Mailing List, Networking, bpf,
	Cgroups



On 5/20/22 1:11 AM, Tejun Heo wrote:
> Hello,
> 
> On Fri, May 20, 2022 at 12:58:52AM -0700, Yosry Ahmed wrote:
>> On Fri, May 20, 2022 at 12:41 AM Tejun Heo <tj@kernel.org> wrote:
>>>
>>> On Fri, May 20, 2022 at 01:21:31AM +0000, Yosry Ahmed wrote:
>>>> From: Hao Luo <haoluo@google.com>
>>>>
>>>> Introduce a new type of iter prog: cgroup. Unlike other bpf_iter, this
>>>> iter doesn't iterate a set of kernel objects. Instead, it is supposed to
>>>> be parameterized by a cgroup id and prints only that cgroup. So one
>>>> needs to specify a target cgroup id when attaching this iter. The target
>>>> cgroup's state can be read out via a link of this iter.
>>>>
>>>> Signed-off-by: Hao Luo <haoluo@google.com>
>>>> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
>>>
>>> This could be me not understanding why it's structured this way but it keeps
>>> bothering me that this is adding a cgroup iterator which doesn't iterate
>>> cgroups. If all that's needed is extracting information from a specific
>>> cgroup, why does this need to be an iterator? e.g. why can't I use
>>> BPF_PROG_TEST_RUN which looks up the cgroup with the provided ID, flushes
>>> rstat, retrieves whatever information necessary and returns that as the
>>> result?
>>
>> I will let Hao and Yonghong reply here as they have a lot more
>> context, and they had previous discussions about cgroup_iter. I just
>> want to say that exposing the stats in a file is extremely convenient
>> for userspace apps. It becomes very similar to reading stats from
>> cgroupfs. It also makes migrating cgroup stats that we have
>> implemented in the kernel to BPF a lot easier.
> 
> So, if it were upto me, I'd rather direct energy towards making retrieving
> information through TEST_RUN_PROG easier rather than clinging to making
> kernel output text. I get that text interface is familiar but it kinda
> sucks in many ways.
> 
>> AFAIK there are also discussions about using overlayfs to have links
>> to the bpffs files in cgroupfs, which makes it even better. So I would
>> really prefer keeping the approach we have here of reading stats
>> through a file from userspace. As for how we go about this (and why a
>> cgroup iterator doesn't iterate cgroups) I will leave this for Hao and
>> Yonghong to explain the rationale behind it. Ideally we can keep the
>> same functionality under a more descriptive name/type.
> 
> My answer would be the same here. You guys seem dead set on making the
> kernel emulate cgroup1. I'm not gonna explicitly block that but would
> strongly suggest having a longer term view.
> 
> If you *must* do the iterator, can you at least make it a proper iterator
> which supports seeking? AFAICS there's nothing fundamentally preventing bpf
> iterators from supporting seeking. Or is it that you need something which is
> pinned to a cgroup so that you can emulate the directory structure?

The current bpf_iter for cgroup is for the google use case
per previous discussion. But I think a generic cgroup bpf iterator
should help as well.

Maybe you can have a bpf program signature like below:

int BPF_PROG(dump_vmscan, struct bpf_iter_meta *meta, struct cgroup 
*cgrp, struct cgroup *parent_cgrp)

parent_cgrp is NULL when cgrp is the root cgroup.

I would like the bpf program should send the following information to
user space:
    <parent cgroup dir name> <current cgroup dir name>
    <various stats interested by the user>

This way, user space can easily construct the cgroup hierarchy stat like
                            cpu   mem   cpu pressure   mem pressure ...
    cgroup1                 ...
       child1               ...
         grandchild1        ...
       child2               ...
    cgroup 2                ...
       child 3              ...
         ...                ...

the bpf iterator can have additional parameter like
cgroup_id = ... to only call bpf program once with that
cgroup_id if specified.

The kernel part of cgroup_iter can call cgroup_rstat_flush()
before calling cgroup_iter bpf program.

WDYT?

> 
> Thanks.
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next v1 3/5] bpf: Introduce cgroup iter
  2022-05-20 16:29         ` Yonghong Song
@ 2022-05-20 16:45           ` Tejun Heo
  2022-05-20 19:42             ` Hao Luo
  2022-05-21  0:52             ` Yonghong Song
  0 siblings, 2 replies; 58+ messages in thread
From: Tejun Heo @ 2022-05-20 16:45 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Yosry Ahmed, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, John Fastabend,
	KP Singh, Hao Luo, Zefan Li, Johannes Weiner, Shuah Khan,
	Roman Gushchin, Michal Hocko, Stanislav Fomichev, David Rientjes,
	Greg Thelen, Shakeel Butt, Linux Kernel Mailing List, Networking,
	bpf, Cgroups

Hello, Yonghong.

On Fri, May 20, 2022 at 09:29:43AM -0700, Yonghong Song wrote:
> Maybe you can have a bpf program signature like below:
> 
> int BPF_PROG(dump_vmscan, struct bpf_iter_meta *meta, struct cgroup *cgrp,
> struct cgroup *parent_cgrp)
> 
> parent_cgrp is NULL when cgrp is the root cgroup.
> 
> I would like the bpf program should send the following information to
> user space:
>    <parent cgroup dir name> <current cgroup dir name>

I don't think parent cgroup dir name would be sufficient to reconstruct the
path given that multiple cgroups in different subtrees can have the same
name. For live cgroups, userspace can find the path from id (or ino) without
traversing anything by constructing the fhandle, open it open_by_handle_at()
and then reading /proc/self/fd/$FD symlink -
https://lkml.org/lkml/2020/12/2/1126. This isn't available for dead cgroups
but I'm not sure how much that'd matter given that they aren't visible from
userspace anyway.

>    <various stats interested by the user>
> 
> This way, user space can easily construct the cgroup hierarchy stat like
>                            cpu   mem   cpu pressure   mem pressure ...
>    cgroup1                 ...
>       child1               ...
>         grandchild1        ...
>       child2               ...
>    cgroup 2                ...
>       child 3              ...
>         ...                ...
> 
> the bpf iterator can have additional parameter like
> cgroup_id = ... to only call bpf program once with that
> cgroup_id if specified.
> 
> The kernel part of cgroup_iter can call cgroup_rstat_flush()
> before calling cgroup_iter bpf program.
> 
> WDYT?

Would it work to just pass in @cgrp and provide a group of helpers so that
the program can do whatever it wanna do including looking up the full path
and passing that to userspace?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next v1 3/5] bpf: Introduce cgroup iter
  2022-05-20  8:11       ` Tejun Heo
  2022-05-20 11:27         ` Tejun Heo
  2022-05-20 16:29         ` Yonghong Song
@ 2022-05-20 17:30         ` Hao Luo
  2 siblings, 0 replies; 58+ messages in thread
From: Hao Luo @ 2022-05-20 17:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Yosry Ahmed, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Zefan Li, Johannes Weiner, Shuah Khan,
	Roman Gushchin, Michal Hocko, Stanislav Fomichev, David Rientjes,
	Greg Thelen, Shakeel Butt, Linux Kernel Mailing List, Networking,
	bpf, Cgroups

Hi Tejun,

On Fri, May 20, 2022 at 1:11 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Fri, May 20, 2022 at 12:58:52AM -0700, Yosry Ahmed wrote:
> > On Fri, May 20, 2022 at 12:41 AM Tejun Heo <tj@kernel.org> wrote:
> > >
> > > On Fri, May 20, 2022 at 01:21:31AM +0000, Yosry Ahmed wrote:
> > > > From: Hao Luo <haoluo@google.com>
> > > >
> > > > Introduce a new type of iter prog: cgroup. Unlike other bpf_iter, this
> > > > iter doesn't iterate a set of kernel objects. Instead, it is supposed to
> > > > be parameterized by a cgroup id and prints only that cgroup. So one
> > > > needs to specify a target cgroup id when attaching this iter. The target
> > > > cgroup's state can be read out via a link of this iter.
> > > >
> > > > Signed-off-by: Hao Luo <haoluo@google.com>
> > > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> > >
> > > This could be me not understanding why it's structured this way but it keeps
> > > bothering me that this is adding a cgroup iterator which doesn't iterate
> > > cgroups. If all that's needed is extracting information from a specific
> > > cgroup, why does this need to be an iterator? e.g. why can't I use
> > > BPF_PROG_TEST_RUN which looks up the cgroup with the provided ID, flushes
> > > rstat, retrieves whatever information necessary and returns that as the
> > > result?
> >
> > I will let Hao and Yonghong reply here as they have a lot more
> > context, and they had previous discussions about cgroup_iter. I just
> > want to say that exposing the stats in a file is extremely convenient
> > for userspace apps. It becomes very similar to reading stats from
> > cgroupfs. It also makes migrating cgroup stats that we have
> > implemented in the kernel to BPF a lot easier.
>
> So, if it were upto me, I'd rather direct energy towards making retrieving
> information through TEST_RUN_PROG easier rather than clinging to making
> kernel output text. I get that text interface is familiar but it kinda
> sucks in many ways.
>

Tejun, could you explain more about the downside of text interfaces
and why TEST_RUN_PROG would address the problems in text output? From
the discussion we had last time, I understand that your concern was
the unstable interface if we introduce bpf files in cgroupfs, so we
are moving toward replicating the directory structure in bpffs. But I
am not sure about the issue of text format output

> > AFAIK there are also discussions about using overlayfs to have links
> > to the bpffs files in cgroupfs, which makes it even better. So I would
> > really prefer keeping the approach we have here of reading stats
> > through a file from userspace. As for how we go about this (and why a
> > cgroup iterator doesn't iterate cgroups) I will leave this for Hao and
> > Yonghong to explain the rationale behind it. Ideally we can keep the
> > same functionality under a more descriptive name/type.
>
> My answer would be the same here. You guys seem dead set on making the
> kernel emulate cgroup1. I'm not gonna explicitly block that but would
> strongly suggest having a longer term view.
>

The reason why Yosry and I are still pushing toward this direction is
that our user space applications rely heavily on extracting
information from text output for cgroups. Please understand that
migrating them from the traditional model to a new model is a bigger
pain. But I agree that if we have a better, concrete solution (for
example, maybe TEST_RUN_PROG) to convince them and help them migrate,
I really would love to contribute and work on it.

> If you *must* do the iterator, can you at least make it a proper iterator
> which supports seeking? AFAICS there's nothing fundamentally preventing bpf
> iterators from supporting seeking. Or is it that you need something which is
> pinned to a cgroup so that you can emulate the directory structure?
>

Yonghong may comment on adding seek for bpf_iter. I would love to
contribute if we are in need of that. Right now, we don't have a use
case that needs seek for bpf_iter, I think. My thought: for cgroups,
we can seek using cgroup id. Maybe, not all kernel objects are
indexable, so seeking doesn't apply there?

Hao

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next v1 3/5] bpf: Introduce cgroup iter
  2022-05-20 16:45           ` Tejun Heo
@ 2022-05-20 19:42             ` Hao Luo
  2022-05-20 21:18               ` Yosry Ahmed
  2022-05-20 21:49               ` Hao Luo
  2022-05-21  0:52             ` Yonghong Song
  1 sibling, 2 replies; 58+ messages in thread
From: Hao Luo @ 2022-05-20 19:42 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Yonghong Song, Yosry Ahmed, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, John Fastabend,
	KP Singh, Zefan Li, Johannes Weiner, Shuah Khan, Roman Gushchin,
	Michal Hocko, Stanislav Fomichev, David Rientjes, Greg Thelen,
	Shakeel Butt, Linux Kernel Mailing List, Networking, bpf,
	Cgroups

Hi Tejun and Yonghong,

On Fri, May 20, 2022 at 9:45 AM Tejun Heo <tj@kernel.org> wrote:
> On Fri, May 20, 2022 at 09:29:43AM -0700, Yonghong Song wrote:
> > Maybe you can have a bpf program signature like below:
> >
> > int BPF_PROG(dump_vmscan, struct bpf_iter_meta *meta, struct cgroup *cgrp,
> > struct cgroup *parent_cgrp)
> >
> > parent_cgrp is NULL when cgrp is the root cgroup.
> >
> > I would like the bpf program should send the following information to
> > user space:
> >    <parent cgroup dir name> <current cgroup dir name>
>
> I don't think parent cgroup dir name would be sufficient to reconstruct the
> path given that multiple cgroups in different subtrees can have the same
> name. For live cgroups, userspace can find the path from id (or ino) without
> traversing anything by constructing the fhandle, open it open_by_handle_at()
> and then reading /proc/self/fd/$FD symlink -
> https://lkml.org/lkml/2020/12/2/1126. This isn't available for dead cgroups
> but I'm not sure how much that'd matter given that they aren't visible from
> userspace anyway.
>

Sending cgroup id is better than cgroup dir name, also because IIUC
the path obtained from cgroup id depends on the namespace of the
userspace process. So if the dump file may be potentially read by
processes within a container, it's better to have the output
namespaced IMO.

> >    <various stats interested by the user>
> >
> > This way, user space can easily construct the cgroup hierarchy stat like
> >                            cpu   mem   cpu pressure   mem pressure ...
> >    cgroup1                 ...
> >       child1               ...
> >         grandchild1        ...
> >       child2               ...
> >    cgroup 2                ...
> >       child 3              ...
> >         ...                ...
> >
> > the bpf iterator can have additional parameter like
> > cgroup_id = ... to only call bpf program once with that
> > cgroup_id if specified.

Yep, this should work. We just need to make the cgroup_id parameter
optional. If it is specified when creating bpf_iter_link, we print for
that cgroup only. If it is not specified, we iterate over all cgroups.
If I understand correctly, sounds doable.

> > The kernel part of cgroup_iter can call cgroup_rstat_flush()
> > before calling cgroup_iter bpf program.

Sounds good to me as well. But my knowledge on rstat_flush is limited.
Yosry can give this a try.

>
> Would it work to just pass in @cgrp and provide a group of helpers so that
> the program can do whatever it wanna do including looking up the full path
> and passing that to userspace?
>

My understanding is, yes, doable. If we need the full path information
of a cgroup, helpers or kfuncs are needed.

The userspace needs to specify the identity of the cgroup, when
creating bpf_iter. This identity could be cgroup id or fd. This
identity needs to be converted to cgroup object somewhere before
passing into bpf program to use.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next v1 3/5] bpf: Introduce cgroup iter
  2022-05-20 19:42             ` Hao Luo
@ 2022-05-20 21:18               ` Yosry Ahmed
  2022-05-20 22:19                 ` Alexei Starovoitov
  2022-05-20 21:49               ` Hao Luo
  1 sibling, 1 reply; 58+ messages in thread
From: Yosry Ahmed @ 2022-05-20 21:18 UTC (permalink / raw)
  To: Hao Luo
  Cc: Tejun Heo, Yonghong Song, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, John Fastabend,
	KP Singh, Zefan Li, Johannes Weiner, Shuah Khan, Roman Gushchin,
	Michal Hocko, Stanislav Fomichev, David Rientjes, Greg Thelen,
	Shakeel Butt, Linux Kernel Mailing List, Networking, bpf,
	Cgroups

On Fri, May 20, 2022 at 12:43 PM Hao Luo <haoluo@google.com> wrote:
>
> Hi Tejun and Yonghong,
>
> On Fri, May 20, 2022 at 9:45 AM Tejun Heo <tj@kernel.org> wrote:
> > On Fri, May 20, 2022 at 09:29:43AM -0700, Yonghong Song wrote:
> > > Maybe you can have a bpf program signature like below:
> > >
> > > int BPF_PROG(dump_vmscan, struct bpf_iter_meta *meta, struct cgroup *cgrp,
> > > struct cgroup *parent_cgrp)
> > >
> > > parent_cgrp is NULL when cgrp is the root cgroup.
> > >
> > > I would like the bpf program should send the following information to
> > > user space:
> > >    <parent cgroup dir name> <current cgroup dir name>
> >
> > I don't think parent cgroup dir name would be sufficient to reconstruct the
> > path given that multiple cgroups in different subtrees can have the same
> > name. For live cgroups, userspace can find the path from id (or ino) without
> > traversing anything by constructing the fhandle, open it open_by_handle_at()
> > and then reading /proc/self/fd/$FD symlink -
> > https://lkml.org/lkml/2020/12/2/1126. This isn't available for dead cgroups
> > but I'm not sure how much that'd matter given that they aren't visible from
> > userspace anyway.
> >
>
> Sending cgroup id is better than cgroup dir name, also because IIUC
> the path obtained from cgroup id depends on the namespace of the
> userspace process. So if the dump file may be potentially read by
> processes within a container, it's better to have the output
> namespaced IMO.
>
> > >    <various stats interested by the user>
> > >
> > > This way, user space can easily construct the cgroup hierarchy stat like
> > >                            cpu   mem   cpu pressure   mem pressure ...
> > >    cgroup1                 ...
> > >       child1               ...
> > >         grandchild1        ...
> > >       child2               ...
> > >    cgroup 2                ...
> > >       child 3              ...
> > >         ...                ...
> > >
> > > the bpf iterator can have additional parameter like
> > > cgroup_id = ... to only call bpf program once with that
> > > cgroup_id if specified.
>
> Yep, this should work. We just need to make the cgroup_id parameter
> optional. If it is specified when creating bpf_iter_link, we print for
> that cgroup only. If it is not specified, we iterate over all cgroups.
> If I understand correctly, sounds doable.
>
> > > The kernel part of cgroup_iter can call cgroup_rstat_flush()
> > > before calling cgroup_iter bpf program.
>
> Sounds good to me as well. But my knowledge on rstat_flush is limited.
> Yosry can give this a try.
>
> >
> > Would it work to just pass in @cgrp and provide a group of helpers so that
> > the program can do whatever it wanna do including looking up the full path
> > and passing that to userspace?
> >
>
> My understanding is, yes, doable. If we need the full path information
> of a cgroup, helpers or kfuncs are needed.
>
> The userspace needs to specify the identity of the cgroup, when
> creating bpf_iter. This identity could be cgroup id or fd. This
> identity needs to be converted to cgroup object somewhere before
> passing into bpf program to use.


Let's sum up the discussion here, I feel like we are losing track of
the main problem. IIUC the main concern is that cgroup_iter is not
effectively an iterator, it rather dumps information for one cgroup. I
like the suggestion to make it iterate cgroups by default, and an
optional cgroup_id parameter to make it only "iterate" this one
cgroup. IIUC, this cgroup_id parameter would be a link parameter,
similar to the current approach. Basically, we extend the current
patch so that if cgroup_id is not specified the iterator gets called
for all cgroups instead of one. This fixes the problem for our use
case and also keeps cgroup_iter generic enough. Is my understanding
correct? If yes, I don't see a need to flush rstat in the kernel on
behalf of cgroup_iter progs.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next v1 3/5] bpf: Introduce cgroup iter
  2022-05-20 19:42             ` Hao Luo
  2022-05-20 21:18               ` Yosry Ahmed
@ 2022-05-20 21:49               ` Hao Luo
  2022-05-21  0:58                 ` Yonghong Song
  1 sibling, 1 reply; 58+ messages in thread
From: Hao Luo @ 2022-05-20 21:49 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Yonghong Song, Yosry Ahmed, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, John Fastabend,
	KP Singh, Zefan Li, Johannes Weiner, Shuah Khan, Roman Gushchin,
	Michal Hocko, Stanislav Fomichev, David Rientjes, Greg Thelen,
	Shakeel Butt, Linux Kernel Mailing List, Networking, bpf,
	Cgroups

Hi Tejun and Yonghong,

On Fri, May 20, 2022 at 12:42 PM Hao Luo <haoluo@google.com> wrote:
>
> Hi Tejun and Yonghong,
>
> On Fri, May 20, 2022 at 9:45 AM Tejun Heo <tj@kernel.org> wrote:
> > On Fri, May 20, 2022 at 09:29:43AM -0700, Yonghong Song wrote:
> > >    <various stats interested by the user>
> > >
> > > This way, user space can easily construct the cgroup hierarchy stat like
> > >                            cpu   mem   cpu pressure   mem pressure ...
> > >    cgroup1                 ...
> > >       child1               ...
> > >         grandchild1        ...
> > >       child2               ...
> > >    cgroup 2                ...
> > >       child 3              ...
> > >         ...                ...
> > >
> > > the bpf iterator can have additional parameter like
> > > cgroup_id = ... to only call bpf program once with that
> > > cgroup_id if specified.
>
> Yep, this should work. We just need to make the cgroup_id parameter
> optional. If it is specified when creating bpf_iter_link, we print for
> that cgroup only. If it is not specified, we iterate over all cgroups.
> If I understand correctly, sounds doable.
>

Yonghong, I realized that seek() which Tejun has been calling out, can
be used to specify the target cgroup, rather than adding a new
parameter. Maybe, we can pass cgroup_id to seek() on cgroup bpf_iter,
which will instruct read() to return the corresponding cgroup's stats.
On the other hand, reading without calling seek() beforehand will
return all the cgroups.

WDYT?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next v1 3/5] bpf: Introduce cgroup iter
  2022-05-20 21:18               ` Yosry Ahmed
@ 2022-05-20 22:19                 ` Alexei Starovoitov
  2022-05-20 22:36                   ` Yosry Ahmed
  2022-05-20 22:57                   ` Tejun Heo
  0 siblings, 2 replies; 58+ messages in thread
From: Alexei Starovoitov @ 2022-05-20 22:19 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Hao Luo, Tejun Heo, Yonghong Song, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	John Fastabend, KP Singh, Zefan Li, Johannes Weiner, Shuah Khan,
	Roman Gushchin, Michal Hocko, Stanislav Fomichev, David Rientjes,
	Greg Thelen, Shakeel Butt, Linux Kernel Mailing List, Networking,
	bpf, Cgroups

On Fri, May 20, 2022 at 02:18:42PM -0700, Yosry Ahmed wrote:
> >
> > The userspace needs to specify the identity of the cgroup, when
> > creating bpf_iter. This identity could be cgroup id or fd. This
> > identity needs to be converted to cgroup object somewhere before
> > passing into bpf program to use.
> 
> 
> Let's sum up the discussion here, I feel like we are losing track of
> the main problem. IIUC the main concern is that cgroup_iter is not
> effectively an iterator, it rather dumps information for one cgroup. I
> like the suggestion to make it iterate cgroups by default, and an
> optional cgroup_id parameter to make it only "iterate" this one
> cgroup.

We have bpf_map iterator that walks all bpf maps.
When map iterator is parametrized with map_fd the iterator walks
all elements of that map.
cgroup iterator should have similar semantics.
When non-parameterized it will walk all cgroups and their descendent
depth first way. I believe that's what Yonghong is proposing.
When parametrized it will start from that particular cgroup and
walk all descendant of that cgroup only.
The bpf prog can stop the iteration right away with ret 1.
Maybe we can add two parameters. One -> cgroup_fd to use and another ->
the order of iteration css_for_each_descendant_pre vs _post.
wdyt?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next v1 3/5] bpf: Introduce cgroup iter
  2022-05-20 22:19                 ` Alexei Starovoitov
@ 2022-05-20 22:36                   ` Yosry Ahmed
  2022-05-20 22:57                   ` Tejun Heo
  1 sibling, 0 replies; 58+ messages in thread
From: Yosry Ahmed @ 2022-05-20 22:36 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Hao Luo, Tejun Heo, Yonghong Song, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	John Fastabend, KP Singh, Zefan Li, Johannes Weiner, Shuah Khan,
	Roman Gushchin, Michal Hocko, Stanislav Fomichev, David Rientjes,
	Greg Thelen, Shakeel Butt, Linux Kernel Mailing List, Networking,
	bpf, Cgroups

On Fri, May 20, 2022 at 3:19 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, May 20, 2022 at 02:18:42PM -0700, Yosry Ahmed wrote:
> > >
> > > The userspace needs to specify the identity of the cgroup, when
> > > creating bpf_iter. This identity could be cgroup id or fd. This
> > > identity needs to be converted to cgroup object somewhere before
> > > passing into bpf program to use.
> >
> >
> > Let's sum up the discussion here, I feel like we are losing track of
> > the main problem. IIUC the main concern is that cgroup_iter is not
> > effectively an iterator, it rather dumps information for one cgroup. I
> > like the suggestion to make it iterate cgroups by default, and an
> > optional cgroup_id parameter to make it only "iterate" this one
> > cgroup.
>
> We have bpf_map iterator that walks all bpf maps.
> When map iterator is parametrized with map_fd the iterator walks
> all elements of that map.
> cgroup iterator should have similar semantics.
> When non-parameterized it will walk all cgroups and their descendent
> depth first way. I believe that's what Yonghong is proposing.
> When parametrized it will start from that particular cgroup and
> walk all descendant of that cgroup only.
> The bpf prog can stop the iteration right away with ret 1.
> Maybe we can add two parameters. One -> cgroup_fd to use and another ->
> the order of iteration css_for_each_descendant_pre vs _post.
> wdyt?

So basically extend the current patch so that cgroup_id (or cgroup_fd)
is optional, and it specifies where the iteration starts. If not
provided, then we start at root. For our use case where we want the
iterator to only be invoked for one cgroup we make it return 1 to stop
after the first iteration.

I assume an order parameter is also needed to specify "pre" for our
use case to make sure we are starting iteration at the top cgroup (the
one whose cgroup_id is the parameter of the iterator).

Is my understanding correct? If yes, then this sounds very good. It is
generic enough, actually iterates cgroups, and works for our use case.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next v1 3/5] bpf: Introduce cgroup iter
  2022-05-20 22:19                 ` Alexei Starovoitov
  2022-05-20 22:36                   ` Yosry Ahmed
@ 2022-05-20 22:57                   ` Tejun Heo
  2022-05-21  0:59                     ` Yonghong Song
  1 sibling, 1 reply; 58+ messages in thread
From: Tejun Heo @ 2022-05-20 22:57 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Yosry Ahmed, Hao Luo, Yonghong Song, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	John Fastabend, KP Singh, Zefan Li, Johannes Weiner, Shuah Khan,
	Roman Gushchin, Michal Hocko, Stanislav Fomichev, David Rientjes,
	Greg Thelen, Shakeel Butt, Linux Kernel Mailing List, Networking,
	bpf, Cgroups

Hello,

On Fri, May 20, 2022 at 03:19:19PM -0700, Alexei Starovoitov wrote:
> We have bpf_map iterator that walks all bpf maps.
> When map iterator is parametrized with map_fd the iterator walks
> all elements of that map.
> cgroup iterator should have similar semantics.
> When non-parameterized it will walk all cgroups and their descendent
> depth first way. I believe that's what Yonghong is proposing.
> When parametrized it will start from that particular cgroup and
> walk all descendant of that cgroup only.
> The bpf prog can stop the iteration right away with ret 1.
> Maybe we can add two parameters. One -> cgroup_fd to use and another ->
> the order of iteration css_for_each_descendant_pre vs _post.
> wdyt?

Sounds perfectly reasonable to me.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next v1 3/5] bpf: Introduce cgroup iter
  2022-05-20 16:45           ` Tejun Heo
  2022-05-20 19:42             ` Hao Luo
@ 2022-05-21  0:52             ` Yonghong Song
  1 sibling, 0 replies; 58+ messages in thread
From: Yonghong Song @ 2022-05-21  0:52 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Yosry Ahmed, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, John Fastabend,
	KP Singh, Hao Luo, Zefan Li, Johannes Weiner, Shuah Khan,
	Roman Gushchin, Michal Hocko, Stanislav Fomichev, David Rientjes,
	Greg Thelen, Shakeel Butt, Linux Kernel Mailing List, Networking,
	bpf, Cgroups



On 5/20/22 9:45 AM, Tejun Heo wrote:
> Hello, Yonghong.
> 
> On Fri, May 20, 2022 at 09:29:43AM -0700, Yonghong Song wrote:
>> Maybe you can have a bpf program signature like below:
>>
>> int BPF_PROG(dump_vmscan, struct bpf_iter_meta *meta, struct cgroup *cgrp,
>> struct cgroup *parent_cgrp)
>>
>> parent_cgrp is NULL when cgrp is the root cgroup.
>>
>> I would like the bpf program should send the following information to
>> user space:
>>     <parent cgroup dir name> <current cgroup dir name>
> 
> I don't think parent cgroup dir name would be sufficient to reconstruct the
> path given that multiple cgroups in different subtrees can have the same
> name. For live cgroups, userspace can find the path from id (or ino) without
> traversing anything by constructing the fhandle, open it open_by_handle_at()
> and then reading /proc/self/fd/$FD symlink -
> https://lkml.org/lkml/2020/12/2/1126. This isn't available for dead cgroups
> but I'm not sure how much that'd matter given that they aren't visible from
> userspace anyway.

passing id/ino to user space and then get directory name in userspace
should work just fine.

> 
>>     <various stats interested by the user>
>>
>> This way, user space can easily construct the cgroup hierarchy stat like
>>                             cpu   mem   cpu pressure   mem pressure ...
>>     cgroup1                 ...
>>        child1               ...
>>          grandchild1        ...
>>        child2               ...
>>     cgroup 2                ...
>>        child 3              ...
>>          ...                ...
>>
>> the bpf iterator can have additional parameter like
>> cgroup_id = ... to only call bpf program once with that
>> cgroup_id if specified.
>>
>> The kernel part of cgroup_iter can call cgroup_rstat_flush()
>> before calling cgroup_iter bpf program.
>>
>> WDYT?
> 
> Would it work to just pass in @cgrp and provide a group of helpers so that
> the program can do whatever it wanna do including looking up the full path
> and passing that to userspace?

I am not super familiar with cgroup internals, I guess with cgroup + 
helpers to retrieve stats, or directly expose stats data structure
to bpf program. Either one is okay to me as long as we can get
desired results.

> 
> Thanks.
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next v1 3/5] bpf: Introduce cgroup iter
  2022-05-20 21:49               ` Hao Luo
@ 2022-05-21  0:58                 ` Yonghong Song
  2022-05-21  2:43                   ` Hao Luo
  0 siblings, 1 reply; 58+ messages in thread
From: Yonghong Song @ 2022-05-21  0:58 UTC (permalink / raw)
  To: Hao Luo, Tejun Heo
  Cc: Yosry Ahmed, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, John Fastabend,
	KP Singh, Zefan Li, Johannes Weiner, Shuah Khan, Roman Gushchin,
	Michal Hocko, Stanislav Fomichev, David Rientjes, Greg Thelen,
	Shakeel Butt, Linux Kernel Mailing List, Networking, bpf,
	Cgroups



On 5/20/22 2:49 PM, Hao Luo wrote:
> Hi Tejun and Yonghong,
> 
> On Fri, May 20, 2022 at 12:42 PM Hao Luo <haoluo@google.com> wrote:
>>
>> Hi Tejun and Yonghong,
>>
>> On Fri, May 20, 2022 at 9:45 AM Tejun Heo <tj@kernel.org> wrote:
>>> On Fri, May 20, 2022 at 09:29:43AM -0700, Yonghong Song wrote:
>>>>     <various stats interested by the user>
>>>>
>>>> This way, user space can easily construct the cgroup hierarchy stat like
>>>>                             cpu   mem   cpu pressure   mem pressure ...
>>>>     cgroup1                 ...
>>>>        child1               ...
>>>>          grandchild1        ...
>>>>        child2               ...
>>>>     cgroup 2                ...
>>>>        child 3              ...
>>>>          ...                ...
>>>>
>>>> the bpf iterator can have additional parameter like
>>>> cgroup_id = ... to only call bpf program once with that
>>>> cgroup_id if specified.
>>
>> Yep, this should work. We just need to make the cgroup_id parameter
>> optional. If it is specified when creating bpf_iter_link, we print for
>> that cgroup only. If it is not specified, we iterate over all cgroups.
>> If I understand correctly, sounds doable.
>>
> 
> Yonghong, I realized that seek() which Tejun has been calling out, can
> be used to specify the target cgroup, rather than adding a new
> parameter. Maybe, we can pass cgroup_id to seek() on cgroup bpf_iter,
> which will instruct read() to return the corresponding cgroup's stats.
> On the other hand, reading without calling seek() beforehand will
> return all the cgroups.

Currently, seek is not supported for bpf_iter.

const struct file_operations bpf_iter_fops = {
         .open           = iter_open,
         .llseek         = no_llseek,
         .read           = bpf_seq_read,
         .release        = iter_release,
};

But if seek() works, I don't mind to remove this restriction.
But not sure what to seek. Do you mean to provide a cgroup_fd/cgroup_id
as the seek() syscall parameter? This may work.

But considering we have parameterized example (map_fd) and
in the future, we may have other parameterized bpf_iter
(e.g., for one task). Maybe parameter-based approach is better.

> 
> WDYT?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next v1 3/5] bpf: Introduce cgroup iter
  2022-05-20 22:57                   ` Tejun Heo
@ 2022-05-21  0:59                     ` Yonghong Song
  2022-05-21  2:34                       ` Hao Luo
  0 siblings, 1 reply; 58+ messages in thread
From: Yonghong Song @ 2022-05-21  0:59 UTC (permalink / raw)
  To: Tejun Heo, Alexei Starovoitov
  Cc: Yosry Ahmed, Hao Luo, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, John Fastabend,
	KP Singh, Zefan Li, Johannes Weiner, Shuah Khan, Roman Gushchin,
	Michal Hocko, Stanislav Fomichev, David Rientjes, Greg Thelen,
	Shakeel Butt, Linux Kernel Mailing List, Networking, bpf,
	Cgroups



On 5/20/22 3:57 PM, Tejun Heo wrote:
> Hello,
> 
> On Fri, May 20, 2022 at 03:19:19PM -0700, Alexei Starovoitov wrote:
>> We have bpf_map iterator that walks all bpf maps.
>> When map iterator is parametrized with map_fd the iterator walks
>> all elements of that map.
>> cgroup iterator should have similar semantics.
>> When non-parameterized it will walk all cgroups and their descendent
>> depth first way. I believe that's what Yonghong is proposing.
>> When parametrized it will start from that particular cgroup and
>> walk all descendant of that cgroup only.
>> The bpf prog can stop the iteration right away with ret 1.
>> Maybe we can add two parameters. One -> cgroup_fd to use and another ->
>> the order of iteration css_for_each_descendant_pre vs _post.
>> wdyt?
> 
> Sounds perfectly reasonable to me.

This works for me too. Thanks!

> 
> Thanks.
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next v1 3/5] bpf: Introduce cgroup iter
  2022-05-21  0:59                     ` Yonghong Song
@ 2022-05-21  2:34                       ` Hao Luo
  2022-05-23 23:58                         ` Andrii Nakryiko
  0 siblings, 1 reply; 58+ messages in thread
From: Hao Luo @ 2022-05-21  2:34 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Tejun Heo, Alexei Starovoitov, Yosry Ahmed, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	John Fastabend, KP Singh, Zefan Li, Johannes Weiner, Shuah Khan,
	Roman Gushchin, Michal Hocko, Stanislav Fomichev, David Rientjes,
	Greg Thelen, Shakeel Butt, Linux Kernel Mailing List, Networking,
	bpf, Cgroups

On Fri, May 20, 2022 at 5:59 PM Yonghong Song <yhs@fb.com> wrote:
> On 5/20/22 3:57 PM, Tejun Heo wrote:
> > Hello,
> >
> > On Fri, May 20, 2022 at 03:19:19PM -0700, Alexei Starovoitov wrote:
> >> We have bpf_map iterator that walks all bpf maps.
> >> When map iterator is parametrized with map_fd the iterator walks
> >> all elements of that map.
> >> cgroup iterator should have similar semantics.
> >> When non-parameterized it will walk all cgroups and their descendent
> >> depth first way. I believe that's what Yonghong is proposing.
> >> When parametrized it will start from that particular cgroup and
> >> walk all descendant of that cgroup only.
> >> The bpf prog can stop the iteration right away with ret 1.
> >> Maybe we can add two parameters. One -> cgroup_fd to use and another ->
> >> the order of iteration css_for_each_descendant_pre vs _post.
> >> wdyt?
> >
> > Sounds perfectly reasonable to me.
>
> This works for me too. Thanks!
>

This sounds good to me. Thanks. Let's try to do it in the next iteration.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next v1 3/5] bpf: Introduce cgroup iter
  2022-05-21  0:58                 ` Yonghong Song
@ 2022-05-21  2:43                   ` Hao Luo
  2022-05-21  4:53                     ` Tejun Heo
  0 siblings, 1 reply; 58+ messages in thread
From: Hao Luo @ 2022-05-21  2:43 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Tejun Heo, Yosry Ahmed, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, John Fastabend,
	KP Singh, Zefan Li, Johannes Weiner, Shuah Khan, Roman Gushchin,
	Michal Hocko, Stanislav Fomichev, David Rientjes, Greg Thelen,
	Shakeel Butt, Linux Kernel Mailing List, Networking, bpf,
	Cgroups

On Fri, May 20, 2022 at 5:58 PM Yonghong Song <yhs@fb.com> wrote:
> On 5/20/22 2:49 PM, Hao Luo wrote:
> > Hi Tejun and Yonghong,
> >
> > On Fri, May 20, 2022 at 12:42 PM Hao Luo <haoluo@google.com> wrote:
> >>
> >> Hi Tejun and Yonghong,
> >>
> >> On Fri, May 20, 2022 at 9:45 AM Tejun Heo <tj@kernel.org> wrote:
> >>> On Fri, May 20, 2022 at 09:29:43AM -0700, Yonghong Song wrote:
> >>>>     <various stats interested by the user>
> >>>>
> >>>> This way, user space can easily construct the cgroup hierarchy stat like
> >>>>                             cpu   mem   cpu pressure   mem pressure ...
> >>>>     cgroup1                 ...
> >>>>        child1               ...
> >>>>          grandchild1        ...
> >>>>        child2               ...
> >>>>     cgroup 2                ...
> >>>>        child 3              ...
> >>>>          ...                ...
> >>>>
> >>>> the bpf iterator can have additional parameter like
> >>>> cgroup_id = ... to only call bpf program once with that
> >>>> cgroup_id if specified.
> >>
> >> Yep, this should work. We just need to make the cgroup_id parameter
> >> optional. If it is specified when creating bpf_iter_link, we print for
> >> that cgroup only. If it is not specified, we iterate over all cgroups.
> >> If I understand correctly, sounds doable.
> >>
> >
> > Yonghong, I realized that seek() which Tejun has been calling out, can
> > be used to specify the target cgroup, rather than adding a new
> > parameter. Maybe, we can pass cgroup_id to seek() on cgroup bpf_iter,
> > which will instruct read() to return the corresponding cgroup's stats.
> > On the other hand, reading without calling seek() beforehand will
> > return all the cgroups.
>
> Currently, seek is not supported for bpf_iter.
>
> const struct file_operations bpf_iter_fops = {
>          .open           = iter_open,
>          .llseek         = no_llseek,
>          .read           = bpf_seq_read,
>          .release        = iter_release,
> };
>
> But if seek() works, I don't mind to remove this restriction.
> But not sure what to seek. Do you mean to provide a cgroup_fd/cgroup_id
> as the seek() syscall parameter? This may work.

Yes, passing a cgroup_id as the seek() syscall parameter was what I meant.

Tejun previously requested us to support seek() for a proper iterator.
Since Alexei has a nice solution that all of us have ack'ed, I am not
sure whether we still want to add seek() for bpf_iter as Tejun asked.
I guess not.

>
> But considering we have parameterized example (map_fd) and
> in the future, we may have other parameterized bpf_iter
> (e.g., for one task). Maybe parameter-based approach is better.
>

Acknowledged.

> >
> > WDYT?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next v1 3/5] bpf: Introduce cgroup iter
  2022-05-21  2:43                   ` Hao Luo
@ 2022-05-21  4:53                     ` Tejun Heo
  0 siblings, 0 replies; 58+ messages in thread
From: Tejun Heo @ 2022-05-21  4:53 UTC (permalink / raw)
  To: Hao Luo
  Cc: Yonghong Song, Yosry Ahmed, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, John Fastabend,
	KP Singh, Zefan Li, Johannes Weiner, Shuah Khan, Roman Gushchin,
	Michal Hocko, Stanislav Fomichev, David Rientjes, Greg Thelen,
	Shakeel Butt, Linux Kernel Mailing List, Networking, bpf,
	Cgroups

On Fri, May 20, 2022 at 07:43:12PM -0700, Hao Luo wrote:
> Yes, passing a cgroup_id as the seek() syscall parameter was what I meant.
> 
> Tejun previously requested us to support seek() for a proper iterator.
> Since Alexei has a nice solution that all of us have ack'ed, I am not
> sure whether we still want to add seek() for bpf_iter as Tejun asked.
> I guess not.

Yeah, I meant seeking with the ID but it's better to follow the same
convention as other iterators.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next v1 1/5] cgroup: bpf: add a hook for bpf progs to attach to rstat flushing
  2022-05-20  1:21 ` [PATCH bpf-next v1 1/5] cgroup: bpf: add a hook for bpf progs to attach to rstat flushing Yosry Ahmed
@ 2022-05-21 11:16   ` kernel test robot
  2022-05-21 11:26   ` kernel test robot
  2022-05-21 11:26   ` kernel test robot
  2 siblings, 0 replies; 58+ messages in thread
From: kernel test robot @ 2022-05-21 11:16 UTC (permalink / raw)
  To: Yosry Ahmed, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Hao Luo, Tejun Heo, Zefan Li,
	Johannes Weiner, Shuah Khan, Roman Gushchin, Michal Hocko
  Cc: llvm, kbuild-all, Stanislav Fomichev, David Rientjes,
	Greg Thelen, Shakeel Butt, linux-kernel, netdev, bpf, cgroups,
	Yosry Ahmed

Hi Yosry,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Yosry-Ahmed/bpf-rstat-cgroup-hierarchical-stats/20220520-093041
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: hexagon-randconfig-r041-20220519 (https://download.01.org/0day-ci/archive/20220521/202205211949.sJimC9kh-lkp@intel.com/config)
compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project e00cbbec06c08dc616a0d52a20f678b8fbd4e304)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/23c4c48fb35b084dc1173c7b9d23d4e6e1a084a3
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Yosry-Ahmed/bpf-rstat-cgroup-hierarchical-stats/20220520-093041
        git checkout 23c4c48fb35b084dc1173c7b9d23d4e6e1a084a3
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=hexagon SHELL=/bin/bash kernel/cgroup/

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> kernel/cgroup/rstat.c:145:22: warning: no previous prototype for function 'bpf_rstat_flush' [-Wmissing-prototypes]
   __weak noinline void bpf_rstat_flush(struct cgroup *cgrp,
                        ^
   kernel/cgroup/rstat.c:145:17: note: declare 'static' if the function is not intended to be used outside of this translation unit
   __weak noinline void bpf_rstat_flush(struct cgroup *cgrp,
                   ^
                   static 
   1 warning generated.


vim +/bpf_rstat_flush +145 kernel/cgroup/rstat.c

   143	
   144	/* A hook for bpf stat collectors to attach to and flush their stats */
 > 145	__weak noinline void bpf_rstat_flush(struct cgroup *cgrp,
   146					     struct cgroup *parent, int cpu)
   147	{
   148	}
   149	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next v1 1/5] cgroup: bpf: add a hook for bpf progs to attach to rstat flushing
  2022-05-20  1:21 ` [PATCH bpf-next v1 1/5] cgroup: bpf: add a hook for bpf progs to attach to rstat flushing Yosry Ahmed
  2022-05-21 11:16   ` kernel test robot
@ 2022-05-21 11:26   ` kernel test robot
  2022-05-21 11:26   ` kernel test robot
  2 siblings, 0 replies; 58+ messages in thread
From: kernel test robot @ 2022-05-21 11:26 UTC (permalink / raw)
  To: Yosry Ahmed, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Hao Luo, Tejun Heo, Zefan Li,
	Johannes Weiner, Shuah Khan, Roman Gushchin, Michal Hocko
  Cc: kbuild-all, Stanislav Fomichev, David Rientjes, Greg Thelen,
	Shakeel Butt, linux-kernel, netdev, bpf, cgroups, Yosry Ahmed

Hi Yosry,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Yosry-Ahmed/bpf-rstat-cgroup-hierarchical-stats/20220520-093041
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: parisc-randconfig-r026-20220519 (https://download.01.org/0day-ci/archive/20220521/202205211930.7xTXJTBH-lkp@intel.com/config)
compiler: hppa-linux-gcc (GCC) 11.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/23c4c48fb35b084dc1173c7b9d23d4e6e1a084a3
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Yosry-Ahmed/bpf-rstat-cgroup-hierarchical-stats/20220520-093041
        git checkout 23c4c48fb35b084dc1173c7b9d23d4e6e1a084a3
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.3.0 make.cross W=1 O=build_dir ARCH=parisc SHELL=/bin/bash kernel/cgroup/

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> kernel/cgroup/rstat.c:145:22: warning: no previous prototype for 'bpf_rstat_flush' [-Wmissing-prototypes]
     145 | __weak noinline void bpf_rstat_flush(struct cgroup *cgrp,
         |                      ^~~~~~~~~~~~~~~


vim +/bpf_rstat_flush +145 kernel/cgroup/rstat.c

   143	
   144	/* A hook for bpf stat collectors to attach to and flush their stats */
 > 145	__weak noinline void bpf_rstat_flush(struct cgroup *cgrp,
   146					     struct cgroup *parent, int cpu)
   147	{
   148	}
   149	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next v1 1/5] cgroup: bpf: add a hook for bpf progs to attach to rstat flushing
  2022-05-20  1:21 ` [PATCH bpf-next v1 1/5] cgroup: bpf: add a hook for bpf progs to attach to rstat flushing Yosry Ahmed
  2022-05-21 11:16   ` kernel test robot
  2022-05-21 11:26   ` kernel test robot
@ 2022-05-21 11:26   ` kernel test robot
  2 siblings, 0 replies; 58+ messages in thread
From: kernel test robot @ 2022-05-21 11:26 UTC (permalink / raw)
  To: Yosry Ahmed, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Hao Luo, Tejun Heo, Zefan Li,
	Johannes Weiner, Shuah Khan, Roman Gushchin, Michal Hocko
  Cc: kbuild-all, Stanislav Fomichev, David Rientjes, Greg Thelen,
	Shakeel Butt, linux-kernel, netdev, bpf, cgroups, Yosry Ahmed

Hi Yosry,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Yosry-Ahmed/bpf-rstat-cgroup-hierarchical-stats/20220520-093041
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: um-i386_defconfig (https://download.01.org/0day-ci/archive/20220521/202205211932.e4nkPC3e-lkp@intel.com/config)
compiler: gcc-11 (Debian 11.3.0-1) 11.3.0
reproduce (this is a W=1 build):
        # https://github.com/intel-lab-lkp/linux/commit/23c4c48fb35b084dc1173c7b9d23d4e6e1a084a3
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Yosry-Ahmed/bpf-rstat-cgroup-hierarchical-stats/20220520-093041
        git checkout 23c4c48fb35b084dc1173c7b9d23d4e6e1a084a3
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        make W=1 O=build_dir ARCH=um SUBARCH=i386 SHELL=/bin/bash kernel/cgroup/

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> kernel/cgroup/rstat.c:145:22: warning: no previous prototype for 'bpf_rstat_flush' [-Wmissing-prototypes]
     145 | __weak noinline void bpf_rstat_flush(struct cgroup *cgrp,
         |                      ^~~~~~~~~~~~~~~


vim +/bpf_rstat_flush +145 kernel/cgroup/rstat.c

   143	
   144	/* A hook for bpf stat collectors to attach to and flush their stats */
 > 145	__weak noinline void bpf_rstat_flush(struct cgroup *cgrp,
   146					     struct cgroup *parent, int cpu)
   147	{
   148	}
   149	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next v1 2/5] cgroup: bpf: add cgroup_rstat_updated() and cgroup_rstat_flush() kfuncs
  2022-05-20  1:21 ` [PATCH bpf-next v1 2/5] cgroup: bpf: add cgroup_rstat_updated() and cgroup_rstat_flush() kfuncs Yosry Ahmed
  2022-05-20  7:24   ` Tejun Heo
  2022-05-20 15:14   ` Yonghong Song
@ 2022-05-21 11:47   ` kernel test robot
  2 siblings, 0 replies; 58+ messages in thread
From: kernel test robot @ 2022-05-21 11:47 UTC (permalink / raw)
  To: Yosry Ahmed, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Hao Luo, Tejun Heo, Zefan Li,
	Johannes Weiner, Shuah Khan, Roman Gushchin, Michal Hocko
  Cc: kbuild-all, Stanislav Fomichev, David Rientjes, Greg Thelen,
	Shakeel Butt, linux-kernel, netdev, bpf, cgroups, Yosry Ahmed

Hi Yosry,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Yosry-Ahmed/bpf-rstat-cgroup-hierarchical-stats/20220520-093041
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: um-i386_defconfig (https://download.01.org/0day-ci/archive/20220521/202205211913.wPnVDaPm-lkp@intel.com/config)
compiler: gcc-11 (Debian 11.3.0-1) 11.3.0
reproduce (this is a W=1 build):
        # https://github.com/intel-lab-lkp/linux/commit/203797424b1159b12702cea9d9a20acc24ea92e0
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Yosry-Ahmed/bpf-rstat-cgroup-hierarchical-stats/20220520-093041
        git checkout 203797424b1159b12702cea9d9a20acc24ea92e0
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        make W=1 O=build_dir ARCH=um SUBARCH=i386 SHELL=/bin/bash kernel/cgroup/

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   kernel/cgroup/rstat.c:155:22: warning: no previous prototype for 'bpf_rstat_flush' [-Wmissing-prototypes]
     155 | __weak noinline void bpf_rstat_flush(struct cgroup *cgrp,
         |                      ^~~~~~~~~~~~~~~
   kernel/cgroup/rstat.c:503:10: error: 'const struct btf_kfunc_id_set' has no member named 'sleepable_set'; did you mean 'release_set'?
     503 |         .sleepable_set  = &bpf_rstat_sleepable_kfunc_ids,
         |          ^~~~~~~~~~~~~
         |          release_set
>> kernel/cgroup/rstat.c:503:27: warning: excess elements in struct initializer
     503 |         .sleepable_set  = &bpf_rstat_sleepable_kfunc_ids,
         |                           ^
   kernel/cgroup/rstat.c:503:27: note: (near initialization for 'bpf_rstat_kfunc_set')


vim +503 kernel/cgroup/rstat.c

   499	
   500	static const struct btf_kfunc_id_set bpf_rstat_kfunc_set = {
   501		.owner		= THIS_MODULE,
   502		.check_set	= &bpf_rstat_check_kfunc_ids,
 > 503		.sleepable_set	= &bpf_rstat_sleepable_kfunc_ids,
   504	};
   505	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next v1 3/5] bpf: Introduce cgroup iter
  2022-05-21  2:34                       ` Hao Luo
@ 2022-05-23 23:58                         ` Andrii Nakryiko
  2022-05-24  0:53                           ` Hao Luo
  0 siblings, 1 reply; 58+ messages in thread
From: Andrii Nakryiko @ 2022-05-23 23:58 UTC (permalink / raw)
  To: Hao Luo
  Cc: Yonghong Song, Tejun Heo, Alexei Starovoitov, Yosry Ahmed,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, John Fastabend, KP Singh, Zefan Li,
	Johannes Weiner, Shuah Khan, Roman Gushchin, Michal Hocko,
	Stanislav Fomichev, David Rientjes, Greg Thelen, Shakeel Butt,
	Linux Kernel Mailing List, Networking, bpf, Cgroups

On Fri, May 20, 2022 at 7:35 PM Hao Luo <haoluo@google.com> wrote:
>
> On Fri, May 20, 2022 at 5:59 PM Yonghong Song <yhs@fb.com> wrote:
> > On 5/20/22 3:57 PM, Tejun Heo wrote:
> > > Hello,
> > >
> > > On Fri, May 20, 2022 at 03:19:19PM -0700, Alexei Starovoitov wrote:
> > >> We have bpf_map iterator that walks all bpf maps.
> > >> When map iterator is parametrized with map_fd the iterator walks
> > >> all elements of that map.
> > >> cgroup iterator should have similar semantics.
> > >> When non-parameterized it will walk all cgroups and their descendent
> > >> depth first way. I believe that's what Yonghong is proposing.
> > >> When parametrized it will start from that particular cgroup and
> > >> walk all descendant of that cgroup only.
> > >> The bpf prog can stop the iteration right away with ret 1.
> > >> Maybe we can add two parameters. One -> cgroup_fd to use and another ->
> > >> the order of iteration css_for_each_descendant_pre vs _post.
> > >> wdyt?
> > >
> > > Sounds perfectly reasonable to me.
> >
> > This works for me too. Thanks!
> >
>
> This sounds good to me. Thanks. Let's try to do it in the next iteration.

Can we, in addition to descendant_pre and descendant_post walk
algorithms also add the one that does ascendants walk (i.e., start
from specified cgroup and walk up to the root cgroup)? I don't have
specific example, but it seems natural to include it for "cgroup
iterator" in general. Hopefully it won't add much code to the
implementation.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next v1 5/5] bpf: add a selftest for cgroup hierarchical stats collection
  2022-05-20 16:18     ` Yosry Ahmed
@ 2022-05-24  0:01       ` Andrii Nakryiko
  2022-05-24  2:35         ` Yosry Ahmed
  0 siblings, 1 reply; 58+ messages in thread
From: Andrii Nakryiko @ 2022-05-24  0:01 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Yonghong Song, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Roman Gushchin, Michal Hocko, Stanislav Fomichev,
	David Rientjes, Greg Thelen, Shakeel Butt,
	Linux Kernel Mailing List, Networking, bpf, Cgroups

On Fri, May 20, 2022 at 9:19 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Fri, May 20, 2022 at 9:09 AM Yonghong Song <yhs@fb.com> wrote:
> >
> >
> >
> > On 5/19/22 6:21 PM, Yosry Ahmed wrote:
> > > Add a selftest that tests the whole workflow for collecting,
> > > aggregating, and display cgroup hierarchical stats.
> > >
> > > TL;DR:
> > > - Whenever reclaim happens, vmscan_start and vmscan_end update
> > >    per-cgroup percpu readings, and tell rstat which (cgroup, cpu) pairs
> > >    have updates.
> > > - When userspace tries to read the stats, vmscan_dump calls rstat to flush
> > >    the stats.
> > > - rstat calls vmscan_flush once for every (cgroup, cpu) pair that has
> > >    updates, vmscan_flush aggregates cpu readings and propagates updates
> > >    to parents.
> > >
> > > Detailed explanation:
> > > - The test loads tracing bpf programs, vmscan_start and vmscan_end, to
> > >    measure the latency of cgroup reclaim. Per-cgroup ratings are stored in
> > >    percpu maps for efficiency. When a cgroup reading is updated on a cpu,
> > >    cgroup_rstat_updated(cgroup, cpu) is called to add the cgroup to the
> > >    rstat updated tree on that cpu.
> > >
> > > - A cgroup_iter program, vmscan_dump, is loaded and pinned to a file, for
> > >    each cgroup. Reading this file invokes the program, which calls
> > >    cgroup_rstat_flush(cgroup) to ask rstat to propagate the updates for all
> > >    cpus and cgroups that have updates in this cgroup's subtree. Afterwards,
> > >    the stats are exposed to the user.
> > >
> > > - An ftrace program, vmscan_flush, is also loaded and attached to
> > >    bpf_rstat_flush. When rstat flushing is ongoing, vmscan_flush is invoked
> > >    once for each (cgroup, cpu) pair that has updates. cgroups are popped
> > >    from the rstat tree in a bottom-up fashion, so calls will always be
> > >    made for cgroups that have updates before their parents. The program
> > >    aggregates percpu readings to a total per-cgroup reading, and also
> > >    propagates them to the parent cgroup. After rstat flushing is over, all
> > >    cgroups will have correct updated hierarchical readings (including all
> > >    cpus and all their descendants).
> > >
> > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> > > ---
> > >   .../test_cgroup_hierarchical_stats.c          | 339 ++++++++++++++++++
> > >   tools/testing/selftests/bpf/progs/bpf_iter.h  |   7 +
> > >   .../selftests/bpf/progs/cgroup_vmscan.c       | 221 ++++++++++++
> > >   3 files changed, 567 insertions(+)
> > >   create mode 100644 tools/testing/selftests/bpf/prog_tests/test_cgroup_hierarchical_stats.c
> > >   create mode 100644 tools/testing/selftests/bpf/progs/cgroup_vmscan.c
> > >
> > > diff --git a/tools/testing/selftests/bpf/prog_tests/test_cgroup_hierarchical_stats.c b/tools/testing/selftests/bpf/prog_tests/test_cgroup_hierarchical_stats.c
> > > new file mode 100644
> > > index 000000000000..e560c1f6291f
> > > --- /dev/null
> > > +++ b/tools/testing/selftests/bpf/prog_tests/test_cgroup_hierarchical_stats.c
> > > @@ -0,0 +1,339 @@
> > > +// SPDX-License-Identifier: GPL-2.0-only
> > > +/*
> > > + * Functions to manage eBPF programs attached to cgroup subsystems
> > > + *
> > > + * Copyright 2022 Google LLC.
> > > + */
> > > +#include <errno.h>
> > > +#include <sys/types.h>
> > > +#include <sys/mount.h>
> > > +#include <sys/stat.h>
> > > +#include <unistd.h>
> > > +
> > > +#include <bpf/libbpf.h>
> > > +#include <bpf/bpf.h>
> > > +#include <test_progs.h>
> > > +
> > > +#include "cgroup_helpers.h"
> > > +#include "cgroup_vmscan.skel.h"
> > > +
> > > +#define PAGE_SIZE 4096
> > > +#define MB(x) (x << 20)
> > > +
> > > +#define BPFFS_ROOT "/sys/fs/bpf/"
> > > +#define BPFFS_VMSCAN BPFFS_ROOT"vmscan/"
> > > +
> > > +#define CG_ROOT_NAME "root"
> > > +#define CG_ROOT_ID 1
> > > +
> > > +#define CGROUP_PATH(p, n) {.name = #n, .path = #p"/"#n}
> > > +
> > > +static struct {
> > > +     const char *name, *path;
> > > +     unsigned long long id;
> > > +     int fd;
> > > +} cgroups[] = {
> > > +     CGROUP_PATH(/, test),
> > > +     CGROUP_PATH(/test, child1),
> > > +     CGROUP_PATH(/test, child2),
> > > +     CGROUP_PATH(/test/child1, child1_1),
> > > +     CGROUP_PATH(/test/child1, child1_2),
> > > +     CGROUP_PATH(/test/child2, child2_1),
> > > +     CGROUP_PATH(/test/child2, child2_2),
> > > +};
> > > +
> > > +#define N_CGROUPS ARRAY_SIZE(cgroups)
> > > +#define N_NON_LEAF_CGROUPS 3
> > > +
> > > +bool mounted_bpffs;
> > > +static int duration;
> > > +
> > > +static int read_from_file(const char *path, char *buf, size_t size)
> > > +{
> > > +     int fd, len;
> > > +
> > > +     fd = open(path, O_RDONLY);
> > > +     if (fd < 0) {
> > > +             log_err("Open %s", path);
> > > +             return -errno;
> > > +     }
> > > +     len = read(fd, buf, size);
> > > +     if (len < 0)
> > > +             log_err("Read %s", path);
> > > +     else
> > > +             buf[len] = 0;
> > > +     close(fd);
> > > +     return len < 0 ? -errno : 0;
> > > +}
> > > +
> > > +static int setup_bpffs(void)
> > > +{
> > > +     int err;
> > > +
> > > +     /* Mount bpffs */
> > > +     err = mount("bpf", BPFFS_ROOT, "bpf", 0, NULL);
> > > +     mounted_bpffs = !err;
> > > +     if (CHECK(err && errno != EBUSY, "mount bpffs",
> >
> > Please use ASSERT_* macros instead of CHECK.
> > There are similar instances below as well.
>
> CHECK is more flexible in providing a parameterized failure message,
> but I guess we ideally shouldn't see those a lot anyway. Will change
> them to ASSERTs in the next version.

The idea with ASSERT_xxx() is that you express semantically meaningful
assertion/condition/check and the macro provides helpful and
meaningful information for you. E.g., ASSERT_EQ(bla, 123, "bla_value")
will emit something along the lines: "unexpected value of 'bla_value':
345, expected 123". It provides useful info when check fails without
requiring to type all the extra format strings and parameters.

And also CHECK() has an inverted condition which is extremely
confusing. We don't use CHECK() for new code anymore.

>
> >
> > > +           "failed to mount bpffs at %s (%s)\n", BPFFS_ROOT,
> > > +           strerror(errno)))
> > > +             return err;
> > > +
> > > +     /* Create a directory to contain stat files in bpffs */
> > > +     err = mkdir(BPFFS_VMSCAN, 0755);
> > > +     CHECK(err, "mkdir bpffs", "failed to mkdir %s (%s)\n",
> > > +           BPFFS_VMSCAN, strerror(errno));
> > > +     return err;
> > > +}
> > > +

[...]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next v1 3/5] bpf: Introduce cgroup iter
  2022-05-23 23:58                         ` Andrii Nakryiko
@ 2022-05-24  0:53                           ` Hao Luo
  2022-05-24  1:30                             ` Andrii Nakryiko
  0 siblings, 1 reply; 58+ messages in thread
From: Hao Luo @ 2022-05-24  0:53 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Yonghong Song, Tejun Heo, Alexei Starovoitov, Yosry Ahmed,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, John Fastabend, KP Singh, Zefan Li,
	Johannes Weiner, Shuah Khan, Roman Gushchin, Michal Hocko,
	Stanislav Fomichev, David Rientjes, Greg Thelen, Shakeel Butt,
	Linux Kernel Mailing List, Networking, bpf, Cgroups

On Mon, May 23, 2022 at 4:58 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Fri, May 20, 2022 at 7:35 PM Hao Luo <haoluo@google.com> wrote:
> >
> > On Fri, May 20, 2022 at 5:59 PM Yonghong Song <yhs@fb.com> wrote:
> > > On 5/20/22 3:57 PM, Tejun Heo wrote:
> > > > Hello,
> > > >
> > > > On Fri, May 20, 2022 at 03:19:19PM -0700, Alexei Starovoitov wrote:
> > > >> We have bpf_map iterator that walks all bpf maps.
> > > >> When map iterator is parametrized with map_fd the iterator walks
> > > >> all elements of that map.
> > > >> cgroup iterator should have similar semantics.
> > > >> When non-parameterized it will walk all cgroups and their descendent
> > > >> depth first way. I believe that's what Yonghong is proposing.
> > > >> When parametrized it will start from that particular cgroup and
> > > >> walk all descendant of that cgroup only.
> > > >> The bpf prog can stop the iteration right away with ret 1.
> > > >> Maybe we can add two parameters. One -> cgroup_fd to use and another ->
> > > >> the order of iteration css_for_each_descendant_pre vs _post.
> > > >> wdyt?
> > > >
> > > > Sounds perfectly reasonable to me.
> > >
> > > This works for me too. Thanks!
> > >
> >
> > This sounds good to me. Thanks. Let's try to do it in the next iteration.
>
> Can we, in addition to descendant_pre and descendant_post walk
> algorithms also add the one that does ascendants walk (i.e., start
> from specified cgroup and walk up to the root cgroup)? I don't have
> specific example, but it seems natural to include it for "cgroup
> iterator" in general. Hopefully it won't add much code to the
> implementation.

Yep. Sounds reasonable and doable. It's just adding a flag to specify
traversal order, like:

{
  WALK_DESCENDANT_PRE,
  WALK_DESCENDANT_POST,
  WALK_PARENT_UP,
};

In bpf_iter's seq_next(), change the algorithm to yield the parent of
the current cgroup.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next v1 3/5] bpf: Introduce cgroup iter
  2022-05-24  0:53                           ` Hao Luo
@ 2022-05-24  1:30                             ` Andrii Nakryiko
  0 siblings, 0 replies; 58+ messages in thread
From: Andrii Nakryiko @ 2022-05-24  1:30 UTC (permalink / raw)
  To: Hao Luo
  Cc: Yonghong Song, Tejun Heo, Alexei Starovoitov, Yosry Ahmed,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, John Fastabend, KP Singh, Zefan Li,
	Johannes Weiner, Shuah Khan, Roman Gushchin, Michal Hocko,
	Stanislav Fomichev, David Rientjes, Greg Thelen, Shakeel Butt,
	Linux Kernel Mailing List, Networking, bpf, Cgroups

On Mon, May 23, 2022 at 5:53 PM Hao Luo <haoluo@google.com> wrote:
>
> On Mon, May 23, 2022 at 4:58 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Fri, May 20, 2022 at 7:35 PM Hao Luo <haoluo@google.com> wrote:
> > >
> > > On Fri, May 20, 2022 at 5:59 PM Yonghong Song <yhs@fb.com> wrote:
> > > > On 5/20/22 3:57 PM, Tejun Heo wrote:
> > > > > Hello,
> > > > >
> > > > > On Fri, May 20, 2022 at 03:19:19PM -0700, Alexei Starovoitov wrote:
> > > > >> We have bpf_map iterator that walks all bpf maps.
> > > > >> When map iterator is parametrized with map_fd the iterator walks
> > > > >> all elements of that map.
> > > > >> cgroup iterator should have similar semantics.
> > > > >> When non-parameterized it will walk all cgroups and their descendent
> > > > >> depth first way. I believe that's what Yonghong is proposing.
> > > > >> When parametrized it will start from that particular cgroup and
> > > > >> walk all descendant of that cgroup only.
> > > > >> The bpf prog can stop the iteration right away with ret 1.
> > > > >> Maybe we can add two parameters. One -> cgroup_fd to use and another ->
> > > > >> the order of iteration css_for_each_descendant_pre vs _post.
> > > > >> wdyt?
> > > > >
> > > > > Sounds perfectly reasonable to me.
> > > >
> > > > This works for me too. Thanks!
> > > >
> > >
> > > This sounds good to me. Thanks. Let's try to do it in the next iteration.
> >
> > Can we, in addition to descendant_pre and descendant_post walk
> > algorithms also add the one that does ascendants walk (i.e., start
> > from specified cgroup and walk up to the root cgroup)? I don't have
> > specific example, but it seems natural to include it for "cgroup
> > iterator" in general. Hopefully it won't add much code to the
> > implementation.
>
> Yep. Sounds reasonable and doable. It's just adding a flag to specify
> traversal order, like:
>
> {
>   WALK_DESCENDANT_PRE,
>   WALK_DESCENDANT_POST,
>   WALK_PARENT_UP,

Probably something more like BPF_CG_WALK_DESCENDANT_PRE and so on?

> };
>
> In bpf_iter's seq_next(), change the algorithm to yield the parent of
> the current cgroup.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next v1 5/5] bpf: add a selftest for cgroup hierarchical stats collection
  2022-05-24  0:01       ` Andrii Nakryiko
@ 2022-05-24  2:35         ` Yosry Ahmed
  0 siblings, 0 replies; 58+ messages in thread
From: Yosry Ahmed @ 2022-05-24  2:35 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Yonghong Song, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Roman Gushchin, Michal Hocko, Stanislav Fomichev,
	David Rientjes, Greg Thelen, Shakeel Butt,
	Linux Kernel Mailing List, Networking, bpf, Cgroups

On Mon, May 23, 2022 at 5:01 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Fri, May 20, 2022 at 9:19 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Fri, May 20, 2022 at 9:09 AM Yonghong Song <yhs@fb.com> wrote:
> > >
> > >
> > >
> > > On 5/19/22 6:21 PM, Yosry Ahmed wrote:
> > > > Add a selftest that tests the whole workflow for collecting,
> > > > aggregating, and display cgroup hierarchical stats.
> > > >
> > > > TL;DR:
> > > > - Whenever reclaim happens, vmscan_start and vmscan_end update
> > > >    per-cgroup percpu readings, and tell rstat which (cgroup, cpu) pairs
> > > >    have updates.
> > > > - When userspace tries to read the stats, vmscan_dump calls rstat to flush
> > > >    the stats.
> > > > - rstat calls vmscan_flush once for every (cgroup, cpu) pair that has
> > > >    updates, vmscan_flush aggregates cpu readings and propagates updates
> > > >    to parents.
> > > >
> > > > Detailed explanation:
> > > > - The test loads tracing bpf programs, vmscan_start and vmscan_end, to
> > > >    measure the latency of cgroup reclaim. Per-cgroup ratings are stored in
> > > >    percpu maps for efficiency. When a cgroup reading is updated on a cpu,
> > > >    cgroup_rstat_updated(cgroup, cpu) is called to add the cgroup to the
> > > >    rstat updated tree on that cpu.
> > > >
> > > > - A cgroup_iter program, vmscan_dump, is loaded and pinned to a file, for
> > > >    each cgroup. Reading this file invokes the program, which calls
> > > >    cgroup_rstat_flush(cgroup) to ask rstat to propagate the updates for all
> > > >    cpus and cgroups that have updates in this cgroup's subtree. Afterwards,
> > > >    the stats are exposed to the user.
> > > >
> > > > - An ftrace program, vmscan_flush, is also loaded and attached to
> > > >    bpf_rstat_flush. When rstat flushing is ongoing, vmscan_flush is invoked
> > > >    once for each (cgroup, cpu) pair that has updates. cgroups are popped
> > > >    from the rstat tree in a bottom-up fashion, so calls will always be
> > > >    made for cgroups that have updates before their parents. The program
> > > >    aggregates percpu readings to a total per-cgroup reading, and also
> > > >    propagates them to the parent cgroup. After rstat flushing is over, all
> > > >    cgroups will have correct updated hierarchical readings (including all
> > > >    cpus and all their descendants).
> > > >
> > > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> > > > ---
> > > >   .../test_cgroup_hierarchical_stats.c          | 339 ++++++++++++++++++
> > > >   tools/testing/selftests/bpf/progs/bpf_iter.h  |   7 +
> > > >   .../selftests/bpf/progs/cgroup_vmscan.c       | 221 ++++++++++++
> > > >   3 files changed, 567 insertions(+)
> > > >   create mode 100644 tools/testing/selftests/bpf/prog_tests/test_cgroup_hierarchical_stats.c
> > > >   create mode 100644 tools/testing/selftests/bpf/progs/cgroup_vmscan.c
> > > >
> > > > diff --git a/tools/testing/selftests/bpf/prog_tests/test_cgroup_hierarchical_stats.c b/tools/testing/selftests/bpf/prog_tests/test_cgroup_hierarchical_stats.c
> > > > new file mode 100644
> > > > index 000000000000..e560c1f6291f
> > > > --- /dev/null
> > > > +++ b/tools/testing/selftests/bpf/prog_tests/test_cgroup_hierarchical_stats.c
> > > > @@ -0,0 +1,339 @@
> > > > +// SPDX-License-Identifier: GPL-2.0-only
> > > > +/*
> > > > + * Functions to manage eBPF programs attached to cgroup subsystems
> > > > + *
> > > > + * Copyright 2022 Google LLC.
> > > > + */
> > > > +#include <errno.h>
> > > > +#include <sys/types.h>
> > > > +#include <sys/mount.h>
> > > > +#include <sys/stat.h>
> > > > +#include <unistd.h>
> > > > +
> > > > +#include <bpf/libbpf.h>
> > > > +#include <bpf/bpf.h>
> > > > +#include <test_progs.h>
> > > > +
> > > > +#include "cgroup_helpers.h"
> > > > +#include "cgroup_vmscan.skel.h"
> > > > +
> > > > +#define PAGE_SIZE 4096
> > > > +#define MB(x) (x << 20)
> > > > +
> > > > +#define BPFFS_ROOT "/sys/fs/bpf/"
> > > > +#define BPFFS_VMSCAN BPFFS_ROOT"vmscan/"
> > > > +
> > > > +#define CG_ROOT_NAME "root"
> > > > +#define CG_ROOT_ID 1
> > > > +
> > > > +#define CGROUP_PATH(p, n) {.name = #n, .path = #p"/"#n}
> > > > +
> > > > +static struct {
> > > > +     const char *name, *path;
> > > > +     unsigned long long id;
> > > > +     int fd;
> > > > +} cgroups[] = {
> > > > +     CGROUP_PATH(/, test),
> > > > +     CGROUP_PATH(/test, child1),
> > > > +     CGROUP_PATH(/test, child2),
> > > > +     CGROUP_PATH(/test/child1, child1_1),
> > > > +     CGROUP_PATH(/test/child1, child1_2),
> > > > +     CGROUP_PATH(/test/child2, child2_1),
> > > > +     CGROUP_PATH(/test/child2, child2_2),
> > > > +};
> > > > +
> > > > +#define N_CGROUPS ARRAY_SIZE(cgroups)
> > > > +#define N_NON_LEAF_CGROUPS 3
> > > > +
> > > > +bool mounted_bpffs;
> > > > +static int duration;
> > > > +
> > > > +static int read_from_file(const char *path, char *buf, size_t size)
> > > > +{
> > > > +     int fd, len;
> > > > +
> > > > +     fd = open(path, O_RDONLY);
> > > > +     if (fd < 0) {
> > > > +             log_err("Open %s", path);
> > > > +             return -errno;
> > > > +     }
> > > > +     len = read(fd, buf, size);
> > > > +     if (len < 0)
> > > > +             log_err("Read %s", path);
> > > > +     else
> > > > +             buf[len] = 0;
> > > > +     close(fd);
> > > > +     return len < 0 ? -errno : 0;
> > > > +}
> > > > +
> > > > +static int setup_bpffs(void)
> > > > +{
> > > > +     int err;
> > > > +
> > > > +     /* Mount bpffs */
> > > > +     err = mount("bpf", BPFFS_ROOT, "bpf", 0, NULL);
> > > > +     mounted_bpffs = !err;
> > > > +     if (CHECK(err && errno != EBUSY, "mount bpffs",
> > >
> > > Please use ASSERT_* macros instead of CHECK.
> > > There are similar instances below as well.
> >
> > CHECK is more flexible in providing a parameterized failure message,
> > but I guess we ideally shouldn't see those a lot anyway. Will change
> > them to ASSERTs in the next version.
>
> The idea with ASSERT_xxx() is that you express semantically meaningful
> assertion/condition/check and the macro provides helpful and
> meaningful information for you. E.g., ASSERT_EQ(bla, 123, "bla_value")
> will emit something along the lines: "unexpected value of 'bla_value':
> 345, expected 123". It provides useful info when check fails without
> requiring to type all the extra format strings and parameters.
>
> And also CHECK() has an inverted condition which is extremely
> confusing. We don't use CHECK() for new code anymore.

I agree with this point. Especially that my test had some ASSERTs and
some CHECKs so the if conditions ended up being confusing. I am
changing them all to ASSERTs in the next version. Thanks for the
insights!

>
> >
> > >
> > > > +           "failed to mount bpffs at %s (%s)\n", BPFFS_ROOT,
> > > > +           strerror(errno)))
> > > > +             return err;
> > > > +
> > > > +     /* Create a directory to contain stat files in bpffs */
> > > > +     err = mkdir(BPFFS_VMSCAN, 0755);
> > > > +     CHECK(err, "mkdir bpffs", "failed to mkdir %s (%s)\n",
> > > > +           BPFFS_VMSCAN, strerror(errno));
> > > > +     return err;
> > > > +}
> > > > +
>
> [...]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next v1 0/5] bpf: rstat: cgroup hierarchical stats
  2022-05-20  1:21 [PATCH bpf-next v1 0/5] bpf: rstat: cgroup hierarchical stats Yosry Ahmed
                   ` (4 preceding siblings ...)
  2022-05-20  1:21 ` [PATCH bpf-next v1 5/5] bpf: add a selftest for cgroup hierarchical stats collection Yosry Ahmed
@ 2022-06-03 16:22 ` Michal Koutný
  2022-06-03 19:47   ` Yosry Ahmed
  5 siblings, 1 reply; 58+ messages in thread
From: Michal Koutný @ 2022-06-03 16:22 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Roman Gushchin, Michal Hocko, Stanislav Fomichev,
	David Rientjes, Greg Thelen, Shakeel Butt, linux-kernel, netdev,
	bpf, cgroups

Hello Yosry et al.

This is an interesting piece of work, I'll add some questions and
comments.

On Fri, May 20, 2022 at 01:21:28AM +0000, Yosry Ahmed <yosryahmed@google.com> wrote:
> This patch series allows for using bpf to collect hierarchical cgroup
> stats efficiently by integrating with the rstat framework. The rstat
> framework provides an efficient way to collect cgroup stats and
> propagate them through the cgroup hierarchy.

About the efficiency. Do you have any numbers or examples?
IIUC the idea is to utilize the cgroup's rstat subgraph of full tree
when flushing.
I was looking at your selftest example and the measuring hooks call
cgroup_rstat_updated() and they also allocate an entry bpf_map[cg_id].
The flush callback then looks up the cg_id for cgroups in the rstat
subgraph.
(I'm not familiar with bpf_map implementation or performance but I
imagine, you're potentially one step away from erasing bpf_map[cg_id] in
the flush callback.)
It seems to me that you're building a parallel structure (inside
bpf_map(s)) with similar purpose to the rstat subgraph.

So I wonder whether there remains any benefit of coupling this with
rstat?


Also, I'd expect the custom-processed data are useful in the
structured form (within bpf_maps) but then there's the cgroup iter thing
that takes available data and "flattens" them into text files.
I see this was discussed in subthreads already so it's not necessary to
return to it. IIUC you somehow intend to provide the custom info via the
text files. If that's true, I'd include that in the next cover message
(the purpose of the iterator).


> * The second patch adds cgroup_rstat_updated() and cgorup_rstat_flush()
> kfuncs, to allow bpf stat collectors and readers to communicate with rstat.

kfunc means that it can be just called from any BPF program?
(I'm thinking of an unprivileged user who issues cgroup_rstat_updated()
deep down in the hierarchy repeatedly just to "spam" the rstat subgraph
(which slows down flushers above). Arguably, this can be done already
e.g. by causing certain MM events, so I'd like to just clarify if this
can be a new source of such arbitrary updates.)

> * The third patch is actually v2 of a previously submitted patch [1]
> by Hao Luo. We agreed that it fits better as a part of this series. It
> introduces cgroup_iter programs that can dump stats for cgroups to
> userspace.
> v1 - > v2:
> - Getting the cgroup's reference at the time at attaching, instead of
>   at the time when iterating. (Yonghong) (context [1])

I noticed you take the reference to cgroup, that's fine.
But the demo program also accesses via RCU pointers
(memory_subsys_enabled():cgroup->subsys).
Again, my BPF ignorance here, does the iterator framework somehow take
care of RCU locks?


Thanks,
Michal

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next v1 5/5] bpf: add a selftest for cgroup hierarchical stats collection
  2022-05-20  1:21 ` [PATCH bpf-next v1 5/5] bpf: add a selftest for cgroup hierarchical stats collection Yosry Ahmed
  2022-05-20 16:09   ` Yonghong Song
@ 2022-06-03 16:23   ` Michal Koutný
  2022-06-03 19:52     ` Yosry Ahmed
  1 sibling, 1 reply; 58+ messages in thread
From: Michal Koutný @ 2022-06-03 16:23 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Roman Gushchin, Michal Hocko, Stanislav Fomichev,
	David Rientjes, Greg Thelen, Shakeel Butt, linux-kernel, netdev,
	bpf, cgroups

On Fri, May 20, 2022 at 01:21:33AM +0000, Yosry Ahmed <yosryahmed@google.com> wrote:
> +#define CGROUP_PATH(p, n) {.name = #n, .path = #p"/"#n}
> +
> +static struct {
> +	const char *name, *path;

Please unify the order of path and name with the macro (slightly
confusing ;-).

> +SEC("tp_btf/mm_vmscan_memcg_reclaim_end")
> +int BPF_PROG(vmscan_end, struct lruvec *lruvec, struct scan_control *sc)
> +{
> [...]
> +	struct cgroup *cgrp = task_memcg(current);
> [...]
> +	/* cgrp may not have memory controller enabled */
> +	if (!cgrp)
> +		return 0;

Yes, the controller may not be enabled (for a cgroup).
Just noting that the task_memcg() implementation will fall back to
root_mem_cgroup in such a case (or nearest ancestor), you may want to
use cgroup_ss_mask() for proper detection.

Regards,
Michal

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next v1 0/5] bpf: rstat: cgroup hierarchical stats
  2022-06-03 16:22 ` [PATCH bpf-next v1 0/5] bpf: rstat: cgroup hierarchical stats Michal Koutný
@ 2022-06-03 19:47   ` Yosry Ahmed
  2022-06-06 12:32     ` Michal Koutný
  0 siblings, 1 reply; 58+ messages in thread
From: Yosry Ahmed @ 2022-06-03 19:47 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Roman Gushchin, Michal Hocko, Stanislav Fomichev,
	David Rientjes, Greg Thelen, Shakeel Butt,
	Linux Kernel Mailing List, Networking, bpf, Cgroups

On Fri, Jun 3, 2022 at 9:22 AM Michal Koutný <mkoutny@suse.com> wrote:
>
> Hello Yosry et al.
>
> This is an interesting piece of work, I'll add some questions and
> comments.
>
> On Fri, May 20, 2022 at 01:21:28AM +0000, Yosry Ahmed <yosryahmed@google.com> wrote:
> > This patch series allows for using bpf to collect hierarchical cgroup
> > stats efficiently by integrating with the rstat framework. The rstat
> > framework provides an efficient way to collect cgroup stats and
> > propagate them through the cgroup hierarchy.
>
> About the efficiency. Do you have any numbers or examples?
> IIUC the idea is to utilize the cgroup's rstat subgraph of full tree
> when flushing.
> I was looking at your selftest example and the measuring hooks call
> cgroup_rstat_updated() and they also allocate an entry bpf_map[cg_id].
> The flush callback then looks up the cg_id for cgroups in the rstat
> subgraph.
> (I'm not familiar with bpf_map implementation or performance but I
> imagine, you're potentially one step away from erasing bpf_map[cg_id] in
> the flush callback.)
> It seems to me that you're building a parallel structure (inside
> bpf_map(s)) with similar purpose to the rstat subgraph.
>
> So I wonder whether there remains any benefit of coupling this with
> rstat?

Hi Michal,

Thanks for taking a look at this!

The bpf_map[cg_id] is not a similar structure to the rstat flush
subgraph. This is where the stats are stored. These are long running
numbers for (virtually) all cgroups on the system, they do not get
allocated every time we call cgroup_rstat_updated(), only the first
time. They are actually not erased at all in the whole selftest
(except when the map is deleted at the end). In a production
environment, we might have "setup" and "destroy" bpf programs that run
when cgroups are created/destroyed, and allocate/delete these map
entries then, to avoid the overhead in the first stat update/flush if
necessary.

The only reason I didn't do this in the demo selftest is because it
was complex/long enough as-is, and for the purposes of showcasing and
testing it seemed enough to allocate entries on demand on the first
stat update. I can add a comment about this in the selftest if you
think it's not obvious.

In short, think of these bpf maps as equivalents to "struct
memcg_vmstats" and "struct memcg_vmstats_percpu" in the memory
controller. They are just containers to store the stats in, they do
not have any subgraph structure and they have no use beyond storing
percpu and total stats.

I run small microbenchmarks that are not worth posting, they compared
the latency of bpf stats collection vs. in-kernel code that adds stats
to struct memcg_vmstats[_percpu] and flushes them accordingly, the
difference was marginal. If the map lookups are deemed expensive and a
bottleneck in the future, I have some ideas about improving that. We
can rewrite the cgroup storage map to use the generic bpf local
storage code, and have it be accessible from all programs by a cgroup
key (like task_storage for e.g.) rather than only programs attached to
that cgroup. However, this discussion is a tangent here.

>
>
> Also, I'd expect the custom-processed data are useful in the
> structured form (within bpf_maps) but then there's the cgroup iter thing
> that takes available data and "flattens" them into text files.
> I see this was discussed in subthreads already so it's not necessary to
> return to it. IIUC you somehow intend to provide the custom info via the
> text files. If that's true, I'd include that in the next cover message
> (the purpose of the iterator).

The main reason for this is to provide data in a similar fashion to
cgroupfs, in text file per-cgroup. I will include this clearly in the
next cover message. You can always not use the cgroup_iter and access
the data directly from bpf maps.

>
>
> > * The second patch adds cgroup_rstat_updated() and cgorup_rstat_flush()
> > kfuncs, to allow bpf stat collectors and readers to communicate with rstat.
>
> kfunc means that it can be just called from any BPF program?
> (I'm thinking of an unprivileged user who issues cgroup_rstat_updated()
> deep down in the hierarchy repeatedly just to "spam" the rstat subgraph
> (which slows down flushers above). Arguably, this can be done already
> e.g. by causing certain MM events, so I'd like to just clarify if this
> can be a new source of such arbitrary updates.)

AFAIK loading bpf programs requires a privileged user, so someone has
to approve such a program. Am I missing something?

>
> > * The third patch is actually v2 of a previously submitted patch [1]
> > by Hao Luo. We agreed that it fits better as a part of this series. It
> > introduces cgroup_iter programs that can dump stats for cgroups to
> > userspace.
> > v1 - > v2:
> > - Getting the cgroup's reference at the time at attaching, instead of
> >   at the time when iterating. (Yonghong) (context [1])
>
> I noticed you take the reference to cgroup, that's fine.
> But the demo program also accesses via RCU pointers
> (memory_subsys_enabled():cgroup->subsys).
> Again, my BPF ignorance here, does the iterator framework somehow take
> care of RCU locks?

bpf_iter_run_prog() is used to run bpf iterator programs, and it grabs
rcu read lock before doing so. So AFAICT we are good on that front.

Thanks a lot for this great discussion!

>
>
> Thanks,
> Michal

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next v1 5/5] bpf: add a selftest for cgroup hierarchical stats collection
  2022-06-03 16:23   ` Michal Koutný
@ 2022-06-03 19:52     ` Yosry Ahmed
  2022-06-06 12:32       ` Michal Koutný
  0 siblings, 1 reply; 58+ messages in thread
From: Yosry Ahmed @ 2022-06-03 19:52 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Roman Gushchin, Michal Hocko, Stanislav Fomichev,
	David Rientjes, Greg Thelen, Shakeel Butt,
	Linux Kernel Mailing List, Networking, bpf, Cgroups

Thanks for taking a look at this!

On Fri, Jun 3, 2022 at 9:23 AM Michal Koutný <mkoutny@suse.com> wrote:
>
> On Fri, May 20, 2022 at 01:21:33AM +0000, Yosry Ahmed <yosryahmed@google.com> wrote:
> > +#define CGROUP_PATH(p, n) {.name = #n, .path = #p"/"#n}
> > +
> > +static struct {
> > +     const char *name, *path;
>
> Please unify the order of path and name with the macro (slightly
> confusing ;-).

Totally agree, will do.

>
> > +SEC("tp_btf/mm_vmscan_memcg_reclaim_end")
> > +int BPF_PROG(vmscan_end, struct lruvec *lruvec, struct scan_control *sc)
> > +{
> > [...]
> > +     struct cgroup *cgrp = task_memcg(current);
> > [...]
> > +     /* cgrp may not have memory controller enabled */
> > +     if (!cgrp)
> > +             return 0;
>
> Yes, the controller may not be enabled (for a cgroup).
> Just noting that the task_memcg() implementation will fall back to
> root_mem_cgroup in such a case (or nearest ancestor), you may want to
> use cgroup_ss_mask() for proper detection.

Good catch. I get confused between cgrp->subsys and
task->cgroups->subsys sometimes because of different fallback
behavior. IIUC cgrp->subsys should have NULL if the memory controller
is not enabled (no nearest ancestor fallback), and hence I can use
memory_subsys_enabled() that I defined just above task_memcg() to test
for this (I have no idea why I am not already using it here). Is my
understanding correct?

>
> Regards,
> Michal

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next v1 0/5] bpf: rstat: cgroup hierarchical stats
  2022-06-03 19:47   ` Yosry Ahmed
@ 2022-06-06 12:32     ` Michal Koutný
  2022-06-06 19:32       ` Yosry Ahmed
  0 siblings, 1 reply; 58+ messages in thread
From: Michal Koutný @ 2022-06-06 12:32 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Roman Gushchin, Michal Hocko, Stanislav Fomichev,
	David Rientjes, Greg Thelen, Shakeel Butt,
	Linux Kernel Mailing List, Networking, bpf, Cgroups

On Fri, Jun 03, 2022 at 12:47:19PM -0700, Yosry Ahmed <yosryahmed@google.com> wrote:
> In short, think of these bpf maps as equivalents to "struct
> memcg_vmstats" and "struct memcg_vmstats_percpu" in the memory
> controller. They are just containers to store the stats in, they do
> not have any subgraph structure and they have no use beyond storing
> percpu and total stats.

Thanks for the explanation.

> I run small microbenchmarks that are not worth posting, they compared
> the latency of bpf stats collection vs. in-kernel code that adds stats
> to struct memcg_vmstats[_percpu] and flushes them accordingly, the
> difference was marginal.

OK, that's a reasonable comparison.

> The main reason for this is to provide data in a similar fashion to
> cgroupfs, in text file per-cgroup. I will include this clearly in the
> next cover message.

Thanks, it'd be great to have that use-case captured there.

> AFAIK loading bpf programs requires a privileged user, so someone has
> to approve such a program. Am I missing something?

A sysctl unprivileged_bpf_disabled somehow stuck in my head. But as I
wrote, this adds a way how to call cgroup_rstat_updated() directly, it's
not reserved for privilged users anyhow.

> bpf_iter_run_prog() is used to run bpf iterator programs, and it grabs
> rcu read lock before doing so. So AFAICT we are good on that front.

Thanks for the clarification.


Michal

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next v1 5/5] bpf: add a selftest for cgroup hierarchical stats collection
  2022-06-03 19:52     ` Yosry Ahmed
@ 2022-06-06 12:32       ` Michal Koutný
  2022-06-06 19:41         ` Yosry Ahmed
  0 siblings, 1 reply; 58+ messages in thread
From: Michal Koutný @ 2022-06-06 12:32 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Roman Gushchin, Michal Hocko, Stanislav Fomichev,
	David Rientjes, Greg Thelen, Shakeel Butt,
	Linux Kernel Mailing List, Networking, bpf, Cgroups

On Fri, Jun 03, 2022 at 12:52:27PM -0700, Yosry Ahmed <yosryahmed@google.com> wrote:
> Good catch. I get confused between cgrp->subsys and
> task->cgroups->subsys sometimes because of different fallback
> behavior. IIUC cgrp->subsys should have NULL if the memory controller
> is not enabled (no nearest ancestor fallback), and hence I can use
> memory_subsys_enabled() that I defined just above task_memcg() to test
> for this (I have no idea why I am not already using it here). Is my
> understanding correct?

You're correct, css_set (task->cgroups) has a css (memcg) always defined
(be it root only (or even a css from v1 hierarchy but that should not
relevant here)). A particular cgroup can have the css set to NULL.

When I think about your stats collecting example now, task_memcg() looks
more suitable to achieve proper hierarchical counting in the end (IOW
you'd lose info from tasks who don't reside in memcg-enabled leaf).

(It's just that task_memcg won't return NULL. Unless the kernel is
compiled without memcg support completely, which makes me think how do
the config-dependent values propagate to BPF programs?)

Thanks,
Michal

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next v1 0/5] bpf: rstat: cgroup hierarchical stats
  2022-06-06 12:32     ` Michal Koutný
@ 2022-06-06 19:32       ` Yosry Ahmed
  2022-06-06 19:54         ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 58+ messages in thread
From: Yosry Ahmed @ 2022-06-06 19:32 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Roman Gushchin, Michal Hocko, Stanislav Fomichev,
	David Rientjes, Greg Thelen, Shakeel Butt,
	Linux Kernel Mailing List, Networking, bpf, Cgroups

On Mon, Jun 6, 2022 at 5:32 AM Michal Koutný <mkoutny@suse.com> wrote:
>
> On Fri, Jun 03, 2022 at 12:47:19PM -0700, Yosry Ahmed <yosryahmed@google.com> wrote:
> > In short, think of these bpf maps as equivalents to "struct
> > memcg_vmstats" and "struct memcg_vmstats_percpu" in the memory
> > controller. They are just containers to store the stats in, they do
> > not have any subgraph structure and they have no use beyond storing
> > percpu and total stats.
>
> Thanks for the explanation.
>
> > I run small microbenchmarks that are not worth posting, they compared
> > the latency of bpf stats collection vs. in-kernel code that adds stats
> > to struct memcg_vmstats[_percpu] and flushes them accordingly, the
> > difference was marginal.
>
> OK, that's a reasonable comparison.
>
> > The main reason for this is to provide data in a similar fashion to
> > cgroupfs, in text file per-cgroup. I will include this clearly in the
> > next cover message.
>
> Thanks, it'd be great to have that use-case captured there.
>
> > AFAIK loading bpf programs requires a privileged user, so someone has
> > to approve such a program. Am I missing something?
>
> A sysctl unprivileged_bpf_disabled somehow stuck in my head. But as I
> wrote, this adds a way how to call cgroup_rstat_updated() directly, it's
> not reserved for privilged users anyhow.

I am not sure if kfuncs have different privilege requirements or if
there is a way to mark a kfunc as privileged. Maybe someone with more
bpf knowledge can help here. But I assume if unprivileged_bpf_disabled
is not set then there is a certain amount of risk/trust that you are
taking anyway?

>
> > bpf_iter_run_prog() is used to run bpf iterator programs, and it grabs
> > rcu read lock before doing so. So AFAICT we are good on that front.
>
> Thanks for the clarification.
>
>
> Michal

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next v1 5/5] bpf: add a selftest for cgroup hierarchical stats collection
  2022-06-06 12:32       ` Michal Koutný
@ 2022-06-06 19:41         ` Yosry Ahmed
  2022-06-07 12:12           ` Michal Koutný
  0 siblings, 1 reply; 58+ messages in thread
From: Yosry Ahmed @ 2022-06-06 19:41 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Roman Gushchin, Michal Hocko, Stanislav Fomichev,
	David Rientjes, Greg Thelen, Shakeel Butt,
	Linux Kernel Mailing List, Networking, bpf, Cgroups

On Mon, Jun 6, 2022 at 5:32 AM Michal Koutný <mkoutny@suse.com> wrote:
>
> On Fri, Jun 03, 2022 at 12:52:27PM -0700, Yosry Ahmed <yosryahmed@google.com> wrote:
> > Good catch. I get confused between cgrp->subsys and
> > task->cgroups->subsys sometimes because of different fallback
> > behavior. IIUC cgrp->subsys should have NULL if the memory controller
> > is not enabled (no nearest ancestor fallback), and hence I can use
> > memory_subsys_enabled() that I defined just above task_memcg() to test
> > for this (I have no idea why I am not already using it here). Is my
> > understanding correct?
>
> You're correct, css_set (task->cgroups) has a css (memcg) always defined
> (be it root only (or even a css from v1 hierarchy but that should not
> relevant here)). A particular cgroup can have the css set to NULL.
>
> When I think about your stats collecting example now, task_memcg() looks
> more suitable to achieve proper hierarchical counting in the end (IOW
> you'd lose info from tasks who don't reside in memcg-enabled leaf).

I guess it depends on how userspace reasons about this, and whether or
not you want to collect stats from leaves that don't reside in a
memcg-enabled leaf. I will go through all the memcg-enabled checks and
make sure they make sense and are consistent, maybe add some comments
to make the userspace policy here clear.

>
> (It's just that task_memcg won't return NULL. Unless the kernel is
> compiled without memcg support completely, which makes me think how do
> the config-dependent values propagate to BPF programs?)

I don't know if there is a standard way to handle this, but I think
you should know the configs of your kernel when you are loading a bpf
program? In this particular case, if CONFIG_CGROUPS=0 then the bpf
programs will not even load due to lack of hook points or kfuncs won't
exist. If the CONFIG_CGROUPS=1 but CONFIG_MEMCG=0 I think everything
will work normally except that task_memcg() will always return NULL so
no stats will be collected, which makes sense. There will be some
overhead to running bpf programs that will always do nothing, but I
would argue that it's the userspace's fault here for loading bpf
programs on a non-compatible kernel.

>
> Thanks,
> Michal

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next v1 0/5] bpf: rstat: cgroup hierarchical stats
  2022-06-06 19:32       ` Yosry Ahmed
@ 2022-06-06 19:54         ` Kumar Kartikeya Dwivedi
  2022-06-06 20:00           ` Yosry Ahmed
  0 siblings, 1 reply; 58+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-06-06 19:54 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Michal Koutný,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Roman Gushchin, Michal Hocko, Stanislav Fomichev,
	David Rientjes, Greg Thelen, Shakeel Butt,
	Linux Kernel Mailing List, Networking, bpf, Cgroups

On Tue, Jun 07, 2022 at 01:02:04AM IST, Yosry Ahmed wrote:
> On Mon, Jun 6, 2022 at 5:32 AM Michal Koutný <mkoutny@suse.com> wrote:
> >
> > On Fri, Jun 03, 2022 at 12:47:19PM -0700, Yosry Ahmed <yosryahmed@google.com> wrote:
> > > In short, think of these bpf maps as equivalents to "struct
> > > memcg_vmstats" and "struct memcg_vmstats_percpu" in the memory
> > > controller. They are just containers to store the stats in, they do
> > > not have any subgraph structure and they have no use beyond storing
> > > percpu and total stats.
> >
> > Thanks for the explanation.
> >
> > > I run small microbenchmarks that are not worth posting, they compared
> > > the latency of bpf stats collection vs. in-kernel code that adds stats
> > > to struct memcg_vmstats[_percpu] and flushes them accordingly, the
> > > difference was marginal.
> >
> > OK, that's a reasonable comparison.
> >
> > > The main reason for this is to provide data in a similar fashion to
> > > cgroupfs, in text file per-cgroup. I will include this clearly in the
> > > next cover message.
> >
> > Thanks, it'd be great to have that use-case captured there.
> >
> > > AFAIK loading bpf programs requires a privileged user, so someone has
> > > to approve such a program. Am I missing something?
> >
> > A sysctl unprivileged_bpf_disabled somehow stuck in my head. But as I
> > wrote, this adds a way how to call cgroup_rstat_updated() directly, it's
> > not reserved for privilged users anyhow.
>
> I am not sure if kfuncs have different privilege requirements or if
> there is a way to mark a kfunc as privileged. Maybe someone with more
> bpf knowledge can help here. But I assume if unprivileged_bpf_disabled
> is not set then there is a certain amount of risk/trust that you are
> taking anyway?
>

It requires CAP_BPF or CAP_SYS_ADMIN, see verifier.c:add_subprog_or_kfunc.

> >
> > > bpf_iter_run_prog() is used to run bpf iterator programs, and it grabs
> > > rcu read lock before doing so. So AFAICT we are good on that front.
> >
> > Thanks for the clarification.
> >
> >
> > Michal

--
Kartikeya

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next v1 0/5] bpf: rstat: cgroup hierarchical stats
  2022-06-06 19:54         ` Kumar Kartikeya Dwivedi
@ 2022-06-06 20:00           ` Yosry Ahmed
  0 siblings, 0 replies; 58+ messages in thread
From: Yosry Ahmed @ 2022-06-06 20:00 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Michal Koutný,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Roman Gushchin, Michal Hocko, Stanislav Fomichev,
	David Rientjes, Greg Thelen, Shakeel Butt,
	Linux Kernel Mailing List, Networking, bpf, Cgroups

On Mon, Jun 6, 2022 at 12:55 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Tue, Jun 07, 2022 at 01:02:04AM IST, Yosry Ahmed wrote:
> > On Mon, Jun 6, 2022 at 5:32 AM Michal Koutný <mkoutny@suse.com> wrote:
> > >
> > > On Fri, Jun 03, 2022 at 12:47:19PM -0700, Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > In short, think of these bpf maps as equivalents to "struct
> > > > memcg_vmstats" and "struct memcg_vmstats_percpu" in the memory
> > > > controller. They are just containers to store the stats in, they do
> > > > not have any subgraph structure and they have no use beyond storing
> > > > percpu and total stats.
> > >
> > > Thanks for the explanation.
> > >
> > > > I run small microbenchmarks that are not worth posting, they compared
> > > > the latency of bpf stats collection vs. in-kernel code that adds stats
> > > > to struct memcg_vmstats[_percpu] and flushes them accordingly, the
> > > > difference was marginal.
> > >
> > > OK, that's a reasonable comparison.
> > >
> > > > The main reason for this is to provide data in a similar fashion to
> > > > cgroupfs, in text file per-cgroup. I will include this clearly in the
> > > > next cover message.
> > >
> > > Thanks, it'd be great to have that use-case captured there.
> > >
> > > > AFAIK loading bpf programs requires a privileged user, so someone has
> > > > to approve such a program. Am I missing something?
> > >
> > > A sysctl unprivileged_bpf_disabled somehow stuck in my head. But as I
> > > wrote, this adds a way how to call cgroup_rstat_updated() directly, it's
> > > not reserved for privilged users anyhow.
> >
> > I am not sure if kfuncs have different privilege requirements or if
> > there is a way to mark a kfunc as privileged. Maybe someone with more
> > bpf knowledge can help here. But I assume if unprivileged_bpf_disabled
> > is not set then there is a certain amount of risk/trust that you are
> > taking anyway?
> >
>
> It requires CAP_BPF or CAP_SYS_ADMIN, see verifier.c:add_subprog_or_kfunc.

Thanks for the clarification!

>
> > >
> > > > bpf_iter_run_prog() is used to run bpf iterator programs, and it grabs
> > > > rcu read lock before doing so. So AFAICT we are good on that front.
> > >
> > > Thanks for the clarification.
> > >
> > >
> > > Michal
>
> --
> Kartikeya

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next v1 5/5] bpf: add a selftest for cgroup hierarchical stats collection
  2022-06-06 19:41         ` Yosry Ahmed
@ 2022-06-07 12:12           ` Michal Koutný
  2022-06-07 17:43             ` Yosry Ahmed
  0 siblings, 1 reply; 58+ messages in thread
From: Michal Koutný @ 2022-06-07 12:12 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Roman Gushchin, Michal Hocko, Stanislav Fomichev,
	David Rientjes, Greg Thelen, Shakeel Butt,
	Linux Kernel Mailing List, Networking, bpf, Cgroups

On Mon, Jun 06, 2022 at 12:41:06PM -0700, Yosry Ahmed <yosryahmed@google.com> wrote:
> I don't know if there is a standard way to handle this, but I think
> you should know the configs of your kernel when you are loading a bpf
> program?

Isn't this one of purposes of BTF? (I don't know, I'm genuinely asking.)

> If the CONFIG_CGROUPS=1 but CONFIG_MEMCG=0 I think everything will
> work normally except that task_memcg() will always return NULL so no
> stats will be collected, which makes sense.

I was not able to track down what is the include chain to
tools/testing/selftests/bpf/progs/cgroup_vmscan.c, i.e. how is the enum
value memory_cgrp_id defined.

(A custom kernel module build requires target kernel's header files, I
could understand that compiling a BPF program requires them likewise and
that's how this could work.
Although, it goes against my undestanding of the CO-RE principle.)

> There will be some overhead to running bpf programs that will always
> do nothing, but I would argue that it's the userspace's fault here for
> loading bpf programs on a non-compatible kernel.

Yeah, running an empty program is non-issue in my eyes, I was rather
considering whether the program uses proper offsets.

Michal


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next v1 5/5] bpf: add a selftest for cgroup hierarchical stats collection
  2022-06-07 12:12           ` Michal Koutný
@ 2022-06-07 17:43             ` Yosry Ahmed
  2022-06-08 11:17               ` Michal Koutný
  0 siblings, 1 reply; 58+ messages in thread
From: Yosry Ahmed @ 2022-06-07 17:43 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Roman Gushchin, Michal Hocko, Stanislav Fomichev,
	David Rientjes, Greg Thelen, Shakeel Butt,
	Linux Kernel Mailing List, Networking, bpf, Cgroups

On Tue, Jun 7, 2022 at 5:12 AM Michal Koutný <mkoutny@suse.com> wrote:
>
> On Mon, Jun 06, 2022 at 12:41:06PM -0700, Yosry Ahmed <yosryahmed@google.com> wrote:
> > I don't know if there is a standard way to handle this, but I think
> > you should know the configs of your kernel when you are loading a bpf
> > program?
>
> Isn't this one of purposes of BTF? (I don't know, I'm genuinely asking.)
>
> > If the CONFIG_CGROUPS=1 but CONFIG_MEMCG=0 I think everything will
> > work normally except that task_memcg() will always return NULL so no
> > stats will be collected, which makes sense.
>
> I was not able to track down what is the include chain to
> tools/testing/selftests/bpf/progs/cgroup_vmscan.c, i.e. how is the enum
> value memory_cgrp_id defined.

memory_cgrp_id is defined in "vmlinux.h" (generated from BTF) which is
included through "bpf_iter.h". If the kernel is not compiled with
CONFIG_MEMCG then this enum value will not be defined and the bpf prog
should not compile.

>
> (A custom kernel module build requires target kernel's header files, I
> could understand that compiling a BPF program requires them likewise and
> that's how this could work.
> Although, it goes against my undestanding of the CO-RE principle.)
>
> > There will be some overhead to running bpf programs that will always
> > do nothing, but I would argue that it's the userspace's fault here for
> > loading bpf programs on a non-compatible kernel.
>
> Yeah, running an empty program is non-issue in my eyes, I was rather
> considering whether the program uses proper offsets.
>
> Michal
>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next v1 5/5] bpf: add a selftest for cgroup hierarchical stats collection
  2022-06-07 17:43             ` Yosry Ahmed
@ 2022-06-08 11:17               ` Michal Koutný
  0 siblings, 0 replies; 58+ messages in thread
From: Michal Koutný @ 2022-06-08 11:17 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Roman Gushchin, Michal Hocko, Stanislav Fomichev,
	David Rientjes, Greg Thelen, Shakeel Butt,
	Linux Kernel Mailing List, Networking, bpf, Cgroups

On Tue, Jun 07, 2022 at 10:43:35AM -0700, Yosry Ahmed <yosryahmed@google.com> wrote:
> memory_cgrp_id is defined in "vmlinux.h" (generated from BTF) which is
> included through "bpf_iter.h". If the kernel is not compiled with
> CONFIG_MEMCG then this enum value will not be defined and the bpf prog
> should not compile.

Cool. Then it works as I would have expected.

Michal

^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2022-06-08 11:17 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-20  1:21 [PATCH bpf-next v1 0/5] bpf: rstat: cgroup hierarchical stats Yosry Ahmed
2022-05-20  1:21 ` [PATCH bpf-next v1 1/5] cgroup: bpf: add a hook for bpf progs to attach to rstat flushing Yosry Ahmed
2022-05-21 11:16   ` kernel test robot
2022-05-21 11:26   ` kernel test robot
2022-05-21 11:26   ` kernel test robot
2022-05-20  1:21 ` [PATCH bpf-next v1 2/5] cgroup: bpf: add cgroup_rstat_updated() and cgroup_rstat_flush() kfuncs Yosry Ahmed
2022-05-20  7:24   ` Tejun Heo
2022-05-20  9:13     ` Yosry Ahmed
2022-05-20  9:36       ` Kumar Kartikeya Dwivedi
2022-05-20 11:16         ` Tejun Heo
2022-05-20 16:06           ` Yosry Ahmed
2022-05-20 15:14   ` Yonghong Song
2022-05-20 16:08     ` Yosry Ahmed
2022-05-20 16:16       ` Yonghong Song
2022-05-20 16:20         ` Yosry Ahmed
2022-05-21 11:47   ` kernel test robot
2022-05-20  1:21 ` [PATCH bpf-next v1 3/5] bpf: Introduce cgroup iter Yosry Ahmed
2022-05-20  7:41   ` Tejun Heo
2022-05-20  7:58     ` Yosry Ahmed
2022-05-20  8:11       ` Tejun Heo
2022-05-20 11:27         ` Tejun Heo
2022-05-20 16:29         ` Yonghong Song
2022-05-20 16:45           ` Tejun Heo
2022-05-20 19:42             ` Hao Luo
2022-05-20 21:18               ` Yosry Ahmed
2022-05-20 22:19                 ` Alexei Starovoitov
2022-05-20 22:36                   ` Yosry Ahmed
2022-05-20 22:57                   ` Tejun Heo
2022-05-21  0:59                     ` Yonghong Song
2022-05-21  2:34                       ` Hao Luo
2022-05-23 23:58                         ` Andrii Nakryiko
2022-05-24  0:53                           ` Hao Luo
2022-05-24  1:30                             ` Andrii Nakryiko
2022-05-20 21:49               ` Hao Luo
2022-05-21  0:58                 ` Yonghong Song
2022-05-21  2:43                   ` Hao Luo
2022-05-21  4:53                     ` Tejun Heo
2022-05-21  0:52             ` Yonghong Song
2022-05-20 17:30         ` Hao Luo
2022-05-20  1:21 ` [PATCH bpf-next v1 4/5] selftests/bpf: extend cgroup helpers Yosry Ahmed
2022-05-20  1:21 ` [PATCH bpf-next v1 5/5] bpf: add a selftest for cgroup hierarchical stats collection Yosry Ahmed
2022-05-20 16:09   ` Yonghong Song
2022-05-20 16:18     ` Yosry Ahmed
2022-05-24  0:01       ` Andrii Nakryiko
2022-05-24  2:35         ` Yosry Ahmed
2022-06-03 16:23   ` Michal Koutný
2022-06-03 19:52     ` Yosry Ahmed
2022-06-06 12:32       ` Michal Koutný
2022-06-06 19:41         ` Yosry Ahmed
2022-06-07 12:12           ` Michal Koutný
2022-06-07 17:43             ` Yosry Ahmed
2022-06-08 11:17               ` Michal Koutný
2022-06-03 16:22 ` [PATCH bpf-next v1 0/5] bpf: rstat: cgroup hierarchical stats Michal Koutný
2022-06-03 19:47   ` Yosry Ahmed
2022-06-06 12:32     ` Michal Koutný
2022-06-06 19:32       ` Yosry Ahmed
2022-06-06 19:54         ` Kumar Kartikeya Dwivedi
2022-06-06 20:00           ` Yosry Ahmed

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).