linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next v2 0/8] bpf: rstat: cgroup hierarchical stats
@ 2022-06-10 19:44 Yosry Ahmed
  2022-06-10 19:44 ` [PATCH bpf-next v2 1/8] cgroup: enable cgroup_get_from_file() on cgroup1 Yosry Ahmed
                   ` (8 more replies)
  0 siblings, 9 replies; 46+ messages in thread
From: Yosry Ahmed @ 2022-06-10 19:44 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Michal Hocko
  Cc: Roman Gushchin, David Rientjes, Stanislav Fomichev, Greg Thelen,
	Shakeel Butt, linux-kernel, netdev, bpf, cgroups, Yosry Ahmed

This patch series allows for using bpf to collect hierarchical cgroup
stats efficiently by integrating with the rstat framework. The rstat
framework provides an efficient way to collect cgroup stats percpu and
propagate them through the cgroup hierarchy.

The stats are exposed to userspace in textual form by reading files in
bpffs, similar to cgroupfs stats by using a cgroup_iter program.
cgroup_iter is a type of bpf_iter. It walks over cgroups in two modes:

 - walking a cgroup's descendants.
 - walking a cgroup's ancestors.

When attaching cgroup_iter, one needs to set a cgroup to the iter_link
created from attaching. This cgroup is passed as a file descriptor and
serves as the starting point of the walk.

For walking descendants, one can specify the order: either pre-order or
post-order. For walking ancestors, the walk starts at the specified
cgroup and ends at the root.

One can also terminate the walk early by returning 1 from the iter
program.

Note that because walking cgroup hierarchy holds cgroup_mutex, the iter
program is called with cgroup_mutex held.

** Background on rstat for stats collection **
(I am using a subscriber analogy that is not commonly used)

The rstat framework maintains a tree of cgroups that have updates and
which cpus have updates. A subscriber to the rstat framework maintains
their own stats. The framework is used to tell the subscriber when
and what to flush, for the most efficient stats propagation. The
workflow is as follows:

- When a subscriber updates a cgroup on a cpu, it informs the rstat
  framework by calling cgroup_rstat_updated(cgrp, cpu).

- When a subscriber wants to read some stats for a cgroup, it asks
  the rstat framework to initiate a stats flush (propagation) by calling
  cgroup_rstat_flush(cgrp).

- When the rstat framework initiates a flush, it makes callbacks to
  subscribers to aggregate stats on cpus that have updates, and
  propagate updates to their parent.

Currently, the main subscribers to the rstat framework are cgroup
subsystems (e.g. memory, block). This patch series allow bpf programs to
become subscribers as well.

Patches in this series are based off a patch in the mailing
list which adds a new kfunc set for sleepable functions:
"btf: Add a new kfunc set which allows to mark a function to be
sleepable" [1].

Patches in this series are organized as follows:
* Patch 1 enables the use of cgroup_get_from_file() in cgroup1.
  This is useful because it enables cgroup_iter to work with cgroup1, and
  allows the entire stat collection workflow to be cgroup1-compatible.
* Patches 2-5 introduce cgroup_iter prog, and a selftest.
* Patches 6-8 allow bpf programs to integrate with rstat by adding the
  necessary hook points and kfunc. A comprehensive selftest that
  demonstrates the entire workflow for using bpf and rstat to
  efficiently collect and output cgroup stats is added.

v1 -> v2:
- Redesign of cgroup_iter from v1, based on Alexei's idea [2]:
  - supports walking cgroup subtree.
  - supports walking ancestors of a cgroup. (Andrii)
  - supports terminating the walk early.
  - uses fd instead of cgroup_id as parameter for iter_link. Using fd is
    a convention in bpf.
  - gets cgroup's ref at attach time and deref at detach.
  - brought back cgroup1 support for cgroup_iter.
- Squashed the patches adding the rstat flush hook points and kfuncs
  (Tejun).
- Added a comment explaining why bpf_rstat_flush() needs to be weak
  (Tejun).
- Updated the final selftest with the new cgroup_iter design.
- Changed CHECKs in the selftest with ASSERTs (Yonghong, Andrii).
- Removed empty line at the end of the selftest (Yonghong).
- Renamed test files to cgroup_hierarchical_stats.c.
- Reordered CGROUP_PATH params order to match struct declaration
  in the selftest (Michal).
- Removed memory_subsys_enabled() and made sure memcg controller
  enablement checks make sense and are documented (Michal).

RFC v2 -> v1:
- Instead of introducing a new program type for rstat flushing, add an
  empty hook point, bpf_rstat_flush(), and use fentry bpf programs to
  attach to it and flush bpf stats.
- Instead of using helpers, use kfuncs for rstat functions.
- These changes simplify the patchset greatly, with minimal changes to
  uapi.

RFC v1 -> RFC v2:
- Instead of rstat flush programs attach to subsystems, they now attach
  to rstat (global flushers, not per-subsystem), based on discussions
  with Tejun. The first patch is entirely rewritten.
- Pass cgroup pointers to rstat flushers instead of cgroup ids. This is
  much more flexibility and less likely to need a uapi update later.
- rstat helpers are now only defined if CGROUP_CONFIG.
- Most of the code is now only defined if CGROUP_CONFIG and
  CONFIG_BPF_SYSCALL.
- Move rstat helper protos from bpf_base_func_proto() to
  tracing_prog_func_proto().
- rstat helpers argument (cgroup pointer) is now ARG_PTR_TO_BTF_ID, not
  ARG_ANYTHING.
- Rewrote the selftest to use the cgroup helpers.
- Dropped bpf_map_lookup_percpu_elem (already added by Feng).
- Dropped patch to support cgroup v1 for cgroup_iter.
- Dropped patch to define some cgroup_put() when !CONFIG_CGROUP. The
  code that calls it is no longer compiled when !CONFIG_CGROUP.

cgroup_iter was originally introduced in a different patch series[3].
Hao and I agreed that it fits better as part of this series.
RFC v1 of this patch series had the following changes from [3]:
- Getting the cgroup's reference at the time at attaching, instead of
  at the time when iterating. (Yonghong)
- Remove .init_seq_private and .fini_seq_private callbacks for
  cgroup_iter. They are not needed now. (Yonghong)

[1] https://lore.kernel.org/bpf/20220421140740.459558-5-benjamin.tissoires@redhat.com/
[2] https://lore.kernel.org/bpf/20220520221919.jnqgv52k4ajlgzcl@MBP-98dd607d3435.dhcp.thefacebook.com/
[3] https://lore.kernel.org/lkml/20220225234339.2386398-9-haoluo@google.com/

Hao Luo (4):
  cgroup: Add cgroup_put() in !CONFIG_CGROUPS case
  bpf, iter: Fix the condition on p when calling stop.
  bpf: Introduce cgroup iter
  selftests/bpf: Test cgroup_iter.

Yosry Ahmed (4):
  cgroup: enable cgroup_get_from_file() on cgroup1
  cgroup: bpf: enable bpf programs to integrate with rstat
  selftests/bpf: extend cgroup helpers
  bpf: add a selftest for cgroup hierarchical stats collection

 include/linux/bpf.h                           |   8 +
 include/linux/cgroup.h                        |   3 +
 include/uapi/linux/bpf.h                      |  21 ++
 kernel/bpf/Makefile                           |   2 +-
 kernel/bpf/bpf_iter.c                         |   5 +
 kernel/bpf/cgroup_iter.c                      | 235 ++++++++++++
 kernel/cgroup/cgroup.c                        |   5 -
 kernel/cgroup/rstat.c                         |  46 +++
 tools/include/uapi/linux/bpf.h                |  21 ++
 tools/testing/selftests/bpf/cgroup_helpers.c  | 173 +++++++--
 tools/testing/selftests/bpf/cgroup_helpers.h  |  15 +-
 .../prog_tests/cgroup_hierarchical_stats.c    | 351 ++++++++++++++++++
 .../selftests/bpf/prog_tests/cgroup_iter.c    | 190 ++++++++++
 tools/testing/selftests/bpf/progs/bpf_iter.h  |   7 +
 .../bpf/progs/cgroup_hierarchical_stats.c     | 234 ++++++++++++
 .../testing/selftests/bpf/progs/cgroup_iter.c |  39 ++
 16 files changed, 1303 insertions(+), 52 deletions(-)
 create mode 100644 kernel/bpf/cgroup_iter.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/cgroup_iter.c
 create mode 100644 tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c
 create mode 100644 tools/testing/selftests/bpf/progs/cgroup_iter.c

-- 
2.36.1.476.g0c4daa206d-goog


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH bpf-next v2 1/8] cgroup: enable cgroup_get_from_file() on cgroup1
  2022-06-10 19:44 [PATCH bpf-next v2 0/8] bpf: rstat: cgroup hierarchical stats Yosry Ahmed
@ 2022-06-10 19:44 ` Yosry Ahmed
  2022-06-10 19:44 ` [PATCH bpf-next v2 2/8] cgroup: Add cgroup_put() in !CONFIG_CGROUPS case Yosry Ahmed
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 46+ messages in thread
From: Yosry Ahmed @ 2022-06-10 19:44 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Michal Hocko
  Cc: Roman Gushchin, David Rientjes, Stanislav Fomichev, Greg Thelen,
	Shakeel Butt, linux-kernel, netdev, bpf, cgroups, Yosry Ahmed

cgroup_get_from_file() currently fails with -EBADF if called on cgroup
v1. However, the current implementation works on cgroup v1 as well, so
the restriction is unnecessary.

This enabled cgroup_get_from_fd() to work on cgroup v1, which would be
the only thing stopping bpf cgroup_iter from supporting cgroup v1.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 kernel/cgroup/cgroup.c | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 1779ccddb734d..9943fcb1e574d 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -6090,11 +6090,6 @@ static struct cgroup *cgroup_get_from_file(struct file *f)
 		return ERR_CAST(css);
 
 	cgrp = css->cgroup;
-	if (!cgroup_on_dfl(cgrp)) {
-		cgroup_put(cgrp);
-		return ERR_PTR(-EBADF);
-	}
-
 	return cgrp;
 }
 
-- 
2.36.1.476.g0c4daa206d-goog


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH bpf-next v2 2/8] cgroup: Add cgroup_put() in !CONFIG_CGROUPS case
  2022-06-10 19:44 [PATCH bpf-next v2 0/8] bpf: rstat: cgroup hierarchical stats Yosry Ahmed
  2022-06-10 19:44 ` [PATCH bpf-next v2 1/8] cgroup: enable cgroup_get_from_file() on cgroup1 Yosry Ahmed
@ 2022-06-10 19:44 ` Yosry Ahmed
  2022-06-10 19:44 ` [PATCH bpf-next v2 3/8] bpf, iter: Fix the condition on p when calling stop Yosry Ahmed
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 46+ messages in thread
From: Yosry Ahmed @ 2022-06-10 19:44 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Michal Hocko
  Cc: Roman Gushchin, David Rientjes, Stanislav Fomichev, Greg Thelen,
	Shakeel Butt, linux-kernel, netdev, bpf, cgroups, Yosry Ahmed

From: Hao Luo <haoluo@google.com>

There is already a cgroup_get_from_id() in the !CONFIG_CGROUPS case,
let's have a matching cgroup_put() in !CONFIG_CGROUPS too.

Signed-off-by: Hao Luo <haoluo@google.com>
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 include/linux/cgroup.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 0d1ada8968d75..7485a2f939119 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -757,6 +757,9 @@ static inline struct cgroup *cgroup_get_from_id(u64 id)
 {
 	return NULL;
 }
+
+static inline void cgroup_put(struct cgroup *cgrp)
+{}
 #endif /* !CONFIG_CGROUPS */
 
 #ifdef CONFIG_CGROUPS
-- 
2.36.1.476.g0c4daa206d-goog


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH bpf-next v2 3/8] bpf, iter: Fix the condition on p when calling stop.
  2022-06-10 19:44 [PATCH bpf-next v2 0/8] bpf: rstat: cgroup hierarchical stats Yosry Ahmed
  2022-06-10 19:44 ` [PATCH bpf-next v2 1/8] cgroup: enable cgroup_get_from_file() on cgroup1 Yosry Ahmed
  2022-06-10 19:44 ` [PATCH bpf-next v2 2/8] cgroup: Add cgroup_put() in !CONFIG_CGROUPS case Yosry Ahmed
@ 2022-06-10 19:44 ` Yosry Ahmed
  2022-06-20 18:48   ` Yonghong Song
  2022-06-10 19:44 ` [PATCH bpf-next v2 4/8] bpf: Introduce cgroup iter Yosry Ahmed
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 46+ messages in thread
From: Yosry Ahmed @ 2022-06-10 19:44 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Michal Hocko
  Cc: Roman Gushchin, David Rientjes, Stanislav Fomichev, Greg Thelen,
	Shakeel Butt, linux-kernel, netdev, bpf, cgroups, Yosry Ahmed

From: Hao Luo <haoluo@google.com>

In bpf_seq_read, seq->op->next() could return an ERR and jump to
the label stop. However, the existing code in stop does not handle
the case when p (returned from next()) is an ERR. Adds the handling
of ERR of p by converting p into an error and jumping to done.

Because all the current implementations do not have a case that
returns ERR from next(), so this patch doesn't have behavior changes
right now.

Signed-off-by: Hao Luo <haoluo@google.com>
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 kernel/bpf/bpf_iter.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c
index d5d96ceca1058..1585caf7c7200 100644
--- a/kernel/bpf/bpf_iter.c
+++ b/kernel/bpf/bpf_iter.c
@@ -198,6 +198,11 @@ static ssize_t bpf_seq_read(struct file *file, char __user *buf, size_t size,
 	}
 stop:
 	offs = seq->count;
+	if (IS_ERR(p)) {
+		seq->op->stop(seq, NULL);
+		err = PTR_ERR(p);
+		goto done;
+	}
 	/* bpf program called if !p */
 	seq->op->stop(seq, p);
 	if (!p) {
-- 
2.36.1.476.g0c4daa206d-goog


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH bpf-next v2 4/8] bpf: Introduce cgroup iter
  2022-06-10 19:44 [PATCH bpf-next v2 0/8] bpf: rstat: cgroup hierarchical stats Yosry Ahmed
                   ` (2 preceding siblings ...)
  2022-06-10 19:44 ` [PATCH bpf-next v2 3/8] bpf, iter: Fix the condition on p when calling stop Yosry Ahmed
@ 2022-06-10 19:44 ` Yosry Ahmed
  2022-06-11  6:23   ` kernel test robot
                     ` (5 more replies)
  2022-06-10 19:44 ` [PATCH bpf-next v2 5/8] selftests/bpf: Test cgroup_iter Yosry Ahmed
                   ` (4 subsequent siblings)
  8 siblings, 6 replies; 46+ messages in thread
From: Yosry Ahmed @ 2022-06-10 19:44 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Michal Hocko
  Cc: Roman Gushchin, David Rientjes, Stanislav Fomichev, Greg Thelen,
	Shakeel Butt, linux-kernel, netdev, bpf, cgroups, Yosry Ahmed

From: Hao Luo <haoluo@google.com>

Cgroup_iter is a type of bpf_iter. It walks over cgroups in two modes:

 - walking a cgroup's descendants.
 - walking a cgroup's ancestors.

When attaching cgroup_iter, one can set a cgroup to the iter_link
created from attaching. This cgroup is passed as a file descriptor and
serves as the starting point of the walk. If no cgroup is specified,
the starting point will be the root cgroup.

For walking descendants, one can specify the order: either pre-order or
post-order. For walking ancestors, the walk starts at the specified
cgroup and ends at the root.

One can also terminate the walk early by returning 1 from the iter
program.

Note that because walking cgroup hierarchy holds cgroup_mutex, the iter
program is called with cgroup_mutex held.

Signed-off-by: Hao Luo <haoluo@google.com>
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 include/linux/bpf.h            |   8 ++
 include/uapi/linux/bpf.h       |  21 +++
 kernel/bpf/Makefile            |   2 +-
 kernel/bpf/cgroup_iter.c       | 235 +++++++++++++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h |  21 +++
 5 files changed, 286 insertions(+), 1 deletion(-)
 create mode 100644 kernel/bpf/cgroup_iter.c

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 8e6092d0ea956..48d8e836b9748 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -44,6 +44,7 @@ struct kobject;
 struct mem_cgroup;
 struct module;
 struct bpf_func_state;
+struct cgroup;
 
 extern struct idr btf_idr;
 extern spinlock_t btf_idr_lock;
@@ -1590,7 +1591,14 @@ int bpf_obj_get_user(const char __user *pathname, int flags);
 	int __init bpf_iter_ ## target(args) { return 0; }
 
 struct bpf_iter_aux_info {
+	/* for map_elem iter */
 	struct bpf_map *map;
+
+	/* for cgroup iter */
+	struct {
+		struct cgroup *start; /* starting cgroup */
+		int order;
+	} cgroup;
 };
 
 typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog,
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index f4009dbdf62da..4fd05cde19116 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -87,10 +87,27 @@ struct bpf_cgroup_storage_key {
 	__u32	attach_type;		/* program attach type (enum bpf_attach_type) */
 };
 
+enum bpf_iter_cgroup_traversal_order {
+	BPF_ITER_CGROUP_PRE = 0,	/* pre-order traversal */
+	BPF_ITER_CGROUP_POST,		/* post-order traversal */
+	BPF_ITER_CGROUP_PARENT_UP,	/* traversal of ancestors up to the root */
+};
+
 union bpf_iter_link_info {
 	struct {
 		__u32	map_fd;
 	} map;
+
+	/* cgroup_iter walks either the live descendants of a cgroup subtree, or the ancestors
+	 * of a given cgroup.
+	 */
+	struct {
+		/* Cgroup file descriptor. This is root of the subtree if for walking the
+		 * descendants; this is the starting cgroup if for walking the ancestors.
+		 */
+		__u32	cgroup_fd;
+		__u32	traversal_order;
+	} cgroup;
 };
 
 /* BPF syscall commands, see bpf(2) man-page for more details. */
@@ -6050,6 +6067,10 @@ struct bpf_link_info {
 				struct {
 					__u32 map_id;
 				} map;
+				struct {
+					__u32 traversal_order;
+					__aligned_u64 cgroup_id;
+				} cgroup;
 			};
 		} iter;
 		struct  {
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 057ba8e01e70f..9741b9314fb46 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -8,7 +8,7 @@ CFLAGS_core.o += $(call cc-disable-warning, override-init) $(cflags-nogcse-yy)
 
 obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_iter.o map_iter.o task_iter.o prog_iter.o link_iter.o
 obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o
-obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
+obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o cgroup_iter.o
 obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o
 obj-${CONFIG_BPF_LSM}	  += bpf_inode_storage.o
 obj-$(CONFIG_BPF_SYSCALL) += disasm.o
diff --git a/kernel/bpf/cgroup_iter.c b/kernel/bpf/cgroup_iter.c
new file mode 100644
index 0000000000000..88deb655efa71
--- /dev/null
+++ b/kernel/bpf/cgroup_iter.c
@@ -0,0 +1,235 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2022 Google */
+#include <linux/bpf.h>
+#include <linux/btf_ids.h>
+#include <linux/cgroup.h>
+#include <linux/kernel.h>
+#include <linux/seq_file.h>
+
+#include "../cgroup/cgroup-internal.h"  /* cgroup_mutex and cgroup_is_dead */
+
+/* cgroup_iter provides two modes of traversal to the cgroup hierarchy.
+ *
+ *  1. Walk the descendants of a cgroup.
+ *  2. Walk the ancestors of a cgroup.
+ *
+ * For walking descendants, cgroup_iter can walk in either pre-order or
+ * post-order. For walking ancestors, the iter walks up from a cgroup to
+ * the root.
+ *
+ * The iter program can terminate the walk early by returning 1. Walk
+ * continues if prog returns 0.
+ *
+ * The prog can check (seq->num == 0) to determine whether this is
+ * the first element. The prog may also be passed a NULL cgroup,
+ * which means the walk has completed and the prog has a chance to
+ * do post-processing, such as outputing an epilogue.
+ *
+ * Note: the iter_prog is called with cgroup_mutex held.
+ */
+
+struct bpf_iter__cgroup {
+	__bpf_md_ptr(struct bpf_iter_meta *, meta);
+	__bpf_md_ptr(struct cgroup *, cgroup);
+};
+
+struct cgroup_iter_priv {
+	struct cgroup_subsys_state *start_css;
+	bool terminate;
+	int order;
+};
+
+static void *cgroup_iter_seq_start(struct seq_file *seq, loff_t *pos)
+{
+	struct cgroup_iter_priv *p = seq->private;
+
+	mutex_lock(&cgroup_mutex);
+
+	/* support only one session */
+	if (*pos > 0)
+		return NULL;
+
+	++*pos;
+	p->terminate = false;
+	if (p->order == BPF_ITER_CGROUP_PRE)
+		return css_next_descendant_pre(NULL, p->start_css);
+	else if (p->order == BPF_ITER_CGROUP_POST)
+		return css_next_descendant_post(NULL, p->start_css);
+	else /* BPF_ITER_CGROUP_PARENT_UP */
+		return p->start_css;
+}
+
+static int __cgroup_iter_seq_show(struct seq_file *seq,
+				  struct cgroup_subsys_state *css, int in_stop);
+
+static void cgroup_iter_seq_stop(struct seq_file *seq, void *v)
+{
+	/* pass NULL to the prog for post-processing */
+	if (!v)
+		__cgroup_iter_seq_show(seq, NULL, true);
+	mutex_unlock(&cgroup_mutex);
+}
+
+static void *cgroup_iter_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	struct cgroup_subsys_state *curr = (struct cgroup_subsys_state *)v;
+	struct cgroup_iter_priv *p = seq->private;
+
+	++*pos;
+	if (p->terminate)
+		return NULL;
+
+	if (p->order == BPF_ITER_CGROUP_PRE)
+		return css_next_descendant_pre(curr, p->start_css);
+	else if (p->order == BPF_ITER_CGROUP_POST)
+		return css_next_descendant_post(curr, p->start_css);
+	else
+		return curr->parent;
+}
+
+static int __cgroup_iter_seq_show(struct seq_file *seq,
+				  struct cgroup_subsys_state *css, int in_stop)
+{
+	struct cgroup_iter_priv *p = seq->private;
+	struct bpf_iter__cgroup ctx;
+	struct bpf_iter_meta meta;
+	struct bpf_prog *prog;
+	int ret = 0;
+
+	/* cgroup is dead, skip this element */
+	if (css && cgroup_is_dead(css->cgroup))
+		return 0;
+
+	ctx.meta = &meta;
+	ctx.cgroup = css ? css->cgroup : NULL;
+	meta.seq = seq;
+	prog = bpf_iter_get_info(&meta, in_stop);
+	if (prog)
+		ret = bpf_iter_run_prog(prog, &ctx);
+
+	/* if prog returns > 0, terminate after this element. */
+	if (ret != 0)
+		p->terminate = true;
+
+	return 0;
+}
+
+static int cgroup_iter_seq_show(struct seq_file *seq, void *v)
+{
+	return __cgroup_iter_seq_show(seq, (struct cgroup_subsys_state *)v,
+				      false);
+}
+
+static const struct seq_operations cgroup_iter_seq_ops = {
+	.start  = cgroup_iter_seq_start,
+	.next   = cgroup_iter_seq_next,
+	.stop   = cgroup_iter_seq_stop,
+	.show   = cgroup_iter_seq_show,
+};
+
+BTF_ID_LIST_SINGLE(bpf_cgroup_btf_id, struct, cgroup)
+
+static int cgroup_iter_seq_init(void *priv, struct bpf_iter_aux_info *aux)
+{
+	struct cgroup_iter_priv *p = (struct cgroup_iter_priv *)priv;
+	struct cgroup *cgrp = aux->cgroup.start;
+
+	p->start_css = &cgrp->self;
+	p->terminate = false;
+	p->order = aux->cgroup.order;
+	return 0;
+}
+
+static const struct bpf_iter_seq_info cgroup_iter_seq_info = {
+	.seq_ops                = &cgroup_iter_seq_ops,
+	.init_seq_private       = cgroup_iter_seq_init,
+	.seq_priv_size          = sizeof(struct cgroup_iter_priv),
+};
+
+static int bpf_iter_attach_cgroup(struct bpf_prog *prog,
+				  union bpf_iter_link_info *linfo,
+				  struct bpf_iter_aux_info *aux)
+{
+	int fd = linfo->cgroup.cgroup_fd;
+	struct cgroup *cgrp;
+
+	if (fd)
+		cgrp = cgroup_get_from_fd(fd);
+	else /* walk the entire hierarchy by default. */
+		cgrp = cgroup_get_from_path("/");
+
+	if (IS_ERR(cgrp))
+		return PTR_ERR(cgrp);
+
+	aux->cgroup.start = cgrp;
+	aux->cgroup.order = linfo->cgroup.traversal_order;
+	return 0;
+}
+
+static void bpf_iter_detach_cgroup(struct bpf_iter_aux_info *aux)
+{
+	cgroup_put(aux->cgroup.start);
+}
+
+static void bpf_iter_cgroup_show_fdinfo(const struct bpf_iter_aux_info *aux,
+					struct seq_file *seq)
+{
+	char *buf;
+
+	buf = kzalloc(PATH_MAX, GFP_KERNEL);
+	if (!buf) {
+		seq_puts(seq, "cgroup_path:\n");
+		goto show_order;
+	}
+
+	/* If cgroup_path_ns() fails, buf will be an empty string, cgroup_path
+	 * will print nothing.
+	 *
+	 * Path is in the calling process's cgroup namespace.
+	 */
+	cgroup_path_ns(aux->cgroup.start, buf, PATH_MAX,
+		       current->nsproxy->cgroup_ns);
+	seq_printf(seq, "cgroup_path:\t%s\n", buf);
+	kfree(buf);
+
+show_order:
+	if (aux->cgroup.order == BPF_ITER_CGROUP_PRE)
+		seq_puts(seq, "traversal_order: pre\n");
+	else if (aux->cgroup.order == BPF_ITER_CGROUP_POST)
+		seq_puts(seq, "traversal_order: post\n");
+	else /* BPF_ITER_CGROUP_PARENT_UP */
+		seq_puts(seq, "traversal_order: parent_up\n");
+}
+
+static int bpf_iter_cgroup_fill_link_info(const struct bpf_iter_aux_info *aux,
+					  struct bpf_link_info *info)
+{
+	info->iter.cgroup.traversal_order = aux->cgroup.order;
+	info->iter.cgroup.cgroup_id = cgroup_id(aux->cgroup.start);
+	return 0;
+}
+
+DEFINE_BPF_ITER_FUNC(cgroup, struct bpf_iter_meta *meta,
+		     struct cgroup *cgroup)
+
+static struct bpf_iter_reg bpf_cgroup_reg_info = {
+	.target			= "cgroup",
+	.attach_target		= bpf_iter_attach_cgroup,
+	.detach_target		= bpf_iter_detach_cgroup,
+	.show_fdinfo		= bpf_iter_cgroup_show_fdinfo,
+	.fill_link_info		= bpf_iter_cgroup_fill_link_info,
+	.ctx_arg_info_size	= 1,
+	.ctx_arg_info		= {
+		{ offsetof(struct bpf_iter__cgroup, cgroup),
+		  PTR_TO_BTF_ID },
+	},
+	.seq_info		= &cgroup_iter_seq_info,
+};
+
+static int __init bpf_cgroup_iter_init(void)
+{
+	bpf_cgroup_reg_info.ctx_arg_info[0].btf_id = bpf_cgroup_btf_id[0];
+	return bpf_iter_reg_target(&bpf_cgroup_reg_info);
+}
+
+late_initcall(bpf_cgroup_iter_init);
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index f4009dbdf62da..4fd05cde19116 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -87,10 +87,27 @@ struct bpf_cgroup_storage_key {
 	__u32	attach_type;		/* program attach type (enum bpf_attach_type) */
 };
 
+enum bpf_iter_cgroup_traversal_order {
+	BPF_ITER_CGROUP_PRE = 0,	/* pre-order traversal */
+	BPF_ITER_CGROUP_POST,		/* post-order traversal */
+	BPF_ITER_CGROUP_PARENT_UP,	/* traversal of ancestors up to the root */
+};
+
 union bpf_iter_link_info {
 	struct {
 		__u32	map_fd;
 	} map;
+
+	/* cgroup_iter walks either the live descendants of a cgroup subtree, or the ancestors
+	 * of a given cgroup.
+	 */
+	struct {
+		/* Cgroup file descriptor. This is root of the subtree if for walking the
+		 * descendants; this is the starting cgroup if for walking the ancestors.
+		 */
+		__u32	cgroup_fd;
+		__u32	traversal_order;
+	} cgroup;
 };
 
 /* BPF syscall commands, see bpf(2) man-page for more details. */
@@ -6050,6 +6067,10 @@ struct bpf_link_info {
 				struct {
 					__u32 map_id;
 				} map;
+				struct {
+					__u32 traversal_order;
+					__aligned_u64 cgroup_id;
+				} cgroup;
 			};
 		} iter;
 		struct  {
-- 
2.36.1.476.g0c4daa206d-goog


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH bpf-next v2 5/8] selftests/bpf: Test cgroup_iter.
  2022-06-10 19:44 [PATCH bpf-next v2 0/8] bpf: rstat: cgroup hierarchical stats Yosry Ahmed
                   ` (3 preceding siblings ...)
  2022-06-10 19:44 ` [PATCH bpf-next v2 4/8] bpf: Introduce cgroup iter Yosry Ahmed
@ 2022-06-10 19:44 ` Yosry Ahmed
  2022-06-28  6:11   ` Yonghong Song
  2022-06-10 19:44 ` [PATCH bpf-next v2 6/8] cgroup: bpf: enable bpf programs to integrate with rstat Yosry Ahmed
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 46+ messages in thread
From: Yosry Ahmed @ 2022-06-10 19:44 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Michal Hocko
  Cc: Roman Gushchin, David Rientjes, Stanislav Fomichev, Greg Thelen,
	Shakeel Butt, linux-kernel, netdev, bpf, cgroups, Yosry Ahmed

From: Hao Luo <haoluo@google.com>

Add a selftest for cgroup_iter. The selftest creates a mini cgroup tree
of the following structure:

    ROOT (working cgroup)
     |
   PARENT
  /      \
CHILD1  CHILD2

and tests the following scenarios:

 - invalid cgroup fd.
 - pre-order walk over descendants from PARENT.
 - post-order walk over descendants from PARENT.
 - walk of ancestors from PARENT.
 - early termination.

Signed-off-by: Hao Luo <haoluo@google.com>
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 .../selftests/bpf/prog_tests/cgroup_iter.c    | 190 ++++++++++++++++++
 tools/testing/selftests/bpf/progs/bpf_iter.h  |   7 +
 .../testing/selftests/bpf/progs/cgroup_iter.c |  39 ++++
 3 files changed, 236 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/cgroup_iter.c
 create mode 100644 tools/testing/selftests/bpf/progs/cgroup_iter.c

diff --git a/tools/testing/selftests/bpf/prog_tests/cgroup_iter.c b/tools/testing/selftests/bpf/prog_tests/cgroup_iter.c
new file mode 100644
index 0000000000000..4c8f11f784915
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/cgroup_iter.c
@@ -0,0 +1,190 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2022 Google */
+
+#include <test_progs.h>
+#include <bpf/libbpf.h>
+#include <bpf/btf.h>
+#include "cgroup_iter.skel.h"
+#include "cgroup_helpers.h"
+
+#define ROOT		0
+#define PARENT		1
+#define CHILD1		2
+#define CHILD2		3
+#define NUM_CGROUPS	4
+
+#define PROLOGUE	"prologue\n"
+#define EPILOGUE	"epilogue\n"
+
+#define format_expected_output1(cg_id1) \
+	snprintf(expected_output, sizeof(expected_output), \
+		 PROLOGUE "%8llu\n" EPILOGUE, (cg_id1))
+
+#define format_expected_output2(cg_id1, cg_id2) \
+	snprintf(expected_output, sizeof(expected_output), \
+		 PROLOGUE "%8llu\n%8llu\n" EPILOGUE, \
+		 (cg_id1), (cg_id2))
+
+#define format_expected_output3(cg_id1, cg_id2, cg_id3) \
+	snprintf(expected_output, sizeof(expected_output), \
+		 PROLOGUE "%8llu\n%8llu\n%8llu\n" EPILOGUE, \
+		 (cg_id1), (cg_id2), (cg_id3))
+
+const char *cg_path[] = {
+	"/", "/parent", "/parent/child1", "/parent/child2"
+};
+
+static int cg_fd[] = {-1, -1, -1, -1};
+static unsigned long long cg_id[] = {0, 0, 0, 0};
+static char expected_output[64];
+
+int setup_cgroups(void)
+{
+	int fd, i = 0;
+
+	for (i = 0; i < NUM_CGROUPS; i++) {
+		fd = create_and_get_cgroup(cg_path[i]);
+		if (fd < 0)
+			return fd;
+
+		cg_fd[i] = fd;
+		cg_id[i] = get_cgroup_id(cg_path[i]);
+	}
+	return 0;
+}
+
+void cleanup_cgroups(void)
+{
+	int i;
+
+	for (i = 0; i < NUM_CGROUPS; i++)
+		close(cg_fd[i]);
+}
+
+static void read_from_cgroup_iter(struct bpf_program *prog, int cgroup_fd,
+				  int order, const char *testname)
+{
+	DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts);
+	union bpf_iter_link_info linfo;
+	struct bpf_link *link;
+	int len, iter_fd;
+	static char buf[64];
+
+	memset(&linfo, 0, sizeof(linfo));
+	linfo.cgroup.cgroup_fd = cgroup_fd;
+	linfo.cgroup.traversal_order = order;
+	opts.link_info = &linfo;
+	opts.link_info_len = sizeof(linfo);
+
+	link = bpf_program__attach_iter(prog, &opts);
+	if (!ASSERT_OK_PTR(link, "attach_iter"))
+		return;
+
+	iter_fd = bpf_iter_create(bpf_link__fd(link));
+	if (iter_fd < 0)
+		goto free_link;
+
+	memset(buf, 0, sizeof(buf));
+	while ((len = read(iter_fd, buf, sizeof(buf))) > 0)
+		;
+
+	ASSERT_STREQ(buf, expected_output, testname);
+
+	close(iter_fd);
+free_link:
+	bpf_link__destroy(link);
+}
+
+/* Invalid cgroup. */
+static void test_invalid_cgroup(struct cgroup_iter *skel)
+{
+
+	DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts);
+	union bpf_iter_link_info linfo;
+	struct bpf_link *link;
+
+	memset(&linfo, 0, sizeof(linfo));
+	linfo.cgroup.cgroup_fd = (__u32)-1;
+	opts.link_info = &linfo;
+	opts.link_info_len = sizeof(linfo);
+
+	link = bpf_program__attach_iter(skel->progs.cgroup_id_printer, &opts);
+	if (!ASSERT_ERR_PTR(link, "attach_iter"))
+		bpf_link__destroy(link);
+}
+
+/* Preorder walk prints parent and child in order. */
+static void test_walk_preorder(struct cgroup_iter *skel)
+{
+	format_expected_output3(cg_id[PARENT], cg_id[CHILD1], cg_id[CHILD2]);
+
+	read_from_cgroup_iter(skel->progs.cgroup_id_printer, cg_fd[PARENT],
+			      BPF_ITER_CGROUP_PRE, "preorder");
+}
+
+/* Postorder walk prints child and parent in order. */
+static void test_walk_postorder(struct cgroup_iter *skel)
+{
+	format_expected_output3(cg_id[CHILD1], cg_id[CHILD2], cg_id[PARENT]);
+
+	read_from_cgroup_iter(skel->progs.cgroup_id_printer, cg_fd[PARENT],
+			      BPF_ITER_CGROUP_POST, "postorder");
+}
+
+/* Walking parents prints parent and then root. */
+static void test_walk_parent_up(struct cgroup_iter *skel)
+{
+	/* terminate the walk when ROOT is met. */
+	skel->bss->terminal_cgroup = cg_id[ROOT];
+
+	format_expected_output2(cg_id[PARENT], cg_id[ROOT]);
+
+	read_from_cgroup_iter(skel->progs.cgroup_id_printer, cg_fd[PARENT],
+			      BPF_ITER_CGROUP_PARENT_UP, "parent_up");
+
+	skel->bss->terminal_cgroup = 0;
+}
+
+/* Early termination prints parent only. */
+static void test_early_termination(struct cgroup_iter *skel)
+{
+	/* terminate the walk after the first element is processed. */
+	skel->bss->terminate_early = 1;
+
+	format_expected_output1(cg_id[PARENT]);
+
+	read_from_cgroup_iter(skel->progs.cgroup_id_printer, cg_fd[PARENT],
+			      BPF_ITER_CGROUP_PRE, "early_termination");
+
+	skel->bss->terminate_early = 0;
+}
+
+void test_cgroup_iter(void)
+{
+	struct cgroup_iter *skel = NULL;
+
+	if (setup_cgroup_environment() < 0)
+		return;
+
+	if (setup_cgroups() < 0)
+		goto out;
+
+	skel = cgroup_iter__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "cgroup_iter__open_and_load"))
+		goto out;
+
+	if (test__start_subtest("cgroup_iter__invalid_cgroup"))
+		test_invalid_cgroup(skel);
+	if (test__start_subtest("cgroup_iter__preorder"))
+		test_walk_preorder(skel);
+	if (test__start_subtest("cgroup_iter__postorder"))
+		test_walk_postorder(skel);
+	if (test__start_subtest("cgroup_iter__parent_up_walk"))
+		test_walk_parent_up(skel);
+	if (test__start_subtest("cgroup_iter__early_termination"))
+		test_early_termination(skel);
+out:
+	cgroup_iter__destroy(skel);
+	cleanup_cgroups();
+	cleanup_cgroup_environment();
+}
diff --git a/tools/testing/selftests/bpf/progs/bpf_iter.h b/tools/testing/selftests/bpf/progs/bpf_iter.h
index 97ec8bc76ae62..df91f1daf74d3 100644
--- a/tools/testing/selftests/bpf/progs/bpf_iter.h
+++ b/tools/testing/selftests/bpf/progs/bpf_iter.h
@@ -17,6 +17,7 @@
 #define bpf_iter__bpf_sk_storage_map bpf_iter__bpf_sk_storage_map___not_used
 #define bpf_iter__sockmap bpf_iter__sockmap___not_used
 #define bpf_iter__bpf_link bpf_iter__bpf_link___not_used
+#define bpf_iter__cgroup bpf_iter__cgroup__not_used
 #define btf_ptr btf_ptr___not_used
 #define BTF_F_COMPACT BTF_F_COMPACT___not_used
 #define BTF_F_NONAME BTF_F_NONAME___not_used
@@ -39,6 +40,7 @@
 #undef bpf_iter__bpf_sk_storage_map
 #undef bpf_iter__sockmap
 #undef bpf_iter__bpf_link
+#undef bpf_iter__cgroup
 #undef btf_ptr
 #undef BTF_F_COMPACT
 #undef BTF_F_NONAME
@@ -139,6 +141,11 @@ struct bpf_iter__bpf_link {
 	struct bpf_link *link;
 };
 
+struct bpf_iter__cgroup {
+	struct bpf_iter_meta *meta;
+	struct cgroup *cgroup;
+} __attribute((preserve_access_index));
+
 struct btf_ptr {
 	void *ptr;
 	__u32 type_id;
diff --git a/tools/testing/selftests/bpf/progs/cgroup_iter.c b/tools/testing/selftests/bpf/progs/cgroup_iter.c
new file mode 100644
index 0000000000000..2a34d146d6df0
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/cgroup_iter.c
@@ -0,0 +1,39 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2022 Google */
+
+#include "bpf_iter.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+char _license[] SEC("license") = "GPL";
+volatile int terminate_early = 0;
+volatile u64 terminal_cgroup = 0;
+
+static inline u64 cgroup_id(struct cgroup *cgrp)
+{
+	return cgrp->kn->id;
+}
+
+SEC("iter/cgroup")
+int cgroup_id_printer(struct bpf_iter__cgroup *ctx)
+{
+	struct seq_file *seq = ctx->meta->seq;
+	struct cgroup *cgrp = ctx->cgroup;
+
+	/* epilogue */
+	if (cgrp == NULL) {
+		BPF_SEQ_PRINTF(seq, "epilogue\n");
+		return 0;
+	}
+
+	/* prologue */
+	if (ctx->meta->seq_num == 0)
+		BPF_SEQ_PRINTF(seq, "prologue\n");
+
+	BPF_SEQ_PRINTF(seq, "%8llu\n", cgroup_id(cgrp));
+
+	if (terminal_cgroup == cgroup_id(cgrp))
+		return 1;
+
+	return terminate_early ? 1 : 0;
+}
-- 
2.36.1.476.g0c4daa206d-goog


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH bpf-next v2 6/8] cgroup: bpf: enable bpf programs to integrate with rstat
  2022-06-10 19:44 [PATCH bpf-next v2 0/8] bpf: rstat: cgroup hierarchical stats Yosry Ahmed
                   ` (4 preceding siblings ...)
  2022-06-10 19:44 ` [PATCH bpf-next v2 5/8] selftests/bpf: Test cgroup_iter Yosry Ahmed
@ 2022-06-10 19:44 ` Yosry Ahmed
  2022-06-10 20:52   ` kernel test robot
                     ` (3 more replies)
  2022-06-10 19:44 ` [PATCH bpf-next v2 7/8] selftests/bpf: extend cgroup helpers Yosry Ahmed
                   ` (2 subsequent siblings)
  8 siblings, 4 replies; 46+ messages in thread
From: Yosry Ahmed @ 2022-06-10 19:44 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Michal Hocko
  Cc: Roman Gushchin, David Rientjes, Stanislav Fomichev, Greg Thelen,
	Shakeel Butt, linux-kernel, netdev, bpf, cgroups, Yosry Ahmed

Enable bpf programs to make use of rstat to collect cgroup hierarchical
stats efficiently:
- Add cgroup_rstat_updated() kfunc, for bpf progs that collect stats.
- Add cgroup_rstat_flush() kfunc, for bpf progs that read stats.
- Add an empty bpf_rstat_flush() hook that is called during rstat
  flushing, for bpf progs that flush stats to attach to. Attaching a bpf
  prog to this hook effectively registers it as a flush callback.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 kernel/cgroup/rstat.c | 46 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 46 insertions(+)

diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 24b5c2ab55983..94140bd0d02a4 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -3,6 +3,11 @@
 
 #include <linux/sched/cputime.h>
 
+#include <linux/bpf.h>
+#include <linux/btf.h>
+#include <linux/btf_ids.h>
+
+
 static DEFINE_SPINLOCK(cgroup_rstat_lock);
 static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);
 
@@ -141,6 +146,23 @@ static struct cgroup *cgroup_rstat_cpu_pop_updated(struct cgroup *pos,
 	return pos;
 }
 
+/*
+ * A hook for bpf stat collectors to attach to and flush their stats.
+ * Together with providing bpf kfuncs for cgroup_rstat_updated() and
+ * cgroup_rstat_flush(), this enables a complete workflow where bpf progs that
+ * collect cgroup stats can integrate with rstat for efficient flushing.
+ *
+ * A static noinline declaration here could cause the compiler to optimize away
+ * the function. A global noinline declaration will keep the definition, but may
+ * optimize away the callsite. Therefore, __weak is needed to ensure that the
+ * call is still emitted, by telling the compiler that we don't know what the
+ * function might eventually be.
+ */
+__weak noinline void bpf_rstat_flush(struct cgroup *cgrp,
+				     struct cgroup *parent, int cpu)
+{
+}
+
 /* see cgroup_rstat_flush() */
 static void cgroup_rstat_flush_locked(struct cgroup *cgrp, bool may_sleep)
 	__releases(&cgroup_rstat_lock) __acquires(&cgroup_rstat_lock)
@@ -168,6 +190,7 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp, bool may_sleep)
 			struct cgroup_subsys_state *css;
 
 			cgroup_base_stat_flush(pos, cpu);
+			bpf_rstat_flush(pos, cgroup_parent(pos), cpu);
 
 			rcu_read_lock();
 			list_for_each_entry_rcu(css, &pos->rstat_css_list,
@@ -469,3 +492,26 @@ void cgroup_base_stat_cputime_show(struct seq_file *seq)
 		   "system_usec %llu\n",
 		   usage, utime, stime);
 }
+
+/* Add bpf kfuncs for cgroup_rstat_updated() and cgroup_rstat_flush() */
+BTF_SET_START(bpf_rstat_check_kfunc_ids)
+BTF_ID(func, cgroup_rstat_updated)
+BTF_ID(func, cgroup_rstat_flush)
+BTF_SET_END(bpf_rstat_check_kfunc_ids)
+
+BTF_SET_START(bpf_rstat_sleepable_kfunc_ids)
+BTF_ID(func, cgroup_rstat_flush)
+BTF_SET_END(bpf_rstat_sleepable_kfunc_ids)
+
+static const struct btf_kfunc_id_set bpf_rstat_kfunc_set = {
+	.owner		= THIS_MODULE,
+	.check_set	= &bpf_rstat_check_kfunc_ids,
+	.sleepable_set	= &bpf_rstat_sleepable_kfunc_ids,
+};
+
+static int __init bpf_rstat_kfunc_init(void)
+{
+	return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING,
+					 &bpf_rstat_kfunc_set);
+}
+late_initcall(bpf_rstat_kfunc_init);
-- 
2.36.1.476.g0c4daa206d-goog


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH bpf-next v2 7/8] selftests/bpf: extend cgroup helpers
  2022-06-10 19:44 [PATCH bpf-next v2 0/8] bpf: rstat: cgroup hierarchical stats Yosry Ahmed
                   ` (5 preceding siblings ...)
  2022-06-10 19:44 ` [PATCH bpf-next v2 6/8] cgroup: bpf: enable bpf programs to integrate with rstat Yosry Ahmed
@ 2022-06-10 19:44 ` Yosry Ahmed
  2022-06-10 19:44 ` [PATCH bpf-next v2 8/8] bpf: add a selftest for cgroup hierarchical stats collection Yosry Ahmed
  2022-06-10 19:48 ` [PATCH bpf-next v2 0/8] bpf: rstat: cgroup hierarchical stats Yosry Ahmed
  8 siblings, 0 replies; 46+ messages in thread
From: Yosry Ahmed @ 2022-06-10 19:44 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Michal Hocko
  Cc: Roman Gushchin, David Rientjes, Stanislav Fomichev, Greg Thelen,
	Shakeel Butt, linux-kernel, netdev, bpf, cgroups, Yosry Ahmed

This patch extends bpf selftests cgroup helpers in various ways:
- Add enable_controllers() that allows tests to enable all or a
  subset of controllers for a specific cgroup.
- Add write_cgroup_file().
- Add join_cgroup_parent(). The cgroup workdir is based on the pid,
  therefore a spawned child cannot join the same cgroup hierarchy of the
  test through join_cgroup(). join_cgroup_parent() is used in child
  processes to join a cgroup under the parent's workdir.
- Distinguish relative and absolute cgroup paths in function arguments.
  Now relative paths are called relative_path, and absolute paths are
  called cgroup_path.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 tools/testing/selftests/bpf/cgroup_helpers.c | 173 ++++++++++++++-----
 tools/testing/selftests/bpf/cgroup_helpers.h |  15 +-
 2 files changed, 142 insertions(+), 46 deletions(-)

diff --git a/tools/testing/selftests/bpf/cgroup_helpers.c b/tools/testing/selftests/bpf/cgroup_helpers.c
index 9d59c3990ca8d..98c5f2f3d3c60 100644
--- a/tools/testing/selftests/bpf/cgroup_helpers.c
+++ b/tools/testing/selftests/bpf/cgroup_helpers.c
@@ -33,49 +33,51 @@
 #define CGROUP_MOUNT_DFLT		"/sys/fs/cgroup"
 #define NETCLS_MOUNT_PATH		CGROUP_MOUNT_DFLT "/net_cls"
 #define CGROUP_WORK_DIR			"/cgroup-test-work-dir"
-#define format_cgroup_path(buf, path) \
+
+#define format_cgroup_path_pid(buf, path, pid) \
 	snprintf(buf, sizeof(buf), "%s%s%d%s", CGROUP_MOUNT_PATH, \
-	CGROUP_WORK_DIR, getpid(), path)
+	CGROUP_WORK_DIR, pid, path)
+
+#define format_cgroup_path(buf, path) \
+	format_cgroup_path_pid(buf, path, getpid())
+
+#define format_parent_cgroup_path(buf, path) \
+	format_cgroup_path_pid(buf, path, getppid())
 
 #define format_classid_path(buf)				\
 	snprintf(buf, sizeof(buf), "%s%s", NETCLS_MOUNT_PATH,	\
 		 CGROUP_WORK_DIR)
 
-/**
- * enable_all_controllers() - Enable all available cgroup v2 controllers
- *
- * Enable all available cgroup v2 controllers in order to increase
- * the code coverage.
- *
- * If successful, 0 is returned.
- */
-static int enable_all_controllers(char *cgroup_path)
+
+static int __enable_controllers(const char *cgroup_path, const char *controllers)
 {
 	char path[PATH_MAX + 1];
-	char buf[PATH_MAX];
+	char enable[PATH_MAX + 1];
 	char *c, *c2;
 	int fd, cfd;
 	ssize_t len;
 
-	snprintf(path, sizeof(path), "%s/cgroup.controllers", cgroup_path);
-	fd = open(path, O_RDONLY);
-	if (fd < 0) {
-		log_err("Opening cgroup.controllers: %s", path);
-		return 1;
-	}
+	/* If not controllers are passed, enable all available controllers */
+	if (!controllers) {
+		snprintf(path, sizeof(path), "%s/cgroup.controllers",
+			 cgroup_path);
+		fd = open(path, O_RDONLY);
+		if (fd < 0) {
+			log_err("Opening cgroup.controllers: %s", path);
+			return 1;
+		}
 
-	len = read(fd, buf, sizeof(buf) - 1);
-	if (len < 0) {
+		len = read(fd, enable, sizeof(enable) - 1);
+		if (len < 0) {
+			close(fd);
+			log_err("Reading cgroup.controllers: %s", path);
+			return 1;
+		} else if (len == 0) /* No controllers to enable */
+			return 0;
+		enable[len] = 0;
 		close(fd);
-		log_err("Reading cgroup.controllers: %s", path);
-		return 1;
-	}
-	buf[len] = 0;
-	close(fd);
-
-	/* No controllers available? We're probably on cgroup v1. */
-	if (len == 0)
-		return 0;
+	} else
+		strncpy(enable, controllers, sizeof(enable));
 
 	snprintf(path, sizeof(path), "%s/cgroup.subtree_control", cgroup_path);
 	cfd = open(path, O_RDWR);
@@ -84,7 +86,7 @@ static int enable_all_controllers(char *cgroup_path)
 		return 1;
 	}
 
-	for (c = strtok_r(buf, " ", &c2); c; c = strtok_r(NULL, " ", &c2)) {
+	for (c = strtok_r(enable, " ", &c2); c; c = strtok_r(NULL, " ", &c2)) {
 		if (dprintf(cfd, "+%s\n", c) <= 0) {
 			log_err("Enabling controller %s: %s", c, path);
 			close(cfd);
@@ -95,6 +97,59 @@ static int enable_all_controllers(char *cgroup_path)
 	return 0;
 }
 
+/**
+ * enable_controllers() - Enable cgroup v2 controllers
+ * @relative_path: The cgroup path, relative to the workdir
+ * @controllers: List of controllers to enable in cgroup.controllers format
+ *
+ *
+ * Enable given cgroup v2 controllers, if @controllers is NULL, enable all
+ * available controllers.
+ *
+ * If successful, 0 is returned.
+ */
+int enable_controllers(const char *relative_path, const char *controllers)
+{
+	char cgroup_path[PATH_MAX + 1];
+
+	format_cgroup_path(cgroup_path, relative_path);
+	return __enable_controllers(cgroup_path, controllers);
+}
+
+/**
+ * write_cgroup_file() - Write to a cgroup file
+ * @relative_path: The cgroup path, relative to the workdir
+ * @buf: Buffer to write to the file
+ *
+ * Write to a file in the given cgroup's directory.
+ *
+ * If successful, 0 is returned.
+ */
+int write_cgroup_file(const char *relative_path, const char *file,
+		      const char *buf)
+{
+	char cgroup_path[PATH_MAX - 24];
+	char file_path[PATH_MAX + 1];
+	int fd;
+
+	format_cgroup_path(cgroup_path, relative_path);
+
+	snprintf(file_path, sizeof(file_path), "%s/%s", cgroup_path, file);
+	fd = open(file_path, O_RDWR);
+	if (fd < 0) {
+		log_err("Opening cgroup.subtree_control: %s", file_path);
+		return 1;
+	}
+
+	if (dprintf(fd, "%s", buf) <= 0) {
+		log_err("Writing to %s", file_path);
+		close(fd);
+		return 1;
+	}
+	close(fd);
+	return 0;
+}
+
 /**
  * setup_cgroup_environment() - Setup the cgroup environment
  *
@@ -133,7 +188,9 @@ int setup_cgroup_environment(void)
 		return 1;
 	}
 
-	if (enable_all_controllers(cgroup_workdir))
+	/* Enable all available controllers to increase test coverage */
+	if (__enable_controllers(CGROUP_MOUNT_PATH, NULL) ||
+	    __enable_controllers(cgroup_workdir, NULL))
 		return 1;
 
 	return 0;
@@ -173,7 +230,7 @@ static int join_cgroup_from_top(const char *cgroup_path)
 
 /**
  * join_cgroup() - Join a cgroup
- * @path: The cgroup path, relative to the workdir, to join
+ * @relative_path: The cgroup path, relative to the workdir, to join
  *
  * This function expects a cgroup to already be created, relative to the cgroup
  * work dir, and it joins it. For example, passing "/my-cgroup" as the path
@@ -182,11 +239,27 @@ static int join_cgroup_from_top(const char *cgroup_path)
  *
  * On success, it returns 0, otherwise on failure it returns 1.
  */
-int join_cgroup(const char *path)
+int join_cgroup(const char *relative_path)
+{
+	char cgroup_path[PATH_MAX + 1];
+
+	format_cgroup_path(cgroup_path, relative_path);
+	return join_cgroup_from_top(cgroup_path);
+}
+
+/**
+ * join_parent_cgroup() - Join a cgroup in the parent process workdir
+ * @relative_path: The cgroup path, relative to parent process workdir, to join
+ *
+ * See join_cgroup().
+ *
+ * On success, it returns 0, otherwise on failure it returns 1.
+ */
+int join_parent_cgroup(const char *relative_path)
 {
 	char cgroup_path[PATH_MAX + 1];
 
-	format_cgroup_path(cgroup_path, path);
+	format_parent_cgroup_path(cgroup_path, relative_path);
 	return join_cgroup_from_top(cgroup_path);
 }
 
@@ -212,9 +285,27 @@ void cleanup_cgroup_environment(void)
 	nftw(cgroup_workdir, nftwfunc, WALK_FD_LIMIT, FTW_DEPTH | FTW_MOUNT);
 }
 
+/**
+ * get_root_cgroup() - Get the FD of the root cgroup
+ *
+ * On success, it returns the file descriptor. On failure, it returns -1.
+ * If there is a failure, it prints the error to stderr.
+ */
+int get_root_cgroup(void)
+{
+	int fd;
+
+	fd = open(CGROUP_MOUNT_PATH, O_RDONLY);
+	if (fd < 0) {
+		log_err("Opening root cgroup");
+		return -1;
+	}
+	return fd;
+}
+
 /**
  * create_and_get_cgroup() - Create a cgroup, relative to workdir, and get the FD
- * @path: The cgroup path, relative to the workdir, to join
+ * @relative_path: The cgroup path, relative to the workdir, to join
  *
  * This function creates a cgroup under the top level workdir and returns the
  * file descriptor. It is idempotent.
@@ -222,14 +313,14 @@ void cleanup_cgroup_environment(void)
  * On success, it returns the file descriptor. On failure it returns -1.
  * If there is a failure, it prints the error to stderr.
  */
-int create_and_get_cgroup(const char *path)
+int create_and_get_cgroup(const char *relative_path)
 {
 	char cgroup_path[PATH_MAX + 1];
 	int fd;
 
-	format_cgroup_path(cgroup_path, path);
+	format_cgroup_path(cgroup_path, relative_path);
 	if (mkdir(cgroup_path, 0777) && errno != EEXIST) {
-		log_err("mkdiring cgroup %s .. %s", path, cgroup_path);
+		log_err("mkdiring cgroup %s .. %s", relative_path, cgroup_path);
 		return -1;
 	}
 
@@ -244,13 +335,13 @@ int create_and_get_cgroup(const char *path)
 
 /**
  * get_cgroup_id() - Get cgroup id for a particular cgroup path
- * @path: The cgroup path, relative to the workdir, to join
+ * @relative_path: The cgroup path, relative to the workdir, to join
  *
  * On success, it returns the cgroup id. On failure it returns 0,
  * which is an invalid cgroup id.
  * If there is a failure, it prints the error to stderr.
  */
-unsigned long long get_cgroup_id(const char *path)
+unsigned long long get_cgroup_id(const char *relative_path)
 {
 	int dirfd, err, flags, mount_id, fhsize;
 	union {
@@ -261,7 +352,7 @@ unsigned long long get_cgroup_id(const char *path)
 	struct file_handle *fhp, *fhp2;
 	unsigned long long ret = 0;
 
-	format_cgroup_path(cgroup_workdir, path);
+	format_cgroup_path(cgroup_workdir, relative_path);
 
 	dirfd = AT_FDCWD;
 	flags = 0;
diff --git a/tools/testing/selftests/bpf/cgroup_helpers.h b/tools/testing/selftests/bpf/cgroup_helpers.h
index fcc9cb91b2111..895e4de1174c9 100644
--- a/tools/testing/selftests/bpf/cgroup_helpers.h
+++ b/tools/testing/selftests/bpf/cgroup_helpers.h
@@ -10,11 +10,16 @@
 	__FILE__, __LINE__, clean_errno(), ##__VA_ARGS__)
 
 /* cgroupv2 related */
-int cgroup_setup_and_join(const char *path);
-int create_and_get_cgroup(const char *path);
-unsigned long long get_cgroup_id(const char *path);
+int enable_controllers(const char *relative_path, const char *controllers);
+int write_cgroup_file(const char *relative_path, const char *file,
+		      const char *buf);
+int cgroup_setup_and_join(const char *relative_path);
+int get_root_cgroup(void);
+int create_and_get_cgroup(const char *relative_path);
+unsigned long long get_cgroup_id(const char *relative_path);
 
-int join_cgroup(const char *path);
+int join_cgroup(const char *relative_path);
+int join_parent_cgroup(const char *relative_path);
 
 int setup_cgroup_environment(void);
 void cleanup_cgroup_environment(void);
@@ -26,4 +31,4 @@ int join_classid(void);
 int setup_classid_environment(void);
 void cleanup_classid_environment(void);
 
-#endif /* __CGROUP_HELPERS_H */
\ No newline at end of file
+#endif /* __CGROUP_HELPERS_H */
-- 
2.36.1.476.g0c4daa206d-goog


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH bpf-next v2 8/8] bpf: add a selftest for cgroup hierarchical stats collection
  2022-06-10 19:44 [PATCH bpf-next v2 0/8] bpf: rstat: cgroup hierarchical stats Yosry Ahmed
                   ` (6 preceding siblings ...)
  2022-06-10 19:44 ` [PATCH bpf-next v2 7/8] selftests/bpf: extend cgroup helpers Yosry Ahmed
@ 2022-06-10 19:44 ` Yosry Ahmed
  2022-06-28  6:14   ` Yonghong Song
  2022-06-10 19:48 ` [PATCH bpf-next v2 0/8] bpf: rstat: cgroup hierarchical stats Yosry Ahmed
  8 siblings, 1 reply; 46+ messages in thread
From: Yosry Ahmed @ 2022-06-10 19:44 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Michal Hocko
  Cc: Roman Gushchin, David Rientjes, Stanislav Fomichev, Greg Thelen,
	Shakeel Butt, linux-kernel, netdev, bpf, cgroups, Yosry Ahmed

Add a selftest that tests the whole workflow for collecting,
aggregating (flushing), and displaying cgroup hierarchical stats.

TL;DR:
- Whenever reclaim happens, vmscan_start and vmscan_end update
  per-cgroup percpu readings, and tell rstat which (cgroup, cpu) pairs
  have updates.
- When userspace tries to read the stats, vmscan_dump calls rstat to flush
  the stats, and outputs the stats in text format to userspace (similar
  to cgroupfs stats).
- rstat calls vmscan_flush once for every (cgroup, cpu) pair that has
  updates, vmscan_flush aggregates cpu readings and propagates updates
  to parents.

Detailed explanation:
- The test loads tracing bpf programs, vmscan_start and vmscan_end, to
  measure the latency of cgroup reclaim. Per-cgroup ratings are stored in
  percpu maps for efficiency. When a cgroup reading is updated on a cpu,
  cgroup_rstat_updated(cgroup, cpu) is called to add the cgroup to the
  rstat updated tree on that cpu.

- A cgroup_iter program, vmscan_dump, is loaded and pinned to a file, for
  each cgroup. Reading this file invokes the program, which calls
  cgroup_rstat_flush(cgroup) to ask rstat to propagate the updates for all
  cpus and cgroups that have updates in this cgroup's subtree. Afterwards,
  the stats are exposed to the user. vmscan_dump returns 1 to terminate
  iteration early, so that we only expose stats for one cgroup per read.

- An ftrace program, vmscan_flush, is also loaded and attached to
  bpf_rstat_flush. When rstat flushing is ongoing, vmscan_flush is invoked
  once for each (cgroup, cpu) pair that has updates. cgroups are popped
  from the rstat tree in a bottom-up fashion, so calls will always be
  made for cgroups that have updates before their parents. The program
  aggregates percpu readings to a total per-cgroup reading, and also
  propagates them to the parent cgroup. After rstat flushing is over, all
  cgroups will have correct updated hierarchical readings (including all
  cpus and all their descendants).

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 .../prog_tests/cgroup_hierarchical_stats.c    | 351 ++++++++++++++++++
 .../bpf/progs/cgroup_hierarchical_stats.c     | 234 ++++++++++++
 2 files changed, 585 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c
 create mode 100644 tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c

diff --git a/tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c b/tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c
new file mode 100644
index 0000000000000..b78a4043da49a
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c
@@ -0,0 +1,351 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Functions to manage eBPF programs attached to cgroup subsystems
+ *
+ * Copyright 2022 Google LLC.
+ */
+#include <errno.h>
+#include <sys/types.h>
+#include <sys/mount.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include <test_progs.h>
+#include <bpf/libbpf.h>
+#include <bpf/bpf.h>
+
+#include "cgroup_helpers.h"
+#include "cgroup_hierarchical_stats.skel.h"
+
+#define PAGE_SIZE 4096
+#define MB(x) (x << 20)
+
+#define BPFFS_ROOT "/sys/fs/bpf/"
+#define BPFFS_VMSCAN BPFFS_ROOT"vmscan/"
+
+#define CG_ROOT_NAME "root"
+#define CG_ROOT_ID 1
+
+#define CGROUP_PATH(p, n) {.path = #p"/"#n, .name = #n}
+
+static struct {
+	const char *path, *name;
+	unsigned long long id;
+	int fd;
+} cgroups[] = {
+	CGROUP_PATH(/, test),
+	CGROUP_PATH(/test, child1),
+	CGROUP_PATH(/test, child2),
+	CGROUP_PATH(/test/child1, child1_1),
+	CGROUP_PATH(/test/child1, child1_2),
+	CGROUP_PATH(/test/child2, child2_1),
+	CGROUP_PATH(/test/child2, child2_2),
+};
+
+#define N_CGROUPS ARRAY_SIZE(cgroups)
+#define N_NON_LEAF_CGROUPS 3
+
+int root_cgroup_fd;
+bool mounted_bpffs;
+
+static int read_from_file(const char *path, char *buf, size_t size)
+{
+	int fd, len;
+
+	fd = open(path, O_RDONLY);
+	if (fd < 0) {
+		log_err("Open %s", path);
+		return -errno;
+	}
+	len = read(fd, buf, size);
+	if (len < 0)
+		log_err("Read %s", path);
+	else
+		buf[len] = 0;
+	close(fd);
+	return len < 0 ? -errno : 0;
+}
+
+static int setup_bpffs(void)
+{
+	int err;
+
+	/* Mount bpffs */
+	err = mount("bpf", BPFFS_ROOT, "bpf", 0, NULL);
+	mounted_bpffs = !err;
+	if (!ASSERT_OK(err && errno != EBUSY, "mount bpffs"))
+		return err;
+
+	/* Create a directory to contain stat files in bpffs */
+	err = mkdir(BPFFS_VMSCAN, 0755);
+	ASSERT_OK(err, "mkdir bpffs");
+	return err;
+}
+
+static void cleanup_bpffs(void)
+{
+	/* Remove created directory in bpffs */
+	ASSERT_OK(rmdir(BPFFS_VMSCAN), "rmdir "BPFFS_VMSCAN);
+
+	/* Unmount bpffs, if it wasn't already mounted when we started */
+	if (mounted_bpffs)
+		return;
+	ASSERT_OK(umount(BPFFS_ROOT), "unmount bpffs");
+}
+
+static int setup_cgroups(void)
+{
+	int i, fd, err;
+
+	err = setup_cgroup_environment();
+	if (!ASSERT_OK(err, "setup_cgroup_environment"))
+		return err;
+
+	root_cgroup_fd = get_root_cgroup();
+	if (!ASSERT_GE(root_cgroup_fd, 0, "get_root_cgroup"))
+		return root_cgroup_fd;
+
+	for (i = 0; i < N_CGROUPS; i++) {
+		fd = create_and_get_cgroup(cgroups[i].path);
+		if (!ASSERT_GE(fd, 0, "create_and_get_cgroup"))
+			return fd;
+
+		cgroups[i].fd = fd;
+		cgroups[i].id = get_cgroup_id(cgroups[i].path);
+
+		/*
+		 * Enable memcg controller for the entire hierarchy.
+		 * Note that stats are collected for all cgroups in a hierarchy
+		 * with memcg enabled anyway, but are only exposed for cgroups
+		 * that have memcg enabled.
+		 */
+		if (i < N_NON_LEAF_CGROUPS) {
+			err = enable_controllers(cgroups[i].path, "memory");
+			if (!ASSERT_OK(err, "enable_controllers"))
+				return err;
+		}
+	}
+	return 0;
+}
+
+static void cleanup_cgroups(void)
+{
+	close(root_cgroup_fd);
+	for (int i = 0; i < N_CGROUPS; i++)
+		close(cgroups[i].fd);
+	cleanup_cgroup_environment();
+}
+
+
+static int setup_hierarchy(void)
+{
+	return setup_bpffs() || setup_cgroups();
+}
+
+static void destroy_hierarchy(void)
+{
+	cleanup_cgroups();
+	cleanup_bpffs();
+}
+
+static void alloc_anon(size_t size)
+{
+	char *buf, *ptr;
+
+	buf = malloc(size);
+	for (ptr = buf; ptr < buf + size; ptr += PAGE_SIZE)
+		*ptr = 0;
+	free(buf);
+}
+
+static int induce_vmscan(void)
+{
+	char size[128];
+	int i, err;
+
+	/*
+	 * Set memory.high for test parent cgroup to 1 MB to throttle
+	 * allocations and invoke reclaim in children.
+	 */
+	snprintf(size, 128, "%d", MB(1));
+	err = write_cgroup_file(cgroups[0].path, "memory.high",	size);
+	if (!ASSERT_OK(err, "write memory.high"))
+		return err;
+	/*
+	 * In every leaf cgroup, run a memory hog for a few seconds to induce
+	 * reclaim then kill it.
+	 */
+	for (i = N_NON_LEAF_CGROUPS; i < N_CGROUPS; i++) {
+		pid_t pid = fork();
+
+		if (pid == 0) {
+			/* Join cgroup in the parent process workdir */
+			join_parent_cgroup(cgroups[i].path);
+
+			/* Allocate more memory than memory.high */
+			alloc_anon(MB(2));
+			exit(0);
+		} else {
+			/* Wait for child to cause reclaim then kill it */
+			if (!ASSERT_GT(pid, 0, "fork"))
+				return pid;
+			sleep(2);
+			kill(pid, SIGKILL);
+			waitpid(pid, NULL, 0);
+		}
+	}
+	return 0;
+}
+
+static unsigned long long get_cgroup_vmscan_delay(unsigned long long cgroup_id,
+						  const char *file_name)
+{
+	char buf[128], path[128];
+	unsigned long long vmscan = 0, id = 0;
+	int err;
+
+	/* For every cgroup, read the file generated by cgroup_iter */
+	snprintf(path, 128, "%s%s", BPFFS_VMSCAN, file_name);
+	err = read_from_file(path, buf, 128);
+	if (!ASSERT_OK(err, "read cgroup_iter"))
+		return 0;
+
+	/* Check the output file formatting */
+	ASSERT_EQ(sscanf(buf, "cg_id: %llu, total_vmscan_delay: %llu\n",
+			 &id, &vmscan), 2, "output format");
+
+	/* Check that the cgroup_id is displayed correctly */
+	ASSERT_EQ(id, cgroup_id, "cgroup_id");
+	/* Check that the vmscan reading is non-zero */
+	ASSERT_GT(vmscan, 0, "vmscan_reading");
+	return vmscan;
+}
+
+static void check_vmscan_stats(void)
+{
+	int i;
+	unsigned long long vmscan_readings[N_CGROUPS], vmscan_root;
+
+	for (i = 0; i < N_CGROUPS; i++)
+		vmscan_readings[i] = get_cgroup_vmscan_delay(cgroups[i].id,
+							     cgroups[i].name);
+
+	/* Read stats for root too */
+	vmscan_root = get_cgroup_vmscan_delay(CG_ROOT_ID, CG_ROOT_NAME);
+
+	/* Check that child1 == child1_1 + child1_2 */
+	ASSERT_EQ(vmscan_readings[1], vmscan_readings[3] + vmscan_readings[4],
+		  "child1_vmscan");
+	/* Check that child2 == child2_1 + child2_2 */
+	ASSERT_EQ(vmscan_readings[2], vmscan_readings[5] + vmscan_readings[6],
+		  "child2_vmscan");
+	/* Check that test == child1 + child2 */
+	ASSERT_EQ(vmscan_readings[0], vmscan_readings[1] + vmscan_readings[2],
+		  "test_vmscan");
+	/* Check that root >= test */
+	ASSERT_GE(vmscan_root, vmscan_readings[1], "root_vmscan");
+}
+
+static int setup_cgroup_iter(struct cgroup_hierarchical_stats *obj, int cgroup_fd,
+			     const char *file_name)
+{
+	DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts);
+	union bpf_iter_link_info linfo = {};
+	struct bpf_link *link;
+	char path[128];
+	int err;
+
+	/*
+	 * Create an iter link, parameterized by cgroup_fd.
+	 * We only want to traverse one cgroup, so set the traversal order to
+	 * "pre", and return 1 from dump_vmscan to stop iteration after the
+	 * first cgroup.
+	 */
+	linfo.cgroup.cgroup_fd = cgroup_fd;
+	linfo.cgroup.traversal_order = BPF_ITER_CGROUP_PRE;
+	opts.link_info = &linfo;
+	opts.link_info_len = sizeof(linfo);
+	link = bpf_program__attach_iter(obj->progs.dump_vmscan, &opts);
+	if (!ASSERT_OK_PTR(link, "attach iter"))
+		return libbpf_get_error(link);
+
+	/* Pin the link to a bpffs file */
+	snprintf(path, 128, "%s%s", BPFFS_VMSCAN, file_name);
+	err = bpf_link__pin(link, path);
+	ASSERT_OK(err, "pin cgroup_iter");
+	return err;
+}
+
+static int setup_progs(struct cgroup_hierarchical_stats **skel)
+{
+	int i, err;
+	struct bpf_link *link;
+	struct cgroup_hierarchical_stats *obj;
+
+	obj = cgroup_hierarchical_stats__open_and_load();
+	if (!ASSERT_OK_PTR(obj, "open_and_load"))
+		return libbpf_get_error(obj);
+
+	/* Attach cgroup_iter program that will dump the stats to cgroups */
+	for (i = 0; i < N_CGROUPS; i++) {
+		err = setup_cgroup_iter(obj, cgroups[i].fd, cgroups[i].name);
+		if (!ASSERT_OK(err, "setup_cgroup_iter"))
+			return err;
+	}
+	/* Also dump stats for root */
+	err = setup_cgroup_iter(obj, root_cgroup_fd, CG_ROOT_NAME);
+	if (!ASSERT_OK(err, "setup_cgroup_iter"))
+		return err;
+
+	/* Attach rstat flusher */
+	link = bpf_program__attach(obj->progs.vmscan_flush);
+	if (!ASSERT_OK_PTR(link, "attach rstat"))
+		return libbpf_get_error(link);
+
+	/* Attach tracing programs that will calculate vmscan delays */
+	link = bpf_program__attach(obj->progs.vmscan_start);
+	if (!ASSERT_OK_PTR(obj, "attach raw_tracepoint"))
+		return libbpf_get_error(obj);
+
+	link = bpf_program__attach(obj->progs.vmscan_end);
+	if (!ASSERT_OK_PTR(obj, "attach raw_tracepoint"))
+		return libbpf_get_error(obj);
+
+	*skel = obj;
+	return 0;
+}
+
+void destroy_progs(struct cgroup_hierarchical_stats *skel)
+{
+	char path[128];
+	int i;
+
+	for (i = 0; i < N_CGROUPS; i++) {
+		/* Delete files in bpffs that cgroup_iters are pinned in */
+		snprintf(path, 128, "%s%s", BPFFS_VMSCAN,
+			 cgroups[i].name);
+		ASSERT_OK(remove(path), "remove cgroup_iter pin");
+	}
+
+	/* Delete root file in bpffs */
+	snprintf(path, 128, "%s%s", BPFFS_VMSCAN, CG_ROOT_NAME);
+	ASSERT_OK(remove(path), "remove cgroup_iter root pin");
+	cgroup_hierarchical_stats__destroy(skel);
+}
+
+void test_cgroup_hierarchical_stats(void)
+{
+	struct cgroup_hierarchical_stats *skel = NULL;
+
+	if (setup_hierarchy())
+		goto hierarchy_cleanup;
+	if (setup_progs(&skel))
+		goto cleanup;
+	if (induce_vmscan())
+		goto cleanup;
+	check_vmscan_stats();
+cleanup:
+	destroy_progs(skel);
+hierarchy_cleanup:
+	destroy_hierarchy();
+}
diff --git a/tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c b/tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c
new file mode 100644
index 0000000000000..fd2028f1ed70b
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c
@@ -0,0 +1,234 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Functions to manage eBPF programs attached to cgroup subsystems
+ *
+ * Copyright 2022 Google LLC.
+ */
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+char _license[] SEC("license") = "GPL";
+
+/*
+ * Start times are stored per-task, not per-cgroup, as multiple tasks in one
+ * cgroup can perform reclain concurrently.
+ */
+struct {
+	__uint(type, BPF_MAP_TYPE_TASK_STORAGE);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__type(key, int);
+	__type(value, __u64);
+} vmscan_start_time SEC(".maps");
+
+struct vmscan_percpu {
+	/* Previous percpu state, to figure out if we have new updates */
+	__u64 prev;
+	/* Current percpu state */
+	__u64 state;
+};
+
+struct vmscan {
+	/* State propagated through children, pending aggregation */
+	__u64 pending;
+	/* Total state, including all cpus and all children */
+	__u64 state;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_PERCPU_HASH);
+	__uint(max_entries, 10);
+	__type(key, __u64);
+	__type(value, struct vmscan_percpu);
+} pcpu_cgroup_vmscan_elapsed SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, 10);
+	__type(key, __u64);
+	__type(value, struct vmscan);
+} cgroup_vmscan_elapsed SEC(".maps");
+
+extern void cgroup_rstat_updated(struct cgroup *cgrp, int cpu) __ksym;
+extern void cgroup_rstat_flush(struct cgroup *cgrp) __ksym;
+
+static inline struct cgroup *task_memcg(struct task_struct *task)
+{
+	return task->cgroups->subsys[memory_cgrp_id]->cgroup;
+}
+
+static inline uint64_t cgroup_id(struct cgroup *cgrp)
+{
+	return cgrp->kn->id;
+}
+
+static inline int create_vmscan_percpu_elem(__u64 cg_id, __u64 state)
+{
+	struct vmscan_percpu pcpu_init = {.state = state, .prev = 0};
+
+	if (bpf_map_update_elem(&pcpu_cgroup_vmscan_elapsed, &cg_id,
+				&pcpu_init, BPF_NOEXIST)) {
+		bpf_printk("failed to create pcpu entry for cgroup %llu\n"
+			   , cg_id);
+		return 1;
+	}
+	return 0;
+}
+
+static inline int create_vmscan_elem(__u64 cg_id, __u64 state, __u64 pending)
+{
+	struct vmscan init = {.state = state, .pending = pending};
+
+	if (bpf_map_update_elem(&cgroup_vmscan_elapsed, &cg_id,
+				&init, BPF_NOEXIST)) {
+		bpf_printk("failed to create entry for cgroup %llu\n"
+			   , cg_id);
+		return 1;
+	}
+	return 0;
+}
+
+SEC("tp_btf/mm_vmscan_memcg_reclaim_begin")
+int BPF_PROG(vmscan_start, struct lruvec *lruvec, struct scan_control *sc)
+{
+	struct task_struct *task = bpf_get_current_task_btf();
+	__u64 *start_time_ptr;
+
+	start_time_ptr = bpf_task_storage_get(&vmscan_start_time, task, 0,
+					  BPF_LOCAL_STORAGE_GET_F_CREATE);
+	if (!start_time_ptr) {
+		bpf_printk("error retrieving storage\n");
+		return 0;
+	}
+
+	*start_time_ptr = bpf_ktime_get_ns();
+	return 0;
+}
+
+SEC("tp_btf/mm_vmscan_memcg_reclaim_end")
+int BPF_PROG(vmscan_end, struct lruvec *lruvec, struct scan_control *sc)
+{
+	struct vmscan_percpu *pcpu_stat;
+	struct task_struct *current = bpf_get_current_task_btf();
+	struct cgroup *cgrp;
+	__u64 *start_time_ptr;
+	__u64 current_elapsed, cg_id;
+	__u64 end_time = bpf_ktime_get_ns();
+
+	/*
+	 * cgrp is the first parent cgroup of current that has memcg enabled in
+	 * its subtree_control, or NULL if memcg is disabled in the entire tree.
+	 * In a cgroup hierarchy like this:
+	 *                               a
+	 *                              / \
+	 *                             b   c
+	 *  If "a" has memcg enabled, while "b" doesn't, then processes in "b"
+	 *  will accumulate their stats directly to "a". This makes sure that no
+	 *  stats are lost from processes in leaf cgroups that don't have memcg
+	 *  enabled, but only exposes stats for cgroups that have memcg enabled.
+	 */
+	cgrp = task_memcg(current);
+	if (!cgrp)
+		return 0;
+
+	cg_id = cgroup_id(cgrp);
+	start_time_ptr = bpf_task_storage_get(&vmscan_start_time, current, 0,
+					      BPF_LOCAL_STORAGE_GET_F_CREATE);
+	if (!start_time_ptr) {
+		bpf_printk("error retrieving storage local storage\n");
+		return 0;
+	}
+
+	current_elapsed = end_time - *start_time_ptr;
+	pcpu_stat = bpf_map_lookup_elem(&pcpu_cgroup_vmscan_elapsed,
+					&cg_id);
+	if (pcpu_stat)
+		__sync_fetch_and_add(&pcpu_stat->state, current_elapsed);
+	else
+		create_vmscan_percpu_elem(cg_id, current_elapsed);
+
+	cgroup_rstat_updated(cgrp, bpf_get_smp_processor_id());
+	return 0;
+}
+
+SEC("fentry/bpf_rstat_flush")
+int BPF_PROG(vmscan_flush, struct cgroup *cgrp, struct cgroup *parent, int cpu)
+{
+	struct vmscan_percpu *pcpu_stat;
+	struct vmscan *total_stat, *parent_stat;
+	__u64 cg_id = cgroup_id(cgrp);
+	__u64 parent_cg_id = parent ? cgroup_id(parent) : 0;
+	__u64 *pcpu_vmscan;
+	__u64 state;
+	__u64 delta = 0;
+
+	/* Add CPU changes on this level since the last flush */
+	pcpu_stat = bpf_map_lookup_percpu_elem(&pcpu_cgroup_vmscan_elapsed,
+					       &cg_id, cpu);
+	if (pcpu_stat) {
+		state = pcpu_stat->state;
+		delta += state - pcpu_stat->prev;
+		pcpu_stat->prev = state;
+	}
+
+	total_stat = bpf_map_lookup_elem(&cgroup_vmscan_elapsed, &cg_id);
+	if (!total_stat) {
+		create_vmscan_elem(cg_id, delta, 0);
+		goto update_parent;
+	}
+
+	/* Collect pending stats from subtree */
+	if (total_stat->pending) {
+		delta += total_stat->pending;
+		total_stat->pending = 0;
+	}
+
+	/* Propagate changes to this cgroup's total */
+	total_stat->state += delta;
+
+update_parent:
+	/* Skip if there are no changes to propagate, or no parent */
+	if (!delta || !parent_cg_id)
+		return 0;
+
+	/* Propagate changes to cgroup's parent */
+	parent_stat = bpf_map_lookup_elem(&cgroup_vmscan_elapsed,
+					  &parent_cg_id);
+	if (parent_stat)
+		parent_stat->pending += delta;
+	else
+		create_vmscan_elem(parent_cg_id, 0, delta);
+
+	return 0;
+}
+
+SEC("iter.s/cgroup")
+int BPF_PROG(dump_vmscan, struct bpf_iter_meta *meta, struct cgroup *cgrp)
+{
+	struct seq_file *seq = meta->seq;
+	struct vmscan *total_stat;
+	__u64 cg_id = cgroup_id(cgrp);
+
+	/* Do nothing for the terminal call */
+	if (!cgrp)
+		return 1;
+
+	/* Flush the stats to make sure we get the most updated numbers */
+	cgroup_rstat_flush(cgrp);
+
+	total_stat = bpf_map_lookup_elem(&cgroup_vmscan_elapsed, &cg_id);
+	if (!total_stat) {
+		bpf_printk("error finding stats for cgroup %llu\n", cg_id);
+		BPF_SEQ_PRINTF(seq, "cg_id: %llu, total_vmscan_delay: -1\n",
+			       cg_id);
+		return 1;
+	}
+	BPF_SEQ_PRINTF(seq, "cg_id: %llu, total_vmscan_delay: %llu\n",
+		       cg_id, total_stat->state);
+
+	/*
+	 * We only dump stats for one cgroup here, so return 1 to stop
+	 * iteration after the first cgroup.
+	 */
+	return 1;
+}
-- 
2.36.1.476.g0c4daa206d-goog


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH bpf-next v2 0/8] bpf: rstat: cgroup hierarchical stats
  2022-06-10 19:44 [PATCH bpf-next v2 0/8] bpf: rstat: cgroup hierarchical stats Yosry Ahmed
                   ` (7 preceding siblings ...)
  2022-06-10 19:44 ` [PATCH bpf-next v2 8/8] bpf: add a selftest for cgroup hierarchical stats collection Yosry Ahmed
@ 2022-06-10 19:48 ` Yosry Ahmed
  8 siblings, 0 replies; 46+ messages in thread
From: Yosry Ahmed @ 2022-06-10 19:48 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Michal Hocko
  Cc: Roman Gushchin, David Rientjes, Stanislav Fomichev, Greg Thelen,
	Shakeel Butt, Linux Kernel Mailing List, Networking, bpf,
	Cgroups, Michal Koutný

+cc Michal Koutný <mkoutny@suse.com>


On Fri, Jun 10, 2022 at 12:44 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> This patch series allows for using bpf to collect hierarchical cgroup
> stats efficiently by integrating with the rstat framework. The rstat
> framework provides an efficient way to collect cgroup stats percpu and
> propagate them through the cgroup hierarchy.
>
> The stats are exposed to userspace in textual form by reading files in
> bpffs, similar to cgroupfs stats by using a cgroup_iter program.
> cgroup_iter is a type of bpf_iter. It walks over cgroups in two modes:
>
>  - walking a cgroup's descendants.
>  - walking a cgroup's ancestors.
>
> When attaching cgroup_iter, one needs to set a cgroup to the iter_link
> created from attaching. This cgroup is passed as a file descriptor and
> serves as the starting point of the walk.
>
> For walking descendants, one can specify the order: either pre-order or
> post-order. For walking ancestors, the walk starts at the specified
> cgroup and ends at the root.
>
> One can also terminate the walk early by returning 1 from the iter
> program.
>
> Note that because walking cgroup hierarchy holds cgroup_mutex, the iter
> program is called with cgroup_mutex held.
>
> ** Background on rstat for stats collection **
> (I am using a subscriber analogy that is not commonly used)
>
> The rstat framework maintains a tree of cgroups that have updates and
> which cpus have updates. A subscriber to the rstat framework maintains
> their own stats. The framework is used to tell the subscriber when
> and what to flush, for the most efficient stats propagation. The
> workflow is as follows:
>
> - When a subscriber updates a cgroup on a cpu, it informs the rstat
>   framework by calling cgroup_rstat_updated(cgrp, cpu).
>
> - When a subscriber wants to read some stats for a cgroup, it asks
>   the rstat framework to initiate a stats flush (propagation) by calling
>   cgroup_rstat_flush(cgrp).
>
> - When the rstat framework initiates a flush, it makes callbacks to
>   subscribers to aggregate stats on cpus that have updates, and
>   propagate updates to their parent.
>
> Currently, the main subscribers to the rstat framework are cgroup
> subsystems (e.g. memory, block). This patch series allow bpf programs to
> become subscribers as well.
>
> Patches in this series are based off a patch in the mailing
> list which adds a new kfunc set for sleepable functions:
> "btf: Add a new kfunc set which allows to mark a function to be
> sleepable" [1].
>
> Patches in this series are organized as follows:
> * Patch 1 enables the use of cgroup_get_from_file() in cgroup1.
>   This is useful because it enables cgroup_iter to work with cgroup1, and
>   allows the entire stat collection workflow to be cgroup1-compatible.
> * Patches 2-5 introduce cgroup_iter prog, and a selftest.
> * Patches 6-8 allow bpf programs to integrate with rstat by adding the
>   necessary hook points and kfunc. A comprehensive selftest that
>   demonstrates the entire workflow for using bpf and rstat to
>   efficiently collect and output cgroup stats is added.
>
> v1 -> v2:
> - Redesign of cgroup_iter from v1, based on Alexei's idea [2]:
>   - supports walking cgroup subtree.
>   - supports walking ancestors of a cgroup. (Andrii)
>   - supports terminating the walk early.
>   - uses fd instead of cgroup_id as parameter for iter_link. Using fd is
>     a convention in bpf.
>   - gets cgroup's ref at attach time and deref at detach.
>   - brought back cgroup1 support for cgroup_iter.
> - Squashed the patches adding the rstat flush hook points and kfuncs
>   (Tejun).
> - Added a comment explaining why bpf_rstat_flush() needs to be weak
>   (Tejun).
> - Updated the final selftest with the new cgroup_iter design.
> - Changed CHECKs in the selftest with ASSERTs (Yonghong, Andrii).
> - Removed empty line at the end of the selftest (Yonghong).
> - Renamed test files to cgroup_hierarchical_stats.c.
> - Reordered CGROUP_PATH params order to match struct declaration
>   in the selftest (Michal).
> - Removed memory_subsys_enabled() and made sure memcg controller
>   enablement checks make sense and are documented (Michal).
>
> RFC v2 -> v1:
> - Instead of introducing a new program type for rstat flushing, add an
>   empty hook point, bpf_rstat_flush(), and use fentry bpf programs to
>   attach to it and flush bpf stats.
> - Instead of using helpers, use kfuncs for rstat functions.
> - These changes simplify the patchset greatly, with minimal changes to
>   uapi.
>
> RFC v1 -> RFC v2:
> - Instead of rstat flush programs attach to subsystems, they now attach
>   to rstat (global flushers, not per-subsystem), based on discussions
>   with Tejun. The first patch is entirely rewritten.
> - Pass cgroup pointers to rstat flushers instead of cgroup ids. This is
>   much more flexibility and less likely to need a uapi update later.
> - rstat helpers are now only defined if CGROUP_CONFIG.
> - Most of the code is now only defined if CGROUP_CONFIG and
>   CONFIG_BPF_SYSCALL.
> - Move rstat helper protos from bpf_base_func_proto() to
>   tracing_prog_func_proto().
> - rstat helpers argument (cgroup pointer) is now ARG_PTR_TO_BTF_ID, not
>   ARG_ANYTHING.
> - Rewrote the selftest to use the cgroup helpers.
> - Dropped bpf_map_lookup_percpu_elem (already added by Feng).
> - Dropped patch to support cgroup v1 for cgroup_iter.
> - Dropped patch to define some cgroup_put() when !CONFIG_CGROUP. The
>   code that calls it is no longer compiled when !CONFIG_CGROUP.
>
> cgroup_iter was originally introduced in a different patch series[3].
> Hao and I agreed that it fits better as part of this series.
> RFC v1 of this patch series had the following changes from [3]:
> - Getting the cgroup's reference at the time at attaching, instead of
>   at the time when iterating. (Yonghong)
> - Remove .init_seq_private and .fini_seq_private callbacks for
>   cgroup_iter. They are not needed now. (Yonghong)
>
> [1] https://lore.kernel.org/bpf/20220421140740.459558-5-benjamin.tissoires@redhat.com/
> [2] https://lore.kernel.org/bpf/20220520221919.jnqgv52k4ajlgzcl@MBP-98dd607d3435.dhcp.thefacebook.com/
> [3] https://lore.kernel.org/lkml/20220225234339.2386398-9-haoluo@google.com/
>
> Hao Luo (4):
>   cgroup: Add cgroup_put() in !CONFIG_CGROUPS case
>   bpf, iter: Fix the condition on p when calling stop.
>   bpf: Introduce cgroup iter
>   selftests/bpf: Test cgroup_iter.
>
> Yosry Ahmed (4):
>   cgroup: enable cgroup_get_from_file() on cgroup1
>   cgroup: bpf: enable bpf programs to integrate with rstat
>   selftests/bpf: extend cgroup helpers
>   bpf: add a selftest for cgroup hierarchical stats collection
>
>  include/linux/bpf.h                           |   8 +
>  include/linux/cgroup.h                        |   3 +
>  include/uapi/linux/bpf.h                      |  21 ++
>  kernel/bpf/Makefile                           |   2 +-
>  kernel/bpf/bpf_iter.c                         |   5 +
>  kernel/bpf/cgroup_iter.c                      | 235 ++++++++++++
>  kernel/cgroup/cgroup.c                        |   5 -
>  kernel/cgroup/rstat.c                         |  46 +++
>  tools/include/uapi/linux/bpf.h                |  21 ++
>  tools/testing/selftests/bpf/cgroup_helpers.c  | 173 +++++++--
>  tools/testing/selftests/bpf/cgroup_helpers.h  |  15 +-
>  .../prog_tests/cgroup_hierarchical_stats.c    | 351 ++++++++++++++++++
>  .../selftests/bpf/prog_tests/cgroup_iter.c    | 190 ++++++++++
>  tools/testing/selftests/bpf/progs/bpf_iter.h  |   7 +
>  .../bpf/progs/cgroup_hierarchical_stats.c     | 234 ++++++++++++
>  .../testing/selftests/bpf/progs/cgroup_iter.c |  39 ++
>  16 files changed, 1303 insertions(+), 52 deletions(-)
>  create mode 100644 kernel/bpf/cgroup_iter.c
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/cgroup_iter.c
>  create mode 100644 tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c
>  create mode 100644 tools/testing/selftests/bpf/progs/cgroup_iter.c
>
> --
> 2.36.1.476.g0c4daa206d-goog
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH bpf-next v2 6/8] cgroup: bpf: enable bpf programs to integrate with rstat
  2022-06-10 19:44 ` [PATCH bpf-next v2 6/8] cgroup: bpf: enable bpf programs to integrate with rstat Yosry Ahmed
@ 2022-06-10 20:52   ` kernel test robot
  2022-06-10 21:22   ` kernel test robot
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 46+ messages in thread
From: kernel test robot @ 2022-06-10 20:52 UTC (permalink / raw)
  To: Yosry Ahmed, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Hao Luo, Tejun Heo, Zefan Li,
	Johannes Weiner, Shuah Khan, Michal Hocko
  Cc: kbuild-all, Roman Gushchin, David Rientjes, Stanislav Fomichev,
	Greg Thelen, Shakeel Butt, linux-kernel, netdev, bpf, cgroups,
	Yosry Ahmed

Hi Yosry,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Yosry-Ahmed/bpf-rstat-cgroup-hierarchical-stats/20220611-034720
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: um-i386_defconfig (https://download.01.org/0day-ci/archive/20220611/202206110457.uD5lLvbh-lkp@intel.com/config)
compiler: gcc-11 (Debian 11.3.0-3) 11.3.0
reproduce (this is a W=1 build):
        # https://github.com/intel-lab-lkp/linux/commit/83f297e2b47dc41b511f071b9eadf38339387b41
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Yosry-Ahmed/bpf-rstat-cgroup-hierarchical-stats/20220611-034720
        git checkout 83f297e2b47dc41b511f071b9eadf38339387b41
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        make W=1 O=build_dir ARCH=um SUBARCH=i386 SHELL=/bin/bash kernel/cgroup/

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> kernel/cgroup/rstat.c:161:22: warning: no previous prototype for 'bpf_rstat_flush' [-Wmissing-prototypes]
     161 | __weak noinline void bpf_rstat_flush(struct cgroup *cgrp,
         |                      ^~~~~~~~~~~~~~~
   kernel/cgroup/rstat.c:509:10: error: 'const struct btf_kfunc_id_set' has no member named 'sleepable_set'; did you mean 'release_set'?
     509 |         .sleepable_set  = &bpf_rstat_sleepable_kfunc_ids,
         |          ^~~~~~~~~~~~~
         |          release_set
>> kernel/cgroup/rstat.c:509:27: warning: excess elements in struct initializer
     509 |         .sleepable_set  = &bpf_rstat_sleepable_kfunc_ids,
         |                           ^
   kernel/cgroup/rstat.c:509:27: note: (near initialization for 'bpf_rstat_kfunc_set')


vim +/bpf_rstat_flush +161 kernel/cgroup/rstat.c

   148	
   149	/*
   150	 * A hook for bpf stat collectors to attach to and flush their stats.
   151	 * Together with providing bpf kfuncs for cgroup_rstat_updated() and
   152	 * cgroup_rstat_flush(), this enables a complete workflow where bpf progs that
   153	 * collect cgroup stats can integrate with rstat for efficient flushing.
   154	 *
   155	 * A static noinline declaration here could cause the compiler to optimize away
   156	 * the function. A global noinline declaration will keep the definition, but may
   157	 * optimize away the callsite. Therefore, __weak is needed to ensure that the
   158	 * call is still emitted, by telling the compiler that we don't know what the
   159	 * function might eventually be.
   160	 */
 > 161	__weak noinline void bpf_rstat_flush(struct cgroup *cgrp,
   162					     struct cgroup *parent, int cpu)
   163	{
   164	}
   165	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH bpf-next v2 6/8] cgroup: bpf: enable bpf programs to integrate with rstat
  2022-06-10 19:44 ` [PATCH bpf-next v2 6/8] cgroup: bpf: enable bpf programs to integrate with rstat Yosry Ahmed
  2022-06-10 20:52   ` kernel test robot
@ 2022-06-10 21:22   ` kernel test robot
  2022-06-10 21:30     ` Yosry Ahmed
  2022-06-11 10:22   ` kernel test robot
  2022-06-28  6:12   ` Yonghong Song
  3 siblings, 1 reply; 46+ messages in thread
From: kernel test robot @ 2022-06-10 21:22 UTC (permalink / raw)
  To: Yosry Ahmed, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Hao Luo, Tejun Heo, Zefan Li,
	Johannes Weiner, Shuah Khan, Michal Hocko
  Cc: kbuild-all, Roman Gushchin, David Rientjes, Stanislav Fomichev,
	Greg Thelen, Shakeel Butt, linux-kernel, netdev, bpf, cgroups,
	Yosry Ahmed

Hi Yosry,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Yosry-Ahmed/bpf-rstat-cgroup-hierarchical-stats/20220611-034720
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: um-i386_defconfig (https://download.01.org/0day-ci/archive/20220611/202206110544.D5cTU0WQ-lkp@intel.com/config)
compiler: gcc-11 (Debian 11.3.0-3) 11.3.0
reproduce (this is a W=1 build):
        # https://github.com/intel-lab-lkp/linux/commit/83f297e2b47dc41b511f071b9eadf38339387b41
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Yosry-Ahmed/bpf-rstat-cgroup-hierarchical-stats/20220611-034720
        git checkout 83f297e2b47dc41b511f071b9eadf38339387b41
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        make W=1 O=build_dir ARCH=um SUBARCH=i386 SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   kernel/cgroup/rstat.c:161:22: warning: no previous prototype for 'bpf_rstat_flush' [-Wmissing-prototypes]
     161 | __weak noinline void bpf_rstat_flush(struct cgroup *cgrp,
         |                      ^~~~~~~~~~~~~~~
>> kernel/cgroup/rstat.c:509:10: error: 'const struct btf_kfunc_id_set' has no member named 'sleepable_set'; did you mean 'release_set'?
     509 |         .sleepable_set  = &bpf_rstat_sleepable_kfunc_ids,
         |          ^~~~~~~~~~~~~
         |          release_set
   kernel/cgroup/rstat.c:509:27: warning: excess elements in struct initializer
     509 |         .sleepable_set  = &bpf_rstat_sleepable_kfunc_ids,
         |                           ^
   kernel/cgroup/rstat.c:509:27: note: (near initialization for 'bpf_rstat_kfunc_set')


vim +509 kernel/cgroup/rstat.c

   505	
   506	static const struct btf_kfunc_id_set bpf_rstat_kfunc_set = {
   507		.owner		= THIS_MODULE,
   508		.check_set	= &bpf_rstat_check_kfunc_ids,
 > 509		.sleepable_set	= &bpf_rstat_sleepable_kfunc_ids,
   510	};
   511	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH bpf-next v2 6/8] cgroup: bpf: enable bpf programs to integrate with rstat
  2022-06-10 21:22   ` kernel test robot
@ 2022-06-10 21:30     ` Yosry Ahmed
  2022-06-11 19:57       ` Alexei Starovoitov
  0 siblings, 1 reply; 46+ messages in thread
From: Yosry Ahmed @ 2022-06-10 21:30 UTC (permalink / raw)
  To: kernel test robot
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Michal Hocko, kbuild-all, Roman Gushchin,
	David Rientjes, Stanislav Fomichev, Greg Thelen, Shakeel Butt,
	Linux Kernel Mailing List, Networking, bpf, Cgroups

On Fri, Jun 10, 2022 at 2:23 PM kernel test robot <lkp@intel.com> wrote:
>
> Hi Yosry,
>
> Thank you for the patch! Yet something to improve:
>
> [auto build test ERROR on bpf-next/master]
>
> url:    https://github.com/intel-lab-lkp/linux/commits/Yosry-Ahmed/bpf-rstat-cgroup-hierarchical-stats/20220611-034720
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
> config: um-i386_defconfig (https://download.01.org/0day-ci/archive/20220611/202206110544.D5cTU0WQ-lkp@intel.com/config)
> compiler: gcc-11 (Debian 11.3.0-3) 11.3.0
> reproduce (this is a W=1 build):
>         # https://github.com/intel-lab-lkp/linux/commit/83f297e2b47dc41b511f071b9eadf38339387b41
>         git remote add linux-review https://github.com/intel-lab-lkp/linux
>         git fetch --no-tags linux-review Yosry-Ahmed/bpf-rstat-cgroup-hierarchical-stats/20220611-034720
>         git checkout 83f297e2b47dc41b511f071b9eadf38339387b41
>         # save the config file
>         mkdir build_dir && cp config build_dir/.config
>         make W=1 O=build_dir ARCH=um SUBARCH=i386 SHELL=/bin/bash
>
> If you fix the issue, kindly add following tag where applicable
> Reported-by: kernel test robot <lkp@intel.com>
>
> All errors (new ones prefixed by >>):
>
>    kernel/cgroup/rstat.c:161:22: warning: no previous prototype for 'bpf_rstat_flush' [-Wmissing-prototypes]
>      161 | __weak noinline void bpf_rstat_flush(struct cgroup *cgrp,
>          |                      ^~~~~~~~~~~~~~~
> >> kernel/cgroup/rstat.c:509:10: error: 'const struct btf_kfunc_id_set' has no member named 'sleepable_set'; did you mean 'release_set'?
>      509 |         .sleepable_set  = &bpf_rstat_sleepable_kfunc_ids,
>          |          ^~~~~~~~~~~~~
>          |          release_set
>    kernel/cgroup/rstat.c:509:27: warning: excess elements in struct initializer
>      509 |         .sleepable_set  = &bpf_rstat_sleepable_kfunc_ids,
>          |                           ^
>    kernel/cgroup/rstat.c:509:27: note: (near initialization for 'bpf_rstat_kfunc_set')
>
>
> vim +509 kernel/cgroup/rstat.c
>
>    505
>    506  static const struct btf_kfunc_id_set bpf_rstat_kfunc_set = {
>    507          .owner          = THIS_MODULE,
>    508          .check_set      = &bpf_rstat_check_kfunc_ids,
>  > 509          .sleepable_set  = &bpf_rstat_sleepable_kfunc_ids,
>    510  };
>    511
>
> --
> 0-DAY CI Kernel Test Service
> https://01.org/lkp

AFAICT these failures are because the patch series depends on a patch
in the mailing list [1] that is not in bpf-next, as explained by the
cover letter.

[1] https://lore.kernel.org/bpf/20220421140740.459558-5-benjamin.tissoires@redhat.com/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH bpf-next v2 4/8] bpf: Introduce cgroup iter
  2022-06-10 19:44 ` [PATCH bpf-next v2 4/8] bpf: Introduce cgroup iter Yosry Ahmed
@ 2022-06-11  6:23   ` kernel test robot
  2022-06-11  7:34   ` kernel test robot
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 46+ messages in thread
From: kernel test robot @ 2022-06-11  6:23 UTC (permalink / raw)
  To: Yosry Ahmed, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Hao Luo, Tejun Heo, Zefan Li,
	Johannes Weiner, Shuah Khan, Michal Hocko
  Cc: kbuild-all, Roman Gushchin, David Rientjes, Stanislav Fomichev,
	Greg Thelen, Shakeel Butt, linux-kernel, netdev, bpf, cgroups,
	Yosry Ahmed

Hi Yosry,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Yosry-Ahmed/bpf-rstat-cgroup-hierarchical-stats/20220611-034720
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: m68k-defconfig (https://download.01.org/0day-ci/archive/20220611/202206111453.xWWh2wMK-lkp@intel.com/config)
compiler: m68k-linux-gcc (GCC) 11.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/619857fd1ec4f351376ffcaaec20acc9aae9486f
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Yosry-Ahmed/bpf-rstat-cgroup-hierarchical-stats/20220611-034720
        git checkout 619857fd1ec4f351376ffcaaec20acc9aae9486f
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.3.0 make.cross W=1 O=build_dir ARCH=m68k SHELL=/bin/bash kernel/bpf/ kernel/cgroup/

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   In file included from kernel/bpf/cgroup_iter.c:9:
   kernel/bpf/../cgroup/cgroup-internal.h:78:41: error: field 'iter' has incomplete type
      78 |                 struct css_task_iter    iter;
         |                                         ^~~~
   kernel/bpf/../cgroup/cgroup-internal.h: In function 'cgroup_is_dead':
   kernel/bpf/../cgroup/cgroup-internal.h:188:22: error: invalid use of undefined type 'const struct cgroup'
     188 |         return !(cgrp->self.flags & CSS_ONLINE);
         |                      ^~
   kernel/bpf/../cgroup/cgroup-internal.h:188:37: error: 'CSS_ONLINE' undeclared (first use in this function); did you mean 'N_ONLINE'?
     188 |         return !(cgrp->self.flags & CSS_ONLINE);
         |                                     ^~~~~~~~~~
         |                                     N_ONLINE
   kernel/bpf/../cgroup/cgroup-internal.h:188:37: note: each undeclared identifier is reported only once for each function it appears in
   kernel/bpf/../cgroup/cgroup-internal.h: In function 'notify_on_release':
   kernel/bpf/../cgroup/cgroup-internal.h:193:25: error: 'CGRP_NOTIFY_ON_RELEASE' undeclared (first use in this function)
     193 |         return test_bit(CGRP_NOTIFY_ON_RELEASE, &cgrp->flags);
         |                         ^~~~~~~~~~~~~~~~~~~~~~
   kernel/bpf/../cgroup/cgroup-internal.h:193:54: error: invalid use of undefined type 'const struct cgroup'
     193 |         return test_bit(CGRP_NOTIFY_ON_RELEASE, &cgrp->flags);
         |                                                      ^~
   kernel/bpf/../cgroup/cgroup-internal.h: In function 'put_css_set':
   kernel/bpf/../cgroup/cgroup-internal.h:207:39: error: invalid use of undefined type 'struct css_set'
     207 |         if (refcount_dec_not_one(&cset->refcount))
         |                                       ^~
   kernel/bpf/../cgroup/cgroup-internal.h: In function 'get_css_set':
   kernel/bpf/../cgroup/cgroup-internal.h:220:27: error: invalid use of undefined type 'struct css_set'
     220 |         refcount_inc(&cset->refcount);
         |                           ^~
   kernel/bpf/../cgroup/cgroup-internal.h: At top level:
   kernel/bpf/../cgroup/cgroup-internal.h:284:22: error: array type has incomplete element type 'struct cftype'
     284 | extern struct cftype cgroup1_base_files[];
         |                      ^~~~~~~~~~~~~~~~~~
   kernel/bpf/cgroup_iter.c: In function 'cgroup_iter_seq_start':
   kernel/bpf/cgroup_iter.c:55:24: error: implicit declaration of function 'css_next_descendant_pre' [-Werror=implicit-function-declaration]
      55 |                 return css_next_descendant_pre(NULL, p->start_css);
         |                        ^~~~~~~~~~~~~~~~~~~~~~~
>> kernel/bpf/cgroup_iter.c:55:24: warning: returning 'int' from a function with return type 'void *' makes pointer from integer without a cast [-Wint-conversion]
      55 |                 return css_next_descendant_pre(NULL, p->start_css);
         |                        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   kernel/bpf/cgroup_iter.c:57:24: error: implicit declaration of function 'css_next_descendant_post' [-Werror=implicit-function-declaration]
      57 |                 return css_next_descendant_post(NULL, p->start_css);
         |                        ^~~~~~~~~~~~~~~~~~~~~~~~
   kernel/bpf/cgroup_iter.c:57:24: warning: returning 'int' from a function with return type 'void *' makes pointer from integer without a cast [-Wint-conversion]
      57 |                 return css_next_descendant_post(NULL, p->start_css);
         |                        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   kernel/bpf/cgroup_iter.c: In function 'cgroup_iter_seq_next':
   kernel/bpf/cgroup_iter.c:83:24: warning: returning 'int' from a function with return type 'void *' makes pointer from integer without a cast [-Wint-conversion]
      83 |                 return css_next_descendant_pre(curr, p->start_css);
         |                        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   kernel/bpf/cgroup_iter.c:85:24: warning: returning 'int' from a function with return type 'void *' makes pointer from integer without a cast [-Wint-conversion]
      85 |                 return css_next_descendant_post(curr, p->start_css);
         |                        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   kernel/bpf/cgroup_iter.c:87:28: error: invalid use of undefined type 'struct cgroup_subsys_state'
      87 |                 return curr->parent;
         |                            ^~
   kernel/bpf/cgroup_iter.c: In function '__cgroup_iter_seq_show':
   kernel/bpf/cgroup_iter.c:100:38: error: invalid use of undefined type 'struct cgroup_subsys_state'
     100 |         if (css && cgroup_is_dead(css->cgroup))
         |                                      ^~
   kernel/bpf/cgroup_iter.c:104:31: error: invalid use of undefined type 'struct cgroup_subsys_state'
     104 |         ctx.cgroup = css ? css->cgroup : NULL;
         |                               ^~
   kernel/bpf/cgroup_iter.c: In function 'cgroup_iter_seq_init':
   kernel/bpf/cgroup_iter.c:137:29: error: invalid use of undefined type 'struct cgroup'
     137 |         p->start_css = &cgrp->self;
         |                             ^~
   kernel/bpf/cgroup_iter.c: In function 'bpf_iter_attach_cgroup':
   kernel/bpf/cgroup_iter.c:157:24: error: implicit declaration of function 'cgroup_get_from_fd'; did you mean 'cgroup_get_from_id'? [-Werror=implicit-function-declaration]
     157 |                 cgrp = cgroup_get_from_fd(fd);
         |                        ^~~~~~~~~~~~~~~~~~
         |                        cgroup_get_from_id
>> kernel/bpf/cgroup_iter.c:157:22: warning: assignment to 'struct cgroup *' from 'int' makes pointer from integer without a cast [-Wint-conversion]
     157 |                 cgrp = cgroup_get_from_fd(fd);
         |                      ^
   kernel/bpf/cgroup_iter.c:159:24: error: implicit declaration of function 'cgroup_get_from_path'; did you mean 'cgroup_get_from_id'? [-Werror=implicit-function-declaration]
     159 |                 cgrp = cgroup_get_from_path("/");
         |                        ^~~~~~~~~~~~~~~~~~~~
         |                        cgroup_get_from_id
   kernel/bpf/cgroup_iter.c:159:22: warning: assignment to 'struct cgroup *' from 'int' makes pointer from integer without a cast [-Wint-conversion]
     159 |                 cgrp = cgroup_get_from_path("/");
         |                      ^
   kernel/bpf/cgroup_iter.c: In function 'bpf_iter_cgroup_show_fdinfo':
   kernel/bpf/cgroup_iter.c:190:9: error: implicit declaration of function 'cgroup_path_ns'; did you mean 'cgroup_parent'? [-Werror=implicit-function-declaration]
     190 |         cgroup_path_ns(aux->cgroup.start, buf, PATH_MAX,
         |         ^~~~~~~~~~~~~~
         |         cgroup_parent
   kernel/bpf/cgroup_iter.c: In function 'cgroup_iter_seq_next':
   kernel/bpf/cgroup_iter.c:88:1: error: control reaches end of non-void function [-Werror=return-type]
      88 | }
         | ^
   cc1: some warnings being treated as errors


vim +55 kernel/bpf/cgroup_iter.c

    41	
    42	static void *cgroup_iter_seq_start(struct seq_file *seq, loff_t *pos)
    43	{
    44		struct cgroup_iter_priv *p = seq->private;
    45	
    46		mutex_lock(&cgroup_mutex);
    47	
    48		/* support only one session */
    49		if (*pos > 0)
    50			return NULL;
    51	
    52		++*pos;
    53		p->terminate = false;
    54		if (p->order == BPF_ITER_CGROUP_PRE)
  > 55			return css_next_descendant_pre(NULL, p->start_css);
    56		else if (p->order == BPF_ITER_CGROUP_POST)
    57			return css_next_descendant_post(NULL, p->start_css);
    58		else /* BPF_ITER_CGROUP_PARENT_UP */
    59			return p->start_css;
    60	}
    61	
    62	static int __cgroup_iter_seq_show(struct seq_file *seq,
    63					  struct cgroup_subsys_state *css, int in_stop);
    64	
    65	static void cgroup_iter_seq_stop(struct seq_file *seq, void *v)
    66	{
    67		/* pass NULL to the prog for post-processing */
    68		if (!v)
    69			__cgroup_iter_seq_show(seq, NULL, true);
    70		mutex_unlock(&cgroup_mutex);
    71	}
    72	
    73	static void *cgroup_iter_seq_next(struct seq_file *seq, void *v, loff_t *pos)
    74	{
    75		struct cgroup_subsys_state *curr = (struct cgroup_subsys_state *)v;
    76		struct cgroup_iter_priv *p = seq->private;
    77	
    78		++*pos;
    79		if (p->terminate)
    80			return NULL;
    81	
    82		if (p->order == BPF_ITER_CGROUP_PRE)
    83			return css_next_descendant_pre(curr, p->start_css);
    84		else if (p->order == BPF_ITER_CGROUP_POST)
    85			return css_next_descendant_post(curr, p->start_css);
    86		else
    87			return curr->parent;
    88	}
    89	
    90	static int __cgroup_iter_seq_show(struct seq_file *seq,
    91					  struct cgroup_subsys_state *css, int in_stop)
    92	{
    93		struct cgroup_iter_priv *p = seq->private;
    94		struct bpf_iter__cgroup ctx;
    95		struct bpf_iter_meta meta;
    96		struct bpf_prog *prog;
    97		int ret = 0;
    98	
    99		/* cgroup is dead, skip this element */
   100		if (css && cgroup_is_dead(css->cgroup))
   101			return 0;
   102	
   103		ctx.meta = &meta;
   104		ctx.cgroup = css ? css->cgroup : NULL;
   105		meta.seq = seq;
   106		prog = bpf_iter_get_info(&meta, in_stop);
   107		if (prog)
   108			ret = bpf_iter_run_prog(prog, &ctx);
   109	
   110		/* if prog returns > 0, terminate after this element. */
   111		if (ret != 0)
   112			p->terminate = true;
   113	
   114		return 0;
   115	}
   116	
   117	static int cgroup_iter_seq_show(struct seq_file *seq, void *v)
   118	{
   119		return __cgroup_iter_seq_show(seq, (struct cgroup_subsys_state *)v,
   120					      false);
   121	}
   122	
   123	static const struct seq_operations cgroup_iter_seq_ops = {
   124		.start  = cgroup_iter_seq_start,
   125		.next   = cgroup_iter_seq_next,
   126		.stop   = cgroup_iter_seq_stop,
   127		.show   = cgroup_iter_seq_show,
   128	};
   129	
   130	BTF_ID_LIST_SINGLE(bpf_cgroup_btf_id, struct, cgroup)
   131	
   132	static int cgroup_iter_seq_init(void *priv, struct bpf_iter_aux_info *aux)
   133	{
   134		struct cgroup_iter_priv *p = (struct cgroup_iter_priv *)priv;
   135		struct cgroup *cgrp = aux->cgroup.start;
   136	
   137		p->start_css = &cgrp->self;
   138		p->terminate = false;
   139		p->order = aux->cgroup.order;
   140		return 0;
   141	}
   142	
   143	static const struct bpf_iter_seq_info cgroup_iter_seq_info = {
   144		.seq_ops                = &cgroup_iter_seq_ops,
   145		.init_seq_private       = cgroup_iter_seq_init,
   146		.seq_priv_size          = sizeof(struct cgroup_iter_priv),
   147	};
   148	
   149	static int bpf_iter_attach_cgroup(struct bpf_prog *prog,
   150					  union bpf_iter_link_info *linfo,
   151					  struct bpf_iter_aux_info *aux)
   152	{
   153		int fd = linfo->cgroup.cgroup_fd;
   154		struct cgroup *cgrp;
   155	
   156		if (fd)
 > 157			cgrp = cgroup_get_from_fd(fd);
   158		else /* walk the entire hierarchy by default. */
   159			cgrp = cgroup_get_from_path("/");
   160	
   161		if (IS_ERR(cgrp))
   162			return PTR_ERR(cgrp);
   163	
   164		aux->cgroup.start = cgrp;
   165		aux->cgroup.order = linfo->cgroup.traversal_order;
   166		return 0;
   167	}
   168	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH bpf-next v2 4/8] bpf: Introduce cgroup iter
  2022-06-10 19:44 ` [PATCH bpf-next v2 4/8] bpf: Introduce cgroup iter Yosry Ahmed
  2022-06-11  6:23   ` kernel test robot
@ 2022-06-11  7:34   ` kernel test robot
  2022-06-11 12:44   ` kernel test robot
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 46+ messages in thread
From: kernel test robot @ 2022-06-11  7:34 UTC (permalink / raw)
  To: Yosry Ahmed, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Hao Luo, Tejun Heo, Zefan Li,
	Johannes Weiner, Shuah Khan, Michal Hocko
  Cc: kbuild-all, Roman Gushchin, David Rientjes, Stanislav Fomichev,
	Greg Thelen, Shakeel Butt, linux-kernel, netdev, bpf, cgroups,
	Yosry Ahmed

Hi Yosry,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Yosry-Ahmed/bpf-rstat-cgroup-hierarchical-stats/20220611-034720
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: m68k-defconfig (https://download.01.org/0day-ci/archive/20220611/202206111529.2okIVRo9-lkp@intel.com/config)
compiler: m68k-linux-gcc (GCC) 11.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/619857fd1ec4f351376ffcaaec20acc9aae9486f
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Yosry-Ahmed/bpf-rstat-cgroup-hierarchical-stats/20220611-034720
        git checkout 619857fd1ec4f351376ffcaaec20acc9aae9486f
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.3.0 make.cross W=1 O=build_dir ARCH=m68k SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   In file included from kernel/bpf/cgroup_iter.c:9:
>> kernel/bpf/../cgroup/cgroup-internal.h:78:41: error: field 'iter' has incomplete type
      78 |                 struct css_task_iter    iter;
         |                                         ^~~~
   kernel/bpf/../cgroup/cgroup-internal.h: In function 'cgroup_is_dead':
   kernel/bpf/../cgroup/cgroup-internal.h:188:22: error: invalid use of undefined type 'const struct cgroup'
     188 |         return !(cgrp->self.flags & CSS_ONLINE);
         |                      ^~
   kernel/bpf/../cgroup/cgroup-internal.h:188:37: error: 'CSS_ONLINE' undeclared (first use in this function); did you mean 'N_ONLINE'?
     188 |         return !(cgrp->self.flags & CSS_ONLINE);
         |                                     ^~~~~~~~~~
         |                                     N_ONLINE
   kernel/bpf/../cgroup/cgroup-internal.h:188:37: note: each undeclared identifier is reported only once for each function it appears in
   kernel/bpf/../cgroup/cgroup-internal.h: In function 'notify_on_release':
   kernel/bpf/../cgroup/cgroup-internal.h:193:25: error: 'CGRP_NOTIFY_ON_RELEASE' undeclared (first use in this function)
     193 |         return test_bit(CGRP_NOTIFY_ON_RELEASE, &cgrp->flags);
         |                         ^~~~~~~~~~~~~~~~~~~~~~
   kernel/bpf/../cgroup/cgroup-internal.h:193:54: error: invalid use of undefined type 'const struct cgroup'
     193 |         return test_bit(CGRP_NOTIFY_ON_RELEASE, &cgrp->flags);
         |                                                      ^~
   kernel/bpf/../cgroup/cgroup-internal.h: In function 'put_css_set':
   kernel/bpf/../cgroup/cgroup-internal.h:207:39: error: invalid use of undefined type 'struct css_set'
     207 |         if (refcount_dec_not_one(&cset->refcount))
         |                                       ^~
   kernel/bpf/../cgroup/cgroup-internal.h: In function 'get_css_set':
   kernel/bpf/../cgroup/cgroup-internal.h:220:27: error: invalid use of undefined type 'struct css_set'
     220 |         refcount_inc(&cset->refcount);
         |                           ^~
   kernel/bpf/../cgroup/cgroup-internal.h: At top level:
   kernel/bpf/../cgroup/cgroup-internal.h:284:22: error: array type has incomplete element type 'struct cftype'
     284 | extern struct cftype cgroup1_base_files[];
         |                      ^~~~~~~~~~~~~~~~~~
   kernel/bpf/cgroup_iter.c: In function 'cgroup_iter_seq_start':
>> kernel/bpf/cgroup_iter.c:55:24: error: implicit declaration of function 'css_next_descendant_pre' [-Werror=implicit-function-declaration]
      55 |                 return css_next_descendant_pre(NULL, p->start_css);
         |                        ^~~~~~~~~~~~~~~~~~~~~~~
   kernel/bpf/cgroup_iter.c:55:24: warning: returning 'int' from a function with return type 'void *' makes pointer from integer without a cast [-Wint-conversion]
      55 |                 return css_next_descendant_pre(NULL, p->start_css);
         |                        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> kernel/bpf/cgroup_iter.c:57:24: error: implicit declaration of function 'css_next_descendant_post' [-Werror=implicit-function-declaration]
      57 |                 return css_next_descendant_post(NULL, p->start_css);
         |                        ^~~~~~~~~~~~~~~~~~~~~~~~
   kernel/bpf/cgroup_iter.c:57:24: warning: returning 'int' from a function with return type 'void *' makes pointer from integer without a cast [-Wint-conversion]
      57 |                 return css_next_descendant_post(NULL, p->start_css);
         |                        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   kernel/bpf/cgroup_iter.c: In function 'cgroup_iter_seq_next':
   kernel/bpf/cgroup_iter.c:83:24: warning: returning 'int' from a function with return type 'void *' makes pointer from integer without a cast [-Wint-conversion]
      83 |                 return css_next_descendant_pre(curr, p->start_css);
         |                        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   kernel/bpf/cgroup_iter.c:85:24: warning: returning 'int' from a function with return type 'void *' makes pointer from integer without a cast [-Wint-conversion]
      85 |                 return css_next_descendant_post(curr, p->start_css);
         |                        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> kernel/bpf/cgroup_iter.c:87:28: error: invalid use of undefined type 'struct cgroup_subsys_state'
      87 |                 return curr->parent;
         |                            ^~
   kernel/bpf/cgroup_iter.c: In function '__cgroup_iter_seq_show':
   kernel/bpf/cgroup_iter.c:100:38: error: invalid use of undefined type 'struct cgroup_subsys_state'
     100 |         if (css && cgroup_is_dead(css->cgroup))
         |                                      ^~
   kernel/bpf/cgroup_iter.c:104:31: error: invalid use of undefined type 'struct cgroup_subsys_state'
     104 |         ctx.cgroup = css ? css->cgroup : NULL;
         |                               ^~
   kernel/bpf/cgroup_iter.c: In function 'cgroup_iter_seq_init':
>> kernel/bpf/cgroup_iter.c:137:29: error: invalid use of undefined type 'struct cgroup'
     137 |         p->start_css = &cgrp->self;
         |                             ^~
   kernel/bpf/cgroup_iter.c: In function 'bpf_iter_attach_cgroup':
>> kernel/bpf/cgroup_iter.c:157:24: error: implicit declaration of function 'cgroup_get_from_fd'; did you mean 'cgroup_get_from_id'? [-Werror=implicit-function-declaration]
     157 |                 cgrp = cgroup_get_from_fd(fd);
         |                        ^~~~~~~~~~~~~~~~~~
         |                        cgroup_get_from_id
   kernel/bpf/cgroup_iter.c:157:22: warning: assignment to 'struct cgroup *' from 'int' makes pointer from integer without a cast [-Wint-conversion]
     157 |                 cgrp = cgroup_get_from_fd(fd);
         |                      ^
>> kernel/bpf/cgroup_iter.c:159:24: error: implicit declaration of function 'cgroup_get_from_path'; did you mean 'cgroup_get_from_id'? [-Werror=implicit-function-declaration]
     159 |                 cgrp = cgroup_get_from_path("/");
         |                        ^~~~~~~~~~~~~~~~~~~~
         |                        cgroup_get_from_id
   kernel/bpf/cgroup_iter.c:159:22: warning: assignment to 'struct cgroup *' from 'int' makes pointer from integer without a cast [-Wint-conversion]
     159 |                 cgrp = cgroup_get_from_path("/");
         |                      ^
   kernel/bpf/cgroup_iter.c: In function 'bpf_iter_cgroup_show_fdinfo':
>> kernel/bpf/cgroup_iter.c:190:9: error: implicit declaration of function 'cgroup_path_ns'; did you mean 'cgroup_parent'? [-Werror=implicit-function-declaration]
     190 |         cgroup_path_ns(aux->cgroup.start, buf, PATH_MAX,
         |         ^~~~~~~~~~~~~~
         |         cgroup_parent
   kernel/bpf/cgroup_iter.c: In function 'cgroup_iter_seq_next':
   kernel/bpf/cgroup_iter.c:88:1: error: control reaches end of non-void function [-Werror=return-type]
      88 | }
         | ^
   cc1: some warnings being treated as errors


vim +/iter +78 kernel/bpf/../cgroup/cgroup-internal.h

0d2b5955b36250 Tejun Heo 2022-01-06  68  
0d2b5955b36250 Tejun Heo 2022-01-06  69  struct cgroup_file_ctx {
e57457641613fe Tejun Heo 2022-01-06  70  	struct cgroup_namespace	*ns;
e57457641613fe Tejun Heo 2022-01-06  71  
0d2b5955b36250 Tejun Heo 2022-01-06  72  	struct {
0d2b5955b36250 Tejun Heo 2022-01-06  73  		void			*trigger;
0d2b5955b36250 Tejun Heo 2022-01-06  74  	} psi;
0d2b5955b36250 Tejun Heo 2022-01-06  75  
0d2b5955b36250 Tejun Heo 2022-01-06  76  	struct {
0d2b5955b36250 Tejun Heo 2022-01-06  77  		bool			started;
0d2b5955b36250 Tejun Heo 2022-01-06 @78  		struct css_task_iter	iter;
0d2b5955b36250 Tejun Heo 2022-01-06  79  	} procs;
0d2b5955b36250 Tejun Heo 2022-01-06  80  
0d2b5955b36250 Tejun Heo 2022-01-06  81  	struct {
0d2b5955b36250 Tejun Heo 2022-01-06  82  		struct cgroup_pidlist	*pidlist;
0d2b5955b36250 Tejun Heo 2022-01-06  83  	} procs1;
0d2b5955b36250 Tejun Heo 2022-01-06  84  };
0d2b5955b36250 Tejun Heo 2022-01-06  85  

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH bpf-next v2 6/8] cgroup: bpf: enable bpf programs to integrate with rstat
  2022-06-10 19:44 ` [PATCH bpf-next v2 6/8] cgroup: bpf: enable bpf programs to integrate with rstat Yosry Ahmed
  2022-06-10 20:52   ` kernel test robot
  2022-06-10 21:22   ` kernel test robot
@ 2022-06-11 10:22   ` kernel test robot
  2022-06-28  6:12   ` Yonghong Song
  3 siblings, 0 replies; 46+ messages in thread
From: kernel test robot @ 2022-06-11 10:22 UTC (permalink / raw)
  To: Yosry Ahmed, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Hao Luo, Tejun Heo, Zefan Li,
	Johannes Weiner, Shuah Khan, Michal Hocko
  Cc: llvm, kbuild-all, Roman Gushchin, David Rientjes,
	Stanislav Fomichev, Greg Thelen, Shakeel Butt, linux-kernel,
	netdev, bpf, cgroups, Yosry Ahmed

Hi Yosry,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Yosry-Ahmed/bpf-rstat-cgroup-hierarchical-stats/20220611-034720
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: hexagon-randconfig-r012-20220611 (https://download.01.org/0day-ci/archive/20220611/202206111842.O3viR9gq-lkp@intel.com/config)
compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project ff4abe755279a3a47cc416ef80dbc900d9a98a19)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/83f297e2b47dc41b511f071b9eadf38339387b41
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Yosry-Ahmed/bpf-rstat-cgroup-hierarchical-stats/20220611-034720
        git checkout 83f297e2b47dc41b511f071b9eadf38339387b41
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=hexagon SHELL=/bin/bash kernel/cgroup/

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All error/warnings (new ones prefixed by >>):

>> kernel/cgroup/rstat.c:161:22: warning: no previous prototype for function 'bpf_rstat_flush' [-Wmissing-prototypes]
   __weak noinline void bpf_rstat_flush(struct cgroup *cgrp,
                        ^
   kernel/cgroup/rstat.c:161:17: note: declare 'static' if the function is not intended to be used outside of this translation unit
   __weak noinline void bpf_rstat_flush(struct cgroup *cgrp,
                   ^
                   static 
>> kernel/cgroup/rstat.c:509:3: error: field designator 'sleepable_set' does not refer to any field in type 'const struct btf_kfunc_id_set'
           .sleepable_set  = &bpf_rstat_sleepable_kfunc_ids,
            ^
   1 warning and 1 error generated.


vim +509 kernel/cgroup/rstat.c

   505	
   506	static const struct btf_kfunc_id_set bpf_rstat_kfunc_set = {
   507		.owner		= THIS_MODULE,
   508		.check_set	= &bpf_rstat_check_kfunc_ids,
 > 509		.sleepable_set	= &bpf_rstat_sleepable_kfunc_ids,
   510	};
   511	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH bpf-next v2 4/8] bpf: Introduce cgroup iter
  2022-06-10 19:44 ` [PATCH bpf-next v2 4/8] bpf: Introduce cgroup iter Yosry Ahmed
  2022-06-11  6:23   ` kernel test robot
  2022-06-11  7:34   ` kernel test robot
@ 2022-06-11 12:44   ` kernel test robot
  2022-06-11 12:55   ` kernel test robot
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 46+ messages in thread
From: kernel test robot @ 2022-06-11 12:44 UTC (permalink / raw)
  To: Yosry Ahmed, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Hao Luo, Tejun Heo, Zefan Li,
	Johannes Weiner, Shuah Khan, Michal Hocko
  Cc: llvm, kbuild-all, Roman Gushchin, David Rientjes,
	Stanislav Fomichev, Greg Thelen, Shakeel Butt, linux-kernel,
	netdev, bpf, cgroups, Yosry Ahmed

Hi Yosry,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Yosry-Ahmed/bpf-rstat-cgroup-hierarchical-stats/20220611-034720
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: s390-randconfig-r035-20220611 (https://download.01.org/0day-ci/archive/20220611/202206112009.sycCJKhv-lkp@intel.com/config)
compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project ff4abe755279a3a47cc416ef80dbc900d9a98a19)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install s390 cross compiling tool for clang build
        # apt-get install binutils-s390x-linux-gnu
        # https://github.com/intel-lab-lkp/linux/commit/619857fd1ec4f351376ffcaaec20acc9aae9486f
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Yosry-Ahmed/bpf-rstat-cgroup-hierarchical-stats/20220611-034720
        git checkout 619857fd1ec4f351376ffcaaec20acc9aae9486f
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=s390 SHELL=/bin/bash kernel/bpf/ kernel/cgroup/

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   In file included from kernel/bpf/cgroup_iter.c:9:
   kernel/bpf/../cgroup/cgroup-internal.h:78:24: error: field has incomplete type 'struct css_task_iter'
                   struct css_task_iter    iter;
                                           ^
   kernel/bpf/../cgroup/cgroup-internal.h:78:10: note: forward declaration of 'struct css_task_iter'
                   struct css_task_iter    iter;
                          ^
   kernel/bpf/../cgroup/cgroup-internal.h:188:15: error: incomplete definition of type 'struct cgroup'
           return !(cgrp->self.flags & CSS_ONLINE);
                    ~~~~^
   include/linux/sched/task.h:35:9: note: forward declaration of 'struct cgroup'
           struct cgroup *cgrp;
                  ^
   In file included from kernel/bpf/cgroup_iter.c:9:
   kernel/bpf/../cgroup/cgroup-internal.h:188:30: error: use of undeclared identifier 'CSS_ONLINE'; did you mean 'N_ONLINE'?
           return !(cgrp->self.flags & CSS_ONLINE);
                                       ^~~~~~~~~~
                                       N_ONLINE
   include/linux/nodemask.h:392:2: note: 'N_ONLINE' declared here
           N_ONLINE,               /* The node is online */
           ^
   In file included from kernel/bpf/cgroup_iter.c:9:
   kernel/bpf/../cgroup/cgroup-internal.h:193:47: error: incomplete definition of type 'struct cgroup'
           return test_bit(CGRP_NOTIFY_ON_RELEASE, &cgrp->flags);
                                                    ~~~~^
   include/linux/sched/task.h:35:9: note: forward declaration of 'struct cgroup'
           struct cgroup *cgrp;
                  ^
   In file included from kernel/bpf/cgroup_iter.c:9:
   kernel/bpf/../cgroup/cgroup-internal.h:193:18: error: use of undeclared identifier 'CGRP_NOTIFY_ON_RELEASE'
           return test_bit(CGRP_NOTIFY_ON_RELEASE, &cgrp->flags);
                           ^
   kernel/bpf/../cgroup/cgroup-internal.h:207:32: error: incomplete definition of type 'struct css_set'
           if (refcount_dec_not_one(&cset->refcount))
                                     ~~~~^
   include/linux/sched/task.h:16:8: note: forward declaration of 'struct css_set'
   struct css_set;
          ^
   In file included from kernel/bpf/cgroup_iter.c:9:
   kernel/bpf/../cgroup/cgroup-internal.h:220:20: error: incomplete definition of type 'struct css_set'
           refcount_inc(&cset->refcount);
                         ~~~~^
   include/linux/sched/task.h:16:8: note: forward declaration of 'struct css_set'
   struct css_set;
          ^
   In file included from kernel/bpf/cgroup_iter.c:9:
   kernel/bpf/../cgroup/cgroup-internal.h:284:40: error: array has incomplete element type 'struct cftype'
   extern struct cftype cgroup1_base_files[];
                                          ^
   kernel/bpf/../cgroup/cgroup-internal.h:284:15: note: forward declaration of 'struct cftype'
   extern struct cftype cgroup1_base_files[];
                 ^
   kernel/bpf/cgroup_iter.c:55:10: error: call to undeclared function 'css_next_descendant_pre'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
                   return css_next_descendant_pre(NULL, p->start_css);
                          ^
>> kernel/bpf/cgroup_iter.c:55:10: warning: incompatible integer to pointer conversion returning 'int' from a function with result type 'void *' [-Wint-conversion]
                   return css_next_descendant_pre(NULL, p->start_css);
                          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   kernel/bpf/cgroup_iter.c:57:10: error: call to undeclared function 'css_next_descendant_post'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
                   return css_next_descendant_post(NULL, p->start_css);
                          ^
   kernel/bpf/cgroup_iter.c:57:10: warning: incompatible integer to pointer conversion returning 'int' from a function with result type 'void *' [-Wint-conversion]
                   return css_next_descendant_post(NULL, p->start_css);
                          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   kernel/bpf/cgroup_iter.c:83:10: error: call to undeclared function 'css_next_descendant_pre'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
                   return css_next_descendant_pre(curr, p->start_css);
                          ^
   kernel/bpf/cgroup_iter.c:83:10: warning: incompatible integer to pointer conversion returning 'int' from a function with result type 'void *' [-Wint-conversion]
                   return css_next_descendant_pre(curr, p->start_css);
                          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   kernel/bpf/cgroup_iter.c:85:10: error: call to undeclared function 'css_next_descendant_post'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
                   return css_next_descendant_post(curr, p->start_css);
                          ^
   kernel/bpf/cgroup_iter.c:85:10: warning: incompatible integer to pointer conversion returning 'int' from a function with result type 'void *' [-Wint-conversion]
                   return css_next_descendant_post(curr, p->start_css);
                          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   kernel/bpf/cgroup_iter.c:87:14: error: incomplete definition of type 'struct cgroup_subsys_state'
                   return curr->parent;
                          ~~~~^
   include/linux/kthread.h:218:8: note: forward declaration of 'struct cgroup_subsys_state'
   struct cgroup_subsys_state;
          ^
   kernel/bpf/cgroup_iter.c:100:31: error: incomplete definition of type 'struct cgroup_subsys_state'
           if (css && cgroup_is_dead(css->cgroup))
                                     ~~~^
   include/linux/kthread.h:218:8: note: forward declaration of 'struct cgroup_subsys_state'
   struct cgroup_subsys_state;
          ^
   kernel/bpf/cgroup_iter.c:104:24: error: incomplete definition of type 'struct cgroup_subsys_state'
           ctx.cgroup = css ? css->cgroup : NULL;
                              ~~~^
   include/linux/kthread.h:218:8: note: forward declaration of 'struct cgroup_subsys_state'
   struct cgroup_subsys_state;
          ^
   kernel/bpf/cgroup_iter.c:137:22: error: incomplete definition of type 'struct cgroup'
           p->start_css = &cgrp->self;
                           ~~~~^
   include/linux/sched/task.h:35:9: note: forward declaration of 'struct cgroup'
           struct cgroup *cgrp;
                  ^
   kernel/bpf/cgroup_iter.c:157:10: error: call to undeclared function 'cgroup_get_from_fd'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
                   cgrp = cgroup_get_from_fd(fd);
                          ^
   kernel/bpf/cgroup_iter.c:157:10: note: did you mean 'cgroup_get_from_id'?
   include/linux/cgroup.h:756:30: note: 'cgroup_get_from_id' declared here
   static inline struct cgroup *cgroup_get_from_id(u64 id)
                                ^
>> kernel/bpf/cgroup_iter.c:157:8: warning: incompatible integer to pointer conversion assigning to 'struct cgroup *' from 'int' [-Wint-conversion]
                   cgrp = cgroup_get_from_fd(fd);
                        ^ ~~~~~~~~~~~~~~~~~~~~~~
   kernel/bpf/cgroup_iter.c:159:10: error: call to undeclared function 'cgroup_get_from_path'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
                   cgrp = cgroup_get_from_path("/");
                          ^
   kernel/bpf/cgroup_iter.c:159:8: warning: incompatible integer to pointer conversion assigning to 'struct cgroup *' from 'int' [-Wint-conversion]
                   cgrp = cgroup_get_from_path("/");
                        ^ ~~~~~~~~~~~~~~~~~~~~~~~~~
   kernel/bpf/cgroup_iter.c:190:2: error: call to undeclared function 'cgroup_path_ns'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
           cgroup_path_ns(aux->cgroup.start, buf, PATH_MAX,
           ^
   kernel/bpf/cgroup_iter.c:190:2: note: did you mean 'cgroup_parent'?
   include/linux/cgroup.h:732:30: note: 'cgroup_parent' declared here
   static inline struct cgroup *cgroup_parent(struct cgroup *cgrp)
                                ^
   6 warnings and 19 errors generated.


vim +55 kernel/bpf/cgroup_iter.c

    41	
    42	static void *cgroup_iter_seq_start(struct seq_file *seq, loff_t *pos)
    43	{
    44		struct cgroup_iter_priv *p = seq->private;
    45	
    46		mutex_lock(&cgroup_mutex);
    47	
    48		/* support only one session */
    49		if (*pos > 0)
    50			return NULL;
    51	
    52		++*pos;
    53		p->terminate = false;
    54		if (p->order == BPF_ITER_CGROUP_PRE)
  > 55			return css_next_descendant_pre(NULL, p->start_css);
    56		else if (p->order == BPF_ITER_CGROUP_POST)
    57			return css_next_descendant_post(NULL, p->start_css);
    58		else /* BPF_ITER_CGROUP_PARENT_UP */
    59			return p->start_css;
    60	}
    61	
    62	static int __cgroup_iter_seq_show(struct seq_file *seq,
    63					  struct cgroup_subsys_state *css, int in_stop);
    64	
    65	static void cgroup_iter_seq_stop(struct seq_file *seq, void *v)
    66	{
    67		/* pass NULL to the prog for post-processing */
    68		if (!v)
    69			__cgroup_iter_seq_show(seq, NULL, true);
    70		mutex_unlock(&cgroup_mutex);
    71	}
    72	
    73	static void *cgroup_iter_seq_next(struct seq_file *seq, void *v, loff_t *pos)
    74	{
    75		struct cgroup_subsys_state *curr = (struct cgroup_subsys_state *)v;
    76		struct cgroup_iter_priv *p = seq->private;
    77	
    78		++*pos;
    79		if (p->terminate)
    80			return NULL;
    81	
    82		if (p->order == BPF_ITER_CGROUP_PRE)
    83			return css_next_descendant_pre(curr, p->start_css);
    84		else if (p->order == BPF_ITER_CGROUP_POST)
    85			return css_next_descendant_post(curr, p->start_css);
    86		else
    87			return curr->parent;
    88	}
    89	
    90	static int __cgroup_iter_seq_show(struct seq_file *seq,
    91					  struct cgroup_subsys_state *css, int in_stop)
    92	{
    93		struct cgroup_iter_priv *p = seq->private;
    94		struct bpf_iter__cgroup ctx;
    95		struct bpf_iter_meta meta;
    96		struct bpf_prog *prog;
    97		int ret = 0;
    98	
    99		/* cgroup is dead, skip this element */
   100		if (css && cgroup_is_dead(css->cgroup))
   101			return 0;
   102	
   103		ctx.meta = &meta;
   104		ctx.cgroup = css ? css->cgroup : NULL;
   105		meta.seq = seq;
   106		prog = bpf_iter_get_info(&meta, in_stop);
   107		if (prog)
   108			ret = bpf_iter_run_prog(prog, &ctx);
   109	
   110		/* if prog returns > 0, terminate after this element. */
   111		if (ret != 0)
   112			p->terminate = true;
   113	
   114		return 0;
   115	}
   116	
   117	static int cgroup_iter_seq_show(struct seq_file *seq, void *v)
   118	{
   119		return __cgroup_iter_seq_show(seq, (struct cgroup_subsys_state *)v,
   120					      false);
   121	}
   122	
   123	static const struct seq_operations cgroup_iter_seq_ops = {
   124		.start  = cgroup_iter_seq_start,
   125		.next   = cgroup_iter_seq_next,
   126		.stop   = cgroup_iter_seq_stop,
   127		.show   = cgroup_iter_seq_show,
   128	};
   129	
   130	BTF_ID_LIST_SINGLE(bpf_cgroup_btf_id, struct, cgroup)
   131	
   132	static int cgroup_iter_seq_init(void *priv, struct bpf_iter_aux_info *aux)
   133	{
   134		struct cgroup_iter_priv *p = (struct cgroup_iter_priv *)priv;
   135		struct cgroup *cgrp = aux->cgroup.start;
   136	
   137		p->start_css = &cgrp->self;
   138		p->terminate = false;
   139		p->order = aux->cgroup.order;
   140		return 0;
   141	}
   142	
   143	static const struct bpf_iter_seq_info cgroup_iter_seq_info = {
   144		.seq_ops                = &cgroup_iter_seq_ops,
   145		.init_seq_private       = cgroup_iter_seq_init,
   146		.seq_priv_size          = sizeof(struct cgroup_iter_priv),
   147	};
   148	
   149	static int bpf_iter_attach_cgroup(struct bpf_prog *prog,
   150					  union bpf_iter_link_info *linfo,
   151					  struct bpf_iter_aux_info *aux)
   152	{
   153		int fd = linfo->cgroup.cgroup_fd;
   154		struct cgroup *cgrp;
   155	
   156		if (fd)
 > 157			cgrp = cgroup_get_from_fd(fd);
   158		else /* walk the entire hierarchy by default. */
   159			cgrp = cgroup_get_from_path("/");
   160	
   161		if (IS_ERR(cgrp))
   162			return PTR_ERR(cgrp);
   163	
   164		aux->cgroup.start = cgrp;
   165		aux->cgroup.order = linfo->cgroup.traversal_order;
   166		return 0;
   167	}
   168	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH bpf-next v2 4/8] bpf: Introduce cgroup iter
  2022-06-10 19:44 ` [PATCH bpf-next v2 4/8] bpf: Introduce cgroup iter Yosry Ahmed
                     ` (2 preceding siblings ...)
  2022-06-11 12:44   ` kernel test robot
@ 2022-06-11 12:55   ` kernel test robot
  2022-06-28  4:09   ` Yonghong Song
  2022-06-28  4:14   ` Yonghong Song
  5 siblings, 0 replies; 46+ messages in thread
From: kernel test robot @ 2022-06-11 12:55 UTC (permalink / raw)
  To: Yosry Ahmed, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Hao Luo, Tejun Heo, Zefan Li,
	Johannes Weiner, Shuah Khan, Michal Hocko
  Cc: kbuild-all, Roman Gushchin, David Rientjes, Stanislav Fomichev,
	Greg Thelen, Shakeel Butt, linux-kernel, netdev, bpf, cgroups,
	Yosry Ahmed

Hi Yosry,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Yosry-Ahmed/bpf-rstat-cgroup-hierarchical-stats/20220611-034720
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: m68k-randconfig-r004-20220611 (https://download.01.org/0day-ci/archive/20220611/202206112000.LRgcxlpN-lkp@intel.com/config)
compiler: m68k-linux-gcc (GCC) 11.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/619857fd1ec4f351376ffcaaec20acc9aae9486f
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Yosry-Ahmed/bpf-rstat-cgroup-hierarchical-stats/20220611-034720
        git checkout 619857fd1ec4f351376ffcaaec20acc9aae9486f
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.3.0 make.cross W=1 O=build_dir ARCH=m68k SHELL=/bin/bash kernel/bpf/ kernel/cgroup/

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All error/warnings (new ones prefixed by >>):

   In file included from kernel/bpf/cgroup_iter.c:9:
>> kernel/bpf/../cgroup/cgroup-internal.h:78:41: error: field 'iter' has incomplete type
      78 |                 struct css_task_iter    iter;
         |                                         ^~~~
   kernel/bpf/../cgroup/cgroup-internal.h: In function 'cgroup_is_dead':
>> kernel/bpf/../cgroup/cgroup-internal.h:188:22: error: invalid use of undefined type 'const struct cgroup'
     188 |         return !(cgrp->self.flags & CSS_ONLINE);
         |                      ^~
>> kernel/bpf/../cgroup/cgroup-internal.h:188:37: error: 'CSS_ONLINE' undeclared (first use in this function); did you mean 'N_ONLINE'?
     188 |         return !(cgrp->self.flags & CSS_ONLINE);
         |                                     ^~~~~~~~~~
         |                                     N_ONLINE
   kernel/bpf/../cgroup/cgroup-internal.h:188:37: note: each undeclared identifier is reported only once for each function it appears in
   kernel/bpf/../cgroup/cgroup-internal.h: In function 'notify_on_release':
>> kernel/bpf/../cgroup/cgroup-internal.h:193:25: error: 'CGRP_NOTIFY_ON_RELEASE' undeclared (first use in this function)
     193 |         return test_bit(CGRP_NOTIFY_ON_RELEASE, &cgrp->flags);
         |                         ^~~~~~~~~~~~~~~~~~~~~~
   kernel/bpf/../cgroup/cgroup-internal.h:193:54: error: invalid use of undefined type 'const struct cgroup'
     193 |         return test_bit(CGRP_NOTIFY_ON_RELEASE, &cgrp->flags);
         |                                                      ^~
   kernel/bpf/../cgroup/cgroup-internal.h: In function 'put_css_set':
>> kernel/bpf/../cgroup/cgroup-internal.h:207:39: error: invalid use of undefined type 'struct css_set'
     207 |         if (refcount_dec_not_one(&cset->refcount))
         |                                       ^~
   kernel/bpf/../cgroup/cgroup-internal.h: In function 'get_css_set':
   kernel/bpf/../cgroup/cgroup-internal.h:220:27: error: invalid use of undefined type 'struct css_set'
     220 |         refcount_inc(&cset->refcount);
         |                           ^~
   kernel/bpf/../cgroup/cgroup-internal.h: At top level:
>> kernel/bpf/../cgroup/cgroup-internal.h:284:22: error: array type has incomplete element type 'struct cftype'
     284 | extern struct cftype cgroup1_base_files[];
         |                      ^~~~~~~~~~~~~~~~~~
   kernel/bpf/cgroup_iter.c: In function 'cgroup_iter_seq_start':
>> kernel/bpf/cgroup_iter.c:55:24: error: implicit declaration of function 'css_next_descendant_pre' [-Werror=implicit-function-declaration]
      55 |                 return css_next_descendant_pre(NULL, p->start_css);
         |                        ^~~~~~~~~~~~~~~~~~~~~~~
>> kernel/bpf/cgroup_iter.c:55:24: warning: returning 'int' from a function with return type 'void *' makes pointer from integer without a cast [-Wint-conversion]
      55 |                 return css_next_descendant_pre(NULL, p->start_css);
         |                        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> kernel/bpf/cgroup_iter.c:57:24: error: implicit declaration of function 'css_next_descendant_post' [-Werror=implicit-function-declaration]
      57 |                 return css_next_descendant_post(NULL, p->start_css);
         |                        ^~~~~~~~~~~~~~~~~~~~~~~~
   kernel/bpf/cgroup_iter.c:57:24: warning: returning 'int' from a function with return type 'void *' makes pointer from integer without a cast [-Wint-conversion]
      57 |                 return css_next_descendant_post(NULL, p->start_css);
         |                        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   kernel/bpf/cgroup_iter.c: In function 'cgroup_iter_seq_next':
   kernel/bpf/cgroup_iter.c:83:24: warning: returning 'int' from a function with return type 'void *' makes pointer from integer without a cast [-Wint-conversion]
      83 |                 return css_next_descendant_pre(curr, p->start_css);
         |                        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   kernel/bpf/cgroup_iter.c:85:24: warning: returning 'int' from a function with return type 'void *' makes pointer from integer without a cast [-Wint-conversion]
      85 |                 return css_next_descendant_post(curr, p->start_css);
         |                        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> kernel/bpf/cgroup_iter.c:87:28: error: invalid use of undefined type 'struct cgroup_subsys_state'
      87 |                 return curr->parent;
         |                            ^~
   kernel/bpf/cgroup_iter.c: In function '__cgroup_iter_seq_show':
   kernel/bpf/cgroup_iter.c:100:38: error: invalid use of undefined type 'struct cgroup_subsys_state'
     100 |         if (css && cgroup_is_dead(css->cgroup))
         |                                      ^~
   kernel/bpf/cgroup_iter.c:104:31: error: invalid use of undefined type 'struct cgroup_subsys_state'
     104 |         ctx.cgroup = css ? css->cgroup : NULL;
         |                               ^~
   kernel/bpf/cgroup_iter.c: In function 'cgroup_iter_seq_init':
>> kernel/bpf/cgroup_iter.c:137:29: error: invalid use of undefined type 'struct cgroup'
     137 |         p->start_css = &cgrp->self;
         |                             ^~
   kernel/bpf/cgroup_iter.c: In function 'bpf_iter_attach_cgroup':
>> kernel/bpf/cgroup_iter.c:157:24: error: implicit declaration of function 'cgroup_get_from_fd'; did you mean 'cgroup_get_from_id'? [-Werror=implicit-function-declaration]
     157 |                 cgrp = cgroup_get_from_fd(fd);
         |                        ^~~~~~~~~~~~~~~~~~
         |                        cgroup_get_from_id
>> kernel/bpf/cgroup_iter.c:157:22: warning: assignment to 'struct cgroup *' from 'int' makes pointer from integer without a cast [-Wint-conversion]
     157 |                 cgrp = cgroup_get_from_fd(fd);
         |                      ^
>> kernel/bpf/cgroup_iter.c:159:24: error: implicit declaration of function 'cgroup_get_from_path'; did you mean 'cgroup_get_from_id'? [-Werror=implicit-function-declaration]
     159 |                 cgrp = cgroup_get_from_path("/");
         |                        ^~~~~~~~~~~~~~~~~~~~
         |                        cgroup_get_from_id
   kernel/bpf/cgroup_iter.c:159:22: warning: assignment to 'struct cgroup *' from 'int' makes pointer from integer without a cast [-Wint-conversion]
     159 |                 cgrp = cgroup_get_from_path("/");
         |                      ^
   kernel/bpf/cgroup_iter.c: In function 'bpf_iter_cgroup_show_fdinfo':
>> kernel/bpf/cgroup_iter.c:190:9: error: implicit declaration of function 'cgroup_path_ns'; did you mean 'cgroup_parent'? [-Werror=implicit-function-declaration]
     190 |         cgroup_path_ns(aux->cgroup.start, buf, PATH_MAX,
         |         ^~~~~~~~~~~~~~
         |         cgroup_parent
   kernel/bpf/cgroup_iter.c: In function 'cgroup_iter_seq_next':
   kernel/bpf/cgroup_iter.c:88:1: error: control reaches end of non-void function [-Werror=return-type]
      88 | }
         | ^
   cc1: some warnings being treated as errors


vim +/iter +78 kernel/bpf/../cgroup/cgroup-internal.h

0d2b5955b36250 Tejun Heo       2022-01-06   68  
0d2b5955b36250 Tejun Heo       2022-01-06   69  struct cgroup_file_ctx {
e57457641613fe Tejun Heo       2022-01-06   70  	struct cgroup_namespace	*ns;
e57457641613fe Tejun Heo       2022-01-06   71  
0d2b5955b36250 Tejun Heo       2022-01-06   72  	struct {
0d2b5955b36250 Tejun Heo       2022-01-06   73  		void			*trigger;
0d2b5955b36250 Tejun Heo       2022-01-06   74  	} psi;
0d2b5955b36250 Tejun Heo       2022-01-06   75  
0d2b5955b36250 Tejun Heo       2022-01-06   76  	struct {
0d2b5955b36250 Tejun Heo       2022-01-06   77  		bool			started;
0d2b5955b36250 Tejun Heo       2022-01-06  @78  		struct css_task_iter	iter;
0d2b5955b36250 Tejun Heo       2022-01-06   79  	} procs;
0d2b5955b36250 Tejun Heo       2022-01-06   80  
0d2b5955b36250 Tejun Heo       2022-01-06   81  	struct {
0d2b5955b36250 Tejun Heo       2022-01-06   82  		struct cgroup_pidlist	*pidlist;
0d2b5955b36250 Tejun Heo       2022-01-06   83  	} procs1;
0d2b5955b36250 Tejun Heo       2022-01-06   84  };
0d2b5955b36250 Tejun Heo       2022-01-06   85  
0a268dbd7932c7 Tejun Heo       2016-12-27   86  /*
0a268dbd7932c7 Tejun Heo       2016-12-27   87   * A cgroup can be associated with multiple css_sets as different tasks may
0a268dbd7932c7 Tejun Heo       2016-12-27   88   * belong to different cgroups on different hierarchies.  In the other
0a268dbd7932c7 Tejun Heo       2016-12-27   89   * direction, a css_set is naturally associated with multiple cgroups.
0a268dbd7932c7 Tejun Heo       2016-12-27   90   * This M:N relationship is represented by the following link structure
0a268dbd7932c7 Tejun Heo       2016-12-27   91   * which exists for each association and allows traversing the associations
0a268dbd7932c7 Tejun Heo       2016-12-27   92   * from both sides.
0a268dbd7932c7 Tejun Heo       2016-12-27   93   */
0a268dbd7932c7 Tejun Heo       2016-12-27   94  struct cgrp_cset_link {
0a268dbd7932c7 Tejun Heo       2016-12-27   95  	/* the cgroup and css_set this link associates */
0a268dbd7932c7 Tejun Heo       2016-12-27   96  	struct cgroup		*cgrp;
0a268dbd7932c7 Tejun Heo       2016-12-27   97  	struct css_set		*cset;
0a268dbd7932c7 Tejun Heo       2016-12-27   98  
0a268dbd7932c7 Tejun Heo       2016-12-27   99  	/* list of cgrp_cset_links anchored at cgrp->cset_links */
0a268dbd7932c7 Tejun Heo       2016-12-27  100  	struct list_head	cset_link;
0a268dbd7932c7 Tejun Heo       2016-12-27  101  
0a268dbd7932c7 Tejun Heo       2016-12-27  102  	/* list of cgrp_cset_links anchored at css_set->cgrp_links */
0a268dbd7932c7 Tejun Heo       2016-12-27  103  	struct list_head	cgrp_link;
0a268dbd7932c7 Tejun Heo       2016-12-27  104  };
0a268dbd7932c7 Tejun Heo       2016-12-27  105  
e595cd706982bf Tejun Heo       2017-01-15  106  /* used to track tasks and csets during migration */
e595cd706982bf Tejun Heo       2017-01-15  107  struct cgroup_taskset {
e595cd706982bf Tejun Heo       2017-01-15  108  	/* the src and dst cset list running through cset->mg_node */
e595cd706982bf Tejun Heo       2017-01-15  109  	struct list_head	src_csets;
e595cd706982bf Tejun Heo       2017-01-15  110  	struct list_head	dst_csets;
e595cd706982bf Tejun Heo       2017-01-15  111  
610467270fb368 Tejun Heo       2017-07-08  112  	/* the number of tasks in the set */
610467270fb368 Tejun Heo       2017-07-08  113  	int			nr_tasks;
610467270fb368 Tejun Heo       2017-07-08  114  
e595cd706982bf Tejun Heo       2017-01-15  115  	/* the subsys currently being processed */
e595cd706982bf Tejun Heo       2017-01-15  116  	int			ssid;
e595cd706982bf Tejun Heo       2017-01-15  117  
e595cd706982bf Tejun Heo       2017-01-15  118  	/*
e595cd706982bf Tejun Heo       2017-01-15  119  	 * Fields for cgroup_taskset_*() iteration.
e595cd706982bf Tejun Heo       2017-01-15  120  	 *
e595cd706982bf Tejun Heo       2017-01-15  121  	 * Before migration is committed, the target migration tasks are on
e595cd706982bf Tejun Heo       2017-01-15  122  	 * ->mg_tasks of the csets on ->src_csets.  After, on ->mg_tasks of
e595cd706982bf Tejun Heo       2017-01-15  123  	 * the csets on ->dst_csets.  ->csets point to either ->src_csets
e595cd706982bf Tejun Heo       2017-01-15  124  	 * or ->dst_csets depending on whether migration is committed.
e595cd706982bf Tejun Heo       2017-01-15  125  	 *
e595cd706982bf Tejun Heo       2017-01-15  126  	 * ->cur_csets and ->cur_task point to the current task position
e595cd706982bf Tejun Heo       2017-01-15  127  	 * during iteration.
e595cd706982bf Tejun Heo       2017-01-15  128  	 */
e595cd706982bf Tejun Heo       2017-01-15  129  	struct list_head	*csets;
e595cd706982bf Tejun Heo       2017-01-15  130  	struct css_set		*cur_cset;
e595cd706982bf Tejun Heo       2017-01-15  131  	struct task_struct	*cur_task;
e595cd706982bf Tejun Heo       2017-01-15  132  };
e595cd706982bf Tejun Heo       2017-01-15  133  
e595cd706982bf Tejun Heo       2017-01-15  134  /* migration context also tracks preloading */
e595cd706982bf Tejun Heo       2017-01-15  135  struct cgroup_mgctx {
e595cd706982bf Tejun Heo       2017-01-15  136  	/*
e595cd706982bf Tejun Heo       2017-01-15  137  	 * Preloaded source and destination csets.  Used to guarantee
e595cd706982bf Tejun Heo       2017-01-15  138  	 * atomic success or failure on actual migration.
e595cd706982bf Tejun Heo       2017-01-15  139  	 */
e595cd706982bf Tejun Heo       2017-01-15  140  	struct list_head	preloaded_src_csets;
e595cd706982bf Tejun Heo       2017-01-15  141  	struct list_head	preloaded_dst_csets;
e595cd706982bf Tejun Heo       2017-01-15  142  
e595cd706982bf Tejun Heo       2017-01-15  143  	/* tasks and csets to migrate */
e595cd706982bf Tejun Heo       2017-01-15  144  	struct cgroup_taskset	tset;
bfc2cf6f61fcea Tejun Heo       2017-01-15  145  
bfc2cf6f61fcea Tejun Heo       2017-01-15  146  	/* subsystems affected by migration */
bfc2cf6f61fcea Tejun Heo       2017-01-15  147  	u16			ss_mask;
e595cd706982bf Tejun Heo       2017-01-15  148  };
e595cd706982bf Tejun Heo       2017-01-15  149  
e595cd706982bf Tejun Heo       2017-01-15  150  #define CGROUP_TASKSET_INIT(tset)						\
e595cd706982bf Tejun Heo       2017-01-15  151  {										\
e595cd706982bf Tejun Heo       2017-01-15  152  	.src_csets		= LIST_HEAD_INIT(tset.src_csets),		\
e595cd706982bf Tejun Heo       2017-01-15  153  	.dst_csets		= LIST_HEAD_INIT(tset.dst_csets),		\
e595cd706982bf Tejun Heo       2017-01-15  154  	.csets			= &tset.src_csets,				\
e595cd706982bf Tejun Heo       2017-01-15  155  }
e595cd706982bf Tejun Heo       2017-01-15  156  
e595cd706982bf Tejun Heo       2017-01-15  157  #define CGROUP_MGCTX_INIT(name)							\
e595cd706982bf Tejun Heo       2017-01-15  158  {										\
e595cd706982bf Tejun Heo       2017-01-15  159  	LIST_HEAD_INIT(name.preloaded_src_csets),				\
e595cd706982bf Tejun Heo       2017-01-15  160  	LIST_HEAD_INIT(name.preloaded_dst_csets),				\
e595cd706982bf Tejun Heo       2017-01-15  161  	CGROUP_TASKSET_INIT(name.tset),						\
e595cd706982bf Tejun Heo       2017-01-15  162  }
e595cd706982bf Tejun Heo       2017-01-15  163  
e595cd706982bf Tejun Heo       2017-01-15  164  #define DEFINE_CGROUP_MGCTX(name)						\
e595cd706982bf Tejun Heo       2017-01-15  165  	struct cgroup_mgctx name = CGROUP_MGCTX_INIT(name)
e595cd706982bf Tejun Heo       2017-01-15  166  
0a268dbd7932c7 Tejun Heo       2016-12-27  167  extern struct mutex cgroup_mutex;
0a268dbd7932c7 Tejun Heo       2016-12-27  168  extern spinlock_t css_set_lock;
0a268dbd7932c7 Tejun Heo       2016-12-27  169  extern struct cgroup_subsys *cgroup_subsys[];
0a268dbd7932c7 Tejun Heo       2016-12-27  170  extern struct list_head cgroup_roots;
0a268dbd7932c7 Tejun Heo       2016-12-27  171  extern struct file_system_type cgroup_fs_type;
0a268dbd7932c7 Tejun Heo       2016-12-27  172  
0a268dbd7932c7 Tejun Heo       2016-12-27  173  /* iterate across the hierarchies */
0a268dbd7932c7 Tejun Heo       2016-12-27  174  #define for_each_root(root)						\
0a268dbd7932c7 Tejun Heo       2016-12-27  175  	list_for_each_entry((root), &cgroup_roots, root_list)
0a268dbd7932c7 Tejun Heo       2016-12-27  176  
0a268dbd7932c7 Tejun Heo       2016-12-27  177  /**
0a268dbd7932c7 Tejun Heo       2016-12-27  178   * for_each_subsys - iterate all enabled cgroup subsystems
0a268dbd7932c7 Tejun Heo       2016-12-27  179   * @ss: the iteration cursor
0a268dbd7932c7 Tejun Heo       2016-12-27  180   * @ssid: the index of @ss, CGROUP_SUBSYS_COUNT after reaching the end
0a268dbd7932c7 Tejun Heo       2016-12-27  181   */
0a268dbd7932c7 Tejun Heo       2016-12-27  182  #define for_each_subsys(ss, ssid)					\
0a268dbd7932c7 Tejun Heo       2016-12-27  183  	for ((ssid) = 0; (ssid) < CGROUP_SUBSYS_COUNT &&		\
0a268dbd7932c7 Tejun Heo       2016-12-27  184  	     (((ss) = cgroup_subsys[ssid]) || true); (ssid)++)
0a268dbd7932c7 Tejun Heo       2016-12-27  185  
0a268dbd7932c7 Tejun Heo       2016-12-27  186  static inline bool cgroup_is_dead(const struct cgroup *cgrp)
0a268dbd7932c7 Tejun Heo       2016-12-27  187  {
0a268dbd7932c7 Tejun Heo       2016-12-27 @188  	return !(cgrp->self.flags & CSS_ONLINE);
0a268dbd7932c7 Tejun Heo       2016-12-27  189  }
0a268dbd7932c7 Tejun Heo       2016-12-27  190  
0a268dbd7932c7 Tejun Heo       2016-12-27  191  static inline bool notify_on_release(const struct cgroup *cgrp)
0a268dbd7932c7 Tejun Heo       2016-12-27  192  {
0a268dbd7932c7 Tejun Heo       2016-12-27 @193  	return test_bit(CGRP_NOTIFY_ON_RELEASE, &cgrp->flags);
0a268dbd7932c7 Tejun Heo       2016-12-27  194  }
0a268dbd7932c7 Tejun Heo       2016-12-27  195  
dcfe149b9f45aa Tejun Heo       2016-12-27  196  void put_css_set_locked(struct css_set *cset);
dcfe149b9f45aa Tejun Heo       2016-12-27  197  
dcfe149b9f45aa Tejun Heo       2016-12-27  198  static inline void put_css_set(struct css_set *cset)
dcfe149b9f45aa Tejun Heo       2016-12-27  199  {
dcfe149b9f45aa Tejun Heo       2016-12-27  200  	unsigned long flags;
dcfe149b9f45aa Tejun Heo       2016-12-27  201  
dcfe149b9f45aa Tejun Heo       2016-12-27  202  	/*
dcfe149b9f45aa Tejun Heo       2016-12-27  203  	 * Ensure that the refcount doesn't hit zero while any readers
dcfe149b9f45aa Tejun Heo       2016-12-27  204  	 * can see it. Similar to atomic_dec_and_lock(), but for an
dcfe149b9f45aa Tejun Heo       2016-12-27  205  	 * rwlock
dcfe149b9f45aa Tejun Heo       2016-12-27  206  	 */
4b9502e63b5e2b Elena Reshetova 2017-03-08 @207  	if (refcount_dec_not_one(&cset->refcount))
dcfe149b9f45aa Tejun Heo       2016-12-27  208  		return;
dcfe149b9f45aa Tejun Heo       2016-12-27  209  
dcfe149b9f45aa Tejun Heo       2016-12-27  210  	spin_lock_irqsave(&css_set_lock, flags);
dcfe149b9f45aa Tejun Heo       2016-12-27  211  	put_css_set_locked(cset);
dcfe149b9f45aa Tejun Heo       2016-12-27  212  	spin_unlock_irqrestore(&css_set_lock, flags);
dcfe149b9f45aa Tejun Heo       2016-12-27  213  }
dcfe149b9f45aa Tejun Heo       2016-12-27  214  

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH bpf-next v2 6/8] cgroup: bpf: enable bpf programs to integrate with rstat
  2022-06-10 21:30     ` Yosry Ahmed
@ 2022-06-11 19:57       ` Alexei Starovoitov
  2022-06-13 17:05         ` Yosry Ahmed
  0 siblings, 1 reply; 46+ messages in thread
From: Alexei Starovoitov @ 2022-06-11 19:57 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: kernel test robot, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Hao Luo, Tejun Heo, Zefan Li,
	Johannes Weiner, Shuah Khan, Michal Hocko, kbuild-all,
	Roman Gushchin, David Rientjes, Stanislav Fomichev, Greg Thelen,
	Shakeel Butt, Linux Kernel Mailing List, Networking, bpf,
	Cgroups

On Fri, Jun 10, 2022 at 02:30:00PM -0700, Yosry Ahmed wrote:
> 
> AFAICT these failures are because the patch series depends on a patch
> in the mailing list [1] that is not in bpf-next, as explained by the
> cover letter.
> 
> [1] https://lore.kernel.org/bpf/20220421140740.459558-5-benjamin.tissoires@redhat.com/

You probably want to rebase and include that patch as patch 1 in your series
preserving Benjamin's SOB and cc-ing him on the series.
Otherwise we cannot land the set, BPF CI cannot test it, and review is hard to do.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH bpf-next v2 6/8] cgroup: bpf: enable bpf programs to integrate with rstat
  2022-06-11 19:57       ` Alexei Starovoitov
@ 2022-06-13 17:05         ` Yosry Ahmed
  0 siblings, 0 replies; 46+ messages in thread
From: Yosry Ahmed @ 2022-06-13 17:05 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: kernel test robot, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Hao Luo, Tejun Heo, Zefan Li,
	Johannes Weiner, Shuah Khan, Michal Hocko, kbuild-all,
	Roman Gushchin, David Rientjes, Stanislav Fomichev, Greg Thelen,
	Shakeel Butt, Linux Kernel Mailing List, Networking, bpf,
	Cgroups

On Sat, Jun 11, 2022 at 12:57 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Jun 10, 2022 at 02:30:00PM -0700, Yosry Ahmed wrote:
> >
> > AFAICT these failures are because the patch series depends on a patch
> > in the mailing list [1] that is not in bpf-next, as explained by the
> > cover letter.
> >
> > [1] https://lore.kernel.org/bpf/20220421140740.459558-5-benjamin.tissoires@redhat.com/
>
> You probably want to rebase and include that patch as patch 1 in your series
> preserving Benjamin's SOB and cc-ing him on the series.
> Otherwise we cannot land the set, BPF CI cannot test it, and review is hard to do.

Sounds good. Will rebase do that and send a v3.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH bpf-next v2 3/8] bpf, iter: Fix the condition on p when calling stop.
  2022-06-10 19:44 ` [PATCH bpf-next v2 3/8] bpf, iter: Fix the condition on p when calling stop Yosry Ahmed
@ 2022-06-20 18:48   ` Yonghong Song
  2022-06-21  7:25     ` Hao Luo
  0 siblings, 1 reply; 46+ messages in thread
From: Yonghong Song @ 2022-06-20 18:48 UTC (permalink / raw)
  To: Yosry Ahmed, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Michal Hocko
  Cc: Roman Gushchin, David Rientjes, Stanislav Fomichev, Greg Thelen,
	Shakeel Butt, linux-kernel, netdev, bpf, cgroups



On 6/10/22 12:44 PM, Yosry Ahmed wrote:
> From: Hao Luo <haoluo@google.com>
> 
> In bpf_seq_read, seq->op->next() could return an ERR and jump to
> the label stop. However, the existing code in stop does not handle
> the case when p (returned from next()) is an ERR. Adds the handling
> of ERR of p by converting p into an error and jumping to done.
> 
> Because all the current implementations do not have a case that
> returns ERR from next(), so this patch doesn't have behavior changes
> right now.
> 
> Signed-off-by: Hao Luo <haoluo@google.com>
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>

Acked-by: Yonghong Song <yhs@fb.com>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH bpf-next v2 3/8] bpf, iter: Fix the condition on p when calling stop.
  2022-06-20 18:48   ` Yonghong Song
@ 2022-06-21  7:25     ` Hao Luo
  2022-06-24 17:46       ` Yonghong Song
  0 siblings, 1 reply; 46+ messages in thread
From: Hao Luo @ 2022-06-21  7:25 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Yosry Ahmed, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, John Fastabend,
	KP Singh, Tejun Heo, Zefan Li, Johannes Weiner, Shuah Khan,
	Michal Hocko, Roman Gushchin, David Rientjes, Stanislav Fomichev,
	Greg Thelen, Shakeel Butt, linux-kernel, netdev, bpf, cgroups

On Mon, Jun 20, 2022 at 11:48 AM Yonghong Song <yhs@fb.com> wrote:
>
> On 6/10/22 12:44 PM, Yosry Ahmed wrote:
> > From: Hao Luo <haoluo@google.com>
> >
> > In bpf_seq_read, seq->op->next() could return an ERR and jump to
> > the label stop. However, the existing code in stop does not handle
> > the case when p (returned from next()) is an ERR. Adds the handling
> > of ERR of p by converting p into an error and jumping to done.
> >
> > Because all the current implementations do not have a case that
> > returns ERR from next(), so this patch doesn't have behavior changes
> > right now.
> >
> > Signed-off-by: Hao Luo <haoluo@google.com>
> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
>
> Acked-by: Yonghong Song <yhs@fb.com>

Yonghong, do you want to get this change in now, or you want to wait
for the whole patchset? This fix is straightforward and independent of
other parts. Yosry and I can rebase.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH bpf-next v2 3/8] bpf, iter: Fix the condition on p when calling stop.
  2022-06-21  7:25     ` Hao Luo
@ 2022-06-24 17:46       ` Yonghong Song
  2022-06-24 18:23         ` Yosry Ahmed
  0 siblings, 1 reply; 46+ messages in thread
From: Yonghong Song @ 2022-06-24 17:46 UTC (permalink / raw)
  To: Hao Luo
  Cc: Yosry Ahmed, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, John Fastabend,
	KP Singh, Tejun Heo, Zefan Li, Johannes Weiner, Shuah Khan,
	Michal Hocko, Roman Gushchin, David Rientjes, Stanislav Fomichev,
	Greg Thelen, Shakeel Butt, linux-kernel, netdev, bpf, cgroups



On 6/21/22 12:25 AM, Hao Luo wrote:
> On Mon, Jun 20, 2022 at 11:48 AM Yonghong Song <yhs@fb.com> wrote:
>>
>> On 6/10/22 12:44 PM, Yosry Ahmed wrote:
>>> From: Hao Luo <haoluo@google.com>
>>>
>>> In bpf_seq_read, seq->op->next() could return an ERR and jump to
>>> the label stop. However, the existing code in stop does not handle
>>> the case when p (returned from next()) is an ERR. Adds the handling
>>> of ERR of p by converting p into an error and jumping to done.
>>>
>>> Because all the current implementations do not have a case that
>>> returns ERR from next(), so this patch doesn't have behavior changes
>>> right now.
>>>
>>> Signed-off-by: Hao Luo <haoluo@google.com>
>>> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
>>
>> Acked-by: Yonghong Song <yhs@fb.com>
> 
> Yonghong, do you want to get this change in now, or you want to wait
> for the whole patchset? This fix is straightforward and independent of
> other parts. Yosry and I can rebase.

Sorry for delay. Let me review other patches as well before your next 
version.

BTW, I would be great if you just put the prerequisite patch
 
https://lore.kernel.org/bpf/20220421140740.459558-5-benjamin.tissoires@redhat.com/
as the first patch so at least BPF CI will be able to test
your patch set. It looks like KP's bpf_getxattr patch set already did this.
 
https://lore.kernel.org/bpf/20220624045636.3668195-2-kpsingh@kernel.org/T/#u


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH bpf-next v2 3/8] bpf, iter: Fix the condition on p when calling stop.
  2022-06-24 17:46       ` Yonghong Song
@ 2022-06-24 18:23         ` Yosry Ahmed
  0 siblings, 0 replies; 46+ messages in thread
From: Yosry Ahmed @ 2022-06-24 18:23 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Hao Luo, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, John Fastabend, KP Singh, Tejun Heo,
	Zefan Li, Johannes Weiner, Shuah Khan, Michal Hocko,
	Roman Gushchin, David Rientjes, Stanislav Fomichev, Greg Thelen,
	Shakeel Butt, Linux Kernel Mailing List, Networking, bpf,
	Cgroups

On Fri, Jun 24, 2022 at 10:46 AM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 6/21/22 12:25 AM, Hao Luo wrote:
> > On Mon, Jun 20, 2022 at 11:48 AM Yonghong Song <yhs@fb.com> wrote:
> >>
> >> On 6/10/22 12:44 PM, Yosry Ahmed wrote:
> >>> From: Hao Luo <haoluo@google.com>
> >>>
> >>> In bpf_seq_read, seq->op->next() could return an ERR and jump to
> >>> the label stop. However, the existing code in stop does not handle
> >>> the case when p (returned from next()) is an ERR. Adds the handling
> >>> of ERR of p by converting p into an error and jumping to done.
> >>>
> >>> Because all the current implementations do not have a case that
> >>> returns ERR from next(), so this patch doesn't have behavior changes
> >>> right now.
> >>>
> >>> Signed-off-by: Hao Luo <haoluo@google.com>
> >>> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> >>
> >> Acked-by: Yonghong Song <yhs@fb.com>
> >
> > Yonghong, do you want to get this change in now, or you want to wait
> > for the whole patchset? This fix is straightforward and independent of
> > other parts. Yosry and I can rebase.
>
> Sorry for delay. Let me review other patches as well before your next
> version.

Thanks!

>
> BTW, I would be great if you just put the prerequisite patch

I am intending to do that in the next version if KP's patchset doesn't
land in bpf-next.

>
> https://lore.kernel.org/bpf/20220421140740.459558-5-benjamin.tissoires@redhat.com/
> as the first patch so at least BPF CI will be able to test
> your patch set. It looks like KP's bpf_getxattr patch set already did this.
>
> https://lore.kernel.org/bpf/20220624045636.3668195-2-kpsingh@kernel.org/T/#u
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH bpf-next v2 4/8] bpf: Introduce cgroup iter
  2022-06-10 19:44 ` [PATCH bpf-next v2 4/8] bpf: Introduce cgroup iter Yosry Ahmed
                     ` (3 preceding siblings ...)
  2022-06-11 12:55   ` kernel test robot
@ 2022-06-28  4:09   ` Yonghong Song
  2022-06-28  6:06     ` Yosry Ahmed
  2022-07-07 23:33     ` Hao Luo
  2022-06-28  4:14   ` Yonghong Song
  5 siblings, 2 replies; 46+ messages in thread
From: Yonghong Song @ 2022-06-28  4:09 UTC (permalink / raw)
  To: Yosry Ahmed, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Michal Hocko
  Cc: Roman Gushchin, David Rientjes, Stanislav Fomichev, Greg Thelen,
	Shakeel Butt, linux-kernel, netdev, bpf, cgroups



On 6/10/22 12:44 PM, Yosry Ahmed wrote:
> From: Hao Luo <haoluo@google.com>
> 
> Cgroup_iter is a type of bpf_iter. It walks over cgroups in two modes:
> 
>   - walking a cgroup's descendants.
>   - walking a cgroup's ancestors.

The implementation has another choice, BPF_ITER_CGROUP_PARENT_UP.
We should add it here as well.

> 
> When attaching cgroup_iter, one can set a cgroup to the iter_link
> created from attaching. This cgroup is passed as a file descriptor and
> serves as the starting point of the walk. If no cgroup is specified,
> the starting point will be the root cgroup.
> 
> For walking descendants, one can specify the order: either pre-order or
> post-order. For walking ancestors, the walk starts at the specified
> cgroup and ends at the root.
> 
> One can also terminate the walk early by returning 1 from the iter
> program.
> 
> Note that because walking cgroup hierarchy holds cgroup_mutex, the iter
> program is called with cgroup_mutex held.

Overall looks good to me with a few nits below.

Acked-by: Yonghong Song <yhs@fb.com>

> 
> Signed-off-by: Hao Luo <haoluo@google.com>
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> ---
>   include/linux/bpf.h            |   8 ++
>   include/uapi/linux/bpf.h       |  21 +++
>   kernel/bpf/Makefile            |   2 +-
>   kernel/bpf/cgroup_iter.c       | 235 +++++++++++++++++++++++++++++++++
>   tools/include/uapi/linux/bpf.h |  21 +++
>   5 files changed, 286 insertions(+), 1 deletion(-)
>   create mode 100644 kernel/bpf/cgroup_iter.c
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 8e6092d0ea956..48d8e836b9748 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -44,6 +44,7 @@ struct kobject;
>   struct mem_cgroup;
>   struct module;
>   struct bpf_func_state;
> +struct cgroup;
>   
>   extern struct idr btf_idr;
>   extern spinlock_t btf_idr_lock;
> @@ -1590,7 +1591,14 @@ int bpf_obj_get_user(const char __user *pathname, int flags);
>   	int __init bpf_iter_ ## target(args) { return 0; }
>   
>   struct bpf_iter_aux_info {
> +	/* for map_elem iter */
>   	struct bpf_map *map;
> +
> +	/* for cgroup iter */
> +	struct {
> +		struct cgroup *start; /* starting cgroup */
> +		int order;
> +	} cgroup;
>   };
>   
>   typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog,
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index f4009dbdf62da..4fd05cde19116 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -87,10 +87,27 @@ struct bpf_cgroup_storage_key {
>   	__u32	attach_type;		/* program attach type (enum bpf_attach_type) */
>   };
>   
> +enum bpf_iter_cgroup_traversal_order {
> +	BPF_ITER_CGROUP_PRE = 0,	/* pre-order traversal */
> +	BPF_ITER_CGROUP_POST,		/* post-order traversal */
> +	BPF_ITER_CGROUP_PARENT_UP,	/* traversal of ancestors up to the root */
> +};
> +
>   union bpf_iter_link_info {
>   	struct {
>   		__u32	map_fd;
>   	} map;
> +
> +	/* cgroup_iter walks either the live descendants of a cgroup subtree, or the ancestors
> +	 * of a given cgroup.
> +	 */
> +	struct {
> +		/* Cgroup file descriptor. This is root of the subtree if for walking the
> +		 * descendants; this is the starting cgroup if for walking the ancestors.
> +		 */
> +		__u32	cgroup_fd;
> +		__u32	traversal_order;
> +	} cgroup;
>   };
>   
>   /* BPF syscall commands, see bpf(2) man-page for more details. */
> @@ -6050,6 +6067,10 @@ struct bpf_link_info {
>   				struct {
>   					__u32 map_id;
>   				} map;
> +				struct {
> +					__u32 traversal_order;
> +					__aligned_u64 cgroup_id;
> +				} cgroup;
>   			};
>   		} iter;
>   		struct  {
> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> index 057ba8e01e70f..9741b9314fb46 100644
> --- a/kernel/bpf/Makefile
> +++ b/kernel/bpf/Makefile
> @@ -8,7 +8,7 @@ CFLAGS_core.o += $(call cc-disable-warning, override-init) $(cflags-nogcse-yy)
>   
>   obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_iter.o map_iter.o task_iter.o prog_iter.o link_iter.o
>   obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o
> -obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
> +obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o cgroup_iter.o
>   obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o
>   obj-${CONFIG_BPF_LSM}	  += bpf_inode_storage.o
>   obj-$(CONFIG_BPF_SYSCALL) += disasm.o
> diff --git a/kernel/bpf/cgroup_iter.c b/kernel/bpf/cgroup_iter.c
> new file mode 100644
> index 0000000000000..88deb655efa71
> --- /dev/null
> +++ b/kernel/bpf/cgroup_iter.c
> @@ -0,0 +1,235 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright (c) 2022 Google */
> +#include <linux/bpf.h>
> +#include <linux/btf_ids.h>
> +#include <linux/cgroup.h>
> +#include <linux/kernel.h>
> +#include <linux/seq_file.h>
> +
> +#include "../cgroup/cgroup-internal.h"  /* cgroup_mutex and cgroup_is_dead */
> +
> +/* cgroup_iter provides two modes of traversal to the cgroup hierarchy.
> + *
> + *  1. Walk the descendants of a cgroup.
> + *  2. Walk the ancestors of a cgroup.

three modes here?

> + *
> + * For walking descendants, cgroup_iter can walk in either pre-order or
> + * post-order. For walking ancestors, the iter walks up from a cgroup to
> + * the root.
> + *
> + * The iter program can terminate the walk early by returning 1. Walk
> + * continues if prog returns 0.
> + *
> + * The prog can check (seq->num == 0) to determine whether this is
> + * the first element. The prog may also be passed a NULL cgroup,
> + * which means the walk has completed and the prog has a chance to
> + * do post-processing, such as outputing an epilogue.
> + *
> + * Note: the iter_prog is called with cgroup_mutex held.
> + */
> +
> +struct bpf_iter__cgroup {
> +	__bpf_md_ptr(struct bpf_iter_meta *, meta);
> +	__bpf_md_ptr(struct cgroup *, cgroup);
> +};
> +
> +struct cgroup_iter_priv {
> +	struct cgroup_subsys_state *start_css;
> +	bool terminate;
> +	int order;
> +};
> +
> +static void *cgroup_iter_seq_start(struct seq_file *seq, loff_t *pos)
> +{
> +	struct cgroup_iter_priv *p = seq->private;
> +
> +	mutex_lock(&cgroup_mutex);
> +
> +	/* support only one session */
> +	if (*pos > 0)
> +		return NULL;
> +
> +	++*pos;
> +	p->terminate = false;
> +	if (p->order == BPF_ITER_CGROUP_PRE)
> +		return css_next_descendant_pre(NULL, p->start_css);
> +	else if (p->order == BPF_ITER_CGROUP_POST)
> +		return css_next_descendant_post(NULL, p->start_css);
> +	else /* BPF_ITER_CGROUP_PARENT_UP */
> +		return p->start_css;
> +}
> +
> +static int __cgroup_iter_seq_show(struct seq_file *seq,
> +				  struct cgroup_subsys_state *css, int in_stop);
> +
> +static void cgroup_iter_seq_stop(struct seq_file *seq, void *v)
> +{
> +	/* pass NULL to the prog for post-processing */
> +	if (!v)
> +		__cgroup_iter_seq_show(seq, NULL, true);
> +	mutex_unlock(&cgroup_mutex);
> +}
> +
> +static void *cgroup_iter_seq_next(struct seq_file *seq, void *v, loff_t *pos)
> +{
> +	struct cgroup_subsys_state *curr = (struct cgroup_subsys_state *)v;
> +	struct cgroup_iter_priv *p = seq->private;
> +
> +	++*pos;
> +	if (p->terminate)
> +		return NULL;
> +
> +	if (p->order == BPF_ITER_CGROUP_PRE)
> +		return css_next_descendant_pre(curr, p->start_css);
> +	else if (p->order == BPF_ITER_CGROUP_POST)
> +		return css_next_descendant_post(curr, p->start_css);
> +	else
> +		return curr->parent;
> +}
> +
> +static int __cgroup_iter_seq_show(struct seq_file *seq,
> +				  struct cgroup_subsys_state *css, int in_stop)
> +{
> +	struct cgroup_iter_priv *p = seq->private;
> +	struct bpf_iter__cgroup ctx;
> +	struct bpf_iter_meta meta;
> +	struct bpf_prog *prog;
> +	int ret = 0;
> +
> +	/* cgroup is dead, skip this element */
> +	if (css && cgroup_is_dead(css->cgroup))
> +		return 0;
> +
> +	ctx.meta = &meta;
> +	ctx.cgroup = css ? css->cgroup : NULL;
> +	meta.seq = seq;
> +	prog = bpf_iter_get_info(&meta, in_stop);
> +	if (prog)
> +		ret = bpf_iter_run_prog(prog, &ctx);
> +
> +	/* if prog returns > 0, terminate after this element. */
> +	if (ret != 0)
> +		p->terminate = true;
> +
> +	return 0;
> +}
> +
> +static int cgroup_iter_seq_show(struct seq_file *seq, void *v)
> +{
> +	return __cgroup_iter_seq_show(seq, (struct cgroup_subsys_state *)v,
> +				      false);
> +}
> +
> +static const struct seq_operations cgroup_iter_seq_ops = {
> +	.start  = cgroup_iter_seq_start,
> +	.next   = cgroup_iter_seq_next,
> +	.stop   = cgroup_iter_seq_stop,
> +	.show   = cgroup_iter_seq_show,
> +};
> +
> +BTF_ID_LIST_SINGLE(bpf_cgroup_btf_id, struct, cgroup)
> +
> +static int cgroup_iter_seq_init(void *priv, struct bpf_iter_aux_info *aux)
> +{
> +	struct cgroup_iter_priv *p = (struct cgroup_iter_priv *)priv;
> +	struct cgroup *cgrp = aux->cgroup.start;
> +
> +	p->start_css = &cgrp->self;
> +	p->terminate = false;
> +	p->order = aux->cgroup.order;
> +	return 0;
> +}
> +
> +static const struct bpf_iter_seq_info cgroup_iter_seq_info = {
> +	.seq_ops                = &cgroup_iter_seq_ops,
> +	.init_seq_private       = cgroup_iter_seq_init,
> +	.seq_priv_size          = sizeof(struct cgroup_iter_priv),
> +};
> +
> +static int bpf_iter_attach_cgroup(struct bpf_prog *prog,
> +				  union bpf_iter_link_info *linfo,
> +				  struct bpf_iter_aux_info *aux)
> +{
> +	int fd = linfo->cgroup.cgroup_fd;
> +	struct cgroup *cgrp;
> +
> +	if (fd)
> +		cgrp = cgroup_get_from_fd(fd);
> +	else /* walk the entire hierarchy by default. */
> +		cgrp = cgroup_get_from_path("/");
> +
> +	if (IS_ERR(cgrp))
> +		return PTR_ERR(cgrp);
> +
> +	aux->cgroup.start = cgrp;
> +	aux->cgroup.order = linfo->cgroup.traversal_order;

The legality of traversal_order should be checked.

> +	return 0;
> +}
> +
> +static void bpf_iter_detach_cgroup(struct bpf_iter_aux_info *aux)
> +{
> +	cgroup_put(aux->cgroup.start);
> +}
> +
> +static void bpf_iter_cgroup_show_fdinfo(const struct bpf_iter_aux_info *aux,
> +					struct seq_file *seq)
> +{
> +	char *buf;
> +
> +	buf = kzalloc(PATH_MAX, GFP_KERNEL);
> +	if (!buf) {
> +		seq_puts(seq, "cgroup_path:\n");

This is a really unlikely case. maybe "cgroup_path:<unknown>"?

> +		goto show_order;
> +	}
> +
> +	/* If cgroup_path_ns() fails, buf will be an empty string, cgroup_path
> +	 * will print nothing.
> +	 *
> +	 * Path is in the calling process's cgroup namespace.
> +	 */
> +	cgroup_path_ns(aux->cgroup.start, buf, PATH_MAX,
> +		       current->nsproxy->cgroup_ns);
> +	seq_printf(seq, "cgroup_path:\t%s\n", buf);
> +	kfree(buf);
> +
> +show_order:
> +	if (aux->cgroup.order == BPF_ITER_CGROUP_PRE)
> +		seq_puts(seq, "traversal_order: pre\n");
> +	else if (aux->cgroup.order == BPF_ITER_CGROUP_POST)
> +		seq_puts(seq, "traversal_order: post\n");
> +	else /* BPF_ITER_CGROUP_PARENT_UP */
> +		seq_puts(seq, "traversal_order: parent_up\n");
> +}
> +
[...]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH bpf-next v2 4/8] bpf: Introduce cgroup iter
  2022-06-10 19:44 ` [PATCH bpf-next v2 4/8] bpf: Introduce cgroup iter Yosry Ahmed
                     ` (4 preceding siblings ...)
  2022-06-28  4:09   ` Yonghong Song
@ 2022-06-28  4:14   ` Yonghong Song
  2022-06-28  6:03     ` Yosry Ahmed
  2022-07-07 23:36     ` Hao Luo
  5 siblings, 2 replies; 46+ messages in thread
From: Yonghong Song @ 2022-06-28  4:14 UTC (permalink / raw)
  To: Yosry Ahmed, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Michal Hocko
  Cc: Roman Gushchin, David Rientjes, Stanislav Fomichev, Greg Thelen,
	Shakeel Butt, linux-kernel, netdev, bpf, cgroups



On 6/10/22 12:44 PM, Yosry Ahmed wrote:
> From: Hao Luo <haoluo@google.com>
> 
> Cgroup_iter is a type of bpf_iter. It walks over cgroups in two modes:
> 
>   - walking a cgroup's descendants.
>   - walking a cgroup's ancestors.
> 
> When attaching cgroup_iter, one can set a cgroup to the iter_link
> created from attaching. This cgroup is passed as a file descriptor and
> serves as the starting point of the walk. If no cgroup is specified,
> the starting point will be the root cgroup.
> 
> For walking descendants, one can specify the order: either pre-order or
> post-order. For walking ancestors, the walk starts at the specified
> cgroup and ends at the root.
> 
> One can also terminate the walk early by returning 1 from the iter
> program.
> 
> Note that because walking cgroup hierarchy holds cgroup_mutex, the iter
> program is called with cgroup_mutex held.
> 
> Signed-off-by: Hao Luo <haoluo@google.com>
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> ---
>   include/linux/bpf.h            |   8 ++
>   include/uapi/linux/bpf.h       |  21 +++
>   kernel/bpf/Makefile            |   2 +-
>   kernel/bpf/cgroup_iter.c       | 235 +++++++++++++++++++++++++++++++++
>   tools/include/uapi/linux/bpf.h |  21 +++
>   5 files changed, 286 insertions(+), 1 deletion(-)
>   create mode 100644 kernel/bpf/cgroup_iter.c
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 8e6092d0ea956..48d8e836b9748 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -44,6 +44,7 @@ struct kobject;
>   struct mem_cgroup;
>   struct module;
>   struct bpf_func_state;
> +struct cgroup;
>   
>   extern struct idr btf_idr;
>   extern spinlock_t btf_idr_lock;
> @@ -1590,7 +1591,14 @@ int bpf_obj_get_user(const char __user *pathname, int flags);
>   	int __init bpf_iter_ ## target(args) { return 0; }
>   
>   struct bpf_iter_aux_info {
> +	/* for map_elem iter */
>   	struct bpf_map *map;
> +
> +	/* for cgroup iter */
> +	struct {
> +		struct cgroup *start; /* starting cgroup */
> +		int order;
> +	} cgroup;
>   };
>   
[...]
> +
> +static void *cgroup_iter_seq_start(struct seq_file *seq, loff_t *pos)
> +{
> +	struct cgroup_iter_priv *p = seq->private;
> +
> +	mutex_lock(&cgroup_mutex);
> +
> +	/* support only one session */
> +	if (*pos > 0)
> +		return NULL;
> +
> +	++*pos;
> +	p->terminate = false;
> +	if (p->order == BPF_ITER_CGROUP_PRE)
> +		return css_next_descendant_pre(NULL, p->start_css);
> +	else if (p->order == BPF_ITER_CGROUP_POST)
> +		return css_next_descendant_post(NULL, p->start_css);
> +	else /* BPF_ITER_CGROUP_PARENT_UP */
> +		return p->start_css;
> +}
> +
> +static int __cgroup_iter_seq_show(struct seq_file *seq,
> +				  struct cgroup_subsys_state *css, int in_stop);
> +
> +static void cgroup_iter_seq_stop(struct seq_file *seq, void *v)
> +{
> +	/* pass NULL to the prog for post-processing */
> +	if (!v)
> +		__cgroup_iter_seq_show(seq, NULL, true);
> +	mutex_unlock(&cgroup_mutex);
> +}
> +
> +static void *cgroup_iter_seq_next(struct seq_file *seq, void *v, loff_t *pos)
> +{
> +	struct cgroup_subsys_state *curr = (struct cgroup_subsys_state *)v;
> +	struct cgroup_iter_priv *p = seq->private;
> +
> +	++*pos;
> +	if (p->terminate)
> +		return NULL;
> +
> +	if (p->order == BPF_ITER_CGROUP_PRE)
> +		return css_next_descendant_pre(curr, p->start_css);
> +	else if (p->order == BPF_ITER_CGROUP_POST)
> +		return css_next_descendant_post(curr, p->start_css);
> +	else
> +		return curr->parent;
> +}
> +
> +static int __cgroup_iter_seq_show(struct seq_file *seq,
> +				  struct cgroup_subsys_state *css, int in_stop)
> +{
> +	struct cgroup_iter_priv *p = seq->private;
> +	struct bpf_iter__cgroup ctx;
> +	struct bpf_iter_meta meta;
> +	struct bpf_prog *prog;
> +	int ret = 0;
> +
> +	/* cgroup is dead, skip this element */
> +	if (css && cgroup_is_dead(css->cgroup))
> +		return 0;
> +
> +	ctx.meta = &meta;
> +	ctx.cgroup = css ? css->cgroup : NULL;
> +	meta.seq = seq;
> +	prog = bpf_iter_get_info(&meta, in_stop);
> +	if (prog)
> +		ret = bpf_iter_run_prog(prog, &ctx);

Do we need to do anything special to ensure bpf program gets
up-to-date stat from ctx.cgroup?

> +
> +	/* if prog returns > 0, terminate after this element. */
> +	if (ret != 0)
> +		p->terminate = true;
> +
> +	return 0;
> +}
> +
[...]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH bpf-next v2 4/8] bpf: Introduce cgroup iter
  2022-06-28  4:14   ` Yonghong Song
@ 2022-06-28  6:03     ` Yosry Ahmed
  2022-06-28 17:03       ` Yonghong Song
  2022-07-07 23:36     ` Hao Luo
  1 sibling, 1 reply; 46+ messages in thread
From: Yosry Ahmed @ 2022-06-28  6:03 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, John Fastabend, KP Singh, Hao Luo,
	Tejun Heo, Zefan Li, Johannes Weiner, Shuah Khan, Michal Hocko,
	Roman Gushchin, David Rientjes, Stanislav Fomichev, Greg Thelen,
	Shakeel Butt, Linux Kernel Mailing List, Networking, bpf,
	Cgroups

On Mon, Jun 27, 2022 at 9:14 PM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 6/10/22 12:44 PM, Yosry Ahmed wrote:
> > From: Hao Luo <haoluo@google.com>
> >
> > Cgroup_iter is a type of bpf_iter. It walks over cgroups in two modes:
> >
> >   - walking a cgroup's descendants.
> >   - walking a cgroup's ancestors.
> >
> > When attaching cgroup_iter, one can set a cgroup to the iter_link
> > created from attaching. This cgroup is passed as a file descriptor and
> > serves as the starting point of the walk. If no cgroup is specified,
> > the starting point will be the root cgroup.
> >
> > For walking descendants, one can specify the order: either pre-order or
> > post-order. For walking ancestors, the walk starts at the specified
> > cgroup and ends at the root.
> >
> > One can also terminate the walk early by returning 1 from the iter
> > program.
> >
> > Note that because walking cgroup hierarchy holds cgroup_mutex, the iter
> > program is called with cgroup_mutex held.
> >
> > Signed-off-by: Hao Luo <haoluo@google.com>
> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> > ---
> >   include/linux/bpf.h            |   8 ++
> >   include/uapi/linux/bpf.h       |  21 +++
> >   kernel/bpf/Makefile            |   2 +-
> >   kernel/bpf/cgroup_iter.c       | 235 +++++++++++++++++++++++++++++++++
> >   tools/include/uapi/linux/bpf.h |  21 +++
> >   5 files changed, 286 insertions(+), 1 deletion(-)
> >   create mode 100644 kernel/bpf/cgroup_iter.c
> >
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index 8e6092d0ea956..48d8e836b9748 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -44,6 +44,7 @@ struct kobject;
> >   struct mem_cgroup;
> >   struct module;
> >   struct bpf_func_state;
> > +struct cgroup;
> >
> >   extern struct idr btf_idr;
> >   extern spinlock_t btf_idr_lock;
> > @@ -1590,7 +1591,14 @@ int bpf_obj_get_user(const char __user *pathname, int flags);
> >       int __init bpf_iter_ ## target(args) { return 0; }
> >
> >   struct bpf_iter_aux_info {
> > +     /* for map_elem iter */
> >       struct bpf_map *map;
> > +
> > +     /* for cgroup iter */
> > +     struct {
> > +             struct cgroup *start; /* starting cgroup */
> > +             int order;
> > +     } cgroup;
> >   };
> >
> [...]
> > +
> > +static void *cgroup_iter_seq_start(struct seq_file *seq, loff_t *pos)
> > +{
> > +     struct cgroup_iter_priv *p = seq->private;
> > +
> > +     mutex_lock(&cgroup_mutex);
> > +
> > +     /* support only one session */
> > +     if (*pos > 0)
> > +             return NULL;
> > +
> > +     ++*pos;
> > +     p->terminate = false;
> > +     if (p->order == BPF_ITER_CGROUP_PRE)
> > +             return css_next_descendant_pre(NULL, p->start_css);
> > +     else if (p->order == BPF_ITER_CGROUP_POST)
> > +             return css_next_descendant_post(NULL, p->start_css);
> > +     else /* BPF_ITER_CGROUP_PARENT_UP */
> > +             return p->start_css;
> > +}
> > +
> > +static int __cgroup_iter_seq_show(struct seq_file *seq,
> > +                               struct cgroup_subsys_state *css, int in_stop);
> > +
> > +static void cgroup_iter_seq_stop(struct seq_file *seq, void *v)
> > +{
> > +     /* pass NULL to the prog for post-processing */
> > +     if (!v)
> > +             __cgroup_iter_seq_show(seq, NULL, true);
> > +     mutex_unlock(&cgroup_mutex);
> > +}
> > +
> > +static void *cgroup_iter_seq_next(struct seq_file *seq, void *v, loff_t *pos)
> > +{
> > +     struct cgroup_subsys_state *curr = (struct cgroup_subsys_state *)v;
> > +     struct cgroup_iter_priv *p = seq->private;
> > +
> > +     ++*pos;
> > +     if (p->terminate)
> > +             return NULL;
> > +
> > +     if (p->order == BPF_ITER_CGROUP_PRE)
> > +             return css_next_descendant_pre(curr, p->start_css);
> > +     else if (p->order == BPF_ITER_CGROUP_POST)
> > +             return css_next_descendant_post(curr, p->start_css);
> > +     else
> > +             return curr->parent;
> > +}
> > +
> > +static int __cgroup_iter_seq_show(struct seq_file *seq,
> > +                               struct cgroup_subsys_state *css, int in_stop)
> > +{
> > +     struct cgroup_iter_priv *p = seq->private;
> > +     struct bpf_iter__cgroup ctx;
> > +     struct bpf_iter_meta meta;
> > +     struct bpf_prog *prog;
> > +     int ret = 0;
> > +
> > +     /* cgroup is dead, skip this element */
> > +     if (css && cgroup_is_dead(css->cgroup))
> > +             return 0;
> > +
> > +     ctx.meta = &meta;
> > +     ctx.cgroup = css ? css->cgroup : NULL;
> > +     meta.seq = seq;
> > +     prog = bpf_iter_get_info(&meta, in_stop);
> > +     if (prog)
> > +             ret = bpf_iter_run_prog(prog, &ctx);
>
> Do we need to do anything special to ensure bpf program gets
> up-to-date stat from ctx.cgroup?

Later patches in the series add cgroup_flush_rstat() kfunc which
flushes cgroup stats that use rstat (e.g. memcg stats). It can be
called directly from the bpf program if needed.

It would be better to leave this to the bpf program, it's an
unnecessary toll to flush the stats for any cgroup_iter program, that
could be not accessing stats, or stats that are not maintained using
rstat.

>
> > +
> > +     /* if prog returns > 0, terminate after this element. */
> > +     if (ret != 0)
> > +             p->terminate = true;
> > +
> > +     return 0;
> > +}
> > +
> [...]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH bpf-next v2 4/8] bpf: Introduce cgroup iter
  2022-06-28  4:09   ` Yonghong Song
@ 2022-06-28  6:06     ` Yosry Ahmed
  2022-07-07 23:33     ` Hao Luo
  1 sibling, 0 replies; 46+ messages in thread
From: Yosry Ahmed @ 2022-06-28  6:06 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, John Fastabend, KP Singh, Hao Luo,
	Tejun Heo, Zefan Li, Johannes Weiner, Shuah Khan, Michal Hocko,
	Roman Gushchin, David Rientjes, Stanislav Fomichev, Greg Thelen,
	Shakeel Butt, Linux Kernel Mailing List, Networking, bpf,
	Cgroups

On Mon, Jun 27, 2022 at 9:09 PM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 6/10/22 12:44 PM, Yosry Ahmed wrote:
> > From: Hao Luo <haoluo@google.com>
> >
> > Cgroup_iter is a type of bpf_iter. It walks over cgroups in two modes:
> >
> >   - walking a cgroup's descendants.
> >   - walking a cgroup's ancestors.
>
> The implementation has another choice, BPF_ITER_CGROUP_PARENT_UP.
> We should add it here as well.
>

BPF_ITER_CGROUP_PARENT_UP is expressed here, I think what's actually
missing here (and down below where only 2 modes are specified again)
is that walking descendants is broken down into two separate modes,
pre and post order traversals.

> >
> > When attaching cgroup_iter, one can set a cgroup to the iter_link
> > created from attaching. This cgroup is passed as a file descriptor and
> > serves as the starting point of the walk. If no cgroup is specified,
> > the starting point will be the root cgroup.
> >
> > For walking descendants, one can specify the order: either pre-order or
> > post-order. For walking ancestors, the walk starts at the specified
> > cgroup and ends at the root.
> >
> > One can also terminate the walk early by returning 1 from the iter
> > program.
> >
> > Note that because walking cgroup hierarchy holds cgroup_mutex, the iter
> > program is called with cgroup_mutex held.
>
> Overall looks good to me with a few nits below.
>
> Acked-by: Yonghong Song <yhs@fb.com>
>
> >
> > Signed-off-by: Hao Luo <haoluo@google.com>
> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> > ---
> >   include/linux/bpf.h            |   8 ++
> >   include/uapi/linux/bpf.h       |  21 +++
> >   kernel/bpf/Makefile            |   2 +-
> >   kernel/bpf/cgroup_iter.c       | 235 +++++++++++++++++++++++++++++++++
> >   tools/include/uapi/linux/bpf.h |  21 +++
> >   5 files changed, 286 insertions(+), 1 deletion(-)
> >   create mode 100644 kernel/bpf/cgroup_iter.c
> >
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index 8e6092d0ea956..48d8e836b9748 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -44,6 +44,7 @@ struct kobject;
> >   struct mem_cgroup;
> >   struct module;
> >   struct bpf_func_state;
> > +struct cgroup;
> >
> >   extern struct idr btf_idr;
> >   extern spinlock_t btf_idr_lock;
> > @@ -1590,7 +1591,14 @@ int bpf_obj_get_user(const char __user *pathname, int flags);
> >       int __init bpf_iter_ ## target(args) { return 0; }
> >
> >   struct bpf_iter_aux_info {
> > +     /* for map_elem iter */
> >       struct bpf_map *map;
> > +
> > +     /* for cgroup iter */
> > +     struct {
> > +             struct cgroup *start; /* starting cgroup */
> > +             int order;
> > +     } cgroup;
> >   };
> >
> >   typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog,
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index f4009dbdf62da..4fd05cde19116 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -87,10 +87,27 @@ struct bpf_cgroup_storage_key {
> >       __u32   attach_type;            /* program attach type (enum bpf_attach_type) */
> >   };
> >
> > +enum bpf_iter_cgroup_traversal_order {
> > +     BPF_ITER_CGROUP_PRE = 0,        /* pre-order traversal */
> > +     BPF_ITER_CGROUP_POST,           /* post-order traversal */
> > +     BPF_ITER_CGROUP_PARENT_UP,      /* traversal of ancestors up to the root */
> > +};
> > +
> >   union bpf_iter_link_info {
> >       struct {
> >               __u32   map_fd;
> >       } map;
> > +
> > +     /* cgroup_iter walks either the live descendants of a cgroup subtree, or the ancestors
> > +      * of a given cgroup.
> > +      */
> > +     struct {
> > +             /* Cgroup file descriptor. This is root of the subtree if for walking the
> > +              * descendants; this is the starting cgroup if for walking the ancestors.
> > +              */
> > +             __u32   cgroup_fd;
> > +             __u32   traversal_order;
> > +     } cgroup;
> >   };
> >
> >   /* BPF syscall commands, see bpf(2) man-page for more details. */
> > @@ -6050,6 +6067,10 @@ struct bpf_link_info {
> >                               struct {
> >                                       __u32 map_id;
> >                               } map;
> > +                             struct {
> > +                                     __u32 traversal_order;
> > +                                     __aligned_u64 cgroup_id;
> > +                             } cgroup;
> >                       };
> >               } iter;
> >               struct  {
> > diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> > index 057ba8e01e70f..9741b9314fb46 100644
> > --- a/kernel/bpf/Makefile
> > +++ b/kernel/bpf/Makefile
> > @@ -8,7 +8,7 @@ CFLAGS_core.o += $(call cc-disable-warning, override-init) $(cflags-nogcse-yy)
> >
> >   obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_iter.o map_iter.o task_iter.o prog_iter.o link_iter.o
> >   obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o
> > -obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
> > +obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o cgroup_iter.o
> >   obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o
> >   obj-${CONFIG_BPF_LSM}         += bpf_inode_storage.o
> >   obj-$(CONFIG_BPF_SYSCALL) += disasm.o
> > diff --git a/kernel/bpf/cgroup_iter.c b/kernel/bpf/cgroup_iter.c
> > new file mode 100644
> > index 0000000000000..88deb655efa71
> > --- /dev/null
> > +++ b/kernel/bpf/cgroup_iter.c
> > @@ -0,0 +1,235 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/* Copyright (c) 2022 Google */
> > +#include <linux/bpf.h>
> > +#include <linux/btf_ids.h>
> > +#include <linux/cgroup.h>
> > +#include <linux/kernel.h>
> > +#include <linux/seq_file.h>
> > +
> > +#include "../cgroup/cgroup-internal.h"  /* cgroup_mutex and cgroup_is_dead */
> > +
> > +/* cgroup_iter provides two modes of traversal to the cgroup hierarchy.
> > + *
> > + *  1. Walk the descendants of a cgroup.
> > + *  2. Walk the ancestors of a cgroup.
>
> three modes here?
>
> > + *
> > + * For walking descendants, cgroup_iter can walk in either pre-order or
> > + * post-order. For walking ancestors, the iter walks up from a cgroup to
> > + * the root.
> > + *
> > + * The iter program can terminate the walk early by returning 1. Walk
> > + * continues if prog returns 0.
> > + *
> > + * The prog can check (seq->num == 0) to determine whether this is
> > + * the first element. The prog may also be passed a NULL cgroup,
> > + * which means the walk has completed and the prog has a chance to
> > + * do post-processing, such as outputing an epilogue.
> > + *
> > + * Note: the iter_prog is called with cgroup_mutex held.
> > + */
> > +
> > +struct bpf_iter__cgroup {
> > +     __bpf_md_ptr(struct bpf_iter_meta *, meta);
> > +     __bpf_md_ptr(struct cgroup *, cgroup);
> > +};
> > +
> > +struct cgroup_iter_priv {
> > +     struct cgroup_subsys_state *start_css;
> > +     bool terminate;
> > +     int order;
> > +};
> > +
> > +static void *cgroup_iter_seq_start(struct seq_file *seq, loff_t *pos)
> > +{
> > +     struct cgroup_iter_priv *p = seq->private;
> > +
> > +     mutex_lock(&cgroup_mutex);
> > +
> > +     /* support only one session */
> > +     if (*pos > 0)
> > +             return NULL;
> > +
> > +     ++*pos;
> > +     p->terminate = false;
> > +     if (p->order == BPF_ITER_CGROUP_PRE)
> > +             return css_next_descendant_pre(NULL, p->start_css);
> > +     else if (p->order == BPF_ITER_CGROUP_POST)
> > +             return css_next_descendant_post(NULL, p->start_css);
> > +     else /* BPF_ITER_CGROUP_PARENT_UP */
> > +             return p->start_css;
> > +}
> > +
> > +static int __cgroup_iter_seq_show(struct seq_file *seq,
> > +                               struct cgroup_subsys_state *css, int in_stop);
> > +
> > +static void cgroup_iter_seq_stop(struct seq_file *seq, void *v)
> > +{
> > +     /* pass NULL to the prog for post-processing */
> > +     if (!v)
> > +             __cgroup_iter_seq_show(seq, NULL, true);
> > +     mutex_unlock(&cgroup_mutex);
> > +}
> > +
> > +static void *cgroup_iter_seq_next(struct seq_file *seq, void *v, loff_t *pos)
> > +{
> > +     struct cgroup_subsys_state *curr = (struct cgroup_subsys_state *)v;
> > +     struct cgroup_iter_priv *p = seq->private;
> > +
> > +     ++*pos;
> > +     if (p->terminate)
> > +             return NULL;
> > +
> > +     if (p->order == BPF_ITER_CGROUP_PRE)
> > +             return css_next_descendant_pre(curr, p->start_css);
> > +     else if (p->order == BPF_ITER_CGROUP_POST)
> > +             return css_next_descendant_post(curr, p->start_css);
> > +     else
> > +             return curr->parent;
> > +}
> > +
> > +static int __cgroup_iter_seq_show(struct seq_file *seq,
> > +                               struct cgroup_subsys_state *css, int in_stop)
> > +{
> > +     struct cgroup_iter_priv *p = seq->private;
> > +     struct bpf_iter__cgroup ctx;
> > +     struct bpf_iter_meta meta;
> > +     struct bpf_prog *prog;
> > +     int ret = 0;
> > +
> > +     /* cgroup is dead, skip this element */
> > +     if (css && cgroup_is_dead(css->cgroup))
> > +             return 0;
> > +
> > +     ctx.meta = &meta;
> > +     ctx.cgroup = css ? css->cgroup : NULL;
> > +     meta.seq = seq;
> > +     prog = bpf_iter_get_info(&meta, in_stop);
> > +     if (prog)
> > +             ret = bpf_iter_run_prog(prog, &ctx);
> > +
> > +     /* if prog returns > 0, terminate after this element. */
> > +     if (ret != 0)
> > +             p->terminate = true;
> > +
> > +     return 0;
> > +}
> > +
> > +static int cgroup_iter_seq_show(struct seq_file *seq, void *v)
> > +{
> > +     return __cgroup_iter_seq_show(seq, (struct cgroup_subsys_state *)v,
> > +                                   false);
> > +}
> > +
> > +static const struct seq_operations cgroup_iter_seq_ops = {
> > +     .start  = cgroup_iter_seq_start,
> > +     .next   = cgroup_iter_seq_next,
> > +     .stop   = cgroup_iter_seq_stop,
> > +     .show   = cgroup_iter_seq_show,
> > +};
> > +
> > +BTF_ID_LIST_SINGLE(bpf_cgroup_btf_id, struct, cgroup)
> > +
> > +static int cgroup_iter_seq_init(void *priv, struct bpf_iter_aux_info *aux)
> > +{
> > +     struct cgroup_iter_priv *p = (struct cgroup_iter_priv *)priv;
> > +     struct cgroup *cgrp = aux->cgroup.start;
> > +
> > +     p->start_css = &cgrp->self;
> > +     p->terminate = false;
> > +     p->order = aux->cgroup.order;
> > +     return 0;
> > +}
> > +
> > +static const struct bpf_iter_seq_info cgroup_iter_seq_info = {
> > +     .seq_ops                = &cgroup_iter_seq_ops,
> > +     .init_seq_private       = cgroup_iter_seq_init,
> > +     .seq_priv_size          = sizeof(struct cgroup_iter_priv),
> > +};
> > +
> > +static int bpf_iter_attach_cgroup(struct bpf_prog *prog,
> > +                               union bpf_iter_link_info *linfo,
> > +                               struct bpf_iter_aux_info *aux)
> > +{
> > +     int fd = linfo->cgroup.cgroup_fd;
> > +     struct cgroup *cgrp;
> > +
> > +     if (fd)
> > +             cgrp = cgroup_get_from_fd(fd);
> > +     else /* walk the entire hierarchy by default. */
> > +             cgrp = cgroup_get_from_path("/");
> > +
> > +     if (IS_ERR(cgrp))
> > +             return PTR_ERR(cgrp);
> > +
> > +     aux->cgroup.start = cgrp;
> > +     aux->cgroup.order = linfo->cgroup.traversal_order;
>
> The legality of traversal_order should be checked.
>
> > +     return 0;
> > +}
> > +
> > +static void bpf_iter_detach_cgroup(struct bpf_iter_aux_info *aux)
> > +{
> > +     cgroup_put(aux->cgroup.start);
> > +}
> > +
> > +static void bpf_iter_cgroup_show_fdinfo(const struct bpf_iter_aux_info *aux,
> > +                                     struct seq_file *seq)
> > +{
> > +     char *buf;
> > +
> > +     buf = kzalloc(PATH_MAX, GFP_KERNEL);
> > +     if (!buf) {
> > +             seq_puts(seq, "cgroup_path:\n");
>
> This is a really unlikely case. maybe "cgroup_path:<unknown>"?
>
> > +             goto show_order;
> > +     }
> > +
> > +     /* If cgroup_path_ns() fails, buf will be an empty string, cgroup_path
> > +      * will print nothing.
> > +      *
> > +      * Path is in the calling process's cgroup namespace.
> > +      */
> > +     cgroup_path_ns(aux->cgroup.start, buf, PATH_MAX,
> > +                    current->nsproxy->cgroup_ns);
> > +     seq_printf(seq, "cgroup_path:\t%s\n", buf);
> > +     kfree(buf);
> > +
> > +show_order:
> > +     if (aux->cgroup.order == BPF_ITER_CGROUP_PRE)
> > +             seq_puts(seq, "traversal_order: pre\n");
> > +     else if (aux->cgroup.order == BPF_ITER_CGROUP_POST)
> > +             seq_puts(seq, "traversal_order: post\n");
> > +     else /* BPF_ITER_CGROUP_PARENT_UP */
> > +             seq_puts(seq, "traversal_order: parent_up\n");
> > +}
> > +
> [...]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH bpf-next v2 5/8] selftests/bpf: Test cgroup_iter.
  2022-06-10 19:44 ` [PATCH bpf-next v2 5/8] selftests/bpf: Test cgroup_iter Yosry Ahmed
@ 2022-06-28  6:11   ` Yonghong Song
  0 siblings, 0 replies; 46+ messages in thread
From: Yonghong Song @ 2022-06-28  6:11 UTC (permalink / raw)
  To: Yosry Ahmed, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Michal Hocko
  Cc: Roman Gushchin, David Rientjes, Stanislav Fomichev, Greg Thelen,
	Shakeel Butt, linux-kernel, netdev, bpf, cgroups



On 6/10/22 12:44 PM, Yosry Ahmed wrote:
> From: Hao Luo <haoluo@google.com>
> 
> Add a selftest for cgroup_iter. The selftest creates a mini cgroup tree
> of the following structure:
> 
>      ROOT (working cgroup)
>       |
>     PARENT
>    /      \
> CHILD1  CHILD2
> 
> and tests the following scenarios:
> 
>   - invalid cgroup fd.
>   - pre-order walk over descendants from PARENT.
>   - post-order walk over descendants from PARENT.
>   - walk of ancestors from PARENT.
>   - early termination.
> 
> Signed-off-by: Hao Luo <haoluo@google.com>
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>

Acked-by: Yonghong Song <yhs@fb.com>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH bpf-next v2 6/8] cgroup: bpf: enable bpf programs to integrate with rstat
  2022-06-10 19:44 ` [PATCH bpf-next v2 6/8] cgroup: bpf: enable bpf programs to integrate with rstat Yosry Ahmed
                     ` (2 preceding siblings ...)
  2022-06-11 10:22   ` kernel test robot
@ 2022-06-28  6:12   ` Yonghong Song
  3 siblings, 0 replies; 46+ messages in thread
From: Yonghong Song @ 2022-06-28  6:12 UTC (permalink / raw)
  To: Yosry Ahmed, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Michal Hocko
  Cc: Roman Gushchin, David Rientjes, Stanislav Fomichev, Greg Thelen,
	Shakeel Butt, linux-kernel, netdev, bpf, cgroups



On 6/10/22 12:44 PM, Yosry Ahmed wrote:
> Enable bpf programs to make use of rstat to collect cgroup hierarchical
> stats efficiently:
> - Add cgroup_rstat_updated() kfunc, for bpf progs that collect stats.
> - Add cgroup_rstat_flush() kfunc, for bpf progs that read stats.
> - Add an empty bpf_rstat_flush() hook that is called during rstat
>    flushing, for bpf progs that flush stats to attach to. Attaching a bpf
>    prog to this hook effectively registers it as a flush callback.
> 
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>

Acked-by: Yonghong Song <yhs@fb.com>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH bpf-next v2 8/8] bpf: add a selftest for cgroup hierarchical stats collection
  2022-06-10 19:44 ` [PATCH bpf-next v2 8/8] bpf: add a selftest for cgroup hierarchical stats collection Yosry Ahmed
@ 2022-06-28  6:14   ` Yonghong Song
  2022-06-28  6:47     ` Yosry Ahmed
  0 siblings, 1 reply; 46+ messages in thread
From: Yonghong Song @ 2022-06-28  6:14 UTC (permalink / raw)
  To: Yosry Ahmed, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Michal Hocko
  Cc: Roman Gushchin, David Rientjes, Stanislav Fomichev, Greg Thelen,
	Shakeel Butt, linux-kernel, netdev, bpf, cgroups



On 6/10/22 12:44 PM, Yosry Ahmed wrote:
> Add a selftest that tests the whole workflow for collecting,
> aggregating (flushing), and displaying cgroup hierarchical stats.
> 
> TL;DR:
> - Whenever reclaim happens, vmscan_start and vmscan_end update
>    per-cgroup percpu readings, and tell rstat which (cgroup, cpu) pairs
>    have updates.
> - When userspace tries to read the stats, vmscan_dump calls rstat to flush
>    the stats, and outputs the stats in text format to userspace (similar
>    to cgroupfs stats).
> - rstat calls vmscan_flush once for every (cgroup, cpu) pair that has
>    updates, vmscan_flush aggregates cpu readings and propagates updates
>    to parents.
> 
> Detailed explanation:
> - The test loads tracing bpf programs, vmscan_start and vmscan_end, to
>    measure the latency of cgroup reclaim. Per-cgroup ratings are stored in
>    percpu maps for efficiency. When a cgroup reading is updated on a cpu,
>    cgroup_rstat_updated(cgroup, cpu) is called to add the cgroup to the
>    rstat updated tree on that cpu.
> 
> - A cgroup_iter program, vmscan_dump, is loaded and pinned to a file, for
>    each cgroup. Reading this file invokes the program, which calls
>    cgroup_rstat_flush(cgroup) to ask rstat to propagate the updates for all
>    cpus and cgroups that have updates in this cgroup's subtree. Afterwards,
>    the stats are exposed to the user. vmscan_dump returns 1 to terminate
>    iteration early, so that we only expose stats for one cgroup per read.
> 
> - An ftrace program, vmscan_flush, is also loaded and attached to
>    bpf_rstat_flush. When rstat flushing is ongoing, vmscan_flush is invoked
>    once for each (cgroup, cpu) pair that has updates. cgroups are popped
>    from the rstat tree in a bottom-up fashion, so calls will always be
>    made for cgroups that have updates before their parents. The program
>    aggregates percpu readings to a total per-cgroup reading, and also
>    propagates them to the parent cgroup. After rstat flushing is over, all
>    cgroups will have correct updated hierarchical readings (including all
>    cpus and all their descendants).
> 
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>

There are a selftest failure with test:

get_cgroup_vmscan_delay:PASS:output format 0 nsec
get_cgroup_vmscan_delay:PASS:cgroup_id 0 nsec
get_cgroup_vmscan_delay:PASS:vmscan_reading 0 nsec
get_cgroup_vmscan_delay:PASS:read cgroup_iter 0 nsec
get_cgroup_vmscan_delay:PASS:output format 0 nsec
get_cgroup_vmscan_delay:PASS:cgroup_id 0 nsec
get_cgroup_vmscan_delay:FAIL:vmscan_reading unexpected vmscan_reading: 
actual 0 <= expected 0
check_vmscan_stats:FAIL:child1_vmscan unexpected child1_vmscan: actual 
781874 != expected 382092
check_vmscan_stats:FAIL:child2_vmscan unexpected child2_vmscan: actual 
-1 != expected -2
check_vmscan_stats:FAIL:test_vmscan unexpected test_vmscan: actual 
781874 != expected 781873
check_vmscan_stats:FAIL:root_vmscan unexpected root_vmscan: actual 0 < 
expected 781874
destroy_progs:PASS:remove cgroup_iter pin 0 nsec
destroy_progs:PASS:remove cgroup_iter pin 0 nsec
destroy_progs:PASS:remove cgroup_iter pin 0 nsec
destroy_progs:PASS:remove cgroup_iter pin 0 nsec
destroy_progs:PASS:remove cgroup_iter pin 0 nsec
destroy_progs:PASS:remove cgroup_iter pin 0 nsec
destroy_progs:PASS:remove cgroup_iter pin 0 nsec
destroy_progs:PASS:remove cgroup_iter root pin 0 nsec
cleanup_bpffs:PASS:rmdir /sys/fs/bpf/vmscan/ 0 nsec
#33      cgroup_hierarchical_stats:FAIL


Also an existing test also failed.

btf_dump_data:PASS:find type id 0 nsec 
 

btf_dump_data:PASS:failed/unexpected type_sz 0 nsec 
 

btf_dump_data:FAIL:ensure expected/actual match unexpected ensure 
expected/actual match: actual '(union bpf_iter_link_info){.map = 
(struct){.map_fd = (__u32)1,},.cgroup '
test_btf_dump_struct_data:PASS:find struct sk_buff 0 nsec 
 

test_btf_dump_struct_data:PASS:unexpected return value dumping sk_buff 0 
nsec 

btf_dump_data:PASS:verify prefix match 0 nsec 
 

btf_dump_data:PASS:find type id 0 nsec 
 

btf_dump_data:PASS:failed to return -E2BIG 0 nsec 
 

btf_dump_data:PASS:ensure expected/actual match 0 nsec 
 

btf_dump_data:PASS:verify prefix match 0 nsec 
 

btf_dump_data:PASS:find type id 0 nsec 
 

btf_dump_data:PASS:failed to return -E2BIG 0 nsec 
 

btf_dump_data:PASS:ensure expected/actual match 0 nsec 
 

#21/14   btf_dump/btf_dump: struct_data:FAIL

please take a look.

> ---
>   .../prog_tests/cgroup_hierarchical_stats.c    | 351 ++++++++++++++++++
>   .../bpf/progs/cgroup_hierarchical_stats.c     | 234 ++++++++++++
>   2 files changed, 585 insertions(+)
>   create mode 100644 tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c
>   create mode 100644 tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c
> 
> diff --git a/tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c b/tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c
> new file mode 100644
> index 0000000000000..b78a4043da49a
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c
> @@ -0,0 +1,351 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Functions to manage eBPF programs attached to cgroup subsystems
> + *
> + * Copyright 2022 Google LLC.
> + */
> +#include <errno.h>
> +#include <sys/types.h>
> +#include <sys/mount.h>
> +#include <sys/stat.h>
> +#include <unistd.h>
> +
> +#include <test_progs.h>
> +#include <bpf/libbpf.h>
> +#include <bpf/bpf.h>
> +
> +#include "cgroup_helpers.h"
> +#include "cgroup_hierarchical_stats.skel.h"
> +
> +#define PAGE_SIZE 4096
> +#define MB(x) (x << 20)
> +
> +#define BPFFS_ROOT "/sys/fs/bpf/"
> +#define BPFFS_VMSCAN BPFFS_ROOT"vmscan/"
> +
> +#define CG_ROOT_NAME "root"
> +#define CG_ROOT_ID 1
> +
> +#define CGROUP_PATH(p, n) {.path = #p"/"#n, .name = #n}
> +
> +static struct {
> +	const char *path, *name;
> +	unsigned long long id;
> +	int fd;
> +} cgroups[] = {
> +	CGROUP_PATH(/, test),
> +	CGROUP_PATH(/test, child1),
> +	CGROUP_PATH(/test, child2),
> +	CGROUP_PATH(/test/child1, child1_1),
> +	CGROUP_PATH(/test/child1, child1_2),
> +	CGROUP_PATH(/test/child2, child2_1),
> +	CGROUP_PATH(/test/child2, child2_2),
> +};
> +
> +#define N_CGROUPS ARRAY_SIZE(cgroups)
> +#define N_NON_LEAF_CGROUPS 3
> +
> +int root_cgroup_fd;
> +bool mounted_bpffs;
> +
> +static int read_from_file(const char *path, char *buf, size_t size)
> +{
> +	int fd, len;
> +
> +	fd = open(path, O_RDONLY);
> +	if (fd < 0) {
> +		log_err("Open %s", path);
> +		return -errno;
> +	}
> +	len = read(fd, buf, size);
> +	if (len < 0)
> +		log_err("Read %s", path);
> +	else
> +		buf[len] = 0;
> +	close(fd);
> +	return len < 0 ? -errno : 0;
> +}
> +
> +static int setup_bpffs(void)
> +{
> +	int err;
> +
> +	/* Mount bpffs */
> +	err = mount("bpf", BPFFS_ROOT, "bpf", 0, NULL);
> +	mounted_bpffs = !err;
> +	if (!ASSERT_OK(err && errno != EBUSY, "mount bpffs"))
> +		return err;
> +
> +	/* Create a directory to contain stat files in bpffs */
> +	err = mkdir(BPFFS_VMSCAN, 0755);
> +	ASSERT_OK(err, "mkdir bpffs");
> +	return err;
> +}
> +
> +static void cleanup_bpffs(void)
> +{
> +	/* Remove created directory in bpffs */
> +	ASSERT_OK(rmdir(BPFFS_VMSCAN), "rmdir "BPFFS_VMSCAN);
> +
> +	/* Unmount bpffs, if it wasn't already mounted when we started */
> +	if (mounted_bpffs)
> +		return;
> +	ASSERT_OK(umount(BPFFS_ROOT), "unmount bpffs");
> +}
> +
> +static int setup_cgroups(void)
> +{
> +	int i, fd, err;
> +
> +	err = setup_cgroup_environment();
> +	if (!ASSERT_OK(err, "setup_cgroup_environment"))
> +		return err;
> +
> +	root_cgroup_fd = get_root_cgroup();
> +	if (!ASSERT_GE(root_cgroup_fd, 0, "get_root_cgroup"))
> +		return root_cgroup_fd;
> +
> +	for (i = 0; i < N_CGROUPS; i++) {
> +		fd = create_and_get_cgroup(cgroups[i].path);
> +		if (!ASSERT_GE(fd, 0, "create_and_get_cgroup"))
> +			return fd;
> +
> +		cgroups[i].fd = fd;
> +		cgroups[i].id = get_cgroup_id(cgroups[i].path);
> +
> +		/*
> +		 * Enable memcg controller for the entire hierarchy.
> +		 * Note that stats are collected for all cgroups in a hierarchy
> +		 * with memcg enabled anyway, but are only exposed for cgroups
> +		 * that have memcg enabled.
> +		 */
> +		if (i < N_NON_LEAF_CGROUPS) {
> +			err = enable_controllers(cgroups[i].path, "memory");
> +			if (!ASSERT_OK(err, "enable_controllers"))
> +				return err;
> +		}
> +	}
> +	return 0;
> +}
> +
> +static void cleanup_cgroups(void)
> +{
> +	close(root_cgroup_fd);
> +	for (int i = 0; i < N_CGROUPS; i++)
> +		close(cgroups[i].fd);
> +	cleanup_cgroup_environment();
> +}
> +
> +
> +static int setup_hierarchy(void)
> +{
> +	return setup_bpffs() || setup_cgroups();
> +}
> +
> +static void destroy_hierarchy(void)
> +{
> +	cleanup_cgroups();
> +	cleanup_bpffs();
> +}
> +
> +static void alloc_anon(size_t size)
> +{
> +	char *buf, *ptr;
> +
> +	buf = malloc(size);
> +	for (ptr = buf; ptr < buf + size; ptr += PAGE_SIZE)
> +		*ptr = 0;
> +	free(buf);
> +}
> +
> +static int induce_vmscan(void)
> +{
> +	char size[128];
> +	int i, err;
> +
> +	/*
> +	 * Set memory.high for test parent cgroup to 1 MB to throttle
> +	 * allocations and invoke reclaim in children.
> +	 */
> +	snprintf(size, 128, "%d", MB(1));
> +	err = write_cgroup_file(cgroups[0].path, "memory.high",	size);
> +	if (!ASSERT_OK(err, "write memory.high"))
> +		return err;
> +	/*
> +	 * In every leaf cgroup, run a memory hog for a few seconds to induce
> +	 * reclaim then kill it.
> +	 */
> +	for (i = N_NON_LEAF_CGROUPS; i < N_CGROUPS; i++) {
> +		pid_t pid = fork();
> +
> +		if (pid == 0) {
> +			/* Join cgroup in the parent process workdir */
> +			join_parent_cgroup(cgroups[i].path);
> +
> +			/* Allocate more memory than memory.high */
> +			alloc_anon(MB(2));
> +			exit(0);
> +		} else {
> +			/* Wait for child to cause reclaim then kill it */
> +			if (!ASSERT_GT(pid, 0, "fork"))
> +				return pid;
> +			sleep(2);
> +			kill(pid, SIGKILL);
> +			waitpid(pid, NULL, 0);
> +		}
> +	}
> +	return 0;
> +}
> +
> +static unsigned long long get_cgroup_vmscan_delay(unsigned long long cgroup_id,
> +						  const char *file_name)
> +{
> +	char buf[128], path[128];
> +	unsigned long long vmscan = 0, id = 0;
> +	int err;
> +
> +	/* For every cgroup, read the file generated by cgroup_iter */
> +	snprintf(path, 128, "%s%s", BPFFS_VMSCAN, file_name);
> +	err = read_from_file(path, buf, 128);
> +	if (!ASSERT_OK(err, "read cgroup_iter"))
> +		return 0;
> +
> +	/* Check the output file formatting */
> +	ASSERT_EQ(sscanf(buf, "cg_id: %llu, total_vmscan_delay: %llu\n",
> +			 &id, &vmscan), 2, "output format");
> +
> +	/* Check that the cgroup_id is displayed correctly */
> +	ASSERT_EQ(id, cgroup_id, "cgroup_id");
> +	/* Check that the vmscan reading is non-zero */
> +	ASSERT_GT(vmscan, 0, "vmscan_reading");
> +	return vmscan;
> +}
> +
> +static void check_vmscan_stats(void)
> +{
> +	int i;
> +	unsigned long long vmscan_readings[N_CGROUPS], vmscan_root;
> +
> +	for (i = 0; i < N_CGROUPS; i++)
> +		vmscan_readings[i] = get_cgroup_vmscan_delay(cgroups[i].id,
> +							     cgroups[i].name);
> +
> +	/* Read stats for root too */
> +	vmscan_root = get_cgroup_vmscan_delay(CG_ROOT_ID, CG_ROOT_NAME);
> +
> +	/* Check that child1 == child1_1 + child1_2 */
> +	ASSERT_EQ(vmscan_readings[1], vmscan_readings[3] + vmscan_readings[4],
> +		  "child1_vmscan");
> +	/* Check that child2 == child2_1 + child2_2 */
> +	ASSERT_EQ(vmscan_readings[2], vmscan_readings[5] + vmscan_readings[6],
> +		  "child2_vmscan");
> +	/* Check that test == child1 + child2 */
> +	ASSERT_EQ(vmscan_readings[0], vmscan_readings[1] + vmscan_readings[2],
> +		  "test_vmscan");
> +	/* Check that root >= test */
> +	ASSERT_GE(vmscan_root, vmscan_readings[1], "root_vmscan");
> +}
> +
> +static int setup_cgroup_iter(struct cgroup_hierarchical_stats *obj, int cgroup_fd,
> +			     const char *file_name)
> +{
> +	DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts);
> +	union bpf_iter_link_info linfo = {};
> +	struct bpf_link *link;
> +	char path[128];
> +	int err;
> +
> +	/*
> +	 * Create an iter link, parameterized by cgroup_fd.
> +	 * We only want to traverse one cgroup, so set the traversal order to
> +	 * "pre", and return 1 from dump_vmscan to stop iteration after the
> +	 * first cgroup.
> +	 */
> +	linfo.cgroup.cgroup_fd = cgroup_fd;
> +	linfo.cgroup.traversal_order = BPF_ITER_CGROUP_PRE;
> +	opts.link_info = &linfo;
> +	opts.link_info_len = sizeof(linfo);
> +	link = bpf_program__attach_iter(obj->progs.dump_vmscan, &opts);
> +	if (!ASSERT_OK_PTR(link, "attach iter"))
> +		return libbpf_get_error(link);
> +
> +	/* Pin the link to a bpffs file */
> +	snprintf(path, 128, "%s%s", BPFFS_VMSCAN, file_name);
> +	err = bpf_link__pin(link, path);
> +	ASSERT_OK(err, "pin cgroup_iter");
> +	return err;
> +}
> +
> +static int setup_progs(struct cgroup_hierarchical_stats **skel)
> +{
> +	int i, err;
> +	struct bpf_link *link;
> +	struct cgroup_hierarchical_stats *obj;
> +
> +	obj = cgroup_hierarchical_stats__open_and_load();
> +	if (!ASSERT_OK_PTR(obj, "open_and_load"))
> +		return libbpf_get_error(obj);
> +
> +	/* Attach cgroup_iter program that will dump the stats to cgroups */
> +	for (i = 0; i < N_CGROUPS; i++) {
> +		err = setup_cgroup_iter(obj, cgroups[i].fd, cgroups[i].name);
> +		if (!ASSERT_OK(err, "setup_cgroup_iter"))
> +			return err;
> +	}
> +	/* Also dump stats for root */
> +	err = setup_cgroup_iter(obj, root_cgroup_fd, CG_ROOT_NAME);
> +	if (!ASSERT_OK(err, "setup_cgroup_iter"))
> +		return err;
> +
> +	/* Attach rstat flusher */
> +	link = bpf_program__attach(obj->progs.vmscan_flush);
> +	if (!ASSERT_OK_PTR(link, "attach rstat"))
> +		return libbpf_get_error(link);
> +
> +	/* Attach tracing programs that will calculate vmscan delays */
> +	link = bpf_program__attach(obj->progs.vmscan_start);
> +	if (!ASSERT_OK_PTR(obj, "attach raw_tracepoint"))
> +		return libbpf_get_error(obj);
> +
> +	link = bpf_program__attach(obj->progs.vmscan_end);
> +	if (!ASSERT_OK_PTR(obj, "attach raw_tracepoint"))
> +		return libbpf_get_error(obj);
> +
> +	*skel = obj;
> +	return 0;
> +}
> +
> +void destroy_progs(struct cgroup_hierarchical_stats *skel)
> +{
> +	char path[128];
> +	int i;
> +
> +	for (i = 0; i < N_CGROUPS; i++) {
> +		/* Delete files in bpffs that cgroup_iters are pinned in */
> +		snprintf(path, 128, "%s%s", BPFFS_VMSCAN,
> +			 cgroups[i].name);
> +		ASSERT_OK(remove(path), "remove cgroup_iter pin");
> +	}
> +
> +	/* Delete root file in bpffs */
> +	snprintf(path, 128, "%s%s", BPFFS_VMSCAN, CG_ROOT_NAME);
> +	ASSERT_OK(remove(path), "remove cgroup_iter root pin");
> +	cgroup_hierarchical_stats__destroy(skel);
> +}
> +
> +void test_cgroup_hierarchical_stats(void)
> +{
> +	struct cgroup_hierarchical_stats *skel = NULL;
> +
> +	if (setup_hierarchy())
> +		goto hierarchy_cleanup;
> +	if (setup_progs(&skel))
> +		goto cleanup;
> +	if (induce_vmscan())
> +		goto cleanup;
> +	check_vmscan_stats();
> +cleanup:
> +	destroy_progs(skel);
> +hierarchy_cleanup:
> +	destroy_hierarchy();
> +}
> diff --git a/tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c b/tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c
> new file mode 100644
> index 0000000000000..fd2028f1ed70b
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c
> @@ -0,0 +1,234 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Functions to manage eBPF programs attached to cgroup subsystems
> + *
> + * Copyright 2022 Google LLC.
> + */
> +#include "vmlinux.h"
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_tracing.h>
> +
> +char _license[] SEC("license") = "GPL";
> +
> +/*
> + * Start times are stored per-task, not per-cgroup, as multiple tasks in one
> + * cgroup can perform reclain concurrently.
> + */
> +struct {
> +	__uint(type, BPF_MAP_TYPE_TASK_STORAGE);
> +	__uint(map_flags, BPF_F_NO_PREALLOC);
> +	__type(key, int);
> +	__type(value, __u64);
> +} vmscan_start_time SEC(".maps");
> +
> +struct vmscan_percpu {
> +	/* Previous percpu state, to figure out if we have new updates */
> +	__u64 prev;
> +	/* Current percpu state */
> +	__u64 state;
> +};
> +
> +struct vmscan {
> +	/* State propagated through children, pending aggregation */
> +	__u64 pending;
> +	/* Total state, including all cpus and all children */
> +	__u64 state;
> +};
> +
> +struct {
> +	__uint(type, BPF_MAP_TYPE_PERCPU_HASH);
> +	__uint(max_entries, 10);
> +	__type(key, __u64);
> +	__type(value, struct vmscan_percpu);
> +} pcpu_cgroup_vmscan_elapsed SEC(".maps");
> +
> +struct {
> +	__uint(type, BPF_MAP_TYPE_HASH);
> +	__uint(max_entries, 10);
> +	__type(key, __u64);
> +	__type(value, struct vmscan);
> +} cgroup_vmscan_elapsed SEC(".maps");
> +
> +extern void cgroup_rstat_updated(struct cgroup *cgrp, int cpu) __ksym;
> +extern void cgroup_rstat_flush(struct cgroup *cgrp) __ksym;
> +
> +static inline struct cgroup *task_memcg(struct task_struct *task)
> +{
> +	return task->cgroups->subsys[memory_cgrp_id]->cgroup;
> +}
> +
> +static inline uint64_t cgroup_id(struct cgroup *cgrp)
> +{
> +	return cgrp->kn->id;
> +}
> +
> +static inline int create_vmscan_percpu_elem(__u64 cg_id, __u64 state)
> +{
> +	struct vmscan_percpu pcpu_init = {.state = state, .prev = 0};
> +
> +	if (bpf_map_update_elem(&pcpu_cgroup_vmscan_elapsed, &cg_id,
> +				&pcpu_init, BPF_NOEXIST)) {
> +		bpf_printk("failed to create pcpu entry for cgroup %llu\n"
> +			   , cg_id);
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +static inline int create_vmscan_elem(__u64 cg_id, __u64 state, __u64 pending)
> +{
> +	struct vmscan init = {.state = state, .pending = pending};
> +
> +	if (bpf_map_update_elem(&cgroup_vmscan_elapsed, &cg_id,
> +				&init, BPF_NOEXIST)) {
> +		bpf_printk("failed to create entry for cgroup %llu\n"
> +			   , cg_id);
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +SEC("tp_btf/mm_vmscan_memcg_reclaim_begin")
> +int BPF_PROG(vmscan_start, struct lruvec *lruvec, struct scan_control *sc)
> +{
> +	struct task_struct *task = bpf_get_current_task_btf();
> +	__u64 *start_time_ptr;
> +
> +	start_time_ptr = bpf_task_storage_get(&vmscan_start_time, task, 0,
> +					  BPF_LOCAL_STORAGE_GET_F_CREATE);
> +	if (!start_time_ptr) {
> +		bpf_printk("error retrieving storage\n");
> +		return 0;
> +	}
> +
> +	*start_time_ptr = bpf_ktime_get_ns();
> +	return 0;
> +}
> +
> +SEC("tp_btf/mm_vmscan_memcg_reclaim_end")
> +int BPF_PROG(vmscan_end, struct lruvec *lruvec, struct scan_control *sc)
> +{
> +	struct vmscan_percpu *pcpu_stat;
> +	struct task_struct *current = bpf_get_current_task_btf();
> +	struct cgroup *cgrp;
> +	__u64 *start_time_ptr;
> +	__u64 current_elapsed, cg_id;
> +	__u64 end_time = bpf_ktime_get_ns();
> +
> +	/*
> +	 * cgrp is the first parent cgroup of current that has memcg enabled in
> +	 * its subtree_control, or NULL if memcg is disabled in the entire tree.
> +	 * In a cgroup hierarchy like this:
> +	 *                               a
> +	 *                              / \
> +	 *                             b   c
> +	 *  If "a" has memcg enabled, while "b" doesn't, then processes in "b"
> +	 *  will accumulate their stats directly to "a". This makes sure that no
> +	 *  stats are lost from processes in leaf cgroups that don't have memcg
> +	 *  enabled, but only exposes stats for cgroups that have memcg enabled.
> +	 */
> +	cgrp = task_memcg(current);
> +	if (!cgrp)
> +		return 0;
> +
> +	cg_id = cgroup_id(cgrp);
> +	start_time_ptr = bpf_task_storage_get(&vmscan_start_time, current, 0,
> +					      BPF_LOCAL_STORAGE_GET_F_CREATE);
> +	if (!start_time_ptr) {
> +		bpf_printk("error retrieving storage local storage\n");
> +		return 0;
> +	}
> +
> +	current_elapsed = end_time - *start_time_ptr;
> +	pcpu_stat = bpf_map_lookup_elem(&pcpu_cgroup_vmscan_elapsed,
> +					&cg_id);
> +	if (pcpu_stat)
> +		__sync_fetch_and_add(&pcpu_stat->state, current_elapsed);
> +	else
> +		create_vmscan_percpu_elem(cg_id, current_elapsed);
> +
> +	cgroup_rstat_updated(cgrp, bpf_get_smp_processor_id());
> +	return 0;
> +}
> +
> +SEC("fentry/bpf_rstat_flush")
> +int BPF_PROG(vmscan_flush, struct cgroup *cgrp, struct cgroup *parent, int cpu)
> +{
> +	struct vmscan_percpu *pcpu_stat;
> +	struct vmscan *total_stat, *parent_stat;
> +	__u64 cg_id = cgroup_id(cgrp);
> +	__u64 parent_cg_id = parent ? cgroup_id(parent) : 0;
> +	__u64 *pcpu_vmscan;
> +	__u64 state;
> +	__u64 delta = 0;
> +
> +	/* Add CPU changes on this level since the last flush */
> +	pcpu_stat = bpf_map_lookup_percpu_elem(&pcpu_cgroup_vmscan_elapsed,
> +					       &cg_id, cpu);
> +	if (pcpu_stat) {
> +		state = pcpu_stat->state;
> +		delta += state - pcpu_stat->prev;
> +		pcpu_stat->prev = state;
> +	}
> +
> +	total_stat = bpf_map_lookup_elem(&cgroup_vmscan_elapsed, &cg_id);
> +	if (!total_stat) {
> +		create_vmscan_elem(cg_id, delta, 0);
> +		goto update_parent;
> +	}
> +
> +	/* Collect pending stats from subtree */
> +	if (total_stat->pending) {
> +		delta += total_stat->pending;
> +		total_stat->pending = 0;
> +	}
> +
> +	/* Propagate changes to this cgroup's total */
> +	total_stat->state += delta;
> +
> +update_parent:
> +	/* Skip if there are no changes to propagate, or no parent */
> +	if (!delta || !parent_cg_id)
> +		return 0;
> +
> +	/* Propagate changes to cgroup's parent */
> +	parent_stat = bpf_map_lookup_elem(&cgroup_vmscan_elapsed,
> +					  &parent_cg_id);
> +	if (parent_stat)
> +		parent_stat->pending += delta;
> +	else
> +		create_vmscan_elem(parent_cg_id, 0, delta);
> +
> +	return 0;
> +}
> +
> +SEC("iter.s/cgroup")
> +int BPF_PROG(dump_vmscan, struct bpf_iter_meta *meta, struct cgroup *cgrp)
> +{
> +	struct seq_file *seq = meta->seq;
> +	struct vmscan *total_stat;
> +	__u64 cg_id = cgroup_id(cgrp);
> +
> +	/* Do nothing for the terminal call */
> +	if (!cgrp)
> +		return 1;
> +
> +	/* Flush the stats to make sure we get the most updated numbers */
> +	cgroup_rstat_flush(cgrp);
> +
> +	total_stat = bpf_map_lookup_elem(&cgroup_vmscan_elapsed, &cg_id);
> +	if (!total_stat) {
> +		bpf_printk("error finding stats for cgroup %llu\n", cg_id);
> +		BPF_SEQ_PRINTF(seq, "cg_id: %llu, total_vmscan_delay: -1\n",
> +			       cg_id);
> +		return 1;
> +	}
> +	BPF_SEQ_PRINTF(seq, "cg_id: %llu, total_vmscan_delay: %llu\n",
> +		       cg_id, total_stat->state);
> +
> +	/*
> +	 * We only dump stats for one cgroup here, so return 1 to stop
> +	 * iteration after the first cgroup.
> +	 */
> +	return 1;
> +}

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH bpf-next v2 8/8] bpf: add a selftest for cgroup hierarchical stats collection
  2022-06-28  6:14   ` Yonghong Song
@ 2022-06-28  6:47     ` Yosry Ahmed
  2022-06-28  7:14       ` Yosry Ahmed
  2022-06-28  7:43       ` Yosry Ahmed
  0 siblings, 2 replies; 46+ messages in thread
From: Yosry Ahmed @ 2022-06-28  6:47 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, John Fastabend, KP Singh, Hao Luo,
	Tejun Heo, Zefan Li, Johannes Weiner, Shuah Khan, Michal Hocko,
	Roman Gushchin, David Rientjes, Stanislav Fomichev, Greg Thelen,
	Shakeel Butt, Linux Kernel Mailing List, Networking, bpf,
	Cgroups

On Mon, Jun 27, 2022 at 11:14 PM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 6/10/22 12:44 PM, Yosry Ahmed wrote:
> > Add a selftest that tests the whole workflow for collecting,
> > aggregating (flushing), and displaying cgroup hierarchical stats.
> >
> > TL;DR:
> > - Whenever reclaim happens, vmscan_start and vmscan_end update
> >    per-cgroup percpu readings, and tell rstat which (cgroup, cpu) pairs
> >    have updates.
> > - When userspace tries to read the stats, vmscan_dump calls rstat to flush
> >    the stats, and outputs the stats in text format to userspace (similar
> >    to cgroupfs stats).
> > - rstat calls vmscan_flush once for every (cgroup, cpu) pair that has
> >    updates, vmscan_flush aggregates cpu readings and propagates updates
> >    to parents.
> >
> > Detailed explanation:
> > - The test loads tracing bpf programs, vmscan_start and vmscan_end, to
> >    measure the latency of cgroup reclaim. Per-cgroup ratings are stored in
> >    percpu maps for efficiency. When a cgroup reading is updated on a cpu,
> >    cgroup_rstat_updated(cgroup, cpu) is called to add the cgroup to the
> >    rstat updated tree on that cpu.
> >
> > - A cgroup_iter program, vmscan_dump, is loaded and pinned to a file, for
> >    each cgroup. Reading this file invokes the program, which calls
> >    cgroup_rstat_flush(cgroup) to ask rstat to propagate the updates for all
> >    cpus and cgroups that have updates in this cgroup's subtree. Afterwards,
> >    the stats are exposed to the user. vmscan_dump returns 1 to terminate
> >    iteration early, so that we only expose stats for one cgroup per read.
> >
> > - An ftrace program, vmscan_flush, is also loaded and attached to
> >    bpf_rstat_flush. When rstat flushing is ongoing, vmscan_flush is invoked
> >    once for each (cgroup, cpu) pair that has updates. cgroups are popped
> >    from the rstat tree in a bottom-up fashion, so calls will always be
> >    made for cgroups that have updates before their parents. The program
> >    aggregates percpu readings to a total per-cgroup reading, and also
> >    propagates them to the parent cgroup. After rstat flushing is over, all
> >    cgroups will have correct updated hierarchical readings (including all
> >    cpus and all their descendants).
> >
> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
>
> There are a selftest failure with test:
>
> get_cgroup_vmscan_delay:PASS:output format 0 nsec
> get_cgroup_vmscan_delay:PASS:cgroup_id 0 nsec
> get_cgroup_vmscan_delay:PASS:vmscan_reading 0 nsec
> get_cgroup_vmscan_delay:PASS:read cgroup_iter 0 nsec
> get_cgroup_vmscan_delay:PASS:output format 0 nsec
> get_cgroup_vmscan_delay:PASS:cgroup_id 0 nsec
> get_cgroup_vmscan_delay:FAIL:vmscan_reading unexpected vmscan_reading:
> actual 0 <= expected 0
> check_vmscan_stats:FAIL:child1_vmscan unexpected child1_vmscan: actual
> 781874 != expected 382092
> check_vmscan_stats:FAIL:child2_vmscan unexpected child2_vmscan: actual
> -1 != expected -2
> check_vmscan_stats:FAIL:test_vmscan unexpected test_vmscan: actual
> 781874 != expected 781873
> check_vmscan_stats:FAIL:root_vmscan unexpected root_vmscan: actual 0 <
> expected 781874
> destroy_progs:PASS:remove cgroup_iter pin 0 nsec
> destroy_progs:PASS:remove cgroup_iter pin 0 nsec
> destroy_progs:PASS:remove cgroup_iter pin 0 nsec
> destroy_progs:PASS:remove cgroup_iter pin 0 nsec
> destroy_progs:PASS:remove cgroup_iter pin 0 nsec
> destroy_progs:PASS:remove cgroup_iter pin 0 nsec
> destroy_progs:PASS:remove cgroup_iter pin 0 nsec
> destroy_progs:PASS:remove cgroup_iter root pin 0 nsec
> cleanup_bpffs:PASS:rmdir /sys/fs/bpf/vmscan/ 0 nsec
> #33      cgroup_hierarchical_stats:FAIL
>

The test is passing on my setup. I am trying to figure out if there is
something outside the setup done by the test that can cause the test
to fail.

>
> Also an existing test also failed.
>
> btf_dump_data:PASS:find type id 0 nsec
>
>
> btf_dump_data:PASS:failed/unexpected type_sz 0 nsec
>
>
> btf_dump_data:FAIL:ensure expected/actual match unexpected ensure
> expected/actual match: actual '(union bpf_iter_link_info){.map =
> (struct){.map_fd = (__u32)1,},.cgroup '
> test_btf_dump_struct_data:PASS:find struct sk_buff 0 nsec
>

Yeah I see what happened there. bpf_iter_link_info was changed by the
patch that introduced cgroup_iter, and this specific union is used by
the test to test the "union with nested struct" btf dumping. I will
add a patch in the next version that updates the btf_dump_data test
accordingly. Thanks.

>
> test_btf_dump_struct_data:PASS:unexpected return value dumping sk_buff 0
> nsec
>
> btf_dump_data:PASS:verify prefix match 0 nsec
>
>
> btf_dump_data:PASS:find type id 0 nsec
>
>
> btf_dump_data:PASS:failed to return -E2BIG 0 nsec
>
>
> btf_dump_data:PASS:ensure expected/actual match 0 nsec
>
>
> btf_dump_data:PASS:verify prefix match 0 nsec
>
>
> btf_dump_data:PASS:find type id 0 nsec
>
>
> btf_dump_data:PASS:failed to return -E2BIG 0 nsec
>
>
> btf_dump_data:PASS:ensure expected/actual match 0 nsec
>
>
> #21/14   btf_dump/btf_dump: struct_data:FAIL
>
> please take a look.
>
> > ---
> >   .../prog_tests/cgroup_hierarchical_stats.c    | 351 ++++++++++++++++++
> >   .../bpf/progs/cgroup_hierarchical_stats.c     | 234 ++++++++++++
> >   2 files changed, 585 insertions(+)
> >   create mode 100644 tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c
> >   create mode 100644 tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c
> >
> > diff --git a/tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c b/tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c
> > new file mode 100644
> > index 0000000000000..b78a4043da49a
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c
> > @@ -0,0 +1,351 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * Functions to manage eBPF programs attached to cgroup subsystems
> > + *
> > + * Copyright 2022 Google LLC.
> > + */
> > +#include <errno.h>
> > +#include <sys/types.h>
> > +#include <sys/mount.h>
> > +#include <sys/stat.h>
> > +#include <unistd.h>
> > +
> > +#include <test_progs.h>
> > +#include <bpf/libbpf.h>
> > +#include <bpf/bpf.h>
> > +
> > +#include "cgroup_helpers.h"
> > +#include "cgroup_hierarchical_stats.skel.h"
> > +
> > +#define PAGE_SIZE 4096
> > +#define MB(x) (x << 20)
> > +
> > +#define BPFFS_ROOT "/sys/fs/bpf/"
> > +#define BPFFS_VMSCAN BPFFS_ROOT"vmscan/"
> > +
> > +#define CG_ROOT_NAME "root"
> > +#define CG_ROOT_ID 1
> > +
> > +#define CGROUP_PATH(p, n) {.path = #p"/"#n, .name = #n}
> > +
> > +static struct {
> > +     const char *path, *name;
> > +     unsigned long long id;
> > +     int fd;
> > +} cgroups[] = {
> > +     CGROUP_PATH(/, test),
> > +     CGROUP_PATH(/test, child1),
> > +     CGROUP_PATH(/test, child2),
> > +     CGROUP_PATH(/test/child1, child1_1),
> > +     CGROUP_PATH(/test/child1, child1_2),
> > +     CGROUP_PATH(/test/child2, child2_1),
> > +     CGROUP_PATH(/test/child2, child2_2),
> > +};
> > +
> > +#define N_CGROUPS ARRAY_SIZE(cgroups)
> > +#define N_NON_LEAF_CGROUPS 3
> > +
> > +int root_cgroup_fd;
> > +bool mounted_bpffs;
> > +
> > +static int read_from_file(const char *path, char *buf, size_t size)
> > +{
> > +     int fd, len;
> > +
> > +     fd = open(path, O_RDONLY);
> > +     if (fd < 0) {
> > +             log_err("Open %s", path);
> > +             return -errno;
> > +     }
> > +     len = read(fd, buf, size);
> > +     if (len < 0)
> > +             log_err("Read %s", path);
> > +     else
> > +             buf[len] = 0;
> > +     close(fd);
> > +     return len < 0 ? -errno : 0;
> > +}
> > +
> > +static int setup_bpffs(void)
> > +{
> > +     int err;
> > +
> > +     /* Mount bpffs */
> > +     err = mount("bpf", BPFFS_ROOT, "bpf", 0, NULL);
> > +     mounted_bpffs = !err;
> > +     if (!ASSERT_OK(err && errno != EBUSY, "mount bpffs"))
> > +             return err;
> > +
> > +     /* Create a directory to contain stat files in bpffs */
> > +     err = mkdir(BPFFS_VMSCAN, 0755);
> > +     ASSERT_OK(err, "mkdir bpffs");
> > +     return err;
> > +}
> > +
> > +static void cleanup_bpffs(void)
> > +{
> > +     /* Remove created directory in bpffs */
> > +     ASSERT_OK(rmdir(BPFFS_VMSCAN), "rmdir "BPFFS_VMSCAN);
> > +
> > +     /* Unmount bpffs, if it wasn't already mounted when we started */
> > +     if (mounted_bpffs)
> > +             return;
> > +     ASSERT_OK(umount(BPFFS_ROOT), "unmount bpffs");
> > +}
> > +
> > +static int setup_cgroups(void)
> > +{
> > +     int i, fd, err;
> > +
> > +     err = setup_cgroup_environment();
> > +     if (!ASSERT_OK(err, "setup_cgroup_environment"))
> > +             return err;
> > +
> > +     root_cgroup_fd = get_root_cgroup();
> > +     if (!ASSERT_GE(root_cgroup_fd, 0, "get_root_cgroup"))
> > +             return root_cgroup_fd;
> > +
> > +     for (i = 0; i < N_CGROUPS; i++) {
> > +             fd = create_and_get_cgroup(cgroups[i].path);
> > +             if (!ASSERT_GE(fd, 0, "create_and_get_cgroup"))
> > +                     return fd;
> > +
> > +             cgroups[i].fd = fd;
> > +             cgroups[i].id = get_cgroup_id(cgroups[i].path);
> > +
> > +             /*
> > +              * Enable memcg controller for the entire hierarchy.
> > +              * Note that stats are collected for all cgroups in a hierarchy
> > +              * with memcg enabled anyway, but are only exposed for cgroups
> > +              * that have memcg enabled.
> > +              */
> > +             if (i < N_NON_LEAF_CGROUPS) {
> > +                     err = enable_controllers(cgroups[i].path, "memory");
> > +                     if (!ASSERT_OK(err, "enable_controllers"))
> > +                             return err;
> > +             }
> > +     }
> > +     return 0;
> > +}
> > +
> > +static void cleanup_cgroups(void)
> > +{
> > +     close(root_cgroup_fd);
> > +     for (int i = 0; i < N_CGROUPS; i++)
> > +             close(cgroups[i].fd);
> > +     cleanup_cgroup_environment();
> > +}
> > +
> > +
> > +static int setup_hierarchy(void)
> > +{
> > +     return setup_bpffs() || setup_cgroups();
> > +}
> > +
> > +static void destroy_hierarchy(void)
> > +{
> > +     cleanup_cgroups();
> > +     cleanup_bpffs();
> > +}
> > +
> > +static void alloc_anon(size_t size)
> > +{
> > +     char *buf, *ptr;
> > +
> > +     buf = malloc(size);
> > +     for (ptr = buf; ptr < buf + size; ptr += PAGE_SIZE)
> > +             *ptr = 0;
> > +     free(buf);
> > +}
> > +
> > +static int induce_vmscan(void)
> > +{
> > +     char size[128];
> > +     int i, err;
> > +
> > +     /*
> > +      * Set memory.high for test parent cgroup to 1 MB to throttle
> > +      * allocations and invoke reclaim in children.
> > +      */
> > +     snprintf(size, 128, "%d", MB(1));
> > +     err = write_cgroup_file(cgroups[0].path, "memory.high", size);
> > +     if (!ASSERT_OK(err, "write memory.high"))
> > +             return err;
> > +     /*
> > +      * In every leaf cgroup, run a memory hog for a few seconds to induce
> > +      * reclaim then kill it.
> > +      */
> > +     for (i = N_NON_LEAF_CGROUPS; i < N_CGROUPS; i++) {
> > +             pid_t pid = fork();
> > +
> > +             if (pid == 0) {
> > +                     /* Join cgroup in the parent process workdir */
> > +                     join_parent_cgroup(cgroups[i].path);
> > +
> > +                     /* Allocate more memory than memory.high */
> > +                     alloc_anon(MB(2));
> > +                     exit(0);
> > +             } else {
> > +                     /* Wait for child to cause reclaim then kill it */
> > +                     if (!ASSERT_GT(pid, 0, "fork"))
> > +                             return pid;
> > +                     sleep(2);
> > +                     kill(pid, SIGKILL);
> > +                     waitpid(pid, NULL, 0);
> > +             }
> > +     }
> > +     return 0;
> > +}
> > +
> > +static unsigned long long get_cgroup_vmscan_delay(unsigned long long cgroup_id,
> > +                                               const char *file_name)
> > +{
> > +     char buf[128], path[128];
> > +     unsigned long long vmscan = 0, id = 0;
> > +     int err;
> > +
> > +     /* For every cgroup, read the file generated by cgroup_iter */
> > +     snprintf(path, 128, "%s%s", BPFFS_VMSCAN, file_name);
> > +     err = read_from_file(path, buf, 128);
> > +     if (!ASSERT_OK(err, "read cgroup_iter"))
> > +             return 0;
> > +
> > +     /* Check the output file formatting */
> > +     ASSERT_EQ(sscanf(buf, "cg_id: %llu, total_vmscan_delay: %llu\n",
> > +                      &id, &vmscan), 2, "output format");
> > +
> > +     /* Check that the cgroup_id is displayed correctly */
> > +     ASSERT_EQ(id, cgroup_id, "cgroup_id");
> > +     /* Check that the vmscan reading is non-zero */
> > +     ASSERT_GT(vmscan, 0, "vmscan_reading");
> > +     return vmscan;
> > +}
> > +
> > +static void check_vmscan_stats(void)
> > +{
> > +     int i;
> > +     unsigned long long vmscan_readings[N_CGROUPS], vmscan_root;
> > +
> > +     for (i = 0; i < N_CGROUPS; i++)
> > +             vmscan_readings[i] = get_cgroup_vmscan_delay(cgroups[i].id,
> > +                                                          cgroups[i].name);
> > +
> > +     /* Read stats for root too */
> > +     vmscan_root = get_cgroup_vmscan_delay(CG_ROOT_ID, CG_ROOT_NAME);
> > +
> > +     /* Check that child1 == child1_1 + child1_2 */
> > +     ASSERT_EQ(vmscan_readings[1], vmscan_readings[3] + vmscan_readings[4],
> > +               "child1_vmscan");
> > +     /* Check that child2 == child2_1 + child2_2 */
> > +     ASSERT_EQ(vmscan_readings[2], vmscan_readings[5] + vmscan_readings[6],
> > +               "child2_vmscan");
> > +     /* Check that test == child1 + child2 */
> > +     ASSERT_EQ(vmscan_readings[0], vmscan_readings[1] + vmscan_readings[2],
> > +               "test_vmscan");
> > +     /* Check that root >= test */
> > +     ASSERT_GE(vmscan_root, vmscan_readings[1], "root_vmscan");
> > +}
> > +
> > +static int setup_cgroup_iter(struct cgroup_hierarchical_stats *obj, int cgroup_fd,
> > +                          const char *file_name)
> > +{
> > +     DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts);
> > +     union bpf_iter_link_info linfo = {};
> > +     struct bpf_link *link;
> > +     char path[128];
> > +     int err;
> > +
> > +     /*
> > +      * Create an iter link, parameterized by cgroup_fd.
> > +      * We only want to traverse one cgroup, so set the traversal order to
> > +      * "pre", and return 1 from dump_vmscan to stop iteration after the
> > +      * first cgroup.
> > +      */
> > +     linfo.cgroup.cgroup_fd = cgroup_fd;
> > +     linfo.cgroup.traversal_order = BPF_ITER_CGROUP_PRE;
> > +     opts.link_info = &linfo;
> > +     opts.link_info_len = sizeof(linfo);
> > +     link = bpf_program__attach_iter(obj->progs.dump_vmscan, &opts);
> > +     if (!ASSERT_OK_PTR(link, "attach iter"))
> > +             return libbpf_get_error(link);
> > +
> > +     /* Pin the link to a bpffs file */
> > +     snprintf(path, 128, "%s%s", BPFFS_VMSCAN, file_name);
> > +     err = bpf_link__pin(link, path);
> > +     ASSERT_OK(err, "pin cgroup_iter");
> > +     return err;
> > +}
> > +
> > +static int setup_progs(struct cgroup_hierarchical_stats **skel)
> > +{
> > +     int i, err;
> > +     struct bpf_link *link;
> > +     struct cgroup_hierarchical_stats *obj;
> > +
> > +     obj = cgroup_hierarchical_stats__open_and_load();
> > +     if (!ASSERT_OK_PTR(obj, "open_and_load"))
> > +             return libbpf_get_error(obj);
> > +
> > +     /* Attach cgroup_iter program that will dump the stats to cgroups */
> > +     for (i = 0; i < N_CGROUPS; i++) {
> > +             err = setup_cgroup_iter(obj, cgroups[i].fd, cgroups[i].name);
> > +             if (!ASSERT_OK(err, "setup_cgroup_iter"))
> > +                     return err;
> > +     }
> > +     /* Also dump stats for root */
> > +     err = setup_cgroup_iter(obj, root_cgroup_fd, CG_ROOT_NAME);
> > +     if (!ASSERT_OK(err, "setup_cgroup_iter"))
> > +             return err;
> > +
> > +     /* Attach rstat flusher */
> > +     link = bpf_program__attach(obj->progs.vmscan_flush);
> > +     if (!ASSERT_OK_PTR(link, "attach rstat"))
> > +             return libbpf_get_error(link);
> > +
> > +     /* Attach tracing programs that will calculate vmscan delays */
> > +     link = bpf_program__attach(obj->progs.vmscan_start);
> > +     if (!ASSERT_OK_PTR(obj, "attach raw_tracepoint"))
> > +             return libbpf_get_error(obj);
> > +
> > +     link = bpf_program__attach(obj->progs.vmscan_end);
> > +     if (!ASSERT_OK_PTR(obj, "attach raw_tracepoint"))
> > +             return libbpf_get_error(obj);
> > +
> > +     *skel = obj;
> > +     return 0;
> > +}
> > +
> > +void destroy_progs(struct cgroup_hierarchical_stats *skel)
> > +{
> > +     char path[128];
> > +     int i;
> > +
> > +     for (i = 0; i < N_CGROUPS; i++) {
> > +             /* Delete files in bpffs that cgroup_iters are pinned in */
> > +             snprintf(path, 128, "%s%s", BPFFS_VMSCAN,
> > +                      cgroups[i].name);
> > +             ASSERT_OK(remove(path), "remove cgroup_iter pin");
> > +     }
> > +
> > +     /* Delete root file in bpffs */
> > +     snprintf(path, 128, "%s%s", BPFFS_VMSCAN, CG_ROOT_NAME);
> > +     ASSERT_OK(remove(path), "remove cgroup_iter root pin");
> > +     cgroup_hierarchical_stats__destroy(skel);
> > +}
> > +
> > +void test_cgroup_hierarchical_stats(void)
> > +{
> > +     struct cgroup_hierarchical_stats *skel = NULL;
> > +
> > +     if (setup_hierarchy())
> > +             goto hierarchy_cleanup;
> > +     if (setup_progs(&skel))
> > +             goto cleanup;
> > +     if (induce_vmscan())
> > +             goto cleanup;
> > +     check_vmscan_stats();
> > +cleanup:
> > +     destroy_progs(skel);
> > +hierarchy_cleanup:
> > +     destroy_hierarchy();
> > +}
> > diff --git a/tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c b/tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c
> > new file mode 100644
> > index 0000000000000..fd2028f1ed70b
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c
> > @@ -0,0 +1,234 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * Functions to manage eBPF programs attached to cgroup subsystems
> > + *
> > + * Copyright 2022 Google LLC.
> > + */
> > +#include "vmlinux.h"
> > +#include <bpf/bpf_helpers.h>
> > +#include <bpf/bpf_tracing.h>
> > +
> > +char _license[] SEC("license") = "GPL";
> > +
> > +/*
> > + * Start times are stored per-task, not per-cgroup, as multiple tasks in one
> > + * cgroup can perform reclain concurrently.
> > + */
> > +struct {
> > +     __uint(type, BPF_MAP_TYPE_TASK_STORAGE);
> > +     __uint(map_flags, BPF_F_NO_PREALLOC);
> > +     __type(key, int);
> > +     __type(value, __u64);
> > +} vmscan_start_time SEC(".maps");
> > +
> > +struct vmscan_percpu {
> > +     /* Previous percpu state, to figure out if we have new updates */
> > +     __u64 prev;
> > +     /* Current percpu state */
> > +     __u64 state;
> > +};
> > +
> > +struct vmscan {
> > +     /* State propagated through children, pending aggregation */
> > +     __u64 pending;
> > +     /* Total state, including all cpus and all children */
> > +     __u64 state;
> > +};
> > +
> > +struct {
> > +     __uint(type, BPF_MAP_TYPE_PERCPU_HASH);
> > +     __uint(max_entries, 10);
> > +     __type(key, __u64);
> > +     __type(value, struct vmscan_percpu);
> > +} pcpu_cgroup_vmscan_elapsed SEC(".maps");
> > +
> > +struct {
> > +     __uint(type, BPF_MAP_TYPE_HASH);
> > +     __uint(max_entries, 10);
> > +     __type(key, __u64);
> > +     __type(value, struct vmscan);
> > +} cgroup_vmscan_elapsed SEC(".maps");
> > +
> > +extern void cgroup_rstat_updated(struct cgroup *cgrp, int cpu) __ksym;
> > +extern void cgroup_rstat_flush(struct cgroup *cgrp) __ksym;
> > +
> > +static inline struct cgroup *task_memcg(struct task_struct *task)
> > +{
> > +     return task->cgroups->subsys[memory_cgrp_id]->cgroup;
> > +}
> > +
> > +static inline uint64_t cgroup_id(struct cgroup *cgrp)
> > +{
> > +     return cgrp->kn->id;
> > +}
> > +
> > +static inline int create_vmscan_percpu_elem(__u64 cg_id, __u64 state)
> > +{
> > +     struct vmscan_percpu pcpu_init = {.state = state, .prev = 0};
> > +
> > +     if (bpf_map_update_elem(&pcpu_cgroup_vmscan_elapsed, &cg_id,
> > +                             &pcpu_init, BPF_NOEXIST)) {
> > +             bpf_printk("failed to create pcpu entry for cgroup %llu\n"
> > +                        , cg_id);
> > +             return 1;
> > +     }
> > +     return 0;
> > +}
> > +
> > +static inline int create_vmscan_elem(__u64 cg_id, __u64 state, __u64 pending)
> > +{
> > +     struct vmscan init = {.state = state, .pending = pending};
> > +
> > +     if (bpf_map_update_elem(&cgroup_vmscan_elapsed, &cg_id,
> > +                             &init, BPF_NOEXIST)) {
> > +             bpf_printk("failed to create entry for cgroup %llu\n"
> > +                        , cg_id);
> > +             return 1;
> > +     }
> > +     return 0;
> > +}
> > +
> > +SEC("tp_btf/mm_vmscan_memcg_reclaim_begin")
> > +int BPF_PROG(vmscan_start, struct lruvec *lruvec, struct scan_control *sc)
> > +{
> > +     struct task_struct *task = bpf_get_current_task_btf();
> > +     __u64 *start_time_ptr;
> > +
> > +     start_time_ptr = bpf_task_storage_get(&vmscan_start_time, task, 0,
> > +                                       BPF_LOCAL_STORAGE_GET_F_CREATE);
> > +     if (!start_time_ptr) {
> > +             bpf_printk("error retrieving storage\n");
> > +             return 0;
> > +     }
> > +
> > +     *start_time_ptr = bpf_ktime_get_ns();
> > +     return 0;
> > +}
> > +
> > +SEC("tp_btf/mm_vmscan_memcg_reclaim_end")
> > +int BPF_PROG(vmscan_end, struct lruvec *lruvec, struct scan_control *sc)
> > +{
> > +     struct vmscan_percpu *pcpu_stat;
> > +     struct task_struct *current = bpf_get_current_task_btf();
> > +     struct cgroup *cgrp;
> > +     __u64 *start_time_ptr;
> > +     __u64 current_elapsed, cg_id;
> > +     __u64 end_time = bpf_ktime_get_ns();
> > +
> > +     /*
> > +      * cgrp is the first parent cgroup of current that has memcg enabled in
> > +      * its subtree_control, or NULL if memcg is disabled in the entire tree.
> > +      * In a cgroup hierarchy like this:
> > +      *                               a
> > +      *                              / \
> > +      *                             b   c
> > +      *  If "a" has memcg enabled, while "b" doesn't, then processes in "b"
> > +      *  will accumulate their stats directly to "a". This makes sure that no
> > +      *  stats are lost from processes in leaf cgroups that don't have memcg
> > +      *  enabled, but only exposes stats for cgroups that have memcg enabled.
> > +      */
> > +     cgrp = task_memcg(current);
> > +     if (!cgrp)
> > +             return 0;
> > +
> > +     cg_id = cgroup_id(cgrp);
> > +     start_time_ptr = bpf_task_storage_get(&vmscan_start_time, current, 0,
> > +                                           BPF_LOCAL_STORAGE_GET_F_CREATE);
> > +     if (!start_time_ptr) {
> > +             bpf_printk("error retrieving storage local storage\n");
> > +             return 0;
> > +     }
> > +
> > +     current_elapsed = end_time - *start_time_ptr;
> > +     pcpu_stat = bpf_map_lookup_elem(&pcpu_cgroup_vmscan_elapsed,
> > +                                     &cg_id);
> > +     if (pcpu_stat)
> > +             __sync_fetch_and_add(&pcpu_stat->state, current_elapsed);
> > +     else
> > +             create_vmscan_percpu_elem(cg_id, current_elapsed);
> > +
> > +     cgroup_rstat_updated(cgrp, bpf_get_smp_processor_id());
> > +     return 0;
> > +}
> > +
> > +SEC("fentry/bpf_rstat_flush")
> > +int BPF_PROG(vmscan_flush, struct cgroup *cgrp, struct cgroup *parent, int cpu)
> > +{
> > +     struct vmscan_percpu *pcpu_stat;
> > +     struct vmscan *total_stat, *parent_stat;
> > +     __u64 cg_id = cgroup_id(cgrp);
> > +     __u64 parent_cg_id = parent ? cgroup_id(parent) : 0;
> > +     __u64 *pcpu_vmscan;
> > +     __u64 state;
> > +     __u64 delta = 0;
> > +
> > +     /* Add CPU changes on this level since the last flush */
> > +     pcpu_stat = bpf_map_lookup_percpu_elem(&pcpu_cgroup_vmscan_elapsed,
> > +                                            &cg_id, cpu);
> > +     if (pcpu_stat) {
> > +             state = pcpu_stat->state;
> > +             delta += state - pcpu_stat->prev;
> > +             pcpu_stat->prev = state;
> > +     }
> > +
> > +     total_stat = bpf_map_lookup_elem(&cgroup_vmscan_elapsed, &cg_id);
> > +     if (!total_stat) {
> > +             create_vmscan_elem(cg_id, delta, 0);
> > +             goto update_parent;
> > +     }
> > +
> > +     /* Collect pending stats from subtree */
> > +     if (total_stat->pending) {
> > +             delta += total_stat->pending;
> > +             total_stat->pending = 0;
> > +     }
> > +
> > +     /* Propagate changes to this cgroup's total */
> > +     total_stat->state += delta;
> > +
> > +update_parent:
> > +     /* Skip if there are no changes to propagate, or no parent */
> > +     if (!delta || !parent_cg_id)
> > +             return 0;
> > +
> > +     /* Propagate changes to cgroup's parent */
> > +     parent_stat = bpf_map_lookup_elem(&cgroup_vmscan_elapsed,
> > +                                       &parent_cg_id);
> > +     if (parent_stat)
> > +             parent_stat->pending += delta;
> > +     else
> > +             create_vmscan_elem(parent_cg_id, 0, delta);
> > +
> > +     return 0;
> > +}
> > +
> > +SEC("iter.s/cgroup")
> > +int BPF_PROG(dump_vmscan, struct bpf_iter_meta *meta, struct cgroup *cgrp)
> > +{
> > +     struct seq_file *seq = meta->seq;
> > +     struct vmscan *total_stat;
> > +     __u64 cg_id = cgroup_id(cgrp);
> > +
> > +     /* Do nothing for the terminal call */
> > +     if (!cgrp)
> > +             return 1;
> > +
> > +     /* Flush the stats to make sure we get the most updated numbers */
> > +     cgroup_rstat_flush(cgrp);
> > +
> > +     total_stat = bpf_map_lookup_elem(&cgroup_vmscan_elapsed, &cg_id);
> > +     if (!total_stat) {
> > +             bpf_printk("error finding stats for cgroup %llu\n", cg_id);
> > +             BPF_SEQ_PRINTF(seq, "cg_id: %llu, total_vmscan_delay: -1\n",
> > +                            cg_id);
> > +             return 1;
> > +     }
> > +     BPF_SEQ_PRINTF(seq, "cg_id: %llu, total_vmscan_delay: %llu\n",
> > +                    cg_id, total_stat->state);
> > +
> > +     /*
> > +      * We only dump stats for one cgroup here, so return 1 to stop
> > +      * iteration after the first cgroup.
> > +      */
> > +     return 1;
> > +}

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH bpf-next v2 8/8] bpf: add a selftest for cgroup hierarchical stats collection
  2022-06-28  6:47     ` Yosry Ahmed
@ 2022-06-28  7:14       ` Yosry Ahmed
  2022-06-29  0:09         ` Yosry Ahmed
  2022-06-29  6:17         ` Yonghong Song
  2022-06-28  7:43       ` Yosry Ahmed
  1 sibling, 2 replies; 46+ messages in thread
From: Yosry Ahmed @ 2022-06-28  7:14 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, John Fastabend, KP Singh, Hao Luo,
	Tejun Heo, Zefan Li, Johannes Weiner, Shuah Khan, Michal Hocko,
	Roman Gushchin, David Rientjes, Stanislav Fomichev, Greg Thelen,
	Shakeel Butt, Linux Kernel Mailing List, Networking, bpf,
	Cgroups

[-- Attachment #1: Type: text/plain, Size: 28802 bytes --]

On Mon, Jun 27, 2022 at 11:47 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Mon, Jun 27, 2022 at 11:14 PM Yonghong Song <yhs@fb.com> wrote:
> >
> >
> >
> > On 6/10/22 12:44 PM, Yosry Ahmed wrote:
> > > Add a selftest that tests the whole workflow for collecting,
> > > aggregating (flushing), and displaying cgroup hierarchical stats.
> > >
> > > TL;DR:
> > > - Whenever reclaim happens, vmscan_start and vmscan_end update
> > >    per-cgroup percpu readings, and tell rstat which (cgroup, cpu) pairs
> > >    have updates.
> > > - When userspace tries to read the stats, vmscan_dump calls rstat to flush
> > >    the stats, and outputs the stats in text format to userspace (similar
> > >    to cgroupfs stats).
> > > - rstat calls vmscan_flush once for every (cgroup, cpu) pair that has
> > >    updates, vmscan_flush aggregates cpu readings and propagates updates
> > >    to parents.
> > >
> > > Detailed explanation:
> > > - The test loads tracing bpf programs, vmscan_start and vmscan_end, to
> > >    measure the latency of cgroup reclaim. Per-cgroup ratings are stored in
> > >    percpu maps for efficiency. When a cgroup reading is updated on a cpu,
> > >    cgroup_rstat_updated(cgroup, cpu) is called to add the cgroup to the
> > >    rstat updated tree on that cpu.
> > >
> > > - A cgroup_iter program, vmscan_dump, is loaded and pinned to a file, for
> > >    each cgroup. Reading this file invokes the program, which calls
> > >    cgroup_rstat_flush(cgroup) to ask rstat to propagate the updates for all
> > >    cpus and cgroups that have updates in this cgroup's subtree. Afterwards,
> > >    the stats are exposed to the user. vmscan_dump returns 1 to terminate
> > >    iteration early, so that we only expose stats for one cgroup per read.
> > >
> > > - An ftrace program, vmscan_flush, is also loaded and attached to
> > >    bpf_rstat_flush. When rstat flushing is ongoing, vmscan_flush is invoked
> > >    once for each (cgroup, cpu) pair that has updates. cgroups are popped
> > >    from the rstat tree in a bottom-up fashion, so calls will always be
> > >    made for cgroups that have updates before their parents. The program
> > >    aggregates percpu readings to a total per-cgroup reading, and also
> > >    propagates them to the parent cgroup. After rstat flushing is over, all
> > >    cgroups will have correct updated hierarchical readings (including all
> > >    cpus and all their descendants).
> > >
> > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> >
> > There are a selftest failure with test:
> >
> > get_cgroup_vmscan_delay:PASS:output format 0 nsec
> > get_cgroup_vmscan_delay:PASS:cgroup_id 0 nsec
> > get_cgroup_vmscan_delay:PASS:vmscan_reading 0 nsec
> > get_cgroup_vmscan_delay:PASS:read cgroup_iter 0 nsec
> > get_cgroup_vmscan_delay:PASS:output format 0 nsec
> > get_cgroup_vmscan_delay:PASS:cgroup_id 0 nsec
> > get_cgroup_vmscan_delay:FAIL:vmscan_reading unexpected vmscan_reading:
> > actual 0 <= expected 0
> > check_vmscan_stats:FAIL:child1_vmscan unexpected child1_vmscan: actual
> > 781874 != expected 382092
> > check_vmscan_stats:FAIL:child2_vmscan unexpected child2_vmscan: actual
> > -1 != expected -2
> > check_vmscan_stats:FAIL:test_vmscan unexpected test_vmscan: actual
> > 781874 != expected 781873
> > check_vmscan_stats:FAIL:root_vmscan unexpected root_vmscan: actual 0 <
> > expected 781874
> > destroy_progs:PASS:remove cgroup_iter pin 0 nsec
> > destroy_progs:PASS:remove cgroup_iter pin 0 nsec
> > destroy_progs:PASS:remove cgroup_iter pin 0 nsec
> > destroy_progs:PASS:remove cgroup_iter pin 0 nsec
> > destroy_progs:PASS:remove cgroup_iter pin 0 nsec
> > destroy_progs:PASS:remove cgroup_iter pin 0 nsec
> > destroy_progs:PASS:remove cgroup_iter pin 0 nsec
> > destroy_progs:PASS:remove cgroup_iter root pin 0 nsec
> > cleanup_bpffs:PASS:rmdir /sys/fs/bpf/vmscan/ 0 nsec
> > #33      cgroup_hierarchical_stats:FAIL
> >
>
> The test is passing on my setup. I am trying to figure out if there is
> something outside the setup done by the test that can cause the test
> to fail.
>

I can't reproduce the failure on my machine. It seems like for some
reason reclaim is not invoked in one of the test cgroups which results
in the expected stats not being there. I have a few suspicions as to
what might cause this but I am not sure.

If you have the capacity, do you mind re-running the test with the
attached diff1.patch? (and maybe diff2.patch if that fails, this will
cause OOMs in the test cgroup, you might see some process killed
warnings).
Thanks!


> >
> > Also an existing test also failed.
> >
> > btf_dump_data:PASS:find type id 0 nsec
> >
> >
> > btf_dump_data:PASS:failed/unexpected type_sz 0 nsec
> >
> >
> > btf_dump_data:FAIL:ensure expected/actual match unexpected ensure
> > expected/actual match: actual '(union bpf_iter_link_info){.map =
> > (struct){.map_fd = (__u32)1,},.cgroup '
> > test_btf_dump_struct_data:PASS:find struct sk_buff 0 nsec
> >
>
> Yeah I see what happened there. bpf_iter_link_info was changed by the
> patch that introduced cgroup_iter, and this specific union is used by
> the test to test the "union with nested struct" btf dumping. I will
> add a patch in the next version that updates the btf_dump_data test
> accordingly. Thanks.
>
> >
> > test_btf_dump_struct_data:PASS:unexpected return value dumping sk_buff 0
> > nsec
> >
> > btf_dump_data:PASS:verify prefix match 0 nsec
> >
> >
> > btf_dump_data:PASS:find type id 0 nsec
> >
> >
> > btf_dump_data:PASS:failed to return -E2BIG 0 nsec
> >
> >
> > btf_dump_data:PASS:ensure expected/actual match 0 nsec
> >
> >
> > btf_dump_data:PASS:verify prefix match 0 nsec
> >
> >
> > btf_dump_data:PASS:find type id 0 nsec
> >
> >
> > btf_dump_data:PASS:failed to return -E2BIG 0 nsec
> >
> >
> > btf_dump_data:PASS:ensure expected/actual match 0 nsec
> >
> >
> > #21/14   btf_dump/btf_dump: struct_data:FAIL
> >
> > please take a look.
> >
> > > ---
> > >   .../prog_tests/cgroup_hierarchical_stats.c    | 351 ++++++++++++++++++
> > >   .../bpf/progs/cgroup_hierarchical_stats.c     | 234 ++++++++++++
> > >   2 files changed, 585 insertions(+)
> > >   create mode 100644 tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c
> > >   create mode 100644 tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c
> > >
> > > diff --git a/tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c b/tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c
> > > new file mode 100644
> > > index 0000000000000..b78a4043da49a
> > > --- /dev/null
> > > +++ b/tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c
> > > @@ -0,0 +1,351 @@
> > > +// SPDX-License-Identifier: GPL-2.0-only
> > > +/*
> > > + * Functions to manage eBPF programs attached to cgroup subsystems
> > > + *
> > > + * Copyright 2022 Google LLC.
> > > + */
> > > +#include <errno.h>
> > > +#include <sys/types.h>
> > > +#include <sys/mount.h>
> > > +#include <sys/stat.h>
> > > +#include <unistd.h>
> > > +
> > > +#include <test_progs.h>
> > > +#include <bpf/libbpf.h>
> > > +#include <bpf/bpf.h>
> > > +
> > > +#include "cgroup_helpers.h"
> > > +#include "cgroup_hierarchical_stats.skel.h"
> > > +
> > > +#define PAGE_SIZE 4096
> > > +#define MB(x) (x << 20)
> > > +
> > > +#define BPFFS_ROOT "/sys/fs/bpf/"
> > > +#define BPFFS_VMSCAN BPFFS_ROOT"vmscan/"
> > > +
> > > +#define CG_ROOT_NAME "root"
> > > +#define CG_ROOT_ID 1
> > > +
> > > +#define CGROUP_PATH(p, n) {.path = #p"/"#n, .name = #n}
> > > +
> > > +static struct {
> > > +     const char *path, *name;
> > > +     unsigned long long id;
> > > +     int fd;
> > > +} cgroups[] = {
> > > +     CGROUP_PATH(/, test),
> > > +     CGROUP_PATH(/test, child1),
> > > +     CGROUP_PATH(/test, child2),
> > > +     CGROUP_PATH(/test/child1, child1_1),
> > > +     CGROUP_PATH(/test/child1, child1_2),
> > > +     CGROUP_PATH(/test/child2, child2_1),
> > > +     CGROUP_PATH(/test/child2, child2_2),
> > > +};
> > > +
> > > +#define N_CGROUPS ARRAY_SIZE(cgroups)
> > > +#define N_NON_LEAF_CGROUPS 3
> > > +
> > > +int root_cgroup_fd;
> > > +bool mounted_bpffs;
> > > +
> > > +static int read_from_file(const char *path, char *buf, size_t size)
> > > +{
> > > +     int fd, len;
> > > +
> > > +     fd = open(path, O_RDONLY);
> > > +     if (fd < 0) {
> > > +             log_err("Open %s", path);
> > > +             return -errno;
> > > +     }
> > > +     len = read(fd, buf, size);
> > > +     if (len < 0)
> > > +             log_err("Read %s", path);
> > > +     else
> > > +             buf[len] = 0;
> > > +     close(fd);
> > > +     return len < 0 ? -errno : 0;
> > > +}
> > > +
> > > +static int setup_bpffs(void)
> > > +{
> > > +     int err;
> > > +
> > > +     /* Mount bpffs */
> > > +     err = mount("bpf", BPFFS_ROOT, "bpf", 0, NULL);
> > > +     mounted_bpffs = !err;
> > > +     if (!ASSERT_OK(err && errno != EBUSY, "mount bpffs"))
> > > +             return err;
> > > +
> > > +     /* Create a directory to contain stat files in bpffs */
> > > +     err = mkdir(BPFFS_VMSCAN, 0755);
> > > +     ASSERT_OK(err, "mkdir bpffs");
> > > +     return err;
> > > +}
> > > +
> > > +static void cleanup_bpffs(void)
> > > +{
> > > +     /* Remove created directory in bpffs */
> > > +     ASSERT_OK(rmdir(BPFFS_VMSCAN), "rmdir "BPFFS_VMSCAN);
> > > +
> > > +     /* Unmount bpffs, if it wasn't already mounted when we started */
> > > +     if (mounted_bpffs)
> > > +             return;
> > > +     ASSERT_OK(umount(BPFFS_ROOT), "unmount bpffs");
> > > +}
> > > +
> > > +static int setup_cgroups(void)
> > > +{
> > > +     int i, fd, err;
> > > +
> > > +     err = setup_cgroup_environment();
> > > +     if (!ASSERT_OK(err, "setup_cgroup_environment"))
> > > +             return err;
> > > +
> > > +     root_cgroup_fd = get_root_cgroup();
> > > +     if (!ASSERT_GE(root_cgroup_fd, 0, "get_root_cgroup"))
> > > +             return root_cgroup_fd;
> > > +
> > > +     for (i = 0; i < N_CGROUPS; i++) {
> > > +             fd = create_and_get_cgroup(cgroups[i].path);
> > > +             if (!ASSERT_GE(fd, 0, "create_and_get_cgroup"))
> > > +                     return fd;
> > > +
> > > +             cgroups[i].fd = fd;
> > > +             cgroups[i].id = get_cgroup_id(cgroups[i].path);
> > > +
> > > +             /*
> > > +              * Enable memcg controller for the entire hierarchy.
> > > +              * Note that stats are collected for all cgroups in a hierarchy
> > > +              * with memcg enabled anyway, but are only exposed for cgroups
> > > +              * that have memcg enabled.
> > > +              */
> > > +             if (i < N_NON_LEAF_CGROUPS) {
> > > +                     err = enable_controllers(cgroups[i].path, "memory");
> > > +                     if (!ASSERT_OK(err, "enable_controllers"))
> > > +                             return err;
> > > +             }
> > > +     }
> > > +     return 0;
> > > +}
> > > +
> > > +static void cleanup_cgroups(void)
> > > +{
> > > +     close(root_cgroup_fd);
> > > +     for (int i = 0; i < N_CGROUPS; i++)
> > > +             close(cgroups[i].fd);
> > > +     cleanup_cgroup_environment();
> > > +}
> > > +
> > > +
> > > +static int setup_hierarchy(void)
> > > +{
> > > +     return setup_bpffs() || setup_cgroups();
> > > +}
> > > +
> > > +static void destroy_hierarchy(void)
> > > +{
> > > +     cleanup_cgroups();
> > > +     cleanup_bpffs();
> > > +}
> > > +
> > > +static void alloc_anon(size_t size)
> > > +{
> > > +     char *buf, *ptr;
> > > +
> > > +     buf = malloc(size);
> > > +     for (ptr = buf; ptr < buf + size; ptr += PAGE_SIZE)
> > > +             *ptr = 0;
> > > +     free(buf);
> > > +}
> > > +
> > > +static int induce_vmscan(void)
> > > +{
> > > +     char size[128];
> > > +     int i, err;
> > > +
> > > +     /*
> > > +      * Set memory.high for test parent cgroup to 1 MB to throttle
> > > +      * allocations and invoke reclaim in children.
> > > +      */
> > > +     snprintf(size, 128, "%d", MB(1));
> > > +     err = write_cgroup_file(cgroups[0].path, "memory.high", size);
> > > +     if (!ASSERT_OK(err, "write memory.high"))
> > > +             return err;
> > > +     /*
> > > +      * In every leaf cgroup, run a memory hog for a few seconds to induce
> > > +      * reclaim then kill it.
> > > +      */
> > > +     for (i = N_NON_LEAF_CGROUPS; i < N_CGROUPS; i++) {
> > > +             pid_t pid = fork();
> > > +
> > > +             if (pid == 0) {
> > > +                     /* Join cgroup in the parent process workdir */
> > > +                     join_parent_cgroup(cgroups[i].path);
> > > +
> > > +                     /* Allocate more memory than memory.high */
> > > +                     alloc_anon(MB(2));
> > > +                     exit(0);
> > > +             } else {
> > > +                     /* Wait for child to cause reclaim then kill it */
> > > +                     if (!ASSERT_GT(pid, 0, "fork"))
> > > +                             return pid;
> > > +                     sleep(2);
> > > +                     kill(pid, SIGKILL);
> > > +                     waitpid(pid, NULL, 0);
> > > +             }
> > > +     }
> > > +     return 0;
> > > +}
> > > +
> > > +static unsigned long long get_cgroup_vmscan_delay(unsigned long long cgroup_id,
> > > +                                               const char *file_name)
> > > +{
> > > +     char buf[128], path[128];
> > > +     unsigned long long vmscan = 0, id = 0;
> > > +     int err;
> > > +
> > > +     /* For every cgroup, read the file generated by cgroup_iter */
> > > +     snprintf(path, 128, "%s%s", BPFFS_VMSCAN, file_name);
> > > +     err = read_from_file(path, buf, 128);
> > > +     if (!ASSERT_OK(err, "read cgroup_iter"))
> > > +             return 0;
> > > +
> > > +     /* Check the output file formatting */
> > > +     ASSERT_EQ(sscanf(buf, "cg_id: %llu, total_vmscan_delay: %llu\n",
> > > +                      &id, &vmscan), 2, "output format");
> > > +
> > > +     /* Check that the cgroup_id is displayed correctly */
> > > +     ASSERT_EQ(id, cgroup_id, "cgroup_id");
> > > +     /* Check that the vmscan reading is non-zero */
> > > +     ASSERT_GT(vmscan, 0, "vmscan_reading");
> > > +     return vmscan;
> > > +}
> > > +
> > > +static void check_vmscan_stats(void)
> > > +{
> > > +     int i;
> > > +     unsigned long long vmscan_readings[N_CGROUPS], vmscan_root;
> > > +
> > > +     for (i = 0; i < N_CGROUPS; i++)
> > > +             vmscan_readings[i] = get_cgroup_vmscan_delay(cgroups[i].id,
> > > +                                                          cgroups[i].name);
> > > +
> > > +     /* Read stats for root too */
> > > +     vmscan_root = get_cgroup_vmscan_delay(CG_ROOT_ID, CG_ROOT_NAME);
> > > +
> > > +     /* Check that child1 == child1_1 + child1_2 */
> > > +     ASSERT_EQ(vmscan_readings[1], vmscan_readings[3] + vmscan_readings[4],
> > > +               "child1_vmscan");
> > > +     /* Check that child2 == child2_1 + child2_2 */
> > > +     ASSERT_EQ(vmscan_readings[2], vmscan_readings[5] + vmscan_readings[6],
> > > +               "child2_vmscan");
> > > +     /* Check that test == child1 + child2 */
> > > +     ASSERT_EQ(vmscan_readings[0], vmscan_readings[1] + vmscan_readings[2],
> > > +               "test_vmscan");
> > > +     /* Check that root >= test */
> > > +     ASSERT_GE(vmscan_root, vmscan_readings[1], "root_vmscan");
> > > +}
> > > +
> > > +static int setup_cgroup_iter(struct cgroup_hierarchical_stats *obj, int cgroup_fd,
> > > +                          const char *file_name)
> > > +{
> > > +     DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts);
> > > +     union bpf_iter_link_info linfo = {};
> > > +     struct bpf_link *link;
> > > +     char path[128];
> > > +     int err;
> > > +
> > > +     /*
> > > +      * Create an iter link, parameterized by cgroup_fd.
> > > +      * We only want to traverse one cgroup, so set the traversal order to
> > > +      * "pre", and return 1 from dump_vmscan to stop iteration after the
> > > +      * first cgroup.
> > > +      */
> > > +     linfo.cgroup.cgroup_fd = cgroup_fd;
> > > +     linfo.cgroup.traversal_order = BPF_ITER_CGROUP_PRE;
> > > +     opts.link_info = &linfo;
> > > +     opts.link_info_len = sizeof(linfo);
> > > +     link = bpf_program__attach_iter(obj->progs.dump_vmscan, &opts);
> > > +     if (!ASSERT_OK_PTR(link, "attach iter"))
> > > +             return libbpf_get_error(link);
> > > +
> > > +     /* Pin the link to a bpffs file */
> > > +     snprintf(path, 128, "%s%s", BPFFS_VMSCAN, file_name);
> > > +     err = bpf_link__pin(link, path);
> > > +     ASSERT_OK(err, "pin cgroup_iter");
> > > +     return err;
> > > +}
> > > +
> > > +static int setup_progs(struct cgroup_hierarchical_stats **skel)
> > > +{
> > > +     int i, err;
> > > +     struct bpf_link *link;
> > > +     struct cgroup_hierarchical_stats *obj;
> > > +
> > > +     obj = cgroup_hierarchical_stats__open_and_load();
> > > +     if (!ASSERT_OK_PTR(obj, "open_and_load"))
> > > +             return libbpf_get_error(obj);
> > > +
> > > +     /* Attach cgroup_iter program that will dump the stats to cgroups */
> > > +     for (i = 0; i < N_CGROUPS; i++) {
> > > +             err = setup_cgroup_iter(obj, cgroups[i].fd, cgroups[i].name);
> > > +             if (!ASSERT_OK(err, "setup_cgroup_iter"))
> > > +                     return err;
> > > +     }
> > > +     /* Also dump stats for root */
> > > +     err = setup_cgroup_iter(obj, root_cgroup_fd, CG_ROOT_NAME);
> > > +     if (!ASSERT_OK(err, "setup_cgroup_iter"))
> > > +             return err;
> > > +
> > > +     /* Attach rstat flusher */
> > > +     link = bpf_program__attach(obj->progs.vmscan_flush);
> > > +     if (!ASSERT_OK_PTR(link, "attach rstat"))
> > > +             return libbpf_get_error(link);
> > > +
> > > +     /* Attach tracing programs that will calculate vmscan delays */
> > > +     link = bpf_program__attach(obj->progs.vmscan_start);
> > > +     if (!ASSERT_OK_PTR(obj, "attach raw_tracepoint"))
> > > +             return libbpf_get_error(obj);
> > > +
> > > +     link = bpf_program__attach(obj->progs.vmscan_end);
> > > +     if (!ASSERT_OK_PTR(obj, "attach raw_tracepoint"))
> > > +             return libbpf_get_error(obj);
> > > +
> > > +     *skel = obj;
> > > +     return 0;
> > > +}
> > > +
> > > +void destroy_progs(struct cgroup_hierarchical_stats *skel)
> > > +{
> > > +     char path[128];
> > > +     int i;
> > > +
> > > +     for (i = 0; i < N_CGROUPS; i++) {
> > > +             /* Delete files in bpffs that cgroup_iters are pinned in */
> > > +             snprintf(path, 128, "%s%s", BPFFS_VMSCAN,
> > > +                      cgroups[i].name);
> > > +             ASSERT_OK(remove(path), "remove cgroup_iter pin");
> > > +     }
> > > +
> > > +     /* Delete root file in bpffs */
> > > +     snprintf(path, 128, "%s%s", BPFFS_VMSCAN, CG_ROOT_NAME);
> > > +     ASSERT_OK(remove(path), "remove cgroup_iter root pin");
> > > +     cgroup_hierarchical_stats__destroy(skel);
> > > +}
> > > +
> > > +void test_cgroup_hierarchical_stats(void)
> > > +{
> > > +     struct cgroup_hierarchical_stats *skel = NULL;
> > > +
> > > +     if (setup_hierarchy())
> > > +             goto hierarchy_cleanup;
> > > +     if (setup_progs(&skel))
> > > +             goto cleanup;
> > > +     if (induce_vmscan())
> > > +             goto cleanup;
> > > +     check_vmscan_stats();
> > > +cleanup:
> > > +     destroy_progs(skel);
> > > +hierarchy_cleanup:
> > > +     destroy_hierarchy();
> > > +}
> > > diff --git a/tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c b/tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c
> > > new file mode 100644
> > > index 0000000000000..fd2028f1ed70b
> > > --- /dev/null
> > > +++ b/tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c
> > > @@ -0,0 +1,234 @@
> > > +// SPDX-License-Identifier: GPL-2.0-only
> > > +/*
> > > + * Functions to manage eBPF programs attached to cgroup subsystems
> > > + *
> > > + * Copyright 2022 Google LLC.
> > > + */
> > > +#include "vmlinux.h"
> > > +#include <bpf/bpf_helpers.h>
> > > +#include <bpf/bpf_tracing.h>
> > > +
> > > +char _license[] SEC("license") = "GPL";
> > > +
> > > +/*
> > > + * Start times are stored per-task, not per-cgroup, as multiple tasks in one
> > > + * cgroup can perform reclain concurrently.
> > > + */
> > > +struct {
> > > +     __uint(type, BPF_MAP_TYPE_TASK_STORAGE);
> > > +     __uint(map_flags, BPF_F_NO_PREALLOC);
> > > +     __type(key, int);
> > > +     __type(value, __u64);
> > > +} vmscan_start_time SEC(".maps");
> > > +
> > > +struct vmscan_percpu {
> > > +     /* Previous percpu state, to figure out if we have new updates */
> > > +     __u64 prev;
> > > +     /* Current percpu state */
> > > +     __u64 state;
> > > +};
> > > +
> > > +struct vmscan {
> > > +     /* State propagated through children, pending aggregation */
> > > +     __u64 pending;
> > > +     /* Total state, including all cpus and all children */
> > > +     __u64 state;
> > > +};
> > > +
> > > +struct {
> > > +     __uint(type, BPF_MAP_TYPE_PERCPU_HASH);
> > > +     __uint(max_entries, 10);
> > > +     __type(key, __u64);
> > > +     __type(value, struct vmscan_percpu);
> > > +} pcpu_cgroup_vmscan_elapsed SEC(".maps");
> > > +
> > > +struct {
> > > +     __uint(type, BPF_MAP_TYPE_HASH);
> > > +     __uint(max_entries, 10);
> > > +     __type(key, __u64);
> > > +     __type(value, struct vmscan);
> > > +} cgroup_vmscan_elapsed SEC(".maps");
> > > +
> > > +extern void cgroup_rstat_updated(struct cgroup *cgrp, int cpu) __ksym;
> > > +extern void cgroup_rstat_flush(struct cgroup *cgrp) __ksym;
> > > +
> > > +static inline struct cgroup *task_memcg(struct task_struct *task)
> > > +{
> > > +     return task->cgroups->subsys[memory_cgrp_id]->cgroup;
> > > +}
> > > +
> > > +static inline uint64_t cgroup_id(struct cgroup *cgrp)
> > > +{
> > > +     return cgrp->kn->id;
> > > +}
> > > +
> > > +static inline int create_vmscan_percpu_elem(__u64 cg_id, __u64 state)
> > > +{
> > > +     struct vmscan_percpu pcpu_init = {.state = state, .prev = 0};
> > > +
> > > +     if (bpf_map_update_elem(&pcpu_cgroup_vmscan_elapsed, &cg_id,
> > > +                             &pcpu_init, BPF_NOEXIST)) {
> > > +             bpf_printk("failed to create pcpu entry for cgroup %llu\n"
> > > +                        , cg_id);
> > > +             return 1;
> > > +     }
> > > +     return 0;
> > > +}
> > > +
> > > +static inline int create_vmscan_elem(__u64 cg_id, __u64 state, __u64 pending)
> > > +{
> > > +     struct vmscan init = {.state = state, .pending = pending};
> > > +
> > > +     if (bpf_map_update_elem(&cgroup_vmscan_elapsed, &cg_id,
> > > +                             &init, BPF_NOEXIST)) {
> > > +             bpf_printk("failed to create entry for cgroup %llu\n"
> > > +                        , cg_id);
> > > +             return 1;
> > > +     }
> > > +     return 0;
> > > +}
> > > +
> > > +SEC("tp_btf/mm_vmscan_memcg_reclaim_begin")
> > > +int BPF_PROG(vmscan_start, struct lruvec *lruvec, struct scan_control *sc)
> > > +{
> > > +     struct task_struct *task = bpf_get_current_task_btf();
> > > +     __u64 *start_time_ptr;
> > > +
> > > +     start_time_ptr = bpf_task_storage_get(&vmscan_start_time, task, 0,
> > > +                                       BPF_LOCAL_STORAGE_GET_F_CREATE);
> > > +     if (!start_time_ptr) {
> > > +             bpf_printk("error retrieving storage\n");
> > > +             return 0;
> > > +     }
> > > +
> > > +     *start_time_ptr = bpf_ktime_get_ns();
> > > +     return 0;
> > > +}
> > > +
> > > +SEC("tp_btf/mm_vmscan_memcg_reclaim_end")
> > > +int BPF_PROG(vmscan_end, struct lruvec *lruvec, struct scan_control *sc)
> > > +{
> > > +     struct vmscan_percpu *pcpu_stat;
> > > +     struct task_struct *current = bpf_get_current_task_btf();
> > > +     struct cgroup *cgrp;
> > > +     __u64 *start_time_ptr;
> > > +     __u64 current_elapsed, cg_id;
> > > +     __u64 end_time = bpf_ktime_get_ns();
> > > +
> > > +     /*
> > > +      * cgrp is the first parent cgroup of current that has memcg enabled in
> > > +      * its subtree_control, or NULL if memcg is disabled in the entire tree.
> > > +      * In a cgroup hierarchy like this:
> > > +      *                               a
> > > +      *                              / \
> > > +      *                             b   c
> > > +      *  If "a" has memcg enabled, while "b" doesn't, then processes in "b"
> > > +      *  will accumulate their stats directly to "a". This makes sure that no
> > > +      *  stats are lost from processes in leaf cgroups that don't have memcg
> > > +      *  enabled, but only exposes stats for cgroups that have memcg enabled.
> > > +      */
> > > +     cgrp = task_memcg(current);
> > > +     if (!cgrp)
> > > +             return 0;
> > > +
> > > +     cg_id = cgroup_id(cgrp);
> > > +     start_time_ptr = bpf_task_storage_get(&vmscan_start_time, current, 0,
> > > +                                           BPF_LOCAL_STORAGE_GET_F_CREATE);
> > > +     if (!start_time_ptr) {
> > > +             bpf_printk("error retrieving storage local storage\n");
> > > +             return 0;
> > > +     }
> > > +
> > > +     current_elapsed = end_time - *start_time_ptr;
> > > +     pcpu_stat = bpf_map_lookup_elem(&pcpu_cgroup_vmscan_elapsed,
> > > +                                     &cg_id);
> > > +     if (pcpu_stat)
> > > +             __sync_fetch_and_add(&pcpu_stat->state, current_elapsed);
> > > +     else
> > > +             create_vmscan_percpu_elem(cg_id, current_elapsed);
> > > +
> > > +     cgroup_rstat_updated(cgrp, bpf_get_smp_processor_id());
> > > +     return 0;
> > > +}
> > > +
> > > +SEC("fentry/bpf_rstat_flush")
> > > +int BPF_PROG(vmscan_flush, struct cgroup *cgrp, struct cgroup *parent, int cpu)
> > > +{
> > > +     struct vmscan_percpu *pcpu_stat;
> > > +     struct vmscan *total_stat, *parent_stat;
> > > +     __u64 cg_id = cgroup_id(cgrp);
> > > +     __u64 parent_cg_id = parent ? cgroup_id(parent) : 0;
> > > +     __u64 *pcpu_vmscan;
> > > +     __u64 state;
> > > +     __u64 delta = 0;
> > > +
> > > +     /* Add CPU changes on this level since the last flush */
> > > +     pcpu_stat = bpf_map_lookup_percpu_elem(&pcpu_cgroup_vmscan_elapsed,
> > > +                                            &cg_id, cpu);
> > > +     if (pcpu_stat) {
> > > +             state = pcpu_stat->state;
> > > +             delta += state - pcpu_stat->prev;
> > > +             pcpu_stat->prev = state;
> > > +     }
> > > +
> > > +     total_stat = bpf_map_lookup_elem(&cgroup_vmscan_elapsed, &cg_id);
> > > +     if (!total_stat) {
> > > +             create_vmscan_elem(cg_id, delta, 0);
> > > +             goto update_parent;
> > > +     }
> > > +
> > > +     /* Collect pending stats from subtree */
> > > +     if (total_stat->pending) {
> > > +             delta += total_stat->pending;
> > > +             total_stat->pending = 0;
> > > +     }
> > > +
> > > +     /* Propagate changes to this cgroup's total */
> > > +     total_stat->state += delta;
> > > +
> > > +update_parent:
> > > +     /* Skip if there are no changes to propagate, or no parent */
> > > +     if (!delta || !parent_cg_id)
> > > +             return 0;
> > > +
> > > +     /* Propagate changes to cgroup's parent */
> > > +     parent_stat = bpf_map_lookup_elem(&cgroup_vmscan_elapsed,
> > > +                                       &parent_cg_id);
> > > +     if (parent_stat)
> > > +             parent_stat->pending += delta;
> > > +     else
> > > +             create_vmscan_elem(parent_cg_id, 0, delta);
> > > +
> > > +     return 0;
> > > +}
> > > +
> > > +SEC("iter.s/cgroup")
> > > +int BPF_PROG(dump_vmscan, struct bpf_iter_meta *meta, struct cgroup *cgrp)
> > > +{
> > > +     struct seq_file *seq = meta->seq;
> > > +     struct vmscan *total_stat;
> > > +     __u64 cg_id = cgroup_id(cgrp);
> > > +
> > > +     /* Do nothing for the terminal call */
> > > +     if (!cgrp)
> > > +             return 1;
> > > +
> > > +     /* Flush the stats to make sure we get the most updated numbers */
> > > +     cgroup_rstat_flush(cgrp);
> > > +
> > > +     total_stat = bpf_map_lookup_elem(&cgroup_vmscan_elapsed, &cg_id);
> > > +     if (!total_stat) {
> > > +             bpf_printk("error finding stats for cgroup %llu\n", cg_id);
> > > +             BPF_SEQ_PRINTF(seq, "cg_id: %llu, total_vmscan_delay: -1\n",
> > > +                            cg_id);
> > > +             return 1;
> > > +     }
> > > +     BPF_SEQ_PRINTF(seq, "cg_id: %llu, total_vmscan_delay: %llu\n",
> > > +                    cg_id, total_stat->state);
> > > +
> > > +     /*
> > > +      * We only dump stats for one cgroup here, so return 1 to stop
> > > +      * iteration after the first cgroup.
> > > +      */
> > > +     return 1;
> > > +}

[-- Attachment #2: diff1.patch --]
[-- Type: application/octet-stream, Size: 577 bytes --]

diff --git a/tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c b/tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c
index b78a4043da49a..bc0998fe255f0 100644
--- a/tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c
+++ b/tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c
@@ -184,7 +184,6 @@ static int induce_vmscan(void)
 
 			/* Allocate more memory than memory.high */
 			alloc_anon(MB(2));
-			exit(0);
 		} else {
 			/* Wait for child to cause reclaim then kill it */
 			if (!ASSERT_GT(pid, 0, "fork"))

[-- Attachment #3: diff2.patch --]
[-- Type: application/octet-stream, Size: 1226 bytes --]

diff --git a/tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c b/tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c
index b78a4043da49a..ac2390f8f40b0 100644
--- a/tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c
+++ b/tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c
@@ -164,12 +164,12 @@ static int induce_vmscan(void)
 	int i, err;
 
 	/*
-	 * Set memory.high for test parent cgroup to 1 MB to throttle
+	 * Set memory.max for test parent cgroup to 1 MB to throttle
 	 * allocations and invoke reclaim in children.
 	 */
 	snprintf(size, 128, "%d", MB(1));
-	err = write_cgroup_file(cgroups[0].path, "memory.high",	size);
-	if (!ASSERT_OK(err, "write memory.high"))
+	err = write_cgroup_file(cgroups[0].path, "memory.max",	size);
+	if (!ASSERT_OK(err, "write memory.max"))
 		return err;
 	/*
 	 * In every leaf cgroup, run a memory hog for a few seconds to induce
@@ -182,7 +182,7 @@ static int induce_vmscan(void)
 			/* Join cgroup in the parent process workdir */
 			join_parent_cgroup(cgroups[i].path);
 
-			/* Allocate more memory than memory.high */
+			/* Allocate more memory than memory.max */
 			alloc_anon(MB(2));
 			exit(0);
 		} else {

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH bpf-next v2 8/8] bpf: add a selftest for cgroup hierarchical stats collection
  2022-06-28  6:47     ` Yosry Ahmed
  2022-06-28  7:14       ` Yosry Ahmed
@ 2022-06-28  7:43       ` Yosry Ahmed
  2022-06-29  6:26         ` Yonghong Song
  1 sibling, 1 reply; 46+ messages in thread
From: Yosry Ahmed @ 2022-06-28  7:43 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, John Fastabend, KP Singh, Hao Luo,
	Tejun Heo, Zefan Li, Johannes Weiner, Shuah Khan, Michal Hocko,
	Roman Gushchin, David Rientjes, Stanislav Fomichev, Greg Thelen,
	Shakeel Butt, Linux Kernel Mailing List, Networking, bpf,
	Cgroups

[-- Attachment #1: Type: text/plain, Size: 29033 bytes --]

On Mon, Jun 27, 2022 at 11:47 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Mon, Jun 27, 2022 at 11:14 PM Yonghong Song <yhs@fb.com> wrote:
> >
> >
> >
> > On 6/10/22 12:44 PM, Yosry Ahmed wrote:
> > > Add a selftest that tests the whole workflow for collecting,
> > > aggregating (flushing), and displaying cgroup hierarchical stats.
> > >
> > > TL;DR:
> > > - Whenever reclaim happens, vmscan_start and vmscan_end update
> > >    per-cgroup percpu readings, and tell rstat which (cgroup, cpu) pairs
> > >    have updates.
> > > - When userspace tries to read the stats, vmscan_dump calls rstat to flush
> > >    the stats, and outputs the stats in text format to userspace (similar
> > >    to cgroupfs stats).
> > > - rstat calls vmscan_flush once for every (cgroup, cpu) pair that has
> > >    updates, vmscan_flush aggregates cpu readings and propagates updates
> > >    to parents.
> > >
> > > Detailed explanation:
> > > - The test loads tracing bpf programs, vmscan_start and vmscan_end, to
> > >    measure the latency of cgroup reclaim. Per-cgroup ratings are stored in
> > >    percpu maps for efficiency. When a cgroup reading is updated on a cpu,
> > >    cgroup_rstat_updated(cgroup, cpu) is called to add the cgroup to the
> > >    rstat updated tree on that cpu.
> > >
> > > - A cgroup_iter program, vmscan_dump, is loaded and pinned to a file, for
> > >    each cgroup. Reading this file invokes the program, which calls
> > >    cgroup_rstat_flush(cgroup) to ask rstat to propagate the updates for all
> > >    cpus and cgroups that have updates in this cgroup's subtree. Afterwards,
> > >    the stats are exposed to the user. vmscan_dump returns 1 to terminate
> > >    iteration early, so that we only expose stats for one cgroup per read.
> > >
> > > - An ftrace program, vmscan_flush, is also loaded and attached to
> > >    bpf_rstat_flush. When rstat flushing is ongoing, vmscan_flush is invoked
> > >    once for each (cgroup, cpu) pair that has updates. cgroups are popped
> > >    from the rstat tree in a bottom-up fashion, so calls will always be
> > >    made for cgroups that have updates before their parents. The program
> > >    aggregates percpu readings to a total per-cgroup reading, and also
> > >    propagates them to the parent cgroup. After rstat flushing is over, all
> > >    cgroups will have correct updated hierarchical readings (including all
> > >    cpus and all their descendants).
> > >
> > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> >
> > There are a selftest failure with test:
> >
> > get_cgroup_vmscan_delay:PASS:output format 0 nsec
> > get_cgroup_vmscan_delay:PASS:cgroup_id 0 nsec
> > get_cgroup_vmscan_delay:PASS:vmscan_reading 0 nsec
> > get_cgroup_vmscan_delay:PASS:read cgroup_iter 0 nsec
> > get_cgroup_vmscan_delay:PASS:output format 0 nsec
> > get_cgroup_vmscan_delay:PASS:cgroup_id 0 nsec
> > get_cgroup_vmscan_delay:FAIL:vmscan_reading unexpected vmscan_reading:
> > actual 0 <= expected 0
> > check_vmscan_stats:FAIL:child1_vmscan unexpected child1_vmscan: actual
> > 781874 != expected 382092
> > check_vmscan_stats:FAIL:child2_vmscan unexpected child2_vmscan: actual
> > -1 != expected -2
> > check_vmscan_stats:FAIL:test_vmscan unexpected test_vmscan: actual
> > 781874 != expected 781873
> > check_vmscan_stats:FAIL:root_vmscan unexpected root_vmscan: actual 0 <
> > expected 781874
> > destroy_progs:PASS:remove cgroup_iter pin 0 nsec
> > destroy_progs:PASS:remove cgroup_iter pin 0 nsec
> > destroy_progs:PASS:remove cgroup_iter pin 0 nsec
> > destroy_progs:PASS:remove cgroup_iter pin 0 nsec
> > destroy_progs:PASS:remove cgroup_iter pin 0 nsec
> > destroy_progs:PASS:remove cgroup_iter pin 0 nsec
> > destroy_progs:PASS:remove cgroup_iter pin 0 nsec
> > destroy_progs:PASS:remove cgroup_iter root pin 0 nsec
> > cleanup_bpffs:PASS:rmdir /sys/fs/bpf/vmscan/ 0 nsec
> > #33      cgroup_hierarchical_stats:FAIL
> >
>
> The test is passing on my setup. I am trying to figure out if there is
> something outside the setup done by the test that can cause the test
> to fail.
>
> >
> > Also an existing test also failed.
> >
> > btf_dump_data:PASS:find type id 0 nsec
> >
> >
> > btf_dump_data:PASS:failed/unexpected type_sz 0 nsec
> >
> >
> > btf_dump_data:FAIL:ensure expected/actual match unexpected ensure
> > expected/actual match: actual '(union bpf_iter_link_info){.map =
> > (struct){.map_fd = (__u32)1,},.cgroup '
> > test_btf_dump_struct_data:PASS:find struct sk_buff 0 nsec
> >
>
> Yeah I see what happened there. bpf_iter_link_info was changed by the
> patch that introduced cgroup_iter, and this specific union is used by
> the test to test the "union with nested struct" btf dumping. I will
> add a patch in the next version that updates the btf_dump_data test
> accordingly. Thanks.
>

So I actually tried the attached diff to updated the expected dump of
bpf_iter_link_info in this test, but the test still failed:

btf_dump_data:FAIL:ensure expected/actual match unexpected ensure
expected/actual match: actual '(union bpf_iter_link_info){.map =
(struct){.map_fd = (__u32)1,},.cgroup = (struct){.cgroup_fd =
(__u32)1,},}'  != expected '(union bpf_iter_link_info){.map =
(struct){.map_fd = (__u32)1,},.cgroup = (struct){.cgroup_fd =
(__u32)1,.traversal_order = (__u32)1},}'

It seems to me that the actual output in this case is not right, it is
missing traversal_order. Did we accidentally find a bug in btf dumping
of unions with nested structs, or am I missing something here?
Thanks!

> >
> > test_btf_dump_struct_data:PASS:unexpected return value dumping sk_buff 0
> > nsec
> >
> > btf_dump_data:PASS:verify prefix match 0 nsec
> >
> >
> > btf_dump_data:PASS:find type id 0 nsec
> >
> >
> > btf_dump_data:PASS:failed to return -E2BIG 0 nsec
> >
> >
> > btf_dump_data:PASS:ensure expected/actual match 0 nsec
> >
> >
> > btf_dump_data:PASS:verify prefix match 0 nsec
> >
> >
> > btf_dump_data:PASS:find type id 0 nsec
> >
> >
> > btf_dump_data:PASS:failed to return -E2BIG 0 nsec
> >
> >
> > btf_dump_data:PASS:ensure expected/actual match 0 nsec
> >
> >
> > #21/14   btf_dump/btf_dump: struct_data:FAIL
> >
> > please take a look.
> >
> > > ---
> > >   .../prog_tests/cgroup_hierarchical_stats.c    | 351 ++++++++++++++++++
> > >   .../bpf/progs/cgroup_hierarchical_stats.c     | 234 ++++++++++++
> > >   2 files changed, 585 insertions(+)
> > >   create mode 100644 tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c
> > >   create mode 100644 tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c
> > >
> > > diff --git a/tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c b/tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c
> > > new file mode 100644
> > > index 0000000000000..b78a4043da49a
> > > --- /dev/null
> > > +++ b/tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c
> > > @@ -0,0 +1,351 @@
> > > +// SPDX-License-Identifier: GPL-2.0-only
> > > +/*
> > > + * Functions to manage eBPF programs attached to cgroup subsystems
> > > + *
> > > + * Copyright 2022 Google LLC.
> > > + */
> > > +#include <errno.h>
> > > +#include <sys/types.h>
> > > +#include <sys/mount.h>
> > > +#include <sys/stat.h>
> > > +#include <unistd.h>
> > > +
> > > +#include <test_progs.h>
> > > +#include <bpf/libbpf.h>
> > > +#include <bpf/bpf.h>
> > > +
> > > +#include "cgroup_helpers.h"
> > > +#include "cgroup_hierarchical_stats.skel.h"
> > > +
> > > +#define PAGE_SIZE 4096
> > > +#define MB(x) (x << 20)
> > > +
> > > +#define BPFFS_ROOT "/sys/fs/bpf/"
> > > +#define BPFFS_VMSCAN BPFFS_ROOT"vmscan/"
> > > +
> > > +#define CG_ROOT_NAME "root"
> > > +#define CG_ROOT_ID 1
> > > +
> > > +#define CGROUP_PATH(p, n) {.path = #p"/"#n, .name = #n}
> > > +
> > > +static struct {
> > > +     const char *path, *name;
> > > +     unsigned long long id;
> > > +     int fd;
> > > +} cgroups[] = {
> > > +     CGROUP_PATH(/, test),
> > > +     CGROUP_PATH(/test, child1),
> > > +     CGROUP_PATH(/test, child2),
> > > +     CGROUP_PATH(/test/child1, child1_1),
> > > +     CGROUP_PATH(/test/child1, child1_2),
> > > +     CGROUP_PATH(/test/child2, child2_1),
> > > +     CGROUP_PATH(/test/child2, child2_2),
> > > +};
> > > +
> > > +#define N_CGROUPS ARRAY_SIZE(cgroups)
> > > +#define N_NON_LEAF_CGROUPS 3
> > > +
> > > +int root_cgroup_fd;
> > > +bool mounted_bpffs;
> > > +
> > > +static int read_from_file(const char *path, char *buf, size_t size)
> > > +{
> > > +     int fd, len;
> > > +
> > > +     fd = open(path, O_RDONLY);
> > > +     if (fd < 0) {
> > > +             log_err("Open %s", path);
> > > +             return -errno;
> > > +     }
> > > +     len = read(fd, buf, size);
> > > +     if (len < 0)
> > > +             log_err("Read %s", path);
> > > +     else
> > > +             buf[len] = 0;
> > > +     close(fd);
> > > +     return len < 0 ? -errno : 0;
> > > +}
> > > +
> > > +static int setup_bpffs(void)
> > > +{
> > > +     int err;
> > > +
> > > +     /* Mount bpffs */
> > > +     err = mount("bpf", BPFFS_ROOT, "bpf", 0, NULL);
> > > +     mounted_bpffs = !err;
> > > +     if (!ASSERT_OK(err && errno != EBUSY, "mount bpffs"))
> > > +             return err;
> > > +
> > > +     /* Create a directory to contain stat files in bpffs */
> > > +     err = mkdir(BPFFS_VMSCAN, 0755);
> > > +     ASSERT_OK(err, "mkdir bpffs");
> > > +     return err;
> > > +}
> > > +
> > > +static void cleanup_bpffs(void)
> > > +{
> > > +     /* Remove created directory in bpffs */
> > > +     ASSERT_OK(rmdir(BPFFS_VMSCAN), "rmdir "BPFFS_VMSCAN);
> > > +
> > > +     /* Unmount bpffs, if it wasn't already mounted when we started */
> > > +     if (mounted_bpffs)
> > > +             return;
> > > +     ASSERT_OK(umount(BPFFS_ROOT), "unmount bpffs");
> > > +}
> > > +
> > > +static int setup_cgroups(void)
> > > +{
> > > +     int i, fd, err;
> > > +
> > > +     err = setup_cgroup_environment();
> > > +     if (!ASSERT_OK(err, "setup_cgroup_environment"))
> > > +             return err;
> > > +
> > > +     root_cgroup_fd = get_root_cgroup();
> > > +     if (!ASSERT_GE(root_cgroup_fd, 0, "get_root_cgroup"))
> > > +             return root_cgroup_fd;
> > > +
> > > +     for (i = 0; i < N_CGROUPS; i++) {
> > > +             fd = create_and_get_cgroup(cgroups[i].path);
> > > +             if (!ASSERT_GE(fd, 0, "create_and_get_cgroup"))
> > > +                     return fd;
> > > +
> > > +             cgroups[i].fd = fd;
> > > +             cgroups[i].id = get_cgroup_id(cgroups[i].path);
> > > +
> > > +             /*
> > > +              * Enable memcg controller for the entire hierarchy.
> > > +              * Note that stats are collected for all cgroups in a hierarchy
> > > +              * with memcg enabled anyway, but are only exposed for cgroups
> > > +              * that have memcg enabled.
> > > +              */
> > > +             if (i < N_NON_LEAF_CGROUPS) {
> > > +                     err = enable_controllers(cgroups[i].path, "memory");
> > > +                     if (!ASSERT_OK(err, "enable_controllers"))
> > > +                             return err;
> > > +             }
> > > +     }
> > > +     return 0;
> > > +}
> > > +
> > > +static void cleanup_cgroups(void)
> > > +{
> > > +     close(root_cgroup_fd);
> > > +     for (int i = 0; i < N_CGROUPS; i++)
> > > +             close(cgroups[i].fd);
> > > +     cleanup_cgroup_environment();
> > > +}
> > > +
> > > +
> > > +static int setup_hierarchy(void)
> > > +{
> > > +     return setup_bpffs() || setup_cgroups();
> > > +}
> > > +
> > > +static void destroy_hierarchy(void)
> > > +{
> > > +     cleanup_cgroups();
> > > +     cleanup_bpffs();
> > > +}
> > > +
> > > +static void alloc_anon(size_t size)
> > > +{
> > > +     char *buf, *ptr;
> > > +
> > > +     buf = malloc(size);
> > > +     for (ptr = buf; ptr < buf + size; ptr += PAGE_SIZE)
> > > +             *ptr = 0;
> > > +     free(buf);
> > > +}
> > > +
> > > +static int induce_vmscan(void)
> > > +{
> > > +     char size[128];
> > > +     int i, err;
> > > +
> > > +     /*
> > > +      * Set memory.high for test parent cgroup to 1 MB to throttle
> > > +      * allocations and invoke reclaim in children.
> > > +      */
> > > +     snprintf(size, 128, "%d", MB(1));
> > > +     err = write_cgroup_file(cgroups[0].path, "memory.high", size);
> > > +     if (!ASSERT_OK(err, "write memory.high"))
> > > +             return err;
> > > +     /*
> > > +      * In every leaf cgroup, run a memory hog for a few seconds to induce
> > > +      * reclaim then kill it.
> > > +      */
> > > +     for (i = N_NON_LEAF_CGROUPS; i < N_CGROUPS; i++) {
> > > +             pid_t pid = fork();
> > > +
> > > +             if (pid == 0) {
> > > +                     /* Join cgroup in the parent process workdir */
> > > +                     join_parent_cgroup(cgroups[i].path);
> > > +
> > > +                     /* Allocate more memory than memory.high */
> > > +                     alloc_anon(MB(2));
> > > +                     exit(0);
> > > +             } else {
> > > +                     /* Wait for child to cause reclaim then kill it */
> > > +                     if (!ASSERT_GT(pid, 0, "fork"))
> > > +                             return pid;
> > > +                     sleep(2);
> > > +                     kill(pid, SIGKILL);
> > > +                     waitpid(pid, NULL, 0);
> > > +             }
> > > +     }
> > > +     return 0;
> > > +}
> > > +
> > > +static unsigned long long get_cgroup_vmscan_delay(unsigned long long cgroup_id,
> > > +                                               const char *file_name)
> > > +{
> > > +     char buf[128], path[128];
> > > +     unsigned long long vmscan = 0, id = 0;
> > > +     int err;
> > > +
> > > +     /* For every cgroup, read the file generated by cgroup_iter */
> > > +     snprintf(path, 128, "%s%s", BPFFS_VMSCAN, file_name);
> > > +     err = read_from_file(path, buf, 128);
> > > +     if (!ASSERT_OK(err, "read cgroup_iter"))
> > > +             return 0;
> > > +
> > > +     /* Check the output file formatting */
> > > +     ASSERT_EQ(sscanf(buf, "cg_id: %llu, total_vmscan_delay: %llu\n",
> > > +                      &id, &vmscan), 2, "output format");
> > > +
> > > +     /* Check that the cgroup_id is displayed correctly */
> > > +     ASSERT_EQ(id, cgroup_id, "cgroup_id");
> > > +     /* Check that the vmscan reading is non-zero */
> > > +     ASSERT_GT(vmscan, 0, "vmscan_reading");
> > > +     return vmscan;
> > > +}
> > > +
> > > +static void check_vmscan_stats(void)
> > > +{
> > > +     int i;
> > > +     unsigned long long vmscan_readings[N_CGROUPS], vmscan_root;
> > > +
> > > +     for (i = 0; i < N_CGROUPS; i++)
> > > +             vmscan_readings[i] = get_cgroup_vmscan_delay(cgroups[i].id,
> > > +                                                          cgroups[i].name);
> > > +
> > > +     /* Read stats for root too */
> > > +     vmscan_root = get_cgroup_vmscan_delay(CG_ROOT_ID, CG_ROOT_NAME);
> > > +
> > > +     /* Check that child1 == child1_1 + child1_2 */
> > > +     ASSERT_EQ(vmscan_readings[1], vmscan_readings[3] + vmscan_readings[4],
> > > +               "child1_vmscan");
> > > +     /* Check that child2 == child2_1 + child2_2 */
> > > +     ASSERT_EQ(vmscan_readings[2], vmscan_readings[5] + vmscan_readings[6],
> > > +               "child2_vmscan");
> > > +     /* Check that test == child1 + child2 */
> > > +     ASSERT_EQ(vmscan_readings[0], vmscan_readings[1] + vmscan_readings[2],
> > > +               "test_vmscan");
> > > +     /* Check that root >= test */
> > > +     ASSERT_GE(vmscan_root, vmscan_readings[1], "root_vmscan");
> > > +}
> > > +
> > > +static int setup_cgroup_iter(struct cgroup_hierarchical_stats *obj, int cgroup_fd,
> > > +                          const char *file_name)
> > > +{
> > > +     DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts);
> > > +     union bpf_iter_link_info linfo = {};
> > > +     struct bpf_link *link;
> > > +     char path[128];
> > > +     int err;
> > > +
> > > +     /*
> > > +      * Create an iter link, parameterized by cgroup_fd.
> > > +      * We only want to traverse one cgroup, so set the traversal order to
> > > +      * "pre", and return 1 from dump_vmscan to stop iteration after the
> > > +      * first cgroup.
> > > +      */
> > > +     linfo.cgroup.cgroup_fd = cgroup_fd;
> > > +     linfo.cgroup.traversal_order = BPF_ITER_CGROUP_PRE;
> > > +     opts.link_info = &linfo;
> > > +     opts.link_info_len = sizeof(linfo);
> > > +     link = bpf_program__attach_iter(obj->progs.dump_vmscan, &opts);
> > > +     if (!ASSERT_OK_PTR(link, "attach iter"))
> > > +             return libbpf_get_error(link);
> > > +
> > > +     /* Pin the link to a bpffs file */
> > > +     snprintf(path, 128, "%s%s", BPFFS_VMSCAN, file_name);
> > > +     err = bpf_link__pin(link, path);
> > > +     ASSERT_OK(err, "pin cgroup_iter");
> > > +     return err;
> > > +}
> > > +
> > > +static int setup_progs(struct cgroup_hierarchical_stats **skel)
> > > +{
> > > +     int i, err;
> > > +     struct bpf_link *link;
> > > +     struct cgroup_hierarchical_stats *obj;
> > > +
> > > +     obj = cgroup_hierarchical_stats__open_and_load();
> > > +     if (!ASSERT_OK_PTR(obj, "open_and_load"))
> > > +             return libbpf_get_error(obj);
> > > +
> > > +     /* Attach cgroup_iter program that will dump the stats to cgroups */
> > > +     for (i = 0; i < N_CGROUPS; i++) {
> > > +             err = setup_cgroup_iter(obj, cgroups[i].fd, cgroups[i].name);
> > > +             if (!ASSERT_OK(err, "setup_cgroup_iter"))
> > > +                     return err;
> > > +     }
> > > +     /* Also dump stats for root */
> > > +     err = setup_cgroup_iter(obj, root_cgroup_fd, CG_ROOT_NAME);
> > > +     if (!ASSERT_OK(err, "setup_cgroup_iter"))
> > > +             return err;
> > > +
> > > +     /* Attach rstat flusher */
> > > +     link = bpf_program__attach(obj->progs.vmscan_flush);
> > > +     if (!ASSERT_OK_PTR(link, "attach rstat"))
> > > +             return libbpf_get_error(link);
> > > +
> > > +     /* Attach tracing programs that will calculate vmscan delays */
> > > +     link = bpf_program__attach(obj->progs.vmscan_start);
> > > +     if (!ASSERT_OK_PTR(obj, "attach raw_tracepoint"))
> > > +             return libbpf_get_error(obj);
> > > +
> > > +     link = bpf_program__attach(obj->progs.vmscan_end);
> > > +     if (!ASSERT_OK_PTR(obj, "attach raw_tracepoint"))
> > > +             return libbpf_get_error(obj);
> > > +
> > > +     *skel = obj;
> > > +     return 0;
> > > +}
> > > +
> > > +void destroy_progs(struct cgroup_hierarchical_stats *skel)
> > > +{
> > > +     char path[128];
> > > +     int i;
> > > +
> > > +     for (i = 0; i < N_CGROUPS; i++) {
> > > +             /* Delete files in bpffs that cgroup_iters are pinned in */
> > > +             snprintf(path, 128, "%s%s", BPFFS_VMSCAN,
> > > +                      cgroups[i].name);
> > > +             ASSERT_OK(remove(path), "remove cgroup_iter pin");
> > > +     }
> > > +
> > > +     /* Delete root file in bpffs */
> > > +     snprintf(path, 128, "%s%s", BPFFS_VMSCAN, CG_ROOT_NAME);
> > > +     ASSERT_OK(remove(path), "remove cgroup_iter root pin");
> > > +     cgroup_hierarchical_stats__destroy(skel);
> > > +}
> > > +
> > > +void test_cgroup_hierarchical_stats(void)
> > > +{
> > > +     struct cgroup_hierarchical_stats *skel = NULL;
> > > +
> > > +     if (setup_hierarchy())
> > > +             goto hierarchy_cleanup;
> > > +     if (setup_progs(&skel))
> > > +             goto cleanup;
> > > +     if (induce_vmscan())
> > > +             goto cleanup;
> > > +     check_vmscan_stats();
> > > +cleanup:
> > > +     destroy_progs(skel);
> > > +hierarchy_cleanup:
> > > +     destroy_hierarchy();
> > > +}
> > > diff --git a/tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c b/tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c
> > > new file mode 100644
> > > index 0000000000000..fd2028f1ed70b
> > > --- /dev/null
> > > +++ b/tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c
> > > @@ -0,0 +1,234 @@
> > > +// SPDX-License-Identifier: GPL-2.0-only
> > > +/*
> > > + * Functions to manage eBPF programs attached to cgroup subsystems
> > > + *
> > > + * Copyright 2022 Google LLC.
> > > + */
> > > +#include "vmlinux.h"
> > > +#include <bpf/bpf_helpers.h>
> > > +#include <bpf/bpf_tracing.h>
> > > +
> > > +char _license[] SEC("license") = "GPL";
> > > +
> > > +/*
> > > + * Start times are stored per-task, not per-cgroup, as multiple tasks in one
> > > + * cgroup can perform reclain concurrently.
> > > + */
> > > +struct {
> > > +     __uint(type, BPF_MAP_TYPE_TASK_STORAGE);
> > > +     __uint(map_flags, BPF_F_NO_PREALLOC);
> > > +     __type(key, int);
> > > +     __type(value, __u64);
> > > +} vmscan_start_time SEC(".maps");
> > > +
> > > +struct vmscan_percpu {
> > > +     /* Previous percpu state, to figure out if we have new updates */
> > > +     __u64 prev;
> > > +     /* Current percpu state */
> > > +     __u64 state;
> > > +};
> > > +
> > > +struct vmscan {
> > > +     /* State propagated through children, pending aggregation */
> > > +     __u64 pending;
> > > +     /* Total state, including all cpus and all children */
> > > +     __u64 state;
> > > +};
> > > +
> > > +struct {
> > > +     __uint(type, BPF_MAP_TYPE_PERCPU_HASH);
> > > +     __uint(max_entries, 10);
> > > +     __type(key, __u64);
> > > +     __type(value, struct vmscan_percpu);
> > > +} pcpu_cgroup_vmscan_elapsed SEC(".maps");
> > > +
> > > +struct {
> > > +     __uint(type, BPF_MAP_TYPE_HASH);
> > > +     __uint(max_entries, 10);
> > > +     __type(key, __u64);
> > > +     __type(value, struct vmscan);
> > > +} cgroup_vmscan_elapsed SEC(".maps");
> > > +
> > > +extern void cgroup_rstat_updated(struct cgroup *cgrp, int cpu) __ksym;
> > > +extern void cgroup_rstat_flush(struct cgroup *cgrp) __ksym;
> > > +
> > > +static inline struct cgroup *task_memcg(struct task_struct *task)
> > > +{
> > > +     return task->cgroups->subsys[memory_cgrp_id]->cgroup;
> > > +}
> > > +
> > > +static inline uint64_t cgroup_id(struct cgroup *cgrp)
> > > +{
> > > +     return cgrp->kn->id;
> > > +}
> > > +
> > > +static inline int create_vmscan_percpu_elem(__u64 cg_id, __u64 state)
> > > +{
> > > +     struct vmscan_percpu pcpu_init = {.state = state, .prev = 0};
> > > +
> > > +     if (bpf_map_update_elem(&pcpu_cgroup_vmscan_elapsed, &cg_id,
> > > +                             &pcpu_init, BPF_NOEXIST)) {
> > > +             bpf_printk("failed to create pcpu entry for cgroup %llu\n"
> > > +                        , cg_id);
> > > +             return 1;
> > > +     }
> > > +     return 0;
> > > +}
> > > +
> > > +static inline int create_vmscan_elem(__u64 cg_id, __u64 state, __u64 pending)
> > > +{
> > > +     struct vmscan init = {.state = state, .pending = pending};
> > > +
> > > +     if (bpf_map_update_elem(&cgroup_vmscan_elapsed, &cg_id,
> > > +                             &init, BPF_NOEXIST)) {
> > > +             bpf_printk("failed to create entry for cgroup %llu\n"
> > > +                        , cg_id);
> > > +             return 1;
> > > +     }
> > > +     return 0;
> > > +}
> > > +
> > > +SEC("tp_btf/mm_vmscan_memcg_reclaim_begin")
> > > +int BPF_PROG(vmscan_start, struct lruvec *lruvec, struct scan_control *sc)
> > > +{
> > > +     struct task_struct *task = bpf_get_current_task_btf();
> > > +     __u64 *start_time_ptr;
> > > +
> > > +     start_time_ptr = bpf_task_storage_get(&vmscan_start_time, task, 0,
> > > +                                       BPF_LOCAL_STORAGE_GET_F_CREATE);
> > > +     if (!start_time_ptr) {
> > > +             bpf_printk("error retrieving storage\n");
> > > +             return 0;
> > > +     }
> > > +
> > > +     *start_time_ptr = bpf_ktime_get_ns();
> > > +     return 0;
> > > +}
> > > +
> > > +SEC("tp_btf/mm_vmscan_memcg_reclaim_end")
> > > +int BPF_PROG(vmscan_end, struct lruvec *lruvec, struct scan_control *sc)
> > > +{
> > > +     struct vmscan_percpu *pcpu_stat;
> > > +     struct task_struct *current = bpf_get_current_task_btf();
> > > +     struct cgroup *cgrp;
> > > +     __u64 *start_time_ptr;
> > > +     __u64 current_elapsed, cg_id;
> > > +     __u64 end_time = bpf_ktime_get_ns();
> > > +
> > > +     /*
> > > +      * cgrp is the first parent cgroup of current that has memcg enabled in
> > > +      * its subtree_control, or NULL if memcg is disabled in the entire tree.
> > > +      * In a cgroup hierarchy like this:
> > > +      *                               a
> > > +      *                              / \
> > > +      *                             b   c
> > > +      *  If "a" has memcg enabled, while "b" doesn't, then processes in "b"
> > > +      *  will accumulate their stats directly to "a". This makes sure that no
> > > +      *  stats are lost from processes in leaf cgroups that don't have memcg
> > > +      *  enabled, but only exposes stats for cgroups that have memcg enabled.
> > > +      */
> > > +     cgrp = task_memcg(current);
> > > +     if (!cgrp)
> > > +             return 0;
> > > +
> > > +     cg_id = cgroup_id(cgrp);
> > > +     start_time_ptr = bpf_task_storage_get(&vmscan_start_time, current, 0,
> > > +                                           BPF_LOCAL_STORAGE_GET_F_CREATE);
> > > +     if (!start_time_ptr) {
> > > +             bpf_printk("error retrieving storage local storage\n");
> > > +             return 0;
> > > +     }
> > > +
> > > +     current_elapsed = end_time - *start_time_ptr;
> > > +     pcpu_stat = bpf_map_lookup_elem(&pcpu_cgroup_vmscan_elapsed,
> > > +                                     &cg_id);
> > > +     if (pcpu_stat)
> > > +             __sync_fetch_and_add(&pcpu_stat->state, current_elapsed);
> > > +     else
> > > +             create_vmscan_percpu_elem(cg_id, current_elapsed);
> > > +
> > > +     cgroup_rstat_updated(cgrp, bpf_get_smp_processor_id());
> > > +     return 0;
> > > +}
> > > +
> > > +SEC("fentry/bpf_rstat_flush")
> > > +int BPF_PROG(vmscan_flush, struct cgroup *cgrp, struct cgroup *parent, int cpu)
> > > +{
> > > +     struct vmscan_percpu *pcpu_stat;
> > > +     struct vmscan *total_stat, *parent_stat;
> > > +     __u64 cg_id = cgroup_id(cgrp);
> > > +     __u64 parent_cg_id = parent ? cgroup_id(parent) : 0;
> > > +     __u64 *pcpu_vmscan;
> > > +     __u64 state;
> > > +     __u64 delta = 0;
> > > +
> > > +     /* Add CPU changes on this level since the last flush */
> > > +     pcpu_stat = bpf_map_lookup_percpu_elem(&pcpu_cgroup_vmscan_elapsed,
> > > +                                            &cg_id, cpu);
> > > +     if (pcpu_stat) {
> > > +             state = pcpu_stat->state;
> > > +             delta += state - pcpu_stat->prev;
> > > +             pcpu_stat->prev = state;
> > > +     }
> > > +
> > > +     total_stat = bpf_map_lookup_elem(&cgroup_vmscan_elapsed, &cg_id);
> > > +     if (!total_stat) {
> > > +             create_vmscan_elem(cg_id, delta, 0);
> > > +             goto update_parent;
> > > +     }
> > > +
> > > +     /* Collect pending stats from subtree */
> > > +     if (total_stat->pending) {
> > > +             delta += total_stat->pending;
> > > +             total_stat->pending = 0;
> > > +     }
> > > +
> > > +     /* Propagate changes to this cgroup's total */
> > > +     total_stat->state += delta;
> > > +
> > > +update_parent:
> > > +     /* Skip if there are no changes to propagate, or no parent */
> > > +     if (!delta || !parent_cg_id)
> > > +             return 0;
> > > +
> > > +     /* Propagate changes to cgroup's parent */
> > > +     parent_stat = bpf_map_lookup_elem(&cgroup_vmscan_elapsed,
> > > +                                       &parent_cg_id);
> > > +     if (parent_stat)
> > > +             parent_stat->pending += delta;
> > > +     else
> > > +             create_vmscan_elem(parent_cg_id, 0, delta);
> > > +
> > > +     return 0;
> > > +}
> > > +
> > > +SEC("iter.s/cgroup")
> > > +int BPF_PROG(dump_vmscan, struct bpf_iter_meta *meta, struct cgroup *cgrp)
> > > +{
> > > +     struct seq_file *seq = meta->seq;
> > > +     struct vmscan *total_stat;
> > > +     __u64 cg_id = cgroup_id(cgrp);
> > > +
> > > +     /* Do nothing for the terminal call */
> > > +     if (!cgrp)
> > > +             return 1;
> > > +
> > > +     /* Flush the stats to make sure we get the most updated numbers */
> > > +     cgroup_rstat_flush(cgrp);
> > > +
> > > +     total_stat = bpf_map_lookup_elem(&cgroup_vmscan_elapsed, &cg_id);
> > > +     if (!total_stat) {
> > > +             bpf_printk("error finding stats for cgroup %llu\n", cg_id);
> > > +             BPF_SEQ_PRINTF(seq, "cg_id: %llu, total_vmscan_delay: -1\n",
> > > +                            cg_id);
> > > +             return 1;
> > > +     }
> > > +     BPF_SEQ_PRINTF(seq, "cg_id: %llu, total_vmscan_delay: %llu\n",
> > > +                    cg_id, total_stat->state);
> > > +
> > > +     /*
> > > +      * We only dump stats for one cgroup here, so return 1 to stop
> > > +      * iteration after the first cgroup.
> > > +      */
> > > +     return 1;
> > > +}

[-- Attachment #2: btf_dump_fix.patch --]
[-- Type: application/octet-stream, Size: 985 bytes --]

diff --git a/tools/testing/selftests/bpf/prog_tests/btf_dump.c b/tools/testing/selftests/bpf/prog_tests/btf_dump.c
index 5fce7008d1ff3..a7b7e008dd6f8 100644
--- a/tools/testing/selftests/bpf/prog_tests/btf_dump.c
+++ b/tools/testing/selftests/bpf/prog_tests/btf_dump.c
@@ -764,8 +764,8 @@ static void test_btf_dump_struct_data(struct btf *btf, struct btf_dump *d,
 
 	/* union with nested struct */
 	TEST_BTF_DUMP_DATA(btf, d, "union", str, union bpf_iter_link_info, BTF_F_COMPACT,
-			   "(union bpf_iter_link_info){.map = (struct){.map_fd = (__u32)1,},}",
-			   { .map = { .map_fd = 1 }});
+			   "(union bpf_iter_link_info){.map = (struct){.map_fd = (__u32)1,},.cgroup = (struct){.cgroup_fd = (__u32)1,.traversal_order = (__u32)1},}",
+			   { .map = { .map_fd = 1 }, .cgroup = {.cgroup_fd = 1, .traversal_order = BPF_ITER_CGROUP_PRE }});
 
 	/* struct skb with nested structs/unions; because type output is so
 	 * complex, we don't do a string comparison, just verify we return

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH bpf-next v2 4/8] bpf: Introduce cgroup iter
  2022-06-28  6:03     ` Yosry Ahmed
@ 2022-06-28 17:03       ` Yonghong Song
  0 siblings, 0 replies; 46+ messages in thread
From: Yonghong Song @ 2022-06-28 17:03 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, John Fastabend, KP Singh, Hao Luo,
	Tejun Heo, Zefan Li, Johannes Weiner, Shuah Khan, Michal Hocko,
	Roman Gushchin, David Rientjes, Stanislav Fomichev, Greg Thelen,
	Shakeel Butt, Linux Kernel Mailing List, Networking, bpf,
	Cgroups



On 6/27/22 11:03 PM, Yosry Ahmed wrote:
> On Mon, Jun 27, 2022 at 9:14 PM Yonghong Song <yhs@fb.com> wrote:
>>
>>
>>
>> On 6/10/22 12:44 PM, Yosry Ahmed wrote:
>>> From: Hao Luo <haoluo@google.com>
>>>
>>> Cgroup_iter is a type of bpf_iter. It walks over cgroups in two modes:
>>>
>>>    - walking a cgroup's descendants.
>>>    - walking a cgroup's ancestors.
>>>
>>> When attaching cgroup_iter, one can set a cgroup to the iter_link
>>> created from attaching. This cgroup is passed as a file descriptor and
>>> serves as the starting point of the walk. If no cgroup is specified,
>>> the starting point will be the root cgroup.
>>>
>>> For walking descendants, one can specify the order: either pre-order or
>>> post-order. For walking ancestors, the walk starts at the specified
>>> cgroup and ends at the root.
>>>
>>> One can also terminate the walk early by returning 1 from the iter
>>> program.
>>>
>>> Note that because walking cgroup hierarchy holds cgroup_mutex, the iter
>>> program is called with cgroup_mutex held.
>>>
>>> Signed-off-by: Hao Luo <haoluo@google.com>
>>> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
>>> ---
>>>    include/linux/bpf.h            |   8 ++
>>>    include/uapi/linux/bpf.h       |  21 +++
>>>    kernel/bpf/Makefile            |   2 +-
>>>    kernel/bpf/cgroup_iter.c       | 235 +++++++++++++++++++++++++++++++++
>>>    tools/include/uapi/linux/bpf.h |  21 +++
>>>    5 files changed, 286 insertions(+), 1 deletion(-)
>>>    create mode 100644 kernel/bpf/cgroup_iter.c
>>>
>>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>>> index 8e6092d0ea956..48d8e836b9748 100644
>>> --- a/include/linux/bpf.h
>>> +++ b/include/linux/bpf.h
>>> @@ -44,6 +44,7 @@ struct kobject;
>>>    struct mem_cgroup;
>>>    struct module;
>>>    struct bpf_func_state;
>>> +struct cgroup;
>>>
>>>    extern struct idr btf_idr;
>>>    extern spinlock_t btf_idr_lock;
>>> @@ -1590,7 +1591,14 @@ int bpf_obj_get_user(const char __user *pathname, int flags);
>>>        int __init bpf_iter_ ## target(args) { return 0; }
>>>
>>>    struct bpf_iter_aux_info {
>>> +     /* for map_elem iter */
>>>        struct bpf_map *map;
>>> +
>>> +     /* for cgroup iter */
>>> +     struct {
>>> +             struct cgroup *start; /* starting cgroup */
>>> +             int order;
>>> +     } cgroup;
>>>    };
>>>
>> [...]
>>> +
>>> +static void *cgroup_iter_seq_start(struct seq_file *seq, loff_t *pos)
>>> +{
>>> +     struct cgroup_iter_priv *p = seq->private;
>>> +
>>> +     mutex_lock(&cgroup_mutex);
>>> +
>>> +     /* support only one session */
>>> +     if (*pos > 0)
>>> +             return NULL;
>>> +
>>> +     ++*pos;
>>> +     p->terminate = false;
>>> +     if (p->order == BPF_ITER_CGROUP_PRE)
>>> +             return css_next_descendant_pre(NULL, p->start_css);
>>> +     else if (p->order == BPF_ITER_CGROUP_POST)
>>> +             return css_next_descendant_post(NULL, p->start_css);
>>> +     else /* BPF_ITER_CGROUP_PARENT_UP */
>>> +             return p->start_css;
>>> +}
>>> +
>>> +static int __cgroup_iter_seq_show(struct seq_file *seq,
>>> +                               struct cgroup_subsys_state *css, int in_stop);
>>> +
>>> +static void cgroup_iter_seq_stop(struct seq_file *seq, void *v)
>>> +{
>>> +     /* pass NULL to the prog for post-processing */
>>> +     if (!v)
>>> +             __cgroup_iter_seq_show(seq, NULL, true);
>>> +     mutex_unlock(&cgroup_mutex);
>>> +}
>>> +
>>> +static void *cgroup_iter_seq_next(struct seq_file *seq, void *v, loff_t *pos)
>>> +{
>>> +     struct cgroup_subsys_state *curr = (struct cgroup_subsys_state *)v;
>>> +     struct cgroup_iter_priv *p = seq->private;
>>> +
>>> +     ++*pos;
>>> +     if (p->terminate)
>>> +             return NULL;
>>> +
>>> +     if (p->order == BPF_ITER_CGROUP_PRE)
>>> +             return css_next_descendant_pre(curr, p->start_css);
>>> +     else if (p->order == BPF_ITER_CGROUP_POST)
>>> +             return css_next_descendant_post(curr, p->start_css);
>>> +     else
>>> +             return curr->parent;
>>> +}
>>> +
>>> +static int __cgroup_iter_seq_show(struct seq_file *seq,
>>> +                               struct cgroup_subsys_state *css, int in_stop)
>>> +{
>>> +     struct cgroup_iter_priv *p = seq->private;
>>> +     struct bpf_iter__cgroup ctx;
>>> +     struct bpf_iter_meta meta;
>>> +     struct bpf_prog *prog;
>>> +     int ret = 0;
>>> +
>>> +     /* cgroup is dead, skip this element */
>>> +     if (css && cgroup_is_dead(css->cgroup))
>>> +             return 0;
>>> +
>>> +     ctx.meta = &meta;
>>> +     ctx.cgroup = css ? css->cgroup : NULL;
>>> +     meta.seq = seq;
>>> +     prog = bpf_iter_get_info(&meta, in_stop);
>>> +     if (prog)
>>> +             ret = bpf_iter_run_prog(prog, &ctx);
>>
>> Do we need to do anything special to ensure bpf program gets
>> up-to-date stat from ctx.cgroup?
> 
> Later patches in the series add cgroup_flush_rstat() kfunc which
> flushes cgroup stats that use rstat (e.g. memcg stats). It can be
> called directly from the bpf program if needed.
> 
> It would be better to leave this to the bpf program, it's an
> unnecessary toll to flush the stats for any cgroup_iter program, that
> could be not accessing stats, or stats that are not maintained using
> rstat.

Okay, this should work.

> 
>>
>>> +
>>> +     /* if prog returns > 0, terminate after this element. */
>>> +     if (ret != 0)
>>> +             p->terminate = true;
>>> +
>>> +     return 0;
>>> +}
>>> +
>> [...]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH bpf-next v2 8/8] bpf: add a selftest for cgroup hierarchical stats collection
  2022-06-28  7:14       ` Yosry Ahmed
@ 2022-06-29  0:09         ` Yosry Ahmed
  2022-06-29  6:48           ` Yonghong Song
  2022-06-29  6:17         ` Yonghong Song
  1 sibling, 1 reply; 46+ messages in thread
From: Yosry Ahmed @ 2022-06-29  0:09 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, John Fastabend, KP Singh, Hao Luo,
	Tejun Heo, Zefan Li, Johannes Weiner, Shuah Khan, Michal Hocko,
	Roman Gushchin, David Rientjes, Stanislav Fomichev, Greg Thelen,
	Shakeel Butt, Linux Kernel Mailing List, Networking, bpf,
	Cgroups

On Tue, Jun 28, 2022 at 12:14 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Mon, Jun 27, 2022 at 11:47 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Mon, Jun 27, 2022 at 11:14 PM Yonghong Song <yhs@fb.com> wrote:
> > >
> > >
> > >
> > > On 6/10/22 12:44 PM, Yosry Ahmed wrote:
> > > > Add a selftest that tests the whole workflow for collecting,
> > > > aggregating (flushing), and displaying cgroup hierarchical stats.
> > > >
> > > > TL;DR:
> > > > - Whenever reclaim happens, vmscan_start and vmscan_end update
> > > >    per-cgroup percpu readings, and tell rstat which (cgroup, cpu) pairs
> > > >    have updates.
> > > > - When userspace tries to read the stats, vmscan_dump calls rstat to flush
> > > >    the stats, and outputs the stats in text format to userspace (similar
> > > >    to cgroupfs stats).
> > > > - rstat calls vmscan_flush once for every (cgroup, cpu) pair that has
> > > >    updates, vmscan_flush aggregates cpu readings and propagates updates
> > > >    to parents.
> > > >
> > > > Detailed explanation:
> > > > - The test loads tracing bpf programs, vmscan_start and vmscan_end, to
> > > >    measure the latency of cgroup reclaim. Per-cgroup ratings are stored in
> > > >    percpu maps for efficiency. When a cgroup reading is updated on a cpu,
> > > >    cgroup_rstat_updated(cgroup, cpu) is called to add the cgroup to the
> > > >    rstat updated tree on that cpu.
> > > >
> > > > - A cgroup_iter program, vmscan_dump, is loaded and pinned to a file, for
> > > >    each cgroup. Reading this file invokes the program, which calls
> > > >    cgroup_rstat_flush(cgroup) to ask rstat to propagate the updates for all
> > > >    cpus and cgroups that have updates in this cgroup's subtree. Afterwards,
> > > >    the stats are exposed to the user. vmscan_dump returns 1 to terminate
> > > >    iteration early, so that we only expose stats for one cgroup per read.
> > > >
> > > > - An ftrace program, vmscan_flush, is also loaded and attached to
> > > >    bpf_rstat_flush. When rstat flushing is ongoing, vmscan_flush is invoked
> > > >    once for each (cgroup, cpu) pair that has updates. cgroups are popped
> > > >    from the rstat tree in a bottom-up fashion, so calls will always be
> > > >    made for cgroups that have updates before their parents. The program
> > > >    aggregates percpu readings to a total per-cgroup reading, and also
> > > >    propagates them to the parent cgroup. After rstat flushing is over, all
> > > >    cgroups will have correct updated hierarchical readings (including all
> > > >    cpus and all their descendants).
> > > >
> > > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> > >
> > > There are a selftest failure with test:
> > >
> > > get_cgroup_vmscan_delay:PASS:output format 0 nsec
> > > get_cgroup_vmscan_delay:PASS:cgroup_id 0 nsec
> > > get_cgroup_vmscan_delay:PASS:vmscan_reading 0 nsec
> > > get_cgroup_vmscan_delay:PASS:read cgroup_iter 0 nsec
> > > get_cgroup_vmscan_delay:PASS:output format 0 nsec
> > > get_cgroup_vmscan_delay:PASS:cgroup_id 0 nsec
> > > get_cgroup_vmscan_delay:FAIL:vmscan_reading unexpected vmscan_reading:
> > > actual 0 <= expected 0
> > > check_vmscan_stats:FAIL:child1_vmscan unexpected child1_vmscan: actual
> > > 781874 != expected 382092
> > > check_vmscan_stats:FAIL:child2_vmscan unexpected child2_vmscan: actual
> > > -1 != expected -2
> > > check_vmscan_stats:FAIL:test_vmscan unexpected test_vmscan: actual
> > > 781874 != expected 781873
> > > check_vmscan_stats:FAIL:root_vmscan unexpected root_vmscan: actual 0 <
> > > expected 781874
> > > destroy_progs:PASS:remove cgroup_iter pin 0 nsec
> > > destroy_progs:PASS:remove cgroup_iter pin 0 nsec
> > > destroy_progs:PASS:remove cgroup_iter pin 0 nsec
> > > destroy_progs:PASS:remove cgroup_iter pin 0 nsec
> > > destroy_progs:PASS:remove cgroup_iter pin 0 nsec
> > > destroy_progs:PASS:remove cgroup_iter pin 0 nsec
> > > destroy_progs:PASS:remove cgroup_iter pin 0 nsec
> > > destroy_progs:PASS:remove cgroup_iter root pin 0 nsec
> > > cleanup_bpffs:PASS:rmdir /sys/fs/bpf/vmscan/ 0 nsec
> > > #33      cgroup_hierarchical_stats:FAIL
> > >
> >
> > The test is passing on my setup. I am trying to figure out if there is
> > something outside the setup done by the test that can cause the test
> > to fail.
> >
>
> I can't reproduce the failure on my machine. It seems like for some
> reason reclaim is not invoked in one of the test cgroups which results
> in the expected stats not being there. I have a few suspicions as to
> what might cause this but I am not sure.
>
> If you have the capacity, do you mind re-running the test with the
> attached diff1.patch? (and maybe diff2.patch if that fails, this will
> cause OOMs in the test cgroup, you might see some process killed
> warnings).
> Thanks!
>

In addition to that, it looks like one of the cgroups has a "0" stat
which shouldn't happen unless one of the map update/lookup operations
failed, which should log something using bpf_printk. I need to
reproduce the test failure to investigate this properly. Did you
observe this failure on your machine or in CI? Any instructions on how
to reproduce or system setup?

>
> > >
> > > Also an existing test also failed.
> > >
> > > btf_dump_data:PASS:find type id 0 nsec
> > >
> > >
> > > btf_dump_data:PASS:failed/unexpected type_sz 0 nsec
> > >
> > >
> > > btf_dump_data:FAIL:ensure expected/actual match unexpected ensure
> > > expected/actual match: actual '(union bpf_iter_link_info){.map =
> > > (struct){.map_fd = (__u32)1,},.cgroup '
> > > test_btf_dump_struct_data:PASS:find struct sk_buff 0 nsec
> > >
> >
> > Yeah I see what happened there. bpf_iter_link_info was changed by the
> > patch that introduced cgroup_iter, and this specific union is used by
> > the test to test the "union with nested struct" btf dumping. I will
> > add a patch in the next version that updates the btf_dump_data test
> > accordingly. Thanks.
> >
> > >
> > > test_btf_dump_struct_data:PASS:unexpected return value dumping sk_buff 0
> > > nsec
> > >
> > > btf_dump_data:PASS:verify prefix match 0 nsec
> > >
> > >
> > > btf_dump_data:PASS:find type id 0 nsec
> > >
> > >
> > > btf_dump_data:PASS:failed to return -E2BIG 0 nsec
> > >
> > >
> > > btf_dump_data:PASS:ensure expected/actual match 0 nsec
> > >
> > >
> > > btf_dump_data:PASS:verify prefix match 0 nsec
> > >
> > >
> > > btf_dump_data:PASS:find type id 0 nsec
> > >
> > >
> > > btf_dump_data:PASS:failed to return -E2BIG 0 nsec
> > >
> > >
> > > btf_dump_data:PASS:ensure expected/actual match 0 nsec
> > >
> > >
> > > #21/14   btf_dump/btf_dump: struct_data:FAIL
> > >
> > > please take a look.
> > >
> > > > ---
> > > >   .../prog_tests/cgroup_hierarchical_stats.c    | 351 ++++++++++++++++++
> > > >   .../bpf/progs/cgroup_hierarchical_stats.c     | 234 ++++++++++++
> > > >   2 files changed, 585 insertions(+)
> > > >   create mode 100644 tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c
> > > >   create mode 100644 tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c
> > > >
> > > > diff --git a/tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c b/tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c
> > > > new file mode 100644
> > > > index 0000000000000..b78a4043da49a
> > > > --- /dev/null
> > > > +++ b/tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c
> > > > @@ -0,0 +1,351 @@
> > > > +// SPDX-License-Identifier: GPL-2.0-only
> > > > +/*
> > > > + * Functions to manage eBPF programs attached to cgroup subsystems
> > > > + *
> > > > + * Copyright 2022 Google LLC.
> > > > + */
> > > > +#include <errno.h>
> > > > +#include <sys/types.h>
> > > > +#include <sys/mount.h>
> > > > +#include <sys/stat.h>
> > > > +#include <unistd.h>
> > > > +
> > > > +#include <test_progs.h>
> > > > +#include <bpf/libbpf.h>
> > > > +#include <bpf/bpf.h>
> > > > +
> > > > +#include "cgroup_helpers.h"
> > > > +#include "cgroup_hierarchical_stats.skel.h"
> > > > +
> > > > +#define PAGE_SIZE 4096
> > > > +#define MB(x) (x << 20)
> > > > +
> > > > +#define BPFFS_ROOT "/sys/fs/bpf/"
> > > > +#define BPFFS_VMSCAN BPFFS_ROOT"vmscan/"
> > > > +
> > > > +#define CG_ROOT_NAME "root"
> > > > +#define CG_ROOT_ID 1
> > > > +
> > > > +#define CGROUP_PATH(p, n) {.path = #p"/"#n, .name = #n}
> > > > +
> > > > +static struct {
> > > > +     const char *path, *name;
> > > > +     unsigned long long id;
> > > > +     int fd;
> > > > +} cgroups[] = {
> > > > +     CGROUP_PATH(/, test),
> > > > +     CGROUP_PATH(/test, child1),
> > > > +     CGROUP_PATH(/test, child2),
> > > > +     CGROUP_PATH(/test/child1, child1_1),
> > > > +     CGROUP_PATH(/test/child1, child1_2),
> > > > +     CGROUP_PATH(/test/child2, child2_1),
> > > > +     CGROUP_PATH(/test/child2, child2_2),
> > > > +};
> > > > +
> > > > +#define N_CGROUPS ARRAY_SIZE(cgroups)
> > > > +#define N_NON_LEAF_CGROUPS 3
> > > > +
> > > > +int root_cgroup_fd;
> > > > +bool mounted_bpffs;
> > > > +
> > > > +static int read_from_file(const char *path, char *buf, size_t size)
> > > > +{
> > > > +     int fd, len;
> > > > +
> > > > +     fd = open(path, O_RDONLY);
> > > > +     if (fd < 0) {
> > > > +             log_err("Open %s", path);
> > > > +             return -errno;
> > > > +     }
> > > > +     len = read(fd, buf, size);
> > > > +     if (len < 0)
> > > > +             log_err("Read %s", path);
> > > > +     else
> > > > +             buf[len] = 0;
> > > > +     close(fd);
> > > > +     return len < 0 ? -errno : 0;
> > > > +}
> > > > +
> > > > +static int setup_bpffs(void)
> > > > +{
> > > > +     int err;
> > > > +
> > > > +     /* Mount bpffs */
> > > > +     err = mount("bpf", BPFFS_ROOT, "bpf", 0, NULL);
> > > > +     mounted_bpffs = !err;
> > > > +     if (!ASSERT_OK(err && errno != EBUSY, "mount bpffs"))
> > > > +             return err;
> > > > +
> > > > +     /* Create a directory to contain stat files in bpffs */
> > > > +     err = mkdir(BPFFS_VMSCAN, 0755);
> > > > +     ASSERT_OK(err, "mkdir bpffs");
> > > > +     return err;
> > > > +}
> > > > +
> > > > +static void cleanup_bpffs(void)
> > > > +{
> > > > +     /* Remove created directory in bpffs */
> > > > +     ASSERT_OK(rmdir(BPFFS_VMSCAN), "rmdir "BPFFS_VMSCAN);
> > > > +
> > > > +     /* Unmount bpffs, if it wasn't already mounted when we started */
> > > > +     if (mounted_bpffs)
> > > > +             return;
> > > > +     ASSERT_OK(umount(BPFFS_ROOT), "unmount bpffs");
> > > > +}
> > > > +
> > > > +static int setup_cgroups(void)
> > > > +{
> > > > +     int i, fd, err;
> > > > +
> > > > +     err = setup_cgroup_environment();
> > > > +     if (!ASSERT_OK(err, "setup_cgroup_environment"))
> > > > +             return err;
> > > > +
> > > > +     root_cgroup_fd = get_root_cgroup();
> > > > +     if (!ASSERT_GE(root_cgroup_fd, 0, "get_root_cgroup"))
> > > > +             return root_cgroup_fd;
> > > > +
> > > > +     for (i = 0; i < N_CGROUPS; i++) {
> > > > +             fd = create_and_get_cgroup(cgroups[i].path);
> > > > +             if (!ASSERT_GE(fd, 0, "create_and_get_cgroup"))
> > > > +                     return fd;
> > > > +
> > > > +             cgroups[i].fd = fd;
> > > > +             cgroups[i].id = get_cgroup_id(cgroups[i].path);
> > > > +
> > > > +             /*
> > > > +              * Enable memcg controller for the entire hierarchy.
> > > > +              * Note that stats are collected for all cgroups in a hierarchy
> > > > +              * with memcg enabled anyway, but are only exposed for cgroups
> > > > +              * that have memcg enabled.
> > > > +              */
> > > > +             if (i < N_NON_LEAF_CGROUPS) {
> > > > +                     err = enable_controllers(cgroups[i].path, "memory");
> > > > +                     if (!ASSERT_OK(err, "enable_controllers"))
> > > > +                             return err;
> > > > +             }
> > > > +     }
> > > > +     return 0;
> > > > +}
> > > > +
> > > > +static void cleanup_cgroups(void)
> > > > +{
> > > > +     close(root_cgroup_fd);
> > > > +     for (int i = 0; i < N_CGROUPS; i++)
> > > > +             close(cgroups[i].fd);
> > > > +     cleanup_cgroup_environment();
> > > > +}
> > > > +
> > > > +
> > > > +static int setup_hierarchy(void)
> > > > +{
> > > > +     return setup_bpffs() || setup_cgroups();
> > > > +}
> > > > +
> > > > +static void destroy_hierarchy(void)
> > > > +{
> > > > +     cleanup_cgroups();
> > > > +     cleanup_bpffs();
> > > > +}
> > > > +
> > > > +static void alloc_anon(size_t size)
> > > > +{
> > > > +     char *buf, *ptr;
> > > > +
> > > > +     buf = malloc(size);
> > > > +     for (ptr = buf; ptr < buf + size; ptr += PAGE_SIZE)
> > > > +             *ptr = 0;
> > > > +     free(buf);
> > > > +}
> > > > +
> > > > +static int induce_vmscan(void)
> > > > +{
> > > > +     char size[128];
> > > > +     int i, err;
> > > > +
> > > > +     /*
> > > > +      * Set memory.high for test parent cgroup to 1 MB to throttle
> > > > +      * allocations and invoke reclaim in children.
> > > > +      */
> > > > +     snprintf(size, 128, "%d", MB(1));
> > > > +     err = write_cgroup_file(cgroups[0].path, "memory.high", size);
> > > > +     if (!ASSERT_OK(err, "write memory.high"))
> > > > +             return err;
> > > > +     /*
> > > > +      * In every leaf cgroup, run a memory hog for a few seconds to induce
> > > > +      * reclaim then kill it.
> > > > +      */
> > > > +     for (i = N_NON_LEAF_CGROUPS; i < N_CGROUPS; i++) {
> > > > +             pid_t pid = fork();
> > > > +
> > > > +             if (pid == 0) {
> > > > +                     /* Join cgroup in the parent process workdir */
> > > > +                     join_parent_cgroup(cgroups[i].path);
> > > > +
> > > > +                     /* Allocate more memory than memory.high */
> > > > +                     alloc_anon(MB(2));
> > > > +                     exit(0);
> > > > +             } else {
> > > > +                     /* Wait for child to cause reclaim then kill it */
> > > > +                     if (!ASSERT_GT(pid, 0, "fork"))
> > > > +                             return pid;
> > > > +                     sleep(2);
> > > > +                     kill(pid, SIGKILL);
> > > > +                     waitpid(pid, NULL, 0);
> > > > +             }
> > > > +     }
> > > > +     return 0;
> > > > +}
> > > > +
> > > > +static unsigned long long get_cgroup_vmscan_delay(unsigned long long cgroup_id,
> > > > +                                               const char *file_name)
> > > > +{
> > > > +     char buf[128], path[128];
> > > > +     unsigned long long vmscan = 0, id = 0;
> > > > +     int err;
> > > > +
> > > > +     /* For every cgroup, read the file generated by cgroup_iter */
> > > > +     snprintf(path, 128, "%s%s", BPFFS_VMSCAN, file_name);
> > > > +     err = read_from_file(path, buf, 128);
> > > > +     if (!ASSERT_OK(err, "read cgroup_iter"))
> > > > +             return 0;
> > > > +
> > > > +     /* Check the output file formatting */
> > > > +     ASSERT_EQ(sscanf(buf, "cg_id: %llu, total_vmscan_delay: %llu\n",
> > > > +                      &id, &vmscan), 2, "output format");
> > > > +
> > > > +     /* Check that the cgroup_id is displayed correctly */
> > > > +     ASSERT_EQ(id, cgroup_id, "cgroup_id");
> > > > +     /* Check that the vmscan reading is non-zero */
> > > > +     ASSERT_GT(vmscan, 0, "vmscan_reading");
> > > > +     return vmscan;
> > > > +}
> > > > +
> > > > +static void check_vmscan_stats(void)
> > > > +{
> > > > +     int i;
> > > > +     unsigned long long vmscan_readings[N_CGROUPS], vmscan_root;
> > > > +
> > > > +     for (i = 0; i < N_CGROUPS; i++)
> > > > +             vmscan_readings[i] = get_cgroup_vmscan_delay(cgroups[i].id,
> > > > +                                                          cgroups[i].name);
> > > > +
> > > > +     /* Read stats for root too */
> > > > +     vmscan_root = get_cgroup_vmscan_delay(CG_ROOT_ID, CG_ROOT_NAME);
> > > > +
> > > > +     /* Check that child1 == child1_1 + child1_2 */
> > > > +     ASSERT_EQ(vmscan_readings[1], vmscan_readings[3] + vmscan_readings[4],
> > > > +               "child1_vmscan");
> > > > +     /* Check that child2 == child2_1 + child2_2 */
> > > > +     ASSERT_EQ(vmscan_readings[2], vmscan_readings[5] + vmscan_readings[6],
> > > > +               "child2_vmscan");
> > > > +     /* Check that test == child1 + child2 */
> > > > +     ASSERT_EQ(vmscan_readings[0], vmscan_readings[1] + vmscan_readings[2],
> > > > +               "test_vmscan");
> > > > +     /* Check that root >= test */
> > > > +     ASSERT_GE(vmscan_root, vmscan_readings[1], "root_vmscan");
> > > > +}
> > > > +
> > > > +static int setup_cgroup_iter(struct cgroup_hierarchical_stats *obj, int cgroup_fd,
> > > > +                          const char *file_name)
> > > > +{
> > > > +     DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts);
> > > > +     union bpf_iter_link_info linfo = {};
> > > > +     struct bpf_link *link;
> > > > +     char path[128];
> > > > +     int err;
> > > > +
> > > > +     /*
> > > > +      * Create an iter link, parameterized by cgroup_fd.
> > > > +      * We only want to traverse one cgroup, so set the traversal order to
> > > > +      * "pre", and return 1 from dump_vmscan to stop iteration after the
> > > > +      * first cgroup.
> > > > +      */
> > > > +     linfo.cgroup.cgroup_fd = cgroup_fd;
> > > > +     linfo.cgroup.traversal_order = BPF_ITER_CGROUP_PRE;
> > > > +     opts.link_info = &linfo;
> > > > +     opts.link_info_len = sizeof(linfo);
> > > > +     link = bpf_program__attach_iter(obj->progs.dump_vmscan, &opts);
> > > > +     if (!ASSERT_OK_PTR(link, "attach iter"))
> > > > +             return libbpf_get_error(link);
> > > > +
> > > > +     /* Pin the link to a bpffs file */
> > > > +     snprintf(path, 128, "%s%s", BPFFS_VMSCAN, file_name);
> > > > +     err = bpf_link__pin(link, path);
> > > > +     ASSERT_OK(err, "pin cgroup_iter");
> > > > +     return err;
> > > > +}
> > > > +
> > > > +static int setup_progs(struct cgroup_hierarchical_stats **skel)
> > > > +{
> > > > +     int i, err;
> > > > +     struct bpf_link *link;
> > > > +     struct cgroup_hierarchical_stats *obj;
> > > > +
> > > > +     obj = cgroup_hierarchical_stats__open_and_load();
> > > > +     if (!ASSERT_OK_PTR(obj, "open_and_load"))
> > > > +             return libbpf_get_error(obj);
> > > > +
> > > > +     /* Attach cgroup_iter program that will dump the stats to cgroups */
> > > > +     for (i = 0; i < N_CGROUPS; i++) {
> > > > +             err = setup_cgroup_iter(obj, cgroups[i].fd, cgroups[i].name);
> > > > +             if (!ASSERT_OK(err, "setup_cgroup_iter"))
> > > > +                     return err;
> > > > +     }
> > > > +     /* Also dump stats for root */
> > > > +     err = setup_cgroup_iter(obj, root_cgroup_fd, CG_ROOT_NAME);
> > > > +     if (!ASSERT_OK(err, "setup_cgroup_iter"))
> > > > +             return err;
> > > > +
> > > > +     /* Attach rstat flusher */
> > > > +     link = bpf_program__attach(obj->progs.vmscan_flush);
> > > > +     if (!ASSERT_OK_PTR(link, "attach rstat"))
> > > > +             return libbpf_get_error(link);
> > > > +
> > > > +     /* Attach tracing programs that will calculate vmscan delays */
> > > > +     link = bpf_program__attach(obj->progs.vmscan_start);
> > > > +     if (!ASSERT_OK_PTR(obj, "attach raw_tracepoint"))
> > > > +             return libbpf_get_error(obj);
> > > > +
> > > > +     link = bpf_program__attach(obj->progs.vmscan_end);
> > > > +     if (!ASSERT_OK_PTR(obj, "attach raw_tracepoint"))
> > > > +             return libbpf_get_error(obj);
> > > > +
> > > > +     *skel = obj;
> > > > +     return 0;
> > > > +}
> > > > +
> > > > +void destroy_progs(struct cgroup_hierarchical_stats *skel)
> > > > +{
> > > > +     char path[128];
> > > > +     int i;
> > > > +
> > > > +     for (i = 0; i < N_CGROUPS; i++) {
> > > > +             /* Delete files in bpffs that cgroup_iters are pinned in */
> > > > +             snprintf(path, 128, "%s%s", BPFFS_VMSCAN,
> > > > +                      cgroups[i].name);
> > > > +             ASSERT_OK(remove(path), "remove cgroup_iter pin");
> > > > +     }
> > > > +
> > > > +     /* Delete root file in bpffs */
> > > > +     snprintf(path, 128, "%s%s", BPFFS_VMSCAN, CG_ROOT_NAME);
> > > > +     ASSERT_OK(remove(path), "remove cgroup_iter root pin");
> > > > +     cgroup_hierarchical_stats__destroy(skel);
> > > > +}
> > > > +
> > > > +void test_cgroup_hierarchical_stats(void)
> > > > +{
> > > > +     struct cgroup_hierarchical_stats *skel = NULL;
> > > > +
> > > > +     if (setup_hierarchy())
> > > > +             goto hierarchy_cleanup;
> > > > +     if (setup_progs(&skel))
> > > > +             goto cleanup;
> > > > +     if (induce_vmscan())
> > > > +             goto cleanup;
> > > > +     check_vmscan_stats();
> > > > +cleanup:
> > > > +     destroy_progs(skel);
> > > > +hierarchy_cleanup:
> > > > +     destroy_hierarchy();
> > > > +}
> > > > diff --git a/tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c b/tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c
> > > > new file mode 100644
> > > > index 0000000000000..fd2028f1ed70b
> > > > --- /dev/null
> > > > +++ b/tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c
> > > > @@ -0,0 +1,234 @@
> > > > +// SPDX-License-Identifier: GPL-2.0-only
> > > > +/*
> > > > + * Functions to manage eBPF programs attached to cgroup subsystems
> > > > + *
> > > > + * Copyright 2022 Google LLC.
> > > > + */
> > > > +#include "vmlinux.h"
> > > > +#include <bpf/bpf_helpers.h>
> > > > +#include <bpf/bpf_tracing.h>
> > > > +
> > > > +char _license[] SEC("license") = "GPL";
> > > > +
> > > > +/*
> > > > + * Start times are stored per-task, not per-cgroup, as multiple tasks in one
> > > > + * cgroup can perform reclain concurrently.
> > > > + */
> > > > +struct {
> > > > +     __uint(type, BPF_MAP_TYPE_TASK_STORAGE);
> > > > +     __uint(map_flags, BPF_F_NO_PREALLOC);
> > > > +     __type(key, int);
> > > > +     __type(value, __u64);
> > > > +} vmscan_start_time SEC(".maps");
> > > > +
> > > > +struct vmscan_percpu {
> > > > +     /* Previous percpu state, to figure out if we have new updates */
> > > > +     __u64 prev;
> > > > +     /* Current percpu state */
> > > > +     __u64 state;
> > > > +};
> > > > +
> > > > +struct vmscan {
> > > > +     /* State propagated through children, pending aggregation */
> > > > +     __u64 pending;
> > > > +     /* Total state, including all cpus and all children */
> > > > +     __u64 state;
> > > > +};
> > > > +
> > > > +struct {
> > > > +     __uint(type, BPF_MAP_TYPE_PERCPU_HASH);
> > > > +     __uint(max_entries, 10);
> > > > +     __type(key, __u64);
> > > > +     __type(value, struct vmscan_percpu);
> > > > +} pcpu_cgroup_vmscan_elapsed SEC(".maps");
> > > > +
> > > > +struct {
> > > > +     __uint(type, BPF_MAP_TYPE_HASH);
> > > > +     __uint(max_entries, 10);
> > > > +     __type(key, __u64);
> > > > +     __type(value, struct vmscan);
> > > > +} cgroup_vmscan_elapsed SEC(".maps");
> > > > +
> > > > +extern void cgroup_rstat_updated(struct cgroup *cgrp, int cpu) __ksym;
> > > > +extern void cgroup_rstat_flush(struct cgroup *cgrp) __ksym;
> > > > +
> > > > +static inline struct cgroup *task_memcg(struct task_struct *task)
> > > > +{
> > > > +     return task->cgroups->subsys[memory_cgrp_id]->cgroup;
> > > > +}
> > > > +
> > > > +static inline uint64_t cgroup_id(struct cgroup *cgrp)
> > > > +{
> > > > +     return cgrp->kn->id;
> > > > +}
> > > > +
> > > > +static inline int create_vmscan_percpu_elem(__u64 cg_id, __u64 state)
> > > > +{
> > > > +     struct vmscan_percpu pcpu_init = {.state = state, .prev = 0};
> > > > +
> > > > +     if (bpf_map_update_elem(&pcpu_cgroup_vmscan_elapsed, &cg_id,
> > > > +                             &pcpu_init, BPF_NOEXIST)) {
> > > > +             bpf_printk("failed to create pcpu entry for cgroup %llu\n"
> > > > +                        , cg_id);
> > > > +             return 1;
> > > > +     }
> > > > +     return 0;
> > > > +}
> > > > +
> > > > +static inline int create_vmscan_elem(__u64 cg_id, __u64 state, __u64 pending)
> > > > +{
> > > > +     struct vmscan init = {.state = state, .pending = pending};
> > > > +
> > > > +     if (bpf_map_update_elem(&cgroup_vmscan_elapsed, &cg_id,
> > > > +                             &init, BPF_NOEXIST)) {
> > > > +             bpf_printk("failed to create entry for cgroup %llu\n"
> > > > +                        , cg_id);
> > > > +             return 1;
> > > > +     }
> > > > +     return 0;
> > > > +}
> > > > +
> > > > +SEC("tp_btf/mm_vmscan_memcg_reclaim_begin")
> > > > +int BPF_PROG(vmscan_start, struct lruvec *lruvec, struct scan_control *sc)
> > > > +{
> > > > +     struct task_struct *task = bpf_get_current_task_btf();
> > > > +     __u64 *start_time_ptr;
> > > > +
> > > > +     start_time_ptr = bpf_task_storage_get(&vmscan_start_time, task, 0,
> > > > +                                       BPF_LOCAL_STORAGE_GET_F_CREATE);
> > > > +     if (!start_time_ptr) {
> > > > +             bpf_printk("error retrieving storage\n");
> > > > +             return 0;
> > > > +     }
> > > > +
> > > > +     *start_time_ptr = bpf_ktime_get_ns();
> > > > +     return 0;
> > > > +}
> > > > +
> > > > +SEC("tp_btf/mm_vmscan_memcg_reclaim_end")
> > > > +int BPF_PROG(vmscan_end, struct lruvec *lruvec, struct scan_control *sc)
> > > > +{
> > > > +     struct vmscan_percpu *pcpu_stat;
> > > > +     struct task_struct *current = bpf_get_current_task_btf();
> > > > +     struct cgroup *cgrp;
> > > > +     __u64 *start_time_ptr;
> > > > +     __u64 current_elapsed, cg_id;
> > > > +     __u64 end_time = bpf_ktime_get_ns();
> > > > +
> > > > +     /*
> > > > +      * cgrp is the first parent cgroup of current that has memcg enabled in
> > > > +      * its subtree_control, or NULL if memcg is disabled in the entire tree.
> > > > +      * In a cgroup hierarchy like this:
> > > > +      *                               a
> > > > +      *                              / \
> > > > +      *                             b   c
> > > > +      *  If "a" has memcg enabled, while "b" doesn't, then processes in "b"
> > > > +      *  will accumulate their stats directly to "a". This makes sure that no
> > > > +      *  stats are lost from processes in leaf cgroups that don't have memcg
> > > > +      *  enabled, but only exposes stats for cgroups that have memcg enabled.
> > > > +      */
> > > > +     cgrp = task_memcg(current);
> > > > +     if (!cgrp)
> > > > +             return 0;
> > > > +
> > > > +     cg_id = cgroup_id(cgrp);
> > > > +     start_time_ptr = bpf_task_storage_get(&vmscan_start_time, current, 0,
> > > > +                                           BPF_LOCAL_STORAGE_GET_F_CREATE);
> > > > +     if (!start_time_ptr) {
> > > > +             bpf_printk("error retrieving storage local storage\n");
> > > > +             return 0;
> > > > +     }
> > > > +
> > > > +     current_elapsed = end_time - *start_time_ptr;
> > > > +     pcpu_stat = bpf_map_lookup_elem(&pcpu_cgroup_vmscan_elapsed,
> > > > +                                     &cg_id);
> > > > +     if (pcpu_stat)
> > > > +             __sync_fetch_and_add(&pcpu_stat->state, current_elapsed);
> > > > +     else
> > > > +             create_vmscan_percpu_elem(cg_id, current_elapsed);
> > > > +
> > > > +     cgroup_rstat_updated(cgrp, bpf_get_smp_processor_id());
> > > > +     return 0;
> > > > +}
> > > > +
> > > > +SEC("fentry/bpf_rstat_flush")
> > > > +int BPF_PROG(vmscan_flush, struct cgroup *cgrp, struct cgroup *parent, int cpu)
> > > > +{
> > > > +     struct vmscan_percpu *pcpu_stat;
> > > > +     struct vmscan *total_stat, *parent_stat;
> > > > +     __u64 cg_id = cgroup_id(cgrp);
> > > > +     __u64 parent_cg_id = parent ? cgroup_id(parent) : 0;
> > > > +     __u64 *pcpu_vmscan;
> > > > +     __u64 state;
> > > > +     __u64 delta = 0;
> > > > +
> > > > +     /* Add CPU changes on this level since the last flush */
> > > > +     pcpu_stat = bpf_map_lookup_percpu_elem(&pcpu_cgroup_vmscan_elapsed,
> > > > +                                            &cg_id, cpu);
> > > > +     if (pcpu_stat) {
> > > > +             state = pcpu_stat->state;
> > > > +             delta += state - pcpu_stat->prev;
> > > > +             pcpu_stat->prev = state;
> > > > +     }
> > > > +
> > > > +     total_stat = bpf_map_lookup_elem(&cgroup_vmscan_elapsed, &cg_id);
> > > > +     if (!total_stat) {
> > > > +             create_vmscan_elem(cg_id, delta, 0);
> > > > +             goto update_parent;
> > > > +     }
> > > > +
> > > > +     /* Collect pending stats from subtree */
> > > > +     if (total_stat->pending) {
> > > > +             delta += total_stat->pending;
> > > > +             total_stat->pending = 0;
> > > > +     }
> > > > +
> > > > +     /* Propagate changes to this cgroup's total */
> > > > +     total_stat->state += delta;
> > > > +
> > > > +update_parent:
> > > > +     /* Skip if there are no changes to propagate, or no parent */
> > > > +     if (!delta || !parent_cg_id)
> > > > +             return 0;
> > > > +
> > > > +     /* Propagate changes to cgroup's parent */
> > > > +     parent_stat = bpf_map_lookup_elem(&cgroup_vmscan_elapsed,
> > > > +                                       &parent_cg_id);
> > > > +     if (parent_stat)
> > > > +             parent_stat->pending += delta;
> > > > +     else
> > > > +             create_vmscan_elem(parent_cg_id, 0, delta);
> > > > +
> > > > +     return 0;
> > > > +}
> > > > +
> > > > +SEC("iter.s/cgroup")
> > > > +int BPF_PROG(dump_vmscan, struct bpf_iter_meta *meta, struct cgroup *cgrp)
> > > > +{
> > > > +     struct seq_file *seq = meta->seq;
> > > > +     struct vmscan *total_stat;
> > > > +     __u64 cg_id = cgroup_id(cgrp);
> > > > +
> > > > +     /* Do nothing for the terminal call */
> > > > +     if (!cgrp)
> > > > +             return 1;
> > > > +
> > > > +     /* Flush the stats to make sure we get the most updated numbers */
> > > > +     cgroup_rstat_flush(cgrp);
> > > > +
> > > > +     total_stat = bpf_map_lookup_elem(&cgroup_vmscan_elapsed, &cg_id);
> > > > +     if (!total_stat) {
> > > > +             bpf_printk("error finding stats for cgroup %llu\n", cg_id);
> > > > +             BPF_SEQ_PRINTF(seq, "cg_id: %llu, total_vmscan_delay: -1\n",
> > > > +                            cg_id);
> > > > +             return 1;
> > > > +     }
> > > > +     BPF_SEQ_PRINTF(seq, "cg_id: %llu, total_vmscan_delay: %llu\n",
> > > > +                    cg_id, total_stat->state);
> > > > +
> > > > +     /*
> > > > +      * We only dump stats for one cgroup here, so return 1 to stop
> > > > +      * iteration after the first cgroup.
> > > > +      */
> > > > +     return 1;
> > > > +}

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH bpf-next v2 8/8] bpf: add a selftest for cgroup hierarchical stats collection
  2022-06-28  7:14       ` Yosry Ahmed
  2022-06-29  0:09         ` Yosry Ahmed
@ 2022-06-29  6:17         ` Yonghong Song
  1 sibling, 0 replies; 46+ messages in thread
From: Yonghong Song @ 2022-06-29  6:17 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, John Fastabend, KP Singh, Hao Luo,
	Tejun Heo, Zefan Li, Johannes Weiner, Shuah Khan, Michal Hocko,
	Roman Gushchin, David Rientjes, Stanislav Fomichev, Greg Thelen,
	Shakeel Butt, Linux Kernel Mailing List, Networking, bpf,
	Cgroups



On 6/28/22 12:14 AM, Yosry Ahmed wrote:
> On Mon, Jun 27, 2022 at 11:47 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>>
>> On Mon, Jun 27, 2022 at 11:14 PM Yonghong Song <yhs@fb.com> wrote:
>>>
>>>
>>>
>>> On 6/10/22 12:44 PM, Yosry Ahmed wrote:
>>>> Add a selftest that tests the whole workflow for collecting,
>>>> aggregating (flushing), and displaying cgroup hierarchical stats.
>>>>
>>>> TL;DR:
>>>> - Whenever reclaim happens, vmscan_start and vmscan_end update
>>>>     per-cgroup percpu readings, and tell rstat which (cgroup, cpu) pairs
>>>>     have updates.
>>>> - When userspace tries to read the stats, vmscan_dump calls rstat to flush
>>>>     the stats, and outputs the stats in text format to userspace (similar
>>>>     to cgroupfs stats).
>>>> - rstat calls vmscan_flush once for every (cgroup, cpu) pair that has
>>>>     updates, vmscan_flush aggregates cpu readings and propagates updates
>>>>     to parents.
>>>>
>>>> Detailed explanation:
>>>> - The test loads tracing bpf programs, vmscan_start and vmscan_end, to
>>>>     measure the latency of cgroup reclaim. Per-cgroup ratings are stored in
>>>>     percpu maps for efficiency. When a cgroup reading is updated on a cpu,
>>>>     cgroup_rstat_updated(cgroup, cpu) is called to add the cgroup to the
>>>>     rstat updated tree on that cpu.
>>>>
>>>> - A cgroup_iter program, vmscan_dump, is loaded and pinned to a file, for
>>>>     each cgroup. Reading this file invokes the program, which calls
>>>>     cgroup_rstat_flush(cgroup) to ask rstat to propagate the updates for all
>>>>     cpus and cgroups that have updates in this cgroup's subtree. Afterwards,
>>>>     the stats are exposed to the user. vmscan_dump returns 1 to terminate
>>>>     iteration early, so that we only expose stats for one cgroup per read.
>>>>
>>>> - An ftrace program, vmscan_flush, is also loaded and attached to
>>>>     bpf_rstat_flush. When rstat flushing is ongoing, vmscan_flush is invoked
>>>>     once for each (cgroup, cpu) pair that has updates. cgroups are popped
>>>>     from the rstat tree in a bottom-up fashion, so calls will always be
>>>>     made for cgroups that have updates before their parents. The program
>>>>     aggregates percpu readings to a total per-cgroup reading, and also
>>>>     propagates them to the parent cgroup. After rstat flushing is over, all
>>>>     cgroups will have correct updated hierarchical readings (including all
>>>>     cpus and all their descendants).
>>>>
>>>> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
>>>
>>> There are a selftest failure with test:
>>>
>>> get_cgroup_vmscan_delay:PASS:output format 0 nsec
>>> get_cgroup_vmscan_delay:PASS:cgroup_id 0 nsec
>>> get_cgroup_vmscan_delay:PASS:vmscan_reading 0 nsec
>>> get_cgroup_vmscan_delay:PASS:read cgroup_iter 0 nsec
>>> get_cgroup_vmscan_delay:PASS:output format 0 nsec
>>> get_cgroup_vmscan_delay:PASS:cgroup_id 0 nsec
>>> get_cgroup_vmscan_delay:FAIL:vmscan_reading unexpected vmscan_reading:
>>> actual 0 <= expected 0
>>> check_vmscan_stats:FAIL:child1_vmscan unexpected child1_vmscan: actual
>>> 781874 != expected 382092
>>> check_vmscan_stats:FAIL:child2_vmscan unexpected child2_vmscan: actual
>>> -1 != expected -2
>>> check_vmscan_stats:FAIL:test_vmscan unexpected test_vmscan: actual
>>> 781874 != expected 781873
>>> check_vmscan_stats:FAIL:root_vmscan unexpected root_vmscan: actual 0 <
>>> expected 781874
>>> destroy_progs:PASS:remove cgroup_iter pin 0 nsec
>>> destroy_progs:PASS:remove cgroup_iter pin 0 nsec
>>> destroy_progs:PASS:remove cgroup_iter pin 0 nsec
>>> destroy_progs:PASS:remove cgroup_iter pin 0 nsec
>>> destroy_progs:PASS:remove cgroup_iter pin 0 nsec
>>> destroy_progs:PASS:remove cgroup_iter pin 0 nsec
>>> destroy_progs:PASS:remove cgroup_iter pin 0 nsec
>>> destroy_progs:PASS:remove cgroup_iter root pin 0 nsec
>>> cleanup_bpffs:PASS:rmdir /sys/fs/bpf/vmscan/ 0 nsec
>>> #33      cgroup_hierarchical_stats:FAIL
>>>
>>
>> The test is passing on my setup. I am trying to figure out if there is
>> something outside the setup done by the test that can cause the test
>> to fail.
>>
> 
> I can't reproduce the failure on my machine. It seems like for some
> reason reclaim is not invoked in one of the test cgroups which results
> in the expected stats not being there. I have a few suspicions as to
> what might cause this but I am not sure.
> 
> If you have the capacity, do you mind re-running the test with the
> attached diff1.patch? (and maybe diff2.patch if that fails, this will
> cause OOMs in the test cgroup, you might see some process killed
> warnings).

The patch doesn't help. Still failed.

get_cgroup_vmscan_delay:PASS:cgroup_id 0 nsec
get_cgroup_vmscan_delay:FAIL:vmscan_reading unexpected vmscan_reading: 
actual 0 <= expected 0
check_vmscan_stats:FAIL:child1_vmscan unexpected child1_vmscan: actual 
676612 != expected 339142
check_vmscan_stats:FAIL:child2_vmscan unexpected child2_vmscan: actual 
-1 != expected -2
check_vmscan_stats:FAIL:test_vmscan unexpected test_vmscan: actual 
676612 != expected 676611
check_vmscan_stats:FAIL:root_vmscan unexpected root_vmscan: actual 0 < 
expected 676612
destroy_progs:PASS:remove cgroup_iter pin 0 nsec
destroy_progs:PASS:remove cgroup_iter pin 0 nsec


> Thanks!
> 
> 
>>>
>>> Also an existing test also failed.
[...]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH bpf-next v2 8/8] bpf: add a selftest for cgroup hierarchical stats collection
  2022-06-28  7:43       ` Yosry Ahmed
@ 2022-06-29  6:26         ` Yonghong Song
  2022-06-29  8:03           ` Yosry Ahmed
  2022-07-01 23:28           ` Hao Luo
  0 siblings, 2 replies; 46+ messages in thread
From: Yonghong Song @ 2022-06-29  6:26 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, John Fastabend, KP Singh, Hao Luo,
	Tejun Heo, Zefan Li, Johannes Weiner, Shuah Khan, Michal Hocko,
	Roman Gushchin, David Rientjes, Stanislav Fomichev, Greg Thelen,
	Shakeel Butt, Linux Kernel Mailing List, Networking, bpf,
	Cgroups



On 6/28/22 12:43 AM, Yosry Ahmed wrote:
> On Mon, Jun 27, 2022 at 11:47 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>>
>> On Mon, Jun 27, 2022 at 11:14 PM Yonghong Song <yhs@fb.com> wrote:
>>>
>>>
>>>
>>> On 6/10/22 12:44 PM, Yosry Ahmed wrote:
>>>> Add a selftest that tests the whole workflow for collecting,
>>>> aggregating (flushing), and displaying cgroup hierarchical stats.
>>>>
>>>> TL;DR:
>>>> - Whenever reclaim happens, vmscan_start and vmscan_end update
>>>>     per-cgroup percpu readings, and tell rstat which (cgroup, cpu) pairs
>>>>     have updates.
>>>> - When userspace tries to read the stats, vmscan_dump calls rstat to flush
>>>>     the stats, and outputs the stats in text format to userspace (similar
>>>>     to cgroupfs stats).
>>>> - rstat calls vmscan_flush once for every (cgroup, cpu) pair that has
>>>>     updates, vmscan_flush aggregates cpu readings and propagates updates
>>>>     to parents.
>>>>
>>>> Detailed explanation:
>>>> - The test loads tracing bpf programs, vmscan_start and vmscan_end, to
>>>>     measure the latency of cgroup reclaim. Per-cgroup ratings are stored in
>>>>     percpu maps for efficiency. When a cgroup reading is updated on a cpu,
>>>>     cgroup_rstat_updated(cgroup, cpu) is called to add the cgroup to the
>>>>     rstat updated tree on that cpu.
>>>>
>>>> - A cgroup_iter program, vmscan_dump, is loaded and pinned to a file, for
>>>>     each cgroup. Reading this file invokes the program, which calls
>>>>     cgroup_rstat_flush(cgroup) to ask rstat to propagate the updates for all
>>>>     cpus and cgroups that have updates in this cgroup's subtree. Afterwards,
>>>>     the stats are exposed to the user. vmscan_dump returns 1 to terminate
>>>>     iteration early, so that we only expose stats for one cgroup per read.
>>>>
>>>> - An ftrace program, vmscan_flush, is also loaded and attached to
>>>>     bpf_rstat_flush. When rstat flushing is ongoing, vmscan_flush is invoked
>>>>     once for each (cgroup, cpu) pair that has updates. cgroups are popped
>>>>     from the rstat tree in a bottom-up fashion, so calls will always be
>>>>     made for cgroups that have updates before their parents. The program
>>>>     aggregates percpu readings to a total per-cgroup reading, and also
>>>>     propagates them to the parent cgroup. After rstat flushing is over, all
>>>>     cgroups will have correct updated hierarchical readings (including all
>>>>     cpus and all their descendants).
>>>>
>>>> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
>>>
>>> There are a selftest failure with test:
>>>
>>> get_cgroup_vmscan_delay:PASS:output format 0 nsec
>>> get_cgroup_vmscan_delay:PASS:cgroup_id 0 nsec
>>> get_cgroup_vmscan_delay:PASS:vmscan_reading 0 nsec
>>> get_cgroup_vmscan_delay:PASS:read cgroup_iter 0 nsec
>>> get_cgroup_vmscan_delay:PASS:output format 0 nsec
>>> get_cgroup_vmscan_delay:PASS:cgroup_id 0 nsec
>>> get_cgroup_vmscan_delay:FAIL:vmscan_reading unexpected vmscan_reading:
>>> actual 0 <= expected 0
>>> check_vmscan_stats:FAIL:child1_vmscan unexpected child1_vmscan: actual
>>> 781874 != expected 382092
>>> check_vmscan_stats:FAIL:child2_vmscan unexpected child2_vmscan: actual
>>> -1 != expected -2
>>> check_vmscan_stats:FAIL:test_vmscan unexpected test_vmscan: actual
>>> 781874 != expected 781873
>>> check_vmscan_stats:FAIL:root_vmscan unexpected root_vmscan: actual 0 <
>>> expected 781874
>>> destroy_progs:PASS:remove cgroup_iter pin 0 nsec
>>> destroy_progs:PASS:remove cgroup_iter pin 0 nsec
>>> destroy_progs:PASS:remove cgroup_iter pin 0 nsec
>>> destroy_progs:PASS:remove cgroup_iter pin 0 nsec
>>> destroy_progs:PASS:remove cgroup_iter pin 0 nsec
>>> destroy_progs:PASS:remove cgroup_iter pin 0 nsec
>>> destroy_progs:PASS:remove cgroup_iter pin 0 nsec
>>> destroy_progs:PASS:remove cgroup_iter root pin 0 nsec
>>> cleanup_bpffs:PASS:rmdir /sys/fs/bpf/vmscan/ 0 nsec
>>> #33      cgroup_hierarchical_stats:FAIL
>>>
>>
>> The test is passing on my setup. I am trying to figure out if there is
>> something outside the setup done by the test that can cause the test
>> to fail.
>>
>>>
>>> Also an existing test also failed.
>>>
>>> btf_dump_data:PASS:find type id 0 nsec
>>>
>>>
>>> btf_dump_data:PASS:failed/unexpected type_sz 0 nsec
>>>
>>>
>>> btf_dump_data:FAIL:ensure expected/actual match unexpected ensure
>>> expected/actual match: actual '(union bpf_iter_link_info){.map =
>>> (struct){.map_fd = (__u32)1,},.cgroup '
>>> test_btf_dump_struct_data:PASS:find struct sk_buff 0 nsec
>>>
>>
>> Yeah I see what happened there. bpf_iter_link_info was changed by the
>> patch that introduced cgroup_iter, and this specific union is used by
>> the test to test the "union with nested struct" btf dumping. I will
>> add a patch in the next version that updates the btf_dump_data test
>> accordingly. Thanks.
>>
> 
> So I actually tried the attached diff to updated the expected dump of
> bpf_iter_link_info in this test, but the test still failed:
> 
> btf_dump_data:FAIL:ensure expected/actual match unexpected ensure
> expected/actual match: actual '(union bpf_iter_link_info){.map =
> (struct){.map_fd = (__u32)1,},.cgroup = (struct){.cgroup_fd =
> (__u32)1,},}'  != expected '(union bpf_iter_link_info){.map =
> (struct){.map_fd = (__u32)1,},.cgroup = (struct){.cgroup_fd =
> (__u32)1,.traversal_order = (__u32)1},}'
> 
> It seems to me that the actual output in this case is not right, it is
> missing traversal_order. Did we accidentally find a bug in btf dumping
> of unions with nested structs, or am I missing something here?

Probably there is an issue in btf_dump_data() function in
tools/lib/bpf/btf_dump.c. Could you take a look at it?

> Thanks!
> 
>>>
>>> test_btf_dump_struct_data:PASS:unexpected return value dumping sk_buff 0
>>> nsec
>>>
>>> btf_dump_data:PASS:verify prefix match 0 nsec
>>>
>>>
>>> btf_dump_data:PASS:find type id 0 nsec
>>>
>>>
>>> btf_dump_data:PASS:failed to return -E2BIG 0 nsec
>>>
>>>
>>> btf_dump_data:PASS:ensure expected/actual match 0 nsec
>>>
>>>
>>> btf_dump_data:PASS:verify prefix match 0 nsec
>>>
>>>
>>> btf_dump_data:PASS:find type id 0 nsec
>>>
>>>
>>> btf_dump_data:PASS:failed to return -E2BIG 0 nsec
>>>
>>>
>>> btf_dump_data:PASS:ensure expected/actual match 0 nsec
>>>
>>>
>>> #21/14   btf_dump/btf_dump: struct_data:FAIL
>>>
>>> please take a look.
>>>
>>>> ---
>>>>    .../prog_tests/cgroup_hierarchical_stats.c    | 351 ++++++++++++++++++
>>>>    .../bpf/progs/cgroup_hierarchical_stats.c     | 234 ++++++++++++
>>>>    2 files changed, 585 insertions(+)
>>>>    create mode 100644 tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c
>>>>    create mode 100644 tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c
>>>>
[...]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH bpf-next v2 8/8] bpf: add a selftest for cgroup hierarchical stats collection
  2022-06-29  0:09         ` Yosry Ahmed
@ 2022-06-29  6:48           ` Yonghong Song
  2022-06-29  8:04             ` Yosry Ahmed
  0 siblings, 1 reply; 46+ messages in thread
From: Yonghong Song @ 2022-06-29  6:48 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, John Fastabend, KP Singh, Hao Luo,
	Tejun Heo, Zefan Li, Johannes Weiner, Shuah Khan, Michal Hocko,
	Roman Gushchin, David Rientjes, Stanislav Fomichev, Greg Thelen,
	Shakeel Butt, Linux Kernel Mailing List, Networking, bpf,
	Cgroups



On 6/28/22 5:09 PM, Yosry Ahmed wrote:
> On Tue, Jun 28, 2022 at 12:14 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>>
>> On Mon, Jun 27, 2022 at 11:47 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>>>
>>> On Mon, Jun 27, 2022 at 11:14 PM Yonghong Song <yhs@fb.com> wrote:
>>>>
>>>>
>>>>
>>>> On 6/10/22 12:44 PM, Yosry Ahmed wrote:
>>>>> Add a selftest that tests the whole workflow for collecting,
>>>>> aggregating (flushing), and displaying cgroup hierarchical stats.
>>>>>
>>>>> TL;DR:
>>>>> - Whenever reclaim happens, vmscan_start and vmscan_end update
>>>>>     per-cgroup percpu readings, and tell rstat which (cgroup, cpu) pairs
>>>>>     have updates.
>>>>> - When userspace tries to read the stats, vmscan_dump calls rstat to flush
>>>>>     the stats, and outputs the stats in text format to userspace (similar
>>>>>     to cgroupfs stats).
>>>>> - rstat calls vmscan_flush once for every (cgroup, cpu) pair that has
>>>>>     updates, vmscan_flush aggregates cpu readings and propagates updates
>>>>>     to parents.
>>>>>
>>>>> Detailed explanation:
>>>>> - The test loads tracing bpf programs, vmscan_start and vmscan_end, to
>>>>>     measure the latency of cgroup reclaim. Per-cgroup ratings are stored in
>>>>>     percpu maps for efficiency. When a cgroup reading is updated on a cpu,
>>>>>     cgroup_rstat_updated(cgroup, cpu) is called to add the cgroup to the
>>>>>     rstat updated tree on that cpu.
>>>>>
>>>>> - A cgroup_iter program, vmscan_dump, is loaded and pinned to a file, for
>>>>>     each cgroup. Reading this file invokes the program, which calls
>>>>>     cgroup_rstat_flush(cgroup) to ask rstat to propagate the updates for all
>>>>>     cpus and cgroups that have updates in this cgroup's subtree. Afterwards,
>>>>>     the stats are exposed to the user. vmscan_dump returns 1 to terminate
>>>>>     iteration early, so that we only expose stats for one cgroup per read.
>>>>>
>>>>> - An ftrace program, vmscan_flush, is also loaded and attached to
>>>>>     bpf_rstat_flush. When rstat flushing is ongoing, vmscan_flush is invoked
>>>>>     once for each (cgroup, cpu) pair that has updates. cgroups are popped
>>>>>     from the rstat tree in a bottom-up fashion, so calls will always be
>>>>>     made for cgroups that have updates before their parents. The program
>>>>>     aggregates percpu readings to a total per-cgroup reading, and also
>>>>>     propagates them to the parent cgroup. After rstat flushing is over, all
>>>>>     cgroups will have correct updated hierarchical readings (including all
>>>>>     cpus and all their descendants).
>>>>>
>>>>> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
>>>>
>>>> There are a selftest failure with test:
>>>>
>>>> get_cgroup_vmscan_delay:PASS:output format 0 nsec
>>>> get_cgroup_vmscan_delay:PASS:cgroup_id 0 nsec
>>>> get_cgroup_vmscan_delay:PASS:vmscan_reading 0 nsec
>>>> get_cgroup_vmscan_delay:PASS:read cgroup_iter 0 nsec
>>>> get_cgroup_vmscan_delay:PASS:output format 0 nsec
>>>> get_cgroup_vmscan_delay:PASS:cgroup_id 0 nsec
>>>> get_cgroup_vmscan_delay:FAIL:vmscan_reading unexpected vmscan_reading:
>>>> actual 0 <= expected 0
>>>> check_vmscan_stats:FAIL:child1_vmscan unexpected child1_vmscan: actual
>>>> 781874 != expected 382092
>>>> check_vmscan_stats:FAIL:child2_vmscan unexpected child2_vmscan: actual
>>>> -1 != expected -2
>>>> check_vmscan_stats:FAIL:test_vmscan unexpected test_vmscan: actual
>>>> 781874 != expected 781873
>>>> check_vmscan_stats:FAIL:root_vmscan unexpected root_vmscan: actual 0 <
>>>> expected 781874
>>>> destroy_progs:PASS:remove cgroup_iter pin 0 nsec
>>>> destroy_progs:PASS:remove cgroup_iter pin 0 nsec
>>>> destroy_progs:PASS:remove cgroup_iter pin 0 nsec
>>>> destroy_progs:PASS:remove cgroup_iter pin 0 nsec
>>>> destroy_progs:PASS:remove cgroup_iter pin 0 nsec
>>>> destroy_progs:PASS:remove cgroup_iter pin 0 nsec
>>>> destroy_progs:PASS:remove cgroup_iter pin 0 nsec
>>>> destroy_progs:PASS:remove cgroup_iter root pin 0 nsec
>>>> cleanup_bpffs:PASS:rmdir /sys/fs/bpf/vmscan/ 0 nsec
>>>> #33      cgroup_hierarchical_stats:FAIL
>>>>
>>>
>>> The test is passing on my setup. I am trying to figure out if there is
>>> something outside the setup done by the test that can cause the test
>>> to fail.
>>>
>>
>> I can't reproduce the failure on my machine. It seems like for some
>> reason reclaim is not invoked in one of the test cgroups which results
>> in the expected stats not being there. I have a few suspicions as to
>> what might cause this but I am not sure.
>>
>> If you have the capacity, do you mind re-running the test with the
>> attached diff1.patch? (and maybe diff2.patch if that fails, this will
>> cause OOMs in the test cgroup, you might see some process killed
>> warnings).
>> Thanks!
>>
> 
> In addition to that, it looks like one of the cgroups has a "0" stat
> which shouldn't happen unless one of the map update/lookup operations
> failed, which should log something using bpf_printk. I need to
> reproduce the test failure to investigate this properly. Did you
> observe this failure on your machine or in CI? Any instructions on how
> to reproduce or system setup?

I got "0" as well.

get_cgroup_vmscan_delay:FAIL:vmscan_reading unexpected vmscan_reading: 
actual 0 <= expected 0
check_vmscan_stats:FAIL:child1_vmscan unexpected child1_vmscan: actual 
676612 != expected 339142
check_vmscan_stats:FAIL:child2_vmscan unexpected child2_vmscan: actual 
-1 != expected -2
check_vmscan_stats:FAIL:test_vmscan unexpected test_vmscan: actual 
676612 != expected 676611
check_vmscan_stats:FAIL:root_vmscan unexpected root_vmscan: actual 0 < 
expected 676612

I don't have special config. I am running on qemu vm, similar to
ci environment but may have a slightly different config.

The CI for this patch set won't work since the sleepable kfunc support
patch is not available. Once you have that patch, bpf CI should be able
to compile the patch set and run the tests.

> 
>>
>>>>
>>>> Also an existing test also failed.
>>>>
>>>> btf_dump_data:PASS:find type id 0 nsec
>>>>
>>>>
>>>> btf_dump_data:PASS:failed/unexpected type_sz 0 nsec
>>>>
>>>>
>>>> btf_dump_data:FAIL:ensure expected/actual match unexpected ensure
>>>> expected/actual match: actual '(union bpf_iter_link_info){.map =
>>>> (struct){.map_fd = (__u32)1,},.cgroup '
>>>> test_btf_dump_struct_data:PASS:find struct sk_buff 0 nsec
>>>>
>>>
>>> Yeah I see what happened there. bpf_iter_link_info was changed by the
>>> patch that introduced cgroup_iter, and this specific union is used by
>>> the test to test the "union with nested struct" btf dumping. I will
>>> add a patch in the next version that updates the btf_dump_data test
>>> accordingly. Thanks.
>>>
>>>>
>>>> test_btf_dump_struct_data:PASS:unexpected return value dumping sk_buff 0
>>>> nsec
>>>>
>>>> btf_dump_data:PASS:verify prefix match 0 nsec
>>>>
>>>>
>>>> btf_dump_data:PASS:find type id 0 nsec
>>>>
>>>>
>>>> btf_dump_data:PASS:failed to return -E2BIG 0 nsec
>>>>
>>>>
>>>> btf_dump_data:PASS:ensure expected/actual match 0 nsec
>>>>
>>>>
>>>> btf_dump_data:PASS:verify prefix match 0 nsec
>>>>
>>>>
>>>> btf_dump_data:PASS:find type id 0 nsec
>>>>
>>>>
>>>> btf_dump_data:PASS:failed to return -E2BIG 0 nsec
>>>>
>>>>
>>>> btf_dump_data:PASS:ensure expected/actual match 0 nsec
>>>>
>>>>
>>>> #21/14   btf_dump/btf_dump: struct_data:FAIL
>>>>
>>>> please take a look.
>>>>
>>>>> ---
>>>>>    .../prog_tests/cgroup_hierarchical_stats.c    | 351 ++++++++++++++++++
>>>>>    .../bpf/progs/cgroup_hierarchical_stats.c     | 234 ++++++++++++
>>>>>    2 files changed, 585 insertions(+)
>>>>>    create mode 100644 tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c
>>>>>    create mode 100644 tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c
>>>>>
[...]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH bpf-next v2 8/8] bpf: add a selftest for cgroup hierarchical stats collection
  2022-06-29  6:26         ` Yonghong Song
@ 2022-06-29  8:03           ` Yosry Ahmed
  2022-07-01 23:28           ` Hao Luo
  1 sibling, 0 replies; 46+ messages in thread
From: Yosry Ahmed @ 2022-06-29  8:03 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, John Fastabend, KP Singh, Hao Luo,
	Tejun Heo, Zefan Li, Johannes Weiner, Shuah Khan, Michal Hocko,
	Roman Gushchin, David Rientjes, Stanislav Fomichev, Greg Thelen,
	Shakeel Butt, Linux Kernel Mailing List, Networking, bpf,
	Cgroups

On Tue, Jun 28, 2022 at 11:27 PM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 6/28/22 12:43 AM, Yosry Ahmed wrote:
> > On Mon, Jun 27, 2022 at 11:47 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >>
> >> On Mon, Jun 27, 2022 at 11:14 PM Yonghong Song <yhs@fb.com> wrote:
> >>>
> >>>
> >>>
> >>> On 6/10/22 12:44 PM, Yosry Ahmed wrote:
> >>>> Add a selftest that tests the whole workflow for collecting,
> >>>> aggregating (flushing), and displaying cgroup hierarchical stats.
> >>>>
> >>>> TL;DR:
> >>>> - Whenever reclaim happens, vmscan_start and vmscan_end update
> >>>>     per-cgroup percpu readings, and tell rstat which (cgroup, cpu) pairs
> >>>>     have updates.
> >>>> - When userspace tries to read the stats, vmscan_dump calls rstat to flush
> >>>>     the stats, and outputs the stats in text format to userspace (similar
> >>>>     to cgroupfs stats).
> >>>> - rstat calls vmscan_flush once for every (cgroup, cpu) pair that has
> >>>>     updates, vmscan_flush aggregates cpu readings and propagates updates
> >>>>     to parents.
> >>>>
> >>>> Detailed explanation:
> >>>> - The test loads tracing bpf programs, vmscan_start and vmscan_end, to
> >>>>     measure the latency of cgroup reclaim. Per-cgroup ratings are stored in
> >>>>     percpu maps for efficiency. When a cgroup reading is updated on a cpu,
> >>>>     cgroup_rstat_updated(cgroup, cpu) is called to add the cgroup to the
> >>>>     rstat updated tree on that cpu.
> >>>>
> >>>> - A cgroup_iter program, vmscan_dump, is loaded and pinned to a file, for
> >>>>     each cgroup. Reading this file invokes the program, which calls
> >>>>     cgroup_rstat_flush(cgroup) to ask rstat to propagate the updates for all
> >>>>     cpus and cgroups that have updates in this cgroup's subtree. Afterwards,
> >>>>     the stats are exposed to the user. vmscan_dump returns 1 to terminate
> >>>>     iteration early, so that we only expose stats for one cgroup per read.
> >>>>
> >>>> - An ftrace program, vmscan_flush, is also loaded and attached to
> >>>>     bpf_rstat_flush. When rstat flushing is ongoing, vmscan_flush is invoked
> >>>>     once for each (cgroup, cpu) pair that has updates. cgroups are popped
> >>>>     from the rstat tree in a bottom-up fashion, so calls will always be
> >>>>     made for cgroups that have updates before their parents. The program
> >>>>     aggregates percpu readings to a total per-cgroup reading, and also
> >>>>     propagates them to the parent cgroup. After rstat flushing is over, all
> >>>>     cgroups will have correct updated hierarchical readings (including all
> >>>>     cpus and all their descendants).
> >>>>
> >>>> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> >>>
> >>> There are a selftest failure with test:
> >>>
> >>> get_cgroup_vmscan_delay:PASS:output format 0 nsec
> >>> get_cgroup_vmscan_delay:PASS:cgroup_id 0 nsec
> >>> get_cgroup_vmscan_delay:PASS:vmscan_reading 0 nsec
> >>> get_cgroup_vmscan_delay:PASS:read cgroup_iter 0 nsec
> >>> get_cgroup_vmscan_delay:PASS:output format 0 nsec
> >>> get_cgroup_vmscan_delay:PASS:cgroup_id 0 nsec
> >>> get_cgroup_vmscan_delay:FAIL:vmscan_reading unexpected vmscan_reading:
> >>> actual 0 <= expected 0
> >>> check_vmscan_stats:FAIL:child1_vmscan unexpected child1_vmscan: actual
> >>> 781874 != expected 382092
> >>> check_vmscan_stats:FAIL:child2_vmscan unexpected child2_vmscan: actual
> >>> -1 != expected -2
> >>> check_vmscan_stats:FAIL:test_vmscan unexpected test_vmscan: actual
> >>> 781874 != expected 781873
> >>> check_vmscan_stats:FAIL:root_vmscan unexpected root_vmscan: actual 0 <
> >>> expected 781874
> >>> destroy_progs:PASS:remove cgroup_iter pin 0 nsec
> >>> destroy_progs:PASS:remove cgroup_iter pin 0 nsec
> >>> destroy_progs:PASS:remove cgroup_iter pin 0 nsec
> >>> destroy_progs:PASS:remove cgroup_iter pin 0 nsec
> >>> destroy_progs:PASS:remove cgroup_iter pin 0 nsec
> >>> destroy_progs:PASS:remove cgroup_iter pin 0 nsec
> >>> destroy_progs:PASS:remove cgroup_iter pin 0 nsec
> >>> destroy_progs:PASS:remove cgroup_iter root pin 0 nsec
> >>> cleanup_bpffs:PASS:rmdir /sys/fs/bpf/vmscan/ 0 nsec
> >>> #33      cgroup_hierarchical_stats:FAIL
> >>>
> >>
> >> The test is passing on my setup. I am trying to figure out if there is
> >> something outside the setup done by the test that can cause the test
> >> to fail.
> >>
> >>>
> >>> Also an existing test also failed.
> >>>
> >>> btf_dump_data:PASS:find type id 0 nsec
> >>>
> >>>
> >>> btf_dump_data:PASS:failed/unexpected type_sz 0 nsec
> >>>
> >>>
> >>> btf_dump_data:FAIL:ensure expected/actual match unexpected ensure
> >>> expected/actual match: actual '(union bpf_iter_link_info){.map =
> >>> (struct){.map_fd = (__u32)1,},.cgroup '
> >>> test_btf_dump_struct_data:PASS:find struct sk_buff 0 nsec
> >>>
> >>
> >> Yeah I see what happened there. bpf_iter_link_info was changed by the
> >> patch that introduced cgroup_iter, and this specific union is used by
> >> the test to test the "union with nested struct" btf dumping. I will
> >> add a patch in the next version that updates the btf_dump_data test
> >> accordingly. Thanks.
> >>
> >
> > So I actually tried the attached diff to updated the expected dump of
> > bpf_iter_link_info in this test, but the test still failed:
> >
> > btf_dump_data:FAIL:ensure expected/actual match unexpected ensure
> > expected/actual match: actual '(union bpf_iter_link_info){.map =
> > (struct){.map_fd = (__u32)1,},.cgroup = (struct){.cgroup_fd =
> > (__u32)1,},}'  != expected '(union bpf_iter_link_info){.map =
> > (struct){.map_fd = (__u32)1,},.cgroup = (struct){.cgroup_fd =
> > (__u32)1,.traversal_order = (__u32)1},}'
> >
> > It seems to me that the actual output in this case is not right, it is
> > missing traversal_order. Did we accidentally find a bug in btf dumping
> > of unions with nested structs, or am I missing something here?
>
> Probably there is an issue in btf_dump_data() function in
> tools/lib/bpf/btf_dump.c. Could you take a look at it?

I will try to take a look but after I figure out why the selftest
added here is always passing for me and always failing for you :(

>
> > Thanks!
> >
> >>>
> >>> test_btf_dump_struct_data:PASS:unexpected return value dumping sk_buff 0
> >>> nsec
> >>>
> >>> btf_dump_data:PASS:verify prefix match 0 nsec
> >>>
> >>>
> >>> btf_dump_data:PASS:find type id 0 nsec
> >>>
> >>>
> >>> btf_dump_data:PASS:failed to return -E2BIG 0 nsec
> >>>
> >>>
> >>> btf_dump_data:PASS:ensure expected/actual match 0 nsec
> >>>
> >>>
> >>> btf_dump_data:PASS:verify prefix match 0 nsec
> >>>
> >>>
> >>> btf_dump_data:PASS:find type id 0 nsec
> >>>
> >>>
> >>> btf_dump_data:PASS:failed to return -E2BIG 0 nsec
> >>>
> >>>
> >>> btf_dump_data:PASS:ensure expected/actual match 0 nsec
> >>>
> >>>
> >>> #21/14   btf_dump/btf_dump: struct_data:FAIL
> >>>
> >>> please take a look.
> >>>
> >>>> ---
> >>>>    .../prog_tests/cgroup_hierarchical_stats.c    | 351 ++++++++++++++++++
> >>>>    .../bpf/progs/cgroup_hierarchical_stats.c     | 234 ++++++++++++
> >>>>    2 files changed, 585 insertions(+)
> >>>>    create mode 100644 tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c
> >>>>    create mode 100644 tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c
> >>>>
> [...]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH bpf-next v2 8/8] bpf: add a selftest for cgroup hierarchical stats collection
  2022-06-29  6:48           ` Yonghong Song
@ 2022-06-29  8:04             ` Yosry Ahmed
  2022-07-02  0:55               ` Yonghong Song
  0 siblings, 1 reply; 46+ messages in thread
From: Yosry Ahmed @ 2022-06-29  8:04 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, John Fastabend, KP Singh, Hao Luo,
	Tejun Heo, Zefan Li, Johannes Weiner, Shuah Khan, Michal Hocko,
	Roman Gushchin, David Rientjes, Stanislav Fomichev, Greg Thelen,
	Shakeel Butt, Linux Kernel Mailing List, Networking, bpf,
	Cgroups

On Tue, Jun 28, 2022 at 11:48 PM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 6/28/22 5:09 PM, Yosry Ahmed wrote:
> > On Tue, Jun 28, 2022 at 12:14 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >>
> >> On Mon, Jun 27, 2022 at 11:47 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >>>
> >>> On Mon, Jun 27, 2022 at 11:14 PM Yonghong Song <yhs@fb.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 6/10/22 12:44 PM, Yosry Ahmed wrote:
> >>>>> Add a selftest that tests the whole workflow for collecting,
> >>>>> aggregating (flushing), and displaying cgroup hierarchical stats.
> >>>>>
> >>>>> TL;DR:
> >>>>> - Whenever reclaim happens, vmscan_start and vmscan_end update
> >>>>>     per-cgroup percpu readings, and tell rstat which (cgroup, cpu) pairs
> >>>>>     have updates.
> >>>>> - When userspace tries to read the stats, vmscan_dump calls rstat to flush
> >>>>>     the stats, and outputs the stats in text format to userspace (similar
> >>>>>     to cgroupfs stats).
> >>>>> - rstat calls vmscan_flush once for every (cgroup, cpu) pair that has
> >>>>>     updates, vmscan_flush aggregates cpu readings and propagates updates
> >>>>>     to parents.
> >>>>>
> >>>>> Detailed explanation:
> >>>>> - The test loads tracing bpf programs, vmscan_start and vmscan_end, to
> >>>>>     measure the latency of cgroup reclaim. Per-cgroup ratings are stored in
> >>>>>     percpu maps for efficiency. When a cgroup reading is updated on a cpu,
> >>>>>     cgroup_rstat_updated(cgroup, cpu) is called to add the cgroup to the
> >>>>>     rstat updated tree on that cpu.
> >>>>>
> >>>>> - A cgroup_iter program, vmscan_dump, is loaded and pinned to a file, for
> >>>>>     each cgroup. Reading this file invokes the program, which calls
> >>>>>     cgroup_rstat_flush(cgroup) to ask rstat to propagate the updates for all
> >>>>>     cpus and cgroups that have updates in this cgroup's subtree. Afterwards,
> >>>>>     the stats are exposed to the user. vmscan_dump returns 1 to terminate
> >>>>>     iteration early, so that we only expose stats for one cgroup per read.
> >>>>>
> >>>>> - An ftrace program, vmscan_flush, is also loaded and attached to
> >>>>>     bpf_rstat_flush. When rstat flushing is ongoing, vmscan_flush is invoked
> >>>>>     once for each (cgroup, cpu) pair that has updates. cgroups are popped
> >>>>>     from the rstat tree in a bottom-up fashion, so calls will always be
> >>>>>     made for cgroups that have updates before their parents. The program
> >>>>>     aggregates percpu readings to a total per-cgroup reading, and also
> >>>>>     propagates them to the parent cgroup. After rstat flushing is over, all
> >>>>>     cgroups will have correct updated hierarchical readings (including all
> >>>>>     cpus and all their descendants).
> >>>>>
> >>>>> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> >>>>
> >>>> There are a selftest failure with test:
> >>>>
> >>>> get_cgroup_vmscan_delay:PASS:output format 0 nsec
> >>>> get_cgroup_vmscan_delay:PASS:cgroup_id 0 nsec
> >>>> get_cgroup_vmscan_delay:PASS:vmscan_reading 0 nsec
> >>>> get_cgroup_vmscan_delay:PASS:read cgroup_iter 0 nsec
> >>>> get_cgroup_vmscan_delay:PASS:output format 0 nsec
> >>>> get_cgroup_vmscan_delay:PASS:cgroup_id 0 nsec
> >>>> get_cgroup_vmscan_delay:FAIL:vmscan_reading unexpected vmscan_reading:
> >>>> actual 0 <= expected 0
> >>>> check_vmscan_stats:FAIL:child1_vmscan unexpected child1_vmscan: actual
> >>>> 781874 != expected 382092
> >>>> check_vmscan_stats:FAIL:child2_vmscan unexpected child2_vmscan: actual
> >>>> -1 != expected -2
> >>>> check_vmscan_stats:FAIL:test_vmscan unexpected test_vmscan: actual
> >>>> 781874 != expected 781873
> >>>> check_vmscan_stats:FAIL:root_vmscan unexpected root_vmscan: actual 0 <
> >>>> expected 781874
> >>>> destroy_progs:PASS:remove cgroup_iter pin 0 nsec
> >>>> destroy_progs:PASS:remove cgroup_iter pin 0 nsec
> >>>> destroy_progs:PASS:remove cgroup_iter pin 0 nsec
> >>>> destroy_progs:PASS:remove cgroup_iter pin 0 nsec
> >>>> destroy_progs:PASS:remove cgroup_iter pin 0 nsec
> >>>> destroy_progs:PASS:remove cgroup_iter pin 0 nsec
> >>>> destroy_progs:PASS:remove cgroup_iter pin 0 nsec
> >>>> destroy_progs:PASS:remove cgroup_iter root pin 0 nsec
> >>>> cleanup_bpffs:PASS:rmdir /sys/fs/bpf/vmscan/ 0 nsec
> >>>> #33      cgroup_hierarchical_stats:FAIL
> >>>>
> >>>
> >>> The test is passing on my setup. I am trying to figure out if there is
> >>> something outside the setup done by the test that can cause the test
> >>> to fail.
> >>>
> >>
> >> I can't reproduce the failure on my machine. It seems like for some
> >> reason reclaim is not invoked in one of the test cgroups which results
> >> in the expected stats not being there. I have a few suspicions as to
> >> what might cause this but I am not sure.
> >>
> >> If you have the capacity, do you mind re-running the test with the
> >> attached diff1.patch? (and maybe diff2.patch if that fails, this will
> >> cause OOMs in the test cgroup, you might see some process killed
> >> warnings).
> >> Thanks!
> >>
> >
> > In addition to that, it looks like one of the cgroups has a "0" stat
> > which shouldn't happen unless one of the map update/lookup operations
> > failed, which should log something using bpf_printk. I need to
> > reproduce the test failure to investigate this properly. Did you
> > observe this failure on your machine or in CI? Any instructions on how
> > to reproduce or system setup?
>
> I got "0" as well.
>
> get_cgroup_vmscan_delay:FAIL:vmscan_reading unexpected vmscan_reading:
> actual 0 <= expected 0
> check_vmscan_stats:FAIL:child1_vmscan unexpected child1_vmscan: actual
> 676612 != expected 339142
> check_vmscan_stats:FAIL:child2_vmscan unexpected child2_vmscan: actual
> -1 != expected -2
> check_vmscan_stats:FAIL:test_vmscan unexpected test_vmscan: actual
> 676612 != expected 676611
> check_vmscan_stats:FAIL:root_vmscan unexpected root_vmscan: actual 0 <
> expected 676612
>
> I don't have special config. I am running on qemu vm, similar to
> ci environment but may have a slightly different config.
>
> The CI for this patch set won't work since the sleepable kfunc support
> patch is not available. Once you have that patch, bpf CI should be able
> to compile the patch set and run the tests.
>

I will include this patch in the next version anyway, but I am trying
to find out why this selftest is failing for you before I send it out.
I am trying to reproduce the problem but no luck so far.

> >
> >>
> >>>>
> >>>> Also an existing test also failed.
> >>>>
> >>>> btf_dump_data:PASS:find type id 0 nsec
> >>>>
> >>>>
> >>>> btf_dump_data:PASS:failed/unexpected type_sz 0 nsec
> >>>>
> >>>>
> >>>> btf_dump_data:FAIL:ensure expected/actual match unexpected ensure
> >>>> expected/actual match: actual '(union bpf_iter_link_info){.map =
> >>>> (struct){.map_fd = (__u32)1,},.cgroup '
> >>>> test_btf_dump_struct_data:PASS:find struct sk_buff 0 nsec
> >>>>
> >>>
> >>> Yeah I see what happened there. bpf_iter_link_info was changed by the
> >>> patch that introduced cgroup_iter, and this specific union is used by
> >>> the test to test the "union with nested struct" btf dumping. I will
> >>> add a patch in the next version that updates the btf_dump_data test
> >>> accordingly. Thanks.
> >>>
> >>>>
> >>>> test_btf_dump_struct_data:PASS:unexpected return value dumping sk_buff 0
> >>>> nsec
> >>>>
> >>>> btf_dump_data:PASS:verify prefix match 0 nsec
> >>>>
> >>>>
> >>>> btf_dump_data:PASS:find type id 0 nsec
> >>>>
> >>>>
> >>>> btf_dump_data:PASS:failed to return -E2BIG 0 nsec
> >>>>
> >>>>
> >>>> btf_dump_data:PASS:ensure expected/actual match 0 nsec
> >>>>
> >>>>
> >>>> btf_dump_data:PASS:verify prefix match 0 nsec
> >>>>
> >>>>
> >>>> btf_dump_data:PASS:find type id 0 nsec
> >>>>
> >>>>
> >>>> btf_dump_data:PASS:failed to return -E2BIG 0 nsec
> >>>>
> >>>>
> >>>> btf_dump_data:PASS:ensure expected/actual match 0 nsec
> >>>>
> >>>>
> >>>> #21/14   btf_dump/btf_dump: struct_data:FAIL
> >>>>
> >>>> please take a look.
> >>>>
> >>>>> ---
> >>>>>    .../prog_tests/cgroup_hierarchical_stats.c    | 351 ++++++++++++++++++
> >>>>>    .../bpf/progs/cgroup_hierarchical_stats.c     | 234 ++++++++++++
> >>>>>    2 files changed, 585 insertions(+)
> >>>>>    create mode 100644 tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c
> >>>>>    create mode 100644 tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c
> >>>>>
> [...]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH bpf-next v2 8/8] bpf: add a selftest for cgroup hierarchical stats collection
  2022-06-29  6:26         ` Yonghong Song
  2022-06-29  8:03           ` Yosry Ahmed
@ 2022-07-01 23:28           ` Hao Luo
  1 sibling, 0 replies; 46+ messages in thread
From: Hao Luo @ 2022-07-01 23:28 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Yosry Ahmed, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, John Fastabend,
	KP Singh, Tejun Heo, Zefan Li, Johannes Weiner, Shuah Khan,
	Michal Hocko, Roman Gushchin, David Rientjes, Stanislav Fomichev,
	Greg Thelen, Shakeel Butt, Linux Kernel Mailing List, Networking,
	bpf, Cgroups

On Tue, Jun 28, 2022 at 11:27 PM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 6/28/22 12:43 AM, Yosry Ahmed wrote:
> > On Mon, Jun 27, 2022 at 11:47 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >>
> >> On Mon, Jun 27, 2022 at 11:14 PM Yonghong Song <yhs@fb.com> wrote:
[...]
> >>> btf_dump_data:FAIL:ensure expected/actual match unexpected ensure
> >>> expected/actual match: actual '(union bpf_iter_link_info){.map =
> >>> (struct){.map_fd = (__u32)1,},.cgroup '
> >>> test_btf_dump_struct_data:PASS:find struct sk_buff 0 nsec
> >>>
> >>
> >> Yeah I see what happened there. bpf_iter_link_info was changed by the
> >> patch that introduced cgroup_iter, and this specific union is used by
> >> the test to test the "union with nested struct" btf dumping. I will
> >> add a patch in the next version that updates the btf_dump_data test
> >> accordingly. Thanks.
> >>
> >
> > So I actually tried the attached diff to updated the expected dump of
> > bpf_iter_link_info in this test, but the test still failed:
> >
> > btf_dump_data:FAIL:ensure expected/actual match unexpected ensure
> > expected/actual match: actual '(union bpf_iter_link_info){.map =
> > (struct){.map_fd = (__u32)1,},.cgroup = (struct){.cgroup_fd =
> > (__u32)1,},}'  != expected '(union bpf_iter_link_info){.map =
> > (struct){.map_fd = (__u32)1,},.cgroup = (struct){.cgroup_fd =
> > (__u32)1,.traversal_order = (__u32)1},}'
> >
> > It seems to me that the actual output in this case is not right, it is
> > missing traversal_order. Did we accidentally find a bug in btf dumping
> > of unions with nested structs, or am I missing something here?
>
> Probably there is an issue in btf_dump_data() function in
> tools/lib/bpf/btf_dump.c. Could you take a look at it?
>

Regarding this failure of btf_dump_data, the cause seems that:

I added a new struct in 'union bpf_iter_link_info' in this patch
series, which expanded bpf_iter_link_info's size from 32bit to 64bit.
However, the test still used the old struct to initialize, which makes
a temporary stack variable (of type bpf_iter_link_info) partially
initialized. If I initialize the type by the larger new struct only,
btf_dump_data will output the correct content and the said test will
pass.

Yosry, we need to fold the said solution in the patch which introduced
changes to bpf_iter_link_info, so that it won't break the test.

I haven't dug into btf_dump_data() on why partially initialized union
fails. I need to look at the get_cgroup_vmscan_delay selftest in this
patch now.

Hao

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH bpf-next v2 8/8] bpf: add a selftest for cgroup hierarchical stats collection
  2022-06-29  8:04             ` Yosry Ahmed
@ 2022-07-02  0:55               ` Yonghong Song
  2022-07-06 21:29                 ` Yosry Ahmed
  0 siblings, 1 reply; 46+ messages in thread
From: Yonghong Song @ 2022-07-02  0:55 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, John Fastabend, KP Singh, Hao Luo,
	Tejun Heo, Zefan Li, Johannes Weiner, Shuah Khan, Michal Hocko,
	Roman Gushchin, David Rientjes, Stanislav Fomichev, Greg Thelen,
	Shakeel Butt, Linux Kernel Mailing List, Networking, bpf,
	Cgroups



On 6/29/22 1:04 AM, Yosry Ahmed wrote:
> On Tue, Jun 28, 2022 at 11:48 PM Yonghong Song <yhs@fb.com> wrote:
>>
>>
>>
>> On 6/28/22 5:09 PM, Yosry Ahmed wrote:
>>> On Tue, Jun 28, 2022 at 12:14 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>>>>
>>>> On Mon, Jun 27, 2022 at 11:47 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>>>>>
>>>>> On Mon, Jun 27, 2022 at 11:14 PM Yonghong Song <yhs@fb.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 6/10/22 12:44 PM, Yosry Ahmed wrote:
>>>>>>> Add a selftest that tests the whole workflow for collecting,
>>>>>>> aggregating (flushing), and displaying cgroup hierarchical stats.
>>>>>>>
>>>>>>> TL;DR:
>>>>>>> - Whenever reclaim happens, vmscan_start and vmscan_end update
>>>>>>>      per-cgroup percpu readings, and tell rstat which (cgroup, cpu) pairs
>>>>>>>      have updates.
>>>>>>> - When userspace tries to read the stats, vmscan_dump calls rstat to flush
>>>>>>>      the stats, and outputs the stats in text format to userspace (similar
>>>>>>>      to cgroupfs stats).
>>>>>>> - rstat calls vmscan_flush once for every (cgroup, cpu) pair that has
>>>>>>>      updates, vmscan_flush aggregates cpu readings and propagates updates
>>>>>>>      to parents.
>>>>>>>
>>>>>>> Detailed explanation:
>>>>>>> - The test loads tracing bpf programs, vmscan_start and vmscan_end, to
>>>>>>>      measure the latency of cgroup reclaim. Per-cgroup ratings are stored in
>>>>>>>      percpu maps for efficiency. When a cgroup reading is updated on a cpu,
>>>>>>>      cgroup_rstat_updated(cgroup, cpu) is called to add the cgroup to the
>>>>>>>      rstat updated tree on that cpu.
>>>>>>>
>>>>>>> - A cgroup_iter program, vmscan_dump, is loaded and pinned to a file, for
>>>>>>>      each cgroup. Reading this file invokes the program, which calls
>>>>>>>      cgroup_rstat_flush(cgroup) to ask rstat to propagate the updates for all
>>>>>>>      cpus and cgroups that have updates in this cgroup's subtree. Afterwards,
>>>>>>>      the stats are exposed to the user. vmscan_dump returns 1 to terminate
>>>>>>>      iteration early, so that we only expose stats for one cgroup per read.
>>>>>>>
>>>>>>> - An ftrace program, vmscan_flush, is also loaded and attached to
>>>>>>>      bpf_rstat_flush. When rstat flushing is ongoing, vmscan_flush is invoked
>>>>>>>      once for each (cgroup, cpu) pair that has updates. cgroups are popped
>>>>>>>      from the rstat tree in a bottom-up fashion, so calls will always be
>>>>>>>      made for cgroups that have updates before their parents. The program
>>>>>>>      aggregates percpu readings to a total per-cgroup reading, and also
>>>>>>>      propagates them to the parent cgroup. After rstat flushing is over, all
>>>>>>>      cgroups will have correct updated hierarchical readings (including all
>>>>>>>      cpus and all their descendants).
>>>>>>>
>>>>>>> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
>>>>>>
>>>>>> There are a selftest failure with test:
>>>>>>
>>>>>> get_cgroup_vmscan_delay:PASS:output format 0 nsec
>>>>>> get_cgroup_vmscan_delay:PASS:cgroup_id 0 nsec
>>>>>> get_cgroup_vmscan_delay:PASS:vmscan_reading 0 nsec
>>>>>> get_cgroup_vmscan_delay:PASS:read cgroup_iter 0 nsec
>>>>>> get_cgroup_vmscan_delay:PASS:output format 0 nsec
>>>>>> get_cgroup_vmscan_delay:PASS:cgroup_id 0 nsec
>>>>>> get_cgroup_vmscan_delay:FAIL:vmscan_reading unexpected vmscan_reading:
>>>>>> actual 0 <= expected 0
>>>>>> check_vmscan_stats:FAIL:child1_vmscan unexpected child1_vmscan: actual
>>>>>> 781874 != expected 382092
>>>>>> check_vmscan_stats:FAIL:child2_vmscan unexpected child2_vmscan: actual
>>>>>> -1 != expected -2
>>>>>> check_vmscan_stats:FAIL:test_vmscan unexpected test_vmscan: actual
>>>>>> 781874 != expected 781873
>>>>>> check_vmscan_stats:FAIL:root_vmscan unexpected root_vmscan: actual 0 <
>>>>>> expected 781874
>>>>>> destroy_progs:PASS:remove cgroup_iter pin 0 nsec
>>>>>> destroy_progs:PASS:remove cgroup_iter pin 0 nsec
>>>>>> destroy_progs:PASS:remove cgroup_iter pin 0 nsec
>>>>>> destroy_progs:PASS:remove cgroup_iter pin 0 nsec
>>>>>> destroy_progs:PASS:remove cgroup_iter pin 0 nsec
>>>>>> destroy_progs:PASS:remove cgroup_iter pin 0 nsec
>>>>>> destroy_progs:PASS:remove cgroup_iter pin 0 nsec
>>>>>> destroy_progs:PASS:remove cgroup_iter root pin 0 nsec
>>>>>> cleanup_bpffs:PASS:rmdir /sys/fs/bpf/vmscan/ 0 nsec
>>>>>> #33      cgroup_hierarchical_stats:FAIL
>>>>>>
>>>>>
>>>>> The test is passing on my setup. I am trying to figure out if there is
>>>>> something outside the setup done by the test that can cause the test
>>>>> to fail.
>>>>>
>>>>
>>>> I can't reproduce the failure on my machine. It seems like for some
>>>> reason reclaim is not invoked in one of the test cgroups which results
>>>> in the expected stats not being there. I have a few suspicions as to
>>>> what might cause this but I am not sure.
>>>>
>>>> If you have the capacity, do you mind re-running the test with the
>>>> attached diff1.patch? (and maybe diff2.patch if that fails, this will
>>>> cause OOMs in the test cgroup, you might see some process killed
>>>> warnings).
>>>> Thanks!
>>>>
>>>
>>> In addition to that, it looks like one of the cgroups has a "0" stat
>>> which shouldn't happen unless one of the map update/lookup operations
>>> failed, which should log something using bpf_printk. I need to
>>> reproduce the test failure to investigate this properly. Did you
>>> observe this failure on your machine or in CI? Any instructions on how
>>> to reproduce or system setup?
>>
>> I got "0" as well.
>>
>> get_cgroup_vmscan_delay:FAIL:vmscan_reading unexpected vmscan_reading:
>> actual 0 <= expected 0
>> check_vmscan_stats:FAIL:child1_vmscan unexpected child1_vmscan: actual
>> 676612 != expected 339142
>> check_vmscan_stats:FAIL:child2_vmscan unexpected child2_vmscan: actual
>> -1 != expected -2
>> check_vmscan_stats:FAIL:test_vmscan unexpected test_vmscan: actual
>> 676612 != expected 676611
>> check_vmscan_stats:FAIL:root_vmscan unexpected root_vmscan: actual 0 <
>> expected 676612
>>
>> I don't have special config. I am running on qemu vm, similar to
>> ci environment but may have a slightly different config.
>>
>> The CI for this patch set won't work since the sleepable kfunc support
>> patch is not available. Once you have that patch, bpf CI should be able
>> to compile the patch set and run the tests.
>>
> 
> I will include this patch in the next version anyway, but I am trying
> to find out why this selftest is failing for you before I send it out.
> I am trying to reproduce the problem but no luck so far.

I debugged this a little bit and found that this two programs

SEC("tp_btf/mm_vmscan_memcg_reclaim_begin")
int BPF_PROG(vmscan_start, struct lruvec *lruvec, struct scan_control *sc)

and

SEC("tp_btf/mm_vmscan_memcg_reclaim_end")
int BPF_PROG(vmscan_end, struct lruvec *lruvec, struct scan_control *sc)

are not triggered.

I do have CONFIG_MEMCG enabled in my config file:
...
CONFIG_MEMCG=y
CONFIG_MEMCG_SWAP=y
CONFIG_MEMCG_KMEM=y
...

Maybe when cgroup_rstat_flush() is called, some code path won't trigger
mm_vmscan_memcg_reclaim_begin/end()?

> 
>>>
>>>>
>>>>>>
>>>>>> Also an existing test also failed.
>>>>>>
>>>>>> btf_dump_data:PASS:find type id 0 nsec
>>>>>>
>>>>>>
>>>>>> btf_dump_data:PASS:failed/unexpected type_sz 0 nsec
>>>>>>
>>>>>>
>>>>>> btf_dump_data:FAIL:ensure expected/actual match unexpected ensure
>>>>>> expected/actual match: actual '(union bpf_iter_link_info){.map =
>>>>>> (struct){.map_fd = (__u32)1,},.cgroup '
>>>>>> test_btf_dump_struct_data:PASS:find struct sk_buff 0 nsec
>>>>>>
>>>>>
>>>>> Yeah I see what happened there. bpf_iter_link_info was changed by the
>>>>> patch that introduced cgroup_iter, and this specific union is used by
>>>>> the test to test the "union with nested struct" btf dumping. I will
>>>>> add a patch in the next version that updates the btf_dump_data test
>>>>> accordingly. Thanks.
>>>>>
>>>>>>
>>>>>> test_btf_dump_struct_data:PASS:unexpected return value dumping sk_buff 0
>>>>>> nsec
>>>>>>
>>>>>> btf_dump_data:PASS:verify prefix match 0 nsec
>>>>>>
>>>>>>
>>>>>> btf_dump_data:PASS:find type id 0 nsec
>>>>>>
>>>>>>
>>>>>> btf_dump_data:PASS:failed to return -E2BIG 0 nsec
>>>>>>
>>>>>>
>>>>>> btf_dump_data:PASS:ensure expected/actual match 0 nsec
>>>>>>
>>>>>>
>>>>>> btf_dump_data:PASS:verify prefix match 0 nsec
>>>>>>
>>>>>>
>>>>>> btf_dump_data:PASS:find type id 0 nsec
>>>>>>
>>>>>>
>>>>>> btf_dump_data:PASS:failed to return -E2BIG 0 nsec
>>>>>>
>>>>>>
>>>>>> btf_dump_data:PASS:ensure expected/actual match 0 nsec
>>>>>>
>>>>>>
>>>>>> #21/14   btf_dump/btf_dump: struct_data:FAIL
>>>>>>
>>>>>> please take a look.
>>>>>>
>>>>>>> ---
>>>>>>>     .../prog_tests/cgroup_hierarchical_stats.c    | 351 ++++++++++++++++++
>>>>>>>     .../bpf/progs/cgroup_hierarchical_stats.c     | 234 ++++++++++++
>>>>>>>     2 files changed, 585 insertions(+)
>>>>>>>     create mode 100644 tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c
>>>>>>>     create mode 100644 tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c
>>>>>>>
>> [...]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH bpf-next v2 8/8] bpf: add a selftest for cgroup hierarchical stats collection
  2022-07-02  0:55               ` Yonghong Song
@ 2022-07-06 21:29                 ` Yosry Ahmed
  0 siblings, 0 replies; 46+ messages in thread
From: Yosry Ahmed @ 2022-07-06 21:29 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, John Fastabend, KP Singh, Hao Luo,
	Tejun Heo, Zefan Li, Johannes Weiner, Shuah Khan, Michal Hocko,
	Roman Gushchin, David Rientjes, Stanislav Fomichev, Greg Thelen,
	Shakeel Butt, Linux Kernel Mailing List, Networking, bpf,
	Cgroups

On Fri, Jul 1, 2022 at 5:55 PM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 6/29/22 1:04 AM, Yosry Ahmed wrote:
> > On Tue, Jun 28, 2022 at 11:48 PM Yonghong Song <yhs@fb.com> wrote:
> >>
> >>
> >>
> >> On 6/28/22 5:09 PM, Yosry Ahmed wrote:
> >>> On Tue, Jun 28, 2022 at 12:14 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >>>>
> >>>> On Mon, Jun 27, 2022 at 11:47 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >>>>>
> >>>>> On Mon, Jun 27, 2022 at 11:14 PM Yonghong Song <yhs@fb.com> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 6/10/22 12:44 PM, Yosry Ahmed wrote:
> >>>>>>> Add a selftest that tests the whole workflow for collecting,
> >>>>>>> aggregating (flushing), and displaying cgroup hierarchical stats.
> >>>>>>>
> >>>>>>> TL;DR:
> >>>>>>> - Whenever reclaim happens, vmscan_start and vmscan_end update
> >>>>>>>      per-cgroup percpu readings, and tell rstat which (cgroup, cpu) pairs
> >>>>>>>      have updates.
> >>>>>>> - When userspace tries to read the stats, vmscan_dump calls rstat to flush
> >>>>>>>      the stats, and outputs the stats in text format to userspace (similar
> >>>>>>>      to cgroupfs stats).
> >>>>>>> - rstat calls vmscan_flush once for every (cgroup, cpu) pair that has
> >>>>>>>      updates, vmscan_flush aggregates cpu readings and propagates updates
> >>>>>>>      to parents.
> >>>>>>>
> >>>>>>> Detailed explanation:
> >>>>>>> - The test loads tracing bpf programs, vmscan_start and vmscan_end, to
> >>>>>>>      measure the latency of cgroup reclaim. Per-cgroup ratings are stored in
> >>>>>>>      percpu maps for efficiency. When a cgroup reading is updated on a cpu,
> >>>>>>>      cgroup_rstat_updated(cgroup, cpu) is called to add the cgroup to the
> >>>>>>>      rstat updated tree on that cpu.
> >>>>>>>
> >>>>>>> - A cgroup_iter program, vmscan_dump, is loaded and pinned to a file, for
> >>>>>>>      each cgroup. Reading this file invokes the program, which calls
> >>>>>>>      cgroup_rstat_flush(cgroup) to ask rstat to propagate the updates for all
> >>>>>>>      cpus and cgroups that have updates in this cgroup's subtree. Afterwards,
> >>>>>>>      the stats are exposed to the user. vmscan_dump returns 1 to terminate
> >>>>>>>      iteration early, so that we only expose stats for one cgroup per read.
> >>>>>>>
> >>>>>>> - An ftrace program, vmscan_flush, is also loaded and attached to
> >>>>>>>      bpf_rstat_flush. When rstat flushing is ongoing, vmscan_flush is invoked
> >>>>>>>      once for each (cgroup, cpu) pair that has updates. cgroups are popped
> >>>>>>>      from the rstat tree in a bottom-up fashion, so calls will always be
> >>>>>>>      made for cgroups that have updates before their parents. The program
> >>>>>>>      aggregates percpu readings to a total per-cgroup reading, and also
> >>>>>>>      propagates them to the parent cgroup. After rstat flushing is over, all
> >>>>>>>      cgroups will have correct updated hierarchical readings (including all
> >>>>>>>      cpus and all their descendants).
> >>>>>>>
> >>>>>>> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> >>>>>>
> >>>>>> There are a selftest failure with test:
> >>>>>>
> >>>>>> get_cgroup_vmscan_delay:PASS:output format 0 nsec
> >>>>>> get_cgroup_vmscan_delay:PASS:cgroup_id 0 nsec
> >>>>>> get_cgroup_vmscan_delay:PASS:vmscan_reading 0 nsec
> >>>>>> get_cgroup_vmscan_delay:PASS:read cgroup_iter 0 nsec
> >>>>>> get_cgroup_vmscan_delay:PASS:output format 0 nsec
> >>>>>> get_cgroup_vmscan_delay:PASS:cgroup_id 0 nsec
> >>>>>> get_cgroup_vmscan_delay:FAIL:vmscan_reading unexpected vmscan_reading:
> >>>>>> actual 0 <= expected 0
> >>>>>> check_vmscan_stats:FAIL:child1_vmscan unexpected child1_vmscan: actual
> >>>>>> 781874 != expected 382092
> >>>>>> check_vmscan_stats:FAIL:child2_vmscan unexpected child2_vmscan: actual
> >>>>>> -1 != expected -2
> >>>>>> check_vmscan_stats:FAIL:test_vmscan unexpected test_vmscan: actual
> >>>>>> 781874 != expected 781873
> >>>>>> check_vmscan_stats:FAIL:root_vmscan unexpected root_vmscan: actual 0 <
> >>>>>> expected 781874
> >>>>>> destroy_progs:PASS:remove cgroup_iter pin 0 nsec
> >>>>>> destroy_progs:PASS:remove cgroup_iter pin 0 nsec
> >>>>>> destroy_progs:PASS:remove cgroup_iter pin 0 nsec
> >>>>>> destroy_progs:PASS:remove cgroup_iter pin 0 nsec
> >>>>>> destroy_progs:PASS:remove cgroup_iter pin 0 nsec
> >>>>>> destroy_progs:PASS:remove cgroup_iter pin 0 nsec
> >>>>>> destroy_progs:PASS:remove cgroup_iter pin 0 nsec
> >>>>>> destroy_progs:PASS:remove cgroup_iter root pin 0 nsec
> >>>>>> cleanup_bpffs:PASS:rmdir /sys/fs/bpf/vmscan/ 0 nsec
> >>>>>> #33      cgroup_hierarchical_stats:FAIL
> >>>>>>
> >>>>>
> >>>>> The test is passing on my setup. I am trying to figure out if there is
> >>>>> something outside the setup done by the test that can cause the test
> >>>>> to fail.
> >>>>>
> >>>>
> >>>> I can't reproduce the failure on my machine. It seems like for some
> >>>> reason reclaim is not invoked in one of the test cgroups which results
> >>>> in the expected stats not being there. I have a few suspicions as to
> >>>> what might cause this but I am not sure.
> >>>>
> >>>> If you have the capacity, do you mind re-running the test with the
> >>>> attached diff1.patch? (and maybe diff2.patch if that fails, this will
> >>>> cause OOMs in the test cgroup, you might see some process killed
> >>>> warnings).
> >>>> Thanks!
> >>>>
> >>>
> >>> In addition to that, it looks like one of the cgroups has a "0" stat
> >>> which shouldn't happen unless one of the map update/lookup operations
> >>> failed, which should log something using bpf_printk. I need to
> >>> reproduce the test failure to investigate this properly. Did you
> >>> observe this failure on your machine or in CI? Any instructions on how
> >>> to reproduce or system setup?
> >>
> >> I got "0" as well.
> >>
> >> get_cgroup_vmscan_delay:FAIL:vmscan_reading unexpected vmscan_reading:
> >> actual 0 <= expected 0
> >> check_vmscan_stats:FAIL:child1_vmscan unexpected child1_vmscan: actual
> >> 676612 != expected 339142
> >> check_vmscan_stats:FAIL:child2_vmscan unexpected child2_vmscan: actual
> >> -1 != expected -2
> >> check_vmscan_stats:FAIL:test_vmscan unexpected test_vmscan: actual
> >> 676612 != expected 676611
> >> check_vmscan_stats:FAIL:root_vmscan unexpected root_vmscan: actual 0 <
> >> expected 676612
> >>
> >> I don't have special config. I am running on qemu vm, similar to
> >> ci environment but may have a slightly different config.
> >>
> >> The CI for this patch set won't work since the sleepable kfunc support
> >> patch is not available. Once you have that patch, bpf CI should be able
> >> to compile the patch set and run the tests.
> >>
> >
> > I will include this patch in the next version anyway, but I am trying
> > to find out why this selftest is failing for you before I send it out.
> > I am trying to reproduce the problem but no luck so far.
>
> I debugged this a little bit and found that this two programs
>
> SEC("tp_btf/mm_vmscan_memcg_reclaim_begin")
> int BPF_PROG(vmscan_start, struct lruvec *lruvec, struct scan_control *sc)
>
> and
>
> SEC("tp_btf/mm_vmscan_memcg_reclaim_end")
> int BPF_PROG(vmscan_end, struct lruvec *lruvec, struct scan_control *sc)
>
> are not triggered.

Thanks so much for doing this. I am still failing to reproduce the
problem so this is very useful. I believe if those programs are not
triggered at all then we are not walking the memcg reclaim path, which
shouldn't happen since we are setting memory.high to a limit and then
allocating more memory, which should trigger memcg reclaim.

I am looking at the code now, and there are some conditions that will
cause memory.high to not invoke reclaim (at least synchronously). Did
you try diff2.patch attached in the previous email? It changes the
test to use memory.max instead of memory.high, this will cause an OOM
kill of the test child process, but it should be a stronger guarantee
that reclaim happens and we hit mm_vmscan_memcg_reclaim_begin/end().
If diff2.patch above works, is it okay to keep it? Is it okay to have
some test processes OOM killed during testing?

>
> I do have CONFIG_MEMCG enabled in my config file:
> ...
> CONFIG_MEMCG=y
> CONFIG_MEMCG_SWAP=y
> CONFIG_MEMCG_KMEM= > ...
>
> Maybe when cgroup_rstat_flush() is called, some code path won't trigger
> mm_vmscan_memcg_reclaim_begin/end()?
>

cgroup_rstat_flush() should be completely separate in this regard, and
should not affect the code path that triggers
mm_vmscan_memcg_reclaim_begin/end().

> >
> >>>
> >>>>
> >>>>>>
> >>>>>> Also an existing test also failed.
> >>>>>>
> >>>>>> btf_dump_data:PASS:find type id 0 nsec
> >>>>>>
> >>>>>>
> >>>>>> btf_dump_data:PASS:failed/unexpected type_sz 0 nsec
> >>>>>>
> >>>>>>
> >>>>>> btf_dump_data:FAIL:ensure expected/actual match unexpected ensure
> >>>>>> expected/actual match: actual '(union bpf_iter_link_info){.map =
> >>>>>> (struct){.map_fd = (__u32)1,},.cgroup '
> >>>>>> test_btf_dump_struct_data:PASS:find struct sk_buff 0 nsec
> >>>>>>
> >>>>>
> >>>>> Yeah I see what happened there. bpf_iter_link_info was changed by the
> >>>>> patch that introduced cgroup_iter, and this specific union is used by
> >>>>> the test to test the "union with nested struct" btf dumping. I will
> >>>>> add a patch in the next version that updates the btf_dump_data test
> >>>>> accordingly. Thanks.
> >>>>>
> >>>>>>
> >>>>>> test_btf_dump_struct_data:PASS:unexpected return value dumping sk_buff 0
> >>>>>> nsec
> >>>>>>
> >>>>>> btf_dump_data:PASS:verify prefix match 0 nsec
> >>>>>>
> >>>>>>
> >>>>>> btf_dump_data:PASS:find type id 0 nsec
> >>>>>>
> >>>>>>
> >>>>>> btf_dump_data:PASS:failed to return -E2BIG 0 nsec
> >>>>>>
> >>>>>>
> >>>>>> btf_dump_data:PASS:ensure expected/actual match 0 nsec
> >>>>>>
> >>>>>>
> >>>>>> btf_dump_data:PASS:verify prefix match 0 nsec
> >>>>>>
> >>>>>>
> >>>>>> btf_dump_data:PASS:find type id 0 nsec
> >>>>>>
> >>>>>>
> >>>>>> btf_dump_data:PASS:failed to return -E2BIG 0 nsec
> >>>>>>
> >>>>>>
> >>>>>> btf_dump_data:PASS:ensure expected/actual match 0 nsec
> >>>>>>
> >>>>>>
> >>>>>> #21/14   btf_dump/btf_dump: struct_data:FAIL
> >>>>>>
> >>>>>> please take a look.
> >>>>>>
> >>>>>>> ---
> >>>>>>>     .../prog_tests/cgroup_hierarchical_stats.c    | 351 ++++++++++++++++++
> >>>>>>>     .../bpf/progs/cgroup_hierarchical_stats.c     | 234 ++++++++++++
> >>>>>>>     2 files changed, 585 insertions(+)
> >>>>>>>     create mode 100644 tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c
> >>>>>>>     create mode 100644 tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c
> >>>>>>>
> >> [...]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH bpf-next v2 4/8] bpf: Introduce cgroup iter
  2022-06-28  4:09   ` Yonghong Song
  2022-06-28  6:06     ` Yosry Ahmed
@ 2022-07-07 23:33     ` Hao Luo
  1 sibling, 0 replies; 46+ messages in thread
From: Hao Luo @ 2022-07-07 23:33 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Yosry Ahmed, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, John Fastabend,
	KP Singh, Tejun Heo, Zefan Li, Johannes Weiner, Shuah Khan,
	Michal Hocko, Roman Gushchin, David Rientjes, Stanislav Fomichev,
	Greg Thelen, Shakeel Butt, linux-kernel, netdev, bpf, cgroups

On Mon, Jun 27, 2022 at 9:09 PM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 6/10/22 12:44 PM, Yosry Ahmed wrote:
> > From: Hao Luo <haoluo@google.com>
> >
> > Cgroup_iter is a type of bpf_iter. It walks over cgroups in two modes:
> >
> >   - walking a cgroup's descendants.
> >   - walking a cgroup's ancestors.
>
> The implementation has another choice, BPF_ITER_CGROUP_PARENT_UP.
> We should add it here as well.
>

Sorry about the late reply. I meant to write two modes: walking up and
walking down. And two sub-modes when walking down: pre-order and
post-order.

But it seems this is confusing. I'm going to use three modes in the
next version: UP, PRE and POST.

Besides, since this patch modifies the bpf_iter_link_info, and that
breaks the btf_dump selftest, I need to include the fix of the
selftest in this patch.

> >
> > When attaching cgroup_iter, one can set a cgroup to the iter_link
> > created from attaching. This cgroup is passed as a file descriptor and
> > serves as the starting point of the walk. If no cgroup is specified,
> > the starting point will be the root cgroup.
> >
> > For walking descendants, one can specify the order: either pre-order or
> > post-order. For walking ancestors, the walk starts at the specified
> > cgroup and ends at the root.
> >
> > One can also terminate the walk early by returning 1 from the iter
> > program.
> >
> > Note that because walking cgroup hierarchy holds cgroup_mutex, the iter
> > program is called with cgroup_mutex held.
>
> Overall looks good to me with a few nits below.
>
> Acked-by: Yonghong Song <yhs@fb.com>
>
> >
> > Signed-off-by: Hao Luo <haoluo@google.com>
> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> > ---
[...]
> > +
> > +/* cgroup_iter provides two modes of traversal to the cgroup hierarchy.
> > + *
> > + *  1. Walk the descendants of a cgroup.
> > + *  2. Walk the ancestors of a cgroup.
>
> three modes here?
>

Same here. Will use 'three modes' in the next version.

> > + *
[...]
> > +
> > +static int bpf_iter_attach_cgroup(struct bpf_prog *prog,
> > +                               union bpf_iter_link_info *linfo,
> > +                               struct bpf_iter_aux_info *aux)
> > +{
> > +     int fd = linfo->cgroup.cgroup_fd;
> > +     struct cgroup *cgrp;
> > +
> > +     if (fd)
> > +             cgrp = cgroup_get_from_fd(fd);
> > +     else /* walk the entire hierarchy by default. */
> > +             cgrp = cgroup_get_from_path("/");
> > +
> > +     if (IS_ERR(cgrp))
> > +             return PTR_ERR(cgrp);
> > +
> > +     aux->cgroup.start = cgrp;
> > +     aux->cgroup.order = linfo->cgroup.traversal_order;
>
> The legality of traversal_order should be checked.
>

Sounds good. Will do.

> > +     return 0;
> > +}
> > +
[...]
> > +static void bpf_iter_cgroup_show_fdinfo(const struct bpf_iter_aux_info *aux,
> > +                                     struct seq_file *seq)
> > +{
> > +     char *buf;
> > +
> > +     buf = kzalloc(PATH_MAX, GFP_KERNEL);
> > +     if (!buf) {
> > +             seq_puts(seq, "cgroup_path:\n");
>
> This is a really unlikely case. maybe "cgroup_path:<unknown>"?
>

Sure, no problem. This is a path that is unlikely to hit.

> > +             goto show_order;
[...]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH bpf-next v2 4/8] bpf: Introduce cgroup iter
  2022-06-28  4:14   ` Yonghong Song
  2022-06-28  6:03     ` Yosry Ahmed
@ 2022-07-07 23:36     ` Hao Luo
  1 sibling, 0 replies; 46+ messages in thread
From: Hao Luo @ 2022-07-07 23:36 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Yosry Ahmed, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, John Fastabend,
	KP Singh, Tejun Heo, Zefan Li, Johannes Weiner, Shuah Khan,
	Michal Hocko, Roman Gushchin, David Rientjes, Stanislav Fomichev,
	Greg Thelen, Shakeel Butt, linux-kernel, netdev, bpf, cgroups

On Mon, Jun 27, 2022 at 9:14 PM Yonghong Song <yhs@fb.com> wrote:
>
> > +static int __cgroup_iter_seq_show(struct seq_file *seq,
> > +                               struct cgroup_subsys_state *css, int in_stop)
> > +{
> > +     struct cgroup_iter_priv *p = seq->private;
> > +     struct bpf_iter__cgroup ctx;
> > +     struct bpf_iter_meta meta;
> > +     struct bpf_prog *prog;
> > +     int ret = 0;
> > +
> > +     /* cgroup is dead, skip this element */
> > +     if (css && cgroup_is_dead(css->cgroup))
> > +             return 0;
> > +
> > +     ctx.meta = &meta;
> > +     ctx.cgroup = css ? css->cgroup : NULL;
> > +     meta.seq = seq;
> > +     prog = bpf_iter_get_info(&meta, in_stop);
> > +     if (prog)
> > +             ret = bpf_iter_run_prog(prog, &ctx);
>
> Do we need to do anything special to ensure bpf program gets
> up-to-date stat from ctx.cgroup?
>

Let's leave that to be handled by bpf programs. The kfunc rstat_flush
can be called to sync stats, if using rstat.

> > +
> > +     /* if prog returns > 0, terminate after this element. */
> > +     if (ret != 0)
> > +             p->terminate = true;
> > +
> > +     return 0;
> > +}
> > +
> [...]

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2022-07-07 23:36 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-10 19:44 [PATCH bpf-next v2 0/8] bpf: rstat: cgroup hierarchical stats Yosry Ahmed
2022-06-10 19:44 ` [PATCH bpf-next v2 1/8] cgroup: enable cgroup_get_from_file() on cgroup1 Yosry Ahmed
2022-06-10 19:44 ` [PATCH bpf-next v2 2/8] cgroup: Add cgroup_put() in !CONFIG_CGROUPS case Yosry Ahmed
2022-06-10 19:44 ` [PATCH bpf-next v2 3/8] bpf, iter: Fix the condition on p when calling stop Yosry Ahmed
2022-06-20 18:48   ` Yonghong Song
2022-06-21  7:25     ` Hao Luo
2022-06-24 17:46       ` Yonghong Song
2022-06-24 18:23         ` Yosry Ahmed
2022-06-10 19:44 ` [PATCH bpf-next v2 4/8] bpf: Introduce cgroup iter Yosry Ahmed
2022-06-11  6:23   ` kernel test robot
2022-06-11  7:34   ` kernel test robot
2022-06-11 12:44   ` kernel test robot
2022-06-11 12:55   ` kernel test robot
2022-06-28  4:09   ` Yonghong Song
2022-06-28  6:06     ` Yosry Ahmed
2022-07-07 23:33     ` Hao Luo
2022-06-28  4:14   ` Yonghong Song
2022-06-28  6:03     ` Yosry Ahmed
2022-06-28 17:03       ` Yonghong Song
2022-07-07 23:36     ` Hao Luo
2022-06-10 19:44 ` [PATCH bpf-next v2 5/8] selftests/bpf: Test cgroup_iter Yosry Ahmed
2022-06-28  6:11   ` Yonghong Song
2022-06-10 19:44 ` [PATCH bpf-next v2 6/8] cgroup: bpf: enable bpf programs to integrate with rstat Yosry Ahmed
2022-06-10 20:52   ` kernel test robot
2022-06-10 21:22   ` kernel test robot
2022-06-10 21:30     ` Yosry Ahmed
2022-06-11 19:57       ` Alexei Starovoitov
2022-06-13 17:05         ` Yosry Ahmed
2022-06-11 10:22   ` kernel test robot
2022-06-28  6:12   ` Yonghong Song
2022-06-10 19:44 ` [PATCH bpf-next v2 7/8] selftests/bpf: extend cgroup helpers Yosry Ahmed
2022-06-10 19:44 ` [PATCH bpf-next v2 8/8] bpf: add a selftest for cgroup hierarchical stats collection Yosry Ahmed
2022-06-28  6:14   ` Yonghong Song
2022-06-28  6:47     ` Yosry Ahmed
2022-06-28  7:14       ` Yosry Ahmed
2022-06-29  0:09         ` Yosry Ahmed
2022-06-29  6:48           ` Yonghong Song
2022-06-29  8:04             ` Yosry Ahmed
2022-07-02  0:55               ` Yonghong Song
2022-07-06 21:29                 ` Yosry Ahmed
2022-06-29  6:17         ` Yonghong Song
2022-06-28  7:43       ` Yosry Ahmed
2022-06-29  6:26         ` Yonghong Song
2022-06-29  8:03           ` Yosry Ahmed
2022-07-01 23:28           ` Hao Luo
2022-06-10 19:48 ` [PATCH bpf-next v2 0/8] bpf: rstat: cgroup hierarchical stats Yosry Ahmed

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).