linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH bpf-next v2 0/7] bpf: rstat: cgroup hierarchical stats
@ 2022-05-15  2:34 Yosry Ahmed
  2022-05-15  2:34 ` [RFC PATCH bpf-next v2 1/7] bpf: introduce RSTAT_FLUSH program type Yosry Ahmed
                   ` (7 more replies)
  0 siblings, 8 replies; 12+ messages in thread
From: Yosry Ahmed @ 2022-05-15  2:34 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Roman Gushchin, Michal Hocko
  Cc: Stanislav Fomichev, David Rientjes, Greg Thelen, Shakeel Butt,
	linux-kernel, netdev, bpf, cgroups, Yosry Ahmed

This patch series allows for using bpf to collect hierarchical cgroup
stats efficiently by integrating with the rstat framework. The rstat
framework provides an efficient way to collect cgroup stats and
propagate them through the cgroup hierarchy.

* Background on rstat (I am using a subscriber analogy that is not
commonly used):

The rstat framework maintains a tree of cgroups that have updates and
which cpus have updates. A subscriber to the rstat framework maintains
their own stats. The framework is used to tell the subscriber when
and what to flush, for the most efficient stats propagation. The
workflow is as follows:

- When a subscriber updates a cgroup on a cpu, it informs the rstat
  framework by calling cgroup_rstat_updated(cgrp, cpu).

- When a subscriber wants to read some stats for a cgroup, it asks
  the rstat framework to initiate a stats flush (propagation) by calling
  cgroup_rstat_flush(cgrp).

- When the rstat framework initiates a flush, it makes callbacks to
  subscribers to aggregate stats on cpus that have updates, and
  propagate updates to their parent.

Currently, the main subscribers to the rstat framework are cgroup
subsystems (e.g. memory, block). This patch series allow bpf programs to
become subscribers as well.

The first three patches introduce a new bpf program type, RSTAT_FLUSH,
which is a callback that rstat makes to bpf when a stats flush is
ongoing.

The fourth patch adds bpf_cgroup_rstat_updated() and
bpf_cgroup_rstat_flush() helpers, to allow bpf stat collectors and
readers to communicate with rstat.

The fifth patch is actually v2 of a previously submitted patch [1]
by Hao Luo. We agreed that it fits better as a part of this series. It
introduces cgroup_iter programs that can dump stats for cgroups to
userspace.
v1 - > v2:
- Getting the cgroup's reference at the time at attaching, instead of
  at the time when iterating. (Yonghong) (context [1])
- Remove .init_seq_private and .fini_seq_private callbacks for
  cgroup_iter. They are not needed now. (Yonghong)

The sixth patch extends bpf selftests cgroup helpers, as necessary for
the following patch.

The seventh patch is a selftest that demonstrates the entire workflow.
It includes programs that collect, aggregate, and dump per-cgroup stats
by fully integrating with the rstat framework.

[1]https://lore.kernel.org/lkml/20220225234339.2386398-9-haoluo@google.com/

RFC v1 -> RFC v2:
- Instead of rstat flush programs attach to subsystems, they now attach
  to rstat (global flushers, not per-subsystem), based on discussions
  with Tejun. The first patch is entirely rewritten.
- Pass cgroup pointers to rstat flushers instead of cgroup ids. This is
  much more flexibility and less likely to need a uapi update later.
- rstat helpers are now only defined if CGROUP_CONFIG.
- Most of the code is now only defined if CGROUP_CONFIG and
  CONFIG_BPF_SYSCALL.
- Move rstat helper protos from bpf_base_func_proto() to
  tracing_prog_func_proto().
- rstat helpers argument (cgroup pointer) is now ARG_PTR_TO_BTF_ID, not
  ARG_ANYTHING.
- Rewrote the selftest to use the cgroup helpers.
- Dropped bpf_map_lookup_percpu_elem (already added by Feng).
- Dropped patch to support cgroup v1 for cgroup_iter.
- Dropped patch to define some cgroup_put() when !CONFIG_CGROUP. The
  code that calls it is no longer compiled when !CONFIG_CGROUP.


Hao Luo (1):
  bpf: Introduce cgroup iter

Yosry Ahmed (6):
  bpf: introduce RSTAT_FLUSH program type
  cgroup: bpf: flush bpf stats on rstat flush
  libbpf: Add support for rstat flush progs
  bpf: add bpf rstat helpers
  selftests/bpf: extend cgroup helpers
  bpf: add a selftest for cgroup hierarchical stats collection

 include/linux/bpf-rstat.h                     |  31 ++
 include/linux/bpf.h                           |   4 +
 include/linux/bpf_types.h                     |   4 +
 include/uapi/linux/bpf.h                      |  33 ++
 kernel/bpf/Makefile                           |   3 +
 kernel/bpf/cgroup_iter.c                      | 148 ++++++++
 kernel/bpf/helpers.c                          |  30 ++
 kernel/bpf/rstat.c                            | 187 ++++++++++
 kernel/bpf/syscall.c                          |   6 +
 kernel/cgroup/rstat.c                         |   2 +
 kernel/trace/bpf_trace.c                      |   4 +
 scripts/bpf_doc.py                            |   2 +
 tools/include/uapi/linux/bpf.h                |  33 ++
 tools/lib/bpf/bpf.c                           |   1 -
 tools/lib/bpf/libbpf.c                        |  40 +++
 tools/lib/bpf/libbpf.h                        |   3 +
 tools/lib/bpf/libbpf.map                      |   1 +
 tools/testing/selftests/bpf/cgroup_helpers.c  | 158 +++++---
 tools/testing/selftests/bpf/cgroup_helpers.h  |  14 +-
 .../test_cgroup_hierarchical_stats.c          | 339 ++++++++++++++++++
 tools/testing/selftests/bpf/progs/bpf_iter.h  |   7 +
 .../selftests/bpf/progs/cgroup_vmscan.c       | 222 ++++++++++++
 22 files changed, 1225 insertions(+), 47 deletions(-)
 create mode 100644 include/linux/bpf-rstat.h
 create mode 100644 kernel/bpf/cgroup_iter.c
 create mode 100644 kernel/bpf/rstat.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_cgroup_hierarchical_stats.c
 create mode 100644 tools/testing/selftests/bpf/progs/cgroup_vmscan.c

-- 
2.36.0.550.gb090851708-goog


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC PATCH bpf-next v2 1/7] bpf: introduce RSTAT_FLUSH program type
  2022-05-15  2:34 [RFC PATCH bpf-next v2 0/7] bpf: rstat: cgroup hierarchical stats Yosry Ahmed
@ 2022-05-15  2:34 ` Yosry Ahmed
  2022-05-15  2:34 ` [RFC PATCH bpf-next v2 2/7] cgroup: bpf: flush bpf stats on rstat flush Yosry Ahmed
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 12+ messages in thread
From: Yosry Ahmed @ 2022-05-15  2:34 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Roman Gushchin, Michal Hocko
  Cc: Stanislav Fomichev, David Rientjes, Greg Thelen, Shakeel Butt,
	linux-kernel, netdev, bpf, cgroups, Yosry Ahmed

This patch introduces a new bpf program type, RSTAT_FLUSH,
with new corresponding link and attach types.

These programs acts as a callback for the rstat framework to call when a
stats flush is ongoing. It allows BPF programs to collect and maintain
hierarchical stats cgroup stats efficiently by integrating with the rstat
framework.

See the selftest in the final patch for a practical example.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 include/linux/bpf-rstat.h      |  25 +++++
 include/linux/bpf_types.h      |   4 +
 include/uapi/linux/bpf.h       |   9 ++
 kernel/bpf/Makefile            |   3 +
 kernel/bpf/rstat.c             | 166 +++++++++++++++++++++++++++++++++
 kernel/bpf/syscall.c           |   6 ++
 tools/include/uapi/linux/bpf.h |   9 ++
 7 files changed, 222 insertions(+)
 create mode 100644 include/linux/bpf-rstat.h
 create mode 100644 kernel/bpf/rstat.c

diff --git a/include/linux/bpf-rstat.h b/include/linux/bpf-rstat.h
new file mode 100644
index 000000000000..23cad23b5fc2
--- /dev/null
+++ b/include/linux/bpf-rstat.h
@@ -0,0 +1,25 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright 2022 Google LLC.
+ */
+#ifndef _BPF_RSTAT_H_
+#define _BPF_RSTAT_H_
+
+#include <linux/bpf.h>
+
+#if defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_CGROUPS)
+
+int bpf_rstat_link_attach(const union bpf_attr *attr,
+				 struct bpf_prog *prog);
+
+#else /* defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_CGROUPS) */
+
+static inline int bpf_rstat_link_attach(const union bpf_attr *attr,
+					struct bpf_prog *prog)
+{
+	return -ENOTSUPP;
+}
+
+#endif /* defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_CGROUPS) */
+
+#endif  /* _BPF_RSTAT */
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 2b9112b80171..ff92299f76a9 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -77,6 +77,10 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_LSM, lsm,
 	       void *, void *)
 #endif /* CONFIG_BPF_LSM */
 #endif
+#ifdef CONFIG_CGROUPS
+BPF_PROG_TYPE(BPF_PROG_TYPE_RSTAT_FLUSH, rstat_flush,
+	      struct bpf_rstat_flush_ctx, struct bpf_rstat_flush_ctx)
+#endif /* CONFIG_CGROUPS */
 BPF_PROG_TYPE(BPF_PROG_TYPE_SYSCALL, bpf_syscall,
 	      void *, void *)
 
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 0210f85131b3..968e3cb02580 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -952,6 +952,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_LSM,
 	BPF_PROG_TYPE_SK_LOOKUP,
 	BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */
+	BPF_PROG_TYPE_RSTAT_FLUSH,
 };
 
 enum bpf_attach_type {
@@ -998,6 +999,7 @@ enum bpf_attach_type {
 	BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
 	BPF_PERF_EVENT,
 	BPF_TRACE_KPROBE_MULTI,
+	BPF_RSTAT_FLUSH,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -1014,6 +1016,7 @@ enum bpf_link_type {
 	BPF_LINK_TYPE_PERF_EVENT = 7,
 	BPF_LINK_TYPE_KPROBE_MULTI = 8,
 	BPF_LINK_TYPE_STRUCT_OPS = 9,
+	BPF_LINK_TYPE_RSTAT = 10,
 
 	MAX_BPF_LINK_TYPE,
 };
@@ -6359,6 +6362,12 @@ struct bpf_cgroup_dev_ctx {
 	__u32 minor;
 };
 
+struct bpf_rstat_flush_ctx {
+	__bpf_md_ptr(struct cgroup *, cgrp);
+	__bpf_md_ptr(struct cgroup *, parent);
+	__s32 cpu;
+};
+
 struct bpf_raw_tracepoint_args {
 	__u64 args[0];
 };
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 057ba8e01e70..0487133b799f 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -36,6 +36,9 @@ obj-$(CONFIG_BPF_SYSCALL) += bpf_struct_ops.o
 obj-${CONFIG_BPF_LSM} += bpf_lsm.o
 endif
 obj-$(CONFIG_BPF_PRELOAD) += preload/
+ifeq ($(CONFIG_CGROUPS),y)
+obj-$(CONFIG_BPF_SYSCALL) += rstat.o
+endif
 
 obj-$(CONFIG_BPF_SYSCALL) += relo_core.o
 $(obj)/relo_core.o: $(srctree)/tools/lib/bpf/relo_core.c FORCE
diff --git a/kernel/bpf/rstat.c b/kernel/bpf/rstat.c
new file mode 100644
index 000000000000..5f529002d4b9
--- /dev/null
+++ b/kernel/bpf/rstat.c
@@ -0,0 +1,166 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Functions to manage eBPF programs attached to cgroup subsystems
+ *
+ * Copyright 2022 Google LLC.
+ */
+
+#include <linux/bpf-rstat.h>
+#include <linux/btf_ids.h>
+#include <linux/cgroup.h>
+#include <linux/filter.h>
+
+static LIST_HEAD(bpf_rstat_flushers);
+static DEFINE_SPINLOCK(bpf_rstat_flushers_lock);
+
+
+struct bpf_rstat_flusher {
+	struct bpf_prog *prog;
+	/* List of BPF rtstat flushers, anchored at subsys->bpf */
+	struct list_head list;
+};
+
+struct bpf_rstat_link {
+	struct bpf_link link;
+	struct bpf_rstat_flusher *flusher;
+};
+
+static int bpf_rstat_flush_attach(struct bpf_prog *prog,
+				  struct bpf_rstat_link *rlink)
+{
+	struct bpf_rstat_flusher *flusher;
+
+	flusher = kmalloc(sizeof(*flusher), GFP_KERNEL);
+	if (!flusher)
+		return -ENOMEM;
+
+	flusher->prog = prog;
+	rlink->flusher = flusher;
+
+	spin_lock(&bpf_rstat_flushers_lock);
+	list_add(&flusher->list, &bpf_rstat_flushers);
+	spin_unlock(&bpf_rstat_flushers_lock);
+
+	return 0;
+}
+
+static void bpf_rstat_flush_detach(struct bpf_rstat_link *rstat_link)
+{
+	struct bpf_rstat_flusher *flusher = rstat_link->flusher;
+
+	if (!flusher)
+		return;
+
+	spin_lock(&bpf_rstat_flushers_lock);
+	list_del(&flusher->list);
+	bpf_prog_put(flusher->prog);
+	kfree(flusher);
+	spin_unlock(&bpf_rstat_flushers_lock);
+}
+
+static const struct bpf_func_proto *
+bpf_rstat_flush_func_proto(enum bpf_func_id func_id,
+			   const struct bpf_prog *prog)
+{
+	return bpf_base_func_proto(func_id);
+}
+
+BTF_ID_LIST_SINGLE(bpf_cgroup_btf_ids, struct, cgroup)
+
+static bool bpf_rstat_flush_is_valid_access(int off, int size,
+					    enum bpf_access_type type,
+					    const struct bpf_prog *prog,
+					    struct bpf_insn_access_aux *info)
+{
+	if (type == BPF_WRITE)
+		return false;
+
+	if (off < 0 || off + size > sizeof(struct bpf_rstat_flush_ctx))
+		return false;
+	/* The verifier guarantees that size > 0 */
+	if (off % size != 0)
+		return false;
+
+	switch (off) {
+	case bpf_ctx_range_ptr(struct bpf_rstat_flush_ctx, cgrp):
+		info->reg_type = PTR_TO_BTF_ID;
+		info->btf_id = bpf_cgroup_btf_ids[0];
+		info->btf = bpf_get_btf_vmlinux();
+		return !IS_ERR(info->btf) && info->btf && size == sizeof(__u64);
+	case bpf_ctx_range_ptr(struct bpf_rstat_flush_ctx, parent):
+		info->reg_type = PTR_TO_BTF_ID_OR_NULL;
+		info->btf_id = bpf_cgroup_btf_ids[0];
+		info->btf = bpf_get_btf_vmlinux();
+		return !IS_ERR(info->btf) && info->btf && size == sizeof(__u64);
+	case bpf_ctx_range(struct bpf_rstat_flush_ctx, cpu):
+		return size == sizeof(__s32);
+	default:
+		return false;
+	}
+}
+
+const struct bpf_prog_ops rstat_flush_prog_ops = {
+};
+
+const struct bpf_verifier_ops rstat_flush_verifier_ops = {
+	.get_func_proto         = bpf_rstat_flush_func_proto,
+	.is_valid_access        = bpf_rstat_flush_is_valid_access,
+};
+
+static void bpf_rstat_link_release(struct bpf_link *link)
+{
+	struct bpf_rstat_link *rlink;
+
+	rlink = container_of(link,
+			     struct bpf_rstat_link,
+			     link);
+
+	/* rstat flushers are currently the only supported rstat programs */
+	bpf_rstat_flush_detach(rlink);
+}
+
+static void bpf_rstat_link_dealloc(struct bpf_link *link)
+{
+	struct bpf_rstat_link *rlink = container_of(link,
+						    struct bpf_rstat_link,
+						    link);
+	kfree(rlink);
+}
+
+static const struct bpf_link_ops bpf_rstat_link_lops = {
+	.release = bpf_rstat_link_release,
+	.dealloc = bpf_rstat_link_dealloc,
+};
+
+int bpf_rstat_link_attach(const union bpf_attr *attr,
+			  struct bpf_prog *prog)
+{
+	struct bpf_link_primer link_primer;
+	struct bpf_rstat_link *link;
+	int err;
+
+	if (attr->link_create.target_fd || attr->link_create.flags)
+		return -EINVAL;
+
+	link = kzalloc(sizeof(*link), GFP_USER);
+	if (!link)
+		return -ENOMEM;
+
+	bpf_link_init(&link->link, BPF_LINK_TYPE_RSTAT,
+		      &bpf_rstat_link_lops, prog);
+
+	err = bpf_link_prime(&link->link, &link_primer);
+	if (err) {
+		kfree(link);
+		return err;
+	}
+
+	/* rstat flushers are currently the only supported rstat programs */
+	err = bpf_rstat_flush_attach(prog, link);
+	if (err) {
+		bpf_link_cleanup(&link_primer);
+		return err;
+	}
+
+	return bpf_link_settle(&link_primer);
+}
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 72e53489165d..ffeed8379b35 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -3,6 +3,7 @@
  */
 #include <linux/bpf.h>
 #include <linux/bpf-cgroup.h>
+#include <linux/bpf-rstat.h>
 #include <linux/bpf_trace.h>
 #include <linux/bpf_lirc.h>
 #include <linux/bpf_verifier.h>
@@ -3416,6 +3417,8 @@ attach_type_to_prog_type(enum bpf_attach_type attach_type)
 		return BPF_PROG_TYPE_SK_LOOKUP;
 	case BPF_XDP:
 		return BPF_PROG_TYPE_XDP;
+	case BPF_RSTAT_FLUSH:
+		return BPF_PROG_TYPE_RSTAT_FLUSH;
 	default:
 		return BPF_PROG_TYPE_UNSPEC;
 	}
@@ -4564,6 +4567,9 @@ static int link_create(union bpf_attr *attr, bpfptr_t uattr)
 		else
 			ret = bpf_kprobe_multi_link_attach(attr, prog);
 		break;
+	case BPF_PROG_TYPE_RSTAT_FLUSH:
+		ret = bpf_rstat_link_attach(attr, prog);
+		break;
 	default:
 		ret = -EINVAL;
 	}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 0210f85131b3..968e3cb02580 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -952,6 +952,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_LSM,
 	BPF_PROG_TYPE_SK_LOOKUP,
 	BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */
+	BPF_PROG_TYPE_RSTAT_FLUSH,
 };
 
 enum bpf_attach_type {
@@ -998,6 +999,7 @@ enum bpf_attach_type {
 	BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
 	BPF_PERF_EVENT,
 	BPF_TRACE_KPROBE_MULTI,
+	BPF_RSTAT_FLUSH,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -1014,6 +1016,7 @@ enum bpf_link_type {
 	BPF_LINK_TYPE_PERF_EVENT = 7,
 	BPF_LINK_TYPE_KPROBE_MULTI = 8,
 	BPF_LINK_TYPE_STRUCT_OPS = 9,
+	BPF_LINK_TYPE_RSTAT = 10,
 
 	MAX_BPF_LINK_TYPE,
 };
@@ -6359,6 +6362,12 @@ struct bpf_cgroup_dev_ctx {
 	__u32 minor;
 };
 
+struct bpf_rstat_flush_ctx {
+	__bpf_md_ptr(struct cgroup *, cgrp);
+	__bpf_md_ptr(struct cgroup *, parent);
+	__s32 cpu;
+};
+
 struct bpf_raw_tracepoint_args {
 	__u64 args[0];
 };
-- 
2.36.0.550.gb090851708-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH bpf-next v2 2/7] cgroup: bpf: flush bpf stats on rstat flush
  2022-05-15  2:34 [RFC PATCH bpf-next v2 0/7] bpf: rstat: cgroup hierarchical stats Yosry Ahmed
  2022-05-15  2:34 ` [RFC PATCH bpf-next v2 1/7] bpf: introduce RSTAT_FLUSH program type Yosry Ahmed
@ 2022-05-15  2:34 ` Yosry Ahmed
  2022-05-17  2:08   ` Alexei Starovoitov
  2022-05-15  2:35 ` [RFC PATCH bpf-next v2 3/7] libbpf: Add support for rstat flush progs Yosry Ahmed
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 12+ messages in thread
From: Yosry Ahmed @ 2022-05-15  2:34 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Roman Gushchin, Michal Hocko
  Cc: Stanislav Fomichev, David Rientjes, Greg Thelen, Shakeel Butt,
	linux-kernel, netdev, bpf, cgroups, Yosry Ahmed

When an rstat flush is ongoing for a cgroup, also flush bpf stats by
running any attached rstat flush programs.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 include/linux/bpf-rstat.h |  6 ++++++
 kernel/bpf/rstat.c        | 21 +++++++++++++++++++++
 kernel/cgroup/rstat.c     |  2 ++
 3 files changed, 29 insertions(+)

diff --git a/include/linux/bpf-rstat.h b/include/linux/bpf-rstat.h
index 23cad23b5fc2..55e000fe0f47 100644
--- a/include/linux/bpf-rstat.h
+++ b/include/linux/bpf-rstat.h
@@ -12,6 +12,8 @@
 int bpf_rstat_link_attach(const union bpf_attr *attr,
 				 struct bpf_prog *prog);
 
+void bpf_rstat_flush(struct cgroup *cgrp, int cpu);
+
 #else /* defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_CGROUPS) */
 
 static inline int bpf_rstat_link_attach(const union bpf_attr *attr,
@@ -20,6 +22,10 @@ static inline int bpf_rstat_link_attach(const union bpf_attr *attr,
 	return -ENOTSUPP;
 }
 
+static inline void bpf_rstat_flush(struct cgroup *cgrp, int cpu)
+{
+}
+
 #endif /* defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_CGROUPS) */
 
 #endif  /* _BPF_RSTAT */
diff --git a/kernel/bpf/rstat.c b/kernel/bpf/rstat.c
index 5f529002d4b9..e96bc080f4b9 100644
--- a/kernel/bpf/rstat.c
+++ b/kernel/bpf/rstat.c
@@ -164,3 +164,24 @@ int bpf_rstat_link_attach(const union bpf_attr *attr,
 
 	return bpf_link_settle(&link_primer);
 }
+
+void bpf_rstat_flush(struct cgroup *cgrp, int cpu)
+{
+	struct bpf_rstat_flusher *flusher;
+	struct bpf_rstat_flush_ctx ctx = {
+		.cgrp = cgrp,
+		.parent = cgroup_parent(cgrp),
+		.cpu = cpu,
+	};
+
+	rcu_read_lock();
+	migrate_disable();
+	spin_lock(&bpf_rstat_flushers_lock);
+
+	list_for_each_entry(flusher, &bpf_rstat_flushers, list)
+		(void) bpf_prog_run(flusher->prog, &ctx);
+
+	spin_unlock(&bpf_rstat_flushers_lock);
+	migrate_enable();
+	rcu_read_unlock();
+}
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 24b5c2ab5598..0285d496e807 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -2,6 +2,7 @@
 #include "cgroup-internal.h"
 
 #include <linux/sched/cputime.h>
+#include <linux/bpf-rstat.h>
 
 static DEFINE_SPINLOCK(cgroup_rstat_lock);
 static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);
@@ -168,6 +169,7 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp, bool may_sleep)
 			struct cgroup_subsys_state *css;
 
 			cgroup_base_stat_flush(pos, cpu);
+			bpf_rstat_flush(pos, cpu);
 
 			rcu_read_lock();
 			list_for_each_entry_rcu(css, &pos->rstat_css_list,
-- 
2.36.0.550.gb090851708-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH bpf-next v2 3/7] libbpf: Add support for rstat flush progs
  2022-05-15  2:34 [RFC PATCH bpf-next v2 0/7] bpf: rstat: cgroup hierarchical stats Yosry Ahmed
  2022-05-15  2:34 ` [RFC PATCH bpf-next v2 1/7] bpf: introduce RSTAT_FLUSH program type Yosry Ahmed
  2022-05-15  2:34 ` [RFC PATCH bpf-next v2 2/7] cgroup: bpf: flush bpf stats on rstat flush Yosry Ahmed
@ 2022-05-15  2:35 ` Yosry Ahmed
  2022-05-15  9:07   ` Yosry Ahmed
  2022-05-15  2:35 ` [RFC PATCH bpf-next v2 4/7] bpf: add bpf rstat helpers Yosry Ahmed
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 12+ messages in thread
From: Yosry Ahmed @ 2022-05-15  2:35 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Roman Gushchin, Michal Hocko
  Cc: Stanislav Fomichev, David Rientjes, Greg Thelen, Shakeel Butt,
	linux-kernel, netdev, bpf, cgroups, Yosry Ahmed

Add support to attach RSTAT_FLUSH programs.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 tools/lib/bpf/bpf.c      |  1 -
 tools/lib/bpf/libbpf.c   | 40 ++++++++++++++++++++++++++++++++++++++++
 tools/lib/bpf/libbpf.h   |  3 +++
 tools/lib/bpf/libbpf.map |  1 +
 4 files changed, 44 insertions(+), 1 deletion(-)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 5660268e103f..9e3cb0d1eb99 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -870,7 +870,6 @@ int bpf_link_create(int prog_fd, int target_fd,
 		attr.link_create.tracing.cookie = OPTS_GET(opts, tracing.cookie, 0);
 		if (!OPTS_ZEROED(opts, tracing))
 			return libbpf_err(-EINVAL);
-		break;
 	default:
 		if (!OPTS_ZEROED(opts, flags))
 			return libbpf_err(-EINVAL);
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 4867a930628b..b7fc64ebf8dd 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -8998,6 +8998,7 @@ static int attach_trace(const struct bpf_program *prog, long cookie, struct bpf_
 static int attach_kprobe_multi(const struct bpf_program *prog, long cookie, struct bpf_link **link);
 static int attach_lsm(const struct bpf_program *prog, long cookie, struct bpf_link **link);
 static int attach_iter(const struct bpf_program *prog, long cookie, struct bpf_link **link);
+static int attach_rstat(const struct bpf_program *prog, long cookie, struct bpf_link **link);
 
 static const struct bpf_sec_def section_defs[] = {
 	SEC_DEF("socket",		SOCKET_FILTER, 0, SEC_NONE | SEC_SLOPPY_PFX),
@@ -9078,6 +9079,7 @@ static const struct bpf_sec_def section_defs[] = {
 	SEC_DEF("cgroup/setsockopt",	CGROUP_SOCKOPT, BPF_CGROUP_SETSOCKOPT, SEC_ATTACHABLE | SEC_SLOPPY_PFX),
 	SEC_DEF("struct_ops+",		STRUCT_OPS, 0, SEC_NONE),
 	SEC_DEF("sk_lookup",		SK_LOOKUP, BPF_SK_LOOKUP, SEC_ATTACHABLE | SEC_SLOPPY_PFX),
+	SEC_DEF("rstat/flush",		RSTAT_FLUSH, 0, SEC_NONE, attach_rstat),
 };
 
 static size_t custom_sec_def_cnt;
@@ -11784,6 +11786,44 @@ static int attach_iter(const struct bpf_program *prog, long cookie, struct bpf_l
 	return libbpf_get_error(*link);
 }
 
+struct bpf_link *bpf_program__attach_rstat(const struct bpf_program *prog)
+{
+	struct bpf_link *link = NULL;
+	char errmsg[STRERR_BUFSIZE];
+	int err, prog_fd, link_fd;
+
+	prog_fd = bpf_program__fd(prog);
+	if (prog_fd < 0) {
+		pr_warn("prog '%s': can't attach before loaded\n", prog->name);
+		return libbpf_err_ptr(-EINVAL);
+	}
+
+	link = calloc(1, sizeof(*link));
+	if (!link)
+		return libbpf_err_ptr(-ENOMEM);
+	link->detach = &bpf_link__detach_fd;
+
+	/* rstat flushers are currently the only supported rstat programs */
+	link_fd = bpf_link_create(prog_fd, 0, BPF_RSTAT_FLUSH, NULL);
+	if (link_fd < 0) {
+		err = -errno;
+		pr_warn("prog '%s': failed to attach: %s\n",
+			prog->name, libbpf_strerror_r(err, errmsg,
+						      sizeof(errmsg)));
+		free(link);
+		return libbpf_err_ptr(err);
+	}
+
+	link->fd = link_fd;
+	return link;
+}
+
+static int attach_rstat(const struct bpf_program *prog, long cookie, struct bpf_link **link)
+{
+	*link = bpf_program__attach_rstat(prog);
+	return libbpf_get_error(*link);
+}
+
 struct bpf_link *bpf_program__attach(const struct bpf_program *prog)
 {
 	struct bpf_link *link = NULL;
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 21984dcd6dbe..f8b6827d5550 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -662,6 +662,9 @@ LIBBPF_API struct bpf_link *
 bpf_program__attach_iter(const struct bpf_program *prog,
 			 const struct bpf_iter_attach_opts *opts);
 
+LIBBPF_API struct bpf_link *
+bpf_program__attach_rstat(const struct bpf_program *prog);
+
 /*
  * Libbpf allows callers to adjust BPF programs before being loaded
  * into kernel. One program in an object file can be transformed into
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index 008da8db1d94..f945c6265cb5 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -449,6 +449,7 @@ LIBBPF_0.8.0 {
 		bpf_program__attach_kprobe_multi_opts;
 		bpf_program__attach_trace_opts;
 		bpf_program__attach_usdt;
+		bpf_program__attach_rstat;
 		bpf_program__set_insns;
 		libbpf_register_prog_handler;
 		libbpf_unregister_prog_handler;
-- 
2.36.0.550.gb090851708-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH bpf-next v2 4/7] bpf: add bpf rstat helpers
  2022-05-15  2:34 [RFC PATCH bpf-next v2 0/7] bpf: rstat: cgroup hierarchical stats Yosry Ahmed
                   ` (2 preceding siblings ...)
  2022-05-15  2:35 ` [RFC PATCH bpf-next v2 3/7] libbpf: Add support for rstat flush progs Yosry Ahmed
@ 2022-05-15  2:35 ` Yosry Ahmed
  2022-05-15  2:35 ` [RFC PATCH bpf-next v2 5/7] bpf: Introduce cgroup iter Yosry Ahmed
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 12+ messages in thread
From: Yosry Ahmed @ 2022-05-15  2:35 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Roman Gushchin, Michal Hocko
  Cc: Stanislav Fomichev, David Rientjes, Greg Thelen, Shakeel Butt,
	linux-kernel, netdev, bpf, cgroups, Yosry Ahmed

Add bpf_cgroup_rstat_updated() and bpf_cgroup_rstat_flush() helpers
to enable  bpf programs that collect and output cgroup stats
to communicate with the rstat frameworkto add a cgroup to the rstat
updated tree or trigger an rstat flush before reading stats.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 include/linux/bpf.h            |  2 ++
 include/uapi/linux/bpf.h       | 18 ++++++++++++++++++
 kernel/bpf/helpers.c           | 30 ++++++++++++++++++++++++++++++
 kernel/trace/bpf_trace.c       |  4 ++++
 scripts/bpf_doc.py             |  2 ++
 tools/include/uapi/linux/bpf.h | 18 ++++++++++++++++++
 6 files changed, 74 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 5061ccd8b2dc..ca908a731cb4 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -2205,6 +2205,8 @@ extern const struct bpf_func_proto bpf_sock_map_update_proto;
 extern const struct bpf_func_proto bpf_sock_hash_update_proto;
 extern const struct bpf_func_proto bpf_get_current_cgroup_id_proto;
 extern const struct bpf_func_proto bpf_get_current_ancestor_cgroup_id_proto;
+extern const struct bpf_func_proto bpf_cgroup_rstat_updated_proto;
+extern const struct bpf_func_proto bpf_cgroup_rstat_flush_proto;
 extern const struct bpf_func_proto bpf_msg_redirect_hash_proto;
 extern const struct bpf_func_proto bpf_msg_redirect_map_proto;
 extern const struct bpf_func_proto bpf_sk_redirect_hash_proto;
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 968e3cb02580..022522174286 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -5175,6 +5175,22 @@ union bpf_attr {
  * 	Return
  * 		Map value associated to *key* on *cpu*, or **NULL** if no entry
  * 		was found or *cpu* is invalid.
+ *
+ * void bpf_cgroup_rstat_updated(struct cgroup *cgrp)
+ *	Description
+ *		Notify the rstat framework that bpf stats were updated for
+ *		*cgrp* on the current cpu. Directly calls cgroup_rstat_updated
+ *		with the given *cgrp* and the current cpu.
+ *	Return
+ *		0
+ *
+ * void bpf_cgroup_rstat_flush(struct cgroup *cgrp)
+ *	Description
+ *		Collect all per-cpu stats in *cgrp*'s subtree into global
+ *		counters and propagate them upwards. Directly calls
+ *		cgroup_rstat_flush_irqsafe with the given *cgrp*.
+ *	Return
+ *		0
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5373,6 +5389,8 @@ union bpf_attr {
 	FN(ima_file_hash),		\
 	FN(kptr_xchg),			\
 	FN(map_lookup_percpu_elem),     \
+	FN(cgroup_rstat_updated),	\
+	FN(cgroup_rstat_flush),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index d5f104a39092..88ed26cf45e2 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -416,6 +416,36 @@ const struct bpf_func_proto bpf_get_current_ancestor_cgroup_id_proto = {
 	.arg1_type	= ARG_ANYTHING,
 };
 
+BTF_ID_LIST_SINGLE(bpf_cgroup_btf_ids, struct, cgroup)
+
+BPF_CALL_1(bpf_cgroup_rstat_updated, struct cgroup *, cgrp)
+{
+	cgroup_rstat_updated(cgrp, smp_processor_id());
+	return 0;
+}
+
+const struct bpf_func_proto bpf_cgroup_rstat_updated_proto = {
+	.func		= bpf_cgroup_rstat_updated,
+	.gpl_only	= false,
+	.ret_type	= RET_VOID,
+	.arg1_type	= ARG_PTR_TO_BTF_ID,
+	.arg1_btf_id	= &bpf_cgroup_btf_ids[0],
+};
+
+BPF_CALL_1(bpf_cgroup_rstat_flush, struct cgroup *, cgrp)
+{
+	cgroup_rstat_flush_irqsafe(cgrp);
+	return 0;
+}
+
+const struct bpf_func_proto bpf_cgroup_rstat_flush_proto = {
+	.func		= bpf_cgroup_rstat_flush,
+	.gpl_only	= false,
+	.ret_type	= RET_VOID,
+	.arg1_type	= ARG_PTR_TO_BTF_ID,
+	.arg1_btf_id	= &bpf_cgroup_btf_ids[0],
+};
+
 #ifdef CONFIG_CGROUP_BPF
 
 BPF_CALL_2(bpf_get_local_storage, struct bpf_map *, map, u64, flags)
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 7141ca8a1c2d..e5a4f1b6e00d 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1255,6 +1255,10 @@ bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_get_current_cgroup_id_proto;
 	case BPF_FUNC_get_current_ancestor_cgroup_id:
 		return &bpf_get_current_ancestor_cgroup_id_proto;
+	case BPF_FUNC_cgroup_rstat_updated:
+		return &bpf_cgroup_rstat_updated_proto;
+	case BPF_FUNC_cgroup_rstat_flush:
+		return &bpf_cgroup_rstat_flush_proto;
 #endif
 	case BPF_FUNC_send_signal:
 		return &bpf_send_signal_proto;
diff --git a/scripts/bpf_doc.py b/scripts/bpf_doc.py
index 096625242475..9e2b08557a6f 100755
--- a/scripts/bpf_doc.py
+++ b/scripts/bpf_doc.py
@@ -633,6 +633,7 @@ class PrinterHelpers(Printer):
             'struct socket',
             'struct file',
             'struct bpf_timer',
+            'struct cgroup',
     ]
     known_types = {
             '...',
@@ -682,6 +683,7 @@ class PrinterHelpers(Printer):
             'struct socket',
             'struct file',
             'struct bpf_timer',
+            'struct cgroup',
     }
     mapped_types = {
             'u8': '__u8',
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 968e3cb02580..022522174286 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -5175,6 +5175,22 @@ union bpf_attr {
  * 	Return
  * 		Map value associated to *key* on *cpu*, or **NULL** if no entry
  * 		was found or *cpu* is invalid.
+ *
+ * void bpf_cgroup_rstat_updated(struct cgroup *cgrp)
+ *	Description
+ *		Notify the rstat framework that bpf stats were updated for
+ *		*cgrp* on the current cpu. Directly calls cgroup_rstat_updated
+ *		with the given *cgrp* and the current cpu.
+ *	Return
+ *		0
+ *
+ * void bpf_cgroup_rstat_flush(struct cgroup *cgrp)
+ *	Description
+ *		Collect all per-cpu stats in *cgrp*'s subtree into global
+ *		counters and propagate them upwards. Directly calls
+ *		cgroup_rstat_flush_irqsafe with the given *cgrp*.
+ *	Return
+ *		0
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5373,6 +5389,8 @@ union bpf_attr {
 	FN(ima_file_hash),		\
 	FN(kptr_xchg),			\
 	FN(map_lookup_percpu_elem),     \
+	FN(cgroup_rstat_updated),	\
+	FN(cgroup_rstat_flush),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
-- 
2.36.0.550.gb090851708-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH bpf-next v2 5/7] bpf: Introduce cgroup iter
  2022-05-15  2:34 [RFC PATCH bpf-next v2 0/7] bpf: rstat: cgroup hierarchical stats Yosry Ahmed
                   ` (3 preceding siblings ...)
  2022-05-15  2:35 ` [RFC PATCH bpf-next v2 4/7] bpf: add bpf rstat helpers Yosry Ahmed
@ 2022-05-15  2:35 ` Yosry Ahmed
  2022-05-15  2:35 ` [RFC PATCH bpf-next v2 6/7] selftests/bpf: extend cgroup helpers Yosry Ahmed
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 12+ messages in thread
From: Yosry Ahmed @ 2022-05-15  2:35 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Roman Gushchin, Michal Hocko
  Cc: Stanislav Fomichev, David Rientjes, Greg Thelen, Shakeel Butt,
	linux-kernel, netdev, bpf, cgroups, Yosry Ahmed

From: Hao Luo <haoluo@google.com>

Introduce a new type of iter prog: cgroup. Unlike other bpf_iter, this
iter doesn't iterate a set of kernel objects. Instead, it is supposed to
be parameterized by a cgroup id and prints only that cgroup. So one
needs to specify a target cgroup id when attaching this iter. The target
cgroup's state can be read out via a link of this iter.

Signed-off-by: Hao Luo <haoluo@google.com>
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 include/linux/bpf.h            |   2 +
 include/uapi/linux/bpf.h       |   6 ++
 kernel/bpf/Makefile            |   2 +-
 kernel/bpf/cgroup_iter.c       | 148 +++++++++++++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h |   6 ++
 5 files changed, 163 insertions(+), 1 deletion(-)
 create mode 100644 kernel/bpf/cgroup_iter.c

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index ca908a731cb4..45076c581f24 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -44,6 +44,7 @@ struct kobject;
 struct mem_cgroup;
 struct module;
 struct bpf_func_state;
+struct cgroup;
 
 extern struct idr btf_idr;
 extern spinlock_t btf_idr_lock;
@@ -1579,6 +1580,7 @@ int bpf_obj_get_user(const char __user *pathname, int flags);
 
 struct bpf_iter_aux_info {
 	struct bpf_map *map;
+	struct cgroup *cgroup;
 };
 
 typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog,
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 022522174286..9a93b72bf39c 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -91,6 +91,9 @@ union bpf_iter_link_info {
 	struct {
 		__u32	map_fd;
 	} map;
+	struct {
+		__u64	cgroup_id;
+	} cgroup;
 };
 
 /* BPF syscall commands, see bpf(2) man-page for more details. */
@@ -5986,6 +5989,9 @@ struct bpf_link_info {
 				struct {
 					__u32 map_id;
 				} map;
+				struct {
+					__u64 cgroup_id;
+				} cgroup;
 			};
 		} iter;
 		struct  {
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 0487133b799f..f2a6fd0633d6 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -37,7 +37,7 @@ obj-${CONFIG_BPF_LSM} += bpf_lsm.o
 endif
 obj-$(CONFIG_BPF_PRELOAD) += preload/
 ifeq ($(CONFIG_CGROUPS),y)
-obj-$(CONFIG_BPF_SYSCALL) += rstat.o
+obj-$(CONFIG_BPF_SYSCALL) += rstat.o cgroup_iter.o
 endif
 
 obj-$(CONFIG_BPF_SYSCALL) += relo_core.o
diff --git a/kernel/bpf/cgroup_iter.c b/kernel/bpf/cgroup_iter.c
new file mode 100644
index 000000000000..86bdfe135d24
--- /dev/null
+++ b/kernel/bpf/cgroup_iter.c
@@ -0,0 +1,148 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2022 Google */
+#include <linux/bpf.h>
+#include <linux/btf_ids.h>
+#include <linux/cgroup.h>
+#include <linux/kernel.h>
+#include <linux/seq_file.h>
+
+struct bpf_iter__cgroup {
+	__bpf_md_ptr(struct bpf_iter_meta *, meta);
+	__bpf_md_ptr(struct cgroup *, cgroup);
+};
+
+static void *cgroup_iter_seq_start(struct seq_file *seq, loff_t *pos)
+{
+	/* Only one session is supported. */
+	if (*pos > 0)
+		return NULL;
+
+	if (*pos == 0)
+		++*pos;
+
+	return *(struct cgroup **)seq->private;
+}
+
+static void *cgroup_iter_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	++*pos;
+	return NULL;
+}
+
+static int cgroup_iter_seq_show(struct seq_file *seq, void *v)
+{
+	struct bpf_iter__cgroup ctx;
+	struct bpf_iter_meta meta;
+	struct bpf_prog *prog;
+	int ret = 0;
+
+	ctx.meta = &meta;
+	ctx.cgroup = v;
+	meta.seq = seq;
+	prog = bpf_iter_get_info(&meta, false);
+	if (prog)
+		ret = bpf_iter_run_prog(prog, &ctx);
+
+	return ret;
+}
+
+static void cgroup_iter_seq_stop(struct seq_file *seq, void *v)
+{
+}
+
+static const struct seq_operations cgroup_iter_seq_ops = {
+	.start  = cgroup_iter_seq_start,
+	.next   = cgroup_iter_seq_next,
+	.stop   = cgroup_iter_seq_stop,
+	.show   = cgroup_iter_seq_show,
+};
+
+BTF_ID_LIST_SINGLE(bpf_cgroup_btf_id, struct, cgroup)
+
+static int cgroup_iter_seq_init(void *priv_data, struct bpf_iter_aux_info *aux)
+{
+	*(struct cgroup **)priv_data = aux->cgroup;
+	return 0;
+}
+
+static const struct bpf_iter_seq_info cgroup_iter_seq_info = {
+	.seq_ops                = &cgroup_iter_seq_ops,
+	.init_seq_private       = cgroup_iter_seq_init,
+	.seq_priv_size          = sizeof(struct cgroup *),
+};
+
+static int bpf_iter_attach_cgroup(struct bpf_prog *prog,
+				  union bpf_iter_link_info *linfo,
+				  struct bpf_iter_aux_info *aux)
+{
+	struct cgroup *cgroup;
+
+	cgroup = cgroup_get_from_id(linfo->cgroup.cgroup_id);
+	if (!cgroup)
+		return -EBUSY;
+
+	aux->cgroup = cgroup;
+	return 0;
+}
+
+static void bpf_iter_detach_cgroup(struct bpf_iter_aux_info *aux)
+{
+	if (aux->cgroup)
+		cgroup_put(aux->cgroup);
+}
+
+static void bpf_iter_cgroup_show_fdinfo(const struct bpf_iter_aux_info *aux,
+					struct seq_file *seq)
+{
+	char *buf;
+
+	seq_printf(seq, "cgroup_id:\t%llu\n", cgroup_id(aux->cgroup));
+
+	buf = kmalloc(PATH_MAX, GFP_KERNEL);
+	if (!buf) {
+		seq_puts(seq, "cgroup_path:\n");
+		return;
+	}
+
+	/* If cgroup_path_ns() fails, buf will be an empty string, cgroup_path
+	 * will print nothing.
+	 *
+	 * Cgroup_path is the path in the calliing process's cgroup namespace.
+	 */
+	cgroup_path_ns(aux->cgroup, buf, sizeof(buf),
+		       current->nsproxy->cgroup_ns);
+	seq_printf(seq, "cgroup_path:\t%s\n", buf);
+	kfree(buf);
+}
+
+static int bpf_iter_cgroup_fill_link_info(const struct bpf_iter_aux_info *aux,
+					  struct bpf_link_info *info)
+{
+	info->iter.cgroup.cgroup_id = cgroup_id(aux->cgroup);
+	return 0;
+}
+
+DEFINE_BPF_ITER_FUNC(cgroup, struct bpf_iter_meta *meta,
+		     struct cgroup *cgroup)
+
+static struct bpf_iter_reg bpf_cgroup_reg_info = {
+	.target			= "cgroup",
+	.attach_target		= bpf_iter_attach_cgroup,
+	.detach_target		= bpf_iter_detach_cgroup,
+	.show_fdinfo		= bpf_iter_cgroup_show_fdinfo,
+	.fill_link_info		= bpf_iter_cgroup_fill_link_info,
+	.ctx_arg_info_size	= 1,
+	.ctx_arg_info		= {
+		{ offsetof(struct bpf_iter__cgroup, cgroup),
+		  PTR_TO_BTF_ID },
+	},
+	.seq_info		= &cgroup_iter_seq_info,
+};
+
+static int __init bpf_cgroup_iter_init(void)
+{
+	bpf_cgroup_reg_info.ctx_arg_info[0].btf_id = bpf_cgroup_btf_id[0];
+	return bpf_iter_reg_target(&bpf_cgroup_reg_info);
+}
+
+late_initcall(bpf_cgroup_iter_init);
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 022522174286..9a93b72bf39c 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -91,6 +91,9 @@ union bpf_iter_link_info {
 	struct {
 		__u32	map_fd;
 	} map;
+	struct {
+		__u64	cgroup_id;
+	} cgroup;
 };
 
 /* BPF syscall commands, see bpf(2) man-page for more details. */
@@ -5986,6 +5989,9 @@ struct bpf_link_info {
 				struct {
 					__u32 map_id;
 				} map;
+				struct {
+					__u64 cgroup_id;
+				} cgroup;
 			};
 		} iter;
 		struct  {
-- 
2.36.0.550.gb090851708-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH bpf-next v2 6/7] selftests/bpf: extend cgroup helpers
  2022-05-15  2:34 [RFC PATCH bpf-next v2 0/7] bpf: rstat: cgroup hierarchical stats Yosry Ahmed
                   ` (4 preceding siblings ...)
  2022-05-15  2:35 ` [RFC PATCH bpf-next v2 5/7] bpf: Introduce cgroup iter Yosry Ahmed
@ 2022-05-15  2:35 ` Yosry Ahmed
  2022-05-15  2:35 ` [RFC PATCH bpf-next v2 7/7] bpf: add a selftest for cgroup hierarchical stats collection Yosry Ahmed
  2022-05-16 19:39 ` [RFC PATCH bpf-next v2 0/7] bpf: rstat: cgroup hierarchical stats Tejun Heo
  7 siblings, 0 replies; 12+ messages in thread
From: Yosry Ahmed @ 2022-05-15  2:35 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Roman Gushchin, Michal Hocko
  Cc: Stanislav Fomichev, David Rientjes, Greg Thelen, Shakeel Butt,
	linux-kernel, netdev, bpf, cgroups, Yosry Ahmed

This patch extends bpf selftests cgroup helpers in various ways:
- Expose enable_controllers() that allows tests to enable all or a
  subset of controllers for a specific cgroup.
- Add write_cgroup_file().
- Add join_cgroup_parent(). The cgroup workdir is based on the pid,
  therefore a spawned child cannot join the same cgroup hierarchy of the
  test through join_cgroup(). join_cgroup_parent() is used in child
  processes to join a cgroup under the parent's workdir.
- Distinguish relative and absolute cgroup paths in function arguments.
  Now relative paths are called relative_path, and absolute paths are
  called cgroup_path.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 tools/testing/selftests/bpf/cgroup_helpers.c | 158 ++++++++++++++-----
 tools/testing/selftests/bpf/cgroup_helpers.h |  14 +-
 2 files changed, 126 insertions(+), 46 deletions(-)

diff --git a/tools/testing/selftests/bpf/cgroup_helpers.c b/tools/testing/selftests/bpf/cgroup_helpers.c
index 9d59c3990ca8..065270b2387f 100644
--- a/tools/testing/selftests/bpf/cgroup_helpers.c
+++ b/tools/testing/selftests/bpf/cgroup_helpers.c
@@ -33,49 +33,51 @@
 #define CGROUP_MOUNT_DFLT		"/sys/fs/cgroup"
 #define NETCLS_MOUNT_PATH		CGROUP_MOUNT_DFLT "/net_cls"
 #define CGROUP_WORK_DIR			"/cgroup-test-work-dir"
-#define format_cgroup_path(buf, path) \
+
+#define format_cgroup_path_pid(buf, path, pid) \
 	snprintf(buf, sizeof(buf), "%s%s%d%s", CGROUP_MOUNT_PATH, \
-	CGROUP_WORK_DIR, getpid(), path)
+	CGROUP_WORK_DIR, pid, path)
+
+#define format_cgroup_path(buf, path) \
+	format_cgroup_path_pid(buf, path, getpid())
+
+#define format_parent_cgroup_path(buf, path) \
+	format_cgroup_path_pid(buf, path, getppid())
 
 #define format_classid_path(buf)				\
 	snprintf(buf, sizeof(buf), "%s%s", NETCLS_MOUNT_PATH,	\
 		 CGROUP_WORK_DIR)
 
-/**
- * enable_all_controllers() - Enable all available cgroup v2 controllers
- *
- * Enable all available cgroup v2 controllers in order to increase
- * the code coverage.
- *
- * If successful, 0 is returned.
- */
-static int enable_all_controllers(char *cgroup_path)
+
+static int __enable_controllers(const char *cgroup_path, const char *controllers)
 {
 	char path[PATH_MAX + 1];
-	char buf[PATH_MAX];
+	char enable[PATH_MAX + 1];
 	char *c, *c2;
 	int fd, cfd;
 	ssize_t len;
 
-	snprintf(path, sizeof(path), "%s/cgroup.controllers", cgroup_path);
-	fd = open(path, O_RDONLY);
-	if (fd < 0) {
-		log_err("Opening cgroup.controllers: %s", path);
-		return 1;
-	}
+	/* If not controllers are passed, enable all available controllers */
+	if (!controllers) {
+		snprintf(path, sizeof(path), "%s/cgroup.controllers",
+			 cgroup_path);
+		fd = open(path, O_RDONLY);
+		if (fd < 0) {
+			log_err("Opening cgroup.controllers: %s", path);
+			return 1;
+		}
 
-	len = read(fd, buf, sizeof(buf) - 1);
-	if (len < 0) {
+		len = read(fd, enable, sizeof(enable) - 1);
+		if (len < 0) {
+			close(fd);
+			log_err("Reading cgroup.controllers: %s", path);
+			return 1;
+		} else if (len == 0) /* No controllers to enable */
+			return 0;
+		enable[len] = 0;
 		close(fd);
-		log_err("Reading cgroup.controllers: %s", path);
-		return 1;
-	}
-	buf[len] = 0;
-	close(fd);
-
-	/* No controllers available? We're probably on cgroup v1. */
-	if (len == 0)
-		return 0;
+	} else
+		strncpy(enable, controllers, sizeof(enable));
 
 	snprintf(path, sizeof(path), "%s/cgroup.subtree_control", cgroup_path);
 	cfd = open(path, O_RDWR);
@@ -84,7 +86,7 @@ static int enable_all_controllers(char *cgroup_path)
 		return 1;
 	}
 
-	for (c = strtok_r(buf, " ", &c2); c; c = strtok_r(NULL, " ", &c2)) {
+	for (c = strtok_r(enable, " ", &c2); c; c = strtok_r(NULL, " ", &c2)) {
 		if (dprintf(cfd, "+%s\n", c) <= 0) {
 			log_err("Enabling controller %s: %s", c, path);
 			close(cfd);
@@ -95,6 +97,63 @@ static int enable_all_controllers(char *cgroup_path)
 	return 0;
 }
 
+/**
+ * enable_controllers() - Enable cgroup v2 controllers
+ * @relative_path: The cgroup path, relative to the workdir
+ * @controllers: List of controllers to enable in cgroup.controllers format
+ *
+ *
+ * Enable given cgroup v2 controllers, if @controllers is NULL, enable all
+ * available controllers.
+ *
+ * If successful, 0 is returned.
+ */
+int enable_controllers(const char *relative_path, const char *controllers)
+{
+	char cgroup_path[PATH_MAX + 1];
+
+	format_cgroup_path(cgroup_path, relative_path);
+	return __enable_controllers(cgroup_path, controllers);
+}
+
+
+
+/**
+ * write_cgroup_file() - Write to a cgroup file
+ * @relative_path: The cgroup path, relative to the workdir
+ * @buf: Buffer to write to the file
+ *
+ * Write to a file in the given cgroup's directory.
+ *
+ * If successful, 0 is returned.
+ */
+int write_cgroup_file(const char *relative_path, const char *file,
+		      const char *buf)
+{
+	char cgroup_path[PATH_MAX - 24];
+	char file_path[PATH_MAX + 1];
+	int fd;
+
+	format_cgroup_path(cgroup_path, relative_path);
+
+	snprintf(file_path, sizeof(file_path), "%s/%s", cgroup_path, file);
+	fd = open(file_path, O_RDWR);
+	if (fd < 0) {
+		log_err("Opening cgroup.subtree_control: %s", file_path);
+		return 1;
+	}
+
+	if (dprintf(fd, "%s", buf) <= 0) {
+		log_err("Writing to %s", file_path);
+		close(fd);
+		return 1;
+	}
+	close(fd);
+	return 0;
+}
+
+
+
 /**
  * setup_cgroup_environment() - Setup the cgroup environment
  *
@@ -133,7 +192,8 @@ int setup_cgroup_environment(void)
 		return 1;
 	}
 
-	if (enable_all_controllers(cgroup_workdir))
+	/* Enable all available controllers to increase test coverage */
+	if (__enable_controllers(cgroup_workdir, NULL))
 		return 1;
 
 	return 0;
@@ -173,7 +233,7 @@ static int join_cgroup_from_top(const char *cgroup_path)
 
 /**
  * join_cgroup() - Join a cgroup
- * @path: The cgroup path, relative to the workdir, to join
+ * @relative_path: The cgroup path, relative to the workdir, to join
  *
  * This function expects a cgroup to already be created, relative to the cgroup
  * work dir, and it joins it. For example, passing "/my-cgroup" as the path
@@ -182,11 +242,27 @@ static int join_cgroup_from_top(const char *cgroup_path)
  *
  * On success, it returns 0, otherwise on failure it returns 1.
  */
-int join_cgroup(const char *path)
+int join_cgroup(const char *relative_path)
+{
+	char cgroup_path[PATH_MAX + 1];
+
+	format_cgroup_path(cgroup_path, relative_path);
+	return join_cgroup_from_top(cgroup_path);
+}
+
+/**
+ * join_parent_cgroup() - Join a cgroup in the parent process workdir
+ * @relative_path: The cgroup path, relative to parent process workdir, to join
+ *
+ * See join_cgroup().
+ *
+ * On success, it returns 0, otherwise on failure it returns 1.
+ */
+int join_parent_cgroup(const char *relative_path)
 {
 	char cgroup_path[PATH_MAX + 1];
 
-	format_cgroup_path(cgroup_path, path);
+	format_parent_cgroup_path(cgroup_path, relative_path);
 	return join_cgroup_from_top(cgroup_path);
 }
 
@@ -214,7 +290,7 @@ void cleanup_cgroup_environment(void)
 
 /**
  * create_and_get_cgroup() - Create a cgroup, relative to workdir, and get the FD
- * @path: The cgroup path, relative to the workdir, to join
+ * @relative_path: The cgroup path, relative to the workdir, to join
  *
  * This function creates a cgroup under the top level workdir and returns the
  * file descriptor. It is idempotent.
@@ -222,14 +298,14 @@ void cleanup_cgroup_environment(void)
  * On success, it returns the file descriptor. On failure it returns -1.
  * If there is a failure, it prints the error to stderr.
  */
-int create_and_get_cgroup(const char *path)
+int create_and_get_cgroup(const char *relative_path)
 {
 	char cgroup_path[PATH_MAX + 1];
 	int fd;
 
-	format_cgroup_path(cgroup_path, path);
+	format_cgroup_path(cgroup_path, relative_path);
 	if (mkdir(cgroup_path, 0777) && errno != EEXIST) {
-		log_err("mkdiring cgroup %s .. %s", path, cgroup_path);
+		log_err("mkdiring cgroup %s .. %s", relative_path, cgroup_path);
 		return -1;
 	}
 
@@ -244,13 +320,13 @@ int create_and_get_cgroup(const char *path)
 
 /**
  * get_cgroup_id() - Get cgroup id for a particular cgroup path
- * @path: The cgroup path, relative to the workdir, to join
+ * @relative_path: The cgroup path, relative to the workdir, to join
  *
  * On success, it returns the cgroup id. On failure it returns 0,
  * which is an invalid cgroup id.
  * If there is a failure, it prints the error to stderr.
  */
-unsigned long long get_cgroup_id(const char *path)
+unsigned long long get_cgroup_id(const char *relative_path)
 {
 	int dirfd, err, flags, mount_id, fhsize;
 	union {
@@ -261,7 +337,7 @@ unsigned long long get_cgroup_id(const char *path)
 	struct file_handle *fhp, *fhp2;
 	unsigned long long ret = 0;
 
-	format_cgroup_path(cgroup_workdir, path);
+	format_cgroup_path(cgroup_workdir, relative_path);
 
 	dirfd = AT_FDCWD;
 	flags = 0;
diff --git a/tools/testing/selftests/bpf/cgroup_helpers.h b/tools/testing/selftests/bpf/cgroup_helpers.h
index fcc9cb91b211..6b1d905557c7 100644
--- a/tools/testing/selftests/bpf/cgroup_helpers.h
+++ b/tools/testing/selftests/bpf/cgroup_helpers.h
@@ -10,11 +10,15 @@
 	__FILE__, __LINE__, clean_errno(), ##__VA_ARGS__)
 
 /* cgroupv2 related */
-int cgroup_setup_and_join(const char *path);
-int create_and_get_cgroup(const char *path);
-unsigned long long get_cgroup_id(const char *path);
+int enable_controllers(const char *relative_path, const char *controllers);
+int write_cgroup_file(const char *relative_path, const char *file,
+		      const char *buf);
+int cgroup_setup_and_join(const char *relative_path);
+int create_and_get_cgroup(const char *relative_path);
+unsigned long long get_cgroup_id(const char *relative_path);
 
-int join_cgroup(const char *path);
+int join_cgroup(const char *relative_path);
+int join_parent_cgroup(const char *relative_path);
 
 int setup_cgroup_environment(void);
 void cleanup_cgroup_environment(void);
@@ -26,4 +30,4 @@ int join_classid(void);
 int setup_classid_environment(void);
 void cleanup_classid_environment(void);
 
-#endif /* __CGROUP_HELPERS_H */
\ No newline at end of file
+#endif /* __CGROUP_HELPERS_H */
-- 
2.36.0.550.gb090851708-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH bpf-next v2 7/7] bpf: add a selftest for cgroup hierarchical stats collection
  2022-05-15  2:34 [RFC PATCH bpf-next v2 0/7] bpf: rstat: cgroup hierarchical stats Yosry Ahmed
                   ` (5 preceding siblings ...)
  2022-05-15  2:35 ` [RFC PATCH bpf-next v2 6/7] selftests/bpf: extend cgroup helpers Yosry Ahmed
@ 2022-05-15  2:35 ` Yosry Ahmed
  2022-05-16 19:39 ` [RFC PATCH bpf-next v2 0/7] bpf: rstat: cgroup hierarchical stats Tejun Heo
  7 siblings, 0 replies; 12+ messages in thread
From: Yosry Ahmed @ 2022-05-15  2:35 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Roman Gushchin, Michal Hocko
  Cc: Stanislav Fomichev, David Rientjes, Greg Thelen, Shakeel Butt,
	linux-kernel, netdev, bpf, cgroups, Yosry Ahmed

Add a selftest that tests the whole workflow for collecting,
aggregating, and display cgroup hierarchical stats.

The test loads tracing bpf programs at the beginning and ending of
direct reclaim to measure the vmscan latency. Per-cgroup readings are
stored in percpu maps for efficiency. When a cgroup reading is updated,
bpf_cgroup_rstat_updated() is called to add the cgroup (and the current
cpu) to the rstat updated tree. When a cgroup is added to the rstat
updated tree, all its parents are added as well. rstat makes sure
cgroups are popped in a bottom up fashion.

When an rstat flush is invoked, an rstat flusher program is called for
per-cgroup per-cpu pairs on the updated tree. The program aggregates
percpu readings to a total reading, and also propagates them to the
parent. After rstat flushing is over, the program will have been invoked
for all (cgroup, cpu) pairs that have updates as well as their parents,
so the whole hierarchy will have updated (flushed) stats.

Finally, a cgroup_iter program is pinned to a file for each cgroup.
Reading this file invokes the cgroup_iter program to flush the stats and
display them to the user.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 .../test_cgroup_hierarchical_stats.c          | 339 ++++++++++++++++++
 tools/testing/selftests/bpf/progs/bpf_iter.h  |   7 +
 .../selftests/bpf/progs/cgroup_vmscan.c       | 222 ++++++++++++
 3 files changed, 568 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_cgroup_hierarchical_stats.c
 create mode 100644 tools/testing/selftests/bpf/progs/cgroup_vmscan.c

diff --git a/tools/testing/selftests/bpf/prog_tests/test_cgroup_hierarchical_stats.c b/tools/testing/selftests/bpf/prog_tests/test_cgroup_hierarchical_stats.c
new file mode 100644
index 000000000000..feb325e7fc39
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/test_cgroup_hierarchical_stats.c
@@ -0,0 +1,339 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Functions to manage eBPF programs attached to cgroup subsystems
+ *
+ * Copyright 2022 Google LLC.
+ */
+#include <errno.h>
+#include <sys/types.h>
+#include <sys/mount.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include <bpf/libbpf.h>
+#include <bpf/bpf.h>
+#include <test_progs.h>
+
+#include "cgroup_helpers.h"
+#include "cgroup_vmscan.skel.h"
+
+#define PAGE_SIZE 4096
+#define MB(x) (x << 20)
+
+#define BPFFS_ROOT "/sys/fs/bpf/"
+#define BPFFS_VMSCAN BPFFS_ROOT"vmscan/"
+
+#define CG_ROOT_NAME "root"
+#define CG_ROOT_ID 1
+
+#define CGROUP_PATH(p, n) {.name = #n, .path = #p"/"#n}
+
+static struct {
+	const char *name, *path;
+	unsigned long long id;
+	int fd;
+} cgroups[] = {
+	CGROUP_PATH(/, test),
+	CGROUP_PATH(/test, child1),
+	CGROUP_PATH(/test, child2),
+	CGROUP_PATH(/test/child1, child1_1),
+	CGROUP_PATH(/test/child1, child1_2),
+	CGROUP_PATH(/test/child2, child2_1),
+	CGROUP_PATH(/test/child2, child2_2),
+};
+
+#define N_CGROUPS (sizeof(cgroups)/sizeof(cgroups[0]))
+#define N_NON_LEAF_CGROUPS 3
+
+bool mounted_bpffs;
+static int duration;
+
+static int read_from_file(const char *path, char *buf, size_t size)
+{
+	int fd, len;
+
+	fd = open(path, O_RDONLY);
+	if (fd < 0) {
+		log_err("Open %s", path);
+		return -errno;
+	}
+	len = read(fd, buf, size);
+	if (len < 0)
+		log_err("Read %s", path);
+	else
+		buf[len] = 0;
+	close(fd);
+	return len < 0 ? -errno : 0;
+}
+
+static int setup_bpffs(void)
+{
+	int err;
+
+	/* Mount bpffs */
+	err = mount("bpf", BPFFS_ROOT, "bpf", 0, NULL);
+	mounted_bpffs = !err;
+	if (CHECK(err && errno != EBUSY, "mount bpffs",
+	      "failed to mount bpffs at %s (%s)\n", BPFFS_ROOT,
+	      strerror(errno)))
+		return err;
+
+	/* Create a directory to contain stat files in bpffs */
+	err = mkdir(BPFFS_VMSCAN, 0755);
+	CHECK(err, "mkdir bpffs", "failed to mkdir %s (%s)\n",
+	      BPFFS_VMSCAN, strerror(errno));
+	return err;
+}
+
+static void cleanup_bpffs(void)
+{
+	/* Remove created directory in bpffs */
+	CHECK(rmdir(BPFFS_VMSCAN), "rmdir", "failed to rmdir %s (%s)\n",
+	      BPFFS_VMSCAN, strerror(errno));
+
+	/* Unmount bpffs, if it wasn't already mounted when we started */
+	if (mounted_bpffs)
+		return;
+	CHECK(umount(BPFFS_ROOT), "umount", "failed to unmount bpffs (%s)\n",
+	      strerror(errno));
+}
+
+static int setup_cgroups(void)
+{
+	int i, err;
+
+	err = setup_cgroup_environment();
+	if (CHECK(err, "setup_cgroup_environment", "failed: %d\n", err))
+		return err;
+
+	for (i = 0; i < N_CGROUPS; i++) {
+		int fd;
+
+		fd = create_and_get_cgroup(cgroups[i].path);
+		if (!ASSERT_GE(fd, 0, "create_and_get_cgroup"))
+			return fd;
+
+		cgroups[i].fd = fd;
+		cgroups[i].id = get_cgroup_id(cgroups[i].path);
+		if (i < N_NON_LEAF_CGROUPS) {
+			err = enable_controllers(cgroups[i].path, "memory");
+			if (!ASSERT_OK(err, "enable_controllers"))
+				return err;
+		}
+	}
+	return 0;
+}
+
+static void cleanup_cgroups(void)
+{
+	for (int i = 0; i < N_CGROUPS; i++)
+		close(cgroups[i].fd);
+	cleanup_cgroup_environment();
+}
+
+
+static int setup_hierarchy(void)
+{
+	return setup_bpffs() || setup_cgroups();
+}
+
+static void destroy_hierarchy(void)
+{
+	cleanup_cgroups();
+	cleanup_bpffs();
+}
+
+static void alloc_anon(size_t size)
+{
+	char *buf, *ptr;
+
+	buf = malloc(size);
+	for (ptr = buf; ptr < buf + size; ptr += PAGE_SIZE)
+		*ptr = 0;
+	free(buf);
+}
+
+static int induce_vmscan(void)
+{
+	char size[128];
+	int i, err;
+
+	/*
+	 * Set memory.high for test parent cgroup to 1 MB to throttle
+	 * allocations and invoke reclaim in children.
+	 */
+	snprintf(size, 128, "%d", MB(1));
+	err = write_cgroup_file(cgroups[0].path, "memory.high",	size);
+	if (!ASSERT_OK(err, "write memory.high"))
+		return err;
+	/*
+	 * In every leaf cgroup, run a memory hog for a few seconds to induce
+	 * reclaim then kill it.
+	 */
+	for (i = N_NON_LEAF_CGROUPS; i < N_CGROUPS; i++) {
+		pid_t pid = fork();
+
+		if (pid == 0) {
+			/* Join cgroup in the parent process workdir */
+			join_parent_cgroup(cgroups[i].path);
+
+			/* Allocate more memory than memory.high */
+			alloc_anon(MB(2));
+			exit(0);
+		} else {
+			/* Wait for child to cause reclaim then kill it */
+			if (!ASSERT_GT(pid, 0, "fork"))
+				return pid;
+			sleep(2);
+			kill(pid, SIGKILL);
+			waitpid(pid, NULL, 0);
+		}
+	}
+	return 0;
+}
+
+static unsigned long long get_cgroup_vmscan(unsigned long long cgroup_id,
+					    const char *file_name)
+{
+	char buf[128], path[128];
+	unsigned long long vmscan = 0, id = 0;
+	int err;
+
+	/* For every cgroup, read the file generated by cgroup_iter */
+	snprintf(path, 128, "%s%s", BPFFS_VMSCAN, file_name);
+	err = read_from_file(path, buf, 128);
+	if (CHECK(err, "read", "failed to read from %s (%s)\n",
+		   path, strerror(errno)))
+		return 0;
+
+	/* Check the output file formatting */
+	ASSERT_EQ(sscanf(buf, "cg_id: %llu, total_vmscan_delay: %llu\n",
+			 &id, &vmscan), 2, "output format");
+
+	/* Check that the cgroup_id is displayed correctly */
+	ASSERT_EQ(cgroup_id, id, "cgroup_id");
+	/* Check that the vmscan reading is non-zero */
+	ASSERT_NEQ(vmscan, 0, "vmscan_reading");
+	return vmscan;
+}
+
+static void check_vmscan_stats(void)
+{
+	int i;
+	unsigned long long vmscan_readings[N_CGROUPS], vmscan_root;
+
+	for (i = 0; i < N_CGROUPS; i++)
+		vmscan_readings[i] = get_cgroup_vmscan(cgroups[i].id,
+						       cgroups[i].name);
+
+	/* Read stats for root too */
+	vmscan_root = get_cgroup_vmscan(CG_ROOT_ID, CG_ROOT_NAME);
+
+	/* Check that child1 == child1_1 + child1_2 */
+	ASSERT_EQ(vmscan_readings[1], vmscan_readings[3] + vmscan_readings[4],
+		  "child1_vmscan");
+	/* Check that child2 == child2_1 + child2_2 */
+	ASSERT_EQ(vmscan_readings[2], vmscan_readings[5] + vmscan_readings[6],
+		  "child2_vmscan");
+	/* Check that test == child1 + child2 */
+	ASSERT_EQ(vmscan_readings[0], vmscan_readings[1] + vmscan_readings[2],
+		  "test_vmscan");
+	/* Check that root >= test */
+	ASSERT_GE(vmscan_root, vmscan_readings[1], "root_vmscan");
+}
+
+static int setup_cgroup_iter(struct cgroup_vmscan *obj,
+			     unsigned long long cgroup_id,
+			     const char *file_name)
+{
+	DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts);
+	union bpf_iter_link_info linfo = {};
+	struct bpf_link *link;
+	char path[128];
+	int err;
+
+	/* Create an iter link, parameterized by cgroup id */
+	linfo.cgroup.cgroup_id = cgroup_id;
+	opts.link_info = &linfo;
+	opts.link_info_len = sizeof(linfo);
+	link = bpf_program__attach_iter(obj->progs.dump_vmscan, &opts);
+	if (!ASSERT_OK_PTR(link, "attach iter"))
+		return libbpf_get_error(link);
+
+	/* Pin the link to a bpffs file */
+	snprintf(path, 128, "%s%s", BPFFS_VMSCAN, file_name);
+	err = bpf_link__pin(link, path);
+	CHECK(err, "pin iter", "failed to pin iter at %s", path);
+	return err;
+}
+
+static int setup_progs(struct cgroup_vmscan **skel)
+{
+	int i;
+	struct bpf_link *link;
+	struct cgroup_vmscan *obj;
+
+	obj = cgroup_vmscan__open_and_load();
+	if (!ASSERT_OK_PTR(obj, "open_and_load"))
+		return libbpf_get_error(obj);
+
+	/* Attach cgroup_iter program that will dump the stats to cgroups */
+	for (i = 0; i < N_CGROUPS; i++)
+		setup_cgroup_iter(obj, cgroups[i].id, cgroups[i].name);
+	/* Also dump stats for root */
+	setup_cgroup_iter(obj, CG_ROOT_ID, CG_ROOT_NAME);
+
+	/* Attach rstat flusher */
+	link = bpf_program__attach(obj->progs.vmscan_flush);
+	if (!ASSERT_OK_PTR(link, "attach rstat"))
+		return libbpf_get_error(link);
+
+	/* Attach tracing programs that will calculate vmscan delays */
+	link = bpf_program__attach(obj->progs.vmscan_start);
+	if (!ASSERT_OK_PTR(obj, "attach raw_tracepoint"))
+		return libbpf_get_error(obj);
+
+	link = bpf_program__attach(obj->progs.vmscan_end);
+	if (!ASSERT_OK_PTR(obj, "attach raw_tracepoint"))
+		return libbpf_get_error(obj);
+
+	*skel = obj;
+	return 0;
+}
+
+void destroy_progs(struct cgroup_vmscan *skel)
+{
+	char path[128];
+	int i;
+
+	for (i = 0; i < N_CGROUPS; i++) {
+		/* Delete files in bpffs that cgroup_iters are pinned in */
+		snprintf(path, 128, "%s%s", BPFFS_VMSCAN,
+			 cgroups[i].name);
+		CHECK(remove(path), "remove", "failed to remove %s (%s)\n",
+		      path, strerror(errno));
+	}
+
+	/* Delete root file in bpffs */
+	snprintf(path, 128, "%s%s", BPFFS_VMSCAN, CG_ROOT_NAME);
+	CHECK(remove(path), "remove", "failed to remove %s (%s)\n", path,
+	      strerror(errno));
+	cgroup_vmscan__destroy(skel);
+}
+
+void test_cgroup_hierarchical_stats(void)
+{
+	struct cgroup_vmscan *skel = NULL;
+
+	if (setup_hierarchy())
+		goto hierarchy_cleanup;
+	if (setup_progs(&skel))
+		goto cleanup;
+	if (induce_vmscan())
+		goto cleanup;
+	check_vmscan_stats();
+cleanup:
+	destroy_progs(skel);
+hierarchy_cleanup:
+	destroy_hierarchy();
+}
diff --git a/tools/testing/selftests/bpf/progs/bpf_iter.h b/tools/testing/selftests/bpf/progs/bpf_iter.h
index 97ec8bc76ae6..df91f1daf74d 100644
--- a/tools/testing/selftests/bpf/progs/bpf_iter.h
+++ b/tools/testing/selftests/bpf/progs/bpf_iter.h
@@ -17,6 +17,7 @@
 #define bpf_iter__bpf_sk_storage_map bpf_iter__bpf_sk_storage_map___not_used
 #define bpf_iter__sockmap bpf_iter__sockmap___not_used
 #define bpf_iter__bpf_link bpf_iter__bpf_link___not_used
+#define bpf_iter__cgroup bpf_iter__cgroup__not_used
 #define btf_ptr btf_ptr___not_used
 #define BTF_F_COMPACT BTF_F_COMPACT___not_used
 #define BTF_F_NONAME BTF_F_NONAME___not_used
@@ -39,6 +40,7 @@
 #undef bpf_iter__bpf_sk_storage_map
 #undef bpf_iter__sockmap
 #undef bpf_iter__bpf_link
+#undef bpf_iter__cgroup
 #undef btf_ptr
 #undef BTF_F_COMPACT
 #undef BTF_F_NONAME
@@ -139,6 +141,11 @@ struct bpf_iter__bpf_link {
 	struct bpf_link *link;
 };
 
+struct bpf_iter__cgroup {
+	struct bpf_iter_meta *meta;
+	struct cgroup *cgroup;
+} __attribute((preserve_access_index));
+
 struct btf_ptr {
 	void *ptr;
 	__u32 type_id;
diff --git a/tools/testing/selftests/bpf/progs/cgroup_vmscan.c b/tools/testing/selftests/bpf/progs/cgroup_vmscan.c
new file mode 100644
index 000000000000..96aa62f7b260
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/cgroup_vmscan.c
@@ -0,0 +1,222 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Functions to manage eBPF programs attached to cgroup subsystems
+ *
+ * Copyright 2022 Google LLC.
+ */
+#include "bpf_iter.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_core_read.h>
+
+char _license[] SEC("license") = "GPL";
+
+/*
+ * Start times are stored per-task, not per-cgroup, as multiple tasks in one
+ * cgroup can perform reclain concurrently.
+ */
+struct {
+	__uint(type, BPF_MAP_TYPE_TASK_STORAGE);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__type(key, int);
+	__type(value, __u64);
+} vmscan_start_time SEC(".maps");
+
+struct vmscan_percpu {
+	/* Previous percpu state, to figure out if we have new updates */
+	__u64 prev;
+	/* Current percpu state */
+	__u64 state;
+};
+
+struct vmscan {
+	/* State propagated through children, pending aggregation */
+	__u64 pending;
+	/* Total state, including all cpus and all children */
+	__u64 state;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_PERCPU_HASH);
+	__uint(max_entries, 10);
+	__type(key, __u64);
+	__type(value, struct vmscan_percpu);
+} pcpu_cgroup_vmscan_elapsed SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, 10);
+	__type(key, __u64);
+	__type(value, struct vmscan);
+} cgroup_vmscan_elapsed SEC(".maps");
+
+
+static inline bool memory_subsys_enabled(struct cgroup *cgrp)
+{
+	return cgrp->subsys[memory_cgrp_id] != NULL;
+}
+
+static inline struct cgroup *task_memcg(struct task_struct *task)
+{
+	return task->cgroups->subsys[memory_cgrp_id]->cgroup;
+}
+
+static inline uint64_t cgroup_id(struct cgroup *cgrp)
+{
+	return cgrp->kn->id;
+}
+
+static inline int create_vmscan_percpu_elem(__u64 cg_id, __u64 state)
+{
+	struct vmscan_percpu pcpu_init = {.state = state, .prev = 0};
+
+	if (bpf_map_update_elem(&pcpu_cgroup_vmscan_elapsed, &cg_id,
+				&pcpu_init, BPF_NOEXIST)) {
+		bpf_printk("failed to create pcpu entry for cgroup %llu\n"
+			   , cg_id);
+		return 1;
+	}
+	return 0;
+}
+
+static inline int create_vmscan_elem(__u64 cg_id, __u64 state, __u64 pending)
+{
+	struct vmscan init = {.state = state, .pending = pending};
+
+	if (bpf_map_update_elem(&cgroup_vmscan_elapsed, &cg_id,
+				&init, BPF_NOEXIST)) {
+		bpf_printk("failed to create entry for cgroup %llu\n"
+			   , cg_id);
+		return 1;
+	}
+	return 0;
+}
+
+SEC("raw_tp/mm_vmscan_memcg_reclaim_begin")
+int vmscan_start(struct lruvec *lruvec, struct scan_control *sc)
+{
+	struct task_struct *task = bpf_get_current_task_btf();
+	__u64 *start_time_ptr;
+
+	start_time_ptr = bpf_task_storage_get(&vmscan_start_time, task, 0,
+					  BPF_LOCAL_STORAGE_GET_F_CREATE);
+	if (!start_time_ptr) {
+		bpf_printk("error retrieving storage\n");
+		return 0;
+	}
+
+	*start_time_ptr = bpf_ktime_get_ns();
+	return 0;
+}
+
+SEC("raw_tp/mm_vmscan_memcg_reclaim_end")
+int vmscan_end(struct lruvec *lruvec, struct scan_control *sc)
+{
+	struct vmscan_percpu *pcpu_stat;
+	struct task_struct *current = bpf_get_current_task_btf();
+	struct cgroup *cgrp = task_memcg(current);
+	__u64 *start_time_ptr;
+	__u64 current_elapsed, cg_id;
+	__u64 end_time = bpf_ktime_get_ns();
+
+	/* cgrp may not have memory controller enabled */
+	if (!cgrp)
+		return 0;
+
+	cg_id = cgroup_id(cgrp);
+	start_time_ptr = bpf_task_storage_get(&vmscan_start_time, current, 0,
+					      BPF_LOCAL_STORAGE_GET_F_CREATE);
+	if (!start_time_ptr) {
+		bpf_printk("error retrieving storage local storage\n");
+		return 0;
+	}
+
+	current_elapsed = end_time - *start_time_ptr;
+	pcpu_stat = bpf_map_lookup_elem(&pcpu_cgroup_vmscan_elapsed,
+					&cg_id);
+	if (pcpu_stat)
+		__sync_fetch_and_add(&pcpu_stat->state, current_elapsed);
+	else
+		create_vmscan_percpu_elem(cg_id, current_elapsed);
+
+	bpf_cgroup_rstat_updated(cgrp);
+	return 0;
+}
+
+SEC("rstat/flush")
+int vmscan_flush(struct bpf_rstat_flush_ctx *ctx)
+{
+	struct vmscan_percpu *pcpu_stat;
+	struct vmscan *total_stat, *parent_stat;
+	struct cgroup *cgrp = ctx->cgrp, *parent = ctx->parent;
+	__u64 cg_id = cgroup_id(ctx->cgrp);
+	__u64 parent_cg_id = parent ? cgroup_id(parent) : 0;
+	__s32 cpu = ctx->cpu;
+	__u64 *pcpu_vmscan;
+	__u64 state;
+	__u64 delta = 0;
+
+	if (!memory_subsys_enabled(cgrp))
+		return 0;
+
+	/* Add CPU changes on this level since the last flush */
+	pcpu_stat = bpf_map_lookup_percpu_elem(&pcpu_cgroup_vmscan_elapsed,
+					       &cg_id, cpu);
+	if (pcpu_stat) {
+		state = pcpu_stat->state;
+		delta += state - pcpu_stat->prev;
+		pcpu_stat->prev = state;
+	}
+
+	total_stat = bpf_map_lookup_elem(&cgroup_vmscan_elapsed, &cg_id);
+	if (!total_stat) {
+		create_vmscan_elem(cg_id, delta, 0);
+		goto update_parent;
+	}
+
+	/* Collect pending stats from subtree */
+	if (total_stat->pending) {
+		delta += total_stat->pending;
+		total_stat->pending = 0;
+	}
+
+	/* Propagate changes to this cgroup's total */
+	total_stat->state += delta;
+
+update_parent:
+	/* Skip if there are no changes to propagate, or no parent */
+	if (!delta || !parent_cg_id)
+		return 0;
+
+	/* Propagate changes to cgroup's parent */
+	parent_stat = bpf_map_lookup_elem(&cgroup_vmscan_elapsed,
+					  &parent_cg_id);
+	if (parent_stat)
+		parent_stat->pending += delta;
+	else
+		create_vmscan_elem(parent_cg_id, 0, delta);
+
+	return 0;
+}
+
+SEC("iter/cgroup")
+int dump_vmscan(struct bpf_iter__cgroup *ctx)
+{
+	struct seq_file *seq = ctx->meta->seq;
+	struct cgroup *cgrp = ctx->cgroup;
+	struct vmscan *total_stat;
+	__u64 cg_id = cgroup_id(cgrp);
+
+	/* Flush the stats to make sure we get the most updated numbers */
+	bpf_cgroup_rstat_flush(cgrp);
+
+	total_stat = bpf_map_lookup_elem(&cgroup_vmscan_elapsed, &cg_id);
+	if (!total_stat) {
+		bpf_printk("error finding stats for cgroup %llu\n", cg_id);
+		BPF_SEQ_PRINTF(seq, "cg_id: -1, total_vmscan_delay: -1\n");
+		return 0;
+	}
+	BPF_SEQ_PRINTF(seq, "cg_id: %llu, total_vmscan_delay: %llu\n",
+		       cg_id, total_stat->state);
+	return 0;
+}
+
-- 
2.36.0.550.gb090851708-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH bpf-next v2 3/7] libbpf: Add support for rstat flush progs
  2022-05-15  2:35 ` [RFC PATCH bpf-next v2 3/7] libbpf: Add support for rstat flush progs Yosry Ahmed
@ 2022-05-15  9:07   ` Yosry Ahmed
  0 siblings, 0 replies; 12+ messages in thread
From: Yosry Ahmed @ 2022-05-15  9:07 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Roman Gushchin, Michal Hocko
  Cc: Stanislav Fomichev, David Rientjes, Greg Thelen, Shakeel Butt,
	Linux Kernel Mailing List, Networking, bpf, cgroups

On Sat, May 14, 2022 at 7:35 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> Add support to attach RSTAT_FLUSH programs.
>
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> ---
>  tools/lib/bpf/bpf.c      |  1 -
>  tools/lib/bpf/libbpf.c   | 40 ++++++++++++++++++++++++++++++++++++++++
>  tools/lib/bpf/libbpf.h   |  3 +++
>  tools/lib/bpf/libbpf.map |  1 +
>  4 files changed, 44 insertions(+), 1 deletion(-)
>
> diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
> index 5660268e103f..9e3cb0d1eb99 100644
> --- a/tools/lib/bpf/bpf.c
> +++ b/tools/lib/bpf/bpf.c
> @@ -870,7 +870,6 @@ int bpf_link_create(int prog_fd, int target_fd,
>                 attr.link_create.tracing.cookie = OPTS_GET(opts, tracing.cookie, 0);
>                 if (!OPTS_ZEROED(opts, tracing))
>                         return libbpf_err(-EINVAL);
> -               break;

This is a mistake, a remnant of RFC V1. Will remove it in the next version.

>         default:
>                 if (!OPTS_ZEROED(opts, flags))
>                         return libbpf_err(-EINVAL);
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 4867a930628b..b7fc64ebf8dd 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -8998,6 +8998,7 @@ static int attach_trace(const struct bpf_program *prog, long cookie, struct bpf_
>  static int attach_kprobe_multi(const struct bpf_program *prog, long cookie, struct bpf_link **link);
>  static int attach_lsm(const struct bpf_program *prog, long cookie, struct bpf_link **link);
>  static int attach_iter(const struct bpf_program *prog, long cookie, struct bpf_link **link);
> +static int attach_rstat(const struct bpf_program *prog, long cookie, struct bpf_link **link);
>
>  static const struct bpf_sec_def section_defs[] = {
>         SEC_DEF("socket",               SOCKET_FILTER, 0, SEC_NONE | SEC_SLOPPY_PFX),
> @@ -9078,6 +9079,7 @@ static const struct bpf_sec_def section_defs[] = {
>         SEC_DEF("cgroup/setsockopt",    CGROUP_SOCKOPT, BPF_CGROUP_SETSOCKOPT, SEC_ATTACHABLE | SEC_SLOPPY_PFX),
>         SEC_DEF("struct_ops+",          STRUCT_OPS, 0, SEC_NONE),
>         SEC_DEF("sk_lookup",            SK_LOOKUP, BPF_SK_LOOKUP, SEC_ATTACHABLE | SEC_SLOPPY_PFX),
> +       SEC_DEF("rstat/flush",          RSTAT_FLUSH, 0, SEC_NONE, attach_rstat),
>  };
>
>  static size_t custom_sec_def_cnt;
> @@ -11784,6 +11786,44 @@ static int attach_iter(const struct bpf_program *prog, long cookie, struct bpf_l
>         return libbpf_get_error(*link);
>  }
>
> +struct bpf_link *bpf_program__attach_rstat(const struct bpf_program *prog)
> +{
> +       struct bpf_link *link = NULL;
> +       char errmsg[STRERR_BUFSIZE];
> +       int err, prog_fd, link_fd;
> +
> +       prog_fd = bpf_program__fd(prog);
> +       if (prog_fd < 0) {
> +               pr_warn("prog '%s': can't attach before loaded\n", prog->name);
> +               return libbpf_err_ptr(-EINVAL);
> +       }
> +
> +       link = calloc(1, sizeof(*link));
> +       if (!link)
> +               return libbpf_err_ptr(-ENOMEM);
> +       link->detach = &bpf_link__detach_fd;
> +
> +       /* rstat flushers are currently the only supported rstat programs */
> +       link_fd = bpf_link_create(prog_fd, 0, BPF_RSTAT_FLUSH, NULL);
> +       if (link_fd < 0) {
> +               err = -errno;
> +               pr_warn("prog '%s': failed to attach: %s\n",
> +                       prog->name, libbpf_strerror_r(err, errmsg,
> +                                                     sizeof(errmsg)));
> +               free(link);
> +               return libbpf_err_ptr(err);
> +       }
> +
> +       link->fd = link_fd;
> +       return link;
> +}
> +
> +static int attach_rstat(const struct bpf_program *prog, long cookie, struct bpf_link **link)
> +{
> +       *link = bpf_program__attach_rstat(prog);
> +       return libbpf_get_error(*link);
> +}
> +
>  struct bpf_link *bpf_program__attach(const struct bpf_program *prog)
>  {
>         struct bpf_link *link = NULL;
> diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
> index 21984dcd6dbe..f8b6827d5550 100644
> --- a/tools/lib/bpf/libbpf.h
> +++ b/tools/lib/bpf/libbpf.h
> @@ -662,6 +662,9 @@ LIBBPF_API struct bpf_link *
>  bpf_program__attach_iter(const struct bpf_program *prog,
>                          const struct bpf_iter_attach_opts *opts);
>
> +LIBBPF_API struct bpf_link *
> +bpf_program__attach_rstat(const struct bpf_program *prog);
> +
>  /*
>   * Libbpf allows callers to adjust BPF programs before being loaded
>   * into kernel. One program in an object file can be transformed into
> diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
> index 008da8db1d94..f945c6265cb5 100644
> --- a/tools/lib/bpf/libbpf.map
> +++ b/tools/lib/bpf/libbpf.map
> @@ -449,6 +449,7 @@ LIBBPF_0.8.0 {
>                 bpf_program__attach_kprobe_multi_opts;
>                 bpf_program__attach_trace_opts;
>                 bpf_program__attach_usdt;
> +               bpf_program__attach_rstat;
>                 bpf_program__set_insns;
>                 libbpf_register_prog_handler;
>                 libbpf_unregister_prog_handler;
> --
> 2.36.0.550.gb090851708-goog
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH bpf-next v2 0/7] bpf: rstat: cgroup hierarchical stats
  2022-05-15  2:34 [RFC PATCH bpf-next v2 0/7] bpf: rstat: cgroup hierarchical stats Yosry Ahmed
                   ` (6 preceding siblings ...)
  2022-05-15  2:35 ` [RFC PATCH bpf-next v2 7/7] bpf: add a selftest for cgroup hierarchical stats collection Yosry Ahmed
@ 2022-05-16 19:39 ` Tejun Heo
  7 siblings, 0 replies; 12+ messages in thread
From: Tejun Heo @ 2022-05-16 19:39 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Zefan Li, Johannes Weiner, Shuah Khan,
	Roman Gushchin, Michal Hocko, Stanislav Fomichev, David Rientjes,
	Greg Thelen, Shakeel Butt, linux-kernel, netdev, bpf, cgroups

Hello,

Looks good from cgroup side.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH bpf-next v2 2/7] cgroup: bpf: flush bpf stats on rstat flush
  2022-05-15  2:34 ` [RFC PATCH bpf-next v2 2/7] cgroup: bpf: flush bpf stats on rstat flush Yosry Ahmed
@ 2022-05-17  2:08   ` Alexei Starovoitov
  2022-05-17 21:51     ` Yosry Ahmed
  0 siblings, 1 reply; 12+ messages in thread
From: Alexei Starovoitov @ 2022-05-17  2:08 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Roman Gushchin, Michal Hocko, Stanislav Fomichev,
	David Rientjes, Greg Thelen, Shakeel Butt, linux-kernel, netdev,
	bpf, cgroups

On Sun, May 15, 2022 at 02:34:59AM +0000, Yosry Ahmed wrote:
> +
> +void bpf_rstat_flush(struct cgroup *cgrp, int cpu)
> +{
> +	struct bpf_rstat_flusher *flusher;
> +	struct bpf_rstat_flush_ctx ctx = {
> +		.cgrp = cgrp,
> +		.parent = cgroup_parent(cgrp),
> +		.cpu = cpu,
> +	};
> +
> +	rcu_read_lock();
> +	migrate_disable();
> +	spin_lock(&bpf_rstat_flushers_lock);
> +
> +	list_for_each_entry(flusher, &bpf_rstat_flushers, list)
> +		(void) bpf_prog_run(flusher->prog, &ctx);
> +
> +	spin_unlock(&bpf_rstat_flushers_lock);
> +	migrate_enable();
> +	rcu_read_unlock();
> +}
> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
> index 24b5c2ab5598..0285d496e807 100644
> --- a/kernel/cgroup/rstat.c
> +++ b/kernel/cgroup/rstat.c
> @@ -2,6 +2,7 @@
>  #include "cgroup-internal.h"
>  
>  #include <linux/sched/cputime.h>
> +#include <linux/bpf-rstat.h>
>  
>  static DEFINE_SPINLOCK(cgroup_rstat_lock);
>  static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);
> @@ -168,6 +169,7 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp, bool may_sleep)
>  			struct cgroup_subsys_state *css;
>  
>  			cgroup_base_stat_flush(pos, cpu);
> +			bpf_rstat_flush(pos, cpu);

Please use the following approach instead:

__weak noinline void bpf_rstat_flush(struct cgroup *cgrp, struct cgroup *parent, int cpu)
{
}

and change above line to:
  bpf_rstat_flush(pos, cgroup_parent(pos), cpu);

Then tracing bpf fentry progs will be able to attach to bpf_rstat_flush.
Pretty much the patches 1, 2, 3 are not necessary.
In patch 4 add bpf_cgroup_rstat_updated/flush as two kfuncs instead of stable helpers.

This way patches 1,2,3,4 will become 2 trivial patches and we will be
able to extend the interface between cgroup rstat and bpf whenever we need
without worrying about uapi stability.

We had similar discusison with HID subsystem that plans to use bpf in HID
with the same approach.
See this patch set:
https://lore.kernel.org/bpf/20220421140740.459558-2-benjamin.tissoires@redhat.com/
You'd need patch 1 from it to enable kfuncs for tracing.

Your patch 5 is needed as-is.
Yonghong,
please review it.
Different approach for patch 1-4 won't affect patch 5.
Patches 6 and 7 look good.

With this approach that patch 7 will mostly stay as-is. Instead of:
+SEC("rstat/flush")
+int vmscan_flush(struct bpf_rstat_flush_ctx *ctx)
+{
+       struct vmscan_percpu *pcpu_stat;
+       struct vmscan *total_stat, *parent_stat;
+       struct cgroup *cgrp = ctx->cgrp, *parent = ctx->parent;

it will become

SEC("fentry/bpf_rstat_flush")
int BPF_PROG(vmscan_flush, struct cgroup *cgrp, struct cgroup *parent, int cpu)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH bpf-next v2 2/7] cgroup: bpf: flush bpf stats on rstat flush
  2022-05-17  2:08   ` Alexei Starovoitov
@ 2022-05-17 21:51     ` Yosry Ahmed
  0 siblings, 0 replies; 12+ messages in thread
From: Yosry Ahmed @ 2022-05-17 21:51 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Roman Gushchin, Michal Hocko, Stanislav Fomichev,
	David Rientjes, Greg Thelen, Shakeel Butt,
	Linux Kernel Mailing List, Networking, bpf, cgroups

On Mon, May 16, 2022 at 7:08 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Sun, May 15, 2022 at 02:34:59AM +0000, Yosry Ahmed wrote:
> > +
> > +void bpf_rstat_flush(struct cgroup *cgrp, int cpu)
> > +{
> > +     struct bpf_rstat_flusher *flusher;
> > +     struct bpf_rstat_flush_ctx ctx = {
> > +             .cgrp = cgrp,
> > +             .parent = cgroup_parent(cgrp),
> > +             .cpu = cpu,
> > +     };
> > +
> > +     rcu_read_lock();
> > +     migrate_disable();
> > +     spin_lock(&bpf_rstat_flushers_lock);
> > +
> > +     list_for_each_entry(flusher, &bpf_rstat_flushers, list)
> > +             (void) bpf_prog_run(flusher->prog, &ctx);
> > +
> > +     spin_unlock(&bpf_rstat_flushers_lock);
> > +     migrate_enable();
> > +     rcu_read_unlock();
> > +}
> > diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
> > index 24b5c2ab5598..0285d496e807 100644
> > --- a/kernel/cgroup/rstat.c
> > +++ b/kernel/cgroup/rstat.c
> > @@ -2,6 +2,7 @@
> >  #include "cgroup-internal.h"
> >
> >  #include <linux/sched/cputime.h>
> > +#include <linux/bpf-rstat.h>
> >
> >  static DEFINE_SPINLOCK(cgroup_rstat_lock);
> >  static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);
> > @@ -168,6 +169,7 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp, bool may_sleep)
> >                       struct cgroup_subsys_state *css;
> >
> >                       cgroup_base_stat_flush(pos, cpu);
> > +                     bpf_rstat_flush(pos, cpu);
>
> Please use the following approach instead:
>
> __weak noinline void bpf_rstat_flush(struct cgroup *cgrp, struct cgroup *parent, int cpu)
> {
> }
>
> and change above line to:
>   bpf_rstat_flush(pos, cgroup_parent(pos), cpu);
>
> Then tracing bpf fentry progs will be able to attach to bpf_rstat_flush.
> Pretty much the patches 1, 2, 3 are not necessary.
> In patch 4 add bpf_cgroup_rstat_updated/flush as two kfuncs instead of stable helpers.
>
> This way patches 1,2,3,4 will become 2 trivial patches and we will be
> able to extend the interface between cgroup rstat and bpf whenever we need
> without worrying about uapi stability.
>
> We had similar discusison with HID subsystem that plans to use bpf in HID
> with the same approach.
> See this patch set:
> https://lore.kernel.org/bpf/20220421140740.459558-2-benjamin.tissoires@redhat.com/
> You'd need patch 1 from it to enable kfuncs for tracing.
>
> Your patch 5 is needed as-is.
> Yonghong,
> please review it.
> Different approach for patch 1-4 won't affect patch 5.
> Patches 6 and 7 look good.
>
> With this approach that patch 7 will mostly stay as-is. Instead of:
> +SEC("rstat/flush")
> +int vmscan_flush(struct bpf_rstat_flush_ctx *ctx)
> +{
> +       struct vmscan_percpu *pcpu_stat;
> +       struct vmscan *total_stat, *parent_stat;
> +       struct cgroup *cgrp = ctx->cgrp, *parent = ctx->parent;
>
> it will become
>
> SEC("fentry/bpf_rstat_flush")
> int BPF_PROG(vmscan_flush, struct cgroup *cgrp, struct cgroup *parent, int cpu)

Thanks so much for taking the time to look into this.

Indeed, this approach looks cleaner and simpler. I will incorporate
that into a V1 and send it. Thanks!

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-05-17 21:52 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-15  2:34 [RFC PATCH bpf-next v2 0/7] bpf: rstat: cgroup hierarchical stats Yosry Ahmed
2022-05-15  2:34 ` [RFC PATCH bpf-next v2 1/7] bpf: introduce RSTAT_FLUSH program type Yosry Ahmed
2022-05-15  2:34 ` [RFC PATCH bpf-next v2 2/7] cgroup: bpf: flush bpf stats on rstat flush Yosry Ahmed
2022-05-17  2:08   ` Alexei Starovoitov
2022-05-17 21:51     ` Yosry Ahmed
2022-05-15  2:35 ` [RFC PATCH bpf-next v2 3/7] libbpf: Add support for rstat flush progs Yosry Ahmed
2022-05-15  9:07   ` Yosry Ahmed
2022-05-15  2:35 ` [RFC PATCH bpf-next v2 4/7] bpf: add bpf rstat helpers Yosry Ahmed
2022-05-15  2:35 ` [RFC PATCH bpf-next v2 5/7] bpf: Introduce cgroup iter Yosry Ahmed
2022-05-15  2:35 ` [RFC PATCH bpf-next v2 6/7] selftests/bpf: extend cgroup helpers Yosry Ahmed
2022-05-15  2:35 ` [RFC PATCH bpf-next v2 7/7] bpf: add a selftest for cgroup hierarchical stats collection Yosry Ahmed
2022-05-16 19:39 ` [RFC PATCH bpf-next v2 0/7] bpf: rstat: cgroup hierarchical stats Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).