linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH bpf-next 0/9] bpf: cgroup hierarchical stats collection
@ 2022-05-10  0:17 Yosry Ahmed
  2022-05-10  0:17 ` [RFC PATCH bpf-next 1/9] bpf: introduce CGROUP_SUBSYS_RSTAT program type Yosry Ahmed
                   ` (9 more replies)
  0 siblings, 10 replies; 30+ messages in thread
From: Yosry Ahmed @ 2022-05-10  0:17 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Roman Gushchin, Michal Hocko
  Cc: Stanislav Fomichev, David Rientjes, Greg Thelen, Shakeel Butt,
	linux-kernel, netdev, bpf, cgroups, Yosry Ahmed

This patch series allows for using bpf to collect hierarchical cgroup
stats efficiently by integrating with the rstat framework. The rstat
framework provides an efficient way to collect cgroup stats and
propagate them through the cgroup hierarchy.

The last patch is a selftest that demonastrates the entire workflow.
The workflow consists of:
- bpf programs that collect per-cpu per-cgroup stats (tracing progs).
- bpf rstat flusher that contains the logic for aggregating stats
  across cpus and across the cgroup hierarchy.
- bpf cgroup_iter responsible for outputting the stats to userspace
  through reading a file in bpffs.

The first 3 patches include the new bpf rstat flusher program type and
the needed support in rstat code and libbpf. The rstat flusher program
is a callback that the rstat framework makes to bpf when a stat flush is
ongoing, similar to the css_rstat_flush() callback that rstat makes to
cgroup controllers. Each callback is parameterized by a (cgroup, cpu)
pair that has been updated. The program contains the logic for
aggregating the stats across cpus and across the cgroup hierarchy.
These programs can be attached to any cgroup subsystem, not only the
ones that implement the css_rstat_flush() callback in the kernel. This
gives bpf programs more flexibility, and more isolation from the kernel
implementation.

The following 2 patches add necessary helpers for the stats collection
workflow. Helpers that call into cgroup_rstat_updated() and
cgroup_rstat_flush() are added to allow bpf programs collecting stats to
tell the rstat framework that a cgroup has been updated, and to allow
bpf programs outputting stats to tell the rstat framework to flush the
stats before they are displayed to the user. An additional
bpf_map_lookup_percpu_elem is introduced to allow rstat flusher programs
to access percpu stats of the cpu being flushed.

The following 3 patches add the cgroup_iter program type (v2). This was
originally introduced by Hao as a part of a different series [1].
Their usecase is better showcased as part of this patch series. We also
make cgroup_get_from_id() cgroup v1 friendly to allow cgroup_iter programs
to display stats for cgroup v1 as well. This small change makes the
entire workflow cgroup v1 friendly without any other dedicated changes.

The final patch is a selftest demonstrating the entire workflow with a
set of bpf programs that collect per-cgroup latency of memcg reclaim.

[1]https://lore.kernel.org/lkml/20220225234339.2386398-9-haoluo@google.com/


Hao Luo (2):
  cgroup: Add cgroup_put() in !CONFIG_CGROUPS case
  bpf: Introduce cgroup iter

Yosry Ahmed (7):
  bpf: introduce CGROUP_SUBSYS_RSTAT program type
  cgroup: bpf: flush bpf stats on rstat flush
  libbpf: Add support for rstat progs and links
  bpf: add bpf rstat helpers
  bpf: add bpf_map_lookup_percpu_elem() helper
  cgroup: add v1 support to cgroup_get_from_id()
  bpf: add a selftest for cgroup hierarchical stats collection

 include/linux/bpf-cgroup-subsys.h             |  35 ++
 include/linux/bpf.h                           |   4 +
 include/linux/bpf_types.h                     |   2 +
 include/linux/cgroup-defs.h                   |   4 +
 include/linux/cgroup.h                        |   5 +
 include/uapi/linux/bpf.h                      |  45 +++
 kernel/bpf/Makefile                           |   3 +-
 kernel/bpf/arraymap.c                         |  11 +-
 kernel/bpf/cgroup_iter.c                      | 148 ++++++++
 kernel/bpf/cgroup_subsys.c                    | 212 +++++++++++
 kernel/bpf/hashtab.c                          |  25 +-
 kernel/bpf/helpers.c                          |  56 +++
 kernel/bpf/syscall.c                          |   6 +
 kernel/bpf/verifier.c                         |   6 +
 kernel/cgroup/cgroup.c                        |  16 +-
 kernel/cgroup/rstat.c                         |  11 +
 scripts/bpf_doc.py                            |   2 +
 tools/include/uapi/linux/bpf.h                |  45 +++
 tools/lib/bpf/bpf.c                           |   3 +
 tools/lib/bpf/bpf.h                           |   3 +
 tools/lib/bpf/libbpf.c                        |  35 ++
 tools/lib/bpf/libbpf.h                        |   3 +
 tools/lib/bpf/libbpf.map                      |   1 +
 .../test_cgroup_hierarchical_stats.c          | 335 ++++++++++++++++++
 tools/testing/selftests/bpf/progs/bpf_iter.h  |   7 +
 .../selftests/bpf/progs/cgroup_vmscan.c       | 211 +++++++++++
 26 files changed, 1212 insertions(+), 22 deletions(-)
 create mode 100644 include/linux/bpf-cgroup-subsys.h
 create mode 100644 kernel/bpf/cgroup_iter.c
 create mode 100644 kernel/bpf/cgroup_subsys.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_cgroup_hierarchical_stats.c
 create mode 100644 tools/testing/selftests/bpf/progs/cgroup_vmscan.c

-- 
2.36.0.512.ge40c2bad7a-goog


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC PATCH bpf-next 1/9] bpf: introduce CGROUP_SUBSYS_RSTAT program type
  2022-05-10  0:17 [RFC PATCH bpf-next 0/9] bpf: cgroup hierarchical stats collection Yosry Ahmed
@ 2022-05-10  0:17 ` Yosry Ahmed
  2022-05-10 18:07   ` Yosry Ahmed
  2022-05-10 18:44   ` Tejun Heo
  2022-05-10  0:18 ` [RFC PATCH bpf-next 2/9] cgroup: bpf: flush bpf stats on rstat flush Yosry Ahmed
                   ` (8 subsequent siblings)
  9 siblings, 2 replies; 30+ messages in thread
From: Yosry Ahmed @ 2022-05-10  0:17 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Roman Gushchin, Michal Hocko
  Cc: Stanislav Fomichev, David Rientjes, Greg Thelen, Shakeel Butt,
	linux-kernel, netdev, bpf, cgroups, Yosry Ahmed

This patch introduces a new bpf program type CGROUP_SUBSYS_RSTAT,
with new corresponding link and attach types.

The main purpose of these programs is to allow BPF programs to collect
and maintain hierarchical cgroup stats easily and efficiently by making
using of the rstat framework in the kernel.

Those programs attach to a cgroup subsystem. They typically contain logic
to aggregate per-cpu and per-cgroup stats collected by other BPF programs.

Currently, only rstat flusher programs can be attached to cgroup
subsystems, but this can be extended later if a use-case arises.

See the selftest in the final patch for a practical example.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 include/linux/bpf-cgroup-subsys.h |  30 ++++++
 include/linux/bpf_types.h         |   2 +
 include/linux/cgroup-defs.h       |   4 +
 include/uapi/linux/bpf.h          |  12 +++
 kernel/bpf/Makefile               |   1 +
 kernel/bpf/cgroup_subsys.c        | 166 ++++++++++++++++++++++++++++++
 kernel/bpf/syscall.c              |   6 ++
 kernel/cgroup/cgroup.c            |   1 +
 tools/include/uapi/linux/bpf.h    |  12 +++
 9 files changed, 234 insertions(+)
 create mode 100644 include/linux/bpf-cgroup-subsys.h
 create mode 100644 kernel/bpf/cgroup_subsys.c

diff --git a/include/linux/bpf-cgroup-subsys.h b/include/linux/bpf-cgroup-subsys.h
new file mode 100644
index 000000000000..4dcde06b5599
--- /dev/null
+++ b/include/linux/bpf-cgroup-subsys.h
@@ -0,0 +1,30 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright 2022 Google LLC.
+ */
+#ifndef _BPF_CGROUP_SUBSYS_H_
+#define _BPF_CGROUP_SUBSYS_H_
+
+#include <linux/bpf.h>
+
+struct cgroup_subsys_bpf {
+	/* Head of the list of BPF rstat flushers attached to this subsystem */
+	struct list_head rstat_flushers;
+	spinlock_t flushers_lock;
+};
+
+struct bpf_subsys_rstat_flusher {
+	struct bpf_prog *prog;
+	/* List of BPF rtstat flushers, anchored at subsys->bpf */
+	struct list_head list;
+};
+
+struct bpf_cgroup_subsys_link {
+	struct bpf_link link;
+	struct cgroup_subsys *ss;
+};
+
+int cgroup_subsys_bpf_link_attach(const union bpf_attr *attr,
+				  struct bpf_prog *prog);
+
+#endif  // _BPF_CGROUP_SUBSYS_H_
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 3e24ad0c4b3c..854ee958b0e4 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -56,6 +56,8 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SYSCTL, cg_sysctl,
 	      struct bpf_sysctl, struct bpf_sysctl_kern)
 BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SOCKOPT, cg_sockopt,
 	      struct bpf_sockopt, struct bpf_sockopt_kern)
+BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SUBSYS_RSTAT, cgroup_subsys_rstat,
+	      struct bpf_rstat_ctx, struct bpf_rstat_ctx)
 #endif
 #ifdef CONFIG_BPF_LIRC_MODE2
 BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2,
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 1bfcfb1af352..3bd6eed1fa13 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -20,6 +20,7 @@
 #include <linux/u64_stats_sync.h>
 #include <linux/workqueue.h>
 #include <linux/bpf-cgroup-defs.h>
+#include <linux/bpf-cgroup-subsys.h>
 #include <linux/psi_types.h>
 
 #ifdef CONFIG_CGROUPS
@@ -706,6 +707,9 @@ struct cgroup_subsys {
 	 * specifies the mask of subsystems that this one depends on.
 	 */
 	unsigned int depends_on;
+
+	/* used to store bpf programs.*/
+	struct cgroup_subsys_bpf bpf;
 };
 
 extern struct percpu_rw_semaphore cgroup_threadgroup_rwsem;
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index d14b10b85e51..0f4855fa85db 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -952,6 +952,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_LSM,
 	BPF_PROG_TYPE_SK_LOOKUP,
 	BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */
+	BPF_PROG_TYPE_CGROUP_SUBSYS_RSTAT,
 };
 
 enum bpf_attach_type {
@@ -998,6 +999,7 @@ enum bpf_attach_type {
 	BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
 	BPF_PERF_EVENT,
 	BPF_TRACE_KPROBE_MULTI,
+	BPF_CGROUP_SUBSYS_RSTAT,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -1013,6 +1015,7 @@ enum bpf_link_type {
 	BPF_LINK_TYPE_XDP = 6,
 	BPF_LINK_TYPE_PERF_EVENT = 7,
 	BPF_LINK_TYPE_KPROBE_MULTI = 8,
+	BPF_LINK_TYPE_CGROUP_SUBSYS = 9,
 
 	MAX_BPF_LINK_TYPE,
 };
@@ -1482,6 +1485,9 @@ union bpf_attr {
 				 */
 				__u64		bpf_cookie;
 			} perf_event;
+			struct {
+				__u64		name;
+			} cgroup_subsys;
 			struct {
 				__u32		flags;
 				__u32		cnt;
@@ -6324,6 +6330,12 @@ struct bpf_cgroup_dev_ctx {
 	__u32 minor;
 };
 
+struct bpf_rstat_ctx {
+	__u64 cgroup_id;
+	__u64 parent_cgroup_id; /* 0 if root */
+	__s32 cpu;
+};
+
 struct bpf_raw_tracepoint_args {
 	__u64 args[0];
 };
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index c1a9be6a4b9f..6caf4a61e543 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -25,6 +25,7 @@ ifeq ($(CONFIG_PERF_EVENTS),y)
 obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
 endif
 obj-$(CONFIG_CGROUP_BPF) += cgroup.o
+obj-$(CONFIG_CGROUP_BPF) += cgroup_subsys.o
 ifeq ($(CONFIG_INET),y)
 obj-$(CONFIG_BPF_SYSCALL) += reuseport_array.o
 endif
diff --git a/kernel/bpf/cgroup_subsys.c b/kernel/bpf/cgroup_subsys.c
new file mode 100644
index 000000000000..9673ce6aa84a
--- /dev/null
+++ b/kernel/bpf/cgroup_subsys.c
@@ -0,0 +1,166 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Functions to manage eBPF programs attached to cgroup subsystems
+ *
+ * Copyright 2022 Google LLC.
+ */
+
+#include <linux/bpf-cgroup-subsys.h>
+#include <linux/filter.h>
+
+#include "../cgroup/cgroup-internal.h"
+
+
+static int cgroup_subsys_bpf_attach(struct cgroup_subsys *ss, struct bpf_prog *prog)
+{
+	struct bpf_subsys_rstat_flusher *rstat_flusher;
+
+	rstat_flusher = kmalloc(sizeof(*rstat_flusher), GFP_KERNEL);
+	if (!rstat_flusher)
+		return -ENOMEM;
+	rstat_flusher->prog = prog;
+
+	spin_lock(&ss->bpf.flushers_lock);
+	list_add(&rstat_flusher->list, &ss->bpf.rstat_flushers);
+	spin_unlock(&ss->bpf.flushers_lock);
+
+	return 0;
+}
+
+static void cgroup_subsys_bpf_detach(struct cgroup_subsys *ss, struct bpf_prog *prog)
+{
+	struct bpf_subsys_rstat_flusher *rstat_flusher = NULL;
+
+	spin_lock(&ss->bpf.flushers_lock);
+	list_for_each_entry(rstat_flusher, &ss->bpf.rstat_flushers, list)
+		if (rstat_flusher->prog == prog)
+			break;
+
+	if (rstat_flusher) {
+		list_del(&rstat_flusher->list);
+		bpf_prog_put(rstat_flusher->prog);
+		kfree(rstat_flusher);
+	}
+	spin_unlock(&ss->bpf.flushers_lock);
+}
+
+static void bpf_cgroup_subsys_link_release(struct bpf_link *link)
+{
+	struct bpf_cgroup_subsys_link *ss_link = container_of(link,
+						       struct bpf_cgroup_subsys_link,
+						       link);
+	if (ss_link->ss) {
+		cgroup_subsys_bpf_detach(ss_link->ss, ss_link->link.prog);
+		ss_link->ss = NULL;
+	}
+}
+
+static int bpf_cgroup_subsys_link_detach(struct bpf_link *link)
+{
+	bpf_cgroup_subsys_link_release(link);
+	return 0;
+}
+
+static void bpf_cgroup_subsys_link_dealloc(struct bpf_link *link)
+{
+	struct bpf_cgroup_subsys_link *ss_link = container_of(link,
+						       struct bpf_cgroup_subsys_link,
+						       link);
+	kfree(ss_link);
+}
+
+static const struct bpf_link_ops bpf_cgroup_subsys_link_lops = {
+	.detach = bpf_cgroup_subsys_link_detach,
+	.release = bpf_cgroup_subsys_link_release,
+	.dealloc = bpf_cgroup_subsys_link_dealloc,
+};
+
+int cgroup_subsys_bpf_link_attach(const union bpf_attr *attr,
+				  struct bpf_prog *prog)
+{
+	struct bpf_link_primer link_primer;
+	struct bpf_cgroup_subsys_link *link;
+	struct cgroup_subsys *ss, *attach_ss = NULL;
+	const char __user *ss_name_user;
+	char ss_name[MAX_CGROUP_TYPE_NAMELEN];
+	int ssid, err;
+
+	if (attr->link_create.target_fd || attr->link_create.flags)
+		return -EINVAL;
+
+	ss_name_user = u64_to_user_ptr(attr->link_create.cgroup_subsys.name);
+	if (strncpy_from_user(ss_name, ss_name_user, sizeof(ss_name) - 1) < 0)
+		return -EFAULT;
+
+	for_each_subsys(ss, ssid)
+		if (!strcmp(ss_name, ss->name) ||
+		    !strcmp(ss_name, ss->legacy_name))
+			attach_ss = ss;
+
+	if (!attach_ss)
+		return -EINVAL;
+
+	link = kzalloc(sizeof(*link), GFP_USER);
+	if (!link)
+		return -ENOMEM;
+
+	bpf_link_init(&link->link, BPF_LINK_TYPE_CGROUP_SUBSYS,
+		      &bpf_cgroup_subsys_link_lops,
+		      prog);
+	link->ss = attach_ss;
+
+	err = bpf_link_prime(&link->link, &link_primer);
+	if (err) {
+		kfree(link);
+		return err;
+	}
+
+	err = cgroup_subsys_bpf_attach(attach_ss, prog);
+	if (err) {
+		bpf_link_cleanup(&link_primer);
+		return err;
+	}
+
+	return bpf_link_settle(&link_primer);
+}
+
+static const struct bpf_func_proto *
+cgroup_subsys_rstat_func_proto(enum bpf_func_id func_id,
+			       const struct bpf_prog *prog)
+{
+	return bpf_base_func_proto(func_id);
+}
+
+static bool cgroup_subsys_rstat_is_valid_access(int off, int size,
+					   enum bpf_access_type type,
+					   const struct bpf_prog *prog,
+					   struct bpf_insn_access_aux *info)
+{
+	if (type == BPF_WRITE)
+		return false;
+
+	if (off < 0 || off + size > sizeof(struct bpf_rstat_ctx))
+		return false;
+	/* The verifier guarantees that size > 0 */
+	if (off % size != 0)
+		return false;
+
+	switch (off) {
+	case offsetof(struct bpf_rstat_ctx, cgroup_id):
+		return size == sizeof(__u64);
+	case offsetof(struct bpf_rstat_ctx, parent_cgroup_id):
+		return size == sizeof(__u64);
+	case offsetof(struct bpf_rstat_ctx, cpu):
+		return size == sizeof(__s32);
+	default:
+		return false;
+	}
+}
+
+const struct bpf_prog_ops cgroup_subsys_rstat_prog_ops = {
+};
+
+const struct bpf_verifier_ops cgroup_subsys_rstat_verifier_ops = {
+	.get_func_proto         = cgroup_subsys_rstat_func_proto,
+	.is_valid_access        = cgroup_subsys_rstat_is_valid_access,
+};
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index cdaa1152436a..48149c54d969 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -3,6 +3,7 @@
  */
 #include <linux/bpf.h>
 #include <linux/bpf-cgroup.h>
+#include <linux/bpf-cgroup-subsys.h>
 #include <linux/bpf_trace.h>
 #include <linux/bpf_lirc.h>
 #include <linux/bpf_verifier.h>
@@ -3194,6 +3195,8 @@ attach_type_to_prog_type(enum bpf_attach_type attach_type)
 		return BPF_PROG_TYPE_SK_LOOKUP;
 	case BPF_XDP:
 		return BPF_PROG_TYPE_XDP;
+	case BPF_CGROUP_SUBSYS_RSTAT:
+		return BPF_PROG_TYPE_CGROUP_SUBSYS_RSTAT;
 	default:
 		return BPF_PROG_TYPE_UNSPEC;
 	}
@@ -4341,6 +4344,9 @@ static int link_create(union bpf_attr *attr, bpfptr_t uattr)
 		else
 			ret = bpf_kprobe_multi_link_attach(attr, prog);
 		break;
+	case BPF_PROG_TYPE_CGROUP_SUBSYS_RSTAT:
+		ret = cgroup_subsys_bpf_link_attach(attr, prog);
+		break;
 	default:
 		ret = -EINVAL;
 	}
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index adb820e98f24..7b1448013009 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5745,6 +5745,7 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early)
 
 	idr_init(&ss->css_idr);
 	INIT_LIST_HEAD(&ss->cfts);
+	INIT_LIST_HEAD(&ss->bpf.rstat_flushers);
 
 	/* Create the root cgroup state for this subsystem */
 	ss->root = &cgrp_dfl_root;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index d14b10b85e51..0f4855fa85db 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -952,6 +952,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_LSM,
 	BPF_PROG_TYPE_SK_LOOKUP,
 	BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */
+	BPF_PROG_TYPE_CGROUP_SUBSYS_RSTAT,
 };
 
 enum bpf_attach_type {
@@ -998,6 +999,7 @@ enum bpf_attach_type {
 	BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
 	BPF_PERF_EVENT,
 	BPF_TRACE_KPROBE_MULTI,
+	BPF_CGROUP_SUBSYS_RSTAT,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -1013,6 +1015,7 @@ enum bpf_link_type {
 	BPF_LINK_TYPE_XDP = 6,
 	BPF_LINK_TYPE_PERF_EVENT = 7,
 	BPF_LINK_TYPE_KPROBE_MULTI = 8,
+	BPF_LINK_TYPE_CGROUP_SUBSYS = 9,
 
 	MAX_BPF_LINK_TYPE,
 };
@@ -1482,6 +1485,9 @@ union bpf_attr {
 				 */
 				__u64		bpf_cookie;
 			} perf_event;
+			struct {
+				__u64		name;
+			} cgroup_subsys;
 			struct {
 				__u32		flags;
 				__u32		cnt;
@@ -6324,6 +6330,12 @@ struct bpf_cgroup_dev_ctx {
 	__u32 minor;
 };
 
+struct bpf_rstat_ctx {
+	__u64 cgroup_id;
+	__u64 parent_cgroup_id; /* 0 if root */
+	__s32 cpu;
+};
+
 struct bpf_raw_tracepoint_args {
 	__u64 args[0];
 };
-- 
2.36.0.512.ge40c2bad7a-goog


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH bpf-next 2/9] cgroup: bpf: flush bpf stats on rstat flush
  2022-05-10  0:17 [RFC PATCH bpf-next 0/9] bpf: cgroup hierarchical stats collection Yosry Ahmed
  2022-05-10  0:17 ` [RFC PATCH bpf-next 1/9] bpf: introduce CGROUP_SUBSYS_RSTAT program type Yosry Ahmed
@ 2022-05-10  0:18 ` Yosry Ahmed
  2022-05-10 18:45   ` Tejun Heo
  2022-05-10  0:18 ` [RFC PATCH bpf-next 3/9] libbpf: Add support for rstat progs and links Yosry Ahmed
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 30+ messages in thread
From: Yosry Ahmed @ 2022-05-10  0:18 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Roman Gushchin, Michal Hocko
  Cc: Stanislav Fomichev, David Rientjes, Greg Thelen, Shakeel Butt,
	linux-kernel, netdev, bpf, cgroups, Yosry Ahmed

When a cgroup is popped from the rstat updated tree, subsystems rstat
flushers are run through the css_rstat_flush() callback. Also run bpf
flushers for all subsystems that have at least one bpf rstat flusher
attached, and are enabled for this cgroup.

A list of subsystems that have attached rstat flushers is maintained to
avoid looping through all subsystems for all cpus for every cgroup that
is being popped from the updated tree. Since we introduce a lock here to
protect this list, also use it to protect rstat_flushers lists inside
each subsystem (since they both need to locked together anyway), and get
read of the locks in struct cgroup_subsys_bpf.

rstat flushers are run for any enabled subsystem that has flushers
attached, even if it does not subscribe to css flushing through
css_rstat_flush(). This gives flexibility for bpf programs to collect
stats for any subsystem, regardless of the implementation changes in the
kernel.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 include/linux/bpf-cgroup-subsys.h |  7 +++-
 include/linux/cgroup.h            |  2 ++
 kernel/bpf/cgroup_subsys.c        | 60 +++++++++++++++++++++++++++----
 kernel/cgroup/cgroup.c            |  5 +--
 kernel/cgroup/rstat.c             | 11 ++++++
 5 files changed, 75 insertions(+), 10 deletions(-)

diff --git a/include/linux/bpf-cgroup-subsys.h b/include/linux/bpf-cgroup-subsys.h
index 4dcde06b5599..e977b9ef5754 100644
--- a/include/linux/bpf-cgroup-subsys.h
+++ b/include/linux/bpf-cgroup-subsys.h
@@ -10,7 +10,11 @@
 struct cgroup_subsys_bpf {
 	/* Head of the list of BPF rstat flushers attached to this subsystem */
 	struct list_head rstat_flushers;
-	spinlock_t flushers_lock;
+	/*
+	 * A list that runs through subsystems that have at least one rstat
+	 * flusher.
+	 */
+	struct list_head rstat_subsys_node;
 };
 
 struct bpf_subsys_rstat_flusher {
@@ -26,5 +30,6 @@ struct bpf_cgroup_subsys_link {
 
 int cgroup_subsys_bpf_link_attach(const union bpf_attr *attr,
 				  struct bpf_prog *prog);
+void bpf_run_rstat_flushers(struct cgroup *cgrp, int cpu);
 
 #endif  // _BPF_CGROUP_SUBSYS_H_
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 0d1ada8968d7..5408c74d5c44 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -97,6 +97,8 @@ extern struct css_set init_css_set;
 
 bool css_has_online_children(struct cgroup_subsys_state *css);
 struct cgroup_subsys_state *css_from_id(int id, struct cgroup_subsys *ss);
+struct cgroup_subsys_state *cgroup_css(struct cgroup *cgrp,
+				       struct cgroup_subsys *ss);
 struct cgroup_subsys_state *cgroup_e_css(struct cgroup *cgroup,
 					 struct cgroup_subsys *ss);
 struct cgroup_subsys_state *cgroup_get_e_css(struct cgroup *cgroup,
diff --git a/kernel/bpf/cgroup_subsys.c b/kernel/bpf/cgroup_subsys.c
index 9673ce6aa84a..1d10319a34e9 100644
--- a/kernel/bpf/cgroup_subsys.c
+++ b/kernel/bpf/cgroup_subsys.c
@@ -6,10 +6,46 @@
  */
 
 #include <linux/bpf-cgroup-subsys.h>
+#include <linux/cgroup.h>
 #include <linux/filter.h>
 
 #include "../cgroup/cgroup-internal.h"
 
+/* List of subsystems that have rstat flushers attached */
+static LIST_HEAD(bpf_rstat_subsys_list);
+/* Protects the above list, and the lists of rstat flushers in each subsys */
+static DEFINE_SPINLOCK(bpf_rstat_subsys_lock);
+
+
+void bpf_run_rstat_flushers(struct cgroup *cgrp, int cpu)
+{
+	struct cgroup_subsys_bpf *ss_bpf;
+	struct cgroup *parent = cgroup_parent(cgrp);
+	struct bpf_rstat_ctx ctx = {
+		.cgroup_id = cgroup_id(cgrp),
+		.parent_cgroup_id = parent ? cgroup_id(parent) : 0,
+		.cpu = cpu,
+	};
+
+	rcu_read_lock();
+	migrate_disable();
+	spin_lock(&bpf_rstat_subsys_lock);
+	list_for_each_entry(ss_bpf, &bpf_rstat_subsys_list, rstat_subsys_node) {
+		struct bpf_subsys_rstat_flusher *rstat_flusher;
+		struct cgroup_subsys *ss = container_of(ss_bpf,
+							struct cgroup_subsys,
+							bpf);
+
+		/* Subsystem ss is not enabled for cgrp */
+		if (!cgroup_css(cgrp, ss))
+			continue;
+		list_for_each_entry(rstat_flusher, &ss_bpf->rstat_flushers, list)
+			(void) bpf_prog_run(rstat_flusher->prog, &ctx);
+	}
+	spin_unlock(&bpf_rstat_subsys_lock);
+	migrate_enable();
+	rcu_read_unlock();
+}
 
 static int cgroup_subsys_bpf_attach(struct cgroup_subsys *ss, struct bpf_prog *prog)
 {
@@ -20,28 +56,38 @@ static int cgroup_subsys_bpf_attach(struct cgroup_subsys *ss, struct bpf_prog *p
 		return -ENOMEM;
 	rstat_flusher->prog = prog;
 
-	spin_lock(&ss->bpf.flushers_lock);
+	spin_lock(&bpf_rstat_subsys_lock);
+	/* Add ss to bpf_rstat_subsys_list when we attach the first flusher */
+	if (list_empty(&ss->bpf.rstat_flushers))
+		list_add(&ss->bpf.rstat_subsys_node, &bpf_rstat_subsys_list);
 	list_add(&rstat_flusher->list, &ss->bpf.rstat_flushers);
-	spin_unlock(&ss->bpf.flushers_lock);
+	spin_unlock(&bpf_rstat_subsys_lock);
 
 	return 0;
 }
 
 static void cgroup_subsys_bpf_detach(struct cgroup_subsys *ss, struct bpf_prog *prog)
 {
-	struct bpf_subsys_rstat_flusher *rstat_flusher = NULL;
+	struct bpf_subsys_rstat_flusher *iter, *rstat_flusher = NULL;
 
-	spin_lock(&ss->bpf.flushers_lock);
-	list_for_each_entry(rstat_flusher, &ss->bpf.rstat_flushers, list)
-		if (rstat_flusher->prog == prog)
+	spin_lock(&bpf_rstat_subsys_lock);
+	list_for_each_entry(iter, &ss->bpf.rstat_flushers, list)
+		if (iter->prog == prog) {
+			rstat_flusher = iter;
 			break;
+		}
 
 	if (rstat_flusher) {
 		list_del(&rstat_flusher->list);
 		bpf_prog_put(rstat_flusher->prog);
 		kfree(rstat_flusher);
 	}
-	spin_unlock(&ss->bpf.flushers_lock);
+	/*
+	 * Remove ss from bpf_rstat_subsys_list when we detach the last flusher
+	 */
+	if (list_empty(&ss->bpf.rstat_flushers))
+		list_del(&ss->bpf.rstat_subsys_node);
+	spin_unlock(&bpf_rstat_subsys_lock);
 }
 
 static void bpf_cgroup_subsys_link_release(struct bpf_link *link)
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 7b1448013009..af703cfcb9d2 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -478,8 +478,8 @@ static u16 cgroup_ss_mask(struct cgroup *cgrp)
  * keep accessing it outside the said locks.  This function may return
  * %NULL if @cgrp doesn't have @subsys_id enabled.
  */
-static struct cgroup_subsys_state *cgroup_css(struct cgroup *cgrp,
-					      struct cgroup_subsys *ss)
+struct cgroup_subsys_state *cgroup_css(struct cgroup *cgrp,
+				       struct cgroup_subsys *ss)
 {
 	if (CGROUP_HAS_SUBSYS_CONFIG && ss)
 		return rcu_dereference_check(cgrp->subsys[ss->id],
@@ -5746,6 +5746,7 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early)
 	idr_init(&ss->css_idr);
 	INIT_LIST_HEAD(&ss->cfts);
 	INIT_LIST_HEAD(&ss->bpf.rstat_flushers);
+	INIT_LIST_HEAD(&ss->bpf.rstat_subsys_node);
 
 	/* Create the root cgroup state for this subsystem */
 	ss->root = &cgrp_dfl_root;
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 24b5c2ab5598..af553a0ccc0d 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -2,6 +2,7 @@
 #include "cgroup-internal.h"
 
 #include <linux/sched/cputime.h>
+#include <linux/bpf-cgroup-subsys.h>
 
 static DEFINE_SPINLOCK(cgroup_rstat_lock);
 static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);
@@ -173,6 +174,16 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp, bool may_sleep)
 			list_for_each_entry_rcu(css, &pos->rstat_css_list,
 						rstat_css_node)
 				css->ss->css_rstat_flush(css, cpu);
+			/*
+			 * We run bpf flushers in a separate loop in
+			 * bpf_run_rstat_flushers()  as the above
+			 * loop only goes through subsystems that have rstat
+			 * flushing registered in the kernel.
+			 *
+			 * This gives flexibility for BPF programs to utilize
+			 * rstat to collect stats for any subsystem.
+			 */
+			bpf_run_rstat_flushers(pos, cpu);
 			rcu_read_unlock();
 		}
 		raw_spin_unlock_irqrestore(cpu_lock, flags);
-- 
2.36.0.512.ge40c2bad7a-goog


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH bpf-next 3/9] libbpf: Add support for rstat progs and links
  2022-05-10  0:17 [RFC PATCH bpf-next 0/9] bpf: cgroup hierarchical stats collection Yosry Ahmed
  2022-05-10  0:17 ` [RFC PATCH bpf-next 1/9] bpf: introduce CGROUP_SUBSYS_RSTAT program type Yosry Ahmed
  2022-05-10  0:18 ` [RFC PATCH bpf-next 2/9] cgroup: bpf: flush bpf stats on rstat flush Yosry Ahmed
@ 2022-05-10  0:18 ` Yosry Ahmed
  2022-05-10  0:18 ` [RFC PATCH bpf-next 4/9] bpf: add bpf rstat helpers Yosry Ahmed
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 30+ messages in thread
From: Yosry Ahmed @ 2022-05-10  0:18 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Roman Gushchin, Michal Hocko
  Cc: Stanislav Fomichev, David Rientjes, Greg Thelen, Shakeel Butt,
	linux-kernel, netdev, bpf, cgroups, Yosry Ahmed

Add support to attach "cgroup_subsys/rstat" programs to a subsystem by
calling bpf_program__attach_subsys. Currently, only CGROUP_SUBSYS_RSTAT
programs are supported for attachment to subsystems.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 tools/lib/bpf/bpf.c      |  3 +++
 tools/lib/bpf/bpf.h      |  3 +++
 tools/lib/bpf/libbpf.c   | 35 +++++++++++++++++++++++++++++++++++
 tools/lib/bpf/libbpf.h   |  3 +++
 tools/lib/bpf/libbpf.map |  1 +
 5 files changed, 45 insertions(+)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index cf27251adb92..abfff17cfa07 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -863,6 +863,9 @@ int bpf_link_create(int prog_fd, int target_fd,
 		if (!OPTS_ZEROED(opts, kprobe_multi))
 			return libbpf_err(-EINVAL);
 		break;
+	case BPF_CGROUP_SUBSYS_RSTAT:
+		attr.link_create.cgroup_subsys.name = ptr_to_u64(OPTS_GET(opts, cgroup_subsys.name, 0));
+		break;
 	default:
 		if (!OPTS_ZEROED(opts, flags))
 			return libbpf_err(-EINVAL);
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index f4b4afb6d4ba..384767a9ffd3 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -413,6 +413,9 @@ struct bpf_link_create_opts {
 		struct {
 			__u64 bpf_cookie;
 		} perf_event;
+		struct {
+			const char *name;
+		} cgroup_subsys;
 		struct {
 			__u32 flags;
 			__u32 cnt;
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 809fe209cdcc..56380953df55 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -8715,6 +8715,7 @@ static const struct bpf_sec_def section_defs[] = {
 	SEC_DEF("cgroup/setsockopt",	CGROUP_SOCKOPT, BPF_CGROUP_SETSOCKOPT, SEC_ATTACHABLE | SEC_SLOPPY_PFX),
 	SEC_DEF("struct_ops+",		STRUCT_OPS, 0, SEC_NONE),
 	SEC_DEF("sk_lookup",		SK_LOOKUP, BPF_SK_LOOKUP, SEC_ATTACHABLE | SEC_SLOPPY_PFX),
+	SEC_DEF("cgroup_subsys/rstat",	CGROUP_SUBSYS_RSTAT, 0, SEC_NONE),
 };
 
 static size_t custom_sec_def_cnt;
@@ -10957,6 +10958,40 @@ static int attach_iter(const struct bpf_program *prog, long cookie, struct bpf_l
 	return libbpf_get_error(*link);
 }
 
+struct bpf_link *bpf_program__attach_subsys(const struct bpf_program *prog,
+					     const char *subsys_name)
+{
+	DECLARE_LIBBPF_OPTS(bpf_link_create_opts, lopts,
+			    .cgroup_subsys.name = subsys_name);
+	struct bpf_link *link = NULL;
+	char errmsg[STRERR_BUFSIZE];
+	int err, prog_fd, link_fd;
+
+	prog_fd = bpf_program__fd(prog);
+	if (prog_fd < 0) {
+		pr_warn("prog '%s': can't attach before loaded\n", prog->name);
+		return libbpf_err_ptr(-EINVAL);
+	}
+
+	link = calloc(1, sizeof(*link));
+	if (!link)
+		return libbpf_err_ptr(-ENOMEM);
+	link->detach = &bpf_link__detach_fd;
+
+	link_fd = bpf_link_create(prog_fd, 0, BPF_CGROUP_SUBSYS_RSTAT, &lopts);
+	if (link_fd < 0) {
+		err = -errno;
+		pr_warn("prog '%s': failed to attach: %s\n",
+			prog->name, libbpf_strerror_r(err, errmsg,
+						      sizeof(errmsg)));
+		free(link);
+		return libbpf_err_ptr(err);
+	}
+
+	link->fd = link_fd;
+	return link;
+}
+
 struct bpf_link *bpf_program__attach(const struct bpf_program *prog)
 {
 	struct bpf_link *link = NULL;
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 05dde85e19a6..eddbffcd39f7 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -537,6 +537,9 @@ bpf_program__attach_xdp(const struct bpf_program *prog, int ifindex);
 LIBBPF_API struct bpf_link *
 bpf_program__attach_freplace(const struct bpf_program *prog,
 			     int target_fd, const char *attach_func_name);
+LIBBPF_API struct bpf_link *
+bpf_program__attach_subsys(const struct bpf_program *prog,
+			   const char *subsys_name);
 
 struct bpf_map;
 
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index dd35ee58bfaa..5583a2dbfb7c 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -447,4 +447,5 @@ LIBBPF_0.8.0 {
 		libbpf_register_prog_handler;
 		libbpf_unregister_prog_handler;
 		bpf_program__attach_kprobe_multi_opts;
+		bpf_program__attach_subsys;
 } LIBBPF_0.7.0;
-- 
2.36.0.512.ge40c2bad7a-goog


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH bpf-next 4/9] bpf: add bpf rstat helpers
  2022-05-10  0:17 [RFC PATCH bpf-next 0/9] bpf: cgroup hierarchical stats collection Yosry Ahmed
                   ` (2 preceding siblings ...)
  2022-05-10  0:18 ` [RFC PATCH bpf-next 3/9] libbpf: Add support for rstat progs and links Yosry Ahmed
@ 2022-05-10  0:18 ` Yosry Ahmed
  2022-05-10  0:18 ` [RFC PATCH bpf-next 5/9] bpf: add bpf_map_lookup_percpu_elem() helper Yosry Ahmed
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 30+ messages in thread
From: Yosry Ahmed @ 2022-05-10  0:18 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Roman Gushchin, Michal Hocko
  Cc: Stanislav Fomichev, David Rientjes, Greg Thelen, Shakeel Butt,
	linux-kernel, netdev, bpf, cgroups, Yosry Ahmed

Add bpf_cgroup_rstat_updated() and bpf_cgroup_rstat_flush() helpers
to enable  bpf programs that collect and output cgroup stats
to communicate with the rstat frameworkto add a cgroup to the rstat
updated tree or trigger an rstat flush before reading stats.

ARG_ANYTHING is used here for the struct *cgroup parameter. Would it be
better to add a task_cgroup(subsys_id) helper that returns a cgroup
pointer so that we can use a BTF argument instead?

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 include/uapi/linux/bpf.h       | 18 ++++++++++++++++++
 kernel/bpf/helpers.c           | 30 ++++++++++++++++++++++++++++++
 scripts/bpf_doc.py             |  2 ++
 tools/include/uapi/linux/bpf.h | 18 ++++++++++++++++++
 4 files changed, 68 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 0f4855fa85db..fce5535579d6 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -5149,6 +5149,22 @@ union bpf_attr {
  *		The **hash_algo** is returned on success,
  *		**-EOPNOTSUP** if the hash calculation failed or **-EINVAL** if
  *		invalid arguments are passed.
+ *
+ * void bpf_cgroup_rstat_updated(struct cgroup *cgrp)
+ *	Description
+ *		Notify the rstat framework that bpf stats were updated for
+ *		*cgrp* on the current cpu. Directly calls cgroup_rstat_updated
+ *		with the given *cgrp* and the current cpu.
+ *	Return
+ *		0
+ *
+ * void bpf_cgroup_rstat_flush(struct cgroup *cgrp)
+ *	Description
+ *		Collect all per-cpu stats in *cgrp*'s subtree into global
+ *		counters and propagate them upwards. Directly calls
+ *		cgroup_rstat_flush_irqsafe with the given *cgrp*.
+ *	Return
+ *		0
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5345,6 +5361,8 @@ union bpf_attr {
 	FN(copy_from_user_task),	\
 	FN(skb_set_tstamp),		\
 	FN(ima_file_hash),		\
+	FN(cgroup_rstat_updated),	\
+	FN(cgroup_rstat_flush),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 315053ef6a75..d124eed97ad7 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1374,6 +1374,32 @@ void bpf_timer_cancel_and_free(void *val)
 	kfree(t);
 }
 
+BPF_CALL_1(bpf_cgroup_rstat_updated, struct cgroup *, cgrp)
+{
+	cgroup_rstat_updated(cgrp, smp_processor_id());
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_cgroup_rstat_updated_proto = {
+	.func		= bpf_cgroup_rstat_updated,
+	.gpl_only	= false,
+	.ret_type	= RET_VOID,
+	.arg1_type	= ARG_ANYTHING,
+};
+
+BPF_CALL_1(bpf_cgroup_rstat_flush, struct cgroup *, cgrp)
+{
+	cgroup_rstat_flush_irqsafe(cgrp);
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_cgroup_rstat_flush_proto = {
+	.func		= bpf_cgroup_rstat_flush,
+	.gpl_only	= false,
+	.ret_type	= RET_VOID,
+	.arg1_type	= ARG_ANYTHING,
+};
+
 const struct bpf_func_proto bpf_get_current_task_proto __weak;
 const struct bpf_func_proto bpf_get_current_task_btf_proto __weak;
 const struct bpf_func_proto bpf_probe_read_user_proto __weak;
@@ -1426,6 +1452,10 @@ bpf_base_func_proto(enum bpf_func_id func_id)
 		return &bpf_loop_proto;
 	case BPF_FUNC_strncmp:
 		return &bpf_strncmp_proto;
+	case BPF_FUNC_cgroup_rstat_updated:
+		return &bpf_cgroup_rstat_updated_proto;
+	case BPF_FUNC_cgroup_rstat_flush:
+		return &bpf_cgroup_rstat_flush_proto;
 	default:
 		break;
 	}
diff --git a/scripts/bpf_doc.py b/scripts/bpf_doc.py
index 096625242475..9e2b08557a6f 100755
--- a/scripts/bpf_doc.py
+++ b/scripts/bpf_doc.py
@@ -633,6 +633,7 @@ class PrinterHelpers(Printer):
             'struct socket',
             'struct file',
             'struct bpf_timer',
+            'struct cgroup',
     ]
     known_types = {
             '...',
@@ -682,6 +683,7 @@ class PrinterHelpers(Printer):
             'struct socket',
             'struct file',
             'struct bpf_timer',
+            'struct cgroup',
     }
     mapped_types = {
             'u8': '__u8',
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 0f4855fa85db..fce5535579d6 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -5149,6 +5149,22 @@ union bpf_attr {
  *		The **hash_algo** is returned on success,
  *		**-EOPNOTSUP** if the hash calculation failed or **-EINVAL** if
  *		invalid arguments are passed.
+ *
+ * void bpf_cgroup_rstat_updated(struct cgroup *cgrp)
+ *	Description
+ *		Notify the rstat framework that bpf stats were updated for
+ *		*cgrp* on the current cpu. Directly calls cgroup_rstat_updated
+ *		with the given *cgrp* and the current cpu.
+ *	Return
+ *		0
+ *
+ * void bpf_cgroup_rstat_flush(struct cgroup *cgrp)
+ *	Description
+ *		Collect all per-cpu stats in *cgrp*'s subtree into global
+ *		counters and propagate them upwards. Directly calls
+ *		cgroup_rstat_flush_irqsafe with the given *cgrp*.
+ *	Return
+ *		0
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5345,6 +5361,8 @@ union bpf_attr {
 	FN(copy_from_user_task),	\
 	FN(skb_set_tstamp),		\
 	FN(ima_file_hash),		\
+	FN(cgroup_rstat_updated),	\
+	FN(cgroup_rstat_flush),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
-- 
2.36.0.512.ge40c2bad7a-goog


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH bpf-next 5/9] bpf: add bpf_map_lookup_percpu_elem() helper
  2022-05-10  0:17 [RFC PATCH bpf-next 0/9] bpf: cgroup hierarchical stats collection Yosry Ahmed
                   ` (3 preceding siblings ...)
  2022-05-10  0:18 ` [RFC PATCH bpf-next 4/9] bpf: add bpf rstat helpers Yosry Ahmed
@ 2022-05-10  0:18 ` Yosry Ahmed
  2022-05-10  0:18 ` [RFC PATCH bpf-next 6/9] cgroup: add v1 support to cgroup_get_from_id() Yosry Ahmed
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 30+ messages in thread
From: Yosry Ahmed @ 2022-05-10  0:18 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Roman Gushchin, Michal Hocko
  Cc: Stanislav Fomichev, David Rientjes, Greg Thelen, Shakeel Butt,
	linux-kernel, netdev, bpf, cgroups, Yosry Ahmed

Add a helper for bpf programs to lookup a percpu element for a cpu other
than the current one. This is useful for rstat flusher programs as they
get called to aggregate stats from different cpus, regardless of the
current cpu.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 include/linux/bpf.h            |  2 ++
 include/uapi/linux/bpf.h       |  9 +++++++++
 kernel/bpf/arraymap.c          | 11 ++++++++---
 kernel/bpf/hashtab.c           | 25 +++++++++++--------------
 kernel/bpf/helpers.c           | 26 ++++++++++++++++++++++++++
 kernel/bpf/verifier.c          |  6 ++++++
 tools/include/uapi/linux/bpf.h |  9 +++++++++
 7 files changed, 71 insertions(+), 17 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index bdb5298735ce..f6fa35ffe311 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1665,6 +1665,8 @@ int map_set_for_each_callback_args(struct bpf_verifier_env *env,
 				   struct bpf_func_state *caller,
 				   struct bpf_func_state *callee);
 
+void *bpf_percpu_hash_lookup(struct bpf_map *map, void *key, int cpu);
+void *bpf_percpu_array_lookup(struct bpf_map *map, void *key, int cpu);
 int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value);
 int bpf_percpu_array_copy(struct bpf_map *map, void *key, void *value);
 int bpf_percpu_hash_update(struct bpf_map *map, void *key, void *value,
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index fce5535579d6..015ed402c642 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1553,6 +1553,14 @@ union bpf_attr {
  * 		Map value associated to *key*, or **NULL** if no entry was
  * 		found.
  *
+ * void *bpf_map_lookup_percpu_elem(struct bpf_map *map, const void *key, int cpu)
+ *	Description
+ *		Perform a lookup in percpu *map* for an entry associated to
+ *		*key* for the given *cpu*.
+ *	Return
+ *		Map value associated to *key* per *cpu*, or **NULL** if no entry
+ *		was found.
+ *
  * long bpf_map_update_elem(struct bpf_map *map, const void *key, const void *value, u64 flags)
  * 	Description
  * 		Add or update the value of the entry associated to *key* in
@@ -5169,6 +5177,7 @@ union bpf_attr {
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
 	FN(map_lookup_elem),		\
+	FN(map_lookup_percpu_elem),	\
 	FN(map_update_elem),		\
 	FN(map_delete_elem),		\
 	FN(probe_read),			\
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 7f145aefbff8..945dae4c20eb 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -230,8 +230,7 @@ static int array_map_gen_lookup(struct bpf_map *map, struct bpf_insn *insn_buf)
 	return insn - insn_buf;
 }
 
-/* Called from eBPF program */
-static void *percpu_array_map_lookup_elem(struct bpf_map *map, void *key)
+void *bpf_percpu_array_lookup(struct bpf_map *map, void *key, int cpu)
 {
 	struct bpf_array *array = container_of(map, struct bpf_array, map);
 	u32 index = *(u32 *)key;
@@ -239,7 +238,13 @@ static void *percpu_array_map_lookup_elem(struct bpf_map *map, void *key)
 	if (unlikely(index >= array->map.max_entries))
 		return NULL;
 
-	return this_cpu_ptr(array->pptrs[index & array->index_mask]);
+	return per_cpu_ptr(array->pptrs[index & array->index_mask], cpu);
+}
+
+/* Called from eBPF program */
+static void *percpu_array_map_lookup_elem(struct bpf_map *map, void *key)
+{
+	return bpf_percpu_array_lookup(map, key, smp_processor_id());
 }
 
 int bpf_percpu_array_copy(struct bpf_map *map, void *key, void *value)
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 65877967f414..c6d4699d65e8 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -2150,27 +2150,24 @@ const struct bpf_map_ops htab_lru_map_ops = {
 	.iter_seq_info = &iter_seq_info,
 };
 
-/* Called from eBPF program */
-static void *htab_percpu_map_lookup_elem(struct bpf_map *map, void *key)
+void *bpf_percpu_hash_lookup(struct bpf_map *map, void *key, int cpu)
 {
+	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
 	struct htab_elem *l = __htab_map_lookup_elem(map, key);
 
-	if (l)
-		return this_cpu_ptr(htab_elem_get_ptr(l, map->key_size));
+	if (l) {
+		if (htab_is_lru(htab))
+			bpf_lru_node_set_ref(&l->lru_node);
+		return per_cpu_ptr(htab_elem_get_ptr(l, map->key_size), cpu);
+	}
 	else
 		return NULL;
 }
 
-static void *htab_lru_percpu_map_lookup_elem(struct bpf_map *map, void *key)
+/* Called from eBPF program */
+static void *htab_percpu_map_lookup_elem(struct bpf_map *map, void *key)
 {
-	struct htab_elem *l = __htab_map_lookup_elem(map, key);
-
-	if (l) {
-		bpf_lru_node_set_ref(&l->lru_node);
-		return this_cpu_ptr(htab_elem_get_ptr(l, map->key_size));
-	}
-
-	return NULL;
+	return bpf_percpu_hash_lookup(map, key, smp_processor_id());
 }
 
 int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value)
@@ -2279,7 +2276,7 @@ const struct bpf_map_ops htab_lru_percpu_map_ops = {
 	.map_alloc = htab_map_alloc,
 	.map_free = htab_map_free,
 	.map_get_next_key = htab_map_get_next_key,
-	.map_lookup_elem = htab_lru_percpu_map_lookup_elem,
+	.map_lookup_elem = htab_percpu_map_lookup_elem,
 	.map_lookup_and_delete_elem = htab_lru_percpu_map_lookup_and_delete_elem,
 	.map_update_elem = htab_lru_percpu_map_update_elem,
 	.map_delete_elem = htab_lru_map_delete_elem,
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index d124eed97ad7..abed4e1737f6 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -45,6 +45,30 @@ const struct bpf_func_proto bpf_map_lookup_elem_proto = {
 	.arg2_type	= ARG_PTR_TO_MAP_KEY,
 };
 
+BPF_CALL_3(bpf_map_lookup_percpu_elem, struct bpf_map *, map, void *, key,
+	   int, cpu)
+{
+	WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_bh_held());
+	switch (map->map_type) {
+	case BPF_MAP_TYPE_PERCPU_ARRAY:
+		return (unsigned long) bpf_percpu_array_lookup(map, key, cpu);
+	case BPF_MAP_TYPE_PERCPU_HASH:
+	case BPF_MAP_TYPE_LRU_PERCPU_HASH:
+		return (unsigned long) bpf_percpu_hash_lookup(map, key, cpu);
+	default:
+		return (unsigned long) NULL;
+	}
+}
+
+const struct bpf_func_proto bpf_map_lookup_percpu_elem_proto = {
+	.func		= bpf_map_lookup_percpu_elem,
+	.gpl_only	= false,
+	.ret_type	= RET_PTR_TO_MAP_VALUE_OR_NULL,
+	.arg1_type	= ARG_CONST_MAP_PTR,
+	.arg2_type	= ARG_PTR_TO_MAP_KEY,
+	.arg3_type	= ARG_ANYTHING,
+};
+
 BPF_CALL_4(bpf_map_update_elem, struct bpf_map *, map, void *, key,
 	   void *, value, u64, flags)
 {
@@ -1414,6 +1438,8 @@ bpf_base_func_proto(enum bpf_func_id func_id)
 	switch (func_id) {
 	case BPF_FUNC_map_lookup_elem:
 		return &bpf_map_lookup_elem_proto;
+	case BPF_FUNC_map_lookup_percpu_elem:
+		return &bpf_map_lookup_percpu_elem_proto;
 	case BPF_FUNC_map_update_elem:
 		return &bpf_map_update_elem_proto;
 	case BPF_FUNC_map_delete_elem:
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index d175b70067b3..2d7f7c9a970d 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -5879,6 +5879,12 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 		if (map->map_type != BPF_MAP_TYPE_TASK_STORAGE)
 			goto error;
 		break;
+	case BPF_FUNC_map_lookup_percpu_elem:
+		if (map->map_type != BPF_MAP_TYPE_PERCPU_HASH &&
+		    map->map_type != BPF_MAP_TYPE_LRU_PERCPU_HASH &&
+		    map->map_type != BPF_MAP_TYPE_PERCPU_ARRAY)
+			goto error;
+		break;
 	default:
 		break;
 	}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index fce5535579d6..015ed402c642 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1553,6 +1553,14 @@ union bpf_attr {
  * 		Map value associated to *key*, or **NULL** if no entry was
  * 		found.
  *
+ * void *bpf_map_lookup_percpu_elem(struct bpf_map *map, const void *key, int cpu)
+ *	Description
+ *		Perform a lookup in percpu *map* for an entry associated to
+ *		*key* for the given *cpu*.
+ *	Return
+ *		Map value associated to *key* per *cpu*, or **NULL** if no entry
+ *		was found.
+ *
  * long bpf_map_update_elem(struct bpf_map *map, const void *key, const void *value, u64 flags)
  * 	Description
  * 		Add or update the value of the entry associated to *key* in
@@ -5169,6 +5177,7 @@ union bpf_attr {
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
 	FN(map_lookup_elem),		\
+	FN(map_lookup_percpu_elem),	\
 	FN(map_update_elem),		\
 	FN(map_delete_elem),		\
 	FN(probe_read),			\
-- 
2.36.0.512.ge40c2bad7a-goog


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH bpf-next 6/9] cgroup: add v1 support to cgroup_get_from_id()
  2022-05-10  0:17 [RFC PATCH bpf-next 0/9] bpf: cgroup hierarchical stats collection Yosry Ahmed
                   ` (4 preceding siblings ...)
  2022-05-10  0:18 ` [RFC PATCH bpf-next 5/9] bpf: add bpf_map_lookup_percpu_elem() helper Yosry Ahmed
@ 2022-05-10  0:18 ` Yosry Ahmed
  2022-05-10 18:33   ` Tejun Heo
  2022-05-10  0:18 ` [RFC PATCH bpf-next 7/9] cgroup: Add cgroup_put() in !CONFIG_CGROUPS case Yosry Ahmed
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 30+ messages in thread
From: Yosry Ahmed @ 2022-05-10  0:18 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Roman Gushchin, Michal Hocko
  Cc: Stanislav Fomichev, David Rientjes, Greg Thelen, Shakeel Butt,
	linux-kernel, netdev, bpf, cgroups, Yosry Ahmed

The current implementation of cgroup_get_from_id() only searches the
default hierarchy for the given id. Make it compatible with cgroup v1 by
looking through all the roots instead.

cgrp_dfl_root should be the first element in the list so there shouldn't
be a performance impact for cgroup v2 users (in the case of a valid id).

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 kernel/cgroup/cgroup.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index af703cfcb9d2..12700cd21973 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5970,10 +5970,16 @@ void cgroup_path_from_kernfs_id(u64 id, char *buf, size_t buflen)
  */
 struct cgroup *cgroup_get_from_id(u64 id)
 {
-	struct kernfs_node *kn;
+	struct kernfs_node *kn = NULL;
 	struct cgroup *cgrp = NULL;
+	struct cgroup_root *root;
+
+	for_each_root(root) {
+		kn = kernfs_find_and_get_node_by_id(root->kf_root, id);
+		if (kn)
+			break;
+	}
 
-	kn = kernfs_find_and_get_node_by_id(cgrp_dfl_root.kf_root, id);
 	if (!kn)
 		goto out;
 
-- 
2.36.0.512.ge40c2bad7a-goog


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH bpf-next 7/9] cgroup: Add cgroup_put() in !CONFIG_CGROUPS case
  2022-05-10  0:17 [RFC PATCH bpf-next 0/9] bpf: cgroup hierarchical stats collection Yosry Ahmed
                   ` (5 preceding siblings ...)
  2022-05-10  0:18 ` [RFC PATCH bpf-next 6/9] cgroup: add v1 support to cgroup_get_from_id() Yosry Ahmed
@ 2022-05-10  0:18 ` Yosry Ahmed
  2022-05-10 18:25   ` Hao Luo
  2022-05-10  0:18 ` [RFC PATCH bpf-next 8/9] bpf: Introduce cgroup iter Yosry Ahmed
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 30+ messages in thread
From: Yosry Ahmed @ 2022-05-10  0:18 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Roman Gushchin, Michal Hocko
  Cc: Stanislav Fomichev, David Rientjes, Greg Thelen, Shakeel Butt,
	linux-kernel, netdev, bpf, cgroups, Yosry Ahmed

From: Hao Luo <haoluo@google.com>

There is already a cgroup_get_from_id() in the !CONFIG_CGROUPS case,
let's have a matching cgroup_put() in !CONFIG_CGROUPS too.

Signed-off-by: Hao Luo <haoluo@google.com>
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 include/linux/cgroup.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 5408c74d5c44..4f1d8febb9fd 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -759,6 +759,9 @@ static inline struct cgroup *cgroup_get_from_id(u64 id)
 {
 	return NULL;
 }
+
+static inline struct cgroup *cgroup_put(void)
+{}
 #endif /* !CONFIG_CGROUPS */
 
 #ifdef CONFIG_CGROUPS
-- 
2.36.0.512.ge40c2bad7a-goog


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH bpf-next 8/9] bpf: Introduce cgroup iter
  2022-05-10  0:17 [RFC PATCH bpf-next 0/9] bpf: cgroup hierarchical stats collection Yosry Ahmed
                   ` (6 preceding siblings ...)
  2022-05-10  0:18 ` [RFC PATCH bpf-next 7/9] cgroup: Add cgroup_put() in !CONFIG_CGROUPS case Yosry Ahmed
@ 2022-05-10  0:18 ` Yosry Ahmed
  2022-05-10 18:25   ` Hao Luo
  2022-05-10 18:54   ` Tejun Heo
  2022-05-10  0:18 ` [RFC PATCH bpf-next 9/9] selftest/bpf: add a selftest for cgroup hierarchical stats Yosry Ahmed
  2022-05-13  7:16 ` [RFC PATCH bpf-next 0/9] bpf: cgroup hierarchical stats collection Yosry Ahmed
  9 siblings, 2 replies; 30+ messages in thread
From: Yosry Ahmed @ 2022-05-10  0:18 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Roman Gushchin, Michal Hocko
  Cc: Stanislav Fomichev, David Rientjes, Greg Thelen, Shakeel Butt,
	linux-kernel, netdev, bpf, cgroups, Yosry Ahmed

From: Hao Luo <haoluo@google.com>

Introduce a new type of iter prog: cgroup. Unlike other bpf_iter, this
iter doesn't iterate a set of kernel objects. Instead, it is supposed to
be parameterized by a cgroup id and prints only that cgroup. So one
needs to specify a target cgroup id when attaching this iter. The target
cgroup's state can be read out via a link of this iter.

Signed-off-by: Hao Luo <haoluo@google.com>
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 include/linux/bpf.h            |   2 +
 include/uapi/linux/bpf.h       |   6 ++
 kernel/bpf/Makefile            |   2 +-
 kernel/bpf/cgroup_iter.c       | 148 +++++++++++++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h |   6 ++
 5 files changed, 163 insertions(+), 1 deletion(-)
 create mode 100644 kernel/bpf/cgroup_iter.c

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index f6fa35ffe311..f472f43521d2 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -43,6 +43,7 @@ struct kobject;
 struct mem_cgroup;
 struct module;
 struct bpf_func_state;
+struct cgroup;
 
 extern struct idr btf_idr;
 extern spinlock_t btf_idr_lock;
@@ -1601,6 +1602,7 @@ int bpf_obj_get_user(const char __user *pathname, int flags);
 
 struct bpf_iter_aux_info {
 	struct bpf_map *map;
+	struct cgroup *cgroup;
 };
 
 typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog,
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 015ed402c642..096c521e34de 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -91,6 +91,9 @@ union bpf_iter_link_info {
 	struct {
 		__u32	map_fd;
 	} map;
+	struct {
+		__u64	cgroup_id;
+	} cgroup;
 };
 
 /* BPF syscall commands, see bpf(2) man-page for more details. */
@@ -5963,6 +5966,9 @@ struct bpf_link_info {
 				struct {
 					__u32 map_id;
 				} map;
+				struct {
+					__u64 cgroup_id;
+				} cgroup;
 			};
 		} iter;
 		struct  {
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 6caf4a61e543..07a715b54190 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -8,7 +8,7 @@ CFLAGS_core.o += $(call cc-disable-warning, override-init) $(cflags-nogcse-yy)
 
 obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_iter.o map_iter.o task_iter.o prog_iter.o
 obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o
-obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
+obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o cgroup_iter.o
 obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o
 obj-${CONFIG_BPF_LSM}	  += bpf_inode_storage.o
 obj-$(CONFIG_BPF_SYSCALL) += disasm.o
diff --git a/kernel/bpf/cgroup_iter.c b/kernel/bpf/cgroup_iter.c
new file mode 100644
index 000000000000..86bdfe135d24
--- /dev/null
+++ b/kernel/bpf/cgroup_iter.c
@@ -0,0 +1,148 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2022 Google */
+#include <linux/bpf.h>
+#include <linux/btf_ids.h>
+#include <linux/cgroup.h>
+#include <linux/kernel.h>
+#include <linux/seq_file.h>
+
+struct bpf_iter__cgroup {
+	__bpf_md_ptr(struct bpf_iter_meta *, meta);
+	__bpf_md_ptr(struct cgroup *, cgroup);
+};
+
+static void *cgroup_iter_seq_start(struct seq_file *seq, loff_t *pos)
+{
+	/* Only one session is supported. */
+	if (*pos > 0)
+		return NULL;
+
+	if (*pos == 0)
+		++*pos;
+
+	return *(struct cgroup **)seq->private;
+}
+
+static void *cgroup_iter_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	++*pos;
+	return NULL;
+}
+
+static int cgroup_iter_seq_show(struct seq_file *seq, void *v)
+{
+	struct bpf_iter__cgroup ctx;
+	struct bpf_iter_meta meta;
+	struct bpf_prog *prog;
+	int ret = 0;
+
+	ctx.meta = &meta;
+	ctx.cgroup = v;
+	meta.seq = seq;
+	prog = bpf_iter_get_info(&meta, false);
+	if (prog)
+		ret = bpf_iter_run_prog(prog, &ctx);
+
+	return ret;
+}
+
+static void cgroup_iter_seq_stop(struct seq_file *seq, void *v)
+{
+}
+
+static const struct seq_operations cgroup_iter_seq_ops = {
+	.start  = cgroup_iter_seq_start,
+	.next   = cgroup_iter_seq_next,
+	.stop   = cgroup_iter_seq_stop,
+	.show   = cgroup_iter_seq_show,
+};
+
+BTF_ID_LIST_SINGLE(bpf_cgroup_btf_id, struct, cgroup)
+
+static int cgroup_iter_seq_init(void *priv_data, struct bpf_iter_aux_info *aux)
+{
+	*(struct cgroup **)priv_data = aux->cgroup;
+	return 0;
+}
+
+static const struct bpf_iter_seq_info cgroup_iter_seq_info = {
+	.seq_ops                = &cgroup_iter_seq_ops,
+	.init_seq_private       = cgroup_iter_seq_init,
+	.seq_priv_size          = sizeof(struct cgroup *),
+};
+
+static int bpf_iter_attach_cgroup(struct bpf_prog *prog,
+				  union bpf_iter_link_info *linfo,
+				  struct bpf_iter_aux_info *aux)
+{
+	struct cgroup *cgroup;
+
+	cgroup = cgroup_get_from_id(linfo->cgroup.cgroup_id);
+	if (!cgroup)
+		return -EBUSY;
+
+	aux->cgroup = cgroup;
+	return 0;
+}
+
+static void bpf_iter_detach_cgroup(struct bpf_iter_aux_info *aux)
+{
+	if (aux->cgroup)
+		cgroup_put(aux->cgroup);
+}
+
+static void bpf_iter_cgroup_show_fdinfo(const struct bpf_iter_aux_info *aux,
+					struct seq_file *seq)
+{
+	char *buf;
+
+	seq_printf(seq, "cgroup_id:\t%llu\n", cgroup_id(aux->cgroup));
+
+	buf = kmalloc(PATH_MAX, GFP_KERNEL);
+	if (!buf) {
+		seq_puts(seq, "cgroup_path:\n");
+		return;
+	}
+
+	/* If cgroup_path_ns() fails, buf will be an empty string, cgroup_path
+	 * will print nothing.
+	 *
+	 * Cgroup_path is the path in the calliing process's cgroup namespace.
+	 */
+	cgroup_path_ns(aux->cgroup, buf, sizeof(buf),
+		       current->nsproxy->cgroup_ns);
+	seq_printf(seq, "cgroup_path:\t%s\n", buf);
+	kfree(buf);
+}
+
+static int bpf_iter_cgroup_fill_link_info(const struct bpf_iter_aux_info *aux,
+					  struct bpf_link_info *info)
+{
+	info->iter.cgroup.cgroup_id = cgroup_id(aux->cgroup);
+	return 0;
+}
+
+DEFINE_BPF_ITER_FUNC(cgroup, struct bpf_iter_meta *meta,
+		     struct cgroup *cgroup)
+
+static struct bpf_iter_reg bpf_cgroup_reg_info = {
+	.target			= "cgroup",
+	.attach_target		= bpf_iter_attach_cgroup,
+	.detach_target		= bpf_iter_detach_cgroup,
+	.show_fdinfo		= bpf_iter_cgroup_show_fdinfo,
+	.fill_link_info		= bpf_iter_cgroup_fill_link_info,
+	.ctx_arg_info_size	= 1,
+	.ctx_arg_info		= {
+		{ offsetof(struct bpf_iter__cgroup, cgroup),
+		  PTR_TO_BTF_ID },
+	},
+	.seq_info		= &cgroup_iter_seq_info,
+};
+
+static int __init bpf_cgroup_iter_init(void)
+{
+	bpf_cgroup_reg_info.ctx_arg_info[0].btf_id = bpf_cgroup_btf_id[0];
+	return bpf_iter_reg_target(&bpf_cgroup_reg_info);
+}
+
+late_initcall(bpf_cgroup_iter_init);
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 015ed402c642..096c521e34de 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -91,6 +91,9 @@ union bpf_iter_link_info {
 	struct {
 		__u32	map_fd;
 	} map;
+	struct {
+		__u64	cgroup_id;
+	} cgroup;
 };
 
 /* BPF syscall commands, see bpf(2) man-page for more details. */
@@ -5963,6 +5966,9 @@ struct bpf_link_info {
 				struct {
 					__u32 map_id;
 				} map;
+				struct {
+					__u64 cgroup_id;
+				} cgroup;
 			};
 		} iter;
 		struct  {
-- 
2.36.0.512.ge40c2bad7a-goog


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH bpf-next 9/9] selftest/bpf: add a selftest for cgroup hierarchical stats
  2022-05-10  0:17 [RFC PATCH bpf-next 0/9] bpf: cgroup hierarchical stats collection Yosry Ahmed
                   ` (7 preceding siblings ...)
  2022-05-10  0:18 ` [RFC PATCH bpf-next 8/9] bpf: Introduce cgroup iter Yosry Ahmed
@ 2022-05-10  0:18 ` Yosry Ahmed
  2022-05-13  7:16 ` [RFC PATCH bpf-next 0/9] bpf: cgroup hierarchical stats collection Yosry Ahmed
  9 siblings, 0 replies; 30+ messages in thread
From: Yosry Ahmed @ 2022-05-10  0:18 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Roman Gushchin, Michal Hocko
  Cc: Stanislav Fomichev, David Rientjes, Greg Thelen, Shakeel Butt,
	linux-kernel, netdev, bpf, cgroups, Yosry Ahmed

Add a selftest that tests the whole workflow for collecting,
aggregating, and display cgroup hierarchical stats.

The test loads tracing bpf programs at the beginning and ending of
direct reclaim to measure the vmscan latency. Per-cgroup readings are
stored in percpu maps for efficiency. When a cgroup reading is updated,
bpf_cgroup_rstat_updated() is called to add the cgroup (and the current
cpu) to the rstat updated tree. When a cgroup is added to the rstat
updated tree, all its parents are added as well. rstat makes sure
cgroups are popped in a bottom up fashion.

When an rstat flush is invoked, an rstat flusher program is called for
per-cgroup per-cpu pairs on the updated tree. The program aggregates
percpu readings to a total reading, and also propagates them to the
parent. After rstat flushing is over, the program will have been invoked
for all (cgroup, cpu) pairs that have updates as well as their parents,
so the whole hierarchy will have updated (flushed) stats.

Finally, a cgroup_iter program is pinned to a file for each cgroup.
Reading this file invokes the cgroup_iter program to flush the stats and
display them to the user.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 .../test_cgroup_hierarchical_stats.c          | 335 ++++++++++++++++++
 tools/testing/selftests/bpf/progs/bpf_iter.h  |   7 +
 .../selftests/bpf/progs/cgroup_vmscan.c       | 211 +++++++++++
 3 files changed, 553 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_cgroup_hierarchical_stats.c
 create mode 100644 tools/testing/selftests/bpf/progs/cgroup_vmscan.c

diff --git a/tools/testing/selftests/bpf/prog_tests/test_cgroup_hierarchical_stats.c b/tools/testing/selftests/bpf/prog_tests/test_cgroup_hierarchical_stats.c
new file mode 100644
index 000000000000..7c4d199967d7
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/test_cgroup_hierarchical_stats.c
@@ -0,0 +1,335 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Functions to manage eBPF programs attached to cgroup subsystems
+ *
+ * Copyright 2022 Google LLC.
+ */
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/mount.h>
+#include <unistd.h>
+#include <errno.h>
+
+#include <bpf/libbpf.h>
+#include <bpf/bpf.h>
+#include <test_progs.h>
+
+#include "cgroup_vmscan.skel.h"
+
+#define PAGE_SIZE 4096
+#define MB(x) (x << 20)
+
+#define BPFFS_ROOT "/sys/fs/bpf/"
+#define BPFFS_VMSCAN BPFFS_ROOT"vmscan/"
+#define CGROUP_ROOT "/sys/fs/cgroup/"
+
+#define RET_IF_ERR(exp, t, f...) ({		\
+	int ___res = (exp);			\
+	if (CHECK(___res, t, f))		\
+		return ___res;			\
+})
+
+struct cgroup_path {
+	const char *name, *path;
+};
+
+#define CGROUP_PATH(p, n) {.name = #n, .path = CGROUP_ROOT#p"/"#n}
+#define CGROUP_ROOT_PATH {.name = "root", .path = CGROUP_ROOT}
+
+static struct cgroup_path cgroup_hierarchy[] = {
+	CGROUP_ROOT_PATH,
+	CGROUP_PATH(, test),
+	CGROUP_PATH(test, child1),
+	CGROUP_PATH(test, child2),
+	CGROUP_PATH(test/child1, child1_1),
+	CGROUP_PATH(test/child1, child1_2),
+	CGROUP_PATH(test/child2, child2_1),
+	CGROUP_PATH(test/child2, child2_2),
+};
+
+#define N_CGROUPS (sizeof(cgroup_hierarchy)/sizeof(struct cgroup_path))
+
+static const int non_leaf_cgroups = 4;
+static __u64 cgroup_ids[N_CGROUPS];
+
+static int duration;
+
+static __u64 cgroup_id_from_path(const char *cgroup_path)
+{
+	struct stat file_stat;
+
+	if (stat(cgroup_path, &file_stat))
+		return -1;
+	return file_stat.st_ino;
+}
+
+int write_to_file(const char *path, const char *buf, size_t size)
+{
+	int fd, len, err = 0;
+
+	fd = open(path, O_WRONLY);
+	if (fd < 0)
+		return -errno;
+	len = write(fd, buf, size);
+	if (len < 0)
+		err = -errno;
+	else if (len < size)
+		err = -1;
+	close(fd);
+	return err;
+}
+
+int read_from_file(const char *path, char *buf, size_t size)
+{
+	int fd, len;
+
+	fd = open(path, O_RDONLY);
+	if (fd < 0)
+		return -errno;
+	len = read(fd, buf, size);
+	if (len >= 0)
+		buf[len] = 0;
+	close(fd);
+	return len < 0 ? -errno : 0;
+}
+
+int setup_hierarchy(void)
+{
+	int i, len;
+	char path[128];
+
+	/* Mount bpffs, and create a directory to pin cgroup_iters in */
+	RET_IF_ERR(mount("bpf", BPFFS_ROOT, "bpf", 0, NULL), "mount",
+		   "failed to mount bpffs at %s (%s)\n", BPFFS_ROOT,
+		   strerror(errno));
+	RET_IF_ERR(mkdir(BPFFS_VMSCAN, 0755), "mkdir",
+		   "failed to mkdir %s (%s)\n", BPFFS_VMSCAN, strerror(errno));
+
+	/* Mount cgroup v2 */
+	RET_IF_ERR(mount("none", CGROUP_ROOT, "cgroup2", 0, NULL),
+		   "mount", "failed to mount cgroup2 at %s (%s)\n",
+		   CGROUP_ROOT, strerror(errno));
+
+	/* Enable memory controller in cgroup v2 root */
+	len = snprintf(path, 128, "%scgroup.subtree_control", CGROUP_ROOT);
+	RET_IF_ERR(write_to_file(path, "+memory", len), "+memory",
+		   "+memory failed in root (%s)\n",
+		   strerror(errno));
+	/* Root cgroup id is 1 in v2*/
+	cgroup_ids[0] = 1;
+
+	for (i = 1; i < N_CGROUPS; i++) {
+		/* Create cgroup */
+		RET_IF_ERR(mkdir(cgroup_hierarchy[i].path, 0666),
+			   "mkdir", "failed to mkdir %s (%s)\n",
+			   cgroup_hierarchy[i].path, strerror(errno));
+
+		cgroup_ids[i] = cgroup_id_from_path(cgroup_hierarchy[i].path);
+
+		/* Enable memory controller non-leaf cgroups */
+		if (i < non_leaf_cgroups)  {
+			len = snprintf(path, 128, "%s/cgroup.subtree_control",
+				       cgroup_hierarchy[i].path);
+			RET_IF_ERR(write_to_file(path, "+memory", len),
+				   "+memory", "+memory failed in %s (%s)\n",
+				   cgroup_hierarchy[i].name, strerror(errno));
+		}
+	}
+	return 0;
+}
+
+void destroy_hierarchy(void)
+{
+	int i;
+	char path[128];
+
+	for (i = N_CGROUPS - 1; i >= 0; i--) {
+		/* Delete files in bpffs that cgroup_iters are pinned in */
+		snprintf(path, 128, "%s%s", BPFFS_VMSCAN,
+			 cgroup_hierarchy[i].name);
+		CHECK(remove(path), "remove", "failed to remove %s (%s)\n",
+		      path, strerror(errno));
+
+		if (i == 0)
+			break;
+
+		/* Delete cgroup */
+		CHECK(rmdir(cgroup_hierarchy[i].path), "rmdir",
+		      "failed to rmdir %s (%s)\n", cgroup_hierarchy[i].path,
+		      strerror(errno));
+	}
+	/* Remove created directory in bpffs */
+	CHECK(rmdir(BPFFS_VMSCAN), "rmdir", "failed to rmdir %s (%s)\n",
+	      BPFFS_VMSCAN, strerror(errno));
+	/* Unmount bpffs */
+	CHECK(umount(BPFFS_ROOT), "umount", "failed to unmount bpffs (%s)\n",
+	      strerror(errno));
+	/* Unmount cgroup v2 */
+	CHECK(umount(CGROUP_ROOT), "umount", "failed to unmount cgroup2 (%s)\n",
+	      strerror(errno));
+}
+
+void alloc_anon(size_t size)
+{
+	char *buf, *ptr;
+
+	buf = malloc(size);
+	for (ptr = buf; ptr < buf + size; ptr += PAGE_SIZE)
+		*ptr = 0;
+	free(buf);
+}
+
+int induce_vmscan(void)
+{
+	char cmd[128], path[128];
+	int i, pid, len;
+
+	/*
+	 * Set memory.high for test parent cgroup to 1 MB to throttle
+	 * allocations and invoke reclaim in children.
+	 */
+	snprintf(path, 128, "%s/memory.high", cgroup_hierarchy[1].path);
+	len = snprintf(cmd, 128, "%d", MB(1));
+	RET_IF_ERR(write_to_file(path, cmd, len), "memory.high",
+		   "failed to write to %s (%s)\n", path, strerror(errno));
+
+	/*
+	 * In every leaf cgroup, run a memory hog for a few seconds to induce
+	 * reclaim then kill it.
+	 */
+	for (i = non_leaf_cgroups; i < N_CGROUPS; i++) {
+		pid = fork();
+		if (pid == 0) {
+			pid = getpid();
+
+			/* Add child to leaf cgroup */
+			snprintf(path, 128, "%s/cgroup.procs",
+				 cgroup_hierarchy[i].path);
+			len = snprintf(cmd, 128, "%d", pid);
+			RET_IF_ERR(write_to_file(path, cmd, len),
+				   "cgroup.procs",
+				   "failed to add pid %d to cgroup %s (%s)\n",
+				   pid, cgroup_hierarchy[i].name,
+				   strerror(errno));
+
+			/* Allocate 2 MB  */
+			alloc_anon(MB(2));
+			exit(0);
+		} else {
+			/* Wait for child to cause reclaim then kill it */
+			sleep(3);
+			kill(pid, SIGKILL);
+			waitpid(pid, NULL, 0);
+		}
+	}
+	return 0;
+}
+
+int check_vmscan_stats(void)
+{
+	char buf[128], path[128];
+	int i;
+	__u64 vmscan_readings[N_CGROUPS];
+
+	for (i = 0; i < N_CGROUPS; i++) {
+		__u64 id;
+
+		/* For every cgroup, read the file generated by cgroup_iter */
+		snprintf(path, 128, "%s%s", BPFFS_VMSCAN,
+			cgroup_hierarchy[i].name);
+		RET_IF_ERR(read_from_file(path, buf, 128), "read",
+			   "failed to read from %s (%s)\n",
+			   path, strerror(errno));
+		/* Check the output file formatting */
+		ASSERT_EQ(sscanf(buf, "cg_id: %llu, total_vmscan_delay: %llu\n",
+				 &id, &vmscan_readings[i]), 2, "output format");
+
+		/* Check that the cgroup_id is displayed correctly */
+		ASSERT_EQ(cgroup_ids[i], id, "cgroup_id");
+		/* Check that the vmscan reading is non-zerp */
+		ASSERT_NEQ(vmscan_readings[i], 0, "vmscan_reading");
+	}
+
+	/* Check that child1 == child1_1 + child1_2 */
+	ASSERT_EQ(vmscan_readings[2], vmscan_readings[4] + vmscan_readings[5],
+		  "child1_vmscan");
+	/* Check that child2 == child2_1 + child2_2 */
+	ASSERT_EQ(vmscan_readings[3], vmscan_readings[6] + vmscan_readings[7],
+		  "child2_vmscan");
+	/* Check that test == child1 + child2 */
+	ASSERT_EQ(vmscan_readings[1], vmscan_readings[2] + vmscan_readings[3],
+		  "test_vmscan");
+	/* Check that root >= test */
+	ASSERT_GE(vmscan_readings[0], vmscan_readings[1], "root_vmscan");
+
+	return 0;
+}
+
+int setup_progs(struct cgroup_vmscan **skel)
+{
+	int i;
+	struct bpf_link *link;
+	struct cgroup_vmscan *obj;
+
+	obj = cgroup_vmscan__open_and_load();
+	if (!ASSERT_OK_PTR(obj, "open_and_load"))
+		return libbpf_get_error(obj);
+
+	/* Attach rstat flusher to memory subsystem */
+	link = bpf_program__attach_subsys(obj->progs.vmscan_flush, "memory");
+	if (!ASSERT_OK_PTR(link, "attach_subsys"))
+		return libbpf_get_error(link);
+
+	/* Attach cgroup_iter program that will dump the stats to cgroups */
+	for (i = 0; i < N_CGROUPS; i++) {
+		DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts);
+		union bpf_iter_link_info linfo = {};
+		char path[128];
+
+		/* Create an iter link, parameterized by cgroup id */
+		linfo.cgroup.cgroup_id = cgroup_ids[i];
+		opts.link_info = &linfo;
+		opts.link_info_len = sizeof(linfo);
+		link = bpf_program__attach_iter(obj->progs.dump_vmscan, &opts);
+		if (!ASSERT_OK_PTR(link, "attach_iter"))
+			return libbpf_get_error(link);
+
+		/* Pin the link to a bpffs file */
+		snprintf(path, 128, "%s%s", BPFFS_VMSCAN,
+			 cgroup_hierarchy[i].name);
+		bpf_link__pin(link, path);
+	}
+
+	/* Attach tracing programs that will calculate vmscan delays */
+	link = bpf_program__attach(obj->progs.vmscan_start);
+	if (!ASSERT_OK_PTR(obj, "attach"))
+		return libbpf_get_error(obj);
+
+	link = bpf_program__attach(obj->progs.vmscan_end);
+	if (!ASSERT_OK_PTR(obj, "attach"))
+		return libbpf_get_error(obj);
+
+	*skel = obj;
+	return 0;
+}
+
+void destroy_progs(struct cgroup_vmscan *skel)
+{
+	cgroup_vmscan__destroy(skel);
+}
+
+void test_cgroup_hierarchical_stats(void)
+{
+	struct cgroup_vmscan *skel = NULL;
+
+	if (setup_hierarchy())
+		goto cleanup;
+	if (setup_progs(&skel))
+		goto cleanup;
+	if (induce_vmscan())
+		goto cleanup;
+	check_vmscan_stats();
+cleanup:
+	destroy_progs(skel);
+	destroy_hierarchy();
+}
diff --git a/tools/testing/selftests/bpf/progs/bpf_iter.h b/tools/testing/selftests/bpf/progs/bpf_iter.h
index 8cfaeba1ddbf..b10ad01e878a 100644
--- a/tools/testing/selftests/bpf/progs/bpf_iter.h
+++ b/tools/testing/selftests/bpf/progs/bpf_iter.h
@@ -16,6 +16,7 @@
 #define bpf_iter__bpf_map_elem bpf_iter__bpf_map_elem___not_used
 #define bpf_iter__bpf_sk_storage_map bpf_iter__bpf_sk_storage_map___not_used
 #define bpf_iter__sockmap bpf_iter__sockmap___not_used
+#define bpf_iter__cgroup bpf_iter__cgroup__not_used
 #define btf_ptr btf_ptr___not_used
 #define BTF_F_COMPACT BTF_F_COMPACT___not_used
 #define BTF_F_NONAME BTF_F_NONAME___not_used
@@ -37,6 +38,7 @@
 #undef bpf_iter__bpf_map_elem
 #undef bpf_iter__bpf_sk_storage_map
 #undef bpf_iter__sockmap
+#undef bpf_iter__cgroup
 #undef btf_ptr
 #undef BTF_F_COMPACT
 #undef BTF_F_NONAME
@@ -132,6 +134,11 @@ struct bpf_iter__sockmap {
 	struct sock *sk;
 };
 
+struct bpf_iter__cgroup {
+	struct bpf_iter_meta *meta;
+	struct cgroup *cgroup;
+} __attribute((preserve_access_index));
+
 struct btf_ptr {
 	void *ptr;
 	__u32 type_id;
diff --git a/tools/testing/selftests/bpf/progs/cgroup_vmscan.c b/tools/testing/selftests/bpf/progs/cgroup_vmscan.c
new file mode 100644
index 000000000000..41516f8263b3
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/cgroup_vmscan.c
@@ -0,0 +1,211 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Functions to manage eBPF programs attached to cgroup subsystems
+ *
+ * Copyright 2022 Google LLC.
+ */
+#include "bpf_iter.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_core_read.h>
+
+char _license[] SEC("license") = "GPL";
+
+/*
+ * Start times are stored per-task, not per-cgroup, as multiple tasks in one
+ * cgroup can perform reclain concurrently.
+ */
+struct {
+	__uint(type, BPF_MAP_TYPE_TASK_STORAGE);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__type(key, int);
+	__type(value, __u64);
+} vmscan_start_time SEC(".maps");
+
+struct vmscan_percpu {
+	/* Previous percpu state, to figure out if we have new updates */
+	__u64 prev;
+	/* Current percpu state */
+	__u64 state;
+};
+
+struct vmscan {
+	/* State propagated through children, pending aggregation */
+	__u64 pending;
+	/* Total state, including all cpus and all children */
+	__u64 state;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_PERCPU_HASH);
+	__uint(max_entries, 10);
+	__type(key, __u64);
+	__type(value, struct vmscan_percpu);
+} pcpu_cgroup_vmscan_elapsed SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, 10);
+	__type(key, __u64);
+	__type(value, struct vmscan);
+} cgroup_vmscan_elapsed SEC(".maps");
+
+static inline struct cgroup *task_memcg(struct task_struct *task)
+{
+	return BPF_CORE_READ(task, cgroups, subsys[memory_cgrp_id], cgroup);
+}
+
+static inline uint64_t cgroup_id(struct cgroup *cgrp)
+{
+	return BPF_CORE_READ(cgrp, kn, id);
+}
+
+static inline int create_vmscan_percpu_elem(__u64 cg_id, __u64 state)
+{
+	struct vmscan_percpu pcpu_init = {.state = state, .prev = 0};
+
+	if (bpf_map_update_elem(&pcpu_cgroup_vmscan_elapsed, &cg_id,
+				&pcpu_init, BPF_NOEXIST)) {
+		bpf_printk("failed to create pcpu entry for cgroup %llu\n"
+			   , cg_id);
+		return 1;
+	}
+	return 0;
+}
+
+static inline int create_vmscan_elem(__u64 cg_id, __u64 state, __u64 pending)
+{
+	struct vmscan init = {.state = state, .pending = pending};
+
+	if (bpf_map_update_elem(&cgroup_vmscan_elapsed, &cg_id,
+				&init, BPF_NOEXIST)) {
+		bpf_printk("failed to create entry for cgroup %llu\n"
+			   , cg_id);
+		return 1;
+	}
+	return 0;
+}
+
+SEC("raw_tp/mm_vmscan_memcg_reclaim_begin")
+int vmscan_start(struct lruvec *lruvec, struct scan_control *sc)
+{
+	struct task_struct *task = bpf_get_current_task_btf();
+	__u64 *start_time_ptr;
+
+	start_time_ptr = bpf_task_storage_get(&vmscan_start_time, task, 0,
+					  BPF_LOCAL_STORAGE_GET_F_CREATE);
+	if (!start_time_ptr) {
+		bpf_printk("error retrieving storage\n");
+		return 0;
+	}
+
+	*start_time_ptr = bpf_ktime_get_ns();
+	return 0;
+}
+
+SEC("raw_tp/mm_vmscan_memcg_reclaim_end")
+int vmscan_end(struct lruvec *lruvec, struct scan_control *sc)
+{
+	struct vmscan_percpu *pcpu_stat;
+	struct task_struct *current = bpf_get_current_task_btf();
+	struct cgroup *cgrp = task_memcg(current);
+	__u64 *start_time_ptr;
+	__u64 current_elapsed, cg_id;
+	__u64 end_time = bpf_ktime_get_ns();
+
+	/* cgrp may not have memory controller enabled */
+	if (!cgrp)
+		return 0;
+
+	cg_id = cgroup_id(cgrp);
+	start_time_ptr = bpf_task_storage_get(&vmscan_start_time, current, 0,
+					      BPF_LOCAL_STORAGE_GET_F_CREATE);
+	if (!start_time_ptr) {
+		bpf_printk("error retrieving storage local storage\n");
+		return 0;
+	}
+
+	current_elapsed = end_time - *start_time_ptr;
+	pcpu_stat = bpf_map_lookup_elem(&pcpu_cgroup_vmscan_elapsed,
+					&cg_id);
+	if (pcpu_stat)
+		__sync_fetch_and_add(&pcpu_stat->state, current_elapsed);
+	else
+		create_vmscan_percpu_elem(cg_id, current_elapsed);
+
+	bpf_cgroup_rstat_updated(cgrp);
+	return 0;
+}
+
+SEC("cgroup_subsys/rstat")
+int vmscan_flush(struct bpf_rstat_ctx *ctx)
+{
+	struct vmscan_percpu *pcpu_stat;
+	struct vmscan *total_stat, *parent_stat;
+	__u64 *pcpu_vmscan;
+	__u64 state;
+	__u64 delta = 0;
+	__u64 cg_id = ctx->cgroup_id;
+	__u64 parent_cg_id = ctx->parent_cgroup_id;
+	__s32 cpu = ctx->cpu;
+
+	/* Add CPU changes on this level since the last flush */
+	pcpu_stat = bpf_map_lookup_percpu_elem(&pcpu_cgroup_vmscan_elapsed,
+					       &cg_id, cpu);
+	if (pcpu_stat) {
+		state = pcpu_stat->state;
+		delta += state - pcpu_stat->prev;
+		pcpu_stat->prev = state;
+	}
+
+	total_stat = bpf_map_lookup_elem(&cgroup_vmscan_elapsed, &cg_id);
+	if (!total_stat) {
+		create_vmscan_elem(cg_id, delta, 0);
+		goto update_parent;
+	}
+
+	/* Collect pending stats from subtree */
+	if (total_stat->pending) {
+		delta += total_stat->pending;
+		total_stat->pending = 0;
+	}
+
+	/* Propagate changes to this cgroup's total */
+	total_stat->state += delta;
+
+update_parent:
+	/* Skip if there are no changes to propagate, or no parent */
+	if (!delta || !parent_cg_id)
+		return 0;
+
+	/* Propagate changes to cgroup's parent */
+	parent_stat = bpf_map_lookup_elem(&cgroup_vmscan_elapsed,
+					  &parent_cg_id);
+	if (parent_stat)
+		parent_stat->pending += delta;
+	else
+		create_vmscan_elem(parent_cg_id, 0, delta);
+
+	return 0;
+}
+
+SEC("iter/cgroup")
+int dump_vmscan(struct bpf_iter__cgroup *ctx)
+{
+	struct seq_file *seq = ctx->meta->seq;
+	struct cgroup *cgroup = ctx->cgroup;
+	struct vmscan *total_stat;
+	__u64 cg_id = cgroup_id(cgroup);
+
+	/* Flush the stats to make sure we get the most updated numbers */
+	bpf_cgroup_rstat_flush(cgroup);
+
+	total_stat = bpf_map_lookup_elem(&cgroup_vmscan_elapsed, &cg_id);
+	if (!total_stat) {
+		bpf_printk("error finding stats for cgroup %llu\n", cg_id);
+		return 0;
+	}
+	BPF_SEQ_PRINTF(seq, "cg_id: %llu, total_vmscan_delay: %llu\n",
+		       cg_id, total_stat->state);
+	return 0;
+}
+
-- 
2.36.0.512.ge40c2bad7a-goog


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH bpf-next 1/9] bpf: introduce CGROUP_SUBSYS_RSTAT program type
  2022-05-10  0:17 ` [RFC PATCH bpf-next 1/9] bpf: introduce CGROUP_SUBSYS_RSTAT program type Yosry Ahmed
@ 2022-05-10 18:07   ` Yosry Ahmed
  2022-05-10 19:21     ` Yosry Ahmed
  2022-05-10 18:44   ` Tejun Heo
  1 sibling, 1 reply; 30+ messages in thread
From: Yosry Ahmed @ 2022-05-10 18:07 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Roman Gushchin, Michal Hocko
  Cc: Stanislav Fomichev, David Rientjes, Greg Thelen, Shakeel Butt,
	Linux Kernel Mailing List, Networking, bpf, cgroups

On Mon, May 9, 2022 at 5:18 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> This patch introduces a new bpf program type CGROUP_SUBSYS_RSTAT,
> with new corresponding link and attach types.
>
> The main purpose of these programs is to allow BPF programs to collect
> and maintain hierarchical cgroup stats easily and efficiently by making
> using of the rstat framework in the kernel.
>
> Those programs attach to a cgroup subsystem. They typically contain logic
> to aggregate per-cpu and per-cgroup stats collected by other BPF programs.
>
> Currently, only rstat flusher programs can be attached to cgroup
> subsystems, but this can be extended later if a use-case arises.
>
> See the selftest in the final patch for a practical example.
>
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> ---
>  include/linux/bpf-cgroup-subsys.h |  30 ++++++
>  include/linux/bpf_types.h         |   2 +
>  include/linux/cgroup-defs.h       |   4 +
>  include/uapi/linux/bpf.h          |  12 +++
>  kernel/bpf/Makefile               |   1 +
>  kernel/bpf/cgroup_subsys.c        | 166 ++++++++++++++++++++++++++++++
>  kernel/bpf/syscall.c              |   6 ++
>  kernel/cgroup/cgroup.c            |   1 +
>  tools/include/uapi/linux/bpf.h    |  12 +++
>  9 files changed, 234 insertions(+)
>  create mode 100644 include/linux/bpf-cgroup-subsys.h
>  create mode 100644 kernel/bpf/cgroup_subsys.c
>
> diff --git a/include/linux/bpf-cgroup-subsys.h b/include/linux/bpf-cgroup-subsys.h
> new file mode 100644
> index 000000000000..4dcde06b5599
> --- /dev/null
> +++ b/include/linux/bpf-cgroup-subsys.h
> @@ -0,0 +1,30 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright 2022 Google LLC.
> + */
> +#ifndef _BPF_CGROUP_SUBSYS_H_
> +#define _BPF_CGROUP_SUBSYS_H_
> +
> +#include <linux/bpf.h>
> +
> +struct cgroup_subsys_bpf {
> +       /* Head of the list of BPF rstat flushers attached to this subsystem */
> +       struct list_head rstat_flushers;
> +       spinlock_t flushers_lock;
> +};
> +
> +struct bpf_subsys_rstat_flusher {
> +       struct bpf_prog *prog;
> +       /* List of BPF rtstat flushers, anchored at subsys->bpf */
> +       struct list_head list;
> +};
> +
> +struct bpf_cgroup_subsys_link {
> +       struct bpf_link link;
> +       struct cgroup_subsys *ss;
> +};
> +
> +int cgroup_subsys_bpf_link_attach(const union bpf_attr *attr,
> +                                 struct bpf_prog *prog);
> +

In the next version I will make sure everything here is also defined
for when CONFIG_BPF_SYSCALL is not set, and move the structs that can
be moved to the cc file there.

> +#endif  // _BPF_CGROUP_SUBSYS_H_
> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> index 3e24ad0c4b3c..854ee958b0e4 100644
> --- a/include/linux/bpf_types.h
> +++ b/include/linux/bpf_types.h
> @@ -56,6 +56,8 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SYSCTL, cg_sysctl,
>               struct bpf_sysctl, struct bpf_sysctl_kern)
>  BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SOCKOPT, cg_sockopt,
>               struct bpf_sockopt, struct bpf_sockopt_kern)
> +BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SUBSYS_RSTAT, cgroup_subsys_rstat,
> +             struct bpf_rstat_ctx, struct bpf_rstat_ctx)
>  #endif
>  #ifdef CONFIG_BPF_LIRC_MODE2
>  BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2,
> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index 1bfcfb1af352..3bd6eed1fa13 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -20,6 +20,7 @@
>  #include <linux/u64_stats_sync.h>
>  #include <linux/workqueue.h>
>  #include <linux/bpf-cgroup-defs.h>
> +#include <linux/bpf-cgroup-subsys.h>
>  #include <linux/psi_types.h>
>
>  #ifdef CONFIG_CGROUPS
> @@ -706,6 +707,9 @@ struct cgroup_subsys {
>          * specifies the mask of subsystems that this one depends on.
>          */
>         unsigned int depends_on;
> +
> +       /* used to store bpf programs.*/
> +       struct cgroup_subsys_bpf bpf;
>  };
>
>  extern struct percpu_rw_semaphore cgroup_threadgroup_rwsem;
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index d14b10b85e51..0f4855fa85db 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -952,6 +952,7 @@ enum bpf_prog_type {
>         BPF_PROG_TYPE_LSM,
>         BPF_PROG_TYPE_SK_LOOKUP,
>         BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */
> +       BPF_PROG_TYPE_CGROUP_SUBSYS_RSTAT,
>  };
>
>  enum bpf_attach_type {
> @@ -998,6 +999,7 @@ enum bpf_attach_type {
>         BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
>         BPF_PERF_EVENT,
>         BPF_TRACE_KPROBE_MULTI,
> +       BPF_CGROUP_SUBSYS_RSTAT,
>         __MAX_BPF_ATTACH_TYPE
>  };
>
> @@ -1013,6 +1015,7 @@ enum bpf_link_type {
>         BPF_LINK_TYPE_XDP = 6,
>         BPF_LINK_TYPE_PERF_EVENT = 7,
>         BPF_LINK_TYPE_KPROBE_MULTI = 8,
> +       BPF_LINK_TYPE_CGROUP_SUBSYS = 9,
>
>         MAX_BPF_LINK_TYPE,
>  };
> @@ -1482,6 +1485,9 @@ union bpf_attr {
>                                  */
>                                 __u64           bpf_cookie;
>                         } perf_event;
> +                       struct {
> +                               __u64           name;
> +                       } cgroup_subsys;
>                         struct {
>                                 __u32           flags;
>                                 __u32           cnt;
> @@ -6324,6 +6330,12 @@ struct bpf_cgroup_dev_ctx {
>         __u32 minor;
>  };
>
> +struct bpf_rstat_ctx {
> +       __u64 cgroup_id;
> +       __u64 parent_cgroup_id; /* 0 if root */
> +       __s32 cpu;
> +};
> +
>  struct bpf_raw_tracepoint_args {
>         __u64 args[0];
>  };
> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> index c1a9be6a4b9f..6caf4a61e543 100644
> --- a/kernel/bpf/Makefile
> +++ b/kernel/bpf/Makefile
> @@ -25,6 +25,7 @@ ifeq ($(CONFIG_PERF_EVENTS),y)
>  obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
>  endif
>  obj-$(CONFIG_CGROUP_BPF) += cgroup.o
> +obj-$(CONFIG_CGROUP_BPF) += cgroup_subsys.o

In the next version I will replace this with:
ifeq ($(CONFIG_CGROUP),y)
obj-$(CONFIG_BPF_SYSCALL) += cgroup_subsys.o
endif

, as this program type doesn't attach to cgroups and does not depend
on CONFIG_CGROUP_BPF, only CONFIG_CGROUP and CONFIG_BPF_SYSCALL.

>  ifeq ($(CONFIG_INET),y)
>  obj-$(CONFIG_BPF_SYSCALL) += reuseport_array.o
>  endif
> diff --git a/kernel/bpf/cgroup_subsys.c b/kernel/bpf/cgroup_subsys.c
> new file mode 100644
> index 000000000000..9673ce6aa84a
> --- /dev/null
> +++ b/kernel/bpf/cgroup_subsys.c
> @@ -0,0 +1,166 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Functions to manage eBPF programs attached to cgroup subsystems
> + *
> + * Copyright 2022 Google LLC.
> + */
> +
> +#include <linux/bpf-cgroup-subsys.h>
> +#include <linux/filter.h>
> +
> +#include "../cgroup/cgroup-internal.h"
> +
> +
> +static int cgroup_subsys_bpf_attach(struct cgroup_subsys *ss, struct bpf_prog *prog)
> +{
> +       struct bpf_subsys_rstat_flusher *rstat_flusher;
> +
> +       rstat_flusher = kmalloc(sizeof(*rstat_flusher), GFP_KERNEL);
> +       if (!rstat_flusher)
> +               return -ENOMEM;
> +       rstat_flusher->prog = prog;
> +
> +       spin_lock(&ss->bpf.flushers_lock);
> +       list_add(&rstat_flusher->list, &ss->bpf.rstat_flushers);
> +       spin_unlock(&ss->bpf.flushers_lock);
> +
> +       return 0;
> +}
> +
> +static void cgroup_subsys_bpf_detach(struct cgroup_subsys *ss, struct bpf_prog *prog)
> +{
> +       struct bpf_subsys_rstat_flusher *rstat_flusher = NULL;
> +
> +       spin_lock(&ss->bpf.flushers_lock);
> +       list_for_each_entry(rstat_flusher, &ss->bpf.rstat_flushers, list)
> +               if (rstat_flusher->prog == prog)
> +                       break;
> +
> +       if (rstat_flusher) {
> +               list_del(&rstat_flusher->list);
> +               bpf_prog_put(rstat_flusher->prog);
> +               kfree(rstat_flusher);
> +       }
> +       spin_unlock(&ss->bpf.flushers_lock);
> +}
> +
> +static void bpf_cgroup_subsys_link_release(struct bpf_link *link)
> +{
> +       struct bpf_cgroup_subsys_link *ss_link = container_of(link,
> +                                                      struct bpf_cgroup_subsys_link,
> +                                                      link);
> +       if (ss_link->ss) {
> +               cgroup_subsys_bpf_detach(ss_link->ss, ss_link->link.prog);
> +               ss_link->ss = NULL;
> +       }
> +}
> +
> +static int bpf_cgroup_subsys_link_detach(struct bpf_link *link)
> +{
> +       bpf_cgroup_subsys_link_release(link);
> +       return 0;
> +}
> +
> +static void bpf_cgroup_subsys_link_dealloc(struct bpf_link *link)
> +{
> +       struct bpf_cgroup_subsys_link *ss_link = container_of(link,
> +                                                      struct bpf_cgroup_subsys_link,
> +                                                      link);
> +       kfree(ss_link);
> +}
> +
> +static const struct bpf_link_ops bpf_cgroup_subsys_link_lops = {
> +       .detach = bpf_cgroup_subsys_link_detach,
> +       .release = bpf_cgroup_subsys_link_release,
> +       .dealloc = bpf_cgroup_subsys_link_dealloc,
> +};
> +
> +int cgroup_subsys_bpf_link_attach(const union bpf_attr *attr,
> +                                 struct bpf_prog *prog)
> +{
> +       struct bpf_link_primer link_primer;
> +       struct bpf_cgroup_subsys_link *link;
> +       struct cgroup_subsys *ss, *attach_ss = NULL;
> +       const char __user *ss_name_user;
> +       char ss_name[MAX_CGROUP_TYPE_NAMELEN];
> +       int ssid, err;
> +
> +       if (attr->link_create.target_fd || attr->link_create.flags)
> +               return -EINVAL;
> +
> +       ss_name_user = u64_to_user_ptr(attr->link_create.cgroup_subsys.name);
> +       if (strncpy_from_user(ss_name, ss_name_user, sizeof(ss_name) - 1) < 0)
> +               return -EFAULT;
> +
> +       for_each_subsys(ss, ssid)
> +               if (!strcmp(ss_name, ss->name) ||
> +                   !strcmp(ss_name, ss->legacy_name))
> +                       attach_ss = ss;
> +
> +       if (!attach_ss)
> +               return -EINVAL;
> +
> +       link = kzalloc(sizeof(*link), GFP_USER);
> +       if (!link)
> +               return -ENOMEM;
> +
> +       bpf_link_init(&link->link, BPF_LINK_TYPE_CGROUP_SUBSYS,
> +                     &bpf_cgroup_subsys_link_lops,
> +                     prog);
> +       link->ss = attach_ss;
> +
> +       err = bpf_link_prime(&link->link, &link_primer);
> +       if (err) {
> +               kfree(link);
> +               return err;
> +       }
> +
> +       err = cgroup_subsys_bpf_attach(attach_ss, prog);
> +       if (err) {
> +               bpf_link_cleanup(&link_primer);
> +               return err;
> +       }
> +
> +       return bpf_link_settle(&link_primer);
> +}
> +
> +static const struct bpf_func_proto *
> +cgroup_subsys_rstat_func_proto(enum bpf_func_id func_id,
> +                              const struct bpf_prog *prog)
> +{
> +       return bpf_base_func_proto(func_id);
> +}
> +
> +static bool cgroup_subsys_rstat_is_valid_access(int off, int size,
> +                                          enum bpf_access_type type,
> +                                          const struct bpf_prog *prog,
> +                                          struct bpf_insn_access_aux *info)
> +{
> +       if (type == BPF_WRITE)
> +               return false;
> +
> +       if (off < 0 || off + size > sizeof(struct bpf_rstat_ctx))
> +               return false;
> +       /* The verifier guarantees that size > 0 */
> +       if (off % size != 0)
> +               return false;
> +
> +       switch (off) {
> +       case offsetof(struct bpf_rstat_ctx, cgroup_id):
> +               return size == sizeof(__u64);
> +       case offsetof(struct bpf_rstat_ctx, parent_cgroup_id):
> +               return size == sizeof(__u64);
> +       case offsetof(struct bpf_rstat_ctx, cpu):
> +               return size == sizeof(__s32);
> +       default:
> +               return false;
> +       }
> +}
> +
> +const struct bpf_prog_ops cgroup_subsys_rstat_prog_ops = {
> +};
> +
> +const struct bpf_verifier_ops cgroup_subsys_rstat_verifier_ops = {
> +       .get_func_proto         = cgroup_subsys_rstat_func_proto,
> +       .is_valid_access        = cgroup_subsys_rstat_is_valid_access,
> +};
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index cdaa1152436a..48149c54d969 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -3,6 +3,7 @@
>   */
>  #include <linux/bpf.h>
>  #include <linux/bpf-cgroup.h>
> +#include <linux/bpf-cgroup-subsys.h>
>  #include <linux/bpf_trace.h>
>  #include <linux/bpf_lirc.h>
>  #include <linux/bpf_verifier.h>
> @@ -3194,6 +3195,8 @@ attach_type_to_prog_type(enum bpf_attach_type attach_type)
>                 return BPF_PROG_TYPE_SK_LOOKUP;
>         case BPF_XDP:
>                 return BPF_PROG_TYPE_XDP;
> +       case BPF_CGROUP_SUBSYS_RSTAT:
> +               return BPF_PROG_TYPE_CGROUP_SUBSYS_RSTAT;
>         default:
>                 return BPF_PROG_TYPE_UNSPEC;
>         }
> @@ -4341,6 +4344,9 @@ static int link_create(union bpf_attr *attr, bpfptr_t uattr)
>                 else
>                         ret = bpf_kprobe_multi_link_attach(attr, prog);
>                 break;
> +       case BPF_PROG_TYPE_CGROUP_SUBSYS_RSTAT:
> +               ret = cgroup_subsys_bpf_link_attach(attr, prog);
> +               break;
>         default:
>                 ret = -EINVAL;
>         }
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index adb820e98f24..7b1448013009 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -5745,6 +5745,7 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early)
>
>         idr_init(&ss->css_idr);
>         INIT_LIST_HEAD(&ss->cfts);
> +       INIT_LIST_HEAD(&ss->bpf.rstat_flushers);
>
>         /* Create the root cgroup state for this subsystem */
>         ss->root = &cgrp_dfl_root;
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index d14b10b85e51..0f4855fa85db 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -952,6 +952,7 @@ enum bpf_prog_type {
>         BPF_PROG_TYPE_LSM,
>         BPF_PROG_TYPE_SK_LOOKUP,
>         BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */
> +       BPF_PROG_TYPE_CGROUP_SUBSYS_RSTAT,
>  };
>
>  enum bpf_attach_type {
> @@ -998,6 +999,7 @@ enum bpf_attach_type {
>         BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
>         BPF_PERF_EVENT,
>         BPF_TRACE_KPROBE_MULTI,
> +       BPF_CGROUP_SUBSYS_RSTAT,
>         __MAX_BPF_ATTACH_TYPE
>  };
>
> @@ -1013,6 +1015,7 @@ enum bpf_link_type {
>         BPF_LINK_TYPE_XDP = 6,
>         BPF_LINK_TYPE_PERF_EVENT = 7,
>         BPF_LINK_TYPE_KPROBE_MULTI = 8,
> +       BPF_LINK_TYPE_CGROUP_SUBSYS = 9,
>
>         MAX_BPF_LINK_TYPE,
>  };
> @@ -1482,6 +1485,9 @@ union bpf_attr {
>                                  */
>                                 __u64           bpf_cookie;
>                         } perf_event;
> +                       struct {
> +                               __u64           name;
> +                       } cgroup_subsys;
>                         struct {
>                                 __u32           flags;
>                                 __u32           cnt;
> @@ -6324,6 +6330,12 @@ struct bpf_cgroup_dev_ctx {
>         __u32 minor;
>  };
>
> +struct bpf_rstat_ctx {
> +       __u64 cgroup_id;
> +       __u64 parent_cgroup_id; /* 0 if root */
> +       __s32 cpu;
> +};
> +
>  struct bpf_raw_tracepoint_args {
>         __u64 args[0];
>  };
> --
> 2.36.0.512.ge40c2bad7a-goog
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH bpf-next 8/9] bpf: Introduce cgroup iter
  2022-05-10  0:18 ` [RFC PATCH bpf-next 8/9] bpf: Introduce cgroup iter Yosry Ahmed
@ 2022-05-10 18:25   ` Hao Luo
  2022-05-10 18:54   ` Tejun Heo
  1 sibling, 0 replies; 30+ messages in thread
From: Hao Luo @ 2022-05-10 18:25 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Tejun Heo, Zefan Li, Johannes Weiner, Shuah Khan,
	Roman Gushchin, Michal Hocko, Stanislav Fomichev, David Rientjes,
	Greg Thelen, Shakeel Butt, linux-kernel, netdev, bpf, cgroups

On Mon, May 9, 2022 at 5:18 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> From: Hao Luo <haoluo@google.com>
>
> Introduce a new type of iter prog: cgroup. Unlike other bpf_iter, this
> iter doesn't iterate a set of kernel objects. Instead, it is supposed to
> be parameterized by a cgroup id and prints only that cgroup. So one
> needs to specify a target cgroup id when attaching this iter. The target
> cgroup's state can be read out via a link of this iter.
>
> Signed-off-by: Hao Luo <haoluo@google.com>
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> ---
>  include/linux/bpf.h            |   2 +
>  include/uapi/linux/bpf.h       |   6 ++
>  kernel/bpf/Makefile            |   2 +-
>  kernel/bpf/cgroup_iter.c       | 148 +++++++++++++++++++++++++++++++++
>  tools/include/uapi/linux/bpf.h |   6 ++
>  5 files changed, 163 insertions(+), 1 deletion(-)
>  create mode 100644 kernel/bpf/cgroup_iter.c
>

Thanks Yosry for posting this patch! Dear reviewers, this is v2 of the
cgroup_iter change I sent previously at

https://lore.kernel.org/bpf/20220225234339.2386398-9-haoluo@google.com/

v1 - > v2:
- Getting the cgroup's reference at the time at attaching, instead of
at the time when iterating. (Yonghong) (context [1])
- Remove .init_seq_private and .fini_seq_private callbacks for
cgroup_iter. They are not needed now. (Yonghong)

[1] https://lore.kernel.org/bpf/f780fc3a-dbc2-986c-d5a0-6b0ef1c4311f@fb.com/

Hao

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH bpf-next 7/9] cgroup: Add cgroup_put() in !CONFIG_CGROUPS case
  2022-05-10  0:18 ` [RFC PATCH bpf-next 7/9] cgroup: Add cgroup_put() in !CONFIG_CGROUPS case Yosry Ahmed
@ 2022-05-10 18:25   ` Hao Luo
  0 siblings, 0 replies; 30+ messages in thread
From: Hao Luo @ 2022-05-10 18:25 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Tejun Heo, Zefan Li, Johannes Weiner, Shuah Khan,
	Roman Gushchin, Michal Hocko, Stanislav Fomichev, David Rientjes,
	Greg Thelen, Shakeel Butt, linux-kernel, netdev, bpf, cgroups

On Mon, May 9, 2022 at 5:18 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> From: Hao Luo <haoluo@google.com>
>
> There is already a cgroup_get_from_id() in the !CONFIG_CGROUPS case,
> let's have a matching cgroup_put() in !CONFIG_CGROUPS too.
>
> Signed-off-by: Hao Luo <haoluo@google.com>
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> ---
>  include/linux/cgroup.h | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 5408c74d5c44..4f1d8febb9fd 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
[...]
> +
> +static inline struct cgroup *cgroup_put(void)
> +{}

Sorry Yosry, the return type and parameter type are mixed up. I will
fix it and send you an updated version.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH bpf-next 6/9] cgroup: add v1 support to cgroup_get_from_id()
  2022-05-10  0:18 ` [RFC PATCH bpf-next 6/9] cgroup: add v1 support to cgroup_get_from_id() Yosry Ahmed
@ 2022-05-10 18:33   ` Tejun Heo
  2022-05-10 18:36     ` Yosry Ahmed
  0 siblings, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2022-05-10 18:33 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Zefan Li, Johannes Weiner, Shuah Khan,
	Roman Gushchin, Michal Hocko, Stanislav Fomichev, David Rientjes,
	Greg Thelen, Shakeel Butt, linux-kernel, netdev, bpf, cgroups

On Tue, May 10, 2022 at 12:18:04AM +0000, Yosry Ahmed wrote:
> The current implementation of cgroup_get_from_id() only searches the
> default hierarchy for the given id. Make it compatible with cgroup v1 by
> looking through all the roots instead.
> 
> cgrp_dfl_root should be the first element in the list so there shouldn't
> be a performance impact for cgroup v2 users (in the case of a valid id).
> 
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> ---
>  kernel/cgroup/cgroup.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index af703cfcb9d2..12700cd21973 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -5970,10 +5970,16 @@ void cgroup_path_from_kernfs_id(u64 id, char *buf, size_t buflen)
>   */
>  struct cgroup *cgroup_get_from_id(u64 id)
>  {
> -	struct kernfs_node *kn;
> +	struct kernfs_node *kn = NULL;
>  	struct cgroup *cgrp = NULL;
> +	struct cgroup_root *root;
> +
> +	for_each_root(root) {
> +		kn = kernfs_find_and_get_node_by_id(root->kf_root, id);
> +		if (kn)
> +			break;
> +	}

I can't see how this can work. You're smashing together separate namespaces
and the same IDs can exist across multiple of these hierarchies. You'd need
a bigger surgery to make this work for cgroup1 which would prolly involve
complications around 32bit ino's and file handle support too, which I'm not
likely to ack, so please give it up on adding these things to cgroup1.

Nacked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH bpf-next 6/9] cgroup: add v1 support to cgroup_get_from_id()
  2022-05-10 18:33   ` Tejun Heo
@ 2022-05-10 18:36     ` Yosry Ahmed
  0 siblings, 0 replies; 30+ messages in thread
From: Yosry Ahmed @ 2022-05-10 18:36 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Zefan Li, Johannes Weiner, Shuah Khan,
	Roman Gushchin, Michal Hocko, Stanislav Fomichev, David Rientjes,
	Greg Thelen, Shakeel Butt, Linux Kernel Mailing List, Networking,
	bpf, cgroups

On Tue, May 10, 2022 at 11:34 AM Tejun Heo <tj@kernel.org> wrote:
>
> On Tue, May 10, 2022 at 12:18:04AM +0000, Yosry Ahmed wrote:
> > The current implementation of cgroup_get_from_id() only searches the
> > default hierarchy for the given id. Make it compatible with cgroup v1 by
> > looking through all the roots instead.
> >
> > cgrp_dfl_root should be the first element in the list so there shouldn't
> > be a performance impact for cgroup v2 users (in the case of a valid id).
> >
> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> > ---
> >  kernel/cgroup/cgroup.c | 10 ++++++++--
> >  1 file changed, 8 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> > index af703cfcb9d2..12700cd21973 100644
> > --- a/kernel/cgroup/cgroup.c
> > +++ b/kernel/cgroup/cgroup.c
> > @@ -5970,10 +5970,16 @@ void cgroup_path_from_kernfs_id(u64 id, char *buf, size_t buflen)
> >   */
> >  struct cgroup *cgroup_get_from_id(u64 id)
> >  {
> > -     struct kernfs_node *kn;
> > +     struct kernfs_node *kn = NULL;
> >       struct cgroup *cgrp = NULL;
> > +     struct cgroup_root *root;
> > +
> > +     for_each_root(root) {
> > +             kn = kernfs_find_and_get_node_by_id(root->kf_root, id);
> > +             if (kn)
> > +                     break;
> > +     }
>
> I can't see how this can work. You're smashing together separate namespaces
> and the same IDs can exist across multiple of these hierarchies. You'd need
> a bigger surgery to make this work for cgroup1 which would prolly involve
> complications around 32bit ino's and file handle support too, which I'm not
> likely to ack, so please give it up on adding these things to cgroup1.
>
> Nacked-by: Tejun Heo <tj@kernel.org>
>
> Thanks.

Completely understandable. I sent this patch knowing that it likely
will not be accepted, with hopes of hearing feedback on whether this
can be done in a simple way or not. Looks like I got my answer, so
thanks for the info!

Will drop this patch in the incoming versions.

>
> --
> tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH bpf-next 1/9] bpf: introduce CGROUP_SUBSYS_RSTAT program type
  2022-05-10  0:17 ` [RFC PATCH bpf-next 1/9] bpf: introduce CGROUP_SUBSYS_RSTAT program type Yosry Ahmed
  2022-05-10 18:07   ` Yosry Ahmed
@ 2022-05-10 18:44   ` Tejun Heo
  2022-05-10 19:34     ` Yosry Ahmed
  1 sibling, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2022-05-10 18:44 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Zefan Li, Johannes Weiner, Shuah Khan,
	Roman Gushchin, Michal Hocko, Stanislav Fomichev, David Rientjes,
	Greg Thelen, Shakeel Butt, linux-kernel, netdev, bpf, cgroups

Hello,

On Tue, May 10, 2022 at 12:17:59AM +0000, Yosry Ahmed wrote:
> @@ -706,6 +707,9 @@ struct cgroup_subsys {
>  	 * specifies the mask of subsystems that this one depends on.
>  	 */
>  	unsigned int depends_on;
> +
> +	/* used to store bpf programs.*/
> +	struct cgroup_subsys_bpf bpf;
>  };

Care to elaborate on rationales around associating this with a specific
cgroup_subsys rather than letting it walk cgroups and access whatever csses
as needed? I don't think it's a wrong approach or anything but I can think
of plenty of things that would be interesting without being associated with
a specific subsystem - even all the cpu usage statistics are built to in the
cgroup core and given how e.g. systemd uses cgroup to organize the
applications in the system whether resource control is active or not, there
are a lot of info one can gather about those without being associated with a
specific subsystem.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH bpf-next 2/9] cgroup: bpf: flush bpf stats on rstat flush
  2022-05-10  0:18 ` [RFC PATCH bpf-next 2/9] cgroup: bpf: flush bpf stats on rstat flush Yosry Ahmed
@ 2022-05-10 18:45   ` Tejun Heo
  0 siblings, 0 replies; 30+ messages in thread
From: Tejun Heo @ 2022-05-10 18:45 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Zefan Li, Johannes Weiner, Shuah Khan,
	Roman Gushchin, Michal Hocko, Stanislav Fomichev, David Rientjes,
	Greg Thelen, Shakeel Butt, linux-kernel, netdev, bpf, cgroups

On Tue, May 10, 2022 at 12:18:00AM +0000, Yosry Ahmed wrote:
> When a cgroup is popped from the rstat updated tree, subsystems rstat
> flushers are run through the css_rstat_flush() callback. Also run bpf
> flushers for all subsystems that have at least one bpf rstat flusher
> attached, and are enabled for this cgroup.
> 
> A list of subsystems that have attached rstat flushers is maintained to
> avoid looping through all subsystems for all cpus for every cgroup that
> is being popped from the updated tree. Since we introduce a lock here to
> protect this list, also use it to protect rstat_flushers lists inside
> each subsystem (since they both need to locked together anyway), and get
> read of the locks in struct cgroup_subsys_bpf.
> 
> rstat flushers are run for any enabled subsystem that has flushers
> attached, even if it does not subscribe to css flushing through
> css_rstat_flush(). This gives flexibility for bpf programs to collect
> stats for any subsystem, regardless of the implementation changes in the
> kernel.

Yeah, again, the fact that these things are associated with a speicfic
subsystem feels a bit jarring to me. Let's get that resolved first.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH bpf-next 8/9] bpf: Introduce cgroup iter
  2022-05-10  0:18 ` [RFC PATCH bpf-next 8/9] bpf: Introduce cgroup iter Yosry Ahmed
  2022-05-10 18:25   ` Hao Luo
@ 2022-05-10 18:54   ` Tejun Heo
  2022-05-10 21:12     ` Hao Luo
  1 sibling, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2022-05-10 18:54 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Zefan Li, Johannes Weiner, Shuah Khan,
	Roman Gushchin, Michal Hocko, Stanislav Fomichev, David Rientjes,
	Greg Thelen, Shakeel Butt, linux-kernel, netdev, bpf, cgroups

Hello,

On Tue, May 10, 2022 at 12:18:06AM +0000, Yosry Ahmed wrote:
> From: Hao Luo <haoluo@google.com>
> 
> Introduce a new type of iter prog: cgroup. Unlike other bpf_iter, this
> iter doesn't iterate a set of kernel objects. Instead, it is supposed to
> be parameterized by a cgroup id and prints only that cgroup. So one
> needs to specify a target cgroup id when attaching this iter. The target
> cgroup's state can be read out via a link of this iter.

Is there a reason why this can't be a proper iterator which supports
lseek64() to locate a specific cgroup?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH bpf-next 1/9] bpf: introduce CGROUP_SUBSYS_RSTAT program type
  2022-05-10 18:07   ` Yosry Ahmed
@ 2022-05-10 19:21     ` Yosry Ahmed
  0 siblings, 0 replies; 30+ messages in thread
From: Yosry Ahmed @ 2022-05-10 19:21 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Roman Gushchin, Michal Hocko
  Cc: Stanislav Fomichev, David Rientjes, Greg Thelen, Shakeel Butt,
	Linux Kernel Mailing List, Networking, bpf, cgroups

On Tue, May 10, 2022 at 11:07 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Mon, May 9, 2022 at 5:18 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > This patch introduces a new bpf program type CGROUP_SUBSYS_RSTAT,
> > with new corresponding link and attach types.
> >
> > The main purpose of these programs is to allow BPF programs to collect
> > and maintain hierarchical cgroup stats easily and efficiently by making
> > using of the rstat framework in the kernel.
> >
> > Those programs attach to a cgroup subsystem. They typically contain logic
> > to aggregate per-cpu and per-cgroup stats collected by other BPF programs.
> >
> > Currently, only rstat flusher programs can be attached to cgroup
> > subsystems, but this can be extended later if a use-case arises.
> >
> > See the selftest in the final patch for a practical example.
> >
> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> > ---
> >  include/linux/bpf-cgroup-subsys.h |  30 ++++++
> >  include/linux/bpf_types.h         |   2 +
> >  include/linux/cgroup-defs.h       |   4 +
> >  include/uapi/linux/bpf.h          |  12 +++
> >  kernel/bpf/Makefile               |   1 +
> >  kernel/bpf/cgroup_subsys.c        | 166 ++++++++++++++++++++++++++++++
> >  kernel/bpf/syscall.c              |   6 ++
> >  kernel/cgroup/cgroup.c            |   1 +
> >  tools/include/uapi/linux/bpf.h    |  12 +++
> >  9 files changed, 234 insertions(+)
> >  create mode 100644 include/linux/bpf-cgroup-subsys.h
> >  create mode 100644 kernel/bpf/cgroup_subsys.c
> >
> > diff --git a/include/linux/bpf-cgroup-subsys.h b/include/linux/bpf-cgroup-subsys.h
> > new file mode 100644
> > index 000000000000..4dcde06b5599
> > --- /dev/null
> > +++ b/include/linux/bpf-cgroup-subsys.h
> > @@ -0,0 +1,30 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/*
> > + * Copyright 2022 Google LLC.
> > + */
> > +#ifndef _BPF_CGROUP_SUBSYS_H_
> > +#define _BPF_CGROUP_SUBSYS_H_
> > +
> > +#include <linux/bpf.h>
> > +
> > +struct cgroup_subsys_bpf {
> > +       /* Head of the list of BPF rstat flushers attached to this subsystem */
> > +       struct list_head rstat_flushers;
> > +       spinlock_t flushers_lock;
> > +};
> > +
> > +struct bpf_subsys_rstat_flusher {
> > +       struct bpf_prog *prog;
> > +       /* List of BPF rtstat flushers, anchored at subsys->bpf */
> > +       struct list_head list;
> > +};
> > +
> > +struct bpf_cgroup_subsys_link {
> > +       struct bpf_link link;
> > +       struct cgroup_subsys *ss;
> > +};
> > +
> > +int cgroup_subsys_bpf_link_attach(const union bpf_attr *attr,
> > +                                 struct bpf_prog *prog);
> > +
>
> In the next version I will make sure everything here is also defined
> for when CONFIG_BPF_SYSCALL is not set, and move the structs that can
> be moved to the cc file there.
>
> > +#endif  // _BPF_CGROUP_SUBSYS_H_
> > diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> > index 3e24ad0c4b3c..854ee958b0e4 100644
> > --- a/include/linux/bpf_types.h
> > +++ b/include/linux/bpf_types.h
> > @@ -56,6 +56,8 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SYSCTL, cg_sysctl,
> >               struct bpf_sysctl, struct bpf_sysctl_kern)
> >  BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SOCKOPT, cg_sockopt,
> >               struct bpf_sockopt, struct bpf_sockopt_kern)
> > +BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SUBSYS_RSTAT, cgroup_subsys_rstat,
> > +             struct bpf_rstat_ctx, struct bpf_rstat_ctx)
> >  #endif
> >  #ifdef CONFIG_BPF_LIRC_MODE2
> >  BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2,
> > diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> > index 1bfcfb1af352..3bd6eed1fa13 100644
> > --- a/include/linux/cgroup-defs.h
> > +++ b/include/linux/cgroup-defs.h
> > @@ -20,6 +20,7 @@
> >  #include <linux/u64_stats_sync.h>
> >  #include <linux/workqueue.h>
> >  #include <linux/bpf-cgroup-defs.h>
> > +#include <linux/bpf-cgroup-subsys.h>
> >  #include <linux/psi_types.h>
> >
> >  #ifdef CONFIG_CGROUPS
> > @@ -706,6 +707,9 @@ struct cgroup_subsys {
> >          * specifies the mask of subsystems that this one depends on.
> >          */
> >         unsigned int depends_on;
> > +
> > +       /* used to store bpf programs.*/
> > +       struct cgroup_subsys_bpf bpf;
> >  };
> >
> >  extern struct percpu_rw_semaphore cgroup_threadgroup_rwsem;
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index d14b10b85e51..0f4855fa85db 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -952,6 +952,7 @@ enum bpf_prog_type {
> >         BPF_PROG_TYPE_LSM,
> >         BPF_PROG_TYPE_SK_LOOKUP,
> >         BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */
> > +       BPF_PROG_TYPE_CGROUP_SUBSYS_RSTAT,
> >  };
> >
> >  enum bpf_attach_type {
> > @@ -998,6 +999,7 @@ enum bpf_attach_type {
> >         BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
> >         BPF_PERF_EVENT,
> >         BPF_TRACE_KPROBE_MULTI,
> > +       BPF_CGROUP_SUBSYS_RSTAT,
> >         __MAX_BPF_ATTACH_TYPE
> >  };
> >
> > @@ -1013,6 +1015,7 @@ enum bpf_link_type {
> >         BPF_LINK_TYPE_XDP = 6,
> >         BPF_LINK_TYPE_PERF_EVENT = 7,
> >         BPF_LINK_TYPE_KPROBE_MULTI = 8,
> > +       BPF_LINK_TYPE_CGROUP_SUBSYS = 9,
> >
> >         MAX_BPF_LINK_TYPE,
> >  };
> > @@ -1482,6 +1485,9 @@ union bpf_attr {
> >                                  */
> >                                 __u64           bpf_cookie;
> >                         } perf_event;
> > +                       struct {
> > +                               __u64           name;
> > +                       } cgroup_subsys;
> >                         struct {
> >                                 __u32           flags;
> >                                 __u32           cnt;
> > @@ -6324,6 +6330,12 @@ struct bpf_cgroup_dev_ctx {
> >         __u32 minor;
> >  };
> >
> > +struct bpf_rstat_ctx {
> > +       __u64 cgroup_id;
> > +       __u64 parent_cgroup_id; /* 0 if root */
> > +       __s32 cpu;
> > +};
> > +
> >  struct bpf_raw_tracepoint_args {
> >         __u64 args[0];
> >  };
> > diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> > index c1a9be6a4b9f..6caf4a61e543 100644
> > --- a/kernel/bpf/Makefile
> > +++ b/kernel/bpf/Makefile
> > @@ -25,6 +25,7 @@ ifeq ($(CONFIG_PERF_EVENTS),y)
> >  obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
> >  endif
> >  obj-$(CONFIG_CGROUP_BPF) += cgroup.o
> > +obj-$(CONFIG_CGROUP_BPF) += cgroup_subsys.o
>
> In the next version I will replace this with:
> ifeq ($(CONFIG_CGROUP),y)
> obj-$(CONFIG_BPF_SYSCALL) += cgroup_subsys.o
> endif
>
> , as this program type doesn't attach to cgroups and does not depend
> on CONFIG_CGROUP_BPF, only CONFIG_CGROUP and CONFIG_BPF_SYSCALL.

On second thought it might be simpler and cleaner to leave this code
under CONFIG_CGROUP_BPF.

>
> >  ifeq ($(CONFIG_INET),y)
> >  obj-$(CONFIG_BPF_SYSCALL) += reuseport_array.o
> >  endif
> > diff --git a/kernel/bpf/cgroup_subsys.c b/kernel/bpf/cgroup_subsys.c
> > new file mode 100644
> > index 000000000000..9673ce6aa84a
> > --- /dev/null
> > +++ b/kernel/bpf/cgroup_subsys.c
> > @@ -0,0 +1,166 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * Functions to manage eBPF programs attached to cgroup subsystems
> > + *
> > + * Copyright 2022 Google LLC.
> > + */
> > +
> > +#include <linux/bpf-cgroup-subsys.h>
> > +#include <linux/filter.h>
> > +
> > +#include "../cgroup/cgroup-internal.h"
> > +
> > +
> > +static int cgroup_subsys_bpf_attach(struct cgroup_subsys *ss, struct bpf_prog *prog)
> > +{
> > +       struct bpf_subsys_rstat_flusher *rstat_flusher;
> > +
> > +       rstat_flusher = kmalloc(sizeof(*rstat_flusher), GFP_KERNEL);
> > +       if (!rstat_flusher)
> > +               return -ENOMEM;
> > +       rstat_flusher->prog = prog;
> > +
> > +       spin_lock(&ss->bpf.flushers_lock);
> > +       list_add(&rstat_flusher->list, &ss->bpf.rstat_flushers);
> > +       spin_unlock(&ss->bpf.flushers_lock);
> > +
> > +       return 0;
> > +}
> > +
> > +static void cgroup_subsys_bpf_detach(struct cgroup_subsys *ss, struct bpf_prog *prog)
> > +{
> > +       struct bpf_subsys_rstat_flusher *rstat_flusher = NULL;
> > +
> > +       spin_lock(&ss->bpf.flushers_lock);
> > +       list_for_each_entry(rstat_flusher, &ss->bpf.rstat_flushers, list)
> > +               if (rstat_flusher->prog == prog)
> > +                       break;
> > +
> > +       if (rstat_flusher) {
> > +               list_del(&rstat_flusher->list);
> > +               bpf_prog_put(rstat_flusher->prog);
> > +               kfree(rstat_flusher);
> > +       }
> > +       spin_unlock(&ss->bpf.flushers_lock);
> > +}
> > +
> > +static void bpf_cgroup_subsys_link_release(struct bpf_link *link)
> > +{
> > +       struct bpf_cgroup_subsys_link *ss_link = container_of(link,
> > +                                                      struct bpf_cgroup_subsys_link,
> > +                                                      link);
> > +       if (ss_link->ss) {
> > +               cgroup_subsys_bpf_detach(ss_link->ss, ss_link->link.prog);
> > +               ss_link->ss = NULL;
> > +       }
> > +}
> > +
> > +static int bpf_cgroup_subsys_link_detach(struct bpf_link *link)
> > +{
> > +       bpf_cgroup_subsys_link_release(link);
> > +       return 0;
> > +}
> > +
> > +static void bpf_cgroup_subsys_link_dealloc(struct bpf_link *link)
> > +{
> > +       struct bpf_cgroup_subsys_link *ss_link = container_of(link,
> > +                                                      struct bpf_cgroup_subsys_link,
> > +                                                      link);
> > +       kfree(ss_link);
> > +}
> > +
> > +static const struct bpf_link_ops bpf_cgroup_subsys_link_lops = {
> > +       .detach = bpf_cgroup_subsys_link_detach,
> > +       .release = bpf_cgroup_subsys_link_release,
> > +       .dealloc = bpf_cgroup_subsys_link_dealloc,
> > +};
> > +
> > +int cgroup_subsys_bpf_link_attach(const union bpf_attr *attr,
> > +                                 struct bpf_prog *prog)
> > +{
> > +       struct bpf_link_primer link_primer;
> > +       struct bpf_cgroup_subsys_link *link;
> > +       struct cgroup_subsys *ss, *attach_ss = NULL;
> > +       const char __user *ss_name_user;
> > +       char ss_name[MAX_CGROUP_TYPE_NAMELEN];
> > +       int ssid, err;
> > +
> > +       if (attr->link_create.target_fd || attr->link_create.flags)
> > +               return -EINVAL;
> > +
> > +       ss_name_user = u64_to_user_ptr(attr->link_create.cgroup_subsys.name);
> > +       if (strncpy_from_user(ss_name, ss_name_user, sizeof(ss_name) - 1) < 0)
> > +               return -EFAULT;
> > +
> > +       for_each_subsys(ss, ssid)
> > +               if (!strcmp(ss_name, ss->name) ||
> > +                   !strcmp(ss_name, ss->legacy_name))
> > +                       attach_ss = ss;
> > +
> > +       if (!attach_ss)
> > +               return -EINVAL;
> > +
> > +       link = kzalloc(sizeof(*link), GFP_USER);
> > +       if (!link)
> > +               return -ENOMEM;
> > +
> > +       bpf_link_init(&link->link, BPF_LINK_TYPE_CGROUP_SUBSYS,
> > +                     &bpf_cgroup_subsys_link_lops,
> > +                     prog);
> > +       link->ss = attach_ss;
> > +
> > +       err = bpf_link_prime(&link->link, &link_primer);
> > +       if (err) {
> > +               kfree(link);
> > +               return err;
> > +       }
> > +
> > +       err = cgroup_subsys_bpf_attach(attach_ss, prog);
> > +       if (err) {
> > +               bpf_link_cleanup(&link_primer);
> > +               return err;
> > +       }
> > +
> > +       return bpf_link_settle(&link_primer);
> > +}
> > +
> > +static const struct bpf_func_proto *
> > +cgroup_subsys_rstat_func_proto(enum bpf_func_id func_id,
> > +                              const struct bpf_prog *prog)
> > +{
> > +       return bpf_base_func_proto(func_id);
> > +}
> > +
> > +static bool cgroup_subsys_rstat_is_valid_access(int off, int size,
> > +                                          enum bpf_access_type type,
> > +                                          const struct bpf_prog *prog,
> > +                                          struct bpf_insn_access_aux *info)
> > +{
> > +       if (type == BPF_WRITE)
> > +               return false;
> > +
> > +       if (off < 0 || off + size > sizeof(struct bpf_rstat_ctx))
> > +               return false;
> > +       /* The verifier guarantees that size > 0 */
> > +       if (off % size != 0)
> > +               return false;
> > +
> > +       switch (off) {
> > +       case offsetof(struct bpf_rstat_ctx, cgroup_id):
> > +               return size == sizeof(__u64);
> > +       case offsetof(struct bpf_rstat_ctx, parent_cgroup_id):
> > +               return size == sizeof(__u64);
> > +       case offsetof(struct bpf_rstat_ctx, cpu):
> > +               return size == sizeof(__s32);
> > +       default:
> > +               return false;
> > +       }
> > +}
> > +
> > +const struct bpf_prog_ops cgroup_subsys_rstat_prog_ops = {
> > +};
> > +
> > +const struct bpf_verifier_ops cgroup_subsys_rstat_verifier_ops = {
> > +       .get_func_proto         = cgroup_subsys_rstat_func_proto,
> > +       .is_valid_access        = cgroup_subsys_rstat_is_valid_access,
> > +};
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index cdaa1152436a..48149c54d969 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> > @@ -3,6 +3,7 @@
> >   */
> >  #include <linux/bpf.h>
> >  #include <linux/bpf-cgroup.h>
> > +#include <linux/bpf-cgroup-subsys.h>
> >  #include <linux/bpf_trace.h>
> >  #include <linux/bpf_lirc.h>
> >  #include <linux/bpf_verifier.h>
> > @@ -3194,6 +3195,8 @@ attach_type_to_prog_type(enum bpf_attach_type attach_type)
> >                 return BPF_PROG_TYPE_SK_LOOKUP;
> >         case BPF_XDP:
> >                 return BPF_PROG_TYPE_XDP;
> > +       case BPF_CGROUP_SUBSYS_RSTAT:
> > +               return BPF_PROG_TYPE_CGROUP_SUBSYS_RSTAT;
> >         default:
> >                 return BPF_PROG_TYPE_UNSPEC;
> >         }
> > @@ -4341,6 +4344,9 @@ static int link_create(union bpf_attr *attr, bpfptr_t uattr)
> >                 else
> >                         ret = bpf_kprobe_multi_link_attach(attr, prog);
> >                 break;
> > +       case BPF_PROG_TYPE_CGROUP_SUBSYS_RSTAT:
> > +               ret = cgroup_subsys_bpf_link_attach(attr, prog);
> > +               break;
> >         default:
> >                 ret = -EINVAL;
> >         }
> > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> > index adb820e98f24..7b1448013009 100644
> > --- a/kernel/cgroup/cgroup.c
> > +++ b/kernel/cgroup/cgroup.c
> > @@ -5745,6 +5745,7 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early)
> >
> >         idr_init(&ss->css_idr);
> >         INIT_LIST_HEAD(&ss->cfts);
> > +       INIT_LIST_HEAD(&ss->bpf.rstat_flushers);
> >
> >         /* Create the root cgroup state for this subsystem */
> >         ss->root = &cgrp_dfl_root;
> > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> > index d14b10b85e51..0f4855fa85db 100644
> > --- a/tools/include/uapi/linux/bpf.h
> > +++ b/tools/include/uapi/linux/bpf.h
> > @@ -952,6 +952,7 @@ enum bpf_prog_type {
> >         BPF_PROG_TYPE_LSM,
> >         BPF_PROG_TYPE_SK_LOOKUP,
> >         BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */
> > +       BPF_PROG_TYPE_CGROUP_SUBSYS_RSTAT,
> >  };
> >
> >  enum bpf_attach_type {
> > @@ -998,6 +999,7 @@ enum bpf_attach_type {
> >         BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
> >         BPF_PERF_EVENT,
> >         BPF_TRACE_KPROBE_MULTI,
> > +       BPF_CGROUP_SUBSYS_RSTAT,
> >         __MAX_BPF_ATTACH_TYPE
> >  };
> >
> > @@ -1013,6 +1015,7 @@ enum bpf_link_type {
> >         BPF_LINK_TYPE_XDP = 6,
> >         BPF_LINK_TYPE_PERF_EVENT = 7,
> >         BPF_LINK_TYPE_KPROBE_MULTI = 8,
> > +       BPF_LINK_TYPE_CGROUP_SUBSYS = 9,
> >
> >         MAX_BPF_LINK_TYPE,
> >  };
> > @@ -1482,6 +1485,9 @@ union bpf_attr {
> >                                  */
> >                                 __u64           bpf_cookie;
> >                         } perf_event;
> > +                       struct {
> > +                               __u64           name;
> > +                       } cgroup_subsys;
> >                         struct {
> >                                 __u32           flags;
> >                                 __u32           cnt;
> > @@ -6324,6 +6330,12 @@ struct bpf_cgroup_dev_ctx {
> >         __u32 minor;
> >  };
> >
> > +struct bpf_rstat_ctx {
> > +       __u64 cgroup_id;
> > +       __u64 parent_cgroup_id; /* 0 if root */
> > +       __s32 cpu;
> > +};
> > +
> >  struct bpf_raw_tracepoint_args {
> >         __u64 args[0];
> >  };
> > --
> > 2.36.0.512.ge40c2bad7a-goog
> >

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH bpf-next 1/9] bpf: introduce CGROUP_SUBSYS_RSTAT program type
  2022-05-10 18:44   ` Tejun Heo
@ 2022-05-10 19:34     ` Yosry Ahmed
  2022-05-10 19:59       ` Tejun Heo
  0 siblings, 1 reply; 30+ messages in thread
From: Yosry Ahmed @ 2022-05-10 19:34 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Zefan Li, Johannes Weiner, Shuah Khan,
	Roman Gushchin, Michal Hocko, Stanislav Fomichev, David Rientjes,
	Greg Thelen, Shakeel Butt, Linux Kernel Mailing List, Networking,
	bpf, cgroups

On Tue, May 10, 2022 at 11:44 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Tue, May 10, 2022 at 12:17:59AM +0000, Yosry Ahmed wrote:
> > @@ -706,6 +707,9 @@ struct cgroup_subsys {
> >        * specifies the mask of subsystems that this one depends on.
> >        */
> >       unsigned int depends_on;
> > +
> > +     /* used to store bpf programs.*/
> > +     struct cgroup_subsys_bpf bpf;
> >  };
>
> Care to elaborate on rationales around associating this with a specific
> cgroup_subsys rather than letting it walk cgroups and access whatever csses
> as needed? I don't think it's a wrong approach or anything but I can think
> of plenty of things that would be interesting without being associated with
> a specific subsystem - even all the cpu usage statistics are built to in the
> cgroup core and given how e.g. systemd uses cgroup to organize the
> applications in the system whether resource control is active or not, there
> are a lot of info one can gather about those without being associated with a
> specific subsystem.

Hi Tejun,

Thanks so much for taking the time to look into this!

The rationale behind associating this work with cgroup_subsys is that
usually the stats are associated with a resource (e.g. memory, cpu,
etc). For example, if the memory controller is only enabled for a
subtree in a big hierarchy, it would be more efficient to only run BPF
rstat programs for those cgroups, not the entire hierarchy. It
provides a way to control what part of the hierarchy you want to
collect stats for. This is also semantically similar to the
css_rstat_flush() callback.

However, I do see your point about the benefits of collecting stats
that are not associated with any controller. I think there are
multiple options here, and I would love to hear what you prefer:
1. In addition to subsystems, support an "all" or "cgroup" attach
point that loads BPF rstat flush programs that will run for all
cgroups.
2. Simplify the interface so that all BPF rstat flush programs run for
all cgroups, and add the subsystem association later if a need arises.
3. Instead of attaching BPF programs to a subsystem, attach them to a
cgroup. This gives more flexibility, but also makes lifetime handling
of programs more complicated and error-prone. I can also see most use
cases (including ours) attaching programs to the root cgroup anyway.
In this case, we waste space by storing pointers to the same program
in every cgroup, and have unnecessary complexity in the code.

Let me know what you think!

>
> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH bpf-next 1/9] bpf: introduce CGROUP_SUBSYS_RSTAT program type
  2022-05-10 19:34     ` Yosry Ahmed
@ 2022-05-10 19:59       ` Tejun Heo
  2022-05-10 20:43         ` Yosry Ahmed
  0 siblings, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2022-05-10 19:59 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Zefan Li, Johannes Weiner, Shuah Khan,
	Roman Gushchin, Michal Hocko, Stanislav Fomichev, David Rientjes,
	Greg Thelen, Shakeel Butt, Linux Kernel Mailing List, Networking,
	bpf, cgroups

Hello,

On Tue, May 10, 2022 at 12:34:42PM -0700, Yosry Ahmed wrote:
> The rationale behind associating this work with cgroup_subsys is that
> usually the stats are associated with a resource (e.g. memory, cpu,
> etc). For example, if the memory controller is only enabled for a
> subtree in a big hierarchy, it would be more efficient to only run BPF
> rstat programs for those cgroups, not the entire hierarchy. It
> provides a way to control what part of the hierarchy you want to
> collect stats for. This is also semantically similar to the
> css_rstat_flush() callback.

Hmm... one major point of rstat is not having to worry about these things
because we iterate what's been active rather than what exists. Now, this
isn't entirely true because we share the same updated list for all sources.
This is a trade-off which makes sense because 1. the number of cgroups to
iterate each cycle is generally really low anyway 2. different controllers
often get enabled together. If the balance tilts towards "we're walking too
many due to the sharing of updated list across different sources", the
solution would be splitting the updated list so that we make the walk finer
grained.

Note that the above doesn't really affect the conceptual model. It's purely
an optimization decision. Tying these things to a cgroup_subsys does affect
the conceptual model and, in this case, the userland API for a performance
consideration which can be solved otherwise.

So, let's please keep this simple and in the (unlikely) case that the
overhead becomes an issue, solve it from rstat operation side.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH bpf-next 1/9] bpf: introduce CGROUP_SUBSYS_RSTAT program type
  2022-05-10 19:59       ` Tejun Heo
@ 2022-05-10 20:43         ` Yosry Ahmed
  2022-05-10 21:01           ` Tejun Heo
  0 siblings, 1 reply; 30+ messages in thread
From: Yosry Ahmed @ 2022-05-10 20:43 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Zefan Li, Johannes Weiner, Shuah Khan,
	Roman Gushchin, Michal Hocko, Stanislav Fomichev, David Rientjes,
	Greg Thelen, Shakeel Butt, Linux Kernel Mailing List, Networking,
	bpf, cgroups

On Tue, May 10, 2022 at 12:59 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Tue, May 10, 2022 at 12:34:42PM -0700, Yosry Ahmed wrote:
> > The rationale behind associating this work with cgroup_subsys is that
> > usually the stats are associated with a resource (e.g. memory, cpu,
> > etc). For example, if the memory controller is only enabled for a
> > subtree in a big hierarchy, it would be more efficient to only run BPF
> > rstat programs for those cgroups, not the entire hierarchy. It
> > provides a way to control what part of the hierarchy you want to
> > collect stats for. This is also semantically similar to the
> > css_rstat_flush() callback.
>
> Hmm... one major point of rstat is not having to worry about these things
> because we iterate what's been active rather than what exists. Now, this
> isn't entirely true because we share the same updated list for all sources.
> This is a trade-off which makes sense because 1. the number of cgroups to
> iterate each cycle is generally really low anyway 2. different controllers
> often get enabled together. If the balance tilts towards "we're walking too
> many due to the sharing of updated list across different sources", the
> solution would be splitting the updated list so that we make the walk finer
> grained.
>
> Note that the above doesn't really affect the conceptual model. It's purely
> an optimization decision. Tying these things to a cgroup_subsys does affect
> the conceptual model and, in this case, the userland API for a performance
> consideration which can be solved otherwise.
>
> So, let's please keep this simple and in the (unlikely) case that the
> overhead becomes an issue, solve it from rstat operation side.
>
> Thanks.

I assume if we do this optimization, and have separate updated lists
for controllers, we will still have a "core" updated list that is not
tied to any controller. Is this correct?

If yes, then we can make the interface controller-agnostic (a global
list of BPF flushers). If we do the optimization later, we tie BPF
stats to the "core" updated list. We can even extend the userland
interface then to allow for controller-specific BPF stats if found
useful.

If not, and there will only be controller-specific updated lists then,
then we might need to maintain a "core" updated list just for the sake
of BPF programs, which I don't think would be favorable.

What do you think? Either-way, I will try to document our discussion
outcome in the commit message (and maybe the code), so that
if-and-when this optimization is made, we can come back to it.


>
> --
> tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH bpf-next 1/9] bpf: introduce CGROUP_SUBSYS_RSTAT program type
  2022-05-10 20:43         ` Yosry Ahmed
@ 2022-05-10 21:01           ` Tejun Heo
  2022-05-10 21:55             ` Yosry Ahmed
  0 siblings, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2022-05-10 21:01 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Zefan Li, Johannes Weiner, Shuah Khan,
	Roman Gushchin, Michal Hocko, Stanislav Fomichev, David Rientjes,
	Greg Thelen, Shakeel Butt, Linux Kernel Mailing List, Networking,
	bpf, cgroups

Hello,

On Tue, May 10, 2022 at 01:43:46PM -0700, Yosry Ahmed wrote:
> I assume if we do this optimization, and have separate updated lists
> for controllers, we will still have a "core" updated list that is not
> tied to any controller. Is this correct?

Or we can create a dedicated updated list for the bpf progs, or even
multiple for groups of them and so on.

> If yes, then we can make the interface controller-agnostic (a global
> list of BPF flushers). If we do the optimization later, we tie BPF
> stats to the "core" updated list. We can even extend the userland
> interface then to allow for controller-specific BPF stats if found
> useful.

We'll need that anyway as cpustats are tied to the cgroup themselves rather
than the cpu controller.

> If not, and there will only be controller-specific updated lists then,
> then we might need to maintain a "core" updated list just for the sake
> of BPF programs, which I don't think would be favorable.

If needed, that's fine actually.

> What do you think? Either-way, I will try to document our discussion
> outcome in the commit message (and maybe the code), so that
> if-and-when this optimization is made, we can come back to it.

So, the main focus is keeping the userspace interface as simple as possible
and solving performance issues on the rstat side. If we need however many
updated lists to do that, that's all fine. FWIW, the experience up until now
has been consistent with the assumptions that the current implementation
makes and I haven't seen real any world cases where the shared updated list
are problematic.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH bpf-next 8/9] bpf: Introduce cgroup iter
  2022-05-10 18:54   ` Tejun Heo
@ 2022-05-10 21:12     ` Hao Luo
  2022-05-10 22:07       ` Tejun Heo
  0 siblings, 1 reply; 30+ messages in thread
From: Hao Luo @ 2022-05-10 21:12 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Yosry Ahmed, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Zefan Li, Johannes Weiner, Shuah Khan,
	Roman Gushchin, Michal Hocko, Stanislav Fomichev, David Rientjes,
	Greg Thelen, Shakeel Butt, linux-kernel, netdev, bpf, cgroups

Hello Tejun,

On Tue, May 10, 2022 at 11:54 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Tue, May 10, 2022 at 12:18:06AM +0000, Yosry Ahmed wrote:
> > From: Hao Luo <haoluo@google.com>
> >
> > Introduce a new type of iter prog: cgroup. Unlike other bpf_iter, this
> > iter doesn't iterate a set of kernel objects. Instead, it is supposed to
> > be parameterized by a cgroup id and prints only that cgroup. So one
> > needs to specify a target cgroup id when attaching this iter. The target
> > cgroup's state can be read out via a link of this iter.
>
> Is there a reason why this can't be a proper iterator which supports
> lseek64() to locate a specific cgroup?
>

There are two reasons:

- Bpf_iter assumes no_llseek. I haven't looked closely on why this is
so and whether we can add its support.

- Second, the name 'iter' in this patch is misleading. What this patch
really does is reusing the functionality of dumping in bpf_iter.
'Dumper' is a better name. We want to create one file in bpffs for
each cgroup. We are essentially just iterating a set of one single
element.

> Thanks.

>
> --
> tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH bpf-next 1/9] bpf: introduce CGROUP_SUBSYS_RSTAT program type
  2022-05-10 21:01           ` Tejun Heo
@ 2022-05-10 21:55             ` Yosry Ahmed
  2022-05-10 22:09               ` Tejun Heo
  0 siblings, 1 reply; 30+ messages in thread
From: Yosry Ahmed @ 2022-05-10 21:55 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Zefan Li, Johannes Weiner, Shuah Khan,
	Roman Gushchin, Michal Hocko, Stanislav Fomichev, David Rientjes,
	Greg Thelen, Shakeel Butt, Linux Kernel Mailing List, Networking,
	bpf, cgroups

On Tue, May 10, 2022 at 2:01 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Tue, May 10, 2022 at 01:43:46PM -0700, Yosry Ahmed wrote:
> > I assume if we do this optimization, and have separate updated lists
> > for controllers, we will still have a "core" updated list that is not
> > tied to any controller. Is this correct?
>
> Or we can create a dedicated updated list for the bpf progs, or even
> multiple for groups of them and so on.
>
> > If yes, then we can make the interface controller-agnostic (a global
> > list of BPF flushers). If we do the optimization later, we tie BPF
> > stats to the "core" updated list. We can even extend the userland
> > interface then to allow for controller-specific BPF stats if found
> > useful.
>
> We'll need that anyway as cpustats are tied to the cgroup themselves rather
> than the cpu controller.
>
> > If not, and there will only be controller-specific updated lists then,
> > then we might need to maintain a "core" updated list just for the sake
> > of BPF programs, which I don't think would be favorable.
>
> If needed, that's fine actually.
>
> > What do you think? Either-way, I will try to document our discussion
> > outcome in the commit message (and maybe the code), so that
> > if-and-when this optimization is made, we can come back to it.
>
> So, the main focus is keeping the userspace interface as simple as possible
> and solving performance issues on the rstat side. If we need however many
> updated lists to do that, that's all fine. FWIW, the experience up until now
> has been consistent with the assumptions that the current implementation
> makes and I haven't seen real any world cases where the shared updated list
> are problematic.
>

Thanks again for your insights and time!

That's great to hear. I am all in for making the userspace interface
simpler. I will rework this patch series so that the BPF programs just
attach to "rstat" and send a V1.
Any other concerns you have that you think I should address in V1?

> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH bpf-next 8/9] bpf: Introduce cgroup iter
  2022-05-10 21:12     ` Hao Luo
@ 2022-05-10 22:07       ` Tejun Heo
  2022-05-10 22:49         ` Hao Luo
  0 siblings, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2022-05-10 22:07 UTC (permalink / raw)
  To: Hao Luo
  Cc: Yosry Ahmed, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Zefan Li, Johannes Weiner, Shuah Khan,
	Roman Gushchin, Michal Hocko, Stanislav Fomichev, David Rientjes,
	Greg Thelen, Shakeel Butt, linux-kernel, netdev, bpf, cgroups

Hello,

On Tue, May 10, 2022 at 02:12:16PM -0700, Hao Luo wrote:
> > Is there a reason why this can't be a proper iterator which supports
> > lseek64() to locate a specific cgroup?
> >
> 
> There are two reasons:
> 
> - Bpf_iter assumes no_llseek. I haven't looked closely on why this is
> so and whether we can add its support.
> 
> - Second, the name 'iter' in this patch is misleading. What this patch
> really does is reusing the functionality of dumping in bpf_iter.
> 'Dumper' is a better name. We want to create one file in bpffs for
> each cgroup. We are essentially just iterating a set of one single
> element.

I see. I'm just shooting in the dark without context but at least in
principle there's no reason why cgroups wouldn't be iterable, so it might be
something worth at least thinking about before baking in the interface.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH bpf-next 1/9] bpf: introduce CGROUP_SUBSYS_RSTAT program type
  2022-05-10 21:55             ` Yosry Ahmed
@ 2022-05-10 22:09               ` Tejun Heo
  2022-05-10 22:10                 ` Yosry Ahmed
  0 siblings, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2022-05-10 22:09 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Zefan Li, Johannes Weiner, Shuah Khan,
	Roman Gushchin, Michal Hocko, Stanislav Fomichev, David Rientjes,
	Greg Thelen, Shakeel Butt, Linux Kernel Mailing List, Networking,
	bpf, cgroups

Hello,

On Tue, May 10, 2022 at 02:55:32PM -0700, Yosry Ahmed wrote:
> That's great to hear. I am all in for making the userspace interface
> simpler. I will rework this patch series so that the BPF programs just
> attach to "rstat" and send a V1.
> Any other concerns you have that you think I should address in V1?

Not that I can think of right now but my bpf side insight is really limited,
so it might be worthwhile to wait for some bpf folks to chime in?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH bpf-next 1/9] bpf: introduce CGROUP_SUBSYS_RSTAT program type
  2022-05-10 22:09               ` Tejun Heo
@ 2022-05-10 22:10                 ` Yosry Ahmed
  0 siblings, 0 replies; 30+ messages in thread
From: Yosry Ahmed @ 2022-05-10 22:10 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Zefan Li, Johannes Weiner, Shuah Khan,
	Roman Gushchin, Michal Hocko, Stanislav Fomichev, David Rientjes,
	Greg Thelen, Shakeel Butt, Linux Kernel Mailing List, Networking,
	bpf, cgroups

On Tue, May 10, 2022 at 3:09 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Tue, May 10, 2022 at 02:55:32PM -0700, Yosry Ahmed wrote:
> > That's great to hear. I am all in for making the userspace interface
> > simpler. I will rework this patch series so that the BPF programs just
> > attach to "rstat" and send a V1.
> > Any other concerns you have that you think I should address in V1?
>
> Not that I can think of right now but my bpf side insight is really limited,
> so it might be worthwhile to wait for some bpf folks to chime in?
>

Sounds good. Will wait for feedback on the BPF side of things before I
send a V1.

> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH bpf-next 8/9] bpf: Introduce cgroup iter
  2022-05-10 22:07       ` Tejun Heo
@ 2022-05-10 22:49         ` Hao Luo
  0 siblings, 0 replies; 30+ messages in thread
From: Hao Luo @ 2022-05-10 22:49 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Yosry Ahmed, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Zefan Li, Johannes Weiner, Shuah Khan,
	Roman Gushchin, Michal Hocko, Stanislav Fomichev, David Rientjes,
	Greg Thelen, Shakeel Butt, linux-kernel, netdev, bpf, cgroups

On Tue, May 10, 2022 at 3:07 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Tue, May 10, 2022 at 02:12:16PM -0700, Hao Luo wrote:
> > > Is there a reason why this can't be a proper iterator which supports
> > > lseek64() to locate a specific cgroup?
> > >
> >
> > There are two reasons:
> >
> > - Bpf_iter assumes no_llseek. I haven't looked closely on why this is
> > so and whether we can add its support.
> >
> > - Second, the name 'iter' in this patch is misleading. What this patch
> > really does is reusing the functionality of dumping in bpf_iter.
> > 'Dumper' is a better name. We want to create one file in bpffs for
> > each cgroup. We are essentially just iterating a set of one single
> > element.
>
> I see. I'm just shooting in the dark without context but at least in
> principle there's no reason why cgroups wouldn't be iterable, so it might be
> something worth at least thinking about before baking in the interface.
>

Yep. Conceptually there should be no problem to iterate cgroups in the
system. It may be better to have two independent bpf objects: bpf_iter
and bpf_dumper. In our use case, we want bpf_dumper, which just
exports data out through fs interface.

Hao

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH bpf-next 0/9] bpf: cgroup hierarchical stats collection
  2022-05-10  0:17 [RFC PATCH bpf-next 0/9] bpf: cgroup hierarchical stats collection Yosry Ahmed
                   ` (8 preceding siblings ...)
  2022-05-10  0:18 ` [RFC PATCH bpf-next 9/9] selftest/bpf: add a selftest for cgroup hierarchical stats Yosry Ahmed
@ 2022-05-13  7:16 ` Yosry Ahmed
  9 siblings, 0 replies; 30+ messages in thread
From: Yosry Ahmed @ 2022-05-13  7:16 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Tejun Heo, Zefan Li, Johannes Weiner,
	Shuah Khan, Roman Gushchin, Michal Hocko
  Cc: Stanislav Fomichev, David Rientjes, Greg Thelen, Shakeel Butt,
	Linux Kernel Mailing List, Networking, bpf, cgroups

I have done some significant changes on the BPF side of this. I will
send a RFC V2 soon with those changes and incorporating the feedback
on the cgroup side that I got from Tejun. Hold off on reviewing this
version.


On Mon, May 9, 2022 at 5:18 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> This patch series allows for using bpf to collect hierarchical cgroup
> stats efficiently by integrating with the rstat framework. The rstat
> framework provides an efficient way to collect cgroup stats and
> propagate them through the cgroup hierarchy.
>
> The last patch is a selftest that demonastrates the entire workflow.
> The workflow consists of:
> - bpf programs that collect per-cpu per-cgroup stats (tracing progs).
> - bpf rstat flusher that contains the logic for aggregating stats
>   across cpus and across the cgroup hierarchy.
> - bpf cgroup_iter responsible for outputting the stats to userspace
>   through reading a file in bpffs.
>
> The first 3 patches include the new bpf rstat flusher program type and
> the needed support in rstat code and libbpf. The rstat flusher program
> is a callback that the rstat framework makes to bpf when a stat flush is
> ongoing, similar to the css_rstat_flush() callback that rstat makes to
> cgroup controllers. Each callback is parameterized by a (cgroup, cpu)
> pair that has been updated. The program contains the logic for
> aggregating the stats across cpus and across the cgroup hierarchy.
> These programs can be attached to any cgroup subsystem, not only the
> ones that implement the css_rstat_flush() callback in the kernel. This
> gives bpf programs more flexibility, and more isolation from the kernel
> implementation.
>
> The following 2 patches add necessary helpers for the stats collection
> workflow. Helpers that call into cgroup_rstat_updated() and
> cgroup_rstat_flush() are added to allow bpf programs collecting stats to
> tell the rstat framework that a cgroup has been updated, and to allow
> bpf programs outputting stats to tell the rstat framework to flush the
> stats before they are displayed to the user. An additional
> bpf_map_lookup_percpu_elem is introduced to allow rstat flusher programs
> to access percpu stats of the cpu being flushed.
>
> The following 3 patches add the cgroup_iter program type (v2). This was
> originally introduced by Hao as a part of a different series [1].
> Their usecase is better showcased as part of this patch series. We also
> make cgroup_get_from_id() cgroup v1 friendly to allow cgroup_iter programs
> to display stats for cgroup v1 as well. This small change makes the
> entire workflow cgroup v1 friendly without any other dedicated changes.
>
> The final patch is a selftest demonstrating the entire workflow with a
> set of bpf programs that collect per-cgroup latency of memcg reclaim.
>
> [1]https://lore.kernel.org/lkml/20220225234339.2386398-9-haoluo@google.com/
>
>
> Hao Luo (2):
>   cgroup: Add cgroup_put() in !CONFIG_CGROUPS case
>   bpf: Introduce cgroup iter
>
> Yosry Ahmed (7):
>   bpf: introduce CGROUP_SUBSYS_RSTAT program type
>   cgroup: bpf: flush bpf stats on rstat flush
>   libbpf: Add support for rstat progs and links
>   bpf: add bpf rstat helpers
>   bpf: add bpf_map_lookup_percpu_elem() helper
>   cgroup: add v1 support to cgroup_get_from_id()
>   bpf: add a selftest for cgroup hierarchical stats collection
>
>  include/linux/bpf-cgroup-subsys.h             |  35 ++
>  include/linux/bpf.h                           |   4 +
>  include/linux/bpf_types.h                     |   2 +
>  include/linux/cgroup-defs.h                   |   4 +
>  include/linux/cgroup.h                        |   5 +
>  include/uapi/linux/bpf.h                      |  45 +++
>  kernel/bpf/Makefile                           |   3 +-
>  kernel/bpf/arraymap.c                         |  11 +-
>  kernel/bpf/cgroup_iter.c                      | 148 ++++++++
>  kernel/bpf/cgroup_subsys.c                    | 212 +++++++++++
>  kernel/bpf/hashtab.c                          |  25 +-
>  kernel/bpf/helpers.c                          |  56 +++
>  kernel/bpf/syscall.c                          |   6 +
>  kernel/bpf/verifier.c                         |   6 +
>  kernel/cgroup/cgroup.c                        |  16 +-
>  kernel/cgroup/rstat.c                         |  11 +
>  scripts/bpf_doc.py                            |   2 +
>  tools/include/uapi/linux/bpf.h                |  45 +++
>  tools/lib/bpf/bpf.c                           |   3 +
>  tools/lib/bpf/bpf.h                           |   3 +
>  tools/lib/bpf/libbpf.c                        |  35 ++
>  tools/lib/bpf/libbpf.h                        |   3 +
>  tools/lib/bpf/libbpf.map                      |   1 +
>  .../test_cgroup_hierarchical_stats.c          | 335 ++++++++++++++++++
>  tools/testing/selftests/bpf/progs/bpf_iter.h  |   7 +
>  .../selftests/bpf/progs/cgroup_vmscan.c       | 211 +++++++++++
>  26 files changed, 1212 insertions(+), 22 deletions(-)
>  create mode 100644 include/linux/bpf-cgroup-subsys.h
>  create mode 100644 kernel/bpf/cgroup_iter.c
>  create mode 100644 kernel/bpf/cgroup_subsys.c
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/test_cgroup_hierarchical_stats.c
>  create mode 100644 tools/testing/selftests/bpf/progs/cgroup_vmscan.c
>
> --
> 2.36.0.512.ge40c2bad7a-goog
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2022-05-13  7:17 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-10  0:17 [RFC PATCH bpf-next 0/9] bpf: cgroup hierarchical stats collection Yosry Ahmed
2022-05-10  0:17 ` [RFC PATCH bpf-next 1/9] bpf: introduce CGROUP_SUBSYS_RSTAT program type Yosry Ahmed
2022-05-10 18:07   ` Yosry Ahmed
2022-05-10 19:21     ` Yosry Ahmed
2022-05-10 18:44   ` Tejun Heo
2022-05-10 19:34     ` Yosry Ahmed
2022-05-10 19:59       ` Tejun Heo
2022-05-10 20:43         ` Yosry Ahmed
2022-05-10 21:01           ` Tejun Heo
2022-05-10 21:55             ` Yosry Ahmed
2022-05-10 22:09               ` Tejun Heo
2022-05-10 22:10                 ` Yosry Ahmed
2022-05-10  0:18 ` [RFC PATCH bpf-next 2/9] cgroup: bpf: flush bpf stats on rstat flush Yosry Ahmed
2022-05-10 18:45   ` Tejun Heo
2022-05-10  0:18 ` [RFC PATCH bpf-next 3/9] libbpf: Add support for rstat progs and links Yosry Ahmed
2022-05-10  0:18 ` [RFC PATCH bpf-next 4/9] bpf: add bpf rstat helpers Yosry Ahmed
2022-05-10  0:18 ` [RFC PATCH bpf-next 5/9] bpf: add bpf_map_lookup_percpu_elem() helper Yosry Ahmed
2022-05-10  0:18 ` [RFC PATCH bpf-next 6/9] cgroup: add v1 support to cgroup_get_from_id() Yosry Ahmed
2022-05-10 18:33   ` Tejun Heo
2022-05-10 18:36     ` Yosry Ahmed
2022-05-10  0:18 ` [RFC PATCH bpf-next 7/9] cgroup: Add cgroup_put() in !CONFIG_CGROUPS case Yosry Ahmed
2022-05-10 18:25   ` Hao Luo
2022-05-10  0:18 ` [RFC PATCH bpf-next 8/9] bpf: Introduce cgroup iter Yosry Ahmed
2022-05-10 18:25   ` Hao Luo
2022-05-10 18:54   ` Tejun Heo
2022-05-10 21:12     ` Hao Luo
2022-05-10 22:07       ` Tejun Heo
2022-05-10 22:49         ` Hao Luo
2022-05-10  0:18 ` [RFC PATCH bpf-next 9/9] selftest/bpf: add a selftest for cgroup hierarchical stats Yosry Ahmed
2022-05-13  7:16 ` [RFC PATCH bpf-next 0/9] bpf: cgroup hierarchical stats collection Yosry Ahmed

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).