[PATCHSET v2 0/3] perf stat: Enable BPF counters with --for-each-cgroup

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCHSET v2 0/3] perf stat: Enable BPF counters with --for-each-cgroup
@ 2021-06-15  1:17 Namhyung Kim
  2021-06-15  1:17 ` [PATCH 1/3] perf tools: Add read_cgroup_id() function Namhyung Kim
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Namhyung Kim @ 2021-06-15  1:17 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Andi Kleen, Ian Rogers,
	Stephane Eranian, Song Liu

Hello,

This is to add BPF support for --for-each-cgroup to handle many cgroup
events on big machines.  You can use the --bpf-counters to enable the
new behavior.

 * changes v2
  - remove incorrect use of BPF_F_PRESERVE_ELEMS
  - add missing map elements after lookup
  - handle cgroup v1

Basic idea is to use a single set of per-cpu events to count
interested events and aggregate them to each cgroup.  I used bperf
mechanism to use a BPF program for cgroup-switches and save the
results in a matching map element for given cgroups.

Without this, we need to have separate events for cgroups, and it
creates unnecessary multiplexing overhead (and PMU programming) when
tasks in different cgroups are switched.  I saw this makes a big
difference on 256 cpu machines with hundreds of cgroups.

Actually this is what I wanted to do it in the kernel [1], but IIUC
with some limitations we can do the job using BPF.

Current limitations are:
 * it doesn't support cgroup hierarchy
 * there's no reliable way to trigger running the BPF program

Thanks,
Namhyung

[1] https://lore.kernel.org/lkml/20210413155337.644993-1-namhyung@kernel.org/

Namhyung Kim (3):
  perf tools: Add read_cgroup_id() function
  perf tools: Add cgroup_is_v2() helper
  perf stat: Enable BPF counter with --for-each-cgroup

 tools/perf/Makefile.perf                    |   1 +
 tools/perf/util/Build                       |   1 +
 tools/perf/util/bpf_counter.c               |   5 +
 tools/perf/util/bpf_counter_cgroup.c        | 319 ++++++++++++++++++++
 tools/perf/util/bpf_skel/bperf_cgroup.bpf.c | 146 +++++++++
 tools/perf/util/cgroup.c                    |  46 +++
 tools/perf/util/cgroup.h                    |  12 +
 7 files changed, 530 insertions(+)
 create mode 100644 tools/perf/util/bpf_counter_cgroup.c
 create mode 100644 tools/perf/util/bpf_skel/bperf_cgroup.bpf.c

-- 
2.32.0.272.g935e593368-goog

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/3] perf tools: Add read_cgroup_id() function
  2021-06-15  1:17 [PATCHSET v2 0/3] perf stat: Enable BPF counters with --for-each-cgroup Namhyung Kim
@ 2021-06-15  1:17 ` Namhyung Kim
  2021-06-15  1:17 ` [PATCH 2/3] perf tools: Add cgroup_is_v2() helper Namhyung Kim
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 9+ messages in thread
From: Namhyung Kim @ 2021-06-15  1:17 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Andi Kleen, Ian Rogers,
	Stephane Eranian, Song Liu

The read_cgroup_id() is to read a cgroup id from a file handle using
name_to_handle_at(2) for the given cgroup.  It'll be used by bperf
cgroup stat later.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/perf/util/cgroup.c | 25 +++++++++++++++++++++++++
 tools/perf/util/cgroup.h |  9 +++++++++
 2 files changed, 34 insertions(+)

diff --git a/tools/perf/util/cgroup.c b/tools/perf/util/cgroup.c
index f24ab4585553..ef18c988c681 100644
--- a/tools/perf/util/cgroup.c
+++ b/tools/perf/util/cgroup.c
@@ -45,6 +45,31 @@ static int open_cgroup(const char *name)
 	return fd;
 }
 
+#ifdef HAVE_FILE_HANDLE
+int read_cgroup_id(struct cgroup *cgrp)
+{
+	char path[PATH_MAX + 1];
+	char mnt[PATH_MAX + 1];
+	struct {
+		struct file_handle fh;
+		uint64_t cgroup_id;
+	} handle;
+	int mount_id;
+
+	if (cgroupfs_find_mountpoint(mnt, PATH_MAX + 1, "perf_event"))
+		return -1;
+
+	scnprintf(path, PATH_MAX, "%s/%s", mnt, cgrp->name);
+
+	handle.fh.handle_bytes = sizeof(handle.cgroup_id);
+	if (name_to_handle_at(AT_FDCWD, path, &handle.fh, &mount_id, 0) < 0)
+		return -1;
+
+	cgrp->id = handle.cgroup_id;
+	return 0;
+}
+#endif  /* HAVE_FILE_HANDLE */
+
 static struct cgroup *evlist__find_cgroup(struct evlist *evlist, const char *str)
 {
 	struct evsel *counter;
diff --git a/tools/perf/util/cgroup.h b/tools/perf/util/cgroup.h
index 162906f3412a..707adbe25123 100644
--- a/tools/perf/util/cgroup.h
+++ b/tools/perf/util/cgroup.h
@@ -38,4 +38,13 @@ struct cgroup *cgroup__find(struct perf_env *env, uint64_t id);
 
 void perf_env__purge_cgroups(struct perf_env *env);
 
+#ifdef HAVE_FILE_HANDLE
+int read_cgroup_id(struct cgroup *cgrp);
+#else
+int read_cgroup_id(struct cgroup *cgrp)
+{
+	return -1;
+}
+#endif  /* HAVE_FILE_HANDLE */
+
 #endif /* __CGROUP_H__ */
-- 
2.32.0.272.g935e593368-goog


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/3] perf tools: Add cgroup_is_v2() helper
  2021-06-15  1:17 [PATCHSET v2 0/3] perf stat: Enable BPF counters with --for-each-cgroup Namhyung Kim
  2021-06-15  1:17 ` [PATCH 1/3] perf tools: Add read_cgroup_id() function Namhyung Kim
@ 2021-06-15  1:17 ` Namhyung Kim
  2021-06-15  1:17 ` [PATCH 3/3] perf stat: Enable BPF counter with --for-each-cgroup Namhyung Kim
  2021-06-16 15:14 ` [PATCHSET v2 0/3] perf stat: Enable BPF counters " Peter Zijlstra
  3 siblings, 0 replies; 9+ messages in thread
From: Namhyung Kim @ 2021-06-15  1:17 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Andi Kleen, Ian Rogers,
	Stephane Eranian, Song Liu

The cgroup_is_v2() is to check if the given subsystem is mounted on
cgroup v2 or not.  It'll be used by BPF cgroup code later.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/perf/util/cgroup.c | 19 +++++++++++++++++++
 tools/perf/util/cgroup.h |  2 ++
 2 files changed, 21 insertions(+)

diff --git a/tools/perf/util/cgroup.c b/tools/perf/util/cgroup.c
index ef18c988c681..48ec79211270 100644
--- a/tools/perf/util/cgroup.c
+++ b/tools/perf/util/cgroup.c
@@ -9,6 +9,7 @@
 #include <linux/zalloc.h>
 #include <sys/types.h>
 #include <sys/stat.h>
+#include <sys/statfs.h>
 #include <fcntl.h>
 #include <stdlib.h>
 #include <string.h>
@@ -70,6 +71,24 @@ int read_cgroup_id(struct cgroup *cgrp)
 }
 #endif  /* HAVE_FILE_HANDLE */
 
+#ifndef CGROUP2_SUPER_MAGIC
+#define CGROUP2_SUPER_MAGIC  0x63677270
+#endif
+
+int cgroup_is_v2(const char *subsys)
+{
+	char mnt[PATH_MAX + 1];
+	struct statfs stbuf;
+
+	if (cgroupfs_find_mountpoint(mnt, PATH_MAX + 1, subsys))
+		return -1;
+
+	if (statfs(mnt, stbuf) < 0)
+		return -1;
+
+	return (stbuf.f_type == CGROUP2_SUPER_MAGIC);
+}
+
 static struct cgroup *evlist__find_cgroup(struct evlist *evlist, const char *str)
 {
 	struct evsel *counter;
diff --git a/tools/perf/util/cgroup.h b/tools/perf/util/cgroup.h
index 707adbe25123..1549ec2fd348 100644
--- a/tools/perf/util/cgroup.h
+++ b/tools/perf/util/cgroup.h
@@ -47,4 +47,6 @@ int read_cgroup_id(struct cgroup *cgrp)
 }
 #endif  /* HAVE_FILE_HANDLE */
 
+int cgroup_is_v2(const char *subsys);
+
 #endif /* __CGROUP_H__ */
-- 
2.32.0.272.g935e593368-goog


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 3/3] perf stat: Enable BPF counter with --for-each-cgroup
  2021-06-15  1:17 [PATCHSET v2 0/3] perf stat: Enable BPF counters with --for-each-cgroup Namhyung Kim
  2021-06-15  1:17 ` [PATCH 1/3] perf tools: Add read_cgroup_id() function Namhyung Kim
  2021-06-15  1:17 ` [PATCH 2/3] perf tools: Add cgroup_is_v2() helper Namhyung Kim
@ 2021-06-15  1:17 ` Namhyung Kim
  2021-06-16 15:14 ` [PATCHSET v2 0/3] perf stat: Enable BPF counters " Peter Zijlstra
  3 siblings, 0 replies; 9+ messages in thread
From: Namhyung Kim @ 2021-06-15  1:17 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Andi Kleen, Ian Rogers,
	Stephane Eranian, Song Liu

Recently bperf was added to use BPF to count perf events for various
purposes.  This is an extension for the approach and targetting to
cgroup usages.

Unlike the other bperf, it doesn't share the events with other
processes but it'd reduce unnecessary events (and the overhead of
multiplexing) for each monitored cgroup within the perf session.

When --for-each-cgroup is used with --bpf-counters, it will open
cgroup-switches event per cpu internally and attach the new BPF
program to read given perf_events and to aggregate the results for
cgroups.  It's only called when task is switched to a task in a
different cgroup.

It seems the bpf test run mechanism doens't work for programs attached
the perf_event.  So it's not guaranteed to get the accurate results
using this.  But in practice, perf tools tends to be in a separate
cgroup than the actual works, so having an affinity setting loop at
the end triggers cgroup switches on each cpu.

Cc: Song Liu <songliubraving@fb.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/perf/Makefile.perf                    |   1 +
 tools/perf/util/Build                       |   1 +
 tools/perf/util/bpf_counter.c               |   5 +
 tools/perf/util/bpf_counter_cgroup.c        | 319 ++++++++++++++++++++
 tools/perf/util/bpf_skel/bperf_cgroup.bpf.c | 146 +++++++++
 tools/perf/util/cgroup.c                    |   2 +
 tools/perf/util/cgroup.h                    |   1 +
 7 files changed, 475 insertions(+)
 create mode 100644 tools/perf/util/bpf_counter_cgroup.c
 create mode 100644 tools/perf/util/bpf_skel/bperf_cgroup.bpf.c

diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
index e47f04e5b51e..b67099dd2456 100644
--- a/tools/perf/Makefile.perf
+++ b/tools/perf/Makefile.perf
@@ -1015,6 +1015,7 @@ SKEL_OUT := $(abspath $(OUTPUT)util/bpf_skel)
 SKEL_TMP_OUT := $(abspath $(SKEL_OUT)/.tmp)
 SKELETONS := $(SKEL_OUT)/bpf_prog_profiler.skel.h
 SKELETONS += $(SKEL_OUT)/bperf_leader.skel.h $(SKEL_OUT)/bperf_follower.skel.h
+SKELETONS += $(SKEL_OUT)/bperf_cgroup.skel.h
 
 ifdef BUILD_BPF_SKEL
 BPFTOOL := $(SKEL_TMP_OUT)/bootstrap/bpftool
diff --git a/tools/perf/util/Build b/tools/perf/util/Build
index 95e15d1035ab..700d635448ff 100644
--- a/tools/perf/util/Build
+++ b/tools/perf/util/Build
@@ -140,6 +140,7 @@ perf-y += clockid.o
 perf-$(CONFIG_LIBBPF) += bpf-loader.o
 perf-$(CONFIG_LIBBPF) += bpf_map.o
 perf-$(CONFIG_PERF_BPF_SKEL) += bpf_counter.o
+perf-$(CONFIG_PERF_BPF_SKEL) += bpf_counter_cgroup.o
 perf-$(CONFIG_BPF_PROLOGUE) += bpf-prologue.o
 perf-$(CONFIG_LIBELF) += symbol-elf.o
 perf-$(CONFIG_LIBELF) += probe-file.o
diff --git a/tools/perf/util/bpf_counter.c b/tools/perf/util/bpf_counter.c
index 974f10e356f0..7812c5d9b826 100644
--- a/tools/perf/util/bpf_counter.c
+++ b/tools/perf/util/bpf_counter.c
@@ -22,6 +22,7 @@
 #include "evsel.h"
 #include "evlist.h"
 #include "target.h"
+#include "cgroup.h"
 #include "cpumap.h"
 #include "thread_map.h"
 
@@ -792,6 +793,8 @@ struct bpf_counter_ops bperf_ops = {
 	.destroy    = bperf__destroy,
 };
 
+extern struct bpf_counter_ops bperf_cgrp_ops;
+
 static inline bool bpf_counter_skip(struct evsel *evsel)
 {
 	return list_empty(&evsel->bpf_counter_list) &&
@@ -809,6 +812,8 @@ int bpf_counter__load(struct evsel *evsel, struct target *target)
 {
 	if (target->bpf_str)
 		evsel->bpf_counter_ops = &bpf_program_profiler_ops;
+	else if (cgrp_event_expanded && target->use_bpf)
+		evsel->bpf_counter_ops = &bperf_cgrp_ops;
 	else if (target->use_bpf || evsel->bpf_counter ||
 		 evsel__match_bpf_counter_events(evsel->name))
 		evsel->bpf_counter_ops = &bperf_ops;
diff --git a/tools/perf/util/bpf_counter_cgroup.c b/tools/perf/util/bpf_counter_cgroup.c
new file mode 100644
index 000000000000..0ec5be75b860
--- /dev/null
+++ b/tools/perf/util/bpf_counter_cgroup.c
@@ -0,0 +1,319 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/* Copyright (c) 2019 Facebook */
+/* Copyright (c) 2021 Google */
+
+#include <assert.h>
+#include <limits.h>
+#include <unistd.h>
+#include <sys/file.h>
+#include <sys/time.h>
+#include <sys/resource.h>
+#include <linux/err.h>
+#include <linux/zalloc.h>
+#include <linux/perf_event.h>
+#include <bpf/bpf.h>
+#include <bpf/btf.h>
+#include <bpf/libbpf.h>
+#include <api/fs/fs.h>
+#include <perf/bpf_perf.h>
+
+#include "affinity.h"
+#include "bpf_counter.h"
+#include "cgroup.h"
+#include "counts.h"
+#include "debug.h"
+#include "evsel.h"
+#include "evlist.h"
+#include "target.h"
+#include "cpumap.h"
+#include "thread_map.h"
+
+#include "bpf_skel/bperf_cgroup.skel.h"
+
+static struct perf_event_attr cgrp_switch_attr = {
+	.type = PERF_TYPE_SOFTWARE,
+	.config = PERF_COUNT_SW_CGROUP_SWITCHES,
+	.size = sizeof(cgrp_switch_attr),
+	.sample_period = 1,
+	.disabled = 1,
+};
+
+static struct evsel *cgrp_switch;
+static struct xyarray *cgrp_prog_fds;
+static struct bperf_cgroup_bpf *skel;
+
+#define FD(evt, cpu) (*(int *)xyarray__entry(evt->core.fd, cpu, 0))
+#define PROG(cpu)    (*(int *)xyarray__entry(cgrp_prog_fds, cpu, 0))
+
+static void set_max_rlimit(void)
+{
+	struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY };
+
+	setrlimit(RLIMIT_MEMLOCK, &rinf);
+}
+
+static __u32 bpf_link_get_prog_id(int fd)
+{
+	struct bpf_link_info link_info = {0};
+	__u32 link_info_len = sizeof(link_info);
+
+	bpf_obj_get_info_by_fd(fd, &link_info, &link_info_len);
+	return link_info.prog_id;
+}
+
+static int bperf_load_program(struct evlist *evlist)
+{
+	struct bpf_link *link;
+	struct evsel *evsel;
+	struct cgroup *cgrp, *leader_cgrp;
+	__u32 i, cpu, prog_id;
+	int nr_cpus = evlist->core.all_cpus->nr;
+	int map_size, map_fd, err;
+
+	skel = bperf_cgroup_bpf__open();
+	if (!skel) {
+		pr_err("Failed to open cgroup skeleton\n");
+		return -1;
+	}
+
+	skel->rodata->num_cpus = nr_cpus;
+	skel->rodata->num_events = evlist->core.nr_entries / nr_cgroups;
+
+	BUG_ON(evlist->core.nr_entries % nr_cgroups != 0);
+
+	/* we need one copy of events per cpu for reading */
+	map_size = nr_cpus * evlist->core.nr_entries / nr_cgroups;
+	bpf_map__resize(skel->maps.events, map_size);
+	bpf_map__resize(skel->maps.cpu_idx, nr_cpus);
+	bpf_map__resize(skel->maps.cgrp_idx, nr_cgroups);
+	/* previous result is saved in a per-cpu array */
+	map_size = evlist->core.nr_entries / nr_cgroups;
+	bpf_map__resize(skel->maps.prev_readings, map_size);
+	/* cgroup result needs all events */
+	map_size = nr_cpus * evlist->core.nr_entries;
+	bpf_map__resize(skel->maps.cgrp_readings, map_size);
+
+	set_max_rlimit();
+
+	err = bperf_cgroup_bpf__load(skel);
+	if (err) {
+		pr_err("Failed to load cgroup skeleton\n");
+		goto out;
+	}
+
+	if (cgroup_is_v2("perf_event") > 0)
+		skel->bss->use_cgroup_v2 = 1;
+
+	err = -1;
+
+	cgrp_switch = evsel__new(&cgrp_switch_attr);
+	if (evsel__open_per_cpu(cgrp_switch, evlist->core.all_cpus, -1) < 0) {
+		pr_err("Failed to open cgroup switches event\n");
+		goto out;
+	}
+
+	map_fd = bpf_map__fd(skel->maps.cpu_idx);
+	if (map_fd < 0) {
+		pr_err("cannot get cpu idx map\n");
+		goto out;
+	}
+
+	cgrp_prog_fds = xyarray__new(nr_cpus, 1, sizeof(int));
+	if (!cgrp_prog_fds) {
+		pr_err("Failed to allocate cgroup switch prog fd\n");
+		goto out;
+	}
+
+	for (i = 0; i < nr_cpus; i++) {
+		link = bpf_program__attach_perf_event(skel->progs.on_switch,
+						      FD(cgrp_switch, i));
+		if (IS_ERR(link)) {
+			pr_err("Failed to attach cgroup program\n");
+			err = PTR_ERR(link);
+			goto out;
+		}
+
+		/* update cpu index in case there are missing cpus */
+		cpu = evlist->core.all_cpus->map[i];
+		bpf_map_update_elem(map_fd, &cpu, &i, BPF_ANY);
+
+		prog_id = bpf_link_get_prog_id(bpf_link__fd(link));
+		PROG(i) = bpf_prog_get_fd_by_id(prog_id);
+	}
+
+	/*
+	 * Update cgrp_idx map from cgroup-id to event index.
+	 */
+	cgrp = NULL;
+	i = 0;
+
+	evlist__for_each_entry(evlist, evsel) {
+		if (cgrp == NULL || evsel->cgrp == leader_cgrp) {
+			leader_cgrp = evsel->cgrp;
+			evsel->cgrp = NULL;
+
+			/* open single copy of the events w/o cgroup */
+			err = evsel__open_per_cpu(evsel, evlist->core.all_cpus, -1);
+			if (err) {
+				pr_err("Failed to open first cgroup events\n");
+				goto out;
+			}
+
+			map_fd = bpf_map__fd(skel->maps.events);
+			for (cpu = 0; cpu < nr_cpus; cpu++) {
+				__u32 idx = evsel->idx * nr_cpus + cpu;
+				int fd = FD(evsel, cpu);
+
+				bpf_map_update_elem(map_fd, &idx, &fd, BPF_ANY);
+			}
+
+			evsel->cgrp = leader_cgrp;
+		}
+		evsel->supported = true;
+
+		if (evsel->cgrp == cgrp)
+			continue;
+
+		cgrp = evsel->cgrp;
+
+		if (read_cgroup_id(cgrp) < 0) {
+			pr_debug("Failed to get cgroup id\n");
+			err = -1;
+			goto out;
+		}
+
+		map_fd = bpf_map__fd(skel->maps.cgrp_idx);
+		bpf_map_update_elem(map_fd, &cgrp->id, &i, BPF_ANY);
+
+		i++;
+	}
+
+	pr_debug("The kernel does not support test_run for perf_event BPF programs.\n"
+		 "Therefore, --for-each-cgroup might show inaccurate readings\n");
+	err = 0;
+
+out:
+	return err;
+}
+
+static int bperf_cgrp__load(struct evsel *evsel, struct target *target)
+{
+	static bool bperf_loaded = false;
+
+	evsel->bperf_leader_prog_fd = -1;
+	evsel->bperf_leader_link_fd = -1;
+
+	if (!bperf_loaded && bperf_load_program(evsel->evlist))
+		return -1;
+
+	bperf_loaded = true;
+	/* just to bypass bpf_counter_skip() */
+	evsel->follower_skel = (struct bperf_follower_bpf *)skel;
+
+	return 0;
+}
+
+static int bperf_cgrp__install_pe(struct evsel *evsel, int cpu, int fd)
+{
+	/* nothing to do */
+	return 0;
+}
+
+/*
+ * trigger the leader prog on each cpu, so the cgrp_reading map could get
+ * the latest results.
+ */
+static int bperf_sync_counters(struct evlist *evlist)
+{
+	struct affinity affinity;
+	int i, cpu;
+
+	/* change affinity to rotate all cpus to trigger cgroup-switches (hopefully) */
+	if (affinity__setup(&affinity) < 0)
+		return -1;
+
+	evlist__for_each_cpu(evlist, i, cpu)
+		affinity__set(&affinity, cpu);
+
+	affinity__cleanup(&affinity);
+
+	return 0;
+}
+
+static int bperf_cgrp__enable(struct evsel *evsel)
+{
+	skel->bss->enabled = 1;
+	return 0;
+}
+
+static int bperf_cgrp__disable(struct evsel *evsel)
+{
+	if (evsel->idx)
+		return 0;
+
+	bperf_sync_counters(evsel->evlist);
+
+	skel->bss->enabled = 0;
+	return 0;
+}
+
+static int bperf_cgrp__read(struct evsel *evsel)
+{
+	struct evlist *evlist = evsel->evlist;
+	int i, nr_cpus = evlist->core.all_cpus->nr;
+	struct perf_counts_values *counts;
+	struct bpf_perf_event_value values;
+	struct cgroup *cgrp = NULL;
+	int cgrp_idx = -1;
+	int reading_map_fd, err = 0;
+	__u32 idx;
+
+	if (evsel->idx)
+		return 0;
+
+	reading_map_fd = bpf_map__fd(skel->maps.cgrp_readings);
+
+	evlist__for_each_entry(evlist, evsel) {
+		if (cgrp != evsel->cgrp) {
+			cgrp = evsel->cgrp;
+			cgrp_idx++;
+		}
+
+		for (i = 0; i < nr_cpus; i++) {
+			idx = evsel->idx * nr_cpus + i;
+			err = bpf_map_lookup_elem(reading_map_fd, &idx, &values);
+			if (err)
+				goto out;
+
+			counts = perf_counts(evsel->counts, i, 0);
+			counts->val = values.counter;
+			counts->ena = values.enabled;
+			counts->run = values.running;
+		}
+	}
+
+out:
+	return err;
+}
+
+static int bperf_cgrp__destroy(struct evsel *evsel)
+{
+	if (evsel->idx)
+		return 0;
+
+	bperf_cgroup_bpf__destroy(skel);
+	evsel__delete(cgrp_switch);  // it'll destroy on_switch progs too
+	free(cgrp_prog_fds);
+
+	return 0;
+}
+
+struct bpf_counter_ops bperf_cgrp_ops = {
+	.load       = bperf_cgrp__load,
+	.enable     = bperf_cgrp__enable,
+	.disable    = bperf_cgrp__disable,
+	.read       = bperf_cgrp__read,
+	.install_pe = bperf_cgrp__install_pe,
+	.destroy    = bperf_cgrp__destroy,
+};
diff --git a/tools/perf/util/bpf_skel/bperf_cgroup.bpf.c b/tools/perf/util/bpf_skel/bperf_cgroup.bpf.c
new file mode 100644
index 000000000000..5b520821c4b7
--- /dev/null
+++ b/tools/perf/util/bpf_skel/bperf_cgroup.bpf.c
@@ -0,0 +1,146 @@
+// SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+// Copyright (c) 2021 Facebook
+// Copyright (c) 2021 Google
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_core_read.h>
+
+#define MAX_EVENTS  32  // max events per cgroup: arbitrary
+
+// NOTE: many of map and global data will be modified before loading
+//       from the userspace (perf tool) using the skeleton helpers.
+
+// single set of global perf events to measure
+struct {
+	__uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(int));
+	__uint(max_entries, 1);
+} events SEC(".maps");
+
+// from logical cpu number to event index
+// useful when user wants to count subset of cpus
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(__u32));
+	__uint(max_entries, 1);
+} cpu_idx SEC(".maps");
+
+// from cgroup id to event index
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(key_size, sizeof(__u64));
+	__uint(value_size, sizeof(__u32));
+	__uint(max_entries, 1);
+} cgrp_idx SEC(".maps");
+
+// per-cpu event snapshots to calculate delta
+struct {
+	__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(struct bpf_perf_event_value));
+} prev_readings SEC(".maps");
+
+// aggregated event values for each cgroup
+// will be read from the user-space
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(struct bpf_perf_event_value));
+} cgrp_readings SEC(".maps");
+
+const volatile __u32 num_events = 1;
+const volatile __u32 num_cpus = 1;
+
+int enabled = 0;
+int use_cgroup_v2 = 0;
+
+static inline get_current_cgroup_v1_id(void)
+{
+	struct task_struct *p = (void *)bpf_get_current_task();
+
+	return BPF_CORE_READ(p, cgroups, subsys[perf_event_cgrp_id], cgroup, kn, id);
+}
+
+// This will be attached to cgroup-switches event for each cpu
+SEC("perf_events")
+int BPF_PROG(on_switch)
+{
+	register __u32 idx = 0;  // to have it in a register to pass BPF verifier
+	struct bpf_perf_event_value val, *prev_val, *cgrp_val;
+	__u32 cpu = bpf_get_smp_processor_id();
+	__u64 cgrp;
+	__u32 evt_idx, key;
+	__u32 *elem;
+	long err;
+
+	// map the current CPU to a CPU index, particularly necessary if there
+	// are fewer CPUs profiled on than all CPUs.
+	elem = bpf_map_lookup_elem(&cpu_idx, &cpu);
+	if (!elem)
+		return 0;
+	cpu = *elem;
+
+	if (use_cgroup_v2)
+		cgrp = bpf_get_current_cgroup_id();
+	else
+		cgrp = get_current_cgroup_v1_id();
+
+	elem = bpf_map_lookup_elem(&cgrp_idx, &cgrp);
+	if (elem)
+		cgrp = *elem;
+	else
+		cgrp = ~0ULL;
+
+	for ( ; idx < MAX_EVENTS; idx++) {
+		if (idx == num_events)
+			break;
+
+		// XXX: do not pass idx directly (for verifier)
+		key = idx;
+		// this is per-cpu array for diff
+		prev_val = bpf_map_lookup_elem(&prev_readings, &key);
+		if (!prev_val) {
+			val.counter = val.enabled = val.running = 0;
+			bpf_map_update_elem(&prev_readings, &key, val, BPF_ANY);
+
+			prev_val = bpf_map_lookup_elem(&prev_readings, &key);
+			if (!prev_val)
+				continue;
+		}
+
+		// read from global event array
+		evt_idx = idx * num_cpus + cpu;
+		err = bpf_perf_event_read_value(&events, evt_idx, &val, sizeof(val));
+		if (err)
+			continue;
+
+		if (enabled && cgrp != ~0ULL) {
+			// aggregate the result by cgroup
+			evt_idx += cgrp * num_cpus * num_events;
+			cgrp_val = bpf_map_lookup_elem(&cgrp_readings, &evt_idx);
+			if (cgrp_val) {
+				cgrp_val->counter += val.counter - prev_val->counter;
+				cgrp_val->enabled += val.enabled - prev_val->enabled;
+				cgrp_val->running += val.running - prev_val->running;
+			} else {
+				val->counter -= prev_val->counter;
+				val->enabled -= prev_val->enabled;
+				val->running -= prev_val->running;
+
+				bpf_map_update_elem(&cgrp_readings, &evt_idx, &val, BPF_ANY);
+
+				val->counter += prev_val->counter;
+				val->enabled += prev_val->enabled;
+				val->running += prev_val->running;
+			}
+		}
+
+		*prev_val = val;
+	}
+	return 0;
+}
+
+char LICENSE[] SEC("license") = "Dual BSD/GPL";
diff --git a/tools/perf/util/cgroup.c b/tools/perf/util/cgroup.c
index 48ec79211270..f7d07b365401 100644
--- a/tools/perf/util/cgroup.c
+++ b/tools/perf/util/cgroup.c
@@ -18,6 +18,7 @@
 #include <regex.h>
 
 int nr_cgroups;
+bool cgrp_event_expanded;
 
 /* used to match cgroup name with patterns */
 struct cgroup_name {
@@ -484,6 +485,7 @@ int evlist__expand_cgroup(struct evlist *evlist, const char *str,
 	}
 
 	ret = 0;
+	cgrp_event_expanded = true;
 
 out_err:
 	evlist__delete(orig_list);
diff --git a/tools/perf/util/cgroup.h b/tools/perf/util/cgroup.h
index 1549ec2fd348..21f7ccc566e1 100644
--- a/tools/perf/util/cgroup.h
+++ b/tools/perf/util/cgroup.h
@@ -17,6 +17,7 @@ struct cgroup {
 };
 
 extern int nr_cgroups; /* number of explicit cgroups defined */
+extern bool cgrp_event_expanded;
 
 struct cgroup *cgroup__get(struct cgroup *cgroup);
 void cgroup__put(struct cgroup *cgroup);
-- 
2.32.0.272.g935e593368-goog


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCHSET v2 0/3] perf stat: Enable BPF counters with --for-each-cgroup
  2021-06-15  1:17 [PATCHSET v2 0/3] perf stat: Enable BPF counters with --for-each-cgroup Namhyung Kim
                   ` (2 preceding siblings ...)
  2021-06-15  1:17 ` [PATCH 3/3] perf stat: Enable BPF counter with --for-each-cgroup Namhyung Kim
@ 2021-06-16 15:14 ` Peter Zijlstra
  2021-06-16 16:33   ` Namhyung Kim
  3 siblings, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2021-06-16 15:14 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Ingo Molnar, LKML,
	Andi Kleen, Ian Rogers, Stephane Eranian, Song Liu

On Mon, Jun 14, 2021 at 06:17:21PM -0700, Namhyung Kim wrote:
> Current limitations are:
>  * it doesn't support cgroup hierarchy

That seems unfortunate; there's no bpf helper to iterate cgroup
hierarchy?

>  * there's no reliable way to trigger running the BPF program

You can't attach to the PERF_COUNT_SW_CGROUP_SWITCHES event?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCHSET v2 0/3] perf stat: Enable BPF counters with --for-each-cgroup
  2021-06-16 15:14 ` [PATCHSET v2 0/3] perf stat: Enable BPF counters " Peter Zijlstra
@ 2021-06-16 16:33   ` Namhyung Kim
  2021-06-16 22:32     ` Peter Zijlstra
  0 siblings, 1 reply; 9+ messages in thread
From: Namhyung Kim @ 2021-06-16 16:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Ingo Molnar, LKML,
	Andi Kleen, Ian Rogers, Stephane Eranian, Song Liu

Hi Peter,

On Wed, Jun 16, 2021 at 8:14 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Jun 14, 2021 at 06:17:21PM -0700, Namhyung Kim wrote:
> > Current limitations are:
> >  * it doesn't support cgroup hierarchy
>
> That seems unfortunate; there's no bpf helper to iterate cgroup
> hierarchy?

I couldn't find one..

>
> >  * there's no reliable way to trigger running the BPF program
>
> You can't attach to the PERF_COUNT_SW_CGROUP_SWITCHES event?

I did it.  But the BPF test run seems not to work with perf_event.
So it needs to trigger a cgroup switch manually..

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCHSET v2 0/3] perf stat: Enable BPF counters with --for-each-cgroup
  2021-06-16 16:33   ` Namhyung Kim
@ 2021-06-16 22:32     ` Peter Zijlstra
  2021-06-17  6:33       ` Song Liu
  0 siblings, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2021-06-16 22:32 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Ingo Molnar, LKML,
	Andi Kleen, Ian Rogers, Stephane Eranian, Song Liu

On Wed, Jun 16, 2021 at 09:33:42AM -0700, Namhyung Kim wrote:

> > That seems unfortunate; there's no bpf helper to iterate cgroup
> > hierarchy?
> 
> I couldn't find one..

Song, is that something that would make sense to have?

> > >  * there's no reliable way to trigger running the BPF program
> >
> > You can't attach to the PERF_COUNT_SW_CGROUP_SWITCHES event?
> 
> I did it.  But the BPF test run seems not to work with perf_event.
> So it needs to trigger a cgroup switch manually..

AFAICT it should be possible to set a bpf prog on a software event.
perf_event_set_bpf_prog() will take the first branch
(!perf_event_is_tracing()) and call perf_event_set_bpf_handler().

That should then result in running the bpf program every time the event
would generate a sample.

So if you configure the event to sample on every single event, it should
then run your program every time.

This is all from looking at the code, because I really can't operate any
of that for real. I suspect Song can help out.

The alternative is to attach a BPF program to the sched_switch
tracepoint and do the cgroup filter in BPF.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCHSET v2 0/3] perf stat: Enable BPF counters with --for-each-cgroup
  2021-06-16 22:32     ` Peter Zijlstra
@ 2021-06-17  6:33       ` Song Liu
  2021-06-22  1:43         ` Namhyung Kim
  0 siblings, 1 reply; 9+ messages in thread
From: Song Liu @ 2021-06-17  6:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa, Ingo Molnar,
	LKML, Andi Kleen, Ian Rogers, Stephane Eranian



> On Jun 16, 2021, at 3:32 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Wed, Jun 16, 2021 at 09:33:42AM -0700, Namhyung Kim wrote:
> 
>>> That seems unfortunate; there's no bpf helper to iterate cgroup
>>> hierarchy?
>> 
>> I couldn't find one..
> 
> Song, is that something that would make sense to have?

I think we can solve this with bpf_get_current_ancestor_cgroup_id and 
a bounded loop. Like:

	/* get diff_reading, which is reading - prev_reading */

	for (i = 0; i < 10 /* at most 10 levels */; i++) {
		__u64 cgroup_id = bpf_get_current_ancestor_cgroup_id(i);
		if (!cgroup_id)
			break;
		/* add diff_reading to cgroup_id */
	}

> 
>>>> * there's no reliable way to trigger running the BPF program
>>> 
>>> You can't attach to the PERF_COUNT_SW_CGROUP_SWITCHES event?
>> 
>> I did it.  But the BPF test run seems not to work with perf_event.
>> So it needs to trigger a cgroup switch manually..
> 
> AFAICT it should be possible to set a bpf prog on a software event.
> perf_event_set_bpf_prog() will take the first branch
> (!perf_event_is_tracing()) and call perf_event_set_bpf_handler().
> 
> That should then result in running the bpf program every time the event
> would generate a sample.
> 
> So if you configure the event to sample on every single event, it should
> then run your program every time.
> 
> This is all from looking at the code, because I really can't operate any
> of that for real. I suspect Song can help out.
> 
> The alternative is to attach a BPF program to the sched_switch
> tracepoint and do the cgroup filter in BPF.

We can create a raw_tp BPF program just for BPF_PROG_TEST_RUN (now also called
BPF_PROG_RUN). The program should be the same as current on_switch program. 
We don't have to attach the program, just use BPF_PROG_RUN to trigger it. 

Would something like this work?

Thanks,
Song 





^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCHSET v2 0/3] perf stat: Enable BPF counters with --for-each-cgroup
  2021-06-17  6:33       ` Song Liu
@ 2021-06-22  1:43         ` Namhyung Kim
  0 siblings, 0 replies; 9+ messages in thread
From: Namhyung Kim @ 2021-06-22  1:43 UTC (permalink / raw)
  To: Song Liu
  Cc: Peter Zijlstra, Arnaldo Carvalho de Melo, Jiri Olsa, Ingo Molnar,
	LKML, Andi Kleen, Ian Rogers, Stephane Eranian

Hi Song,

On Wed, Jun 16, 2021 at 11:33 PM Song Liu <songliubraving@fb.com> wrote:
>
>
>
> > On Jun 16, 2021, at 3:32 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Wed, Jun 16, 2021 at 09:33:42AM -0700, Namhyung Kim wrote:
> >
> >>> That seems unfortunate; there's no bpf helper to iterate cgroup
> >>> hierarchy?
> >>
> >> I couldn't find one..
> >
> > Song, is that something that would make sense to have?
>
> I think we can solve this with bpf_get_current_ancestor_cgroup_id and
> a bounded loop. Like:
>
>         /* get diff_reading, which is reading - prev_reading */
>
>         for (i = 0; i < 10 /* at most 10 levels */; i++) {
>                 __u64 cgroup_id = bpf_get_current_ancestor_cgroup_id(i);
>                 if (!cgroup_id)
>                         break;
>                 /* add diff_reading to cgroup_id */
>         }

OK, but I'm not sure 0 id is guaranteed.

>
> >
> >>>> * there's no reliable way to trigger running the BPF program
> >>>
> >>> You can't attach to the PERF_COUNT_SW_CGROUP_SWITCHES event?
> >>
> >> I did it.  But the BPF test run seems not to work with perf_event.
> >> So it needs to trigger a cgroup switch manually..
> >
> > AFAICT it should be possible to set a bpf prog on a software event.
> > perf_event_set_bpf_prog() will take the first branch
> > (!perf_event_is_tracing()) and call perf_event_set_bpf_handler().
> >
> > That should then result in running the bpf program every time the event
> > would generate a sample.
> >
> > So if you configure the event to sample on every single event, it should
> > then run your program every time.
> >
> > This is all from looking at the code, because I really can't operate any
> > of that for real. I suspect Song can help out.
> >
> > The alternative is to attach a BPF program to the sched_switch
> > tracepoint and do the cgroup filter in BPF.
>
> We can create a raw_tp BPF program just for BPF_PROG_TEST_RUN (now also called
> BPF_PROG_RUN). The program should be the same as current on_switch program.
> We don't have to attach the program, just use BPF_PROG_RUN to trigger it.
>
> Would something like this work?

Oh, I think it'd work.  Thanks for the suggestion!

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-06-22  1:44 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-15  1:17 [PATCHSET v2 0/3] perf stat: Enable BPF counters with --for-each-cgroup Namhyung Kim
2021-06-15  1:17 ` [PATCH 1/3] perf tools: Add read_cgroup_id() function Namhyung Kim
2021-06-15  1:17 ` [PATCH 2/3] perf tools: Add cgroup_is_v2() helper Namhyung Kim
2021-06-15  1:17 ` [PATCH 3/3] perf stat: Enable BPF counter with --for-each-cgroup Namhyung Kim
2021-06-16 15:14 ` [PATCHSET v2 0/3] perf stat: Enable BPF counters " Peter Zijlstra
2021-06-16 16:33   ` Namhyung Kim
2021-06-16 22:32     ` Peter Zijlstra
2021-06-17  6:33       ` Song Liu
2021-06-22  1:43         ` Namhyung Kim

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.