[PATCHSET v3 0/3] perf stat: Enable BPF counters with --for-each-cgroup

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCHSET v3 0/3] perf stat: Enable BPF counters with --for-each-cgroup
@ 2021-06-22  7:12 Namhyung Kim
  2021-06-22  7:12 ` [PATCH 1/3] perf tools: Add read_cgroup_id() function Namhyung Kim
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Namhyung Kim @ 2021-06-22  7:12 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Andi Kleen, Ian Rogers,
	Stephane Eranian, Song Liu

Hello,

This is to add BPF support for --for-each-cgroup to handle many cgroup
events on big machines.  You can use the --bpf-counters to enable the
new behavior.

 * changes in v3
  - support cgroup hierarchy with ancestor ids
  - add and trigger raw_tp BPF program
  - add a build rule for vmlinux.h

 * changes in v2
  - remove incorrect use of BPF_F_PRESERVE_ELEMS
  - add missing map elements after lookup
  - handle cgroup v1

Basic idea is to use a single set of per-cpu events to count
interested events and aggregate them to each cgroup.  I used bperf
mechanism to use a BPF program for cgroup-switches and save the
results in a matching map element for given cgroups.

Without this, we need to have separate events for cgroups, and it
creates unnecessary multiplexing overhead (and PMU programming) when
tasks in different cgroups are switched.  I saw this makes a big
difference on 256 cpu machines with hundreds of cgroups.

Actually this is what I wanted to do it in the kernel [1], but we can
do the job using BPF!

Thanks,
Namhyung

[1] https://lore.kernel.org/lkml/20210413155337.644993-1-namhyung@kernel.org/

Namhyung Kim (3):
  perf tools: Add read_cgroup_id() function
  perf tools: Add cgroup_is_v2() helper
  perf stat: Enable BPF counter with --for-each-cgroup

 tools/perf/Makefile.perf                    |   7 +-
 tools/perf/util/Build                       |   1 +
 tools/perf/util/bpf_counter.c               |   5 +
 tools/perf/util/bpf_counter_cgroup.c        | 337 ++++++++++++++++++++
 tools/perf/util/bpf_skel/bperf_cgroup.bpf.c | 207 ++++++++++++
 tools/perf/util/cgroup.c                    |  46 +++
 tools/perf/util/cgroup.h                    |  12 +
 7 files changed, 614 insertions(+), 1 deletion(-)
 create mode 100644 tools/perf/util/bpf_counter_cgroup.c
 create mode 100644 tools/perf/util/bpf_skel/bperf_cgroup.bpf.c

-- 
2.32.0.288.g62a8d224e6-goog

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/3] perf tools: Add read_cgroup_id() function
  2021-06-22  7:12 [PATCHSET v3 0/3] perf stat: Enable BPF counters with --for-each-cgroup Namhyung Kim
@ 2021-06-22  7:12 ` Namhyung Kim
  2021-06-22  7:12 ` [PATCH 2/3] perf tools: Add cgroup_is_v2() helper Namhyung Kim
  2021-06-22  7:12 ` [PATCH 3/3] perf stat: Enable BPF counter with --for-each-cgroup Namhyung Kim
  2 siblings, 0 replies; 13+ messages in thread
From: Namhyung Kim @ 2021-06-22  7:12 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Andi Kleen, Ian Rogers,
	Stephane Eranian, Song Liu

The read_cgroup_id() is to read a cgroup id from a file handle using
name_to_handle_at(2) for the given cgroup.  It'll be used by bperf
cgroup stat later.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/perf/util/cgroup.c | 25 +++++++++++++++++++++++++
 tools/perf/util/cgroup.h |  9 +++++++++
 2 files changed, 34 insertions(+)

diff --git a/tools/perf/util/cgroup.c b/tools/perf/util/cgroup.c
index f24ab4585553..ef18c988c681 100644
--- a/tools/perf/util/cgroup.c
+++ b/tools/perf/util/cgroup.c
@@ -45,6 +45,31 @@ static int open_cgroup(const char *name)
 	return fd;
 }
 
+#ifdef HAVE_FILE_HANDLE
+int read_cgroup_id(struct cgroup *cgrp)
+{
+	char path[PATH_MAX + 1];
+	char mnt[PATH_MAX + 1];
+	struct {
+		struct file_handle fh;
+		uint64_t cgroup_id;
+	} handle;
+	int mount_id;
+
+	if (cgroupfs_find_mountpoint(mnt, PATH_MAX + 1, "perf_event"))
+		return -1;
+
+	scnprintf(path, PATH_MAX, "%s/%s", mnt, cgrp->name);
+
+	handle.fh.handle_bytes = sizeof(handle.cgroup_id);
+	if (name_to_handle_at(AT_FDCWD, path, &handle.fh, &mount_id, 0) < 0)
+		return -1;
+
+	cgrp->id = handle.cgroup_id;
+	return 0;
+}
+#endif  /* HAVE_FILE_HANDLE */
+
 static struct cgroup *evlist__find_cgroup(struct evlist *evlist, const char *str)
 {
 	struct evsel *counter;
diff --git a/tools/perf/util/cgroup.h b/tools/perf/util/cgroup.h
index 162906f3412a..707adbe25123 100644
--- a/tools/perf/util/cgroup.h
+++ b/tools/perf/util/cgroup.h
@@ -38,4 +38,13 @@ struct cgroup *cgroup__find(struct perf_env *env, uint64_t id);
 
 void perf_env__purge_cgroups(struct perf_env *env);
 
+#ifdef HAVE_FILE_HANDLE
+int read_cgroup_id(struct cgroup *cgrp);
+#else
+int read_cgroup_id(struct cgroup *cgrp)
+{
+	return -1;
+}
+#endif  /* HAVE_FILE_HANDLE */
+
 #endif /* __CGROUP_H__ */
-- 
2.32.0.288.g62a8d224e6-goog


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/3] perf tools: Add cgroup_is_v2() helper
  2021-06-22  7:12 [PATCHSET v3 0/3] perf stat: Enable BPF counters with --for-each-cgroup Namhyung Kim
  2021-06-22  7:12 ` [PATCH 1/3] perf tools: Add read_cgroup_id() function Namhyung Kim
@ 2021-06-22  7:12 ` Namhyung Kim
  2021-06-22  7:12 ` [PATCH 3/3] perf stat: Enable BPF counter with --for-each-cgroup Namhyung Kim
  2 siblings, 0 replies; 13+ messages in thread
From: Namhyung Kim @ 2021-06-22  7:12 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Andi Kleen, Ian Rogers,
	Stephane Eranian, Song Liu

The cgroup_is_v2() is to check if the given subsystem is mounted on
cgroup v2 or not.  It'll be used by BPF cgroup code later.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/perf/util/cgroup.c | 19 +++++++++++++++++++
 tools/perf/util/cgroup.h |  2 ++
 2 files changed, 21 insertions(+)

diff --git a/tools/perf/util/cgroup.c b/tools/perf/util/cgroup.c
index ef18c988c681..48ec79211270 100644
--- a/tools/perf/util/cgroup.c
+++ b/tools/perf/util/cgroup.c
@@ -9,6 +9,7 @@
 #include <linux/zalloc.h>
 #include <sys/types.h>
 #include <sys/stat.h>
+#include <sys/statfs.h>
 #include <fcntl.h>
 #include <stdlib.h>
 #include <string.h>
@@ -70,6 +71,24 @@ int read_cgroup_id(struct cgroup *cgrp)
 }
 #endif  /* HAVE_FILE_HANDLE */
 
+#ifndef CGROUP2_SUPER_MAGIC
+#define CGROUP2_SUPER_MAGIC  0x63677270
+#endif
+
+int cgroup_is_v2(const char *subsys)
+{
+	char mnt[PATH_MAX + 1];
+	struct statfs stbuf;
+
+	if (cgroupfs_find_mountpoint(mnt, PATH_MAX + 1, subsys))
+		return -1;
+
+	if (statfs(mnt, stbuf) < 0)
+		return -1;
+
+	return (stbuf.f_type == CGROUP2_SUPER_MAGIC);
+}
+
 static struct cgroup *evlist__find_cgroup(struct evlist *evlist, const char *str)
 {
 	struct evsel *counter;
diff --git a/tools/perf/util/cgroup.h b/tools/perf/util/cgroup.h
index 707adbe25123..1549ec2fd348 100644
--- a/tools/perf/util/cgroup.h
+++ b/tools/perf/util/cgroup.h
@@ -47,4 +47,6 @@ int read_cgroup_id(struct cgroup *cgrp)
 }
 #endif  /* HAVE_FILE_HANDLE */
 
+int cgroup_is_v2(const char *subsys);
+
 #endif /* __CGROUP_H__ */
-- 
2.32.0.288.g62a8d224e6-goog


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/3] perf stat: Enable BPF counter with --for-each-cgroup
  2021-06-22  7:12 [PATCHSET v3 0/3] perf stat: Enable BPF counters with --for-each-cgroup Namhyung Kim
  2021-06-22  7:12 ` [PATCH 1/3] perf tools: Add read_cgroup_id() function Namhyung Kim
  2021-06-22  7:12 ` [PATCH 2/3] perf tools: Add cgroup_is_v2() helper Namhyung Kim
@ 2021-06-22  7:12 ` Namhyung Kim
  2021-06-24  4:54   ` Song Liu
  2 siblings, 1 reply; 13+ messages in thread
From: Namhyung Kim @ 2021-06-22  7:12 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Andi Kleen, Ian Rogers,
	Stephane Eranian, Song Liu

Recently bperf was added to use BPF to count perf events for various
purposes.  This is an extension for the approach and targetting to
cgroup usages.

Unlike the other bperf, it doesn't share the events with other
processes but it'd reduce unnecessary events (and the overhead of
multiplexing) for each monitored cgroup within the perf session.

When --for-each-cgroup is used with --bpf-counters, it will open
cgroup-switches event per cpu internally and attach the new BPF
program to read given perf_events and to aggregate the results for
cgroups.  It's only called when task is switched to a task in a
different cgroup.

Cc: Song Liu <songliubraving@fb.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/perf/Makefile.perf                    |   7 +-
 tools/perf/util/Build                       |   1 +
 tools/perf/util/bpf_counter.c               |   5 +
 tools/perf/util/bpf_counter_cgroup.c        | 337 ++++++++++++++++++++
 tools/perf/util/bpf_skel/bperf_cgroup.bpf.c | 207 ++++++++++++
 tools/perf/util/cgroup.c                    |   2 +
 tools/perf/util/cgroup.h                    |   1 +
 7 files changed, 559 insertions(+), 1 deletion(-)
 create mode 100644 tools/perf/util/bpf_counter_cgroup.c
 create mode 100644 tools/perf/util/bpf_skel/bperf_cgroup.bpf.c

diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
index e47f04e5b51e..786cba8f3798 100644
--- a/tools/perf/Makefile.perf
+++ b/tools/perf/Makefile.perf
@@ -1015,6 +1015,7 @@ SKEL_OUT := $(abspath $(OUTPUT)util/bpf_skel)
 SKEL_TMP_OUT := $(abspath $(SKEL_OUT)/.tmp)
 SKELETONS := $(SKEL_OUT)/bpf_prog_profiler.skel.h
 SKELETONS += $(SKEL_OUT)/bperf_leader.skel.h $(SKEL_OUT)/bperf_follower.skel.h
+SKELETONS += $(SKEL_OUT)/bperf_cgroup.skel.h
 
 ifdef BUILD_BPF_SKEL
 BPFTOOL := $(SKEL_TMP_OUT)/bootstrap/bpftool
@@ -1032,7 +1033,11 @@ $(SKEL_TMP_OUT)/%.bpf.o: util/bpf_skel/%.bpf.c $(LIBBPF) | $(SKEL_TMP_OUT)
 	$(QUIET_CLANG)$(CLANG) -g -O2 -target bpf -Wall -Werror $(BPF_INCLUDE) \
 	  -c $(filter util/bpf_skel/%.bpf.c,$^) -o $@ && $(LLVM_STRIP) -g $@
 
-$(SKEL_OUT)/%.skel.h: $(SKEL_TMP_OUT)/%.bpf.o | $(BPFTOOL)
+$(SKEL_OUT)/vmlinux.h:
+	$(MAKE) -C ../bpf/bpftool OUTPUT=$(SKEL_TMP_OUT)/ $(SKEL_TMP_OUT)/vmlinux.h
+	$(Q)mv $(SKEL_TMP_OUT)/vmlinux.h $(SKEL_OUT)/vmlinux.h
+
+$(SKEL_OUT)/%.skel.h: $(SKEL_TMP_OUT)/%.bpf.o $(SKEL_OUT)/vmlinux.h | $(BPFTOOL)
 	$(QUIET_GENSKEL)$(BPFTOOL) gen skeleton $< > $@
 
 bpf-skel: $(SKELETONS)
diff --git a/tools/perf/util/Build b/tools/perf/util/Build
index 95e15d1035ab..700d635448ff 100644
--- a/tools/perf/util/Build
+++ b/tools/perf/util/Build
@@ -140,6 +140,7 @@ perf-y += clockid.o
 perf-$(CONFIG_LIBBPF) += bpf-loader.o
 perf-$(CONFIG_LIBBPF) += bpf_map.o
 perf-$(CONFIG_PERF_BPF_SKEL) += bpf_counter.o
+perf-$(CONFIG_PERF_BPF_SKEL) += bpf_counter_cgroup.o
 perf-$(CONFIG_BPF_PROLOGUE) += bpf-prologue.o
 perf-$(CONFIG_LIBELF) += symbol-elf.o
 perf-$(CONFIG_LIBELF) += probe-file.o
diff --git a/tools/perf/util/bpf_counter.c b/tools/perf/util/bpf_counter.c
index 974f10e356f0..7812c5d9b826 100644
--- a/tools/perf/util/bpf_counter.c
+++ b/tools/perf/util/bpf_counter.c
@@ -22,6 +22,7 @@
 #include "evsel.h"
 #include "evlist.h"
 #include "target.h"
+#include "cgroup.h"
 #include "cpumap.h"
 #include "thread_map.h"
 
@@ -792,6 +793,8 @@ struct bpf_counter_ops bperf_ops = {
 	.destroy    = bperf__destroy,
 };
 
+extern struct bpf_counter_ops bperf_cgrp_ops;
+
 static inline bool bpf_counter_skip(struct evsel *evsel)
 {
 	return list_empty(&evsel->bpf_counter_list) &&
@@ -809,6 +812,8 @@ int bpf_counter__load(struct evsel *evsel, struct target *target)
 {
 	if (target->bpf_str)
 		evsel->bpf_counter_ops = &bpf_program_profiler_ops;
+	else if (cgrp_event_expanded && target->use_bpf)
+		evsel->bpf_counter_ops = &bperf_cgrp_ops;
 	else if (target->use_bpf || evsel->bpf_counter ||
 		 evsel__match_bpf_counter_events(evsel->name))
 		evsel->bpf_counter_ops = &bperf_ops;
diff --git a/tools/perf/util/bpf_counter_cgroup.c b/tools/perf/util/bpf_counter_cgroup.c
new file mode 100644
index 000000000000..86cf26d712b8
--- /dev/null
+++ b/tools/perf/util/bpf_counter_cgroup.c
@@ -0,0 +1,337 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/* Copyright (c) 2019 Facebook */
+/* Copyright (c) 2021 Google */
+
+#include <assert.h>
+#include <limits.h>
+#include <unistd.h>
+#include <sys/file.h>
+#include <sys/time.h>
+#include <sys/resource.h>
+#include <linux/err.h>
+#include <linux/zalloc.h>
+#include <linux/perf_event.h>
+#include <bpf/bpf.h>
+#include <bpf/btf.h>
+#include <bpf/libbpf.h>
+#include <api/fs/fs.h>
+#include <perf/bpf_perf.h>
+
+#include "affinity.h"
+#include "bpf_counter.h"
+#include "cgroup.h"
+#include "counts.h"
+#include "debug.h"
+#include "evsel.h"
+#include "evlist.h"
+#include "target.h"
+#include "cpumap.h"
+#include "thread_map.h"
+
+#include "bpf_skel/bperf_cgroup.skel.h"
+
+static struct perf_event_attr cgrp_switch_attr = {
+	.type = PERF_TYPE_SOFTWARE,
+	.config = PERF_COUNT_SW_CGROUP_SWITCHES,
+	.size = sizeof(cgrp_switch_attr),
+	.sample_period = 1,
+	.disabled = 1,
+};
+
+static struct evsel *cgrp_switch;
+static struct xyarray *cgrp_prog_fds;
+static struct bperf_cgroup_bpf *skel;
+
+#define FD(evt, cpu) (*(int *)xyarray__entry(evt->core.fd, cpu, 0))
+#define PROG(cpu)    (*(int *)xyarray__entry(cgrp_prog_fds, cpu, 0))
+
+static void set_max_rlimit(void)
+{
+	struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY };
+
+	setrlimit(RLIMIT_MEMLOCK, &rinf);
+}
+
+static __u32 bpf_link_get_prog_id(int fd)
+{
+	struct bpf_link_info link_info = {0};
+	__u32 link_info_len = sizeof(link_info);
+
+	bpf_obj_get_info_by_fd(fd, &link_info, &link_info_len);
+	return link_info.prog_id;
+}
+
+static int bperf_load_program(struct evlist *evlist)
+{
+	struct bpf_link *link;
+	struct evsel *evsel;
+	struct cgroup *cgrp, *leader_cgrp;
+	__u32 i, cpu, prog_id;
+	int nr_cpus = evlist->core.all_cpus->nr;
+	int map_size, map_fd;
+	int prog_fd, err;
+
+	skel = bperf_cgroup_bpf__open();
+	if (!skel) {
+		pr_err("Failed to open cgroup skeleton\n");
+		return -1;
+	}
+
+	skel->rodata->num_cpus = nr_cpus;
+	skel->rodata->num_events = evlist->core.nr_entries / nr_cgroups;
+
+	BUG_ON(evlist->core.nr_entries % nr_cgroups != 0);
+
+	/* we need one copy of events per cpu for reading */
+	map_size = nr_cpus * evlist->core.nr_entries / nr_cgroups;
+	bpf_map__resize(skel->maps.events, map_size);
+	bpf_map__resize(skel->maps.cpu_idx, nr_cpus);
+	bpf_map__resize(skel->maps.cgrp_idx, nr_cgroups);
+	/* previous result is saved in a per-cpu array */
+	map_size = evlist->core.nr_entries / nr_cgroups;
+	bpf_map__resize(skel->maps.prev_readings, map_size);
+	/* cgroup result needs all events */
+	map_size = nr_cpus * evlist->core.nr_entries;
+	bpf_map__resize(skel->maps.cgrp_readings, map_size);
+
+	set_max_rlimit();
+
+	err = bperf_cgroup_bpf__load(skel);
+	if (err) {
+		pr_err("Failed to load cgroup skeleton\n");
+		goto out;
+	}
+
+	if (cgroup_is_v2("perf_event") > 0)
+		skel->bss->use_cgroup_v2 = 1;
+
+	err = -1;
+
+	cgrp_switch = evsel__new(&cgrp_switch_attr);
+	if (evsel__open_per_cpu(cgrp_switch, evlist->core.all_cpus, -1) < 0) {
+		pr_err("Failed to open cgroup switches event\n");
+		goto out;
+	}
+
+	map_fd = bpf_map__fd(skel->maps.cpu_idx);
+	if (map_fd < 0) {
+		pr_err("cannot get cpu idx map\n");
+		goto out;
+	}
+
+	cgrp_prog_fds = xyarray__new(nr_cpus, 1, sizeof(int));
+	if (!cgrp_prog_fds) {
+		pr_err("Failed to allocate cgroup switch prog fd\n");
+		goto out;
+	}
+
+	for (i = 0; i < nr_cpus; i++) {
+		link = bpf_program__attach_perf_event(skel->progs.on_switch,
+						      FD(cgrp_switch, i));
+		if (IS_ERR(link)) {
+			pr_err("Failed to attach cgroup program\n");
+			err = PTR_ERR(link);
+			goto out;
+		}
+
+		/* update cpu index in case there are missing cpus */
+		cpu = evlist->core.all_cpus->map[i];
+		bpf_map_update_elem(map_fd, &cpu, &i, BPF_ANY);
+
+		prog_id = bpf_link_get_prog_id(bpf_link__fd(link));
+		PROG(i) = bpf_prog_get_fd_by_id(prog_id);
+	}
+
+	/*
+	 * Update cgrp_idx map from cgroup-id to event index.
+	 */
+	cgrp = NULL;
+	i = 0;
+
+	evlist__for_each_entry(evlist, evsel) {
+		if (cgrp == NULL || evsel->cgrp == leader_cgrp) {
+			leader_cgrp = evsel->cgrp;
+			evsel->cgrp = NULL;
+
+			/* open single copy of the events w/o cgroup */
+			err = evsel__open_per_cpu(evsel, evlist->core.all_cpus, -1);
+			if (err) {
+				pr_err("Failed to open first cgroup events\n");
+				goto out;
+			}
+
+			map_fd = bpf_map__fd(skel->maps.events);
+			for (cpu = 0; cpu < nr_cpus; cpu++) {
+				__u32 idx = evsel->idx * nr_cpus + cpu;
+				int fd = FD(evsel, cpu);
+
+				bpf_map_update_elem(map_fd, &idx, &fd, BPF_ANY);
+			}
+
+			evsel->cgrp = leader_cgrp;
+		}
+		evsel->supported = true;
+
+		if (evsel->cgrp == cgrp)
+			continue;
+
+		cgrp = evsel->cgrp;
+
+		if (read_cgroup_id(cgrp) < 0) {
+			pr_debug("Failed to get cgroup id\n");
+			err = -1;
+			goto out;
+		}
+
+		map_fd = bpf_map__fd(skel->maps.cgrp_idx);
+		bpf_map_update_elem(map_fd, &cgrp->id, &i, BPF_ANY);
+
+		i++;
+	}
+
+	/*
+	 * bperf uses BPF_PROG_TEST_RUN to get accurate reading. Check
+	 * whether the kernel support it
+	 */
+	prog_fd = bpf_program__fd(skel->progs.trigger_read);
+	err = bperf_trigger_reading(prog_fd, 0);
+	if (err) {
+		pr_debug("The kernel does not support test_run for raw_tp BPF programs.\n"
+			 "Therefore, --for-each-cgroup might show inaccurate readings\n");
+	}
+
+out:
+	return err;
+}
+
+static int bperf_cgrp__load(struct evsel *evsel, struct target *target)
+{
+	static bool bperf_loaded = false;
+
+	evsel->bperf_leader_prog_fd = -1;
+	evsel->bperf_leader_link_fd = -1;
+
+	if (!bperf_loaded && bperf_load_program(evsel->evlist))
+		return -1;
+
+	bperf_loaded = true;
+	/* just to bypass bpf_counter_skip() */
+	evsel->follower_skel = (struct bperf_follower_bpf *)skel;
+
+	return 0;
+}
+
+static int bperf_cgrp__install_pe(struct evsel *evsel, int cpu, int fd)
+{
+	/* nothing to do */
+	return 0;
+}
+
+/* trigger the leader program on a cpu */
+static int bperf_trigger_reading(int prog_fd, int cpu)
+{
+	DECLARE_LIBBPF_OPTS(bpf_test_run_opts, opts,
+			    .ctx_in = NULL,
+			    .ctx_size_in = 0,
+			    .flags = BPF_F_TEST_RUN_ON_CPU,
+			    .cpu = cpu,
+			    .retval = 0,
+		);
+
+	return bpf_prog_test_run_opts(prog_fd, &opts);
+}
+/*
+ * trigger the leader prog on each cpu, so the cgrp_reading map could get
+ * the latest results.
+ */
+static int bperf_sync_counters(struct evlist *evlist)
+{
+	int i, cpu;
+	int nr_cpus = evlist->core.all_cpus->nr;
+	int prog_fd = bpf_program__fd(skel->progs.trigger_read);
+
+	for (i = 0; i < nr_cpus; i++) {
+		cpu = evlist->core.all_cpus->map[i];
+		bperf_trigger_reading(prog_fd, cpu);
+	}
+
+	return 0;
+}
+
+static int bperf_cgrp__enable(struct evsel *evsel)
+{
+	skel->bss->enabled = 1;
+	return 0;
+}
+
+static int bperf_cgrp__disable(struct evsel *evsel)
+{
+	if (evsel->idx)
+		return 0;
+
+	bperf_sync_counters(evsel->evlist);
+
+	skel->bss->enabled = 0;
+	return 0;
+}
+
+static int bperf_cgrp__read(struct evsel *evsel)
+{
+	struct evlist *evlist = evsel->evlist;
+	int i, nr_cpus = evlist->core.all_cpus->nr;
+	struct perf_counts_values *counts;
+	struct bpf_perf_event_value values;
+	struct cgroup *cgrp = NULL;
+	int cgrp_idx = -1;
+	int reading_map_fd, err = 0;
+	__u32 idx;
+
+	if (evsel->idx)
+		return 0;
+
+	reading_map_fd = bpf_map__fd(skel->maps.cgrp_readings);
+
+	evlist__for_each_entry(evlist, evsel) {
+		if (cgrp != evsel->cgrp) {
+			cgrp = evsel->cgrp;
+			cgrp_idx++;
+		}
+
+		for (i = 0; i < nr_cpus; i++) {
+			idx = evsel->idx * nr_cpus + i;
+			err = bpf_map_lookup_elem(reading_map_fd, &idx, &values);
+			if (err)
+				goto out;
+
+			counts = perf_counts(evsel->counts, i, 0);
+			counts->val = values.counter;
+			counts->ena = values.enabled;
+			counts->run = values.running;
+		}
+	}
+
+out:
+	return err;
+}
+
+static int bperf_cgrp__destroy(struct evsel *evsel)
+{
+	if (evsel->idx)
+		return 0;
+
+	bperf_cgroup_bpf__destroy(skel);
+	evsel__delete(cgrp_switch);  // it'll destroy on_switch progs too
+	free(cgrp_prog_fds);
+
+	return 0;
+}
+
+struct bpf_counter_ops bperf_cgrp_ops = {
+	.load       = bperf_cgrp__load,
+	.enable     = bperf_cgrp__enable,
+	.disable    = bperf_cgrp__disable,
+	.read       = bperf_cgrp__read,
+	.install_pe = bperf_cgrp__install_pe,
+	.destroy    = bperf_cgrp__destroy,
+};
diff --git a/tools/perf/util/bpf_skel/bperf_cgroup.bpf.c b/tools/perf/util/bpf_skel/bperf_cgroup.bpf.c
new file mode 100644
index 000000000000..6d74e93dd1f5
--- /dev/null
+++ b/tools/perf/util/bpf_skel/bperf_cgroup.bpf.c
@@ -0,0 +1,207 @@
+// SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+// Copyright (c) 2021 Facebook
+// Copyright (c) 2021 Google
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_core_read.h>
+
+#define MAX_LEVELS  10  // max cgroup hierarchy level: arbitrary
+#define MAX_EVENTS  32  // max events per cgroup: arbitrary
+
+// NOTE: many of map and global data will be modified before loading
+//       from the userspace (perf tool) using the skeleton helpers.
+
+// single set of global perf events to measure
+struct {
+	__uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(int));
+	__uint(max_entries, 1);
+} events SEC(".maps");
+
+// from logical cpu number to event index
+// useful when user wants to count subset of cpus
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(__u32));
+	__uint(max_entries, 1);
+} cpu_idx SEC(".maps");
+
+// from cgroup id to event index
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(key_size, sizeof(__u64));
+	__uint(value_size, sizeof(__u32));
+	__uint(max_entries, 1);
+} cgrp_idx SEC(".maps");
+
+// per-cpu event snapshots to calculate delta
+struct {
+	__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(struct bpf_perf_event_value));
+} prev_readings SEC(".maps");
+
+// aggregated event values for each cgroup
+// will be read from the user-space
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(struct bpf_perf_event_value));
+} cgrp_readings SEC(".maps");
+
+const volatile __u32 num_events = 1;
+const volatile __u32 num_cpus = 1;
+
+int enabled = 0;
+int use_cgroup_v2 = 0;
+
+static inline int get_cgroup_v1_idx(__u32 *cgrps, int size)
+{
+	struct task_struct *p = (void *)bpf_get_current_task();
+	struct cgroup *cgrp;
+	register int i = 0;
+	__u32 *elem;
+	int level;
+	int cnt;
+
+	cgrp = BPF_CORE_READ(p, cgroups, subsys[perf_event_cgrp_id], cgroup);
+	level = BPF_CORE_READ(cgrp, level);
+
+	for (cnt = 0; i < MAX_LEVELS; i++) {
+		__u64 cgrp_id;
+
+		if (i > level)
+			break;
+
+		// convert cgroup-id to a map index
+		cgrp_id = BPF_CORE_READ(cgrp, ancestor_ids[i]);
+		elem = bpf_map_lookup_elem(&cgrp_idx, &cgrp_id);
+		if (!elem)
+			continue;
+
+		cgrps[cnt++] = *elem;
+		if (cnt == size)
+			break;
+	}
+
+	return cnt;
+}
+
+static inline int get_cgroup_v2_idx(__u32 *cgrps, int size)
+{
+	register int i = 0;
+	__u32 *elem;
+	int cnt;
+
+	for (cnt = 0; i < MAX_LEVELS; i++) {
+		__u64 cgrp_id = bpf_get_current_ancestor_cgroup_id(i);
+
+		if (cgrp_id == 0)
+			break;
+
+		// convert cgroup-id to a map index
+		elem = bpf_map_lookup_elem(&cgrp_idx, &cgrp_id);
+		if (!elem)
+			continue;
+
+		cgrps[cnt++] = *elem;
+		if (cnt == size)
+			break;
+	}
+
+	return cnt;
+}
+
+static int bperf_cgroup_count(void)
+{
+	register __u32 idx = 0;  // to have it in a register to pass BPF verifier
+	register int c = 0;
+	struct bpf_perf_event_value val, delta, *prev_val, *cgrp_val;
+	__u32 cpu = bpf_get_smp_processor_id();
+	__u32 cgrp_idx[MAX_LEVELS];
+	int cgrp_cnt;
+	__u32 evt_idx, key, cgrp;
+	__u32 *elem;
+	long err;
+
+	// map the current CPU to a CPU index, particularly necessary if there
+	// are fewer CPUs profiled on than all CPUs.
+	elem = bpf_map_lookup_elem(&cpu_idx, &cpu);
+	if (!elem)
+		return 0;
+	cpu = *elem;
+
+	if (use_cgroup_v2)
+		cgrp_cnt = get_cgroup_v2_idx(cgrp_idx, MAX_LEVELS);
+	else
+		cgrp_cnt = get_cgroup_v1_idx(cgrp_idx, MAX_LEVELS);
+
+	for ( ; idx < MAX_EVENTS; idx++) {
+		if (idx == num_events)
+			break;
+
+		// XXX: do not pass idx directly (for verifier)
+		key = idx;
+		// this is per-cpu array for diff
+		prev_val = bpf_map_lookup_elem(&prev_readings, &key);
+		if (!prev_val) {
+			val.counter = val.enabled = val.running = 0;
+			bpf_map_update_elem(&prev_readings, &key, &val, BPF_ANY);
+
+			prev_val = bpf_map_lookup_elem(&prev_readings, &key);
+			if (!prev_val)
+				continue;
+		}
+
+		// read from global event array
+		evt_idx = idx * num_cpus + cpu;
+		err = bpf_perf_event_read_value(&events, evt_idx, &val, sizeof(val));
+		if (err)
+			continue;
+
+		if (enabled) {
+			delta.counter = val.counter - prev_val->counter;
+			delta.enabled = val.enabled - prev_val->enabled;
+			delta.running = val.running - prev_val->running;
+
+			for (c = 0; c < MAX_LEVELS; c++) {
+				if (c == cgrp_cnt)
+					break;
+
+				cgrp = cgrp_idx[c];
+
+				// aggregate the result by cgroup
+				key = cgrp * num_cpus * num_events + evt_idx;
+				cgrp_val = bpf_map_lookup_elem(&cgrp_readings, &key);
+				if (cgrp_val) {
+					cgrp_val->counter += delta.counter;
+					cgrp_val->enabled += delta.enabled;
+					cgrp_val->running += delta.running;
+				} else {
+					bpf_map_update_elem(&cgrp_readings, &key, &delta, BPF_ANY);
+				}
+			}
+		}
+
+		*prev_val = val;
+	}
+	return 0;
+}
+
+// This will be attached to cgroup-switches event for each cpu
+SEC("perf_events")
+int BPF_PROG(on_cgrp_switch)
+{
+	return bperf_cgroup_count();
+}
+
+SEC("raw_tp/sched_switch")
+int BPF_PROG(trigger_read)
+{
+	return bperf_cgroup_count();
+}
+
+char LICENSE[] SEC("license") = "Dual BSD/GPL";
diff --git a/tools/perf/util/cgroup.c b/tools/perf/util/cgroup.c
index 48ec79211270..f7d07b365401 100644
--- a/tools/perf/util/cgroup.c
+++ b/tools/perf/util/cgroup.c
@@ -18,6 +18,7 @@
 #include <regex.h>
 
 int nr_cgroups;
+bool cgrp_event_expanded;
 
 /* used to match cgroup name with patterns */
 struct cgroup_name {
@@ -484,6 +485,7 @@ int evlist__expand_cgroup(struct evlist *evlist, const char *str,
 	}
 
 	ret = 0;
+	cgrp_event_expanded = true;
 
 out_err:
 	evlist__delete(orig_list);
diff --git a/tools/perf/util/cgroup.h b/tools/perf/util/cgroup.h
index 1549ec2fd348..21f7ccc566e1 100644
--- a/tools/perf/util/cgroup.h
+++ b/tools/perf/util/cgroup.h
@@ -17,6 +17,7 @@ struct cgroup {
 };
 
 extern int nr_cgroups; /* number of explicit cgroups defined */
+extern bool cgrp_event_expanded;
 
 struct cgroup *cgroup__get(struct cgroup *cgroup);
 void cgroup__put(struct cgroup *cgroup);
-- 
2.32.0.288.g62a8d224e6-goog


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] perf stat: Enable BPF counter with --for-each-cgroup
  2021-06-22  7:12 ` [PATCH 3/3] perf stat: Enable BPF counter with --for-each-cgroup Namhyung Kim
@ 2021-06-24  4:54   ` Song Liu
  2021-06-24  5:49     ` Namhyung Kim
  0 siblings, 1 reply; 13+ messages in thread
From: Song Liu @ 2021-06-24  4:54 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Ingo Molnar, Peter Zijlstra,
	LKML, Andi Kleen, Ian Rogers, Stephane Eranian



> On Jun 22, 2021, at 12:12 AM, Namhyung Kim <namhyung@kernel.org> wrote:
> 
> Recently bperf was added to use BPF to count perf events for various
> purposes.  This is an extension for the approach and targetting to
> cgroup usages.
> 
> Unlike the other bperf, it doesn't share the events with other
> processes but it'd reduce unnecessary events (and the overhead of
> multiplexing) for each monitored cgroup within the perf session.
> 
> When --for-each-cgroup is used with --bpf-counters, it will open
> cgroup-switches event per cpu internally and attach the new BPF
> program to read given perf_events and to aggregate the results for
> cgroups.  It's only called when task is switched to a task in a
> different cgroup.
> 
> Cc: Song Liu <songliubraving@fb.com>
> Signed-off-by: Namhyung Kim <namhyung@kernel.org>
> ---
> tools/perf/Makefile.perf                    |   7 +-
> tools/perf/util/Build                       |   1 +
> tools/perf/util/bpf_counter.c               |   5 +
> tools/perf/util/bpf_counter_cgroup.c        | 337 ++++++++++++++++++++
> tools/perf/util/bpf_skel/bperf_cgroup.bpf.c | 207 ++++++++++++
> tools/perf/util/cgroup.c                    |   2 +
> tools/perf/util/cgroup.h                    |   1 +
> 7 files changed, 559 insertions(+), 1 deletion(-)
> create mode 100644 tools/perf/util/bpf_counter_cgroup.c
> create mode 100644 tools/perf/util/bpf_skel/bperf_cgroup.bpf.c
> 
> diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
> index e47f04e5b51e..786cba8f3798 100644
> --- a/tools/perf/Makefile.perf
> +++ b/tools/perf/Makefile.perf
> @@ -1015,6 +1015,7 @@ SKEL_OUT := $(abspath $(OUTPUT)util/bpf_skel)
> SKEL_TMP_OUT := $(abspath $(SKEL_OUT)/.tmp)
> SKELETONS := $(SKEL_OUT)/bpf_prog_profiler.skel.h
> SKELETONS += $(SKEL_OUT)/bperf_leader.skel.h $(SKEL_OUT)/bperf_follower.skel.h
> +SKELETONS += $(SKEL_OUT)/bperf_cgroup.skel.h
> 
> ifdef BUILD_BPF_SKEL
> BPFTOOL := $(SKEL_TMP_OUT)/bootstrap/bpftool
> @@ -1032,7 +1033,11 @@ $(SKEL_TMP_OUT)/%.bpf.o: util/bpf_skel/%.bpf.c $(LIBBPF) | $(SKEL_TMP_OUT)
> 	$(QUIET_CLANG)$(CLANG) -g -O2 -target bpf -Wall -Werror $(BPF_INCLUDE) \
> 	  -c $(filter util/bpf_skel/%.bpf.c,$^) -o $@ && $(LLVM_STRIP) -g $@
> 
> -$(SKEL_OUT)/%.skel.h: $(SKEL_TMP_OUT)/%.bpf.o | $(BPFTOOL)
> +$(SKEL_OUT)/vmlinux.h:
> +	$(MAKE) -C ../bpf/bpftool OUTPUT=$(SKEL_TMP_OUT)/ $(SKEL_TMP_OUT)/vmlinux.h

We build bpftool with $(BPFTOOL), which is a few lines above. 
Can we reuse some of that? 

> +	$(Q)mv $(SKEL_TMP_OUT)/vmlinux.h $(SKEL_OUT)/vmlinux.h
> +
> +$(SKEL_OUT)/%.skel.h: $(SKEL_TMP_OUT)/%.bpf.o $(SKEL_OUT)/vmlinux.h | $(BPFTOOL)
> 	$(QUIET_GENSKEL)$(BPFTOOL) gen skeleton $< > $@
> 
> bpf-skel: $(SKELETONS)
> diff --git a/tools/perf/util/Build b/tools/perf/util/Build
> index 95e15d1035ab..700d635448ff 100644
> --- a/tools/perf/util/Build
> +++ b/tools/perf/util/Build
> @@ -140,6 +140,7 @@ perf-y += clockid.o
> perf-$(CONFIG_LIBBPF) += bpf-loader.o
> perf-$(CONFIG_LIBBPF) += bpf_map.o
> perf-$(CONFIG_PERF_BPF_SKEL) += bpf_counter.o
> +perf-$(CONFIG_PERF_BPF_SKEL) += bpf_counter_cgroup.o
> perf-$(CONFIG_BPF_PROLOGUE) += bpf-prologue.o
> perf-$(CONFIG_LIBELF) += symbol-elf.o
> perf-$(CONFIG_LIBELF) += probe-file.o
> diff --git a/tools/perf/util/bpf_counter.c b/tools/perf/util/bpf_counter.c
> index 974f10e356f0..7812c5d9b826 100644
> --- a/tools/perf/util/bpf_counter.c
> +++ b/tools/perf/util/bpf_counter.c
> @@ -22,6 +22,7 @@
> #include "evsel.h"
> #include "evlist.h"
> #include "target.h"
> +#include "cgroup.h"
> #include "cpumap.h"
> #include "thread_map.h"
> 
> @@ -792,6 +793,8 @@ struct bpf_counter_ops bperf_ops = {
> 	.destroy    = bperf__destroy,
> };
> 
> +extern struct bpf_counter_ops bperf_cgrp_ops;
> +
> static inline bool bpf_counter_skip(struct evsel *evsel)
> {
> 	return list_empty(&evsel->bpf_counter_list) &&
> @@ -809,6 +812,8 @@ int bpf_counter__load(struct evsel *evsel, struct target *target)
> {
> 	if (target->bpf_str)
> 		evsel->bpf_counter_ops = &bpf_program_profiler_ops;
> +	else if (cgrp_event_expanded && target->use_bpf)
> +		evsel->bpf_counter_ops = &bperf_cgrp_ops;
> 	else if (target->use_bpf || evsel->bpf_counter ||
> 		 evsel__match_bpf_counter_events(evsel->name))
> 		evsel->bpf_counter_ops = &bperf_ops;

[...]


> +
> +#include "bpf_skel/bperf_cgroup.skel.h"
> +
> +static struct perf_event_attr cgrp_switch_attr = {
> +	.type = PERF_TYPE_SOFTWARE,
> +	.config = PERF_COUNT_SW_CGROUP_SWITCHES,
> +	.size = sizeof(cgrp_switch_attr),
> +	.sample_period = 1,
> +	.disabled = 1,
> +};
> +
> +static struct evsel *cgrp_switch;
> +static struct xyarray *cgrp_prog_fds;
> +static struct bperf_cgroup_bpf *skel;
> +
> +#define FD(evt, cpu) (*(int *)xyarray__entry(evt->core.fd, cpu, 0))
> +#define PROG(cpu)    (*(int *)xyarray__entry(cgrp_prog_fds, cpu, 0))
> +
> +static void set_max_rlimit(void)
> +{
> +	struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY };
> +
> +	setrlimit(RLIMIT_MEMLOCK, &rinf);
> +}
> +
> +static __u32 bpf_link_get_prog_id(int fd)
> +{
> +	struct bpf_link_info link_info = {0};
> +	__u32 link_info_len = sizeof(link_info);
> +
> +	bpf_obj_get_info_by_fd(fd, &link_info, &link_info_len);
> +	return link_info.prog_id;
> +}

How about we move set_max_rlimit() and bpf_link_get_prog_id() to 
a header so we don't have to duplicate it?

> +
> +static int bperf_load_program(struct evlist *evlist)
> +{
> +	struct bpf_link *link;
> +	struct evsel *evsel;
> +	struct cgroup *cgrp, *leader_cgrp;
> +	__u32 i, cpu, prog_id;
> +	int nr_cpus = evlist->core.all_cpus->nr;
> +	int map_size, map_fd;
> +	int prog_fd, err;
> +
> +	skel = bperf_cgroup_bpf__open();
> +	if (!skel) {
> +		pr_err("Failed to open cgroup skeleton\n");
> +		return -1;
> +	}
> +
> +	skel->rodata->num_cpus = nr_cpus;
> +	skel->rodata->num_events = evlist->core.nr_entries / nr_cgroups;
> +
> +	BUG_ON(evlist->core.nr_entries % nr_cgroups != 0);
> +
> +	/* we need one copy of events per cpu for reading */
> +	map_size = nr_cpus * evlist->core.nr_entries / nr_cgroups;
> +	bpf_map__resize(skel->maps.events, map_size);
> +	bpf_map__resize(skel->maps.cpu_idx, nr_cpus);
> +	bpf_map__resize(skel->maps.cgrp_idx, nr_cgroups);
> +	/* previous result is saved in a per-cpu array */
> +	map_size = evlist->core.nr_entries / nr_cgroups;
> +	bpf_map__resize(skel->maps.prev_readings, map_size);
> +	/* cgroup result needs all events */
> +	map_size = nr_cpus * evlist->core.nr_entries;
> +	bpf_map__resize(skel->maps.cgrp_readings, map_size);

We are setting map_size back and forth here. 

[...]


> diff --git a/tools/perf/util/bpf_skel/bperf_cgroup.bpf.c b/tools/perf/util/bpf_skel/bperf_cgroup.bpf.c
> new file mode 100644
> index 000000000000..6d74e93dd1f5
> --- /dev/null
> +++ b/tools/perf/util/bpf_skel/bperf_cgroup.bpf.c
> @@ -0,0 +1,207 @@
> +// SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
> +// Copyright (c) 2021 Facebook
> +// Copyright (c) 2021 Google
> +#include "vmlinux.h"
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_tracing.h>
> +#include <bpf/bpf_core_read.h>
> +
> +#define MAX_LEVELS  10  // max cgroup hierarchy level: arbitrary
> +#define MAX_EVENTS  32  // max events per cgroup: arbitrary
> +
> +// NOTE: many of map and global data will be modified before loading
> +//       from the userspace (perf tool) using the skeleton helpers.
> +
> +// single set of global perf events to measure
> +struct {
> +	__uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
> +	__uint(key_size, sizeof(__u32));
> +	__uint(value_size, sizeof(int));
> +	__uint(max_entries, 1);
> +} events SEC(".maps");
> +
> +// from logical cpu number to event index
> +// useful when user wants to count subset of cpus
> +struct {
> +	__uint(type, BPF_MAP_TYPE_HASH);
> +	__uint(key_size, sizeof(__u32));
> +	__uint(value_size, sizeof(__u32));
> +	__uint(max_entries, 1);
> +} cpu_idx SEC(".maps");

How about we make cpu_idx a percpu array and use 0,1 for 
disable/enable profiling on this cpu? 

> +
> +// from cgroup id to event index
> +struct {
> +	__uint(type, BPF_MAP_TYPE_HASH);
> +	__uint(key_size, sizeof(__u64));
> +	__uint(value_size, sizeof(__u32));
> +	__uint(max_entries, 1);
> +} cgrp_idx SEC(".maps");
> +
> +// per-cpu event snapshots to calculate delta
> +struct {
> +	__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
> +	__uint(key_size, sizeof(__u32));
> +	__uint(value_size, sizeof(struct bpf_perf_event_value));
> +} prev_readings SEC(".maps");
> +
> +// aggregated event values for each cgroup
> +// will be read from the user-space
> +struct {
> +	__uint(type, BPF_MAP_TYPE_ARRAY);
> +	__uint(key_size, sizeof(__u32));
> +	__uint(value_size, sizeof(struct bpf_perf_event_value));
> +} cgrp_readings SEC(".maps");

Maybe also make this a percpu array? This should make the BPF program
faster. 

> +
> +const volatile __u32 num_events = 1;
> +const volatile __u32 num_cpus = 1;
> +
> +int enabled = 0;
> +int use_cgroup_v2 = 0;
> +
[...]


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] perf stat: Enable BPF counter with --for-each-cgroup
  2021-06-24  4:54   ` Song Liu
@ 2021-06-24  5:49     ` Namhyung Kim
  2021-06-24 16:20       ` Song Liu
  0 siblings, 1 reply; 13+ messages in thread
From: Namhyung Kim @ 2021-06-24  5:49 UTC (permalink / raw)
  To: Song Liu
  Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Ingo Molnar, Peter Zijlstra,
	LKML, Andi Kleen, Ian Rogers, Stephane Eranian

Hi Song,

On Wed, Jun 23, 2021 at 9:54 PM Song Liu <songliubraving@fb.com> wrote:
>
>
>
> > On Jun 22, 2021, at 12:12 AM, Namhyung Kim <namhyung@kernel.org> wrote:
> >
> > Recently bperf was added to use BPF to count perf events for various
> > purposes.  This is an extension for the approach and targetting to
> > cgroup usages.
> >
> > Unlike the other bperf, it doesn't share the events with other
> > processes but it'd reduce unnecessary events (and the overhead of
> > multiplexing) for each monitored cgroup within the perf session.
> >
> > When --for-each-cgroup is used with --bpf-counters, it will open
> > cgroup-switches event per cpu internally and attach the new BPF
> > program to read given perf_events and to aggregate the results for
> > cgroups.  It's only called when task is switched to a task in a
> > different cgroup.
> >
> > Cc: Song Liu <songliubraving@fb.com>
> > Signed-off-by: Namhyung Kim <namhyung@kernel.org>
> > ---
> > tools/perf/Makefile.perf                    |   7 +-
> > tools/perf/util/Build                       |   1 +
> > tools/perf/util/bpf_counter.c               |   5 +
> > tools/perf/util/bpf_counter_cgroup.c        | 337 ++++++++++++++++++++
> > tools/perf/util/bpf_skel/bperf_cgroup.bpf.c | 207 ++++++++++++
> > tools/perf/util/cgroup.c                    |   2 +
> > tools/perf/util/cgroup.h                    |   1 +
> > 7 files changed, 559 insertions(+), 1 deletion(-)
> > create mode 100644 tools/perf/util/bpf_counter_cgroup.c
> > create mode 100644 tools/perf/util/bpf_skel/bperf_cgroup.bpf.c
> >
> > diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
> > index e47f04e5b51e..786cba8f3798 100644
> > --- a/tools/perf/Makefile.perf
> > +++ b/tools/perf/Makefile.perf
> > @@ -1015,6 +1015,7 @@ SKEL_OUT := $(abspath $(OUTPUT)util/bpf_skel)
> > SKEL_TMP_OUT := $(abspath $(SKEL_OUT)/.tmp)
> > SKELETONS := $(SKEL_OUT)/bpf_prog_profiler.skel.h
> > SKELETONS += $(SKEL_OUT)/bperf_leader.skel.h $(SKEL_OUT)/bperf_follower.skel.h
> > +SKELETONS += $(SKEL_OUT)/bperf_cgroup.skel.h
> >
> > ifdef BUILD_BPF_SKEL
> > BPFTOOL := $(SKEL_TMP_OUT)/bootstrap/bpftool
> > @@ -1032,7 +1033,11 @@ $(SKEL_TMP_OUT)/%.bpf.o: util/bpf_skel/%.bpf.c $(LIBBPF) | $(SKEL_TMP_OUT)
> >       $(QUIET_CLANG)$(CLANG) -g -O2 -target bpf -Wall -Werror $(BPF_INCLUDE) \
> >         -c $(filter util/bpf_skel/%.bpf.c,$^) -o $@ && $(LLVM_STRIP) -g $@
> >
> > -$(SKEL_OUT)/%.skel.h: $(SKEL_TMP_OUT)/%.bpf.o | $(BPFTOOL)
> > +$(SKEL_OUT)/vmlinux.h:
> > +     $(MAKE) -C ../bpf/bpftool OUTPUT=$(SKEL_TMP_OUT)/ $(SKEL_TMP_OUT)/vmlinux.h
>
> We build bpftool with $(BPFTOOL), which is a few lines above.
> Can we reuse some of that?

Ok, will do.

>
> > +     $(Q)mv $(SKEL_TMP_OUT)/vmlinux.h $(SKEL_OUT)/vmlinux.h
> > +
> > +$(SKEL_OUT)/%.skel.h: $(SKEL_TMP_OUT)/%.bpf.o $(SKEL_OUT)/vmlinux.h | $(BPFTOOL)
> >       $(QUIET_GENSKEL)$(BPFTOOL) gen skeleton $< > $@
> >
> > bpf-skel: $(SKELETONS)
> > diff --git a/tools/perf/util/Build b/tools/perf/util/Build
> > index 95e15d1035ab..700d635448ff 100644
> > --- a/tools/perf/util/Build
> > +++ b/tools/perf/util/Build
> > @@ -140,6 +140,7 @@ perf-y += clockid.o
> > perf-$(CONFIG_LIBBPF) += bpf-loader.o
> > perf-$(CONFIG_LIBBPF) += bpf_map.o
> > perf-$(CONFIG_PERF_BPF_SKEL) += bpf_counter.o
> > +perf-$(CONFIG_PERF_BPF_SKEL) += bpf_counter_cgroup.o
> > perf-$(CONFIG_BPF_PROLOGUE) += bpf-prologue.o
> > perf-$(CONFIG_LIBELF) += symbol-elf.o
> > perf-$(CONFIG_LIBELF) += probe-file.o
> > diff --git a/tools/perf/util/bpf_counter.c b/tools/perf/util/bpf_counter.c
> > index 974f10e356f0..7812c5d9b826 100644
> > --- a/tools/perf/util/bpf_counter.c
> > +++ b/tools/perf/util/bpf_counter.c
> > @@ -22,6 +22,7 @@
> > #include "evsel.h"
> > #include "evlist.h"
> > #include "target.h"
> > +#include "cgroup.h"
> > #include "cpumap.h"
> > #include "thread_map.h"
> >
> > @@ -792,6 +793,8 @@ struct bpf_counter_ops bperf_ops = {
> >       .destroy    = bperf__destroy,
> > };
> >
> > +extern struct bpf_counter_ops bperf_cgrp_ops;
> > +
> > static inline bool bpf_counter_skip(struct evsel *evsel)
> > {
> >       return list_empty(&evsel->bpf_counter_list) &&
> > @@ -809,6 +812,8 @@ int bpf_counter__load(struct evsel *evsel, struct target *target)
> > {
> >       if (target->bpf_str)
> >               evsel->bpf_counter_ops = &bpf_program_profiler_ops;
> > +     else if (cgrp_event_expanded && target->use_bpf)
> > +             evsel->bpf_counter_ops = &bperf_cgrp_ops;
> >       else if (target->use_bpf || evsel->bpf_counter ||
> >                evsel__match_bpf_counter_events(evsel->name))
> >               evsel->bpf_counter_ops = &bperf_ops;
>
> [...]
>
>
> > +
> > +#include "bpf_skel/bperf_cgroup.skel.h"
> > +
> > +static struct perf_event_attr cgrp_switch_attr = {
> > +     .type = PERF_TYPE_SOFTWARE,
> > +     .config = PERF_COUNT_SW_CGROUP_SWITCHES,
> > +     .size = sizeof(cgrp_switch_attr),
> > +     .sample_period = 1,
> > +     .disabled = 1,
> > +};
> > +
> > +static struct evsel *cgrp_switch;
> > +static struct xyarray *cgrp_prog_fds;
> > +static struct bperf_cgroup_bpf *skel;
> > +
> > +#define FD(evt, cpu) (*(int *)xyarray__entry(evt->core.fd, cpu, 0))
> > +#define PROG(cpu)    (*(int *)xyarray__entry(cgrp_prog_fds, cpu, 0))
> > +
> > +static void set_max_rlimit(void)
> > +{
> > +     struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY };
> > +
> > +     setrlimit(RLIMIT_MEMLOCK, &rinf);
> > +}
> > +
> > +static __u32 bpf_link_get_prog_id(int fd)
> > +{
> > +     struct bpf_link_info link_info = {0};
> > +     __u32 link_info_len = sizeof(link_info);
> > +
> > +     bpf_obj_get_info_by_fd(fd, &link_info, &link_info_len);
> > +     return link_info.prog_id;
> > +}
>
> How about we move set_max_rlimit() and bpf_link_get_prog_id() to
> a header so we don't have to duplicate it?

Sounds good.

>
> > +
> > +static int bperf_load_program(struct evlist *evlist)
> > +{
> > +     struct bpf_link *link;
> > +     struct evsel *evsel;
> > +     struct cgroup *cgrp, *leader_cgrp;
> > +     __u32 i, cpu, prog_id;
> > +     int nr_cpus = evlist->core.all_cpus->nr;
> > +     int map_size, map_fd;
> > +     int prog_fd, err;
> > +
> > +     skel = bperf_cgroup_bpf__open();
> > +     if (!skel) {
> > +             pr_err("Failed to open cgroup skeleton\n");
> > +             return -1;
> > +     }
> > +
> > +     skel->rodata->num_cpus = nr_cpus;
> > +     skel->rodata->num_events = evlist->core.nr_entries / nr_cgroups;
> > +
> > +     BUG_ON(evlist->core.nr_entries % nr_cgroups != 0);
> > +
> > +     /* we need one copy of events per cpu for reading */
> > +     map_size = nr_cpus * evlist->core.nr_entries / nr_cgroups;
> > +     bpf_map__resize(skel->maps.events, map_size);
> > +     bpf_map__resize(skel->maps.cpu_idx, nr_cpus);
> > +     bpf_map__resize(skel->maps.cgrp_idx, nr_cgroups);
> > +     /* previous result is saved in a per-cpu array */
> > +     map_size = evlist->core.nr_entries / nr_cgroups;
> > +     bpf_map__resize(skel->maps.prev_readings, map_size);
> > +     /* cgroup result needs all events */
> > +     map_size = nr_cpus * evlist->core.nr_entries;
> > +     bpf_map__resize(skel->maps.cgrp_readings, map_size);
>
> We are setting map_size back and forth here.

But they are all different sizes.

>
> [...]
>
>
> > diff --git a/tools/perf/util/bpf_skel/bperf_cgroup.bpf.c b/tools/perf/util/bpf_skel/bperf_cgroup.bpf.c
> > new file mode 100644
> > index 000000000000..6d74e93dd1f5
> > --- /dev/null
> > +++ b/tools/perf/util/bpf_skel/bperf_cgroup.bpf.c
> > @@ -0,0 +1,207 @@
> > +// SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
> > +// Copyright (c) 2021 Facebook
> > +// Copyright (c) 2021 Google
> > +#include "vmlinux.h"
> > +#include <bpf/bpf_helpers.h>
> > +#include <bpf/bpf_tracing.h>
> > +#include <bpf/bpf_core_read.h>
> > +
> > +#define MAX_LEVELS  10  // max cgroup hierarchy level: arbitrary
> > +#define MAX_EVENTS  32  // max events per cgroup: arbitrary
> > +
> > +// NOTE: many of map and global data will be modified before loading
> > +//       from the userspace (perf tool) using the skeleton helpers.
> > +
> > +// single set of global perf events to measure
> > +struct {
> > +     __uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
> > +     __uint(key_size, sizeof(__u32));
> > +     __uint(value_size, sizeof(int));
> > +     __uint(max_entries, 1);
> > +} events SEC(".maps");
> > +
> > +// from logical cpu number to event index
> > +// useful when user wants to count subset of cpus
> > +struct {
> > +     __uint(type, BPF_MAP_TYPE_HASH);
> > +     __uint(key_size, sizeof(__u32));
> > +     __uint(value_size, sizeof(__u32));
> > +     __uint(max_entries, 1);
> > +} cpu_idx SEC(".maps");
>
> How about we make cpu_idx a percpu array and use 0,1 for
> disable/enable profiling on this cpu?

No, it's to calculate an index to the cgrp_readings map which
has the event x cpu x cgroup number of elements.

It controls enabling events with a global (bss) variable.

>
> > +
> > +// from cgroup id to event index
> > +struct {
> > +     __uint(type, BPF_MAP_TYPE_HASH);
> > +     __uint(key_size, sizeof(__u64));
> > +     __uint(value_size, sizeof(__u32));
> > +     __uint(max_entries, 1);
> > +} cgrp_idx SEC(".maps");
> > +
> > +// per-cpu event snapshots to calculate delta
> > +struct {
> > +     __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
> > +     __uint(key_size, sizeof(__u32));
> > +     __uint(value_size, sizeof(struct bpf_perf_event_value));
> > +} prev_readings SEC(".maps");
> > +
> > +// aggregated event values for each cgroup
> > +// will be read from the user-space
> > +struct {
> > +     __uint(type, BPF_MAP_TYPE_ARRAY);
> > +     __uint(key_size, sizeof(__u32));
> > +     __uint(value_size, sizeof(struct bpf_perf_event_value));
> > +} cgrp_readings SEC(".maps");
>
> Maybe also make this a percpu array? This should make the BPF program
> faster.

Maybe.  But I don't know how to access the elements
in a per-cpu map from userspace.

Thanks,
Namhyung


>
> > +
> > +const volatile __u32 num_events = 1;
> > +const volatile __u32 num_cpus = 1;
> > +
> > +int enabled = 0;
> > +int use_cgroup_v2 = 0;
> > +
> [...]
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] perf stat: Enable BPF counter with --for-each-cgroup
  2021-06-24  5:49     ` Namhyung Kim
@ 2021-06-24 16:20       ` Song Liu
  2021-06-24 21:01         ` Namhyung Kim
  0 siblings, 1 reply; 13+ messages in thread
From: Song Liu @ 2021-06-24 16:20 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Ingo Molnar, Peter Zijlstra,
	LKML, Andi Kleen, Ian Rogers, Stephane Eranian



> On Jun 23, 2021, at 10:49 PM, Namhyung Kim <namhyung@kernel.org> wrote:
[...]

>>> +
>>> +     skel->rodata->num_cpus = nr_cpus;
>>> +     skel->rodata->num_events = evlist->core.nr_entries / nr_cgroups;
>>> +
>>> +     BUG_ON(evlist->core.nr_entries % nr_cgroups != 0);
>>> +
>>> +     /* we need one copy of events per cpu for reading */
>>> +     map_size = nr_cpus * evlist->core.nr_entries / nr_cgroups;
>>> +     bpf_map__resize(skel->maps.events, map_size);
>>> +     bpf_map__resize(skel->maps.cpu_idx, nr_cpus);
>>> +     bpf_map__resize(skel->maps.cgrp_idx, nr_cgroups);
>>> +     /* previous result is saved in a per-cpu array */
>>> +     map_size = evlist->core.nr_entries / nr_cgroups;
>>> +     bpf_map__resize(skel->maps.prev_readings, map_size);
>>> +     /* cgroup result needs all events */
>>> +     map_size = nr_cpus * evlist->core.nr_entries;
>>> +     bpf_map__resize(skel->maps.cgrp_readings, map_size);
>> 
>> We are setting map_size back and forth here.
> 
> But they are all different sizes.

Right. I misread the code. 

> 
>> 
>> [...]
>> 
>> 
>>> diff --git a/tools/perf/util/bpf_skel/bperf_cgroup.bpf.c b/tools/perf/util/bpf_skel/bperf_cgroup.bpf.c
>>> new file mode 100644
>>> index 000000000000..6d74e93dd1f5
>>> --- /dev/null
>>> +++ b/tools/perf/util/bpf_skel/bperf_cgroup.bpf.c
>>> @@ -0,0 +1,207 @@
>>> +// SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
>>> +// Copyright (c) 2021 Facebook
>>> +// Copyright (c) 2021 Google
>>> +#include "vmlinux.h"
>>> +#include <bpf/bpf_helpers.h>
>>> +#include <bpf/bpf_tracing.h>
>>> +#include <bpf/bpf_core_read.h>
>>> +
>>> +#define MAX_LEVELS  10  // max cgroup hierarchy level: arbitrary
>>> +#define MAX_EVENTS  32  // max events per cgroup: arbitrary
>>> +
>>> +// NOTE: many of map and global data will be modified before loading
>>> +//       from the userspace (perf tool) using the skeleton helpers.
>>> +
>>> +// single set of global perf events to measure
>>> +struct {
>>> +     __uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
>>> +     __uint(key_size, sizeof(__u32));
>>> +     __uint(value_size, sizeof(int));
>>> +     __uint(max_entries, 1);
>>> +} events SEC(".maps");
>>> +
>>> +// from logical cpu number to event index
>>> +// useful when user wants to count subset of cpus
>>> +struct {
>>> +     __uint(type, BPF_MAP_TYPE_HASH);
>>> +     __uint(key_size, sizeof(__u32));
>>> +     __uint(value_size, sizeof(__u32));
>>> +     __uint(max_entries, 1);
>>> +} cpu_idx SEC(".maps");
>> 
>> How about we make cpu_idx a percpu array and use 0,1 for
>> disable/enable profiling on this cpu?
> 
> No, it's to calculate an index to the cgrp_readings map which
> has the event x cpu x cgroup number of elements.
> 
> It controls enabling events with a global (bss) variable.

If we make cgrp_idx a per cpu array, we probably don't need the
cpu_idx map? 

> 
>> 
>>> +
>>> +// from cgroup id to event index
>>> +struct {
>>> +     __uint(type, BPF_MAP_TYPE_HASH);
>>> +     __uint(key_size, sizeof(__u64));
>>> +     __uint(value_size, sizeof(__u32));
>>> +     __uint(max_entries, 1);
>>> +} cgrp_idx SEC(".maps");
>>> +
>>> +// per-cpu event snapshots to calculate delta
>>> +struct {
>>> +     __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
>>> +     __uint(key_size, sizeof(__u32));
>>> +     __uint(value_size, sizeof(struct bpf_perf_event_value));
>>> +} prev_readings SEC(".maps");
>>> +
>>> +// aggregated event values for each cgroup
>>> +// will be read from the user-space
>>> +struct {
>>> +     __uint(type, BPF_MAP_TYPE_ARRAY);
>>> +     __uint(key_size, sizeof(__u32));
>>> +     __uint(value_size, sizeof(struct bpf_perf_event_value));
>>> +} cgrp_readings SEC(".maps");
>> 
>> Maybe also make this a percpu array? This should make the BPF program
>> faster.
> 
> Maybe.  But I don't know how to access the elements
> in a per-cpu map from userspace.

Please refer to bperf__read() reading accum_readings. Basically, we read
one index of all CPUs with one bpf_map_lookup_elem(). 

Thanks,
Song

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] perf stat: Enable BPF counter with --for-each-cgroup
  2021-06-24 16:20       ` Song Liu
@ 2021-06-24 21:01         ` Namhyung Kim
  2021-06-24 21:40           ` Song Liu
  0 siblings, 1 reply; 13+ messages in thread
From: Namhyung Kim @ 2021-06-24 21:01 UTC (permalink / raw)
  To: Song Liu
  Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Ingo Molnar, Peter Zijlstra,
	LKML, Andi Kleen, Ian Rogers, Stephane Eranian

On Thu, Jun 24, 2021 at 9:20 AM Song Liu <songliubraving@fb.com> wrote:
> >>> +
> >>> +// single set of global perf events to measure
> >>> +struct {
> >>> +     __uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
> >>> +     __uint(key_size, sizeof(__u32));
> >>> +     __uint(value_size, sizeof(int));
> >>> +     __uint(max_entries, 1);
> >>> +} events SEC(".maps");
> >>> +
> >>> +// from logical cpu number to event index
> >>> +// useful when user wants to count subset of cpus
> >>> +struct {
> >>> +     __uint(type, BPF_MAP_TYPE_HASH);
> >>> +     __uint(key_size, sizeof(__u32));
> >>> +     __uint(value_size, sizeof(__u32));
> >>> +     __uint(max_entries, 1);
> >>> +} cpu_idx SEC(".maps");
> >>
> >> How about we make cpu_idx a percpu array and use 0,1 for
> >> disable/enable profiling on this cpu?
> >
> > No, it's to calculate an index to the cgrp_readings map which
> > has the event x cpu x cgroup number of elements.
> >
> > It controls enabling events with a global (bss) variable.
>
> If we make cgrp_idx a per cpu array, we probably don't need the
> cpu_idx map?

Right.

>
> >
> >>
> >>> +
> >>> +// from cgroup id to event index
> >>> +struct {
> >>> +     __uint(type, BPF_MAP_TYPE_HASH);
> >>> +     __uint(key_size, sizeof(__u64));
> >>> +     __uint(value_size, sizeof(__u32));
> >>> +     __uint(max_entries, 1);
> >>> +} cgrp_idx SEC(".maps");
> >>> +
> >>> +// per-cpu event snapshots to calculate delta
> >>> +struct {
> >>> +     __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
> >>> +     __uint(key_size, sizeof(__u32));
> >>> +     __uint(value_size, sizeof(struct bpf_perf_event_value));
> >>> +} prev_readings SEC(".maps");
> >>> +
> >>> +// aggregated event values for each cgroup
> >>> +// will be read from the user-space
> >>> +struct {
> >>> +     __uint(type, BPF_MAP_TYPE_ARRAY);
> >>> +     __uint(key_size, sizeof(__u32));
> >>> +     __uint(value_size, sizeof(struct bpf_perf_event_value));
> >>> +} cgrp_readings SEC(".maps");
> >>
> >> Maybe also make this a percpu array? This should make the BPF program
> >> faster.
> >
> > Maybe.  But I don't know how to access the elements
> > in a per-cpu map from userspace.
>
> Please refer to bperf__read() reading accum_readings. Basically, we read
> one index of all CPUs with one bpf_map_lookup_elem().

Thanks!  So when I use a per-cpu array with 3 elements, I can access
to cpu/elem entries in a row like below, right?

  0/0, 0/1, 0/2, 1/0, 1/1, 1/2, 2/0, 2/1, 2/2, 3/0, ...

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] perf stat: Enable BPF counter with --for-each-cgroup
  2021-06-24 21:01         ` Namhyung Kim
@ 2021-06-24 21:40           ` Song Liu
  2021-06-24 22:06             ` Namhyung Kim
  0 siblings, 1 reply; 13+ messages in thread
From: Song Liu @ 2021-06-24 21:40 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Ingo Molnar, Peter Zijlstra,
	LKML, Andi Kleen, Ian Rogers, Stephane Eranian



> On Jun 24, 2021, at 2:01 PM, Namhyung Kim <namhyung@kernel.org> wrote:
> 
> On Thu, Jun 24, 2021 at 9:20 AM Song Liu <songliubraving@fb.com> wrote:
>>>>> +
>>>>> +// single set of global perf events to measure
>>>>> +struct {
>>>>> +     __uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
>>>>> +     __uint(key_size, sizeof(__u32));
>>>>> +     __uint(value_size, sizeof(int));
>>>>> +     __uint(max_entries, 1);
>>>>> +} events SEC(".maps");
>>>>> +
>>>>> +// from logical cpu number to event index
>>>>> +// useful when user wants to count subset of cpus
>>>>> +struct {
>>>>> +     __uint(type, BPF_MAP_TYPE_HASH);
>>>>> +     __uint(key_size, sizeof(__u32));
>>>>> +     __uint(value_size, sizeof(__u32));
>>>>> +     __uint(max_entries, 1);
>>>>> +} cpu_idx SEC(".maps");
>>>> 
>>>> How about we make cpu_idx a percpu array and use 0,1 for
>>>> disable/enable profiling on this cpu?
>>> 
>>> No, it's to calculate an index to the cgrp_readings map which
>>> has the event x cpu x cgroup number of elements.
>>> 
>>> It controls enabling events with a global (bss) variable.
>> 
>> If we make cgrp_idx a per cpu array, we probably don't need the
>> cpu_idx map?
> 
> Right.
> 
>> 
>>> 
>>>> 
>>>>> +
>>>>> +// from cgroup id to event index
>>>>> +struct {
>>>>> +     __uint(type, BPF_MAP_TYPE_HASH);
>>>>> +     __uint(key_size, sizeof(__u64));
>>>>> +     __uint(value_size, sizeof(__u32));
>>>>> +     __uint(max_entries, 1);
>>>>> +} cgrp_idx SEC(".maps");
>>>>> +
>>>>> +// per-cpu event snapshots to calculate delta
>>>>> +struct {
>>>>> +     __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
>>>>> +     __uint(key_size, sizeof(__u32));
>>>>> +     __uint(value_size, sizeof(struct bpf_perf_event_value));
>>>>> +} prev_readings SEC(".maps");
>>>>> +
>>>>> +// aggregated event values for each cgroup
>>>>> +// will be read from the user-space
>>>>> +struct {
>>>>> +     __uint(type, BPF_MAP_TYPE_ARRAY);
>>>>> +     __uint(key_size, sizeof(__u32));
>>>>> +     __uint(value_size, sizeof(struct bpf_perf_event_value));
>>>>> +} cgrp_readings SEC(".maps");
>>>> 
>>>> Maybe also make this a percpu array? This should make the BPF program
>>>> faster.
>>> 
>>> Maybe.  But I don't know how to access the elements
>>> in a per-cpu map from userspace.
>> 
>> Please refer to bperf__read() reading accum_readings. Basically, we read
>> one index of all CPUs with one bpf_map_lookup_elem().
> 
> Thanks!  So when I use a per-cpu array with 3 elements, I can access
> to cpu/elem entries in a row like below, right?
> 
>  0/0, 0/1, 0/2, 1/0, 1/1, 1/2, 2/0, 2/1, 2/2, 3/0, ...

I am not sure I am following here. 

Say the system have 10 cpus, and the array has 3 elements. We can do:

	__u32 values[10];  /* assuming both key and value are __u32 */
	__u32 elem;
	int cpu;

	for (elem = 0; elem < 3; elem++) {
		bpf_map_lookup_elem(map_fd, &elem, values);
		for (cpu = 0; cpu < 10; cpu++)
			values[cpu] /* this is the value for cpu/elem */
	}

Thanks,
Song

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] perf stat: Enable BPF counter with --for-each-cgroup
  2021-06-24 21:40           ` Song Liu
@ 2021-06-24 22:06             ` Namhyung Kim
  2021-06-24 22:15               ` Song Liu
  0 siblings, 1 reply; 13+ messages in thread
From: Namhyung Kim @ 2021-06-24 22:06 UTC (permalink / raw)
  To: Song Liu
  Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Ingo Molnar, Peter Zijlstra,
	LKML, Andi Kleen, Ian Rogers, Stephane Eranian

On Thu, Jun 24, 2021 at 2:41 PM Song Liu <songliubraving@fb.com> wrote:
>
>
>
> > On Jun 24, 2021, at 2:01 PM, Namhyung Kim <namhyung@kernel.org> wrote:
> >
> > On Thu, Jun 24, 2021 at 9:20 AM Song Liu <songliubraving@fb.com> wrote:
> >>>>> +
> >>>>> +// single set of global perf events to measure
> >>>>> +struct {
> >>>>> +     __uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
> >>>>> +     __uint(key_size, sizeof(__u32));
> >>>>> +     __uint(value_size, sizeof(int));
> >>>>> +     __uint(max_entries, 1);
> >>>>> +} events SEC(".maps");
> >>>>> +
> >>>>> +// from logical cpu number to event index
> >>>>> +// useful when user wants to count subset of cpus
> >>>>> +struct {
> >>>>> +     __uint(type, BPF_MAP_TYPE_HASH);
> >>>>> +     __uint(key_size, sizeof(__u32));
> >>>>> +     __uint(value_size, sizeof(__u32));
> >>>>> +     __uint(max_entries, 1);
> >>>>> +} cpu_idx SEC(".maps");
> >>>>
> >>>> How about we make cpu_idx a percpu array and use 0,1 for
> >>>> disable/enable profiling on this cpu?
> >>>
> >>> No, it's to calculate an index to the cgrp_readings map which
> >>> has the event x cpu x cgroup number of elements.
> >>>
> >>> It controls enabling events with a global (bss) variable.
> >>
> >> If we make cgrp_idx a per cpu array, we probably don't need the
> >> cpu_idx map?
> >
> > Right.

Maybe not.  Sometimes we want to profile a subset of cpus only.
In that case, cpu != idx then I think we still need this.


> >
> >>
> >>>
> >>>>
> >>>>> +
> >>>>> +// from cgroup id to event index
> >>>>> +struct {
> >>>>> +     __uint(type, BPF_MAP_TYPE_HASH);
> >>>>> +     __uint(key_size, sizeof(__u64));
> >>>>> +     __uint(value_size, sizeof(__u32));
> >>>>> +     __uint(max_entries, 1);
> >>>>> +} cgrp_idx SEC(".maps");
> >>>>> +
> >>>>> +// per-cpu event snapshots to calculate delta
> >>>>> +struct {
> >>>>> +     __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
> >>>>> +     __uint(key_size, sizeof(__u32));
> >>>>> +     __uint(value_size, sizeof(struct bpf_perf_event_value));
> >>>>> +} prev_readings SEC(".maps");
> >>>>> +
> >>>>> +// aggregated event values for each cgroup
> >>>>> +// will be read from the user-space
> >>>>> +struct {
> >>>>> +     __uint(type, BPF_MAP_TYPE_ARRAY);
> >>>>> +     __uint(key_size, sizeof(__u32));
> >>>>> +     __uint(value_size, sizeof(struct bpf_perf_event_value));
> >>>>> +} cgrp_readings SEC(".maps");
> >>>>
> >>>> Maybe also make this a percpu array? This should make the BPF program
> >>>> faster.
> >>>
> >>> Maybe.  But I don't know how to access the elements
> >>> in a per-cpu map from userspace.
> >>
> >> Please refer to bperf__read() reading accum_readings. Basically, we read
> >> one index of all CPUs with one bpf_map_lookup_elem().
> >
> > Thanks!  So when I use a per-cpu array with 3 elements, I can access
> > to cpu/elem entries in a row like below, right?
> >
> >  0/0, 0/1, 0/2, 1/0, 1/1, 1/2, 2/0, 2/1, 2/2, 3/0, ...
>
> I am not sure I am following here.
>
> Say the system have 10 cpus, and the array has 3 elements. We can do:
>
>         __u32 values[10];  /* assuming both key and value are __u32 */
>         __u32 elem;
>         int cpu;
>
>         for (elem = 0; elem < 3; elem++) {
>                 bpf_map_lookup_elem(map_fd, &elem, values);
>                 for (cpu = 0; cpu < 10; cpu++)
>                         values[cpu] /* this is the value for cpu/elem */
>         }

Thanks for the explanation, I didn't think that way.
I thought it like below:

    __u32 elem, value;

    for (elem = 0; elem < 3 * 10; elem++) {
        bpf_map_lookup_elem(map_fd, &elem, &value);
    }

So in this case, the actual value size is like below, right?

  value-size = map-value-size * number-of-cpu

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] perf stat: Enable BPF counter with --for-each-cgroup
  2021-06-24 22:06             ` Namhyung Kim
@ 2021-06-24 22:15               ` Song Liu
  2021-06-24 22:21                 ` Namhyung Kim
  0 siblings, 1 reply; 13+ messages in thread
From: Song Liu @ 2021-06-24 22:15 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Ingo Molnar, Peter Zijlstra,
	LKML, Andi Kleen, Ian Rogers, Stephane Eranian



> On Jun 24, 2021, at 3:06 PM, Namhyung Kim <namhyung@kernel.org> wrote:
> 
> On Thu, Jun 24, 2021 at 2:41 PM Song Liu <songliubraving@fb.com> wrote:
>> 
>> 
>> 
>>> On Jun 24, 2021, at 2:01 PM, Namhyung Kim <namhyung@kernel.org> wrote:
>>> 
>>> On Thu, Jun 24, 2021 at 9:20 AM Song Liu <songliubraving@fb.com> wrote:
>>>>>>> +
>>>>>>> +// single set of global perf events to measure
>>>>>>> +struct {
>>>>>>> +     __uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
>>>>>>> +     __uint(key_size, sizeof(__u32));
>>>>>>> +     __uint(value_size, sizeof(int));
>>>>>>> +     __uint(max_entries, 1);
>>>>>>> +} events SEC(".maps");
>>>>>>> +
>>>>>>> +// from logical cpu number to event index
>>>>>>> +// useful when user wants to count subset of cpus
>>>>>>> +struct {
>>>>>>> +     __uint(type, BPF_MAP_TYPE_HASH);
>>>>>>> +     __uint(key_size, sizeof(__u32));
>>>>>>> +     __uint(value_size, sizeof(__u32));
>>>>>>> +     __uint(max_entries, 1);
>>>>>>> +} cpu_idx SEC(".maps");
>>>>>> 
>>>>>> How about we make cpu_idx a percpu array and use 0,1 for
>>>>>> disable/enable profiling on this cpu?
>>>>> 
>>>>> No, it's to calculate an index to the cgrp_readings map which
>>>>> has the event x cpu x cgroup number of elements.
>>>>> 
>>>>> It controls enabling events with a global (bss) variable.
>>>> 
>>>> If we make cgrp_idx a per cpu array, we probably don't need the
>>>> cpu_idx map?
>>> 
>>> Right.
> 
> Maybe not.  Sometimes we want to profile a subset of cpus only.
> In that case, cpu != idx then I think we still need this.

We can only attach the bpf program on selected CPUs. Say, we want
CPUs 1, 3, 5. We just do 

	for (i in [1, 3, 5]) {
		link = bpf_program__attach_perf_event(skel->progs.on_switch,
						      FD(cgrp_switch, i));
		/* */
	}

The value arrays are still for all cpu, but they will just report zero
for CPU 0, 2, 4, .... 

Would this work? 
	
[...]


>>>>> Maybe.  But I don't know how to access the elements
>>>>> in a per-cpu map from userspace.
>>>> 
>>>> Please refer to bperf__read() reading accum_readings. Basically, we read
>>>> one index of all CPUs with one bpf_map_lookup_elem().
>>> 
>>> Thanks!  So when I use a per-cpu array with 3 elements, I can access
>>> to cpu/elem entries in a row like below, right?
>>> 
>>> 0/0, 0/1, 0/2, 1/0, 1/1, 1/2, 2/0, 2/1, 2/2, 3/0, ...
>> 
>> I am not sure I am following here.
>> 
>> Say the system have 10 cpus, and the array has 3 elements. We can do:
>> 
>>        __u32 values[10];  /* assuming both key and value are __u32 */
>>        __u32 elem;
>>        int cpu;
>> 
>>        for (elem = 0; elem < 3; elem++) {
>>                bpf_map_lookup_elem(map_fd, &elem, values);
>>                for (cpu = 0; cpu < 10; cpu++)
>>                        values[cpu] /* this is the value for cpu/elem */
>>        }
> 
> Thanks for the explanation, I didn't think that way.
> I thought it like below:
> 
>    __u32 elem, value;
> 
>    for (elem = 0; elem < 3 * 10; elem++) {
>        bpf_map_lookup_elem(map_fd, &elem, &value);
>    }
> 
> So in this case, the actual value size is like below, right?
> 
>  value-size = map-value-size * number-of-cpu

This is right (for user space). 

Thanks,
Song


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] perf stat: Enable BPF counter with --for-each-cgroup
  2021-06-24 22:15               ` Song Liu
@ 2021-06-24 22:21                 ` Namhyung Kim
  0 siblings, 0 replies; 13+ messages in thread
From: Namhyung Kim @ 2021-06-24 22:21 UTC (permalink / raw)
  To: Song Liu
  Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Ingo Molnar, Peter Zijlstra,
	LKML, Andi Kleen, Ian Rogers, Stephane Eranian

On Thu, Jun 24, 2021 at 3:16 PM Song Liu <songliubraving@fb.com> wrote:
>
>
>
> > On Jun 24, 2021, at 3:06 PM, Namhyung Kim <namhyung@kernel.org> wrote:
> >
> > On Thu, Jun 24, 2021 at 2:41 PM Song Liu <songliubraving@fb.com> wrote:
> >>
> >>
> >>
> >>> On Jun 24, 2021, at 2:01 PM, Namhyung Kim <namhyung@kernel.org> wrote:
> >>>
> >>> On Thu, Jun 24, 2021 at 9:20 AM Song Liu <songliubraving@fb.com> wrote:
> >>>>>>> +
> >>>>>>> +// single set of global perf events to measure
> >>>>>>> +struct {
> >>>>>>> +     __uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
> >>>>>>> +     __uint(key_size, sizeof(__u32));
> >>>>>>> +     __uint(value_size, sizeof(int));
> >>>>>>> +     __uint(max_entries, 1);
> >>>>>>> +} events SEC(".maps");
> >>>>>>> +
> >>>>>>> +// from logical cpu number to event index
> >>>>>>> +// useful when user wants to count subset of cpus
> >>>>>>> +struct {
> >>>>>>> +     __uint(type, BPF_MAP_TYPE_HASH);
> >>>>>>> +     __uint(key_size, sizeof(__u32));
> >>>>>>> +     __uint(value_size, sizeof(__u32));
> >>>>>>> +     __uint(max_entries, 1);
> >>>>>>> +} cpu_idx SEC(".maps");
> >>>>>>
> >>>>>> How about we make cpu_idx a percpu array and use 0,1 for
> >>>>>> disable/enable profiling on this cpu?
> >>>>>
> >>>>> No, it's to calculate an index to the cgrp_readings map which
> >>>>> has the event x cpu x cgroup number of elements.
> >>>>>
> >>>>> It controls enabling events with a global (bss) variable.
> >>>>
> >>>> If we make cgrp_idx a per cpu array, we probably don't need the
> >>>> cpu_idx map?
> >>>
> >>> Right.
> >
> > Maybe not.  Sometimes we want to profile a subset of cpus only.
> > In that case, cpu != idx then I think we still need this.
>
> We can only attach the bpf program on selected CPUs. Say, we want
> CPUs 1, 3, 5. We just do
>
>         for (i in [1, 3, 5]) {
>                 link = bpf_program__attach_perf_event(skel->progs.on_switch,
>                                                       FD(cgrp_switch, i));
>                 /* */
>         }
>
> The value arrays are still for all cpu, but they will just report zero
> for CPU 0, 2, 4, ....
>
> Would this work?

Yeah, that's exactly what I do, and I'd like to have a compact map
eliminating the unused entries (cpus).  But now I think that I can
keep it with a full cpus and just don't use them.


>
> >>>>> Maybe.  But I don't know how to access the elements
> >>>>> in a per-cpu map from userspace.
> >>>>
> >>>> Please refer to bperf__read() reading accum_readings. Basically, we read
> >>>> one index of all CPUs with one bpf_map_lookup_elem().
> >>>
> >>> Thanks!  So when I use a per-cpu array with 3 elements, I can access
> >>> to cpu/elem entries in a row like below, right?
> >>>
> >>> 0/0, 0/1, 0/2, 1/0, 1/1, 1/2, 2/0, 2/1, 2/2, 3/0, ...
> >>
> >> I am not sure I am following here.
> >>
> >> Say the system have 10 cpus, and the array has 3 elements. We can do:
> >>
> >>        __u32 values[10];  /* assuming both key and value are __u32 */
> >>        __u32 elem;
> >>        int cpu;
> >>
> >>        for (elem = 0; elem < 3; elem++) {
> >>                bpf_map_lookup_elem(map_fd, &elem, values);
> >>                for (cpu = 0; cpu < 10; cpu++)
> >>                        values[cpu] /* this is the value for cpu/elem */
> >>        }
> >
> > Thanks for the explanation, I didn't think that way.
> > I thought it like below:
> >
> >    __u32 elem, value;
> >
> >    for (elem = 0; elem < 3 * 10; elem++) {
> >        bpf_map_lookup_elem(map_fd, &elem, &value);
> >    }
> >
> > So in this case, the actual value size is like below, right?
> >
> >  value-size = map-value-size * number-of-cpu
>
> This is right (for user space).

Thanks for your clarification!

Namhyung

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 3/3] perf stat: Enable BPF counter with --for-each-cgroup
  2021-06-15  1:17 [PATCHSET v2 0/3] perf stat: Enable BPF counters " Namhyung Kim
@ 2021-06-15  1:17 ` Namhyung Kim
  0 siblings, 0 replies; 13+ messages in thread
From: Namhyung Kim @ 2021-06-15  1:17 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Andi Kleen, Ian Rogers,
	Stephane Eranian, Song Liu

Recently bperf was added to use BPF to count perf events for various
purposes.  This is an extension for the approach and targetting to
cgroup usages.

Unlike the other bperf, it doesn't share the events with other
processes but it'd reduce unnecessary events (and the overhead of
multiplexing) for each monitored cgroup within the perf session.

When --for-each-cgroup is used with --bpf-counters, it will open
cgroup-switches event per cpu internally and attach the new BPF
program to read given perf_events and to aggregate the results for
cgroups.  It's only called when task is switched to a task in a
different cgroup.

It seems the bpf test run mechanism doens't work for programs attached
the perf_event.  So it's not guaranteed to get the accurate results
using this.  But in practice, perf tools tends to be in a separate
cgroup than the actual works, so having an affinity setting loop at
the end triggers cgroup switches on each cpu.

Cc: Song Liu <songliubraving@fb.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/perf/Makefile.perf                    |   1 +
 tools/perf/util/Build                       |   1 +
 tools/perf/util/bpf_counter.c               |   5 +
 tools/perf/util/bpf_counter_cgroup.c        | 319 ++++++++++++++++++++
 tools/perf/util/bpf_skel/bperf_cgroup.bpf.c | 146 +++++++++
 tools/perf/util/cgroup.c                    |   2 +
 tools/perf/util/cgroup.h                    |   1 +
 7 files changed, 475 insertions(+)
 create mode 100644 tools/perf/util/bpf_counter_cgroup.c
 create mode 100644 tools/perf/util/bpf_skel/bperf_cgroup.bpf.c

diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
index e47f04e5b51e..b67099dd2456 100644
--- a/tools/perf/Makefile.perf
+++ b/tools/perf/Makefile.perf
@@ -1015,6 +1015,7 @@ SKEL_OUT := $(abspath $(OUTPUT)util/bpf_skel)
 SKEL_TMP_OUT := $(abspath $(SKEL_OUT)/.tmp)
 SKELETONS := $(SKEL_OUT)/bpf_prog_profiler.skel.h
 SKELETONS += $(SKEL_OUT)/bperf_leader.skel.h $(SKEL_OUT)/bperf_follower.skel.h
+SKELETONS += $(SKEL_OUT)/bperf_cgroup.skel.h
 
 ifdef BUILD_BPF_SKEL
 BPFTOOL := $(SKEL_TMP_OUT)/bootstrap/bpftool
diff --git a/tools/perf/util/Build b/tools/perf/util/Build
index 95e15d1035ab..700d635448ff 100644
--- a/tools/perf/util/Build
+++ b/tools/perf/util/Build
@@ -140,6 +140,7 @@ perf-y += clockid.o
 perf-$(CONFIG_LIBBPF) += bpf-loader.o
 perf-$(CONFIG_LIBBPF) += bpf_map.o
 perf-$(CONFIG_PERF_BPF_SKEL) += bpf_counter.o
+perf-$(CONFIG_PERF_BPF_SKEL) += bpf_counter_cgroup.o
 perf-$(CONFIG_BPF_PROLOGUE) += bpf-prologue.o
 perf-$(CONFIG_LIBELF) += symbol-elf.o
 perf-$(CONFIG_LIBELF) += probe-file.o
diff --git a/tools/perf/util/bpf_counter.c b/tools/perf/util/bpf_counter.c
index 974f10e356f0..7812c5d9b826 100644
--- a/tools/perf/util/bpf_counter.c
+++ b/tools/perf/util/bpf_counter.c
@@ -22,6 +22,7 @@
 #include "evsel.h"
 #include "evlist.h"
 #include "target.h"
+#include "cgroup.h"
 #include "cpumap.h"
 #include "thread_map.h"
 
@@ -792,6 +793,8 @@ struct bpf_counter_ops bperf_ops = {
 	.destroy    = bperf__destroy,
 };
 
+extern struct bpf_counter_ops bperf_cgrp_ops;
+
 static inline bool bpf_counter_skip(struct evsel *evsel)
 {
 	return list_empty(&evsel->bpf_counter_list) &&
@@ -809,6 +812,8 @@ int bpf_counter__load(struct evsel *evsel, struct target *target)
 {
 	if (target->bpf_str)
 		evsel->bpf_counter_ops = &bpf_program_profiler_ops;
+	else if (cgrp_event_expanded && target->use_bpf)
+		evsel->bpf_counter_ops = &bperf_cgrp_ops;
 	else if (target->use_bpf || evsel->bpf_counter ||
 		 evsel__match_bpf_counter_events(evsel->name))
 		evsel->bpf_counter_ops = &bperf_ops;
diff --git a/tools/perf/util/bpf_counter_cgroup.c b/tools/perf/util/bpf_counter_cgroup.c
new file mode 100644
index 000000000000..0ec5be75b860
--- /dev/null
+++ b/tools/perf/util/bpf_counter_cgroup.c
@@ -0,0 +1,319 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/* Copyright (c) 2019 Facebook */
+/* Copyright (c) 2021 Google */
+
+#include <assert.h>
+#include <limits.h>
+#include <unistd.h>
+#include <sys/file.h>
+#include <sys/time.h>
+#include <sys/resource.h>
+#include <linux/err.h>
+#include <linux/zalloc.h>
+#include <linux/perf_event.h>
+#include <bpf/bpf.h>
+#include <bpf/btf.h>
+#include <bpf/libbpf.h>
+#include <api/fs/fs.h>
+#include <perf/bpf_perf.h>
+
+#include "affinity.h"
+#include "bpf_counter.h"
+#include "cgroup.h"
+#include "counts.h"
+#include "debug.h"
+#include "evsel.h"
+#include "evlist.h"
+#include "target.h"
+#include "cpumap.h"
+#include "thread_map.h"
+
+#include "bpf_skel/bperf_cgroup.skel.h"
+
+static struct perf_event_attr cgrp_switch_attr = {
+	.type = PERF_TYPE_SOFTWARE,
+	.config = PERF_COUNT_SW_CGROUP_SWITCHES,
+	.size = sizeof(cgrp_switch_attr),
+	.sample_period = 1,
+	.disabled = 1,
+};
+
+static struct evsel *cgrp_switch;
+static struct xyarray *cgrp_prog_fds;
+static struct bperf_cgroup_bpf *skel;
+
+#define FD(evt, cpu) (*(int *)xyarray__entry(evt->core.fd, cpu, 0))
+#define PROG(cpu)    (*(int *)xyarray__entry(cgrp_prog_fds, cpu, 0))
+
+static void set_max_rlimit(void)
+{
+	struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY };
+
+	setrlimit(RLIMIT_MEMLOCK, &rinf);
+}
+
+static __u32 bpf_link_get_prog_id(int fd)
+{
+	struct bpf_link_info link_info = {0};
+	__u32 link_info_len = sizeof(link_info);
+
+	bpf_obj_get_info_by_fd(fd, &link_info, &link_info_len);
+	return link_info.prog_id;
+}
+
+static int bperf_load_program(struct evlist *evlist)
+{
+	struct bpf_link *link;
+	struct evsel *evsel;
+	struct cgroup *cgrp, *leader_cgrp;
+	__u32 i, cpu, prog_id;
+	int nr_cpus = evlist->core.all_cpus->nr;
+	int map_size, map_fd, err;
+
+	skel = bperf_cgroup_bpf__open();
+	if (!skel) {
+		pr_err("Failed to open cgroup skeleton\n");
+		return -1;
+	}
+
+	skel->rodata->num_cpus = nr_cpus;
+	skel->rodata->num_events = evlist->core.nr_entries / nr_cgroups;
+
+	BUG_ON(evlist->core.nr_entries % nr_cgroups != 0);
+
+	/* we need one copy of events per cpu for reading */
+	map_size = nr_cpus * evlist->core.nr_entries / nr_cgroups;
+	bpf_map__resize(skel->maps.events, map_size);
+	bpf_map__resize(skel->maps.cpu_idx, nr_cpus);
+	bpf_map__resize(skel->maps.cgrp_idx, nr_cgroups);
+	/* previous result is saved in a per-cpu array */
+	map_size = evlist->core.nr_entries / nr_cgroups;
+	bpf_map__resize(skel->maps.prev_readings, map_size);
+	/* cgroup result needs all events */
+	map_size = nr_cpus * evlist->core.nr_entries;
+	bpf_map__resize(skel->maps.cgrp_readings, map_size);
+
+	set_max_rlimit();
+
+	err = bperf_cgroup_bpf__load(skel);
+	if (err) {
+		pr_err("Failed to load cgroup skeleton\n");
+		goto out;
+	}
+
+	if (cgroup_is_v2("perf_event") > 0)
+		skel->bss->use_cgroup_v2 = 1;
+
+	err = -1;
+
+	cgrp_switch = evsel__new(&cgrp_switch_attr);
+	if (evsel__open_per_cpu(cgrp_switch, evlist->core.all_cpus, -1) < 0) {
+		pr_err("Failed to open cgroup switches event\n");
+		goto out;
+	}
+
+	map_fd = bpf_map__fd(skel->maps.cpu_idx);
+	if (map_fd < 0) {
+		pr_err("cannot get cpu idx map\n");
+		goto out;
+	}
+
+	cgrp_prog_fds = xyarray__new(nr_cpus, 1, sizeof(int));
+	if (!cgrp_prog_fds) {
+		pr_err("Failed to allocate cgroup switch prog fd\n");
+		goto out;
+	}
+
+	for (i = 0; i < nr_cpus; i++) {
+		link = bpf_program__attach_perf_event(skel->progs.on_switch,
+						      FD(cgrp_switch, i));
+		if (IS_ERR(link)) {
+			pr_err("Failed to attach cgroup program\n");
+			err = PTR_ERR(link);
+			goto out;
+		}
+
+		/* update cpu index in case there are missing cpus */
+		cpu = evlist->core.all_cpus->map[i];
+		bpf_map_update_elem(map_fd, &cpu, &i, BPF_ANY);
+
+		prog_id = bpf_link_get_prog_id(bpf_link__fd(link));
+		PROG(i) = bpf_prog_get_fd_by_id(prog_id);
+	}
+
+	/*
+	 * Update cgrp_idx map from cgroup-id to event index.
+	 */
+	cgrp = NULL;
+	i = 0;
+
+	evlist__for_each_entry(evlist, evsel) {
+		if (cgrp == NULL || evsel->cgrp == leader_cgrp) {
+			leader_cgrp = evsel->cgrp;
+			evsel->cgrp = NULL;
+
+			/* open single copy of the events w/o cgroup */
+			err = evsel__open_per_cpu(evsel, evlist->core.all_cpus, -1);
+			if (err) {
+				pr_err("Failed to open first cgroup events\n");
+				goto out;
+			}
+
+			map_fd = bpf_map__fd(skel->maps.events);
+			for (cpu = 0; cpu < nr_cpus; cpu++) {
+				__u32 idx = evsel->idx * nr_cpus + cpu;
+				int fd = FD(evsel, cpu);
+
+				bpf_map_update_elem(map_fd, &idx, &fd, BPF_ANY);
+			}
+
+			evsel->cgrp = leader_cgrp;
+		}
+		evsel->supported = true;
+
+		if (evsel->cgrp == cgrp)
+			continue;
+
+		cgrp = evsel->cgrp;
+
+		if (read_cgroup_id(cgrp) < 0) {
+			pr_debug("Failed to get cgroup id\n");
+			err = -1;
+			goto out;
+		}
+
+		map_fd = bpf_map__fd(skel->maps.cgrp_idx);
+		bpf_map_update_elem(map_fd, &cgrp->id, &i, BPF_ANY);
+
+		i++;
+	}
+
+	pr_debug("The kernel does not support test_run for perf_event BPF programs.\n"
+		 "Therefore, --for-each-cgroup might show inaccurate readings\n");
+	err = 0;
+
+out:
+	return err;
+}
+
+static int bperf_cgrp__load(struct evsel *evsel, struct target *target)
+{
+	static bool bperf_loaded = false;
+
+	evsel->bperf_leader_prog_fd = -1;
+	evsel->bperf_leader_link_fd = -1;
+
+	if (!bperf_loaded && bperf_load_program(evsel->evlist))
+		return -1;
+
+	bperf_loaded = true;
+	/* just to bypass bpf_counter_skip() */
+	evsel->follower_skel = (struct bperf_follower_bpf *)skel;
+
+	return 0;
+}
+
+static int bperf_cgrp__install_pe(struct evsel *evsel, int cpu, int fd)
+{
+	/* nothing to do */
+	return 0;
+}
+
+/*
+ * trigger the leader prog on each cpu, so the cgrp_reading map could get
+ * the latest results.
+ */
+static int bperf_sync_counters(struct evlist *evlist)
+{
+	struct affinity affinity;
+	int i, cpu;
+
+	/* change affinity to rotate all cpus to trigger cgroup-switches (hopefully) */
+	if (affinity__setup(&affinity) < 0)
+		return -1;
+
+	evlist__for_each_cpu(evlist, i, cpu)
+		affinity__set(&affinity, cpu);
+
+	affinity__cleanup(&affinity);
+
+	return 0;
+}
+
+static int bperf_cgrp__enable(struct evsel *evsel)
+{
+	skel->bss->enabled = 1;
+	return 0;
+}
+
+static int bperf_cgrp__disable(struct evsel *evsel)
+{
+	if (evsel->idx)
+		return 0;
+
+	bperf_sync_counters(evsel->evlist);
+
+	skel->bss->enabled = 0;
+	return 0;
+}
+
+static int bperf_cgrp__read(struct evsel *evsel)
+{
+	struct evlist *evlist = evsel->evlist;
+	int i, nr_cpus = evlist->core.all_cpus->nr;
+	struct perf_counts_values *counts;
+	struct bpf_perf_event_value values;
+	struct cgroup *cgrp = NULL;
+	int cgrp_idx = -1;
+	int reading_map_fd, err = 0;
+	__u32 idx;
+
+	if (evsel->idx)
+		return 0;
+
+	reading_map_fd = bpf_map__fd(skel->maps.cgrp_readings);
+
+	evlist__for_each_entry(evlist, evsel) {
+		if (cgrp != evsel->cgrp) {
+			cgrp = evsel->cgrp;
+			cgrp_idx++;
+		}
+
+		for (i = 0; i < nr_cpus; i++) {
+			idx = evsel->idx * nr_cpus + i;
+			err = bpf_map_lookup_elem(reading_map_fd, &idx, &values);
+			if (err)
+				goto out;
+
+			counts = perf_counts(evsel->counts, i, 0);
+			counts->val = values.counter;
+			counts->ena = values.enabled;
+			counts->run = values.running;
+		}
+	}
+
+out:
+	return err;
+}
+
+static int bperf_cgrp__destroy(struct evsel *evsel)
+{
+	if (evsel->idx)
+		return 0;
+
+	bperf_cgroup_bpf__destroy(skel);
+	evsel__delete(cgrp_switch);  // it'll destroy on_switch progs too
+	free(cgrp_prog_fds);
+
+	return 0;
+}
+
+struct bpf_counter_ops bperf_cgrp_ops = {
+	.load       = bperf_cgrp__load,
+	.enable     = bperf_cgrp__enable,
+	.disable    = bperf_cgrp__disable,
+	.read       = bperf_cgrp__read,
+	.install_pe = bperf_cgrp__install_pe,
+	.destroy    = bperf_cgrp__destroy,
+};
diff --git a/tools/perf/util/bpf_skel/bperf_cgroup.bpf.c b/tools/perf/util/bpf_skel/bperf_cgroup.bpf.c
new file mode 100644
index 000000000000..5b520821c4b7
--- /dev/null
+++ b/tools/perf/util/bpf_skel/bperf_cgroup.bpf.c
@@ -0,0 +1,146 @@
+// SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+// Copyright (c) 2021 Facebook
+// Copyright (c) 2021 Google
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_core_read.h>
+
+#define MAX_EVENTS  32  // max events per cgroup: arbitrary
+
+// NOTE: many of map and global data will be modified before loading
+//       from the userspace (perf tool) using the skeleton helpers.
+
+// single set of global perf events to measure
+struct {
+	__uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(int));
+	__uint(max_entries, 1);
+} events SEC(".maps");
+
+// from logical cpu number to event index
+// useful when user wants to count subset of cpus
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(__u32));
+	__uint(max_entries, 1);
+} cpu_idx SEC(".maps");
+
+// from cgroup id to event index
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(key_size, sizeof(__u64));
+	__uint(value_size, sizeof(__u32));
+	__uint(max_entries, 1);
+} cgrp_idx SEC(".maps");
+
+// per-cpu event snapshots to calculate delta
+struct {
+	__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(struct bpf_perf_event_value));
+} prev_readings SEC(".maps");
+
+// aggregated event values for each cgroup
+// will be read from the user-space
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(struct bpf_perf_event_value));
+} cgrp_readings SEC(".maps");
+
+const volatile __u32 num_events = 1;
+const volatile __u32 num_cpus = 1;
+
+int enabled = 0;
+int use_cgroup_v2 = 0;
+
+static inline get_current_cgroup_v1_id(void)
+{
+	struct task_struct *p = (void *)bpf_get_current_task();
+
+	return BPF_CORE_READ(p, cgroups, subsys[perf_event_cgrp_id], cgroup, kn, id);
+}
+
+// This will be attached to cgroup-switches event for each cpu
+SEC("perf_events")
+int BPF_PROG(on_switch)
+{
+	register __u32 idx = 0;  // to have it in a register to pass BPF verifier
+	struct bpf_perf_event_value val, *prev_val, *cgrp_val;
+	__u32 cpu = bpf_get_smp_processor_id();
+	__u64 cgrp;
+	__u32 evt_idx, key;
+	__u32 *elem;
+	long err;
+
+	// map the current CPU to a CPU index, particularly necessary if there
+	// are fewer CPUs profiled on than all CPUs.
+	elem = bpf_map_lookup_elem(&cpu_idx, &cpu);
+	if (!elem)
+		return 0;
+	cpu = *elem;
+
+	if (use_cgroup_v2)
+		cgrp = bpf_get_current_cgroup_id();
+	else
+		cgrp = get_current_cgroup_v1_id();
+
+	elem = bpf_map_lookup_elem(&cgrp_idx, &cgrp);
+	if (elem)
+		cgrp = *elem;
+	else
+		cgrp = ~0ULL;
+
+	for ( ; idx < MAX_EVENTS; idx++) {
+		if (idx == num_events)
+			break;
+
+		// XXX: do not pass idx directly (for verifier)
+		key = idx;
+		// this is per-cpu array for diff
+		prev_val = bpf_map_lookup_elem(&prev_readings, &key);
+		if (!prev_val) {
+			val.counter = val.enabled = val.running = 0;
+			bpf_map_update_elem(&prev_readings, &key, val, BPF_ANY);
+
+			prev_val = bpf_map_lookup_elem(&prev_readings, &key);
+			if (!prev_val)
+				continue;
+		}
+
+		// read from global event array
+		evt_idx = idx * num_cpus + cpu;
+		err = bpf_perf_event_read_value(&events, evt_idx, &val, sizeof(val));
+		if (err)
+			continue;
+
+		if (enabled && cgrp != ~0ULL) {
+			// aggregate the result by cgroup
+			evt_idx += cgrp * num_cpus * num_events;
+			cgrp_val = bpf_map_lookup_elem(&cgrp_readings, &evt_idx);
+			if (cgrp_val) {
+				cgrp_val->counter += val.counter - prev_val->counter;
+				cgrp_val->enabled += val.enabled - prev_val->enabled;
+				cgrp_val->running += val.running - prev_val->running;
+			} else {
+				val->counter -= prev_val->counter;
+				val->enabled -= prev_val->enabled;
+				val->running -= prev_val->running;
+
+				bpf_map_update_elem(&cgrp_readings, &evt_idx, &val, BPF_ANY);
+
+				val->counter += prev_val->counter;
+				val->enabled += prev_val->enabled;
+				val->running += prev_val->running;
+			}
+		}
+
+		*prev_val = val;
+	}
+	return 0;
+}
+
+char LICENSE[] SEC("license") = "Dual BSD/GPL";
diff --git a/tools/perf/util/cgroup.c b/tools/perf/util/cgroup.c
index 48ec79211270..f7d07b365401 100644
--- a/tools/perf/util/cgroup.c
+++ b/tools/perf/util/cgroup.c
@@ -18,6 +18,7 @@
 #include <regex.h>
 
 int nr_cgroups;
+bool cgrp_event_expanded;
 
 /* used to match cgroup name with patterns */
 struct cgroup_name {
@@ -484,6 +485,7 @@ int evlist__expand_cgroup(struct evlist *evlist, const char *str,
 	}
 
 	ret = 0;
+	cgrp_event_expanded = true;
 
 out_err:
 	evlist__delete(orig_list);
diff --git a/tools/perf/util/cgroup.h b/tools/perf/util/cgroup.h
index 1549ec2fd348..21f7ccc566e1 100644
--- a/tools/perf/util/cgroup.h
+++ b/tools/perf/util/cgroup.h
@@ -17,6 +17,7 @@ struct cgroup {
 };
 
 extern int nr_cgroups; /* number of explicit cgroups defined */
+extern bool cgrp_event_expanded;
 
 struct cgroup *cgroup__get(struct cgroup *cgroup);
 void cgroup__put(struct cgroup *cgroup);
-- 
2.32.0.272.g935e593368-goog


^ permalink raw reply related	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-06-24 22:21 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-22  7:12 [PATCHSET v3 0/3] perf stat: Enable BPF counters with --for-each-cgroup Namhyung Kim
2021-06-22  7:12 ` [PATCH 1/3] perf tools: Add read_cgroup_id() function Namhyung Kim
2021-06-22  7:12 ` [PATCH 2/3] perf tools: Add cgroup_is_v2() helper Namhyung Kim
2021-06-22  7:12 ` [PATCH 3/3] perf stat: Enable BPF counter with --for-each-cgroup Namhyung Kim
2021-06-24  4:54   ` Song Liu
2021-06-24  5:49     ` Namhyung Kim
2021-06-24 16:20       ` Song Liu
2021-06-24 21:01         ` Namhyung Kim
2021-06-24 21:40           ` Song Liu
2021-06-24 22:06             ` Namhyung Kim
2021-06-24 22:15               ` Song Liu
2021-06-24 22:21                 ` Namhyung Kim
  -- strict thread matches above, loose matches on Subject: below --
2021-06-15  1:17 [PATCHSET v2 0/3] perf stat: Enable BPF counters " Namhyung Kim
2021-06-15  1:17 ` [PATCH 3/3] perf stat: Enable BPF counter " Namhyung Kim

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.