All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/6] perf record: Implement off-cpu profiling with BPF (v3)
@ 2022-05-18 22:47 Namhyung Kim
  2022-05-18 22:47 ` [PATCH 1/6] perf report: Do not extend sample type of bpf-output event Namhyung Kim
                   ` (5 more replies)
  0 siblings, 6 replies; 20+ messages in thread
From: Namhyung Kim @ 2022-05-18 22:47 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Andi Kleen, Ian Rogers,
	Song Liu, Hao Luo, Milian Wolff, bpf, linux-perf-users,
	Blake Jones

Hello,

This is the second version of off-cpu profiling support.  Together with
(PMU-based) cpu profiling, it can show holistic view of the performance
characteristics of your application or system.

Changes in v3)
 * address most of review comments from Arnaldo
 * handle the new sched_switch argument ordering
 * add cgroup filter/sampling support
 * add a shell test

Changes in v2)
 * change sched_switch argument handling (Andrii)
 * use task local storage (Hao)
 * fix build error on !BUILD_BPF_SKEL (kernel test robot)
 * add documentation regard fp callstack (Milian)
 
 
With BPF, it can aggregate scheduling stats for interested tasks
and/or states and convert the data into a form of perf sample records.
I chose the bpf-output event which is a software event supposed to be
consumed by BPF programs and renamed it as "offcpu-time".  So it
requires no change on the perf report side except for setting sample
types of bpf-output event.

Basically it collects userspace callstack for tasks as it's what users
want mostly.  Maybe we can add support for the kernel stacks but I'm
afraid that it'd cause more overhead.  So the offcpu-time event will
always have callchains regardless of the command line option, and it
enables the children mode in perf report by default.

It adds --off-cpu option to perf record like below:

  $ sudo perf record -a --off-cpu -- perf bench sched messaging -l 1000
  # Running 'sched/messaging' benchmark:
  # 20 sender and receiver processes per group
  # 10 groups == 400 processes run

     Total time: 1.518 [sec]
  [ perf record: Woken up 9 times to write data ]
  [ perf record: Captured and wrote 5.313 MB perf.data (53341 samples) ]

Then we can run perf report as usual.  The below is just to skip less
important parts.

  $ sudo perf report --stdio --call-graph=no --percent-limit=2
  # To display the perf.data header info, please use --header/--header-only options.
  #
  #
  # Total Lost Samples: 0
  #
  # Samples: 52K of event 'cycles'
  # Event count (approx.): 42522453276
  #
  # Children      Self  Command          Shared Object     Symbol                            
  # ........  ........  ...............  ................  ..................................
  #
       9.58%     9.58%  sched-messaging  [kernel.vmlinux]  [k] audit_filter_rules.constprop.0
       8.46%     8.46%  sched-messaging  [kernel.vmlinux]  [k] audit_filter_syscall
       4.54%     4.54%  sched-messaging  [kernel.vmlinux]  [k] copy_user_enhanced_fast_string
       2.94%     2.94%  sched-messaging  [kernel.vmlinux]  [k] unix_stream_read_generic
       2.45%     2.45%  sched-messaging  [kernel.vmlinux]  [k] memcg_slab_free_hook
  
  
  # Samples: 983  of event 'offcpu-time'
  # Event count (approx.): 684538813464
  #
  # Children      Self  Command          Shared Object         Symbol                    
  # ........  ........  ...............  ....................  ..........................
  #
      83.86%     0.00%  sched-messaging  libc-2.33.so          [.] __libc_start_main
      83.86%     0.00%  sched-messaging  perf                  [.] cmd_bench
      83.86%     0.00%  sched-messaging  perf                  [.] main
      83.86%     0.00%  sched-messaging  perf                  [.] run_builtin
      83.64%     0.00%  sched-messaging  perf                  [.] bench_sched_messaging
      41.35%    41.35%  sched-messaging  libpthread-2.33.so    [.] __read
      38.88%    38.88%  sched-messaging  libpthread-2.33.so    [.] __write
       3.41%     3.41%  sched-messaging  libc-2.33.so          [.] __poll

The perf bench sched messaging created 400 processes to send/receive
messages through unix sockets.  It spent a large portion of cpu cycles
for audit filter and read/copy the messages while most of the
offcpu-time was in read and write calls.

You can get the code from 'perf/offcpu-v3' branch in my tree at

  git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git

Enjoy! :)

Thanks,
Namhyung


Namhyung Kim (6):
  perf report: Do not extend sample type of bpf-output event
  perf record: Enable off-cpu analysis with BPF
  perf record: Implement basic filtering for off-cpu
  perf record: Handle argument change in sched_switch
  perf record: Add cgroup support for off-cpu profiling
  perf test: Add a basic offcpu profiling test

 tools/perf/Documentation/perf-record.txt |  10 +
 tools/perf/Makefile.perf                 |   1 +
 tools/perf/builtin-record.c              |  25 ++
 tools/perf/tests/shell/record_offcpu.sh  |  60 ++++
 tools/perf/util/Build                    |   1 +
 tools/perf/util/bpf_off_cpu.c            | 338 +++++++++++++++++++++++
 tools/perf/util/bpf_skel/off_cpu.bpf.c   | 229 +++++++++++++++
 tools/perf/util/evsel.c                  |   4 +-
 tools/perf/util/off_cpu.h                |  29 ++
 9 files changed, 695 insertions(+), 2 deletions(-)
 create mode 100755 tools/perf/tests/shell/record_offcpu.sh
 create mode 100644 tools/perf/util/bpf_off_cpu.c
 create mode 100644 tools/perf/util/bpf_skel/off_cpu.bpf.c
 create mode 100644 tools/perf/util/off_cpu.h


base-commit: 6a973e291978bfd1ff8bb3184e337309acc16d69
-- 
2.36.1.124.g0e6072fb45-goog


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/6] perf report: Do not extend sample type of bpf-output event
  2022-05-18 22:47 [RFC 0/6] perf record: Implement off-cpu profiling with BPF (v3) Namhyung Kim
@ 2022-05-18 22:47 ` Namhyung Kim
  2022-05-18 22:47 ` [PATCH 2/6] perf record: Enable off-cpu analysis with BPF Namhyung Kim
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 20+ messages in thread
From: Namhyung Kim @ 2022-05-18 22:47 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Andi Kleen, Ian Rogers,
	Song Liu, Hao Luo, Milian Wolff, bpf, linux-perf-users,
	Blake Jones

Currently evsel__new_idx() sets more sample_type bits when it finds a
BPF-output event.  But it should honor what's recorded in the perf
data file rather than blindly sets the bits.  Otherwise it could lead
to a parse error when it recorded with a modified sample_type.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/perf/util/evsel.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 1cf967d689aa..d3c8ebdc6d43 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -296,8 +296,8 @@ struct evsel *evsel__new_idx(struct perf_event_attr *attr, int idx)
 		return NULL;
 	evsel__init(evsel, attr, idx);
 
-	if (evsel__is_bpf_output(evsel)) {
-		evsel->core.attr.sample_type |= (PERF_SAMPLE_RAW | PERF_SAMPLE_TIME |
+	if (evsel__is_bpf_output(evsel) && !attr->sample_type) {
+		evsel->core.attr.sample_type = (PERF_SAMPLE_RAW | PERF_SAMPLE_TIME |
 					    PERF_SAMPLE_CPU | PERF_SAMPLE_PERIOD),
 		evsel->core.attr.sample_period = 1;
 	}
-- 
2.36.1.124.g0e6072fb45-goog


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/6] perf record: Enable off-cpu analysis with BPF
  2022-05-18 22:47 [RFC 0/6] perf record: Implement off-cpu profiling with BPF (v3) Namhyung Kim
  2022-05-18 22:47 ` [PATCH 1/6] perf report: Do not extend sample type of bpf-output event Namhyung Kim
@ 2022-05-18 22:47 ` Namhyung Kim
  2022-05-19  3:58   ` Ian Rogers
  2022-06-01  0:00   ` Ian Rogers
  2022-05-18 22:47 ` [PATCH 3/6] perf record: Implement basic filtering for off-cpu Namhyung Kim
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 20+ messages in thread
From: Namhyung Kim @ 2022-05-18 22:47 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Andi Kleen, Ian Rogers,
	Song Liu, Hao Luo, Milian Wolff, bpf, linux-perf-users,
	Blake Jones

Add --off-cpu option to enable the off-cpu profiling with BPF.  It'd
use a bpf_output event and rename it to "offcpu-time".  Samples will
be synthesized at the end of the record session using data from a BPF
map which contains the aggregated off-cpu time at context switches.
So it needs root privilege to get the off-cpu profiling.

Each sample will have a separate user stacktrace so it will skip
kernel threads.  The sample ip will be set from the stacktrace and
other sample data will be updated accordingly.  Currently it only
handles some basic sample types.

The sample timestamp is set to a dummy value just not to bother with
other events during the sorting.  So it has a very big initial value
and increase it on processing each samples.

Good thing is that it can be used together with regular profiling like
cpu cycles.  If you don't want to that, you can use a dummy event to
enable off-cpu profiling only.

Example output:
  $ sudo perf record --off-cpu perf bench sched messaging -l 1000

  $ sudo perf report --stdio --call-graph=no
  # Total Lost Samples: 0
  #
  # Samples: 41K of event 'cycles'
  # Event count (approx.): 42137343851
  ...

  # Samples: 1K of event 'offcpu-time'
  # Event count (approx.): 587990831640
  #
  # Children      Self  Command          Shared Object       Symbol
  # ........  ........  ...............  ..................  .........................
  #
      81.66%     0.00%  sched-messaging  libc-2.33.so        [.] __libc_start_main
      81.66%     0.00%  sched-messaging  perf                [.] cmd_bench
      81.66%     0.00%  sched-messaging  perf                [.] main
      81.66%     0.00%  sched-messaging  perf                [.] run_builtin
      81.43%     0.00%  sched-messaging  perf                [.] bench_sched_messaging
      40.86%    40.86%  sched-messaging  libpthread-2.33.so  [.] __read
      37.66%    37.66%  sched-messaging  libpthread-2.33.so  [.] __write
       2.91%     2.91%  sched-messaging  libc-2.33.so        [.] __poll
  ...

As you can see it spent most of off-cpu time in read and write in
bench_sched_messaging().  The --call-graph=no was added just to make
the output concise here.

It uses perf hooks facility to control BPF program during the record
session rather than adding new BPF/off-cpu specific calls.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/perf/Documentation/perf-record.txt |  10 ++
 tools/perf/Makefile.perf                 |   1 +
 tools/perf/builtin-record.c              |  25 +++
 tools/perf/util/Build                    |   1 +
 tools/perf/util/bpf_off_cpu.c            | 204 +++++++++++++++++++++++
 tools/perf/util/bpf_skel/off_cpu.bpf.c   | 139 +++++++++++++++
 tools/perf/util/off_cpu.h                |  24 +++
 7 files changed, 404 insertions(+)
 create mode 100644 tools/perf/util/bpf_off_cpu.c
 create mode 100644 tools/perf/util/bpf_skel/off_cpu.bpf.c
 create mode 100644 tools/perf/util/off_cpu.h

diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
index 465be4e62a17..b4e9ef7edfef 100644
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@@ -758,6 +758,16 @@ include::intel-hybrid.txt[]
 	If the URLs is not specified, the value of DEBUGINFOD_URLS
 	system environment variable is used.
 
+--off-cpu::
+	Enable off-cpu profiling with BPF.  The BPF program will collect
+	task scheduling information with (user) stacktrace and save them
+	as sample data of a software event named "offcpu-time".  The
+	sample period will have the time the task slept in nanoseconds.
+
+	Note that BPF can collect stack traces using frame pointer ("fp")
+	only, as of now.  So the applications built without the frame
+	pointer might see bogus addresses.
+
 SEE ALSO
 --------
 linkperf:perf-stat[1], linkperf:perf-list[1], linkperf:perf-intel-pt[1]
diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
index 6e5aded855cc..8f738e11356d 100644
--- a/tools/perf/Makefile.perf
+++ b/tools/perf/Makefile.perf
@@ -1038,6 +1038,7 @@ SKEL_TMP_OUT := $(abspath $(SKEL_OUT)/.tmp)
 SKELETONS := $(SKEL_OUT)/bpf_prog_profiler.skel.h
 SKELETONS += $(SKEL_OUT)/bperf_leader.skel.h $(SKEL_OUT)/bperf_follower.skel.h
 SKELETONS += $(SKEL_OUT)/bperf_cgroup.skel.h $(SKEL_OUT)/func_latency.skel.h
+SKELETONS += $(SKEL_OUT)/off_cpu.skel.h
 
 $(SKEL_TMP_OUT) $(LIBBPF_OUTPUT):
 	$(Q)$(MKDIR) -p $@
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index a5cf6a99d67f..91f88501412e 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -49,6 +49,7 @@
 #include "util/clockid.h"
 #include "util/pmu-hybrid.h"
 #include "util/evlist-hybrid.h"
+#include "util/off_cpu.h"
 #include "asm/bug.h"
 #include "perf.h"
 #include "cputopo.h"
@@ -162,6 +163,7 @@ struct record {
 	bool			buildid_mmap;
 	bool			timestamp_filename;
 	bool			timestamp_boundary;
+	bool			off_cpu;
 	struct switch_output	switch_output;
 	unsigned long long	samples;
 	unsigned long		output_max_size;	/* = 0: unlimited */
@@ -903,6 +905,11 @@ static int record__config_text_poke(struct evlist *evlist)
 	return 0;
 }
 
+static int record__config_off_cpu(struct record *rec)
+{
+	return off_cpu_prepare(rec->evlist);
+}
+
 static bool record__kcore_readable(struct machine *machine)
 {
 	char kcore[PATH_MAX];
@@ -2600,6 +2607,9 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
 	} else
 		status = err;
 
+	if (rec->off_cpu)
+		rec->bytes_written += off_cpu_write(rec->session);
+
 	record__synthesize(rec, true);
 	/* this will be recalculated during process_buildids() */
 	rec->samples = 0;
@@ -3324,6 +3334,7 @@ static struct option __record_options[] = {
 	OPT_CALLBACK_OPTARG(0, "threads", &record.opts, NULL, "spec",
 			    "write collected trace data into several data files using parallel threads",
 			    record__parse_threads),
+	OPT_BOOLEAN(0, "off-cpu", &record.off_cpu, "Enable off-cpu analysis"),
 	OPT_END()
 };
 
@@ -3743,6 +3754,12 @@ int cmd_record(int argc, const char **argv)
 	set_nobuild('\0', "vmlinux", true);
 # undef set_nobuild
 # undef REASON
+#endif
+
+#ifndef HAVE_BPF_SKEL
+# define set_nobuild(s, l, m, c) set_option_nobuild(record_options, s, l, m, c)
+	set_nobuild('\0', "off-cpu", "no BUILD_BPF_SKEL=1", true);
+# undef set_nobuild
 #endif
 
 	rec->opts.affinity = PERF_AFFINITY_SYS;
@@ -3981,6 +3998,14 @@ int cmd_record(int argc, const char **argv)
 		}
 	}
 
+	if (rec->off_cpu) {
+		err = record__config_off_cpu(rec);
+		if (err) {
+			pr_err("record__config_off_cpu failed, error %d\n", err);
+			goto out;
+		}
+	}
+
 	if (record_opts__config(&rec->opts)) {
 		err = -EINVAL;
 		goto out;
diff --git a/tools/perf/util/Build b/tools/perf/util/Build
index 9a7209a99e16..a51267d88ca9 100644
--- a/tools/perf/util/Build
+++ b/tools/perf/util/Build
@@ -147,6 +147,7 @@ perf-$(CONFIG_LIBBPF) += bpf_map.o
 perf-$(CONFIG_PERF_BPF_SKEL) += bpf_counter.o
 perf-$(CONFIG_PERF_BPF_SKEL) += bpf_counter_cgroup.o
 perf-$(CONFIG_PERF_BPF_SKEL) += bpf_ftrace.o
+perf-$(CONFIG_PERF_BPF_SKEL) += bpf_off_cpu.o
 perf-$(CONFIG_BPF_PROLOGUE) += bpf-prologue.o
 perf-$(CONFIG_LIBELF) += symbol-elf.o
 perf-$(CONFIG_LIBELF) += probe-file.o
diff --git a/tools/perf/util/bpf_off_cpu.c b/tools/perf/util/bpf_off_cpu.c
new file mode 100644
index 000000000000..9ed7aca3f4ac
--- /dev/null
+++ b/tools/perf/util/bpf_off_cpu.c
@@ -0,0 +1,204 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "util/bpf_counter.h"
+#include "util/debug.h"
+#include "util/evsel.h"
+#include "util/evlist.h"
+#include "util/off_cpu.h"
+#include "util/perf-hooks.h"
+#include "util/session.h"
+#include <bpf/bpf.h>
+
+#include "bpf_skel/off_cpu.skel.h"
+
+#define MAX_STACKS  32
+/* we don't need actual timestamp, just want to put the samples at last */
+#define OFF_CPU_TIMESTAMP  (~0ull << 32)
+
+static struct off_cpu_bpf *skel;
+
+struct off_cpu_key {
+	u32 pid;
+	u32 tgid;
+	u32 stack_id;
+	u32 state;
+};
+
+union off_cpu_data {
+	struct perf_event_header hdr;
+	u64 array[1024 / sizeof(u64)];
+};
+
+static int off_cpu_config(struct evlist *evlist)
+{
+	struct evsel *evsel;
+	struct perf_event_attr attr = {
+		.type	= PERF_TYPE_SOFTWARE,
+		.config = PERF_COUNT_SW_BPF_OUTPUT,
+		.size	= sizeof(attr), /* to capture ABI version */
+	};
+	char *evname = strdup(OFFCPU_EVENT);
+
+	if (evname == NULL)
+		return -ENOMEM;
+
+	evsel = evsel__new(&attr);
+	if (!evsel) {
+		free(evname);
+		return -ENOMEM;
+	}
+
+	evsel->core.attr.freq = 1;
+	evsel->core.attr.sample_period = 1;
+	/* off-cpu analysis depends on stack trace */
+	evsel->core.attr.sample_type = PERF_SAMPLE_CALLCHAIN;
+
+	evlist__add(evlist, evsel);
+
+	free(evsel->name);
+	evsel->name = evname;
+
+	return 0;
+}
+
+static void off_cpu_start(void *arg __maybe_unused)
+{
+	skel->bss->enabled = 1;
+}
+
+static void off_cpu_finish(void *arg __maybe_unused)
+{
+	skel->bss->enabled = 0;
+	off_cpu_bpf__destroy(skel);
+}
+
+int off_cpu_prepare(struct evlist *evlist)
+{
+	int err;
+
+	if (off_cpu_config(evlist) < 0) {
+		pr_err("Failed to config off-cpu BPF event\n");
+		return -1;
+	}
+
+	set_max_rlimit();
+
+	skel = off_cpu_bpf__open_and_load();
+	if (!skel) {
+		pr_err("Failed to open off-cpu BPF skeleton\n");
+		return -1;
+	}
+
+	err = off_cpu_bpf__attach(skel);
+	if (err) {
+		pr_err("Failed to attach off-cpu BPF skeleton\n");
+		goto out;
+	}
+
+	if (perf_hooks__set_hook("record_start", off_cpu_start, NULL) ||
+	    perf_hooks__set_hook("record_end", off_cpu_finish, NULL)) {
+		pr_err("Failed to attach off-cpu skeleton\n");
+		goto out;
+	}
+
+	return 0;
+
+out:
+	off_cpu_bpf__destroy(skel);
+	return -1;
+}
+
+int off_cpu_write(struct perf_session *session)
+{
+	int bytes = 0, size;
+	int fd, stack;
+	u64 sample_type, val, sid = 0;
+	struct evsel *evsel;
+	struct perf_data_file *file = &session->data->file;
+	struct off_cpu_key prev, key;
+	union off_cpu_data data = {
+		.hdr = {
+			.type = PERF_RECORD_SAMPLE,
+			.misc = PERF_RECORD_MISC_USER,
+		},
+	};
+	u64 tstamp = OFF_CPU_TIMESTAMP;
+
+	skel->bss->enabled = 0;
+
+	evsel = evlist__find_evsel_by_str(session->evlist, OFFCPU_EVENT);
+	if (evsel == NULL) {
+		pr_err("%s evsel not found\n", OFFCPU_EVENT);
+		return 0;
+	}
+
+	sample_type = evsel->core.attr.sample_type;
+
+	if (sample_type & (PERF_SAMPLE_ID | PERF_SAMPLE_IDENTIFIER)) {
+		if (evsel->core.id)
+			sid = evsel->core.id[0];
+	}
+
+	fd = bpf_map__fd(skel->maps.off_cpu);
+	stack = bpf_map__fd(skel->maps.stacks);
+	memset(&prev, 0, sizeof(prev));
+
+	while (!bpf_map_get_next_key(fd, &prev, &key)) {
+		int n = 1;  /* start from perf_event_header */
+		int ip_pos = -1;
+
+		bpf_map_lookup_elem(fd, &key, &val);
+
+		if (sample_type & PERF_SAMPLE_IDENTIFIER)
+			data.array[n++] = sid;
+		if (sample_type & PERF_SAMPLE_IP) {
+			ip_pos = n;
+			data.array[n++] = 0;  /* will be updated */
+		}
+		if (sample_type & PERF_SAMPLE_TID)
+			data.array[n++] = (u64)key.pid << 32 | key.tgid;
+		if (sample_type & PERF_SAMPLE_TIME)
+			data.array[n++] = tstamp;
+		if (sample_type & PERF_SAMPLE_ID)
+			data.array[n++] = sid;
+		if (sample_type & PERF_SAMPLE_CPU)
+			data.array[n++] = 0;
+		if (sample_type & PERF_SAMPLE_PERIOD)
+			data.array[n++] = val;
+		if (sample_type & PERF_SAMPLE_CALLCHAIN) {
+			int len = 0;
+
+			/* data.array[n] is callchain->nr (updated later) */
+			data.array[n + 1] = PERF_CONTEXT_USER;
+			data.array[n + 2] = 0;
+
+			bpf_map_lookup_elem(stack, &key.stack_id, &data.array[n + 2]);
+			while (data.array[n + 2 + len])
+				len++;
+
+			/* update length of callchain */
+			data.array[n] = len + 1;
+
+			/* update sample ip with the first callchain entry */
+			if (ip_pos >= 0)
+				data.array[ip_pos] = data.array[n + 2];
+
+			/* calculate sample callchain data array length */
+			n += len + 2;
+		}
+		/* TODO: handle more sample types */
+
+		size = n * sizeof(u64);
+		data.hdr.size = size;
+		bytes += size;
+
+		if (perf_data_file__write(file, &data, size) < 0) {
+			pr_err("failed to write perf data, error: %m\n");
+			return bytes;
+		}
+
+		prev = key;
+		/* increase dummy timestamp to sort later samples */
+		tstamp++;
+	}
+	return bytes;
+}
diff --git a/tools/perf/util/bpf_skel/off_cpu.bpf.c b/tools/perf/util/bpf_skel/off_cpu.bpf.c
new file mode 100644
index 000000000000..5173ed882fdf
--- /dev/null
+++ b/tools/perf/util/bpf_skel/off_cpu.bpf.c
@@ -0,0 +1,139 @@
+// SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+// Copyright (c) 2022 Google
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_core_read.h>
+
+/* task->flags for off-cpu analysis */
+#define PF_KTHREAD   0x00200000  /* I am a kernel thread */
+
+/* task->state for off-cpu analysis */
+#define TASK_INTERRUPTIBLE	0x0001
+#define TASK_UNINTERRUPTIBLE	0x0002
+
+#define MAX_STACKS   32
+#define MAX_ENTRIES  102400
+
+struct tstamp_data {
+	__u32 stack_id;
+	__u32 state;
+	__u64 timestamp;
+};
+
+struct offcpu_key {
+	__u32 pid;
+	__u32 tgid;
+	__u32 stack_id;
+	__u32 state;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_STACK_TRACE);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, MAX_STACKS * sizeof(__u64));
+	__uint(max_entries, MAX_ENTRIES);
+} stacks SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_TASK_STORAGE);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__type(key, int);
+	__type(value, struct tstamp_data);
+} tstamp SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(key_size, sizeof(struct offcpu_key));
+	__uint(value_size, sizeof(__u64));
+	__uint(max_entries, MAX_ENTRIES);
+} off_cpu SEC(".maps");
+
+/* old kernel task_struct definition */
+struct task_struct___old {
+	long state;
+} __attribute__((preserve_access_index));
+
+int enabled = 0;
+
+/*
+ * Old kernel used to call it task_struct->state and now it's '__state'.
+ * Use BPF CO-RE "ignored suffix rule" to deal with it like below:
+ *
+ * https://nakryiko.com/posts/bpf-core-reference-guide/#handling-incompatible-field-and-type-changes
+ */
+static inline int get_task_state(struct task_struct *t)
+{
+	if (bpf_core_field_exists(t->__state))
+		return BPF_CORE_READ(t, __state);
+
+	/* recast pointer to capture task_struct___old type for compiler */
+	struct task_struct___old *t_old = (void *)t;
+
+	/* now use old "state" name of the field */
+	return BPF_CORE_READ(t_old, state);
+}
+
+SEC("tp_btf/sched_switch")
+int on_switch(u64 *ctx)
+{
+	__u64 ts;
+	int state;
+	__u32 stack_id;
+	struct task_struct *prev, *next;
+	struct tstamp_data *pelem;
+
+	if (!enabled)
+		return 0;
+
+	prev = (struct task_struct *)ctx[1];
+	next = (struct task_struct *)ctx[2];
+	state = get_task_state(prev);
+
+	ts = bpf_ktime_get_ns();
+
+	if (prev->flags & PF_KTHREAD)
+		goto next;
+	if (state != TASK_INTERRUPTIBLE &&
+	    state != TASK_UNINTERRUPTIBLE)
+		goto next;
+
+	stack_id = bpf_get_stackid(ctx, &stacks,
+				   BPF_F_FAST_STACK_CMP | BPF_F_USER_STACK);
+
+	pelem = bpf_task_storage_get(&tstamp, prev, NULL,
+				     BPF_LOCAL_STORAGE_GET_F_CREATE);
+	if (!pelem)
+		goto next;
+
+	pelem->timestamp = ts;
+	pelem->state = state;
+	pelem->stack_id = stack_id;
+
+next:
+	pelem = bpf_task_storage_get(&tstamp, next, NULL, 0);
+
+	if (pelem && pelem->timestamp) {
+		struct offcpu_key key = {
+			.pid = next->pid,
+			.tgid = next->tgid,
+			.stack_id = pelem->stack_id,
+			.state = pelem->state,
+		};
+		__u64 delta = ts - pelem->timestamp;
+		__u64 *total;
+
+		total = bpf_map_lookup_elem(&off_cpu, &key);
+		if (total)
+			*total += delta;
+		else
+			bpf_map_update_elem(&off_cpu, &key, &delta, BPF_ANY);
+
+		/* prevent to reuse the timestamp later */
+		pelem->timestamp = 0;
+	}
+
+	return 0;
+}
+
+char LICENSE[] SEC("license") = "Dual BSD/GPL";
diff --git a/tools/perf/util/off_cpu.h b/tools/perf/util/off_cpu.h
new file mode 100644
index 000000000000..375d03c424ea
--- /dev/null
+++ b/tools/perf/util/off_cpu.h
@@ -0,0 +1,24 @@
+#ifndef PERF_UTIL_OFF_CPU_H
+#define PERF_UTIL_OFF_CPU_H
+
+struct evlist;
+struct perf_session;
+
+#define OFFCPU_EVENT  "offcpu-time"
+
+#ifdef HAVE_BPF_SKEL
+int off_cpu_prepare(struct evlist *evlist);
+int off_cpu_write(struct perf_session *session);
+#else
+static inline int off_cpu_prepare(struct evlist *evlist __maybe_unused)
+{
+	return -1;
+}
+
+static inline int off_cpu_write(struct perf_session *session __maybe_unused)
+{
+	return -1;
+}
+#endif
+
+#endif  /* PERF_UTIL_OFF_CPU_H */
-- 
2.36.1.124.g0e6072fb45-goog


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 3/6] perf record: Implement basic filtering for off-cpu
  2022-05-18 22:47 [RFC 0/6] perf record: Implement off-cpu profiling with BPF (v3) Namhyung Kim
  2022-05-18 22:47 ` [PATCH 1/6] perf report: Do not extend sample type of bpf-output event Namhyung Kim
  2022-05-18 22:47 ` [PATCH 2/6] perf record: Enable off-cpu analysis with BPF Namhyung Kim
@ 2022-05-18 22:47 ` Namhyung Kim
  2022-05-19  4:02   ` Ian Rogers
  2022-05-18 22:47 ` [PATCH 4/6] perf record: Handle argument change in sched_switch Namhyung Kim
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 20+ messages in thread
From: Namhyung Kim @ 2022-05-18 22:47 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Andi Kleen, Ian Rogers,
	Song Liu, Hao Luo, Milian Wolff, bpf, linux-perf-users,
	Blake Jones

It should honor cpu and task filtering with -a, -C or -p, -t options.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/perf/builtin-record.c            |  2 +-
 tools/perf/util/bpf_off_cpu.c          | 78 +++++++++++++++++++++++---
 tools/perf/util/bpf_skel/off_cpu.bpf.c | 52 +++++++++++++++--
 tools/perf/util/off_cpu.h              |  6 +-
 4 files changed, 123 insertions(+), 15 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 91f88501412e..7f60d2eac0b4 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -907,7 +907,7 @@ static int record__config_text_poke(struct evlist *evlist)
 
 static int record__config_off_cpu(struct record *rec)
 {
-	return off_cpu_prepare(rec->evlist);
+	return off_cpu_prepare(rec->evlist, &rec->opts.target);
 }
 
 static bool record__kcore_readable(struct machine *machine)
diff --git a/tools/perf/util/bpf_off_cpu.c b/tools/perf/util/bpf_off_cpu.c
index 9ed7aca3f4ac..b5e2d038da50 100644
--- a/tools/perf/util/bpf_off_cpu.c
+++ b/tools/perf/util/bpf_off_cpu.c
@@ -6,6 +6,9 @@
 #include "util/off_cpu.h"
 #include "util/perf-hooks.h"
 #include "util/session.h"
+#include "util/target.h"
+#include "util/cpumap.h"
+#include "util/thread_map.h"
 #include <bpf/bpf.h>
 
 #include "bpf_skel/off_cpu.skel.h"
@@ -60,8 +63,23 @@ static int off_cpu_config(struct evlist *evlist)
 	return 0;
 }
 
-static void off_cpu_start(void *arg __maybe_unused)
+static void off_cpu_start(void *arg)
 {
+	struct evlist *evlist = arg;
+
+	/* update task filter for the given workload */
+	if (!skel->bss->has_cpu && !skel->bss->has_task &&
+	    perf_thread_map__pid(evlist->core.threads, 0) != -1) {
+		int fd;
+		u32 pid;
+		u8 val = 1;
+
+		skel->bss->has_task = 1;
+		fd = bpf_map__fd(skel->maps.task_filter);
+		pid = perf_thread_map__pid(evlist->core.threads, 0);
+		bpf_map_update_elem(fd, &pid, &val, BPF_ANY);
+	}
+
 	skel->bss->enabled = 1;
 }
 
@@ -71,31 +89,75 @@ static void off_cpu_finish(void *arg __maybe_unused)
 	off_cpu_bpf__destroy(skel);
 }
 
-int off_cpu_prepare(struct evlist *evlist)
+int off_cpu_prepare(struct evlist *evlist, struct target *target)
 {
-	int err;
+	int err, fd, i;
+	int ncpus = 1, ntasks = 1;
 
 	if (off_cpu_config(evlist) < 0) {
 		pr_err("Failed to config off-cpu BPF event\n");
 		return -1;
 	}
 
-	set_max_rlimit();
-
-	skel = off_cpu_bpf__open_and_load();
+	skel = off_cpu_bpf__open();
 	if (!skel) {
 		pr_err("Failed to open off-cpu BPF skeleton\n");
 		return -1;
 	}
 
+	/* don't need to set cpu filter for system-wide mode */
+	if (target->cpu_list) {
+		ncpus = perf_cpu_map__nr(evlist->core.user_requested_cpus);
+		bpf_map__set_max_entries(skel->maps.cpu_filter, ncpus);
+	}
+
+	if (target__has_task(target)) {
+		ntasks = perf_thread_map__nr(evlist->core.threads);
+		bpf_map__set_max_entries(skel->maps.task_filter, ntasks);
+	}
+
+	set_max_rlimit();
+
+	err = off_cpu_bpf__load(skel);
+	if (err) {
+		pr_err("Failed to load off-cpu skeleton\n");
+		goto out;
+	}
+
+	if (target->cpu_list) {
+		u32 cpu;
+		u8 val = 1;
+
+		skel->bss->has_cpu = 1;
+		fd = bpf_map__fd(skel->maps.cpu_filter);
+
+		for (i = 0; i < ncpus; i++) {
+			cpu = perf_cpu_map__cpu(evlist->core.user_requested_cpus, i).cpu;
+			bpf_map_update_elem(fd, &cpu, &val, BPF_ANY);
+		}
+	}
+
+	if (target__has_task(target)) {
+		u32 pid;
+		u8 val = 1;
+
+		skel->bss->has_task = 1;
+		fd = bpf_map__fd(skel->maps.task_filter);
+
+		for (i = 0; i < ntasks; i++) {
+			pid = perf_thread_map__pid(evlist->core.threads, i);
+			bpf_map_update_elem(fd, &pid, &val, BPF_ANY);
+		}
+	}
+
 	err = off_cpu_bpf__attach(skel);
 	if (err) {
 		pr_err("Failed to attach off-cpu BPF skeleton\n");
 		goto out;
 	}
 
-	if (perf_hooks__set_hook("record_start", off_cpu_start, NULL) ||
-	    perf_hooks__set_hook("record_end", off_cpu_finish, NULL)) {
+	if (perf_hooks__set_hook("record_start", off_cpu_start, evlist) ||
+	    perf_hooks__set_hook("record_end", off_cpu_finish, evlist)) {
 		pr_err("Failed to attach off-cpu skeleton\n");
 		goto out;
 	}
diff --git a/tools/perf/util/bpf_skel/off_cpu.bpf.c b/tools/perf/util/bpf_skel/off_cpu.bpf.c
index 5173ed882fdf..78cdcc8ff863 100644
--- a/tools/perf/util/bpf_skel/off_cpu.bpf.c
+++ b/tools/perf/util/bpf_skel/off_cpu.bpf.c
@@ -49,12 +49,28 @@ struct {
 	__uint(max_entries, MAX_ENTRIES);
 } off_cpu SEC(".maps");
 
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(__u8));
+	__uint(max_entries, 1);
+} cpu_filter SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(__u8));
+	__uint(max_entries, 1);
+} task_filter SEC(".maps");
+
 /* old kernel task_struct definition */
 struct task_struct___old {
 	long state;
 } __attribute__((preserve_access_index));
 
 int enabled = 0;
+int has_cpu = 0;
+int has_task = 0;
 
 /*
  * Old kernel used to call it task_struct->state and now it's '__state'.
@@ -74,6 +90,37 @@ static inline int get_task_state(struct task_struct *t)
 	return BPF_CORE_READ(t_old, state);
 }
 
+static inline int can_record(struct task_struct *t, int state)
+{
+	/* kernel threads don't have user stack */
+	if (t->flags & PF_KTHREAD)
+		return 0;
+
+	if (state != TASK_INTERRUPTIBLE &&
+	    state != TASK_UNINTERRUPTIBLE)
+		return 0;
+
+	if (has_cpu) {
+		__u32 cpu = bpf_get_smp_processor_id();
+		__u8 *ok;
+
+		ok = bpf_map_lookup_elem(&cpu_filter, &cpu);
+		if (!ok)
+			return 0;
+	}
+
+	if (has_task) {
+		__u8 *ok;
+		__u32 pid = t->pid;
+
+		ok = bpf_map_lookup_elem(&task_filter, &pid);
+		if (!ok)
+			return 0;
+	}
+
+	return 1;
+}
+
 SEC("tp_btf/sched_switch")
 int on_switch(u64 *ctx)
 {
@@ -92,10 +139,7 @@ int on_switch(u64 *ctx)
 
 	ts = bpf_ktime_get_ns();
 
-	if (prev->flags & PF_KTHREAD)
-		goto next;
-	if (state != TASK_INTERRUPTIBLE &&
-	    state != TASK_UNINTERRUPTIBLE)
+	if (!can_record(prev, state))
 		goto next;
 
 	stack_id = bpf_get_stackid(ctx, &stacks,
diff --git a/tools/perf/util/off_cpu.h b/tools/perf/util/off_cpu.h
index 375d03c424ea..f47af0232e55 100644
--- a/tools/perf/util/off_cpu.h
+++ b/tools/perf/util/off_cpu.h
@@ -2,15 +2,17 @@
 #define PERF_UTIL_OFF_CPU_H
 
 struct evlist;
+struct target;
 struct perf_session;
 
 #define OFFCPU_EVENT  "offcpu-time"
 
 #ifdef HAVE_BPF_SKEL
-int off_cpu_prepare(struct evlist *evlist);
+int off_cpu_prepare(struct evlist *evlist, struct target *target);
 int off_cpu_write(struct perf_session *session);
 #else
-static inline int off_cpu_prepare(struct evlist *evlist __maybe_unused)
+static inline int off_cpu_prepare(struct evlist *evlist __maybe_unused,
+				  struct target *target __maybe_unused)
 {
 	return -1;
 }
-- 
2.36.1.124.g0e6072fb45-goog


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 4/6] perf record: Handle argument change in sched_switch
  2022-05-18 22:47 [RFC 0/6] perf record: Implement off-cpu profiling with BPF (v3) Namhyung Kim
                   ` (2 preceding siblings ...)
  2022-05-18 22:47 ` [PATCH 3/6] perf record: Implement basic filtering for off-cpu Namhyung Kim
@ 2022-05-18 22:47 ` Namhyung Kim
  2022-05-19  4:04   ` Ian Rogers
  2022-05-18 22:47 ` [PATCH 5/6] perf record: Add cgroup support for off-cpu profiling Namhyung Kim
  2022-05-18 22:47 ` [PATCH 6/6] perf test: Add a basic offcpu profiling test Namhyung Kim
  5 siblings, 1 reply; 20+ messages in thread
From: Namhyung Kim @ 2022-05-18 22:47 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Andi Kleen, Ian Rogers,
	Song Liu, Hao Luo, Milian Wolff, bpf, linux-perf-users,
	Blake Jones

Recently sched_switch tracepoint added a new argument for prev_state,
but it's hard to handle the change in a BPF program.  Instead, we can
check the function prototype in BTF before loading the program.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/perf/util/bpf_off_cpu.c          | 28 +++++++++++++++++++++
 tools/perf/util/bpf_skel/off_cpu.bpf.c | 35 ++++++++++++++++++--------
 2 files changed, 52 insertions(+), 11 deletions(-)

diff --git a/tools/perf/util/bpf_off_cpu.c b/tools/perf/util/bpf_off_cpu.c
index b5e2d038da50..874856c55101 100644
--- a/tools/perf/util/bpf_off_cpu.c
+++ b/tools/perf/util/bpf_off_cpu.c
@@ -89,6 +89,33 @@ static void off_cpu_finish(void *arg __maybe_unused)
 	off_cpu_bpf__destroy(skel);
 }
 
+/* v5.18 kernel added prev_state arg, so it needs to check the signature */
+static void check_sched_switch_args(void)
+{
+	const struct btf *btf = bpf_object__btf(skel->obj);
+	const struct btf_type *t1, *t2, *t3;
+	u32 type_id;
+
+	type_id = btf__find_by_name_kind(btf, "bpf_trace_sched_switch",
+					 BTF_KIND_TYPEDEF);
+	if ((s32)type_id < 0)
+		return;
+
+	t1 = btf__type_by_id(btf, type_id);
+	if (t1 == NULL)
+		return;
+
+	t2 = btf__type_by_id(btf, t1->type);
+	if (t2 == NULL || !btf_is_ptr(t2))
+		return;
+
+	t3 = btf__type_by_id(btf, t2->type);
+	if (t3 && btf_is_func_proto(t3) && btf_vlen(t3) == 4) {
+		/* new format: pass prev_state as 4th arg */
+		skel->rodata->has_prev_state = true;
+	}
+}
+
 int off_cpu_prepare(struct evlist *evlist, struct target *target)
 {
 	int err, fd, i;
@@ -117,6 +144,7 @@ int off_cpu_prepare(struct evlist *evlist, struct target *target)
 	}
 
 	set_max_rlimit();
+	check_sched_switch_args();
 
 	err = off_cpu_bpf__load(skel);
 	if (err) {
diff --git a/tools/perf/util/bpf_skel/off_cpu.bpf.c b/tools/perf/util/bpf_skel/off_cpu.bpf.c
index 78cdcc8ff863..986d7db6e75d 100644
--- a/tools/perf/util/bpf_skel/off_cpu.bpf.c
+++ b/tools/perf/util/bpf_skel/off_cpu.bpf.c
@@ -72,6 +72,8 @@ int enabled = 0;
 int has_cpu = 0;
 int has_task = 0;
 
+const volatile bool has_prev_state = false;
+
 /*
  * Old kernel used to call it task_struct->state and now it's '__state'.
  * Use BPF CO-RE "ignored suffix rule" to deal with it like below:
@@ -121,22 +123,13 @@ static inline int can_record(struct task_struct *t, int state)
 	return 1;
 }
 
-SEC("tp_btf/sched_switch")
-int on_switch(u64 *ctx)
+static int off_cpu_stat(u64 *ctx, struct task_struct *prev,
+			struct task_struct *next, int state)
 {
 	__u64 ts;
-	int state;
 	__u32 stack_id;
-	struct task_struct *prev, *next;
 	struct tstamp_data *pelem;
 
-	if (!enabled)
-		return 0;
-
-	prev = (struct task_struct *)ctx[1];
-	next = (struct task_struct *)ctx[2];
-	state = get_task_state(prev);
-
 	ts = bpf_ktime_get_ns();
 
 	if (!can_record(prev, state))
@@ -180,4 +173,24 @@ int on_switch(u64 *ctx)
 	return 0;
 }
 
+SEC("tp_btf/sched_switch")
+int on_switch(u64 *ctx)
+{
+	struct task_struct *prev, *next;
+	int prev_state;
+
+	if (!enabled)
+		return 0;
+
+	prev = (struct task_struct *)ctx[1];
+	next = (struct task_struct *)ctx[2];
+
+	if (has_prev_state)
+		prev_state = (int)ctx[3];
+	else
+		prev_state = get_task_state(prev);
+
+	return off_cpu_stat(ctx, prev, next, prev_state);
+}
+
 char LICENSE[] SEC("license") = "Dual BSD/GPL";
-- 
2.36.1.124.g0e6072fb45-goog


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 5/6] perf record: Add cgroup support for off-cpu profiling
  2022-05-18 22:47 [RFC 0/6] perf record: Implement off-cpu profiling with BPF (v3) Namhyung Kim
                   ` (3 preceding siblings ...)
  2022-05-18 22:47 ` [PATCH 4/6] perf record: Handle argument change in sched_switch Namhyung Kim
@ 2022-05-18 22:47 ` Namhyung Kim
  2022-05-19  4:07   ` Ian Rogers
  2022-05-18 22:47 ` [PATCH 6/6] perf test: Add a basic offcpu profiling test Namhyung Kim
  5 siblings, 1 reply; 20+ messages in thread
From: Namhyung Kim @ 2022-05-18 22:47 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Andi Kleen, Ian Rogers,
	Song Liu, Hao Luo, Milian Wolff, bpf, linux-perf-users,
	Blake Jones

This covers two different use cases.  The first one is cgroup
filtering given by -G/--cgroup option which controls the off-cpu
profiling for tasks in the given cgroups only.

The other use case is cgroup sampling which is enabled by
--all-cgroups option and it adds PERF_SAMPLE_CGROUP to the sample_type
to set the cgroup id of the task in the sample data.

Example output.

  $ sudo perf record -a --off-cpu --all-cgroups sleep 1

  $ sudo perf report --stdio -s comm,cgroup --call-graph=no
  ...
  # Samples: 144  of event 'offcpu-time'
  # Event count (approx.): 48452045427
  #
  # Children      Self  Command          Cgroup
  # ........  ........  ...............  ..........................................
  #
      61.57%     5.60%  Chrome_ChildIOT  /user.slice/user-657345.slice/user@657345.service/app.slice/...
      29.51%     7.38%  Web Content      /user.slice/user-657345.slice/user@657345.service/app.slice/...
      17.48%     1.59%  Chrome_IOThread  /user.slice/user-657345.slice/user@657345.service/app.slice/...
      16.48%     4.12%  pipewire-pulse   /user.slice/user-657345.slice/user@657345.service/session.slice/...
      14.48%     2.07%  perf             /user.slice/user-657345.slice/user@657345.service/app.slice/...
      14.30%     7.15%  CompositorTileW  /user.slice/user-657345.slice/user@657345.service/app.slice/...
      13.33%     6.67%  Timer            /user.slice/user-657345.slice/user@657345.service/app.slice/...
  ...

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/perf/builtin-record.c            |  2 +-
 tools/perf/util/bpf_off_cpu.c          | 48 ++++++++++++++++++++++++--
 tools/perf/util/bpf_skel/off_cpu.bpf.c | 33 ++++++++++++++++++
 tools/perf/util/off_cpu.h              |  7 ++--
 4 files changed, 85 insertions(+), 5 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 7f60d2eac0b4..77fa21c2c69f 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -907,7 +907,7 @@ static int record__config_text_poke(struct evlist *evlist)
 
 static int record__config_off_cpu(struct record *rec)
 {
-	return off_cpu_prepare(rec->evlist, &rec->opts.target);
+	return off_cpu_prepare(rec->evlist, &rec->opts.target, &rec->opts);
 }
 
 static bool record__kcore_readable(struct machine *machine)
diff --git a/tools/perf/util/bpf_off_cpu.c b/tools/perf/util/bpf_off_cpu.c
index 874856c55101..b73e84a02264 100644
--- a/tools/perf/util/bpf_off_cpu.c
+++ b/tools/perf/util/bpf_off_cpu.c
@@ -5,10 +5,12 @@
 #include "util/evlist.h"
 #include "util/off_cpu.h"
 #include "util/perf-hooks.h"
+#include "util/record.h"
 #include "util/session.h"
 #include "util/target.h"
 #include "util/cpumap.h"
 #include "util/thread_map.h"
+#include "util/cgroup.h"
 #include <bpf/bpf.h>
 
 #include "bpf_skel/off_cpu.skel.h"
@@ -24,6 +26,7 @@ struct off_cpu_key {
 	u32 tgid;
 	u32 stack_id;
 	u32 state;
+	u64 cgroup_id;
 };
 
 union off_cpu_data {
@@ -116,10 +119,11 @@ static void check_sched_switch_args(void)
 	}
 }
 
-int off_cpu_prepare(struct evlist *evlist, struct target *target)
+int off_cpu_prepare(struct evlist *evlist, struct target *target,
+		    struct record_opts *opts)
 {
 	int err, fd, i;
-	int ncpus = 1, ntasks = 1;
+	int ncpus = 1, ntasks = 1, ncgrps = 1;
 
 	if (off_cpu_config(evlist) < 0) {
 		pr_err("Failed to config off-cpu BPF event\n");
@@ -143,6 +147,21 @@ int off_cpu_prepare(struct evlist *evlist, struct target *target)
 		bpf_map__set_max_entries(skel->maps.task_filter, ntasks);
 	}
 
+	if (evlist__first(evlist)->cgrp) {
+		ncgrps = evlist->core.nr_entries - 1; /* excluding a dummy */
+		bpf_map__set_max_entries(skel->maps.cgroup_filter, ncgrps);
+
+		if (!cgroup_is_v2("perf_event"))
+			skel->rodata->uses_cgroup_v1 = true;
+	}
+
+	if (opts->record_cgroup) {
+		skel->rodata->needs_cgroup = true;
+
+		if (!cgroup_is_v2("perf_event"))
+			skel->rodata->uses_cgroup_v1 = true;
+	}
+
 	set_max_rlimit();
 	check_sched_switch_args();
 
@@ -178,6 +197,29 @@ int off_cpu_prepare(struct evlist *evlist, struct target *target)
 		}
 	}
 
+	if (evlist__first(evlist)->cgrp) {
+		struct evsel *evsel;
+		u8 val = 1;
+
+		skel->bss->has_cgroup = 1;
+		fd = bpf_map__fd(skel->maps.cgroup_filter);
+
+		evlist__for_each_entry(evlist, evsel) {
+			struct cgroup *cgrp = evsel->cgrp;
+
+			if (cgrp == NULL)
+				continue;
+
+			if (!cgrp->id && read_cgroup_id(cgrp) < 0) {
+				pr_err("Failed to read cgroup id of %s\n",
+				       cgrp->name);
+				goto out;
+			}
+
+			bpf_map_update_elem(fd, &cgrp->id, &val, BPF_ANY);
+		}
+	}
+
 	err = off_cpu_bpf__attach(skel);
 	if (err) {
 		pr_err("Failed to attach off-cpu BPF skeleton\n");
@@ -275,6 +317,8 @@ int off_cpu_write(struct perf_session *session)
 			/* calculate sample callchain data array length */
 			n += len + 2;
 		}
+		if (sample_type & PERF_SAMPLE_CGROUP)
+			data.array[n++] = key.cgroup_id;
 		/* TODO: handle more sample types */
 
 		size = n * sizeof(u64);
diff --git a/tools/perf/util/bpf_skel/off_cpu.bpf.c b/tools/perf/util/bpf_skel/off_cpu.bpf.c
index 986d7db6e75d..792ae2847080 100644
--- a/tools/perf/util/bpf_skel/off_cpu.bpf.c
+++ b/tools/perf/util/bpf_skel/off_cpu.bpf.c
@@ -26,6 +26,7 @@ struct offcpu_key {
 	__u32 tgid;
 	__u32 stack_id;
 	__u32 state;
+	__u64 cgroup_id;
 };
 
 struct {
@@ -63,6 +64,13 @@ struct {
 	__uint(max_entries, 1);
 } task_filter SEC(".maps");
 
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(key_size, sizeof(__u64));
+	__uint(value_size, sizeof(__u8));
+	__uint(max_entries, 1);
+} cgroup_filter SEC(".maps");
+
 /* old kernel task_struct definition */
 struct task_struct___old {
 	long state;
@@ -71,8 +79,11 @@ struct task_struct___old {
 int enabled = 0;
 int has_cpu = 0;
 int has_task = 0;
+int has_cgroup = 0;
 
 const volatile bool has_prev_state = false;
+const volatile bool needs_cgroup = false;
+const volatile bool uses_cgroup_v1 = false;
 
 /*
  * Old kernel used to call it task_struct->state and now it's '__state'.
@@ -92,6 +103,18 @@ static inline int get_task_state(struct task_struct *t)
 	return BPF_CORE_READ(t_old, state);
 }
 
+static inline __u64 get_cgroup_id(struct task_struct *t)
+{
+	struct cgroup *cgrp;
+
+	if (uses_cgroup_v1)
+		cgrp = BPF_CORE_READ(t, cgroups, subsys[perf_event_cgrp_id], cgroup);
+	else
+		cgrp = BPF_CORE_READ(t, cgroups, dfl_cgrp);
+
+	return BPF_CORE_READ(cgrp, kn, id);
+}
+
 static inline int can_record(struct task_struct *t, int state)
 {
 	/* kernel threads don't have user stack */
@@ -120,6 +143,15 @@ static inline int can_record(struct task_struct *t, int state)
 			return 0;
 	}
 
+	if (has_cgroup) {
+		__u8 *ok;
+		__u64 cgrp_id = get_cgroup_id(t);
+
+		ok = bpf_map_lookup_elem(&cgroup_filter, &cgrp_id);
+		if (!ok)
+			return 0;
+	}
+
 	return 1;
 }
 
@@ -156,6 +188,7 @@ static int off_cpu_stat(u64 *ctx, struct task_struct *prev,
 			.tgid = next->tgid,
 			.stack_id = pelem->stack_id,
 			.state = pelem->state,
+			.cgroup_id = needs_cgroup ? get_cgroup_id(next) : 0,
 		};
 		__u64 delta = ts - pelem->timestamp;
 		__u64 *total;
diff --git a/tools/perf/util/off_cpu.h b/tools/perf/util/off_cpu.h
index f47af0232e55..548008f74d42 100644
--- a/tools/perf/util/off_cpu.h
+++ b/tools/perf/util/off_cpu.h
@@ -4,15 +4,18 @@
 struct evlist;
 struct target;
 struct perf_session;
+struct record_opts;
 
 #define OFFCPU_EVENT  "offcpu-time"
 
 #ifdef HAVE_BPF_SKEL
-int off_cpu_prepare(struct evlist *evlist, struct target *target);
+int off_cpu_prepare(struct evlist *evlist, struct target *target,
+		    struct record_opts *opts);
 int off_cpu_write(struct perf_session *session);
 #else
 static inline int off_cpu_prepare(struct evlist *evlist __maybe_unused,
-				  struct target *target __maybe_unused)
+				  struct target *target __maybe_unused,
+				  struct record_opts *opts __maybe_unused)
 {
 	return -1;
 }
-- 
2.36.1.124.g0e6072fb45-goog


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 6/6] perf test: Add a basic offcpu profiling test
  2022-05-18 22:47 [RFC 0/6] perf record: Implement off-cpu profiling with BPF (v3) Namhyung Kim
                   ` (4 preceding siblings ...)
  2022-05-18 22:47 ` [PATCH 5/6] perf record: Add cgroup support for off-cpu profiling Namhyung Kim
@ 2022-05-18 22:47 ` Namhyung Kim
  2022-05-19  4:08   ` Ian Rogers
  5 siblings, 1 reply; 20+ messages in thread
From: Namhyung Kim @ 2022-05-18 22:47 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Andi Kleen, Ian Rogers,
	Song Liu, Hao Luo, Milian Wolff, bpf, linux-perf-users,
	Blake Jones

  $ sudo ./perf test -v offcpu
   88: perf record offcpu profiling tests                              :
  --- start ---
  test child forked, pid 685966
  Basic off-cpu test
  Basic off-cpu test [Success]
  test child finished with 0
  ---- end ----
  perf record offcpu profiling tests: Ok

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/perf/tests/shell/record_offcpu.sh | 60 +++++++++++++++++++++++++
 1 file changed, 60 insertions(+)
 create mode 100755 tools/perf/tests/shell/record_offcpu.sh

diff --git a/tools/perf/tests/shell/record_offcpu.sh b/tools/perf/tests/shell/record_offcpu.sh
new file mode 100755
index 000000000000..96e0739f7478
--- /dev/null
+++ b/tools/perf/tests/shell/record_offcpu.sh
@@ -0,0 +1,60 @@
+#!/bin/sh
+# perf record offcpu profiling tests
+# SPDX-License-Identifier: GPL-2.0
+
+set -e
+
+err=0
+perfdata=$(mktemp /tmp/__perf_test.perf.data.XXXXX)
+
+cleanup() {
+  rm -f ${perfdata}
+  rm -f ${perfdata}.old
+  trap - exit term int
+}
+
+trap_cleanup() {
+  cleanup
+  exit 1
+}
+trap trap_cleanup exit term int
+
+test_offcpu() {
+  echo "Basic off-cpu test"
+  if [ `id -u` != 0 ]
+  then
+    echo "Basic off-cpu test [Skipped permission]"
+    err=2
+    return
+  fi
+  if perf record --off-cpu -o ${perfdata} --quiet true 2>&1 | grep BUILD_BPF_SKEL
+  then
+    echo "Basic off-cpu test [Skipped missing BPF support]"
+    err=2
+    return
+  fi
+  if ! perf record --off-cpu -e dummy -o ${perfdata} sleep 1 2> /dev/null
+  then
+    echo "Basic off-cpu test [Failed record]"
+    err=1
+    return
+  fi
+  if ! perf evlist -i ${perfdata} | grep -q "offcpu-time"
+  then
+    echo "Basic off-cpu test [Failed record]"
+    err=1
+    return
+  fi
+  if ! perf report -i ${perfdata} -q --percent-limit=90 | egrep -q sleep
+  then
+    echo "Basic off-cpu test [Failed missing output]"
+    err=1
+    return
+  fi
+  echo "Basic off-cpu test [Success]"
+}
+
+test_offcpu
+
+cleanup
+exit $err
-- 
2.36.1.124.g0e6072fb45-goog


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/6] perf record: Enable off-cpu analysis with BPF
  2022-05-18 22:47 ` [PATCH 2/6] perf record: Enable off-cpu analysis with BPF Namhyung Kim
@ 2022-05-19  3:58   ` Ian Rogers
  2022-05-19 21:01     ` Namhyung Kim
  2022-06-01  0:00   ` Ian Rogers
  1 sibling, 1 reply; 20+ messages in thread
From: Ian Rogers @ 2022-05-19  3:58 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Ingo Molnar, Peter Zijlstra,
	LKML, Andi Kleen, Song Liu, Hao Luo, Milian Wolff, bpf,
	linux-perf-users, Blake Jones

On Wed, May 18, 2022 at 3:47 PM Namhyung Kim <namhyung@kernel.org> wrote:
>
> Add --off-cpu option to enable the off-cpu profiling with BPF.  It'd
> use a bpf_output event and rename it to "offcpu-time".  Samples will
> be synthesized at the end of the record session using data from a BPF
> map which contains the aggregated off-cpu time at context switches.
> So it needs root privilege to get the off-cpu profiling.
>
> Each sample will have a separate user stacktrace so it will skip
> kernel threads.  The sample ip will be set from the stacktrace and
> other sample data will be updated accordingly.  Currently it only
> handles some basic sample types.
>
> The sample timestamp is set to a dummy value just not to bother with
> other events during the sorting.  So it has a very big initial value
> and increase it on processing each samples.
>
> Good thing is that it can be used together with regular profiling like
> cpu cycles.  If you don't want to that, you can use a dummy event to
> enable off-cpu profiling only.
>
> Example output:
>   $ sudo perf record --off-cpu perf bench sched messaging -l 1000
>
>   $ sudo perf report --stdio --call-graph=no
>   # Total Lost Samples: 0
>   #
>   # Samples: 41K of event 'cycles'
>   # Event count (approx.): 42137343851
>   ...
>
>   # Samples: 1K of event 'offcpu-time'
>   # Event count (approx.): 587990831640
>   #
>   # Children      Self  Command          Shared Object       Symbol
>   # ........  ........  ...............  ..................  .........................
>   #
>       81.66%     0.00%  sched-messaging  libc-2.33.so        [.] __libc_start_main
>       81.66%     0.00%  sched-messaging  perf                [.] cmd_bench
>       81.66%     0.00%  sched-messaging  perf                [.] main
>       81.66%     0.00%  sched-messaging  perf                [.] run_builtin
>       81.43%     0.00%  sched-messaging  perf                [.] bench_sched_messaging
>       40.86%    40.86%  sched-messaging  libpthread-2.33.so  [.] __read
>       37.66%    37.66%  sched-messaging  libpthread-2.33.so  [.] __write
>        2.91%     2.91%  sched-messaging  libc-2.33.so        [.] __poll
>   ...
>
> As you can see it spent most of off-cpu time in read and write in
> bench_sched_messaging().  The --call-graph=no was added just to make
> the output concise here.
>
> It uses perf hooks facility to control BPF program during the record
> session rather than adding new BPF/off-cpu specific calls.
>
> Signed-off-by: Namhyung Kim <namhyung@kernel.org>

Acked-by: Ian Rogers <irogers@google.com>

> ---
>  tools/perf/Documentation/perf-record.txt |  10 ++
>  tools/perf/Makefile.perf                 |   1 +
>  tools/perf/builtin-record.c              |  25 +++
>  tools/perf/util/Build                    |   1 +
>  tools/perf/util/bpf_off_cpu.c            | 204 +++++++++++++++++++++++
>  tools/perf/util/bpf_skel/off_cpu.bpf.c   | 139 +++++++++++++++
>  tools/perf/util/off_cpu.h                |  24 +++
>  7 files changed, 404 insertions(+)
>  create mode 100644 tools/perf/util/bpf_off_cpu.c
>  create mode 100644 tools/perf/util/bpf_skel/off_cpu.bpf.c
>  create mode 100644 tools/perf/util/off_cpu.h
>
> diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
> index 465be4e62a17..b4e9ef7edfef 100644
> --- a/tools/perf/Documentation/perf-record.txt
> +++ b/tools/perf/Documentation/perf-record.txt
> @@ -758,6 +758,16 @@ include::intel-hybrid.txt[]
>         If the URLs is not specified, the value of DEBUGINFOD_URLS
>         system environment variable is used.
>
> +--off-cpu::
> +       Enable off-cpu profiling with BPF.  The BPF program will collect
> +       task scheduling information with (user) stacktrace and save them
> +       as sample data of a software event named "offcpu-time".  The
> +       sample period will have the time the task slept in nanoseconds.
> +
> +       Note that BPF can collect stack traces using frame pointer ("fp")
> +       only, as of now.  So the applications built without the frame
> +       pointer might see bogus addresses.
> +
>  SEE ALSO
>  --------
>  linkperf:perf-stat[1], linkperf:perf-list[1], linkperf:perf-intel-pt[1]
> diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
> index 6e5aded855cc..8f738e11356d 100644
> --- a/tools/perf/Makefile.perf
> +++ b/tools/perf/Makefile.perf
> @@ -1038,6 +1038,7 @@ SKEL_TMP_OUT := $(abspath $(SKEL_OUT)/.tmp)
>  SKELETONS := $(SKEL_OUT)/bpf_prog_profiler.skel.h
>  SKELETONS += $(SKEL_OUT)/bperf_leader.skel.h $(SKEL_OUT)/bperf_follower.skel.h
>  SKELETONS += $(SKEL_OUT)/bperf_cgroup.skel.h $(SKEL_OUT)/func_latency.skel.h
> +SKELETONS += $(SKEL_OUT)/off_cpu.skel.h
>
>  $(SKEL_TMP_OUT) $(LIBBPF_OUTPUT):
>         $(Q)$(MKDIR) -p $@
> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index a5cf6a99d67f..91f88501412e 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -49,6 +49,7 @@
>  #include "util/clockid.h"
>  #include "util/pmu-hybrid.h"
>  #include "util/evlist-hybrid.h"
> +#include "util/off_cpu.h"
>  #include "asm/bug.h"
>  #include "perf.h"
>  #include "cputopo.h"
> @@ -162,6 +163,7 @@ struct record {
>         bool                    buildid_mmap;
>         bool                    timestamp_filename;
>         bool                    timestamp_boundary;
> +       bool                    off_cpu;
>         struct switch_output    switch_output;
>         unsigned long long      samples;
>         unsigned long           output_max_size;        /* = 0: unlimited */
> @@ -903,6 +905,11 @@ static int record__config_text_poke(struct evlist *evlist)
>         return 0;
>  }
>
> +static int record__config_off_cpu(struct record *rec)
> +{
> +       return off_cpu_prepare(rec->evlist);
> +}
> +
>  static bool record__kcore_readable(struct machine *machine)
>  {
>         char kcore[PATH_MAX];
> @@ -2600,6 +2607,9 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
>         } else
>                 status = err;
>
> +       if (rec->off_cpu)
> +               rec->bytes_written += off_cpu_write(rec->session);
> +
>         record__synthesize(rec, true);
>         /* this will be recalculated during process_buildids() */
>         rec->samples = 0;
> @@ -3324,6 +3334,7 @@ static struct option __record_options[] = {
>         OPT_CALLBACK_OPTARG(0, "threads", &record.opts, NULL, "spec",
>                             "write collected trace data into several data files using parallel threads",
>                             record__parse_threads),
> +       OPT_BOOLEAN(0, "off-cpu", &record.off_cpu, "Enable off-cpu analysis"),
>         OPT_END()
>  };
>
> @@ -3743,6 +3754,12 @@ int cmd_record(int argc, const char **argv)
>         set_nobuild('\0', "vmlinux", true);
>  # undef set_nobuild
>  # undef REASON
> +#endif
> +
> +#ifndef HAVE_BPF_SKEL
> +# define set_nobuild(s, l, m, c) set_option_nobuild(record_options, s, l, m, c)
> +       set_nobuild('\0', "off-cpu", "no BUILD_BPF_SKEL=1", true);
> +# undef set_nobuild
>  #endif
>
>         rec->opts.affinity = PERF_AFFINITY_SYS;
> @@ -3981,6 +3998,14 @@ int cmd_record(int argc, const char **argv)
>                 }
>         }
>
> +       if (rec->off_cpu) {
> +               err = record__config_off_cpu(rec);
> +               if (err) {
> +                       pr_err("record__config_off_cpu failed, error %d\n", err);
> +                       goto out;
> +               }
> +       }
> +
>         if (record_opts__config(&rec->opts)) {
>                 err = -EINVAL;
>                 goto out;
> diff --git a/tools/perf/util/Build b/tools/perf/util/Build
> index 9a7209a99e16..a51267d88ca9 100644
> --- a/tools/perf/util/Build
> +++ b/tools/perf/util/Build
> @@ -147,6 +147,7 @@ perf-$(CONFIG_LIBBPF) += bpf_map.o
>  perf-$(CONFIG_PERF_BPF_SKEL) += bpf_counter.o
>  perf-$(CONFIG_PERF_BPF_SKEL) += bpf_counter_cgroup.o
>  perf-$(CONFIG_PERF_BPF_SKEL) += bpf_ftrace.o
> +perf-$(CONFIG_PERF_BPF_SKEL) += bpf_off_cpu.o
>  perf-$(CONFIG_BPF_PROLOGUE) += bpf-prologue.o
>  perf-$(CONFIG_LIBELF) += symbol-elf.o
>  perf-$(CONFIG_LIBELF) += probe-file.o
> diff --git a/tools/perf/util/bpf_off_cpu.c b/tools/perf/util/bpf_off_cpu.c
> new file mode 100644
> index 000000000000..9ed7aca3f4ac
> --- /dev/null
> +++ b/tools/perf/util/bpf_off_cpu.c
> @@ -0,0 +1,204 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include "util/bpf_counter.h"
> +#include "util/debug.h"
> +#include "util/evsel.h"
> +#include "util/evlist.h"
> +#include "util/off_cpu.h"
> +#include "util/perf-hooks.h"
> +#include "util/session.h"
> +#include <bpf/bpf.h>
> +
> +#include "bpf_skel/off_cpu.skel.h"
> +
> +#define MAX_STACKS  32
> +/* we don't need actual timestamp, just want to put the samples at last */
> +#define OFF_CPU_TIMESTAMP  (~0ull << 32)
> +
> +static struct off_cpu_bpf *skel;
> +
> +struct off_cpu_key {
> +       u32 pid;
> +       u32 tgid;
> +       u32 stack_id;
> +       u32 state;
> +};
> +
> +union off_cpu_data {
> +       struct perf_event_header hdr;
> +       u64 array[1024 / sizeof(u64)];
> +};
> +
> +static int off_cpu_config(struct evlist *evlist)
> +{
> +       struct evsel *evsel;
> +       struct perf_event_attr attr = {
> +               .type   = PERF_TYPE_SOFTWARE,
> +               .config = PERF_COUNT_SW_BPF_OUTPUT,
> +               .size   = sizeof(attr), /* to capture ABI version */
> +       };
> +       char *evname = strdup(OFFCPU_EVENT);
> +
> +       if (evname == NULL)
> +               return -ENOMEM;
> +
> +       evsel = evsel__new(&attr);
> +       if (!evsel) {
> +               free(evname);
> +               return -ENOMEM;
> +       }
> +
> +       evsel->core.attr.freq = 1;
> +       evsel->core.attr.sample_period = 1;
> +       /* off-cpu analysis depends on stack trace */
> +       evsel->core.attr.sample_type = PERF_SAMPLE_CALLCHAIN;
> +
> +       evlist__add(evlist, evsel);
> +
> +       free(evsel->name);
> +       evsel->name = evname;
> +
> +       return 0;
> +}
> +
> +static void off_cpu_start(void *arg __maybe_unused)
> +{
> +       skel->bss->enabled = 1;
> +}
> +
> +static void off_cpu_finish(void *arg __maybe_unused)
> +{
> +       skel->bss->enabled = 0;
> +       off_cpu_bpf__destroy(skel);
> +}
> +
> +int off_cpu_prepare(struct evlist *evlist)
> +{
> +       int err;
> +
> +       if (off_cpu_config(evlist) < 0) {
> +               pr_err("Failed to config off-cpu BPF event\n");
> +               return -1;
> +       }
> +
> +       set_max_rlimit();
> +
> +       skel = off_cpu_bpf__open_and_load();
> +       if (!skel) {
> +               pr_err("Failed to open off-cpu BPF skeleton\n");
> +               return -1;
> +       }
> +
> +       err = off_cpu_bpf__attach(skel);
> +       if (err) {
> +               pr_err("Failed to attach off-cpu BPF skeleton\n");
> +               goto out;
> +       }
> +
> +       if (perf_hooks__set_hook("record_start", off_cpu_start, NULL) ||
> +           perf_hooks__set_hook("record_end", off_cpu_finish, NULL)) {

An off-topic thought here. I was looking at tool events and thinking
that rather than have global state and the counter reading getting
those global values, it would be nice if the tool events had private
state that was created on open, freed on close and then did the
appropriate thing for enable/read/disable. If we were object oriented
I was thinking tool events could be a subclass of evsel that would
override the appropriate functions. Something that strikes me as silly
is that tool events have a dummy event file descriptor from
perf_event_open, in part because the evsel code paths are shared.
Anyway, if we made evsels have something like a virtual method table,
we could do similar tricks with off-cpu and BPF created events.
Possibly this could lead to better structured code and more reuse.

Thanks,
Ian

> +               pr_err("Failed to attach off-cpu skeleton\n");
> +               goto out;
> +       }
> +
> +       return 0;
> +
> +out:
> +       off_cpu_bpf__destroy(skel);
> +       return -1;
> +}
> +
> +int off_cpu_write(struct perf_session *session)
> +{
> +       int bytes = 0, size;
> +       int fd, stack;
> +       u64 sample_type, val, sid = 0;
> +       struct evsel *evsel;
> +       struct perf_data_file *file = &session->data->file;
> +       struct off_cpu_key prev, key;
> +       union off_cpu_data data = {
> +               .hdr = {
> +                       .type = PERF_RECORD_SAMPLE,
> +                       .misc = PERF_RECORD_MISC_USER,
> +               },
> +       };
> +       u64 tstamp = OFF_CPU_TIMESTAMP;
> +
> +       skel->bss->enabled = 0;
> +
> +       evsel = evlist__find_evsel_by_str(session->evlist, OFFCPU_EVENT);
> +       if (evsel == NULL) {
> +               pr_err("%s evsel not found\n", OFFCPU_EVENT);
> +               return 0;
> +       }
> +
> +       sample_type = evsel->core.attr.sample_type;
> +
> +       if (sample_type & (PERF_SAMPLE_ID | PERF_SAMPLE_IDENTIFIER)) {
> +               if (evsel->core.id)
> +                       sid = evsel->core.id[0];
> +       }
> +
> +       fd = bpf_map__fd(skel->maps.off_cpu);
> +       stack = bpf_map__fd(skel->maps.stacks);
> +       memset(&prev, 0, sizeof(prev));
> +
> +       while (!bpf_map_get_next_key(fd, &prev, &key)) {
> +               int n = 1;  /* start from perf_event_header */
> +               int ip_pos = -1;
> +
> +               bpf_map_lookup_elem(fd, &key, &val);
> +
> +               if (sample_type & PERF_SAMPLE_IDENTIFIER)
> +                       data.array[n++] = sid;
> +               if (sample_type & PERF_SAMPLE_IP) {
> +                       ip_pos = n;
> +                       data.array[n++] = 0;  /* will be updated */
> +               }
> +               if (sample_type & PERF_SAMPLE_TID)
> +                       data.array[n++] = (u64)key.pid << 32 | key.tgid;
> +               if (sample_type & PERF_SAMPLE_TIME)
> +                       data.array[n++] = tstamp;
> +               if (sample_type & PERF_SAMPLE_ID)
> +                       data.array[n++] = sid;
> +               if (sample_type & PERF_SAMPLE_CPU)
> +                       data.array[n++] = 0;
> +               if (sample_type & PERF_SAMPLE_PERIOD)
> +                       data.array[n++] = val;
> +               if (sample_type & PERF_SAMPLE_CALLCHAIN) {
> +                       int len = 0;
> +
> +                       /* data.array[n] is callchain->nr (updated later) */
> +                       data.array[n + 1] = PERF_CONTEXT_USER;
> +                       data.array[n + 2] = 0;
> +
> +                       bpf_map_lookup_elem(stack, &key.stack_id, &data.array[n + 2]);
> +                       while (data.array[n + 2 + len])
> +                               len++;
> +
> +                       /* update length of callchain */
> +                       data.array[n] = len + 1;
> +
> +                       /* update sample ip with the first callchain entry */
> +                       if (ip_pos >= 0)
> +                               data.array[ip_pos] = data.array[n + 2];
> +
> +                       /* calculate sample callchain data array length */
> +                       n += len + 2;
> +               }
> +               /* TODO: handle more sample types */
> +
> +               size = n * sizeof(u64);
> +               data.hdr.size = size;
> +               bytes += size;
> +
> +               if (perf_data_file__write(file, &data, size) < 0) {
> +                       pr_err("failed to write perf data, error: %m\n");
> +                       return bytes;
> +               }
> +
> +               prev = key;
> +               /* increase dummy timestamp to sort later samples */
> +               tstamp++;
> +       }
> +       return bytes;
> +}
> diff --git a/tools/perf/util/bpf_skel/off_cpu.bpf.c b/tools/perf/util/bpf_skel/off_cpu.bpf.c
> new file mode 100644
> index 000000000000..5173ed882fdf
> --- /dev/null
> +++ b/tools/perf/util/bpf_skel/off_cpu.bpf.c
> @@ -0,0 +1,139 @@
> +// SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
> +// Copyright (c) 2022 Google
> +#include "vmlinux.h"
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_tracing.h>
> +#include <bpf/bpf_core_read.h>
> +
> +/* task->flags for off-cpu analysis */
> +#define PF_KTHREAD   0x00200000  /* I am a kernel thread */
> +
> +/* task->state for off-cpu analysis */
> +#define TASK_INTERRUPTIBLE     0x0001
> +#define TASK_UNINTERRUPTIBLE   0x0002
> +
> +#define MAX_STACKS   32
> +#define MAX_ENTRIES  102400
> +
> +struct tstamp_data {
> +       __u32 stack_id;
> +       __u32 state;
> +       __u64 timestamp;
> +};
> +
> +struct offcpu_key {
> +       __u32 pid;
> +       __u32 tgid;
> +       __u32 stack_id;
> +       __u32 state;
> +};
> +
> +struct {
> +       __uint(type, BPF_MAP_TYPE_STACK_TRACE);
> +       __uint(key_size, sizeof(__u32));
> +       __uint(value_size, MAX_STACKS * sizeof(__u64));
> +       __uint(max_entries, MAX_ENTRIES);
> +} stacks SEC(".maps");
> +
> +struct {
> +       __uint(type, BPF_MAP_TYPE_TASK_STORAGE);
> +       __uint(map_flags, BPF_F_NO_PREALLOC);
> +       __type(key, int);
> +       __type(value, struct tstamp_data);
> +} tstamp SEC(".maps");
> +
> +struct {
> +       __uint(type, BPF_MAP_TYPE_HASH);
> +       __uint(key_size, sizeof(struct offcpu_key));
> +       __uint(value_size, sizeof(__u64));
> +       __uint(max_entries, MAX_ENTRIES);
> +} off_cpu SEC(".maps");
> +
> +/* old kernel task_struct definition */
> +struct task_struct___old {
> +       long state;
> +} __attribute__((preserve_access_index));
> +
> +int enabled = 0;
> +
> +/*
> + * Old kernel used to call it task_struct->state and now it's '__state'.
> + * Use BPF CO-RE "ignored suffix rule" to deal with it like below:
> + *
> + * https://nakryiko.com/posts/bpf-core-reference-guide/#handling-incompatible-field-and-type-changes
> + */
> +static inline int get_task_state(struct task_struct *t)
> +{
> +       if (bpf_core_field_exists(t->__state))
> +               return BPF_CORE_READ(t, __state);
> +
> +       /* recast pointer to capture task_struct___old type for compiler */
> +       struct task_struct___old *t_old = (void *)t;
> +
> +       /* now use old "state" name of the field */
> +       return BPF_CORE_READ(t_old, state);
> +}
> +
> +SEC("tp_btf/sched_switch")
> +int on_switch(u64 *ctx)
> +{
> +       __u64 ts;
> +       int state;
> +       __u32 stack_id;
> +       struct task_struct *prev, *next;
> +       struct tstamp_data *pelem;
> +
> +       if (!enabled)
> +               return 0;
> +
> +       prev = (struct task_struct *)ctx[1];
> +       next = (struct task_struct *)ctx[2];
> +       state = get_task_state(prev);
> +
> +       ts = bpf_ktime_get_ns();
> +
> +       if (prev->flags & PF_KTHREAD)
> +               goto next;
> +       if (state != TASK_INTERRUPTIBLE &&
> +           state != TASK_UNINTERRUPTIBLE)
> +               goto next;
> +
> +       stack_id = bpf_get_stackid(ctx, &stacks,
> +                                  BPF_F_FAST_STACK_CMP | BPF_F_USER_STACK);
> +
> +       pelem = bpf_task_storage_get(&tstamp, prev, NULL,
> +                                    BPF_LOCAL_STORAGE_GET_F_CREATE);
> +       if (!pelem)
> +               goto next;
> +
> +       pelem->timestamp = ts;
> +       pelem->state = state;
> +       pelem->stack_id = stack_id;
> +
> +next:
> +       pelem = bpf_task_storage_get(&tstamp, next, NULL, 0);
> +
> +       if (pelem && pelem->timestamp) {
> +               struct offcpu_key key = {
> +                       .pid = next->pid,
> +                       .tgid = next->tgid,
> +                       .stack_id = pelem->stack_id,
> +                       .state = pelem->state,
> +               };
> +               __u64 delta = ts - pelem->timestamp;
> +               __u64 *total;
> +
> +               total = bpf_map_lookup_elem(&off_cpu, &key);
> +               if (total)
> +                       *total += delta;
> +               else
> +                       bpf_map_update_elem(&off_cpu, &key, &delta, BPF_ANY);
> +
> +               /* prevent to reuse the timestamp later */
> +               pelem->timestamp = 0;
> +       }
> +
> +       return 0;
> +}
> +
> +char LICENSE[] SEC("license") = "Dual BSD/GPL";
> diff --git a/tools/perf/util/off_cpu.h b/tools/perf/util/off_cpu.h
> new file mode 100644
> index 000000000000..375d03c424ea
> --- /dev/null
> +++ b/tools/perf/util/off_cpu.h
> @@ -0,0 +1,24 @@
> +#ifndef PERF_UTIL_OFF_CPU_H
> +#define PERF_UTIL_OFF_CPU_H
> +
> +struct evlist;
> +struct perf_session;
> +
> +#define OFFCPU_EVENT  "offcpu-time"
> +
> +#ifdef HAVE_BPF_SKEL
> +int off_cpu_prepare(struct evlist *evlist);
> +int off_cpu_write(struct perf_session *session);
> +#else
> +static inline int off_cpu_prepare(struct evlist *evlist __maybe_unused)
> +{
> +       return -1;
> +}
> +
> +static inline int off_cpu_write(struct perf_session *session __maybe_unused)
> +{
> +       return -1;
> +}
> +#endif
> +
> +#endif  /* PERF_UTIL_OFF_CPU_H */
> --
> 2.36.1.124.g0e6072fb45-goog
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/6] perf record: Implement basic filtering for off-cpu
  2022-05-18 22:47 ` [PATCH 3/6] perf record: Implement basic filtering for off-cpu Namhyung Kim
@ 2022-05-19  4:02   ` Ian Rogers
  2022-05-19 21:02     ` Namhyung Kim
  0 siblings, 1 reply; 20+ messages in thread
From: Ian Rogers @ 2022-05-19  4:02 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Ingo Molnar, Peter Zijlstra,
	LKML, Andi Kleen, Song Liu, Hao Luo, Milian Wolff, bpf,
	linux-perf-users, Blake Jones

On Wed, May 18, 2022 at 3:47 PM Namhyung Kim <namhyung@kernel.org> wrote:
>
> It should honor cpu and task filtering with -a, -C or -p, -t options.
>
> Signed-off-by: Namhyung Kim <namhyung@kernel.org>
> ---
>  tools/perf/builtin-record.c            |  2 +-
>  tools/perf/util/bpf_off_cpu.c          | 78 +++++++++++++++++++++++---
>  tools/perf/util/bpf_skel/off_cpu.bpf.c | 52 +++++++++++++++--
>  tools/perf/util/off_cpu.h              |  6 +-
>  4 files changed, 123 insertions(+), 15 deletions(-)
>
> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index 91f88501412e..7f60d2eac0b4 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -907,7 +907,7 @@ static int record__config_text_poke(struct evlist *evlist)
>
>  static int record__config_off_cpu(struct record *rec)
>  {
> -       return off_cpu_prepare(rec->evlist);
> +       return off_cpu_prepare(rec->evlist, &rec->opts.target);
>  }
>
>  static bool record__kcore_readable(struct machine *machine)
> diff --git a/tools/perf/util/bpf_off_cpu.c b/tools/perf/util/bpf_off_cpu.c
> index 9ed7aca3f4ac..b5e2d038da50 100644
> --- a/tools/perf/util/bpf_off_cpu.c
> +++ b/tools/perf/util/bpf_off_cpu.c
> @@ -6,6 +6,9 @@
>  #include "util/off_cpu.h"
>  #include "util/perf-hooks.h"
>  #include "util/session.h"
> +#include "util/target.h"
> +#include "util/cpumap.h"
> +#include "util/thread_map.h"
>  #include <bpf/bpf.h>
>
>  #include "bpf_skel/off_cpu.skel.h"
> @@ -60,8 +63,23 @@ static int off_cpu_config(struct evlist *evlist)
>         return 0;
>  }
>
> -static void off_cpu_start(void *arg __maybe_unused)
> +static void off_cpu_start(void *arg)
>  {
> +       struct evlist *evlist = arg;
> +
> +       /* update task filter for the given workload */
> +       if (!skel->bss->has_cpu && !skel->bss->has_task &&
> +           perf_thread_map__pid(evlist->core.threads, 0) != -1) {
> +               int fd;
> +               u32 pid;
> +               u8 val = 1;
> +
> +               skel->bss->has_task = 1;
> +               fd = bpf_map__fd(skel->maps.task_filter);
> +               pid = perf_thread_map__pid(evlist->core.threads, 0);
> +               bpf_map_update_elem(fd, &pid, &val, BPF_ANY);
> +       }
> +
>         skel->bss->enabled = 1;
>  }
>
> @@ -71,31 +89,75 @@ static void off_cpu_finish(void *arg __maybe_unused)
>         off_cpu_bpf__destroy(skel);
>  }
>
> -int off_cpu_prepare(struct evlist *evlist)
> +int off_cpu_prepare(struct evlist *evlist, struct target *target)
>  {
> -       int err;
> +       int err, fd, i;
> +       int ncpus = 1, ntasks = 1;
>
>         if (off_cpu_config(evlist) < 0) {
>                 pr_err("Failed to config off-cpu BPF event\n");
>                 return -1;
>         }
>
> -       set_max_rlimit();
> -
> -       skel = off_cpu_bpf__open_and_load();
> +       skel = off_cpu_bpf__open();
>         if (!skel) {
>                 pr_err("Failed to open off-cpu BPF skeleton\n");
>                 return -1;
>         }
>
> +       /* don't need to set cpu filter for system-wide mode */
> +       if (target->cpu_list) {
> +               ncpus = perf_cpu_map__nr(evlist->core.user_requested_cpus);
> +               bpf_map__set_max_entries(skel->maps.cpu_filter, ncpus);
> +       }
> +
> +       if (target__has_task(target)) {
> +               ntasks = perf_thread_map__nr(evlist->core.threads);
> +               bpf_map__set_max_entries(skel->maps.task_filter, ntasks);
> +       }
> +
> +       set_max_rlimit();
> +
> +       err = off_cpu_bpf__load(skel);
> +       if (err) {
> +               pr_err("Failed to load off-cpu skeleton\n");
> +               goto out;
> +       }
> +
> +       if (target->cpu_list) {
> +               u32 cpu;
> +               u8 val = 1;
> +
> +               skel->bss->has_cpu = 1;
> +               fd = bpf_map__fd(skel->maps.cpu_filter);
> +
> +               for (i = 0; i < ncpus; i++) {
> +                       cpu = perf_cpu_map__cpu(evlist->core.user_requested_cpus, i).cpu;
> +                       bpf_map_update_elem(fd, &cpu, &val, BPF_ANY);

Perhaps more concise with a for_each:

perf_cpu_map__for_each_cpu(cpu, idx, evlist->core.user_requested_cpus)
  bpf_map_update_elem(fd, &cpu.cpu, &val, BPF_ANY);

> +               }
> +       }
> +
> +       if (target__has_task(target)) {
> +               u32 pid;
> +               u8 val = 1;
> +
> +               skel->bss->has_task = 1;
> +               fd = bpf_map__fd(skel->maps.task_filter);
> +
> +               for (i = 0; i < ntasks; i++) {
> +                       pid = perf_thread_map__pid(evlist->core.threads, i);
> +                       bpf_map_update_elem(fd, &pid, &val, BPF_ANY);
> +               }
> +       }
> +
>         err = off_cpu_bpf__attach(skel);
>         if (err) {
>                 pr_err("Failed to attach off-cpu BPF skeleton\n");
>                 goto out;
>         }
>
> -       if (perf_hooks__set_hook("record_start", off_cpu_start, NULL) ||
> -           perf_hooks__set_hook("record_end", off_cpu_finish, NULL)) {
> +       if (perf_hooks__set_hook("record_start", off_cpu_start, evlist) ||
> +           perf_hooks__set_hook("record_end", off_cpu_finish, evlist)) {
>                 pr_err("Failed to attach off-cpu skeleton\n");
>                 goto out;
>         }
> diff --git a/tools/perf/util/bpf_skel/off_cpu.bpf.c b/tools/perf/util/bpf_skel/off_cpu.bpf.c
> index 5173ed882fdf..78cdcc8ff863 100644
> --- a/tools/perf/util/bpf_skel/off_cpu.bpf.c
> +++ b/tools/perf/util/bpf_skel/off_cpu.bpf.c
> @@ -49,12 +49,28 @@ struct {
>         __uint(max_entries, MAX_ENTRIES);
>  } off_cpu SEC(".maps");
>
> +struct {
> +       __uint(type, BPF_MAP_TYPE_HASH);
> +       __uint(key_size, sizeof(__u32));
> +       __uint(value_size, sizeof(__u8));
> +       __uint(max_entries, 1);
> +} cpu_filter SEC(".maps");
> +
> +struct {
> +       __uint(type, BPF_MAP_TYPE_HASH);
> +       __uint(key_size, sizeof(__u32));
> +       __uint(value_size, sizeof(__u8));
> +       __uint(max_entries, 1);
> +} task_filter SEC(".maps");
> +
>  /* old kernel task_struct definition */
>  struct task_struct___old {
>         long state;
>  } __attribute__((preserve_access_index));
>
>  int enabled = 0;
> +int has_cpu = 0;
> +int has_task = 0;
>
>  /*
>   * Old kernel used to call it task_struct->state and now it's '__state'.
> @@ -74,6 +90,37 @@ static inline int get_task_state(struct task_struct *t)
>         return BPF_CORE_READ(t_old, state);
>  }
>
> +static inline int can_record(struct task_struct *t, int state)
> +{
> +       /* kernel threads don't have user stack */
> +       if (t->flags & PF_KTHREAD)
> +               return 0;
> +
> +       if (state != TASK_INTERRUPTIBLE &&
> +           state != TASK_UNINTERRUPTIBLE)
> +               return 0;
> +
> +       if (has_cpu) {
> +               __u32 cpu = bpf_get_smp_processor_id();
> +               __u8 *ok;
> +
> +               ok = bpf_map_lookup_elem(&cpu_filter, &cpu);
> +               if (!ok)
> +                       return 0;
> +       }
> +
> +       if (has_task) {
> +               __u8 *ok;
> +               __u32 pid = t->pid;
> +
> +               ok = bpf_map_lookup_elem(&task_filter, &pid);
> +               if (!ok)
> +                       return 0;
> +       }
> +
> +       return 1;
> +}
> +
>  SEC("tp_btf/sched_switch")
>  int on_switch(u64 *ctx)
>  {
> @@ -92,10 +139,7 @@ int on_switch(u64 *ctx)
>
>         ts = bpf_ktime_get_ns();
>
> -       if (prev->flags & PF_KTHREAD)
> -               goto next;
> -       if (state != TASK_INTERRUPTIBLE &&
> -           state != TASK_UNINTERRUPTIBLE)
> +       if (!can_record(prev, state))
>                 goto next;
>
>         stack_id = bpf_get_stackid(ctx, &stacks,
> diff --git a/tools/perf/util/off_cpu.h b/tools/perf/util/off_cpu.h
> index 375d03c424ea..f47af0232e55 100644
> --- a/tools/perf/util/off_cpu.h
> +++ b/tools/perf/util/off_cpu.h
> @@ -2,15 +2,17 @@
>  #define PERF_UTIL_OFF_CPU_H
>
>  struct evlist;
> +struct target;
>  struct perf_session;
>
>  #define OFFCPU_EVENT  "offcpu-time"
>
>  #ifdef HAVE_BPF_SKEL
> -int off_cpu_prepare(struct evlist *evlist);
> +int off_cpu_prepare(struct evlist *evlist, struct target *target);
>  int off_cpu_write(struct perf_session *session);
>  #else
> -static inline int off_cpu_prepare(struct evlist *evlist __maybe_unused)
> +static inline int off_cpu_prepare(struct evlist *evlist __maybe_unused,
> +                                 struct target *target __maybe_unused)
>  {
>         return -1;
>  }
> --
> 2.36.1.124.g0e6072fb45-goog
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/6] perf record: Handle argument change in sched_switch
  2022-05-18 22:47 ` [PATCH 4/6] perf record: Handle argument change in sched_switch Namhyung Kim
@ 2022-05-19  4:04   ` Ian Rogers
  0 siblings, 0 replies; 20+ messages in thread
From: Ian Rogers @ 2022-05-19  4:04 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Ingo Molnar, Peter Zijlstra,
	LKML, Andi Kleen, Song Liu, Hao Luo, Milian Wolff, bpf,
	linux-perf-users, Blake Jones

On Wed, May 18, 2022 at 3:47 PM Namhyung Kim <namhyung@kernel.org> wrote:
>
> Recently sched_switch tracepoint added a new argument for prev_state,
> but it's hard to handle the change in a BPF program.  Instead, we can
> check the function prototype in BTF before loading the program.
>
> Signed-off-by: Namhyung Kim <namhyung@kernel.org>

Acked-by: Ian Rogers <irogers@google.com>

Thanks,
Ian

> ---
>  tools/perf/util/bpf_off_cpu.c          | 28 +++++++++++++++++++++
>  tools/perf/util/bpf_skel/off_cpu.bpf.c | 35 ++++++++++++++++++--------
>  2 files changed, 52 insertions(+), 11 deletions(-)
>
> diff --git a/tools/perf/util/bpf_off_cpu.c b/tools/perf/util/bpf_off_cpu.c
> index b5e2d038da50..874856c55101 100644
> --- a/tools/perf/util/bpf_off_cpu.c
> +++ b/tools/perf/util/bpf_off_cpu.c
> @@ -89,6 +89,33 @@ static void off_cpu_finish(void *arg __maybe_unused)
>         off_cpu_bpf__destroy(skel);
>  }
>
> +/* v5.18 kernel added prev_state arg, so it needs to check the signature */
> +static void check_sched_switch_args(void)
> +{
> +       const struct btf *btf = bpf_object__btf(skel->obj);
> +       const struct btf_type *t1, *t2, *t3;
> +       u32 type_id;
> +
> +       type_id = btf__find_by_name_kind(btf, "bpf_trace_sched_switch",
> +                                        BTF_KIND_TYPEDEF);
> +       if ((s32)type_id < 0)
> +               return;
> +
> +       t1 = btf__type_by_id(btf, type_id);
> +       if (t1 == NULL)
> +               return;
> +
> +       t2 = btf__type_by_id(btf, t1->type);
> +       if (t2 == NULL || !btf_is_ptr(t2))
> +               return;
> +
> +       t3 = btf__type_by_id(btf, t2->type);
> +       if (t3 && btf_is_func_proto(t3) && btf_vlen(t3) == 4) {
> +               /* new format: pass prev_state as 4th arg */
> +               skel->rodata->has_prev_state = true;
> +       }
> +}
> +
>  int off_cpu_prepare(struct evlist *evlist, struct target *target)
>  {
>         int err, fd, i;
> @@ -117,6 +144,7 @@ int off_cpu_prepare(struct evlist *evlist, struct target *target)
>         }
>
>         set_max_rlimit();
> +       check_sched_switch_args();
>
>         err = off_cpu_bpf__load(skel);
>         if (err) {
> diff --git a/tools/perf/util/bpf_skel/off_cpu.bpf.c b/tools/perf/util/bpf_skel/off_cpu.bpf.c
> index 78cdcc8ff863..986d7db6e75d 100644
> --- a/tools/perf/util/bpf_skel/off_cpu.bpf.c
> +++ b/tools/perf/util/bpf_skel/off_cpu.bpf.c
> @@ -72,6 +72,8 @@ int enabled = 0;
>  int has_cpu = 0;
>  int has_task = 0;
>
> +const volatile bool has_prev_state = false;
> +
>  /*
>   * Old kernel used to call it task_struct->state and now it's '__state'.
>   * Use BPF CO-RE "ignored suffix rule" to deal with it like below:
> @@ -121,22 +123,13 @@ static inline int can_record(struct task_struct *t, int state)
>         return 1;
>  }
>
> -SEC("tp_btf/sched_switch")
> -int on_switch(u64 *ctx)
> +static int off_cpu_stat(u64 *ctx, struct task_struct *prev,
> +                       struct task_struct *next, int state)
>  {
>         __u64 ts;
> -       int state;
>         __u32 stack_id;
> -       struct task_struct *prev, *next;
>         struct tstamp_data *pelem;
>
> -       if (!enabled)
> -               return 0;
> -
> -       prev = (struct task_struct *)ctx[1];
> -       next = (struct task_struct *)ctx[2];
> -       state = get_task_state(prev);
> -
>         ts = bpf_ktime_get_ns();
>
>         if (!can_record(prev, state))
> @@ -180,4 +173,24 @@ int on_switch(u64 *ctx)
>         return 0;
>  }
>
> +SEC("tp_btf/sched_switch")
> +int on_switch(u64 *ctx)
> +{
> +       struct task_struct *prev, *next;
> +       int prev_state;
> +
> +       if (!enabled)
> +               return 0;
> +
> +       prev = (struct task_struct *)ctx[1];
> +       next = (struct task_struct *)ctx[2];
> +
> +       if (has_prev_state)
> +               prev_state = (int)ctx[3];
> +       else
> +               prev_state = get_task_state(prev);
> +
> +       return off_cpu_stat(ctx, prev, next, prev_state);
> +}
> +
>  char LICENSE[] SEC("license") = "Dual BSD/GPL";
> --
> 2.36.1.124.g0e6072fb45-goog
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 5/6] perf record: Add cgroup support for off-cpu profiling
  2022-05-18 22:47 ` [PATCH 5/6] perf record: Add cgroup support for off-cpu profiling Namhyung Kim
@ 2022-05-19  4:07   ` Ian Rogers
  0 siblings, 0 replies; 20+ messages in thread
From: Ian Rogers @ 2022-05-19  4:07 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Ingo Molnar, Peter Zijlstra,
	LKML, Andi Kleen, Song Liu, Hao Luo, Milian Wolff, bpf,
	linux-perf-users, Blake Jones

On Wed, May 18, 2022 at 3:47 PM Namhyung Kim <namhyung@kernel.org> wrote:
>
> This covers two different use cases.  The first one is cgroup
> filtering given by -G/--cgroup option which controls the off-cpu
> profiling for tasks in the given cgroups only.
>
> The other use case is cgroup sampling which is enabled by
> --all-cgroups option and it adds PERF_SAMPLE_CGROUP to the sample_type
> to set the cgroup id of the task in the sample data.
>
> Example output.
>
>   $ sudo perf record -a --off-cpu --all-cgroups sleep 1
>
>   $ sudo perf report --stdio -s comm,cgroup --call-graph=no
>   ...
>   # Samples: 144  of event 'offcpu-time'
>   # Event count (approx.): 48452045427
>   #
>   # Children      Self  Command          Cgroup
>   # ........  ........  ...............  ..........................................
>   #
>       61.57%     5.60%  Chrome_ChildIOT  /user.slice/user-657345.slice/user@657345.service/app.slice/...
>       29.51%     7.38%  Web Content      /user.slice/user-657345.slice/user@657345.service/app.slice/...
>       17.48%     1.59%  Chrome_IOThread  /user.slice/user-657345.slice/user@657345.service/app.slice/...
>       16.48%     4.12%  pipewire-pulse   /user.slice/user-657345.slice/user@657345.service/session.slice/...
>       14.48%     2.07%  perf             /user.slice/user-657345.slice/user@657345.service/app.slice/...
>       14.30%     7.15%  CompositorTileW  /user.slice/user-657345.slice/user@657345.service/app.slice/...
>       13.33%     6.67%  Timer            /user.slice/user-657345.slice/user@657345.service/app.slice/...
>   ...
>
> Signed-off-by: Namhyung Kim <namhyung@kernel.org>

Acked-by: Ian Rogers <irogers@google.com>

Thanks,
Ian

> ---
>  tools/perf/builtin-record.c            |  2 +-
>  tools/perf/util/bpf_off_cpu.c          | 48 ++++++++++++++++++++++++--
>  tools/perf/util/bpf_skel/off_cpu.bpf.c | 33 ++++++++++++++++++
>  tools/perf/util/off_cpu.h              |  7 ++--
>  4 files changed, 85 insertions(+), 5 deletions(-)
>
> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index 7f60d2eac0b4..77fa21c2c69f 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -907,7 +907,7 @@ static int record__config_text_poke(struct evlist *evlist)
>
>  static int record__config_off_cpu(struct record *rec)
>  {
> -       return off_cpu_prepare(rec->evlist, &rec->opts.target);
> +       return off_cpu_prepare(rec->evlist, &rec->opts.target, &rec->opts);
>  }
>
>  static bool record__kcore_readable(struct machine *machine)
> diff --git a/tools/perf/util/bpf_off_cpu.c b/tools/perf/util/bpf_off_cpu.c
> index 874856c55101..b73e84a02264 100644
> --- a/tools/perf/util/bpf_off_cpu.c
> +++ b/tools/perf/util/bpf_off_cpu.c
> @@ -5,10 +5,12 @@
>  #include "util/evlist.h"
>  #include "util/off_cpu.h"
>  #include "util/perf-hooks.h"
> +#include "util/record.h"
>  #include "util/session.h"
>  #include "util/target.h"
>  #include "util/cpumap.h"
>  #include "util/thread_map.h"
> +#include "util/cgroup.h"
>  #include <bpf/bpf.h>
>
>  #include "bpf_skel/off_cpu.skel.h"
> @@ -24,6 +26,7 @@ struct off_cpu_key {
>         u32 tgid;
>         u32 stack_id;
>         u32 state;
> +       u64 cgroup_id;
>  };
>
>  union off_cpu_data {
> @@ -116,10 +119,11 @@ static void check_sched_switch_args(void)
>         }
>  }
>
> -int off_cpu_prepare(struct evlist *evlist, struct target *target)
> +int off_cpu_prepare(struct evlist *evlist, struct target *target,
> +                   struct record_opts *opts)
>  {
>         int err, fd, i;
> -       int ncpus = 1, ntasks = 1;
> +       int ncpus = 1, ntasks = 1, ncgrps = 1;
>
>         if (off_cpu_config(evlist) < 0) {
>                 pr_err("Failed to config off-cpu BPF event\n");
> @@ -143,6 +147,21 @@ int off_cpu_prepare(struct evlist *evlist, struct target *target)
>                 bpf_map__set_max_entries(skel->maps.task_filter, ntasks);
>         }
>
> +       if (evlist__first(evlist)->cgrp) {
> +               ncgrps = evlist->core.nr_entries - 1; /* excluding a dummy */
> +               bpf_map__set_max_entries(skel->maps.cgroup_filter, ncgrps);
> +
> +               if (!cgroup_is_v2("perf_event"))
> +                       skel->rodata->uses_cgroup_v1 = true;
> +       }
> +
> +       if (opts->record_cgroup) {
> +               skel->rodata->needs_cgroup = true;
> +
> +               if (!cgroup_is_v2("perf_event"))
> +                       skel->rodata->uses_cgroup_v1 = true;
> +       }
> +
>         set_max_rlimit();
>         check_sched_switch_args();
>
> @@ -178,6 +197,29 @@ int off_cpu_prepare(struct evlist *evlist, struct target *target)
>                 }
>         }
>
> +       if (evlist__first(evlist)->cgrp) {
> +               struct evsel *evsel;
> +               u8 val = 1;
> +
> +               skel->bss->has_cgroup = 1;
> +               fd = bpf_map__fd(skel->maps.cgroup_filter);
> +
> +               evlist__for_each_entry(evlist, evsel) {
> +                       struct cgroup *cgrp = evsel->cgrp;
> +
> +                       if (cgrp == NULL)
> +                               continue;
> +
> +                       if (!cgrp->id && read_cgroup_id(cgrp) < 0) {
> +                               pr_err("Failed to read cgroup id of %s\n",
> +                                      cgrp->name);
> +                               goto out;
> +                       }
> +
> +                       bpf_map_update_elem(fd, &cgrp->id, &val, BPF_ANY);
> +               }
> +       }
> +
>         err = off_cpu_bpf__attach(skel);
>         if (err) {
>                 pr_err("Failed to attach off-cpu BPF skeleton\n");
> @@ -275,6 +317,8 @@ int off_cpu_write(struct perf_session *session)
>                         /* calculate sample callchain data array length */
>                         n += len + 2;
>                 }
> +               if (sample_type & PERF_SAMPLE_CGROUP)
> +                       data.array[n++] = key.cgroup_id;
>                 /* TODO: handle more sample types */
>
>                 size = n * sizeof(u64);
> diff --git a/tools/perf/util/bpf_skel/off_cpu.bpf.c b/tools/perf/util/bpf_skel/off_cpu.bpf.c
> index 986d7db6e75d..792ae2847080 100644
> --- a/tools/perf/util/bpf_skel/off_cpu.bpf.c
> +++ b/tools/perf/util/bpf_skel/off_cpu.bpf.c
> @@ -26,6 +26,7 @@ struct offcpu_key {
>         __u32 tgid;
>         __u32 stack_id;
>         __u32 state;
> +       __u64 cgroup_id;
>  };
>
>  struct {
> @@ -63,6 +64,13 @@ struct {
>         __uint(max_entries, 1);
>  } task_filter SEC(".maps");
>
> +struct {
> +       __uint(type, BPF_MAP_TYPE_HASH);
> +       __uint(key_size, sizeof(__u64));
> +       __uint(value_size, sizeof(__u8));
> +       __uint(max_entries, 1);
> +} cgroup_filter SEC(".maps");
> +
>  /* old kernel task_struct definition */
>  struct task_struct___old {
>         long state;
> @@ -71,8 +79,11 @@ struct task_struct___old {
>  int enabled = 0;
>  int has_cpu = 0;
>  int has_task = 0;
> +int has_cgroup = 0;
>
>  const volatile bool has_prev_state = false;
> +const volatile bool needs_cgroup = false;
> +const volatile bool uses_cgroup_v1 = false;
>
>  /*
>   * Old kernel used to call it task_struct->state and now it's '__state'.
> @@ -92,6 +103,18 @@ static inline int get_task_state(struct task_struct *t)
>         return BPF_CORE_READ(t_old, state);
>  }
>
> +static inline __u64 get_cgroup_id(struct task_struct *t)
> +{
> +       struct cgroup *cgrp;
> +
> +       if (uses_cgroup_v1)
> +               cgrp = BPF_CORE_READ(t, cgroups, subsys[perf_event_cgrp_id], cgroup);
> +       else
> +               cgrp = BPF_CORE_READ(t, cgroups, dfl_cgrp);
> +
> +       return BPF_CORE_READ(cgrp, kn, id);
> +}
> +
>  static inline int can_record(struct task_struct *t, int state)
>  {
>         /* kernel threads don't have user stack */
> @@ -120,6 +143,15 @@ static inline int can_record(struct task_struct *t, int state)
>                         return 0;
>         }
>
> +       if (has_cgroup) {
> +               __u8 *ok;
> +               __u64 cgrp_id = get_cgroup_id(t);
> +
> +               ok = bpf_map_lookup_elem(&cgroup_filter, &cgrp_id);
> +               if (!ok)
> +                       return 0;
> +       }
> +
>         return 1;
>  }
>
> @@ -156,6 +188,7 @@ static int off_cpu_stat(u64 *ctx, struct task_struct *prev,
>                         .tgid = next->tgid,
>                         .stack_id = pelem->stack_id,
>                         .state = pelem->state,
> +                       .cgroup_id = needs_cgroup ? get_cgroup_id(next) : 0,
>                 };
>                 __u64 delta = ts - pelem->timestamp;
>                 __u64 *total;
> diff --git a/tools/perf/util/off_cpu.h b/tools/perf/util/off_cpu.h
> index f47af0232e55..548008f74d42 100644
> --- a/tools/perf/util/off_cpu.h
> +++ b/tools/perf/util/off_cpu.h
> @@ -4,15 +4,18 @@
>  struct evlist;
>  struct target;
>  struct perf_session;
> +struct record_opts;
>
>  #define OFFCPU_EVENT  "offcpu-time"
>
>  #ifdef HAVE_BPF_SKEL
> -int off_cpu_prepare(struct evlist *evlist, struct target *target);
> +int off_cpu_prepare(struct evlist *evlist, struct target *target,
> +                   struct record_opts *opts);
>  int off_cpu_write(struct perf_session *session);
>  #else
>  static inline int off_cpu_prepare(struct evlist *evlist __maybe_unused,
> -                                 struct target *target __maybe_unused)
> +                                 struct target *target __maybe_unused,
> +                                 struct record_opts *opts __maybe_unused)
>  {
>         return -1;
>  }
> --
> 2.36.1.124.g0e6072fb45-goog
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 6/6] perf test: Add a basic offcpu profiling test
  2022-05-18 22:47 ` [PATCH 6/6] perf test: Add a basic offcpu profiling test Namhyung Kim
@ 2022-05-19  4:08   ` Ian Rogers
  0 siblings, 0 replies; 20+ messages in thread
From: Ian Rogers @ 2022-05-19  4:08 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Ingo Molnar, Peter Zijlstra,
	LKML, Andi Kleen, Song Liu, Hao Luo, Milian Wolff, bpf,
	linux-perf-users, Blake Jones

On Wed, May 18, 2022 at 3:47 PM Namhyung Kim <namhyung@kernel.org> wrote:
>
>   $ sudo ./perf test -v offcpu
>    88: perf record offcpu profiling tests                              :
>   --- start ---
>   test child forked, pid 685966
>   Basic off-cpu test
>   Basic off-cpu test [Success]
>   test child finished with 0
>   ---- end ----
>   perf record offcpu profiling tests: Ok
>
> Signed-off-by: Namhyung Kim <namhyung@kernel.org>

Acked-by: Ian Rogers <irogers@google.com>

Thanks,
Ian

> ---
>  tools/perf/tests/shell/record_offcpu.sh | 60 +++++++++++++++++++++++++
>  1 file changed, 60 insertions(+)
>  create mode 100755 tools/perf/tests/shell/record_offcpu.sh
>
> diff --git a/tools/perf/tests/shell/record_offcpu.sh b/tools/perf/tests/shell/record_offcpu.sh
> new file mode 100755
> index 000000000000..96e0739f7478
> --- /dev/null
> +++ b/tools/perf/tests/shell/record_offcpu.sh
> @@ -0,0 +1,60 @@
> +#!/bin/sh
> +# perf record offcpu profiling tests
> +# SPDX-License-Identifier: GPL-2.0
> +
> +set -e
> +
> +err=0
> +perfdata=$(mktemp /tmp/__perf_test.perf.data.XXXXX)
> +
> +cleanup() {
> +  rm -f ${perfdata}
> +  rm -f ${perfdata}.old
> +  trap - exit term int
> +}
> +
> +trap_cleanup() {
> +  cleanup
> +  exit 1
> +}
> +trap trap_cleanup exit term int
> +
> +test_offcpu() {
> +  echo "Basic off-cpu test"
> +  if [ `id -u` != 0 ]
> +  then
> +    echo "Basic off-cpu test [Skipped permission]"
> +    err=2
> +    return
> +  fi
> +  if perf record --off-cpu -o ${perfdata} --quiet true 2>&1 | grep BUILD_BPF_SKEL
> +  then
> +    echo "Basic off-cpu test [Skipped missing BPF support]"
> +    err=2
> +    return
> +  fi
> +  if ! perf record --off-cpu -e dummy -o ${perfdata} sleep 1 2> /dev/null
> +  then
> +    echo "Basic off-cpu test [Failed record]"
> +    err=1
> +    return
> +  fi
> +  if ! perf evlist -i ${perfdata} | grep -q "offcpu-time"
> +  then
> +    echo "Basic off-cpu test [Failed record]"
> +    err=1
> +    return
> +  fi
> +  if ! perf report -i ${perfdata} -q --percent-limit=90 | egrep -q sleep
> +  then
> +    echo "Basic off-cpu test [Failed missing output]"
> +    err=1
> +    return
> +  fi
> +  echo "Basic off-cpu test [Success]"
> +}
> +
> +test_offcpu
> +
> +cleanup
> +exit $err
> --
> 2.36.1.124.g0e6072fb45-goog
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/6] perf record: Enable off-cpu analysis with BPF
  2022-05-19  3:58   ` Ian Rogers
@ 2022-05-19 21:01     ` Namhyung Kim
  0 siblings, 0 replies; 20+ messages in thread
From: Namhyung Kim @ 2022-05-19 21:01 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Ingo Molnar, Peter Zijlstra,
	LKML, Andi Kleen, Song Liu, Hao Luo, Milian Wolff, bpf,
	linux-perf-users, Blake Jones

Hi Ian,

On Wed, May 18, 2022 at 8:58 PM Ian Rogers <irogers@google.com> wrote:
>
> On Wed, May 18, 2022 at 3:47 PM Namhyung Kim <namhyung@kernel.org> wrote:
> >
> > Add --off-cpu option to enable the off-cpu profiling with BPF.  It'd
> > use a bpf_output event and rename it to "offcpu-time".  Samples will
> > be synthesized at the end of the record session using data from a BPF
> > map which contains the aggregated off-cpu time at context switches.
> > So it needs root privilege to get the off-cpu profiling.
> >
> > Each sample will have a separate user stacktrace so it will skip
> > kernel threads.  The sample ip will be set from the stacktrace and
> > other sample data will be updated accordingly.  Currently it only
> > handles some basic sample types.
> >
> > The sample timestamp is set to a dummy value just not to bother with
> > other events during the sorting.  So it has a very big initial value
> > and increase it on processing each samples.
> >
> > Good thing is that it can be used together with regular profiling like
> > cpu cycles.  If you don't want to that, you can use a dummy event to
> > enable off-cpu profiling only.
> >
> > Example output:
> >   $ sudo perf record --off-cpu perf bench sched messaging -l 1000
> >
> >   $ sudo perf report --stdio --call-graph=no
> >   # Total Lost Samples: 0
> >   #
> >   # Samples: 41K of event 'cycles'
> >   # Event count (approx.): 42137343851
> >   ...
> >
> >   # Samples: 1K of event 'offcpu-time'
> >   # Event count (approx.): 587990831640
> >   #
> >   # Children      Self  Command          Shared Object       Symbol
> >   # ........  ........  ...............  ..................  .........................
> >   #
> >       81.66%     0.00%  sched-messaging  libc-2.33.so        [.] __libc_start_main
> >       81.66%     0.00%  sched-messaging  perf                [.] cmd_bench
> >       81.66%     0.00%  sched-messaging  perf                [.] main
> >       81.66%     0.00%  sched-messaging  perf                [.] run_builtin
> >       81.43%     0.00%  sched-messaging  perf                [.] bench_sched_messaging
> >       40.86%    40.86%  sched-messaging  libpthread-2.33.so  [.] __read
> >       37.66%    37.66%  sched-messaging  libpthread-2.33.so  [.] __write
> >        2.91%     2.91%  sched-messaging  libc-2.33.so        [.] __poll
> >   ...
> >
> > As you can see it spent most of off-cpu time in read and write in
> > bench_sched_messaging().  The --call-graph=no was added just to make
> > the output concise here.
> >
> > It uses perf hooks facility to control BPF program during the record
> > session rather than adding new BPF/off-cpu specific calls.
> >
> > Signed-off-by: Namhyung Kim <namhyung@kernel.org>
>
> Acked-by: Ian Rogers <irogers@google.com>
>
> > ---
> >  tools/perf/Documentation/perf-record.txt |  10 ++
> >  tools/perf/Makefile.perf                 |   1 +
> >  tools/perf/builtin-record.c              |  25 +++
> >  tools/perf/util/Build                    |   1 +
> >  tools/perf/util/bpf_off_cpu.c            | 204 +++++++++++++++++++++++
> >  tools/perf/util/bpf_skel/off_cpu.bpf.c   | 139 +++++++++++++++
> >  tools/perf/util/off_cpu.h                |  24 +++
> >  7 files changed, 404 insertions(+)
> >  create mode 100644 tools/perf/util/bpf_off_cpu.c
> >  create mode 100644 tools/perf/util/bpf_skel/off_cpu.bpf.c
> >  create mode 100644 tools/perf/util/off_cpu.h
> >
> > diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
> > index 465be4e62a17..b4e9ef7edfef 100644
> > --- a/tools/perf/Documentation/perf-record.txt
> > +++ b/tools/perf/Documentation/perf-record.txt
> > @@ -758,6 +758,16 @@ include::intel-hybrid.txt[]
> >         If the URLs is not specified, the value of DEBUGINFOD_URLS
> >         system environment variable is used.
> >
> > +--off-cpu::
> > +       Enable off-cpu profiling with BPF.  The BPF program will collect
> > +       task scheduling information with (user) stacktrace and save them
> > +       as sample data of a software event named "offcpu-time".  The
> > +       sample period will have the time the task slept in nanoseconds.
> > +
> > +       Note that BPF can collect stack traces using frame pointer ("fp")
> > +       only, as of now.  So the applications built without the frame
> > +       pointer might see bogus addresses.
> > +
> >  SEE ALSO
> >  --------
> >  linkperf:perf-stat[1], linkperf:perf-list[1], linkperf:perf-intel-pt[1]
> > diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
> > index 6e5aded855cc..8f738e11356d 100644
> > --- a/tools/perf/Makefile.perf
> > +++ b/tools/perf/Makefile.perf
> > @@ -1038,6 +1038,7 @@ SKEL_TMP_OUT := $(abspath $(SKEL_OUT)/.tmp)
> >  SKELETONS := $(SKEL_OUT)/bpf_prog_profiler.skel.h
> >  SKELETONS += $(SKEL_OUT)/bperf_leader.skel.h $(SKEL_OUT)/bperf_follower.skel.h
> >  SKELETONS += $(SKEL_OUT)/bperf_cgroup.skel.h $(SKEL_OUT)/func_latency.skel.h
> > +SKELETONS += $(SKEL_OUT)/off_cpu.skel.h
> >
> >  $(SKEL_TMP_OUT) $(LIBBPF_OUTPUT):
> >         $(Q)$(MKDIR) -p $@
> > diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> > index a5cf6a99d67f..91f88501412e 100644
> > --- a/tools/perf/builtin-record.c
> > +++ b/tools/perf/builtin-record.c
> > @@ -49,6 +49,7 @@
> >  #include "util/clockid.h"
> >  #include "util/pmu-hybrid.h"
> >  #include "util/evlist-hybrid.h"
> > +#include "util/off_cpu.h"
> >  #include "asm/bug.h"
> >  #include "perf.h"
> >  #include "cputopo.h"
> > @@ -162,6 +163,7 @@ struct record {
> >         bool                    buildid_mmap;
> >         bool                    timestamp_filename;
> >         bool                    timestamp_boundary;
> > +       bool                    off_cpu;
> >         struct switch_output    switch_output;
> >         unsigned long long      samples;
> >         unsigned long           output_max_size;        /* = 0: unlimited */
> > @@ -903,6 +905,11 @@ static int record__config_text_poke(struct evlist *evlist)
> >         return 0;
> >  }
> >
> > +static int record__config_off_cpu(struct record *rec)
> > +{
> > +       return off_cpu_prepare(rec->evlist);
> > +}
> > +
> >  static bool record__kcore_readable(struct machine *machine)
> >  {
> >         char kcore[PATH_MAX];
> > @@ -2600,6 +2607,9 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
> >         } else
> >                 status = err;
> >
> > +       if (rec->off_cpu)
> > +               rec->bytes_written += off_cpu_write(rec->session);
> > +
> >         record__synthesize(rec, true);
> >         /* this will be recalculated during process_buildids() */
> >         rec->samples = 0;
> > @@ -3324,6 +3334,7 @@ static struct option __record_options[] = {
> >         OPT_CALLBACK_OPTARG(0, "threads", &record.opts, NULL, "spec",
> >                             "write collected trace data into several data files using parallel threads",
> >                             record__parse_threads),
> > +       OPT_BOOLEAN(0, "off-cpu", &record.off_cpu, "Enable off-cpu analysis"),
> >         OPT_END()
> >  };
> >
> > @@ -3743,6 +3754,12 @@ int cmd_record(int argc, const char **argv)
> >         set_nobuild('\0', "vmlinux", true);
> >  # undef set_nobuild
> >  # undef REASON
> > +#endif
> > +
> > +#ifndef HAVE_BPF_SKEL
> > +# define set_nobuild(s, l, m, c) set_option_nobuild(record_options, s, l, m, c)
> > +       set_nobuild('\0', "off-cpu", "no BUILD_BPF_SKEL=1", true);
> > +# undef set_nobuild
> >  #endif
> >
> >         rec->opts.affinity = PERF_AFFINITY_SYS;
> > @@ -3981,6 +3998,14 @@ int cmd_record(int argc, const char **argv)
> >                 }
> >         }
> >
> > +       if (rec->off_cpu) {
> > +               err = record__config_off_cpu(rec);
> > +               if (err) {
> > +                       pr_err("record__config_off_cpu failed, error %d\n", err);
> > +                       goto out;
> > +               }
> > +       }
> > +
> >         if (record_opts__config(&rec->opts)) {
> >                 err = -EINVAL;
> >                 goto out;
> > diff --git a/tools/perf/util/Build b/tools/perf/util/Build
> > index 9a7209a99e16..a51267d88ca9 100644
> > --- a/tools/perf/util/Build
> > +++ b/tools/perf/util/Build
> > @@ -147,6 +147,7 @@ perf-$(CONFIG_LIBBPF) += bpf_map.o
> >  perf-$(CONFIG_PERF_BPF_SKEL) += bpf_counter.o
> >  perf-$(CONFIG_PERF_BPF_SKEL) += bpf_counter_cgroup.o
> >  perf-$(CONFIG_PERF_BPF_SKEL) += bpf_ftrace.o
> > +perf-$(CONFIG_PERF_BPF_SKEL) += bpf_off_cpu.o
> >  perf-$(CONFIG_BPF_PROLOGUE) += bpf-prologue.o
> >  perf-$(CONFIG_LIBELF) += symbol-elf.o
> >  perf-$(CONFIG_LIBELF) += probe-file.o
> > diff --git a/tools/perf/util/bpf_off_cpu.c b/tools/perf/util/bpf_off_cpu.c
> > new file mode 100644
> > index 000000000000..9ed7aca3f4ac
> > --- /dev/null
> > +++ b/tools/perf/util/bpf_off_cpu.c
> > @@ -0,0 +1,204 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#include "util/bpf_counter.h"
> > +#include "util/debug.h"
> > +#include "util/evsel.h"
> > +#include "util/evlist.h"
> > +#include "util/off_cpu.h"
> > +#include "util/perf-hooks.h"
> > +#include "util/session.h"
> > +#include <bpf/bpf.h>
> > +
> > +#include "bpf_skel/off_cpu.skel.h"
> > +
> > +#define MAX_STACKS  32
> > +/* we don't need actual timestamp, just want to put the samples at last */
> > +#define OFF_CPU_TIMESTAMP  (~0ull << 32)
> > +
> > +static struct off_cpu_bpf *skel;
> > +
> > +struct off_cpu_key {
> > +       u32 pid;
> > +       u32 tgid;
> > +       u32 stack_id;
> > +       u32 state;
> > +};
> > +
> > +union off_cpu_data {
> > +       struct perf_event_header hdr;
> > +       u64 array[1024 / sizeof(u64)];
> > +};
> > +
> > +static int off_cpu_config(struct evlist *evlist)
> > +{
> > +       struct evsel *evsel;
> > +       struct perf_event_attr attr = {
> > +               .type   = PERF_TYPE_SOFTWARE,
> > +               .config = PERF_COUNT_SW_BPF_OUTPUT,
> > +               .size   = sizeof(attr), /* to capture ABI version */
> > +       };
> > +       char *evname = strdup(OFFCPU_EVENT);
> > +
> > +       if (evname == NULL)
> > +               return -ENOMEM;
> > +
> > +       evsel = evsel__new(&attr);
> > +       if (!evsel) {
> > +               free(evname);
> > +               return -ENOMEM;
> > +       }
> > +
> > +       evsel->core.attr.freq = 1;
> > +       evsel->core.attr.sample_period = 1;
> > +       /* off-cpu analysis depends on stack trace */
> > +       evsel->core.attr.sample_type = PERF_SAMPLE_CALLCHAIN;
> > +
> > +       evlist__add(evlist, evsel);
> > +
> > +       free(evsel->name);
> > +       evsel->name = evname;
> > +
> > +       return 0;
> > +}
> > +
> > +static void off_cpu_start(void *arg __maybe_unused)
> > +{
> > +       skel->bss->enabled = 1;
> > +}
> > +
> > +static void off_cpu_finish(void *arg __maybe_unused)
> > +{
> > +       skel->bss->enabled = 0;
> > +       off_cpu_bpf__destroy(skel);
> > +}
> > +
> > +int off_cpu_prepare(struct evlist *evlist)
> > +{
> > +       int err;
> > +
> > +       if (off_cpu_config(evlist) < 0) {
> > +               pr_err("Failed to config off-cpu BPF event\n");
> > +               return -1;
> > +       }
> > +
> > +       set_max_rlimit();
> > +
> > +       skel = off_cpu_bpf__open_and_load();
> > +       if (!skel) {
> > +               pr_err("Failed to open off-cpu BPF skeleton\n");
> > +               return -1;
> > +       }
> > +
> > +       err = off_cpu_bpf__attach(skel);
> > +       if (err) {
> > +               pr_err("Failed to attach off-cpu BPF skeleton\n");
> > +               goto out;
> > +       }
> > +
> > +       if (perf_hooks__set_hook("record_start", off_cpu_start, NULL) ||
> > +           perf_hooks__set_hook("record_end", off_cpu_finish, NULL)) {
>
> An off-topic thought here. I was looking at tool events and thinking
> that rather than have global state and the counter reading getting
> those global values, it would be nice if the tool events had private
> state that was created on open, freed on close and then did the
> appropriate thing for enable/read/disable. If we were object oriented
> I was thinking tool events could be a subclass of evsel that would
> override the appropriate functions. Something that strikes me as silly
> is that tool events have a dummy event file descriptor from
> perf_event_open, in part because the evsel code paths are shared.
> Anyway, if we made evsels have something like a virtual method table,
> we could do similar tricks with off-cpu and BPF created events.
> Possibly this could lead to better structured code and more reuse.

Thanks for your review.

Yeah I think it makes sense to change the offcpu-time event as a
tool event and to skip actual open/read/close operations.  It would
be a good follow up work.

Using a dummy event still has a syscall overhead and tool events
don't need the syscalls.  Maybe we can add some logic in the evlist
code to skip the tool events.

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/6] perf record: Implement basic filtering for off-cpu
  2022-05-19  4:02   ` Ian Rogers
@ 2022-05-19 21:02     ` Namhyung Kim
  2022-05-25 11:27       ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 20+ messages in thread
From: Namhyung Kim @ 2022-05-19 21:02 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Ingo Molnar, Peter Zijlstra,
	LKML, Andi Kleen, Song Liu, Hao Luo, Milian Wolff, bpf,
	linux-perf-users, Blake Jones

On Wed, May 18, 2022 at 9:02 PM Ian Rogers <irogers@google.com> wrote:
>
> On Wed, May 18, 2022 at 3:47 PM Namhyung Kim <namhyung@kernel.org> wrote:
> >
> > It should honor cpu and task filtering with -a, -C or -p, -t options.
> >
> > Signed-off-by: Namhyung Kim <namhyung@kernel.org>
> > ---
> >  tools/perf/builtin-record.c            |  2 +-
> >  tools/perf/util/bpf_off_cpu.c          | 78 +++++++++++++++++++++++---
> >  tools/perf/util/bpf_skel/off_cpu.bpf.c | 52 +++++++++++++++--
> >  tools/perf/util/off_cpu.h              |  6 +-
> >  4 files changed, 123 insertions(+), 15 deletions(-)
> >
> > diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> > index 91f88501412e..7f60d2eac0b4 100644
> > --- a/tools/perf/builtin-record.c
> > +++ b/tools/perf/builtin-record.c
> > @@ -907,7 +907,7 @@ static int record__config_text_poke(struct evlist *evlist)
> >
> >  static int record__config_off_cpu(struct record *rec)
> >  {
> > -       return off_cpu_prepare(rec->evlist);
> > +       return off_cpu_prepare(rec->evlist, &rec->opts.target);
> >  }
> >
> >  static bool record__kcore_readable(struct machine *machine)
> > diff --git a/tools/perf/util/bpf_off_cpu.c b/tools/perf/util/bpf_off_cpu.c
> > index 9ed7aca3f4ac..b5e2d038da50 100644
> > --- a/tools/perf/util/bpf_off_cpu.c
> > +++ b/tools/perf/util/bpf_off_cpu.c
> > @@ -6,6 +6,9 @@
> >  #include "util/off_cpu.h"
> >  #include "util/perf-hooks.h"
> >  #include "util/session.h"
> > +#include "util/target.h"
> > +#include "util/cpumap.h"
> > +#include "util/thread_map.h"
> >  #include <bpf/bpf.h>
> >
> >  #include "bpf_skel/off_cpu.skel.h"
> > @@ -60,8 +63,23 @@ static int off_cpu_config(struct evlist *evlist)
> >         return 0;
> >  }
> >
> > -static void off_cpu_start(void *arg __maybe_unused)
> > +static void off_cpu_start(void *arg)
> >  {
> > +       struct evlist *evlist = arg;
> > +
> > +       /* update task filter for the given workload */
> > +       if (!skel->bss->has_cpu && !skel->bss->has_task &&
> > +           perf_thread_map__pid(evlist->core.threads, 0) != -1) {
> > +               int fd;
> > +               u32 pid;
> > +               u8 val = 1;
> > +
> > +               skel->bss->has_task = 1;
> > +               fd = bpf_map__fd(skel->maps.task_filter);
> > +               pid = perf_thread_map__pid(evlist->core.threads, 0);
> > +               bpf_map_update_elem(fd, &pid, &val, BPF_ANY);
> > +       }
> > +
> >         skel->bss->enabled = 1;
> >  }
> >
> > @@ -71,31 +89,75 @@ static void off_cpu_finish(void *arg __maybe_unused)
> >         off_cpu_bpf__destroy(skel);
> >  }
> >
> > -int off_cpu_prepare(struct evlist *evlist)
> > +int off_cpu_prepare(struct evlist *evlist, struct target *target)
> >  {
> > -       int err;
> > +       int err, fd, i;
> > +       int ncpus = 1, ntasks = 1;
> >
> >         if (off_cpu_config(evlist) < 0) {
> >                 pr_err("Failed to config off-cpu BPF event\n");
> >                 return -1;
> >         }
> >
> > -       set_max_rlimit();
> > -
> > -       skel = off_cpu_bpf__open_and_load();
> > +       skel = off_cpu_bpf__open();
> >         if (!skel) {
> >                 pr_err("Failed to open off-cpu BPF skeleton\n");
> >                 return -1;
> >         }
> >
> > +       /* don't need to set cpu filter for system-wide mode */
> > +       if (target->cpu_list) {
> > +               ncpus = perf_cpu_map__nr(evlist->core.user_requested_cpus);
> > +               bpf_map__set_max_entries(skel->maps.cpu_filter, ncpus);
> > +       }
> > +
> > +       if (target__has_task(target)) {
> > +               ntasks = perf_thread_map__nr(evlist->core.threads);
> > +               bpf_map__set_max_entries(skel->maps.task_filter, ntasks);
> > +       }
> > +
> > +       set_max_rlimit();
> > +
> > +       err = off_cpu_bpf__load(skel);
> > +       if (err) {
> > +               pr_err("Failed to load off-cpu skeleton\n");
> > +               goto out;
> > +       }
> > +
> > +       if (target->cpu_list) {
> > +               u32 cpu;
> > +               u8 val = 1;
> > +
> > +               skel->bss->has_cpu = 1;
> > +               fd = bpf_map__fd(skel->maps.cpu_filter);
> > +
> > +               for (i = 0; i < ncpus; i++) {
> > +                       cpu = perf_cpu_map__cpu(evlist->core.user_requested_cpus, i).cpu;
> > +                       bpf_map_update_elem(fd, &cpu, &val, BPF_ANY);
>
> Perhaps more concise with a for_each:
>
> perf_cpu_map__for_each_cpu(cpu, idx, evlist->core.user_requested_cpus)
>   bpf_map_update_elem(fd, &cpu.cpu, &val, BPF_ANY);

Will change.

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/6] perf record: Implement basic filtering for off-cpu
  2022-05-19 21:02     ` Namhyung Kim
@ 2022-05-25 11:27       ` Arnaldo Carvalho de Melo
  2022-05-25 11:44         ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 20+ messages in thread
From: Arnaldo Carvalho de Melo @ 2022-05-25 11:27 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Ian Rogers, Jiri Olsa, Ingo Molnar, Peter Zijlstra, LKML,
	Andi Kleen, Song Liu, Hao Luo, Milian Wolff, bpf,
	linux-perf-users, Blake Jones

Em Thu, May 19, 2022 at 02:02:28PM -0700, Namhyung Kim escreveu:
> On Wed, May 18, 2022 at 9:02 PM Ian Rogers <irogers@google.com> wrote:
> >
> > On Wed, May 18, 2022 at 3:47 PM Namhyung Kim <namhyung@kernel.org> wrote:
> > >
> > > It should honor cpu and task filtering with -a, -C or -p, -t options.
> > >
> > > Signed-off-by: Namhyung Kim <namhyung@kernel.org>
> > > ---
> > >  tools/perf/builtin-record.c            |  2 +-
> > >  tools/perf/util/bpf_off_cpu.c          | 78 +++++++++++++++++++++++---
> > >  tools/perf/util/bpf_skel/off_cpu.bpf.c | 52 +++++++++++++++--
> > >  tools/perf/util/off_cpu.h              |  6 +-
> > >  4 files changed, 123 insertions(+), 15 deletions(-)
> > >
> > > diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> > > index 91f88501412e..7f60d2eac0b4 100644
> > > --- a/tools/perf/builtin-record.c
> > > +++ b/tools/perf/builtin-record.c
> > > @@ -907,7 +907,7 @@ static int record__config_text_poke(struct evlist *evlist)
> > >
> > >  static int record__config_off_cpu(struct record *rec)
> > >  {
> > > -       return off_cpu_prepare(rec->evlist);
> > > +       return off_cpu_prepare(rec->evlist, &rec->opts.target);
> > >  }
> > >
> > >  static bool record__kcore_readable(struct machine *machine)
> > > diff --git a/tools/perf/util/bpf_off_cpu.c b/tools/perf/util/bpf_off_cpu.c
> > > index 9ed7aca3f4ac..b5e2d038da50 100644
> > > --- a/tools/perf/util/bpf_off_cpu.c
> > > +++ b/tools/perf/util/bpf_off_cpu.c
> > > @@ -6,6 +6,9 @@
> > >  #include "util/off_cpu.h"
> > >  #include "util/perf-hooks.h"
> > >  #include "util/session.h"
> > > +#include "util/target.h"
> > > +#include "util/cpumap.h"
> > > +#include "util/thread_map.h"
> > >  #include <bpf/bpf.h>
> > >
> > >  #include "bpf_skel/off_cpu.skel.h"
> > > @@ -60,8 +63,23 @@ static int off_cpu_config(struct evlist *evlist)
> > >         return 0;
> > >  }
> > >
> > > -static void off_cpu_start(void *arg __maybe_unused)
> > > +static void off_cpu_start(void *arg)
> > >  {
> > > +       struct evlist *evlist = arg;
> > > +
> > > +       /* update task filter for the given workload */
> > > +       if (!skel->bss->has_cpu && !skel->bss->has_task &&
> > > +           perf_thread_map__pid(evlist->core.threads, 0) != -1) {
> > > +               int fd;
> > > +               u32 pid;
> > > +               u8 val = 1;
> > > +
> > > +               skel->bss->has_task = 1;
> > > +               fd = bpf_map__fd(skel->maps.task_filter);
> > > +               pid = perf_thread_map__pid(evlist->core.threads, 0);
> > > +               bpf_map_update_elem(fd, &pid, &val, BPF_ANY);
> > > +       }
> > > +
> > >         skel->bss->enabled = 1;
> > >  }
> > >
> > > @@ -71,31 +89,75 @@ static void off_cpu_finish(void *arg __maybe_unused)
> > >         off_cpu_bpf__destroy(skel);
> > >  }
> > >
> > > -int off_cpu_prepare(struct evlist *evlist)
> > > +int off_cpu_prepare(struct evlist *evlist, struct target *target)
> > >  {
> > > -       int err;
> > > +       int err, fd, i;
> > > +       int ncpus = 1, ntasks = 1;
> > >
> > >         if (off_cpu_config(evlist) < 0) {
> > >                 pr_err("Failed to config off-cpu BPF event\n");
> > >                 return -1;
> > >         }
> > >
> > > -       set_max_rlimit();
> > > -
> > > -       skel = off_cpu_bpf__open_and_load();
> > > +       skel = off_cpu_bpf__open();
> > >         if (!skel) {
> > >                 pr_err("Failed to open off-cpu BPF skeleton\n");
> > >                 return -1;
> > >         }
> > >
> > > +       /* don't need to set cpu filter for system-wide mode */
> > > +       if (target->cpu_list) {
> > > +               ncpus = perf_cpu_map__nr(evlist->core.user_requested_cpus);
> > > +               bpf_map__set_max_entries(skel->maps.cpu_filter, ncpus);
> > > +       }
> > > +
> > > +       if (target__has_task(target)) {
> > > +               ntasks = perf_thread_map__nr(evlist->core.threads);
> > > +               bpf_map__set_max_entries(skel->maps.task_filter, ntasks);
> > > +       }
> > > +
> > > +       set_max_rlimit();
> > > +
> > > +       err = off_cpu_bpf__load(skel);
> > > +       if (err) {
> > > +               pr_err("Failed to load off-cpu skeleton\n");
> > > +               goto out;
> > > +       }
> > > +
> > > +       if (target->cpu_list) {
> > > +               u32 cpu;
> > > +               u8 val = 1;
> > > +
> > > +               skel->bss->has_cpu = 1;
> > > +               fd = bpf_map__fd(skel->maps.cpu_filter);
> > > +
> > > +               for (i = 0; i < ncpus; i++) {
> > > +                       cpu = perf_cpu_map__cpu(evlist->core.user_requested_cpus, i).cpu;
> > > +                       bpf_map_update_elem(fd, &cpu, &val, BPF_ANY);
> >
> > Perhaps more concise with a for_each:
> >
> > perf_cpu_map__for_each_cpu(cpu, idx, evlist->core.user_requested_cpus)
> >   bpf_map_update_elem(fd, &cpu.cpu, &val, BPF_ANY);

So I'll wait for a new version of this patchset.

- Arnaldo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/6] perf record: Implement basic filtering for off-cpu
  2022-05-25 11:27       ` Arnaldo Carvalho de Melo
@ 2022-05-25 11:44         ` Arnaldo Carvalho de Melo
  2022-05-25 14:06           ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 20+ messages in thread
From: Arnaldo Carvalho de Melo @ 2022-05-25 11:44 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Ian Rogers, Jiri Olsa, Ingo Molnar, Peter Zijlstra, LKML,
	Andi Kleen, Song Liu, Hao Luo, Milian Wolff, bpf,
	linux-perf-users, Blake Jones

Em Wed, May 25, 2022 at 08:27:38AM -0300, Arnaldo Carvalho de Melo escreveu:
> Em Thu, May 19, 2022 at 02:02:28PM -0700, Namhyung Kim escreveu:
> > On Wed, May 18, 2022 at 9:02 PM Ian Rogers <irogers@google.com> wrote:
> > >
> > > On Wed, May 18, 2022 at 3:47 PM Namhyung Kim <namhyung@kernel.org> wrote:
> > > >
> > > > It should honor cpu and task filtering with -a, -C or -p, -t options.
> > > >
> > > > Signed-off-by: Namhyung Kim <namhyung@kernel.org>
> > > > ---
> > > >  tools/perf/builtin-record.c            |  2 +-
> > > >  tools/perf/util/bpf_off_cpu.c          | 78 +++++++++++++++++++++++---
> > > >  tools/perf/util/bpf_skel/off_cpu.bpf.c | 52 +++++++++++++++--
> > > >  tools/perf/util/off_cpu.h              |  6 +-
> > > >  4 files changed, 123 insertions(+), 15 deletions(-)
> > > >
> > > > diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> > > > index 91f88501412e..7f60d2eac0b4 100644
> > > > --- a/tools/perf/builtin-record.c
> > > > +++ b/tools/perf/builtin-record.c
> > > > @@ -907,7 +907,7 @@ static int record__config_text_poke(struct evlist *evlist)
> > > >
> > > >  static int record__config_off_cpu(struct record *rec)
> > > >  {
> > > > -       return off_cpu_prepare(rec->evlist);
> > > > +       return off_cpu_prepare(rec->evlist, &rec->opts.target);
> > > >  }
> > > >
> > > >  static bool record__kcore_readable(struct machine *machine)
> > > > diff --git a/tools/perf/util/bpf_off_cpu.c b/tools/perf/util/bpf_off_cpu.c
> > > > index 9ed7aca3f4ac..b5e2d038da50 100644
> > > > --- a/tools/perf/util/bpf_off_cpu.c
> > > > +++ b/tools/perf/util/bpf_off_cpu.c
> > > > @@ -6,6 +6,9 @@
> > > >  #include "util/off_cpu.h"
> > > >  #include "util/perf-hooks.h"
> > > >  #include "util/session.h"
> > > > +#include "util/target.h"
> > > > +#include "util/cpumap.h"
> > > > +#include "util/thread_map.h"
> > > >  #include <bpf/bpf.h>
> > > >
> > > >  #include "bpf_skel/off_cpu.skel.h"
> > > > @@ -60,8 +63,23 @@ static int off_cpu_config(struct evlist *evlist)
> > > >         return 0;
> > > >  }
> > > >
> > > > -static void off_cpu_start(void *arg __maybe_unused)
> > > > +static void off_cpu_start(void *arg)
> > > >  {
> > > > +       struct evlist *evlist = arg;
> > > > +
> > > > +       /* update task filter for the given workload */
> > > > +       if (!skel->bss->has_cpu && !skel->bss->has_task &&
> > > > +           perf_thread_map__pid(evlist->core.threads, 0) != -1) {
> > > > +               int fd;
> > > > +               u32 pid;
> > > > +               u8 val = 1;
> > > > +
> > > > +               skel->bss->has_task = 1;
> > > > +               fd = bpf_map__fd(skel->maps.task_filter);
> > > > +               pid = perf_thread_map__pid(evlist->core.threads, 0);
> > > > +               bpf_map_update_elem(fd, &pid, &val, BPF_ANY);
> > > > +       }
> > > > +
> > > >         skel->bss->enabled = 1;
> > > >  }
> > > >
> > > > @@ -71,31 +89,75 @@ static void off_cpu_finish(void *arg __maybe_unused)
> > > >         off_cpu_bpf__destroy(skel);
> > > >  }
> > > >
> > > > -int off_cpu_prepare(struct evlist *evlist)
> > > > +int off_cpu_prepare(struct evlist *evlist, struct target *target)
> > > >  {
> > > > -       int err;
> > > > +       int err, fd, i;
> > > > +       int ncpus = 1, ntasks = 1;
> > > >
> > > >         if (off_cpu_config(evlist) < 0) {
> > > >                 pr_err("Failed to config off-cpu BPF event\n");
> > > >                 return -1;
> > > >         }
> > > >
> > > > -       set_max_rlimit();
> > > > -
> > > > -       skel = off_cpu_bpf__open_and_load();
> > > > +       skel = off_cpu_bpf__open();
> > > >         if (!skel) {
> > > >                 pr_err("Failed to open off-cpu BPF skeleton\n");
> > > >                 return -1;
> > > >         }
> > > >
> > > > +       /* don't need to set cpu filter for system-wide mode */
> > > > +       if (target->cpu_list) {
> > > > +               ncpus = perf_cpu_map__nr(evlist->core.user_requested_cpus);
> > > > +               bpf_map__set_max_entries(skel->maps.cpu_filter, ncpus);
> > > > +       }
> > > > +
> > > > +       if (target__has_task(target)) {
> > > > +               ntasks = perf_thread_map__nr(evlist->core.threads);
> > > > +               bpf_map__set_max_entries(skel->maps.task_filter, ntasks);
> > > > +       }
> > > > +
> > > > +       set_max_rlimit();
> > > > +
> > > > +       err = off_cpu_bpf__load(skel);
> > > > +       if (err) {
> > > > +               pr_err("Failed to load off-cpu skeleton\n");
> > > > +               goto out;
> > > > +       }
> > > > +
> > > > +       if (target->cpu_list) {
> > > > +               u32 cpu;
> > > > +               u8 val = 1;
> > > > +
> > > > +               skel->bss->has_cpu = 1;
> > > > +               fd = bpf_map__fd(skel->maps.cpu_filter);
> > > > +
> > > > +               for (i = 0; i < ncpus; i++) {
> > > > +                       cpu = perf_cpu_map__cpu(evlist->core.user_requested_cpus, i).cpu;
> > > > +                       bpf_map_update_elem(fd, &cpu, &val, BPF_ANY);
> > >
> > > Perhaps more concise with a for_each:
> > >
> > > perf_cpu_map__for_each_cpu(cpu, idx, evlist->core.user_requested_cpus)
> > >   bpf_map_update_elem(fd, &cpu.cpu, &val, BPF_ANY);
> 
> So I'll wait for a new version of this patchset.

I take that back, will apply and this can be a follow up patch, right?

- Arnaldo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/6] perf record: Implement basic filtering for off-cpu
  2022-05-25 11:44         ` Arnaldo Carvalho de Melo
@ 2022-05-25 14:06           ` Arnaldo Carvalho de Melo
  0 siblings, 0 replies; 20+ messages in thread
From: Arnaldo Carvalho de Melo @ 2022-05-25 14:06 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Ian Rogers, Jiri Olsa, Ingo Molnar, Peter Zijlstra, LKML,
	Andi Kleen, Song Liu, Hao Luo, Milian Wolff, bpf,
	linux-perf-users, Blake Jones

Em Wed, May 25, 2022 at 08:44:38AM -0300, Arnaldo Carvalho de Melo escreveu:
> Em Wed, May 25, 2022 at 08:27:38AM -0300, Arnaldo Carvalho de Melo escreveu:
> > Em Thu, May 19, 2022 at 02:02:28PM -0700, Namhyung Kim escreveu:
> > > On Wed, May 18, 2022 at 9:02 PM Ian Rogers <irogers@google.com> wrote:
> > > > Perhaps more concise with a for_each:

> > > > perf_cpu_map__for_each_cpu(cpu, idx, evlist->core.user_requested_cpus)
> > > >   bpf_map_update_elem(fd, &cpu.cpu, &val, BPF_ANY);

> > So I'll wait for a new version of this patchset.

> I take that back, will apply and this can be a follow up patch, right?

I tested it and added some committer notes, everything is now at my
tmp.perf/core branch and will transition to perf/core on its way to 5.19
later today, after tests finish.

- Arnaldo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/6] perf record: Enable off-cpu analysis with BPF
  2022-05-18 22:47 ` [PATCH 2/6] perf record: Enable off-cpu analysis with BPF Namhyung Kim
  2022-05-19  3:58   ` Ian Rogers
@ 2022-06-01  0:00   ` Ian Rogers
  2022-06-01  6:01     ` Namhyung Kim
  1 sibling, 1 reply; 20+ messages in thread
From: Ian Rogers @ 2022-06-01  0:00 UTC (permalink / raw)
  To: Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
	Andrii Nakryiko, Peter Zijlstra, Hao Luo, bpf
  Cc: Ingo Molnar, LKML, Andi Kleen, Song Liu, Milian Wolff,
	linux-perf-users, Blake Jones

On Wed, May 18, 2022 at 3:47 PM Namhyung Kim <namhyung@kernel.org> wrote:
>
> Add --off-cpu option to enable the off-cpu profiling with BPF.  It'd
> use a bpf_output event and rename it to "offcpu-time".  Samples will
> be synthesized at the end of the record session using data from a BPF
> map which contains the aggregated off-cpu time at context switches.
> So it needs root privilege to get the off-cpu profiling.
>
> Each sample will have a separate user stacktrace so it will skip
> kernel threads.  The sample ip will be set from the stacktrace and
> other sample data will be updated accordingly.  Currently it only
> handles some basic sample types.
>
> The sample timestamp is set to a dummy value just not to bother with
> other events during the sorting.  So it has a very big initial value
> and increase it on processing each samples.
>
> Good thing is that it can be used together with regular profiling like
> cpu cycles.  If you don't want to that, you can use a dummy event to
> enable off-cpu profiling only.
>
> Example output:
>   $ sudo perf record --off-cpu perf bench sched messaging -l 1000
>
>   $ sudo perf report --stdio --call-graph=no
>   # Total Lost Samples: 0
>   #
>   # Samples: 41K of event 'cycles'
>   # Event count (approx.): 42137343851
>   ...
>
>   # Samples: 1K of event 'offcpu-time'
>   # Event count (approx.): 587990831640
>   #
>   # Children      Self  Command          Shared Object       Symbol
>   # ........  ........  ...............  ..................  .........................
>   #
>       81.66%     0.00%  sched-messaging  libc-2.33.so        [.] __libc_start_main
>       81.66%     0.00%  sched-messaging  perf                [.] cmd_bench
>       81.66%     0.00%  sched-messaging  perf                [.] main
>       81.66%     0.00%  sched-messaging  perf                [.] run_builtin
>       81.43%     0.00%  sched-messaging  perf                [.] bench_sched_messaging
>       40.86%    40.86%  sched-messaging  libpthread-2.33.so  [.] __read
>       37.66%    37.66%  sched-messaging  libpthread-2.33.so  [.] __write
>        2.91%     2.91%  sched-messaging  libc-2.33.so        [.] __poll
>   ...
>
> As you can see it spent most of off-cpu time in read and write in
> bench_sched_messaging().  The --call-graph=no was added just to make
> the output concise here.
>
> It uses perf hooks facility to control BPF program during the record
> session rather than adding new BPF/off-cpu specific calls.
>
> Signed-off-by: Namhyung Kim <namhyung@kernel.org>
> ---
>  tools/perf/Documentation/perf-record.txt |  10 ++
>  tools/perf/Makefile.perf                 |   1 +
>  tools/perf/builtin-record.c              |  25 +++
>  tools/perf/util/Build                    |   1 +
>  tools/perf/util/bpf_off_cpu.c            | 204 +++++++++++++++++++++++
>  tools/perf/util/bpf_skel/off_cpu.bpf.c   | 139 +++++++++++++++
>  tools/perf/util/off_cpu.h                |  24 +++
>  7 files changed, 404 insertions(+)
>  create mode 100644 tools/perf/util/bpf_off_cpu.c
>  create mode 100644 tools/perf/util/bpf_skel/off_cpu.bpf.c
>  create mode 100644 tools/perf/util/off_cpu.h
>
> diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
> index 465be4e62a17..b4e9ef7edfef 100644
> --- a/tools/perf/Documentation/perf-record.txt
> +++ b/tools/perf/Documentation/perf-record.txt
> @@ -758,6 +758,16 @@ include::intel-hybrid.txt[]
>         If the URLs is not specified, the value of DEBUGINFOD_URLS
>         system environment variable is used.
>
> +--off-cpu::
> +       Enable off-cpu profiling with BPF.  The BPF program will collect
> +       task scheduling information with (user) stacktrace and save them
> +       as sample data of a software event named "offcpu-time".  The
> +       sample period will have the time the task slept in nanoseconds.
> +
> +       Note that BPF can collect stack traces using frame pointer ("fp")
> +       only, as of now.  So the applications built without the frame
> +       pointer might see bogus addresses.
> +
>  SEE ALSO
>  --------
>  linkperf:perf-stat[1], linkperf:perf-list[1], linkperf:perf-intel-pt[1]
> diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
> index 6e5aded855cc..8f738e11356d 100644
> --- a/tools/perf/Makefile.perf
> +++ b/tools/perf/Makefile.perf
> @@ -1038,6 +1038,7 @@ SKEL_TMP_OUT := $(abspath $(SKEL_OUT)/.tmp)
>  SKELETONS := $(SKEL_OUT)/bpf_prog_profiler.skel.h
>  SKELETONS += $(SKEL_OUT)/bperf_leader.skel.h $(SKEL_OUT)/bperf_follower.skel.h
>  SKELETONS += $(SKEL_OUT)/bperf_cgroup.skel.h $(SKEL_OUT)/func_latency.skel.h
> +SKELETONS += $(SKEL_OUT)/off_cpu.skel.h
>
>  $(SKEL_TMP_OUT) $(LIBBPF_OUTPUT):
>         $(Q)$(MKDIR) -p $@
> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index a5cf6a99d67f..91f88501412e 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -49,6 +49,7 @@
>  #include "util/clockid.h"
>  #include "util/pmu-hybrid.h"
>  #include "util/evlist-hybrid.h"
> +#include "util/off_cpu.h"
>  #include "asm/bug.h"
>  #include "perf.h"
>  #include "cputopo.h"
> @@ -162,6 +163,7 @@ struct record {
>         bool                    buildid_mmap;
>         bool                    timestamp_filename;
>         bool                    timestamp_boundary;
> +       bool                    off_cpu;
>         struct switch_output    switch_output;
>         unsigned long long      samples;
>         unsigned long           output_max_size;        /* = 0: unlimited */
> @@ -903,6 +905,11 @@ static int record__config_text_poke(struct evlist *evlist)
>         return 0;
>  }
>
> +static int record__config_off_cpu(struct record *rec)
> +{
> +       return off_cpu_prepare(rec->evlist);
> +}
> +
>  static bool record__kcore_readable(struct machine *machine)
>  {
>         char kcore[PATH_MAX];
> @@ -2600,6 +2607,9 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
>         } else
>                 status = err;
>
> +       if (rec->off_cpu)
> +               rec->bytes_written += off_cpu_write(rec->session);
> +
>         record__synthesize(rec, true);
>         /* this will be recalculated during process_buildids() */
>         rec->samples = 0;
> @@ -3324,6 +3334,7 @@ static struct option __record_options[] = {
>         OPT_CALLBACK_OPTARG(0, "threads", &record.opts, NULL, "spec",
>                             "write collected trace data into several data files using parallel threads",
>                             record__parse_threads),
> +       OPT_BOOLEAN(0, "off-cpu", &record.off_cpu, "Enable off-cpu analysis"),
>         OPT_END()
>  };
>
> @@ -3743,6 +3754,12 @@ int cmd_record(int argc, const char **argv)
>         set_nobuild('\0', "vmlinux", true);
>  # undef set_nobuild
>  # undef REASON
> +#endif
> +
> +#ifndef HAVE_BPF_SKEL
> +# define set_nobuild(s, l, m, c) set_option_nobuild(record_options, s, l, m, c)
> +       set_nobuild('\0', "off-cpu", "no BUILD_BPF_SKEL=1", true);
> +# undef set_nobuild
>  #endif
>
>         rec->opts.affinity = PERF_AFFINITY_SYS;
> @@ -3981,6 +3998,14 @@ int cmd_record(int argc, const char **argv)
>                 }
>         }
>
> +       if (rec->off_cpu) {
> +               err = record__config_off_cpu(rec);
> +               if (err) {
> +                       pr_err("record__config_off_cpu failed, error %d\n", err);
> +                       goto out;
> +               }
> +       }
> +
>         if (record_opts__config(&rec->opts)) {
>                 err = -EINVAL;
>                 goto out;
> diff --git a/tools/perf/util/Build b/tools/perf/util/Build
> index 9a7209a99e16..a51267d88ca9 100644
> --- a/tools/perf/util/Build
> +++ b/tools/perf/util/Build
> @@ -147,6 +147,7 @@ perf-$(CONFIG_LIBBPF) += bpf_map.o
>  perf-$(CONFIG_PERF_BPF_SKEL) += bpf_counter.o
>  perf-$(CONFIG_PERF_BPF_SKEL) += bpf_counter_cgroup.o
>  perf-$(CONFIG_PERF_BPF_SKEL) += bpf_ftrace.o
> +perf-$(CONFIG_PERF_BPF_SKEL) += bpf_off_cpu.o
>  perf-$(CONFIG_BPF_PROLOGUE) += bpf-prologue.o
>  perf-$(CONFIG_LIBELF) += symbol-elf.o
>  perf-$(CONFIG_LIBELF) += probe-file.o
> diff --git a/tools/perf/util/bpf_off_cpu.c b/tools/perf/util/bpf_off_cpu.c
> new file mode 100644
> index 000000000000..9ed7aca3f4ac
> --- /dev/null
> +++ b/tools/perf/util/bpf_off_cpu.c
> @@ -0,0 +1,204 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include "util/bpf_counter.h"
> +#include "util/debug.h"
> +#include "util/evsel.h"
> +#include "util/evlist.h"
> +#include "util/off_cpu.h"
> +#include "util/perf-hooks.h"
> +#include "util/session.h"
> +#include <bpf/bpf.h>
> +
> +#include "bpf_skel/off_cpu.skel.h"
> +
> +#define MAX_STACKS  32
> +/* we don't need actual timestamp, just want to put the samples at last */
> +#define OFF_CPU_TIMESTAMP  (~0ull << 32)
> +
> +static struct off_cpu_bpf *skel;
> +
> +struct off_cpu_key {
> +       u32 pid;
> +       u32 tgid;
> +       u32 stack_id;
> +       u32 state;
> +};
> +
> +union off_cpu_data {
> +       struct perf_event_header hdr;
> +       u64 array[1024 / sizeof(u64)];
> +};
> +
> +static int off_cpu_config(struct evlist *evlist)
> +{
> +       struct evsel *evsel;
> +       struct perf_event_attr attr = {
> +               .type   = PERF_TYPE_SOFTWARE,
> +               .config = PERF_COUNT_SW_BPF_OUTPUT,
> +               .size   = sizeof(attr), /* to capture ABI version */
> +       };
> +       char *evname = strdup(OFFCPU_EVENT);
> +
> +       if (evname == NULL)
> +               return -ENOMEM;
> +
> +       evsel = evsel__new(&attr);
> +       if (!evsel) {
> +               free(evname);
> +               return -ENOMEM;
> +       }
> +
> +       evsel->core.attr.freq = 1;
> +       evsel->core.attr.sample_period = 1;
> +       /* off-cpu analysis depends on stack trace */
> +       evsel->core.attr.sample_type = PERF_SAMPLE_CALLCHAIN;
> +
> +       evlist__add(evlist, evsel);
> +
> +       free(evsel->name);
> +       evsel->name = evname;
> +
> +       return 0;
> +}
> +
> +static void off_cpu_start(void *arg __maybe_unused)
> +{
> +       skel->bss->enabled = 1;
> +}
> +
> +static void off_cpu_finish(void *arg __maybe_unused)
> +{
> +       skel->bss->enabled = 0;
> +       off_cpu_bpf__destroy(skel);
> +}
> +
> +int off_cpu_prepare(struct evlist *evlist)
> +{
> +       int err;
> +
> +       if (off_cpu_config(evlist) < 0) {
> +               pr_err("Failed to config off-cpu BPF event\n");
> +               return -1;
> +       }
> +
> +       set_max_rlimit();
> +
> +       skel = off_cpu_bpf__open_and_load();
> +       if (!skel) {
> +               pr_err("Failed to open off-cpu BPF skeleton\n");
> +               return -1;
> +       }
> +
> +       err = off_cpu_bpf__attach(skel);
> +       if (err) {
> +               pr_err("Failed to attach off-cpu BPF skeleton\n");
> +               goto out;
> +       }
> +
> +       if (perf_hooks__set_hook("record_start", off_cpu_start, NULL) ||
> +           perf_hooks__set_hook("record_end", off_cpu_finish, NULL)) {
> +               pr_err("Failed to attach off-cpu skeleton\n");
> +               goto out;
> +       }
> +
> +       return 0;
> +
> +out:
> +       off_cpu_bpf__destroy(skel);
> +       return -1;
> +}
> +
> +int off_cpu_write(struct perf_session *session)
> +{
> +       int bytes = 0, size;
> +       int fd, stack;
> +       u64 sample_type, val, sid = 0;
> +       struct evsel *evsel;
> +       struct perf_data_file *file = &session->data->file;
> +       struct off_cpu_key prev, key;
> +       union off_cpu_data data = {
> +               .hdr = {
> +                       .type = PERF_RECORD_SAMPLE,
> +                       .misc = PERF_RECORD_MISC_USER,
> +               },
> +       };
> +       u64 tstamp = OFF_CPU_TIMESTAMP;
> +
> +       skel->bss->enabled = 0;
> +
> +       evsel = evlist__find_evsel_by_str(session->evlist, OFFCPU_EVENT);
> +       if (evsel == NULL) {
> +               pr_err("%s evsel not found\n", OFFCPU_EVENT);
> +               return 0;
> +       }
> +
> +       sample_type = evsel->core.attr.sample_type;
> +
> +       if (sample_type & (PERF_SAMPLE_ID | PERF_SAMPLE_IDENTIFIER)) {
> +               if (evsel->core.id)
> +                       sid = evsel->core.id[0];
> +       }
> +
> +       fd = bpf_map__fd(skel->maps.off_cpu);
> +       stack = bpf_map__fd(skel->maps.stacks);
> +       memset(&prev, 0, sizeof(prev));
> +
> +       while (!bpf_map_get_next_key(fd, &prev, &key)) {
> +               int n = 1;  /* start from perf_event_header */
> +               int ip_pos = -1;
> +
> +               bpf_map_lookup_elem(fd, &key, &val);
> +
> +               if (sample_type & PERF_SAMPLE_IDENTIFIER)
> +                       data.array[n++] = sid;
> +               if (sample_type & PERF_SAMPLE_IP) {
> +                       ip_pos = n;
> +                       data.array[n++] = 0;  /* will be updated */
> +               }
> +               if (sample_type & PERF_SAMPLE_TID)
> +                       data.array[n++] = (u64)key.pid << 32 | key.tgid;
> +               if (sample_type & PERF_SAMPLE_TIME)
> +                       data.array[n++] = tstamp;
> +               if (sample_type & PERF_SAMPLE_ID)
> +                       data.array[n++] = sid;
> +               if (sample_type & PERF_SAMPLE_CPU)
> +                       data.array[n++] = 0;
> +               if (sample_type & PERF_SAMPLE_PERIOD)
> +                       data.array[n++] = val;
> +               if (sample_type & PERF_SAMPLE_CALLCHAIN) {
> +                       int len = 0;
> +
> +                       /* data.array[n] is callchain->nr (updated later) */
> +                       data.array[n + 1] = PERF_CONTEXT_USER;
> +                       data.array[n + 2] = 0;
> +
> +                       bpf_map_lookup_elem(stack, &key.stack_id, &data.array[n + 2]);
> +                       while (data.array[n + 2 + len])
> +                               len++;
> +
> +                       /* update length of callchain */
> +                       data.array[n] = len + 1;
> +
> +                       /* update sample ip with the first callchain entry */
> +                       if (ip_pos >= 0)
> +                               data.array[ip_pos] = data.array[n + 2];
> +
> +                       /* calculate sample callchain data array length */
> +                       n += len + 2;
> +               }
> +               /* TODO: handle more sample types */
> +
> +               size = n * sizeof(u64);
> +               data.hdr.size = size;
> +               bytes += size;
> +
> +               if (perf_data_file__write(file, &data, size) < 0) {
> +                       pr_err("failed to write perf data, error: %m\n");
> +                       return bytes;
> +               }
> +
> +               prev = key;
> +               /* increase dummy timestamp to sort later samples */
> +               tstamp++;
> +       }
> +       return bytes;
> +}
> diff --git a/tools/perf/util/bpf_skel/off_cpu.bpf.c b/tools/perf/util/bpf_skel/off_cpu.bpf.c
> new file mode 100644
> index 000000000000..5173ed882fdf
> --- /dev/null
> +++ b/tools/perf/util/bpf_skel/off_cpu.bpf.c
> @@ -0,0 +1,139 @@
> +// SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
> +// Copyright (c) 2022 Google
> +#include "vmlinux.h"
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_tracing.h>
> +#include <bpf/bpf_core_read.h>
> +
> +/* task->flags for off-cpu analysis */
> +#define PF_KTHREAD   0x00200000  /* I am a kernel thread */
> +
> +/* task->state for off-cpu analysis */
> +#define TASK_INTERRUPTIBLE     0x0001
> +#define TASK_UNINTERRUPTIBLE   0x0002
> +
> +#define MAX_STACKS   32
> +#define MAX_ENTRIES  102400
> +
> +struct tstamp_data {
> +       __u32 stack_id;
> +       __u32 state;
> +       __u64 timestamp;
> +};
> +
> +struct offcpu_key {
> +       __u32 pid;
> +       __u32 tgid;
> +       __u32 stack_id;
> +       __u32 state;
> +};
> +
> +struct {
> +       __uint(type, BPF_MAP_TYPE_STACK_TRACE);
> +       __uint(key_size, sizeof(__u32));
> +       __uint(value_size, MAX_STACKS * sizeof(__u64));
> +       __uint(max_entries, MAX_ENTRIES);
> +} stacks SEC(".maps");
> +
> +struct {
> +       __uint(type, BPF_MAP_TYPE_TASK_STORAGE);
> +       __uint(map_flags, BPF_F_NO_PREALLOC);
> +       __type(key, int);
> +       __type(value, struct tstamp_data);
> +} tstamp SEC(".maps");
> +
> +struct {
> +       __uint(type, BPF_MAP_TYPE_HASH);
> +       __uint(key_size, sizeof(struct offcpu_key));
> +       __uint(value_size, sizeof(__u64));
> +       __uint(max_entries, MAX_ENTRIES);
> +} off_cpu SEC(".maps");
> +
> +/* old kernel task_struct definition */
> +struct task_struct___old {
> +       long state;
> +} __attribute__((preserve_access_index));
> +
> +int enabled = 0;
> +
> +/*
> + * Old kernel used to call it task_struct->state and now it's '__state'.
> + * Use BPF CO-RE "ignored suffix rule" to deal with it like below:
> + *
> + * https://nakryiko.com/posts/bpf-core-reference-guide/#handling-incompatible-field-and-type-changes
> + */
> +static inline int get_task_state(struct task_struct *t)
> +{
> +       if (bpf_core_field_exists(t->__state))
> +               return BPF_CORE_READ(t, __state);
> +

When building against a pre-5.14 kernel I'm running into a build issue here:

tools/perf/util/bpf_skel/off_cpu.bpf.c:96:31: error: no member named '__
state' in 'struct task_struct'; did you mean 'state'?
       if (bpf_core_field_exists(t->__state))
                                    ^~~~~~~
                                    state

This isn't covered by Andrii's BPF CO-RE reference guide. I have an
#iffy workaround below,but this will be brittle if the 5.14+ kernel
code is backported. Suggestions welcomed :-)

Thanks,
Ian

```
diff --git a/tools/perf/util/bpf_skel/off_cpu.bpf.c
b/tools/perf/util/bpf_skel/off_cpu.bpf.c
index 792ae2847080..fb4fd3fbedd6 100644
--- a/tools/perf/util/bpf_skel/off_cpu.bpf.c
+++ b/tools/perf/util/bpf_skel/off_cpu.bpf.c
@@ -71,10 +71,23 @@ struct {
       __uint(max_entries, 1);
} cgroup_filter SEC(".maps");

+#if LINUX_VERSION_CODE >= KERNEL_VERSION(5,14,0)
/* old kernel task_struct definition */
struct task_struct___old {
       long state;
} __attribute__((preserve_access_index));
+#define TASK_STATE__STATE __state
+#define ALT_TASK_STATE task_struct___old
+#define ALT_TASK_STATE__STATE state
+#else
+/* new kernel task_struct definition */
+struct task_struct___new {
+       long __state;
+} __attribute__((preserve_access_index));
+#define TASK_STATE__STATE state
+#define ALT_TASK_STATE task_struct___new
+#define ALT_TASK_STATE__STATE __state
+#endif

int enabled = 0;
int has_cpu = 0;
@@ -93,14 +106,14 @@ const volatile bool uses_cgroup_v1 = false;
 */
static inline int get_task_state(struct task_struct *t)
{
-       if (bpf_core_field_exists(t->__state))
-               return BPF_CORE_READ(t, __state);
+       if (bpf_core_field_exists(t->TASK_STATE__STATE))
+               return BPF_CORE_READ(t, TASK_STATE__STATE);

-       /* recast pointer to capture task_struct___old type for compiler */
-       struct task_struct___old *t_old = (void *)t;
+       /* recast pointer to capture task_struct___new/old type for compiler */
+       struct ALT_TASK_STATE *t_alt = (void *)t;

-       /* now use old "state" name of the field */
-       return BPF_CORE_READ(t_old, state);
+       /* now use new/old "state" name of the field */
+       return BPF_CORE_READ(t_alt, ALT_TASK_STATE__STATE);
}

static inline __u64 get_cgroup_id(struct task_struct *t)
```




> +       /* recast pointer to capture task_struct___old type for compiler */
> +       struct task_struct___old *t_old = (void *)t;
> +
> +       /* now use old "state" name of the field */
> +       return BPF_CORE_READ(t_old, state);
> +}
> +
> +SEC("tp_btf/sched_switch")
> +int on_switch(u64 *ctx)
> +{
> +       __u64 ts;
> +       int state;
> +       __u32 stack_id;
> +       struct task_struct *prev, *next;
> +       struct tstamp_data *pelem;
> +
> +       if (!enabled)
> +               return 0;
> +
> +       prev = (struct task_struct *)ctx[1];
> +       next = (struct task_struct *)ctx[2];
> +       state = get_task_state(prev);
> +
> +       ts = bpf_ktime_get_ns();
> +
> +       if (prev->flags & PF_KTHREAD)
> +               goto next;
> +       if (state != TASK_INTERRUPTIBLE &&
> +           state != TASK_UNINTERRUPTIBLE)
> +               goto next;
> +
> +       stack_id = bpf_get_stackid(ctx, &stacks,
> +                                  BPF_F_FAST_STACK_CMP | BPF_F_USER_STACK);
> +
> +       pelem = bpf_task_storage_get(&tstamp, prev, NULL,
> +                                    BPF_LOCAL_STORAGE_GET_F_CREATE);
> +       if (!pelem)
> +               goto next;
> +
> +       pelem->timestamp = ts;
> +       pelem->state = state;
> +       pelem->stack_id = stack_id;
> +
> +next:
> +       pelem = bpf_task_storage_get(&tstamp, next, NULL, 0);
> +
> +       if (pelem && pelem->timestamp) {
> +               struct offcpu_key key = {
> +                       .pid = next->pid,
> +                       .tgid = next->tgid,
> +                       .stack_id = pelem->stack_id,
> +                       .state = pelem->state,
> +               };
> +               __u64 delta = ts - pelem->timestamp;
> +               __u64 *total;
> +
> +               total = bpf_map_lookup_elem(&off_cpu, &key);
> +               if (total)
> +                       *total += delta;
> +               else
> +                       bpf_map_update_elem(&off_cpu, &key, &delta, BPF_ANY);
> +
> +               /* prevent to reuse the timestamp later */
> +               pelem->timestamp = 0;
> +       }
> +
> +       return 0;
> +}
> +
> +char LICENSE[] SEC("license") = "Dual BSD/GPL";
> diff --git a/tools/perf/util/off_cpu.h b/tools/perf/util/off_cpu.h
> new file mode 100644
> index 000000000000..375d03c424ea
> --- /dev/null
> +++ b/tools/perf/util/off_cpu.h
> @@ -0,0 +1,24 @@
> +#ifndef PERF_UTIL_OFF_CPU_H
> +#define PERF_UTIL_OFF_CPU_H
> +
> +struct evlist;
> +struct perf_session;
> +
> +#define OFFCPU_EVENT  "offcpu-time"
> +
> +#ifdef HAVE_BPF_SKEL
> +int off_cpu_prepare(struct evlist *evlist);
> +int off_cpu_write(struct perf_session *session);
> +#else
> +static inline int off_cpu_prepare(struct evlist *evlist __maybe_unused)
> +{
> +       return -1;
> +}
> +
> +static inline int off_cpu_write(struct perf_session *session __maybe_unused)
> +{
> +       return -1;
> +}
> +#endif
> +
> +#endif  /* PERF_UTIL_OFF_CPU_H */
> --
> 2.36.1.124.g0e6072fb45-goog
>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/6] perf record: Enable off-cpu analysis with BPF
  2022-06-01  0:00   ` Ian Rogers
@ 2022-06-01  6:01     ` Namhyung Kim
  2022-06-02 22:36       ` Namhyung Kim
  0 siblings, 1 reply; 20+ messages in thread
From: Namhyung Kim @ 2022-06-01  6:01 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Andrii Nakryiko,
	Peter Zijlstra, Hao Luo, bpf, Ingo Molnar, LKML, Andi Kleen,
	Song Liu, Milian Wolff, linux-perf-users, Blake Jones

On Tue, May 31, 2022 at 5:00 PM Ian Rogers <irogers@google.com> wrote:
>
> On Wed, May 18, 2022 at 3:47 PM Namhyung Kim <namhyung@kernel.org> wrote:
[SNIP]
> > +
> > +/*
> > + * Old kernel used to call it task_struct->state and now it's '__state'.
> > + * Use BPF CO-RE "ignored suffix rule" to deal with it like below:
> > + *
> > + * https://nakryiko.com/posts/bpf-core-reference-guide/#handling-incompatible-field-and-type-changes
> > + */
> > +static inline int get_task_state(struct task_struct *t)
> > +{
> > +       if (bpf_core_field_exists(t->__state))
> > +               return BPF_CORE_READ(t, __state);
> > +
>
> When building against a pre-5.14 kernel I'm running into a build issue here:
>
> tools/perf/util/bpf_skel/off_cpu.bpf.c:96:31: error: no member named '__
> state' in 'struct task_struct'; did you mean 'state'?
>        if (bpf_core_field_exists(t->__state))
>                                     ^~~~~~~
>                                     state
>
> This isn't covered by Andrii's BPF CO-RE reference guide. I have an
> #iffy workaround below,but this will be brittle if the 5.14+ kernel
> code is backported. Suggestions welcomed :-)

Thanks for the fix.  I think we should not guess the field name
in the current task struct and check both versions separately.
I'm afraid the version check won't work with some backported
kernels.  But do we care?

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/6] perf record: Enable off-cpu analysis with BPF
  2022-06-01  6:01     ` Namhyung Kim
@ 2022-06-02 22:36       ` Namhyung Kim
  0 siblings, 0 replies; 20+ messages in thread
From: Namhyung Kim @ 2022-06-02 22:36 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Andrii Nakryiko,
	Peter Zijlstra, Hao Luo, bpf, Ingo Molnar, LKML, Andi Kleen,
	Song Liu, Milian Wolff, linux-perf-users, Blake Jones

On Tue, May 31, 2022 at 11:01:14PM -0700, Namhyung Kim wrote:
> On Tue, May 31, 2022 at 5:00 PM Ian Rogers <irogers@google.com> wrote:
> >
> > On Wed, May 18, 2022 at 3:47 PM Namhyung Kim <namhyung@kernel.org> wrote:
> [SNIP]
> > > +
> > > +/*
> > > + * Old kernel used to call it task_struct->state and now it's '__state'.
> > > + * Use BPF CO-RE "ignored suffix rule" to deal with it like below:
> > > + *
> > > + * https://nakryiko.com/posts/bpf-core-reference-guide/#handling-incompatible-field-and-type-changes
> > > + */
> > > +static inline int get_task_state(struct task_struct *t)
> > > +{
> > > +       if (bpf_core_field_exists(t->__state))
> > > +               return BPF_CORE_READ(t, __state);
> > > +
> >
> > When building against a pre-5.14 kernel I'm running into a build issue here:
> >
> > tools/perf/util/bpf_skel/off_cpu.bpf.c:96:31: error: no member named '__
> > state' in 'struct task_struct'; did you mean 'state'?
> >        if (bpf_core_field_exists(t->__state))
> >                                     ^~~~~~~
> >                                     state
> >
> > This isn't covered by Andrii's BPF CO-RE reference guide. I have an
> > #iffy workaround below,but this will be brittle if the 5.14+ kernel
> > code is backported. Suggestions welcomed :-)
> 
> Thanks for the fix.  I think we should not guess the field name
> in the current task struct and check both versions separately.
> I'm afraid the version check won't work with some backported
> kernels.  But do we care?


What about this instead?

----8<----
From a621f836f00e11942e5d39a735ec8f7a21962d6a Mon Sep 17 00:00:00 2001
From: Namhyung Kim <namhyung@kernel.org>
Date: Tue, 31 May 2022 14:25:05 -0700
Subject: [PATCH] perf tools: Fix a build failure in off-cpu BPF program on old kernels

Old kernels have task_struct which contains "state" field.  While the
get_task_state() in the BPF code handles that, it assumed the kernel
has the new definition and caused a build error on old kernels.

We should not assume anything and access them carefully.

Reported-by: Ian Rogers <irogers@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/perf/util/bpf_skel/off_cpu.bpf.c | 20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/tools/perf/util/bpf_skel/off_cpu.bpf.c b/tools/perf/util/bpf_skel/off_cpu.bpf.c
index 792ae2847080..cc6d7fd55118 100644
--- a/tools/perf/util/bpf_skel/off_cpu.bpf.c
+++ b/tools/perf/util/bpf_skel/off_cpu.bpf.c
@@ -71,6 +71,11 @@ struct {
 	__uint(max_entries, 1);
 } cgroup_filter SEC(".maps");
 
+/* new kernel task_struct definition */
+struct task_struct___new {
+	long __state;
+} __attribute__((preserve_access_index));
+
 /* old kernel task_struct definition */
 struct task_struct___old {
 	long state;
@@ -93,14 +98,17 @@ const volatile bool uses_cgroup_v1 = false;
  */
 static inline int get_task_state(struct task_struct *t)
 {
-	if (bpf_core_field_exists(t->__state))
-		return BPF_CORE_READ(t, __state);
+	/* recast pointer to capture new type for compiler */
+	struct task_struct___new *t_new = (void *)t;
 
-	/* recast pointer to capture task_struct___old type for compiler */
-	struct task_struct___old *t_old = (void *)t;
+	if (bpf_core_field_exists(t_new->__state)) {
+		return BPF_CORE_READ(t_new, __state);
+	} else {
+		/* recast pointer to capture old type for compiler */
+		struct task_struct___old *t_old = (void *)t;
 
-	/* now use old "state" name of the field */
-	return BPF_CORE_READ(t_old, state);
+		return BPF_CORE_READ(t_old, state);
+	}
 }
 
 static inline __u64 get_cgroup_id(struct task_struct *t)
-- 
2.36.1.255.ge46751e96f-goog


^ permalink raw reply related	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2022-06-02 22:36 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-18 22:47 [RFC 0/6] perf record: Implement off-cpu profiling with BPF (v3) Namhyung Kim
2022-05-18 22:47 ` [PATCH 1/6] perf report: Do not extend sample type of bpf-output event Namhyung Kim
2022-05-18 22:47 ` [PATCH 2/6] perf record: Enable off-cpu analysis with BPF Namhyung Kim
2022-05-19  3:58   ` Ian Rogers
2022-05-19 21:01     ` Namhyung Kim
2022-06-01  0:00   ` Ian Rogers
2022-06-01  6:01     ` Namhyung Kim
2022-06-02 22:36       ` Namhyung Kim
2022-05-18 22:47 ` [PATCH 3/6] perf record: Implement basic filtering for off-cpu Namhyung Kim
2022-05-19  4:02   ` Ian Rogers
2022-05-19 21:02     ` Namhyung Kim
2022-05-25 11:27       ` Arnaldo Carvalho de Melo
2022-05-25 11:44         ` Arnaldo Carvalho de Melo
2022-05-25 14:06           ` Arnaldo Carvalho de Melo
2022-05-18 22:47 ` [PATCH 4/6] perf record: Handle argument change in sched_switch Namhyung Kim
2022-05-19  4:04   ` Ian Rogers
2022-05-18 22:47 ` [PATCH 5/6] perf record: Add cgroup support for off-cpu profiling Namhyung Kim
2022-05-19  4:07   ` Ian Rogers
2022-05-18 22:47 ` [PATCH 6/6] perf test: Add a basic offcpu profiling test Namhyung Kim
2022-05-19  4:08   ` Ian Rogers

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.