linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/4] perf record: Implement off-cpu profiling with BPF (v1)
@ 2022-04-22  5:33 Namhyung Kim
  2022-04-22  5:33 ` [PATCH 1/4] perf report: Do not extend sample type of bpf-output event Namhyung Kim
                   ` (5 more replies)
  0 siblings, 6 replies; 19+ messages in thread
From: Namhyung Kim @ 2022-04-22  5:33 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Andi Kleen, Ian Rogers,
	Song Liu, Hao Luo, bpf, linux-perf-users, Blake Jones

Hello,

This is the first version of off-cpu profiling support.  Together with
(PMU-based) cpu profiling, it can show holistic view of the performance
characteristics of your application or system.

With BPF, it can aggregate scheduling stats for interested tasks
and/or states and convert the data into a form of perf sample records.
I chose the bpf-output event which is a software event supposed to be
consumed by BPF programs and renamed it as "offcpu-time".  So it
requires no change on the perf report side except for setting sample
types of bpf-output event.

Basically it collects userspace callstack for tasks as it's what users
want mostly.  Maybe we can add support for the kernel stacks but I'm
afraid that it'd cause more overhead.  So the offcpu-time event will
always have callchains regardless of the command line option, and it
enables the children mode in perf report by default.

It adds --off-cpu option to perf record like below:

  $ sudo perf record -a --off-cpu -- perf bench sched messaging -l 1000
  # Running 'sched/messaging' benchmark:
  # 20 sender and receiver processes per group
  # 10 groups == 400 processes run

     Total time: 1.518 [sec]
  [ perf record: Woken up 9 times to write data ]
  [ perf record: Captured and wrote 5.313 MB perf.data (53341 samples) ]

Then we can run perf report as usual.  The below is just to skip less
important parts.

  $ sudo perf report --stdio --call-graph=no --percent-limit=2
  # To display the perf.data header info, please use --header/--header-only options.
  #
  #
  # Total Lost Samples: 0
  #
  # Samples: 52K of event 'cycles'
  # Event count (approx.): 42522453276
  #
  # Children      Self  Command          Shared Object     Symbol                            
  # ........  ........  ...............  ................  ..................................
  #
       9.58%     9.58%  sched-messaging  [kernel.vmlinux]  [k] audit_filter_rules.constprop.0
       8.46%     8.46%  sched-messaging  [kernel.vmlinux]  [k] audit_filter_syscall
       4.54%     4.54%  sched-messaging  [kernel.vmlinux]  [k] copy_user_enhanced_fast_string
       2.94%     2.94%  sched-messaging  [kernel.vmlinux]  [k] unix_stream_read_generic
       2.45%     2.45%  sched-messaging  [kernel.vmlinux]  [k] memcg_slab_free_hook
  
  
  # Samples: 983  of event 'offcpu-time'
  # Event count (approx.): 684538813464
  #
  # Children      Self  Command          Shared Object         Symbol                    
  # ........  ........  ...............  ....................  ..........................
  #
      83.86%     0.00%  sched-messaging  libc-2.33.so          [.] __libc_start_main
      83.86%     0.00%  sched-messaging  perf                  [.] cmd_bench
      83.86%     0.00%  sched-messaging  perf                  [.] main
      83.86%     0.00%  sched-messaging  perf                  [.] run_builtin
      83.64%     0.00%  sched-messaging  perf                  [.] bench_sched_messaging
      41.35%    41.35%  sched-messaging  libpthread-2.33.so    [.] __read
      38.88%    38.88%  sched-messaging  libpthread-2.33.so    [.] __write
       3.41%     3.41%  sched-messaging  libc-2.33.so          [.] __poll

The perf bench sched messaging created 400 processes to send/receive
messages through unix sockets.  It spent a large portion of cpu cycles
for audit filter and read/copy the messages while most of the
offcpu-time was in read and write calls.

You can get the code from 'perf/offcpu-v1' branch in my tree at

  git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git

Enjoy! :)

Thanks,
Namhyung


Namhyung Kim (4):
  perf report: Do not extend sample type of bpf-output event
  perf record: Enable off-cpu analysis with BPF
  perf record: Implement basic filtering for off-cpu
  perf record: Handle argument change in sched_switch

 tools/perf/Makefile.perf               |   1 +
 tools/perf/builtin-record.c            |  21 ++
 tools/perf/util/Build                  |   1 +
 tools/perf/util/bpf_off_cpu.c          | 301 +++++++++++++++++++++++++
 tools/perf/util/bpf_skel/off_cpu.bpf.c | 214 ++++++++++++++++++
 tools/perf/util/evsel.c                |   4 +-
 6 files changed, 540 insertions(+), 2 deletions(-)
 create mode 100644 tools/perf/util/bpf_off_cpu.c
 create mode 100644 tools/perf/util/bpf_skel/off_cpu.bpf.c


base-commit: 41204da4c16071be9090940b18f566832d46becc
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/4] perf report: Do not extend sample type of bpf-output event
  2022-04-22  5:33 [RFC 0/4] perf record: Implement off-cpu profiling with BPF (v1) Namhyung Kim
@ 2022-04-22  5:33 ` Namhyung Kim
  2022-04-22  5:33 ` [PATCH 2/4] perf record: Enable off-cpu analysis with BPF Namhyung Kim
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 19+ messages in thread
From: Namhyung Kim @ 2022-04-22  5:33 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Andi Kleen, Ian Rogers,
	Song Liu, Hao Luo, bpf, linux-perf-users, Blake Jones

Currently evsel__new_idx() sets more sample_type bits when it finds a
BPF-output event.  But it should honor what's recorded in the perf
data file rather than blindly sets the bits.  Otherwise it could lead
to a parse error when it recorded with a modified sample_type.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/perf/util/evsel.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 2a1729e7aee4..5f947adc16cb 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -269,8 +269,8 @@ struct evsel *evsel__new_idx(struct perf_event_attr *attr, int idx)
 		return NULL;
 	evsel__init(evsel, attr, idx);
 
-	if (evsel__is_bpf_output(evsel)) {
-		evsel->core.attr.sample_type |= (PERF_SAMPLE_RAW | PERF_SAMPLE_TIME |
+	if (evsel__is_bpf_output(evsel) && !attr->sample_type) {
+		evsel->core.attr.sample_type = (PERF_SAMPLE_RAW | PERF_SAMPLE_TIME |
 					    PERF_SAMPLE_CPU | PERF_SAMPLE_PERIOD),
 		evsel->core.attr.sample_period = 1;
 	}
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/4] perf record: Enable off-cpu analysis with BPF
  2022-04-22  5:33 [RFC 0/4] perf record: Implement off-cpu profiling with BPF (v1) Namhyung Kim
  2022-04-22  5:33 ` [PATCH 1/4] perf report: Do not extend sample type of bpf-output event Namhyung Kim
@ 2022-04-22  5:33 ` Namhyung Kim
  2022-04-22  5:34 ` [PATCH 3/4] perf record: Implement basic filtering for off-cpu Namhyung Kim
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 19+ messages in thread
From: Namhyung Kim @ 2022-04-22  5:33 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Andi Kleen, Ian Rogers,
	Song Liu, Hao Luo, bpf, linux-perf-users, Blake Jones

Add --off-cpu option to enable the off-cpu profiling with BPF.  It'd
use a bpf_output event and rename it to "offcpu-time".  Samples will
be synthesized at the end of the record session using data from a BPF
map which contains the aggregated off-cpu time at context switches.
So it needs root privilege to get the off-cpu profiling.

Each sample will have a separate user stacktrace so it will skip
kernel threads.  The sample ip will be set from the stacktrace and
other sample data will be updated accordingly.  Currently it only
handles some basic sample types.

The sample timestamp is set to a dummy value just not to bother with
other events during the sorting.  So it has a very big initial value
and increase it on processing each samples.

Good thing is that it can be used together with regular profiling like
cpu cycles.  If you don't want to that, you can use a dummy event to
enable off-cpu profiling only.

Example output:
  $ sudo perf record --off-cpu perf bench sched messaging -l 1000

  $ sudo perf report --stdio --call-graph=no
  # Total Lost Samples: 0
  #
  # Samples: 41K of event 'cycles'
  # Event count (approx.): 42137343851
  ...

  # Samples: 1K of event 'offcpu-time'
  # Event count (approx.): 587990831640
  #
  # Children      Self  Command          Shared Object       Symbol
  # ........  ........  ...............  ..................  .........................
  #
      81.66%     0.00%  sched-messaging  libc-2.33.so        [.] __libc_start_main
      81.66%     0.00%  sched-messaging  perf                [.] cmd_bench
      81.66%     0.00%  sched-messaging  perf                [.] main
      81.66%     0.00%  sched-messaging  perf                [.] run_builtin
      81.43%     0.00%  sched-messaging  perf                [.] bench_sched_messaging
      40.86%    40.86%  sched-messaging  libpthread-2.33.so  [.] __read
      37.66%    37.66%  sched-messaging  libpthread-2.33.so  [.] __write
       2.91%     2.91%  sched-messaging  libc-2.33.so        [.] __poll
  ...

As you can see it spent most of off-cpu time in read and write in
bench_sched_messaging().  The --call-graph=no was added just to make
the output concise here.

It uses perf hooks facility to control BPF program during the record
session rather than adding new BPF/off-cpu specific calls.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/perf/Makefile.perf               |   1 +
 tools/perf/builtin-record.c            |  21 +++
 tools/perf/util/Build                  |   1 +
 tools/perf/util/bpf_off_cpu.c          | 208 +++++++++++++++++++++++++
 tools/perf/util/bpf_skel/off_cpu.bpf.c | 137 ++++++++++++++++
 5 files changed, 368 insertions(+)
 create mode 100644 tools/perf/util/bpf_off_cpu.c
 create mode 100644 tools/perf/util/bpf_skel/off_cpu.bpf.c

diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
index 69473a836bae..ce333327182a 100644
--- a/tools/perf/Makefile.perf
+++ b/tools/perf/Makefile.perf
@@ -1041,6 +1041,7 @@ SKEL_TMP_OUT := $(abspath $(SKEL_OUT)/.tmp)
 SKELETONS := $(SKEL_OUT)/bpf_prog_profiler.skel.h
 SKELETONS += $(SKEL_OUT)/bperf_leader.skel.h $(SKEL_OUT)/bperf_follower.skel.h
 SKELETONS += $(SKEL_OUT)/bperf_cgroup.skel.h $(SKEL_OUT)/func_latency.skel.h
+SKELETONS += $(SKEL_OUT)/off_cpu.skel.h
 
 $(SKEL_TMP_OUT) $(LIBBPF_OUTPUT):
 	$(Q)$(MKDIR) -p $@
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index ba74fab02e62..3d24d528ba8e 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -49,6 +49,7 @@
 #include "util/clockid.h"
 #include "util/pmu-hybrid.h"
 #include "util/evlist-hybrid.h"
+#include "util/off_cpu.h"
 #include "asm/bug.h"
 #include "perf.h"
 #include "cputopo.h"
@@ -162,6 +163,7 @@ struct record {
 	bool			buildid_mmap;
 	bool			timestamp_filename;
 	bool			timestamp_boundary;
+	bool			off_cpu;
 	struct switch_output	switch_output;
 	unsigned long long	samples;
 	unsigned long		output_max_size;	/* = 0: unlimited */
@@ -903,6 +905,11 @@ static int record__config_text_poke(struct evlist *evlist)
 	return 0;
 }
 
+static int record__config_off_cpu(struct record *rec)
+{
+	return off_cpu_prepare(rec->evlist);
+}
+
 static bool record__kcore_readable(struct machine *machine)
 {
 	char kcore[PATH_MAX];
@@ -2596,6 +2603,9 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
 	} else
 		status = err;
 
+	if (rec->off_cpu)
+		rec->bytes_written += off_cpu_write(rec->session);
+
 	record__synthesize(rec, true);
 	/* this will be recalculated during process_buildids() */
 	rec->samples = 0;
@@ -3320,6 +3330,9 @@ static struct option __record_options[] = {
 	OPT_CALLBACK_OPTARG(0, "threads", &record.opts, NULL, "spec",
 			    "write collected trace data into several data files using parallel threads",
 			    record__parse_threads),
+#ifdef HAVE_BPF_SKEL
+	OPT_BOOLEAN(0, "off-cpu", &record.off_cpu, "Enable off-cpu analysis"),
+#endif
 	OPT_END()
 };
 
@@ -3968,6 +3981,14 @@ int cmd_record(int argc, const char **argv)
 		}
 	}
 
+	if (rec->off_cpu) {
+		err = record__config_off_cpu(rec);
+		if (err) {
+			pr_err("record__config_off_cpu failed, error %d\n", err);
+			goto out;
+		}
+	}
+
 	if (record_opts__config(&rec->opts)) {
 		err = -EINVAL;
 		goto out;
diff --git a/tools/perf/util/Build b/tools/perf/util/Build
index 9a7209a99e16..a51267d88ca9 100644
--- a/tools/perf/util/Build
+++ b/tools/perf/util/Build
@@ -147,6 +147,7 @@ perf-$(CONFIG_LIBBPF) += bpf_map.o
 perf-$(CONFIG_PERF_BPF_SKEL) += bpf_counter.o
 perf-$(CONFIG_PERF_BPF_SKEL) += bpf_counter_cgroup.o
 perf-$(CONFIG_PERF_BPF_SKEL) += bpf_ftrace.o
+perf-$(CONFIG_PERF_BPF_SKEL) += bpf_off_cpu.o
 perf-$(CONFIG_BPF_PROLOGUE) += bpf-prologue.o
 perf-$(CONFIG_LIBELF) += symbol-elf.o
 perf-$(CONFIG_LIBELF) += probe-file.o
diff --git a/tools/perf/util/bpf_off_cpu.c b/tools/perf/util/bpf_off_cpu.c
new file mode 100644
index 000000000000..1f87d2a9b86d
--- /dev/null
+++ b/tools/perf/util/bpf_off_cpu.c
@@ -0,0 +1,208 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "util/bpf_counter.h"
+#include "util/debug.h"
+#include "util/evsel.h"
+#include "util/evlist.h"
+#include "util/off_cpu.h"
+#include "util/perf-hooks.h"
+#include "util/session.h"
+#include <bpf/bpf.h>
+
+#include "bpf_skel/off_cpu.skel.h"
+
+#define MAX_STACKS  32
+/* we don't need actual timestamp, just want to put the samples at last */
+#define OFF_CPU_TIMESTAMP  (~0ull << 32)
+
+static struct off_cpu_bpf *skel;
+
+struct off_cpu_key {
+	u32 pid;
+	u32 tgid;
+	u32 stack_id;
+	u32 state;
+	u64 cgroup_id;
+};
+
+union off_cpu_data {
+	struct perf_event_header hdr;
+	u64 array[1024 / sizeof(u64)];
+};
+
+static int off_cpu_config(struct evlist *evlist)
+{
+	struct evsel *evsel;
+	struct perf_event_attr attr = {
+		.type	= PERF_TYPE_SOFTWARE,
+		.config = PERF_COUNT_SW_BPF_OUTPUT,
+		.size	= sizeof(attr), /* to capture ABI version */
+	};
+
+	evsel = evsel__new(&attr);
+	if (!evsel)
+		return -ENOMEM;
+
+	evsel->core.attr.freq = 1;
+	evsel->core.attr.sample_period = 1;
+	/* off-cpu analysis depends on stack trace */
+	evsel->core.attr.sample_type = PERF_SAMPLE_CALLCHAIN;
+
+	evlist__add(evlist, evsel);
+
+	free(evsel->name);
+	evsel->name = strdup("offcpu-time");
+	if (evsel->name == NULL)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static void off_cpu_start(void *arg __maybe_unused)
+{
+	skel->bss->enabled = 1;
+}
+
+static void off_cpu_finish(void *arg __maybe_unused)
+{
+	skel->bss->enabled = 0;
+	off_cpu_bpf__destroy(skel);
+}
+
+int off_cpu_prepare(struct evlist *evlist)
+{
+	int err;
+
+	if (off_cpu_config(evlist) < 0) {
+		pr_err("Failed to config off-cpu BPF event\n");
+		return -1;
+	}
+
+	set_max_rlimit();
+
+	skel = off_cpu_bpf__open_and_load();
+	if (!skel) {
+		pr_err("Failed to open off-cpu skeleton\n");
+		return -1;
+	}
+
+	err = off_cpu_bpf__attach(skel);
+	if (err) {
+		pr_err("Failed to attach off-cpu skeleton\n");
+		goto out;
+	}
+
+	if (perf_hooks__set_hook("record_start", off_cpu_start, NULL) ||
+	    perf_hooks__set_hook("record_end", off_cpu_finish, NULL)) {
+		pr_err("Failed to attach off-cpu skeleton\n");
+		goto out;
+	}
+
+	return 0;
+
+out:
+	off_cpu_bpf__destroy(skel);
+	return -1;
+}
+
+int off_cpu_write(struct perf_session *session)
+{
+	int bytes = 0, size;
+	int fd, stack;
+	bool found = false;
+	u64 sample_type, val, sid = 0;
+	struct evsel *evsel;
+	struct perf_data_file *file = &session->data->file;
+	struct off_cpu_key prev, key;
+	union off_cpu_data data = {
+		.hdr = {
+			.type = PERF_RECORD_SAMPLE,
+			.misc = PERF_RECORD_MISC_USER,
+		},
+	};
+	u64 tstamp = OFF_CPU_TIMESTAMP;
+
+	skel->bss->enabled = 0;
+
+	evlist__for_each_entry(session->evlist, evsel) {
+		if (!strcmp(evsel__name(evsel), "offcpu-time")) {
+			found = true;
+			break;
+		}
+	}
+
+	if (!found) {
+		pr_err("offcpu-time evsel not found\n");
+		return 0;
+	}
+
+	sample_type = evsel->core.attr.sample_type;
+
+	if (sample_type & (PERF_SAMPLE_ID | PERF_SAMPLE_IDENTIFIER)) {
+		if (evsel->core.id)
+			sid = evsel->core.id[0];
+	}
+
+	fd = bpf_map__fd(skel->maps.off_cpu);
+	stack = bpf_map__fd(skel->maps.stacks);
+	memset(&prev, 0, sizeof(prev));
+
+	while (!bpf_map_get_next_key(fd, &prev, &key)) {
+		int n = 1;  /* start from perf_event_header */
+		int ip_pos = -1;
+
+		bpf_map_lookup_elem(fd, &key, &val);
+
+		if (sample_type & PERF_SAMPLE_IDENTIFIER)
+			data.array[n++] = sid;
+		if (sample_type & PERF_SAMPLE_IP) {
+			ip_pos = n;
+			data.array[n++] = 0;  /* will be updated */
+		}
+		if (sample_type & PERF_SAMPLE_TID)
+			data.array[n++] = (u64)key.pid << 32 | key.tgid;
+		if (sample_type & PERF_SAMPLE_TIME)
+			data.array[n++] = tstamp;
+		if (sample_type & PERF_SAMPLE_ID)
+			data.array[n++] = sid;
+		if (sample_type & PERF_SAMPLE_CPU)
+			data.array[n++] = 0;
+		if (sample_type & PERF_SAMPLE_PERIOD)
+			data.array[n++] = val;
+		if (sample_type & PERF_SAMPLE_CALLCHAIN) {
+			int len = 0;
+
+			/* data.array[n] is callchain->nr (updated later) */
+			data.array[n + 1] = PERF_CONTEXT_USER;
+			data.array[n + 2] = 0;
+
+			bpf_map_lookup_elem(stack, &key.stack_id, &data.array[n + 2]);
+			while (data.array[n + 2 + len])
+				len++;
+
+			/* update length of callchain */
+			data.array[n] = len + 1;
+
+			/* update sample ip with the first callchain entry */
+			if (ip_pos >= 0)
+				data.array[ip_pos] = data.array[n + 2];
+
+			/* calculate sample callchain data array length */
+			n += len + 2;
+		}
+		/* TODO: handle more sample types */
+
+		size = n * sizeof(u64);
+		data.hdr.size = size;
+		bytes += size;
+
+		if (perf_data_file__write(file, &data, size) < 0) {
+			pr_err("failed to write perf data, error: %m\n");
+			return bytes;
+		}
+
+		prev = key;
+		/* increase dummy timestamp to sort later samples */
+		tstamp++;
+	}
+	return bytes;
+}
diff --git a/tools/perf/util/bpf_skel/off_cpu.bpf.c b/tools/perf/util/bpf_skel/off_cpu.bpf.c
new file mode 100644
index 000000000000..2bc6f7cc59ea
--- /dev/null
+++ b/tools/perf/util/bpf_skel/off_cpu.bpf.c
@@ -0,0 +1,137 @@
+// SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+// Copyright (c) 2022 Google
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_core_read.h>
+
+/* task->flags for off-cpu analysis */
+#define PF_KTHREAD   0x00200000  /* I am a kernel thread */
+
+/* task->state for off-cpu analysis */
+#define TASK_INTERRUPTIBLE	0x0001
+#define TASK_UNINTERRUPTIBLE	0x0002
+
+#define MAX_STACKS   32
+#define MAX_ENTRIES  102400
+
+struct tstamp_data {
+	__u32 stack_id;
+	__u32 state;
+	__u64 timestamp;
+};
+
+struct offcpu_key {
+	__u32 pid;
+	__u32 tgid;
+	__u32 stack_id;
+	__u32 state;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_STACK_TRACE);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, MAX_STACKS * sizeof(__u64));
+	__uint(max_entries, MAX_ENTRIES);
+} stacks SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(struct tstamp_data));
+	__uint(max_entries, MAX_ENTRIES);
+} tstamp SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(key_size, sizeof(struct offcpu_key));
+	__uint(value_size, sizeof(__u64));
+	__uint(max_entries, MAX_ENTRIES);
+} off_cpu SEC(".maps");
+
+/* old kernel task_struct definition */
+struct task_struct___old {
+	long state;
+} __attribute__((preserve_access_index));
+
+int enabled = 0;
+
+/*
+ * recently task_struct->state renamed to __state so it made an incompatible
+ * change.  Use BPF CO-RE "ignored suffix rule" to deal with it like below:
+ *
+ * https://nakryiko.com/posts/bpf-core-reference-guide/#handling-incompatible-field-and-type-changes
+ */
+static inline int get_task_state(struct task_struct *t)
+{
+	if (bpf_core_field_exists(t->__state))
+		return BPF_CORE_READ(t, __state);
+
+	/* recast pointer to capture task_struct___old type for compiler */
+	struct task_struct___old *t_old = (void *)t;
+
+	/* now use old "state" name of the field */
+	return BPF_CORE_READ(t_old, state);
+}
+
+SEC("tp_btf/sched_switch")
+int on_switch(u64 *ctx)
+{
+	__u64 ts;
+	int state;
+	__u32 pid, stack_id;
+	struct task_struct *prev, *next;
+	struct tstamp_data elem, *pelem;
+
+	if (!enabled)
+		return 0;
+
+	prev = (struct task_struct *)ctx[1];
+	next = (struct task_struct *)ctx[2];
+	state = get_task_state(prev);
+
+	ts = bpf_ktime_get_ns();
+
+	if (prev->flags & PF_KTHREAD)
+		goto next;
+	if (state != TASK_INTERRUPTIBLE &&
+	    state != TASK_UNINTERRUPTIBLE)
+		goto next;
+
+	stack_id = bpf_get_stackid(ctx, &stacks,
+				   BPF_F_FAST_STACK_CMP | BPF_F_USER_STACK);
+
+	elem.timestamp = ts;
+	elem.state = state;
+	elem.stack_id = stack_id;
+
+	pid = prev->pid;
+	bpf_map_update_elem(&tstamp, &pid, &elem, BPF_ANY);
+
+next:
+	pid = next->pid;
+	pelem = bpf_map_lookup_elem(&tstamp, &pid);
+
+	if (pelem) {
+		struct offcpu_key key = {
+			.pid = next->pid,
+			.tgid = next->tgid,
+			.stack_id = pelem->stack_id,
+			.state = pelem->state,
+		};
+		__u64 delta = ts - pelem->timestamp;
+		__u64 *total;
+
+		total = bpf_map_lookup_elem(&off_cpu, &key);
+		if (total)
+			*total += delta;
+		else
+			bpf_map_update_elem(&off_cpu, &key, &delta, BPF_ANY);
+
+		bpf_map_delete_elem(&tstamp, &pid);
+	}
+
+	return 0;
+}
+
+char LICENSE[] SEC("license") = "Dual BSD/GPL";
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/4] perf record: Implement basic filtering for off-cpu
  2022-04-22  5:33 [RFC 0/4] perf record: Implement off-cpu profiling with BPF (v1) Namhyung Kim
  2022-04-22  5:33 ` [PATCH 1/4] perf report: Do not extend sample type of bpf-output event Namhyung Kim
  2022-04-22  5:33 ` [PATCH 2/4] perf record: Enable off-cpu analysis with BPF Namhyung Kim
@ 2022-04-22  5:34 ` Namhyung Kim
  2022-04-22  5:34 ` [PATCH 4/4] perf record: Handle argument change in sched_switch Namhyung Kim
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 19+ messages in thread
From: Namhyung Kim @ 2022-04-22  5:34 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Andi Kleen, Ian Rogers,
	Song Liu, Hao Luo, bpf, linux-perf-users, Blake Jones

It should honor cpu and task filtering with -a, -C or -p, -t options.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/perf/builtin-record.c            |  2 +-
 tools/perf/util/bpf_off_cpu.c          | 78 +++++++++++++++++++++++---
 tools/perf/util/bpf_skel/off_cpu.bpf.c | 52 +++++++++++++++--
 3 files changed, 119 insertions(+), 13 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 3d24d528ba8e..592384e058c3 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -907,7 +907,7 @@ static int record__config_text_poke(struct evlist *evlist)
 
 static int record__config_off_cpu(struct record *rec)
 {
-	return off_cpu_prepare(rec->evlist);
+	return off_cpu_prepare(rec->evlist, &rec->opts.target);
 }
 
 static bool record__kcore_readable(struct machine *machine)
diff --git a/tools/perf/util/bpf_off_cpu.c b/tools/perf/util/bpf_off_cpu.c
index 1f87d2a9b86d..89f36229041d 100644
--- a/tools/perf/util/bpf_off_cpu.c
+++ b/tools/perf/util/bpf_off_cpu.c
@@ -6,6 +6,9 @@
 #include "util/off_cpu.h"
 #include "util/perf-hooks.h"
 #include "util/session.h"
+#include "util/target.h"
+#include "util/cpumap.h"
+#include "util/thread_map.h"
 #include <bpf/bpf.h>
 
 #include "bpf_skel/off_cpu.skel.h"
@@ -57,8 +60,23 @@ static int off_cpu_config(struct evlist *evlist)
 	return 0;
 }
 
-static void off_cpu_start(void *arg __maybe_unused)
+static void off_cpu_start(void *arg)
 {
+	struct evlist *evlist = arg;
+
+	/* update task filter for the given workload */
+	if (!skel->bss->has_cpu && !skel->bss->has_task &&
+	    perf_thread_map__pid(evlist->core.threads, 0) != -1) {
+		int fd;
+		u32 pid;
+		u8 val = 1;
+
+		skel->bss->has_task = 1;
+		fd = bpf_map__fd(skel->maps.task_filter);
+		pid = perf_thread_map__pid(evlist->core.threads, 0);
+		bpf_map_update_elem(fd, &pid, &val, BPF_ANY);
+	}
+
 	skel->bss->enabled = 1;
 }
 
@@ -68,31 +86,75 @@ static void off_cpu_finish(void *arg __maybe_unused)
 	off_cpu_bpf__destroy(skel);
 }
 
-int off_cpu_prepare(struct evlist *evlist)
+int off_cpu_prepare(struct evlist *evlist, struct target *target)
 {
-	int err;
+	int err, fd, i;
+	int ncpus = 1, ntasks = 1;
 
 	if (off_cpu_config(evlist) < 0) {
 		pr_err("Failed to config off-cpu BPF event\n");
 		return -1;
 	}
 
-	set_max_rlimit();
-
-	skel = off_cpu_bpf__open_and_load();
+	skel = off_cpu_bpf__open();
 	if (!skel) {
 		pr_err("Failed to open off-cpu skeleton\n");
 		return -1;
 	}
 
+	/* don't need to set cpu filter for system-wide mode */
+	if (target->cpu_list) {
+		ncpus = perf_cpu_map__nr(evlist->core.user_requested_cpus);
+		bpf_map__set_max_entries(skel->maps.cpu_filter, ncpus);
+	}
+
+	if (target__has_task(target)) {
+		ncpus = perf_thread_map__nr(evlist->core.threads);
+		bpf_map__set_max_entries(skel->maps.task_filter, ntasks);
+	}
+
+	set_max_rlimit();
+
+	err = off_cpu_bpf__load(skel);
+	if (err) {
+		pr_err("Failed to load off-cpu skeleton\n");
+		goto out;
+	}
+
+	if (target->cpu_list) {
+		u32 cpu;
+		u8 val = 1;
+
+		skel->bss->has_cpu = 1;
+		fd = bpf_map__fd(skel->maps.cpu_filter);
+
+		for (i = 0; i < ncpus; i++) {
+			cpu = perf_cpu_map__cpu(evlist->core.user_requested_cpus, i).cpu;
+			bpf_map_update_elem(fd, &cpu, &val, BPF_ANY);
+		}
+	}
+
+	if (target__has_task(target)) {
+		u32 pid;
+		u8 val = 1;
+
+		skel->bss->has_task = 1;
+		fd = bpf_map__fd(skel->maps.task_filter);
+
+		for (i = 0; i < ntasks; i++) {
+			pid = perf_thread_map__pid(evlist->core.threads, i);
+			bpf_map_update_elem(fd, &pid, &val, BPF_ANY);
+		}
+	}
+
 	err = off_cpu_bpf__attach(skel);
 	if (err) {
 		pr_err("Failed to attach off-cpu skeleton\n");
 		goto out;
 	}
 
-	if (perf_hooks__set_hook("record_start", off_cpu_start, NULL) ||
-	    perf_hooks__set_hook("record_end", off_cpu_finish, NULL)) {
+	if (perf_hooks__set_hook("record_start", off_cpu_start, evlist) ||
+	    perf_hooks__set_hook("record_end", off_cpu_finish, evlist)) {
 		pr_err("Failed to attach off-cpu skeleton\n");
 		goto out;
 	}
diff --git a/tools/perf/util/bpf_skel/off_cpu.bpf.c b/tools/perf/util/bpf_skel/off_cpu.bpf.c
index 2bc6f7cc59ea..27425fe361e2 100644
--- a/tools/perf/util/bpf_skel/off_cpu.bpf.c
+++ b/tools/perf/util/bpf_skel/off_cpu.bpf.c
@@ -49,12 +49,28 @@ struct {
 	__uint(max_entries, MAX_ENTRIES);
 } off_cpu SEC(".maps");
 
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(__u8));
+	__uint(max_entries, 1);
+} cpu_filter SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(__u8));
+	__uint(max_entries, 1);
+} task_filter SEC(".maps");
+
 /* old kernel task_struct definition */
 struct task_struct___old {
 	long state;
 } __attribute__((preserve_access_index));
 
 int enabled = 0;
+int has_cpu = 0;
+int has_task = 0;
 
 /*
  * recently task_struct->state renamed to __state so it made an incompatible
@@ -74,6 +90,37 @@ static inline int get_task_state(struct task_struct *t)
 	return BPF_CORE_READ(t_old, state);
 }
 
+static inline int can_record(struct task_struct *t, int state)
+{
+	if (has_cpu) {
+		__u32 cpu = bpf_get_smp_processor_id();
+		__u8 *ok;
+
+		ok = bpf_map_lookup_elem(&cpu_filter, &cpu);
+		if (!ok)
+			return 0;
+	}
+
+	if (has_task) {
+		__u8 *ok;
+		__u32 pid = t->pid;
+
+		ok = bpf_map_lookup_elem(&task_filter, &pid);
+		if (!ok)
+			return 0;
+	}
+
+	/* kernel threads don't have user stack */
+	if (t->flags & PF_KTHREAD)
+		return 0;
+
+	if (state != TASK_INTERRUPTIBLE &&
+	    state != TASK_UNINTERRUPTIBLE)
+		return 0;
+
+	return 1;
+}
+
 SEC("tp_btf/sched_switch")
 int on_switch(u64 *ctx)
 {
@@ -92,10 +139,7 @@ int on_switch(u64 *ctx)
 
 	ts = bpf_ktime_get_ns();
 
-	if (prev->flags & PF_KTHREAD)
-		goto next;
-	if (state != TASK_INTERRUPTIBLE &&
-	    state != TASK_UNINTERRUPTIBLE)
+	if (!can_record(prev, state))
 		goto next;
 
 	stack_id = bpf_get_stackid(ctx, &stacks,
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 4/4] perf record: Handle argument change in sched_switch
  2022-04-22  5:33 [RFC 0/4] perf record: Implement off-cpu profiling with BPF (v1) Namhyung Kim
                   ` (2 preceding siblings ...)
  2022-04-22  5:34 ` [PATCH 3/4] perf record: Implement basic filtering for off-cpu Namhyung Kim
@ 2022-04-22  5:34 ` Namhyung Kim
  2022-04-22 10:11 ` [RFC 0/4] perf record: Implement off-cpu profiling with BPF (v1) Jiri Olsa
  2022-04-22 10:20 ` Milian Wolff
  5 siblings, 0 replies; 19+ messages in thread
From: Namhyung Kim @ 2022-04-22  5:34 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Andi Kleen, Ian Rogers,
	Song Liu, Hao Luo, bpf, linux-perf-users, Blake Jones

Recently sched_switch tracepoint added a new argument for prev_state,
but it's hard to handle the change in a BPF program.  Instead, we can
check the function prototype in BTF before loading the program.

Thus I make two copies of the tracepoint handler and select one based
on the BTF info.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/perf/util/bpf_off_cpu.c          | 32 +++++++++++++++
 tools/perf/util/bpf_skel/off_cpu.bpf.c | 55 ++++++++++++++++++++------
 2 files changed, 76 insertions(+), 11 deletions(-)

diff --git a/tools/perf/util/bpf_off_cpu.c b/tools/perf/util/bpf_off_cpu.c
index 89f36229041d..38aeb13d3d25 100644
--- a/tools/perf/util/bpf_off_cpu.c
+++ b/tools/perf/util/bpf_off_cpu.c
@@ -86,6 +86,37 @@ static void off_cpu_finish(void *arg __maybe_unused)
 	off_cpu_bpf__destroy(skel);
 }
 
+/* recent kernel added prev_state arg, so it needs to call the proper function */
+static void check_sched_switch_args(void)
+{
+	const struct btf *btf = bpf_object__btf(skel->obj);
+	const struct btf_type *t1, *t2, *t3;
+	u32 type_id;
+
+	type_id = btf__find_by_name_kind(btf, "bpf_trace_sched_switch",
+					 BTF_KIND_TYPEDEF);
+	if ((s32)type_id < 0)
+		goto old_format;
+
+	t1 = btf__type_by_id(btf, type_id);
+	if (t1 == NULL)
+		goto old_format;
+
+	t2 = btf__type_by_id(btf, t1->type);
+	if (t2 == NULL || !btf_is_ptr(t2))
+		goto old_format;
+
+	t3 = btf__type_by_id(btf, t2->type);
+	if (t3 && btf_is_func_proto(t3) && btf_vlen(t3) == 4) {
+		/* new format: disable old functions */
+		bpf_program__set_autoload(skel->progs.on_switch3, false);
+		return;
+	}
+
+old_format:
+	bpf_program__set_autoload(skel->progs.on_switch4, false);
+}
+
 int off_cpu_prepare(struct evlist *evlist, struct target *target)
 {
 	int err, fd, i;
@@ -114,6 +145,7 @@ int off_cpu_prepare(struct evlist *evlist, struct target *target)
 	}
 
 	set_max_rlimit();
+	check_sched_switch_args();
 
 	err = off_cpu_bpf__load(skel);
 	if (err) {
diff --git a/tools/perf/util/bpf_skel/off_cpu.bpf.c b/tools/perf/util/bpf_skel/off_cpu.bpf.c
index 27425fe361e2..e11e198af86f 100644
--- a/tools/perf/util/bpf_skel/off_cpu.bpf.c
+++ b/tools/perf/util/bpf_skel/off_cpu.bpf.c
@@ -121,22 +121,13 @@ static inline int can_record(struct task_struct *t, int state)
 	return 1;
 }
 
-SEC("tp_btf/sched_switch")
-int on_switch(u64 *ctx)
+static int on_switch(u64 *ctx, struct task_struct *prev,
+		     struct task_struct *next, int state)
 {
 	__u64 ts;
-	int state;
 	__u32 pid, stack_id;
-	struct task_struct *prev, *next;
 	struct tstamp_data elem, *pelem;
 
-	if (!enabled)
-		return 0;
-
-	prev = (struct task_struct *)ctx[1];
-	next = (struct task_struct *)ctx[2];
-	state = get_task_state(prev);
-
 	ts = bpf_ktime_get_ns();
 
 	if (!can_record(prev, state))
@@ -178,4 +169,46 @@ int on_switch(u64 *ctx)
 	return 0;
 }
 
+SEC("tp_btf/sched_switch")
+int on_switch3(u64 *ctx)
+{
+	struct task_struct *prev, *next;
+	int state;
+
+	if (!enabled)
+		return 0;
+
+	/*
+	 * TP_PROTO(bool preempt, struct task_struct *prev,
+	 *          struct task_struct *next)
+	 */
+	prev = (struct task_struct *)ctx[1];
+	next = (struct task_struct *)ctx[2];
+
+	state = get_task_state(prev);
+
+	return on_switch(ctx, prev, next, state);
+}
+
+SEC("tp_btf/sched_switch")
+int on_switch4(u64 *ctx)
+{
+	struct task_struct *prev, *next;
+	int prev_state;
+
+	if (!enabled)
+		return 0;
+
+	/*
+	 * TP_PROTO(bool preempt, int prev_state,
+	 *          struct task_struct *prev,
+	 *          struct task_struct *next)
+	 */
+	prev = (struct task_struct *)ctx[2];
+	next = (struct task_struct *)ctx[3];
+	prev_state = (int)ctx[1];
+
+	return on_switch(ctx, prev, next, prev_state);
+}
+
 char LICENSE[] SEC("license") = "Dual BSD/GPL";
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC 0/4] perf record: Implement off-cpu profiling with BPF (v1)
  2022-04-22  5:33 [RFC 0/4] perf record: Implement off-cpu profiling with BPF (v1) Namhyung Kim
                   ` (3 preceding siblings ...)
  2022-04-22  5:34 ` [PATCH 4/4] perf record: Handle argument change in sched_switch Namhyung Kim
@ 2022-04-22 10:11 ` Jiri Olsa
  2022-04-22 14:53   ` Namhyung Kim
  2022-04-22 10:20 ` Milian Wolff
  5 siblings, 1 reply; 19+ messages in thread
From: Jiri Olsa @ 2022-04-22 10:11 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Ingo Molnar, Peter Zijlstra,
	LKML, Andi Kleen, Ian Rogers, Song Liu, Hao Luo, bpf,
	linux-perf-users, Blake Jones

On Thu, Apr 21, 2022 at 10:33:57PM -0700, Namhyung Kim wrote:

SNIP

> The perf bench sched messaging created 400 processes to send/receive
> messages through unix sockets.  It spent a large portion of cpu cycles
> for audit filter and read/copy the messages while most of the
> offcpu-time was in read and write calls.
> 
> You can get the code from 'perf/offcpu-v1' branch in my tree at
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git
> 
> Enjoy! :)

  CC      builtin-record.o
builtin-record.c:52:10: fatal error: util/off_cpu.h: No such file or directory
   52 | #include "util/off_cpu.h"

forgot to add util/off_cpu.h ?

jirka

> 
> Thanks,
> Namhyung
> 
> 
> Namhyung Kim (4):
>   perf report: Do not extend sample type of bpf-output event
>   perf record: Enable off-cpu analysis with BPF
>   perf record: Implement basic filtering for off-cpu
>   perf record: Handle argument change in sched_switch
> 
>  tools/perf/Makefile.perf               |   1 +
>  tools/perf/builtin-record.c            |  21 ++
>  tools/perf/util/Build                  |   1 +
>  tools/perf/util/bpf_off_cpu.c          | 301 +++++++++++++++++++++++++
>  tools/perf/util/bpf_skel/off_cpu.bpf.c | 214 ++++++++++++++++++
>  tools/perf/util/evsel.c                |   4 +-
>  6 files changed, 540 insertions(+), 2 deletions(-)
>  create mode 100644 tools/perf/util/bpf_off_cpu.c
>  create mode 100644 tools/perf/util/bpf_skel/off_cpu.bpf.c
> 
> 
> base-commit: 41204da4c16071be9090940b18f566832d46becc
> -- 
> 2.36.0.rc2.479.g8af0fa9b8e-goog
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC 0/4] perf record: Implement off-cpu profiling with BPF (v1)
  2022-04-22  5:33 [RFC 0/4] perf record: Implement off-cpu profiling with BPF (v1) Namhyung Kim
                   ` (4 preceding siblings ...)
  2022-04-22 10:11 ` [RFC 0/4] perf record: Implement off-cpu profiling with BPF (v1) Jiri Olsa
@ 2022-04-22 10:20 ` Milian Wolff
  2022-04-22 15:01   ` Namhyung Kim
  5 siblings, 1 reply; 19+ messages in thread
From: Milian Wolff @ 2022-04-22 10:20 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa, Namhyung Kim
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Andi Kleen, Ian Rogers,
	Song Liu, Hao Luo, bpf, linux-perf-users, Blake Jones

[-- Attachment #1: Type: text/plain, Size: 1776 bytes --]

On Freitag, 22. April 2022 07:33:57 CEST Namhyung Kim wrote:
> Hello,
> 
> This is the first version of off-cpu profiling support.  Together with
> (PMU-based) cpu profiling, it can show holistic view of the performance
> characteristics of your application or system.

Hey Namhyung,

this is awesome news! In hotspot, I've long done off-cpu profiling manually by 
looking at the time between --switch-events. The downside is that we also need 
to track the sched:sched_switch event to get a call stack. But this approach 
also works with dwarf based unwinding, and also includes kernel stacks.

> With BPF, it can aggregate scheduling stats for interested tasks
> and/or states and convert the data into a form of perf sample records.
> I chose the bpf-output event which is a software event supposed to be
> consumed by BPF programs and renamed it as "offcpu-time".  So it
> requires no change on the perf report side except for setting sample
> types of bpf-output event.
> 
> Basically it collects userspace callstack for tasks as it's what users
> want mostly.  Maybe we can add support for the kernel stacks but I'm
> afraid that it'd cause more overhead.  So the offcpu-time event will
> always have callchains regardless of the command line option, and it
> enables the children mode in perf report by default.

Has anything changed wrt perf/bpf and user applications not compiled with `-
fno-omit-frame-pointer`? I.e. does this new utility only work for specially 
compiled applications, or do we also get backtraces for "normal" binaries that 
we can install through package managers?

Thanks
-- 
Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer
KDAB (Deutschland) GmbH, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt, C++ and OpenGL Experts

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5272 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC 0/4] perf record: Implement off-cpu profiling with BPF (v1)
  2022-04-22 10:11 ` [RFC 0/4] perf record: Implement off-cpu profiling with BPF (v1) Jiri Olsa
@ 2022-04-22 14:53   ` Namhyung Kim
  0 siblings, 0 replies; 19+ messages in thread
From: Namhyung Kim @ 2022-04-22 14:53 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Ingo Molnar, Peter Zijlstra,
	LKML, Andi Kleen, Ian Rogers, Song Liu, Hao Luo, bpf,
	linux-perf-users, Blake Jones

Hi Jiri,

On Fri, Apr 22, 2022 at 3:11 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Thu, Apr 21, 2022 at 10:33:57PM -0700, Namhyung Kim wrote:
>
> SNIP
>
> > The perf bench sched messaging created 400 processes to send/receive
> > messages through unix sockets.  It spent a large portion of cpu cycles
> > for audit filter and read/copy the messages while most of the
> > offcpu-time was in read and write calls.
> >
> > You can get the code from 'perf/offcpu-v1' branch in my tree at
> >
> >   git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git
> >
> > Enjoy! :)
>
>   CC      builtin-record.o
> builtin-record.c:52:10: fatal error: util/off_cpu.h: No such file or directory
>    52 | #include "util/off_cpu.h"
>
> forgot to add util/off_cpu.h ?

Oops, you're right.  Will resend soon.

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC 0/4] perf record: Implement off-cpu profiling with BPF (v1)
  2022-04-22 10:20 ` Milian Wolff
@ 2022-04-22 15:01   ` Namhyung Kim
  2022-04-22 19:04     ` Arnaldo Carvalho de Melo
  2022-04-25 12:42     ` Milian Wolff
  0 siblings, 2 replies; 19+ messages in thread
From: Namhyung Kim @ 2022-04-22 15:01 UTC (permalink / raw)
  To: Milian Wolff
  Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Ingo Molnar, Peter Zijlstra,
	LKML, Andi Kleen, Ian Rogers, Song Liu, Hao Luo, bpf,
	linux-perf-users, Blake Jones

Hi Milian,

On Fri, Apr 22, 2022 at 3:21 AM Milian Wolff <milian.wolff@kdab.com> wrote:
>
> On Freitag, 22. April 2022 07:33:57 CEST Namhyung Kim wrote:
> > Hello,
> >
> > This is the first version of off-cpu profiling support.  Together with
> > (PMU-based) cpu profiling, it can show holistic view of the performance
> > characteristics of your application or system.
>
> Hey Namhyung,
>
> this is awesome news! In hotspot, I've long done off-cpu profiling manually by
> looking at the time between --switch-events. The downside is that we also need
> to track the sched:sched_switch event to get a call stack. But this approach
> also works with dwarf based unwinding, and also includes kernel stacks.

Thanks, I've also briefly thought about the switch event based off-cpu
profiling as it doesn't require root.  But collecting call stacks is hard and
I'd like to do it in kernel/bpf to reduce the overhead.

>
> > With BPF, it can aggregate scheduling stats for interested tasks
> > and/or states and convert the data into a form of perf sample records.
> > I chose the bpf-output event which is a software event supposed to be
> > consumed by BPF programs and renamed it as "offcpu-time".  So it
> > requires no change on the perf report side except for setting sample
> > types of bpf-output event.
> >
> > Basically it collects userspace callstack for tasks as it's what users
> > want mostly.  Maybe we can add support for the kernel stacks but I'm
> > afraid that it'd cause more overhead.  So the offcpu-time event will
> > always have callchains regardless of the command line option, and it
> > enables the children mode in perf report by default.
>
> Has anything changed wrt perf/bpf and user applications not compiled with `-
> fno-omit-frame-pointer`? I.e. does this new utility only work for specially
> compiled applications, or do we also get backtraces for "normal" binaries that
> we can install through package managers?

I am not aware of such changes, it still needs a frame pointer to get
backtraces.

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC 0/4] perf record: Implement off-cpu profiling with BPF (v1)
  2022-04-22 15:01   ` Namhyung Kim
@ 2022-04-22 19:04     ` Arnaldo Carvalho de Melo
  2022-04-25 12:42     ` Milian Wolff
  1 sibling, 0 replies; 19+ messages in thread
From: Arnaldo Carvalho de Melo @ 2022-04-22 19:04 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Milian Wolff, Jiri Olsa, Ingo Molnar, Peter Zijlstra, LKML,
	Andi Kleen, Ian Rogers, Song Liu, Hao Luo, bpf, linux-perf-users,
	Blake Jones

Em Fri, Apr 22, 2022 at 08:01:15AM -0700, Namhyung Kim escreveu:
> Hi Milian,
 
> On Fri, Apr 22, 2022 at 3:21 AM Milian Wolff <milian.wolff@kdab.com> wrote:
> > On Freitag, 22. April 2022 07:33:57 CEST Namhyung Kim wrote:
> > > This is the first version of off-cpu profiling support.  Together with
> > > (PMU-based) cpu profiling, it can show holistic view of the performance
> > > characteristics of your application or system.

> > Hey Namhyung,

> > this is awesome news! In hotspot, I've long done off-cpu profiling manually by
> > looking at the time between --switch-events. The downside is that we also need
> > to track the sched:sched_switch event to get a call stack. But this approach
> > also works with dwarf based unwinding, and also includes kernel stacks.
> 
> Thanks, I've also briefly thought about the switch event based off-cpu
> profiling as it doesn't require root.  But collecting call stacks is hard and
> I'd like to do it in kernel/bpf to reduce the overhead.

It would be great to have both in perf. Right now since we have one in
hotspot that is working, perfecting the other method, Namhyung's, using
BPF to reduce the amount of data to postprocess in userspace, looks
great.
 
> > > With BPF, it can aggregate scheduling stats for interested tasks
> > > and/or states and convert the data into a form of perf sample records.
> > > I chose the bpf-output event which is a software event supposed to be
> > > consumed by BPF programs and renamed it as "offcpu-time".  So it
> > > requires no change on the perf report side except for setting sample
> > > types of bpf-output event.
> > >
> > > Basically it collects userspace callstack for tasks as it's what users
> > > want mostly.  Maybe we can add support for the kernel stacks but I'm
> > > afraid that it'd cause more overhead.  So the offcpu-time event will
> > > always have callchains regardless of the command line option, and it
> > > enables the children mode in perf report by default.
> >
> > Has anything changed wrt perf/bpf and user applications not compiled with `-
> > fno-omit-frame-pointer`? I.e. does this new utility only work for specially
> > compiled applications, or do we also get backtraces for "normal" binaries that
> > we can install through package managers?
> 
> I am not aware of such changes, it still needs a frame pointer to get
> backtraces.

I see this as an initial limitation, one that we can lift later?

- Arnaldo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC 0/4] perf record: Implement off-cpu profiling with BPF (v1)
  2022-04-22 15:01   ` Namhyung Kim
  2022-04-22 19:04     ` Arnaldo Carvalho de Melo
@ 2022-04-25 12:42     ` Milian Wolff
  2022-04-25 16:49       ` Ian Rogers
  2022-04-25 18:58       ` Namhyung Kim
  1 sibling, 2 replies; 19+ messages in thread
From: Milian Wolff @ 2022-04-25 12:42 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Ingo Molnar, Peter Zijlstra,
	LKML, Andi Kleen, Ian Rogers, Song Liu, Hao Luo, bpf,
	linux-perf-users, Blake Jones

[-- Attachment #1: Type: text/plain, Size: 3142 bytes --]

On Freitag, 22. April 2022 17:01:15 CEST Namhyung Kim wrote:
> Hi Milian,
> 
> On Fri, Apr 22, 2022 at 3:21 AM Milian Wolff <milian.wolff@kdab.com> wrote:
> > On Freitag, 22. April 2022 07:33:57 CEST Namhyung Kim wrote:
> > > Hello,
> > > 
> > > This is the first version of off-cpu profiling support.  Together with
> > > (PMU-based) cpu profiling, it can show holistic view of the performance
> > > characteristics of your application or system.
> > 
> > Hey Namhyung,
> > 
> > this is awesome news! In hotspot, I've long done off-cpu profiling
> > manually by looking at the time between --switch-events. The downside is
> > that we also need to track the sched:sched_switch event to get a call
> > stack. But this approach also works with dwarf based unwinding, and also
> > includes kernel stacks.
>
> Thanks, I've also briefly thought about the switch event based off-cpu
> profiling as it doesn't require root.  But collecting call stacks is hard
> and I'd like to do it in kernel/bpf to reduce the overhead.

I'm all for reducing the overhead, I just wonder about the practicality. At 
the very least, please make sure to note this limitation explicitly to end 
users. As a preacher for perf, I have come across lots of people stumbling 
over `perf record -g` not producing any sensible output because they are 
simply not aware that this requires frame pointers which are basically non 
existing on most "normal" distributions. Nowadays `man perf record` tries to 
educate people, please do the same for the new `--off-cpu` switch.

> > > With BPF, it can aggregate scheduling stats for interested tasks
> > > and/or states and convert the data into a form of perf sample records.
> > > I chose the bpf-output event which is a software event supposed to be
> > > consumed by BPF programs and renamed it as "offcpu-time".  So it
> > > requires no change on the perf report side except for setting sample
> > > types of bpf-output event.
> > > 
> > > Basically it collects userspace callstack for tasks as it's what users
> > > want mostly.  Maybe we can add support for the kernel stacks but I'm
> > > afraid that it'd cause more overhead.  So the offcpu-time event will
> > > always have callchains regardless of the command line option, and it
> > > enables the children mode in perf report by default.
> > 
> > Has anything changed wrt perf/bpf and user applications not compiled with
> > `- fno-omit-frame-pointer`? I.e. does this new utility only work for
> > specially compiled applications, or do we also get backtraces for
> > "normal" binaries that we can install through package managers?
> 
> I am not aware of such changes, it still needs a frame pointer to get
> backtraces.

May I ask what kind of setup you are using this on? Do you use something like 
Gentoo or yocto where you compile your whole system with `-fno-omit-frame-
pointer`? Because otherwise, any kind of off-cpu time in system libraries will 
not be resolved properly, no?

Thanks
-- 
Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer
KDAB (Deutschland) GmbH, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt, C++ and OpenGL Experts

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5272 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC 0/4] perf record: Implement off-cpu profiling with BPF (v1)
  2022-04-25 12:42     ` Milian Wolff
@ 2022-04-25 16:49       ` Ian Rogers
  2022-04-25 18:58       ` Namhyung Kim
  1 sibling, 0 replies; 19+ messages in thread
From: Ian Rogers @ 2022-04-25 16:49 UTC (permalink / raw)
  To: Milian Wolff
  Cc: Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa, Ingo Molnar,
	Peter Zijlstra, LKML, Andi Kleen, Song Liu, Hao Luo, bpf,
	linux-perf-users, Blake Jones, michael

On Mon, Apr 25, 2022 at 5:42 AM Milian Wolff <milian.wolff@kdab.com> wrote:
>
> On Freitag, 22. April 2022 17:01:15 CEST Namhyung Kim wrote:
> > Hi Milian,
> >
> > On Fri, Apr 22, 2022 at 3:21 AM Milian Wolff <milian.wolff@kdab.com> wrote:
> > > On Freitag, 22. April 2022 07:33:57 CEST Namhyung Kim wrote:
> > > > Hello,
> > > >
> > > > This is the first version of off-cpu profiling support.  Together with
> > > > (PMU-based) cpu profiling, it can show holistic view of the performance
> > > > characteristics of your application or system.
> > >
> > > Hey Namhyung,
> > >
> > > this is awesome news! In hotspot, I've long done off-cpu profiling
> > > manually by looking at the time between --switch-events. The downside is
> > > that we also need to track the sched:sched_switch event to get a call
> > > stack. But this approach also works with dwarf based unwinding, and also
> > > includes kernel stacks.
> >
> > Thanks, I've also briefly thought about the switch event based off-cpu
> > profiling as it doesn't require root.  But collecting call stacks is hard
> > and I'd like to do it in kernel/bpf to reduce the overhead.
>
> I'm all for reducing the overhead, I just wonder about the practicality. At
> the very least, please make sure to note this limitation explicitly to end
> users. As a preacher for perf, I have come across lots of people stumbling
> over `perf record -g` not producing any sensible output because they are
> simply not aware that this requires frame pointers which are basically non
> existing on most "normal" distributions. Nowadays `man perf record` tries to
> educate people, please do the same for the new `--off-cpu` switch.

I think documenting that off-cpu has a dependency on frame pointers
makes sense. There has been work to make LBR work:
https://lore.kernel.org/bpf/20210818012937.2522409-1-songliubraving@fb.com/
DWARF unwinding is problematic and is probably something best kept in
user land. There is also Intel's CET that may provide an alternate
backtraces.

More recent Intel and AMD cpus have techniques to turn memory
locations into registers, an approach generally called memory
renaming. There is some description here:
https://www.agner.org/forum/viewtopic.php?t=41
In LLVM there is a pass to promote memory locations into registers
called mem2reg. Having the frame pointer as an extra register will
help this pass as there will be 1 more register to replace something
from memory. The memory renaming optimization is similar to mem2reg
except done in the CPU's front-end. It would be interesting to see
benchmark results on modern CPUs with and without omit-frame-pointer.
My expectation is that the performance wins aren't as great, if any,
as they used to be (cc-ed Michael Larabel as I Iove phoronix and it'd
be awesome if someone could do an omit-frame-pointer shoot-out).

> > > > With BPF, it can aggregate scheduling stats for interested tasks
> > > > and/or states and convert the data into a form of perf sample records.
> > > > I chose the bpf-output event which is a software event supposed to be
> > > > consumed by BPF programs and renamed it as "offcpu-time".  So it
> > > > requires no change on the perf report side except for setting sample
> > > > types of bpf-output event.
> > > >
> > > > Basically it collects userspace callstack for tasks as it's what users
> > > > want mostly.  Maybe we can add support for the kernel stacks but I'm
> > > > afraid that it'd cause more overhead.  So the offcpu-time event will
> > > > always have callchains regardless of the command line option, and it
> > > > enables the children mode in perf report by default.
> > >
> > > Has anything changed wrt perf/bpf and user applications not compiled with
> > > `- fno-omit-frame-pointer`? I.e. does this new utility only work for
> > > specially compiled applications, or do we also get backtraces for
> > > "normal" binaries that we can install through package managers?
> >
> > I am not aware of such changes, it still needs a frame pointer to get
> > backtraces.
>
> May I ask what kind of setup you are using this on? Do you use something like
> Gentoo or yocto where you compile your whole system with `-fno-omit-frame-
> pointer`? Because otherwise, any kind of off-cpu time in system libraries will
> not be resolved properly, no?

I agree with your point. Often in cloud environments binaries are
static blobs linking in all their dependencies. This can aid
deployment, bug compatibility, etc. Fwiw, all backtraces gathered in
Google's profiling are frame pointer based. A large motivation for
this is the security aspect of having a privileged application able to
snapshot other threads stacks that happens with dwarf based unwinding.

In summary, your point is that frame pointer based unwinding is
largely broken on all major distributions today limiting the utility
of off-CPU as it is here. I agree, memory renaming in hardware could
hopefully mean that this isn't the case in distributions in the
future. Even if it isn't there are alternate backtraces from sources
like LBR and CET that mean we can fix this other ways.

Thanks,
Ian

> Thanks
> --
> Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer
> KDAB (Deutschland) GmbH, a KDAB Group company
> Tel: +49-30-521325470
> KDAB - The Qt, C++ and OpenGL Experts

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC 0/4] perf record: Implement off-cpu profiling with BPF (v1)
  2022-04-25 12:42     ` Milian Wolff
  2022-04-25 16:49       ` Ian Rogers
@ 2022-04-25 18:58       ` Namhyung Kim
  1 sibling, 0 replies; 19+ messages in thread
From: Namhyung Kim @ 2022-04-25 18:58 UTC (permalink / raw)
  To: Milian Wolff
  Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Ingo Molnar, Peter Zijlstra,
	LKML, Andi Kleen, Ian Rogers, Song Liu, Hao Luo, bpf,
	linux-perf-users, Blake Jones

On Mon, Apr 25, 2022 at 5:42 AM Milian Wolff <milian.wolff@kdab.com> wrote:
>
> On Freitag, 22. April 2022 17:01:15 CEST Namhyung Kim wrote:
> > Hi Milian,
> >
> > On Fri, Apr 22, 2022 at 3:21 AM Milian Wolff <milian.wolff@kdab.com> wrote:
> > > On Freitag, 22. April 2022 07:33:57 CEST Namhyung Kim wrote:
> > > > Hello,
> > > >
> > > > This is the first version of off-cpu profiling support.  Together with
> > > > (PMU-based) cpu profiling, it can show holistic view of the performance
> > > > characteristics of your application or system.
> > >
> > > Hey Namhyung,
> > >
> > > this is awesome news! In hotspot, I've long done off-cpu profiling
> > > manually by looking at the time between --switch-events. The downside is
> > > that we also need to track the sched:sched_switch event to get a call
> > > stack. But this approach also works with dwarf based unwinding, and also
> > > includes kernel stacks.
> >
> > Thanks, I've also briefly thought about the switch event based off-cpu
> > profiling as it doesn't require root.  But collecting call stacks is hard
> > and I'd like to do it in kernel/bpf to reduce the overhead.
>
> I'm all for reducing the overhead, I just wonder about the practicality. At
> the very least, please make sure to note this limitation explicitly to end
> users. As a preacher for perf, I have come across lots of people stumbling
> over `perf record -g` not producing any sensible output because they are
> simply not aware that this requires frame pointers which are basically non
> existing on most "normal" distributions. Nowadays `man perf record` tries to
> educate people, please do the same for the new `--off-cpu` switch.

Good point, will add it .

>
> > > > With BPF, it can aggregate scheduling stats for interested tasks
> > > > and/or states and convert the data into a form of perf sample records.
> > > > I chose the bpf-output event which is a software event supposed to be
> > > > consumed by BPF programs and renamed it as "offcpu-time".  So it
> > > > requires no change on the perf report side except for setting sample
> > > > types of bpf-output event.
> > > >
> > > > Basically it collects userspace callstack for tasks as it's what users
> > > > want mostly.  Maybe we can add support for the kernel stacks but I'm
> > > > afraid that it'd cause more overhead.  So the offcpu-time event will
> > > > always have callchains regardless of the command line option, and it
> > > > enables the children mode in perf report by default.
> > >
> > > Has anything changed wrt perf/bpf and user applications not compiled with
> > > `- fno-omit-frame-pointer`? I.e. does this new utility only work for
> > > specially compiled applications, or do we also get backtraces for
> > > "normal" binaries that we can install through package managers?
> >
> > I am not aware of such changes, it still needs a frame pointer to get
> > backtraces.
>
> May I ask what kind of setup you are using this on? Do you use something like
> Gentoo or yocto where you compile your whole system with `-fno-omit-frame-
> pointer`? Because otherwise, any kind of off-cpu time in system libraries will
> not be resolved properly, no?

In my work environment, everything is built with the frame pointer.
It's unfortunate most distros build without it, but as Ian said, I hope
we can lift the limitation with recent technologies soon.

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] perf record: Enable off-cpu analysis with BPF
  2022-05-10 17:02   ` Arnaldo Carvalho de Melo
@ 2022-05-12  6:13     ` Namhyung Kim
  0 siblings, 0 replies; 19+ messages in thread
From: Namhyung Kim @ 2022-05-12  6:13 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Jiri Olsa, Ingo Molnar, Peter Zijlstra, LKML, Andi Kleen,
	Ian Rogers, Song Liu, Hao Luo, Milian Wolff, bpf,
	linux-perf-users, Blake Jones

On Tue, May 10, 2022 at 10:02 AM Arnaldo Carvalho de Melo
<acme@kernel.org> wrote:
>
> Em Fri, May 06, 2022 at 01:16:25PM -0700, Namhyung Kim escreveu:
[SNIP]
> > diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
> > index 465be4e62a17..8084e3fb73a1 100644
> > --- a/tools/perf/Documentation/perf-record.txt
> > +++ b/tools/perf/Documentation/perf-record.txt
> > @@ -758,6 +758,16 @@ include::intel-hybrid.txt[]
> >       If the URLs is not specified, the value of DEBUGINFOD_URLS
> >       system environment variable is used.
> >
> > +--off-cpu::
> > +     Enable off-cpu profiling with BPF.  The BPF program will collect
> > +     task scheduling information with (user) stacktrace and save them
> > +     as sample data of a software event named "offcpu-time".  The
> > +     sample period will have the time the task slept in nano-second.
>
>                                                         nanoseconds.

ok

> > +
> > +     Note that BPF can collect stacktrace using frame pointer ("fp")
>
>                                 stack traces

ok

>
> > +     only, as of now.  So the applications built without the frame
> > +     pointer might see bogus addresses.
>
> One possible improvement is to check if debugging info is available in
> the traced program and check if -fnoomit-frame-pointer is present:
>
> ⬢[acme@toolbox perf]$ readelf -wi ~/bin/perf | grep DW_AT_producer -m1
>     <d>   DW_AT_producer    : (indirect string, offset: 0x6627): GNU C99 11.2.1 20220127 (Red Hat 11.2.1-9) -mtune=generic -march=x86-64 -ggdb3 -O6 -std=gnu99 -fno-omit-frame-pointer -funwind-tables -fstack-protector-all
> ⬢[acme@toolbox perf]$
>
> If debugging info is available and -fno-omit-frame-pointer isn't in the
> DW_AT_producer string, then we can print a flashing warning and even
> don't use the stack traces.

Yeah, but checking the binary only would not be enough as it will
call other libraries.  Also the libc would be the most frequent user of
any sleeping syscalls, I guess.

>
> That or some other more robust method to detect that frame pointers are
> valid.
>
> Some more comments below.
>
> > +
> >  SEE ALSO
> >  --------
> >  linkperf:perf-stat[1], linkperf:perf-list[1], linkperf:perf-intel-pt[1]
> > diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
> > index 6e5aded855cc..8f738e11356d 100644
> > --- a/tools/perf/Makefile.perf
> > +++ b/tools/perf/Makefile.perf
> > @@ -1038,6 +1038,7 @@ SKEL_TMP_OUT := $(abspath $(SKEL_OUT)/.tmp)
> >  SKELETONS := $(SKEL_OUT)/bpf_prog_profiler.skel.h
> >  SKELETONS += $(SKEL_OUT)/bperf_leader.skel.h $(SKEL_OUT)/bperf_follower.skel.h
> >  SKELETONS += $(SKEL_OUT)/bperf_cgroup.skel.h $(SKEL_OUT)/func_latency.skel.h
> > +SKELETONS += $(SKEL_OUT)/off_cpu.skel.h
> >
> >  $(SKEL_TMP_OUT) $(LIBBPF_OUTPUT):
> >       $(Q)$(MKDIR) -p $@
> > diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> > index a5cf6a99d67f..4e0285ca9a2d 100644
> > --- a/tools/perf/builtin-record.c
> > +++ b/tools/perf/builtin-record.c
> > @@ -49,6 +49,7 @@
> >  #include "util/clockid.h"
> >  #include "util/pmu-hybrid.h"
> >  #include "util/evlist-hybrid.h"
> > +#include "util/off_cpu.h"
> >  #include "asm/bug.h"
> >  #include "perf.h"
> >  #include "cputopo.h"
> > @@ -162,6 +163,7 @@ struct record {
> >       bool                    buildid_mmap;
> >       bool                    timestamp_filename;
> >       bool                    timestamp_boundary;
> > +     bool                    off_cpu;
> >       struct switch_output    switch_output;
> >       unsigned long long      samples;
> >       unsigned long           output_max_size;        /* = 0: unlimited */
> > @@ -903,6 +905,11 @@ static int record__config_text_poke(struct evlist *evlist)
> >       return 0;
> >  }
> >
> > +static int record__config_off_cpu(struct record *rec)
> > +{
> > +     return off_cpu_prepare(rec->evlist);
> > +}
> > +
> >  static bool record__kcore_readable(struct machine *machine)
> >  {
> >       char kcore[PATH_MAX];
> > @@ -2600,6 +2607,9 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
> >       } else
> >               status = err;
> >
> > +     if (rec->off_cpu)
> > +             rec->bytes_written += off_cpu_write(rec->session);
> > +
> >       record__synthesize(rec, true);
> >       /* this will be recalculated during process_buildids() */
> >       rec->samples = 0;
> > @@ -3324,6 +3334,9 @@ static struct option __record_options[] = {
> >       OPT_CALLBACK_OPTARG(0, "threads", &record.opts, NULL, "spec",
> >                           "write collected trace data into several data files using parallel threads",
> >                           record__parse_threads),
> > +#ifdef HAVE_BPF_SKEL
> > +     OPT_BOOLEAN(0, "off-cpu", &record.off_cpu, "Enable off-cpu analysis"),
> > +#endif
>
>
> Since this is in the documentation, I think we should have an #else
> clause and state that --off-cpu needs building with some specific make
> command line.

I'd rather remove the #ifdef here and put it after parse_options
to show the warning.

>
> >       OPT_END()
> >  };
> >
> > @@ -3981,6 +3994,14 @@ int cmd_record(int argc, const char **argv)
> >               }
> >       }
> >
> > +     if (rec->off_cpu) {
> > +             err = record__config_off_cpu(rec);
> > +             if (err) {
> > +                     pr_err("record__config_off_cpu failed, error %d\n", err);
> > +                     goto out;
> > +             }
> > +     }
> > +
> >       if (record_opts__config(&rec->opts)) {
> >               err = -EINVAL;
> >               goto out;
> > diff --git a/tools/perf/util/Build b/tools/perf/util/Build
> > index 9a7209a99e16..a51267d88ca9 100644
> > --- a/tools/perf/util/Build
> > +++ b/tools/perf/util/Build
> > @@ -147,6 +147,7 @@ perf-$(CONFIG_LIBBPF) += bpf_map.o
> >  perf-$(CONFIG_PERF_BPF_SKEL) += bpf_counter.o
> >  perf-$(CONFIG_PERF_BPF_SKEL) += bpf_counter_cgroup.o
> >  perf-$(CONFIG_PERF_BPF_SKEL) += bpf_ftrace.o
> > +perf-$(CONFIG_PERF_BPF_SKEL) += bpf_off_cpu.o
> >  perf-$(CONFIG_BPF_PROLOGUE) += bpf-prologue.o
> >  perf-$(CONFIG_LIBELF) += symbol-elf.o
> >  perf-$(CONFIG_LIBELF) += probe-file.o
> > diff --git a/tools/perf/util/bpf_off_cpu.c b/tools/perf/util/bpf_off_cpu.c
> > new file mode 100644
> > index 000000000000..1f87d2a9b86d
> > --- /dev/null
> > +++ b/tools/perf/util/bpf_off_cpu.c
> > @@ -0,0 +1,208 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#include "util/bpf_counter.h"
> > +#include "util/debug.h"
> > +#include "util/evsel.h"
> > +#include "util/evlist.h"
> > +#include "util/off_cpu.h"
> > +#include "util/perf-hooks.h"
> > +#include "util/session.h"
> > +#include <bpf/bpf.h>
> > +
> > +#include "bpf_skel/off_cpu.skel.h"
> > +
> > +#define MAX_STACKS  32
> > +/* we don't need actual timestamp, just want to put the samples at last */
> > +#define OFF_CPU_TIMESTAMP  (~0ull << 32)
> > +
> > +static struct off_cpu_bpf *skel;
> > +
> > +struct off_cpu_key {
> > +     u32 pid;
> > +     u32 tgid;
> > +     u32 stack_id;
> > +     u32 state;
> > +     u64 cgroup_id;
> > +};
> > +
> > +union off_cpu_data {
> > +     struct perf_event_header hdr;
> > +     u64 array[1024 / sizeof(u64)];
> > +};
> > +
> > +static int off_cpu_config(struct evlist *evlist)
> > +{
> > +     struct evsel *evsel;
> > +     struct perf_event_attr attr = {
> > +             .type   = PERF_TYPE_SOFTWARE,
> > +             .config = PERF_COUNT_SW_BPF_OUTPUT,
> > +             .size   = sizeof(attr), /* to capture ABI version */
> > +     };
> > +
> > +     evsel = evsel__new(&attr);
> > +     if (!evsel)
> > +             return -ENOMEM;
> > +
> > +     evsel->core.attr.freq = 1;
> > +     evsel->core.attr.sample_period = 1;
> > +     /* off-cpu analysis depends on stack trace */
> > +     evsel->core.attr.sample_type = PERF_SAMPLE_CALLCHAIN;
> > +
> > +     evlist__add(evlist, evsel);
> > +
> > +     free(evsel->name);
> > +     evsel->name = strdup("offcpu-time");
> > +     if (evsel->name == NULL)
> > +             return -ENOMEM;
>
> But at this point the evsel was left in the evlist? Shouldn't we start
> it all by trying to allocate this string and don't bother to call
> evsel__new() on failure, etc?

Sounds good, will change.

>
> > +
> > +     return 0;
> > +}
> > +
> > +static void off_cpu_start(void *arg __maybe_unused)
> > +{
> > +     skel->bss->enabled = 1;
> > +}
> > +
> > +static void off_cpu_finish(void *arg __maybe_unused)
> > +{
> > +     skel->bss->enabled = 0;
> > +     off_cpu_bpf__destroy(skel);
> > +}
> > +
> > +int off_cpu_prepare(struct evlist *evlist)
> > +{
> > +     int err;
> > +
> > +     if (off_cpu_config(evlist) < 0) {
> > +             pr_err("Failed to config off-cpu BPF event\n");
> > +             return -1;
> > +     }
> > +
> > +     set_max_rlimit();
> > +
> > +     skel = off_cpu_bpf__open_and_load();
> > +     if (!skel) {
> > +             pr_err("Failed to open off-cpu skeleton\n");
>
> "BPF sleketon", to give a bit more context to users?

ok

>
> > +             return -1;
> > +     }
> > +
> > +     err = off_cpu_bpf__attach(skel);
> > +     if (err) {
> > +             pr_err("Failed to attach off-cpu skeleton\n");
>
> Ditto.

ok

>
> > +             goto out;
> > +     }
> > +
> > +     if (perf_hooks__set_hook("record_start", off_cpu_start, NULL) ||
> > +         perf_hooks__set_hook("record_end", off_cpu_finish, NULL)) {
> > +             pr_err("Failed to attach off-cpu skeleton\n");
> > +             goto out;
> > +     }
> > +
> > +     return 0;
> > +
> > +out:
> > +     off_cpu_bpf__destroy(skel);
> > +     return -1;
> > +}
> > +
> > +int off_cpu_write(struct perf_session *session)
> > +{
> > +     int bytes = 0, size;
> > +     int fd, stack;
> > +     bool found = false;
> > +     u64 sample_type, val, sid = 0;
> > +     struct evsel *evsel;
> > +     struct perf_data_file *file = &session->data->file;
> > +     struct off_cpu_key prev, key;
> > +     union off_cpu_data data = {
> > +             .hdr = {
> > +                     .type = PERF_RECORD_SAMPLE,
> > +                     .misc = PERF_RECORD_MISC_USER,
> > +             },
> > +     };
> > +     u64 tstamp = OFF_CPU_TIMESTAMP;
> > +
> > +     skel->bss->enabled = 0;
> > +
> > +     evlist__for_each_entry(session->evlist, evsel) {
> > +             if (!strcmp(evsel__name(evsel), "offcpu-time")) {
> > +                     found = true;
> > +                     break;
> > +             }
> > +     }
>
> Don't we have some evlist__find_evsel function?
>
> struct evsel *evlist__find_evsel_by_str(struct evlist *evlist, const char *str)
> {
>         struct evsel *evsel;
>
>         evlist__for_each_entry(evlist, evsel) {
>                 if (!evsel->name)
>                         continue;
>                 if (strcmp(str, evsel->name) == 0)
>                         return evsel;
>         }
>
>         return NULL;
> }

Nice, I'll use it instead.

Thanks for your review!
Namhyung

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] perf record: Enable off-cpu analysis with BPF
  2022-05-06 20:16 ` [PATCH 2/4] perf record: Enable off-cpu analysis with BPF Namhyung Kim
@ 2022-05-10 17:02   ` Arnaldo Carvalho de Melo
  2022-05-12  6:13     ` Namhyung Kim
  0 siblings, 1 reply; 19+ messages in thread
From: Arnaldo Carvalho de Melo @ 2022-05-10 17:02 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Jiri Olsa, Ingo Molnar, Peter Zijlstra, LKML, Andi Kleen,
	Ian Rogers, Song Liu, Hao Luo, Milian Wolff, bpf,
	linux-perf-users, Blake Jones

Em Fri, May 06, 2022 at 01:16:25PM -0700, Namhyung Kim escreveu:
> Add --off-cpu option to enable the off-cpu profiling with BPF.  It'd
> use a bpf_output event and rename it to "offcpu-time".  Samples will
> be synthesized at the end of the record session using data from a BPF
> map which contains the aggregated off-cpu time at context switches.
> So it needs root privilege to get the off-cpu profiling.
> 
> Each sample will have a separate user stacktrace so it will skip
> kernel threads.  The sample ip will be set from the stacktrace and
> other sample data will be updated accordingly.  Currently it only
> handles some basic sample types.
> 
> The sample timestamp is set to a dummy value just not to bother with
> other events during the sorting.  So it has a very big initial value
> and increase it on processing each samples.
> 
> Good thing is that it can be used together with regular profiling like
> cpu cycles.  If you don't want to that, you can use a dummy event to
> enable off-cpu profiling only.
> 
> Example output:
>   $ sudo perf record --off-cpu perf bench sched messaging -l 1000
> 
>   $ sudo perf report --stdio --call-graph=no
>   # Total Lost Samples: 0
>   #
>   # Samples: 41K of event 'cycles'
>   # Event count (approx.): 42137343851
>   ...
> 
>   # Samples: 1K of event 'offcpu-time'
>   # Event count (approx.): 587990831640
>   #
>   # Children      Self  Command          Shared Object       Symbol
>   # ........  ........  ...............  ..................  .........................
>   #
>       81.66%     0.00%  sched-messaging  libc-2.33.so        [.] __libc_start_main
>       81.66%     0.00%  sched-messaging  perf                [.] cmd_bench
>       81.66%     0.00%  sched-messaging  perf                [.] main
>       81.66%     0.00%  sched-messaging  perf                [.] run_builtin
>       81.43%     0.00%  sched-messaging  perf                [.] bench_sched_messaging
>       40.86%    40.86%  sched-messaging  libpthread-2.33.so  [.] __read
>       37.66%    37.66%  sched-messaging  libpthread-2.33.so  [.] __write
>        2.91%     2.91%  sched-messaging  libc-2.33.so        [.] __poll
>   ...
> 
> As you can see it spent most of off-cpu time in read and write in
> bench_sched_messaging().  The --call-graph=no was added just to make
> the output concise here.
> 
> It uses perf hooks facility to control BPF program during the record
> session rather than adding new BPF/off-cpu specific calls.
> 
> Signed-off-by: Namhyung Kim <namhyung@kernel.org>
> ---
>  tools/perf/Documentation/perf-record.txt |  10 ++
>  tools/perf/Makefile.perf                 |   1 +
>  tools/perf/builtin-record.c              |  21 +++
>  tools/perf/util/Build                    |   1 +
>  tools/perf/util/bpf_off_cpu.c            | 208 +++++++++++++++++++++++
>  tools/perf/util/bpf_skel/off_cpu.bpf.c   | 139 +++++++++++++++
>  tools/perf/util/off_cpu.h                |  22 +++
>  7 files changed, 402 insertions(+)
>  create mode 100644 tools/perf/util/bpf_off_cpu.c
>  create mode 100644 tools/perf/util/bpf_skel/off_cpu.bpf.c
>  create mode 100644 tools/perf/util/off_cpu.h
> 
> diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
> index 465be4e62a17..8084e3fb73a1 100644
> --- a/tools/perf/Documentation/perf-record.txt
> +++ b/tools/perf/Documentation/perf-record.txt
> @@ -758,6 +758,16 @@ include::intel-hybrid.txt[]
>  	If the URLs is not specified, the value of DEBUGINFOD_URLS
>  	system environment variable is used.
>  
> +--off-cpu::
> +	Enable off-cpu profiling with BPF.  The BPF program will collect
> +	task scheduling information with (user) stacktrace and save them
> +	as sample data of a software event named "offcpu-time".  The
> +	sample period will have the time the task slept in nano-second.

							nanoseconds.
> +
> +	Note that BPF can collect stacktrace using frame pointer ("fp")

				stack traces

> +	only, as of now.  So the applications built without the frame
> +	pointer might see bogus addresses.

One possible improvement is to check if debugging info is available in
the traced program and check if -fnoomit-frame-pointer is present:

⬢[acme@toolbox perf]$ readelf -wi ~/bin/perf | grep DW_AT_producer -m1
    <d>   DW_AT_producer    : (indirect string, offset: 0x6627): GNU C99 11.2.1 20220127 (Red Hat 11.2.1-9) -mtune=generic -march=x86-64 -ggdb3 -O6 -std=gnu99 -fno-omit-frame-pointer -funwind-tables -fstack-protector-all
⬢[acme@toolbox perf]$ 

If debugging info is available and -fno-omit-frame-pointer isn't in the
DW_AT_producer string, then we can print a flashing warning and even
don't use the stack traces.

That or some other more robust method to detect that frame pointers are
valid.

Some more comments below.

> +
>  SEE ALSO
>  --------
>  linkperf:perf-stat[1], linkperf:perf-list[1], linkperf:perf-intel-pt[1]
> diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
> index 6e5aded855cc..8f738e11356d 100644
> --- a/tools/perf/Makefile.perf
> +++ b/tools/perf/Makefile.perf
> @@ -1038,6 +1038,7 @@ SKEL_TMP_OUT := $(abspath $(SKEL_OUT)/.tmp)
>  SKELETONS := $(SKEL_OUT)/bpf_prog_profiler.skel.h
>  SKELETONS += $(SKEL_OUT)/bperf_leader.skel.h $(SKEL_OUT)/bperf_follower.skel.h
>  SKELETONS += $(SKEL_OUT)/bperf_cgroup.skel.h $(SKEL_OUT)/func_latency.skel.h
> +SKELETONS += $(SKEL_OUT)/off_cpu.skel.h
>  
>  $(SKEL_TMP_OUT) $(LIBBPF_OUTPUT):
>  	$(Q)$(MKDIR) -p $@
> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index a5cf6a99d67f..4e0285ca9a2d 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -49,6 +49,7 @@
>  #include "util/clockid.h"
>  #include "util/pmu-hybrid.h"
>  #include "util/evlist-hybrid.h"
> +#include "util/off_cpu.h"
>  #include "asm/bug.h"
>  #include "perf.h"
>  #include "cputopo.h"
> @@ -162,6 +163,7 @@ struct record {
>  	bool			buildid_mmap;
>  	bool			timestamp_filename;
>  	bool			timestamp_boundary;
> +	bool			off_cpu;
>  	struct switch_output	switch_output;
>  	unsigned long long	samples;
>  	unsigned long		output_max_size;	/* = 0: unlimited */
> @@ -903,6 +905,11 @@ static int record__config_text_poke(struct evlist *evlist)
>  	return 0;
>  }
>  
> +static int record__config_off_cpu(struct record *rec)
> +{
> +	return off_cpu_prepare(rec->evlist);
> +}
> +
>  static bool record__kcore_readable(struct machine *machine)
>  {
>  	char kcore[PATH_MAX];
> @@ -2600,6 +2607,9 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
>  	} else
>  		status = err;
>  
> +	if (rec->off_cpu)
> +		rec->bytes_written += off_cpu_write(rec->session);
> +
>  	record__synthesize(rec, true);
>  	/* this will be recalculated during process_buildids() */
>  	rec->samples = 0;
> @@ -3324,6 +3334,9 @@ static struct option __record_options[] = {
>  	OPT_CALLBACK_OPTARG(0, "threads", &record.opts, NULL, "spec",
>  			    "write collected trace data into several data files using parallel threads",
>  			    record__parse_threads),
> +#ifdef HAVE_BPF_SKEL
> +	OPT_BOOLEAN(0, "off-cpu", &record.off_cpu, "Enable off-cpu analysis"),
> +#endif


Since this is in the documentation, I think we should have an #else
clause and state that --off-cpu needs building with some specific make
command line.

>  	OPT_END()
>  };
>  
> @@ -3981,6 +3994,14 @@ int cmd_record(int argc, const char **argv)
>  		}
>  	}
>  
> +	if (rec->off_cpu) {
> +		err = record__config_off_cpu(rec);
> +		if (err) {
> +			pr_err("record__config_off_cpu failed, error %d\n", err);
> +			goto out;
> +		}
> +	}
> +
>  	if (record_opts__config(&rec->opts)) {
>  		err = -EINVAL;
>  		goto out;
> diff --git a/tools/perf/util/Build b/tools/perf/util/Build
> index 9a7209a99e16..a51267d88ca9 100644
> --- a/tools/perf/util/Build
> +++ b/tools/perf/util/Build
> @@ -147,6 +147,7 @@ perf-$(CONFIG_LIBBPF) += bpf_map.o
>  perf-$(CONFIG_PERF_BPF_SKEL) += bpf_counter.o
>  perf-$(CONFIG_PERF_BPF_SKEL) += bpf_counter_cgroup.o
>  perf-$(CONFIG_PERF_BPF_SKEL) += bpf_ftrace.o
> +perf-$(CONFIG_PERF_BPF_SKEL) += bpf_off_cpu.o
>  perf-$(CONFIG_BPF_PROLOGUE) += bpf-prologue.o
>  perf-$(CONFIG_LIBELF) += symbol-elf.o
>  perf-$(CONFIG_LIBELF) += probe-file.o
> diff --git a/tools/perf/util/bpf_off_cpu.c b/tools/perf/util/bpf_off_cpu.c
> new file mode 100644
> index 000000000000..1f87d2a9b86d
> --- /dev/null
> +++ b/tools/perf/util/bpf_off_cpu.c
> @@ -0,0 +1,208 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include "util/bpf_counter.h"
> +#include "util/debug.h"
> +#include "util/evsel.h"
> +#include "util/evlist.h"
> +#include "util/off_cpu.h"
> +#include "util/perf-hooks.h"
> +#include "util/session.h"
> +#include <bpf/bpf.h>
> +
> +#include "bpf_skel/off_cpu.skel.h"
> +
> +#define MAX_STACKS  32
> +/* we don't need actual timestamp, just want to put the samples at last */
> +#define OFF_CPU_TIMESTAMP  (~0ull << 32)
> +
> +static struct off_cpu_bpf *skel;
> +
> +struct off_cpu_key {
> +	u32 pid;
> +	u32 tgid;
> +	u32 stack_id;
> +	u32 state;
> +	u64 cgroup_id;
> +};
> +
> +union off_cpu_data {
> +	struct perf_event_header hdr;
> +	u64 array[1024 / sizeof(u64)];
> +};
> +
> +static int off_cpu_config(struct evlist *evlist)
> +{
> +	struct evsel *evsel;
> +	struct perf_event_attr attr = {
> +		.type	= PERF_TYPE_SOFTWARE,
> +		.config = PERF_COUNT_SW_BPF_OUTPUT,
> +		.size	= sizeof(attr), /* to capture ABI version */
> +	};
> +
> +	evsel = evsel__new(&attr);
> +	if (!evsel)
> +		return -ENOMEM;
> +
> +	evsel->core.attr.freq = 1;
> +	evsel->core.attr.sample_period = 1;
> +	/* off-cpu analysis depends on stack trace */
> +	evsel->core.attr.sample_type = PERF_SAMPLE_CALLCHAIN;
> +
> +	evlist__add(evlist, evsel);
> +
> +	free(evsel->name);
> +	evsel->name = strdup("offcpu-time");
> +	if (evsel->name == NULL)
> +		return -ENOMEM;

But at this point the evsel was left in the evlist? Shouldn't we start
it all by trying to allocate this string and don't bother to call
evsel__new() on failure, etc?

> +
> +	return 0;
> +}
> +
> +static void off_cpu_start(void *arg __maybe_unused)
> +{
> +	skel->bss->enabled = 1;
> +}
> +
> +static void off_cpu_finish(void *arg __maybe_unused)
> +{
> +	skel->bss->enabled = 0;
> +	off_cpu_bpf__destroy(skel);
> +}
> +
> +int off_cpu_prepare(struct evlist *evlist)
> +{
> +	int err;
> +
> +	if (off_cpu_config(evlist) < 0) {
> +		pr_err("Failed to config off-cpu BPF event\n");
> +		return -1;
> +	}
> +
> +	set_max_rlimit();
> +
> +	skel = off_cpu_bpf__open_and_load();
> +	if (!skel) {
> +		pr_err("Failed to open off-cpu skeleton\n");

"BPF sleketon", to give a bit more context to users?

> +		return -1;
> +	}
> +
> +	err = off_cpu_bpf__attach(skel);
> +	if (err) {
> +		pr_err("Failed to attach off-cpu skeleton\n");

Ditto.

> +		goto out;
> +	}
> +
> +	if (perf_hooks__set_hook("record_start", off_cpu_start, NULL) ||
> +	    perf_hooks__set_hook("record_end", off_cpu_finish, NULL)) {
> +		pr_err("Failed to attach off-cpu skeleton\n");
> +		goto out;
> +	}
> +
> +	return 0;
> +
> +out:
> +	off_cpu_bpf__destroy(skel);
> +	return -1;
> +}
> +
> +int off_cpu_write(struct perf_session *session)
> +{
> +	int bytes = 0, size;
> +	int fd, stack;
> +	bool found = false;
> +	u64 sample_type, val, sid = 0;
> +	struct evsel *evsel;
> +	struct perf_data_file *file = &session->data->file;
> +	struct off_cpu_key prev, key;
> +	union off_cpu_data data = {
> +		.hdr = {
> +			.type = PERF_RECORD_SAMPLE,
> +			.misc = PERF_RECORD_MISC_USER,
> +		},
> +	};
> +	u64 tstamp = OFF_CPU_TIMESTAMP;
> +
> +	skel->bss->enabled = 0;
> +
> +	evlist__for_each_entry(session->evlist, evsel) {
> +		if (!strcmp(evsel__name(evsel), "offcpu-time")) {
> +			found = true;
> +			break;
> +		}
> +	}

Don't we have some evlist__find_evsel function?

struct evsel *evlist__find_evsel_by_str(struct evlist *evlist, const char *str)
{
        struct evsel *evsel;

        evlist__for_each_entry(evlist, evsel) {
                if (!evsel->name)
                        continue;
                if (strcmp(str, evsel->name) == 0)
                        return evsel;
        }

        return NULL;
}

> +
> +	if (!found) {
> +		pr_err("offcpu-time evsel not found\n");
> +		return 0;
> +	}
> +
> +	sample_type = evsel->core.attr.sample_type;
> +
> +	if (sample_type & (PERF_SAMPLE_ID | PERF_SAMPLE_IDENTIFIER)) {
> +		if (evsel->core.id)
> +			sid = evsel->core.id[0];
> +	}
> +
> +	fd = bpf_map__fd(skel->maps.off_cpu);
> +	stack = bpf_map__fd(skel->maps.stacks);
> +	memset(&prev, 0, sizeof(prev));
> +
> +	while (!bpf_map_get_next_key(fd, &prev, &key)) {
> +		int n = 1;  /* start from perf_event_header */
> +		int ip_pos = -1;
> +
> +		bpf_map_lookup_elem(fd, &key, &val);
> +
> +		if (sample_type & PERF_SAMPLE_IDENTIFIER)
> +			data.array[n++] = sid;
> +		if (sample_type & PERF_SAMPLE_IP) {
> +			ip_pos = n;
> +			data.array[n++] = 0;  /* will be updated */
> +		}
> +		if (sample_type & PERF_SAMPLE_TID)
> +			data.array[n++] = (u64)key.pid << 32 | key.tgid;
> +		if (sample_type & PERF_SAMPLE_TIME)
> +			data.array[n++] = tstamp;
> +		if (sample_type & PERF_SAMPLE_ID)
> +			data.array[n++] = sid;
> +		if (sample_type & PERF_SAMPLE_CPU)
> +			data.array[n++] = 0;
> +		if (sample_type & PERF_SAMPLE_PERIOD)
> +			data.array[n++] = val;
> +		if (sample_type & PERF_SAMPLE_CALLCHAIN) {
> +			int len = 0;
> +
> +			/* data.array[n] is callchain->nr (updated later) */
> +			data.array[n + 1] = PERF_CONTEXT_USER;
> +			data.array[n + 2] = 0;
> +
> +			bpf_map_lookup_elem(stack, &key.stack_id, &data.array[n + 2]);
> +			while (data.array[n + 2 + len])
> +				len++;
> +
> +			/* update length of callchain */
> +			data.array[n] = len + 1;
> +
> +			/* update sample ip with the first callchain entry */
> +			if (ip_pos >= 0)
> +				data.array[ip_pos] = data.array[n + 2];
> +
> +			/* calculate sample callchain data array length */
> +			n += len + 2;
> +		}
> +		/* TODO: handle more sample types */
> +
> +		size = n * sizeof(u64);
> +		data.hdr.size = size;
> +		bytes += size;
> +
> +		if (perf_data_file__write(file, &data, size) < 0) {
> +			pr_err("failed to write perf data, error: %m\n");
> +			return bytes;
> +		}
> +
> +		prev = key;
> +		/* increase dummy timestamp to sort later samples */
> +		tstamp++;
> +	}
> +	return bytes;
> +}
> diff --git a/tools/perf/util/bpf_skel/off_cpu.bpf.c b/tools/perf/util/bpf_skel/off_cpu.bpf.c
> new file mode 100644
> index 000000000000..5173ed882fdf
> --- /dev/null
> +++ b/tools/perf/util/bpf_skel/off_cpu.bpf.c
> @@ -0,0 +1,139 @@
> +// SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
> +// Copyright (c) 2022 Google
> +#include "vmlinux.h"
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_tracing.h>
> +#include <bpf/bpf_core_read.h>
> +
> +/* task->flags for off-cpu analysis */
> +#define PF_KTHREAD   0x00200000  /* I am a kernel thread */
> +
> +/* task->state for off-cpu analysis */
> +#define TASK_INTERRUPTIBLE	0x0001
> +#define TASK_UNINTERRUPTIBLE	0x0002
> +
> +#define MAX_STACKS   32
> +#define MAX_ENTRIES  102400
> +
> +struct tstamp_data {
> +	__u32 stack_id;
> +	__u32 state;
> +	__u64 timestamp;
> +};
> +
> +struct offcpu_key {
> +	__u32 pid;
> +	__u32 tgid;
> +	__u32 stack_id;
> +	__u32 state;
> +};
> +
> +struct {
> +	__uint(type, BPF_MAP_TYPE_STACK_TRACE);
> +	__uint(key_size, sizeof(__u32));
> +	__uint(value_size, MAX_STACKS * sizeof(__u64));
> +	__uint(max_entries, MAX_ENTRIES);
> +} stacks SEC(".maps");
> +
> +struct {
> +	__uint(type, BPF_MAP_TYPE_TASK_STORAGE);
> +	__uint(map_flags, BPF_F_NO_PREALLOC);
> +	__type(key, int);
> +	__type(value, struct tstamp_data);
> +} tstamp SEC(".maps");
> +
> +struct {
> +	__uint(type, BPF_MAP_TYPE_HASH);
> +	__uint(key_size, sizeof(struct offcpu_key));
> +	__uint(value_size, sizeof(__u64));
> +	__uint(max_entries, MAX_ENTRIES);
> +} off_cpu SEC(".maps");
> +
> +/* old kernel task_struct definition */
> +struct task_struct___old {
> +	long state;
> +} __attribute__((preserve_access_index));
> +
> +int enabled = 0;
> +
> +/*
> + * Old kernel used to call it task_struct->state and now it's '__state'.
> + * Use BPF CO-RE "ignored suffix rule" to deal with it like below:
> + *
> + * https://nakryiko.com/posts/bpf-core-reference-guide/#handling-incompatible-field-and-type-changes
> + */
> +static inline int get_task_state(struct task_struct *t)
> +{
> +	if (bpf_core_field_exists(t->__state))
> +		return BPF_CORE_READ(t, __state);
> +
> +	/* recast pointer to capture task_struct___old type for compiler */
> +	struct task_struct___old *t_old = (void *)t;
> +
> +	/* now use old "state" name of the field */
> +	return BPF_CORE_READ(t_old, state);
> +}
> +
> +SEC("tp_btf/sched_switch")
> +int on_switch(u64 *ctx)
> +{
> +	__u64 ts;
> +	int state;
> +	__u32 stack_id;
> +	struct task_struct *prev, *next;
> +	struct tstamp_data *pelem;
> +
> +	if (!enabled)
> +		return 0;
> +
> +	prev = (struct task_struct *)ctx[1];
> +	next = (struct task_struct *)ctx[2];
> +	state = get_task_state(prev);
> +
> +	ts = bpf_ktime_get_ns();
> +
> +	if (prev->flags & PF_KTHREAD)
> +		goto next;
> +	if (state != TASK_INTERRUPTIBLE &&
> +	    state != TASK_UNINTERRUPTIBLE)
> +		goto next;
> +
> +	stack_id = bpf_get_stackid(ctx, &stacks,
> +				   BPF_F_FAST_STACK_CMP | BPF_F_USER_STACK);
> +
> +	pelem = bpf_task_storage_get(&tstamp, prev, NULL,
> +				     BPF_LOCAL_STORAGE_GET_F_CREATE);
> +	if (!pelem)
> +		goto next;
> +
> +	pelem->timestamp = ts;
> +	pelem->state = state;
> +	pelem->stack_id = stack_id;
> +
> +next:
> +	pelem = bpf_task_storage_get(&tstamp, next, NULL, 0);
> +
> +	if (pelem && pelem->timestamp) {
> +		struct offcpu_key key = {
> +			.pid = next->pid,
> +			.tgid = next->tgid,
> +			.stack_id = pelem->stack_id,
> +			.state = pelem->state,
> +		};
> +		__u64 delta = ts - pelem->timestamp;
> +		__u64 *total;
> +
> +		total = bpf_map_lookup_elem(&off_cpu, &key);
> +		if (total)
> +			*total += delta;
> +		else
> +			bpf_map_update_elem(&off_cpu, &key, &delta, BPF_ANY);
> +
> +		/* prevent to reuse the timestamp later */
> +		pelem->timestamp = 0;
> +	}
> +
> +	return 0;
> +}
> +
> +char LICENSE[] SEC("license") = "Dual BSD/GPL";
> diff --git a/tools/perf/util/off_cpu.h b/tools/perf/util/off_cpu.h
> new file mode 100644
> index 000000000000..4113bd19a2e4
> --- /dev/null
> +++ b/tools/perf/util/off_cpu.h
> @@ -0,0 +1,22 @@
> +#ifndef PERF_UTIL_OFF_CPU_H
> +#define PERF_UTIL_OFF_CPU_H
> +
> +struct evlist;
> +struct perf_session;
> +
> +#ifdef HAVE_BPF_SKEL
> +int off_cpu_prepare(struct evlist *evlist);
> +int off_cpu_write(struct perf_session *session);
> +#else
> +static inline int off_cpu_prepare(struct evlist *evlist __maybe_unused)
> +{
> +	return -1;
> +}
> +
> +static inline int off_cpu_write(struct perf_session *session __maybe_unused)
> +{
> +	return -1;
> +}
> +#endif
> +
> +#endif  /* PERF_UTIL_OFF_CPU_H */
> -- 
> 2.36.0.512.ge40c2bad7a-goog

-- 

- Arnaldo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 2/4] perf record: Enable off-cpu analysis with BPF
  2022-05-06 20:16 [RFC 0/4] perf record: Implement off-cpu profiling with BPF (v2) Namhyung Kim
@ 2022-05-06 20:16 ` Namhyung Kim
  2022-05-10 17:02   ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 19+ messages in thread
From: Namhyung Kim @ 2022-05-06 20:16 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Andi Kleen, Ian Rogers,
	Song Liu, Hao Luo, Milian Wolff, bpf, linux-perf-users,
	Blake Jones

Add --off-cpu option to enable the off-cpu profiling with BPF.  It'd
use a bpf_output event and rename it to "offcpu-time".  Samples will
be synthesized at the end of the record session using data from a BPF
map which contains the aggregated off-cpu time at context switches.
So it needs root privilege to get the off-cpu profiling.

Each sample will have a separate user stacktrace so it will skip
kernel threads.  The sample ip will be set from the stacktrace and
other sample data will be updated accordingly.  Currently it only
handles some basic sample types.

The sample timestamp is set to a dummy value just not to bother with
other events during the sorting.  So it has a very big initial value
and increase it on processing each samples.

Good thing is that it can be used together with regular profiling like
cpu cycles.  If you don't want to that, you can use a dummy event to
enable off-cpu profiling only.

Example output:
  $ sudo perf record --off-cpu perf bench sched messaging -l 1000

  $ sudo perf report --stdio --call-graph=no
  # Total Lost Samples: 0
  #
  # Samples: 41K of event 'cycles'
  # Event count (approx.): 42137343851
  ...

  # Samples: 1K of event 'offcpu-time'
  # Event count (approx.): 587990831640
  #
  # Children      Self  Command          Shared Object       Symbol
  # ........  ........  ...............  ..................  .........................
  #
      81.66%     0.00%  sched-messaging  libc-2.33.so        [.] __libc_start_main
      81.66%     0.00%  sched-messaging  perf                [.] cmd_bench
      81.66%     0.00%  sched-messaging  perf                [.] main
      81.66%     0.00%  sched-messaging  perf                [.] run_builtin
      81.43%     0.00%  sched-messaging  perf                [.] bench_sched_messaging
      40.86%    40.86%  sched-messaging  libpthread-2.33.so  [.] __read
      37.66%    37.66%  sched-messaging  libpthread-2.33.so  [.] __write
       2.91%     2.91%  sched-messaging  libc-2.33.so        [.] __poll
  ...

As you can see it spent most of off-cpu time in read and write in
bench_sched_messaging().  The --call-graph=no was added just to make
the output concise here.

It uses perf hooks facility to control BPF program during the record
session rather than adding new BPF/off-cpu specific calls.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/perf/Documentation/perf-record.txt |  10 ++
 tools/perf/Makefile.perf                 |   1 +
 tools/perf/builtin-record.c              |  21 +++
 tools/perf/util/Build                    |   1 +
 tools/perf/util/bpf_off_cpu.c            | 208 +++++++++++++++++++++++
 tools/perf/util/bpf_skel/off_cpu.bpf.c   | 139 +++++++++++++++
 tools/perf/util/off_cpu.h                |  22 +++
 7 files changed, 402 insertions(+)
 create mode 100644 tools/perf/util/bpf_off_cpu.c
 create mode 100644 tools/perf/util/bpf_skel/off_cpu.bpf.c
 create mode 100644 tools/perf/util/off_cpu.h

diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
index 465be4e62a17..8084e3fb73a1 100644
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@@ -758,6 +758,16 @@ include::intel-hybrid.txt[]
 	If the URLs is not specified, the value of DEBUGINFOD_URLS
 	system environment variable is used.
 
+--off-cpu::
+	Enable off-cpu profiling with BPF.  The BPF program will collect
+	task scheduling information with (user) stacktrace and save them
+	as sample data of a software event named "offcpu-time".  The
+	sample period will have the time the task slept in nano-second.
+
+	Note that BPF can collect stacktrace using frame pointer ("fp")
+	only, as of now.  So the applications built without the frame
+	pointer might see bogus addresses.
+
 SEE ALSO
 --------
 linkperf:perf-stat[1], linkperf:perf-list[1], linkperf:perf-intel-pt[1]
diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
index 6e5aded855cc..8f738e11356d 100644
--- a/tools/perf/Makefile.perf
+++ b/tools/perf/Makefile.perf
@@ -1038,6 +1038,7 @@ SKEL_TMP_OUT := $(abspath $(SKEL_OUT)/.tmp)
 SKELETONS := $(SKEL_OUT)/bpf_prog_profiler.skel.h
 SKELETONS += $(SKEL_OUT)/bperf_leader.skel.h $(SKEL_OUT)/bperf_follower.skel.h
 SKELETONS += $(SKEL_OUT)/bperf_cgroup.skel.h $(SKEL_OUT)/func_latency.skel.h
+SKELETONS += $(SKEL_OUT)/off_cpu.skel.h
 
 $(SKEL_TMP_OUT) $(LIBBPF_OUTPUT):
 	$(Q)$(MKDIR) -p $@
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index a5cf6a99d67f..4e0285ca9a2d 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -49,6 +49,7 @@
 #include "util/clockid.h"
 #include "util/pmu-hybrid.h"
 #include "util/evlist-hybrid.h"
+#include "util/off_cpu.h"
 #include "asm/bug.h"
 #include "perf.h"
 #include "cputopo.h"
@@ -162,6 +163,7 @@ struct record {
 	bool			buildid_mmap;
 	bool			timestamp_filename;
 	bool			timestamp_boundary;
+	bool			off_cpu;
 	struct switch_output	switch_output;
 	unsigned long long	samples;
 	unsigned long		output_max_size;	/* = 0: unlimited */
@@ -903,6 +905,11 @@ static int record__config_text_poke(struct evlist *evlist)
 	return 0;
 }
 
+static int record__config_off_cpu(struct record *rec)
+{
+	return off_cpu_prepare(rec->evlist);
+}
+
 static bool record__kcore_readable(struct machine *machine)
 {
 	char kcore[PATH_MAX];
@@ -2600,6 +2607,9 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
 	} else
 		status = err;
 
+	if (rec->off_cpu)
+		rec->bytes_written += off_cpu_write(rec->session);
+
 	record__synthesize(rec, true);
 	/* this will be recalculated during process_buildids() */
 	rec->samples = 0;
@@ -3324,6 +3334,9 @@ static struct option __record_options[] = {
 	OPT_CALLBACK_OPTARG(0, "threads", &record.opts, NULL, "spec",
 			    "write collected trace data into several data files using parallel threads",
 			    record__parse_threads),
+#ifdef HAVE_BPF_SKEL
+	OPT_BOOLEAN(0, "off-cpu", &record.off_cpu, "Enable off-cpu analysis"),
+#endif
 	OPT_END()
 };
 
@@ -3981,6 +3994,14 @@ int cmd_record(int argc, const char **argv)
 		}
 	}
 
+	if (rec->off_cpu) {
+		err = record__config_off_cpu(rec);
+		if (err) {
+			pr_err("record__config_off_cpu failed, error %d\n", err);
+			goto out;
+		}
+	}
+
 	if (record_opts__config(&rec->opts)) {
 		err = -EINVAL;
 		goto out;
diff --git a/tools/perf/util/Build b/tools/perf/util/Build
index 9a7209a99e16..a51267d88ca9 100644
--- a/tools/perf/util/Build
+++ b/tools/perf/util/Build
@@ -147,6 +147,7 @@ perf-$(CONFIG_LIBBPF) += bpf_map.o
 perf-$(CONFIG_PERF_BPF_SKEL) += bpf_counter.o
 perf-$(CONFIG_PERF_BPF_SKEL) += bpf_counter_cgroup.o
 perf-$(CONFIG_PERF_BPF_SKEL) += bpf_ftrace.o
+perf-$(CONFIG_PERF_BPF_SKEL) += bpf_off_cpu.o
 perf-$(CONFIG_BPF_PROLOGUE) += bpf-prologue.o
 perf-$(CONFIG_LIBELF) += symbol-elf.o
 perf-$(CONFIG_LIBELF) += probe-file.o
diff --git a/tools/perf/util/bpf_off_cpu.c b/tools/perf/util/bpf_off_cpu.c
new file mode 100644
index 000000000000..1f87d2a9b86d
--- /dev/null
+++ b/tools/perf/util/bpf_off_cpu.c
@@ -0,0 +1,208 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "util/bpf_counter.h"
+#include "util/debug.h"
+#include "util/evsel.h"
+#include "util/evlist.h"
+#include "util/off_cpu.h"
+#include "util/perf-hooks.h"
+#include "util/session.h"
+#include <bpf/bpf.h>
+
+#include "bpf_skel/off_cpu.skel.h"
+
+#define MAX_STACKS  32
+/* we don't need actual timestamp, just want to put the samples at last */
+#define OFF_CPU_TIMESTAMP  (~0ull << 32)
+
+static struct off_cpu_bpf *skel;
+
+struct off_cpu_key {
+	u32 pid;
+	u32 tgid;
+	u32 stack_id;
+	u32 state;
+	u64 cgroup_id;
+};
+
+union off_cpu_data {
+	struct perf_event_header hdr;
+	u64 array[1024 / sizeof(u64)];
+};
+
+static int off_cpu_config(struct evlist *evlist)
+{
+	struct evsel *evsel;
+	struct perf_event_attr attr = {
+		.type	= PERF_TYPE_SOFTWARE,
+		.config = PERF_COUNT_SW_BPF_OUTPUT,
+		.size	= sizeof(attr), /* to capture ABI version */
+	};
+
+	evsel = evsel__new(&attr);
+	if (!evsel)
+		return -ENOMEM;
+
+	evsel->core.attr.freq = 1;
+	evsel->core.attr.sample_period = 1;
+	/* off-cpu analysis depends on stack trace */
+	evsel->core.attr.sample_type = PERF_SAMPLE_CALLCHAIN;
+
+	evlist__add(evlist, evsel);
+
+	free(evsel->name);
+	evsel->name = strdup("offcpu-time");
+	if (evsel->name == NULL)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static void off_cpu_start(void *arg __maybe_unused)
+{
+	skel->bss->enabled = 1;
+}
+
+static void off_cpu_finish(void *arg __maybe_unused)
+{
+	skel->bss->enabled = 0;
+	off_cpu_bpf__destroy(skel);
+}
+
+int off_cpu_prepare(struct evlist *evlist)
+{
+	int err;
+
+	if (off_cpu_config(evlist) < 0) {
+		pr_err("Failed to config off-cpu BPF event\n");
+		return -1;
+	}
+
+	set_max_rlimit();
+
+	skel = off_cpu_bpf__open_and_load();
+	if (!skel) {
+		pr_err("Failed to open off-cpu skeleton\n");
+		return -1;
+	}
+
+	err = off_cpu_bpf__attach(skel);
+	if (err) {
+		pr_err("Failed to attach off-cpu skeleton\n");
+		goto out;
+	}
+
+	if (perf_hooks__set_hook("record_start", off_cpu_start, NULL) ||
+	    perf_hooks__set_hook("record_end", off_cpu_finish, NULL)) {
+		pr_err("Failed to attach off-cpu skeleton\n");
+		goto out;
+	}
+
+	return 0;
+
+out:
+	off_cpu_bpf__destroy(skel);
+	return -1;
+}
+
+int off_cpu_write(struct perf_session *session)
+{
+	int bytes = 0, size;
+	int fd, stack;
+	bool found = false;
+	u64 sample_type, val, sid = 0;
+	struct evsel *evsel;
+	struct perf_data_file *file = &session->data->file;
+	struct off_cpu_key prev, key;
+	union off_cpu_data data = {
+		.hdr = {
+			.type = PERF_RECORD_SAMPLE,
+			.misc = PERF_RECORD_MISC_USER,
+		},
+	};
+	u64 tstamp = OFF_CPU_TIMESTAMP;
+
+	skel->bss->enabled = 0;
+
+	evlist__for_each_entry(session->evlist, evsel) {
+		if (!strcmp(evsel__name(evsel), "offcpu-time")) {
+			found = true;
+			break;
+		}
+	}
+
+	if (!found) {
+		pr_err("offcpu-time evsel not found\n");
+		return 0;
+	}
+
+	sample_type = evsel->core.attr.sample_type;
+
+	if (sample_type & (PERF_SAMPLE_ID | PERF_SAMPLE_IDENTIFIER)) {
+		if (evsel->core.id)
+			sid = evsel->core.id[0];
+	}
+
+	fd = bpf_map__fd(skel->maps.off_cpu);
+	stack = bpf_map__fd(skel->maps.stacks);
+	memset(&prev, 0, sizeof(prev));
+
+	while (!bpf_map_get_next_key(fd, &prev, &key)) {
+		int n = 1;  /* start from perf_event_header */
+		int ip_pos = -1;
+
+		bpf_map_lookup_elem(fd, &key, &val);
+
+		if (sample_type & PERF_SAMPLE_IDENTIFIER)
+			data.array[n++] = sid;
+		if (sample_type & PERF_SAMPLE_IP) {
+			ip_pos = n;
+			data.array[n++] = 0;  /* will be updated */
+		}
+		if (sample_type & PERF_SAMPLE_TID)
+			data.array[n++] = (u64)key.pid << 32 | key.tgid;
+		if (sample_type & PERF_SAMPLE_TIME)
+			data.array[n++] = tstamp;
+		if (sample_type & PERF_SAMPLE_ID)
+			data.array[n++] = sid;
+		if (sample_type & PERF_SAMPLE_CPU)
+			data.array[n++] = 0;
+		if (sample_type & PERF_SAMPLE_PERIOD)
+			data.array[n++] = val;
+		if (sample_type & PERF_SAMPLE_CALLCHAIN) {
+			int len = 0;
+
+			/* data.array[n] is callchain->nr (updated later) */
+			data.array[n + 1] = PERF_CONTEXT_USER;
+			data.array[n + 2] = 0;
+
+			bpf_map_lookup_elem(stack, &key.stack_id, &data.array[n + 2]);
+			while (data.array[n + 2 + len])
+				len++;
+
+			/* update length of callchain */
+			data.array[n] = len + 1;
+
+			/* update sample ip with the first callchain entry */
+			if (ip_pos >= 0)
+				data.array[ip_pos] = data.array[n + 2];
+
+			/* calculate sample callchain data array length */
+			n += len + 2;
+		}
+		/* TODO: handle more sample types */
+
+		size = n * sizeof(u64);
+		data.hdr.size = size;
+		bytes += size;
+
+		if (perf_data_file__write(file, &data, size) < 0) {
+			pr_err("failed to write perf data, error: %m\n");
+			return bytes;
+		}
+
+		prev = key;
+		/* increase dummy timestamp to sort later samples */
+		tstamp++;
+	}
+	return bytes;
+}
diff --git a/tools/perf/util/bpf_skel/off_cpu.bpf.c b/tools/perf/util/bpf_skel/off_cpu.bpf.c
new file mode 100644
index 000000000000..5173ed882fdf
--- /dev/null
+++ b/tools/perf/util/bpf_skel/off_cpu.bpf.c
@@ -0,0 +1,139 @@
+// SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+// Copyright (c) 2022 Google
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_core_read.h>
+
+/* task->flags for off-cpu analysis */
+#define PF_KTHREAD   0x00200000  /* I am a kernel thread */
+
+/* task->state for off-cpu analysis */
+#define TASK_INTERRUPTIBLE	0x0001
+#define TASK_UNINTERRUPTIBLE	0x0002
+
+#define MAX_STACKS   32
+#define MAX_ENTRIES  102400
+
+struct tstamp_data {
+	__u32 stack_id;
+	__u32 state;
+	__u64 timestamp;
+};
+
+struct offcpu_key {
+	__u32 pid;
+	__u32 tgid;
+	__u32 stack_id;
+	__u32 state;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_STACK_TRACE);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, MAX_STACKS * sizeof(__u64));
+	__uint(max_entries, MAX_ENTRIES);
+} stacks SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_TASK_STORAGE);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__type(key, int);
+	__type(value, struct tstamp_data);
+} tstamp SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(key_size, sizeof(struct offcpu_key));
+	__uint(value_size, sizeof(__u64));
+	__uint(max_entries, MAX_ENTRIES);
+} off_cpu SEC(".maps");
+
+/* old kernel task_struct definition */
+struct task_struct___old {
+	long state;
+} __attribute__((preserve_access_index));
+
+int enabled = 0;
+
+/*
+ * Old kernel used to call it task_struct->state and now it's '__state'.
+ * Use BPF CO-RE "ignored suffix rule" to deal with it like below:
+ *
+ * https://nakryiko.com/posts/bpf-core-reference-guide/#handling-incompatible-field-and-type-changes
+ */
+static inline int get_task_state(struct task_struct *t)
+{
+	if (bpf_core_field_exists(t->__state))
+		return BPF_CORE_READ(t, __state);
+
+	/* recast pointer to capture task_struct___old type for compiler */
+	struct task_struct___old *t_old = (void *)t;
+
+	/* now use old "state" name of the field */
+	return BPF_CORE_READ(t_old, state);
+}
+
+SEC("tp_btf/sched_switch")
+int on_switch(u64 *ctx)
+{
+	__u64 ts;
+	int state;
+	__u32 stack_id;
+	struct task_struct *prev, *next;
+	struct tstamp_data *pelem;
+
+	if (!enabled)
+		return 0;
+
+	prev = (struct task_struct *)ctx[1];
+	next = (struct task_struct *)ctx[2];
+	state = get_task_state(prev);
+
+	ts = bpf_ktime_get_ns();
+
+	if (prev->flags & PF_KTHREAD)
+		goto next;
+	if (state != TASK_INTERRUPTIBLE &&
+	    state != TASK_UNINTERRUPTIBLE)
+		goto next;
+
+	stack_id = bpf_get_stackid(ctx, &stacks,
+				   BPF_F_FAST_STACK_CMP | BPF_F_USER_STACK);
+
+	pelem = bpf_task_storage_get(&tstamp, prev, NULL,
+				     BPF_LOCAL_STORAGE_GET_F_CREATE);
+	if (!pelem)
+		goto next;
+
+	pelem->timestamp = ts;
+	pelem->state = state;
+	pelem->stack_id = stack_id;
+
+next:
+	pelem = bpf_task_storage_get(&tstamp, next, NULL, 0);
+
+	if (pelem && pelem->timestamp) {
+		struct offcpu_key key = {
+			.pid = next->pid,
+			.tgid = next->tgid,
+			.stack_id = pelem->stack_id,
+			.state = pelem->state,
+		};
+		__u64 delta = ts - pelem->timestamp;
+		__u64 *total;
+
+		total = bpf_map_lookup_elem(&off_cpu, &key);
+		if (total)
+			*total += delta;
+		else
+			bpf_map_update_elem(&off_cpu, &key, &delta, BPF_ANY);
+
+		/* prevent to reuse the timestamp later */
+		pelem->timestamp = 0;
+	}
+
+	return 0;
+}
+
+char LICENSE[] SEC("license") = "Dual BSD/GPL";
diff --git a/tools/perf/util/off_cpu.h b/tools/perf/util/off_cpu.h
new file mode 100644
index 000000000000..4113bd19a2e4
--- /dev/null
+++ b/tools/perf/util/off_cpu.h
@@ -0,0 +1,22 @@
+#ifndef PERF_UTIL_OFF_CPU_H
+#define PERF_UTIL_OFF_CPU_H
+
+struct evlist;
+struct perf_session;
+
+#ifdef HAVE_BPF_SKEL
+int off_cpu_prepare(struct evlist *evlist);
+int off_cpu_write(struct perf_session *session);
+#else
+static inline int off_cpu_prepare(struct evlist *evlist __maybe_unused)
+{
+	return -1;
+}
+
+static inline int off_cpu_write(struct perf_session *session __maybe_unused)
+{
+	return -1;
+}
+#endif
+
+#endif  /* PERF_UTIL_OFF_CPU_H */
-- 
2.36.0.512.ge40c2bad7a-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] perf record: Enable off-cpu analysis with BPF
  2022-04-27 23:07   ` Hao Luo
@ 2022-04-29 17:27     ` Namhyung Kim
  0 siblings, 0 replies; 19+ messages in thread
From: Namhyung Kim @ 2022-04-29 17:27 UTC (permalink / raw)
  To: Hao Luo
  Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Ingo Molnar, Peter Zijlstra,
	LKML, Andi Kleen, Ian Rogers, Song Liu, bpf, linux-perf-users,
	Blake Jones

Hi Hao,

On Wed, Apr 27, 2022 at 4:07 PM Hao Luo <haoluo@google.com> wrote:
>
> Hi Namhyung,
>
> On Fri, Apr 22, 2022 at 8:05 AM Namhyung Kim <namhyung@kernel.org> wrote:
> >
>
> [...]
>
> >
> > Signed-off-by: Namhyung Kim <namhyung@kernel.org>
> > ---
> >  tools/perf/Makefile.perf               |   1 +
> >  tools/perf/builtin-record.c            |  21 +++
> >  tools/perf/util/Build                  |   1 +
> >  tools/perf/util/bpf_off_cpu.c          | 208 +++++++++++++++++++++++++
> >  tools/perf/util/bpf_skel/off_cpu.bpf.c | 137 ++++++++++++++++
> >  tools/perf/util/off_cpu.h              |  22 +++
> >  6 files changed, 390 insertions(+)
> >  create mode 100644 tools/perf/util/bpf_off_cpu.c
> >  create mode 100644 tools/perf/util/bpf_skel/off_cpu.bpf.c
> >  create mode 100644 tools/perf/util/off_cpu.h
> >
>
> [...]
>
> > diff --git a/tools/perf/util/bpf_skel/off_cpu.bpf.c b/tools/perf/util/bpf_skel/off_cpu.bpf.c
> > new file mode 100644
> > index 000000000000..2bc6f7cc59ea
> > --- /dev/null
> > +++ b/tools/perf/util/bpf_skel/off_cpu.bpf.c
> >
> > +struct {
> > +       __uint(type, BPF_MAP_TYPE_HASH);
> > +       __uint(key_size, sizeof(__u32));
> > +       __uint(value_size, sizeof(struct tstamp_data));
> > +       __uint(max_entries, MAX_ENTRIES);
> > +} tstamp SEC(".maps");
>
> I think using task local storage for this tstamp would be more
> efficient. There is an example in
> tools/bpf/runqslower/runqslower.bpf.c

Right, will change in v2.

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] perf record: Enable off-cpu analysis with BPF
  2022-04-22 15:05 ` [PATCH 2/4] perf record: Enable off-cpu analysis with BPF Namhyung Kim
@ 2022-04-27 23:07   ` Hao Luo
  2022-04-29 17:27     ` Namhyung Kim
  0 siblings, 1 reply; 19+ messages in thread
From: Hao Luo @ 2022-04-27 23:07 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Ingo Molnar, Peter Zijlstra,
	LKML, Andi Kleen, Ian Rogers, Song Liu, bpf, linux-perf-users,
	Blake Jones

Hi Namhyung,

On Fri, Apr 22, 2022 at 8:05 AM Namhyung Kim <namhyung@kernel.org> wrote:
>

[...]

>
> Signed-off-by: Namhyung Kim <namhyung@kernel.org>
> ---
>  tools/perf/Makefile.perf               |   1 +
>  tools/perf/builtin-record.c            |  21 +++
>  tools/perf/util/Build                  |   1 +
>  tools/perf/util/bpf_off_cpu.c          | 208 +++++++++++++++++++++++++
>  tools/perf/util/bpf_skel/off_cpu.bpf.c | 137 ++++++++++++++++
>  tools/perf/util/off_cpu.h              |  22 +++
>  6 files changed, 390 insertions(+)
>  create mode 100644 tools/perf/util/bpf_off_cpu.c
>  create mode 100644 tools/perf/util/bpf_skel/off_cpu.bpf.c
>  create mode 100644 tools/perf/util/off_cpu.h
>

[...]

> diff --git a/tools/perf/util/bpf_skel/off_cpu.bpf.c b/tools/perf/util/bpf_skel/off_cpu.bpf.c
> new file mode 100644
> index 000000000000..2bc6f7cc59ea
> --- /dev/null
> +++ b/tools/perf/util/bpf_skel/off_cpu.bpf.c
>
> +struct {
> +       __uint(type, BPF_MAP_TYPE_HASH);
> +       __uint(key_size, sizeof(__u32));
> +       __uint(value_size, sizeof(struct tstamp_data));
> +       __uint(max_entries, MAX_ENTRIES);
> +} tstamp SEC(".maps");

I think using task local storage for this tstamp would be more
efficient. There is an example in
tools/bpf/runqslower/runqslower.bpf.c

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 2/4] perf record: Enable off-cpu analysis with BPF
  2022-04-22 15:05 [RFC RESEND " Namhyung Kim
@ 2022-04-22 15:05 ` Namhyung Kim
  2022-04-27 23:07   ` Hao Luo
  0 siblings, 1 reply; 19+ messages in thread
From: Namhyung Kim @ 2022-04-22 15:05 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Andi Kleen, Ian Rogers,
	Song Liu, Hao Luo, bpf, linux-perf-users, Blake Jones

Add --off-cpu option to enable the off-cpu profiling with BPF.  It'd
use a bpf_output event and rename it to "offcpu-time".  Samples will
be synthesized at the end of the record session using data from a BPF
map which contains the aggregated off-cpu time at context switches.
So it needs root privilege to get the off-cpu profiling.

Each sample will have a separate user stacktrace so it will skip
kernel threads.  The sample ip will be set from the stacktrace and
other sample data will be updated accordingly.  Currently it only
handles some basic sample types.

The sample timestamp is set to a dummy value just not to bother with
other events during the sorting.  So it has a very big initial value
and increase it on processing each samples.

Good thing is that it can be used together with regular profiling like
cpu cycles.  If you don't want to that, you can use a dummy event to
enable off-cpu profiling only.

Example output:
  $ sudo perf record --off-cpu perf bench sched messaging -l 1000

  $ sudo perf report --stdio --call-graph=no
  # Total Lost Samples: 0
  #
  # Samples: 41K of event 'cycles'
  # Event count (approx.): 42137343851
  ...

  # Samples: 1K of event 'offcpu-time'
  # Event count (approx.): 587990831640
  #
  # Children      Self  Command          Shared Object       Symbol
  # ........  ........  ...............  ..................  .........................
  #
      81.66%     0.00%  sched-messaging  libc-2.33.so        [.] __libc_start_main
      81.66%     0.00%  sched-messaging  perf                [.] cmd_bench
      81.66%     0.00%  sched-messaging  perf                [.] main
      81.66%     0.00%  sched-messaging  perf                [.] run_builtin
      81.43%     0.00%  sched-messaging  perf                [.] bench_sched_messaging
      40.86%    40.86%  sched-messaging  libpthread-2.33.so  [.] __read
      37.66%    37.66%  sched-messaging  libpthread-2.33.so  [.] __write
       2.91%     2.91%  sched-messaging  libc-2.33.so        [.] __poll
  ...

As you can see it spent most of off-cpu time in read and write in
bench_sched_messaging().  The --call-graph=no was added just to make
the output concise here.

It uses perf hooks facility to control BPF program during the record
session rather than adding new BPF/off-cpu specific calls.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/perf/Makefile.perf               |   1 +
 tools/perf/builtin-record.c            |  21 +++
 tools/perf/util/Build                  |   1 +
 tools/perf/util/bpf_off_cpu.c          | 208 +++++++++++++++++++++++++
 tools/perf/util/bpf_skel/off_cpu.bpf.c | 137 ++++++++++++++++
 tools/perf/util/off_cpu.h              |  22 +++
 6 files changed, 390 insertions(+)
 create mode 100644 tools/perf/util/bpf_off_cpu.c
 create mode 100644 tools/perf/util/bpf_skel/off_cpu.bpf.c
 create mode 100644 tools/perf/util/off_cpu.h

diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
index 69473a836bae..ce333327182a 100644
--- a/tools/perf/Makefile.perf
+++ b/tools/perf/Makefile.perf
@@ -1041,6 +1041,7 @@ SKEL_TMP_OUT := $(abspath $(SKEL_OUT)/.tmp)
 SKELETONS := $(SKEL_OUT)/bpf_prog_profiler.skel.h
 SKELETONS += $(SKEL_OUT)/bperf_leader.skel.h $(SKEL_OUT)/bperf_follower.skel.h
 SKELETONS += $(SKEL_OUT)/bperf_cgroup.skel.h $(SKEL_OUT)/func_latency.skel.h
+SKELETONS += $(SKEL_OUT)/off_cpu.skel.h
 
 $(SKEL_TMP_OUT) $(LIBBPF_OUTPUT):
 	$(Q)$(MKDIR) -p $@
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index ba74fab02e62..3d24d528ba8e 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -49,6 +49,7 @@
 #include "util/clockid.h"
 #include "util/pmu-hybrid.h"
 #include "util/evlist-hybrid.h"
+#include "util/off_cpu.h"
 #include "asm/bug.h"
 #include "perf.h"
 #include "cputopo.h"
@@ -162,6 +163,7 @@ struct record {
 	bool			buildid_mmap;
 	bool			timestamp_filename;
 	bool			timestamp_boundary;
+	bool			off_cpu;
 	struct switch_output	switch_output;
 	unsigned long long	samples;
 	unsigned long		output_max_size;	/* = 0: unlimited */
@@ -903,6 +905,11 @@ static int record__config_text_poke(struct evlist *evlist)
 	return 0;
 }
 
+static int record__config_off_cpu(struct record *rec)
+{
+	return off_cpu_prepare(rec->evlist);
+}
+
 static bool record__kcore_readable(struct machine *machine)
 {
 	char kcore[PATH_MAX];
@@ -2596,6 +2603,9 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
 	} else
 		status = err;
 
+	if (rec->off_cpu)
+		rec->bytes_written += off_cpu_write(rec->session);
+
 	record__synthesize(rec, true);
 	/* this will be recalculated during process_buildids() */
 	rec->samples = 0;
@@ -3320,6 +3330,9 @@ static struct option __record_options[] = {
 	OPT_CALLBACK_OPTARG(0, "threads", &record.opts, NULL, "spec",
 			    "write collected trace data into several data files using parallel threads",
 			    record__parse_threads),
+#ifdef HAVE_BPF_SKEL
+	OPT_BOOLEAN(0, "off-cpu", &record.off_cpu, "Enable off-cpu analysis"),
+#endif
 	OPT_END()
 };
 
@@ -3968,6 +3981,14 @@ int cmd_record(int argc, const char **argv)
 		}
 	}
 
+	if (rec->off_cpu) {
+		err = record__config_off_cpu(rec);
+		if (err) {
+			pr_err("record__config_off_cpu failed, error %d\n", err);
+			goto out;
+		}
+	}
+
 	if (record_opts__config(&rec->opts)) {
 		err = -EINVAL;
 		goto out;
diff --git a/tools/perf/util/Build b/tools/perf/util/Build
index 9a7209a99e16..a51267d88ca9 100644
--- a/tools/perf/util/Build
+++ b/tools/perf/util/Build
@@ -147,6 +147,7 @@ perf-$(CONFIG_LIBBPF) += bpf_map.o
 perf-$(CONFIG_PERF_BPF_SKEL) += bpf_counter.o
 perf-$(CONFIG_PERF_BPF_SKEL) += bpf_counter_cgroup.o
 perf-$(CONFIG_PERF_BPF_SKEL) += bpf_ftrace.o
+perf-$(CONFIG_PERF_BPF_SKEL) += bpf_off_cpu.o
 perf-$(CONFIG_BPF_PROLOGUE) += bpf-prologue.o
 perf-$(CONFIG_LIBELF) += symbol-elf.o
 perf-$(CONFIG_LIBELF) += probe-file.o
diff --git a/tools/perf/util/bpf_off_cpu.c b/tools/perf/util/bpf_off_cpu.c
new file mode 100644
index 000000000000..1f87d2a9b86d
--- /dev/null
+++ b/tools/perf/util/bpf_off_cpu.c
@@ -0,0 +1,208 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "util/bpf_counter.h"
+#include "util/debug.h"
+#include "util/evsel.h"
+#include "util/evlist.h"
+#include "util/off_cpu.h"
+#include "util/perf-hooks.h"
+#include "util/session.h"
+#include <bpf/bpf.h>
+
+#include "bpf_skel/off_cpu.skel.h"
+
+#define MAX_STACKS  32
+/* we don't need actual timestamp, just want to put the samples at last */
+#define OFF_CPU_TIMESTAMP  (~0ull << 32)
+
+static struct off_cpu_bpf *skel;
+
+struct off_cpu_key {
+	u32 pid;
+	u32 tgid;
+	u32 stack_id;
+	u32 state;
+	u64 cgroup_id;
+};
+
+union off_cpu_data {
+	struct perf_event_header hdr;
+	u64 array[1024 / sizeof(u64)];
+};
+
+static int off_cpu_config(struct evlist *evlist)
+{
+	struct evsel *evsel;
+	struct perf_event_attr attr = {
+		.type	= PERF_TYPE_SOFTWARE,
+		.config = PERF_COUNT_SW_BPF_OUTPUT,
+		.size	= sizeof(attr), /* to capture ABI version */
+	};
+
+	evsel = evsel__new(&attr);
+	if (!evsel)
+		return -ENOMEM;
+
+	evsel->core.attr.freq = 1;
+	evsel->core.attr.sample_period = 1;
+	/* off-cpu analysis depends on stack trace */
+	evsel->core.attr.sample_type = PERF_SAMPLE_CALLCHAIN;
+
+	evlist__add(evlist, evsel);
+
+	free(evsel->name);
+	evsel->name = strdup("offcpu-time");
+	if (evsel->name == NULL)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static void off_cpu_start(void *arg __maybe_unused)
+{
+	skel->bss->enabled = 1;
+}
+
+static void off_cpu_finish(void *arg __maybe_unused)
+{
+	skel->bss->enabled = 0;
+	off_cpu_bpf__destroy(skel);
+}
+
+int off_cpu_prepare(struct evlist *evlist)
+{
+	int err;
+
+	if (off_cpu_config(evlist) < 0) {
+		pr_err("Failed to config off-cpu BPF event\n");
+		return -1;
+	}
+
+	set_max_rlimit();
+
+	skel = off_cpu_bpf__open_and_load();
+	if (!skel) {
+		pr_err("Failed to open off-cpu skeleton\n");
+		return -1;
+	}
+
+	err = off_cpu_bpf__attach(skel);
+	if (err) {
+		pr_err("Failed to attach off-cpu skeleton\n");
+		goto out;
+	}
+
+	if (perf_hooks__set_hook("record_start", off_cpu_start, NULL) ||
+	    perf_hooks__set_hook("record_end", off_cpu_finish, NULL)) {
+		pr_err("Failed to attach off-cpu skeleton\n");
+		goto out;
+	}
+
+	return 0;
+
+out:
+	off_cpu_bpf__destroy(skel);
+	return -1;
+}
+
+int off_cpu_write(struct perf_session *session)
+{
+	int bytes = 0, size;
+	int fd, stack;
+	bool found = false;
+	u64 sample_type, val, sid = 0;
+	struct evsel *evsel;
+	struct perf_data_file *file = &session->data->file;
+	struct off_cpu_key prev, key;
+	union off_cpu_data data = {
+		.hdr = {
+			.type = PERF_RECORD_SAMPLE,
+			.misc = PERF_RECORD_MISC_USER,
+		},
+	};
+	u64 tstamp = OFF_CPU_TIMESTAMP;
+
+	skel->bss->enabled = 0;
+
+	evlist__for_each_entry(session->evlist, evsel) {
+		if (!strcmp(evsel__name(evsel), "offcpu-time")) {
+			found = true;
+			break;
+		}
+	}
+
+	if (!found) {
+		pr_err("offcpu-time evsel not found\n");
+		return 0;
+	}
+
+	sample_type = evsel->core.attr.sample_type;
+
+	if (sample_type & (PERF_SAMPLE_ID | PERF_SAMPLE_IDENTIFIER)) {
+		if (evsel->core.id)
+			sid = evsel->core.id[0];
+	}
+
+	fd = bpf_map__fd(skel->maps.off_cpu);
+	stack = bpf_map__fd(skel->maps.stacks);
+	memset(&prev, 0, sizeof(prev));
+
+	while (!bpf_map_get_next_key(fd, &prev, &key)) {
+		int n = 1;  /* start from perf_event_header */
+		int ip_pos = -1;
+
+		bpf_map_lookup_elem(fd, &key, &val);
+
+		if (sample_type & PERF_SAMPLE_IDENTIFIER)
+			data.array[n++] = sid;
+		if (sample_type & PERF_SAMPLE_IP) {
+			ip_pos = n;
+			data.array[n++] = 0;  /* will be updated */
+		}
+		if (sample_type & PERF_SAMPLE_TID)
+			data.array[n++] = (u64)key.pid << 32 | key.tgid;
+		if (sample_type & PERF_SAMPLE_TIME)
+			data.array[n++] = tstamp;
+		if (sample_type & PERF_SAMPLE_ID)
+			data.array[n++] = sid;
+		if (sample_type & PERF_SAMPLE_CPU)
+			data.array[n++] = 0;
+		if (sample_type & PERF_SAMPLE_PERIOD)
+			data.array[n++] = val;
+		if (sample_type & PERF_SAMPLE_CALLCHAIN) {
+			int len = 0;
+
+			/* data.array[n] is callchain->nr (updated later) */
+			data.array[n + 1] = PERF_CONTEXT_USER;
+			data.array[n + 2] = 0;
+
+			bpf_map_lookup_elem(stack, &key.stack_id, &data.array[n + 2]);
+			while (data.array[n + 2 + len])
+				len++;
+
+			/* update length of callchain */
+			data.array[n] = len + 1;
+
+			/* update sample ip with the first callchain entry */
+			if (ip_pos >= 0)
+				data.array[ip_pos] = data.array[n + 2];
+
+			/* calculate sample callchain data array length */
+			n += len + 2;
+		}
+		/* TODO: handle more sample types */
+
+		size = n * sizeof(u64);
+		data.hdr.size = size;
+		bytes += size;
+
+		if (perf_data_file__write(file, &data, size) < 0) {
+			pr_err("failed to write perf data, error: %m\n");
+			return bytes;
+		}
+
+		prev = key;
+		/* increase dummy timestamp to sort later samples */
+		tstamp++;
+	}
+	return bytes;
+}
diff --git a/tools/perf/util/bpf_skel/off_cpu.bpf.c b/tools/perf/util/bpf_skel/off_cpu.bpf.c
new file mode 100644
index 000000000000..2bc6f7cc59ea
--- /dev/null
+++ b/tools/perf/util/bpf_skel/off_cpu.bpf.c
@@ -0,0 +1,137 @@
+// SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+// Copyright (c) 2022 Google
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_core_read.h>
+
+/* task->flags for off-cpu analysis */
+#define PF_KTHREAD   0x00200000  /* I am a kernel thread */
+
+/* task->state for off-cpu analysis */
+#define TASK_INTERRUPTIBLE	0x0001
+#define TASK_UNINTERRUPTIBLE	0x0002
+
+#define MAX_STACKS   32
+#define MAX_ENTRIES  102400
+
+struct tstamp_data {
+	__u32 stack_id;
+	__u32 state;
+	__u64 timestamp;
+};
+
+struct offcpu_key {
+	__u32 pid;
+	__u32 tgid;
+	__u32 stack_id;
+	__u32 state;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_STACK_TRACE);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, MAX_STACKS * sizeof(__u64));
+	__uint(max_entries, MAX_ENTRIES);
+} stacks SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(struct tstamp_data));
+	__uint(max_entries, MAX_ENTRIES);
+} tstamp SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(key_size, sizeof(struct offcpu_key));
+	__uint(value_size, sizeof(__u64));
+	__uint(max_entries, MAX_ENTRIES);
+} off_cpu SEC(".maps");
+
+/* old kernel task_struct definition */
+struct task_struct___old {
+	long state;
+} __attribute__((preserve_access_index));
+
+int enabled = 0;
+
+/*
+ * recently task_struct->state renamed to __state so it made an incompatible
+ * change.  Use BPF CO-RE "ignored suffix rule" to deal with it like below:
+ *
+ * https://nakryiko.com/posts/bpf-core-reference-guide/#handling-incompatible-field-and-type-changes
+ */
+static inline int get_task_state(struct task_struct *t)
+{
+	if (bpf_core_field_exists(t->__state))
+		return BPF_CORE_READ(t, __state);
+
+	/* recast pointer to capture task_struct___old type for compiler */
+	struct task_struct___old *t_old = (void *)t;
+
+	/* now use old "state" name of the field */
+	return BPF_CORE_READ(t_old, state);
+}
+
+SEC("tp_btf/sched_switch")
+int on_switch(u64 *ctx)
+{
+	__u64 ts;
+	int state;
+	__u32 pid, stack_id;
+	struct task_struct *prev, *next;
+	struct tstamp_data elem, *pelem;
+
+	if (!enabled)
+		return 0;
+
+	prev = (struct task_struct *)ctx[1];
+	next = (struct task_struct *)ctx[2];
+	state = get_task_state(prev);
+
+	ts = bpf_ktime_get_ns();
+
+	if (prev->flags & PF_KTHREAD)
+		goto next;
+	if (state != TASK_INTERRUPTIBLE &&
+	    state != TASK_UNINTERRUPTIBLE)
+		goto next;
+
+	stack_id = bpf_get_stackid(ctx, &stacks,
+				   BPF_F_FAST_STACK_CMP | BPF_F_USER_STACK);
+
+	elem.timestamp = ts;
+	elem.state = state;
+	elem.stack_id = stack_id;
+
+	pid = prev->pid;
+	bpf_map_update_elem(&tstamp, &pid, &elem, BPF_ANY);
+
+next:
+	pid = next->pid;
+	pelem = bpf_map_lookup_elem(&tstamp, &pid);
+
+	if (pelem) {
+		struct offcpu_key key = {
+			.pid = next->pid,
+			.tgid = next->tgid,
+			.stack_id = pelem->stack_id,
+			.state = pelem->state,
+		};
+		__u64 delta = ts - pelem->timestamp;
+		__u64 *total;
+
+		total = bpf_map_lookup_elem(&off_cpu, &key);
+		if (total)
+			*total += delta;
+		else
+			bpf_map_update_elem(&off_cpu, &key, &delta, BPF_ANY);
+
+		bpf_map_delete_elem(&tstamp, &pid);
+	}
+
+	return 0;
+}
+
+char LICENSE[] SEC("license") = "Dual BSD/GPL";
diff --git a/tools/perf/util/off_cpu.h b/tools/perf/util/off_cpu.h
new file mode 100644
index 000000000000..d5e063083bae
--- /dev/null
+++ b/tools/perf/util/off_cpu.h
@@ -0,0 +1,22 @@
+#ifndef PERF_UTIL_OFF_CPU_H
+#define PERF_UTIL_OFF_CPU_H
+
+struct evlist;
+struct perf_session;
+
+#ifdef HAVE_BPF_SKEL
+int off_cpu_prepare(struct evlist *evlist);
+int off_cpu_write(struct perf_session *session);
+#else
+static inline int off_cpu_prepare(struct evlist *evlist __maybe_unused);
+{
+	return -1;
+}
+
+static inline int off_cpu_write(struct perf_session *session __maybe_unused)
+{
+	return -1;
+}
+#endif
+
+#endif  /* PERF_UTIL_OFF_CPU_H */
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2022-05-12  6:13 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-22  5:33 [RFC 0/4] perf record: Implement off-cpu profiling with BPF (v1) Namhyung Kim
2022-04-22  5:33 ` [PATCH 1/4] perf report: Do not extend sample type of bpf-output event Namhyung Kim
2022-04-22  5:33 ` [PATCH 2/4] perf record: Enable off-cpu analysis with BPF Namhyung Kim
2022-04-22  5:34 ` [PATCH 3/4] perf record: Implement basic filtering for off-cpu Namhyung Kim
2022-04-22  5:34 ` [PATCH 4/4] perf record: Handle argument change in sched_switch Namhyung Kim
2022-04-22 10:11 ` [RFC 0/4] perf record: Implement off-cpu profiling with BPF (v1) Jiri Olsa
2022-04-22 14:53   ` Namhyung Kim
2022-04-22 10:20 ` Milian Wolff
2022-04-22 15:01   ` Namhyung Kim
2022-04-22 19:04     ` Arnaldo Carvalho de Melo
2022-04-25 12:42     ` Milian Wolff
2022-04-25 16:49       ` Ian Rogers
2022-04-25 18:58       ` Namhyung Kim
2022-04-22 15:05 [RFC RESEND " Namhyung Kim
2022-04-22 15:05 ` [PATCH 2/4] perf record: Enable off-cpu analysis with BPF Namhyung Kim
2022-04-27 23:07   ` Hao Luo
2022-04-29 17:27     ` Namhyung Kim
2022-05-06 20:16 [RFC 0/4] perf record: Implement off-cpu profiling with BPF (v2) Namhyung Kim
2022-05-06 20:16 ` [PATCH 2/4] perf record: Enable off-cpu analysis with BPF Namhyung Kim
2022-05-10 17:02   ` Arnaldo Carvalho de Melo
2022-05-12  6:13     ` Namhyung Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).