All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/3] Add a tracepoint for BPF perf sampling
@ 2016-08-03  2:47 Brendan Gregg
  2016-08-03  2:47 ` [PATCH v2 1/3] perf/core: Add a tracepoint for " Brendan Gregg
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Brendan Gregg @ 2016-08-03  2:47 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin
  Cc: linux-kernel, Alexei Starovoitov, Wang Nan, Brendan Gregg

This patchset adds a tracepoint for perf sampling, perf:perf_hrtimer, and
includes a complete example in samples/bpf for using it to frequency
count sampled instruction pointers in a BPF map.

Signed-off-by: Brendan Gregg <bgregg@netflix.com>
---
Changes in v2:
- added samples/bpf/sampleip* example for perf:perf_hrtimer
- added bpf-output to perf script usage message

Brendan Gregg (3):
  perf/core: Add a tracepoint for perf sampling
  samples/bpf: Add a sampling BPF example
  perf script: add bpf-output field to usage message

 include/trace/events/perf.h              |  29 +++++
 kernel/events/core.c                     |   5 +
 samples/bpf/Makefile                     |   4 +
 samples/bpf/sampleip_kern.c              |  48 ++++++++
 samples/bpf/sampleip_user.c              | 189 +++++++++++++++++++++++++++++++
 tools/perf/Documentation/perf-script.txt |   4 +-
 tools/perf/builtin-script.c              |   2 +-
 7 files changed, 278 insertions(+), 3 deletions(-)
 create mode 100644 include/trace/events/perf.h
 create mode 100644 samples/bpf/sampleip_kern.c
 create mode 100644 samples/bpf/sampleip_user.c

-- 
2.7.4

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling
  2016-08-03  2:47 [PATCH v2 0/3] Add a tracepoint for BPF perf sampling Brendan Gregg
@ 2016-08-03  2:47 ` Brendan Gregg
  2016-08-03  9:48   ` Peter Zijlstra
  2016-08-03  2:47 ` [PATCH v2 2/3] samples/bpf: Add a sampling BPF example Brendan Gregg
  2016-08-03  2:47 ` [PATCH v2 3/3] perf script: add bpf-output field to usage message Brendan Gregg
  2 siblings, 1 reply; 16+ messages in thread
From: Brendan Gregg @ 2016-08-03  2:47 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin
  Cc: linux-kernel, Alexei Starovoitov, Wang Nan, Brendan Gregg

When perf is performing hrtimer-based sampling, this tracepoint can be used
by BPF to run additional logic on each sample. For example, BPF can fetch
stack traces and frequency count them in kernel context, for an efficient
profiler.

Signed-off-by: Brendan Gregg <bgregg@netflix.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
---
 include/trace/events/perf.h | 29 +++++++++++++++++++++++++++++
 kernel/events/core.c        |  5 +++++
 2 files changed, 34 insertions(+)
 create mode 100644 include/trace/events/perf.h

diff --git a/include/trace/events/perf.h b/include/trace/events/perf.h
new file mode 100644
index 0000000..461770d
--- /dev/null
+++ b/include/trace/events/perf.h
@@ -0,0 +1,29 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM perf
+
+#if !defined(_TRACE_PERF_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_PERF_H
+
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(perf_hrtimer,
+	TP_PROTO(struct pt_regs *regs, struct perf_event *event),
+
+	TP_ARGS(regs, event),
+
+	TP_STRUCT__entry(
+		__field(struct pt_regs *, regs)
+		__field(struct perf_event *, event)
+	),
+
+	TP_fast_assign(
+		__entry->regs = regs;
+		__entry->event = event;
+	),
+
+	TP_printk("regs=%p evt=%p", __entry->regs, __entry->event)
+);
+#endif /* _TRACE_PERF_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 356a6c7..78bce19 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -51,6 +51,9 @@
 
 #include <asm/irq_regs.h>
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/perf.h>
+
 typedef int (*remote_function_f)(void *);
 
 struct remote_function_call {
@@ -8062,6 +8065,8 @@ static enum hrtimer_restart perf_swevent_hrtimer(struct hrtimer *hrtimer)
 	perf_sample_data_init(&data, 0, event->hw.last_period);
 	regs = get_irq_regs();
 
+	trace_perf_hrtimer(regs, event);
+
 	if (regs && !perf_exclude_event(event, regs)) {
 		if (!(event->attr.exclude_idle && is_idle_task(current)))
 			if (__perf_event_overflow(event, 1, &data, regs))
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 2/3] samples/bpf: Add a sampling BPF example
  2016-08-03  2:47 [PATCH v2 0/3] Add a tracepoint for BPF perf sampling Brendan Gregg
  2016-08-03  2:47 ` [PATCH v2 1/3] perf/core: Add a tracepoint for " Brendan Gregg
@ 2016-08-03  2:47 ` Brendan Gregg
  2016-08-03  2:47 ` [PATCH v2 3/3] perf script: add bpf-output field to usage message Brendan Gregg
  2 siblings, 0 replies; 16+ messages in thread
From: Brendan Gregg @ 2016-08-03  2:47 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin
  Cc: linux-kernel, Alexei Starovoitov, Wang Nan, Brendan Gregg

This example samples the instruction pointer at a timed interval, and
frequency counts it in a BPF map. It is an example of summarizing sampled
data in-kernel for passing to user space. It uses the perf:perf_hrtimer
tracepoint with perf_events sampling.

Example output:

Sampling at 99 Hertz for 5 seconds. Ctrl-C also ends.
ADDR                KSYM                             COUNT
0xffffffff81257088  __fsnotify_parent                1
0x7f20b792d9b0      (user)                           1
0xffffffff8121469e  __vfs_read                       1
0xffffffff81214afd  rw_verify_area                   1
0xffffffff8123327e  __fget_light                     1
0x7fc0965a6e2c      (user)                           1
0xffffffff81233c04  __fdget_pos                      1
0xffffffff81378528  common_file_perm                 1
0x404d90            (user)                           1
0xffffffff81214c13  vfs_read                         1
[...]
0xffffffff813d9e97  copy_user_enhanced_fast_string   3
0xffffffff817e310c  _raw_spin_lock_irqsave           4
0xffffffff817e31a0  entry_SYSCALL_64_fastpath        4
0xffffffff814fb96c  extract_crng                     6
0xffffffff813d9e95  copy_user_enhanced_fast_string   7
0xffffffff814fb8a3  _extract_crng                    7
0xffffffff817e2d55  _raw_spin_unlock_irqrestore      1399
0xffffffff8105fb46  native_safe_halt                 2190

It also has basic options:

USAGE: sampleip [-F freq] [duration]
       -F freq    # sample frequency (Hertz), default 99
       duration   # sampling duration (seconds), default 5

Signed-off-by: Brendan Gregg <bgregg@netflix.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
---
 samples/bpf/Makefile        |   4 +
 samples/bpf/sampleip_kern.c |  48 +++++++++++
 samples/bpf/sampleip_user.c | 189 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 241 insertions(+)
 create mode 100644 samples/bpf/sampleip_kern.c
 create mode 100644 samples/bpf/sampleip_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 90ebf7d..dc88a1e 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -24,6 +24,7 @@ hostprogs-y += test_overhead
 hostprogs-y += test_cgrp2_array_pin
 hostprogs-y += xdp1
 hostprogs-y += xdp2
+hostprogs-y += sampleip
 
 test_verifier-objs := test_verifier.o libbpf.o
 test_maps-objs := test_maps.o libbpf.o
@@ -49,6 +50,7 @@ test_cgrp2_array_pin-objs := libbpf.o test_cgrp2_array_pin.o
 xdp1-objs := bpf_load.o libbpf.o xdp1_user.o
 # reuse xdp1 source intentionally
 xdp2-objs := bpf_load.o libbpf.o xdp1_user.o
+sampleip-objs := bpf_load.o libbpf.o sampleip_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -74,6 +76,7 @@ always += parse_varlen.o parse_simple.o parse_ldabs.o
 always += test_cgrp2_tc_kern.o
 always += xdp1_kern.o
 always += xdp2_kern.o
+always += sampleip_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 
@@ -97,6 +100,7 @@ HOSTLOADLIBES_map_perf_test += -lelf -lrt
 HOSTLOADLIBES_test_overhead += -lelf -lrt
 HOSTLOADLIBES_xdp1 += -lelf
 HOSTLOADLIBES_xdp2 += -lelf
+HOSTLOADLIBES_sampleip += -lelf
 
 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline:
 #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc CLANG=~/git/llvm/build/bin/clang
diff --git a/samples/bpf/sampleip_kern.c b/samples/bpf/sampleip_kern.c
new file mode 100644
index 0000000..afec3fe
--- /dev/null
+++ b/samples/bpf/sampleip_kern.c
@@ -0,0 +1,48 @@
+/* Copyright 2016 Netflix, Inc.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include <linux/version.h>
+#include <uapi/linux/bpf.h>
+#include <linux/ptrace.h>
+#include "bpf_helpers.h"
+
+#define MAX_IPS		8192
+
+#define _(P) ({typeof(P) val; bpf_probe_read(&val, sizeof(val), &P); val;})
+
+struct bpf_map_def SEC("maps") ip_map = {
+	.type = BPF_MAP_TYPE_HASH,
+	.key_size = sizeof(u64),
+	.value_size = sizeof(u32),
+	.max_entries = MAX_IPS,
+};
+
+/* from /sys/kernel/debug/tracing/events/perf/perf_hrtimer/format */
+struct perf_hrtimer_args {
+	unsigned long long pad;
+	struct pt_regs *regs;
+	struct perf_event *event;
+};
+SEC("tracepoint/perf/perf_hrtimer")
+int do_sample(struct perf_hrtimer_args *args)
+{
+	struct pt_regs *regs;
+	u64 ip;
+	u32 *value, init_val = 1;
+
+	regs = _(args->regs);
+	ip = _(regs->ip);
+	value = bpf_map_lookup_elem(&ip_map, &ip);
+	if (value)
+		*value += 1;
+	else
+		/* E2BIG not tested for this example only */
+		bpf_map_update_elem(&ip_map, &ip, &init_val, BPF_ANY);
+
+	return 0;
+}
+char _license[] SEC("license") = "GPL";
+u32 _version SEC("version") = LINUX_VERSION_CODE;
diff --git a/samples/bpf/sampleip_user.c b/samples/bpf/sampleip_user.c
new file mode 100644
index 0000000..da0727d
--- /dev/null
+++ b/samples/bpf/sampleip_user.c
@@ -0,0 +1,189 @@
+/*
+ * sampleip: sample instruction pointer and frequency count in a BPF map.
+ *
+ * Copyright 2016 Netflix, Inc.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <errno.h>
+#include <signal.h>
+#include <string.h>
+#include <linux/perf_event.h>
+#include <linux/ptrace.h>
+#include <linux/bpf.h>
+#include "libbpf.h"
+#include "bpf_load.h"
+
+#define DEFAULT_FREQ	99
+#define DEFAULT_SECS	5
+#define MAX_IPS		8192
+#define PAGE_OFFSET	0xffff880000000000
+
+static int nr_cpus;
+
+static void usage(void)
+{
+	printf("USAGE: sampleip [-F freq] [duration]\n");
+	printf("       -F freq    # sample frequency (Hertz), default 99\n");
+	printf("       duration   # sampling duration (seconds), default 5\n");
+}
+
+static int sampling_start(int *pmu_fd, int freq)
+{
+	int i;
+
+	struct perf_event_attr pe_sample_attr = {
+		.type = PERF_TYPE_SOFTWARE,
+		.freq = 1,
+		.sample_period = freq,
+	};
+
+	for (i = 0; i < nr_cpus; i++) {
+		pmu_fd[i] = perf_event_open(&pe_sample_attr, -1 /* pid */, i,
+					    -1 /* group_fd */, 0 /* flags */);
+		if (pmu_fd[i] < 0) {
+			fprintf(stderr, "ERROR: Initializing perf sampling\n");
+			return 1;
+		}
+	}
+
+	return 0;
+}
+
+static void sampling_end(int *pmu_fd)
+{
+	int i;
+
+	for (i = 0; i < nr_cpus; i++)
+		close(pmu_fd[i]);
+}
+
+struct ipcount {
+	__u64 ip;
+	__u32 count;
+};
+
+/* used for sorting */
+struct ipcount counts[MAX_IPS];
+
+static int count_cmp(const void *p1, const void *p2)
+{
+	return ((struct ipcount *)p1)->count - ((struct ipcount *)p2)->count;
+}
+
+static void print_ip_map(int fd)
+{
+	struct ksym *sym;
+	__u64 key, next_key;
+	__u32 value;
+	int i, max;
+
+	printf("%-19s %-32s %s\n", "ADDR", "KSYM", "COUNT");
+
+	/* fetch IPs and counts */
+	key = 0, i = 0;
+	while (bpf_get_next_key(fd, &key, &next_key) == 0) {
+		bpf_lookup_elem(fd, &next_key, &value);
+		counts[i].ip = next_key;
+		counts[i++].count = value;
+		key = next_key;
+	}
+	max = i;
+
+	/* sort and print */
+	qsort(counts, max, sizeof(struct ipcount), count_cmp);
+	for (i = 0; i < max; i++) {
+		if (counts[i].ip > PAGE_OFFSET) {
+			sym = ksym_search(counts[i].ip);
+			printf("0x%-17llx %-32s %u\n", counts[i].ip, sym->name,
+			       counts[i].count);
+		} else {
+			printf("0x%-17llx %-32s %u\n", counts[i].ip, "(user)",
+			       counts[i].count);
+		}
+	}
+
+	if (max == MAX_IPS) {
+		printf("WARNING: IP hash was full (max %d entries); ", max);
+		printf("may have dropped samples\n");
+	}
+}
+
+static void int_exit(int sig)
+{
+	printf("\n");
+	print_ip_map(map_fd[0]);
+	exit(0);
+}
+
+int main(int argc, char **argv)
+{
+	char filename[256];
+	int *pmu_fd, opt, freq = DEFAULT_FREQ, secs = DEFAULT_SECS;
+
+	/* process arguments */
+	while ((opt = getopt(argc, argv, "F:h")) != -1) {
+		switch (opt) {
+		case 'F':
+			freq = atoi(optarg);
+			break;
+		case 'h':
+		default:
+			usage();
+			return 0;
+		}
+	}
+	if (argc - optind == 1)
+		secs = atoi(argv[optind]);
+	if (freq == 0 || secs == 0) {
+		usage();
+		return 1;
+	}
+
+	/* initialize kernel symbol translation */
+	if (load_kallsyms()) {
+		fprintf(stderr, "ERROR: loading /proc/kallsyms\n");
+		return 2;
+	}
+
+	/* create perf FDs for each CPU */
+	nr_cpus = sysconf(_SC_NPROCESSORS_CONF);
+	pmu_fd = malloc(nr_cpus * sizeof(int));
+	if (pmu_fd == NULL) {
+		fprintf(stderr, "ERROR: malloc of pmu_fd\n");
+		return 1;
+	}
+
+	/* load BPF program */
+	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+	if (load_bpf_file(filename)) {
+		fprintf(stderr, "ERROR: loading BPF program (errno %d):\n",
+			errno);
+		if (strcmp(bpf_log_buf, "") == 0)
+			fprintf(stderr, "Try: ulimit -l unlimited\n");
+		else
+			fprintf(stderr, "%s", bpf_log_buf);
+		return 1;
+	}
+	signal(SIGINT, int_exit);
+
+	/* do sampling */
+	printf("Sampling at %d Hertz for %d seconds. Ctrl-C also ends.\n",
+	       freq, secs);
+	if (sampling_start(pmu_fd, freq) != 0)
+		return 1;
+	sleep(secs);
+	sampling_end(pmu_fd);
+	free(pmu_fd);
+
+	/* output sample counts */
+	print_ip_map(map_fd[0]);
+
+	return 0;
+}
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 3/3] perf script: add bpf-output field to usage message
  2016-08-03  2:47 [PATCH v2 0/3] Add a tracepoint for BPF perf sampling Brendan Gregg
  2016-08-03  2:47 ` [PATCH v2 1/3] perf/core: Add a tracepoint for " Brendan Gregg
  2016-08-03  2:47 ` [PATCH v2 2/3] samples/bpf: Add a sampling BPF example Brendan Gregg
@ 2016-08-03  2:47 ` Brendan Gregg
  2016-08-09 19:16   ` [tip:perf/urgent] perf script: Add 'bpf-output' " tip-bot for Brendan Gregg
  2 siblings, 1 reply; 16+ messages in thread
From: Brendan Gregg @ 2016-08-03  2:47 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin
  Cc: linux-kernel, Alexei Starovoitov, Wang Nan, Brendan Gregg

This adds the bpf-output field to the perf script usage message, and docs.

Signed-off-by: Brendan Gregg <bgregg@netflix.com>
Cc: Wang Nan <wangnan0@huawei.com>
---
 tools/perf/Documentation/perf-script.txt | 4 ++--
 tools/perf/builtin-script.c              | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/tools/perf/Documentation/perf-script.txt b/tools/perf/Documentation/perf-script.txt
index 1f6c705..053bbbd 100644
--- a/tools/perf/Documentation/perf-script.txt
+++ b/tools/perf/Documentation/perf-script.txt
@@ -116,8 +116,8 @@ OPTIONS
 --fields::
         Comma separated list of fields to print. Options are:
         comm, tid, pid, time, cpu, event, trace, ip, sym, dso, addr, symoff,
-	srcline, period, iregs, brstack, brstacksym, flags.
-        Field list can be prepended with the type, trace, sw or hw,
+        srcline, period, iregs, brstack, brstacksym, flags, bpf-output,
+        callindent. Field list can be prepended with the type, trace, sw or hw,
         to indicate to which event type the field list applies.
         e.g., -F sw:comm,tid,time,ip,sym  and -F trace:time,cpu,trace
 
diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c
index 971ff91..9c640a8 100644
--- a/tools/perf/builtin-script.c
+++ b/tools/perf/builtin-script.c
@@ -2116,7 +2116,7 @@ int cmd_script(int argc, const char **argv, const char *prefix __maybe_unused)
 		     "Valid types: hw,sw,trace,raw. "
 		     "Fields: comm,tid,pid,time,cpu,event,trace,ip,sym,dso,"
 		     "addr,symoff,period,iregs,brstack,brstacksym,flags,"
-		     "callindent", parse_output_fields),
+		     "bpf-output,callindent", parse_output_fields),
 	OPT_BOOLEAN('a', "all-cpus", &system_wide,
 		    "system-wide collection from all CPUs"),
 	OPT_STRING('S', "symbols", &symbol_conf.sym_list_str, "symbol[,symbol...]",
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling
  2016-08-03  2:47 ` [PATCH v2 1/3] perf/core: Add a tracepoint for " Brendan Gregg
@ 2016-08-03  9:48   ` Peter Zijlstra
  2016-08-03 18:57     ` Brendan Gregg
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2016-08-03  9:48 UTC (permalink / raw)
  To: Brendan Gregg
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	linux-kernel, Alexei Starovoitov, Wang Nan

On Wed, Aug 03, 2016 at 02:47:47AM +0000, Brendan Gregg wrote:
> When perf is performing hrtimer-based sampling, this tracepoint can be used
> by BPF to run additional logic on each sample. For example, BPF can fetch
> stack traces and frequency count them in kernel context, for an efficient
> profiler.

Urgh, no like.

And then next week we'll add a tracepoint to some other random pmu or
whatnot.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling
  2016-08-03  9:48   ` Peter Zijlstra
@ 2016-08-03 18:57     ` Brendan Gregg
  2016-08-04 14:28       ` Peter Zijlstra
  0 siblings, 1 reply; 16+ messages in thread
From: Brendan Gregg @ 2016-08-03 18:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	linux-kernel, Alexei Starovoitov, Wang Nan

On Wed, Aug 3, 2016 at 2:48 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Aug 03, 2016 at 02:47:47AM +0000, Brendan Gregg wrote:
> > When perf is performing hrtimer-based sampling, this tracepoint can be used
> > by BPF to run additional logic on each sample. For example, BPF can fetch
> > stack traces and frequency count them in kernel context, for an efficient
> > profiler.
>
> Urgh, no like.
>
> And then next week we'll add a tracepoint to some other random pmu or
> whatnot.

I wouldn't want random pmu tracepoints either, but I can see how it
looks with a new tracepoint category of "perf:", and maybe I shouldn't
have called it that. Would it be better if the tracepoint was called
"timer:perf_hrtimer"?

As for pmu tracepoints: if I were to instrument it (although I wasn't
planning to), I'd put a tracepoint in perf_event_overflow() called
"perf:perf_overflow", with the same arguments. That could then be used
for all PMU overflow events, without needing to add specific
tracepoints. The "perf:" category would then have two tracepoints, and
would not need to grow. Does that option sound better than
"timer:perf_hrtimer"? (As in, keeping the category "perf" so that we
have a later option of adding a perf:perf_overflow tracepoint if need
be?)

It's really perf_hrtimer that we'll use daily, for CPU profiling with
in-kernel aggregations of instruction pointers and stack traces, and
other sampling needs. It's one tracepoint that will add much value,
and will mean that BPF programs can then be attached to kprobes,
uprobes, tracepoints, and timed samples.

Thanks for your consideration,

Brendan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling
  2016-08-03 18:57     ` Brendan Gregg
@ 2016-08-04 14:28       ` Peter Zijlstra
  2016-08-05  1:43         ` Alexei Starovoitov
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2016-08-04 14:28 UTC (permalink / raw)
  To: Brendan Gregg
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	linux-kernel, Alexei Starovoitov, Wang Nan

On Wed, Aug 03, 2016 at 11:57:05AM -0700, Brendan Gregg wrote:

> As for pmu tracepoints: if I were to instrument it (although I wasn't
> planning to), I'd put a tracepoint in perf_event_overflow() called
> "perf:perf_overflow", with the same arguments. That could then be used
> for all PMU overflow events, without needing to add specific
> tracepoints. 

Could we not teach BPF to replace event->overflow_handler and inject
itself there?

We don't currently have nice interfaces for doing that, but it should be
possible to do I think. We already have the indirect function call, so
injecting ourself there has 0 overhead.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling
  2016-08-04 14:28       ` Peter Zijlstra
@ 2016-08-05  1:43         ` Alexei Starovoitov
  2016-08-05  4:13           ` Brendan Gregg
  0 siblings, 1 reply; 16+ messages in thread
From: Alexei Starovoitov @ 2016-08-05  1:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Brendan Gregg, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, linux-kernel, Alexei Starovoitov, Wang Nan

On Thu, Aug 04, 2016 at 04:28:53PM +0200, Peter Zijlstra wrote:
> On Wed, Aug 03, 2016 at 11:57:05AM -0700, Brendan Gregg wrote:
> 
> > As for pmu tracepoints: if I were to instrument it (although I wasn't
> > planning to), I'd put a tracepoint in perf_event_overflow() called
> > "perf:perf_overflow", with the same arguments. That could then be used
> > for all PMU overflow events, without needing to add specific
> > tracepoints. 
> 
> Could we not teach BPF to replace event->overflow_handler and inject
> itself there?
> 
> We don't currently have nice interfaces for doing that, but it should be
> possible to do I think. We already have the indirect function call, so
> injecting ourself there has 0 overhead.

you're right. All makes sense. I guess I was too lazy to look into
how to do it properly. Adding a tracepoint looked like quick and
easy way to achieve the same.
As far as api goes probably existing IOC_SET_BPF ioctl will do too.
Currently overflow_handler is set at event alloc time. If we start
changing it on the fly with atomic xchg(), afaik things shouldn't
break, since each overflow_handler is run to completion and doesn't
change global state, right?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling
  2016-08-05  1:43         ` Alexei Starovoitov
@ 2016-08-05  4:13           ` Brendan Gregg
  2016-08-05  5:24             ` Alexei Starovoitov
  0 siblings, 1 reply; 16+ messages in thread
From: Brendan Gregg @ 2016-08-05  4:13 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, linux-kernel, Alexei Starovoitov, Wang Nan

On Thu, Aug 4, 2016 at 6:43 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Thu, Aug 04, 2016 at 04:28:53PM +0200, Peter Zijlstra wrote:
>> On Wed, Aug 03, 2016 at 11:57:05AM -0700, Brendan Gregg wrote:
>>
>> > As for pmu tracepoints: if I were to instrument it (although I wasn't
>> > planning to), I'd put a tracepoint in perf_event_overflow() called
>> > "perf:perf_overflow", with the same arguments. That could then be used
>> > for all PMU overflow events, without needing to add specific
>> > tracepoints.
>>
>> Could we not teach BPF to replace event->overflow_handler and inject
>> itself there?
>>
>> We don't currently have nice interfaces for doing that, but it should be
>> possible to do I think. We already have the indirect function call, so
>> injecting ourself there has 0 overhead.

Sounds like a good idea, especially for things like struct
file_operations so that we can statically instrument file system
read/writes with zero non-enabled overhead, and not worry about high
frequency workloads (>10M events/sec).

These perf probes aren't high frequency, though, and the code is not
normally in use, so overhead should be much less of a concern.
Sampling at 999 Hertz * CPUs is as frequent as I'd go. And if the
tracepoint code is still adding a mem read, conditional, and branch,
then that's not many instructions, especially considering the normal
use case of these perf functions: creating records and writing to a
perf ring buffer, then picking that up in user space by perf, then
either processing it live or writing to perf.data, back to the file
system, etc. It would be hard to benchmark the effect of adding a few
instructions to that path (and any results may be more sensitive to
cache line placement than the instructions).

The perf:perf_hrtimer probe point is also reading state mid-way
through a function, so it's not quite as simple as wrapping the
function pointer. I do like that idea, though, but for things like
struct file_operations.

>
> you're right. All makes sense. I guess I was too lazy to look into
> how to do it properly. Adding a tracepoint looked like quick and
> easy way to achieve the same.
> As far as api goes probably existing IOC_SET_BPF ioctl will do too.
> Currently overflow_handler is set at event alloc time. If we start
> changing it on the fly with atomic xchg(), afaik things shouldn't
> break, since each overflow_handler is run to completion and doesn't
> change global state, right?
>

How would it be implemented? I was thinking of adding explicit wrappers, eg:

static ssize_t
__ext4_file_write_iter_tracepoint(struct kiocb *iocb, struct iov_iter *from)
{
        trace_ext4_write_iter(iocb, from);
        ext4_file_write_iter(iocb, from);
}

Then switching in __ext4_file_write_iter_tracepoint() as needed.

I was wondering about doing this dynamically, but wouldn't that then
create another problem of maintenance -- we'd need to decorate such
code with a comment, at least, to say "this function is exposed as a
tracepoint".

If a dynamic approach is still desired, and the goal is zero
non-enabled overhead, then I'd also wonder if we could leverage
kprobes to do this. Imagine walking a file_operations to find the
addresses, and then kprobe-ing the addresses. Not the same (or
probably as efficient) as setting the function pointer, but, using
existing kprobes.

Brendan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling
  2016-08-05  4:13           ` Brendan Gregg
@ 2016-08-05  5:24             ` Alexei Starovoitov
  2016-08-05 10:52               ` Peter Zijlstra
  0 siblings, 1 reply; 16+ messages in thread
From: Alexei Starovoitov @ 2016-08-05  5:24 UTC (permalink / raw)
  To: Brendan Gregg
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, linux-kernel, Alexei Starovoitov, Wang Nan

On Thu, Aug 04, 2016 at 09:13:16PM -0700, Brendan Gregg wrote:
> On Thu, Aug 4, 2016 at 6:43 PM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> > On Thu, Aug 04, 2016 at 04:28:53PM +0200, Peter Zijlstra wrote:
> >> On Wed, Aug 03, 2016 at 11:57:05AM -0700, Brendan Gregg wrote:
> >>
> >> > As for pmu tracepoints: if I were to instrument it (although I wasn't
> >> > planning to), I'd put a tracepoint in perf_event_overflow() called
> >> > "perf:perf_overflow", with the same arguments. That could then be used
> >> > for all PMU overflow events, without needing to add specific
> >> > tracepoints.
> >>
> >> Could we not teach BPF to replace event->overflow_handler and inject
> >> itself there?
> >>
> >> We don't currently have nice interfaces for doing that, but it should be
> >> possible to do I think. We already have the indirect function call, so
> >> injecting ourself there has 0 overhead.
> 
> Sounds like a good idea, especially for things like struct
> file_operations so that we can statically instrument file system
> read/writes with zero non-enabled overhead, and not worry about high
> frequency workloads (>10M events/sec).
> 
> These perf probes aren't high frequency, though, and the code is not
> normally in use, so overhead should be much less of a concern.
> Sampling at 999 Hertz * CPUs is as frequent as I'd go. And if the
> tracepoint code is still adding a mem read, conditional, and branch,
> then that's not many instructions, especially considering the normal
> use case of these perf functions: creating records and writing to a
> perf ring buffer, then picking that up in user space by perf, then
> either processing it live or writing to perf.data, back to the file
> system, etc. It would be hard to benchmark the effect of adding a few
> instructions to that path (and any results may be more sensitive to
> cache line placement than the instructions).

tracepoints are actually zero overhead already via static-key mechanism.
I don't think Peter's objection for the tracepoint was due to overhead.

> The perf:perf_hrtimer probe point is also reading state mid-way
> through a function, so it's not quite as simple as wrapping the
> function pointer. I do like that idea, though, but for things like
> struct file_operations.
> 
> >
> > you're right. All makes sense. I guess I was too lazy to look into
> > how to do it properly. Adding a tracepoint looked like quick and
> > easy way to achieve the same.
> > As far as api goes probably existing IOC_SET_BPF ioctl will do too.
> > Currently overflow_handler is set at event alloc time. If we start
> > changing it on the fly with atomic xchg(), afaik things shouldn't
> > break, since each overflow_handler is run to completion and doesn't
> > change global state, right?
> >
> 
> How would it be implemented? I was thinking of adding explicit wrappers, eg:

instead of adding a tracepoint to perf_swevent_hrtimer we can replace
overflow_handler for that particular event with some form of bpf wrapper.
(probably new bpf program type). Then not only periodic events
will be triggering bpf prog, but pmu events as well.
So instead of normal __perf_event_output() writing into ringbuffer,
a bpf prog will be called that can optionally write into different
rb via bpf_perf_event_output. The question is what to pass into the
program to make the most use out of it. 'struct pt_regs' is done deal.
but perf_sample_data we cannot pass as-is, since it's kernel internal.
Probably something similar to __sk_buff mirror would be needed.
Another nice benefit of doing via overflow_handler instead of tracepoint
is that exclude_idle, exclude_user, exclude_kernel flags of the perf event
will all magically work and program will be event specific.
So two parallel 'perf record'-like sampling won't conflict.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling
  2016-08-05  5:24             ` Alexei Starovoitov
@ 2016-08-05 10:52               ` Peter Zijlstra
  2016-08-05 17:22                 ` Brendan Gregg
  2016-08-05 19:45                 ` Alexei Starovoitov
  0 siblings, 2 replies; 16+ messages in thread
From: Peter Zijlstra @ 2016-08-05 10:52 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Brendan Gregg, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, linux-kernel, Alexei Starovoitov, Wang Nan

On Thu, Aug 04, 2016 at 10:24:06PM -0700, Alexei Starovoitov wrote:
> tracepoints are actually zero overhead already via static-key mechanism.
> I don't think Peter's objection for the tracepoint was due to overhead.

Almost 0, they still have some I$ footprint, but yes. My main worry is
that we can feed tracepoints into perf, so having tracepoints in perf is
tricky.

I also don't much like this tracepoint being specific to the hrtimer
bits, I can well imagine people wanting to do the same thing for
hardware based samples or whatnot.

> > The perf:perf_hrtimer probe point is also reading state mid-way
> > through a function, so it's not quite as simple as wrapping the
> > function pointer. I do like that idea, though, but for things like
> > struct file_operations.

So what additional state to you need?

> > > Currently overflow_handler is set at event alloc time. If we start
> > > changing it on the fly with atomic xchg(), afaik things shouldn't
> > > break, since each overflow_handler is run to completion and doesn't
> > > change global state, right?

Yes, or even a simple WRITE_ONCE() to replace it, as long as we make
sure to use a READ_ONCE() to load the pointer.

As long as we're sure to limit this poking to a single user its fairly
simple to get right. The moment there can be concurrency a lot of fail
can happen.

> instead of adding a tracepoint to perf_swevent_hrtimer we can replace
> overflow_handler for that particular event with some form of bpf wrapper.
> (probably new bpf program type). Then not only periodic events
> will be triggering bpf prog, but pmu events as well.

Exactly.

> So instead of normal __perf_event_output() writing into ringbuffer,
> a bpf prog will be called that can optionally write into different
> rb via bpf_perf_event_output. 

It could even chain and call into the original function once its done
and have both outputs.

> The question is what to pass into the
> program to make the most use out of it. 'struct pt_regs' is done deal.
> but perf_sample_data we cannot pass as-is, since it's kernel internal.

Urgh, does it have to be stable API? Can't we simply rely on the kernel
headers to provide the right structure definition?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling
  2016-08-05 10:52               ` Peter Zijlstra
@ 2016-08-05 17:22                 ` Brendan Gregg
  2016-08-08  9:57                   ` Peter Zijlstra
  2016-08-05 19:45                 ` Alexei Starovoitov
  1 sibling, 1 reply; 16+ messages in thread
From: Brendan Gregg @ 2016-08-05 17:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexei Starovoitov, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, linux-kernel, Alexei Starovoitov, Wang Nan

On Fri, Aug 5, 2016 at 3:52 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Aug 04, 2016 at 10:24:06PM -0700, Alexei Starovoitov wrote:
>> tracepoints are actually zero overhead already via static-key mechanism.
>> I don't think Peter's objection for the tracepoint was due to overhead.
>
> Almost 0, they still have some I$ footprint, but yes. My main worry is
> that we can feed tracepoints into perf, so having tracepoints in perf is
> tricky.

Coincidentally I$ footprint was my most recent use case for needing
this: I have an I$ busting workload, and wanting to profile
instructions at a very high rate to get a breakdown of I$ population.
(Normally I'd use I$ miss overflow, but none of our Linux systems have
PMCs: cloud.)

> I also don't much like this tracepoint being specific to the hrtimer
> bits, I can well imagine people wanting to do the same thing for
> hardware based samples or whatnot.

Sure, which is why I thought we'd have two in a perf category. I'm all
for PMCs events, even though we can't currently use them!

>
>> > The perf:perf_hrtimer probe point is also reading state mid-way
>> > through a function, so it's not quite as simple as wrapping the
>> > function pointer. I do like that idea, though, but for things like
>> > struct file_operations.
>
> So what additional state to you need?

I was pulling in regs after get_irq_regs(), struct perf_event *event
after it's populated. Not that hard to duplicate. Just noting it
didn't map directly to the function entry.

I wanted perf_event just for event->ctx->task->pid, so that a BPF
program can differentiate between it's samples and other concurrent
sessions.

(I  was thinking of changing my patch to expose pid_t instead of
perf_event, since I was noticing it didn't add many instructions.)

[...]
>> instead of adding a tracepoint to perf_swevent_hrtimer we can replace
>> overflow_handler for that particular event with some form of bpf wrapper.
>> (probably new bpf program type). Then not only periodic events
>> will be triggering bpf prog, but pmu events as well.
>
> Exactly.

Although the timer use case is a bit different, and is via
hwc->hrtimer.function = perf_swevent_hrtimer.

[...]
>> The question is what to pass into the
>> program to make the most use out of it. 'struct pt_regs' is done deal.
>> but perf_sample_data we cannot pass as-is, since it's kernel internal.
>
> Urgh, does it have to be stable API? Can't we simply rely on the kernel
> headers to provide the right structure definition?

For timer it can be: struct pt_regs, pid_t.

So that would restrict your BPF program to one timer, since if you had
two (from one pid) you couldn't tell them apart. But I'm not sure of a
use case for two in-kernel timers. If there were, we could also add
struct perf_event_attr, which has enough info to tell things apart,
and is already exposed to user space.

I haven't looked into the PMU arguments, but perhaps that could be:
struct pt_regs, pid_t, struct perf_event_attr.

Thanks,

Brendan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling
  2016-08-05 10:52               ` Peter Zijlstra
  2016-08-05 17:22                 ` Brendan Gregg
@ 2016-08-05 19:45                 ` Alexei Starovoitov
  2016-08-08 10:03                   ` Peter Zijlstra
  1 sibling, 1 reply; 16+ messages in thread
From: Alexei Starovoitov @ 2016-08-05 19:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Brendan Gregg, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, linux-kernel, Alexei Starovoitov, Wang Nan

On Fri, Aug 05, 2016 at 12:52:09PM +0200, Peter Zijlstra wrote:
> > > > Currently overflow_handler is set at event alloc time. If we start
> > > > changing it on the fly with atomic xchg(), afaik things shouldn't
> > > > break, since each overflow_handler is run to completion and doesn't
> > > > change global state, right?
> 
> Yes, or even a simple WRITE_ONCE() to replace it, as long as we make
> sure to use a READ_ONCE() to load the pointer.
> 
> As long as we're sure to limit this poking to a single user its fairly
> simple to get right. The moment there can be concurrency a lot of fail
> can happen.

agreed.

> > So instead of normal __perf_event_output() writing into ringbuffer,
> > a bpf prog will be called that can optionally write into different
> > rb via bpf_perf_event_output. 
> 
> It could even chain and call into the original function once its done
> and have both outputs.

interesting idea. makes sense.
Also thinking about concurrency and the need to remember the original
handler somewhere, would it be cleaner api to add a bit to perf_event_attr
and use attr.config1 as bpf_fd ?
Then perf_event_open at event creation time will use bpf prog as
overflow_handler. That solves concurrency concerns and potential semantical
issues if we go with ioctl() approach.
Like if we perf_event_open() an event for a task, then bpf attach to it,
what children task and corresponding inherited events suppose to do?
Inherit overflow_handler, right? but then deatch of bpf in the parent
suppose to clear it in inherited events as well. A bit complicated.
I guess we can define it that way.
Just seems easier to do bpf attach at perf_event_open time only.

> > The question is what to pass into the
> > program to make the most use out of it. 'struct pt_regs' is done deal.
> > but perf_sample_data we cannot pass as-is, since it's kernel internal.
> 
> Urgh, does it have to be stable API? Can't we simply rely on the kernel
> headers to provide the right structure definition?

yes we can. The concern is about assumptions people will make about
perf_sample_data and the speed of access to it. From bpf program point
of view the pointer to perf_sample_data will be opaque unsafe pointer,
so any access to fields would have to be done via bpf_probe_read which
has non-trivial overhead.
If we go with the uapi mirror of perf_sample_data approach, it will be
fast, since mirror is not an actual struct. Like the 'struct __sk_buff' we
have in uapi/linux/bpf.h is a meta structure. It's not allocated anywhere
and no fields are copied. When bpf program does 'skb->vlan_present'
the verifier rewrites it at load time into corresponding access to
kernel internal 'struct sk_buff' fields with bitmask, shifts and such.
For this case we can define something like
struct bpf_perf_sample_data {
  __u64 period;
};
then bpf prog will only be able to access that signle field which verifier
will translate into 'data->period' where data is 'struct perf_sample_data *'
Later we can add other fields if necessary. The kernel is free to mess
around with perf_sample_data whichever way without impacting bpf progs.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling
  2016-08-05 17:22                 ` Brendan Gregg
@ 2016-08-08  9:57                   ` Peter Zijlstra
  0 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2016-08-08  9:57 UTC (permalink / raw)
  To: Brendan Gregg
  Cc: Alexei Starovoitov, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, linux-kernel, Alexei Starovoitov, Wang Nan

On Fri, Aug 05, 2016 at 10:22:08AM -0700, Brendan Gregg wrote:

> (Normally I'd use I$ miss overflow, but none of our Linux systems have
> PMCs: cloud.)

I think I had better not comment on that ;-)

> >> > The perf:perf_hrtimer probe point is also reading state mid-way
> >> > through a function, so it's not quite as simple as wrapping the
> >> > function pointer. I do like that idea, though, but for things like
> >> > struct file_operations.
> >
> > So what additional state to you need?
> 
> I was pulling in regs after get_irq_regs(), struct perf_event *event
> after it's populated. Not that hard to duplicate. Just noting it
> didn't map directly to the function entry.

Right, both of which are available to the overflow handler.

> I wanted perf_event just for event->ctx->task->pid, so that a BPF
> program can differentiate between it's samples and other concurrent
> sessions.
> 
> (I  was thinking of changing my patch to expose pid_t instead of
> perf_event, since I was noticing it didn't add many instructions.)

Slightly confused, event->ctx->task == current, no? We flip that pointer
when we flip the contexts.

At which point, it should be the same as SAMPLE_TID.

!?

> [...]
> >> instead of adding a tracepoint to perf_swevent_hrtimer we can replace
> >> overflow_handler for that particular event with some form of bpf wrapper.
> >> (probably new bpf program type). Then not only periodic events
> >> will be triggering bpf prog, but pmu events as well.
> >
> > Exactly.
> 
> Although the timer use case is a bit different, and is via
> hwc->hrtimer.function = perf_swevent_hrtimer.

Still not entirely sure why you could not hook into
event->overflow_handler.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling
  2016-08-05 19:45                 ` Alexei Starovoitov
@ 2016-08-08 10:03                   ` Peter Zijlstra
  0 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2016-08-08 10:03 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Brendan Gregg, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, linux-kernel, Alexei Starovoitov, Wang Nan

On Fri, Aug 05, 2016 at 12:45:07PM -0700, Alexei Starovoitov wrote:

> Also thinking about concurrency and the need to remember the original
> handler somewhere, would it be cleaner api to add a bit to perf_event_attr
> and use attr.config1 as bpf_fd ?

attr.config[12] are in use already, typically uncore events need them.
We cannot rely on those bits being unused.

> Then perf_event_open at event creation time will use bpf prog as
> overflow_handler. That solves concurrency concerns and potential semantical
> issues if we go with ioctl() approach.

> Like if we perf_event_open() an event for a task, then bpf attach to it,
> what children task and corresponding inherited events suppose to do?
> Inherit overflow_handler, right? but then deatch of bpf in the parent
> suppose to clear it in inherited events as well. A bit complicated.
> I guess we can define it that way.
> Just seems easier to do bpf attach at perf_event_open time only.

Which is why I would've liked BPF to create its own events, instead of
userspace passing them in through this array.

Because now you have a chicken'n'egg issue with that you want BPF to use
the overflow handler but need BPF running before you have an actual
handler to link to.

> > Urgh, does it have to be stable API? Can't we simply rely on the kernel
> > headers to provide the right structure definition?
> 
> yes we can. The concern is about assumptions people will make about
> perf_sample_data and the speed of access to it. From bpf program point
> of view the pointer to perf_sample_data will be opaque unsafe pointer,
> so any access to fields would have to be done via bpf_probe_read which
> has non-trivial overhead.
> If we go with the uapi mirror of perf_sample_data approach, it will be
> fast, since mirror is not an actual struct. Like the 'struct __sk_buff' we
> have in uapi/linux/bpf.h is a meta structure. It's not allocated anywhere
> and no fields are copied. When bpf program does 'skb->vlan_present'
> the verifier rewrites it at load time into corresponding access to
> kernel internal 'struct sk_buff' fields with bitmask, shifts and such.
> For this case we can define something like
> struct bpf_perf_sample_data {
>   __u64 period;
> };
> then bpf prog will only be able to access that signle field which verifier
> will translate into 'data->period' where data is 'struct perf_sample_data *'
> Later we can add other fields if necessary. The kernel is free to mess
> around with perf_sample_data whichever way without impacting bpf progs.

Hmm, I was not aware of that. Should be doable indeed.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [tip:perf/urgent] perf script: Add 'bpf-output' field to usage message
  2016-08-03  2:47 ` [PATCH v2 3/3] perf script: add bpf-output field to usage message Brendan Gregg
@ 2016-08-09 19:16   ` tip-bot for Brendan Gregg
  0 siblings, 0 replies; 16+ messages in thread
From: tip-bot for Brendan Gregg @ 2016-08-09 19:16 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: alexander.shishkin, acme, wangnan0, hpa, mingo, bgregg, ast,
	linux-kernel, tglx, peterz

Commit-ID:  bcdc09af3ef30ef071677544ce23a1c8873a2dda
Gitweb:     http://git.kernel.org/tip/bcdc09af3ef30ef071677544ce23a1c8873a2dda
Author:     Brendan Gregg <bgregg@netflix.com>
AuthorDate: Wed, 3 Aug 2016 02:47:49 +0000
Committer:  Arnaldo Carvalho de Melo <acme@redhat.com>
CommitDate: Tue, 9 Aug 2016 10:46:43 -0300

perf script: Add 'bpf-output' field to usage message

This adds the 'bpf-output' field to the perf script usage message, and docs.

Signed-off-by: Brendan Gregg <bgregg@netflix.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: http://lkml.kernel.org/r/1470192469-11910-4-git-send-email-bgregg@netflix.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/Documentation/perf-script.txt | 4 ++--
 tools/perf/builtin-script.c              | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/tools/perf/Documentation/perf-script.txt b/tools/perf/Documentation/perf-script.txt
index 1f6c705..053bbbd 100644
--- a/tools/perf/Documentation/perf-script.txt
+++ b/tools/perf/Documentation/perf-script.txt
@@ -116,8 +116,8 @@ OPTIONS
 --fields::
         Comma separated list of fields to print. Options are:
         comm, tid, pid, time, cpu, event, trace, ip, sym, dso, addr, symoff,
-	srcline, period, iregs, brstack, brstacksym, flags.
-        Field list can be prepended with the type, trace, sw or hw,
+        srcline, period, iregs, brstack, brstacksym, flags, bpf-output,
+        callindent. Field list can be prepended with the type, trace, sw or hw,
         to indicate to which event type the field list applies.
         e.g., -F sw:comm,tid,time,ip,sym  and -F trace:time,cpu,trace
 
diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c
index 971ff91..9c640a8 100644
--- a/tools/perf/builtin-script.c
+++ b/tools/perf/builtin-script.c
@@ -2116,7 +2116,7 @@ int cmd_script(int argc, const char **argv, const char *prefix __maybe_unused)
 		     "Valid types: hw,sw,trace,raw. "
 		     "Fields: comm,tid,pid,time,cpu,event,trace,ip,sym,dso,"
 		     "addr,symoff,period,iregs,brstack,brstacksym,flags,"
-		     "callindent", parse_output_fields),
+		     "bpf-output,callindent", parse_output_fields),
 	OPT_BOOLEAN('a', "all-cpus", &system_wide,
 		    "system-wide collection from all CPUs"),
 	OPT_STRING('S', "symbols", &symbol_conf.sym_list_str, "symbol[,symbol...]",

^ permalink raw reply related	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2016-08-09 19:19 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-03  2:47 [PATCH v2 0/3] Add a tracepoint for BPF perf sampling Brendan Gregg
2016-08-03  2:47 ` [PATCH v2 1/3] perf/core: Add a tracepoint for " Brendan Gregg
2016-08-03  9:48   ` Peter Zijlstra
2016-08-03 18:57     ` Brendan Gregg
2016-08-04 14:28       ` Peter Zijlstra
2016-08-05  1:43         ` Alexei Starovoitov
2016-08-05  4:13           ` Brendan Gregg
2016-08-05  5:24             ` Alexei Starovoitov
2016-08-05 10:52               ` Peter Zijlstra
2016-08-05 17:22                 ` Brendan Gregg
2016-08-08  9:57                   ` Peter Zijlstra
2016-08-05 19:45                 ` Alexei Starovoitov
2016-08-08 10:03                   ` Peter Zijlstra
2016-08-03  2:47 ` [PATCH v2 2/3] samples/bpf: Add a sampling BPF example Brendan Gregg
2016-08-03  2:47 ` [PATCH v2 3/3] perf script: add bpf-output field to usage message Brendan Gregg
2016-08-09 19:16   ` [tip:perf/urgent] perf script: Add 'bpf-output' " tip-bot for Brendan Gregg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.