[RFC PATCH v3 0/2] Make eBPF programs output data to perf event

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH v3 0/2] Make eBPF programs output data to perf event
@ 2015-07-07 11:43 He Kuang
  2015-07-07 11:43 ` [RFC PATCH v3 1/2] tracing: Add new trace type for bpf data output He Kuang
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: He Kuang @ 2015-07-07 11:43 UTC (permalink / raw)
  To: rostedt, ast, masami.hiramatsu.pt, acme, a.p.zijlstra, mingo,
	namhyung, jolsa
  Cc: wangnan0, linux-kernel, hekuang

Hi,

The two previous versions tried to combine bpf output data with the
sample event of the attached kprobe point, which leads to problems
about perf_trace_buf.

After discussion we found it's not necessary to combine those two
parts of information, even we do not need the orignial kprobe output
event at all. Based on this idea, the implementation becomes simple,
just like what perf do with ftrace:functions, we set up a bpf ftrace
entry for perf tools to poll and collect data on it, eBpf program use
a helper function to submit data to ring-buffer, that's all. This
implementation also leaves all issues such as sample-types to perf
commandline.

Currently, we just use raw data in the format fields to not interfere
perf sample parser, because the raw-data can be parsed by perf script
plugin easily.

Modify the sample in patch v1 slightly:

  SEC("generic_perform_write=generic_perform_write")
  int NODE_generic_perform_write(struct pt_regs *ctx)
  {
          char fmt[] = "generic_perform_write, cur=0x%llx, del=0x%llx\n";
          u64 cur_time, del_time;
          int ind =0;
          struct time_table output, *last = bpf_map_lookup_elem(&global_time_table, &ind);
          if (!last)
                  return 0;

          cur_time = bpf_ktime_get_ns();
          if (!last->last_time)
                  del_time = 0;
          else
                  del_time = cur_time - last->last_time;

          /* For debug */
          bpf_trace_printk(fmt, sizeof(fmt), cur_time, del_time);

          /* Update time table */
          output.last_time = cur_time;
          bpf_map_update_elem(&global_time_table, &ind, &output, BPF_ANY);

          /* This is a casual condition to show the funciton */
          if (del_time < 1000)
                  return 0;

          bpf_output_sample(&del_time, sizeof(del_time));

          return 0;
  }

Record bpf events:

  $ perf record -e ftrace:bpf -e sample.o -- dd if=/dev/zero of=test bs=4k count=3

The results showed in perf-script:

  $ perf script
  dd   994 [000]   166.686779: ftrace:bpf: 8: (000000000542b426, ...)
  dd   994 [000]   166.686779: ftrace:bpf: 8: (00000000001011ef, ...)
  dd   994 [000]   166.686779: ftrace:bpf: 8: (000000000007a2b6, ...)

Thank you.

He Kuang (2):
  tracing: Add new trace type for bpf data output
  bpf: Introduce function for outputing data to perf event

 include/uapi/linux/bpf.h     |  3 +++
 kernel/trace/bpf_trace.c     | 43 +++++++++++++++++++++++++++++++++++++++++++
 kernel/trace/trace.h         |  6 ++++++
 kernel/trace/trace_entries.h | 18 ++++++++++++++++++
 samples/bpf/bpf_helpers.h    |  2 ++
 5 files changed, 72 insertions(+)

-- 
1.8.5.2


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [RFC PATCH v3 1/2] tracing: Add new trace type for bpf data output
  2015-07-07 11:43 [RFC PATCH v3 0/2] Make eBPF programs output data to perf event He Kuang
@ 2015-07-07 11:43 ` He Kuang
  2015-07-07 22:47   ` Peter Zijlstra
  2015-07-08  1:35   ` Alexei Starovoitov
  2015-07-07 11:43 ` [RFC PATCH v3 2/2] bpf: Introduce function for outputing data to perf event He Kuang
  2015-07-08  1:31 ` [RFC PATCH v3 0/2] Make eBPF programs output " Alexei Starovoitov
  2 siblings, 2 replies; 7+ messages in thread
From: He Kuang @ 2015-07-07 11:43 UTC (permalink / raw)
  To: rostedt, ast, masami.hiramatsu.pt, acme, a.p.zijlstra, mingo,
	namhyung, jolsa
  Cc: wangnan0, linux-kernel, hekuang

Add TRACE_BPF as a new trace type to establish infrastruction for bpf
output data to perf. This new trace type creates a static singleton
ftrace entry in kernel, userspace perf tools can detect and use this
new ftrace entry just as using the existing tracepoint events.

This added a new bpf ftrace entry in debugfs:

     /sys/kernel/debug/tracing/events/ftrace/bpf

Userspace perf tools detect the new tracepoint event as:

     ftrace:bpf                          [Tracepoint event]

Data in ring-buffer of perf events added to this ftrace:bpf event can
be polled out, sample types and other attributes can be adjusted to
those events directly without touching the original kprobe events.

Signed-off-by: He Kuang <hekuang@huawei.com>
---
 kernel/trace/trace.h         |  1 +
 kernel/trace/trace_entries.h | 18 ++++++++++++++++++
 2 files changed, 19 insertions(+)

diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index d261201..d135f55 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -38,6 +38,7 @@ enum trace_type {
 	TRACE_USER_STACK,
 	TRACE_BLK,
 	TRACE_BPUTS,
+	TRACE_BPF,
 
 	__TRACE_LAST_TYPE,
 };
diff --git a/kernel/trace/trace_entries.h b/kernel/trace/trace_entries.h
index ee7b94a..c237212 100644
--- a/kernel/trace/trace_entries.h
+++ b/kernel/trace/trace_entries.h
@@ -322,3 +322,21 @@ FTRACE_ENTRY(branch, trace_branch,
 	FILTER_OTHER
 );
 
+#define TRACE_BPF_MAX_ENTRY	64
+#define TRACE_BPF_MAX_SIZE	(TRACE_BPF_MAX_ENTRY * sizeof(u64))
+
+FTRACE_ENTRY_REG(bpf, trace_bpf,
+
+	TRACE_BPF,
+
+	F_STRUCT(
+		__field(long,	size)
+		__array(u64,	raw_data,	TRACE_BPF_MAX_ENTRY)
+	),
+
+	F_printk("%ld: (%016llx, ...)", __entry->size, __entry->raw_data[0]),
+
+	FILTER_OTHER,
+
+	perf_ftrace_event_register
+);
-- 
1.8.5.2


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [RFC PATCH v3 2/2] bpf: Introduce function for outputing data to perf event
  2015-07-07 11:43 [RFC PATCH v3 0/2] Make eBPF programs output data to perf event He Kuang
  2015-07-07 11:43 ` [RFC PATCH v3 1/2] tracing: Add new trace type for bpf data output He Kuang
@ 2015-07-07 11:43 ` He Kuang
  2015-07-08  1:45   ` Alexei Starovoitov
  2015-07-08  1:31 ` [RFC PATCH v3 0/2] Make eBPF programs output " Alexei Starovoitov
  2 siblings, 1 reply; 7+ messages in thread
From: He Kuang @ 2015-07-07 11:43 UTC (permalink / raw)
  To: rostedt, ast, masami.hiramatsu.pt, acme, a.p.zijlstra, mingo,
	namhyung, jolsa
  Cc: wangnan0, linux-kernel, hekuang

There're scenarios that we need an eBPF program to record not only
kprobe point args, but also the PMU counters, time latencies or cache
miss numbers between two probe points and other information we can
get when the probe point is entered.

This helper function gives eBPF program ability to output data as perf
sample event. The function works as kprobe_perf_func(), it packets the
data from bpf stack space into a sample record and submits it to the
ring-buffer of perf_events which are binded to BPF ftrace
entry. Userspace perf tools can record BPF ftrace event to collect
those records.

Signed-off-by: He Kuang <hekuang@huawei.com>
---
 include/uapi/linux/bpf.h  |  3 +++
 kernel/trace/bpf_trace.c  | 43 +++++++++++++++++++++++++++++++++++++++++++
 kernel/trace/trace.h      |  5 +++++
 samples/bpf/bpf_helpers.h |  2 ++
 4 files changed, 53 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index a9ebdf5..f44b0aa 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -210,6 +210,9 @@ enum bpf_func_id {
 	 * Return: 0 on success
 	 */
 	BPF_FUNC_l4_csum_replace,
+
+	/* int bpf_output_data(void *src, int size, void *regs) */
+	BPF_FUNC_output_data,
 	__BPF_FUNC_MAX_ID,
 };
 
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 2d56ce5..45dbeab 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -79,6 +79,47 @@ static const struct bpf_func_proto bpf_probe_read_proto = {
 	.arg3_type	= ARG_ANYTHING,
 };
 
+static u64 bpf_output_data(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+	void *src = (void *) (long) r1;
+	int dsize = (int) r2, __size, size;
+	void *regs = (void *) (long) r3;
+	struct bpf_trace_entry_head *entry;
+	struct hlist_head *head;
+	int rctx;
+
+	if (dsize > TRACE_BPF_MAX_SIZE)
+		return -ENOMEM;
+
+	head = this_cpu_ptr(event_bpf.perf_events);
+	if (hlist_empty(head))
+		return -ENOENT;
+
+	__size = sizeof(*entry) + dsize;
+	size = ALIGN(__size + sizeof(u32), sizeof(u64));
+	size -= sizeof(u32);
+
+	entry = perf_trace_buf_prepare(size, TRACE_BPF, NULL, &rctx);
+	if (!entry)
+		return -ENOMEM;
+
+	entry->size = dsize;
+	memcpy(&entry[1], src, dsize);
+
+	perf_tp_event(0, 1, entry, size, regs, head, rctx, NULL);
+
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_output_data_proto = {
+	.func		= bpf_output_data,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_STACK,
+	.arg2_type	= ARG_CONST_STACK_SIZE,
+	.arg3_type	= ARG_PTR_TO_CTX,
+};
+
 static u64 bpf_ktime_get_ns(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
 {
 	/* NMI safe access to clock monotonic */
@@ -170,6 +211,8 @@ static const struct bpf_func_proto *kprobe_prog_func_proto(enum bpf_func_id func
 		return &bpf_map_delete_elem_proto;
 	case BPF_FUNC_probe_read:
 		return &bpf_probe_read_proto;
+	case BPF_FUNC_output_data:
+		return &bpf_output_data_proto;
 	case BPF_FUNC_ktime_get_ns:
 		return &bpf_ktime_get_ns_proto;
 
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index d135f55..8d9100d 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -113,6 +113,11 @@ struct kretprobe_trace_entry_head {
 	unsigned long		ret_ip;
 };
 
+struct bpf_trace_entry_head {
+	struct trace_entry	ent;
+	unsigned long		size;
+};
+
 /*
  * trace_flag_type is an enumeration that holds different
  * states when a trace occurs. These are:
diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h
index f960b5f..bc7f13c 100644
--- a/samples/bpf/bpf_helpers.h
+++ b/samples/bpf/bpf_helpers.h
@@ -49,5 +49,7 @@ static int (*bpf_l3_csum_replace)(void *ctx, int off, int from, int to, int flag
 	(void *) BPF_FUNC_l3_csum_replace;
 static int (*bpf_l4_csum_replace)(void *ctx, int off, int from, int to, int flags) =
 	(void *) BPF_FUNC_l4_csum_replace;
+static int (*bpf_output_data)(void *src, int size, void *regs) =
+	(void *) BPF_FUNC_output_data;
 
 #endif
-- 
1.8.5.2


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH v3 1/2] tracing: Add new trace type for bpf data output
  2015-07-07 11:43 ` [RFC PATCH v3 1/2] tracing: Add new trace type for bpf data output He Kuang
@ 2015-07-07 22:47   ` Peter Zijlstra
  2015-07-08  1:35   ` Alexei Starovoitov
  1 sibling, 0 replies; 7+ messages in thread
From: Peter Zijlstra @ 2015-07-07 22:47 UTC (permalink / raw)
  To: He Kuang
  Cc: rostedt, ast, masami.hiramatsu.pt, acme, mingo, namhyung, jolsa,
	wangnan0, linux-kernel

On Tue, Jul 07, 2015 at 11:43:05AM +0000, He Kuang wrote:
> +FTRACE_ENTRY_REG(bpf, trace_bpf,
> +
> +	TRACE_BPF,
> +
> +	F_STRUCT(
> +		__field(long,	size)

I would suggest using either u32 or u64, not a variable size type like
long.

> +		__array(u64,	raw_data,	TRACE_BPF_MAX_ENTRY)
> +	),
> +
> +	F_printk("%ld: (%016llx, ...)", __entry->size, __entry->raw_data[0]),
> +
> +	FILTER_OTHER,
> +
> +	perf_ftrace_event_register
> +);
> -- 
> 1.8.5.2
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH v3 0/2] Make eBPF programs output data to perf event
  2015-07-07 11:43 [RFC PATCH v3 0/2] Make eBPF programs output data to perf event He Kuang
  2015-07-07 11:43 ` [RFC PATCH v3 1/2] tracing: Add new trace type for bpf data output He Kuang
  2015-07-07 11:43 ` [RFC PATCH v3 2/2] bpf: Introduce function for outputing data to perf event He Kuang
@ 2015-07-08  1:31 ` Alexei Starovoitov
  2 siblings, 0 replies; 7+ messages in thread
From: Alexei Starovoitov @ 2015-07-08  1:31 UTC (permalink / raw)
  To: He Kuang, rostedt, masami.hiramatsu.pt, acme, a.p.zijlstra,
	mingo, namhyung, jolsa
  Cc: wangnan0, linux-kernel

On 7/7/15 4:43 AM, He Kuang wrote:
> Hi,
>
> The two previous versions tried to combine bpf output data with the
> sample event of the attached kprobe point, which leads to problems
> about perf_trace_buf.
>
> After discussion we found it's not necessary to combine those two
> parts of information, even we do not need the orignial kprobe output
> event at all. Based on this idea, the implementation becomes simple,
> just like what perf do with ftrace:functions, we set up a bpf ftrace
> entry for perf tools to poll and collect data on it, eBpf program use
> a helper function to submit data to ring-buffer, that's all. This
> implementation also leaves all issues such as sample-types to perf
> commandline.
>
> Currently, we just use raw data in the format fields to not interfere
> perf sample parser, because the raw-data can be parsed by perf script
> plugin easily.

Looks much better!
In general I think splitting it into two patches is confusing,
since 1st patch is meaningless without 2nd. I would squash it.
Other comments inline.

>            bpf_output_sample(&del_time, sizeof(del_time));

typo?
You meant bpf_output_data(&del_time, sizeof(del_time), ctx) ?

To match the rest of helpers, please make ctx to be first argument.
Also I think bpf_output_trace_data() name is better.
bpf_output_data name doesn't indicate that it's tracing only helper
and might be confusing with networking helpers.

> Record bpf events:
>
>    $ perf record -e ftrace:bpf -e sample.o -- dd if=/dev/zero of=test bs=4k count=3
>
> The results showed in perf-script:
>
>    $ perf script
>    dd   994 [000]   166.686779: ftrace:bpf: 8: (000000000542b426, ...)
>    dd   994 [000]   166.686779: ftrace:bpf: 8: (00000000001011ef, ...)
>    dd   994 [000]   166.686779: ftrace:bpf: 8: (000000000007a2b6, ...)

nice!


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH v3 1/2] tracing: Add new trace type for bpf data output
  2015-07-07 11:43 ` [RFC PATCH v3 1/2] tracing: Add new trace type for bpf data output He Kuang
  2015-07-07 22:47   ` Peter Zijlstra
@ 2015-07-08  1:35   ` Alexei Starovoitov
  1 sibling, 0 replies; 7+ messages in thread
From: Alexei Starovoitov @ 2015-07-08  1:35 UTC (permalink / raw)
  To: He Kuang, rostedt, masami.hiramatsu.pt, acme, a.p.zijlstra,
	mingo, namhyung, jolsa
  Cc: wangnan0, linux-kernel

On 7/7/15 4:43 AM, He Kuang wrote:
> +	F_STRUCT(
> +		__field(long,	size)

as Peter said please use u32 to avoid 32 vs 64-bit issues.

> +		__array(u64,	raw_data,	TRACE_BPF_MAX_ENTRY)
> +	),
> +
> +	F_printk("%ld: (%016llx, ...)", __entry->size, __entry->raw_data[0]),

can we conditionally print raw_data[0], raw_data[1], ..[3] when
values are non-zero? '...' as part of print is kinda useless.
Also I would drop '()' and ',' they don't add much value.
Remember we won't be able to change this format once it goes in.



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH v3 2/2] bpf: Introduce function for outputing data to perf event
  2015-07-07 11:43 ` [RFC PATCH v3 2/2] bpf: Introduce function for outputing data to perf event He Kuang
@ 2015-07-08  1:45   ` Alexei Starovoitov
  0 siblings, 0 replies; 7+ messages in thread
From: Alexei Starovoitov @ 2015-07-08  1:45 UTC (permalink / raw)
  To: He Kuang, rostedt, masami.hiramatsu.pt, acme, a.p.zijlstra,
	mingo, namhyung, jolsa
  Cc: wangnan0, linux-kernel

On 7/7/15 4:43 AM, He Kuang wrote:

> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -210,6 +210,9 @@ enum bpf_func_id {
>   	 * Return: 0 on success
>   	 */
>   	BPF_FUNC_l4_csum_replace,
> +
> +	/* int bpf_output_data(void *src, int size, void *regs) */

bpf_output_trace_data(struct pt_regs *ctx, void *data, int data_size)

> +	BPF_FUNC_output_data,
>   	__BPF_FUNC_MAX_ID,
>   };
>
> +static u64 bpf_output_data(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
> +{
> +	void *src = (void *) (long) r1;
> +	int dsize = (int) r2, __size, size;
> +	void *regs = (void *) (long) r3;

please cast to 'struct pt_regs *', since that's what it is.

> +	__size = sizeof(*entry) + dsize;
> +	size = ALIGN(__size + sizeof(u32), sizeof(u64));
> +	size -= sizeof(u32);
> +
> +	entry = perf_trace_buf_prepare(size, TRACE_BPF, NULL, &rctx);
> +	if (!entry)
> +		return -ENOMEM;
> +
> +	entry->size = dsize;

something wrong here. Either 'size' from bpf_trace_entry_head
should be used or from trace_bpf.

>   	(void *) BPF_FUNC_l4_csum_replace;
> +static int (*bpf_output_data)(void *src, int size, void *regs) =
> +	(void *) BPF_FUNC_output_data;

'struct pt_regs *ctx' here as well.


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-07-08  1:45 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-07-07 11:43 [RFC PATCH v3 0/2] Make eBPF programs output data to perf event He Kuang
2015-07-07 11:43 ` [RFC PATCH v3 1/2] tracing: Add new trace type for bpf data output He Kuang
2015-07-07 22:47   ` Peter Zijlstra
2015-07-08  1:35   ` Alexei Starovoitov
2015-07-07 11:43 ` [RFC PATCH v3 2/2] bpf: Introduce function for outputing data to perf event He Kuang
2015-07-08  1:45   ` Alexei Starovoitov
2015-07-08  1:31 ` [RFC PATCH v3 0/2] Make eBPF programs output " Alexei Starovoitov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.