All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH 00/10] Add trace event support to eBPF
@ 2016-02-12 16:11 Tom Zanussi
  2016-02-12 16:11 ` [RFC][PATCH 01/10] tracing: Move some constants to ring_buffer.h Tom Zanussi
                   ` (10 more replies)
  0 siblings, 11 replies; 17+ messages in thread
From: Tom Zanussi @ 2016-02-12 16:11 UTC (permalink / raw)
  To: ast, rostedt
  Cc: masami.hiramatsu.pt, namhyung, peterz, linux-kernel, Tom Zanussi

Hi,

As promised in previous threads, this patchset shares some common
functionality with the hist triggers code and enables trace events to
be accessed from eBPF programs.

It needs to be applied on top of current tracing/for-next with the
hist triggers v13 patches applied:

  https://lkml.org/lkml/2015/12/10/640

Any input welcome, especially things that could be done better wrt
eBPF, since I'm not as familiar with that code...

Thanks,

Tom

The following changes since commit def7919edab980525728a6ee9728c23ececdaf52:

  tracing: Add hist trigger 'log2' modifier (2016-02-10 09:51:54 -0600)

are available in the git repository at:

  git://git.yoctoproject.org/linux-yocto-contrib.git tzanussi/ebpf-trace-events-v0
  http://git.yoctoproject.org/cgit/cgit.cgi/linux-yocto-contrib/log/?h=tzanussi/ebpf-trace-events-v0

Tom Zanussi (10):
  tracing: Move some constants to ring_buffer.h
  eBPF: Add BPF_PROG_TYPE_TRACE_EVENT prog type
  tracing: Add an 'accessor' function to ftrace_event_field
  tracing: Add trace event accessor functions
  eBPF/tracing: Add eBPF trace event field access helpers
  tracing: Add kprobe/uprobe support for TRACE_EVENT eBPF progs
  tracing: Add eBPF program support to static trace events
  samples/bpf: Add support for trace events to the eBPF loading code
  samples/bpf: Add readcounts-by-pid example
  samples/bpf: Add kprobe-event-fields example

 include/linux/ring_buffer.h            |  35 ++++++++
 include/linux/trace_events.h           |   7 ++
 include/trace/perf.h                   |  10 +++
 include/uapi/linux/bpf.h               |  19 +++++
 kernel/bpf/verifier.c                  |   2 +-
 kernel/events/core.c                   |  12 +--
 kernel/trace/bpf_trace.c               | 149 +++++++++++++++++++++++++++++++++
 kernel/trace/ring_buffer.c             |  31 -------
 kernel/trace/trace.h                   |   5 ++
 kernel/trace/trace_events.c            | 111 ++++++++++++++++++++++++
 kernel/trace/trace_events_hist.c       | 103 +++--------------------
 kernel/trace/trace_kprobe.c            |  20 ++++-
 kernel/trace/trace_syscalls.c          |  18 ++++
 kernel/trace/trace_uprobe.c            |  11 ++-
 samples/bpf/Makefile                   |   8 ++
 samples/bpf/bpf_helpers.h              |   4 +
 samples/bpf/bpf_load.c                 |  46 ++++++++--
 samples/bpf/kprobe-event-fields_kern.c |  56 +++++++++++++
 samples/bpf/kprobe-event-fields_user.c |  25 ++++++
 samples/bpf/readcounts-by-pid_kern.c   |  57 +++++++++++++
 samples/bpf/readcounts-by-pid_user.c   |  66 +++++++++++++++
 21 files changed, 657 insertions(+), 138 deletions(-)
 create mode 100644 samples/bpf/kprobe-event-fields_kern.c
 create mode 100644 samples/bpf/kprobe-event-fields_user.c
 create mode 100644 samples/bpf/readcounts-by-pid_kern.c
 create mode 100644 samples/bpf/readcounts-by-pid_user.c

-- 
1.9.3

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC][PATCH 01/10] tracing: Move some constants to ring_buffer.h
  2016-02-12 16:11 [RFC][PATCH 00/10] Add trace event support to eBPF Tom Zanussi
@ 2016-02-12 16:11 ` Tom Zanussi
  2016-02-12 16:11 ` [RFC][PATCH 02/10] eBPF: Add BPF_PROG_TYPE_TRACE_EVENT prog type Tom Zanussi
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: Tom Zanussi @ 2016-02-12 16:11 UTC (permalink / raw)
  To: ast, rostedt
  Cc: masami.hiramatsu.pt, namhyung, peterz, linux-kernel, Tom Zanussi

For eBPF trace event support, we need to be able to inform the eBPF
verifier about legal accesses to the event buffer, which means we need
some of the ring_buffer constants.  This moves what we need along with
related constants to the ring_buffer header file.

Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
---
 include/linux/ring_buffer.h | 35 +++++++++++++++++++++++++++++++++++
 kernel/trace/ring_buffer.c  | 31 -------------------------------
 2 files changed, 35 insertions(+), 31 deletions(-)

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index 4acc552..6da902f 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -6,6 +6,8 @@
 #include <linux/seq_file.h>
 #include <linux/poll.h>
 
+#include <asm/local.h>
+
 struct ring_buffer;
 struct ring_buffer_iter;
 
@@ -198,4 +200,37 @@ enum ring_buffer_flags {
 	RB_FL_OVERWRITE		= 1 << 0,
 };
 
+/* Used for individual buffers (after the counter) */
+#define RB_BUFFER_OFF		(1 << 20)
+
+#define BUF_PAGE_HDR_SIZE offsetof(struct buffer_data_page, data)
+
+#define RB_EVNT_HDR_SIZE (offsetof(struct ring_buffer_event, array))
+#define RB_ALIGNMENT		4U
+#define RB_MAX_SMALL_DATA	(RB_ALIGNMENT * RINGBUF_TYPE_DATA_TYPE_LEN_MAX)
+#define RB_EVNT_MIN_SIZE	8U	/* two 32bit words */
+
+#ifndef CONFIG_HAVE_64BIT_ALIGNED_ACCESS
+# define RB_FORCE_8BYTE_ALIGNMENT	0
+# define RB_ARCH_ALIGNMENT		RB_ALIGNMENT
+#else
+# define RB_FORCE_8BYTE_ALIGNMENT	1
+# define RB_ARCH_ALIGNMENT		8U
+#endif
+
+#define RB_ALIGN_DATA		__aligned(RB_ARCH_ALIGNMENT)
+
+struct buffer_data_page {
+	u64		 time_stamp;	/* page time stamp */
+	local_t		 commit;	/* write committed index */
+	unsigned char	 data[] RB_ALIGN_DATA;	/* data of buffer page */
+};
+
+#define BUF_PAGE_HDR_SIZE offsetof(struct buffer_data_page, data)
+
+#define BUF_PAGE_SIZE (PAGE_SIZE - BUF_PAGE_HDR_SIZE)
+
+/* Max payload is BUF_PAGE_SIZE - header (8bytes) */
+#define BUF_MAX_DATA_SIZE (BUF_PAGE_SIZE - (sizeof(u32) * 2))
+
 #endif /* _LINUX_RING_BUFFER_H */
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 9c6045a..1af61d6 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -115,26 +115,6 @@ int ring_buffer_print_entry_header(struct trace_seq *s)
  *
  */
 
-/* Used for individual buffers (after the counter) */
-#define RB_BUFFER_OFF		(1 << 20)
-
-#define BUF_PAGE_HDR_SIZE offsetof(struct buffer_data_page, data)
-
-#define RB_EVNT_HDR_SIZE (offsetof(struct ring_buffer_event, array))
-#define RB_ALIGNMENT		4U
-#define RB_MAX_SMALL_DATA	(RB_ALIGNMENT * RINGBUF_TYPE_DATA_TYPE_LEN_MAX)
-#define RB_EVNT_MIN_SIZE	8U	/* two 32bit words */
-
-#ifndef CONFIG_HAVE_64BIT_ALIGNED_ACCESS
-# define RB_FORCE_8BYTE_ALIGNMENT	0
-# define RB_ARCH_ALIGNMENT		RB_ALIGNMENT
-#else
-# define RB_FORCE_8BYTE_ALIGNMENT	1
-# define RB_ARCH_ALIGNMENT		8U
-#endif
-
-#define RB_ALIGN_DATA		__aligned(RB_ARCH_ALIGNMENT)
-
 /* define RINGBUF_TYPE_DATA for 'case RINGBUF_TYPE_DATA:' */
 #define RINGBUF_TYPE_DATA 0 ... RINGBUF_TYPE_DATA_TYPE_LEN_MAX
 
@@ -280,12 +260,6 @@ EXPORT_SYMBOL_GPL(ring_buffer_event_data);
 /* Missed count stored at end */
 #define RB_MISSED_STORED	(1 << 30)
 
-struct buffer_data_page {
-	u64		 time_stamp;	/* page time stamp */
-	local_t		 commit;	/* write committed index */
-	unsigned char	 data[] RB_ALIGN_DATA;	/* data of buffer page */
-};
-
 /*
  * Note, the buffer_page list must be first. The buffer pages
  * are allocated in cache lines, which means that each buffer
@@ -355,11 +329,6 @@ static inline int test_time_stamp(u64 delta)
 	return 0;
 }
 
-#define BUF_PAGE_SIZE (PAGE_SIZE - BUF_PAGE_HDR_SIZE)
-
-/* Max payload is BUF_PAGE_SIZE - header (8bytes) */
-#define BUF_MAX_DATA_SIZE (BUF_PAGE_SIZE - (sizeof(u32) * 2))
-
 int ring_buffer_print_page_header(struct trace_seq *s)
 {
 	struct buffer_data_page field;
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC][PATCH 02/10] eBPF: Add BPF_PROG_TYPE_TRACE_EVENT prog type
  2016-02-12 16:11 [RFC][PATCH 00/10] Add trace event support to eBPF Tom Zanussi
  2016-02-12 16:11 ` [RFC][PATCH 01/10] tracing: Move some constants to ring_buffer.h Tom Zanussi
@ 2016-02-12 16:11 ` Tom Zanussi
  2016-02-12 16:11 ` [RFC][PATCH 03/10] tracing: Add an 'accessor' function to ftrace_event_field Tom Zanussi
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: Tom Zanussi @ 2016-02-12 16:11 UTC (permalink / raw)
  To: ast, rostedt
  Cc: masami.hiramatsu.pt, namhyung, peterz, linux-kernel, Tom Zanussi

Add a new BPF_PROG_TYPE_TRACE_EVENT prog type that allows the eBPF
program to access the fields defined for a trace event.  Trace event
fields are defined and available on both static and kprobes-based
trace events, and so TRACE_EVENT eBPF progs can be used on both types.

The reason a new prog type is needed is that BPF_PROG_TYPE_KPROBE
progs expect a pt_regs * ctx, while a TRACE_EVENT prog needs a trace
rec * ctx.  It would have been nice to have a probe that could do
both, but existing KPROBE progs expect pt_regs * ctx.  We can't change
that to some more self-descriptive ctx without breaking existing eBPF
programs.

In any case, mixing the two different types of access in a given
program probably isn't that common a thing to want to do - if you're
grabbing probe params and chasing pointers in your probe, you're
probably not typically interested in accessing event fields too, and
vice versa.

Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
---
 include/linux/trace_events.h |  7 +++++
 include/uapi/linux/bpf.h     |  1 +
 kernel/events/core.c         | 12 +++++----
 kernel/trace/bpf_trace.c     | 62 ++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 77 insertions(+), 5 deletions(-)

diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index 70f8fc4..f7f12f3 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -15,6 +15,13 @@ struct tracer;
 struct dentry;
 struct bpf_prog;
 
+struct trace_event_context {
+	struct trace_event_call *call;
+	void *record;
+};
+
+#define TRACE_EVENT_CTX_HDR_SIZE offsetof(struct trace_event_context, record)
+
 struct trace_print_flags {
 	unsigned long		mask;
 	const char		*name;
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 9ea2d22..df6a7ff 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -89,6 +89,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_KPROBE,
 	BPF_PROG_TYPE_SCHED_CLS,
 	BPF_PROG_TYPE_SCHED_ACT,
+	BPF_PROG_TYPE_TRACE_EVENT,
 };
 
 #define BPF_PSEUDO_MAP_FD	1
diff --git a/kernel/events/core.c b/kernel/events/core.c
index cfc227c..c366e6e 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7081,15 +7081,17 @@ static int perf_event_set_bpf_prog(struct perf_event *event, u32 prog_fd)
 	if (event->tp_event->prog)
 		return -EEXIST;
 
-	if (!(event->tp_event->flags & TRACE_EVENT_FL_UKPROBE))
-		/* bpf programs can only be attached to u/kprobes */
-		return -EINVAL;
-
 	prog = bpf_prog_get(prog_fd);
 	if (IS_ERR(prog))
 		return PTR_ERR(prog);
 
-	if (prog->type != BPF_PROG_TYPE_KPROBE) {
+	if ((prog->type == BPF_PROG_TYPE_KPROBE) &&
+	    !(event->tp_event->flags & TRACE_EVENT_FL_UKPROBE))
+		/* KPROBE bpf programs can only be attached to u/kprobes */
+		return -EINVAL;
+
+	if (prog->type != BPF_PROG_TYPE_KPROBE &&
+	    prog->type != BPF_PROG_TYPE_TRACE_EVENT) {
 		/* valid fd, but invalid bpf program type */
 		bpf_prog_put(prog);
 		return -EINVAL;
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 4228fd3..78dbac0 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -326,9 +326,71 @@ static struct bpf_prog_type_list kprobe_tl = {
 	.type	= BPF_PROG_TYPE_KPROBE,
 };
 
+static const struct bpf_func_proto *
+trace_event_prog_func_proto(enum bpf_func_id func_id)
+{
+	switch (func_id) {
+	case BPF_FUNC_map_lookup_elem:
+		return &bpf_map_lookup_elem_proto;
+	case BPF_FUNC_map_update_elem:
+		return &bpf_map_update_elem_proto;
+	case BPF_FUNC_map_delete_elem:
+		return &bpf_map_delete_elem_proto;
+	case BPF_FUNC_probe_read:
+		return &bpf_probe_read_proto;
+	case BPF_FUNC_ktime_get_ns:
+		return &bpf_ktime_get_ns_proto;
+	case BPF_FUNC_tail_call:
+		return &bpf_tail_call_proto;
+	case BPF_FUNC_get_current_pid_tgid:
+		return &bpf_get_current_pid_tgid_proto;
+	case BPF_FUNC_get_current_uid_gid:
+		return &bpf_get_current_uid_gid_proto;
+	case BPF_FUNC_get_current_comm:
+		return &bpf_get_current_comm_proto;
+	case BPF_FUNC_trace_printk:
+		return bpf_get_trace_printk_proto();
+	case BPF_FUNC_get_smp_processor_id:
+		return &bpf_get_smp_processor_id_proto;
+	case BPF_FUNC_perf_event_read:
+		return &bpf_perf_event_read_proto;
+	case BPF_FUNC_perf_event_output:
+		return &bpf_perf_event_output_proto;
+	default:
+		return NULL;
+	}
+}
+
+/* trace_event programs can access fields of trace event in rec */
+static bool trace_event_prog_is_valid_access(int off, int size,
+					     enum bpf_access_type type)
+{
+	/* check bounds */
+	if (off < 0 || off >= TRACE_EVENT_CTX_HDR_SIZE + BUF_MAX_DATA_SIZE)
+		return false;
+
+	/* only read is allowed */
+	if (type != BPF_READ)
+		return false;
+
+	return true;
+}
+
+static const struct bpf_verifier_ops trace_event_prog_ops = {
+	.get_func_proto  = trace_event_prog_func_proto,
+	.is_valid_access = trace_event_prog_is_valid_access,
+};
+
+static struct bpf_prog_type_list trace_event_tl = {
+	.ops	= &trace_event_prog_ops,
+	.type	= BPF_PROG_TYPE_TRACE_EVENT,
+};
+
 static int __init register_kprobe_prog_ops(void)
 {
 	bpf_register_prog_type(&kprobe_tl);
+	bpf_register_prog_type(&trace_event_tl);
+
 	return 0;
 }
 late_initcall(register_kprobe_prog_ops);
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC][PATCH 03/10] tracing: Add an 'accessor' function to ftrace_event_field
  2016-02-12 16:11 [RFC][PATCH 00/10] Add trace event support to eBPF Tom Zanussi
  2016-02-12 16:11 ` [RFC][PATCH 01/10] tracing: Move some constants to ring_buffer.h Tom Zanussi
  2016-02-12 16:11 ` [RFC][PATCH 02/10] eBPF: Add BPF_PROG_TYPE_TRACE_EVENT prog type Tom Zanussi
@ 2016-02-12 16:11 ` Tom Zanussi
  2016-02-12 16:11 ` [RFC][PATCH 04/10] tracing: Add trace event accessor functions Tom Zanussi
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: Tom Zanussi @ 2016-02-12 16:11 UTC (permalink / raw)
  To: ast, rostedt
  Cc: masami.hiramatsu.pt, namhyung, peterz, linux-kernel, Tom Zanussi

For grabbing the contents of an event field from a trace record, it
would be useful to associate a field-specific accessor function with
an event field.  This defines an accessor prototype and adds an
accessor field to struct ftrace_event_field for that purpose.

Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
---
 kernel/trace/trace.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index fed2ae0..7b48a3c 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -1030,11 +1030,16 @@ static inline void trace_branch_disable(void)
 /* set ring buffers to default size if not already done so */
 int tracing_update_buffers(void);
 
+struct ftrace_event_field;
+
+typedef u64 (*ftrace_event_field_fn_t) (struct ftrace_event_field *field, void *event);
+
 struct ftrace_event_field {
 	struct list_head	link;
 	const char		*name;
 	const char		*type;
 	int			filter_type;
+	ftrace_event_field_fn_t	accessor;
 	int			offset;
 	int			size;
 	int			is_signed;
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC][PATCH 04/10] tracing: Add trace event accessor functions
  2016-02-12 16:11 [RFC][PATCH 00/10] Add trace event support to eBPF Tom Zanussi
                   ` (2 preceding siblings ...)
  2016-02-12 16:11 ` [RFC][PATCH 03/10] tracing: Add an 'accessor' function to ftrace_event_field Tom Zanussi
@ 2016-02-12 16:11 ` Tom Zanussi
  2016-02-12 16:11 ` [RFC][PATCH 05/10] eBPF/tracing: Add eBPF trace event field access helpers Tom Zanussi
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: Tom Zanussi @ 2016-02-12 16:11 UTC (permalink / raw)
  To: ast, rostedt
  Cc: masami.hiramatsu.pt, namhyung, peterz, linux-kernel, Tom Zanussi

This adds a set of accessor functions for integer and string trace
event fields.  When a trace event field is defined, the appropriate
accessor function for that field is chosen and set; clients can access
it via the ftrace_event_field's 'accessor' field.

This code is moved directly from the hist triggers code and the hist
triggers fixed up to make use of it.

Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
---
 kernel/trace/trace_events.c      | 111 +++++++++++++++++++++++++++++++++++++++
 kernel/trace/trace_events_hist.c | 103 ++++--------------------------------
 2 files changed, 121 insertions(+), 93 deletions(-)

diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index 8b54f21..47cb12c 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -70,6 +70,115 @@ static int system_refcount_dec(struct event_subsystem *system)
 #define while_for_each_event_file()		\
 	}
 
+static u64 ftrace_event_field_fn_none(struct ftrace_event_field *field,
+				      void *event)
+{
+	return 0;
+}
+
+static u64 ftrace_event_field_fn_string(struct ftrace_event_field *field,
+					void *event)
+{
+	char *addr = (char *)(event + field->offset);
+
+	return (u64)(unsigned long)addr;
+}
+
+static u64 ftrace_event_field_fn_dynstring(struct ftrace_event_field *field,
+					   void *event)
+{
+	u32 str_item = *(u32 *)(event + field->offset);
+	int str_loc = str_item & 0xffff;
+	char *addr = (char *)(event + str_loc);
+
+	return (u64)(unsigned long)addr;
+}
+
+static u64 ftrace_event_field_fn_pstring(struct ftrace_event_field *field,
+					 void *event)
+{
+	char **addr = (char **)(event + field->offset);
+
+	return (u64)(unsigned long)*addr;
+}
+
+#define DEFINE_FTRACE_EVENT_FIELD_FN(type)			\
+static u64							\
+ftrace_event_field_fn_##type(struct ftrace_event_field *field,	\
+			     void *event)			\
+{								\
+	type *addr = (type *)(event + field->offset);		\
+								\
+	return (u64)*addr;					\
+}
+
+DEFINE_FTRACE_EVENT_FIELD_FN(s64);
+DEFINE_FTRACE_EVENT_FIELD_FN(u64);
+DEFINE_FTRACE_EVENT_FIELD_FN(s32);
+DEFINE_FTRACE_EVENT_FIELD_FN(u32);
+DEFINE_FTRACE_EVENT_FIELD_FN(s16);
+DEFINE_FTRACE_EVENT_FIELD_FN(u16);
+DEFINE_FTRACE_EVENT_FIELD_FN(s8);
+DEFINE_FTRACE_EVENT_FIELD_FN(u8);
+
+static ftrace_event_field_fn_t select_integer_accessor_fn(int field_size,
+							  int field_is_signed)
+{
+	ftrace_event_field_fn_t fn = NULL;
+
+	switch (field_size) {
+	case 8:
+		if (field_is_signed)
+			fn = ftrace_event_field_fn_s64;
+		else
+			fn = ftrace_event_field_fn_u64;
+		break;
+	case 4:
+		if (field_is_signed)
+			fn = ftrace_event_field_fn_s32;
+		else
+			fn = ftrace_event_field_fn_u32;
+		break;
+	case 2:
+		if (field_is_signed)
+			fn = ftrace_event_field_fn_s16;
+		else
+			fn = ftrace_event_field_fn_u16;
+		break;
+	case 1:
+		if (field_is_signed)
+			fn = ftrace_event_field_fn_s8;
+		else
+			fn = ftrace_event_field_fn_u8;
+		break;
+	}
+
+	return fn;
+}
+
+static void set_field_accessor(struct ftrace_event_field *field)
+{
+	field->accessor = ftrace_event_field_fn_none;
+
+	if (field && is_function_field(field))
+		return;
+
+	if (is_string_field(field)) {
+		if (field->filter_type == FILTER_STATIC_STRING)
+			field->accessor = ftrace_event_field_fn_string;
+		else if (field->filter_type == FILTER_DYN_STRING)
+			field->accessor = ftrace_event_field_fn_dynstring;
+		else
+			field->accessor = ftrace_event_field_fn_pstring;
+	} else {
+		ftrace_event_field_fn_t fn;
+
+		fn = select_integer_accessor_fn(field->size, field->is_signed);
+		if (fn)
+			field->accessor = fn;
+	}
+}
+
 static struct list_head *
 trace_get_fields(struct trace_event_call *event_call)
 {
@@ -131,6 +240,8 @@ static int __trace_define_field(struct list_head *head, const char *type,
 	field->size = size;
 	field->is_signed = is_signed;
 
+	set_field_accessor(field);
+
 	list_add(&field->link, head);
 
 	return 0;
diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c
index 6e12388..66ca35e 100644
--- a/kernel/trace/trace_events_hist.c
+++ b/kernel/trace/trace_events_hist.c
@@ -30,7 +30,7 @@ typedef u64 (*hist_field_fn_t) (struct hist_field *field, void *event);
 struct hist_field {
 	struct ftrace_event_field	*field;
 	unsigned long			flags;
-	hist_field_fn_t			fn;
+	hist_field_fn_t                 fn;
 	unsigned int			size;
 	unsigned int			offset;
 };
@@ -45,29 +45,6 @@ static u64 hist_field_counter(struct hist_field *field, void *event)
 	return 1;
 }
 
-static u64 hist_field_string(struct hist_field *hist_field, void *event)
-{
-	char *addr = (char *)(event + hist_field->field->offset);
-
-	return (u64)(unsigned long)addr;
-}
-
-static u64 hist_field_dynstring(struct hist_field *hist_field, void *event)
-{
-	u32 str_item = *(u32 *)(event + hist_field->field->offset);
-	int str_loc = str_item & 0xffff;
-	char *addr = (char *)(event + str_loc);
-
-	return (u64)(unsigned long)addr;
-}
-
-static u64 hist_field_pstring(struct hist_field *hist_field, void *event)
-{
-	char **addr = (char **)(event + hist_field->field->offset);
-
-	return (u64)(unsigned long)*addr;
-}
-
 static u64 hist_field_log2(struct hist_field *hist_field, void *event)
 {
 	u64 val = *(u64 *)(event + hist_field->field->offset);
@@ -75,23 +52,6 @@ static u64 hist_field_log2(struct hist_field *hist_field, void *event)
 	return (u64) ilog2(roundup_pow_of_two(val));
 }
 
-#define DEFINE_HIST_FIELD_FN(type)					\
-static u64 hist_field_##type(struct hist_field *hist_field, void *event)\
-{									\
-	type *addr = (type *)(event + hist_field->field->offset);	\
-									\
-	return (u64)*addr;						\
-}
-
-DEFINE_HIST_FIELD_FN(s64);
-DEFINE_HIST_FIELD_FN(u64);
-DEFINE_HIST_FIELD_FN(s32);
-DEFINE_HIST_FIELD_FN(u32);
-DEFINE_HIST_FIELD_FN(s16);
-DEFINE_HIST_FIELD_FN(u16);
-DEFINE_HIST_FIELD_FN(s8);
-DEFINE_HIST_FIELD_FN(u8);
-
 #define for_each_hist_field(i, hist_data)	\
 	for (i = 0; i < hist_data->n_fields; i++)
 
@@ -146,40 +106,6 @@ struct hist_trigger_data {
 	struct tracing_map		*map;
 };
 
-static hist_field_fn_t select_value_fn(int field_size, int field_is_signed)
-{
-	hist_field_fn_t fn = NULL;
-
-	switch (field_size) {
-	case 8:
-		if (field_is_signed)
-			fn = hist_field_s64;
-		else
-			fn = hist_field_u64;
-		break;
-	case 4:
-		if (field_is_signed)
-			fn = hist_field_s32;
-		else
-			fn = hist_field_u32;
-		break;
-	case 2:
-		if (field_is_signed)
-			fn = hist_field_s16;
-		else
-			fn = hist_field_u16;
-		break;
-	case 1:
-		if (field_is_signed)
-			fn = hist_field_s8;
-		else
-			fn = hist_field_u8;
-		break;
-	}
-
-	return fn;
-}
-
 static int parse_map_size(char *str)
 {
 	unsigned long size, map_bits;
@@ -372,23 +298,8 @@ static struct hist_field *create_hist_field(struct ftrace_event_field *field,
 		goto out;
 	}
 
-	if (is_string_field(field)) {
+	if (is_string_field(field))
 		flags |= HIST_FIELD_STRING;
-
-		if (field->filter_type == FILTER_STATIC_STRING)
-			hist_field->fn = hist_field_string;
-		else if (field->filter_type == FILTER_DYN_STRING)
-			hist_field->fn = hist_field_dynstring;
-		else
-			hist_field->fn = hist_field_pstring;
-	} else {
-		hist_field->fn = select_value_fn(field->size,
-						 field->is_signed);
-		if (!hist_field->fn) {
-			destroy_hist_field(hist_field);
-			return NULL;
-		}
-	}
  out:
 	hist_field->field = field;
 	hist_field->flags = flags;
@@ -828,7 +739,10 @@ static void hist_trigger_elt_update(struct hist_trigger_data *hist_data,
 
 	for_each_hist_val_field(i, hist_data) {
 		hist_field = hist_data->fields[i];
-		hist_val = hist_field->fn(hist_field, rec);
+		if (hist_field->fn)
+			hist_val = hist_field->fn(hist_field, rec);
+		else
+			hist_val = hist_field->field->accessor(hist_field->field, rec);
 		tracing_map_update_sum(elt, i, hist_val);
 	}
 }
@@ -886,7 +800,10 @@ static void event_hist_trigger(struct event_trigger_data *data, void *rec)
 
 			key = entries;
 		} else {
-			field_contents = key_field->fn(key_field, rec);
+			if (key_field->fn)
+				field_contents = key_field->fn(key_field, rec);
+			else
+				field_contents = key_field->field->accessor(key_field->field, rec);
 			if (key_field->flags & HIST_FIELD_STRING) {
 				key = (void *)(unsigned long)field_contents;
 				use_compound_key = true;
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC][PATCH 05/10] eBPF/tracing: Add eBPF trace event field access helpers
  2016-02-12 16:11 [RFC][PATCH 00/10] Add trace event support to eBPF Tom Zanussi
                   ` (3 preceding siblings ...)
  2016-02-12 16:11 ` [RFC][PATCH 04/10] tracing: Add trace event accessor functions Tom Zanussi
@ 2016-02-12 16:11 ` Tom Zanussi
  2016-02-12 16:11 ` [RFC][PATCH 06/10] tracing: Add kprobe/uprobe support for TRACE_EVENT eBPF progs Tom Zanussi
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: Tom Zanussi @ 2016-02-12 16:11 UTC (permalink / raw)
  To: ast, rostedt
  Cc: masami.hiramatsu.pt, namhyung, peterz, linux-kernel, Tom Zanussi

Add two eBPF helper functions for accessing trace event contents:

 - bpf_trace_event_field_read() - grab integer field contents from trace events
 - bpf_trace_event_field_read_string() - grab string contents from trace events

Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
---
 include/uapi/linux/bpf.h | 18 ++++++++++
 kernel/bpf/verifier.c    |  2 +-
 kernel/trace/bpf_trace.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 106 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index df6a7ff..4045ce9 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -270,6 +270,24 @@ enum bpf_func_id {
 	 * Return: 0 on success
 	 */
 	BPF_FUNC_perf_event_output,
+
+	/**
+	 * bpf_trace_event_field_read(ctx, field_name) - read trace event field contents
+	 * @ctx: struct trace_event_context*
+	 * @field_name: ftrace_event_field name
+	 * Return: value of field in ctx->record
+	 */
+	BPF_FUNC_trace_event_field_read,
+
+	/**
+	 * bpf_trace_event_field_read_string(ctx, field_name, dest, size) - read trace event field string
+	 * @ctx: struct trace_event_context*
+	 * @field_name: ftrace_event_field name
+	 * @dest: destination string buffer
+	 * @size: destination string buffer size
+	 * Return: value of field in ctx->record
+	 */
+	BPF_FUNC_trace_event_field_read_string,
 	__BPF_FUNC_MAX_ID,
 };
 
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index a7945d1..2b877ff 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -245,7 +245,7 @@ static const struct {
 } func_limit[] = {
 	{BPF_MAP_TYPE_PROG_ARRAY, BPF_FUNC_tail_call},
 	{BPF_MAP_TYPE_PERF_EVENT_ARRAY, BPF_FUNC_perf_event_read},
-	{BPF_MAP_TYPE_PERF_EVENT_ARRAY, BPF_FUNC_perf_event_output},
+	{BPF_MAP_TYPE_PERF_EVENT_ARRAY, BPF_FUNC_trace_event_field_read_string},
 };
 
 static void print_verifier_state(struct verifier_env *env)
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 78dbac0..eb78c28 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -326,6 +326,89 @@ static struct bpf_prog_type_list kprobe_tl = {
 	.type	= BPF_PROG_TYPE_KPROBE,
 };
 
+static u64 bpf_trace_event_field_read(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+	struct trace_event_context *ctx;
+	struct ftrace_event_field *field;
+	char *field_name;
+	u64 val;
+
+	ctx = (struct trace_event_context *) (long) r1;
+	field_name = (char *) (long) r2;
+
+	field = trace_find_event_field(ctx->call, field_name);
+
+	if (unlikely(!field))
+		return -ENOENT;
+
+	val = field->accessor(field, ctx->record);
+
+	return val;
+}
+
+static const struct bpf_func_proto bpf_trace_event_field_read_proto = {
+	.func		= bpf_trace_event_field_read,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_PTR_TO_STACK,
+};
+
+static int event_field_strlen(struct ftrace_event_field *field, void *rec)
+{
+	int size;
+
+	if (field->filter_type == FILTER_DYN_STRING)
+		size = *(u32 *)(rec + field->offset) >> 16;
+	else if (field->filter_type == FILTER_PTR_STRING)
+		size = strlen((char *)(rec + field->offset));
+	else
+		size = field->size;
+
+	return size;
+}
+
+static u64 bpf_trace_event_field_read_string(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+	struct ftrace_event_field *field;
+	struct trace_event_context *ctx;
+	char *field_name;
+	int size, len;
+	void *dst;
+	u64 val;
+
+	ctx = (struct trace_event_context *) (long) r1;
+	field_name = (char *) (long) r2;
+
+	field = trace_find_event_field(ctx->call, field_name);
+	if (unlikely(!field))
+		return -ENOENT;
+
+	val = field->accessor(field, ctx->record);
+
+	dst = (void *)(long)r3;
+	size = (int)r4;
+
+	len = event_field_strlen(field, ctx->record);
+	if (len > size - 1)
+		len = size - 1;
+
+	memset(dst, '\0', size);
+	memcpy(dst, (char *)val, len);
+
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_trace_event_field_read_string_proto = {
+	.func		= bpf_trace_event_field_read_string,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_PTR_TO_STACK,
+	.arg3_type	= ARG_PTR_TO_STACK,
+	.arg4_type	= ARG_CONST_STACK_SIZE,
+};
+
 static const struct bpf_func_proto *
 trace_event_prog_func_proto(enum bpf_func_id func_id)
 {
@@ -356,6 +439,10 @@ trace_event_prog_func_proto(enum bpf_func_id func_id)
 		return &bpf_perf_event_read_proto;
 	case BPF_FUNC_perf_event_output:
 		return &bpf_perf_event_output_proto;
+	case BPF_FUNC_trace_event_field_read:
+		return &bpf_trace_event_field_read_proto;
+	case BPF_FUNC_trace_event_field_read_string:
+		return &bpf_trace_event_field_read_string_proto;
 	default:
 		return NULL;
 	}
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC][PATCH 06/10] tracing: Add kprobe/uprobe support for TRACE_EVENT eBPF progs
  2016-02-12 16:11 [RFC][PATCH 00/10] Add trace event support to eBPF Tom Zanussi
                   ` (4 preceding siblings ...)
  2016-02-12 16:11 ` [RFC][PATCH 05/10] eBPF/tracing: Add eBPF trace event field access helpers Tom Zanussi
@ 2016-02-12 16:11 ` Tom Zanussi
  2016-02-12 16:11 ` [RFC][PATCH 07/10] tracing: Add eBPF program support to static trace events Tom Zanussi
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: Tom Zanussi @ 2016-02-12 16:11 UTC (permalink / raw)
  To: ast, rostedt
  Cc: masami.hiramatsu.pt, namhyung, peterz, linux-kernel, Tom Zanussi

Add a call at the end of kprobe/kretprobe/uprobe probe functions for
BPF_PROG_TYPE_TRACE_EVENT programs.  These new calls are made after
the trace event fields have been stored into the trace buffer, where
they can then be accessed by the eBPF program.

The existing BPF_PROG_TYPE_KPROBE calls at the beginning remain, but
the prog type needs to be checked to see whether it's a kprobe or a
trace event prog.

Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
---
 kernel/trace/trace_kprobe.c | 20 ++++++++++++++++++--
 kernel/trace/trace_uprobe.c | 11 ++++++++++-
 2 files changed, 28 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index c995644..34231f1 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -19,6 +19,8 @@
 
 #include <linux/module.h>
 #include <linux/uaccess.h>
+#include <linux/filter.h>
+#include <linux/bpf.h>
 
 #include "trace_probe.h"
 
@@ -1129,7 +1131,8 @@ kprobe_perf_func(struct trace_kprobe *tk, struct pt_regs *regs)
 	int size, __size, dsize;
 	int rctx;
 
-	if (prog && !trace_call_bpf(prog, regs))
+	if (prog && prog->type == BPF_PROG_TYPE_KPROBE &&
+	    !trace_call_bpf(prog, regs))
 		return;
 
 	head = this_cpu_ptr(call->perf_events);
@@ -1148,6 +1151,12 @@ kprobe_perf_func(struct trace_kprobe *tk, struct pt_regs *regs)
 	entry->ip = (unsigned long)tk->rp.kp.addr;
 	memset(&entry[1], 0, dsize);
 	store_trace_args(sizeof(*entry), &tk->tp, regs, (u8 *)&entry[1], dsize);
+	if (prog && prog->type == BPF_PROG_TYPE_TRACE_EVENT) {
+		struct trace_event_context ctx = { call, entry };
+
+		if (!trace_call_bpf(prog, &ctx))
+			return;
+	}
 	perf_trace_buf_submit(entry, size, rctx, 0, 1, regs, head, NULL);
 }
 NOKPROBE_SYMBOL(kprobe_perf_func);
@@ -1164,7 +1173,8 @@ kretprobe_perf_func(struct trace_kprobe *tk, struct kretprobe_instance *ri,
 	int size, __size, dsize;
 	int rctx;
 
-	if (prog && !trace_call_bpf(prog, regs))
+	if (prog && prog->type == BPF_PROG_TYPE_KPROBE &&
+	    !trace_call_bpf(prog, regs))
 		return;
 
 	head = this_cpu_ptr(call->perf_events);
@@ -1183,6 +1193,12 @@ kretprobe_perf_func(struct trace_kprobe *tk, struct kretprobe_instance *ri,
 	entry->func = (unsigned long)tk->rp.kp.addr;
 	entry->ret_ip = (unsigned long)ri->ret_addr;
 	store_trace_args(sizeof(*entry), &tk->tp, regs, (u8 *)&entry[1], dsize);
+	if (prog && prog->type == BPF_PROG_TYPE_TRACE_EVENT) {
+		struct trace_event_context ctx = { call, entry };
+
+		if (!trace_call_bpf(prog, &ctx))
+			return;
+	}
 	perf_trace_buf_submit(entry, size, rctx, 0, 1, regs, head, NULL);
 }
 NOKPROBE_SYMBOL(kretprobe_perf_func);
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index d2f6d0b..990bc71 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -23,6 +23,8 @@
 #include <linux/uprobes.h>
 #include <linux/namei.h>
 #include <linux/string.h>
+#include <linux/filter.h>
+#include <linux/bpf.h>
 
 #include "trace_probe.h"
 
@@ -1116,7 +1118,8 @@ static void __uprobe_perf_func(struct trace_uprobe *tu,
 	int size, esize;
 	int rctx;
 
-	if (prog && !trace_call_bpf(prog, regs))
+	if (prog && prog->type == BPF_PROG_TYPE_KPROBE &&
+	    trace_call_bpf(prog, regs))
 		return;
 
 	esize = SIZEOF_TRACE_ENTRY(is_ret_probe(tu));
@@ -1152,6 +1155,12 @@ static void __uprobe_perf_func(struct trace_uprobe *tu,
 		memset(data + len, 0, size - esize - len);
 	}
 
+	if (prog && prog->type == BPF_PROG_TYPE_TRACE_EVENT) {
+		struct trace_event_context ctx = { call, entry };
+
+		if (!trace_call_bpf(prog, &ctx))
+			return;
+	}
 	perf_trace_buf_submit(entry, size, rctx, 0, 1, regs, head, NULL);
  out:
 	preempt_enable();
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC][PATCH 07/10] tracing: Add eBPF program support to static trace events
  2016-02-12 16:11 [RFC][PATCH 00/10] Add trace event support to eBPF Tom Zanussi
                   ` (5 preceding siblings ...)
  2016-02-12 16:11 ` [RFC][PATCH 06/10] tracing: Add kprobe/uprobe support for TRACE_EVENT eBPF progs Tom Zanussi
@ 2016-02-12 16:11 ` Tom Zanussi
  2016-02-12 16:11 ` [RFC][PATCH 08/10] samples/bpf: Add support for trace events to the eBPF loading code Tom Zanussi
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: Tom Zanussi @ 2016-02-12 16:11 UTC (permalink / raw)
  To: ast, rostedt
  Cc: masami.hiramatsu.pt, namhyung, peterz, linux-kernel, Tom Zanussi

Have perf_trace_*() and perf_syscall_enter()/exit() make calls to
trace_call_bpf() if there's a TRACE_EVENT program associated with a
trace event.

Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
---
 include/trace/perf.h          | 10 ++++++++++
 kernel/trace/trace_syscalls.c | 18 ++++++++++++++++++
 2 files changed, 28 insertions(+)

diff --git a/include/trace/perf.h b/include/trace/perf.h
index 26486fc..0d8069c 100644
--- a/include/trace/perf.h
+++ b/include/trace/perf.h
@@ -1,3 +1,5 @@
+#include <linux/filter.h>
+#include <linux/bpf.h>
 
 #undef TRACE_SYSTEM_VAR
 
@@ -35,6 +37,7 @@ static notrace void							\
 perf_trace_##call(void *__data, proto)					\
 {									\
 	struct trace_event_call *event_call = __data;			\
+	struct bpf_prog *prog = event_call->prog;			\
 	struct trace_event_data_offsets_##call __maybe_unused __data_offsets;\
 	struct trace_event_raw_##call *entry;				\
 	struct pt_regs *__regs;						\
@@ -67,6 +70,13 @@ perf_trace_##call(void *__data, proto)					\
 									\
 	{ assign; }							\
 									\
+	if (prog && prog->type == BPF_PROG_TYPE_TRACE_EVENT) {		\
+		struct trace_event_context ctx = { event_call, entry };	\
+									\
+		if (!trace_call_bpf(prog, &ctx))			\
+			return;						\
+	}								\
+									\
 	perf_trace_buf_submit(entry, __entry_size, rctx, __addr,	\
 		__count, __regs, head, __task);				\
 }
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index 50be560..feea899 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -6,6 +6,8 @@
 #include <linux/module.h>	/* for MODULE_NAME_LEN via KSYM_SYMBOL_LEN */
 #include <linux/ftrace.h>
 #include <linux/perf_event.h>
+#include <linux/filter.h>
+#include <linux/bpf.h>
 #include <asm/syscall.h>
 
 #include "trace_output.h"
@@ -562,6 +564,7 @@ static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id)
 	struct syscall_metadata *sys_data;
 	struct syscall_trace_enter *rec;
 	struct hlist_head *head;
+	struct bpf_prog *prog;
 	int syscall_nr;
 	int rctx;
 	int size;
@@ -576,6 +579,8 @@ static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id)
 	if (!sys_data)
 		return;
 
+	prog = sys_data->enter_event->prog;
+
 	head = this_cpu_ptr(sys_data->enter_event->perf_events);
 	if (hlist_empty(head))
 		return;
@@ -593,6 +598,11 @@ static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id)
 	rec->nr = syscall_nr;
 	syscall_get_arguments(current, regs, 0, sys_data->nb_args,
 			       (unsigned long *)&rec->args);
+	if (prog && prog->type == BPF_PROG_TYPE_TRACE_EVENT) {
+		struct trace_event_context ctx = { sys_data->enter_event, rec };
+		if (!trace_call_bpf(prog, &ctx))
+			return;
+	}
 	perf_trace_buf_submit(rec, size, rctx, 0, 1, regs, head, NULL);
 }
 
@@ -636,6 +646,7 @@ static void perf_syscall_exit(void *ignore, struct pt_regs *regs, long ret)
 	struct syscall_metadata *sys_data;
 	struct syscall_trace_exit *rec;
 	struct hlist_head *head;
+	struct bpf_prog *prog;
 	int syscall_nr;
 	int rctx;
 	int size;
@@ -650,6 +661,8 @@ static void perf_syscall_exit(void *ignore, struct pt_regs *regs, long ret)
 	if (!sys_data)
 		return;
 
+	prog = sys_data->enter_event->prog;
+
 	head = this_cpu_ptr(sys_data->exit_event->perf_events);
 	if (hlist_empty(head))
 		return;
@@ -665,6 +678,11 @@ static void perf_syscall_exit(void *ignore, struct pt_regs *regs, long ret)
 
 	rec->nr = syscall_nr;
 	rec->ret = syscall_get_return_value(current, regs);
+	if (prog && prog->type == BPF_PROG_TYPE_TRACE_EVENT) {
+		struct trace_event_context ctx = { sys_data->exit_event, rec };
+		if (!trace_call_bpf(prog, &ctx))
+			return;
+	}
 	perf_trace_buf_submit(rec, size, rctx, 0, 1, regs, head, NULL);
 }
 
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC][PATCH 08/10] samples/bpf: Add support for trace events to the eBPF loading code
  2016-02-12 16:11 [RFC][PATCH 00/10] Add trace event support to eBPF Tom Zanussi
                   ` (6 preceding siblings ...)
  2016-02-12 16:11 ` [RFC][PATCH 07/10] tracing: Add eBPF program support to static trace events Tom Zanussi
@ 2016-02-12 16:11 ` Tom Zanussi
  2016-02-12 16:11 ` [RFC][PATCH 09/10] samples/bpf: Add readcounts-by-pid example Tom Zanussi
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: Tom Zanussi @ 2016-02-12 16:11 UTC (permalink / raw)
  To: ast, rostedt
  Cc: masami.hiramatsu.pt, namhyung, peterz, linux-kernel, Tom Zanussi

Because we now have two types of eBPF programs that can attach to
kprobes, we need to add a way to distinguish them.

BPF_PROG_TYPE_KPROBE programs are defined the same as they were
before:

SEC("kprobe/...")

BPF_PROG_TYPE_TRACE_EVENT programs always start with 'event/'.  A
TRACE_EVENT program on a kprobe would be defined in the same way as a
KPROBE program but would prepend the 'event/':
before bu can be applied to kprobes by

SEC("event/kprobe/...")
SEC("event/kretprobe/...")

BPF_PROG_TYPE_TRACE_EVENT program on a static trace event simply
specifies 'event/' plus the colon-separated subsys:event name:

SEC("event/subsys:event")

Also, add a couple helper declarations for the new field-reading
helper functions.

Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
---
 samples/bpf/bpf_helpers.h |  4 ++++
 samples/bpf/bpf_load.c    | 46 +++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 45 insertions(+), 5 deletions(-)

diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h
index 7ad19e1..a2fe8df 100644
--- a/samples/bpf/bpf_helpers.h
+++ b/samples/bpf/bpf_helpers.h
@@ -39,6 +39,10 @@ static int (*bpf_redirect)(int ifindex, int flags) =
 	(void *) BPF_FUNC_redirect;
 static int (*bpf_perf_event_output)(void *ctx, void *map, int index, void *data, int size) =
 	(void *) BPF_FUNC_perf_event_output;
+static int (*bpf_trace_event_field_read)(void *ctx, const char *field_name) =
+	(void *) BPF_FUNC_trace_event_field_read;
+static int (*bpf_trace_event_field_read_string)(void *ctx, const char *field_name, void *buf, int size) =
+	(void *) BPF_FUNC_trace_event_field_read_string;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index da86a8e..006a139 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -46,14 +46,39 @@ static int populate_prog_array(const char *event, int prog_fd)
 
 static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 {
+	bool is_trace_event = strncmp(event, "event/", 6) == 0;
 	bool is_socket = strncmp(event, "socket", 6) == 0;
-	bool is_kprobe = strncmp(event, "kprobe/", 7) == 0;
-	bool is_kretprobe = strncmp(event, "kretprobe/", 10) == 0;
+	bool is_kretprobe = false;
+	bool is_kprobe = false;
+
 	enum bpf_prog_type prog_type;
-	char buf[256];
+	char buf[256], event_buf[256];
 	int fd, efd, err, id;
 	struct perf_event_attr attr = {};
 
+	char *event_name = NULL;
+	char *subsys_name = NULL;
+
+	if (is_trace_event)
+		event += 6;
+
+	is_kprobe = strncmp(event, "kprobe/", 7) == 0;
+	is_kretprobe = strncmp(event, "kretprobe/", 10) == 0;
+
+	if (is_trace_event) {
+		if (!(is_kprobe || is_kretprobe)) {
+			strncpy(event_buf, event, 255);
+			event_buf[255] = '\0';
+
+			event_name = event_buf;
+			subsys_name = strsep(&event_name, ":");
+			if (!subsys_name || !event_name) {
+				printf("Unknown trace event '%s'\n", event);
+				return -1;
+			}
+		}
+	}
+
 	attr.type = PERF_TYPE_TRACEPOINT;
 	attr.sample_type = PERF_SAMPLE_RAW;
 	attr.sample_period = 1;
@@ -61,6 +86,8 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 
 	if (is_socket) {
 		prog_type = BPF_PROG_TYPE_SOCKET_FILTER;
+	} else if (is_trace_event) {
+		prog_type = BPF_PROG_TYPE_TRACE_EVENT;
 	} else if (is_kprobe || is_kretprobe) {
 		prog_type = BPF_PROG_TYPE_KPROBE;
 	} else {
@@ -114,8 +141,15 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 	}
 
 	strcpy(buf, DEBUGFS);
-	strcat(buf, "events/kprobes/");
-	strcat(buf, event);
+	if (is_kprobe || is_kretprobe) {
+		strcat(buf, "events/kprobes/");
+		strcat(buf, event);
+	} else {
+		strcat(buf, "events/");
+		strcat(buf, subsys_name);
+		strcat(buf, "/");
+		strcat(buf, event_name);
+	}
 	strcat(buf, "/id");
 
 	efd = open(buf, O_RDONLY, 0);
@@ -300,6 +334,7 @@ int load_bpf_file(char *path)
 
 			if (memcmp(shname_prog, "kprobe/", 7) == 0 ||
 			    memcmp(shname_prog, "kretprobe/", 10) == 0 ||
+			    memcmp(shname_prog, "event/", 6) == 0 ||
 			    memcmp(shname_prog, "socket", 6) == 0)
 				load_and_attach(shname_prog, insns, data_prog->d_size);
 		}
@@ -316,6 +351,7 @@ int load_bpf_file(char *path)
 
 		if (memcmp(shname, "kprobe/", 7) == 0 ||
 		    memcmp(shname, "kretprobe/", 10) == 0 ||
+		    memcmp(shname, "event/", 6) == 0 ||
 		    memcmp(shname, "socket", 6) == 0)
 			load_and_attach(shname, data->d_buf, data->d_size);
 	}
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC][PATCH 09/10] samples/bpf: Add readcounts-by-pid example
  2016-02-12 16:11 [RFC][PATCH 00/10] Add trace event support to eBPF Tom Zanussi
                   ` (7 preceding siblings ...)
  2016-02-12 16:11 ` [RFC][PATCH 08/10] samples/bpf: Add support for trace events to the eBPF loading code Tom Zanussi
@ 2016-02-12 16:11 ` Tom Zanussi
  2016-02-12 16:11 ` [RFC][PATCH 10/10] samples/bpf: Add kprobe-event-fields example Tom Zanussi
  2016-02-14  0:02 ` [RFC][PATCH 00/10] Add trace event support to eBPF Alexei Starovoitov
  10 siblings, 0 replies; 17+ messages in thread
From: Tom Zanussi @ 2016-02-12 16:11 UTC (permalink / raw)
  To: ast, rostedt
  Cc: masami.hiramatsu.pt, namhyung, peterz, linux-kernel, Tom Zanussi

This is a simple demonstration of an eBPF program attached to static
trace event ("event/subsys:event").  The count and pid values here are
the values grabbed from the event hits and aggregated in a hash map.

Example output:

  # ./readcounts-by-pid
  ^C
  pid     4143    comm uname               count          832    hitcount            1
  pid     2755    comm gdbus               count           32    hitcount            2
  pid      315    comm systemd-journal     count        17408    hitcount           16
  pid     2415    comm dbus-daemon         count         8242    hitcount            5
  pid     4164    comm gdbus               count          288    hitcount           18
  pid     4139    comm firefox             count       384245    hitcount           61
  pid     2660    comm gnome-shell         count        42672    hitcount          117
  pid      774    comm Xorg                count      4621105    hitcount         1259
  pid     2072    comm upowerd             count           32    hitcount            2

Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
---
 samples/bpf/Makefile                 |  4 +++
 samples/bpf/readcounts-by-pid_kern.c | 57 +++++++++++++++++++++++++++++++
 samples/bpf/readcounts-by-pid_user.c | 66 ++++++++++++++++++++++++++++++++++++
 3 files changed, 127 insertions(+)
 create mode 100644 samples/bpf/readcounts-by-pid_kern.c
 create mode 100644 samples/bpf/readcounts-by-pid_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index edd638b..d7af8d5 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -16,6 +16,7 @@ hostprogs-y += tracex5
 hostprogs-y += tracex6
 hostprogs-y += trace_output
 hostprogs-y += lathist
+hostprogs-y += readcounts-by-pid
 
 test_verifier-objs := test_verifier.o libbpf.o
 test_maps-objs := test_maps.o libbpf.o
@@ -32,6 +33,7 @@ tracex5-objs := bpf_load.o libbpf.o tracex5_user.o
 tracex6-objs := bpf_load.o libbpf.o tracex6_user.o
 trace_output-objs := bpf_load.o libbpf.o trace_output_user.o
 lathist-objs := bpf_load.o libbpf.o lathist_user.o
+readcounts-by-pid-objs := bpf_load.o libbpf.o readcounts-by-pid_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -47,6 +49,7 @@ always += tracex6_kern.o
 always += trace_output_kern.o
 always += tcbpf1_kern.o
 always += lathist_kern.o
+always += readcounts-by-pid_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 
@@ -63,6 +66,7 @@ HOSTLOADLIBES_tracex5 += -lelf
 HOSTLOADLIBES_tracex6 += -lelf
 HOSTLOADLIBES_trace_output += -lelf -lrt
 HOSTLOADLIBES_lathist += -lelf
+HOSTLOADLIBES_readcounts-by-pid += -lelf
 
 # point this to your LLVM backend with bpf support
 LLC=$(srctree)/tools/bpf/llvm/bld/Debug+Asserts/bin/llc
diff --git a/samples/bpf/readcounts-by-pid_kern.c b/samples/bpf/readcounts-by-pid_kern.c
new file mode 100644
index 0000000..5967781
--- /dev/null
+++ b/samples/bpf/readcounts-by-pid_kern.c
@@ -0,0 +1,57 @@
+/* Copyright (c) 2016 Intel Corporation
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+
+#include <linux/version.h>
+#include <uapi/linux/bpf.h>
+#include "bpf_helpers.h"
+
+struct hist_key {
+	char comm[16];
+	u64 pid;
+};
+
+struct hist_val {
+	u64 count;
+	u64 hitcount;
+};
+
+struct bpf_map_def SEC("maps") counts_map = {
+	.type = BPF_MAP_TYPE_HASH,
+	.key_size = sizeof(struct hist_key),
+	.value_size = sizeof(struct hist_val),
+	.max_entries = 1024,
+};
+
+SEC("event/syscalls:sys_enter_read")
+int bpf_prog(void *ctx)
+{
+	struct hist_key key = {};
+	struct hist_val init_val;
+	struct hist_val *val;
+	u64 count;
+
+	char common_pid_field_name1[] = "common_pid";
+	key.pid = bpf_trace_event_field_read(ctx, common_pid_field_name1);
+
+	bpf_get_current_comm(&key.comm, sizeof(key.comm));
+
+	char count_field_name1[] = "count";
+	count = bpf_trace_event_field_read(ctx, count_field_name1);
+
+	val = bpf_map_lookup_elem(&counts_map, &key);
+	if (val) {
+		val->count += count;
+		val->hitcount += 1;
+	} else {
+		init_val.count = count;
+		init_val.hitcount = 1;
+		bpf_map_update_elem(&counts_map, &key, &init_val, BPF_ANY);
+	}
+	return 1;
+}
+char _license[] SEC("license") = "GPL";
+u32 _version SEC("version") = LINUX_VERSION_CODE;
diff --git a/samples/bpf/readcounts-by-pid_user.c b/samples/bpf/readcounts-by-pid_user.c
new file mode 100644
index 0000000..d08b867
--- /dev/null
+++ b/samples/bpf/readcounts-by-pid_user.c
@@ -0,0 +1,66 @@
+/* Copyright (c) 2016 Intel Corporation
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+
+#include <stdio.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <signal.h>
+#include <linux/bpf.h>
+#include "libbpf.h"
+#include "bpf_load.h"
+
+struct hist_key {
+	char comm[16];
+	__u64 pid;
+};
+
+struct hist_val {
+	__u64 count;
+	__u64 hitcount;
+};
+
+static void print_hist(int fd)
+{
+	struct hist_key key = {}, next_key;
+	struct hist_val val;
+
+	printf("\n");
+
+	while (bpf_get_next_key(fd, &key, &next_key) == 0) {
+		bpf_lookup_elem(fd, &next_key, &val);
+		printf("pid %8llu    comm %-16s    count %12llu    hitcount %12llu\n",
+		       next_key.pid, next_key.comm, val.count, val.hitcount);
+		key = next_key;
+	}
+}
+
+static void int_exit(int sig)
+{
+	print_hist(map_fd[0]);
+
+	exit(0);
+}
+
+int main(int ac, char **argv)
+{
+	char filename[256];
+
+	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+	signal(SIGINT, int_exit);
+
+	if (load_bpf_file(filename)) {
+		printf("%s", bpf_log_buf);
+		return 1;
+	}
+
+	for (;;) {
+		sleep(60);
+	}
+
+	return 0;
+}
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC][PATCH 10/10] samples/bpf: Add kprobe-event-fields example
  2016-02-12 16:11 [RFC][PATCH 00/10] Add trace event support to eBPF Tom Zanussi
                   ` (8 preceding siblings ...)
  2016-02-12 16:11 ` [RFC][PATCH 09/10] samples/bpf: Add readcounts-by-pid example Tom Zanussi
@ 2016-02-12 16:11 ` Tom Zanussi
  2016-02-14  0:02 ` [RFC][PATCH 00/10] Add trace event support to eBPF Alexei Starovoitov
  10 siblings, 0 replies; 17+ messages in thread
From: Tom Zanussi @ 2016-02-12 16:11 UTC (permalink / raw)
  To: ast, rostedt
  Cc: masami.hiramatsu.pt, namhyung, peterz, linux-kernel, Tom Zanussi

This is a simple demonstration of an eBPF program attached to both a
kprobe trace event ("event/kprobe/...") and the same event through a
static trace event ("event/subsys:event".  The common_pid, name, and
len fields in the netif_receive_skb static trace event here are the
values grabbed from the event and printed.  The common_pid value for
the __netif_receive_skb_core kprobe event here is also the value
grabbed from the kprobe trace event.

Example output:

  # ./kprobe-event-fields
              ping-4074  [000] d.s1   131.098630: : __netif_receive_skb_core kprobe fields:  common_pid = 4074
              ping-4074  [000] ..s1   131.098653: : netif_receive_skb trace event fields:  common_pid = 4074, name = lo, len 84

Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
---
 samples/bpf/Makefile                   |  4 +++
 samples/bpf/kprobe-event-fields_kern.c | 56 ++++++++++++++++++++++++++++++++++
 samples/bpf/kprobe-event-fields_user.c | 25 +++++++++++++++
 3 files changed, 85 insertions(+)
 create mode 100644 samples/bpf/kprobe-event-fields_kern.c
 create mode 100644 samples/bpf/kprobe-event-fields_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index d7af8d5..6b9ceae 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -17,6 +17,7 @@ hostprogs-y += tracex6
 hostprogs-y += trace_output
 hostprogs-y += lathist
 hostprogs-y += readcounts-by-pid
+hostprogs-y += kprobe-event-fields
 
 test_verifier-objs := test_verifier.o libbpf.o
 test_maps-objs := test_maps.o libbpf.o
@@ -34,6 +35,7 @@ tracex6-objs := bpf_load.o libbpf.o tracex6_user.o
 trace_output-objs := bpf_load.o libbpf.o trace_output_user.o
 lathist-objs := bpf_load.o libbpf.o lathist_user.o
 readcounts-by-pid-objs := bpf_load.o libbpf.o readcounts-by-pid_user.o
+kprobe-event-fields-objs := bpf_load.o libbpf.o kprobe-event-fields_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -50,6 +52,7 @@ always += trace_output_kern.o
 always += tcbpf1_kern.o
 always += lathist_kern.o
 always += readcounts-by-pid_kern.o
+always += kprobe-event-fields_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 
@@ -67,6 +70,7 @@ HOSTLOADLIBES_tracex6 += -lelf
 HOSTLOADLIBES_trace_output += -lelf -lrt
 HOSTLOADLIBES_lathist += -lelf
 HOSTLOADLIBES_readcounts-by-pid += -lelf
+HOSTLOADLIBES_kprobe-event-fields += -lelf
 
 # point this to your LLVM backend with bpf support
 LLC=$(srctree)/tools/bpf/llvm/bld/Debug+Asserts/bin/llc
diff --git a/samples/bpf/kprobe-event-fields_kern.c b/samples/bpf/kprobe-event-fields_kern.c
new file mode 100644
index 0000000..3d01e08
--- /dev/null
+++ b/samples/bpf/kprobe-event-fields_kern.c
@@ -0,0 +1,56 @@
+/* Copyright (c) 2016 Intel Corporation
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+
+#include <uapi/linux/bpf.h>
+#include <linux/version.h>
+#include "bpf_helpers.h"
+
+/*
+ * With kprobes and event/kprobe/xxx, we can access the common trace
+ * event fields:
+ */
+SEC("event/kprobe/__netif_receive_skb_core")
+int bpf_prog1(void *ctx)
+{
+	int common_pid;
+
+	char common_pid_field[] = "common_pid";
+	common_pid = bpf_trace_event_field_read(ctx, common_pid_field);
+
+	char fmt[] = "__netif_receive_skb_core kprobe fields:  common_pid = %d\n";
+	bpf_trace_printk(fmt, sizeof(fmt), common_pid);
+
+	return 1;
+}
+
+/*
+ * Without the event/kprobe, we can access all the static trace event
+ * fields:
+ */
+SEC("event/net:netif_receive_skb")
+int bpf_prog2(void *ctx)
+{
+	char name[256] = {};
+	int len, common_pid;
+
+	char len_field[] = "len";
+	len = bpf_trace_event_field_read(ctx, len_field);
+
+	char name_field[] = "name";
+	bpf_trace_event_field_read_string(ctx, name_field, name, sizeof(name));
+
+	char common_pid_field[] = "common_pid";
+	common_pid = bpf_trace_event_field_read(ctx, common_pid_field);
+
+	char fmt[] = "netif_receive_skb trace event fields:  common_pid = %d, name = %s, len %d\n";
+	bpf_trace_printk(fmt, sizeof(fmt), common_pid, name, len);
+
+	return 1;
+}
+
+char _license[] SEC("license") = "GPL";
+u32 _version SEC("version") = LINUX_VERSION_CODE;
diff --git a/samples/bpf/kprobe-event-fields_user.c b/samples/bpf/kprobe-event-fields_user.c
new file mode 100644
index 0000000..31a4818
--- /dev/null
+++ b/samples/bpf/kprobe-event-fields_user.c
@@ -0,0 +1,25 @@
+#include <stdio.h>
+#include <linux/bpf.h>
+#include <unistd.h>
+#include "libbpf.h"
+#include "bpf_load.h"
+
+int main(int ac, char **argv)
+{
+	FILE *f;
+	char filename[256];
+
+	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+	if (load_bpf_file(filename)) {
+		printf("%s", bpf_log_buf);
+		return 1;
+	}
+
+	f = popen("taskset 1 ping -c5 localhost", "r");
+	(void) f;
+
+	read_trace_pipe();
+
+	return 0;
+}
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC][PATCH 00/10] Add trace event support to eBPF
  2016-02-12 16:11 [RFC][PATCH 00/10] Add trace event support to eBPF Tom Zanussi
                   ` (9 preceding siblings ...)
  2016-02-12 16:11 ` [RFC][PATCH 10/10] samples/bpf: Add kprobe-event-fields example Tom Zanussi
@ 2016-02-14  0:02 ` Alexei Starovoitov
  2016-02-16 22:35   ` Tom Zanussi
  10 siblings, 1 reply; 17+ messages in thread
From: Alexei Starovoitov @ 2016-02-14  0:02 UTC (permalink / raw)
  To: Tom Zanussi
  Cc: rostedt, masami.hiramatsu.pt, namhyung, peterz, linux-kernel,
	Daniel Borkmann, netdev

On Fri, Feb 12, 2016 at 10:11:18AM -0600, Tom Zanussi wrote:
> Hi,
> 
> As promised in previous threads, this patchset shares some common
> functionality with the hist triggers code and enables trace events to
> be accessed from eBPF programs.

great that you've started working on BPF!

> It needs to be applied on top of current tracing/for-next with the
> hist triggers v13 patches applied:
> 
>   https://lkml.org/lkml/2015/12/10/640
> 
> Any input welcome, especially things that could be done better wrt
> eBPF, since I'm not as familiar with that code...

this dependency looks odd. As I was saying a year ago I don't think
this hist triggers belong in the kernel. BPF already can do
way more complex aggregation and histograms.
Take a look at all the tools written on top of it:
https://github.com/iovisor/bcc/tree/master/tools
I don't buy into reasoning that hist triggers are needed to do
performance analysis in the busybox. If not in busybox,
just use perf+bpf or bcc.

that aside we do need to be able to attach bpf to tracepoints.
The key goal of bpf+tracepoints is to be faster than bpf+kprobe.
kprobes were good enough for almost all use cases so far except
when we started processing millions of events per second via
kprobe+bpf. At that point even optimized kprobe adds too much
overhead.

Patch 9:
+       char common_pid_field_name1[] = "common_pid";
+       key.pid = bpf_trace_event_field_read(ctx, common_pid_field_name1);

this is already achieved by bpf_get_current_pid_tgid()
and it is several times faster.

+       char count_field_name1[] = "count";
+       count = bpf_trace_event_field_read(ctx, count_field_name1);

Already possible by simple PT_REGS_PARM3(ctx) which is faster as well.
And if you're using bcc you can write
sys_read(struct pt_regs *ctx, int fd, char *buf, size_t count)
{ .. use 'count' ...}
and bcc will automatically convert 'count' to ctx->dx.
The latest perf can do this as well with slightly different syntax.

Patch 10:
+       char len_field[] = "len";
+       len = bpf_trace_event_field_read(ctx, len_field);
+
+       char name_field[] = "name";
+       bpf_trace_event_field_read_string(ctx, name_field, name, sizeof(name));
+
+       char common_pid_field[] = "common_pid";
+       common_pid = bpf_trace_event_field_read(ctx, common_pid_field);

all of the above can be done already and it's even demoed
in samples/bpf/tracex1_kern.c

The main problem with this patch set is in patch 5:
+       field_name = (char *) (long) r2;
+       field = trace_find_event_field(ctx->call, field_name);

This is very slow, since it's doing for_each_field(){strcmp()}
That's the reason that simple ctx->register or bpf_probe_read()
are much faster.

There are few other issues with this set:
Patch 2:
+       if (off < 0 || off >= TRACE_EVENT_CTX_HDR_SIZE + BUF_MAX_DATA_SIZE)
+               return false;
that will allow bpf program to read a page worth of data from some
stack pointer, since in patch 6:
+               struct trace_event_context ctx = { call, entry };
+               if (!trace_call_bpf(prog, &ctx))
the kernel probably won't crash, but it shouldn't be allowed.

Patch 6 is odd.
We already have prog_type_kprobe to work with kprobes.
Having prog_type_trace_event there is unnecessary.

The syscall part of Patch 7 is probably unnecessary too.
When we benchmarked bpf+syscall it turned out that adding
kprobe to sys_xx is actually faster than using syscall_enter() logic.
May be things have changed. Need to re-measure.

The tracepoint part of Patch 7 makes sense, but it has too much
overhead to be competitive with kprobe.
perf_trace_buf_prepare() does local_save_flags() which is quite
expensive instruction when tracepoint is called million times a second.
perf_fetch_caller_regs() is heavy too, since it zeros pt_regs which
will be unused the bpf program.

I'm also working on bpf+tracepoint.
My plan was to add a variant of perf_trace_buf_prepare() that
doesn't populate 'struct trace_entry' and just returns a pointer.
Then perf_trace_##call() does tstruct and {assign;} into that
'entry' pointer and then call bpf prog with
'struct ctx {regs, entry_ptr, __data_size}'
Then bpf prog will call a helper to copy the data into its stack
and will access it there as normal.
It's going to be significantly faster than for_each_field+strcmp
approach and little bit faster than kprobe, but I'm not happy
with that speed yet.
I'm still thinking on how to avoid 2nd copy and extra helper.
The program should be able to access the entry_ptr directly,
but some verifier magic needs to be done. so it's still wip.

btw, ast@plumgrid is no longer functional and
please cc netdev every time kernel/bpf/ is touched.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC][PATCH 00/10] Add trace event support to eBPF
  2016-02-14  0:02 ` [RFC][PATCH 00/10] Add trace event support to eBPF Alexei Starovoitov
@ 2016-02-16 22:35   ` Tom Zanussi
  2016-02-17  4:51     ` Alexei Starovoitov
  0 siblings, 1 reply; 17+ messages in thread
From: Tom Zanussi @ 2016-02-16 22:35 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: rostedt, masami.hiramatsu.pt, namhyung, peterz, linux-kernel,
	Daniel Borkmann, netdev

On Sun, 2016-02-14 at 01:02 +0100, Alexei Starovoitov wrote:
> On Fri, Feb 12, 2016 at 10:11:18AM -0600, Tom Zanussi wrote:
> > Hi,
> > 
> > As promised in previous threads, this patchset shares some common
> > functionality with the hist triggers code and enables trace events to
> > be accessed from eBPF programs.
> 
> great that you've started working on BPF!
> 
> > It needs to be applied on top of current tracing/for-next with the
> > hist triggers v13 patches applied:
> > 
> >   https://lkml.org/lkml/2015/12/10/640
> > 
> > Any input welcome, especially things that could be done better wrt
> > eBPF, since I'm not as familiar with that code...
> 
> this dependency looks odd. As I was saying a year ago I don't think

It's not actually a dependency, just a consequence of trying to share
code for this RFC that happened to already be in hist triggers.  I could
just as well have removed the hunk that deletes that code from hist
triggers, but one of the main points was to show how they could be
shared.

> this hist triggers belong in the kernel. BPF already can do
> way more complex aggregation and histograms.

Way more?  I still can't accomplish with eBPF some of the most basic and
helpful use cases that I can with hist triggers, such as using
stacktraces as hash keys.  And the locking in the eBPF hashmap
implementation prevents anything like the function_hist [1] tracer from
being implemented on top of it:

.
.
.
ip: [ffffffffa0166350] gen7_render_get_cmd_length_mask     hitcount: 18551173520
ip: [ffffffff817589b0] _cond_resched                       hitcount: 19242157574
ip: [ffffffff8175bd60] _raw_spin_lock                      hitcount: 19275222963
ip: [ffffffff81232c60] __fdget                             hitcount: 22865629762
ip: [ffffffff81232c00] __fget_light                        hitcount: 24594932641
ip: [ffffffff8175bed0] _raw_spin_lock_irqsave              hitcount: 26510496751
ip: [ffffffff8175bb30] _raw_spin_unlock_irqrestore         hitcount: 26525916042
ip: [ffffffffa018d6c0] gen6_ring_get_seqno                 hitcount: 95888137753

Totals:
    Hits: 1020596753338
    Entries: 7102
    Dropped: 0

[1]
http://git.yoctoproject.org/cgit/cgit.cgi/linux-yocto-contrib/commit/?h=tzanussi/func-hist-v0&id=fa123181a068574bd9e300416cc947aa7424d808

> Take a look at all the tools written on top of it:
> https://github.com/iovisor/bcc/tree/master/tools

That's great, but it's all out-of-tree.  Supporting out-of-tree users
has never been justification for merging in-kernel code (or for blocking
it from being merged). 

> I don't buy into reasoning that hist triggers are needed to do
> performance analysis in the busybox. If not in busybox,
> just use perf+bpf or bcc.
> 

I have systems with tiny amounts of memory and storage that have zero
chance of ever having a compiler or Python running on them.  It's
indispensable on those systems to have access to all and nothing more
than the ftrace-based tools and trace event files directly.  I even have
systems that don't have a shell at all, but can nonetheless access all
the ftrace and trace event functionality through a different channel.
hist triggers is just another one of those tools, a logical progression
towards allowing users in such a situation (among others) to make more
efficient and concise use of the trace event data.  I don't see what
your problem with that is.

> that aside we do need to be able to attach bpf to tracepoints.
> The key goal of bpf+tracepoints is to be faster than bpf+kprobe.
> kprobes were good enough for almost all use cases so far except
> when we started processing millions of events per second via
> kprobe+bpf. At that point even optimized kprobe adds too much
> overhead.
> 

It sounds from this like you're mainly interested in hooking into the
trace events but not the trace event data, which is fine, but other
tools might find some value in the trace event data, even if it may not
be the most optimal path.

I haven't measured the overhead of the cost of accessing data from the
trace buffers, but I do know that the hist triggers have no problem
logging millions of events per second.

In the past I have measured the basic hist triggers mechanism and found
it to be somewhat faster than the ftrace function tracer itself:

    over a kernel compile:

    no tracing:

      real    110m47.817s
      user    99m34.144s
      sys    7m19.928s

    function tracer enabled:

      real 138m6.854s
      user 100m57.692s
      sys 36m27.724s

    function_graph tracer enabled:

      real    194m38.052s
      user    102m32.192s
      sys    97m47.056s

    function_hist tracer enabled:

      real    128m44.682s
      user    100m29.080s
      sys    26m52.880s

Given the fact that I haven't heard people complaining about the
overhead of even the function_graph tracer, I don't think the overhead
would be a show-stopper in most cases.

> Patch 9:
> +       char common_pid_field_name1[] = "common_pid";
> +       key.pid = bpf_trace_event_field_read(ctx, common_pid_field_name1);
> 
> this is already achieved by bpf_get_current_pid_tgid()
> and it is several times faster.
> 

Right, but the point of the examples was just to show that the mechanism
worked.  I guess I should have named them tracexX_... instead, to make
that clear.

One point I would make about this though is that while it might be
slower to access this particular field that way, the user who's just
trying to get something done doesn't need to know about
bpf_get_current_pid_tgid() and can just look at the available fields in
the trace event format file and use them directly - trading off
efficiency for ease-of-use.

I also wonder about the trace event fields that don't boil down to
simple assignments - do each of them require 'helper' functions be
written in order to access the same data, if you're not interested in
using the trace event data? 

> +       char count_field_name1[] = "count";
> +       count = bpf_trace_event_field_read(ctx, count_field_name1);
> 
> Already possible by simple PT_REGS_PARM3(ctx) which is faster as well.

It may be 'simple' in that it's only 20 chars or so, but that seems to
imply a certain kind of deep knowledge on the part of the user in order
to be able to write even a simple eBPF program.  Maybe some users would
like to just say, 'just get me the count value, I don't really want to
know all about this pt_regs thing'.  BTW, where is PT_REGS_PARM3() and
how to use it documented for a new user, other than via the examples? 

> And if you're using bcc you can write
> sys_read(struct pt_regs *ctx, int fd, char *buf, size_t count)
> { .. use 'count' ...}
> and bcc will automatically convert 'count' to ctx->dx.
> The latest perf can do this as well with slightly different syntax.
> 
> Patch 10:
> +       char len_field[] = "len";
> +       len = bpf_trace_event_field_read(ctx, len_field);
> +
> +       char name_field[] = "name";
> +       bpf_trace_event_field_read_string(ctx, name_field, name, sizeof(name));
> +
> +       char common_pid_field[] = "common_pid";
> +       common_pid = bpf_trace_event_field_read(ctx, common_pid_field);
> 
> all of the above can be done already and it's even demoed
> in samples/bpf/tracex1_kern.c

Right, and again, just an example to show that it works, and in this
case that you're getting the same data from the kprobe and the static
trace event.

> 
> The main problem with this patch set is in patch 5:
> +       field_name = (char *) (long) r2;
> +       field = trace_find_event_field(ctx->call, field_name);
> 
> This is very slow, since it's doing for_each_field(){strcmp()}
> That's the reason that simple ctx->register or bpf_probe_read()
> are much faster.
> 

Yes, this and the next thing were the main reasons for making this an
RFC patch - I actually was hoping for someone to point out the 'right'
way to do this in eBPF.

Obviously there's no reason to do this lookup more than once - it should
be done once and then saved, but I ran into a few questions about that:

 - should I have a helper that returns a pointer to the field, then save
it in a 'global hashmap' with one entry, like I've seen in other eBPF
code, and pass it in whenever accessing the field contents?
 - I tried that but the top half was masked off, why?
 - given that I only want something like the field lookup to happen
once, in eBPF what's the accepted way to do a 'begin probe' i.e. some
code like populating a global field array before the 'main' program
starts?

> There are few other issues with this set:
> Patch 2:
> +       if (off < 0 || off >= TRACE_EVENT_CTX_HDR_SIZE + BUF_MAX_DATA_SIZE)
> +               return false;
> that will allow bpf program to read a page worth of data from some
> stack pointer, since in patch 6:
> +               struct trace_event_context ctx = { call, entry };
> +               if (!trace_call_bpf(prog, &ctx))
> the kernel probably won't crash, but it shouldn't be allowed.
> 

Right, the intent was to allow the eBPF program a somehow access a
pointer to the trace record, and grab the field directly.  Obviously I
don't have a clue about how to make that properly happen in the eBPF
interpreter, and I don't see anything explaining all the voodoo
surrounding that even in the comments.  I guess I'd have to spend a few
hours reading the BPF code and the verifier even, to understand that.
It would be nice if an interface that seems like it would be used by
other people wanting to extend eBPF could provide some hints in the
comments if not be thoroughly explained in the documentation.

> Patch 6 is odd.
> We already have prog_type_kprobe to work with kprobes.
> Having prog_type_trace_event there is unnecessary.
> 

Not if you intend on making use of the trace event data.  Which I think
I've realized that you don't, and therefore this whole patchset is
indeed pointless.

> The syscall part of Patch 7 is probably unnecessary too.
> When we benchmarked bpf+syscall it turned out that adding
> kprobe to sys_xx is actually faster than using syscall_enter() logic.
> May be things have changed. Need to re-measure.
> 
> The tracepoint part of Patch 7 makes sense, but it has too much
> overhead to be competitive with kprobe.
> perf_trace_buf_prepare() does local_save_flags() which is quite
> expensive instruction when tracepoint is called million times a second.
> perf_fetch_caller_regs() is heavy too, since it zeros pt_regs which
> will be unused the bpf program.
> 
> I'm also working on bpf+tracepoint.
> My plan was to add a variant of perf_trace_buf_prepare() that
> doesn't populate 'struct trace_entry' and just returns a pointer.
> Then perf_trace_##call() does tstruct and {assign;} into that
> 'entry' pointer and then call bpf prog with
> 'struct ctx {regs, entry_ptr, __data_size}'

Ah, I think this is what I was missing with the Patch 2 thing.  Well, I
guess I'll find out what was the right way to do it when I see your
patch.

> Then bpf prog will call a helper to copy the data into its stack
> and will access it there as normal.
> It's going to be significantly faster than for_each_field+strcmp
> approach and little bit faster than kprobe, but I'm not happy
> with that speed yet.
> I'm still thinking on how to avoid 2nd copy and extra helper.
> The program should be able to access the entry_ptr directly,
> but some verifier magic needs to be done. so it's still wip.
> 
> btw, ast@plumgrid is no longer functional and
> please cc netdev every time kernel/bpf/ is touched.
> 

Why netdev?  This has nothing to do with networking.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC][PATCH 00/10] Add trace event support to eBPF
  2016-02-16 22:35   ` Tom Zanussi
@ 2016-02-17  4:51     ` Alexei Starovoitov
  2016-02-18 21:27       ` Tom Zanussi
  0 siblings, 1 reply; 17+ messages in thread
From: Alexei Starovoitov @ 2016-02-17  4:51 UTC (permalink / raw)
  To: Tom Zanussi
  Cc: rostedt, masami.hiramatsu.pt, namhyung, peterz, linux-kernel,
	Daniel Borkmann, netdev

On Tue, Feb 16, 2016 at 04:35:27PM -0600, Tom Zanussi wrote:
> On Sun, 2016-02-14 at 01:02 +0100, Alexei Starovoitov wrote:
> > On Fri, Feb 12, 2016 at 10:11:18AM -0600, Tom Zanussi wrote:

> > this hist triggers belong in the kernel. BPF already can do
> > way more complex aggregation and histograms.
> 
> Way more?  I still can't accomplish with eBPF some of the most basic and
> helpful use cases that I can with hist triggers, such as using
> stacktraces as hash keys.  And the locking in the eBPF hashmap
> implementation prevents anything like the function_hist [1] tracer from
> being implemented on top of it:

Both statements are not true.
In the link from previous email take a look at funccount_example.txt:
# ./funccount 'vfs_*'
Tracing... Ctrl-C to end.
^C
ADDR             FUNC                          COUNT
ffffffff811efe81 vfs_create                        1
ffffffff811f24a1 vfs_rename                        1
ffffffff81215191 vfs_fsync_range                   2
ffffffff81231df1 vfs_lock_file                    30
ffffffff811e8dd1 vfs_fstatat                     152
ffffffff811e8d71 vfs_fstat                       154
ffffffff811e4381 vfs_write                       166
ffffffff811e8c71 vfs_getattr_nosec               262
ffffffff811e8d41 vfs_getattr                     262
ffffffff811e3221 vfs_open                        264
ffffffff811e4251 vfs_read                        470
Detaching...

And this is done without adding new code to the kernel.

Another example is offwaketime that uses two stack traces as
part of single key.

> > Take a look at all the tools written on top of it:
> > https://github.com/iovisor/bcc/tree/master/tools
> 
> That's great, but it's all out-of-tree.  Supporting out-of-tree users
> has never been justification for merging in-kernel code (or for blocking
> it from being merged). 

huh? perf is the only in-tree user space project.
All others tools and libraries are out-of-tree and that makes sense.

Actually would be great to merge bcc with perf eventually, but choice
of C++ isn't going to make it easy. The only real difference
between perf+bpf and bcc is that bcc integrates clang/llvm
as a library whereas perf+bpf deals with elf files and standalone compiler.
There are pros and cons for both and it's great that both are actively
growing and gaining user traction.

> I have systems with tiny amounts of memory and storage that have zero
> chance of ever having a compiler or Python running on them.  It's

if your system is so short on memory, then you don't want to bloat
the kernel with histtriggers especially since they're not
going to be used 24/7 due to the overhead.

> I haven't measured the overhead of the cost of accessing data from the
> trace buffers, but I do know that the hist triggers have no problem
> logging millions of events per second.
> 
> In the past I have measured the basic hist triggers mechanism and found
> it to be somewhat faster than the ftrace function tracer itself:
> 
>     over a kernel compile:
> 
>     no tracing:
> 
>       real   110m47.817s
>       user    99m34.144s
>       sys      7m19.928s
... 
>     function_hist tracer enabled:
>       real   128m44.682s
>       user   100m29.080s
>       sys     26m52.880s

78% of cpu time is in user space. Not a great test of kernel
datapath, but 'sys 7m19.928s vs 26m52.880s' actually means that
the kernel part is 3 times slower. That is your enormous overhead.

2 hours to compile the kernel. ouch. that must be very low end device.
For comparison full kernel build on my box:
real    2m49.693s
user    66m44.204s
sys     5m29.257s

> One point I would make about this though is that while it might be
> slower to access this particular field that way, the user who's just
> trying to get something done doesn't need to know about
> bpf_get_current_pid_tgid() and can just look at the available fields in
> the trace event format file and use them directly - trading off
> efficiency for ease-of-use.

sorry, but nack for such 'ease-of-use'.
We're not going to sacrifice performance even if there are few raw edges
in the user space. User tools can be improved, new compilers and
front-ends written, but kernel API will stay fixed and must be fast
from the start.

> surrounding that even in the comments.  I guess I'd have to spend a few
> hours reading the BPF code and the verifier even, to understand that.

not sure what is your goal. Runtime lookup via field name is not acceptable
whether it's cached or not. There is no place for strcmp in the critical path.

> > please cc netdev every time kernel/bpf/ is touched.
> > 
> 
> Why netdev?  This has nothing to do with networking.

because that's what MAINTAINERS file says.
kernel/bpf/* is used for both tracing and networking and all significant
changes should be going through net-next to avoid conflicts and
to make sure that active developers can do a thorough review.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC][PATCH 00/10] Add trace event support to eBPF
  2016-02-17  4:51     ` Alexei Starovoitov
@ 2016-02-18 21:27       ` Tom Zanussi
  2016-02-18 22:40         ` Daniel Borkmann
  2016-02-19  4:16         ` Alexei Starovoitov
  0 siblings, 2 replies; 17+ messages in thread
From: Tom Zanussi @ 2016-02-18 21:27 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: rostedt, masami.hiramatsu.pt, namhyung, peterz, linux-kernel,
	Daniel Borkmann, netdev

On Tue, 2016-02-16 at 20:51 -0800, Alexei Starovoitov wrote:
> On Tue, Feb 16, 2016 at 04:35:27PM -0600, Tom Zanussi wrote:
> > On Sun, 2016-02-14 at 01:02 +0100, Alexei Starovoitov wrote:
> > > On Fri, Feb 12, 2016 at 10:11:18AM -0600, Tom Zanussi wrote:
> 
> > > this hist triggers belong in the kernel. BPF already can do
> > > way more complex aggregation and histograms.
> > 
> > Way more?  I still can't accomplish with eBPF some of the most basic and
> > helpful use cases that I can with hist triggers, such as using
> > stacktraces as hash keys.  And the locking in the eBPF hashmap
> > implementation prevents anything like the function_hist [1] tracer from
> > being implemented on top of it:
> 
> Both statements are not true.

Erm, not exactly...

> In the link from previous email take a look at funccount_example.txt:
> # ./funccount 'vfs_*'
> Tracing... Ctrl-C to end.
> ^C
> ADDR             FUNC                          COUNT
> ffffffff811efe81 vfs_create                        1
> ffffffff811f24a1 vfs_rename                        1
> ffffffff81215191 vfs_fsync_range                   2
> ffffffff81231df1 vfs_lock_file                    30
> ffffffff811e8dd1 vfs_fstatat                     152
> ffffffff811e8d71 vfs_fstat                       154
> ffffffff811e4381 vfs_write                       166
> ffffffff811e8c71 vfs_getattr_nosec               262
> ffffffff811e8d41 vfs_getattr                     262
> ffffffff811e3221 vfs_open                        264
> ffffffff811e4251 vfs_read                        470
> Detaching...
> 

When I tried to use it to trace all functions:

  # ./funccount.py '*'

I got about 5 minutes worth of these kinds of error messages, which I
couldn't break out of:

write of "p:kprobes/p_rootfs_mount rootfs_mount" into kprobe_events failed: Device or resource busy
open(/sys/kernel/debug/tracing/events/kprobes/p_legacy_pic_int_noop/id): Too many open files
open(/sys/kernel/debug/tracing/events/kprobes/p_legacy_pic_irq_pending_noop/id): Too many open files
open(/sys/kernel/debug/tracing/events/kprobes/p_legacy_pic_probe/id): Too many open files
open(/sys/kernel/debug/tracing/events/kprobes/p_unmask_8259A_irq/id): Too many open files
open(/sys/kernel/debug/tracing/events/kprobes/p_enable_8259A_irq/id): Too many open files
open(/sys/kernel/debug/tracing/events/kprobes/p_mask_8259A_irq/id): Too many open files
open(/sys/kernel/debug/tracing/events/kprobes/p_disable_8259A_irq/id): Too many open files
open(/sys/kernel/debug/tracing/events/kprobes/p_i8259A_irq_pending/id): Too many open files
...

Which is probably a good thing, since that's a relatively common thing
for someone to try, whereas this subset probably isn't:

  # ./funccount.py '*spin*'

Which on my machine resulted in a hard lockup on all CPUs.  I'm not set
up to grab serial output on that machine, but here's a screenshot of
most of the stacktrace, all I could get:

  http://picpaste.com/bcc-spinstar-hardlock-fobQbcuG.JPG

There's nothing special about that machine and it's running a stock
4.4.0 kernel, so it should be pretty easy to reproduce on anything with
just a BPF-enabled config.  If not, let me know and I'll send more
specific info.

> And this is done without adding new code to the kernel.
> 
> Another example is offwaketime that uses two stack traces as
> part of single key.
> 

Which has the below code in the script itself implementing a stack
walker:

static u64 get_frame(u64 *bp) {
    if (*bp) {
        // The following stack walker is x86_64 specific
        u64 ret = 0;
        if (bpf_probe_read(&ret, sizeof(ret), (void *)(*bp+8)))
            return 0;
        if (bpf_probe_read(bp, sizeof(*bp), (void *)*bp))
            *bp = 0;
        if (ret < __START_KERNEL_map)
            return 0;
        return ret;
    }
    return 0;
}

int waker(struct pt_regs *ctx, struct task_struct *p) {
    u32 pid = p->pid;

    if (!(FILTER))
        return 0;

    u64 bp = 0;
    struct wokeby_t woke = {};
    int depth = 0;
    bpf_get_current_comm(&woke.name, sizeof(woke.name));
    bp = ctx->bp;

    // unrolled loop (MAXWDEPTH):
    if (!(woke.ret[depth++] = get_frame(&bp))) goto out;
    if (!(woke.ret[depth++] = get_frame(&bp))) goto out;
    if (!(woke.ret[depth++] = get_frame(&bp))) goto out;
    if (!(woke.ret[depth++] = get_frame(&bp))) goto out;
    if (!(woke.ret[depth++] = get_frame(&bp))) goto out;
    if (!(woke.ret[depth++] = get_frame(&bp))) goto out;
    if (!(woke.ret[depth++] = get_frame(&bp))) goto out;
    if (!(woke.ret[depth++] = get_frame(&bp))) goto out;
    if (!(woke.ret[depth++] = get_frame(&bp))) goto out;
    woke.ret[depth] = get_frame(&bp);

I think one call to this pretty much cancels out any performance gains
you might have made by fastidiously avoiding trace events.

So technically users can use stacktraces as hash keys, but they're
expected to use the bpf_probe_read() kernel primitives to write a stack
walker themselves.  Not exactly what I was hoping for..

> > > Take a look at all the tools written on top of it:
> > > https://github.com/iovisor/bcc/tree/master/tools
> > 
> > That's great, but it's all out-of-tree.  Supporting out-of-tree users
> > has never been justification for merging in-kernel code (or for blocking
> > it from being merged). 
> 
> huh? perf is the only in-tree user space project.
> All others tools and libraries are out-of-tree and that makes sense.
> 

What about all the other things under tools/?

> Actually would be great to merge bcc with perf eventually, but choice
> of C++ isn't going to make it easy. The only real difference
> between perf+bpf and bcc is that bcc integrates clang/llvm
> as a library whereas perf+bpf deals with elf files and standalone compiler.
> There are pros and cons for both and it's great that both are actively
> growing and gaining user traction.
> 

Why worry about merging bcc with perf?  Why not a tools/bcc?

> > I have systems with tiny amounts of memory and storage that have zero
> > chance of ever having a compiler or Python running on them.  It's
> 
> if your system is so short on memory, then you don't want to bloat
> the kernel with histtriggers especially since they're not
> going to be used 24/7 due to the overhead.
> 
> > I haven't measured the overhead of the cost of accessing data from the
> > trace buffers, but I do know that the hist triggers have no problem
> > logging millions of events per second.
> > 
> > In the past I have measured the basic hist triggers mechanism and found
> > it to be somewhat faster than the ftrace function tracer itself:
> > 
> >     over a kernel compile:
> > 
> >     no tracing:
> > 
> >       real   110m47.817s
> >       user    99m34.144s
> >       sys      7m19.928s
> ... 
> >     function_hist tracer enabled:
> >       real   128m44.682s
> >       user   100m29.080s
> >       sys     26m52.880s
> 
> 78% of cpu time is in user space. Not a great test of kernel
> datapath, but 'sys 7m19.928s vs 26m52.880s' actually means that
> the kernel part is 3 times slower. That is your enormous overhead.
> 

Like I said, I didn't drill down on any measurements to see which part
was due to hist triggers code vs the ftrace calls - my point was only to
show that the hist triggers are actually faster than the function
tracer.

I guess that anything that does useful work on every single function
call is going to add some measurable amount of overhead.  I'm still
interested in seeing what that would be for eBPF.

> 2 hours to compile the kernel. ouch. that must be very low end device.
> For comparison full kernel build on my box:
> real    2m49.693s
> user    66m44.204s
> sys     5m29.257s
> 
> > One point I would make about this though is that while it might be
> > slower to access this particular field that way, the user who's just
> > trying to get something done doesn't need to know about
> > bpf_get_current_pid_tgid() and can just look at the available fields in
> > the trace event format file and use them directly - trading off
> > efficiency for ease-of-use.
> 
> sorry, but nack for such 'ease-of-use'.
> We're not going to sacrifice performance even if there are few raw edges
> in the user space. User tools can be improved, new compilers and
> front-ends written, but kernel API will stay fixed and must be fast
> from the start.
> 

OK, so it sounds like eBPF will never support access to trace events.
Good to know, that officially makes our work completely orthogonal.

> > surrounding that even in the comments.  I guess I'd have to spend a few
> > hours reading the BPF code and the verifier even, to understand that.
> 
> not sure what is your goal. Runtime lookup via field name is not acceptable
> whether it's cached or not. There is no place for strcmp in the critical path.
> 

Exactly - that's why I was asking about a 'begin probe', in order to do
the lookup once, in an non-critical path.

> > > please cc netdev every time kernel/bpf/ is touched.
> > > 
> > 
> > Why netdev?  This has nothing to do with networking.
> 
> because that's what MAINTAINERS file says.
> kernel/bpf/* is used for both tracing and networking and all significant
> changes should be going through net-next to avoid conflicts and
> to make sure that active developers can do a thorough review.
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC][PATCH 00/10] Add trace event support to eBPF
  2016-02-18 21:27       ` Tom Zanussi
@ 2016-02-18 22:40         ` Daniel Borkmann
  2016-02-19  4:16         ` Alexei Starovoitov
  1 sibling, 0 replies; 17+ messages in thread
From: Daniel Borkmann @ 2016-02-18 22:40 UTC (permalink / raw)
  To: Tom Zanussi
  Cc: Alexei Starovoitov, rostedt, masami.hiramatsu.pt, namhyung,
	peterz, linux-kernel, netdev

On 02/18/2016 10:27 PM, Tom Zanussi wrote:
> On Tue, 2016-02-16 at 20:51 -0800, Alexei Starovoitov wrote:
>> On Tue, Feb 16, 2016 at 04:35:27PM -0600, Tom Zanussi wrote:
>>> On Sun, 2016-02-14 at 01:02 +0100, Alexei Starovoitov wrote:
[...]
>>>> Take a look at all the tools written on top of it:
>>>> https://github.com/iovisor/bcc/tree/master/tools
>>>
>>> That's great, but it's all out-of-tree.  Supporting out-of-tree users
>>> has never been justification for merging in-kernel code (or for blocking
>>> it from being merged).
>>
>> huh? perf is the only in-tree user space project.
>> All others tools and libraries are out-of-tree and that makes sense.
>
> What about all the other things under tools/?
>
>> Actually would be great to merge bcc with perf eventually, but choice
>> of C++ isn't going to make it easy. The only real difference
>> between perf+bpf and bcc is that bcc integrates clang/llvm
>> as a library whereas perf+bpf deals with elf files and standalone compiler.
>> There are pros and cons for both and it's great that both are actively
>> growing and gaining user traction.
>
> Why worry about merging bcc with perf?  Why not a tools/bcc?

It would indeed be great to mid-term have bcc internals to some degree
merged(/rewritten) into perf. tools/bcc doesn't make much sense to me
as it really should be perf, where already the rest of the eBPF front-end
logic resides that Wang et al initially integrated. So efforts could
consolidate from that side.

The user could be given a choice whether his use-case is to load the
object file, or directly pass in a C file where perf does the rest. And
f.e. Brendan's tools could ship natively as perf "built-ins" that users
can go and try out from there directly and/or use the code as a starting
point for their own proglets.

Cheers,
Daniel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC][PATCH 00/10] Add trace event support to eBPF
  2016-02-18 21:27       ` Tom Zanussi
  2016-02-18 22:40         ` Daniel Borkmann
@ 2016-02-19  4:16         ` Alexei Starovoitov
  1 sibling, 0 replies; 17+ messages in thread
From: Alexei Starovoitov @ 2016-02-19  4:16 UTC (permalink / raw)
  To: Tom Zanussi
  Cc: rostedt, masami.hiramatsu.pt, namhyung, peterz, linux-kernel,
	Daniel Borkmann, netdev

On Thu, Feb 18, 2016 at 03:27:18PM -0600, Tom Zanussi wrote:
> On Tue, 2016-02-16 at 20:51 -0800, Alexei Starovoitov wrote:
> > On Tue, Feb 16, 2016 at 04:35:27PM -0600, Tom Zanussi wrote:
> > > On Sun, 2016-02-14 at 01:02 +0100, Alexei Starovoitov wrote:
> > > > On Fri, Feb 12, 2016 at 10:11:18AM -0600, Tom Zanussi wrote:
> > 
> 
>   # ./funccount.py '*spin*'
> 
> Which on my machine resulted in a hard lockup on all CPUs.  I'm not set

thanks for the report. looks like something got broken. After:
# ./funccount.par '*spin*'
Tracing 12 functions for "*spin*"... Hit Ctrl-C to end.
^C
ADDR             FUNC                          COUNT
ffffffff810aeb91 mutex_spin_on_owner             530
ffffffff8177f241 _raw_spin_unlock_bh            1325
ffffffff810aebe1 mutex_optimistic_spin          1696
ffffffff8177f581 _raw_spin_lock_bh              1985
ffffffff8177f511 _raw_spin_trylock             55337
ffffffff8177f3c1 _raw_spin_lock_irq           787875
ffffffff8177f551 _raw_spin_lock              2211324
ffffffff8177f331 _raw_spin_lock_irqsave      3556740
ffffffff8177f1c1 __lock_text_start           3593983
[  275.175524] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 11
it seems kprobe cleanup is racing with bpf cleanup...

> > > surrounding that even in the comments.  I guess I'd have to spend a few
> > > hours reading the BPF code and the verifier even, to understand that.
> > 
> > not sure what is your goal. Runtime lookup via field name is not acceptable
> > whether it's cached or not. There is no place for strcmp in the critical path.
> 
> Exactly - that's why I was asking about a 'begin probe', in order to do
> the lookup once, in an non-critical path.

It is critical path. When program is called million times for_each{strcmp}
at the beginning of every program is unacceptable overhead.
In the crash above, Ctrl-C was pressed in a split second, yet bpf
already processed 2.2M + 3.5M + 3.5M events and then hung while unloading.
In the upcoming tracepoint+bpf patches the programs will have
direct access to tracepoint data without wasting time on strcmp.
The steps to do that were already outlined in the previous email.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2016-02-19  4:16 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-12 16:11 [RFC][PATCH 00/10] Add trace event support to eBPF Tom Zanussi
2016-02-12 16:11 ` [RFC][PATCH 01/10] tracing: Move some constants to ring_buffer.h Tom Zanussi
2016-02-12 16:11 ` [RFC][PATCH 02/10] eBPF: Add BPF_PROG_TYPE_TRACE_EVENT prog type Tom Zanussi
2016-02-12 16:11 ` [RFC][PATCH 03/10] tracing: Add an 'accessor' function to ftrace_event_field Tom Zanussi
2016-02-12 16:11 ` [RFC][PATCH 04/10] tracing: Add trace event accessor functions Tom Zanussi
2016-02-12 16:11 ` [RFC][PATCH 05/10] eBPF/tracing: Add eBPF trace event field access helpers Tom Zanussi
2016-02-12 16:11 ` [RFC][PATCH 06/10] tracing: Add kprobe/uprobe support for TRACE_EVENT eBPF progs Tom Zanussi
2016-02-12 16:11 ` [RFC][PATCH 07/10] tracing: Add eBPF program support to static trace events Tom Zanussi
2016-02-12 16:11 ` [RFC][PATCH 08/10] samples/bpf: Add support for trace events to the eBPF loading code Tom Zanussi
2016-02-12 16:11 ` [RFC][PATCH 09/10] samples/bpf: Add readcounts-by-pid example Tom Zanussi
2016-02-12 16:11 ` [RFC][PATCH 10/10] samples/bpf: Add kprobe-event-fields example Tom Zanussi
2016-02-14  0:02 ` [RFC][PATCH 00/10] Add trace event support to eBPF Alexei Starovoitov
2016-02-16 22:35   ` Tom Zanussi
2016-02-17  4:51     ` Alexei Starovoitov
2016-02-18 21:27       ` Tom Zanussi
2016-02-18 22:40         ` Daniel Borkmann
2016-02-19  4:16         ` Alexei Starovoitov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.