linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/15] tracing: 'hist' triggers
@ 2015-03-02 16:00 Tom Zanussi
  2015-03-02 16:00 ` [PATCH 01/15] tracing: Make ftrace_event_field checking functions available Tom Zanussi
                   ` (15 more replies)
  0 siblings, 16 replies; 45+ messages in thread
From: Tom Zanussi @ 2015-03-02 16:00 UTC (permalink / raw)
  To: rostedt
  Cc: masami.hiramatsu.pt, namhyung, andi, ast, linux-kernel, Tom Zanussi

This is v2 of my previously posted 'hashtriggers' patchset [1], but
renamed to 'hist triggers' following feedback from v1.

Since then, the kernel has gained a tracing map implementation in the
form of bpf_map, which this patchset makes a bit more generic, exports
and uses (as tracing_map_*, still in the bpf syscall file however).

A large part of the initial hash triggers implementation was devoted
to a map implementation and general-purpose hashing functions, which
have now been subsumed by the bpf maps.  I've completely redone the
trigger patches themselves to work on top of tracing_map.  The result
is a much simpler and easier-to-review patchset that's able to focus
more directly on the problem at hand.

The new version addresses all the comments from the previous review,
including changing the name from hash->hist, adding separate 'hist'
files for the output, and moving the examples into Documentation.

This patchset also includes a couple other new and related triggers,
enable_hist and disable_hist, very similar to the existing
enable_event/disable_event triggers used to automatically enable and
disable events based on a triggering condition, but in this case
allowing hist triggers to be enabled and disabled in the same way.

The only problem with using the bpf_map implementation for this is
that it uses kmalloc internally, which causes problems when trying to
trace kmalloc itself.  I'm guessing the ebpf tracing code would also
share this problem e.g. when using bpf_maps from probes on kmalloc().
This patchset attempts a solution to that problem (by adding a
gfp_flag and changing the kmem memory allocation tracepoints to
conditional variants) for checking for it in for but I'm not sure it's
the best way to address it.

There are a couple of important bits of functionality that were
present in v1 but dropped in v2 mainly because I'm still trying to
figure out the best way to accomplish those things using the bpf_map
implementation.

The first is support for compound keys.  Currently, maps can only be
keyed on a single event field, whereas in v1 they could be keyed on
multiple keys.  With support for compound keys, you can create much
more interesting output, such as for example per-pid lists of
syscalls or read counts e.g.:

  # echo 'hist:keys=common_pid.execname,id.syscall:vals=hitcount' > \
        /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger

  # cat /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/hist

  key: common_pid:bash[3112], id:sys_write		       vals: count:69
  key: common_pid:bash[3112], id:sys_rt_sigprocmask	       vals: count:218

  key: common_pid:update-notifier[3164], id:sys_poll	       vals: count:37
  key: common_pid:update-notifier[3164], id:sys_recvfrom       vals: count:118

  key: common_pid:deja-dup-monito[3194], id:sys_sendto	       vals: count:1
  key: common_pid:deja-dup-monito[3194], id:sys_read	       vals: count:4
  key: common_pid:deja-dup-monito[3194], id:sys_poll	       vals: count:8
  key: common_pid:deja-dup-monito[3194], id:sys_recvmsg	       vals: count:8
  key: common_pid:deja-dup-monito[3194], id:sys_getegid	       vals: count:8

  key: common_pid:emacs[3275], id:sys_fsync		       vals: count:1
  key: common_pid:emacs[3275], id:sys_open		       vals: count:1
  key: common_pid:emacs[3275], id:sys_symlink		       vals: count:2
  key: common_pid:emacs[3275], id:sys_poll		       vals: count:23
  key: common_pid:emacs[3275], id:sys_select		       vals: count:23
  key: common_pid:emacs[3275], id:unknown_syscall	       vals: count:34
  key: common_pid:emacs[3275], id:sys_ioctl		       vals: count:60
  key: common_pid:emacs[3275], id:sys_rt_sigprocmask	       vals: count:116

  key: common_pid:cat[3323], id:sys_munmap		       vals: count:1
  key: common_pid:cat[3323], id:sys_fadvise64		       vals: count:1

Related to that is support for sorting on multiple fields.  Currently,
you can sort using only a primary key.  Being able to sort on multiple
or at least a secondary key is indispensible for seeing trends when
displaying multiple values.

[1] http://thread.gmane.org/gmane.linux.kernel/1673551

Changes from v1:
 - completely rewritten on top of tracing_map (renamed and exported bpf_map)
 - added map clearing and client ops to tracing_map
 - changed the name from 'hash' triggers to 'hist' triggers
 - added new trigger 'pause' feature
 - added new enable_hist and disable_hist triggers
 - added usage for hist/enable_hist/disable hist to tracing/README
 - moved examples into Documentation/trace/event.txt
 - added ___GFP_NOTRACE, kmalloc/kfree macros, and conditional kmem tracepoints

The following changes since commit 49058038a12cfd9044146a1bf4b286781268d5c9:

  ring-buffer: Do not wake up a splice waiter when page is not full (2015-02-24 14:00:41 -0600)

are available in the git repository at:

  git://git.yoctoproject.org/linux-yocto-contrib.git tzanussi/hist-triggers-v2
  http://git.yoctoproject.org/cgit/cgit.cgi/linux-yocto-contrib/log/?h=tzanussi/hist-triggers-v2

Tom Zanussi (15):
  tracing: Make ftrace_event_field checking functions available
  tracing: Add event record param to trigger_ops.func()
  tracing: Add get_syscall_name()
  bpf: Export bpf map functionality as trace_map_*
  bpf: Export a map-clearing function
  bpf: Add tracing_map client ops
  mm: Add ___GFP_NOTRACE
  tracing: Make kmem memory allocation tracepoints conditional
  tracing: Add kmalloc/kfree macros
  bpf: Make tracing_map use kmalloc/kfree_notrace()
  tracing: Add a per-event-trigger 'paused' field
  tracing: Add 'hist' event trigger command
  tracing: Add sorting to hist triggers
  tracing: Add enable_hist/disable_hist triggers
  tracing: Add 'hist' trigger Documentation

 Documentation/trace/events.txt      |  870 +++++++++++++++++++++
 include/linux/bpf.h                 |   15 +
 include/linux/ftrace_event.h        |    9 +-
 include/linux/gfp.h                 |    3 +-
 include/linux/slab.h                |   61 +-
 include/trace/events/kmem.h         |   28 +-
 kernel/bpf/arraymap.c               |   16 +
 kernel/bpf/hashtab.c                |   39 +-
 kernel/bpf/syscall.c                |  193 ++++-
 kernel/trace/trace.c                |   48 ++
 kernel/trace/trace.h                |   25 +-
 kernel/trace/trace_events.c         |    3 +
 kernel/trace/trace_events_filter.c  |   15 +-
 kernel/trace/trace_events_trigger.c | 1466 ++++++++++++++++++++++++++++++++++-
 kernel/trace/trace_syscalls.c       |   11 +
 mm/slab.c                           |   45 +-
 mm/slob.c                           |   45 +-
 mm/slub.c                           |   47 +-
 18 files changed, 2795 insertions(+), 144 deletions(-)

-- 
1.9.3


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 01/15] tracing: Make ftrace_event_field checking functions available
  2015-03-02 16:00 [PATCH v2 00/15] tracing: 'hist' triggers Tom Zanussi
@ 2015-03-02 16:00 ` Tom Zanussi
  2015-03-02 16:00 ` [PATCH 02/15] tracing: Add event record param to trigger_ops.func() Tom Zanussi
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 45+ messages in thread
From: Tom Zanussi @ 2015-03-02 16:00 UTC (permalink / raw)
  To: rostedt
  Cc: masami.hiramatsu.pt, namhyung, andi, ast, linux-kernel, Tom Zanussi

Make is_string_field() and is_function_field() accessible outside of
trace_event_filters.c for other users of ftrace_event_fields.

Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
---
 kernel/trace/trace.h               | 12 ++++++++++++
 kernel/trace/trace_events_filter.c | 12 ------------
 2 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index d951ded..6177486 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -1050,6 +1050,18 @@ struct filter_pred {
 	unsigned short		right;
 };
 
+static inline bool is_string_field(struct ftrace_event_field *field)
+{
+	return field->filter_type == FILTER_DYN_STRING ||
+	       field->filter_type == FILTER_STATIC_STRING ||
+	       field->filter_type == FILTER_PTR_STRING;
+}
+
+static inline bool is_function_field(struct ftrace_event_field *field)
+{
+	return field->filter_type == FILTER_TRACE_FN;
+}
+
 extern enum regex_type
 filter_parse_regex(char *buff, int len, char **search, int *not);
 extern void print_event_filter(struct ftrace_event_file *file,
diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index ced69da..aa8ab85 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -917,18 +917,6 @@ int filter_assign_type(const char *type)
 	return FILTER_OTHER;
 }
 
-static bool is_function_field(struct ftrace_event_field *field)
-{
-	return field->filter_type == FILTER_TRACE_FN;
-}
-
-static bool is_string_field(struct ftrace_event_field *field)
-{
-	return field->filter_type == FILTER_DYN_STRING ||
-	       field->filter_type == FILTER_STATIC_STRING ||
-	       field->filter_type == FILTER_PTR_STRING;
-}
-
 static int is_legal_op(struct ftrace_event_field *field, int op)
 {
 	if (is_string_field(field) &&
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 02/15] tracing: Add event record param to trigger_ops.func()
  2015-03-02 16:00 [PATCH v2 00/15] tracing: 'hist' triggers Tom Zanussi
  2015-03-02 16:00 ` [PATCH 01/15] tracing: Make ftrace_event_field checking functions available Tom Zanussi
@ 2015-03-02 16:00 ` Tom Zanussi
  2015-03-02 16:00 ` [PATCH 03/15] tracing: Add get_syscall_name() Tom Zanussi
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 45+ messages in thread
From: Tom Zanussi @ 2015-03-02 16:00 UTC (permalink / raw)
  To: rostedt
  Cc: masami.hiramatsu.pt, namhyung, andi, ast, linux-kernel, Tom Zanussi

Some triggers may need access to the trace event, so pass it in.  Also
fix up the existing trigger funcs and their callers.

Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
---
 include/linux/ftrace_event.h        |  7 ++++---
 kernel/trace/trace.h                |  6 ++++--
 kernel/trace/trace_events_trigger.c | 35 ++++++++++++++++++-----------------
 3 files changed, 26 insertions(+), 22 deletions(-)

diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index 5aa4a92..c967526 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -429,7 +429,8 @@ extern int call_filter_check_discard(struct ftrace_event_call *call, void *rec,
 extern enum event_trigger_type event_triggers_call(struct ftrace_event_file *file,
 						   void *rec);
 extern void event_triggers_post_call(struct ftrace_event_file *file,
-				     enum event_trigger_type tt);
+				     enum event_trigger_type tt,
+				     void *rec);
 
 /**
  * ftrace_trigger_soft_disabled - do triggers and test if soft disabled
@@ -512,7 +513,7 @@ event_trigger_unlock_commit(struct ftrace_event_file *file,
 		trace_buffer_unlock_commit(buffer, event, irq_flags, pc);
 
 	if (tt)
-		event_triggers_post_call(file, tt);
+		event_triggers_post_call(file, tt, entry);
 }
 
 /**
@@ -545,7 +546,7 @@ event_trigger_unlock_commit_regs(struct ftrace_event_file *file,
 						irq_flags, pc, regs);
 
 	if (tt)
-		event_triggers_post_call(file, tt);
+		event_triggers_post_call(file, tt, entry);
 }
 
 enum {
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 6177486..89dd87a 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -1125,7 +1125,8 @@ struct event_trigger_data {
  * @func: The trigger 'probe' function called when the triggering
  *	event occurs.  The data passed into this callback is the data
  *	that was supplied to the event_command @reg() function that
- *	registered the trigger (see struct event_command).
+ *	registered the trigger (see struct event_command) along with
+ *	the trace record, rec.
  *
  * @init: An optional initialization function called for the trigger
  *	when the trigger is registered (via the event_command reg()
@@ -1150,7 +1151,8 @@ struct event_trigger_data {
  *	(see trace_event_triggers.c).
  */
 struct event_trigger_ops {
-	void			(*func)(struct event_trigger_data *data);
+	void			(*func)(struct event_trigger_data *data,
+					void *rec);
 	int			(*init)(struct event_trigger_ops *ops,
 					struct event_trigger_data *data);
 	void			(*free)(struct event_trigger_ops *ops,
diff --git a/kernel/trace/trace_events_trigger.c b/kernel/trace/trace_events_trigger.c
index 8712df9..a8dfd4e 100644
--- a/kernel/trace/trace_events_trigger.c
+++ b/kernel/trace/trace_events_trigger.c
@@ -74,7 +74,7 @@ event_triggers_call(struct ftrace_event_file *file, void *rec)
 
 	list_for_each_entry_rcu(data, &file->triggers, list) {
 		if (!rec) {
-			data->ops->func(data);
+			data->ops->func(data, rec);
 			continue;
 		}
 		filter = rcu_dereference_sched(data->filter);
@@ -84,7 +84,7 @@ event_triggers_call(struct ftrace_event_file *file, void *rec)
 			tt |= data->cmd_ops->trigger_type;
 			continue;
 		}
-		data->ops->func(data);
+		data->ops->func(data, rec);
 	}
 	return tt;
 }
@@ -104,13 +104,14 @@ EXPORT_SYMBOL_GPL(event_triggers_call);
  */
 void
 event_triggers_post_call(struct ftrace_event_file *file,
-			 enum event_trigger_type tt)
+			 enum event_trigger_type tt,
+			 void *rec)
 {
 	struct event_trigger_data *data;
 
 	list_for_each_entry_rcu(data, &file->triggers, list) {
 		if (data->cmd_ops->trigger_type & tt)
-			data->ops->func(data);
+			data->ops->func(data, rec);
 	}
 }
 EXPORT_SYMBOL_GPL(event_triggers_post_call);
@@ -751,7 +752,7 @@ static int set_trigger_filter(char *filter_str,
 }
 
 static void
-traceon_trigger(struct event_trigger_data *data)
+traceon_trigger(struct event_trigger_data *data, void *rec)
 {
 	if (tracing_is_on())
 		return;
@@ -760,7 +761,7 @@ traceon_trigger(struct event_trigger_data *data)
 }
 
 static void
-traceon_count_trigger(struct event_trigger_data *data)
+traceon_count_trigger(struct event_trigger_data *data, void *rec)
 {
 	if (tracing_is_on())
 		return;
@@ -775,7 +776,7 @@ traceon_count_trigger(struct event_trigger_data *data)
 }
 
 static void
-traceoff_trigger(struct event_trigger_data *data)
+traceoff_trigger(struct event_trigger_data *data, void *rec)
 {
 	if (!tracing_is_on())
 		return;
@@ -784,7 +785,7 @@ traceoff_trigger(struct event_trigger_data *data)
 }
 
 static void
-traceoff_count_trigger(struct event_trigger_data *data)
+traceoff_count_trigger(struct event_trigger_data *data, void *rec)
 {
 	if (!tracing_is_on())
 		return;
@@ -880,13 +881,13 @@ static struct event_command trigger_traceoff_cmd = {
 
 #ifdef CONFIG_TRACER_SNAPSHOT
 static void
-snapshot_trigger(struct event_trigger_data *data)
+snapshot_trigger(struct event_trigger_data *data, void *rec)
 {
 	tracing_snapshot();
 }
 
 static void
-snapshot_count_trigger(struct event_trigger_data *data)
+snapshot_count_trigger(struct event_trigger_data *data, void *rec)
 {
 	if (!data->count)
 		return;
@@ -894,7 +895,7 @@ snapshot_count_trigger(struct event_trigger_data *data)
 	if (data->count != -1)
 		(data->count)--;
 
-	snapshot_trigger(data);
+	snapshot_trigger(data, rec);
 }
 
 static int
@@ -973,13 +974,13 @@ static __init int register_trigger_snapshot_cmd(void) { return 0; }
 #define STACK_SKIP 3
 
 static void
-stacktrace_trigger(struct event_trigger_data *data)
+stacktrace_trigger(struct event_trigger_data *data, void *rec)
 {
 	trace_dump_stack(STACK_SKIP);
 }
 
 static void
-stacktrace_count_trigger(struct event_trigger_data *data)
+stacktrace_count_trigger(struct event_trigger_data *data, void *rec)
 {
 	if (!data->count)
 		return;
@@ -987,7 +988,7 @@ stacktrace_count_trigger(struct event_trigger_data *data)
 	if (data->count != -1)
 		(data->count)--;
 
-	stacktrace_trigger(data);
+	stacktrace_trigger(data, rec);
 }
 
 static int
@@ -1058,7 +1059,7 @@ struct enable_trigger_data {
 };
 
 static void
-event_enable_trigger(struct event_trigger_data *data)
+event_enable_trigger(struct event_trigger_data *data, void *rec)
 {
 	struct enable_trigger_data *enable_data = data->private_data;
 
@@ -1069,7 +1070,7 @@ event_enable_trigger(struct event_trigger_data *data)
 }
 
 static void
-event_enable_count_trigger(struct event_trigger_data *data)
+event_enable_count_trigger(struct event_trigger_data *data, void *rec)
 {
 	struct enable_trigger_data *enable_data = data->private_data;
 
@@ -1083,7 +1084,7 @@ event_enable_count_trigger(struct event_trigger_data *data)
 	if (data->count != -1)
 		(data->count)--;
 
-	event_enable_trigger(data);
+	event_enable_trigger(data, rec);
 }
 
 static int
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 03/15] tracing: Add get_syscall_name()
  2015-03-02 16:00 [PATCH v2 00/15] tracing: 'hist' triggers Tom Zanussi
  2015-03-02 16:00 ` [PATCH 01/15] tracing: Make ftrace_event_field checking functions available Tom Zanussi
  2015-03-02 16:00 ` [PATCH 02/15] tracing: Add event record param to trigger_ops.func() Tom Zanussi
@ 2015-03-02 16:00 ` Tom Zanussi
  2015-03-02 16:00 ` [PATCH 04/15] bpf: Export bpf map functionality as trace_map_* Tom Zanussi
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 45+ messages in thread
From: Tom Zanussi @ 2015-03-02 16:00 UTC (permalink / raw)
  To: rostedt
  Cc: masami.hiramatsu.pt, namhyung, andi, ast, linux-kernel, Tom Zanussi

Add a utility function to grab the syscall name from the syscall
metadata, given a syscall id.

Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
---
 kernel/trace/trace.h          |  5 +++++
 kernel/trace/trace_syscalls.c | 11 +++++++++++
 2 files changed, 16 insertions(+)

diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 89dd87a..3d38c2e 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -1317,8 +1317,13 @@ int perf_ftrace_event_register(struct ftrace_event_call *call,
 
 #ifdef CONFIG_FTRACE_SYSCALLS
 void init_ftrace_syscalls(void);
+const char *get_syscall_name(int syscall);
 #else
 static inline void init_ftrace_syscalls(void) { }
+static inline const char *get_syscall_name(int syscall)
+{
+	return NULL;
+}
 #endif
 
 #ifdef CONFIG_EVENT_TRACING
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index c6ee36f..35dfcf3 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -106,6 +106,17 @@ static struct syscall_metadata *syscall_nr_to_meta(int nr)
 	return syscalls_metadata[nr];
 }
 
+const char *get_syscall_name(int syscall)
+{
+	struct syscall_metadata *entry;
+
+	entry = syscall_nr_to_meta(syscall);
+	if (!entry)
+		return NULL;
+
+	return entry->name;
+}
+
 static enum print_line_t
 print_syscall_enter(struct trace_iterator *iter, int flags,
 		    struct trace_event *event)
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 04/15] bpf: Export bpf map functionality as trace_map_*
  2015-03-02 16:00 [PATCH v2 00/15] tracing: 'hist' triggers Tom Zanussi
                   ` (2 preceding siblings ...)
  2015-03-02 16:00 ` [PATCH 03/15] tracing: Add get_syscall_name() Tom Zanussi
@ 2015-03-02 16:00 ` Tom Zanussi
  2015-03-02 16:00 ` [PATCH 05/15] bpf: Export a map-clearing function Tom Zanussi
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 45+ messages in thread
From: Tom Zanussi @ 2015-03-02 16:00 UTC (permalink / raw)
  To: rostedt
  Cc: masami.hiramatsu.pt, namhyung, andi, ast, linux-kernel, Tom Zanussi

Make the bpf map implementation available outside of kernel/bpf - as a
general-purpose map implementation for tracing, it should be available
to any tracing facility that can make use of it, not just the bpf
syscall.

Create a set of exported tracing_map_* functions and have the bpf
implementation use them.

Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
---
 include/linux/bpf.h  |   7 +++
 kernel/bpf/syscall.c | 173 ++++++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 163 insertions(+), 17 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index bbfceb7..900405bf 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -142,4 +142,11 @@ extern struct bpf_func_proto bpf_map_lookup_elem_proto;
 extern struct bpf_func_proto bpf_map_update_elem_proto;
 extern struct bpf_func_proto bpf_map_delete_elem_proto;
 
+struct bpf_map *tracing_map_create(union bpf_attr *attr);
+void tracing_map_destroy(struct bpf_map *map);
+int tracing_map_update_elem(struct bpf_map *map, void *key, void *value,
+			    union bpf_attr *attr);
+int tracing_map_lookup_elem(struct bpf_map *map, void *key, void *uvalue);
+int tracing_map_delete_elem(struct bpf_map *map, void *key);
+int tracing_map_get_next_key(struct bpf_map *map, void *key, void *next_key);
 #endif /* _LINUX_BPF_H */
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 088ac0b..cac8df6 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -63,11 +63,24 @@ void bpf_map_put(struct bpf_map *map)
 	}
 }
 
+/**
+ * tracing_map_destroy - destroy (release) a bpf_map instance
+ * @map: The bpf_map to destroy
+ *
+ * Release a bpf_map.  This decrements the map's refcount and
+ * schedules it for deletion.  For all intents and purposes, the user
+ * can assume the map has been deleted after this call.
+ */
+void tracing_map_destroy(struct bpf_map *map)
+{
+	bpf_map_put(map);
+}
+
 static int bpf_map_release(struct inode *inode, struct file *filp)
 {
 	struct bpf_map *map = filp->private_data;
 
-	bpf_map_put(map);
+	tracing_map_destroy(map);
 	return 0;
 }
 
@@ -84,6 +97,32 @@ static const struct file_operations bpf_map_fops = {
 		   sizeof(attr->CMD##_LAST_FIELD)) != NULL
 
 #define BPF_MAP_CREATE_LAST_FIELD max_entries
+
+/**
+ * tracing_map_create - Create a bpf_map instance
+ * @attr: The bpf_attr for the operation
+ *
+ * Create a bpf_map of the type specified by attr.
+ *
+ * Note that it's assumed that unused attr fields have been zeroed and
+ * that kernel code can be trusted to have done so before calling this
+ * function.
+ *
+ * Return: the map created on success, ERR_PTR otherwise
+ */
+struct bpf_map *tracing_map_create(union bpf_attr *attr)
+{
+	struct bpf_map *map;
+
+	/* find map type and init map: hashtable vs rbtree vs bloom vs ... */
+	map = find_and_alloc_map(attr);
+	if (!IS_ERR(map))
+		atomic_set(&map->refcnt, 1);
+
+	return map;
+}
+EXPORT_SYMBOL_GPL(tracing_map_create);
+
 /* called via syscall */
 static int map_create(union bpf_attr *attr)
 {
@@ -94,13 +133,10 @@ static int map_create(union bpf_attr *attr)
 	if (err)
 		return -EINVAL;
 
-	/* find map type and init map: hashtable vs rbtree vs bloom vs ... */
-	map = find_and_alloc_map(attr);
+	map = tracing_map_create(attr);
 	if (IS_ERR(map))
 		return PTR_ERR(map);
 
-	atomic_set(&map->refcnt, 1);
-
 	err = anon_inode_getfd("bpf-map", &bpf_map_fops, map, O_RDWR | O_CLOEXEC);
 
 	if (err < 0)
@@ -140,6 +176,40 @@ static void __user *u64_to_ptr(__u64 val)
 	return (void __user *) (unsigned long) val;
 }
 
+/**
+ * tracing_map_lookup_elem - Perform a bpf_map lookup
+ * @map: The bpf_map to perform the lookup on
+ * @key: The key to look up
+ * @uvalue: The caller-owned memory to be filled in with the found value
+ *
+ * Look up the specified key in the bpf_map and if found copy the
+ * value into the uvalue specified by the caller.  The uvalue must be
+ * at least as large as the map->value_size.
+ *
+ * Return: 0 on success, -ENOENT if key not found
+ */
+int tracing_map_lookup_elem(struct bpf_map *map, void *key, void *uvalue)
+{
+	int err = -ENOENT;
+	void *value;
+
+	rcu_read_lock();
+
+	value = map->ops->map_lookup_elem(map, key);
+	if (!value)
+		goto err_unlock;
+
+	memcpy(uvalue, value, map->value_size);
+
+	err = 0;
+
+err_unlock:
+	rcu_read_unlock();
+
+	return err;
+}
+EXPORT_SYMBOL_GPL(tracing_map_lookup_elem);
+
 /* last field in 'union bpf_attr' used by this command */
 #define BPF_MAP_LOOKUP_ELEM_LAST_FIELD value
 
@@ -190,6 +260,39 @@ err_put:
 	return err;
 }
 
+/**
+ * tracing_map_update_elem - Update or add a bpf_map element
+ * @map: The bpf_map to perform the lookup on
+ * @key: The key to update or add
+ * @value: The value to update or add
+ * @attr: The bpf_attr for the operation
+ *
+ * Update or add a value for the specified key in the bpf_map.  Both
+ * the key and value are user-owned memory, copies of which will be
+ * added to the map.
+ *
+ * Note that it's assumed that unused attr fields have been zeroed and
+ * that kernel code can be trusted to have done so before calling this
+ * function.
+ *
+ * Return: 0 on success, errno otherwise
+ */
+int tracing_map_update_elem(struct bpf_map *map, void *key, void *value,
+			    union bpf_attr *attr)
+{
+	int err;
+
+	/* eBPF program that use maps are running under rcu_read_lock(),
+	 * therefore all map accessors rely on this fact, so do the same here
+	 */
+	rcu_read_lock();
+	err = map->ops->map_update_elem(map, key, value, attr->flags);
+	rcu_read_unlock();
+
+	return err;
+}
+EXPORT_SYMBOL_GPL(tracing_map_update_elem);
+
 #define BPF_MAP_UPDATE_ELEM_LAST_FIELD flags
 
 static int map_update_elem(union bpf_attr *attr)
@@ -227,12 +330,7 @@ static int map_update_elem(union bpf_attr *attr)
 	if (copy_from_user(value, uvalue, map->value_size) != 0)
 		goto free_value;
 
-	/* eBPF program that use maps are running under rcu_read_lock(),
-	 * therefore all map accessors rely on this fact, so do the same here
-	 */
-	rcu_read_lock();
-	err = map->ops->map_update_elem(map, key, value, attr->flags);
-	rcu_read_unlock();
+	err = tracing_map_update_elem(map, key, value, attr);
 
 free_value:
 	kfree(value);
@@ -243,6 +341,27 @@ err_put:
 	return err;
 }
 
+/**
+ * tracing_map_delete_elem - Delete an element from a bpf_map
+ * @map: The bpf_map to perform the lookup on
+ * @key: The key to delete
+ *
+ * Delete the bpf_map element with the specified key.
+ *
+ * Return: 0 on success, -ENOENT if key not found
+ */
+int tracing_map_delete_elem(struct bpf_map *map, void *key)
+{
+	int err;
+
+	rcu_read_lock();
+	err = map->ops->map_delete_elem(map, key);
+	rcu_read_unlock();
+
+	return err;
+}
+EXPORT_SYMBOL_GPL(tracing_map_delete_elem);
+
 #define BPF_MAP_DELETE_ELEM_LAST_FIELD key
 
 static int map_delete_elem(union bpf_attr *attr)
@@ -270,9 +389,7 @@ static int map_delete_elem(union bpf_attr *attr)
 	if (copy_from_user(key, ukey, map->key_size) != 0)
 		goto free_key;
 
-	rcu_read_lock();
-	err = map->ops->map_delete_elem(map, key);
-	rcu_read_unlock();
+	err = tracing_map_delete_elem(map, key);
 
 free_key:
 	kfree(key);
@@ -281,6 +398,30 @@ err_put:
 	return err;
 }
 
+/**
+ * tracing_map_get_next_key - Get the next key in the bpf_map
+ * @map: The bpf_map to perform the lookup on
+ * @key: The key to look up
+ * @next_key: The caller-owned memory to be filled in with the next key
+ *
+ * Look up the specified key in the bpf_map and if found copy the next
+ * key into the next_key specified by the caller.  If the key isn't
+ * found, just copy in the first key found.
+ *
+ * Return: 0 on success, -ENOENT if the map is empty
+ */
+int tracing_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
+{
+	int err;
+
+	rcu_read_lock();
+	err = map->ops->map_get_next_key(map, key, next_key);
+	rcu_read_unlock();
+
+	return err;
+}
+EXPORT_SYMBOL_GPL(tracing_map_get_next_key);
+
 /* last field in 'union bpf_attr' used by this command */
 #define BPF_MAP_GET_NEXT_KEY_LAST_FIELD next_key
 
@@ -315,9 +456,7 @@ static int map_get_next_key(union bpf_attr *attr)
 	if (!next_key)
 		goto free_key;
 
-	rcu_read_lock();
-	err = map->ops->map_get_next_key(map, key, next_key);
-	rcu_read_unlock();
+	err = tracing_map_get_next_key(map, key, next_key);
 	if (err)
 		goto free_next_key;
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 05/15] bpf: Export a map-clearing function
  2015-03-02 16:00 [PATCH v2 00/15] tracing: 'hist' triggers Tom Zanussi
                   ` (3 preceding siblings ...)
  2015-03-02 16:00 ` [PATCH 04/15] bpf: Export bpf map functionality as trace_map_* Tom Zanussi
@ 2015-03-02 16:00 ` Tom Zanussi
  2015-03-02 16:00 ` [PATCH 06/15] bpf: Add tracing_map client ops Tom Zanussi
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 45+ messages in thread
From: Tom Zanussi @ 2015-03-02 16:00 UTC (permalink / raw)
  To: rostedt
  Cc: masami.hiramatsu.pt, namhyung, andi, ast, linux-kernel, Tom Zanussi

Add a new map_clear() function to bpf_map_ops along with a
tracing_map_clear() export for external users.

Map implementations that it makes sense for should implement it,
otherwise it's not required.  The bpf hashtab implementation does
implement a clear operation, but since it doesn't make sense for bpf
arraymaps, the arraymap implementation doesn't.

Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
---
 include/linux/bpf.h  |  2 ++
 kernel/bpf/hashtab.c |  8 ++++++++
 kernel/bpf/syscall.c | 17 +++++++++++++++++
 3 files changed, 27 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 900405bf..f7f95d7 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -18,6 +18,7 @@ struct bpf_map_ops {
 	/* funcs callable from userspace (via syscall) */
 	struct bpf_map *(*map_alloc)(union bpf_attr *attr);
 	void (*map_free)(struct bpf_map *);
+	void (*map_clear)(struct bpf_map *);
 	int (*map_get_next_key)(struct bpf_map *map, void *key, void *next_key);
 
 	/* funcs callable from userspace and from eBPF programs */
@@ -144,6 +145,7 @@ extern struct bpf_func_proto bpf_map_delete_elem_proto;
 
 struct bpf_map *tracing_map_create(union bpf_attr *attr);
 void tracing_map_destroy(struct bpf_map *map);
+void tracing_map_clear(struct bpf_map *map);
 int tracing_map_update_elem(struct bpf_map *map, void *key, void *value,
 			    union bpf_attr *attr);
 int tracing_map_lookup_elem(struct bpf_map *map, void *key, void *uvalue);
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index b3ba436..addf3a8 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -325,6 +325,13 @@ static void delete_all_elements(struct bpf_htab *htab)
 	}
 }
 
+static void htab_map_clear(struct bpf_map *map)
+{
+	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+
+	delete_all_elements(htab);
+}
+
 /* Called when map->refcnt goes to zero, either from workqueue or from syscall */
 static void htab_map_free(struct bpf_map *map)
 {
@@ -348,6 +355,7 @@ static void htab_map_free(struct bpf_map *map)
 static struct bpf_map_ops htab_ops = {
 	.map_alloc = htab_map_alloc,
 	.map_free = htab_map_free,
+	.map_clear = htab_map_clear,
 	.map_get_next_key = htab_map_get_next_key,
 	.map_lookup_elem = htab_map_lookup_elem,
 	.map_update_elem = htab_map_update_elem,
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index cac8df6..0f28904 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -362,6 +362,23 @@ int tracing_map_delete_elem(struct bpf_map *map, void *key)
 }
 EXPORT_SYMBOL_GPL(tracing_map_delete_elem);
 
+/**
+ * tracing_map_clear - Clear a bpf_map
+ * @map: The bpf_map to clear
+ *
+ * Clear the bpf_map.
+ *
+ * Return: nothing, map clearing always succeeds
+ */
+void tracing_map_clear(struct bpf_map *map)
+{
+	rcu_read_lock();
+	if (map->ops->map_clear)
+		map->ops->map_clear(map);
+	rcu_read_unlock();
+}
+EXPORT_SYMBOL_GPL(tracing_map_clear);
+
 #define BPF_MAP_DELETE_ELEM_LAST_FIELD key
 
 static int map_delete_elem(union bpf_attr *attr)
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 06/15] bpf: Add tracing_map client ops
  2015-03-02 16:00 [PATCH v2 00/15] tracing: 'hist' triggers Tom Zanussi
                   ` (4 preceding siblings ...)
  2015-03-02 16:00 ` [PATCH 05/15] bpf: Export a map-clearing function Tom Zanussi
@ 2015-03-02 16:00 ` Tom Zanussi
  2015-03-02 16:01 ` [PATCH 07/15] mm: Add ___GFP_NOTRACE Tom Zanussi
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 45+ messages in thread
From: Tom Zanussi @ 2015-03-02 16:00 UTC (permalink / raw)
  To: rostedt
  Cc: masami.hiramatsu.pt, namhyung, andi, ast, linux-kernel, Tom Zanussi

External users of the tracing_map API may customize their map usage
and need to be notified of certain events occuring during the lifetime
of a map.

One example would be element deletion - if private data has been
associated with an element's value, the client should be notified when
that element is being deleted in order to give it a chance to delete
the private data.

struct bpf_map_client_ops defines a set of client callbacks, a pointer
to an instance of which can be passed by the client to tracing_map_create().

Initially, the only callback defined is free_elem(), which is called
when a map element is freed and gives clients the opportunity to free
app-specific elements.  Other callbacks such as a client-defined
sort() function are anticipated, however, which is why it's not a
standalone op.

Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
---
 include/linux/bpf.h   |  8 +++++++-
 kernel/bpf/arraymap.c | 16 ++++++++++++++++
 kernel/bpf/hashtab.c  | 12 ++++++++++++
 kernel/bpf/syscall.c  |  7 +++++--
 4 files changed, 40 insertions(+), 3 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index f7f95d7..09095ec 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -27,6 +27,10 @@ struct bpf_map_ops {
 	int (*map_delete_elem)(struct bpf_map *map, void *key);
 };
 
+struct bpf_map_client_ops {
+	void (*free_elem)(void *value);
+};
+
 struct bpf_map {
 	atomic_t refcnt;
 	enum bpf_map_type map_type;
@@ -35,6 +39,7 @@ struct bpf_map {
 	u32 max_entries;
 	struct bpf_map_ops *ops;
 	struct work_struct work;
+	struct bpf_map_client_ops *client_ops;
 };
 
 struct bpf_map_type_list {
@@ -143,7 +148,8 @@ extern struct bpf_func_proto bpf_map_lookup_elem_proto;
 extern struct bpf_func_proto bpf_map_update_elem_proto;
 extern struct bpf_func_proto bpf_map_delete_elem_proto;
 
-struct bpf_map *tracing_map_create(union bpf_attr *attr);
+struct bpf_map *tracing_map_create(union bpf_attr *attr,
+				   struct bpf_map_client_ops *client_ops);
 void tracing_map_destroy(struct bpf_map *map);
 void tracing_map_clear(struct bpf_map *map);
 int tracing_map_update_elem(struct bpf_map *map, void *key, void *value,
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 9eb4d8a..333c959 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -119,6 +119,20 @@ static int array_map_delete_elem(struct bpf_map *map, void *key)
 	return -EINVAL;
 }
 
+static void free_client_elems(struct bpf_array *array)
+{
+	void *val;
+	u32 index;
+
+	if (!array->map.client_ops || !array->map.client_ops->free_elem)
+		return;
+
+	for (index = 0; index < array->map.max_entries; index++) {
+		val = array->value + array->elem_size * index;
+		array->map.client_ops->free_elem(val);
+	}
+}
+
 /* Called when map->refcnt goes to zero, either from workqueue or from syscall */
 static void array_map_free(struct bpf_map *map)
 {
@@ -131,6 +145,8 @@ static void array_map_free(struct bpf_map *map)
 	 */
 	synchronize_rcu();
 
+	free_client_elems(array);
+
 	kvfree(array);
 }
 
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index addf3a8..6f349ad 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -275,6 +275,16 @@ err:
 	return ret;
 }
 
+static void free_client_elem(struct bpf_htab *htab, struct htab_elem *l)
+{
+	void *val;
+
+	if (htab->map.client_ops && htab->map.client_ops->free_elem) {
+		val = l->key + round_up(htab->map.key_size, 8);
+		htab->map.client_ops->free_elem(val);
+	}
+}
+
 /* Called from syscall or from eBPF program */
 static int htab_map_delete_elem(struct bpf_map *map, void *key)
 {
@@ -298,6 +308,7 @@ static int htab_map_delete_elem(struct bpf_map *map, void *key)
 	l = lookup_elem_raw(head, hash, key, key_size);
 
 	if (l) {
+		free_client_elem(htab, l);
 		hlist_del_rcu(&l->hash_node);
 		htab->count--;
 		kfree_rcu(l, rcu);
@@ -318,6 +329,7 @@ static void delete_all_elements(struct bpf_htab *htab)
 		struct htab_elem *l;
 
 		hlist_for_each_entry_safe(l, n, head, hash_node) {
+			free_client_elem(htab, l);
 			hlist_del_rcu(&l->hash_node);
 			htab->count--;
 			kfree(l);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 0f28904..85735e6 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -110,7 +110,8 @@ static const struct file_operations bpf_map_fops = {
  *
  * Return: the map created on success, ERR_PTR otherwise
  */
-struct bpf_map *tracing_map_create(union bpf_attr *attr)
+struct bpf_map *tracing_map_create(union bpf_attr *attr,
+				   struct bpf_map_client_ops *client_ops)
 {
 	struct bpf_map *map;
 
@@ -119,6 +120,8 @@ struct bpf_map *tracing_map_create(union bpf_attr *attr)
 	if (!IS_ERR(map))
 		atomic_set(&map->refcnt, 1);
 
+	map->client_ops = client_ops;
+
 	return map;
 }
 EXPORT_SYMBOL_GPL(tracing_map_create);
@@ -133,7 +136,7 @@ static int map_create(union bpf_attr *attr)
 	if (err)
 		return -EINVAL;
 
-	map = tracing_map_create(attr);
+	map = tracing_map_create(attr, NULL);
 	if (IS_ERR(map))
 		return PTR_ERR(map);
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 07/15] mm: Add ___GFP_NOTRACE
  2015-03-02 16:00 [PATCH v2 00/15] tracing: 'hist' triggers Tom Zanussi
                   ` (5 preceding siblings ...)
  2015-03-02 16:00 ` [PATCH 06/15] bpf: Add tracing_map client ops Tom Zanussi
@ 2015-03-02 16:01 ` Tom Zanussi
  2015-03-02 16:37   ` Steven Rostedt
  2015-03-02 16:01 ` [PATCH 08/15] tracing: Make kmem memory allocation tracepoints conditional Tom Zanussi
                   ` (8 subsequent siblings)
  15 siblings, 1 reply; 45+ messages in thread
From: Tom Zanussi @ 2015-03-02 16:01 UTC (permalink / raw)
  To: rostedt
  Cc: masami.hiramatsu.pt, namhyung, andi, ast, linux-kernel, Tom Zanussi

Add a gfp flag that allows kmalloc() et al to be used in tracing
functions.

The problem with using kmalloc for tracing is that the tracing
subsystem should be able to trace kmalloc itself, which it can't do
directly because of paths like kmalloc()->trace_kmalloc()->kmalloc()
or kmalloc()->trace_mm_page_alloc()->kmalloc().

With this flag, tracing code could use a special version of kmalloc()
that sets __GFP_NOTRACE on every allocation it does, while leaving the
normal kmalloc() path untouched.

This would allow any tracepoint in the kmalloc path to be avoided via
DEFINE_EVENT_CONDITION() redefinitions of those events, which check
for ___GFP_NOTRACE immediately in their execution and break if set,
thereby avoiding the recursion.

Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
---
 include/linux/gfp.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index b840e3b..acbbc7c 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -34,6 +34,7 @@ struct vm_area_struct;
 #define ___GFP_NO_KSWAPD	0x400000u
 #define ___GFP_OTHER_NODE	0x800000u
 #define ___GFP_WRITE		0x1000000u
+#define ___GFP_NOTRACE		0x2000000u
 /* If the above are modified, __GFP_BITS_SHIFT may need updating */
 
 /*
@@ -97,7 +98,7 @@ struct vm_area_struct;
  */
 #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
 
-#define __GFP_BITS_SHIFT 25	/* Room for N __GFP_FOO bits */
+#define __GFP_BITS_SHIFT 26	/* Room for N __GFP_FOO bits */
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
 
 /* This equals 0, but use constants in case they ever change */
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 08/15] tracing: Make kmem memory allocation tracepoints conditional
  2015-03-02 16:00 [PATCH v2 00/15] tracing: 'hist' triggers Tom Zanussi
                   ` (6 preceding siblings ...)
  2015-03-02 16:01 ` [PATCH 07/15] mm: Add ___GFP_NOTRACE Tom Zanussi
@ 2015-03-02 16:01 ` Tom Zanussi
  2015-03-02 16:01 ` [PATCH 09/15] tracing: Add kmalloc/kfree macros Tom Zanussi
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 45+ messages in thread
From: Tom Zanussi @ 2015-03-02 16:01 UTC (permalink / raw)
  To: rostedt
  Cc: masami.hiramatsu.pt, namhyung, andi, ast, linux-kernel, Tom Zanussi

Making the memory allocation tracepoints conditional allows those
events to be traced using the functions containing them e.g. the
kmalloc_trace() handler can itself use kmalloc() in the trace path,
which otherwise wouldn't be possible.

The TP_CONDITION simply tests gfp_flags for the newly introduced
___GFP_NOTRACE and if true, the tracepoint simply returns without
tracing.

This allows _notrace versions of the memory allocation functions to be
defined, which do the same thing as the normal versions but in
addition set the ___GFP_NOTRACE flag, and thus avoid the (possibly
indirect) recursive calls to themselves.

Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
---
 include/trace/events/kmem.h | 28 +++++++++++++++++++---------
 1 file changed, 19 insertions(+), 9 deletions(-)

diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index aece134..6d34eed 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -42,20 +42,24 @@ DECLARE_EVENT_CLASS(kmem_alloc,
 		show_gfp_flags(__entry->gfp_flags))
 );
 
-DEFINE_EVENT(kmem_alloc, kmalloc,
+DEFINE_EVENT_CONDITION(kmem_alloc, kmalloc,
 
 	TP_PROTO(unsigned long call_site, const void *ptr,
 		 size_t bytes_req, size_t bytes_alloc, gfp_t gfp_flags),
 
-	TP_ARGS(call_site, ptr, bytes_req, bytes_alloc, gfp_flags)
+	TP_ARGS(call_site, ptr, bytes_req, bytes_alloc, gfp_flags),
+
+	TP_CONDITION(!(gfp_flags & ___GFP_NOTRACE))
 );
 
-DEFINE_EVENT(kmem_alloc, kmem_cache_alloc,
+DEFINE_EVENT_CONDITION(kmem_alloc, kmem_cache_alloc,
 
 	TP_PROTO(unsigned long call_site, const void *ptr,
 		 size_t bytes_req, size_t bytes_alloc, gfp_t gfp_flags),
 
-	TP_ARGS(call_site, ptr, bytes_req, bytes_alloc, gfp_flags)
+	TP_ARGS(call_site, ptr, bytes_req, bytes_alloc, gfp_flags),
+
+	TP_CONDITION(!(gfp_flags & ___GFP_NOTRACE))
 );
 
 DECLARE_EVENT_CLASS(kmem_alloc_node,
@@ -96,22 +100,26 @@ DECLARE_EVENT_CLASS(kmem_alloc_node,
 		__entry->node)
 );
 
-DEFINE_EVENT(kmem_alloc_node, kmalloc_node,
+DEFINE_EVENT_CONDITION(kmem_alloc_node, kmalloc_node,
 
 	TP_PROTO(unsigned long call_site, const void *ptr,
 		 size_t bytes_req, size_t bytes_alloc,
 		 gfp_t gfp_flags, int node),
 
-	TP_ARGS(call_site, ptr, bytes_req, bytes_alloc, gfp_flags, node)
+	TP_ARGS(call_site, ptr, bytes_req, bytes_alloc, gfp_flags, node),
+
+	TP_CONDITION(!(gfp_flags & ___GFP_NOTRACE))
 );
 
-DEFINE_EVENT(kmem_alloc_node, kmem_cache_alloc_node,
+DEFINE_EVENT_CONDITION(kmem_alloc_node, kmem_cache_alloc_node,
 
 	TP_PROTO(unsigned long call_site, const void *ptr,
 		 size_t bytes_req, size_t bytes_alloc,
 		 gfp_t gfp_flags, int node),
 
-	TP_ARGS(call_site, ptr, bytes_req, bytes_alloc, gfp_flags, node)
+	TP_ARGS(call_site, ptr, bytes_req, bytes_alloc, gfp_flags, node),
+
+	TP_CONDITION(!(gfp_flags & ___GFP_NOTRACE))
 );
 
 DECLARE_EVENT_CLASS(kmem_free,
@@ -191,13 +199,15 @@ TRACE_EVENT(mm_page_free_batched,
 			__entry->cold)
 );
 
-TRACE_EVENT(mm_page_alloc,
+TRACE_EVENT_CONDITION(mm_page_alloc,
 
 	TP_PROTO(struct page *page, unsigned int order,
 			gfp_t gfp_flags, int migratetype),
 
 	TP_ARGS(page, order, gfp_flags, migratetype),
 
+	TP_CONDITION(!(gfp_flags & ___GFP_NOTRACE)),
+
 	TP_STRUCT__entry(
 		__field(	struct page *,	page		)
 		__field(	unsigned int,	order		)
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 09/15] tracing: Add kmalloc/kfree macros
  2015-03-02 16:00 [PATCH v2 00/15] tracing: 'hist' triggers Tom Zanussi
                   ` (7 preceding siblings ...)
  2015-03-02 16:01 ` [PATCH 08/15] tracing: Make kmem memory allocation tracepoints conditional Tom Zanussi
@ 2015-03-02 16:01 ` Tom Zanussi
  2015-03-02 16:01 ` [PATCH 10/15] bpf: Make tracing_map use kmalloc/kfree_notrace() Tom Zanussi
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 45+ messages in thread
From: Tom Zanussi @ 2015-03-02 16:01 UTC (permalink / raw)
  To: rostedt
  Cc: masami.hiramatsu.pt, namhyung, andi, ast, linux-kernel, Tom Zanussi

Make the kmalloc function in slab.h into a macro, and define a normal
and a _notrace version.

The _notrace version is for tracing code that wants to be able to use
kmalloc for its purposes but at the same time be able to trace kmalloc
and friends.  Examples would be the bpf map and hist triggers code.

The reason for doing this is as a macro is so that we can avoid any
change at all to the normal kmalloc since its performance is obviously
critical.  It allows us to define a _notrace version that reuses the
kmalloc code but additionally sets the ___GFP_NOTRACE flag.  This
allows any downstream call to a tracepoint function to be avoided, as
the DEFINE_EVENT_CONDITION() TP_CONDITION will simply cause the trace
call to exit when it sees the ___GFP_NOTRACE_FLAG.

Because the #ifdef CONFIG_SLOB in the original causes problems if it's
inside a macro, it was explicitly broken out into two different macros
for those cases.

Though it doesn't suffer from the same recursion problems that
motivate the kmalloc macro, we also need to define _notrace versions
of kfree() as well, in order to allow for proper accounting.  Users of
kmalloc_notrace() should use kfree_notrace() to make sure the kfrees
corresponding to the untraced kmallocs don't appear in the trace
stream.

Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
---
 include/linux/slab.h | 61 ++++++++++++++++++++++++++++++++++++++--------------
 mm/slab.c            | 45 ++++++++++++++++++++++----------------
 mm/slob.c            | 45 ++++++++++++++++++++++----------------
 mm/slub.c            | 47 +++++++++++++++++++++++-----------------
 4 files changed, 124 insertions(+), 74 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 9a139b6..7519aaa 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -142,6 +142,7 @@ void kmem_cache_free(struct kmem_cache *, void *);
 void * __must_check __krealloc(const void *, size_t, gfp_t);
 void * __must_check krealloc(const void *, size_t, gfp_t);
 void kfree(const void *);
+void kfree_notrace(const void *);
 void kzfree(const void *);
 size_t ksize(const void *);
 
@@ -409,25 +410,53 @@ static __always_inline void *kmalloc_large(size_t size, gfp_t flags)
  * for general use, and so are not documented here. For a full list of
  * potential flags, always refer to linux/gfp.h.
  */
-static __always_inline void *kmalloc(size_t size, gfp_t flags)
-{
-	if (__builtin_constant_p(size)) {
-		if (size > KMALLOC_MAX_CACHE_SIZE)
-			return kmalloc_large(size, flags);
-#ifndef CONFIG_SLOB
-		if (!(flags & GFP_DMA)) {
-			int index = kmalloc_index(size);
 
-			if (!index)
-				return ZERO_SIZE_PTR;
+#define set_gfp_notrace_flag(flags)
+#define set_gfp_notrace_flag_notrace(flags) (flags |= ___GFP_NOTRACE)
 
-			return kmem_cache_alloc_trace(kmalloc_caches[index],
-					flags, size);
-		}
-#endif
-	}
-	return __kmalloc(size, flags);
+#ifndef CONFIG_SLOB
+#define DEFINE_KMALLOC(_suffix)						\
+static __always_inline void *kmalloc##_suffix(size_t size, gfp_t flags)	\
+{									\
+	set_gfp_notrace_flag##_suffix(flags);				\
+									\
+	if (__builtin_constant_p(size)) {				\
+		if (size > KMALLOC_MAX_CACHE_SIZE)			\
+			return kmalloc_large(size, flags);		\
+									\
+		if (!(flags & GFP_DMA)) {				\
+			int index = kmalloc_index(size);		\
+									\
+			if (!index)					\
+				return ZERO_SIZE_PTR;			\
+									\
+			return kmem_cache_alloc_trace(kmalloc_caches[index],\
+					flags, size);			\
+		}							\
+	}								\
+	return __kmalloc(size, flags);					\
 }
+#else
+#define DEFINE_KMALLOC(_suffix)						\
+static __always_inline void *kmalloc##_suffix(size_t size, gfp_t flags)	\
+{									\
+	set_gfp_notrace_flag##_suffix(flags);				\
+									\
+	if (__builtin_constant_p(size)) {				\
+		if (size > KMALLOC_MAX_CACHE_SIZE)			\
+			return kmalloc_large(size, flags);		\
+	}								\
+	return __kmalloc(size, flags);					\
+}
+#endif /* !CONFIG_SLOB */
+
+DEFINE_KMALLOC()
+#ifdef CONFIG_TRACING
+DEFINE_KMALLOC(_notrace)
+#else
+#define kmalloc_notrace kmalloc
+#define kfree_notrace kfree
+#endif
 
 /*
  * Determine size used for the nth kmalloc cache.
diff --git a/mm/slab.c b/mm/slab.c
index 65b5dcb..c51c96a 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3553,25 +3553,32 @@ EXPORT_SYMBOL(kmem_cache_free);
  * Don't free memory not originally allocated by kmalloc()
  * or you will run into trouble.
  */
-void kfree(const void *objp)
-{
-	struct kmem_cache *c;
-	unsigned long flags;
-
-	trace_kfree(_RET_IP_, objp);
-
-	if (unlikely(ZERO_OR_NULL_PTR(objp)))
-		return;
-	local_irq_save(flags);
-	kfree_debugcheck(objp);
-	c = virt_to_cache(objp);
-	debug_check_no_locks_freed(objp, c->object_size);
-
-	debug_check_no_obj_freed(objp, c->object_size);
-	__cache_free(c, (void *)objp, _RET_IP_);
-	local_irq_restore(flags);
-}
-EXPORT_SYMBOL(kfree);
+#define trace_kfree_notrace
+#define DEFINE_KFREE(_suffix)						\
+void kfree##_suffix(const void *objp)					\
+{									\
+	struct kmem_cache *c;						\
+	unsigned long flags;						\
+									\
+	trace_kfree##_suffix(_RET_IP_, objp);				\
+									\
+	if (unlikely(ZERO_OR_NULL_PTR(objp)))				\
+		return;							\
+	local_irq_save(flags);						\
+	kfree_debugcheck(objp);						\
+	c = virt_to_cache(objp);					\
+	debug_check_no_locks_freed(objp, c->object_size);		\
+									\
+	debug_check_no_obj_freed(objp, c->object_size);			\
+	__cache_free(c, (void *)objp, _RET_IP_);			\
+	local_irq_restore(flags);					\
+}									\
+EXPORT_SYMBOL(kfree##_suffix);
+
+DEFINE_KFREE()
+#ifdef CONFIG_TRACING
+DEFINE_KFREE(_notrace)
+#endif
 
 /*
  * This initializes kmem_cache_node or resizes various caches for all nodes.
diff --git a/mm/slob.c b/mm/slob.c
index 96a8620..b3d37c4 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -481,25 +481,32 @@ void *__kmalloc_node_track_caller(size_t size, gfp_t gfp,
 }
 #endif
 
-void kfree(const void *block)
-{
-	struct page *sp;
-
-	trace_kfree(_RET_IP_, block);
-
-	if (unlikely(ZERO_OR_NULL_PTR(block)))
-		return;
-	kmemleak_free(block);
-
-	sp = virt_to_page(block);
-	if (PageSlab(sp)) {
-		int align = max_t(size_t, ARCH_KMALLOC_MINALIGN, ARCH_SLAB_MINALIGN);
-		unsigned int *m = (unsigned int *)(block - align);
-		slob_free(m, *m + align);
-	} else
-		__free_pages(sp, compound_order(sp));
-}
-EXPORT_SYMBOL(kfree);
+#define trace_kfree_notrace
+#define DEFINE_KFREE(_suffix)						\
+void kfree##_suffix(const void *block)					\
+{									\
+	struct page *sp;						\
+									\
+	trace_kfree##_suffix(_RET_IP_, block);				\
+									\
+	if (unlikely(ZERO_OR_NULL_PTR(block)))				\
+		return;							\
+	kmemleak_free(block);						\
+									\
+	sp = virt_to_page(block);					\
+	if (PageSlab(sp)) {						\
+		int align = max_t(size_t, ARCH_KMALLOC_MINALIGN, ARCH_SLAB_MINALIGN);	\
+		unsigned int *m = (unsigned int *)(block - align);	\
+		slob_free(m, *m + align);				\
+	} else								\
+		__free_pages(sp, compound_order(sp));			\
+}									\
+EXPORT_SYMBOL(kfree##_suffix);
+
+DEFINE_KFREE()
+#ifdef CONFIG_TRACING
+DEFINE_KFREE(_notrace)
+#endif
 
 /* can't use ksize for kmem_cache_alloc memory, only kmalloc */
 size_t ksize(const void *block)
diff --git a/mm/slub.c b/mm/slub.c
index fe376fe..93d4442 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3326,26 +3326,33 @@ size_t ksize(const void *object)
 }
 EXPORT_SYMBOL(ksize);
 
-void kfree(const void *x)
-{
-	struct page *page;
-	void *object = (void *)x;
-
-	trace_kfree(_RET_IP_, x);
-
-	if (unlikely(ZERO_OR_NULL_PTR(x)))
-		return;
-
-	page = virt_to_head_page(x);
-	if (unlikely(!PageSlab(page))) {
-		BUG_ON(!PageCompound(page));
-		kfree_hook(x);
-		__free_kmem_pages(page, compound_order(page));
-		return;
-	}
-	slab_free(page->slab_cache, page, object, _RET_IP_);
-}
-EXPORT_SYMBOL(kfree);
+#define trace_kfree_notrace
+#define DEFINE_KFREE(_suffix)						\
+void kfree##_suffix(const void *x)					\
+{									\
+	struct page *page;						\
+	void *object = (void *)x;					\
+									\
+	trace_kfree##_suffix(_RET_IP_, x);				\
+									\
+	if (unlikely(ZERO_OR_NULL_PTR(x)))				\
+		return;							\
+									\
+	page = virt_to_head_page(x);					\
+	if (unlikely(!PageSlab(page))) {				\
+		BUG_ON(!PageCompound(page));				\
+		kfree_hook(x);						\
+		__free_kmem_pages(page, compound_order(page));		\
+		return;							\
+	}								\
+	slab_free(page->slab_cache, page, object, _RET_IP_);		\
+}									\
+EXPORT_SYMBOL(kfree##_suffix);
+
+DEFINE_KFREE()
+#ifdef CONFIG_TRACING
+DEFINE_KFREE(_notrace)
+#endif
 
 /*
  * kmem_cache_shrink removes empty slabs from the partial lists and sorts
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 10/15] bpf: Make tracing_map use kmalloc/kfree_notrace()
  2015-03-02 16:00 [PATCH v2 00/15] tracing: 'hist' triggers Tom Zanussi
                   ` (8 preceding siblings ...)
  2015-03-02 16:01 ` [PATCH 09/15] tracing: Add kmalloc/kfree macros Tom Zanussi
@ 2015-03-02 16:01 ` Tom Zanussi
  2015-03-02 16:01 ` [PATCH 11/15] tracing: Add a per-event-trigger 'paused' field Tom Zanussi
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 45+ messages in thread
From: Tom Zanussi @ 2015-03-02 16:01 UTC (permalink / raw)
  To: rostedt
  Cc: masami.hiramatsu.pt, namhyung, andi, ast, linux-kernel, Tom Zanussi

We need to prevent any kmallocs that could be invoked from within a
tracepoint handler from being traced, in order to prevent recursion.

For tracing maps, this means the allocation of the map elements (maps
themselves along with their associated buckets, etc, are never
allocated in the context of a tracepoint handler, so we don't need to
worry about them).

We also want the matching kfrees to remain untraced as well.

With this, we should be able to use maps when tracing kmalloc, which
otherwise would lock up the machine.

Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
---
 kernel/bpf/hashtab.c | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 6f349ad..308ef47 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -200,6 +200,15 @@ find_first_elem:
 	return -ENOENT;
 }
 
+/* Use this instead of kfree_rcu() for notrace-kfreeing
+ * kmalloc_notrace()'d elements, to avoid creating a
+ * kfree_rcu_notrace() */
+static void kfree_htab_elem(struct rcu_head *head)
+{
+	struct htab_elem *elem = container_of(head, struct htab_elem, rcu);
+	kfree_notrace(elem);
+}
+
 /* Called from syscall or from eBPF program */
 static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
 				u64 map_flags)
@@ -218,7 +227,7 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
 	WARN_ON_ONCE(!rcu_read_lock_held());
 
 	/* allocate new element outside of lock */
-	l_new = kmalloc(htab->elem_size, GFP_ATOMIC);
+	l_new = kmalloc_notrace(htab->elem_size, GFP_ATOMIC);
 	if (!l_new)
 		return -ENOMEM;
 
@@ -262,7 +271,7 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
 	hlist_add_head_rcu(&l_new->hash_node, head);
 	if (l_old) {
 		hlist_del_rcu(&l_old->hash_node);
-		kfree_rcu(l_old, rcu);
+		call_rcu(&l_old->rcu, kfree_htab_elem);
 	} else {
 		htab->count++;
 	}
@@ -271,7 +280,7 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
 	return 0;
 err:
 	spin_unlock_irqrestore(&htab->lock, flags);
-	kfree(l_new);
+	kfree_notrace(l_new);
 	return ret;
 }
 
@@ -311,7 +320,7 @@ static int htab_map_delete_elem(struct bpf_map *map, void *key)
 		free_client_elem(htab, l);
 		hlist_del_rcu(&l->hash_node);
 		htab->count--;
-		kfree_rcu(l, rcu);
+		call_rcu(&l->rcu, kfree_htab_elem);
 		ret = 0;
 	}
 
@@ -332,7 +341,7 @@ static void delete_all_elements(struct bpf_htab *htab)
 			free_client_elem(htab, l);
 			hlist_del_rcu(&l->hash_node);
 			htab->count--;
-			kfree(l);
+			kfree_notrace(l);
 		}
 	}
 }
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 11/15] tracing: Add a per-event-trigger 'paused' field
  2015-03-02 16:00 [PATCH v2 00/15] tracing: 'hist' triggers Tom Zanussi
                   ` (9 preceding siblings ...)
  2015-03-02 16:01 ` [PATCH 10/15] bpf: Make tracing_map use kmalloc/kfree_notrace() Tom Zanussi
@ 2015-03-02 16:01 ` Tom Zanussi
  2015-03-02 16:01 ` [PATCH 12/15] tracing: Add 'hist' event trigger command Tom Zanussi
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 45+ messages in thread
From: Tom Zanussi @ 2015-03-02 16:01 UTC (permalink / raw)
  To: rostedt
  Cc: masami.hiramatsu.pt, namhyung, andi, ast, linux-kernel, Tom Zanussi

Add a simple per-trigger 'paused' flag, allowing individual triggers
to pause.  We could leave it to individual triggers that need this
functionality to do it themselves, but we also want to allow other
events to control pausing, so add it to the trigger data.

Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
---
 kernel/trace/trace.h                | 1 +
 kernel/trace/trace_events_trigger.c | 4 ++++
 2 files changed, 5 insertions(+)

diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 3d38c2e..5bc1752 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -1110,6 +1110,7 @@ struct event_trigger_data {
 	struct event_filter __rcu	*filter;
 	char				*filter_str;
 	void				*private_data;
+	bool				paused;
 	struct list_head		list;
 };
 
diff --git a/kernel/trace/trace_events_trigger.c b/kernel/trace/trace_events_trigger.c
index a8dfd4e..010ce30 100644
--- a/kernel/trace/trace_events_trigger.c
+++ b/kernel/trace/trace_events_trigger.c
@@ -73,6 +73,8 @@ event_triggers_call(struct ftrace_event_file *file, void *rec)
 		return tt;
 
 	list_for_each_entry_rcu(data, &file->triggers, list) {
+		if (data->paused)
+			continue;
 		if (!rec) {
 			data->ops->func(data, rec);
 			continue;
@@ -110,6 +112,8 @@ event_triggers_post_call(struct ftrace_event_file *file,
 	struct event_trigger_data *data;
 
 	list_for_each_entry_rcu(data, &file->triggers, list) {
+		if (data->paused)
+			continue;
 		if (data->cmd_ops->trigger_type & tt)
 			data->ops->func(data, rec);
 	}
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 12/15] tracing: Add 'hist' event trigger command
  2015-03-02 16:00 [PATCH v2 00/15] tracing: 'hist' triggers Tom Zanussi
                   ` (10 preceding siblings ...)
  2015-03-02 16:01 ` [PATCH 11/15] tracing: Add a per-event-trigger 'paused' field Tom Zanussi
@ 2015-03-02 16:01 ` Tom Zanussi
  2015-03-02 16:01 ` [PATCH 13/15] tracing: Add sorting to hist triggers Tom Zanussi
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 45+ messages in thread
From: Tom Zanussi @ 2015-03-02 16:01 UTC (permalink / raw)
  To: rostedt
  Cc: masami.hiramatsu.pt, namhyung, andi, ast, linux-kernel, Tom Zanussi

'hist' triggers allow users to continually aggregate trace events,
which can then be viewed afterwards by simply reading a 'hist' file
containing the aggregation in a human-readable format.

The basic idea is very simple and boils down to a mechanism whereby
trace events, rather than being exhaustively dumped in raw form and
viewed directly, are automatically 'compressed' into meaningful tables
completely defined by the user.

This is done strictly via single-line command-line commands and
without the aid of any kind of programming language or interpreter.

A surprising number of typical use cases can be accomplished by users
via this simple mechanism.  In fact, a large number of the tasks that
users typically do using the more complicated script-based tracing
tools, at least during the initial stages of an investigation, can be
accomplished by simply specifying a set of keys and values to be used
in the creation of a hash table.

The Linux kernel trace event subsystem happens to provide an extensive
list of keys and values ready-made for such a purpose in the form of
the event format files associated with each trace event.  By simply
consulting the format file for field names of interest and by plugging
them into the hist trigger command, users can create an endless number
of useful aggregations to help with investigating various properties
of the system.  See Documentation/trace/events.txt for examples.

hist triggers are implemented on top of the existing event trigger
infrastructure, and as such are consistent with the existing triggers
from a user's perspective as well.

The basic syntax follows the existing trigger syntax.  Users start an
aggregation by writing a 'hist' trigger to the event of interest's
trigger file:

  # echo hist:keys=xxx:values=yyy [ if filter] > event/trigger

Once a hist trigger has been set up, by default it continually
aggregates every matching event into a hash table using the event key
and value fields specified.

To view the aggregation at any point in time, simply read the 'hist'
file in the same directory as the 'trigger' file:

  # cat event/hist

The detailed syntax provides additional options for user control, and
is described exhaustively in Documentation/trace/events.txt and in the
virtual tracing/README file in the tracing subsystem.

Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
---
 include/linux/ftrace_event.h        |    1 +
 kernel/trace/trace.c                |   39 ++
 kernel/trace/trace.h                |    1 +
 kernel/trace/trace_events.c         |    3 +
 kernel/trace/trace_events_filter.c  |    3 +-
 kernel/trace/trace_events_trigger.c | 1014 +++++++++++++++++++++++++++++++++++
 6 files changed, 1060 insertions(+), 1 deletion(-)

diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index c967526..483e011 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -416,6 +416,7 @@ enum event_trigger_type {
 	ETT_SNAPSHOT		= (1 << 1),
 	ETT_STACKTRACE		= (1 << 2),
 	ETT_EVENT_ENABLE	= (1 << 3),
+	ETT_EVENT_HIST		= (1 << 4),
 };
 
 extern int filter_match_preds(struct event_filter *filter, void *rec);
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 60fa9e1..683048f 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -3730,6 +3730,7 @@ static const char readme_msg[] =
 #ifdef CONFIG_TRACER_SNAPSHOT
 	"\t\t    snapshot\n"
 #endif
+	"\t\t    hist (see below)\n"
 	"\t   example: echo traceoff > events/block/block_unplug/trigger\n"
 	"\t            echo traceoff:3 > events/block/block_unplug/trigger\n"
 	"\t            echo 'enable_event:kmem:kmalloc:3 if nr_rq > 1' > \\\n"
@@ -3745,6 +3746,44 @@ static const char readme_msg[] =
 	"\t   To remove a trigger with a count:\n"
 	"\t     echo '!<trigger>:0 > <system>/<event>/trigger\n"
 	"\t   Filters can be ignored when removing a trigger.\n"
+	"      hist trigger\t- If set, event hits are aggregated into a hash table\n"
+	"\t    Format: hist:keys=<field1>:values=<field1[,field2,...]>\n"
+	"\t            [:size=#entries][:sort=field1][:pause][:continue]\n"
+	"\t            [:clear] [if <filter>]\n\n"
+	"\t    When a matching event is hit, an entry is added to a hash\n"
+	"\t    table using the key(s) and value(s) named.  Keys and values\n"
+	"\t    correspond to fields in the event's format description.\n"
+	"\t    Values must correspond to numeric fields - on an event hit,\n"
+	"\t    the value(s) will be added to a sum kept for that field.\n"
+	"\t    The special string 'hitcount' can be used in place of an\n"
+	"\t    explicit value field - this is simply a count of event hits.\n"
+	"\t    Keys can be any field, or the special string 'stacktrace',\n"
+	"\t    which will use the event's kernel stacktrace as the key.\n\n"
+	"\t    Reading the 'hist' file for the event will dump the hash\n"
+	"\t    table in its entirety to stdout.  By default, numeric fields\n"
+	"\t    are displayed as base-10 integers.  This can be modified by\n"
+	"\t    appending any of the following modifiers to the field name:\n\n"
+	"\t            .hex       display a number as a hex value\n"
+	"\t            .sym       display an address as a symbol\n"
+	"\t            .syscall   display a syscall id as a system call name\n"
+	"\t            .execname  display a common_pid as a program name\n\n"
+	"\t    By default, the size of the hash table is 2048 entries.  The\n"
+	"\t    'size' param can be used to specify more or fewer than that.\n"
+	"\t    The units are in terms of hashtable entries - if a run uses\n"
+	"\t    more entries than specified, the results will show the number\n"
+	"\t    of 'drops', the number of hits that were ignored.  The size\n"
+	"\t    should be a power of 2 between 128 and 131072 (any non-\n"
+	"\t    power-of-2 number specified will be rounded up).\n\n"
+	"\t    The 'sort' param can be used to specify a value field to sort\n"
+	"\t    on.  The default if unspecified is 'hitcount' and the.\n"
+	"\t    default sort order is 'ascending'.  To sort in the opposite\n"
+	"\t    direction, append .descending' to the sort key.\n"
+	"\t    The 'pause' param can be used to pause an existing hist\n"
+	"\t    trigger or to start a hist trigger but not log any events\n"
+	"\t    until told to do so.  'continue' can be used to start or\n"
+	"\t    restart a paused hist trigger.\n\n"
+	"\t    The 'clear' param will clear the contents of a running hist\n"
+	"\t    trigger and leave its current paused/active state.\n"
 ;
 
 static ssize_t
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 5bc1752..f2efd6a 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -1098,6 +1098,7 @@ extern struct mutex event_mutex;
 extern struct list_head ftrace_events;
 
 extern const struct file_operations event_trigger_fops;
+extern const struct file_operations event_hist_fops;
 
 extern int register_trigger_cmds(void);
 extern void clear_event_triggers(struct trace_array *tr);
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index 0d2e473..b906f40 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -1621,6 +1621,9 @@ event_create_dir(struct dentry *parent, struct ftrace_event_file *file)
 	trace_create_file("trigger", 0644, file->dir, file,
 			  &event_trigger_fops);
 
+	trace_create_file("hist", 0444, file->dir, file,
+			  &event_hist_fops);
+
 	trace_create_file("format", 0444, file->dir, call,
 			  &ftrace_event_format_fops);
 
diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index aa8ab85..15b1ffa 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -911,7 +911,8 @@ int filter_assign_type(const char *type)
 	if (strstr(type, "__data_loc") && strstr(type, "char"))
 		return FILTER_DYN_STRING;
 
-	if (strchr(type, '[') && strstr(type, "char"))
+	if (strchr(type, '[') &&
+	    (strstr(type, "char") || strstr(type, "u8")))
 		return FILTER_STATIC_STRING;
 
 	return FILTER_OTHER;
diff --git a/kernel/trace/trace_events_trigger.c b/kernel/trace/trace_events_trigger.c
index 010ce30..80805f3 100644
--- a/kernel/trace/trace_events_trigger.c
+++ b/kernel/trace/trace_events_trigger.c
@@ -22,6 +22,9 @@
 #include <linux/ctype.h>
 #include <linux/mutex.h>
 #include <linux/slab.h>
+#include <linux/stacktrace.h>
+
+#include <linux/bpf.h>
 
 #include "trace.h"
 
@@ -139,6 +142,7 @@ static void *trigger_start(struct seq_file *m, loff_t *pos)
 	/* ->stop() is called even if ->start() fails */
 	mutex_lock(&event_mutex);
 	event_file = event_file_data(m->private);
+
 	if (unlikely(!event_file))
 		return ERR_PTR(-ENODEV);
 
@@ -255,6 +259,7 @@ static ssize_t event_trigger_regex_write(struct file *file,
 
 	mutex_lock(&event_mutex);
 	event_file = event_file_data(file);
+
 	if (unlikely(!event_file)) {
 		mutex_unlock(&event_mutex);
 		free_page((unsigned long)buf);
@@ -1431,12 +1436,1021 @@ static __init int register_trigger_traceon_traceoff_cmds(void)
 	return ret;
 }
 
+struct hist_field;
+
+typedef u64 (*hist_field_fn_t) (struct hist_field *field, void *event);
+
+struct hist_field {
+	struct ftrace_event_field	*field;
+	unsigned long			flags;
+	hist_field_fn_t			fn;
+};
+
+static u64 hist_field_none(struct hist_field *field, void *event)
+{
+	return 0;
+}
+
+static u64 hist_field_string(struct hist_field *hist_field, void *event)
+{
+	char *addr = (char *)(event + hist_field->field->offset);
+
+	return (u64)addr;
+}
+
+#define DEFINE_HIST_FIELD_FN(type)					\
+static u64 hist_field_##type(struct hist_field *hist_field, void *event)\
+{									\
+	type *addr = (type *)(event + hist_field->field->offset);	\
+									\
+	return (u64)*addr;						\
+}
+
+DEFINE_HIST_FIELD_FN(s64);
+DEFINE_HIST_FIELD_FN(u64);
+DEFINE_HIST_FIELD_FN(s32);
+DEFINE_HIST_FIELD_FN(u32);
+DEFINE_HIST_FIELD_FN(s16);
+DEFINE_HIST_FIELD_FN(u16);
+DEFINE_HIST_FIELD_FN(s8);
+DEFINE_HIST_FIELD_FN(u8);
+
+#define HIST_TRIGGER_BITS	11
+#define HIST_TRIGGER_BITS_MAX	17
+#define HIST_TRIGGER_BITS_MIN	7
+#define HIST_VALS_MAX		8
+
+#define HIST_KEY_STRING_MAX	64
+
+enum hist_field_flags {
+	HIST_FIELD_SYM		= 1,
+	HIST_FIELD_HEX		= 2,
+	HIST_FIELD_STACKTRACE	= 4,
+	HIST_FIELD_STRING	= 8,
+	HIST_FIELD_EXECNAME	= 16,
+	HIST_FIELD_SYSCALL	= 32,
+};
+
+struct hist_trigger_attrs {
+	char		*keys_str;
+	char		*vals_str;
+	bool		pause;
+	bool		cont;
+	bool		clear;
+	unsigned int	map_bits;
+};
+
+struct hist_trigger_data {
+	atomic_t			next_entry;
+	atomic64_t			total_hits;
+	struct hist_field		*key;
+	struct hist_field		*vals[HIST_VALS_MAX];
+	unsigned int			n_vals;
+	struct ftrace_event_file	*event_file;
+	atomic64_t			drops;
+	struct hist_trigger_attrs	*attrs;
+	unsigned int			map_bits;
+	struct bpf_map			*map;
+	union bpf_attr			map_attr;
+};
+
+struct hist_trigger_entry {
+	struct hist_trigger_data	*hist_data;
+	atomic64_t			hitcount;
+	atomic64_t			sums[HIST_VALS_MAX];
+	char				*comm;
+};
+
+#define HIST_STACKTRACE_DEPTH 16
+#define HIST_STACKTRACE_SKIP 5
+
+static hist_field_fn_t select_value_fn(int field_size, int field_is_signed)
+{
+	hist_field_fn_t fn = NULL;
+
+	switch (field_size) {
+	case 8:
+		if (field_is_signed)
+			fn = hist_field_s64;
+		else
+			fn = hist_field_u64;
+		break;
+	case 4:
+		if (field_is_signed)
+			fn = hist_field_s32;
+		else
+			fn = hist_field_u32;
+		break;
+	case 2:
+		if (field_is_signed)
+			fn = hist_field_s16;
+		else
+			fn = hist_field_u16;
+		break;
+	case 1:
+		if (field_is_signed)
+			fn = hist_field_s8;
+		else
+			fn = hist_field_u8;
+		break;
+	}
+
+	return fn;
+}
+
+static inline void save_comm(char *comm, struct task_struct *task)
+{
+	if (!task->pid) {
+		strcpy(comm, "<idle>");
+		return;
+	}
+
+	if (WARN_ON_ONCE(task->pid < 0)) {
+		strcpy(comm, "<XXX>");
+		return;
+	}
+
+	if (task->pid > PID_MAX_DEFAULT) {
+		strcpy(comm, "<...>");
+		return;
+	}
+
+	memcpy(comm, task->comm, TASK_COMM_LEN);
+}
+
+static void hist_trigger_entry_free(struct hist_trigger_entry *entry)
+{
+	kfree_notrace(entry->comm);
+	kfree_notrace(entry);
+}
+
+static struct hist_trigger_entry *
+hist_trigger_entry_create(struct hist_trigger_data *hist_data)
+{
+	struct hist_trigger_entry *entry = NULL;
+
+	if (atomic_read(&hist_data->next_entry) >=
+	    hist_data->map_attr.max_entries)
+		return NULL;
+
+	atomic_inc(&hist_data->next_entry);
+
+	entry = kmalloc_notrace(sizeof(*entry), GFP_ATOMIC);
+	if (!entry)
+		return entry;
+	memset(entry, 0, sizeof(*entry));
+
+	if (hist_data->key->flags & HIST_FIELD_EXECNAME) {
+		entry->comm = kmalloc_notrace(TASK_COMM_LEN + 1, GFP_ATOMIC);
+		if (!entry->comm) {
+			hist_trigger_entry_free(entry);
+			return NULL;
+		}
+		save_comm(entry->comm, current);
+	}
+
+	entry->hist_data = hist_data;
+
+	return entry;
+}
+
+static void destroy_hist_field(struct hist_field *hist_field)
+{
+	kfree(hist_field);
+}
+
+static struct hist_field *create_hist_field(struct ftrace_event_field *field,
+					    unsigned long flags)
+{
+	hist_field_fn_t fn = hist_field_none;
+	struct hist_field *hist_field;
+
+	hist_field = kzalloc(sizeof(struct hist_field), GFP_KERNEL);
+	if (!hist_field)
+		return NULL;
+
+	if (flags & HIST_FIELD_STACKTRACE) {
+		hist_field->flags = flags;
+		goto out;
+	}
+
+	if (is_function_field(field))
+		goto free;
+
+	if (is_string_field(field)) {
+		flags |= HIST_FIELD_STRING;
+		fn = hist_field_string;
+	} else {
+		fn = select_value_fn(field->size, field->is_signed);
+		if (!fn)
+			goto free;
+	}
+
+	hist_field->field = field;
+	hist_field->flags = flags;
+	hist_field->fn = fn;
+ out:
+	return hist_field;
+ free:
+	kfree(hist_field);
+	hist_field = NULL;
+
+	goto out;
+}
+
+static void destroy_hist_fields(struct hist_trigger_data *hist_data)
+{
+	unsigned int i;
+
+	destroy_hist_field(hist_data->key);
+	hist_data->key = NULL;
+
+	for (i = 0; i < hist_data->n_vals; i++) {
+		destroy_hist_field(hist_data->vals[i]);
+		hist_data->vals[i] = NULL;
+	}
+}
+
+static int create_key_field(struct hist_trigger_data *hist_data,
+			    struct ftrace_event_file *file,
+			    char *field_str)
+{
+	struct ftrace_event_field *field = NULL;
+	unsigned long flags = 0;
+	char *field_name;
+	int ret = 0;
+
+	if (!strcmp(field_str, "stacktrace"))
+		flags |= HIST_FIELD_STACKTRACE;
+	else {
+		field_name = strsep(&field_str, ".");
+		if (field_str) {
+			if (!strcmp(field_str, "sym"))
+				flags |= HIST_FIELD_SYM;
+			else if (!strcmp(field_str, "hex"))
+				flags |= HIST_FIELD_HEX;
+			else if (!strcmp(field_str, "execname") &&
+				 !strcmp(field_name, "common_pid"))
+				flags |= HIST_FIELD_EXECNAME;
+			else if (!strcmp(field_str, "syscall"))
+				flags |= HIST_FIELD_SYSCALL;
+		}
+
+		field = trace_find_event_field(file->event_call, field_name);
+		if (!field) {
+			ret = -EINVAL;
+			goto out;
+		}
+	}
+
+	hist_data->key = create_hist_field(field, flags);
+	if (!hist_data->key)
+		ret = -ENOMEM;
+ out:
+	return ret;
+}
+
+static int create_val_field(struct hist_trigger_data *hist_data,
+			    unsigned int val,
+			    struct ftrace_event_file *file,
+			    char *field_str)
+{
+	struct ftrace_event_field *field = NULL;
+	unsigned long flags = 0;
+	char *field_name;
+	int ret = 0;
+
+	if (!strcmp(field_str, "hitcount"))
+		return ret;
+
+	field_name = strsep(&field_str, ".");
+	if (field_str) {
+		if (!strcmp(field_str, "sym"))
+			flags |= HIST_FIELD_SYM;
+		else if (!strcmp(field_str, "hex"))
+			flags |= HIST_FIELD_HEX;
+	}
+
+	field = trace_find_event_field(file->event_call, field_name);
+	if (!field) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	hist_data->vals[val] = create_hist_field(field, flags);
+	if (!hist_data->vals[val]) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	hist_data->n_vals++;
+ out:
+	return ret;
+}
+
+static int create_hist_fields(struct hist_trigger_data *hist_data,
+			      struct ftrace_event_file *file)
+{
+	char *fields_str, *field_str;
+	unsigned int i;
+	int ret;
+
+	fields_str = hist_data->attrs->keys_str;
+	strsep(&fields_str, "=");
+	if (!fields_str) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	ret = create_key_field(hist_data, file, fields_str);
+	if (ret)
+		goto out;
+
+	fields_str = hist_data->attrs->vals_str;
+	strsep(&fields_str, "=");
+	if (!fields_str) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	for (i = 0; i < HIST_VALS_MAX; i++) {
+		field_str = strsep(&fields_str, ",");
+		if (!field_str) {
+			if (i == 0) {
+				ret = -EINVAL;
+				goto out;
+			} else
+				break;
+		}
+		ret = create_val_field(hist_data, i, file, field_str);
+		if (ret)
+			goto out;
+	}
+ out:
+	return ret;
+}
+
+static u32 get_key_size(struct hist_trigger_data *hist_data)
+{
+	u32 key_size;
+
+	if (hist_data->key->flags & HIST_FIELD_STACKTRACE)
+		key_size = sizeof(unsigned long) * HIST_STACKTRACE_DEPTH;
+	else
+		key_size = hist_data->key->field->size;
+
+	return key_size;
+}
+
+static void destroy_hist_trigger_attrs(struct hist_trigger_attrs *attrs)
+{
+	kfree(attrs->keys_str);
+	kfree(attrs->vals_str);
+	kfree(attrs);
+}
+
+static int parse_map_size(char *str)
+{
+	unsigned long size, map_bits;
+	int ret;
+
+	strsep(&str, "=");
+	if (!str) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	ret = kstrtoul(str, 0, &size);
+	if (ret)
+		goto out;
+
+	map_bits = ilog2(roundup_pow_of_two(size));
+	if (map_bits < HIST_TRIGGER_BITS_MIN ||
+	    map_bits > HIST_TRIGGER_BITS_MAX)
+		ret = -EINVAL;
+	else
+		ret = map_bits;
+ out:
+	return ret;
+}
+
+static struct hist_trigger_attrs *parse_hist_trigger_attrs(char *trigger_str)
+{
+	struct hist_trigger_attrs *attrs;
+	int ret = 0;
+
+	attrs = kzalloc(sizeof(*attrs), GFP_KERNEL);
+	if (!attrs)
+		return ERR_PTR(-ENOMEM);
+
+	while (trigger_str) {
+		char *str = strsep(&trigger_str, ":");
+
+		if (!strncmp(str, "keys", strlen("keys")) ||
+		    !strncmp(str, "key", strlen("key")))
+			attrs->keys_str = kstrdup(str, GFP_KERNEL);
+		else if (!strncmp(str, "values", strlen("values")) ||
+			 !strncmp(str, "vals", strlen("vals")) ||
+			 !strncmp(str, "val", strlen("val")))
+			attrs->vals_str = kstrdup(str, GFP_KERNEL);
+		else if (!strncmp(str, "pause", strlen("pause")))
+			attrs->pause = true;
+		else if (!strncmp(str, "continue", strlen("continue")) ||
+			 !strncmp(str, "cont", strlen("cont")))
+			attrs->cont = true;
+		else if (!strncmp(str, "clear", strlen("clear")))
+			attrs->clear = true;
+		else if (!strncmp(str, "size", strlen("size"))) {
+			int map_bits = parse_map_size(str);
+			if (map_bits < 0) {
+				ret = map_bits;
+				goto free;
+			}
+			attrs->map_bits = map_bits;
+		} else {
+			ret = -EINVAL;
+			goto free;
+		}
+	}
+
+	return attrs;
+ free:
+	destroy_hist_trigger_attrs(attrs);
+
+	return ERR_PTR(ret);
+}
+
+static void destroy_hist_data(struct hist_trigger_data *hist_data)
+{
+	synchronize_sched();
+
+	destroy_hist_trigger_attrs(hist_data->attrs);
+
+	destroy_hist_fields(hist_data);
+
+	if (hist_data->map)
+		tracing_map_destroy(hist_data->map);
+
+	kfree(hist_data);
+}
+
+static void hist_trigger_free_elem(void *value)
+{
+	hist_trigger_entry_free(*(struct hist_trigger_entry **)value);
+}
+
+static struct bpf_map_client_ops hist_map_client_ops = {
+       .free_elem              = hist_trigger_free_elem,
+};
+
+static struct hist_trigger_data *
+create_hist_data(unsigned int map_bits,
+		 struct hist_trigger_attrs *attrs,
+		 struct ftrace_event_file *file)
+{
+	unsigned int hist_map_size = (1 << map_bits);
+	struct hist_trigger_data *hist_data;
+	int ret = 0;
+
+	hist_data = kzalloc(sizeof(*hist_data), GFP_KERNEL);
+	if (!hist_data)
+		return NULL;
+
+	hist_data->map_attr.map_type = BPF_MAP_TYPE_HASH;
+	hist_data->map_attr.max_entries = hist_map_size;;
+
+	hist_data->attrs = attrs;
+
+	ret = create_hist_fields(hist_data, file);
+	if (ret < 0)
+		goto free;
+
+	hist_data->map_attr.key_size = get_key_size(hist_data);
+	hist_data->map_attr.value_size = sizeof(struct hist_trigger_entry *);
+
+	hist_data->map = tracing_map_create(&hist_data->map_attr,
+					    &hist_map_client_ops);
+	if (IS_ERR(hist_data->map)) {
+		ret = PTR_ERR(hist_data->map);
+		goto free;
+	}
+
+	hist_data->map_bits = map_bits;
+	hist_data->event_file = file;
+ out:
+	return hist_data;
+ free:
+	destroy_hist_data(hist_data);
+	if (ret)
+		hist_data = ERR_PTR(ret);
+	else
+		hist_data = NULL;
+
+	goto out;
+}
+
+static int hist_trigger_entry_insert(struct hist_trigger_data *hist_data,
+				     void *key,
+				     struct hist_trigger_entry *val)
+{
+	return tracing_map_update_elem(hist_data->map, key, &val,
+				       &hist_data->map_attr);
+}
+
+static void hist_trigger_entry_update(struct hist_trigger_data *hist_data,
+				      struct hist_trigger_entry *entry,
+				      void *rec)
+{
+	struct hist_field *hist_field;
+	unsigned int i;
+	u64 hist_val;
+
+	for (i = 0; i < hist_data->n_vals; i++) {
+		hist_field = hist_data->vals[i];
+		hist_val = hist_field->fn(hist_field, rec);
+		atomic64_add(hist_val, &entry->sums[i]);
+	}
+
+	atomic64_inc(&entry->hitcount);
+}
+
+static void event_hist_trigger(struct event_trigger_data *data, void *rec)
+{
+	struct hist_trigger_data *hist_data = data->private_data;
+	unsigned long entries[HIST_STACKTRACE_DEPTH];
+	struct hist_trigger_entry *entry;
+	struct stack_trace stacktrace;
+	u64 field_contents;
+	void *key;
+
+	if (atomic64_read(&hist_data->drops)) {
+		atomic64_inc(&hist_data->drops);
+		return;
+	}
+
+	if (hist_data->key->flags & HIST_FIELD_STACKTRACE) {
+		stacktrace.max_entries = HIST_STACKTRACE_DEPTH;
+		stacktrace.entries = entries;
+		stacktrace.nr_entries = 0;
+		stacktrace.skip = HIST_STACKTRACE_SKIP;
+
+		memset(stacktrace.entries, 0, hist_data->map_attr.key_size);
+		save_stack_trace(&stacktrace);
+
+		key = entries;
+	} else {
+		field_contents = hist_data->key->fn(hist_data->key, rec);
+		if (hist_data->key->flags & HIST_FIELD_STRING)
+			key = (void *)field_contents;
+		else
+			key = (void *)&field_contents;
+	}
+
+	if (tracing_map_lookup_elem(hist_data->map, key, &entry) != 0) {
+		entry = hist_trigger_entry_create(hist_data);
+		if (!entry) {
+			atomic64_inc(&hist_data->drops);
+			return;
+		}
+		if (WARN_ON_ONCE(hist_trigger_entry_insert(hist_data,
+							   key, entry))) {
+			hist_trigger_entry_free(entry);
+			return;
+		}
+	}
+	hist_trigger_entry_update(hist_data, entry, rec);
+
+	atomic64_inc(&hist_data->total_hits);
+}
+
+static void hist_trigger_stacktrace_print(struct seq_file *m,
+					  unsigned long *stacktrace_entries,
+					  unsigned int max_entries)
+{
+	char str[KSYM_SYMBOL_LEN];
+	unsigned int spaces = 8;
+	unsigned int i;
+
+	for (i = 0; i < max_entries; i++) {
+		if (stacktrace_entries[i] == ULONG_MAX)
+			return;
+
+		seq_printf(m, "%*c", 1 + spaces, ' ');
+		sprint_symbol(str, stacktrace_entries[i]);
+		seq_printf(m, "%s\n", str);
+	}
+}
+
+static void hist_trigger_entry_print(struct seq_file *m,
+				     struct hist_trigger_data *hist_data,
+				     void *key,
+				     struct hist_trigger_entry *entry)
+{
+	char str[KSYM_SYMBOL_LEN];
+	unsigned int i;
+	u64 uval;
+
+	if (entry->hist_data->key->flags & HIST_FIELD_SYM) {
+		uval = *(u64 *)key;
+		kallsyms_lookup(uval, NULL, NULL, NULL, str);
+		seq_printf(m, "%s: [%llx] %-35s", hist_data->key->field->name,
+			   uval, str);
+	} else if (entry->hist_data->key->flags & HIST_FIELD_HEX) {
+		uval = *(u64 *)key;
+		seq_printf(m, "%s: %llx", hist_data->key->field->name, uval);
+	} else if (entry->hist_data->key->flags & HIST_FIELD_STACKTRACE) {
+		seq_puts(m, "stacktrace:\n");
+		hist_trigger_stacktrace_print(m, key, HIST_STACKTRACE_DEPTH);
+	} else if (entry->hist_data->key->flags & HIST_FIELD_STRING) {
+		seq_printf(m, "%s: %-35s", hist_data->key->field->name,
+			   (char *)key);
+	} else if (entry->hist_data->key->flags & HIST_FIELD_EXECNAME) {
+		uval = *(u64 *)key;
+		seq_printf(m, "%s: %-16s[%10llu]", hist_data->key->field->name,
+			   entry->comm, uval);
+	} else if (entry->hist_data->key->flags & HIST_FIELD_SYSCALL) {
+		const char *syscall_name;
+
+		uval = *(u64 *)key;
+		syscall_name = get_syscall_name(uval);
+		if (!syscall_name)
+			syscall_name = "unknown_syscall";
+
+		seq_printf(m, "%s: %-30s", hist_data->key->field->name,
+			   syscall_name);
+	} else {
+		uval = *(u64 *)key;
+		seq_printf(m, "%s: %10llu", hist_data->key->field->name, uval);
+	}
+
+	seq_printf(m, " hitcount: %10llu",
+		   (u64)atomic64_read(&entry->hitcount));
+
+	for (i = 0; i < hist_data->n_vals; i++) {
+		seq_printf(m, "  %s: %10llu", hist_data->vals[i]->field->name,
+			   (u64)atomic64_read(&entry->sums[i]));
+	}
+
+	seq_puts(m, "\n");
+}
+
+static int print_entries(struct seq_file *m,
+			 struct hist_trigger_data *hist_data)
+{
+	void *key, *next_key, *tmp_key = NULL;
+	struct hist_trigger_entry *entry;
+	int ret = 0;
+
+	key = kzalloc(hist_data->map_attr.key_size, GFP_KERNEL);
+	if (!key)
+		return -ENOMEM;
+
+	next_key = kzalloc(hist_data->map_attr.key_size, GFP_KERNEL);
+	if (!next_key) {
+		ret = -ENOMEM;
+		goto free;
+	}
+
+	while (tracing_map_get_next_key(hist_data->map, key, next_key) == 0) {
+		tracing_map_lookup_elem(hist_data->map, next_key, &entry);
+		hist_trigger_entry_print(m, hist_data, next_key, entry);
+		tmp_key = key;
+		key = next_key;
+		next_key = tmp_key;
+	}
+free:
+	kfree(key);
+	kfree(next_key);
+
+	return ret;
+}
+
+static int hist_show(struct seq_file *m, void *v)
+{
+	struct event_trigger_data *test, *data = NULL;
+	struct ftrace_event_file *event_file;
+	struct hist_trigger_data *hist_data;
+	int ret = 0;
+
+	mutex_lock(&event_mutex);
+
+	event_file = event_file_data(m->private);
+
+	if (unlikely(!event_file)) {
+		ret = -ENODEV;
+		goto out_unlock;
+	}
+
+	list_for_each_entry_rcu(test, &event_file->triggers, list) {
+		if (test->cmd_ops->trigger_type == ETT_EVENT_HIST) {
+			data = test;
+			break;
+		}
+	}
+
+	if (!data)
+		goto out_unlock;
+
+	seq_puts(m, "# trigger info: ");
+	data->ops->print(m, data->ops, data);
+	seq_puts(m, "\n");
+
+	hist_data = data->private_data;
+
+	ret = print_entries(m, hist_data);
+
+	seq_printf(m, "\nTotals:\n    Hits: %lu\n    Entries: %u\n    Dropped: %lu\n",
+		   atomic64_read(&hist_data->total_hits),
+		   atomic_read(&hist_data->next_entry),
+		   atomic64_read(&hist_data->drops));
+ out_unlock:
+	mutex_unlock(&event_mutex);
+
+	return ret;
+}
+
+static int event_hist_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, hist_show, file);
+}
+
+const struct file_operations event_hist_fops = {
+	.open = event_hist_open,
+	.read = seq_read,
+	.llseek = seq_lseek,
+	.release = single_release,
+};
+
+static const char *get_hist_field_flags(struct hist_field *hist_field)
+{
+	const char *flags_str = NULL;
+
+	if (hist_field->flags & HIST_FIELD_SYM)
+		flags_str = "sym";
+	else if (hist_field->flags & HIST_FIELD_HEX)
+		flags_str = "hex";
+	else if (hist_field->flags & HIST_FIELD_SYSCALL)
+		flags_str = "syscall";
+	else if (hist_field->flags & HIST_FIELD_EXECNAME)
+		flags_str = "execname";
+
+	return flags_str;
+}
+
+static void hist_field_print(struct seq_file *m, struct hist_field *hist_field)
+{
+	seq_printf(m, "%s", hist_field->field->name);
+	if (hist_field->flags) {
+		const char *flags_str = get_hist_field_flags(hist_field);
+
+		if (flags_str)
+			seq_printf(m, ".%s", flags_str);
+	}
+}
+
+static int event_hist_trigger_print(struct seq_file *m,
+				    struct event_trigger_ops *ops,
+				    struct event_trigger_data *data)
+{
+	struct hist_trigger_data *hist_data = data->private_data;
+	unsigned int i;
+
+	seq_puts(m, "hist:keys=");
+
+	if (hist_data->key->flags & HIST_FIELD_STACKTRACE)
+		seq_puts(m, "stacktrace");
+	else
+		hist_field_print(m, hist_data->key);
+
+	seq_puts(m, ":vals=");
+
+	for (i = 0; i < hist_data->n_vals; i++) {
+		if (i > 0)
+			seq_puts(m, ",");
+		hist_field_print(m, hist_data->vals[i]);
+	}
+
+	seq_printf(m, ":size=%u", (1 << hist_data->map_bits));
+
+	if (data->filter_str)
+		seq_printf(m, " if %s", data->filter_str);
+
+	if (data->paused)
+		seq_puts(m, " [paused]");
+	else
+		seq_puts(m, " [active]");
+
+	seq_putc(m, '\n');
+
+	return 0;
+}
+
+static void event_hist_trigger_free(struct event_trigger_ops *ops,
+				    struct event_trigger_data *data)
+{
+	struct hist_trigger_data *hist_data = data->private_data;
+
+	if (WARN_ON_ONCE(data->ref <= 0))
+		return;
+
+	data->ref--;
+	if (!data->ref) {
+		destroy_hist_data(hist_data);
+		trigger_data_free(data);
+	}
+}
+
+static struct event_trigger_ops event_hist_trigger_ops = {
+	.func			= event_hist_trigger,
+	.print			= event_hist_trigger_print,
+	.init			= event_trigger_init,
+	.free			= event_hist_trigger_free,
+};
+
+static struct event_trigger_ops *event_hist_get_trigger_ops(char *cmd,
+							    char *param)
+{
+	return &event_hist_trigger_ops;
+}
+
+static void hist_clear(struct event_trigger_data *data)
+{
+	struct hist_trigger_data *hist_data = data->private_data;
+	bool paused;
+
+	paused = data->paused;
+	data->paused = true;
+
+	tracing_map_clear(hist_data->map);
+
+	atomic_set(&hist_data->next_entry, 0);
+	atomic64_set(&hist_data->total_hits, 0);
+	atomic64_set(&hist_data->drops, 0);
+
+	data->paused = paused;
+}
+
+static int hist_register_trigger(char *glob, struct event_trigger_ops *ops,
+				 struct event_trigger_data *data,
+				 struct ftrace_event_file *file)
+{
+	struct hist_trigger_data *hist_data = data->private_data;
+	struct event_trigger_data *test;
+	int ret = 0;
+
+	list_for_each_entry_rcu(test, &file->triggers, list) {
+		if (test->cmd_ops->trigger_type == ETT_EVENT_HIST) {
+			if (hist_data->attrs->clear)
+				hist_clear(test);
+			else if (hist_data->attrs->pause)
+				test->paused = true;
+			else if (hist_data->attrs->cont)
+				test->paused = false;
+			else
+				ret = -EEXIST;
+			goto out;
+		}
+	}
+
+	if (hist_data->attrs->pause)
+		data->paused = true;
+
+	if (data->ops->init) {
+		ret = data->ops->init(data->ops, data);
+		if (ret < 0)
+			goto out;
+	}
+
+	list_add_rcu(&data->list, &file->triggers);
+	ret++;
+
+	if (trace_event_trigger_enable_disable(file, 1) < 0) {
+		list_del_rcu(&data->list);
+		ret--;
+	}
+	update_cond_flag(file);
+out:
+	return ret;
+}
+
+static int event_hist_trigger_func(struct event_command *cmd_ops,
+				   struct ftrace_event_file *file,
+				   char *glob, char *cmd, char *param)
+{
+	unsigned int hist_trigger_bits = HIST_TRIGGER_BITS;
+	struct event_trigger_data *trigger_data;
+	struct hist_trigger_attrs *attrs;
+	struct event_trigger_ops *trigger_ops;
+	struct hist_trigger_data *hist_data;
+	char *trigger;
+	int ret = 0;
+
+	if (!param)
+		return -EINVAL;
+
+	/* separate the trigger from the filter (k:v [if filter]) */
+	trigger = strsep(&param, " \t");
+	if (!trigger)
+		return -EINVAL;
+
+	attrs = parse_hist_trigger_attrs(trigger);
+	if (IS_ERR(attrs))
+		return PTR_ERR(attrs);
+
+	if (!attrs->keys_str || !attrs->vals_str)
+		return -EINVAL;
+
+	if (attrs->map_bits)
+		hist_trigger_bits = attrs->map_bits;
+
+	hist_data = create_hist_data(hist_trigger_bits, attrs, file);
+	if (IS_ERR(hist_data))
+		return PTR_ERR(hist_data);
+
+	trigger_ops = cmd_ops->get_trigger_ops(cmd, trigger);
+
+	ret = -ENOMEM;
+	trigger_data = kzalloc(sizeof(*trigger_data), GFP_KERNEL);
+	if (!trigger_data)
+		goto out;
+
+	trigger_data->count = -1;
+	trigger_data->ops = trigger_ops;
+	trigger_data->cmd_ops = cmd_ops;
+
+	INIT_LIST_HEAD(&trigger_data->list);
+	RCU_INIT_POINTER(trigger_data->filter, NULL);
+
+	trigger_data->private_data = hist_data;
+
+	if (glob[0] == '!') {
+		cmd_ops->unreg(glob+1, trigger_ops, trigger_data, file);
+		ret = 0;
+		goto out_free;
+	}
+
+	if (!param) /* if param is non-empty, it's supposed to be a filter */
+		goto out_reg;
+
+	if (!cmd_ops->set_filter)
+		goto out_reg;
+
+	ret = cmd_ops->set_filter(param, trigger_data, file);
+	if (ret < 0)
+		goto out_free;
+ out_reg:
+	ret = cmd_ops->reg(glob, trigger_ops, trigger_data, file);
+	/*
+	 * The above returns on success the # of triggers registered,
+	 * but if it didn't register any it returns zero.  Consider no
+	 * triggers registered a failure too.
+	 */
+	if (!ret) {
+		if (!(attrs->pause || attrs->cont || attrs->clear))
+			ret = -ENOENT;
+		goto out_free;
+	} else if (ret < 0)
+		goto out_free;
+	/* Just return zero, not the number of registered triggers */
+	ret = 0;
+ out:
+	return ret;
+ out_free:
+	if (cmd_ops->set_filter)
+		cmd_ops->set_filter(NULL, trigger_data, NULL);
+	kfree(trigger_data);
+	destroy_hist_data(hist_data);
+
+	goto out;
+}
+
+static struct event_command trigger_hist_cmd = {
+	.name			= "hist",
+	.trigger_type		= ETT_EVENT_HIST,
+	.post_trigger		= true, /* need non-NULL rec */
+	.func			= event_hist_trigger_func,
+	.reg			= hist_register_trigger,
+	.unreg			= unregister_trigger,
+	.get_trigger_ops	= event_hist_get_trigger_ops,
+	.set_filter		= set_trigger_filter,
+};
+
+static __init int register_trigger_hist_cmd(void)
+{
+	int ret;
+
+	ret = register_event_command(&trigger_hist_cmd);
+	WARN_ON(ret < 0);
+
+	return ret;
+}
+
 __init int register_trigger_cmds(void)
 {
 	register_trigger_traceon_traceoff_cmds();
 	register_trigger_snapshot_cmd();
 	register_trigger_stacktrace_cmd();
 	register_trigger_enable_disable_cmds();
+	register_trigger_hist_cmd();
 
 	return 0;
 }
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 13/15] tracing: Add sorting to hist triggers
  2015-03-02 16:00 [PATCH v2 00/15] tracing: 'hist' triggers Tom Zanussi
                   ` (11 preceding siblings ...)
  2015-03-02 16:01 ` [PATCH 12/15] tracing: Add 'hist' event trigger command Tom Zanussi
@ 2015-03-02 16:01 ` Tom Zanussi
  2015-03-02 16:01 ` [PATCH 14/15] tracing: Add enable_hist/disable_hist triggers Tom Zanussi
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 45+ messages in thread
From: Tom Zanussi @ 2015-03-02 16:01 UTC (permalink / raw)
  To: rostedt
  Cc: masami.hiramatsu.pt, namhyung, andi, ast, linux-kernel, Tom Zanussi

Add support for sorting to hist triggers, which without this will
display entries in hash order, which is to say randomly.

Currently, support is implemented for just a primary sort key which is
by default ascending.

To specify the sort key for a trigger, append ':sort=<sort_key>',
where sort_key can be any value specified in the values= param, or the
special value 'hitcount', which is a sum of the number of times each
entry in the hashtable was hit.

With these changes, even if the user doesn't explicitly specify a sort
key, the table will be sorted ascending on hitcount.  To sort in
descending order, append the .descending modifier to the sort field,
as such:

  ':sort=<sort_key>.descending'

Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
---
 kernel/trace/trace.c                |   2 +-
 kernel/trace/trace_events_trigger.c | 279 +++++++++++++++++++++++++++++++++++-
 2 files changed, 278 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 683048f..2fa41ef 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -3777,7 +3777,7 @@ static const char readme_msg[] =
 	"\t    The 'sort' param can be used to specify a value field to sort\n"
 	"\t    on.  The default if unspecified is 'hitcount' and the.\n"
 	"\t    default sort order is 'ascending'.  To sort in the opposite\n"
-	"\t    direction, append .descending' to the sort key.\n"
+	"\t    direction, append .descending' to the sort key.\n\n"
 	"\t    The 'pause' param can be used to pause an existing hist\n"
 	"\t    trigger or to start a hist trigger but not log any events\n"
 	"\t    until told to do so.  'continue' can be used to start or\n"
diff --git a/kernel/trace/trace_events_trigger.c b/kernel/trace/trace_events_trigger.c
index 80805f3..ae75528 100644
--- a/kernel/trace/trace_events_trigger.c
+++ b/kernel/trace/trace_events_trigger.c
@@ -23,6 +23,7 @@
 #include <linux/mutex.h>
 #include <linux/slab.h>
 #include <linux/stacktrace.h>
+#include <linux/sort.h>
 
 #include <linux/bpf.h>
 
@@ -1479,6 +1480,7 @@ DEFINE_HIST_FIELD_FN(u8);
 #define HIST_TRIGGER_BITS_MAX	17
 #define HIST_TRIGGER_BITS_MIN	7
 #define HIST_VALS_MAX		8
+#define HIST_SORT_KEYS_MAX	2
 
 #define HIST_KEY_STRING_MAX	64
 
@@ -1491,9 +1493,16 @@ enum hist_field_flags {
 	HIST_FIELD_SYSCALL	= 32,
 };
 
+struct hist_trigger_sort_key {
+	bool		use_hitcount;
+	bool		descending;
+	unsigned int	idx;
+};
+
 struct hist_trigger_attrs {
 	char		*keys_str;
 	char		*vals_str;
+	char		*sort_keys_str;
 	bool		pause;
 	bool		cont;
 	bool		clear;
@@ -1509,6 +1518,8 @@ struct hist_trigger_data {
 	struct ftrace_event_file	*event_file;
 	atomic64_t			drops;
 	struct hist_trigger_attrs	*attrs;
+	struct hist_trigger_sort_key	*sort_keys[HIST_SORT_KEYS_MAX];
+	struct hist_trigger_sort_key	*sort_key_cur;
 	unsigned int			map_bits;
 	struct bpf_map			*map;
 	union bpf_attr			map_attr;
@@ -1521,6 +1532,11 @@ struct hist_trigger_entry {
 	char				*comm;
 };
 
+struct hist_trigger_sort_entry {
+	void				*key;
+	struct hist_trigger_entry	*entry;
+};
+
 #define HIST_STACKTRACE_DEPTH 16
 #define HIST_STACKTRACE_SKIP 5
 
@@ -1671,6 +1687,106 @@ static void destroy_hist_fields(struct hist_trigger_data *hist_data)
 	}
 }
 
+static inline struct hist_trigger_sort_key *create_default_sort_key(void)
+{
+	struct hist_trigger_sort_key *sort_key;
+
+	sort_key = kzalloc(sizeof(*sort_key), GFP_KERNEL);
+	if (!sort_key)
+		return ERR_PTR(-ENOMEM);
+
+	sort_key->use_hitcount = true;
+
+	return sort_key;
+}
+
+static inline struct hist_trigger_sort_key *
+create_sort_key(char *field_name, struct hist_trigger_data *hist_data)
+{
+	struct hist_trigger_sort_key *sort_key;
+	unsigned int i;
+
+	if (!strcmp(field_name, "hitcount"))
+		return create_default_sort_key();
+
+	for (i = 0; i < hist_data->n_vals; i++)
+		if (!strcmp(field_name, hist_data->vals[i]->field->name))
+			goto out;
+
+	return ERR_PTR(-EINVAL);
+ out:
+	sort_key = kzalloc(sizeof(*sort_key), GFP_KERNEL);
+	if (!sort_key)
+		return ERR_PTR(-ENOMEM);
+
+	sort_key->idx = i;
+
+	return sort_key;
+}
+
+static int create_sort_keys(struct hist_trigger_data *hist_data)
+{
+	char *fields_str = hist_data->attrs->sort_keys_str;
+	struct hist_trigger_sort_key *sort_key;
+	char *field_str, *field_name;
+	unsigned int i;
+	int ret = 0;
+
+	if (!fields_str) {
+		sort_key = create_default_sort_key();
+		if (IS_ERR(sort_key)) {
+			ret = PTR_ERR(sort_key);
+			goto out;
+		}
+		hist_data->sort_keys[0] = sort_key;
+		goto out;
+	}
+
+	strsep(&fields_str, "=");
+	if (!fields_str) {
+		ret = -EINVAL;
+		goto free;
+	}
+
+	for (i = 0; i < HIST_SORT_KEYS_MAX; i++) {
+		field_str = strsep(&fields_str, ",");
+		if (!field_str) {
+			if (i == 0) {
+				ret = -EINVAL;
+				goto free;
+			} else
+				break;
+		}
+
+		field_name = strsep(&field_str, ".");
+		sort_key = create_sort_key(field_name, hist_data);
+		if (IS_ERR(sort_key)) {
+			ret = PTR_ERR(sort_key);
+			goto free;
+		}
+
+		if (field_str) {
+			if (!strcmp(field_str, "descending"))
+				sort_key->descending = true;
+			else if (strcmp(field_str, "ascending")) {
+				ret = -EINVAL;
+				goto free;
+			}
+		}
+		hist_data->sort_keys[i] = sort_key;
+	}
+out:
+	return ret;
+free:
+	for (i = 0; i < HIST_SORT_KEYS_MAX; i++) {
+		if (!hist_data->sort_keys[i])
+			break;
+		kfree(hist_data->sort_keys[i]);
+		hist_data->sort_keys[i] = NULL;
+	}
+	goto out;
+}
+
 static int create_key_field(struct hist_trigger_data *hist_data,
 			    struct ftrace_event_file *file,
 			    char *field_str)
@@ -1785,6 +1901,8 @@ static int create_hist_fields(struct hist_trigger_data *hist_data,
 		if (ret)
 			goto out;
 	}
+
+	ret = create_sort_keys(hist_data);
  out:
 	return ret;
 }
@@ -1803,6 +1921,7 @@ static u32 get_key_size(struct hist_trigger_data *hist_data)
 
 static void destroy_hist_trigger_attrs(struct hist_trigger_attrs *attrs)
 {
+	kfree(attrs->sort_keys_str);
 	kfree(attrs->keys_str);
 	kfree(attrs->vals_str);
 	kfree(attrs);
@@ -1852,6 +1971,8 @@ static struct hist_trigger_attrs *parse_hist_trigger_attrs(char *trigger_str)
 			 !strncmp(str, "vals", strlen("vals")) ||
 			 !strncmp(str, "val", strlen("val")))
 			attrs->vals_str = kstrdup(str, GFP_KERNEL);
+		else if (!strncmp(str, "sort", strlen("sort")))
+			attrs->sort_keys_str = kstrdup(str, GFP_KERNEL);
 		else if (!strncmp(str, "pause", strlen("pause")))
 			attrs->pause = true;
 		else if (!strncmp(str, "continue", strlen("continue")) ||
@@ -2093,8 +2214,42 @@ static void hist_trigger_entry_print(struct seq_file *m,
 	seq_puts(m, "\n");
 }
 
-static int print_entries(struct seq_file *m,
-			 struct hist_trigger_data *hist_data)
+static int cmp_entries(const struct hist_trigger_sort_entry **a,
+		       const struct hist_trigger_sort_entry **b)
+{
+	const struct hist_trigger_entry *entry_a, *entry_b;
+	struct hist_trigger_sort_key *sort_key;
+	struct hist_trigger_data *hist_data;
+	u64 val_a, val_b;
+	int ret = 0;
+
+	entry_a = (*a)->entry;
+	entry_b = (*b)->entry;
+
+	hist_data = entry_a->hist_data;
+	sort_key = hist_data->sort_key_cur;
+
+	if (sort_key->use_hitcount) {
+		val_a = atomic64_read(&entry_a->hitcount);
+		val_b = atomic64_read(&entry_b->hitcount);
+	} else {
+		val_a = atomic64_read(&entry_a->sums[sort_key->idx]);
+		val_b = atomic64_read(&entry_b->sums[sort_key->idx]);
+	}
+
+	if (val_a > val_b)
+		ret = 1;
+	else if (val_a < val_b)
+		ret = -1;
+
+	if (sort_key->descending)
+		ret = -ret;
+
+	return ret;
+}
+
+static int print_entries_unsorted(struct seq_file *m,
+				  struct hist_trigger_data *hist_data)
 {
 	void *key, *next_key, *tmp_key = NULL;
 	struct hist_trigger_entry *entry;
@@ -2124,6 +2279,106 @@ free:
 	return ret;
 }
 
+static void destroy_sort_entry(struct hist_trigger_sort_entry *entry)
+{
+	if (!entry)
+		return;
+
+	kfree(entry->key);
+	kfree(entry);
+}
+
+static void destroy_sort_entries(struct hist_trigger_sort_entry **entries,
+				 unsigned int entries_size)
+{
+	unsigned int i;
+
+	for (i = 0; i < entries_size; i++)
+		destroy_sort_entry(entries[i]);
+}
+
+static struct hist_trigger_sort_entry *
+create_sort_entry(void *key, struct hist_trigger_entry *entry)
+{
+	struct hist_trigger_sort_entry *sort_entry;
+
+	sort_entry = kmalloc(sizeof(*sort_entry), GFP_KERNEL);
+	if (!sort_entry)
+		return ERR_PTR(-ENOMEM);
+
+	sort_entry->key = kmalloc(entry->hist_data->map_attr.key_size,
+				  GFP_KERNEL);
+	if (!sort_entry->key) {
+		destroy_sort_entry(sort_entry);
+		return ERR_PTR(-ENOMEM);
+	}
+	memcpy(sort_entry->key, key, entry->hist_data->map_attr.key_size);
+
+	sort_entry->entry = entry;
+
+	return sort_entry;
+}
+
+static int print_entries(struct seq_file *m,
+			 struct hist_trigger_data *hist_data)
+{
+	struct hist_trigger_sort_entry **sort_entries;
+	void *key, *next_key = NULL, *tmp_key = NULL;
+	struct hist_trigger_sort_entry *sort_entry;
+	struct hist_trigger_entry *entry;
+	unsigned int sort_entries_size;
+	unsigned int i = 0, j;
+	int ret = 0;
+
+	hist_data->map_attr.key_size = get_key_size(hist_data);
+
+	sort_entries_size = hist_data->map_attr.max_entries;
+	sort_entries_size *= sizeof(sort_entry);
+	sort_entries = kzalloc(sort_entries_size, GFP_KERNEL);
+
+	if (!sort_entries)
+		return -ENOMEM;
+
+	key = kzalloc(hist_data->map_attr.key_size, GFP_KERNEL);
+	if (!key) {
+		ret = -ENOMEM;
+		goto free;
+	}
+
+	next_key = kzalloc(hist_data->map_attr.key_size, GFP_KERNEL);
+	if (!next_key) {
+		ret = -ENOMEM;
+		goto free;
+	}
+
+	while (tracing_map_get_next_key(hist_data->map, key, next_key) == 0) {
+		tracing_map_lookup_elem(hist_data->map, next_key, &entry);
+		sort_entries[i] = create_sort_entry(next_key, entry);
+		if (!sort_entries[i++])
+			goto free;
+
+		tmp_key = key;
+		key = next_key;
+		next_key = tmp_key;
+	}
+
+	hist_data->sort_key_cur = hist_data->sort_keys[0];
+	sort(sort_entries, i, sizeof(struct hist_trigger_entry *),
+	     (int (*)(const void *, const void *))cmp_entries, NULL);
+
+	for (j = 0; j < i; j++) {
+		hist_trigger_entry_print(m, hist_data, sort_entries[j]->key,
+					 sort_entries[j]->entry);
+	}
+free:
+	destroy_sort_entries(sort_entries, hist_data->map_attr.max_entries);
+
+	kfree(key);
+	kfree(next_key);
+
+	return ret;
+}
+
 static int hist_show(struct seq_file *m, void *v)
 {
 	struct event_trigger_data *test, *data = NULL;
@@ -2157,6 +2412,8 @@ static int hist_show(struct seq_file *m, void *v)
 	hist_data = data->private_data;
 
 	ret = print_entries(m, hist_data);
+	if (ret)
+		print_entries_unsorted(m, hist_data);
 
 	seq_printf(m, "\nTotals:\n    Hits: %lu\n    Entries: %u\n    Dropped: %lu\n",
 		   atomic64_read(&hist_data->total_hits),
@@ -2229,6 +2486,24 @@ static int event_hist_trigger_print(struct seq_file *m,
 		hist_field_print(m, hist_data->vals[i]);
 	}
 
+	for (i = 0; i < HIST_SORT_KEYS_MAX; i++) {
+		if (!hist_data->sort_keys[i])
+			break;
+
+		if (i == 0)
+			seq_puts(m, ":sort=");
+		else
+			seq_puts(m, ",");
+
+		if (hist_data->sort_keys[i]->use_hitcount)
+			seq_puts(m, "hitcount");
+		else {
+			unsigned int idx = hist_data->sort_keys[i]->idx;
+
+			hist_field_print(m, hist_data->vals[idx]);
+		}
+	}
+
 	seq_printf(m, ":size=%u", (1 << hist_data->map_bits));
 
 	if (data->filter_str)
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 14/15] tracing: Add enable_hist/disable_hist triggers
  2015-03-02 16:00 [PATCH v2 00/15] tracing: 'hist' triggers Tom Zanussi
                   ` (12 preceding siblings ...)
  2015-03-02 16:01 ` [PATCH 13/15] tracing: Add sorting to hist triggers Tom Zanussi
@ 2015-03-02 16:01 ` Tom Zanussi
  2015-03-02 16:01 ` [PATCH 15/15] tracing: Add 'hist' trigger Documentation Tom Zanussi
  2015-03-03  2:25 ` [PATCH v2 00/15] tracing: 'hist' triggers Masami Hiramatsu
  15 siblings, 0 replies; 45+ messages in thread
From: Tom Zanussi @ 2015-03-02 16:01 UTC (permalink / raw)
  To: rostedt
  Cc: masami.hiramatsu.pt, namhyung, andi, ast, linux-kernel, Tom Zanussi

Similar to enable_event/disable_event triggers, these triggers enable
and disable the aggregation of events into maps rather than enabling
and disabling their writing into the trace buffer.

They can be used to automatically start and stop hist triggers based
on a matching filter condition.

If there's a paused hist trigger on system:event, the following would
start it when the filter condition was hit:

  # echo enable_hist:system:event [ if filter] > event/trigger

And the following would disable a running system:event hist trigger:

  # echo disable_hist:system:event [ if filter] > event/trigger

See Documentation/trace/events.txt for real examples.

Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
---
 include/linux/ftrace_event.h        |   1 +
 kernel/trace/trace.c                |  11 ++-
 kernel/trace/trace_events_trigger.c | 138 +++++++++++++++++++++++++++++++++++-
 3 files changed, 146 insertions(+), 4 deletions(-)

diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index 483e011..8679996 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -417,6 +417,7 @@ enum event_trigger_type {
 	ETT_STACKTRACE		= (1 << 2),
 	ETT_EVENT_ENABLE	= (1 << 3),
 	ETT_EVENT_HIST		= (1 << 4),
+	ETT_HIST_ENABLE		= (1 << 5),
 };
 
 extern int filter_match_preds(struct event_filter *filter, void *rec);
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 2fa41ef..b537a37 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -3724,6 +3724,8 @@ static const char readme_msg[] =
 	"\t   trigger: traceon, traceoff\n"
 	"\t            enable_event:<system>:<event>\n"
 	"\t            disable_event:<system>:<event>\n"
+	"\t            enable_hist:<system>:<event>\n"
+	"\t            disable_hist:<system>:<event>\n"
 #ifdef CONFIG_STACKTRACE
 	"\t\t    stacktrace\n"
 #endif
@@ -3783,7 +3785,14 @@ static const char readme_msg[] =
 	"\t    until told to do so.  'continue' can be used to start or\n"
 	"\t    restart a paused hist trigger.\n\n"
 	"\t    The 'clear' param will clear the contents of a running hist\n"
-	"\t    trigger and leave its current paused/active state.\n"
+	"\t    trigger and leave its current paused/active state.\n\n"
+	"\t    The enable_hist and disable_hist triggers can be used to\n"
+	"\t    have one event conditionally start and stop another event's\n"
+	"\t    already-attached hist trigger.  Any number of enable_hist\n"
+	"\t    and disable_hist triggers can be attached to a given event,\n"
+	"\t    allowing that event to kick off and stop aggregations on\n"
+	"\t    a host of other events.  See Documentation/trace/events.txt\n"
+	"\t    for examples.\n"
 ;
 
 static ssize_t
diff --git a/kernel/trace/trace_events_trigger.c b/kernel/trace/trace_events_trigger.c
index ae75528..1111ca5 100644
--- a/kernel/trace/trace_events_trigger.c
+++ b/kernel/trace/trace_events_trigger.c
@@ -1062,10 +1062,13 @@ static __init void unregister_trigger_traceon_traceoff_cmds(void)
 /* Avoid typos */
 #define ENABLE_EVENT_STR	"enable_event"
 #define DISABLE_EVENT_STR	"disable_event"
+#define ENABLE_HIST_STR		"enable_hist"
+#define DISABLE_HIST_STR	"disable_hist"
 
 struct enable_trigger_data {
 	struct ftrace_event_file	*file;
 	bool				enable;
+	bool				hist;
 };
 
 static void
@@ -1104,7 +1107,9 @@ event_enable_trigger_print(struct seq_file *m, struct event_trigger_ops *ops,
 	struct enable_trigger_data *enable_data = data->private_data;
 
 	seq_printf(m, "%s:%s:%s",
-		   enable_data->enable ? ENABLE_EVENT_STR : DISABLE_EVENT_STR,
+		   enable_data->hist ?
+		   (enable_data->enable ? ENABLE_HIST_STR : DISABLE_HIST_STR) :
+		   (enable_data->enable ? ENABLE_EVENT_STR : DISABLE_EVENT_STR),
 		   enable_data->file->event_call->class->system,
 		   ftrace_event_name(enable_data->file->event_call));
 
@@ -1183,6 +1188,7 @@ event_enable_trigger_func(struct event_command *cmd_ops,
 	char *trigger;
 	char *number;
 	bool enable;
+	bool hist;
 	int ret;
 
 	if (!param)
@@ -1204,7 +1210,11 @@ event_enable_trigger_func(struct event_command *cmd_ops,
 	if (!event_enable_file)
 		goto out;
 
-	enable = strcmp(cmd, ENABLE_EVENT_STR) == 0;
+	hist = ((strcmp(cmd, ENABLE_HIST_STR) == 0) ||
+		(strcmp(cmd, DISABLE_HIST_STR) == 0));
+
+	enable = ((strcmp(cmd, ENABLE_EVENT_STR) == 0) ||
+		  (strcmp(cmd, ENABLE_HIST_STR) == 0));
 
 	trigger_ops = cmd_ops->get_trigger_ops(cmd, trigger);
 
@@ -1225,6 +1235,7 @@ event_enable_trigger_func(struct event_command *cmd_ops,
 	INIT_LIST_HEAD(&trigger_data->list);
 	RCU_INIT_POINTER(trigger_data->filter, NULL);
 
+	enable_data->hist = hist;
 	enable_data->enable = enable;
 	enable_data->file = event_enable_file;
 	trigger_data->private_data = enable_data;
@@ -1315,6 +1326,8 @@ static int event_enable_register_trigger(char *glob,
 	list_for_each_entry_rcu(test, &file->triggers, list) {
 		test_enable_data = test->private_data;
 		if (test_enable_data &&
+		    (test->cmd_ops->trigger_type ==
+		     data->cmd_ops->trigger_type) &&
 		    (test_enable_data->file == enable_data->file)) {
 			ret = -EEXIST;
 			goto out;
@@ -1352,6 +1365,8 @@ static void event_enable_unregister_trigger(char *glob,
 	list_for_each_entry_rcu(data, &file->triggers, list) {
 		enable_data = data->private_data;
 		if (enable_data &&
+		    (data->cmd_ops->trigger_type ==
+		     test->cmd_ops->trigger_type) &&
 		    (enable_data->file == test_enable_data->file)) {
 			unregistered = true;
 			list_del_rcu(&data->list);
@@ -1371,7 +1386,8 @@ event_enable_get_trigger_ops(char *cmd, char *param)
 	struct event_trigger_ops *ops;
 	bool enable;
 
-	enable = strcmp(cmd, ENABLE_EVENT_STR) == 0;
+	enable = ((strcmp(cmd, ENABLE_EVENT_STR) == 0) ||
+		  (strcmp(cmd, ENABLE_HIST_STR) == 0));
 
 	if (enable)
 		ops = param ? &event_enable_count_trigger_ops :
@@ -1423,6 +1439,121 @@ static __init int register_trigger_enable_disable_cmds(void)
 	return ret;
 }
 
+static void
+hist_enable_trigger(struct event_trigger_data *data, void *rec)
+{
+	struct enable_trigger_data *enable_data = data->private_data;
+	struct event_trigger_data *test;
+
+	list_for_each_entry_rcu(test, &enable_data->file->triggers, list) {
+		if (test->cmd_ops->trigger_type == ETT_EVENT_HIST) {
+			if (enable_data->enable)
+				test->paused = false;
+			else
+				test->paused = true;
+			break;
+		}
+	}
+}
+
+static void
+hist_enable_count_trigger(struct event_trigger_data *data, void *rec)
+{
+	if (!data->count)
+		return;
+
+	if (data->count != -1)
+		(data->count)--;
+
+	hist_enable_trigger(data, rec);
+}
+
+static struct event_trigger_ops hist_enable_trigger_ops = {
+	.func			= hist_enable_trigger,
+	.print			= event_enable_trigger_print,
+	.init			= event_trigger_init,
+	.free			= event_enable_trigger_free,
+};
+
+static struct event_trigger_ops hist_enable_count_trigger_ops = {
+	.func			= hist_enable_count_trigger,
+	.print			= event_enable_trigger_print,
+	.init			= event_trigger_init,
+	.free			= event_enable_trigger_free,
+};
+
+static struct event_trigger_ops hist_disable_trigger_ops = {
+	.func			= hist_enable_trigger,
+	.print			= event_enable_trigger_print,
+	.init			= event_trigger_init,
+	.free			= event_enable_trigger_free,
+};
+
+static struct event_trigger_ops hist_disable_count_trigger_ops = {
+	.func			= hist_enable_count_trigger,
+	.print			= event_enable_trigger_print,
+	.init			= event_trigger_init,
+	.free			= event_enable_trigger_free,
+};
+
+static struct event_trigger_ops *
+hist_enable_get_trigger_ops(char *cmd, char *param)
+{
+	struct event_trigger_ops *ops;
+	bool enable;
+
+	enable = (strcmp(cmd, ENABLE_HIST_STR) == 0);
+
+	if (enable)
+		ops = param ? &hist_enable_count_trigger_ops :
+			&hist_enable_trigger_ops;
+	else
+		ops = param ? &hist_disable_count_trigger_ops :
+			&hist_disable_trigger_ops;
+
+	return ops;
+}
+
+static struct event_command trigger_hist_enable_cmd = {
+	.name			= ENABLE_HIST_STR,
+	.trigger_type		= ETT_HIST_ENABLE,
+	.func			= event_enable_trigger_func,
+	.reg			= event_enable_register_trigger,
+	.unreg			= event_enable_unregister_trigger,
+	.get_trigger_ops	= hist_enable_get_trigger_ops,
+	.set_filter		= set_trigger_filter,
+};
+
+static struct event_command trigger_hist_disable_cmd = {
+	.name			= DISABLE_HIST_STR,
+	.trigger_type		= ETT_HIST_ENABLE,
+	.func			= event_enable_trigger_func,
+	.reg			= event_enable_register_trigger,
+	.unreg			= event_enable_unregister_trigger,
+	.get_trigger_ops	= hist_enable_get_trigger_ops,
+	.set_filter		= set_trigger_filter,
+};
+
+static __init void unregister_trigger_hist_enable_disable_cmds(void)
+{
+	unregister_event_command(&trigger_hist_enable_cmd);
+	unregister_event_command(&trigger_hist_disable_cmd);
+}
+
+static __init int register_trigger_hist_enable_disable_cmds(void)
+{
+	int ret;
+
+	ret = register_event_command(&trigger_hist_enable_cmd);
+	if (WARN_ON(ret < 0))
+		return ret;
+	ret = register_event_command(&trigger_hist_disable_cmd);
+	if (WARN_ON(ret < 0))
+		unregister_trigger_hist_enable_disable_cmds();
+
+	return ret;
+}
+
 static __init int register_trigger_traceon_traceoff_cmds(void)
 {
 	int ret;
@@ -2725,6 +2856,7 @@ __init int register_trigger_cmds(void)
 	register_trigger_snapshot_cmd();
 	register_trigger_stacktrace_cmd();
 	register_trigger_enable_disable_cmds();
+	register_trigger_hist_enable_disable_cmds();
 	register_trigger_hist_cmd();
 
 	return 0;
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 15/15] tracing: Add 'hist' trigger Documentation
  2015-03-02 16:00 [PATCH v2 00/15] tracing: 'hist' triggers Tom Zanussi
                   ` (13 preceding siblings ...)
  2015-03-02 16:01 ` [PATCH 14/15] tracing: Add enable_hist/disable_hist triggers Tom Zanussi
@ 2015-03-02 16:01 ` Tom Zanussi
  2015-03-03  2:25 ` [PATCH v2 00/15] tracing: 'hist' triggers Masami Hiramatsu
  15 siblings, 0 replies; 45+ messages in thread
From: Tom Zanussi @ 2015-03-02 16:01 UTC (permalink / raw)
  To: rostedt
  Cc: masami.hiramatsu.pt, namhyung, andi, ast, linux-kernel, Tom Zanussi

Add documentation and usage examples for 'hist' triggers.

Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
---
 Documentation/trace/events.txt | 870 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 870 insertions(+)

diff --git a/Documentation/trace/events.txt b/Documentation/trace/events.txt
index 75d25a1..0e7a27f 100644
--- a/Documentation/trace/events.txt
+++ b/Documentation/trace/events.txt
@@ -494,3 +494,873 @@ The following commands are supported:
 
   Note that there can be only one traceon or traceoff trigger per
   triggering event.
+
+- hist
+
+  This command aggregates event hits into a hash table keyed on a
+  trace event format fields (or stacktrace) and a set of running
+  totals derived from one or more trace event format fields and/or
+  event counts (hitcount).
+
+  The format of a hist trigger is as follows:
+
+        hist:keys=<field1>:values=<field1[,field2,...]>
+          [:size=#entries][:sort=field1][:pause][:continue]
+          [:clear] [if <filter>]
+
+  When a matching event is hit, an entry is added to a hash table
+  using the key(s) and value(s) named.  Keys and values correspond to
+  fields in the event's format description.  Values must correspond to
+  numeric fields - on an event hit, the value(s) will be added to a
+  sum kept for that field.  The special string 'hitcount' can be used
+  in place of an explicit value field - this is simply a count of
+  event hits.  Keys can be any field, or the special string
+  'stacktrace', which will use the event's kernel stacktrace as the
+  key.  The keywords 'keys' or 'key' can be used to specify keys, and
+  the keyworks 'values', 'vals', or 'val' can be used to specify
+  values.  For the time being, only a single key can be used -
+  compound keys aren't yet supported.
+
+  'hist' triggers add a 'hist' file to each event's subdirectory.
+  Reading the 'hist' file for the event will dump the hash table in
+  its entirety to stdout.  By default, numeric fields are displayed as
+  base-10 integers.  This can be modified by appending any of the
+  following modifiers to the field name:
+
+        .hex       display a number as a hex value
+	.sym       display an address as a symbol
+	.syscall   display a syscall id as a system call name
+	.execname  display a common_pid as a program name
+
+  A typical usage scenario would be the following to enable a hist
+  trigger, read its current contents, and then turn it off:
+
+  # echo 'hist:keys=skbaddr.hex:vals=len' > \
+    /sys/kernel/debug/tracing/events/net/netif_rx/trigger
+
+  # cat /sys/kernel/debug/tracing/events/net/netif_rx/hist
+
+  # echo '!hist:keys=skbaddr.hex:vals=len' > \
+    /sys/kernel/debug/tracing/events/net/netif_rx/trigger
+
+  The trigger file itself can be read to show the details of the
+  currently attached hist trigger.  This information is also displayed
+  at the top of the 'hist' file when read.
+
+  By default, the size of the hash table is 2048 entries.  The 'size'
+  param can be used to specify more or fewer than that.  The units are
+  in terms of hashtable entries - if a run uses more entries than
+  specified, the results will show the number of 'drops', the number
+  of hits that were ignored.  The size should be a power of 2 between
+  128 and 131072 (any non- power-of-2 number specified will be rounded
+  up).
+
+  The 'sort' param can be used to specify a value field to sort on.
+  The default if unspecified is 'hitcount' and the default sort order
+  is 'ascending'.  To sort in the opposite direction, append
+  .descending' to the sort key.
+
+  The 'pause' param can be used to pause an existing hist trigger or
+  to start a hist trigger but not log any events until told to do so.
+  'continue' or 'cont' can be used to start or restart a paused hist
+  trigger.
+
+  The 'clear' param will clear the contents of a running hist trigger
+  and leave its current paused/active state.
+
+- enable_hist/disable_hist
+
+  The enable_hist and disable_hist triggers can be used to have one
+  event conditionally start and stop another event's already-attached
+  hist trigger.  Any number of enable_hist and disable_hist triggers
+  can be attached to a given event, allowing that event to kick off
+  and stop aggregations on a host of other events.
+
+  The format is very similar to the enable/disable_event triggers:
+
+      enable_hist:<system>:<event>[:count]
+      disable_hist:<system>:<event>[:count]
+
+  Instead of enabling or disabling the tracing of the target event
+  into the trace buffer as the enable/disable_event triggers do, the
+  enable/disable_hist triggers enable or disable the aggregation of
+  the target event into a hash table.
+
+  A typical usage scenario for the enable_hist/disable_hist triggers
+  would be to first set up a paused hist trigger on some event,
+  followed by an enable_hist/disable_hist pair that turns the hist
+  aggregation on and off when conditions of interest are hit:
+
+  # echo 'hist:keys=skbaddr.hex:vals=len:pause' > \
+    /sys/kernel/debug/tracing/events/net/netif_receive_skb/trigger
+
+  # echo 'enable_hist:net:netif_receive_skb if filename==/usr/bin/wget' > \
+    /sys/kernel/debug/tracing/events/sched/sched_process_exec/trigger
+
+  # echo 'disable_hist:net:netif_receive_skb if comm==wget' > \
+    /sys/kernel/debug/tracing/events/sched/sched_process_exit/trigger
+
+  The above sets up an initially paused hist trigger which is unpaused
+  and starts aggregating events when a given program is executed, and
+  which stops aggregating when the process exits and the hist trigger
+  is paused again.
+
+  The examples below provide a more concrete illustration of the
+  concepts and typical usage patterns discussed above.
+
+
+6.2 'hist' trigger examples
+---------------------------
+
+  The first set of examples creates aggregations using the kmalloc
+  event.  The fields that can be used for the hist trigger are listed
+  in the kmalloc event's format file:
+
+    # cat /sys/kernel/debug/tracing/events/kmem/kmalloc/format
+    name: kmalloc
+    ID: 374
+    format:
+	field:unsigned short common_type;	offset:0;	size:2;	signed:0;
+	field:unsigned char common_flags;	offset:2;	size:1;	signed:0;
+	field:unsigned char common_preempt_count;		offset:3;	size:1;	signed:0;
+	field:int common_pid;					offset:4;	size:4;	signed:1;
+
+	field:unsigned long call_site;				offset:8;	size:8;	signed:0;
+	field:const void * ptr;					offset:16;	size:8;	signed:0;
+	field:size_t bytes_req;					offset:24;	size:8;	signed:0;
+	field:size_t bytes_alloc;				offset:32;	size:8;	signed:0;
+	field:gfp_t gfp_flags;					offset:40;	size:4;	signed:0;
+
+  We'll start by creating a hist trigger that generates a simple table
+  that lists the total number of bytes requested for each function in
+  the kernel that made one or more calls to kmalloc:
+
+    # echo 'hist:key=call_site:val=bytes_req' > \
+            /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
+
+  This tells the tracing system to create a 'hist' trigger using the
+  call_site field of the kmalloc event as the key for the table, which
+  just means that each unique call_site address will have an entry
+  created for it in the table.  The 'val=bytes_req' parameter tells
+  the hist trigger that for each unique entry (call_site) in the
+  table, it should keep a running total of the number of bytes
+  requested by that call_site.
+
+  We'll let it run for awhile and then dump the contents of the 'hist'
+  file in the kmalloc event's subdirectory (for readability, a number
+  of entries have been omitted):
+
+    # cat /sys/kernel/debug/tracing/events/kmem/kmalloc/hist
+    trigger info: hist:keys=call_site:vals=bytes_req:sort=hitcount:size=2048 [active]
+
+    call_site: 18446744071581750326 hitcount:          1  bytes_req:         24
+    call_site: 18446744071583151255 hitcount:          1  bytes_req:         32
+    call_site: 18446744071582443167 hitcount:          1  bytes_req:        264
+    call_site: 18446744072104099935 hitcount:          2  bytes_req:        464
+    call_site: 18446744071579323550 hitcount:          3  bytes_req:        168
+    call_site: 18446744071580558850 hitcount:          4  bytes_req:      65536
+    .
+    .
+    .
+    call_site: 18446744072101362738 hitcount:       4115  bytes_req:     362120
+    call_site: 18446744072103235682 hitcount:       4115  bytes_req:     427960
+    call_site: 18446744072102962280 hitcount:       4342  bytes_req:    1637584
+    call_site: 18446744072102962328 hitcount:       4973  bytes_req:     360624
+    call_site: 18446744072101400062 hitcount:       6465  bytes_req:     258600
+    call_site: 18446744071582063959 hitcount:       7818  bytes_req:      15636
+    call_site: 18446744072102968908 hitcount:       9315  bytes_req:    8912400
+    call_site: 18446744072103122419 hitcount:       9315  bytes_req:     819720
+    call_site: 18446744072101402850 hitcount:      10240  bytes_req:     573440
+    call_site: 18446744072099471334 hitcount:      23820  bytes_req:     476512
+
+    Totals:
+        Hits: 109976
+        Entries: 76
+        Dropped: 0
+
+  The output displays a line for each entry, beginning with the key
+  specified in the trigger, followed by the value(s) also specified in
+  the trigger.  At the beginning of the output is a line that displays
+  the trigger info, which can also be displayed by reading the
+  'trigger' file:
+
+    # cat /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
+    hist:keys=call_site:vals=bytes_req:sort=hitcount:size=2048 [active]
+
+  At the end of the output are a few lines that display the overall
+  totals for the run.  The 'Hits' field shows the total number of
+  times the event trigger was hit, the 'Entries' fields shows the
+  total number of used entries in the hash table, and the 'Dropped'
+  field shows the number of hits that were dropped because the number
+  of used entries for the run exceeded the maximum number of entries
+  allowed for the table (normally 0, but if not a hint that you may
+  want to increase the size of the table using the 'size' param).
+
+  Notice in the above output that there's an extra field, 'hitcount',
+  that wasn't specified in the trigger.  Also notice that in the
+  trigger info a param,'sort=hitcount', which wasn't specified in the
+  trigger either.  The reason is that every trigger implicitly keeps a
+  count of the total number of hits attributed to a given entry,
+  called the 'hitcount', and that in the absence of a user-specified
+  sort param, the hitcount is used as the default sort field.
+
+  The value 'hitcount' can be used in place of an explicit value in
+  the 'values' param if you don't really need to have any particular
+  field summed and are mainly interested in hit frequencies.
+
+  To turn the hist trigger off, simply call up the trigger in command
+  history and re-execute it with a '!' prepended:
+
+    # echo '!hist:key=call_site:val=bytes_req' > \
+           /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
+
+  Finally, notice that the call_site as displayed in the output above
+  isn't really very useful.  It's an address, but normally addresses
+  are displayed in hex.  To have a numeric field displayed as hex
+  values, simply append '.hex' to the field name in the trigger:
+
+    # echo 'hist:key=call_site.hex:val=bytes_req' > \
+           /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
+
+    # cat /sys/kernel/debug/tracing/events/kmem/kmalloc/hist
+    trigger info: hist:keys=call_site.hex:vals=bytes_req:sort=hitcount:size=2048 [active]
+
+    call_site: ffffffffa04e85cb hitcount:          1  bytes_req:        511
+    call_site: ffffffff810a66a9 hitcount:          1  bytes_req:       1024
+    call_site: ffffffff8152db82 hitcount:          1  bytes_req:          8
+    call_site: ffffffffa008829e hitcount:          1  bytes_req:          7
+    call_site: ffffffffa0087d6a hitcount:          1  bytes_req:          7
+    call_site: ffffffffa02eb7e1 hitcount:          1  bytes_req:        433
+    call_site: ffffffff8152cbde hitcount:          1  bytes_req:        192
+    call_site: ffffffff811a2602 hitcount:          2  bytes_req:      32768
+    .
+    .
+    .
+    call_site: ffffffffa0419062 hitcount:        724  bytes_req:      75296
+    call_site: ffffffffa024fc32 hitcount:        724  bytes_req:      63712
+    call_site: ffffffffa03d6468 hitcount:        852  bytes_req:     416736
+    call_site: ffffffffa03d6498 hitcount:        944  bytes_req:      70984
+    call_site: ffffffff81311d57 hitcount:        961  bytes_req:       1922
+    call_site: ffffffffa0258dfe hitcount:       1429  bytes_req:      57160
+    call_site: ffffffffa03fd5f3 hitcount:       1796  bytes_req:     158048
+    call_site: ffffffffa03d7e4c hitcount:       1796  bytes_req:    2206848
+    call_site: ffffffffa02598e2 hitcount:       3196  bytes_req:     178976
+    call_site: ffffffffa0081fe6 hitcount:       9529  bytes_req:     190584
+
+    Totals:
+        Hits: 24542
+        Entries: 60
+        Dropped: 0
+
+  Even that's only marginally more useful - while hex values do look
+  more like addresses, what users are typically more interested in
+  when looking at text addresses are the corresponding symbols
+  instead.  To have an address displayed as symbolic value instead,
+  simply append '.sym' to the field name in the trigger:
+
+    # echo 'hist:key=call_site.sym:val=bytes_req' > \
+           /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
+
+    # cat /sys/kernel/debug/tracing/events/kmem/kmalloc/hist
+    trigger info: hist:keys=call_site.sym:vals=bytes_req:sort=hitcount:size=2048 [active]
+
+    call_site: [ffffffff8152db82] usb_control_msg                     hitcount:          1  bytes_req:          8
+    call_site: [ffffffffa04e85cb] __ieee80211_start_scan              hitcount:          1  bytes_req:        511
+    call_site: [ffffffff815d7f6b] sk_prot_alloc                       hitcount:          1  bytes_req:        976
+    call_site: [ffffffff8152cbde] usb_alloc_urb                       hitcount:          1  bytes_req:        192
+    call_site: [ffffffff815d2c1b] sock_alloc_inode                    hitcount:          1  bytes_req:         64
+    call_site: [ffffffffa02eb7e1] nl80211_trigger_scan                hitcount:          1  bytes_req:        433
+    call_site: [ffffffffa0087d6a] hidraw_report_event                 hitcount:          1  bytes_req:          7
+    call_site: [ffffffff810a66a9] syslog_print_all                    hitcount:          1  bytes_req:       1024
+    call_site: [ffffffffa008829e] hidraw_send_report                  hitcount:          1  bytes_req:          7
+    call_site: [ffffffff811f1316] mounts_open_common                  hitcount:          2  bytes_req:        368
+    call_site: [ffffffff811d0825] alloc_fdtable                       hitcount:          2  bytes_req:         96
+    call_site: [ffffffffa042106f] __intel_framebuffer_create          hitcount:          3  bytes_req:        432
+    .
+    .
+    .
+    call_site: [ffffffff8136b079] sg_kmalloc                          hitcount:        461  bytes_req:     153696
+    call_site: [ffffffffa03dbc5e] i915_gem_obj_lookup_or_create_vma   hitcount:        462  bytes_req:      92400
+    call_site: [ffffffffa03e213a] i915_gem_object_get_pages_gtt       hitcount:        462  bytes_req:       7392
+    call_site: [ffffffffa03d6498] i915_gem_do_execbuffer.isra.22      hitcount:        697  bytes_req:      52064
+    call_site: [ffffffff81311d57] apparmor_file_alloc_security        hitcount:        926  bytes_req:       1852
+    call_site: [ffffffffa0258dfe] drm_vma_node_allow                  hitcount:       1050  bytes_req:      42000
+    call_site: [ffffffffa03fd5f3] intel_ring_begin                    hitcount:       1116  bytes_req:      98208
+    call_site: [ffffffffa03d7e4c] i915_gem_execbuffer2                hitcount:       1116  bytes_req:     956984
+    call_site: [ffffffffa02598e2] drm_modeset_lock_crtc               hitcount:       2055  bytes_req:     115080
+    call_site: [ffffffffa0081fe6] hid_report_raw_event                hitcount:       6433  bytes_req:     128664
+
+    Totals:
+        Hits: 17384
+        Entries: 62
+        Dropped: 0
+
+  Because the default sort key above is 'hitcount', the above shows a
+  the list of call_sites by increasing hitcount, so that at the bottom
+  we see the functions that made the most kmalloc calls during the
+  run.  If instead we we wanted to see the top kmalloc callers in
+  terms of the number of bytes requested rather than the number of
+  calls, and we wanted the top caller to appear at the top, we can use
+  the 'sort' param, along with the 'descending' modifier:
+
+    # echo 'hist:key=call_site.sym:val=bytes_req:sort=bytes_req.descending' > \
+           /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
+
+    # cat /sys/kernel/debug/tracing/events/kmem/kmalloc/hist
+    trigger info: hist:keys=call_site.sym:vals=bytes_req:sort=bytes_req:size=2048 [active]
+
+    call_site: [ffffffffa03d7e4c] i915_gem_execbuffer2                hitcount:        682  bytes_req:     766360
+    call_site: [ffffffff811d6c3b] seq_buf_alloc                       hitcount:         46  bytes_req:     186176
+    call_site: [ffffffffa03d6468] i915_gem_do_execbuffer.isra.22      hitcount:        377  bytes_req:     142808
+    call_site: [ffffffffa02598e2] drm_modeset_lock_crtc               hitcount:       1107  bytes_req:      61992
+    call_site: [ffffffffa03fd5f3] intel_ring_begin                    hitcount:        682  bytes_req:      60016
+    call_site: [ffffffffa0081fe6] hid_report_raw_event                hitcount:       2993  bytes_req:      59864
+    call_site: [ffffffff8136b079] sg_kmalloc                          hitcount:        206  bytes_req:      48096
+    call_site: [ffffffffa03dbc5e] i915_gem_obj_lookup_or_create_vma   hitcount:        206  bytes_req:      41200
+    call_site: [ffffffff811a2602] __kmalloc                           hitcount:          2  bytes_req:      32768
+    call_site: [ffffffffa0419062] intel_crtc_page_flip                hitcount:        305  bytes_req:      31720
+    call_site: [ffffffffa024fc32] drm_mode_page_flip_ioctl            hitcount:        305  bytes_req:      26840
+    call_site: [ffffffffa0258dfe] drm_vma_node_allow                  hitcount:        603  bytes_req:      24120
+    call_site: [ffffffffa03d6498] i915_gem_do_execbuffer.isra.22      hitcount:        305  bytes_req:      23072
+    call_site: [ffffffff812775da] ext4_find_extent                    hitcount:         79  bytes_req:       7584
+    .
+    .
+    .
+    call_site: [ffffffff811f3b07] inotify_handle_event                hitcount:          2  bytes_req:         96
+    call_site: [ffffffff811bfb6e] vfs_rename                          hitcount:          6  bytes_req:         66
+    call_site: [ffffffff815d2c1b] sock_alloc_inode                    hitcount:          1  bytes_req:         64
+    call_site: [ffffffff811d0825] alloc_fdtable                       hitcount:          1  bytes_req:         48
+    call_site: [ffffffff81220376] proc_self_follow_link               hitcount:          2  bytes_req:         24
+    call_site: [ffffffff8112e1ec] hist_show                           hitcount:          2  bytes_req:         16
+    call_site: [ffffffff8112e1d0] hist_show                           hitcount:          2  bytes_req:         16
+    call_site: [ffffffff8152db82] usb_control_msg                     hitcount:          1  bytes_req:          8
+    call_site: [ffffffffa008829e] hidraw_send_report                  hitcount:          1  bytes_req:          7
+    call_site: [ffffffffa0087d6a] hidraw_report_event                 hitcount:          1  bytes_req:          7
+
+    Totals:
+        Hits: 9331
+        Entries: 57
+        Dropped: 0
+
+  We can also add multiple fields to the 'values' param.  For example,
+  we might want to see the total number of bytes allocated alongside
+  bytes requested, and display the result sorted by bytes allocated in
+  a descending order:
+
+    # echo 'hist:keys=call_site.sym:values=bytes_req,bytes_alloc:sort=bytes_alloc.descending' > \
+           /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
+
+    # cat /sys/kernel/debug/tracing/events/kmem/kmalloc/hist
+    trigger info: hist:keys=call_site.sym:vals=bytes_req,bytes_alloc:sort=bytes_alloc:size=2048 [active]
+
+    call_site: [ffffffffa03d7e4c] i915_gem_execbuffer2                hitcount:       4427  bytes_req:    4190032  bytes_alloc:    5967744
+    call_site: [ffffffff811d6c3b] seq_buf_alloc                       hitcount:        427  bytes_req:    1706768  bytes_alloc:    1765376
+    call_site: [ffffffffa03d6468] i915_gem_do_execbuffer.isra.22      hitcount:       1978  bytes_req:     765936  bytes_alloc:    1434304
+    call_site: [ffffffffa03fd5f3] intel_ring_begin                    hitcount:       4427  bytes_req:     389576  bytes_alloc:     424992
+    call_site: [ffffffff8136b079] sg_kmalloc                          hitcount:        993  bytes_req:     305824  bytes_alloc:     376576
+    call_site: [ffffffffa02598e2] drm_modeset_lock_crtc               hitcount:       4319  bytes_req:     241864  bytes_alloc:     276416
+    call_site: [ffffffffa03dbc5e] i915_gem_obj_lookup_or_create_vma   hitcount:        993  bytes_req:     198600  bytes_alloc:     254208
+    call_site: [ffffffffa0419062] intel_crtc_page_flip                hitcount:       1806  bytes_req:     187824  bytes_alloc:     231168
+    call_site: [ffffffff81424d78] __tty_buffer_request_room           hitcount:         61  bytes_req:     128928  bytes_alloc:     229376
+    call_site: [ffffffffa0081fe6] hid_report_raw_event                hitcount:       9550  bytes_req:     191024  bytes_alloc:     210160
+    call_site: [ffffffffa03d6498] i915_gem_do_execbuffer.isra.22      hitcount:       2449  bytes_req:     184720  bytes_alloc:     209216
+    call_site: [ffffffffa0258dfe] drm_vma_node_allow                  hitcount:       3256  bytes_req:     130240  bytes_alloc:     208384
+    .
+    .
+    .
+    call_site: [ffffffffa04e85cb] __ieee80211_start_scan              hitcount:          3  bytes_req:       1533  bytes_alloc:       1536
+    call_site: [ffffffffa02eb7e1] nl80211_trigger_scan                hitcount:          3  bytes_req:       1299  bytes_alloc:       1536
+    call_site: [ffffffff8152cbde] usb_alloc_urb                       hitcount:          6  bytes_req:       1152  bytes_alloc:       1152
+    call_site: [ffffffff811f3b07] inotify_handle_event                hitcount:         14  bytes_req:        744  bytes_alloc:        896
+    call_site: [ffffffff81206b81] load_elf_binary                     hitcount:         24  bytes_req:        672  bytes_alloc:        768
+    call_site: [ffffffff811bfb6e] vfs_rename                          hitcount:         38  bytes_req:        436  bytes_alloc:        640
+    call_site: [ffffffff81220376] proc_self_follow_link               hitcount:         24  bytes_req:        288  bytes_alloc:        384
+    call_site: [ffffffff815d2c1b] sock_alloc_inode                    hitcount:          5  bytes_req:        320  bytes_alloc:        320
+    call_site: [ffffffff811d0825] alloc_fdtable                       hitcount:          4  bytes_req:        192  bytes_alloc:        256
+    call_site: [ffffffffa04ec05f] ieee80211_start_tx_ba_session       hitcount:          1  bytes_req:        232  bytes_alloc:        256
+    call_site: [ffffffffa0244701] drm_vm_open_locked                  hitcount:          5  bytes_req:        160  bytes_alloc:        160
+    call_site: [ffffffff8112e1d0] hist_show                           hitcount:         10  bytes_req:         80  bytes_alloc:         80
+    call_site: [ffffffff8112e1ec] hist_show                           hitcount:         10  bytes_req:         80  bytes_alloc:         80
+    call_site: [ffffffff81074c9e] kthread_create_on_node              hitcount:          1  bytes_req:         56  bytes_alloc:         64
+    call_site: [ffffffff8152db82] usb_control_msg                     hitcount:          6  bytes_req:         48  bytes_alloc:         48
+    call_site: [ffffffffa008829e] hidraw_send_report                  hitcount:          6  bytes_req:         42  bytes_alloc:         48
+    call_site: [ffffffffa0087d6a] hidraw_report_event                 hitcount:          6  bytes_req:         42  bytes_alloc:         48
+
+    Totals:
+        Hits: 45576
+        Entries: 66
+        Dropped: 0
+
+  Finally, to finish off our kmalloc example, instead of simply having
+  the hist trigger display symbolic call_sites, we can have the hist
+  trigger additionally display the complete set of kernel stack traces
+  that led to each call_sites.  To do that, we simply use the special
+  value 'stacktrace' for the key param:
+
+    # echo 'hist:keys=stacktrace:values=bytes_req,bytes_alloc:sort=bytes_alloc' > \
+           /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
+
+  The above trigger will use the kernel stack trace in effect when an
+  event is triggered as the key for the hash table.  This allows the
+  enumeration of every kernel callpath that led up to a particular
+  event, along with a running total of any of the event fields for
+  that event.  Here we tally bytes requested and bytes allocated for
+  every callpath in the system that led up to a kmalloc (in this case
+  every callpath to a kmalloc for a kernel compile):
+
+    # cat /sys/kernel/debug/tracing/events/kmem/kmalloc/hist
+    trigger info: hist:keys=stacktrace:vals=bytes_req,bytes_alloc:sort=bytes_alloc:size=2048 [active]
+
+    stacktrace:
+         __kmalloc_track_caller+0x10b/0x190
+         kstrdup+0x35/0x70
+         __kernfs_new_node+0x34/0xf0
+         kernfs_new_node+0x26/0x50
+         kernfs_create_dir_ns+0x29/0x80
+         sysfs_create_dir_ns+0x40/0xa0
+         kobject_add_internal+0xbd/0x3a0
+         kobject_add+0x60/0xb0
+         device_add+0xff/0x5a0
+         device_register+0x1e/0x30
+         usb_create_ep_devs+0x81/0xd0
+         usb_new_device+0x357/0x410
+         hub_event+0xd68/0x11a0
+         process_one_work+0x149/0x3d0
+         worker_thread+0x121/0x4a0
+         kthread+0xd2/0xf0
+     hitcount:          1  bytes_req:          6  bytes_alloc:          8
+    stacktrace:
+         kmem_cache_alloc_trace+0xfb/0x160
+         usb_control_msg+0x42/0x110
+         hub_port_status+0x84/0x120
+         hub_port_reset+0x263/0x430
+         hub_port_init+0x77/0xbb0
+         hub_event+0x944/0x11a0
+         process_one_work+0x149/0x3d0
+         worker_thread+0x121/0x4a0
+         kthread+0xd2/0xf0
+         ret_from_fork+0x7c/0xb0
+     hitcount:          1  bytes_req:          8  bytes_alloc:          8
+    .
+    .
+    .
+    stacktrace:
+         __kmalloc+0x11b/0x1a0
+         load_elf_phdrs+0x76/0xa0
+         load_elf_binary+0x102/0x1780
+         search_binary_handler+0x97/0x1d0
+         do_execveat_common.isra.32+0x551/0x6e0
+         SyS_execve+0x3a/0x50
+         stub_execve+0x69/0xa0
+     hitcount:     131791  bytes_req:   67822944  bytes_alloc:   80279552
+    stacktrace:
+         __kmalloc+0x11b/0x1a0
+         i915_gem_execbuffer2+0x6c/0x2c0 [i915]
+         drm_ioctl+0x1a4/0x630 [drm]
+         do_vfs_ioctl+0x2f0/0x4f0
+         SyS_ioctl+0x81/0xa0
+         system_call_fastpath+0x12/0x17
+     hitcount:     164540  bytes_req:  134464456  bytes_alloc:  167693376
+    stacktrace:
+         kmem_cache_alloc_trace+0xfb/0x160
+         apparmor_file_alloc_security+0x27/0x40
+         security_file_alloc+0x16/0x20
+         get_empty_filp+0x93/0x1c0
+         path_openat+0x31/0x620
+         do_filp_open+0x3a/0x90
+         do_sys_open+0x128/0x220
+         SyS_open+0x1e/0x20
+         system_call_fastpath+0x12/0x17
+     hitcount:   22118413  bytes_req:   44236888  bytes_alloc:  176947328
+    stacktrace:
+         __kmalloc+0x11b/0x1a0
+         seq_buf_alloc+0x1b/0x50
+         seq_read+0x2cc/0x370
+         proc_reg_read+0x3d/0x80
+         __vfs_read+0x18/0x50
+         vfs_read+0x86/0x140
+         SyS_read+0x46/0xb0
+         system_call_fastpath+0x12/0x17
+     hitcount:      66168  bytes_req:  271015944  bytes_alloc:  271015952
+
+    Totals:
+        Hits: 26780801
+        Entries: 633
+        Dropped: 0
+
+  If you key a hist trigger on pid, for example, to gather and display
+  sorted totals for each process, you can use the special .execname
+  modifier to display the executable names for the processes in the
+  table rather than raw pids.  The example below keeps a per-process
+  sum of total bytes read:
+
+    # echo 'hist:key=common_pid.execname:val=count:sort=count.descending' > \
+           /sys/kernel/debug/tracing/events/syscalls/sys_enter_read/trigger
+
+    # cat /sys/kernel/debug/tracing/events/syscalls/sys_enter_read/hist
+    trigger info: hist:keys=common_pid.execname:vals=count:sort=count:size=2048 [active]
+
+    common_pid: Xorg            [18446612132314219843] hitcount:        166  count:     159744
+    common_pid: gnome-terminal  [18446612132314221635] hitcount:         44  count:     113321
+    common_pid: compiz          [18446612132314221491] hitcount:         13  count:      81968
+    common_pid: bash            [18446612132314231931] hitcount:          3  count:      66369
+    common_pid: gdbus           [18446612132314221135] hitcount:         91  count:       1456
+    common_pid: acpid           [18446612132314219701] hitcount:         48  count:       1152
+    common_pid: ibus-daemon     [18446612132314221129] hitcount:         71  count:       1136
+    common_pid: gdbus           [18446612132314221638] hitcount:         42  count:        672
+    common_pid: gdbus           [18446612132314221345] hitcount:         42  count:        672
+    common_pid: ibus-engine-sim [18446612132314221344] hitcount:         14  count:        224
+    common_pid: gdbus           [18446612132314221262] hitcount:         12  count:        192
+    common_pid: gdbus           [18446612132314221249] hitcount:          6  count:         96
+    common_pid: ibus-ui-gtk3    [18446612132314221245] hitcount:          6  count:         96
+    common_pid: gdbus           [18446612132314221504] hitcount:          5  count:         80
+    common_pid: unity-panel-ser [18446612132314221243] hitcount:          3  count:         48
+    common_pid: postgres        [18446612132314220339] hitcount:          1  count:         16
+    common_pid: gdbus           [18446612132314221395] hitcount:          1  count:         16
+    common_pid: rtkit-daemon    [18446612132314220509] hitcount:          1  count:          8
+    common_pid: upstart-dbus-br [18446612132314221100] hitcount:          4  count:          4
+    common_pid: bash            [18446612132314221924] hitcount:          4  count:          4
+    common_pid: firefox         [18446612132314222258] hitcount:          1  count:          1
+
+    Totals:
+        Hits: 578
+        Entries: 21
+        Dropped: 0
+
+
+  Similarly, if you key a hist trigger on syscall id, for example to
+  gather and display a list of systemwide syscall hits, you can use
+  the special .syscall modifier to display the syscall names rather
+  than raw ids.  The example below keeps a running total of syscall
+  counts for the system during the run:
+
+    # echo 'hist:key=id.syscall:val=hitcount' > \
+           /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger
+
+    # cat /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/hist
+    trigger info: hist:keys=id.syscall:vals=:sort=hitcount:size=2048 [active]
+
+    id: sys_fadvise64                  hitcount:          1
+    id: sys_bind                       hitcount:          1
+    id: sys_fsync                      hitcount:          1
+    id: sys_getsockname                hitcount:          1
+    id: sys_readlink                   hitcount:          3
+    id: sys_set_tid_address            hitcount:          4
+    .
+    .
+    .
+    id: sys_mprotect                   hitcount:         57
+    id: sys_waitid                     hitcount:         64
+    id: sys_rt_sigaction               hitcount:        121
+    id: sys_inotify_add_watch          hitcount:        234
+    id: sys_mmap                       hitcount:        258
+    id: sys_nanosleep                  hitcount:        310
+    id: sys_close                      hitcount:        323
+    id: sys_rt_sigprocmask             hitcount:       3520
+    id: sys_futex                      hitcount:       4819
+    id: sys_write                      hitcount:       8907
+    id: sys_setitimer                  hitcount:      39039
+    id: sys_writev                     hitcount:      64528
+    id: sys_poll                       hitcount:      77493
+    id: sys_recvmsg                    hitcount:     168240
+    id: sys_ioctl                      hitcount:     448946
+
+    Totals:
+        Hits: 869971
+        Entries: 75
+        Dropped: 0
+
+  The next example uses a string field as the hash key and
+  demonstrates how you can manually pause and continue a hist trigger.
+  In this example, we'll aggregate fork counts and don't expect a
+  large number of entries in the hash table, so we'll drop it to a
+  much smaller number, say 256:
+
+    # echo 'hist:key=child_comm:val=hitcount:size=256' > \
+           /sys/kernel/debug/tracing/events/sched/sched_process_fork/trigger
+
+    # cat /sys/kernel/debug/tracing/events/sched/sched_process_fork/hist
+    trigger info: hist:keys=child_comm:vals=:sort=hitcount:size=256 [active]
+
+    child_comm: gdbus                               hitcount:          1
+    child_comm: smbd                                hitcount:          1
+    child_comm: ibus-daemon                         hitcount:          1
+    child_comm: Cache2 I/O                          hitcount:          1
+    child_comm: dconf worker                        hitcount:          1
+    child_comm: postgres                            hitcount:          1
+    child_comm: unity-panel-ser                     hitcount:          2
+    child_comm: compiz                              hitcount:          3
+    child_comm: bash                                hitcount:          3
+    child_comm: firefox                             hitcount:         60
+
+    Totals:
+        Hits: 74
+        Entries: 10
+        Dropped: 0
+
+    # echo 'hist:key=child_comm:val=hitcount:size=256:pause' > \
+           /sys/kernel/debug/tracing/events/sched/sched_process_fork/trigger
+
+  If we want to pause the hist trigger, we can simply append :pause to
+  the command that started the trigger.  Notice that the trigger info
+  displays as [paused]:
+
+    # cat /sys/kernel/debug/tracing/events/sched/sched_process_fork/hist
+    trigger info: hist:keys=child_comm:vals=:sort=hitcount:size=256 [paused]
+
+    child_comm: dconf worker                        hitcount:          1
+    child_comm: gdbus                               hitcount:          1
+    child_comm: Cache2 I/O                          hitcount:          1
+    child_comm: ibus-daemon                         hitcount:          1
+    child_comm: unity-panel-ser                     hitcount:          2
+    child_comm: smbd                                hitcount:          2
+    child_comm: postgres                            hitcount:          3
+    child_comm: compiz                              hitcount:          3
+    child_comm: bash                                hitcount:          3
+    child_comm: emacs                               hitcount:          8
+    child_comm: firefox                             hitcount:         60
+
+    Totals:
+        Hits: 85
+        Entries: 11
+        Dropped: 0
+
+  To manually continue having the trigger aggregate events, append
+  :cont instead.  Notice that the trigger info displays as [active]
+  again, and the data has changed:
+
+    # echo 'hist:key=child_comm:val=hitcount:size=256:cont' >
+           /sys/kernel/debug/tracing/events/sched/sched_process_fork/trigger
+
+    # cat /sys/kernel/debug/tracing/events/sched/sched_process_fork/hist
+    trigger info: hist:keys=child_comm:vals=:sort=hitcount:size=256 [active]
+
+    child_comm: Cache2 I/O                          hitcount:          1
+    child_comm: ibus-daemon                         hitcount:          1
+    child_comm: kthreadd                            hitcount:          1
+    child_comm: gdbus                               hitcount:          1
+    child_comm: dconf worker                        hitcount:          1
+    child_comm: unity-panel-ser                     hitcount:          2
+    child_comm: smbd                                hitcount:          2
+    child_comm: compiz                              hitcount:          3
+    child_comm: postgres                            hitcount:          4
+    child_comm: bash                                hitcount:          4
+    child_comm: emacs                               hitcount:          8
+    child_comm: firefox                             hitcount:         60
+
+    Totals:
+        Hits: 88
+        Entries: 12
+        Dropped: 0
+
+  The previous example showed how to start and stop a hist trigger by
+  appending 'pause' and 'continue' to the hist trigger command.  A
+  hist trigger can also be started in a paused state by initially
+  starting the trigger with ':pause' appended.  This allows you to
+  start the trigger only when you're ready to start collecting data
+  and not before.  For example, start the trigger in a paused state,
+  then unpause it and do something you want to measure, then pause the
+  trigger when done.
+
+  Of course, doing this manually can be difficult and error-prone, but
+  it is possible to automatically start and stop a hist trigger based
+  on some condition, via the enable_hist and disable_hist triggers.
+
+  For example, suppose we wanted to take a look at the relative
+  weights in terms of skb length for each callpath that leads to a
+  netif_receieve_skb event when downloading a decent-sized file using
+  wget.
+
+  First we set up an initially paused stacktrace trigger on the
+  netif_receive_skb event:
+
+    # echo 'hist:key=stacktrace:vals=len:pause' > \
+           /sys/kernel/debug/tracing/events/net/netif_receive_skb/trigger
+
+  Next, we set up an 'enable_hist' trigger on the sched_process_exec
+  event, with an 'if filename==/usr/bin/wget' filter.  The effect of
+  this new trigger is that it will 'unpause' the hist trigger we just
+  set up on netif_receive_skb if and only if it sees a
+  sched_proces_exec event with a filename of '/usr/bin/wget'.  When
+  that happens, all netif_receive_skb events are aggregated into a
+  hash table keyed on stacktrace:
+
+    # echo 'enable_hist:net:netif_receive_skb if filename==/usr/bin/wget' > \
+           /sys/kernel/debug/tracing/events/sched/sched_process_exec/trigger
+
+  The aggregation continues until the netif_receive_skb is paused
+  again, which is what the following disable_hist event does by
+  creating a similar setup on the sched_process_exit event, using the
+  filter 'comm==wget':
+
+    # echo 'disable_hist:net:netif_receive_skb if comm==wget' > \
+           /sys/kernel/debug/tracing/events/sched/sched_process_exit/trigger
+
+  Whenever a process exits and the comm field of the disable_hist
+  trigger filter matches 'comm==wget', the netif_receive_skb hist
+  trigger is disabled.
+
+  The overall effect is that netif_received_skb events are aggregated
+  into the hash table for only the duration of the wget.  Executing a
+  wget command and then listing the 'hist' file will display the
+  output generated by the wget command:
+
+    $ wget https://www.kernel.org/pub/linux/kernel/v3.x/patch-3.19.xz
+
+    # cat /sys/kernel/debug/tracing/events/net/netif_receive_skb/hist
+    trigger info: hist:keys=stacktrace:vals=len:sort=hitcount:size=2048 [paused]
+
+    stacktrace:
+         __netif_receive_skb_core+0x4ad/0x780
+         __netif_receive_skb+0x18/0x60
+         process_backlog+0xa8/0x150
+         net_rx_action+0x15d/0x340
+         __do_softirq+0x114/0x2c0
+         do_softirq_own_stack+0x1c/0x30
+         do_softirq+0x65/0x70
+         __local_bh_enable_ip+0xb5/0xc0
+         ip_finish_output+0x1f4/0x810
+         ip_output+0x68/0xa0
+         ip_local_out_sk+0x30/0x40
+         ip_send_skb+0x1a/0x50
+         udp_send_skb+0x173/0x2a0
+         udp_sendmsg+0x2bf/0x9f0
+         inet_sendmsg+0x63/0xb0
+         do_sock_sendmsg+0x8c/0x100
+     hitcount:          4  len:        388
+    stacktrace:
+         __netif_receive_skb_core+0x4ad/0x780
+         __netif_receive_skb+0x18/0x60
+         netif_receive_skb_internal+0x23/0x90
+         napi_gro_receive+0xc8/0x100
+         ieee80211_deliver_skb+0xba/0x250 [mac80211]
+         ieee80211_rx_handlers+0xcc5/0x2240 [mac80211]
+         ieee80211_prepare_and_rx_handle+0x4e7/0xc30 [mac80211]
+         ieee80211_rx+0x31d/0x920 [mac80211]
+         iwlagn_rx_reply_rx+0x3db/0x6f0 [iwldvm]
+         iwl_rx_dispatch+0x8e/0xf0 [iwldvm]
+         iwl_pcie_irq_handler+0xe1c/0x12c0 [iwlwifi]
+         irq_thread_fn+0x20/0x50
+         irq_thread+0x11f/0x150
+         kthread+0xd2/0xf0
+         ret_from_fork+0x7c/0xb0
+     hitcount:         35  len:      17415
+    stacktrace:
+         __netif_receive_skb_core+0x4ad/0x780
+         __netif_receive_skb+0x18/0x60
+         netif_receive_skb_internal+0x23/0x90
+         napi_gro_complete+0xa4/0xe0
+         dev_gro_receive+0x233/0x360
+         napi_gro_receive+0x30/0x100
+         ieee80211_deliver_skb+0xba/0x250 [mac80211]
+         ieee80211_rx_handlers+0xcc5/0x2240 [mac80211]
+         ieee80211_prepare_and_rx_handle+0x4e7/0xc30 [mac80211]
+         ieee80211_rx+0x31d/0x920 [mac80211]
+         iwlagn_rx_reply_rx+0x3db/0x6f0 [iwldvm]
+         iwl_rx_dispatch+0x8e/0xf0 [iwldvm]
+         iwl_pcie_irq_handler+0xe1c/0x12c0 [iwlwifi]
+         irq_thread_fn+0x20/0x50
+         irq_thread+0x11f/0x150
+         kthread+0xd2/0xf0
+     hitcount:        155  len:     636342
+    stacktrace:
+         __netif_receive_skb_core+0x4ad/0x780
+         __netif_receive_skb+0x18/0x60
+         netif_receive_skb_internal+0x23/0x90
+         napi_gro_complete+0xa4/0xe0
+         napi_gro_flush+0x6d/0x90
+         iwl_pcie_irq_handler+0x90f/0x12c0 [iwlwifi]
+         irq_thread_fn+0x20/0x50
+         irq_thread+0x11f/0x150
+         kthread+0xd2/0xf0
+         ret_from_fork+0x7c/0xb0
+     hitcount:       1013  len:    5531908
+
+    Totals:
+        Hits: 1207
+        Entries: 4
+        Dropped: 0
+
+  The above shows all the netif_receive_skb callpaths and their total
+  lengths for the duration of the wget command.
+
+  The 'clear' hist trigger param can be used to clear the hash table.
+  Suppose we wanted to try another run of the previous example but
+  this time also wanted to see the complete list of events that went
+  into the histogram.  In order to avoid having to set everything up
+  again, we can just clear the histogram first:
+
+    # echo 'hist:key=stacktrace:vals=len:clear' > \
+           /sys/kernel/debug/tracing/events/net/netif_receive_skb/trigger
+
+  Just to verify that it is in fact cleared, here's what we now see in
+  the hist file:
+
+    # cat /sys/kernel/debug/tracing/events/net/netif_receive_skb/hist
+    trigger info: hist:keys=stacktrace:vals=len:sort=hitcount:size=2048 [paused]
+
+    Totals:
+        Hits: 0
+        Entries: 0
+        Dropped: 0
+
+  Since we want to see the detailed list of every netif_receive_skb
+  event occurring during the new run, which are in fact same events
+  being aggregated into the hash table, we add some additional
+  'enable_event' events the triggering sched_process_exec and
+  sched_process_exit events as such:
+
+    # echo 'enable_event:net:netif_receive_skb if filename==/usr/bin/wget' > \
+           /sys/kernel/debug/tracing/events/sched/sched_process_exec/trigger
+
+    # echo 'disable_event:net:netif_receive_skb if comm==wget' > \
+           /sys/kernel/debug/tracing/events/sched/sched_process_exit/trigger
+
+  If you read the trigger files for the sched_process_exec and
+  sched_process_exit triggers, you should see two triggers for each:
+  one enabling/disabling the hist aggregation and the other
+  enabling/disabling the logging of events:
+
+    # cat /sys/kernel/debug/tracing/events/sched/sched_process_exec/trigger
+    enable_event:net:netif_receive_skb:unlimited if filename==/usr/bin/wget
+    enable_hist:net:netif_receive_skb:unlimited if filename==/usr/bin/wget
+
+    # cat /sys/kernel/debug/tracing/events/sched/sched_process_exit/trigger
+    enable_event:net:netif_receive_skb:unlimited if comm==wget
+    disable_hist:net:netif_receive_skb:unlimited if comm==wget
+
+  In other words, whever either of the sched_process_exec or
+  sched_process_exit events is hit and matches 'wget', it enables or
+  disables both the histogram and the event log, and what you end up
+  with is a hash table and set of events just covering the specified
+  duration.
+
+  Displaying the 'hist' file should show something similar to what you
+  saw in the last run, but this time you should also see the
+  individual events in the trace file:
+
+    # cat /sys/kernel/debug/tracing/trace
+    # tracer: nop
+    #
+    # entries-in-buffer/entries-written: 183/1426   #P:4
+    #
+    #                              _-----=> irqs-off
+    #                             / _----=> need-resched
+    #                            | / _---=> hardirq/softirq
+    #                            || / _--=> preempt-depth
+    #                            ||| /     delay
+    #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
+    #              | |       |   ||||       |         |
+                wget-15108 [000] ..s1 31769.606929: netif_receive_skb: dev=lo skbaddr=ffff88009c353100 len=60
+                wget-15108 [000] ..s1 31769.606999: netif_receive_skb: dev=lo skbaddr=ffff88009c353200 len=60
+             dnsmasq-1382  [000] ..s1 31769.677652: netif_receive_skb: dev=lo skbaddr=ffff88009c352b00 len=130
+             dnsmasq-1382  [000] ..s1 31769.685917: netif_receive_skb: dev=lo skbaddr=ffff88009c352200 len=138
+    ##### CPU 2 buffer started ####
+      irq/29-iwlwifi-559   [002] ..s. 31772.031529: netif_receive_skb: dev=wlan0 skbaddr=ffff88009d433d00 len=2948
+      irq/29-iwlwifi-559   [002] ..s. 31772.031572: netif_receive_skb: dev=wlan0 skbaddr=ffff88009d432200 len=1500
+      irq/29-iwlwifi-559   [002] ..s. 31772.032196: netif_receive_skb: dev=wlan0 skbaddr=ffff88009d433100 len=2948
+      irq/29-iwlwifi-559   [002] ..s. 31772.032761: netif_receive_skb: dev=wlan0 skbaddr=ffff88009d433000 len=2948
+      irq/29-iwlwifi-559   [002] ..s. 31772.033220: netif_receive_skb: dev=wlan0 skbaddr=ffff88009d432e00 len=1500
+    .
+    .
+    .
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH 07/15] mm: Add ___GFP_NOTRACE
  2015-03-02 16:01 ` [PATCH 07/15] mm: Add ___GFP_NOTRACE Tom Zanussi
@ 2015-03-02 16:37   ` Steven Rostedt
  2015-03-02 16:46     ` Tom Zanussi
  0 siblings, 1 reply; 45+ messages in thread
From: Steven Rostedt @ 2015-03-02 16:37 UTC (permalink / raw)
  To: Tom Zanussi; +Cc: masami.hiramatsu.pt, namhyung, andi, ast, linux-kernel

On Mon,  2 Mar 2015 10:01:00 -0600
Tom Zanussi <tom.zanussi@linux.intel.com> wrote:

> Add a gfp flag that allows kmalloc() et al to be used in tracing
> functions.
> 
> The problem with using kmalloc for tracing is that the tracing
> subsystem should be able to trace kmalloc itself, which it can't do
> directly because of paths like kmalloc()->trace_kmalloc()->kmalloc()
> or kmalloc()->trace_mm_page_alloc()->kmalloc().

This part I don't like at all. Why can't the memory be preallocated
when the hist is created (the echo 'hist:...')?

kmalloc must never be called from any tracepoint callback.

This change is currently a showstopper.

-- Steve


> 
> With this flag, tracing code could use a special version of kmalloc()
> that sets __GFP_NOTRACE on every allocation it does, while leaving the
> normal kmalloc() path untouched.
> 
> This would allow any tracepoint in the kmalloc path to be avoided via
> DEFINE_EVENT_CONDITION() redefinitions of those events, which check
> for ___GFP_NOTRACE immediately in their execution and break if set,
> thereby avoiding the recursion.
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 07/15] mm: Add ___GFP_NOTRACE
  2015-03-02 16:37   ` Steven Rostedt
@ 2015-03-02 16:46     ` Tom Zanussi
  2015-03-02 17:58       ` Alexei Starovoitov
  0 siblings, 1 reply; 45+ messages in thread
From: Tom Zanussi @ 2015-03-02 16:46 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: masami.hiramatsu.pt, namhyung, andi, ast, linux-kernel

On Mon, 2015-03-02 at 11:37 -0500, Steven Rostedt wrote:
> On Mon,  2 Mar 2015 10:01:00 -0600
> Tom Zanussi <tom.zanussi@linux.intel.com> wrote:
> 
> > Add a gfp flag that allows kmalloc() et al to be used in tracing
> > functions.
> > 
> > The problem with using kmalloc for tracing is that the tracing
> > subsystem should be able to trace kmalloc itself, which it can't do
> > directly because of paths like kmalloc()->trace_kmalloc()->kmalloc()
> > or kmalloc()->trace_mm_page_alloc()->kmalloc().
> 
> This part I don't like at all. Why can't the memory be preallocated
> when the hist is created (the echo 'hist:...')?
> 

Yeah, I didn't like it either.  My original version did exactly what you
suggest and preallocated an array of entries to 'allocate' from in order
to avoid the problem.

But I wanted to attempt to use the bpf_map directly, which already uses
kmalloc internally.  My fallback in case this wouldn't fly, which it
obviously won't, would be to add an option to have the bpf_map code
preallocate a maximum number of entries or pass in a client-owned array
for the purpose.  I'll do that if I don't hear any better ideas..

Tom

> kmalloc must never be called from any tracepoint callback.
> 
> This change is currently a showstopper.
> 
> -- Steve
> 
> 
> > 
> > With this flag, tracing code could use a special version of kmalloc()
> > that sets __GFP_NOTRACE on every allocation it does, while leaving the
> > normal kmalloc() path untouched.
> > 
> > This would allow any tracepoint in the kmalloc path to be avoided via
> > DEFINE_EVENT_CONDITION() redefinitions of those events, which check
> > for ___GFP_NOTRACE immediately in their execution and break if set,
> > thereby avoiding the recursion.
> > 



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 07/15] mm: Add ___GFP_NOTRACE
  2015-03-02 16:46     ` Tom Zanussi
@ 2015-03-02 17:58       ` Alexei Starovoitov
  2015-03-02 18:03         ` Tom Zanussi
  0 siblings, 1 reply; 45+ messages in thread
From: Alexei Starovoitov @ 2015-03-02 17:58 UTC (permalink / raw)
  To: Tom Zanussi
  Cc: Steven Rostedt, Masami Hiramatsu, Namhyung Kim, Andi Kleen, LKML

On Mon, Mar 2, 2015 at 8:46 AM, Tom Zanussi <tom.zanussi@linux.intel.com> wrote:
> On Mon, 2015-03-02 at 11:37 -0500, Steven Rostedt wrote:
>> On Mon,  2 Mar 2015 10:01:00 -0600
>> Tom Zanussi <tom.zanussi@linux.intel.com> wrote:
>>
>> > Add a gfp flag that allows kmalloc() et al to be used in tracing
>> > functions.
>> >
>> > The problem with using kmalloc for tracing is that the tracing
>> > subsystem should be able to trace kmalloc itself, which it can't do
>> > directly because of paths like kmalloc()->trace_kmalloc()->kmalloc()
>> > or kmalloc()->trace_mm_page_alloc()->kmalloc().
>>
>> This part I don't like at all. Why can't the memory be preallocated
>> when the hist is created (the echo 'hist:...')?
>>
>
> Yeah, I didn't like it either.  My original version did exactly what you
> suggest and preallocated an array of entries to 'allocate' from in order
> to avoid the problem.
>
> But I wanted to attempt to use the bpf_map directly, which already uses
> kmalloc internally.  My fallback in case this wouldn't fly, which it
> obviously won't, would be to add an option to have the bpf_map code
> preallocate a maximum number of entries or pass in a client-owned array
> for the purpose.  I'll do that if I don't hear any better ideas..

Tom, I'm still reading through the patch set.
Quick comment for the above.
Currently there are two map types: array and hash.
array type is pre-allocating all memory at map creation time.
hash is allocating on demand.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 07/15] mm: Add ___GFP_NOTRACE
  2015-03-02 17:58       ` Alexei Starovoitov
@ 2015-03-02 18:03         ` Tom Zanussi
  2015-03-02 18:12           ` Alexei Starovoitov
  0 siblings, 1 reply; 45+ messages in thread
From: Tom Zanussi @ 2015-03-02 18:03 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Steven Rostedt, Masami Hiramatsu, Namhyung Kim, Andi Kleen, LKML

On Mon, 2015-03-02 at 09:58 -0800, Alexei Starovoitov wrote:
> On Mon, Mar 2, 2015 at 8:46 AM, Tom Zanussi <tom.zanussi@linux.intel.com> wrote:
> > On Mon, 2015-03-02 at 11:37 -0500, Steven Rostedt wrote:
> >> On Mon,  2 Mar 2015 10:01:00 -0600
> >> Tom Zanussi <tom.zanussi@linux.intel.com> wrote:
> >>
> >> > Add a gfp flag that allows kmalloc() et al to be used in tracing
> >> > functions.
> >> >
> >> > The problem with using kmalloc for tracing is that the tracing
> >> > subsystem should be able to trace kmalloc itself, which it can't do
> >> > directly because of paths like kmalloc()->trace_kmalloc()->kmalloc()
> >> > or kmalloc()->trace_mm_page_alloc()->kmalloc().
> >>
> >> This part I don't like at all. Why can't the memory be preallocated
> >> when the hist is created (the echo 'hist:...')?
> >>
> >
> > Yeah, I didn't like it either.  My original version did exactly what you
> > suggest and preallocated an array of entries to 'allocate' from in order
> > to avoid the problem.
> >
> > But I wanted to attempt to use the bpf_map directly, which already uses
> > kmalloc internally.  My fallback in case this wouldn't fly, which it
> > obviously won't, would be to add an option to have the bpf_map code
> > preallocate a maximum number of entries or pass in a client-owned array
> > for the purpose.  I'll do that if I don't hear any better ideas..
> 
> Tom, I'm still reading through the patch set.
> Quick comment for the above.
> Currently there are two map types: array and hash.
> array type is pre-allocating all memory at map creation time.
> hash is allocating on demand.

OK, so would it make sense to do the same for the hash type, or at least
add an option that does that?

Tom



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 07/15] mm: Add ___GFP_NOTRACE
  2015-03-02 18:03         ` Tom Zanussi
@ 2015-03-02 18:12           ` Alexei Starovoitov
  2015-03-02 18:25             ` Tom Zanussi
  2015-03-02 18:43             ` Steven Rostedt
  0 siblings, 2 replies; 45+ messages in thread
From: Alexei Starovoitov @ 2015-03-02 18:12 UTC (permalink / raw)
  To: Tom Zanussi
  Cc: Steven Rostedt, Masami Hiramatsu, Namhyung Kim, Andi Kleen, LKML

On Mon, Mar 2, 2015 at 10:03 AM, Tom Zanussi
<tom.zanussi@linux.intel.com> wrote:
> On Mon, 2015-03-02 at 09:58 -0800, Alexei Starovoitov wrote:
>> On Mon, Mar 2, 2015 at 8:46 AM, Tom Zanussi <tom.zanussi@linux.intel.com> wrote:
>> > On Mon, 2015-03-02 at 11:37 -0500, Steven Rostedt wrote:
>> >> On Mon,  2 Mar 2015 10:01:00 -0600
>> >> Tom Zanussi <tom.zanussi@linux.intel.com> wrote:
>> >>
>> >> > Add a gfp flag that allows kmalloc() et al to be used in tracing
>> >> > functions.
>> >> >
>> >> > The problem with using kmalloc for tracing is that the tracing
>> >> > subsystem should be able to trace kmalloc itself, which it can't do
>> >> > directly because of paths like kmalloc()->trace_kmalloc()->kmalloc()
>> >> > or kmalloc()->trace_mm_page_alloc()->kmalloc().
>> >>
>> >> This part I don't like at all. Why can't the memory be preallocated
>> >> when the hist is created (the echo 'hist:...')?
>> >>
>> >
>> > Yeah, I didn't like it either.  My original version did exactly what you
>> > suggest and preallocated an array of entries to 'allocate' from in order
>> > to avoid the problem.
>> >
>> > But I wanted to attempt to use the bpf_map directly, which already uses
>> > kmalloc internally.  My fallback in case this wouldn't fly, which it
>> > obviously won't, would be to add an option to have the bpf_map code
>> > preallocate a maximum number of entries or pass in a client-owned array
>> > for the purpose.  I'll do that if I don't hear any better ideas..
>>
>> Tom, I'm still reading through the patch set.
>> Quick comment for the above.
>> Currently there are two map types: array and hash.
>> array type is pre-allocating all memory at map creation time.
>> hash is allocating on demand.
>
> OK, so would it make sense to do the same for the hash type, or at least
> add an option that does that?

I'm not sure what would be the meaning of hash map that has all
elements pre-allocated...
As I'm reading your cover letter, I agree, we need to find a way
to call kmalloc_notrace-like from tracepoints.
Not sure that patch 8 style of duplicating the functions is clean.
Can we keep kmalloc/kfree as-is and do something like
if (in_tracepoint()) check inside ftrace_raw_kmalloc*  ?
so that kmalloc will be traced but calls to kmalloc from inside
tracepoints will be automatically suppressed ?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 07/15] mm: Add ___GFP_NOTRACE
  2015-03-02 18:12           ` Alexei Starovoitov
@ 2015-03-02 18:25             ` Tom Zanussi
  2015-03-02 18:43             ` Steven Rostedt
  1 sibling, 0 replies; 45+ messages in thread
From: Tom Zanussi @ 2015-03-02 18:25 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Steven Rostedt, Masami Hiramatsu, Namhyung Kim, Andi Kleen, LKML

On Mon, 2015-03-02 at 10:12 -0800, Alexei Starovoitov wrote:
> On Mon, Mar 2, 2015 at 10:03 AM, Tom Zanussi
> <tom.zanussi@linux.intel.com> wrote:
> > On Mon, 2015-03-02 at 09:58 -0800, Alexei Starovoitov wrote:
> >> On Mon, Mar 2, 2015 at 8:46 AM, Tom Zanussi <tom.zanussi@linux.intel.com> wrote:
> >> > On Mon, 2015-03-02 at 11:37 -0500, Steven Rostedt wrote:
> >> >> On Mon,  2 Mar 2015 10:01:00 -0600
> >> >> Tom Zanussi <tom.zanussi@linux.intel.com> wrote:
> >> >>
> >> >> > Add a gfp flag that allows kmalloc() et al to be used in tracing
> >> >> > functions.
> >> >> >
> >> >> > The problem with using kmalloc for tracing is that the tracing
> >> >> > subsystem should be able to trace kmalloc itself, which it can't do
> >> >> > directly because of paths like kmalloc()->trace_kmalloc()->kmalloc()
> >> >> > or kmalloc()->trace_mm_page_alloc()->kmalloc().
> >> >>
> >> >> This part I don't like at all. Why can't the memory be preallocated
> >> >> when the hist is created (the echo 'hist:...')?
> >> >>
> >> >
> >> > Yeah, I didn't like it either.  My original version did exactly what you
> >> > suggest and preallocated an array of entries to 'allocate' from in order
> >> > to avoid the problem.
> >> >
> >> > But I wanted to attempt to use the bpf_map directly, which already uses
> >> > kmalloc internally.  My fallback in case this wouldn't fly, which it
> >> > obviously won't, would be to add an option to have the bpf_map code
> >> > preallocate a maximum number of entries or pass in a client-owned array
> >> > for the purpose.  I'll do that if I don't hear any better ideas..
> >>
> >> Tom, I'm still reading through the patch set.
> >> Quick comment for the above.
> >> Currently there are two map types: array and hash.
> >> array type is pre-allocating all memory at map creation time.
> >> hash is allocating on demand.
> >
> > OK, so would it make sense to do the same for the hash type, or at least
> > add an option that does that?
> 
> I'm not sure what would be the meaning of hash map that has all
> elements pre-allocated...

The idea would be that instead of getting your individually kmalloc'ed
elements on-demand from kmalloc while in the handler, you'd get them
from a pool you've pre-allocated when you set up the table.  This could
be from a list of individual entries you've already kmalloc'ed ahead of
time, or from an array of n * sizeof(entry).

This would also allow you to avoid GFP_ATOMIC for those.

> As I'm reading your cover letter, I agree, we need to find a way
> to call kmalloc_notrace-like from tracepoints.
> Not sure that patch 8 style of duplicating the functions is clean.

No, it's horrible, but it does the job without changing the normal path
at all.

> Can we keep kmalloc/kfree as-is and do something like
> if (in_tracepoint()) check inside ftrace_raw_kmalloc*  ?

Yeah, that's essentially what TP_CONDITION() in patch 8 (Make kmem
memory allocation tracepoints conditional) does.

Tom

> so that kmalloc will be traced but calls to kmalloc from inside
> tracepoints will be automatically suppressed ?



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 07/15] mm: Add ___GFP_NOTRACE
  2015-03-02 18:12           ` Alexei Starovoitov
  2015-03-02 18:25             ` Tom Zanussi
@ 2015-03-02 18:43             ` Steven Rostedt
  1 sibling, 0 replies; 45+ messages in thread
From: Steven Rostedt @ 2015-03-02 18:43 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Tom Zanussi, Masami Hiramatsu, Namhyung Kim, Andi Kleen, LKML

On Mon, 2 Mar 2015 10:12:32 -0800
Alexei Starovoitov <ast@plumgrid.com> wrote:

> I'm not sure what would be the meaning of hash map that has all
> elements pre-allocated...
> As I'm reading your cover letter, I agree, we need to find a way
> to call kmalloc_notrace-like from tracepoints.
> Not sure that patch 8 style of duplicating the functions is clean.
> Can we keep kmalloc/kfree as-is and do something like
> if (in_tracepoint()) check inside ftrace_raw_kmalloc*  ?

I'm strongly against this. You should not be doing anything in a
tracepoint that you can't do from NMI context. And calling kmalloc
happens to be one of them.

Not to mention, kmalloc is a hot path, and tracing must not have any
impact on it (no extra if statements).

-- Steve

> so that kmalloc will be traced but calls to kmalloc from inside
> tracepoints will be automatically suppressed ?


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 00/15] tracing: 'hist' triggers
  2015-03-02 16:00 [PATCH v2 00/15] tracing: 'hist' triggers Tom Zanussi
                   ` (14 preceding siblings ...)
  2015-03-02 16:01 ` [PATCH 15/15] tracing: Add 'hist' trigger Documentation Tom Zanussi
@ 2015-03-03  2:25 ` Masami Hiramatsu
  2015-03-03 14:47   ` Tom Zanussi
  15 siblings, 1 reply; 45+ messages in thread
From: Masami Hiramatsu @ 2015-03-03  2:25 UTC (permalink / raw)
  To: Tom Zanussi; +Cc: rostedt, namhyung, andi, ast, linux-kernel

(2015/03/03 1:00), Tom Zanussi wrote:
> This is v2 of my previously posted 'hashtriggers' patchset [1], but
> renamed to 'hist triggers' following feedback from v1.

This is what I need :) The trigger interface gives us better flexibility
for environment. With this series I believe the 80% use of "scripting
tracing" can be replaced with just "echo'ing tracing" via tracefs :)

> 
> Since then, the kernel has gained a tracing map implementation in the
> form of bpf_map, which this patchset makes a bit more generic, exports
> and uses (as tracing_map_*, still in the bpf syscall file however).
> 
> A large part of the initial hash triggers implementation was devoted
> to a map implementation and general-purpose hashing functions, which
> have now been subsumed by the bpf maps.  I've completely redone the
> trigger patches themselves to work on top of tracing_map.  The result
> is a much simpler and easier-to-review patchset that's able to focus
> more directly on the problem at hand.
> 
> The new version addresses all the comments from the previous review,
> including changing the name from hash->hist, adding separate 'hist'
> files for the output, and moving the examples into Documentation.
> 
> This patchset also includes a couple other new and related triggers,
> enable_hist and disable_hist, very similar to the existing
> enable_event/disable_event triggers used to automatically enable and
> disable events based on a triggering condition, but in this case
> allowing hist triggers to be enabled and disabled in the same way.
> 
> The only problem with using the bpf_map implementation for this is
> that it uses kmalloc internally, which causes problems when trying to
> trace kmalloc itself.  I'm guessing the ebpf tracing code would also
> share this problem e.g. when using bpf_maps from probes on kmalloc().
> This patchset attempts a solution to that problem (by adding a
> gfp_flag and changing the kmem memory allocation tracepoints to
> conditional variants) for checking for it in for but I'm not sure it's
> the best way to address it.

That is not a solution for kprobe-based events, nor the events on
interrupt context.
Can we reserve some amount of memory for bpf_map? and If it is exceeded
the reserved memory we can choose (A) disable hist or (B) continue
to do with kmalloc.

> 
> There are a couple of important bits of functionality that were
> present in v1 but dropped in v2 mainly because I'm still trying to
> figure out the best way to accomplish those things using the bpf_map
> implementation.
> 
> The first is support for compound keys.  Currently, maps can only be
> keyed on a single event field, whereas in v1 they could be keyed on
> multiple keys.  With support for compound keys, you can create much
> more interesting output, such as for example per-pid lists of
> syscalls or read counts e.g.:
> 
>   # echo 'hist:keys=common_pid.execname,id.syscall:vals=hitcount' > \
>         /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger
> 
>   # cat /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/hist
> 
>   key: common_pid:bash[3112], id:sys_write		       vals: count:69
>   key: common_pid:bash[3112], id:sys_rt_sigprocmask	       vals: count:218
> 
>   key: common_pid:update-notifier[3164], id:sys_poll	       vals: count:37
>   key: common_pid:update-notifier[3164], id:sys_recvfrom       vals: count:118
> 
>   key: common_pid:deja-dup-monito[3194], id:sys_sendto	       vals: count:1
>   key: common_pid:deja-dup-monito[3194], id:sys_read	       vals: count:4
>   key: common_pid:deja-dup-monito[3194], id:sys_poll	       vals: count:8
>   key: common_pid:deja-dup-monito[3194], id:sys_recvmsg	       vals: count:8
>   key: common_pid:deja-dup-monito[3194], id:sys_getegid	       vals: count:8
> 
>   key: common_pid:emacs[3275], id:sys_fsync		       vals: count:1
>   key: common_pid:emacs[3275], id:sys_open		       vals: count:1
>   key: common_pid:emacs[3275], id:sys_symlink		       vals: count:2
>   key: common_pid:emacs[3275], id:sys_poll		       vals: count:23
>   key: common_pid:emacs[3275], id:sys_select		       vals: count:23
>   key: common_pid:emacs[3275], id:unknown_syscall	       vals: count:34
>   key: common_pid:emacs[3275], id:sys_ioctl		       vals: count:60
>   key: common_pid:emacs[3275], id:sys_rt_sigprocmask	       vals: count:116
> 
>   key: common_pid:cat[3323], id:sys_munmap		       vals: count:1
>   key: common_pid:cat[3323], id:sys_fadvise64		       vals: count:1

Very impressive! :)

Thank you,

> 
> Related to that is support for sorting on multiple fields.  Currently,
> you can sort using only a primary key.  Being able to sort on multiple
> or at least a secondary key is indispensible for seeing trends when
> displaying multiple values.
> 
> [1] http://thread.gmane.org/gmane.linux.kernel/1673551
> 
> Changes from v1:
>  - completely rewritten on top of tracing_map (renamed and exported bpf_map)
>  - added map clearing and client ops to tracing_map
>  - changed the name from 'hash' triggers to 'hist' triggers
>  - added new trigger 'pause' feature
>  - added new enable_hist and disable_hist triggers
>  - added usage for hist/enable_hist/disable hist to tracing/README
>  - moved examples into Documentation/trace/event.txt
>  - added ___GFP_NOTRACE, kmalloc/kfree macros, and conditional kmem tracepoints
> 
> The following changes since commit 49058038a12cfd9044146a1bf4b286781268d5c9:
> 
>   ring-buffer: Do not wake up a splice waiter when page is not full (2015-02-24 14:00:41 -0600)
> 
> are available in the git repository at:
> 
>   git://git.yoctoproject.org/linux-yocto-contrib.git tzanussi/hist-triggers-v2
>   http://git.yoctoproject.org/cgit/cgit.cgi/linux-yocto-contrib/log/?h=tzanussi/hist-triggers-v2
> 
> Tom Zanussi (15):
>   tracing: Make ftrace_event_field checking functions available
>   tracing: Add event record param to trigger_ops.func()
>   tracing: Add get_syscall_name()
>   bpf: Export bpf map functionality as trace_map_*
>   bpf: Export a map-clearing function
>   bpf: Add tracing_map client ops
>   mm: Add ___GFP_NOTRACE
>   tracing: Make kmem memory allocation tracepoints conditional
>   tracing: Add kmalloc/kfree macros
>   bpf: Make tracing_map use kmalloc/kfree_notrace()
>   tracing: Add a per-event-trigger 'paused' field
>   tracing: Add 'hist' event trigger command
>   tracing: Add sorting to hist triggers
>   tracing: Add enable_hist/disable_hist triggers
>   tracing: Add 'hist' trigger Documentation
> 
>  Documentation/trace/events.txt      |  870 +++++++++++++++++++++
>  include/linux/bpf.h                 |   15 +
>  include/linux/ftrace_event.h        |    9 +-
>  include/linux/gfp.h                 |    3 +-
>  include/linux/slab.h                |   61 +-
>  include/trace/events/kmem.h         |   28 +-
>  kernel/bpf/arraymap.c               |   16 +
>  kernel/bpf/hashtab.c                |   39 +-
>  kernel/bpf/syscall.c                |  193 ++++-
>  kernel/trace/trace.c                |   48 ++
>  kernel/trace/trace.h                |   25 +-
>  kernel/trace/trace_events.c         |    3 +
>  kernel/trace/trace_events_filter.c  |   15 +-
>  kernel/trace/trace_events_trigger.c | 1466 ++++++++++++++++++++++++++++++++++-
>  kernel/trace/trace_syscalls.c       |   11 +
>  mm/slab.c                           |   45 +-
>  mm/slob.c                           |   45 +-
>  mm/slub.c                           |   47 +-
>  18 files changed, 2795 insertions(+), 144 deletions(-)
> 


-- 
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Research Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 00/15] tracing: 'hist' triggers
  2015-03-03  2:25 ` [PATCH v2 00/15] tracing: 'hist' triggers Masami Hiramatsu
@ 2015-03-03 14:47   ` Tom Zanussi
  0 siblings, 0 replies; 45+ messages in thread
From: Tom Zanussi @ 2015-03-03 14:47 UTC (permalink / raw)
  To: Masami Hiramatsu; +Cc: rostedt, namhyung, andi, ast, linux-kernel

On Tue, 2015-03-03 at 11:25 +0900, Masami Hiramatsu wrote:
> (2015/03/03 1:00), Tom Zanussi wrote:
> > This is v2 of my previously posted 'hashtriggers' patchset [1], but
> > renamed to 'hist triggers' following feedback from v1.
> 
> This is what I need :) The trigger interface gives us better flexibility
> for environment. With this series I believe the 80% use of "scripting
> tracing" can be replaced with just "echo'ing tracing" via tracefs :)
> 

Glad you like it, thanks!

> > 
> > Since then, the kernel has gained a tracing map implementation in the
> > form of bpf_map, which this patchset makes a bit more generic, exports
> > and uses (as tracing_map_*, still in the bpf syscall file however).
> > 
> > A large part of the initial hash triggers implementation was devoted
> > to a map implementation and general-purpose hashing functions, which
> > have now been subsumed by the bpf maps.  I've completely redone the
> > trigger patches themselves to work on top of tracing_map.  The result
> > is a much simpler and easier-to-review patchset that's able to focus
> > more directly on the problem at hand.
> > 
> > The new version addresses all the comments from the previous review,
> > including changing the name from hash->hist, adding separate 'hist'
> > files for the output, and moving the examples into Documentation.
> > 
> > This patchset also includes a couple other new and related triggers,
> > enable_hist and disable_hist, very similar to the existing
> > enable_event/disable_event triggers used to automatically enable and
> > disable events based on a triggering condition, but in this case
> > allowing hist triggers to be enabled and disabled in the same way.
> > 
> > The only problem with using the bpf_map implementation for this is
> > that it uses kmalloc internally, which causes problems when trying to
> > trace kmalloc itself.  I'm guessing the ebpf tracing code would also
> > share this problem e.g. when using bpf_maps from probes on kmalloc().
> > This patchset attempts a solution to that problem (by adding a
> > gfp_flag and changing the kmem memory allocation tracepoints to
> > conditional variants) for checking for it in for but I'm not sure it's
> > the best way to address it.
> 
> That is not a solution for kprobe-based events, nor the events on
> interrupt context.
> Can we reserve some amount of memory for bpf_map? and If it is exceeded
> the reserved memory we can choose (A) disable hist or (B) continue
> to do with kmalloc.
> 

Yeah, the non-bpf_map v1 did (A) with reserved memory.  I'll take a look
at doing that again for the next version.   

> > 
> > There are a couple of important bits of functionality that were
> > present in v1 but dropped in v2 mainly because I'm still trying to
> > figure out the best way to accomplish those things using the bpf_map
> > implementation.
> > 
> > The first is support for compound keys.  Currently, maps can only be
> > keyed on a single event field, whereas in v1 they could be keyed on
> > multiple keys.  With support for compound keys, you can create much
> > more interesting output, such as for example per-pid lists of
> > syscalls or read counts e.g.:
> > 
> >   # echo 'hist:keys=common_pid.execname,id.syscall:vals=hitcount' > \
> >         /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger
> > 
> >   # cat /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/hist
> > 
> >   key: common_pid:bash[3112], id:sys_write		       vals: count:69
> >   key: common_pid:bash[3112], id:sys_rt_sigprocmask	       vals: count:218
> > 
> >   key: common_pid:update-notifier[3164], id:sys_poll	       vals: count:37
> >   key: common_pid:update-notifier[3164], id:sys_recvfrom       vals: count:118
> > 
> >   key: common_pid:deja-dup-monito[3194], id:sys_sendto	       vals: count:1
> >   key: common_pid:deja-dup-monito[3194], id:sys_read	       vals: count:4
> >   key: common_pid:deja-dup-monito[3194], id:sys_poll	       vals: count:8
> >   key: common_pid:deja-dup-monito[3194], id:sys_recvmsg	       vals: count:8
> >   key: common_pid:deja-dup-monito[3194], id:sys_getegid	       vals: count:8
> > 
> >   key: common_pid:emacs[3275], id:sys_fsync		       vals: count:1
> >   key: common_pid:emacs[3275], id:sys_open		       vals: count:1
> >   key: common_pid:emacs[3275], id:sys_symlink		       vals: count:2
> >   key: common_pid:emacs[3275], id:sys_poll		       vals: count:23
> >   key: common_pid:emacs[3275], id:sys_select		       vals: count:23
> >   key: common_pid:emacs[3275], id:unknown_syscall	       vals: count:34
> >   key: common_pid:emacs[3275], id:sys_ioctl		       vals: count:60
> >   key: common_pid:emacs[3275], id:sys_rt_sigprocmask	       vals: count:116
> > 
> >   key: common_pid:cat[3323], id:sys_munmap		       vals: count:1
> >   key: common_pid:cat[3323], id:sys_fadvise64		       vals: count:1
> 
> Very impressive! :)
> 

Thanks!

Tom

> Thank you,
> 
> > 
> > Related to that is support for sorting on multiple fields.  Currently,
> > you can sort using only a primary key.  Being able to sort on multiple
> > or at least a secondary key is indispensible for seeing trends when
> > displaying multiple values.
> > 
> > [1] http://thread.gmane.org/gmane.linux.kernel/1673551
> > 
> > Changes from v1:
> >  - completely rewritten on top of tracing_map (renamed and exported bpf_map)
> >  - added map clearing and client ops to tracing_map
> >  - changed the name from 'hash' triggers to 'hist' triggers
> >  - added new trigger 'pause' feature
> >  - added new enable_hist and disable_hist triggers
> >  - added usage for hist/enable_hist/disable hist to tracing/README
> >  - moved examples into Documentation/trace/event.txt
> >  - added ___GFP_NOTRACE, kmalloc/kfree macros, and conditional kmem tracepoints
> > 
> > The following changes since commit 49058038a12cfd9044146a1bf4b286781268d5c9:
> > 
> >   ring-buffer: Do not wake up a splice waiter when page is not full (2015-02-24 14:00:41 -0600)
> > 
> > are available in the git repository at:
> > 
> >   git://git.yoctoproject.org/linux-yocto-contrib.git tzanussi/hist-triggers-v2
> >   http://git.yoctoproject.org/cgit/cgit.cgi/linux-yocto-contrib/log/?h=tzanussi/hist-triggers-v2
> > 
> > Tom Zanussi (15):
> >   tracing: Make ftrace_event_field checking functions available
> >   tracing: Add event record param to trigger_ops.func()
> >   tracing: Add get_syscall_name()
> >   bpf: Export bpf map functionality as trace_map_*
> >   bpf: Export a map-clearing function
> >   bpf: Add tracing_map client ops
> >   mm: Add ___GFP_NOTRACE
> >   tracing: Make kmem memory allocation tracepoints conditional
> >   tracing: Add kmalloc/kfree macros
> >   bpf: Make tracing_map use kmalloc/kfree_notrace()
> >   tracing: Add a per-event-trigger 'paused' field
> >   tracing: Add 'hist' event trigger command
> >   tracing: Add sorting to hist triggers
> >   tracing: Add enable_hist/disable_hist triggers
> >   tracing: Add 'hist' trigger Documentation
> > 
> >  Documentation/trace/events.txt      |  870 +++++++++++++++++++++
> >  include/linux/bpf.h                 |   15 +
> >  include/linux/ftrace_event.h        |    9 +-
> >  include/linux/gfp.h                 |    3 +-
> >  include/linux/slab.h                |   61 +-
> >  include/trace/events/kmem.h         |   28 +-
> >  kernel/bpf/arraymap.c               |   16 +
> >  kernel/bpf/hashtab.c                |   39 +-
> >  kernel/bpf/syscall.c                |  193 ++++-
> >  kernel/trace/trace.c                |   48 ++
> >  kernel/trace/trace.h                |   25 +-
> >  kernel/trace/trace_events.c         |    3 +
> >  kernel/trace/trace_events_filter.c  |   15 +-
> >  kernel/trace/trace_events_trigger.c | 1466 ++++++++++++++++++++++++++++++++++-
> >  kernel/trace/trace_syscalls.c       |   11 +
> >  mm/slab.c                           |   45 +-
> >  mm/slob.c                           |   45 +-
> >  mm/slub.c                           |   47 +-
> >  18 files changed, 2795 insertions(+), 144 deletions(-)
> > 
> 
> 



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 00/15] tracing: 'hist' triggers
  2015-03-02 19:55 ` Tom Zanussi
@ 2015-03-09 11:38   ` He Kuang
  0 siblings, 0 replies; 45+ messages in thread
From: He Kuang @ 2015-03-09 11:38 UTC (permalink / raw)
  To: Tom Zanussi, Alexei Starovoitov
  Cc: Steven Rostedt, Masami Hiramatsu, Namhyung Kim, Andi Kleen, LKML,
	Ingo Molnar, Arnaldo Carvalho de Melo

hi, Tom

On 2015/3/3 3:55, Tom Zanussi wrote:
> BTW, I've actually tried to play around with the BPF samples/, but it 
> seems they're not actually hooked into the system i.e. the samples 
> Makefile doesn't build them, and it even looks for tools/llvm that's 
> not there. I got as far as getting the latest llvm from the location 
> mentioened in one of the bpf commit messages, but gave up after it 
> told me 'invalid target bpf'. And I couldn't find any documentation on 
> how to set it all up - did I just miss that? Tom 

I'd also encountered the same problem:

--enable-experimental-targets=BPF, 'BPF' letters must be uppercase. Have 
a try.
>> Thanks
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>
>



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 00/15] tracing: 'hist' triggers
  2015-03-04  2:22   ` Alexei Starovoitov
@ 2015-03-04  5:01     ` Tom Zanussi
  0 siblings, 0 replies; 45+ messages in thread
From: Tom Zanussi @ 2015-03-04  5:01 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Steven Rostedt, Masami Hiramatsu, Namhyung Kim, Andi Kleen, LKML,
	Ingo Molnar, Arnaldo Carvalho de Melo

On Tue, 2015-03-03 at 18:22 -0800, Alexei Starovoitov wrote:
> On 3/3/15 7:47 AM, Tom Zanussi wrote:
> > On Mon, 2015-03-02 at 18:31 -0800, Alexei Starovoitov wrote:
> >> On Mon, Mar 2, 2015 at 5:18 PM, Tom Zanussi <tom.zanussi@linux.intel.com> wrote:
> 
> >> I'm not proposing to use asm or C for this 'hist->bpf' tool.
> >> Keep proposed 'hist:keys=...:vals=...' syntax and generate
> >> bpf program on the fly based on this string.
> >> So this user tool will take string, generate program, load
> >> and attach it. Everything based on this single string input.
> >> With the examples you mentioned in docs, it's not hard.
> >> It will be more involved than patch 12, but way more generic
> >> and easily extensible when 'hist:keys=...' string would need
> >> to be extended.
> >>
> >
> > Hmm, this seems silly to me - doing all that work just to get back to
> > where we already are.
> 
> your 'hist->bpf' tool could have been first to solve 'bpf hard to use'
> problem and overtime it could have evolved into full dtrace alternative.
> Whereas by adding 'hist:keys=..' parsing to kernel you'll be stuck
> with it and somebody else's dtrace-like tool will supersede it.
> 

I think there's some misunderstanding there - it was never my intent to
create a full dtrace alternative, really the idea was (and has been,
even before there was any such thing as ebpf in the kernel) only to
provide access to some low-hanging fruit via the standard read/write
interfaces people are used to with ftrace.  The dtrace-like tools were
what I thought ebpf was all about, where users could go after exhausting
the possibilities of a simple table-driven first approach to a problem. 

> > I don't think you can start requiring all new tracing functionality to
> > embed a code generator/compiler, or as a result force any particular
> > interface on users.
> 
> force interface on users?!
> I think I've explained enough that bpf would not be a user visible
> interface. It's kernel to user land interface. In my proposal of
> 'hist->bpf' tool users won't even see that bpf is being generated
> and run underneath. Same with dtrace-like scripting. Users will
> see scripting language and won't care how it's compiled and
> executed inside kernel. Multiple different languages and interfaces
> are possible. Including hist-like text that's not a language.
> 

What about the kernel to user land interface that users have already
been asking for, pseudo-files in debugfs/tracefs?  If you can't do that,
then you're forcing interface on users.  

>  > If it's possible to factor out a common
> > infrastructure that can accommodate different user interfaces and
> > preferences, don't you think it makes sense to do that?
> >
> > In any case, outside of the boilerplate tracing infrastructure code, it
> > seems to me that a lot of the code in the main hist trigger patches is
> > code that you'd also need for a tracepoint/bpf interface.  I think we've
> > already agreed that it would be good to work towards being able to share
> > that, so for the next version, I'll see what I can come up with.
> 
> yes. let's see what pieces can be shared. Though I don't want to
> sacrifice long term goals for short term solution.
> In case of bpf_maps, the objective is to share data between
> kernel and user space by single abstraction with different
> implementations underneath. In some use cases kernel programs only
> do lookups into maps, while user space concurrently adding/deleting
> entries (so no allocations from programs)
> In your patch 4 you're adding kernel only access to maps but planning
> to use hash type only and also not exposing these maps to user space
> at all... that defeats bpf_map purpose and you only reusing ~200
> lines of hashtab.c code, but also need to hack it for pre-allocate
> and make config_bpf_syscall a strong dependency for 'hist'. why?
> 'hist' patch set would have been much shorter if you just used
> hashtable.h or rhashtable.h or even rbtree.h (so you don't need
> to do sorting in patch 13). The actual hash insert/walk code
> is tiny and trivial.

Well, I kind of hinted in the cover letter that ultimately I thought the
tracing_map common code would move out of bpf/syscall.c, and thus break
the dependency on bpf_syscall.  I didn't want to spend time actually
doing that before knowing how the general idea would fly, and am glad
now that I didn't.

But the reason I thought of using it in the first place was that I saw a
map implementation specially designed for tracing and thought it would
make sense to reuse it if possible for what I thought was a
complementary tracing tool.  I did find it useful, and using it did
simplify my overall code, but it seems it's more trouble than it's worth
and not particularly welcome, so yeah, I probably will just look at
something else.

Thanks for your input,

Tom

> I've looked through the patches again and they seem to have serious
> bugs when it comes to object life time:
> - patch 5 that adds clear_map is completely broken, calling
> delete_all_elements() by wrapping it in rcu_read_lock() ?!
> - patch 6 adds 'free_elem' callback that is called right from
> htab_map_delete_elem() while we're still under rcu.. ?!
> - patch 12 does
> struct hist_trigger_entry *entry;
> tracing_map_lookup_elem(hist_data->map, next_key, &entry);
> hist_trigger_entry_print(m, hist_data, next_key, entry);
> so when you're storing a pointer inside the map the rcu protection
> of map_lookup protects only the value of 'entry' pointer.
> The actual contents of *entry are not protected by anything,
> so even if you fix 5 and 6, you'll still have severe memory corruptions,
> since nothing guarantees that contents of *entry are valid when
> hist_trigger_entry_print() is trying to access it.
> 
> I think you'd be better off using vanilla hashtable.h
> 
> Thanks
> 



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 00/15] tracing: 'hist' triggers
  2015-03-03 15:47 ` Tom Zanussi
@ 2015-03-04  2:22   ` Alexei Starovoitov
  2015-03-04  5:01     ` Tom Zanussi
  0 siblings, 1 reply; 45+ messages in thread
From: Alexei Starovoitov @ 2015-03-04  2:22 UTC (permalink / raw)
  To: Tom Zanussi
  Cc: Steven Rostedt, Masami Hiramatsu, Namhyung Kim, Andi Kleen, LKML,
	Ingo Molnar, Arnaldo Carvalho de Melo

On 3/3/15 7:47 AM, Tom Zanussi wrote:
> On Mon, 2015-03-02 at 18:31 -0800, Alexei Starovoitov wrote:
>> On Mon, Mar 2, 2015 at 5:18 PM, Tom Zanussi <tom.zanussi@linux.intel.com> wrote:

>> I'm not proposing to use asm or C for this 'hist->bpf' tool.
>> Keep proposed 'hist:keys=...:vals=...' syntax and generate
>> bpf program on the fly based on this string.
>> So this user tool will take string, generate program, load
>> and attach it. Everything based on this single string input.
>> With the examples you mentioned in docs, it's not hard.
>> It will be more involved than patch 12, but way more generic
>> and easily extensible when 'hist:keys=...' string would need
>> to be extended.
>>
>
> Hmm, this seems silly to me - doing all that work just to get back to
> where we already are.

your 'hist->bpf' tool could have been first to solve 'bpf hard to use'
problem and overtime it could have evolved into full dtrace alternative.
Whereas by adding 'hist:keys=..' parsing to kernel you'll be stuck
with it and somebody else's dtrace-like tool will supersede it.

> I don't think you can start requiring all new tracing functionality to
> embed a code generator/compiler, or as a result force any particular
> interface on users.

force interface on users?!
I think I've explained enough that bpf would not be a user visible
interface. It's kernel to user land interface. In my proposal of
'hist->bpf' tool users won't even see that bpf is being generated
and run underneath. Same with dtrace-like scripting. Users will
see scripting language and won't care how it's compiled and
executed inside kernel. Multiple different languages and interfaces
are possible. Including hist-like text that's not a language.

 > If it's possible to factor out a common
> infrastructure that can accommodate different user interfaces and
> preferences, don't you think it makes sense to do that?
>
> In any case, outside of the boilerplate tracing infrastructure code, it
> seems to me that a lot of the code in the main hist trigger patches is
> code that you'd also need for a tracepoint/bpf interface.  I think we've
> already agreed that it would be good to work towards being able to share
> that, so for the next version, I'll see what I can come up with.

yes. let's see what pieces can be shared. Though I don't want to
sacrifice long term goals for short term solution.
In case of bpf_maps, the objective is to share data between
kernel and user space by single abstraction with different
implementations underneath. In some use cases kernel programs only
do lookups into maps, while user space concurrently adding/deleting
entries (so no allocations from programs)
In your patch 4 you're adding kernel only access to maps but planning
to use hash type only and also not exposing these maps to user space
at all... that defeats bpf_map purpose and you only reusing ~200
lines of hashtab.c code, but also need to hack it for pre-allocate
and make config_bpf_syscall a strong dependency for 'hist'. why?
'hist' patch set would have been much shorter if you just used
hashtable.h or rhashtable.h or even rbtree.h (so you don't need
to do sorting in patch 13). The actual hash insert/walk code
is tiny and trivial.

I've looked through the patches again and they seem to have serious
bugs when it comes to object life time:
- patch 5 that adds clear_map is completely broken, calling
delete_all_elements() by wrapping it in rcu_read_lock() ?!
- patch 6 adds 'free_elem' callback that is called right from
htab_map_delete_elem() while we're still under rcu.. ?!
- patch 12 does
struct hist_trigger_entry *entry;
tracing_map_lookup_elem(hist_data->map, next_key, &entry);
hist_trigger_entry_print(m, hist_data, next_key, entry);
so when you're storing a pointer inside the map the rcu protection
of map_lookup protects only the value of 'entry' pointer.
The actual contents of *entry are not protected by anything,
so even if you fix 5 and 6, you'll still have severe memory corruptions,
since nothing guarantees that contents of *entry are valid when
hist_trigger_entry_print() is trying to access it.

I think you'd be better off using vanilla hashtable.h

Thanks


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 00/15] tracing: 'hist' triggers
  2015-03-03  2:31 Alexei Starovoitov
@ 2015-03-03 15:47 ` Tom Zanussi
  2015-03-04  2:22   ` Alexei Starovoitov
  0 siblings, 1 reply; 45+ messages in thread
From: Tom Zanussi @ 2015-03-03 15:47 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Steven Rostedt, Masami Hiramatsu, Namhyung Kim, Andi Kleen, LKML,
	Ingo Molnar, Arnaldo Carvalho de Melo

On Mon, 2015-03-02 at 18:31 -0800, Alexei Starovoitov wrote:
> On Mon, Mar 2, 2015 at 5:18 PM, Tom Zanussi <tom.zanussi@linux.intel.com> wrote:
> >>
> >> I'm saying keep the command line version of hist, but let
> >> user space process it.
> >> I don't buy the argument that you must run it in busybox
> >> without any extra tools.
> >> If you're in busybox, the system is likely idle, so nothing
> >> to trace/analyze. If you have some user space apps,
> >> then it equally easy to add 'hist->bpf' tool.
> >>
> >
> > How about systems that run a single statically linked process with no
> > shell (but a service that can read and write files like/event/trigger
> > and event/hist)?  We'd still like to be able to trace those systems.
> 
> hmm, there is no shell and one cannot do
> 'echo hist.. > trigger; cat hist' , but there is a demon
> that can accept remote commands ?
> Then would make even more sense to pass bpf programs
> from remote host and consume aggregated data remotely.
> This on-host demon will be tiny wrapper on top of bpf syscall.
> Quite a bit more flexible :)
> 
> > Well, I'd say writing BPF 'assembly' to do anything isn't something more
> > than a few users in the world would even consider, so that's completely
> > out. Which means the only practical way to use it is via the C
> > interface.  But getting that set up properly doesn't seem
> > straightforward either - it isn't something the Makefile will help with,
> > and there's no documentation on how one might do it.
> 
> I'm not proposing to use asm or C for this 'hist->bpf' tool.
> Keep proposed 'hist:keys=...:vals=...' syntax and generate
> bpf program on the fly based on this string.
> So this user tool will take string, generate program, load
> and attach it. Everything based on this single string input.
> With the examples you mentioned in docs, it's not hard.
> It will be more involved than patch 12, but way more generic
> and easily extensible when 'hist:keys=...' string would need
> to be extended.
> 

Hmm, this seems silly to me - doing all that work just to get back to
where we already are.

I don't think you can start requiring all new tracing functionality to
embed a code generator/compiler, or as a result force any particular
interface on users.  If it's possible to factor out a common
infrastructure that can accommodate different user interfaces and
preferences, don't you think it makes sense to do that?

In any case, outside of the boilerplate tracing infrastructure code, it
seems to me that a lot of the code in the main hist trigger patches is
code that you'd also need for a tracepoint/bpf interface.  I think we've
already agreed that it would be good to work towards being able to share
that, so for the next version, I'll see what I can come up with.

Tom



> > So I tweaked the Makefile to get samples/bpf in the build (I mean the
> > directory is there under samples/, so why do I need to add it to the
> > Makefile myself?) and tried building which failed until I tweaked
> > something else to get it to find the right headers, etc.  Finally I got
> > it building the userspace stuff but then found out I needed my own llvm
> > to get the kernel modules built, so searched and found your llvm tree
> > which I thought would configure the bpf backend automatically, but
> > apparently not, since it then failed with llc: invalid target 'bpf'
> > which is where I gave up.  Do I need to configure with --target=bpf or
> > something like that?  I don't know, and know nothing about llvm, so am
> > kind of stuck.
> 
> hmm. 'make samples/bpf/' should work out of the box.
> only llvm needs to installed.
> To install llvm:
> $git clone http://llvm.org/git/llvm.git
> $mkdir llvm/bld; cd llvm/bld
> if you like cmake:
> $cmake -DLLVM_EXPERIMENTAL_TARGETS_TO_BUILD=BPF ..
> $make
> if you like autoconf:
> $../configure --enable-experimental-targets=BPF
> $make
> 
> That's typical for any new backend. Hopefully soon it will lose
> 'experimental' status.
> 
> > I really do want to try doing something with it, and I understand that
> > you're working on improving the user experience, but at this point it
> > seems users have to jump through a lot of hoops just to get a minimally
> > working setup.  Even a small paragraph with some basic instructions
> > would help.  Or maybe it's just me, and it works for everyone else out
> > of the box.
> 
> Thank you for feedback. We'll add llvm build instructions to the doc.
> The goal for llvm is to be able to do
> 'apt-get install llvm-3.7' and bpf will be part of default install.



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 00/15] tracing: 'hist' triggers
@ 2015-03-03  2:31 Alexei Starovoitov
  2015-03-03 15:47 ` Tom Zanussi
  0 siblings, 1 reply; 45+ messages in thread
From: Alexei Starovoitov @ 2015-03-03  2:31 UTC (permalink / raw)
  To: Tom Zanussi
  Cc: Steven Rostedt, Masami Hiramatsu, Namhyung Kim, Andi Kleen, LKML,
	Ingo Molnar, Arnaldo Carvalho de Melo

On Mon, Mar 2, 2015 at 5:18 PM, Tom Zanussi <tom.zanussi@linux.intel.com> wrote:
>>
>> I'm saying keep the command line version of hist, but let
>> user space process it.
>> I don't buy the argument that you must run it in busybox
>> without any extra tools.
>> If you're in busybox, the system is likely idle, so nothing
>> to trace/analyze. If you have some user space apps,
>> then it equally easy to add 'hist->bpf' tool.
>>
>
> How about systems that run a single statically linked process with no
> shell (but a service that can read and write files like/event/trigger
> and event/hist)?  We'd still like to be able to trace those systems.

hmm, there is no shell and one cannot do
'echo hist.. > trigger; cat hist' , but there is a demon
that can accept remote commands ?
Then would make even more sense to pass bpf programs
from remote host and consume aggregated data remotely.
This on-host demon will be tiny wrapper on top of bpf syscall.
Quite a bit more flexible :)

> Well, I'd say writing BPF 'assembly' to do anything isn't something more
> than a few users in the world would even consider, so that's completely
> out. Which means the only practical way to use it is via the C
> interface.  But getting that set up properly doesn't seem
> straightforward either - it isn't something the Makefile will help with,
> and there's no documentation on how one might do it.

I'm not proposing to use asm or C for this 'hist->bpf' tool.
Keep proposed 'hist:keys=...:vals=...' syntax and generate
bpf program on the fly based on this string.
So this user tool will take string, generate program, load
and attach it. Everything based on this single string input.
With the examples you mentioned in docs, it's not hard.
It will be more involved than patch 12, but way more generic
and easily extensible when 'hist:keys=...' string would need
to be extended.

> So I tweaked the Makefile to get samples/bpf in the build (I mean the
> directory is there under samples/, so why do I need to add it to the
> Makefile myself?) and tried building which failed until I tweaked
> something else to get it to find the right headers, etc.  Finally I got
> it building the userspace stuff but then found out I needed my own llvm
> to get the kernel modules built, so searched and found your llvm tree
> which I thought would configure the bpf backend automatically, but
> apparently not, since it then failed with llc: invalid target 'bpf'
> which is where I gave up.  Do I need to configure with --target=bpf or
> something like that?  I don't know, and know nothing about llvm, so am
> kind of stuck.

hmm. 'make samples/bpf/' should work out of the box.
only llvm needs to installed.
To install llvm:
$git clone http://llvm.org/git/llvm.git
$mkdir llvm/bld; cd llvm/bld
if you like cmake:
$cmake -DLLVM_EXPERIMENTAL_TARGETS_TO_BUILD=BPF ..
$make
if you like autoconf:
$../configure --enable-experimental-targets=BPF
$make

That's typical for any new backend. Hopefully soon it will lose
'experimental' status.

> I really do want to try doing something with it, and I understand that
> you're working on improving the user experience, but at this point it
> seems users have to jump through a lot of hoops just to get a minimally
> working setup.  Even a small paragraph with some basic instructions
> would help.  Or maybe it's just me, and it works for everyone else out
> of the box.

Thank you for feedback. We'll add llvm build instructions to the doc.
The goal for llvm is to be able to do
'apt-get install llvm-3.7' and bpf will be part of default install.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 00/15] tracing: 'hist' triggers
  2015-03-03  0:01 Alexei Starovoitov
@ 2015-03-03  1:18 ` Tom Zanussi
  0 siblings, 0 replies; 45+ messages in thread
From: Tom Zanussi @ 2015-03-03  1:18 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Steven Rostedt, Masami Hiramatsu, Namhyung Kim, Andi Kleen, LKML,
	Ingo Molnar, Arnaldo Carvalho de Melo

On Mon, 2015-03-02 at 16:01 -0800, Alexei Starovoitov wrote:
> On Mon, Mar 2, 2015 at 11:55 AM, Tom Zanussi
> <tom.zanussi@linux.intel.com> wrote:
> >
> > I disagree that it would be rarely used.  In fact, it would probably
> > cover about 80% of the use cases that people initially use things like
> > systemtap or dtrace for, which I guess is what ebpf is shooting for.
> 
> 'hist' style won't solve any of the use cases I'm targeting with bpf.
> So, imo, 'hist' being 80% of dtrace is far from reality...
> but let's agree to disagree. it's not that important.
> I'm not saying don't do 'hist' at all.
> I'm only suggesting to do it differently.
> 
> > I'm also looking at systems that have very little memory and 8Mb of
> > storage to work with, so streaming it all to userspace and
> > post-processing won't really work on those systems.
> 
> I'm not suggesting to post-process. Quite the opposite.
> Let programs do ++ in the kernel, since that's what
> your patch 12 is doing, but in a hard coded way.
> 
> > With some thought, though, I think the ebpf system/interpreter could be
> > made smart enough to recognize the simple patterns represented by the
> > hist triggers, and reuse them internally.  So ftrace users get their
> > command-line version and it's also something ebpf can reuse.
> 
> I'm saying keep the command line version of hist, but let
> user space process it.
> I don't buy the argument that you must run it in busybox
> without any extra tools.
> If you're in busybox, the system is likely idle, so nothing
> to trace/analyze. If you have some user space apps,
> then it equally easy to add 'hist->bpf' tool.
> 

How about systems that run a single statically linked process with no
shell (but a service that can read and write files like/event/trigger
and event/hist)?  We'd still like to be able to trace those systems.

> >> to embedded argument. Why add this to kernel
> >> when bpf programs can do the same on demand?
> >
> > Because this demonstrates that all those things can be done without
> > introducing an interpreter into the mix, so why bother with the
> > interpreter?
> 
> because interpreter is done once for all use cases,
> whereas custom 'hist' code is doing only one thing for one use case.
> 

I agree that the hist functionality is a subset of what can be done with
a full-blown interpreter, but it's not doing just one thing for one use
case - it covers a whole set of use cases.

> >> the kernel ABI exposure is much smaller.
> >> So would you consider working together on adding
> >> clean bpf+tracepoints infra and corresponding
> >> user space bits?
> >> We can have small user space parser/compiler for
> >> 'hist:keys=common_pid.execname,id.syscall:vals=hitcount'
> >> strings that will convert it into bpf program and you'll
> >> be able to use it in embedded setups ?
> >
> > Yeah, wouldn't be averse to working together to create a clean bpf
> > +tracepoints infrastructure - I think creating a reusable component like
> > this would be a good first step.
> 
> great.
> From the program you can emit the same text format
> as in your 'cat hist' example.
> But it will not be a part of stable kernel ABI, which I think is
> one of the main advantages to do such printing from programs
> instead of kernel C code.
> If you decide to extend what is being printed, you can
> tweak 'hist->bpf' tool and print something else.
> No one will complain, whereas when you would want
> to extend the format of 'hist' file printed by kernel, you'd
> need to consider all user tools that are parsing it.
> Like we saw in systrace example...
> 
> > BTW, I've actually tried to play around with the BPF samples/, but it
> > seems they're not actually hooked into the system i.e. the samples
> > Makefile doesn't build them, and it even looks for tools/llvm that's not
> > there.  I got as far as getting the latest llvm from the location
> > mentioened in one of the bpf commit messages, but gave up after it told
> > me 'invalid target bpf'.  And I couldn't find any documentation on how
> > to set it all up - did I just miss that?
> 
> the comment next to 'tool/llvm' says 'point to your llvm' :)
> so yes, to build C examples one need to install latest llvm trunk.
> If you're saying that existing bpf stuff is hard to use, then yes.

Well, I'd say writing BPF 'assembly' to do anything isn't something more
than a few users in the world would even consider, so that's completely
out. Which means the only practical way to use it is via the C
interface.  But getting that set up properly doesn't seem
straightforward either - it isn't something the Makefile will help with,
and there's no documentation on how one might do it.

So I tweaked the Makefile to get samples/bpf in the build (I mean the
directory is there under samples/, so why do I need to add it to the
Makefile myself?) and tried building which failed until I tweaked
something else to get it to find the right headers, etc.  Finally I got
it building the userspace stuff but then found out I needed my own llvm
to get the kernel modules built, so searched and found your llvm tree
which I thought would configure the bpf backend automatically, but
apparently not, since it then failed with llc: invalid target 'bpf'
which is where I gave up.  Do I need to configure with --target=bpf or
something like that?  I don't know, and know nothing about llvm, so am
kind of stuck.

I really do want to try doing something with it, and I understand that
you're working on improving the user experience, but at this point it
seems users have to jump through a lot of hoops just to get a minimally
working setup.  Even a small paragraph with some basic instructions
would help.  Or maybe it's just me, and it works for everyone else out
of the box.

Tom

> I completely agree. It is hard to use. We're working on it.
> The user bits can be improved gradually unlike kernel/user
> boundary. Once you set it to be 'hist' file format it will stay
> forever.



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 00/15] tracing: 'hist' triggers
@ 2015-03-03  0:01 Alexei Starovoitov
  2015-03-03  1:18 ` Tom Zanussi
  0 siblings, 1 reply; 45+ messages in thread
From: Alexei Starovoitov @ 2015-03-03  0:01 UTC (permalink / raw)
  To: Tom Zanussi
  Cc: Steven Rostedt, Masami Hiramatsu, Namhyung Kim, Andi Kleen, LKML,
	Ingo Molnar, Arnaldo Carvalho de Melo

On Mon, Mar 2, 2015 at 11:55 AM, Tom Zanussi
<tom.zanussi@linux.intel.com> wrote:
>
> I disagree that it would be rarely used.  In fact, it would probably
> cover about 80% of the use cases that people initially use things like
> systemtap or dtrace for, which I guess is what ebpf is shooting for.

'hist' style won't solve any of the use cases I'm targeting with bpf.
So, imo, 'hist' being 80% of dtrace is far from reality...
but let's agree to disagree. it's not that important.
I'm not saying don't do 'hist' at all.
I'm only suggesting to do it differently.

> I'm also looking at systems that have very little memory and 8Mb of
> storage to work with, so streaming it all to userspace and
> post-processing won't really work on those systems.

I'm not suggesting to post-process. Quite the opposite.
Let programs do ++ in the kernel, since that's what
your patch 12 is doing, but in a hard coded way.

> With some thought, though, I think the ebpf system/interpreter could be
> made smart enough to recognize the simple patterns represented by the
> hist triggers, and reuse them internally.  So ftrace users get their
> command-line version and it's also something ebpf can reuse.

I'm saying keep the command line version of hist, but let
user space process it.
I don't buy the argument that you must run it in busybox
without any extra tools.
If you're in busybox, the system is likely idle, so nothing
to trace/analyze. If you have some user space apps,
then it equally easy to add 'hist->bpf' tool.

>> to embedded argument. Why add this to kernel
>> when bpf programs can do the same on demand?
>
> Because this demonstrates that all those things can be done without
> introducing an interpreter into the mix, so why bother with the
> interpreter?

because interpreter is done once for all use cases,
whereas custom 'hist' code is doing only one thing for one use case.

>> the kernel ABI exposure is much smaller.
>> So would you consider working together on adding
>> clean bpf+tracepoints infra and corresponding
>> user space bits?
>> We can have small user space parser/compiler for
>> 'hist:keys=common_pid.execname,id.syscall:vals=hitcount'
>> strings that will convert it into bpf program and you'll
>> be able to use it in embedded setups ?
>
> Yeah, wouldn't be averse to working together to create a clean bpf
> +tracepoints infrastructure - I think creating a reusable component like
> this would be a good first step.

great.
>From the program you can emit the same text format
as in your 'cat hist' example.
But it will not be a part of stable kernel ABI, which I think is
one of the main advantages to do such printing from programs
instead of kernel C code.
If you decide to extend what is being printed, you can
tweak 'hist->bpf' tool and print something else.
No one will complain, whereas when you would want
to extend the format of 'hist' file printed by kernel, you'd
need to consider all user tools that are parsing it.
Like we saw in systrace example...

> BTW, I've actually tried to play around with the BPF samples/, but it
> seems they're not actually hooked into the system i.e. the samples
> Makefile doesn't build them, and it even looks for tools/llvm that's not
> there.  I got as far as getting the latest llvm from the location
> mentioened in one of the bpf commit messages, but gave up after it told
> me 'invalid target bpf'.  And I couldn't find any documentation on how
> to set it all up - did I just miss that?

the comment next to 'tool/llvm' says 'point to your llvm' :)
so yes, to build C examples one need to install latest llvm trunk.
If you're saying that existing bpf stuff is hard to use, then yes.
I completely agree. It is hard to use. We're working on it.
The user bits can be improved gradually unlike kernel/user
boundary. Once you set it to be 'hist' file format it will stay
forever.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 00/15] tracing: 'hist' triggers
  2015-03-02 20:48           ` Alexei Starovoitov
  2015-03-02 20:52             ` Karim Yaghmour
@ 2015-03-02 20:54             ` Karim Yaghmour
  1 sibling, 0 replies; 45+ messages in thread
From: Karim Yaghmour @ 2015-03-02 20:54 UTC (permalink / raw)
  To: Alexei Starovoitov, Steven Rostedt
  Cc: Tom Zanussi, Masami Hiramatsu, Namhyung Kim, Andi Kleen, LKML,
	Ingo Molnar, Arnaldo Carvalho de Melo


On 15-03-02 03:48 PM, Alexei Starovoitov wrote:
> good. thanks for explaining.
> all makes sense now.
> 
> btw, that fancy systrace seems to be parsing text from trace_pipe
> https://android.googlesource.com/platform/external/chromium-trace/+/jb-dev/src/tracing/linux_perf_importer.js
> with a bunch of regex...
> including sched_switch: next_prio...

Oh, and you can't use Firefox to view the traces it generates. You can
only use Chrome ... even though the documentation says: "... you can
open the report using a web browser." Which I guess means that "browser
= chrome" at Google.

-- 
Karim Yaghmour
CEO - Opersys inc. / www.opersys.com
http://twitter.com/karimyaghmour


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 00/15] tracing: 'hist' triggers
  2015-03-02 20:48           ` Alexei Starovoitov
@ 2015-03-02 20:52             ` Karim Yaghmour
  2015-03-02 20:54             ` Karim Yaghmour
  1 sibling, 0 replies; 45+ messages in thread
From: Karim Yaghmour @ 2015-03-02 20:52 UTC (permalink / raw)
  To: Alexei Starovoitov, Steven Rostedt
  Cc: Tom Zanussi, Masami Hiramatsu, Namhyung Kim, Andi Kleen, LKML,
	Ingo Molnar, Arnaldo Carvalho de Melo


On 15-03-02 03:48 PM, Alexei Starovoitov wrote:
> good. thanks for explaining.
> all makes sense now.
> 
> btw, that fancy systrace seems to be parsing text from trace_pipe
> https://android.googlesource.com/platform/external/chromium-trace/+/jb-dev/src/tracing/linux_perf_importer.js
> with a bunch of regex...
> including sched_switch: next_prio...

Yes, it does. This is why it's not meant for analyzing large traces.

-- 
Karim Yaghmour
CEO - Opersys inc. / www.opersys.com
http://twitter.com/karimyaghmour


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 00/15] tracing: 'hist' triggers
  2015-03-02 20:43         ` Steven Rostedt
@ 2015-03-02 20:48           ` Alexei Starovoitov
  2015-03-02 20:52             ` Karim Yaghmour
  2015-03-02 20:54             ` Karim Yaghmour
  0 siblings, 2 replies; 45+ messages in thread
From: Alexei Starovoitov @ 2015-03-02 20:48 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Karim Yaghmour, Tom Zanussi, Masami Hiramatsu, Namhyung Kim,
	Andi Kleen, LKML, Ingo Molnar, Arnaldo Carvalho de Melo

On Mon, Mar 2, 2015 at 12:43 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Mon, 2 Mar 2015 15:39:48 -0500
> Steven Rostedt <rostedt@goodmis.org> wrote:
>
>> On Mon, 2 Mar 2015 12:33:34 -0800
>> Alexei Starovoitov <ast@plumgrid.com> wrote:
>>
>> > On Mon, Mar 2, 2015 at 11:49 AM, Karim Yaghmour
>> > <karim.yaghmour@opersys.com> wrote:
>> > >
>> > > On 15-03-02 02:45 PM, Steven Rostedt wrote:
>> > >> Interesting. The Android devices I have still have it enabled (rooted,
>> > >> but still running the stock system).
>> > >
>> > > I don't know that there's any policy to disable tracing on Android. The
>> > > Android framework in fact has generally been instrumented by Google
>> > > itself to output trace info into trace_marker. And the systrace/atrace
>> > > tools made available to app developers need to get access to this
>> > > tracing info. So, if Android had tracing disabled, systrace/atrace
>> > > wouldn't work.
>> > > https://developer.android.com/tools/debugging/systrace.html
>> >
>> > that's interesting. thanks for the link.
>> >
>> > I don't see tracing being explicitly enabled in defconfig:
>> > https://source.android.com/devices/tech/kernel.html
>>
>> CONFIG_SCHED_TRACER=y
>>
>> That alone will enable tracing.
>>
>> -- Steve
>>
>> > or here:
>> > https://android.googlesource.com/kernel/common/+/android-3.10/android/configs/android-recommended.cfg
>>
>
> CONFIG_ENABLE_DEFAULT_TRACERS=y
>
> And so will that.

good. thanks for explaining.
all makes sense now.

btw, that fancy systrace seems to be parsing text from trace_pipe
https://android.googlesource.com/platform/external/chromium-trace/+/jb-dev/src/tracing/linux_perf_importer.js
with a bunch of regex...
including sched_switch: next_prio...

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 00/15] tracing: 'hist' triggers
  2015-03-02 20:39       ` Steven Rostedt
@ 2015-03-02 20:43         ` Steven Rostedt
  2015-03-02 20:48           ` Alexei Starovoitov
  0 siblings, 1 reply; 45+ messages in thread
From: Steven Rostedt @ 2015-03-02 20:43 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Karim Yaghmour, Tom Zanussi, Masami Hiramatsu, Namhyung Kim,
	Andi Kleen, LKML, Ingo Molnar, Arnaldo Carvalho de Melo

On Mon, 2 Mar 2015 15:39:48 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Mon, 2 Mar 2015 12:33:34 -0800
> Alexei Starovoitov <ast@plumgrid.com> wrote:
> 
> > On Mon, Mar 2, 2015 at 11:49 AM, Karim Yaghmour
> > <karim.yaghmour@opersys.com> wrote:
> > >
> > > On 15-03-02 02:45 PM, Steven Rostedt wrote:
> > >> Interesting. The Android devices I have still have it enabled (rooted,
> > >> but still running the stock system).
> > >
> > > I don't know that there's any policy to disable tracing on Android. The
> > > Android framework in fact has generally been instrumented by Google
> > > itself to output trace info into trace_marker. And the systrace/atrace
> > > tools made available to app developers need to get access to this
> > > tracing info. So, if Android had tracing disabled, systrace/atrace
> > > wouldn't work.
> > > https://developer.android.com/tools/debugging/systrace.html
> > 
> > that's interesting. thanks for the link.
> > 
> > I don't see tracing being explicitly enabled in defconfig:
> > https://source.android.com/devices/tech/kernel.html
> 
> CONFIG_SCHED_TRACER=y
> 
> That alone will enable tracing.
> 
> -- Steve
> 
> > or here:
> > https://android.googlesource.com/kernel/common/+/android-3.10/android/configs/android-recommended.cfg
> 

CONFIG_ENABLE_DEFAULT_TRACERS=y

And so will that.

-- Steve


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 00/15] tracing: 'hist' triggers
  2015-03-02 20:33     ` Alexei Starovoitov
  2015-03-02 20:37       ` Karim Yaghmour
@ 2015-03-02 20:39       ` Steven Rostedt
  2015-03-02 20:43         ` Steven Rostedt
  1 sibling, 1 reply; 45+ messages in thread
From: Steven Rostedt @ 2015-03-02 20:39 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Karim Yaghmour, Tom Zanussi, Masami Hiramatsu, Namhyung Kim,
	Andi Kleen, LKML, Ingo Molnar, Arnaldo Carvalho de Melo

On Mon, 2 Mar 2015 12:33:34 -0800
Alexei Starovoitov <ast@plumgrid.com> wrote:

> On Mon, Mar 2, 2015 at 11:49 AM, Karim Yaghmour
> <karim.yaghmour@opersys.com> wrote:
> >
> > On 15-03-02 02:45 PM, Steven Rostedt wrote:
> >> Interesting. The Android devices I have still have it enabled (rooted,
> >> but still running the stock system).
> >
> > I don't know that there's any policy to disable tracing on Android. The
> > Android framework in fact has generally been instrumented by Google
> > itself to output trace info into trace_marker. And the systrace/atrace
> > tools made available to app developers need to get access to this
> > tracing info. So, if Android had tracing disabled, systrace/atrace
> > wouldn't work.
> > https://developer.android.com/tools/debugging/systrace.html
> 
> that's interesting. thanks for the link.
> 
> I don't see tracing being explicitly enabled in defconfig:
> https://source.android.com/devices/tech/kernel.html

CONFIG_SCHED_TRACER=y

That alone will enable tracing.

-- Steve

> or here:
> https://android.googlesource.com/kernel/common/+/android-3.10/android/configs/android-recommended.cfg


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 00/15] tracing: 'hist' triggers
  2015-03-02 20:33     ` Alexei Starovoitov
@ 2015-03-02 20:37       ` Karim Yaghmour
  2015-03-02 20:39       ` Steven Rostedt
  1 sibling, 0 replies; 45+ messages in thread
From: Karim Yaghmour @ 2015-03-02 20:37 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Steven Rostedt, Tom Zanussi, Masami Hiramatsu, Namhyung Kim,
	Andi Kleen, LKML, Ingo Molnar, Arnaldo Carvalho de Melo


On 15-03-02 03:33 PM, Alexei Starovoitov wrote:
> that's interesting. thanks for the link.
> 
> I don't see tracing being explicitly enabled in defconfig:
> https://source.android.com/devices/tech/kernel.html
> or here:
> https://android.googlesource.com/kernel/common/+/android-3.10/android/configs/android-recommended.cfg

I don't know that either of these is "authoritative". I know of both of
these, but I've never looked at them as being the reference for what
manufacturers ship. Instead, most manufacturers get their default
kernels from SoC vendors. So it's much likelier that an Androidized
kernel tree from Qualcomm or Intel is closer to what gets really shipped
than the two links above.

-- 
Karim Yaghmour
CEO - Opersys inc. / www.opersys.com
http://twitter.com/karimyaghmour


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 00/15] tracing: 'hist' triggers
  2015-03-02 19:49   ` Karim Yaghmour
@ 2015-03-02 20:33     ` Alexei Starovoitov
  2015-03-02 20:37       ` Karim Yaghmour
  2015-03-02 20:39       ` Steven Rostedt
  0 siblings, 2 replies; 45+ messages in thread
From: Alexei Starovoitov @ 2015-03-02 20:33 UTC (permalink / raw)
  To: Karim Yaghmour
  Cc: Steven Rostedt, Tom Zanussi, Masami Hiramatsu, Namhyung Kim,
	Andi Kleen, LKML, Ingo Molnar, Arnaldo Carvalho de Melo

On Mon, Mar 2, 2015 at 11:49 AM, Karim Yaghmour
<karim.yaghmour@opersys.com> wrote:
>
> On 15-03-02 02:45 PM, Steven Rostedt wrote:
>> Interesting. The Android devices I have still have it enabled (rooted,
>> but still running the stock system).
>
> I don't know that there's any policy to disable tracing on Android. The
> Android framework in fact has generally been instrumented by Google
> itself to output trace info into trace_marker. And the systrace/atrace
> tools made available to app developers need to get access to this
> tracing info. So, if Android had tracing disabled, systrace/atrace
> wouldn't work.
> https://developer.android.com/tools/debugging/systrace.html

that's interesting. thanks for the link.

I don't see tracing being explicitly enabled in defconfig:
https://source.android.com/devices/tech/kernel.html
or here:
https://android.googlesource.com/kernel/common/+/android-3.10/android/configs/android-recommended.cfg

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 00/15] tracing: 'hist' triggers
  2015-03-02 19:14 Alexei Starovoitov
  2015-03-02 19:31 ` Steven Rostedt
@ 2015-03-02 19:55 ` Tom Zanussi
  2015-03-09 11:38   ` He Kuang
  1 sibling, 1 reply; 45+ messages in thread
From: Tom Zanussi @ 2015-03-02 19:55 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Steven Rostedt, Masami Hiramatsu, Namhyung Kim, Andi Kleen, LKML,
	Ingo Molnar, Arnaldo Carvalho de Melo

Hi Alexei,

On Mon, 2015-03-02 at 11:14 -0800, Alexei Starovoitov wrote:
> On Mon, Mar 2, 2015 at 8:00 AM, Tom Zanussi <tom.zanussi@linux.intel.com> wrote:
> >
> >   # echo 'hist:keys=common_pid.execname,id.syscall:vals=hitcount' > \
> >         /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger
> >
> >   # cat /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/hist
> >
> >   key: common_pid:bash[3112], id:sys_write                     vals: count:69
> >   key: common_pid:bash[3112], id:sys_rt_sigprocmask            vals: count:218
> 
> Hi Tom,
> 
> I think we both want to see in-kernel aggregation.
> This 'hist' stuff is trying to do counting and even map sorting
> in the kernel, whereas with bpf programs I'm moving
> all of these decisions to user space.
> I understand your desire to avoid any user level scripts
> and do everything via 'cat' and debugfs, but imo that's
> very limiting. I think it's better to do slim user space

It's consistent with the whole host of other ftrace tools, so I don't
know why that model would be any more limiting for this.

> scripting language that can translate to bpf even in
> embedded setups. Then users will be able to aggregate
> whatever they like, whereas with 'hist' approach
> they're limited to simple counters.
> trace_events_trigger.c - 1466 lines - that's quite a bit
> of code that will be rarely used. Kinda goes counter

I disagree that it would be rarely used.  In fact, it would probably
cover about 80% of the use cases that people initially use things like
systemtap or dtrace for, which I guess is what ebpf is shooting for.

I'm also looking at systems that have very little memory and 8Mb of
storage to work with, so streaming it all to userspace and
post-processing won't really work on those systems.

With some thought, though, I think the ebpf system/interpreter could be
made smart enough to recognize the simple patterns represented by the
hist triggers, and reuse them internally.  So ftrace users get their
command-line version and it's also something ebpf can reuse.

> to embedded argument. Why add this to kernel
> when bpf programs can do the same on demand?

Because this demonstrates that all those things can be done without
introducing an interpreter into the mix, so why bother with the
interpreter?

> Also the arguments about stable ABI apply as well.
> The format of 'hist' file would need to be stable, so will
> be hard to extend it. With bpf programs doing aggregation

Well, the format is very regular - keys and values and summary lines,
not much to break there.

> the kernel ABI exposure is much smaller.
> So would you consider working together on adding
> clean bpf+tracepoints infra and corresponding
> user space bits?
> We can have small user space parser/compiler for
> 'hist:keys=common_pid.execname,id.syscall:vals=hitcount'
> strings that will convert it into bpf program and you'll
> be able to use it in embedded setups ?

Yeah, wouldn't be averse to working together to create a clean bpf
+tracepoints infrastructure - I think creating a reusable component like
this would be a good first step.

BTW, I've actually tried to play around with the BPF samples/, but it
seems they're not actually hooked into the system i.e. the samples
Makefile doesn't build them, and it even looks for tools/llvm that's not
there.  I got as far as getting the latest llvm from the location
mentioened in one of the bpf commit messages, but gave up after it told
me 'invalid target bpf'.  And I couldn't find any documentation on how
to set it all up - did I just miss that?

Tom

> 
> Thanks



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 00/15] tracing: 'hist' triggers
  2015-03-02 19:45 ` Steven Rostedt
@ 2015-03-02 19:49   ` Karim Yaghmour
  2015-03-02 20:33     ` Alexei Starovoitov
  0 siblings, 1 reply; 45+ messages in thread
From: Karim Yaghmour @ 2015-03-02 19:49 UTC (permalink / raw)
  To: Steven Rostedt, Alexei Starovoitov
  Cc: Tom Zanussi, Masami Hiramatsu, Namhyung Kim, Andi Kleen, LKML,
	Ingo Molnar, Arnaldo Carvalho de Melo


On 15-03-02 02:45 PM, Steven Rostedt wrote:
> Interesting. The Android devices I have still have it enabled (rooted,
> but still running the stock system).

I don't know that there's any policy to disable tracing on Android. The
Android framework in fact has generally been instrumented by Google
itself to output trace info into trace_marker. And the systrace/atrace
tools made available to app developers need to get access to this
tracing info. So, if Android had tracing disabled, systrace/atrace
wouldn't work.
https://developer.android.com/tools/debugging/systrace.html

-- 
Karim Yaghmour
CEO - Opersys inc. / www.opersys.com
http://twitter.com/karimyaghmour


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 00/15] tracing: 'hist' triggers
  2015-03-02 19:36 Alexei Starovoitov
@ 2015-03-02 19:45 ` Steven Rostedt
  2015-03-02 19:49   ` Karim Yaghmour
  0 siblings, 1 reply; 45+ messages in thread
From: Steven Rostedt @ 2015-03-02 19:45 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Tom Zanussi, Masami Hiramatsu, Namhyung Kim, Andi Kleen, LKML,
	Ingo Molnar, Arnaldo Carvalho de Melo, Karim Yaghmour

On Mon, 2 Mar 2015 11:36:36 -0800
Alexei Starovoitov <ast@plumgrid.com> wrote:

> > Make sure it also works on android.
> 
> the sad part is that tracing is off on android...
> so all of tracefs is actually compiled out there...
> and probably will be if we keep adding rarely used code to it.

Interesting. The Android devices I have still have it enabled (rooted,
but still running the stock system).

-- Steve


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 00/15] tracing: 'hist' triggers
@ 2015-03-02 19:36 Alexei Starovoitov
  2015-03-02 19:45 ` Steven Rostedt
  0 siblings, 1 reply; 45+ messages in thread
From: Alexei Starovoitov @ 2015-03-02 19:36 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Tom Zanussi, Masami Hiramatsu, Namhyung Kim, Andi Kleen, LKML,
	Ingo Molnar, Arnaldo Carvalho de Melo

On Mon, Mar 2, 2015 at 11:31 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Mon, 2 Mar 2015 11:14:54 -0800
> Alexei Starovoitov <ast@plumgrid.com> wrote:
>
>> I think we both want to see in-kernel aggregation.
>> This 'hist' stuff is trying to do counting and even map sorting
>> in the kernel, whereas with bpf programs I'm moving
>> all of these decisions to user space.
>> I understand your desire to avoid any user level scripts
>> and do everything via 'cat' and debugfs, but imo that's
>> very limiting. I think it's better to do slim user space
>> scripting language that can translate to bpf even in
>> embedded setups. Then users will be able to aggregate
>> whatever they like, whereas with 'hist' approach
>> they're limited to simple counters.
>> trace_events_trigger.c - 1466 lines - that's quite a bit
>> of code that will be rarely used. Kinda goes counter
>> to embedded argument. Why add this to kernel
>> when bpf programs can do the same on demand?
>
> At Collab, a lot of people came to me telling me they love the debugfs
> system. Because it's everywhere they go. You remove that, you remove
> most users (including myself).
>
>
>> Also the arguments about stable ABI apply as well.
>> The format of 'hist' file would need to be stable, so will
>> be hard to extend it. With bpf programs doing aggregation
>> the kernel ABI exposure is much smaller.
>
> I disagree with this statement.
>
>> So would you consider working together on adding
>> clean bpf+tracepoints infra and corresponding
>> user space bits?
>> We can have small user space parser/compiler for
>> 'hist:keys=common_pid.execname,id.syscall:vals=hitcount'
>> strings that will convert it into bpf program and you'll
>> be able to use it in embedded setups ?
>
> Make sure it also works on android.

the sad part is that tracing is off on android...
so all of tracefs is actually compiled out there...
and probably will be if we keep adding rarely used code to it.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 00/15] tracing: 'hist' triggers
  2015-03-02 19:14 Alexei Starovoitov
@ 2015-03-02 19:31 ` Steven Rostedt
  2015-03-02 19:55 ` Tom Zanussi
  1 sibling, 0 replies; 45+ messages in thread
From: Steven Rostedt @ 2015-03-02 19:31 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Tom Zanussi, Masami Hiramatsu, Namhyung Kim, Andi Kleen, LKML,
	Ingo Molnar, Arnaldo Carvalho de Melo

On Mon, 2 Mar 2015 11:14:54 -0800
Alexei Starovoitov <ast@plumgrid.com> wrote:
 
> I think we both want to see in-kernel aggregation.
> This 'hist' stuff is trying to do counting and even map sorting
> in the kernel, whereas with bpf programs I'm moving
> all of these decisions to user space.
> I understand your desire to avoid any user level scripts
> and do everything via 'cat' and debugfs, but imo that's
> very limiting. I think it's better to do slim user space
> scripting language that can translate to bpf even in
> embedded setups. Then users will be able to aggregate
> whatever they like, whereas with 'hist' approach
> they're limited to simple counters.
> trace_events_trigger.c - 1466 lines - that's quite a bit
> of code that will be rarely used. Kinda goes counter
> to embedded argument. Why add this to kernel
> when bpf programs can do the same on demand?

At Collab, a lot of people came to me telling me they love the debugfs
system. Because it's everywhere they go. You remove that, you remove
most users (including myself).


> Also the arguments about stable ABI apply as well.
> The format of 'hist' file would need to be stable, so will
> be hard to extend it. With bpf programs doing aggregation
> the kernel ABI exposure is much smaller.

I disagree with this statement.

> So would you consider working together on adding
> clean bpf+tracepoints infra and corresponding
> user space bits?
> We can have small user space parser/compiler for
> 'hist:keys=common_pid.execname,id.syscall:vals=hitcount'
> strings that will convert it into bpf program and you'll
> be able to use it in embedded setups ?

Make sure it also works on android.

-- Steve

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 00/15] tracing: 'hist' triggers
@ 2015-03-02 19:14 Alexei Starovoitov
  2015-03-02 19:31 ` Steven Rostedt
  2015-03-02 19:55 ` Tom Zanussi
  0 siblings, 2 replies; 45+ messages in thread
From: Alexei Starovoitov @ 2015-03-02 19:14 UTC (permalink / raw)
  To: Tom Zanussi
  Cc: Steven Rostedt, Masami Hiramatsu, Namhyung Kim, Andi Kleen, LKML,
	Ingo Molnar, Arnaldo Carvalho de Melo

On Mon, Mar 2, 2015 at 8:00 AM, Tom Zanussi <tom.zanussi@linux.intel.com> wrote:
>
>   # echo 'hist:keys=common_pid.execname,id.syscall:vals=hitcount' > \
>         /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger
>
>   # cat /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/hist
>
>   key: common_pid:bash[3112], id:sys_write                     vals: count:69
>   key: common_pid:bash[3112], id:sys_rt_sigprocmask            vals: count:218

Hi Tom,

I think we both want to see in-kernel aggregation.
This 'hist' stuff is trying to do counting and even map sorting
in the kernel, whereas with bpf programs I'm moving
all of these decisions to user space.
I understand your desire to avoid any user level scripts
and do everything via 'cat' and debugfs, but imo that's
very limiting. I think it's better to do slim user space
scripting language that can translate to bpf even in
embedded setups. Then users will be able to aggregate
whatever they like, whereas with 'hist' approach
they're limited to simple counters.
trace_events_trigger.c - 1466 lines - that's quite a bit
of code that will be rarely used. Kinda goes counter
to embedded argument. Why add this to kernel
when bpf programs can do the same on demand?
Also the arguments about stable ABI apply as well.
The format of 'hist' file would need to be stable, so will
be hard to extend it. With bpf programs doing aggregation
the kernel ABI exposure is much smaller.
So would you consider working together on adding
clean bpf+tracepoints infra and corresponding
user space bits?
We can have small user space parser/compiler for
'hist:keys=common_pid.execname,id.syscall:vals=hitcount'
strings that will convert it into bpf program and you'll
be able to use it in embedded setups ?

Thanks

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2015-03-09 11:41 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-02 16:00 [PATCH v2 00/15] tracing: 'hist' triggers Tom Zanussi
2015-03-02 16:00 ` [PATCH 01/15] tracing: Make ftrace_event_field checking functions available Tom Zanussi
2015-03-02 16:00 ` [PATCH 02/15] tracing: Add event record param to trigger_ops.func() Tom Zanussi
2015-03-02 16:00 ` [PATCH 03/15] tracing: Add get_syscall_name() Tom Zanussi
2015-03-02 16:00 ` [PATCH 04/15] bpf: Export bpf map functionality as trace_map_* Tom Zanussi
2015-03-02 16:00 ` [PATCH 05/15] bpf: Export a map-clearing function Tom Zanussi
2015-03-02 16:00 ` [PATCH 06/15] bpf: Add tracing_map client ops Tom Zanussi
2015-03-02 16:01 ` [PATCH 07/15] mm: Add ___GFP_NOTRACE Tom Zanussi
2015-03-02 16:37   ` Steven Rostedt
2015-03-02 16:46     ` Tom Zanussi
2015-03-02 17:58       ` Alexei Starovoitov
2015-03-02 18:03         ` Tom Zanussi
2015-03-02 18:12           ` Alexei Starovoitov
2015-03-02 18:25             ` Tom Zanussi
2015-03-02 18:43             ` Steven Rostedt
2015-03-02 16:01 ` [PATCH 08/15] tracing: Make kmem memory allocation tracepoints conditional Tom Zanussi
2015-03-02 16:01 ` [PATCH 09/15] tracing: Add kmalloc/kfree macros Tom Zanussi
2015-03-02 16:01 ` [PATCH 10/15] bpf: Make tracing_map use kmalloc/kfree_notrace() Tom Zanussi
2015-03-02 16:01 ` [PATCH 11/15] tracing: Add a per-event-trigger 'paused' field Tom Zanussi
2015-03-02 16:01 ` [PATCH 12/15] tracing: Add 'hist' event trigger command Tom Zanussi
2015-03-02 16:01 ` [PATCH 13/15] tracing: Add sorting to hist triggers Tom Zanussi
2015-03-02 16:01 ` [PATCH 14/15] tracing: Add enable_hist/disable_hist triggers Tom Zanussi
2015-03-02 16:01 ` [PATCH 15/15] tracing: Add 'hist' trigger Documentation Tom Zanussi
2015-03-03  2:25 ` [PATCH v2 00/15] tracing: 'hist' triggers Masami Hiramatsu
2015-03-03 14:47   ` Tom Zanussi
2015-03-02 19:14 Alexei Starovoitov
2015-03-02 19:31 ` Steven Rostedt
2015-03-02 19:55 ` Tom Zanussi
2015-03-09 11:38   ` He Kuang
2015-03-02 19:36 Alexei Starovoitov
2015-03-02 19:45 ` Steven Rostedt
2015-03-02 19:49   ` Karim Yaghmour
2015-03-02 20:33     ` Alexei Starovoitov
2015-03-02 20:37       ` Karim Yaghmour
2015-03-02 20:39       ` Steven Rostedt
2015-03-02 20:43         ` Steven Rostedt
2015-03-02 20:48           ` Alexei Starovoitov
2015-03-02 20:52             ` Karim Yaghmour
2015-03-02 20:54             ` Karim Yaghmour
2015-03-03  0:01 Alexei Starovoitov
2015-03-03  1:18 ` Tom Zanussi
2015-03-03  2:31 Alexei Starovoitov
2015-03-03 15:47 ` Tom Zanussi
2015-03-04  2:22   ` Alexei Starovoitov
2015-03-04  5:01     ` Tom Zanussi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).