All of lore.kernel.org
 help / color / mirror / Atom feed
* [for-next][PATCH 00/20] tracing: function triggers, stack tracer fixes, clocks and documenation
@ 2013-03-15 19:39 Steven Rostedt
  2013-03-15 19:39 ` [for-next][PATCH 01/20] tracing: Consolidate updating of count for traceon/off Steven Rostedt
                   ` (19 more replies)
  0 siblings, 20 replies; 21+ messages in thread
From: Steven Rostedt @ 2013-03-15 19:39 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ingo Molnar, Andrew Morton, Frederic Weisbecker

[-- Attachment #1: Type: text/plain, Size: 3341 bytes --]


Here's another set of patches.

Mostly I updated the function triggers. You know the kind:

 echo bad_area_nosemaphore:traceoff > set_ftrace_filter

I cleaned up the code and also added a few more triggers.
What we already had was just traceon and traceoff. These turn on and
off tracing when a function is hit.

What was added was:

 snapshot:  Take a snapshot of the buffer (swap buffers with the live
	    and "snapshot" buffer) when a function is hit. Note, it
	    will not snapshot in an NMI.

 stacktrace: Write the current kernel stacktrace into the ring buffer.

 enable_event: Enable a specified event to start tracing.

 disable_event: Have a specified event to stop tracing.

I also added two new trace_clocks:

  uptime - counts jiffies since boot up and reports in nanoseconds.

  perf   - uses perf's local_clock(). This will help later when
	   perf can read ftrace buffers and can interleave them.

Finally, I updated Documentation/trace/ftrace.txt to reflect all these
changes, as well as changes that have been done since it was last updated.
That file was suffering from some major bit rot.

I'll keep these in linux-next for a while before doing my pull request
to Ingo.

Enjoy,

-- Steve


Steven Rostedt (Red Hat) (20):
      tracing: Consolidate updating of count for traceon/off
      tracing: Consolidate ftrace_trace_onoff_unreg() into callback
      ftrace: Separate unlimited probes from count limited probes
      ftrace: Fix function probe to only enable needed functions
      tracing: Add alloc/free_snapshot() to replace duplicate code
      tracing: Add snapshot trigger to function probes
      tracing: Fix comments for ftrace_event_file/call flags
      ftrace: Clean up function probe methods
      ftrace: Use manual free after synchronize_sched() not call_rcu_sched()
      tracing: Add a way to soft disable trace events
      tracing: Add function probe triggers to enable/disable events
      tracing: Add skip argument to trace_dump_stack()
      tracing: Add function probe to trigger stack traces
      tracing: Use stack of calling function for stack tracer
      tracing: Fix stack tracer with fentry use
      tracing: Remove most or all of stack tracer stack size from stack_max_size
      tracing: Add function-trace option to disable function tracing of latency tracers
      tracing: Add "uptime" trace clock that uses jiffies
      tracing: Add "perf" trace_clock
      tracing: Bring Documentation/trace/ftrace.txt up to date

----
 Documentation/trace/ftrace.txt    | 2097 ++++++++++++++++++++++++++-----------
 include/linux/ftrace.h            |    6 +-
 include/linux/ftrace_event.h      |   37 +-
 include/linux/kernel.h            |    2 +-
 include/linux/trace_clock.h       |    1 +
 include/trace/ftrace.h            |    8 +
 kernel/trace/ftrace.c             |   73 +-
 kernel/trace/trace.c              |  208 +++-
 kernel/trace/trace.h              |    1 +
 kernel/trace/trace_clock.c        |   10 +
 kernel/trace/trace_events.c       |  354 ++++++-
 kernel/trace/trace_functions.c    |  199 +++-
 kernel/trace/trace_irqsoff.c      |   67 +-
 kernel/trace/trace_sched_wakeup.c |   63 +-
 kernel/trace/trace_stack.c        |   74 +-
 15 files changed, 2418 insertions(+), 782 deletions(-)

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [for-next][PATCH 01/20] tracing: Consolidate updating of count for traceon/off
  2013-03-15 19:39 [for-next][PATCH 00/20] tracing: function triggers, stack tracer fixes, clocks and documenation Steven Rostedt
@ 2013-03-15 19:39 ` Steven Rostedt
  2013-03-15 19:39 ` [for-next][PATCH 02/20] tracing: Consolidate ftrace_trace_onoff_unreg() into callback Steven Rostedt
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Steven Rostedt @ 2013-03-15 19:39 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ingo Molnar, Andrew Morton, Frederic Weisbecker

[-- Attachment #1: Type: text/plain, Size: 1627 bytes --]

From: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>

Remove some duplicate code and replace it with a helper function.
This makes the code a it cleaner.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/trace/trace_functions.c |   33 ++++++++++++++++-----------------
 1 file changed, 16 insertions(+), 17 deletions(-)

diff --git a/kernel/trace/trace_functions.c b/kernel/trace/trace_functions.c
index e467c0c..38cfb29 100644
--- a/kernel/trace/trace_functions.c
+++ b/kernel/trace/trace_functions.c
@@ -214,38 +214,37 @@ static struct tracer function_trace __read_mostly =
 };
 
 #ifdef CONFIG_DYNAMIC_FTRACE
-static void
-ftrace_traceon(unsigned long ip, unsigned long parent_ip, void **data)
+static int update_count(void **data)
 {
-	long *count = (long *)data;
-
-	if (tracing_is_on())
-		return;
+	unsigned long *count = (long *)data;
 
 	if (!*count)
-		return;
+		return 0;
 
 	if (*count != -1)
 		(*count)--;
 
-	tracing_on();
+	return 1;
 }
 
 static void
-ftrace_traceoff(unsigned long ip, unsigned long parent_ip, void **data)
+ftrace_traceon(unsigned long ip, unsigned long parent_ip, void **data)
 {
-	long *count = (long *)data;
-
-	if (!tracing_is_on())
+	if (tracing_is_on())
 		return;
 
-	if (!*count)
-		return;
+	if (update_count(data))
+		tracing_on();
+}
 
-	if (*count != -1)
-		(*count)--;
+static void
+ftrace_traceoff(unsigned long ip, unsigned long parent_ip, void **data)
+{
+	if (!tracing_is_on())
+		return;
 
-	tracing_off();
+	if (update_count(data))
+		tracing_off();
 }
 
 static int
-- 
1.7.10.4



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [for-next][PATCH 02/20] tracing: Consolidate ftrace_trace_onoff_unreg() into callback
  2013-03-15 19:39 [for-next][PATCH 00/20] tracing: function triggers, stack tracer fixes, clocks and documenation Steven Rostedt
  2013-03-15 19:39 ` [for-next][PATCH 01/20] tracing: Consolidate updating of count for traceon/off Steven Rostedt
@ 2013-03-15 19:39 ` Steven Rostedt
  2013-03-15 19:39 ` [for-next][PATCH 03/20] ftrace: Separate unlimited probes from count limited probes Steven Rostedt
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Steven Rostedt @ 2013-03-15 19:39 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ingo Molnar, Andrew Morton, Frederic Weisbecker

[-- Attachment #1: Type: text/plain, Size: 1925 bytes --]

From: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>

The only thing ftrace_trace_onoff_unreg() does is to do a strcmp()
against the cmd parameter to determine what op to unregister. But
this compare is also done after the location that this function is
called (and returns). By moving the check for '!' to unregister after
the strcmp(), the callback function itself can just do the unregister
and we can get rid of the helper function.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/trace/trace_functions.c |   24 +++++-------------------
 1 file changed, 5 insertions(+), 19 deletions(-)

diff --git a/kernel/trace/trace_functions.c b/kernel/trace/trace_functions.c
index 38cfb29..a88a3e0 100644
--- a/kernel/trace/trace_functions.c
+++ b/kernel/trace/trace_functions.c
@@ -283,22 +283,6 @@ ftrace_trace_onoff_print(struct seq_file *m, unsigned long ip,
 }
 
 static int
-ftrace_trace_onoff_unreg(char *glob, char *cmd, char *param)
-{
-	struct ftrace_probe_ops *ops;
-
-	/* we register both traceon and traceoff to this callback */
-	if (strcmp(cmd, "traceon") == 0)
-		ops = &traceon_probe_ops;
-	else
-		ops = &traceoff_probe_ops;
-
-	unregister_ftrace_function_probe_func(glob, ops);
-
-	return 0;
-}
-
-static int
 ftrace_trace_onoff_callback(struct ftrace_hash *hash,
 			    char *glob, char *cmd, char *param, int enable)
 {
@@ -311,15 +295,17 @@ ftrace_trace_onoff_callback(struct ftrace_hash *hash,
 	if (!enable)
 		return -EINVAL;
 
-	if (glob[0] == '!')
-		return ftrace_trace_onoff_unreg(glob+1, cmd, param);
-
 	/* we register both traceon and traceoff to this callback */
 	if (strcmp(cmd, "traceon") == 0)
 		ops = &traceon_probe_ops;
 	else
 		ops = &traceoff_probe_ops;
 
+	if (glob[0] == '!') {
+		unregister_ftrace_function_probe_func(glob+1, ops);
+		return 0;
+	}
+
 	if (!param)
 		goto out_reg;
 
-- 
1.7.10.4



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [for-next][PATCH 03/20] ftrace: Separate unlimited probes from count limited probes
  2013-03-15 19:39 [for-next][PATCH 00/20] tracing: function triggers, stack tracer fixes, clocks and documenation Steven Rostedt
  2013-03-15 19:39 ` [for-next][PATCH 01/20] tracing: Consolidate updating of count for traceon/off Steven Rostedt
  2013-03-15 19:39 ` [for-next][PATCH 02/20] tracing: Consolidate ftrace_trace_onoff_unreg() into callback Steven Rostedt
@ 2013-03-15 19:39 ` Steven Rostedt
  2013-03-15 19:39 ` [for-next][PATCH 04/20] ftrace: Fix function probe to only enable needed functions Steven Rostedt
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Steven Rostedt @ 2013-03-15 19:39 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ingo Molnar, Andrew Morton, Frederic Weisbecker

[-- Attachment #1: Type: text/plain, Size: 3104 bytes --]

From: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>

The function tracing probes that trigger traceon or traceoff can be
set to unlimited, or given a count of # of times to execute.

By separating these two types of probes, we can then use the dynamic
ftrace function filtering directly, and remove the brute force
"check if this function called is my probe" routines in ftrace.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/trace/trace_functions.c |   38 +++++++++++++++++++++++++++++++++-----
 1 file changed, 33 insertions(+), 5 deletions(-)

diff --git a/kernel/trace/trace_functions.c b/kernel/trace/trace_functions.c
index a88a3e0..043b242 100644
--- a/kernel/trace/trace_functions.c
+++ b/kernel/trace/trace_functions.c
@@ -228,7 +228,7 @@ static int update_count(void **data)
 }
 
 static void
-ftrace_traceon(unsigned long ip, unsigned long parent_ip, void **data)
+ftrace_traceon_count(unsigned long ip, unsigned long parent_ip, void **data)
 {
 	if (tracing_is_on())
 		return;
@@ -238,7 +238,7 @@ ftrace_traceon(unsigned long ip, unsigned long parent_ip, void **data)
 }
 
 static void
-ftrace_traceoff(unsigned long ip, unsigned long parent_ip, void **data)
+ftrace_traceoff_count(unsigned long ip, unsigned long parent_ip, void **data)
 {
 	if (!tracing_is_on())
 		return;
@@ -247,10 +247,38 @@ ftrace_traceoff(unsigned long ip, unsigned long parent_ip, void **data)
 		tracing_off();
 }
 
+static void
+ftrace_traceon(unsigned long ip, unsigned long parent_ip, void **data)
+{
+	if (tracing_is_on())
+		return;
+
+	tracing_on();
+}
+
+static void
+ftrace_traceoff(unsigned long ip, unsigned long parent_ip, void **data)
+{
+	if (!tracing_is_on())
+		return;
+
+	tracing_off();
+}
+
 static int
 ftrace_trace_onoff_print(struct seq_file *m, unsigned long ip,
 			 struct ftrace_probe_ops *ops, void *data);
 
+static struct ftrace_probe_ops traceon_count_probe_ops = {
+	.func			= ftrace_traceon_count,
+	.print			= ftrace_trace_onoff_print,
+};
+
+static struct ftrace_probe_ops traceoff_count_probe_ops = {
+	.func			= ftrace_traceoff_count,
+	.print			= ftrace_trace_onoff_print,
+};
+
 static struct ftrace_probe_ops traceon_probe_ops = {
 	.func			= ftrace_traceon,
 	.print			= ftrace_trace_onoff_print,
@@ -269,7 +297,7 @@ ftrace_trace_onoff_print(struct seq_file *m, unsigned long ip,
 
 	seq_printf(m, "%ps:", (void *)ip);
 
-	if (ops == &traceon_probe_ops)
+	if (ops == &traceon_probe_ops || ops == &traceon_count_probe_ops)
 		seq_printf(m, "traceon");
 	else
 		seq_printf(m, "traceoff");
@@ -297,9 +325,9 @@ ftrace_trace_onoff_callback(struct ftrace_hash *hash,
 
 	/* we register both traceon and traceoff to this callback */
 	if (strcmp(cmd, "traceon") == 0)
-		ops = &traceon_probe_ops;
+		ops = param ? &traceon_count_probe_ops : &traceon_probe_ops;
 	else
-		ops = &traceoff_probe_ops;
+		ops = param ? &traceoff_count_probe_ops : &traceoff_probe_ops;
 
 	if (glob[0] == '!') {
 		unregister_ftrace_function_probe_func(glob+1, ops);
-- 
1.7.10.4



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [for-next][PATCH 04/20] ftrace: Fix function probe to only enable needed functions
  2013-03-15 19:39 [for-next][PATCH 00/20] tracing: function triggers, stack tracer fixes, clocks and documenation Steven Rostedt
                   ` (2 preceding siblings ...)
  2013-03-15 19:39 ` [for-next][PATCH 03/20] ftrace: Separate unlimited probes from count limited probes Steven Rostedt
@ 2013-03-15 19:39 ` Steven Rostedt
  2013-03-15 19:39 ` [for-next][PATCH 05/20] tracing: Add alloc/free_snapshot() to replace duplicate code Steven Rostedt
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Steven Rostedt @ 2013-03-15 19:39 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ingo Molnar, Andrew Morton, Frederic Weisbecker

[-- Attachment #1: Type: text/plain, Size: 4644 bytes --]

From: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>

Currently the function probe enables all functions and runs a "hash"
against every function call to see if it should call a probe. This
is extremely wasteful.

Note, a probe is something like:

  echo schedule:traceoff > /debug/tracing/set_ftrace_filter

When schedule is called, the probe will disable tracing. But currently,
it has a call back for *all* functions, and checks to see if the
called function is the probe that is needed.

The probe function has been created before ftrace was rewritten to
allow for more than one "op" to be registered by the function tracer.
When probes were created, it couldn't limit the functions without also
limiting normal function calls. But now we can, it's about time
to update the probe code.

Todo, have separate ops for different entries. That is, assign
a ftrace_ops per probe, instead of one op for all probes. But
as there's not many probes assigned, this may not be that urgent.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/trace/ftrace.c |   48 ++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 46 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index e6effd0..dab031f 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -2988,18 +2988,20 @@ static void ftrace_free_entry_rcu(struct rcu_head *rhp)
 	kfree(entry);
 }
 
-
 int
 register_ftrace_function_probe(char *glob, struct ftrace_probe_ops *ops,
 			      void *data)
 {
 	struct ftrace_func_probe *entry;
+	struct ftrace_hash **orig_hash = &trace_probe_ops.filter_hash;
+	struct ftrace_hash *hash;
 	struct ftrace_page *pg;
 	struct dyn_ftrace *rec;
 	int type, len, not;
 	unsigned long key;
 	int count = 0;
 	char *search;
+	int ret;
 
 	type = filter_parse_regex(glob, strlen(glob), &search, &not);
 	len = strlen(search);
@@ -3010,8 +3012,16 @@ register_ftrace_function_probe(char *glob, struct ftrace_probe_ops *ops,
 
 	mutex_lock(&ftrace_lock);
 
-	if (unlikely(ftrace_disabled))
+	hash = alloc_and_copy_ftrace_hash(FTRACE_HASH_DEFAULT_BITS, *orig_hash);
+	if (!hash) {
+		count = -ENOMEM;
+		goto out_unlock;
+	}
+
+	if (unlikely(ftrace_disabled)) {
+		count = -ENODEV;
 		goto out_unlock;
+	}
 
 	do_for_each_ftrace_rec(pg, rec) {
 
@@ -3043,6 +3053,13 @@ register_ftrace_function_probe(char *glob, struct ftrace_probe_ops *ops,
 			}
 		}
 
+		ret = enter_record(hash, rec, 0);
+		if (ret < 0) {
+			kfree(entry);
+			count = ret;
+			goto out_unlock;
+		}
+
 		entry->ops = ops;
 		entry->ip = rec->ip;
 
@@ -3050,10 +3067,16 @@ register_ftrace_function_probe(char *glob, struct ftrace_probe_ops *ops,
 		hlist_add_head_rcu(&entry->node, &ftrace_func_hash[key]);
 
 	} while_for_each_ftrace_rec();
+
+	ret = ftrace_hash_move(&trace_probe_ops, 1, orig_hash, hash);
+	if (ret < 0)
+		count = ret;
+
 	__enable_ftrace_function_probe();
 
  out_unlock:
 	mutex_unlock(&ftrace_lock);
+	free_ftrace_hash(hash);
 
 	return count;
 }
@@ -3067,7 +3090,10 @@ static void
 __unregister_ftrace_function_probe(char *glob, struct ftrace_probe_ops *ops,
 				  void *data, int flags)
 {
+	struct ftrace_func_entry *rec_entry;
 	struct ftrace_func_probe *entry;
+	struct ftrace_hash **orig_hash = &trace_probe_ops.filter_hash;
+	struct ftrace_hash *hash;
 	struct hlist_node *n, *tmp;
 	char str[KSYM_SYMBOL_LEN];
 	int type = MATCH_FULL;
@@ -3088,6 +3114,12 @@ __unregister_ftrace_function_probe(char *glob, struct ftrace_probe_ops *ops,
 	}
 
 	mutex_lock(&ftrace_lock);
+
+	hash = alloc_and_copy_ftrace_hash(FTRACE_HASH_DEFAULT_BITS, *orig_hash);
+	if (!hash)
+		/* Hmm, should report this somehow */
+		goto out_unlock;
+
 	for (i = 0; i < FTRACE_FUNC_HASHSIZE; i++) {
 		struct hlist_head *hhd = &ftrace_func_hash[i];
 
@@ -3108,12 +3140,24 @@ __unregister_ftrace_function_probe(char *glob, struct ftrace_probe_ops *ops,
 					continue;
 			}
 
+			rec_entry = ftrace_lookup_ip(hash, entry->ip);
+			/* It is possible more than one entry had this ip */
+			if (rec_entry)
+				free_hash_entry(hash, rec_entry);
+
 			hlist_del_rcu(&entry->node);
 			call_rcu_sched(&entry->rcu, ftrace_free_entry_rcu);
 		}
 	}
 	__disable_ftrace_function_probe();
+	/*
+	 * Remove after the disable is called. Otherwise, if the last
+	 * probe is removed, a null hash means *all enabled*.
+	 */
+	ftrace_hash_move(&trace_probe_ops, 1, orig_hash, hash);
+ out_unlock:
 	mutex_unlock(&ftrace_lock);
+	free_ftrace_hash(hash);
 }
 
 void
-- 
1.7.10.4



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [for-next][PATCH 05/20] tracing: Add alloc/free_snapshot() to replace duplicate code
  2013-03-15 19:39 [for-next][PATCH 00/20] tracing: function triggers, stack tracer fixes, clocks and documenation Steven Rostedt
                   ` (3 preceding siblings ...)
  2013-03-15 19:39 ` [for-next][PATCH 04/20] ftrace: Fix function probe to only enable needed functions Steven Rostedt
@ 2013-03-15 19:39 ` Steven Rostedt
  2013-03-15 19:39 ` [for-next][PATCH 06/20] tracing: Add snapshot trigger to function probes Steven Rostedt
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Steven Rostedt @ 2013-03-15 19:39 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ingo Molnar, Andrew Morton, Frederic Weisbecker

[-- Attachment #1: Type: text/plain, Size: 4740 bytes --]

From: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>

Add alloc_snapshot() and free_snapshot() to allocate and free the
snapshot buffer respectively, and use these to remove duplicate
code.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/trace/trace.c |   79 +++++++++++++++++++++++++++-----------------------
 1 file changed, 42 insertions(+), 37 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 5c53e40..906049c 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -149,14 +149,14 @@ static int __init set_ftrace_dump_on_oops(char *str)
 }
 __setup("ftrace_dump_on_oops", set_ftrace_dump_on_oops);
 
-static int __init alloc_snapshot(char *str)
+static int __init boot_alloc_snapshot(char *str)
 {
 	allocate_snapshot = true;
 	/* We also need the main ring buffer expanded */
 	ring_buffer_expanded = true;
 	return 1;
 }
-__setup("alloc_snapshot", alloc_snapshot);
+__setup("alloc_snapshot", boot_alloc_snapshot);
 
 
 static char trace_boot_options_buf[MAX_TRACER_SIZE] __initdata;
@@ -470,6 +470,38 @@ EXPORT_SYMBOL_GPL(tracing_snapshot);
 
 static int resize_buffer_duplicate_size(struct trace_buffer *trace_buf,
 					struct trace_buffer *size_buf, int cpu_id);
+static void set_buffer_entries(struct trace_buffer *buf, unsigned long val);
+
+static int alloc_snapshot(struct trace_array *tr)
+{
+	int ret;
+
+	if (!tr->allocated_snapshot) {
+
+		/* allocate spare buffer */
+		ret = resize_buffer_duplicate_size(&tr->max_buffer,
+				   &tr->trace_buffer, RING_BUFFER_ALL_CPUS);
+		if (ret < 0)
+			return ret;
+
+		tr->allocated_snapshot = true;
+	}
+
+	return 0;
+}
+
+void free_snapshot(struct trace_array *tr)
+{
+	/*
+	 * We don't free the ring buffer. instead, resize it because
+	 * The max_tr ring buffer has some state (e.g. ring->clock) and
+	 * we want preserve it.
+	 */
+	ring_buffer_resize(tr->max_buffer.buffer, 1, RING_BUFFER_ALL_CPUS);
+	set_buffer_entries(&tr->max_buffer, 1);
+	tracing_reset_online_cpus(&tr->max_buffer);
+	tr->allocated_snapshot = false;
+}
 
 /**
  * trace_snapshot_alloc - allocate and take a snapshot of the current buffer.
@@ -487,16 +519,9 @@ void tracing_snapshot_alloc(void)
 	struct trace_array *tr = &global_trace;
 	int ret;
 
-	if (!tr->allocated_snapshot) {
-
-		/* allocate spare buffer */
-		ret = resize_buffer_duplicate_size(&tr->max_buffer,
-				   &tr->trace_buffer, RING_BUFFER_ALL_CPUS);
-		if (WARN_ON(ret < 0))
-			return;
-
-		tr->allocated_snapshot = true;
-	}
+	ret = alloc_snapshot(tr);
+	if (WARN_ON(ret < 0))
+		return;
 
 	tracing_snapshot();
 }
@@ -3581,15 +3606,7 @@ static int tracing_set_tracer(const char *buf)
 		 * so a synchronized_sched() is sufficient.
 		 */
 		synchronize_sched();
-		/*
-		 * We don't free the ring buffer. instead, resize it because
-		 * The max_tr ring buffer has some state (e.g. ring->clock) and
-		 * we want preserve it.
-		 */
-		ring_buffer_resize(tr->max_buffer.buffer, 1, RING_BUFFER_ALL_CPUS);
-		set_buffer_entries(&tr->max_buffer, 1);
-		tracing_reset_online_cpus(&tr->max_buffer);
-		tr->allocated_snapshot = false;
+		free_snapshot(tr);
 	}
 #endif
 	destroy_trace_option_files(topts);
@@ -3598,12 +3615,9 @@ static int tracing_set_tracer(const char *buf)
 
 #ifdef CONFIG_TRACER_MAX_TRACE
 	if (t->use_max_tr && !had_max_tr) {
-		/* we need to make per cpu buffer sizes equivalent */
-		ret = resize_buffer_duplicate_size(&tr->max_buffer, &tr->trace_buffer,
-						   RING_BUFFER_ALL_CPUS);
+		ret = alloc_snapshot(tr);
 		if (ret < 0)
 			goto out;
-		tr->allocated_snapshot = true;
 	}
 #endif
 
@@ -4475,14 +4489,8 @@ tracing_snapshot_write(struct file *filp, const char __user *ubuf, size_t cnt,
 			ret = -EINVAL;
 			break;
 		}
-		if (tr->allocated_snapshot) {
-			/* free spare buffer */
-			ring_buffer_resize(tr->max_buffer.buffer, 1,
-					   RING_BUFFER_ALL_CPUS);
-			set_buffer_entries(&tr->max_buffer, 1);
-			tracing_reset_online_cpus(&tr->max_buffer);
-			tr->allocated_snapshot = false;
-		}
+		if (tr->allocated_snapshot)
+			free_snapshot(tr);
 		break;
 	case 1:
 /* Only allow per-cpu swap if the ring buffer supports it */
@@ -4493,12 +4501,9 @@ tracing_snapshot_write(struct file *filp, const char __user *ubuf, size_t cnt,
 		}
 #endif
 		if (!tr->allocated_snapshot) {
-			/* allocate spare buffer */
-			ret = resize_buffer_duplicate_size(&tr->max_buffer,
-					&tr->trace_buffer, RING_BUFFER_ALL_CPUS);
+			ret = alloc_snapshot(tr);
 			if (ret < 0)
 				break;
-			tr->allocated_snapshot = true;
 		}
 		local_irq_disable();
 		/* Now, we're going to swap */
-- 
1.7.10.4



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [for-next][PATCH 06/20] tracing: Add snapshot trigger to function probes
  2013-03-15 19:39 [for-next][PATCH 00/20] tracing: function triggers, stack tracer fixes, clocks and documenation Steven Rostedt
                   ` (4 preceding siblings ...)
  2013-03-15 19:39 ` [for-next][PATCH 05/20] tracing: Add alloc/free_snapshot() to replace duplicate code Steven Rostedt
@ 2013-03-15 19:39 ` Steven Rostedt
  2013-03-15 19:39 ` [for-next][PATCH 07/20] tracing: Fix comments for ftrace_event_file/call flags Steven Rostedt
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Steven Rostedt @ 2013-03-15 19:39 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ingo Molnar, Andrew Morton, Frederic Weisbecker

[-- Attachment #1: Type: text/plain, Size: 3997 bytes --]

From: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>

 echo 'schedule:snapshot:1' > /debug/tracing/set_ftrace_filter

This will cause the scheduler to trigger a snapshot the next time
it's called (you can use any function that's not called by NMI).

Even though it triggers only once, you still need to remove it with:

 echo '!schedule:snapshot:0' > /debug/tracing/set_ftrace_filter

The :1 can be left off for the first command:

 echo 'schedule:snapshot' > /debug/tracing/set_ftrace_filter

But this will cause all calls to schedule to trigger a snapshot.
This must be removed without the ':0'

 echo '!schedule:snapshot' > /debug/tracing/set_ftrace_filter

As adding a "count" is a different operation (internally).

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/trace/trace.c |  111 +++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 110 insertions(+), 1 deletion(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 906049c..c5b8446 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -5086,7 +5086,114 @@ static const struct file_operations tracing_dyn_info_fops = {
 	.read		= tracing_read_dyn_info,
 	.llseek		= generic_file_llseek,
 };
-#endif
+#endif /* CONFIG_DYNAMIC_FTRACE */
+
+#if defined(CONFIG_TRACER_SNAPSHOT) && defined(CONFIG_DYNAMIC_FTRACE)
+static void
+ftrace_snapshot(unsigned long ip, unsigned long parent_ip, void **data)
+{
+	tracing_snapshot();
+}
+
+static void
+ftrace_count_snapshot(unsigned long ip, unsigned long parent_ip, void **data)
+{
+	unsigned long *count = (long *)data;
+
+	if (!*count)
+		return;
+
+	if (*count != -1)
+		(*count)--;
+
+	tracing_snapshot();
+}
+
+static int
+ftrace_snapshot_print(struct seq_file *m, unsigned long ip,
+		      struct ftrace_probe_ops *ops, void *data)
+{
+	long count = (long)data;
+
+	seq_printf(m, "%ps:", (void *)ip);
+
+	seq_printf(m, "snapshot");
+
+	if (count == -1)
+		seq_printf(m, ":unlimited\n");
+	else
+		seq_printf(m, ":count=%ld\n", count);
+
+	return 0;
+}
+
+static struct ftrace_probe_ops snapshot_probe_ops = {
+	.func			= ftrace_snapshot,
+	.print			= ftrace_snapshot_print,
+};
+
+static struct ftrace_probe_ops snapshot_count_probe_ops = {
+	.func			= ftrace_count_snapshot,
+	.print			= ftrace_snapshot_print,
+};
+
+static int
+ftrace_trace_snapshot_callback(struct ftrace_hash *hash,
+			       char *glob, char *cmd, char *param, int enable)
+{
+	struct ftrace_probe_ops *ops;
+	void *count = (void *)-1;
+	char *number;
+	int ret;
+
+	/* hash funcs only work with set_ftrace_filter */
+	if (!enable)
+		return -EINVAL;
+
+	ops = param ? &snapshot_count_probe_ops :  &snapshot_probe_ops;
+
+	if (glob[0] == '!') {
+		unregister_ftrace_function_probe_func(glob+1, ops);
+		return 0;
+	}
+
+	if (!param)
+		goto out_reg;
+
+	number = strsep(&param, ":");
+
+	if (!strlen(number))
+		goto out_reg;
+
+	/*
+	 * We use the callback data field (which is a pointer)
+	 * as our counter.
+	 */
+	ret = kstrtoul(number, 0, (unsigned long *)&count);
+	if (ret)
+		return ret;
+
+ out_reg:
+	ret = register_ftrace_function_probe(glob, ops, count);
+
+	if (ret >= 0)
+		alloc_snapshot(&global_trace);
+
+	return ret < 0 ? ret : 0;
+}
+
+static struct ftrace_func_command ftrace_snapshot_cmd = {
+	.name			= "snapshot",
+	.func			= ftrace_trace_snapshot_callback,
+};
+
+static int register_snapshot_cmd(void)
+{
+	return register_ftrace_command(&ftrace_snapshot_cmd);
+}
+#else
+static inline int register_snapshot_cmd(void) { return 0; }
+#endif /* defined(CONFIG_TRACER_SNAPSHOT) && defined(CONFIG_DYNAMIC_FTRACE) */
 
 struct dentry *tracing_init_dentry_tr(struct trace_array *tr)
 {
@@ -6076,6 +6183,8 @@ __init static int tracer_alloc_buffers(void)
 		trace_set_options(&global_trace, option);
 	}
 
+	register_snapshot_cmd();
+
 	return 0;
 
 out_free_cpumask:
-- 
1.7.10.4



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [for-next][PATCH 07/20] tracing: Fix comments for ftrace_event_file/call flags
  2013-03-15 19:39 [for-next][PATCH 00/20] tracing: function triggers, stack tracer fixes, clocks and documenation Steven Rostedt
                   ` (5 preceding siblings ...)
  2013-03-15 19:39 ` [for-next][PATCH 06/20] tracing: Add snapshot trigger to function probes Steven Rostedt
@ 2013-03-15 19:39 ` Steven Rostedt
  2013-03-15 19:39 ` [for-next][PATCH 08/20] ftrace: Clean up function probe methods Steven Rostedt
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Steven Rostedt @ 2013-03-15 19:39 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ingo Molnar, Andrew Morton, Frederic Weisbecker

[-- Attachment #1: Type: text/plain, Size: 1659 bytes --]

From: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>

Most of the flags for the struct ftrace_event_file were moved over
to the flags of the struct ftrace_event_call, but the comments were
never updated.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 include/linux/ftrace_event.h |   17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index d84c4a5..4cb6cd8 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -230,6 +230,13 @@ struct ftrace_event_call {
 	struct list_head	*files;
 	void			*mod;
 	void			*data;
+	/*
+	 *   bit 0:		filter_active
+	 *   bit 1:		allow trace by non root (cap any)
+	 *   bit 2:		failed to apply filter
+	 *   bit 3:		ftrace internal event (do not enable)
+	 *   bit 4:		Event was enabled by module
+	 */
 	int			flags; /* static flags of different events */
 
 #ifdef CONFIG_PERF_EVENTS
@@ -248,7 +255,7 @@ enum {
 
 /*
  * Ftrace event file flags:
- *  ENABELD	  - The event is enabled
+ *  ENABLED	  - The event is enabled
  *  RECORDED_CMD  - The comms should be recorded at sched_switch
  */
 enum {
@@ -265,12 +272,8 @@ struct ftrace_event_file {
 
 	/*
 	 * 32 bit flags:
-	 *   bit 1:		enabled
-	 *   bit 2:		filter_active
-	 *   bit 3:		enabled cmd record
-	 *   bit 4:		allow trace by non root (cap any)
-	 *   bit 5:		failed to apply filter
-	 *   bit 6:		ftrace internal event (do not enable)
+	 *   bit 0:		enabled
+	 *   bit 1:		enabled cmd record
 	 *
 	 * Changes to flags must hold the event_mutex.
 	 *
-- 
1.7.10.4



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [for-next][PATCH 08/20] ftrace: Clean up function probe methods
  2013-03-15 19:39 [for-next][PATCH 00/20] tracing: function triggers, stack tracer fixes, clocks and documenation Steven Rostedt
                   ` (6 preceding siblings ...)
  2013-03-15 19:39 ` [for-next][PATCH 07/20] tracing: Fix comments for ftrace_event_file/call flags Steven Rostedt
@ 2013-03-15 19:39 ` Steven Rostedt
  2013-03-15 19:39 ` [for-next][PATCH 09/20] ftrace: Use manual free after synchronize_sched() not call_rcu_sched() Steven Rostedt
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Steven Rostedt @ 2013-03-15 19:39 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ingo Molnar, Andrew Morton, Frederic Weisbecker

[-- Attachment #1: Type: text/plain, Size: 2412 bytes --]

From: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>

When a function probe is created, each function that the probe is
attached to, a "callback" method is called. On release of the probe,
each function entry calls the "free" method.

First, "callback" is a confusing name and does not really match what
it does. Callback sounds like it will be called when the probe
triggers. But that's not the case. This is really an "init" function,
so lets rename it as such.

Secondly, both "init" and "free" do not pass enough information back
to the handlers. Pass back the ops, ip and data for each time the
method is called. We have the information, might as well use it.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 include/linux/ftrace.h |    6 ++++--
 kernel/trace/ftrace.c  |    6 +++---
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index e5ca8ef..832422d 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -259,8 +259,10 @@ struct ftrace_probe_ops {
 	void			(*func)(unsigned long ip,
 					unsigned long parent_ip,
 					void **data);
-	int			(*callback)(unsigned long ip, void **data);
-	void			(*free)(void **data);
+	int			(*init)(struct ftrace_probe_ops *ops,
+					unsigned long ip, void **data);
+	void			(*free)(struct ftrace_probe_ops *ops,
+					unsigned long ip, void **data);
 	int			(*print)(struct seq_file *m,
 					 unsigned long ip,
 					 struct ftrace_probe_ops *ops,
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index dab031f..ff0ef41 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -2984,7 +2984,7 @@ static void ftrace_free_entry_rcu(struct rcu_head *rhp)
 		container_of(rhp, struct ftrace_func_probe, rcu);
 
 	if (entry->ops->free)
-		entry->ops->free(&entry->data);
+		entry->ops->free(entry->ops, entry->ip, &entry->data);
 	kfree(entry);
 }
 
@@ -3045,8 +3045,8 @@ register_ftrace_function_probe(char *glob, struct ftrace_probe_ops *ops,
 		 * for each function we find. We call the callback
 		 * to give the caller an opportunity to do so.
 		 */
-		if (ops->callback) {
-			if (ops->callback(rec->ip, &entry->data) < 0) {
+		if (ops->init) {
+			if (ops->init(ops, rec->ip, &entry->data) < 0) {
 				/* caller does not like this func */
 				kfree(entry);
 				continue;
-- 
1.7.10.4



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [for-next][PATCH 09/20] ftrace: Use manual free after synchronize_sched() not call_rcu_sched()
  2013-03-15 19:39 [for-next][PATCH 00/20] tracing: function triggers, stack tracer fixes, clocks and documenation Steven Rostedt
                   ` (7 preceding siblings ...)
  2013-03-15 19:39 ` [for-next][PATCH 08/20] ftrace: Clean up function probe methods Steven Rostedt
@ 2013-03-15 19:39 ` Steven Rostedt
  2013-03-15 19:39 ` [for-next][PATCH 10/20] tracing: Add a way to soft disable trace events Steven Rostedt
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Steven Rostedt @ 2013-03-15 19:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Frederic Weisbecker, Paul McKenney

[-- Attachment #1: Type: text/plain, Size: 3099 bytes --]

From: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>

The entries to the probe hash must be freed after a synchronize_sched()
after the entry has been removed from the hash.

As the entries are registered with ops that may have their own callbacks,
and these callbacks may sleep, we can not use call_rcu_sched() because
the rcu callbacks registered with that are called from a softirq context.

Instead of using call_rcu_sched(), manually save the entries on a free_list
and at the end of the loop that removes the entries, do a synchronize_sched()
and then go through the free_list, freeing the entries.

Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/trace/ftrace.c |   19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index ff0ef41..2577082 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -1068,7 +1068,7 @@ struct ftrace_func_probe {
 	unsigned long		flags;
 	unsigned long		ip;
 	void			*data;
-	struct rcu_head		rcu;
+	struct list_head	free_list;
 };
 
 struct ftrace_func_entry {
@@ -2978,11 +2978,8 @@ static void __disable_ftrace_function_probe(void)
 }
 
 
-static void ftrace_free_entry_rcu(struct rcu_head *rhp)
+static void ftrace_free_entry(struct ftrace_func_probe *entry)
 {
-	struct ftrace_func_probe *entry =
-		container_of(rhp, struct ftrace_func_probe, rcu);
-
 	if (entry->ops->free)
 		entry->ops->free(entry->ops, entry->ip, &entry->data);
 	kfree(entry);
@@ -3092,7 +3089,9 @@ __unregister_ftrace_function_probe(char *glob, struct ftrace_probe_ops *ops,
 {
 	struct ftrace_func_entry *rec_entry;
 	struct ftrace_func_probe *entry;
+	struct ftrace_func_probe *p;
 	struct ftrace_hash **orig_hash = &trace_probe_ops.filter_hash;
+	struct list_head free_list;
 	struct ftrace_hash *hash;
 	struct hlist_node *n, *tmp;
 	char str[KSYM_SYMBOL_LEN];
@@ -3120,6 +3119,8 @@ __unregister_ftrace_function_probe(char *glob, struct ftrace_probe_ops *ops,
 		/* Hmm, should report this somehow */
 		goto out_unlock;
 
+	INIT_LIST_HEAD(&free_list);
+
 	for (i = 0; i < FTRACE_FUNC_HASHSIZE; i++) {
 		struct hlist_head *hhd = &ftrace_func_hash[i];
 
@@ -3146,7 +3147,7 @@ __unregister_ftrace_function_probe(char *glob, struct ftrace_probe_ops *ops,
 				free_hash_entry(hash, rec_entry);
 
 			hlist_del_rcu(&entry->node);
-			call_rcu_sched(&entry->rcu, ftrace_free_entry_rcu);
+			list_add(&entry->free_list, &free_list);
 		}
 	}
 	__disable_ftrace_function_probe();
@@ -3155,6 +3156,12 @@ __unregister_ftrace_function_probe(char *glob, struct ftrace_probe_ops *ops,
 	 * probe is removed, a null hash means *all enabled*.
 	 */
 	ftrace_hash_move(&trace_probe_ops, 1, orig_hash, hash);
+	synchronize_sched();
+	list_for_each_entry_safe(entry, p, &free_list, free_list) {
+		list_del(&entry->free_list);
+		ftrace_free_entry(entry);
+	}
+		
  out_unlock:
 	mutex_unlock(&ftrace_lock);
 	free_ftrace_hash(hash);
-- 
1.7.10.4



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [for-next][PATCH 10/20] tracing: Add a way to soft disable trace events
  2013-03-15 19:39 [for-next][PATCH 00/20] tracing: function triggers, stack tracer fixes, clocks and documenation Steven Rostedt
                   ` (8 preceding siblings ...)
  2013-03-15 19:39 ` [for-next][PATCH 09/20] ftrace: Use manual free after synchronize_sched() not call_rcu_sched() Steven Rostedt
@ 2013-03-15 19:39 ` Steven Rostedt
  2013-03-15 19:39 ` [for-next][PATCH 11/20] tracing: Add function probe triggers to enable/disable events Steven Rostedt
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Steven Rostedt @ 2013-03-15 19:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Frederic Weisbecker, Dan Carpenter,
	Tom Zanussi

[-- Attachment #1: Type: text/plain, Size: 9205 bytes --]

From: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>

In order to let triggers enable or disable events, we need a 'soft'
method for doing so. For example, if a function probe is added that
lets a user enable or disable events when a function is called, that
change must be done without taking locks or a mutex, and definitely
it can't sleep. But the full enabling of a tracepoint is expensive.

By adding a 'SOFT_DISABLE' flag, and converting the flags to be updated
without the protection of a mutex (using set/clear_bit()), this soft
disable flag can be used to allow critical sections to enable or disable
events from being traced (after the event has been placed into "SOFT_MODE").

Some caveats though: The comm recorder (to map pids with a comm) can not
be soft disabled (yet). If you disable an event with with a "soft"
disable and wait a while before reading the trace, the comm cache may be
replaced and you'll get a bunch of <...> for comms in the trace.

Reading the "enable" file for an event that is disabled will now give
you "0*" where the '*' denotes that the tracepoint is still active but
the event itself is "disabled".

[ fixed _BIT used in & operation : thanks to Dan Carpenter and smatch ]

Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 include/linux/ftrace_event.h |   20 +++++++----
 include/trace/ftrace.h       |    8 +++++
 kernel/trace/trace_events.c  |   75 ++++++++++++++++++++++++++++++++++--------
 3 files changed, 84 insertions(+), 19 deletions(-)

diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index 4cb6cd8..4e28b01 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -251,16 +251,23 @@ struct ftrace_subsystem_dir;
 enum {
 	FTRACE_EVENT_FL_ENABLED_BIT,
 	FTRACE_EVENT_FL_RECORDED_CMD_BIT,
+	FTRACE_EVENT_FL_SOFT_MODE_BIT,
+	FTRACE_EVENT_FL_SOFT_DISABLED_BIT,
 };
 
 /*
  * Ftrace event file flags:
  *  ENABLED	  - The event is enabled
  *  RECORDED_CMD  - The comms should be recorded at sched_switch
+ *  SOFT_MODE     - The event is enabled/disabled by SOFT_DISABLED
+ *  SOFT_DISABLED - When set, do not trace the event (even though its
+ *                   tracepoint may be enabled)
  */
 enum {
 	FTRACE_EVENT_FL_ENABLED		= (1 << FTRACE_EVENT_FL_ENABLED_BIT),
 	FTRACE_EVENT_FL_RECORDED_CMD	= (1 << FTRACE_EVENT_FL_RECORDED_CMD_BIT),
+	FTRACE_EVENT_FL_SOFT_MODE	= (1 << FTRACE_EVENT_FL_SOFT_MODE_BIT),
+	FTRACE_EVENT_FL_SOFT_DISABLED	= (1 << FTRACE_EVENT_FL_SOFT_DISABLED_BIT),
 };
 
 struct ftrace_event_file {
@@ -274,17 +281,18 @@ struct ftrace_event_file {
 	 * 32 bit flags:
 	 *   bit 0:		enabled
 	 *   bit 1:		enabled cmd record
+	 *   bit 2:		enable/disable with the soft disable bit
+	 *   bit 3:		soft disabled
 	 *
-	 * Changes to flags must hold the event_mutex.
-	 *
-	 * Note: Reads of flags do not hold the event_mutex since
-	 * they occur in critical sections. But the way flags
+	 * Note: The bits must be set atomically to prevent races
+	 * from other writers. Reads of flags do not need to be in
+	 * sync as they occur in critical sections. But the way flags
 	 * is currently used, these changes do not affect the code
 	 * except that when a change is made, it may have a slight
 	 * delay in propagating the changes to other CPUs due to
-	 * caching and such.
+	 * caching and such. Which is mostly OK ;-)
 	 */
-	unsigned int		flags;
+	unsigned long		flags;
 };
 
 #define __TRACE_EVENT_FLAGS(name, value)				\
diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
index bbf09c2..4bda044 100644
--- a/include/trace/ftrace.h
+++ b/include/trace/ftrace.h
@@ -413,6 +413,10 @@ static inline notrace int ftrace_get_offsets_##call(			\
  *	int __data_size;
  *	int pc;
  *
+ *	if (test_bit(FTRACE_EVENT_FL_SOFT_DISABLED_BIT,
+ *		     &ftrace_file->flags))
+ *		return;
+ *
  *	local_save_flags(irq_flags);
  *	pc = preempt_count();
  *
@@ -518,6 +522,10 @@ ftrace_raw_event_##call(void *__data, proto)				\
 	int __data_size;						\
 	int pc;								\
 									\
+	if (test_bit(FTRACE_EVENT_FL_SOFT_DISABLED_BIT,			\
+		     &ftrace_file->flags))				\
+		return;							\
+									\
 	local_save_flags(irq_flags);					\
 	pc = preempt_count();						\
 									\
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index 38b54c5..106640b 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -205,37 +205,77 @@ void trace_event_enable_cmd_record(bool enable)
 
 		if (enable) {
 			tracing_start_cmdline_record();
-			file->flags |= FTRACE_EVENT_FL_RECORDED_CMD;
+			set_bit(FTRACE_EVENT_FL_RECORDED_CMD_BIT, &file->flags);
 		} else {
 			tracing_stop_cmdline_record();
-			file->flags &= ~FTRACE_EVENT_FL_RECORDED_CMD;
+			clear_bit(FTRACE_EVENT_FL_RECORDED_CMD_BIT, &file->flags);
 		}
 	} while_for_each_event_file();
 	mutex_unlock(&event_mutex);
 }
 
-static int ftrace_event_enable_disable(struct ftrace_event_file *file,
-				       int enable)
+static int __ftrace_event_enable_disable(struct ftrace_event_file *file,
+					 int enable, int soft_disable)
 {
 	struct ftrace_event_call *call = file->event_call;
 	int ret = 0;
+	int disable;
 
 	switch (enable) {
 	case 0:
-		if (file->flags & FTRACE_EVENT_FL_ENABLED) {
-			file->flags &= ~FTRACE_EVENT_FL_ENABLED;
+		/*
+		 * When soft_disable is set and enable is cleared, we want
+		 * to clear the SOFT_DISABLED flag but leave the event in the
+		 * state that it was. That is, if the event was enabled and
+		 * SOFT_DISABLED isn't set, then do nothing. But if SOFT_DISABLED
+		 * is set we do not want the event to be enabled before we
+		 * clear the bit.
+		 *
+		 * When soft_disable is not set but the SOFT_MODE flag is,
+		 * we do nothing. Do not disable the tracepoint, otherwise
+		 * "soft enable"s (clearing the SOFT_DISABLED bit) wont work.
+		 */
+		if (soft_disable) {
+			disable = file->flags & FTRACE_EVENT_FL_SOFT_DISABLED;
+			clear_bit(FTRACE_EVENT_FL_SOFT_MODE_BIT, &file->flags);
+		} else
+			disable = !(file->flags & FTRACE_EVENT_FL_SOFT_MODE);
+
+		if (disable && (file->flags & FTRACE_EVENT_FL_ENABLED)) {
+			clear_bit(FTRACE_EVENT_FL_ENABLED_BIT, &file->flags);
 			if (file->flags & FTRACE_EVENT_FL_RECORDED_CMD) {
 				tracing_stop_cmdline_record();
-				file->flags &= ~FTRACE_EVENT_FL_RECORDED_CMD;
+				clear_bit(FTRACE_EVENT_FL_RECORDED_CMD_BIT, &file->flags);
 			}
 			call->class->reg(call, TRACE_REG_UNREGISTER, file);
 		}
+		/* If in SOFT_MODE, just set the SOFT_DISABLE_BIT */
+		if (file->flags & FTRACE_EVENT_FL_SOFT_MODE)
+			set_bit(FTRACE_EVENT_FL_SOFT_DISABLED_BIT, &file->flags);
 		break;
 	case 1:
+		/*
+		 * When soft_disable is set and enable is set, we want to
+		 * register the tracepoint for the event, but leave the event
+		 * as is. That means, if the event was already enabled, we do
+		 * nothing (but set SOFT_MODE). If the event is disabled, we
+		 * set SOFT_DISABLED before enabling the event tracepoint, so
+		 * it still seems to be disabled.
+		 */
+		if (!soft_disable)
+			clear_bit(FTRACE_EVENT_FL_SOFT_DISABLED_BIT, &file->flags);
+		else
+			set_bit(FTRACE_EVENT_FL_SOFT_MODE_BIT, &file->flags);
+
 		if (!(file->flags & FTRACE_EVENT_FL_ENABLED)) {
+
+			/* Keep the event disabled, when going to SOFT_MODE. */
+			if (soft_disable)
+				set_bit(FTRACE_EVENT_FL_SOFT_DISABLED_BIT, &file->flags);
+
 			if (trace_flags & TRACE_ITER_RECORD_CMD) {
 				tracing_start_cmdline_record();
-				file->flags |= FTRACE_EVENT_FL_RECORDED_CMD;
+				set_bit(FTRACE_EVENT_FL_RECORDED_CMD_BIT, &file->flags);
 			}
 			ret = call->class->reg(call, TRACE_REG_REGISTER, file);
 			if (ret) {
@@ -244,7 +284,7 @@ static int ftrace_event_enable_disable(struct ftrace_event_file *file,
 					"%s\n", call->name);
 				break;
 			}
-			file->flags |= FTRACE_EVENT_FL_ENABLED;
+			set_bit(FTRACE_EVENT_FL_ENABLED_BIT, &file->flags);
 
 			/* WAS_ENABLED gets set but never cleared. */
 			call->flags |= TRACE_EVENT_FL_WAS_ENABLED;
@@ -255,6 +295,12 @@ static int ftrace_event_enable_disable(struct ftrace_event_file *file,
 	return ret;
 }
 
+static int ftrace_event_enable_disable(struct ftrace_event_file *file,
+				       int enable)
+{
+	return __ftrace_event_enable_disable(file, enable, 0);
+}
+
 static void ftrace_clear_events(struct trace_array *tr)
 {
 	struct ftrace_event_file *file;
@@ -547,12 +593,15 @@ event_enable_read(struct file *filp, char __user *ubuf, size_t cnt,
 	struct ftrace_event_file *file = filp->private_data;
 	char *buf;
 
-	if (file->flags & FTRACE_EVENT_FL_ENABLED)
-		buf = "1\n";
-	else
+	if (file->flags & FTRACE_EVENT_FL_ENABLED) {
+		if (file->flags & FTRACE_EVENT_FL_SOFT_DISABLED)
+			buf = "0*\n";
+		else
+			buf = "1\n";
+	} else
 		buf = "0\n";
 
-	return simple_read_from_buffer(ubuf, cnt, ppos, buf, 2);
+	return simple_read_from_buffer(ubuf, cnt, ppos, buf, strlen(buf));
 }
 
 static ssize_t
-- 
1.7.10.4



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [for-next][PATCH 11/20] tracing: Add function probe triggers to enable/disable events
  2013-03-15 19:39 [for-next][PATCH 00/20] tracing: function triggers, stack tracer fixes, clocks and documenation Steven Rostedt
                   ` (9 preceding siblings ...)
  2013-03-15 19:39 ` [for-next][PATCH 10/20] tracing: Add a way to soft disable trace events Steven Rostedt
@ 2013-03-15 19:39 ` Steven Rostedt
  2013-03-15 19:39 ` [for-next][PATCH 12/20] tracing: Add skip argument to trace_dump_stack() Steven Rostedt
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Steven Rostedt @ 2013-03-15 19:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Frederic Weisbecker, Dan Carpenter,
	Tom Zanussi

[-- Attachment #1: Type: text/plain, Size: 8318 bytes --]

From: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>

Add triggers to function tracer that lets an event get enabled or
disabled when a function is called:

format is:

 <function>:enable_event:<system>:<event>[:<count>]
 <function>:disable_event:<system>:<event>[:<count>]

 echo 'schedule:enable_event:sched:sched_switch' > /debug/tracing/set_ftrace_filter

Every time schedule is called, it will enable the sched_switch event.

 echo 'schedule:disable_event:sched:sched_switch:2' > /debug/tracing/set_ftrace_filter

The first two times schedule is called while the sched_switch
event is enabled, it will disable it. It will not count for a time
that the event is already disabled (or enabled for enable_event).

[ fixed return without mutex_unlock() - thanks to Dan Carpenter and smatch ]

Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/trace/trace_events.c |  279 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 279 insertions(+)

diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index 106640b..c636523 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -1798,6 +1798,283 @@ __trace_add_event_dirs(struct trace_array *tr)
 	}
 }
 
+#ifdef CONFIG_DYNAMIC_FTRACE
+
+/* Avoid typos */
+#define ENABLE_EVENT_STR	"enable_event"
+#define DISABLE_EVENT_STR	"disable_event"
+
+struct event_probe_data {
+	struct ftrace_event_file	*file;
+	unsigned long			count;
+	int				ref;
+	bool				enable;
+};
+
+static struct ftrace_event_file *
+find_event_file(struct trace_array *tr, const char *system,  const char *event)
+{
+	struct ftrace_event_file *file;
+	struct ftrace_event_call *call;
+
+	list_for_each_entry(file, &tr->events, list) {
+
+		call = file->event_call;
+
+		if (!call->name || !call->class || !call->class->reg)
+			continue;
+
+		if (call->flags & TRACE_EVENT_FL_IGNORE_ENABLE)
+			continue;
+
+		if (strcmp(event, call->name) == 0 &&
+		    strcmp(system, call->class->system) == 0)
+			return file;
+	}
+	return NULL;
+}
+
+static void
+event_enable_probe(unsigned long ip, unsigned long parent_ip, void **_data)
+{
+	struct event_probe_data **pdata = (struct event_probe_data **)_data;
+	struct event_probe_data *data = *pdata;
+
+	if (!data)
+		return;
+
+	if (data->enable)
+		clear_bit(FTRACE_EVENT_FL_SOFT_DISABLED_BIT, &data->file->flags);
+	else
+		set_bit(FTRACE_EVENT_FL_SOFT_DISABLED_BIT, &data->file->flags);
+}
+
+static void
+event_enable_count_probe(unsigned long ip, unsigned long parent_ip, void **_data)
+{
+	struct event_probe_data **pdata = (struct event_probe_data **)_data;
+	struct event_probe_data *data = *pdata;
+
+	if (!data)
+		return;
+
+	if (!data->count)
+		return;
+
+	/* Skip if the event is in a state we want to switch to */
+	if (data->enable == !(data->file->flags & FTRACE_EVENT_FL_SOFT_DISABLED))
+		return;
+
+	if (data->count != -1)
+		(data->count)--;
+
+	event_enable_probe(ip, parent_ip, _data);
+}
+
+static int
+event_enable_print(struct seq_file *m, unsigned long ip,
+		      struct ftrace_probe_ops *ops, void *_data)
+{
+	struct event_probe_data *data = _data;
+
+	seq_printf(m, "%ps:", (void *)ip);
+
+	seq_printf(m, "%s:%s:%s",
+		   data->enable ? ENABLE_EVENT_STR : DISABLE_EVENT_STR,
+		   data->file->event_call->class->system,
+		   data->file->event_call->name);
+
+	if (data->count == -1)
+		seq_printf(m, ":unlimited\n");
+	else
+		seq_printf(m, ":count=%ld\n", data->count);
+
+	return 0;
+}
+
+static int
+event_enable_init(struct ftrace_probe_ops *ops, unsigned long ip,
+		  void **_data)
+{
+	struct event_probe_data **pdata = (struct event_probe_data **)_data;
+	struct event_probe_data *data = *pdata;
+
+	data->ref++;
+	return 0;
+}
+
+static void
+event_enable_free(struct ftrace_probe_ops *ops, unsigned long ip,
+		  void **_data)
+{
+	struct event_probe_data **pdata = (struct event_probe_data **)_data;
+	struct event_probe_data *data = *pdata;
+
+	if (WARN_ON_ONCE(data->ref <= 0))
+		return;
+
+	data->ref--;
+	if (!data->ref) {
+		/* Remove the SOFT_MODE flag */
+		__ftrace_event_enable_disable(data->file, 0, 1);
+		module_put(data->file->event_call->mod);
+		kfree(data);
+	}
+	*pdata = NULL;
+}
+
+static struct ftrace_probe_ops event_enable_probe_ops = {
+	.func			= event_enable_probe,
+	.print			= event_enable_print,
+	.init			= event_enable_init,
+	.free			= event_enable_free,
+};
+
+static struct ftrace_probe_ops event_enable_count_probe_ops = {
+	.func			= event_enable_count_probe,
+	.print			= event_enable_print,
+	.init			= event_enable_init,
+	.free			= event_enable_free,
+};
+
+static struct ftrace_probe_ops event_disable_probe_ops = {
+	.func			= event_enable_probe,
+	.print			= event_enable_print,
+	.init			= event_enable_init,
+	.free			= event_enable_free,
+};
+
+static struct ftrace_probe_ops event_disable_count_probe_ops = {
+	.func			= event_enable_count_probe,
+	.print			= event_enable_print,
+	.init			= event_enable_init,
+	.free			= event_enable_free,
+};
+
+static int
+event_enable_func(struct ftrace_hash *hash,
+		  char *glob, char *cmd, char *param, int enabled)
+{
+	struct trace_array *tr = top_trace_array();
+	struct ftrace_event_file *file;
+	struct ftrace_probe_ops *ops;
+	struct event_probe_data *data;
+	const char *system;
+	const char *event;
+	char *number;
+	bool enable;
+	int ret;
+
+	/* hash funcs only work with set_ftrace_filter */
+	if (!enabled)
+		return -EINVAL;
+
+	if (!param)
+		return -EINVAL;
+
+	system = strsep(&param, ":");
+	if (!param)
+		return -EINVAL;
+
+	event = strsep(&param, ":");
+
+	mutex_lock(&event_mutex);
+
+	ret = -EINVAL;
+	file = find_event_file(tr, system, event);
+	if (!file)
+		goto out;
+
+	enable = strcmp(cmd, ENABLE_EVENT_STR) == 0;
+
+	if (enable)
+		ops = param ? &event_enable_count_probe_ops : &event_enable_probe_ops;
+	else
+		ops = param ? &event_disable_count_probe_ops : &event_disable_probe_ops;
+
+	if (glob[0] == '!') {
+		unregister_ftrace_function_probe_func(glob+1, ops);
+		ret = 0;
+		goto out;
+	}
+
+	ret = -ENOMEM;
+	data = kzalloc(sizeof(*data), GFP_KERNEL);
+	if (!data)
+		goto out;
+
+	data->enable = enable;
+	data->count = -1;
+	data->file = file;
+
+	if (!param)
+		goto out_reg;
+
+	number = strsep(&param, ":");
+
+	ret = -EINVAL;
+	if (!strlen(number))
+		goto out_free;
+
+	/*
+	 * We use the callback data field (which is a pointer)
+	 * as our counter.
+	 */
+	ret = kstrtoul(number, 0, &data->count);
+	if (ret)
+		goto out_free;
+
+ out_reg:
+	/* Don't let event modules unload while probe registered */
+	ret = try_module_get(file->event_call->mod);
+	if (!ret)
+		goto out_free;
+
+	ret = __ftrace_event_enable_disable(file, 1, 1);
+	if (ret < 0)
+		goto out_put;
+	ret = register_ftrace_function_probe(glob, ops, data);
+	if (!ret)
+		goto out_disable;
+ out:
+	mutex_unlock(&event_mutex);
+	return ret;
+
+ out_disable:
+	__ftrace_event_enable_disable(file, 0, 1);
+ out_put:
+	module_put(file->event_call->mod);
+ out_free:
+	kfree(data);
+	goto out;
+}
+
+static struct ftrace_func_command event_enable_cmd = {
+	.name			= ENABLE_EVENT_STR,
+	.func			= event_enable_func,
+};
+
+static struct ftrace_func_command event_disable_cmd = {
+	.name			= DISABLE_EVENT_STR,
+	.func			= event_enable_func,
+};
+
+static __init int register_event_cmds(void)
+{
+	int ret;
+
+	ret = register_ftrace_command(&event_enable_cmd);
+	if (WARN_ON(ret < 0))
+		return ret;
+	ret = register_ftrace_command(&event_disable_cmd);
+	if (WARN_ON(ret < 0))
+		unregister_ftrace_command(&event_enable_cmd);
+	return ret;
+}
+#else
+static inline int register_event_cmds(void) { return 0; }
+#endif /* CONFIG_DYNAMIC_FTRACE */
+
 /*
  * The top level array has already had its ftrace_event_file
  * descriptors created in order to allow for early events to
@@ -2058,6 +2335,8 @@ static __init int event_trace_enable(void)
 
 	trace_printk_start_comm();
 
+	register_event_cmds();
+
 	return 0;
 }
 
-- 
1.7.10.4



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [for-next][PATCH 12/20] tracing: Add skip argument to trace_dump_stack()
  2013-03-15 19:39 [for-next][PATCH 00/20] tracing: function triggers, stack tracer fixes, clocks and documenation Steven Rostedt
                   ` (10 preceding siblings ...)
  2013-03-15 19:39 ` [for-next][PATCH 11/20] tracing: Add function probe triggers to enable/disable events Steven Rostedt
@ 2013-03-15 19:39 ` Steven Rostedt
  2013-03-15 19:39 ` [for-next][PATCH 13/20] tracing: Add function probe to trigger stack traces Steven Rostedt
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Steven Rostedt @ 2013-03-15 19:39 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ingo Molnar, Andrew Morton, Frederic Weisbecker

[-- Attachment #1: Type: text/plain, Size: 2103 bytes --]

From: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>

Altough the trace_dump_stack() already skips three functions in
the call to stack trace, which gets the stack trace to start
at the caller of the function, the caller may want to skip some
more too (as it may have helper functions).

Add a skip argument to the trace_dump_stack() that lets the caller
skip back tracing functions that it doesn't care about.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 include/linux/kernel.h |    2 +-
 kernel/trace/trace.c   |   13 +++++++++----
 2 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index d0a16fe..239dbb9 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -597,7 +597,7 @@ extern int __trace_puts(unsigned long ip, const char *str, int size);
 		__trace_puts(_THIS_IP_, str, strlen(str));		\
 })
 
-extern void trace_dump_stack(void);
+extern void trace_dump_stack(int skip);
 
 /*
  * The double __builtin_constant_p is because gcc will give us an error
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index c5b8446..8aa5321 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1657,8 +1657,9 @@ void __trace_stack(struct trace_array *tr, unsigned long flags, int skip,
 
 /**
  * trace_dump_stack - record a stack back trace in the trace buffer
+ * @skip: Number of functions to skip (helper handlers)
  */
-void trace_dump_stack(void)
+void trace_dump_stack(int skip)
 {
 	unsigned long flags;
 
@@ -1667,9 +1668,13 @@ void trace_dump_stack(void)
 
 	local_save_flags(flags);
 
-	/* skipping 3 traces, seems to get us at the caller of this function */
-	__ftrace_trace_stack(global_trace.trace_buffer.buffer, flags, 3,
-			     preempt_count(), NULL);
+	/*
+	 * Skip 3 more, seems to get us at the caller of
+	 * this function.
+	 */
+	skip += 3;
+	__ftrace_trace_stack(global_trace.trace_buffer.buffer,
+			     flags, skip, preempt_count(), NULL);
 }
 
 static DEFINE_PER_CPU(int, user_stack_count);
-- 
1.7.10.4



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [for-next][PATCH 13/20] tracing: Add function probe to trigger stack traces
  2013-03-15 19:39 [for-next][PATCH 00/20] tracing: function triggers, stack tracer fixes, clocks and documenation Steven Rostedt
                   ` (11 preceding siblings ...)
  2013-03-15 19:39 ` [for-next][PATCH 12/20] tracing: Add skip argument to trace_dump_stack() Steven Rostedt
@ 2013-03-15 19:39 ` Steven Rostedt
  2013-03-15 19:39 ` [for-next][PATCH 14/20] tracing: Use stack of calling function for stack tracer Steven Rostedt
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Steven Rostedt @ 2013-03-15 19:39 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ingo Molnar, Andrew Morton, Frederic Weisbecker

[-- Attachment #1: Type: text/plain, Size: 7565 bytes --]

From: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>

Add a function probe that will cause a stack trace to be traced in
the ring buffer when the given function(s) are called.

format is:

 <function>:stacktrace[:<count>]

 echo 'schedule:stacktrace' > /debug/tracing/set_ftrace_filter
 cat /debug/tracing/trace_pipe
     kworker/2:0-4329  [002] ...2  2933.558007: <stack trace>
 => kthread
 => ret_from_fork
          <idle>-0     [000] .N.2  2933.558019: <stack trace>
 => rest_init
 => start_kernel
 => x86_64_start_reservations
 => x86_64_start_kernel
     kworker/2:0-4329  [002] ...2  2933.558109: <stack trace>
 => kthread
 => ret_from_fork
[...]

This can be set to only trace a specific amount of times:

 echo 'schedule:stacktrace:3' > /debug/tracing/set_ftrace_filter
 cat /debug/tracing/trace_pipe
           <...>-58    [003] ...2   841.801694: <stack trace>
 => kthread
 => ret_from_fork
          <idle>-0     [001] .N.2   841.801697: <stack trace>
 => start_secondary
           <...>-2059  [001] ...2   841.801736: <stack trace>
 => wait_for_common
 => wait_for_completion
 => flush_work
 => tty_flush_to_ldisc
 => input_available_p
 => n_tty_poll
 => tty_poll
 => do_select
 => core_sys_select
 => sys_select
 => system_call_fastpath

To remove these:

 echo '!schedule:stacktrace' > /debug/tracing/set_ftrace_filter
 echo '!schedule:stacktrace:0' > /debug/tracing/set_ftrace_filter

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/trace/trace_functions.c |  150 ++++++++++++++++++++++++++++++----------
 1 file changed, 115 insertions(+), 35 deletions(-)

diff --git a/kernel/trace/trace_functions.c b/kernel/trace/trace_functions.c
index 043b242..c4d6d71 100644
--- a/kernel/trace/trace_functions.c
+++ b/kernel/trace/trace_functions.c
@@ -265,56 +265,103 @@ ftrace_traceoff(unsigned long ip, unsigned long parent_ip, void **data)
 	tracing_off();
 }
 
+/*
+ * Skip 4:
+ *   ftrace_stacktrace()
+ *   function_trace_probe_call()
+ *   ftrace_ops_list_func()
+ *   ftrace_call()
+ */
+#define STACK_SKIP 4
+
+static void
+ftrace_stacktrace(unsigned long ip, unsigned long parent_ip, void **data)
+{
+	trace_dump_stack(STACK_SKIP);
+}
+
+static void
+ftrace_stacktrace_count(unsigned long ip, unsigned long parent_ip, void **data)
+{
+	if (!tracing_is_on())
+		return;
+
+	if (update_count(data))
+		trace_dump_stack(STACK_SKIP);
+}
+
 static int
-ftrace_trace_onoff_print(struct seq_file *m, unsigned long ip,
-			 struct ftrace_probe_ops *ops, void *data);
+ftrace_probe_print(const char *name, struct seq_file *m,
+		   unsigned long ip, void *data)
+{
+	long count = (long)data;
+
+	seq_printf(m, "%ps:%s", (void *)ip, name);
+
+	if (count == -1)
+		seq_printf(m, ":unlimited\n");
+	else
+		seq_printf(m, ":count=%ld\n", count);
+
+	return 0;
+}
+
+static int
+ftrace_traceon_print(struct seq_file *m, unsigned long ip,
+			 struct ftrace_probe_ops *ops, void *data)
+{
+	return ftrace_probe_print("traceon", m, ip, data);
+}
+
+static int
+ftrace_traceoff_print(struct seq_file *m, unsigned long ip,
+			 struct ftrace_probe_ops *ops, void *data)
+{
+	return ftrace_probe_print("traceoff", m, ip, data);
+}
+
+static int
+ftrace_stacktrace_print(struct seq_file *m, unsigned long ip,
+			struct ftrace_probe_ops *ops, void *data)
+{
+	return ftrace_probe_print("stacktrace", m, ip, data);
+}
 
 static struct ftrace_probe_ops traceon_count_probe_ops = {
 	.func			= ftrace_traceon_count,
-	.print			= ftrace_trace_onoff_print,
+	.print			= ftrace_traceon_print,
 };
 
 static struct ftrace_probe_ops traceoff_count_probe_ops = {
 	.func			= ftrace_traceoff_count,
-	.print			= ftrace_trace_onoff_print,
+	.print			= ftrace_traceoff_print,
+};
+
+static struct ftrace_probe_ops stacktrace_count_probe_ops = {
+	.func			= ftrace_stacktrace_count,
+	.print			= ftrace_stacktrace_print,
 };
 
 static struct ftrace_probe_ops traceon_probe_ops = {
 	.func			= ftrace_traceon,
-	.print			= ftrace_trace_onoff_print,
+	.print			= ftrace_traceon_print,
 };
 
 static struct ftrace_probe_ops traceoff_probe_ops = {
 	.func			= ftrace_traceoff,
-	.print			= ftrace_trace_onoff_print,
+	.print			= ftrace_traceoff_print,
 };
 
-static int
-ftrace_trace_onoff_print(struct seq_file *m, unsigned long ip,
-			 struct ftrace_probe_ops *ops, void *data)
-{
-	long count = (long)data;
-
-	seq_printf(m, "%ps:", (void *)ip);
-
-	if (ops == &traceon_probe_ops || ops == &traceon_count_probe_ops)
-		seq_printf(m, "traceon");
-	else
-		seq_printf(m, "traceoff");
-
-	if (count == -1)
-		seq_printf(m, ":unlimited\n");
-	else
-		seq_printf(m, ":count=%ld\n", count);
-
-	return 0;
-}
+static struct ftrace_probe_ops stacktrace_probe_ops = {
+	.func			= ftrace_stacktrace,
+	.print			= ftrace_stacktrace_print,
+};
 
 static int
-ftrace_trace_onoff_callback(struct ftrace_hash *hash,
-			    char *glob, char *cmd, char *param, int enable)
+ftrace_trace_probe_callback(struct ftrace_probe_ops *ops,
+			    struct ftrace_hash *hash, char *glob,
+			    char *cmd, char *param, int enable)
 {
-	struct ftrace_probe_ops *ops;
 	void *count = (void *)-1;
 	char *number;
 	int ret;
@@ -323,12 +370,6 @@ ftrace_trace_onoff_callback(struct ftrace_hash *hash,
 	if (!enable)
 		return -EINVAL;
 
-	/* we register both traceon and traceoff to this callback */
-	if (strcmp(cmd, "traceon") == 0)
-		ops = param ? &traceon_count_probe_ops : &traceon_probe_ops;
-	else
-		ops = param ? &traceoff_count_probe_ops : &traceoff_probe_ops;
-
 	if (glob[0] == '!') {
 		unregister_ftrace_function_probe_func(glob+1, ops);
 		return 0;
@@ -356,6 +397,34 @@ ftrace_trace_onoff_callback(struct ftrace_hash *hash,
 	return ret < 0 ? ret : 0;
 }
 
+static int
+ftrace_trace_onoff_callback(struct ftrace_hash *hash,
+			    char *glob, char *cmd, char *param, int enable)
+{
+	struct ftrace_probe_ops *ops;
+
+	/* we register both traceon and traceoff to this callback */
+	if (strcmp(cmd, "traceon") == 0)
+		ops = param ? &traceon_count_probe_ops : &traceon_probe_ops;
+	else
+		ops = param ? &traceoff_count_probe_ops : &traceoff_probe_ops;
+
+	return ftrace_trace_probe_callback(ops, hash, glob, cmd,
+					   param, enable);
+}
+
+static int
+ftrace_stacktrace_callback(struct ftrace_hash *hash,
+			   char *glob, char *cmd, char *param, int enable)
+{
+	struct ftrace_probe_ops *ops;
+
+	ops = param ? &stacktrace_count_probe_ops : &stacktrace_probe_ops;
+
+	return ftrace_trace_probe_callback(ops, hash, glob, cmd,
+					   param, enable);
+}
+
 static struct ftrace_func_command ftrace_traceon_cmd = {
 	.name			= "traceon",
 	.func			= ftrace_trace_onoff_callback,
@@ -366,6 +435,11 @@ static struct ftrace_func_command ftrace_traceoff_cmd = {
 	.func			= ftrace_trace_onoff_callback,
 };
 
+static struct ftrace_func_command ftrace_stacktrace_cmd = {
+	.name			= "stacktrace",
+	.func			= ftrace_stacktrace_callback,
+};
+
 static int __init init_func_cmd_traceon(void)
 {
 	int ret;
@@ -377,6 +451,12 @@ static int __init init_func_cmd_traceon(void)
 	ret = register_ftrace_command(&ftrace_traceon_cmd);
 	if (ret)
 		unregister_ftrace_command(&ftrace_traceoff_cmd);
+
+	ret = register_ftrace_command(&ftrace_stacktrace_cmd);
+	if (ret) {
+		unregister_ftrace_command(&ftrace_traceoff_cmd);
+		unregister_ftrace_command(&ftrace_traceon_cmd);
+	}
 	return ret;
 }
 #else
-- 
1.7.10.4



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [for-next][PATCH 14/20] tracing: Use stack of calling function for stack tracer
  2013-03-15 19:39 [for-next][PATCH 00/20] tracing: function triggers, stack tracer fixes, clocks and documenation Steven Rostedt
                   ` (12 preceding siblings ...)
  2013-03-15 19:39 ` [for-next][PATCH 13/20] tracing: Add function probe to trigger stack traces Steven Rostedt
@ 2013-03-15 19:39 ` Steven Rostedt
  2013-03-15 19:39 ` [for-next][PATCH 15/20] tracing: Fix stack tracer with fentry use Steven Rostedt
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Steven Rostedt @ 2013-03-15 19:39 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ingo Molnar, Andrew Morton, Frederic Weisbecker, stable

[-- Attachment #1: Type: text/plain, Size: 2040 bytes --]

From: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>

Use the stack of stack_trace_call() instead of check_stack() as
the test pointer for max stack size. It makes it a bit cleaner
and a little more accurate.

Adding stable, as a later fix depends on this patch.

Cc: stable@vger.kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/trace/trace_stack.c |   12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/kernel/trace/trace_stack.c b/kernel/trace/trace_stack.c
index 42ca822..dc02e29 100644
--- a/kernel/trace/trace_stack.c
+++ b/kernel/trace/trace_stack.c
@@ -39,20 +39,21 @@ static DEFINE_MUTEX(stack_sysctl_mutex);
 int stack_tracer_enabled;
 static int last_stack_tracer_enabled;
 
-static inline void check_stack(void)
+static inline void
+check_stack(unsigned long *stack)
 {
 	unsigned long this_size, flags;
 	unsigned long *p, *top, *start;
 	int i;
 
-	this_size = ((unsigned long)&this_size) & (THREAD_SIZE-1);
+	this_size = ((unsigned long)stack) & (THREAD_SIZE-1);
 	this_size = THREAD_SIZE - this_size;
 
 	if (this_size <= max_stack_size)
 		return;
 
 	/* we do not handle interrupt stacks yet */
-	if (!object_is_on_stack(&this_size))
+	if (!object_is_on_stack(stack))
 		return;
 
 	local_irq_save(flags);
@@ -73,7 +74,7 @@ static inline void check_stack(void)
 	 * Now find where in the stack these are.
 	 */
 	i = 0;
-	start = &this_size;
+	start = stack;
 	top = (unsigned long *)
 		(((unsigned long)start & ~(THREAD_SIZE-1)) + THREAD_SIZE);
 
@@ -113,6 +114,7 @@ static void
 stack_trace_call(unsigned long ip, unsigned long parent_ip,
 		 struct ftrace_ops *op, struct pt_regs *pt_regs)
 {
+	unsigned long stack;
 	int cpu;
 
 	preempt_disable_notrace();
@@ -122,7 +124,7 @@ stack_trace_call(unsigned long ip, unsigned long parent_ip,
 	if (per_cpu(trace_active, cpu)++ != 0)
 		goto out;
 
-	check_stack();
+	check_stack(&stack);
 
  out:
 	per_cpu(trace_active, cpu)--;
-- 
1.7.10.4



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [for-next][PATCH 15/20] tracing: Fix stack tracer with fentry use
  2013-03-15 19:39 [for-next][PATCH 00/20] tracing: function triggers, stack tracer fixes, clocks and documenation Steven Rostedt
                   ` (13 preceding siblings ...)
  2013-03-15 19:39 ` [for-next][PATCH 14/20] tracing: Use stack of calling function for stack tracer Steven Rostedt
@ 2013-03-15 19:39 ` Steven Rostedt
  2013-03-15 19:39 ` [for-next][PATCH 16/20] tracing: Remove most or all of stack tracer stack size from stack_max_size Steven Rostedt
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Steven Rostedt @ 2013-03-15 19:39 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ingo Molnar, Andrew Morton, Frederic Weisbecker, stable

[-- Attachment #1: Type: text/plain, Size: 4237 bytes --]

From: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>

When gcc 4.6 on x86 is used, the function tracer will use the new
option -mfentry which does a call to "fentry" at every function
instead of "mcount". The significance of this is that fentry is
called as the first operation of the function instead of the mcount
usage of being called after the stack.

This causes the stack tracer to show some bogus results for the size
of the last function traced, as well as showing "ftrace_call" instead
of the function. This is due to the stack frame not being set up
by the function that is about to be traced.

 # cat stack_trace
        Depth    Size   Location    (48 entries)
        -----    ----   --------
  0)     4824     216   ftrace_call+0x5/0x2f
  1)     4608     112   ____cache_alloc+0xb7/0x22d
  2)     4496      80   kmem_cache_alloc+0x63/0x12f

The 216 size for ftrace_call includes both the ftrace_call stack
(which includes the saving of registers it does), as well as the
stack size of the parent.

To fix this, if CC_USING_FENTRY is defined, then the stack_tracer
will reserve the first item in stack_dump_trace[] array when
calling save_stack_trace(), and it will fill it in with the parent ip.
Then the code will look for the parent pointer on the stack and
give the real size of the parent's stack pointer:

 # cat stack_trace
        Depth    Size   Location    (14 entries)
        -----    ----   --------
  0)     2640      48   update_group_power+0x26/0x187
  1)     2592     224   update_sd_lb_stats+0x2a5/0x4ac
  2)     2368     160   find_busiest_group+0x31/0x1f1
  3)     2208     256   load_balance+0xd9/0x662

I'm Cc'ing stable, although it's not urgent, as it only shows bogus
size for item #0, the rest of the trace is legit. It should still be
corrected in previous stable releases.

Cc: stable@vger.kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/trace/trace_stack.c |   33 +++++++++++++++++++++++++++++----
 1 file changed, 29 insertions(+), 4 deletions(-)

diff --git a/kernel/trace/trace_stack.c b/kernel/trace/trace_stack.c
index dc02e29..ea28e4b 100644
--- a/kernel/trace/trace_stack.c
+++ b/kernel/trace/trace_stack.c
@@ -20,13 +20,27 @@
 
 #define STACK_TRACE_ENTRIES 500
 
+/*
+ * If fentry is used, then the function being traced will
+ * jump to fentry directly before it sets up its stack frame.
+ * We need to ignore that one and record the parent. Since
+ * the stack frame for the traced function wasn't set up yet,
+ * the stack_trace wont see the parent. That needs to be added
+ * manually to stack_dump_trace[] as the first element.
+ */
+#ifdef CC_USING_FENTRY
+# define add_func	1
+#else
+# define add_func	0
+#endif
+
 static unsigned long stack_dump_trace[STACK_TRACE_ENTRIES+1] =
 	 { [0 ... (STACK_TRACE_ENTRIES)] = ULONG_MAX };
 static unsigned stack_dump_index[STACK_TRACE_ENTRIES];
 
 static struct stack_trace max_stack_trace = {
-	.max_entries		= STACK_TRACE_ENTRIES,
-	.entries		= stack_dump_trace,
+	.max_entries		= STACK_TRACE_ENTRIES - add_func,
+	.entries		= &stack_dump_trace[add_func],
 };
 
 static unsigned long max_stack_size;
@@ -40,7 +54,7 @@ int stack_tracer_enabled;
 static int last_stack_tracer_enabled;
 
 static inline void
-check_stack(unsigned long *stack)
+check_stack(unsigned long ip, unsigned long *stack)
 {
 	unsigned long this_size, flags;
 	unsigned long *p, *top, *start;
@@ -71,6 +85,17 @@ check_stack(unsigned long *stack)
 	save_stack_trace(&max_stack_trace);
 
 	/*
+	 * When fentry is used, the traced function does not get
+	 * its stack frame set up, and we lose the parent.
+	 * Add that one in manally. We set up save_stack_trace()
+	 * to not touch the first element in this case.
+	 */
+	if (add_func) {
+		stack_dump_trace[0] = ip;
+		max_stack_trace.nr_entries++;
+	}
+
+	/*
 	 * Now find where in the stack these are.
 	 */
 	i = 0;
@@ -124,7 +149,7 @@ stack_trace_call(unsigned long ip, unsigned long parent_ip,
 	if (per_cpu(trace_active, cpu)++ != 0)
 		goto out;
 
-	check_stack(&stack);
+	check_stack(parent_ip, &stack);
 
  out:
 	per_cpu(trace_active, cpu)--;
-- 
1.7.10.4



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [for-next][PATCH 16/20] tracing: Remove most or all of stack tracer stack size from stack_max_size
  2013-03-15 19:39 [for-next][PATCH 00/20] tracing: function triggers, stack tracer fixes, clocks and documenation Steven Rostedt
                   ` (14 preceding siblings ...)
  2013-03-15 19:39 ` [for-next][PATCH 15/20] tracing: Fix stack tracer with fentry use Steven Rostedt
@ 2013-03-15 19:39 ` Steven Rostedt
  2013-03-15 19:39 ` [for-next][PATCH 17/20] tracing: Add function-trace option to disable function tracing of latency tracers Steven Rostedt
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Steven Rostedt @ 2013-03-15 19:39 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ingo Molnar, Andrew Morton, Frederic Weisbecker, stable

[-- Attachment #1: Type: text/plain, Size: 5746 bytes --]

From: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>

Currently, the depth reported in the stack tracer stack_trace file
does not match the stack_max_size file. This is because the stack_max_size
includes the overhead of stack tracer itself while the depth does not.

The first time a max is triggered, a calculation is not performed that
figures out the overhead of the stack tracer and subtracts it from
the stack_max_size variable. The overhead is stored and is subtracted
from the reported stack size for comparing for a new max.

Now the stack_max_size corresponds to the reported depth:

 # cat stack_max_size
4640

 # cat stack_trace
        Depth    Size   Location    (48 entries)
        -----    ----   --------
  0)     4640      32   _raw_spin_lock+0x18/0x24
  1)     4608     112   ____cache_alloc+0xb7/0x22d
  2)     4496      80   kmem_cache_alloc+0x63/0x12f
  3)     4416      16   mempool_alloc_slab+0x15/0x17
[...]

While testing against and older gcc on x86 that uses mcount instead
of fentry, I found that pasing in ip + MCOUNT_INSN_SIZE let the
stack trace show one more function deep which was missing before.

Cc: stable@vger.kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/trace/trace_stack.c |   75 +++++++++++++++++++++++++++++++-------------
 1 file changed, 54 insertions(+), 21 deletions(-)

diff --git a/kernel/trace/trace_stack.c b/kernel/trace/trace_stack.c
index ea28e4b..aab277b 100644
--- a/kernel/trace/trace_stack.c
+++ b/kernel/trace/trace_stack.c
@@ -20,27 +20,24 @@
 
 #define STACK_TRACE_ENTRIES 500
 
-/*
- * If fentry is used, then the function being traced will
- * jump to fentry directly before it sets up its stack frame.
- * We need to ignore that one and record the parent. Since
- * the stack frame for the traced function wasn't set up yet,
- * the stack_trace wont see the parent. That needs to be added
- * manually to stack_dump_trace[] as the first element.
- */
 #ifdef CC_USING_FENTRY
-# define add_func	1
+# define fentry		1
 #else
-# define add_func	0
+# define fentry		0
 #endif
 
 static unsigned long stack_dump_trace[STACK_TRACE_ENTRIES+1] =
 	 { [0 ... (STACK_TRACE_ENTRIES)] = ULONG_MAX };
 static unsigned stack_dump_index[STACK_TRACE_ENTRIES];
 
+/*
+ * Reserve one entry for the passed in ip. This will allow
+ * us to remove most or all of the stack size overhead
+ * added by the stack tracer itself.
+ */
 static struct stack_trace max_stack_trace = {
-	.max_entries		= STACK_TRACE_ENTRIES - add_func,
-	.entries		= &stack_dump_trace[add_func],
+	.max_entries		= STACK_TRACE_ENTRIES - 1,
+	.entries		= &stack_dump_trace[1],
 };
 
 static unsigned long max_stack_size;
@@ -58,10 +55,14 @@ check_stack(unsigned long ip, unsigned long *stack)
 {
 	unsigned long this_size, flags;
 	unsigned long *p, *top, *start;
+	static int tracer_frame;
+	int frame_size = ACCESS_ONCE(tracer_frame);
 	int i;
 
 	this_size = ((unsigned long)stack) & (THREAD_SIZE-1);
 	this_size = THREAD_SIZE - this_size;
+	/* Remove the frame of the tracer */
+	this_size -= frame_size;
 
 	if (this_size <= max_stack_size)
 		return;
@@ -73,6 +74,10 @@ check_stack(unsigned long ip, unsigned long *stack)
 	local_irq_save(flags);
 	arch_spin_lock(&max_stack_lock);
 
+	/* In case another CPU set the tracer_frame on us */
+	if (unlikely(!frame_size))
+		this_size -= tracer_frame;
+
 	/* a race could have already updated it */
 	if (this_size <= max_stack_size)
 		goto out;
@@ -85,15 +90,12 @@ check_stack(unsigned long ip, unsigned long *stack)
 	save_stack_trace(&max_stack_trace);
 
 	/*
-	 * When fentry is used, the traced function does not get
-	 * its stack frame set up, and we lose the parent.
-	 * Add that one in manally. We set up save_stack_trace()
-	 * to not touch the first element in this case.
+	 * Add the passed in ip from the function tracer.
+	 * Searching for this on the stack will skip over
+	 * most of the overhead from the stack tracer itself.
 	 */
-	if (add_func) {
-		stack_dump_trace[0] = ip;
-		max_stack_trace.nr_entries++;
-	}
+	stack_dump_trace[0] = ip;
+	max_stack_trace.nr_entries++;
 
 	/*
 	 * Now find where in the stack these are.
@@ -123,6 +125,18 @@ check_stack(unsigned long ip, unsigned long *stack)
 				found = 1;
 				/* Start the search from here */
 				start = p + 1;
+				/*
+				 * We do not want to show the overhead
+				 * of the stack tracer stack in the
+				 * max stack. If we haven't figured
+				 * out what that is, then figure it out
+				 * now.
+				 */
+				if (unlikely(!tracer_frame) && i == 1) {
+					tracer_frame = (p - stack) *
+						sizeof(unsigned long);
+					max_stack_size -= tracer_frame;
+				}
 			}
 		}
 
@@ -149,7 +163,26 @@ stack_trace_call(unsigned long ip, unsigned long parent_ip,
 	if (per_cpu(trace_active, cpu)++ != 0)
 		goto out;
 
-	check_stack(parent_ip, &stack);
+	/*
+	 * When fentry is used, the traced function does not get
+	 * its stack frame set up, and we lose the parent.
+	 * The ip is pretty useless because the function tracer
+	 * was called before that function set up its stack frame.
+	 * In this case, we use the parent ip.
+	 *
+	 * By adding the return address of either the parent ip
+	 * or the current ip we can disregard most of the stack usage
+	 * caused by the stack tracer itself.
+	 *
+	 * The function tracer always reports the address of where the
+	 * mcount call was, but the stack will hold the return address.
+	 */
+	if (fentry)
+		ip = parent_ip;
+	else
+		ip += MCOUNT_INSN_SIZE;
+
+	check_stack(ip, &stack);
 
  out:
 	per_cpu(trace_active, cpu)--;
-- 
1.7.10.4



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [for-next][PATCH 17/20] tracing: Add function-trace option to disable function tracing of latency tracers
  2013-03-15 19:39 [for-next][PATCH 00/20] tracing: function triggers, stack tracer fixes, clocks and documenation Steven Rostedt
                   ` (15 preceding siblings ...)
  2013-03-15 19:39 ` [for-next][PATCH 16/20] tracing: Remove most or all of stack tracer stack size from stack_max_size Steven Rostedt
@ 2013-03-15 19:39 ` Steven Rostedt
  2013-03-15 19:39 ` [for-next][PATCH 18/20] tracing: Add "uptime" trace clock that uses jiffies Steven Rostedt
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Steven Rostedt @ 2013-03-15 19:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Frederic Weisbecker, Thomas Gleixner,
	Peter Zijlstra, Clark Williams, John Kacur

[-- Attachment #1: Type: text/plain, Size: 8756 bytes --]

From: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>

Currently, the only way to stop the latency tracers from doing function
tracing is to fully disable the function tracer from the proc file
system:

  echo 0 > /proc/sys/kernel/ftrace_enabled

This is a big hammer approach as it disables function tracing for
all users. This includes kprobes, perf, stack tracer, etc.

Instead, create a function-trace option that the latency tracers can
check to determine if it should enable function tracing or not.
This option can be set or cleared even while the tracer is active
and the tracers will disable or enable function tracing depending
on how the option was set.

Instead of using the proc file, disable latency function tracing with

  echo 0 > /debug/tracing/options/function-trace

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Clark Williams <williams@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/trace/trace.c              |    3 +-
 kernel/trace/trace.h              |    1 +
 kernel/trace/trace_irqsoff.c      |   67 ++++++++++++++++++++++++++++++-------
 kernel/trace/trace_sched_wakeup.c |   63 ++++++++++++++++++++++++++++------
 4 files changed, 111 insertions(+), 23 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 8aa5321..f90ca16 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -328,7 +328,7 @@ static inline void trace_access_lock_init(void)
 unsigned long trace_flags = TRACE_ITER_PRINT_PARENT | TRACE_ITER_PRINTK |
 	TRACE_ITER_ANNOTATE | TRACE_ITER_CONTEXT_INFO | TRACE_ITER_SLEEP_TIME |
 	TRACE_ITER_GRAPH_TIME | TRACE_ITER_RECORD_CMD | TRACE_ITER_OVERWRITE |
-	TRACE_ITER_IRQ_INFO | TRACE_ITER_MARKERS;
+	TRACE_ITER_IRQ_INFO | TRACE_ITER_MARKERS | TRACE_ITER_FUNCTION;
 
 /**
  * tracing_on - enable tracing buffers
@@ -635,6 +635,7 @@ static const char *trace_options[] = {
 	"disable_on_free",
 	"irq-info",
 	"markers",
+	"function-trace",
 	NULL
 };
 
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 0e430b4..5cc5236 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -867,6 +867,7 @@ enum trace_iterator_flags {
 	TRACE_ITER_STOP_ON_FREE		= 0x400000,
 	TRACE_ITER_IRQ_INFO		= 0x800000,
 	TRACE_ITER_MARKERS		= 0x1000000,
+	TRACE_ITER_FUNCTION		= 0x2000000,
 };
 
 /*
diff --git a/kernel/trace/trace_irqsoff.c b/kernel/trace/trace_irqsoff.c
index 5aa40ab..b19d065 100644
--- a/kernel/trace/trace_irqsoff.c
+++ b/kernel/trace/trace_irqsoff.c
@@ -33,6 +33,7 @@ enum {
 static int trace_type __read_mostly;
 
 static int save_flags;
+static bool function_enabled;
 
 static void stop_irqsoff_tracer(struct trace_array *tr, int graph);
 static int start_irqsoff_tracer(struct trace_array *tr, int graph);
@@ -528,15 +529,60 @@ void trace_preempt_off(unsigned long a0, unsigned long a1)
 }
 #endif /* CONFIG_PREEMPT_TRACER */
 
-static int start_irqsoff_tracer(struct trace_array *tr, int graph)
+static int register_irqsoff_function(int graph, int set)
 {
-	int ret = 0;
+	int ret;
 
-	if (!graph)
-		ret = register_ftrace_function(&trace_ops);
-	else
+	/* 'set' is set if TRACE_ITER_FUNCTION is about to be set */
+	if (function_enabled || (!set && !(trace_flags & TRACE_ITER_FUNCTION)))
+		return 0;
+
+	if (graph)
 		ret = register_ftrace_graph(&irqsoff_graph_return,
 					    &irqsoff_graph_entry);
+	else
+		ret = register_ftrace_function(&trace_ops);
+
+	if (!ret)
+		function_enabled = true;
+
+	return ret;
+}
+
+static void unregister_irqsoff_function(int graph)
+{
+	if (!function_enabled)
+		return;
+
+	if (graph)
+		unregister_ftrace_graph();
+	else
+		unregister_ftrace_function(&trace_ops);
+
+	function_enabled = false;
+}
+
+static void irqsoff_function_set(int set)
+{
+	if (set)
+		register_irqsoff_function(is_graph(), 1);
+	else
+		unregister_irqsoff_function(is_graph());
+}
+
+static int irqsoff_flag_changed(struct tracer *tracer, u32 mask, int set)
+{
+	if (mask & TRACE_ITER_FUNCTION)
+		irqsoff_function_set(set);
+
+	return trace_keep_overwrite(tracer, mask, set);
+}
+
+static int start_irqsoff_tracer(struct trace_array *tr, int graph)
+{
+	int ret;
+
+	ret = register_irqsoff_function(graph, 0);
 
 	if (!ret && tracing_is_enabled())
 		tracer_enabled = 1;
@@ -550,10 +596,7 @@ static void stop_irqsoff_tracer(struct trace_array *tr, int graph)
 {
 	tracer_enabled = 0;
 
-	if (!graph)
-		unregister_ftrace_function(&trace_ops);
-	else
-		unregister_ftrace_graph();
+	unregister_irqsoff_function(graph);
 }
 
 static void __irqsoff_tracer_init(struct trace_array *tr)
@@ -615,7 +658,7 @@ static struct tracer irqsoff_tracer __read_mostly =
 	.print_line     = irqsoff_print_line,
 	.flags		= &tracer_flags,
 	.set_flag	= irqsoff_set_flag,
-	.flag_changed	= trace_keep_overwrite,
+	.flag_changed	= irqsoff_flag_changed,
 #ifdef CONFIG_FTRACE_SELFTEST
 	.selftest    = trace_selftest_startup_irqsoff,
 #endif
@@ -649,7 +692,7 @@ static struct tracer preemptoff_tracer __read_mostly =
 	.print_line     = irqsoff_print_line,
 	.flags		= &tracer_flags,
 	.set_flag	= irqsoff_set_flag,
-	.flag_changed	= trace_keep_overwrite,
+	.flag_changed	= irqsoff_flag_changed,
 #ifdef CONFIG_FTRACE_SELFTEST
 	.selftest    = trace_selftest_startup_preemptoff,
 #endif
@@ -685,7 +728,7 @@ static struct tracer preemptirqsoff_tracer __read_mostly =
 	.print_line     = irqsoff_print_line,
 	.flags		= &tracer_flags,
 	.set_flag	= irqsoff_set_flag,
-	.flag_changed	= trace_keep_overwrite,
+	.flag_changed	= irqsoff_flag_changed,
 #ifdef CONFIG_FTRACE_SELFTEST
 	.selftest    = trace_selftest_startup_preemptirqsoff,
 #endif
diff --git a/kernel/trace/trace_sched_wakeup.c b/kernel/trace/trace_sched_wakeup.c
index c16f8cd..fee77e1 100644
--- a/kernel/trace/trace_sched_wakeup.c
+++ b/kernel/trace/trace_sched_wakeup.c
@@ -37,6 +37,7 @@ static int wakeup_graph_entry(struct ftrace_graph_ent *trace);
 static void wakeup_graph_return(struct ftrace_graph_ret *trace);
 
 static int save_flags;
+static bool function_enabled;
 
 #define TRACE_DISPLAY_GRAPH     1
 
@@ -134,15 +135,60 @@ static struct ftrace_ops trace_ops __read_mostly =
 };
 #endif /* CONFIG_FUNCTION_TRACER */
 
-static int start_func_tracer(int graph)
+static int register_wakeup_function(int graph, int set)
 {
 	int ret;
 
-	if (!graph)
-		ret = register_ftrace_function(&trace_ops);
-	else
+	/* 'set' is set if TRACE_ITER_FUNCTION is about to be set */
+	if (function_enabled || (!set && !(trace_flags & TRACE_ITER_FUNCTION)))
+		return 0;
+
+	if (graph)
 		ret = register_ftrace_graph(&wakeup_graph_return,
 					    &wakeup_graph_entry);
+	else
+		ret = register_ftrace_function(&trace_ops);
+
+	if (!ret)
+		function_enabled = true;
+
+	return ret;
+}
+
+static void unregister_wakeup_function(int graph)
+{
+	if (!function_enabled)
+		return;
+
+	if (graph)
+		unregister_ftrace_graph();
+	else
+		unregister_ftrace_function(&trace_ops);
+
+	function_enabled = false;
+}
+
+static void wakeup_function_set(int set)
+{
+	if (set)
+		register_wakeup_function(is_graph(), 1);
+	else
+		unregister_wakeup_function(is_graph());
+}
+
+static int wakeup_flag_changed(struct tracer *tracer, u32 mask, int set)
+{
+	if (mask & TRACE_ITER_FUNCTION)
+		wakeup_function_set(set);
+
+	return trace_keep_overwrite(tracer, mask, set);
+}
+
+static int start_func_tracer(int graph)
+{
+	int ret;
+
+	ret = register_wakeup_function(graph, 0);
 
 	if (!ret && tracing_is_enabled())
 		tracer_enabled = 1;
@@ -156,10 +202,7 @@ static void stop_func_tracer(int graph)
 {
 	tracer_enabled = 0;
 
-	if (!graph)
-		unregister_ftrace_function(&trace_ops);
-	else
-		unregister_ftrace_graph();
+	unregister_wakeup_function(graph);
 }
 
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
@@ -600,7 +643,7 @@ static struct tracer wakeup_tracer __read_mostly =
 	.print_line	= wakeup_print_line,
 	.flags		= &tracer_flags,
 	.set_flag	= wakeup_set_flag,
-	.flag_changed	= trace_keep_overwrite,
+	.flag_changed	= wakeup_flag_changed,
 #ifdef CONFIG_FTRACE_SELFTEST
 	.selftest    = trace_selftest_startup_wakeup,
 #endif
@@ -622,7 +665,7 @@ static struct tracer wakeup_rt_tracer __read_mostly =
 	.print_line	= wakeup_print_line,
 	.flags		= &tracer_flags,
 	.set_flag	= wakeup_set_flag,
-	.flag_changed	= trace_keep_overwrite,
+	.flag_changed	= wakeup_flag_changed,
 #ifdef CONFIG_FTRACE_SELFTEST
 	.selftest    = trace_selftest_startup_wakeup,
 #endif
-- 
1.7.10.4



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [for-next][PATCH 18/20] tracing: Add "uptime" trace clock that uses jiffies
  2013-03-15 19:39 [for-next][PATCH 00/20] tracing: function triggers, stack tracer fixes, clocks and documenation Steven Rostedt
                   ` (16 preceding siblings ...)
  2013-03-15 19:39 ` [for-next][PATCH 17/20] tracing: Add function-trace option to disable function tracing of latency tracers Steven Rostedt
@ 2013-03-15 19:39 ` Steven Rostedt
  2013-03-15 19:39 ` [for-next][PATCH 19/20] tracing: Add "perf" trace_clock Steven Rostedt
  2013-03-15 19:39 ` [for-next][PATCH 20/20] tracing: Bring Documentation/trace/ftrace.txt up to date Steven Rostedt
  19 siblings, 0 replies; 21+ messages in thread
From: Steven Rostedt @ 2013-03-15 19:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Frederic Weisbecker,
	Mauro Carvalho Chehab, Peter Zijlstra

[-- Attachment #1: Type: text/plain, Size: 2114 bytes --]

From: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>

Add a simple trace clock called "uptime" for those that are
interested in the uptime of the trace. It uses jiffies as that's
the safest method, as other uptime clocks grab seq locks, which could
cause a deadlock if taken from an event or function tracer.

Requested-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 include/linux/trace_clock.h |    1 +
 kernel/trace/trace.c        |    1 +
 kernel/trace/trace_clock.c  |   10 ++++++++++
 3 files changed, 12 insertions(+)

diff --git a/include/linux/trace_clock.h b/include/linux/trace_clock.h
index d563f37..1d7ca27 100644
--- a/include/linux/trace_clock.h
+++ b/include/linux/trace_clock.h
@@ -16,6 +16,7 @@
 
 extern u64 notrace trace_clock_local(void);
 extern u64 notrace trace_clock(void);
+extern u64 notrace trace_clock_jiffies(void);
 extern u64 notrace trace_clock_global(void);
 extern u64 notrace trace_clock_counter(void);
 
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index f90ca16..8eabfbb 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -647,6 +647,7 @@ static struct {
 	{ trace_clock_local,	"local",	1 },
 	{ trace_clock_global,	"global",	1 },
 	{ trace_clock_counter,	"counter",	0 },
+	{ trace_clock_jiffies,	"uptime",	1 },
 	ARCH_TRACE_CLOCKS
 };
 
diff --git a/kernel/trace/trace_clock.c b/kernel/trace/trace_clock.c
index aa8f5f4..26dc348 100644
--- a/kernel/trace/trace_clock.c
+++ b/kernel/trace/trace_clock.c
@@ -57,6 +57,16 @@ u64 notrace trace_clock(void)
 	return local_clock();
 }
 
+/*
+ * trace_jiffy_clock(): Simply use jiffies as a clock counter.
+ */
+u64 notrace trace_clock_jiffies(void)
+{
+	u64 jiffy = jiffies - INITIAL_JIFFIES;
+
+	/* Return nsecs */
+	return (u64)jiffies_to_usecs(jiffy) * 1000ULL;
+}
 
 /*
  * trace_clock_global(): special globally coherent trace clock
-- 
1.7.10.4



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [for-next][PATCH 19/20] tracing: Add "perf" trace_clock
  2013-03-15 19:39 [for-next][PATCH 00/20] tracing: function triggers, stack tracer fixes, clocks and documenation Steven Rostedt
                   ` (17 preceding siblings ...)
  2013-03-15 19:39 ` [for-next][PATCH 18/20] tracing: Add "uptime" trace clock that uses jiffies Steven Rostedt
@ 2013-03-15 19:39 ` Steven Rostedt
  2013-03-15 19:39 ` [for-next][PATCH 20/20] tracing: Bring Documentation/trace/ftrace.txt up to date Steven Rostedt
  19 siblings, 0 replies; 21+ messages in thread
From: Steven Rostedt @ 2013-03-15 19:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Frederic Weisbecker, Thomas Gleixner,
	Peter Zijlstra

[-- Attachment #1: Type: text/plain, Size: 1224 bytes --]

From: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>

The function trace_clock() calls "local_clock()" which is exactly
the same clock that perf uses. I'm not sure why perf doesn't call
trace_clock(), as trace_clock() doesn't have any users.

But now it does. As trace_clock() calls local_clock() like perf does,
I added the trace_clock "perf" option that uses trace_clock().

Now the ftrace buffers can use the same clock as perf uses. This
will be useful when perf starts reading the ftrace buffers, and will
be able to interleave them with the same clock data.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/trace/trace.c |    1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 8eabfbb..7f0e7fa 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -648,6 +648,7 @@ static struct {
 	{ trace_clock_global,	"global",	1 },
 	{ trace_clock_counter,	"counter",	0 },
 	{ trace_clock_jiffies,	"uptime",	1 },
+	{ trace_clock,		"perf",		1 },
 	ARCH_TRACE_CLOCKS
 };
 
-- 
1.7.10.4



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [for-next][PATCH 20/20] tracing: Bring Documentation/trace/ftrace.txt up to date
  2013-03-15 19:39 [for-next][PATCH 00/20] tracing: function triggers, stack tracer fixes, clocks and documenation Steven Rostedt
                   ` (18 preceding siblings ...)
  2013-03-15 19:39 ` [for-next][PATCH 19/20] tracing: Add "perf" trace_clock Steven Rostedt
@ 2013-03-15 19:39 ` Steven Rostedt
  19 siblings, 0 replies; 21+ messages in thread
From: Steven Rostedt @ 2013-03-15 19:39 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ingo Molnar, Andrew Morton, Frederic Weisbecker

[-- Attachment #1: Type: text/plain, Size: 113221 bytes --]

From: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>

The ftrace.txt document has been suffering from some serious bit rot.

Updated the current content to how things are as of v3.10.
Remove things that no longer exist.
Add documentation about new features:
  per_cpu
  stats
  instances
  stack trace
  etc.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 Documentation/trace/ftrace.txt | 2097 ++++++++++++++++++++++++++++------------
 1 file changed, 1480 insertions(+), 617 deletions(-)

diff --git a/Documentation/trace/ftrace.txt b/Documentation/trace/ftrace.txt
index a372304..bfe8c29 100644
--- a/Documentation/trace/ftrace.txt
+++ b/Documentation/trace/ftrace.txt
@@ -8,6 +8,7 @@ Copyright 2008 Red Hat Inc.
 Reviewers:   Elias Oltmanns, Randy Dunlap, Andrew Morton,
 	     John Kacur, and David Teigland.
 Written for: 2.6.28-rc2
+Updated for: 3.10
 
 Introduction
 ------------
@@ -17,13 +18,16 @@ designers of systems to find what is going on inside the kernel.
 It can be used for debugging or analyzing latencies and
 performance issues that take place outside of user-space.
 
-Although ftrace is the function tracer, it also includes an
-infrastructure that allows for other types of tracing. Some of
-the tracers that are currently in ftrace include a tracer to
-trace context switches, the time it takes for a high priority
-task to run after it was woken up, the time interrupts are
-disabled, and more (ftrace allows for tracer plugins, which
-means that the list of tracers can always grow).
+Although ftrace is typically considered the function tracer, it
+is really a frame work of several assorted tracing utilities.
+There's latency tracing to examine what occurs between interrupts
+disabled and enabled, as well as for preemption and from a time
+a task is woken to the task is actually scheduled in.
+
+One of the most common uses of ftrace is the event tracing.
+Through out the kernel is hundreds of static event points that
+can be enabled via the debugfs file system to see what is
+going on in certain parts of the kernel.
 
 
 Implementation Details
@@ -61,7 +65,7 @@ the extended "/sys/kernel/debug/tracing" path name.
 
 That's it! (assuming that you have ftrace configured into your kernel)
 
-After mounting the debugfs, you can see a directory called
+After mounting debugfs, you can see a directory called
 "tracing".  This directory contains the control and output files
 of ftrace. Here is a list of some of the key files:
 
@@ -84,7 +88,9 @@ of ftrace. Here is a list of some of the key files:
 
 	This sets or displays whether writing to the trace
 	ring buffer is enabled. Echo 0 into this file to disable
-	the tracer or 1 to enable it.
+	the tracer or 1 to enable it. Note, this only disables
+	writing to the ring buffer, the tracing overhead may
+	still be occurring.
 
   trace:
 
@@ -109,7 +115,15 @@ of ftrace. Here is a list of some of the key files:
 
 	This file lets the user control the amount of data
 	that is displayed in one of the above output
-	files.
+	files. Options also exist to modify how a tracer
+	or events work (stack traces, timestamps, etc).
+
+  options:
+
+	This is a directory that has a file for every available
+	trace option (also in trace_options). Options may also be set
+	or cleared by writing a "1" or "0" respectively into the
+	corresponding file with the option name.
 
   tracing_max_latency:
 
@@ -121,10 +135,17 @@ of ftrace. Here is a list of some of the key files:
 	latency is greater than the value in this
 	file. (in microseconds)
 
+  tracing_thresh:
+
+	Some latency tracers will record a trace whenever the
+	latency is greater than the number in this file.
+	Only active when the file contains a number greater than 0.
+	(in microseconds)
+
   buffer_size_kb:
 
 	This sets or displays the number of kilobytes each CPU
-	buffer can hold. The tracer buffers are the same size
+	buffer holds. By default, the trace buffers are the same size
 	for each CPU. The displayed number is the size of the
 	CPU buffer and not total size of all buffers. The
 	trace buffers are allocated in pages (blocks of memory
@@ -133,16 +154,30 @@ of ftrace. Here is a list of some of the key files:
 	than requested, the rest of the page will be used,
 	making the actual allocation bigger than requested.
 	( Note, the size may not be a multiple of the page size
-	  due to buffer management overhead. )
+	  due to buffer management meta-data. )
 
-	This can only be updated when the current_tracer
-	is set to "nop".
+  buffer_total_size_kb:
+
+	This displays the total combined size of all the trace buffers.
+
+  free_buffer:
+
+	If a process is performing the tracing, and the ring buffer
+	should be shrunk "freed" when the process is finished, even
+	if it were to be killed by a signal, this file can be used
+	for that purpose. On close of this file, the ring buffer will
+	be resized to its minimum size. Having a process that is tracing
+	also open this file, when the process exits its file descriptor
+	for this file will be closed, and in doing so, the ring buffer
+	will be "freed".
+
+	It may also stop tracing if disable_on_free option is set.
 
   tracing_cpumask:
 
 	This is a mask that lets the user only trace
-	on specified CPUS. The format is a hex string
-	representing the CPUS.
+	on specified CPUs. The format is a hex string
+	representing the CPUs.
 
   set_ftrace_filter:
 
@@ -183,6 +218,261 @@ of ftrace. Here is a list of some of the key files:
 	"set_ftrace_notrace". (See the section "dynamic ftrace"
 	below for more details.)
 
+  enabled_functions:
+
+	This file is more for debugging ftrace, but can also be useful
+	in seeing if any function has a callback attached to it.
+	Not only does the trace infrastructure use ftrace function
+	trace utility, but other subsystems might too. This file
+	displays all functions that have a callback attached to them
+	as well as the number of callbacks that have been attached.
+	Note, a callback may also call multiple functions which will
+	not be listed in this count.
+
+	If the callback registered to be traced by a function with
+	the "save regs" attribute (thus even more overhead), a 'R'
+	will be displayed on the same line as the function that
+	is returning registers.
+
+  function_profile_enabled:
+
+	When set it will enable all functions with either the function
+	tracer, or if enabled, the function graph tracer. It will
+	keep a histogram of the number of functions that were called
+	and if run with the function graph tracer, it will also keep
+	track of the time spent in those functions. The histogram
+	content can be displayed in the files:
+
+	trace_stats/function<cpu> ( function0, function1, etc).
+
+  trace_stats:
+
+	A directory that holds different tracing stats.
+
+  kprobe_events:
+ 
+	Enable dynamic trace points. See kprobetrace.txt.
+
+  kprobe_profile:
+
+	Dynamic trace points stats. See kprobetrace.txt.
+
+  max_graph_depth:
+
+	Used with the function graph tracer. This is the max depth
+	it will trace into a function. Setting this to a value of
+	one will show only the first kernel function that is called
+	from user space.
+
+  printk_formats:
+
+	This is for tools that read the raw format files. If an event in
+	the ring buffer references a string (currently only trace_printk()
+	does this), only a pointer to the string is recorded into the buffer
+	and not the string itself. This prevents tools from knowing what
+	that string was. This file displays the string and address for
+	the string allowing tools to map the pointers to what the
+	strings were.
+
+  saved_cmdlines:
+
+	Only the pid of the task is recorded in a trace event unless
+	the event specifically saves the task comm as well. Ftrace
+	makes a cache of pid mappings to comms to try to display
+	comms for events. If a pid for a comm is not listed, then
+	"<...>" is displayed in the output.
+
+  snapshot:
+
+	This displays the "snapshot" buffer and also lets the user
+	take a snapshot of the current running trace.
+	See the "Snapshot" section below for more details.
+
+  stack_max_size:
+
+	When the stack tracer is activated, this will display the
+	maximum stack size it has encountered.
+	See the "Stack Trace" section below.
+
+  stack_trace:
+
+	This displays the stack back trace of the largest stack
+	that was encountered when the stack tracer is activated.
+	See the "Stack Trace" section below.
+
+  stack_trace_filter:
+
+	This is similar to "set_ftrace_filter" but it limits what
+	functions the stack tracer will check.
+
+  trace_clock:
+
+	Whenever an event is recorded into the ring buffer, a
+	"timestamp" is added. This stamp comes from a specified
+	clock. By default, ftrace uses the "local" clock. This
+	clock is very fast and strictly per cpu, but on some
+	systems it may not be monotonic with respect to other
+	CPUs. In other words, the local clocks may not be in sync
+	with local clocks on other CPUs.
+
+	Usual clocks for tracing:
+
+	  # cat trace_clock
+	  [local] global counter x86-tsc
+
+	  local: Default clock, but may not be in sync across CPUs
+
+	  global: This clock is in sync with all CPUs but may
+	  	  be a bit slower than the local clock.
+
+	  counter: This is not a clock at all, but literally an atomic
+	  	   counter. It counts up one by one, but is in sync
+		   with all CPUs. This is useful when you need to
+		   know exactly the order events occurred with respect to
+		   each other on different CPUs.
+
+	  uptime: This uses the jiffies counter and the time stamp
+	  	  is relative to the time since boot up.
+
+	  perf: This makes ftrace use the same clock that perf uses.
+	  	Eventually perf will be able to read ftrace buffers
+		and this will help out in interleaving the data.
+
+	  x86-tsc: Architectures may define their own clocks. For
+	  	   example, x86 uses its own TSC cycle clock here.
+
+	To set a clock, simply echo the clock name into this file.
+
+	  echo global > trace_clock
+
+  trace_marker:
+
+	This is a very useful file for synchronizing user space
+	with events happening in the kernel. Writing strings into
+	this file will be written into the ftrace buffer.
+
+	It is useful in applications to open this file at the start
+	of the application and just reference the file descriptor
+	for the file.
+
+	void trace_write(const char *fmt, ...)
+	{
+		va_list ap;
+		char buf[256];
+		int n;
+
+		if (trace_fd < 0)
+			return;
+
+		va_start(ap, fmt);
+		n = vsnprintf(buf, 256, fmt, ap);
+		va_end(ap);
+
+		write(trace_fd, buf, n);
+	}
+
+	start:
+
+		trace_fd = open("trace_marker", WR_ONLY);
+
+  uprobe_events:
+ 
+	Add dynamic tracepoints in programs.
+	See uprobetracer.txt
+
+  uprobe_profile:
+
+	Uprobe statistics. See uprobetrace.txt
+
+  instances:
+
+	This is a way to make multiple trace buffers where different
+	events can be recorded in different buffers.
+	See "Instances" section below.
+
+  events:
+
+	This is the trace event directory. It holds event tracepoints
+	(also known as static tracepoints) that have been compiled
+	into the kernel. It shows what event tracepoints exist
+	and how they are grouped by system. There are "enable"
+	files at various levels that can enable the tracepoints
+	when a "1" is written to them.
+
+	See events.txt for more information.
+
+  per_cpu:
+
+	This is a directory that contains the trace per_cpu information.
+
+  per_cpu/cpu0/buffer_size_kb:
+
+	The ftrace buffer is defined per_cpu. That is, there's a separate
+	buffer for each CPU to allow writes to be done atomically,
+	and free from cache bouncing. These buffers may have different
+	size buffers. This file is similar to the buffer_size_kb
+	file, but it only displays or sets the buffer size for the
+	specific CPU. (here cpu0).
+
+  per_cpu/cpu0/trace:
+
+	This is similar to the "trace" file, but it will only display
+	the data specific for the CPU. If written to, it only clears
+	the specific CPU buffer.
+
+  per_cpu/cpu0/trace_pipe
+
+	This is similar to the "trace_pipe" file, and is a consuming
+	read, but it will only display (and consume) the data specific
+	for the CPU.
+
+  per_cpu/cpu0/trace_pipe_raw
+
+	For tools that can parse the ftrace ring buffer binary format,
+	the trace_pipe_raw file can be used to extract the data
+	from the ring buffer directly. With the use of the splice()
+	system call, the buffer data can be quickly transferred to
+	a file or to the network where a server is collecting the
+	data.
+
+	Like trace_pipe, this is a consuming reader, where multiple
+	reads will always produce different data.
+
+  per_cpu/cpu0/snapshot:
+
+	This is similar to the main "snapshot" file, but will only
+	snapshot the current CPU (if supported). It only displays
+	the content of the snapshot for a given CPU, and if
+	written to, only clears this CPU buffer.
+
+  per_cpu/cpu0/snapshot_raw:
+
+	Similar to the trace_pipe_raw, but will read the binary format
+	from the snapshot buffer for the given CPU.
+
+  per_cpu/cpu0/stats:
+
+	This displays certain stats about the ring buffer:
+
+	 entries: The number of events that are still in the buffer.
+
+	 overrun: The number of lost events due to overwriting when
+	 	  the buffer was full.
+
+	 commit overrun: Should always be zero.
+	 	This gets set if so many events happened within a nested
+		event (ring buffer is re-entrant), that it fills the
+		buffer and starts dropping events.
+
+	 bytes: Bytes actually read (not overwritten).
+
+	 oldest event ts: The oldest timestamp in the buffer
+
+	 now ts: The current timestamp
+
+	 dropped events: Events lost due to overwrite option being off.
+
+	 read events: The number of events read.
 
 The Tracers
 -----------
@@ -234,11 +524,6 @@ Here is the list of current tracers that may be configured.
         RT tasks (as the current "wakeup" does). This is useful
         for those interested in wake up timings of RT tasks.
 
-  "hw-branch-tracer"
-
-	Uses the BTS CPU feature on x86 CPUs to traces all
-	branches executed.
-
   "nop"
 
 	This is the "trace nothing" tracer. To remove all
@@ -261,70 +546,100 @@ Here is an example of the output format of the file "trace"
                              --------
 # tracer: function
 #
-#           TASK-PID   CPU#    TIMESTAMP  FUNCTION
-#              | |      |          |         |
-            bash-4251  [01] 10152.583854: path_put <-path_walk
-            bash-4251  [01] 10152.583855: dput <-path_put
-            bash-4251  [01] 10152.583855: _atomic_dec_and_lock <-dput
+# entries-in-buffer/entries-written: 140080/250280   #P:4
+#
+#                              _-----=> irqs-off
+#                             / _----=> need-resched
+#                            | / _---=> hardirq/softirq
+#                            || / _--=> preempt-depth
+#                            ||| /     delay
+#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
+#              | |       |   ||||       |         |
+            bash-1977  [000] .... 17284.993652: sys_close <-system_call_fastpath
+            bash-1977  [000] .... 17284.993653: __close_fd <-sys_close
+            bash-1977  [000] .... 17284.993653: _raw_spin_lock <-__close_fd
+            sshd-1974  [003] .... 17284.993653: __srcu_read_unlock <-fsnotify
+            bash-1977  [000] .... 17284.993654: add_preempt_count <-_raw_spin_lock
+            bash-1977  [000] ...1 17284.993655: _raw_spin_unlock <-__close_fd
+            bash-1977  [000] ...1 17284.993656: sub_preempt_count <-_raw_spin_unlock
+            bash-1977  [000] .... 17284.993657: filp_close <-__close_fd
+            bash-1977  [000] .... 17284.993657: dnotify_flush <-filp_close
+            sshd-1974  [003] .... 17284.993658: sys_select <-system_call_fastpath
                              --------
 
 A header is printed with the tracer name that is represented by
-the trace. In this case the tracer is "function". Then a header
-showing the format. Task name "bash", the task PID "4251", the
-CPU that it was running on "01", the timestamp in <secs>.<usecs>
-format, the function name that was traced "path_put" and the
-parent function that called this function "path_walk". The
-timestamp is the time at which the function was entered.
+the trace. In this case the tracer is "function". Then it shows the
+number of events in the buffer as well as the total number of entries
+that were written. The difference is the number of entries that were
+lost due to the buffer filling up (250280 - 140080 = 110200 events
+lost).
+
+The header explains the content of the events. Task name "bash", the task
+PID "1977", the CPU that it was running on "000", the latency format
+(explained below), the timestamp in <secs>.<usecs> format, the
+function name that was traced "sys_close" and the parent function that
+called this function "system_call_fastpath". The timestamp is the time
+at which the function was entered.
 
 Latency trace format
 --------------------
 
-When the latency-format option is enabled, the trace file gives
-somewhat more information to see why a latency happened.
-Here is a typical trace.
+When the latency-format option is enabled or when one of the latency
+tracers is set, the trace file gives somewhat more information to see
+why a latency happened. Here is a typical trace.
 
 # tracer: irqsoff
 #
-irqsoff latency trace v1.1.5 on 2.6.26-rc8
---------------------------------------------------------------------
- latency: 97 us, #3/3, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
-    -----------------
-    | task: swapper-0 (uid:0 nice:0 policy:0 rt_prio:0)
-    -----------------
- => started at: apic_timer_interrupt
- => ended at:   do_softirq
-
-#                _------=> CPU#
-#               / _-----=> irqs-off
-#              | / _----=> need-resched
-#              || / _---=> hardirq/softirq
-#              ||| / _--=> preempt-depth
-#              |||| /
-#              |||||     delay
-#  cmd     pid ||||| time  |   caller
-#     \   /    |||||   \   |   /
-  <idle>-0     0d..1    0us+: trace_hardirqs_off_thunk (apic_timer_interrupt)
-  <idle>-0     0d.s.   97us : __do_softirq (do_softirq)
-  <idle>-0     0d.s1   98us : trace_hardirqs_on (do_softirq)
+# irqsoff latency trace v1.1.5 on 3.8.0-test+
+# --------------------------------------------------------------------
+# latency: 259 us, #4/4, CPU#2 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
+#    -----------------
+#    | task: ps-6143 (uid:0 nice:0 policy:0 rt_prio:0)
+#    -----------------
+#  => started at: __lock_task_sighand
+#  => ended at:   _raw_spin_unlock_irqrestore
+#
+#
+#                  _------=> CPU#            
+#                 / _-----=> irqs-off        
+#                | / _----=> need-resched    
+#                || / _---=> hardirq/softirq 
+#                ||| / _--=> preempt-depth   
+#                |||| /     delay             
+#  cmd     pid   ||||| time  |   caller      
+#     \   /      |||||  \    |   /           
+      ps-6143    2d...    0us!: trace_hardirqs_off <-__lock_task_sighand
+      ps-6143    2d..1  259us+: trace_hardirqs_on <-_raw_spin_unlock_irqrestore
+      ps-6143    2d..1  263us+: time_hardirqs_on <-_raw_spin_unlock_irqrestore
+      ps-6143    2d..1  306us : <stack trace>
+ => trace_hardirqs_on_caller
+ => trace_hardirqs_on
+ => _raw_spin_unlock_irqrestore
+ => do_task_stat
+ => proc_tgid_stat
+ => proc_single_show
+ => seq_read
+ => vfs_read
+ => sys_read
+ => system_call_fastpath
 
 
 This shows that the current tracer is "irqsoff" tracing the time
-for which interrupts were disabled. It gives the trace version
-and the version of the kernel upon which this was executed on
-(2.6.26-rc8). Then it displays the max latency in microsecs (97
-us). The number of trace entries displayed and the total number
-recorded (both are three: #3/3). The type of preemption that was
-used (PREEMPT). VP, KP, SP, and HP are always zero and are
-reserved for later use. #P is the number of online CPUS (#P:2).
+for which interrupts were disabled. It gives the trace version (which
+never changes) and the version of the kernel upon which this was executed on
+(3.10). Then it displays the max latency in microseconds (259 us). The number
+of trace entries displayed and the total number (both are four: #4/4).
+VP, KP, SP, and HP are always zero and are reserved for later use.
+#P is the number of online CPUs (#P:4).
 
 The task is the process that was running when the latency
-occurred. (swapper pid: 0).
+occurred. (ps pid: 6143).
 
 The start and stop (the functions in which the interrupts were
 disabled and enabled respectively) that caused the latencies:
 
-  apic_timer_interrupt is where the interrupts were disabled.
-  do_softirq is where they were enabled again.
+ __lock_task_sighand is where the interrupts were disabled.
+ _raw_spin_unlock_irqrestore is where they were enabled again.
 
 The next lines after the header are the trace itself. The header
 explains which is which.
@@ -367,16 +682,43 @@ The above is mostly meaningful for kernel developers.
 
   The rest is the same as the 'trace' file.
 
+  Note, the latency tracers will usually end with a back trace
+  to easily find where the latency occurred.
 
 trace_options
 -------------
 
-The trace_options file is used to control what gets printed in
-the trace output. To see what is available, simply cat the file:
+The trace_options file (or the options directory) is used to control
+what gets printed in the trace output, or manipulate the tracers.
+To see what is available, simply cat the file:
 
   cat trace_options
-  print-parent nosym-offset nosym-addr noverbose noraw nohex nobin \
-  noblock nostacktrace nosched-tree nouserstacktrace nosym-userobj
+print-parent
+nosym-offset
+nosym-addr
+noverbose
+noraw
+nohex
+nobin
+noblock
+nostacktrace
+trace_printk
+noftrace_preempt
+nobranch
+annotate
+nouserstacktrace
+nosym-userobj
+noprintk-msg-only
+context-info
+latency-format
+sleep-time
+graph-time
+record-cmd
+overwrite
+nodisable_on_free
+irq-info
+markers
+function-trace
 
 To disable one of the options, echo in the option prepended with
 "no".
@@ -428,13 +770,34 @@ Here are the available options:
 
   bin - This will print out the formats in raw binary.
 
-  block - TBD (needs update)
+  block - When set, reading trace_pipe will not block when polled.
 
   stacktrace - This is one of the options that changes the trace
 	       itself. When a trace is recorded, so is the stack
 	       of functions. This allows for back traces of
 	       trace sites.
 
+  trace_printk - Can disable trace_printk() from writing into the buffer.
+
+  branch - Enable branch tracing with the tracer.
+
+  annotate - It is sometimes confusing when the CPU buffers are full
+  	     and one CPU buffer had a lot of events recently, thus
+	     a shorter time frame, were another CPU may have only had
+	     a few events, which lets it have older events. When
+	     the trace is reported, it shows the oldest events first,
+	     and it may look like only one CPU ran (the one with the
+	     oldest events). When the annotate option is set, it will
+	     display when a new CPU buffer started:
+
+          <idle>-0     [001] dNs4 21169.031481: wake_up_idle_cpu <-add_timer_on
+          <idle>-0     [001] dNs4 21169.031482: _raw_spin_unlock_irqrestore <-add_timer_on
+          <idle>-0     [001] .Ns4 21169.031484: sub_preempt_count <-_raw_spin_unlock_irqrestore
+##### CPU 2 buffer started ####
+          <idle>-0     [002] .N.1 21169.031484: rcu_idle_exit <-cpu_idle
+          <idle>-0     [001] .Ns3 21169.031484: _raw_spin_unlock <-clocksource_watchdog
+          <idle>-0     [001] .Ns3 21169.031485: sub_preempt_count <-_raw_spin_unlock
+
   userstacktrace - This option changes the trace. It records a
 		   stacktrace of the current userspace thread.
 
@@ -451,9 +814,13 @@ Here are the available options:
 		a.out-1623  [000] 40874.465068: /root/a.out[+0x480] <-/root/a.out[+0
 x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6]
 
-  sched-tree - trace all tasks that are on the runqueue, at
-	       every scheduling event. Will add overhead if
-	       there's a lot of tasks running at once.
+
+  printk-msg-only - When set, trace_printk()s will only show the format
+  		    and not their parameters (if trace_bprintk() or
+		    trace_bputs() was used to save the trace_printk()).
+
+  context-info - Show only the event data. Hides the comm, PID,
+  	         timestamp, CPU, and other useful data.
 
   latency-format - This option changes the trace. When
                    it is enabled, the trace displays
@@ -461,31 +828,61 @@ x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6]
                    latencies, as described in "Latency
                    trace format".
 
+  sleep-time - When running function graph tracer, to include
+  	       the time a task schedules out in its function.
+	       When enabled, it will account time the task has been
+	       scheduled out as part of the function call.
+
+  graph-time - When running function graph tracer, to include the
+  	       time to call nested functions. When this is not set,
+	       the time reported for the function will only include
+	       the time the function itself executed for, not the time
+	       for functions that it called.
+
+  record-cmd - When any event or tracer is enabled, a hook is enabled
+  	       in the sched_switch trace point to fill comm cache
+	       with mapped pids and comms. But this may cause some
+	       overhead, and if you only care about pids, and not the
+	       name of the task, disabling this option can lower the
+	       impact of tracing.
+
   overwrite - This controls what happens when the trace buffer is
               full. If "1" (default), the oldest events are
               discarded and overwritten. If "0", then the newest
               events are discarded.
+	        (see per_cpu/cpu0/stats for overrun and dropped)
 
-ftrace_enabled
---------------
+  disable_on_free - When the free_buffer is closed, tracing will
+  		    stop (tracing_on set to 0).
 
-The following tracers (listed below) give different output
-depending on whether or not the sysctl ftrace_enabled is set. To
-set ftrace_enabled, one can either use the sysctl function or
-set it via the proc file system interface.
+  irq-info - Shows the interrupt, preempt count, need resched data.
+  	     When disabled, the trace looks like:
 
-  sysctl kernel.ftrace_enabled=1
+# tracer: function
+#
+# entries-in-buffer/entries-written: 144405/9452052   #P:4
+#
+#           TASK-PID   CPU#      TIMESTAMP  FUNCTION
+#              | |       |          |         |
+          <idle>-0     [002]  23636.756054: ttwu_do_activate.constprop.89 <-try_to_wake_up
+          <idle>-0     [002]  23636.756054: activate_task <-ttwu_do_activate.constprop.89
+          <idle>-0     [002]  23636.756055: enqueue_task <-activate_task
 
- or
 
-  echo 1 > /proc/sys/kernel/ftrace_enabled
+  markers - When set, the trace_marker is writable (only by root).
+  	    When disabled, the trace_marker will error with EINVAL
+	    on write.
+
+
+  function-trace - The latency tracers will enable function tracing
+  	    if this option is enabled (default it is). When
+	    it is disabled, the latency tracers do not trace
+	    functions. This keeps the overhead of the tracer down
+	    when performing latency tests.
 
-To disable ftrace_enabled simply replace the '1' with '0' in the
-above commands.
+ Note: Some tracers have their own options. They only appear
+       when the tracer is active.
 
-When ftrace_enabled is set the tracers will also record the
-functions that are within the trace. The descriptions of the
-tracers will also show an example with ftrace enabled.
 
 
 irqsoff
@@ -506,95 +903,133 @@ new trace is saved.
 To reset the maximum, echo 0 into tracing_max_latency. Here is
 an example:
 
+ # echo 0 > options/function-trace
  # echo irqsoff > current_tracer
- # echo latency-format > trace_options
- # echo 0 > tracing_max_latency
  # echo 1 > tracing_on
+ # echo 0 > tracing_max_latency
  # ls -ltr
  [...]
  # echo 0 > tracing_on
  # cat trace
 # tracer: irqsoff
 #
-irqsoff latency trace v1.1.5 on 2.6.26
---------------------------------------------------------------------
- latency: 12 us, #3/3, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
-    -----------------
-    | task: bash-3730 (uid:0 nice:0 policy:0 rt_prio:0)
-    -----------------
- => started at: sys_setpgid
- => ended at:   sys_setpgid
-
-#                _------=> CPU#
-#               / _-----=> irqs-off
-#              | / _----=> need-resched
-#              || / _---=> hardirq/softirq
-#              ||| / _--=> preempt-depth
-#              |||| /
-#              |||||     delay
-#  cmd     pid ||||| time  |   caller
-#     \   /    |||||   \   |   /
-    bash-3730  1d...    0us : _write_lock_irq (sys_setpgid)
-    bash-3730  1d..1    1us+: _write_unlock_irq (sys_setpgid)
-    bash-3730  1d..2   14us : trace_hardirqs_on (sys_setpgid)
-
-
-Here we see that that we had a latency of 12 microsecs (which is
-very good). The _write_lock_irq in sys_setpgid disabled
-interrupts. The difference between the 12 and the displayed
-timestamp 14us occurred because the clock was incremented
+# irqsoff latency trace v1.1.5 on 3.8.0-test+
+# --------------------------------------------------------------------
+# latency: 16 us, #4/4, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
+#    -----------------
+#    | task: swapper/0-0 (uid:0 nice:0 policy:0 rt_prio:0)
+#    -----------------
+#  => started at: run_timer_softirq
+#  => ended at:   run_timer_softirq
+#
+#
+#                  _------=> CPU#            
+#                 / _-----=> irqs-off        
+#                | / _----=> need-resched    
+#                || / _---=> hardirq/softirq 
+#                ||| / _--=> preempt-depth   
+#                |||| /     delay             
+#  cmd     pid   ||||| time  |   caller      
+#     \   /      |||||  \    |   /           
+  <idle>-0       0d.s2    0us+: _raw_spin_lock_irq <-run_timer_softirq
+  <idle>-0       0dNs3   17us : _raw_spin_unlock_irq <-run_timer_softirq
+  <idle>-0       0dNs3   17us+: trace_hardirqs_on <-run_timer_softirq
+  <idle>-0       0dNs3   25us : <stack trace>
+ => _raw_spin_unlock_irq
+ => run_timer_softirq
+ => __do_softirq
+ => call_softirq
+ => do_softirq
+ => irq_exit
+ => smp_apic_timer_interrupt
+ => apic_timer_interrupt
+ => rcu_idle_exit
+ => cpu_idle
+ => rest_init
+ => start_kernel
+ => x86_64_start_reservations
+ => x86_64_start_kernel
+
+Here we see that that we had a latency of 16 microseconds (which is
+very good). The _raw_spin_lock_irq in run_timer_softirq disabled
+interrupts. The difference between the 16 and the displayed
+timestamp 25us occurred because the clock was incremented
 between the time of recording the max latency and the time of
 recording the function that had that latency.
 
-Note the above example had ftrace_enabled not set. If we set the
-ftrace_enabled, we get a much larger output:
+Note the above example had function-trace not set. If we set
+function-trace, we get a much larger output:
+
+ with echo 1 > options/function-trace
 
 # tracer: irqsoff
 #
-irqsoff latency trace v1.1.5 on 2.6.26-rc8
---------------------------------------------------------------------
- latency: 50 us, #101/101, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
-    -----------------
-    | task: ls-4339 (uid:0 nice:0 policy:0 rt_prio:0)
-    -----------------
- => started at: __alloc_pages_internal
- => ended at:   __alloc_pages_internal
-
-#                _------=> CPU#
-#               / _-----=> irqs-off
-#              | / _----=> need-resched
-#              || / _---=> hardirq/softirq
-#              ||| / _--=> preempt-depth
-#              |||| /
-#              |||||     delay
-#  cmd     pid ||||| time  |   caller
-#     \   /    |||||   \   |   /
-      ls-4339  0...1    0us+: get_page_from_freelist (__alloc_pages_internal)
-      ls-4339  0d..1    3us : rmqueue_bulk (get_page_from_freelist)
-      ls-4339  0d..1    3us : _spin_lock (rmqueue_bulk)
-      ls-4339  0d..1    4us : add_preempt_count (_spin_lock)
-      ls-4339  0d..2    4us : __rmqueue (rmqueue_bulk)
-      ls-4339  0d..2    5us : __rmqueue_smallest (__rmqueue)
-      ls-4339  0d..2    5us : __mod_zone_page_state (__rmqueue_smallest)
-      ls-4339  0d..2    6us : __rmqueue (rmqueue_bulk)
-      ls-4339  0d..2    6us : __rmqueue_smallest (__rmqueue)
-      ls-4339  0d..2    7us : __mod_zone_page_state (__rmqueue_smallest)
-      ls-4339  0d..2    7us : __rmqueue (rmqueue_bulk)
-      ls-4339  0d..2    8us : __rmqueue_smallest (__rmqueue)
+# irqsoff latency trace v1.1.5 on 3.8.0-test+
+# --------------------------------------------------------------------
+# latency: 71 us, #168/168, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
+#    -----------------
+#    | task: bash-2042 (uid:0 nice:0 policy:0 rt_prio:0)
+#    -----------------
+#  => started at: ata_scsi_queuecmd
+#  => ended at:   ata_scsi_queuecmd
+#
+#
+#                  _------=> CPU#            
+#                 / _-----=> irqs-off        
+#                | / _----=> need-resched    
+#                || / _---=> hardirq/softirq 
+#                ||| / _--=> preempt-depth   
+#                |||| /     delay             
+#  cmd     pid   ||||| time  |   caller      
+#     \   /      |||||  \    |   /           
+    bash-2042    3d...    0us : _raw_spin_lock_irqsave <-ata_scsi_queuecmd
+    bash-2042    3d...    0us : add_preempt_count <-_raw_spin_lock_irqsave
+    bash-2042    3d..1    1us : ata_scsi_find_dev <-ata_scsi_queuecmd
+    bash-2042    3d..1    1us : __ata_scsi_find_dev <-ata_scsi_find_dev
+    bash-2042    3d..1    2us : ata_find_dev.part.14 <-__ata_scsi_find_dev
+    bash-2042    3d..1    2us : ata_qc_new_init <-__ata_scsi_queuecmd
+    bash-2042    3d..1    3us : ata_sg_init <-__ata_scsi_queuecmd
+    bash-2042    3d..1    4us : ata_scsi_rw_xlat <-__ata_scsi_queuecmd
+    bash-2042    3d..1    4us : ata_build_rw_tf <-ata_scsi_rw_xlat
 [...]
-      ls-4339  0d..2   46us : __rmqueue_smallest (__rmqueue)
-      ls-4339  0d..2   47us : __mod_zone_page_state (__rmqueue_smallest)
-      ls-4339  0d..2   47us : __rmqueue (rmqueue_bulk)
-      ls-4339  0d..2   48us : __rmqueue_smallest (__rmqueue)
-      ls-4339  0d..2   48us : __mod_zone_page_state (__rmqueue_smallest)
-      ls-4339  0d..2   49us : _spin_unlock (rmqueue_bulk)
-      ls-4339  0d..2   49us : sub_preempt_count (_spin_unlock)
-      ls-4339  0d..1   50us : get_page_from_freelist (__alloc_pages_internal)
-      ls-4339  0d..2   51us : trace_hardirqs_on (__alloc_pages_internal)
-
-
-
-Here we traced a 50 microsecond latency. But we also see all the
+    bash-2042    3d..1   67us : delay_tsc <-__delay
+    bash-2042    3d..1   67us : add_preempt_count <-delay_tsc
+    bash-2042    3d..2   67us : sub_preempt_count <-delay_tsc
+    bash-2042    3d..1   67us : add_preempt_count <-delay_tsc
+    bash-2042    3d..2   68us : sub_preempt_count <-delay_tsc
+    bash-2042    3d..1   68us+: ata_bmdma_start <-ata_bmdma_qc_issue
+    bash-2042    3d..1   71us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd
+    bash-2042    3d..1   71us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd
+    bash-2042    3d..1   72us+: trace_hardirqs_on <-ata_scsi_queuecmd
+    bash-2042    3d..1  120us : <stack trace>
+ => _raw_spin_unlock_irqrestore
+ => ata_scsi_queuecmd
+ => scsi_dispatch_cmd
+ => scsi_request_fn
+ => __blk_run_queue_uncond
+ => __blk_run_queue
+ => blk_queue_bio
+ => generic_make_request
+ => submit_bio
+ => submit_bh
+ => __ext3_get_inode_loc
+ => ext3_iget
+ => ext3_lookup
+ => lookup_real
+ => __lookup_hash
+ => walk_component
+ => lookup_last
+ => path_lookupat
+ => filename_lookup
+ => user_path_at_empty
+ => user_path_at
+ => vfs_fstatat
+ => vfs_stat
+ => sys_newstat
+ => system_call_fastpath
+
+
+Here we traced a 71 microsecond latency. But we also see all the
 functions that were called during that time. Note that by
 enabling function tracing, we incur an added overhead. This
 overhead may extend the latency times. But nevertheless, this
@@ -614,120 +1049,122 @@ Like the irqsoff tracer, it records the maximum latency for
 which preemption was disabled. The control of preemptoff tracer
 is much like the irqsoff tracer.
 
+ # echo 0 > options/function-trace
  # echo preemptoff > current_tracer
- # echo latency-format > trace_options
- # echo 0 > tracing_max_latency
  # echo 1 > tracing_on
+ # echo 0 > tracing_max_latency
  # ls -ltr
  [...]
  # echo 0 > tracing_on
  # cat trace
 # tracer: preemptoff
 #
-preemptoff latency trace v1.1.5 on 2.6.26-rc8
---------------------------------------------------------------------
- latency: 29 us, #3/3, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
-    -----------------
-    | task: sshd-4261 (uid:0 nice:0 policy:0 rt_prio:0)
-    -----------------
- => started at: do_IRQ
- => ended at:   __do_softirq
-
-#                _------=> CPU#
-#               / _-----=> irqs-off
-#              | / _----=> need-resched
-#              || / _---=> hardirq/softirq
-#              ||| / _--=> preempt-depth
-#              |||| /
-#              |||||     delay
-#  cmd     pid ||||| time  |   caller
-#     \   /    |||||   \   |   /
-    sshd-4261  0d.h.    0us+: irq_enter (do_IRQ)
-    sshd-4261  0d.s.   29us : _local_bh_enable (__do_softirq)
-    sshd-4261  0d.s1   30us : trace_preempt_on (__do_softirq)
+# preemptoff latency trace v1.1.5 on 3.8.0-test+
+# --------------------------------------------------------------------
+# latency: 46 us, #4/4, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
+#    -----------------
+#    | task: sshd-1991 (uid:0 nice:0 policy:0 rt_prio:0)
+#    -----------------
+#  => started at: do_IRQ
+#  => ended at:   do_IRQ
+#
+#
+#                  _------=> CPU#            
+#                 / _-----=> irqs-off        
+#                | / _----=> need-resched    
+#                || / _---=> hardirq/softirq 
+#                ||| / _--=> preempt-depth   
+#                |||| /     delay             
+#  cmd     pid   ||||| time  |   caller      
+#     \   /      |||||  \    |   /           
+    sshd-1991    1d.h.    0us+: irq_enter <-do_IRQ
+    sshd-1991    1d..1   46us : irq_exit <-do_IRQ
+    sshd-1991    1d..1   47us+: trace_preempt_on <-do_IRQ
+    sshd-1991    1d..1   52us : <stack trace>
+ => sub_preempt_count
+ => irq_exit
+ => do_IRQ
+ => ret_from_intr
 
 
 This has some more changes. Preemption was disabled when an
-interrupt came in (notice the 'h'), and was enabled while doing
-a softirq. (notice the 's'). But we also see that interrupts
-have been disabled when entering the preempt off section and
-leaving it (the 'd'). We do not know if interrupts were enabled
-in the mean time.
+interrupt came in (notice the 'h'), and was enabled on exit.
+But we also see that interrupts have been disabled when entering
+the preempt off section and leaving it (the 'd'). We do not know if
+interrupts were enabled in the mean time or shortly after this
+was over.
 
 # tracer: preemptoff
 #
-preemptoff latency trace v1.1.5 on 2.6.26-rc8
---------------------------------------------------------------------
- latency: 63 us, #87/87, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
-    -----------------
-    | task: sshd-4261 (uid:0 nice:0 policy:0 rt_prio:0)
-    -----------------
- => started at: remove_wait_queue
- => ended at:   __do_softirq
-
-#                _------=> CPU#
-#               / _-----=> irqs-off
-#              | / _----=> need-resched
-#              || / _---=> hardirq/softirq
-#              ||| / _--=> preempt-depth
-#              |||| /
-#              |||||     delay
-#  cmd     pid ||||| time  |   caller
-#     \   /    |||||   \   |   /
-    sshd-4261  0d..1    0us : _spin_lock_irqsave (remove_wait_queue)
-    sshd-4261  0d..1    1us : _spin_unlock_irqrestore (remove_wait_queue)
-    sshd-4261  0d..1    2us : do_IRQ (common_interrupt)
-    sshd-4261  0d..1    2us : irq_enter (do_IRQ)
-    sshd-4261  0d..1    2us : idle_cpu (irq_enter)
-    sshd-4261  0d..1    3us : add_preempt_count (irq_enter)
-    sshd-4261  0d.h1    3us : idle_cpu (irq_enter)
-    sshd-4261  0d.h.    4us : handle_fasteoi_irq (do_IRQ)
+# preemptoff latency trace v1.1.5 on 3.8.0-test+
+# --------------------------------------------------------------------
+# latency: 83 us, #241/241, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
+#    -----------------
+#    | task: bash-1994 (uid:0 nice:0 policy:0 rt_prio:0)
+#    -----------------
+#  => started at: wake_up_new_task
+#  => ended at:   task_rq_unlock
+#
+#
+#                  _------=> CPU#            
+#                 / _-----=> irqs-off        
+#                | / _----=> need-resched    
+#                || / _---=> hardirq/softirq 
+#                ||| / _--=> preempt-depth   
+#                |||| /     delay             
+#  cmd     pid   ||||| time  |   caller      
+#     \   /      |||||  \    |   /           
+    bash-1994    1d..1    0us : _raw_spin_lock_irqsave <-wake_up_new_task
+    bash-1994    1d..1    0us : select_task_rq_fair <-select_task_rq
+    bash-1994    1d..1    1us : __rcu_read_lock <-select_task_rq_fair
+    bash-1994    1d..1    1us : source_load <-select_task_rq_fair
+    bash-1994    1d..1    1us : source_load <-select_task_rq_fair
 [...]
-    sshd-4261  0d.h.   12us : add_preempt_count (_spin_lock)
-    sshd-4261  0d.h1   12us : ack_ioapic_quirk_irq (handle_fasteoi_irq)
-    sshd-4261  0d.h1   13us : move_native_irq (ack_ioapic_quirk_irq)
-    sshd-4261  0d.h1   13us : _spin_unlock (handle_fasteoi_irq)
-    sshd-4261  0d.h1   14us : sub_preempt_count (_spin_unlock)
-    sshd-4261  0d.h1   14us : irq_exit (do_IRQ)
-    sshd-4261  0d.h1   15us : sub_preempt_count (irq_exit)
-    sshd-4261  0d..2   15us : do_softirq (irq_exit)
-    sshd-4261  0d...   15us : __do_softirq (do_softirq)
-    sshd-4261  0d...   16us : __local_bh_disable (__do_softirq)
-    sshd-4261  0d...   16us+: add_preempt_count (__local_bh_disable)
-    sshd-4261  0d.s4   20us : add_preempt_count (__local_bh_disable)
-    sshd-4261  0d.s4   21us : sub_preempt_count (local_bh_enable)
-    sshd-4261  0d.s5   21us : sub_preempt_count (local_bh_enable)
+    bash-1994    1d..1   12us : irq_enter <-smp_apic_timer_interrupt
+    bash-1994    1d..1   12us : rcu_irq_enter <-irq_enter
+    bash-1994    1d..1   13us : add_preempt_count <-irq_enter
+    bash-1994    1d.h1   13us : exit_idle <-smp_apic_timer_interrupt
+    bash-1994    1d.h1   13us : hrtimer_interrupt <-smp_apic_timer_interrupt
+    bash-1994    1d.h1   13us : _raw_spin_lock <-hrtimer_interrupt
+    bash-1994    1d.h1   14us : add_preempt_count <-_raw_spin_lock
+    bash-1994    1d.h2   14us : ktime_get_update_offsets <-hrtimer_interrupt
 [...]
-    sshd-4261  0d.s6   41us : add_preempt_count (__local_bh_disable)
-    sshd-4261  0d.s6   42us : sub_preempt_count (local_bh_enable)
-    sshd-4261  0d.s7   42us : sub_preempt_count (local_bh_enable)
-    sshd-4261  0d.s5   43us : add_preempt_count (__local_bh_disable)
-    sshd-4261  0d.s5   43us : sub_preempt_count (local_bh_enable_ip)
-    sshd-4261  0d.s6   44us : sub_preempt_count (local_bh_enable_ip)
-    sshd-4261  0d.s5   44us : add_preempt_count (__local_bh_disable)
-    sshd-4261  0d.s5   45us : sub_preempt_count (local_bh_enable)
+    bash-1994    1d.h1   35us : lapic_next_event <-clockevents_program_event
+    bash-1994    1d.h1   35us : irq_exit <-smp_apic_timer_interrupt
+    bash-1994    1d.h1   36us : sub_preempt_count <-irq_exit
+    bash-1994    1d..2   36us : do_softirq <-irq_exit
+    bash-1994    1d..2   36us : __do_softirq <-call_softirq
+    bash-1994    1d..2   36us : __local_bh_disable <-__do_softirq
+    bash-1994    1d.s2   37us : add_preempt_count <-_raw_spin_lock_irq
+    bash-1994    1d.s3   38us : _raw_spin_unlock <-run_timer_softirq
+    bash-1994    1d.s3   39us : sub_preempt_count <-_raw_spin_unlock
+    bash-1994    1d.s2   39us : call_timer_fn <-run_timer_softirq
 [...]
-    sshd-4261  0d.s.   63us : _local_bh_enable (__do_softirq)
-    sshd-4261  0d.s1   64us : trace_preempt_on (__do_softirq)
+    bash-1994    1dNs2   81us : cpu_needs_another_gp <-rcu_process_callbacks
+    bash-1994    1dNs2   82us : __local_bh_enable <-__do_softirq
+    bash-1994    1dNs2   82us : sub_preempt_count <-__local_bh_enable
+    bash-1994    1dN.2   82us : idle_cpu <-irq_exit
+    bash-1994    1dN.2   83us : rcu_irq_exit <-irq_exit
+    bash-1994    1dN.2   83us : sub_preempt_count <-irq_exit
+    bash-1994    1.N.1   84us : _raw_spin_unlock_irqrestore <-task_rq_unlock
+    bash-1994    1.N.1   84us+: trace_preempt_on <-task_rq_unlock
+    bash-1994    1.N.1  104us : <stack trace>
+ => sub_preempt_count
+ => _raw_spin_unlock_irqrestore
+ => task_rq_unlock
+ => wake_up_new_task
+ => do_fork
+ => sys_clone
+ => stub_clone
 
 
 The above is an example of the preemptoff trace with
-ftrace_enabled set. Here we see that interrupts were disabled
+function-trace set. Here we see that interrupts were not disabled
 the entire time. The irq_enter code lets us know that we entered
 an interrupt 'h'. Before that, the functions being traced still
 show that it is not in an interrupt, but we can see from the
 functions themselves that this is not the case.
 
-Notice that __do_softirq when called does not have a
-preempt_count. It may seem that we missed a preempt enabling.
-What really happened is that the preempt count is held on the
-thread's stack and we switched to the softirq stack (4K stacks
-in effect). The code does not copy the preempt count, but
-because interrupts are disabled, we do not need to worry about
-it. Having a tracer like this is good for letting people know
-what really happens inside the kernel.
-
-
 preemptirqsoff
 --------------
 
@@ -762,38 +1199,57 @@ tracer.
 Again, using this trace is much like the irqsoff and preemptoff
 tracers.
 
+ # echo 0 > options/function-trace
  # echo preemptirqsoff > current_tracer
- # echo latency-format > trace_options
- # echo 0 > tracing_max_latency
  # echo 1 > tracing_on
+ # echo 0 > tracing_max_latency
  # ls -ltr
  [...]
  # echo 0 > tracing_on
  # cat trace
 # tracer: preemptirqsoff
 #
-preemptirqsoff latency trace v1.1.5 on 2.6.26-rc8
---------------------------------------------------------------------
- latency: 293 us, #3/3, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
-    -----------------
-    | task: ls-4860 (uid:0 nice:0 policy:0 rt_prio:0)
-    -----------------
- => started at: apic_timer_interrupt
- => ended at:   __do_softirq
-
-#                _------=> CPU#
-#               / _-----=> irqs-off
-#              | / _----=> need-resched
-#              || / _---=> hardirq/softirq
-#              ||| / _--=> preempt-depth
-#              |||| /
-#              |||||     delay
-#  cmd     pid ||||| time  |   caller
-#     \   /    |||||   \   |   /
-      ls-4860  0d...    0us!: trace_hardirqs_off_thunk (apic_timer_interrupt)
-      ls-4860  0d.s.  294us : _local_bh_enable (__do_softirq)
-      ls-4860  0d.s1  294us : trace_preempt_on (__do_softirq)
-
+# preemptirqsoff latency trace v1.1.5 on 3.8.0-test+
+# --------------------------------------------------------------------
+# latency: 100 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
+#    -----------------
+#    | task: ls-2230 (uid:0 nice:0 policy:0 rt_prio:0)
+#    -----------------
+#  => started at: ata_scsi_queuecmd
+#  => ended at:   ata_scsi_queuecmd
+#
+#
+#                  _------=> CPU#            
+#                 / _-----=> irqs-off        
+#                | / _----=> need-resched    
+#                || / _---=> hardirq/softirq 
+#                ||| / _--=> preempt-depth   
+#                |||| /     delay             
+#  cmd     pid   ||||| time  |   caller      
+#     \   /      |||||  \    |   /           
+      ls-2230    3d...    0us+: _raw_spin_lock_irqsave <-ata_scsi_queuecmd
+      ls-2230    3...1  100us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd
+      ls-2230    3...1  101us+: trace_preempt_on <-ata_scsi_queuecmd
+      ls-2230    3...1  111us : <stack trace>
+ => sub_preempt_count
+ => _raw_spin_unlock_irqrestore
+ => ata_scsi_queuecmd
+ => scsi_dispatch_cmd
+ => scsi_request_fn
+ => __blk_run_queue_uncond
+ => __blk_run_queue
+ => blk_queue_bio
+ => generic_make_request
+ => submit_bio
+ => submit_bh
+ => ext3_bread
+ => ext3_dir_bread
+ => htree_dirblock_to_tree
+ => ext3_htree_fill_tree
+ => ext3_readdir
+ => vfs_readdir
+ => sys_getdents
+ => system_call_fastpath
 
 
 The trace_hardirqs_off_thunk is called from assembly on x86 when
@@ -802,105 +1258,158 @@ function tracing, we do not know if interrupts were enabled
 within the preemption points. We do see that it started with
 preemption enabled.
 
-Here is a trace with ftrace_enabled set:
-
+Here is a trace with function-trace set:
 
 # tracer: preemptirqsoff
 #
-preemptirqsoff latency trace v1.1.5 on 2.6.26-rc8
---------------------------------------------------------------------
- latency: 105 us, #183/183, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
-    -----------------
-    | task: sshd-4261 (uid:0 nice:0 policy:0 rt_prio:0)
-    -----------------
- => started at: write_chan
- => ended at:   __do_softirq
-
-#                _------=> CPU#
-#               / _-----=> irqs-off
-#              | / _----=> need-resched
-#              || / _---=> hardirq/softirq
-#              ||| / _--=> preempt-depth
-#              |||| /
-#              |||||     delay
-#  cmd     pid ||||| time  |   caller
-#     \   /    |||||   \   |   /
-      ls-4473  0.N..    0us : preempt_schedule (write_chan)
-      ls-4473  0dN.1    1us : _spin_lock (schedule)
-      ls-4473  0dN.1    2us : add_preempt_count (_spin_lock)
-      ls-4473  0d..2    2us : put_prev_task_fair (schedule)
-[...]
-      ls-4473  0d..2   13us : set_normalized_timespec (ktime_get_ts)
-      ls-4473  0d..2   13us : __switch_to (schedule)
-    sshd-4261  0d..2   14us : finish_task_switch (schedule)
-    sshd-4261  0d..2   14us : _spin_unlock_irq (finish_task_switch)
-    sshd-4261  0d..1   15us : add_preempt_count (_spin_lock_irqsave)
-    sshd-4261  0d..2   16us : _spin_unlock_irqrestore (hrtick_set)
-    sshd-4261  0d..2   16us : do_IRQ (common_interrupt)
-    sshd-4261  0d..2   17us : irq_enter (do_IRQ)
-    sshd-4261  0d..2   17us : idle_cpu (irq_enter)
-    sshd-4261  0d..2   18us : add_preempt_count (irq_enter)
-    sshd-4261  0d.h2   18us : idle_cpu (irq_enter)
-    sshd-4261  0d.h.   18us : handle_fasteoi_irq (do_IRQ)
-    sshd-4261  0d.h.   19us : _spin_lock (handle_fasteoi_irq)
-    sshd-4261  0d.h.   19us : add_preempt_count (_spin_lock)
-    sshd-4261  0d.h1   20us : _spin_unlock (handle_fasteoi_irq)
-    sshd-4261  0d.h1   20us : sub_preempt_count (_spin_unlock)
-[...]
-    sshd-4261  0d.h1   28us : _spin_unlock (handle_fasteoi_irq)
-    sshd-4261  0d.h1   29us : sub_preempt_count (_spin_unlock)
-    sshd-4261  0d.h2   29us : irq_exit (do_IRQ)
-    sshd-4261  0d.h2   29us : sub_preempt_count (irq_exit)
-    sshd-4261  0d..3   30us : do_softirq (irq_exit)
-    sshd-4261  0d...   30us : __do_softirq (do_softirq)
-    sshd-4261  0d...   31us : __local_bh_disable (__do_softirq)
-    sshd-4261  0d...   31us+: add_preempt_count (__local_bh_disable)
-    sshd-4261  0d.s4   34us : add_preempt_count (__local_bh_disable)
+# preemptirqsoff latency trace v1.1.5 on 3.8.0-test+
+# --------------------------------------------------------------------
+# latency: 161 us, #339/339, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
+#    -----------------
+#    | task: ls-2269 (uid:0 nice:0 policy:0 rt_prio:0)
+#    -----------------
+#  => started at: schedule
+#  => ended at:   mutex_unlock
+#
+#
+#                  _------=> CPU#            
+#                 / _-----=> irqs-off        
+#                | / _----=> need-resched    
+#                || / _---=> hardirq/softirq 
+#                ||| / _--=> preempt-depth   
+#                |||| /     delay             
+#  cmd     pid   ||||| time  |   caller      
+#     \   /      |||||  \    |   /           
+kworker/-59      3...1    0us : __schedule <-schedule
+kworker/-59      3d..1    0us : rcu_preempt_qs <-rcu_note_context_switch
+kworker/-59      3d..1    1us : add_preempt_count <-_raw_spin_lock_irq
+kworker/-59      3d..2    1us : deactivate_task <-__schedule
+kworker/-59      3d..2    1us : dequeue_task <-deactivate_task
+kworker/-59      3d..2    2us : update_rq_clock <-dequeue_task
+kworker/-59      3d..2    2us : dequeue_task_fair <-dequeue_task
+kworker/-59      3d..2    2us : update_curr <-dequeue_task_fair
+kworker/-59      3d..2    2us : update_min_vruntime <-update_curr
+kworker/-59      3d..2    3us : cpuacct_charge <-update_curr
+kworker/-59      3d..2    3us : __rcu_read_lock <-cpuacct_charge
+kworker/-59      3d..2    3us : __rcu_read_unlock <-cpuacct_charge
+kworker/-59      3d..2    3us : update_cfs_rq_blocked_load <-dequeue_task_fair
+kworker/-59      3d..2    4us : clear_buddies <-dequeue_task_fair
+kworker/-59      3d..2    4us : account_entity_dequeue <-dequeue_task_fair
+kworker/-59      3d..2    4us : update_min_vruntime <-dequeue_task_fair
+kworker/-59      3d..2    4us : update_cfs_shares <-dequeue_task_fair
+kworker/-59      3d..2    5us : hrtick_update <-dequeue_task_fair
+kworker/-59      3d..2    5us : wq_worker_sleeping <-__schedule
+kworker/-59      3d..2    5us : kthread_data <-wq_worker_sleeping
+kworker/-59      3d..2    5us : put_prev_task_fair <-__schedule
+kworker/-59      3d..2    6us : pick_next_task_fair <-pick_next_task
+kworker/-59      3d..2    6us : clear_buddies <-pick_next_task_fair
+kworker/-59      3d..2    6us : set_next_entity <-pick_next_task_fair
+kworker/-59      3d..2    6us : update_stats_wait_end <-set_next_entity
+      ls-2269    3d..2    7us : finish_task_switch <-__schedule
+      ls-2269    3d..2    7us : _raw_spin_unlock_irq <-finish_task_switch
+      ls-2269    3d..2    8us : do_IRQ <-ret_from_intr
+      ls-2269    3d..2    8us : irq_enter <-do_IRQ
+      ls-2269    3d..2    8us : rcu_irq_enter <-irq_enter
+      ls-2269    3d..2    9us : add_preempt_count <-irq_enter
+      ls-2269    3d.h2    9us : exit_idle <-do_IRQ
 [...]
-    sshd-4261  0d.s3   43us : sub_preempt_count (local_bh_enable_ip)
-    sshd-4261  0d.s4   44us : sub_preempt_count (local_bh_enable_ip)
-    sshd-4261  0d.s3   44us : smp_apic_timer_interrupt (apic_timer_interrupt)
-    sshd-4261  0d.s3   45us : irq_enter (smp_apic_timer_interrupt)
-    sshd-4261  0d.s3   45us : idle_cpu (irq_enter)
-    sshd-4261  0d.s3   46us : add_preempt_count (irq_enter)
-    sshd-4261  0d.H3   46us : idle_cpu (irq_enter)
-    sshd-4261  0d.H3   47us : hrtimer_interrupt (smp_apic_timer_interrupt)
-    sshd-4261  0d.H3   47us : ktime_get (hrtimer_interrupt)
+      ls-2269    3d.h3   20us : sub_preempt_count <-_raw_spin_unlock
+      ls-2269    3d.h2   20us : irq_exit <-do_IRQ
+      ls-2269    3d.h2   21us : sub_preempt_count <-irq_exit
+      ls-2269    3d..3   21us : do_softirq <-irq_exit
+      ls-2269    3d..3   21us : __do_softirq <-call_softirq
+      ls-2269    3d..3   21us+: __local_bh_disable <-__do_softirq
+      ls-2269    3d.s4   29us : sub_preempt_count <-_local_bh_enable_ip
+      ls-2269    3d.s5   29us : sub_preempt_count <-_local_bh_enable_ip
+      ls-2269    3d.s5   31us : do_IRQ <-ret_from_intr
+      ls-2269    3d.s5   31us : irq_enter <-do_IRQ
+      ls-2269    3d.s5   31us : rcu_irq_enter <-irq_enter
 [...]
-    sshd-4261  0d.H3   81us : tick_program_event (hrtimer_interrupt)
-    sshd-4261  0d.H3   82us : ktime_get (tick_program_event)
-    sshd-4261  0d.H3   82us : ktime_get_ts (ktime_get)
-    sshd-4261  0d.H3   83us : getnstimeofday (ktime_get_ts)
-    sshd-4261  0d.H3   83us : set_normalized_timespec (ktime_get_ts)
-    sshd-4261  0d.H3   84us : clockevents_program_event (tick_program_event)
-    sshd-4261  0d.H3   84us : lapic_next_event (clockevents_program_event)
-    sshd-4261  0d.H3   85us : irq_exit (smp_apic_timer_interrupt)
-    sshd-4261  0d.H3   85us : sub_preempt_count (irq_exit)
-    sshd-4261  0d.s4   86us : sub_preempt_count (irq_exit)
-    sshd-4261  0d.s3   86us : add_preempt_count (__local_bh_disable)
+      ls-2269    3d.s5   31us : rcu_irq_enter <-irq_enter
+      ls-2269    3d.s5   32us : add_preempt_count <-irq_enter
+      ls-2269    3d.H5   32us : exit_idle <-do_IRQ
+      ls-2269    3d.H5   32us : handle_irq <-do_IRQ
+      ls-2269    3d.H5   32us : irq_to_desc <-handle_irq
+      ls-2269    3d.H5   33us : handle_fasteoi_irq <-handle_irq
 [...]
-    sshd-4261  0d.s1   98us : sub_preempt_count (net_rx_action)
-    sshd-4261  0d.s.   99us : add_preempt_count (_spin_lock_irq)
-    sshd-4261  0d.s1   99us+: _spin_unlock_irq (run_timer_softirq)
-    sshd-4261  0d.s.  104us : _local_bh_enable (__do_softirq)
-    sshd-4261  0d.s.  104us : sub_preempt_count (_local_bh_enable)
-    sshd-4261  0d.s.  105us : _local_bh_enable (__do_softirq)
-    sshd-4261  0d.s1  105us : trace_preempt_on (__do_softirq)
-
-
-This is a very interesting trace. It started with the preemption
-of the ls task. We see that the task had the "need_resched" bit
-set via the 'N' in the trace.  Interrupts were disabled before
-the spin_lock at the beginning of the trace. We see that a
-schedule took place to run sshd.  When the interrupts were
-enabled, we took an interrupt. On return from the interrupt
-handler, the softirq ran. We took another interrupt while
-running the softirq as we see from the capital 'H'.
+      ls-2269    3d.s5  158us : _raw_spin_unlock_irqrestore <-rtl8139_poll
+      ls-2269    3d.s3  158us : net_rps_action_and_irq_enable.isra.65 <-net_rx_action
+      ls-2269    3d.s3  159us : __local_bh_enable <-__do_softirq
+      ls-2269    3d.s3  159us : sub_preempt_count <-__local_bh_enable
+      ls-2269    3d..3  159us : idle_cpu <-irq_exit
+      ls-2269    3d..3  159us : rcu_irq_exit <-irq_exit
+      ls-2269    3d..3  160us : sub_preempt_count <-irq_exit
+      ls-2269    3d...  161us : __mutex_unlock_slowpath <-mutex_unlock
+      ls-2269    3d...  162us+: trace_hardirqs_on <-mutex_unlock
+      ls-2269    3d...  186us : <stack trace>
+ => __mutex_unlock_slowpath
+ => mutex_unlock
+ => process_output
+ => n_tty_write
+ => tty_write
+ => vfs_write
+ => sys_write
+ => system_call_fastpath
+
+This is an interesting trace. It started with kworker running and
+scheduling out and ls taking over. But as soon as ls released the
+rq lock and enabled interrupts (but not preemption) an interrupt
+triggered. When the interrupt finished, it started running softirqs.
+But while the softirq was running, another interrupt triggered.
+When an interrupt is running inside a softirq, the annotation is 'H'.
 
 
 wakeup
 ------
 
+One common case that people are interested in tracing is the
+time it takes for a task that is woken to actually wake up.
+Now for non Real-Time tasks, this can be arbitrary. But tracing
+it none the less can be interesting. 
+
+Without function tracing:
+
+ # echo 0 > options/function-trace
+ # echo wakeup > current_tracer
+ # echo 1 > tracing_on
+ # echo 0 > tracing_max_latency
+ # chrt -f 5 sleep 1
+ # echo 0 > tracing_on
+ # cat trace
+# tracer: wakeup
+#
+# wakeup latency trace v1.1.5 on 3.8.0-test+
+# --------------------------------------------------------------------
+# latency: 15 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
+#    -----------------
+#    | task: kworker/3:1H-312 (uid:0 nice:-20 policy:0 rt_prio:0)
+#    -----------------
+#
+#                  _------=> CPU#            
+#                 / _-----=> irqs-off        
+#                | / _----=> need-resched    
+#                || / _---=> hardirq/softirq 
+#                ||| / _--=> preempt-depth   
+#                |||| /     delay             
+#  cmd     pid   ||||| time  |   caller      
+#     \   /      |||||  \    |   /           
+  <idle>-0       3dNs7    0us :      0:120:R   + [003]   312:100:R kworker/3:1H
+  <idle>-0       3dNs7    1us+: ttwu_do_activate.constprop.87 <-try_to_wake_up
+  <idle>-0       3d..3   15us : __schedule <-schedule
+  <idle>-0       3d..3   15us :      0:120:R ==> [003]   312:100:R kworker/3:1H
+
+The tracer only traces the highest priority task in the system
+to avoid tracing the normal circumstances. Here we see that
+the kworker with a nice priority of -20 (not very nice), took
+just 15 microseconds from the time it woke up, to the time it
+ran.
+
+Non Real-Time tasks are not that interesting. A more interesting
+trace is to concentrate only on Real-Time tasks.
+
+wakeup_rt
+---------
+
 In a Real-Time environment it is very important to know the
 wakeup time it takes for the highest priority task that is woken
 up to the time that it executes. This is also known as "schedule
@@ -914,124 +1423,229 @@ Real-Time environments are interested in the worst case latency.
 That is the longest latency it takes for something to happen,
 and not the average. We can have a very fast scheduler that may
 only have a large latency once in a while, but that would not
-work well with Real-Time tasks.  The wakeup tracer was designed
+work well with Real-Time tasks.  The wakeup_rt tracer was designed
 to record the worst case wakeups of RT tasks. Non-RT tasks are
 not recorded because the tracer only records one worst case and
 tracing non-RT tasks that are unpredictable will overwrite the
-worst case latency of RT tasks.
+worst case latency of RT tasks (just run the normal wakeup
+tracer for a while to see that effect).
 
 Since this tracer only deals with RT tasks, we will run this
 slightly differently than we did with the previous tracers.
 Instead of performing an 'ls', we will run 'sleep 1' under
 'chrt' which changes the priority of the task.
 
- # echo wakeup > current_tracer
- # echo latency-format > trace_options
- # echo 0 > tracing_max_latency
+ # echo 0 > options/function-trace
+ # echo wakeup_rt > current_tracer
  # echo 1 > tracing_on
+ # echo 0 > tracing_max_latency
  # chrt -f 5 sleep 1
  # echo 0 > tracing_on
  # cat trace
 # tracer: wakeup
 #
-wakeup latency trace v1.1.5 on 2.6.26-rc8
---------------------------------------------------------------------
- latency: 4 us, #2/2, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
-    -----------------
-    | task: sleep-4901 (uid:0 nice:0 policy:1 rt_prio:5)
-    -----------------
-
-#                _------=> CPU#
-#               / _-----=> irqs-off
-#              | / _----=> need-resched
-#              || / _---=> hardirq/softirq
-#              ||| / _--=> preempt-depth
-#              |||| /
-#              |||||     delay
-#  cmd     pid ||||| time  |   caller
-#     \   /    |||||   \   |   /
-  <idle>-0     1d.h4    0us+: try_to_wake_up (wake_up_process)
-  <idle>-0     1d..4    4us : schedule (cpu_idle)
-
-
-Running this on an idle system, we see that it only took 4
-microseconds to perform the task switch.  Note, since the trace
-marker in the schedule is before the actual "switch", we stop
-the tracing when the recorded task is about to schedule in. This
-may change if we add a new marker at the end of the scheduler.
-
-Notice that the recorded task is 'sleep' with the PID of 4901
+# tracer: wakeup_rt
+#
+# wakeup_rt latency trace v1.1.5 on 3.8.0-test+
+# --------------------------------------------------------------------
+# latency: 5 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
+#    -----------------
+#    | task: sleep-2389 (uid:0 nice:0 policy:1 rt_prio:5)
+#    -----------------
+#
+#                  _------=> CPU#            
+#                 / _-----=> irqs-off        
+#                | / _----=> need-resched    
+#                || / _---=> hardirq/softirq 
+#                ||| / _--=> preempt-depth   
+#                |||| /     delay             
+#  cmd     pid   ||||| time  |   caller      
+#     \   /      |||||  \    |   /           
+  <idle>-0       3d.h4    0us :      0:120:R   + [003]  2389: 94:R sleep
+  <idle>-0       3d.h4    1us+: ttwu_do_activate.constprop.87 <-try_to_wake_up
+  <idle>-0       3d..3    5us : __schedule <-schedule
+  <idle>-0       3d..3    5us :      0:120:R ==> [003]  2389: 94:R sleep
+
+
+Running this on an idle system, we see that it only took 5 microseconds
+to perform the task switch.  Note, since the trace point in the schedule
+is before the actual "switch", we stop the tracing when the recorded task
+is about to schedule in. This may change if we add a new marker at the
+end of the scheduler.
+
+Notice that the recorded task is 'sleep' with the PID of 2389
 and it has an rt_prio of 5. This priority is user-space priority
 and not the internal kernel priority. The policy is 1 for
 SCHED_FIFO and 2 for SCHED_RR.
 
-Doing the same with chrt -r 5 and ftrace_enabled set.
+Note, that the trace data shows the internal priority (99 - rtprio).
 
-# tracer: wakeup
+  <idle>-0       3d..3    5us :      0:120:R ==> [003]  2389: 94:R sleep
+
+The 0:120:R means idle was running with a nice priority of 0 (120 - 20)
+and in the running state 'R'. The sleep task was scheduled in with
+2389: 94:R. That is the priority is the kernel rtprio (99 - 5 = 94)
+and it too is in the running state.
+
+Doing the same with chrt -r 5 and function-trace set.
+
+  echo 1 > options/function-trace
+
+# tracer: wakeup_rt
 #
-wakeup latency trace v1.1.5 on 2.6.26-rc8
---------------------------------------------------------------------
- latency: 50 us, #60/60, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
-    -----------------
-    | task: sleep-4068 (uid:0 nice:0 policy:2 rt_prio:5)
-    -----------------
-
-#                _------=> CPU#
-#               / _-----=> irqs-off
-#              | / _----=> need-resched
-#              || / _---=> hardirq/softirq
-#              ||| / _--=> preempt-depth
-#              |||| /
-#              |||||     delay
-#  cmd     pid ||||| time  |   caller
-#     \   /    |||||   \   |   /
-ksoftirq-7     1d.H3    0us : try_to_wake_up (wake_up_process)
-ksoftirq-7     1d.H4    1us : sub_preempt_count (marker_probe_cb)
-ksoftirq-7     1d.H3    2us : check_preempt_wakeup (try_to_wake_up)
-ksoftirq-7     1d.H3    3us : update_curr (check_preempt_wakeup)
-ksoftirq-7     1d.H3    4us : calc_delta_mine (update_curr)
-ksoftirq-7     1d.H3    5us : __resched_task (check_preempt_wakeup)
-ksoftirq-7     1d.H3    6us : task_wake_up_rt (try_to_wake_up)
-ksoftirq-7     1d.H3    7us : _spin_unlock_irqrestore (try_to_wake_up)
-[...]
-ksoftirq-7     1d.H2   17us : irq_exit (smp_apic_timer_interrupt)
-ksoftirq-7     1d.H2   18us : sub_preempt_count (irq_exit)
-ksoftirq-7     1d.s3   19us : sub_preempt_count (irq_exit)
-ksoftirq-7     1..s2   20us : rcu_process_callbacks (__do_softirq)
-[...]
-ksoftirq-7     1..s2   26us : __rcu_process_callbacks (rcu_process_callbacks)
-ksoftirq-7     1d.s2   27us : _local_bh_enable (__do_softirq)
-ksoftirq-7     1d.s2   28us : sub_preempt_count (_local_bh_enable)
-ksoftirq-7     1.N.3   29us : sub_preempt_count (ksoftirqd)
-ksoftirq-7     1.N.2   30us : _cond_resched (ksoftirqd)
-ksoftirq-7     1.N.2   31us : __cond_resched (_cond_resched)
-ksoftirq-7     1.N.2   32us : add_preempt_count (__cond_resched)
-ksoftirq-7     1.N.2   33us : schedule (__cond_resched)
-ksoftirq-7     1.N.2   33us : add_preempt_count (schedule)
-ksoftirq-7     1.N.3   34us : hrtick_clear (schedule)
-ksoftirq-7     1dN.3   35us : _spin_lock (schedule)
-ksoftirq-7     1dN.3   36us : add_preempt_count (_spin_lock)
-ksoftirq-7     1d..4   37us : put_prev_task_fair (schedule)
-ksoftirq-7     1d..4   38us : update_curr (put_prev_task_fair)
-[...]
-ksoftirq-7     1d..5   47us : _spin_trylock (tracing_record_cmdline)
-ksoftirq-7     1d..5   48us : add_preempt_count (_spin_trylock)
-ksoftirq-7     1d..6   49us : _spin_unlock (tracing_record_cmdline)
-ksoftirq-7     1d..6   49us : sub_preempt_count (_spin_unlock)
-ksoftirq-7     1d..4   50us : schedule (__cond_resched)
-
-The interrupt went off while running ksoftirqd. This task runs
-at SCHED_OTHER. Why did not we see the 'N' set early? This may
-be a harmless bug with x86_32 and 4K stacks. On x86_32 with 4K
-stacks configured, the interrupt and softirq run with their own
-stack. Some information is held on the top of the task's stack
-(need_resched and preempt_count are both stored there). The
-setting of the NEED_RESCHED bit is done directly to the task's
-stack, but the reading of the NEED_RESCHED is done by looking at
-the current stack, which in this case is the stack for the hard
-interrupt. This hides the fact that NEED_RESCHED has been set.
-We do not see the 'N' until we switch back to the task's
-assigned stack.
+# wakeup_rt latency trace v1.1.5 on 3.8.0-test+
+# --------------------------------------------------------------------
+# latency: 29 us, #85/85, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
+#    -----------------
+#    | task: sleep-2448 (uid:0 nice:0 policy:1 rt_prio:5)
+#    -----------------
+#
+#                  _------=> CPU#            
+#                 / _-----=> irqs-off        
+#                | / _----=> need-resched    
+#                || / _---=> hardirq/softirq 
+#                ||| / _--=> preempt-depth   
+#                |||| /     delay             
+#  cmd     pid   ||||| time  |   caller      
+#     \   /      |||||  \    |   /           
+  <idle>-0       3d.h4    1us+:      0:120:R   + [003]  2448: 94:R sleep
+  <idle>-0       3d.h4    2us : ttwu_do_activate.constprop.87 <-try_to_wake_up
+  <idle>-0       3d.h3    3us : check_preempt_curr <-ttwu_do_wakeup
+  <idle>-0       3d.h3    3us : resched_task <-check_preempt_curr
+  <idle>-0       3dNh3    4us : task_woken_rt <-ttwu_do_wakeup
+  <idle>-0       3dNh3    4us : _raw_spin_unlock <-try_to_wake_up
+  <idle>-0       3dNh3    4us : sub_preempt_count <-_raw_spin_unlock
+  <idle>-0       3dNh2    5us : ttwu_stat <-try_to_wake_up
+  <idle>-0       3dNh2    5us : _raw_spin_unlock_irqrestore <-try_to_wake_up
+  <idle>-0       3dNh2    6us : sub_preempt_count <-_raw_spin_unlock_irqrestore
+  <idle>-0       3dNh1    6us : _raw_spin_lock <-__run_hrtimer
+  <idle>-0       3dNh1    6us : add_preempt_count <-_raw_spin_lock
+  <idle>-0       3dNh2    7us : _raw_spin_unlock <-hrtimer_interrupt
+  <idle>-0       3dNh2    7us : sub_preempt_count <-_raw_spin_unlock
+  <idle>-0       3dNh1    7us : tick_program_event <-hrtimer_interrupt
+  <idle>-0       3dNh1    7us : clockevents_program_event <-tick_program_event
+  <idle>-0       3dNh1    8us : ktime_get <-clockevents_program_event
+  <idle>-0       3dNh1    8us : lapic_next_event <-clockevents_program_event
+  <idle>-0       3dNh1    8us : irq_exit <-smp_apic_timer_interrupt
+  <idle>-0       3dNh1    9us : sub_preempt_count <-irq_exit
+  <idle>-0       3dN.2    9us : idle_cpu <-irq_exit
+  <idle>-0       3dN.2    9us : rcu_irq_exit <-irq_exit
+  <idle>-0       3dN.2   10us : rcu_eqs_enter_common.isra.45 <-rcu_irq_exit
+  <idle>-0       3dN.2   10us : sub_preempt_count <-irq_exit
+  <idle>-0       3.N.1   11us : rcu_idle_exit <-cpu_idle
+  <idle>-0       3dN.1   11us : rcu_eqs_exit_common.isra.43 <-rcu_idle_exit
+  <idle>-0       3.N.1   11us : tick_nohz_idle_exit <-cpu_idle
+  <idle>-0       3dN.1   12us : menu_hrtimer_cancel <-tick_nohz_idle_exit
+  <idle>-0       3dN.1   12us : ktime_get <-tick_nohz_idle_exit
+  <idle>-0       3dN.1   12us : tick_do_update_jiffies64 <-tick_nohz_idle_exit
+  <idle>-0       3dN.1   13us : update_cpu_load_nohz <-tick_nohz_idle_exit
+  <idle>-0       3dN.1   13us : _raw_spin_lock <-update_cpu_load_nohz
+  <idle>-0       3dN.1   13us : add_preempt_count <-_raw_spin_lock
+  <idle>-0       3dN.2   13us : __update_cpu_load <-update_cpu_load_nohz
+  <idle>-0       3dN.2   14us : sched_avg_update <-__update_cpu_load
+  <idle>-0       3dN.2   14us : _raw_spin_unlock <-update_cpu_load_nohz
+  <idle>-0       3dN.2   14us : sub_preempt_count <-_raw_spin_unlock
+  <idle>-0       3dN.1   15us : calc_load_exit_idle <-tick_nohz_idle_exit
+  <idle>-0       3dN.1   15us : touch_softlockup_watchdog <-tick_nohz_idle_exit
+  <idle>-0       3dN.1   15us : hrtimer_cancel <-tick_nohz_idle_exit
+  <idle>-0       3dN.1   15us : hrtimer_try_to_cancel <-hrtimer_cancel
+  <idle>-0       3dN.1   16us : lock_hrtimer_base.isra.18 <-hrtimer_try_to_cancel
+  <idle>-0       3dN.1   16us : _raw_spin_lock_irqsave <-lock_hrtimer_base.isra.18
+  <idle>-0       3dN.1   16us : add_preempt_count <-_raw_spin_lock_irqsave
+  <idle>-0       3dN.2   17us : __remove_hrtimer <-remove_hrtimer.part.16
+  <idle>-0       3dN.2   17us : hrtimer_force_reprogram <-__remove_hrtimer
+  <idle>-0       3dN.2   17us : tick_program_event <-hrtimer_force_reprogram
+  <idle>-0       3dN.2   18us : clockevents_program_event <-tick_program_event
+  <idle>-0       3dN.2   18us : ktime_get <-clockevents_program_event
+  <idle>-0       3dN.2   18us : lapic_next_event <-clockevents_program_event
+  <idle>-0       3dN.2   19us : _raw_spin_unlock_irqrestore <-hrtimer_try_to_cancel
+  <idle>-0       3dN.2   19us : sub_preempt_count <-_raw_spin_unlock_irqrestore
+  <idle>-0       3dN.1   19us : hrtimer_forward <-tick_nohz_idle_exit
+  <idle>-0       3dN.1   20us : ktime_add_safe <-hrtimer_forward
+  <idle>-0       3dN.1   20us : ktime_add_safe <-hrtimer_forward
+  <idle>-0       3dN.1   20us : hrtimer_start_range_ns <-hrtimer_start_expires.constprop.11
+  <idle>-0       3dN.1   20us : __hrtimer_start_range_ns <-hrtimer_start_range_ns
+  <idle>-0       3dN.1   21us : lock_hrtimer_base.isra.18 <-__hrtimer_start_range_ns
+  <idle>-0       3dN.1   21us : _raw_spin_lock_irqsave <-lock_hrtimer_base.isra.18
+  <idle>-0       3dN.1   21us : add_preempt_count <-_raw_spin_lock_irqsave
+  <idle>-0       3dN.2   22us : ktime_add_safe <-__hrtimer_start_range_ns
+  <idle>-0       3dN.2   22us : enqueue_hrtimer <-__hrtimer_start_range_ns
+  <idle>-0       3dN.2   22us : tick_program_event <-__hrtimer_start_range_ns
+  <idle>-0       3dN.2   23us : clockevents_program_event <-tick_program_event
+  <idle>-0       3dN.2   23us : ktime_get <-clockevents_program_event
+  <idle>-0       3dN.2   23us : lapic_next_event <-clockevents_program_event
+  <idle>-0       3dN.2   24us : _raw_spin_unlock_irqrestore <-__hrtimer_start_range_ns
+  <idle>-0       3dN.2   24us : sub_preempt_count <-_raw_spin_unlock_irqrestore
+  <idle>-0       3dN.1   24us : account_idle_ticks <-tick_nohz_idle_exit
+  <idle>-0       3dN.1   24us : account_idle_time <-account_idle_ticks
+  <idle>-0       3.N.1   25us : sub_preempt_count <-cpu_idle
+  <idle>-0       3.N..   25us : schedule <-cpu_idle
+  <idle>-0       3.N..   25us : __schedule <-preempt_schedule
+  <idle>-0       3.N..   26us : add_preempt_count <-__schedule
+  <idle>-0       3.N.1   26us : rcu_note_context_switch <-__schedule
+  <idle>-0       3.N.1   26us : rcu_sched_qs <-rcu_note_context_switch
+  <idle>-0       3dN.1   27us : rcu_preempt_qs <-rcu_note_context_switch
+  <idle>-0       3.N.1   27us : _raw_spin_lock_irq <-__schedule
+  <idle>-0       3dN.1   27us : add_preempt_count <-_raw_spin_lock_irq
+  <idle>-0       3dN.2   28us : put_prev_task_idle <-__schedule
+  <idle>-0       3dN.2   28us : pick_next_task_stop <-pick_next_task
+  <idle>-0       3dN.2   28us : pick_next_task_rt <-pick_next_task
+  <idle>-0       3dN.2   29us : dequeue_pushable_task <-pick_next_task_rt
+  <idle>-0       3d..3   29us : __schedule <-preempt_schedule
+  <idle>-0       3d..3   30us :      0:120:R ==> [003]  2448: 94:R sleep
+
+This isn't that big of a trace, even with function tracing enabled,
+so I included the entire trace.
+
+The interrupt went off while when the system was idle. Somewhere
+before task_woken_rt() was called, the NEED_RESCHED flag was set,
+this is indicated by the first occurrence of the 'N' flag.
+
+Latency tracing and events
+--------------------------
+As function tracing can induce a much larger latency, but without
+seeing what happens within the latency it is hard to know what
+caused it. There is a middle ground, and that is with enabling
+events.
+
+ # echo 0 > options/function-trace
+ # echo wakeup_rt > current_tracer
+ # echo 1 > events/enable
+ # echo 1 > tracing_on
+ # echo 0 > tracing_max_latency
+ # chrt -f 5 sleep 1
+ # echo 0 > tracing_on
+ # cat trace
+# tracer: wakeup_rt
+#
+# wakeup_rt latency trace v1.1.5 on 3.8.0-test+
+# --------------------------------------------------------------------
+# latency: 6 us, #12/12, CPU#2 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
+#    -----------------
+#    | task: sleep-5882 (uid:0 nice:0 policy:1 rt_prio:5)
+#    -----------------
+#
+#                  _------=> CPU#            
+#                 / _-----=> irqs-off        
+#                | / _----=> need-resched    
+#                || / _---=> hardirq/softirq 
+#                ||| / _--=> preempt-depth   
+#                |||| /     delay             
+#  cmd     pid   ||||| time  |   caller      
+#     \   /      |||||  \    |   /           
+  <idle>-0       2d.h4    0us :      0:120:R   + [002]  5882: 94:R sleep
+  <idle>-0       2d.h4    0us : ttwu_do_activate.constprop.87 <-try_to_wake_up
+  <idle>-0       2d.h4    1us : sched_wakeup: comm=sleep pid=5882 prio=94 success=1 target_cpu=002
+  <idle>-0       2dNh2    1us : hrtimer_expire_exit: hrtimer=ffff88007796feb8
+  <idle>-0       2.N.2    2us : power_end: cpu_id=2
+  <idle>-0       2.N.2    3us : cpu_idle: state=4294967295 cpu_id=2
+  <idle>-0       2dN.3    4us : hrtimer_cancel: hrtimer=ffff88007d50d5e0
+  <idle>-0       2dN.3    4us : hrtimer_start: hrtimer=ffff88007d50d5e0 function=tick_sched_timer expires=34311211000000 softexpires=34311211000000
+  <idle>-0       2.N.2    5us : rcu_utilization: Start context switch
+  <idle>-0       2.N.2    5us : rcu_utilization: End context switch
+  <idle>-0       2d..3    6us : __schedule <-schedule
+  <idle>-0       2d..3    6us :      0:120:R ==> [002]  5882: 94:R sleep
+
 
 function
 --------
@@ -1039,6 +1653,7 @@ function
 This tracer is the function tracer. Enabling the function tracer
 can be done from the debug file system. Make sure the
 ftrace_enabled is set; otherwise this tracer is a nop.
+See the "ftrace_enabled" section below.
 
  # sysctl kernel.ftrace_enabled=1
  # echo function > current_tracer
@@ -1048,23 +1663,23 @@ ftrace_enabled is set; otherwise this tracer is a nop.
  # cat trace
 # tracer: function
 #
-#           TASK-PID   CPU#    TIMESTAMP  FUNCTION
-#              | |      |          |         |
-            bash-4003  [00]   123.638713: finish_task_switch <-schedule
-            bash-4003  [00]   123.638714: _spin_unlock_irq <-finish_task_switch
-            bash-4003  [00]   123.638714: sub_preempt_count <-_spin_unlock_irq
-            bash-4003  [00]   123.638715: hrtick_set <-schedule
-            bash-4003  [00]   123.638715: _spin_lock_irqsave <-hrtick_set
-            bash-4003  [00]   123.638716: add_preempt_count <-_spin_lock_irqsave
-            bash-4003  [00]   123.638716: _spin_unlock_irqrestore <-hrtick_set
-            bash-4003  [00]   123.638717: sub_preempt_count <-_spin_unlock_irqrestore
-            bash-4003  [00]   123.638717: hrtick_clear <-hrtick_set
-            bash-4003  [00]   123.638718: sub_preempt_count <-schedule
-            bash-4003  [00]   123.638718: sub_preempt_count <-preempt_schedule
-            bash-4003  [00]   123.638719: wait_for_completion <-__stop_machine_run
-            bash-4003  [00]   123.638719: wait_for_common <-wait_for_completion
-            bash-4003  [00]   123.638720: _spin_lock_irq <-wait_for_common
-            bash-4003  [00]   123.638720: add_preempt_count <-_spin_lock_irq
+# entries-in-buffer/entries-written: 24799/24799   #P:4
+#
+#                              _-----=> irqs-off
+#                             / _----=> need-resched
+#                            | / _---=> hardirq/softirq
+#                            || / _--=> preempt-depth
+#                            ||| /     delay
+#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
+#              | |       |   ||||       |         |
+            bash-1994  [002] ....  3082.063030: mutex_unlock <-rb_simple_write
+            bash-1994  [002] ....  3082.063031: __mutex_unlock_slowpath <-mutex_unlock
+            bash-1994  [002] ....  3082.063031: __fsnotify_parent <-fsnotify_modify
+            bash-1994  [002] ....  3082.063032: fsnotify <-fsnotify_modify
+            bash-1994  [002] ....  3082.063032: __srcu_read_lock <-fsnotify
+            bash-1994  [002] ....  3082.063032: add_preempt_count <-__srcu_read_lock
+            bash-1994  [002] ...1  3082.063032: sub_preempt_count <-__srcu_read_lock
+            bash-1994  [002] ....  3082.063033: __srcu_read_unlock <-fsnotify
 [...]
 
 
@@ -1214,79 +1829,19 @@ int main (int argc, char **argv)
         return 0;
 }
 
+Or this simple script!
 
-hw-branch-tracer (x86 only)
----------------------------
-
-This tracer uses the x86 last branch tracing hardware feature to
-collect a branch trace on all cpus with relatively low overhead.
-
-The tracer uses a fixed-size circular buffer per cpu and only
-traces ring 0 branches. The trace file dumps that buffer in the
-following format:
-
-# tracer: hw-branch-tracer
-#
-# CPU#        TO  <-  FROM
-   0  scheduler_tick+0xb5/0x1bf	  <-  task_tick_idle+0x5/0x6
-   2  run_posix_cpu_timers+0x2b/0x72a	  <-  run_posix_cpu_timers+0x25/0x72a
-   0  scheduler_tick+0x139/0x1bf	  <-  scheduler_tick+0xed/0x1bf
-   0  scheduler_tick+0x17c/0x1bf	  <-  scheduler_tick+0x148/0x1bf
-   2  run_posix_cpu_timers+0x9e/0x72a	  <-  run_posix_cpu_timers+0x5e/0x72a
-   0  scheduler_tick+0x1b6/0x1bf	  <-  scheduler_tick+0x1aa/0x1bf
-
-
-The tracer may be used to dump the trace for the oops'ing cpu on
-a kernel oops into the system log. To enable this,
-ftrace_dump_on_oops must be set. To set ftrace_dump_on_oops, one
-can either use the sysctl function or set it via the proc system
-interface.
-
-  sysctl kernel.ftrace_dump_on_oops=n
-
-or
-
-  echo n > /proc/sys/kernel/ftrace_dump_on_oops
-
-If n = 1, ftrace will dump buffers of all CPUs, if n = 2 ftrace will
-only dump the buffer of the CPU that triggered the oops.
-
-Here's an example of such a dump after a null pointer
-dereference in a kernel module:
-
-[57848.105921] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
-[57848.106019] IP: [<ffffffffa0000006>] open+0x6/0x14 [oops]
-[57848.106019] PGD 2354e9067 PUD 2375e7067 PMD 0
-[57848.106019] Oops: 0002 [#1] SMP
-[57848.106019] last sysfs file: /sys/devices/pci0000:00/0000:00:1e.0/0000:20:05.0/local_cpus
-[57848.106019] Dumping ftrace buffer:
-[57848.106019] ---------------------------------
-[...]
-[57848.106019]    0  chrdev_open+0xe6/0x165	  <-  cdev_put+0x23/0x24
-[57848.106019]    0  chrdev_open+0x117/0x165	  <-  chrdev_open+0xfa/0x165
-[57848.106019]    0  chrdev_open+0x120/0x165	  <-  chrdev_open+0x11c/0x165
-[57848.106019]    0  chrdev_open+0x134/0x165	  <-  chrdev_open+0x12b/0x165
-[57848.106019]    0  open+0x0/0x14 [oops]	  <-  chrdev_open+0x144/0x165
-[57848.106019]    0  page_fault+0x0/0x30	  <-  open+0x6/0x14 [oops]
-[57848.106019]    0  error_entry+0x0/0x5b	  <-  page_fault+0x4/0x30
-[57848.106019]    0  error_kernelspace+0x0/0x31	  <-  error_entry+0x59/0x5b
-[57848.106019]    0  error_sti+0x0/0x1	  <-  error_kernelspace+0x2d/0x31
-[57848.106019]    0  page_fault+0x9/0x30	  <-  error_sti+0x0/0x1
-[57848.106019]    0  do_page_fault+0x0/0x881	  <-  page_fault+0x1a/0x30
-[...]
-[57848.106019]    0  do_page_fault+0x66b/0x881	  <-  is_prefetch+0x1ee/0x1f2
-[57848.106019]    0  do_page_fault+0x6e0/0x881	  <-  do_page_fault+0x67a/0x881
-[57848.106019]    0  oops_begin+0x0/0x96	  <-  do_page_fault+0x6e0/0x881
-[57848.106019]    0  trace_hw_branch_oops+0x0/0x2d	  <-  oops_begin+0x9/0x96
-[...]
-[57848.106019]    0  ds_suspend_bts+0x2a/0xe3	  <-  ds_suspend_bts+0x1a/0xe3
-[57848.106019] ---------------------------------
-[57848.106019] CPU 0
-[57848.106019] Modules linked in: oops
-[57848.106019] Pid: 5542, comm: cat Tainted: G        W  2.6.28 #23
-[57848.106019] RIP: 0010:[<ffffffffa0000006>]  [<ffffffffa0000006>] open+0x6/0x14 [oops]
-[57848.106019] RSP: 0018:ffff880235457d48  EFLAGS: 00010246
-[...]
+------
+#!/bin/bash
+
+debugfs=`sed -ne 's/^debugfs \(.*\) debugfs.*/\1/p' /proc/mounts`
+echo nop > $debugfs/tracing/current_tracer
+echo 0 > $debugfs/tracing/tracing_on
+echo $$ > $debugfs/tracing/set_ftrace_pid
+echo function > $debugfs/tracing/current_tracer
+echo 1 > $debugfs/tracing/tracing_on
+exec "$@"
+------
 
 
 function graph tracer
@@ -1473,16 +2028,18 @@ starts of pointing to a simple return. (Enabling FTRACE will
 include the -pg switch in the compiling of the kernel.)
 
 At compile time every C file object is run through the
-recordmcount.pl script (located in the scripts directory). This
-script will process the C object using objdump to find all the
-locations in the .text section that call mcount. (Note, only the
-.text section is processed, since processing other sections like
-.init.text may cause races due to those sections being freed).
+recordmcount program (located in the scripts directory). This
+program will parse the ELF headers in the C object to find all
+the locations in the .text section that call mcount. (Note, only
+white listed .text sections are processed, since processing other
+sections like .init.text may cause races due to those sections
+being freed unexpectedly).
 
 A new section called "__mcount_loc" is created that holds
 references to all the mcount call sites in the .text section.
-This section is compiled back into the original object. The
-final linker will add all these references into a single table.
+The recordmcount program re-links this section back into the
+original object. The final linking stage of the kernel will add all these
+references into a single table.
 
 On boot up, before SMP is initialized, the dynamic ftrace code
 scans this table and updates all the locations into nops. It
@@ -1493,13 +2050,25 @@ unloaded, it also removes its functions from the ftrace function
 list. This is automatic in the module unload code, and the
 module author does not need to worry about it.
 
-When tracing is enabled, kstop_machine is called to prevent
-races with the CPUS executing code being modified (which can
-cause the CPU to do undesirable things), and the nops are
+When tracing is enabled, the process of modifying the function
+tracepoints is dependent on architecture. The old method is to use
+kstop_machine to prevent races with the CPUs executing code being
+modified (which can cause the CPU to do undesirable things, especially
+if the modified code crosses cache (or page) boundaries), and the nops are
 patched back to calls. But this time, they do not call mcount
 (which is just a function stub). They now call into the ftrace
 infrastructure.
 
+The new method of modifying the function tracepoints is to place
+a breakpoint at the location to be modified, sync all CPUs, modify
+the rest of the instruction not covered by the breakpoint. Sync
+all CPUs again, and then remove the breakpoint with the finished
+version to the ftrace call site.
+
+Some archs do not even need to monkey around with the synchronization,
+and can just slap the new code on top of the old without any
+problems with other CPUs executing it at the same time.
+
 One special side-effect to the recording of the functions being
 traced is that we can now selectively choose which functions we
 wish to trace and which ones we want the mcount calls to remain
@@ -1530,20 +2099,28 @@ mutex_lock
 
 If I am only interested in sys_nanosleep and hrtimer_interrupt:
 
- # echo sys_nanosleep hrtimer_interrupt \
-		> set_ftrace_filter
+ # echo sys_nanosleep hrtimer_interrupt > set_ftrace_filter
  # echo function > current_tracer
  # echo 1 > tracing_on
  # usleep 1
  # echo 0 > tracing_on
  # cat trace
-# tracer: ftrace
+# tracer: function
+#
+# entries-in-buffer/entries-written: 5/5   #P:4
 #
-#           TASK-PID   CPU#    TIMESTAMP  FUNCTION
-#              | |      |          |         |
-          usleep-4134  [00]  1317.070017: hrtimer_interrupt <-smp_apic_timer_interrupt
-          usleep-4134  [00]  1317.070111: sys_nanosleep <-syscall_call
-          <idle>-0     [00]  1317.070115: hrtimer_interrupt <-smp_apic_timer_interrupt
+#                              _-----=> irqs-off
+#                             / _----=> need-resched
+#                            | / _---=> hardirq/softirq
+#                            || / _--=> preempt-depth
+#                            ||| /     delay
+#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
+#              | |       |   ||||       |         |
+          usleep-2665  [001] ....  4186.475355: sys_nanosleep <-system_call_fastpath
+          <idle>-0     [001] d.h1  4186.475409: hrtimer_interrupt <-smp_apic_timer_interrupt
+          usleep-2665  [001] d.h1  4186.475426: hrtimer_interrupt <-smp_apic_timer_interrupt
+          <idle>-0     [003] d.h1  4186.475426: hrtimer_interrupt <-smp_apic_timer_interrupt
+          <idle>-0     [002] d.h1  4186.475427: hrtimer_interrupt <-smp_apic_timer_interrupt
 
 To see which functions are being traced, you can cat the file:
 
@@ -1571,20 +2148,25 @@ Note: It is better to use quotes to enclose the wild cards,
 
 Produces:
 
-# tracer: ftrace
+# tracer: function
 #
-#           TASK-PID   CPU#    TIMESTAMP  FUNCTION
-#              | |      |          |         |
-            bash-4003  [00]  1480.611794: hrtimer_init <-copy_process
-            bash-4003  [00]  1480.611941: hrtimer_start <-hrtick_set
-            bash-4003  [00]  1480.611956: hrtimer_cancel <-hrtick_clear
-            bash-4003  [00]  1480.611956: hrtimer_try_to_cancel <-hrtimer_cancel
-          <idle>-0     [00]  1480.612019: hrtimer_get_next_event <-get_next_timer_interrupt
-          <idle>-0     [00]  1480.612025: hrtimer_get_next_event <-get_next_timer_interrupt
-          <idle>-0     [00]  1480.612032: hrtimer_get_next_event <-get_next_timer_interrupt
-          <idle>-0     [00]  1480.612037: hrtimer_get_next_event <-get_next_timer_interrupt
-          <idle>-0     [00]  1480.612382: hrtimer_get_next_event <-get_next_timer_interrupt
-
+# entries-in-buffer/entries-written: 897/897   #P:4
+#
+#                              _-----=> irqs-off
+#                             / _----=> need-resched
+#                            | / _---=> hardirq/softirq
+#                            || / _--=> preempt-depth
+#                            ||| /     delay
+#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
+#              | |       |   ||||       |         |
+          <idle>-0     [003] dN.1  4228.547803: hrtimer_cancel <-tick_nohz_idle_exit
+          <idle>-0     [003] dN.1  4228.547804: hrtimer_try_to_cancel <-hrtimer_cancel
+          <idle>-0     [003] dN.2  4228.547805: hrtimer_force_reprogram <-__remove_hrtimer
+          <idle>-0     [003] dN.1  4228.547805: hrtimer_forward <-tick_nohz_idle_exit
+          <idle>-0     [003] dN.1  4228.547805: hrtimer_start_range_ns <-hrtimer_start_expires.constprop.11
+          <idle>-0     [003] d..1  4228.547858: hrtimer_get_next_event <-get_next_timer_interrupt
+          <idle>-0     [003] d..1  4228.547859: hrtimer_start <-__tick_nohz_idle_enter
+          <idle>-0     [003] d..2  4228.547860: hrtimer_force_reprogram <-__rem
 
 Notice that we lost the sys_nanosleep.
 
@@ -1651,19 +2233,29 @@ traced.
 
 Produces:
 
-# tracer: ftrace
+# tracer: function
+#
+# entries-in-buffer/entries-written: 39608/39608   #P:4
 #
-#           TASK-PID   CPU#    TIMESTAMP  FUNCTION
-#              | |      |          |         |
-            bash-4043  [01]   115.281644: finish_task_switch <-schedule
-            bash-4043  [01]   115.281645: hrtick_set <-schedule
-            bash-4043  [01]   115.281645: hrtick_clear <-hrtick_set
-            bash-4043  [01]   115.281646: wait_for_completion <-__stop_machine_run
-            bash-4043  [01]   115.281647: wait_for_common <-wait_for_completion
-            bash-4043  [01]   115.281647: kthread_stop <-stop_machine_run
-            bash-4043  [01]   115.281648: init_waitqueue_head <-kthread_stop
-            bash-4043  [01]   115.281648: wake_up_process <-kthread_stop
-            bash-4043  [01]   115.281649: try_to_wake_up <-wake_up_process
+#                              _-----=> irqs-off
+#                             / _----=> need-resched
+#                            | / _---=> hardirq/softirq
+#                            || / _--=> preempt-depth
+#                            ||| /     delay
+#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
+#              | |       |   ||||       |         |
+            bash-1994  [000] ....  4342.324896: file_ra_state_init <-do_dentry_open
+            bash-1994  [000] ....  4342.324897: open_check_o_direct <-do_last
+            bash-1994  [000] ....  4342.324897: ima_file_check <-do_last
+            bash-1994  [000] ....  4342.324898: process_measurement <-ima_file_check
+            bash-1994  [000] ....  4342.324898: ima_get_action <-process_measurement
+            bash-1994  [000] ....  4342.324898: ima_match_policy <-ima_get_action
+            bash-1994  [000] ....  4342.324899: do_truncate <-do_last
+            bash-1994  [000] ....  4342.324899: should_remove_suid <-do_truncate
+            bash-1994  [000] ....  4342.324899: notify_change <-do_truncate
+            bash-1994  [000] ....  4342.324900: current_fs_time <-notify_change
+            bash-1994  [000] ....  4342.324900: current_kernel_time <-current_fs_time
+            bash-1994  [000] ....  4342.324900: timespec_trunc <-current_fs_time
 
 We can see that there's no more lock or preempt tracing.
 
@@ -1729,6 +2321,28 @@ this special filter via:
  echo > set_graph_function
 
 
+ftrace_enabled
+--------------
+
+Note, the proc sysctl ftrace_enable is a big on/off switch for the
+function tracer. By default it is enabled (when function tracing is
+enabled in the kernel). If it is disabled, all function tracing is
+disabled. This includes not only the function tracers for ftrace, but
+also for any other uses (perf, kprobes, stack tracing, profiling, etc).
+
+Please disable this with care.
+
+This can be disable (and enabled) with:
+
+  sysctl kernel.ftrace_enabled=0
+  sysctl kernel.ftrace_enabled=1
+
+ or
+
+  echo 0 > /proc/sys/kernel/ftrace_enabled
+  echo 1 > /proc/sys/kernel/ftrace_enabled
+
+
 Filter commands
 ---------------
 
@@ -1763,12 +2377,58 @@ The following commands are supported:
 
    echo '__schedule_bug:traceoff:5' > set_ftrace_filter
 
+  To always disable tracing when __schedule_bug is hit:
+
+   echo '__schedule_bug:traceoff' > set_ftrace_filter
+
   These commands are cumulative whether or not they are appended
   to set_ftrace_filter. To remove a command, prepend it by '!'
   and drop the parameter:
 
+   echo '!__schedule_bug:traceoff:0' > set_ftrace_filter
+
+    The above removes the traceoff command for __schedule_bug
+    that have a counter. To remove commands without counters:
+
    echo '!__schedule_bug:traceoff' > set_ftrace_filter
 
+- snapshot
+  Will cause a snapshot to be triggered when the function is hit.
+
+   echo 'native_flush_tlb_others:snapshot' > set_ftrace_filter
+
+  To only snapshot once:
+
+   echo 'native_flush_tlb_others:snapshot:1' > set_ftrace_filter
+
+  To remove the above commands:
+
+   echo '!native_flush_tlb_others:snapshot' > set_ftrace_filter
+   echo '!native_flush_tlb_others:snapshot:0' > set_ftrace_filter
+
+- enable_event/disable_event
+  These commands can enable or disable a trace event. Note, because
+  function tracing callbacks are very sensitive, when these commands
+  are registered, the trace point is activated, but disabled in
+  a "soft" mode. That is, the tracepoint will be called, but
+  just will not be traced. The event tracepoint stays in this mode
+  as long as there's a command that triggers it.
+
+   echo 'try_to_wake_up:enable_event:sched:sched_switch:2' > \
+   	 set_ftrace_filter
+
+  The format is:
+
+    <function>:enable_event:<system>:<event>[:count]
+    <function>:disable_event:<system>:<event>[:count]
+
+  To remove the events commands:
+
+
+   echo '!try_to_wake_up:enable_event:sched:sched_switch:0' > \
+   	 set_ftrace_filter
+   echo '!schedule:disable_event:sched:sched_switch' > \
+   	 set_ftrace_filter
 
 trace_pipe
 ----------
@@ -1787,28 +2447,31 @@ different. The trace is live.
  # cat trace
 # tracer: function
 #
-#           TASK-PID   CPU#    TIMESTAMP  FUNCTION
-#              | |      |          |         |
+# entries-in-buffer/entries-written: 0/0   #P:4
+#
+#                              _-----=> irqs-off
+#                             / _----=> need-resched
+#                            | / _---=> hardirq/softirq
+#                            || / _--=> preempt-depth
+#                            ||| /     delay
+#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
+#              | |       |   ||||       |         |
 
  #
  # cat /tmp/trace.out
-            bash-4043  [00] 41.267106: finish_task_switch <-schedule
-            bash-4043  [00] 41.267106: hrtick_set <-schedule
-            bash-4043  [00] 41.267107: hrtick_clear <-hrtick_set
-            bash-4043  [00] 41.267108: wait_for_completion <-__stop_machine_run
-            bash-4043  [00] 41.267108: wait_for_common <-wait_for_completion
-            bash-4043  [00] 41.267109: kthread_stop <-stop_machine_run
-            bash-4043  [00] 41.267109: init_waitqueue_head <-kthread_stop
-            bash-4043  [00] 41.267110: wake_up_process <-kthread_stop
-            bash-4043  [00] 41.267110: try_to_wake_up <-wake_up_process
-            bash-4043  [00] 41.267111: select_task_rq_rt <-try_to_wake_up
+            bash-1994  [000] ....  5281.568961: mutex_unlock <-rb_simple_write
+            bash-1994  [000] ....  5281.568963: __mutex_unlock_slowpath <-mutex_unlock
+            bash-1994  [000] ....  5281.568963: __fsnotify_parent <-fsnotify_modify
+            bash-1994  [000] ....  5281.568964: fsnotify <-fsnotify_modify
+            bash-1994  [000] ....  5281.568964: __srcu_read_lock <-fsnotify
+            bash-1994  [000] ....  5281.568964: add_preempt_count <-__srcu_read_lock
+            bash-1994  [000] ...1  5281.568965: sub_preempt_count <-__srcu_read_lock
+            bash-1994  [000] ....  5281.568965: __srcu_read_unlock <-fsnotify
+            bash-1994  [000] ....  5281.568967: sys_dup2 <-system_call_fastpath
 
 
 Note, reading the trace_pipe file will block until more input is
-added. By changing the tracer, trace_pipe will issue an EOF. We
-needed to set the function tracer _before_ we "cat" the
-trace_pipe file.
-
+added.
 
 trace entries
 -------------
@@ -1817,31 +2480,50 @@ Having too much or not enough data can be troublesome in
 diagnosing an issue in the kernel. The file buffer_size_kb is
 used to modify the size of the internal trace buffers. The
 number listed is the number of entries that can be recorded per
-CPU. To know the full size, multiply the number of possible CPUS
+CPU. To know the full size, multiply the number of possible CPUs
 with the number of entries.
 
  # cat buffer_size_kb
 1408 (units kilobytes)
 
-Note, to modify this, you must have tracing completely disabled.
-To do that, echo "nop" into the current_tracer. If the
-current_tracer is not set to "nop", an EINVAL error will be
-returned.
+Or simply read buffer_total_size_kb
+
+ # cat buffer_total_size_kb 
+5632
+
+To modify the buffer, simple echo in a number (in 1024 byte segments).
 
- # echo nop > current_tracer
  # echo 10000 > buffer_size_kb
  # cat buffer_size_kb
 10000 (units kilobytes)
 
-The number of pages which will be allocated is limited to a
-percentage of available memory. Allocating too much will produce
-an error.
+It will try to allocate as much as possible. If you allocate too
+much, it can cause Out-Of-Memory to trigger.
 
  # echo 1000000000000 > buffer_size_kb
 -bash: echo: write error: Cannot allocate memory
  # cat buffer_size_kb
 85
 
+The per_cpu buffers can be changed individually as well:
+
+ # echo 10000 > per_cpu/cpu0/buffer_size_kb
+ # echo 100 > per_cpu/cpu1/buffer_size_kb
+
+When the per_cpu buffers are not the same, the buffer_size_kb
+at the top level will just show an X
+
+ # cat buffer_size_kb
+X
+
+This is where the buffer_total_size_kb is useful:
+
+ # cat buffer_total_size_kb 
+12916
+
+Writing to the top level buffer_size_kb will reset all the buffers
+to be the same again.
+
 Snapshot
 --------
 CONFIG_TRACER_SNAPSHOT makes a generic snapshot feature
@@ -1925,7 +2607,188 @@ bash: echo: write error: Device or resource busy
  # cat snapshot
 cat: snapshot: Device or resource busy
 
+
+Instances
+---------
+In the debugfs tracing directory is a directory called "instances".
+This directory can have new directories created inside of it using
+mkdir, and removing directories with rmdir. The directory created
+with mkdir in this directory will already contain files and other
+directories after it is created.
+
+ # mkdir instances/foo
+ # ls instances/foo
+buffer_size_kb  buffer_total_size_kb  events  free_buffer  per_cpu
+set_event  snapshot  trace  trace_clock  trace_marker  trace_options
+trace_pipe  tracing_on
+
+As you can see, the new directory looks similar to the tracing directory
+itself. In fact, it is very similar, except that the buffer and
+events are agnostic from the main director, or from any other
+instances that are created.
+
+The files in the new directory work just like the files with the
+same name in the tracing directory except the buffer that is used
+is a separate and new buffer. The files affect that buffer but do not
+affect the main buffer with the exception of trace_options. Currently,
+the trace_options affect all instances and the top level buffer
+the same, but this may change in future releases. That is, options
+may become specific to the instance they reside in.
+
+Notice that none of the function tracer files are there, nor is
+current_tracer and available_tracers. This is because the buffers
+can currently only have events enabled for them.
+
+ # mkdir instances/foo
+ # mkdir instances/bar
+ # mkdir instances/zoot
+ # echo 100000 > buffer_size_kb
+ # echo 1000 > instances/foo/buffer_size_kb
+ # echo 5000 > instances/bar/per_cpu/cpu1/buffer_size_kb
+ # echo function > current_trace
+ # echo 1 > instances/foo/events/sched/sched_wakeup/enable
+ # echo 1 > instances/foo/events/sched/sched_wakeup_new/enable
+ # echo 1 > instances/foo/events/sched/sched_switch/enable
+ # echo 1 > instances/bar/events/irq/enable
+ # echo 1 > instances/zoot/events/syscalls/enable
+ # cat trace_pipe
+CPU:2 [LOST 11745 EVENTS]
+            bash-2044  [002] .... 10594.481032: _raw_spin_lock_irqsave <-get_page_from_freelist
+            bash-2044  [002] d... 10594.481032: add_preempt_count <-_raw_spin_lock_irqsave
+            bash-2044  [002] d..1 10594.481032: __rmqueue <-get_page_from_freelist
+            bash-2044  [002] d..1 10594.481033: _raw_spin_unlock <-get_page_from_freelist
+            bash-2044  [002] d..1 10594.481033: sub_preempt_count <-_raw_spin_unlock
+            bash-2044  [002] d... 10594.481033: get_pageblock_flags_group <-get_pageblock_migratetype
+            bash-2044  [002] d... 10594.481034: __mod_zone_page_state <-get_page_from_freelist
+            bash-2044  [002] d... 10594.481034: zone_statistics <-get_page_from_freelist
+            bash-2044  [002] d... 10594.481034: __inc_zone_state <-zone_statistics
+            bash-2044  [002] d... 10594.481034: __inc_zone_state <-zone_statistics
+            bash-2044  [002] .... 10594.481035: arch_dup_task_struct <-copy_process
+[...]
+
+ # cat instances/foo/trace_pipe
+            bash-1998  [000] d..4   136.676759: sched_wakeup: comm=kworker/0:1 pid=59 prio=120 success=1 target_cpu=000
+            bash-1998  [000] dN.4   136.676760: sched_wakeup: comm=bash pid=1998 prio=120 success=1 target_cpu=000
+          <idle>-0     [003] d.h3   136.676906: sched_wakeup: comm=rcu_preempt pid=9 prio=120 success=1 target_cpu=003
+          <idle>-0     [003] d..3   136.676909: sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=9 next_prio=120
+     rcu_preempt-9     [003] d..3   136.676916: sched_switch: prev_comm=rcu_preempt prev_pid=9 prev_prio=120 prev_state=S ==> next_comm=swapper/3 next_pid=0 next_prio=120
+            bash-1998  [000] d..4   136.677014: sched_wakeup: comm=kworker/0:1 pid=59 prio=120 success=1 target_cpu=000
+            bash-1998  [000] dN.4   136.677016: sched_wakeup: comm=bash pid=1998 prio=120 success=1 target_cpu=000
+            bash-1998  [000] d..3   136.677018: sched_switch: prev_comm=bash prev_pid=1998 prev_prio=120 prev_state=R+ ==> next_comm=kworker/0:1 next_pid=59 next_prio=120
+     kworker/0:1-59    [000] d..4   136.677022: sched_wakeup: comm=sshd pid=1995 prio=120 success=1 target_cpu=001
+     kworker/0:1-59    [000] d..3   136.677025: sched_switch: prev_comm=kworker/0:1 prev_pid=59 prev_prio=120 prev_state=S ==> next_comm=bash next_pid=1998 next_prio=120
+[...]
+
+ # cat instances/bar/trace_pipe
+     migration/1-14    [001] d.h3   138.732674: softirq_raise: vec=3 [action=NET_RX]
+          <idle>-0     [001] dNh3   138.732725: softirq_raise: vec=3 [action=NET_RX]
+            bash-1998  [000] d.h1   138.733101: softirq_raise: vec=1 [action=TIMER]
+            bash-1998  [000] d.h1   138.733102: softirq_raise: vec=9 [action=RCU]
+            bash-1998  [000] ..s2   138.733105: softirq_entry: vec=1 [action=TIMER]
+            bash-1998  [000] ..s2   138.733106: softirq_exit: vec=1 [action=TIMER]
+            bash-1998  [000] ..s2   138.733106: softirq_entry: vec=9 [action=RCU]
+            bash-1998  [000] ..s2   138.733109: softirq_exit: vec=9 [action=RCU]
+            sshd-1995  [001] d.h1   138.733278: irq_handler_entry: irq=21 name=uhci_hcd:usb4
+            sshd-1995  [001] d.h1   138.733280: irq_handler_exit: irq=21 ret=unhandled
+            sshd-1995  [001] d.h1   138.733281: irq_handler_entry: irq=21 name=eth0
+            sshd-1995  [001] d.h1   138.733283: irq_handler_exit: irq=21 ret=handled
+[...]
+
+ # cat instances/zoot/trace
+# tracer: nop
+#
+# entries-in-buffer/entries-written: 18996/18996   #P:4
+#
+#                              _-----=> irqs-off
+#                             / _----=> need-resched
+#                            | / _---=> hardirq/softirq
+#                            || / _--=> preempt-depth
+#                            ||| /     delay
+#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
+#              | |       |   ||||       |         |
+            bash-1998  [000] d...   140.733501: sys_write -> 0x2
+            bash-1998  [000] d...   140.733504: sys_dup2(oldfd: a, newfd: 1)
+            bash-1998  [000] d...   140.733506: sys_dup2 -> 0x1
+            bash-1998  [000] d...   140.733508: sys_fcntl(fd: a, cmd: 1, arg: 0)
+            bash-1998  [000] d...   140.733509: sys_fcntl -> 0x1
+            bash-1998  [000] d...   140.733510: sys_close(fd: a)
+            bash-1998  [000] d...   140.733510: sys_close -> 0x0
+            bash-1998  [000] d...   140.733514: sys_rt_sigprocmask(how: 0, nset: 0, oset: 6e2768, sigsetsize: 8)
+            bash-1998  [000] d...   140.733515: sys_rt_sigprocmask -> 0x0
+            bash-1998  [000] d...   140.733516: sys_rt_sigaction(sig: 2, act: 7fff718846f0, oact: 7fff71884650, sigsetsize: 8)
+            bash-1998  [000] d...   140.733516: sys_rt_sigaction -> 0x0
+
+You can see that the trace of the top most trace buffer shows only
+the function tracing. The foo instance displays wakeups and task
+switches.
+
+To remove the instances, simply delete their directories:
+
+ # rmdir instances/foo
+ # rmdir instances/bar
+ # rmdir instances/zoot
+
+Note, if a process has a trace file open in one of the instance
+directories, the rmdir will fail with EBUSY.
+
+
+Stack trace
 -----------
+Since the kernel has a fixed sized stack, it is important not to
+waste it in functions. A kernel developer must be conscience of
+what they allocate on the stack. If they add too much, the system
+can be in danger of a stack overflow, and corruption will occur,
+usually leading to a system panic.
+
+There are some tools that check this, usually with interrupts
+periodically checking usage. But if you can perform a check
+at every function call that will become very useful. As ftrace provides
+a function tracer, it makes it convenient to check the stack size
+at every function call. This is enabled via the stack tracer.
+
+CONFIG_STACK_TRACER enables the ftrace stack tracing functionality.
+To enable it, write a '1' into /proc/sys/kernel/stack_tracer_enabled.
+
+ # echo 1 > /proc/sys/kernel/stack_tracer_enabled
+
+You can also enable it from the kernel command line to trace
+the stack size of the kernel during boot up, by adding "stacktrace"
+to the kernel command line parameter.
+
+After running it for a few minutes, the output looks like:
+
+ # cat stack_max_size
+2928
+
+ # cat stack_trace
+        Depth    Size   Location    (18 entries)
+        -----    ----   --------
+  0)     2928     224   update_sd_lb_stats+0xbc/0x4ac
+  1)     2704     160   find_busiest_group+0x31/0x1f1
+  2)     2544     256   load_balance+0xd9/0x662
+  3)     2288      80   idle_balance+0xbb/0x130
+  4)     2208     128   __schedule+0x26e/0x5b9
+  5)     2080      16   schedule+0x64/0x66
+  6)     2064     128   schedule_timeout+0x34/0xe0
+  7)     1936     112   wait_for_common+0x97/0xf1
+  8)     1824      16   wait_for_completion+0x1d/0x1f
+  9)     1808     128   flush_work+0xfe/0x119
+ 10)     1680      16   tty_flush_to_ldisc+0x1e/0x20
+ 11)     1664      48   input_available_p+0x1d/0x5c
+ 12)     1616      48   n_tty_poll+0x6d/0x134
+ 13)     1568      64   tty_poll+0x64/0x7f
+ 14)     1504     880   do_select+0x31e/0x511
+ 15)      624     400   core_sys_select+0x177/0x216
+ 16)      224      96   sys_select+0x91/0xb9
+ 17)      128     128   system_call_fastpath+0x16/0x1b
+
+Note, if -mfentry is being used by gcc, functions get traced before
+they set up the stack frame. This means that leaf level functions
+are not tested by the stack tracer when -mfentry is used.
+
+Currently, -mfentry is used by gcc 4.6.0 and above on x86 only.
+
+---------
 
 More details can be found in the source code, in the
 kernel/trace/*.c files.
-- 
1.7.10.4



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply related	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2013-03-15 19:55 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-03-15 19:39 [for-next][PATCH 00/20] tracing: function triggers, stack tracer fixes, clocks and documenation Steven Rostedt
2013-03-15 19:39 ` [for-next][PATCH 01/20] tracing: Consolidate updating of count for traceon/off Steven Rostedt
2013-03-15 19:39 ` [for-next][PATCH 02/20] tracing: Consolidate ftrace_trace_onoff_unreg() into callback Steven Rostedt
2013-03-15 19:39 ` [for-next][PATCH 03/20] ftrace: Separate unlimited probes from count limited probes Steven Rostedt
2013-03-15 19:39 ` [for-next][PATCH 04/20] ftrace: Fix function probe to only enable needed functions Steven Rostedt
2013-03-15 19:39 ` [for-next][PATCH 05/20] tracing: Add alloc/free_snapshot() to replace duplicate code Steven Rostedt
2013-03-15 19:39 ` [for-next][PATCH 06/20] tracing: Add snapshot trigger to function probes Steven Rostedt
2013-03-15 19:39 ` [for-next][PATCH 07/20] tracing: Fix comments for ftrace_event_file/call flags Steven Rostedt
2013-03-15 19:39 ` [for-next][PATCH 08/20] ftrace: Clean up function probe methods Steven Rostedt
2013-03-15 19:39 ` [for-next][PATCH 09/20] ftrace: Use manual free after synchronize_sched() not call_rcu_sched() Steven Rostedt
2013-03-15 19:39 ` [for-next][PATCH 10/20] tracing: Add a way to soft disable trace events Steven Rostedt
2013-03-15 19:39 ` [for-next][PATCH 11/20] tracing: Add function probe triggers to enable/disable events Steven Rostedt
2013-03-15 19:39 ` [for-next][PATCH 12/20] tracing: Add skip argument to trace_dump_stack() Steven Rostedt
2013-03-15 19:39 ` [for-next][PATCH 13/20] tracing: Add function probe to trigger stack traces Steven Rostedt
2013-03-15 19:39 ` [for-next][PATCH 14/20] tracing: Use stack of calling function for stack tracer Steven Rostedt
2013-03-15 19:39 ` [for-next][PATCH 15/20] tracing: Fix stack tracer with fentry use Steven Rostedt
2013-03-15 19:39 ` [for-next][PATCH 16/20] tracing: Remove most or all of stack tracer stack size from stack_max_size Steven Rostedt
2013-03-15 19:39 ` [for-next][PATCH 17/20] tracing: Add function-trace option to disable function tracing of latency tracers Steven Rostedt
2013-03-15 19:39 ` [for-next][PATCH 18/20] tracing: Add "uptime" trace clock that uses jiffies Steven Rostedt
2013-03-15 19:39 ` [for-next][PATCH 19/20] tracing: Add "perf" trace_clock Steven Rostedt
2013-03-15 19:39 ` [for-next][PATCH 20/20] tracing: Bring Documentation/trace/ftrace.txt up to date Steven Rostedt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.