[PATCH v2 0/4] perf: add option to limit callchain stack scan to increase speed

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/4] perf: add option to limit callchain stack scan to increase speed
@ 2013-10-18 14:38 Waiman Long
  2013-10-18 14:38 ` [PATCH v2 1/4] perf: Fix potential compilation error with some compilers Waiman Long
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Waiman Long @ 2013-10-18 14:38 UTC (permalink / raw)
  To: Ingo Molnar, Arnaldo Carvalho de Melo
  Cc: Peter Zijlstra, Paul Mackerras, Namhyung Kim, Jiri Olsa,
	Adrian Hunter, David Ahern, Stephane Eranian, linux-kernel,
	Aswin Chandramouleeswaran, Scott J Norton, Waiman Long

v1->v2:
 - Include a compilation fix patch and a code streamlining patch 
   into the patch set.
 - Use the __stringify() macro in stringify.h instead of adding a
   duplicate macro.
 - Add the --max-stack option to perf-top as well.
 
This perf patch set contains the following changes:

Patch 1 - Fix a perf tool compilation error that happens on SLES 11
          sp3 system.
Patch 2 - Streamline the append_chain() function to make it run a bit
          faster.
Patch 3 - Add a --max-stack option to perf-report to speed up its
          processing at the expense of less backtrace information
          available.
Patch 4 - Add a similar --max-stack option to perf-top.

Waiman Long (4):
  perf: Fix potential compilation error with some compilers
  perf: streamline append_chain() function
  perf-report: add --max-stack option to limit callchain stack scan
  perf-top: add --max-stack option to limit callchain stack scan

 tools/perf/Documentation/perf-report.txt           |    8 +++++++
 tools/perf/Documentation/perf-top.txt              |    8 +++++++
 tools/perf/builtin-report.c                        |   22 +++++++++++++++----
 tools/perf/builtin-top.c                           |    9 ++++++-
 tools/perf/util/callchain.c                        |    9 +++----
 tools/perf/util/machine.c                          |   14 ++++++++----
 tools/perf/util/machine.h                          |    3 +-
 .../perf/util/scripting-engines/trace-event-perl.c |    6 ++++-
 tools/perf/util/session.c                          |    3 +-
 tools/perf/util/top.h                              |    1 +
 10 files changed, 63 insertions(+), 20 deletions(-)


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 1/4] perf: Fix potential compilation error with some compilers
  2013-10-18 14:38 [PATCH v2 0/4] perf: add option to limit callchain stack scan to increase speed Waiman Long
@ 2013-10-18 14:38 ` Waiman Long
  2013-10-18 14:38 ` [PATCH v2 2/4] perf: streamline append_chain() function Waiman Long
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 14+ messages in thread
From: Waiman Long @ 2013-10-18 14:38 UTC (permalink / raw)
  To: Ingo Molnar, Arnaldo Carvalho de Melo
  Cc: Peter Zijlstra, Paul Mackerras, Namhyung Kim, Jiri Olsa,
	Adrian Hunter, David Ahern, Stephane Eranian, linux-kernel,
	Aswin Chandramouleeswaran, Scott J Norton, Waiman Long

The building of the perf tool failed in a SLES11 sp3 system with the
following compilation error:

cc1: warnings being treated as errors
util/scripting-engines/trace-event-perl.c: In function
‘perl_process_tracepoint’:
util/scripting-engines/trace-event-perl.c:285: error: format ‘%lu’
expects type ‘long unsigned int’, but argument 2 has type ‘__u64’

This patch replaces PRIu64 which is "lu" by the explicit "llu" to
fix this problem as __u64 is of type "long long unsigned".

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 .../perf/util/scripting-engines/trace-event-perl.c |    6 +++++-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/tools/perf/util/scripting-engines/trace-event-perl.c b/tools/perf/util/scripting-engines/trace-event-perl.c
index a85e4ae..d6eb9c5 100644
--- a/tools/perf/util/scripting-engines/trace-event-perl.c
+++ b/tools/perf/util/scripting-engines/trace-event-perl.c
@@ -281,8 +281,12 @@ static void perl_process_tracepoint(union perf_event *perf_event __maybe_unused,
 		return;
 
 	event = find_cache_event(evsel);
+	/*
+	 * attr.config is a __u64 which requires "%llu" to avoid compilation
+	 * error/warning with some compilers.
+	 */
 	if (!event)
-		die("ug! no event found for type %" PRIu64, evsel->attr.config);
+		die("ug! no event found for type %llu", evsel->attr.config);
 
 	pid = raw_field_value(event, "common_pid", data);
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 2/4] perf: streamline append_chain() function
  2013-10-18 14:38 [PATCH v2 0/4] perf: add option to limit callchain stack scan to increase speed Waiman Long
  2013-10-18 14:38 ` [PATCH v2 1/4] perf: Fix potential compilation error with some compilers Waiman Long
@ 2013-10-18 14:38 ` Waiman Long
  2013-10-20  0:29   ` Andi Kleen
  2013-10-18 14:38 ` [PATCH v2 3/4] perf-report: add --max-stack option to limit callchain stack scan Waiman Long
  2013-10-18 14:38 ` [PATCH v2 4/4] perf-top: add " Waiman Long
  3 siblings, 1 reply; 14+ messages in thread
From: Waiman Long @ 2013-10-18 14:38 UTC (permalink / raw)
  To: Ingo Molnar, Arnaldo Carvalho de Melo
  Cc: Peter Zijlstra, Paul Mackerras, Namhyung Kim, Jiri Olsa,
	Adrian Hunter, David Ahern, Stephane Eranian, linux-kernel,
	Aswin Chandramouleeswaran, Scott J Norton, Waiman Long

When callgraph is enabled, the append_chain() function consumes a major
portion of the total CPU time. This patch tries to streamline the
append_chain() function by removing unneeded conditional test as well as
using ?: statement which can be more efficient than the regular if
statement in some architectures.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 tools/perf/util/callchain.c |    9 ++++-----
 1 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/tools/perf/util/callchain.c b/tools/perf/util/callchain.c
index 482f680..1e79001 100644
--- a/tools/perf/util/callchain.c
+++ b/tools/perf/util/callchain.c
@@ -315,6 +315,7 @@ append_chain(struct callchain_node *root,
 	struct callchain_list *cnode;
 	u64 start = cursor->pos;
 	bool found = false;
+	bool func_mode = (callchain_param.key == CCKEY_FUNCTION);
 	u64 matches;
 
 	/*
@@ -331,17 +332,15 @@ append_chain(struct callchain_node *root,
 		if (!node)
 			break;
 
-		sym = node->sym;
+		sym = func_mode ? node->sym : NULL;
 
-		if (cnode->ms.sym && sym &&
-		    callchain_param.key == CCKEY_FUNCTION) {
+		if (cnode->ms.sym && sym) {
 			if (cnode->ms.sym->start != sym->start)
 				break;
 		} else if (cnode->ip != node->ip)
 			break;
 
-		if (!found)
-			found = true;
+		found = true;
 
 		callchain_cursor_advance(cursor);
 	}
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 3/4] perf-report: add --max-stack option to limit callchain stack scan
  2013-10-18 14:38 [PATCH v2 0/4] perf: add option to limit callchain stack scan to increase speed Waiman Long
  2013-10-18 14:38 ` [PATCH v2 1/4] perf: Fix potential compilation error with some compilers Waiman Long
  2013-10-18 14:38 ` [PATCH v2 2/4] perf: streamline append_chain() function Waiman Long
@ 2013-10-18 14:38 ` Waiman Long
  2013-10-18 17:17   ` Arnaldo Carvalho de Melo
                     ` (2 more replies)
  2013-10-18 14:38 ` [PATCH v2 4/4] perf-top: add " Waiman Long
  3 siblings, 3 replies; 14+ messages in thread
From: Waiman Long @ 2013-10-18 14:38 UTC (permalink / raw)
  To: Ingo Molnar, Arnaldo Carvalho de Melo
  Cc: Peter Zijlstra, Paul Mackerras, Namhyung Kim, Jiri Olsa,
	Adrian Hunter, David Ahern, Stephane Eranian, linux-kernel,
	Aswin Chandramouleeswaran, Scott J Norton, Waiman Long

When callgraph data was included in the perf data file, it may take a
long time to scan all those data and merge them together especially
if the stored callchains are long and the perf data file itself is
large, like a Gbyte or so.

The callchain stack is currently limited to PERF_MAX_STACK_DEPTH (127).
This is a large value. Usually the callgraph data that developers are
most interested in are the first few levels, the rests are usually
not looked at.

This patch adds a new --max-stack option to perf-report to limit the
depth of callchain stack data to look at to reduce the time it takes
for perf-report to finish its processing. It trades the presence of
trailing stack information with faster speed.

The following table shows the elapsed time of doing perf-report on a
perf.data file of size 985,531,828 bytes.

--max_stack	Elapsed Time	Output data size
-----------	------------	----------------
not set		   88.0s	124,422,651
64		   87.5s	116,303,213
32		   87.2s	112,023,804
16		   86.6s	 94,326,380
8		   59.9s	 33,697,248
4		   40.7s	 10,116,637
-g none		   27.1s	  2,555,810

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 tools/perf/Documentation/perf-report.txt |    8 ++++++++
 tools/perf/builtin-report.c              |   22 +++++++++++++++++-----
 tools/perf/builtin-top.c                 |    3 ++-
 tools/perf/util/machine.c                |   14 +++++++++-----
 tools/perf/util/machine.h                |    3 ++-
 tools/perf/util/session.c                |    3 ++-
 6 files changed, 40 insertions(+), 13 deletions(-)

diff --git a/tools/perf/Documentation/perf-report.txt b/tools/perf/Documentation/perf-report.txt
index 2b8097e..be3f196 100644
--- a/tools/perf/Documentation/perf-report.txt
+++ b/tools/perf/Documentation/perf-report.txt
@@ -135,6 +135,14 @@ OPTIONS
 
 	Default: fractal,0.5,callee,function.
 
+--max-stack::
+	Set the stack depth limit when parsing the callchain, anything
+	beyond the specified depth will be ignored. This is a trade-off
+	between information loss and faster processing especially for
+	workloads that can have a very long callchain stack.
+
+	Default: 127
+
 -G::
 --inverted::
         alias for inverted caller based call graph.
diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index 72eae74..d0c9504 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -47,6 +47,7 @@ struct perf_report {
 	bool			show_threads;
 	bool			inverted_callchain;
 	bool			mem_mode;
+	int			max_stack;
 	struct perf_read_values	show_threads_values;
 	const char		*pretty_printing_style;
 	const char		*cpu_list;
@@ -88,7 +89,8 @@ static int perf_report__add_mem_hist_entry(struct perf_tool *tool,
 	if ((sort__has_parent || symbol_conf.use_callchain) &&
 	    sample->callchain) {
 		err = machine__resolve_callchain(machine, evsel, al->thread,
-						 sample, &parent, al);
+						 sample, &parent, al,
+						 rep->max_stack);
 		if (err)
 			return err;
 	}
@@ -179,7 +181,8 @@ static int perf_report__add_branch_hist_entry(struct perf_tool *tool,
 	if ((sort__has_parent || symbol_conf.use_callchain)
 	    && sample->callchain) {
 		err = machine__resolve_callchain(machine, evsel, al->thread,
-						 sample, &parent, al);
+						 sample, &parent, al,
+						 rep->max_stack);
 		if (err)
 			return err;
 	}
@@ -242,18 +245,21 @@ out:
 	return err;
 }
 
-static int perf_evsel__add_hist_entry(struct perf_evsel *evsel,
+static int perf_evsel__add_hist_entry(struct perf_tool *tool,
+				      struct perf_evsel *evsel,
 				      struct addr_location *al,
 				      struct perf_sample *sample,
 				      struct machine *machine)
 {
+	struct perf_report *rep = container_of(tool, struct perf_report, tool);
 	struct symbol *parent = NULL;
 	int err = 0;
 	struct hist_entry *he;
 
 	if ((sort__has_parent || symbol_conf.use_callchain) && sample->callchain) {
 		err = machine__resolve_callchain(machine, evsel, al->thread,
-						 sample, &parent, al);
+						 sample, &parent, al,
+						 rep->max_stack);
 		if (err)
 			return err;
 	}
@@ -330,7 +336,8 @@ static int process_sample_event(struct perf_tool *tool,
 		if (al.map != NULL)
 			al.map->dso->hit = 1;
 
-		ret = perf_evsel__add_hist_entry(evsel, &al, sample, machine);
+		ret = perf_evsel__add_hist_entry(tool, evsel, &al, sample,
+						 machine);
 		if (ret < 0)
 			pr_debug("problem incrementing symbol period, skipping event\n");
 	}
@@ -757,6 +764,7 @@ int cmd_report(int argc, const char **argv, const char *prefix __maybe_unused)
 			.ordered_samples = true,
 			.ordering_requires_timestamps = true,
 		},
+		.max_stack		 = PERF_MAX_STACK_DEPTH,
 		.pretty_printing_style	 = "normal",
 	};
 	const struct option options[] = {
@@ -797,6 +805,10 @@ int cmd_report(int argc, const char **argv, const char *prefix __maybe_unused)
 	OPT_CALLBACK_DEFAULT('g', "call-graph", &report, "output_type,min_percent[,print_limit],call_order",
 		     "Display callchains using output_type (graph, flat, fractal, or none) , min percent threshold, optional print limit, callchain order, key (function or address). "
 		     "Default: fractal,0.5,callee,function", &parse_callchain_opt, callchain_default_opt),
+	OPT_INTEGER(0, "max-stack", &report.max_stack,
+		    "Set the maximum stack depth when parsing the callchain, "
+		    "anything beyond the specified depth will be ignored. "
+		    "Default: " __stringify(PERF_MAX_STACK_DEPTH)),
 	OPT_BOOLEAN('G', "inverted", &report.inverted_callchain,
 		    "alias for inverted call graph"),
 	OPT_CALLBACK(0, "ignore-callees", NULL, "regex",
diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index 2122141..2725aca 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -771,7 +771,8 @@ static void perf_event__process_sample(struct perf_tool *tool,
 		    sample->callchain) {
 			err = machine__resolve_callchain(machine, evsel,
 							 al.thread, sample,
-							 &parent, &al);
+							 &parent, &al,
+							 PERF_MAX_STACK_DEPTH);
 			if (err)
 				return;
 		}
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index 6188d28..9617c4a 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -1267,10 +1267,12 @@ static int machine__resolve_callchain_sample(struct machine *machine,
 					     struct thread *thread,
 					     struct ip_callchain *chain,
 					     struct symbol **parent,
-					     struct addr_location *root_al)
+					     struct addr_location *root_al,
+					     int max_stack)
 {
 	u8 cpumode = PERF_RECORD_MISC_USER;
-	unsigned int i;
+	int chain_nr = min(max_stack, (int)chain->nr);
+	int i;
 	int err;
 
 	callchain_cursor_reset(&callchain_cursor);
@@ -1280,7 +1282,7 @@ static int machine__resolve_callchain_sample(struct machine *machine,
 		return 0;
 	}
 
-	for (i = 0; i < chain->nr; i++) {
+	for (i = 0; i < chain_nr; i++) {
 		u64 ip;
 		struct addr_location al;
 
@@ -1352,12 +1354,14 @@ int machine__resolve_callchain(struct machine *machine,
 			       struct thread *thread,
 			       struct perf_sample *sample,
 			       struct symbol **parent,
-			       struct addr_location *root_al)
+			       struct addr_location *root_al,
+			       int max_stack)
 {
 	int ret;
 
 	ret = machine__resolve_callchain_sample(machine, thread,
-						sample->callchain, parent, root_al);
+						sample->callchain, parent,
+						root_al, max_stack);
 	if (ret)
 		return ret;
 
diff --git a/tools/perf/util/machine.h b/tools/perf/util/machine.h
index 58a6be1..d09cce0 100644
--- a/tools/perf/util/machine.h
+++ b/tools/perf/util/machine.h
@@ -91,7 +91,8 @@ int machine__resolve_callchain(struct machine *machine,
 			       struct thread *thread,
 			       struct perf_sample *sample,
 			       struct symbol **parent,
-			       struct addr_location *root_al);
+			       struct addr_location *root_al,
+			       int max_stack);
 
 /*
  * Default guest kernel is defined by parameter --guestkallsyms
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 568b750..96e5449 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -1525,7 +1525,8 @@ void perf_evsel__print_ip(struct perf_evsel *evsel, union perf_event *event,
 	if (symbol_conf.use_callchain && sample->callchain) {
 
 		if (machine__resolve_callchain(machine, evsel, al.thread,
-					       sample, NULL, NULL) != 0) {
+					       sample, NULL, NULL,
+					       PERF_MAX_STACK_DEPTH) != 0) {
 			if (verbose)
 				error("Failed to resolve callchain. Skipping\n");
 			return;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 4/4] perf-top: add --max-stack option to limit callchain stack scan
  2013-10-18 14:38 [PATCH v2 0/4] perf: add option to limit callchain stack scan to increase speed Waiman Long
                   ` (2 preceding siblings ...)
  2013-10-18 14:38 ` [PATCH v2 3/4] perf-report: add --max-stack option to limit callchain stack scan Waiman Long
@ 2013-10-18 14:38 ` Waiman Long
  2013-10-18 17:31   ` David Ahern
                     ` (2 more replies)
  3 siblings, 3 replies; 14+ messages in thread
From: Waiman Long @ 2013-10-18 14:38 UTC (permalink / raw)
  To: Ingo Molnar, Arnaldo Carvalho de Melo
  Cc: Peter Zijlstra, Paul Mackerras, Namhyung Kim, Jiri Olsa,
	Adrian Hunter, David Ahern, Stephane Eranian, linux-kernel,
	Aswin Chandramouleeswaran, Scott J Norton, Waiman Long

When the callgraph function is enabled (-G), it may take a long time to
scan all the stack data and merge them accordingly.

This patch adds a new --max-stack option to perf-top to limit the depth
of callchain stack data to look at to reduce the time it takes for
perf-top to finish its processing. It reduces the amount of information
provided to the user in exchange for faster speed.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 tools/perf/Documentation/perf-top.txt |    8 ++++++++
 tools/perf/builtin-top.c              |    8 ++++++--
 tools/perf/util/top.h                 |    1 +
 3 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/tools/perf/Documentation/perf-top.txt b/tools/perf/Documentation/perf-top.txt
index 58d6598..3fd911c 100644
--- a/tools/perf/Documentation/perf-top.txt
+++ b/tools/perf/Documentation/perf-top.txt
@@ -155,6 +155,14 @@ Default is to monitor all CPUS.
 
 	Default: fractal,0.5,callee.
 
+--max-stack::
+	Set the stack depth limit when parsing the callchain, anything
+	beyond the specified depth will be ignored. This is a trade-off
+	between information loss and faster processing especially for
+	workloads that can have a very long callchain stack.
+
+	Default: 127
+
 --ignore-callees=<regex>::
         Ignore callees of the function(s) matching the given regex.
         This has the effect of collecting the callers of each such
diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index 2725aca..14902b0 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -772,7 +772,7 @@ static void perf_event__process_sample(struct perf_tool *tool,
 			err = machine__resolve_callchain(machine, evsel,
 							 al.thread, sample,
 							 &parent, &al,
-							 PERF_MAX_STACK_DEPTH);
+							 top->max_stack);
 			if (err)
 				return;
 		}
@@ -1052,10 +1052,11 @@ int cmd_top(int argc, const char **argv, const char *prefix __maybe_unused)
 			.user_freq	= UINT_MAX,
 			.user_interval	= ULLONG_MAX,
 			.freq		= 4000, /* 4 KHz */
-			.target		     = {
+			.target		= {
 				.uses_mmap   = true,
 			},
 		},
+		.max_stack	     = PERF_MAX_STACK_DEPTH,
 		.sym_pcnt_filter     = 5,
 	};
 	struct perf_record_opts *opts = &top.record_opts;
@@ -1110,6 +1111,9 @@ int cmd_top(int argc, const char **argv, const char *prefix __maybe_unused)
 	OPT_CALLBACK_DEFAULT('G', "call-graph", &top.record_opts,
 			     "mode[,dump_size]", record_callchain_help,
 			     &parse_callchain_opt, "fp"),
+	OPT_INTEGER(0, "max-stack", &top.max_stack,
+		    "Set the maximum stack depth when parsing the callchain. "
+		    "Default: " __stringify(PERF_MAX_STACK_DEPTH)),
 	OPT_CALLBACK(0, "ignore-callees", NULL, "regex",
 		   "ignore callees of these functions in call graphs",
 		   report_parse_ignore_callees_opt),
diff --git a/tools/perf/util/top.h b/tools/perf/util/top.h
index b554ffc..88cfeaf 100644
--- a/tools/perf/util/top.h
+++ b/tools/perf/util/top.h
@@ -24,6 +24,7 @@ struct perf_top {
 	u64		   exact_samples;
 	u64		   guest_us_samples, guest_kernel_samples;
 	int		   print_entries, count_filter, delay_secs;
+	int		   max_stack;
 	bool		   hide_kernel_symbols, hide_user_symbols, zero;
 	bool		   use_tui, use_stdio;
 	bool		   kptr_restrict_warned;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 3/4] perf-report: add --max-stack option to limit callchain stack scan
  2013-10-18 14:38 ` [PATCH v2 3/4] perf-report: add --max-stack option to limit callchain stack scan Waiman Long
@ 2013-10-18 17:17   ` Arnaldo Carvalho de Melo
  2013-10-21 14:51     ` Waiman Long
  2013-10-18 17:30   ` David Ahern
  2013-10-23  7:55   ` [tip:perf/core] perf report: Add " tip-bot for Waiman Long
  2 siblings, 1 reply; 14+ messages in thread
From: Arnaldo Carvalho de Melo @ 2013-10-18 17:17 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Peter Zijlstra, Paul Mackerras, Namhyung Kim,
	Jiri Olsa, Adrian Hunter, David Ahern, Stephane Eranian,
	linux-kernel, Aswin Chandramouleeswaran, Scott J Norton

Em Fri, Oct 18, 2013 at 10:38:48AM -0400, Waiman Long escreveu:
> When callgraph data was included in the perf data file, it may take a
> long time to scan all those data and merge them together especially
> if the stored callchains are long and the perf data file itself is
> large, like a Gbyte or so.
> 
> The callchain stack is currently limited to PERF_MAX_STACK_DEPTH (127).
> This is a large value. Usually the callgraph data that developers are
> most interested in are the first few levels, the rests are usually
> not looked at.
> 
> This patch adds a new --max-stack option to perf-report to limit the
> depth of callchain stack data to look at to reduce the time it takes
> for perf-report to finish its processing. It trades the presence of
> trailing stack information with faster speed.
> 
> The following table shows the elapsed time of doing perf-report on a
> perf.data file of size 985,531,828 bytes.
> 
> --max_stack	Elapsed Time	Output data size
> -----------	------------	----------------

Please prefix lines like this (------) with a space, otherwise 'git am'
will chop off everything from that line onwards. Fixing it up now.

- Arnaldo

> not set		   88.0s	124,422,651
> 64		   87.5s	116,303,213
> 32		   87.2s	112,023,804
> 16		   86.6s	 94,326,380
> 8		   59.9s	 33,697,248
> 4		   40.7s	 10,116,637
> -g none		   27.1s	  2,555,810
> 
> Signed-off-by: Waiman Long <Waiman.Long@hp.com>
> ---
>  tools/perf/Documentation/perf-report.txt |    8 ++++++++
>  tools/perf/builtin-report.c              |   22 +++++++++++++++++-----
>  tools/perf/builtin-top.c                 |    3 ++-
>  tools/perf/util/machine.c                |   14 +++++++++-----
>  tools/perf/util/machine.h                |    3 ++-
>  tools/perf/util/session.c                |    3 ++-
>  6 files changed, 40 insertions(+), 13 deletions(-)
> 
> diff --git a/tools/perf/Documentation/perf-report.txt b/tools/perf/Documentation/perf-report.txt
> index 2b8097e..be3f196 100644
> --- a/tools/perf/Documentation/perf-report.txt
> +++ b/tools/perf/Documentation/perf-report.txt
> @@ -135,6 +135,14 @@ OPTIONS
>  
>  	Default: fractal,0.5,callee,function.
>  
> +--max-stack::
> +	Set the stack depth limit when parsing the callchain, anything
> +	beyond the specified depth will be ignored. This is a trade-off
> +	between information loss and faster processing especially for
> +	workloads that can have a very long callchain stack.
> +
> +	Default: 127
> +
>  -G::
>  --inverted::
>          alias for inverted caller based call graph.
> diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
> index 72eae74..d0c9504 100644
> --- a/tools/perf/builtin-report.c
> +++ b/tools/perf/builtin-report.c
> @@ -47,6 +47,7 @@ struct perf_report {
>  	bool			show_threads;
>  	bool			inverted_callchain;
>  	bool			mem_mode;
> +	int			max_stack;
>  	struct perf_read_values	show_threads_values;
>  	const char		*pretty_printing_style;
>  	const char		*cpu_list;
> @@ -88,7 +89,8 @@ static int perf_report__add_mem_hist_entry(struct perf_tool *tool,
>  	if ((sort__has_parent || symbol_conf.use_callchain) &&
>  	    sample->callchain) {
>  		err = machine__resolve_callchain(machine, evsel, al->thread,
> -						 sample, &parent, al);
> +						 sample, &parent, al,
> +						 rep->max_stack);
>  		if (err)
>  			return err;
>  	}
> @@ -179,7 +181,8 @@ static int perf_report__add_branch_hist_entry(struct perf_tool *tool,
>  	if ((sort__has_parent || symbol_conf.use_callchain)
>  	    && sample->callchain) {
>  		err = machine__resolve_callchain(machine, evsel, al->thread,
> -						 sample, &parent, al);
> +						 sample, &parent, al,
> +						 rep->max_stack);
>  		if (err)
>  			return err;
>  	}
> @@ -242,18 +245,21 @@ out:
>  	return err;
>  }
>  
> -static int perf_evsel__add_hist_entry(struct perf_evsel *evsel,
> +static int perf_evsel__add_hist_entry(struct perf_tool *tool,
> +				      struct perf_evsel *evsel,
>  				      struct addr_location *al,
>  				      struct perf_sample *sample,
>  				      struct machine *machine)
>  {
> +	struct perf_report *rep = container_of(tool, struct perf_report, tool);
>  	struct symbol *parent = NULL;
>  	int err = 0;
>  	struct hist_entry *he;
>  
>  	if ((sort__has_parent || symbol_conf.use_callchain) && sample->callchain) {
>  		err = machine__resolve_callchain(machine, evsel, al->thread,
> -						 sample, &parent, al);
> +						 sample, &parent, al,
> +						 rep->max_stack);
>  		if (err)
>  			return err;
>  	}
> @@ -330,7 +336,8 @@ static int process_sample_event(struct perf_tool *tool,
>  		if (al.map != NULL)
>  			al.map->dso->hit = 1;
>  
> -		ret = perf_evsel__add_hist_entry(evsel, &al, sample, machine);
> +		ret = perf_evsel__add_hist_entry(tool, evsel, &al, sample,
> +						 machine);
>  		if (ret < 0)
>  			pr_debug("problem incrementing symbol period, skipping event\n");
>  	}
> @@ -757,6 +764,7 @@ int cmd_report(int argc, const char **argv, const char *prefix __maybe_unused)
>  			.ordered_samples = true,
>  			.ordering_requires_timestamps = true,
>  		},
> +		.max_stack		 = PERF_MAX_STACK_DEPTH,
>  		.pretty_printing_style	 = "normal",
>  	};
>  	const struct option options[] = {
> @@ -797,6 +805,10 @@ int cmd_report(int argc, const char **argv, const char *prefix __maybe_unused)
>  	OPT_CALLBACK_DEFAULT('g', "call-graph", &report, "output_type,min_percent[,print_limit],call_order",
>  		     "Display callchains using output_type (graph, flat, fractal, or none) , min percent threshold, optional print limit, callchain order, key (function or address). "
>  		     "Default: fractal,0.5,callee,function", &parse_callchain_opt, callchain_default_opt),
> +	OPT_INTEGER(0, "max-stack", &report.max_stack,
> +		    "Set the maximum stack depth when parsing the callchain, "
> +		    "anything beyond the specified depth will be ignored. "
> +		    "Default: " __stringify(PERF_MAX_STACK_DEPTH)),
>  	OPT_BOOLEAN('G', "inverted", &report.inverted_callchain,
>  		    "alias for inverted call graph"),
>  	OPT_CALLBACK(0, "ignore-callees", NULL, "regex",
> diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
> index 2122141..2725aca 100644
> --- a/tools/perf/builtin-top.c
> +++ b/tools/perf/builtin-top.c
> @@ -771,7 +771,8 @@ static void perf_event__process_sample(struct perf_tool *tool,
>  		    sample->callchain) {
>  			err = machine__resolve_callchain(machine, evsel,
>  							 al.thread, sample,
> -							 &parent, &al);
> +							 &parent, &al,
> +							 PERF_MAX_STACK_DEPTH);
>  			if (err)
>  				return;
>  		}
> diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
> index 6188d28..9617c4a 100644
> --- a/tools/perf/util/machine.c
> +++ b/tools/perf/util/machine.c
> @@ -1267,10 +1267,12 @@ static int machine__resolve_callchain_sample(struct machine *machine,
>  					     struct thread *thread,
>  					     struct ip_callchain *chain,
>  					     struct symbol **parent,
> -					     struct addr_location *root_al)
> +					     struct addr_location *root_al,
> +					     int max_stack)
>  {
>  	u8 cpumode = PERF_RECORD_MISC_USER;
> -	unsigned int i;
> +	int chain_nr = min(max_stack, (int)chain->nr);
> +	int i;
>  	int err;
>  
>  	callchain_cursor_reset(&callchain_cursor);
> @@ -1280,7 +1282,7 @@ static int machine__resolve_callchain_sample(struct machine *machine,
>  		return 0;
>  	}
>  
> -	for (i = 0; i < chain->nr; i++) {
> +	for (i = 0; i < chain_nr; i++) {
>  		u64 ip;
>  		struct addr_location al;
>  
> @@ -1352,12 +1354,14 @@ int machine__resolve_callchain(struct machine *machine,
>  			       struct thread *thread,
>  			       struct perf_sample *sample,
>  			       struct symbol **parent,
> -			       struct addr_location *root_al)
> +			       struct addr_location *root_al,
> +			       int max_stack)
>  {
>  	int ret;
>  
>  	ret = machine__resolve_callchain_sample(machine, thread,
> -						sample->callchain, parent, root_al);
> +						sample->callchain, parent,
> +						root_al, max_stack);
>  	if (ret)
>  		return ret;
>  
> diff --git a/tools/perf/util/machine.h b/tools/perf/util/machine.h
> index 58a6be1..d09cce0 100644
> --- a/tools/perf/util/machine.h
> +++ b/tools/perf/util/machine.h
> @@ -91,7 +91,8 @@ int machine__resolve_callchain(struct machine *machine,
>  			       struct thread *thread,
>  			       struct perf_sample *sample,
>  			       struct symbol **parent,
> -			       struct addr_location *root_al);
> +			       struct addr_location *root_al,
> +			       int max_stack);
>  
>  /*
>   * Default guest kernel is defined by parameter --guestkallsyms
> diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
> index 568b750..96e5449 100644
> --- a/tools/perf/util/session.c
> +++ b/tools/perf/util/session.c
> @@ -1525,7 +1525,8 @@ void perf_evsel__print_ip(struct perf_evsel *evsel, union perf_event *event,
>  	if (symbol_conf.use_callchain && sample->callchain) {
>  
>  		if (machine__resolve_callchain(machine, evsel, al.thread,
> -					       sample, NULL, NULL) != 0) {
> +					       sample, NULL, NULL,
> +					       PERF_MAX_STACK_DEPTH) != 0) {
>  			if (verbose)
>  				error("Failed to resolve callchain. Skipping\n");
>  			return;
> -- 
> 1.7.1

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 3/4] perf-report: add --max-stack option to limit callchain stack scan
  2013-10-18 14:38 ` [PATCH v2 3/4] perf-report: add --max-stack option to limit callchain stack scan Waiman Long
  2013-10-18 17:17   ` Arnaldo Carvalho de Melo
@ 2013-10-18 17:30   ` David Ahern
  2013-10-23  7:55   ` [tip:perf/core] perf report: Add " tip-bot for Waiman Long
  2 siblings, 0 replies; 14+ messages in thread
From: David Ahern @ 2013-10-18 17:30 UTC (permalink / raw)
  To: Waiman Long, Ingo Molnar, Arnaldo Carvalho de Melo
  Cc: Peter Zijlstra, Paul Mackerras, Namhyung Kim, Jiri Olsa,
	Adrian Hunter, Stephane Eranian, linux-kernel,
	Aswin Chandramouleeswaran, Scott J Norton

On 10/18/13 8:38 AM, Waiman Long wrote:
> When callgraph data was included in the perf data file, it may take a
> long time to scan all those data and merge them together especially
> if the stored callchains are long and the perf data file itself is
> large, like a Gbyte or so.
>
> The callchain stack is currently limited to PERF_MAX_STACK_DEPTH (127).
> This is a large value. Usually the callgraph data that developers are
> most interested in are the first few levels, the rests are usually
> not looked at.
>
> This patch adds a new --max-stack option to perf-report to limit the
> depth of callchain stack data to look at to reduce the time it takes
> for perf-report to finish its processing. It trades the presence of
> trailing stack information with faster speed.
>
> The following table shows the elapsed time of doing perf-report on a
> perf.data file of size 985,531,828 bytes.
>
> --max_stack	Elapsed Time	Output data size
> -----------	------------	----------------
> not set		   88.0s	124,422,651
> 64		   87.5s	116,303,213
> 32		   87.2s	112,023,804
> 16		   86.6s	 94,326,380
> 8		   59.9s	 33,697,248
> 4		   40.7s	 10,116,637
> -g none		   27.1s	  2,555,810
>
> Signed-off-by: Waiman Long <Waiman.Long@hp.com>
> ---
>   tools/perf/Documentation/perf-report.txt |    8 ++++++++
>   tools/perf/builtin-report.c              |   22 +++++++++++++++++-----
>   tools/perf/builtin-top.c                 |    3 ++-
>   tools/perf/util/machine.c                |   14 +++++++++-----
>   tools/perf/util/machine.h                |    3 ++-
>   tools/perf/util/session.c                |    3 ++-
>   6 files changed, 40 insertions(+), 13 deletions(-)
>

Looks good to me. Acked-by: David Ahern <dsahern@gmail.com>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 4/4] perf-top: add --max-stack option to limit callchain stack scan
  2013-10-18 14:38 ` [PATCH v2 4/4] perf-top: add " Waiman Long
@ 2013-10-18 17:31   ` David Ahern
  2013-10-20 22:35   ` Davidlohr Bueso
  2013-10-23  7:55   ` [tip:perf/core] perf top: Add " tip-bot for Waiman Long
  2 siblings, 0 replies; 14+ messages in thread
From: David Ahern @ 2013-10-18 17:31 UTC (permalink / raw)
  To: Waiman Long, Ingo Molnar, Arnaldo Carvalho de Melo
  Cc: Peter Zijlstra, Paul Mackerras, Namhyung Kim, Jiri Olsa,
	Adrian Hunter, Stephane Eranian, linux-kernel,
	Aswin Chandramouleeswaran, Scott J Norton

On 10/18/13 8:38 AM, Waiman Long wrote:
> When the callgraph function is enabled (-G), it may take a long time to
> scan all the stack data and merge them accordingly.
>
> This patch adds a new --max-stack option to perf-top to limit the depth
> of callchain stack data to look at to reduce the time it takes for
> perf-top to finish its processing. It reduces the amount of information
> provided to the user in exchange for faster speed.
>
> Signed-off-by: Waiman Long <Waiman.Long@hp.com>
> ---
>   tools/perf/Documentation/perf-top.txt |    8 ++++++++
>   tools/perf/builtin-top.c              |    8 ++++++--
>   tools/perf/util/top.h                 |    1 +
>   3 files changed, 15 insertions(+), 2 deletions(-)
>

Looks good to me. Acked-by: David Ahern <dsahern@gmail.com>



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/4] perf: streamline append_chain() function
  2013-10-18 14:38 ` [PATCH v2 2/4] perf: streamline append_chain() function Waiman Long
@ 2013-10-20  0:29   ` Andi Kleen
  2013-10-21 14:50     ` Waiman Long
  0 siblings, 1 reply; 14+ messages in thread
From: Andi Kleen @ 2013-10-20  0:29 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Peter Zijlstra,
	Paul Mackerras, Namhyung Kim, Jiri Olsa, Adrian Hunter,
	David Ahern, Stephane Eranian, linux-kernel,
	Aswin Chandramouleeswaran, Scott J Norton

Waiman Long <Waiman.Long@hp.com> writes:

> as well as
> using ?: statement which can be more efficient than the regular if
> statement in some architectures.

I don't think that's true, the compiler does if conversion anyways for both.

But change seems reasonable.

-Andi


-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 4/4] perf-top: add --max-stack option to limit callchain stack scan
  2013-10-18 14:38 ` [PATCH v2 4/4] perf-top: add " Waiman Long
  2013-10-18 17:31   ` David Ahern
@ 2013-10-20 22:35   ` Davidlohr Bueso
  2013-10-23  7:55   ` [tip:perf/core] perf top: Add " tip-bot for Waiman Long
  2 siblings, 0 replies; 14+ messages in thread
From: Davidlohr Bueso @ 2013-10-20 22:35 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Peter Zijlstra,
	Paul Mackerras, Namhyung Kim, Jiri Olsa, Adrian Hunter,
	David Ahern, Stephane Eranian, linux-kernel,
	Aswin Chandramouleeswaran, Scott J Norton

On Fri, 2013-10-18 at 10:38 -0400, Waiman Long wrote:
> When the callgraph function is enabled (-G), it may take a long time to
> scan all the stack data and merge them accordingly.
> 
> This patch adds a new --max-stack option to perf-top to limit the depth
> of callchain stack data to look at to reduce the time it takes for
> perf-top to finish its processing. It reduces the amount of information
> provided to the user in exchange for faster speed.
> 
> Signed-off-by: Waiman Long <Waiman.Long@hp.com>

Tested-by: Davidlohr Bueso <davidlohr@hp.com>

> ---
>  tools/perf/Documentation/perf-top.txt |    8 ++++++++
>  tools/perf/builtin-top.c              |    8 ++++++--
>  tools/perf/util/top.h                 |    1 +
>  3 files changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/perf/Documentation/perf-top.txt b/tools/perf/Documentation/perf-top.txt
> index 58d6598..3fd911c 100644
> --- a/tools/perf/Documentation/perf-top.txt
> +++ b/tools/perf/Documentation/perf-top.txt
> @@ -155,6 +155,14 @@ Default is to monitor all CPUS.
>  
>  	Default: fractal,0.5,callee.
>  
> +--max-stack::
> +	Set the stack depth limit when parsing the callchain, anything
> +	beyond the specified depth will be ignored. This is a trade-off
> +	between information loss and faster processing especially for
> +	workloads that can have a very long callchain stack.
> +
> +	Default: 127
> +
>  --ignore-callees=<regex>::
>          Ignore callees of the function(s) matching the given regex.
>          This has the effect of collecting the callers of each such
> diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
> index 2725aca..14902b0 100644
> --- a/tools/perf/builtin-top.c
> +++ b/tools/perf/builtin-top.c
> @@ -772,7 +772,7 @@ static void perf_event__process_sample(struct perf_tool *tool,
>  			err = machine__resolve_callchain(machine, evsel,
>  							 al.thread, sample,
>  							 &parent, &al,
> -							 PERF_MAX_STACK_DEPTH);
> +							 top->max_stack);
>  			if (err)
>  				return;
>  		}
> @@ -1052,10 +1052,11 @@ int cmd_top(int argc, const char **argv, const char *prefix __maybe_unused)
>  			.user_freq	= UINT_MAX,
>  			.user_interval	= ULLONG_MAX,
>  			.freq		= 4000, /* 4 KHz */
> -			.target		     = {
> +			.target		= {
>  				.uses_mmap   = true,
>  			},
>  		},
> +		.max_stack	     = PERF_MAX_STACK_DEPTH,
>  		.sym_pcnt_filter     = 5,
>  	};
>  	struct perf_record_opts *opts = &top.record_opts;
> @@ -1110,6 +1111,9 @@ int cmd_top(int argc, const char **argv, const char *prefix __maybe_unused)
>  	OPT_CALLBACK_DEFAULT('G', "call-graph", &top.record_opts,
>  			     "mode[,dump_size]", record_callchain_help,
>  			     &parse_callchain_opt, "fp"),
> +	OPT_INTEGER(0, "max-stack", &top.max_stack,
> +		    "Set the maximum stack depth when parsing the callchain. "
> +		    "Default: " __stringify(PERF_MAX_STACK_DEPTH)),
>  	OPT_CALLBACK(0, "ignore-callees", NULL, "regex",
>  		   "ignore callees of these functions in call graphs",
>  		   report_parse_ignore_callees_opt),
> diff --git a/tools/perf/util/top.h b/tools/perf/util/top.h
> index b554ffc..88cfeaf 100644
> --- a/tools/perf/util/top.h
> +++ b/tools/perf/util/top.h
> @@ -24,6 +24,7 @@ struct perf_top {
>  	u64		   exact_samples;
>  	u64		   guest_us_samples, guest_kernel_samples;
>  	int		   print_entries, count_filter, delay_secs;
> +	int		   max_stack;
>  	bool		   hide_kernel_symbols, hide_user_symbols, zero;
>  	bool		   use_tui, use_stdio;
>  	bool		   kptr_restrict_warned;



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/4] perf: streamline append_chain() function
  2013-10-20  0:29   ` Andi Kleen
@ 2013-10-21 14:50     ` Waiman Long
  0 siblings, 0 replies; 14+ messages in thread
From: Waiman Long @ 2013-10-21 14:50 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Peter Zijlstra,
	Paul Mackerras, Namhyung Kim, Jiri Olsa, Adrian Hunter,
	David Ahern, Stephane Eranian, linux-kernel,
	Aswin Chandramouleeswaran, Scott J Norton

On 10/19/2013 08:29 PM, Andi Kleen wrote:
> Waiman Long<Waiman.Long@hp.com>  writes:
>
>> as well as
>> using ?: statement which can be more efficient than the regular if
>> statement in some architectures.
> I don't think that's true, the compiler does if conversion anyways for both.
>
> But change seems reasonable.
>
> -Andi
>
>

That may be true for a simple if statement. However, the condition was 
checked as the last of 3 tests. I doubt if the compiler is able to 
optimize that effectively.

-Longman

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 3/4] perf-report: add --max-stack option to limit callchain stack scan
  2013-10-18 17:17   ` Arnaldo Carvalho de Melo
@ 2013-10-21 14:51     ` Waiman Long
  0 siblings, 0 replies; 14+ messages in thread
From: Waiman Long @ 2013-10-21 14:51 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Ingo Molnar, Peter Zijlstra, Paul Mackerras, Namhyung Kim,
	Jiri Olsa, Adrian Hunter, David Ahern, Stephane Eranian,
	linux-kernel, Aswin Chandramouleeswaran, Scott J Norton

On 10/18/2013 01:17 PM, Arnaldo Carvalho de Melo wrote:
> Em Fri, Oct 18, 2013 at 10:38:48AM -0400, Waiman Long escreveu:
>> When callgraph data was included in the perf data file, it may take a
>> long time to scan all those data and merge them together especially
>> if the stored callchains are long and the perf data file itself is
>> large, like a Gbyte or so.
>>
>> The callchain stack is currently limited to PERF_MAX_STACK_DEPTH (127).
>> This is a large value. Usually the callgraph data that developers are
>> most interested in are the first few levels, the rests are usually
>> not looked at.
>>
>> This patch adds a new --max-stack option to perf-report to limit the
>> depth of callchain stack data to look at to reduce the time it takes
>> for perf-report to finish its processing. It trades the presence of
>> trailing stack information with faster speed.
>>
>> The following table shows the elapsed time of doing perf-report on a
>> perf.data file of size 985,531,828 bytes.
>>
>> --max_stack	Elapsed Time	Output data size
>> -----------	------------	----------------
> Please prefix lines like this (------) with a space, otherwise 'git am'
> will chop off everything from that line onwards. Fixing it up now.
>
> - Arnaldo
>
>

Thank for spotting the problem, will fix that in the next version.

-Longman

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [tip:perf/core] perf report: Add --max-stack option to limit callchain stack scan
  2013-10-18 14:38 ` [PATCH v2 3/4] perf-report: add --max-stack option to limit callchain stack scan Waiman Long
  2013-10-18 17:17   ` Arnaldo Carvalho de Melo
  2013-10-18 17:30   ` David Ahern
@ 2013-10-23  7:55   ` tip-bot for Waiman Long
  2 siblings, 0 replies; 14+ messages in thread
From: tip-bot for Waiman Long @ 2013-10-23  7:55 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: acme, eranian, mingo, mingo, a.p.zijlstra, jolsa, Waiman.Long,
	dsahern, tglx, scott.norton, hpa, paulus, linux-kernel, namhyung,
	adrian.hunter, aswin

Commit-ID:  91e95617429cb272fd908b1928a1915b37b9655f
Gitweb:     http://git.kernel.org/tip/91e95617429cb272fd908b1928a1915b37b9655f
Author:     Waiman Long <Waiman.Long@hp.com>
AuthorDate: Fri, 18 Oct 2013 10:38:48 -0400
Committer:  Arnaldo Carvalho de Melo <acme@redhat.com>
CommitDate: Mon, 21 Oct 2013 17:36:25 -0300

perf report: Add --max-stack option to limit callchain stack scan

When callgraph data was included in the perf data file, it may take a
long time to scan all those data and merge them together especially if
the stored callchains are long and the perf data file itself is large,
like a Gbyte or so.

The callchain stack is currently limited to PERF_MAX_STACK_DEPTH (127).
This is a large value. Usually the callgraph data that developers are
most interested in are the first few levels, the rests are usually not
looked at.

This patch adds a new --max-stack option to perf-report to limit the
depth of callchain stack data to look at to reduce the time it takes for
perf-report to finish its processing. It trades the presence of trailing
stack information with faster speed.

The following table shows the elapsed time of doing perf-report on a
perf.data file of size 985,531,828 bytes.

  --max_stack   Elapsed Time    Output data size
  -----------   ------------    ----------------
  not set        88.0s          124,422,651
  64             87.5s          116,303,213
  32             87.2s          112,023,804
  16             86.6s           94,326,380
  8              59.9s           33,697,248
  4              40.7s           10,116,637
  -g none        27.1s            2,555,810

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Acked-by: David Ahern <dsahern@gmail.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Scott J Norton <scott.norton@hp.com>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/r/1382107129-2010-4-git-send-email-Waiman.Long@hp.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/Documentation/perf-report.txt |  8 ++++++++
 tools/perf/builtin-report.c              | 22 +++++++++++++++++-----
 tools/perf/builtin-top.c                 |  3 ++-
 tools/perf/util/machine.c                | 14 +++++++++-----
 tools/perf/util/machine.h                |  3 ++-
 tools/perf/util/session.c                |  3 ++-
 6 files changed, 40 insertions(+), 13 deletions(-)

diff --git a/tools/perf/Documentation/perf-report.txt b/tools/perf/Documentation/perf-report.txt
index be5ad87..10a2798 100644
--- a/tools/perf/Documentation/perf-report.txt
+++ b/tools/perf/Documentation/perf-report.txt
@@ -141,6 +141,14 @@ OPTIONS
 
 	Default: fractal,0.5,callee,function.
 
+--max-stack::
+	Set the stack depth limit when parsing the callchain, anything
+	beyond the specified depth will be ignored. This is a trade-off
+	between information loss and faster processing especially for
+	workloads that can have a very long callchain stack.
+
+	Default: 127
+
 -G::
 --inverted::
         alias for inverted caller based call graph.
diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index fa68a36..81addca 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -49,6 +49,7 @@ struct perf_report {
 	bool			show_threads;
 	bool			inverted_callchain;
 	bool			mem_mode;
+	int			max_stack;
 	struct perf_read_values	show_threads_values;
 	const char		*pretty_printing_style;
 	const char		*cpu_list;
@@ -90,7 +91,8 @@ static int perf_report__add_mem_hist_entry(struct perf_tool *tool,
 	if ((sort__has_parent || symbol_conf.use_callchain) &&
 	    sample->callchain) {
 		err = machine__resolve_callchain(machine, evsel, al->thread,
-						 sample, &parent, al);
+						 sample, &parent, al,
+						 rep->max_stack);
 		if (err)
 			return err;
 	}
@@ -181,7 +183,8 @@ static int perf_report__add_branch_hist_entry(struct perf_tool *tool,
 	if ((sort__has_parent || symbol_conf.use_callchain)
 	    && sample->callchain) {
 		err = machine__resolve_callchain(machine, evsel, al->thread,
-						 sample, &parent, al);
+						 sample, &parent, al,
+						 rep->max_stack);
 		if (err)
 			return err;
 	}
@@ -244,18 +247,21 @@ out:
 	return err;
 }
 
-static int perf_evsel__add_hist_entry(struct perf_evsel *evsel,
+static int perf_evsel__add_hist_entry(struct perf_tool *tool,
+				      struct perf_evsel *evsel,
 				      struct addr_location *al,
 				      struct perf_sample *sample,
 				      struct machine *machine)
 {
+	struct perf_report *rep = container_of(tool, struct perf_report, tool);
 	struct symbol *parent = NULL;
 	int err = 0;
 	struct hist_entry *he;
 
 	if ((sort__has_parent || symbol_conf.use_callchain) && sample->callchain) {
 		err = machine__resolve_callchain(machine, evsel, al->thread,
-						 sample, &parent, al);
+						 sample, &parent, al,
+						 rep->max_stack);
 		if (err)
 			return err;
 	}
@@ -332,7 +338,8 @@ static int process_sample_event(struct perf_tool *tool,
 		if (al.map != NULL)
 			al.map->dso->hit = 1;
 
-		ret = perf_evsel__add_hist_entry(evsel, &al, sample, machine);
+		ret = perf_evsel__add_hist_entry(tool, evsel, &al, sample,
+						 machine);
 		if (ret < 0)
 			pr_debug("problem incrementing symbol period, skipping event\n");
 	}
@@ -772,6 +779,7 @@ int cmd_report(int argc, const char **argv, const char *prefix __maybe_unused)
 			.ordered_samples = true,
 			.ordering_requires_timestamps = true,
 		},
+		.max_stack		 = PERF_MAX_STACK_DEPTH,
 		.pretty_printing_style	 = "normal",
 	};
 	const struct option options[] = {
@@ -812,6 +820,10 @@ int cmd_report(int argc, const char **argv, const char *prefix __maybe_unused)
 	OPT_CALLBACK_DEFAULT('g', "call-graph", &report, "output_type,min_percent[,print_limit],call_order",
 		     "Display callchains using output_type (graph, flat, fractal, or none) , min percent threshold, optional print limit, callchain order, key (function or address). "
 		     "Default: fractal,0.5,callee,function", &parse_callchain_opt, callchain_default_opt),
+	OPT_INTEGER(0, "max-stack", &report.max_stack,
+		    "Set the maximum stack depth when parsing the callchain, "
+		    "anything beyond the specified depth will be ignored. "
+		    "Default: " __stringify(PERF_MAX_STACK_DEPTH)),
 	OPT_BOOLEAN('G', "inverted", &report.inverted_callchain,
 		    "alias for inverted call graph"),
 	OPT_CALLBACK(0, "ignore-callees", NULL, "regex",
diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index d934f70..112cb7d 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -770,7 +770,8 @@ static void perf_event__process_sample(struct perf_tool *tool,
 		    sample->callchain) {
 			err = machine__resolve_callchain(machine, evsel,
 							 al.thread, sample,
-							 &parent, &al);
+							 &parent, &al,
+							 PERF_MAX_STACK_DEPTH);
 			if (err)
 				return;
 		}
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index 6b861ae..ea93425 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -1253,10 +1253,12 @@ static int machine__resolve_callchain_sample(struct machine *machine,
 					     struct thread *thread,
 					     struct ip_callchain *chain,
 					     struct symbol **parent,
-					     struct addr_location *root_al)
+					     struct addr_location *root_al,
+					     int max_stack)
 {
 	u8 cpumode = PERF_RECORD_MISC_USER;
-	unsigned int i;
+	int chain_nr = min(max_stack, (int)chain->nr);
+	int i;
 	int err;
 
 	callchain_cursor_reset(&callchain_cursor);
@@ -1266,7 +1268,7 @@ static int machine__resolve_callchain_sample(struct machine *machine,
 		return 0;
 	}
 
-	for (i = 0; i < chain->nr; i++) {
+	for (i = 0; i < chain_nr; i++) {
 		u64 ip;
 		struct addr_location al;
 
@@ -1338,12 +1340,14 @@ int machine__resolve_callchain(struct machine *machine,
 			       struct thread *thread,
 			       struct perf_sample *sample,
 			       struct symbol **parent,
-			       struct addr_location *root_al)
+			       struct addr_location *root_al,
+			       int max_stack)
 {
 	int ret;
 
 	ret = machine__resolve_callchain_sample(machine, thread,
-						sample->callchain, parent, root_al);
+						sample->callchain, parent,
+						root_al, max_stack);
 	if (ret)
 		return ret;
 
diff --git a/tools/perf/util/machine.h b/tools/perf/util/machine.h
index d44c09b..4c1f5d5 100644
--- a/tools/perf/util/machine.h
+++ b/tools/perf/util/machine.h
@@ -92,7 +92,8 @@ int machine__resolve_callchain(struct machine *machine,
 			       struct thread *thread,
 			       struct perf_sample *sample,
 			       struct symbol **parent,
-			       struct addr_location *root_al);
+			       struct addr_location *root_al,
+			       int max_stack);
 
 /*
  * Default guest kernel is defined by parameter --guestkallsyms
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 19fc716..854c5aa 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -1512,7 +1512,8 @@ void perf_evsel__print_ip(struct perf_evsel *evsel, union perf_event *event,
 	if (symbol_conf.use_callchain && sample->callchain) {
 
 		if (machine__resolve_callchain(machine, evsel, al.thread,
-					       sample, NULL, NULL) != 0) {
+					       sample, NULL, NULL,
+					       PERF_MAX_STACK_DEPTH) != 0) {
 			if (verbose)
 				error("Failed to resolve callchain. Skipping\n");
 			return;

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [tip:perf/core] perf top: Add --max-stack option to limit callchain stack scan
  2013-10-18 14:38 ` [PATCH v2 4/4] perf-top: add " Waiman Long
  2013-10-18 17:31   ` David Ahern
  2013-10-20 22:35   ` Davidlohr Bueso
@ 2013-10-23  7:55   ` tip-bot for Waiman Long
  2 siblings, 0 replies; 14+ messages in thread
From: tip-bot for Waiman Long @ 2013-10-23  7:55 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: acme, eranian, mingo, mingo, a.p.zijlstra, jolsa, Waiman.Long,
	dsahern, tglx, scott.norton, davidlohr, hpa, paulus,
	linux-kernel, namhyung, adrian.hunter, aswin

Commit-ID:  5dbb6e81d85e55ee2b4cf523c1738e16f63e5400
Gitweb:     http://git.kernel.org/tip/5dbb6e81d85e55ee2b4cf523c1738e16f63e5400
Author:     Waiman Long <Waiman.Long@hp.com>
AuthorDate: Fri, 18 Oct 2013 10:38:49 -0400
Committer:  Arnaldo Carvalho de Melo <acme@redhat.com>
CommitDate: Mon, 21 Oct 2013 17:36:25 -0300

perf top: Add --max-stack option to limit callchain stack scan

When the callgraph function is enabled (-G), it may take a long time to
scan all the stack data and merge them accordingly.

This patch adds a new --max-stack option to perf-top to limit the depth
of callchain stack data to look at to reduce the time it takes for
perf-top to finish its processing. It reduces the amount of information
provided to the user in exchange for faster speed.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Acked-by: David Ahern <dsahern@gmail.com>
Tested-by: Davidlohr Bueso <davidlohr@hp.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Scott J Norton <scott.norton@hp.com>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/r/1382107129-2010-5-git-send-email-Waiman.Long@hp.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/Documentation/perf-top.txt | 8 ++++++++
 tools/perf/builtin-top.c              | 8 ++++++--
 tools/perf/util/top.h                 | 1 +
 3 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/tools/perf/Documentation/perf-top.txt b/tools/perf/Documentation/perf-top.txt
index f65777c..c16a09e 100644
--- a/tools/perf/Documentation/perf-top.txt
+++ b/tools/perf/Documentation/perf-top.txt
@@ -158,6 +158,14 @@ Default is to monitor all CPUS.
 
 	Default: fractal,0.5,callee.
 
+--max-stack::
+	Set the stack depth limit when parsing the callchain, anything
+	beyond the specified depth will be ignored. This is a trade-off
+	between information loss and faster processing especially for
+	workloads that can have a very long callchain stack.
+
+	Default: 127
+
 --ignore-callees=<regex>::
         Ignore callees of the function(s) matching the given regex.
         This has the effect of collecting the callers of each such
diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index 112cb7d..386d833 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -771,7 +771,7 @@ static void perf_event__process_sample(struct perf_tool *tool,
 			err = machine__resolve_callchain(machine, evsel,
 							 al.thread, sample,
 							 &parent, &al,
-							 PERF_MAX_STACK_DEPTH);
+							 top->max_stack);
 			if (err)
 				return;
 		}
@@ -1048,10 +1048,11 @@ int cmd_top(int argc, const char **argv, const char *prefix __maybe_unused)
 			.user_freq	= UINT_MAX,
 			.user_interval	= ULLONG_MAX,
 			.freq		= 4000, /* 4 KHz */
-			.target		     = {
+			.target		= {
 				.uses_mmap   = true,
 			},
 		},
+		.max_stack	     = PERF_MAX_STACK_DEPTH,
 		.sym_pcnt_filter     = 5,
 	};
 	struct perf_record_opts *opts = &top.record_opts;
@@ -1110,6 +1111,9 @@ int cmd_top(int argc, const char **argv, const char *prefix __maybe_unused)
 	OPT_CALLBACK_DEFAULT('G', "call-graph", &top.record_opts,
 			     "mode[,dump_size]", record_callchain_help,
 			     &parse_callchain_opt, "fp"),
+	OPT_INTEGER(0, "max-stack", &top.max_stack,
+		    "Set the maximum stack depth when parsing the callchain. "
+		    "Default: " __stringify(PERF_MAX_STACK_DEPTH)),
 	OPT_CALLBACK(0, "ignore-callees", NULL, "regex",
 		   "ignore callees of these functions in call graphs",
 		   report_parse_ignore_callees_opt),
diff --git a/tools/perf/util/top.h b/tools/perf/util/top.h
index b554ffc..88cfeaf 100644
--- a/tools/perf/util/top.h
+++ b/tools/perf/util/top.h
@@ -24,6 +24,7 @@ struct perf_top {
 	u64		   exact_samples;
 	u64		   guest_us_samples, guest_kernel_samples;
 	int		   print_entries, count_filter, delay_secs;
+	int		   max_stack;
 	bool		   hide_kernel_symbols, hide_user_symbols, zero;
 	bool		   use_tui, use_stdio;
 	bool		   kptr_restrict_warned;

^ permalink raw reply related	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2013-10-23  7:56 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-10-18 14:38 [PATCH v2 0/4] perf: add option to limit callchain stack scan to increase speed Waiman Long
2013-10-18 14:38 ` [PATCH v2 1/4] perf: Fix potential compilation error with some compilers Waiman Long
2013-10-18 14:38 ` [PATCH v2 2/4] perf: streamline append_chain() function Waiman Long
2013-10-20  0:29   ` Andi Kleen
2013-10-21 14:50     ` Waiman Long
2013-10-18 14:38 ` [PATCH v2 3/4] perf-report: add --max-stack option to limit callchain stack scan Waiman Long
2013-10-18 17:17   ` Arnaldo Carvalho de Melo
2013-10-21 14:51     ` Waiman Long
2013-10-18 17:30   ` David Ahern
2013-10-23  7:55   ` [tip:perf/core] perf report: Add " tip-bot for Waiman Long
2013-10-18 14:38 ` [PATCH v2 4/4] perf-top: add " Waiman Long
2013-10-18 17:31   ` David Ahern
2013-10-20 22:35   ` Davidlohr Bueso
2013-10-23  7:55   ` [tip:perf/core] perf top: Add " tip-bot for Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).