[PATCH v2 0/4] perf: add option to limit callchain stack scan to increase speed

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/4] perf: add option to limit callchain stack scan to increase speed
@ 2013-10-18 14:38 Waiman Long
  2013-10-18 14:38 ` [PATCH v2 1/4] perf: Fix potential compilation error with some compilers Waiman Long
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Waiman Long @ 2013-10-18 14:38 UTC (permalink / raw)
  To: Ingo Molnar, Arnaldo Carvalho de Melo
  Cc: Peter Zijlstra, Paul Mackerras, Namhyung Kim, Jiri Olsa,
	Adrian Hunter, David Ahern, Stephane Eranian, linux-kernel,
	Aswin Chandramouleeswaran, Scott J Norton, Waiman Long

v1->v2:
 - Include a compilation fix patch and a code streamlining patch 
   into the patch set.
 - Use the __stringify() macro in stringify.h instead of adding a
   duplicate macro.
 - Add the --max-stack option to perf-top as well.
 
This perf patch set contains the following changes:

Patch 1 - Fix a perf tool compilation error that happens on SLES 11
          sp3 system.
Patch 2 - Streamline the append_chain() function to make it run a bit
          faster.
Patch 3 - Add a --max-stack option to perf-report to speed up its
          processing at the expense of less backtrace information
          available.
Patch 4 - Add a similar --max-stack option to perf-top.

Waiman Long (4):
  perf: Fix potential compilation error with some compilers
  perf: streamline append_chain() function
  perf-report: add --max-stack option to limit callchain stack scan
  perf-top: add --max-stack option to limit callchain stack scan

 tools/perf/Documentation/perf-report.txt           |    8 +++++++
 tools/perf/Documentation/perf-top.txt              |    8 +++++++
 tools/perf/builtin-report.c                        |   22 +++++++++++++++----
 tools/perf/builtin-top.c                           |    9 ++++++-
 tools/perf/util/callchain.c                        |    9 +++----
 tools/perf/util/machine.c                          |   14 ++++++++----
 tools/perf/util/machine.h                          |    3 +-
 .../perf/util/scripting-engines/trace-event-perl.c |    6 ++++-
 tools/perf/util/session.c                          |    3 +-
 tools/perf/util/top.h                              |    1 +
 10 files changed, 63 insertions(+), 20 deletions(-)


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 1/4] perf: Fix potential compilation error with some compilers
  2013-10-18 14:38 [PATCH v2 0/4] perf: add option to limit callchain stack scan to increase speed Waiman Long
@ 2013-10-18 14:38 ` Waiman Long
  2013-10-18 14:38 ` [PATCH v2 2/4] perf: streamline append_chain() function Waiman Long
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 14+ messages in thread
From: Waiman Long @ 2013-10-18 14:38 UTC (permalink / raw)
  To: Ingo Molnar, Arnaldo Carvalho de Melo
  Cc: Peter Zijlstra, Paul Mackerras, Namhyung Kim, Jiri Olsa,
	Adrian Hunter, David Ahern, Stephane Eranian, linux-kernel,
	Aswin Chandramouleeswaran, Scott J Norton, Waiman Long

The building of the perf tool failed in a SLES11 sp3 system with the
following compilation error:

cc1: warnings being treated as errors
util/scripting-engines/trace-event-perl.c: In function
‘perl_process_tracepoint’:
util/scripting-engines/trace-event-perl.c:285: error: format ‘%lu’
expects type ‘long unsigned int’, but argument 2 has type ‘__u64’

This patch replaces PRIu64 which is "lu" by the explicit "llu" to
fix this problem as __u64 is of type "long long unsigned".

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 .../perf/util/scripting-engines/trace-event-perl.c |    6 +++++-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/tools/perf/util/scripting-engines/trace-event-perl.c b/tools/perf/util/scripting-engines/trace-event-perl.c
index a85e4ae..d6eb9c5 100644
--- a/tools/perf/util/scripting-engines/trace-event-perl.c
+++ b/tools/perf/util/scripting-engines/trace-event-perl.c
@@ -281,8 +281,12 @@ static void perl_process_tracepoint(union perf_event *perf_event __maybe_unused,
 		return;
 
 	event = find_cache_event(evsel);
+	/*
+	 * attr.config is a __u64 which requires "%llu" to avoid compilation
+	 * error/warning with some compilers.
+	 */
 	if (!event)
-		die("ug! no event found for type %" PRIu64, evsel->attr.config);
+		die("ug! no event found for type %llu", evsel->attr.config);
 
 	pid = raw_field_value(event, "common_pid", data);
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 2/4] perf: streamline append_chain() function
  2013-10-18 14:38 [PATCH v2 0/4] perf: add option to limit callchain stack scan to increase speed Waiman Long
  2013-10-18 14:38 ` [PATCH v2 1/4] perf: Fix potential compilation error with some compilers Waiman Long
@ 2013-10-18 14:38 ` Waiman Long
  2013-10-20  0:29   ` Andi Kleen
  2013-10-18 14:38 ` [PATCH v2 3/4] perf-report: add --max-stack option to limit callchain stack scan Waiman Long
  2013-10-18 14:38 ` [PATCH v2 4/4] perf-top: add " Waiman Long
  3 siblings, 1 reply; 14+ messages in thread
From: Waiman Long @ 2013-10-18 14:38 UTC (permalink / raw)
  To: Ingo Molnar, Arnaldo Carvalho de Melo
  Cc: Peter Zijlstra, Paul Mackerras, Namhyung Kim, Jiri Olsa,
	Adrian Hunter, David Ahern, Stephane Eranian, linux-kernel,
	Aswin Chandramouleeswaran, Scott J Norton, Waiman Long

When callgraph is enabled, the append_chain() function consumes a major
portion of the total CPU time. This patch tries to streamline the
append_chain() function by removing unneeded conditional test as well as
using ?: statement which can be more efficient than the regular if
statement in some architectures.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 tools/perf/util/callchain.c |    9 ++++-----
 1 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/tools/perf/util/callchain.c b/tools/perf/util/callchain.c
index 482f680..1e79001 100644
--- a/tools/perf/util/callchain.c
+++ b/tools/perf/util/callchain.c
@@ -315,6 +315,7 @@ append_chain(struct callchain_node *root,
 	struct callchain_list *cnode;
 	u64 start = cursor->pos;
 	bool found = false;
+	bool func_mode = (callchain_param.key == CCKEY_FUNCTION);
 	u64 matches;
 
 	/*
@@ -331,17 +332,15 @@ append_chain(struct callchain_node *root,
 		if (!node)
 			break;
 
-		sym = node->sym;
+		sym = func_mode ? node->sym : NULL;
 
-		if (cnode->ms.sym && sym &&
-		    callchain_param.key == CCKEY_FUNCTION) {
+		if (cnode->ms.sym && sym) {
 			if (cnode->ms.sym->start != sym->start)
 				break;
 		} else if (cnode->ip != node->ip)
 			break;
 
-		if (!found)
-			found = true;
+		found = true;
 
 		callchain_cursor_advance(cursor);
 	}
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 3/4] perf-report: add --max-stack option to limit callchain stack scan
  2013-10-18 14:38 [PATCH v2 0/4] perf: add option to limit callchain stack scan to increase speed Waiman Long
  2013-10-18 14:38 ` [PATCH v2 1/4] perf: Fix potential compilation error with some compilers Waiman Long
  2013-10-18 14:38 ` [PATCH v2 2/4] perf: streamline append_chain() function Waiman Long
@ 2013-10-18 14:38 ` Waiman Long
  2013-10-18 17:17   ` Arnaldo Carvalho de Melo
                     ` (2 more replies)
  2013-10-18 14:38 ` [PATCH v2 4/4] perf-top: add " Waiman Long
  3 siblings, 3 replies; 14+ messages in thread
From: Waiman Long @ 2013-10-18 14:38 UTC (permalink / raw)
  To: Ingo Molnar, Arnaldo Carvalho de Melo
  Cc: Peter Zijlstra, Paul Mackerras, Namhyung Kim, Jiri Olsa,
	Adrian Hunter, David Ahern, Stephane Eranian, linux-kernel,
	Aswin Chandramouleeswaran, Scott J Norton, Waiman Long

When callgraph data was included in the perf data file, it may take a
long time to scan all those data and merge them together especially
if the stored callchains are long and the perf data file itself is
large, like a Gbyte or so.

The callchain stack is currently limited to PERF_MAX_STACK_DEPTH (127).
This is a large value. Usually the callgraph data that developers are
most interested in are the first few levels, the rests are usually
not looked at.

This patch adds a new --max-stack option to perf-report to limit the
depth of callchain stack data to look at to reduce the time it takes
for perf-report to finish its processing. It trades the presence of
trailing stack information with faster speed.

The following table shows the elapsed time of doing perf-report on a
perf.data file of size 985,531,828 bytes.

--max_stack	Elapsed Time	Output data size
-----------	------------	----------------
not set		   88.0s	124,422,651
64		   87.5s	116,303,213
32		   87.2s	112,023,804
16		   86.6s	 94,326,380
8		   59.9s	 33,697,248
4		   40.7s	 10,116,637
-g none		   27.1s	  2,555,810

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 tools/perf/Documentation/perf-report.txt |    8 ++++++++
 tools/perf/builtin-report.c              |   22 +++++++++++++++++-----
 tools/perf/builtin-top.c                 |    3 ++-
 tools/perf/util/machine.c                |   14 +++++++++-----
 tools/perf/util/machine.h                |    3 ++-
 tools/perf/util/session.c                |    3 ++-
 6 files changed, 40 insertions(+), 13 deletions(-)

diff --git a/tools/perf/Documentation/perf-report.txt b/tools/perf/Documentation/perf-report.txt
index 2b8097e..be3f196 100644
--- a/tools/perf/Documentation/perf-report.txt
+++ b/tools/perf/Documentation/perf-report.txt
@@ -135,6 +135,14 @@ OPTIONS
 
 	Default: fractal,0.5,callee,function.
 
+--max-stack::
+	Set the stack depth limit when parsing the callchain, anything
+	beyond the specified depth will be ignored. This is a trade-off
+	between information loss and faster processing especially for
+	workloads that can have a very long callchain stack.
+
+	Default: 127
+
 -G::
 --inverted::
         alias for inverted caller based call graph.
diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index 72eae74..d0c9504 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -47,6 +47,7 @@ struct perf_report {
 	bool			show_threads;
 	bool			inverted_callchain;
 	bool			mem_mode;
+	int			max_stack;
 	struct perf_read_values	show_threads_values;
 	const char		*pretty_printing_style;
 	const char		*cpu_list;
@@ -88,7 +89,8 @@ static int perf_report__add_mem_hist_entry(struct perf_tool *tool,
 	if ((sort__has_parent || symbol_conf.use_callchain) &&
 	    sample->callchain) {
 		err = machine__resolve_callchain(machine, evsel, al->thread,
-						 sample, &parent, al);
+						 sample, &parent, al,
+						 rep->max_stack);
 		if (err)
 			return err;
 	}
@@ -179,7 +181,8 @@ static int perf_report__add_branch_hist_entry(struct perf_tool *tool,
 	if ((sort__has_parent || symbol_conf.use_callchain)
 	    && sample->callchain) {
 		err = machine__resolve_callchain(machine, evsel, al->thread,
-						 sample, &parent, al);
+						 sample, &parent, al,
+						 rep->max_stack);
 		if (err)
 			return err;
 	}
@@ -242,18 +245,21 @@ out:
 	return err;
 }
 
-static int perf_evsel__add_hist_entry(struct perf_evsel *evsel,
+static int perf_evsel__add_hist_entry(struct perf_tool *tool,
+				      struct perf_evsel *evsel,
 				      struct addr_location *al,
 				      struct perf_sample *sample,
 				      struct machine *machine)
 {
+	struct perf_report *rep = container_of(tool, struct perf_report, tool);
 	struct symbol *parent = NULL;
 	int err = 0;
 	struct hist_entry *he;
 
 	if ((sort__has_parent || symbol_conf.use_callchain) && sample->callchain) {
 		err = machine__resolve_callchain(machine, evsel, al->thread,
-						 sample, &parent, al);
+						 sample, &parent, al,
+						 rep->max_stack);
 		if (err)
 			return err;
 	}
@@ -330,7 +336,8 @@ static int process_sample_event(struct perf_tool *tool,
 		if (al.map != NULL)
 			al.map->dso->hit = 1;
 
-		ret = perf_evsel__add_hist_entry(evsel, &al, sample, machine);
+		ret = perf_evsel__add_hist_entry(tool, evsel, &al, sample,
+						 machine);
 		if (ret < 0)
 			pr_debug("problem incrementing symbol period, skipping event\n");
 	}
@@ -757,6 +764,7 @@ int cmd_report(int argc, const char **argv, const char *prefix __maybe_unused)
 			.ordered_samples = true,
 			.ordering_requires_timestamps = true,
 		},
+		.max_stack		 = PERF_MAX_STACK_DEPTH,
 		.pretty_printing_style	 = "normal",
 	};
 	const struct option options[] = {
@@ -797,6 +805,10 @@ int cmd_report(int argc, const char **argv, const char *prefix __maybe_unused)
 	OPT_CALLBACK_DEFAULT('g', "call-graph", &report, "output_type,min_percent[,print_limit],call_order",
 		     "Display callchains using output_type (graph, flat, fractal, or none) , min percent threshold, optional print limit, callchain order, key (function or address). "
 		     "Default: fractal,0.5,callee,function", &parse_callchain_opt, callchain_default_opt),
+	OPT_INTEGER(0, "max-stack", &report.max_stack,
+		    "Set the maximum stack depth when parsing the callchain, "
+		    "anything beyond the specified depth will be ignored. "
+		    "Default: " __stringify(PERF_MAX_STACK_DEPTH)),
 	OPT_BOOLEAN('G', "inverted", &report.inverted_callchain,
 		    "alias for inverted call graph"),
 	OPT_CALLBACK(0, "ignore-callees", NULL, "regex",
diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index 2122141..2725aca 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -771,7 +771,8 @@ static void perf_event__process_sample(struct perf_tool *tool,
 		    sample->callchain) {
 			err = machine__resolve_callchain(machine, evsel,
 							 al.thread, sample,
-							 &parent, &al);
+							 &parent, &al,
+							 PERF_MAX_STACK_DEPTH);
 			if (err)
 				return;
 		}
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index 6188d28..9617c4a 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -1267,10 +1267,12 @@ static int machine__resolve_callchain_sample(struct machine *machine,
 					     struct thread *thread,
 					     struct ip_callchain *chain,
 					     struct symbol **parent,
-					     struct addr_location *root_al)
+					     struct addr_location *root_al,
+					     int max_stack)
 {
 	u8 cpumode = PERF_RECORD_MISC_USER;
-	unsigned int i;
+	int chain_nr = min(max_stack, (int)chain->nr);
+	int i;
 	int err;
 
 	callchain_cursor_reset(&callchain_cursor);
@@ -1280,7 +1282,7 @@ static int machine__resolve_callchain_sample(struct machine *machine,
 		return 0;
 	}
 
-	for (i = 0; i < chain->nr; i++) {
+	for (i = 0; i < chain_nr; i++) {
 		u64 ip;
 		struct addr_location al;
 
@@ -1352,12 +1354,14 @@ int machine__resolve_callchain(struct machine *machine,
 			       struct thread *thread,
 			       struct perf_sample *sample,
 			       struct symbol **parent,
-			       struct addr_location *root_al)
+			       struct addr_location *root_al,
+			       int max_stack)
 {
 	int ret;
 
 	ret = machine__resolve_callchain_sample(machine, thread,
-						sample->callchain, parent, root_al);
+						sample->callchain, parent,
+						root_al, max_stack);
 	if (ret)
 		return ret;
 
diff --git a/tools/perf/util/machine.h b/tools/perf/util/machine.h
index 58a6be1..d09cce0 100644
--- a/tools/perf/util/machine.h
+++ b/tools/perf/util/machine.h
@@ -91,7 +91,8 @@ int machine__resolve_callchain(struct machine *machine,
 			       struct thread *thread,
 			       struct perf_sample *sample,
 			       struct symbol **parent,
-			       struct addr_location *root_al);
+			       struct addr_location *root_al,
+			       int max_stack);
 
 /*
  * Default guest kernel is defined by parameter --guestkallsyms
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 568b750..96e5449 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -1525,7 +1525,8 @@ void perf_evsel__print_ip(struct perf_evsel *evsel, union perf_event *event,
 	if (symbol_conf.use_callchain && sample->callchain) {
 
 		if (machine__resolve_callchain(machine, evsel, al.thread,
-					       sample, NULL, NULL) != 0) {
+					       sample, NULL, NULL,
+					       PERF_MAX_STACK_DEPTH) != 0) {
 			if (verbose)
 				error("Failed to resolve callchain. Skipping\n");
 			return;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 4/4] perf-top: add --max-stack option to limit callchain stack scan
  2013-10-18 14:38 [PATCH v2 0/4] perf: add option to limit callchain stack scan to increase speed Waiman Long
                   ` (2 preceding siblings ...)
  2013-10-18 14:38 ` [PATCH v2 3/4] perf-report: add --max-stack option to limit callchain stack scan Waiman Long
@ 2013-10-18 14:38 ` Waiman Long
  2013-10-18 17:31   ` David Ahern
                     ` (2 more replies)
  3 siblings, 3 replies; 14+ messages in thread
From: Waiman Long @ 2013-10-18 14:38 UTC (permalink / raw)
  To: Ingo Molnar, Arnaldo Carvalho de Melo
  Cc: Peter Zijlstra, Paul Mackerras, Namhyung Kim, Jiri Olsa,
	Adrian Hunter, David Ahern, Stephane Eranian, linux-kernel,
	Aswin Chandramouleeswaran, Scott J Norton, Waiman Long

When the callgraph function is enabled (-G), it may take a long time to
scan all the stack data and merge them accordingly.

This patch adds a new --max-stack option to perf-top to limit the depth
of callchain stack data to look at to reduce the time it takes for
perf-top to finish its processing. It reduces the amount of information
provided to the user in exchange for faster speed.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 tools/perf/Documentation/perf-top.txt |    8 ++++++++
 tools/perf/builtin-top.c              |    8 ++++++--
 tools/perf/util/top.h                 |    1 +
 3 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/tools/perf/Documentation/perf-top.txt b/tools/perf/Documentation/perf-top.txt
index 58d6598..3fd911c 100644
--- a/tools/perf/Documentation/perf-top.txt
+++ b/tools/perf/Documentation/perf-top.txt
@@ -155,6 +155,14 @@ Default is to monitor all CPUS.
 
 	Default: fractal,0.5,callee.
 
+--max-stack::
+	Set the stack depth limit when parsing the callchain, anything
+	beyond the specified depth will be ignored. This is a trade-off
+	between information loss and faster processing especially for
+	workloads that can have a very long callchain stack.
+
+	Default: 127
+
 --ignore-callees=<regex>::
         Ignore callees of the function(s) matching the given regex.
         This has the effect of collecting the callers of each such
diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index 2725aca..14902b0 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -772,7 +772,7 @@ static void perf_event__process_sample(struct perf_tool *tool,
 			err = machine__resolve_callchain(machine, evsel,
 							 al.thread, sample,
 							 &parent, &al,
-							 PERF_MAX_STACK_DEPTH);
+							 top->max_stack);
 			if (err)
 				return;
 		}
@@ -1052,10 +1052,11 @@ int cmd_top(int argc, const char **argv, const char *prefix __maybe_unused)
 			.user_freq	= UINT_MAX,
 			.user_interval	= ULLONG_MAX,
 			.freq		= 4000, /* 4 KHz */
-			.target		     = {
+			.target		= {
 				.uses_mmap   = true,
 			},
 		},
+		.max_stack	     = PERF_MAX_STACK_DEPTH,
 		.sym_pcnt_filter     = 5,
 	};
 	struct perf_record_opts *opts = &top.record_opts;
@@ -1110,6 +1111,9 @@ int cmd_top(int argc, const char **argv, const char *prefix __maybe_unused)
 	OPT_CALLBACK_DEFAULT('G', "call-graph", &top.record_opts,
 			     "mode[,dump_size]", record_callchain_help,
 			     &parse_callchain_opt, "fp"),
+	OPT_INTEGER(0, "max-stack", &top.max_stack,
+		    "Set the maximum stack depth when parsing the callchain. "
+		    "Default: " __stringify(PERF_MAX_STACK_DEPTH)),
 	OPT_CALLBACK(0, "ignore-callees", NULL, "regex",
 		   "ignore callees of these functions in call graphs",
 		   report_parse_ignore_callees_opt),
diff --git a/tools/perf/util/top.h b/tools/perf/util/top.h
index b554ffc..88cfeaf 100644
--- a/tools/perf/util/top.h
+++ b/tools/perf/util/top.h
@@ -24,6 +24,7 @@ struct perf_top {
 	u64		   exact_samples;
 	u64		   guest_us_samples, guest_kernel_samples;
 	int		   print_entries, count_filter, delay_secs;
+	int		   max_stack;
 	bool		   hide_kernel_symbols, hide_user_symbols, zero;
 	bool		   use_tui, use_stdio;
 	bool		   kptr_restrict_warned;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 3/4] perf-report: add --max-stack option to limit callchain stack scan
  2013-10-18 14:38 ` [PATCH v2 3/4] perf-report: add --max-stack option to limit callchain stack scan Waiman Long
@ 2013-10-18 17:17   ` Arnaldo Carvalho de Melo
  2013-10-21 14:51     ` Waiman Long
  2013-10-18 17:30   ` David Ahern
  2013-10-23  7:55   ` [tip:perf/core] perf report: Add " tip-bot for Waiman Long
  2 siblings, 1 reply; 14+ messages in thread
From: Arnaldo Carvalho de Melo @ 2013-10-18 17:17 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Peter Zijlstra, Paul Mackerras, Namhyung Kim,
	Jiri Olsa, Adrian Hunter, David Ahern, Stephane Eranian,
	linux-kernel, Aswin Chandramouleeswaran, Scott J Norton

Em Fri, Oct 18, 2013 at 10:38:48AM -0400, Waiman Long escreveu:
> When callgraph data was included in the perf data file, it may take a
> long time to scan all those data and merge them together especially
> if the stored callchains are long and the perf data file itself is
> large, like a Gbyte or so.
> 
> The callchain stack is currently limited to PERF_MAX_STACK_DEPTH (127).
> This is a large value. Usually the callgraph data that developers are
> most interested in are the first few levels, the rests are usually
> not looked at.
> 
> This patch adds a new --max-stack option to perf-report to limit the
> depth of callchain stack data to look at to reduce the time it takes
> for perf-report to finish its processing. It trades the presence of
> trailing stack information with faster speed.
> 
> The following table shows the elapsed time of doing perf-report on a
> perf.data file of size 985,531,828 bytes.
> 
> --max_stack	Elapsed Time	Output data size
> -----------	------------	----------------

Please prefix lines like this (------) with a space, otherwise 'git am'
will chop off everything from that line onwards. Fixing it up now.

- Arnaldo

> not set		   88.0s	124,422,651
> 64		   87.5s	116,303,213
> 32		   87.2s	112,023,804
> 16		   86.6s	 94,326,380
> 8		   59.9s	 33,697,248
> 4		   40.7s	 10,116,637
> -g none		   27.1s	  2,555,810
> 
> Signed-off-by: Waiman Long <Waiman.Long@hp.com>
> ---
>  tools/perf/Documentation/perf-report.txt |    8 ++++++++
>  tools/perf/builtin-report.c              |   22 +++++++++++++++++-----
>  tools/perf/builtin-top.c                 |    3 ++-
>  tools/perf/util/machine.c                |   14 +++++++++-----
>  tools/perf/util/machine.h                |    3 ++-
>  tools/perf/util/session.c                |    3 ++-
>  6 files changed, 40 insertions(+), 13 deletions(-)
> 
> diff --git a/tools/perf/Documentation/perf-report.txt b/tools/perf/Documentation/perf-report.txt
> index 2b8097e..be3f196 100644
> --- a/tools/perf/Documentation/perf-report.txt
> +++ b/tools/perf/Documentation/perf-report.txt
> @@ -135,6 +135,14 @@ OPTIONS
>  
>  	Default: fractal,0.5,callee,function.
>  
> +--max-stack::
> +	Set the stack depth limit when parsing the callchain, anything
> +	beyond the specified depth will be ignored. This is a trade-off
> +	between information loss and faster processing especially for
> +	workloads that can have a very long callchain stack.
> +
> +	Default: 127
> +
>  -G::
>  --inverted::
>          alias for inverted caller based call graph.
> diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
> index 72eae74..d0c9504 100644
> --- a/tools/perf/builtin-report.c
> +++ b/tools/perf/builtin-report.c
> @@ -47,6 +47,7 @@ struct perf_report {
>  	bool			show_threads;
>  	bool			inverted_callchain;
>  	bool			mem_mode;
> +	int			max_stack;
>  	struct perf_read_values	show_threads_values;
>  	const char		*pretty_printing_style;
>  	const char		*cpu_list;
> @@ -88,7 +89,8 @@ static int perf_report__add_mem_hist_entry(struct perf_tool *tool,
>  	if ((sort__has_parent || symbol_conf.use_callchain) &&
>  	    sample->callchain) {
>  		err = machine__resolve_callchain(machine, evsel, al->thread,
> -						 sample, &parent, al);
> +						 sample, &parent, al,
> +						 rep->max_stack);
>  		if (err)
>  			return err;
>  	}
> @@ -179,7 +181,8 @@ static int perf_report__add_branch_hist_entry(struct perf_tool *tool,
>  	if ((sort__has_parent || symbol_conf.use_callchain)
>  	    && sample->callchain) {
>  		err = machine__resolve_callchain(machine, evsel, al->thread,
> -						 sample, &parent, al);
> +						 sample, &parent, al,
> +						 rep->max_stack);
>  		if (err)
>  			return err;
>  	}
> @@ -242,18 +245,21 @@ out:
>  	return err;
>  }
>  
> -static int perf_evsel__add_hist_entry(struct perf_evsel *evsel,
> +static int perf_evsel__add_hist_entry(struct perf_tool *tool,
> +				      struct perf_evsel *evsel,
>  				      struct addr_location *al,
>  				      struct perf_sample *sample,
>  				      struct machine *machine)
>  {
> +	struct perf_report *rep = container_of(tool, struct perf_report, tool);
>  	struct symbol *parent = NULL;
>  	int err = 0;
>  	struct hist_entry *he;
>  
>  	if ((sort__has_parent || symbol_conf.use_callchain) && sample->callchain) {
>  		err = machine__resolve_callchain(machine, evsel, al->thread,
> -						 sample, &parent, al);
> +						 sample, &parent, al,
> +						 rep->max_stack);
>  		if (err)
>  			return err;
>  	}
> @@ -330,7 +336,8 @@ static int process_sample_event(struct perf_tool *tool,
>  		if (al.map != NULL)
>  			al.map->dso->hit = 1;
>  
> -		ret = perf_evsel__add_hist_entry(evsel, &al, sample, machine);
> +		ret = perf_evsel__add_hist_entry(tool, evsel, &al, sample,
> +						 machine);
>  		if (ret < 0)
>  			pr_debug("problem incrementing symbol period, skipping event\n");
>  	}
> @@ -757,6 +764,7 @@ int cmd_report(int argc, const char **argv, const char *prefix __maybe_unused)
>  			.ordered_samples = true,
>  			.ordering_requires_timestamps = true,
>  		},
> +		.max_stack		 = PERF_MAX_STACK_DEPTH,
>  		.pretty_printing_style	 = "normal",
>  	};
>  	const struct option options[] = {
> @@ -797,6 +805,10 @@ int cmd_report(int argc, const char **argv, const char *prefix __maybe_unused)
>  	OPT_CALLBACK_DEFAULT('g', "call-graph", &report, "output_type,min_percent[,print_limit],call_order",
>  		     "Display callchains using output_type (graph, flat, fractal, or none) , min percent threshold, optional print limit, callchain order, key (function or address). "
>  		     "Default: fractal,0.5,callee,function", &parse_callchain_opt, callchain_default_opt),
> +	OPT_INTEGER(0, "max-stack", &report.max_stack,
> +		    "Set the maximum stack depth when parsing the callchain, "
> +		    "anything beyond the specified depth will be ignored. "
> +		    "Default: " __stringify(PERF_MAX_STACK_DEPTH)),
>  	OPT_BOOLEAN('G', "inverted", &report.inverted_callchain,
>  		    "alias for inverted call graph"),
>  	OPT_CALLBACK(0, "ignore-callees", NULL, "regex",
> diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
> index 2122141..2725aca 100644
> --- a/tools/perf/builtin-top.c
> +++ b/tools/perf/builtin-top.c
> @@ -771,7 +771,8 @@ static void perf_event__process_sample(struct perf_tool *tool,
>  		    sample->callchain) {
>  			err = machine__resolve_callchain(machine, evsel,
>  							 al.thread, sample,
> -							 &parent, &al);
> +							 &parent, &al,
> +							 PERF_MAX_STACK_DEPTH);
>  			if (err)
>  				return;
>  		}
> diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
> index 6188d28..9617c4a 100644
> --- a/tools/perf/util/machine.c
> +++ b/tools/perf/util/machine.c
> @@ -1267,10 +1267,12 @@ static int machine__resolve_callchain_sample(struct machine *machine,
>  					     struct thread *thread,
>  					     struct ip_callchain *chain,
>  					     struct symbol **parent,
> -					     struct addr_location *root_al)
> +					     struct addr_location *root_al,
> +					     int max_stack)
>  {
>  	u8 cpumode = PERF_RECORD_MISC_USER;
> -	unsigned int i;
> +	int chain_nr = min(max_stack, (int)chain->nr);
> +	int i;
>  	int err;
>  
>  	callchain_cursor_reset(&callchain_cursor);
> @@ -1280,7 +1282,7 @@ static int machine__resolve_callchain_sample(struct machine *machine,
>  		return 0;
>  	}
>  
> -	for (i = 0; i < chain->nr; i++) {
> +	for (i = 0; i < chain_nr; i++) {
>  		u64 ip;
>  		struct addr_location al;
>  
> @@ -1352,12 +1354,14 @@ int machine__resolve_callchain(struct machine *machine,
>  			       struct thread *thread,
>  			       struct perf_sample *sample,
>  			       struct symbol **parent,
> -			       struct addr_location *root_al)
> +			       struct addr_location *root_al,
> +			       int max_stack)
>  {
>  	int ret;
>  
>  	ret = machine__resolve_callchain_sample(machine, thread,
> -						sample->callchain, parent, root_al);
> +						sample->callchain, parent,
> +						root_al, max_stack);
>  	if (ret)
>  		return ret;
>  
> diff --git a/tools/perf/util/machine.h b/tools/perf/util/machine.h
> index 58a6be1..d09cce0 100644
> --- a/tools/perf/util/machine.h
> +++ b/tools/perf/util/machine.h
> @@ -91,7 +91,8 @@ int machine__resolve_callchain(struct machine *machine,
>  			       struct thread *thread,
>  			       struct perf_sample *sample,
>  			       struct symbol **parent,
> -			       struct addr_location *root_al);
> +			       struct addr_location *root_al,
> +			       int max_stack);
>  
>  /*
>   * Default guest kernel is defined by parameter --guestkallsyms
> diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
> index 568b750..96e5449 100644
> --- a/tools/perf/util/session.c
> +++ b/tools/perf/util/session.c
> @@ -1525,7 +1525,8 @@ void perf_evsel__print_ip(struct perf_evsel *evsel, union perf_event *event,
>  	if (symbol_conf.use_callchain && sample->callchain) {
>  
>  		if (machine__resolve_callchain(machine, evsel, al.thread,
> -					       sample, NULL, NULL) != 0) {
> +					       sample, NULL, NULL,
> +					       PERF_MAX_STACK_DEPTH) != 0) {
>  			if (verbose)
>  				error("Failed to resolve callchain. Skipping\n");
>  			return;
> -- 
> 1.7.1

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 3/4] perf-report: add --max-stack option to limit callchain stack scan
  2013-10-18 14:38 ` [PATCH v2 3/4] perf-report: add --max-stack option to limit callchain stack scan Waiman Long
  2013-10-18 17:17   ` Arnaldo Carvalho de Melo
@ 2013-10-18 17:30   ` David Ahern
  2013-10-23  7:55   ` [tip:perf/core] perf report: Add " tip-bot for Waiman Long
  2 siblings, 0 replies; 14+ messages in thread
From: David Ahern @ 2013-10-18 17:30 UTC (permalink / raw)
  To: Waiman Long, Ingo Molnar, Arnaldo Carvalho de Melo
  Cc: Peter Zijlstra, Paul Mackerras, Namhyung Kim, Jiri Olsa,
	Adrian Hunter, Stephane Eranian, linux-kernel,
	Aswin Chandramouleeswaran, Scott J Norton

On 10/18/13 8:38 AM, Waiman Long wrote:
> When callgraph data was included in the perf data file, it may take a
> long time to scan all those data and merge them together especially
> if the stored callchains are long and the perf data file itself is
> large, like a Gbyte or so.
>
> The callchain stack is currently limited to PERF_MAX_STACK_DEPTH (127).
> This is a large value. Usually the callgraph data that developers are
> most interested in are the first few levels, the rests are usually
> not looked at.
>
> This patch adds a new --max-stack option to perf-report to limit the
> depth of callchain stack data to look at to reduce the time it takes
> for perf-report to finish its processing. It trades the presence of
> trailing stack information with faster speed.
>
> The following table shows the elapsed time of doing perf-report on a
> perf.data file of size 985,531,828 bytes.
>
> --max_stack	Elapsed Time	Output data size
> -----------	------------	----------------
> not set		   88.0s	124,422,651
> 64		   87.5s	116,303,213
> 32		   87.2s	112,023,804
> 16		   86.6s	 94,326,380
> 8		   59.9s	 33,697,248
> 4		   40.7s	 10,116,637
> -g none		   27.1s	  2,555,810
>
> Signed-off-by: Waiman Long <Waiman.Long@hp.com>
> ---
>   tools/perf/Documentation/perf-report.txt |    8 ++++++++
>   tools/perf/builtin-report.c              |   22 +++++++++++++++++-----
>   tools/perf/builtin-top.c                 |    3 ++-
>   tools/perf/util/machine.c                |   14 +++++++++-----
>   tools/perf/util/machine.h                |    3 ++-
>   tools/perf/util/session.c                |    3 ++-
>   6 files changed, 40 insertions(+), 13 deletions(-)
>

Looks good to me. Acked-by: David Ahern <dsahern@gmail.com>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 4/4] perf-top: add --max-stack option to limit callchain stack scan
  2013-10-18 14:38 ` [PATCH v2 4/4] perf-top: add " Waiman Long
@ 2013-10-18 17:31   ` David Ahern
  2013-10-20 22:35   ` Davidlohr Bueso
  2013-10-23  7:55   ` [tip:perf/core] perf top: Add " tip-bot for Waiman Long
  2 siblings, 0 replies; 14+ messages in thread
From: David Ahern @ 2013-10-18 17:31 UTC (permalink / raw)
  To: Waiman Long, Ingo Molnar, Arnaldo Carvalho de Melo
  Cc: Peter Zijlstra, Paul Mackerras, Namhyung Kim, Jiri Olsa,
	Adrian Hunter, Stephane Eranian, linux-kernel,
	Aswin Chandramouleeswaran, Scott J Norton

On 10/18/13 8:38 AM, Waiman Long wrote:
> When the callgraph function is enabled (-G), it may take a long time to
> scan all the stack data and merge them accordingly.
>
> This patch adds a new --max-stack option to perf-top to limit the depth
> of callchain stack data to look at to reduce the time it takes for
> perf-top to finish its processing. It reduces the amount of information
> provided to the user in exchange for faster speed.
>
> Signed-off-by: Waiman Long <Waiman.Long@hp.com>
> ---
>   tools/perf/Documentation/perf-top.txt |    8 ++++++++
>   tools/perf/builtin-top.c              |    8 ++++++--
>   tools/perf/util/top.h                 |    1 +
>   3 files changed, 15 insertions(+), 2 deletions(-)
>

Looks good to me. Acked-by: David Ahern <dsahern@gmail.com>



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/4] perf: streamline append_chain() function
  2013-10-18 14:38 ` [PATCH v2 2/4] perf: streamline append_chain() function Waiman Long
@ 2013-10-20  0:29   ` Andi Kleen
  2013-10-21 14:50     ` Waiman Long
  0 siblings, 1 reply; 14+ messages in thread
From: Andi Kleen @ 2013-10-20  0:29 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Peter Zijlstra,
	Paul Mackerras, Namhyung Kim, Jiri Olsa, Adrian Hunter,
	David Ahern, Stephane Eranian, linux-kernel,
	Aswin Chandramouleeswaran, Scott J Norton

Waiman Long <Waiman.Long@hp.com> writes:

> as well as
> using ?: statement which can be more efficient than the regular if
> statement in some architectures.

I don't think that's true, the compiler does if conversion anyways for both.

But change seems reasonable.

-Andi


-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 4/4] perf-top: add --max-stack option to limit callchain stack scan
  2013-10-18 14:38 ` [PATCH v2 4/4] perf-top: add " Waiman Long
  2013-10-18 17:31   ` David Ahern
@ 2013-10-20 22:35   ` Davidlohr Bueso
  2013-10-23  7:55   ` [tip:perf/core] perf top: Add " tip-bot for Waiman Long
  2 siblings, 0 replies; 14+ messages in thread
From: Davidlohr Bueso @ 2013-10-20 22:35 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Peter Zijlstra,
	Paul Mackerras, Namhyung Kim, Jiri Olsa, Adrian Hunter,
	David Ahern, Stephane Eranian, linux-kernel,
	Aswin Chandramouleeswaran, Scott J Norton

On Fri, 2013-10-18 at 10:38 -0400, Waiman Long wrote:
> When the callgraph function is enabled (-G), it may take a long time to
> scan all the stack data and merge them accordingly.
> 
> This patch adds a new --max-stack option to perf-top to limit the depth
> of callchain stack data to look at to reduce the time it takes for
> perf-top to finish its processing. It reduces the amount of information
> provided to the user in exchange for faster speed.
> 
> Signed-off-by: Waiman Long <Waiman.Long@hp.com>

Tested-by: Davidlohr Bueso <davidlohr@hp.com>

> ---
>  tools/perf/Documentation/perf-top.txt |    8 ++++++++
>  tools/perf/builtin-top.c              |    8 ++++++--
>  tools/perf/util/top.h                 |    1 +
>  3 files changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/perf/Documentation/perf-top.txt b/tools/perf/Documentation/perf-top.txt
> index 58d6598..3fd911c 100644
> --- a/tools/perf/Documentation/perf-top.txt
> +++ b/tools/perf/Documentation/perf-top.txt
> @@ -155,6 +155,14 @@ Default is to monitor all CPUS.
>  
>  	Default: fractal,0.5,callee.
>  
> +--max-stack::
> +	Set the stack depth limit when parsing the callchain, anything
> +	beyond the specified depth will be ignored. This is a trade-off
> +	between information loss and faster processing especially for
> +	workloads that can have a very long callchain stack.
> +
> +	Default: 127
> +
>  --ignore-callees=<regex>::
>          Ignore callees of the function(s) matching the given regex.
>          This has the effect of collecting the callers of each such
> diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
> index 2725aca..14902b0 100644
> --- a/tools/perf/builtin-top.c
> +++ b/tools/perf/builtin-top.c
> @@ -772,7 +772,7 @@ static void perf_event__process_sample(struct perf_tool *tool,
>  			err = machine__resolve_callchain(machine, evsel,
>  							 al.thread, sample,
>  							 &parent, &al,
> -							 PERF_MAX_STACK_DEPTH);
> +							 top->max_stack);
>  			if (err)
>  				return;
>  		}
> @@ -1052,10 +1052,11 @@ int cmd_top(int argc, const char **argv, const char *prefix __maybe_unused)
>  			.user_freq	= UINT_MAX,
>  			.user_interval	= ULLONG_MAX,
>  			.freq		= 4000, /* 4 KHz */
> -			.target		     = {
> +			.target		= {
>  				.uses_mmap   = true,
>  			},
>  		},
> +		.max_stack	     = PERF_MAX_STACK_DEPTH,
>  		.sym_pcnt_filter     = 5,
>  	};
>  	struct perf_record_opts *opts = &top.record_opts;
> @@ -1110,6 +1111,9 @@ int cmd_top(int argc, const char **argv, const char *prefix __maybe_unused)
>  	OPT_CALLBACK_DEFAULT('G', "call-graph", &top.record_opts,
>  			     "mode[,dump_size]", record_callchain_help,
>  			     &parse_callchain_opt, "fp"),
> +	OPT_INTEGER(0, "max-stack", &top.max_stack,
> +		    "Set the maximum stack depth when parsing the callchain. "
> +		    "Default: " __stringify(PERF_MAX_STACK_DEPTH)),
>  	OPT_CALLBACK(0, "ignore-callees", NULL, "regex",
>  		   "ignore callees of these functions in call graphs",
>  		   report_parse_ignore_callees_opt),
> diff --git a/tools/perf/util/top.h b/tools/perf/util/top.h
> index b554ffc..88cfeaf 100644
> --- a/tools/perf/util/top.h
> +++ b/tools/perf/util/top.h
> @@ -24,6 +24,7 @@ struct perf_top {
>  	u64		   exact_samples;
>  	u64		   guest_us_samples, guest_kernel_samples;
>  	int		   print_entries, count_filter, delay_secs;
> +	int		   max_stack;
>  	bool		   hide_kernel_symbols, hide_user_symbols, zero;
>  	bool		   use_tui, use_stdio;
>  	bool		   kptr_restrict_warned;



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/4] perf: streamline append_chain() function
  2013-10-20  0:29   ` Andi Kleen
@ 2013-10-21 14:50     ` Waiman Long
  0 siblings, 0 replies; 14+ messages in thread
From: Waiman Long @ 2013-10-21 14:50 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Peter Zijlstra,
	Paul Mackerras, Namhyung Kim, Jiri Olsa, Adrian Hunter,
	David Ahern, Stephane Eranian, linux-kernel,
	Aswin Chandramouleeswaran, Scott J Norton

On 10/19/2013 08:29 PM, Andi Kleen wrote:
> Waiman Long<Waiman.Long@hp.com>  writes:
>
>> as well as
>> using ?: statement which can be more efficient than the regular if
>> statement in some architectures.
> I don't think that's true, the compiler does if conversion anyways for both.
>
> But change seems reasonable.
>
> -Andi
>
>

That may be true for a simple if statement. However, the condition was 
checked as the last of 3 tests. I doubt if the compiler is able to 
optimize that effectively.

-Longman

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 3/4] perf-report: add --max-stack option to limit callchain stack scan
  2013-10-18 17:17   ` Arnaldo Carvalho de Melo
@ 2013-10-21 14:51     ` Waiman Long
  0 siblings, 0 replies; 14+ messages in thread
From: Waiman Long @ 2013-10-21 14:51 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Ingo Molnar, Peter Zijlstra, Paul Mackerras, Namhyung Kim,
	Jiri Olsa, Adrian Hunter, David Ahern, Stephane Eranian,
	linux-kernel, Aswin Chandramouleeswaran, Scott J Norton

On 10/18/2013 01:17 PM, Arnaldo Carvalho de Melo wrote:
> Em Fri, Oct 18, 2013 at 10:38:48AM -0400, Waiman Long escreveu:
>> When callgraph data was included in the perf data file, it may take a
>> long time to scan all those data and merge them together especially
>> if the stored callchains are long and the perf data file itself is
>> large, like a Gbyte or so.
>>
>> The callchain stack is currently limited to PERF_MAX_STACK_DEPTH (127).
>> This is a large value. Usually the callgraph data that developers are
>> most interested in are the first few levels, the rests are usually
>> not looked at.
>>
>> This patch adds a new --max-stack option to perf-report to limit the
>> depth of callchain stack data to look at to reduce the time it takes
>> for perf-report to finish its processing. It trades the presence of
>> trailing stack information with faster speed.
>>
>> The following table shows the elapsed time of doing perf-report on a
>> perf.data file of size 985,531,828 bytes.
>>
>> --max_stack	Elapsed Time	Output data size
>> -----------	------------	----------------
> Please prefix lines like this (------) with a space, otherwise 'git am'
> will chop off everything from that line onwards. Fixing it up now.
>
> - Arnaldo
>
>

Thank for spotting the problem, will fix that in the next version.

-Longman

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [tip:perf/core] perf report: Add --max-stack option to limit callchain stack scan
  2013-10-18 14:38 ` [PATCH v2 3/4] perf-report: add --max-stack option to limit callchain stack scan Waiman Long
  2013-10-18 17:17   ` Arnaldo Carvalho de Melo
  2013-10-18 17:30   ` David Ahern
@ 2013-10-23  7:55   ` tip-bot for Waiman Long
  2 siblings, 0 replies; 14+ messages in thread
From: tip-bot for Waiman Long @ 2013-10-23  7:55 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: acme, eranian, mingo, mingo, a.p.zijlstra, jolsa, Waiman.Long,
	dsahern, tglx, scott.norton, hpa, paulus, linux-kernel, namhyung,
	adrian.hunter, aswin

Commit-ID:  91e95617429cb272fd908b1928a1915b37b9655f
Gitweb:     http://git.kernel.org/tip/91e95617429cb272fd908b1928a1915b37b9655f
Author:     Waiman Long <Waiman.Long@hp.com>
AuthorDate: Fri, 18 Oct 2013 10:38:48 -0400
Committer:  Arnaldo Carvalho de Melo <acme@redhat.com>
CommitDate: Mon, 21 Oct 2013 17:36:25 -0300

perf report: Add --max-stack option to limit callchain stack scan

When callgraph data was included in the perf data file, it may take a
long time to scan all those data and merge them together especially if
the stored callchains are long and the perf data file itself is large,
like a Gbyte or so.

The callchain stack is currently limited to PERF_MAX_STACK_DEPTH (127).
This is a large value. Usually the callgraph data that developers are
most interested in are the first few levels, the rests are usually not
looked at.

This patch adds a new --max-stack option to perf-report to limit the
depth of callchain stack data to look at to reduce the time it takes for
perf-report to finish its processing. It trades the presence of trailing
stack information with faster speed.

The following table shows the elapsed time of doing perf-report on a
perf.data file of size 985,531,828 bytes.

  --max_stack   Elapsed Time    Output data size
  -----------   ------------    ----------------
  not set        88.0s          124,422,651
  64             87.5s          116,303,213
  32             87.2s          112,023,804
  16             86.6s           94,326,380
  8              59.9s           33,697,248
  4              40.7s           10,116,637
  -g none        27.1s            2,555,810

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Acked-by: David Ahern <dsahern@gmail.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Scott J Norton <scott.norton@hp.com>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/r/1382107129-2010-4-git-send-email-Waiman.Long@hp.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/Documentation/perf-report.txt |  8 ++++++++
 tools/perf/builtin-report.c              | 22 +++++++++++++++++-----
 tools/perf/builtin-top.c                 |  3 ++-
 tools/perf/util/machine.c                | 14 +++++++++-----
 tools/perf/util/machine.h                |  3 ++-
 tools/perf/util/session.c                |  3 ++-
 6 files changed, 40 insertions(+), 13 deletions(-)

diff --git a/tools/perf/Documentation/perf-report.txt b/tools/perf/Documentation/perf-report.txt
index be5ad87..10a2798 100644
--- a/tools/perf/Documentation/perf-report.txt
+++ b/tools/perf/Documentation/perf-report.txt
@@ -141,6 +141,14 @@ OPTIONS
 
 	Default: fractal,0.5,callee,function.
 
+--max-stack::
+	Set the stack depth limit when parsing the callchain, anything
+	beyond the specified depth will be ignored. This is a trade-off
+	between information loss and faster processing especially for
+	workloads that can have a very long callchain stack.
+
+	Default: 127
+
 -G::
 --inverted::
         alias for inverted caller based call graph.
diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index fa68a36..81addca 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -49,6 +49,7 @@ struct perf_report {
 	bool			show_threads;
 	bool			inverted_callchain;
 	bool			mem_mode;
+	int			max_stack;
 	struct perf_read_values	show_threads_values;
 	const char		*pretty_printing_style;
 	const char		*cpu_list;
@@ -90,7 +91,8 @@ static int perf_report__add_mem_hist_entry(struct perf_tool *tool,
 	if ((sort__has_parent || symbol_conf.use_callchain) &&
 	    sample->callchain) {
 		err = machine__resolve_callchain(machine, evsel, al->thread,
-						 sample, &parent, al);
+						 sample, &parent, al,
+						 rep->max_stack);
 		if (err)
 			return err;
 	}
@@ -181,7 +183,8 @@ static int perf_report__add_branch_hist_entry(struct perf_tool *tool,
 	if ((sort__has_parent || symbol_conf.use_callchain)
 	    && sample->callchain) {
 		err = machine__resolve_callchain(machine, evsel, al->thread,
-						 sample, &parent, al);
+						 sample, &parent, al,
+						 rep->max_stack);
 		if (err)
 			return err;
 	}
@@ -244,18 +247,21 @@ out:
 	return err;
 }
 
-static int perf_evsel__add_hist_entry(struct perf_evsel *evsel,
+static int perf_evsel__add_hist_entry(struct perf_tool *tool,
+				      struct perf_evsel *evsel,
 				      struct addr_location *al,
 				      struct perf_sample *sample,
 				      struct machine *machine)
 {
+	struct perf_report *rep = container_of(tool, struct perf_report, tool);
 	struct symbol *parent = NULL;
 	int err = 0;
 	struct hist_entry *he;
 
 	if ((sort__has_parent || symbol_conf.use_callchain) && sample->callchain) {
 		err = machine__resolve_callchain(machine, evsel, al->thread,
-						 sample, &parent, al);
+						 sample, &parent, al,
+						 rep->max_stack);
 		if (err)
 			return err;
 	}
@@ -332,7 +338,8 @@ static int process_sample_event(struct perf_tool *tool,
 		if (al.map != NULL)
 			al.map->dso->hit = 1;
 
-		ret = perf_evsel__add_hist_entry(evsel, &al, sample, machine);
+		ret = perf_evsel__add_hist_entry(tool, evsel, &al, sample,
+						 machine);
 		if (ret < 0)
 			pr_debug("problem incrementing symbol period, skipping event\n");
 	}
@@ -772,6 +779,7 @@ int cmd_report(int argc, const char **argv, const char *prefix __maybe_unused)
 			.ordered_samples = true,
 			.ordering_requires_timestamps = true,
 		},
+		.max_stack		 = PERF_MAX_STACK_DEPTH,
 		.pretty_printing_style	 = "normal",
 	};
 	const struct option options[] = {
@@ -812,6 +820,10 @@ int cmd_report(int argc, const char **argv, const char *prefix __maybe_unused)
 	OPT_CALLBACK_DEFAULT('g', "call-graph", &report, "output_type,min_percent[,print_limit],call_order",
 		     "Display callchains using output_type (graph, flat, fractal, or none) , min percent threshold, optional print limit, callchain order, key (function or address). "
 		     "Default: fractal,0.5,callee,function", &parse_callchain_opt, callchain_default_opt),
+	OPT_INTEGER(0, "max-stack", &report.max_stack,
+		    "Set the maximum stack depth when parsing the callchain, "
+		    "anything beyond the specified depth will be ignored. "
+		    "Default: " __stringify(PERF_MAX_STACK_DEPTH)),
 	OPT_BOOLEAN('G', "inverted", &report.inverted_callchain,
 		    "alias for inverted call graph"),
 	OPT_CALLBACK(0, "ignore-callees", NULL, "regex",
diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index d934f70..112cb7d 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -770,7 +770,8 @@ static void perf_event__process_sample(struct perf_tool *tool,
 		    sample->callchain) {
 			err = machine__resolve_callchain(machine, evsel,
 							 al.thread, sample,
-							 &parent, &al);
+							 &parent, &al,
+							 PERF_MAX_STACK_DEPTH);
 			if (err)
 				return;
 		}
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index 6b861ae..ea93425 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -1253,10 +1253,12 @@ static int machine__resolve_callchain_sample(struct machine *machine,
 					     struct thread *thread,
 					     struct ip_callchain *chain,
 					     struct symbol **parent,
-					     struct addr_location *root_al)
+					     struct addr_location *root_al,
+					     int max_stack)
 {
 	u8 cpumode = PERF_RECORD_MISC_USER;
-	unsigned int i;
+	int chain_nr = min(max_stack, (int)chain->nr);
+	int i;
 	int err;
 
 	callchain_cursor_reset(&callchain_cursor);
@@ -1266,7 +1268,7 @@ static int machine__resolve_callchain_sample(struct machine *machine,
 		return 0;
 	}
 
-	for (i = 0; i < chain->nr; i++) {
+	for (i = 0; i < chain_nr; i++) {
 		u64 ip;
 		struct addr_location al;
 
@@ -1338,12 +1340,14 @@ int machine__resolve_callchain(struct machine *machine,
 			       struct thread *thread,
 			       struct perf_sample *sample,
 			       struct symbol **parent,
-			       struct addr_location *root_al)
+			       struct addr_location *root_al,
+			       int max_stack)
 {
 	int ret;
 
 	ret = machine__resolve_callchain_sample(machine, thread,
-						sample->callchain, parent, root_al);
+						sample->callchain, parent,
+						root_al, max_stack);
 	if (ret)
 		return ret;
 
diff --git a/tools/perf/util/machine.h b/tools/perf/util/machine.h
index d44c09b..4c1f5d5 100644
--- a/tools/perf/util/machine.h
+++ b/tools/perf/util/machine.h
@@ -92,7 +92,8 @@ int machine__resolve_callchain(struct machine *machine,
 			       struct thread *thread,
 			       struct perf_sample *sample,
 			       struct symbol **parent,
-			       struct addr_location *root_al);
+			       struct addr_location *root_al,
+			       int max_stack);
 
 /*
  * Default guest kernel is defined by parameter --guestkallsyms
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 19fc716..854c5aa 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -1512,7 +1512,8 @@ void perf_evsel__print_ip(struct perf_evsel *evsel, union perf_event *event,
 	if (symbol_conf.use_callchain && sample->callchain) {
 
 		if (machine__resolve_callchain(machine, evsel, al.thread,
-					       sample, NULL, NULL) != 0) {
+					       sample, NULL, NULL,
+					       PERF_MAX_STACK_DEPTH) != 0) {
 			if (verbose)
 				error("Failed to resolve callchain. Skipping\n");
 			return;

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [tip:perf/core] perf top: Add --max-stack option to limit callchain stack scan
  2013-10-18 14:38 ` [PATCH v2 4/4] perf-top: add " Waiman Long
  2013-10-18 17:31   ` David Ahern
  2013-10-20 22:35   ` Davidlohr Bueso
@ 2013-10-23  7:55   ` tip-bot for Waiman Long
  2 siblings, 0 replies; 14+ messages in thread
From: tip-bot for Waiman Long @ 2013-10-23  7:55 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: acme, eranian, mingo, mingo, a.p.zijlstra, jolsa, Waiman.Long,
	dsahern, tglx, scott.norton, davidlohr, hpa, paulus,
	linux-kernel, namhyung, adrian.hunter, aswin

Commit-ID:  5dbb6e81d85e55ee2b4cf523c1738e16f63e5400
Gitweb:     http://git.kernel.org/tip/5dbb6e81d85e55ee2b4cf523c1738e16f63e5400
Author:     Waiman Long <Waiman.Long@hp.com>
AuthorDate: Fri, 18 Oct 2013 10:38:49 -0400
Committer:  Arnaldo Carvalho de Melo <acme@redhat.com>
CommitDate: Mon, 21 Oct 2013 17:36:25 -0300

perf top: Add --max-stack option to limit callchain stack scan

When the callgraph function is enabled (-G), it may take a long time to
scan all the stack data and merge them accordingly.

This patch adds a new --max-stack option to perf-top to limit the depth
of callchain stack data to look at to reduce the time it takes for
perf-top to finish its processing. It reduces the amount of information
provided to the user in exchange for faster speed.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Acked-by: David Ahern <dsahern@gmail.com>
Tested-by: Davidlohr Bueso <davidlohr@hp.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Scott J Norton <scott.norton@hp.com>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/r/1382107129-2010-5-git-send-email-Waiman.Long@hp.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/Documentation/perf-top.txt | 8 ++++++++
 tools/perf/builtin-top.c              | 8 ++++++--
 tools/perf/util/top.h                 | 1 +
 3 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/tools/perf/Documentation/perf-top.txt b/tools/perf/Documentation/perf-top.txt
index f65777c..c16a09e 100644
--- a/tools/perf/Documentation/perf-top.txt
+++ b/tools/perf/Documentation/perf-top.txt
@@ -158,6 +158,14 @@ Default is to monitor all CPUS.
 
 	Default: fractal,0.5,callee.
 
+--max-stack::
+	Set the stack depth limit when parsing the callchain, anything
+	beyond the specified depth will be ignored. This is a trade-off
+	between information loss and faster processing especially for
+	workloads that can have a very long callchain stack.
+
+	Default: 127
+
 --ignore-callees=<regex>::
         Ignore callees of the function(s) matching the given regex.
         This has the effect of collecting the callers of each such
diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index 112cb7d..386d833 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -771,7 +771,7 @@ static void perf_event__process_sample(struct perf_tool *tool,
 			err = machine__resolve_callchain(machine, evsel,
 							 al.thread, sample,
 							 &parent, &al,
-							 PERF_MAX_STACK_DEPTH);
+							 top->max_stack);
 			if (err)
 				return;
 		}
@@ -1048,10 +1048,11 @@ int cmd_top(int argc, const char **argv, const char *prefix __maybe_unused)
 			.user_freq	= UINT_MAX,
 			.user_interval	= ULLONG_MAX,
 			.freq		= 4000, /* 4 KHz */
-			.target		     = {
+			.target		= {
 				.uses_mmap   = true,
 			},
 		},
+		.max_stack	     = PERF_MAX_STACK_DEPTH,
 		.sym_pcnt_filter     = 5,
 	};
 	struct perf_record_opts *opts = &top.record_opts;
@@ -1110,6 +1111,9 @@ int cmd_top(int argc, const char **argv, const char *prefix __maybe_unused)
 	OPT_CALLBACK_DEFAULT('G', "call-graph", &top.record_opts,
 			     "mode[,dump_size]", record_callchain_help,
 			     &parse_callchain_opt, "fp"),
+	OPT_INTEGER(0, "max-stack", &top.max_stack,
+		    "Set the maximum stack depth when parsing the callchain. "
+		    "Default: " __stringify(PERF_MAX_STACK_DEPTH)),
 	OPT_CALLBACK(0, "ignore-callees", NULL, "regex",
 		   "ignore callees of these functions in call graphs",
 		   report_parse_ignore_callees_opt),
diff --git a/tools/perf/util/top.h b/tools/perf/util/top.h
index b554ffc..88cfeaf 100644
--- a/tools/perf/util/top.h
+++ b/tools/perf/util/top.h
@@ -24,6 +24,7 @@ struct perf_top {
 	u64		   exact_samples;
 	u64		   guest_us_samples, guest_kernel_samples;
 	int		   print_entries, count_filter, delay_secs;
+	int		   max_stack;
 	bool		   hide_kernel_symbols, hide_user_symbols, zero;
 	bool		   use_tui, use_stdio;
 	bool		   kptr_restrict_warned;

^ permalink raw reply related	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2013-10-23  7:56 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-10-18 14:38 [PATCH v2 0/4] perf: add option to limit callchain stack scan to increase speed Waiman Long
2013-10-18 14:38 ` [PATCH v2 1/4] perf: Fix potential compilation error with some compilers Waiman Long
2013-10-18 14:38 ` [PATCH v2 2/4] perf: streamline append_chain() function Waiman Long
2013-10-20  0:29   ` Andi Kleen
2013-10-21 14:50     ` Waiman Long
2013-10-18 14:38 ` [PATCH v2 3/4] perf-report: add --max-stack option to limit callchain stack scan Waiman Long
2013-10-18 17:17   ` Arnaldo Carvalho de Melo
2013-10-21 14:51     ` Waiman Long
2013-10-18 17:30   ` David Ahern
2013-10-23  7:55   ` [tip:perf/core] perf report: Add " tip-bot for Waiman Long
2013-10-18 14:38 ` [PATCH v2 4/4] perf-top: add " Waiman Long
2013-10-18 17:31   ` David Ahern
2013-10-20 22:35   ` Davidlohr Bueso
2013-10-23  7:55   ` [tip:perf/core] perf top: Add " tip-bot for Waiman Long

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.