All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] perf: fix symbol processing bug and greatly improve performance
@ 2013-05-06 13:43 Waiman Long
  2013-05-07  7:01 ` Ingo Molnar
  2013-05-07  9:30 ` Jiri Olsa
  0 siblings, 2 replies; 13+ messages in thread
From: Waiman Long @ 2013-05-06 13:43 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa, Stephane Eranian, Namhyung Kim
  Cc: Waiman Long, Peter Zijlstra, Paul Mackerras, Ingo Molnar,
	linux-kernel, Chandramouleeswaran, Aswin, Norton, Scott J

When "perf record" was used on a large machine with a lot of CPUs,
the perf post-processing time could take a lot of minutes and even
hours depending on how large the resulting perf.data file was.

While running AIM7 1500-user high_systime workload on a 80-core x86-64
system with a 3.9 kernel, the workload itself took about 2 minutes
to run and the perf.data file had a size of 1108.746 MB. However,
the post-processing step took more than 10 minutes.

With a gprof-profiled perf binary, the time spent by perf was as
follows:

  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
 96.90    822.10   822.10   192156     0.00     0.00  dsos__find
  0.81    828.96     6.86 172089958     0.00     0.00  rb_next
  0.41    832.44     3.48 48539289     0.00     0.00  rb_erase

So 97% (822 seconds) of the time was spent in a single dsos_find()
function. After analyzing the call-graph data below:

-----------------------------------------------
                0.00  822.12  192156/192156      map__new [6]
[7]     96.9    0.00  822.12  192156         vdso__dso_findnew [7]
              822.10    0.00  192156/192156      dsos__find [8]
                0.01    0.00  192156/192156      dsos__add [62]
                0.01    0.00  192156/192366      dso__new [61]
                0.00    0.00       1/45282525     memdup [31]
                0.00    0.00  192156/192230      dso__set_long_name [91]
-----------------------------------------------
              822.10    0.00  192156/192156      vdso__dso_findnew [7]
[8]     96.9  822.10    0.00  192156         dsos__find [8]
-----------------------------------------------

It was found that the vdso__dso_findnew() function failed to locate
VDSO__MAP_NAME ("[vdso]") in the dso list and have to insert a new
entry at the end for 192156 times. This problem is due to the fact that
there are 2 types of name in the dso entry - short name and long name.
The initial dso__new() adds "[vdso]" to both the short and long names.
After that, vdso__dso_findnew() modifies the long name to something
like /tmp/perf-vdso.so-NoXkDj. The dsos__find() function only compares
the long name. As a result, the same vdso entry is duplicated many
time in the dso list. This bug increases memory consumption as well
as slows the symbol processing time to a crawl.

To resolve this problem, the dsos__find() function interface was
modified to enable searching either the long name or the short
name. The vdso__dso_findnew() will now search only the short name
while the other call sites search for the long name as before.

With this change, the cpu time of perf was reduced from 848.38s to
15.77s and dsos__find() only accounted for 0.06% of the total time.

  0.06     15.73     0.01   192151     0.00     0.00  dsos__find

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 tools/perf/util/dso.c  |   10 ++++++++--
 tools/perf/util/dso.h  |    2 +-
 tools/perf/util/vdso.c |    2 +-
 3 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/tools/perf/util/dso.c b/tools/perf/util/dso.c
index 6f7d5a9..3d80f92 100644
--- a/tools/perf/util/dso.c
+++ b/tools/perf/util/dso.c
@@ -513,10 +513,16 @@ void dsos__add(struct list_head *head, struct dso *dso)
 	list_add_tail(&dso->node, head);
 }
 
-struct dso *dsos__find(struct list_head *head, const char *name)
+struct dso *dsos__find(struct list_head *head, const char *name, bool cmp_short)
 {
 	struct dso *pos;
 
+	if (cmp_short) {
+		list_for_each_entry(pos, head, node)
+			if (strcmp(pos->short_name, name) == 0)
+				return pos;
+		return NULL;
+	}
 	list_for_each_entry(pos, head, node)
 		if (strcmp(pos->long_name, name) == 0)
 			return pos;
@@ -525,7 +531,7 @@ struct dso *dsos__find(struct list_head *head, const char *name)
 
 struct dso *__dsos__findnew(struct list_head *head, const char *name)
 {
-	struct dso *dso = dsos__find(head, name);
+	struct dso *dso = dsos__find(head, name, FALSE);
 
 	if (!dso) {
 		dso = dso__new(name);
diff --git a/tools/perf/util/dso.h b/tools/perf/util/dso.h
index 450199a..44c9fdd 100644
--- a/tools/perf/util/dso.h
+++ b/tools/perf/util/dso.h
@@ -133,7 +133,7 @@ struct dso *dso__kernel_findnew(struct machine *machine, const char *name,
 				const char *short_name, int dso_type);
 
 void dsos__add(struct list_head *head, struct dso *dso);
-struct dso *dsos__find(struct list_head *head, const char *name);
+struct dso *dsos__find(struct list_head *head, const char *name, bool cmp_short);
 struct dso *__dsos__findnew(struct list_head *head, const char *name);
 bool __dsos__read_build_ids(struct list_head *head, bool with_hits);
 
diff --git a/tools/perf/util/vdso.c b/tools/perf/util/vdso.c
index e60951f..a8fd73d 100644
--- a/tools/perf/util/vdso.c
+++ b/tools/perf/util/vdso.c
@@ -91,7 +91,7 @@ void vdso__exit(void)
 
 struct dso *vdso__dso_findnew(struct list_head *head)
 {
-	struct dso *dso = dsos__find(head, VDSO__MAP_NAME);
+	struct dso *dso = dsos__find(head, VDSO__MAP_NAME, TRUE);
 
 	if (!dso) {
 		char *file;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] perf: fix symbol processing bug and greatly improve performance
  2013-05-06 13:43 [PATCH] perf: fix symbol processing bug and greatly improve performance Waiman Long
@ 2013-05-07  7:01 ` Ingo Molnar
  2013-05-07 14:19   ` Waiman Long
  2013-05-07  9:30 ` Jiri Olsa
  1 sibling, 1 reply; 13+ messages in thread
From: Ingo Molnar @ 2013-05-07  7:01 UTC (permalink / raw)
  To: Waiman Long
  Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Stephane Eranian,
	Namhyung Kim, Peter Zijlstra, Paul Mackerras, Ingo Molnar,
	linux-kernel, Chandramouleeswaran, Aswin, Norton, Scott J,
	Fr??d??ric Weisbecker


* Waiman Long <Waiman.Long@hp.com> wrote:

> When "perf record" was used on a large machine with a lot of CPUs,
> the perf post-processing time could take a lot of minutes and even
> hours depending on how large the resulting perf.data file was.
> 
> While running AIM7 1500-user high_systime workload on a 80-core x86-64
> system with a 3.9 kernel, the workload itself took about 2 minutes
> to run and the perf.data file had a size of 1108.746 MB. However,
> the post-processing step took more than 10 minutes.
> 
> With a gprof-profiled perf binary, the time spent by perf was as
> follows:
> 
>   %   cumulative   self              self     total
>  time   seconds   seconds    calls   s/call   s/call  name
>  96.90    822.10   822.10   192156     0.00     0.00  dsos__find
>   0.81    828.96     6.86 172089958     0.00     0.00  rb_next
>   0.41    832.44     3.48 48539289     0.00     0.00  rb_erase
> 
> So 97% (822 seconds) of the time was spent in a single dsos_find()
> function. After analyzing the call-graph data below:
> 
> -----------------------------------------------
>                 0.00  822.12  192156/192156      map__new [6]
> [7]     96.9    0.00  822.12  192156         vdso__dso_findnew [7]
>               822.10    0.00  192156/192156      dsos__find [8]
>                 0.01    0.00  192156/192156      dsos__add [62]
>                 0.01    0.00  192156/192366      dso__new [61]
>                 0.00    0.00       1/45282525     memdup [31]
>                 0.00    0.00  192156/192230      dso__set_long_name [91]
> -----------------------------------------------
>               822.10    0.00  192156/192156      vdso__dso_findnew [7]
> [8]     96.9  822.10    0.00  192156         dsos__find [8]
> -----------------------------------------------
> 
> It was found that the vdso__dso_findnew() function failed to locate
> VDSO__MAP_NAME ("[vdso]") in the dso list and have to insert a new
> entry at the end for 192156 times. This problem is due to the fact that
> there are 2 types of name in the dso entry - short name and long name.
> The initial dso__new() adds "[vdso]" to both the short and long names.
> After that, vdso__dso_findnew() modifies the long name to something
> like /tmp/perf-vdso.so-NoXkDj. The dsos__find() function only compares
> the long name. As a result, the same vdso entry is duplicated many
> time in the dso list. This bug increases memory consumption as well
> as slows the symbol processing time to a crawl.
> 
> To resolve this problem, the dsos__find() function interface was
> modified to enable searching either the long name or the short
> name. The vdso__dso_findnew() will now search only the short name
> while the other call sites search for the long name as before.
> 
> With this change, the cpu time of perf was reduced from 848.38s to
> 15.77s and dsos__find() only accounted for 0.06% of the total time.
> 
>   0.06     15.73     0.01   192151     0.00     0.00  dsos__find

Very nice!

I noticed that you used gprof to instrument perf itself on a call graph 
level.

Does this method of profiling perf via perf:

  perf record -g perf report
  perf report

... produce similarly useful call-graph instrumentation for you?

If not or not quite then could you describe the differences? We could use 
that to further improve perf call-graph profiling.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] perf: fix symbol processing bug and greatly improve performance
  2013-05-06 13:43 [PATCH] perf: fix symbol processing bug and greatly improve performance Waiman Long
  2013-05-07  7:01 ` Ingo Molnar
@ 2013-05-07  9:30 ` Jiri Olsa
  2013-05-07 14:40   ` Waiman Long
  2013-05-08 15:44   ` Waiman Long
  1 sibling, 2 replies; 13+ messages in thread
From: Jiri Olsa @ 2013-05-07  9:30 UTC (permalink / raw)
  To: Waiman Long
  Cc: Arnaldo Carvalho de Melo, Stephane Eranian, Namhyung Kim,
	Peter Zijlstra, Paul Mackerras, Ingo Molnar, linux-kernel,
	Chandramouleeswaran, Aswin, Norton, Scott J

On Mon, May 06, 2013 at 09:43:53AM -0400, Waiman Long wrote:
> When "perf record" was used on a large machine with a lot of CPUs,
> the perf post-processing time could take a lot of minutes and even
> hours depending on how large the resulting perf.data file was.
> 
> While running AIM7 1500-user high_systime workload on a 80-core x86-64
> system with a 3.9 kernel, the workload itself took about 2 minutes
> to run and the perf.data file had a size of 1108.746 MB. However,
> the post-processing step took more than 10 minutes.
> 
> With a gprof-profiled perf binary, the time spent by perf was as
> follows:
> 
>   %   cumulative   self              self     total
>  time   seconds   seconds    calls   s/call   s/call  name
>  96.90    822.10   822.10   192156     0.00     0.00  dsos__find
>   0.81    828.96     6.86 172089958     0.00     0.00  rb_next
>   0.41    832.44     3.48 48539289     0.00     0.00  rb_erase
> 
> So 97% (822 seconds) of the time was spent in a single dsos_find()
> function. After analyzing the call-graph data below:
> 
> -----------------------------------------------
>                 0.00  822.12  192156/192156      map__new [6]
> [7]     96.9    0.00  822.12  192156         vdso__dso_findnew [7]
>               822.10    0.00  192156/192156      dsos__find [8]
>                 0.01    0.00  192156/192156      dsos__add [62]
>                 0.01    0.00  192156/192366      dso__new [61]
>                 0.00    0.00       1/45282525     memdup [31]
>                 0.00    0.00  192156/192230      dso__set_long_name [91]
> -----------------------------------------------
>               822.10    0.00  192156/192156      vdso__dso_findnew [7]
> [8]     96.9  822.10    0.00  192156         dsos__find [8]
> -----------------------------------------------
> 
> It was found that the vdso__dso_findnew() function failed to locate
> VDSO__MAP_NAME ("[vdso]") in the dso list and have to insert a new
> entry at the end for 192156 times. This problem is due to the fact that
> there are 2 types of name in the dso entry - short name and long name.
> The initial dso__new() adds "[vdso]" to both the short and long names.
> After that, vdso__dso_findnew() modifies the long name to something
> like /tmp/perf-vdso.so-NoXkDj. The dsos__find() function only compares
> the long name. As a result, the same vdso entry is duplicated many
> time in the dso list. This bug increases memory consumption as well
> as slows the symbol processing time to a crawl.

hi,
the issue is there and fix looks ok, thanks!

though I'm not able to get vdso callchains to pop out
even by investigating report with vdso heavy workload.

I'll have a closer look..

thanks,
jirka

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] perf: fix symbol processing bug and greatly improve performance
  2013-05-07  7:01 ` Ingo Molnar
@ 2013-05-07 14:19   ` Waiman Long
  2013-05-07 14:52     ` Waiman Long
  2013-05-07 17:15     ` Ingo Molnar
  0 siblings, 2 replies; 13+ messages in thread
From: Waiman Long @ 2013-05-07 14:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Stephane Eranian,
	Namhyung Kim, Peter Zijlstra, Paul Mackerras, Ingo Molnar,
	linux-kernel, Chandramouleeswaran, Aswin, Norton, Scott J,
	Fr??d??ric Weisbecker

On 05/07/2013 03:01 AM, Ingo Molnar wrote:
> * Waiman Long<Waiman.Long@hp.com>  wrote:
>
>> When "perf record" was used on a large machine with a lot of CPUs,
>> the perf post-processing time could take a lot of minutes and even
>> hours depending on how large the resulting perf.data file was.
>>
>> While running AIM7 1500-user high_systime workload on a 80-core x86-64
>> system with a 3.9 kernel, the workload itself took about 2 minutes
>> to run and the perf.data file had a size of 1108.746 MB. However,
>> the post-processing step took more than 10 minutes.
>>
>> With a gprof-profiled perf binary, the time spent by perf was as
>> follows:
>>
>>    %   cumulative   self              self     total
>>   time   seconds   seconds    calls   s/call   s/call  name
>>   96.90    822.10   822.10   192156     0.00     0.00  dsos__find
>>    0.81    828.96     6.86 172089958     0.00     0.00  rb_next
>>    0.41    832.44     3.48 48539289     0.00     0.00  rb_erase
>>
>> So 97% (822 seconds) of the time was spent in a single dsos_find()
>> function. After analyzing the call-graph data below:
>>
>> -----------------------------------------------
>>                  0.00  822.12  192156/192156      map__new [6]
>> [7]     96.9    0.00  822.12  192156         vdso__dso_findnew [7]
>>                822.10    0.00  192156/192156      dsos__find [8]
>>                  0.01    0.00  192156/192156      dsos__add [62]
>>                  0.01    0.00  192156/192366      dso__new [61]
>>                  0.00    0.00       1/45282525     memdup [31]
>>                  0.00    0.00  192156/192230      dso__set_long_name [91]
>> -----------------------------------------------
>>                822.10    0.00  192156/192156      vdso__dso_findnew [7]
>> [8]     96.9  822.10    0.00  192156         dsos__find [8]
>> -----------------------------------------------
>>
>> It was found that the vdso__dso_findnew() function failed to locate
>> VDSO__MAP_NAME ("[vdso]") in the dso list and have to insert a new
>> entry at the end for 192156 times. This problem is due to the fact that
>> there are 2 types of name in the dso entry - short name and long name.
>> The initial dso__new() adds "[vdso]" to both the short and long names.
>> After that, vdso__dso_findnew() modifies the long name to something
>> like /tmp/perf-vdso.so-NoXkDj. The dsos__find() function only compares
>> the long name. As a result, the same vdso entry is duplicated many
>> time in the dso list. This bug increases memory consumption as well
>> as slows the symbol processing time to a crawl.
>>
>> To resolve this problem, the dsos__find() function interface was
>> modified to enable searching either the long name or the short
>> name. The vdso__dso_findnew() will now search only the short name
>> while the other call sites search for the long name as before.
>>
>> With this change, the cpu time of perf was reduced from 848.38s to
>> 15.77s and dsos__find() only accounted for 0.06% of the total time.
>>
>>    0.06     15.73     0.01   192151     0.00     0.00  dsos__find
> Very nice!
>
> I noticed that you used gprof to instrument perf itself on a call graph
> level.
>
> Does this method of profiling perf via perf:
>
>    perf record -g perf report
>    perf report
>
> ... produce similarly useful call-graph instrumentation for you?
>
> If not or not quite then could you describe the differences? We could use
> that to further improve perf call-graph profiling.

Thank for the comment.

The slowdown that I was trying to fix was in the "perf record" part of 
the profiling process, not the "perf report" part. I didn't try 
perf-record on perf-record as the performance counters are limited 
resources and I don't want resource conflicts to affect the results. As 
the slow-down was entirely in the user space, I decided to use gprof to 
do the profiling.

One thing I noticed about the gprof instrumentation and perf call-graph 
was that inlined functions were properly identified by gprof, but not 
perf. This makes sense as the compiler produces additional 
instrumentation code and data into the binary that can be used by gprof. 
That is not the case for perf.

I also looked into possible performance bottleneck in the perf-report 
part of the process. Most of the time was spent in sorting and merging 
the call-chain data and there didn't seem to have low hanging fruit for 
performance tuning opportunity.

Please let me know what other tests or data would you like to me to gather.

Best regards,
Longman

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] perf: fix symbol processing bug and greatly improve performance
  2013-05-07  9:30 ` Jiri Olsa
@ 2013-05-07 14:40   ` Waiman Long
  2013-05-08 15:44   ` Waiman Long
  1 sibling, 0 replies; 13+ messages in thread
From: Waiman Long @ 2013-05-07 14:40 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Arnaldo Carvalho de Melo, Stephane Eranian, Namhyung Kim,
	Peter Zijlstra, Paul Mackerras, Ingo Molnar, linux-kernel,
	Chandramouleeswaran, Aswin, Norton, Scott J

On 05/07/2013 05:30 AM, Jiri Olsa wrote:
> On Mon, May 06, 2013 at 09:43:53AM -0400, Waiman Long wrote:
>> When "perf record" was used on a large machine with a lot of CPUs,
>> the perf post-processing time could take a lot of minutes and even
>> hours depending on how large the resulting perf.data file was.
>>
>> While running AIM7 1500-user high_systime workload on a 80-core x86-64
>> system with a 3.9 kernel, the workload itself took about 2 minutes
>> to run and the perf.data file had a size of 1108.746 MB. However,
>> the post-processing step took more than 10 minutes.
>>
>> With a gprof-profiled perf binary, the time spent by perf was as
>> follows:
>>
>>    %   cumulative   self              self     total
>>   time   seconds   seconds    calls   s/call   s/call  name
>>   96.90    822.10   822.10   192156     0.00     0.00  dsos__find
>>    0.81    828.96     6.86 172089958     0.00     0.00  rb_next
>>    0.41    832.44     3.48 48539289     0.00     0.00  rb_erase
>>
>> So 97% (822 seconds) of the time was spent in a single dsos_find()
>> function. After analyzing the call-graph data below:
>>
>> -----------------------------------------------
>>                  0.00  822.12  192156/192156      map__new [6]
>> [7]     96.9    0.00  822.12  192156         vdso__dso_findnew [7]
>>                822.10    0.00  192156/192156      dsos__find [8]
>>                  0.01    0.00  192156/192156      dsos__add [62]
>>                  0.01    0.00  192156/192366      dso__new [61]
>>                  0.00    0.00       1/45282525     memdup [31]
>>                  0.00    0.00  192156/192230      dso__set_long_name [91]
>> -----------------------------------------------
>>                822.10    0.00  192156/192156      vdso__dso_findnew [7]
>> [8]     96.9  822.10    0.00  192156         dsos__find [8]
>> -----------------------------------------------
>>
>> It was found that the vdso__dso_findnew() function failed to locate
>> VDSO__MAP_NAME ("[vdso]") in the dso list and have to insert a new
>> entry at the end for 192156 times. This problem is due to the fact that
>> there are 2 types of name in the dso entry - short name and long name.
>> The initial dso__new() adds "[vdso]" to both the short and long names.
>> After that, vdso__dso_findnew() modifies the long name to something
>> like /tmp/perf-vdso.so-NoXkDj. The dsos__find() function only compares
>> the long name. As a result, the same vdso entry is duplicated many
>> time in the dso list. This bug increases memory consumption as well
>> as slows the symbol processing time to a crawl.
> hi,
> the issue is there and fix looks ok, thanks!
>
> though I'm not able to get vdso callchains to pop out
> even by investigating report with vdso heavy workload.
>
> I'll have a closer look..

The test machine that I used have RHEL 6.4 installed in it with a 
upstream 3.9 kernel layered on top. The kernel config is based on the 
6.4 configuration file with modification to enable the X2APIC option 
needed by the machine. Other than that, I didn't make too much 
modification to the base configuration. I used the "-a -s" option when 
running perf-record.

I don't think the vdso callchains were major part of the workload that I 
tested. I think it is the high number of CPU cores plus the high number 
of users (1500) that cause the performance bottleneck to surface. In a 
smaller machine, those bottlenecks may be much less noticeable. The vdso 
call-chain dominates the post-processsing time because of the need to 
search through the while DSO list for the vdso library which can grow to 
2M+ in my test case.

Regards,
Longman

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] perf: fix symbol processing bug and greatly improve performance
  2013-05-07 14:19   ` Waiman Long
@ 2013-05-07 14:52     ` Waiman Long
  2013-05-07 17:15     ` Ingo Molnar
  1 sibling, 0 replies; 13+ messages in thread
From: Waiman Long @ 2013-05-07 14:52 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Stephane Eranian,
	Namhyung Kim, Peter Zijlstra, Paul Mackerras, Ingo Molnar,
	linux-kernel, Chandramouleeswaran, Aswin, Norton, Scott J,
	Fr??d??ric Weisbecker

On 05/07/2013 10:19 AM, Waiman Long wrote:
> The slowdown that I was trying to fix was in the "perf record" part of 
> the profiling process, not the "perf report" part. I didn't try 
> perf-record on perf-record as the performance counters are limited 
> resources and I don't want resource conflicts to affect the results. 
> As the slow-down was entirely in the user space, I decided to use 
> gprof to do the profiling.

Just to clarify that the post-processing time that I referred to in the 
patch is the time spent after the workload finishes in the perf-record 
session and the perf command itself quits. I could take a really long 
time for perf to quit without the patch.

Regards,
Longman


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] perf: fix symbol processing bug and greatly improve performance
  2013-05-07 14:19   ` Waiman Long
  2013-05-07 14:52     ` Waiman Long
@ 2013-05-07 17:15     ` Ingo Molnar
  1 sibling, 0 replies; 13+ messages in thread
From: Ingo Molnar @ 2013-05-07 17:15 UTC (permalink / raw)
  To: Waiman Long
  Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Stephane Eranian,
	Namhyung Kim, Peter Zijlstra, Paul Mackerras, Ingo Molnar,
	linux-kernel, Chandramouleeswaran, Aswin, Norton, Scott J,
	Fr??d??ric Weisbecker


* Waiman Long <Waiman.Long@hp.com> wrote:

> On 05/07/2013 03:01 AM, Ingo Molnar wrote:
> >* Waiman Long<Waiman.Long@hp.com>  wrote:
> >
> >>When "perf record" was used on a large machine with a lot of CPUs,
> >>the perf post-processing time could take a lot of minutes and even
> >>hours depending on how large the resulting perf.data file was.
> >>
> >>While running AIM7 1500-user high_systime workload on a 80-core x86-64
> >>system with a 3.9 kernel, the workload itself took about 2 minutes
> >>to run and the perf.data file had a size of 1108.746 MB. However,
> >>the post-processing step took more than 10 minutes.
> >>
> >>With a gprof-profiled perf binary, the time spent by perf was as
> >>follows:
> >>
> >>   %   cumulative   self              self     total
> >>  time   seconds   seconds    calls   s/call   s/call  name
> >>  96.90    822.10   822.10   192156     0.00     0.00  dsos__find
> >>   0.81    828.96     6.86 172089958     0.00     0.00  rb_next
> >>   0.41    832.44     3.48 48539289     0.00     0.00  rb_erase
> >>
> >>So 97% (822 seconds) of the time was spent in a single dsos_find()
> >>function. After analyzing the call-graph data below:
> >>
> >>-----------------------------------------------
> >>                 0.00  822.12  192156/192156      map__new [6]
> >>[7]     96.9    0.00  822.12  192156         vdso__dso_findnew [7]
> >>               822.10    0.00  192156/192156      dsos__find [8]
> >>                 0.01    0.00  192156/192156      dsos__add [62]
> >>                 0.01    0.00  192156/192366      dso__new [61]
> >>                 0.00    0.00       1/45282525     memdup [31]
> >>                 0.00    0.00  192156/192230      dso__set_long_name [91]
> >>-----------------------------------------------
> >>               822.10    0.00  192156/192156      vdso__dso_findnew [7]
> >>[8]     96.9  822.10    0.00  192156         dsos__find [8]
> >>-----------------------------------------------
> >>
> >>It was found that the vdso__dso_findnew() function failed to locate
> >>VDSO__MAP_NAME ("[vdso]") in the dso list and have to insert a new
> >>entry at the end for 192156 times. This problem is due to the fact that
> >>there are 2 types of name in the dso entry - short name and long name.
> >>The initial dso__new() adds "[vdso]" to both the short and long names.
> >>After that, vdso__dso_findnew() modifies the long name to something
> >>like /tmp/perf-vdso.so-NoXkDj. The dsos__find() function only compares
> >>the long name. As a result, the same vdso entry is duplicated many
> >>time in the dso list. This bug increases memory consumption as well
> >>as slows the symbol processing time to a crawl.
> >>
> >>To resolve this problem, the dsos__find() function interface was
> >>modified to enable searching either the long name or the short
> >>name. The vdso__dso_findnew() will now search only the short name
> >>while the other call sites search for the long name as before.
> >>
> >>With this change, the cpu time of perf was reduced from 848.38s to
> >>15.77s and dsos__find() only accounted for 0.06% of the total time.
> >>
> >>   0.06     15.73     0.01   192151     0.00     0.00  dsos__find
> >Very nice!
> >
> >I noticed that you used gprof to instrument perf itself on a call graph
> >level.
> >
> >Does this method of profiling perf via perf:
> >
> >   perf record -g perf report
> >   perf report
> >
> >... produce similarly useful call-graph instrumentation for you?
> >
> >If not or not quite then could you describe the differences? We could use
> >that to further improve perf call-graph profiling.
> 
> Thank for the comment.
> 
> The slowdown that I was trying to fix was in the "perf record" part of 
> the profiling process, not the "perf report" part. I didn't try 
> perf-record on perf-record as the performance counters are limited 
> resources and I don't want resource conflicts to affect the results.

ah, ok. In general two instances of cycles-profiling should work just fine 
without any resource conflicts.

But yeah, with "perf record" that's a valid worry and I can see how you 
wanted to not worry about that.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] perf: fix symbol processing bug and greatly improve performance
  2013-05-07  9:30 ` Jiri Olsa
  2013-05-07 14:40   ` Waiman Long
@ 2013-05-08 15:44   ` Waiman Long
  2013-05-09 10:19     ` Jiri Olsa
  1 sibling, 1 reply; 13+ messages in thread
From: Waiman Long @ 2013-05-08 15:44 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Arnaldo Carvalho de Melo, Stephane Eranian, Namhyung Kim,
	Peter Zijlstra, Paul Mackerras, Ingo Molnar, linux-kernel,
	Chandramouleeswaran, Aswin, Norton, Scott J

On 05/07/2013 05:30 AM, Jiri Olsa wrote:
> On Mon, May 06, 2013 at 09:43:53AM -0400, Waiman Long wrote:
>> When "perf record" was used on a large machine with a lot of CPUs,
>> the perf post-processing time could take a lot of minutes and even
>> hours depending on how large the resulting perf.data file was.
>>
>> While running AIM7 1500-user high_systime workload on a 80-core x86-64
>> system with a 3.9 kernel, the workload itself took about 2 minutes
>> to run and the perf.data file had a size of 1108.746 MB. However,
>> the post-processing step took more than 10 minutes.
>>
>> With a gprof-profiled perf binary, the time spent by perf was as
>> follows:
>>
>>    %   cumulative   self              self     total
>>   time   seconds   seconds    calls   s/call   s/call  name
>>   96.90    822.10   822.10   192156     0.00     0.00  dsos__find
>>    0.81    828.96     6.86 172089958     0.00     0.00  rb_next
>>    0.41    832.44     3.48 48539289     0.00     0.00  rb_erase
>>
>> So 97% (822 seconds) of the time was spent in a single dsos_find()
>> function. After analyzing the call-graph data below:
>>
>> -----------------------------------------------
>>                  0.00  822.12  192156/192156      map__new [6]
>> [7]     96.9    0.00  822.12  192156         vdso__dso_findnew [7]
>>                822.10    0.00  192156/192156      dsos__find [8]
>>                  0.01    0.00  192156/192156      dsos__add [62]
>>                  0.01    0.00  192156/192366      dso__new [61]
>>                  0.00    0.00       1/45282525     memdup [31]
>>                  0.00    0.00  192156/192230      dso__set_long_name [91]
>> -----------------------------------------------
>>                822.10    0.00  192156/192156      vdso__dso_findnew [7]
>> [8]     96.9  822.10    0.00  192156         dsos__find [8]
>> -----------------------------------------------
>>
>> It was found that the vdso__dso_findnew() function failed to locate
>> VDSO__MAP_NAME ("[vdso]") in the dso list and have to insert a new
>> entry at the end for 192156 times. This problem is due to the fact that
>> there are 2 types of name in the dso entry - short name and long name.
>> The initial dso__new() adds "[vdso]" to both the short and long names.
>> After that, vdso__dso_findnew() modifies the long name to something
>> like /tmp/perf-vdso.so-NoXkDj. The dsos__find() function only compares
>> the long name. As a result, the same vdso entry is duplicated many
>> time in the dso list. This bug increases memory consumption as well
>> as slows the symbol processing time to a crawl.
> hi,
> the issue is there and fix looks ok, thanks!
>
> though I'm not able to get vdso callchains to pop out
> even by investigating report with vdso heavy workload.
>
> I'll have a closer look..

Is there a chance that the fix will go to v3.10 or have to wait for v3.11?

Regards,
Longman

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] perf: fix symbol processing bug and greatly improve performance
  2013-05-08 15:44   ` Waiman Long
@ 2013-05-09 10:19     ` Jiri Olsa
  2013-05-09 12:52       ` Ingo Molnar
  2013-05-09 14:46       ` Waiman Long
  0 siblings, 2 replies; 13+ messages in thread
From: Jiri Olsa @ 2013-05-09 10:19 UTC (permalink / raw)
  To: Waiman Long
  Cc: Arnaldo Carvalho de Melo, Stephane Eranian, Namhyung Kim,
	Peter Zijlstra, Paul Mackerras, Ingo Molnar, linux-kernel,
	Chandramouleeswaran, Aswin, Norton, Scott J

On Wed, May 08, 2013 at 11:44:35AM -0400, Waiman Long wrote:
> On 05/07/2013 05:30 AM, Jiri Olsa wrote:
> >On Mon, May 06, 2013 at 09:43:53AM -0400, Waiman Long wrote:
> >>When "perf record" was used on a large machine with a lot of CPUs,
> >>the perf post-processing time could take a lot of minutes and even
> >>hours depending on how large the resulting perf.data file was.
> >>
> >>While running AIM7 1500-user high_systime workload on a 80-core x86-64
> >>system with a 3.9 kernel, the workload itself took about 2 minutes
> >>to run and the perf.data file had a size of 1108.746 MB. However,
> >>the post-processing step took more than 10 minutes.
> >>
> >>With a gprof-profiled perf binary, the time spent by perf was as
> >>follows:
> >>
> >>   %   cumulative   self              self     total
> >>  time   seconds   seconds    calls   s/call   s/call  name
> >>  96.90    822.10   822.10   192156     0.00     0.00  dsos__find
> >>   0.81    828.96     6.86 172089958     0.00     0.00  rb_next
> >>   0.41    832.44     3.48 48539289     0.00     0.00  rb_erase
> >>
> >>So 97% (822 seconds) of the time was spent in a single dsos_find()
> >>function. After analyzing the call-graph data below:
> >>
> >>-----------------------------------------------
> >>                 0.00  822.12  192156/192156      map__new [6]
> >>[7]     96.9    0.00  822.12  192156         vdso__dso_findnew [7]
> >>               822.10    0.00  192156/192156      dsos__find [8]
> >>                 0.01    0.00  192156/192156      dsos__add [62]
> >>                 0.01    0.00  192156/192366      dso__new [61]
> >>                 0.00    0.00       1/45282525     memdup [31]
> >>                 0.00    0.00  192156/192230      dso__set_long_name [91]
> >>-----------------------------------------------
> >>               822.10    0.00  192156/192156      vdso__dso_findnew [7]
> >>[8]     96.9  822.10    0.00  192156         dsos__find [8]
> >>-----------------------------------------------
> >>
> >>It was found that the vdso__dso_findnew() function failed to locate
> >>VDSO__MAP_NAME ("[vdso]") in the dso list and have to insert a new
> >>entry at the end for 192156 times. This problem is due to the fact that
> >>there are 2 types of name in the dso entry - short name and long name.
> >>The initial dso__new() adds "[vdso]" to both the short and long names.
> >>After that, vdso__dso_findnew() modifies the long name to something
> >>like /tmp/perf-vdso.so-NoXkDj. The dsos__find() function only compares
> >>the long name. As a result, the same vdso entry is duplicated many
> >>time in the dso list. This bug increases memory consumption as well
> >>as slows the symbol processing time to a crawl.
> >hi,
> >the issue is there and fix looks ok, thanks!
> >
> >though I'm not able to get vdso callchains to pop out
> >even by investigating report with vdso heavy workload.
> >
> >I'll have a closer look..
> 
> Is there a chance that the fix will go to v3.10 or have to wait for v3.11?

I got this from scripts/checkpatch.pl:

WARNING: line over 80 characters
#104: FILE: tools/perf/util/dso.h:136:
+struct dso *dsos__find(struct list_head *head, const char *name, bool
cmp_short);

otherwise it looks ok:

Acked-by: Jiri Olsa <jolsa@redhat.com>

thanks,
jirka

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] perf: fix symbol processing bug and greatly improve performance
  2013-05-09 10:19     ` Jiri Olsa
@ 2013-05-09 12:52       ` Ingo Molnar
  2013-05-09 14:46       ` Waiman Long
  1 sibling, 0 replies; 13+ messages in thread
From: Ingo Molnar @ 2013-05-09 12:52 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Waiman Long, Arnaldo Carvalho de Melo, Stephane Eranian,
	Namhyung Kim, Peter Zijlstra, Paul Mackerras, Ingo Molnar,
	linux-kernel, Chandramouleeswaran, Aswin, Norton, Scott J


* Jiri Olsa <jolsa@redhat.com> wrote:

> On Wed, May 08, 2013 at 11:44:35AM -0400, Waiman Long wrote:
> > On 05/07/2013 05:30 AM, Jiri Olsa wrote:
> > >On Mon, May 06, 2013 at 09:43:53AM -0400, Waiman Long wrote:
> > >>When "perf record" was used on a large machine with a lot of CPUs,
> > >>the perf post-processing time could take a lot of minutes and even
> > >>hours depending on how large the resulting perf.data file was.
> > >>
> > >>While running AIM7 1500-user high_systime workload on a 80-core x86-64
> > >>system with a 3.9 kernel, the workload itself took about 2 minutes
> > >>to run and the perf.data file had a size of 1108.746 MB. However,
> > >>the post-processing step took more than 10 minutes.
> > >>
> > >>With a gprof-profiled perf binary, the time spent by perf was as
> > >>follows:
> > >>
> > >>   %   cumulative   self              self     total
> > >>  time   seconds   seconds    calls   s/call   s/call  name
> > >>  96.90    822.10   822.10   192156     0.00     0.00  dsos__find
> > >>   0.81    828.96     6.86 172089958     0.00     0.00  rb_next
> > >>   0.41    832.44     3.48 48539289     0.00     0.00  rb_erase
> > >>
> > >>So 97% (822 seconds) of the time was spent in a single dsos_find()
> > >>function. After analyzing the call-graph data below:
> > >>
> > >>-----------------------------------------------
> > >>                 0.00  822.12  192156/192156      map__new [6]
> > >>[7]     96.9    0.00  822.12  192156         vdso__dso_findnew [7]
> > >>               822.10    0.00  192156/192156      dsos__find [8]
> > >>                 0.01    0.00  192156/192156      dsos__add [62]
> > >>                 0.01    0.00  192156/192366      dso__new [61]
> > >>                 0.00    0.00       1/45282525     memdup [31]
> > >>                 0.00    0.00  192156/192230      dso__set_long_name [91]
> > >>-----------------------------------------------
> > >>               822.10    0.00  192156/192156      vdso__dso_findnew [7]
> > >>[8]     96.9  822.10    0.00  192156         dsos__find [8]
> > >>-----------------------------------------------
> > >>
> > >>It was found that the vdso__dso_findnew() function failed to locate
> > >>VDSO__MAP_NAME ("[vdso]") in the dso list and have to insert a new
> > >>entry at the end for 192156 times. This problem is due to the fact that
> > >>there are 2 types of name in the dso entry - short name and long name.
> > >>The initial dso__new() adds "[vdso]" to both the short and long names.
> > >>After that, vdso__dso_findnew() modifies the long name to something
> > >>like /tmp/perf-vdso.so-NoXkDj. The dsos__find() function only compares
> > >>the long name. As a result, the same vdso entry is duplicated many
> > >>time in the dso list. This bug increases memory consumption as well
> > >>as slows the symbol processing time to a crawl.
> > >hi,
> > >the issue is there and fix looks ok, thanks!
> > >
> > >though I'm not able to get vdso callchains to pop out
> > >even by investigating report with vdso heavy workload.
> > >
> > >I'll have a closer look..
> > 
> > Is there a chance that the fix will go to v3.10 or have to wait for v3.11?
> 
> I got this from scripts/checkpatch.pl:
> 
> WARNING: line over 80 characters
> #104: FILE: tools/perf/util/dso.h:136:
> +struct dso *dsos__find(struct list_head *head, const char *name, bool
> cmp_short);

For a prototype line that's OK I think.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] perf: fix symbol processing bug and greatly improve performance
  2013-05-09 10:19     ` Jiri Olsa
  2013-05-09 12:52       ` Ingo Molnar
@ 2013-05-09 14:46       ` Waiman Long
  2013-05-09 15:05         ` David Ahern
  1 sibling, 1 reply; 13+ messages in thread
From: Waiman Long @ 2013-05-09 14:46 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Arnaldo Carvalho de Melo, Stephane Eranian, Namhyung Kim,
	Peter Zijlstra, Paul Mackerras, Ingo Molnar, linux-kernel,
	Chandramouleeswaran, Aswin, Norton, Scott J

On 05/09/2013 06:19 AM, Jiri Olsa wrote:
> On Wed, May 08, 2013 at 11:44:35AM -0400, Waiman Long wrote:
>> On 05/07/2013 05:30 AM, Jiri Olsa wrote:
>>> On Mon, May 06, 2013 at 09:43:53AM -0400, Waiman Long wrote:
>>>> When "perf record" was used on a large machine with a lot of CPUs,
>>>> the perf post-processing time could take a lot of minutes and even
>>>> hours depending on how large the resulting perf.data file was.
>>>>
>>>> While running AIM7 1500-user high_systime workload on a 80-core x86-64
>>>> system with a 3.9 kernel, the workload itself took about 2 minutes
>>>> to run and the perf.data file had a size of 1108.746 MB. However,
>>>> the post-processing step took more than 10 minutes.
>>>>
>>>> With a gprof-profiled perf binary, the time spent by perf was as
>>>> follows:
>>>>
>>>>    %   cumulative   self              self     total
>>>>   time   seconds   seconds    calls   s/call   s/call  name
>>>>   96.90    822.10   822.10   192156     0.00     0.00  dsos__find
>>>>    0.81    828.96     6.86 172089958     0.00     0.00  rb_next
>>>>    0.41    832.44     3.48 48539289     0.00     0.00  rb_erase
>>>>
>>>> So 97% (822 seconds) of the time was spent in a single dsos_find()
>>>> function. After analyzing the call-graph data below:
>>>>
>>>> -----------------------------------------------
>>>>                  0.00  822.12  192156/192156      map__new [6]
>>>> [7]     96.9    0.00  822.12  192156         vdso__dso_findnew [7]
>>>>                822.10    0.00  192156/192156      dsos__find [8]
>>>>                  0.01    0.00  192156/192156      dsos__add [62]
>>>>                  0.01    0.00  192156/192366      dso__new [61]
>>>>                  0.00    0.00       1/45282525     memdup [31]
>>>>                  0.00    0.00  192156/192230      dso__set_long_name [91]
>>>> -----------------------------------------------
>>>>                822.10    0.00  192156/192156      vdso__dso_findnew [7]
>>>> [8]     96.9  822.10    0.00  192156         dsos__find [8]
>>>> -----------------------------------------------
>>>>
>>>> It was found that the vdso__dso_findnew() function failed to locate
>>>> VDSO__MAP_NAME ("[vdso]") in the dso list and have to insert a new
>>>> entry at the end for 192156 times. This problem is due to the fact that
>>>> there are 2 types of name in the dso entry - short name and long name.
>>>> The initial dso__new() adds "[vdso]" to both the short and long names.
>>>> After that, vdso__dso_findnew() modifies the long name to something
>>>> like /tmp/perf-vdso.so-NoXkDj. The dsos__find() function only compares
>>>> the long name. As a result, the same vdso entry is duplicated many
>>>> time in the dso list. This bug increases memory consumption as well
>>>> as slows the symbol processing time to a crawl.
>>> hi,
>>> the issue is there and fix looks ok, thanks!
>>>
>>> though I'm not able to get vdso callchains to pop out
>>> even by investigating report with vdso heavy workload.
>>>
>>> I'll have a closer look..
>> Is there a chance that the fix will go to v3.10 or have to wait for v3.11?
> I got this from scripts/checkpatch.pl:
>
> WARNING: line over 80 characters
> #104: FILE: tools/perf/util/dso.h:136:
> +struct dso *dsos__find(struct list_head *head, const char *name, bool
> cmp_short);
>
> otherwise it looks ok:
>
> Acked-by: Jiri Olsa<jolsa@redhat.com>

I am sorry that I forgot to rerun checkpatch.pl again after changing the 
argument type from int to bool. So I missed this warning.

I had resent an updated patch with the warning fixed. I also made some 
update to the patch description.

Regards,
Longman

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] perf: fix symbol processing bug and greatly improve performance
  2013-05-09 14:46       ` Waiman Long
@ 2013-05-09 15:05         ` David Ahern
  2013-05-09 15:31           ` Waiman Long
  0 siblings, 1 reply; 13+ messages in thread
From: David Ahern @ 2013-05-09 15:05 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jiri Olsa, Arnaldo Carvalho de Melo, Stephane Eranian,
	Namhyung Kim, Peter Zijlstra, Paul Mackerras, Ingo Molnar,
	linux-kernel, Chandramouleeswaran, Aswin, Norton, Scott J

On 5/9/13 8:46 AM, Waiman Long wrote:
> I am sorry that I forgot to rerun checkpatch.pl again after changing the
> argument type from int to bool.

You can edit/create .git/hooks/pre-commit and have it run checkpatch.pl 
so you don't forget:

#!/bin/bash
exec git diff --cached | scripts/checkpatch.pl --no-signoff -q -

If it spits out warnings that are ok, add -n to git-commit to bypass.

David


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] perf: fix symbol processing bug and greatly improve performance
  2013-05-09 15:05         ` David Ahern
@ 2013-05-09 15:31           ` Waiman Long
  0 siblings, 0 replies; 13+ messages in thread
From: Waiman Long @ 2013-05-09 15:31 UTC (permalink / raw)
  To: David Ahern
  Cc: Jiri Olsa, Arnaldo Carvalho de Melo, Stephane Eranian,
	Namhyung Kim, Peter Zijlstra, Paul Mackerras, Ingo Molnar,
	linux-kernel, Chandramouleeswaran, Aswin, Norton, Scott J

On 05/09/2013 11:05 AM, David Ahern wrote:
> On 5/9/13 8:46 AM, Waiman Long wrote:
>> I am sorry that I forgot to rerun checkpatch.pl again after changing the
>> argument type from int to bool.
>
> You can edit/create .git/hooks/pre-commit and have it run 
> checkpatch.pl so you don't forget:
>
> #!/bin/bash
> exec git diff --cached | scripts/checkpatch.pl --no-signoff -q -
>
> If it spits out warnings that are ok, add -n to git-commit to bypass.

Thanks for the tip! I will implement that in my development environment.

Regards,
Longman

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2013-05-09 15:31 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-05-06 13:43 [PATCH] perf: fix symbol processing bug and greatly improve performance Waiman Long
2013-05-07  7:01 ` Ingo Molnar
2013-05-07 14:19   ` Waiman Long
2013-05-07 14:52     ` Waiman Long
2013-05-07 17:15     ` Ingo Molnar
2013-05-07  9:30 ` Jiri Olsa
2013-05-07 14:40   ` Waiman Long
2013-05-08 15:44   ` Waiman Long
2013-05-09 10:19     ` Jiri Olsa
2013-05-09 12:52       ` Ingo Molnar
2013-05-09 14:46       ` Waiman Long
2013-05-09 15:05         ` David Ahern
2013-05-09 15:31           ` Waiman Long

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.