* [PATCH] perf: fix symbol processing bug and greatly improve performance
@ 2013-05-06 13:43 Waiman Long
2013-05-07 7:01 ` Ingo Molnar
2013-05-07 9:30 ` Jiri Olsa
0 siblings, 2 replies; 13+ messages in thread
From: Waiman Long @ 2013-05-06 13:43 UTC (permalink / raw)
To: Arnaldo Carvalho de Melo, Jiri Olsa, Stephane Eranian, Namhyung Kim
Cc: Waiman Long, Peter Zijlstra, Paul Mackerras, Ingo Molnar,
linux-kernel, Chandramouleeswaran, Aswin, Norton, Scott J
When "perf record" was used on a large machine with a lot of CPUs,
the perf post-processing time could take a lot of minutes and even
hours depending on how large the resulting perf.data file was.
While running AIM7 1500-user high_systime workload on a 80-core x86-64
system with a 3.9 kernel, the workload itself took about 2 minutes
to run and the perf.data file had a size of 1108.746 MB. However,
the post-processing step took more than 10 minutes.
With a gprof-profiled perf binary, the time spent by perf was as
follows:
% cumulative self self total
time seconds seconds calls s/call s/call name
96.90 822.10 822.10 192156 0.00 0.00 dsos__find
0.81 828.96 6.86 172089958 0.00 0.00 rb_next
0.41 832.44 3.48 48539289 0.00 0.00 rb_erase
So 97% (822 seconds) of the time was spent in a single dsos_find()
function. After analyzing the call-graph data below:
-----------------------------------------------
0.00 822.12 192156/192156 map__new [6]
[7] 96.9 0.00 822.12 192156 vdso__dso_findnew [7]
822.10 0.00 192156/192156 dsos__find [8]
0.01 0.00 192156/192156 dsos__add [62]
0.01 0.00 192156/192366 dso__new [61]
0.00 0.00 1/45282525 memdup [31]
0.00 0.00 192156/192230 dso__set_long_name [91]
-----------------------------------------------
822.10 0.00 192156/192156 vdso__dso_findnew [7]
[8] 96.9 822.10 0.00 192156 dsos__find [8]
-----------------------------------------------
It was found that the vdso__dso_findnew() function failed to locate
VDSO__MAP_NAME ("[vdso]") in the dso list and have to insert a new
entry at the end for 192156 times. This problem is due to the fact that
there are 2 types of name in the dso entry - short name and long name.
The initial dso__new() adds "[vdso]" to both the short and long names.
After that, vdso__dso_findnew() modifies the long name to something
like /tmp/perf-vdso.so-NoXkDj. The dsos__find() function only compares
the long name. As a result, the same vdso entry is duplicated many
time in the dso list. This bug increases memory consumption as well
as slows the symbol processing time to a crawl.
To resolve this problem, the dsos__find() function interface was
modified to enable searching either the long name or the short
name. The vdso__dso_findnew() will now search only the short name
while the other call sites search for the long name as before.
With this change, the cpu time of perf was reduced from 848.38s to
15.77s and dsos__find() only accounted for 0.06% of the total time.
0.06 15.73 0.01 192151 0.00 0.00 dsos__find
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
tools/perf/util/dso.c | 10 ++++++++--
tools/perf/util/dso.h | 2 +-
tools/perf/util/vdso.c | 2 +-
3 files changed, 10 insertions(+), 4 deletions(-)
diff --git a/tools/perf/util/dso.c b/tools/perf/util/dso.c
index 6f7d5a9..3d80f92 100644
--- a/tools/perf/util/dso.c
+++ b/tools/perf/util/dso.c
@@ -513,10 +513,16 @@ void dsos__add(struct list_head *head, struct dso *dso)
list_add_tail(&dso->node, head);
}
-struct dso *dsos__find(struct list_head *head, const char *name)
+struct dso *dsos__find(struct list_head *head, const char *name, bool cmp_short)
{
struct dso *pos;
+ if (cmp_short) {
+ list_for_each_entry(pos, head, node)
+ if (strcmp(pos->short_name, name) == 0)
+ return pos;
+ return NULL;
+ }
list_for_each_entry(pos, head, node)
if (strcmp(pos->long_name, name) == 0)
return pos;
@@ -525,7 +531,7 @@ struct dso *dsos__find(struct list_head *head, const char *name)
struct dso *__dsos__findnew(struct list_head *head, const char *name)
{
- struct dso *dso = dsos__find(head, name);
+ struct dso *dso = dsos__find(head, name, FALSE);
if (!dso) {
dso = dso__new(name);
diff --git a/tools/perf/util/dso.h b/tools/perf/util/dso.h
index 450199a..44c9fdd 100644
--- a/tools/perf/util/dso.h
+++ b/tools/perf/util/dso.h
@@ -133,7 +133,7 @@ struct dso *dso__kernel_findnew(struct machine *machine, const char *name,
const char *short_name, int dso_type);
void dsos__add(struct list_head *head, struct dso *dso);
-struct dso *dsos__find(struct list_head *head, const char *name);
+struct dso *dsos__find(struct list_head *head, const char *name, bool cmp_short);
struct dso *__dsos__findnew(struct list_head *head, const char *name);
bool __dsos__read_build_ids(struct list_head *head, bool with_hits);
diff --git a/tools/perf/util/vdso.c b/tools/perf/util/vdso.c
index e60951f..a8fd73d 100644
--- a/tools/perf/util/vdso.c
+++ b/tools/perf/util/vdso.c
@@ -91,7 +91,7 @@ void vdso__exit(void)
struct dso *vdso__dso_findnew(struct list_head *head)
{
- struct dso *dso = dsos__find(head, VDSO__MAP_NAME);
+ struct dso *dso = dsos__find(head, VDSO__MAP_NAME, TRUE);
if (!dso) {
char *file;
--
1.7.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH] perf: fix symbol processing bug and greatly improve performance
2013-05-06 13:43 [PATCH] perf: fix symbol processing bug and greatly improve performance Waiman Long
@ 2013-05-07 7:01 ` Ingo Molnar
2013-05-07 14:19 ` Waiman Long
2013-05-07 9:30 ` Jiri Olsa
1 sibling, 1 reply; 13+ messages in thread
From: Ingo Molnar @ 2013-05-07 7:01 UTC (permalink / raw)
To: Waiman Long
Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Stephane Eranian,
Namhyung Kim, Peter Zijlstra, Paul Mackerras, Ingo Molnar,
linux-kernel, Chandramouleeswaran, Aswin, Norton, Scott J,
Fr??d??ric Weisbecker
* Waiman Long <Waiman.Long@hp.com> wrote:
> When "perf record" was used on a large machine with a lot of CPUs,
> the perf post-processing time could take a lot of minutes and even
> hours depending on how large the resulting perf.data file was.
>
> While running AIM7 1500-user high_systime workload on a 80-core x86-64
> system with a 3.9 kernel, the workload itself took about 2 minutes
> to run and the perf.data file had a size of 1108.746 MB. However,
> the post-processing step took more than 10 minutes.
>
> With a gprof-profiled perf binary, the time spent by perf was as
> follows:
>
> % cumulative self self total
> time seconds seconds calls s/call s/call name
> 96.90 822.10 822.10 192156 0.00 0.00 dsos__find
> 0.81 828.96 6.86 172089958 0.00 0.00 rb_next
> 0.41 832.44 3.48 48539289 0.00 0.00 rb_erase
>
> So 97% (822 seconds) of the time was spent in a single dsos_find()
> function. After analyzing the call-graph data below:
>
> -----------------------------------------------
> 0.00 822.12 192156/192156 map__new [6]
> [7] 96.9 0.00 822.12 192156 vdso__dso_findnew [7]
> 822.10 0.00 192156/192156 dsos__find [8]
> 0.01 0.00 192156/192156 dsos__add [62]
> 0.01 0.00 192156/192366 dso__new [61]
> 0.00 0.00 1/45282525 memdup [31]
> 0.00 0.00 192156/192230 dso__set_long_name [91]
> -----------------------------------------------
> 822.10 0.00 192156/192156 vdso__dso_findnew [7]
> [8] 96.9 822.10 0.00 192156 dsos__find [8]
> -----------------------------------------------
>
> It was found that the vdso__dso_findnew() function failed to locate
> VDSO__MAP_NAME ("[vdso]") in the dso list and have to insert a new
> entry at the end for 192156 times. This problem is due to the fact that
> there are 2 types of name in the dso entry - short name and long name.
> The initial dso__new() adds "[vdso]" to both the short and long names.
> After that, vdso__dso_findnew() modifies the long name to something
> like /tmp/perf-vdso.so-NoXkDj. The dsos__find() function only compares
> the long name. As a result, the same vdso entry is duplicated many
> time in the dso list. This bug increases memory consumption as well
> as slows the symbol processing time to a crawl.
>
> To resolve this problem, the dsos__find() function interface was
> modified to enable searching either the long name or the short
> name. The vdso__dso_findnew() will now search only the short name
> while the other call sites search for the long name as before.
>
> With this change, the cpu time of perf was reduced from 848.38s to
> 15.77s and dsos__find() only accounted for 0.06% of the total time.
>
> 0.06 15.73 0.01 192151 0.00 0.00 dsos__find
Very nice!
I noticed that you used gprof to instrument perf itself on a call graph
level.
Does this method of profiling perf via perf:
perf record -g perf report
perf report
... produce similarly useful call-graph instrumentation for you?
If not or not quite then could you describe the differences? We could use
that to further improve perf call-graph profiling.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] perf: fix symbol processing bug and greatly improve performance
2013-05-06 13:43 [PATCH] perf: fix symbol processing bug and greatly improve performance Waiman Long
2013-05-07 7:01 ` Ingo Molnar
@ 2013-05-07 9:30 ` Jiri Olsa
2013-05-07 14:40 ` Waiman Long
2013-05-08 15:44 ` Waiman Long
1 sibling, 2 replies; 13+ messages in thread
From: Jiri Olsa @ 2013-05-07 9:30 UTC (permalink / raw)
To: Waiman Long
Cc: Arnaldo Carvalho de Melo, Stephane Eranian, Namhyung Kim,
Peter Zijlstra, Paul Mackerras, Ingo Molnar, linux-kernel,
Chandramouleeswaran, Aswin, Norton, Scott J
On Mon, May 06, 2013 at 09:43:53AM -0400, Waiman Long wrote:
> When "perf record" was used on a large machine with a lot of CPUs,
> the perf post-processing time could take a lot of minutes and even
> hours depending on how large the resulting perf.data file was.
>
> While running AIM7 1500-user high_systime workload on a 80-core x86-64
> system with a 3.9 kernel, the workload itself took about 2 minutes
> to run and the perf.data file had a size of 1108.746 MB. However,
> the post-processing step took more than 10 minutes.
>
> With a gprof-profiled perf binary, the time spent by perf was as
> follows:
>
> % cumulative self self total
> time seconds seconds calls s/call s/call name
> 96.90 822.10 822.10 192156 0.00 0.00 dsos__find
> 0.81 828.96 6.86 172089958 0.00 0.00 rb_next
> 0.41 832.44 3.48 48539289 0.00 0.00 rb_erase
>
> So 97% (822 seconds) of the time was spent in a single dsos_find()
> function. After analyzing the call-graph data below:
>
> -----------------------------------------------
> 0.00 822.12 192156/192156 map__new [6]
> [7] 96.9 0.00 822.12 192156 vdso__dso_findnew [7]
> 822.10 0.00 192156/192156 dsos__find [8]
> 0.01 0.00 192156/192156 dsos__add [62]
> 0.01 0.00 192156/192366 dso__new [61]
> 0.00 0.00 1/45282525 memdup [31]
> 0.00 0.00 192156/192230 dso__set_long_name [91]
> -----------------------------------------------
> 822.10 0.00 192156/192156 vdso__dso_findnew [7]
> [8] 96.9 822.10 0.00 192156 dsos__find [8]
> -----------------------------------------------
>
> It was found that the vdso__dso_findnew() function failed to locate
> VDSO__MAP_NAME ("[vdso]") in the dso list and have to insert a new
> entry at the end for 192156 times. This problem is due to the fact that
> there are 2 types of name in the dso entry - short name and long name.
> The initial dso__new() adds "[vdso]" to both the short and long names.
> After that, vdso__dso_findnew() modifies the long name to something
> like /tmp/perf-vdso.so-NoXkDj. The dsos__find() function only compares
> the long name. As a result, the same vdso entry is duplicated many
> time in the dso list. This bug increases memory consumption as well
> as slows the symbol processing time to a crawl.
hi,
the issue is there and fix looks ok, thanks!
though I'm not able to get vdso callchains to pop out
even by investigating report with vdso heavy workload.
I'll have a closer look..
thanks,
jirka
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] perf: fix symbol processing bug and greatly improve performance
2013-05-07 7:01 ` Ingo Molnar
@ 2013-05-07 14:19 ` Waiman Long
2013-05-07 14:52 ` Waiman Long
2013-05-07 17:15 ` Ingo Molnar
0 siblings, 2 replies; 13+ messages in thread
From: Waiman Long @ 2013-05-07 14:19 UTC (permalink / raw)
To: Ingo Molnar
Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Stephane Eranian,
Namhyung Kim, Peter Zijlstra, Paul Mackerras, Ingo Molnar,
linux-kernel, Chandramouleeswaran, Aswin, Norton, Scott J,
Fr??d??ric Weisbecker
On 05/07/2013 03:01 AM, Ingo Molnar wrote:
> * Waiman Long<Waiman.Long@hp.com> wrote:
>
>> When "perf record" was used on a large machine with a lot of CPUs,
>> the perf post-processing time could take a lot of minutes and even
>> hours depending on how large the resulting perf.data file was.
>>
>> While running AIM7 1500-user high_systime workload on a 80-core x86-64
>> system with a 3.9 kernel, the workload itself took about 2 minutes
>> to run and the perf.data file had a size of 1108.746 MB. However,
>> the post-processing step took more than 10 minutes.
>>
>> With a gprof-profiled perf binary, the time spent by perf was as
>> follows:
>>
>> % cumulative self self total
>> time seconds seconds calls s/call s/call name
>> 96.90 822.10 822.10 192156 0.00 0.00 dsos__find
>> 0.81 828.96 6.86 172089958 0.00 0.00 rb_next
>> 0.41 832.44 3.48 48539289 0.00 0.00 rb_erase
>>
>> So 97% (822 seconds) of the time was spent in a single dsos_find()
>> function. After analyzing the call-graph data below:
>>
>> -----------------------------------------------
>> 0.00 822.12 192156/192156 map__new [6]
>> [7] 96.9 0.00 822.12 192156 vdso__dso_findnew [7]
>> 822.10 0.00 192156/192156 dsos__find [8]
>> 0.01 0.00 192156/192156 dsos__add [62]
>> 0.01 0.00 192156/192366 dso__new [61]
>> 0.00 0.00 1/45282525 memdup [31]
>> 0.00 0.00 192156/192230 dso__set_long_name [91]
>> -----------------------------------------------
>> 822.10 0.00 192156/192156 vdso__dso_findnew [7]
>> [8] 96.9 822.10 0.00 192156 dsos__find [8]
>> -----------------------------------------------
>>
>> It was found that the vdso__dso_findnew() function failed to locate
>> VDSO__MAP_NAME ("[vdso]") in the dso list and have to insert a new
>> entry at the end for 192156 times. This problem is due to the fact that
>> there are 2 types of name in the dso entry - short name and long name.
>> The initial dso__new() adds "[vdso]" to both the short and long names.
>> After that, vdso__dso_findnew() modifies the long name to something
>> like /tmp/perf-vdso.so-NoXkDj. The dsos__find() function only compares
>> the long name. As a result, the same vdso entry is duplicated many
>> time in the dso list. This bug increases memory consumption as well
>> as slows the symbol processing time to a crawl.
>>
>> To resolve this problem, the dsos__find() function interface was
>> modified to enable searching either the long name or the short
>> name. The vdso__dso_findnew() will now search only the short name
>> while the other call sites search for the long name as before.
>>
>> With this change, the cpu time of perf was reduced from 848.38s to
>> 15.77s and dsos__find() only accounted for 0.06% of the total time.
>>
>> 0.06 15.73 0.01 192151 0.00 0.00 dsos__find
> Very nice!
>
> I noticed that you used gprof to instrument perf itself on a call graph
> level.
>
> Does this method of profiling perf via perf:
>
> perf record -g perf report
> perf report
>
> ... produce similarly useful call-graph instrumentation for you?
>
> If not or not quite then could you describe the differences? We could use
> that to further improve perf call-graph profiling.
Thank for the comment.
The slowdown that I was trying to fix was in the "perf record" part of
the profiling process, not the "perf report" part. I didn't try
perf-record on perf-record as the performance counters are limited
resources and I don't want resource conflicts to affect the results. As
the slow-down was entirely in the user space, I decided to use gprof to
do the profiling.
One thing I noticed about the gprof instrumentation and perf call-graph
was that inlined functions were properly identified by gprof, but not
perf. This makes sense as the compiler produces additional
instrumentation code and data into the binary that can be used by gprof.
That is not the case for perf.
I also looked into possible performance bottleneck in the perf-report
part of the process. Most of the time was spent in sorting and merging
the call-chain data and there didn't seem to have low hanging fruit for
performance tuning opportunity.
Please let me know what other tests or data would you like to me to gather.
Best regards,
Longman
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] perf: fix symbol processing bug and greatly improve performance
2013-05-07 9:30 ` Jiri Olsa
@ 2013-05-07 14:40 ` Waiman Long
2013-05-08 15:44 ` Waiman Long
1 sibling, 0 replies; 13+ messages in thread
From: Waiman Long @ 2013-05-07 14:40 UTC (permalink / raw)
To: Jiri Olsa
Cc: Arnaldo Carvalho de Melo, Stephane Eranian, Namhyung Kim,
Peter Zijlstra, Paul Mackerras, Ingo Molnar, linux-kernel,
Chandramouleeswaran, Aswin, Norton, Scott J
On 05/07/2013 05:30 AM, Jiri Olsa wrote:
> On Mon, May 06, 2013 at 09:43:53AM -0400, Waiman Long wrote:
>> When "perf record" was used on a large machine with a lot of CPUs,
>> the perf post-processing time could take a lot of minutes and even
>> hours depending on how large the resulting perf.data file was.
>>
>> While running AIM7 1500-user high_systime workload on a 80-core x86-64
>> system with a 3.9 kernel, the workload itself took about 2 minutes
>> to run and the perf.data file had a size of 1108.746 MB. However,
>> the post-processing step took more than 10 minutes.
>>
>> With a gprof-profiled perf binary, the time spent by perf was as
>> follows:
>>
>> % cumulative self self total
>> time seconds seconds calls s/call s/call name
>> 96.90 822.10 822.10 192156 0.00 0.00 dsos__find
>> 0.81 828.96 6.86 172089958 0.00 0.00 rb_next
>> 0.41 832.44 3.48 48539289 0.00 0.00 rb_erase
>>
>> So 97% (822 seconds) of the time was spent in a single dsos_find()
>> function. After analyzing the call-graph data below:
>>
>> -----------------------------------------------
>> 0.00 822.12 192156/192156 map__new [6]
>> [7] 96.9 0.00 822.12 192156 vdso__dso_findnew [7]
>> 822.10 0.00 192156/192156 dsos__find [8]
>> 0.01 0.00 192156/192156 dsos__add [62]
>> 0.01 0.00 192156/192366 dso__new [61]
>> 0.00 0.00 1/45282525 memdup [31]
>> 0.00 0.00 192156/192230 dso__set_long_name [91]
>> -----------------------------------------------
>> 822.10 0.00 192156/192156 vdso__dso_findnew [7]
>> [8] 96.9 822.10 0.00 192156 dsos__find [8]
>> -----------------------------------------------
>>
>> It was found that the vdso__dso_findnew() function failed to locate
>> VDSO__MAP_NAME ("[vdso]") in the dso list and have to insert a new
>> entry at the end for 192156 times. This problem is due to the fact that
>> there are 2 types of name in the dso entry - short name and long name.
>> The initial dso__new() adds "[vdso]" to both the short and long names.
>> After that, vdso__dso_findnew() modifies the long name to something
>> like /tmp/perf-vdso.so-NoXkDj. The dsos__find() function only compares
>> the long name. As a result, the same vdso entry is duplicated many
>> time in the dso list. This bug increases memory consumption as well
>> as slows the symbol processing time to a crawl.
> hi,
> the issue is there and fix looks ok, thanks!
>
> though I'm not able to get vdso callchains to pop out
> even by investigating report with vdso heavy workload.
>
> I'll have a closer look..
The test machine that I used have RHEL 6.4 installed in it with a
upstream 3.9 kernel layered on top. The kernel config is based on the
6.4 configuration file with modification to enable the X2APIC option
needed by the machine. Other than that, I didn't make too much
modification to the base configuration. I used the "-a -s" option when
running perf-record.
I don't think the vdso callchains were major part of the workload that I
tested. I think it is the high number of CPU cores plus the high number
of users (1500) that cause the performance bottleneck to surface. In a
smaller machine, those bottlenecks may be much less noticeable. The vdso
call-chain dominates the post-processsing time because of the need to
search through the while DSO list for the vdso library which can grow to
2M+ in my test case.
Regards,
Longman
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] perf: fix symbol processing bug and greatly improve performance
2013-05-07 14:19 ` Waiman Long
@ 2013-05-07 14:52 ` Waiman Long
2013-05-07 17:15 ` Ingo Molnar
1 sibling, 0 replies; 13+ messages in thread
From: Waiman Long @ 2013-05-07 14:52 UTC (permalink / raw)
To: Ingo Molnar
Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Stephane Eranian,
Namhyung Kim, Peter Zijlstra, Paul Mackerras, Ingo Molnar,
linux-kernel, Chandramouleeswaran, Aswin, Norton, Scott J,
Fr??d??ric Weisbecker
On 05/07/2013 10:19 AM, Waiman Long wrote:
> The slowdown that I was trying to fix was in the "perf record" part of
> the profiling process, not the "perf report" part. I didn't try
> perf-record on perf-record as the performance counters are limited
> resources and I don't want resource conflicts to affect the results.
> As the slow-down was entirely in the user space, I decided to use
> gprof to do the profiling.
Just to clarify that the post-processing time that I referred to in the
patch is the time spent after the workload finishes in the perf-record
session and the perf command itself quits. I could take a really long
time for perf to quit without the patch.
Regards,
Longman
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] perf: fix symbol processing bug and greatly improve performance
2013-05-07 14:19 ` Waiman Long
2013-05-07 14:52 ` Waiman Long
@ 2013-05-07 17:15 ` Ingo Molnar
1 sibling, 0 replies; 13+ messages in thread
From: Ingo Molnar @ 2013-05-07 17:15 UTC (permalink / raw)
To: Waiman Long
Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Stephane Eranian,
Namhyung Kim, Peter Zijlstra, Paul Mackerras, Ingo Molnar,
linux-kernel, Chandramouleeswaran, Aswin, Norton, Scott J,
Fr??d??ric Weisbecker
* Waiman Long <Waiman.Long@hp.com> wrote:
> On 05/07/2013 03:01 AM, Ingo Molnar wrote:
> >* Waiman Long<Waiman.Long@hp.com> wrote:
> >
> >>When "perf record" was used on a large machine with a lot of CPUs,
> >>the perf post-processing time could take a lot of minutes and even
> >>hours depending on how large the resulting perf.data file was.
> >>
> >>While running AIM7 1500-user high_systime workload on a 80-core x86-64
> >>system with a 3.9 kernel, the workload itself took about 2 minutes
> >>to run and the perf.data file had a size of 1108.746 MB. However,
> >>the post-processing step took more than 10 minutes.
> >>
> >>With a gprof-profiled perf binary, the time spent by perf was as
> >>follows:
> >>
> >> % cumulative self self total
> >> time seconds seconds calls s/call s/call name
> >> 96.90 822.10 822.10 192156 0.00 0.00 dsos__find
> >> 0.81 828.96 6.86 172089958 0.00 0.00 rb_next
> >> 0.41 832.44 3.48 48539289 0.00 0.00 rb_erase
> >>
> >>So 97% (822 seconds) of the time was spent in a single dsos_find()
> >>function. After analyzing the call-graph data below:
> >>
> >>-----------------------------------------------
> >> 0.00 822.12 192156/192156 map__new [6]
> >>[7] 96.9 0.00 822.12 192156 vdso__dso_findnew [7]
> >> 822.10 0.00 192156/192156 dsos__find [8]
> >> 0.01 0.00 192156/192156 dsos__add [62]
> >> 0.01 0.00 192156/192366 dso__new [61]
> >> 0.00 0.00 1/45282525 memdup [31]
> >> 0.00 0.00 192156/192230 dso__set_long_name [91]
> >>-----------------------------------------------
> >> 822.10 0.00 192156/192156 vdso__dso_findnew [7]
> >>[8] 96.9 822.10 0.00 192156 dsos__find [8]
> >>-----------------------------------------------
> >>
> >>It was found that the vdso__dso_findnew() function failed to locate
> >>VDSO__MAP_NAME ("[vdso]") in the dso list and have to insert a new
> >>entry at the end for 192156 times. This problem is due to the fact that
> >>there are 2 types of name in the dso entry - short name and long name.
> >>The initial dso__new() adds "[vdso]" to both the short and long names.
> >>After that, vdso__dso_findnew() modifies the long name to something
> >>like /tmp/perf-vdso.so-NoXkDj. The dsos__find() function only compares
> >>the long name. As a result, the same vdso entry is duplicated many
> >>time in the dso list. This bug increases memory consumption as well
> >>as slows the symbol processing time to a crawl.
> >>
> >>To resolve this problem, the dsos__find() function interface was
> >>modified to enable searching either the long name or the short
> >>name. The vdso__dso_findnew() will now search only the short name
> >>while the other call sites search for the long name as before.
> >>
> >>With this change, the cpu time of perf was reduced from 848.38s to
> >>15.77s and dsos__find() only accounted for 0.06% of the total time.
> >>
> >> 0.06 15.73 0.01 192151 0.00 0.00 dsos__find
> >Very nice!
> >
> >I noticed that you used gprof to instrument perf itself on a call graph
> >level.
> >
> >Does this method of profiling perf via perf:
> >
> > perf record -g perf report
> > perf report
> >
> >... produce similarly useful call-graph instrumentation for you?
> >
> >If not or not quite then could you describe the differences? We could use
> >that to further improve perf call-graph profiling.
>
> Thank for the comment.
>
> The slowdown that I was trying to fix was in the "perf record" part of
> the profiling process, not the "perf report" part. I didn't try
> perf-record on perf-record as the performance counters are limited
> resources and I don't want resource conflicts to affect the results.
ah, ok. In general two instances of cycles-profiling should work just fine
without any resource conflicts.
But yeah, with "perf record" that's a valid worry and I can see how you
wanted to not worry about that.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] perf: fix symbol processing bug and greatly improve performance
2013-05-07 9:30 ` Jiri Olsa
2013-05-07 14:40 ` Waiman Long
@ 2013-05-08 15:44 ` Waiman Long
2013-05-09 10:19 ` Jiri Olsa
1 sibling, 1 reply; 13+ messages in thread
From: Waiman Long @ 2013-05-08 15:44 UTC (permalink / raw)
To: Jiri Olsa
Cc: Arnaldo Carvalho de Melo, Stephane Eranian, Namhyung Kim,
Peter Zijlstra, Paul Mackerras, Ingo Molnar, linux-kernel,
Chandramouleeswaran, Aswin, Norton, Scott J
On 05/07/2013 05:30 AM, Jiri Olsa wrote:
> On Mon, May 06, 2013 at 09:43:53AM -0400, Waiman Long wrote:
>> When "perf record" was used on a large machine with a lot of CPUs,
>> the perf post-processing time could take a lot of minutes and even
>> hours depending on how large the resulting perf.data file was.
>>
>> While running AIM7 1500-user high_systime workload on a 80-core x86-64
>> system with a 3.9 kernel, the workload itself took about 2 minutes
>> to run and the perf.data file had a size of 1108.746 MB. However,
>> the post-processing step took more than 10 minutes.
>>
>> With a gprof-profiled perf binary, the time spent by perf was as
>> follows:
>>
>> % cumulative self self total
>> time seconds seconds calls s/call s/call name
>> 96.90 822.10 822.10 192156 0.00 0.00 dsos__find
>> 0.81 828.96 6.86 172089958 0.00 0.00 rb_next
>> 0.41 832.44 3.48 48539289 0.00 0.00 rb_erase
>>
>> So 97% (822 seconds) of the time was spent in a single dsos_find()
>> function. After analyzing the call-graph data below:
>>
>> -----------------------------------------------
>> 0.00 822.12 192156/192156 map__new [6]
>> [7] 96.9 0.00 822.12 192156 vdso__dso_findnew [7]
>> 822.10 0.00 192156/192156 dsos__find [8]
>> 0.01 0.00 192156/192156 dsos__add [62]
>> 0.01 0.00 192156/192366 dso__new [61]
>> 0.00 0.00 1/45282525 memdup [31]
>> 0.00 0.00 192156/192230 dso__set_long_name [91]
>> -----------------------------------------------
>> 822.10 0.00 192156/192156 vdso__dso_findnew [7]
>> [8] 96.9 822.10 0.00 192156 dsos__find [8]
>> -----------------------------------------------
>>
>> It was found that the vdso__dso_findnew() function failed to locate
>> VDSO__MAP_NAME ("[vdso]") in the dso list and have to insert a new
>> entry at the end for 192156 times. This problem is due to the fact that
>> there are 2 types of name in the dso entry - short name and long name.
>> The initial dso__new() adds "[vdso]" to both the short and long names.
>> After that, vdso__dso_findnew() modifies the long name to something
>> like /tmp/perf-vdso.so-NoXkDj. The dsos__find() function only compares
>> the long name. As a result, the same vdso entry is duplicated many
>> time in the dso list. This bug increases memory consumption as well
>> as slows the symbol processing time to a crawl.
> hi,
> the issue is there and fix looks ok, thanks!
>
> though I'm not able to get vdso callchains to pop out
> even by investigating report with vdso heavy workload.
>
> I'll have a closer look..
Is there a chance that the fix will go to v3.10 or have to wait for v3.11?
Regards,
Longman
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] perf: fix symbol processing bug and greatly improve performance
2013-05-08 15:44 ` Waiman Long
@ 2013-05-09 10:19 ` Jiri Olsa
2013-05-09 12:52 ` Ingo Molnar
2013-05-09 14:46 ` Waiman Long
0 siblings, 2 replies; 13+ messages in thread
From: Jiri Olsa @ 2013-05-09 10:19 UTC (permalink / raw)
To: Waiman Long
Cc: Arnaldo Carvalho de Melo, Stephane Eranian, Namhyung Kim,
Peter Zijlstra, Paul Mackerras, Ingo Molnar, linux-kernel,
Chandramouleeswaran, Aswin, Norton, Scott J
On Wed, May 08, 2013 at 11:44:35AM -0400, Waiman Long wrote:
> On 05/07/2013 05:30 AM, Jiri Olsa wrote:
> >On Mon, May 06, 2013 at 09:43:53AM -0400, Waiman Long wrote:
> >>When "perf record" was used on a large machine with a lot of CPUs,
> >>the perf post-processing time could take a lot of minutes and even
> >>hours depending on how large the resulting perf.data file was.
> >>
> >>While running AIM7 1500-user high_systime workload on a 80-core x86-64
> >>system with a 3.9 kernel, the workload itself took about 2 minutes
> >>to run and the perf.data file had a size of 1108.746 MB. However,
> >>the post-processing step took more than 10 minutes.
> >>
> >>With a gprof-profiled perf binary, the time spent by perf was as
> >>follows:
> >>
> >> % cumulative self self total
> >> time seconds seconds calls s/call s/call name
> >> 96.90 822.10 822.10 192156 0.00 0.00 dsos__find
> >> 0.81 828.96 6.86 172089958 0.00 0.00 rb_next
> >> 0.41 832.44 3.48 48539289 0.00 0.00 rb_erase
> >>
> >>So 97% (822 seconds) of the time was spent in a single dsos_find()
> >>function. After analyzing the call-graph data below:
> >>
> >>-----------------------------------------------
> >> 0.00 822.12 192156/192156 map__new [6]
> >>[7] 96.9 0.00 822.12 192156 vdso__dso_findnew [7]
> >> 822.10 0.00 192156/192156 dsos__find [8]
> >> 0.01 0.00 192156/192156 dsos__add [62]
> >> 0.01 0.00 192156/192366 dso__new [61]
> >> 0.00 0.00 1/45282525 memdup [31]
> >> 0.00 0.00 192156/192230 dso__set_long_name [91]
> >>-----------------------------------------------
> >> 822.10 0.00 192156/192156 vdso__dso_findnew [7]
> >>[8] 96.9 822.10 0.00 192156 dsos__find [8]
> >>-----------------------------------------------
> >>
> >>It was found that the vdso__dso_findnew() function failed to locate
> >>VDSO__MAP_NAME ("[vdso]") in the dso list and have to insert a new
> >>entry at the end for 192156 times. This problem is due to the fact that
> >>there are 2 types of name in the dso entry - short name and long name.
> >>The initial dso__new() adds "[vdso]" to both the short and long names.
> >>After that, vdso__dso_findnew() modifies the long name to something
> >>like /tmp/perf-vdso.so-NoXkDj. The dsos__find() function only compares
> >>the long name. As a result, the same vdso entry is duplicated many
> >>time in the dso list. This bug increases memory consumption as well
> >>as slows the symbol processing time to a crawl.
> >hi,
> >the issue is there and fix looks ok, thanks!
> >
> >though I'm not able to get vdso callchains to pop out
> >even by investigating report with vdso heavy workload.
> >
> >I'll have a closer look..
>
> Is there a chance that the fix will go to v3.10 or have to wait for v3.11?
I got this from scripts/checkpatch.pl:
WARNING: line over 80 characters
#104: FILE: tools/perf/util/dso.h:136:
+struct dso *dsos__find(struct list_head *head, const char *name, bool
cmp_short);
otherwise it looks ok:
Acked-by: Jiri Olsa <jolsa@redhat.com>
thanks,
jirka
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] perf: fix symbol processing bug and greatly improve performance
2013-05-09 10:19 ` Jiri Olsa
@ 2013-05-09 12:52 ` Ingo Molnar
2013-05-09 14:46 ` Waiman Long
1 sibling, 0 replies; 13+ messages in thread
From: Ingo Molnar @ 2013-05-09 12:52 UTC (permalink / raw)
To: Jiri Olsa
Cc: Waiman Long, Arnaldo Carvalho de Melo, Stephane Eranian,
Namhyung Kim, Peter Zijlstra, Paul Mackerras, Ingo Molnar,
linux-kernel, Chandramouleeswaran, Aswin, Norton, Scott J
* Jiri Olsa <jolsa@redhat.com> wrote:
> On Wed, May 08, 2013 at 11:44:35AM -0400, Waiman Long wrote:
> > On 05/07/2013 05:30 AM, Jiri Olsa wrote:
> > >On Mon, May 06, 2013 at 09:43:53AM -0400, Waiman Long wrote:
> > >>When "perf record" was used on a large machine with a lot of CPUs,
> > >>the perf post-processing time could take a lot of minutes and even
> > >>hours depending on how large the resulting perf.data file was.
> > >>
> > >>While running AIM7 1500-user high_systime workload on a 80-core x86-64
> > >>system with a 3.9 kernel, the workload itself took about 2 minutes
> > >>to run and the perf.data file had a size of 1108.746 MB. However,
> > >>the post-processing step took more than 10 minutes.
> > >>
> > >>With a gprof-profiled perf binary, the time spent by perf was as
> > >>follows:
> > >>
> > >> % cumulative self self total
> > >> time seconds seconds calls s/call s/call name
> > >> 96.90 822.10 822.10 192156 0.00 0.00 dsos__find
> > >> 0.81 828.96 6.86 172089958 0.00 0.00 rb_next
> > >> 0.41 832.44 3.48 48539289 0.00 0.00 rb_erase
> > >>
> > >>So 97% (822 seconds) of the time was spent in a single dsos_find()
> > >>function. After analyzing the call-graph data below:
> > >>
> > >>-----------------------------------------------
> > >> 0.00 822.12 192156/192156 map__new [6]
> > >>[7] 96.9 0.00 822.12 192156 vdso__dso_findnew [7]
> > >> 822.10 0.00 192156/192156 dsos__find [8]
> > >> 0.01 0.00 192156/192156 dsos__add [62]
> > >> 0.01 0.00 192156/192366 dso__new [61]
> > >> 0.00 0.00 1/45282525 memdup [31]
> > >> 0.00 0.00 192156/192230 dso__set_long_name [91]
> > >>-----------------------------------------------
> > >> 822.10 0.00 192156/192156 vdso__dso_findnew [7]
> > >>[8] 96.9 822.10 0.00 192156 dsos__find [8]
> > >>-----------------------------------------------
> > >>
> > >>It was found that the vdso__dso_findnew() function failed to locate
> > >>VDSO__MAP_NAME ("[vdso]") in the dso list and have to insert a new
> > >>entry at the end for 192156 times. This problem is due to the fact that
> > >>there are 2 types of name in the dso entry - short name and long name.
> > >>The initial dso__new() adds "[vdso]" to both the short and long names.
> > >>After that, vdso__dso_findnew() modifies the long name to something
> > >>like /tmp/perf-vdso.so-NoXkDj. The dsos__find() function only compares
> > >>the long name. As a result, the same vdso entry is duplicated many
> > >>time in the dso list. This bug increases memory consumption as well
> > >>as slows the symbol processing time to a crawl.
> > >hi,
> > >the issue is there and fix looks ok, thanks!
> > >
> > >though I'm not able to get vdso callchains to pop out
> > >even by investigating report with vdso heavy workload.
> > >
> > >I'll have a closer look..
> >
> > Is there a chance that the fix will go to v3.10 or have to wait for v3.11?
>
> I got this from scripts/checkpatch.pl:
>
> WARNING: line over 80 characters
> #104: FILE: tools/perf/util/dso.h:136:
> +struct dso *dsos__find(struct list_head *head, const char *name, bool
> cmp_short);
For a prototype line that's OK I think.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] perf: fix symbol processing bug and greatly improve performance
2013-05-09 10:19 ` Jiri Olsa
2013-05-09 12:52 ` Ingo Molnar
@ 2013-05-09 14:46 ` Waiman Long
2013-05-09 15:05 ` David Ahern
1 sibling, 1 reply; 13+ messages in thread
From: Waiman Long @ 2013-05-09 14:46 UTC (permalink / raw)
To: Jiri Olsa
Cc: Arnaldo Carvalho de Melo, Stephane Eranian, Namhyung Kim,
Peter Zijlstra, Paul Mackerras, Ingo Molnar, linux-kernel,
Chandramouleeswaran, Aswin, Norton, Scott J
On 05/09/2013 06:19 AM, Jiri Olsa wrote:
> On Wed, May 08, 2013 at 11:44:35AM -0400, Waiman Long wrote:
>> On 05/07/2013 05:30 AM, Jiri Olsa wrote:
>>> On Mon, May 06, 2013 at 09:43:53AM -0400, Waiman Long wrote:
>>>> When "perf record" was used on a large machine with a lot of CPUs,
>>>> the perf post-processing time could take a lot of minutes and even
>>>> hours depending on how large the resulting perf.data file was.
>>>>
>>>> While running AIM7 1500-user high_systime workload on a 80-core x86-64
>>>> system with a 3.9 kernel, the workload itself took about 2 minutes
>>>> to run and the perf.data file had a size of 1108.746 MB. However,
>>>> the post-processing step took more than 10 minutes.
>>>>
>>>> With a gprof-profiled perf binary, the time spent by perf was as
>>>> follows:
>>>>
>>>> % cumulative self self total
>>>> time seconds seconds calls s/call s/call name
>>>> 96.90 822.10 822.10 192156 0.00 0.00 dsos__find
>>>> 0.81 828.96 6.86 172089958 0.00 0.00 rb_next
>>>> 0.41 832.44 3.48 48539289 0.00 0.00 rb_erase
>>>>
>>>> So 97% (822 seconds) of the time was spent in a single dsos_find()
>>>> function. After analyzing the call-graph data below:
>>>>
>>>> -----------------------------------------------
>>>> 0.00 822.12 192156/192156 map__new [6]
>>>> [7] 96.9 0.00 822.12 192156 vdso__dso_findnew [7]
>>>> 822.10 0.00 192156/192156 dsos__find [8]
>>>> 0.01 0.00 192156/192156 dsos__add [62]
>>>> 0.01 0.00 192156/192366 dso__new [61]
>>>> 0.00 0.00 1/45282525 memdup [31]
>>>> 0.00 0.00 192156/192230 dso__set_long_name [91]
>>>> -----------------------------------------------
>>>> 822.10 0.00 192156/192156 vdso__dso_findnew [7]
>>>> [8] 96.9 822.10 0.00 192156 dsos__find [8]
>>>> -----------------------------------------------
>>>>
>>>> It was found that the vdso__dso_findnew() function failed to locate
>>>> VDSO__MAP_NAME ("[vdso]") in the dso list and have to insert a new
>>>> entry at the end for 192156 times. This problem is due to the fact that
>>>> there are 2 types of name in the dso entry - short name and long name.
>>>> The initial dso__new() adds "[vdso]" to both the short and long names.
>>>> After that, vdso__dso_findnew() modifies the long name to something
>>>> like /tmp/perf-vdso.so-NoXkDj. The dsos__find() function only compares
>>>> the long name. As a result, the same vdso entry is duplicated many
>>>> time in the dso list. This bug increases memory consumption as well
>>>> as slows the symbol processing time to a crawl.
>>> hi,
>>> the issue is there and fix looks ok, thanks!
>>>
>>> though I'm not able to get vdso callchains to pop out
>>> even by investigating report with vdso heavy workload.
>>>
>>> I'll have a closer look..
>> Is there a chance that the fix will go to v3.10 or have to wait for v3.11?
> I got this from scripts/checkpatch.pl:
>
> WARNING: line over 80 characters
> #104: FILE: tools/perf/util/dso.h:136:
> +struct dso *dsos__find(struct list_head *head, const char *name, bool
> cmp_short);
>
> otherwise it looks ok:
>
> Acked-by: Jiri Olsa<jolsa@redhat.com>
I am sorry that I forgot to rerun checkpatch.pl again after changing the
argument type from int to bool. So I missed this warning.
I had resent an updated patch with the warning fixed. I also made some
update to the patch description.
Regards,
Longman
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] perf: fix symbol processing bug and greatly improve performance
2013-05-09 14:46 ` Waiman Long
@ 2013-05-09 15:05 ` David Ahern
2013-05-09 15:31 ` Waiman Long
0 siblings, 1 reply; 13+ messages in thread
From: David Ahern @ 2013-05-09 15:05 UTC (permalink / raw)
To: Waiman Long
Cc: Jiri Olsa, Arnaldo Carvalho de Melo, Stephane Eranian,
Namhyung Kim, Peter Zijlstra, Paul Mackerras, Ingo Molnar,
linux-kernel, Chandramouleeswaran, Aswin, Norton, Scott J
On 5/9/13 8:46 AM, Waiman Long wrote:
> I am sorry that I forgot to rerun checkpatch.pl again after changing the
> argument type from int to bool.
You can edit/create .git/hooks/pre-commit and have it run checkpatch.pl
so you don't forget:
#!/bin/bash
exec git diff --cached | scripts/checkpatch.pl --no-signoff -q -
If it spits out warnings that are ok, add -n to git-commit to bypass.
David
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] perf: fix symbol processing bug and greatly improve performance
2013-05-09 15:05 ` David Ahern
@ 2013-05-09 15:31 ` Waiman Long
0 siblings, 0 replies; 13+ messages in thread
From: Waiman Long @ 2013-05-09 15:31 UTC (permalink / raw)
To: David Ahern
Cc: Jiri Olsa, Arnaldo Carvalho de Melo, Stephane Eranian,
Namhyung Kim, Peter Zijlstra, Paul Mackerras, Ingo Molnar,
linux-kernel, Chandramouleeswaran, Aswin, Norton, Scott J
On 05/09/2013 11:05 AM, David Ahern wrote:
> On 5/9/13 8:46 AM, Waiman Long wrote:
>> I am sorry that I forgot to rerun checkpatch.pl again after changing the
>> argument type from int to bool.
>
> You can edit/create .git/hooks/pre-commit and have it run
> checkpatch.pl so you don't forget:
>
> #!/bin/bash
> exec git diff --cached | scripts/checkpatch.pl --no-signoff -q -
>
> If it spits out warnings that are ok, add -n to git-commit to bypass.
Thanks for the tip! I will implement that in my development environment.
Regards,
Longman
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2013-05-09 15:31 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-05-06 13:43 [PATCH] perf: fix symbol processing bug and greatly improve performance Waiman Long
2013-05-07 7:01 ` Ingo Molnar
2013-05-07 14:19 ` Waiman Long
2013-05-07 14:52 ` Waiman Long
2013-05-07 17:15 ` Ingo Molnar
2013-05-07 9:30 ` Jiri Olsa
2013-05-07 14:40 ` Waiman Long
2013-05-08 15:44 ` Waiman Long
2013-05-09 10:19 ` Jiri Olsa
2013-05-09 12:52 ` Ingo Molnar
2013-05-09 14:46 ` Waiman Long
2013-05-09 15:05 ` David Ahern
2013-05-09 15:31 ` Waiman Long
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.