* [PATCH v2 0/1] perf mem: improves DSO long names search speed with RB tree
@ 2014-09-16 19:08 Waiman Long
2014-09-16 19:08 ` [PATCH v2] " Waiman Long
0 siblings, 1 reply; 6+ messages in thread
From: Waiman Long @ 2014-09-16 19:08 UTC (permalink / raw)
To: Peter Zijlstra, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo
Cc: linux-kernel, Scott J Norton, Don Zickus, Jiri Olsa,
Adrian Hunter, Waiman Long
v1->v2:
- Rename DSO longname RBtree find function to segregate its two different
uses of searching and linking DSO into RB tree.
This patch addes a RBtree to sort the DSO structures of the perf tool
by its long name so that searching for a match will be much quicker
especially if a large number of DSOs are beining profiled.
Waiman Long (1):
perf mem: improves DSO long names search speed with RB tree
tools/perf/util/dso.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++--
tools/perf/util/dso.h | 1 +
2 files changed, 84 insertions(+), 4 deletions(-)
^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH v2] perf mem: improves DSO long names search speed with RB tree
2014-09-16 19:08 [PATCH v2 0/1] perf mem: improves DSO long names search speed with RB tree Waiman Long
@ 2014-09-16 19:08 ` Waiman Long
2014-09-17 6:22 ` Adrian Hunter
2014-09-17 14:51 ` Arnaldo Carvalho de Melo
0 siblings, 2 replies; 6+ messages in thread
From: Waiman Long @ 2014-09-16 19:08 UTC (permalink / raw)
To: Peter Zijlstra, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo
Cc: linux-kernel, Scott J Norton, Don Zickus, Jiri Olsa,
Adrian Hunter, Waiman Long
With workload that spawns and destroys many threads and processes,
it was found that perf-mem could took a long time to post-process
the perf data after the target workload had completed its operation.
The performance bottleneck was found to be searching and insertion
of the new DSO structures (thousands of them in this case).
In a dual-socket Ivy-Bridge E7-4890 v2 machine (30-core, 60-thread),
the perf profile below shows what perf was doing after the profiled
AIM7 shared workload completed:
- 83.94% perf libc-2.11.3.so [.] __strcmp_sse42
- __strcmp_sse42
- 99.82% map__new
machine__process_mmap_event
perf_session_deliver_event
perf_session__process_event
__perf_session__process_events
cmd_record
cmd_mem
run_builtin
main
__libc_start_main
- 13.17% perf perf [.] __dsos__findnew
__dsos__findnew
map__new
machine__process_mmap_event
perf_session_deliver_event
perf_session__process_event
__perf_session__process_events
cmd_record
cmd_mem
run_builtin
main
__libc_start_main
So about 97% of CPU times were spent in the map__new() function
trying to insert new DSO entry into the DSO linked list. The whole
post-processing step took about 9 minutes.
The DSO structures are currently searched linearly. So the total
processing time will be proportional to n^2.
To overcome this performance problem, the DSO code is modified to
also put the DSO structures in a RB tree sorted by its long name
in additional to being in a simple linked list. With this change,
the processing time will become proportional to n*log(n) which will
be much quicker for large n. However, the short name will still
be searched using the old linear searching method which is slow.
With that patch in place, the same perf-mem post-processing step took
less than 30 seconds to complete.
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
tools/perf/util/dso.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++--
tools/perf/util/dso.h | 1 +
2 files changed, 84 insertions(+), 4 deletions(-)
diff --git a/tools/perf/util/dso.c b/tools/perf/util/dso.c
index 819f104..fccb2f0 100644
--- a/tools/perf/util/dso.c
+++ b/tools/perf/util/dso.c
@@ -611,17 +611,93 @@ struct dso *dso__kernel_findnew(struct machine *machine, const char *name,
return dso;
}
+/*
+ * RB root of DSOs sorted by the long name
+ */
+static struct rb_root dso__longname_root = { NULL };
+
+/*
+ * Find a matching entry and/or link current entry to RB tree.
+ * Either one of the dso or name parameter must be non-NULL or the
+ * function will not work.
+ */
+static struct dso *
+dso__findlink_by_longname(struct dso *dso, const char *name)
+{
+ struct rb_node **p = &dso__longname_root.rb_node;
+ struct rb_node *parent = NULL;
+ int warned = false;
+
+ if (!name)
+ name = dso->long_name;
+ /*
+ * Find node with the matching name
+ */
+ while (*p) {
+ struct dso *this = rb_entry(*p, struct dso, longname_rb_node);
+ long rc = (long)strcmp(name, this->long_name);
+
+ parent = *p;
+ if (rc == 0) {
+ /*
+ * In case the new DSO is a duplicate of an existing
+ * one, print an one-time warning & sort the entry
+ * by its DSO address.
+ */
+ if (!dso || (dso == this))
+ return this; /* Find matching dso */
+ if (!warned) {
+ pr_warning("Duplicated dso long name: %s\n",
+ name);
+ warned = true;
+ }
+ rc = (long)dso - (long)this;
+ }
+ if (rc < 0)
+ p = &parent->rb_left;
+ else
+ p = &parent->rb_right;
+ }
+ if (dso) {
+ /* Add new node and rebalance tree */
+ rb_link_node(&dso->longname_rb_node, parent, p);
+ rb_insert_color(&dso->longname_rb_node, &dso__longname_root);
+ }
+ return NULL;
+}
+
+static inline struct dso *
+dso__find_by_longname(const char *name)
+{
+ return dso__findlink_by_longname(NULL, name);
+}
+
+/*
+ * Unlink the longname-sorted RB tree node
+ */
+static inline void dso__unlink_longname_node(struct dso *dso)
+{
+ rb_erase(&dso->longname_rb_node, &dso__longname_root);
+}
+
void dso__set_long_name(struct dso *dso, const char *name, bool name_allocated)
{
if (name == NULL)
return;
+ if (dso->long_name) {
+ if (!strcmp(dso->long_name, name))
+ return;
+ dso__unlink_longname_node(dso);
+ }
+
if (dso->long_name_allocated)
free((char *)dso->long_name);
dso->long_name = name;
dso->long_name_len = strlen(name);
dso->long_name_allocated = name_allocated;
+ (void)dso__findlink_by_longname(dso, name);
}
void dso__set_short_name(struct dso *dso, const char *name, bool name_allocated)
@@ -695,6 +771,8 @@ struct dso *dso__new(const char *name)
if (dso != NULL) {
int i;
strcpy(dso->name, name);
+ RB_CLEAR_NODE(&dso->longname_rb_node);
+ dso->long_name = NULL;
dso__set_long_name(dso, dso->name, false);
dso__set_short_name(dso, dso->name, false);
for (i = 0; i < MAP__NR_TYPES; ++i)
@@ -733,6 +811,10 @@ void dso__delete(struct dso *dso)
zfree((char **)&dso->long_name);
dso->long_name_allocated = false;
}
+ if (dso->long_name) {
+ dso__unlink_longname_node(dso);
+ dso->long_name = NULL;
+ }
dso__data_close(dso);
dso_cache__free(&dso->data.cache);
@@ -822,10 +904,7 @@ struct dso *dsos__find(const struct list_head *head, const char *name, bool cmp_
return pos;
return NULL;
}
- list_for_each_entry(pos, head, node)
- if (strcmp(pos->long_name, name) == 0)
- return pos;
- return NULL;
+ return dso__find_by_longname(name);
}
struct dso *__dsos__findnew(struct list_head *head, const char *name)
diff --git a/tools/perf/util/dso.h b/tools/perf/util/dso.h
index ad553ba..81a75ee 100644
--- a/tools/perf/util/dso.h
+++ b/tools/perf/util/dso.h
@@ -76,6 +76,7 @@ struct dso {
struct list_head node;
struct rb_root symbols[MAP__NR_TYPES];
struct rb_root symbol_names[MAP__NR_TYPES];
+ struct rb_node longname_rb_node;
void *a2l;
char *symsrc_filename;
unsigned int a2l_fails;
--
1.7.1
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH v2] perf mem: improves DSO long names search speed with RB tree
2014-09-16 19:08 ` [PATCH v2] " Waiman Long
@ 2014-09-17 6:22 ` Adrian Hunter
2014-09-17 14:08 ` Waiman Long
2014-09-17 14:51 ` Arnaldo Carvalho de Melo
1 sibling, 1 reply; 6+ messages in thread
From: Adrian Hunter @ 2014-09-17 6:22 UTC (permalink / raw)
To: Waiman Long, Arnaldo Carvalho de Melo
Cc: Peter Zijlstra, Paul Mackerras, Ingo Molnar, linux-kernel,
Scott J Norton, Don Zickus, Jiri Olsa
On 09/16/2014 10:08 PM, Waiman Long wrote:
> With workload that spawns and destroys many threads and processes,
> it was found that perf-mem could took a long time to post-process
> the perf data after the target workload had completed its operation.
> The performance bottleneck was found to be searching and insertion
> of the new DSO structures (thousands of them in this case).
>
> In a dual-socket Ivy-Bridge E7-4890 v2 machine (30-core, 60-thread),
> the perf profile below shows what perf was doing after the profiled
> AIM7 shared workload completed:
>
> - 83.94% perf libc-2.11.3.so [.] __strcmp_sse42
> - __strcmp_sse42
> - 99.82% map__new
> machine__process_mmap_event
> perf_session_deliver_event
> perf_session__process_event
> __perf_session__process_events
> cmd_record
> cmd_mem
> run_builtin
> main
> __libc_start_main
> - 13.17% perf perf [.] __dsos__findnew
> __dsos__findnew
> map__new
> machine__process_mmap_event
> perf_session_deliver_event
> perf_session__process_event
> __perf_session__process_events
> cmd_record
> cmd_mem
> run_builtin
> main
> __libc_start_main
>
> So about 97% of CPU times were spent in the map__new() function
> trying to insert new DSO entry into the DSO linked list. The whole
> post-processing step took about 9 minutes.
>
> The DSO structures are currently searched linearly. So the total
> processing time will be proportional to n^2.
>
> To overcome this performance problem, the DSO code is modified to
> also put the DSO structures in a RB tree sorted by its long name
> in additional to being in a simple linked list. With this change,
> the processing time will become proportional to n*log(n) which will
> be much quicker for large n. However, the short name will still
> be searched using the old linear searching method which is slow.
> With that patch in place, the same perf-mem post-processing step took
> less than 30 seconds to complete.
>
> Signed-off-by: Waiman Long <Waiman.Long@hp.com>
> ---
> tools/perf/util/dso.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++--
> tools/perf/util/dso.h | 1 +
> 2 files changed, 84 insertions(+), 4 deletions(-)
>
> diff --git a/tools/perf/util/dso.c b/tools/perf/util/dso.c
> index 819f104..fccb2f0 100644
> --- a/tools/perf/util/dso.c
> +++ b/tools/perf/util/dso.c
> @@ -611,17 +611,93 @@ struct dso *dso__kernel_findnew(struct machine *machine, const char *name,
> return dso;
> }
>
> +/*
> + * RB root of DSOs sorted by the long name
> + */
> +static struct rb_root dso__longname_root = { NULL };
Global variable!
Why not just change the lists to rbtrees i.e.
diff --git a/tools/perf/util/machine.h b/tools/perf/util/machine.h
index 6a6bcc1..fa30780 100644
--- a/tools/perf/util/machine.h
+++ b/tools/perf/util/machine.h
@@ -32,8 +32,8 @@ struct machine {
struct list_head dead_threads;
struct thread *last_match;
struct vdso_info *vdso_info;
- struct list_head user_dsos;
- struct list_head kernel_dsos;
+ struct rb_root user_dsos;
+ struct rb_root kernel_dsos;
struct map_groups kmaps;
struct map *vmlinux_maps[MAP__NR_TYPES];
u64 kernel_start;
And make all the resulting adjustments.
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH v2] perf mem: improves DSO long names search speed with RB tree
2014-09-17 6:22 ` Adrian Hunter
@ 2014-09-17 14:08 ` Waiman Long
0 siblings, 0 replies; 6+ messages in thread
From: Waiman Long @ 2014-09-17 14:08 UTC (permalink / raw)
To: Adrian Hunter
Cc: Arnaldo Carvalho de Melo, Peter Zijlstra, Paul Mackerras,
Ingo Molnar, linux-kernel, Scott J Norton, Don Zickus, Jiri Olsa
On 09/17/2014 02:22 AM, Adrian Hunter wrote:
> On 09/16/2014 10:08 PM, Waiman Long wrote:
>> With workload that spawns and destroys many threads and processes,
>> it was found that perf-mem could took a long time to post-process
>> the perf data after the target workload had completed its operation.
>> The performance bottleneck was found to be searching and insertion
>> of the new DSO structures (thousands of them in this case).
>>
>> In a dual-socket Ivy-Bridge E7-4890 v2 machine (30-core, 60-thread),
>> the perf profile below shows what perf was doing after the profiled
>> AIM7 shared workload completed:
>>
>> - 83.94% perf libc-2.11.3.so [.] __strcmp_sse42
>> - __strcmp_sse42
>> - 99.82% map__new
>> machine__process_mmap_event
>> perf_session_deliver_event
>> perf_session__process_event
>> __perf_session__process_events
>> cmd_record
>> cmd_mem
>> run_builtin
>> main
>> __libc_start_main
>> - 13.17% perf perf [.] __dsos__findnew
>> __dsos__findnew
>> map__new
>> machine__process_mmap_event
>> perf_session_deliver_event
>> perf_session__process_event
>> __perf_session__process_events
>> cmd_record
>> cmd_mem
>> run_builtin
>> main
>> __libc_start_main
>>
>> So about 97% of CPU times were spent in the map__new() function
>> trying to insert new DSO entry into the DSO linked list. The whole
>> post-processing step took about 9 minutes.
>>
>> The DSO structures are currently searched linearly. So the total
>> processing time will be proportional to n^2.
>>
>> To overcome this performance problem, the DSO code is modified to
>> also put the DSO structures in a RB tree sorted by its long name
>> in additional to being in a simple linked list. With this change,
>> the processing time will become proportional to n*log(n) which will
>> be much quicker for large n. However, the short name will still
>> be searched using the old linear searching method which is slow.
>> With that patch in place, the same perf-mem post-processing step took
>> less than 30 seconds to complete.
>>
>> Signed-off-by: Waiman Long<Waiman.Long@hp.com>
>> ---
>> tools/perf/util/dso.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++--
>> tools/perf/util/dso.h | 1 +
>> 2 files changed, 84 insertions(+), 4 deletions(-)
>>
>> diff --git a/tools/perf/util/dso.c b/tools/perf/util/dso.c
>> index 819f104..fccb2f0 100644
>> --- a/tools/perf/util/dso.c
>> +++ b/tools/perf/util/dso.c
>> @@ -611,17 +611,93 @@ struct dso *dso__kernel_findnew(struct machine *machine, const char *name,
>> return dso;
>> }
>>
>> +/*
>> + * RB root of DSOs sorted by the long name
>> + */
>> +static struct rb_root dso__longname_root = { NULL };
> Global variable!
>
> Why not just change the lists to rbtrees i.e.
The DSOs can be searched either by its short name or its long name. The
patch enables long name search with RB tree. The short name search is
still using the linked list as the sort order will be different from the
long name.
I think I can remove the linked list and do the short name search
linearly by scanning from the leftmost to the rightmost in the RBtree.
In that way, we can remove the redundant list head field in the DSO
structure.
Cheers,
Longman
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v2] perf mem: improves DSO long names search speed with RB tree
2014-09-16 19:08 ` [PATCH v2] " Waiman Long
2014-09-17 6:22 ` Adrian Hunter
@ 2014-09-17 14:51 ` Arnaldo Carvalho de Melo
2014-09-17 17:55 ` Waiman Long
1 sibling, 1 reply; 6+ messages in thread
From: Arnaldo Carvalho de Melo @ 2014-09-17 14:51 UTC (permalink / raw)
To: Waiman Long
Cc: Peter Zijlstra, Paul Mackerras, Ingo Molnar, linux-kernel,
Scott J Norton, Don Zickus, Jiri Olsa, Adrian Hunter
Em Tue, Sep 16, 2014 at 03:08:19PM -0400, Waiman Long escreveu:
> +++ b/tools/perf/util/dso.c
> @@ -611,17 +611,93 @@ struct dso *dso__kernel_findnew(struct machine *machine, const char *name,
> return dso;
> }
>
> +/*
> + * RB root of DSOs sorted by the long name
> + */
> +static struct rb_root dso__longname_root = { NULL };
Use RB_ROOT, like in:
[acme@zoo linux]$ grep -w rb_root mm/vmalloc.c
static struct rb_root vmap_area_root = RB_ROOT;
[acme@zoo linux]$
But then can't this be made non-static, i.e. at the 'struct machine'
level? I.e. it is more likely that DSOs with the same long name are
really the same thing on a single 'struct machine', not accross multiple
ones or even multiple sessions (i.e. across multiple 'struct machines').
IIRC Adrian also pointed this out.
- Arnaldo
> +/*
> + * Find a matching entry and/or link current entry to RB tree.
> + * Either one of the dso or name parameter must be non-NULL or the
> + * function will not work.
> + */
> +static struct dso *
> +dso__findlink_by_longname(struct dso *dso, const char *name)
> +{
> + struct rb_node **p = &dso__longname_root.rb_node;
> + struct rb_node *parent = NULL;
> + int warned = false;
> +
> + if (!name)
> + name = dso->long_name;
> + /*
> + * Find node with the matching name
> + */
> + while (*p) {
> + struct dso *this = rb_entry(*p, struct dso, longname_rb_node);
> + long rc = (long)strcmp(name, this->long_name);
> +
> + parent = *p;
> + if (rc == 0) {
> + /*
> + * In case the new DSO is a duplicate of an existing
> + * one, print an one-time warning & sort the entry
> + * by its DSO address.
> + */
> + if (!dso || (dso == this))
> + return this; /* Find matching dso */
> + if (!warned) {
> + pr_warning("Duplicated dso long name: %s\n",
> + name);
> + warned = true;
> + }
> + rc = (long)dso - (long)this;
> + }
> + if (rc < 0)
> + p = &parent->rb_left;
> + else
> + p = &parent->rb_right;
> + }
> + if (dso) {
> + /* Add new node and rebalance tree */
> + rb_link_node(&dso->longname_rb_node, parent, p);
> + rb_insert_color(&dso->longname_rb_node, &dso__longname_root);
> + }
> + return NULL;
> +}
> +
> +static inline struct dso *
> +dso__find_by_longname(const char *name)
> +{
> + return dso__findlink_by_longname(NULL, name);
> +}
> +
> +/*
> + * Unlink the longname-sorted RB tree node
> + */
> +static inline void dso__unlink_longname_node(struct dso *dso)
> +{
> + rb_erase(&dso->longname_rb_node, &dso__longname_root);
> +}
> +
> void dso__set_long_name(struct dso *dso, const char *name, bool name_allocated)
> {
> if (name == NULL)
> return;
>
> + if (dso->long_name) {
> + if (!strcmp(dso->long_name, name))
> + return;
> + dso__unlink_longname_node(dso);
> + }
> +
> if (dso->long_name_allocated)
> free((char *)dso->long_name);
>
> dso->long_name = name;
> dso->long_name_len = strlen(name);
> dso->long_name_allocated = name_allocated;
> + (void)dso__findlink_by_longname(dso, name);
> }
>
> void dso__set_short_name(struct dso *dso, const char *name, bool name_allocated)
> @@ -695,6 +771,8 @@ struct dso *dso__new(const char *name)
> if (dso != NULL) {
> int i;
> strcpy(dso->name, name);
> + RB_CLEAR_NODE(&dso->longname_rb_node);
> + dso->long_name = NULL;
> dso__set_long_name(dso, dso->name, false);
> dso__set_short_name(dso, dso->name, false);
> for (i = 0; i < MAP__NR_TYPES; ++i)
> @@ -733,6 +811,10 @@ void dso__delete(struct dso *dso)
> zfree((char **)&dso->long_name);
> dso->long_name_allocated = false;
> }
> + if (dso->long_name) {
> + dso__unlink_longname_node(dso);
> + dso->long_name = NULL;
> + }
>
> dso__data_close(dso);
> dso_cache__free(&dso->data.cache);
> @@ -822,10 +904,7 @@ struct dso *dsos__find(const struct list_head *head, const char *name, bool cmp_
> return pos;
> return NULL;
> }
> - list_for_each_entry(pos, head, node)
> - if (strcmp(pos->long_name, name) == 0)
> - return pos;
> - return NULL;
> + return dso__find_by_longname(name);
> }
>
> struct dso *__dsos__findnew(struct list_head *head, const char *name)
> diff --git a/tools/perf/util/dso.h b/tools/perf/util/dso.h
> index ad553ba..81a75ee 100644
> --- a/tools/perf/util/dso.h
> +++ b/tools/perf/util/dso.h
> @@ -76,6 +76,7 @@ struct dso {
> struct list_head node;
> struct rb_root symbols[MAP__NR_TYPES];
> struct rb_root symbol_names[MAP__NR_TYPES];
> + struct rb_node longname_rb_node;
> void *a2l;
> char *symsrc_filename;
> unsigned int a2l_fails;
> --
> 1.7.1
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v2] perf mem: improves DSO long names search speed with RB tree
2014-09-17 14:51 ` Arnaldo Carvalho de Melo
@ 2014-09-17 17:55 ` Waiman Long
0 siblings, 0 replies; 6+ messages in thread
From: Waiman Long @ 2014-09-17 17:55 UTC (permalink / raw)
To: Arnaldo Carvalho de Melo
Cc: Peter Zijlstra, Paul Mackerras, Ingo Molnar, linux-kernel,
Scott J Norton, Don Zickus, Jiri Olsa, Adrian Hunter
On 09/17/2014 10:51 AM, Arnaldo Carvalho de Melo wrote:
> Em Tue, Sep 16, 2014 at 03:08:19PM -0400, Waiman Long escreveu:
>> +++ b/tools/perf/util/dso.c
>> @@ -611,17 +611,93 @@ struct dso *dso__kernel_findnew(struct machine *machine, const char *name,
>> return dso;
>> }
>>
>> +/*
>> + * RB root of DSOs sorted by the long name
>> + */
>> +static struct rb_root dso__longname_root = { NULL };
> Use RB_ROOT, like in:
>
> [acme@zoo linux]$ grep -w rb_root mm/vmalloc.c
> static struct rb_root vmap_area_root = RB_ROOT;
> [acme@zoo linux]$
I don't use RB_ROOT here because it gave compilation error.
> But then can't this be made non-static, i.e. at the 'struct machine'
> level? I.e. it is more likely that DSOs with the same long name are
> really the same thing on a single 'struct machine', not accross multiple
> ones or even multiple sessions (i.e. across multiple 'struct machines').
>
> IIRC Adrian also pointed this out.
>
> - Arnaldo
>
>
Yes, I am going to put the rb_root in the machine structure level as
suggested by Adrian. So the static rb_root will be gone. I also found
out that the dso__load_sym() function in util/symbol-elf.c may create
DSOs of the same long name which can be either "[kernel.allsyms]" or
"/lib/modules/.../vmlinux".
Cheers,
Longman
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2014-09-17 17:55 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-16 19:08 [PATCH v2 0/1] perf mem: improves DSO long names search speed with RB tree Waiman Long
2014-09-16 19:08 ` [PATCH v2] " Waiman Long
2014-09-17 6:22 ` Adrian Hunter
2014-09-17 14:08 ` Waiman Long
2014-09-17 14:51 ` Arnaldo Carvalho de Melo
2014-09-17 17:55 ` Waiman Long
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.