* [PATCH 1/2] perf callchain: Create an address space per thread
@ 2014-09-23 6:30 Namhyung Kim
2014-09-23 6:30 ` [PATCH 2/2] perf callchain: Use global caching provided by libunwind Namhyung Kim
2014-09-23 12:24 ` [PATCH 1/2] perf callchain: Create an address space per thread Jiri Olsa
0 siblings, 2 replies; 14+ messages in thread
From: Namhyung Kim @ 2014-09-23 6:30 UTC (permalink / raw)
To: Arnaldo Carvalho de Melo
Cc: Peter Zijlstra, Ingo Molnar, Paul Mackerras, Namhyung Kim,
Namhyung Kim, LKML, Jiri Olsa, Jean Pihet, Arun Sharma
The unw_addr_space_t in libunwind represents an address space to be
used for stack unwinding. It doesn't need to be create/destory
everytime to unwind callchain (as in get_entries) and can have a same
lifetime as thread (unless exec called).
So move the address space construction/destruction logic to the thread
lifetime handling functions. This is a preparation to enable caching
in the unwind library.
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Jean Pihet <jean.pihet@linaro.org>
Cc: Arun Sharma <asharma@fb.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
tools/perf/util/thread.c | 8 ++++++++
tools/perf/util/unwind-libunwind.c | 30 +++++++++++++++++++++++++-----
tools/perf/util/unwind.h | 17 +++++++++++++++++
3 files changed, 50 insertions(+), 5 deletions(-)
diff --git a/tools/perf/util/thread.c b/tools/perf/util/thread.c
index a9df7f2c6dc9..c1fa4a3597ea 100644
--- a/tools/perf/util/thread.c
+++ b/tools/perf/util/thread.c
@@ -7,6 +7,7 @@
#include "util.h"
#include "debug.h"
#include "comm.h"
+#include "unwind.h"
int thread__init_map_groups(struct thread *thread, struct machine *machine)
{
@@ -48,6 +49,12 @@ struct thread *thread__new(pid_t pid, pid_t tid)
goto err_thread;
list_add(&comm->list, &thread->comm_list);
+
+ if (unwind__prepare_access(thread) < 0) {
+ list_del(&comm->list);
+ free(comm);
+ goto err_thread;
+ }
}
return thread;
@@ -69,6 +76,7 @@ void thread__delete(struct thread *thread)
list_del(&comm->list);
comm__free(comm);
}
+ unwind__finish_access(thread);
free(thread);
}
diff --git a/tools/perf/util/unwind-libunwind.c b/tools/perf/util/unwind-libunwind.c
index 92b56db52471..76ec25663c95 100644
--- a/tools/perf/util/unwind-libunwind.c
+++ b/tools/perf/util/unwind-libunwind.c
@@ -525,12 +525,9 @@ static unw_accessors_t accessors = {
.get_proc_name = get_proc_name,
};
-static int get_entries(struct unwind_info *ui, unwind_entry_cb_t cb,
- void *arg, int max_stack)
+int unwind__prepare_access(struct thread *thread)
{
unw_addr_space_t addr_space;
- unw_cursor_t c;
- int ret;
addr_space = unw_create_addr_space(&accessors, 0);
if (!addr_space) {
@@ -538,6 +535,30 @@ static int get_entries(struct unwind_info *ui, unwind_entry_cb_t cb,
return -ENOMEM;
}
+ thread__set_priv(thread, addr_space);
+
+ return 0;
+}
+
+void unwind__finish_access(struct thread *thread)
+{
+ unw_addr_space_t addr_space;
+
+ addr_space = thread__priv(thread);
+ unw_destroy_addr_space(addr_space);
+}
+
+static int get_entries(struct unwind_info *ui, unwind_entry_cb_t cb,
+ void *arg, int max_stack)
+{
+ unw_addr_space_t addr_space;
+ unw_cursor_t c;
+ int ret;
+
+ addr_space = thread__priv(ui->thread);
+ if (addr_space == NULL)
+ return -1;
+
ret = unw_init_remote(&c, addr_space, ui);
if (ret)
display_error(ret);
@@ -549,7 +570,6 @@ static int get_entries(struct unwind_info *ui, unwind_entry_cb_t cb,
ret = ip ? entry(ip, ui->thread, ui->machine, cb, arg) : 0;
}
- unw_destroy_addr_space(addr_space);
return ret;
}
diff --git a/tools/perf/util/unwind.h b/tools/perf/util/unwind.h
index f03061260b4e..4b99c6280c2a 100644
--- a/tools/perf/util/unwind.h
+++ b/tools/perf/util/unwind.h
@@ -4,6 +4,7 @@
#include <linux/types.h>
#include "event.h"
#include "symbol.h"
+#include "thread.h"
struct unwind_entry {
struct map *map;
@@ -21,6 +22,15 @@ int unwind__get_entries(unwind_entry_cb_t cb, void *arg,
/* libunwind specific */
#ifdef HAVE_LIBUNWIND_SUPPORT
int libunwind__arch_reg_id(int regnum);
+int unwind__prepare_access(struct thread *thread);
+void unwind__finish_access(struct thread *thread);
+#else
+static inline int unwind__prepare_access(struct thread *thread)
+{
+ return 0;
+}
+
+static inline void unwind__finish_access(struct thread *thread) {}
#endif
#else
static inline int
@@ -33,5 +43,12 @@ unwind__get_entries(unwind_entry_cb_t cb __maybe_unused,
{
return 0;
}
+
+static inline int unwind__prepare_access(struct thread *thread)
+{
+ return 0;
+}
+
+static inline void unwind__finish_access(struct thread *thread) {}
#endif /* HAVE_DWARF_UNWIND_SUPPORT */
#endif /* __UNWIND_H */
--
2.1.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 2/2] perf callchain: Use global caching provided by libunwind
2014-09-23 6:30 [PATCH 1/2] perf callchain: Create an address space per thread Namhyung Kim
@ 2014-09-23 6:30 ` Namhyung Kim
2014-09-23 12:28 ` Jiri Olsa
2014-09-23 14:01 ` Arun Sharma
2014-09-23 12:24 ` [PATCH 1/2] perf callchain: Create an address space per thread Jiri Olsa
1 sibling, 2 replies; 14+ messages in thread
From: Namhyung Kim @ 2014-09-23 6:30 UTC (permalink / raw)
To: Arnaldo Carvalho de Melo
Cc: Peter Zijlstra, Ingo Molnar, Paul Mackerras, Namhyung Kim,
Namhyung Kim, LKML, Jiri Olsa, Jean Pihet, Arun Sharma
The libunwind provides two caching policy which are global and
per-thread. As perf unwinds callchains in a single thread, it'd
sufficient to use global caching.
This speeds up my perf report from 14s to 7s on a ~260MB data file.
Although the output contains a slight difference (~0.01% in terms of
number of lines printed) on callchains which were not resolved.
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Jean Pihet <jean.pihet@linaro.org>
Cc: Arun Sharma <asharma@fb.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
tools/perf/util/thread.c | 3 +++
tools/perf/util/unwind-libunwind.c | 9 +++++++++
tools/perf/util/unwind.h | 3 +++
3 files changed, 15 insertions(+)
diff --git a/tools/perf/util/thread.c b/tools/perf/util/thread.c
index c1fa4a3597ea..e67d4ca6de44 100644
--- a/tools/perf/util/thread.c
+++ b/tools/perf/util/thread.c
@@ -119,6 +119,9 @@ int __thread__set_comm(struct thread *thread, const char *str, u64 timestamp,
if (!new)
return -ENOMEM;
list_add(&new->list, &thread->comm_list);
+
+ if (exec)
+ unwind__flush_access(thread);
}
thread->comm_set = true;
diff --git a/tools/perf/util/unwind-libunwind.c b/tools/perf/util/unwind-libunwind.c
index 76ec25663c95..6df06f0cd177 100644
--- a/tools/perf/util/unwind-libunwind.c
+++ b/tools/perf/util/unwind-libunwind.c
@@ -535,11 +535,20 @@ int unwind__prepare_access(struct thread *thread)
return -ENOMEM;
}
+ unw_set_caching_policy(addr_space, UNW_CACHE_GLOBAL);
thread__set_priv(thread, addr_space);
return 0;
}
+void unwind__flush_access(struct thread *thread)
+{
+ unw_addr_space_t addr_space;
+
+ addr_space = thread__priv(thread);
+ unw_flush_cache(addr_space, 0, 0);
+}
+
void unwind__finish_access(struct thread *thread)
{
unw_addr_space_t addr_space;
diff --git a/tools/perf/util/unwind.h b/tools/perf/util/unwind.h
index 4b99c6280c2a..d68f24d4f01b 100644
--- a/tools/perf/util/unwind.h
+++ b/tools/perf/util/unwind.h
@@ -23,6 +23,7 @@ int unwind__get_entries(unwind_entry_cb_t cb, void *arg,
#ifdef HAVE_LIBUNWIND_SUPPORT
int libunwind__arch_reg_id(int regnum);
int unwind__prepare_access(struct thread *thread);
+void unwind__flush_access(struct thread *thread);
void unwind__finish_access(struct thread *thread);
#else
static inline int unwind__prepare_access(struct thread *thread)
@@ -30,6 +31,7 @@ static inline int unwind__prepare_access(struct thread *thread)
return 0;
}
+static inline void unwind__flush_access(struct thread *thread) {}
static inline void unwind__finish_access(struct thread *thread) {}
#endif
#else
@@ -49,6 +51,7 @@ static inline int unwind__prepare_access(struct thread *thread)
return 0;
}
+static inline void unwind__flush_access(struct thread *thread) {}
static inline void unwind__finish_access(struct thread *thread) {}
#endif /* HAVE_DWARF_UNWIND_SUPPORT */
#endif /* __UNWIND_H */
--
2.1.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] perf callchain: Create an address space per thread
2014-09-23 6:30 [PATCH 1/2] perf callchain: Create an address space per thread Namhyung Kim
2014-09-23 6:30 ` [PATCH 2/2] perf callchain: Use global caching provided by libunwind Namhyung Kim
@ 2014-09-23 12:24 ` Jiri Olsa
2014-09-23 12:49 ` Namhyung Kim
1 sibling, 1 reply; 14+ messages in thread
From: Jiri Olsa @ 2014-09-23 12:24 UTC (permalink / raw)
To: Namhyung Kim
Cc: Arnaldo Carvalho de Melo, Peter Zijlstra, Ingo Molnar,
Paul Mackerras, Namhyung Kim, LKML, Jean Pihet, Arun Sharma
On Tue, Sep 23, 2014 at 03:30:27PM +0900, Namhyung Kim wrote:
> The unw_addr_space_t in libunwind represents an address space to be
> used for stack unwinding. It doesn't need to be create/destory
> everytime to unwind callchain (as in get_entries) and can have a same
> lifetime as thread (unless exec called).
>
> So move the address space construction/destruction logic to the thread
> lifetime handling functions. This is a preparation to enable caching
> in the unwind library.
>
> Cc: Jiri Olsa <jolsa@redhat.com>
> Cc: Jean Pihet <jean.pihet@linaro.org>
> Cc: Arun Sharma <asharma@fb.com>
> Signed-off-by: Namhyung Kim <namhyung@kernel.org>
> ---
> tools/perf/util/thread.c | 8 ++++++++
> tools/perf/util/unwind-libunwind.c | 30 +++++++++++++++++++++++++-----
> tools/perf/util/unwind.h | 17 +++++++++++++++++
> 3 files changed, 50 insertions(+), 5 deletions(-)
>
> diff --git a/tools/perf/util/thread.c b/tools/perf/util/thread.c
> index a9df7f2c6dc9..c1fa4a3597ea 100644
> --- a/tools/perf/util/thread.c
> +++ b/tools/perf/util/thread.c
> @@ -7,6 +7,7 @@
> #include "util.h"
> #include "debug.h"
> #include "comm.h"
> +#include "unwind.h"
>
> int thread__init_map_groups(struct thread *thread, struct machine *machine)
> {
> @@ -48,6 +49,12 @@ struct thread *thread__new(pid_t pid, pid_t tid)
> goto err_thread;
>
> list_add(&comm->list, &thread->comm_list);
> +
> + if (unwind__prepare_access(thread) < 0) {
you could call list_add below this call and thus save
the list_del call below in error path
looks like you dont need comm to be on comm_list within
the unwind__prepare_access call
> + list_del(&comm->list);
> + free(comm);
> + goto err_thread;
> + }
> }
SNIP
jirka
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] perf callchain: Use global caching provided by libunwind
2014-09-23 6:30 ` [PATCH 2/2] perf callchain: Use global caching provided by libunwind Namhyung Kim
@ 2014-09-23 12:28 ` Jiri Olsa
2014-09-23 12:53 ` Namhyung Kim
2014-09-23 14:01 ` Arun Sharma
1 sibling, 1 reply; 14+ messages in thread
From: Jiri Olsa @ 2014-09-23 12:28 UTC (permalink / raw)
To: Namhyung Kim
Cc: Arnaldo Carvalho de Melo, Peter Zijlstra, Ingo Molnar,
Paul Mackerras, Namhyung Kim, LKML, Jean Pihet, Arun Sharma
On Tue, Sep 23, 2014 at 03:30:28PM +0900, Namhyung Kim wrote:
> The libunwind provides two caching policy which are global and
> per-thread. As perf unwinds callchains in a single thread, it'd
> sufficient to use global caching.
>
> This speeds up my perf report from 14s to 7s on a ~260MB data file.
> Although the output contains a slight difference (~0.01% in terms of
> number of lines printed) on callchains which were not resolved.
hum, the speedup is nice, but what was the diff output.. any example?
The new version does not print some lines or different ones?
thanks,
jirka
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] perf callchain: Create an address space per thread
2014-09-23 12:24 ` [PATCH 1/2] perf callchain: Create an address space per thread Jiri Olsa
@ 2014-09-23 12:49 ` Namhyung Kim
2014-09-26 15:35 ` Arnaldo Carvalho de Melo
0 siblings, 1 reply; 14+ messages in thread
From: Namhyung Kim @ 2014-09-23 12:49 UTC (permalink / raw)
To: Jiri Olsa
Cc: Arnaldo Carvalho de Melo, Peter Zijlstra, Ingo Molnar,
Paul Mackerras, Namhyung Kim, LKML, Jean Pihet, Arun Sharma
Hi Jiri,
On Tue, Sep 23, 2014 at 9:24 PM, Jiri Olsa <jolsa@redhat.com> wrote:
> On Tue, Sep 23, 2014 at 03:30:27PM +0900, Namhyung Kim wrote:
>> The unw_addr_space_t in libunwind represents an address space to be
>> used for stack unwinding. It doesn't need to be create/destory
>> everytime to unwind callchain (as in get_entries) and can have a same
>> lifetime as thread (unless exec called).
>>
>> So move the address space construction/destruction logic to the thread
>> lifetime handling functions. This is a preparation to enable caching
>> in the unwind library.
>>
>> Cc: Jiri Olsa <jolsa@redhat.com>
>> Cc: Jean Pihet <jean.pihet@linaro.org>
>> Cc: Arun Sharma <asharma@fb.com>
>> Signed-off-by: Namhyung Kim <namhyung@kernel.org>
>> ---
>> tools/perf/util/thread.c | 8 ++++++++
>> tools/perf/util/unwind-libunwind.c | 30 +++++++++++++++++++++++++-----
>> tools/perf/util/unwind.h | 17 +++++++++++++++++
>> 3 files changed, 50 insertions(+), 5 deletions(-)
>>
>> diff --git a/tools/perf/util/thread.c b/tools/perf/util/thread.c
>> index a9df7f2c6dc9..c1fa4a3597ea 100644
>> --- a/tools/perf/util/thread.c
>> +++ b/tools/perf/util/thread.c
>> @@ -7,6 +7,7 @@
>> #include "util.h"
>> #include "debug.h"
>> #include "comm.h"
>> +#include "unwind.h"
>>
>> int thread__init_map_groups(struct thread *thread, struct machine *machine)
>> {
>> @@ -48,6 +49,12 @@ struct thread *thread__new(pid_t pid, pid_t tid)
>> goto err_thread;
>>
>> list_add(&comm->list, &thread->comm_list);
>> +
>> + if (unwind__prepare_access(thread) < 0) {
>
> you could call list_add below this call and thus save
> the list_del call below in error path
>
> looks like you dont need comm to be on comm_list within
> the unwind__prepare_access call
Right. Will change.
Thanks,
Namhyung
>
>> + list_del(&comm->list);
>> + free(comm);
>> + goto err_thread;
>> + }
>> }
>
> SNIP
>
> jirka
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] perf callchain: Use global caching provided by libunwind
2014-09-23 12:28 ` Jiri Olsa
@ 2014-09-23 12:53 ` Namhyung Kim
2014-09-24 1:04 ` Namhyung Kim
0 siblings, 1 reply; 14+ messages in thread
From: Namhyung Kim @ 2014-09-23 12:53 UTC (permalink / raw)
To: Jiri Olsa
Cc: Arnaldo Carvalho de Melo, Peter Zijlstra, Ingo Molnar,
Paul Mackerras, Namhyung Kim, LKML, Jean Pihet, Arun Sharma
On Tue, Sep 23, 2014 at 9:28 PM, Jiri Olsa <jolsa@redhat.com> wrote:
> On Tue, Sep 23, 2014 at 03:30:28PM +0900, Namhyung Kim wrote:
>> The libunwind provides two caching policy which are global and
>> per-thread. As perf unwinds callchains in a single thread, it'd
>> sufficient to use global caching.
>>
>> This speeds up my perf report from 14s to 7s on a ~260MB data file.
>> Although the output contains a slight difference (~0.01% in terms of
>> number of lines printed) on callchains which were not resolved.
>
> hum, the speedup is nice, but what was the diff output.. any example?
> The new version does not print some lines or different ones?
I don't have the result now - will post the diff when I go to the
office tomorrow. But IIRC new version only adds new lines..
Thanks,
Namhyung
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] perf callchain: Use global caching provided by libunwind
2014-09-23 6:30 ` [PATCH 2/2] perf callchain: Use global caching provided by libunwind Namhyung Kim
2014-09-23 12:28 ` Jiri Olsa
@ 2014-09-23 14:01 ` Arun Sharma
2014-09-24 2:24 ` Namhyung Kim
1 sibling, 1 reply; 14+ messages in thread
From: Arun Sharma @ 2014-09-23 14:01 UTC (permalink / raw)
To: Namhyung Kim, Arnaldo Carvalho de Melo
Cc: Peter Zijlstra, Ingo Molnar, Paul Mackerras, Namhyung Kim, LKML,
Jiri Olsa, Jean Pihet
On 9/23/14, 12:00 PM, Namhyung Kim wrote:
> + unw_set_caching_policy(addr_space, UNW_CACHE_GLOBAL);
The result is a bit surprising for me. In micro benchmarking (eg:
Lperf-simple), the per-thread policy is generally faster because it
doesn't involve locking.
libunwind/tests/Lperf-simple
unw_getcontext : cold avg= 109.673 nsec, warm avg= 28.610 nsec
unw_init_local : cold avg= 259.876 nsec, warm avg= 9.537 nsec
no cache : unw_step : 1st= 3258.387 min= 2922.331 avg= 3002.384 nsec
global cache : unw_step : 1st= 1192.093 min= 960.486 avg= 982.208 nsec
per-thread cache: unw_step : 1st= 429.153 min= 113.533 avg= 121.762 nsec
I can see how the global policy would involve less memory allocation
because of shared data structures. Curious about the reason for the
speedup (specifically if libunwind should change the defaults for the
non-local unwinding case).
-Arun
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] perf callchain: Use global caching provided by libunwind
2014-09-23 12:53 ` Namhyung Kim
@ 2014-09-24 1:04 ` Namhyung Kim
0 siblings, 0 replies; 14+ messages in thread
From: Namhyung Kim @ 2014-09-24 1:04 UTC (permalink / raw)
To: Jiri Olsa
Cc: Arnaldo Carvalho de Melo, Peter Zijlstra, Ingo Molnar,
Paul Mackerras, Namhyung Kim, LKML, Jean Pihet, Arun Sharma
On Tue, 23 Sep 2014 21:53:43 +0900, Namhyung Kim wrote:
> On Tue, Sep 23, 2014 at 9:28 PM, Jiri Olsa <jolsa@redhat.com> wrote:
>> On Tue, Sep 23, 2014 at 03:30:28PM +0900, Namhyung Kim wrote:
>>> The libunwind provides two caching policy which are global and
>>> per-thread. As perf unwinds callchains in a single thread, it'd
>>> sufficient to use global caching.
>>>
>>> This speeds up my perf report from 14s to 7s on a ~260MB data file.
>>> Although the output contains a slight difference (~0.01% in terms of
>>> number of lines printed) on callchains which were not resolved.
>>
>> hum, the speedup is nice, but what was the diff output.. any example?
>> The new version does not print some lines or different ones?
>
> I don't have the result now - will post the diff when I go to the
> office tomorrow. But IIRC new version only adds new lines..
Okay, this is the result.. But it doesn't always make a difference. I
can see it produces exactly same output for other (even bigger) data files.
$ diff -U0 callchain-result.{old,new}
--- callchain-result.old 2014-09-23 14:52:29.630711402 +0900
+++ callchain-result.new 2014-09-23 14:52:52.709505500 +0900
@@ -5322,0 +5323,9 @@
+ | | |
+ | | --0.00%-- 0x406520
+ | | 0x4064f0
+ | | |
+ | | |--0.00%-- 0x1ad52a0
+ | | |
+ | | |--0.00%-- 0x1a0f2e0
+ | | |
+ | | --0.00%-- 0x1a9e0c0
@@ -108899,0 +108909,3 @@
+ | | 0x406520
+ | | 0x4064f0
+ | | 0x1a9e0c0
@@ -180410,0 +180423,9 @@
+ | |
+ | --0.00%-- 0x406520
+ | 0x4064f0
+ | |
+ | |--0.00%-- 0x1ad52a0
+ | |
+ | |--0.00%-- 0x1a0f2e0
+ | |
+ | --0.00%-- 0x1a9e0c0
$ wc -l callchain-result.{old,new}
191412 callchain-result.old
191433 callchain-result.new
382845 total
Thanks,
Namhyung
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] perf callchain: Use global caching provided by libunwind
2014-09-23 14:01 ` Arun Sharma
@ 2014-09-24 2:24 ` Namhyung Kim
2014-09-24 13:45 ` Jean Pihet
0 siblings, 1 reply; 14+ messages in thread
From: Namhyung Kim @ 2014-09-24 2:24 UTC (permalink / raw)
To: Arun Sharma
Cc: Arnaldo Carvalho de Melo, Peter Zijlstra, Ingo Molnar,
Paul Mackerras, Namhyung Kim, LKML, Jiri Olsa, Jean Pihet
Hi Arun,
On Tue, 23 Sep 2014 14:01:22 +0000, Arun Sharma wrote:
> On 9/23/14, 12:00 PM, Namhyung Kim wrote:
>
>> + unw_set_caching_policy(addr_space, UNW_CACHE_GLOBAL);
>
> The result is a bit surprising for me. In micro benchmarking (eg:
> Lperf-simple), the per-thread policy is generally faster because it
> doesn't involve locking.
>
> libunwind/tests/Lperf-simple
> unw_getcontext : cold avg= 109.673 nsec, warm avg= 28.610 nsec
> unw_init_local : cold avg= 259.876 nsec, warm avg= 9.537 nsec
> no cache : unw_step : 1st= 3258.387 min= 2922.331 avg= 3002.384 nsec
> global cache : unw_step : 1st= 1192.093 min= 960.486 avg= 982.208 nsec
> per-thread cache: unw_step : 1st= 429.153 min= 113.533 avg= 121.762 nsec
Yes, per-thread policy is faster than global caching policy. Below is my
test result. Note that I already run this several times before to
remove an effect that file contents loaded in page cache.
Performance counter stats for
'perf report -i /home/namhyung/tmp/perf-testing/perf.data.kbuild.dwarf --stdio' (3 runs):
UNW_CACHE_NONE UNW_CACHE_GLOBAL UNW_CACHE_PER_THREAD
-----------------------------------------------------------------------------------------------
task-clock (msec) 14298.911947 7112.171928 6913.244797
context-switches 1,507 762 742
cpu-migrations 1 2 1
page-faults 2,924,889 1,101,380 1,101,380
cycles 53,895,784,665 26,798,627,423 26,070,728,349
stalled-cycles-frontend 24,472,506,687 12,577,760,746 12,435,320,081
stalled-cycles-backend 17,550,483,726 9,075,054,009 9,035,478,957
instructions 73,544,039,490 34,352,889,707 33,283,120,736
branches 14,969,890,371 7,139,469,848 6,926,994,151
branch-misses 193,852,116 100,455,431 99,757,213
time elapsed 14.905719730 7.455597356 7.242275972
>
> I can see how the global policy would involve less memory allocation
> because of shared data structures. Curious about the reason for the
> speedup (specifically if libunwind should change the defaults for the
> non-local unwinding case).
I don't see much difference between global and per-thread caching for
remote unwind (besides rs_cache->lock you mentioned). Also I'm curious
that how rs_new() is protected from concurrent accesses in per-thread
caching. That's why I chose the global caching - yeah, it probably
doesn't matter to a single thread, but... :)
Thanks
Namhyung
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] perf callchain: Use global caching provided by libunwind
2014-09-24 2:24 ` Namhyung Kim
@ 2014-09-24 13:45 ` Jean Pihet
2014-09-26 5:50 ` Namhyung Kim
0 siblings, 1 reply; 14+ messages in thread
From: Jean Pihet @ 2014-09-24 13:45 UTC (permalink / raw)
To: Namhyung Kim
Cc: Arun Sharma, Arnaldo Carvalho de Melo, Peter Zijlstra,
Ingo Molnar, Paul Mackerras, Namhyung Kim, LKML, Jiri Olsa
Hi!
Here are the test results on ARMv7 for the 2 patches. The speedup is
about x2.1 for identical unwinding output data.
'perf record --call-graph dwarf -- stress --cpu 2 --io 2 --vm 2
--timeout 10s' generates a 365 MB perf.data file.
time perf.orig report --sort symbol --call-graph --stdio 2&>1 > /dev/null
average on 3 runs
real 36.736
user 14.79
sys 21.91
time perf.libunwind.speedup report --sort symbol --call-graph --stdio
2&>1 > /dev/null
average on 3 runs
real 17.41 x2.11
user 6.42 x2.3
sys 10.97 x2
So the patches definitely speedup the unwinding.
FWIW: Acked-by: Jean Pihet <jean.pihet@linaro.org>
For info unwinding using libdw is about 5x faster:
time perf.libdw.speedup report --sort symbol --call-graph --stdio 2&>1
> /dev/null
real 0m3.484s
user 0m2.360s
sys 0m1.070s
Thanks,
Jean
On 24 September 2014 04:24, Namhyung Kim <namhyung@kernel.org> wrote:
> Hi Arun,
>
> On Tue, 23 Sep 2014 14:01:22 +0000, Arun Sharma wrote:
>> On 9/23/14, 12:00 PM, Namhyung Kim wrote:
>>
>>> + unw_set_caching_policy(addr_space, UNW_CACHE_GLOBAL);
>>
>> The result is a bit surprising for me. In micro benchmarking (eg:
>> Lperf-simple), the per-thread policy is generally faster because it
>> doesn't involve locking.
>>
>> libunwind/tests/Lperf-simple
>> unw_getcontext : cold avg= 109.673 nsec, warm avg= 28.610 nsec
>> unw_init_local : cold avg= 259.876 nsec, warm avg= 9.537 nsec
>> no cache : unw_step : 1st= 3258.387 min= 2922.331 avg= 3002.384 nsec
>> global cache : unw_step : 1st= 1192.093 min= 960.486 avg= 982.208 nsec
>> per-thread cache: unw_step : 1st= 429.153 min= 113.533 avg= 121.762 nsec
>
> Yes, per-thread policy is faster than global caching policy. Below is my
> test result. Note that I already run this several times before to
> remove an effect that file contents loaded in page cache.
>
> Performance counter stats for
> 'perf report -i /home/namhyung/tmp/perf-testing/perf.data.kbuild.dwarf --stdio' (3 runs):
>
> UNW_CACHE_NONE UNW_CACHE_GLOBAL UNW_CACHE_PER_THREAD
> -----------------------------------------------------------------------------------------------
> task-clock (msec) 14298.911947 7112.171928 6913.244797
> context-switches 1,507 762 742
> cpu-migrations 1 2 1
> page-faults 2,924,889 1,101,380 1,101,380
> cycles 53,895,784,665 26,798,627,423 26,070,728,349
> stalled-cycles-frontend 24,472,506,687 12,577,760,746 12,435,320,081
> stalled-cycles-backend 17,550,483,726 9,075,054,009 9,035,478,957
> instructions 73,544,039,490 34,352,889,707 33,283,120,736
> branches 14,969,890,371 7,139,469,848 6,926,994,151
> branch-misses 193,852,116 100,455,431 99,757,213
> time elapsed 14.905719730 7.455597356 7.242275972
>
>
>>
>> I can see how the global policy would involve less memory allocation
>> because of shared data structures. Curious about the reason for the
>> speedup (specifically if libunwind should change the defaults for the
>> non-local unwinding case).
>
> I don't see much difference between global and per-thread caching for
> remote unwind (besides rs_cache->lock you mentioned). Also I'm curious
> that how rs_new() is protected from concurrent accesses in per-thread
> caching. That's why I chose the global caching - yeah, it probably
> doesn't matter to a single thread, but... :)
>
> Thanks
> Namhyung
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] perf callchain: Use global caching provided by libunwind
2014-09-24 13:45 ` Jean Pihet
@ 2014-09-26 5:50 ` Namhyung Kim
2014-09-26 7:14 ` Jean Pihet
0 siblings, 1 reply; 14+ messages in thread
From: Namhyung Kim @ 2014-09-26 5:50 UTC (permalink / raw)
To: Jean Pihet
Cc: Arun Sharma, Arnaldo Carvalho de Melo, Peter Zijlstra,
Ingo Molnar, Paul Mackerras, Namhyung Kim, LKML, Jiri Olsa
Hi Jean,
On Wed, 24 Sep 2014 15:45:57 +0200, Jean Pihet wrote:
> Hi!
>
> Here are the test results on ARMv7 for the 2 patches. The speedup is
> about x2.1 for identical unwinding output data.
>
> 'perf record --call-graph dwarf -- stress --cpu 2 --io 2 --vm 2
> --timeout 10s' generates a 365 MB perf.data file.
>
> time perf.orig report --sort symbol --call-graph --stdio 2&>1 > /dev/null
> average on 3 runs
> real 36.736
> user 14.79
> sys 21.91
>
> time perf.libunwind.speedup report --sort symbol --call-graph --stdio
> 2&>1 > /dev/null
> average on 3 runs
> real 17.41 x2.11
> user 6.42 x2.3
> sys 10.97 x2
>
> So the patches definitely speedup the unwinding.
> FWIW: Acked-by: Jean Pihet <jean.pihet@linaro.org>
Thanks for your test!
Btw, have you checked the difference of the output before and after the
caching?
>
> For info unwinding using libdw is about 5x faster:
> time perf.libdw.speedup report --sort symbol --call-graph --stdio 2&>1
>> /dev/null
> real 0m3.484s
> user 0m2.360s
> sys 0m1.070s
Wow, it's pretty nice. I'll take a look at the libdw unwinding later.
Thanks,
Namhyung
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] perf callchain: Use global caching provided by libunwind
2014-09-26 5:50 ` Namhyung Kim
@ 2014-09-26 7:14 ` Jean Pihet
2014-09-29 2:35 ` Namhyung Kim
0 siblings, 1 reply; 14+ messages in thread
From: Jean Pihet @ 2014-09-26 7:14 UTC (permalink / raw)
To: Namhyung Kim
Cc: Arun Sharma, Arnaldo Carvalho de Melo, Peter Zijlstra,
Ingo Molnar, Paul Mackerras, Namhyung Kim, LKML, Jiri Olsa
Hi,
On 26 September 2014 07:50, Namhyung Kim <namhyung@kernel.org> wrote:
> Hi Jean,
>
> On Wed, 24 Sep 2014 15:45:57 +0200, Jean Pihet wrote:
>> Hi!
>>
>> Here are the test results on ARMv7 for the 2 patches. The speedup is
>> about x2.1 for identical unwinding output data.
>>
>> 'perf record --call-graph dwarf -- stress --cpu 2 --io 2 --vm 2
>> --timeout 10s' generates a 365 MB perf.data file.
>>
>> time perf.orig report --sort symbol --call-graph --stdio 2&>1 > /dev/null
>> average on 3 runs
>> real 36.736
>> user 14.79
>> sys 21.91
>>
>> time perf.libunwind.speedup report --sort symbol --call-graph --stdio
>> 2&>1 > /dev/null
>> average on 3 runs
>> real 17.41 x2.11
>> user 6.42 x2.3
>> sys 10.97 x2
>>
>> So the patches definitely speedup the unwinding.
>> FWIW: Acked-by: Jean Pihet <jean.pihet@linaro.org>
>
> Thanks for your test!
>
> Btw, have you checked the difference of the output before and after the
> caching?
Yes the output is identical (as stated here above).
>
>>
>> For info unwinding using libdw is about 5x faster:
>> time perf.libdw.speedup report --sort symbol --call-graph --stdio 2&>1
>>> /dev/null
>> real 0m3.484s
>> user 0m2.360s
>> sys 0m1.070s
>
> Wow, it's pretty nice. I'll take a look at the libdw unwinding later.
>
> Thanks,
> Namhyung
Thanks for the patches!
Jean
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] perf callchain: Create an address space per thread
2014-09-23 12:49 ` Namhyung Kim
@ 2014-09-26 15:35 ` Arnaldo Carvalho de Melo
0 siblings, 0 replies; 14+ messages in thread
From: Arnaldo Carvalho de Melo @ 2014-09-26 15:35 UTC (permalink / raw)
To: Namhyung Kim
Cc: Jiri Olsa, Peter Zijlstra, Ingo Molnar, Paul Mackerras,
Namhyung Kim, LKML, Jean Pihet, Arun Sharma
Em Tue, Sep 23, 2014 at 09:49:22PM +0900, Namhyung Kim escreveu:
> On Tue, Sep 23, 2014 at 9:24 PM, Jiri Olsa <jolsa@redhat.com> wrote:
> >> list_add(&comm->list, &thread->comm_list);
> >> +
> >> + if (unwind__prepare_access(thread) < 0) {
> >
> > you could call list_add below this call and thus save
> > the list_del call below in error path
> >
> > looks like you dont need comm to be on comm_list within
> > the unwind__prepare_access call
>
> Right. Will change.
Ok, waiting for v2 then.
- Arnaldo
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] perf callchain: Use global caching provided by libunwind
2014-09-26 7:14 ` Jean Pihet
@ 2014-09-29 2:35 ` Namhyung Kim
0 siblings, 0 replies; 14+ messages in thread
From: Namhyung Kim @ 2014-09-29 2:35 UTC (permalink / raw)
To: Jean Pihet
Cc: Arun Sharma, Arnaldo Carvalho de Melo, Peter Zijlstra,
Ingo Molnar, Paul Mackerras, Namhyung Kim, LKML, Jiri Olsa
Hi Jean,
On Fri, 26 Sep 2014 09:14:41 +0200, Jean Pihet wrote:
> Hi,
>
> On 26 September 2014 07:50, Namhyung Kim <namhyung@kernel.org> wrote:
>> Hi Jean,
>>
>> On Wed, 24 Sep 2014 15:45:57 +0200, Jean Pihet wrote:
>>> Hi!
>>>
>>> Here are the test results on ARMv7 for the 2 patches. The speedup is
>>> about x2.1 for identical unwinding output data.
>>>
>>> 'perf record --call-graph dwarf -- stress --cpu 2 --io 2 --vm 2
>>> --timeout 10s' generates a 365 MB perf.data file.
>>>
>>> time perf.orig report --sort symbol --call-graph --stdio 2&>1 > /dev/null
>>> average on 3 runs
>>> real 36.736
>>> user 14.79
>>> sys 21.91
>>>
>>> time perf.libunwind.speedup report --sort symbol --call-graph --stdio
>>> 2&>1 > /dev/null
>>> average on 3 runs
>>> real 17.41 x2.11
>>> user 6.42 x2.3
>>> sys 10.97 x2
>>>
>>> So the patches definitely speedup the unwinding.
>>> FWIW: Acked-by: Jean Pihet <jean.pihet@linaro.org>
>>
>> Thanks for your test!
>>
>> Btw, have you checked the difference of the output before and after the
>> caching?
> Yes the output is identical (as stated here above).
Ah, I misunderstood it. Thanks for your confirmation. :)
Thanks,
Namhyung
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2014-09-29 2:35 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-23 6:30 [PATCH 1/2] perf callchain: Create an address space per thread Namhyung Kim
2014-09-23 6:30 ` [PATCH 2/2] perf callchain: Use global caching provided by libunwind Namhyung Kim
2014-09-23 12:28 ` Jiri Olsa
2014-09-23 12:53 ` Namhyung Kim
2014-09-24 1:04 ` Namhyung Kim
2014-09-23 14:01 ` Arun Sharma
2014-09-24 2:24 ` Namhyung Kim
2014-09-24 13:45 ` Jean Pihet
2014-09-26 5:50 ` Namhyung Kim
2014-09-26 7:14 ` Jean Pihet
2014-09-29 2:35 ` Namhyung Kim
2014-09-23 12:24 ` [PATCH 1/2] perf callchain: Create an address space per thread Jiri Olsa
2014-09-23 12:49 ` Namhyung Kim
2014-09-26 15:35 ` Arnaldo Carvalho de Melo
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.