linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6] perf thread-stack: x86 retpolines
@ 2019-01-09  9:18 Adrian Hunter
  2019-01-09  9:18 ` [PATCH 1/6] perf tools: Fix split_kallsyms_for_kcore for trampoline symbols Adrian Hunter
                   ` (6 more replies)
  0 siblings, 7 replies; 23+ messages in thread
From: Adrian Hunter @ 2019-01-09  9:18 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo; +Cc: Jiri Olsa, linux-kernel

Hi

Here are some patches to improve the exported call graph, primarily to deal
with x86 retpolines.


Adrian Hunter (6):
      perf tools: Fix split_kallsyms_for_kcore for trampoline symbols
      perf thread-stack: Tidy thread_stack__push_cp() usage
      perf thread-stack: Tidy thread_stack__no_call_return() by adding more local variables
      perf thread-stack: Represent jmps to the start of a different symbol
      perf thread-stack: Improve thread_stack__no_call_return()
      perf thread-stack: Hide x86 retpolines

 tools/perf/scripts/python/export-to-postgresql.py |   2 +-
 tools/perf/scripts/python/export-to-sqlite.py     |   2 +-
 tools/perf/util/symbol.c                          |   2 +
 tools/perf/util/thread-stack.c                    | 235 ++++++++++++++++++----
 tools/perf/util/thread-stack.h                    |   3 +
 5 files changed, 208 insertions(+), 36 deletions(-)


Regards
Adrian

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/6] perf tools: Fix split_kallsyms_for_kcore for trampoline symbols
  2019-01-09  9:18 [PATCH 0/6] perf thread-stack: x86 retpolines Adrian Hunter
@ 2019-01-09  9:18 ` Adrian Hunter
  2019-02-09 12:57   ` [tip:perf/core] perf tools: Fix split_kallsyms_for_kcore() " tip-bot for Adrian Hunter
  2019-01-09  9:18 ` [PATCH 2/6] perf thread-stack: Tidy thread_stack__push_cp() usage Adrian Hunter
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 23+ messages in thread
From: Adrian Hunter @ 2019-01-09  9:18 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo; +Cc: Jiri Olsa, linux-kernel

Kallsyms symbols do not have a size, so the size becomes the distance to
the next symbol. Consequently the recently added trampoline symbols end up
with large sizes because the trampolines are some distance from one another
and the main kernel map. However, symbols that end outside their map can
disrupt the symbol tree because, after mapping, it can appear incorrectly
that they overlap other symbols. Add logic to truncate symbol size to the
end of the corresponding map.

Fixes: d83212d5dd67 ("kallsyms, x86: Export addresses of PTI entry trampolines")
Cc: stable@vger.kernel.org
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---
 tools/perf/util/symbol.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/perf/util/symbol.c b/tools/perf/util/symbol.c
index 01f2c7385e38..baf6498d1835 100644
--- a/tools/perf/util/symbol.c
+++ b/tools/perf/util/symbol.c
@@ -709,6 +709,8 @@ static int map_groups__split_kallsyms_for_kcore(struct map_groups *kmaps, struct
 		}
 
 		pos->start -= curr_map->start - curr_map->pgoff;
+		if (pos->end > curr_map->end)
+			pos->end = curr_map->end;
 		if (pos->end)
 			pos->end -= curr_map->start - curr_map->pgoff;
 		symbols__insert(&curr_map->dso->symbols, pos);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/6] perf thread-stack: Tidy thread_stack__push_cp() usage
  2019-01-09  9:18 [PATCH 0/6] perf thread-stack: x86 retpolines Adrian Hunter
  2019-01-09  9:18 ` [PATCH 1/6] perf tools: Fix split_kallsyms_for_kcore for trampoline symbols Adrian Hunter
@ 2019-01-09  9:18 ` Adrian Hunter
  2019-02-09 12:58   ` [tip:perf/core] " tip-bot for Adrian Hunter
  2019-01-09  9:18 ` [PATCH 3/6] perf thread-stack: Tidy thread_stack__no_call_return() by adding more local variables Adrian Hunter
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 23+ messages in thread
From: Adrian Hunter @ 2019-01-09  9:18 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo; +Cc: Jiri Olsa, linux-kernel

If 'cp' is checked in thread_stack__push_cp() a number of error checks can
be removed, reducing code size and improving readability.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---
 tools/perf/util/thread-stack.c | 13 +++----------
 1 file changed, 3 insertions(+), 10 deletions(-)

diff --git a/tools/perf/util/thread-stack.c b/tools/perf/util/thread-stack.c
index d52f27f373ce..93ba3ab6a602 100644
--- a/tools/perf/util/thread-stack.c
+++ b/tools/perf/util/thread-stack.c
@@ -493,6 +493,9 @@ static int thread_stack__push_cp(struct thread_stack *ts, u64 ret_addr,
 	struct thread_stack_entry *tse;
 	int err;
 
+	if (!cp)
+		return -ENOMEM;
+
 	if (ts->cnt == ts->sz) {
 		err = thread_stack__grow(ts);
 		if (err)
@@ -576,8 +579,6 @@ static int thread_stack__bottom(struct thread_stack *ts,
 
 	cp = call_path__findnew(cpr, &cpr->call_path, sym, ip,
 				ts->kernel_start);
-	if (!cp)
-		return -ENOMEM;
 
 	return thread_stack__push_cp(ts, ip, sample->time, ref, cp,
 				     true, false);
@@ -609,8 +610,6 @@ static int thread_stack__no_call_return(struct thread *thread,
 			cp = call_path__findnew(cpr, &cpr->call_path,
 						to_al->sym, sample->addr,
 						ts->kernel_start);
-			if (!cp)
-				return -ENOMEM;
 			return thread_stack__push_cp(ts, 0, sample->time, ref,
 						     cp, true, false);
 		}
@@ -633,8 +632,6 @@ static int thread_stack__no_call_return(struct thread *thread,
 	/* This 'return' had no 'call', so push and pop top of stack */
 	cp = call_path__findnew(cpr, parent, from_al->sym, sample->ip,
 				ts->kernel_start);
-	if (!cp)
-		return -ENOMEM;
 
 	err = thread_stack__push_cp(ts, sample->addr, sample->time, ref, cp,
 				    true, false);
@@ -680,8 +677,6 @@ static int thread_stack__trace_end(struct thread_stack *ts,
 
 	cp = call_path__findnew(cpr, ts->stack[ts->cnt - 1].cp, NULL, 0,
 				ts->kernel_start);
-	if (!cp)
-		return -ENOMEM;
 
 	ret_addr = sample->ip + sample->insn_len;
 
@@ -745,8 +740,6 @@ int thread_stack__process(struct thread *thread, struct comm *comm,
 		cp = call_path__findnew(cpr, ts->stack[ts->cnt - 1].cp,
 					to_al->sym, sample->addr,
 					ts->kernel_start);
-		if (!cp)
-			return -ENOMEM;
 		err = thread_stack__push_cp(ts, ret_addr, sample->time, ref,
 					    cp, false, trace_end);
 	} else if (sample->flags & PERF_IP_FLAG_RETURN) {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 3/6] perf thread-stack: Tidy thread_stack__no_call_return() by adding more local variables
  2019-01-09  9:18 [PATCH 0/6] perf thread-stack: x86 retpolines Adrian Hunter
  2019-01-09  9:18 ` [PATCH 1/6] perf tools: Fix split_kallsyms_for_kcore for trampoline symbols Adrian Hunter
  2019-01-09  9:18 ` [PATCH 2/6] perf thread-stack: Tidy thread_stack__push_cp() usage Adrian Hunter
@ 2019-01-09  9:18 ` Adrian Hunter
  2019-02-09 12:58   ` [tip:perf/core] " tip-bot for Adrian Hunter
  2019-01-09  9:18 ` [PATCH 4/6] perf thread-stack: Represent jmps to the start of a different symbol Adrian Hunter
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 23+ messages in thread
From: Adrian Hunter @ 2019-01-09  9:18 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo; +Cc: Jiri Olsa, linux-kernel

Make thread_stack__no_call_return() more readable by adding more local
variables.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---
 tools/perf/util/thread-stack.c | 35 +++++++++++++++++-----------------
 1 file changed, 17 insertions(+), 18 deletions(-)

diff --git a/tools/perf/util/thread-stack.c b/tools/perf/util/thread-stack.c
index 93ba3ab6a602..7f8eff018c16 100644
--- a/tools/perf/util/thread-stack.c
+++ b/tools/perf/util/thread-stack.c
@@ -591,34 +591,36 @@ static int thread_stack__no_call_return(struct thread *thread,
 					struct addr_location *to_al, u64 ref)
 {
 	struct call_path_root *cpr = ts->crp->cpr;
+	struct call_path *root = &cpr->call_path;
+	struct symbol *fsym = from_al->sym;
+	struct symbol *tsym = to_al->sym;
 	struct call_path *cp, *parent;
 	u64 ks = ts->kernel_start;
+	u64 addr = sample->addr;
+	u64 tm = sample->time;
+	u64 ip = sample->ip;
 	int err;
 
-	if (sample->ip >= ks && sample->addr < ks) {
+	if (ip >= ks && addr < ks) {
 		/* Return to userspace, so pop all kernel addresses */
 		while (thread_stack__in_kernel(ts)) {
 			err = thread_stack__call_return(thread, ts, --ts->cnt,
-							sample->time, ref,
-							true);
+							tm, ref, true);
 			if (err)
 				return err;
 		}
 
 		/* If the stack is empty, push the userspace address */
 		if (!ts->cnt) {
-			cp = call_path__findnew(cpr, &cpr->call_path,
-						to_al->sym, sample->addr,
-						ts->kernel_start);
-			return thread_stack__push_cp(ts, 0, sample->time, ref,
-						     cp, true, false);
+			cp = call_path__findnew(cpr, root, tsym, addr, ks);
+			return thread_stack__push_cp(ts, 0, tm, ref, cp, true,
+						     false);
 		}
-	} else if (thread_stack__in_kernel(ts) && sample->ip < ks) {
+	} else if (thread_stack__in_kernel(ts) && ip < ks) {
 		/* Return to userspace, so pop all kernel addresses */
 		while (thread_stack__in_kernel(ts)) {
 			err = thread_stack__call_return(thread, ts, --ts->cnt,
-							sample->time, ref,
-							true);
+							tm, ref, true);
 			if (err)
 				return err;
 		}
@@ -627,19 +629,16 @@ static int thread_stack__no_call_return(struct thread *thread,
 	if (ts->cnt)
 		parent = ts->stack[ts->cnt - 1].cp;
 	else
-		parent = &cpr->call_path;
+		parent = root;
 
 	/* This 'return' had no 'call', so push and pop top of stack */
-	cp = call_path__findnew(cpr, parent, from_al->sym, sample->ip,
-				ts->kernel_start);
+	cp = call_path__findnew(cpr, parent, fsym, ip, ks);
 
-	err = thread_stack__push_cp(ts, sample->addr, sample->time, ref, cp,
-				    true, false);
+	err = thread_stack__push_cp(ts, addr, tm, ref, cp, true, false);
 	if (err)
 		return err;
 
-	return thread_stack__pop_cp(thread, ts, sample->addr, sample->time, ref,
-				    to_al->sym);
+	return thread_stack__pop_cp(thread, ts, addr, tm, ref, tsym);
 }
 
 static int thread_stack__trace_begin(struct thread *thread,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 4/6] perf thread-stack: Represent jmps to the start of a different symbol
  2019-01-09  9:18 [PATCH 0/6] perf thread-stack: x86 retpolines Adrian Hunter
                   ` (2 preceding siblings ...)
  2019-01-09  9:18 ` [PATCH 3/6] perf thread-stack: Tidy thread_stack__no_call_return() by adding more local variables Adrian Hunter
@ 2019-01-09  9:18 ` Adrian Hunter
  2019-02-06 12:39   ` Arnaldo Carvalho de Melo
  2019-02-09 12:59   ` [tip:perf/core] " tip-bot for Adrian Hunter
  2019-01-09  9:18 ` [PATCH 5/6] perf thread-stack: Improve thread_stack__no_call_return() Adrian Hunter
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 23+ messages in thread
From: Adrian Hunter @ 2019-01-09  9:18 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo; +Cc: Jiri Olsa, linux-kernel

The compiler might optimize a call/ret combination by making it a jmp.
However the thread-stack does not presently cater for that, so that such
control flow is not visible in the call graph. Make it visible by recording
on the stack a branch to the start of a different symbol. Note, that means
when a ret pops the stack, all jmps must be popped off first.

Example:

$ cat jmp-to-fn.c
__attribute__((noinline)) int bar(void)
{
        return -1;
}

__attribute__((noinline)) int foo(void)
{
        return bar() + 1;
}

int main()
{
        return foo();
}
$ gcc -ggdb3 -Wall -Wextra -O2 -o jmp-to-fn jmp-to-fn.c
$ objdump -d jmp-to-fn
<SNIP>
0000000000001040 <main>:
    1040:       31 c0                   xor    %eax,%eax
    1042:       e9 09 01 00 00          jmpq   1150 <foo>
<SNIP>
0000000000001140 <bar>:
    1140:       b8 ff ff ff ff          mov    $0xffffffff,%eax
    1145:       c3                      retq
<SNIP>
0000000000001150 <foo>:
    1150:       31 c0                   xor    %eax,%eax
    1152:       e8 e9 ff ff ff          callq  1140 <bar>
    1157:       83 c0 01                add    $0x1,%eax
    115a:       c3                      retq
<SNIP>
$ perf record -o jmp-to-fn.perf.data -e intel_pt/cyc/u ./jmp-to-fn
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0,017 MB jmp-to-fn.perf.data ]
$ perf script -i jmp-to-fn.perf.data --itrace=be -s ~/libexec/perf-core/scripts/python/export-to-sqlite.py jmp-to-fn.db branches calls
2019-01-08 13:24:58.783069 Creating database...
2019-01-08 13:24:58.794650 Writing records...
2019-01-08 13:24:59.008050 Adding indexes
2019-01-08 13:24:59.015802 Done
$  ~/libexec/perf-core/scripts/python/exported-sql-viewer.py jmp-to-fn.db

Before:

    main
        -> bar

After:

    main
        -> foo
            -> bar

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---
 .../scripts/python/export-to-postgresql.py    |  2 +-
 tools/perf/scripts/python/export-to-sqlite.py |  2 +-
 tools/perf/util/thread-stack.c                | 30 +++++++++++++++++--
 tools/perf/util/thread-stack.h                |  3 ++
 4 files changed, 33 insertions(+), 4 deletions(-)

diff --git a/tools/perf/scripts/python/export-to-postgresql.py b/tools/perf/scripts/python/export-to-postgresql.py
index 0564dd7377f2..30130213da7e 100644
--- a/tools/perf/scripts/python/export-to-postgresql.py
+++ b/tools/perf/scripts/python/export-to-postgresql.py
@@ -478,7 +478,7 @@ if perf_db_export_calls:
 			'branch_count,'
 			'call_id,'
 			'return_id,'
-			'CASE WHEN flags=1 THEN \'no call\' WHEN flags=2 THEN \'no return\' WHEN flags=3 THEN \'no call/return\' ELSE \'\' END AS flags,'
+			'CASE WHEN flags=0 THEN \'\' WHEN flags=1 THEN \'no call\' WHEN flags=2 THEN \'no return\' WHEN flags=3 THEN \'no call/return\' WHEN flags=6 THEN \'jump\' ELSE flags END AS flags,'
 			'parent_call_path_id'
 		' FROM calls INNER JOIN call_paths ON call_paths.id = call_path_id')
 
diff --git a/tools/perf/scripts/python/export-to-sqlite.py b/tools/perf/scripts/python/export-to-sqlite.py
index 245caf2643ed..ed237f2ed03f 100644
--- a/tools/perf/scripts/python/export-to-sqlite.py
+++ b/tools/perf/scripts/python/export-to-sqlite.py
@@ -320,7 +320,7 @@ if perf_db_export_calls:
 			'branch_count,'
 			'call_id,'
 			'return_id,'
-			'CASE WHEN flags=1 THEN \'no call\' WHEN flags=2 THEN \'no return\' WHEN flags=3 THEN \'no call/return\' ELSE \'\' END AS flags,'
+			'CASE WHEN flags=0 THEN \'\' WHEN flags=1 THEN \'no call\' WHEN flags=2 THEN \'no return\' WHEN flags=3 THEN \'no call/return\' WHEN flags=6 THEN \'jump\' ELSE flags END AS flags,'
 			'parent_call_path_id'
 		' FROM calls INNER JOIN call_paths ON call_paths.id = call_path_id')
 
diff --git a/tools/perf/util/thread-stack.c b/tools/perf/util/thread-stack.c
index 7f8eff018c16..f52c0f90915d 100644
--- a/tools/perf/util/thread-stack.c
+++ b/tools/perf/util/thread-stack.c
@@ -38,6 +38,7 @@
  * @cp: call path
  * @no_call: a 'call' was not seen
  * @trace_end: a 'call' but trace ended
+ * @non_call: a branch but not a 'call' to the start of a different symbol
  */
 struct thread_stack_entry {
 	u64 ret_addr;
@@ -47,6 +48,7 @@ struct thread_stack_entry {
 	struct call_path *cp;
 	bool no_call;
 	bool trace_end;
+	bool non_call;
 };
 
 /**
@@ -268,6 +270,8 @@ static int thread_stack__call_return(struct thread *thread,
 		cr.flags |= CALL_RETURN_NO_CALL;
 	if (no_return)
 		cr.flags |= CALL_RETURN_NO_RETURN;
+	if (tse->non_call)
+		cr.flags |= CALL_RETURN_NON_CALL;
 
 	return crp->process(&cr, crp->data);
 }
@@ -510,6 +514,7 @@ static int thread_stack__push_cp(struct thread_stack *ts, u64 ret_addr,
 	tse->cp = cp;
 	tse->no_call = no_call;
 	tse->trace_end = trace_end;
+	tse->non_call = false;
 
 	return 0;
 }
@@ -531,14 +536,16 @@ static int thread_stack__pop_cp(struct thread *thread, struct thread_stack *ts,
 							 timestamp, ref, false);
 	}
 
-	if (ts->stack[ts->cnt - 1].ret_addr == ret_addr) {
+	if (ts->stack[ts->cnt - 1].ret_addr == ret_addr &&
+	    !ts->stack[ts->cnt - 1].non_call) {
 		return thread_stack__call_return(thread, ts, --ts->cnt,
 						 timestamp, ref, false);
 	} else {
 		size_t i = ts->cnt - 1;
 
 		while (i--) {
-			if (ts->stack[i].ret_addr != ret_addr)
+			if (ts->stack[i].ret_addr != ret_addr ||
+			    ts->stack[i].non_call)
 				continue;
 			i += 1;
 			while (ts->cnt > i) {
@@ -757,6 +764,25 @@ int thread_stack__process(struct thread *thread, struct comm *comm,
 		err = thread_stack__trace_begin(thread, ts, sample->time, ref);
 	} else if (sample->flags & PERF_IP_FLAG_TRACE_END) {
 		err = thread_stack__trace_end(ts, sample, ref);
+	} else if (sample->flags & PERF_IP_FLAG_BRANCH &&
+		   from_al->sym != to_al->sym && to_al->sym &&
+		   to_al->addr == to_al->sym->start) {
+		struct call_path_root *cpr = ts->crp->cpr;
+		struct call_path *cp;
+
+		/*
+		 * The compiler might optimize a call/ret combination by making
+		 * it a jmp. Make that visible by recording on the stack a
+		 * branch to the start of a different symbol. Note, that means
+		 * when a ret pops the stack, all jmps must be popped off first.
+		 */
+		cp = call_path__findnew(cpr, ts->stack[ts->cnt - 1].cp,
+					to_al->sym, sample->addr,
+					ts->kernel_start);
+		err = thread_stack__push_cp(ts, 0, sample->time, ref, cp, false,
+					    false);
+		if (!err)
+			ts->stack[ts->cnt - 1].non_call = true;
 	}
 
 	return err;
diff --git a/tools/perf/util/thread-stack.h b/tools/perf/util/thread-stack.h
index 1f626f4a1c40..b7c04e19ad41 100644
--- a/tools/perf/util/thread-stack.h
+++ b/tools/perf/util/thread-stack.h
@@ -35,10 +35,13 @@ struct call_path;
  *
  * CALL_RETURN_NO_CALL: 'return' but no matching 'call'
  * CALL_RETURN_NO_RETURN: 'call' but no matching 'return'
+ * CALL_RETURN_NON_CALL: a branch but not a 'call' to the start of a different
+ *                       symbol
  */
 enum {
 	CALL_RETURN_NO_CALL	= 1 << 0,
 	CALL_RETURN_NO_RETURN	= 1 << 1,
+	CALL_RETURN_NON_CALL	= 1 << 2,
 };
 
 /**
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 5/6] perf thread-stack: Improve thread_stack__no_call_return()
  2019-01-09  9:18 [PATCH 0/6] perf thread-stack: x86 retpolines Adrian Hunter
                   ` (3 preceding siblings ...)
  2019-01-09  9:18 ` [PATCH 4/6] perf thread-stack: Represent jmps to the start of a different symbol Adrian Hunter
@ 2019-01-09  9:18 ` Adrian Hunter
  2019-02-22  9:48   ` Adrian Hunter
  2019-02-28  7:49   ` [tip:perf/core] " tip-bot for Adrian Hunter
  2019-01-09  9:18 ` [PATCH 6/6] perf thread-stack: Hide x86 retpolines Adrian Hunter
  2019-01-10  9:55 ` [PATCH 0/6] perf thread-stack: " Jiri Olsa
  6 siblings, 2 replies; 23+ messages in thread
From: Adrian Hunter @ 2019-01-09  9:18 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo; +Cc: Jiri Olsa, linux-kernel

Improve thread_stack__no_call_return() to better handle 'returns' that do
not match the stack i.e. 'no call'. See code comments for details.
The example below shows how retpolines are affected:

Example:

$ cat simple-retpoline.c
__attribute__((noinline)) int bar(void)
{
        return -1;
}

int foo(void)
{
        return bar() + 1;
}

__attribute__((indirect_branch("thunk"))) int main()
{
        int (*volatile fn)(void) = foo;

        fn();
        return fn();
}
$ gcc -ggdb3 -Wall -Wextra -O2 -o simple-retpoline simple-retpoline.c
$ objdump -d simple-retpoline
<SNIP>
0000000000001040 <main>:
    1040:       48 83 ec 18             sub    $0x18,%rsp
    1044:       48 8d 05 25 01 00 00    lea    0x125(%rip),%rax        # 1170 <foo>
    104b:       48 89 44 24 08          mov    %rax,0x8(%rsp)
    1050:       48 8b 44 24 08          mov    0x8(%rsp),%rax
    1055:       e8 1f 01 00 00          callq  1179 <__x86_indirect_thunk_rax>
    105a:       48 8b 44 24 08          mov    0x8(%rsp),%rax
    105f:       48 83 c4 18             add    $0x18,%rsp
    1063:       e9 11 01 00 00          jmpq   1179 <__x86_indirect_thunk_rax>
<SNIP>
0000000000001160 <bar>:
    1160:       b8 ff ff ff ff          mov    $0xffffffff,%eax
    1165:       c3                      retq
<SNIP>
0000000000001170 <foo>:
    1170:       e8 eb ff ff ff          callq  1160 <bar>
    1175:       83 c0 01                add    $0x1,%eax
    1178:       c3                      retq
0000000000001179 <__x86_indirect_thunk_rax>:
    1179:       e8 07 00 00 00          callq  1185 <__x86_indirect_thunk_rax+0xc>
    117e:       f3 90                   pause
    1180:       0f ae e8                lfence
    1183:       eb f9                   jmp    117e <__x86_indirect_thunk_rax+0x5>
    1185:       48 89 04 24             mov    %rax,(%rsp)
    1189:       c3                      retq
<SNIP>
$ perf record -o simple-retpoline.perf.data -e intel_pt/cyc/u ./simple-retpoline
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0,017 MB simple-retpoline.perf.data ]
$ perf script -i simple-retpoline.perf.data --itrace=be -s ~/libexec/perf-core/scripts/python/export-to-sqlite.py simple-retpoline.db branches calls
2019-01-08 14:03:37.851655 Creating database...
2019-01-08 14:03:37.863256 Writing records...
2019-01-08 14:03:38.069750 Adding indexes
2019-01-08 14:03:38.078799 Done
$ ~/libexec/perf-core/scripts/python/exported-sql-viewer.py simple-retpoline.db

Before:

    main
        -> __x86_indirect_thunk_rax
            -> __x86_indirect_thunk_rax
                -> __x86_indirect_thunk_rax
                    -> bar

After:

    main
        -> __x86_indirect_thunk_rax
            -> __x86_indirect_thunk_rax
                -> foo
                    -> bar

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---
 tools/perf/util/thread-stack.c | 49 +++++++++++++++++++++++++++++++---
 1 file changed, 46 insertions(+), 3 deletions(-)

diff --git a/tools/perf/util/thread-stack.c b/tools/perf/util/thread-stack.c
index f52c0f90915d..632c07a125ab 100644
--- a/tools/perf/util/thread-stack.c
+++ b/tools/perf/util/thread-stack.c
@@ -638,14 +638,57 @@ static int thread_stack__no_call_return(struct thread *thread,
 	else
 		parent = root;
 
-	/* This 'return' had no 'call', so push and pop top of stack */
-	cp = call_path__findnew(cpr, parent, fsym, ip, ks);
+	if (parent->sym == from_al->sym) {
+		/*
+		 * At the bottom of the stack, assume the missing 'call' was
+		 * before the trace started. So, pop the current symbol and push
+		 * the 'to' symbol.
+		 */
+		if (ts->cnt == 1) {
+			err = thread_stack__call_return(thread, ts, --ts->cnt,
+							tm, ref, false);
+			if (err)
+				return err;
+		}
+
+		if (!ts->cnt) {
+			cp = call_path__findnew(cpr, root, tsym, addr, ks);
+
+			return thread_stack__push_cp(ts, addr, tm, ref, cp,
+						     true, false);
+		}
+
+		/*
+		 * Otherwise assume the 'return' is being used as a jump (e.g.
+		 * retpoline) and just push the 'to' symbol.
+		 */
+		cp = call_path__findnew(cpr, parent, tsym, addr, ks);
+
+		err = thread_stack__push_cp(ts, 0, tm, ref, cp, true, false);
+		if (!err)
+			ts->stack[ts->cnt - 1].non_call = true;
+
+		return err;
+	}
+
+	/*
+	 * Assume 'parent' has not yet returned, so push 'to', and then push and
+	 * pop 'from'.
+	 */
+
+	cp = call_path__findnew(cpr, parent, tsym, addr, ks);
 
 	err = thread_stack__push_cp(ts, addr, tm, ref, cp, true, false);
 	if (err)
 		return err;
 
-	return thread_stack__pop_cp(thread, ts, addr, tm, ref, tsym);
+	cp = call_path__findnew(cpr, cp, fsym, ip, ks);
+
+	err = thread_stack__push_cp(ts, ip, tm, ref, cp, true, false);
+	if (err)
+		return err;
+
+	return thread_stack__call_return(thread, ts, --ts->cnt, tm, ref, false);
 }
 
 static int thread_stack__trace_begin(struct thread *thread,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 6/6] perf thread-stack: Hide x86 retpolines
  2019-01-09  9:18 [PATCH 0/6] perf thread-stack: x86 retpolines Adrian Hunter
                   ` (4 preceding siblings ...)
  2019-01-09  9:18 ` [PATCH 5/6] perf thread-stack: Improve thread_stack__no_call_return() Adrian Hunter
@ 2019-01-09  9:18 ` Adrian Hunter
  2019-01-09 15:38   ` Jiri Olsa
                     ` (2 more replies)
  2019-01-10  9:55 ` [PATCH 0/6] perf thread-stack: " Jiri Olsa
  6 siblings, 3 replies; 23+ messages in thread
From: Adrian Hunter @ 2019-01-09  9:18 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo; +Cc: Jiri Olsa, linux-kernel

x86 retpoline functions pollute the call graph by showing up everywhere
there is an indirect branch, but they do not really mean anything. Make
changes so that the default retpoline functions will no longer appear in
the call graph. Note this only affects the call graph, since all the
original branches are left unchanged.

This does not handle function return thunks, nor is there any improvement
for the handling of inline thunks or extern thunks.

Example:

$ cat simple-retpoline.c
__attribute__((noinline)) int bar(void)
{
        return -1;
}

int foo(void)
{
        return bar() + 1;
}

__attribute__((indirect_branch("thunk"))) int main()
{
        int (*volatile fn)(void) = foo;

        fn();
        return fn();
}
$ gcc -ggdb3 -Wall -Wextra -O2 -o simple-retpoline simple-retpoline.c
$ objdump -d simple-retpoline
<SNIP>
0000000000001040 <main>:
    1040:       48 83 ec 18             sub    $0x18,%rsp
    1044:       48 8d 05 25 01 00 00    lea    0x125(%rip),%rax        # 1170 <foo>
    104b:       48 89 44 24 08          mov    %rax,0x8(%rsp)
    1050:       48 8b 44 24 08          mov    0x8(%rsp),%rax
    1055:       e8 1f 01 00 00          callq  1179 <__x86_indirect_thunk_rax>
    105a:       48 8b 44 24 08          mov    0x8(%rsp),%rax
    105f:       48 83 c4 18             add    $0x18,%rsp
    1063:       e9 11 01 00 00          jmpq   1179 <__x86_indirect_thunk_rax>
<SNIP>
0000000000001160 <bar>:
    1160:       b8 ff ff ff ff          mov    $0xffffffff,%eax
    1165:       c3                      retq
<SNIP>
0000000000001170 <foo>:
    1170:       e8 eb ff ff ff          callq  1160 <bar>
    1175:       83 c0 01                add    $0x1,%eax
    1178:       c3                      retq
0000000000001179 <__x86_indirect_thunk_rax>:
    1179:       e8 07 00 00 00          callq  1185 <__x86_indirect_thunk_rax+0xc>
    117e:       f3 90                   pause
    1180:       0f ae e8                lfence
    1183:       eb f9                   jmp    117e <__x86_indirect_thunk_rax+0x5>
    1185:       48 89 04 24             mov    %rax,(%rsp)
    1189:       c3                      retq
<SNIP>
$ perf record -o simple-retpoline.perf.data -e intel_pt/cyc/u ./simple-retpoline
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0,017 MB simple-retpoline.perf.data ]
$ perf script -i simple-retpoline.perf.data --itrace=be -s ~/libexec/perf-core/scripts/python/export-to-sqlite.py simple-retpoline.db branches calls
2019-01-08 14:03:37.851655 Creating database...
2019-01-08 14:03:37.863256 Writing records...
2019-01-08 14:03:38.069750 Adding indexes
2019-01-08 14:03:38.078799 Done
$ ~/libexec/perf-core/scripts/python/exported-sql-viewer.py simple-retpoline.db

Before:

    main
        -> __x86_indirect_thunk_rax
            -> __x86_indirect_thunk_rax
                -> foo
                    -> bar

After:

    main
        -> foo
            -> bar

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---
 tools/perf/util/thread-stack.c | 112 ++++++++++++++++++++++++++++++++-
 1 file changed, 109 insertions(+), 3 deletions(-)

diff --git a/tools/perf/util/thread-stack.c b/tools/perf/util/thread-stack.c
index 632c07a125ab..805e30836460 100644
--- a/tools/perf/util/thread-stack.c
+++ b/tools/perf/util/thread-stack.c
@@ -20,6 +20,7 @@
 #include "thread.h"
 #include "event.h"
 #include "machine.h"
+#include "env.h"
 #include "util.h"
 #include "debug.h"
 #include "symbol.h"
@@ -29,6 +30,19 @@
 
 #define STACK_GROWTH 2048
 
+/*
+ * State of retpoline detection.
+ *
+ * RETPOLINE_NONE: no retpoline detection
+ * X86_RETPOLINE_POSSIBLE: x86 retpoline possible
+ * X86_RETPOLINE_DETECTED: x86 retpoline detected
+ */
+enum retpoline_state_t {
+	RETPOLINE_NONE,
+	X86_RETPOLINE_POSSIBLE,
+	X86_RETPOLINE_DETECTED,
+};
+
 /**
  * struct thread_stack_entry - thread stack entry.
  * @ret_addr: return address
@@ -64,6 +78,7 @@ struct thread_stack_entry {
  * @crp: call/return processor
  * @comm: current comm
  * @arr_sz: size of array if this is the first element of an array
+ * @rstate: used to detect retpolines
  */
 struct thread_stack {
 	struct thread_stack_entry *stack;
@@ -76,6 +91,7 @@ struct thread_stack {
 	struct call_return_processor *crp;
 	struct comm *comm;
 	unsigned int arr_sz;
+	enum retpoline_state_t rstate;
 };
 
 /*
@@ -115,10 +131,16 @@ static int thread_stack__init(struct thread_stack *ts, struct thread *thread,
 	if (err)
 		return err;
 
-	if (thread->mg && thread->mg->machine)
-		ts->kernel_start = machine__kernel_start(thread->mg->machine);
-	else
+	if (thread->mg && thread->mg->machine) {
+		struct machine *machine = thread->mg->machine;
+		const char *arch = perf_env__arch(machine->env);
+
+		ts->kernel_start = machine__kernel_start(machine);
+		if (!strcmp(arch, "x86"))
+			ts->rstate = X86_RETPOLINE_POSSIBLE;
+	} else {
 		ts->kernel_start = 1ULL << 63;
+	}
 	ts->crp = crp;
 
 	return 0;
@@ -733,6 +755,70 @@ static int thread_stack__trace_end(struct thread_stack *ts,
 				     false, true);
 }
 
+static bool is_x86_retpoline(const char *name)
+{
+	const char *p = strstr(name, "__x86_indirect_thunk_");
+
+	return p == name || !strcmp(name, "__indirect_thunk_start");
+}
+
+/*
+ * x86 retpoline functions pollute the call graph. This function removes them.
+ * This does not handle function return thunks, nor is there any improvement
+ * for the handling of inline thunks or extern thunks.
+ */
+static int thread_stack__x86_retpoline(struct thread_stack *ts,
+				       struct perf_sample *sample,
+				       struct addr_location *to_al)
+{
+	struct thread_stack_entry *tse = &ts->stack[ts->cnt - 1];
+	struct call_path_root *cpr = ts->crp->cpr;
+	struct symbol *sym = tse->cp->sym;
+	struct symbol *tsym = to_al->sym;
+	struct call_path *cp;
+
+	if (sym && sym->name && is_x86_retpoline(sym->name)) {
+		/*
+		 * This is a x86 retpoline fn. It pollutes the call graph by
+		 * showing up everywhere there is an indirect branch, but does
+		 * not itself mean anything. Here the top-of-stack is removed,
+		 * by decrementing the stack count, and then further down, the
+		 * resulting top-of-stack is replaced with the actual target.
+		 * The result is that the retpoline functions will no longer
+		 * appear in the call graph. Note this only affects the call
+		 * graph, since all the original branches are left unchanged.
+		 */
+		ts->cnt -= 1;
+		sym = ts->stack[ts->cnt - 2].cp->sym;
+		if (sym && sym == tsym && to_al->addr != tsym->start) {
+			/*
+			 * Target is back to the middle of the symbol we came
+			 * from so assume it is an indirect jmp and forget it
+			 * altogether.
+			 */
+			ts->cnt -= 1;
+			return 0;
+		}
+	} else if (sym && sym == tsym) {
+		/*
+		 * Target is back to the symbol we came from so assume it is an
+		 * indirect jmp and forget it altogether.
+		 */
+		ts->cnt -= 1;
+		return 0;
+	}
+
+	cp = call_path__findnew(cpr, ts->stack[ts->cnt - 2].cp, tsym,
+				sample->addr, ts->kernel_start);
+	if (!cp)
+		return -ENOMEM;
+
+	/* Replace the top-of-stack with the actual target */
+	ts->stack[ts->cnt - 1].cp = cp;
+
+	return 0;
+}
+
 int thread_stack__process(struct thread *thread, struct comm *comm,
 			  struct perf_sample *sample,
 			  struct addr_location *from_al,
@@ -740,6 +826,7 @@ int thread_stack__process(struct thread *thread, struct comm *comm,
 			  struct call_return_processor *crp)
 {
 	struct thread_stack *ts = thread__stack(thread, sample->cpu);
+	enum retpoline_state_t rstate;
 	int err = 0;
 
 	if (ts && !ts->crp) {
@@ -755,6 +842,10 @@ int thread_stack__process(struct thread *thread, struct comm *comm,
 		ts->comm = comm;
 	}
 
+	rstate = ts->rstate;
+	if (rstate == X86_RETPOLINE_DETECTED)
+		ts->rstate = X86_RETPOLINE_POSSIBLE;
+
 	/* Flush stack on exec */
 	if (ts->comm != comm && thread->pid_ == thread->tid) {
 		err = __thread_stack__flush(thread, ts);
@@ -791,10 +882,25 @@ int thread_stack__process(struct thread *thread, struct comm *comm,
 					ts->kernel_start);
 		err = thread_stack__push_cp(ts, ret_addr, sample->time, ref,
 					    cp, false, trace_end);
+
+		/*
+		 * A call to the same symbol but not the start of the symbol,
+		 * may be the start of a x86 retpoline.
+		 */
+		if (!err && rstate == X86_RETPOLINE_POSSIBLE && to_al->sym &&
+		    from_al->sym == to_al->sym &&
+		    to_al->addr != to_al->sym->start)
+			ts->rstate = X86_RETPOLINE_DETECTED;
+
 	} else if (sample->flags & PERF_IP_FLAG_RETURN) {
 		if (!sample->ip || !sample->addr)
 			return 0;
 
+		/* x86 retpoline 'return' doesn't match the stack */
+		if (rstate == X86_RETPOLINE_DETECTED && ts->cnt > 2 &&
+		    ts->stack[ts->cnt - 1].ret_addr != sample->addr)
+			return thread_stack__x86_retpoline(ts, sample, to_al);
+
 		err = thread_stack__pop_cp(thread, ts, sample->addr,
 					   sample->time, ref, from_al->sym);
 		if (err) {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 6/6] perf thread-stack: Hide x86 retpolines
  2019-01-09  9:18 ` [PATCH 6/6] perf thread-stack: Hide x86 retpolines Adrian Hunter
@ 2019-01-09 15:38   ` Jiri Olsa
  2019-01-10  7:52     ` Adrian Hunter
  2019-02-22 19:42   ` Arnaldo Carvalho de Melo
  2019-02-28  7:49   ` [tip:perf/core] " tip-bot for Adrian Hunter
  2 siblings, 1 reply; 23+ messages in thread
From: Jiri Olsa @ 2019-01-09 15:38 UTC (permalink / raw)
  To: Adrian Hunter; +Cc: Arnaldo Carvalho de Melo, linux-kernel

On Wed, Jan 09, 2019 at 11:18:35AM +0200, Adrian Hunter wrote:
> x86 retpoline functions pollute the call graph by showing up everywhere
> there is an indirect branch, but they do not really mean anything. Make
> changes so that the default retpoline functions will no longer appear in
> the call graph. Note this only affects the call graph, since all the
> original branches are left unchanged.
> 
> This does not handle function return thunks, nor is there any improvement
> for the handling of inline thunks or extern thunks.
> 
> Example:
> 
> $ cat simple-retpoline.c
> __attribute__((noinline)) int bar(void)
> {
>         return -1;
> }
> 
> int foo(void)
> {
>         return bar() + 1;
> }
> 
> __attribute__((indirect_branch("thunk"))) int main()
> {
>         int (*volatile fn)(void) = foo;
> 
>         fn();
>         return fn();
> }
> $ gcc -ggdb3 -Wall -Wextra -O2 -o simple-retpoline simple-retpoline.c
> $ objdump -d simple-retpoline
> <SNIP>
> 0000000000001040 <main>:
>     1040:       48 83 ec 18             sub    $0x18,%rsp
>     1044:       48 8d 05 25 01 00 00    lea    0x125(%rip),%rax        # 1170 <foo>
>     104b:       48 89 44 24 08          mov    %rax,0x8(%rsp)
>     1050:       48 8b 44 24 08          mov    0x8(%rsp),%rax
>     1055:       e8 1f 01 00 00          callq  1179 <__x86_indirect_thunk_rax>
>     105a:       48 8b 44 24 08          mov    0x8(%rsp),%rax
>     105f:       48 83 c4 18             add    $0x18,%rsp
>     1063:       e9 11 01 00 00          jmpq   1179 <__x86_indirect_thunk_rax>
> <SNIP>
> 0000000000001160 <bar>:
>     1160:       b8 ff ff ff ff          mov    $0xffffffff,%eax
>     1165:       c3                      retq
> <SNIP>
> 0000000000001170 <foo>:
>     1170:       e8 eb ff ff ff          callq  1160 <bar>
>     1175:       83 c0 01                add    $0x1,%eax
>     1178:       c3                      retq
> 0000000000001179 <__x86_indirect_thunk_rax>:
>     1179:       e8 07 00 00 00          callq  1185 <__x86_indirect_thunk_rax+0xc>
>     117e:       f3 90                   pause
>     1180:       0f ae e8                lfence
>     1183:       eb f9                   jmp    117e <__x86_indirect_thunk_rax+0x5>
>     1185:       48 89 04 24             mov    %rax,(%rsp)
>     1189:       c3                      retq
> <SNIP>
> $ perf record -o simple-retpoline.perf.data -e intel_pt/cyc/u ./simple-retpoline
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0,017 MB simple-retpoline.perf.data ]
> $ perf script -i simple-retpoline.perf.data --itrace=be -s ~/libexec/perf-core/scripts/python/export-to-sqlite.py simple-retpoline.db branches calls
> 2019-01-08 14:03:37.851655 Creating database...
> 2019-01-08 14:03:37.863256 Writing records...
> 2019-01-08 14:03:38.069750 Adding indexes
> 2019-01-08 14:03:38.078799 Done
> $ ~/libexec/perf-core/scripts/python/exported-sql-viewer.py simple-retpoline.db
> 
> Before:
> 
>     main
>         -> __x86_indirect_thunk_rax
>             -> __x86_indirect_thunk_rax
>                 -> foo
>                     -> bar
> 
> After:
> 
>     main
>         -> foo
>             -> bar
> 

hi,
I still see them in kernel report.. after a while the output is shifted
too far right out of the screen due to (I guess) this call/ret imbalance


[root@krava perf]# ./perf-with-kcore record pt -e intel_pt/cyc/k -a -- sleep 1
[root@krava perf]# ./perf-with-kcore script pt --call-trace -C 0
            Xorg  2115 [000]  5850.032452321:  cbr: 36 freq: 3621 MHz (171%)
            Xorg  2115 [000]  5850.032452351: ([kernel.kallsyms])           __list_add_valid
            Xorg  2115 [000]  5850.032452380: ([kernel.kallsyms])           perf_pmu_enable.part.101
            Xorg  2115 [000]  5850.032452380: ([kernel.kallsyms])               __indirect_thunk_start
            Xorg  2115 [000]  5850.032452415: ([kernel.kallsyms])           __indirect_thunk_start
            Xorg  2115 [000]  5850.032452415: ([kernel.kallsyms])               __indirect_thunk_start
            Xorg  2115 [000]  5850.032452442: ([kernel.kallsyms])           __list_add_valid
            Xorg  2115 [000]  5850.032452457: ([kernel.kallsyms])           rb_next
            Xorg  2115 [000]  5850.032452494: ([kernel.kallsyms])           __x86_indirect_thunk_r12
            Xorg  2115 [000]  5850.032452494: ([kernel.kallsyms])               __x86_indirect_thunk_r12
            Xorg  2115 [000]  5850.032452535: ([kernel.kallsyms])                   group_sched_in
            Xorg  2115 [000]  5850.032452535: ([kernel.kallsyms])                       __indirect_thunk_start
            Xorg  2115 [000]  5850.032452535: ([kernel.kallsyms])                           __indirect_thunk_start
            Xorg  2115 [000]  5850.032452548: ([kernel.kallsyms])                       event_sched_in.isra.113
            Xorg  2115 [000]  5850.032452548: ([kernel.kallsyms])                           perf_event_set_state.part.70
            Xorg  2115 [000]  5850.032452548: ([kernel.kallsyms])                               perf_event_update_time
            Xorg  2115 [000]  5850.032452562: ([kernel.kallsyms])                           perf_log_itrace_start
            Xorg  2115 [000]  5850.032452562: ([kernel.kallsyms])                           __indirect_thunk_start
            Xorg  2115 [000]  5850.032452562: ([kernel.kallsyms])                               __indirect_thunk_start
            Xorg  2115 [000]  5850.032452600: ([kernel.kallsyms])                                   perf_event_update_userpage
            Xorg  2115 [000]  5850.032452600: ([kernel.kallsyms])                                       calc_timer_values
            Xorg  2115 [000]  5850.032452600: ([kernel.kallsyms])                                           sched_clock_cpu
            Xorg  2115 [000]  5850.032452600: ([kernel.kallsyms])                                               sched_clock
            Xorg  2115 [000]  5850.032452600: ([kernel.kallsyms])                                                   native_sched_clock
            Xorg  2115 [000]  5850.032452793: ([kernel.kallsyms])                                       __indirect_thunk_start
            Xorg  2115 [000]  5850.032452793: ([kernel.kallsyms])                                           __indirect_thunk_start
            Xorg  2115 [000]  5850.032452867: ([kernel.kallsyms])                                       arch_perf_update_userpage
            Xorg  2115 [000]  5850.032452867: ([kernel.kallsyms])                                           using_native_sched_clock
            Xorg  2115 [000]  5850.032452867: ([kernel.kallsyms])                                           sched_clock_stable
            Xorg  2115 [000]  5850.032452867: ([kernel.kallsyms])                                           cyc2ns_read_begin
            Xorg  2115 [000]  5850.032453157: ([kernel.kallsyms])                                           cyc2ns_read_end
            Xorg  2115 [000]  5850.032453194: ([kernel.kallsyms])                       __indirect_thunk_start
            Xorg  2115 [000]  5850.032453194: ([kernel.kallsyms])                           __indirect_thunk_start
            Xorg  2115 [000]  5850.032453206: ([kernel.kallsyms])                   __list_add_valid


jirka

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 6/6] perf thread-stack: Hide x86 retpolines
  2019-01-09 15:38   ` Jiri Olsa
@ 2019-01-10  7:52     ` Adrian Hunter
  0 siblings, 0 replies; 23+ messages in thread
From: Adrian Hunter @ 2019-01-10  7:52 UTC (permalink / raw)
  To: Jiri Olsa; +Cc: Arnaldo Carvalho de Melo, linux-kernel

On 9/01/19 5:38 PM, Jiri Olsa wrote:
> On Wed, Jan 09, 2019 at 11:18:35AM +0200, Adrian Hunter wrote:
>> x86 retpoline functions pollute the call graph by showing up everywhere
>> there is an indirect branch, but they do not really mean anything. Make
>> changes so that the default retpoline functions will no longer appear in
>> the call graph. Note this only affects the call graph, since all the
>> original branches are left unchanged.
>>
>> This does not handle function return thunks, nor is there any improvement
>> for the handling of inline thunks or extern thunks.
>>
>> Example:
>>
>> $ cat simple-retpoline.c
>> __attribute__((noinline)) int bar(void)
>> {
>>         return -1;
>> }
>>
>> int foo(void)
>> {
>>         return bar() + 1;
>> }
>>
>> __attribute__((indirect_branch("thunk"))) int main()
>> {
>>         int (*volatile fn)(void) = foo;
>>
>>         fn();
>>         return fn();
>> }
>> $ gcc -ggdb3 -Wall -Wextra -O2 -o simple-retpoline simple-retpoline.c
>> $ objdump -d simple-retpoline
>> <SNIP>
>> 0000000000001040 <main>:
>>     1040:       48 83 ec 18             sub    $0x18,%rsp
>>     1044:       48 8d 05 25 01 00 00    lea    0x125(%rip),%rax        # 1170 <foo>
>>     104b:       48 89 44 24 08          mov    %rax,0x8(%rsp)
>>     1050:       48 8b 44 24 08          mov    0x8(%rsp),%rax
>>     1055:       e8 1f 01 00 00          callq  1179 <__x86_indirect_thunk_rax>
>>     105a:       48 8b 44 24 08          mov    0x8(%rsp),%rax
>>     105f:       48 83 c4 18             add    $0x18,%rsp
>>     1063:       e9 11 01 00 00          jmpq   1179 <__x86_indirect_thunk_rax>
>> <SNIP>
>> 0000000000001160 <bar>:
>>     1160:       b8 ff ff ff ff          mov    $0xffffffff,%eax
>>     1165:       c3                      retq
>> <SNIP>
>> 0000000000001170 <foo>:
>>     1170:       e8 eb ff ff ff          callq  1160 <bar>
>>     1175:       83 c0 01                add    $0x1,%eax
>>     1178:       c3                      retq
>> 0000000000001179 <__x86_indirect_thunk_rax>:
>>     1179:       e8 07 00 00 00          callq  1185 <__x86_indirect_thunk_rax+0xc>
>>     117e:       f3 90                   pause
>>     1180:       0f ae e8                lfence
>>     1183:       eb f9                   jmp    117e <__x86_indirect_thunk_rax+0x5>
>>     1185:       48 89 04 24             mov    %rax,(%rsp)
>>     1189:       c3                      retq
>> <SNIP>
>> $ perf record -o simple-retpoline.perf.data -e intel_pt/cyc/u ./simple-retpoline
>> [ perf record: Woken up 1 times to write data ]
>> [ perf record: Captured and wrote 0,017 MB simple-retpoline.perf.data ]
>> $ perf script -i simple-retpoline.perf.data --itrace=be -s ~/libexec/perf-core/scripts/python/export-to-sqlite.py simple-retpoline.db branches calls
>> 2019-01-08 14:03:37.851655 Creating database...
>> 2019-01-08 14:03:37.863256 Writing records...
>> 2019-01-08 14:03:38.069750 Adding indexes
>> 2019-01-08 14:03:38.078799 Done
>> $ ~/libexec/perf-core/scripts/python/exported-sql-viewer.py simple-retpoline.db
>>
>> Before:
>>
>>     main
>>         -> __x86_indirect_thunk_rax
>>             -> __x86_indirect_thunk_rax
>>                 -> foo
>>                     -> bar
>>
>> After:
>>
>>     main
>>         -> foo
>>             -> bar
>>
> 
> hi,
> I still see them in kernel report.. after a while the output is shifted
> too far right out of the screen due to (I guess) this call/ret imbalance
> 
> 
> [root@krava perf]# ./perf-with-kcore record pt -e intel_pt/cyc/k -a -- sleep 1
> [root@krava perf]# ./perf-with-kcore script pt --call-trace -C 0
>             Xorg  2115 [000]  5850.032452321:  cbr: 36 freq: 3621 MHz (171%)
>             Xorg  2115 [000]  5850.032452351: ([kernel.kallsyms])           __list_add_valid
>             Xorg  2115 [000]  5850.032452380: ([kernel.kallsyms])           perf_pmu_enable.part.101
>             Xorg  2115 [000]  5850.032452380: ([kernel.kallsyms])               __indirect_thunk_start
>             Xorg  2115 [000]  5850.032452415: ([kernel.kallsyms])           __indirect_thunk_start
>             Xorg  2115 [000]  5850.032452415: ([kernel.kallsyms])               __indirect_thunk_start
>             Xorg  2115 [000]  5850.032452442: ([kernel.kallsyms])           __list_add_valid
>             Xorg  2115 [000]  5850.032452457: ([kernel.kallsyms])           rb_next
>             Xorg  2115 [000]  5850.032452494: ([kernel.kallsyms])           __x86_indirect_thunk_r12
>             Xorg  2115 [000]  5850.032452494: ([kernel.kallsyms])               __x86_indirect_thunk_r12
>             Xorg  2115 [000]  5850.032452535: ([kernel.kallsyms])                   group_sched_in
>             Xorg  2115 [000]  5850.032452535: ([kernel.kallsyms])                       __indirect_thunk_start
>             Xorg  2115 [000]  5850.032452535: ([kernel.kallsyms])                           __indirect_thunk_start
>             Xorg  2115 [000]  5850.032452548: ([kernel.kallsyms])                       event_sched_in.isra.113
>             Xorg  2115 [000]  5850.032452548: ([kernel.kallsyms])                           perf_event_set_state.part.70
>             Xorg  2115 [000]  5850.032452548: ([kernel.kallsyms])                               perf_event_update_time
>             Xorg  2115 [000]  5850.032452562: ([kernel.kallsyms])                           perf_log_itrace_start
>             Xorg  2115 [000]  5850.032452562: ([kernel.kallsyms])                           __indirect_thunk_start
>             Xorg  2115 [000]  5850.032452562: ([kernel.kallsyms])                               __indirect_thunk_start
>             Xorg  2115 [000]  5850.032452600: ([kernel.kallsyms])                                   perf_event_update_userpage
>             Xorg  2115 [000]  5850.032452600: ([kernel.kallsyms])                                       calc_timer_values
>             Xorg  2115 [000]  5850.032452600: ([kernel.kallsyms])                                           sched_clock_cpu
>             Xorg  2115 [000]  5850.032452600: ([kernel.kallsyms])                                               sched_clock
>             Xorg  2115 [000]  5850.032452600: ([kernel.kallsyms])                                                   native_sched_clock
>             Xorg  2115 [000]  5850.032452793: ([kernel.kallsyms])                                       __indirect_thunk_start
>             Xorg  2115 [000]  5850.032452793: ([kernel.kallsyms])                                           __indirect_thunk_start
>             Xorg  2115 [000]  5850.032452867: ([kernel.kallsyms])                                       arch_perf_update_userpage
>             Xorg  2115 [000]  5850.032452867: ([kernel.kallsyms])                                           using_native_sched_clock
>             Xorg  2115 [000]  5850.032452867: ([kernel.kallsyms])                                           sched_clock_stable
>             Xorg  2115 [000]  5850.032452867: ([kernel.kallsyms])                                           cyc2ns_read_begin
>             Xorg  2115 [000]  5850.032453157: ([kernel.kallsyms])                                           cyc2ns_read_end
>             Xorg  2115 [000]  5850.032453194: ([kernel.kallsyms])                       __indirect_thunk_start
>             Xorg  2115 [000]  5850.032453194: ([kernel.kallsyms])                           __indirect_thunk_start
>             Xorg  2115 [000]  5850.032453206: ([kernel.kallsyms])                   __list_add_valid
> 
> 
> jirka
> 

These patches are for db export and the context-sensitive call chain report
in the exported-sql-viewer.py python script.  The callchains you see above
are done at a lower level, before symbol resolution.  If you don't want to
see the thunks you can grep them out.  But more work is needed on them too.
I'll add it to the TODO list.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/6] perf thread-stack: x86 retpolines
  2019-01-09  9:18 [PATCH 0/6] perf thread-stack: x86 retpolines Adrian Hunter
                   ` (5 preceding siblings ...)
  2019-01-09  9:18 ` [PATCH 6/6] perf thread-stack: Hide x86 retpolines Adrian Hunter
@ 2019-01-10  9:55 ` Jiri Olsa
  2019-02-06  9:10   ` Adrian Hunter
  6 siblings, 1 reply; 23+ messages in thread
From: Jiri Olsa @ 2019-01-10  9:55 UTC (permalink / raw)
  To: Adrian Hunter; +Cc: Arnaldo Carvalho de Melo, linux-kernel

On Wed, Jan 09, 2019 at 11:18:29AM +0200, Adrian Hunter wrote:
> Hi
> 
> Here are some patches to improve the exported call graph, primarily to deal
> with x86 retpolines.
> 
> 
> Adrian Hunter (6):
>       perf tools: Fix split_kallsyms_for_kcore for trampoline symbols
>       perf thread-stack: Tidy thread_stack__push_cp() usage
>       perf thread-stack: Tidy thread_stack__no_call_return() by adding more local variables
>       perf thread-stack: Represent jmps to the start of a different symbol
>       perf thread-stack: Improve thread_stack__no_call_return()
>       perf thread-stack: Hide x86 retpolines

Acked-by: Jiri Olsa <jolsa@kernel.org>

thanks,
jirka

> 
>  tools/perf/scripts/python/export-to-postgresql.py |   2 +-
>  tools/perf/scripts/python/export-to-sqlite.py     |   2 +-
>  tools/perf/util/symbol.c                          |   2 +
>  tools/perf/util/thread-stack.c                    | 235 ++++++++++++++++++----
>  tools/perf/util/thread-stack.h                    |   3 +
>  5 files changed, 208 insertions(+), 36 deletions(-)
> 
> 
> Regards
> Adrian

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/6] perf thread-stack: x86 retpolines
  2019-01-10  9:55 ` [PATCH 0/6] perf thread-stack: " Jiri Olsa
@ 2019-02-06  9:10   ` Adrian Hunter
  0 siblings, 0 replies; 23+ messages in thread
From: Adrian Hunter @ 2019-02-06  9:10 UTC (permalink / raw)
  To: Jiri Olsa; +Cc: Arnaldo Carvalho de Melo, linux-kernel

On 10/01/19 11:55 AM, Jiri Olsa wrote:
> On Wed, Jan 09, 2019 at 11:18:29AM +0200, Adrian Hunter wrote:
>> Hi
>>
>> Here are some patches to improve the exported call graph, primarily to deal
>> with x86 retpolines.
>>
>>
>> Adrian Hunter (6):
>>       perf tools: Fix split_kallsyms_for_kcore for trampoline symbols
>>       perf thread-stack: Tidy thread_stack__push_cp() usage
>>       perf thread-stack: Tidy thread_stack__no_call_return() by adding more local variables
>>       perf thread-stack: Represent jmps to the start of a different symbol
>>       perf thread-stack: Improve thread_stack__no_call_return()
>>       perf thread-stack: Hide x86 retpolines
> 
> Acked-by: Jiri Olsa <jolsa@kernel.org>

Can these be applied?

> 
> thanks,
> jirka
> 
>>
>>  tools/perf/scripts/python/export-to-postgresql.py |   2 +-
>>  tools/perf/scripts/python/export-to-sqlite.py     |   2 +-
>>  tools/perf/util/symbol.c                          |   2 +
>>  tools/perf/util/thread-stack.c                    | 235 ++++++++++++++++++----
>>  tools/perf/util/thread-stack.h                    |   3 +
>>  5 files changed, 208 insertions(+), 36 deletions(-)
>>
>>
>> Regards
>> Adrian
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/6] perf thread-stack: Represent jmps to the start of a different symbol
  2019-01-09  9:18 ` [PATCH 4/6] perf thread-stack: Represent jmps to the start of a different symbol Adrian Hunter
@ 2019-02-06 12:39   ` Arnaldo Carvalho de Melo
  2019-02-06 13:25     ` Adrian Hunter
  2019-02-09 12:59   ` [tip:perf/core] " tip-bot for Adrian Hunter
  1 sibling, 1 reply; 23+ messages in thread
From: Arnaldo Carvalho de Melo @ 2019-02-06 12:39 UTC (permalink / raw)
  To: Adrian Hunter; +Cc: Jiri Olsa, linux-kernel

Em Wed, Jan 09, 2019 at 11:18:33AM +0200, Adrian Hunter escreveu:
> The compiler might optimize a call/ret combination by making it a jmp.
> However the thread-stack does not presently cater for that, so that such
> control flow is not visible in the call graph. Make it visible by recording
> on the stack a branch to the start of a different symbol. Note, that means
> when a ret pops the stack, all jmps must be popped off first.
> 
> Example:
> 
> $ cat jmp-to-fn.c
> __attribute__((noinline)) int bar(void)
> {
>         return -1;
> }
> 
> __attribute__((noinline)) int foo(void)
> {
>         return bar() + 1;
> }
> 
> int main()
> {
>         return foo();
> }
> $ gcc -ggdb3 -Wall -Wextra -O2 -o jmp-to-fn jmp-to-fn.c
> $ objdump -d jmp-to-fn
> <SNIP>
> 0000000000001040 <main>:
>     1040:       31 c0                   xor    %eax,%eax
>     1042:       e9 09 01 00 00          jmpq   1150 <foo>
> <SNIP>
> 0000000000001140 <bar>:
>     1140:       b8 ff ff ff ff          mov    $0xffffffff,%eax
>     1145:       c3                      retq
> <SNIP>
> 0000000000001150 <foo>:
>     1150:       31 c0                   xor    %eax,%eax
>     1152:       e8 e9 ff ff ff          callq  1140 <bar>
>     1157:       83 c0 01                add    $0x1,%eax
>     115a:       c3                      retq
> <SNIP>
> $ perf record -o jmp-to-fn.perf.data -e intel_pt/cyc/u ./jmp-to-fn
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0,017 MB jmp-to-fn.perf.data ]
> $ perf script -i jmp-to-fn.perf.data --itrace=be -s ~/libexec/perf-core/scripts/python/export-to-sqlite.py jmp-to-fn.db branches calls
> 2019-01-08 13:24:58.783069 Creating database...
> 2019-01-08 13:24:58.794650 Writing records...
> 2019-01-08 13:24:59.008050 Adding indexes
> 2019-01-08 13:24:59.015802 Done
> $  ~/libexec/perf-core/scripts/python/exported-sql-viewer.py jmp-to-fn.db

So, while testing this I notice that we need to do the python3
conversions for these scripts as well:

[acme@quaco adrian]$ objdump -d jmp-to-fn > /tmp/bla
[acme@quaco adrian]$ vim /tmp/bla
[acme@quaco adrian]$ perf record -o jmp-to-fn.perf.data -e intel_pt/cyc/u ./jmp-to-fn
Couldn't synthesize bpf events.
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.017 MB jmp-to-fn.perf.data ]
[acme@quaco adrian]$ perf script -i jmp-to-fn.perf.data --itrace=be -s ~/libexec/perf-core/scripts/python/export-to-sqlite.py jmp-to-fn.db branches calls
  File "/home/acme/libexec/perf-core/scripts/python/export-to-sqlite.py", line 103
    print datetime.datetime.today(), "Creating database..."
                 ^
SyntaxError: invalid syntax
Error running python script /home/acme/libexec/perf-core/scripts/python/export-to-sqlite.py
[acme@quaco adrian]$

Mostly are simple, i.e. in this case we need something like:

  print(str(datetime.datetime.today()) + "Creating database...")


for a really crude conversion, but one that would make this work with
both python2 and python3 (make PYTHON=python3 when building perf).

 
> Before:
> 
>     main
>         -> bar
> 
> After:
> 
>     main
>         -> foo
>             -> bar
> 
> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
> ---
>  .../scripts/python/export-to-postgresql.py    |  2 +-
>  tools/perf/scripts/python/export-to-sqlite.py |  2 +-
>  tools/perf/util/thread-stack.c                | 30 +++++++++++++++++--
>  tools/perf/util/thread-stack.h                |  3 ++
>  4 files changed, 33 insertions(+), 4 deletions(-)
> 
> diff --git a/tools/perf/scripts/python/export-to-postgresql.py b/tools/perf/scripts/python/export-to-postgresql.py
> index 0564dd7377f2..30130213da7e 100644
> --- a/tools/perf/scripts/python/export-to-postgresql.py
> +++ b/tools/perf/scripts/python/export-to-postgresql.py
> @@ -478,7 +478,7 @@ if perf_db_export_calls:
>  			'branch_count,'
>  			'call_id,'
>  			'return_id,'
> -			'CASE WHEN flags=1 THEN \'no call\' WHEN flags=2 THEN \'no return\' WHEN flags=3 THEN \'no call/return\' ELSE \'\' END AS flags,'
> +			'CASE WHEN flags=0 THEN \'\' WHEN flags=1 THEN \'no call\' WHEN flags=2 THEN \'no return\' WHEN flags=3 THEN \'no call/return\' WHEN flags=6 THEN \'jump\' ELSE flags END AS flags,'
>  			'parent_call_path_id'
>  		' FROM calls INNER JOIN call_paths ON call_paths.id = call_path_id')
>  
> diff --git a/tools/perf/scripts/python/export-to-sqlite.py b/tools/perf/scripts/python/export-to-sqlite.py
> index 245caf2643ed..ed237f2ed03f 100644
> --- a/tools/perf/scripts/python/export-to-sqlite.py
> +++ b/tools/perf/scripts/python/export-to-sqlite.py
> @@ -320,7 +320,7 @@ if perf_db_export_calls:
>  			'branch_count,'
>  			'call_id,'
>  			'return_id,'
> -			'CASE WHEN flags=1 THEN \'no call\' WHEN flags=2 THEN \'no return\' WHEN flags=3 THEN \'no call/return\' ELSE \'\' END AS flags,'
> +			'CASE WHEN flags=0 THEN \'\' WHEN flags=1 THEN \'no call\' WHEN flags=2 THEN \'no return\' WHEN flags=3 THEN \'no call/return\' WHEN flags=6 THEN \'jump\' ELSE flags END AS flags,'
>  			'parent_call_path_id'
>  		' FROM calls INNER JOIN call_paths ON call_paths.id = call_path_id')
>  
> diff --git a/tools/perf/util/thread-stack.c b/tools/perf/util/thread-stack.c
> index 7f8eff018c16..f52c0f90915d 100644
> --- a/tools/perf/util/thread-stack.c
> +++ b/tools/perf/util/thread-stack.c
> @@ -38,6 +38,7 @@
>   * @cp: call path
>   * @no_call: a 'call' was not seen
>   * @trace_end: a 'call' but trace ended
> + * @non_call: a branch but not a 'call' to the start of a different symbol
>   */
>  struct thread_stack_entry {
>  	u64 ret_addr;
> @@ -47,6 +48,7 @@ struct thread_stack_entry {
>  	struct call_path *cp;
>  	bool no_call;
>  	bool trace_end;
> +	bool non_call;
>  };
>  
>  /**
> @@ -268,6 +270,8 @@ static int thread_stack__call_return(struct thread *thread,
>  		cr.flags |= CALL_RETURN_NO_CALL;
>  	if (no_return)
>  		cr.flags |= CALL_RETURN_NO_RETURN;
> +	if (tse->non_call)
> +		cr.flags |= CALL_RETURN_NON_CALL;
>  
>  	return crp->process(&cr, crp->data);
>  }
> @@ -510,6 +514,7 @@ static int thread_stack__push_cp(struct thread_stack *ts, u64 ret_addr,
>  	tse->cp = cp;
>  	tse->no_call = no_call;
>  	tse->trace_end = trace_end;
> +	tse->non_call = false;
>  
>  	return 0;
>  }
> @@ -531,14 +536,16 @@ static int thread_stack__pop_cp(struct thread *thread, struct thread_stack *ts,
>  							 timestamp, ref, false);
>  	}
>  
> -	if (ts->stack[ts->cnt - 1].ret_addr == ret_addr) {
> +	if (ts->stack[ts->cnt - 1].ret_addr == ret_addr &&
> +	    !ts->stack[ts->cnt - 1].non_call) {
>  		return thread_stack__call_return(thread, ts, --ts->cnt,
>  						 timestamp, ref, false);
>  	} else {
>  		size_t i = ts->cnt - 1;
>  
>  		while (i--) {
> -			if (ts->stack[i].ret_addr != ret_addr)
> +			if (ts->stack[i].ret_addr != ret_addr ||
> +			    ts->stack[i].non_call)
>  				continue;
>  			i += 1;
>  			while (ts->cnt > i) {
> @@ -757,6 +764,25 @@ int thread_stack__process(struct thread *thread, struct comm *comm,
>  		err = thread_stack__trace_begin(thread, ts, sample->time, ref);
>  	} else if (sample->flags & PERF_IP_FLAG_TRACE_END) {
>  		err = thread_stack__trace_end(ts, sample, ref);
> +	} else if (sample->flags & PERF_IP_FLAG_BRANCH &&
> +		   from_al->sym != to_al->sym && to_al->sym &&
> +		   to_al->addr == to_al->sym->start) {
> +		struct call_path_root *cpr = ts->crp->cpr;
> +		struct call_path *cp;
> +
> +		/*
> +		 * The compiler might optimize a call/ret combination by making
> +		 * it a jmp. Make that visible by recording on the stack a
> +		 * branch to the start of a different symbol. Note, that means
> +		 * when a ret pops the stack, all jmps must be popped off first.
> +		 */
> +		cp = call_path__findnew(cpr, ts->stack[ts->cnt - 1].cp,
> +					to_al->sym, sample->addr,
> +					ts->kernel_start);
> +		err = thread_stack__push_cp(ts, 0, sample->time, ref, cp, false,
> +					    false);
> +		if (!err)
> +			ts->stack[ts->cnt - 1].non_call = true;
>  	}
>  
>  	return err;
> diff --git a/tools/perf/util/thread-stack.h b/tools/perf/util/thread-stack.h
> index 1f626f4a1c40..b7c04e19ad41 100644
> --- a/tools/perf/util/thread-stack.h
> +++ b/tools/perf/util/thread-stack.h
> @@ -35,10 +35,13 @@ struct call_path;
>   *
>   * CALL_RETURN_NO_CALL: 'return' but no matching 'call'
>   * CALL_RETURN_NO_RETURN: 'call' but no matching 'return'
> + * CALL_RETURN_NON_CALL: a branch but not a 'call' to the start of a different
> + *                       symbol
>   */
>  enum {
>  	CALL_RETURN_NO_CALL	= 1 << 0,
>  	CALL_RETURN_NO_RETURN	= 1 << 1,
> +	CALL_RETURN_NON_CALL	= 1 << 2,
>  };
>  
>  /**
> -- 
> 2.17.1

-- 

- Arnaldo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/6] perf thread-stack: Represent jmps to the start of a different symbol
  2019-02-06 12:39   ` Arnaldo Carvalho de Melo
@ 2019-02-06 13:25     ` Adrian Hunter
  0 siblings, 0 replies; 23+ messages in thread
From: Adrian Hunter @ 2019-02-06 13:25 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo; +Cc: Jiri Olsa, linux-kernel

On 6/02/19 2:39 PM, Arnaldo Carvalho de Melo wrote:
> Em Wed, Jan 09, 2019 at 11:18:33AM +0200, Adrian Hunter escreveu:
>> The compiler might optimize a call/ret combination by making it a jmp.
>> However the thread-stack does not presently cater for that, so that such
>> control flow is not visible in the call graph. Make it visible by recording
>> on the stack a branch to the start of a different symbol. Note, that means
>> when a ret pops the stack, all jmps must be popped off first.
>>
>> Example:
>>
>> $ cat jmp-to-fn.c
>> __attribute__((noinline)) int bar(void)
>> {
>>         return -1;
>> }
>>
>> __attribute__((noinline)) int foo(void)
>> {
>>         return bar() + 1;
>> }
>>
>> int main()
>> {
>>         return foo();
>> }
>> $ gcc -ggdb3 -Wall -Wextra -O2 -o jmp-to-fn jmp-to-fn.c
>> $ objdump -d jmp-to-fn
>> <SNIP>
>> 0000000000001040 <main>:
>>     1040:       31 c0                   xor    %eax,%eax
>>     1042:       e9 09 01 00 00          jmpq   1150 <foo>
>> <SNIP>
>> 0000000000001140 <bar>:
>>     1140:       b8 ff ff ff ff          mov    $0xffffffff,%eax
>>     1145:       c3                      retq
>> <SNIP>
>> 0000000000001150 <foo>:
>>     1150:       31 c0                   xor    %eax,%eax
>>     1152:       e8 e9 ff ff ff          callq  1140 <bar>
>>     1157:       83 c0 01                add    $0x1,%eax
>>     115a:       c3                      retq
>> <SNIP>
>> $ perf record -o jmp-to-fn.perf.data -e intel_pt/cyc/u ./jmp-to-fn
>> [ perf record: Woken up 1 times to write data ]
>> [ perf record: Captured and wrote 0,017 MB jmp-to-fn.perf.data ]
>> $ perf script -i jmp-to-fn.perf.data --itrace=be -s ~/libexec/perf-core/scripts/python/export-to-sqlite.py jmp-to-fn.db branches calls
>> 2019-01-08 13:24:58.783069 Creating database...
>> 2019-01-08 13:24:58.794650 Writing records...
>> 2019-01-08 13:24:59.008050 Adding indexes
>> 2019-01-08 13:24:59.015802 Done
>> $  ~/libexec/perf-core/scripts/python/exported-sql-viewer.py jmp-to-fn.db
> 
> So, while testing this I notice that we need to do the python3
> conversions for these scripts as well:


I will look at it, but an you take these patches as they are for now?


> 
> [acme@quaco adrian]$ objdump -d jmp-to-fn > /tmp/bla
> [acme@quaco adrian]$ vim /tmp/bla
> [acme@quaco adrian]$ perf record -o jmp-to-fn.perf.data -e intel_pt/cyc/u ./jmp-to-fn
> Couldn't synthesize bpf events.
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.017 MB jmp-to-fn.perf.data ]
> [acme@quaco adrian]$ perf script -i jmp-to-fn.perf.data --itrace=be -s ~/libexec/perf-core/scripts/python/export-to-sqlite.py jmp-to-fn.db branches calls
>   File "/home/acme/libexec/perf-core/scripts/python/export-to-sqlite.py", line 103
>     print datetime.datetime.today(), "Creating database..."
>                  ^
> SyntaxError: invalid syntax
> Error running python script /home/acme/libexec/perf-core/scripts/python/export-to-sqlite.py
> [acme@quaco adrian]$
> 
> Mostly are simple, i.e. in this case we need something like:
> 
>   print(str(datetime.datetime.today()) + "Creating database...")
> 
> 
> for a really crude conversion, but one that would make this work with
> both python2 and python3 (make PYTHON=python3 when building perf).
> 
>  
>> Before:
>>
>>     main
>>         -> bar
>>
>> After:
>>
>>     main
>>         -> foo
>>             -> bar
>>
>> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
>> ---
>>  .../scripts/python/export-to-postgresql.py    |  2 +-
>>  tools/perf/scripts/python/export-to-sqlite.py |  2 +-
>>  tools/perf/util/thread-stack.c                | 30 +++++++++++++++++--
>>  tools/perf/util/thread-stack.h                |  3 ++
>>  4 files changed, 33 insertions(+), 4 deletions(-)
>>
>> diff --git a/tools/perf/scripts/python/export-to-postgresql.py b/tools/perf/scripts/python/export-to-postgresql.py
>> index 0564dd7377f2..30130213da7e 100644
>> --- a/tools/perf/scripts/python/export-to-postgresql.py
>> +++ b/tools/perf/scripts/python/export-to-postgresql.py
>> @@ -478,7 +478,7 @@ if perf_db_export_calls:
>>  			'branch_count,'
>>  			'call_id,'
>>  			'return_id,'
>> -			'CASE WHEN flags=1 THEN \'no call\' WHEN flags=2 THEN \'no return\' WHEN flags=3 THEN \'no call/return\' ELSE \'\' END AS flags,'
>> +			'CASE WHEN flags=0 THEN \'\' WHEN flags=1 THEN \'no call\' WHEN flags=2 THEN \'no return\' WHEN flags=3 THEN \'no call/return\' WHEN flags=6 THEN \'jump\' ELSE flags END AS flags,'
>>  			'parent_call_path_id'
>>  		' FROM calls INNER JOIN call_paths ON call_paths.id = call_path_id')
>>  
>> diff --git a/tools/perf/scripts/python/export-to-sqlite.py b/tools/perf/scripts/python/export-to-sqlite.py
>> index 245caf2643ed..ed237f2ed03f 100644
>> --- a/tools/perf/scripts/python/export-to-sqlite.py
>> +++ b/tools/perf/scripts/python/export-to-sqlite.py
>> @@ -320,7 +320,7 @@ if perf_db_export_calls:
>>  			'branch_count,'
>>  			'call_id,'
>>  			'return_id,'
>> -			'CASE WHEN flags=1 THEN \'no call\' WHEN flags=2 THEN \'no return\' WHEN flags=3 THEN \'no call/return\' ELSE \'\' END AS flags,'
>> +			'CASE WHEN flags=0 THEN \'\' WHEN flags=1 THEN \'no call\' WHEN flags=2 THEN \'no return\' WHEN flags=3 THEN \'no call/return\' WHEN flags=6 THEN \'jump\' ELSE flags END AS flags,'
>>  			'parent_call_path_id'
>>  		' FROM calls INNER JOIN call_paths ON call_paths.id = call_path_id')
>>  
>> diff --git a/tools/perf/util/thread-stack.c b/tools/perf/util/thread-stack.c
>> index 7f8eff018c16..f52c0f90915d 100644
>> --- a/tools/perf/util/thread-stack.c
>> +++ b/tools/perf/util/thread-stack.c
>> @@ -38,6 +38,7 @@
>>   * @cp: call path
>>   * @no_call: a 'call' was not seen
>>   * @trace_end: a 'call' but trace ended
>> + * @non_call: a branch but not a 'call' to the start of a different symbol
>>   */
>>  struct thread_stack_entry {
>>  	u64 ret_addr;
>> @@ -47,6 +48,7 @@ struct thread_stack_entry {
>>  	struct call_path *cp;
>>  	bool no_call;
>>  	bool trace_end;
>> +	bool non_call;
>>  };
>>  
>>  /**
>> @@ -268,6 +270,8 @@ static int thread_stack__call_return(struct thread *thread,
>>  		cr.flags |= CALL_RETURN_NO_CALL;
>>  	if (no_return)
>>  		cr.flags |= CALL_RETURN_NO_RETURN;
>> +	if (tse->non_call)
>> +		cr.flags |= CALL_RETURN_NON_CALL;
>>  
>>  	return crp->process(&cr, crp->data);
>>  }
>> @@ -510,6 +514,7 @@ static int thread_stack__push_cp(struct thread_stack *ts, u64 ret_addr,
>>  	tse->cp = cp;
>>  	tse->no_call = no_call;
>>  	tse->trace_end = trace_end;
>> +	tse->non_call = false;
>>  
>>  	return 0;
>>  }
>> @@ -531,14 +536,16 @@ static int thread_stack__pop_cp(struct thread *thread, struct thread_stack *ts,
>>  							 timestamp, ref, false);
>>  	}
>>  
>> -	if (ts->stack[ts->cnt - 1].ret_addr == ret_addr) {
>> +	if (ts->stack[ts->cnt - 1].ret_addr == ret_addr &&
>> +	    !ts->stack[ts->cnt - 1].non_call) {
>>  		return thread_stack__call_return(thread, ts, --ts->cnt,
>>  						 timestamp, ref, false);
>>  	} else {
>>  		size_t i = ts->cnt - 1;
>>  
>>  		while (i--) {
>> -			if (ts->stack[i].ret_addr != ret_addr)
>> +			if (ts->stack[i].ret_addr != ret_addr ||
>> +			    ts->stack[i].non_call)
>>  				continue;
>>  			i += 1;
>>  			while (ts->cnt > i) {
>> @@ -757,6 +764,25 @@ int thread_stack__process(struct thread *thread, struct comm *comm,
>>  		err = thread_stack__trace_begin(thread, ts, sample->time, ref);
>>  	} else if (sample->flags & PERF_IP_FLAG_TRACE_END) {
>>  		err = thread_stack__trace_end(ts, sample, ref);
>> +	} else if (sample->flags & PERF_IP_FLAG_BRANCH &&
>> +		   from_al->sym != to_al->sym && to_al->sym &&
>> +		   to_al->addr == to_al->sym->start) {
>> +		struct call_path_root *cpr = ts->crp->cpr;
>> +		struct call_path *cp;
>> +
>> +		/*
>> +		 * The compiler might optimize a call/ret combination by making
>> +		 * it a jmp. Make that visible by recording on the stack a
>> +		 * branch to the start of a different symbol. Note, that means
>> +		 * when a ret pops the stack, all jmps must be popped off first.
>> +		 */
>> +		cp = call_path__findnew(cpr, ts->stack[ts->cnt - 1].cp,
>> +					to_al->sym, sample->addr,
>> +					ts->kernel_start);
>> +		err = thread_stack__push_cp(ts, 0, sample->time, ref, cp, false,
>> +					    false);
>> +		if (!err)
>> +			ts->stack[ts->cnt - 1].non_call = true;
>>  	}
>>  
>>  	return err;
>> diff --git a/tools/perf/util/thread-stack.h b/tools/perf/util/thread-stack.h
>> index 1f626f4a1c40..b7c04e19ad41 100644
>> --- a/tools/perf/util/thread-stack.h
>> +++ b/tools/perf/util/thread-stack.h
>> @@ -35,10 +35,13 @@ struct call_path;
>>   *
>>   * CALL_RETURN_NO_CALL: 'return' but no matching 'call'
>>   * CALL_RETURN_NO_RETURN: 'call' but no matching 'return'
>> + * CALL_RETURN_NON_CALL: a branch but not a 'call' to the start of a different
>> + *                       symbol
>>   */
>>  enum {
>>  	CALL_RETURN_NO_CALL	= 1 << 0,
>>  	CALL_RETURN_NO_RETURN	= 1 << 1,
>> +	CALL_RETURN_NON_CALL	= 1 << 2,
>>  };
>>  
>>  /**
>> -- 
>> 2.17.1
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [tip:perf/core] perf tools: Fix split_kallsyms_for_kcore() for trampoline symbols
  2019-01-09  9:18 ` [PATCH 1/6] perf tools: Fix split_kallsyms_for_kcore for trampoline symbols Adrian Hunter
@ 2019-02-09 12:57   ` tip-bot for Adrian Hunter
  0 siblings, 0 replies; 23+ messages in thread
From: tip-bot for Adrian Hunter @ 2019-02-09 12:57 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, mingo, adrian.hunter, hpa, jolsa, tglx, acme

Commit-ID:  d6d457451eb94fa747dc202765592eb8885a7352
Gitweb:     https://git.kernel.org/tip/d6d457451eb94fa747dc202765592eb8885a7352
Author:     Adrian Hunter <adrian.hunter@intel.com>
AuthorDate: Wed, 9 Jan 2019 11:18:30 +0200
Committer:  Arnaldo Carvalho de Melo <acme@redhat.com>
CommitDate: Wed, 6 Feb 2019 10:00:40 -0300

perf tools: Fix split_kallsyms_for_kcore() for trampoline symbols

Kallsyms symbols do not have a size, so the size becomes the distance to
the next symbol.

Consequently the recently added trampoline symbols end up with large
sizes because the trampolines are some distance from one another and the
main kernel map.

However, symbols that end outside their map can disrupt the symbol tree
because, after mapping, it can appear incorrectly that they overlap
other symbols.

Add logic to truncate symbol size to the end of the corresponding map.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Cc: stable@vger.kernel.org
Fixes: d83212d5dd67 ("kallsyms, x86: Export addresses of PTI entry trampolines")
Link: http://lkml.kernel.org/r/20190109091835.5570-2-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/util/symbol.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/perf/util/symbol.c b/tools/perf/util/symbol.c
index 7b194cf53cc0..758bf5f74e6e 100644
--- a/tools/perf/util/symbol.c
+++ b/tools/perf/util/symbol.c
@@ -718,6 +718,8 @@ static int map_groups__split_kallsyms_for_kcore(struct map_groups *kmaps, struct
 		}
 
 		pos->start -= curr_map->start - curr_map->pgoff;
+		if (pos->end > curr_map->end)
+			pos->end = curr_map->end;
 		if (pos->end)
 			pos->end -= curr_map->start - curr_map->pgoff;
 		symbols__insert(&curr_map->dso->symbols, pos);

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [tip:perf/core] perf thread-stack: Tidy thread_stack__push_cp() usage
  2019-01-09  9:18 ` [PATCH 2/6] perf thread-stack: Tidy thread_stack__push_cp() usage Adrian Hunter
@ 2019-02-09 12:58   ` tip-bot for Adrian Hunter
  0 siblings, 0 replies; 23+ messages in thread
From: tip-bot for Adrian Hunter @ 2019-02-09 12:58 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: jolsa, acme, mingo, tglx, linux-kernel, hpa, adrian.hunter

Commit-ID:  e7a3a055f2b88ebd0bdae8b0aade1e7d80c8a81e
Gitweb:     https://git.kernel.org/tip/e7a3a055f2b88ebd0bdae8b0aade1e7d80c8a81e
Author:     Adrian Hunter <adrian.hunter@intel.com>
AuthorDate: Wed, 9 Jan 2019 11:18:31 +0200
Committer:  Arnaldo Carvalho de Melo <acme@redhat.com>
CommitDate: Wed, 6 Feb 2019 10:00:40 -0300

perf thread-stack: Tidy thread_stack__push_cp() usage

If 'cp' is checked in thread_stack__push_cp() a number of error checks
can be removed, reducing code size and improving readability.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: http://lkml.kernel.org/r/20190109091835.5570-3-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/util/thread-stack.c | 13 +++----------
 1 file changed, 3 insertions(+), 10 deletions(-)

diff --git a/tools/perf/util/thread-stack.c b/tools/perf/util/thread-stack.c
index d52f27f373ce..93ba3ab6a602 100644
--- a/tools/perf/util/thread-stack.c
+++ b/tools/perf/util/thread-stack.c
@@ -493,6 +493,9 @@ static int thread_stack__push_cp(struct thread_stack *ts, u64 ret_addr,
 	struct thread_stack_entry *tse;
 	int err;
 
+	if (!cp)
+		return -ENOMEM;
+
 	if (ts->cnt == ts->sz) {
 		err = thread_stack__grow(ts);
 		if (err)
@@ -576,8 +579,6 @@ static int thread_stack__bottom(struct thread_stack *ts,
 
 	cp = call_path__findnew(cpr, &cpr->call_path, sym, ip,
 				ts->kernel_start);
-	if (!cp)
-		return -ENOMEM;
 
 	return thread_stack__push_cp(ts, ip, sample->time, ref, cp,
 				     true, false);
@@ -609,8 +610,6 @@ static int thread_stack__no_call_return(struct thread *thread,
 			cp = call_path__findnew(cpr, &cpr->call_path,
 						to_al->sym, sample->addr,
 						ts->kernel_start);
-			if (!cp)
-				return -ENOMEM;
 			return thread_stack__push_cp(ts, 0, sample->time, ref,
 						     cp, true, false);
 		}
@@ -633,8 +632,6 @@ static int thread_stack__no_call_return(struct thread *thread,
 	/* This 'return' had no 'call', so push and pop top of stack */
 	cp = call_path__findnew(cpr, parent, from_al->sym, sample->ip,
 				ts->kernel_start);
-	if (!cp)
-		return -ENOMEM;
 
 	err = thread_stack__push_cp(ts, sample->addr, sample->time, ref, cp,
 				    true, false);
@@ -680,8 +677,6 @@ static int thread_stack__trace_end(struct thread_stack *ts,
 
 	cp = call_path__findnew(cpr, ts->stack[ts->cnt - 1].cp, NULL, 0,
 				ts->kernel_start);
-	if (!cp)
-		return -ENOMEM;
 
 	ret_addr = sample->ip + sample->insn_len;
 
@@ -745,8 +740,6 @@ int thread_stack__process(struct thread *thread, struct comm *comm,
 		cp = call_path__findnew(cpr, ts->stack[ts->cnt - 1].cp,
 					to_al->sym, sample->addr,
 					ts->kernel_start);
-		if (!cp)
-			return -ENOMEM;
 		err = thread_stack__push_cp(ts, ret_addr, sample->time, ref,
 					    cp, false, trace_end);
 	} else if (sample->flags & PERF_IP_FLAG_RETURN) {

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [tip:perf/core] perf thread-stack: Tidy thread_stack__no_call_return() by adding more local variables
  2019-01-09  9:18 ` [PATCH 3/6] perf thread-stack: Tidy thread_stack__no_call_return() by adding more local variables Adrian Hunter
@ 2019-02-09 12:58   ` tip-bot for Adrian Hunter
  0 siblings, 0 replies; 23+ messages in thread
From: tip-bot for Adrian Hunter @ 2019-02-09 12:58 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, tglx, hpa, adrian.hunter, jolsa, acme, mingo

Commit-ID:  90c2cda7056e3a7555d874a27aae12fd46ca802e
Gitweb:     https://git.kernel.org/tip/90c2cda7056e3a7555d874a27aae12fd46ca802e
Author:     Adrian Hunter <adrian.hunter@intel.com>
AuthorDate: Wed, 9 Jan 2019 11:18:32 +0200
Committer:  Arnaldo Carvalho de Melo <acme@redhat.com>
CommitDate: Wed, 6 Feb 2019 10:00:40 -0300

perf thread-stack: Tidy thread_stack__no_call_return() by adding more local variables

Make thread_stack__no_call_return() more readable by adding more local
variables.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: http://lkml.kernel.org/r/20190109091835.5570-4-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/util/thread-stack.c | 35 +++++++++++++++++------------------
 1 file changed, 17 insertions(+), 18 deletions(-)

diff --git a/tools/perf/util/thread-stack.c b/tools/perf/util/thread-stack.c
index 93ba3ab6a602..7f8eff018c16 100644
--- a/tools/perf/util/thread-stack.c
+++ b/tools/perf/util/thread-stack.c
@@ -591,34 +591,36 @@ static int thread_stack__no_call_return(struct thread *thread,
 					struct addr_location *to_al, u64 ref)
 {
 	struct call_path_root *cpr = ts->crp->cpr;
+	struct call_path *root = &cpr->call_path;
+	struct symbol *fsym = from_al->sym;
+	struct symbol *tsym = to_al->sym;
 	struct call_path *cp, *parent;
 	u64 ks = ts->kernel_start;
+	u64 addr = sample->addr;
+	u64 tm = sample->time;
+	u64 ip = sample->ip;
 	int err;
 
-	if (sample->ip >= ks && sample->addr < ks) {
+	if (ip >= ks && addr < ks) {
 		/* Return to userspace, so pop all kernel addresses */
 		while (thread_stack__in_kernel(ts)) {
 			err = thread_stack__call_return(thread, ts, --ts->cnt,
-							sample->time, ref,
-							true);
+							tm, ref, true);
 			if (err)
 				return err;
 		}
 
 		/* If the stack is empty, push the userspace address */
 		if (!ts->cnt) {
-			cp = call_path__findnew(cpr, &cpr->call_path,
-						to_al->sym, sample->addr,
-						ts->kernel_start);
-			return thread_stack__push_cp(ts, 0, sample->time, ref,
-						     cp, true, false);
+			cp = call_path__findnew(cpr, root, tsym, addr, ks);
+			return thread_stack__push_cp(ts, 0, tm, ref, cp, true,
+						     false);
 		}
-	} else if (thread_stack__in_kernel(ts) && sample->ip < ks) {
+	} else if (thread_stack__in_kernel(ts) && ip < ks) {
 		/* Return to userspace, so pop all kernel addresses */
 		while (thread_stack__in_kernel(ts)) {
 			err = thread_stack__call_return(thread, ts, --ts->cnt,
-							sample->time, ref,
-							true);
+							tm, ref, true);
 			if (err)
 				return err;
 		}
@@ -627,19 +629,16 @@ static int thread_stack__no_call_return(struct thread *thread,
 	if (ts->cnt)
 		parent = ts->stack[ts->cnt - 1].cp;
 	else
-		parent = &cpr->call_path;
+		parent = root;
 
 	/* This 'return' had no 'call', so push and pop top of stack */
-	cp = call_path__findnew(cpr, parent, from_al->sym, sample->ip,
-				ts->kernel_start);
+	cp = call_path__findnew(cpr, parent, fsym, ip, ks);
 
-	err = thread_stack__push_cp(ts, sample->addr, sample->time, ref, cp,
-				    true, false);
+	err = thread_stack__push_cp(ts, addr, tm, ref, cp, true, false);
 	if (err)
 		return err;
 
-	return thread_stack__pop_cp(thread, ts, sample->addr, sample->time, ref,
-				    to_al->sym);
+	return thread_stack__pop_cp(thread, ts, addr, tm, ref, tsym);
 }
 
 static int thread_stack__trace_begin(struct thread *thread,

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [tip:perf/core] perf thread-stack: Represent jmps to the start of a different symbol
  2019-01-09  9:18 ` [PATCH 4/6] perf thread-stack: Represent jmps to the start of a different symbol Adrian Hunter
  2019-02-06 12:39   ` Arnaldo Carvalho de Melo
@ 2019-02-09 12:59   ` tip-bot for Adrian Hunter
  1 sibling, 0 replies; 23+ messages in thread
From: tip-bot for Adrian Hunter @ 2019-02-09 12:59 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mingo, jolsa, linux-kernel, acme, tglx, hpa, adrian.hunter

Commit-ID:  f08046cb3082b313e7b08dc35838cf8bd902c36b
Gitweb:     https://git.kernel.org/tip/f08046cb3082b313e7b08dc35838cf8bd902c36b
Author:     Adrian Hunter <adrian.hunter@intel.com>
AuthorDate: Wed, 9 Jan 2019 11:18:33 +0200
Committer:  Arnaldo Carvalho de Melo <acme@redhat.com>
CommitDate: Wed, 6 Feb 2019 10:00:40 -0300

perf thread-stack: Represent jmps to the start of a different symbol

The compiler might optimize a call/ret combination by making it a jmp.
However the thread-stack does not presently cater for that, so that such
control flow is not visible in the call graph. Make it visible by
recording on the stack a branch to the start of a different symbol.
Note, that means when a ret pops the stack, all jmps must be popped off
first.

Example:

  $ cat jmp-to-fn.c
  __attribute__((noinline)) int bar(void)
  {
          return -1;
  }

  __attribute__((noinline)) int foo(void)
  {
          return bar() + 1;
  }

  int main()
  {
          return foo();
  }
  $ gcc -ggdb3 -Wall -Wextra -O2 -o jmp-to-fn jmp-to-fn.c
  $ objdump -d jmp-to-fn
  <SNIP>
  0000000000001040 <main>:
      1040:       31 c0                   xor    %eax,%eax
      1042:       e9 09 01 00 00          jmpq   1150 <foo>
  <SNIP>
  0000000000001140 <bar>:
      1140:       b8 ff ff ff ff          mov    $0xffffffff,%eax
      1145:       c3                      retq
  <SNIP>
  0000000000001150 <foo>:
      1150:       31 c0                   xor    %eax,%eax
      1152:       e8 e9 ff ff ff          callq  1140 <bar>
      1157:       83 c0 01                add    $0x1,%eax
      115a:       c3                      retq
  <SNIP>
  $ perf record -o jmp-to-fn.perf.data -e intel_pt/cyc/u ./jmp-to-fn
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0,017 MB jmp-to-fn.perf.data ]
  $ perf script -i jmp-to-fn.perf.data --itrace=be -s ~/libexec/perf-core/scripts/python/export-to-sqlite.py jmp-to-fn.db branches calls
  2019-01-08 13:24:58.783069 Creating database...
  2019-01-08 13:24:58.794650 Writing records...
  2019-01-08 13:24:59.008050 Adding indexes
  2019-01-08 13:24:59.015802 Done
  $  ~/libexec/perf-core/scripts/python/exported-sql-viewer.py jmp-to-fn.db

Before:

    main
        -> bar

After:

    main
        -> foo
            -> bar

Committer testing:

Install the python2-pyside package, then select these menu options
on the GUI:

   "Reports"
      "Context sensitive callgraphs"

Then go on expanding the symbols, to get, full picture when doing this
on a fedora:29 with gcc version 8.2.1 20181215 (Red Hat 8.2.1-6) (GCC):

jmp-to-fn
  PID:TID
    _start                (ld-2.28.so)
      __libc_start_main
        main
          foo
            bar

To verify that indeed, this fixes the problem.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: http://lkml.kernel.org/r/20190109091835.5570-5-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/scripts/python/export-to-postgresql.py |  2 +-
 tools/perf/scripts/python/export-to-sqlite.py     |  2 +-
 tools/perf/util/thread-stack.c                    | 30 +++++++++++++++++++++--
 tools/perf/util/thread-stack.h                    |  3 +++
 4 files changed, 33 insertions(+), 4 deletions(-)

diff --git a/tools/perf/scripts/python/export-to-postgresql.py b/tools/perf/scripts/python/export-to-postgresql.py
index 0564dd7377f2..30130213da7e 100644
--- a/tools/perf/scripts/python/export-to-postgresql.py
+++ b/tools/perf/scripts/python/export-to-postgresql.py
@@ -478,7 +478,7 @@ if perf_db_export_calls:
 			'branch_count,'
 			'call_id,'
 			'return_id,'
-			'CASE WHEN flags=1 THEN \'no call\' WHEN flags=2 THEN \'no return\' WHEN flags=3 THEN \'no call/return\' ELSE \'\' END AS flags,'
+			'CASE WHEN flags=0 THEN \'\' WHEN flags=1 THEN \'no call\' WHEN flags=2 THEN \'no return\' WHEN flags=3 THEN \'no call/return\' WHEN flags=6 THEN \'jump\' ELSE flags END AS flags,'
 			'parent_call_path_id'
 		' FROM calls INNER JOIN call_paths ON call_paths.id = call_path_id')
 
diff --git a/tools/perf/scripts/python/export-to-sqlite.py b/tools/perf/scripts/python/export-to-sqlite.py
index 245caf2643ed..ed237f2ed03f 100644
--- a/tools/perf/scripts/python/export-to-sqlite.py
+++ b/tools/perf/scripts/python/export-to-sqlite.py
@@ -320,7 +320,7 @@ if perf_db_export_calls:
 			'branch_count,'
 			'call_id,'
 			'return_id,'
-			'CASE WHEN flags=1 THEN \'no call\' WHEN flags=2 THEN \'no return\' WHEN flags=3 THEN \'no call/return\' ELSE \'\' END AS flags,'
+			'CASE WHEN flags=0 THEN \'\' WHEN flags=1 THEN \'no call\' WHEN flags=2 THEN \'no return\' WHEN flags=3 THEN \'no call/return\' WHEN flags=6 THEN \'jump\' ELSE flags END AS flags,'
 			'parent_call_path_id'
 		' FROM calls INNER JOIN call_paths ON call_paths.id = call_path_id')
 
diff --git a/tools/perf/util/thread-stack.c b/tools/perf/util/thread-stack.c
index 7f8eff018c16..f52c0f90915d 100644
--- a/tools/perf/util/thread-stack.c
+++ b/tools/perf/util/thread-stack.c
@@ -38,6 +38,7 @@
  * @cp: call path
  * @no_call: a 'call' was not seen
  * @trace_end: a 'call' but trace ended
+ * @non_call: a branch but not a 'call' to the start of a different symbol
  */
 struct thread_stack_entry {
 	u64 ret_addr;
@@ -47,6 +48,7 @@ struct thread_stack_entry {
 	struct call_path *cp;
 	bool no_call;
 	bool trace_end;
+	bool non_call;
 };
 
 /**
@@ -268,6 +270,8 @@ static int thread_stack__call_return(struct thread *thread,
 		cr.flags |= CALL_RETURN_NO_CALL;
 	if (no_return)
 		cr.flags |= CALL_RETURN_NO_RETURN;
+	if (tse->non_call)
+		cr.flags |= CALL_RETURN_NON_CALL;
 
 	return crp->process(&cr, crp->data);
 }
@@ -510,6 +514,7 @@ static int thread_stack__push_cp(struct thread_stack *ts, u64 ret_addr,
 	tse->cp = cp;
 	tse->no_call = no_call;
 	tse->trace_end = trace_end;
+	tse->non_call = false;
 
 	return 0;
 }
@@ -531,14 +536,16 @@ static int thread_stack__pop_cp(struct thread *thread, struct thread_stack *ts,
 							 timestamp, ref, false);
 	}
 
-	if (ts->stack[ts->cnt - 1].ret_addr == ret_addr) {
+	if (ts->stack[ts->cnt - 1].ret_addr == ret_addr &&
+	    !ts->stack[ts->cnt - 1].non_call) {
 		return thread_stack__call_return(thread, ts, --ts->cnt,
 						 timestamp, ref, false);
 	} else {
 		size_t i = ts->cnt - 1;
 
 		while (i--) {
-			if (ts->stack[i].ret_addr != ret_addr)
+			if (ts->stack[i].ret_addr != ret_addr ||
+			    ts->stack[i].non_call)
 				continue;
 			i += 1;
 			while (ts->cnt > i) {
@@ -757,6 +764,25 @@ int thread_stack__process(struct thread *thread, struct comm *comm,
 		err = thread_stack__trace_begin(thread, ts, sample->time, ref);
 	} else if (sample->flags & PERF_IP_FLAG_TRACE_END) {
 		err = thread_stack__trace_end(ts, sample, ref);
+	} else if (sample->flags & PERF_IP_FLAG_BRANCH &&
+		   from_al->sym != to_al->sym && to_al->sym &&
+		   to_al->addr == to_al->sym->start) {
+		struct call_path_root *cpr = ts->crp->cpr;
+		struct call_path *cp;
+
+		/*
+		 * The compiler might optimize a call/ret combination by making
+		 * it a jmp. Make that visible by recording on the stack a
+		 * branch to the start of a different symbol. Note, that means
+		 * when a ret pops the stack, all jmps must be popped off first.
+		 */
+		cp = call_path__findnew(cpr, ts->stack[ts->cnt - 1].cp,
+					to_al->sym, sample->addr,
+					ts->kernel_start);
+		err = thread_stack__push_cp(ts, 0, sample->time, ref, cp, false,
+					    false);
+		if (!err)
+			ts->stack[ts->cnt - 1].non_call = true;
 	}
 
 	return err;
diff --git a/tools/perf/util/thread-stack.h b/tools/perf/util/thread-stack.h
index 1f626f4a1c40..b7c04e19ad41 100644
--- a/tools/perf/util/thread-stack.h
+++ b/tools/perf/util/thread-stack.h
@@ -35,10 +35,13 @@ struct call_path;
  *
  * CALL_RETURN_NO_CALL: 'return' but no matching 'call'
  * CALL_RETURN_NO_RETURN: 'call' but no matching 'return'
+ * CALL_RETURN_NON_CALL: a branch but not a 'call' to the start of a different
+ *                       symbol
  */
 enum {
 	CALL_RETURN_NO_CALL	= 1 << 0,
 	CALL_RETURN_NO_RETURN	= 1 << 1,
+	CALL_RETURN_NON_CALL	= 1 << 2,
 };
 
 /**

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 5/6] perf thread-stack: Improve thread_stack__no_call_return()
  2019-01-09  9:18 ` [PATCH 5/6] perf thread-stack: Improve thread_stack__no_call_return() Adrian Hunter
@ 2019-02-22  9:48   ` Adrian Hunter
  2019-02-22 14:39     ` Arnaldo Carvalho de Melo
  2019-02-28  7:49   ` [tip:perf/core] " tip-bot for Adrian Hunter
  1 sibling, 1 reply; 23+ messages in thread
From: Adrian Hunter @ 2019-02-22  9:48 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo; +Cc: Jiri Olsa, linux-kernel

On 9/01/19 11:18 AM, Adrian Hunter wrote:
> Improve thread_stack__no_call_return() to better handle 'returns' that do
> not match the stack i.e. 'no call'. See code comments for details.

This patch and patch 6 of the series do not seem to have been applied.
Can they be applied?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 5/6] perf thread-stack: Improve thread_stack__no_call_return()
  2019-02-22  9:48   ` Adrian Hunter
@ 2019-02-22 14:39     ` Arnaldo Carvalho de Melo
  0 siblings, 0 replies; 23+ messages in thread
From: Arnaldo Carvalho de Melo @ 2019-02-22 14:39 UTC (permalink / raw)
  To: Adrian Hunter; +Cc: Jiri Olsa, linux-kernel

Em Fri, Feb 22, 2019 at 11:48:51AM +0200, Adrian Hunter escreveu:
> On 9/01/19 11:18 AM, Adrian Hunter wrote:
> > Improve thread_stack__no_call_return() to better handle 'returns' that do
> > not match the stack i.e. 'no call'. See code comments for details.
> 
> This patch and patch 6 of the series do not seem to have been applied.
> Can they be applied?

I think there was some python3 oddity and I ended up letting this fell
thru the cracks, now I applied and tested it, see my results below, ints
in my perf/core branch now, will go together with your new batch in the
next pull req to Ingo,

Thanks,

- Arnaldo

commit 4ad79244f4899815a904f6e6adb71d24216b9b1d
Author: Adrian Hunter <adrian.hunter@intel.com>
Date:   Wed Jan 9 11:18:34 2019 +0200

    perf thread-stack: Improve thread_stack__no_call_return()
    
    Improve thread_stack__no_call_return() to better handle 'returns' that
    do not match the stack i.e. 'no call'. See code comments for details.
    The example below shows how retpolines are affected:
    
    Example:
    
      $ cat simple-retpoline.c
      __attribute__((noinline)) int bar(void)
      {
              return -1;
      }
    
      int foo(void)
      {
              return bar() + 1;
      }
    
      __attribute__((indirect_branch("thunk"))) int main()
      {
              int (*volatile fn)(void) = foo;
    
              fn();
              return fn();
      }
      $ gcc -ggdb3 -Wall -Wextra -O2 -o simple-retpoline simple-retpoline.c
      $ objdump -d simple-retpoline
      <SNIP>
      0000000000001040 <main>:
          1040:       48 83 ec 18             sub    $0x18,%rsp
          1044:       48 8d 05 25 01 00 00    lea    0x125(%rip),%rax        # 1170 <foo>
          104b:       48 89 44 24 08          mov    %rax,0x8(%rsp)
          1050:       48 8b 44 24 08          mov    0x8(%rsp),%rax
          1055:       e8 1f 01 00 00          callq  1179 <__x86_indirect_thunk_rax>
          105a:       48 8b 44 24 08          mov    0x8(%rsp),%rax
          105f:       48 83 c4 18             add    $0x18,%rsp
          1063:       e9 11 01 00 00          jmpq   1179 <__x86_indirect_thunk_rax>
      <SNIP>
      0000000000001160 <bar>:
          1160:       b8 ff ff ff ff          mov    $0xffffffff,%eax
          1165:       c3                      retq
      <SNIP>
      0000000000001170 <foo>:
          1170:       e8 eb ff ff ff          callq  1160 <bar>
          1175:       83 c0 01                add    $0x1,%eax
          1178:       c3                      retq
      0000000000001179 <__x86_indirect_thunk_rax>:
          1179:       e8 07 00 00 00          callq  1185 <__x86_indirect_thunk_rax+0xc>
          117e:       f3 90                   pause
          1180:       0f ae e8                lfence
          1183:       eb f9                   jmp    117e <__x86_indirect_thunk_rax+0x5>
          1185:       48 89 04 24             mov    %rax,(%rsp)
          1189:       c3                      retq
      <SNIP>
      $ perf record -o simple-retpoline.perf.data -e intel_pt/cyc/u ./simple-retpoline
      [ perf record: Woken up 1 times to write data ]
      [ perf record: Captured and wrote 0,017 MB simple-retpoline.perf.data ]
      $ perf script -i simple-retpoline.perf.data --itrace=be -s ~/libexec/perf-core/scripts/python/export-to-sqlite.py simple-retpoline.db branches calls
      2019-01-08 14:03:37.851655 Creating database...
      2019-01-08 14:03:37.863256 Writing records...
      2019-01-08 14:03:38.069750 Adding indexes
      2019-01-08 14:03:38.078799 Done
      $ ~/libexec/perf-core/scripts/python/exported-sql-viewer.py simple-retpoline.db
    
    Before:
    
        main
            -> __x86_indirect_thunk_rax
                -> __x86_indirect_thunk_rax
                    -> __x86_indirect_thunk_rax
                        -> bar
    
    After:
    
        main
            -> __x86_indirect_thunk_rax
                -> __x86_indirect_thunk_rax
                    -> foo
                        -> bar
    
    Committer testing:
    
    Chose "Reports", Then "Context-Sensitive Call Graph" and then go on
    expanding:
    
    Before:
    
    simple-retpolin
       PID:PID
          _start
             _start
                __libc_start_main
                   main
                       __x86_indirect_thunk_rax
                          __x86_indirect_thunk_rax
                          bar
    
    After:
    
    Remove the "simple.retpoline.db" file, run again the 'perf script' line
    to regenerate the .db file and run the exported-sql-viewer.py again to
    get the same all the way to 'main', then, from there, including 'main':
    
                   main
                       __x86_indirect_thunk_rax
                           __x86_indirect_thunk_rax
                               foo
                                   bar
    
    Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
    Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
    Cc: Jiri Olsa <jolsa@redhat.com>
    Link: http://lkml.kernel.org/r/20190109091835.5570-6-adrian.hunter@intel.com
    Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

diff --git a/tools/perf/util/thread-stack.c b/tools/perf/util/thread-stack.c
index f52c0f90915d..632c07a125ab 100644
--- a/tools/perf/util/thread-stack.c
+++ b/tools/perf/util/thread-stack.c
@@ -638,14 +638,57 @@ static int thread_stack__no_call_return(struct thread *thread,
 	else
 		parent = root;
 
-	/* This 'return' had no 'call', so push and pop top of stack */
-	cp = call_path__findnew(cpr, parent, fsym, ip, ks);
+	if (parent->sym == from_al->sym) {
+		/*
+		 * At the bottom of the stack, assume the missing 'call' was
+		 * before the trace started. So, pop the current symbol and push
+		 * the 'to' symbol.
+		 */
+		if (ts->cnt == 1) {
+			err = thread_stack__call_return(thread, ts, --ts->cnt,
+							tm, ref, false);
+			if (err)
+				return err;
+		}
+
+		if (!ts->cnt) {
+			cp = call_path__findnew(cpr, root, tsym, addr, ks);
+
+			return thread_stack__push_cp(ts, addr, tm, ref, cp,
+						     true, false);
+		}
+
+		/*
+		 * Otherwise assume the 'return' is being used as a jump (e.g.
+		 * retpoline) and just push the 'to' symbol.
+		 */
+		cp = call_path__findnew(cpr, parent, tsym, addr, ks);
+
+		err = thread_stack__push_cp(ts, 0, tm, ref, cp, true, false);
+		if (!err)
+			ts->stack[ts->cnt - 1].non_call = true;
+
+		return err;
+	}
+
+	/*
+	 * Assume 'parent' has not yet returned, so push 'to', and then push and
+	 * pop 'from'.
+	 */
+
+	cp = call_path__findnew(cpr, parent, tsym, addr, ks);
 
 	err = thread_stack__push_cp(ts, addr, tm, ref, cp, true, false);
 	if (err)
 		return err;
 
-	return thread_stack__pop_cp(thread, ts, addr, tm, ref, tsym);
+	cp = call_path__findnew(cpr, cp, fsym, ip, ks);
+
+	err = thread_stack__push_cp(ts, ip, tm, ref, cp, true, false);
+	if (err)
+		return err;
+
+	return thread_stack__call_return(thread, ts, --ts->cnt, tm, ref, false);
 }
 
 static int thread_stack__trace_begin(struct thread *thread,

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 6/6] perf thread-stack: Hide x86 retpolines
  2019-01-09  9:18 ` [PATCH 6/6] perf thread-stack: Hide x86 retpolines Adrian Hunter
  2019-01-09 15:38   ` Jiri Olsa
@ 2019-02-22 19:42   ` Arnaldo Carvalho de Melo
  2019-02-22 20:53     ` Arnaldo Carvalho de Melo
  2019-02-28  7:49   ` [tip:perf/core] " tip-bot for Adrian Hunter
  2 siblings, 1 reply; 23+ messages in thread
From: Arnaldo Carvalho de Melo @ 2019-02-22 19:42 UTC (permalink / raw)
  To: Adrian Hunter; +Cc: Jiri Olsa, linux-kernel

Em Wed, Jan 09, 2019 at 11:18:35AM +0200, Adrian Hunter escreveu:
> x86 retpoline functions pollute the call graph by showing up everywhere
> there is an indirect branch, but they do not really mean anything. Make
> changes so that the default retpoline functions will no longer appear in
> the call graph. Note this only affects the call graph, since all the
> original branches are left unchanged.
> 
> This does not handle function return thunks, nor is there any improvement
> for the handling of inline thunks or extern thunks.
> 
> Example:
> 
> $ cat simple-retpoline.c
> __attribute__((noinline)) int bar(void)
> {
>         return -1;
> }
> 
> int foo(void)
> {
>         return bar() + 1;
> }
> 
> __attribute__((indirect_branch("thunk"))) int main()
> {
>         int (*volatile fn)(void) = foo;
> 
>         fn();
>         return fn();
> }
> $ gcc -ggdb3 -Wall -Wextra -O2 -o simple-retpoline simple-retpoline.c
> $ objdump -d simple-retpoline
> <SNIP>
> 0000000000001040 <main>:
>     1040:       48 83 ec 18             sub    $0x18,%rsp
>     1044:       48 8d 05 25 01 00 00    lea    0x125(%rip),%rax        # 1170 <foo>
>     104b:       48 89 44 24 08          mov    %rax,0x8(%rsp)
>     1050:       48 8b 44 24 08          mov    0x8(%rsp),%rax
>     1055:       e8 1f 01 00 00          callq  1179 <__x86_indirect_thunk_rax>
>     105a:       48 8b 44 24 08          mov    0x8(%rsp),%rax
>     105f:       48 83 c4 18             add    $0x18,%rsp
>     1063:       e9 11 01 00 00          jmpq   1179 <__x86_indirect_thunk_rax>
> <SNIP>
> 0000000000001160 <bar>:
>     1160:       b8 ff ff ff ff          mov    $0xffffffff,%eax
>     1165:       c3                      retq
> <SNIP>
> 0000000000001170 <foo>:
>     1170:       e8 eb ff ff ff          callq  1160 <bar>
>     1175:       83 c0 01                add    $0x1,%eax
>     1178:       c3                      retq
> 0000000000001179 <__x86_indirect_thunk_rax>:
>     1179:       e8 07 00 00 00          callq  1185 <__x86_indirect_thunk_rax+0xc>
>     117e:       f3 90                   pause
>     1180:       0f ae e8                lfence
>     1183:       eb f9                   jmp    117e <__x86_indirect_thunk_rax+0x5>
>     1185:       48 89 04 24             mov    %rax,(%rsp)
>     1189:       c3                      retq
> <SNIP>
> $ perf record -o simple-retpoline.perf.data -e intel_pt/cyc/u ./simple-retpoline
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0,017 MB simple-retpoline.perf.data ]
> $ perf script -i simple-retpoline.perf.data --itrace=be -s ~/libexec/perf-core/scripts/python/export-to-sqlite.py simple-retpoline.db branches calls
> 2019-01-08 14:03:37.851655 Creating database...
> 2019-01-08 14:03:37.863256 Writing records...
> 2019-01-08 14:03:38.069750 Adding indexes
> 2019-01-08 14:03:38.078799 Done
> $ ~/libexec/perf-core/scripts/python/exported-sql-viewer.py simple-retpoline.db
> 
> Before:
> 
>     main
>         -> __x86_indirect_thunk_rax
>             -> __x86_indirect_thunk_rax
>                 -> foo
>                     -> bar
> 
> After:
> 
>     main
>         -> foo
>             -> bar
> 
> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
> ---
>  tools/perf/util/thread-stack.c | 112 ++++++++++++++++++++++++++++++++-
>  1 file changed, 109 insertions(+), 3 deletions(-)
> 
> diff --git a/tools/perf/util/thread-stack.c b/tools/perf/util/thread-stack.c
> index 632c07a125ab..805e30836460 100644
> --- a/tools/perf/util/thread-stack.c
> +++ b/tools/perf/util/thread-stack.c
> @@ -20,6 +20,7 @@
>  #include "thread.h"
>  #include "event.h"
>  #include "machine.h"
> +#include "env.h"
>  #include "util.h"
>  #include "debug.h"
>  #include "symbol.h"
> @@ -29,6 +30,19 @@
>  
>  #define STACK_GROWTH 2048
>  
> +/*
> + * State of retpoline detection.
> + *
> + * RETPOLINE_NONE: no retpoline detection
> + * X86_RETPOLINE_POSSIBLE: x86 retpoline possible
> + * X86_RETPOLINE_DETECTED: x86 retpoline detected
> + */
> +enum retpoline_state_t {
> +	RETPOLINE_NONE,
> +	X86_RETPOLINE_POSSIBLE,
> +	X86_RETPOLINE_DETECTED,
> +};
> +
>  /**
>   * struct thread_stack_entry - thread stack entry.
>   * @ret_addr: return address
> @@ -64,6 +78,7 @@ struct thread_stack_entry {
>   * @crp: call/return processor
>   * @comm: current comm
>   * @arr_sz: size of array if this is the first element of an array
> + * @rstate: used to detect retpolines
>   */
>  struct thread_stack {
>  	struct thread_stack_entry *stack;
> @@ -76,6 +91,7 @@ struct thread_stack {
>  	struct call_return_processor *crp;
>  	struct comm *comm;
>  	unsigned int arr_sz;
> +	enum retpoline_state_t rstate;
>  };
>  
>  /*
> @@ -115,10 +131,16 @@ static int thread_stack__init(struct thread_stack *ts, struct thread *thread,
>  	if (err)
>  		return err;
>  
> -	if (thread->mg && thread->mg->machine)
> -		ts->kernel_start = machine__kernel_start(thread->mg->machine);
> -	else
> +	if (thread->mg && thread->mg->machine) {
> +		struct machine *machine = thread->mg->machine;
> +		const char *arch = perf_env__arch(machine->env);
> +
> +		ts->kernel_start = machine__kernel_start(machine);
> +		if (!strcmp(arch, "x86"))
> +			ts->rstate = X86_RETPOLINE_POSSIBLE;
> +	} else {
>  		ts->kernel_start = 1ULL << 63;
> +	}
>  	ts->crp = crp;
>  
>  	return 0;
> @@ -733,6 +755,70 @@ static int thread_stack__trace_end(struct thread_stack *ts,
>  				     false, true);
>  }
>  
> +static bool is_x86_retpoline(const char *name)
> +{
> +	const char *p = strstr(name, "__x86_indirect_thunk_");
> +
> +	return p == name || !strcmp(name, "__indirect_thunk_start");
> +}
> +
> +/*
> + * x86 retpoline functions pollute the call graph. This function removes them.
> + * This does not handle function return thunks, nor is there any improvement
> + * for the handling of inline thunks or extern thunks.
> + */
> +static int thread_stack__x86_retpoline(struct thread_stack *ts,
> +				       struct perf_sample *sample,
> +				       struct addr_location *to_al)
> +{
> +	struct thread_stack_entry *tse = &ts->stack[ts->cnt - 1];
> +	struct call_path_root *cpr = ts->crp->cpr;
> +	struct symbol *sym = tse->cp->sym;
> +	struct symbol *tsym = to_al->sym;
> +	struct call_path *cp;
> +
> +	if (sym && sym->name && is_x86_retpoline(sym->name)) {


  CC       /tmp/build/perf/util/scripting-engines/trace-event-perl.o
  CC       /tmp/build/perf/util/intel-pt.o
  CC       /tmp/build/perf/util/intel-pt-decoder/intel-pt-log.o
util/thread-stack.c:780:18: error: address of array 'sym->name' will always evaluate to 'true' [-Werror,-Wpointer-bool-conversion]
        if (sym && sym->name && is_x86_retpoline(sym->name)) {
                ~~ ~~~~~^~~~
1 error generated.
mv: cannot stat '/tmp/build/perf/util/.thread-stack.o.tmp': No such file or directory
make[4]: *** [/git/linux/tools/build/Makefile.build:96: /tmp/build/perf/util/thread-stack.o] Error 1


[acme@quaco perf]$ pahole -C symbol ~/bin/perf
struct symbol {
	struct rb_node             rb_node;              /*     0    24 */
	u64                        start;                /*    24     8 */
	u64                        end;                  /*    32     8 */
	u16                        namelen;              /*    40     2 */
	u8                         type:4;               /*    42: 4  1 */
	u8                         binding:4;            /*    42: 0  1 */
	u8                         idle:1;               /*    43: 7  1 */
	u8                         ignore:1;             /*    43: 6  1 */
	u8                         inlined:1;            /*    43: 5  1 */

	/* XXX 5 bits hole, try to pack */

	u8                         arch_sym;             /*    44     1 */
	_Bool                      annotate2;            /*    45     1 */
	char                       name[0];              /*    46     0 */

	/* size: 48, cachelines: 1, members: 12 */
	/* bit holes: 1, sum bit holes: 5 bits */
	/* padding: 2 */
	/* last cacheline: 48 bytes */
};
[acme@quaco perf]$

I'm removing that sym->name test.

> +		/*
> +		 * This is a x86 retpoline fn. It pollutes the call graph by
> +		 * showing up everywhere there is an indirect branch, but does
> +		 * not itself mean anything. Here the top-of-stack is removed,
> +		 * by decrementing the stack count, and then further down, the
> +		 * resulting top-of-stack is replaced with the actual target.
> +		 * The result is that the retpoline functions will no longer
> +		 * appear in the call graph. Note this only affects the call
> +		 * graph, since all the original branches are left unchanged.
> +		 */
> +		ts->cnt -= 1;
> +		sym = ts->stack[ts->cnt - 2].cp->sym;
> +		if (sym && sym == tsym && to_al->addr != tsym->start) {
> +			/*
> +			 * Target is back to the middle of the symbol we came
> +			 * from so assume it is an indirect jmp and forget it
> +			 * altogether.
> +			 */
> +			ts->cnt -= 1;
> +			return 0;
> +		}
> +	} else if (sym && sym == tsym) {
> +		/*
> +		 * Target is back to the symbol we came from so assume it is an
> +		 * indirect jmp and forget it altogether.
> +		 */
> +		ts->cnt -= 1;
> +		return 0;
> +	}
> +
> +	cp = call_path__findnew(cpr, ts->stack[ts->cnt - 2].cp, tsym,
> +				sample->addr, ts->kernel_start);
> +	if (!cp)
> +		return -ENOMEM;
> +
> +	/* Replace the top-of-stack with the actual target */
> +	ts->stack[ts->cnt - 1].cp = cp;
> +
> +	return 0;
> +}
> +
>  int thread_stack__process(struct thread *thread, struct comm *comm,
>  			  struct perf_sample *sample,
>  			  struct addr_location *from_al,
> @@ -740,6 +826,7 @@ int thread_stack__process(struct thread *thread, struct comm *comm,
>  			  struct call_return_processor *crp)
>  {
>  	struct thread_stack *ts = thread__stack(thread, sample->cpu);
> +	enum retpoline_state_t rstate;
>  	int err = 0;
>  
>  	if (ts && !ts->crp) {
> @@ -755,6 +842,10 @@ int thread_stack__process(struct thread *thread, struct comm *comm,
>  		ts->comm = comm;
>  	}
>  
> +	rstate = ts->rstate;
> +	if (rstate == X86_RETPOLINE_DETECTED)
> +		ts->rstate = X86_RETPOLINE_POSSIBLE;
> +
>  	/* Flush stack on exec */
>  	if (ts->comm != comm && thread->pid_ == thread->tid) {
>  		err = __thread_stack__flush(thread, ts);
> @@ -791,10 +882,25 @@ int thread_stack__process(struct thread *thread, struct comm *comm,
>  					ts->kernel_start);
>  		err = thread_stack__push_cp(ts, ret_addr, sample->time, ref,
>  					    cp, false, trace_end);
> +
> +		/*
> +		 * A call to the same symbol but not the start of the symbol,
> +		 * may be the start of a x86 retpoline.
> +		 */
> +		if (!err && rstate == X86_RETPOLINE_POSSIBLE && to_al->sym &&
> +		    from_al->sym == to_al->sym &&
> +		    to_al->addr != to_al->sym->start)
> +			ts->rstate = X86_RETPOLINE_DETECTED;
> +
>  	} else if (sample->flags & PERF_IP_FLAG_RETURN) {
>  		if (!sample->ip || !sample->addr)
>  			return 0;
>  
> +		/* x86 retpoline 'return' doesn't match the stack */
> +		if (rstate == X86_RETPOLINE_DETECTED && ts->cnt > 2 &&
> +		    ts->stack[ts->cnt - 1].ret_addr != sample->addr)
> +			return thread_stack__x86_retpoline(ts, sample, to_al);
> +
>  		err = thread_stack__pop_cp(thread, ts, sample->addr,
>  					   sample->time, ref, from_al->sym);
>  		if (err) {
> -- 
> 2.17.1

-- 

- Arnaldo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 6/6] perf thread-stack: Hide x86 retpolines
  2019-02-22 19:42   ` Arnaldo Carvalho de Melo
@ 2019-02-22 20:53     ` Arnaldo Carvalho de Melo
  0 siblings, 0 replies; 23+ messages in thread
From: Arnaldo Carvalho de Melo @ 2019-02-22 20:53 UTC (permalink / raw)
  To: Adrian Hunter; +Cc: Jiri Olsa, linux-kernel

Em Fri, Feb 22, 2019 at 04:42:44PM -0300, Arnaldo Carvalho de Melo escreveu:
> Em Wed, Jan 09, 2019 at 11:18:35AM +0200, Adrian Hunter escreveu:
> > +static int thread_stack__x86_retpoline(struct thread_stack *ts,
> > +				       struct perf_sample *sample,
> > +				       struct addr_location *to_al)
> > +{
> > +	struct thread_stack_entry *tse = &ts->stack[ts->cnt - 1];
> > +	struct call_path_root *cpr = ts->crp->cpr;
> > +	struct symbol *sym = tse->cp->sym;
> > +	struct symbol *tsym = to_al->sym;
> > +	struct call_path *cp;
> > +
> > +	if (sym && sym->name && is_x86_retpoline(sym->name)) {
> 
> 
>   CC       /tmp/build/perf/util/scripting-engines/trace-event-perl.o
>   CC       /tmp/build/perf/util/intel-pt.o
>   CC       /tmp/build/perf/util/intel-pt-decoder/intel-pt-log.o
> util/thread-stack.c:780:18: error: address of array 'sym->name' will always evaluate to 'true' [-Werror,-Wpointer-bool-conversion]
>         if (sym && sym->name && is_x86_retpoline(sym->name)) {
>                 ~~ ~~~~~^~~~
> 1 error generated.
> mv: cannot stat '/tmp/build/perf/util/.thread-stack.o.tmp': No such file or directory
> make[4]: *** [/git/linux/tools/build/Makefile.build:96: /tmp/build/perf/util/thread-stack.o] Error 1
> 
> 
> [acme@quaco perf]$ pahole -C symbol ~/bin/perf
> struct symbol {
> 	struct rb_node             rb_node;              /*     0    24 */
> 	u64                        start;                /*    24     8 */
> 	u64                        end;                  /*    32     8 */
> 	u16                        namelen;              /*    40     2 */
> 	u8                         type:4;               /*    42: 4  1 */
> 	u8                         binding:4;            /*    42: 0  1 */
> 	u8                         idle:1;               /*    43: 7  1 */
> 	u8                         ignore:1;             /*    43: 6  1 */
> 	u8                         inlined:1;            /*    43: 5  1 */
> 
> 	/* XXX 5 bits hole, try to pack */
> 
> 	u8                         arch_sym;             /*    44     1 */
> 	_Bool                      annotate2;            /*    45     1 */
> 	char                       name[0];              /*    46     0 */
> 
> 	/* size: 48, cachelines: 1, members: 12 */
> 	/* bit holes: 1, sum bit holes: 5 bits */
> 	/* padding: 2 */
> 	/* last cacheline: 48 bytes */
> };
> [acme@quaco perf]$
> 
> I'm removing that sym->name test.

For completeness sake, this was one of the compilers where this failed:

clang version 7.0.1 (Fedora 7.0.1-2.fc30)

The systems where it broke:

  18 debian:9                      : FAIL
  19 debian:experimental           : FAIL
  29 fedora:25                     : FAIL
  30 fedora:26                     : FAIL
  31 fedora:27                     : FAIL
  32 fedora:28                     : FAIL
  33 fedora:29                     : FAIL
  34 fedora:30                     : FAIL
  35 fedora:rawhide                : FAIL
  45 opensuse:tumbleweed           : FAIL
  51 ubuntu:16.04                  : FAIL
  58 ubuntu:17.10                  : FAIL
  59 ubuntu:18.04                  : FAIL
  70 ubuntu:18.10                  : FAIL
  71 ubuntu:19.04                  : FAIL

- Arnaldo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [tip:perf/core] perf thread-stack: Improve thread_stack__no_call_return()
  2019-01-09  9:18 ` [PATCH 5/6] perf thread-stack: Improve thread_stack__no_call_return() Adrian Hunter
  2019-02-22  9:48   ` Adrian Hunter
@ 2019-02-28  7:49   ` tip-bot for Adrian Hunter
  1 sibling, 0 replies; 23+ messages in thread
From: tip-bot for Adrian Hunter @ 2019-02-28  7:49 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: jolsa, linux-kernel, acme, adrian.hunter, tglx, mingo, hpa

Commit-ID:  1f35cd65386e02430546727bd3cc2508d052e3ec
Gitweb:     https://git.kernel.org/tip/1f35cd65386e02430546727bd3cc2508d052e3ec
Author:     Adrian Hunter <adrian.hunter@intel.com>
AuthorDate: Wed, 9 Jan 2019 11:18:34 +0200
Committer:  Arnaldo Carvalho de Melo <acme@redhat.com>
CommitDate: Fri, 22 Feb 2019 11:42:34 -0300

perf thread-stack: Improve thread_stack__no_call_return()

Improve thread_stack__no_call_return() to better handle 'returns' that
do not match the stack i.e. 'no call'. See code comments for details.
The example below shows how retpolines are affected:

Example:

  $ cat simple-retpoline.c
  __attribute__((noinline)) int bar(void)
  {
          return -1;
  }

  int foo(void)
  {
          return bar() + 1;
  }

  __attribute__((indirect_branch("thunk"))) int main()
  {
          int (*volatile fn)(void) = foo;

          fn();
          return fn();
  }
  $ gcc -ggdb3 -Wall -Wextra -O2 -o simple-retpoline simple-retpoline.c
  $ objdump -d simple-retpoline
  <SNIP>
  0000000000001040 <main>:
      1040:       48 83 ec 18             sub    $0x18,%rsp
      1044:       48 8d 05 25 01 00 00    lea    0x125(%rip),%rax        # 1170 <foo>
      104b:       48 89 44 24 08          mov    %rax,0x8(%rsp)
      1050:       48 8b 44 24 08          mov    0x8(%rsp),%rax
      1055:       e8 1f 01 00 00          callq  1179 <__x86_indirect_thunk_rax>
      105a:       48 8b 44 24 08          mov    0x8(%rsp),%rax
      105f:       48 83 c4 18             add    $0x18,%rsp
      1063:       e9 11 01 00 00          jmpq   1179 <__x86_indirect_thunk_rax>
  <SNIP>
  0000000000001160 <bar>:
      1160:       b8 ff ff ff ff          mov    $0xffffffff,%eax
      1165:       c3                      retq
  <SNIP>
  0000000000001170 <foo>:
      1170:       e8 eb ff ff ff          callq  1160 <bar>
      1175:       83 c0 01                add    $0x1,%eax
      1178:       c3                      retq
  0000000000001179 <__x86_indirect_thunk_rax>:
      1179:       e8 07 00 00 00          callq  1185 <__x86_indirect_thunk_rax+0xc>
      117e:       f3 90                   pause
      1180:       0f ae e8                lfence
      1183:       eb f9                   jmp    117e <__x86_indirect_thunk_rax+0x5>
      1185:       48 89 04 24             mov    %rax,(%rsp)
      1189:       c3                      retq
  <SNIP>
  $ perf record -o simple-retpoline.perf.data -e intel_pt/cyc/u ./simple-retpoline
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0,017 MB simple-retpoline.perf.data ]
  $ perf script -i simple-retpoline.perf.data --itrace=be -s ~/libexec/perf-core/scripts/python/export-to-sqlite.py simple-retpoline.db branches calls
  2019-01-08 14:03:37.851655 Creating database...
  2019-01-08 14:03:37.863256 Writing records...
  2019-01-08 14:03:38.069750 Adding indexes
  2019-01-08 14:03:38.078799 Done
  $ ~/libexec/perf-core/scripts/python/exported-sql-viewer.py simple-retpoline.db

Before:

    main
        -> __x86_indirect_thunk_rax
            -> __x86_indirect_thunk_rax
                -> __x86_indirect_thunk_rax
                    -> bar

After:

    main
        -> __x86_indirect_thunk_rax
            -> __x86_indirect_thunk_rax
                -> foo
                    -> bar

Committer testing:

Chose "Reports", Then "Context-Sensitive Call Graph" and then go on
expanding:

Before:

simple-retpolin
   PID:PID
      _start
         _start
            __libc_start_main
               main
                   __x86_indirect_thunk_rax
                      __x86_indirect_thunk_rax
                      bar

After:

Remove the "simple.retpoline.db" file, run again the 'perf script' line
to regenerate the .db file and run the exported-sql-viewer.py again to
get the same all the way to 'main', then, from there, including 'main':

               main
                   __x86_indirect_thunk_rax
                       __x86_indirect_thunk_rax
                           foo
                               bar

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: http://lkml.kernel.org/r/20190109091835.5570-6-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/util/thread-stack.c | 49 +++++++++++++++++++++++++++++++++++++++---
 1 file changed, 46 insertions(+), 3 deletions(-)

diff --git a/tools/perf/util/thread-stack.c b/tools/perf/util/thread-stack.c
index f52c0f90915d..632c07a125ab 100644
--- a/tools/perf/util/thread-stack.c
+++ b/tools/perf/util/thread-stack.c
@@ -638,14 +638,57 @@ static int thread_stack__no_call_return(struct thread *thread,
 	else
 		parent = root;
 
-	/* This 'return' had no 'call', so push and pop top of stack */
-	cp = call_path__findnew(cpr, parent, fsym, ip, ks);
+	if (parent->sym == from_al->sym) {
+		/*
+		 * At the bottom of the stack, assume the missing 'call' was
+		 * before the trace started. So, pop the current symbol and push
+		 * the 'to' symbol.
+		 */
+		if (ts->cnt == 1) {
+			err = thread_stack__call_return(thread, ts, --ts->cnt,
+							tm, ref, false);
+			if (err)
+				return err;
+		}
+
+		if (!ts->cnt) {
+			cp = call_path__findnew(cpr, root, tsym, addr, ks);
+
+			return thread_stack__push_cp(ts, addr, tm, ref, cp,
+						     true, false);
+		}
+
+		/*
+		 * Otherwise assume the 'return' is being used as a jump (e.g.
+		 * retpoline) and just push the 'to' symbol.
+		 */
+		cp = call_path__findnew(cpr, parent, tsym, addr, ks);
+
+		err = thread_stack__push_cp(ts, 0, tm, ref, cp, true, false);
+		if (!err)
+			ts->stack[ts->cnt - 1].non_call = true;
+
+		return err;
+	}
+
+	/*
+	 * Assume 'parent' has not yet returned, so push 'to', and then push and
+	 * pop 'from'.
+	 */
+
+	cp = call_path__findnew(cpr, parent, tsym, addr, ks);
 
 	err = thread_stack__push_cp(ts, addr, tm, ref, cp, true, false);
 	if (err)
 		return err;
 
-	return thread_stack__pop_cp(thread, ts, addr, tm, ref, tsym);
+	cp = call_path__findnew(cpr, cp, fsym, ip, ks);
+
+	err = thread_stack__push_cp(ts, ip, tm, ref, cp, true, false);
+	if (err)
+		return err;
+
+	return thread_stack__call_return(thread, ts, --ts->cnt, tm, ref, false);
 }
 
 static int thread_stack__trace_begin(struct thread *thread,

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [tip:perf/core] perf thread-stack: Hide x86 retpolines
  2019-01-09  9:18 ` [PATCH 6/6] perf thread-stack: Hide x86 retpolines Adrian Hunter
  2019-01-09 15:38   ` Jiri Olsa
  2019-02-22 19:42   ` Arnaldo Carvalho de Melo
@ 2019-02-28  7:49   ` tip-bot for Adrian Hunter
  2 siblings, 0 replies; 23+ messages in thread
From: tip-bot for Adrian Hunter @ 2019-02-28  7:49 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: acme, jolsa, linux-kernel, mingo, adrian.hunter, tglx, hpa

Commit-ID:  3c0cd952cf051903929cd57b89034cc5f71f451d
Gitweb:     https://git.kernel.org/tip/3c0cd952cf051903929cd57b89034cc5f71f451d
Author:     Adrian Hunter <adrian.hunter@intel.com>
AuthorDate: Wed, 9 Jan 2019 11:18:35 +0200
Committer:  Arnaldo Carvalho de Melo <acme@redhat.com>
CommitDate: Fri, 22 Feb 2019 16:49:49 -0300

perf thread-stack: Hide x86 retpolines

x86 retpoline functions pollute the call graph by showing up everywhere
there is an indirect branch, but they do not really mean anything. Make
changes so that the default retpoline functions will no longer appear in
the call graph. Note this only affects the call graph, since all the
original branches are left unchanged.

This does not handle function return thunks, nor is there any
improvement for the handling of inline thunks or extern thunks.

Example:

  $ cat simple-retpoline.c
  __attribute__((noinline)) int bar(void)
  {
          return -1;
  }

  int foo(void)
  {
          return bar() + 1;
  }

  __attribute__((indirect_branch("thunk"))) int main()
  {
          int (*volatile fn)(void) = foo;

          fn();
          return fn();
  }
  $ gcc -ggdb3 -Wall -Wextra -O2 -o simple-retpoline simple-retpoline.c
  $ objdump -d simple-retpoline
  <SNIP>
  0000000000001040 <main>:
      1040:       48 83 ec 18             sub    $0x18,%rsp
      1044:       48 8d 05 25 01 00 00    lea    0x125(%rip),%rax        # 1170 <foo>
      104b:       48 89 44 24 08          mov    %rax,0x8(%rsp)
      1050:       48 8b 44 24 08          mov    0x8(%rsp),%rax
      1055:       e8 1f 01 00 00          callq  1179 <__x86_indirect_thunk_rax>
      105a:       48 8b 44 24 08          mov    0x8(%rsp),%rax
      105f:       48 83 c4 18             add    $0x18,%rsp
      1063:       e9 11 01 00 00          jmpq   1179 <__x86_indirect_thunk_rax>
  <SNIP>
  0000000000001160 <bar>:
      1160:       b8 ff ff ff ff          mov    $0xffffffff,%eax
      1165:       c3                      retq
  <SNIP>
  0000000000001170 <foo>:
      1170:       e8 eb ff ff ff          callq  1160 <bar>
      1175:       83 c0 01                add    $0x1,%eax
      1178:       c3                      retq
  0000000000001179 <__x86_indirect_thunk_rax>:
      1179:       e8 07 00 00 00          callq  1185 <__x86_indirect_thunk_rax+0xc>
      117e:       f3 90                   pause
      1180:       0f ae e8                lfence
      1183:       eb f9                   jmp    117e <__x86_indirect_thunk_rax+0x5>
      1185:       48 89 04 24             mov    %rax,(%rsp)
      1189:       c3                      retq
  <SNIP>
  $ perf record -o simple-retpoline.perf.data -e intel_pt/cyc/u ./simple-retpoline
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0,017 MB simple-retpoline.perf.data ]
  $ perf script -i simple-retpoline.perf.data --itrace=be -s ~/libexec/perf-core/scripts/python/export-to-sqlite.py simple-retpoline.db branches calls
  2019-01-08 14:03:37.851655 Creating database...
  2019-01-08 14:03:37.863256 Writing records...
  2019-01-08 14:03:38.069750 Adding indexes
  2019-01-08 14:03:38.078799 Done
  $ ~/libexec/perf-core/scripts/python/exported-sql-viewer.py simple-retpoline.db

Before:

    main
        -> __x86_indirect_thunk_rax
            -> __x86_indirect_thunk_rax
                -> foo
                    -> bar

After:

    main
        -> foo
            -> bar

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: http://lkml.kernel.org/r/20190109091835.5570-7-adrian.hunter@intel.com
[ Remove (sym->name != NULL) test, this is not a pointer and breaks the build with clang version 7.0.1 (Fedora 7.0.1-2.fc30) ]
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/util/thread-stack.c | 112 +++++++++++++++++++++++++++++++++++++++--
 1 file changed, 109 insertions(+), 3 deletions(-)

diff --git a/tools/perf/util/thread-stack.c b/tools/perf/util/thread-stack.c
index 632c07a125ab..a8b45168513c 100644
--- a/tools/perf/util/thread-stack.c
+++ b/tools/perf/util/thread-stack.c
@@ -20,6 +20,7 @@
 #include "thread.h"
 #include "event.h"
 #include "machine.h"
+#include "env.h"
 #include "util.h"
 #include "debug.h"
 #include "symbol.h"
@@ -29,6 +30,19 @@
 
 #define STACK_GROWTH 2048
 
+/*
+ * State of retpoline detection.
+ *
+ * RETPOLINE_NONE: no retpoline detection
+ * X86_RETPOLINE_POSSIBLE: x86 retpoline possible
+ * X86_RETPOLINE_DETECTED: x86 retpoline detected
+ */
+enum retpoline_state_t {
+	RETPOLINE_NONE,
+	X86_RETPOLINE_POSSIBLE,
+	X86_RETPOLINE_DETECTED,
+};
+
 /**
  * struct thread_stack_entry - thread stack entry.
  * @ret_addr: return address
@@ -64,6 +78,7 @@ struct thread_stack_entry {
  * @crp: call/return processor
  * @comm: current comm
  * @arr_sz: size of array if this is the first element of an array
+ * @rstate: used to detect retpolines
  */
 struct thread_stack {
 	struct thread_stack_entry *stack;
@@ -76,6 +91,7 @@ struct thread_stack {
 	struct call_return_processor *crp;
 	struct comm *comm;
 	unsigned int arr_sz;
+	enum retpoline_state_t rstate;
 };
 
 /*
@@ -115,10 +131,16 @@ static int thread_stack__init(struct thread_stack *ts, struct thread *thread,
 	if (err)
 		return err;
 
-	if (thread->mg && thread->mg->machine)
-		ts->kernel_start = machine__kernel_start(thread->mg->machine);
-	else
+	if (thread->mg && thread->mg->machine) {
+		struct machine *machine = thread->mg->machine;
+		const char *arch = perf_env__arch(machine->env);
+
+		ts->kernel_start = machine__kernel_start(machine);
+		if (!strcmp(arch, "x86"))
+			ts->rstate = X86_RETPOLINE_POSSIBLE;
+	} else {
 		ts->kernel_start = 1ULL << 63;
+	}
 	ts->crp = crp;
 
 	return 0;
@@ -733,6 +755,70 @@ static int thread_stack__trace_end(struct thread_stack *ts,
 				     false, true);
 }
 
+static bool is_x86_retpoline(const char *name)
+{
+	const char *p = strstr(name, "__x86_indirect_thunk_");
+
+	return p == name || !strcmp(name, "__indirect_thunk_start");
+}
+
+/*
+ * x86 retpoline functions pollute the call graph. This function removes them.
+ * This does not handle function return thunks, nor is there any improvement
+ * for the handling of inline thunks or extern thunks.
+ */
+static int thread_stack__x86_retpoline(struct thread_stack *ts,
+				       struct perf_sample *sample,
+				       struct addr_location *to_al)
+{
+	struct thread_stack_entry *tse = &ts->stack[ts->cnt - 1];
+	struct call_path_root *cpr = ts->crp->cpr;
+	struct symbol *sym = tse->cp->sym;
+	struct symbol *tsym = to_al->sym;
+	struct call_path *cp;
+
+	if (sym && is_x86_retpoline(sym->name)) {
+		/*
+		 * This is a x86 retpoline fn. It pollutes the call graph by
+		 * showing up everywhere there is an indirect branch, but does
+		 * not itself mean anything. Here the top-of-stack is removed,
+		 * by decrementing the stack count, and then further down, the
+		 * resulting top-of-stack is replaced with the actual target.
+		 * The result is that the retpoline functions will no longer
+		 * appear in the call graph. Note this only affects the call
+		 * graph, since all the original branches are left unchanged.
+		 */
+		ts->cnt -= 1;
+		sym = ts->stack[ts->cnt - 2].cp->sym;
+		if (sym && sym == tsym && to_al->addr != tsym->start) {
+			/*
+			 * Target is back to the middle of the symbol we came
+			 * from so assume it is an indirect jmp and forget it
+			 * altogether.
+			 */
+			ts->cnt -= 1;
+			return 0;
+		}
+	} else if (sym && sym == tsym) {
+		/*
+		 * Target is back to the symbol we came from so assume it is an
+		 * indirect jmp and forget it altogether.
+		 */
+		ts->cnt -= 1;
+		return 0;
+	}
+
+	cp = call_path__findnew(cpr, ts->stack[ts->cnt - 2].cp, tsym,
+				sample->addr, ts->kernel_start);
+	if (!cp)
+		return -ENOMEM;
+
+	/* Replace the top-of-stack with the actual target */
+	ts->stack[ts->cnt - 1].cp = cp;
+
+	return 0;
+}
+
 int thread_stack__process(struct thread *thread, struct comm *comm,
 			  struct perf_sample *sample,
 			  struct addr_location *from_al,
@@ -740,6 +826,7 @@ int thread_stack__process(struct thread *thread, struct comm *comm,
 			  struct call_return_processor *crp)
 {
 	struct thread_stack *ts = thread__stack(thread, sample->cpu);
+	enum retpoline_state_t rstate;
 	int err = 0;
 
 	if (ts && !ts->crp) {
@@ -755,6 +842,10 @@ int thread_stack__process(struct thread *thread, struct comm *comm,
 		ts->comm = comm;
 	}
 
+	rstate = ts->rstate;
+	if (rstate == X86_RETPOLINE_DETECTED)
+		ts->rstate = X86_RETPOLINE_POSSIBLE;
+
 	/* Flush stack on exec */
 	if (ts->comm != comm && thread->pid_ == thread->tid) {
 		err = __thread_stack__flush(thread, ts);
@@ -791,10 +882,25 @@ int thread_stack__process(struct thread *thread, struct comm *comm,
 					ts->kernel_start);
 		err = thread_stack__push_cp(ts, ret_addr, sample->time, ref,
 					    cp, false, trace_end);
+
+		/*
+		 * A call to the same symbol but not the start of the symbol,
+		 * may be the start of a x86 retpoline.
+		 */
+		if (!err && rstate == X86_RETPOLINE_POSSIBLE && to_al->sym &&
+		    from_al->sym == to_al->sym &&
+		    to_al->addr != to_al->sym->start)
+			ts->rstate = X86_RETPOLINE_DETECTED;
+
 	} else if (sample->flags & PERF_IP_FLAG_RETURN) {
 		if (!sample->ip || !sample->addr)
 			return 0;
 
+		/* x86 retpoline 'return' doesn't match the stack */
+		if (rstate == X86_RETPOLINE_DETECTED && ts->cnt > 2 &&
+		    ts->stack[ts->cnt - 1].ret_addr != sample->addr)
+			return thread_stack__x86_retpoline(ts, sample, to_al);
+
 		err = thread_stack__pop_cp(thread, ts, sample->addr,
 					   sample->time, ref, from_al->sym);
 		if (err) {

^ permalink raw reply related	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2019-02-28  7:49 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-09  9:18 [PATCH 0/6] perf thread-stack: x86 retpolines Adrian Hunter
2019-01-09  9:18 ` [PATCH 1/6] perf tools: Fix split_kallsyms_for_kcore for trampoline symbols Adrian Hunter
2019-02-09 12:57   ` [tip:perf/core] perf tools: Fix split_kallsyms_for_kcore() " tip-bot for Adrian Hunter
2019-01-09  9:18 ` [PATCH 2/6] perf thread-stack: Tidy thread_stack__push_cp() usage Adrian Hunter
2019-02-09 12:58   ` [tip:perf/core] " tip-bot for Adrian Hunter
2019-01-09  9:18 ` [PATCH 3/6] perf thread-stack: Tidy thread_stack__no_call_return() by adding more local variables Adrian Hunter
2019-02-09 12:58   ` [tip:perf/core] " tip-bot for Adrian Hunter
2019-01-09  9:18 ` [PATCH 4/6] perf thread-stack: Represent jmps to the start of a different symbol Adrian Hunter
2019-02-06 12:39   ` Arnaldo Carvalho de Melo
2019-02-06 13:25     ` Adrian Hunter
2019-02-09 12:59   ` [tip:perf/core] " tip-bot for Adrian Hunter
2019-01-09  9:18 ` [PATCH 5/6] perf thread-stack: Improve thread_stack__no_call_return() Adrian Hunter
2019-02-22  9:48   ` Adrian Hunter
2019-02-22 14:39     ` Arnaldo Carvalho de Melo
2019-02-28  7:49   ` [tip:perf/core] " tip-bot for Adrian Hunter
2019-01-09  9:18 ` [PATCH 6/6] perf thread-stack: Hide x86 retpolines Adrian Hunter
2019-01-09 15:38   ` Jiri Olsa
2019-01-10  7:52     ` Adrian Hunter
2019-02-22 19:42   ` Arnaldo Carvalho de Melo
2019-02-22 20:53     ` Arnaldo Carvalho de Melo
2019-02-28  7:49   ` [tip:perf/core] " tip-bot for Adrian Hunter
2019-01-10  9:55 ` [PATCH 0/6] perf thread-stack: " Jiri Olsa
2019-02-06  9:10   ` Adrian Hunter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).