linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Broken dwarf unwinding - wrong stack pointer register value?
@ 2018-10-20 22:39 Milian Wolff
  2018-10-21 20:32 ` Milian Wolff
  2018-10-22 10:35 ` Milian Wolff
  0 siblings, 2 replies; 33+ messages in thread
From: Milian Wolff @ 2018-10-20 22:39 UTC (permalink / raw)
  To: linux-kernel, Jiri Olsa, namhyung; +Cc: linux-perf-users, Arnaldo Carvalho

[-- Attachment #1: Type: text/plain, Size: 7695 bytes --]

Hey all,

I'm on the quest to figure out why perf regularly fails to unwind (some) 
samples. I am seeing very strange behavior, where an apparently wrong stack 
pointer value is read from the register - see below for more information and 
the end of this (long) mail for my open questions. Any help would be greatly 
appreciated.

I am currently using this trivial C++ code to reproduce the issue:

```
#include <cmath>
#include <complex>
#include <iostream>
#include <random>

using namespace std;

int main()
{
    uniform_real_distribution<double> uniform(-1E5, 1E5);
    default_random_engine engine;
    double s = 0;
    for (int i = 0; i < 10000000; ++i) {
        s += norm(complex<double>(uniform(engine), uniform(engine)));
    }
    cout << s << '\n';
    return 0;
}
```

I compile it with `g++ -O2 -g` and then record it with `perf record --call-
graph dwarf`. Using perf script, I then see e.g.:

```
$ perf script -v --no-inline --time 90229.12668,90229.127158 --ns
...
# first frame (working unwinding from __hypot_finite):
unwind: reg 16, val 7faf7dca2696
unwind: reg 7, val 7ffc80811ca0
unwind: find_proc_info dso /usr/lib/libm-2.28.so
unwind: access_mem addr 0x7ffc80811ce8 val 7faf7dc88af9, offset 72
unwind: find_proc_info dso /usr/lib/libm-2.28.so
unwind: access_mem addr 0x7ffc80811d08 val 56382b0fc129, offset 104
unwind: find_proc_info dso /home/milian/projects/kdab/rnd/hotspot/build/tests/
test-clients/cpp-inlining/cpp-inlining
unwind: access_mem addr 0x7ffc80811d58 val 7faf7dabf223, offset 184
unwind: find_proc_info dso /usr/lib/libc-2.28.so
unwind: access_mem addr 0x7ffc80811e18 val 56382b0fc1ee, offset 376
unwind: find_proc_info dso /home/milian/projects/kdab/rnd/hotspot/build/tests/
test-clients/cpp-inlining/cpp-inlining
unwind: __hypot_finite:ip = 0x7faf7dca2696 (0x29696)
unwind: hypotf32x:ip = 0x7faf7dc88af8 (0xfaf8)
unwind: main:ip = 0x56382b0fc128 (0x1128)
unwind: __libc_start_main:ip = 0x7faf7dabf222 (0x24222)
unwind: _start:ip = 0x56382b0fc1ed (0x11ed)
# second frame (unrelated)
unwind: reg 16, val 56382b0fc114
unwind: reg 7, val 7ffc80811d10
unwind: find_proc_info dso /home/milian/projects/kdab/rnd/hotspot/build/tests/
test-clients/cpp-inlining/cpp-inlining
unwind: access_mem addr 0x7ffc80811d58 val 7faf7dabf223, offset 72
unwind: access_mem addr 0x7ffc80811e18 val 56382b0fc1ee, offset 264
unwind: main:ip = 0x56382b0fc114 (0x1114)
unwind: __libc_start_main:ip = 0x7faf7dabf222 (0x24222)
unwind: _start:ip = 0x56382b0fc1ed (0x11ed)
# third frame (broken unwinding from __hypot_finite)
unwind: reg 16, val 7faf7dca2688
unwind: reg 7, val 7ffc80811ca0
unwind: find_proc_info dso /usr/lib/libm-2.28.so
unwind: access_mem addr 0x7ffc80811cc0 val 0, offset 32
unwind: __hypot_finite:ip = 0x7faf7dca2688 (0x29688)
unwind: '':ip = 0xffffffffffffffff (0x0)
cpp-inlining 24617 90229.126685606:     711026 cycles:uppp: 
	    7faf7dca2696 __hypot_finite+0x36 (/usr/lib/libm-2.28.so)
	    7faf7dc88af8 hypotf32x+0x18 (/usr/lib/libm-2.28.so)
	    56382b0fc128 main+0x88 (/home/milian/projects/kdab/rnd/hotspot/
build/tests/test-clients/cpp-inlining/cpp-inlining)
	    7faf7dabf222 __libc_start_main+0xf2 (/usr/lib/libc-2.28.so)
	    56382b0fc1ed _start+0x2d (/home/milian/projects/kdab/rnd/hotspot/
build/tests/test-clients/cpp-inlining/cpp-inlining)

cpp-inlining 24617 90229.126921551:     714657 cycles:uppp: 
	    56382b0fc114 main+0x74 (/home/milian/projects/kdab/rnd/hotspot/
build/tests/test-clients/cpp-inlining/cpp-inlining)
	    7faf7dabf222 __libc_start_main+0xf2 (/usr/lib/libc-2.28.so)
	    56382b0fc1ed _start+0x2d (/home/milian/projects/kdab/rnd/hotspot/
build/tests/test-clients/cpp-inlining/cpp-inlining)

cpp-inlining 24617 90229.127157818:     719976 cycles:uppp: 
	    7faf7dca2688 __hypot_finite+0x28 (/usr/lib/libm-2.28.so)
	ffffffffffffffff [unknown] ([unknown])
...
```

Now I'm trying to figure out why one __hypot_finite sample works but the other 
one breaks for no apparent reason. One thing I noticed so far is the following 
discrepancy in memory accesses, i.e. for the working sample we get:

```
unwind: reg 16, val 7faf7dca2696
unwind: reg 7, val 7ffc80811ca0
unwind: access_mem addr 0x7ffc80811ce8 val 7faf7dc88af9, offset 72
...
```

But the broken sample only produces:

```
unwind: reg 16, val 7faf7dca2688
unwind: reg 7, val 7ffc80811ca0
unwind: find_proc_info dso /usr/lib/libm-2.28.so
unwind: access_mem addr 0x7ffc80811cc0 val 0, offset 32
```

The user stack of the working sample starts with (in hex):

0000000000000000000000009c830441000000000000000000000000000000000000000000000000254a980000000000301d8180fc7f0000c0c10f2b3856000030

The user stack of the broken sample starts with (in hex):

0000000000000000000000002f64fd4000000000000000000000000000000000000000000000000073bb970000000000301d8180fc7f0000c0c10f2b3856000030

This is a 64bit machine, so word width is 8. So both user stacks start with 
`24 * 2 = 48` words of zeros. Hence offset 32 cannot possibly produce a 
meaningful value...

This offset is calculcated from LIBUNWIND__ARCH_REG_SP, cf. unwind-libunwind-
local.c. So is the stack pointer address in the register wrong? If I hackishly 
offset the value for the stack pointer by 40, i.e.:

```
diff --git a/tools/perf/util/unwind-libunwind-local.c b/tools/perf/util/
unwind-libunwind-local.c
index 79f521a552cf..a766ddaaa4dd 100644
--- a/tools/perf/util/unwind-libunwind-local.c
+++ b/tools/perf/util/unwind-libunwind-local.c
@@ -502,6 +502,7 @@ static int access_mem(unw_addr_space_t __maybe_unused as,
 	if (ret)
 		return ret;
 
+	start -= 40;
 	end = start + stack->size;
 
 	/* Check overflow. */
```

Then I can successfully unwind the broken sample:

```
$ perf script -v --no-inline --time 90229.127156,90229.127158 --ns
...
unwind: reg 16, val 7faf7dca2688
unwind: reg 7, val 7ffc80811ca0
unwind: find_proc_info dso /usr/lib/libm-2.28.so
unwind: access_mem addr 0x7ffc80811cc0 val 7faf7dc88af9, offset 72
unwind: find_proc_info dso /usr/lib/libm-2.28.so
unwind: access_mem addr 0x7ffc80811ce0 val 56382b0fc129, offset 104
unwind: find_proc_info dso /home/milian/projects/kdab/rnd/hotspot/build/tests/
test-clients/cpp-inlining/cpp-inlining
unwind: access_mem addr 0x7ffc80811d30 val 7faf7dabf223, offset 184
unwind: find_proc_info dso /usr/lib/libc-2.28.so
unwind: access_mem addr 0x7ffc80811df0 val 56382b0fc1ee, offset 376
unwind: find_proc_info dso /home/milian/projects/kdab/rnd/hotspot/build/tests/
test-clients/cpp-inlining/cpp-inlining
unwind: __hypot_finite:ip = 0x7faf7dca2688 (0x29688)
unwind: hypotf32x:ip = 0x7faf7dc88af8 (0xfaf8)
unwind: main:ip = 0x56382b0fc128 (0x1128)
unwind: __libc_start_main:ip = 0x7faf7dabf222 (0x24222)
unwind: _start:ip = 0x56382b0fc1ed (0x11ed)
cpp-inlining 24617 90229.127157818:     719976 cycles:uppp: 
	    7faf7dca2688 __hypot_finite+0x28 (/usr/lib/libm-2.28.so)
	    7faf7dc88af8 hypotf32x+0x18 (/usr/lib/libm-2.28.so)
	    56382b0fc128 main+0x88 (/home/milian/projects/kdab/rnd/hotspot/
build/tests/test-clients/cpp-inlining/cpp-inlining)
	    7faf7dabf222 __libc_start_main+0xf2 (/usr/lib/libc-2.28.so)
	    56382b0fc1ed _start+0x2d (/home/milian/projects/kdab/rnd/hotspot/
build/tests/test-clients/cpp-inlining/cpp-inlining)

```

So, what now? Here are my open questions:

Is this just working now by chance, or is this the real reason? I.e. is the 
register value for the stack pointer incorrectly recorded?

Can this be fixed somehow during record time?

Can we detect this scenario at analysis time and correct the stack pointer 
address automatically somehow? Should the first frame always try to read from 
offset 72 maybe?

Any help would be greatly appreciated, many thanks

-- 
Milian Wolff
mail@milianw.de
http://milianw.de

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Broken dwarf unwinding - wrong stack pointer register value?
  2018-10-20 22:39 Broken dwarf unwinding - wrong stack pointer register value? Milian Wolff
@ 2018-10-21 20:32 ` Milian Wolff
  2018-10-22 10:35 ` Milian Wolff
  1 sibling, 0 replies; 33+ messages in thread
From: Milian Wolff @ 2018-10-21 20:32 UTC (permalink / raw)
  To: Milian Wolff
  Cc: linux-kernel, Jiri Olsa, namhyung, linux-perf-users, Arnaldo Carvalho

[-- Attachment #1: Type: text/plain, Size: 2408 bytes --]

On Sonntag, 21. Oktober 2018 00:39:51 CEST Milian Wolff wrote:
> Hey all,
> 
> I'm on the quest to figure out why perf regularly fails to unwind (some)
> samples. I am seeing very strange behavior, where an apparently wrong stack
> pointer value is read from the register - see below for more information and
> the end of this (long) mail for my open questions. Any help would be
> greatly appreciated.
> 
> I am currently using this trivial C++ code to reproduce the issue:
> 
> ```
> #include <cmath>
> #include <complex>
> #include <iostream>
> #include <random>
> 
> using namespace std;
> 
> int main()
> {
>     uniform_real_distribution<double> uniform(-1E5, 1E5);
>     default_random_engine engine;
>     double s = 0;
>     for (int i = 0; i < 10000000; ++i) {
>         s += norm(complex<double>(uniform(engine), uniform(engine)));
>     }
>     cout << s << '\n';
>     return 0;
> }
> ```
> 
> I compile it with `g++ -O2 -g` and then record it with `perf record --call-
> graph dwarf`. Using perf script, I then see e.g.:

With my patch to regularly flush the perf script output buffer, we can now 
easily find all broken backtraces and the corresponding debug output via:

    $ perf script --ns -v |& awk -v RS='' '/\[unknown\]/ {print "\n"$0}'

I've pasted the output to the above command from my machine here:
https://paste.kde.org/pmyxwkk1k

This contains 139 samples with broken unwinding, out of 2350 samples in total, 
so about 6% of all samples are broken.

In many cases, the first accessed memory is 0 because a too-low offset into 
the stack is computed from the SP value, similar to the scenario I described 
in my initial mail. In other cases we read garbadge addresses such as 

unwind: access_mem addr 0x7ffc80811cf0 val 408195dfbda90580, offset 24

In all cases except for the the two samples at the very start and end of this 
log, the last offset encountered in access_mem is lower than 72. Remember what 
I wrote in the initial mail - if I manually hack the access_mem function to 
use 72 for one of the broken samples, it made unwinding magically work 
again...

With this addition of data - can anyone sched some light on what's potentially 
going on here? How can we improve this situation?

Thanks
-- 
Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer
KDAB (Deutschland) GmbH, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt, C++ and OpenGL Experts

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 3826 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Broken dwarf unwinding - wrong stack pointer register value?
  2018-10-20 22:39 Broken dwarf unwinding - wrong stack pointer register value? Milian Wolff
  2018-10-21 20:32 ` Milian Wolff
@ 2018-10-22 10:35 ` Milian Wolff
  2018-10-22 11:17   ` Milian Wolff
  1 sibling, 1 reply; 33+ messages in thread
From: Milian Wolff @ 2018-10-22 10:35 UTC (permalink / raw)
  To: Milian Wolff
  Cc: linux-kernel, Jiri Olsa, namhyung, linux-perf-users, Arnaldo Carvalho

[-- Attachment #1: Type: text/plain, Size: 8460 bytes --]

On Sonntag, 21. Oktober 2018 00:39:51 CEST Milian Wolff wrote:
> Hey all,
> 
> I'm on the quest to figure out why perf regularly fails to unwind (some)
> samples. I am seeing very strange behavior, where an apparently wrong stack
> pointer value is read from the register - see below for more information and
> the end of this (long) mail for my open questions. Any help would be
> greatly appreciated.
> 
> I am currently using this trivial C++ code to reproduce the issue:
> 
> ```
> #include <cmath>
> #include <complex>
> #include <iostream>
> #include <random>
> 
> using namespace std;
> 
> int main()
> {
>     uniform_real_distribution<double> uniform(-1E5, 1E5);
>     default_random_engine engine;
>     double s = 0;
>     for (int i = 0; i < 10000000; ++i) {
>         s += norm(complex<double>(uniform(engine), uniform(engine)));
>     }
>     cout << s << '\n';
>     return 0;
> }
> ```
> 
> I compile it with `g++ -O2 -g` and then record it with `perf record --call-
> graph dwarf`. Using perf script, I then see e.g.:
> 
> ```
> $ perf script -v --no-inline --time 90229.12668,90229.127158 --ns
> ...
> # first frame (working unwinding from __hypot_finite):
> unwind: reg 16, val 7faf7dca2696
> unwind: reg 7, val 7ffc80811ca0
> unwind: find_proc_info dso /usr/lib/libm-2.28.so
> unwind: access_mem addr 0x7ffc80811ce8 val 7faf7dc88af9, offset 72
> unwind: find_proc_info dso /usr/lib/libm-2.28.so
> unwind: access_mem addr 0x7ffc80811d08 val 56382b0fc129, offset 104
> unwind: find_proc_info dso
> /home/milian/projects/kdab/rnd/hotspot/build/tests/
> test-clients/cpp-inlining/cpp-inlining
> unwind: access_mem addr 0x7ffc80811d58 val 7faf7dabf223, offset 184
> unwind: find_proc_info dso /usr/lib/libc-2.28.so
> unwind: access_mem addr 0x7ffc80811e18 val 56382b0fc1ee, offset 376
> unwind: find_proc_info dso
> /home/milian/projects/kdab/rnd/hotspot/build/tests/
> test-clients/cpp-inlining/cpp-inlining
> unwind: __hypot_finite:ip = 0x7faf7dca2696 (0x29696)
> unwind: hypotf32x:ip = 0x7faf7dc88af8 (0xfaf8)
> unwind: main:ip = 0x56382b0fc128 (0x1128)
> unwind: __libc_start_main:ip = 0x7faf7dabf222 (0x24222)
> unwind: _start:ip = 0x56382b0fc1ed (0x11ed)
> # second frame (unrelated)
> unwind: reg 16, val 56382b0fc114
> unwind: reg 7, val 7ffc80811d10
> unwind: find_proc_info dso
> /home/milian/projects/kdab/rnd/hotspot/build/tests/
> test-clients/cpp-inlining/cpp-inlining
> unwind: access_mem addr 0x7ffc80811d58 val 7faf7dabf223, offset 72
> unwind: access_mem addr 0x7ffc80811e18 val 56382b0fc1ee, offset 264
> unwind: main:ip = 0x56382b0fc114 (0x1114)
> unwind: __libc_start_main:ip = 0x7faf7dabf222 (0x24222)
> unwind: _start:ip = 0x56382b0fc1ed (0x11ed)
> # third frame (broken unwinding from __hypot_finite)
> unwind: reg 16, val 7faf7dca2688
> unwind: reg 7, val 7ffc80811ca0
> unwind: find_proc_info dso /usr/lib/libm-2.28.so
> unwind: access_mem addr 0x7ffc80811cc0 val 0, offset 32
> unwind: __hypot_finite:ip = 0x7faf7dca2688 (0x29688)
> unwind: '':ip = 0xffffffffffffffff (0x0)
> cpp-inlining 24617 90229.126685606:     711026 cycles:uppp:
> 	    7faf7dca2696 __hypot_finite+0x36 (/usr/lib/libm-2.28.so)
> 	    7faf7dc88af8 hypotf32x+0x18 (/usr/lib/libm-2.28.so)
> 	    56382b0fc128 main+0x88 (/home/milian/projects/kdab/rnd/hotspot/
> build/tests/test-clients/cpp-inlining/cpp-inlining)
> 	    7faf7dabf222 __libc_start_main+0xf2 (/usr/lib/libc-2.28.so)
> 	    56382b0fc1ed _start+0x2d (/home/milian/projects/kdab/rnd/hotspot/
> build/tests/test-clients/cpp-inlining/cpp-inlining)
> 
> cpp-inlining 24617 90229.126921551:     714657 cycles:uppp:
> 	    56382b0fc114 main+0x74 (/home/milian/projects/kdab/rnd/hotspot/
> build/tests/test-clients/cpp-inlining/cpp-inlining)
> 	    7faf7dabf222 __libc_start_main+0xf2 (/usr/lib/libc-2.28.so)
> 	    56382b0fc1ed _start+0x2d (/home/milian/projects/kdab/rnd/hotspot/
> build/tests/test-clients/cpp-inlining/cpp-inlining)
> 
> cpp-inlining 24617 90229.127157818:     719976 cycles:uppp:
> 	    7faf7dca2688 __hypot_finite+0x28 (/usr/lib/libm-2.28.so)
> 	ffffffffffffffff [unknown] ([unknown])
> ...
> ```
> 
> Now I'm trying to figure out why one __hypot_finite sample works but the
> other one breaks for no apparent reason.

I've now collected some more background information, which is quite helpful I 
believe for the analysis of this issue:

Note how the broken sample has the IP pointing at __hypot_finite+0x28:

unwind: __hypot_finite:ip = 0x7faf7dca2688 (0x29688)

When we run my reproducer code in GDB, we can see that obtaining a backtrace 
from that address works just fine there:

```
$ gdb ./cpp-inlining
GNU gdb (GDB) 8.2
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./cpp-inlining...done.
(gdb) break main
Breakpoint 1 at 0x10a0: file /home/milian/projects/kdab/rnd/hotspot/tests/
test-clients/cpp-inlining/main.cpp, line 34.
(gdb) run
Starting program: /home/milian/projects/kdab/rnd/hotspot/build/tests/test-
clients/cpp-inlining/cpp-inlining 

Breakpoint 1, main () at /home/milian/projects/kdab/rnd/hotspot/tests/test-
clients/cpp-inlining/main.cpp:34
34      {
(gdb) disassemble __hypot_finite 
Dump of assembler code for function __hypot_finite:
   0x00007ffff7c96660 <+0>:     endbr64 
   0x00007ffff7c96664 <+4>:     push   %r13
   0x00007ffff7c96666 <+6>:     push   %r12
   0x00007ffff7c96668 <+8>:     push   %rbp
   0x00007ffff7c96669 <+9>:     movq   %xmm1,%rbp
   0x00007ffff7c9666e <+14>:    push   %rbx
   0x00007ffff7c9666f <+15>:    movq   %xmm0,%rbx
   0x00007ffff7c96674 <+20>:    shr    $0x20,%rbp
   0x00007ffff7c96678 <+24>:    shr    $0x20,%rbx
   0x00007ffff7c9667c <+28>:    and    $0x7fffffff,%ebp
   0x00007ffff7c96682 <+34>:    and    $0x7fffffff,%ebx
   0x00007ffff7c96688 <+40>:    sub    $0x28,%rsp
   0x00007ffff7c9668c <+44>:    cmp    %ebp,%ebx
   0x00007ffff7c9668e <+46>:    jge    0x7ffff7c966a2 <__hypot_finite+66>
   0x00007ffff7c96690 <+48>:    mov    %ebx,%eax
   0x00007ffff7c96692 <+50>:    movapd %xmm0,%xmm2
   0x00007ffff7c96696 <+54>:    movapd %xmm1,%xmm0
   0x00007ffff7c9669a <+58>:    mov    %ebp,%ebx
   0x00007ffff7c9669c <+60>:    movapd %xmm2,%xmm1
   0x00007ffff7c966a0 <+64>:    mov    %eax,%ebp
   0x00007ffff7c966a2 <+66>:    movq   %xmm0,%r13
   0x00007ffff7c966a7 <+71>:    mov    %rbx,%rax
   0x00007ffff7c966aa <+74>:    movq   %xmm1,%r12
--Type <RET> for more, q to quit, c to continue without paging--q
Quit
(gdb) break *0x00007ffff7c96688
Breakpoint 2 at 0x7ffff7c96688
(gdb) cont
Continuing.

Breakpoint 2, 0x00007ffff7c96688 in __hypot_finite () from /usr/lib/libm.so.6
(gdb) bt
#0  0x00007ffff7c96688 in __hypot_finite () from /usr/lib/libm.so.6
#1  0x00007ffff7c7caf9 in hypotf64 () from /usr/lib/libm.so.6
#2  0x0000555555555129 in std::__complex_abs (__z=<optimized out>) at /usr/
include/c++/8.2.1/complex:1362
#3  std::abs<double> (__z=...) at /usr/include/c++/8.2.1/complex:618
#4  std::_Norm_helper<true>::_S_do_it<double> (__z=...) at /usr/include/c++/
8.2.1/complex:675
#5  std::norm<double> (__z=...) at /usr/include/c++/8.2.1/complex:685
#6  main () at /home/milian/projects/kdab/rnd/hotspot/tests/test-clients/cpp-
inlining/main.cpp:39
```

But note how __hypot_finite+0x28 actually points at an instruction that 
modifies the stack pointer by 40, just like I did manually to "fix" unwinding 
for this sample (see my initial email):

   0x00007ffff7c96688 <+40>:    sub    $0x28,%rsp

So we are still in the function prologue, and can't properly unwind from there 
apparently! Any suggestions on how this should be handled? Can we figure out 
the size of the function prologue somehow and manually fix the stack pointer 
then?

Cheers
-- 
Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer
KDAB (Deutschland) GmbH, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt, C++ and OpenGL Experts

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 3826 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Broken dwarf unwinding - wrong stack pointer register value?
  2018-10-22 10:35 ` Milian Wolff
@ 2018-10-22 11:17   ` Milian Wolff
  2018-10-22 13:58     ` Andi Kleen
  0 siblings, 1 reply; 33+ messages in thread
From: Milian Wolff @ 2018-10-22 11:17 UTC (permalink / raw)
  To: Milian Wolff
  Cc: Milian Wolff, linux-kernel, Jiri Olsa, namhyung,
	linux-perf-users, Arnaldo Carvalho

[-- Attachment #1: Type: text/plain, Size: 9387 bytes --]

On Montag, 22. Oktober 2018 12:35:39 CEST Milian Wolff wrote:
> On Sonntag, 21. Oktober 2018 00:39:51 CEST Milian Wolff wrote:
> > Hey all,
> > 
> > I'm on the quest to figure out why perf regularly fails to unwind (some)
> > samples. I am seeing very strange behavior, where an apparently wrong
> > stack
> > pointer value is read from the register - see below for more information
> > and the end of this (long) mail for my open questions. Any help would be
> > greatly appreciated.
> > 
> > I am currently using this trivial C++ code to reproduce the issue:
> > 
> > ```
> > #include <cmath>
> > #include <complex>
> > #include <iostream>
> > #include <random>
> > 
> > using namespace std;
> > 
> > int main()
> > {
> > 
> >     uniform_real_distribution<double> uniform(-1E5, 1E5);
> >     default_random_engine engine;
> >     double s = 0;
> >     for (int i = 0; i < 10000000; ++i) {
> >     
> >         s += norm(complex<double>(uniform(engine), uniform(engine)));
> >     
> >     }
> >     cout << s << '\n';
> >     return 0;
> > 
> > }
> > ```
> > 
> > I compile it with `g++ -O2 -g` and then record it with `perf record
> > --call-
> > graph dwarf`. Using perf script, I then see e.g.:
> > 
> > ```
> > $ perf script -v --no-inline --time 90229.12668,90229.127158 --ns
> > ...
> > # first frame (working unwinding from __hypot_finite):
> > unwind: reg 16, val 7faf7dca2696
> > unwind: reg 7, val 7ffc80811ca0
> > unwind: find_proc_info dso /usr/lib/libm-2.28.so
> > unwind: access_mem addr 0x7ffc80811ce8 val 7faf7dc88af9, offset 72
> > unwind: find_proc_info dso /usr/lib/libm-2.28.so
> > unwind: access_mem addr 0x7ffc80811d08 val 56382b0fc129, offset 104
> > unwind: find_proc_info dso
> > /home/milian/projects/kdab/rnd/hotspot/build/tests/
> > test-clients/cpp-inlining/cpp-inlining
> > unwind: access_mem addr 0x7ffc80811d58 val 7faf7dabf223, offset 184
> > unwind: find_proc_info dso /usr/lib/libc-2.28.so
> > unwind: access_mem addr 0x7ffc80811e18 val 56382b0fc1ee, offset 376
> > unwind: find_proc_info dso
> > /home/milian/projects/kdab/rnd/hotspot/build/tests/
> > test-clients/cpp-inlining/cpp-inlining
> > unwind: __hypot_finite:ip = 0x7faf7dca2696 (0x29696)
> > unwind: hypotf32x:ip = 0x7faf7dc88af8 (0xfaf8)
> > unwind: main:ip = 0x56382b0fc128 (0x1128)
> > unwind: __libc_start_main:ip = 0x7faf7dabf222 (0x24222)
> > unwind: _start:ip = 0x56382b0fc1ed (0x11ed)
> > # second frame (unrelated)
> > unwind: reg 16, val 56382b0fc114
> > unwind: reg 7, val 7ffc80811d10
> > unwind: find_proc_info dso
> > /home/milian/projects/kdab/rnd/hotspot/build/tests/
> > test-clients/cpp-inlining/cpp-inlining
> > unwind: access_mem addr 0x7ffc80811d58 val 7faf7dabf223, offset 72
> > unwind: access_mem addr 0x7ffc80811e18 val 56382b0fc1ee, offset 264
> > unwind: main:ip = 0x56382b0fc114 (0x1114)
> > unwind: __libc_start_main:ip = 0x7faf7dabf222 (0x24222)
> > unwind: _start:ip = 0x56382b0fc1ed (0x11ed)
> > # third frame (broken unwinding from __hypot_finite)
> > unwind: reg 16, val 7faf7dca2688
> > unwind: reg 7, val 7ffc80811ca0
> > unwind: find_proc_info dso /usr/lib/libm-2.28.so
> > unwind: access_mem addr 0x7ffc80811cc0 val 0, offset 32
> > unwind: __hypot_finite:ip = 0x7faf7dca2688 (0x29688)
> > unwind: '':ip = 0xffffffffffffffff (0x0)
> > 
> > cpp-inlining 24617 90229.126685606:     711026 cycles:uppp:
> > 	    7faf7dca2696 __hypot_finite+0x36 (/usr/lib/libm-2.28.so)
> > 	    7faf7dc88af8 hypotf32x+0x18 (/usr/lib/libm-2.28.so)
> > 	    56382b0fc128 main+0x88 (/home/milian/projects/kdab/rnd/hotspot/
> > 
> > build/tests/test-clients/cpp-inlining/cpp-inlining)
> > 
> > 	    7faf7dabf222 __libc_start_main+0xf2 (/usr/lib/libc-2.28.so)
> > 	    56382b0fc1ed _start+0x2d (/home/milian/projects/kdab/rnd/hotspot/
> > 
> > build/tests/test-clients/cpp-inlining/cpp-inlining)
> > 
> > cpp-inlining 24617 90229.126921551:     714657 cycles:uppp:
> > 	    56382b0fc114 main+0x74 (/home/milian/projects/kdab/rnd/hotspot/
> > 
> > build/tests/test-clients/cpp-inlining/cpp-inlining)
> > 
> > 	    7faf7dabf222 __libc_start_main+0xf2 (/usr/lib/libc-2.28.so)
> > 	    56382b0fc1ed _start+0x2d (/home/milian/projects/kdab/rnd/hotspot/
> > 
> > build/tests/test-clients/cpp-inlining/cpp-inlining)
> > 
> > cpp-inlining 24617 90229.127157818:     719976 cycles:uppp:
> > 	    7faf7dca2688 __hypot_finite+0x28 (/usr/lib/libm-2.28.so)
> > 	
> > 	ffffffffffffffff [unknown] ([unknown])
> > 
> > ...
> > ```
> > 
> > Now I'm trying to figure out why one __hypot_finite sample works but the
> > other one breaks for no apparent reason.
> 
> I've now collected some more background information, which is quite helpful
> I believe for the analysis of this issue:
> 
> Note how the broken sample has the IP pointing at __hypot_finite+0x28:
> 
> unwind: __hypot_finite:ip = 0x7faf7dca2688 (0x29688)
> 
> When we run my reproducer code in GDB, we can see that obtaining a backtrace
> from that address works just fine there:
> 
> ```
> $ gdb ./cpp-inlining
> GNU gdb (GDB) 8.2
> Copyright (C) 2018 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later
> <http://gnu.org/licenses/gpl.html> This is free software: you are free to
> change and redistribute it. There is NO WARRANTY, to the extent permitted
> by law.
> Type "show copying" and "show warranty" for details.
> This GDB was configured as "x86_64-pc-linux-gnu".
> Type "show configuration" for configuration details.
> For bug reporting instructions, please see:
> <http://www.gnu.org/software/gdb/bugs/>.
> Find the GDB manual and other documentation resources online at:
>     <http://www.gnu.org/software/gdb/documentation/>.
> 
> For help, type "help".
> Type "apropos word" to search for commands related to "word"...
> Reading symbols from ./cpp-inlining...done.
> (gdb) break main
> Breakpoint 1 at 0x10a0: file /home/milian/projects/kdab/rnd/hotspot/tests/
> test-clients/cpp-inlining/main.cpp, line 34.
> (gdb) run
> Starting program: /home/milian/projects/kdab/rnd/hotspot/build/tests/test-
> clients/cpp-inlining/cpp-inlining
> 
> Breakpoint 1, main () at /home/milian/projects/kdab/rnd/hotspot/tests/test-
> clients/cpp-inlining/main.cpp:34
> 34      {
> (gdb) disassemble __hypot_finite
> Dump of assembler code for function __hypot_finite:
>    0x00007ffff7c96660 <+0>:     endbr64
>    0x00007ffff7c96664 <+4>:     push   %r13
>    0x00007ffff7c96666 <+6>:     push   %r12
>    0x00007ffff7c96668 <+8>:     push   %rbp
>    0x00007ffff7c96669 <+9>:     movq   %xmm1,%rbp
>    0x00007ffff7c9666e <+14>:    push   %rbx
>    0x00007ffff7c9666f <+15>:    movq   %xmm0,%rbx
>    0x00007ffff7c96674 <+20>:    shr    $0x20,%rbp
>    0x00007ffff7c96678 <+24>:    shr    $0x20,%rbx
>    0x00007ffff7c9667c <+28>:    and    $0x7fffffff,%ebp
>    0x00007ffff7c96682 <+34>:    and    $0x7fffffff,%ebx
>    0x00007ffff7c96688 <+40>:    sub    $0x28,%rsp
>    0x00007ffff7c9668c <+44>:    cmp    %ebp,%ebx
>    0x00007ffff7c9668e <+46>:    jge    0x7ffff7c966a2 <__hypot_finite+66>
>    0x00007ffff7c96690 <+48>:    mov    %ebx,%eax
>    0x00007ffff7c96692 <+50>:    movapd %xmm0,%xmm2
>    0x00007ffff7c96696 <+54>:    movapd %xmm1,%xmm0
>    0x00007ffff7c9669a <+58>:    mov    %ebp,%ebx
>    0x00007ffff7c9669c <+60>:    movapd %xmm2,%xmm1
>    0x00007ffff7c966a0 <+64>:    mov    %eax,%ebp
>    0x00007ffff7c966a2 <+66>:    movq   %xmm0,%r13
>    0x00007ffff7c966a7 <+71>:    mov    %rbx,%rax
>    0x00007ffff7c966aa <+74>:    movq   %xmm1,%r12
> --Type <RET> for more, q to quit, c to continue without paging--q
> Quit
> (gdb) break *0x00007ffff7c96688
> Breakpoint 2 at 0x7ffff7c96688
> (gdb) cont
> Continuing.
> 
> Breakpoint 2, 0x00007ffff7c96688 in __hypot_finite () from
> /usr/lib/libm.so.6 (gdb) bt
> #0  0x00007ffff7c96688 in __hypot_finite () from /usr/lib/libm.so.6
> #1  0x00007ffff7c7caf9 in hypotf64 () from /usr/lib/libm.so.6
> #2  0x0000555555555129 in std::__complex_abs (__z=<optimized out>) at /usr/
> include/c++/8.2.1/complex:1362
> #3  std::abs<double> (__z=...) at /usr/include/c++/8.2.1/complex:618
> #4  std::_Norm_helper<true>::_S_do_it<double> (__z=...) at /usr/include/c++/
> 8.2.1/complex:675
> #5  std::norm<double> (__z=...) at /usr/include/c++/8.2.1/complex:685
> #6  main () at
> /home/milian/projects/kdab/rnd/hotspot/tests/test-clients/cpp-
> inlining/main.cpp:39
> ```
> 
> But note how __hypot_finite+0x28 actually points at an instruction that
> modifies the stack pointer by 40, just like I did manually to "fix"
> unwinding for this sample (see my initial email):
> 
>    0x00007ffff7c96688 <+40>:    sub    $0x28,%rsp
> 
> So we are still in the function prologue, and can't properly unwind from
> there apparently! Any suggestions on how this should be handled? Can we
> figure out the size of the function prologue somehow and manually fix the
> stack pointer then?

After more digging, it turns out that I've apparently chased a red herring. 
I'm running archlinux which isn't shipping debug symbols for libm. Without 
those, we are apparently out of luck to unwind from the function prologue.

I still wonder whether we could catch this somehow and handle it ourselves. I 
wonder what GDB does to unwind here, despite the lack of debug information in 
libm...

Cheers

-- 
Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer
KDAB (Deutschland) GmbH, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt, C++ and OpenGL Experts

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 3826 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Broken dwarf unwinding - wrong stack pointer register value?
  2018-10-22 11:17   ` Milian Wolff
@ 2018-10-22 13:58     ` Andi Kleen
  2018-10-22 19:26       ` Milian Wolff
  0 siblings, 1 reply; 33+ messages in thread
From: Andi Kleen @ 2018-10-22 13:58 UTC (permalink / raw)
  To: Milian Wolff
  Cc: Milian Wolff, linux-kernel, Jiri Olsa, namhyung,
	linux-perf-users, Arnaldo Carvalho

Milian Wolff <milian.wolff@kdab.com> writes:
>
> After more digging, it turns out that I've apparently chased a red herring. 
> I'm running archlinux which isn't shipping debug symbols for libm.

64bit executables normally have unwind information even when stripped.
Unless someone forcefully stripped those too.

You can checkout with objdump --sections.

-Andi

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Broken dwarf unwinding - wrong stack pointer register value?
  2018-10-22 13:58     ` Andi Kleen
@ 2018-10-22 19:26       ` Milian Wolff
  2018-10-23  4:03         ` Andi Kleen
  0 siblings, 1 reply; 33+ messages in thread
From: Milian Wolff @ 2018-10-22 19:26 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, Jiri Olsa, namhyung, linux-perf-users, Arnaldo Carvalho

[-- Attachment #1: Type: text/plain, Size: 6193 bytes --]

On Montag, 22. Oktober 2018 15:58:17 CEST Andi Kleen wrote:
> Milian Wolff <milian.wolff@kdab.com> writes:
> > After more digging, it turns out that I've apparently chased a red
> > herring.
> > I'm running archlinux which isn't shipping debug symbols for libm.
> 
> 64bit executables normally have unwind information even when stripped.
> Unless someone forcefully stripped those too.
> 
> You can checkout with objdump --sections.

Right, we do have .eh_frame and .eh_frame_hdr according to readelf:

```
$ readelf  --sections /usr/lib/libm.so.6        
There are 26 section headers, starting at offset 0x183120:

Section Headers:
  [Nr] Name              Type             Address           Offset
       Size              EntSize          Flags  Link  Info  Align
  [ 0]                   NULL             0000000000000000  00000000
       0000000000000000  0000000000000000           0     0     0
  [ 1] .note.gnu.build-i NOTE             0000000000000270  00000270
       0000000000000024  0000000000000000   A       0     0     4
  [ 2] .note.ABI-tag     NOTE             0000000000000294  00000294
       0000000000000020  0000000000000000   A       0     0     4
  [ 3] .note.gnu.propert NOTE             00000000000002b8  000002b8
       0000000000000020  0000000000000000   A       0     0     8
  [ 4] .gnu.hash         GNU_HASH         00000000000002d8  000002d8
       00000000000024d0  0000000000000000   A       5     0     8
  [ 5] .dynsym           DYNSYM           00000000000027a8  000027a8
       00000000000066c0  0000000000000018   A       6     1     8
  [ 6] .dynstr           STRTAB           0000000000008e68  00008e68
       0000000000002352  0000000000000000   A       0     0     1
  [ 7] .gnu.version      VERSYM           000000000000b1ba  0000b1ba
       0000000000000890  0000000000000002   A       5     0     2
  [ 8] .gnu.version_d    VERDEF           000000000000ba50  0000ba50
       000000000000017c  0000000000000000   A       6    11     8
  [ 9] .gnu.version_r    VERNEED          000000000000bbd0  0000bbd0
       0000000000000060  0000000000000000   A       6     2     8
  [10] .rela.dyn         RELA             000000000000bc30  0000bc30
       0000000000000480  0000000000000018   A       5     0     8
  [11] .init             PROGBITS         000000000000d000  0000d000
       000000000000001b  0000000000000000  AX       0     0     4
  [12] .text             PROGBITS         000000000000d020  0000d020
       00000000000a063b  0000000000000000  AX       0     0     16
  [13] .fini             PROGBITS         00000000000ad65c  000ad65c
       000000000000000d  0000000000000000  AX       0     0     4
  [14] .rodata           PROGBITS         00000000000ae000  000ae000
       00000000000c76a4  0000000000000000   A       0     0     32
  [15] .eh_frame_hdr     PROGBITS         00000000001756a4  001756a4
       0000000000001c34  0000000000000000   A       0     0     4
  [16] .eh_frame         PROGBITS         00000000001772d8  001772d8
       00000000000093f0  0000000000000000   A       0     0     8
  [17] .hash             HASH             00000000001806c8  001806c8
       000000000000210c  0000000000000004   A       5     0     8
  [18] .init_array       INIT_ARRAY       0000000000183c80  00182c80
       0000000000000008  0000000000000008  WA       0     0     8
  [19] .fini_array       FINI_ARRAY       0000000000183c88  00182c88
       0000000000000008  0000000000000008  WA       0     0     8
  [20] .dynamic          DYNAMIC          0000000000183c90  00182c90
       00000000000001f0  0000000000000010  WA       6     0     8
  [21] .got              PROGBITS         0000000000183e80  00182e80
       0000000000000180  0000000000000008  WA       0     0     8
  [22] .data             PROGBITS         0000000000184000  00183000
       000000000000000c  0000000000000000  WA       0     0     8
  [23] .bss              NOBITS           000000000018400c  0018300c
       000000000000000c  0000000000000000  WA       0     0     4
  [24] .comment          PROGBITS         0000000000000000  0018300c
       000000000000001a  0000000000000001  MS       0     0     1
  [25] .shstrtab         STRTAB           0000000000000000  00183026
       00000000000000fa  0000000000000000           0     0     1
Key to Flags:
  W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
  L (link order), O (extra OS processing required), G (group), T (TLS),
  C (compressed), x (unknown), o (OS specific), E (exclude),
  l (large), p (processor specific)
```

But should that be enough information to be able to unwind from a function 
prologue? I mean, it obviously seems to work when we unwind from the function 
body. But how would I know whether it should work from the prologue too?

Reading e.g. https://www.airs.com/blog/archives/460, I can find:

> There should be exactly one FDE covering each instruction which may be being 
executed when an exception occurs. By default an exception can only occur 
during a function call or a throw. When using the -fnon-call-exceptions gcc 
option, an exception can also occur on most memory references and floating 
point operations. When using -fasynchronous-unwind-tables, the FDE will cover 
every instruction, to permit unwinding from a signal handler.

So what if my libm wasn't compiled with -fasynchronous-unwind-tables? We 
probably cannot throw an exception in the function prologue, so potentially 
that range is simply not mapped? But this is just a shot in the dark, I have 
no clue how to get more information about what contents are actually stored in 
the .eh_frame section. I would love to find out though! Does anyone know a 
tool to sched some light into this section?

I found http://www.bitlackeys.org/#eh_frame which at least shows me that 
__hypot_finite is mentioned in the .eh_frame section:

$ nm -aD /usr/lib/libm.so.6 |& grep hypot_finite
0000000000029660 T __hypot_finite
$ ./eh_frame /usr/lib/libm.so.6 |& grep 29660
Function size: 878 Function Addr: 29660

Thanks

-- 
Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer
KDAB (Deutschland) GmbH, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt, C++ and OpenGL Experts

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 3826 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Broken dwarf unwinding - wrong stack pointer register value?
  2018-10-22 19:26       ` Milian Wolff
@ 2018-10-23  4:03         ` Andi Kleen
  2018-10-23 10:34           ` Milian Wolff
  0 siblings, 1 reply; 33+ messages in thread
From: Andi Kleen @ 2018-10-23  4:03 UTC (permalink / raw)
  To: Milian Wolff
  Cc: linux-kernel, Jiri Olsa, namhyung, linux-perf-users, Arnaldo Carvalho

> So what if my libm wasn't compiled with -fasynchronous-unwind-tables? We 

It's default (64bit since always and 32bit now too) Unless someone disabled it.

However libm might be partially written in assembler and hand written assembler
often has problems with unwind tables because the programmer has to get them
correct explicitely.

-Andi

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Broken dwarf unwinding - wrong stack pointer register value?
  2018-10-23  4:03         ` Andi Kleen
@ 2018-10-23 10:34           ` Milian Wolff
  2018-10-24 14:48             ` Andi Kleen
  0 siblings, 1 reply; 33+ messages in thread
From: Milian Wolff @ 2018-10-23 10:34 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, Jiri Olsa, namhyung, linux-perf-users, Arnaldo Carvalho

[-- Attachment #1: Type: text/plain, Size: 3681 bytes --]

On Dienstag, 23. Oktober 2018 06:03:56 CEST Andi Kleen wrote:
> > So what if my libm wasn't compiled with -fasynchronous-unwind-tables? We
> 
> It's default (64bit since always and 32bit now too) Unless someone disabled
> it.

Excellent, good to know. Since [1] doesn't explicitly disable it, I would 
assume the information should be available.

[1]: https://git.archlinux.org/svntogit/packages.git/tree/trunk/PKGBUILD?
h=packages/glibc

> However libm might be partially written in assembler and hand written
> assembler often has problems with unwind tables because the programmer has
> to get them correct explicitely.

Yes, that could be the case. I'm unsure about the glibc build system and what 
actually gets compiled, but I found a potential source at [2]:

[2]: https://github.com/bminor/glibc/blob/
43b1048ab9418e902aac8c834a7a9a88c501620a/sysdeps/ieee754/dbl-64/e_hypot.c

I believe this is what is used on my system, since I can spot calls to 
__issignaling@@GLIBC_2.18 etc. in the disassembly of __hypot_finite ([3]), 
which matches the sources referenced in [2].

[3]: https://paste.kde.org/poywa7y2z

If [2] is used, then it's not hand written assembler but code compiled by the 
compiler. So unwinding should work, even from the prologue? 

I have since also figured out how to dump the .eh_frame contents in a human 
readable format via readelf. Remember, __hypot_finite on my system is at 
offset 0x29660 of libm, so I think the following are the corresponding 
.eh_frame contents:

```
$ readelf --debug-dump=frames /usr/lib/libm.so.6 |& less
...
00002b60 000000000000004c 00002b64 FDE cie=00000000 
pc=0000000000029660..00000000000299ce
  DW_CFA_advance_loc: 6 to 0000000000029666
  DW_CFA_def_cfa_offset: 16
  DW_CFA_offset: r13 (r13) at cfa-16
  DW_CFA_advance_loc: 2 to 0000000000029668
  DW_CFA_def_cfa_offset: 24
  DW_CFA_offset: r12 (r12) at cfa-24
  DW_CFA_advance_loc: 1 to 0000000000029669
  DW_CFA_def_cfa_offset: 32
  DW_CFA_offset: r6 (rbp) at cfa-32
  DW_CFA_advance_loc: 6 to 000000000002966f
  DW_CFA_def_cfa_offset: 40
  DW_CFA_offset: r3 (rbx) at cfa-40
  DW_CFA_advance_loc: 29 to 000000000002968c
  DW_CFA_def_cfa_offset: 80
  DW_CFA_advance_loc2: 291 to 00000000000297af
  DW_CFA_remember_state
  DW_CFA_def_cfa_offset: 40
  DW_CFA_advance_loc: 5 to 00000000000297b4
  DW_CFA_def_cfa_offset: 32
  DW_CFA_advance_loc: 1 to 00000000000297b5
  DW_CFA_def_cfa_offset: 24
  DW_CFA_advance_loc: 2 to 00000000000297b7
  DW_CFA_def_cfa_offset: 16
  DW_CFA_advance_loc: 2 to 00000000000297b9
  DW_CFA_def_cfa_offset: 8
  DW_CFA_advance_loc: 7 to 00000000000297c0
  DW_CFA_restore_state
  DW_CFA_advance_loc1: 88 to 0000000000029818
  DW_CFA_remember_state
  DW_CFA_def_cfa_offset: 40
  DW_CFA_advance_loc: 1 to 0000000000029819
  DW_CFA_def_cfa_offset: 32
  DW_CFA_advance_loc: 1 to 000000000002981a
  DW_CFA_def_cfa_offset: 24
  DW_CFA_advance_loc: 2 to 000000000002981c
  DW_CFA_def_cfa_offset: 16
  DW_CFA_advance_loc: 2 to 000000000002981e
  DW_CFA_def_cfa_offset: 8
  DW_CFA_advance_loc: 18 to 0000000000029830
  DW_CFA_restore_state
  DW_CFA_nop
```

I notice that this does not touch the rsp register at all, even though it's 
mutated by the code, leading to the issue. See again this paste for the 
disassembly at [3], and note that the broken sample frame points at 

0x0000000000029688 <+40>:    sub    $0x28,%rsp

Can someone at least confirm whether unwinding from a function prologue via 
.eh_frame (but without .debug_frame) should actually be possible?

Thanks
-- 
Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer
KDAB (Deutschland) GmbH, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt, C++ and OpenGL Experts

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 3826 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Broken dwarf unwinding - wrong stack pointer register value?
  2018-10-23 10:34           ` Milian Wolff
@ 2018-10-24 14:48             ` Andi Kleen
  2018-10-30 22:34               ` Milian Wolff
  0 siblings, 1 reply; 33+ messages in thread
From: Andi Kleen @ 2018-10-24 14:48 UTC (permalink / raw)
  To: Milian Wolff
  Cc: linux-kernel, Jiri Olsa, namhyung, linux-perf-users, Arnaldo Carvalho

> 
> Can someone at least confirm whether unwinding from a function prologue via 
> .eh_frame (but without .debug_frame) should actually be possible?

Yes it should be possible. Asynchronous unwind tables should work
from any instruction.

-Andi

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Broken dwarf unwinding - wrong stack pointer register value?
  2018-10-24 14:48             ` Andi Kleen
@ 2018-10-30 22:34               ` Milian Wolff
  2018-11-01 22:08                 ` PEBS level 2/3 breaks dwarf unwinding! [WAS: Re: Broken dwarf unwinding - wrong stack pointer register value?] Milian Wolff
  0 siblings, 1 reply; 33+ messages in thread
From: Milian Wolff @ 2018-10-30 22:34 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, Jiri Olsa, namhyung, linux-perf-users, Arnaldo Carvalho

[-- Attachment #1: Type: text/plain, Size: 7491 bytes --]

On Mittwoch, 24. Oktober 2018 16:48:18 CET Andi Kleen wrote:
> > Can someone at least confirm whether unwinding from a function prologue
> > via
> > .eh_frame (but without .debug_frame) should actually be possible?
> 
> Yes it should be possible. Asynchronous unwind tables should work
> from any instruction.

I'm still trying to figure out what's going on here... I have remembered, that 
libunwind has a quite helpful debug mode, which I further extended to generate 
a more detailed report of the failed unwinding attempt. For one of the broken 
samples, I see:

```
$ UNW_DEBUG_LEVEL=15 perf script -v
...
cpp-inlining 30227 18486.357148:     788066 cycles:ppp:  
>_Ux86_64_init_remote: (cursor=0x7fffafa55c10)
unwind: reg 16, val 7f91345d77b4
unwind: reg 7, val 7ffd1e276f18
 >_Ux86_64_step: (cursor=0x7fffafa55c10, ip=0x00007f91345d77b4, 
cfa=0x00007ffd1e276f18)
unwind: find_proc_info dso /usr/lib/libm-2.28.so
               >_Ux86_64_dwarf_search_unwind_table: ip=0x7f91345d77b4, 
start_ip=0xffffffffffeb3fbc
 >_Ux86_64_dwarf_search_unwind_table: e->fde_offset = 4794, segbase = 
7f91347236a4, debug_frame_base = 0, fde_addr = 7f9134727e38
            >_Ux86_64_dwarf_extract_proc_info_from_fde: FDE @ 0x7f9134727e38
               >_Ux86_64_dwarf_extract_proc_info_from_fde: looking for CIE at 
address 7f91347252d8
               >parse_cie: CIE parsed OK, augmentation = "zR", handler=0x0
               >_Ux86_64_dwarf_extract_proc_info_from_fde: FDE covers IP 
0x7f91345d7660-0x7f91345d79ce, LSDA=0x0
     >_Ux86_64_fetch_frame: fetch frame ip=0x7f91345d77b4 cfa=0x7ffd1e276f18 
format=0
               >run_cfi_program: CFA_def_cfa r7+0x8
               >run_cfi_program: CFA_offset r16 at cfa+0xfffffffffffffff8
               >run_cfi_program: CFA_advance_loc to 0x7f91345d7666
               >run_cfi_program: CFA_def_cfa_offset 0x10
               >run_cfi_program: CFA_offset r13 at cfa+0xfffffffffffffff0
               >run_cfi_program: CFA_advance_loc to 0x7f91345d7668
               >run_cfi_program: CFA_def_cfa_offset 0x18
               >run_cfi_program: CFA_offset r12 at cfa+0xffffffffffffffe8
               >run_cfi_program: CFA_advance_loc to 0x7f91345d7669
               >run_cfi_program: CFA_def_cfa_offset 0x20
               >run_cfi_program: CFA_offset r6 at cfa+0xffffffffffffffe0
               >run_cfi_program: CFA_advance_loc to 0x7f91345d766f
               >run_cfi_program: CFA_def_cfa_offset 0x28
               >run_cfi_program: CFA_offset r3 at cfa+0xffffffffffffffd8
               >run_cfi_program: CFA_advance_loc to 0x7f91345d768c
               >run_cfi_program: CFA_def_cfa_offset 0x50
               >run_cfi_program: CFA_advance_loc2 to 0x7f91345d77af
               >run_cfi_program: CFA_remember_state
               >run_cfi_program: CFA_def_cfa_offset 0x28
               >run_cfi_program: CFA_advance_loc to 0x7f91345d77b4
               >run_cfi_program: CFA_def_cfa_offset 0x20
               >run_cfi_program: CFA_advance_loc to 0x7f91345d77b5
     >_Ux86_64_cache_frame: cache frame ip=0x7f91345d77b4 cfa=0x7ffd1e276f18 
format=0
     >_Ux86_64_reuse_frame: reuse frame ip=0x7f91345d77b4 cfa=0x7ffd1e276f18 
format=0 addr=0x0 offset=+0
               >apply_reg_state: CFA_REG_COLUMN[17] = 0x7, 
DWARF_CFA_OFF_COLUMN[18] = 0x20
               >apply_reg_state: CFA = 0x7ffd1e276f18 (pop) + 0x20 = 
0x7ffd1e276f38
               >apply_reg_state: 0: DWARF_WHERE_SAME = (0, 2)
               >apply_reg_state: 1: DWARF_WHERE_SAME = (1, 2)
               >apply_reg_state: 2: DWARF_WHERE_SAME = (2, 2)
               >apply_reg_state: 3: DWARF_WHERE_CFAREL = 7ffd1e276f38 + -40 = 
(7ffd1e276f10, 0)
               >apply_reg_state: 4: DWARF_WHERE_SAME = (4, 2)
               >apply_reg_state: 5: DWARF_WHERE_SAME = (5, 2)
               >apply_reg_state: 6: DWARF_WHERE_CFAREL = 7ffd1e276f38 + -32 = 
(7ffd1e276f18, 0)
               >apply_reg_state: 7: DWARF_WHERE_SAME = (7, 2)
               >apply_reg_state: 8: DWARF_WHERE_SAME = (8, 2)
               >apply_reg_state: 9: DWARF_WHERE_SAME = (9, 2)
               >apply_reg_state: 10: DWARF_WHERE_SAME = (a, 2)
               >apply_reg_state: 11: DWARF_WHERE_SAME = (b, 2)
               >apply_reg_state: 12: DWARF_WHERE_CFAREL = 7ffd1e276f38 + -24 = 
(7ffd1e276f20, 0)
               >apply_reg_state: 13: DWARF_WHERE_CFAREL = 7ffd1e276f38 + -16 = 
(7ffd1e276f28, 0)
               >apply_reg_state: 14: DWARF_WHERE_SAME = (e, 2)
               >apply_reg_state: 15: DWARF_WHERE_SAME = (f, 2)
               >apply_reg_state: 16: DWARF_WHERE_CFAREL = 7ffd1e276f38 + -8 = 
(7ffd1e276f30, 0)
unwind: access_mem addr 0x7ffd1e276f30 val c0d885722245b5e4, offset 24
               >apply_reg_state: ret_addr_column: 16, ip: c0d885722245b5e4
               >_Ux86_64_step: dwarf_step returned 1
  >_Ux86_64_step: returning 1
 >_Ux86_64_step: (cursor=0x7fffafa55c10, ip=0xc0d885722245b5e4, 
cfa=0x00007ffd1e276f38)
               >_Ux86_64_step: dwarf_step returned -22
  >_Ux86_64_step: returning -22
unwind: __hypot_finite:ip = 0x7f91345d77b4 (0x297b4)
unwind: '':ip = 0xc0d885722245b5e3 (0x0)

            7f91345d77b4 __hypot_finite+0x154 (/usr/lib/libm-2.28.so)
        c0d885722245b5e3 [unknown] ([unknown])
```


Now, I also tried the following:

```
$ perf probe -x /usr/lib/libm-2.28.so -a __hypot_finite+0x154
$ perf record -F 1000 --call-graph dwarf -e probe_libm:__hypot_finite ./cpp-
inlining
```

And all of the samples unwind correctly! This makes me believe that it's not 
the .eh_frame information which is wrong - otherwise unwinding would always 
fail from these locations, esp. when using the custom probe trace point. But 
since this is not happening, what else could it be? I only see two 
possibilities: the register values or the stack memory stored in in the sample 
by perf.

The register values is unlikely, since I now understand how the .eh_frame 
contents get analyzed. For __hypot_finite+0x154, we will always end up asking 
for the address at SP+24. access_mem thus will always look at the address at 
offset 24, independent of the actual value of SP.

So, what remains is that the stack dump is somehow wrong, i.e. its contents 
are moved by some offset. Note how I can "fix" the unwinding for such broken 
samples by manually applying some offset in access_mem. By looking at other 
samples where unwinding works from __hypot_finite, I could figure out that the 
correct address to be read for unwnding should be 7f91345bdaf8, e.g.:

```
            7f91345d76ed __hypot_finite+0x8d (/usr/lib/libm-2.28.so)
            7f91345bdaf8 hypotf32x+0x18 (/usr/lib/libm-2.28.so)
            5620579cb128 main+0x88 (/home/milian/projects/kdab/rnd/hotspot/
build/tests/test-clients/cpp-inlining/cpp-inlining)
```

This address indeed occurs in the user stack dump (starting at 0xe0 in the raw 
event data) for the broken sample, cf.:


```
.  00e0:  00 20 00 00 00 00 00 00 c0 b1 9c 57 20 56 00 00  . .........W V..
.  00f0:  70 70 27 1e fd 7f 00 00 f9 da 5b 34 91 7f 00 00  pp'.......[4....
.  0100:  e4 b5 45 22 72 85 d8 c0 c0 1d 16 84 43 30 bb c0  ..E"r.......C0..
.
```

We can find `7f91345bdaf8+1 = 7f91345bdaf9" at offset 16 (search for "f9 da 5b 
34 91 7f"). Using that address makes unwinding work for this sample. What 
could be the reason for this shift?

Thanks

-- 
Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer
KDAB (Deutschland) GmbH, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt, C++ and OpenGL Experts

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 3824 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* PEBS level 2/3 breaks dwarf unwinding! [WAS: Re: Broken dwarf unwinding - wrong stack pointer register value?]
  2018-10-30 22:34               ` Milian Wolff
@ 2018-11-01 22:08                 ` Milian Wolff
  2018-11-02 11:26                   ` Jiri Olsa
  0 siblings, 1 reply; 33+ messages in thread
From: Milian Wolff @ 2018-11-01 22:08 UTC (permalink / raw)
  To: Milian Wolff
  Cc: Andi Kleen, linux-kernel, Jiri Olsa, namhyung, linux-perf-users,
	Arnaldo Carvalho

[-- Attachment #1: Type: text/plain, Size: 2335 bytes --]

On Dienstag, 30. Oktober 2018 23:34:35 CET Milian Wolff wrote:
> On Mittwoch, 24. Oktober 2018 16:48:18 CET Andi Kleen wrote:
> > > Can someone at least confirm whether unwinding from a function prologue
> > > via
> > > .eh_frame (but without .debug_frame) should actually be possible?
> > 
> > Yes it should be possible. Asynchronous unwind tables should work
> > from any instruction.

<snip>

> We can find `7f91345bdaf8+1 = 7f91345bdaf9" at offset 16 (search for "f9 da
> 5b 34 91 7f"). Using that address makes unwinding work for this sample.
> What could be the reason for this shift?

I believe I have found the culprit: PEBS seems to be at fault here - i.e. the 
RIP/RSP and the ustack dump of the sample simply don't fit together.

Check this out:

```
$ for i in $(seq 10); do perf record -q -e "cycles:" --call-graph dwarf ./cpp-
inlining > /dev/null; perf script | pcre2grep -c -M "hypot_finite.*\n.*\
[unknown\]"; done
0
0
0
0
0
0
0
0
0
0

$ for i in $(seq 10); do perf record -q -e "cycles:p" --call-graph dwarf ./
cpp-inlining > /dev/null; perf script | pcre2grep -c -M "hypot_finite.*\n.*\
[unknown\]"; done
0
0
0
0
0
0
0
0
0
0

$ for i in $(seq 10); do perf record -q -e "cycles:pp" --call-graph dwarf ./
cpp-inlining > /dev/null; perf script | pcre2grep -c -M "hypot_finite.*\n.*\
[unknown\]"; done
37
39
35
28
40
39
29
37
31
26

$ for i in $(seq 10); do perf record -q -e "cycles:ppp" --call-graph dwarf ./
cpp-inlining > /dev/null; perf script | pcre2grep -c -M "hypot_finite.*\n.*\
[unknown\]"; done
79
70
76
77
70
90
64
78
86
74
```

Note how precise levels 0 and 1 do not produce any samples where unwinding 
fails. But precise level 2 produces some, and precise level 3 increases the 
amount (by ca. ~2x).

I can reproduce this pattern on two separate Intel CPUs and kernel versions 
currently:

Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz with 4.18.16-arch1-1-ARCH
Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz with 4.14.78-1-lts

Could someone else try this? What about AMD and IBS - is it also affected? 
What about newer/different Intel CPUs?

Better yet, can someone come up with a fix for this on Intel with maximum 
precise level?

Thanks

-- 
Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer
KDAB (Deutschland) GmbH, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt, C++ and OpenGL Experts

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 3826 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: PEBS level 2/3 breaks dwarf unwinding! [WAS: Re: Broken dwarf unwinding - wrong stack pointer register value?]
  2018-11-01 22:08                 ` PEBS level 2/3 breaks dwarf unwinding! [WAS: Re: Broken dwarf unwinding - wrong stack pointer register value?] Milian Wolff
@ 2018-11-02 11:26                   ` Jiri Olsa
  2018-11-02 17:56                     ` Milian Wolff
  0 siblings, 1 reply; 33+ messages in thread
From: Jiri Olsa @ 2018-11-02 11:26 UTC (permalink / raw)
  To: Milian Wolff
  Cc: Andi Kleen, linux-kernel, Jiri Olsa, namhyung, linux-perf-users,
	Arnaldo Carvalho

On Thu, Nov 01, 2018 at 11:08:18PM +0100, Milian Wolff wrote:
> On Dienstag, 30. Oktober 2018 23:34:35 CET Milian Wolff wrote:
> > On Mittwoch, 24. Oktober 2018 16:48:18 CET Andi Kleen wrote:
> > > > Can someone at least confirm whether unwinding from a function prologue
> > > > via
> > > > .eh_frame (but without .debug_frame) should actually be possible?
> > > 
> > > Yes it should be possible. Asynchronous unwind tables should work
> > > from any instruction.
> 
> <snip>
> 
> > We can find `7f91345bdaf8+1 = 7f91345bdaf9" at offset 16 (search for "f9 da
> > 5b 34 91 7f"). Using that address makes unwinding work for this sample.
> > What could be the reason for this shift?
> 
> I believe I have found the culprit: PEBS seems to be at fault here - i.e. the 
> RIP/RSP and the ustack dump of the sample simply don't fit together.
> 
> Check this out:
> 
> ```
> $ for i in $(seq 10); do perf record -q -e "cycles:" --call-graph dwarf ./cpp-
> inlining > /dev/null; perf script | pcre2grep -c -M "hypot_finite.*\n.*\
> [unknown\]"; done
> 0
> 0
> 0
> 0
> 0
> 0
> 0
> 0
> 0
> 0
> 
> $ for i in $(seq 10); do perf record -q -e "cycles:p" --call-graph dwarf ./
> cpp-inlining > /dev/null; perf script | pcre2grep -c -M "hypot_finite.*\n.*\
> [unknown\]"; done
> 0
> 0
> 0
> 0
> 0
> 0
> 0
> 0
> 0
> 0
> 
> $ for i in $(seq 10); do perf record -q -e "cycles:pp" --call-graph dwarf ./
> cpp-inlining > /dev/null; perf script | pcre2grep -c -M "hypot_finite.*\n.*\
> [unknown\]"; done
> 37
> 39
> 35
> 28
> 40
> 39
> 29
> 37
> 31
> 26
> 
> $ for i in $(seq 10); do perf record -q -e "cycles:ppp" --call-graph dwarf ./
> cpp-inlining > /dev/null; perf script | pcre2grep -c -M "hypot_finite.*\n.*\
> [unknown\]"; done
> 79
> 70
> 76
> 77
> 70
> 90
> 64
> 78
> 86
> 74
> ```
> 
> Note how precise levels 0 and 1 do not produce any samples where unwinding 
> fails. But precise level 2 produces some, and precise level 3 increases the 
> amount (by ca. ~2x).
> 
> I can reproduce this pattern on two separate Intel CPUs and kernel versions 
> currently:
> 
> Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz with 4.18.16-arch1-1-ARCH
> Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz with 4.14.78-1-lts
> 
> Could someone else try this? What about AMD and IBS - is it also affected? 
> What about newer/different Intel CPUs?

I tried on intel and can't actualy see that.. how do the failed samples
look like? like is there the stack dump attached, what's in the regs?

could you please paste the 'perf report -D' output for some of the
failed samples?

thanks,
jirka


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: PEBS level 2/3 breaks dwarf unwinding! [WAS: Re: Broken dwarf unwinding - wrong stack pointer register value?]
  2018-11-02 11:26                   ` Jiri Olsa
@ 2018-11-02 17:56                     ` Milian Wolff
  2018-11-05 20:51                       ` Jiri Olsa
  0 siblings, 1 reply; 33+ messages in thread
From: Milian Wolff @ 2018-11-02 17:56 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Andi Kleen, linux-kernel, Jiri Olsa, namhyung, linux-perf-users,
	Arnaldo Carvalho

[-- Attachment #1: Type: text/plain, Size: 3710 bytes --]

On Freitag, 2. November 2018 12:26:35 CET Jiri Olsa wrote:
> On Thu, Nov 01, 2018 at 11:08:18PM +0100, Milian Wolff wrote:
> > On Dienstag, 30. Oktober 2018 23:34:35 CET Milian Wolff wrote:
> > > On Mittwoch, 24. Oktober 2018 16:48:18 CET Andi Kleen wrote:
> > > > > Can someone at least confirm whether unwinding from a function
> > > > > prologue
> > > > > via
> > > > > .eh_frame (but without .debug_frame) should actually be possible?
> > > > 
> > > > Yes it should be possible. Asynchronous unwind tables should work
> > > > from any instruction.
> > 
> > <snip>
> > 
> > > We can find `7f91345bdaf8+1 = 7f91345bdaf9" at offset 16 (search for "f9
> > > da
> > > 5b 34 91 7f"). Using that address makes unwinding work for this sample.
> > > What could be the reason for this shift?
> > 
> > I believe I have found the culprit: PEBS seems to be at fault here - i.e.
> > the RIP/RSP and the ustack dump of the sample simply don't fit together.
> > 
> > Check this out:
> > 
> > ```
> > $ for i in $(seq 10); do perf record -q -e "cycles:" --call-graph dwarf
> > ./cpp- inlining > /dev/null; perf script | pcre2grep -c -M
> > "hypot_finite.*\n.*\ [unknown\]"; done
> > 0
> > 0
> > 0
> > 0
> > 0
> > 0
> > 0
> > 0
> > 0
> > 0
> > 
> > $ for i in $(seq 10); do perf record -q -e "cycles:p" --call-graph dwarf
> > ./
> > cpp-inlining > /dev/null; perf script | pcre2grep -c -M
> > "hypot_finite.*\n.*\ [unknown\]"; done
> > 0
> > 0
> > 0
> > 0
> > 0
> > 0
> > 0
> > 0
> > 0
> > 0
> > 
> > $ for i in $(seq 10); do perf record -q -e "cycles:pp" --call-graph dwarf
> > ./ cpp-inlining > /dev/null; perf script | pcre2grep -c -M
> > "hypot_finite.*\n.*\ [unknown\]"; done
> > 37
> > 39
> > 35
> > 28
> > 40
> > 39
> > 29
> > 37
> > 31
> > 26
> > 
> > $ for i in $(seq 10); do perf record -q -e "cycles:ppp" --call-graph dwarf
> > ./ cpp-inlining > /dev/null; perf script | pcre2grep -c -M
> > "hypot_finite.*\n.*\ [unknown\]"; done
> > 79
> > 70
> > 76
> > 77
> > 70
> > 90
> > 64
> > 78
> > 86
> > 74
> > ```
> > 
> > Note how precise levels 0 and 1 do not produce any samples where unwinding
> > fails. But precise level 2 produces some, and precise level 3 increases
> > the
> > amount (by ca. ~2x).
> > 
> > I can reproduce this pattern on two separate Intel CPUs and kernel
> > versions
> > currently:
> > 
> > Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz with 4.18.16-arch1-1-ARCH
> > Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz with 4.14.78-1-lts
> > 
> > Could someone else try this? What about AMD and IBS - is it also affected?
> > What about newer/different Intel CPUs?
> 
> I tried on intel and can't actualy see that.. how do the failed samples
> look like? like is there the stack dump attached, what's in the regs?
> 
> could you please paste the 'perf report -D' output for some of the
> failed samples?

See here for one case: https://paste.kde.org/prryvdilq

What Intel CPU did you use? What microcode version? Which kernel version?

Generally, isn't what I'm seeing actually a neccessary evil of the ustack 
based unwinding in perf? I mean, the general procedure is as follows if I'm 
not mistaken:

- PMU triggers interrupt and PEBS stores RIP etc.
- code continous to execute, possibly changing the stack
- PMU interrupt is handled, and a perf sample is generated
  - register values are used from "past" status stored in PEBS
  - but ustack dump is based on the "current" status

From this latter discrepancy, it must directly follow that *sometimes* the 
ustack dump represents a state that cannot be unwound from, no?

Cheers
-- 
Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer
KDAB (Deutschland) GmbH, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt, C++ and OpenGL Experts

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 3826 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: PEBS level 2/3 breaks dwarf unwinding! [WAS: Re: Broken dwarf unwinding - wrong stack pointer register value?]
  2018-11-02 17:56                     ` Milian Wolff
@ 2018-11-05 20:51                       ` Jiri Olsa
  2018-11-05 22:54                         ` Milian Wolff
  0 siblings, 1 reply; 33+ messages in thread
From: Jiri Olsa @ 2018-11-05 20:51 UTC (permalink / raw)
  To: Milian Wolff
  Cc: Andi Kleen, linux-kernel, Jiri Olsa, namhyung, linux-perf-users,
	Arnaldo Carvalho

On Fri, Nov 02, 2018 at 06:56:50PM +0100, Milian Wolff wrote:

SNIP

> > > 
> > > Note how precise levels 0 and 1 do not produce any samples where unwinding
> > > fails. But precise level 2 produces some, and precise level 3 increases
> > > the
> > > amount (by ca. ~2x).
> > > 
> > > I can reproduce this pattern on two separate Intel CPUs and kernel
> > > versions
> > > currently:
> > > 
> > > Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz with 4.18.16-arch1-1-ARCH
> > > Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz with 4.14.78-1-lts
> > > 
> > > Could someone else try this? What about AMD and IBS - is it also affected?
> > > What about newer/different Intel CPUs?
> > 
> > I tried on intel and can't actualy see that.. how do the failed samples
> > look like? like is there the stack dump attached, what's in the regs?
> > 
> > could you please paste the 'perf report -D' output for some of the
> > failed samples?
> 
> See here for one case: https://paste.kde.org/prryvdilq

we should really print some helpfull debug output
for this.. like to show some markers where the stack
data starts

> 
> What Intel CPU did you use? What microcode version? Which kernel version?
> 
> Generally, isn't what I'm seeing actually a neccessary evil of the ustack 
> based unwinding in perf? I mean, the general procedure is as follows if I'm 
> not mistaken:
> 
> - PMU triggers interrupt and PEBS stores RIP etc.
> - code continous to execute, possibly changing the stack

I dont think the code continues to execute.. the stack is ok

the problem I saw in past is that the copy from user is not
100% and sometimes you might not get full stack data you
asked for

have you tried with libdw unwinder? if one of the unwinder
shows more callchains, we need to fix the other one ;-)

jirka

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: PEBS level 2/3 breaks dwarf unwinding! [WAS: Re: Broken dwarf unwinding - wrong stack pointer register value?]
  2018-11-05 20:51                       ` Jiri Olsa
@ 2018-11-05 22:54                         ` Milian Wolff
  2018-11-06  0:10                           ` Andi Kleen
  0 siblings, 1 reply; 33+ messages in thread
From: Milian Wolff @ 2018-11-05 22:54 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Andi Kleen, linux-kernel, Jiri Olsa, namhyung, linux-perf-users,
	Arnaldo Carvalho

[-- Attachment #1: Type: text/plain, Size: 3898 bytes --]

On Montag, 5. November 2018 21:51:19 CET Jiri Olsa wrote:
> On Fri, Nov 02, 2018 at 06:56:50PM +0100, Milian Wolff wrote:
> 
> SNIP
> 
> > > > Note how precise levels 0 and 1 do not produce any samples where
> > > > unwinding
> > > > fails. But precise level 2 produces some, and precise level 3
> > > > increases
> > > > the
> > > > amount (by ca. ~2x).
> > > > 
> > > > I can reproduce this pattern on two separate Intel CPUs and kernel
> > > > versions
> > > > currently:
> > > > 
> > > > Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz with 4.18.16-arch1-1-ARCH
> > > > Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz with 4.14.78-1-lts
> > > > 
> > > > Could someone else try this? What about AMD and IBS - is it also
> > > > affected?
> > > > What about newer/different Intel CPUs?
> > > 
> > > I tried on intel and can't actualy see that.. how do the failed samples
> > > look like? like is there the stack dump attached, what's in the regs?
> > > 
> > > could you please paste the 'perf report -D' output for some of the
> > > failed samples?
> > 
> > See here for one case: https://paste.kde.org/prryvdilq
> 
> we should really print some helpfull debug output
> for this.. like to show some markers where the stack
> data starts

Further down below, the offset for the ustack start is given (0xe0). But yes, 
that would be welcome.

> > What Intel CPU did you use? What microcode version? Which kernel version?
> > 
> > Generally, isn't what I'm seeing actually a neccessary evil of the ustack
> > based unwinding in perf? I mean, the general procedure is as follows if
> > I'm
> > not mistaken:
> > 
> > - PMU triggers interrupt and PEBS stores RIP etc.
> > - code continous to execute, possibly changing the stack
> 
> I dont think the code continues to execute.. the stack is ok

Are you sure about this? I mean, isn't that the whole reason why we need PEBS? 
Generally, if you are sure about this, can you point me to some documentation 
on this to allow me to understand it better?

Also, how do you explain the scenario I am seeing, with `cycles:` and 
`cycles:p` not suffering from this issue, but `cycles:pp` and `cycles:ppp` 
leading to broken samples? It _has_ to be PEBS then, no? What else could 
explain this?

> the problem I saw in past is that the copy from user is not
> 100% and sometimes you might not get full stack data you
> asked for

But that would indicate missing data at the end of the ustack dump? In our 
case, the "problematic" data is always at the start.

Also note the apparent shift in the ustack copy which - in one case - directly 
correlatates with the code being executed, i.e. from objdump in libm I see:

0x0000000000029688 <+40>:    sub    $0x28,%rsp
(https://paste.kde.org/poywa7y2z)

The address of the expected parent frame is 7ffff7c7caf8 (hypotf32x+0x18). 
This can be found at offset 80 in the ustack dump (cf. https://paste.kde.org/
prryvdilq - ("f9 ca c7 f7 ff 7f" is found at 0x130, minus 0xe0 yields 0x50 or 
80).

From the libunwind (or libdw) debug output in perf, we see that the unwinder 
tries to access offset 32 (cf. https://paste.kde.org/prryvdilq#line-610), 
which is ofset by 48 from the desired value of 80. This offset is *veroy* 
close to the value of 40 we see in the libm disassembly for __hypot_function 
("$0x28,%rsp"). Is this really just a coincidence?

> have you tried with libdw unwinder? if one of the unwinder
> shows more callchains, we need to fix the other one ;-)

Yes, I've looked at both unwinders. Both try to access the same values, and 
both break due to seemingly wrong data being read from the stack. And if you 
look at my other patches, you may have seen that I've regularly fixed the 
libdw unwinder to bring it closer to libunwind.

Thanks
-- 
Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer
KDAB (Deutschland) GmbH, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt, C++ and OpenGL Experts

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 3826 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: PEBS level 2/3 breaks dwarf unwinding! [WAS: Re: Broken dwarf unwinding - wrong stack pointer register value?]
  2018-11-05 22:54                         ` Milian Wolff
@ 2018-11-06  0:10                           ` Andi Kleen
  2018-11-06  8:39                             ` Jiri Olsa
  2018-11-10 21:42                             ` Travis Downs
  0 siblings, 2 replies; 33+ messages in thread
From: Andi Kleen @ 2018-11-06  0:10 UTC (permalink / raw)
  To: Milian Wolff
  Cc: Jiri Olsa, linux-kernel, Jiri Olsa, namhyung, linux-perf-users,
	Arnaldo Carvalho

> > > - PMU triggers interrupt and PEBS stores RIP etc.
> > > - code continous to execute, possibly changing the stack
> > 
> > I dont think the code continues to execute.. the stack is ok
> 
> Are you sure about this? I mean, isn't that the whole reason why we need PEBS? 
> Generally, if you are sure about this, can you point me to some documentation 
> on this to allow me to understand it better?

Milian is right.

There is a execution window from PEBS capturing registers to actually triggering
the PMU, and if there is stack manipulation in that window
the PEBS state might be out of sync with the real stack.

The right RIP/RSP to use for the stack unwinding is always the data
in the PMI's exception frame on the stack.

Probably would need to modify perf to report those too in addition
to the PEBS registers.

Of course it would still mean that the stack unwinding may not exactly
match the sample RIP, but at least it should be consistent.

-Andi

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: PEBS level 2/3 breaks dwarf unwinding! [WAS: Re: Broken dwarf unwinding - wrong stack pointer register value?]
  2018-11-06  0:10                           ` Andi Kleen
@ 2018-11-06  8:39                             ` Jiri Olsa
  2018-11-06 17:26                               ` Andi Kleen
  2018-11-06 20:04                               ` Milian Wolff
  2018-11-10 21:42                             ` Travis Downs
  1 sibling, 2 replies; 33+ messages in thread
From: Jiri Olsa @ 2018-11-06  8:39 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Milian Wolff, linux-kernel, Jiri Olsa, namhyung,
	linux-perf-users, Arnaldo Carvalho

On Mon, Nov 05, 2018 at 04:10:37PM -0800, Andi Kleen wrote:
> > > > - PMU triggers interrupt and PEBS stores RIP etc.
> > > > - code continous to execute, possibly changing the stack
> > > 
> > > I dont think the code continues to execute.. the stack is ok
> > 
> > Are you sure about this? I mean, isn't that the whole reason why we need PEBS? 
> > Generally, if you are sure about this, can you point me to some documentation 
> > on this to allow me to understand it better?
> 
> Milian is right.
> 
> There is a execution window from PEBS capturing registers to actually triggering
> the PMU, and if there is stack manipulation in that window
> the PEBS state might be out of sync with the real stack.

hum, is this about having 'large pebs' or there's this window
if there's also only single pebs record allowed? which should
be case for dwarf unwind

> The right RIP/RSP to use for the stack unwinding is always the data
> in the PMI's exception frame on the stack.
> 
> Probably would need to modify perf to report those too in addition
> to the PEBS registers.

ok, should not be that hard

> 
> Of course it would still mean that the stack unwinding may not exactly
> match the sample RIP, but at least it should be consistent.

jirka

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: PEBS level 2/3 breaks dwarf unwinding! [WAS: Re: Broken dwarf unwinding - wrong stack pointer register value?]
  2018-11-06  8:39                             ` Jiri Olsa
@ 2018-11-06 17:26                               ` Andi Kleen
  2018-11-06 20:04                               ` Milian Wolff
  1 sibling, 0 replies; 33+ messages in thread
From: Andi Kleen @ 2018-11-06 17:26 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Milian Wolff, linux-kernel, Jiri Olsa, namhyung,
	linux-perf-users, Arnaldo Carvalho

> hum, is this about having 'large pebs' or there's this window
> if there's also only single pebs record allowed? which should
> be case for dwarf unwind

With large PEBS today there is never any stack unwind because
stack unwinding can be only done from a PMI.

The window happens even with single PEBS. It's related
to the CPU being pipelined, so complex events may not
happen fully atomically.

-Andi


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: PEBS level 2/3 breaks dwarf unwinding! [WAS: Re: Broken dwarf unwinding - wrong stack pointer register value?]
  2018-11-06  8:39                             ` Jiri Olsa
  2018-11-06 17:26                               ` Andi Kleen
@ 2018-11-06 20:04                               ` Milian Wolff
  2018-11-06 20:24                                 ` Andi Kleen
  1 sibling, 1 reply; 33+ messages in thread
From: Milian Wolff @ 2018-11-06 20:04 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Andi Kleen, linux-kernel, Jiri Olsa, namhyung, linux-perf-users,
	Arnaldo Carvalho

[-- Attachment #1: Type: text/plain, Size: 1517 bytes --]

On Dienstag, 6. November 2018 09:39:57 CET Jiri Olsa wrote:
> On Mon, Nov 05, 2018 at 04:10:37PM -0800, Andi Kleen wrote:
> > > > > - PMU triggers interrupt and PEBS stores RIP etc.
> > > > > - code continous to execute, possibly changing the stack
> > > > 
> > > > I dont think the code continues to execute.. the stack is ok
> > > 
> > > Are you sure about this? I mean, isn't that the whole reason why we need
> > > PEBS? Generally, if you are sure about this, can you point me to some
> > > documentation on this to allow me to understand it better?
> > 
> > Milian is right.
> > 
> > There is a execution window from PEBS capturing registers to actually
> > triggering the PMU, and if there is stack manipulation in that window
> > the PEBS state might be out of sync with the real stack.
> 
> hum, is this about having 'large pebs' or there's this window
> if there's also only single pebs record allowed? which should
> be case for dwarf unwind
> 
> > The right RIP/RSP to use for the stack unwinding is always the data
> > in the PMI's exception frame on the stack.
> > 
> > Probably would need to modify perf to report those too in addition
> > to the PEBS registers.
> 
> ok, should not be that hard

Where would I look for the source to change here? So far, I only concentrated 
on the userspace side of perf in tools/perf.

Thanks

-- 
Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer
KDAB (Deutschland) GmbH, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt, C++ and OpenGL Experts

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 3826 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: PEBS level 2/3 breaks dwarf unwinding! [WAS: Re: Broken dwarf unwinding - wrong stack pointer register value?]
  2018-11-06 20:04                               ` Milian Wolff
@ 2018-11-06 20:24                                 ` Andi Kleen
  2018-11-07 22:41                                   ` Milian Wolff
  0 siblings, 1 reply; 33+ messages in thread
From: Andi Kleen @ 2018-11-06 20:24 UTC (permalink / raw)
  To: Milian Wolff
  Cc: Jiri Olsa, linux-kernel, Jiri Olsa, namhyung, linux-perf-users,
	Arnaldo Carvalho

> Where would I look for the source to change here? So far, I only concentrated 
> on the userspace side of perf in tools/perf.

Kind of similar to 

a405bad5ad20 perf/x86: Add Haswell specific transaction flag reporting
fdfbbd07e91f perf: Add generic transaction flags

Report the original (not overwritten) regs->ip and regs->sp

-Andi

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: PEBS level 2/3 breaks dwarf unwinding! [WAS: Re: Broken dwarf unwinding - wrong stack pointer register value?]
  2018-11-06 20:24                                 ` Andi Kleen
@ 2018-11-07 22:41                                   ` Milian Wolff
  2018-11-08 12:41                                     ` Milian Wolff
  2018-11-09  0:54                                     ` Andi Kleen
  0 siblings, 2 replies; 33+ messages in thread
From: Milian Wolff @ 2018-11-07 22:41 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Jiri Olsa, linux-kernel, Jiri Olsa, namhyung, linux-perf-users,
	Arnaldo Carvalho

[-- Attachment #1: Type: text/plain, Size: 2130 bytes --]

On Dienstag, 6. November 2018 21:24:11 CET Andi Kleen wrote:
> > Where would I look for the source to change here? So far, I only
> > concentrated on the userspace side of perf in tools/perf.
> 
> Kind of similar to
> 
> a405bad5ad20 perf/x86: Add Haswell specific transaction flag reporting
> fdfbbd07e91f perf: Add generic transaction flags
> 
> Report the original (not overwritten) regs->ip and regs->sp

Thanks a lot Andi! With your help, I have managed to find the exact issue for 
my scenario. Turns out, it really is "just" the instruction pointer that is 
wrong. I.e. originally we have IP = 0x7feda32ca68c, but with PEBS we correct 
that to IP = 7feda32ca688. The SP register value stays the same according to 
my printk output. Using the original IP value, we can unwind correctly since 
we point to the correct place in the .eh_frame section. The PEBS IP points to 
a different position in the .eh_frame section, which is "too early".

That brings up some questions:

- I noticed `perf record --intr-regs`, but the values recorded in the 
perf.data file are always the same. I.e. comparing uregs and iregs, I always 
see the same values printed by `perf script`. This smells like a bug to me, 
but so far I haven't figured out why this happens...

- Independently, when I add a custom printk manually in `arch/x86/events/
intel/ds.c` at the end of `setup_pebs_sample_data`, then I'm never seeing any 
differences between SP in iregs/pebs/regs. Shouldn't it also be recorded via 
PEBS? Or is it just chance that I'm never seeing any difference in 
setup_pebs_sample_data between iregs->sp and regs->sp?

- Generally, how do we want to handle this bug? If `--intr-regs` would 
actually record a different IP than stored in uregs in the perf.data file, 
then we could use that as a fallback for unwinding, when it fails the first 
time. Or should we always unwind from that IP? How do we mark the "actual" 
frame/IP then, if that differs?

Thanks

-- 
Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer
KDAB (Deutschland) GmbH, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt, C++ and OpenGL Experts

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 3826 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: PEBS level 2/3 breaks dwarf unwinding! [WAS: Re: Broken dwarf unwinding - wrong stack pointer register value?]
  2018-11-07 22:41                                   ` Milian Wolff
@ 2018-11-08 12:41                                     ` Milian Wolff
  2018-11-09  0:55                                       ` Andi Kleen
  2018-11-09  0:54                                     ` Andi Kleen
  1 sibling, 1 reply; 33+ messages in thread
From: Milian Wolff @ 2018-11-08 12:41 UTC (permalink / raw)
  To: Milian Wolff
  Cc: Andi Kleen, Jiri Olsa, linux-kernel, Jiri Olsa, namhyung,
	linux-perf-users, Arnaldo Carvalho

[-- Attachment #1: Type: text/plain, Size: 3204 bytes --]

On Mittwoch, 7. November 2018 23:41:31 CET Milian Wolff wrote:
> On Dienstag, 6. November 2018 21:24:11 CET Andi Kleen wrote:
> > > Where would I look for the source to change here? So far, I only
> > > concentrated on the userspace side of perf in tools/perf.
> > 
> > Kind of similar to
> > 
> > a405bad5ad20 perf/x86: Add Haswell specific transaction flag reporting
> > fdfbbd07e91f perf: Add generic transaction flags
> > 
> > Report the original (not overwritten) regs->ip and regs->sp
> 
> Thanks a lot Andi! With your help, I have managed to find the exact issue
> for my scenario. Turns out, it really is "just" the instruction pointer
> that is wrong. I.e. originally we have IP = 0x7feda32ca68c, but with PEBS
> we correct that to IP = 7feda32ca688. The SP register value stays the same
> according to my printk output. Using the original IP value, we can unwind
> correctly since we point to the correct place in the .eh_frame section. The
> PEBS IP points to a different position in the .eh_frame section, which is
> "too early".
> 
> That brings up some questions:
> 
> - I noticed `perf record --intr-regs`, but the values recorded in the
> perf.data file are always the same. I.e. comparing uregs and iregs, I always
> see the same values printed by `perf script`. This smells like a bug to me,
> but so far I haven't figured out why this happens...

The reason seems to be that perf_event_output only takes one set of registers, 
which then gets handed down into perf_prepare_sample where it gets sampled. 
Thus if sample type has both PERF_SAMPLE_REGS_USER and PERF_SAMPLE_REGS_INTR 
set, then by design both will store the same values for user space samples.

Can we change this, such that perf_event_output also takes a second set of 
registers (iregs) that get sampled for PERF_SAMPLE_REGS_INTR? I'm very new to 
real kernel development, what kind of ABI/API stability guarantees exist for 
something like "perf_event_output"?

> - Independently, when I add a custom printk manually in `arch/x86/events/
> intel/ds.c` at the end of `setup_pebs_sample_data`, then I'm never seeing
> any differences between SP in iregs/pebs/regs. Shouldn't it also be
> recorded via PEBS? Or is it just chance that I'm never seeing any
> difference in setup_pebs_sample_data between iregs->sp and regs->sp?

The reason here seems to be that the registers stored in "pebs" are 
essentially the same as iregs for the setup for `perf record --call-graph 
dwarf`. The difference is the availability of `pebs->real_ip` which gets used 
on my system to fixup the IP. SP stays untouched and is thus only truly valid 
for the untouched IP (which is discarded currently - see above).

> - Generally, how do we want to handle this bug? If `--intr-regs` would
> actually record a different IP than stored in uregs in the perf.data file,
> then we could use that as a fallback for unwinding, when it fails the first
> time. Or should we always unwind from that IP? How do we mark the "actual"
> frame/IP then, if that differs?
> 
> Thanks


-- 
Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer
KDAB (Deutschland) GmbH, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt, C++ and OpenGL Experts

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 3826 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: PEBS level 2/3 breaks dwarf unwinding! [WAS: Re: Broken dwarf unwinding - wrong stack pointer register value?]
  2018-11-07 22:41                                   ` Milian Wolff
  2018-11-08 12:41                                     ` Milian Wolff
@ 2018-11-09  0:54                                     ` Andi Kleen
  1 sibling, 0 replies; 33+ messages in thread
From: Andi Kleen @ 2018-11-09  0:54 UTC (permalink / raw)
  To: Milian Wolff
  Cc: Jiri Olsa, linux-kernel, Jiri Olsa, namhyung, linux-perf-users,
	Arnaldo Carvalho

> - Independently, when I add a custom printk manually in `arch/x86/events/
> intel/ds.c` at the end of `setup_pebs_sample_data`, then I'm never seeing any 
> differences between SP in iregs/pebs/regs. Shouldn't it also be recorded via 
> PEBS? Or is it just chance that I'm never seeing any difference in 
> setup_pebs_sample_data between iregs->sp and regs->sp?

It will depend on the workload.

> 
> - Generally, how do we want to handle this bug? If `--intr-regs` would 
> actually record a different IP than stored in uregs in the perf.data file, 
> then we could use that as a fallback for unwinding, when it fails the first 
> time. Or should we always unwind from that IP? How do we mark the "actual" 
> frame/IP then, if that differs?

I don't think the regs include IP currently? Would need to add it.
Then could report everything from the original stack in intr-regs,
and this would address that problem.

Yes it might be useful to add some indication to the sample that
there is a potential divergence between the backtrace and the IP.
Would also need to report that in perf report then.

-Andi

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: PEBS level 2/3 breaks dwarf unwinding! [WAS: Re: Broken dwarf unwinding - wrong stack pointer register value?]
  2018-11-08 12:41                                     ` Milian Wolff
@ 2018-11-09  0:55                                       ` Andi Kleen
  0 siblings, 0 replies; 33+ messages in thread
From: Andi Kleen @ 2018-11-09  0:55 UTC (permalink / raw)
  To: Milian Wolff
  Cc: Jiri Olsa, linux-kernel, Jiri Olsa, namhyung, linux-perf-users,
	Arnaldo Carvalho

> Can we change this, such that perf_event_output also takes a second set of 
> registers (iregs) that get sampled for PERF_SAMPLE_REGS_INTR? I'm very new to 
> real kernel development, what kind of ABI/API stability guarantees exist for 
> something like "perf_event_output"?

Yes you can change it.

That's no API/ABI stability guarantee for kernel internal functions, 
as long as you change all callers in tree.

-Andi

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: PEBS level 2/3 breaks dwarf unwinding! [WAS: Re: Broken dwarf unwinding - wrong stack pointer register value?]
  2018-11-06  0:10                           ` Andi Kleen
  2018-11-06  8:39                             ` Jiri Olsa
@ 2018-11-10 21:42                             ` Travis Downs
  2018-11-11  1:07                               ` Andi Kleen
  1 sibling, 1 reply; 33+ messages in thread
From: Travis Downs @ 2018-11-10 21:42 UTC (permalink / raw)
  To: ak
  Cc: milian.wolff, jolsa, linux-kernel, jolsa, namhyung,
	linux-perf-users, acme

On Mon, Nov 5, 2018 at 7:11 PM Andi Kleen <ak@linux.intel.com> wrote:
> Milian is right.
>
> There is a execution window from PEBS capturing registers to actually triggering
> the PMU, and if there is stack manipulation in that window
> the PEBS state might be out of sync with the real stack.

This explains some weird results I was always getting especially when
functions were small, including
failed unwindings when using dwarf unwinder.

I guess this problem doesn't occur for LBR unwinding since the LBR
records are captured at the same
moment in time as the PEBS record, so reflect the correct branch
sequence. Of course, LBR doesn't
always let you unwind fully, right?

>
> The right RIP/RSP to use for the stack unwinding is always the data
> in the PMI's exception frame on the stack.
>
> Probably would need to modify perf to report those too in addition
> to the PEBS registers.
>
> Of course it would still mean that the stack unwinding may not exactly
> match the sample RIP, but at least it should be consistent.

What would this fix mean for perf report when you use cycles:pp and
cycles:ppp (or any PEBS based events)? The unwinding should generally
work, but the IP at the top of that stack (from the PMI) will
generally be different than that recorded by PEBS. The tree view and
overhead calculations will be based on the captured stacks, I guess -
but when I annotate, will the values I see correspond to the PEBS IPs
or the PMI IPs?

If someone is using cycles:pp or :ppp they probably care about
instruction-level accuracy, so it would be a shame to throw it away.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: PEBS level 2/3 breaks dwarf unwinding! [WAS: Re: Broken dwarf unwinding - wrong stack pointer register value?]
  2018-11-10 21:42                             ` Travis Downs
@ 2018-11-11  1:07                               ` Andi Kleen
       [not found]                                 ` <CAOBGo4zirLiKX8VcROAE=kAD0+qkF0E-cBv9DtBiQr=_obDv5w@mail.gmail.com>
  0 siblings, 1 reply; 33+ messages in thread
From: Andi Kleen @ 2018-11-11  1:07 UTC (permalink / raw)
  To: Travis Downs
  Cc: milian.wolff, jolsa, linux-kernel, jolsa, namhyung,
	linux-perf-users, acme

On Sat, Nov 10, 2018 at 04:42:48PM -0500, Travis Downs wrote:
> I guess this problem doesn't occur for LBR unwinding since the LBR
> records are captured at the same
> moment in time as the PEBS record, so reflect the correct branch
> sequence.

Actually it happens with LBRs too, but it always gives the backtrace
consistently at the PMI trigger point.
 
> What would this fix mean for perf report when you use cycles:pp and
> cycles:ppp (or any PEBS based events)? The unwinding should generally
> work, but the IP at the top of that stack (from the PMI) will
> generally be different than that recorded by PEBS. The tree view and
> overhead calculations will be based on the captured stacks, I guess -
> but when I annotate, will the values I see correspond to the PEBS IPs
> or the PMI IPs?

Based on PEBS IPs.

It would be a good idea to add a check to perf report
that the two IPs are different, and if they differ
add some indicator to the sample. This could be a new sort key,
although that would waste some space on the screen, or something
else.

It wouldn't be cover all cases, for example if you have recursion
on the same function it might report the same IP even though
it's a different instance, but I presume that should be rare
enough to not be a problem.

-Andi

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: PEBS level 2/3 breaks dwarf unwinding! [WAS: Re: Broken dwarf unwinding - wrong stack pointer register value?]
       [not found]                                 ` <CAOBGo4zirLiKX8VcROAE=kAD0+qkF0E-cBv9DtBiQr=_obDv5w@mail.gmail.com>
@ 2018-11-11  2:54                                   ` Travis Downs
  2018-11-12  3:26                                   ` Andi Kleen
  1 sibling, 0 replies; 33+ messages in thread
From: Travis Downs @ 2018-11-11  2:54 UTC (permalink / raw)
  To: ak
  Cc: Milian Wolff, jolsa, linux-kernel, jolsa, namhyung,
	linux-perf-users, acme

On Sat, Nov 10, 2018 at 8:07 PM Andi Kleen <ak@linux.intel.com> wrote:
>
> On Sat, Nov 10, 2018 at 04:42:48PM -0500, Travis Downs wrote:
> > I guess this problem doesn't occur for LBR unwinding since the LBR
> > records are captured at the same
> > moment in time as the PEBS record, so reflect the correct branch
> > sequence.
>
> Actually it happens with LBRs too, but it always gives the backtrace
> consistently at the PMI trigger point.


That's weird - so the LBR records are from the PMI point, but the rest
of the PEBS record comes from the PEBS trigger point? Or the LBR isn't
part of PEBS at all?

>
> > What would this fix mean for perf report when you use cycles:pp and
> > cycles:ppp (or any PEBS based events)? The unwinding should generally
> > work, but the IP at the top of that stack (from the PMI) will
> > generally be different than that recorded by PEBS. The tree view and
> > overhead calculations will be based on the captured stacks, I guess -
> > but when I annotate, will the values I see correspond to the PEBS IPs
> > or the PMI IPs?
>
> Based on PEBS IPs.
>
> It would be a good idea to add a check to perf report
> that the two IPs are different, and if they differ
> add some indicator to the sample. This could be a new sort key,
> although that would waste some space on the screen, or something
> else.


In the case that PEBS events are used, the IP will differ essentially
100% of the time, right? That is, there will always be *some* skid.

>
>
> It wouldn't be cover all cases, for example if you have recursion
> on the same function it might report the same IP even though
> it's a different instance, but I presume that should be rare
> enough to not be a problem.
>

Well the main problem I see is that "IP inconsistency" will be the
usual case, and it will be hard to resent in a reasonable way in the
report. For example, the backtrace-based displays/reports may indicate
that 80% of your samples are in function X, but based on the PEBS IP
records, only 50% may fall in that function, so you'll always have a
weird thing where when you are investigating within the stack-display
you might see 1234 samples in a function, but when you annotate only
789 samples are accounted for, or whatever.

I don't think this is 100% solvable, it's mostly an issue of
displaying it reasonably and managing expectations.

If the LBR record came from PEBS (as I had thought, but perhaps you
are indicating otherwise above), I could imagine a hybrid mode where
LBR is used to go back some number of calls and then dwarf or FP or
whatever unwinding takes over, because the further down the stack you
do the more likely the PEBS trigger point and PMI point are likely to
have a consistent stack.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: PEBS level 2/3 breaks dwarf unwinding! [WAS: Re: Broken dwarf unwinding - wrong stack pointer register value?]
       [not found]                                 ` <CAOBGo4zirLiKX8VcROAE=kAD0+qkF0E-cBv9DtBiQr=_obDv5w@mail.gmail.com>
  2018-11-11  2:54                                   ` Travis Downs
@ 2018-11-12  3:26                                   ` Andi Kleen
  2018-11-14 13:20                                     ` Milian Wolff
  2018-11-15  2:15                                     ` Travis Downs
  1 sibling, 2 replies; 33+ messages in thread
From: Andi Kleen @ 2018-11-12  3:26 UTC (permalink / raw)
  To: Travis Downs
  Cc: Milian Wolff, jolsa, linux-kernel, jolsa, namhyung,
	linux-perf-users, acme

On Sat, Nov 10, 2018 at 09:50:05PM -0500, Travis Downs wrote:
>    On Sat, Nov 10, 2018 at 8:07 PM Andi Kleen <ak@linux.intel.com> wrote:
> 
>      On Sat, Nov 10, 2018 at 04:42:48PM -0500, Travis Downs wrote:
>      > I guess this problem doesn't occur for LBR unwinding since the LBR
>      > records are captured at the same
>      > moment in time as the PEBS record, so reflect the correct branch
>      > sequence.
> 
>      Actually it happens with LBRs too, but it always gives the backtrace
>      consistently at the PMI trigger point.
> 
>    That's weird - so the LBR records are from the PMI point, but the rest of
>    the PEBS record comes from the PEBS trigger point? Or the LBR isn't part
>    of PEBS at all?

LBR is not part of PEBS, but is collected separately in the PMI handler.

>      > overhead calculations will be based on the captured stacks, I guess -
>      > but when I annotate, will the values I see correspond to the PEBS IPs
>      > or the PMI IPs?
> 
>      Based on PEBS IPs.
> 
>      It would be a good idea to add a check to perf report
>      that the two IPs are different, and if they differ
>      add some indicator to the sample. This could be a new sort key,
>      although that would waste some space on the screen, or something
>      else.
> 
>    In the case that PEBS events are used, the IP will differ essentially 100%
>    of the time, right? That is, there will always be *some* skid.

I wouldn't say that.  It depends on what the CPU is doing and the IPC
of the code.

Also the backtrace inconsistency can only happen if the sample races with
function return. If you don't then the backtrace will point
to the correct function, even though the unwind IP is different. 

For example in the common case where you profile a long loop it
is unlikely to happen.


>    indicating otherwise above), I could imagine a hybrid mode where LBR is
>    used to go back some number of calls and then dwarf or FP or whatever
>    unwinding takes over, because the further down the stack you do the more
>    likely the PEBS trigger point and PMI point are likely to have a
>    consistent stack.

Could collect numbers how often it happens, but it would surprise
me if anything complicated is worth it. I would just do the minimum fixes
to address the unwinder errors, and perhaps add the "unwind ip differs"
indication.

-Andi

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: PEBS level 2/3 breaks dwarf unwinding! [WAS: Re: Broken dwarf unwinding - wrong stack pointer register value?]
  2018-11-12  3:26                                   ` Andi Kleen
@ 2018-11-14 13:20                                     ` Milian Wolff
  2018-11-15  2:05                                       ` Travis Downs
  2018-11-15  2:15                                     ` Travis Downs
  1 sibling, 1 reply; 33+ messages in thread
From: Milian Wolff @ 2018-11-14 13:20 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Travis Downs, jolsa, linux-kernel, jolsa, namhyung,
	linux-perf-users, acme


[-- Attachment #1.1: Type: text/plain, Size: 6105 bytes --]

On Montag, 12. November 2018 04:26:37 CET Andi Kleen wrote:
> On Sat, Nov 10, 2018 at 09:50:05PM -0500, Travis Downs wrote:
> >    On Sat, Nov 10, 2018 at 8:07 PM Andi Kleen <ak@linux.intel.com> wrote:
> >      On Sat, Nov 10, 2018 at 04:42:48PM -0500, Travis Downs wrote:
> >      > I guess this problem doesn't occur for LBR unwinding since the LBR
> >      > records are captured at the same
> >      > moment in time as the PEBS record, so reflect the correct branch
> >      > sequence.
> >      
> >      Actually it happens with LBRs too, but it always gives the backtrace
> >      consistently at the PMI trigger point.
> >    
> >    That's weird - so the LBR records are from the PMI point, but the rest
> >    of
> >    the PEBS record comes from the PEBS trigger point? Or the LBR isn't
> >    part
> >    of PEBS at all?
> 
> LBR is not part of PEBS, but is collected separately in the PMI handler.
> 
> >      > overhead calculations will be based on the captured stacks, I guess
> >      > -
> >      > but when I annotate, will the values I see correspond to the PEBS
> >      > IPs
> >      > or the PMI IPs?
> >      
> >      Based on PEBS IPs.
> >      
> >      It would be a good idea to add a check to perf report
> >      that the two IPs are different, and if they differ
> >      add some indicator to the sample. This could be a new sort key,
> >      although that would waste some space on the screen, or something
> >      else.
> >    
> >    In the case that PEBS events are used, the IP will differ essentially
> >    100%
> >    of the time, right? That is, there will always be *some* skid.
> 
> I wouldn't say that.  It depends on what the CPU is doing and the IPC
> of the code.
> 
> Also the backtrace inconsistency can only happen if the sample races with
> function return. If you don't then the backtrace will point
> to the correct function, even though the unwind IP is different.
> 
> For example in the common case where you profile a long loop it
> is unlikely to happen.
> 
> >    indicating otherwise above), I could imagine a hybrid mode where LBR is
> >    used to go back some number of calls and then dwarf or FP or whatever
> >    unwinding takes over, because the further down the stack you do the
> >    more
> >    likely the PEBS trigger point and PMI point are likely to have a
> >    consistent stack.
> 
> Could collect numbers how often it happens, but it would surprise
> me if anything complicated is worth it. I would just do the minimum fixes
> to address the unwinder errors, and perhaps add the "unwind ip differs"
> indication.

I now have a preliminary WIP patch up and running (see attached), which works 
for my usecase and improves perf noticeably. All traces of "unknown" frames 
are eradicated, i.e. unwinding now works for 100% of the samples!

There are some remaining open questions on my side:

1) Do we really want to change the API of perf_event_overflow_* and 
perf_event_output_* and adapt all its users? To me, it seems as if only PEBS 
and IBS would want to pass distinct register sets for iregs and uregs. All 
other users of the API would continue to pass the same set. Changing the 
central API produces a lot of churn for no good reason. Does anyone see an 
alternative to this?

The only alternative idea I have right now is to temporarily change the 
sample_type in __intel_pmu_pebs_event before we call perf_event_output / 
perf_event_overflow. I.e. unset PERF_SAMPLE_REGS_INTR, then sample the regs 
manually from iregs before calling perf_event_{overflow,output}, then set 
PERF_SAPMLE_REGS_INTR again. Or we could introduce a custom flag similar to 
__PERF_SAMPLE_CALLCHAIN_EARLY here...

2) How do we want to do »the "unwind ip differs" indication« as Andi puts it? 
I.e. on the perf report/script side, how should we display the stacks? 
Something like the following annotation maybe?


```
cpp-inlining  2605 [-01]    57.870061:     701199 cycles:pppu: 
           7fc1042797b4 __hypot_finite+0x154 (/usr/lib/libm-2.28.so)
           7fc10425faf8 hypotf32x+0x18 (/usr/lib/libm-2.28.so) (unwind ip 
differs)
           5622c7452128 main+0x88 (/tmp/cpp-inlining)
           7fc104096222 __libc_start_main+0xf2 (/usr/lib/libc-2.28.so)
           5622c74521ed _start+0x2d (/tmp/cpp-inlining)
```

3) I suggest we always keep the first frame and sample IP from the user regs, 
i.e. the accurate PEBS/IBS IP. Then we add the following frames from unwinding 
the ustack with the iregs. But what do we do with the first iregs IP? If we 
add it, then we could see the same frame with slightly different IP, like in 
the following, which is undesired I believe:


```
cpp-inlining  2605 [-01]    57.870061:     701199 cycles:pppu: 
           7fc1042797b4 __hypot_finite+0x154 (/usr/lib/libm-2.28.so)
           7fc1042797b5 __hypot_finite+0x155 (/usr/lib/libm-2.28.so)
           7fc10425faf8 hypotf32x+0x18 (/usr/lib/libm-2.28.so) (unwind ip 
differs)
           5622c7452128 main+0x88 (/tmp/cpp-inlining)
           7fc104096222 __libc_start_main+0xf2 (/usr/lib/libc-2.28.so)
           5622c74521ed _start+0x2d (/tmp/cpp-inlining)
```

But always skipping the IP is also sometimes wrong, like in this case:

```
cpp-inlining  2605 [-01]    57.862313:     694984 cycles:pppu: 
           7fc1042797b9 __hypot_finite+0x159 (/usr/lib/libm-2.28.so)
           5622c7452128 main+0x88 (/tmp/cpp-inlining)
           7fc104096222 __libc_start_main+0xf2 (/usr/lib/libc-2.28.so)
           5622c74521ed _start+0x2d (/tmp/cpp-inlining)
```

Here, we are missing the hypotf32x call inbetween __hypot_finite and main.

Do we want to introduce some heuristic on how handle these scenarios? I.e. if 
uregs->ip and iregs->ip point to the same function symbol, then skip the frame 
for iregs->ip, otherwise add it?

Thanks
-- 
Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer
KDAB (Deutschland) GmbH, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt, C++ and OpenGL Experts

[-- Attachment #1.2: 0001-WIP-perf-make-it-possible-to-collect-both-iregs-and-.patch --]
[-- Type: text/x-patch, Size: 13997 bytes --]

From 422d2a95eff344407ec425f0de55b264841d1757 Mon Sep 17 00:00:00 2001
From: Milian Wolff <milian.wolff@kdab.com>
Date: Wed, 14 Nov 2018 14:10:47 +0100
Subject: [PATCH 1/2] [WIP] perf: make it possible to collect both, iregs and
 uregs

Previously, only one set of registers was stored in the perf
data for both, user and interrupt registers. Now, two distinct
sets can be sampled.

Signed-off-by: Milian Wolff <milian.wolff@kdab.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
---
 arch/x86/events/amd/ibs.c        |  2 +-
 arch/x86/events/core.c           |  2 +-
 arch/x86/events/intel/core.c     |  2 +-
 arch/x86/events/intel/ds.c       |  7 +++----
 arch/x86/events/intel/knc.c      |  2 +-
 arch/x86/events/intel/p4.c       |  2 +-
 arch/x86/kernel/ptrace.c         |  2 +-
 arch/x86/kvm/pmu.c               |  4 ++--
 drivers/oprofile/nmi_timer_int.c |  2 +-
 include/linux/perf_event.h       | 18 +++++++++++------
 kernel/events/core.c             | 34 ++++++++++++++++----------------
 kernel/trace/bpf_trace.c         |  2 +-
 kernel/watchdog_hld.c            |  2 +-
 13 files changed, 43 insertions(+), 38 deletions(-)

diff --git a/arch/x86/events/amd/ibs.c b/arch/x86/events/amd/ibs.c
index d50bb4dc0650..567db8878511 100644
--- a/arch/x86/events/amd/ibs.c
+++ b/arch/x86/events/amd/ibs.c
@@ -670,7 +670,7 @@ static int perf_ibs_handle_irq(struct perf_ibs *perf_ibs, struct pt_regs *iregs)
 		data.raw = &raw;
 	}
 
-	throttle = perf_event_overflow(event, &data, &regs);
+	throttle = perf_event_overflow(event, &data, &regs, iregs);
 out:
 	if (throttle)
 		perf_ibs_stop(event, 0);
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 106911b603bd..acdcafa57ca0 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1493,7 +1493,7 @@ int x86_pmu_handle_irq(struct pt_regs *regs)
 		if (!x86_perf_event_set_period(event))
 			continue;
 
-		if (perf_event_overflow(event, &data, regs))
+		if (perf_event_overflow(event, &data, regs, regs))
 			x86_pmu_stop(event, 0);
 	}
 
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 273c62e81546..2156620b3d9e 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2299,7 +2299,7 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
 		if (has_branch_stack(event))
 			data.br_stack = &cpuc->lbr_stack;
 
-		if (perf_event_overflow(event, &data, regs))
+		if (perf_event_overflow(event, &data, regs, regs))
 			x86_pmu_stop(event, 0);
 	}
 
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index b7b01d762d32..018fc0649033 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -639,7 +639,7 @@ int intel_pmu_drain_bts_buffer(void)
 	 * the sample.
 	 */
 	rcu_read_lock();
-	perf_prepare_sample(&header, &data, event, &regs);
+	perf_prepare_sample(&header, &data, event, &regs, &regs);
 
 	if (perf_output_begin(&handle, event, header.size *
 			      (top - base - skip)))
@@ -1273,7 +1273,6 @@ static void setup_pebs_sample_data(struct perf_event *event,
 		set_linear_ip(regs, pebs->ip);
 	}
 
-
 	if ((sample_type & (PERF_SAMPLE_ADDR | PERF_SAMPLE_PHYS_ADDR)) &&
 	    x86_pmu.intel_cap.pebs_format >= 1)
 		data->addr = pebs->dla;
@@ -1430,7 +1429,7 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
 
 	while (count > 1) {
 		setup_pebs_sample_data(event, iregs, at, &data, &regs);
-		perf_event_output(event, &data, &regs);
+		perf_event_output(event, &data, &regs, iregs);
 		at += x86_pmu.pebs_record_size;
 		at = get_next_pebs_record_by_bit(at, top, bit);
 		count--;
@@ -1442,7 +1441,7 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
 	 * All but the last records are processed.
 	 * The last one is left to be able to call the overflow handler.
 	 */
-	if (perf_event_overflow(event, &data, &regs)) {
+	if (perf_event_overflow(event, &data, &regs, iregs)) {
 		x86_pmu_stop(event, 0);
 		return;
 	}
diff --git a/arch/x86/events/intel/knc.c b/arch/x86/events/intel/knc.c
index 618001c208e8..9ea5a13af83f 100644
--- a/arch/x86/events/intel/knc.c
+++ b/arch/x86/events/intel/knc.c
@@ -252,7 +252,7 @@ static int knc_pmu_handle_irq(struct pt_regs *regs)
 
 		perf_sample_data_init(&data, 0, event->hw.last_period);
 
-		if (perf_event_overflow(event, &data, regs))
+		if (perf_event_overflow(event, &data, regs, regs))
 			x86_pmu_stop(event, 0);
 	}
 
diff --git a/arch/x86/events/intel/p4.c b/arch/x86/events/intel/p4.c
index d32c0eed38ca..704457b5f49a 100644
--- a/arch/x86/events/intel/p4.c
+++ b/arch/x86/events/intel/p4.c
@@ -1037,7 +1037,7 @@ static int p4_pmu_handle_irq(struct pt_regs *regs)
 			continue;
 
 
-		if (perf_event_overflow(event, &data, regs))
+		if (perf_event_overflow(event, &data, regs, regs))
 			x86_pmu_stop(event, 0);
 	}
 
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index ffae9b9740fd..13b2230e5e9b 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -499,7 +499,7 @@ static int genregs_set(struct task_struct *target,
 
 static void ptrace_triggered(struct perf_event *bp,
 			     struct perf_sample_data *data,
-			     struct pt_regs *regs)
+			     struct pt_regs *regs, struct pt_regs *iregs)
 {
 	int i;
 	struct thread_struct *thread = &(current->thread);
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 58ead7db71a3..b556b2d467e1 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -57,7 +57,7 @@ static void kvm_pmi_trigger_fn(struct irq_work *irq_work)
 
 static void kvm_perf_overflow(struct perf_event *perf_event,
 			      struct perf_sample_data *data,
-			      struct pt_regs *regs)
+			      struct pt_regs *regs, struct pt_regs *iregs)
 {
 	struct kvm_pmc *pmc = perf_event->overflow_handler_context;
 	struct kvm_pmu *pmu = pmc_to_pmu(pmc);
@@ -71,7 +71,7 @@ static void kvm_perf_overflow(struct perf_event *perf_event,
 
 static void kvm_perf_overflow_intr(struct perf_event *perf_event,
 				   struct perf_sample_data *data,
-				   struct pt_regs *regs)
+				   struct pt_regs *regs, struct pt_regs *iregs)
 {
 	struct kvm_pmc *pmc = perf_event->overflow_handler_context;
 	struct kvm_pmu *pmu = pmc_to_pmu(pmc);
diff --git a/drivers/oprofile/nmi_timer_int.c b/drivers/oprofile/nmi_timer_int.c
index f343bd96609a..110dfef21420 100644
--- a/drivers/oprofile/nmi_timer_int.c
+++ b/drivers/oprofile/nmi_timer_int.c
@@ -28,7 +28,7 @@ static struct perf_event_attr nmi_timer_attr = {
 
 static void nmi_timer_callback(struct perf_event *event,
 			       struct perf_sample_data *data,
-			       struct pt_regs *regs)
+			       struct pt_regs *regs, struct pt_regs *iregs)
 {
 	event->hw.interrupts = 0;       /* don't throttle interrupts */
 	oprofile_add_sample(regs, 0);
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 53c500f0ca79..3a989c64c2c7 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -506,7 +506,8 @@ struct perf_sample_data;
 
 typedef void (*perf_overflow_handler_t)(struct perf_event *,
 					struct perf_sample_data *,
-					struct pt_regs *regs);
+					struct pt_regs *regs,
+					struct pt_regs *iregs);
 
 /*
  * Event capabilities. For event_caps and groups caps.
@@ -966,21 +967,26 @@ extern void perf_output_sample(struct perf_output_handle *handle,
 extern void perf_prepare_sample(struct perf_event_header *header,
 				struct perf_sample_data *data,
 				struct perf_event *event,
-				struct pt_regs *regs);
+				struct pt_regs *regs,
+				struct pt_regs *iregs);
 
 extern int perf_event_overflow(struct perf_event *event,
 				 struct perf_sample_data *data,
-				 struct pt_regs *regs);
+				 struct pt_regs *regs,
+				 struct pt_regs *iregs);
 
 extern void perf_event_output_forward(struct perf_event *event,
 				     struct perf_sample_data *data,
-				     struct pt_regs *regs);
+				     struct pt_regs *regs,
+				     struct pt_regs *iregs);
 extern void perf_event_output_backward(struct perf_event *event,
 				       struct perf_sample_data *data,
-				       struct pt_regs *regs);
+				       struct pt_regs *regs,
+				       struct pt_regs *iregs);
 extern void perf_event_output(struct perf_event *event,
 			      struct perf_sample_data *data,
-			      struct pt_regs *regs);
+			      struct pt_regs *regs,
+			      struct pt_regs *iregs);
 
 static inline bool
 is_default_overflow_handler(struct perf_event *event)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 84530ab358c3..1b57602dc6d8 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6369,7 +6369,7 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
 void perf_prepare_sample(struct perf_event_header *header,
 			 struct perf_sample_data *data,
 			 struct perf_event *event,
-			 struct pt_regs *regs)
+			 struct pt_regs *regs, struct pt_regs *iregs)
 {
 	u64 sample_type = event->attr.sample_type;
 
@@ -6474,7 +6474,7 @@ void perf_prepare_sample(struct perf_event_header *header,
 		/* regs dump ABI info */
 		int size = sizeof(u64);
 
-		perf_sample_regs_intr(&data->regs_intr, regs);
+		perf_sample_regs_intr(&data->regs_intr, iregs);
 
 		if (data->regs_intr.regs) {
 			u64 mask = event->attr.sample_regs_intr;
@@ -6492,7 +6492,7 @@ void perf_prepare_sample(struct perf_event_header *header,
 static __always_inline void
 __perf_event_output(struct perf_event *event,
 		    struct perf_sample_data *data,
-		    struct pt_regs *regs,
+		    struct pt_regs *regs, struct pt_regs *iregs,
 		    int (*output_begin)(struct perf_output_handle *,
 					struct perf_event *,
 					unsigned int))
@@ -6503,7 +6503,7 @@ __perf_event_output(struct perf_event *event,
 	/* protect the callchain buffers */
 	rcu_read_lock();
 
-	perf_prepare_sample(&header, data, event, regs);
+	perf_prepare_sample(&header, data, event, regs, iregs);
 
 	if (output_begin(&handle, event, header.size))
 		goto exit;
@@ -6519,25 +6519,25 @@ __perf_event_output(struct perf_event *event,
 void
 perf_event_output_forward(struct perf_event *event,
 			 struct perf_sample_data *data,
-			 struct pt_regs *regs)
+			 struct pt_regs *regs, struct pt_regs *iregs)
 {
-	__perf_event_output(event, data, regs, perf_output_begin_forward);
+	__perf_event_output(event, data, regs, iregs, perf_output_begin_forward);
 }
 
 void
 perf_event_output_backward(struct perf_event *event,
 			   struct perf_sample_data *data,
-			   struct pt_regs *regs)
+			   struct pt_regs *regs, struct pt_regs *iregs)
 {
-	__perf_event_output(event, data, regs, perf_output_begin_backward);
+	__perf_event_output(event, data, regs, iregs, perf_output_begin_backward);
 }
 
 void
 perf_event_output(struct perf_event *event,
 		  struct perf_sample_data *data,
-		  struct pt_regs *regs)
+		  struct pt_regs *regs, struct pt_regs *iregs)
 {
-	__perf_event_output(event, data, regs, perf_output_begin);
+	__perf_event_output(event, data, regs, iregs, perf_output_begin);
 }
 
 /*
@@ -7738,7 +7738,7 @@ int perf_event_account_interrupt(struct perf_event *event)
 
 static int __perf_event_overflow(struct perf_event *event,
 				   int throttle, struct perf_sample_data *data,
-				   struct pt_regs *regs)
+				   struct pt_regs *regs, struct pt_regs *iregs)
 {
 	int events = atomic_read(&event->event_limit);
 	int ret = 0;
@@ -7765,7 +7765,7 @@ static int __perf_event_overflow(struct perf_event *event,
 		perf_event_disable_inatomic(event);
 	}
 
-	READ_ONCE(event->overflow_handler)(event, data, regs);
+	READ_ONCE(event->overflow_handler)(event, data, regs, iregs);
 
 	if (*perf_event_fasync(event) && event->pending_kill) {
 		event->pending_wakeup = 1;
@@ -7777,9 +7777,9 @@ static int __perf_event_overflow(struct perf_event *event,
 
 int perf_event_overflow(struct perf_event *event,
 			  struct perf_sample_data *data,
-			  struct pt_regs *regs)
+			  struct pt_regs *regs, struct pt_regs *iregs)
 {
-	return __perf_event_overflow(event, 1, data, regs);
+	return __perf_event_overflow(event, 1, data, regs, iregs);
 }
 
 /*
@@ -7842,7 +7842,7 @@ static void perf_swevent_overflow(struct perf_event *event, u64 overflow,
 
 	for (; overflow; overflow--) {
 		if (__perf_event_overflow(event, throttle,
-					    data, regs)) {
+					    data, regs, regs)) {
 			/*
 			 * We inhibit the overflow from happening when
 			 * hwc->interrupts == MAX_INTERRUPTS.
@@ -8550,7 +8550,7 @@ static void bpf_overflow_handler(struct perf_event *event,
 	if (!ret)
 		return;
 
-	event->orig_overflow_handler(event, data, regs);
+	event->orig_overflow_handler(event, data, regs, regs);
 }
 
 static int perf_event_set_bpf_handler(struct perf_event *event, u32 prog_fd)
@@ -9152,7 +9152,7 @@ static enum hrtimer_restart perf_swevent_hrtimer(struct hrtimer *hrtimer)
 
 	if (regs && !perf_exclude_event(event, regs)) {
 		if (!(event->attr.exclude_idle && is_idle_task(current)))
-			if (__perf_event_overflow(event, 1, &data, regs))
+			if (__perf_event_overflow(event, 1, &data, regs, regs))
 				ret = HRTIMER_NORESTART;
 	}
 
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 08fcfe440c63..6faf12fd6114 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -392,7 +392,7 @@ __bpf_perf_event_output(struct pt_regs *regs, struct bpf_map *map,
 	if (unlikely(event->oncpu != cpu))
 		return -EOPNOTSUPP;
 
-	perf_event_output(event, sd, regs);
+	perf_event_output(event, sd, regs, regs);
 	return 0;
 }
 
diff --git a/kernel/watchdog_hld.c b/kernel/watchdog_hld.c
index 71381168dede..5f4e18d003bb 100644
--- a/kernel/watchdog_hld.c
+++ b/kernel/watchdog_hld.c
@@ -109,7 +109,7 @@ static struct perf_event_attr wd_hw_attr = {
 /* Callback function for perf event subsystem */
 static void watchdog_overflow_callback(struct perf_event *event,
 				       struct perf_sample_data *data,
-				       struct pt_regs *regs)
+				       struct pt_regs *regs, struct pt_regs *iregs)
 {
 	/* Ensure the watchdog never gets throttled */
 	event->hw.interrupts = 0;
-- 
2.19.1


[-- Attachment #1.3: 0002-WIP-perf-unwind-use-iregs-for-unwinding.patch --]
[-- Type: text/x-patch, Size: 2100 bytes --]

From 721bb20a8a7d1ff2f7b062f8d92f50c743883d35 Mon Sep 17 00:00:00 2001
From: Milian Wolff <milian.wolff@kdab.com>
Date: Wed, 14 Nov 2018 14:18:56 +0100
Subject: [PATCH 2/2] [WIP] perf unwind: use iregs for unwinding

TODO: only use it if available
TODO: figure out when to skip iregs->ip frame, and when
to use it (e.g. when function for iregs->ip and uregs->ip
differs?)

Signed-off-by: Milian Wolff <milian.wolff@kdab.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
---
 tools/perf/util/unwind-libunwind-local.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/tools/perf/util/unwind-libunwind-local.c b/tools/perf/util/unwind-libunwind-local.c
index 79f521a552cf..39f19673cc34 100644
--- a/tools/perf/util/unwind-libunwind-local.c
+++ b/tools/perf/util/unwind-libunwind-local.c
@@ -492,12 +492,12 @@ static int access_mem(unw_addr_space_t __maybe_unused as,
 	int ret;
 
 	/* Don't support write, probably not needed. */
-	if (__write || !stack || !ui->sample->user_regs.regs) {
+	if (__write || !stack || !ui->sample->intr_regs.regs) {
 		*valp = 0;
 		return 0;
 	}
 
-	ret = perf_reg_value(&start, &ui->sample->user_regs,
+	ret = perf_reg_value(&start, &ui->sample->intr_regs,
 			     LIBUNWIND__ARCH_REG_SP);
 	if (ret)
 		return ret;
@@ -541,7 +541,7 @@ static int access_reg(unw_addr_space_t __maybe_unused as,
 		return 0;
 	}
 
-	if (!ui->sample->user_regs.regs) {
+	if (!ui->sample->intr_regs.regs) {
 		*valp = 0;
 		return 0;
 	}
@@ -550,7 +550,7 @@ static int access_reg(unw_addr_space_t __maybe_unused as,
 	if (id < 0)
 		return -EINVAL;
 
-	ret = perf_reg_value(&val, &ui->sample->user_regs, id);
+	ret = perf_reg_value(&val, &ui->sample->intr_regs, id);
 	if (ret) {
 		pr_err("unwind: can't read reg %d\n", regnum);
 		return ret;
@@ -716,7 +716,7 @@ static int _unwind__get_entries(unwind_entry_cb_t cb, void *arg,
 		.machine      = thread->mg->machine,
 	};
 
-	if (!data->user_regs.regs)
+	if (!data->intr_regs.regs)
 		return -EINVAL;
 
 	if (max_stack <= 0)
-- 
2.19.1


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 3826 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: PEBS level 2/3 breaks dwarf unwinding! [WAS: Re: Broken dwarf unwinding - wrong stack pointer register value?]
  2018-11-14 13:20                                     ` Milian Wolff
@ 2018-11-15  2:05                                       ` Travis Downs
  2018-11-15  9:10                                         ` Milian Wolff
  0 siblings, 1 reply; 33+ messages in thread
From: Travis Downs @ 2018-11-15  2:05 UTC (permalink / raw)
  To: Milian Wolff
  Cc: ak, jolsa, linux-kernel, jolsa, namhyung, linux-perf-users, acme

On Wed, Nov 14, 2018 at 8:20 AM Milian Wolff <milian.wolff@kdab.com> wrote:

> 3) I suggest we always keep the first frame and sample IP from the user regs,
> i.e. the accurate PEBS/IBS IP. Then we add the following frames from unwinding
> the ustack with the iregs.

Does this mean that the displayed unwind will sometimes be
"impossible" to have actually be generated from a consistent execution
of the user program?

For example, the top frame (from PEBS) and second frame (from iregs)
may be inconsistent in that the latter function never calls the first.
At this point it would be good to have an indication at the top frame
is from a different source than the rest of the frames, lest the user
pull is hair out trying to determine how function X seems to call
function Y despite that not being the case in the source.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: PEBS level 2/3 breaks dwarf unwinding! [WAS: Re: Broken dwarf unwinding - wrong stack pointer register value?]
  2018-11-12  3:26                                   ` Andi Kleen
  2018-11-14 13:20                                     ` Milian Wolff
@ 2018-11-15  2:15                                     ` Travis Downs
  1 sibling, 0 replies; 33+ messages in thread
From: Travis Downs @ 2018-11-15  2:15 UTC (permalink / raw)
  To: ak
  Cc: Milian Wolff, jolsa, linux-kernel, jolsa, namhyung,
	linux-perf-users, acme

On Sun, Nov 11, 2018 at 10:26 PM Andi Kleen <ak@linux.intel.com> wrote:
> On Sat, Nov 10, 2018 at 09:50:05PM -0500, Travis Downs wrote:

> LBR is not part of PEBS, but is collected separately in the PMI handler.

Thanks for clearing this up - so you can ignore any earlier
suggestions on my part of trying to use LBR to fix the unwinding
inconsistency.

> >    In the case that PEBS events are used, the IP will differ essentially 100%
> >    of the time, right? That is, there will always be *some* skid.
>
> I wouldn't say that.  It depends on what the CPU is doing and the IPC
> of the code.

Well other than say a long-latency cache miss, it is my experience
that the skid is generally never zero. That is, the PEBS and ireg IP
will usually differ. This is mostly moot though: what is important is
how often the ireg skip results in a different call chain (i.e., a
return occurred between the PEBS point and the interrupt), as you have
pointed out.

>
> Also the backtrace inconsistency can only happen if the sample races with
> function return. If you don't then the backtrace will point
> to the correct function, even though the unwind IP is different.
>
> For example in the common case where you profile a long loop it
> is unlikely to happen.

Agreed.


> Could collect numbers how often it happens, but it would surprise
> me if anything complicated is worth it. I would just do the minimum fixes
> to address the unwinder errors, and perhaps add the "unwind ip differs"
> indication.

As above, I think the most important UX problem is not when the IP
differs, but when the top frame of the IP unwind is different than the
function in which the PEBS sample occurred. I think the case where the
skid ends up with both in the same function doesn't pose any
presentation difficulties [1]. When they are different though, it
seems tough to present a consistent picture.

[1] Strictly speaking, this the "IPs are in the same function" is not
sufficient. Imagine a scenario where you have T->B->A (T calls B calls
A) and the PEBS sample happens in A, and then A and B return, and now
C then A are called (T->C->A) and the PMI happens. Now the PEBS IP and
the ireg IP are in the same function, but the stacks are still
inconsistent. It is probably fine to paper this over and show the user
the T->C->A stack, as this stack is somehow accurate (it really
happened), but the user might be confused when he looks at the
annotation for A, and sees code being executed (having PEBS samples)
that he knows can never execute when C calls A (for example) since the
annotations are based on the hidden T->B->A execution...

Bleh.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: PEBS level 2/3 breaks dwarf unwinding! [WAS: Re: Broken dwarf unwinding - wrong stack pointer register value?]
  2018-11-15  2:05                                       ` Travis Downs
@ 2018-11-15  9:10                                         ` Milian Wolff
  2018-11-15 19:00                                           ` Andi Kleen
  0 siblings, 1 reply; 33+ messages in thread
From: Milian Wolff @ 2018-11-15  9:10 UTC (permalink / raw)
  To: Travis Downs
  Cc: ak, jolsa, linux-kernel, jolsa, namhyung, linux-perf-users, acme

[-- Attachment #1: Type: text/plain, Size: 1313 bytes --]

On Donnerstag, 15. November 2018 03:05:32 CET Travis Downs wrote:
> On Wed, Nov 14, 2018 at 8:20 AM Milian Wolff <milian.wolff@kdab.com> wrote:
> > 3) I suggest we always keep the first frame and sample IP from the user
> > regs, i.e. the accurate PEBS/IBS IP. Then we add the following frames
> > from unwinding the ustack with the iregs.
> 
> Does this mean that the displayed unwind will sometimes be
> "impossible" to have actually be generated from a consistent execution
> of the user program?

Yes, that is exactly what I'm saying.

> For example, the top frame (from PEBS) and second frame (from iregs)
> may be inconsistent in that the latter function never calls the first.
> At this point it would be good to have an indication at the top frame
> is from a different source than the rest of the frames, lest the user
> pull is hair out trying to determine how function X seems to call
> function Y despite that not being the case in the source.

I agree. I personally like your suggested approach - only add an indication 
when the IP differs so much that it points to a different function. What do 
others say to this?

Cheers

-- 
Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer
KDAB (Deutschland) GmbH, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt, C++ and OpenGL Experts

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 3826 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: PEBS level 2/3 breaks dwarf unwinding! [WAS: Re: Broken dwarf unwinding - wrong stack pointer register value?]
  2018-11-15  9:10                                         ` Milian Wolff
@ 2018-11-15 19:00                                           ` Andi Kleen
  0 siblings, 0 replies; 33+ messages in thread
From: Andi Kleen @ 2018-11-15 19:00 UTC (permalink / raw)
  To: Milian Wolff
  Cc: Travis Downs, jolsa, linux-kernel, jolsa, namhyung,
	linux-perf-users, acme

Milian Wolff <milian.wolff@kdab.com> writes:
>
> I agree. I personally like your suggested approach - only add an indication 
> when the IP differs so much that it points to a different function. What do 
> others say to this?

Yes that's fine.

-Andi


^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2018-11-15 19:00 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-20 22:39 Broken dwarf unwinding - wrong stack pointer register value? Milian Wolff
2018-10-21 20:32 ` Milian Wolff
2018-10-22 10:35 ` Milian Wolff
2018-10-22 11:17   ` Milian Wolff
2018-10-22 13:58     ` Andi Kleen
2018-10-22 19:26       ` Milian Wolff
2018-10-23  4:03         ` Andi Kleen
2018-10-23 10:34           ` Milian Wolff
2018-10-24 14:48             ` Andi Kleen
2018-10-30 22:34               ` Milian Wolff
2018-11-01 22:08                 ` PEBS level 2/3 breaks dwarf unwinding! [WAS: Re: Broken dwarf unwinding - wrong stack pointer register value?] Milian Wolff
2018-11-02 11:26                   ` Jiri Olsa
2018-11-02 17:56                     ` Milian Wolff
2018-11-05 20:51                       ` Jiri Olsa
2018-11-05 22:54                         ` Milian Wolff
2018-11-06  0:10                           ` Andi Kleen
2018-11-06  8:39                             ` Jiri Olsa
2018-11-06 17:26                               ` Andi Kleen
2018-11-06 20:04                               ` Milian Wolff
2018-11-06 20:24                                 ` Andi Kleen
2018-11-07 22:41                                   ` Milian Wolff
2018-11-08 12:41                                     ` Milian Wolff
2018-11-09  0:55                                       ` Andi Kleen
2018-11-09  0:54                                     ` Andi Kleen
2018-11-10 21:42                             ` Travis Downs
2018-11-11  1:07                               ` Andi Kleen
     [not found]                                 ` <CAOBGo4zirLiKX8VcROAE=kAD0+qkF0E-cBv9DtBiQr=_obDv5w@mail.gmail.com>
2018-11-11  2:54                                   ` Travis Downs
2018-11-12  3:26                                   ` Andi Kleen
2018-11-14 13:20                                     ` Milian Wolff
2018-11-15  2:05                                       ` Travis Downs
2018-11-15  9:10                                         ` Milian Wolff
2018-11-15 19:00                                           ` Andi Kleen
2018-11-15  2:15                                     ` Travis Downs

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).