All of lore.kernel.org
 help / color / mirror / Atom feed
* [BUG REPORT] perf tools: x86_64: Broken calllchain when sampling taken at 'callq' instruction
@ 2015-11-18  7:20 Wangnan (F)
  2015-11-18  8:00 ` Wangnan (F)
  2015-11-18  8:48 ` Jiri Olsa
  0 siblings, 2 replies; 20+ messages in thread
From: Wangnan (F) @ 2015-11-18  7:20 UTC (permalink / raw)
  To: Jiri Olsa, Arnaldo Carvalho de Melo, David Ahern, Peter Zijlstra,
	Ingo Molnar, Milian Wolff
  Cc: linux-kernel, pi3orama, lizefan 00213767

Hi all,

When analysising Jiri's patchset [1] I found a dwarf unwind problem.
On x86 platform, when sample is at a 'callq' instruction, dwarf based
stack unwind always fail.

I compile a small C source file with debug information, turn off
frame pointer and disable optimization:

$ gcc -g -O0 -fomit-frame-pointer ./test_dwarf_unwind.c -o 
./test_dwarf_unwind

Here is part of asm code. Please have a look at 40057d. Also, have a
look at 400534 and 4005bc, from these two instructions we know this
"funcc" need 0x38 bytes on stack, so we know where is the return address.

$ objdump -S ./test_dwarf_unwind
...
0000000000400534 <funcc>:
#include <sys/time.h>

static volatile int x = 0;

int funcc(void)
{
   400534:       48 83 ec 38             sub    $0x38,%rsp
     struct timeval tv1, tv2;
     unsigned long us1, us2;

     gettimeofday(&tv1, NULL);
   400538:       48 8d 44 24 10          lea    0x10(%rsp),%rax
   ...
   400572:       48 89 e0                mov    %rsp,%rax
   400575:       be 00 00 00 00          mov    $0x0,%esi
   40057a:       48 89 c7                mov    %rax,%rdi
   40057d:       e8 7e fe ff ff          callq  400400 
<gettimeofday@plt>    <---!!!
         us2 = tv2.tv_sec * 1000000 + tv2.tv_usec;
   400582:       48 8b 04 24             mov    (%rsp),%rax
   400586:       48 69 d0 40 42 0f 00    imul   $0xf4240,%rax,%rdx
   ...
     return x;
   4005b6:       8b 05 74 04 20 00       mov 0x200474(%rip),%eax        
# 600a30 <x>
}
   4005bc:       48 83 c4 38             add    $0x38,%rsp
   4005c0:       c3                      retq

Then use perf to trace it:

# ./pref record  -g -F 9 --call-graph dwarf ./test_dwarf_unwind
# ./perf report  --no-children --stdio
      ...
      1.23%  test_dwarf_unwi  test_dwarf_unwind  [.] funcc
             |
             ---funcc
                |
                |--0.62%-- funcb
                |          funca
                |          main
                |          __libc_start_main
                |          _start
                |
                 --0.62%-- 0   <-- I applied Jiri's patch in [1] so see 
this 0,
                                   which indicates a broken call chain.

 From 'perf script' result I found the two samples inside funcc:

# ./perf script
...
test_dwarf_unwi 25232 2201855.424651:  443365670 cycles:pp:
                   4005a4 funcc (/home/w00229757/test_dwarf_unwind)
                   4005ca funcb (/home/w00229757/test_dwarf_unwind)
                   4005d8 funca (/home/w00229757/test_dwarf_unwind)
                   4005e6 main (/home/w00229757/test_dwarf_unwind)
             7facc5c2ebd5 __libc_start_main 
(/tmp/oxygen_root-w00229757/lib64/libc-2.18.so)
                   400459 _start (/home/w00229757/test_dwarf_unwind)
...
test_dwarf_unwi 25232 2201856.647088:  443268987 cycles:pp:
                   40057d funcc (/home/w00229757/test_dwarf_unwind)
                        0 [unknown] ([unknown])

With perf report -D I can manually check their stack information manually.


This is stack of sample at 4005a4:

# ./perf report -D
...
0x1d9c8 [0x20e8]: event: 9
.
. ... raw event: size 8424 bytes
.  0000:  09 00 00 00 02 40 e8 20 a4 05 40 00 00 00 00 00  .....@. ..@.....
...
.  00d0:  00 00 00 00 00 00 00 00 00 20 00 00 00 00 00 00  ......... ......
.  00e0:  2b f6 4c 56 00 00 00 00 46 52 02 00 00 00 00 00 +.LV....FR......
.  00f0:  29 f6 4c 56 00 00 00 00 70 e4 0c 00 00 00 00 00 ).LV....p.......
.  0100:  06 f3 3b d7 d7 24 05 00 b0 00 28 d7 d7 24 05 00 ..;..$....(..$..
.  0110:  00 00 00 00 00 00 00 00 ca 05 40 00 00 00 00 00
...
.  20e0:  b0 06 00 00 00 00 00 00                          ........
.
2201855424651081 0x1d9c8 [0x20e8]: PERF_RECORD_SAMPLE(IP, 0x4002): 
25232/25232: 0x4005a4 period: 443365670 addr: 0
... FP chain: nr:0
... user regs: mask 0xff0fff ABI 64-bit
.......
.... IP    0x4005a4
.......
... ustack: size 1712, offset 0xd8
....

 From the above information we can check its stack from 0xe0 (stack size
reside at 0xd8 as a u64 so stack data start at 0xe0). Based on information
we get from the above asm code, we know the return address should at the
0x38 of the stack, which is 0x4005ca.

This is stack of sample at 40057d:

0x343c0 [0x20e8]: event: 9
.
. ... raw event: size 8424 bytes
.  0000:  09 00 00 00 02 40 e8 20 7d 05 40 00 00 00 00 00  .....@. }.@.....
...
.  00d0:  00 00 00 00 00 00 00 00 00 20 00 00 00 00 00 00  ......... ......
.  00e0:  82 05 40 00 00 00 00 00 2c f6 4c 56 00 00 00 00 ..@.....,.LV....
.  00f0:  84 b7 05 00 00 00 00 00 29 f6 4c 56 00 00 00 00 ........).LV....
.  0100:  70 e4 0c 00 00 00 00 00 84 9a 4e d7 d7 24 05 00 p.........N..$..
.  0110:  b0 00 28 d7 d7 24 05 00 00 00 00 00 00 00 00 00 ..(..$..........
.  0120:  ca 05 40 00 00 00 00 00 f0 05 40 00 00 00 00 00 ..@.......@.....
...
.  20e0:  b8 06 00 00 00 00 00 00                          ........
.
2201856647088808 0x343c0 [0x20e8]: PERF_RECORD_SAMPLE(IP, 0x4002): 
25232/25232: 0x40057d period: 443268987 addr: 0
... FP chain: nr:0
... user regs: mask 0xff0fff ABI 64-bit
......
.... IP    0x40057d
......
... ustack: size 1720, offset 0xd8
.....

By manually checking user stack from 0xe0, we can see problem that: 
although the IP
is still at 'callq', that instruction seems already take effect that, it has
already pushes its return address (0x400582) onto stack.

So there must be somewhere the pt_regs or the stack is adjusted in kernel.
Does anyone can give some hint on it?

Thank you.

[1] 
http://lkml.kernel.org/r/1447772739-18471-1-git-send-email-jolsa@kernel.org


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [BUG REPORT] perf tools: x86_64: Broken calllchain when sampling taken at 'callq' instruction
  2015-11-18  7:20 [BUG REPORT] perf tools: x86_64: Broken calllchain when sampling taken at 'callq' instruction Wangnan (F)
@ 2015-11-18  8:00 ` Wangnan (F)
  2015-11-18  8:20   ` Ingo Molnar
  2015-11-18  8:48 ` Jiri Olsa
  1 sibling, 1 reply; 20+ messages in thread
From: Wangnan (F) @ 2015-11-18  8:00 UTC (permalink / raw)
  To: Jiri Olsa, Arnaldo Carvalho de Melo, David Ahern, Peter Zijlstra,
	Ingo Molnar, Milian Wolff
  Cc: linux-kernel, pi3orama, lizefan 00213767



On 2015/11/18 15:20, Wangnan (F) wrote:
> Hi all,
>
> When analysising Jiri's patchset [1] I found a dwarf unwind problem.
> On x86 platform, when sample is at a 'callq' instruction, dwarf based
> stack unwind always fail.
>
> I compile a small C source file with debug information, turn off
> frame pointer and disable optimization:
>
> $ gcc -g -O0 -fomit-frame-pointer ./test_dwarf_unwind.c -o 
> ./test_dwarf_unwind

For whom want to test it: here is the test code I used.

#include <stdio.h>
#include <unistd.h>
#include <sys/time.h>

static volatile int x = 0;

int funcc(void)
{
     struct timeval tv1, tv2;
     unsigned long us1, us2;

     gettimeofday(&tv1, NULL);

     us1 = tv1.tv_sec * 1000000 + tv1.tv_usec;

     while(1) {
         x = x + 100;
         gettimeofday(&tv2, NULL);
         us2 = tv2.tv_sec * 1000000 + tv2.tv_usec;
         if (us2 - us1 >= 3000000)
             break;
     }
     return x;
}
int funcb(void) { return funcc();}
int funca(void) { return funcb();}
int main() { funca(); return 0;}




^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [BUG REPORT] perf tools: x86_64: Broken calllchain when sampling taken at 'callq' instruction
  2015-11-18  8:00 ` Wangnan (F)
@ 2015-11-18  8:20   ` Ingo Molnar
  2015-11-18  8:42     ` Wangnan (F)
  0 siblings, 1 reply; 20+ messages in thread
From: Ingo Molnar @ 2015-11-18  8:20 UTC (permalink / raw)
  To: Wangnan (F)
  Cc: Jiri Olsa, Arnaldo Carvalho de Melo, David Ahern, Peter Zijlstra,
	Milian Wolff, linux-kernel, pi3orama, lizefan 00213767


* Wangnan (F) <wangnan0@huawei.com> wrote:

> On 2015/11/18 15:20, Wangnan (F) wrote:
> >Hi all,
> >
> >When analysising Jiri's patchset [1] I found a dwarf unwind problem.
> >On x86 platform, when sample is at a 'callq' instruction, dwarf based
> >stack unwind always fail.
> >
> >I compile a small C source file with debug information, turn off
> >frame pointer and disable optimization:
> >
> >$ gcc -g -O0 -fomit-frame-pointer ./test_dwarf_unwind.c -o
> >./test_dwarf_unwind
> 
> For whom want to test it: here is the test code I used.
> 
> #include <stdio.h>
> #include <unistd.h>
> #include <sys/time.h>
> 
> static volatile int x = 0;
> 
> int funcc(void)
> {
>     struct timeval tv1, tv2;
>     unsigned long us1, us2;
> 
>     gettimeofday(&tv1, NULL);
> 
>     us1 = tv1.tv_sec * 1000000 + tv1.tv_usec;
> 
>     while(1) {
>         x = x + 100;
>         gettimeofday(&tv2, NULL);
>         us2 = tv2.tv_sec * 1000000 + tv2.tv_usec;
>         if (us2 - us1 >= 3000000)
>             break;
>     }
>     return x;
> }
> int funcb(void) { return funcc();}
> int funca(void) { return funcb();}
> int main() { funca(); return 0;}

What CPU model is this, and what event was used - PEBS perhaps? This might be some 
sort of PMU sampling bug/quirk/misfeature - or perhaps a kernel side fixup that 
went bad?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [BUG REPORT] perf tools: x86_64: Broken calllchain when sampling taken at 'callq' instruction
  2015-11-18  8:20   ` Ingo Molnar
@ 2015-11-18  8:42     ` Wangnan (F)
  2015-11-18  8:49       ` Wangnan (F)
  0 siblings, 1 reply; 20+ messages in thread
From: Wangnan (F) @ 2015-11-18  8:42 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jiri Olsa, Arnaldo Carvalho de Melo, David Ahern, Peter Zijlstra,
	Milian Wolff, linux-kernel, pi3orama, lizefan 00213767



On 2015/11/18 16:20, Ingo Molnar wrote:
> * Wangnan (F) <wangnan0@huawei.com> wrote:
>
>> On 2015/11/18 15:20, Wangnan (F) wrote:
>>> Hi all,
>>>
>>> When analysising Jiri's patchset [1] I found a dwarf unwind problem.
>>> On x86 platform, when sample is at a 'callq' instruction, dwarf based
>>> stack unwind always fail.
>>>
>>> I compile a small C source file with debug information, turn off
>>> frame pointer and disable optimization:
>>>
>>> $ gcc -g -O0 -fomit-frame-pointer ./test_dwarf_unwind.c -o
>>> ./test_dwarf_unwind
>> For whom want to test it: here is the test code I used.
>>
>> #include <stdio.h>
>> #include <unistd.h>
>> #include <sys/time.h>
>>
>> static volatile int x = 0;
>>
>> int funcc(void)
>> {
>>      struct timeval tv1, tv2;
>>      unsigned long us1, us2;
>>
>>      gettimeofday(&tv1, NULL);
>>
>>      us1 = tv1.tv_sec * 1000000 + tv1.tv_usec;
>>
>>      while(1) {
>>          x = x + 100;
>>          gettimeofday(&tv2, NULL);
>>          us2 = tv2.tv_sec * 1000000 + tv2.tv_usec;
>>          if (us2 - us1 >= 3000000)
>>              break;
>>      }
>>      return x;
>> }
>> int funcb(void) { return funcc();}
>> int funca(void) { return funcb();}
>> int main() { funca(); return 0;}
> What CPU model is this, and what event was used - PEBS perhaps? This might be some
> sort of PMU sampling bug/quirk/misfeature - or perhaps a kernel side fixup that
> went bad?

$ cat /proc/cpuinfo
processor    : 0
vendor_id    : GenuineIntel
cpu family    : 6
model        : 60
model name    : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
stepping    : 3
microcode    : 0x1c
cpu MHz        : 3600.000
cache size    : 8192 KB
physical id    : 0
siblings    : 8
core id        : 0
cpu cores    : 4
apicid        : 0
initial apicid    : 0
fpu        : yes
fpu_exception    : yes
cpuid level    : 13
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall 
nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl 
xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor 
ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 
x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm 
abm ida arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid 
fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt
bugs        :
bogomips    : 7183.88
clflush size    : 64
cache_alignment    : 64
address sizes    : 39 bits physical, 48 bits virtual
power management:


perf cmdline is

# ./pref record  -g -F 9 --call-graph dwarf ./test_dwarf_unwind

Use default events, precise_ip == 2 so uses PEBS.

Thank you.

>
> Thanks,
>
> 	Ingo



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [BUG REPORT] perf tools: x86_64: Broken calllchain when sampling taken at 'callq' instruction
  2015-11-18  7:20 [BUG REPORT] perf tools: x86_64: Broken calllchain when sampling taken at 'callq' instruction Wangnan (F)
  2015-11-18  8:00 ` Wangnan (F)
@ 2015-11-18  8:48 ` Jiri Olsa
  2015-11-18  9:02   ` Wangnan (F)
  1 sibling, 1 reply; 20+ messages in thread
From: Jiri Olsa @ 2015-11-18  8:48 UTC (permalink / raw)
  To: Wangnan (F)
  Cc: Jiri Olsa, Arnaldo Carvalho de Melo, David Ahern, Peter Zijlstra,
	Ingo Molnar, Milian Wolff, linux-kernel, pi3orama,
	lizefan 00213767

On Wed, Nov 18, 2015 at 03:20:36PM +0800, Wangnan (F) wrote:

SNIP

> 
> Then use perf to trace it:
> 
> # ./pref record  -g -F 9 --call-graph dwarf ./test_dwarf_unwind
> # ./perf report  --no-children --stdio
>      ...
>      1.23%  test_dwarf_unwi  test_dwarf_unwind  [.] funcc
>             |
>             ---funcc
>                |
>                |--0.62%-- funcb
>                |          funca
>                |          main
>                |          __libc_start_main
>                |          _start
>                |
>                 --0.62%-- 0   <-- I applied Jiri's patch in [1] so see this
> 0,
>                                   which indicates a broken call chain.

can't see this case after multiple running your test,
just the 'end-of-the-callchain' zeros

    49.14%  test_dwarf_unwi  [vdso]             [.] __vdso_gettimeofday  
            |
            ---__vdso_gettimeofday
               funcc
               funcb
               funca
               main
               __libc_start_main
               _start
               0

    40.78%  test_dwarf_unwi  [vdso]             [.] 0x0000000000000987   
            |
            ---0x7ffe28ab7987
               __vdso_gettimeofday
               funcc
               funcb
               funca
               main
               __libc_start_main
               _start
               0


what kernel do you have?

jirka

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [BUG REPORT] perf tools: x86_64: Broken calllchain when sampling taken at 'callq' instruction
  2015-11-18  8:42     ` Wangnan (F)
@ 2015-11-18  8:49       ` Wangnan (F)
  2015-11-19  6:37         ` Ingo Molnar
  0 siblings, 1 reply; 20+ messages in thread
From: Wangnan (F) @ 2015-11-18  8:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jiri Olsa, Arnaldo Carvalho de Melo, David Ahern, Peter Zijlstra,
	Milian Wolff, linux-kernel, pi3orama, lizefan 00213767



On 2015/11/18 16:42, Wangnan (F) wrote:
>
>
> On 2015/11/18 16:20, Ingo Molnar wrote:
>> * Wangnan (F) <wangnan0@huawei.com> wrote:
>>
>>> On 2015/11/18 15:20, Wangnan (F) wrote:
>>>> Hi all,
>>>>
>>>> When analysising Jiri's patchset [1] I found a dwarf unwind problem.
>>>> On x86 platform, when sample is at a 'callq' instruction, dwarf based
>>>> stack unwind always fail.
>>>>
>>>> I compile a small C source file with debug information, turn off
>>>> frame pointer and disable optimization:
>>>>
>>>> $ gcc -g -O0 -fomit-frame-pointer ./test_dwarf_unwind.c -o
>>>> ./test_dwarf_unwind
>>> For whom want to test it: here is the test code I used.
>>>
>>> #include <stdio.h>
>>> #include <unistd.h>
>>> #include <sys/time.h>
>>>
>>> static volatile int x = 0;
>>>
>>> int funcc(void)
>>> {
>>>      struct timeval tv1, tv2;
>>>      unsigned long us1, us2;
>>>
>>>      gettimeofday(&tv1, NULL);
>>>
>>>      us1 = tv1.tv_sec * 1000000 + tv1.tv_usec;
>>>
>>>      while(1) {
>>>          x = x + 100;
>>>          gettimeofday(&tv2, NULL);
>>>          us2 = tv2.tv_sec * 1000000 + tv2.tv_usec;
>>>          if (us2 - us1 >= 3000000)
>>>              break;
>>>      }
>>>      return x;
>>> }
>>> int funcb(void) { return funcc();}
>>> int funca(void) { return funcb();}
>>> int main() { funca(); return 0;}
>> What CPU model is this, and what event was used - PEBS perhaps? This 
>> might be some
>> sort of PMU sampling bug/quirk/misfeature - or perhaps a kernel side 
>> fixup that
>> went bad?
>
> $ cat /proc/cpuinfo
> processor    : 0
> vendor_id    : GenuineIntel
> cpu family    : 6
> model        : 60
> model name    : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
> stepping    : 3
> microcode    : 0x1c
> cpu MHz        : 3600.000
> cache size    : 8192 KB
> physical id    : 0
> siblings    : 8
> core id        : 0
> cpu cores    : 4
> apicid        : 0
> initial apicid    : 0
> fpu        : yes
> fpu_exception    : yes
> cpuid level    : 13
> wp        : yes
> flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
> mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe 
> syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts 
> rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq 
> dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm 
> pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave 
> avx f16c rdrand lahf_lm abm ida arat epb pln pts dtherm tpr_shadow 
> vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 
> erms invpcid xsaveopt
> bugs        :
> bogomips    : 7183.88
> clflush size    : 64
> cache_alignment    : 64
> address sizes    : 39 bits physical, 48 bits virtual
> power management:
>
>
> perf cmdline is
>
> # ./pref record  -g -F 9 --call-graph dwarf ./test_dwarf_unwind
>
> Use default events, precise_ip == 2 so uses PEBS.
>

Testetd 'cycles', 'cycles:p' and 'cycles:pp'. Only 'cycles:pp' captures
sample at callq. So maybe a PEBS problem?

Thank you.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [BUG REPORT] perf tools: x86_64: Broken calllchain when sampling taken at 'callq' instruction
  2015-11-18  8:48 ` Jiri Olsa
@ 2015-11-18  9:02   ` Wangnan (F)
  0 siblings, 0 replies; 20+ messages in thread
From: Wangnan (F) @ 2015-11-18  9:02 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Jiri Olsa, Arnaldo Carvalho de Melo, David Ahern, Peter Zijlstra,
	Ingo Molnar, Milian Wolff, linux-kernel, pi3orama,
	lizefan 00213767



On 2015/11/18 16:48, Jiri Olsa wrote:
> On Wed, Nov 18, 2015 at 03:20:36PM +0800, Wangnan (F) wrote:
>
> SNIP
>
>> Then use perf to trace it:
>>
>> # ./pref record  -g -F 9 --call-graph dwarf ./test_dwarf_unwind
>> # ./perf report  --no-children --stdio
>>       ...
>>       1.23%  test_dwarf_unwi  test_dwarf_unwind  [.] funcc
>>              |
>>              ---funcc
>>                 |
>>                 |--0.62%-- funcb
>>                 |          funca
>>                 |          main
>>                 |          __libc_start_main
>>                 |          _start
>>                 |
>>                  --0.62%-- 0   <-- I applied Jiri's patch in [1] so see this
>> 0,
>>                                    which indicates a broken call chain.
> can't see this case after multiple running your test,
> just the 'end-of-the-callchain' zeros
>
>      49.14%  test_dwarf_unwi  [vdso]             [.] __vdso_gettimeofday
>              |
>              ---__vdso_gettimeofday
>                 funcc
>                 funcb
>                 funca
>                 main
>                 __libc_start_main
>                 _start
>                 0
>
>      40.78%  test_dwarf_unwi  [vdso]             [.] 0x0000000000000987
>              |
>              ---0x7ffe28ab7987
>                 __vdso_gettimeofday
>                 funcc
>                 funcb
>                 funca
>                 main
>                 __libc_start_main
>                 _start
>                 0
>
>
> what kernel do you have?

My kernel is 4.3-rc4.

Please see my reply to Ingo. It is possible a problem related
to hardware and PEBS? Maybe on your system perf never capture
samples on 'callq'? I tested 'cycle' and 'cycle:p', can't get
such samples. Only 'cycles:pp' (the default event) have this
problem.

Thank you.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [BUG REPORT] perf tools: x86_64: Broken calllchain when sampling taken at 'callq' instruction
  2015-11-18  8:49       ` Wangnan (F)
@ 2015-11-19  6:37         ` Ingo Molnar
  2015-11-19  6:45           ` Wangnan (F)
  0 siblings, 1 reply; 20+ messages in thread
From: Ingo Molnar @ 2015-11-19  6:37 UTC (permalink / raw)
  To: Wangnan (F)
  Cc: Jiri Olsa, Arnaldo Carvalho de Melo, David Ahern, Peter Zijlstra,
	Milian Wolff, linux-kernel, pi3orama, lizefan 00213767


* Wangnan (F) <wangnan0@huawei.com> wrote:

> >perf cmdline is
> >
> ># ./pref record  -g -F 9 --call-graph dwarf ./test_dwarf_unwind
> >
> >Use default events, precise_ip == 2 so uses PEBS.
> >
> 
> Testetd 'cycles', 'cycles:p' and 'cycles:pp'. Only 'cycles:pp' captures
> sample at callq. So maybe a PEBS problem?

Well, that's how our PEBS sampling works: we roll back the instruction pointer to 
point at the instruction generating the sample. The state itself is 
post-instruction.

So dwarf unwind has to take this into account. Should the kernel pass in the 
original (uncorrected) RIP as well, perhaps?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [BUG REPORT] perf tools: x86_64: Broken calllchain when sampling taken at 'callq' instruction
  2015-11-19  6:37         ` Ingo Molnar
@ 2015-11-19  6:45           ` Wangnan (F)
  2015-11-19 10:23             ` Ingo Molnar
  0 siblings, 1 reply; 20+ messages in thread
From: Wangnan (F) @ 2015-11-19  6:45 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jiri Olsa, Arnaldo Carvalho de Melo, David Ahern, Peter Zijlstra,
	Milian Wolff, linux-kernel, pi3orama, lizefan 00213767



On 2015/11/19 14:37, Ingo Molnar wrote:
> * Wangnan (F) <wangnan0@huawei.com> wrote:
>
>>> perf cmdline is
>>>
>>> # ./pref record  -g -F 9 --call-graph dwarf ./test_dwarf_unwind
>>>
>>> Use default events, precise_ip == 2 so uses PEBS.
>>>
>> Testetd 'cycles', 'cycles:p' and 'cycles:pp'. Only 'cycles:pp' captures
>> sample at callq. So maybe a PEBS problem?
> Well, that's how our PEBS sampling works: we roll back the instruction pointer to
> point at the instruction generating the sample. The state itself is
> post-instruction.

Just for curiosity:

how the interrupted process continue to execute, when the PC
saved in pt_regs still pointed to 'callq' but SP and stack has
already changes? Do we fix it in kernel, or by hardware?

Thank you.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [BUG REPORT] perf tools: x86_64: Broken calllchain when sampling taken at 'callq' instruction
  2015-11-19  6:45           ` Wangnan (F)
@ 2015-11-19 10:23             ` Ingo Molnar
  2015-11-19 10:43               ` Wangnan (F)
  2015-11-19 11:23               ` Peter Zijlstra
  0 siblings, 2 replies; 20+ messages in thread
From: Ingo Molnar @ 2015-11-19 10:23 UTC (permalink / raw)
  To: Wangnan (F)
  Cc: Jiri Olsa, Arnaldo Carvalho de Melo, David Ahern, Peter Zijlstra,
	Milian Wolff, linux-kernel, pi3orama, lizefan 00213767


* Wangnan (F) <wangnan0@huawei.com> wrote:

> 
> 
> On 2015/11/19 14:37, Ingo Molnar wrote:
> >* Wangnan (F) <wangnan0@huawei.com> wrote:
> >
> >>>perf cmdline is
> >>>
> >>># ./pref record  -g -F 9 --call-graph dwarf ./test_dwarf_unwind
> >>>
> >>>Use default events, precise_ip == 2 so uses PEBS.
> >>>
> >>Testetd 'cycles', 'cycles:p' and 'cycles:pp'. Only 'cycles:pp' captures
> >>sample at callq. So maybe a PEBS problem?
> >Well, that's how our PEBS sampling works: we roll back the instruction pointer to
> >point at the instruction generating the sample. The state itself is
> >post-instruction.
> 
> Just for curiosity:
> 
> how the interrupted process continue to execute, when the PC
> saved in pt_regs still pointed to 'callq' but SP and stack has
> already changes? Do we fix it in kernel, or by hardware?

PEBS is an asynchronous hardware tracing mechanism, when batched PEBS is used it 
might not even result in any interruption of execution. The 'pt_regs' does not 
necessarily correspond to an interrupted, restartable context - we take the RIP 
from the PEBS machinery and also use LBR and disassembly to determine the previous 
instruction, before reporting it to user-space.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [BUG REPORT] perf tools: x86_64: Broken calllchain when sampling taken at 'callq' instruction
  2015-11-19 10:23             ` Ingo Molnar
@ 2015-11-19 10:43               ` Wangnan (F)
  2015-11-19 11:28                 ` Peter Zijlstra
  2015-11-19 11:23               ` Peter Zijlstra
  1 sibling, 1 reply; 20+ messages in thread
From: Wangnan (F) @ 2015-11-19 10:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jiri Olsa, Arnaldo Carvalho de Melo, David Ahern, Peter Zijlstra,
	Milian Wolff, linux-kernel, pi3orama, lizefan 00213767



On 2015/11/19 18:23, Ingo Molnar wrote:
> * Wangnan (F) <wangnan0@huawei.com> wrote:
>
>>
>> On 2015/11/19 14:37, Ingo Molnar wrote:
>>> * Wangnan (F) <wangnan0@huawei.com> wrote:
>>>
>>>>> perf cmdline is
>>>>>
>>>>> # ./pref record  -g -F 9 --call-graph dwarf ./test_dwarf_unwind
>>>>>
>>>>> Use default events, precise_ip == 2 so uses PEBS.
>>>>>
>>>> Testetd 'cycles', 'cycles:p' and 'cycles:pp'. Only 'cycles:pp' captures
>>>> sample at callq. So maybe a PEBS problem?
>>> Well, that's how our PEBS sampling works: we roll back the instruction pointer to
>>> point at the instruction generating the sample. The state itself is
>>> post-instruction.
>> Just for curiosity:
>>
>> how the interrupted process continue to execute, when the PC
>> saved in pt_regs still pointed to 'callq' but SP and stack has
>> already changes? Do we fix it in kernel, or by hardware?
> PEBS is an asynchronous hardware tracing mechanism, when batched PEBS is used it
> might not even result in any interruption of execution. The 'pt_regs' does not
> necessarily correspond to an interrupted, restartable context - we take the RIP
> from the PEBS machinery and also use LBR and disassembly to determine the previous
> instruction, before reporting it to user-space.

You mean __intel_pmu_pebs_event(), which generates many perf_events?
Then their output are based on a same user stack, and could be error,
because the instruction has finished, and user stack could be modified.
Right?

Also, why not fixing rsp in kernel if that instruction is a 'callq'?
For avoiding instruction decoding?

Thank you.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [BUG REPORT] perf tools: x86_64: Broken calllchain when sampling taken at 'callq' instruction
  2015-11-19 10:23             ` Ingo Molnar
  2015-11-19 10:43               ` Wangnan (F)
@ 2015-11-19 11:23               ` Peter Zijlstra
  2015-11-27  8:38                 ` Ingo Molnar
  1 sibling, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2015-11-19 11:23 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Wangnan (F),
	Jiri Olsa, Arnaldo Carvalho de Melo, David Ahern, Milian Wolff,
	linux-kernel, pi3orama, lizefan 00213767

On Thu, Nov 19, 2015 at 11:23:00AM +0100, Ingo Molnar wrote:
> PEBS is an asynchronous hardware tracing mechanism, when batched PEBS is used it 
> might not even result in any interruption of execution. The 'pt_regs' does not 
> necessarily correspond to an interrupted, restartable context - we take the RIP 
> from the PEBS machinery and also use LBR and disassembly to determine the previous 
> instruction, before reporting it to user-space.

Note that modern PEBS hardware (hsw+) does the rollback in hardware.
Prior to that we indeed to it manually using the LBR.

As to pt_regs, we construct a franken pt_regs based on the actual PEBS
buffer overflow PMI and bits from the PEBS record (which also includes
some register state). See
arch/x86/kernel/cpu/perf_event_intel_ds.c:setup_pebs_sample_data().

We always copy the flags, ip, bp and sp from the PEBS record into the
interrupt pt_regs.

And note that the PEBS record is constructed at instruction retirement,
so it shows the state _after_ the instruction, with exception of the
(hsw+) real_ip field.

So the unwinder will have to be taught that if the IP points at a stack
altering instruction (call, push, etc.) it will have to 'undo' the
effects on the actual stack (I appreciate this might be 'interesting'
for things like: pop, ret, etc.).


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [BUG REPORT] perf tools: x86_64: Broken calllchain when sampling taken at 'callq' instruction
  2015-11-19 10:43               ` Wangnan (F)
@ 2015-11-19 11:28                 ` Peter Zijlstra
  0 siblings, 0 replies; 20+ messages in thread
From: Peter Zijlstra @ 2015-11-19 11:28 UTC (permalink / raw)
  To: Wangnan (F)
  Cc: Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo, David Ahern,
	Milian Wolff, linux-kernel, pi3orama, lizefan 00213767

On Thu, Nov 19, 2015 at 06:43:14PM +0800, Wangnan (F) wrote:
> Also, why not fixing rsp in kernel if that instruction is a 'callq'?
> For avoiding instruction decoding?

Why would the kernel do that? It doesn't need this.

Have userspace do it when trying to interpret the data.

If PERF_EFLAGS_EXACT is set, you know the IP points to the actual
instruction and you can attempt to undo whatever parts of it you need at
post processing time.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [BUG REPORT] perf tools: x86_64: Broken calllchain when sampling taken at 'callq' instruction
  2015-11-19 11:23               ` Peter Zijlstra
@ 2015-11-27  8:38                 ` Ingo Molnar
  2015-11-30  9:28                   ` Peter Zijlstra
  0 siblings, 1 reply; 20+ messages in thread
From: Ingo Molnar @ 2015-11-27  8:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Wangnan (F),
	Jiri Olsa, Arnaldo Carvalho de Melo, David Ahern, Milian Wolff,
	linux-kernel, pi3orama, lizefan 00213767


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, Nov 19, 2015 at 11:23:00AM +0100, Ingo Molnar wrote:
> > PEBS is an asynchronous hardware tracing mechanism, when batched PEBS is used it 
> > might not even result in any interruption of execution. The 'pt_regs' does not 
> > necessarily correspond to an interrupted, restartable context - we take the RIP 
> > from the PEBS machinery and also use LBR and disassembly to determine the previous 
> > instruction, before reporting it to user-space.
> 
> Note that modern PEBS hardware (hsw+) does the rollback in hardware.
> Prior to that we indeed to it manually using the LBR.
> 
> As to pt_regs, we construct a franken pt_regs based on the actual PEBS
> buffer overflow PMI and bits from the PEBS record (which also includes
> some register state). See
> arch/x86/kernel/cpu/perf_event_intel_ds.c:setup_pebs_sample_data().
> 
> We always copy the flags, ip, bp and sp from the PEBS record into the
> interrupt pt_regs.
> 
> And note that the PEBS record is constructed at instruction retirement,
> so it shows the state _after_ the instruction, with exception of the
> (hsw+) real_ip field.
> 
> So the unwinder will have to be taught that if the IP points at a stack
> altering instruction (call, push, etc.) it will have to 'undo' the
> effects on the actual stack (I appreciate this might be 'interesting'
> for things like: pop, ret, etc.).

So do we dump both the 'real' and the actual RIP, to not force tooling into having 
to decode instructions and such? (Which is pretty hard and fragile and not always 
possible with instructions that destroy the original RIP, like JMP, etc.)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [BUG REPORT] perf tools: x86_64: Broken calllchain when sampling taken at 'callq' instruction
  2015-11-27  8:38                 ` Ingo Molnar
@ 2015-11-30  9:28                   ` Peter Zijlstra
  2015-12-01  7:28                     ` Ingo Molnar
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2015-11-30  9:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Wangnan (F),
	Jiri Olsa, Arnaldo Carvalho de Melo, David Ahern, Milian Wolff,
	linux-kernel, pi3orama, lizefan 00213767

On Fri, Nov 27, 2015 at 09:38:11AM +0100, Ingo Molnar wrote:
> 
> * Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Thu, Nov 19, 2015 at 11:23:00AM +0100, Ingo Molnar wrote:
> > > PEBS is an asynchronous hardware tracing mechanism, when batched PEBS is used it 
> > > might not even result in any interruption of execution. The 'pt_regs' does not 
> > > necessarily correspond to an interrupted, restartable context - we take the RIP 
> > > from the PEBS machinery and also use LBR and disassembly to determine the previous 
> > > instruction, before reporting it to user-space.
> > 
> > Note that modern PEBS hardware (hsw+) does the rollback in hardware.
> > Prior to that we indeed to it manually using the LBR.
> > 
> > As to pt_regs, we construct a franken pt_regs based on the actual PEBS
> > buffer overflow PMI and bits from the PEBS record (which also includes
> > some register state). See
> > arch/x86/kernel/cpu/perf_event_intel_ds.c:setup_pebs_sample_data().
> > 
> > We always copy the flags, ip, bp and sp from the PEBS record into the
> > interrupt pt_regs.
> > 
> > And note that the PEBS record is constructed at instruction retirement,
> > so it shows the state _after_ the instruction, with exception of the
> > (hsw+) real_ip field.
> > 
> > So the unwinder will have to be taught that if the IP points at a stack
> > altering instruction (call, push, etc.) it will have to 'undo' the
> > effects on the actual stack (I appreciate this might be 'interesting'
> > for things like: pop, ret, etc.).
> 
> So do we dump both the 'real' and the actual RIP, to not force tooling into having 
> to decode instructions and such?

Nope, we only expose the corrected one.

> (Which is pretty hard and fragile and not always 
> possible with instructions that destroy the original RIP, like JMP, etc.)

Not sure what you're getting at here. We don't need the uncorrected
instruction.

But the problem here is that we rewind the instruction stream, but not
the stack. And the stack unwinder is (obviously) interested in the stack
state.

I'm not sure we want (or need) to go undo the specific instruction's
stack effect in-kernel. If the !DWARF unwinders are similarly confused
we might need to put it in kernel (expensive *groan*). If its only the
DWARF muck then its something that can be done in userspace just
fine, although we might need to copy slightly more of the stack than SP
is pointing at, such that we can undo RET/POP etc. which would have data
beyond the head of stack.

The easiest solution might be to figure out the biggest stack offset for
any instruction and always capture that much over the head of stack.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [BUG REPORT] perf tools: x86_64: Broken calllchain when sampling taken at 'callq' instruction
  2015-11-30  9:28                   ` Peter Zijlstra
@ 2015-12-01  7:28                     ` Ingo Molnar
  2015-12-01  8:38                       ` Peter Zijlstra
  0 siblings, 1 reply; 20+ messages in thread
From: Ingo Molnar @ 2015-12-01  7:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Wangnan (F),
	Jiri Olsa, Arnaldo Carvalho de Melo, David Ahern, Milian Wolff,
	linux-kernel, pi3orama, lizefan 00213767


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Fri, Nov 27, 2015 at 09:38:11AM +0100, Ingo Molnar wrote:
> > 
> > * Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > On Thu, Nov 19, 2015 at 11:23:00AM +0100, Ingo Molnar wrote:
> > > > PEBS is an asynchronous hardware tracing mechanism, when batched PEBS is used it 
> > > > might not even result in any interruption of execution. The 'pt_regs' does not 
> > > > necessarily correspond to an interrupted, restartable context - we take the RIP 
> > > > from the PEBS machinery and also use LBR and disassembly to determine the previous 
> > > > instruction, before reporting it to user-space.
> > > 
> > > Note that modern PEBS hardware (hsw+) does the rollback in hardware.
> > > Prior to that we indeed to it manually using the LBR.
> > > 
> > > As to pt_regs, we construct a franken pt_regs based on the actual PEBS
> > > buffer overflow PMI and bits from the PEBS record (which also includes
> > > some register state). See
> > > arch/x86/kernel/cpu/perf_event_intel_ds.c:setup_pebs_sample_data().
> > > 
> > > We always copy the flags, ip, bp and sp from the PEBS record into the
> > > interrupt pt_regs.
> > > 
> > > And note that the PEBS record is constructed at instruction retirement,
> > > so it shows the state _after_ the instruction, with exception of the
> > > (hsw+) real_ip field.
> > > 
> > > So the unwinder will have to be taught that if the IP points at a stack
> > > altering instruction (call, push, etc.) it will have to 'undo' the
> > > effects on the actual stack (I appreciate this might be 'interesting'
> > > for things like: pop, ret, etc.).
> > 
> > So do we dump both the 'real' and the actual RIP, to not force tooling into having 
> > to decode instructions and such?
> 
> Nope, we only expose the corrected one.
> 
> > (Which is pretty hard and fragile and not always 
> > possible with instructions that destroy the original RIP, like JMP, etc.)
> 
> Not sure what you're getting at here. We don't need the uncorrected
> instruction.

Well, we need it for stack unwinding, as you point it out:

> But the problem here is that we rewind the instruction stream, but not
> the stack. And the stack unwinder is (obviously) interested in the stack
> state.

Unwinding the stack state would fix it as well - but an equivalent solution would 
be to pass along the original RIP would fix it as well: we'd have a 
self-consistent pair of RIP/RSP.

Especially since unwinding the RSP is probably hard:

> I'm not sure we want (or need) to go undo the specific instruction's
> stack effect in-kernel. If the !DWARF unwinders are similarly confused
> we might need to put it in kernel (expensive *groan*). If its only the
> DWARF muck then its something that can be done in userspace just
> fine, although we might need to copy slightly more of the stack than SP
> is pointing at, such that we can undo RET/POP etc. which would have data
> beyond the head of stack.
> 
> The easiest solution might be to figure out the biggest stack offset for
> any instruction and always capture that much over the head of stack.

so I think the problem here is that the RSP does not match up to the RIP. We can 
either pass along the original RIP+RSP, or the fixed up one - but what we do 
currently is that we pass along only half of it - which corrupts dwarf unwinding 
state that doesn't tolerate such errors.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [BUG REPORT] perf tools: x86_64: Broken calllchain when sampling taken at 'callq' instruction
  2015-12-01  7:28                     ` Ingo Molnar
@ 2015-12-01  8:38                       ` Peter Zijlstra
  2015-12-01 16:11                         ` Ingo Molnar
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2015-12-01  8:38 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Wangnan (F),
	Jiri Olsa, Arnaldo Carvalho de Melo, David Ahern, Milian Wolff,
	linux-kernel, pi3orama, lizefan 00213767

On Tue, Dec 01, 2015 at 08:28:26AM +0100, Ingo Molnar wrote:
> > Not sure what you're getting at here. We don't need the uncorrected
> > instruction.
> 
> Well, we need it for stack unwinding, as you point it out:
> 
> > But the problem here is that we rewind the instruction stream, but not
> > the stack. And the stack unwinder is (obviously) interested in the stack
> > state.
> 
> Unwinding the stack state would fix it as well - but an equivalent solution would 
> be to pass along the original RIP would fix it as well: we'd have a 
> self-consistent pair of RIP/RSP.
> 
> Especially since unwinding the RSP is probably hard:
> 
> > I'm not sure we want (or need) to go undo the specific instruction's
> > stack effect in-kernel. If the !DWARF unwinders are similarly confused
> > we might need to put it in kernel (expensive *groan*). If its only the
> > DWARF muck then its something that can be done in userspace just
> > fine, although we might need to copy slightly more of the stack than SP
> > is pointing at, such that we can undo RET/POP etc. which would have data
> > beyond the head of stack.
> > 
> > The easiest solution might be to figure out the biggest stack offset for
> > any instruction and always capture that much over the head of stack.
> 
> so I think the problem here is that the RSP does not match up to the RIP. We can 
> either pass along the original RIP+RSP, or the fixed up one - but what we do 
> currently is that we pass along only half of it - which corrupts dwarf unwinding 
> state that doesn't tolerate such errors.

Still not sure what that gets you. Then you get a sample at a known
wrong location, why would you want that?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [BUG REPORT] perf tools: x86_64: Broken calllchain when sampling taken at 'callq' instruction
  2015-12-01  8:38                       ` Peter Zijlstra
@ 2015-12-01 16:11                         ` Ingo Molnar
  2015-12-01 17:21                           ` Peter Zijlstra
  0 siblings, 1 reply; 20+ messages in thread
From: Ingo Molnar @ 2015-12-01 16:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Wangnan (F),
	Jiri Olsa, Arnaldo Carvalho de Melo, David Ahern, Milian Wolff,
	linux-kernel, pi3orama, lizefan 00213767


* Peter Zijlstra <peterz@infradead.org> wrote:

> > so I think the problem here is that the RSP does not match up to the RIP. We 
> > can either pass along the original RIP+RSP, or the fixed up one - but what we 
> > do currently is that we pass along only half of it - which corrupts dwarf 
> > unwinding state that doesn't tolerate such errors.
> 
> Still not sure what that gets you. Then you get a sample at a known wrong 
> location, why would you want that?

Well, we'd at least get a valid call trace - which the 'mixed' one isn't? I.e. 
this only matters with --call-graph.

But yeah, with my suggestion we'd essentially fall back from cycles:pp to 
cycles:p, ideally we'd want to have real_rsp. Does the hardware provide that?

User-space cannot compute that reliably I think, what if the 'real' instruction 
was manipulating RSP in more complex ways than doing a CALL?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [BUG REPORT] perf tools: x86_64: Broken calllchain when sampling taken at 'callq' instruction
  2015-12-01 16:11                         ` Ingo Molnar
@ 2015-12-01 17:21                           ` Peter Zijlstra
  2015-12-02  9:55                             ` Ingo Molnar
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2015-12-01 17:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Wangnan (F),
	Jiri Olsa, Arnaldo Carvalho de Melo, David Ahern, Milian Wolff,
	linux-kernel, pi3orama, lizefan 00213767

On Tue, Dec 01, 2015 at 05:11:25PM +0100, Ingo Molnar wrote:
> 
> * Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > > so I think the problem here is that the RSP does not match up to the RIP. We 
> > > can either pass along the original RIP+RSP, or the fixed up one - but what we 
> > > do currently is that we pass along only half of it - which corrupts dwarf 
> > > unwinding state that doesn't tolerate such errors.
> > 
> > Still not sure what that gets you. Then you get a sample at a known wrong 
> > location, why would you want that?
> 
> Well, we'd at least get a valid call trace - which the 'mixed' one isn't? I.e. 
> this only matters with --call-graph.
> 
> But yeah, with my suggestion we'd essentially fall back from cycles:pp to 
> cycles:p, ideally we'd want to have real_rsp. Does the hardware provide that?

No, no real_rsp.

> User-space cannot compute that reliably I think, what if the 'real' instruction 
> was manipulating RSP in more complex ways than doing a CALL?

I'm not really too aware of these asm details :/ 

Jiri, what is in PERF_SAMPLE_REGS_USER.IP ? from a quick reading that is
whatever is returned from task_pt_regs(current), not the
perf_data.regs.ip field which contains the corrected IP.

Should the uwinder then not use PERF_SAMPLE_REGS_USER.{IP,SP} for a
consistent unwind?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [BUG REPORT] perf tools: x86_64: Broken calllchain when sampling taken at 'callq' instruction
  2015-12-01 17:21                           ` Peter Zijlstra
@ 2015-12-02  9:55                             ` Ingo Molnar
  0 siblings, 0 replies; 20+ messages in thread
From: Ingo Molnar @ 2015-12-02  9:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Wangnan (F),
	Jiri Olsa, Arnaldo Carvalho de Melo, David Ahern, Milian Wolff,
	linux-kernel, pi3orama, lizefan 00213767


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Tue, Dec 01, 2015 at 05:11:25PM +0100, Ingo Molnar wrote:
> > 
> > * Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > > so I think the problem here is that the RSP does not match up to the RIP. We 
> > > > can either pass along the original RIP+RSP, or the fixed up one - but what we 
> > > > do currently is that we pass along only half of it - which corrupts dwarf 
> > > > unwinding state that doesn't tolerate such errors.
> > > 
> > > Still not sure what that gets you. Then you get a sample at a known wrong 
> > > location, why would you want that?
> > 
> > Well, we'd at least get a valid call trace - which the 'mixed' one isn't? I.e. 
> > this only matters with --call-graph.
> > 
> > But yeah, with my suggestion we'd essentially fall back from cycles:pp to 
> > cycles:p, ideally we'd want to have real_rsp. Does the hardware provide that?
> 
> No, no real_rsp.
> 
> > User-space cannot compute that reliably I think, what if the 'real' instruction 
> > was manipulating RSP in more complex ways than doing a CALL?
> 
> I'm not really too aware of these asm details :/ 

So the way I understand it is that dwarf unwind relies on having consistent pairs 
of {RIP,RSP}, to be able to look at the precise instruction, look up the debuginfo 
what the function-internal stack offset at that instruction is (as there's no RBP 
frame information available so the stack offset varies).

That method really relies on having precise {RIP,RSP} pairs - it may otherwise 
lose its notion of where the next return address on the stack is, and the 
backtrace goes off into la-la-land. I suppose such broken callchains were how 
Wangnan noticed the problem?

( As a comparison, RBP based backtraces only rely on having an exact RBP available 
  as a starting point, that will define the next backtrace link and the next RBP
  value. That is why we can do the backtrace walk in the kernel. )

There are other ways a dwarf unwind can fail: for example if there's no dwarf 
debuginfo available for the binary in question, or if the stack snapshot we did is 
too small to find the next link in the backtrace. But having an off-by-few RIP is 
enough to throw the backtrace off at the very first step.

> Jiri, what is in PERF_SAMPLE_REGS_USER.IP ? from a quick reading that is 
> whatever is returned from task_pt_regs(current), not the perf_data.regs.ip field 
> which contains the corrected IP.
> 
> Should the uwinder then not use PERF_SAMPLE_REGS_USER.{IP,SP} for a consistent 
> unwind?

Yeah, if we have the real RIP belonging to that RSP value, then all should be 
golden. That way the unwinder can find the next link. (If all the other conditions 
for a successful backtrace are also there.)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2015-12-02  9:56 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-18  7:20 [BUG REPORT] perf tools: x86_64: Broken calllchain when sampling taken at 'callq' instruction Wangnan (F)
2015-11-18  8:00 ` Wangnan (F)
2015-11-18  8:20   ` Ingo Molnar
2015-11-18  8:42     ` Wangnan (F)
2015-11-18  8:49       ` Wangnan (F)
2015-11-19  6:37         ` Ingo Molnar
2015-11-19  6:45           ` Wangnan (F)
2015-11-19 10:23             ` Ingo Molnar
2015-11-19 10:43               ` Wangnan (F)
2015-11-19 11:28                 ` Peter Zijlstra
2015-11-19 11:23               ` Peter Zijlstra
2015-11-27  8:38                 ` Ingo Molnar
2015-11-30  9:28                   ` Peter Zijlstra
2015-12-01  7:28                     ` Ingo Molnar
2015-12-01  8:38                       ` Peter Zijlstra
2015-12-01 16:11                         ` Ingo Molnar
2015-12-01 17:21                           ` Peter Zijlstra
2015-12-02  9:55                             ` Ingo Molnar
2015-11-18  8:48 ` Jiri Olsa
2015-11-18  9:02   ` Wangnan (F)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.