On Montag, 12. November 2018 04:26:37 CET Andi Kleen wrote:
> On Sat, Nov 10, 2018 at 09:50:05PM -0500, Travis Downs wrote:
> >    On Sat, Nov 10, 2018 at 8:07 PM Andi Kleen <ak@linux.intel.com> wrote:
> >      On Sat, Nov 10, 2018 at 04:42:48PM -0500, Travis Downs wrote:
> >      > I guess this problem doesn't occur for LBR unwinding since the LBR
> >      > records are captured at the same
> >      > moment in time as the PEBS record, so reflect the correct branch
> >      > sequence.
> >      
> >      Actually it happens with LBRs too, but it always gives the backtrace
> >      consistently at the PMI trigger point.
> >    
> >    That's weird - so the LBR records are from the PMI point, but the rest
> >    of
> >    the PEBS record comes from the PEBS trigger point? Or the LBR isn't
> >    part
> >    of PEBS at all?
> 
> LBR is not part of PEBS, but is collected separately in the PMI handler.
> 
> >      > overhead calculations will be based on the captured stacks, I guess
> >      > -
> >      > but when I annotate, will the values I see correspond to the PEBS
> >      > IPs
> >      > or the PMI IPs?
> >      
> >      Based on PEBS IPs.
> >      
> >      It would be a good idea to add a check to perf report
> >      that the two IPs are different, and if they differ
> >      add some indicator to the sample. This could be a new sort key,
> >      although that would waste some space on the screen, or something
> >      else.
> >    
> >    In the case that PEBS events are used, the IP will differ essentially
> >    100%
> >    of the time, right? That is, there will always be *some* skid.
> 
> I wouldn't say that.  It depends on what the CPU is doing and the IPC
> of the code.
> 
> Also the backtrace inconsistency can only happen if the sample races with
> function return. If you don't then the backtrace will point
> to the correct function, even though the unwind IP is different.
> 
> For example in the common case where you profile a long loop it
> is unlikely to happen.
> 
> >    indicating otherwise above), I could imagine a hybrid mode where LBR is
> >    used to go back some number of calls and then dwarf or FP or whatever
> >    unwinding takes over, because the further down the stack you do the
> >    more
> >    likely the PEBS trigger point and PMI point are likely to have a
> >    consistent stack.
> 
> Could collect numbers how often it happens, but it would surprise
> me if anything complicated is worth it. I would just do the minimum fixes
> to address the unwinder errors, and perhaps add the "unwind ip differs"
> indication.

I now have a preliminary WIP patch up and running (see attached), which works 
for my usecase and improves perf noticeably. All traces of "unknown" frames 
are eradicated, i.e. unwinding now works for 100% of the samples!

There are some remaining open questions on my side:

1) Do we really want to change the API of perf_event_overflow_* and 
perf_event_output_* and adapt all its users? To me, it seems as if only PEBS 
and IBS would want to pass distinct register sets for iregs and uregs. All 
other users of the API would continue to pass the same set. Changing the 
central API produces a lot of churn for no good reason. Does anyone see an 
alternative to this?

The only alternative idea I have right now is to temporarily change the 
sample_type in __intel_pmu_pebs_event before we call perf_event_output / 
perf_event_overflow. I.e. unset PERF_SAMPLE_REGS_INTR, then sample the regs 
manually from iregs before calling perf_event_{overflow,output}, then set 
PERF_SAPMLE_REGS_INTR again. Or we could introduce a custom flag similar to 
__PERF_SAMPLE_CALLCHAIN_EARLY here...

2) How do we want to do »the "unwind ip differs" indication« as Andi puts it? 
I.e. on the perf report/script side, how should we display the stacks? 
Something like the following annotation maybe?


```
cpp-inlining  2605 [-01]    57.870061:     701199 cycles:pppu: 
           7fc1042797b4 __hypot_finite+0x154 (/usr/lib/libm-2.28.so)
           7fc10425faf8 hypotf32x+0x18 (/usr/lib/libm-2.28.so) (unwind ip 
differs)
           5622c7452128 main+0x88 (/tmp/cpp-inlining)
           7fc104096222 __libc_start_main+0xf2 (/usr/lib/libc-2.28.so)
           5622c74521ed _start+0x2d (/tmp/cpp-inlining)
```

3) I suggest we always keep the first frame and sample IP from the user regs, 
i.e. the accurate PEBS/IBS IP. Then we add the following frames from unwinding 
the ustack with the iregs. But what do we do with the first iregs IP? If we 
add it, then we could see the same frame with slightly different IP, like in 
the following, which is undesired I believe:


```
cpp-inlining  2605 [-01]    57.870061:     701199 cycles:pppu: 
           7fc1042797b4 __hypot_finite+0x154 (/usr/lib/libm-2.28.so)
           7fc1042797b5 __hypot_finite+0x155 (/usr/lib/libm-2.28.so)
           7fc10425faf8 hypotf32x+0x18 (/usr/lib/libm-2.28.so) (unwind ip 
differs)
           5622c7452128 main+0x88 (/tmp/cpp-inlining)
           7fc104096222 __libc_start_main+0xf2 (/usr/lib/libc-2.28.so)
           5622c74521ed _start+0x2d (/tmp/cpp-inlining)
```

But always skipping the IP is also sometimes wrong, like in this case:

```
cpp-inlining  2605 [-01]    57.862313:     694984 cycles:pppu: 
           7fc1042797b9 __hypot_finite+0x159 (/usr/lib/libm-2.28.so)
           5622c7452128 main+0x88 (/tmp/cpp-inlining)
           7fc104096222 __libc_start_main+0xf2 (/usr/lib/libc-2.28.so)
           5622c74521ed _start+0x2d (/tmp/cpp-inlining)
```

Here, we are missing the hypotf32x call inbetween __hypot_finite and main.

Do we want to introduce some heuristic on how handle these scenarios? I.e. if 
uregs->ip and iregs->ip point to the same function symbol, then skip the frame 
for iregs->ip, otherwise add it?

Thanks
-- 
Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer
KDAB (Deutschland) GmbH, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt, C++ and OpenGL Experts