All of lore.kernel.org
 help / color / mirror / Atom feed
* Perf call graphs on ARM
@ 2021-04-14 13:22 Fontius Sebastian (XC-DA/ESV9)
  2021-04-14 13:38 ` Arnaldo Carvalho de Melo
  2021-04-16 11:16 ` James Clark
  0 siblings, 2 replies; 4+ messages in thread
From: Fontius Sebastian (XC-DA/ESV9) @ 2021-04-14 13:22 UTC (permalink / raw)
  To: linux-perf-users

Hi everyone,

we are having trouble to get perf to output a call graph using frame pointers on a Raspberry Pi 2 Model B Revision 1.1 (ARM Cortex-A7) running Raspbian on its official Kernel versions 4.9 and 5.10.

Let me illustrate the problem we're having using a small test program we will call aaa.cpp:

#include <iostream>
#include <cmath>
__attribute__((noinline)) double G(double aaa) {
  return sqrt(aaa);
}
__attribute__((noinline)) double doit(double aaa) {
  for(int i=0; i<1000; i++)
    aaa=G(aaa);
  return aaa;
}
int main() {
  double aaa = 12;
  for(int i=0; i<10000; i++){
    aaa = doit(aaa);
    for(int j=0; j<1000; j++)
      aaa++;
    }
  std::cout << aaa << std::endl;
}

This program gets compiled like this using GCC Raspbian 8.3.0-6+rpi1:

g++ -O2 -fno-omit-frame-pointer aaa.cpp

Then we run perf on it like this:

perf record -e cycles --call-graph fp -- ./a.out
perf report

In the output of perf there should be a tree like the following:

main
  + doit
    + G

Instead what we are getting is all of those functions attached to a.out (but we _do_ get the runtimes correctly).

It seems the frame pointers are written to the binary, but do not work. We can see the frame pointers in a disassembly output created by compiling with -save-temps like this:

g++ -O2 -fno-omit-frame-pointer -save-temps aaa.cpp

This gives the following output for the doit() function:

_Z4doitd:
.fnstart
.LFB1758:
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 1, uses_anonymous_args = 0
vmov.f64 d7, d0
push {r4, r5, fp, lr}
mov r4, #1000
add fp, sp, #12
.L7:
vmov.f64 d0, d7
bl _Z1Gd
subs r4, r4, #1
vmov.f64 d7, d0
bne .L7
pop {r4, r5, fp, pc}
.cantunwind
.fnend

One thing we did try in addition to using frame pointers is to use the DWARF format, but that has some disadvantages like e.g. using roughly 20x the space of the FP format and being much slower to record. Also the recording itself seems unstable and can simply hang the whole Raspberry Pi completely requiring a hard reset. Using Kernel 5.10 the DWARF format also did exhibit the same 'disconnectedness' of the call stack (i.e. all function directly below a.out).

We also tried running Ubuntu 20.04 using its Kernel 5.4, but there both FP and DWARF were 'disconnected' again.

We're at a loss what is going wrong here. Does someone here have an idea what we could try to further debug or even understand the problem?

Mit freundlichen Grüßen / Best regards

Sebastian Fontius

Chassis Systems Control, Image Processing 9 (XC-DA/ESV9)
Robert Bosch GmbH | Postfach 16 61 | 71226 Leonberg | GERMANY | www.bosch.com

Sitz: Stuttgart, Registergericht: Amtsgericht Stuttgart, HRB 14000;
Aufsichtsratsvorsitzender: Franz Fehrenbach; Geschäftsführung: Dr. Volkmar Denner, 
Prof. Dr. Stefan Asenkerschbaumer, Filiz Albrecht, Dr. Michael Bolle, Dr. Christian Fischer, 
Dr. Stefan Hartung, Dr. Markus Heyn, Harald Kröger, Rolf Najork, Uwe Raschke


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Perf call graphs on ARM
  2021-04-14 13:22 Perf call graphs on ARM Fontius Sebastian (XC-DA/ESV9)
@ 2021-04-14 13:38 ` Arnaldo Carvalho de Melo
  2021-04-16  8:19   ` Fontius Sebastian (XC-DA/ESV9)
  2021-04-16 11:16 ` James Clark
  1 sibling, 1 reply; 4+ messages in thread
From: Arnaldo Carvalho de Melo @ 2021-04-14 13:38 UTC (permalink / raw)
  To: Fontius Sebastian (XC-DA/ESV9); +Cc: linux-perf-users

Em Wed, Apr 14, 2021 at 01:22:32PM +0000, Fontius Sebastian (XC-DA/ESV9) escreveu:
> Hi everyone,
> 
> we are having trouble to get perf to output a call graph using frame pointers on a Raspberry Pi 2 Model B Revision 1.1 (ARM Cortex-A7) running Raspbian on its official Kernel versions 4.9 and 5.10.
> 
> Let me illustrate the problem we're having using a small test program we will call aaa.cpp:

First question is what perf version are you using? The one that comes
with the distro you're using? If so can you please try following the
instructions at:

https://mirrors.edge.kernel.org/pub/linux/kernel/tools/perf/HOWTO.build.perf

And try with the latest released perf, i.e.:

https://mirrors.edge.kernel.org/pub/linux/kernel/tools/perf/v5.1.0/perf-5.1.0.tar.xz

Now lemme look at your message:
 
> #include <iostream>
> #include <cmath>
> __attribute__((noinline)) double G(double aaa) {
>   return sqrt(aaa);
> }
> __attribute__((noinline)) double doit(double aaa) {
>   for(int i=0; i<1000; i++)
>     aaa=G(aaa);
>   return aaa;
> }
> int main() {
>   double aaa = 12;
>   for(int i=0; i<10000; i++){
>     aaa = doit(aaa);
>     for(int j=0; j<1000; j++)
>       aaa++;
>     }
>   std::cout << aaa << std::endl;
> }
> 
> This program gets compiled like this using GCC Raspbian 8.3.0-6+rpi1:
> 
> g++ -O2 -fno-omit-frame-pointer aaa.cpp
> 
> Then we run perf on it like this:
> 
> perf record -e cycles --call-graph fp -- ./a.out
> perf report
> 
> In the output of perf there should be a tree like the following:
> 
> main
>   + doit
>     + G
> 
> Instead what we are getting is all of those functions attached to a.out (but we _do_ get the runtimes correctly).

Lemme try this on a x86_64, using the --no-children option:

[acme@five c++]$ perf report --no-children --stdio
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 226  of event 'cycles:u'
# Event count (approx.): 230936728
#
# Overhead  Command  Shared Object     Symbol
# ........  .......  ................  ........................
#
    86.45%  a.out    a.out             [.] G
            |
            ---G
               main
               __libc_start_main
               0x49564100002adb3d

    12.96%  a.out    a.out             [.] main
            |
            ---main
               __libc_start_main
               0x49564100002adb3d

     0.36%  a.out    [unknown]         [k] 0xffffffffa7269345
     0.20%  a.out    ld-2.32.so        [.] _dl_relocate_object
     0.02%  a.out    [unknown]         [k] 0xffffffffa7555945
     0.00%  a.out    ld-2.32.so        [.] __GI___tunables_init
     0.00%  a.out    [unknown]         [k] 0xffffffffa7269206
     0.00%  a.out    ld-2.32.so        [.] _dl_start
     0.00%  a.out    [unknown]         [k] 0xffffffffa72da481
     0.00%  a.out    [unknown]         [k] 0xffffffffa7681d07
     0.00%  a.out    [unknown]         [k] 0xffffffffa76821b7


#
# (Tip: Save output of perf stat using: perf stat record <target workload>)
#
[acme@five c++]$

So this is the reverse, so we can:

[acme@five c++]$ perf report --call-graph=fractal,1,caller --no-children --stdio
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 226  of event 'cycles:u'
# Event count (approx.): 230936728
#
# Overhead  Command  Shared Object     Symbol
# ........  .......  ................  ........................
#
    86.45%  a.out    a.out             [.] G
            |
            ---0x49564100002adb3d
               __libc_start_main
               main
               G

    12.96%  a.out    a.out             [.] main
            |
            ---0x49564100002adb3d
               __libc_start_main
               main

doit doesn't appear... So I tried using -O0, i.e.:

$ g++ -O0 -fno-omit-frame-pointer aaa.cpp

And there it is:

[acme@five c++]$ perf report --call-graph=fractal,1,caller --no-children --stdio | head -25
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 510  of event 'cycles:u'
# Event count (approx.): 536177203
#
# Overhead  Command  Shared Object     Symbol
# ........  .......  ................  ............................
#
    30.17%  a.out    libm-2.32.so      [.] __sqrt_finite@GLIBC_2.15
            |
            ---0x49564100002aeb3d
               __libc_start_main
               main
               doit
               __sqrt_finite@GLIBC_2.15

    21.84%  a.out    a.out             [.] main
            |
            ---0x49564100002aeb3d
               __libc_start_main
               main

[acme@five c++]$

That first 0x49564100002aeb3d remains a mistery, but I ran out of time,
have to go drive to pick up my 5yo kid :-)

I'll try to read the remaining parts of your message later, but perhaps
the above will help,

Thanks,

- Arnaldo
 
> It seems the frame pointers are written to the binary, but do not work. We can see the frame pointers in a disassembly output created by compiling with -save-temps like this:
> 
> g++ -O2 -fno-omit-frame-pointer -save-temps aaa.cpp
> 
> This gives the following output for the doit() function:
> 
> _Z4doitd:
> .fnstart
> .LFB1758:
> @ args = 0, pretend = 0, frame = 0
> @ frame_needed = 1, uses_anonymous_args = 0
> vmov.f64 d7, d0
> push {r4, r5, fp, lr}
> mov r4, #1000
> add fp, sp, #12
> .L7:
> vmov.f64 d0, d7
> bl _Z1Gd
> subs r4, r4, #1
> vmov.f64 d7, d0
> bne .L7
> pop {r4, r5, fp, pc}
> .cantunwind
> .fnend
> 
> One thing we did try in addition to using frame pointers is to use the DWARF format, but that has some disadvantages like e.g. using roughly 20x the space of the FP format and being much slower to record. Also the recording itself seems unstable and can simply hang the whole Raspberry Pi completely requiring a hard reset. Using Kernel 5.10 the DWARF format also did exhibit the same 'disconnectedness' of the call stack (i.e. all function directly below a.out).
> 
> We also tried running Ubuntu 20.04 using its Kernel 5.4, but there both FP and DWARF were 'disconnected' again.
> 
> We're at a loss what is going wrong here. Does someone here have an idea what we could try to further debug or even understand the problem?
> 
> Mit freundlichen Grüßen / Best regards
> 
> Sebastian Fontius
> 
> Chassis Systems Control, Image Processing 9 (XC-DA/ESV9)
> Robert Bosch GmbH | Postfach 16 61 | 71226 Leonberg | GERMANY | www.bosch.com
> 
> Sitz: Stuttgart, Registergericht: Amtsgericht Stuttgart, HRB 14000;
> Aufsichtsratsvorsitzender: Franz Fehrenbach; Geschäftsführung: Dr. Volkmar Denner, 
> Prof. Dr. Stefan Asenkerschbaumer, Filiz Albrecht, Dr. Michael Bolle, Dr. Christian Fischer, 
> Dr. Stefan Hartung, Dr. Markus Heyn, Harald Kröger, Rolf Najork, Uwe Raschke
> 

-- 

- Arnaldo

^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: Perf call graphs on ARM
  2021-04-14 13:38 ` Arnaldo Carvalho de Melo
@ 2021-04-16  8:19   ` Fontius Sebastian (XC-DA/ESV9)
  0 siblings, 0 replies; 4+ messages in thread
From: Fontius Sebastian (XC-DA/ESV9) @ 2021-04-16  8:19 UTC (permalink / raw)
  To: linux-perf-users; +Cc: Arnaldo Carvalho de Melo

Em Wed, Apr 14, 2021 at 01:22:32PM +0000, Fontius Sebastian (XC-DA/ESV9) escreveu:
> > we are having trouble to get perf to output a call graph using frame
> > pointers on a Raspberry Pi 2 Model B Revision 1.1 (ARM Cortex-A7)
> > running Raspbian on its official Kernel versions 4.9 and 5.10.
> > 
> > Let me illustrate the problem we're having using a small test program
> > we will call aaa.cpp:

> First question is what perf version are you using? The one that comes
> with the distro you're using?

We tried the following combinations:

* Kernel 4.9 (the current version when we were installing Raspbian) with
  the perf binary as supplied by the 'linux-perf-4.9' Raspbian package.

* Kernel 5.10 with its accompanying perf version compiled from source on
  the Raspberry Pi 2 directly.

> If so can you please try following the instructions at:
>
> https://mirrors.edge.kernel.org/pub/linux/kernel/tools/perf/HOWTO.build.perf
>
> And try with the latest released perf, i.e.:
>
> https://mirrors.edge.kernel.org/pub/linux/kernel/tools/perf/v5.1.0/perf-5.1.0.tar.xz

I compiled this version as per your suggestion.  The results were exactly
the same as with the versions mentioned above.

> Lemme try this on a x86_64, using the --no-children option:

On x86_64 we also do not have any problems.  The problems we face are
solely on ARM, and there only on the Raspberry Pi 2 Model B revision 1.1
and 1.2.  We have tried the same thing on a device from a different
vendor which features an ARM A53 processor like the Pi2B r1.2, but not a
Broadcom SoC.  The OS on the other device is a Yocto-based OS featuring
kernel 4.14 built specifically for this device, in case that matters.

> [acme@five c++]$ perf report --no-children --stdio
>
> [...]

As far as we can tell the perf record option using frame pointers already
does produce incomplete output, which cannot be used to create call
graphs.  All the perf report options we tried did not make any
difference WRT call graphs.

> That first 0x49564100002aeb3d remains a mistery, but I ran out of time,
> have to go drive to pick up my 5yo kid :-)

Thanks for your input :).

Regards,

Sebastian

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Perf call graphs on ARM
  2021-04-14 13:22 Perf call graphs on ARM Fontius Sebastian (XC-DA/ESV9)
  2021-04-14 13:38 ` Arnaldo Carvalho de Melo
@ 2021-04-16 11:16 ` James Clark
  1 sibling, 0 replies; 4+ messages in thread
From: James Clark @ 2021-04-16 11:16 UTC (permalink / raw)
  To: Fontius Sebastian (XC-DA/ESV9), linux-perf-users



On 14/04/2021 16:22, Fontius Sebastian (XC-DA/ESV9) wrote:
> Hi everyone,
> 
> we are having trouble to get perf to output a call graph using frame pointers on a Raspberry Pi 2 Model B Revision 1.1 (ARM Cortex-A7) running Raspbian on its official Kernel versions 4.9 and 5.10.
> 
> Let me illustrate the problem we're having using a small test program we will call aaa.cpp:
> 
> #include <iostream>
> #include <cmath>
> __attribute__((noinline)) double G(double aaa) {
>   return sqrt(aaa);
> }
> __attribute__((noinline)) double doit(double aaa) {
>   for(int i=0; i<1000; i++)
>     aaa=G(aaa);
>   return aaa;
> }
> int main() {
>   double aaa = 12;
>   for(int i=0; i<10000; i++){
>     aaa = doit(aaa);
>     for(int j=0; j<1000; j++)
>       aaa++;
>     }
>   std::cout << aaa << std::endl;
> }
> 

Hi Sebastian,

Is it possible to simplify the example and hold the process in G(), with some
kind of while loop for example (as long as the compiler still produces frame
records for each function).

I tried this on 64bit Arm and got a sensible stack and the kernel's frame pointer
unwind for 64 bit is pretty much the same as the 32 bit one. Could it be that
samples are being taken in the function prologues before the frame pointer is
fully setup which is causing some unexpected results?

Here is my stack which looks mostly the same with --call-graph=fp and --call-graph=dwarf:

-   99.73%    99.73%  unwind-simple.o  unwind-simple.out    [.] G                                                                                                                     
     0xfffff67decfd                                                                                                                                                                   
     _start                                                                                                                                                                           
     __libc_start_main                                                                                                                                                                
     main                                                                                                                                                                             
     doit                                                                                                                                                                             
     G   

Thanks
James

> This program gets compiled like this using GCC Raspbian 8.3.0-6+rpi1:
> 
> g++ -O2 -fno-omit-frame-pointer aaa.cpp
> 
> Then we run perf on it like this:
> 
> perf record -e cycles --call-graph fp -- ./a.out
> perf report
> 
> In the output of perf there should be a tree like the following:
> 
> main
>   + doit
>     + G
> 
> Instead what we are getting is all of those functions attached to a.out (but we _do_ get the runtimes correctly).
> 
> It seems the frame pointers are written to the binary, but do not work. We can see the frame pointers in a disassembly output created by compiling with -save-temps like this:
> 
> g++ -O2 -fno-omit-frame-pointer -save-temps aaa.cpp
> 
> This gives the following output for the doit() function:
> 
> _Z4doitd:
> .fnstart
> .LFB1758:
> @ args = 0, pretend = 0, frame = 0
> @ frame_needed = 1, uses_anonymous_args = 0
> vmov.f64 d7, d0
> push {r4, r5, fp, lr}
> mov r4, #1000
> add fp, sp, #12
> .L7:
> vmov.f64 d0, d7
> bl _Z1Gd
> subs r4, r4, #1
> vmov.f64 d7, d0
> bne .L7
> pop {r4, r5, fp, pc}
> .cantunwind
> .fnend
> 
> One thing we did try in addition to using frame pointers is to use the DWARF format, but that has some disadvantages like e.g. using roughly 20x the space of the FP format and being much slower to record. Also the recording itself seems unstable and can simply hang the whole Raspberry Pi completely requiring a hard reset. Using Kernel 5.10 the DWARF format also did exhibit the same 'disconnectedness' of the call stack (i.e. all function directly below a.out).
> 
> We also tried running Ubuntu 20.04 using its Kernel 5.4, but there both FP and DWARF were 'disconnected' again.
> 
> We're at a loss what is going wrong here. Does someone here have an idea what we could try to further debug or even understand the problem?
> 
> Mit freundlichen Grüßen / Best regards
> 
> Sebastian Fontius
> 
> Chassis Systems Control, Image Processing 9 (XC-DA/ESV9)
> Robert Bosch GmbH | Postfach 16 61 | 71226 Leonberg | GERMANY | www.bosch.com
> 
> Sitz: Stuttgart, Registergericht: Amtsgericht Stuttgart, HRB 14000;
> Aufsichtsratsvorsitzender: Franz Fehrenbach; Geschäftsführung: Dr. Volkmar Denner, 
> Prof. Dr. Stefan Asenkerschbaumer, Filiz Albrecht, Dr. Michael Bolle, Dr. Christian Fischer, 
> Dr. Stefan Hartung, Dr. Markus Heyn, Harald Kröger, Rolf Najork, Uwe Raschke
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-04-16 11:16 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-14 13:22 Perf call graphs on ARM Fontius Sebastian (XC-DA/ESV9)
2021-04-14 13:38 ` Arnaldo Carvalho de Melo
2021-04-16  8:19   ` Fontius Sebastian (XC-DA/ESV9)
2021-04-16 11:16 ` James Clark

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.