All of lore.kernel.org
 help / color / mirror / Atom feed
* Reducing impact of IO/CPU overload during `perf record` - bad handling of excessive yield?
@ 2021-09-09 14:08 Milian Wolff
  2021-09-09 19:28 ` Ian Rogers
  0 siblings, 1 reply; 4+ messages in thread
From: Milian Wolff @ 2021-09-09 14:08 UTC (permalink / raw)
  To: linux-perf-users; +Cc: Arnaldo Carvalho de Melo

[-- Attachment #1: Type: text/plain, Size: 7951 bytes --]

Hey there!

I'm trying to profile an application that suffers from overcommit and 
excessive calls to `yield` (via OMP). Generally, it is putting a lot of strain 
on the system. Sadly, I cannot make `perf record` work without dropping (a lot 
of) chunks / events.

Usually I'm using this command line:

```
perf record --call-graph dwarf -z -m 16M ...
```

But this is not enough in this case. What other alternatives are there? Is 
there a `perf record` mode which never drops events, at the cost of maybe 
stalling the profiled application?

Generally though, looking at the (partial) data I get from perf, it seems like 
it does not cope well in the face of excessive calls to `yield`. Take this 
example code here:

```
#include <thread>
#include <vector>
#include <algorithm>

void yieldALot()
{
    for (int i = 0; i < 1000000; ++i) {
        std::this_thread::yield();
    }
}

int main()
{
    std::vector<std::thread> threads(std::thread::hardware_concurrency() * 2);
    std::generate(threads.begin(), threads.end(), []() {
        return std::thread(&yieldALot);
    });

    for (auto &thread : threads)
        thread.join();

    return 0;
}
```

When I compile this with `g++ test.cpp -pthread -O2 -g` and run it on my 
system with a AMD Ryzen 9 3900X with 12 cores and 24 SMT cores, then I observe 
the following behavior:

```
perf stat -r 5 ./a.out 

 Performance counter stats for './a.out' (5 runs):

         18,570.71 msec task-clock                #   19.598 CPUs utilized            
( +-  1.27% )
         2,717,136      context-switches          #  145.447 K/sec                    
( +-  3.26% )
                55      cpu-migrations            #    2.944 /sec                     
( +-  3.79% )
               222      page-faults               #   11.884 /sec                     
( +-  0.49% )
    61,223,010,902      cycles                    #    3.277 GHz                      
( +-  0.54% )  (83.33%)
     1,259,821,397      stalled-cycles-frontend   #    2.07% frontend cycles 
idle     ( +-  1.54% )  (83.16%)
     9,001,394,333      stalled-cycles-backend    #   14.81% backend cycles 
idle      ( +-  0.79% )  (83.25%)
    65,746,848,695      instructions              #    1.08  insn per cycle         
                                                  #    0.14  stalled cycles 
per insn  ( +-  0.26% )  (83.40%)
    14,474,862,454      branches                  #  774.830 M/sec                    
( +-  0.36% )  (83.49%)
       114,518,545      branch-misses             #    0.79% of all branches          
( +-  0.71% )  (83.67%)

           0.94761 +- 0.00935 seconds time elapsed  ( +-  0.99% )
```

Now compare this to what happens when I enable perf recording:

```
 Performance counter stats for 'perf record --call-graph dwarf -z ./a.out' (5 
runs):

         19,928.28 msec task-clock                #    9.405 CPUs utilized            
( +-  2.74% )
         3,026,081      context-switches          #  142.538 K/sec                    
( +-  7.48% )
               120      cpu-migrations            #    5.652 /sec                     
( +-  3.18% )
             5,509      page-faults               #  259.491 /sec                     
( +-  0.01% )
    65,393,592,529      cycles                    #    3.080 GHz                      
( +-  1.81% )  (71.81%)
     1,445,203,862      stalled-cycles-frontend   #    2.13% frontend cycles 
idle     ( +-  2.01% )  (71.60%)
     9,701,321,165      stalled-cycles-backend    #   14.29% backend cycles 
idle      ( +-  1.28% )  (71.62%)
    69,967,062,730      instructions              #    1.03  insn per cycle         
                                                  #    0.14  stalled cycles 
per insn  ( +-  0.74% )  (71.83%)
    15,376,981,968      branches                  #  724.304 M/sec                    
( +-  0.83% )  (72.00%)
       127,934,194      branch-misses             #    0.81% of all branches          
( +-  1.29% )  (72.07%)

          2.118881 +- 0.000496 seconds time elapsed  ( +-  0.02% )
```

Note how the CPU utilization drops dramatically by roughly a factor of 2x, 
which correlates with the increase in overall runtime.

From the data obtained here I see:
```
    98.49%     0.00%  a.out    libc-2.33.so        [.] __GI___sched_yield 
(inlined)
            |
            ---__GI___sched_yield (inlined)
               |          
               |--89.93%--entry_SYSCALL_64_after_hwframe
               |          |          
               |           --89.32%--do_syscall_64
               |                     |          
               |                     |--71.70%--__ia32_sys_sched_yield
               |                     |          |          
               |                     |          |--64.36%--schedule
               |                     |          |          |          
               |                     |          |           --63.42%--
__schedule
               |                     |          |                     |          
               |                     |          |                     
|--32.56%--__perf_event_task_sched_out
               |                     |          |                     |          
|          
               |                     |          |                     |           
--32.07%--amd_pmu_disable_all
               |                     |          |                     |                     
|          
               |                     |          |                     |                     
|--23.54%--x86_pmu_disable_all
               |                     |          |                     |                     
|          |          
               |                     |          |                     |                     
|          |--12.81%--native_write_msr
               |                     |          |                     |                     
|          |          
               |                     |          |                     |                     
|          |--8.24%--native_read_msr
               |                     |          |                     |                     
|          |          
               |                     |          |                     |                     
|           --1.25%--amd_pmu_addr_offset
               |                     |          |                     |                     
|          
               |                     |          |                     |                      
--8.19%--amd_pmu_wait_on_overflow
               |                     |          |                     |                                
|          
               |                     |          |                     |                                
|--5.00%--native_read_msr
               |                     |          |                     |                                
|          
               |                     |          |                     |                                
|--2.20%--delay_halt
               |                     |          |                     |                                
|          |          
               |                     |          |                     |                                
|           --1.99%--delay_halt_mwaitx
               |                     |          |                     |                                
|          
               |                     |          |                     |                                 
--0.55%--amd_pmu_addr_offset

```

Is this a dificiency with the AMD perf subsystem? Or is this a generic issue 
with perf? I understand that it has to enable/disable the PMU when it's 
switching tasks, but potentially there are some ways to optimize this 
behavior?

Thanks

-- 
Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer
KDAB (Deutschland) GmbH, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt, C++ and OpenGL Experts

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5272 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Reducing impact of IO/CPU overload during `perf record` - bad handling of excessive yield?
  2021-09-09 14:08 Reducing impact of IO/CPU overload during `perf record` - bad handling of excessive yield? Milian Wolff
@ 2021-09-09 19:28 ` Ian Rogers
  2021-09-09 20:22   ` Arnaldo Carvalho de Melo
  2021-09-09 20:38   ` Milian Wolff
  0 siblings, 2 replies; 4+ messages in thread
From: Ian Rogers @ 2021-09-09 19:28 UTC (permalink / raw)
  To: Milian Wolff; +Cc: linux-perf-users, Arnaldo Carvalho de Melo

On Thu, Sep 9, 2021 at 7:14 AM Milian Wolff <milian.wolff@kdab.com> wrote:
>
> Hey there!
>
> I'm trying to profile an application that suffers from overcommit and
> excessive calls to `yield` (via OMP). Generally, it is putting a lot of strain
> on the system. Sadly, I cannot make `perf record` work without dropping (a lot
> of) chunks / events.
>
> Usually I'm using this command line:
>
> ```
> perf record --call-graph dwarf -z -m 16M ...
> ```
>
> But this is not enough in this case. What other alternatives are there? Is
> there a `perf record` mode which never drops events, at the cost of maybe
> stalling the profiled application?
>
> Generally though, looking at the (partial) data I get from perf, it seems like
> it does not cope well in the face of excessive calls to `yield`. Take this
> example code here:
>
> ```
> #include <thread>
> #include <vector>
> #include <algorithm>
>
> void yieldALot()
> {
>     for (int i = 0; i < 1000000; ++i) {
>         std::this_thread::yield();
>     }
> }
>
> int main()
> {
>     std::vector<std::thread> threads(std::thread::hardware_concurrency() * 2);
>     std::generate(threads.begin(), threads.end(), []() {
>         return std::thread(&yieldALot);
>     });
>
>     for (auto &thread : threads)
>         thread.join();
>
>     return 0;
> }
> ```
>
> When I compile this with `g++ test.cpp -pthread -O2 -g` and run it on my
> system with a AMD Ryzen 9 3900X with 12 cores and 24 SMT cores, then I observe
> the following behavior:
>
> ```
> perf stat -r 5 ./a.out
>
>  Performance counter stats for './a.out' (5 runs):
>
>          18,570.71 msec task-clock                #   19.598 CPUs utilized
> ( +-  1.27% )
>          2,717,136      context-switches          #  145.447 K/sec
> ( +-  3.26% )
>                 55      cpu-migrations            #    2.944 /sec
> ( +-  3.79% )
>                222      page-faults               #   11.884 /sec
> ( +-  0.49% )
>     61,223,010,902      cycles                    #    3.277 GHz
> ( +-  0.54% )  (83.33%)
>      1,259,821,397      stalled-cycles-frontend   #    2.07% frontend cycles
> idle     ( +-  1.54% )  (83.16%)
>      9,001,394,333      stalled-cycles-backend    #   14.81% backend cycles
> idle      ( +-  0.79% )  (83.25%)
>     65,746,848,695      instructions              #    1.08  insn per cycle
>                                                   #    0.14  stalled cycles
> per insn  ( +-  0.26% )  (83.40%)
>     14,474,862,454      branches                  #  774.830 M/sec
> ( +-  0.36% )  (83.49%)
>        114,518,545      branch-misses             #    0.79% of all branches
> ( +-  0.71% )  (83.67%)
>
>            0.94761 +- 0.00935 seconds time elapsed  ( +-  0.99% )
> ```
>
> Now compare this to what happens when I enable perf recording:
>
> ```
>  Performance counter stats for 'perf record --call-graph dwarf -z ./a.out' (5
> runs):
>
>          19,928.28 msec task-clock                #    9.405 CPUs utilized
> ( +-  2.74% )
>          3,026,081      context-switches          #  142.538 K/sec
> ( +-  7.48% )
>                120      cpu-migrations            #    5.652 /sec
> ( +-  3.18% )
>              5,509      page-faults               #  259.491 /sec
> ( +-  0.01% )
>     65,393,592,529      cycles                    #    3.080 GHz
> ( +-  1.81% )  (71.81%)
>      1,445,203,862      stalled-cycles-frontend   #    2.13% frontend cycles
> idle     ( +-  2.01% )  (71.60%)
>      9,701,321,165      stalled-cycles-backend    #   14.29% backend cycles
> idle      ( +-  1.28% )  (71.62%)
>     69,967,062,730      instructions              #    1.03  insn per cycle
>                                                   #    0.14  stalled cycles
> per insn  ( +-  0.74% )  (71.83%)
>     15,376,981,968      branches                  #  724.304 M/sec
> ( +-  0.83% )  (72.00%)
>        127,934,194      branch-misses             #    0.81% of all branches
> ( +-  1.29% )  (72.07%)
>
>           2.118881 +- 0.000496 seconds time elapsed  ( +-  0.02% )
> ```
>
> Note how the CPU utilization drops dramatically by roughly a factor of 2x,
> which correlates with the increase in overall runtime.
>
> From the data obtained here I see:
> ```
>     98.49%     0.00%  a.out    libc-2.33.so        [.] __GI___sched_yield
> (inlined)
>             |
>             ---__GI___sched_yield (inlined)
>                |
>                |--89.93%--entry_SYSCALL_64_after_hwframe
>                |          |
>                |           --89.32%--do_syscall_64
>                |                     |
>                |                     |--71.70%--__ia32_sys_sched_yield
>                |                     |          |
>                |                     |          |--64.36%--schedule
>                |                     |          |          |
>                |                     |          |           --63.42%--
> __schedule
>                |                     |          |                     |
>                |                     |          |
> |--32.56%--__perf_event_task_sched_out
>                |                     |          |                     |
> |
>                |                     |          |                     |
> --32.07%--amd_pmu_disable_all
>                |                     |          |                     |
> |
>                |                     |          |                     |
> |--23.54%--x86_pmu_disable_all
>                |                     |          |                     |
> |          |
>                |                     |          |                     |
> |          |--12.81%--native_write_msr
>                |                     |          |                     |
> |          |
>                |                     |          |                     |
> |          |--8.24%--native_read_msr
>                |                     |          |                     |
> |          |
>                |                     |          |                     |
> |           --1.25%--amd_pmu_addr_offset
>                |                     |          |                     |
> |
>                |                     |          |                     |
> --8.19%--amd_pmu_wait_on_overflow
>                |                     |          |                     |
> |
>                |                     |          |                     |
> |--5.00%--native_read_msr
>                |                     |          |                     |
> |
>                |                     |          |                     |
> |--2.20%--delay_halt
>                |                     |          |                     |
> |          |
>                |                     |          |                     |
> |           --1.99%--delay_halt_mwaitx
>                |                     |          |                     |
> |
>                |                     |          |                     |
> --0.55%--amd_pmu_addr_offset
>
> ```
>
> Is this a dificiency with the AMD perf subsystem? Or is this a generic issue
> with perf? I understand that it has to enable/disable the PMU when it's
> switching tasks, but potentially there are some ways to optimize this
> behavior?

Hi Milian,

By using dwarf call graphs your samples are writing a dump of the
stack into the perf event ring buffer that will be processed in the
userspace perf command. The default stack dump size is 8kb and you can
lower it - for example with "--call-graph dwarf,4096". I suspect that
most of the overhead you are seeing is from these stack dumps. There
is a more complete description in the man page:
https://man7.org/linux/man-pages/man1/perf-record.1.htm

Which are always worth improving:
https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/tree/tools/perf/Documentation/perf-record.txt?h=perf/core

Thanks,
Ian
> Thanks
>
> --
> Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer
> KDAB (Deutschland) GmbH, a KDAB Group company
> Tel: +49-30-521325470
> KDAB - The Qt, C++ and OpenGL Experts

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Reducing impact of IO/CPU overload during `perf record` - bad handling of excessive yield?
  2021-09-09 19:28 ` Ian Rogers
@ 2021-09-09 20:22   ` Arnaldo Carvalho de Melo
  2021-09-09 20:38   ` Milian Wolff
  1 sibling, 0 replies; 4+ messages in thread
From: Arnaldo Carvalho de Melo @ 2021-09-09 20:22 UTC (permalink / raw)
  To: Ian Rogers; +Cc: Milian Wolff, linux-perf-users

Em Thu, Sep 09, 2021 at 12:28:10PM -0700, Ian Rogers escreveu:
> On Thu, Sep 9, 2021 at 7:14 AM Milian Wolff <milian.wolff@kdab.com> wrote:
> >
> > Hey there!
> >
> > I'm trying to profile an application that suffers from overcommit and
> > excessive calls to `yield` (via OMP). Generally, it is putting a lot of strain
> > on the system. Sadly, I cannot make `perf record` work without dropping (a lot
> > of) chunks / events.
> >
> > Usually I'm using this command line:
> >
> > ```
> > perf record --call-graph dwarf -z -m 16M ...
> > ```
> >
> > But this is not enough in this case. What other alternatives are there? Is
> > there a `perf record` mode which never drops events, at the cost of maybe
> > stalling the profiled application?

There was discussion about this mode, for using on a 'perf trace' that
works like 'strace', i.e. doesn't lose records but stalls the
profiled/traced application, but I don't think that progressed

> > Generally though, looking at the (partial) data I get from perf, it seems like
> > it does not cope well in the face of excessive calls to `yield`. Take this
> > example code here:
> >
> > ```
> > #include <thread>
> > #include <vector>
> > #include <algorithm>
> >
> > void yieldALot()
> > {
> >     for (int i = 0; i < 1000000; ++i) {
> >         std::this_thread::yield();
> >     }
> > }
> >
> > int main()
> > {
> >     std::vector<std::thread> threads(std::thread::hardware_concurrency() * 2);
> >     std::generate(threads.begin(), threads.end(), []() {
> >         return std::thread(&yieldALot);
> >     });
> >
> >     for (auto &thread : threads)
> >         thread.join();
> >
> >     return 0;
> > }
> > ```
> >
> > When I compile this with `g++ test.cpp -pthread -O2 -g` and run it on my
> > system with a AMD Ryzen 9 3900X with 12 cores and 24 SMT cores, then I observe
> > the following behavior:
> >
> > ```
> > perf stat -r 5 ./a.out
> >
> >  Performance counter stats for './a.out' (5 runs):
> >
> >          18,570.71 msec task-clock                #   19.598 CPUs utilized
> > ( +-  1.27% )
> >          2,717,136      context-switches          #  145.447 K/sec
> > ( +-  3.26% )
> >                 55      cpu-migrations            #    2.944 /sec
> > ( +-  3.79% )
> >                222      page-faults               #   11.884 /sec
> > ( +-  0.49% )
> >     61,223,010,902      cycles                    #    3.277 GHz
> > ( +-  0.54% )  (83.33%)
> >      1,259,821,397      stalled-cycles-frontend   #    2.07% frontend cycles
> > idle     ( +-  1.54% )  (83.16%)
> >      9,001,394,333      stalled-cycles-backend    #   14.81% backend cycles
> > idle      ( +-  0.79% )  (83.25%)
> >     65,746,848,695      instructions              #    1.08  insn per cycle
> >                                                   #    0.14  stalled cycles
> > per insn  ( +-  0.26% )  (83.40%)
> >     14,474,862,454      branches                  #  774.830 M/sec
> > ( +-  0.36% )  (83.49%)
> >        114,518,545      branch-misses             #    0.79% of all branches
> > ( +-  0.71% )  (83.67%)
> >
> >            0.94761 +- 0.00935 seconds time elapsed  ( +-  0.99% )
> > ```
> >
> > Now compare this to what happens when I enable perf recording:
> >
> > ```
> >  Performance counter stats for 'perf record --call-graph dwarf -z ./a.out' (5
> > runs):
> >
> >          19,928.28 msec task-clock                #    9.405 CPUs utilized
> > ( +-  2.74% )
> >          3,026,081      context-switches          #  142.538 K/sec
> > ( +-  7.48% )
> >                120      cpu-migrations            #    5.652 /sec
> > ( +-  3.18% )
> >              5,509      page-faults               #  259.491 /sec
> > ( +-  0.01% )
> >     65,393,592,529      cycles                    #    3.080 GHz
> > ( +-  1.81% )  (71.81%)
> >      1,445,203,862      stalled-cycles-frontend   #    2.13% frontend cycles
> > idle     ( +-  2.01% )  (71.60%)
> >      9,701,321,165      stalled-cycles-backend    #   14.29% backend cycles
> > idle      ( +-  1.28% )  (71.62%)
> >     69,967,062,730      instructions              #    1.03  insn per cycle
> >                                                   #    0.14  stalled cycles
> > per insn  ( +-  0.74% )  (71.83%)
> >     15,376,981,968      branches                  #  724.304 M/sec
> > ( +-  0.83% )  (72.00%)
> >        127,934,194      branch-misses             #    0.81% of all branches
> > ( +-  1.29% )  (72.07%)
> >
> >           2.118881 +- 0.000496 seconds time elapsed  ( +-  0.02% )
> > ```
> >
> > Note how the CPU utilization drops dramatically by roughly a factor of 2x,
> > which correlates with the increase in overall runtime.
> >
> > From the data obtained here I see:
> > ```
> >     98.49%     0.00%  a.out    libc-2.33.so        [.] __GI___sched_yield
> > (inlined)
> >             |
> >             ---__GI___sched_yield (inlined)
> >                |
> >                |--89.93%--entry_SYSCALL_64_after_hwframe
> >                |          |
> >                |           --89.32%--do_syscall_64
> >                |                     |
> >                |                     |--71.70%--__ia32_sys_sched_yield
> >                |                     |          |
> >                |                     |          |--64.36%--schedule
> >                |                     |          |          |
> >                |                     |          |           --63.42%--
> > __schedule
> >                |                     |          |                     |
> >                |                     |          |
> > |--32.56%--__perf_event_task_sched_out
> >                |                     |          |                     |
> > |
> >                |                     |          |                     |
> > --32.07%--amd_pmu_disable_all
> >                |                     |          |                     |
> > |
> >                |                     |          |                     |
> > |--23.54%--x86_pmu_disable_all
> >                |                     |          |                     |
> > |          |
> >                |                     |          |                     |
> > |          |--12.81%--native_write_msr
> >                |                     |          |                     |
> > |          |
> >                |                     |          |                     |
> > |          |--8.24%--native_read_msr
> >                |                     |          |                     |
> > |          |
> >                |                     |          |                     |
> > |           --1.25%--amd_pmu_addr_offset
> >                |                     |          |                     |
> > |
> >                |                     |          |                     |
> > --8.19%--amd_pmu_wait_on_overflow
> >                |                     |          |                     |
> > |
> >                |                     |          |                     |
> > |--5.00%--native_read_msr
> >                |                     |          |                     |
> > |
> >                |                     |          |                     |
> > |--2.20%--delay_halt
> >                |                     |          |                     |
> > |          |
> >                |                     |          |                     |
> > |           --1.99%--delay_halt_mwaitx
> >                |                     |          |                     |
> > |
> >                |                     |          |                     |
> > --0.55%--amd_pmu_addr_offset
> >
> > ```
> >
> > Is this a dificiency with the AMD perf subsystem? Or is this a generic issue
> > with perf? I understand that it has to enable/disable the PMU when it's
> > switching tasks, but potentially there are some ways to optimize this
> > behavior?
> 
> Hi Milian,
> 
> By using dwarf call graphs your samples are writing a dump of the
> stack into the perf event ring buffer that will be processed in the
> userspace perf command. The default stack dump size is 8kb and you can
> lower it - for example with "--call-graph dwarf,4096". I suspect that
> most of the overhead you are seeing is from these stack dumps. There
> is a more complete description in the man page:
> https://man7.org/linux/man-pages/man1/perf-record.1.htm
> 
> Which are always worth improving:
> https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/tree/tools/perf/Documentation/perf-record.txt?h=perf/core
> 
> Thanks,
> Ian
> > Thanks
> >
> > --
> > Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer
> > KDAB (Deutschland) GmbH, a KDAB Group company
> > Tel: +49-30-521325470
> > KDAB - The Qt, C++ and OpenGL Experts

-- 

- Arnaldo

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Reducing impact of IO/CPU overload during `perf record` - bad handling of excessive yield?
  2021-09-09 19:28 ` Ian Rogers
  2021-09-09 20:22   ` Arnaldo Carvalho de Melo
@ 2021-09-09 20:38   ` Milian Wolff
  1 sibling, 0 replies; 4+ messages in thread
From: Milian Wolff @ 2021-09-09 20:38 UTC (permalink / raw)
  To: Ian Rogers; +Cc: linux-perf-users, Arnaldo Carvalho de Melo

[-- Attachment #1: Type: text/plain, Size: 2719 bytes --]

On Donnerstag, 9. September 2021 21:28:10 CEST Ian Rogers wrote:
> On Thu, Sep 9, 2021 at 7:14 AM Milian Wolff <milian.wolff@kdab.com> wrote:
> > Hey there!
> > 
> > I'm trying to profile an application that suffers from overcommit and
> > excessive calls to `yield` (via OMP). Generally, it is putting a lot of
> > strain on the system. Sadly, I cannot make `perf record` work without
> > dropping (a lot of) chunks / events.
> > 
> > Usually I'm using this command line:
> > 
> > ```
> > perf record --call-graph dwarf -z -m 16M ...
> > ```

<snip>

> > Is this a dificiency with the AMD perf subsystem? Or is this a generic
> > issue with perf? I understand that it has to enable/disable the PMU when
> > it's switching tasks, but potentially there are some ways to optimize
> > this behavior?
> 
> Hi Milian,

Hey Ian!

> By using dwarf call graphs your samples are writing a dump of the
> stack into the perf event ring buffer that will be processed in the
> userspace perf command. The default stack dump size is 8kb and you can
> lower it - for example with "--call-graph dwarf,4096".

I'm well aware of the overhead imposed by `--call-graph dwarf`, but it is a 
requirement where I'm coming from. I don't know a single "normal" linux 
distribution which enables frame pointers for system libraries for example.

> I suspect that
> most of the overhead you are seeing is from these stack dumps. There
> is a more complete description in the man page:
> https://man7.org/linux/man-pages/man1/perf-record.1.htm
>
> Which are always worth improving:
> https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/tree/tools/pe
> rf/Documentation/perf-record.txt?h=perf/core

Reducing the size of the call stack is not really an option for me either. 
Rather, I would like to somehow influence perf to cope with the load somehow.

As I said above, so far `-m 16M` has often helped to prevent lost chunks. Then 
`-z` and the implicated `--aio` also drastically improve the situation 
compared to the not-so distant past.

But in my situation, neither is sufficient. Are there really no other options 
available to me? As I said, I would even be willing to impose a runtime 
penalty on the profiled application. Ideally, I'd just let perf hog up a full 
core or two. Zstd should easily crunch through ~500MB/s according to the 
benchmarks. Meaning I should - in theory - be able to compress up to 500MB / 
8kB = 62500 samples per second. This should be more than enough to accomodate 
a sampling rate of 1000Hz across 24 threads, no?

Thanks

-- 
Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer
KDAB (Deutschland) GmbH, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt, C++ and OpenGL Experts

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5272 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-09-09 20:39 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-09 14:08 Reducing impact of IO/CPU overload during `perf record` - bad handling of excessive yield? Milian Wolff
2021-09-09 19:28 ` Ian Rogers
2021-09-09 20:22   ` Arnaldo Carvalho de Melo
2021-09-09 20:38   ` Milian Wolff

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.