Reducing impact of IO/CPU overload during `perf record`

* Reducing impact of IO/CPU overload during `perf record` - bad handling of excessive yield?
@ 2021-09-09 14:08 Milian Wolff
  2021-09-09 19:28 ` Ian Rogers
  0 siblings, 1 reply; 4+ messages in thread
From: Milian Wolff @ 2021-09-09 14:08 UTC (permalink / raw)
  To: linux-perf-users; +Cc: Arnaldo Carvalho de Melo

[-- Attachment #1: Type: text/plain, Size: 7951 bytes --]

Hey there!

I'm trying to profile an application that suffers from overcommit and 
excessive calls to `yield` (via OMP). Generally, it is putting a lot of strain 
on the system. Sadly, I cannot make `perf record` work without dropping (a lot 
of) chunks / events.

Usually I'm using this command line:

```
perf record --call-graph dwarf -z -m 16M ...
```

But this is not enough in this case. What other alternatives are there? Is 
there a `perf record` mode which never drops events, at the cost of maybe 
stalling the profiled application?

Generally though, looking at the (partial) data I get from perf, it seems like 
it does not cope well in the face of excessive calls to `yield`. Take this 
example code here:

```
#include <thread>
#include <vector>
#include <algorithm>

void yieldALot()
{
    for (int i = 0; i < 1000000; ++i) {
        std::this_thread::yield();
    }
}

int main()
{
    std::vector<std::thread> threads(std::thread::hardware_concurrency() * 2);
    std::generate(threads.begin(), threads.end(), []() {
        return std::thread(&yieldALot);
    });

    for (auto &thread : threads)
        thread.join();

    return 0;
}
```

When I compile this with `g++ test.cpp -pthread -O2 -g` and run it on my 
system with a AMD Ryzen 9 3900X with 12 cores and 24 SMT cores, then I observe 
the following behavior:

```
perf stat -r 5 ./a.out 

 Performance counter stats for './a.out' (5 runs):

         18,570.71 msec task-clock                #   19.598 CPUs utilized            
( +-  1.27% )
         2,717,136      context-switches          #  145.447 K/sec                    
( +-  3.26% )
                55      cpu-migrations            #    2.944 /sec                     
( +-  3.79% )
               222      page-faults               #   11.884 /sec                     
( +-  0.49% )
    61,223,010,902      cycles                    #    3.277 GHz                      
( +-  0.54% )  (83.33%)
     1,259,821,397      stalled-cycles-frontend   #    2.07% frontend cycles 
idle     ( +-  1.54% )  (83.16%)
     9,001,394,333      stalled-cycles-backend    #   14.81% backend cycles 
idle      ( +-  0.79% )  (83.25%)
    65,746,848,695      instructions              #    1.08  insn per cycle         
                                                  #    0.14  stalled cycles 
per insn  ( +-  0.26% )  (83.40%)
    14,474,862,454      branches                  #  774.830 M/sec                    
( +-  0.36% )  (83.49%)
       114,518,545      branch-misses             #    0.79% of all branches          
( +-  0.71% )  (83.67%)

           0.94761 +- 0.00935 seconds time elapsed  ( +-  0.99% )
```

Now compare this to what happens when I enable perf recording:

```
 Performance counter stats for 'perf record --call-graph dwarf -z ./a.out' (5 
runs):

         19,928.28 msec task-clock                #    9.405 CPUs utilized            
( +-  2.74% )
         3,026,081      context-switches          #  142.538 K/sec                    
( +-  7.48% )
               120      cpu-migrations            #    5.652 /sec                     
( +-  3.18% )
             5,509      page-faults               #  259.491 /sec                     
( +-  0.01% )
    65,393,592,529      cycles                    #    3.080 GHz                      
( +-  1.81% )  (71.81%)
     1,445,203,862      stalled-cycles-frontend   #    2.13% frontend cycles 
idle     ( +-  2.01% )  (71.60%)
     9,701,321,165      stalled-cycles-backend    #   14.29% backend cycles 
idle      ( +-  1.28% )  (71.62%)
    69,967,062,730      instructions              #    1.03  insn per cycle         
                                                  #    0.14  stalled cycles 
per insn  ( +-  0.74% )  (71.83%)
    15,376,981,968      branches                  #  724.304 M/sec                    
( +-  0.83% )  (72.00%)
       127,934,194      branch-misses             #    0.81% of all branches          
( +-  1.29% )  (72.07%)

          2.118881 +- 0.000496 seconds time elapsed  ( +-  0.02% )
```

Note how the CPU utilization drops dramatically by roughly a factor of 2x, 
which correlates with the increase in overall runtime.

From the data obtained here I see:
```
    98.49%     0.00%  a.out    libc-2.33.so        [.] __GI___sched_yield 
(inlined)
            |
            ---__GI___sched_yield (inlined)
               |          
               |--89.93%--entry_SYSCALL_64_after_hwframe
               |          |          
               |           --89.32%--do_syscall_64
               |                     |          
               |                     |--71.70%--__ia32_sys_sched_yield
               |                     |          |          
               |                     |          |--64.36%--schedule
               |                     |          |          |          
               |                     |          |           --63.42%--
__schedule
               |                     |          |                     |          
               |                     |          |                     
|--32.56%--__perf_event_task_sched_out
               |                     |          |                     |          
|          
               |                     |          |                     |           
--32.07%--amd_pmu_disable_all
               |                     |          |                     |                     
|          
               |                     |          |                     |                     
|--23.54%--x86_pmu_disable_all
               |                     |          |                     |                     
|          |          
               |                     |          |                     |                     
|          |--12.81%--native_write_msr
               |                     |          |                     |                     
|          |          
               |                     |          |                     |                     
|          |--8.24%--native_read_msr
               |                     |          |                     |                     
|          |          
               |                     |          |                     |                     
|           --1.25%--amd_pmu_addr_offset
               |                     |          |                     |                     
|          
               |                     |          |                     |                      
--8.19%--amd_pmu_wait_on_overflow
               |                     |          |                     |                                
|          
               |                     |          |                     |                                
|--5.00%--native_read_msr
               |                     |          |                     |                                
|          
               |                     |          |                     |                                
|--2.20%--delay_halt
               |                     |          |                     |                                
|          |          
               |                     |          |                     |                                
|           --1.99%--delay_halt_mwaitx
               |                     |          |                     |                                
|          
               |                     |          |                     |                                 
--0.55%--amd_pmu_addr_offset

```

Is this a dificiency with the AMD perf subsystem? Or is this a generic issue 
with perf? I understand that it has to enable/disable the PMU when it's 
switching tasks, but potentially there are some ways to optimize this 
behavior?

Thanks

-- 
Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer
KDAB (Deutschland) GmbH, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt, C++ and OpenGL Experts

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5272 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread