ftrace performance (sched events): cyclictest shows 25% more latency

All of lore.kernel.org
 help / color / mirror / Atom feed

* ftrace performance (sched events): cyclictest shows 25% more latency
@ 2018-07-06  6:22 Claudio
  2018-07-06 21:24 ` Steven Rostedt
  2018-07-09 15:32 ` Steven Rostedt
  0 siblings, 2 replies; 21+ messages in thread
From: Claudio @ 2018-07-06  6:22 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, linux-kernel

Hello all,

I have been experimenting with the idea of leaving ftrace enabled, with sched events,
on production systems.

The main concern that I am having at the moment is about the impact on the system.
Enabling the sched events that I currently need for the tracing application
seems to slow down context-switches considerably, and make the system less responsive.

I have tested with cyclictest on the mainline kernel, and noticed an increase of min, avg latencies of around 25%.

Is this expected?

Some initial investigation into ftrace seems to point at the reservation and commit of the events into the ring buffer
as the highest sources of overhead, while event parameters copying, including COMM, does not seem to have any noticeable effect
relative to those costs.

I have been running 20 times the following test, and thrown away the first results:

$ sudo ./cyclictest --smp -p95 -m -s -N -l 100000 -q

$ uname -a
Linux claudio-HP-ProBook-470-G5 4.18.0-rc3+ #3 SMP Tue Jul 3 15:50:30 CEST 2018 x86_64 x86_64 x86_64 GNU/Linux

For brevity, this is a comparison of one test's results. All other test results show the same ~25% increase.

On the left side, the run without ftrace sched events, on the right side with ftrace sched events enabled.

CPU    Count      Min       Act        Avg        Max        Count      Min-ftrace Act-ftrace Avg-ftrace Max-ftrace
0     100000      2339       2936       2841     139478     100000       2900       3182       3566      93056
1      66742      2365       3386       2874      93639      66750       2959       3786       3646     154074
2      50080      2376       3058       2910     196221      50097       2997       4209       3655      18707
3      40076      2394       3461       2931      17914      40091       3006       4417       3750      17159
4      33404      2371       3612       2834      15336      33419       2997       3836       3594      23172
5      28635      2387       3313       2885      25863      28649       2995       3795       3647       9956
6      25058      2384       3428       2968      12162      25071       3051       4366       3719      18151
7      22275      2381       2859       2982      10706      22287       3046       5078       3825      10781

I would be thankful for any advice or comments on this,
especially with the goal in mind to lower as much as possible the runtime impact on the system.

Thank you,

Claudio Fontana

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ftrace performance (sched events): cyclictest shows 25% more latency
  2018-07-06  6:22 ftrace performance (sched events): cyclictest shows 25% more latency Claudio
@ 2018-07-06 21:24 ` Steven Rostedt
  2018-07-06 21:39   ` Steven Rostedt
  2018-07-09 10:06   ` Claudio
  2018-07-09 15:32 ` Steven Rostedt
  1 sibling, 2 replies; 21+ messages in thread
From: Steven Rostedt @ 2018-07-06 21:24 UTC (permalink / raw)
  To: Claudio; +Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, linux-kernel

On Fri, 6 Jul 2018 08:22:01 +0200
Claudio <claudio.fontana@gliwa.com> wrote:

> Hello all,
> 
> I have been experimenting with the idea of leaving ftrace enabled, with sched events,
> on production systems.
> 
> The main concern that I am having at the moment is about the impact on the system.
> Enabling the sched events that I currently need for the tracing application
> seems to slow down context-switches considerably, and make the system less responsive.
> 
> I have tested with cyclictest on the mainline kernel, and noticed an increase of min, avg latencies of around 25%.
> 
> Is this expected?

No it is not. And this is weird. I'm seeing it too.

> 
> Some initial investigation into ftrace seems to point at the reservation and commit of the events into the ring buffer
> as the highest sources of overhead, while event parameters copying, including COMM, does not seem to have any noticeable effect
> relative to those costs.

Did you use perf to see where the issues were? If so what perf command
did you use?

When running just:
  perf record ./cyclictest --smp -p95 -m -s -N -l 1000000 -q

I get this:

   6.08%  cyclictest  [kernel.vmlinux]    [k] syscall_return_via_sysret
   4.03%  cyclictest  cyclictest          [.] timerthread
   3.97%  cyclictest  [kernel.vmlinux]    [k] __schedule
   3.81%  cyclictest  [kernel.vmlinux]    [k] native_write_msr
   3.13%  cyclictest  [kernel.vmlinux]    [k] trace_save_cmdline
   2.71%  cyclictest  [kernel.vmlinux]    [k] do_nanosleep
   2.48%  cyclictest  [kernel.vmlinux]    [k] do_syscall_64
   2.11%  cyclictest  [kernel.vmlinux]    [k] pick_next_task_fair
   2.03%  cyclictest  [kernel.vmlinux]    [k] native_sched_clock
   1.99%  cyclictest  [kernel.vmlinux]    [k] cpuacct_charge
   1.93%  cyclictest  [vdso]              [.] __vdso_clock_gettime
   1.86%  cyclictest  [kernel.vmlinux]    [k] hrtimer_nanosleep
   1.84%  cyclictest  [kernel.vmlinux]    [k] cpupri_set
   1.66%  cyclictest  [kernel.vmlinux]    [k] update_curr_rt
   1.65%  cyclictest  [kernel.vmlinux]    [k] __perf_event_task_sched_out
   1.59%  cyclictest  [kernel.vmlinux]    [k] dequeue_rt_stack
   1.55%  cyclictest  [kernel.vmlinux]    [k] __rb_reserve_next

Which shows there's a lot being done before we even get to the ring
buffer reserve.

One thing that I notice too is that we have trace_save_cmdline that
also gets activated when enabling trace events. Perhaps I should try to
optimize that more.

> 
> I have been running 20 times the following test, and thrown away the first results:
> 
> $ sudo ./cyclictest --smp -p95 -m -s -N -l 100000 -q
> 
> $ uname -a
> Linux claudio-HP-ProBook-470-G5 4.18.0-rc3+ #3 SMP Tue Jul 3 15:50:30 CEST 2018 x86_64 x86_64 x86_64 GNU/Linux
> 
> For brevity, this is a comparison of one test's results. All other test results show the same ~25% increase.
> 
> On the left side, the run without ftrace sched events, on the right side with ftrace sched events enabled.
> 
> CPU    Count      Min       Act        Avg        Max        Count      Min-ftrace Act-ftrace Avg-ftrace Max-ftrace
> 0     100000      2339       2936       2841     139478     100000       2900       3182       3566      93056
> 1      66742      2365       3386       2874      93639      66750       2959       3786       3646     154074
> 2      50080      2376       3058       2910     196221      50097       2997       4209       3655      18707
> 3      40076      2394       3461       2931      17914      40091       3006       4417       3750      17159
> 4      33404      2371       3612       2834      15336      33419       2997       3836       3594      23172
> 5      28635      2387       3313       2885      25863      28649       2995       3795       3647       9956
> 6      25058      2384       3428       2968      12162      25071       3051       4366       3719      18151
> 7      22275      2381       2859       2982      10706      22287       3046       5078       3825      10781
> 
> I would be thankful for any advice or comments on this,
> especially with the goal in mind to lower as much as possible the runtime impact on the system.

But still, we use all sorts of trace events when dealing with real
time, and trace events hardly shows anything in the PREEMPT_RT kernel.
Where cyclictest shows 20 microseconds, and sched events hardly touches
that. But I see this weird side effect on a distro kernel too. Maybe
there's something weird going on when we have CONFIG_VOLUNTARY_PREEMPT
set :-/

I'll investigate further.

Thanks for the report.

-- Steve

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ftrace performance (sched events): cyclictest shows 25% more latency
  2018-07-06 21:24 ` Steven Rostedt
@ 2018-07-06 21:39   ` Steven Rostedt
  2018-07-06 22:00     ` Steven Rostedt
  2018-07-09 10:06   ` Claudio
  1 sibling, 1 reply; 21+ messages in thread
From: Steven Rostedt @ 2018-07-06 21:39 UTC (permalink / raw)
  To: Claudio; +Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, linux-kernel

On Fri, 6 Jul 2018 17:24:28 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> I'll investigate further.

Note, I enabled function tracing to trace the scheduler function:

 # trace-cmd start -p function -l schedule

And then ran cyclictest. That does not cause any noticeable increase
in latency (try it). So there is some kind of side effect with the
trace event itself, and not the recording of the event.

-- Steve

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ftrace performance (sched events): cyclictest shows 25% more latency
  2018-07-06 21:39   ` Steven Rostedt
@ 2018-07-06 22:00     ` Steven Rostedt
  0 siblings, 0 replies; 21+ messages in thread
From: Steven Rostedt @ 2018-07-06 22:00 UTC (permalink / raw)
  To: Claudio; +Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, linux-kernel

On Fri, 6 Jul 2018 17:39:22 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Fri, 6 Jul 2018 17:24:28 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > I'll investigate further.  
> 
> Note, I enabled function tracing to trace the scheduler function:
> 
>  # trace-cmd start -p function -l schedule
> 
> And then ran cyclictest. That does not cause any noticeable increase
> in latency (try it). So there is some kind of side effect with the
> trace event itself, and not the recording of the event.

It does add some noise but not nearly as much as a trace event. But,
then I did function graph tracing:

First I did:

  trace-cmd record -p function_graph -l schedule /work/git-local/rt-tests.git/cyclictest --smp -p95 -m -s -N -l 100000 -q

And got this:

          <idle>-0     [000]  3034.534021: sched_switch:         swapper/0:0 [120] S ==> trace-cmd:1693 [120]
       trace-cmd-1693  [000]  3034.534029: funcgraph_entry:                   |  schedule() {
       trace-cmd-1693  [000]  3034.534043: sched_switch:         trace-cmd:1693 [120] D ==> swapper/0:0 [120]
          <idle>-0     [000]  3034.535084: sched_switch:         swapper/0:0 [120] S ==> trace-cmd:1693 [120]
       trace-cmd-1693  [000]  3034.535086: funcgraph_exit:       # 1056.055 us |  }
       trace-cmd-1693  [000]  3034.535090: funcgraph_entry:                   |  schedule() {
       trace-cmd-1693  [000]  3034.535093: sched_switch:         trace-cmd:1693 [120] D ==> swapper/0:0 [120]
          <idle>-0     [000]  3034.536146: sched_switch:         swapper/0:0 [120] S ==> trace-cmd:1693 [120]
       trace-cmd-1693  [000]  3034.536147: funcgraph_exit:       # 1056.563 us |  }
       trace-cmd-1693  [000]  3034.536150: funcgraph_entry:                   |  schedule() {
       trace-cmd-1693  [000]  3034.536153: sched_switch:         trace-cmd:1693 [120] D ==> swapper/0:0 [120]
          <idle>-0     [000]  3034.537206: sched_switch:         swapper/0:0 [120] S ==> trace-cmd:1693 [120]
       trace-cmd-1693  [000]  3034.537207: funcgraph_exit:       # 1056.107 us |  }
       trace-cmd-1693  [000]  3034.537210: funcgraph_entry:                   |  schedule() {
       trace-cmd-1693  [000]  3034.537213: sched_switch:         trace-cmd:1693 [120] D ==> swapper/0:0 [120]
          <idle>-0     [000]  3034.538262: sched_switch:         swapper/0:0 [120] S ==> trace-cmd:1693 [120]
       trace-cmd-1693  [000]  3034.538263: funcgraph_exit:       # 1053.077 us |  }
       trace-cmd-1693  [000]  3034.538265: funcgraph_entry:                   |  schedule() {
       trace-cmd-1693  [000]  3034.538266: sched_switch:         trace-cmd:1693 [120] D ==> swapper/0:0 [120]
          <idle>-0     [000]  3034.538554: sched_switch:         swapper/0:0 [120] S ==> cyclictest:1698 [120]
      cyclictest-1698  [000]  3034.538565: funcgraph_entry:                   |  schedule() {
      cyclictest-1698  [000]  3034.538566: sched_switch:         cyclictest:1698 [4] D ==> swapper/0:0 [120]
          <idle>-0     [000]  3034.539318: sched_switch:         swapper/0:0 [120] S ==> trace-cmd:1693 [120]
       trace-cmd-1693  [000]  3034.539319: funcgraph_exit:       # 1053.723 us |  }
       trace-cmd-1693  [000]  3034.539322: funcgraph_entry:                   |  schedule() {
       trace-cmd-1693  [000]  3034.539324: sched_switch:         trace-cmd:1693 [120] D ==> swapper/0:0 [120]
          <idle>-0     [000]  3034.539567: sched_switch:         swapper/0:0 [120] S ==> cyclictest:1698 [4]
      cyclictest-1698  [000]  3034.539567: funcgraph_exit:       # 1002.624 us |  }

And that continued on as normal.

Then I ran it like this:

  trace-cmd record -p function_graph -l schedule /work/git-local/rt-tests.git/cyclictest --smp -p95 -m -s -N -l 100000 -q


And this was my result:

       trace-cmd-1749  [000]  3321.110772: funcgraph_entry:                   |  schedule() {
       trace-cmd-1749  [000]  3321.111826: funcgraph_exit:       # 1053.004 us |  }
       trace-cmd-1749  [000]  3321.111830: funcgraph_entry:                   |  schedule() {
       trace-cmd-1749  [000]  3321.112885: funcgraph_exit:       # 1054.734 us |  }
       trace-cmd-1749  [000]  3321.112888: funcgraph_entry:                   |  schedule() {
      cyclictest-1756  [000]  3321.113930: funcgraph_entry:                   |  schedule() {
       trace-cmd-1749  [000]  3321.113942: funcgraph_exit:       # 1054.240 us |  }
       trace-cmd-1749  [000]  3321.113947: funcgraph_entry:                   |  schedule() {
      cyclictest-1756  [000]  3321.114933: funcgraph_exit:       # 1002.600 us |  }
      cyclictest-1756  [000]  3321.114935: funcgraph_entry:                   |  schedule() {
       trace-cmd-1749  [000]  3321.115000: funcgraph_exit:       # 1053.451 us |  }
       trace-cmd-1749  [000]  3321.115004: funcgraph_entry:                   |  schedule() {
      cyclictest-1756  [000]  3321.115939: funcgraph_exit:       # 1003.116 us |  }
      cyclictest-1756  [000]  3321.115941: funcgraph_entry:                   |  schedule() {
       trace-cmd-1749  [000]  3321.116058: funcgraph_exit:       # 1054.142 us |  }
       trace-cmd-1749  [000]  3321.116061: funcgraph_entry:                   |  schedule() {
      cyclictest-1756  [000]  3321.116945: funcgraph_exit:       # 1003.559 us |  }
      cyclictest-1756  [000]  3321.116947: funcgraph_entry:                   |  schedule() {
       trace-cmd-1749  [000]  3321.117115: funcgraph_exit:       # 1053.649 us |  }
       trace-cmd-1749  [000]  3321.117119: funcgraph_entry:                   |  schedule() {
      cyclictest-1756  [000]  3321.117951: funcgraph_exit:       # 1003.586 us |  }
      cyclictest-1756  [000]  3321.117953: funcgraph_entry:                   |  schedule() {
       trace-cmd-1749  [000]  3321.118172: funcgraph_exit:       # 1053.264 us |  }
       trace-cmd-1749  [000]  3321.118176: funcgraph_entry:                   |  schedule() {
      cyclictest-1756  [000]  3321.118957: funcgraph_exit:       # 1003.656 us |  }
      cyclictest-1756  [000]  3321.118960: funcgraph_entry:                   |  schedule() {
       trace-cmd-1749  [000]  3321.119230: funcgraph_exit:       # 1053.770 us |  }
       trace-cmd-1749  [000]  3321.119234: funcgraph_entry:                   |  schedule() {
      cyclictest-1756  [000]  3321.119965: funcgraph_exit:       # 1004.830 us |  }
      cyclictest-1756  [000]  3321.119967: funcgraph_entry:                   |  schedule() {
       trace-cmd-1749  [000]  3321.120286: funcgraph_exit:       # 1052.469 us |  }

No noticeable difference in the time the scheduler took (going to and
from idle).

-- Steve

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ftrace performance (sched events): cyclictest shows 25% more latency
  2018-07-06 21:24 ` Steven Rostedt
  2018-07-06 21:39   ` Steven Rostedt
@ 2018-07-09 10:06   ` Claudio
  2018-07-09 14:53     ` Claudio
  1 sibling, 1 reply; 21+ messages in thread
From: Claudio @ 2018-07-09 10:06 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, linux-kernel

Hello Steve,

thank you for your response,

On 07/06/2018 11:24 PM, Steven Rostedt wrote:
> On Fri, 6 Jul 2018 08:22:01 +0200
> Claudio <claudio.fontana@gliwa.com> wrote:
> 
>> Hello all,
>>
>> I have been experimenting with the idea of leaving ftrace enabled, with sched events,
>> on production systems.
>>
>> The main concern that I am having at the moment is about the impact on the system.
>> Enabling the sched events that I currently need for the tracing application
>> seems to slow down context-switches considerably, and make the system less responsive.
>>
>> I have tested with cyclictest on the mainline kernel, and noticed an increase of min, avg latencies of around 25%.
>>
>> Is this expected?
> 
> No it is not. And this is weird. I'm seeing it too.
> 
>>
>> Some initial investigation into ftrace seems to point at the reservation and commit of the events into the ring buffer
>> as the highest sources of overhead, while event parameters copying, including COMM, does not seem to have any noticeable effect
>> relative to those costs.
> 
> Did you use perf to see where the issues were? If so what perf command
> did you use?

I did not use perf, I will now do it and see what I get.
You can discard my comment regarding reservation and commit, it was based on a wrong premise.

Regarding parameter copying, I just brutally modified the code to not fill fields (TP_fast_assign),
and I just noticed that it did not change the result significantly.

> 
> When running just:
>   perf record ./cyclictest --smp -p95 -m -s -N -l 1000000 -q
> 
> I get this:
> 
>    6.08%  cyclictest  [kernel.vmlinux]    [k] syscall_return_via_sysret
>    4.03%  cyclictest  cyclictest          [.] timerthread
>    3.97%  cyclictest  [kernel.vmlinux]    [k] __schedule
>    3.81%  cyclictest  [kernel.vmlinux]    [k] native_write_msr
>    3.13%  cyclictest  [kernel.vmlinux]    [k] trace_save_cmdline
>    2.71%  cyclictest  [kernel.vmlinux]    [k] do_nanosleep
>    2.48%  cyclictest  [kernel.vmlinux]    [k] do_syscall_64
>    2.11%  cyclictest  [kernel.vmlinux]    [k] pick_next_task_fair
>    2.03%  cyclictest  [kernel.vmlinux]    [k] native_sched_clock
>    1.99%  cyclictest  [kernel.vmlinux]    [k] cpuacct_charge
>    1.93%  cyclictest  [vdso]              [.] __vdso_clock_gettime
>    1.86%  cyclictest  [kernel.vmlinux]    [k] hrtimer_nanosleep
>    1.84%  cyclictest  [kernel.vmlinux]    [k] cpupri_set
>    1.66%  cyclictest  [kernel.vmlinux]    [k] update_curr_rt
>    1.65%  cyclictest  [kernel.vmlinux]    [k] __perf_event_task_sched_out
>    1.59%  cyclictest  [kernel.vmlinux]    [k] dequeue_rt_stack
>    1.55%  cyclictest  [kernel.vmlinux]    [k] __rb_reserve_next
> 
> Which shows there's a lot being done before we even get to the ring
> buffer reserve.

I get (CONFIG_PREEMPT_VOLUNTARY):

SCHED EVENTS DISABLED:

     8.57%  cyclictest  [kernel.vmlinux]    [k] syscall_return_via_sysret
     7.04%  cyclictest  [kernel.vmlinux]    [k] native_write_msr
     6.93%  cyclictest  [kernel.vmlinux]    [k] entry_SYSCALL_64_trampoline
     4.42%  cyclictest  cyclictest          [.] timerthread
     4.14%  cyclictest  [kernel.vmlinux]    [k] __schedule
     3.52%  cyclictest  [kernel.vmlinux]    [k] cpupri_set
     2.90%  cyclictest  [kernel.vmlinux]    [k] finish_task_switch
     2.42%  cyclictest  [kernel.vmlinux]    [k] update_curr_rt
     2.34%  cyclictest  [kernel.vmlinux]    [k] _raw_spin_lock
     2.28%  cyclictest  [kernel.vmlinux]    [k] ktime_get
     2.14%  cyclictest  [kernel.vmlinux]    [k] timerqueue_add
     2.06%  cyclictest  [kernel.vmlinux]    [k] read_tsc
     2.06%  cyclictest  [kernel.vmlinux]    [k] native_sched_clock
     2.01%  cyclictest  [vdso]              [.] __vdso_clock_gettime
     1.95%  cyclictest  [kernel.vmlinux]    [k] do_nanosleep
     1.93%  cyclictest  [kernel.vmlinux]    [k] pick_next_task_fair
     1.89%  cyclictest  [vdso]              [.] 0x0000000000000977
     1.75%  cyclictest  [kernel.vmlinux]    [k] dequeue_task_rt
     1.64%  cyclictest  [kernel.vmlinux]    [k] x86_pmu_disable
     1.57%  cyclictest  [kernel.vmlinux]    [k] dequeue_rt_stack
     1.53%  cyclictest  [kernel.vmlinux]    [k] __x64_sys_nanosleep
     1.51%  cyclictest  [kernel.vmlinux]    [k] do_syscall_64
     1.41%  cyclictest  [kernel.vmlinux]    [k] put_prev_task_rt
     1.34%  cyclictest  [kernel.vmlinux]    [k] pick_next_task_dl
     1.28%  cyclictest  [kernel.vmlinux]    [k] pick_next_task_rt
     1.25%  cyclictest  [kernel.vmlinux]    [k] cpuacct_charge
     1.25%  cyclictest  [kernel.vmlinux]    [k] hrtimer_start_range_ns
     1.24%  cyclictest  [kernel.vmlinux]    [k] __x86_indirect_thunk_rax
     1.07%  cyclictest  libpthread-2.23.so  [.] 0x0000000000010c1b
     0.94%  cyclictest  [kernel.vmlinux]    [k] pull_rt_task
     0.93%  cyclictest  [kernel.vmlinux]    [k] entry_SYSCALL_64_stage2
     0.89%  cyclictest  [kernel.vmlinux]    [k] hrtimer_nanosleep
     0.89%  cyclictest  libpthread-2.23.so  [.] 0x0000000000010c1d
     0.88%  cyclictest  [kernel.vmlinux]    [k] entry_SYSCALL_64
     0.79%  cyclictest  [kernel.vmlinux]    [k] _raw_spin_lock_irqsave
     0.79%  cyclictest  [kernel.vmlinux]    [k] ctx_sched_out
     0.78%  cyclictest  [kernel.vmlinux]    [k] copy_user_generic_unrolled
     0.77%  cyclictest  libpthread-2.23.so  [.] __pthread_enable_asynccancel
     0.71%  cyclictest  [kernel.vmlinux]    [k] __perf_event_task_sched_out
     0.71%  cyclictest  [kernel.vmlinux]    [k] pick_next_task_idle
     0.70%  cyclictest  libpthread-2.23.so  [.] __pthread_disable_asynccancel
     0.70%  cyclictest  [kernel.vmlinux]    [k] __hrtimer_init
     0.68%  cyclictest  [kernel.vmlinux]    [k] clear_page_erms
     0.64%  cyclictest  [kernel.vmlinux]    [k] intel_pmu_disable_all
     0.58%  cyclictest  [kernel.vmlinux]    [k] copy_user_enhanced_fast_string
     0.58%  cyclictest  [kernel.vmlinux]    [k] _copy_from_user
     0.58%  cyclictest  [kernel.vmlinux]    [k] _raw_spin_unlock_irqrestore
     0.54%  cyclictest  [kernel.vmlinux]    [k] get_nohz_timer_target
     0.54%  cyclictest  [kernel.vmlinux]    [k] pick_next_task_stop
     0.48%  cyclictest  [kernel.vmlinux]    [k] schedule
     0.47%  cyclictest  [kernel.vmlinux]    [k] lapic_next_deadline
     0.43%  cyclictest  [kernel.vmlinux]    [k] sched_avg_update
     0.42%  cyclictest  [kernel.vmlinux]    [k] __perf_event_task_sched_in
     0.37%  cyclictest  libc-2.23.so        [.] __clock_gettime
     0.37%  cyclictest  [vdso]              [.] 0x0000000000000974
     0.35%  cyclictest  [kernel.vmlinux]    [k] get_timespec64
     0.32%  cyclictest  [kernel.vmlinux]    [k] dequeue_top_rt_rq
     0.31%  cyclictest  [kernel.vmlinux]    [k] try_to_unmap_one
     0.27%  cyclictest  [kernel.vmlinux]    [k] update_rq_clock
     0.24%  cyclictest  [kernel.vmlinux]    [k] rb_insert_color
     0.23%  cyclictest  [kernel.vmlinux]    [k] deactivate_task
     0.22%  cyclictest  [kernel.vmlinux]    [k] follow_page_pte
     0.21%  cyclictest  [kernel.vmlinux]    [k] page_not_mapped
     0.20%  cyclictest  [kernel.vmlinux]    [k] sched_clock_cpu
     0.18%  cyclictest  cyclictest          [.] main
     0.17%  cyclictest  [kernel.vmlinux]    [k] clockevents_program_event
     0.17%  cyclictest  [kernel.vmlinux]    [k] hrtimer_reprogram
     0.16%  cyclictest  [kernel.vmlinux]    [k] find_busiest_group
     0.15%  cyclictest  [kernel.vmlinux]    [k] enqueue_hrtimer
     0.15%  cyclictest  libc-2.23.so        [.] __nanosleep
     0.14%  cyclictest  cyclictest          [.] clock_gettime@plt
     0.14%  cyclictest  [kernel.vmlinux]    [k] rcu_note_context_switch
     0.14%  cyclictest  [kernel.vmlinux]    [k] __intel_pmu_enable_all
     0.13%  cyclictest  [kernel.vmlinux]    [k] hrtimer_active
     0.13%  cyclictest  libpthread-2.23.so  [.] __nanosleep
     0.13%  cyclictest  [kernel.vmlinux]    [k] __msecs_to_jiffies
     0.13%  cyclictest  [kernel.vmlinux]    [k] rcu_all_qs
     0.13%  cyclictest  libpthread-2.23.so  [.] 0x0000000000010c24
     0.13%  cyclictest  [kernel.vmlinux]    [k] hrtimer_try_to_cancel
     0.13%  cyclictest  cyclictest          [.] nanosleep@plt
     0.12%  cyclictest  [kernel.vmlinux]    [k] __intel_pmu_disable_all
     0.12%  cyclictest  [kernel.vmlinux]    [k] _cond_resched
     0.10%  cyclictest  cyclictest          [.] print_stat
     0.10%  cyclictest  [kernel.vmlinux]    [k] perf_ctx_unlock
     0.10%  cyclictest  [kernel.vmlinux]    [k] sched_clock
     0.10%  cyclictest  [kernel.vmlinux]    [k] free_unref_page_list
     0.09%  cyclictest  [vdso]              [.] 0x0000000000000987
     0.09%  cyclictest  [kernel.vmlinux]    [k] update_cfs_group
     0.08%  cyclictest  [kernel.vmlinux]    [k] copy_semundo
     0.08%  cyclictest  [kernel.vmlinux]    [k] update_blocked_averages
     0.07%  cyclictest  [kernel.vmlinux]    [k] tick_program_event
     0.07%  cyclictest  [kernel.vmlinux]    [k] dequeue_task_fair
     0.07%  cyclictest  [kernel.vmlinux]    [k] __handle_mm_fault
     0.06%  cyclictest  [kernel.vmlinux]    [k] plist_del
     0.06%  cyclictest  libpthread-2.23.so  [.] 0x0000000000010c0d
     0.06%  cyclictest  [kernel.vmlinux]    [k] put_prev_task_fair
     0.06%  cyclictest  [vdso]              [.] 0x0000000000000979
     0.06%  cyclictest  [kernel.vmlinux]    [k] munlock_vma_pages_range
     0.05%  cyclictest  [vdso]              [.] 0x000000000000097d
     0.05%  cyclictest  [kernel.vmlinux]    [k] __lru_cache_add
     0.05%  cyclictest  [kernel.vmlinux]    [k] update_load_avg
     0.05%  cyclictest  [kernel.vmlinux]    [k] pick_next_entity
     0.04%  cyclictest  [kernel.vmlinux]    [k] update_rt_migration
     0.04%  cyclictest  [kernel.vmlinux]    [k] pagevec_lru_move_fn
     0.04%  cyclictest  [kernel.vmlinux]    [k] reusable_anon_vma
     0.04%  cyclictest  [kernel.vmlinux]    [k] pmd_huge
     0.04%  cyclictest  libpthread-2.23.so  [.] 0x0000000000010c21
     0.04%  cyclictest  ld-2.23.so          [.] strcmp
     0.04%  cyclictest  libc-2.23.so        [.] __call_tls_dtors
     0.04%  cyclictest  [kernel.vmlinux]    [k] free_unref_page_commit
     0.04%  cyclictest  [kernel.vmlinux]    [k] unmap_page_range
     0.04%  cyclictest  [kernel.vmlinux]    [k] PageHeadHuge
     0.04%  cyclictest  [kernel.vmlinux]    [k] vma_interval_tree_iter_next
     0.04%  cyclictest  [kernel.vmlinux]    [k] rmap_walk_file
     0.04%  cyclictest  libpthread-2.23.so  [.] 0x0000000000010c12
     0.04%  cyclictest  [kernel.vmlinux]    [k] __update_load_avg_se.isra.41
     0.03%  cyclictest  [kernel.vmlinux]    [k] __munlock_pagevec
     0.03%  cyclictest  libc-2.23.so        [.] _itoa_word
     0.03%  cyclictest  [kernel.vmlinux]    [k] __pagevec_lru_add_fn
     0.03%  cyclictest  [kernel.vmlinux]    [k] ext4_file_mmap
     0.03%  cyclictest  [kernel.vmlinux]    [k] free_pcppages_bulk
     0.02%  cyclictest  [kernel.vmlinux]    [k] lock_hrtimer_base.isra.19
     0.02%  cyclictest  [kernel.vmlinux]    [k] __mod_node_page_state
     0.02%  cyclictest  [kernel.vmlinux]    [k] rb_erase
     0.02%  cyclictest  [kernel.vmlinux]    [k] release_pages
     0.02%  cyclictest  [kernel.vmlinux]    [k] clear_buddies
     0.02%  cyclictest  [kernel.vmlinux]    [k] __hrtimer_next_event_base
     0.02%  cyclictest  [kernel.vmlinux]    [k] swapgs_restore_regs_and_return_t
     0.02%  cyclictest  [kernel.vmlinux]    [k] reweight_entity
     0.02%  cyclictest  [kernel.vmlinux]    [k] update_curr
     0.02%  cyclictest  libpthread-2.23.so  [.] 0x0000000000010c29
     0.02%  cyclictest  libpthread-2.23.so  [.] start_thread
     0.02%  cyclictest  [kernel.vmlinux]    [k] xhci_queue_intr_tx
     0.02%  cyclictest  [kernel.vmlinux]    [k] __acct_update_integrals
     0.02%  cyclictest  [vdso]              [.] 0x0000000000000971
     0.02%  cyclictest  [kernel.vmlinux]    [k] xhci_check_args
     0.01%  cyclictest  [kernel.vmlinux]    [k] idle_cpu
     0.01%  cyclictest  [kernel.vmlinux]    [k] scheduler_tick
     0.01%  cyclictest  [kernel.vmlinux]    [k] __delayacct_add_tsk
     0.01%  cyclictest  [kernel.vmlinux]    [k] perf_pmu_disable.part.98
     0.01%  cyclictest  [kernel.vmlinux]    [k] end_repeat_nmi
     0.00%  cyclictest  [kernel.vmlinux]    [k] native_irq_return_iret
     0.00%  cyclictest  [kernel.vmlinux]    [k] nmi_restore
     0.00%  cyclictest  [kernel.vmlinux]    [k] do_nmi
     0.00%  cyclictest  [kernel.vmlinux]    [k] nmi_handle
     0.00%  perf        [kernel.vmlinux]    [k] perf_event_addr_filters_exec
     0.00%  cyclictest  [kernel.vmlinux]    [k] perf_event_nmi_handler
     0.00%  perf        [kernel.vmlinux]    [k] native_write_msr
     0.00%  cyclictest  [unknown]           [.] 0000000000000000

SCHED EVENTS ENABLED:

     7.33%  cyclictest  [kernel.vmlinux]    [k] syscall_return_via_sysret
     6.51%  cyclictest  [kernel.vmlinux]    [k] entry_SYSCALL_64_trampoline
     5.84%  cyclictest  [kernel.vmlinux]    [k] native_write_msr
     4.00%  cyclictest  cyclictest          [.] timerthread
     3.62%  cyclictest  [kernel.vmlinux]    [k] __schedule
     2.71%  cyclictest  [kernel.vmlinux]    [k] cpupri_set
     2.57%  cyclictest  [kernel.vmlinux]    [k] finish_task_switch
     2.46%  cyclictest  [kernel.vmlinux]    [k] native_sched_clock
     2.19%  cyclictest  [kernel.vmlinux]    [k] __x86_indirect_thunk_rax
     2.03%  cyclictest  [kernel.vmlinux]    [k] ktime_get
     2.01%  cyclictest  [kernel.vmlinux]    [k] pick_next_task_fair
     1.93%  cyclictest  [kernel.vmlinux]    [k] _raw_spin_lock
     1.90%  cyclictest  [kernel.vmlinux]    [k] timerqueue_add
     1.70%  cyclictest  [kernel.vmlinux]    [k] do_nanosleep
     1.67%  cyclictest  [kernel.vmlinux]    [k] read_tsc
     1.65%  cyclictest  [kernel.vmlinux]    [k] cpuacct_charge
     1.60%  cyclictest  [kernel.vmlinux]    [k] update_curr_rt
     1.49%  cyclictest  [kernel.vmlinux]    [k] probe_sched_switch
     1.40%  cyclictest  [kernel.vmlinux]    [k] __x64_sys_nanosleep
     1.34%  cyclictest  [kernel.vmlinux]    [k] dequeue_rt_stack
     1.28%  cyclictest  [kernel.vmlinux]    [k] do_syscall_64
     1.27%  cyclictest  [kernel.vmlinux]    [k] x86_pmu_disable
     1.24%  cyclictest  [kernel.vmlinux]    [k] __rb_reserve_next
     1.23%  cyclictest  [vdso]              [.] 0x0000000000000977
     1.21%  cyclictest  [vdso]              [.] __vdso_clock_gettime
     1.06%  cyclictest  libpthread-2.23.so  [.] 0x0000000000010c1d
     1.06%  cyclictest  [kernel.vmlinux]    [k] entry_SYSCALL_64_stage2
     1.03%  cyclictest  [kernel.vmlinux]    [k] pick_next_task_dl
     1.00%  cyclictest  [kernel.vmlinux]    [k] put_prev_task_rt
     1.00%  cyclictest  [kernel.vmlinux]    [k] pick_next_task_rt
     0.98%  cyclictest  [kernel.vmlinux]    [k] trace_event_raw_event_sched_swit
     0.97%  cyclictest  libpthread-2.23.so  [.] 0x0000000000010c1b
     0.95%  cyclictest  [kernel.vmlinux]    [k] pick_next_task_idle
     0.91%  cyclictest  [kernel.vmlinux]    [k] trace_save_cmdline
     0.90%  cyclictest  [kernel.vmlinux]    [k] copy_user_generic_unrolled
     0.90%  cyclictest  [kernel.vmlinux]    [k] ctx_sched_out
     0.90%  cyclictest  [kernel.vmlinux]    [k] lapic_next_deadline
     0.86%  cyclictest  [kernel.vmlinux]    [k] intel_pmu_disable_all
     0.85%  cyclictest  [kernel.vmlinux]    [k] rb_commit
     0.84%  cyclictest  [kernel.vmlinux]    [k] entry_SYSCALL_64
     0.82%  cyclictest  [kernel.vmlinux]    [k] _raw_spin_lock_irqsave
     0.80%  cyclictest  [kernel.vmlinux]    [k] hrtimer_nanosleep
     0.76%  cyclictest  [kernel.vmlinux]    [k] _copy_from_user
     0.75%  cyclictest  [kernel.vmlinux]    [k] __hrtimer_init
     0.72%  cyclictest  [kernel.vmlinux]    [k] ring_buffer_lock_reserve
     0.69%  cyclictest  [kernel.vmlinux]    [k] __perf_event_task_sched_out
     0.66%  cyclictest  [kernel.vmlinux]    [k] hrtimer_start_range_ns
     0.63%  cyclictest  [kernel.vmlinux]    [k] get_nohz_timer_target
     0.62%  cyclictest  [kernel.vmlinux]    [k] clear_page_erms
     0.61%  cyclictest  libpthread-2.23.so  [.] __pthread_disable_asynccancel
     0.59%  cyclictest  [kernel.vmlinux]    [k] dequeue_task_rt
     0.58%  cyclictest  [kernel.vmlinux]    [k] schedule
     0.56%  cyclictest  [kernel.vmlinux]    [k] pull_rt_task
     0.55%  cyclictest  [kernel.vmlinux]    [k] _raw_spin_unlock_irqrestore
     0.54%  cyclictest  [kernel.vmlinux]    [k] rb_insert_color
     0.52%  cyclictest  [kernel.vmlinux]    [k] trace_event_buffer_reserve
     0.51%  cyclictest  [kernel.vmlinux]    [k] trace_clock_local
     0.50%  cyclictest  [kernel.vmlinux]    [k] pick_next_task_stop
     0.48%  cyclictest  [kernel.vmlinux]    [k] __perf_event_task_sched_in
     0.43%  cyclictest  [kernel.vmlinux]    [k] rcu_note_context_switch
     0.41%  cyclictest  [kernel.vmlinux]    [k] trace_event_buffer_commit
     0.41%  cyclictest  [kernel.vmlinux]    [k] deactivate_task
     0.39%  cyclictest  [kernel.vmlinux]    [k] hrtimer_reprogram
     0.36%  cyclictest  libpthread-2.23.so  [.] __pthread_enable_asynccancel
     0.36%  cyclictest  libc-2.23.so        [.] __clock_gettime
     0.34%  cyclictest  [kernel.vmlinux]    [k] tracing_generic_entry_update
     0.34%  cyclictest  [kernel.vmlinux]    [k] enqueue_hrtimer
     0.29%  cyclictest  [kernel.vmlinux]    [k] dequeue_top_rt_rq
     0.29%  cyclictest  [kernel.vmlinux]    [k] perf_ctx_unlock
     0.28%  cyclictest  [kernel.vmlinux]    [k] copy_user_enhanced_fast_string
     0.27%  cyclictest  libpthread-2.23.so  [.] __nanosleep
     0.26%  cyclictest  [kernel.vmlinux]    [k] ring_buffer_unlock_commit
     0.25%  cyclictest  [kernel.vmlinux]    [k] trace_event_buffer_lock_reserve
     0.24%  cyclictest  [kernel.vmlinux]    [k] get_timespec64
     0.23%  cyclictest  [kernel.vmlinux]    [k] find_busiest_group
     0.22%  cyclictest  cyclictest          [.] print_stat
     0.21%  cyclictest  [kernel.vmlinux]    [k] free_unref_page_list
     0.21%  cyclictest  [kernel.vmlinux]    [k] __intel_pmu_enable_all
     0.21%  cyclictest  [kernel.vmlinux]    [k] rb_move_tail
     0.20%  cyclictest  [kernel.vmlinux]    [k] ring_buffer_event_data
     0.20%  cyclictest  [kernel.vmlinux]    [k] vma_interval_tree_subtree_search
     0.20%  cyclictest  [kernel.vmlinux]    [k] sched_avg_update
     0.19%  cyclictest  [kernel.vmlinux]    [k] tick_program_event
     0.19%  cyclictest  [kernel.vmlinux]    [k] follow_page_mask
     0.19%  cyclictest  [kernel.vmlinux]    [k] sched_clock_cpu
     0.19%  cyclictest  [kernel.vmlinux]    [k] tracing_record_taskinfo_sched_sw
     0.16%  cyclictest  [kernel.vmlinux]    [k] get_page_from_freelist
     0.16%  cyclictest  [kernel.vmlinux]    [k] clockevents_program_event
     0.15%  cyclictest  [vdso]              [.] 0x0000000000000974
     0.15%  cyclictest  [kernel.vmlinux]    [k] hrtimer_try_to_cancel
     0.14%  cyclictest  [kernel.vmlinux]    [k] __alloc_pages_nodemask
     0.13%  cyclictest  cyclictest          [.] main
     0.13%  cyclictest  ld-2.23.so          [.] _dl_sysdep_start
     0.13%  cyclictest  [kernel.vmlinux]    [k] rmap_walk_file
     0.12%  cyclictest  [kernel.vmlinux]    [k] update_rq_clock
     0.12%  cyclictest  libc-2.23.so        [.] __libc_disable_asynccancel
     0.12%  cyclictest  [vdso]              [.] 0x0000000000000970
     0.12%  cyclictest  [kernel.vmlinux]    [k] hrtimer_active
     0.11%  cyclictest  [kernel.vmlinux]    [k] page_not_mapped
     0.11%  cyclictest  [kernel.vmlinux]    [k] sched_clock
     0.10%  cyclictest  [kernel.vmlinux]    [k] __mod_zone_page_state
     0.10%  cyclictest  [kernel.vmlinux]    [k] account_entity_dequeue
     0.09%  cyclictest  libc-2.23.so        [.] __nanosleep
     0.09%  cyclictest  [kernel.vmlinux]    [k] rcu_all_qs
     0.09%  cyclictest  [kernel.vmlinux]    [k] _cond_resched
     0.08%  cyclictest  [kernel.vmlinux]    [k] ring_buffer_record_is_on
     0.08%  cyclictest  [kernel.vmlinux]    [k] dequeue_entity
     0.08%  cyclictest  [kernel.vmlinux]    [k] lock_hrtimer_base.isra.19
     0.08%  cyclictest  [kernel.vmlinux]    [k] update_blocked_averages
     0.08%  cyclictest  [kernel.vmlinux]    [k] ftrace_trace_userstack
     0.07%  cyclictest  [kernel.vmlinux]    [k] try_to_unmap_one
     0.07%  cyclictest  [kernel.vmlinux]    [k] tracing_record_taskinfo_skip
     0.07%  cyclictest  libpthread-2.23.so  [.] 0x0000000000010c38
     0.07%  cyclictest  [kernel.vmlinux]    [k] __intel_pmu_disable_all
     0.07%  cyclictest  [kernel.vmlinux]    [k] free_pcppages_bulk
     0.07%  cyclictest  [kernel.vmlinux]    [k] task_ctx_sched_out
     0.07%  cyclictest  [kernel.vmlinux]    [k] __msecs_to_jiffies
     0.07%  cyclictest  [vdso]              [.] 0x0000000000000987
     0.07%  cyclictest  [kernel.vmlinux]    [k] update_rt_migration
     0.06%  cyclictest  [kernel.vmlinux]    [k] reweight_entity
     0.06%  cyclictest  [kernel.vmlinux]    [k] dequeue_task_fair
     0.06%  cyclictest  [kernel.vmlinux]    [k] update_load_avg
     0.06%  cyclictest  [kernel.vmlinux]    [k] trace_buffer_unlock_commit_regs
     0.06%  cyclictest  [kernel.vmlinux]    [k] trace_event_raw_event_sched_stat
     0.05%  cyclictest  [kernel.vmlinux]    [k] update_curr
     0.05%  cyclictest  cyclictest          [.] clock_gettime@plt
     0.04%  cyclictest  [vdso]              [.] 0x000000000000097d
     0.04%  cyclictest  [kernel.vmlinux]    [k] rcu_check_callbacks
     0.04%  cyclictest  [kernel.vmlinux]    [k] load_balance
     0.03%  cyclictest  [kernel.vmlinux]    [k] perf_pmu_disable.part.98
     0.03%  cyclictest  [kernel.vmlinux]    [k] __update_load_avg_se.isra.41
     0.03%  cyclictest  [kernel.vmlinux]    [k] __calc_delta
     0.03%  cyclictest  [kernel.vmlinux]    [k] futex_wake
     0.03%  cyclictest  libpthread-2.23.so  [.] 0x0000000000010c0d
     0.03%  cyclictest  cyclictest          [.] nanosleep@plt
     0.02%  cyclictest  libpthread-2.23.so  [.] 0x0000000000010c09
     0.02%  cyclictest  [kernel.vmlinux]    [k] set_next_entity
     0.02%  cyclictest  [kernel.vmlinux]    [k] switch_mm_irqs_off
     0.02%  cyclictest  [kernel.vmlinux]    [k] balance_runtime
     0.02%  cyclictest  [kernel.vmlinux]    [k] put_prev_task_fair
     0.02%  cyclictest  [kernel.vmlinux]    [k] find_task_by_vpid
     0.02%  cyclictest  libpthread-2.23.so  [.] 0x0000000000010c12
     0.02%  cyclictest  [kernel.vmlinux]    [k] kfree
     0.02%  cyclictest  [kernel.vmlinux]    [k] put_prev_entity
     0.01%  cyclictest  [kernel.vmlinux]    [k] rb_next
     0.01%  cyclictest  [kernel.vmlinux]    [k] ring_buffer_time_stamp_abs
     0.01%  cyclictest  [kernel.vmlinux]    [k] update_min_vruntime
     0.01%  cyclictest  [kernel.vmlinux]    [k] enqueue_top_rt_rq
     0.01%  cyclictest  [kernel.vmlinux]    [k] __kmalloc_node
     0.01%  cyclictest  libpthread-2.23.so  [.] 0x0000000000010c30
     0.01%  cyclictest  [kernel.vmlinux]    [k] ktime_get_update_offsets_now
     0.01%  cyclictest  [kernel.vmlinux]    [k] check_cfs_rq_runtime
     0.01%  cyclictest  [kernel.vmlinux]    [k] cpu_load_update
     0.01%  cyclictest  [kernel.vmlinux]    [k] flush_signal_handlers
     0.01%  cyclictest  [kernel.vmlinux]    [k] page_counter_uncharge
     0.00%  cyclictest  [kernel.vmlinux]    [k] end_repeat_nmi
     0.00%  cyclictest  [kernel.vmlinux]    [k] nmi_cpu_backtrace_handler
     0.00%  cyclictest  [kernel.vmlinux]    [k] nmi_restore
     0.00%  cyclictest  [kernel.vmlinux]    [k] nmi_cpu_backtrace
     0.00%  cyclictest  [kernel.vmlinux]    [k] nmi_handle
     0.00%  perf        [kernel.vmlinux]    [k] native_sched_clock
     0.00%  cyclictest  [kernel.vmlinux]    [k] intel_pmu_handle_irq
     0.00%  cyclictest  [kernel.vmlinux]    [k] perf_event_nmi_handler
     0.00%  cyclictest  [kernel.vmlinux]    [k] native_apic_msr_write
     0.00%  perf        [kernel.vmlinux]    [k] native_write_msr

 
> One thing that I notice too is that we have trace_save_cmdline that
> also gets activated when enabling trace events. Perhaps I should try to
> optimize that more.
>>
>> I have been running 20 times the following test, and thrown away the first results:
>>
>> $ sudo ./cyclictest --smp -p95 -m -s -N -l 100000 -q
>>
>> $ uname -a
>> Linux claudio-HP-ProBook-470-G5 4.18.0-rc3+ #3 SMP Tue Jul 3 15:50:30 CEST 2018 x86_64 x86_64 x86_64 GNU/Linux
>>
>> For brevity, this is a comparison of one test's results. All other test results show the same ~25% increase.
>>
>> On the left side, the run without ftrace sched events, on the right side with ftrace sched events enabled.
>>
>> CPU    Count      Min       Act        Avg        Max        Count      Min-ftrace Act-ftrace Avg-ftrace Max-ftrace
>> 0     100000      2339       2936       2841     139478     100000       2900       3182       3566      93056
>> 1      66742      2365       3386       2874      93639      66750       2959       3786       3646     154074
>> 2      50080      2376       3058       2910     196221      50097       2997       4209       3655      18707
>> 3      40076      2394       3461       2931      17914      40091       3006       4417       3750      17159
>> 4      33404      2371       3612       2834      15336      33419       2997       3836       3594      23172
>> 5      28635      2387       3313       2885      25863      28649       2995       3795       3647       9956
>> 6      25058      2384       3428       2968      12162      25071       3051       4366       3719      18151
>> 7      22275      2381       2859       2982      10706      22287       3046       5078       3825      10781
>>
>> I would be thankful for any advice or comments on this,
>> especially with the goal in mind to lower as much as possible the runtime impact on the system.
> 
> But still, we use all sorts of trace events when dealing with real
> time, and trace events hardly shows anything in the PREEMPT_RT kernel.
> Where cyclictest shows 20 microseconds, and sched events hardly touches
> that. But I see this weird side effect on a distro kernel too. Maybe
> there's something weird going on when we have CONFIG_VOLUNTARY_PREEMPT
> set :-/

If it helps, I tested also CONFIG_PREEMPT_NONE=y and CONFIG_PREEMPT=y,
with the same relative result (~25% increase).

> 
> I'll investigate further.

If stress-ng --switch-ops is of any help, I have some results for that as well.

This is with CONFIG_PREEMPT_NONE:

SCHED EVENTS DISABLED

$ stress-ng --times -s 8 --switch-ops 20000000

Test  Runtime(s) User(%) Sys(%) Total(%) bogo ops/s
   1       22.56    6.33  92.96    99.29     886525
   2       23.09    6.34  92.34    98.68     866176
   3       23.54    6.35  89.02    95.37     849618
   4       22.19    6.56  92.73    99.29     901306
   5       22.69    6.41  92.77    99.18     881445

SCHED EVENTS ENABLED

$ stress-ng --times -s 8 --switch-ops 20000000

Test  Runtime(s) User(%) Sys(%) Total(%) bogo ops/s
   1       36.28    4.12  95.31    99.43     551268
   2       37.61    4.07  95.18    99.25     531773
   3       37.96    4.09  94.94    99.03     526870
   4       38.17    4.01  94.97    98.99     523972
   5       37.47    4.11  94.92    99.03     533760

This is with CONFIG_PREEMPT:

SCHED EVENTS DISABLED

$ stress-ng --times -s 8 --switch-ops 20000000

Test  Runtime(s) User(%) Sys(%) Total(%) bogo ops/s
   1      25.73    5.65   93.28  98.93   777303
   2      25.74    5.68   93.14  98.82   777001
   3      24.84    5.88   92.95  98.83   805153
   4      25.12    5.69   93.06  98.75   796178
   5      26.04    5.58   92.79  98.37   768049

SCHED EVENTS ENABLED

Test  Runtime(s) User(%) Sys(%) Total(%) bogo ops/s
   1  	  41.70	   3.72	  95.90	 99.62	 479616
   2      42.91	   3.59	  95.52	 99.11 	 466092
   3  	  43.24	   3.48	  95.57	 99.05	 462535
   4      43.04	   3.46	  95.83	 99.30	 464684
   5  	  42.62	   3.60	  95.75	 99.34 	 469263

This is with CONFIG_VOLUNTARY_PREEMPT:

SCHED EVENTS DISABLED

$ stress-ng --times -s 8 --switch-ops 20000000

Test  Runtime(s) User(%) Sys(%) Total(%) bogo ops/s
   1      23.27    6.24  91.41   97.64   859476
   2      23.51    6.16  92.70   98.86   850702
   3      23.47    6.42  92.62   99.04   852152
   4      23.92    6.34  92.47   98.81   836120
   5      23.63    6.28  92.77   99.05   846382

SCHED EVENTS ENABLED

Test  Runtime(s) User(%) Sys(%) Total(%) bogo ops/s
   1   	  38.65	   3.82	 95.12	 98.94	 517464
   2  	  38.77	   4.27	 94.77	 99.03	 515863
   3  	  38.70	   3.96	 95.29	 99.25	 516796
   4  	  38.52	   4.06	 95.32	 99.38	 519211
   5  	  38.58	   4.24	 94.71	 98.95	 518403

> Thanks for the report.
> 
> -- Steve
> 

Thank you,

I will try to get more familiar with the internals
and then investigate as well.

Claudio

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ftrace performance (sched events): cyclictest shows 25% more latency
  2018-07-09 10:06   ` Claudio
@ 2018-07-09 14:53     ` Claudio
  2018-07-09 15:11       ` Steven Rostedt
  0 siblings, 1 reply; 21+ messages in thread
From: Claudio @ 2018-07-09 14:53 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, linux-kernel



On 07/09/2018 12:06 PM, Claudio wrote:
> Hello Steve,
> 
> thank you for your response,
> 
> On 07/06/2018 11:24 PM, Steven Rostedt wrote:
>> On Fri, 6 Jul 2018 08:22:01 +0200
>> Claudio <claudio.fontana@gliwa.com> wrote:
>>
>>> Hello all,
>>>
>>> I have been experimenting with the idea of leaving ftrace enabled, with sched events,
>>> on production systems.
>>>
>>> The main concern that I am having at the moment is about the impact on the system.
>>> Enabling the sched events that I currently need for the tracing application
>>> seems to slow down context-switches considerably, and make the system less responsive.
>>>
>>> I have tested with cyclictest on the mainline kernel, and noticed an increase of min, avg latencies of around 25%.
>>>
>>> Is this expected?
>>
>> No it is not. And this is weird. I'm seeing it too.
>>
>>>
>>> Some initial investigation into ftrace seems to point at the reservation and commit of the events into the ring buffer
>>> as the highest sources of overhead, while event parameters copying, including COMM, does not seem to have any noticeable effect
>>> relative to those costs.
>>
>> Did you use perf to see where the issues were? If so what perf command
>> did you use?
> 
> I did not use perf, I will now do it and see what I get.
> You can discard my comment regarding reservation and commit, it was based on a wrong premise.
> 
> Regarding parameter copying, I just brutally modified the code to not fill fields (TP_fast_assign),
> and I just noticed that it did not change the result significantly.
> 
>>
>> When running just:
>>   perf record ./cyclictest --smp -p95 -m -s -N -l 1000000 -q
>>
>> I get this:
>>
>>    6.08%  cyclictest  [kernel.vmlinux]    [k] syscall_return_via_sysret
>>    4.03%  cyclictest  cyclictest          [.] timerthread
>>    3.97%  cyclictest  [kernel.vmlinux]    [k] __schedule
>>    3.81%  cyclictest  [kernel.vmlinux]    [k] native_write_msr
>>    3.13%  cyclictest  [kernel.vmlinux]    [k] trace_save_cmdline
>>    2.71%  cyclictest  [kernel.vmlinux]    [k] do_nanosleep
>>    2.48%  cyclictest  [kernel.vmlinux]    [k] do_syscall_64
>>    2.11%  cyclictest  [kernel.vmlinux]    [k] pick_next_task_fair
>>    2.03%  cyclictest  [kernel.vmlinux]    [k] native_sched_clock
>>    1.99%  cyclictest  [kernel.vmlinux]    [k] cpuacct_charge
>>    1.93%  cyclictest  [vdso]              [.] __vdso_clock_gettime
>>    1.86%  cyclictest  [kernel.vmlinux]    [k] hrtimer_nanosleep
>>    1.84%  cyclictest  [kernel.vmlinux]    [k] cpupri_set
>>    1.66%  cyclictest  [kernel.vmlinux]    [k] update_curr_rt
>>    1.65%  cyclictest  [kernel.vmlinux]    [k] __perf_event_task_sched_out
>>    1.59%  cyclictest  [kernel.vmlinux]    [k] dequeue_rt_stack
>>    1.55%  cyclictest  [kernel.vmlinux]    [k] __rb_reserve_next
>>
>> Which shows there's a lot being done before we even get to the ring
>> buffer reserve.
> 
> I get (CONFIG_PREEMPT_VOLUNTARY):
> 
> SCHED EVENTS DISABLED:
> 
>      8.57%  cyclictest  [kernel.vmlinux]    [k] syscall_return_via_sysret
>      7.04%  cyclictest  [kernel.vmlinux]    [k] native_write_msr
>      6.93%  cyclictest  [kernel.vmlinux]    [k] entry_SYSCALL_64_trampoline
>      4.42%  cyclictest  cyclictest          [.] timerthread
>      4.14%  cyclictest  [kernel.vmlinux]    [k] __schedule
>      3.52%  cyclictest  [kernel.vmlinux]    [k] cpupri_set
>      2.90%  cyclictest  [kernel.vmlinux]    [k] finish_task_switch
>      2.42%  cyclictest  [kernel.vmlinux]    [k] update_curr_rt
>      2.34%  cyclictest  [kernel.vmlinux]    [k] _raw_spin_lock
>      2.28%  cyclictest  [kernel.vmlinux]    [k] ktime_get
>      2.14%  cyclictest  [kernel.vmlinux]    [k] timerqueue_add
>      2.06%  cyclictest  [kernel.vmlinux]    [k] read_tsc
>      2.06%  cyclictest  [kernel.vmlinux]    [k] native_sched_clock
>      2.01%  cyclictest  [vdso]              [.] __vdso_clock_gettime
>      1.95%  cyclictest  [kernel.vmlinux]    [k] do_nanosleep
>      1.93%  cyclictest  [kernel.vmlinux]    [k] pick_next_task_fair
>      1.89%  cyclictest  [vdso]              [.] 0x0000000000000977
>      1.75%  cyclictest  [kernel.vmlinux]    [k] dequeue_task_rt
>      1.64%  cyclictest  [kernel.vmlinux]    [k] x86_pmu_disable
>      1.57%  cyclictest  [kernel.vmlinux]    [k] dequeue_rt_stack
>      1.53%  cyclictest  [kernel.vmlinux]    [k] __x64_sys_nanosleep
>      1.51%  cyclictest  [kernel.vmlinux]    [k] do_syscall_64
>      1.41%  cyclictest  [kernel.vmlinux]    [k] put_prev_task_rt
>      1.34%  cyclictest  [kernel.vmlinux]    [k] pick_next_task_dl
>      1.28%  cyclictest  [kernel.vmlinux]    [k] pick_next_task_rt
>      1.25%  cyclictest  [kernel.vmlinux]    [k] cpuacct_charge
>      1.25%  cyclictest  [kernel.vmlinux]    [k] hrtimer_start_range_ns
>      1.24%  cyclictest  [kernel.vmlinux]    [k] __x86_indirect_thunk_rax
>      1.07%  cyclictest  libpthread-2.23.so  [.] 0x0000000000010c1b
>      0.94%  cyclictest  [kernel.vmlinux]    [k] pull_rt_task
>      0.93%  cyclictest  [kernel.vmlinux]    [k] entry_SYSCALL_64_stage2
>      0.89%  cyclictest  [kernel.vmlinux]    [k] hrtimer_nanosleep
>      0.89%  cyclictest  libpthread-2.23.so  [.] 0x0000000000010c1d
>      0.88%  cyclictest  [kernel.vmlinux]    [k] entry_SYSCALL_64
>      0.79%  cyclictest  [kernel.vmlinux]    [k] _raw_spin_lock_irqsave
>      0.79%  cyclictest  [kernel.vmlinux]    [k] ctx_sched_out
>      0.78%  cyclictest  [kernel.vmlinux]    [k] copy_user_generic_unrolled
>      0.77%  cyclictest  libpthread-2.23.so  [.] __pthread_enable_asynccancel
>      0.71%  cyclictest  [kernel.vmlinux]    [k] __perf_event_task_sched_out
>      0.71%  cyclictest  [kernel.vmlinux]    [k] pick_next_task_idle
>      0.70%  cyclictest  libpthread-2.23.so  [.] __pthread_disable_asynccancel
>      0.70%  cyclictest  [kernel.vmlinux]    [k] __hrtimer_init
>      0.68%  cyclictest  [kernel.vmlinux]    [k] clear_page_erms
>      0.64%  cyclictest  [kernel.vmlinux]    [k] intel_pmu_disable_all
>      0.58%  cyclictest  [kernel.vmlinux]    [k] copy_user_enhanced_fast_string
>      0.58%  cyclictest  [kernel.vmlinux]    [k] _copy_from_user
>      0.58%  cyclictest  [kernel.vmlinux]    [k] _raw_spin_unlock_irqrestore
>      0.54%  cyclictest  [kernel.vmlinux]    [k] get_nohz_timer_target
>      0.54%  cyclictest  [kernel.vmlinux]    [k] pick_next_task_stop
[...]
> 
> SCHED EVENTS ENABLED:
> 
>      7.33%  cyclictest  [kernel.vmlinux]    [k] syscall_return_via_sysret
>      6.51%  cyclictest  [kernel.vmlinux]    [k] entry_SYSCALL_64_trampoline
>      5.84%  cyclictest  [kernel.vmlinux]    [k] native_write_msr
>      4.00%  cyclictest  cyclictest          [.] timerthread
>      3.62%  cyclictest  [kernel.vmlinux]    [k] __schedule
>      2.71%  cyclictest  [kernel.vmlinux]    [k] cpupri_set
>      2.57%  cyclictest  [kernel.vmlinux]    [k] finish_task_switch
>      2.46%  cyclictest  [kernel.vmlinux]    [k] native_sched_clock
>      2.19%  cyclictest  [kernel.vmlinux]    [k] __x86_indirect_thunk_rax
>      2.03%  cyclictest  [kernel.vmlinux]    [k] ktime_get
>      2.01%  cyclictest  [kernel.vmlinux]    [k] pick_next_task_fair
>      1.93%  cyclictest  [kernel.vmlinux]    [k] _raw_spin_lock
>      1.90%  cyclictest  [kernel.vmlinux]    [k] timerqueue_add
>      1.70%  cyclictest  [kernel.vmlinux]    [k] do_nanosleep
>      1.67%  cyclictest  [kernel.vmlinux]    [k] read_tsc
>      1.65%  cyclictest  [kernel.vmlinux]    [k] cpuacct_charge
>      1.60%  cyclictest  [kernel.vmlinux]    [k] update_curr_rt
>      1.49%  cyclictest  [kernel.vmlinux]    [k] probe_sched_switch
>      1.40%  cyclictest  [kernel.vmlinux]    [k] __x64_sys_nanosleep
>      1.34%  cyclictest  [kernel.vmlinux]    [k] dequeue_rt_stack
>      1.28%  cyclictest  [kernel.vmlinux]    [k] do_syscall_64
>      1.27%  cyclictest  [kernel.vmlinux]    [k] x86_pmu_disable
>      1.24%  cyclictest  [kernel.vmlinux]    [k] __rb_reserve_next
>      1.23%  cyclictest  [vdso]              [.] 0x0000000000000977
>      1.21%  cyclictest  [vdso]              [.] __vdso_clock_gettime
>      1.06%  cyclictest  libpthread-2.23.so  [.] 0x0000000000010c1d
>      1.06%  cyclictest  [kernel.vmlinux]    [k] entry_SYSCALL_64_stage2
>      1.03%  cyclictest  [kernel.vmlinux]    [k] pick_next_task_dl
>      1.00%  cyclictest  [kernel.vmlinux]    [k] put_prev_task_rt
>      1.00%  cyclictest  [kernel.vmlinux]    [k] pick_next_task_rt
>      0.98%  cyclictest  [kernel.vmlinux]    [k] trace_event_raw_event_sched_swit
>      0.97%  cyclictest  libpthread-2.23.so  [.] 0x0000000000010c1b
>      0.95%  cyclictest  [kernel.vmlinux]    [k] pick_next_task_idle
>      0.91%  cyclictest  [kernel.vmlinux]    [k] trace_save_cmdline
>      0.90%  cyclictest  [kernel.vmlinux]    [k] copy_user_generic_unrolled
>      0.90%  cyclictest  [kernel.vmlinux]    [k] ctx_sched_out
>      0.90%  cyclictest  [kernel.vmlinux]    [k] lapic_next_deadline
>      0.86%  cyclictest  [kernel.vmlinux]    [k] intel_pmu_disable_all
>      0.85%  cyclictest  [kernel.vmlinux]    [k] rb_commit
>      0.84%  cyclictest  [kernel.vmlinux]    [k] entry_SYSCALL_64
>      0.82%  cyclictest  [kernel.vmlinux]    [k] _raw_spin_lock_irqsave
>      0.80%  cyclictest  [kernel.vmlinux]    [k] hrtimer_nanosleep
>      0.76%  cyclictest  [kernel.vmlinux]    [k] _copy_from_user
>      0.75%  cyclictest  [kernel.vmlinux]    [k] __hrtimer_init
>      0.72%  cyclictest  [kernel.vmlinux]    [k] ring_buffer_lock_reserve
>      0.69%  cyclictest  [kernel.vmlinux]    [k] __perf_event_task_sched_out
>      0.66%  cyclictest  [kernel.vmlinux]    [k] hrtimer_start_range_ns
>      0.63%  cyclictest  [kernel.vmlinux]    [k] get_nohz_timer_target
>      0.62%  cyclictest  [kernel.vmlinux]    [k] clear_page_erms
>      0.61%  cyclictest  libpthread-2.23.so  [.] __pthread_disable_asynccancel
>      0.59%  cyclictest  [kernel.vmlinux]    [k] dequeue_task_rt
>      0.58%  cyclictest  [kernel.vmlinux]    [k] schedule
>      0.56%  cyclictest  [kernel.vmlinux]    [k] pull_rt_task
>      0.55%  cyclictest  [kernel.vmlinux]    [k] _raw_spin_unlock_irqrestore
>      0.54%  cyclictest  [kernel.vmlinux]    [k] rb_insert_color
>      0.52%  cyclictest  [kernel.vmlinux]    [k] trace_event_buffer_reserve
>      0.51%  cyclictest  [kernel.vmlinux]    [k] trace_clock_local
>      0.50%  cyclictest  [kernel.vmlinux]    [k] pick_next_task_stop
[...]
>  
>> One thing that I notice too is that we have trace_save_cmdline that
>> also gets activated when enabling trace events. Perhaps I should try to
>> optimize that more.
>>>
>>> I have been running 20 times the following test, and thrown away the first results:
>>>
>>> $ sudo ./cyclictest --smp -p95 -m -s -N -l 100000 -q
>>>
>>> $ uname -a
>>> Linux claudio-HP-ProBook-470-G5 4.18.0-rc3+ #3 SMP Tue Jul 3 15:50:30 CEST 2018 x86_64 x86_64 x86_64 GNU/Linux
>>>
>>> For brevity, this is a comparison of one test's results. All other test results show the same ~25% increase.
>>>
>>> On the left side, the run without ftrace sched events, on the right side with ftrace sched events enabled.
>>>
>>> CPU    Count      Min       Act        Avg        Max        Count      Min-ftrace Act-ftrace Avg-ftrace Max-ftrace
>>> 0     100000      2339       2936       2841     139478     100000       2900       3182       3566      93056
>>> 1      66742      2365       3386       2874      93639      66750       2959       3786       3646     154074
>>> 2      50080      2376       3058       2910     196221      50097       2997       4209       3655      18707
>>> 3      40076      2394       3461       2931      17914      40091       3006       4417       3750      17159
>>> 4      33404      2371       3612       2834      15336      33419       2997       3836       3594      23172
>>> 5      28635      2387       3313       2885      25863      28649       2995       3795       3647       9956
>>> 6      25058      2384       3428       2968      12162      25071       3051       4366       3719      18151
>>> 7      22275      2381       2859       2982      10706      22287       3046       5078       3825      10781
>>>
>>> I would be thankful for any advice or comments on this,
>>> especially with the goal in mind to lower as much as possible the runtime impact on the system.
>>
>> But still, we use all sorts of trace events when dealing with real
>> time, and trace events hardly shows anything in the PREEMPT_RT kernel.
>> Where cyclictest shows 20 microseconds, and sched events hardly touches
>> that. But I see this weird side effect on a distro kernel too. Maybe
>> there's something weird going on when we have CONFIG_VOLUNTARY_PREEMPT
>> set :-/
> 
> If it helps, I tested also CONFIG_PREEMPT_NONE=y and CONFIG_PREEMPT=y,
> with the same relative result (~25% increase).
> 
>>
>> I'll investigate further.
> 
> If stress-ng --switch-ops is of any help, I have some results for that as well.
> 
> This is with CONFIG_PREEMPT_NONE:
> 
> SCHED EVENTS DISABLED
> 
> $ stress-ng --times -s 8 --switch-ops 20000000
> 
> Test  Runtime(s) User(%) Sys(%) Total(%) bogo ops/s
>    1       22.56    6.33  92.96    99.29     886525
>    2       23.09    6.34  92.34    98.68     866176
>    3       23.54    6.35  89.02    95.37     849618
>    4       22.19    6.56  92.73    99.29     901306
>    5       22.69    6.41  92.77    99.18     881445
> 
> SCHED EVENTS ENABLED
> 
> $ stress-ng --times -s 8 --switch-ops 20000000
> 
> Test  Runtime(s) User(%) Sys(%) Total(%) bogo ops/s
>    1       36.28    4.12  95.31    99.43     551268
>    2       37.61    4.07  95.18    99.25     531773
>    3       37.96    4.09  94.94    99.03     526870
>    4       38.17    4.01  94.97    98.99     523972
>    5       37.47    4.11  94.92    99.03     533760
> 
> This is with CONFIG_PREEMPT:
> 
> SCHED EVENTS DISABLED
> 
> $ stress-ng --times -s 8 --switch-ops 20000000
> 
> Test  Runtime(s) User(%) Sys(%) Total(%) bogo ops/s
>    1      25.73    5.65   93.28  98.93   777303
>    2      25.74    5.68   93.14  98.82   777001
>    3      24.84    5.88   92.95  98.83   805153
>    4      25.12    5.69   93.06  98.75   796178
>    5      26.04    5.58   92.79  98.37   768049
> 
> SCHED EVENTS ENABLED
> 
> Test  Runtime(s) User(%) Sys(%) Total(%) bogo ops/s
>    1  	  41.70	   3.72	  95.90	 99.62	 479616
>    2      42.91	   3.59	  95.52	 99.11 	 466092
>    3  	  43.24	   3.48	  95.57	 99.05	 462535
>    4      43.04	   3.46	  95.83	 99.30	 464684
>    5  	  42.62	   3.60	  95.75	 99.34 	 469263
> 
> This is with CONFIG_VOLUNTARY_PREEMPT:
> 
> SCHED EVENTS DISABLED
> 
> $ stress-ng --times -s 8 --switch-ops 20000000
> 
> Test  Runtime(s) User(%) Sys(%) Total(%) bogo ops/s
>    1      23.27    6.24  91.41   97.64   859476
>    2      23.51    6.16  92.70   98.86   850702
>    3      23.47    6.42  92.62   99.04   852152
>    4      23.92    6.34  92.47   98.81   836120
>    5      23.63    6.28  92.77   99.05   846382
> 
> SCHED EVENTS ENABLED
> 
> Test  Runtime(s) User(%) Sys(%) Total(%) bogo ops/s
>    1   	  38.65	   3.82	 95.12	 98.94	 517464
>    2  	  38.77	   4.27	 94.77	 99.03	 515863
>    3  	  38.70	   3.96	 95.29	 99.25	 516796
>    4  	  38.52	   4.06	 95.32	 99.38	 519211
>    5  	  38.58	   4.24	 94.71	 98.95	 518403
> 
>> Thanks for the report.
>>
>> -- Steve
>>
> 
> Thank you,
> 
> I will try to get more familiar with the internals
> and then investigate as well.
> 
> Claudio
> 

One additional data point, based on brute force again:

I applied this change, in order to understand if it was the

trace_event_raw_event_* (I suppose primarily trace_event_raw_event_switch)

that contained the latency "offenders":

diff --git a/include/trace/trace_events.h b/include/trace/trace_events.h
index 4ecdfe2..969467d 100644
--- a/include/trace/trace_events.h
+++ b/include/trace/trace_events.h
@@ -704,6 +704,8 @@ trace_event_raw_event_##call(void *__data, proto)           
        struct trace_event_raw_##call *entry;                           \
        int __data_size;                                                \
                                                                        \
+       return;                                                         \
+                                                                       \
        if (trace_trigger_soft_disabled(trace_file))                    \
                return;                                                 \
                                                                        \


This reduces the latency overhead to 6% down from 25%.

Maybe obvious? Wanted to share in case it helps, and will dig further.

Ciao,

Claudio

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: ftrace performance (sched events): cyclictest shows 25% more latency
  2018-07-09 14:53     ` Claudio
@ 2018-07-09 15:11       ` Steven Rostedt
  0 siblings, 0 replies; 21+ messages in thread
From: Steven Rostedt @ 2018-07-09 15:11 UTC (permalink / raw)
  To: Claudio; +Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, linux-kernel

On Mon, 9 Jul 2018 16:53:52 +0200
Claudio <claudio.fontana@gliwa.com> wrote:

> 
> One additional data point, based on brute force again:
> 
> I applied this change, in order to understand if it was the
> 
> trace_event_raw_event_* (I suppose primarily trace_event_raw_event_switch)
> 
> that contained the latency "offenders":
> 
> diff --git a/include/trace/trace_events.h b/include/trace/trace_events.h
> index 4ecdfe2..969467d 100644
> --- a/include/trace/trace_events.h
> +++ b/include/trace/trace_events.h
> @@ -704,6 +704,8 @@ trace_event_raw_event_##call(void *__data, proto)           
>         struct trace_event_raw_##call *entry;                           \
>         int __data_size;                                                \
>                                                                         \
> +       return;                                                         \
> +                                                                       \
>         if (trace_trigger_soft_disabled(trace_file))                    \
>                 return;                                                 \
>                                                                         \
> 
> 
> This reduces the latency overhead to 6% down from 25%.
> 
> Maybe obvious? Wanted to share in case it helps, and will dig further.

I noticed that just disabling tracing "echo 0 > tracing_on" is very
similar. I'm now recording timings of various parts of the code. But at
most I've seen is a 12us, which should not add the overhead. So it's
triggering something else.

I'll be going on PTO next week, and there's things I must do this week,
thus I may not have much more time to look into this until I get back
from PTO (July 23rd).

-- Steve

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ftrace performance (sched events): cyclictest shows 25% more latency
  2018-07-06  6:22 ftrace performance (sched events): cyclictest shows 25% more latency Claudio
  2018-07-06 21:24 ` Steven Rostedt
@ 2018-07-09 15:32 ` Steven Rostedt
  2018-07-24  9:58   ` ftrace global trace_pipe_raw Claudio
  2018-07-24  9:59   ` Claudio
  1 sibling, 2 replies; 21+ messages in thread
From: Steven Rostedt @ 2018-07-09 15:32 UTC (permalink / raw)
  To: Claudio; +Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, linux-kernel

On Fri, 6 Jul 2018 08:22:01 +0200
Claudio <claudio.fontana@gliwa.com> wrote:

> Hello all,
> 
> I have been experimenting with the idea of leaving ftrace enabled, with sched events,
> on production systems.
> 
> The main concern that I am having at the moment is about the impact on the system.
> Enabling the sched events that I currently need for the tracing application
> seems to slow down context-switches considerably, and make the system less responsive.
> 
> I have tested with cyclictest on the mainline kernel, and noticed an increase of min, avg latencies of around 25%.
> 
> Is this expected?
> 
> Some initial investigation into ftrace seems to point at the reservation and commit of the events into the ring buffer
> as the highest sources of overhead, while event parameters copying, including COMM, does not seem to have any noticeable effect
> relative to those costs.
> 
> I have been running 20 times the following test, and thrown away the first results:
> 
> $ sudo ./cyclictest --smp -p95 -m -s -N -l 100000 -q

OK, I just noticed that you are using -N which means all numbers are in
nanoseconds.

> 
> $ uname -a
> Linux claudio-HP-ProBook-470-G5 4.18.0-rc3+ #3 SMP Tue Jul 3 15:50:30 CEST 2018 x86_64 x86_64 x86_64 GNU/Linux
> 
> For brevity, this is a comparison of one test's results. All other test results show the same ~25% increase.
> 
> On the left side, the run without ftrace sched events, on the right side with ftrace sched events enabled.
> 
> CPU    Count      Min       Act        Avg        Max        Count      Min-ftrace Act-ftrace Avg-ftrace Max-ftrace
> 0     100000      2339       2936       2841     139478     100000       2900       3182       3566      93056
> 1      66742      2365       3386       2874      93639      66750       2959       3786       3646     154074
> 2      50080      2376       3058       2910     196221      50097       2997       4209       3655      18707
> 3      40076      2394       3461       2931      17914      40091       3006       4417       3750      17159
> 4      33404      2371       3612       2834      15336      33419       2997       3836       3594      23172
> 5      28635      2387       3313       2885      25863      28649       2995       3795       3647       9956
> 6      25058      2384       3428       2968      12162      25071       3051       4366       3719      18151
> 7      22275      2381       2859       2982      10706      22287       3046       5078       3825      10781
> 
> I would be thankful for any advice or comments on this,
> especially with the goal in mind to lower as much as possible the runtime impact on the system.

Thus, the tracing is causing the wakeup time to be an average of 0.8us
longer.

Yes that is expected.

-- Steve

^ permalink raw reply	[flat|nested] 21+ messages in thread

* ftrace global trace_pipe_raw
  2018-07-09 15:32 ` Steven Rostedt
@ 2018-07-24  9:58   ` Claudio
  2018-07-24 14:23     ` Steven Rostedt
  2018-07-24  9:59   ` Claudio
  1 sibling, 1 reply; 21+ messages in thread
From: Claudio @ 2018-07-24  9:58 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, linux-kernel

Hello Steven,

I am doing correlation of linux sched events, following all tasks between cpus,
and one thing that would be really convenient would be to have a global
trace_pipe_raw, in addition to the per-cpu ones, with already sorted events.

I would imagine the core functionality is already available, since trace_pipe
in the tracing directory already shows all events regardless of CPU, and so
it would be a matter of doing the same for trace_pipe_raw.

But is there a good reason why trace_pipe_raw is available only per-cpu?

Would work in the direction of adding a global trace_pipe_raw be considered
for inclusion?

Thank you,

Claudio

On 07/09/2018 05:32 PM, Steven Rostedt wrote:
> On Fri, 6 Jul 2018 08:22:01 +0200
> Claudio <claudio.fontana@gliwa.com> wrote:
> 
>> Hello all,
>>
>> I have been experimenting with the idea of leaving ftrace enabled, with sched events,
>> on production systems.
>>
>> The main concern that I am having at the moment is about the impact on the system.
>> Enabling the sched events that I currently need for the tracing application
>> seems to slow down context-switches considerably, and make the system less responsive.
>>
>> I have tested with cyclictest on the mainline kernel, and noticed an increase of min, avg latencies of around 25%.
>>
>> Is this expected?
>>
>> Some initial investigation into ftrace seems to point at the reservation and commit of the events into the ring buffer
>> as the highest sources of overhead, while event parameters copying, including COMM, does not seem to have any noticeable effect
>> relative to those costs.
>>
>> I have been running 20 times the following test, and thrown away the first results:
>>
>> $ sudo ./cyclictest --smp -p95 -m -s -N -l 100000 -q
> 
> OK, I just noticed that you are using -N which means all numbers are in
> nanoseconds.
> 
>>
>> $ uname -a
>> Linux claudio-HP-ProBook-470-G5 4.18.0-rc3+ #3 SMP Tue Jul 3 15:50:30 CEST 2018 x86_64 x86_64 x86_64 GNU/Linux
>>
>> For brevity, this is a comparison of one test's results. All other test results show the same ~25% increase.
>>
>> On the left side, the run without ftrace sched events, on the right side with ftrace sched events enabled.
>>
>> CPU    Count      Min       Act        Avg        Max        Count      Min-ftrace Act-ftrace Avg-ftrace Max-ftrace
>> 0     100000      2339       2936       2841     139478     100000       2900       3182       3566      93056
>> 1      66742      2365       3386       2874      93639      66750       2959       3786       3646     154074
>> 2      50080      2376       3058       2910     196221      50097       2997       4209       3655      18707
>> 3      40076      2394       3461       2931      17914      40091       3006       4417       3750      17159
>> 4      33404      2371       3612       2834      15336      33419       2997       3836       3594      23172
>> 5      28635      2387       3313       2885      25863      28649       2995       3795       3647       9956
>> 6      25058      2384       3428       2968      12162      25071       3051       4366       3719      18151
>> 7      22275      2381       2859       2982      10706      22287       3046       5078       3825      10781
>>
>> I would be thankful for any advice or comments on this,
>> especially with the goal in mind to lower as much as possible the runtime impact on the system.
> 
> Thus, the tracing is causing the wakeup time to be an average of 0.8us
> longer.
> 
> Yes that is expected.
> 
> -- Steve
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ftrace global trace_pipe_raw
  2018-07-24  9:58   ` ftrace global trace_pipe_raw Claudio
@ 2018-07-24 14:23     ` Steven Rostedt
  2018-07-24 14:25       ` Steven Rostedt
  2018-12-19 11:32       ` Claudio
  0 siblings, 2 replies; 21+ messages in thread
From: Steven Rostedt @ 2018-07-24 14:23 UTC (permalink / raw)
  To: Claudio; +Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, linux-kernel

On Tue, 24 Jul 2018 11:58:18 +0200
Claudio <claudio.fontana@gliwa.com> wrote:

> Hello Steven,
> 
> I am doing correlation of linux sched events, following all tasks between cpus,
> and one thing that would be really convenient would be to have a global
> trace_pipe_raw, in addition to the per-cpu ones, with already sorted events.
> 
> I would imagine the core functionality is already available, since trace_pipe
> in the tracing directory already shows all events regardless of CPU, and so
> it would be a matter of doing the same for trace_pipe_raw.

The difference between trace_pipe and trace_pipe_raw is that trace_pipe
is post processed, and reads the per CPU buffers and interleaves them
one event at a time. The trace_pipe_raw just sends you the raw
unprocessed data directly from the buffers, which are grouped per CPU.

> 
> But is there a good reason why trace_pipe_raw is available only per-cpu?

Yes, because it maps the ring buffers themselves without any post
processing.

> 
> Would work in the direction of adding a global trace_pipe_raw be considered
> for inclusion?

The design of the lockless ring buffer requires not to be preempted,
and that the data cannot be written to from more than one location. To
do so, we make a per CPU buffer, and disable preemption when writing.
This means that we have only one writer at a time. It can handle
interrupts and NMIs, because they will finish before they return and
this doesn't break the algorithm. But having writers from multiple CPUs
would require locking or other heaving synchronization operations that
will greatly reduce the speed of writing to the buffers (not to mention
the cache thrashing).

-- Steve

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ftrace global trace_pipe_raw
  2018-07-24 14:23     ` Steven Rostedt
@ 2018-07-24 14:25       ` Steven Rostedt
  2018-07-24 15:30         ` Claudio
  2018-10-08 16:07         ` Claudio
  2018-12-19 11:32       ` Claudio
  1 sibling, 2 replies; 21+ messages in thread
From: Steven Rostedt @ 2018-07-24 14:25 UTC (permalink / raw)
  To: Claudio; +Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, linux-kernel

On Tue, 24 Jul 2018 10:23:16 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> > 
> > Would work in the direction of adding a global trace_pipe_raw be considered
> > for inclusion?  
> 
> The design of the lockless ring buffer requires not to be preempted,
> and that the data cannot be written to from more than one location. To
> do so, we make a per CPU buffer, and disable preemption when writing.
> This means that we have only one writer at a time. It can handle
> interrupts and NMIs, because they will finish before they return and
> this doesn't break the algorithm. But having writers from multiple CPUs
> would require locking or other heaving synchronization operations that
> will greatly reduce the speed of writing to the buffers (not to mention
> the cache thrashing).

And why would you need a single buffer? Note, we are working on making
libtracecmd.so that will allow applications to read the buffers and the
library will take care of the interleaving of the raw data. This should
hopefully be ready in about three months or so.

-- Steve

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ftrace global trace_pipe_raw
  2018-07-24 14:25       ` Steven Rostedt
@ 2018-07-24 15:30         ` Claudio
  2018-10-08 16:07         ` Claudio
  1 sibling, 0 replies; 21+ messages in thread
From: Claudio @ 2018-07-24 15:30 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, linux-kernel

Hello Steve,

thank you for your answer,

On 07/24/2018 04:25 PM, Steven Rostedt wrote:
> On Tue, 24 Jul 2018 10:23:16 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
>>>
>>> Would work in the direction of adding a global trace_pipe_raw be considered
>>> for inclusion?  
>>
>> The design of the lockless ring buffer requires not to be preempted,
>> and that the data cannot be written to from more than one location. To
>> do so, we make a per CPU buffer, and disable preemption when writing.
>> This means that we have only one writer at a time. It can handle
>> interrupts and NMIs, because they will finish before they return and
>> this doesn't break the algorithm. But having writers from multiple CPUs
>> would require locking or other heaving synchronization operations that
>> will greatly reduce the speed of writing to the buffers (not to mention
>> the cache thrashing).
> 

I understand, it is not a simple matter then.

> And why would you need a single buffer? 

I am interested in all events that have to do with a specific task,
regardless of the CPU they appear on.
Having an already post-processed stream of binary data would be awesome I think.

> Note, we are working on making
> libtracecmd.so that will allow applications to read the buffers and the
> library will take care of the interleaving of the raw data. This should
> hopefully be ready in about three months or so.
> 
> -- Steve

That would be great! So the library could handle this kind of preprocessing,
and create a single stream of sorted timestamps/events with the binary data?

Thanks a lot,

Claudio

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ftrace global trace_pipe_raw
  2018-07-24 14:25       ` Steven Rostedt
  2018-07-24 15:30         ` Claudio
@ 2018-10-08 16:07         ` Claudio
  2018-10-08 16:16           ` Steven Rostedt
  1 sibling, 1 reply; 21+ messages in thread
From: Claudio @ 2018-10-08 16:07 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: linux-trace-devel

Hello Steven,

On 07/24/2018 04:25 PM, Steven Rostedt wrote:
> On Tue, 24 Jul 2018 10:23:16 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
>>>
>>> Would work in the direction of adding a global trace_pipe_raw be considered
>>> for inclusion?  
>>
>> The design of the lockless ring buffer requires not to be preempted,
>> and that the data cannot be written to from more than one location. To
>> do so, we make a per CPU buffer, and disable preemption when writing.
>> This means that we have only one writer at a time. It can handle
>> interrupts and NMIs, because they will finish before they return and
>> this doesn't break the algorithm. But having writers from multiple CPUs
>> would require locking or other heaving synchronization operations that
>> will greatly reduce the speed of writing to the buffers (not to mention
>> the cache thrashing).
> 
> And why would you need a single buffer? Note, we are working on making
> libtracecmd.so that will allow applications to read the buffers and the
> library will take care of the interleaving of the raw data. This should
> hopefully be ready in about three months or so.
> 
> -- Steve
> 

Is this something you will showcase in the linux tracing summit?
Is there a repo / branch I should be following?

The reason why we need to end up with a single stream of events is to be
able to do "online" task state correlation and timing parameters calculations
for all task-related events independent of cores.

Currently we have this on QNX, and we are trying to enable it for Linux as well.

Thank you,

Claudio

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ftrace global trace_pipe_raw
  2018-10-08 16:07         ` Claudio
@ 2018-10-08 16:16           ` Steven Rostedt
  2018-10-08 19:18             ` Claudio
  0 siblings, 1 reply; 21+ messages in thread
From: Steven Rostedt @ 2018-10-08 16:16 UTC (permalink / raw)
  To: Claudio; +Cc: linux-trace-devel

On Mon, 8 Oct 2018 18:07:49 +0200
Claudio <claudio.fontana@gliwa.com> wrote:

> Hello Steven,
> 
> On 07/24/2018 04:25 PM, Steven Rostedt wrote:
> > On Tue, 24 Jul 2018 10:23:16 -0400
> > Steven Rostedt <rostedt@goodmis.org> wrote:
> >   
> >>>
> >>> Would work in the direction of adding a global trace_pipe_raw be considered
> >>> for inclusion?    
> >>
> >> The design of the lockless ring buffer requires not to be preempted,
> >> and that the data cannot be written to from more than one location. To
> >> do so, we make a per CPU buffer, and disable preemption when writing.
> >> This means that we have only one writer at a time. It can handle
> >> interrupts and NMIs, because they will finish before they return and
> >> this doesn't break the algorithm. But having writers from multiple CPUs
> >> would require locking or other heaving synchronization operations that
> >> will greatly reduce the speed of writing to the buffers (not to mention
> >> the cache thrashing).  
> > 
> > And why would you need a single buffer? Note, we are working on making
> > libtracecmd.so that will allow applications to read the buffers and the
> > library will take care of the interleaving of the raw data. This should
> > hopefully be ready in about three months or so.
> > 
> > -- Steve
> >   
> 
> Is this something you will showcase in the linux tracing summit?
> Is there a repo / branch I should be following?

We are preparing the code in tools/lib/traceevent of the Linux kernel
to turn that into a library.

At the same time, we are looking at making libtracecmd or perhaps we'll
call it libftrace? to implement all the trace-cmd code as a library as
well. But that's happening in the main trace-cmd repo:

 git://git.kernel.org/pub/scm/utils/trace-cmd/trace-cmd.git

> 
> The reason why we need to end up with a single stream of events is to be
> able to do "online" task state correlation and timing parameters calculations
> for all task-related events independent of cores.

Well, all the events are timestamped, and you can pick different clocks
to use, and a simple merge sort gives all the information you need.
Note, having per cpu buffers makes things much more efficient as you
don't need to do synchronizing with atomics.

-- Steve

> 
> Currently we have this on QNX, and we are trying to enable it for Linux as well.
> 
> Thank you,
> 
> Claudio

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ftrace global trace_pipe_raw
  2018-10-08 16:16           ` Steven Rostedt
@ 2018-10-08 19:18             ` Claudio
  2018-11-16 13:23               ` Claudio
  0 siblings, 1 reply; 21+ messages in thread
From: Claudio @ 2018-10-08 19:18 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: linux-trace-devel, jens

Hi Steven,

thank you for your answer,

On 10/08/2018 06:16 PM, Steven Rostedt wrote:
> On Mon, 8 Oct 2018 18:07:49 +0200
> Claudio <claudio.fontana@gliwa.com> wrote:
> 
>> Hello Steven,
>>
>> On 07/24/2018 04:25 PM, Steven Rostedt wrote:
>>> On Tue, 24 Jul 2018 10:23:16 -0400
>>> Steven Rostedt <rostedt@goodmis.org> wrote:
>>>   
>>>>>
>>>>> Would work in the direction of adding a global trace_pipe_raw be considered
>>>>> for inclusion?    
>>>>
>>>> The design of the lockless ring buffer requires not to be preempted,
>>>> and that the data cannot be written to from more than one location. To
>>>> do so, we make a per CPU buffer, and disable preemption when writing.
>>>> This means that we have only one writer at a time. It can handle
>>>> interrupts and NMIs, because they will finish before they return and
>>>> this doesn't break the algorithm. But having writers from multiple CPUs
>>>> would require locking or other heaving synchronization operations that
>>>> will greatly reduce the speed of writing to the buffers (not to mention
>>>> the cache thrashing).  
>>>
>>> And why would you need a single buffer? Note, we are working on making
>>> libtracecmd.so that will allow applications to read the buffers and the
>>> library will take care of the interleaving of the raw data. This should
>>> hopefully be ready in about three months or so.
>>>
>>> -- Steve
>>>   
>>
>> Is this something you will showcase in the linux tracing summit?
>> Is there a repo / branch I should be following?
> 
> We are preparing the code in tools/lib/traceevent of the Linux kernel
> to turn that into a library.
> 
> At the same time, we are looking at making libtracecmd or perhaps we'll
> call it libftrace? to implement all the trace-cmd code as a library as
> well. But that's happening in the main trace-cmd repo:
> 
>  git://git.kernel.org/pub/scm/utils/trace-cmd/trace-cmd.git
> 

I think that the binary raw event streaming aspect will be part of this last component right?

libftrace is way better as a name I think.


>>
>> The reason why we need to end up with a single stream of events is to be
>> able to do "online" task state correlation and timing parameters calculations
>> for all task-related events independent of cores.
> 
> Well, all the events are timestamped, and you can pick different clocks
> to use, and a simple merge sort gives all the information you need.

Yes, we are about to do the merge sorting of the streams ourselves,
but if the library does it, even better ;-)

> Note, having per cpu buffers makes things much more efficient as you
> don't need to do synchronizing with atomics.
> 
> -- Steve
> 
>>
>> Currently we have this on QNX, and we are trying to enable it for Linux as well.
>>
>> Thank you,
>>
>> Claudio
> 

Thanks,

Claudio

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ftrace global trace_pipe_raw
  2018-10-08 19:18             ` Claudio
@ 2018-11-16 13:23               ` Claudio
  0 siblings, 0 replies; 21+ messages in thread
From: Claudio @ 2018-11-16 13:23 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: linux-trace-devel, jens

Hello,

just a small update below:

On 10/08/2018 09:18 PM, Claudio wrote:
> Hi Steven,
> 
> thank you for your answer,
> 
> On 10/08/2018 06:16 PM, Steven Rostedt wrote:
>> On Mon, 8 Oct 2018 18:07:49 +0200
>> Claudio <claudio.fontana@gliwa.com> wrote:
>>
>>> Hello Steven,
>>>
>>> On 07/24/2018 04:25 PM, Steven Rostedt wrote:
>>>> On Tue, 24 Jul 2018 10:23:16 -0400
>>>> Steven Rostedt <rostedt@goodmis.org> wrote:
>>>>   
>>>>>>
>>>>>> Would work in the direction of adding a global trace_pipe_raw be considered
>>>>>> for inclusion?    
>>>>>
>>>>> The design of the lockless ring buffer requires not to be preempted,
>>>>> and that the data cannot be written to from more than one location. To
>>>>> do so, we make a per CPU buffer, and disable preemption when writing.
>>>>> This means that we have only one writer at a time. It can handle
>>>>> interrupts and NMIs, because they will finish before they return and
>>>>> this doesn't break the algorithm. But having writers from multiple CPUs
>>>>> would require locking or other heaving synchronization operations that
>>>>> will greatly reduce the speed of writing to the buffers (not to mention
>>>>> the cache thrashing).  
>>>>
>>>> And why would you need a single buffer? Note, we are working on making
>>>> libtracecmd.so that will allow applications to read the buffers and the
>>>> library will take care of the interleaving of the raw data. This should
>>>> hopefully be ready in about three months or so.
>>>>
>>>> -- Steve
>>>>   
>>>
>>> Is this something you will showcase in the linux tracing summit?
>>> Is there a repo / branch I should be following?
>>
>> We are preparing the code in tools/lib/traceevent of the Linux kernel
>> to turn that into a library.
>>
>> At the same time, we are looking at making libtracecmd or perhaps we'll
>> call it libftrace? to implement all the trace-cmd code as a library as
>> well. But that's happening in the main trace-cmd repo:
>>
>>  git://git.kernel.org/pub/scm/utils/trace-cmd/trace-cmd.git
>>
> 
> I think that the binary raw event streaming aspect will be part of this last component right?
> 
> libftrace is way better as a name I think.
> 
> 
>>>
>>> The reason why we need to end up with a single stream of events is to be
>>> able to do "online" task state correlation and timing parameters calculations
>>> for all task-related events independent of cores.
>>
>> Well, all the events are timestamped, and you can pick different clocks
>> to use, and a simple merge sort gives all the information you need.
> 
> Yes, we are about to do the merge sorting of the streams ourselves,
> but if the library does it, even better ;-)

I think that with K streams, instead of merge sort it would be optimal to
use min heap sort for the partial, online sorting of the events coming
from the cpu streams.

Do you think this could be part of the library, so it would be able
to provide a constant stream of already sorted events regardless of the cpu,
for full sched task state correlation?

Or it it your view that this functionality should never be part of the
library, and needs to be implemented by every user of the library separately?

For me, it would be I think the main factor to decide whether to use the
library or not.

Thank you,

Claudio

> 
>> Note, having per cpu buffers makes things much more efficient as you
>> don't need to do synchronizing with atomics.>>
>> -- Steve
>>
>>>
>>> Currently we have this on QNX, and we are trying to enable it for Linux as well.
>>>
>>> Thank you,
>>>
>>> Claudio
>>
> 
> Thanks,
> 
> Claudio
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ftrace global trace_pipe_raw
  2018-07-24 14:23     ` Steven Rostedt
  2018-07-24 14:25       ` Steven Rostedt
@ 2018-12-19 11:32       ` Claudio
  2018-12-19 16:37         ` Steven Rostedt
  1 sibling, 1 reply; 21+ messages in thread
From: Claudio @ 2018-12-19 11:32 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, linux-kernel

Hi Steven,

going back to this old theme to clarify a bit what I was trying to achieve:

On 07/24/2018 04:23 PM, Steven Rostedt wrote:
> On Tue, 24 Jul 2018 11:58:18 +0200
> Claudio <claudio.fontana@gliwa.com> wrote:
> 
>> Hello Steven,
>>
>> I am doing correlation of linux sched events, following all tasks between cpus,
>> and one thing that would be really convenient would be to have a global
>> trace_pipe_raw, in addition to the per-cpu ones, with already sorted events.

I think that I asked for the wrong thing, since I did not understand how the implementation worked.
Which lead to your response, thank you for the clarification.

>>
>> I would imagine the core functionality is already available, since trace_pipe
>> in the tracing directory already shows all events regardless of CPU, and so
>> it would be a matter of doing the same for trace_pipe_raw.
> 
> The difference between trace_pipe and trace_pipe_raw is that trace_pipe
> is post processed, and reads the per CPU buffers and interleaves them
> one event at a time. The trace_pipe_raw just sends you the raw
> unprocessed data directly from the buffers, which are grouped per CPU.

I think that what I am looking for, to improve the performance of our system,
is a post processed stream of binary entry data, already merged from all CPUs
and sorted per timestamp, in the same way that it is done for textual output
in __find_next_entry:

       for_each_tracing_cpu(cpu) {

                if (ring_buffer_empty_cpu(buffer, cpu))
                        continue;

                ent = peek_next_entry(iter, cpu, &ts, &lost_events);

                /*                                                                
                 * Pick the entry with the smallest timestamp:                    
                 */
                if (ent && (!next || ts < next_ts)) {
                        next = ent;
                        next_cpu = cpu;
                        next_ts = ts;
                        next_lost = lost_events;
                        next_size = iter->ent_size;
                }
        }

We first tried to use the textual output directly, but this lead to
unacceptable overheads in parsing the text.

Please correct me if I do not understand, however it seems to me that it
would be possible do the same kind of post processing including generating
a sorted stream of entries, just avoiding the text output formatting,
and outputting the binary data of the entry directly, which would be way
more efficient to consume directly from user space correlators.

But maybe this is not a general enough requirement to be acceptable for
implementing directly into the kernel?

We have the requirement of using the OS tracing events, including
scheduling events, to react from software immediately
(vs doing after-the-fact analysis).

Thank you for your comment on this and I wish you nice holidays.

Ciao,

Claudio

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ftrace global trace_pipe_raw
  2018-12-19 11:32       ` Claudio
@ 2018-12-19 16:37         ` Steven Rostedt
  2019-01-16  8:00           ` Claudio
  0 siblings, 1 reply; 21+ messages in thread
From: Steven Rostedt @ 2018-12-19 16:37 UTC (permalink / raw)
  To: Claudio; +Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, linux-kernel

On Wed, 19 Dec 2018 12:32:41 +0100
Claudio <claudio.fontana@gliwa.com> wrote:

> >>
> >> I would imagine the core functionality is already available, since trace_pipe
> >> in the tracing directory already shows all events regardless of CPU, and so
> >> it would be a matter of doing the same for trace_pipe_raw.  
> > 
> > The difference between trace_pipe and trace_pipe_raw is that trace_pipe
> > is post processed, and reads the per CPU buffers and interleaves them
> > one event at a time. The trace_pipe_raw just sends you the raw
> > unprocessed data directly from the buffers, which are grouped per CPU.  
> 
> I think that what I am looking for, to improve the performance of our system,
> is a post processed stream of binary entry data, already merged from all CPUs
> and sorted per timestamp, in the same way that it is done for textual output
> in __find_next_entry:
> 
>        for_each_tracing_cpu(cpu) {
> 
>                 if (ring_buffer_empty_cpu(buffer, cpu))
>                         continue;
> 
>                 ent = peek_next_entry(iter, cpu, &ts, &lost_events);
> 
>                 /*                                                                
>                  * Pick the entry with the smallest timestamp:                    
>                  */
>                 if (ent && (!next || ts < next_ts)) {
>                         next = ent;
>                         next_cpu = cpu;
>                         next_ts = ts;
>                         next_lost = lost_events;
>                         next_size = iter->ent_size;
>                 }
>         }
> 
> We first tried to use the textual output directly, but this lead to
> unacceptable overheads in parsing the text.
> 
> Please correct me if I do not understand, however it seems to me that it
> would be possible do the same kind of post processing including generating
> a sorted stream of entries, just avoiding the text output formatting,
> and outputting the binary data of the entry directly, which would be way
> more efficient to consume directly from user space correlators.
> 
> But maybe this is not a general enough requirement to be acceptable for
> implementing directly into the kernel?
> 
> We have the requirement of using the OS tracing events, including
> scheduling events, to react from software immediately
> (vs doing after-the-fact analysis).

Have you looked at using the perf event interface? I believe it uses a
single buffer for all events. At least for tracing a single process.

-- Steve

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ftrace global trace_pipe_raw
  2018-12-19 16:37         ` Steven Rostedt
@ 2019-01-16  8:00           ` Claudio
  2019-01-16 13:47             ` Steven Rostedt
  0 siblings, 1 reply; 21+ messages in thread
From: Claudio @ 2019-01-16  8:00 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, linux-kernel

Hi Steven, happy new year,

On 12/19/2018 05:37 PM, Steven Rostedt wrote:
> On Wed, 19 Dec 2018 12:32:41 +0100
> Claudio <claudio.fontana@gliwa.com> wrote:
> 
>>>>
>>>> I would imagine the core functionality is already available, since trace_pipe
>>>> in the tracing directory already shows all events regardless of CPU, and so
>>>> it would be a matter of doing the same for trace_pipe_raw.  
>>>
>>> The difference between trace_pipe and trace_pipe_raw is that trace_pipe
>>> is post processed, and reads the per CPU buffers and interleaves them
>>> one event at a time. The trace_pipe_raw just sends you the raw
>>> unprocessed data directly from the buffers, which are grouped per CPU.  
>>
>> I think that what I am looking for, to improve the performance of our system,
>> is a post processed stream of binary entry data, already merged from all CPUs
>> and sorted per timestamp, in the same way that it is done for textual output
>> in __find_next_entry:
>>
>>        for_each_tracing_cpu(cpu) {
>>
>>                 if (ring_buffer_empty_cpu(buffer, cpu))
>>                         continue;
>>
>>                 ent = peek_next_entry(iter, cpu, &ts, &lost_events);
>>
>>                 /*                                                                
>>                  * Pick the entry with the smallest timestamp:                    
>>                  */
>>                 if (ent && (!next || ts < next_ts)) {
>>                         next = ent;
>>                         next_cpu = cpu;
>>                         next_ts = ts;
>>                         next_lost = lost_events;
>>                         next_size = iter->ent_size;
>>                 }
>>         }
>>
>> We first tried to use the textual output directly, but this lead to
>> unacceptable overheads in parsing the text.
>>
>> Please correct me if I do not understand, however it seems to me that it
>> would be possible do the same kind of post processing including generating
>> a sorted stream of entries, just avoiding the text output formatting,
>> and outputting the binary data of the entry directly, which would be way
>> more efficient to consume directly from user space correlators.
>>
>> But maybe this is not a general enough requirement to be acceptable for
>> implementing directly into the kernel?
>>
>> We have the requirement of using the OS tracing events, including
>> scheduling events, to react from software immediately
>> (vs doing after-the-fact analysis).
> 
> Have you looked at using the perf event interface? I believe it uses a
> single buffer for all events. At least for tracing a single process.
> 
> -- Steve
> 

Indeed the perf event interface would be awesome, if only it would support tracing all processes.

Unfortunately for my use case, it can only trace one process on any cpus, or all processes on one (1) cpu.

I guess for some kind of security concerns..

I'll take a look at how much work it would be to extend the interface for the any process/any cpu use case.

Ciao and thank you,

Claudio




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: ftrace global trace_pipe_raw
  2019-01-16  8:00           ` Claudio
@ 2019-01-16 13:47             ` Steven Rostedt
  0 siblings, 0 replies; 21+ messages in thread
From: Steven Rostedt @ 2019-01-16 13:47 UTC (permalink / raw)
  To: Claudio; +Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, linux-kernel

On Wed, 16 Jan 2019 09:00:00 +0100
Claudio <claudio.fontana@gliwa.com> wrote:

> Indeed the perf event interface would be awesome, if only it would support tracing all processes.
> 
> Unfortunately for my use case, it can only trace one process on any cpus, or all processes on one (1) cpu.
> 
> I guess for some kind of security concerns..

Not security, but performance.

Serialized writes are best done on serialized instances (per cpu or a
single task). Having all tasks on all CPUs write to a single location
is a huge detriment to performance, and has a significant impact. Which
is why the buffer you are asking for doesn't exist.

-- Steve


> 
> I'll take a look at how much work it would be to extend the interface for the any process/any cpu use case.
> 
> Ciao and thank you,
> 
> Claudio
> 
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* ftrace global trace_pipe_raw
  2018-07-09 15:32 ` Steven Rostedt
  2018-07-24  9:58   ` ftrace global trace_pipe_raw Claudio
@ 2018-07-24  9:59   ` Claudio
  1 sibling, 0 replies; 21+ messages in thread
From: Claudio @ 2018-07-24  9:59 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, linux-kernel

Hello Steven,

I am doing correlation of linux sched events, following all tasks between cpus,
and one thing that would be really convenient would be to have a global
trace_pipe_raw, in addition to the per-cpu ones, with already sorted events.

I would imagine the core functionality is already available, since trace_pipe
in the tracing directory already shows all events regardless of CPU, and so
it would be a matter of doing the same for trace_pipe_raw.

But is there a good reason why trace_pipe_raw is available only per-cpu?

Would work in the direction of adding a global trace_pipe_raw be considered
for inclusion?

Thank you,

Claudio

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2019-01-16 13:47 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-06  6:22 ftrace performance (sched events): cyclictest shows 25% more latency Claudio
2018-07-06 21:24 ` Steven Rostedt
2018-07-06 21:39   ` Steven Rostedt
2018-07-06 22:00     ` Steven Rostedt
2018-07-09 10:06   ` Claudio
2018-07-09 14:53     ` Claudio
2018-07-09 15:11       ` Steven Rostedt
2018-07-09 15:32 ` Steven Rostedt
2018-07-24  9:58   ` ftrace global trace_pipe_raw Claudio
2018-07-24 14:23     ` Steven Rostedt
2018-07-24 14:25       ` Steven Rostedt
2018-07-24 15:30         ` Claudio
2018-10-08 16:07         ` Claudio
2018-10-08 16:16           ` Steven Rostedt
2018-10-08 19:18             ` Claudio
2018-11-16 13:23               ` Claudio
2018-12-19 11:32       ` Claudio
2018-12-19 16:37         ` Steven Rostedt
2019-01-16  8:00           ` Claudio
2019-01-16 13:47             ` Steven Rostedt
2018-07-24  9:59   ` Claudio

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.