* ftrace performance (sched events): cyclictest shows 25% more latency @ 2018-07-06 6:22 Claudio 2018-07-06 21:24 ` Steven Rostedt 2018-07-09 15:32 ` Steven Rostedt 0 siblings, 2 replies; 21+ messages in thread From: Claudio @ 2018-07-06 6:22 UTC (permalink / raw) To: Steven Rostedt; +Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, linux-kernel Hello all, I have been experimenting with the idea of leaving ftrace enabled, with sched events, on production systems. The main concern that I am having at the moment is about the impact on the system. Enabling the sched events that I currently need for the tracing application seems to slow down context-switches considerably, and make the system less responsive. I have tested with cyclictest on the mainline kernel, and noticed an increase of min, avg latencies of around 25%. Is this expected? Some initial investigation into ftrace seems to point at the reservation and commit of the events into the ring buffer as the highest sources of overhead, while event parameters copying, including COMM, does not seem to have any noticeable effect relative to those costs. I have been running 20 times the following test, and thrown away the first results: $ sudo ./cyclictest --smp -p95 -m -s -N -l 100000 -q $ uname -a Linux claudio-HP-ProBook-470-G5 4.18.0-rc3+ #3 SMP Tue Jul 3 15:50:30 CEST 2018 x86_64 x86_64 x86_64 GNU/Linux For brevity, this is a comparison of one test's results. All other test results show the same ~25% increase. On the left side, the run without ftrace sched events, on the right side with ftrace sched events enabled. CPU Count Min Act Avg Max Count Min-ftrace Act-ftrace Avg-ftrace Max-ftrace 0 100000 2339 2936 2841 139478 100000 2900 3182 3566 93056 1 66742 2365 3386 2874 93639 66750 2959 3786 3646 154074 2 50080 2376 3058 2910 196221 50097 2997 4209 3655 18707 3 40076 2394 3461 2931 17914 40091 3006 4417 3750 17159 4 33404 2371 3612 2834 15336 33419 2997 3836 3594 23172 5 28635 2387 3313 2885 25863 28649 2995 3795 3647 9956 6 25058 2384 3428 2968 12162 25071 3051 4366 3719 18151 7 22275 2381 2859 2982 10706 22287 3046 5078 3825 10781 I would be thankful for any advice or comments on this, especially with the goal in mind to lower as much as possible the runtime impact on the system. Thank you, Claudio Fontana ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ftrace performance (sched events): cyclictest shows 25% more latency 2018-07-06 6:22 ftrace performance (sched events): cyclictest shows 25% more latency Claudio @ 2018-07-06 21:24 ` Steven Rostedt 2018-07-06 21:39 ` Steven Rostedt 2018-07-09 10:06 ` Claudio 2018-07-09 15:32 ` Steven Rostedt 1 sibling, 2 replies; 21+ messages in thread From: Steven Rostedt @ 2018-07-06 21:24 UTC (permalink / raw) To: Claudio; +Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, linux-kernel On Fri, 6 Jul 2018 08:22:01 +0200 Claudio <claudio.fontana@gliwa.com> wrote: > Hello all, > > I have been experimenting with the idea of leaving ftrace enabled, with sched events, > on production systems. > > The main concern that I am having at the moment is about the impact on the system. > Enabling the sched events that I currently need for the tracing application > seems to slow down context-switches considerably, and make the system less responsive. > > I have tested with cyclictest on the mainline kernel, and noticed an increase of min, avg latencies of around 25%. > > Is this expected? No it is not. And this is weird. I'm seeing it too. > > Some initial investigation into ftrace seems to point at the reservation and commit of the events into the ring buffer > as the highest sources of overhead, while event parameters copying, including COMM, does not seem to have any noticeable effect > relative to those costs. Did you use perf to see where the issues were? If so what perf command did you use? When running just: perf record ./cyclictest --smp -p95 -m -s -N -l 1000000 -q I get this: 6.08% cyclictest [kernel.vmlinux] [k] syscall_return_via_sysret 4.03% cyclictest cyclictest [.] timerthread 3.97% cyclictest [kernel.vmlinux] [k] __schedule 3.81% cyclictest [kernel.vmlinux] [k] native_write_msr 3.13% cyclictest [kernel.vmlinux] [k] trace_save_cmdline 2.71% cyclictest [kernel.vmlinux] [k] do_nanosleep 2.48% cyclictest [kernel.vmlinux] [k] do_syscall_64 2.11% cyclictest [kernel.vmlinux] [k] pick_next_task_fair 2.03% cyclictest [kernel.vmlinux] [k] native_sched_clock 1.99% cyclictest [kernel.vmlinux] [k] cpuacct_charge 1.93% cyclictest [vdso] [.] __vdso_clock_gettime 1.86% cyclictest [kernel.vmlinux] [k] hrtimer_nanosleep 1.84% cyclictest [kernel.vmlinux] [k] cpupri_set 1.66% cyclictest [kernel.vmlinux] [k] update_curr_rt 1.65% cyclictest [kernel.vmlinux] [k] __perf_event_task_sched_out 1.59% cyclictest [kernel.vmlinux] [k] dequeue_rt_stack 1.55% cyclictest [kernel.vmlinux] [k] __rb_reserve_next Which shows there's a lot being done before we even get to the ring buffer reserve. One thing that I notice too is that we have trace_save_cmdline that also gets activated when enabling trace events. Perhaps I should try to optimize that more. > > I have been running 20 times the following test, and thrown away the first results: > > $ sudo ./cyclictest --smp -p95 -m -s -N -l 100000 -q > > $ uname -a > Linux claudio-HP-ProBook-470-G5 4.18.0-rc3+ #3 SMP Tue Jul 3 15:50:30 CEST 2018 x86_64 x86_64 x86_64 GNU/Linux > > For brevity, this is a comparison of one test's results. All other test results show the same ~25% increase. > > On the left side, the run without ftrace sched events, on the right side with ftrace sched events enabled. > > CPU Count Min Act Avg Max Count Min-ftrace Act-ftrace Avg-ftrace Max-ftrace > 0 100000 2339 2936 2841 139478 100000 2900 3182 3566 93056 > 1 66742 2365 3386 2874 93639 66750 2959 3786 3646 154074 > 2 50080 2376 3058 2910 196221 50097 2997 4209 3655 18707 > 3 40076 2394 3461 2931 17914 40091 3006 4417 3750 17159 > 4 33404 2371 3612 2834 15336 33419 2997 3836 3594 23172 > 5 28635 2387 3313 2885 25863 28649 2995 3795 3647 9956 > 6 25058 2384 3428 2968 12162 25071 3051 4366 3719 18151 > 7 22275 2381 2859 2982 10706 22287 3046 5078 3825 10781 > > I would be thankful for any advice or comments on this, > especially with the goal in mind to lower as much as possible the runtime impact on the system. But still, we use all sorts of trace events when dealing with real time, and trace events hardly shows anything in the PREEMPT_RT kernel. Where cyclictest shows 20 microseconds, and sched events hardly touches that. But I see this weird side effect on a distro kernel too. Maybe there's something weird going on when we have CONFIG_VOLUNTARY_PREEMPT set :-/ I'll investigate further. Thanks for the report. -- Steve ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ftrace performance (sched events): cyclictest shows 25% more latency 2018-07-06 21:24 ` Steven Rostedt @ 2018-07-06 21:39 ` Steven Rostedt 2018-07-06 22:00 ` Steven Rostedt 2018-07-09 10:06 ` Claudio 1 sibling, 1 reply; 21+ messages in thread From: Steven Rostedt @ 2018-07-06 21:39 UTC (permalink / raw) To: Claudio; +Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, linux-kernel On Fri, 6 Jul 2018 17:24:28 -0400 Steven Rostedt <rostedt@goodmis.org> wrote: > I'll investigate further. Note, I enabled function tracing to trace the scheduler function: # trace-cmd start -p function -l schedule And then ran cyclictest. That does not cause any noticeable increase in latency (try it). So there is some kind of side effect with the trace event itself, and not the recording of the event. -- Steve ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ftrace performance (sched events): cyclictest shows 25% more latency 2018-07-06 21:39 ` Steven Rostedt @ 2018-07-06 22:00 ` Steven Rostedt 0 siblings, 0 replies; 21+ messages in thread From: Steven Rostedt @ 2018-07-06 22:00 UTC (permalink / raw) To: Claudio; +Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, linux-kernel On Fri, 6 Jul 2018 17:39:22 -0400 Steven Rostedt <rostedt@goodmis.org> wrote: > On Fri, 6 Jul 2018 17:24:28 -0400 > Steven Rostedt <rostedt@goodmis.org> wrote: > > > I'll investigate further. > > Note, I enabled function tracing to trace the scheduler function: > > # trace-cmd start -p function -l schedule > > And then ran cyclictest. That does not cause any noticeable increase > in latency (try it). So there is some kind of side effect with the > trace event itself, and not the recording of the event. It does add some noise but not nearly as much as a trace event. But, then I did function graph tracing: First I did: trace-cmd record -p function_graph -l schedule /work/git-local/rt-tests.git/cyclictest --smp -p95 -m -s -N -l 100000 -q And got this: <idle>-0 [000] 3034.534021: sched_switch: swapper/0:0 [120] S ==> trace-cmd:1693 [120] trace-cmd-1693 [000] 3034.534029: funcgraph_entry: | schedule() { trace-cmd-1693 [000] 3034.534043: sched_switch: trace-cmd:1693 [120] D ==> swapper/0:0 [120] <idle>-0 [000] 3034.535084: sched_switch: swapper/0:0 [120] S ==> trace-cmd:1693 [120] trace-cmd-1693 [000] 3034.535086: funcgraph_exit: # 1056.055 us | } trace-cmd-1693 [000] 3034.535090: funcgraph_entry: | schedule() { trace-cmd-1693 [000] 3034.535093: sched_switch: trace-cmd:1693 [120] D ==> swapper/0:0 [120] <idle>-0 [000] 3034.536146: sched_switch: swapper/0:0 [120] S ==> trace-cmd:1693 [120] trace-cmd-1693 [000] 3034.536147: funcgraph_exit: # 1056.563 us | } trace-cmd-1693 [000] 3034.536150: funcgraph_entry: | schedule() { trace-cmd-1693 [000] 3034.536153: sched_switch: trace-cmd:1693 [120] D ==> swapper/0:0 [120] <idle>-0 [000] 3034.537206: sched_switch: swapper/0:0 [120] S ==> trace-cmd:1693 [120] trace-cmd-1693 [000] 3034.537207: funcgraph_exit: # 1056.107 us | } trace-cmd-1693 [000] 3034.537210: funcgraph_entry: | schedule() { trace-cmd-1693 [000] 3034.537213: sched_switch: trace-cmd:1693 [120] D ==> swapper/0:0 [120] <idle>-0 [000] 3034.538262: sched_switch: swapper/0:0 [120] S ==> trace-cmd:1693 [120] trace-cmd-1693 [000] 3034.538263: funcgraph_exit: # 1053.077 us | } trace-cmd-1693 [000] 3034.538265: funcgraph_entry: | schedule() { trace-cmd-1693 [000] 3034.538266: sched_switch: trace-cmd:1693 [120] D ==> swapper/0:0 [120] <idle>-0 [000] 3034.538554: sched_switch: swapper/0:0 [120] S ==> cyclictest:1698 [120] cyclictest-1698 [000] 3034.538565: funcgraph_entry: | schedule() { cyclictest-1698 [000] 3034.538566: sched_switch: cyclictest:1698 [4] D ==> swapper/0:0 [120] <idle>-0 [000] 3034.539318: sched_switch: swapper/0:0 [120] S ==> trace-cmd:1693 [120] trace-cmd-1693 [000] 3034.539319: funcgraph_exit: # 1053.723 us | } trace-cmd-1693 [000] 3034.539322: funcgraph_entry: | schedule() { trace-cmd-1693 [000] 3034.539324: sched_switch: trace-cmd:1693 [120] D ==> swapper/0:0 [120] <idle>-0 [000] 3034.539567: sched_switch: swapper/0:0 [120] S ==> cyclictest:1698 [4] cyclictest-1698 [000] 3034.539567: funcgraph_exit: # 1002.624 us | } And that continued on as normal. Then I ran it like this: trace-cmd record -p function_graph -l schedule /work/git-local/rt-tests.git/cyclictest --smp -p95 -m -s -N -l 100000 -q And this was my result: trace-cmd-1749 [000] 3321.110772: funcgraph_entry: | schedule() { trace-cmd-1749 [000] 3321.111826: funcgraph_exit: # 1053.004 us | } trace-cmd-1749 [000] 3321.111830: funcgraph_entry: | schedule() { trace-cmd-1749 [000] 3321.112885: funcgraph_exit: # 1054.734 us | } trace-cmd-1749 [000] 3321.112888: funcgraph_entry: | schedule() { cyclictest-1756 [000] 3321.113930: funcgraph_entry: | schedule() { trace-cmd-1749 [000] 3321.113942: funcgraph_exit: # 1054.240 us | } trace-cmd-1749 [000] 3321.113947: funcgraph_entry: | schedule() { cyclictest-1756 [000] 3321.114933: funcgraph_exit: # 1002.600 us | } cyclictest-1756 [000] 3321.114935: funcgraph_entry: | schedule() { trace-cmd-1749 [000] 3321.115000: funcgraph_exit: # 1053.451 us | } trace-cmd-1749 [000] 3321.115004: funcgraph_entry: | schedule() { cyclictest-1756 [000] 3321.115939: funcgraph_exit: # 1003.116 us | } cyclictest-1756 [000] 3321.115941: funcgraph_entry: | schedule() { trace-cmd-1749 [000] 3321.116058: funcgraph_exit: # 1054.142 us | } trace-cmd-1749 [000] 3321.116061: funcgraph_entry: | schedule() { cyclictest-1756 [000] 3321.116945: funcgraph_exit: # 1003.559 us | } cyclictest-1756 [000] 3321.116947: funcgraph_entry: | schedule() { trace-cmd-1749 [000] 3321.117115: funcgraph_exit: # 1053.649 us | } trace-cmd-1749 [000] 3321.117119: funcgraph_entry: | schedule() { cyclictest-1756 [000] 3321.117951: funcgraph_exit: # 1003.586 us | } cyclictest-1756 [000] 3321.117953: funcgraph_entry: | schedule() { trace-cmd-1749 [000] 3321.118172: funcgraph_exit: # 1053.264 us | } trace-cmd-1749 [000] 3321.118176: funcgraph_entry: | schedule() { cyclictest-1756 [000] 3321.118957: funcgraph_exit: # 1003.656 us | } cyclictest-1756 [000] 3321.118960: funcgraph_entry: | schedule() { trace-cmd-1749 [000] 3321.119230: funcgraph_exit: # 1053.770 us | } trace-cmd-1749 [000] 3321.119234: funcgraph_entry: | schedule() { cyclictest-1756 [000] 3321.119965: funcgraph_exit: # 1004.830 us | } cyclictest-1756 [000] 3321.119967: funcgraph_entry: | schedule() { trace-cmd-1749 [000] 3321.120286: funcgraph_exit: # 1052.469 us | } No noticeable difference in the time the scheduler took (going to and from idle). -- Steve ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ftrace performance (sched events): cyclictest shows 25% more latency 2018-07-06 21:24 ` Steven Rostedt 2018-07-06 21:39 ` Steven Rostedt @ 2018-07-09 10:06 ` Claudio 2018-07-09 14:53 ` Claudio 1 sibling, 1 reply; 21+ messages in thread From: Claudio @ 2018-07-09 10:06 UTC (permalink / raw) To: Steven Rostedt; +Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, linux-kernel Hello Steve, thank you for your response, On 07/06/2018 11:24 PM, Steven Rostedt wrote: > On Fri, 6 Jul 2018 08:22:01 +0200 > Claudio <claudio.fontana@gliwa.com> wrote: > >> Hello all, >> >> I have been experimenting with the idea of leaving ftrace enabled, with sched events, >> on production systems. >> >> The main concern that I am having at the moment is about the impact on the system. >> Enabling the sched events that I currently need for the tracing application >> seems to slow down context-switches considerably, and make the system less responsive. >> >> I have tested with cyclictest on the mainline kernel, and noticed an increase of min, avg latencies of around 25%. >> >> Is this expected? > > No it is not. And this is weird. I'm seeing it too. > >> >> Some initial investigation into ftrace seems to point at the reservation and commit of the events into the ring buffer >> as the highest sources of overhead, while event parameters copying, including COMM, does not seem to have any noticeable effect >> relative to those costs. > > Did you use perf to see where the issues were? If so what perf command > did you use? I did not use perf, I will now do it and see what I get. You can discard my comment regarding reservation and commit, it was based on a wrong premise. Regarding parameter copying, I just brutally modified the code to not fill fields (TP_fast_assign), and I just noticed that it did not change the result significantly. > > When running just: > perf record ./cyclictest --smp -p95 -m -s -N -l 1000000 -q > > I get this: > > 6.08% cyclictest [kernel.vmlinux] [k] syscall_return_via_sysret > 4.03% cyclictest cyclictest [.] timerthread > 3.97% cyclictest [kernel.vmlinux] [k] __schedule > 3.81% cyclictest [kernel.vmlinux] [k] native_write_msr > 3.13% cyclictest [kernel.vmlinux] [k] trace_save_cmdline > 2.71% cyclictest [kernel.vmlinux] [k] do_nanosleep > 2.48% cyclictest [kernel.vmlinux] [k] do_syscall_64 > 2.11% cyclictest [kernel.vmlinux] [k] pick_next_task_fair > 2.03% cyclictest [kernel.vmlinux] [k] native_sched_clock > 1.99% cyclictest [kernel.vmlinux] [k] cpuacct_charge > 1.93% cyclictest [vdso] [.] __vdso_clock_gettime > 1.86% cyclictest [kernel.vmlinux] [k] hrtimer_nanosleep > 1.84% cyclictest [kernel.vmlinux] [k] cpupri_set > 1.66% cyclictest [kernel.vmlinux] [k] update_curr_rt > 1.65% cyclictest [kernel.vmlinux] [k] __perf_event_task_sched_out > 1.59% cyclictest [kernel.vmlinux] [k] dequeue_rt_stack > 1.55% cyclictest [kernel.vmlinux] [k] __rb_reserve_next > > Which shows there's a lot being done before we even get to the ring > buffer reserve. I get (CONFIG_PREEMPT_VOLUNTARY): SCHED EVENTS DISABLED: 8.57% cyclictest [kernel.vmlinux] [k] syscall_return_via_sysret 7.04% cyclictest [kernel.vmlinux] [k] native_write_msr 6.93% cyclictest [kernel.vmlinux] [k] entry_SYSCALL_64_trampoline 4.42% cyclictest cyclictest [.] timerthread 4.14% cyclictest [kernel.vmlinux] [k] __schedule 3.52% cyclictest [kernel.vmlinux] [k] cpupri_set 2.90% cyclictest [kernel.vmlinux] [k] finish_task_switch 2.42% cyclictest [kernel.vmlinux] [k] update_curr_rt 2.34% cyclictest [kernel.vmlinux] [k] _raw_spin_lock 2.28% cyclictest [kernel.vmlinux] [k] ktime_get 2.14% cyclictest [kernel.vmlinux] [k] timerqueue_add 2.06% cyclictest [kernel.vmlinux] [k] read_tsc 2.06% cyclictest [kernel.vmlinux] [k] native_sched_clock 2.01% cyclictest [vdso] [.] __vdso_clock_gettime 1.95% cyclictest [kernel.vmlinux] [k] do_nanosleep 1.93% cyclictest [kernel.vmlinux] [k] pick_next_task_fair 1.89% cyclictest [vdso] [.] 0x0000000000000977 1.75% cyclictest [kernel.vmlinux] [k] dequeue_task_rt 1.64% cyclictest [kernel.vmlinux] [k] x86_pmu_disable 1.57% cyclictest [kernel.vmlinux] [k] dequeue_rt_stack 1.53% cyclictest [kernel.vmlinux] [k] __x64_sys_nanosleep 1.51% cyclictest [kernel.vmlinux] [k] do_syscall_64 1.41% cyclictest [kernel.vmlinux] [k] put_prev_task_rt 1.34% cyclictest [kernel.vmlinux] [k] pick_next_task_dl 1.28% cyclictest [kernel.vmlinux] [k] pick_next_task_rt 1.25% cyclictest [kernel.vmlinux] [k] cpuacct_charge 1.25% cyclictest [kernel.vmlinux] [k] hrtimer_start_range_ns 1.24% cyclictest [kernel.vmlinux] [k] __x86_indirect_thunk_rax 1.07% cyclictest libpthread-2.23.so [.] 0x0000000000010c1b 0.94% cyclictest [kernel.vmlinux] [k] pull_rt_task 0.93% cyclictest [kernel.vmlinux] [k] entry_SYSCALL_64_stage2 0.89% cyclictest [kernel.vmlinux] [k] hrtimer_nanosleep 0.89% cyclictest libpthread-2.23.so [.] 0x0000000000010c1d 0.88% cyclictest [kernel.vmlinux] [k] entry_SYSCALL_64 0.79% cyclictest [kernel.vmlinux] [k] _raw_spin_lock_irqsave 0.79% cyclictest [kernel.vmlinux] [k] ctx_sched_out 0.78% cyclictest [kernel.vmlinux] [k] copy_user_generic_unrolled 0.77% cyclictest libpthread-2.23.so [.] __pthread_enable_asynccancel 0.71% cyclictest [kernel.vmlinux] [k] __perf_event_task_sched_out 0.71% cyclictest [kernel.vmlinux] [k] pick_next_task_idle 0.70% cyclictest libpthread-2.23.so [.] __pthread_disable_asynccancel 0.70% cyclictest [kernel.vmlinux] [k] __hrtimer_init 0.68% cyclictest [kernel.vmlinux] [k] clear_page_erms 0.64% cyclictest [kernel.vmlinux] [k] intel_pmu_disable_all 0.58% cyclictest [kernel.vmlinux] [k] copy_user_enhanced_fast_string 0.58% cyclictest [kernel.vmlinux] [k] _copy_from_user 0.58% cyclictest [kernel.vmlinux] [k] _raw_spin_unlock_irqrestore 0.54% cyclictest [kernel.vmlinux] [k] get_nohz_timer_target 0.54% cyclictest [kernel.vmlinux] [k] pick_next_task_stop 0.48% cyclictest [kernel.vmlinux] [k] schedule 0.47% cyclictest [kernel.vmlinux] [k] lapic_next_deadline 0.43% cyclictest [kernel.vmlinux] [k] sched_avg_update 0.42% cyclictest [kernel.vmlinux] [k] __perf_event_task_sched_in 0.37% cyclictest libc-2.23.so [.] __clock_gettime 0.37% cyclictest [vdso] [.] 0x0000000000000974 0.35% cyclictest [kernel.vmlinux] [k] get_timespec64 0.32% cyclictest [kernel.vmlinux] [k] dequeue_top_rt_rq 0.31% cyclictest [kernel.vmlinux] [k] try_to_unmap_one 0.27% cyclictest [kernel.vmlinux] [k] update_rq_clock 0.24% cyclictest [kernel.vmlinux] [k] rb_insert_color 0.23% cyclictest [kernel.vmlinux] [k] deactivate_task 0.22% cyclictest [kernel.vmlinux] [k] follow_page_pte 0.21% cyclictest [kernel.vmlinux] [k] page_not_mapped 0.20% cyclictest [kernel.vmlinux] [k] sched_clock_cpu 0.18% cyclictest cyclictest [.] main 0.17% cyclictest [kernel.vmlinux] [k] clockevents_program_event 0.17% cyclictest [kernel.vmlinux] [k] hrtimer_reprogram 0.16% cyclictest [kernel.vmlinux] [k] find_busiest_group 0.15% cyclictest [kernel.vmlinux] [k] enqueue_hrtimer 0.15% cyclictest libc-2.23.so [.] __nanosleep 0.14% cyclictest cyclictest [.] clock_gettime@plt 0.14% cyclictest [kernel.vmlinux] [k] rcu_note_context_switch 0.14% cyclictest [kernel.vmlinux] [k] __intel_pmu_enable_all 0.13% cyclictest [kernel.vmlinux] [k] hrtimer_active 0.13% cyclictest libpthread-2.23.so [.] __nanosleep 0.13% cyclictest [kernel.vmlinux] [k] __msecs_to_jiffies 0.13% cyclictest [kernel.vmlinux] [k] rcu_all_qs 0.13% cyclictest libpthread-2.23.so [.] 0x0000000000010c24 0.13% cyclictest [kernel.vmlinux] [k] hrtimer_try_to_cancel 0.13% cyclictest cyclictest [.] nanosleep@plt 0.12% cyclictest [kernel.vmlinux] [k] __intel_pmu_disable_all 0.12% cyclictest [kernel.vmlinux] [k] _cond_resched 0.10% cyclictest cyclictest [.] print_stat 0.10% cyclictest [kernel.vmlinux] [k] perf_ctx_unlock 0.10% cyclictest [kernel.vmlinux] [k] sched_clock 0.10% cyclictest [kernel.vmlinux] [k] free_unref_page_list 0.09% cyclictest [vdso] [.] 0x0000000000000987 0.09% cyclictest [kernel.vmlinux] [k] update_cfs_group 0.08% cyclictest [kernel.vmlinux] [k] copy_semundo 0.08% cyclictest [kernel.vmlinux] [k] update_blocked_averages 0.07% cyclictest [kernel.vmlinux] [k] tick_program_event 0.07% cyclictest [kernel.vmlinux] [k] dequeue_task_fair 0.07% cyclictest [kernel.vmlinux] [k] __handle_mm_fault 0.06% cyclictest [kernel.vmlinux] [k] plist_del 0.06% cyclictest libpthread-2.23.so [.] 0x0000000000010c0d 0.06% cyclictest [kernel.vmlinux] [k] put_prev_task_fair 0.06% cyclictest [vdso] [.] 0x0000000000000979 0.06% cyclictest [kernel.vmlinux] [k] munlock_vma_pages_range 0.05% cyclictest [vdso] [.] 0x000000000000097d 0.05% cyclictest [kernel.vmlinux] [k] __lru_cache_add 0.05% cyclictest [kernel.vmlinux] [k] update_load_avg 0.05% cyclictest [kernel.vmlinux] [k] pick_next_entity 0.04% cyclictest [kernel.vmlinux] [k] update_rt_migration 0.04% cyclictest [kernel.vmlinux] [k] pagevec_lru_move_fn 0.04% cyclictest [kernel.vmlinux] [k] reusable_anon_vma 0.04% cyclictest [kernel.vmlinux] [k] pmd_huge 0.04% cyclictest libpthread-2.23.so [.] 0x0000000000010c21 0.04% cyclictest ld-2.23.so [.] strcmp 0.04% cyclictest libc-2.23.so [.] __call_tls_dtors 0.04% cyclictest [kernel.vmlinux] [k] free_unref_page_commit 0.04% cyclictest [kernel.vmlinux] [k] unmap_page_range 0.04% cyclictest [kernel.vmlinux] [k] PageHeadHuge 0.04% cyclictest [kernel.vmlinux] [k] vma_interval_tree_iter_next 0.04% cyclictest [kernel.vmlinux] [k] rmap_walk_file 0.04% cyclictest libpthread-2.23.so [.] 0x0000000000010c12 0.04% cyclictest [kernel.vmlinux] [k] __update_load_avg_se.isra.41 0.03% cyclictest [kernel.vmlinux] [k] __munlock_pagevec 0.03% cyclictest libc-2.23.so [.] _itoa_word 0.03% cyclictest [kernel.vmlinux] [k] __pagevec_lru_add_fn 0.03% cyclictest [kernel.vmlinux] [k] ext4_file_mmap 0.03% cyclictest [kernel.vmlinux] [k] free_pcppages_bulk 0.02% cyclictest [kernel.vmlinux] [k] lock_hrtimer_base.isra.19 0.02% cyclictest [kernel.vmlinux] [k] __mod_node_page_state 0.02% cyclictest [kernel.vmlinux] [k] rb_erase 0.02% cyclictest [kernel.vmlinux] [k] release_pages 0.02% cyclictest [kernel.vmlinux] [k] clear_buddies 0.02% cyclictest [kernel.vmlinux] [k] __hrtimer_next_event_base 0.02% cyclictest [kernel.vmlinux] [k] swapgs_restore_regs_and_return_t 0.02% cyclictest [kernel.vmlinux] [k] reweight_entity 0.02% cyclictest [kernel.vmlinux] [k] update_curr 0.02% cyclictest libpthread-2.23.so [.] 0x0000000000010c29 0.02% cyclictest libpthread-2.23.so [.] start_thread 0.02% cyclictest [kernel.vmlinux] [k] xhci_queue_intr_tx 0.02% cyclictest [kernel.vmlinux] [k] __acct_update_integrals 0.02% cyclictest [vdso] [.] 0x0000000000000971 0.02% cyclictest [kernel.vmlinux] [k] xhci_check_args 0.01% cyclictest [kernel.vmlinux] [k] idle_cpu 0.01% cyclictest [kernel.vmlinux] [k] scheduler_tick 0.01% cyclictest [kernel.vmlinux] [k] __delayacct_add_tsk 0.01% cyclictest [kernel.vmlinux] [k] perf_pmu_disable.part.98 0.01% cyclictest [kernel.vmlinux] [k] end_repeat_nmi 0.00% cyclictest [kernel.vmlinux] [k] native_irq_return_iret 0.00% cyclictest [kernel.vmlinux] [k] nmi_restore 0.00% cyclictest [kernel.vmlinux] [k] do_nmi 0.00% cyclictest [kernel.vmlinux] [k] nmi_handle 0.00% perf [kernel.vmlinux] [k] perf_event_addr_filters_exec 0.00% cyclictest [kernel.vmlinux] [k] perf_event_nmi_handler 0.00% perf [kernel.vmlinux] [k] native_write_msr 0.00% cyclictest [unknown] [.] 0000000000000000 SCHED EVENTS ENABLED: 7.33% cyclictest [kernel.vmlinux] [k] syscall_return_via_sysret 6.51% cyclictest [kernel.vmlinux] [k] entry_SYSCALL_64_trampoline 5.84% cyclictest [kernel.vmlinux] [k] native_write_msr 4.00% cyclictest cyclictest [.] timerthread 3.62% cyclictest [kernel.vmlinux] [k] __schedule 2.71% cyclictest [kernel.vmlinux] [k] cpupri_set 2.57% cyclictest [kernel.vmlinux] [k] finish_task_switch 2.46% cyclictest [kernel.vmlinux] [k] native_sched_clock 2.19% cyclictest [kernel.vmlinux] [k] __x86_indirect_thunk_rax 2.03% cyclictest [kernel.vmlinux] [k] ktime_get 2.01% cyclictest [kernel.vmlinux] [k] pick_next_task_fair 1.93% cyclictest [kernel.vmlinux] [k] _raw_spin_lock 1.90% cyclictest [kernel.vmlinux] [k] timerqueue_add 1.70% cyclictest [kernel.vmlinux] [k] do_nanosleep 1.67% cyclictest [kernel.vmlinux] [k] read_tsc 1.65% cyclictest [kernel.vmlinux] [k] cpuacct_charge 1.60% cyclictest [kernel.vmlinux] [k] update_curr_rt 1.49% cyclictest [kernel.vmlinux] [k] probe_sched_switch 1.40% cyclictest [kernel.vmlinux] [k] __x64_sys_nanosleep 1.34% cyclictest [kernel.vmlinux] [k] dequeue_rt_stack 1.28% cyclictest [kernel.vmlinux] [k] do_syscall_64 1.27% cyclictest [kernel.vmlinux] [k] x86_pmu_disable 1.24% cyclictest [kernel.vmlinux] [k] __rb_reserve_next 1.23% cyclictest [vdso] [.] 0x0000000000000977 1.21% cyclictest [vdso] [.] __vdso_clock_gettime 1.06% cyclictest libpthread-2.23.so [.] 0x0000000000010c1d 1.06% cyclictest [kernel.vmlinux] [k] entry_SYSCALL_64_stage2 1.03% cyclictest [kernel.vmlinux] [k] pick_next_task_dl 1.00% cyclictest [kernel.vmlinux] [k] put_prev_task_rt 1.00% cyclictest [kernel.vmlinux] [k] pick_next_task_rt 0.98% cyclictest [kernel.vmlinux] [k] trace_event_raw_event_sched_swit 0.97% cyclictest libpthread-2.23.so [.] 0x0000000000010c1b 0.95% cyclictest [kernel.vmlinux] [k] pick_next_task_idle 0.91% cyclictest [kernel.vmlinux] [k] trace_save_cmdline 0.90% cyclictest [kernel.vmlinux] [k] copy_user_generic_unrolled 0.90% cyclictest [kernel.vmlinux] [k] ctx_sched_out 0.90% cyclictest [kernel.vmlinux] [k] lapic_next_deadline 0.86% cyclictest [kernel.vmlinux] [k] intel_pmu_disable_all 0.85% cyclictest [kernel.vmlinux] [k] rb_commit 0.84% cyclictest [kernel.vmlinux] [k] entry_SYSCALL_64 0.82% cyclictest [kernel.vmlinux] [k] _raw_spin_lock_irqsave 0.80% cyclictest [kernel.vmlinux] [k] hrtimer_nanosleep 0.76% cyclictest [kernel.vmlinux] [k] _copy_from_user 0.75% cyclictest [kernel.vmlinux] [k] __hrtimer_init 0.72% cyclictest [kernel.vmlinux] [k] ring_buffer_lock_reserve 0.69% cyclictest [kernel.vmlinux] [k] __perf_event_task_sched_out 0.66% cyclictest [kernel.vmlinux] [k] hrtimer_start_range_ns 0.63% cyclictest [kernel.vmlinux] [k] get_nohz_timer_target 0.62% cyclictest [kernel.vmlinux] [k] clear_page_erms 0.61% cyclictest libpthread-2.23.so [.] __pthread_disable_asynccancel 0.59% cyclictest [kernel.vmlinux] [k] dequeue_task_rt 0.58% cyclictest [kernel.vmlinux] [k] schedule 0.56% cyclictest [kernel.vmlinux] [k] pull_rt_task 0.55% cyclictest [kernel.vmlinux] [k] _raw_spin_unlock_irqrestore 0.54% cyclictest [kernel.vmlinux] [k] rb_insert_color 0.52% cyclictest [kernel.vmlinux] [k] trace_event_buffer_reserve 0.51% cyclictest [kernel.vmlinux] [k] trace_clock_local 0.50% cyclictest [kernel.vmlinux] [k] pick_next_task_stop 0.48% cyclictest [kernel.vmlinux] [k] __perf_event_task_sched_in 0.43% cyclictest [kernel.vmlinux] [k] rcu_note_context_switch 0.41% cyclictest [kernel.vmlinux] [k] trace_event_buffer_commit 0.41% cyclictest [kernel.vmlinux] [k] deactivate_task 0.39% cyclictest [kernel.vmlinux] [k] hrtimer_reprogram 0.36% cyclictest libpthread-2.23.so [.] __pthread_enable_asynccancel 0.36% cyclictest libc-2.23.so [.] __clock_gettime 0.34% cyclictest [kernel.vmlinux] [k] tracing_generic_entry_update 0.34% cyclictest [kernel.vmlinux] [k] enqueue_hrtimer 0.29% cyclictest [kernel.vmlinux] [k] dequeue_top_rt_rq 0.29% cyclictest [kernel.vmlinux] [k] perf_ctx_unlock 0.28% cyclictest [kernel.vmlinux] [k] copy_user_enhanced_fast_string 0.27% cyclictest libpthread-2.23.so [.] __nanosleep 0.26% cyclictest [kernel.vmlinux] [k] ring_buffer_unlock_commit 0.25% cyclictest [kernel.vmlinux] [k] trace_event_buffer_lock_reserve 0.24% cyclictest [kernel.vmlinux] [k] get_timespec64 0.23% cyclictest [kernel.vmlinux] [k] find_busiest_group 0.22% cyclictest cyclictest [.] print_stat 0.21% cyclictest [kernel.vmlinux] [k] free_unref_page_list 0.21% cyclictest [kernel.vmlinux] [k] __intel_pmu_enable_all 0.21% cyclictest [kernel.vmlinux] [k] rb_move_tail 0.20% cyclictest [kernel.vmlinux] [k] ring_buffer_event_data 0.20% cyclictest [kernel.vmlinux] [k] vma_interval_tree_subtree_search 0.20% cyclictest [kernel.vmlinux] [k] sched_avg_update 0.19% cyclictest [kernel.vmlinux] [k] tick_program_event 0.19% cyclictest [kernel.vmlinux] [k] follow_page_mask 0.19% cyclictest [kernel.vmlinux] [k] sched_clock_cpu 0.19% cyclictest [kernel.vmlinux] [k] tracing_record_taskinfo_sched_sw 0.16% cyclictest [kernel.vmlinux] [k] get_page_from_freelist 0.16% cyclictest [kernel.vmlinux] [k] clockevents_program_event 0.15% cyclictest [vdso] [.] 0x0000000000000974 0.15% cyclictest [kernel.vmlinux] [k] hrtimer_try_to_cancel 0.14% cyclictest [kernel.vmlinux] [k] __alloc_pages_nodemask 0.13% cyclictest cyclictest [.] main 0.13% cyclictest ld-2.23.so [.] _dl_sysdep_start 0.13% cyclictest [kernel.vmlinux] [k] rmap_walk_file 0.12% cyclictest [kernel.vmlinux] [k] update_rq_clock 0.12% cyclictest libc-2.23.so [.] __libc_disable_asynccancel 0.12% cyclictest [vdso] [.] 0x0000000000000970 0.12% cyclictest [kernel.vmlinux] [k] hrtimer_active 0.11% cyclictest [kernel.vmlinux] [k] page_not_mapped 0.11% cyclictest [kernel.vmlinux] [k] sched_clock 0.10% cyclictest [kernel.vmlinux] [k] __mod_zone_page_state 0.10% cyclictest [kernel.vmlinux] [k] account_entity_dequeue 0.09% cyclictest libc-2.23.so [.] __nanosleep 0.09% cyclictest [kernel.vmlinux] [k] rcu_all_qs 0.09% cyclictest [kernel.vmlinux] [k] _cond_resched 0.08% cyclictest [kernel.vmlinux] [k] ring_buffer_record_is_on 0.08% cyclictest [kernel.vmlinux] [k] dequeue_entity 0.08% cyclictest [kernel.vmlinux] [k] lock_hrtimer_base.isra.19 0.08% cyclictest [kernel.vmlinux] [k] update_blocked_averages 0.08% cyclictest [kernel.vmlinux] [k] ftrace_trace_userstack 0.07% cyclictest [kernel.vmlinux] [k] try_to_unmap_one 0.07% cyclictest [kernel.vmlinux] [k] tracing_record_taskinfo_skip 0.07% cyclictest libpthread-2.23.so [.] 0x0000000000010c38 0.07% cyclictest [kernel.vmlinux] [k] __intel_pmu_disable_all 0.07% cyclictest [kernel.vmlinux] [k] free_pcppages_bulk 0.07% cyclictest [kernel.vmlinux] [k] task_ctx_sched_out 0.07% cyclictest [kernel.vmlinux] [k] __msecs_to_jiffies 0.07% cyclictest [vdso] [.] 0x0000000000000987 0.07% cyclictest [kernel.vmlinux] [k] update_rt_migration 0.06% cyclictest [kernel.vmlinux] [k] reweight_entity 0.06% cyclictest [kernel.vmlinux] [k] dequeue_task_fair 0.06% cyclictest [kernel.vmlinux] [k] update_load_avg 0.06% cyclictest [kernel.vmlinux] [k] trace_buffer_unlock_commit_regs 0.06% cyclictest [kernel.vmlinux] [k] trace_event_raw_event_sched_stat 0.05% cyclictest [kernel.vmlinux] [k] update_curr 0.05% cyclictest cyclictest [.] clock_gettime@plt 0.04% cyclictest [vdso] [.] 0x000000000000097d 0.04% cyclictest [kernel.vmlinux] [k] rcu_check_callbacks 0.04% cyclictest [kernel.vmlinux] [k] load_balance 0.03% cyclictest [kernel.vmlinux] [k] perf_pmu_disable.part.98 0.03% cyclictest [kernel.vmlinux] [k] __update_load_avg_se.isra.41 0.03% cyclictest [kernel.vmlinux] [k] __calc_delta 0.03% cyclictest [kernel.vmlinux] [k] futex_wake 0.03% cyclictest libpthread-2.23.so [.] 0x0000000000010c0d 0.03% cyclictest cyclictest [.] nanosleep@plt 0.02% cyclictest libpthread-2.23.so [.] 0x0000000000010c09 0.02% cyclictest [kernel.vmlinux] [k] set_next_entity 0.02% cyclictest [kernel.vmlinux] [k] switch_mm_irqs_off 0.02% cyclictest [kernel.vmlinux] [k] balance_runtime 0.02% cyclictest [kernel.vmlinux] [k] put_prev_task_fair 0.02% cyclictest [kernel.vmlinux] [k] find_task_by_vpid 0.02% cyclictest libpthread-2.23.so [.] 0x0000000000010c12 0.02% cyclictest [kernel.vmlinux] [k] kfree 0.02% cyclictest [kernel.vmlinux] [k] put_prev_entity 0.01% cyclictest [kernel.vmlinux] [k] rb_next 0.01% cyclictest [kernel.vmlinux] [k] ring_buffer_time_stamp_abs 0.01% cyclictest [kernel.vmlinux] [k] update_min_vruntime 0.01% cyclictest [kernel.vmlinux] [k] enqueue_top_rt_rq 0.01% cyclictest [kernel.vmlinux] [k] __kmalloc_node 0.01% cyclictest libpthread-2.23.so [.] 0x0000000000010c30 0.01% cyclictest [kernel.vmlinux] [k] ktime_get_update_offsets_now 0.01% cyclictest [kernel.vmlinux] [k] check_cfs_rq_runtime 0.01% cyclictest [kernel.vmlinux] [k] cpu_load_update 0.01% cyclictest [kernel.vmlinux] [k] flush_signal_handlers 0.01% cyclictest [kernel.vmlinux] [k] page_counter_uncharge 0.00% cyclictest [kernel.vmlinux] [k] end_repeat_nmi 0.00% cyclictest [kernel.vmlinux] [k] nmi_cpu_backtrace_handler 0.00% cyclictest [kernel.vmlinux] [k] nmi_restore 0.00% cyclictest [kernel.vmlinux] [k] nmi_cpu_backtrace 0.00% cyclictest [kernel.vmlinux] [k] nmi_handle 0.00% perf [kernel.vmlinux] [k] native_sched_clock 0.00% cyclictest [kernel.vmlinux] [k] intel_pmu_handle_irq 0.00% cyclictest [kernel.vmlinux] [k] perf_event_nmi_handler 0.00% cyclictest [kernel.vmlinux] [k] native_apic_msr_write 0.00% perf [kernel.vmlinux] [k] native_write_msr > One thing that I notice too is that we have trace_save_cmdline that > also gets activated when enabling trace events. Perhaps I should try to > optimize that more. >> >> I have been running 20 times the following test, and thrown away the first results: >> >> $ sudo ./cyclictest --smp -p95 -m -s -N -l 100000 -q >> >> $ uname -a >> Linux claudio-HP-ProBook-470-G5 4.18.0-rc3+ #3 SMP Tue Jul 3 15:50:30 CEST 2018 x86_64 x86_64 x86_64 GNU/Linux >> >> For brevity, this is a comparison of one test's results. All other test results show the same ~25% increase. >> >> On the left side, the run without ftrace sched events, on the right side with ftrace sched events enabled. >> >> CPU Count Min Act Avg Max Count Min-ftrace Act-ftrace Avg-ftrace Max-ftrace >> 0 100000 2339 2936 2841 139478 100000 2900 3182 3566 93056 >> 1 66742 2365 3386 2874 93639 66750 2959 3786 3646 154074 >> 2 50080 2376 3058 2910 196221 50097 2997 4209 3655 18707 >> 3 40076 2394 3461 2931 17914 40091 3006 4417 3750 17159 >> 4 33404 2371 3612 2834 15336 33419 2997 3836 3594 23172 >> 5 28635 2387 3313 2885 25863 28649 2995 3795 3647 9956 >> 6 25058 2384 3428 2968 12162 25071 3051 4366 3719 18151 >> 7 22275 2381 2859 2982 10706 22287 3046 5078 3825 10781 >> >> I would be thankful for any advice or comments on this, >> especially with the goal in mind to lower as much as possible the runtime impact on the system. > > But still, we use all sorts of trace events when dealing with real > time, and trace events hardly shows anything in the PREEMPT_RT kernel. > Where cyclictest shows 20 microseconds, and sched events hardly touches > that. But I see this weird side effect on a distro kernel too. Maybe > there's something weird going on when we have CONFIG_VOLUNTARY_PREEMPT > set :-/ If it helps, I tested also CONFIG_PREEMPT_NONE=y and CONFIG_PREEMPT=y, with the same relative result (~25% increase). > > I'll investigate further. If stress-ng --switch-ops is of any help, I have some results for that as well. This is with CONFIG_PREEMPT_NONE: SCHED EVENTS DISABLED $ stress-ng --times -s 8 --switch-ops 20000000 Test Runtime(s) User(%) Sys(%) Total(%) bogo ops/s 1 22.56 6.33 92.96 99.29 886525 2 23.09 6.34 92.34 98.68 866176 3 23.54 6.35 89.02 95.37 849618 4 22.19 6.56 92.73 99.29 901306 5 22.69 6.41 92.77 99.18 881445 SCHED EVENTS ENABLED $ stress-ng --times -s 8 --switch-ops 20000000 Test Runtime(s) User(%) Sys(%) Total(%) bogo ops/s 1 36.28 4.12 95.31 99.43 551268 2 37.61 4.07 95.18 99.25 531773 3 37.96 4.09 94.94 99.03 526870 4 38.17 4.01 94.97 98.99 523972 5 37.47 4.11 94.92 99.03 533760 This is with CONFIG_PREEMPT: SCHED EVENTS DISABLED $ stress-ng --times -s 8 --switch-ops 20000000 Test Runtime(s) User(%) Sys(%) Total(%) bogo ops/s 1 25.73 5.65 93.28 98.93 777303 2 25.74 5.68 93.14 98.82 777001 3 24.84 5.88 92.95 98.83 805153 4 25.12 5.69 93.06 98.75 796178 5 26.04 5.58 92.79 98.37 768049 SCHED EVENTS ENABLED Test Runtime(s) User(%) Sys(%) Total(%) bogo ops/s 1 41.70 3.72 95.90 99.62 479616 2 42.91 3.59 95.52 99.11 466092 3 43.24 3.48 95.57 99.05 462535 4 43.04 3.46 95.83 99.30 464684 5 42.62 3.60 95.75 99.34 469263 This is with CONFIG_VOLUNTARY_PREEMPT: SCHED EVENTS DISABLED $ stress-ng --times -s 8 --switch-ops 20000000 Test Runtime(s) User(%) Sys(%) Total(%) bogo ops/s 1 23.27 6.24 91.41 97.64 859476 2 23.51 6.16 92.70 98.86 850702 3 23.47 6.42 92.62 99.04 852152 4 23.92 6.34 92.47 98.81 836120 5 23.63 6.28 92.77 99.05 846382 SCHED EVENTS ENABLED Test Runtime(s) User(%) Sys(%) Total(%) bogo ops/s 1 38.65 3.82 95.12 98.94 517464 2 38.77 4.27 94.77 99.03 515863 3 38.70 3.96 95.29 99.25 516796 4 38.52 4.06 95.32 99.38 519211 5 38.58 4.24 94.71 98.95 518403 > Thanks for the report. > > -- Steve > Thank you, I will try to get more familiar with the internals and then investigate as well. Claudio ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ftrace performance (sched events): cyclictest shows 25% more latency 2018-07-09 10:06 ` Claudio @ 2018-07-09 14:53 ` Claudio 2018-07-09 15:11 ` Steven Rostedt 0 siblings, 1 reply; 21+ messages in thread From: Claudio @ 2018-07-09 14:53 UTC (permalink / raw) To: Steven Rostedt; +Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, linux-kernel On 07/09/2018 12:06 PM, Claudio wrote: > Hello Steve, > > thank you for your response, > > On 07/06/2018 11:24 PM, Steven Rostedt wrote: >> On Fri, 6 Jul 2018 08:22:01 +0200 >> Claudio <claudio.fontana@gliwa.com> wrote: >> >>> Hello all, >>> >>> I have been experimenting with the idea of leaving ftrace enabled, with sched events, >>> on production systems. >>> >>> The main concern that I am having at the moment is about the impact on the system. >>> Enabling the sched events that I currently need for the tracing application >>> seems to slow down context-switches considerably, and make the system less responsive. >>> >>> I have tested with cyclictest on the mainline kernel, and noticed an increase of min, avg latencies of around 25%. >>> >>> Is this expected? >> >> No it is not. And this is weird. I'm seeing it too. >> >>> >>> Some initial investigation into ftrace seems to point at the reservation and commit of the events into the ring buffer >>> as the highest sources of overhead, while event parameters copying, including COMM, does not seem to have any noticeable effect >>> relative to those costs. >> >> Did you use perf to see where the issues were? If so what perf command >> did you use? > > I did not use perf, I will now do it and see what I get. > You can discard my comment regarding reservation and commit, it was based on a wrong premise. > > Regarding parameter copying, I just brutally modified the code to not fill fields (TP_fast_assign), > and I just noticed that it did not change the result significantly. > >> >> When running just: >> perf record ./cyclictest --smp -p95 -m -s -N -l 1000000 -q >> >> I get this: >> >> 6.08% cyclictest [kernel.vmlinux] [k] syscall_return_via_sysret >> 4.03% cyclictest cyclictest [.] timerthread >> 3.97% cyclictest [kernel.vmlinux] [k] __schedule >> 3.81% cyclictest [kernel.vmlinux] [k] native_write_msr >> 3.13% cyclictest [kernel.vmlinux] [k] trace_save_cmdline >> 2.71% cyclictest [kernel.vmlinux] [k] do_nanosleep >> 2.48% cyclictest [kernel.vmlinux] [k] do_syscall_64 >> 2.11% cyclictest [kernel.vmlinux] [k] pick_next_task_fair >> 2.03% cyclictest [kernel.vmlinux] [k] native_sched_clock >> 1.99% cyclictest [kernel.vmlinux] [k] cpuacct_charge >> 1.93% cyclictest [vdso] [.] __vdso_clock_gettime >> 1.86% cyclictest [kernel.vmlinux] [k] hrtimer_nanosleep >> 1.84% cyclictest [kernel.vmlinux] [k] cpupri_set >> 1.66% cyclictest [kernel.vmlinux] [k] update_curr_rt >> 1.65% cyclictest [kernel.vmlinux] [k] __perf_event_task_sched_out >> 1.59% cyclictest [kernel.vmlinux] [k] dequeue_rt_stack >> 1.55% cyclictest [kernel.vmlinux] [k] __rb_reserve_next >> >> Which shows there's a lot being done before we even get to the ring >> buffer reserve. > > I get (CONFIG_PREEMPT_VOLUNTARY): > > SCHED EVENTS DISABLED: > > 8.57% cyclictest [kernel.vmlinux] [k] syscall_return_via_sysret > 7.04% cyclictest [kernel.vmlinux] [k] native_write_msr > 6.93% cyclictest [kernel.vmlinux] [k] entry_SYSCALL_64_trampoline > 4.42% cyclictest cyclictest [.] timerthread > 4.14% cyclictest [kernel.vmlinux] [k] __schedule > 3.52% cyclictest [kernel.vmlinux] [k] cpupri_set > 2.90% cyclictest [kernel.vmlinux] [k] finish_task_switch > 2.42% cyclictest [kernel.vmlinux] [k] update_curr_rt > 2.34% cyclictest [kernel.vmlinux] [k] _raw_spin_lock > 2.28% cyclictest [kernel.vmlinux] [k] ktime_get > 2.14% cyclictest [kernel.vmlinux] [k] timerqueue_add > 2.06% cyclictest [kernel.vmlinux] [k] read_tsc > 2.06% cyclictest [kernel.vmlinux] [k] native_sched_clock > 2.01% cyclictest [vdso] [.] __vdso_clock_gettime > 1.95% cyclictest [kernel.vmlinux] [k] do_nanosleep > 1.93% cyclictest [kernel.vmlinux] [k] pick_next_task_fair > 1.89% cyclictest [vdso] [.] 0x0000000000000977 > 1.75% cyclictest [kernel.vmlinux] [k] dequeue_task_rt > 1.64% cyclictest [kernel.vmlinux] [k] x86_pmu_disable > 1.57% cyclictest [kernel.vmlinux] [k] dequeue_rt_stack > 1.53% cyclictest [kernel.vmlinux] [k] __x64_sys_nanosleep > 1.51% cyclictest [kernel.vmlinux] [k] do_syscall_64 > 1.41% cyclictest [kernel.vmlinux] [k] put_prev_task_rt > 1.34% cyclictest [kernel.vmlinux] [k] pick_next_task_dl > 1.28% cyclictest [kernel.vmlinux] [k] pick_next_task_rt > 1.25% cyclictest [kernel.vmlinux] [k] cpuacct_charge > 1.25% cyclictest [kernel.vmlinux] [k] hrtimer_start_range_ns > 1.24% cyclictest [kernel.vmlinux] [k] __x86_indirect_thunk_rax > 1.07% cyclictest libpthread-2.23.so [.] 0x0000000000010c1b > 0.94% cyclictest [kernel.vmlinux] [k] pull_rt_task > 0.93% cyclictest [kernel.vmlinux] [k] entry_SYSCALL_64_stage2 > 0.89% cyclictest [kernel.vmlinux] [k] hrtimer_nanosleep > 0.89% cyclictest libpthread-2.23.so [.] 0x0000000000010c1d > 0.88% cyclictest [kernel.vmlinux] [k] entry_SYSCALL_64 > 0.79% cyclictest [kernel.vmlinux] [k] _raw_spin_lock_irqsave > 0.79% cyclictest [kernel.vmlinux] [k] ctx_sched_out > 0.78% cyclictest [kernel.vmlinux] [k] copy_user_generic_unrolled > 0.77% cyclictest libpthread-2.23.so [.] __pthread_enable_asynccancel > 0.71% cyclictest [kernel.vmlinux] [k] __perf_event_task_sched_out > 0.71% cyclictest [kernel.vmlinux] [k] pick_next_task_idle > 0.70% cyclictest libpthread-2.23.so [.] __pthread_disable_asynccancel > 0.70% cyclictest [kernel.vmlinux] [k] __hrtimer_init > 0.68% cyclictest [kernel.vmlinux] [k] clear_page_erms > 0.64% cyclictest [kernel.vmlinux] [k] intel_pmu_disable_all > 0.58% cyclictest [kernel.vmlinux] [k] copy_user_enhanced_fast_string > 0.58% cyclictest [kernel.vmlinux] [k] _copy_from_user > 0.58% cyclictest [kernel.vmlinux] [k] _raw_spin_unlock_irqrestore > 0.54% cyclictest [kernel.vmlinux] [k] get_nohz_timer_target > 0.54% cyclictest [kernel.vmlinux] [k] pick_next_task_stop [...] > > SCHED EVENTS ENABLED: > > 7.33% cyclictest [kernel.vmlinux] [k] syscall_return_via_sysret > 6.51% cyclictest [kernel.vmlinux] [k] entry_SYSCALL_64_trampoline > 5.84% cyclictest [kernel.vmlinux] [k] native_write_msr > 4.00% cyclictest cyclictest [.] timerthread > 3.62% cyclictest [kernel.vmlinux] [k] __schedule > 2.71% cyclictest [kernel.vmlinux] [k] cpupri_set > 2.57% cyclictest [kernel.vmlinux] [k] finish_task_switch > 2.46% cyclictest [kernel.vmlinux] [k] native_sched_clock > 2.19% cyclictest [kernel.vmlinux] [k] __x86_indirect_thunk_rax > 2.03% cyclictest [kernel.vmlinux] [k] ktime_get > 2.01% cyclictest [kernel.vmlinux] [k] pick_next_task_fair > 1.93% cyclictest [kernel.vmlinux] [k] _raw_spin_lock > 1.90% cyclictest [kernel.vmlinux] [k] timerqueue_add > 1.70% cyclictest [kernel.vmlinux] [k] do_nanosleep > 1.67% cyclictest [kernel.vmlinux] [k] read_tsc > 1.65% cyclictest [kernel.vmlinux] [k] cpuacct_charge > 1.60% cyclictest [kernel.vmlinux] [k] update_curr_rt > 1.49% cyclictest [kernel.vmlinux] [k] probe_sched_switch > 1.40% cyclictest [kernel.vmlinux] [k] __x64_sys_nanosleep > 1.34% cyclictest [kernel.vmlinux] [k] dequeue_rt_stack > 1.28% cyclictest [kernel.vmlinux] [k] do_syscall_64 > 1.27% cyclictest [kernel.vmlinux] [k] x86_pmu_disable > 1.24% cyclictest [kernel.vmlinux] [k] __rb_reserve_next > 1.23% cyclictest [vdso] [.] 0x0000000000000977 > 1.21% cyclictest [vdso] [.] __vdso_clock_gettime > 1.06% cyclictest libpthread-2.23.so [.] 0x0000000000010c1d > 1.06% cyclictest [kernel.vmlinux] [k] entry_SYSCALL_64_stage2 > 1.03% cyclictest [kernel.vmlinux] [k] pick_next_task_dl > 1.00% cyclictest [kernel.vmlinux] [k] put_prev_task_rt > 1.00% cyclictest [kernel.vmlinux] [k] pick_next_task_rt > 0.98% cyclictest [kernel.vmlinux] [k] trace_event_raw_event_sched_swit > 0.97% cyclictest libpthread-2.23.so [.] 0x0000000000010c1b > 0.95% cyclictest [kernel.vmlinux] [k] pick_next_task_idle > 0.91% cyclictest [kernel.vmlinux] [k] trace_save_cmdline > 0.90% cyclictest [kernel.vmlinux] [k] copy_user_generic_unrolled > 0.90% cyclictest [kernel.vmlinux] [k] ctx_sched_out > 0.90% cyclictest [kernel.vmlinux] [k] lapic_next_deadline > 0.86% cyclictest [kernel.vmlinux] [k] intel_pmu_disable_all > 0.85% cyclictest [kernel.vmlinux] [k] rb_commit > 0.84% cyclictest [kernel.vmlinux] [k] entry_SYSCALL_64 > 0.82% cyclictest [kernel.vmlinux] [k] _raw_spin_lock_irqsave > 0.80% cyclictest [kernel.vmlinux] [k] hrtimer_nanosleep > 0.76% cyclictest [kernel.vmlinux] [k] _copy_from_user > 0.75% cyclictest [kernel.vmlinux] [k] __hrtimer_init > 0.72% cyclictest [kernel.vmlinux] [k] ring_buffer_lock_reserve > 0.69% cyclictest [kernel.vmlinux] [k] __perf_event_task_sched_out > 0.66% cyclictest [kernel.vmlinux] [k] hrtimer_start_range_ns > 0.63% cyclictest [kernel.vmlinux] [k] get_nohz_timer_target > 0.62% cyclictest [kernel.vmlinux] [k] clear_page_erms > 0.61% cyclictest libpthread-2.23.so [.] __pthread_disable_asynccancel > 0.59% cyclictest [kernel.vmlinux] [k] dequeue_task_rt > 0.58% cyclictest [kernel.vmlinux] [k] schedule > 0.56% cyclictest [kernel.vmlinux] [k] pull_rt_task > 0.55% cyclictest [kernel.vmlinux] [k] _raw_spin_unlock_irqrestore > 0.54% cyclictest [kernel.vmlinux] [k] rb_insert_color > 0.52% cyclictest [kernel.vmlinux] [k] trace_event_buffer_reserve > 0.51% cyclictest [kernel.vmlinux] [k] trace_clock_local > 0.50% cyclictest [kernel.vmlinux] [k] pick_next_task_stop [...] > >> One thing that I notice too is that we have trace_save_cmdline that >> also gets activated when enabling trace events. Perhaps I should try to >> optimize that more. >>> >>> I have been running 20 times the following test, and thrown away the first results: >>> >>> $ sudo ./cyclictest --smp -p95 -m -s -N -l 100000 -q >>> >>> $ uname -a >>> Linux claudio-HP-ProBook-470-G5 4.18.0-rc3+ #3 SMP Tue Jul 3 15:50:30 CEST 2018 x86_64 x86_64 x86_64 GNU/Linux >>> >>> For brevity, this is a comparison of one test's results. All other test results show the same ~25% increase. >>> >>> On the left side, the run without ftrace sched events, on the right side with ftrace sched events enabled. >>> >>> CPU Count Min Act Avg Max Count Min-ftrace Act-ftrace Avg-ftrace Max-ftrace >>> 0 100000 2339 2936 2841 139478 100000 2900 3182 3566 93056 >>> 1 66742 2365 3386 2874 93639 66750 2959 3786 3646 154074 >>> 2 50080 2376 3058 2910 196221 50097 2997 4209 3655 18707 >>> 3 40076 2394 3461 2931 17914 40091 3006 4417 3750 17159 >>> 4 33404 2371 3612 2834 15336 33419 2997 3836 3594 23172 >>> 5 28635 2387 3313 2885 25863 28649 2995 3795 3647 9956 >>> 6 25058 2384 3428 2968 12162 25071 3051 4366 3719 18151 >>> 7 22275 2381 2859 2982 10706 22287 3046 5078 3825 10781 >>> >>> I would be thankful for any advice or comments on this, >>> especially with the goal in mind to lower as much as possible the runtime impact on the system. >> >> But still, we use all sorts of trace events when dealing with real >> time, and trace events hardly shows anything in the PREEMPT_RT kernel. >> Where cyclictest shows 20 microseconds, and sched events hardly touches >> that. But I see this weird side effect on a distro kernel too. Maybe >> there's something weird going on when we have CONFIG_VOLUNTARY_PREEMPT >> set :-/ > > If it helps, I tested also CONFIG_PREEMPT_NONE=y and CONFIG_PREEMPT=y, > with the same relative result (~25% increase). > >> >> I'll investigate further. > > If stress-ng --switch-ops is of any help, I have some results for that as well. > > This is with CONFIG_PREEMPT_NONE: > > SCHED EVENTS DISABLED > > $ stress-ng --times -s 8 --switch-ops 20000000 > > Test Runtime(s) User(%) Sys(%) Total(%) bogo ops/s > 1 22.56 6.33 92.96 99.29 886525 > 2 23.09 6.34 92.34 98.68 866176 > 3 23.54 6.35 89.02 95.37 849618 > 4 22.19 6.56 92.73 99.29 901306 > 5 22.69 6.41 92.77 99.18 881445 > > SCHED EVENTS ENABLED > > $ stress-ng --times -s 8 --switch-ops 20000000 > > Test Runtime(s) User(%) Sys(%) Total(%) bogo ops/s > 1 36.28 4.12 95.31 99.43 551268 > 2 37.61 4.07 95.18 99.25 531773 > 3 37.96 4.09 94.94 99.03 526870 > 4 38.17 4.01 94.97 98.99 523972 > 5 37.47 4.11 94.92 99.03 533760 > > This is with CONFIG_PREEMPT: > > SCHED EVENTS DISABLED > > $ stress-ng --times -s 8 --switch-ops 20000000 > > Test Runtime(s) User(%) Sys(%) Total(%) bogo ops/s > 1 25.73 5.65 93.28 98.93 777303 > 2 25.74 5.68 93.14 98.82 777001 > 3 24.84 5.88 92.95 98.83 805153 > 4 25.12 5.69 93.06 98.75 796178 > 5 26.04 5.58 92.79 98.37 768049 > > SCHED EVENTS ENABLED > > Test Runtime(s) User(%) Sys(%) Total(%) bogo ops/s > 1 41.70 3.72 95.90 99.62 479616 > 2 42.91 3.59 95.52 99.11 466092 > 3 43.24 3.48 95.57 99.05 462535 > 4 43.04 3.46 95.83 99.30 464684 > 5 42.62 3.60 95.75 99.34 469263 > > This is with CONFIG_VOLUNTARY_PREEMPT: > > SCHED EVENTS DISABLED > > $ stress-ng --times -s 8 --switch-ops 20000000 > > Test Runtime(s) User(%) Sys(%) Total(%) bogo ops/s > 1 23.27 6.24 91.41 97.64 859476 > 2 23.51 6.16 92.70 98.86 850702 > 3 23.47 6.42 92.62 99.04 852152 > 4 23.92 6.34 92.47 98.81 836120 > 5 23.63 6.28 92.77 99.05 846382 > > SCHED EVENTS ENABLED > > Test Runtime(s) User(%) Sys(%) Total(%) bogo ops/s > 1 38.65 3.82 95.12 98.94 517464 > 2 38.77 4.27 94.77 99.03 515863 > 3 38.70 3.96 95.29 99.25 516796 > 4 38.52 4.06 95.32 99.38 519211 > 5 38.58 4.24 94.71 98.95 518403 > >> Thanks for the report. >> >> -- Steve >> > > Thank you, > > I will try to get more familiar with the internals > and then investigate as well. > > Claudio > One additional data point, based on brute force again: I applied this change, in order to understand if it was the trace_event_raw_event_* (I suppose primarily trace_event_raw_event_switch) that contained the latency "offenders": diff --git a/include/trace/trace_events.h b/include/trace/trace_events.h index 4ecdfe2..969467d 100644 --- a/include/trace/trace_events.h +++ b/include/trace/trace_events.h @@ -704,6 +704,8 @@ trace_event_raw_event_##call(void *__data, proto) struct trace_event_raw_##call *entry; \ int __data_size; \ \ + return; \ + \ if (trace_trigger_soft_disabled(trace_file)) \ return; \ \ This reduces the latency overhead to 6% down from 25%. Maybe obvious? Wanted to share in case it helps, and will dig further. Ciao, Claudio ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: ftrace performance (sched events): cyclictest shows 25% more latency 2018-07-09 14:53 ` Claudio @ 2018-07-09 15:11 ` Steven Rostedt 0 siblings, 0 replies; 21+ messages in thread From: Steven Rostedt @ 2018-07-09 15:11 UTC (permalink / raw) To: Claudio; +Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, linux-kernel On Mon, 9 Jul 2018 16:53:52 +0200 Claudio <claudio.fontana@gliwa.com> wrote: > > One additional data point, based on brute force again: > > I applied this change, in order to understand if it was the > > trace_event_raw_event_* (I suppose primarily trace_event_raw_event_switch) > > that contained the latency "offenders": > > diff --git a/include/trace/trace_events.h b/include/trace/trace_events.h > index 4ecdfe2..969467d 100644 > --- a/include/trace/trace_events.h > +++ b/include/trace/trace_events.h > @@ -704,6 +704,8 @@ trace_event_raw_event_##call(void *__data, proto) > struct trace_event_raw_##call *entry; \ > int __data_size; \ > \ > + return; \ > + \ > if (trace_trigger_soft_disabled(trace_file)) \ > return; \ > \ > > > This reduces the latency overhead to 6% down from 25%. > > Maybe obvious? Wanted to share in case it helps, and will dig further. I noticed that just disabling tracing "echo 0 > tracing_on" is very similar. I'm now recording timings of various parts of the code. But at most I've seen is a 12us, which should not add the overhead. So it's triggering something else. I'll be going on PTO next week, and there's things I must do this week, thus I may not have much more time to look into this until I get back from PTO (July 23rd). -- Steve ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ftrace performance (sched events): cyclictest shows 25% more latency 2018-07-06 6:22 ftrace performance (sched events): cyclictest shows 25% more latency Claudio 2018-07-06 21:24 ` Steven Rostedt @ 2018-07-09 15:32 ` Steven Rostedt 2018-07-24 9:58 ` ftrace global trace_pipe_raw Claudio 2018-07-24 9:59 ` Claudio 1 sibling, 2 replies; 21+ messages in thread From: Steven Rostedt @ 2018-07-09 15:32 UTC (permalink / raw) To: Claudio; +Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, linux-kernel On Fri, 6 Jul 2018 08:22:01 +0200 Claudio <claudio.fontana@gliwa.com> wrote: > Hello all, > > I have been experimenting with the idea of leaving ftrace enabled, with sched events, > on production systems. > > The main concern that I am having at the moment is about the impact on the system. > Enabling the sched events that I currently need for the tracing application > seems to slow down context-switches considerably, and make the system less responsive. > > I have tested with cyclictest on the mainline kernel, and noticed an increase of min, avg latencies of around 25%. > > Is this expected? > > Some initial investigation into ftrace seems to point at the reservation and commit of the events into the ring buffer > as the highest sources of overhead, while event parameters copying, including COMM, does not seem to have any noticeable effect > relative to those costs. > > I have been running 20 times the following test, and thrown away the first results: > > $ sudo ./cyclictest --smp -p95 -m -s -N -l 100000 -q OK, I just noticed that you are using -N which means all numbers are in nanoseconds. > > $ uname -a > Linux claudio-HP-ProBook-470-G5 4.18.0-rc3+ #3 SMP Tue Jul 3 15:50:30 CEST 2018 x86_64 x86_64 x86_64 GNU/Linux > > For brevity, this is a comparison of one test's results. All other test results show the same ~25% increase. > > On the left side, the run without ftrace sched events, on the right side with ftrace sched events enabled. > > CPU Count Min Act Avg Max Count Min-ftrace Act-ftrace Avg-ftrace Max-ftrace > 0 100000 2339 2936 2841 139478 100000 2900 3182 3566 93056 > 1 66742 2365 3386 2874 93639 66750 2959 3786 3646 154074 > 2 50080 2376 3058 2910 196221 50097 2997 4209 3655 18707 > 3 40076 2394 3461 2931 17914 40091 3006 4417 3750 17159 > 4 33404 2371 3612 2834 15336 33419 2997 3836 3594 23172 > 5 28635 2387 3313 2885 25863 28649 2995 3795 3647 9956 > 6 25058 2384 3428 2968 12162 25071 3051 4366 3719 18151 > 7 22275 2381 2859 2982 10706 22287 3046 5078 3825 10781 > > I would be thankful for any advice or comments on this, > especially with the goal in mind to lower as much as possible the runtime impact on the system. Thus, the tracing is causing the wakeup time to be an average of 0.8us longer. Yes that is expected. -- Steve ^ permalink raw reply [flat|nested] 21+ messages in thread
* ftrace global trace_pipe_raw 2018-07-09 15:32 ` Steven Rostedt @ 2018-07-24 9:58 ` Claudio 2018-07-24 14:23 ` Steven Rostedt 2018-07-24 9:59 ` Claudio 1 sibling, 1 reply; 21+ messages in thread From: Claudio @ 2018-07-24 9:58 UTC (permalink / raw) To: Steven Rostedt; +Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, linux-kernel Hello Steven, I am doing correlation of linux sched events, following all tasks between cpus, and one thing that would be really convenient would be to have a global trace_pipe_raw, in addition to the per-cpu ones, with already sorted events. I would imagine the core functionality is already available, since trace_pipe in the tracing directory already shows all events regardless of CPU, and so it would be a matter of doing the same for trace_pipe_raw. But is there a good reason why trace_pipe_raw is available only per-cpu? Would work in the direction of adding a global trace_pipe_raw be considered for inclusion? Thank you, Claudio On 07/09/2018 05:32 PM, Steven Rostedt wrote: > On Fri, 6 Jul 2018 08:22:01 +0200 > Claudio <claudio.fontana@gliwa.com> wrote: > >> Hello all, >> >> I have been experimenting with the idea of leaving ftrace enabled, with sched events, >> on production systems. >> >> The main concern that I am having at the moment is about the impact on the system. >> Enabling the sched events that I currently need for the tracing application >> seems to slow down context-switches considerably, and make the system less responsive. >> >> I have tested with cyclictest on the mainline kernel, and noticed an increase of min, avg latencies of around 25%. >> >> Is this expected? >> >> Some initial investigation into ftrace seems to point at the reservation and commit of the events into the ring buffer >> as the highest sources of overhead, while event parameters copying, including COMM, does not seem to have any noticeable effect >> relative to those costs. >> >> I have been running 20 times the following test, and thrown away the first results: >> >> $ sudo ./cyclictest --smp -p95 -m -s -N -l 100000 -q > > OK, I just noticed that you are using -N which means all numbers are in > nanoseconds. > >> >> $ uname -a >> Linux claudio-HP-ProBook-470-G5 4.18.0-rc3+ #3 SMP Tue Jul 3 15:50:30 CEST 2018 x86_64 x86_64 x86_64 GNU/Linux >> >> For brevity, this is a comparison of one test's results. All other test results show the same ~25% increase. >> >> On the left side, the run without ftrace sched events, on the right side with ftrace sched events enabled. >> >> CPU Count Min Act Avg Max Count Min-ftrace Act-ftrace Avg-ftrace Max-ftrace >> 0 100000 2339 2936 2841 139478 100000 2900 3182 3566 93056 >> 1 66742 2365 3386 2874 93639 66750 2959 3786 3646 154074 >> 2 50080 2376 3058 2910 196221 50097 2997 4209 3655 18707 >> 3 40076 2394 3461 2931 17914 40091 3006 4417 3750 17159 >> 4 33404 2371 3612 2834 15336 33419 2997 3836 3594 23172 >> 5 28635 2387 3313 2885 25863 28649 2995 3795 3647 9956 >> 6 25058 2384 3428 2968 12162 25071 3051 4366 3719 18151 >> 7 22275 2381 2859 2982 10706 22287 3046 5078 3825 10781 >> >> I would be thankful for any advice or comments on this, >> especially with the goal in mind to lower as much as possible the runtime impact on the system. > > Thus, the tracing is causing the wakeup time to be an average of 0.8us > longer. > > Yes that is expected. > > -- Steve > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ftrace global trace_pipe_raw 2018-07-24 9:58 ` ftrace global trace_pipe_raw Claudio @ 2018-07-24 14:23 ` Steven Rostedt 2018-07-24 14:25 ` Steven Rostedt 2018-12-19 11:32 ` Claudio 0 siblings, 2 replies; 21+ messages in thread From: Steven Rostedt @ 2018-07-24 14:23 UTC (permalink / raw) To: Claudio; +Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, linux-kernel On Tue, 24 Jul 2018 11:58:18 +0200 Claudio <claudio.fontana@gliwa.com> wrote: > Hello Steven, > > I am doing correlation of linux sched events, following all tasks between cpus, > and one thing that would be really convenient would be to have a global > trace_pipe_raw, in addition to the per-cpu ones, with already sorted events. > > I would imagine the core functionality is already available, since trace_pipe > in the tracing directory already shows all events regardless of CPU, and so > it would be a matter of doing the same for trace_pipe_raw. The difference between trace_pipe and trace_pipe_raw is that trace_pipe is post processed, and reads the per CPU buffers and interleaves them one event at a time. The trace_pipe_raw just sends you the raw unprocessed data directly from the buffers, which are grouped per CPU. > > But is there a good reason why trace_pipe_raw is available only per-cpu? Yes, because it maps the ring buffers themselves without any post processing. > > Would work in the direction of adding a global trace_pipe_raw be considered > for inclusion? The design of the lockless ring buffer requires not to be preempted, and that the data cannot be written to from more than one location. To do so, we make a per CPU buffer, and disable preemption when writing. This means that we have only one writer at a time. It can handle interrupts and NMIs, because they will finish before they return and this doesn't break the algorithm. But having writers from multiple CPUs would require locking or other heaving synchronization operations that will greatly reduce the speed of writing to the buffers (not to mention the cache thrashing). -- Steve ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ftrace global trace_pipe_raw 2018-07-24 14:23 ` Steven Rostedt @ 2018-07-24 14:25 ` Steven Rostedt 2018-07-24 15:30 ` Claudio 2018-10-08 16:07 ` Claudio 2018-12-19 11:32 ` Claudio 1 sibling, 2 replies; 21+ messages in thread From: Steven Rostedt @ 2018-07-24 14:25 UTC (permalink / raw) To: Claudio; +Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, linux-kernel On Tue, 24 Jul 2018 10:23:16 -0400 Steven Rostedt <rostedt@goodmis.org> wrote: > > > > Would work in the direction of adding a global trace_pipe_raw be considered > > for inclusion? > > The design of the lockless ring buffer requires not to be preempted, > and that the data cannot be written to from more than one location. To > do so, we make a per CPU buffer, and disable preemption when writing. > This means that we have only one writer at a time. It can handle > interrupts and NMIs, because they will finish before they return and > this doesn't break the algorithm. But having writers from multiple CPUs > would require locking or other heaving synchronization operations that > will greatly reduce the speed of writing to the buffers (not to mention > the cache thrashing). And why would you need a single buffer? Note, we are working on making libtracecmd.so that will allow applications to read the buffers and the library will take care of the interleaving of the raw data. This should hopefully be ready in about three months or so. -- Steve ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ftrace global trace_pipe_raw 2018-07-24 14:25 ` Steven Rostedt @ 2018-07-24 15:30 ` Claudio 2018-10-08 16:07 ` Claudio 1 sibling, 0 replies; 21+ messages in thread From: Claudio @ 2018-07-24 15:30 UTC (permalink / raw) To: Steven Rostedt; +Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, linux-kernel Hello Steve, thank you for your answer, On 07/24/2018 04:25 PM, Steven Rostedt wrote: > On Tue, 24 Jul 2018 10:23:16 -0400 > Steven Rostedt <rostedt@goodmis.org> wrote: > >>> >>> Would work in the direction of adding a global trace_pipe_raw be considered >>> for inclusion? >> >> The design of the lockless ring buffer requires not to be preempted, >> and that the data cannot be written to from more than one location. To >> do so, we make a per CPU buffer, and disable preemption when writing. >> This means that we have only one writer at a time. It can handle >> interrupts and NMIs, because they will finish before they return and >> this doesn't break the algorithm. But having writers from multiple CPUs >> would require locking or other heaving synchronization operations that >> will greatly reduce the speed of writing to the buffers (not to mention >> the cache thrashing). > I understand, it is not a simple matter then. > And why would you need a single buffer? I am interested in all events that have to do with a specific task, regardless of the CPU they appear on. Having an already post-processed stream of binary data would be awesome I think. > Note, we are working on making > libtracecmd.so that will allow applications to read the buffers and the > library will take care of the interleaving of the raw data. This should > hopefully be ready in about three months or so. > > -- Steve That would be great! So the library could handle this kind of preprocessing, and create a single stream of sorted timestamps/events with the binary data? Thanks a lot, Claudio ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ftrace global trace_pipe_raw 2018-07-24 14:25 ` Steven Rostedt 2018-07-24 15:30 ` Claudio @ 2018-10-08 16:07 ` Claudio 2018-10-08 16:16 ` Steven Rostedt 1 sibling, 1 reply; 21+ messages in thread From: Claudio @ 2018-10-08 16:07 UTC (permalink / raw) To: Steven Rostedt; +Cc: linux-trace-devel Hello Steven, On 07/24/2018 04:25 PM, Steven Rostedt wrote: > On Tue, 24 Jul 2018 10:23:16 -0400 > Steven Rostedt <rostedt@goodmis.org> wrote: > >>> >>> Would work in the direction of adding a global trace_pipe_raw be considered >>> for inclusion? >> >> The design of the lockless ring buffer requires not to be preempted, >> and that the data cannot be written to from more than one location. To >> do so, we make a per CPU buffer, and disable preemption when writing. >> This means that we have only one writer at a time. It can handle >> interrupts and NMIs, because they will finish before they return and >> this doesn't break the algorithm. But having writers from multiple CPUs >> would require locking or other heaving synchronization operations that >> will greatly reduce the speed of writing to the buffers (not to mention >> the cache thrashing). > > And why would you need a single buffer? Note, we are working on making > libtracecmd.so that will allow applications to read the buffers and the > library will take care of the interleaving of the raw data. This should > hopefully be ready in about three months or so. > > -- Steve > Is this something you will showcase in the linux tracing summit? Is there a repo / branch I should be following? The reason why we need to end up with a single stream of events is to be able to do "online" task state correlation and timing parameters calculations for all task-related events independent of cores. Currently we have this on QNX, and we are trying to enable it for Linux as well. Thank you, Claudio ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ftrace global trace_pipe_raw 2018-10-08 16:07 ` Claudio @ 2018-10-08 16:16 ` Steven Rostedt 2018-10-08 19:18 ` Claudio 0 siblings, 1 reply; 21+ messages in thread From: Steven Rostedt @ 2018-10-08 16:16 UTC (permalink / raw) To: Claudio; +Cc: linux-trace-devel On Mon, 8 Oct 2018 18:07:49 +0200 Claudio <claudio.fontana@gliwa.com> wrote: > Hello Steven, > > On 07/24/2018 04:25 PM, Steven Rostedt wrote: > > On Tue, 24 Jul 2018 10:23:16 -0400 > > Steven Rostedt <rostedt@goodmis.org> wrote: > > > >>> > >>> Would work in the direction of adding a global trace_pipe_raw be considered > >>> for inclusion? > >> > >> The design of the lockless ring buffer requires not to be preempted, > >> and that the data cannot be written to from more than one location. To > >> do so, we make a per CPU buffer, and disable preemption when writing. > >> This means that we have only one writer at a time. It can handle > >> interrupts and NMIs, because they will finish before they return and > >> this doesn't break the algorithm. But having writers from multiple CPUs > >> would require locking or other heaving synchronization operations that > >> will greatly reduce the speed of writing to the buffers (not to mention > >> the cache thrashing). > > > > And why would you need a single buffer? Note, we are working on making > > libtracecmd.so that will allow applications to read the buffers and the > > library will take care of the interleaving of the raw data. This should > > hopefully be ready in about three months or so. > > > > -- Steve > > > > Is this something you will showcase in the linux tracing summit? > Is there a repo / branch I should be following? We are preparing the code in tools/lib/traceevent of the Linux kernel to turn that into a library. At the same time, we are looking at making libtracecmd or perhaps we'll call it libftrace? to implement all the trace-cmd code as a library as well. But that's happening in the main trace-cmd repo: git://git.kernel.org/pub/scm/utils/trace-cmd/trace-cmd.git > > The reason why we need to end up with a single stream of events is to be > able to do "online" task state correlation and timing parameters calculations > for all task-related events independent of cores. Well, all the events are timestamped, and you can pick different clocks to use, and a simple merge sort gives all the information you need. Note, having per cpu buffers makes things much more efficient as you don't need to do synchronizing with atomics. -- Steve > > Currently we have this on QNX, and we are trying to enable it for Linux as well. > > Thank you, > > Claudio ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ftrace global trace_pipe_raw 2018-10-08 16:16 ` Steven Rostedt @ 2018-10-08 19:18 ` Claudio 2018-11-16 13:23 ` Claudio 0 siblings, 1 reply; 21+ messages in thread From: Claudio @ 2018-10-08 19:18 UTC (permalink / raw) To: Steven Rostedt; +Cc: linux-trace-devel, jens Hi Steven, thank you for your answer, On 10/08/2018 06:16 PM, Steven Rostedt wrote: > On Mon, 8 Oct 2018 18:07:49 +0200 > Claudio <claudio.fontana@gliwa.com> wrote: > >> Hello Steven, >> >> On 07/24/2018 04:25 PM, Steven Rostedt wrote: >>> On Tue, 24 Jul 2018 10:23:16 -0400 >>> Steven Rostedt <rostedt@goodmis.org> wrote: >>> >>>>> >>>>> Would work in the direction of adding a global trace_pipe_raw be considered >>>>> for inclusion? >>>> >>>> The design of the lockless ring buffer requires not to be preempted, >>>> and that the data cannot be written to from more than one location. To >>>> do so, we make a per CPU buffer, and disable preemption when writing. >>>> This means that we have only one writer at a time. It can handle >>>> interrupts and NMIs, because they will finish before they return and >>>> this doesn't break the algorithm. But having writers from multiple CPUs >>>> would require locking or other heaving synchronization operations that >>>> will greatly reduce the speed of writing to the buffers (not to mention >>>> the cache thrashing). >>> >>> And why would you need a single buffer? Note, we are working on making >>> libtracecmd.so that will allow applications to read the buffers and the >>> library will take care of the interleaving of the raw data. This should >>> hopefully be ready in about three months or so. >>> >>> -- Steve >>> >> >> Is this something you will showcase in the linux tracing summit? >> Is there a repo / branch I should be following? > > We are preparing the code in tools/lib/traceevent of the Linux kernel > to turn that into a library. > > At the same time, we are looking at making libtracecmd or perhaps we'll > call it libftrace? to implement all the trace-cmd code as a library as > well. But that's happening in the main trace-cmd repo: > > git://git.kernel.org/pub/scm/utils/trace-cmd/trace-cmd.git > I think that the binary raw event streaming aspect will be part of this last component right? libftrace is way better as a name I think. >> >> The reason why we need to end up with a single stream of events is to be >> able to do "online" task state correlation and timing parameters calculations >> for all task-related events independent of cores. > > Well, all the events are timestamped, and you can pick different clocks > to use, and a simple merge sort gives all the information you need. Yes, we are about to do the merge sorting of the streams ourselves, but if the library does it, even better ;-) > Note, having per cpu buffers makes things much more efficient as you > don't need to do synchronizing with atomics. > > -- Steve > >> >> Currently we have this on QNX, and we are trying to enable it for Linux as well. >> >> Thank you, >> >> Claudio > Thanks, Claudio ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ftrace global trace_pipe_raw 2018-10-08 19:18 ` Claudio @ 2018-11-16 13:23 ` Claudio 0 siblings, 0 replies; 21+ messages in thread From: Claudio @ 2018-11-16 13:23 UTC (permalink / raw) To: Steven Rostedt; +Cc: linux-trace-devel, jens Hello, just a small update below: On 10/08/2018 09:18 PM, Claudio wrote: > Hi Steven, > > thank you for your answer, > > On 10/08/2018 06:16 PM, Steven Rostedt wrote: >> On Mon, 8 Oct 2018 18:07:49 +0200 >> Claudio <claudio.fontana@gliwa.com> wrote: >> >>> Hello Steven, >>> >>> On 07/24/2018 04:25 PM, Steven Rostedt wrote: >>>> On Tue, 24 Jul 2018 10:23:16 -0400 >>>> Steven Rostedt <rostedt@goodmis.org> wrote: >>>> >>>>>> >>>>>> Would work in the direction of adding a global trace_pipe_raw be considered >>>>>> for inclusion? >>>>> >>>>> The design of the lockless ring buffer requires not to be preempted, >>>>> and that the data cannot be written to from more than one location. To >>>>> do so, we make a per CPU buffer, and disable preemption when writing. >>>>> This means that we have only one writer at a time. It can handle >>>>> interrupts and NMIs, because they will finish before they return and >>>>> this doesn't break the algorithm. But having writers from multiple CPUs >>>>> would require locking or other heaving synchronization operations that >>>>> will greatly reduce the speed of writing to the buffers (not to mention >>>>> the cache thrashing). >>>> >>>> And why would you need a single buffer? Note, we are working on making >>>> libtracecmd.so that will allow applications to read the buffers and the >>>> library will take care of the interleaving of the raw data. This should >>>> hopefully be ready in about three months or so. >>>> >>>> -- Steve >>>> >>> >>> Is this something you will showcase in the linux tracing summit? >>> Is there a repo / branch I should be following? >> >> We are preparing the code in tools/lib/traceevent of the Linux kernel >> to turn that into a library. >> >> At the same time, we are looking at making libtracecmd or perhaps we'll >> call it libftrace? to implement all the trace-cmd code as a library as >> well. But that's happening in the main trace-cmd repo: >> >> git://git.kernel.org/pub/scm/utils/trace-cmd/trace-cmd.git >> > > I think that the binary raw event streaming aspect will be part of this last component right? > > libftrace is way better as a name I think. > > >>> >>> The reason why we need to end up with a single stream of events is to be >>> able to do "online" task state correlation and timing parameters calculations >>> for all task-related events independent of cores. >> >> Well, all the events are timestamped, and you can pick different clocks >> to use, and a simple merge sort gives all the information you need. > > Yes, we are about to do the merge sorting of the streams ourselves, > but if the library does it, even better ;-) I think that with K streams, instead of merge sort it would be optimal to use min heap sort for the partial, online sorting of the events coming from the cpu streams. Do you think this could be part of the library, so it would be able to provide a constant stream of already sorted events regardless of the cpu, for full sched task state correlation? Or it it your view that this functionality should never be part of the library, and needs to be implemented by every user of the library separately? For me, it would be I think the main factor to decide whether to use the library or not. Thank you, Claudio > >> Note, having per cpu buffers makes things much more efficient as you >> don't need to do synchronizing with atomics.>> >> -- Steve >> >>> >>> Currently we have this on QNX, and we are trying to enable it for Linux as well. >>> >>> Thank you, >>> >>> Claudio >> > > Thanks, > > Claudio > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ftrace global trace_pipe_raw 2018-07-24 14:23 ` Steven Rostedt 2018-07-24 14:25 ` Steven Rostedt @ 2018-12-19 11:32 ` Claudio 2018-12-19 16:37 ` Steven Rostedt 1 sibling, 1 reply; 21+ messages in thread From: Claudio @ 2018-12-19 11:32 UTC (permalink / raw) To: Steven Rostedt; +Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, linux-kernel Hi Steven, going back to this old theme to clarify a bit what I was trying to achieve: On 07/24/2018 04:23 PM, Steven Rostedt wrote: > On Tue, 24 Jul 2018 11:58:18 +0200 > Claudio <claudio.fontana@gliwa.com> wrote: > >> Hello Steven, >> >> I am doing correlation of linux sched events, following all tasks between cpus, >> and one thing that would be really convenient would be to have a global >> trace_pipe_raw, in addition to the per-cpu ones, with already sorted events. I think that I asked for the wrong thing, since I did not understand how the implementation worked. Which lead to your response, thank you for the clarification. >> >> I would imagine the core functionality is already available, since trace_pipe >> in the tracing directory already shows all events regardless of CPU, and so >> it would be a matter of doing the same for trace_pipe_raw. > > The difference between trace_pipe and trace_pipe_raw is that trace_pipe > is post processed, and reads the per CPU buffers and interleaves them > one event at a time. The trace_pipe_raw just sends you the raw > unprocessed data directly from the buffers, which are grouped per CPU. I think that what I am looking for, to improve the performance of our system, is a post processed stream of binary entry data, already merged from all CPUs and sorted per timestamp, in the same way that it is done for textual output in __find_next_entry: for_each_tracing_cpu(cpu) { if (ring_buffer_empty_cpu(buffer, cpu)) continue; ent = peek_next_entry(iter, cpu, &ts, &lost_events); /* * Pick the entry with the smallest timestamp: */ if (ent && (!next || ts < next_ts)) { next = ent; next_cpu = cpu; next_ts = ts; next_lost = lost_events; next_size = iter->ent_size; } } We first tried to use the textual output directly, but this lead to unacceptable overheads in parsing the text. Please correct me if I do not understand, however it seems to me that it would be possible do the same kind of post processing including generating a sorted stream of entries, just avoiding the text output formatting, and outputting the binary data of the entry directly, which would be way more efficient to consume directly from user space correlators. But maybe this is not a general enough requirement to be acceptable for implementing directly into the kernel? We have the requirement of using the OS tracing events, including scheduling events, to react from software immediately (vs doing after-the-fact analysis). Thank you for your comment on this and I wish you nice holidays. Ciao, Claudio ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ftrace global trace_pipe_raw 2018-12-19 11:32 ` Claudio @ 2018-12-19 16:37 ` Steven Rostedt 2019-01-16 8:00 ` Claudio 0 siblings, 1 reply; 21+ messages in thread From: Steven Rostedt @ 2018-12-19 16:37 UTC (permalink / raw) To: Claudio; +Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, linux-kernel On Wed, 19 Dec 2018 12:32:41 +0100 Claudio <claudio.fontana@gliwa.com> wrote: > >> > >> I would imagine the core functionality is already available, since trace_pipe > >> in the tracing directory already shows all events regardless of CPU, and so > >> it would be a matter of doing the same for trace_pipe_raw. > > > > The difference between trace_pipe and trace_pipe_raw is that trace_pipe > > is post processed, and reads the per CPU buffers and interleaves them > > one event at a time. The trace_pipe_raw just sends you the raw > > unprocessed data directly from the buffers, which are grouped per CPU. > > I think that what I am looking for, to improve the performance of our system, > is a post processed stream of binary entry data, already merged from all CPUs > and sorted per timestamp, in the same way that it is done for textual output > in __find_next_entry: > > for_each_tracing_cpu(cpu) { > > if (ring_buffer_empty_cpu(buffer, cpu)) > continue; > > ent = peek_next_entry(iter, cpu, &ts, &lost_events); > > /* > * Pick the entry with the smallest timestamp: > */ > if (ent && (!next || ts < next_ts)) { > next = ent; > next_cpu = cpu; > next_ts = ts; > next_lost = lost_events; > next_size = iter->ent_size; > } > } > > We first tried to use the textual output directly, but this lead to > unacceptable overheads in parsing the text. > > Please correct me if I do not understand, however it seems to me that it > would be possible do the same kind of post processing including generating > a sorted stream of entries, just avoiding the text output formatting, > and outputting the binary data of the entry directly, which would be way > more efficient to consume directly from user space correlators. > > But maybe this is not a general enough requirement to be acceptable for > implementing directly into the kernel? > > We have the requirement of using the OS tracing events, including > scheduling events, to react from software immediately > (vs doing after-the-fact analysis). Have you looked at using the perf event interface? I believe it uses a single buffer for all events. At least for tracing a single process. -- Steve ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ftrace global trace_pipe_raw 2018-12-19 16:37 ` Steven Rostedt @ 2019-01-16 8:00 ` Claudio 2019-01-16 13:47 ` Steven Rostedt 0 siblings, 1 reply; 21+ messages in thread From: Claudio @ 2019-01-16 8:00 UTC (permalink / raw) To: Steven Rostedt; +Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, linux-kernel Hi Steven, happy new year, On 12/19/2018 05:37 PM, Steven Rostedt wrote: > On Wed, 19 Dec 2018 12:32:41 +0100 > Claudio <claudio.fontana@gliwa.com> wrote: > >>>> >>>> I would imagine the core functionality is already available, since trace_pipe >>>> in the tracing directory already shows all events regardless of CPU, and so >>>> it would be a matter of doing the same for trace_pipe_raw. >>> >>> The difference between trace_pipe and trace_pipe_raw is that trace_pipe >>> is post processed, and reads the per CPU buffers and interleaves them >>> one event at a time. The trace_pipe_raw just sends you the raw >>> unprocessed data directly from the buffers, which are grouped per CPU. >> >> I think that what I am looking for, to improve the performance of our system, >> is a post processed stream of binary entry data, already merged from all CPUs >> and sorted per timestamp, in the same way that it is done for textual output >> in __find_next_entry: >> >> for_each_tracing_cpu(cpu) { >> >> if (ring_buffer_empty_cpu(buffer, cpu)) >> continue; >> >> ent = peek_next_entry(iter, cpu, &ts, &lost_events); >> >> /* >> * Pick the entry with the smallest timestamp: >> */ >> if (ent && (!next || ts < next_ts)) { >> next = ent; >> next_cpu = cpu; >> next_ts = ts; >> next_lost = lost_events; >> next_size = iter->ent_size; >> } >> } >> >> We first tried to use the textual output directly, but this lead to >> unacceptable overheads in parsing the text. >> >> Please correct me if I do not understand, however it seems to me that it >> would be possible do the same kind of post processing including generating >> a sorted stream of entries, just avoiding the text output formatting, >> and outputting the binary data of the entry directly, which would be way >> more efficient to consume directly from user space correlators. >> >> But maybe this is not a general enough requirement to be acceptable for >> implementing directly into the kernel? >> >> We have the requirement of using the OS tracing events, including >> scheduling events, to react from software immediately >> (vs doing after-the-fact analysis). > > Have you looked at using the perf event interface? I believe it uses a > single buffer for all events. At least for tracing a single process. > > -- Steve > Indeed the perf event interface would be awesome, if only it would support tracing all processes. Unfortunately for my use case, it can only trace one process on any cpus, or all processes on one (1) cpu. I guess for some kind of security concerns.. I'll take a look at how much work it would be to extend the interface for the any process/any cpu use case. Ciao and thank you, Claudio ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: ftrace global trace_pipe_raw 2019-01-16 8:00 ` Claudio @ 2019-01-16 13:47 ` Steven Rostedt 0 siblings, 0 replies; 21+ messages in thread From: Steven Rostedt @ 2019-01-16 13:47 UTC (permalink / raw) To: Claudio; +Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, linux-kernel On Wed, 16 Jan 2019 09:00:00 +0100 Claudio <claudio.fontana@gliwa.com> wrote: > Indeed the perf event interface would be awesome, if only it would support tracing all processes. > > Unfortunately for my use case, it can only trace one process on any cpus, or all processes on one (1) cpu. > > I guess for some kind of security concerns.. Not security, but performance. Serialized writes are best done on serialized instances (per cpu or a single task). Having all tasks on all CPUs write to a single location is a huge detriment to performance, and has a significant impact. Which is why the buffer you are asking for doesn't exist. -- Steve > > I'll take a look at how much work it would be to extend the interface for the any process/any cpu use case. > > Ciao and thank you, > > Claudio > > ^ permalink raw reply [flat|nested] 21+ messages in thread
* ftrace global trace_pipe_raw 2018-07-09 15:32 ` Steven Rostedt 2018-07-24 9:58 ` ftrace global trace_pipe_raw Claudio @ 2018-07-24 9:59 ` Claudio 1 sibling, 0 replies; 21+ messages in thread From: Claudio @ 2018-07-24 9:59 UTC (permalink / raw) To: Steven Rostedt; +Cc: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, linux-kernel Hello Steven, I am doing correlation of linux sched events, following all tasks between cpus, and one thing that would be really convenient would be to have a global trace_pipe_raw, in addition to the per-cpu ones, with already sorted events. I would imagine the core functionality is already available, since trace_pipe in the tracing directory already shows all events regardless of CPU, and so it would be a matter of doing the same for trace_pipe_raw. But is there a good reason why trace_pipe_raw is available only per-cpu? Would work in the direction of adding a global trace_pipe_raw be considered for inclusion? Thank you, Claudio ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2019-01-16 13:47 UTC | newest] Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-07-06 6:22 ftrace performance (sched events): cyclictest shows 25% more latency Claudio 2018-07-06 21:24 ` Steven Rostedt 2018-07-06 21:39 ` Steven Rostedt 2018-07-06 22:00 ` Steven Rostedt 2018-07-09 10:06 ` Claudio 2018-07-09 14:53 ` Claudio 2018-07-09 15:11 ` Steven Rostedt 2018-07-09 15:32 ` Steven Rostedt 2018-07-24 9:58 ` ftrace global trace_pipe_raw Claudio 2018-07-24 14:23 ` Steven Rostedt 2018-07-24 14:25 ` Steven Rostedt 2018-07-24 15:30 ` Claudio 2018-10-08 16:07 ` Claudio 2018-10-08 16:16 ` Steven Rostedt 2018-10-08 19:18 ` Claudio 2018-11-16 13:23 ` Claudio 2018-12-19 11:32 ` Claudio 2018-12-19 16:37 ` Steven Rostedt 2019-01-16 8:00 ` Claudio 2019-01-16 13:47 ` Steven Rostedt 2018-07-24 9:59 ` Claudio
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.