On Tue, 22 Oct 2019 18:17:40 -0400 Steven Rostedt wrote: > > your solution is to reduce the overhead. > > my solution is to remove it competely. See the difference? > > You're just trimming it down. I'm curious to what overhead you save by > not saving all parameter registers, and doing a case by case basis? I played with adding a ftrace_ip_caller, that only modifies the ip, and passes a regs with bare minimum saved (allowing the callback to modify the regs->ip). I modified the test-events-sample to just hijack the do_raw_spin_lock() function, which is a very hot path, as it is called by most of the spin_lock code. I then ran: # perf stat -r 20 /work/c/hackbench 50 vanilla, as nothing is happening. I then enabled the test-events-sample with the new ftrace ip modification only. I then had it do the hijack of do_raw_spin_lock() with the current ftrace_regs_caller that kprobes and live kernel patching use. The improvement was quite noticeable. Baseline: (no tracing) # perf stat -r 20 /work/c/hackbench 50 Time: 1.146 Time: 1.120 Time: 1.102 Time: 1.099 Time: 1.136 Time: 1.123 Time: 1.128 Time: 1.115 Time: 1.111 Time: 1.192 Time: 1.160 Time: 1.156 Time: 1.135 Time: 1.144 Time: 1.150 Time: 1.120 Time: 1.121 Time: 1.120 Time: 1.106 Time: 1.127 Performance counter stats for '/work/c/hackbench 50' (20 runs): 9,056.18 msec task-clock # 6.770 CPUs utilized ( +- 0.26% ) 148,461 context-switches # 0.016 M/sec ( +- 2.58% ) 18,397 cpu-migrations # 0.002 M/sec ( +- 3.24% ) 54,029 page-faults # 0.006 M/sec ( +- 0.63% ) 33,227,808,738 cycles # 3.669 GHz ( +- 0.25% ) 23,314,273,335 stalled-cycles-frontend # 70.16% frontend cycles idle ( +- 0.25% ) 23,568,794,330 instructions # 0.71 insn per cycle # 0.99 stalled cycles per insn ( +- 0.26% ) 3,756,521,438 branches # 414.802 M/sec ( +- 0.29% ) 36,949,690 branch-misses # 0.98% of all branches ( +- 0.46% ) 1.3377 +- 0.0213 seconds time elapsed ( +- 1.59% ) Using IPMODIFY without regs: # perf stat -r 20 /work/c/hackbench 50 Time: 1.193 Time: 1.115 Time: 1.145 Time: 1.129 Time: 1.121 Time: 1.132 Time: 1.136 Time: 1.156 Time: 1.133 Time: 1.131 Time: 1.138 Time: 1.150 Time: 1.147 Time: 1.137 Time: 1.178 Time: 1.121 Time: 1.142 Time: 1.158 Time: 1.143 Time: 1.168 Performance counter stats for '/work/c/hackbench 50' (20 runs): 9,231.39 msec task-clock # 6.917 CPUs utilized ( +- 0.39% ) 136,822 context-switches # 0.015 M/sec ( +- 3.06% ) 17,308 cpu-migrations # 0.002 M/sec ( +- 2.61% ) 54,521 page-faults # 0.006 M/sec ( +- 0.57% ) 33,875,937,399 cycles # 3.670 GHz ( +- 0.39% ) 23,575,136,580 stalled-cycles-frontend # 69.59% frontend cycles idle ( +- 0.41% ) 24,246,404,007 instructions # 0.72 insn per cycle # 0.97 stalled cycles per insn ( +- 0.33% ) 3,878,453,510 branches # 420.138 M/sec ( +- 0.36% ) 47,653,911 branch-misses # 1.23% of all branches ( +- 0.43% ) 1.33462 +- 0.00440 seconds time elapsed ( +- 0.33% ) The current ftrace_regs_caller: (old way of doing it) # perf stat -r 20 /work/c/hackbench 50 Time: 1.114 Time: 1.153 Time: 1.236 Time: 1.208 Time: 1.179 Time: 1.183 Time: 1.217 Time: 1.190 Time: 1.157 Time: 1.172 Time: 1.215 Time: 1.165 Time: 1.171 Time: 1.188 Time: 1.176 Time: 1.217 Time: 1.341 Time: 1.238 Time: 1.363 Time: 1.287 Performance counter stats for '/work/c/hackbench 50' (20 runs): 9,522.76 msec task-clock # 6.653 CPUs utilized ( +- 0.36% ) 131,347 context-switches # 0.014 M/sec ( +- 3.37% ) 17,090 cpu-migrations # 0.002 M/sec ( +- 3.56% ) 54,126 page-faults # 0.006 M/sec ( +- 0.71% ) 34,942,273,861 cycles # 3.669 GHz ( +- 0.35% ) 24,517,757,272 stalled-cycles-frontend # 70.17% frontend cycles idle ( +- 0.35% ) 24,684,124,265 instructions # 0.71 insn per cycle # 0.99 stalled cycles per insn ( +- 0.34% ) 3,859,734,617 branches # 405.317 M/sec ( +- 0.38% ) 47,286,857 branch-misses # 1.23% of all branches ( +- 0.70% ) 1.4313 +- 0.0244 seconds time elapsed ( +- 1.70% ) In summary we have: Baseline: (no tracing) 33,227,808,738 cycles # 3.669 GHz ( +- 0.25% ) 1.3377 +- 0.0213 seconds time elapsed ( +- 1.59% ) Just ip modification: 33,875,937,399 cycles # 3.670 GHz ( +- 0.39% ) 1.33462 +- 0.00440 seconds time elapsed ( +- 0.33% ) Full regs as done today: 34,942,273,861 cycles # 3.669 GHz ( +- 0.35% ) 1.4313 +- 0.0244 seconds time elapsed ( +- 1.70% ) Note, the hijack function is done very naively, the function trampoline is called twice. do_raw_spin_lock() call ftrace_ip_caller call my_highjack_func test pip (true) call my_do_spin_lock() call do_raw_spin_lock() call ftrace_ip_caller call my_highjack_func test pip (false) We could work on ways to not do this double call, but I just wanted to get some kind of test case out. This method could at least help with live kernel patching. Attached is the patch I used. To switch back to full regs, just remove the comma and uncomment the SAVE_REGS flag in the trace-events-samples.c file. -- Steve