Here's the test with the attached config (A fedora distro with localmodconfig run against it), with also two patches to implement tracepoints with static calls. The first makes it where a tracepoint will call a function pointer to a single callback if there's only one callback, or an "iterator" which iterates a list of callbacks (when there are more than one callback associated to a tracepoint). It adds printks() to where it enables and disables the tracepoints so expect to see a lot of output when you enable the tracepoints. This is to verify that it's assigning the right code. Here's what I did. 1) I first took the config and turned off CONFIG_RETPOLINE and built v4.20-rc4 with that. I ran this to see what the affect was without retpolines. I booted that kernel and did the following (which is also what I did for every kernel): # trace-cmd start -e all To get the same affect you could also do: # echo 1 > /sys/kernel/debug/tracing/events/enable # perf stat -r 10 /work/c/hackbench 50 The output was this: No RETPOLINES: # perf stat -r 10 /work/c/hackbench 50 Time: 1.351 Time: 1.414 Time: 1.319 Time: 1.277 Time: 1.280 Time: 1.305 Time: 1.294 Time: 1.342 Time: 1.319 Time: 1.288 Performance counter stats for '/work/c/hackbench 50' (10 runs): 10,727.44 msec task-clock # 7.397 CPUs utilized ( +- 0.95% ) 126,300 context-switches # 11774.138 M/sec ( +- 13.80% ) 14,309 cpu-migrations # 1333.973 M/sec ( +- 8.73% ) 44,073 page-faults # 4108.652 M/sec ( +- 0.68% ) 39,484,799,554 cycles # 3680914.295 GHz ( +- 0.95% ) 28,470,896,143 stalled-cycles-frontend # 72.11% frontend cycles idle ( +- 0.95% ) 26,521,427,813 instructions # 0.67 insn per cycle # 1.07 stalled cycles per insn ( +- 0.85% ) 4,931,066,096 branches # 459691625.400 M/sec ( +- 0.87% ) 19,063,801 branch-misses # 0.39% of all branches ( +- 2.05% ) 1.4503 +- 0.0148 seconds time elapsed ( +- 1.02% ) Then I enabled CONFIG_RETPOLINES, built boot and ran it again: baseline RETPOLINES: # perf stat -r 10 /work/c/hackbench 50 Time: 1.313 Time: 1.386 Time: 1.335 Time: 1.363 Time: 1.357 Time: 1.369 Time: 1.363 Time: 1.489 Time: 1.357 Time: 1.422 Performance counter stats for '/work/c/hackbench 50' (10 runs): 11,162.24 msec task-clock # 7.383 CPUs utilized ( +- 1.11% ) 112,882 context-switches # 10113.153 M/sec ( +- 15.86% ) 14,255 cpu-migrations # 1277.103 M/sec ( +- 7.78% ) 43,067 page-faults # 3858.393 M/sec ( +- 1.04% ) 41,076,270,559 cycles # 3680042.874 GHz ( +- 1.12% ) 29,669,137,584 stalled-cycles-frontend # 72.23% frontend cycles idle ( +- 1.21% ) 26,647,656,812 instructions # 0.65 insn per cycle # 1.11 stalled cycles per insn ( +- 0.81% ) 5,069,504,923 branches # 454179389.091 M/sec ( +- 0.83% ) 99,135,413 branch-misses # 1.96% of all branches ( +- 0.87% ) 1.5120 +- 0.0133 seconds time elapsed ( +- 0.88% ) Then I applied the first tracepoint patch to make the change to call directly (and be able to use static calls later). And tested that. Added direct calls for trace_events: # perf stat -r 10 /work/c/hackbench 50 Time: 1.448 Time: 1.386 Time: 1.404 Time: 1.386 Time: 1.344 Time: 1.397 Time: 1.378 Time: 1.351 Time: 1.369 Time: 1.385 Performance counter stats for '/work/c/hackbench 50' (10 runs): 11,249.28 msec task-clock # 7.382 CPUs utilized ( +- 0.64% ) 112,058 context-switches # 9961.721 M/sec ( +- 11.15% ) 15,535 cpu-migrations # 1381.033 M/sec ( +- 10.34% ) 43,673 page-faults # 3882.433 M/sec ( +- 1.14% ) 41,407,431,000 cycles # 3681020.455 GHz ( +- 0.63% ) 29,842,394,154 stalled-cycles-frontend # 72.07% frontend cycles idle ( +- 0.63% ) 26,669,867,181 instructions # 0.64 insn per cycle # 1.12 stalled cycles per insn ( +- 0.58% ) 5,085,122,641 branches # 452055102.392 M/sec ( +- 0.60% ) 108,935,006 branch-misses # 2.14% of all branches ( +- 0.57% ) 1.5239 +- 0.0139 seconds time elapsed ( +- 0.91% ) Then I added patch 1 and 2, and applied the second attached patch and ran that: With static calls: # perf stat -r 10 /work/c/hackbench 50 Time: 1.407 Time: 1.424 Time: 1.352 Time: 1.355 Time: 1.361 Time: 1.416 Time: 1.453 Time: 1.353 Time: 1.341 Time: 1.439 Performance counter stats for '/work/c/hackbench 50' (10 runs): 11,293.08 msec task-clock # 7.390 CPUs utilized ( +- 0.93% ) 125,343 context-switches # 11099.462 M/sec ( +- 11.84% ) 15,587 cpu-migrations # 1380.272 M/sec ( +- 8.21% ) 43,871 page-faults # 3884.890 M/sec ( +- 1.06% ) 41,567,508,330 cycles # 3680918.499 GHz ( +- 0.94% ) 29,851,271,023 stalled-cycles-frontend # 71.81% frontend cycles idle ( +- 0.99% ) 26,878,085,513 instructions # 0.65 insn per cycle # 1.11 stalled cycles per insn ( +- 0.72% ) 5,125,816,911 branches # 453905346.879 M/sec ( +- 0.74% ) 107,643,635 branch-misses # 2.10% of all branches ( +- 0.71% ) 1.5282 +- 0.0135 seconds time elapsed ( +- 0.88% ) Then I applied patch 3 and tested that: With static call trampolines: # perf stat -r 10 /work/c/hackbench 50 Time: 1.350 Time: 1.333 Time: 1.369 Time: 1.361 Time: 1.375 Time: 1.352 Time: 1.316 Time: 1.336 Time: 1.339 Time: 1.371 Performance counter stats for '/work/c/hackbench 50' (10 runs): 10,964.38 msec task-clock # 7.392 CPUs utilized ( +- 0.41% ) 75,986 context-switches # 6930.527 M/sec ( +- 9.23% ) 12,464 cpu-migrations # 1136.858 M/sec ( +- 7.93% ) 44,476 page-faults # 4056.558 M/sec ( +- 1.12% ) 40,354,963,428 cycles # 3680712.468 GHz ( +- 0.42% ) 29,057,240,222 stalled-cycles-frontend # 72.00% frontend cycles idle ( +- 0.46% ) 26,171,883,339 instructions # 0.65 insn per cycle # 1.11 stalled cycles per insn ( +- 0.32% ) 4,978,193,830 branches # 454053195.523 M/sec ( +- 0.33% ) 83,625,127 branch-misses # 1.68% of all branches ( +- 0.33% ) 1.48328 +- 0.00515 seconds time elapsed ( +- 0.35% ) And finally I added patch 4 and tested that: Full static calls: # perf stat -r 10 /work/c/hackbench 50 Time: 1.302 Time: 1.323 Time: 1.356 Time: 1.325 Time: 1.372 Time: 1.373 Time: 1.319 Time: 1.313 Time: 1.362 Time: 1.322 Performance counter stats for '/work/c/hackbench 50' (10 runs): 10,865.10 msec task-clock # 7.373 CPUs utilized ( +- 0.62% ) 88,718 context-switches # 8165.823 M/sec ( +- 10.11% ) 13,463 cpu-migrations # 1239.125 M/sec ( +- 8.42% ) 44,574 page-faults # 4102.673 M/sec ( +- 0.60% ) 39,991,476,585 cycles # 3680897.280 GHz ( +- 0.63% ) 28,713,229,777 stalled-cycles-frontend # 71.80% frontend cycles idle ( +- 0.68% ) 26,289,703,633 instructions # 0.66 insn per cycle # 1.09 stalled cycles per insn ( +- 0.44% ) 4,983,099,105 branches # 458654631.123 M/sec ( +- 0.45% ) 83,719,799 branch-misses # 1.68% of all branches ( +- 0.44% ) 1.47364 +- 0.00706 seconds time elapsed ( +- 0.48% ) In summary, we had this: No RETPOLINES: 1.4503 +- 0.0148 seconds time elapsed ( +- 1.02% ) baseline RETPOLINES: 1.5120 +- 0.0133 seconds time elapsed ( +- 0.88% ) Added direct calls for trace_events: 1.5239 +- 0.0139 seconds time elapsed ( +- 0.91% ) With static calls: 1.5282 +- 0.0135 seconds time elapsed ( +- 0.88% ) With static call trampolines: 1.48328 +- 0.00515 seconds time elapsed ( +- 0.35% ) Full static calls: 1.47364 +- 0.00706 seconds time elapsed ( +- 0.48% ) Adding Retpolines caused a 1.5120 / 1.4503 = 1.0425 ( 4.25% ) slowdown Trampolines made it into 1.48328 / 1.4503 = 1.0227 ( 2.27% ) slowdown With full static calls 1.47364 / 1.4503 = 1.0160 ( 1.6% ) slowdown Going from 4.25 to 1.6 isn't bad, and I think this is very much worth the effort. I did not expect it to go to 0% as there's a lot of other places that retpolines cause issues, but this shows that it does help the tracing code. I originally did the tests with the development config, which has a bunch of debugging options enabled (hackbench usually takes over 9 seconds, not the 1.5 that was done here), and the slowdown was closer to 9% with retpolines. If people want me to do this with that, or I can send them the config. Or better yet, the code is here, just use your own configs. -- Steve