Here's the test with the attached config (A fedora distro with
localmodconfig run against it), with also two patches to implement
tracepoints with static calls. The first makes it where a tracepoint
will call a function pointer to a single callback if there's only one
callback, or an "iterator" which iterates a list of callbacks (when
there are more than one callback associated to a tracepoint).

It adds printks() to where it enables and disables the tracepoints so
expect to see a lot of output when you enable the tracepoints. This is
to verify that it's assigning the right code.

Here's what I did.

1) I first took the config and turned off CONFIG_RETPOLINE and built
v4.20-rc4 with that. I ran this to see what the affect was without
retpolines. I booted that kernel and did the following (which is also
what I did for every kernel):

 # trace-cmd start -e all

  To get the same affect you could also do:

    # echo 1 > /sys/kernel/debug/tracing/events/enable

 # perf stat -r 10 /work/c/hackbench 50

The output was this:

No RETPOLINES:
  
# perf stat -r 10 /work/c/hackbench 50
Time: 1.351
Time: 1.414
Time: 1.319
Time: 1.277
Time: 1.280
Time: 1.305
Time: 1.294
Time: 1.342
Time: 1.319
Time: 1.288

 Performance counter stats for '/work/c/hackbench 50' (10 runs):

         10,727.44 msec task-clock                #    7.397 CPUs utilized            ( +-  0.95% )
           126,300      context-switches          # 11774.138 M/sec                   ( +- 13.80% )
            14,309      cpu-migrations            # 1333.973 M/sec                    ( +-  8.73% )
            44,073      page-faults               # 4108.652 M/sec                    ( +-  0.68% )
    39,484,799,554      cycles                    # 3680914.295 GHz                   ( +-  0.95% )
    28,470,896,143      stalled-cycles-frontend   #   72.11% frontend cycles idle     ( +-  0.95% )
    26,521,427,813      instructions              #    0.67  insn per cycle         
                                                  #    1.07  stalled cycles per insn  ( +-  0.85% )
     4,931,066,096      branches                  # 459691625.400 M/sec               ( +-  0.87% )
        19,063,801      branch-misses             #    0.39% of all branches          ( +-  2.05% )

            1.4503 +- 0.0148 seconds time elapsed  ( +-  1.02% )

Then I enabled CONFIG_RETPOLINES, built boot and ran it again:

baseline RETPOLINES:

# perf stat -r 10 /work/c/hackbench 50
Time: 1.313
Time: 1.386
Time: 1.335
Time: 1.363
Time: 1.357
Time: 1.369
Time: 1.363
Time: 1.489
Time: 1.357
Time: 1.422

 Performance counter stats for '/work/c/hackbench 50' (10 runs):

         11,162.24 msec task-clock                #    7.383 CPUs utilized            ( +-  1.11% )
           112,882      context-switches          # 10113.153 M/sec                   ( +- 15.86% )
            14,255      cpu-migrations            # 1277.103 M/sec                    ( +-  7.78% )
            43,067      page-faults               # 3858.393 M/sec                    ( +-  1.04% )
    41,076,270,559      cycles                    # 3680042.874 GHz                   ( +-  1.12% )
    29,669,137,584      stalled-cycles-frontend   #   72.23% frontend cycles idle     ( +-  1.21% )
    26,647,656,812      instructions              #    0.65  insn per cycle         
                                                  #    1.11  stalled cycles per insn  ( +-  0.81% )
     5,069,504,923      branches                  # 454179389.091 M/sec               ( +-  0.83% )
        99,135,413      branch-misses             #    1.96% of all branches          ( +-  0.87% )

            1.5120 +- 0.0133 seconds time elapsed  ( +-  0.88% )


Then I applied the first tracepoint patch to make the change to call
directly (and be able to use static calls later). And tested that.

Added direct calls for trace_events:

# perf stat -r 10 /work/c/hackbench 50
Time: 1.448
Time: 1.386
Time: 1.404
Time: 1.386
Time: 1.344
Time: 1.397
Time: 1.378
Time: 1.351
Time: 1.369
Time: 1.385

 Performance counter stats for '/work/c/hackbench 50' (10 runs):

         11,249.28 msec task-clock                #    7.382 CPUs utilized            ( +-  0.64% )
           112,058      context-switches          # 9961.721 M/sec                    ( +- 11.15% )
            15,535      cpu-migrations            # 1381.033 M/sec                    ( +- 10.34% )
            43,673      page-faults               # 3882.433 M/sec                    ( +-  1.14% )
    41,407,431,000      cycles                    # 3681020.455 GHz                   ( +-  0.63% )
    29,842,394,154      stalled-cycles-frontend   #   72.07% frontend cycles idle     ( +-  0.63% )
    26,669,867,181      instructions              #    0.64  insn per cycle         
                                                  #    1.12  stalled cycles per insn  ( +-  0.58% )
     5,085,122,641      branches                  # 452055102.392 M/sec               ( +-  0.60% )
       108,935,006      branch-misses             #    2.14% of all branches          ( +-  0.57% )

            1.5239 +- 0.0139 seconds time elapsed  ( +-  0.91% )


Then I added patch 1 and 2, and applied the second attached patch and
ran that:

With static calls:

# perf stat -r 10 /work/c/hackbench 50
Time: 1.407
Time: 1.424
Time: 1.352
Time: 1.355
Time: 1.361
Time: 1.416
Time: 1.453
Time: 1.353
Time: 1.341
Time: 1.439

 Performance counter stats for '/work/c/hackbench 50' (10 runs):

         11,293.08 msec task-clock                #    7.390 CPUs utilized            ( +-  0.93% )
           125,343      context-switches          # 11099.462 M/sec                   ( +- 11.84% )
            15,587      cpu-migrations            # 1380.272 M/sec                    ( +-  8.21% )
            43,871      page-faults               # 3884.890 M/sec                    ( +-  1.06% )
    41,567,508,330      cycles                    # 3680918.499 GHz                   ( +-  0.94% )
    29,851,271,023      stalled-cycles-frontend   #   71.81% frontend cycles idle     ( +-  0.99% )
    26,878,085,513      instructions              #    0.65  insn per cycle         
                                                  #    1.11  stalled cycles per insn  ( +-  0.72% )
     5,125,816,911      branches                  # 453905346.879 M/sec               ( +-  0.74% )
       107,643,635      branch-misses             #    2.10% of all branches          ( +-  0.71% )

            1.5282 +- 0.0135 seconds time elapsed  ( +-  0.88% )

Then I applied patch 3 and tested that:

With static call trampolines:

# perf stat -r 10 /work/c/hackbench 50
Time: 1.350
Time: 1.333
Time: 1.369
Time: 1.361
Time: 1.375
Time: 1.352
Time: 1.316
Time: 1.336
Time: 1.339
Time: 1.371

 Performance counter stats for '/work/c/hackbench 50' (10 runs):

         10,964.38 msec task-clock                #    7.392 CPUs utilized            ( +-  0.41% )
            75,986      context-switches          # 6930.527 M/sec                    ( +-  9.23% )
            12,464      cpu-migrations            # 1136.858 M/sec                    ( +-  7.93% )
            44,476      page-faults               # 4056.558 M/sec                    ( +-  1.12% )
    40,354,963,428      cycles                    # 3680712.468 GHz                   ( +-  0.42% )
    29,057,240,222      stalled-cycles-frontend   #   72.00% frontend cycles idle     ( +-  0.46% )
    26,171,883,339      instructions              #    0.65  insn per cycle         
                                                  #    1.11  stalled cycles per insn  ( +-  0.32% )
     4,978,193,830      branches                  # 454053195.523 M/sec               ( +-  0.33% )
        83,625,127      branch-misses             #    1.68% of all branches          ( +-  0.33% )

           1.48328 +- 0.00515 seconds time elapsed  ( +-  0.35% )

And finally I added patch 4 and tested that:

Full static calls:

# perf stat -r 10 /work/c/hackbench 50
Time: 1.302
Time: 1.323
Time: 1.356
Time: 1.325
Time: 1.372
Time: 1.373
Time: 1.319
Time: 1.313
Time: 1.362
Time: 1.322

 Performance counter stats for '/work/c/hackbench 50' (10 runs):

         10,865.10 msec task-clock                #    7.373 CPUs utilized            ( +-  0.62% )
            88,718      context-switches          # 8165.823 M/sec                    ( +- 10.11% )
            13,463      cpu-migrations            # 1239.125 M/sec                    ( +-  8.42% )
            44,574      page-faults               # 4102.673 M/sec                    ( +-  0.60% )
    39,991,476,585      cycles                    # 3680897.280 GHz                   ( +-  0.63% )
    28,713,229,777      stalled-cycles-frontend   #   71.80% frontend cycles idle     ( +-  0.68% )
    26,289,703,633      instructions              #    0.66  insn per cycle         
                                                  #    1.09  stalled cycles per insn  ( +-  0.44% )
     4,983,099,105      branches                  # 458654631.123 M/sec               ( +-  0.45% )
        83,719,799      branch-misses             #    1.68% of all branches          ( +-  0.44% )

           1.47364 +- 0.00706 seconds time elapsed  ( +-  0.48% )


In summary, we had this:

No RETPOLINES:
            1.4503 +- 0.0148 seconds time elapsed  ( +-  1.02% )

baseline RETPOLINES:
            1.5120 +- 0.0133 seconds time elapsed  ( +-  0.88% )

Added direct calls for trace_events:
            1.5239 +- 0.0139 seconds time elapsed  ( +-  0.91% )

With static calls:
            1.5282 +- 0.0135 seconds time elapsed  ( +-  0.88% )

With static call trampolines:
           1.48328 +- 0.00515 seconds time elapsed  ( +-  0.35% )

Full static calls:
           1.47364 +- 0.00706 seconds time elapsed  ( +-  0.48% )


Adding Retpolines caused a 1.5120 / 1.4503 = 1.0425 ( 4.25% ) slowdown

Trampolines made it into 1.48328 / 1.4503 = 1.0227 ( 2.27% ) slowdown

With full static calls 1.47364 / 1.4503 = 1.0160 ( 1.6% ) slowdown

Going from 4.25 to 1.6 isn't bad, and I think this is very much worth
the effort. I did not expect it to go to 0% as there's a lot of other
places that retpolines cause issues, but this shows that it does help
the tracing code.

I originally did the tests with the development config, which has a
bunch of debugging options enabled (hackbench usually takes over 9
seconds, not the 1.5 that was done here), and the slowdown was closer
to 9% with retpolines. If people want me to do this with that, or I can
send them the config. Or better yet, the code is here, just use your
own configs.

-- Steve