On Tue, 22 Oct 2019 18:17:40 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> > your solution is to reduce the overhead.
> > my solution is to remove it competely. See the difference?  
> 
> You're just trimming it down. I'm curious to what overhead you save by
> not saving all parameter registers, and doing a case by case basis?

I played with adding a ftrace_ip_caller, that only modifies the ip, and
passes a regs with bare minimum saved (allowing the callback to modify
the regs->ip).

I modified the test-events-sample to just hijack the do_raw_spin_lock()
function, which is a very hot path, as it is called by most of the
spin_lock code.

I then ran:

 # perf stat -r 20 /work/c/hackbench 50

vanilla, as nothing is happening.

I then enabled the test-events-sample with the new ftrace ip
modification only.

I then had it do the hijack of do_raw_spin_lock() with the current
ftrace_regs_caller that kprobes and live kernel patching use.

The improvement was quite noticeable.

Baseline: (no tracing)

# perf stat -r 20 /work/c/hackbench 50
Time: 1.146
Time: 1.120
Time: 1.102
Time: 1.099
Time: 1.136
Time: 1.123
Time: 1.128
Time: 1.115
Time: 1.111
Time: 1.192
Time: 1.160
Time: 1.156
Time: 1.135
Time: 1.144
Time: 1.150
Time: 1.120
Time: 1.121
Time: 1.120
Time: 1.106
Time: 1.127

 Performance counter stats for '/work/c/hackbench 50' (20 runs):

          9,056.18 msec task-clock                #    6.770 CPUs utilized            ( +-  0.26% )
           148,461      context-switches          #    0.016 M/sec                    ( +-  2.58% )
            18,397      cpu-migrations            #    0.002 M/sec                    ( +-  3.24% )
            54,029      page-faults               #    0.006 M/sec                    ( +-  0.63% )
    33,227,808,738      cycles                    #    3.669 GHz                      ( +-  0.25% )
    23,314,273,335      stalled-cycles-frontend   #   70.16% frontend cycles idle     ( +-  0.25% )
    23,568,794,330      instructions              #    0.71  insn per cycle         
                                                  #    0.99  stalled cycles per insn  ( +-  0.26% )
     3,756,521,438      branches                  #  414.802 M/sec                    ( +-  0.29% )
        36,949,690      branch-misses             #    0.98% of all branches          ( +-  0.46% )

            1.3377 +- 0.0213 seconds time elapsed  ( +-  1.59% )

Using IPMODIFY without regs:

# perf stat -r 20 /work/c/hackbench 50
Time: 1.193
Time: 1.115
Time: 1.145
Time: 1.129
Time: 1.121
Time: 1.132
Time: 1.136
Time: 1.156
Time: 1.133
Time: 1.131
Time: 1.138
Time: 1.150
Time: 1.147
Time: 1.137
Time: 1.178
Time: 1.121
Time: 1.142
Time: 1.158
Time: 1.143
Time: 1.168

 Performance counter stats for '/work/c/hackbench 50' (20 runs):

          9,231.39 msec task-clock                #    6.917 CPUs utilized            ( +-  0.39% )
           136,822      context-switches          #    0.015 M/sec                    ( +-  3.06% )
            17,308      cpu-migrations            #    0.002 M/sec                    ( +-  2.61% )
            54,521      page-faults               #    0.006 M/sec                    ( +-  0.57% )
    33,875,937,399      cycles                    #    3.670 GHz                      ( +-  0.39% )
    23,575,136,580      stalled-cycles-frontend   #   69.59% frontend cycles idle     ( +-  0.41% )
    24,246,404,007      instructions              #    0.72  insn per cycle         
                                                  #    0.97  stalled cycles per insn  ( +-  0.33% )
     3,878,453,510      branches                  #  420.138 M/sec                    ( +-  0.36% )
        47,653,911      branch-misses             #    1.23% of all branches          ( +-  0.43% )

           1.33462 +- 0.00440 seconds time elapsed  ( +-  0.33% )


The current ftrace_regs_caller: (old way of doing it)

# perf stat -r 20 /work/c/hackbench 50
Time: 1.114
Time: 1.153
Time: 1.236
Time: 1.208
Time: 1.179
Time: 1.183
Time: 1.217
Time: 1.190
Time: 1.157
Time: 1.172
Time: 1.215
Time: 1.165
Time: 1.171
Time: 1.188
Time: 1.176
Time: 1.217
Time: 1.341
Time: 1.238
Time: 1.363
Time: 1.287

 Performance counter stats for '/work/c/hackbench 50' (20 runs):

          9,522.76 msec task-clock                #    6.653 CPUs utilized            ( +-  0.36% )
           131,347      context-switches          #    0.014 M/sec                    ( +-  3.37% )
            17,090      cpu-migrations            #    0.002 M/sec                    ( +-  3.56% )
            54,126      page-faults               #    0.006 M/sec                    ( +-  0.71% )
    34,942,273,861      cycles                    #    3.669 GHz                      ( +-  0.35% )
    24,517,757,272      stalled-cycles-frontend   #   70.17% frontend cycles idle     ( +-  0.35% )
    24,684,124,265      instructions              #    0.71  insn per cycle         
                                                  #    0.99  stalled cycles per insn  ( +-  0.34% )
     3,859,734,617      branches                  #  405.317 M/sec                    ( +-  0.38% )
        47,286,857      branch-misses             #    1.23% of all branches          ( +-  0.70% )

            1.4313 +- 0.0244 seconds time elapsed  ( +-  1.70% )


In summary we have:

Baseline: (no tracing)

    33,227,808,738      cycles                    #    3.669 GHz                      ( +-  0.25% )
            1.3377 +- 0.0213 seconds time elapsed  ( +-  1.59% )

Just ip modification:

    33,875,937,399      cycles                    #    3.670 GHz                      ( +-  0.39% )
           1.33462 +- 0.00440 seconds time elapsed  ( +-  0.33% )

Full regs as done today:

    34,942,273,861      cycles                    #    3.669 GHz                      ( +-  0.35% )
            1.4313 +- 0.0244 seconds time elapsed  ( +-  1.70% )


Note, the hijack function is done very naively, the function trampoline
is called twice.

  do_raw_spin_lock()
    call ftrace_ip_caller
      call my_highjack_func
        test pip (true)
        call my_do_spin_lock()
          call do_raw_spin_lock()
            call ftrace_ip_caller
              call my_highjack_func
                test pip (false)

We could work on ways to not do this double call, but I just wanted to
get some kind of test case out.

This method could at least help with live kernel patching.

Attached is the patch I used. To switch back to full regs, just
remove the comma and uncomment the SAVE_REGS flag in the
trace-events-samples.c file.

-- Steve