On Tue, 23 Jan 2018, Ingo Molnar wrote: > * David Woodhouse wrote: > > > > On SkyLake this would add an overhead of maybe 2-3 cycles per function call and  > > > obviously all this code and data would be very cache hot. Given that the average  > > > number of function calls per system call is around a dozen, this would be _much_  > > > faster than any microcode/MSR based approach. > > > > That's kind of neat, except you don't want it at the top of the > > function; you want it at the bottom. > > > > If you could hijack the *return* site, then you could check for > > underflow and stuff the RSB right there. But in __fentry__ there's not > > a lot you can do other than complain that something bad is going to > > happen in the future. You know that a string of 16+ rets is going to > > happen, but you've got no gadget in *there* to deal with it when it > > does. > > No, it can be done with the existing CALL instrumentation callback that > CONFIG_DYNAMIC_FTRACE=y provides, by pushing a RET trampoline on the stack from > the CALL trampoline - see my previous email. > > > HJ did have patches to turn 'ret' into a form of retpoline, which I > > don't think ever even got performance-tested. > > Return instrumentation is possible as well, but there are two major drawbacks: > > - GCC support for it is not as widely available and return instrumentation is > less tested in Linux kernel contexts > > - a major point of my suggestion is that CONFIG_DYNAMIC_FTRACE=y is already > enabled in distros here and today, so the runtime overhead to non-SkyLake CPUs > would be literally zero, while still allowing to fix the RSB vulnerability on > SkyLake. I played around with that a bit during the week and it turns out to be less simple than you thought. 1) Injecting a trampoline return only works for functions which have all arguments in registers. For functions with arguments on stack like all varg functions this breaks because the function wont find its arguments anymore. I have not yet found a way to figure out reliably which functions have arguments on stack. That might be an option to simply ignore them. The workaround is to replace the original return on stack with the trampoline and store the original return in a per thread stack, which I implemented. But this sucks performance wise badly. 2) Doing the whole dance on function entry has a real down side because you refill RSB on every 15th return no matter whether its required or not. That really gives a very prominent performance hit. An alternative idea is to do the following (not yet implemented): __fentry__: incl PER_CPU_VAR(call_depth) retq and use -mfunction-return=thunk-extern which is available on retpoline enabled compilers. That's a reasonable requirement because w/o retpoline the whole SKL magic is pointless anyway. -mfunction-return=thunk-extern issues jump __x86_return_thunk instead of ret. In the thunk we can do the whole shebang of mitigation. That jump can be identified at build time and it can be patched into a ret for unaffected CPUs. Ideally we do the patching at build time and only patch the jump in when SKL is detected or paranoia requests it. We could actually look into that for tracing as well. The only reason why we don't do that is to select the ideal nop for the CPU the kernel runs on, which obviously cannot be known at build time. __x86_return_thunk would look like this: __x86_return_thunk: testl $0xf, PER_CPU_VAR(call_depth) jnz 1f stuff_rsb 1: decl PER_CPU_VAR(call_depth) ret The call_depth variable would be reset on context switch. Though that has another problem: tail calls. Tail calls will invoke the __fentry__ call of the tail called function, which makes the call_depth counter unbalanced. Tail calls can be prevented by using -fno-optimize-sibling-calls, but that probably sucks as well. Yet another possibility is to avoid the function entry and accouting magic and use the generic gcc return thunk: __x86_return_thunk: call L2 L1: pause lfence jmp L1 L2: lea 8(%rsp), %rsp|lea 4(%esp), %esp ret which basically refills the RSB on every return. That can be inline or extern, but in both cases we should be able to patch it out. I have no idea how that affects performance, but it might be worthwhile to experiment with that. If nobody beats me to it, I'll play around with that some more after vacation. Thanks, tglx