On Tue, 23 Jan 2018, Ingo Molnar wrote:
> * David Woodhouse <dwmw2@infradead.org> wrote:
> 
> > > On SkyLake this would add an overhead of maybe 2-3 cycles per function call and 
> > > obviously all this code and data would be very cache hot. Given that the average 
> > > number of function calls per system call is around a dozen, this would be _much_ 
> > > faster than any microcode/MSR based approach.
> > 
> > That's kind of neat, except you don't want it at the top of the
> > function; you want it at the bottom.
> > 
> > If you could hijack the *return* site, then you could check for
> > underflow and stuff the RSB right there. But in __fentry__ there's not
> > a lot you can do other than complain that something bad is going to
> > happen in the future. You know that a string of 16+ rets is going to
> > happen, but you've got no gadget in *there* to deal with it when it
> > does.
> 
> No, it can be done with the existing CALL instrumentation callback that 
> CONFIG_DYNAMIC_FTRACE=y provides, by pushing a RET trampoline on the stack from 
> the CALL trampoline - see my previous email.
> 
> > HJ did have patches to turn 'ret' into a form of retpoline, which I
> > don't think ever even got performance-tested.
> 
> Return instrumentation is possible as well, but there are two major drawbacks:
> 
>  - GCC support for it is not as widely available and return instrumentation is 
>    less tested in Linux kernel contexts
> 
>  - a major point of my suggestion is that CONFIG_DYNAMIC_FTRACE=y is already 
>    enabled in distros here and today, so the runtime overhead to non-SkyLake CPUs 
>    would be literally zero, while still allowing to fix the RSB vulnerability on 
>    SkyLake.

I played around with that a bit during the week and it turns out to be less
simple than you thought.

1) Injecting a trampoline return only works for functions which have all
   arguments in registers. For functions with arguments on stack like all
   varg functions this breaks because the function wont find its arguments
   anymore.

   I have not yet found a way to figure out reliably which functions have
   arguments on stack. That might be an option to simply ignore them.

   The workaround is to replace the original return on stack with the
   trampoline and store the original return in a per thread stack, which I
   implemented. But this sucks performance wise badly.

2) Doing the whole dance on function entry has a real down side because you
   refill RSB on every 15th return no matter whether its required or
   not. That really gives a very prominent performance hit.

An alternative idea is to do the following (not yet implemented):

__fentry__:
	incl	PER_CPU_VAR(call_depth)
	retq

and use -mfunction-return=thunk-extern which is available on retpoline
enabled compilers. That's a reasonable requirement because w/o retpoline
the whole SKL magic is pointless anyway.

-mfunction-return=thunk-extern issues

	jump	__x86_return_thunk

instead of ret. In the thunk we can do the whole shebang of mitigation.
That jump can be identified at build time and it can be patched into a ret
for unaffected CPUs. Ideally we do the patching at build time and only
patch the jump in when SKL is detected or paranoia requests it.

We could actually look into that for tracing as well. The only reason why
we don't do that is to select the ideal nop for the CPU the kernel runs on,
which obviously cannot be known at build time.

__x86_return_thunk would look like this:

__x86_return_thunk:
	testl	$0xf, PER_CPU_VAR(call_depth)
	jnz	1f	
	stuff_rsb
   1:
	decl	PER_CPU_VAR(call_depth)
   	ret

The call_depth variable would be reset on context switch.

Though that has another problem: tail calls. Tail calls will invoke the
__fentry__ call of the tail called function, which makes the call_depth
counter unbalanced. Tail calls can be prevented by using
-fno-optimize-sibling-calls, but that probably sucks as well.

Yet another possibility is to avoid the function entry and accouting magic
and use the generic gcc return thunk:

__x86_return_thunk:
	call L2
L1:
	pause
	lfence
	jmp L1
L2:
	lea 8(%rsp), %rsp|lea 4(%esp), %esp
	ret

which basically refills the RSB on every return. That can be inline or
extern, but in both cases we should be able to patch it out.

I have no idea how that affects performance, but it might be worthwhile to
experiment with that.

If nobody beats me to it, I'll play around with that some more after
vacation.

Thanks,

	tglx