On Thu, 2018-01-11 at 17:58 -0600, Tom Lendacky wrote:
> 
> > + * These are the bare retpoline primitives for indirect jmp and call.
> > + * Do not use these directly; they only exist to make the ALTERNATIVE
> > + * invocation below less ugly.
> > + */
> > +.macro RETPOLINE_JMP reg:req
> > +     call    .Ldo_rop_\@
> > +.Lspec_trap_\@:
> > +     pause

Note that we never use that one on AMD. You just get 'lfence; jmp *reg'
instead because you promised us that would work.... while Intel said it
would work for a month or two and then said "er, oops, no it doesn't in
all cases." — so we're half-waiting for you lot to do the same thing :)

You *do* get the RSB-stuffing one though, which is the same. So...

> Talked with our engineers some more on using pause vs. lfence.  Pause is
> not serializing on AMD, so the pause/jmp loop will use power as it is
> speculated over waiting for return to mispredict to the correct target.
> Can this be changed back to lfence?  It looked like a very small
> difference in cycles/time.

That seems reasonable, although at this stage I'm also tempted to
suggest we can do that kind of fine-tuning in a followup patch. Like
the bikeshedding about numbers vs. readable labels. We really need the
IBRS and IBPB patches to be landing on top of this as soon as possible.

Paul, the lfence→pause change was only a tiny micro-optimisation on
Intel, wasn't it? Are you happy with changing the implementations of
the RSB stuffing code to use lfence again (or what about 'hlt')?

It currently looks like this... the capture loop is using 'jmp' to
match the retpoline instead of 'call' as in your examples:


#define __FILL_RETURN_BUFFER(reg, nr, sp, uniq)	\
	mov	$(nr/2), reg;			\
.Ldo_call1_ ## uniq:				\
	call	.Ldo_call2_ ## uniq;		\
.Ltrap1_ ## uniq:				\
	pause;					\
	jmp	.Ltrap1_ ## uniq;		\
.Ldo_call2_ ## uniq:				\
	call	.Ldo_loop_ ## uniq;		\
.Ltrap2_ ## uniq:				\
	pause;					\
	jmp	.Ltrap2_ ## uniq;		\
.Ldo_loop_ ## uniq:				\
	dec	reg;				\
	jnz	.Ldo_call1_ ## uniq;		\
	add	$(BITS_PER_LONG/8) * nr, sp;