On Sat, 2018-01-06 at 18:02 +0100, Borislav Petkov wrote:
> On Sat, Jan 06, 2018 at 08:23:21AM +0000, David Woodhouse wrote:
> > Thanks. From code inspection, I couldn't see that it was smart enough
> > *not* to process a relative jump in the 'altinstr' section which was
> > jumping to a target *within* that same altinstr section, and thus
> > didn't need to be touched at all. Does this work?
> > 
> > alternative("", "xor %%rdi, %%rdi; jmp 2f; 2: jmp startup_64", X86_FEATURE_K8);
> 
> So this is fine because it gets turned into a two-byte jump:
> 
> [    0.816005] apply_alternatives: feat: 3*32+4, old: (ffffffff810273c9, len: 10), repl: (ffffffff824759d2, len: 10), pad: 10
> [    0.820001] ffffffff810273c9: old_insn: 90 90 90 90 90 90 90 90 90 90
> [    0.821247] ffffffff824759d2: rpl_insn: 48 31 ff eb 00 e9 24 a6 b8 fe
> [    0.822455] process_jumps: insn start 0x48, at 0, len: 3
> [    0.823496] process_jumps: insn start 0xeb, at 3, len: 2
> [    0.824002] process_jumps: insn start 0xe9, at 5, len: 5
> [    0.825120] recompute_jump: target RIP: ffffffff81000000, new_displ: 0xfffd8c37
> [    0.826567] recompute_jump: final displ: 0xfffd8c32, JMP 0xffffffff81000000
> [    0.828001] ffffffff810273c9: final_insn: e9 32 8c fd ff e9 24 a6 b8 fe
> 
> i.e., notice the "eb 00" thing.
> 
> Which, when copied into the kernel proper, will simply work as it is
> a small offset which, when referring to other code which gets copied
> *together* with it, should work. I.e., we're not changing the offsets
> during the copy so all good.
> 
> It becomes more tricky when you force a 5-byte jump:
> 
>         alternative("", "xor %%rdi, %%rdi; .byte 0xe9; .long 2f - .altinstr_replacement; 2: jmp startup_64", X86_FEATURE_K8);
> 
> because then you need to know whether the offset is within the
> .altinstr_replacement section itself or it is meant to be an absolute
> offset like jmp startup_64 or within another section.

Right, so it all tends to work out OK purely by virtue of the fact that
oldinstr and altinstr end up far enough apart in the image that they're
5-byte jumps. Which isn't perfect but we've lived with worse.

I'm relatively pleased that we've managed to eliminate this as a
dependency for inverting the X86_FEATURE_RETPOLINE logic though, by
following Linus' suggestion to just emit the thunk inline instead of
calling the same one as GCC.

The other fun one for alternatives is in entry_64.S, where we really
need the return address of the call instruction to be *precisely* the 
.Lentry_SYSCALL_64_after_fastpath_call label, so we have to eschew the
normal NOSPEC_CALL there:

	/*
	 * This call instruction is handled specially in stub_ptregs_64.
	 * It might end up jumping to the slow path.  If it jumps, RAX
	 * and all argument registers are clobbered.
	 */
#ifdef CONFIG_RETPOLINE
	movq	sys_call_table(, %rax, 8), %rax
	call	__x86.indirect_thunk.rax
#else
	call	*sys_call_table(, %rax, 8)
#endif
.Lentry_SYSCALL_64_after_fastpath_call:

Now it's not like an unconditional branch to the out-of-line thunk is
really going to be much of a problem, even in the case where that out-
of-line thunk is alternative'd into a bare 'jmp *%rax'. But it would be
slightly nicer to avoid it.

At the moment though, it's really hard to ensure that the 'call'
instruction that leaves its address on the stack is right at the end.

Am I missing a trick there, other than manually inserting leading NOPs
(instead of the automatic trailing ones) to ensure that everything
lines up, and making assumptions about how the assembler will encode
instructions (not that it has *much* choice, but it has some).

On the whole I think it's fine as it is, but if you have a simple fix
then that would be nice.