[RFC PATCH 0/5] x86: dynamic indirect call promotion

* [RFC PATCH 0/5] x86: dynamic indirect call promotion
@ 2018-10-18  0:54 Nadav Amit
  2018-10-18  0:54 ` [RFC PATCH 1/5] x86: introduce preemption disable prefix Nadav Amit
                   ` (6 more replies)
  0 siblings, 7 replies; 43+ messages in thread
From: Nadav Amit @ 2018-10-18  0:54 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Lutomirski, Peter Zijlstra, H . Peter Anvin ,
	Thomas Gleixner, linux-kernel, Nadav Amit, x86, Borislav Petkov,
	David Woodhouse, Nadav Amit

This RFC introduces indirect call promotion in runtime, which for the
matter of simplification (and branding) will be called here "relpolines"
(relative call + trampoline). Relpolines are mainly intended as a way
of reducing retpoline overheads due to Spectre v2.

Unlike indirect call promotion through profile guided optimization, the
proposed approach does not require a profiling stage, works well with
modules whose address is unknown and can adapt to changing workloads.

The main idea is simple: for every indirect call, we inject a piece of
code with fast- and slow-path calls. The fast path is used if the target
matches the expected (hot) target. The slow-path uses a retpoline.
During training, the slow-path is set to call a function that saves the
call source and target in a hash-table and keep count for call
frequency. The most common target is then patched into the hot path.

The patching is done on-the-fly by patching the conditional branch
(opcode and offset) that is used to compare the target to the hot
target. This allows to direct all cores to the fast-path, while patching
the slow-path and vice-versa. Patching follows 2 more rules: (1) Only
patch a single byte when the code might be executed by any core. (2)
When patching more than one byte, ensure that all cores do not run the
to-be-patched-code by preventing this code from being preempted, and
using synchronize_sched() after patching the branch that jumps over this
code.

Changing all the indirect calls to use relpolines is done using assembly
macro magic. There are alternative solutions, but this one is
relatively simple and transparent. There is also logic to retrain the
software predictor, but the policy it uses may need to be refined.

Eventually the results are not bad (2 VCPU VM, throughput reported):

		base		relpoline
		----		---------
nginx 		22898 		25178 (+10%)
redis-ycsb	24523		25486 (+4%)
dbench		2144		2103 (+2%)

When retpolines are disabled, and if retraining is off, performance
benefits are up to 2% (nginx), but are much less impressive.

There are several open issues: retraining should be done when modules
are removed; CPU hotplug is not supported, x86-32 is probably broken and
the Makefile does not rebuild when the relpoline code is changed. Having
said that, I am worried that some of the approaches I took would
challenge the new code-of-conduct, so I though of getting some feedback
before putting more effort into it.

Nadav Amit (5):
  x86: introduce preemption disable prefix
  x86: patch indirect branch promotion
  x86: interface for accessing indirect branch locations
  x86: learning and patching indirect branch targets
  x86: relpoline: disabling interface

 arch/x86/entry/entry_64.S            |  10 +
 arch/x86/include/asm/nospec-branch.h | 158 +++++
 arch/x86/include/asm/sections.h      |   2 +
 arch/x86/kernel/Makefile             |   1 +
 arch/x86/kernel/asm-offsets.c        |   6 +
 arch/x86/kernel/macros.S             |   1 +
 arch/x86/kernel/nospec-branch.c      | 899 +++++++++++++++++++++++++++
 arch/x86/kernel/vmlinux.lds.S        |   7 +
 arch/x86/lib/retpoline.S             |  75 +++
 include/linux/module.h               |   5 +
 kernel/module.c                      |   8 +
 kernel/seccomp.c                     |   2 +
 12 files changed, 1174 insertions(+)
 create mode 100644 arch/x86/kernel/nospec-branch.c

-- 
2.17.1

^ permalink raw reply	[flat|nested] 43+ messages in thread