[RFC,0/5] x86: dynamic indirect call promotion
mbox series

Message ID 20181018005420.82993-1-namit@vmware.com
Headers show
Series
  • x86: dynamic indirect call promotion
Related show

Message

Nadav Amit Oct. 18, 2018, 12:54 a.m. UTC
This RFC introduces indirect call promotion in runtime, which for the
matter of simplification (and branding) will be called here "relpolines"
(relative call + trampoline). Relpolines are mainly intended as a way
of reducing retpoline overheads due to Spectre v2.

Unlike indirect call promotion through profile guided optimization, the
proposed approach does not require a profiling stage, works well with
modules whose address is unknown and can adapt to changing workloads.

The main idea is simple: for every indirect call, we inject a piece of
code with fast- and slow-path calls. The fast path is used if the target
matches the expected (hot) target. The slow-path uses a retpoline.
During training, the slow-path is set to call a function that saves the
call source and target in a hash-table and keep count for call
frequency. The most common target is then patched into the hot path.

The patching is done on-the-fly by patching the conditional branch
(opcode and offset) that is used to compare the target to the hot
target. This allows to direct all cores to the fast-path, while patching
the slow-path and vice-versa. Patching follows 2 more rules: (1) Only
patch a single byte when the code might be executed by any core. (2)
When patching more than one byte, ensure that all cores do not run the
to-be-patched-code by preventing this code from being preempted, and
using synchronize_sched() after patching the branch that jumps over this
code.

Changing all the indirect calls to use relpolines is done using assembly
macro magic. There are alternative solutions, but this one is
relatively simple and transparent. There is also logic to retrain the
software predictor, but the policy it uses may need to be refined.

Eventually the results are not bad (2 VCPU VM, throughput reported):

		base		relpoline
		----		---------
nginx 		22898 		25178 (+10%)
redis-ycsb	24523		25486 (+4%)
dbench		2144		2103 (+2%)

When retpolines are disabled, and if retraining is off, performance
benefits are up to 2% (nginx), but are much less impressive.

There are several open issues: retraining should be done when modules
are removed; CPU hotplug is not supported, x86-32 is probably broken and
the Makefile does not rebuild when the relpoline code is changed. Having
said that, I am worried that some of the approaches I took would
challenge the new code-of-conduct, so I though of getting some feedback
before putting more effort into it.

Nadav Amit (5):
  x86: introduce preemption disable prefix
  x86: patch indirect branch promotion
  x86: interface for accessing indirect branch locations
  x86: learning and patching indirect branch targets
  x86: relpoline: disabling interface

 arch/x86/entry/entry_64.S            |  10 +
 arch/x86/include/asm/nospec-branch.h | 158 +++++
 arch/x86/include/asm/sections.h      |   2 +
 arch/x86/kernel/Makefile             |   1 +
 arch/x86/kernel/asm-offsets.c        |   6 +
 arch/x86/kernel/macros.S             |   1 +
 arch/x86/kernel/nospec-branch.c      | 899 +++++++++++++++++++++++++++
 arch/x86/kernel/vmlinux.lds.S        |   7 +
 arch/x86/lib/retpoline.S             |  75 +++
 include/linux/module.h               |   5 +
 kernel/module.c                      |   8 +
 kernel/seccomp.c                     |   2 +
 12 files changed, 1174 insertions(+)
 create mode 100644 arch/x86/kernel/nospec-branch.c

Comments

Dave Hansen Oct. 23, 2018, 6:36 p.m. UTC | #1
On 10/17/18 5:54 PM, Nadav Amit wrote:
> 		base		relpoline
> 		----		---------
> nginx 	22898 		25178 (+10%)
> redis-ycsb	24523		25486 (+4%)
> dbench	2144		2103 (+2%)

Just out of curiosity, which indirect branches are the culprits here for
causing the slowdowns?
Nadav Amit Oct. 23, 2018, 8:32 p.m. UTC | #2
at 11:36 AM, Dave Hansen <dave.hansen@intel.com> wrote:

> On 10/17/18 5:54 PM, Nadav Amit wrote:
>> base		relpoline
>> 		----		---------
>> nginx 	22898 		25178 (+10%)
>> redis-ycsb	24523		25486 (+4%)
>> dbench	2144		2103 (+2%)
> 
> Just out of curiosity, which indirect branches are the culprits here for
> causing the slowdowns?

So I didn’t try to measure exactly which one. There are roughly 500 that
actually “run” in my tests. Initially, I took the silly approach of trying
to patch the C source-code using semi automatically-generated Coccinelle
scripts, so I can tell you it is not just few branches but many. The
network stack is full of function pointers (e.g., tcp_congestion_ops,
tcp_sock_af_ops, dst_ops). The file-system also uses many function pointers
(file_operations specifically). Compound-pages have d’tor and so on.

If you want, you can rebuild the kernel without retpolines and run
	
  perf record -e br_inst_exec.taken_indirect_near_call:k (your workload)

For some reason I didn’t manage to use PEBS (:ppp) from either the guest or
the host, so my results are a bit skewed (i.e., the sampled location is
usually after the call was taken). Running dbench in the VM gives me the
following “hot-spots”:

# Samples: 304  of event 'br_inst_exec.taken_indirect_near_call'
# Event count (approx.): 60800912
#
# Overhead  Command  Shared Object            Symbol                                       
# ........  .......  .......................  .............................................
#
     5.26%  :197970  [guest.kernel.kallsyms]  [g] __fget_light
     4.28%  :197969  [guest.kernel.kallsyms]  [g] __fget_light
     3.95%  :197969  [guest.kernel.kallsyms]  [g] dcache_readdir
     3.29%  :197970  [guest.kernel.kallsyms]  [g] next_positive.isra.14
     2.96%  :197970  [guest.kernel.kallsyms]  [g] __do_sys_kill
     2.30%  :197970  [guest.kernel.kallsyms]  [g] apparmor_file_open
     1.97%  :197969  [guest.kernel.kallsyms]  [g] __do_sys_kill
     1.97%  :197969  [guest.kernel.kallsyms]  [g] next_positive.isra.14
     1.97%  :197970  [guest.kernel.kallsyms]  [g] _raw_spin_lock
     1.64%  :197969  [guest.kernel.kallsyms]  [g] __alloc_file
     1.64%  :197969  [guest.kernel.kallsyms]  [g] common_file_perm
     1.64%  :197969  [guest.kernel.kallsyms]  [g] filldir
     1.64%  :197970  [guest.kernel.kallsyms]  [g] do_dentry_open
     1.64%  :197970  [guest.kernel.kallsyms]  [g] kmem_cache_free
     1.32%  :197969  [guest.kernel.kallsyms]  [g] __raw_callee_save___pv_queued_spin_unlock
     1.32%  :197969  [guest.kernel.kallsyms]  [g] __slab_free

Regards,
Nadav
Dave Hansen Oct. 23, 2018, 8:37 p.m. UTC | #3
On 10/23/18 1:32 PM, Nadav Amit wrote:
>> On 10/17/18 5:54 PM, Nadav Amit wrote:
>>> base		relpoline
>>> 		----		---------
>>> nginx 	22898 		25178 (+10%)
>>> redis-ycsb	24523		25486 (+4%)
>>> dbench	2144		2103 (+2%)
>> Just out of curiosity, which indirect branches are the culprits here for
>> causing the slowdowns?
> So I didn’t try to measure exactly which one. There are roughly 500 that
> actually “run” in my tests.

OK, cool, that's pretty much all I wanted to know, just that there
aren't 3 of them or something for which we need all this infrastructure.
Josh Poimboeuf Nov. 28, 2018, 4:08 p.m. UTC | #4
On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote:
> This RFC introduces indirect call promotion in runtime, which for the
> matter of simplification (and branding) will be called here "relpolines"
> (relative call + trampoline). Relpolines are mainly intended as a way
> of reducing retpoline overheads due to Spectre v2.
> 
> Unlike indirect call promotion through profile guided optimization, the
> proposed approach does not require a profiling stage, works well with
> modules whose address is unknown and can adapt to changing workloads.
> 
> The main idea is simple: for every indirect call, we inject a piece of
> code with fast- and slow-path calls. The fast path is used if the target
> matches the expected (hot) target. The slow-path uses a retpoline.
> During training, the slow-path is set to call a function that saves the
> call source and target in a hash-table and keep count for call
> frequency. The most common target is then patched into the hot path.
> 
> The patching is done on-the-fly by patching the conditional branch
> (opcode and offset) that is used to compare the target to the hot
> target. This allows to direct all cores to the fast-path, while patching
> the slow-path and vice-versa. Patching follows 2 more rules: (1) Only
> patch a single byte when the code might be executed by any core. (2)
> When patching more than one byte, ensure that all cores do not run the
> to-be-patched-code by preventing this code from being preempted, and
> using synchronize_sched() after patching the branch that jumps over this
> code.
> 
> Changing all the indirect calls to use relpolines is done using assembly
> macro magic. There are alternative solutions, but this one is
> relatively simple and transparent. There is also logic to retrain the
> software predictor, but the policy it uses may need to be refined.
> 
> Eventually the results are not bad (2 VCPU VM, throughput reported):
> 
> 		base		relpoline
> 		----		---------
> nginx 	22898 		25178 (+10%)
> redis-ycsb	24523		25486 (+4%)
> dbench	2144		2103 (+2%)
> 
> When retpolines are disabled, and if retraining is off, performance
> benefits are up to 2% (nginx), but are much less impressive.

Hi Nadav,

Peter pointed me to these patches during a discussion about retpoline
profiling.  Personally, I think this is brilliant.  This could help
networking and filesystem intensive workloads a lot.

Some high-level comments:

- "Relpoline" looks confusingly a lot like "retpoline".  How about
  "optpoline"?  To avoid confusing myself I will hereafter refer to it
  as such :-)

- Instead of patching one byte at a time, is there a reason why
  text_poke_bp() can't be used?  That would greatly simplify the
  patching process, as everything could be patched in a single step.

- In many cases, a single direct call may not be sufficient, as there
  could be for example multiple tasks using different network protocols
  which need different callbacks for the same call site.

- I'm not sure about the periodic retraining logic, it seems a bit
  nondeterministic and bursty.
  
So I'd propose the following changes:

- In the optpoline, reserve space for multiple (5 or so) comparisons and
  direct calls.  Maybe the number of reserved cmp/jne/call slots can be
  tweaked by the caller somehow.  Or maybe it could grow as needed.
  Starting out, they would just be NOPs.

- Instead of the temporary learning mode, add permanent tracking to
  detect a direct call "miss" -- i.e., when none of the existing direct
  calls are applicable and the retpoline will be used.

- In the case of a miss (or N misses), it could trigger a direct call
  patching operation to be run later (workqueue or syscall exit).  If
  all the direct call slots are full, it could patch the least recently
  modified one.  If this causes thrashing (>x changes over y time), it
  could increase the number of direct call slots using a trampoline.
  Even if there were several slots, CPU branch prediction would
  presumably help make it much faster than a basic retpoline.

Thoughts?
Nadav Amit Nov. 28, 2018, 7:34 p.m. UTC | #5
> On Nov 28, 2018, at 8:08 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> 
> On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote:
>> This RFC introduces indirect call promotion in runtime, which for the
>> matter of simplification (and branding) will be called here "relpolines"
>> (relative call + trampoline). Relpolines are mainly intended as a way
>> of reducing retpoline overheads due to Spectre v2.
>> 
>> Unlike indirect call promotion through profile guided optimization, the
>> proposed approach does not require a profiling stage, works well with
>> modules whose address is unknown and can adapt to changing workloads.
>> 
>> The main idea is simple: for every indirect call, we inject a piece of
>> code with fast- and slow-path calls. The fast path is used if the target
>> matches the expected (hot) target. The slow-path uses a retpoline.
>> During training, the slow-path is set to call a function that saves the
>> call source and target in a hash-table and keep count for call
>> frequency. The most common target is then patched into the hot path.
>> 
>> The patching is done on-the-fly by patching the conditional branch
>> (opcode and offset) that is used to compare the target to the hot
>> target. This allows to direct all cores to the fast-path, while patching
>> the slow-path and vice-versa. Patching follows 2 more rules: (1) Only
>> patch a single byte when the code might be executed by any core. (2)
>> When patching more than one byte, ensure that all cores do not run the
>> to-be-patched-code by preventing this code from being preempted, and
>> using synchronize_sched() after patching the branch that jumps over this
>> code.
>> 
>> Changing all the indirect calls to use relpolines is done using assembly
>> macro magic. There are alternative solutions, but this one is
>> relatively simple and transparent. There is also logic to retrain the
>> software predictor, but the policy it uses may need to be refined.
>> 
>> Eventually the results are not bad (2 VCPU VM, throughput reported):
>> 
>> 		base		relpoline
>> 		----		---------
>> nginx 	22898 		25178 (+10%)
>> redis-ycsb	24523		25486 (+4%)
>> dbench	2144		2103 (+2%)
>> 
>> When retpolines are disabled, and if retraining is off, performance
>> benefits are up to 2% (nginx), but are much less impressive.
> 
> Hi Nadav,
> 
> Peter pointed me to these patches during a discussion about retpoline
> profiling.  Personally, I think this is brilliant.  This could help
> networking and filesystem intensive workloads a lot.

Thanks! I was a bit held-back by the relatively limited number of responses.
I finished another version two weeks ago, and every day I think: "should it
be RFCv2 or v1”, ending up not sending it…

There is one issue that I realized while working on the new version: I’m not
sure it is well-defined what an outline retpoline is allowed to do. The
indirect branch promotion code can change rflags, which might cause
correction issues. In practice, using gcc, it is not a problem.

> Some high-level comments:
> 
> - "Relpoline" looks confusingly a lot like "retpoline".  How about
>  "optpoline"?  To avoid confusing myself I will hereafter refer to it
>  as such :-)

Sure. For the academic paper we submitted, we used a much worse name that my
colleague came up with. I’m ok with anything other than that name (not
mentioned to prevent double-blinding violations). I’ll go with your name.

> - Instead of patching one byte at a time, is there a reason why
>  text_poke_bp() can't be used?  That would greatly simplify the
>  patching process, as everything could be patched in a single step.

I thought of it and maybe it is somehow possible, but there are several
problems, for which I didn’t find a simple solution:

1. An indirect branch inside the BP handler might be the one we patch

2. An indirect branch inside an interrupt or NMI handler might be the
   one we patch

3. Overall, we need to patch more than a single instruction, and
   text_poke_bp() is not suitable

> - In many cases, a single direct call may not be sufficient, as there
>  could be for example multiple tasks using different network protocols
>  which need different callbacks for the same call site.

We want to know during compilation how many targets to use. It is not
super-simple to support multiple inlined targets, but it is feasible if you
are willing to annotate when multiple targets are needed. We have a version
which uses outlined indirect branch promotion when there are multiple
targets, but it’s not ready for prime time, and the code-cache misses can
induce some overheads.

> - I'm not sure about the periodic retraining logic, it seems a bit
>  nondeterministic and bursty.

I agree. It can be limited to cases in which modules are loaded/removed,
or when the user explicitly asks for it to take place.

> 
> So I'd propose the following changes:
> 
> - In the optpoline, reserve space for multiple (5 or so) comparisons and
>  direct calls.  Maybe the number of reserved cmp/jne/call slots can be
>  tweaked by the caller somehow.  Or maybe it could grow as needed.
>  Starting out, they would just be NOPs.
> 
> - Instead of the temporary learning mode, add permanent tracking to
>  detect a direct call "miss" -- i.e., when none of the existing direct
>  calls are applicable and the retpoline will be used.
> 
> - In the case of a miss (or N misses), it could trigger a direct call
>  patching operation to be run later (workqueue or syscall exit).  If
>  all the direct call slots are full, it could patch the least recently
>  modified one.  If this causes thrashing (>x changes over y time), it
>  could increase the number of direct call slots using a trampoline.
>  Even if there were several slots, CPU branch prediction would
>  presumably help make it much faster than a basic retpoline.
> 
> Thoughts?

I’m ok with these changes in general, although having multiple inline
targets is not super-simple. However, there are a few issues:

- There is potentially a negative impact due to code-size increase which
  I was worried about.

- I see no reason not to use all the available slots immediately when
  we encounter a miss.

- The order of the branches might be “wrong” (unoptimized) if we do not
  do any relearning.

- The main question is what to do if we run out of slots and still get
  (many?) misses. I presume the right thing is to disable the optpoline
  and jump over it to the retpoline.

Thanks again for the feedback, and please let me know what you think about
my concerns.
Josh Poimboeuf Nov. 29, 2018, 12:38 a.m. UTC | #6
On Wed, Nov 28, 2018 at 07:34:52PM +0000, Nadav Amit wrote:
> > On Nov 28, 2018, at 8:08 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > 
> > On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote:
> >> This RFC introduces indirect call promotion in runtime, which for the
> >> matter of simplification (and branding) will be called here "relpolines"
> >> (relative call + trampoline). Relpolines are mainly intended as a way
> >> of reducing retpoline overheads due to Spectre v2.
> >> 
> >> Unlike indirect call promotion through profile guided optimization, the
> >> proposed approach does not require a profiling stage, works well with
> >> modules whose address is unknown and can adapt to changing workloads.
> >> 
> >> The main idea is simple: for every indirect call, we inject a piece of
> >> code with fast- and slow-path calls. The fast path is used if the target
> >> matches the expected (hot) target. The slow-path uses a retpoline.
> >> During training, the slow-path is set to call a function that saves the
> >> call source and target in a hash-table and keep count for call
> >> frequency. The most common target is then patched into the hot path.
> >> 
> >> The patching is done on-the-fly by patching the conditional branch
> >> (opcode and offset) that is used to compare the target to the hot
> >> target. This allows to direct all cores to the fast-path, while patching
> >> the slow-path and vice-versa. Patching follows 2 more rules: (1) Only
> >> patch a single byte when the code might be executed by any core. (2)
> >> When patching more than one byte, ensure that all cores do not run the
> >> to-be-patched-code by preventing this code from being preempted, and
> >> using synchronize_sched() after patching the branch that jumps over this
> >> code.
> >> 
> >> Changing all the indirect calls to use relpolines is done using assembly
> >> macro magic. There are alternative solutions, but this one is
> >> relatively simple and transparent. There is also logic to retrain the
> >> software predictor, but the policy it uses may need to be refined.
> >> 
> >> Eventually the results are not bad (2 VCPU VM, throughput reported):
> >> 
> >> 		base		relpoline
> >> 		----		---------
> >> nginx 	22898 		25178 (+10%)
> >> redis-ycsb	24523		25486 (+4%)
> >> dbench	2144		2103 (+2%)
> >> 
> >> When retpolines are disabled, and if retraining is off, performance
> >> benefits are up to 2% (nginx), but are much less impressive.
> > 
> > Hi Nadav,
> > 
> > Peter pointed me to these patches during a discussion about retpoline
> > profiling.  Personally, I think this is brilliant.  This could help
> > networking and filesystem intensive workloads a lot.
> 
> Thanks! I was a bit held-back by the relatively limited number of responses.

It is a rather, erm, ambitious idea, maybe they were speechless :-)

> I finished another version two weeks ago, and every day I think: "should it
> be RFCv2 or v1”, ending up not sending it…
> 
> There is one issue that I realized while working on the new version: I’m not
> sure it is well-defined what an outline retpoline is allowed to do. The
> indirect branch promotion code can change rflags, which might cause
> correction issues. In practice, using gcc, it is not a problem.

Callees can clobber flags, so it seems fine to me.

> > Some high-level comments:
> > 
> > - "Relpoline" looks confusingly a lot like "retpoline".  How about
> >  "optpoline"?  To avoid confusing myself I will hereafter refer to it
> >  as such :-)
> 
> Sure. For the academic paper we submitted, we used a much worse name that my
> colleague came up with. I’m ok with anything other than that name (not
> mentioned to prevent double-blinding violations). I’ll go with your name.
> 
> > - Instead of patching one byte at a time, is there a reason why
> >  text_poke_bp() can't be used?  That would greatly simplify the
> >  patching process, as everything could be patched in a single step.
> 
> I thought of it and maybe it is somehow possible, but there are several
> problems, for which I didn’t find a simple solution:
> 
> 1. An indirect branch inside the BP handler might be the one we patch

I _think_ nested INT3s should be doable, because they don't use IST.
Maybe Andy can clarify.

To avoid recursion we'd have to make sure not to use any function
pointers in do_int3() before or during the call to poke_int3_handler().

Incidentally I'm working on a patch which adds an indirect call to
poke_int3_handler().  We'd have to disable optolines for that code.

> 2. An indirect branch inside an interrupt or NMI handler might be the
>    one we patch

But INT3s just use the existing stack, and NMIs support nesting, so I'm
thinking that should also be doable.  Andy?

> 3. Overall, we need to patch more than a single instruction, and
>    text_poke_bp() is not suitable

Hm, I suppose that's true.

> > - In many cases, a single direct call may not be sufficient, as there
> >  could be for example multiple tasks using different network protocols
> >  which need different callbacks for the same call site.
> 
> We want to know during compilation how many targets to use. It is not
> super-simple to support multiple inlined targets, but it is feasible if you
> are willing to annotate when multiple targets are needed.

Why would an annotation be needed?  Is there a reason why the 'call'
macro wouldn't work?

I hate to say it, but another option would be a compiler plugin.

> We have a version which uses outlined indirect branch promotion when
> there are multiple targets, but it’s not ready for prime time, and the
> code-cache misses can induce some overheads.

It would be interesting to see some measurements comparing inline and
out-of-line.  If there were multiple direct call slots then out-of-line
could have advantages in some cases since it reduces the original
function's i-cache footprint.  But maybe it wouldn't be worth it.

> > - I'm not sure about the periodic retraining logic, it seems a bit
> >  nondeterministic and bursty.
> 
> I agree. It can be limited to cases in which modules are loaded/removed,
> or when the user explicitly asks for it to take place.
> 
> > 
> > So I'd propose the following changes:
> > 
> > - In the optpoline, reserve space for multiple (5 or so) comparisons and
> >  direct calls.  Maybe the number of reserved cmp/jne/call slots can be
> >  tweaked by the caller somehow.  Or maybe it could grow as needed.
> >  Starting out, they would just be NOPs.
> > 
> > - Instead of the temporary learning mode, add permanent tracking to
> >  detect a direct call "miss" -- i.e., when none of the existing direct
> >  calls are applicable and the retpoline will be used.
> > 
> > - In the case of a miss (or N misses), it could trigger a direct call
> >  patching operation to be run later (workqueue or syscall exit).  If
> >  all the direct call slots are full, it could patch the least recently
> >  modified one.  If this causes thrashing (>x changes over y time), it
> >  could increase the number of direct call slots using a trampoline.
> >  Even if there were several slots, CPU branch prediction would
> >  presumably help make it much faster than a basic retpoline.
> > 
> > Thoughts?
> 
> I’m ok with these changes in general, although having multiple inline
> targets is not super-simple. However, there are a few issues:
> 
> - There is potentially a negative impact due to code-size increase which
>   I was worried about.

True.  Though out-of-line could help with that (but it would also have
downsides of course).

> - I see no reason not to use all the available slots immediately when
>   we encounter a miss.

Agreed.

> - The order of the branches might be “wrong” (unoptimized) if we do not
>   do any relearning.

True, though branch prediction would mitigate that to a certain degree.
Also if the number of slots is reasonably small (which it will probably
need to be anyway) then it might be good enough.

> - The main question is what to do if we run out of slots and still get
>   (many?) misses. I presume the right thing is to disable the optpoline
>   and jump over it to the retpoline.

It may need some experimentation.  You could just try a simple
round-robin patching scheme.  Possibly only writing after X number of
misses.  And if that doesn't help, just disable it.  Such a simple
approach might work well enough in practice.
Andy Lutomirski Nov. 29, 2018, 1:40 a.m. UTC | #7
On Wed, Nov 28, 2018 at 4:38 PM Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>
> On Wed, Nov 28, 2018 at 07:34:52PM +0000, Nadav Amit wrote:
> > > On Nov 28, 2018, at 8:08 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > >
> > > On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote:
> > >> This RFC introduces indirect call promotion in runtime, which for the
> > >> matter of simplification (and branding) will be called here "relpolines"
> > >> (relative call + trampoline). Relpolines are mainly intended as a way
> > >> of reducing retpoline overheads due to Spectre v2.
> > >>
> > >> Unlike indirect call promotion through profile guided optimization, the
> > >> proposed approach does not require a profiling stage, works well with
> > >> modules whose address is unknown and can adapt to changing workloads.
> > >>
> > >> The main idea is simple: for every indirect call, we inject a piece of
> > >> code with fast- and slow-path calls. The fast path is used if the target
> > >> matches the expected (hot) target. The slow-path uses a retpoline.
> > >> During training, the slow-path is set to call a function that saves the
> > >> call source and target in a hash-table and keep count for call
> > >> frequency. The most common target is then patched into the hot path.
> > >>
> > >> The patching is done on-the-fly by patching the conditional branch
> > >> (opcode and offset) that is used to compare the target to the hot
> > >> target. This allows to direct all cores to the fast-path, while patching
> > >> the slow-path and vice-versa. Patching follows 2 more rules: (1) Only
> > >> patch a single byte when the code might be executed by any core. (2)
> > >> When patching more than one byte, ensure that all cores do not run the
> > >> to-be-patched-code by preventing this code from being preempted, and
> > >> using synchronize_sched() after patching the branch that jumps over this
> > >> code.
> > >>
> > >> Changing all the indirect calls to use relpolines is done using assembly
> > >> macro magic. There are alternative solutions, but this one is
> > >> relatively simple and transparent. There is also logic to retrain the
> > >> software predictor, but the policy it uses may need to be refined.
> > >>
> > >> Eventually the results are not bad (2 VCPU VM, throughput reported):
> > >>
> > >>            base            relpoline
> > >>            ----            ---------
> > >> nginx      22898           25178 (+10%)
> > >> redis-ycsb 24523           25486 (+4%)
> > >> dbench     2144            2103 (+2%)
> > >>
> > >> When retpolines are disabled, and if retraining is off, performance
> > >> benefits are up to 2% (nginx), but are much less impressive.
> > >
> > > Hi Nadav,
> > >
> > > Peter pointed me to these patches during a discussion about retpoline
> > > profiling.  Personally, I think this is brilliant.  This could help
> > > networking and filesystem intensive workloads a lot.
> >
> > Thanks! I was a bit held-back by the relatively limited number of responses.
>
> It is a rather, erm, ambitious idea, maybe they were speechless :-)
>
> > I finished another version two weeks ago, and every day I think: "should it
> > be RFCv2 or v1”, ending up not sending it…
> >
> > There is one issue that I realized while working on the new version: I’m not
> > sure it is well-defined what an outline retpoline is allowed to do. The
> > indirect branch promotion code can change rflags, which might cause
> > correction issues. In practice, using gcc, it is not a problem.
>
> Callees can clobber flags, so it seems fine to me.

Just to check I understand your approach right: you made a macro
called "call", and you're therefore causing all instances of "call" to
become magic?  This is... terrifying.  It's even plausibly worse than
"#define if" :)  The scariest bit is that it will impact inline asm as
well.  Maybe a gcc plugin would be less alarming?

> >
> > 1. An indirect branch inside the BP handler might be the one we patch
>
> I _think_ nested INT3s should be doable, because they don't use IST.
> Maybe Andy can clarify.

int3 should survive recursion these days.  Although I admit I'm
currently wondering what happens if one thread puts a kprobe on an
address that another thread tries to text_poke.

Also, this relpoline magic is likely to start patching text at runtime
on a semi-regular basis.  This type of patching is *slow*.  Is it a
problem?

> > 2. An indirect branch inside an interrupt or NMI handler might be the
> >    one we patch
>
> But INT3s just use the existing stack, and NMIs support nesting, so I'm
> thinking that should also be doable.  Andy?
>

In principle, as long as the code isn't NOKPROBE_SYMBOL-ified, we
should be fine, right?  I'd be a little nervous if we get an int3 in
the C code that handles the early part of an NMI from user mode.  It's
*probably* okay, but one of the alarming issues is that the int3
return path will implicitly unmask NMI, which isn't fantastic.  Maybe
we finally need to dust off my old "return using RET" code to get rid
of that problem.
Nadav Amit Nov. 29, 2018, 2:06 a.m. UTC | #8
> On Nov 28, 2018, at 5:40 PM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> On Wed, Nov 28, 2018 at 4:38 PM Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> On Wed, Nov 28, 2018 at 07:34:52PM +0000, Nadav Amit wrote:
>>>> On Nov 28, 2018, at 8:08 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>>>> 
>>>> On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote:
>>>>> This RFC introduces indirect call promotion in runtime, which for the
>>>>> matter of simplification (and branding) will be called here "relpolines"
>>>>> (relative call + trampoline). Relpolines are mainly intended as a way
>>>>> of reducing retpoline overheads due to Spectre v2.
>>>>> 
>>>>> Unlike indirect call promotion through profile guided optimization, the
>>>>> proposed approach does not require a profiling stage, works well with
>>>>> modules whose address is unknown and can adapt to changing workloads.
>>>>> 
>>>>> The main idea is simple: for every indirect call, we inject a piece of
>>>>> code with fast- and slow-path calls. The fast path is used if the target
>>>>> matches the expected (hot) target. The slow-path uses a retpoline.
>>>>> During training, the slow-path is set to call a function that saves the
>>>>> call source and target in a hash-table and keep count for call
>>>>> frequency. The most common target is then patched into the hot path.
>>>>> 
>>>>> The patching is done on-the-fly by patching the conditional branch
>>>>> (opcode and offset) that is used to compare the target to the hot
>>>>> target. This allows to direct all cores to the fast-path, while patching
>>>>> the slow-path and vice-versa. Patching follows 2 more rules: (1) Only
>>>>> patch a single byte when the code might be executed by any core. (2)
>>>>> When patching more than one byte, ensure that all cores do not run the
>>>>> to-be-patched-code by preventing this code from being preempted, and
>>>>> using synchronize_sched() after patching the branch that jumps over this
>>>>> code.
>>>>> 
>>>>> Changing all the indirect calls to use relpolines is done using assembly
>>>>> macro magic. There are alternative solutions, but this one is
>>>>> relatively simple and transparent. There is also logic to retrain the
>>>>> software predictor, but the policy it uses may need to be refined.
>>>>> 
>>>>> Eventually the results are not bad (2 VCPU VM, throughput reported):
>>>>> 
>>>>>           base            relpoline
>>>>>           ----            ---------
>>>>> nginx      22898           25178 (+10%)
>>>>> redis-ycsb 24523           25486 (+4%)
>>>>> dbench     2144            2103 (+2%)
>>>>> 
>>>>> When retpolines are disabled, and if retraining is off, performance
>>>>> benefits are up to 2% (nginx), but are much less impressive.
>>>> 
>>>> Hi Nadav,
>>>> 
>>>> Peter pointed me to these patches during a discussion about retpoline
>>>> profiling.  Personally, I think this is brilliant.  This could help
>>>> networking and filesystem intensive workloads a lot.
>>> 
>>> Thanks! I was a bit held-back by the relatively limited number of responses.
>> 
>> It is a rather, erm, ambitious idea, maybe they were speechless :-)
>> 
>>> I finished another version two weeks ago, and every day I think: "should it
>>> be RFCv2 or v1”, ending up not sending it…
>>> 
>>> There is one issue that I realized while working on the new version: I’m not
>>> sure it is well-defined what an outline retpoline is allowed to do. The
>>> indirect branch promotion code can change rflags, which might cause
>>> correction issues. In practice, using gcc, it is not a problem.
>> 
>> Callees can clobber flags, so it seems fine to me.
> 
> Just to check I understand your approach right: you made a macro
> called "call", and you're therefore causing all instances of "call" to
> become magic?  This is... terrifying.  It's even plausibly worse than
> "#define if" :)  The scariest bit is that it will impact inline asm as
> well.  Maybe a gcc plugin would be less alarming?

It is likely to look less alarming. When I looked at the inline retpoline
implementation of gcc, it didn’t look much better than what I did - it
basically just emits assembly instructions.

Anyhow, I look (again) into using gcc-plugins.

>>> 1. An indirect branch inside the BP handler might be the one we patch
>> 
>> I _think_ nested INT3s should be doable, because they don't use IST.
>> Maybe Andy can clarify.
> 
> int3 should survive recursion these days.  Although I admit I'm
> currently wondering what happens if one thread puts a kprobe on an
> address that another thread tries to text_poke.

The issue I regarded is having an indirect call *inside* the the handler.
For example, you try to patch the call to bp_int3_handler and then get an
int3. They can be annotated to prevent them from being patched. Then again,
I need to see how gcc plugins can get these annotations.

> 
> Also, this relpoline magic is likely to start patching text at runtime
> on a semi-regular basis.  This type of patching is *slow*.  Is it a
> problem?

It didn’t appear so. Although there are >10000 indirect branches in the
kernel, you don’t patch too many of them even you are doing relearning.

> 
>>> 2. An indirect branch inside an interrupt or NMI handler might be the
>>>   one we patch
>> 
>> But INT3s just use the existing stack, and NMIs support nesting, so I'm
>> thinking that should also be doable.  Andy?
> 
> In principle, as long as the code isn't NOKPROBE_SYMBOL-ified, we
> should be fine, right?  I'd be a little nervous if we get an int3 in
> the C code that handles the early part of an NMI from user mode.  It's
> *probably* okay, but one of the alarming issues is that the int3
> return path will implicitly unmask NMI, which isn't fantastic.  Maybe
> we finally need to dust off my old "return using RET" code to get rid
> of that problem.

So it may be possible. It would require having a new text_poke_bp() variant
for multiple instructions. text_poke_bp() might be slower though.
Andy Lutomirski Nov. 29, 2018, 3:24 a.m. UTC | #9
On Nov 28, 2018, at 6:06 PM, Nadav Amit <namit@vmware.com> wrote:

>> On Nov 28, 2018, at 5:40 PM, Andy Lutomirski <luto@kernel.org> wrote:
>> 
>>> On Wed, Nov 28, 2018 at 4:38 PM Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>>> On Wed, Nov 28, 2018 at 07:34:52PM +0000, Nadav Amit wrote:
>>>>> On Nov 28, 2018, at 8:08 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>>>>> 
>>>>>> On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote:
>>>>>> This RFC introduces indirect call promotion in runtime, which for the
>>>>>> matter of simplification (and branding) will be called here "relpolines"
>>>>>> (relative call + trampoline). Relpolines are mainly intended as a way
>>>>>> of reducing retpoline overheads due to Spectre v2.
>>>>>> 
>>>>>> Unlike indirect call promotion through profile guided optimization, the
>>>>>> proposed approach does not require a profiling stage, works well with
>>>>>> modules whose address is unknown and can adapt to changing workloads.
>>>>>> 
>>>>>> The main idea is simple: for every indirect call, we inject a piece of
>>>>>> code with fast- and slow-path calls. The fast path is used if the target
>>>>>> matches the expected (hot) target. The slow-path uses a retpoline.
>>>>>> During training, the slow-path is set to call a function that saves the
>>>>>> call source and target in a hash-table and keep count for call
>>>>>> frequency. The most common target is then patched into the hot path.
>>>>>> 
>>>>>> The patching is done on-the-fly by patching the conditional branch
>>>>>> (opcode and offset) that is used to compare the target to the hot
>>>>>> target. This allows to direct all cores to the fast-path, while patching
>>>>>> the slow-path and vice-versa. Patching follows 2 more rules: (1) Only
>>>>>> patch a single byte when the code might be executed by any core. (2)
>>>>>> When patching more than one byte, ensure that all cores do not run the
>>>>>> to-be-patched-code by preventing this code from being preempted, and
>>>>>> using synchronize_sched() after patching the branch that jumps over this
>>>>>> code.
>>>>>> 
>>>>>> Changing all the indirect calls to use relpolines is done using assembly
>>>>>> macro magic. There are alternative solutions, but this one is
>>>>>> relatively simple and transparent. There is also logic to retrain the
>>>>>> software predictor, but the policy it uses may need to be refined.
>>>>>> 
>>>>>> Eventually the results are not bad (2 VCPU VM, throughput reported):
>>>>>> 
>>>>>>          base            relpoline
>>>>>>          ----            ---------
>>>>>> nginx      22898           25178 (+10%)
>>>>>> redis-ycsb 24523           25486 (+4%)
>>>>>> dbench     2144            2103 (+2%)
>>>>>> 
>>>>>> When retpolines are disabled, and if retraining is off, performance
>>>>>> benefits are up to 2% (nginx), but are much less impressive.
>>>>> 
>>>>> Hi Nadav,
>>>>> 
>>>>> Peter pointed me to these patches during a discussion about retpoline
>>>>> profiling.  Personally, I think this is brilliant.  This could help
>>>>> networking and filesystem intensive workloads a lot.
>>>> 
>>>> Thanks! I was a bit held-back by the relatively limited number of responses.
>>> 
>>> It is a rather, erm, ambitious idea, maybe they were speechless :-)
>>> 
>>>> I finished another version two weeks ago, and every day I think: "should it
>>>> be RFCv2 or v1”, ending up not sending it…
>>>> 
>>>> There is one issue that I realized while working on the new version: I’m not
>>>> sure it is well-defined what an outline retpoline is allowed to do. The
>>>> indirect branch promotion code can change rflags, which might cause
>>>> correction issues. In practice, using gcc, it is not a problem.
>>> 
>>> Callees can clobber flags, so it seems fine to me.
>> 
>> Just to check I understand your approach right: you made a macro
>> called "call", and you're therefore causing all instances of "call" to
>> become magic?  This is... terrifying.  It's even plausibly worse than
>> "#define if" :)  The scariest bit is that it will impact inline asm as
>> well.  Maybe a gcc plugin would be less alarming?
> 
> It is likely to look less alarming. When I looked at the inline retpoline
> implementation of gcc, it didn’t look much better than what I did - it
> basically just emits assembly instructions.

To be clear, that wasn’t a NAK.  It was merely a “this is alarming.”

Hey Josh - you could potentially do the same hack to generate the static call tables. Take that, objtool.

> 
> Anyhow, I look (again) into using gcc-plugins.
> 
>>>> 1. An indirect branch inside the BP handler might be the one we patch
>>> 
>>> I _think_ nested INT3s should be doable, because they don't use IST.
>>> Maybe Andy can clarify.
>> 
>> int3 should survive recursion these days.  Although I admit I'm
>> currently wondering what happens if one thread puts a kprobe on an
>> address that another thread tries to text_poke.
> 
> The issue I regarded is having an indirect call *inside* the the handler.
> For example, you try to patch the call to bp_int3_handler and then get an
> int3. They can be annotated to prevent them from being patched. Then again,
> I need to see how gcc plugins can get these annotations.

We could move the relevant code to a separate object file that disables the whole mess.

> 
>> 
>> Also, this relpoline magic is likely to start patching text at runtime
>> on a semi-regular basis.  This type of patching is *slow*.  Is it a
>> problem?
> 
> It didn’t appear so. Although there are >10000 indirect branches in the
> kernel, you don’t patch too many of them even you are doing relearning.
> 
>> 
>>>> 2. An indirect branch inside an interrupt or NMI handler might be the
>>>>  one we patch
>>> 
>>> But INT3s just use the existing stack, and NMIs support nesting, so I'm
>>> thinking that should also be doable.  Andy?
>> 
>> In principle, as long as the code isn't NOKPROBE_SYMBOL-ified, we
>> should be fine, right?  I'd be a little nervous if we get an int3 in
>> the C code that handles the early part of an NMI from user mode.  It's
>> *probably* okay, but one of the alarming issues is that the int3
>> return path will implicitly unmask NMI, which isn't fantastic.  Maybe
>> we finally need to dust off my old "return using RET" code to get rid
>> of that problem.
> 
> So it may be possible. It would require having a new text_poke_bp() variant
> for multiple instructions. text_poke_bp() might be slower though.
> 
> 

Can you outline how the patching works at all?  You’re getting rid of preempt disabling, right?  What’s the actual sequence and how does it work?
Josh Poimboeuf Nov. 29, 2018, 4:36 a.m. UTC | #10
On Wed, Nov 28, 2018 at 07:24:08PM -0800, Andy Lutomirski wrote:
> To be clear, that wasn’t a NAK.  It was merely a “this is alarming.”
> 
> Hey Josh - you could potentially do the same hack to generate the
> static call tables. Take that, objtool.

Ha, after witnessing Nadav's glorious hack, I considered that.  But I
didn't see a way to pull it off, because asm macro conditionals don't
seem to have a way to test for a regex (or at least a named prefix) for
the call target.

I'd need a way to detect "call __static_call_tramp_<whatever>".
Andy Lutomirski Nov. 29, 2018, 6:06 a.m. UTC | #11
On Wed, Nov 28, 2018 at 7:24 PM Andy Lutomirski <luto@amacapital.net> wrote:
>
>
> On Nov 28, 2018, at 6:06 PM, Nadav Amit <namit@vmware.com> wrote:
>
> >> On Nov 28, 2018, at 5:40 PM, Andy Lutomirski <luto@kernel.org> wrote:
> >>
> >>> On Wed, Nov 28, 2018 at 4:38 PM Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >>> On Wed, Nov 28, 2018 at 07:34:52PM +0000, Nadav Amit wrote:
> >>>>> On Nov 28, 2018, at 8:08 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >>>>>
> >>>>>> On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote:
> >>>>>> This RFC introduces indirect call promotion in runtime, which for the
> >>>>>> matter of simplification (and branding) will be called here "relpolines"
> >>>>>> (relative call + trampoline). Relpolines are mainly intended as a way
> >>>>>> of reducing retpoline overheads due to Spectre v2.
> >>>>>>
> >>>>>> Unlike indirect call promotion through profile guided optimization, the
> >>>>>> proposed approach does not require a profiling stage, works well with
> >>>>>> modules whose address is unknown and can adapt to changing workloads.
> >>>>>>
> >>>>>> The main idea is simple: for every indirect call, we inject a piece of
> >>>>>> code with fast- and slow-path calls. The fast path is used if the target
> >>>>>> matches the expected (hot) target. The slow-path uses a retpoline.
> >>>>>> During training, the slow-path is set to call a function that saves the
> >>>>>> call source and target in a hash-table and keep count for call
> >>>>>> frequency. The most common target is then patched into the hot path.
> >>>>>>
> >>>>>> The patching is done on-the-fly by patching the conditional branch
> >>>>>> (opcode and offset) that is used to compare the target to the hot
> >>>>>> target. This allows to direct all cores to the fast-path, while patching
> >>>>>> the slow-path and vice-versa. Patching follows 2 more rules: (1) Only
> >>>>>> patch a single byte when the code might be executed by any core. (2)
> >>>>>> When patching more than one byte, ensure that all cores do not run the
> >>>>>> to-be-patched-code by preventing this code from being preempted, and
> >>>>>> using synchronize_sched() after patching the branch that jumps over this
> >>>>>> code.
> >>>>>>
> >>>>>> Changing all the indirect calls to use relpolines is done using assembly
> >>>>>> macro magic. There are alternative solutions, but this one is
> >>>>>> relatively simple and transparent. There is also logic to retrain the
> >>>>>> software predictor, but the policy it uses may need to be refined.
> >>>>>>
> >>>>>> Eventually the results are not bad (2 VCPU VM, throughput reported):
> >>>>>>
> >>>>>>          base            relpoline
> >>>>>>          ----            ---------
> >>>>>> nginx      22898           25178 (+10%)
> >>>>>> redis-ycsb 24523           25486 (+4%)
> >>>>>> dbench     2144            2103 (+2%)
> >>>>>>
> >>>>>> When retpolines are disabled, and if retraining is off, performance
> >>>>>> benefits are up to 2% (nginx), but are much less impressive.
> >>>>>
> >>>>> Hi Nadav,
> >>>>>
> >>>>> Peter pointed me to these patches during a discussion about retpoline
> >>>>> profiling.  Personally, I think this is brilliant.  This could help
> >>>>> networking and filesystem intensive workloads a lot.
> >>>>
> >>>> Thanks! I was a bit held-back by the relatively limited number of responses.
> >>>
> >>> It is a rather, erm, ambitious idea, maybe they were speechless :-)
> >>>
> >>>> I finished another version two weeks ago, and every day I think: "should it
> >>>> be RFCv2 or v1”, ending up not sending it…
> >>>>
> >>>> There is one issue that I realized while working on the new version: I’m not
> >>>> sure it is well-defined what an outline retpoline is allowed to do. The
> >>>> indirect branch promotion code can change rflags, which might cause
> >>>> correction issues. In practice, using gcc, it is not a problem.
> >>>
> >>> Callees can clobber flags, so it seems fine to me.
> >>
> >> Just to check I understand your approach right: you made a macro
> >> called "call", and you're therefore causing all instances of "call" to
> >> become magic?  This is... terrifying.  It's even plausibly worse than
> >> "#define if" :)  The scariest bit is that it will impact inline asm as
> >> well.  Maybe a gcc plugin would be less alarming?
> >
> > It is likely to look less alarming. When I looked at the inline retpoline
> > implementation of gcc, it didn’t look much better than what I did - it
> > basically just emits assembly instructions.
>
> To be clear, that wasn’t a NAK.  It was merely a “this is alarming.”

Although... how do you avoid matching on things that really don't want
this treatment?  paravirt ops come to mind.
Josh Poimboeuf Nov. 29, 2018, 3:19 p.m. UTC | #12
On Wed, Nov 28, 2018 at 10:06:52PM -0800, Andy Lutomirski wrote:
> On Wed, Nov 28, 2018 at 7:24 PM Andy Lutomirski <luto@amacapital.net> wrote:
> >
> >
> > On Nov 28, 2018, at 6:06 PM, Nadav Amit <namit@vmware.com> wrote:
> >
> > >> On Nov 28, 2018, at 5:40 PM, Andy Lutomirski <luto@kernel.org> wrote:
> > >>
> > >>> On Wed, Nov 28, 2018 at 4:38 PM Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > >>> On Wed, Nov 28, 2018 at 07:34:52PM +0000, Nadav Amit wrote:
> > >>>>> On Nov 28, 2018, at 8:08 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > >>>>>
> > >>>>>> On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote:
> > >>>>>> This RFC introduces indirect call promotion in runtime, which for the
> > >>>>>> matter of simplification (and branding) will be called here "relpolines"
> > >>>>>> (relative call + trampoline). Relpolines are mainly intended as a way
> > >>>>>> of reducing retpoline overheads due to Spectre v2.
> > >>>>>>
> > >>>>>> Unlike indirect call promotion through profile guided optimization, the
> > >>>>>> proposed approach does not require a profiling stage, works well with
> > >>>>>> modules whose address is unknown and can adapt to changing workloads.
> > >>>>>>
> > >>>>>> The main idea is simple: for every indirect call, we inject a piece of
> > >>>>>> code with fast- and slow-path calls. The fast path is used if the target
> > >>>>>> matches the expected (hot) target. The slow-path uses a retpoline.
> > >>>>>> During training, the slow-path is set to call a function that saves the
> > >>>>>> call source and target in a hash-table and keep count for call
> > >>>>>> frequency. The most common target is then patched into the hot path.
> > >>>>>>
> > >>>>>> The patching is done on-the-fly by patching the conditional branch
> > >>>>>> (opcode and offset) that is used to compare the target to the hot
> > >>>>>> target. This allows to direct all cores to the fast-path, while patching
> > >>>>>> the slow-path and vice-versa. Patching follows 2 more rules: (1) Only
> > >>>>>> patch a single byte when the code might be executed by any core. (2)
> > >>>>>> When patching more than one byte, ensure that all cores do not run the
> > >>>>>> to-be-patched-code by preventing this code from being preempted, and
> > >>>>>> using synchronize_sched() after patching the branch that jumps over this
> > >>>>>> code.
> > >>>>>>
> > >>>>>> Changing all the indirect calls to use relpolines is done using assembly
> > >>>>>> macro magic. There are alternative solutions, but this one is
> > >>>>>> relatively simple and transparent. There is also logic to retrain the
> > >>>>>> software predictor, but the policy it uses may need to be refined.
> > >>>>>>
> > >>>>>> Eventually the results are not bad (2 VCPU VM, throughput reported):
> > >>>>>>
> > >>>>>>          base            relpoline
> > >>>>>>          ----            ---------
> > >>>>>> nginx      22898           25178 (+10%)
> > >>>>>> redis-ycsb 24523           25486 (+4%)
> > >>>>>> dbench     2144            2103 (+2%)
> > >>>>>>
> > >>>>>> When retpolines are disabled, and if retraining is off, performance
> > >>>>>> benefits are up to 2% (nginx), but are much less impressive.
> > >>>>>
> > >>>>> Hi Nadav,
> > >>>>>
> > >>>>> Peter pointed me to these patches during a discussion about retpoline
> > >>>>> profiling.  Personally, I think this is brilliant.  This could help
> > >>>>> networking and filesystem intensive workloads a lot.
> > >>>>
> > >>>> Thanks! I was a bit held-back by the relatively limited number of responses.
> > >>>
> > >>> It is a rather, erm, ambitious idea, maybe they were speechless :-)
> > >>>
> > >>>> I finished another version two weeks ago, and every day I think: "should it
> > >>>> be RFCv2 or v1”, ending up not sending it…
> > >>>>
> > >>>> There is one issue that I realized while working on the new version: I’m not
> > >>>> sure it is well-defined what an outline retpoline is allowed to do. The
> > >>>> indirect branch promotion code can change rflags, which might cause
> > >>>> correction issues. In practice, using gcc, it is not a problem.
> > >>>
> > >>> Callees can clobber flags, so it seems fine to me.
> > >>
> > >> Just to check I understand your approach right: you made a macro
> > >> called "call", and you're therefore causing all instances of "call" to
> > >> become magic?  This is... terrifying.  It's even plausibly worse than
> > >> "#define if" :)  The scariest bit is that it will impact inline asm as
> > >> well.  Maybe a gcc plugin would be less alarming?
> > >
> > > It is likely to look less alarming. When I looked at the inline retpoline
> > > implementation of gcc, it didn’t look much better than what I did - it
> > > basically just emits assembly instructions.
> >
> > To be clear, that wasn’t a NAK.  It was merely a “this is alarming.”
> 
> Although... how do you avoid matching on things that really don't want
> this treatment?  paravirt ops come to mind.

Paravirt ops don't use retpolines because they're patched into direct
calls during boot.  So Nadav's patches won't touch them.
Nadav Amit Dec. 1, 2018, 6:52 a.m. UTC | #13
> On Nov 29, 2018, at 7:19 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> 
> On Wed, Nov 28, 2018 at 10:06:52PM -0800, Andy Lutomirski wrote:
>> On Wed, Nov 28, 2018 at 7:24 PM Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Nov 28, 2018, at 6:06 PM, Nadav Amit <namit@vmware.com> wrote:
>>> 
>>>>> On Nov 28, 2018, at 5:40 PM, Andy Lutomirski <luto@kernel.org> wrote:
>>>>> 
>>>>>> On Wed, Nov 28, 2018 at 4:38 PM Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>>>>>> On Wed, Nov 28, 2018 at 07:34:52PM +0000, Nadav Amit wrote:
>>>>>>>> On Nov 28, 2018, at 8:08 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>>>>>>>> 
>>>>>>>>> On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote:
>>>>>>>>> This RFC introduces indirect call promotion in runtime, which for the
>>>>>>>>> matter of simplification (and branding) will be called here "relpolines"
>>>>>>>>> (relative call + trampoline). Relpolines are mainly intended as a way
>>>>>>>>> of reducing retpoline overheads due to Spectre v2.
>>>>>>>>> 
>>>>>>>>> Unlike indirect call promotion through profile guided optimization, the
>>>>>>>>> proposed approach does not require a profiling stage, works well with
>>>>>>>>> modules whose address is unknown and can adapt to changing workloads.
>>>>>>>>> 
>>>>>>>>> The main idea is simple: for every indirect call, we inject a piece of
>>>>>>>>> code with fast- and slow-path calls. The fast path is used if the target
>>>>>>>>> matches the expected (hot) target. The slow-path uses a retpoline.
>>>>>>>>> During training, the slow-path is set to call a function that saves the
>>>>>>>>> call source and target in a hash-table and keep count for call
>>>>>>>>> frequency. The most common target is then patched into the hot path.
>>>>>>>>> 
>>>>>>>>> The patching is done on-the-fly by patching the conditional branch
>>>>>>>>> (opcode and offset) that is used to compare the target to the hot
>>>>>>>>> target. This allows to direct all cores to the fast-path, while patching
>>>>>>>>> the slow-path and vice-versa. Patching follows 2 more rules: (1) Only
>>>>>>>>> patch a single byte when the code might be executed by any core. (2)
>>>>>>>>> When patching more than one byte, ensure that all cores do not run the
>>>>>>>>> to-be-patched-code by preventing this code from being preempted, and
>>>>>>>>> using synchronize_sched() after patching the branch that jumps over this
>>>>>>>>> code.
>>>>>>>>> 
>>>>>>>>> Changing all the indirect calls to use relpolines is done using assembly
>>>>>>>>> macro magic. There are alternative solutions, but this one is
>>>>>>>>> relatively simple and transparent. There is also logic to retrain the
>>>>>>>>> software predictor, but the policy it uses may need to be refined.
>>>>>>>>> 
>>>>>>>>> Eventually the results are not bad (2 VCPU VM, throughput reported):
>>>>>>>>> 
>>>>>>>>>         base            relpoline
>>>>>>>>>         ----            ---------
>>>>>>>>> nginx      22898           25178 (+10%)
>>>>>>>>> redis-ycsb 24523           25486 (+4%)
>>>>>>>>> dbench     2144            2103 (+2%)
>>>>>>>>> 
>>>>>>>>> When retpolines are disabled, and if retraining is off, performance
>>>>>>>>> benefits are up to 2% (nginx), but are much less impressive.
>>>>>>>> 
>>>>>>>> Hi Nadav,
>>>>>>>> 
>>>>>>>> Peter pointed me to these patches during a discussion about retpoline
>>>>>>>> profiling.  Personally, I think this is brilliant.  This could help
>>>>>>>> networking and filesystem intensive workloads a lot.
>>>>>>> 
>>>>>>> Thanks! I was a bit held-back by the relatively limited number of responses.
>>>>>> 
>>>>>> It is a rather, erm, ambitious idea, maybe they were speechless :-)
>>>>>> 
>>>>>>> I finished another version two weeks ago, and every day I think: "should it
>>>>>>> be RFCv2 or v1”, ending up not sending it…
>>>>>>> 
>>>>>>> There is one issue that I realized while working on the new version: I’m not
>>>>>>> sure it is well-defined what an outline retpoline is allowed to do. The
>>>>>>> indirect branch promotion code can change rflags, which might cause
>>>>>>> correction issues. In practice, using gcc, it is not a problem.
>>>>>> 
>>>>>> Callees can clobber flags, so it seems fine to me.
>>>>> 
>>>>> Just to check I understand your approach right: you made a macro
>>>>> called "call", and you're therefore causing all instances of "call" to
>>>>> become magic?  This is... terrifying.  It's even plausibly worse than
>>>>> "#define if" :)  The scariest bit is that it will impact inline asm as
>>>>> well.  Maybe a gcc plugin would be less alarming?
>>>> 
>>>> It is likely to look less alarming. When I looked at the inline retpoline
>>>> implementation of gcc, it didn’t look much better than what I did - it
>>>> basically just emits assembly instructions.
>>> 
>>> To be clear, that wasn’t a NAK.  It was merely a “this is alarming.”
>> 
>> Although... how do you avoid matching on things that really don't want
>> this treatment?  paravirt ops come to mind.
> 
> Paravirt ops don't use retpolines because they're patched into direct
> calls during boot.  So Nadav's patches won't touch them.

Actually, the way it’s handled is slightly more complicated - yes, the CALL
macro should not be applied, as Josh said, but the question is how it is
achieved.

The basic idea is that the CALL macro should only be applied to C
source-files and not to assembly files and for macros.s, which holds the PV
call macros. I will recheck it is done this way.

Regards,
Nadav
Josh Poimboeuf Dec. 1, 2018, 2:25 p.m. UTC | #14
On Sat, Dec 01, 2018 at 06:52:45AM +0000, Nadav Amit wrote:
> > On Nov 29, 2018, at 7:19 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > 
> > On Wed, Nov 28, 2018 at 10:06:52PM -0800, Andy Lutomirski wrote:
> >> On Wed, Nov 28, 2018 at 7:24 PM Andy Lutomirski <luto@amacapital.net> wrote:
> >>> On Nov 28, 2018, at 6:06 PM, Nadav Amit <namit@vmware.com> wrote:
> >>> 
> >>>>> On Nov 28, 2018, at 5:40 PM, Andy Lutomirski <luto@kernel.org> wrote:
> >>>>> 
> >>>>>> On Wed, Nov 28, 2018 at 4:38 PM Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >>>>>> On Wed, Nov 28, 2018 at 07:34:52PM +0000, Nadav Amit wrote:
> >>>>>>>> On Nov 28, 2018, at 8:08 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >>>>>>>> 
> >>>>>>>>> On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote:
> >>>>>>>>> This RFC introduces indirect call promotion in runtime, which for the
> >>>>>>>>> matter of simplification (and branding) will be called here "relpolines"
> >>>>>>>>> (relative call + trampoline). Relpolines are mainly intended as a way
> >>>>>>>>> of reducing retpoline overheads due to Spectre v2.
> >>>>>>>>> 
> >>>>>>>>> Unlike indirect call promotion through profile guided optimization, the
> >>>>>>>>> proposed approach does not require a profiling stage, works well with
> >>>>>>>>> modules whose address is unknown and can adapt to changing workloads.
> >>>>>>>>> 
> >>>>>>>>> The main idea is simple: for every indirect call, we inject a piece of
> >>>>>>>>> code with fast- and slow-path calls. The fast path is used if the target
> >>>>>>>>> matches the expected (hot) target. The slow-path uses a retpoline.
> >>>>>>>>> During training, the slow-path is set to call a function that saves the
> >>>>>>>>> call source and target in a hash-table and keep count for call
> >>>>>>>>> frequency. The most common target is then patched into the hot path.
> >>>>>>>>> 
> >>>>>>>>> The patching is done on-the-fly by patching the conditional branch
> >>>>>>>>> (opcode and offset) that is used to compare the target to the hot
> >>>>>>>>> target. This allows to direct all cores to the fast-path, while patching
> >>>>>>>>> the slow-path and vice-versa. Patching follows 2 more rules: (1) Only
> >>>>>>>>> patch a single byte when the code might be executed by any core. (2)
> >>>>>>>>> When patching more than one byte, ensure that all cores do not run the
> >>>>>>>>> to-be-patched-code by preventing this code from being preempted, and
> >>>>>>>>> using synchronize_sched() after patching the branch that jumps over this
> >>>>>>>>> code.
> >>>>>>>>> 
> >>>>>>>>> Changing all the indirect calls to use relpolines is done using assembly
> >>>>>>>>> macro magic. There are alternative solutions, but this one is
> >>>>>>>>> relatively simple and transparent. There is also logic to retrain the
> >>>>>>>>> software predictor, but the policy it uses may need to be refined.
> >>>>>>>>> 
> >>>>>>>>> Eventually the results are not bad (2 VCPU VM, throughput reported):
> >>>>>>>>> 
> >>>>>>>>>         base            relpoline
> >>>>>>>>>         ----            ---------
> >>>>>>>>> nginx      22898           25178 (+10%)
> >>>>>>>>> redis-ycsb 24523           25486 (+4%)
> >>>>>>>>> dbench     2144            2103 (+2%)
> >>>>>>>>> 
> >>>>>>>>> When retpolines are disabled, and if retraining is off, performance
> >>>>>>>>> benefits are up to 2% (nginx), but are much less impressive.
> >>>>>>>> 
> >>>>>>>> Hi Nadav,
> >>>>>>>> 
> >>>>>>>> Peter pointed me to these patches during a discussion about retpoline
> >>>>>>>> profiling.  Personally, I think this is brilliant.  This could help
> >>>>>>>> networking and filesystem intensive workloads a lot.
> >>>>>>> 
> >>>>>>> Thanks! I was a bit held-back by the relatively limited number of responses.
> >>>>>> 
> >>>>>> It is a rather, erm, ambitious idea, maybe they were speechless :-)
> >>>>>> 
> >>>>>>> I finished another version two weeks ago, and every day I think: "should it
> >>>>>>> be RFCv2 or v1”, ending up not sending it…
> >>>>>>> 
> >>>>>>> There is one issue that I realized while working on the new version: I’m not
> >>>>>>> sure it is well-defined what an outline retpoline is allowed to do. The
> >>>>>>> indirect branch promotion code can change rflags, which might cause
> >>>>>>> correction issues. In practice, using gcc, it is not a problem.
> >>>>>> 
> >>>>>> Callees can clobber flags, so it seems fine to me.
> >>>>> 
> >>>>> Just to check I understand your approach right: you made a macro
> >>>>> called "call", and you're therefore causing all instances of "call" to
> >>>>> become magic?  This is... terrifying.  It's even plausibly worse than
> >>>>> "#define if" :)  The scariest bit is that it will impact inline asm as
> >>>>> well.  Maybe a gcc plugin would be less alarming?
> >>>> 
> >>>> It is likely to look less alarming. When I looked at the inline retpoline
> >>>> implementation of gcc, it didn’t look much better than what I did - it
> >>>> basically just emits assembly instructions.
> >>> 
> >>> To be clear, that wasn’t a NAK.  It was merely a “this is alarming.”
> >> 
> >> Although... how do you avoid matching on things that really don't want
> >> this treatment?  paravirt ops come to mind.
> > 
> > Paravirt ops don't use retpolines because they're patched into direct
> > calls during boot.  So Nadav's patches won't touch them.
> 
> Actually, the way it’s handled is slightly more complicated - yes, the CALL
> macro should not be applied, as Josh said, but the question is how it is
> achieved.
> 
> The basic idea is that the CALL macro should only be applied to C
> source-files and not to assembly files and for macros.s, which holds the PV
> call macros. I will recheck it is done this way.

Even if the CALL macro were applied, it would get ignored by your code
because the PARAVIRT_CALL macro doesn't use retpolines.  So it would get
skipped by this check:

.ifc "\v", "__x86_indirect_thunk_\reg_it"
	relpoline_call reg=\reg_it
	retpoline = 1
.endif