[RFC PATCH v2 0/4] dynamic indirect call promotion

* [RFC PATCH v2 0/4] dynamic indirect call promotion
@ 2019-02-02  0:05 Edward Cree
  2019-02-02  0:07 ` [RFC PATCH v2 1/4] static_call: add indirect call promotion (dynamic_call) infrastructure Edward Cree
                   ` (4 more replies)
  0 siblings, 5 replies; 8+ messages in thread
From: Edward Cree @ 2019-02-02  0:05 UTC (permalink / raw)
  To: namit, jpoimboe; +Cc: linux-kernel, x86

This series introduces 'dynamic_calls', branch trees of static calls (updated
 at runtime using text patching), to avoid making indirect calls to common
 targets.  The basic mechanism is
    if (func == static_key_1.target)
        call_static_key_1(args);
    else if (func == static_key_2.target)
        call_static_key_2(args);
    /* ... */
    else
        (*func)(args); /* typically leads to a retpoline nowadays */
 with some additional statistics-gathering to allow periodic relearning of
 branch targets.  Creating and calling a dynamic call table are each a single
 line in the consuming code, although they expand to a nontrivial amount of
 data and text in the kernel image.
This is essentially indirect branch prediction, performed in software because
 we can't trust hardware to get it right.  While the processor may speculate
 into the function calls, this is *probably* OK since they are known to be
 functions that frequently get called in this path, and thus are much less
 likely to contain side-channel information leaks than a completely arbitrary
 target PC from a branch target buffer.  Moreover, when the speculation is
 accurate we positively want to be able to speculate into the callee.
The branch target statistics are collected with percpu variables, counting
 both 'hits' on the existing branch targets and 'misses', divided into counts
 for up to four specific targets (first-come-first-served) and a catch-all
 miss count used once that table is full.
When the total number of specific misses on a cpu reaches 1000, work is
 triggered which adds up counts across all CPUs and chooses the two most-
 popular call targets to patch into the call path.
If instead the catch-all miss count reaches 1000, the counts and specific
 targets for that cpu are discarded, since either the target is too
 unpredictable (lots of low-frequency callees rather than a few dominating
 ones) or the targets that populated the table were by chance unpopular ones.
To ensure that the static key target does not change between the if () check
 and the call, the whole dynamic_call must take place in an RCU read-side
 critical section (which, since the callee does not know it is being called in
 this manner, then lasts at least until the callee returns), and the patching
 at re-learning time is done with the help of a static_key to switch callers
 off the dynamic_call path and RCU synchronisation to ensure none are still on
 it.  In cases where RCU cannot be used (e.g. because some callees need to RCU
 synchronise), it might be possible to add a variant that uses
 synchronize_rcu_tasks() when updating, but this series does not attempt this.

The dynamic_calls created by this series are opt-in, partly because of the
 abovementioned rcu_read_lock requirement.

My attempts to measure the performance impact of dynamic_calls have been
 inconclusive; the effects on an RX-side UDP packet rate test were within
 ±1.5% and nowhere near statistical significance (p around 0.2-0.3 with n=6
 in a Welch t-test).  This could mean that dynamic_calls are ineffective,
 but it could also mean that many more sites need converting before any gain
 shows up, or it could just mean that my testing was insufficiently sensitive
 or measuring the wrong thing.  Given these poor results, this series is
 clearly not 'ready', hence the RFC tags, but hopefully it will inform the
 discussion in this area.

As before, this series depends on Josh's "static calls" patch series (v3 this
 time).  My testing was done with out-of-line static calls, since the inline
 implementation lead to crashes; I have not yet determined whether they were
 the fault of my patch or of the static calls series.

Edward Cree (4):
  static_call: add indirect call promotion (dynamic_call) infrastructure
  net: core: use a dynamic_call for pt_prev->func() in RX path
  net: core: use a dynamic_call for dst_input
  net: core: use a dynamic_call for pt_prev->list_func() in list RX path

 include/linux/dynamic_call.h | 300 +++++++++++++++++++++++++++++++++++++++++++
 include/net/dst.h            |   5 +-
 init/Kconfig                 |  11 ++
 kernel/Makefile              |   1 +
 kernel/dynamic_call.c        | 131 +++++++++++++++++++
 net/core/dev.c               |  18 ++-
 net/core/dst.c               |   2 +
 7 files changed, 463 insertions(+), 5 deletions(-)
 create mode 100644 include/linux/dynamic_call.h
 create mode 100644 kernel/dynamic_call.c

^ permalink raw reply	[flat|nested] 8+ messages in thread