[PATCH v3 net-next 0/1] bpf32->bpf64 mapper and bpf64 interpreter

* [PATCH v3 net-next 0/1] bpf32->bpf64 mapper and bpf64 interpreter
@ 2014-02-27  2:38 Alexei Starovoitov
  2014-02-27  2:38 ` [PATCH v3 net-next 1/1] " Alexei Starovoitov
  0 siblings, 1 reply; 8+ messages in thread
From: Alexei Starovoitov @ 2014-02-27  2:38 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: David S. Miller, Ingo Molnar, Steven Rostedt, Peter Zijlstra,
	H. Peter Anvin, Thomas Gleixner, Masami Hiramatsu, Tom Zanussi,
	Jovi Zhangwei, Eric Dumazet, Linus Torvalds, Andrew Morton,
	Frederic Weisbecker, Arnaldo Carvalho de Melo, Pekka Enberg,
	Arjan van de Ven, Christoph Hellwig, linux-kernel, netdev

Hi All,

V1 patches:
http://thread.gmane.org/gmane.linux.kernel/1605783
V2 patches:
http://thread.gmane.org/gmane.linux.kernel/1642325

V3 summary:
- as suggested by Daniel added on the fly converter from
  old BPF (aka BPF32) into extended BPF (aka BPF64)
- as suggested by Peter Anvin added 32-bit subregisters
  they don't add much to interpreter speed, but simplify bpf32->bpf64 mapping
- added sysctl net.core.bpf64_enable flag
  if enabled, old BPF filters will be converted to BPF64
  and will be used by tcpdump/cls/xtables.
  safety of the filters is verified by old BPF sk_chk_filter()
  BPF64's bpf_check() is dropped from this patch to simplify review

Addition of 32-bit subregs require some work on BPF64 x86_64 JIT, so
it's not included in this patch set. LLVM BPF64 backend also needs to be
taught to take advantage of 32-bit subregs.

Initially BPF64 instruction set was designed for max performance after JIT,
Now it was tweaked for good interpreter speeds as well.
Eventually BPF64 can completely replace existing BPF on all architectures.

Two key reasons why BPF64 interpreter is noticeably faster
than existing BPF32 interpreter:

1.fall-through jumps
  In BPF32 jump instructions are forced to go either 'true' or 'false'
  branch which causes branch-miss penalty.
  BPF64 jump instructions have one branch and fall-through, which fit
  CPU branch predictor logic better.
  'perf stat' shows drastic difference for branch-misses.

2.jump-threaded implementation of interpreter vs switch statement
  Instead of single tablejump at the top of 'switch' statement, GCC will
  generate multiple tablejump instructions, which helps CPU branch predictor

Performance of two BPF filters generated by libpcap was measured
on x86_64, i386 and arm32.

fprog #1 is taken from Documentation/networking/filter.txt:
tcpdump -i eth0 port 22 -dd

fprog #2 is taken from 'man tcpdump':
tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) - 
   ((tcp[12]&0xf0)>>2)) != 0)' -dd

Other libpcap programs have similar performance differences.

Raw performance data from BPF micro-benchmark:
SK_RUN_FILTER on same SKB (cache-hit) or 10k SKBs (cache-miss)
time in nsec per call, smaller is better
--x86_64--
        fprog #1  fprog #1   fprog #2  fprog #2
        cache-hit cache-miss cache-hit cache-miss
BPF32      90        98       207       220
BPF64      28        85       60        108
BPF32_JIT  12        33       17         44
BPF64_JIT  TBD

--i386--
        fprog #1  fprog #1   fprog #2  fprog #2
        cache-hit cache-miss cache-hit cache-miss
BPF32     107        136      227       252
BPF64      40        119       69       172

--arm32--
        fprog #1  fprog #1   fprog #2  fprog #2
        cache-hit cache-miss cache-hit cache-miss
BPF32     202        300      475       540
BPF64     139        270      296       470
BPF32_JIT  26        182       37       202
BPF64_JIT TBD

on Intel cpus BPF64 interpreter is significantly faster than
old BPF interpreter. Existing BPF32_JIT is obviously even faster.
BPF64_JIT has similar performance.

Tested with Daniel's 'trinify BPF fuzzer'

TODO:
- bpf32->bpf64 converter doesn't recognize seccomp and negative
  offsets yet, fix that

- add 32-bit subregs to BPF64 x86_64 JIT and LLVM backend

- add bpf64 verifier, so that tcpdump/cls/xt and others can
  insert both bpf32 and bpf64 programs through the same interface

- add bpf tables, complete 'dropmonitor' and get back to
  systemtap-like probes with bpf64

Please review.
Thanks!

Alexei Starovoitov (1):
  bpf32->bpf64 mapper and bpf64 interpreter

 include/linux/filter.h      |    9 +-
 include/linux/netdevice.h   |    1 +
 include/uapi/linux/filter.h |   37 ++-
 net/core/Makefile           |    2 +-
 net/core/bpf_run.c          |  766 +++++++++++++++++++++++++++++++++++++++++++
 net/core/filter.c           |  114 ++++++-
 net/core/sysctl_net_core.c  |    7 +
 7 files changed, 913 insertions(+), 23 deletions(-)
 create mode 100644 net/core/bpf_run.c

-- 
1.7.9.5

^ permalink raw reply	[flat|nested] 8+ messages in thread