All of lore.kernel.org
 help / color / mirror / Atom feed
From: Alexei Starovoitov <ast@plumgrid.com>
To: Ingo Molnar <mingo@kernel.org>
Cc: "David S. Miller" <davem@davemloft.net>,
	Steven Rostedt <rostedt@goodmis.org>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	"H. Peter Anvin" <hpa@zytor.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>,
	Tom Zanussi <tom.zanussi@linux.intel.com>,
	Jovi Zhangwei <jovi.zhangwei@gmail.com>,
	Eric Dumazet <edumazet@google.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Frederic Weisbecker <fweisbec@gmail.com>,
	Arnaldo Carvalho de Melo <acme@infradead.org>,
	Pekka Enberg <penberg@iki.fi>,
	Arjan van de Ven <arjan@infradead.org>,
	Christoph Hellwig <hch@infradead.org>,
	linux-kernel@vger.kernel.org, netdev@vger.kernel.org
Subject: [RFC PATCH v2 tip 3/7] Extended BPF (64-bit BPF) design document
Date: Wed,  5 Feb 2014 17:10:42 -0800	[thread overview]
Message-ID: <1391649046-4383-4-git-send-email-ast@plumgrid.com> (raw)
In-Reply-To: <1391649046-4383-1-git-send-email-ast@plumgrid.com>

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
 Documentation/bpf_jit.txt |  204 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 204 insertions(+)
 create mode 100644 Documentation/bpf_jit.txt

diff --git a/Documentation/bpf_jit.txt b/Documentation/bpf_jit.txt
new file mode 100644
index 0000000..9c70f42
--- /dev/null
+++ b/Documentation/bpf_jit.txt
@@ -0,0 +1,204 @@
+Subject: extended BPF or 64-bit BPF
+
+Q: What is BPF?
+A: Safe dynamically loadable 32-bit program that can access skb->data via
+sk_load_byte/half/word calls or seccomp_data. Can be attached to sockets,
+to netfilter xtables, seccomp. In case of sockets/xtables input is skb.
+In case of seccomp input is struct seccomp_data.
+
+Q: What is extended BPF?
+A: Safe dynamically loadable 64-bit program that can call fixed set
+of kernel functions and takes generic bpf_context as an input.
+BPF program is a glue between kernel functions and bpf_context.
+Different kernel subsystems can define their own set of available functions
+and alter BPF machinery for specific use case.
+
+Example 1:
+when function set is {bpf_load_byte/half/word} and bpf_context=skb
+the extended BPF is equivalent to original BPF (w/o negative offset extensions),
+since any such extended BPF program will only be able to load data from skb
+and interpret it.
+
+Example 2:
+when function set is {empty} and bpf_context=seccomp_data,
+the extended BPF is equivalent to original seccomp BPF with simpler programs
+and can immediately take advantage of extended BPF-JIT.
+(original BPF-JIT doesn't work for seccomp)
+
+Example 3:
+when function set is {bpf_load_xxx + bpf_table_lookup} and bpf_context=skb
+the extended BPF can be used to implement network analytics in tcpdump.
+Like counting all tcp flows through the dev or filtering for specific
+set of IP addresses.
+
+Example 4:
+when function set is {load_xxx + table_lookup + trace_printk} and
+bpf_context=pt_regs, the extended BPF is used to implement systemtap-like
+tracing filters
+
+Extended Instruction Set was designed with these goals:
+- write programs in restricted C and compile into BPF with GCC/LLVM
+- just-in-time map to modern 64-bit CPU with minimal performance overhead
+  over two steps: C -> BPF -> native code
+- guarantee termination and safety of BPF program in kernel
+  with simple algorithm
+
+Writing filters in tcpdump syntax or in systemtap language is difficult.
+Same filter done in C is easier to understand.
+GCC/LLVM-bpf backend is optional.
+Extended BPF can be coded with macroses from bpf.h just like original BPF.
+
+Minimal performance overhead is achieved by having one to one mapping
+between BPF insns and native insns, and one to one mapping between BPF
+registers and native registers on 64-bit CPUs
+
+Extended BPF allows jump forward and backward for two reasons:
+to reduce branch mispredict penalty compiler moves cold basic blocks out of
+fall-through path and to reduce code duplication that would be unavoidable
+if only jump forward was available.
+To guarantee termination simple non-recursive depth-first-search verifies
+that there are no back-edges (no loops in the program), program is a DAG
+with root at the first insn, all branches end at the last RET insn and
+all instructions are reachable.
+(Original BPF actually allows unreachable insns, but that's a bug)
+
+Original BPF has two registers (A and X) and hidden frame pointer.
+Extended BPF has ten registers and read-only frame pointer.
+Since 64-bit CPUs are passing arguments to the functions via registers
+the number of args from BPF program to in-kernel function is restricted to 5
+and one register is used to accept return value from in-kernel function.
+x86_64 passes first 6 arguments in registers.
+aarch64/sparcv9/mips64 have 7-8 registers for arguments.
+x86_64 has 6 callee saved registers.
+aarch64/sparcv9/mips64 have 11 or more callee saved registers.
+
+Therefore extended BPF calling convention is defined as:
+R0 - return value from in-kernel function
+R1-R5 - arguments from BPF program to in-kernel function
+R6-R9 - callee saved registers that in-kernel function will preserve
+R10 - read-only frame pointer to access stack
+
+so that all BPF registers map one to one to HW registers on x86_64,aarch64,etc
+and BPF calling convention maps directly to ABIs used by kernel on 64-bit
+architectures.
+
+R0-R5 are scratch registers and BPF program needs spill/fill them if necessary
+across calls.
+Note that there is only one BPF program == one BPF function and it cannot call
+other BPF functions. It can only call predefined in-kernel functions.
+
+All BPF registers are 64-bit without subregs, which makes JITed x86 code
+less optimal, but matches sparc/mips architectures.
+Adding 32-bit subregs was considered, since JIT can map them to x86 and aarch64
+nicely, but read-modify-write overhead for sparc/mips is not worth the gains.
+
+Original BPF and extended BPF are two operand instructions, which helps
+to do one-to-one mapping between BPF insn and x86 insn during JIT.
+
+Extended BPF doesn't have pre-defined endianness not to favor one
+architecture vs another. Therefore bswap insn was introduced.
+Original BPF doesn't have such insn and does bswap as part of sk_load_word call
+which is often unnecessary if we want to compare the value with the constant.
+Restricted C code might be written differently depending on endianness
+and GCC/LLVM-bpf will take an endianness flag.
+
+32-bit architectures run 64-bit extended BPF programs via interpreter
+
+Q: Why extended BPF is 64-bit? Cannot we live with 32-bit?
+A: On 64-bit architectures, pointers are 64-bit and we want to pass 64-bit
+values in/out kernel functions, so 32-bit BPF registers would require to define
+register-pair ABI, there won't be a direct BPF register to HW register
+mapping and JIT would need to do combine/split/move operations for every
+register in and out of the function, which is complex, bug prone and slow.
+Another reason is counters. To use 64-bit counter BPF program would need to do
+a complex math. Again bug prone and not atomic.
+
+Q: Original BPF is safe, deterministic and kernel can easily prove that.
+   Does extended BPF keep these properties?
+A: Yes. The safety of the program is determined in two steps.
+First step does depth-first-search to disallow loops and other CFG validation.
+Second step starts from the first insn and descends all possible paths.
+It simulates execution of every insn and observes the state change of
+registers and stack.
+At the start of the program the register R1 contains a pointer to bpf_context
+and has type PTR_TO_CTX. If checker sees an insn that does R2=R1, then R2 has
+now type PTR_TO_CTX as well and can be used on right hand side of expression.
+If R1=PTR_TO_CTX and insn is R2=R1+1, then R2=INVALID_PTR and it is readable.
+If register was never written to, it's not readable.
+After kernel function call, R1-R5 are reset to unreadable and R0 has a return
+type of the function. Since R6-R9 are callee saved, their state is preserved
+across the call.
+load/store instructions are allowed only with registers of valid types, which
+are PTR_TO_CTX, PTR_TO_TABLE, PTR_TO_STACK. They are bounds and alginment
+checked.
+
+bpf_context structure is generic. Its contents are defined by specific use case.
+For seccomp it can be seccomp_data and through get_context_access callback
+BPF checker is customized, so that BPF program can only access certain fields
+of bpf_context with specified size and alignment.
+For example, the following insn:
+  BPF_INSN_LD(BPF_W, R0, R6, 8)
+intends to load word from address R6 + 8 and store it into R0
+If R6=PTR_TO_CTX, then get_context_access callback should let the checker know
+that offset 8 of size 4 bytes can be accessed for reading, otherwise the checker
+will reject the program.
+If R6=PTR_TO_STACK, then access should be aligned and be within stack bounds,
+which are hard coded to [-480, 0]. In this example offset is 8, so it will fail
+verification.
+The checker will allow BPF program to read data from stack only after it wrote
+into it.
+Pointer register spill/fill is tracked as well, since four (R6-R9) callee saved
+registers may not be enough for some programs.
+
+Allowed function calls are customized via get_func_proto callback.
+For example:
+  u64 bpf_load_byte(struct bpf_context *ctx, u32 offset);
+function will have the following definition:
+  struct bpf_func_proto proto = {RET_INTEGER, PTR_TO_CTX};
+and BPF checker will verify that bpf_load_byte is always called with first
+argument being a valid pointer to bpf_context. After the call BPF register R0
+will be set to readable state, so that BPF program can access it.
+
+One of the useful functions that can be made available to BPF program
+are bpf_table_lookup/bpf_table_update.
+Using them a tracing filter can collect any type of statistics.
+
+Therefore extended BPF program consists of instructions and tables.
+From BPF program the table is identified by constant table_id
+and access to a table in C looks like:
+elem = bpf_table_lookup(ctx, table_id, key);
+
+BPF checker matches 'table_id' against known tables, verifies that 'key' points
+to stack and table->key_size bytes are initialized.
+From there on bpf_table_lookup() is a normal kernel function. It needs to do
+a lookup by whatever means and return either valid pointer to the element
+or NULL. BPF checker will verify that the program accesses the pointer only
+after comparing it to NULL. That's the meaning of PTR_TO_TABLE_CONDITIONAL and
+PTR_TO_TABLE register types in bpf_check.c
+
+If a kernel subsystem wants to use this BPF framework and decides to implement
+bpf_table_lookup, the checker will guarantee that argument 'ctx' is a valid
+pointer to bpf_context, 'table_id' is valid table_id and table->key_size bytes
+can be read from the pointer 'key'. It's up to implementation to decide how it
+wants to do the lookup and what is the key.
+
+Going back to the example BPF insn:
+  BPF_INSN_LD(BPF_W, R0, R6, 8)
+if R6=PTR_TO_TABLE, then offset and size of access must be within
+[0, table->elem_size] which is determined by constant table_id that was passed
+into bpf_table_lookup call prior to this insn.
+
+Just like original, extended BPF is limited to 4096 insns, which means that any
+program will terminate quickly and will call fixed number of kernel functions.
+Earlier implementation of the checker had a precise calculation of worst case
+number of insns, but it was removed to simplify the code, since the worst number
+is always less then number of insns in a program anyway (because it's a DAG).
+
+Since register/stack state tracking simulates execution of all insns in all
+possible branches, it will explode if not bounded. There are two bounds.
+verifier_state stack is limited to 1k, therefore BPF program cannot have
+more than 1k jump insns.
+Total number of insns to be analyzed is limited to 32k, which means that
+checker will either prove correctness or reject the program in few
+milliseconds on average x86 cpu. Valid programs take microseconds to verify.
+
-- 
1.7.9.5


  parent reply	other threads:[~2014-02-06  1:11 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-02-06  1:10 [RFC PATCH v2 tip 0/7] 64-bit BPF insn set and tracing filters Alexei Starovoitov
2014-02-06  1:10 ` [RFC PATCH v2 tip 1/7] Extended BPF core framework Alexei Starovoitov
2014-02-06  1:10 ` [RFC PATCH v2 tip 2/7] Extended BPF JIT for x86-64 Alexei Starovoitov
2014-02-06  1:10 ` Alexei Starovoitov [this message]
2014-02-06  1:10 ` [RFC PATCH v2 tip 4/7] Revert "x86/ptrace: Remove unused regs_get_argument_nth API" Alexei Starovoitov
2014-02-06  1:10 ` [RFC PATCH v2 tip 5/7] use BPF in tracing filters Alexei Starovoitov
2014-02-06  1:10 ` [RFC PATCH v2 tip 6/7] LLVM BPF backend Alexei Starovoitov
2014-02-06  1:10 ` [RFC PATCH v2 tip 7/7] tracing filter examples in BPF Alexei Starovoitov
2014-02-06 10:42 ` [RFC PATCH v2 tip 0/7] 64-bit BPF insn set and tracing filters Daniel Borkmann
2014-02-07  1:20   ` Alexei Starovoitov
2014-02-13 20:20     ` Daniel Borkmann
2014-02-13 22:22       ` Daniel Borkmann
2014-02-14  0:59         ` Alexei Starovoitov
2014-02-14 17:02           ` Daniel Borkmann
2014-02-14 17:55             ` Alexei Starovoitov
2014-02-15 16:13               ` Daniel Borkmann
2014-02-14  4:47       ` Alexei Starovoitov
2014-02-14 17:27         ` Daniel Borkmann
2014-02-14 20:17           ` Alexei Starovoitov
2014-02-13 22:32     ` H. Peter Anvin
2014-02-13 22:44       ` Daniel Borkmann
2014-02-13 22:47         ` H. Peter Anvin
2014-02-13 22:55           ` Daniel Borkmann
  -- strict thread matches above, loose matches on Subject: below --
2014-02-06  0:10 Alexei Starovoitov
2014-02-06  0:10 ` [RFC PATCH v2 tip 3/7] Extended BPF (64-bit BPF) design document Alexei Starovoitov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1391649046-4383-4-git-send-email-ast@plumgrid.com \
    --to=ast@plumgrid.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=acme@infradead.org \
    --cc=akpm@linux-foundation.org \
    --cc=arjan@infradead.org \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=fweisbec@gmail.com \
    --cc=hch@infradead.org \
    --cc=hpa@zytor.com \
    --cc=jovi.zhangwei@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=masami.hiramatsu.pt@hitachi.com \
    --cc=mingo@kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=penberg@iki.fi \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    --cc=tom.zanussi@linux.intel.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.