All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC v2 net-next 00/16] BPF syscall, maps, verifier, samples
@ 2014-07-18  4:19 ` Alexei Starovoitov
  0 siblings, 0 replies; 62+ messages in thread
From: Alexei Starovoitov @ 2014-07-18  4:19 UTC (permalink / raw)
  To: David S. Miller
  Cc: Ingo Molnar, Linus Torvalds, Andy Lutomirski, Steven Rostedt,
	Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Jiri Olsa, Thomas Gleixner,
	H. Peter Anvin, Andrew Morton, Kees Cook, linux-api, netdev,
	linux-kernel

Hi All,

changes V1->V2:
- got rid of global id, everything now FD based (Thanks Andy!)
- split type enum in verifier (as suggested by Andy and Namhyung)
- switched gpl enforcement to be kmod like (as suggested by Andy and David)
- addressed feedback from Namhyung, Chema, Joe
- added more comments to verifier
- renamed sock_filter_int -> bpf_insn
- rebased on net-next

FD approach made eBPF user interface much cleaner for sockets/seccomp/tracing
use cases. Now socket and tracing examples (patch 15 and 16) can be Ctrl-C in
the middle and kernel will auto cleanup everything including tracing filters.
Small downside is eBPF programs need to include 'map fixup' section to use maps,
which is similar to traditional elf relocation sections, but much simpler.

First 11 patches are eBPF core which I think is ready for prime time.

Patch 12 (sockets+bpf) is very useful already and it's trivial to expose more
features for sockets in the future (like packet rewrite or calling flow_dissect)

Patch 13 (tracing+bpf) needs more work to become dtrace like. It's a first step

Todo:
- manpage for new syscall
- detect and reject address leaking in non-root programs

----

Fixed V1 cover letter:

'maps' is a generic storage of different types for sharing data between kernel
and userspace. Maps are referrenced by file descriptor. Root process can create
multiple maps of different types where key/value are opaque bytes of data.
It's up to user space and eBPF program to decide what they store in the maps.

eBPF programs are similar to kernel modules. They are loaded by the user space
program and unload on closing of fd. Each program is a safe run-to-completion
set of instructions. eBPF verifier statically determines that the program
terminates and safe to execute. During verification the program takes a hold of
maps that it intends to use, so selected maps cannot be removed until program is
unloaded. The program can be attached to different events. These events can
be packets, tracepoint events and other types in the future. New event triggers
execution of the program which may store information about the event in the maps.
Beyond storing data the programs may call into in-kernel helper functions
which may, for example, dump stack, do trace_printk or other forms of live
kernel debugging. Same program can be attached to multiple events. Different
programs can access the same map:

  tracepoint  tracepoint  tracepoint    sk_buff    sk_buff
   event A     event B     event C      on eth0    on eth1
    |             |          |            |          |
    |             |          |            |          |
    --> tracing <--      tracing       socket      socket
         prog_1           prog_2       prog_3      prog_4
         |  |               |            |
      |---  -----|  |-------|           map_3
    map_1       map_2

User space (via syscall) and eBPF programs access maps concurrently.

Last two patches are sample code. 1st demonstrates stateful packet inspection.
It counts tcp and udp packets on eth0. Should be easy to see how this eBPF
framework can be used for network analytics.
2nd sample does simple 'drop monitor'. It attaches to kfree_skb tracepoint
event and counts number of packet drops at particular $pc location.
User space periodically summarizes what eBPF programs recorded.
In these two samples the eBPF programs are tiny and written in 'assembler'
with macroses. More complex programs can be written C (llvm backend is not
part of this diff and will be upstreamed after this patchset is accepted)
Since eBPF is fully JITed on x64, the cost of running eBPF program is very
small even for high frequency events. Here are the numbers comparing
flow_dissector in C vs eBPF:
  x86_64 skb_flow_dissect() same skb (all cached)         -  42 nsec per call
  x86_64 skb_flow_dissect() different skbs (cache misses) - 141 nsec per call
eBPF+jit skb_flow_dissect() same skb (all cached)         -  51 nsec per call
eBPF+jit skb_flow_dissect() different skbs (cache misses) - 135 nsec per call

Thanks
Alexei

------
The following changes since commit da388973d4a15e71cada1219d625b5393c90e5ae:

  iw_cxgb4: fix for 64-bit integer division (2014-07-17 16:52:08 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf master

for you to fetch changes up to e8c12b5d78f612a7651db9648c45999bd6fd3c1c:

  samples: bpf: example of tracing filters with eBPF (2014-07-17 20:08:17 -0700)

----------------------------------------------------------------
Alexei Starovoitov (16):
      net: filter: split filter.c into two files
      bpf: update MAINTAINERS entry
      net: filter: rename struct sock_filter_int into bpf_insn
      net: filter: split filter.h and expose eBPF to user space
      bpf: introduce syscall(BPF, ...) and BPF maps
      bpf: enable bpf syscall on x64
      bpf: add lookup/update/delete/iterate methods to BPF maps
      bpf: add hashtable type of BPF maps
      bpf: expand BPF syscall with program load/unload
      bpf: add eBPF verifier
      bpf: allow eBPF programs to use maps
      net: sock: allow eBPF programs to be attached to sockets
      tracing: allow eBPF programs to be attached to events
      samples: bpf: add mini eBPF library to manipulate maps and programs
      samples: bpf: example of stateful socket filtering
      samples: bpf: example of tracing filters with eBPF

 Documentation/networking/filter.txt    |  302 +++++++
 MAINTAINERS                            |    7 +
 arch/alpha/include/uapi/asm/socket.h   |    2 +
 arch/avr32/include/uapi/asm/socket.h   |    2 +
 arch/cris/include/uapi/asm/socket.h    |    2 +
 arch/frv/include/uapi/asm/socket.h     |    2 +
 arch/ia64/include/uapi/asm/socket.h    |    2 +
 arch/m32r/include/uapi/asm/socket.h    |    2 +
 arch/mips/include/uapi/asm/socket.h    |    2 +
 arch/mn10300/include/uapi/asm/socket.h |    2 +
 arch/parisc/include/uapi/asm/socket.h  |    2 +
 arch/powerpc/include/uapi/asm/socket.h |    2 +
 arch/s390/include/uapi/asm/socket.h    |    2 +
 arch/sparc/include/uapi/asm/socket.h   |    2 +
 arch/x86/net/bpf_jit_comp.c            |    2 +-
 arch/x86/syscalls/syscall_64.tbl       |    1 +
 arch/xtensa/include/uapi/asm/socket.h  |    2 +
 include/linux/bpf.h                    |  136 +++
 include/linux/filter.h                 |  310 +------
 include/linux/ftrace_event.h           |    5 +
 include/linux/syscalls.h               |    2 +
 include/trace/bpf_trace.h              |   29 +
 include/trace/ftrace.h                 |   10 +
 include/uapi/asm-generic/socket.h      |    2 +
 include/uapi/asm-generic/unistd.h      |    4 +-
 include/uapi/linux/Kbuild              |    1 +
 include/uapi/linux/bpf.h               |  391 ++++++++
 kernel/Makefile                        |    1 +
 kernel/bpf/Makefile                    |    1 +
 kernel/bpf/core.c                      |  539 +++++++++++
 kernel/bpf/hashtab.c                   |  371 ++++++++
 kernel/bpf/syscall.c                   |  828 +++++++++++++++++
 kernel/bpf/verifier.c                  | 1520 ++++++++++++++++++++++++++++++++
 kernel/seccomp.c                       |    2 +-
 kernel/sys_ni.c                        |    3 +
 kernel/trace/Kconfig                   |    1 +
 kernel/trace/Makefile                  |    1 +
 kernel/trace/bpf_trace.c               |  212 +++++
 kernel/trace/trace.h                   |    3 +
 kernel/trace/trace_events.c            |   36 +-
 kernel/trace/trace_events_filter.c     |   72 +-
 lib/test_bpf.c                         |    4 +-
 net/core/filter.c                      |  650 +++-----------
 net/core/sock.c                        |   13 +
 samples/bpf/.gitignore                 |    1 +
 samples/bpf/Makefile                   |   15 +
 samples/bpf/dropmon.c                  |  134 +++
 samples/bpf/libbpf.c                   |  109 +++
 samples/bpf/libbpf.h                   |   22 +
 samples/bpf/sock_example.c             |  161 ++++
 50 files changed, 5099 insertions(+), 828 deletions(-)
 create mode 100644 include/linux/bpf.h
 create mode 100644 include/trace/bpf_trace.h
 create mode 100644 include/uapi/linux/bpf.h
 create mode 100644 kernel/bpf/Makefile
 create mode 100644 kernel/bpf/core.c
 create mode 100644 kernel/bpf/hashtab.c
 create mode 100644 kernel/bpf/syscall.c
 create mode 100644 kernel/bpf/verifier.c
 create mode 100644 kernel/trace/bpf_trace.c
 create mode 100644 samples/bpf/.gitignore
 create mode 100644 samples/bpf/Makefile
 create mode 100644 samples/bpf/dropmon.c
 create mode 100644 samples/bpf/libbpf.c
 create mode 100644 samples/bpf/libbpf.h
 create mode 100644 samples/bpf/sock_example.c


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH RFC v2 net-next 00/16] BPF syscall, maps, verifier, samples
@ 2014-07-18  4:19 ` Alexei Starovoitov
  0 siblings, 0 replies; 62+ messages in thread
From: Alexei Starovoitov @ 2014-07-18  4:19 UTC (permalink / raw)
  To: David S. Miller
  Cc: Ingo Molnar, Linus Torvalds, Andy Lutomirski, Steven Rostedt,
	Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Jiri Olsa, Thomas Gleixner,
	H. Peter Anvin, Andrew Morton, Kees Cook,
	linux-api-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Hi All,

changes V1->V2:
- got rid of global id, everything now FD based (Thanks Andy!)
- split type enum in verifier (as suggested by Andy and Namhyung)
- switched gpl enforcement to be kmod like (as suggested by Andy and David)
- addressed feedback from Namhyung, Chema, Joe
- added more comments to verifier
- renamed sock_filter_int -> bpf_insn
- rebased on net-next

FD approach made eBPF user interface much cleaner for sockets/seccomp/tracing
use cases. Now socket and tracing examples (patch 15 and 16) can be Ctrl-C in
the middle and kernel will auto cleanup everything including tracing filters.
Small downside is eBPF programs need to include 'map fixup' section to use maps,
which is similar to traditional elf relocation sections, but much simpler.

First 11 patches are eBPF core which I think is ready for prime time.

Patch 12 (sockets+bpf) is very useful already and it's trivial to expose more
features for sockets in the future (like packet rewrite or calling flow_dissect)

Patch 13 (tracing+bpf) needs more work to become dtrace like. It's a first step

Todo:
- manpage for new syscall
- detect and reject address leaking in non-root programs

----

Fixed V1 cover letter:

'maps' is a generic storage of different types for sharing data between kernel
and userspace. Maps are referrenced by file descriptor. Root process can create
multiple maps of different types where key/value are opaque bytes of data.
It's up to user space and eBPF program to decide what they store in the maps.

eBPF programs are similar to kernel modules. They are loaded by the user space
program and unload on closing of fd. Each program is a safe run-to-completion
set of instructions. eBPF verifier statically determines that the program
terminates and safe to execute. During verification the program takes a hold of
maps that it intends to use, so selected maps cannot be removed until program is
unloaded. The program can be attached to different events. These events can
be packets, tracepoint events and other types in the future. New event triggers
execution of the program which may store information about the event in the maps.
Beyond storing data the programs may call into in-kernel helper functions
which may, for example, dump stack, do trace_printk or other forms of live
kernel debugging. Same program can be attached to multiple events. Different
programs can access the same map:

  tracepoint  tracepoint  tracepoint    sk_buff    sk_buff
   event A     event B     event C      on eth0    on eth1
    |             |          |            |          |
    |             |          |            |          |
    --> tracing <--      tracing       socket      socket
         prog_1           prog_2       prog_3      prog_4
         |  |               |            |
      |---  -----|  |-------|           map_3
    map_1       map_2

User space (via syscall) and eBPF programs access maps concurrently.

Last two patches are sample code. 1st demonstrates stateful packet inspection.
It counts tcp and udp packets on eth0. Should be easy to see how this eBPF
framework can be used for network analytics.
2nd sample does simple 'drop monitor'. It attaches to kfree_skb tracepoint
event and counts number of packet drops at particular $pc location.
User space periodically summarizes what eBPF programs recorded.
In these two samples the eBPF programs are tiny and written in 'assembler'
with macroses. More complex programs can be written C (llvm backend is not
part of this diff and will be upstreamed after this patchset is accepted)
Since eBPF is fully JITed on x64, the cost of running eBPF program is very
small even for high frequency events. Here are the numbers comparing
flow_dissector in C vs eBPF:
  x86_64 skb_flow_dissect() same skb (all cached)         -  42 nsec per call
  x86_64 skb_flow_dissect() different skbs (cache misses) - 141 nsec per call
eBPF+jit skb_flow_dissect() same skb (all cached)         -  51 nsec per call
eBPF+jit skb_flow_dissect() different skbs (cache misses) - 135 nsec per call

Thanks
Alexei

------
The following changes since commit da388973d4a15e71cada1219d625b5393c90e5ae:

  iw_cxgb4: fix for 64-bit integer division (2014-07-17 16:52:08 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf master

for you to fetch changes up to e8c12b5d78f612a7651db9648c45999bd6fd3c1c:

  samples: bpf: example of tracing filters with eBPF (2014-07-17 20:08:17 -0700)

----------------------------------------------------------------
Alexei Starovoitov (16):
      net: filter: split filter.c into two files
      bpf: update MAINTAINERS entry
      net: filter: rename struct sock_filter_int into bpf_insn
      net: filter: split filter.h and expose eBPF to user space
      bpf: introduce syscall(BPF, ...) and BPF maps
      bpf: enable bpf syscall on x64
      bpf: add lookup/update/delete/iterate methods to BPF maps
      bpf: add hashtable type of BPF maps
      bpf: expand BPF syscall with program load/unload
      bpf: add eBPF verifier
      bpf: allow eBPF programs to use maps
      net: sock: allow eBPF programs to be attached to sockets
      tracing: allow eBPF programs to be attached to events
      samples: bpf: add mini eBPF library to manipulate maps and programs
      samples: bpf: example of stateful socket filtering
      samples: bpf: example of tracing filters with eBPF

 Documentation/networking/filter.txt    |  302 +++++++
 MAINTAINERS                            |    7 +
 arch/alpha/include/uapi/asm/socket.h   |    2 +
 arch/avr32/include/uapi/asm/socket.h   |    2 +
 arch/cris/include/uapi/asm/socket.h    |    2 +
 arch/frv/include/uapi/asm/socket.h     |    2 +
 arch/ia64/include/uapi/asm/socket.h    |    2 +
 arch/m32r/include/uapi/asm/socket.h    |    2 +
 arch/mips/include/uapi/asm/socket.h    |    2 +
 arch/mn10300/include/uapi/asm/socket.h |    2 +
 arch/parisc/include/uapi/asm/socket.h  |    2 +
 arch/powerpc/include/uapi/asm/socket.h |    2 +
 arch/s390/include/uapi/asm/socket.h    |    2 +
 arch/sparc/include/uapi/asm/socket.h   |    2 +
 arch/x86/net/bpf_jit_comp.c            |    2 +-
 arch/x86/syscalls/syscall_64.tbl       |    1 +
 arch/xtensa/include/uapi/asm/socket.h  |    2 +
 include/linux/bpf.h                    |  136 +++
 include/linux/filter.h                 |  310 +------
 include/linux/ftrace_event.h           |    5 +
 include/linux/syscalls.h               |    2 +
 include/trace/bpf_trace.h              |   29 +
 include/trace/ftrace.h                 |   10 +
 include/uapi/asm-generic/socket.h      |    2 +
 include/uapi/asm-generic/unistd.h      |    4 +-
 include/uapi/linux/Kbuild              |    1 +
 include/uapi/linux/bpf.h               |  391 ++++++++
 kernel/Makefile                        |    1 +
 kernel/bpf/Makefile                    |    1 +
 kernel/bpf/core.c                      |  539 +++++++++++
 kernel/bpf/hashtab.c                   |  371 ++++++++
 kernel/bpf/syscall.c                   |  828 +++++++++++++++++
 kernel/bpf/verifier.c                  | 1520 ++++++++++++++++++++++++++++++++
 kernel/seccomp.c                       |    2 +-
 kernel/sys_ni.c                        |    3 +
 kernel/trace/Kconfig                   |    1 +
 kernel/trace/Makefile                  |    1 +
 kernel/trace/bpf_trace.c               |  212 +++++
 kernel/trace/trace.h                   |    3 +
 kernel/trace/trace_events.c            |   36 +-
 kernel/trace/trace_events_filter.c     |   72 +-
 lib/test_bpf.c                         |    4 +-
 net/core/filter.c                      |  650 +++-----------
 net/core/sock.c                        |   13 +
 samples/bpf/.gitignore                 |    1 +
 samples/bpf/Makefile                   |   15 +
 samples/bpf/dropmon.c                  |  134 +++
 samples/bpf/libbpf.c                   |  109 +++
 samples/bpf/libbpf.h                   |   22 +
 samples/bpf/sock_example.c             |  161 ++++
 50 files changed, 5099 insertions(+), 828 deletions(-)
 create mode 100644 include/linux/bpf.h
 create mode 100644 include/trace/bpf_trace.h
 create mode 100644 include/uapi/linux/bpf.h
 create mode 100644 kernel/bpf/Makefile
 create mode 100644 kernel/bpf/core.c
 create mode 100644 kernel/bpf/hashtab.c
 create mode 100644 kernel/bpf/syscall.c
 create mode 100644 kernel/bpf/verifier.c
 create mode 100644 kernel/trace/bpf_trace.c
 create mode 100644 samples/bpf/.gitignore
 create mode 100644 samples/bpf/Makefile
 create mode 100644 samples/bpf/dropmon.c
 create mode 100644 samples/bpf/libbpf.c
 create mode 100644 samples/bpf/libbpf.h
 create mode 100644 samples/bpf/sock_example.c

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH RFC v2 net-next 01/16] net: filter: split filter.c into two files
@ 2014-07-18  4:19   ` Alexei Starovoitov
  0 siblings, 0 replies; 62+ messages in thread
From: Alexei Starovoitov @ 2014-07-18  4:19 UTC (permalink / raw)
  To: David S. Miller
  Cc: Ingo Molnar, Linus Torvalds, Andy Lutomirski, Steven Rostedt,
	Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Jiri Olsa, Thomas Gleixner,
	H. Peter Anvin, Andrew Morton, Kees Cook, linux-api, netdev,
	linux-kernel

BPF is used in several kernel components. This split creates logical boundary
between generic eBPF core and the rest

kernel/bpf/core.c: eBPF interpreter

net/core/filter.c: classic->eBPF converter, classic verifiers, socket filters

This patch only moves functions.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
 kernel/Makefile     |    1 +
 kernel/bpf/Makefile |    1 +
 kernel/bpf/core.c   |  536 +++++++++++++++++++++++++++++++++++++++++++++++++++
 net/core/filter.c   |  511 ------------------------------------------------
 4 files changed, 538 insertions(+), 511 deletions(-)
 create mode 100644 kernel/bpf/Makefile
 create mode 100644 kernel/bpf/core.c

diff --git a/kernel/Makefile b/kernel/Makefile
index f2a8b6246ce9..e7360b7c2c0e 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -87,6 +87,7 @@ obj-$(CONFIG_RING_BUFFER) += trace/
 obj-$(CONFIG_TRACEPOINTS) += trace/
 obj-$(CONFIG_IRQ_WORK) += irq_work.o
 obj-$(CONFIG_CPU_PM) += cpu_pm.o
+obj-$(CONFIG_NET) += bpf/
 
 obj-$(CONFIG_PERF_EVENTS) += events/
 
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
new file mode 100644
index 000000000000..6a71145e2769
--- /dev/null
+++ b/kernel/bpf/Makefile
@@ -0,0 +1 @@
+obj-y := core.o
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
new file mode 100644
index 000000000000..77a240a1ce11
--- /dev/null
+++ b/kernel/bpf/core.c
@@ -0,0 +1,536 @@
+/*
+ * Linux Socket Filter - Kernel level socket filtering
+ *
+ * Based on the design of the Berkeley Packet Filter. The new
+ * internal format has been designed by PLUMgrid:
+ *
+ *	Copyright (c) 2011 - 2014 PLUMgrid, http://plumgrid.com
+ *
+ * Authors:
+ *
+ *	Jay Schulist <jschlst@samba.org>
+ *	Alexei Starovoitov <ast@plumgrid.com>
+ *	Daniel Borkmann <dborkman@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * Andi Kleen - Fix a few bad bugs and races.
+ * Kris Katterjohn - Added many additional checks in sk_chk_filter()
+ */
+#include <linux/filter.h>
+#include <linux/skbuff.h>
+#include <asm/unaligned.h>
+
+/* Registers */
+#define BPF_R0	regs[BPF_REG_0]
+#define BPF_R1	regs[BPF_REG_1]
+#define BPF_R2	regs[BPF_REG_2]
+#define BPF_R3	regs[BPF_REG_3]
+#define BPF_R4	regs[BPF_REG_4]
+#define BPF_R5	regs[BPF_REG_5]
+#define BPF_R6	regs[BPF_REG_6]
+#define BPF_R7	regs[BPF_REG_7]
+#define BPF_R8	regs[BPF_REG_8]
+#define BPF_R9	regs[BPF_REG_9]
+#define BPF_R10	regs[BPF_REG_10]
+
+/* Named registers */
+#define DST	regs[insn->dst_reg]
+#define SRC	regs[insn->src_reg]
+#define FP	regs[BPF_REG_FP]
+#define ARG1	regs[BPF_REG_ARG1]
+#define CTX	regs[BPF_REG_CTX]
+#define IMM	insn->imm
+
+/* No hurry in this branch
+ *
+ * Exported for the bpf jit load helper.
+ */
+void *bpf_internal_load_pointer_neg_helper(const struct sk_buff *skb, int k, unsigned int size)
+{
+	u8 *ptr = NULL;
+
+	if (k >= SKF_NET_OFF)
+		ptr = skb_network_header(skb) + k - SKF_NET_OFF;
+	else if (k >= SKF_LL_OFF)
+		ptr = skb_mac_header(skb) + k - SKF_LL_OFF;
+	if (ptr >= skb->head && ptr + size <= skb_tail_pointer(skb))
+		return ptr;
+
+	return NULL;
+}
+
+/* Base function for offset calculation. Needs to go into .text section,
+ * therefore keeping it non-static as well; will also be used by JITs
+ * anyway later on, so do not let the compiler omit it.
+ */
+noinline u64 __bpf_call_base(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+	return 0;
+}
+
+/**
+ *	__sk_run_filter - run a filter on a given context
+ *	@ctx: buffer to run the filter on
+ *	@insn: filter to apply
+ *
+ * Decode and apply filter instructions to the skb->data. Return length to
+ * keep, 0 for none. @ctx is the data we are operating on, @insn is the
+ * array of filter instructions.
+ */
+static unsigned int __sk_run_filter(void *ctx, const struct sock_filter_int *insn)
+{
+	u64 stack[MAX_BPF_STACK / sizeof(u64)];
+	u64 regs[MAX_BPF_REG], tmp;
+	static const void *jumptable[256] = {
+		[0 ... 255] = &&default_label,
+		/* Now overwrite non-defaults ... */
+		/* 32 bit ALU operations */
+		[BPF_ALU | BPF_ADD | BPF_X] = &&ALU_ADD_X,
+		[BPF_ALU | BPF_ADD | BPF_K] = &&ALU_ADD_K,
+		[BPF_ALU | BPF_SUB | BPF_X] = &&ALU_SUB_X,
+		[BPF_ALU | BPF_SUB | BPF_K] = &&ALU_SUB_K,
+		[BPF_ALU | BPF_AND | BPF_X] = &&ALU_AND_X,
+		[BPF_ALU | BPF_AND | BPF_K] = &&ALU_AND_K,
+		[BPF_ALU | BPF_OR | BPF_X]  = &&ALU_OR_X,
+		[BPF_ALU | BPF_OR | BPF_K]  = &&ALU_OR_K,
+		[BPF_ALU | BPF_LSH | BPF_X] = &&ALU_LSH_X,
+		[BPF_ALU | BPF_LSH | BPF_K] = &&ALU_LSH_K,
+		[BPF_ALU | BPF_RSH | BPF_X] = &&ALU_RSH_X,
+		[BPF_ALU | BPF_RSH | BPF_K] = &&ALU_RSH_K,
+		[BPF_ALU | BPF_XOR | BPF_X] = &&ALU_XOR_X,
+		[BPF_ALU | BPF_XOR | BPF_K] = &&ALU_XOR_K,
+		[BPF_ALU | BPF_MUL | BPF_X] = &&ALU_MUL_X,
+		[BPF_ALU | BPF_MUL | BPF_K] = &&ALU_MUL_K,
+		[BPF_ALU | BPF_MOV | BPF_X] = &&ALU_MOV_X,
+		[BPF_ALU | BPF_MOV | BPF_K] = &&ALU_MOV_K,
+		[BPF_ALU | BPF_DIV | BPF_X] = &&ALU_DIV_X,
+		[BPF_ALU | BPF_DIV | BPF_K] = &&ALU_DIV_K,
+		[BPF_ALU | BPF_MOD | BPF_X] = &&ALU_MOD_X,
+		[BPF_ALU | BPF_MOD | BPF_K] = &&ALU_MOD_K,
+		[BPF_ALU | BPF_NEG] = &&ALU_NEG,
+		[BPF_ALU | BPF_END | BPF_TO_BE] = &&ALU_END_TO_BE,
+		[BPF_ALU | BPF_END | BPF_TO_LE] = &&ALU_END_TO_LE,
+		/* 64 bit ALU operations */
+		[BPF_ALU64 | BPF_ADD | BPF_X] = &&ALU64_ADD_X,
+		[BPF_ALU64 | BPF_ADD | BPF_K] = &&ALU64_ADD_K,
+		[BPF_ALU64 | BPF_SUB | BPF_X] = &&ALU64_SUB_X,
+		[BPF_ALU64 | BPF_SUB | BPF_K] = &&ALU64_SUB_K,
+		[BPF_ALU64 | BPF_AND | BPF_X] = &&ALU64_AND_X,
+		[BPF_ALU64 | BPF_AND | BPF_K] = &&ALU64_AND_K,
+		[BPF_ALU64 | BPF_OR | BPF_X] = &&ALU64_OR_X,
+		[BPF_ALU64 | BPF_OR | BPF_K] = &&ALU64_OR_K,
+		[BPF_ALU64 | BPF_LSH | BPF_X] = &&ALU64_LSH_X,
+		[BPF_ALU64 | BPF_LSH | BPF_K] = &&ALU64_LSH_K,
+		[BPF_ALU64 | BPF_RSH | BPF_X] = &&ALU64_RSH_X,
+		[BPF_ALU64 | BPF_RSH | BPF_K] = &&ALU64_RSH_K,
+		[BPF_ALU64 | BPF_XOR | BPF_X] = &&ALU64_XOR_X,
+		[BPF_ALU64 | BPF_XOR | BPF_K] = &&ALU64_XOR_K,
+		[BPF_ALU64 | BPF_MUL | BPF_X] = &&ALU64_MUL_X,
+		[BPF_ALU64 | BPF_MUL | BPF_K] = &&ALU64_MUL_K,
+		[BPF_ALU64 | BPF_MOV | BPF_X] = &&ALU64_MOV_X,
+		[BPF_ALU64 | BPF_MOV | BPF_K] = &&ALU64_MOV_K,
+		[BPF_ALU64 | BPF_ARSH | BPF_X] = &&ALU64_ARSH_X,
+		[BPF_ALU64 | BPF_ARSH | BPF_K] = &&ALU64_ARSH_K,
+		[BPF_ALU64 | BPF_DIV | BPF_X] = &&ALU64_DIV_X,
+		[BPF_ALU64 | BPF_DIV | BPF_K] = &&ALU64_DIV_K,
+		[BPF_ALU64 | BPF_MOD | BPF_X] = &&ALU64_MOD_X,
+		[BPF_ALU64 | BPF_MOD | BPF_K] = &&ALU64_MOD_K,
+		[BPF_ALU64 | BPF_NEG] = &&ALU64_NEG,
+		/* Call instruction */
+		[BPF_JMP | BPF_CALL] = &&JMP_CALL,
+		/* Jumps */
+		[BPF_JMP | BPF_JA] = &&JMP_JA,
+		[BPF_JMP | BPF_JEQ | BPF_X] = &&JMP_JEQ_X,
+		[BPF_JMP | BPF_JEQ | BPF_K] = &&JMP_JEQ_K,
+		[BPF_JMP | BPF_JNE | BPF_X] = &&JMP_JNE_X,
+		[BPF_JMP | BPF_JNE | BPF_K] = &&JMP_JNE_K,
+		[BPF_JMP | BPF_JGT | BPF_X] = &&JMP_JGT_X,
+		[BPF_JMP | BPF_JGT | BPF_K] = &&JMP_JGT_K,
+		[BPF_JMP | BPF_JGE | BPF_X] = &&JMP_JGE_X,
+		[BPF_JMP | BPF_JGE | BPF_K] = &&JMP_JGE_K,
+		[BPF_JMP | BPF_JSGT | BPF_X] = &&JMP_JSGT_X,
+		[BPF_JMP | BPF_JSGT | BPF_K] = &&JMP_JSGT_K,
+		[BPF_JMP | BPF_JSGE | BPF_X] = &&JMP_JSGE_X,
+		[BPF_JMP | BPF_JSGE | BPF_K] = &&JMP_JSGE_K,
+		[BPF_JMP | BPF_JSET | BPF_X] = &&JMP_JSET_X,
+		[BPF_JMP | BPF_JSET | BPF_K] = &&JMP_JSET_K,
+		/* Program return */
+		[BPF_JMP | BPF_EXIT] = &&JMP_EXIT,
+		/* Store instructions */
+		[BPF_STX | BPF_MEM | BPF_B] = &&STX_MEM_B,
+		[BPF_STX | BPF_MEM | BPF_H] = &&STX_MEM_H,
+		[BPF_STX | BPF_MEM | BPF_W] = &&STX_MEM_W,
+		[BPF_STX | BPF_MEM | BPF_DW] = &&STX_MEM_DW,
+		[BPF_STX | BPF_XADD | BPF_W] = &&STX_XADD_W,
+		[BPF_STX | BPF_XADD | BPF_DW] = &&STX_XADD_DW,
+		[BPF_ST | BPF_MEM | BPF_B] = &&ST_MEM_B,
+		[BPF_ST | BPF_MEM | BPF_H] = &&ST_MEM_H,
+		[BPF_ST | BPF_MEM | BPF_W] = &&ST_MEM_W,
+		[BPF_ST | BPF_MEM | BPF_DW] = &&ST_MEM_DW,
+		/* Load instructions */
+		[BPF_LDX | BPF_MEM | BPF_B] = &&LDX_MEM_B,
+		[BPF_LDX | BPF_MEM | BPF_H] = &&LDX_MEM_H,
+		[BPF_LDX | BPF_MEM | BPF_W] = &&LDX_MEM_W,
+		[BPF_LDX | BPF_MEM | BPF_DW] = &&LDX_MEM_DW,
+		[BPF_LD | BPF_ABS | BPF_W] = &&LD_ABS_W,
+		[BPF_LD | BPF_ABS | BPF_H] = &&LD_ABS_H,
+		[BPF_LD | BPF_ABS | BPF_B] = &&LD_ABS_B,
+		[BPF_LD | BPF_IND | BPF_W] = &&LD_IND_W,
+		[BPF_LD | BPF_IND | BPF_H] = &&LD_IND_H,
+		[BPF_LD | BPF_IND | BPF_B] = &&LD_IND_B,
+	};
+	void *ptr;
+	int off;
+
+#define CONT	 ({ insn++; goto select_insn; })
+#define CONT_JMP ({ insn++; goto select_insn; })
+
+	FP = (u64) (unsigned long) &stack[ARRAY_SIZE(stack)];
+	ARG1 = (u64) (unsigned long) ctx;
+
+	/* Registers used in classic BPF programs need to be reset first. */
+	regs[BPF_REG_A] = 0;
+	regs[BPF_REG_X] = 0;
+
+select_insn:
+	goto *jumptable[insn->code];
+
+	/* ALU */
+#define ALU(OPCODE, OP)			\
+	ALU64_##OPCODE##_X:		\
+		DST = DST OP SRC;	\
+		CONT;			\
+	ALU_##OPCODE##_X:		\
+		DST = (u32) DST OP (u32) SRC;	\
+		CONT;			\
+	ALU64_##OPCODE##_K:		\
+		DST = DST OP IMM;		\
+		CONT;			\
+	ALU_##OPCODE##_K:		\
+		DST = (u32) DST OP (u32) IMM;	\
+		CONT;
+
+	ALU(ADD,  +)
+	ALU(SUB,  -)
+	ALU(AND,  &)
+	ALU(OR,   |)
+	ALU(LSH, <<)
+	ALU(RSH, >>)
+	ALU(XOR,  ^)
+	ALU(MUL,  *)
+#undef ALU
+	ALU_NEG:
+		DST = (u32) -DST;
+		CONT;
+	ALU64_NEG:
+		DST = -DST;
+		CONT;
+	ALU_MOV_X:
+		DST = (u32) SRC;
+		CONT;
+	ALU_MOV_K:
+		DST = (u32) IMM;
+		CONT;
+	ALU64_MOV_X:
+		DST = SRC;
+		CONT;
+	ALU64_MOV_K:
+		DST = IMM;
+		CONT;
+	ALU64_ARSH_X:
+		(*(s64 *) &DST) >>= SRC;
+		CONT;
+	ALU64_ARSH_K:
+		(*(s64 *) &DST) >>= IMM;
+		CONT;
+	ALU64_MOD_X:
+		if (unlikely(SRC == 0))
+			return 0;
+		tmp = DST;
+		DST = do_div(tmp, SRC);
+		CONT;
+	ALU_MOD_X:
+		if (unlikely(SRC == 0))
+			return 0;
+		tmp = (u32) DST;
+		DST = do_div(tmp, (u32) SRC);
+		CONT;
+	ALU64_MOD_K:
+		tmp = DST;
+		DST = do_div(tmp, IMM);
+		CONT;
+	ALU_MOD_K:
+		tmp = (u32) DST;
+		DST = do_div(tmp, (u32) IMM);
+		CONT;
+	ALU64_DIV_X:
+		if (unlikely(SRC == 0))
+			return 0;
+		do_div(DST, SRC);
+		CONT;
+	ALU_DIV_X:
+		if (unlikely(SRC == 0))
+			return 0;
+		tmp = (u32) DST;
+		do_div(tmp, (u32) SRC);
+		DST = (u32) tmp;
+		CONT;
+	ALU64_DIV_K:
+		do_div(DST, IMM);
+		CONT;
+	ALU_DIV_K:
+		tmp = (u32) DST;
+		do_div(tmp, (u32) IMM);
+		DST = (u32) tmp;
+		CONT;
+	ALU_END_TO_BE:
+		switch (IMM) {
+		case 16:
+			DST = (__force u16) cpu_to_be16(DST);
+			break;
+		case 32:
+			DST = (__force u32) cpu_to_be32(DST);
+			break;
+		case 64:
+			DST = (__force u64) cpu_to_be64(DST);
+			break;
+		}
+		CONT;
+	ALU_END_TO_LE:
+		switch (IMM) {
+		case 16:
+			DST = (__force u16) cpu_to_le16(DST);
+			break;
+		case 32:
+			DST = (__force u32) cpu_to_le32(DST);
+			break;
+		case 64:
+			DST = (__force u64) cpu_to_le64(DST);
+			break;
+		}
+		CONT;
+
+	/* CALL */
+	JMP_CALL:
+		/* Function call scratches BPF_R1-BPF_R5 registers,
+		 * preserves BPF_R6-BPF_R9, and stores return value
+		 * into BPF_R0.
+		 */
+		BPF_R0 = (__bpf_call_base + insn->imm)(BPF_R1, BPF_R2, BPF_R3,
+						       BPF_R4, BPF_R5);
+		CONT;
+
+	/* JMP */
+	JMP_JA:
+		insn += insn->off;
+		CONT;
+	JMP_JEQ_X:
+		if (DST == SRC) {
+			insn += insn->off;
+			CONT_JMP;
+		}
+		CONT;
+	JMP_JEQ_K:
+		if (DST == IMM) {
+			insn += insn->off;
+			CONT_JMP;
+		}
+		CONT;
+	JMP_JNE_X:
+		if (DST != SRC) {
+			insn += insn->off;
+			CONT_JMP;
+		}
+		CONT;
+	JMP_JNE_K:
+		if (DST != IMM) {
+			insn += insn->off;
+			CONT_JMP;
+		}
+		CONT;
+	JMP_JGT_X:
+		if (DST > SRC) {
+			insn += insn->off;
+			CONT_JMP;
+		}
+		CONT;
+	JMP_JGT_K:
+		if (DST > IMM) {
+			insn += insn->off;
+			CONT_JMP;
+		}
+		CONT;
+	JMP_JGE_X:
+		if (DST >= SRC) {
+			insn += insn->off;
+			CONT_JMP;
+		}
+		CONT;
+	JMP_JGE_K:
+		if (DST >= IMM) {
+			insn += insn->off;
+			CONT_JMP;
+		}
+		CONT;
+	JMP_JSGT_X:
+		if (((s64) DST) > ((s64) SRC)) {
+			insn += insn->off;
+			CONT_JMP;
+		}
+		CONT;
+	JMP_JSGT_K:
+		if (((s64) DST) > ((s64) IMM)) {
+			insn += insn->off;
+			CONT_JMP;
+		}
+		CONT;
+	JMP_JSGE_X:
+		if (((s64) DST) >= ((s64) SRC)) {
+			insn += insn->off;
+			CONT_JMP;
+		}
+		CONT;
+	JMP_JSGE_K:
+		if (((s64) DST) >= ((s64) IMM)) {
+			insn += insn->off;
+			CONT_JMP;
+		}
+		CONT;
+	JMP_JSET_X:
+		if (DST & SRC) {
+			insn += insn->off;
+			CONT_JMP;
+		}
+		CONT;
+	JMP_JSET_K:
+		if (DST & IMM) {
+			insn += insn->off;
+			CONT_JMP;
+		}
+		CONT;
+	JMP_EXIT:
+		return BPF_R0;
+
+	/* STX and ST and LDX*/
+#define LDST(SIZEOP, SIZE)						\
+	STX_MEM_##SIZEOP:						\
+		*(SIZE *)(unsigned long) (DST + insn->off) = SRC;	\
+		CONT;							\
+	ST_MEM_##SIZEOP:						\
+		*(SIZE *)(unsigned long) (DST + insn->off) = IMM;	\
+		CONT;							\
+	LDX_MEM_##SIZEOP:						\
+		DST = *(SIZE *)(unsigned long) (SRC + insn->off);	\
+		CONT;
+
+	LDST(B,   u8)
+	LDST(H,  u16)
+	LDST(W,  u32)
+	LDST(DW, u64)
+#undef LDST
+	STX_XADD_W: /* lock xadd *(u32 *)(dst_reg + off16) += src_reg */
+		atomic_add((u32) SRC, (atomic_t *)(unsigned long)
+			   (DST + insn->off));
+		CONT;
+	STX_XADD_DW: /* lock xadd *(u64 *)(dst_reg + off16) += src_reg */
+		atomic64_add((u64) SRC, (atomic64_t *)(unsigned long)
+			     (DST + insn->off));
+		CONT;
+	LD_ABS_W: /* BPF_R0 = ntohl(*(u32 *) (skb->data + imm32)) */
+		off = IMM;
+load_word:
+		/* BPF_LD + BPD_ABS and BPF_LD + BPF_IND insns are
+		 * only appearing in the programs where ctx ==
+		 * skb. All programs keep 'ctx' in regs[BPF_REG_CTX]
+		 * == BPF_R6, sk_convert_filter() saves it in BPF_R6,
+		 * internal BPF verifier will check that BPF_R6 ==
+		 * ctx.
+		 *
+		 * BPF_ABS and BPF_IND are wrappers of function calls,
+		 * so they scratch BPF_R1-BPF_R5 registers, preserve
+		 * BPF_R6-BPF_R9, and store return value into BPF_R0.
+		 *
+		 * Implicit input:
+		 *   ctx == skb == BPF_R6 == CTX
+		 *
+		 * Explicit input:
+		 *   SRC == any register
+		 *   IMM == 32-bit immediate
+		 *
+		 * Output:
+		 *   BPF_R0 - 8/16/32-bit skb data converted to cpu endianness
+		 */
+
+		ptr = bpf_load_pointer((struct sk_buff *) (unsigned long) CTX, off, 4, &tmp);
+		if (likely(ptr != NULL)) {
+			BPF_R0 = get_unaligned_be32(ptr);
+			CONT;
+		}
+
+		return 0;
+	LD_ABS_H: /* BPF_R0 = ntohs(*(u16 *) (skb->data + imm32)) */
+		off = IMM;
+load_half:
+		ptr = bpf_load_pointer((struct sk_buff *) (unsigned long) CTX, off, 2, &tmp);
+		if (likely(ptr != NULL)) {
+			BPF_R0 = get_unaligned_be16(ptr);
+			CONT;
+		}
+
+		return 0;
+	LD_ABS_B: /* BPF_R0 = *(u8 *) (skb->data + imm32) */
+		off = IMM;
+load_byte:
+		ptr = bpf_load_pointer((struct sk_buff *) (unsigned long) CTX, off, 1, &tmp);
+		if (likely(ptr != NULL)) {
+			BPF_R0 = *(u8 *)ptr;
+			CONT;
+		}
+
+		return 0;
+	LD_IND_W: /* BPF_R0 = ntohl(*(u32 *) (skb->data + src_reg + imm32)) */
+		off = IMM + SRC;
+		goto load_word;
+	LD_IND_H: /* BPF_R0 = ntohs(*(u16 *) (skb->data + src_reg + imm32)) */
+		off = IMM + SRC;
+		goto load_half;
+	LD_IND_B: /* BPF_R0 = *(u8 *) (skb->data + src_reg + imm32) */
+		off = IMM + SRC;
+		goto load_byte;
+
+	default_label:
+		/* If we ever reach this, we have a bug somewhere. */
+		WARN_RATELIMIT(1, "unknown opcode %02x\n", insn->code);
+		return 0;
+}
+
+void __weak bpf_int_jit_compile(struct sk_filter *prog)
+{
+}
+
+/**
+ *	sk_filter_select_runtime - select execution runtime for BPF program
+ *	@fp: sk_filter populated with internal BPF program
+ *
+ * try to JIT internal BPF program, if JIT is not available select interpreter
+ * BPF program will be executed via SK_RUN_FILTER() macro
+ */
+void sk_filter_select_runtime(struct sk_filter *fp)
+{
+	fp->bpf_func = (void *) __sk_run_filter;
+
+	/* Probe if internal BPF can be JITed */
+	bpf_int_jit_compile(fp);
+}
+EXPORT_SYMBOL_GPL(sk_filter_select_runtime);
+
+/* free internal BPF program */
+void sk_filter_free(struct sk_filter *fp)
+{
+	bpf_jit_free(fp);
+}
+EXPORT_SYMBOL_GPL(sk_filter_free);
diff --git a/net/core/filter.c b/net/core/filter.c
index b90ae7fb3b89..1d0e9492e4fa 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -45,45 +45,6 @@
 #include <linux/seccomp.h>
 #include <linux/if_vlan.h>
 
-/* Registers */
-#define BPF_R0	regs[BPF_REG_0]
-#define BPF_R1	regs[BPF_REG_1]
-#define BPF_R2	regs[BPF_REG_2]
-#define BPF_R3	regs[BPF_REG_3]
-#define BPF_R4	regs[BPF_REG_4]
-#define BPF_R5	regs[BPF_REG_5]
-#define BPF_R6	regs[BPF_REG_6]
-#define BPF_R7	regs[BPF_REG_7]
-#define BPF_R8	regs[BPF_REG_8]
-#define BPF_R9	regs[BPF_REG_9]
-#define BPF_R10	regs[BPF_REG_10]
-
-/* Named registers */
-#define DST	regs[insn->dst_reg]
-#define SRC	regs[insn->src_reg]
-#define FP	regs[BPF_REG_FP]
-#define ARG1	regs[BPF_REG_ARG1]
-#define CTX	regs[BPF_REG_CTX]
-#define IMM	insn->imm
-
-/* No hurry in this branch
- *
- * Exported for the bpf jit load helper.
- */
-void *bpf_internal_load_pointer_neg_helper(const struct sk_buff *skb, int k, unsigned int size)
-{
-	u8 *ptr = NULL;
-
-	if (k >= SKF_NET_OFF)
-		ptr = skb_network_header(skb) + k - SKF_NET_OFF;
-	else if (k >= SKF_LL_OFF)
-		ptr = skb_mac_header(skb) + k - SKF_LL_OFF;
-	if (ptr >= skb->head && ptr + size <= skb_tail_pointer(skb))
-		return ptr;
-
-	return NULL;
-}
-
 /**
  *	sk_filter - run a packet through a socket filter
  *	@sk: sock associated with &sk_buff
@@ -126,451 +87,6 @@ int sk_filter(struct sock *sk, struct sk_buff *skb)
 }
 EXPORT_SYMBOL(sk_filter);
 
-/* Base function for offset calculation. Needs to go into .text section,
- * therefore keeping it non-static as well; will also be used by JITs
- * anyway later on, so do not let the compiler omit it.
- */
-noinline u64 __bpf_call_base(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
-{
-	return 0;
-}
-
-/**
- *	__sk_run_filter - run a filter on a given context
- *	@ctx: buffer to run the filter on
- *	@insn: filter to apply
- *
- * Decode and apply filter instructions to the skb->data. Return length to
- * keep, 0 for none. @ctx is the data we are operating on, @insn is the
- * array of filter instructions.
- */
-static unsigned int __sk_run_filter(void *ctx, const struct sock_filter_int *insn)
-{
-	u64 stack[MAX_BPF_STACK / sizeof(u64)];
-	u64 regs[MAX_BPF_REG], tmp;
-	static const void *jumptable[256] = {
-		[0 ... 255] = &&default_label,
-		/* Now overwrite non-defaults ... */
-		/* 32 bit ALU operations */
-		[BPF_ALU | BPF_ADD | BPF_X] = &&ALU_ADD_X,
-		[BPF_ALU | BPF_ADD | BPF_K] = &&ALU_ADD_K,
-		[BPF_ALU | BPF_SUB | BPF_X] = &&ALU_SUB_X,
-		[BPF_ALU | BPF_SUB | BPF_K] = &&ALU_SUB_K,
-		[BPF_ALU | BPF_AND | BPF_X] = &&ALU_AND_X,
-		[BPF_ALU | BPF_AND | BPF_K] = &&ALU_AND_K,
-		[BPF_ALU | BPF_OR | BPF_X]  = &&ALU_OR_X,
-		[BPF_ALU | BPF_OR | BPF_K]  = &&ALU_OR_K,
-		[BPF_ALU | BPF_LSH | BPF_X] = &&ALU_LSH_X,
-		[BPF_ALU | BPF_LSH | BPF_K] = &&ALU_LSH_K,
-		[BPF_ALU | BPF_RSH | BPF_X] = &&ALU_RSH_X,
-		[BPF_ALU | BPF_RSH | BPF_K] = &&ALU_RSH_K,
-		[BPF_ALU | BPF_XOR | BPF_X] = &&ALU_XOR_X,
-		[BPF_ALU | BPF_XOR | BPF_K] = &&ALU_XOR_K,
-		[BPF_ALU | BPF_MUL | BPF_X] = &&ALU_MUL_X,
-		[BPF_ALU | BPF_MUL | BPF_K] = &&ALU_MUL_K,
-		[BPF_ALU | BPF_MOV | BPF_X] = &&ALU_MOV_X,
-		[BPF_ALU | BPF_MOV | BPF_K] = &&ALU_MOV_K,
-		[BPF_ALU | BPF_DIV | BPF_X] = &&ALU_DIV_X,
-		[BPF_ALU | BPF_DIV | BPF_K] = &&ALU_DIV_K,
-		[BPF_ALU | BPF_MOD | BPF_X] = &&ALU_MOD_X,
-		[BPF_ALU | BPF_MOD | BPF_K] = &&ALU_MOD_K,
-		[BPF_ALU | BPF_NEG] = &&ALU_NEG,
-		[BPF_ALU | BPF_END | BPF_TO_BE] = &&ALU_END_TO_BE,
-		[BPF_ALU | BPF_END | BPF_TO_LE] = &&ALU_END_TO_LE,
-		/* 64 bit ALU operations */
-		[BPF_ALU64 | BPF_ADD | BPF_X] = &&ALU64_ADD_X,
-		[BPF_ALU64 | BPF_ADD | BPF_K] = &&ALU64_ADD_K,
-		[BPF_ALU64 | BPF_SUB | BPF_X] = &&ALU64_SUB_X,
-		[BPF_ALU64 | BPF_SUB | BPF_K] = &&ALU64_SUB_K,
-		[BPF_ALU64 | BPF_AND | BPF_X] = &&ALU64_AND_X,
-		[BPF_ALU64 | BPF_AND | BPF_K] = &&ALU64_AND_K,
-		[BPF_ALU64 | BPF_OR | BPF_X] = &&ALU64_OR_X,
-		[BPF_ALU64 | BPF_OR | BPF_K] = &&ALU64_OR_K,
-		[BPF_ALU64 | BPF_LSH | BPF_X] = &&ALU64_LSH_X,
-		[BPF_ALU64 | BPF_LSH | BPF_K] = &&ALU64_LSH_K,
-		[BPF_ALU64 | BPF_RSH | BPF_X] = &&ALU64_RSH_X,
-		[BPF_ALU64 | BPF_RSH | BPF_K] = &&ALU64_RSH_K,
-		[BPF_ALU64 | BPF_XOR | BPF_X] = &&ALU64_XOR_X,
-		[BPF_ALU64 | BPF_XOR | BPF_K] = &&ALU64_XOR_K,
-		[BPF_ALU64 | BPF_MUL | BPF_X] = &&ALU64_MUL_X,
-		[BPF_ALU64 | BPF_MUL | BPF_K] = &&ALU64_MUL_K,
-		[BPF_ALU64 | BPF_MOV | BPF_X] = &&ALU64_MOV_X,
-		[BPF_ALU64 | BPF_MOV | BPF_K] = &&ALU64_MOV_K,
-		[BPF_ALU64 | BPF_ARSH | BPF_X] = &&ALU64_ARSH_X,
-		[BPF_ALU64 | BPF_ARSH | BPF_K] = &&ALU64_ARSH_K,
-		[BPF_ALU64 | BPF_DIV | BPF_X] = &&ALU64_DIV_X,
-		[BPF_ALU64 | BPF_DIV | BPF_K] = &&ALU64_DIV_K,
-		[BPF_ALU64 | BPF_MOD | BPF_X] = &&ALU64_MOD_X,
-		[BPF_ALU64 | BPF_MOD | BPF_K] = &&ALU64_MOD_K,
-		[BPF_ALU64 | BPF_NEG] = &&ALU64_NEG,
-		/* Call instruction */
-		[BPF_JMP | BPF_CALL] = &&JMP_CALL,
-		/* Jumps */
-		[BPF_JMP | BPF_JA] = &&JMP_JA,
-		[BPF_JMP | BPF_JEQ | BPF_X] = &&JMP_JEQ_X,
-		[BPF_JMP | BPF_JEQ | BPF_K] = &&JMP_JEQ_K,
-		[BPF_JMP | BPF_JNE | BPF_X] = &&JMP_JNE_X,
-		[BPF_JMP | BPF_JNE | BPF_K] = &&JMP_JNE_K,
-		[BPF_JMP | BPF_JGT | BPF_X] = &&JMP_JGT_X,
-		[BPF_JMP | BPF_JGT | BPF_K] = &&JMP_JGT_K,
-		[BPF_JMP | BPF_JGE | BPF_X] = &&JMP_JGE_X,
-		[BPF_JMP | BPF_JGE | BPF_K] = &&JMP_JGE_K,
-		[BPF_JMP | BPF_JSGT | BPF_X] = &&JMP_JSGT_X,
-		[BPF_JMP | BPF_JSGT | BPF_K] = &&JMP_JSGT_K,
-		[BPF_JMP | BPF_JSGE | BPF_X] = &&JMP_JSGE_X,
-		[BPF_JMP | BPF_JSGE | BPF_K] = &&JMP_JSGE_K,
-		[BPF_JMP | BPF_JSET | BPF_X] = &&JMP_JSET_X,
-		[BPF_JMP | BPF_JSET | BPF_K] = &&JMP_JSET_K,
-		/* Program return */
-		[BPF_JMP | BPF_EXIT] = &&JMP_EXIT,
-		/* Store instructions */
-		[BPF_STX | BPF_MEM | BPF_B] = &&STX_MEM_B,
-		[BPF_STX | BPF_MEM | BPF_H] = &&STX_MEM_H,
-		[BPF_STX | BPF_MEM | BPF_W] = &&STX_MEM_W,
-		[BPF_STX | BPF_MEM | BPF_DW] = &&STX_MEM_DW,
-		[BPF_STX | BPF_XADD | BPF_W] = &&STX_XADD_W,
-		[BPF_STX | BPF_XADD | BPF_DW] = &&STX_XADD_DW,
-		[BPF_ST | BPF_MEM | BPF_B] = &&ST_MEM_B,
-		[BPF_ST | BPF_MEM | BPF_H] = &&ST_MEM_H,
-		[BPF_ST | BPF_MEM | BPF_W] = &&ST_MEM_W,
-		[BPF_ST | BPF_MEM | BPF_DW] = &&ST_MEM_DW,
-		/* Load instructions */
-		[BPF_LDX | BPF_MEM | BPF_B] = &&LDX_MEM_B,
-		[BPF_LDX | BPF_MEM | BPF_H] = &&LDX_MEM_H,
-		[BPF_LDX | BPF_MEM | BPF_W] = &&LDX_MEM_W,
-		[BPF_LDX | BPF_MEM | BPF_DW] = &&LDX_MEM_DW,
-		[BPF_LD | BPF_ABS | BPF_W] = &&LD_ABS_W,
-		[BPF_LD | BPF_ABS | BPF_H] = &&LD_ABS_H,
-		[BPF_LD | BPF_ABS | BPF_B] = &&LD_ABS_B,
-		[BPF_LD | BPF_IND | BPF_W] = &&LD_IND_W,
-		[BPF_LD | BPF_IND | BPF_H] = &&LD_IND_H,
-		[BPF_LD | BPF_IND | BPF_B] = &&LD_IND_B,
-	};
-	void *ptr;
-	int off;
-
-#define CONT	 ({ insn++; goto select_insn; })
-#define CONT_JMP ({ insn++; goto select_insn; })
-
-	FP = (u64) (unsigned long) &stack[ARRAY_SIZE(stack)];
-	ARG1 = (u64) (unsigned long) ctx;
-
-	/* Registers used in classic BPF programs need to be reset first. */
-	regs[BPF_REG_A] = 0;
-	regs[BPF_REG_X] = 0;
-
-select_insn:
-	goto *jumptable[insn->code];
-
-	/* ALU */
-#define ALU(OPCODE, OP)			\
-	ALU64_##OPCODE##_X:		\
-		DST = DST OP SRC;	\
-		CONT;			\
-	ALU_##OPCODE##_X:		\
-		DST = (u32) DST OP (u32) SRC;	\
-		CONT;			\
-	ALU64_##OPCODE##_K:		\
-		DST = DST OP IMM;		\
-		CONT;			\
-	ALU_##OPCODE##_K:		\
-		DST = (u32) DST OP (u32) IMM;	\
-		CONT;
-
-	ALU(ADD,  +)
-	ALU(SUB,  -)
-	ALU(AND,  &)
-	ALU(OR,   |)
-	ALU(LSH, <<)
-	ALU(RSH, >>)
-	ALU(XOR,  ^)
-	ALU(MUL,  *)
-#undef ALU
-	ALU_NEG:
-		DST = (u32) -DST;
-		CONT;
-	ALU64_NEG:
-		DST = -DST;
-		CONT;
-	ALU_MOV_X:
-		DST = (u32) SRC;
-		CONT;
-	ALU_MOV_K:
-		DST = (u32) IMM;
-		CONT;
-	ALU64_MOV_X:
-		DST = SRC;
-		CONT;
-	ALU64_MOV_K:
-		DST = IMM;
-		CONT;
-	ALU64_ARSH_X:
-		(*(s64 *) &DST) >>= SRC;
-		CONT;
-	ALU64_ARSH_K:
-		(*(s64 *) &DST) >>= IMM;
-		CONT;
-	ALU64_MOD_X:
-		if (unlikely(SRC == 0))
-			return 0;
-		tmp = DST;
-		DST = do_div(tmp, SRC);
-		CONT;
-	ALU_MOD_X:
-		if (unlikely(SRC == 0))
-			return 0;
-		tmp = (u32) DST;
-		DST = do_div(tmp, (u32) SRC);
-		CONT;
-	ALU64_MOD_K:
-		tmp = DST;
-		DST = do_div(tmp, IMM);
-		CONT;
-	ALU_MOD_K:
-		tmp = (u32) DST;
-		DST = do_div(tmp, (u32) IMM);
-		CONT;
-	ALU64_DIV_X:
-		if (unlikely(SRC == 0))
-			return 0;
-		do_div(DST, SRC);
-		CONT;
-	ALU_DIV_X:
-		if (unlikely(SRC == 0))
-			return 0;
-		tmp = (u32) DST;
-		do_div(tmp, (u32) SRC);
-		DST = (u32) tmp;
-		CONT;
-	ALU64_DIV_K:
-		do_div(DST, IMM);
-		CONT;
-	ALU_DIV_K:
-		tmp = (u32) DST;
-		do_div(tmp, (u32) IMM);
-		DST = (u32) tmp;
-		CONT;
-	ALU_END_TO_BE:
-		switch (IMM) {
-		case 16:
-			DST = (__force u16) cpu_to_be16(DST);
-			break;
-		case 32:
-			DST = (__force u32) cpu_to_be32(DST);
-			break;
-		case 64:
-			DST = (__force u64) cpu_to_be64(DST);
-			break;
-		}
-		CONT;
-	ALU_END_TO_LE:
-		switch (IMM) {
-		case 16:
-			DST = (__force u16) cpu_to_le16(DST);
-			break;
-		case 32:
-			DST = (__force u32) cpu_to_le32(DST);
-			break;
-		case 64:
-			DST = (__force u64) cpu_to_le64(DST);
-			break;
-		}
-		CONT;
-
-	/* CALL */
-	JMP_CALL:
-		/* Function call scratches BPF_R1-BPF_R5 registers,
-		 * preserves BPF_R6-BPF_R9, and stores return value
-		 * into BPF_R0.
-		 */
-		BPF_R0 = (__bpf_call_base + insn->imm)(BPF_R1, BPF_R2, BPF_R3,
-						       BPF_R4, BPF_R5);
-		CONT;
-
-	/* JMP */
-	JMP_JA:
-		insn += insn->off;
-		CONT;
-	JMP_JEQ_X:
-		if (DST == SRC) {
-			insn += insn->off;
-			CONT_JMP;
-		}
-		CONT;
-	JMP_JEQ_K:
-		if (DST == IMM) {
-			insn += insn->off;
-			CONT_JMP;
-		}
-		CONT;
-	JMP_JNE_X:
-		if (DST != SRC) {
-			insn += insn->off;
-			CONT_JMP;
-		}
-		CONT;
-	JMP_JNE_K:
-		if (DST != IMM) {
-			insn += insn->off;
-			CONT_JMP;
-		}
-		CONT;
-	JMP_JGT_X:
-		if (DST > SRC) {
-			insn += insn->off;
-			CONT_JMP;
-		}
-		CONT;
-	JMP_JGT_K:
-		if (DST > IMM) {
-			insn += insn->off;
-			CONT_JMP;
-		}
-		CONT;
-	JMP_JGE_X:
-		if (DST >= SRC) {
-			insn += insn->off;
-			CONT_JMP;
-		}
-		CONT;
-	JMP_JGE_K:
-		if (DST >= IMM) {
-			insn += insn->off;
-			CONT_JMP;
-		}
-		CONT;
-	JMP_JSGT_X:
-		if (((s64) DST) > ((s64) SRC)) {
-			insn += insn->off;
-			CONT_JMP;
-		}
-		CONT;
-	JMP_JSGT_K:
-		if (((s64) DST) > ((s64) IMM)) {
-			insn += insn->off;
-			CONT_JMP;
-		}
-		CONT;
-	JMP_JSGE_X:
-		if (((s64) DST) >= ((s64) SRC)) {
-			insn += insn->off;
-			CONT_JMP;
-		}
-		CONT;
-	JMP_JSGE_K:
-		if (((s64) DST) >= ((s64) IMM)) {
-			insn += insn->off;
-			CONT_JMP;
-		}
-		CONT;
-	JMP_JSET_X:
-		if (DST & SRC) {
-			insn += insn->off;
-			CONT_JMP;
-		}
-		CONT;
-	JMP_JSET_K:
-		if (DST & IMM) {
-			insn += insn->off;
-			CONT_JMP;
-		}
-		CONT;
-	JMP_EXIT:
-		return BPF_R0;
-
-	/* STX and ST and LDX*/
-#define LDST(SIZEOP, SIZE)						\
-	STX_MEM_##SIZEOP:						\
-		*(SIZE *)(unsigned long) (DST + insn->off) = SRC;	\
-		CONT;							\
-	ST_MEM_##SIZEOP:						\
-		*(SIZE *)(unsigned long) (DST + insn->off) = IMM;	\
-		CONT;							\
-	LDX_MEM_##SIZEOP:						\
-		DST = *(SIZE *)(unsigned long) (SRC + insn->off);	\
-		CONT;
-
-	LDST(B,   u8)
-	LDST(H,  u16)
-	LDST(W,  u32)
-	LDST(DW, u64)
-#undef LDST
-	STX_XADD_W: /* lock xadd *(u32 *)(dst_reg + off16) += src_reg */
-		atomic_add((u32) SRC, (atomic_t *)(unsigned long)
-			   (DST + insn->off));
-		CONT;
-	STX_XADD_DW: /* lock xadd *(u64 *)(dst_reg + off16) += src_reg */
-		atomic64_add((u64) SRC, (atomic64_t *)(unsigned long)
-			     (DST + insn->off));
-		CONT;
-	LD_ABS_W: /* BPF_R0 = ntohl(*(u32 *) (skb->data + imm32)) */
-		off = IMM;
-load_word:
-		/* BPF_LD + BPD_ABS and BPF_LD + BPF_IND insns are
-		 * only appearing in the programs where ctx ==
-		 * skb. All programs keep 'ctx' in regs[BPF_REG_CTX]
-		 * == BPF_R6, sk_convert_filter() saves it in BPF_R6,
-		 * internal BPF verifier will check that BPF_R6 ==
-		 * ctx.
-		 *
-		 * BPF_ABS and BPF_IND are wrappers of function calls,
-		 * so they scratch BPF_R1-BPF_R5 registers, preserve
-		 * BPF_R6-BPF_R9, and store return value into BPF_R0.
-		 *
-		 * Implicit input:
-		 *   ctx == skb == BPF_R6 == CTX
-		 *
-		 * Explicit input:
-		 *   SRC == any register
-		 *   IMM == 32-bit immediate
-		 *
-		 * Output:
-		 *   BPF_R0 - 8/16/32-bit skb data converted to cpu endianness
-		 */
-
-		ptr = bpf_load_pointer((struct sk_buff *) (unsigned long) CTX, off, 4, &tmp);
-		if (likely(ptr != NULL)) {
-			BPF_R0 = get_unaligned_be32(ptr);
-			CONT;
-		}
-
-		return 0;
-	LD_ABS_H: /* BPF_R0 = ntohs(*(u16 *) (skb->data + imm32)) */
-		off = IMM;
-load_half:
-		ptr = bpf_load_pointer((struct sk_buff *) (unsigned long) CTX, off, 2, &tmp);
-		if (likely(ptr != NULL)) {
-			BPF_R0 = get_unaligned_be16(ptr);
-			CONT;
-		}
-
-		return 0;
-	LD_ABS_B: /* BPF_R0 = *(u8 *) (skb->data + imm32) */
-		off = IMM;
-load_byte:
-		ptr = bpf_load_pointer((struct sk_buff *) (unsigned long) CTX, off, 1, &tmp);
-		if (likely(ptr != NULL)) {
-			BPF_R0 = *(u8 *)ptr;
-			CONT;
-		}
-
-		return 0;
-	LD_IND_W: /* BPF_R0 = ntohl(*(u32 *) (skb->data + src_reg + imm32)) */
-		off = IMM + SRC;
-		goto load_word;
-	LD_IND_H: /* BPF_R0 = ntohs(*(u16 *) (skb->data + src_reg + imm32)) */
-		off = IMM + SRC;
-		goto load_half;
-	LD_IND_B: /* BPF_R0 = *(u8 *) (skb->data + src_reg + imm32) */
-		off = IMM + SRC;
-		goto load_byte;
-
-	default_label:
-		/* If we ever reach this, we have a bug somewhere. */
-		WARN_RATELIMIT(1, "unknown opcode %02x\n", insn->code);
-		return 0;
-}
-
 /* Helper to find the offset of pkt_type in sk_buff structure. We want
  * to make sure its still a 3bit field starting at a byte boundary;
  * taken from arch/x86/net/bpf_jit_comp.c.
@@ -1455,33 +971,6 @@ out_err:
 	return ERR_PTR(err);
 }
 
-void __weak bpf_int_jit_compile(struct sk_filter *prog)
-{
-}
-
-/**
- *	sk_filter_select_runtime - select execution runtime for BPF program
- *	@fp: sk_filter populated with internal BPF program
- *
- * try to JIT internal BPF program, if JIT is not available select interpreter
- * BPF program will be executed via SK_RUN_FILTER() macro
- */
-void sk_filter_select_runtime(struct sk_filter *fp)
-{
-	fp->bpf_func = (void *) __sk_run_filter;
-
-	/* Probe if internal BPF can be JITed */
-	bpf_int_jit_compile(fp);
-}
-EXPORT_SYMBOL_GPL(sk_filter_select_runtime);
-
-/* free internal BPF program */
-void sk_filter_free(struct sk_filter *fp)
-{
-	bpf_jit_free(fp);
-}
-EXPORT_SYMBOL_GPL(sk_filter_free);
-
 static struct sk_filter *__sk_prepare_filter(struct sk_filter *fp,
 					     struct sock *sk)
 {
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH RFC v2 net-next 01/16] net: filter: split filter.c into two files
@ 2014-07-18  4:19   ` Alexei Starovoitov
  0 siblings, 0 replies; 62+ messages in thread
From: Alexei Starovoitov @ 2014-07-18  4:19 UTC (permalink / raw)
  To: David S. Miller
  Cc: Ingo Molnar, Linus Torvalds, Andy Lutomirski, Steven Rostedt,
	Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Jiri Olsa, Thomas Gleixner,
	H. Peter Anvin, Andrew Morton, Kees Cook,
	linux-api-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

BPF is used in several kernel components. This split creates logical boundary
between generic eBPF core and the rest

kernel/bpf/core.c: eBPF interpreter

net/core/filter.c: classic->eBPF converter, classic verifiers, socket filters

This patch only moves functions.

Signed-off-by: Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
---
 kernel/Makefile     |    1 +
 kernel/bpf/Makefile |    1 +
 kernel/bpf/core.c   |  536 +++++++++++++++++++++++++++++++++++++++++++++++++++
 net/core/filter.c   |  511 ------------------------------------------------
 4 files changed, 538 insertions(+), 511 deletions(-)
 create mode 100644 kernel/bpf/Makefile
 create mode 100644 kernel/bpf/core.c

diff --git a/kernel/Makefile b/kernel/Makefile
index f2a8b6246ce9..e7360b7c2c0e 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -87,6 +87,7 @@ obj-$(CONFIG_RING_BUFFER) += trace/
 obj-$(CONFIG_TRACEPOINTS) += trace/
 obj-$(CONFIG_IRQ_WORK) += irq_work.o
 obj-$(CONFIG_CPU_PM) += cpu_pm.o
+obj-$(CONFIG_NET) += bpf/
 
 obj-$(CONFIG_PERF_EVENTS) += events/
 
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
new file mode 100644
index 000000000000..6a71145e2769
--- /dev/null
+++ b/kernel/bpf/Makefile
@@ -0,0 +1 @@
+obj-y := core.o
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
new file mode 100644
index 000000000000..77a240a1ce11
--- /dev/null
+++ b/kernel/bpf/core.c
@@ -0,0 +1,536 @@
+/*
+ * Linux Socket Filter - Kernel level socket filtering
+ *
+ * Based on the design of the Berkeley Packet Filter. The new
+ * internal format has been designed by PLUMgrid:
+ *
+ *	Copyright (c) 2011 - 2014 PLUMgrid, http://plumgrid.com
+ *
+ * Authors:
+ *
+ *	Jay Schulist <jschlst-eUNUBHrolfbYtjvyW6yDsg@public.gmane.org>
+ *	Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
+ *	Daniel Borkmann <dborkman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * Andi Kleen - Fix a few bad bugs and races.
+ * Kris Katterjohn - Added many additional checks in sk_chk_filter()
+ */
+#include <linux/filter.h>
+#include <linux/skbuff.h>
+#include <asm/unaligned.h>
+
+/* Registers */
+#define BPF_R0	regs[BPF_REG_0]
+#define BPF_R1	regs[BPF_REG_1]
+#define BPF_R2	regs[BPF_REG_2]
+#define BPF_R3	regs[BPF_REG_3]
+#define BPF_R4	regs[BPF_REG_4]
+#define BPF_R5	regs[BPF_REG_5]
+#define BPF_R6	regs[BPF_REG_6]
+#define BPF_R7	regs[BPF_REG_7]
+#define BPF_R8	regs[BPF_REG_8]
+#define BPF_R9	regs[BPF_REG_9]
+#define BPF_R10	regs[BPF_REG_10]
+
+/* Named registers */
+#define DST	regs[insn->dst_reg]
+#define SRC	regs[insn->src_reg]
+#define FP	regs[BPF_REG_FP]
+#define ARG1	regs[BPF_REG_ARG1]
+#define CTX	regs[BPF_REG_CTX]
+#define IMM	insn->imm
+
+/* No hurry in this branch
+ *
+ * Exported for the bpf jit load helper.
+ */
+void *bpf_internal_load_pointer_neg_helper(const struct sk_buff *skb, int k, unsigned int size)
+{
+	u8 *ptr = NULL;
+
+	if (k >= SKF_NET_OFF)
+		ptr = skb_network_header(skb) + k - SKF_NET_OFF;
+	else if (k >= SKF_LL_OFF)
+		ptr = skb_mac_header(skb) + k - SKF_LL_OFF;
+	if (ptr >= skb->head && ptr + size <= skb_tail_pointer(skb))
+		return ptr;
+
+	return NULL;
+}
+
+/* Base function for offset calculation. Needs to go into .text section,
+ * therefore keeping it non-static as well; will also be used by JITs
+ * anyway later on, so do not let the compiler omit it.
+ */
+noinline u64 __bpf_call_base(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+	return 0;
+}
+
+/**
+ *	__sk_run_filter - run a filter on a given context
+ *	@ctx: buffer to run the filter on
+ *	@insn: filter to apply
+ *
+ * Decode and apply filter instructions to the skb->data. Return length to
+ * keep, 0 for none. @ctx is the data we are operating on, @insn is the
+ * array of filter instructions.
+ */
+static unsigned int __sk_run_filter(void *ctx, const struct sock_filter_int *insn)
+{
+	u64 stack[MAX_BPF_STACK / sizeof(u64)];
+	u64 regs[MAX_BPF_REG], tmp;
+	static const void *jumptable[256] = {
+		[0 ... 255] = &&default_label,
+		/* Now overwrite non-defaults ... */
+		/* 32 bit ALU operations */
+		[BPF_ALU | BPF_ADD | BPF_X] = &&ALU_ADD_X,
+		[BPF_ALU | BPF_ADD | BPF_K] = &&ALU_ADD_K,
+		[BPF_ALU | BPF_SUB | BPF_X] = &&ALU_SUB_X,
+		[BPF_ALU | BPF_SUB | BPF_K] = &&ALU_SUB_K,
+		[BPF_ALU | BPF_AND | BPF_X] = &&ALU_AND_X,
+		[BPF_ALU | BPF_AND | BPF_K] = &&ALU_AND_K,
+		[BPF_ALU | BPF_OR | BPF_X]  = &&ALU_OR_X,
+		[BPF_ALU | BPF_OR | BPF_K]  = &&ALU_OR_K,
+		[BPF_ALU | BPF_LSH | BPF_X] = &&ALU_LSH_X,
+		[BPF_ALU | BPF_LSH | BPF_K] = &&ALU_LSH_K,
+		[BPF_ALU | BPF_RSH | BPF_X] = &&ALU_RSH_X,
+		[BPF_ALU | BPF_RSH | BPF_K] = &&ALU_RSH_K,
+		[BPF_ALU | BPF_XOR | BPF_X] = &&ALU_XOR_X,
+		[BPF_ALU | BPF_XOR | BPF_K] = &&ALU_XOR_K,
+		[BPF_ALU | BPF_MUL | BPF_X] = &&ALU_MUL_X,
+		[BPF_ALU | BPF_MUL | BPF_K] = &&ALU_MUL_K,
+		[BPF_ALU | BPF_MOV | BPF_X] = &&ALU_MOV_X,
+		[BPF_ALU | BPF_MOV | BPF_K] = &&ALU_MOV_K,
+		[BPF_ALU | BPF_DIV | BPF_X] = &&ALU_DIV_X,
+		[BPF_ALU | BPF_DIV | BPF_K] = &&ALU_DIV_K,
+		[BPF_ALU | BPF_MOD | BPF_X] = &&ALU_MOD_X,
+		[BPF_ALU | BPF_MOD | BPF_K] = &&ALU_MOD_K,
+		[BPF_ALU | BPF_NEG] = &&ALU_NEG,
+		[BPF_ALU | BPF_END | BPF_TO_BE] = &&ALU_END_TO_BE,
+		[BPF_ALU | BPF_END | BPF_TO_LE] = &&ALU_END_TO_LE,
+		/* 64 bit ALU operations */
+		[BPF_ALU64 | BPF_ADD | BPF_X] = &&ALU64_ADD_X,
+		[BPF_ALU64 | BPF_ADD | BPF_K] = &&ALU64_ADD_K,
+		[BPF_ALU64 | BPF_SUB | BPF_X] = &&ALU64_SUB_X,
+		[BPF_ALU64 | BPF_SUB | BPF_K] = &&ALU64_SUB_K,
+		[BPF_ALU64 | BPF_AND | BPF_X] = &&ALU64_AND_X,
+		[BPF_ALU64 | BPF_AND | BPF_K] = &&ALU64_AND_K,
+		[BPF_ALU64 | BPF_OR | BPF_X] = &&ALU64_OR_X,
+		[BPF_ALU64 | BPF_OR | BPF_K] = &&ALU64_OR_K,
+		[BPF_ALU64 | BPF_LSH | BPF_X] = &&ALU64_LSH_X,
+		[BPF_ALU64 | BPF_LSH | BPF_K] = &&ALU64_LSH_K,
+		[BPF_ALU64 | BPF_RSH | BPF_X] = &&ALU64_RSH_X,
+		[BPF_ALU64 | BPF_RSH | BPF_K] = &&ALU64_RSH_K,
+		[BPF_ALU64 | BPF_XOR | BPF_X] = &&ALU64_XOR_X,
+		[BPF_ALU64 | BPF_XOR | BPF_K] = &&ALU64_XOR_K,
+		[BPF_ALU64 | BPF_MUL | BPF_X] = &&ALU64_MUL_X,
+		[BPF_ALU64 | BPF_MUL | BPF_K] = &&ALU64_MUL_K,
+		[BPF_ALU64 | BPF_MOV | BPF_X] = &&ALU64_MOV_X,
+		[BPF_ALU64 | BPF_MOV | BPF_K] = &&ALU64_MOV_K,
+		[BPF_ALU64 | BPF_ARSH | BPF_X] = &&ALU64_ARSH_X,
+		[BPF_ALU64 | BPF_ARSH | BPF_K] = &&ALU64_ARSH_K,
+		[BPF_ALU64 | BPF_DIV | BPF_X] = &&ALU64_DIV_X,
+		[BPF_ALU64 | BPF_DIV | BPF_K] = &&ALU64_DIV_K,
+		[BPF_ALU64 | BPF_MOD | BPF_X] = &&ALU64_MOD_X,
+		[BPF_ALU64 | BPF_MOD | BPF_K] = &&ALU64_MOD_K,
+		[BPF_ALU64 | BPF_NEG] = &&ALU64_NEG,
+		/* Call instruction */
+		[BPF_JMP | BPF_CALL] = &&JMP_CALL,
+		/* Jumps */
+		[BPF_JMP | BPF_JA] = &&JMP_JA,
+		[BPF_JMP | BPF_JEQ | BPF_X] = &&JMP_JEQ_X,
+		[BPF_JMP | BPF_JEQ | BPF_K] = &&JMP_JEQ_K,
+		[BPF_JMP | BPF_JNE | BPF_X] = &&JMP_JNE_X,
+		[BPF_JMP | BPF_JNE | BPF_K] = &&JMP_JNE_K,
+		[BPF_JMP | BPF_JGT | BPF_X] = &&JMP_JGT_X,
+		[BPF_JMP | BPF_JGT | BPF_K] = &&JMP_JGT_K,
+		[BPF_JMP | BPF_JGE | BPF_X] = &&JMP_JGE_X,
+		[BPF_JMP | BPF_JGE | BPF_K] = &&JMP_JGE_K,
+		[BPF_JMP | BPF_JSGT | BPF_X] = &&JMP_JSGT_X,
+		[BPF_JMP | BPF_JSGT | BPF_K] = &&JMP_JSGT_K,
+		[BPF_JMP | BPF_JSGE | BPF_X] = &&JMP_JSGE_X,
+		[BPF_JMP | BPF_JSGE | BPF_K] = &&JMP_JSGE_K,
+		[BPF_JMP | BPF_JSET | BPF_X] = &&JMP_JSET_X,
+		[BPF_JMP | BPF_JSET | BPF_K] = &&JMP_JSET_K,
+		/* Program return */
+		[BPF_JMP | BPF_EXIT] = &&JMP_EXIT,
+		/* Store instructions */
+		[BPF_STX | BPF_MEM | BPF_B] = &&STX_MEM_B,
+		[BPF_STX | BPF_MEM | BPF_H] = &&STX_MEM_H,
+		[BPF_STX | BPF_MEM | BPF_W] = &&STX_MEM_W,
+		[BPF_STX | BPF_MEM | BPF_DW] = &&STX_MEM_DW,
+		[BPF_STX | BPF_XADD | BPF_W] = &&STX_XADD_W,
+		[BPF_STX | BPF_XADD | BPF_DW] = &&STX_XADD_DW,
+		[BPF_ST | BPF_MEM | BPF_B] = &&ST_MEM_B,
+		[BPF_ST | BPF_MEM | BPF_H] = &&ST_MEM_H,
+		[BPF_ST | BPF_MEM | BPF_W] = &&ST_MEM_W,
+		[BPF_ST | BPF_MEM | BPF_DW] = &&ST_MEM_DW,
+		/* Load instructions */
+		[BPF_LDX | BPF_MEM | BPF_B] = &&LDX_MEM_B,
+		[BPF_LDX | BPF_MEM | BPF_H] = &&LDX_MEM_H,
+		[BPF_LDX | BPF_MEM | BPF_W] = &&LDX_MEM_W,
+		[BPF_LDX | BPF_MEM | BPF_DW] = &&LDX_MEM_DW,
+		[BPF_LD | BPF_ABS | BPF_W] = &&LD_ABS_W,
+		[BPF_LD | BPF_ABS | BPF_H] = &&LD_ABS_H,
+		[BPF_LD | BPF_ABS | BPF_B] = &&LD_ABS_B,
+		[BPF_LD | BPF_IND | BPF_W] = &&LD_IND_W,
+		[BPF_LD | BPF_IND | BPF_H] = &&LD_IND_H,
+		[BPF_LD | BPF_IND | BPF_B] = &&LD_IND_B,
+	};
+	void *ptr;
+	int off;
+
+#define CONT	 ({ insn++; goto select_insn; })
+#define CONT_JMP ({ insn++; goto select_insn; })
+
+	FP = (u64) (unsigned long) &stack[ARRAY_SIZE(stack)];
+	ARG1 = (u64) (unsigned long) ctx;
+
+	/* Registers used in classic BPF programs need to be reset first. */
+	regs[BPF_REG_A] = 0;
+	regs[BPF_REG_X] = 0;
+
+select_insn:
+	goto *jumptable[insn->code];
+
+	/* ALU */
+#define ALU(OPCODE, OP)			\
+	ALU64_##OPCODE##_X:		\
+		DST = DST OP SRC;	\
+		CONT;			\
+	ALU_##OPCODE##_X:		\
+		DST = (u32) DST OP (u32) SRC;	\
+		CONT;			\
+	ALU64_##OPCODE##_K:		\
+		DST = DST OP IMM;		\
+		CONT;			\
+	ALU_##OPCODE##_K:		\
+		DST = (u32) DST OP (u32) IMM;	\
+		CONT;
+
+	ALU(ADD,  +)
+	ALU(SUB,  -)
+	ALU(AND,  &)
+	ALU(OR,   |)
+	ALU(LSH, <<)
+	ALU(RSH, >>)
+	ALU(XOR,  ^)
+	ALU(MUL,  *)
+#undef ALU
+	ALU_NEG:
+		DST = (u32) -DST;
+		CONT;
+	ALU64_NEG:
+		DST = -DST;
+		CONT;
+	ALU_MOV_X:
+		DST = (u32) SRC;
+		CONT;
+	ALU_MOV_K:
+		DST = (u32) IMM;
+		CONT;
+	ALU64_MOV_X:
+		DST = SRC;
+		CONT;
+	ALU64_MOV_K:
+		DST = IMM;
+		CONT;
+	ALU64_ARSH_X:
+		(*(s64 *) &DST) >>= SRC;
+		CONT;
+	ALU64_ARSH_K:
+		(*(s64 *) &DST) >>= IMM;
+		CONT;
+	ALU64_MOD_X:
+		if (unlikely(SRC == 0))
+			return 0;
+		tmp = DST;
+		DST = do_div(tmp, SRC);
+		CONT;
+	ALU_MOD_X:
+		if (unlikely(SRC == 0))
+			return 0;
+		tmp = (u32) DST;
+		DST = do_div(tmp, (u32) SRC);
+		CONT;
+	ALU64_MOD_K:
+		tmp = DST;
+		DST = do_div(tmp, IMM);
+		CONT;
+	ALU_MOD_K:
+		tmp = (u32) DST;
+		DST = do_div(tmp, (u32) IMM);
+		CONT;
+	ALU64_DIV_X:
+		if (unlikely(SRC == 0))
+			return 0;
+		do_div(DST, SRC);
+		CONT;
+	ALU_DIV_X:
+		if (unlikely(SRC == 0))
+			return 0;
+		tmp = (u32) DST;
+		do_div(tmp, (u32) SRC);
+		DST = (u32) tmp;
+		CONT;
+	ALU64_DIV_K:
+		do_div(DST, IMM);
+		CONT;
+	ALU_DIV_K:
+		tmp = (u32) DST;
+		do_div(tmp, (u32) IMM);
+		DST = (u32) tmp;
+		CONT;
+	ALU_END_TO_BE:
+		switch (IMM) {
+		case 16:
+			DST = (__force u16) cpu_to_be16(DST);
+			break;
+		case 32:
+			DST = (__force u32) cpu_to_be32(DST);
+			break;
+		case 64:
+			DST = (__force u64) cpu_to_be64(DST);
+			break;
+		}
+		CONT;
+	ALU_END_TO_LE:
+		switch (IMM) {
+		case 16:
+			DST = (__force u16) cpu_to_le16(DST);
+			break;
+		case 32:
+			DST = (__force u32) cpu_to_le32(DST);
+			break;
+		case 64:
+			DST = (__force u64) cpu_to_le64(DST);
+			break;
+		}
+		CONT;
+
+	/* CALL */
+	JMP_CALL:
+		/* Function call scratches BPF_R1-BPF_R5 registers,
+		 * preserves BPF_R6-BPF_R9, and stores return value
+		 * into BPF_R0.
+		 */
+		BPF_R0 = (__bpf_call_base + insn->imm)(BPF_R1, BPF_R2, BPF_R3,
+						       BPF_R4, BPF_R5);
+		CONT;
+
+	/* JMP */
+	JMP_JA:
+		insn += insn->off;
+		CONT;
+	JMP_JEQ_X:
+		if (DST == SRC) {
+			insn += insn->off;
+			CONT_JMP;
+		}
+		CONT;
+	JMP_JEQ_K:
+		if (DST == IMM) {
+			insn += insn->off;
+			CONT_JMP;
+		}
+		CONT;
+	JMP_JNE_X:
+		if (DST != SRC) {
+			insn += insn->off;
+			CONT_JMP;
+		}
+		CONT;
+	JMP_JNE_K:
+		if (DST != IMM) {
+			insn += insn->off;
+			CONT_JMP;
+		}
+		CONT;
+	JMP_JGT_X:
+		if (DST > SRC) {
+			insn += insn->off;
+			CONT_JMP;
+		}
+		CONT;
+	JMP_JGT_K:
+		if (DST > IMM) {
+			insn += insn->off;
+			CONT_JMP;
+		}
+		CONT;
+	JMP_JGE_X:
+		if (DST >= SRC) {
+			insn += insn->off;
+			CONT_JMP;
+		}
+		CONT;
+	JMP_JGE_K:
+		if (DST >= IMM) {
+			insn += insn->off;
+			CONT_JMP;
+		}
+		CONT;
+	JMP_JSGT_X:
+		if (((s64) DST) > ((s64) SRC)) {
+			insn += insn->off;
+			CONT_JMP;
+		}
+		CONT;
+	JMP_JSGT_K:
+		if (((s64) DST) > ((s64) IMM)) {
+			insn += insn->off;
+			CONT_JMP;
+		}
+		CONT;
+	JMP_JSGE_X:
+		if (((s64) DST) >= ((s64) SRC)) {
+			insn += insn->off;
+			CONT_JMP;
+		}
+		CONT;
+	JMP_JSGE_K:
+		if (((s64) DST) >= ((s64) IMM)) {
+			insn += insn->off;
+			CONT_JMP;
+		}
+		CONT;
+	JMP_JSET_X:
+		if (DST & SRC) {
+			insn += insn->off;
+			CONT_JMP;
+		}
+		CONT;
+	JMP_JSET_K:
+		if (DST & IMM) {
+			insn += insn->off;
+			CONT_JMP;
+		}
+		CONT;
+	JMP_EXIT:
+		return BPF_R0;
+
+	/* STX and ST and LDX*/
+#define LDST(SIZEOP, SIZE)						\
+	STX_MEM_##SIZEOP:						\
+		*(SIZE *)(unsigned long) (DST + insn->off) = SRC;	\
+		CONT;							\
+	ST_MEM_##SIZEOP:						\
+		*(SIZE *)(unsigned long) (DST + insn->off) = IMM;	\
+		CONT;							\
+	LDX_MEM_##SIZEOP:						\
+		DST = *(SIZE *)(unsigned long) (SRC + insn->off);	\
+		CONT;
+
+	LDST(B,   u8)
+	LDST(H,  u16)
+	LDST(W,  u32)
+	LDST(DW, u64)
+#undef LDST
+	STX_XADD_W: /* lock xadd *(u32 *)(dst_reg + off16) += src_reg */
+		atomic_add((u32) SRC, (atomic_t *)(unsigned long)
+			   (DST + insn->off));
+		CONT;
+	STX_XADD_DW: /* lock xadd *(u64 *)(dst_reg + off16) += src_reg */
+		atomic64_add((u64) SRC, (atomic64_t *)(unsigned long)
+			     (DST + insn->off));
+		CONT;
+	LD_ABS_W: /* BPF_R0 = ntohl(*(u32 *) (skb->data + imm32)) */
+		off = IMM;
+load_word:
+		/* BPF_LD + BPD_ABS and BPF_LD + BPF_IND insns are
+		 * only appearing in the programs where ctx ==
+		 * skb. All programs keep 'ctx' in regs[BPF_REG_CTX]
+		 * == BPF_R6, sk_convert_filter() saves it in BPF_R6,
+		 * internal BPF verifier will check that BPF_R6 ==
+		 * ctx.
+		 *
+		 * BPF_ABS and BPF_IND are wrappers of function calls,
+		 * so they scratch BPF_R1-BPF_R5 registers, preserve
+		 * BPF_R6-BPF_R9, and store return value into BPF_R0.
+		 *
+		 * Implicit input:
+		 *   ctx == skb == BPF_R6 == CTX
+		 *
+		 * Explicit input:
+		 *   SRC == any register
+		 *   IMM == 32-bit immediate
+		 *
+		 * Output:
+		 *   BPF_R0 - 8/16/32-bit skb data converted to cpu endianness
+		 */
+
+		ptr = bpf_load_pointer((struct sk_buff *) (unsigned long) CTX, off, 4, &tmp);
+		if (likely(ptr != NULL)) {
+			BPF_R0 = get_unaligned_be32(ptr);
+			CONT;
+		}
+
+		return 0;
+	LD_ABS_H: /* BPF_R0 = ntohs(*(u16 *) (skb->data + imm32)) */
+		off = IMM;
+load_half:
+		ptr = bpf_load_pointer((struct sk_buff *) (unsigned long) CTX, off, 2, &tmp);
+		if (likely(ptr != NULL)) {
+			BPF_R0 = get_unaligned_be16(ptr);
+			CONT;
+		}
+
+		return 0;
+	LD_ABS_B: /* BPF_R0 = *(u8 *) (skb->data + imm32) */
+		off = IMM;
+load_byte:
+		ptr = bpf_load_pointer((struct sk_buff *) (unsigned long) CTX, off, 1, &tmp);
+		if (likely(ptr != NULL)) {
+			BPF_R0 = *(u8 *)ptr;
+			CONT;
+		}
+
+		return 0;
+	LD_IND_W: /* BPF_R0 = ntohl(*(u32 *) (skb->data + src_reg + imm32)) */
+		off = IMM + SRC;
+		goto load_word;
+	LD_IND_H: /* BPF_R0 = ntohs(*(u16 *) (skb->data + src_reg + imm32)) */
+		off = IMM + SRC;
+		goto load_half;
+	LD_IND_B: /* BPF_R0 = *(u8 *) (skb->data + src_reg + imm32) */
+		off = IMM + SRC;
+		goto load_byte;
+
+	default_label:
+		/* If we ever reach this, we have a bug somewhere. */
+		WARN_RATELIMIT(1, "unknown opcode %02x\n", insn->code);
+		return 0;
+}
+
+void __weak bpf_int_jit_compile(struct sk_filter *prog)
+{
+}
+
+/**
+ *	sk_filter_select_runtime - select execution runtime for BPF program
+ *	@fp: sk_filter populated with internal BPF program
+ *
+ * try to JIT internal BPF program, if JIT is not available select interpreter
+ * BPF program will be executed via SK_RUN_FILTER() macro
+ */
+void sk_filter_select_runtime(struct sk_filter *fp)
+{
+	fp->bpf_func = (void *) __sk_run_filter;
+
+	/* Probe if internal BPF can be JITed */
+	bpf_int_jit_compile(fp);
+}
+EXPORT_SYMBOL_GPL(sk_filter_select_runtime);
+
+/* free internal BPF program */
+void sk_filter_free(struct sk_filter *fp)
+{
+	bpf_jit_free(fp);
+}
+EXPORT_SYMBOL_GPL(sk_filter_free);
diff --git a/net/core/filter.c b/net/core/filter.c
index b90ae7fb3b89..1d0e9492e4fa 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -45,45 +45,6 @@
 #include <linux/seccomp.h>
 #include <linux/if_vlan.h>
 
-/* Registers */
-#define BPF_R0	regs[BPF_REG_0]
-#define BPF_R1	regs[BPF_REG_1]
-#define BPF_R2	regs[BPF_REG_2]
-#define BPF_R3	regs[BPF_REG_3]
-#define BPF_R4	regs[BPF_REG_4]
-#define BPF_R5	regs[BPF_REG_5]
-#define BPF_R6	regs[BPF_REG_6]
-#define BPF_R7	regs[BPF_REG_7]
-#define BPF_R8	regs[BPF_REG_8]
-#define BPF_R9	regs[BPF_REG_9]
-#define BPF_R10	regs[BPF_REG_10]
-
-/* Named registers */
-#define DST	regs[insn->dst_reg]
-#define SRC	regs[insn->src_reg]
-#define FP	regs[BPF_REG_FP]
-#define ARG1	regs[BPF_REG_ARG1]
-#define CTX	regs[BPF_REG_CTX]
-#define IMM	insn->imm
-
-/* No hurry in this branch
- *
- * Exported for the bpf jit load helper.
- */
-void *bpf_internal_load_pointer_neg_helper(const struct sk_buff *skb, int k, unsigned int size)
-{
-	u8 *ptr = NULL;
-
-	if (k >= SKF_NET_OFF)
-		ptr = skb_network_header(skb) + k - SKF_NET_OFF;
-	else if (k >= SKF_LL_OFF)
-		ptr = skb_mac_header(skb) + k - SKF_LL_OFF;
-	if (ptr >= skb->head && ptr + size <= skb_tail_pointer(skb))
-		return ptr;
-
-	return NULL;
-}
-
 /**
  *	sk_filter - run a packet through a socket filter
  *	@sk: sock associated with &sk_buff
@@ -126,451 +87,6 @@ int sk_filter(struct sock *sk, struct sk_buff *skb)
 }
 EXPORT_SYMBOL(sk_filter);
 
-/* Base function for offset calculation. Needs to go into .text section,
- * therefore keeping it non-static as well; will also be used by JITs
- * anyway later on, so do not let the compiler omit it.
- */
-noinline u64 __bpf_call_base(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
-{
-	return 0;
-}
-
-/**
- *	__sk_run_filter - run a filter on a given context
- *	@ctx: buffer to run the filter on
- *	@insn: filter to apply
- *
- * Decode and apply filter instructions to the skb->data. Return length to
- * keep, 0 for none. @ctx is the data we are operating on, @insn is the
- * array of filter instructions.
- */
-static unsigned int __sk_run_filter(void *ctx, const struct sock_filter_int *insn)
-{
-	u64 stack[MAX_BPF_STACK / sizeof(u64)];
-	u64 regs[MAX_BPF_REG], tmp;
-	static const void *jumptable[256] = {
-		[0 ... 255] = &&default_label,
-		/* Now overwrite non-defaults ... */
-		/* 32 bit ALU operations */
-		[BPF_ALU | BPF_ADD | BPF_X] = &&ALU_ADD_X,
-		[BPF_ALU | BPF_ADD | BPF_K] = &&ALU_ADD_K,
-		[BPF_ALU | BPF_SUB | BPF_X] = &&ALU_SUB_X,
-		[BPF_ALU | BPF_SUB | BPF_K] = &&ALU_SUB_K,
-		[BPF_ALU | BPF_AND | BPF_X] = &&ALU_AND_X,
-		[BPF_ALU | BPF_AND | BPF_K] = &&ALU_AND_K,
-		[BPF_ALU | BPF_OR | BPF_X]  = &&ALU_OR_X,
-		[BPF_ALU | BPF_OR | BPF_K]  = &&ALU_OR_K,
-		[BPF_ALU | BPF_LSH | BPF_X] = &&ALU_LSH_X,
-		[BPF_ALU | BPF_LSH | BPF_K] = &&ALU_LSH_K,
-		[BPF_ALU | BPF_RSH | BPF_X] = &&ALU_RSH_X,
-		[BPF_ALU | BPF_RSH | BPF_K] = &&ALU_RSH_K,
-		[BPF_ALU | BPF_XOR | BPF_X] = &&ALU_XOR_X,
-		[BPF_ALU | BPF_XOR | BPF_K] = &&ALU_XOR_K,
-		[BPF_ALU | BPF_MUL | BPF_X] = &&ALU_MUL_X,
-		[BPF_ALU | BPF_MUL | BPF_K] = &&ALU_MUL_K,
-		[BPF_ALU | BPF_MOV | BPF_X] = &&ALU_MOV_X,
-		[BPF_ALU | BPF_MOV | BPF_K] = &&ALU_MOV_K,
-		[BPF_ALU | BPF_DIV | BPF_X] = &&ALU_DIV_X,
-		[BPF_ALU | BPF_DIV | BPF_K] = &&ALU_DIV_K,
-		[BPF_ALU | BPF_MOD | BPF_X] = &&ALU_MOD_X,
-		[BPF_ALU | BPF_MOD | BPF_K] = &&ALU_MOD_K,
-		[BPF_ALU | BPF_NEG] = &&ALU_NEG,
-		[BPF_ALU | BPF_END | BPF_TO_BE] = &&ALU_END_TO_BE,
-		[BPF_ALU | BPF_END | BPF_TO_LE] = &&ALU_END_TO_LE,
-		/* 64 bit ALU operations */
-		[BPF_ALU64 | BPF_ADD | BPF_X] = &&ALU64_ADD_X,
-		[BPF_ALU64 | BPF_ADD | BPF_K] = &&ALU64_ADD_K,
-		[BPF_ALU64 | BPF_SUB | BPF_X] = &&ALU64_SUB_X,
-		[BPF_ALU64 | BPF_SUB | BPF_K] = &&ALU64_SUB_K,
-		[BPF_ALU64 | BPF_AND | BPF_X] = &&ALU64_AND_X,
-		[BPF_ALU64 | BPF_AND | BPF_K] = &&ALU64_AND_K,
-		[BPF_ALU64 | BPF_OR | BPF_X] = &&ALU64_OR_X,
-		[BPF_ALU64 | BPF_OR | BPF_K] = &&ALU64_OR_K,
-		[BPF_ALU64 | BPF_LSH | BPF_X] = &&ALU64_LSH_X,
-		[BPF_ALU64 | BPF_LSH | BPF_K] = &&ALU64_LSH_K,
-		[BPF_ALU64 | BPF_RSH | BPF_X] = &&ALU64_RSH_X,
-		[BPF_ALU64 | BPF_RSH | BPF_K] = &&ALU64_RSH_K,
-		[BPF_ALU64 | BPF_XOR | BPF_X] = &&ALU64_XOR_X,
-		[BPF_ALU64 | BPF_XOR | BPF_K] = &&ALU64_XOR_K,
-		[BPF_ALU64 | BPF_MUL | BPF_X] = &&ALU64_MUL_X,
-		[BPF_ALU64 | BPF_MUL | BPF_K] = &&ALU64_MUL_K,
-		[BPF_ALU64 | BPF_MOV | BPF_X] = &&ALU64_MOV_X,
-		[BPF_ALU64 | BPF_MOV | BPF_K] = &&ALU64_MOV_K,
-		[BPF_ALU64 | BPF_ARSH | BPF_X] = &&ALU64_ARSH_X,
-		[BPF_ALU64 | BPF_ARSH | BPF_K] = &&ALU64_ARSH_K,
-		[BPF_ALU64 | BPF_DIV | BPF_X] = &&ALU64_DIV_X,
-		[BPF_ALU64 | BPF_DIV | BPF_K] = &&ALU64_DIV_K,
-		[BPF_ALU64 | BPF_MOD | BPF_X] = &&ALU64_MOD_X,
-		[BPF_ALU64 | BPF_MOD | BPF_K] = &&ALU64_MOD_K,
-		[BPF_ALU64 | BPF_NEG] = &&ALU64_NEG,
-		/* Call instruction */
-		[BPF_JMP | BPF_CALL] = &&JMP_CALL,
-		/* Jumps */
-		[BPF_JMP | BPF_JA] = &&JMP_JA,
-		[BPF_JMP | BPF_JEQ | BPF_X] = &&JMP_JEQ_X,
-		[BPF_JMP | BPF_JEQ | BPF_K] = &&JMP_JEQ_K,
-		[BPF_JMP | BPF_JNE | BPF_X] = &&JMP_JNE_X,
-		[BPF_JMP | BPF_JNE | BPF_K] = &&JMP_JNE_K,
-		[BPF_JMP | BPF_JGT | BPF_X] = &&JMP_JGT_X,
-		[BPF_JMP | BPF_JGT | BPF_K] = &&JMP_JGT_K,
-		[BPF_JMP | BPF_JGE | BPF_X] = &&JMP_JGE_X,
-		[BPF_JMP | BPF_JGE | BPF_K] = &&JMP_JGE_K,
-		[BPF_JMP | BPF_JSGT | BPF_X] = &&JMP_JSGT_X,
-		[BPF_JMP | BPF_JSGT | BPF_K] = &&JMP_JSGT_K,
-		[BPF_JMP | BPF_JSGE | BPF_X] = &&JMP_JSGE_X,
-		[BPF_JMP | BPF_JSGE | BPF_K] = &&JMP_JSGE_K,
-		[BPF_JMP | BPF_JSET | BPF_X] = &&JMP_JSET_X,
-		[BPF_JMP | BPF_JSET | BPF_K] = &&JMP_JSET_K,
-		/* Program return */
-		[BPF_JMP | BPF_EXIT] = &&JMP_EXIT,
-		/* Store instructions */
-		[BPF_STX | BPF_MEM | BPF_B] = &&STX_MEM_B,
-		[BPF_STX | BPF_MEM | BPF_H] = &&STX_MEM_H,
-		[BPF_STX | BPF_MEM | BPF_W] = &&STX_MEM_W,
-		[BPF_STX | BPF_MEM | BPF_DW] = &&STX_MEM_DW,
-		[BPF_STX | BPF_XADD | BPF_W] = &&STX_XADD_W,
-		[BPF_STX | BPF_XADD | BPF_DW] = &&STX_XADD_DW,
-		[BPF_ST | BPF_MEM | BPF_B] = &&ST_MEM_B,
-		[BPF_ST | BPF_MEM | BPF_H] = &&ST_MEM_H,
-		[BPF_ST | BPF_MEM | BPF_W] = &&ST_MEM_W,
-		[BPF_ST | BPF_MEM | BPF_DW] = &&ST_MEM_DW,
-		/* Load instructions */
-		[BPF_LDX | BPF_MEM | BPF_B] = &&LDX_MEM_B,
-		[BPF_LDX | BPF_MEM | BPF_H] = &&LDX_MEM_H,
-		[BPF_LDX | BPF_MEM | BPF_W] = &&LDX_MEM_W,
-		[BPF_LDX | BPF_MEM | BPF_DW] = &&LDX_MEM_DW,
-		[BPF_LD | BPF_ABS | BPF_W] = &&LD_ABS_W,
-		[BPF_LD | BPF_ABS | BPF_H] = &&LD_ABS_H,
-		[BPF_LD | BPF_ABS | BPF_B] = &&LD_ABS_B,
-		[BPF_LD | BPF_IND | BPF_W] = &&LD_IND_W,
-		[BPF_LD | BPF_IND | BPF_H] = &&LD_IND_H,
-		[BPF_LD | BPF_IND | BPF_B] = &&LD_IND_B,
-	};
-	void *ptr;
-	int off;
-
-#define CONT	 ({ insn++; goto select_insn; })
-#define CONT_JMP ({ insn++; goto select_insn; })
-
-	FP = (u64) (unsigned long) &stack[ARRAY_SIZE(stack)];
-	ARG1 = (u64) (unsigned long) ctx;
-
-	/* Registers used in classic BPF programs need to be reset first. */
-	regs[BPF_REG_A] = 0;
-	regs[BPF_REG_X] = 0;
-
-select_insn:
-	goto *jumptable[insn->code];
-
-	/* ALU */
-#define ALU(OPCODE, OP)			\
-	ALU64_##OPCODE##_X:		\
-		DST = DST OP SRC;	\
-		CONT;			\
-	ALU_##OPCODE##_X:		\
-		DST = (u32) DST OP (u32) SRC;	\
-		CONT;			\
-	ALU64_##OPCODE##_K:		\
-		DST = DST OP IMM;		\
-		CONT;			\
-	ALU_##OPCODE##_K:		\
-		DST = (u32) DST OP (u32) IMM;	\
-		CONT;
-
-	ALU(ADD,  +)
-	ALU(SUB,  -)
-	ALU(AND,  &)
-	ALU(OR,   |)
-	ALU(LSH, <<)
-	ALU(RSH, >>)
-	ALU(XOR,  ^)
-	ALU(MUL,  *)
-#undef ALU
-	ALU_NEG:
-		DST = (u32) -DST;
-		CONT;
-	ALU64_NEG:
-		DST = -DST;
-		CONT;
-	ALU_MOV_X:
-		DST = (u32) SRC;
-		CONT;
-	ALU_MOV_K:
-		DST = (u32) IMM;
-		CONT;
-	ALU64_MOV_X:
-		DST = SRC;
-		CONT;
-	ALU64_MOV_K:
-		DST = IMM;
-		CONT;
-	ALU64_ARSH_X:
-		(*(s64 *) &DST) >>= SRC;
-		CONT;
-	ALU64_ARSH_K:
-		(*(s64 *) &DST) >>= IMM;
-		CONT;
-	ALU64_MOD_X:
-		if (unlikely(SRC == 0))
-			return 0;
-		tmp = DST;
-		DST = do_div(tmp, SRC);
-		CONT;
-	ALU_MOD_X:
-		if (unlikely(SRC == 0))
-			return 0;
-		tmp = (u32) DST;
-		DST = do_div(tmp, (u32) SRC);
-		CONT;
-	ALU64_MOD_K:
-		tmp = DST;
-		DST = do_div(tmp, IMM);
-		CONT;
-	ALU_MOD_K:
-		tmp = (u32) DST;
-		DST = do_div(tmp, (u32) IMM);
-		CONT;
-	ALU64_DIV_X:
-		if (unlikely(SRC == 0))
-			return 0;
-		do_div(DST, SRC);
-		CONT;
-	ALU_DIV_X:
-		if (unlikely(SRC == 0))
-			return 0;
-		tmp = (u32) DST;
-		do_div(tmp, (u32) SRC);
-		DST = (u32) tmp;
-		CONT;
-	ALU64_DIV_K:
-		do_div(DST, IMM);
-		CONT;
-	ALU_DIV_K:
-		tmp = (u32) DST;
-		do_div(tmp, (u32) IMM);
-		DST = (u32) tmp;
-		CONT;
-	ALU_END_TO_BE:
-		switch (IMM) {
-		case 16:
-			DST = (__force u16) cpu_to_be16(DST);
-			break;
-		case 32:
-			DST = (__force u32) cpu_to_be32(DST);
-			break;
-		case 64:
-			DST = (__force u64) cpu_to_be64(DST);
-			break;
-		}
-		CONT;
-	ALU_END_TO_LE:
-		switch (IMM) {
-		case 16:
-			DST = (__force u16) cpu_to_le16(DST);
-			break;
-		case 32:
-			DST = (__force u32) cpu_to_le32(DST);
-			break;
-		case 64:
-			DST = (__force u64) cpu_to_le64(DST);
-			break;
-		}
-		CONT;
-
-	/* CALL */
-	JMP_CALL:
-		/* Function call scratches BPF_R1-BPF_R5 registers,
-		 * preserves BPF_R6-BPF_R9, and stores return value
-		 * into BPF_R0.
-		 */
-		BPF_R0 = (__bpf_call_base + insn->imm)(BPF_R1, BPF_R2, BPF_R3,
-						       BPF_R4, BPF_R5);
-		CONT;
-
-	/* JMP */
-	JMP_JA:
-		insn += insn->off;
-		CONT;
-	JMP_JEQ_X:
-		if (DST == SRC) {
-			insn += insn->off;
-			CONT_JMP;
-		}
-		CONT;
-	JMP_JEQ_K:
-		if (DST == IMM) {
-			insn += insn->off;
-			CONT_JMP;
-		}
-		CONT;
-	JMP_JNE_X:
-		if (DST != SRC) {
-			insn += insn->off;
-			CONT_JMP;
-		}
-		CONT;
-	JMP_JNE_K:
-		if (DST != IMM) {
-			insn += insn->off;
-			CONT_JMP;
-		}
-		CONT;
-	JMP_JGT_X:
-		if (DST > SRC) {
-			insn += insn->off;
-			CONT_JMP;
-		}
-		CONT;
-	JMP_JGT_K:
-		if (DST > IMM) {
-			insn += insn->off;
-			CONT_JMP;
-		}
-		CONT;
-	JMP_JGE_X:
-		if (DST >= SRC) {
-			insn += insn->off;
-			CONT_JMP;
-		}
-		CONT;
-	JMP_JGE_K:
-		if (DST >= IMM) {
-			insn += insn->off;
-			CONT_JMP;
-		}
-		CONT;
-	JMP_JSGT_X:
-		if (((s64) DST) > ((s64) SRC)) {
-			insn += insn->off;
-			CONT_JMP;
-		}
-		CONT;
-	JMP_JSGT_K:
-		if (((s64) DST) > ((s64) IMM)) {
-			insn += insn->off;
-			CONT_JMP;
-		}
-		CONT;
-	JMP_JSGE_X:
-		if (((s64) DST) >= ((s64) SRC)) {
-			insn += insn->off;
-			CONT_JMP;
-		}
-		CONT;
-	JMP_JSGE_K:
-		if (((s64) DST) >= ((s64) IMM)) {
-			insn += insn->off;
-			CONT_JMP;
-		}
-		CONT;
-	JMP_JSET_X:
-		if (DST & SRC) {
-			insn += insn->off;
-			CONT_JMP;
-		}
-		CONT;
-	JMP_JSET_K:
-		if (DST & IMM) {
-			insn += insn->off;
-			CONT_JMP;
-		}
-		CONT;
-	JMP_EXIT:
-		return BPF_R0;
-
-	/* STX and ST and LDX*/
-#define LDST(SIZEOP, SIZE)						\
-	STX_MEM_##SIZEOP:						\
-		*(SIZE *)(unsigned long) (DST + insn->off) = SRC;	\
-		CONT;							\
-	ST_MEM_##SIZEOP:						\
-		*(SIZE *)(unsigned long) (DST + insn->off) = IMM;	\
-		CONT;							\
-	LDX_MEM_##SIZEOP:						\
-		DST = *(SIZE *)(unsigned long) (SRC + insn->off);	\
-		CONT;
-
-	LDST(B,   u8)
-	LDST(H,  u16)
-	LDST(W,  u32)
-	LDST(DW, u64)
-#undef LDST
-	STX_XADD_W: /* lock xadd *(u32 *)(dst_reg + off16) += src_reg */
-		atomic_add((u32) SRC, (atomic_t *)(unsigned long)
-			   (DST + insn->off));
-		CONT;
-	STX_XADD_DW: /* lock xadd *(u64 *)(dst_reg + off16) += src_reg */
-		atomic64_add((u64) SRC, (atomic64_t *)(unsigned long)
-			     (DST + insn->off));
-		CONT;
-	LD_ABS_W: /* BPF_R0 = ntohl(*(u32 *) (skb->data + imm32)) */
-		off = IMM;
-load_word:
-		/* BPF_LD + BPD_ABS and BPF_LD + BPF_IND insns are
-		 * only appearing in the programs where ctx ==
-		 * skb. All programs keep 'ctx' in regs[BPF_REG_CTX]
-		 * == BPF_R6, sk_convert_filter() saves it in BPF_R6,
-		 * internal BPF verifier will check that BPF_R6 ==
-		 * ctx.
-		 *
-		 * BPF_ABS and BPF_IND are wrappers of function calls,
-		 * so they scratch BPF_R1-BPF_R5 registers, preserve
-		 * BPF_R6-BPF_R9, and store return value into BPF_R0.
-		 *
-		 * Implicit input:
-		 *   ctx == skb == BPF_R6 == CTX
-		 *
-		 * Explicit input:
-		 *   SRC == any register
-		 *   IMM == 32-bit immediate
-		 *
-		 * Output:
-		 *   BPF_R0 - 8/16/32-bit skb data converted to cpu endianness
-		 */
-
-		ptr = bpf_load_pointer((struct sk_buff *) (unsigned long) CTX, off, 4, &tmp);
-		if (likely(ptr != NULL)) {
-			BPF_R0 = get_unaligned_be32(ptr);
-			CONT;
-		}
-
-		return 0;
-	LD_ABS_H: /* BPF_R0 = ntohs(*(u16 *) (skb->data + imm32)) */
-		off = IMM;
-load_half:
-		ptr = bpf_load_pointer((struct sk_buff *) (unsigned long) CTX, off, 2, &tmp);
-		if (likely(ptr != NULL)) {
-			BPF_R0 = get_unaligned_be16(ptr);
-			CONT;
-		}
-
-		return 0;
-	LD_ABS_B: /* BPF_R0 = *(u8 *) (skb->data + imm32) */
-		off = IMM;
-load_byte:
-		ptr = bpf_load_pointer((struct sk_buff *) (unsigned long) CTX, off, 1, &tmp);
-		if (likely(ptr != NULL)) {
-			BPF_R0 = *(u8 *)ptr;
-			CONT;
-		}
-
-		return 0;
-	LD_IND_W: /* BPF_R0 = ntohl(*(u32 *) (skb->data + src_reg + imm32)) */
-		off = IMM + SRC;
-		goto load_word;
-	LD_IND_H: /* BPF_R0 = ntohs(*(u16 *) (skb->data + src_reg + imm32)) */
-		off = IMM + SRC;
-		goto load_half;
-	LD_IND_B: /* BPF_R0 = *(u8 *) (skb->data + src_reg + imm32) */
-		off = IMM + SRC;
-		goto load_byte;
-
-	default_label:
-		/* If we ever reach this, we have a bug somewhere. */
-		WARN_RATELIMIT(1, "unknown opcode %02x\n", insn->code);
-		return 0;
-}
-
 /* Helper to find the offset of pkt_type in sk_buff structure. We want
  * to make sure its still a 3bit field starting at a byte boundary;
  * taken from arch/x86/net/bpf_jit_comp.c.
@@ -1455,33 +971,6 @@ out_err:
 	return ERR_PTR(err);
 }
 
-void __weak bpf_int_jit_compile(struct sk_filter *prog)
-{
-}
-
-/**
- *	sk_filter_select_runtime - select execution runtime for BPF program
- *	@fp: sk_filter populated with internal BPF program
- *
- * try to JIT internal BPF program, if JIT is not available select interpreter
- * BPF program will be executed via SK_RUN_FILTER() macro
- */
-void sk_filter_select_runtime(struct sk_filter *fp)
-{
-	fp->bpf_func = (void *) __sk_run_filter;
-
-	/* Probe if internal BPF can be JITed */
-	bpf_int_jit_compile(fp);
-}
-EXPORT_SYMBOL_GPL(sk_filter_select_runtime);
-
-/* free internal BPF program */
-void sk_filter_free(struct sk_filter *fp)
-{
-	bpf_jit_free(fp);
-}
-EXPORT_SYMBOL_GPL(sk_filter_free);
-
 static struct sk_filter *__sk_prepare_filter(struct sk_filter *fp,
 					     struct sock *sk)
 {
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH RFC v2 net-next 02/16] bpf: update MAINTAINERS entry
@ 2014-07-18  4:19   ` Alexei Starovoitov
  0 siblings, 0 replies; 62+ messages in thread
From: Alexei Starovoitov @ 2014-07-18  4:19 UTC (permalink / raw)
  To: David S. Miller
  Cc: Ingo Molnar, Linus Torvalds, Andy Lutomirski, Steven Rostedt,
	Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Jiri Olsa, Thomas Gleixner,
	H. Peter Anvin, Andrew Morton, Kees Cook, linux-api, netdev,
	linux-kernel

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
 MAINTAINERS |    7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index ae8cd00215b2..32e24ff46da3 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1912,6 +1912,13 @@ S:	Supported
 F:	drivers/net/bonding/
 F:	include/uapi/linux/if_bonding.h
 
+BPF (Safe dynamic programs and tools)
+M:	Alexei Starovoitov <ast@kernel.org>
+L:	netdev@vger.kernel.org
+L:	linux-kernel@vger.kernel.org
+S:	Supported
+F:	kernel/bpf/
+
 BROADCOM B44 10/100 ETHERNET DRIVER
 M:	Gary Zambrano <zambrano@broadcom.com>
 L:	netdev@vger.kernel.org
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH RFC v2 net-next 02/16] bpf: update MAINTAINERS entry
@ 2014-07-18  4:19   ` Alexei Starovoitov
  0 siblings, 0 replies; 62+ messages in thread
From: Alexei Starovoitov @ 2014-07-18  4:19 UTC (permalink / raw)
  To: David S. Miller
  Cc: Ingo Molnar, Linus Torvalds, Andy Lutomirski, Steven Rostedt,
	Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Jiri Olsa, Thomas Gleixner,
	H. Peter Anvin, Andrew Morton, Kees Cook,
	linux-api-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Signed-off-by: Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
---
 MAINTAINERS |    7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index ae8cd00215b2..32e24ff46da3 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1912,6 +1912,13 @@ S:	Supported
 F:	drivers/net/bonding/
 F:	include/uapi/linux/if_bonding.h
 
+BPF (Safe dynamic programs and tools)
+M:	Alexei Starovoitov <ast-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
+L:	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
+L:	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
+S:	Supported
+F:	kernel/bpf/
+
 BROADCOM B44 10/100 ETHERNET DRIVER
 M:	Gary Zambrano <zambrano-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
 L:	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH RFC v2 net-next 03/16] net: filter: rename struct sock_filter_int into bpf_insn
  2014-07-18  4:19 ` Alexei Starovoitov
                   ` (2 preceding siblings ...)
  (?)
@ 2014-07-18  4:19 ` Alexei Starovoitov
  -1 siblings, 0 replies; 62+ messages in thread
From: Alexei Starovoitov @ 2014-07-18  4:19 UTC (permalink / raw)
  To: David S. Miller
  Cc: Ingo Molnar, Linus Torvalds, Andy Lutomirski, Steven Rostedt,
	Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Jiri Olsa, Thomas Gleixner,
	H. Peter Anvin, Andrew Morton, Kees Cook, linux-api, netdev,
	linux-kernel

follow on patch exposes eBPF to user space and 'sock_filter_int' name
no longer makes sense, so rename it to 'bpf_insn'

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
 arch/x86/net/bpf_jit_comp.c |    2 +-
 include/linux/filter.h      |   50 +++++++++++++++++++++----------------------
 kernel/bpf/core.c           |    2 +-
 kernel/seccomp.c            |    2 +-
 lib/test_bpf.c              |    4 ++--
 net/core/filter.c           |   18 ++++++++--------
 6 files changed, 39 insertions(+), 39 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 99bef86ed6df..71737a83f022 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -214,7 +214,7 @@ struct jit_context {
 static int do_jit(struct sk_filter *bpf_prog, int *addrs, u8 *image,
 		  int oldproglen, struct jit_context *ctx)
 {
-	struct sock_filter_int *insn = bpf_prog->insnsi;
+	struct bpf_insn *insn = bpf_prog->insnsi;
 	int insn_cnt = bpf_prog->len;
 	u8 temp[64];
 	int i;
diff --git a/include/linux/filter.h b/include/linux/filter.h
index c43c8258e682..a3287d1c9a56 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -82,7 +82,7 @@ enum {
 /* ALU ops on registers, bpf_add|sub|...: dst_reg += src_reg */
 
 #define BPF_ALU64_REG(OP, DST, SRC)				\
-	((struct sock_filter_int) {				\
+	((struct bpf_insn) {				\
 		.code  = BPF_ALU64 | BPF_OP(OP) | BPF_X,	\
 		.dst_reg = DST,					\
 		.src_reg = SRC,					\
@@ -90,7 +90,7 @@ enum {
 		.imm   = 0 })
 
 #define BPF_ALU32_REG(OP, DST, SRC)				\
-	((struct sock_filter_int) {				\
+	((struct bpf_insn) {				\
 		.code  = BPF_ALU | BPF_OP(OP) | BPF_X,		\
 		.dst_reg = DST,					\
 		.src_reg = SRC,					\
@@ -100,7 +100,7 @@ enum {
 /* ALU ops on immediates, bpf_add|sub|...: dst_reg += imm32 */
 
 #define BPF_ALU64_IMM(OP, DST, IMM)				\
-	((struct sock_filter_int) {				\
+	((struct bpf_insn) {				\
 		.code  = BPF_ALU64 | BPF_OP(OP) | BPF_K,	\
 		.dst_reg = DST,					\
 		.src_reg = 0,					\
@@ -108,7 +108,7 @@ enum {
 		.imm   = IMM })
 
 #define BPF_ALU32_IMM(OP, DST, IMM)				\
-	((struct sock_filter_int) {				\
+	((struct bpf_insn) {				\
 		.code  = BPF_ALU | BPF_OP(OP) | BPF_K,		\
 		.dst_reg = DST,					\
 		.src_reg = 0,					\
@@ -118,7 +118,7 @@ enum {
 /* Endianess conversion, cpu_to_{l,b}e(), {l,b}e_to_cpu() */
 
 #define BPF_ENDIAN(TYPE, DST, LEN)				\
-	((struct sock_filter_int) {				\
+	((struct bpf_insn) {				\
 		.code  = BPF_ALU | BPF_END | BPF_SRC(TYPE),	\
 		.dst_reg = DST,					\
 		.src_reg = 0,					\
@@ -128,7 +128,7 @@ enum {
 /* Short form of mov, dst_reg = src_reg */
 
 #define BPF_MOV64_REG(DST, SRC)					\
-	((struct sock_filter_int) {				\
+	((struct bpf_insn) {				\
 		.code  = BPF_ALU64 | BPF_MOV | BPF_X,		\
 		.dst_reg = DST,					\
 		.src_reg = SRC,					\
@@ -136,7 +136,7 @@ enum {
 		.imm   = 0 })
 
 #define BPF_MOV32_REG(DST, SRC)					\
-	((struct sock_filter_int) {				\
+	((struct bpf_insn) {				\
 		.code  = BPF_ALU | BPF_MOV | BPF_X,		\
 		.dst_reg = DST,					\
 		.src_reg = SRC,					\
@@ -146,7 +146,7 @@ enum {
 /* Short form of mov, dst_reg = imm32 */
 
 #define BPF_MOV64_IMM(DST, IMM)					\
-	((struct sock_filter_int) {				\
+	((struct bpf_insn) {				\
 		.code  = BPF_ALU64 | BPF_MOV | BPF_K,		\
 		.dst_reg = DST,					\
 		.src_reg = 0,					\
@@ -154,7 +154,7 @@ enum {
 		.imm   = IMM })
 
 #define BPF_MOV32_IMM(DST, IMM)					\
-	((struct sock_filter_int) {				\
+	((struct bpf_insn) {				\
 		.code  = BPF_ALU | BPF_MOV | BPF_K,		\
 		.dst_reg = DST,					\
 		.src_reg = 0,					\
@@ -164,7 +164,7 @@ enum {
 /* Short form of mov based on type, BPF_X: dst_reg = src_reg, BPF_K: dst_reg = imm32 */
 
 #define BPF_MOV64_RAW(TYPE, DST, SRC, IMM)			\
-	((struct sock_filter_int) {				\
+	((struct bpf_insn) {				\
 		.code  = BPF_ALU64 | BPF_MOV | BPF_SRC(TYPE),	\
 		.dst_reg = DST,					\
 		.src_reg = SRC,					\
@@ -172,7 +172,7 @@ enum {
 		.imm   = IMM })
 
 #define BPF_MOV32_RAW(TYPE, DST, SRC, IMM)			\
-	((struct sock_filter_int) {				\
+	((struct bpf_insn) {				\
 		.code  = BPF_ALU | BPF_MOV | BPF_SRC(TYPE),	\
 		.dst_reg = DST,					\
 		.src_reg = SRC,					\
@@ -182,7 +182,7 @@ enum {
 /* Direct packet access, R0 = *(uint *) (skb->data + imm32) */
 
 #define BPF_LD_ABS(SIZE, IMM)					\
-	((struct sock_filter_int) {				\
+	((struct bpf_insn) {				\
 		.code  = BPF_LD | BPF_SIZE(SIZE) | BPF_ABS,	\
 		.dst_reg = 0,					\
 		.src_reg = 0,					\
@@ -192,7 +192,7 @@ enum {
 /* Indirect packet access, R0 = *(uint *) (skb->data + src_reg + imm32) */
 
 #define BPF_LD_IND(SIZE, SRC, IMM)				\
-	((struct sock_filter_int) {				\
+	((struct bpf_insn) {				\
 		.code  = BPF_LD | BPF_SIZE(SIZE) | BPF_IND,	\
 		.dst_reg = 0,					\
 		.src_reg = SRC,					\
@@ -202,7 +202,7 @@ enum {
 /* Memory load, dst_reg = *(uint *) (src_reg + off16) */
 
 #define BPF_LDX_MEM(SIZE, DST, SRC, OFF)			\
-	((struct sock_filter_int) {				\
+	((struct bpf_insn) {				\
 		.code  = BPF_LDX | BPF_SIZE(SIZE) | BPF_MEM,	\
 		.dst_reg = DST,					\
 		.src_reg = SRC,					\
@@ -212,7 +212,7 @@ enum {
 /* Memory store, *(uint *) (dst_reg + off16) = src_reg */
 
 #define BPF_STX_MEM(SIZE, DST, SRC, OFF)			\
-	((struct sock_filter_int) {				\
+	((struct bpf_insn) {				\
 		.code  = BPF_STX | BPF_SIZE(SIZE) | BPF_MEM,	\
 		.dst_reg = DST,					\
 		.src_reg = SRC,					\
@@ -222,7 +222,7 @@ enum {
 /* Memory store, *(uint *) (dst_reg + off16) = imm32 */
 
 #define BPF_ST_MEM(SIZE, DST, OFF, IMM)				\
-	((struct sock_filter_int) {				\
+	((struct bpf_insn) {				\
 		.code  = BPF_ST | BPF_SIZE(SIZE) | BPF_MEM,	\
 		.dst_reg = DST,					\
 		.src_reg = 0,					\
@@ -232,7 +232,7 @@ enum {
 /* Conditional jumps against registers, if (dst_reg 'op' src_reg) goto pc + off16 */
 
 #define BPF_JMP_REG(OP, DST, SRC, OFF)				\
-	((struct sock_filter_int) {				\
+	((struct bpf_insn) {				\
 		.code  = BPF_JMP | BPF_OP(OP) | BPF_X,		\
 		.dst_reg = DST,					\
 		.src_reg = SRC,					\
@@ -242,7 +242,7 @@ enum {
 /* Conditional jumps against immediates, if (dst_reg 'op' imm32) goto pc + off16 */
 
 #define BPF_JMP_IMM(OP, DST, IMM, OFF)				\
-	((struct sock_filter_int) {				\
+	((struct bpf_insn) {				\
 		.code  = BPF_JMP | BPF_OP(OP) | BPF_K,		\
 		.dst_reg = DST,					\
 		.src_reg = 0,					\
@@ -252,7 +252,7 @@ enum {
 /* Function call */
 
 #define BPF_EMIT_CALL(FUNC)					\
-	((struct sock_filter_int) {				\
+	((struct bpf_insn) {				\
 		.code  = BPF_JMP | BPF_CALL,			\
 		.dst_reg = 0,					\
 		.src_reg = 0,					\
@@ -262,7 +262,7 @@ enum {
 /* Raw code statement block */
 
 #define BPF_RAW_INSN(CODE, DST, SRC, OFF, IMM)			\
-	((struct sock_filter_int) {				\
+	((struct bpf_insn) {				\
 		.code  = CODE,					\
 		.dst_reg = DST,					\
 		.src_reg = SRC,					\
@@ -272,7 +272,7 @@ enum {
 /* Program exit */
 
 #define BPF_EXIT_INSN()						\
-	((struct sock_filter_int) {				\
+	((struct bpf_insn) {				\
 		.code  = BPF_JMP | BPF_EXIT,			\
 		.dst_reg = 0,					\
 		.src_reg = 0,					\
@@ -298,7 +298,7 @@ enum {
 /* Macro to invoke filter function. */
 #define SK_RUN_FILTER(filter, ctx)  (*filter->bpf_func)(ctx, filter->insnsi)
 
-struct sock_filter_int {
+struct bpf_insn {
 	__u8	code;		/* opcode */
 	__u8	dst_reg:4;	/* dest register */
 	__u8	src_reg:4;	/* source register */
@@ -330,10 +330,10 @@ struct sk_filter {
 	struct sock_fprog_kern	*orig_prog;	/* Original BPF program */
 	struct rcu_head		rcu;
 	unsigned int		(*bpf_func)(const struct sk_buff *skb,
-					    const struct sock_filter_int *filter);
+					    const struct bpf_insn *filter);
 	union {
 		struct sock_filter	insns[0];
-		struct sock_filter_int	insnsi[0];
+		struct bpf_insn	insnsi[0];
 		struct work_struct	work;
 	};
 };
@@ -353,7 +353,7 @@ void sk_filter_select_runtime(struct sk_filter *fp);
 void sk_filter_free(struct sk_filter *fp);
 
 int sk_convert_filter(struct sock_filter *prog, int len,
-		      struct sock_filter_int *new_prog, int *new_len);
+		      struct bpf_insn *new_prog, int *new_len);
 
 int sk_unattached_filter_create(struct sk_filter **pfp,
 				struct sock_fprog_kern *fprog);
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 77a240a1ce11..265a02cc822d 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -81,7 +81,7 @@ noinline u64 __bpf_call_base(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
  * keep, 0 for none. @ctx is the data we are operating on, @insn is the
  * array of filter instructions.
  */
-static unsigned int __sk_run_filter(void *ctx, const struct sock_filter_int *insn)
+static unsigned int __sk_run_filter(void *ctx, const struct bpf_insn *insn)
 {
 	u64 stack[MAX_BPF_STACK / sizeof(u64)];
 	u64 regs[MAX_BPF_REG], tmp;
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 301bbc24739c..565743db5384 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -248,7 +248,7 @@ static long seccomp_attach_filter(struct sock_fprog *fprog)
 	if (ret)
 		goto free_prog;
 
-	/* Convert 'sock_filter' insns to 'sock_filter_int' insns */
+	/* Convert 'sock_filter' insns to 'bpf_insn' insns */
 	ret = sk_convert_filter(fp, fprog->len, NULL, &new_len);
 	if (ret)
 		goto free_prog;
diff --git a/lib/test_bpf.c b/lib/test_bpf.c
index c579e0f58818..5f48623ee1a7 100644
--- a/lib/test_bpf.c
+++ b/lib/test_bpf.c
@@ -66,7 +66,7 @@ struct bpf_test {
 	const char *descr;
 	union {
 		struct sock_filter insns[MAX_INSNS];
-		struct sock_filter_int insns_int[MAX_INSNS];
+		struct bpf_insn insns_int[MAX_INSNS];
 	} u;
 	__u8 aux;
 	__u8 data[MAX_DATA];
@@ -1807,7 +1807,7 @@ static struct sk_filter *generate_filter(int which, int *err)
 
 		fp->len = flen;
 		memcpy(fp->insnsi, tests[which].u.insns_int,
-		       fp->len * sizeof(struct sock_filter_int));
+		       fp->len * sizeof(struct bpf_insn));
 
 		sk_filter_select_runtime(fp);
 		break;
diff --git a/net/core/filter.c b/net/core/filter.c
index 1d0e9492e4fa..f3b2d5e9fe5f 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -174,9 +174,9 @@ static u64 __get_random_u32(u64 ctx, u64 a, u64 x, u64 r4, u64 r5)
 }
 
 static bool convert_bpf_extensions(struct sock_filter *fp,
-				   struct sock_filter_int **insnp)
+				   struct bpf_insn **insnp)
 {
-	struct sock_filter_int *insn = *insnp;
+	struct bpf_insn *insn = *insnp;
 
 	switch (fp->k) {
 	case SKF_AD_OFF + SKF_AD_PROTOCOL:
@@ -326,7 +326,7 @@ static bool convert_bpf_extensions(struct sock_filter *fp,
  *
  * 2) 2nd pass to remap in two passes: 1st pass finds new
  *    jump offsets, 2nd pass remapping:
- *   new_prog = kmalloc(sizeof(struct sock_filter_int) * new_len);
+ *   new_prog = kmalloc(sizeof(struct bpf_insn) * new_len);
  *   sk_convert_filter(old_prog, old_len, new_prog, &new_len);
  *
  * User BPF's register A is mapped to our BPF register 6, user BPF
@@ -336,10 +336,10 @@ static bool convert_bpf_extensions(struct sock_filter *fp,
  * ctx == 'struct seccomp_data *'.
  */
 int sk_convert_filter(struct sock_filter *prog, int len,
-		      struct sock_filter_int *new_prog, int *new_len)
+		      struct bpf_insn *new_prog, int *new_len)
 {
 	int new_flen = 0, pass = 0, target, i;
-	struct sock_filter_int *new_insn;
+	struct bpf_insn *new_insn;
 	struct sock_filter *fp;
 	int *addrs = NULL;
 	u8 bpf_src;
@@ -365,8 +365,8 @@ do_pass:
 	new_insn++;
 
 	for (i = 0; i < len; fp++, i++) {
-		struct sock_filter_int tmp_insns[6] = { };
-		struct sock_filter_int *insn = tmp_insns;
+		struct bpf_insn tmp_insns[6] = { };
+		struct bpf_insn *insn = tmp_insns;
 
 		if (addrs)
 			addrs[i] = new_insn - new_prog;
@@ -913,7 +913,7 @@ static struct sk_filter *__sk_migrate_filter(struct sk_filter *fp,
 	 * representation.
 	 */
 	BUILD_BUG_ON(sizeof(struct sock_filter) !=
-		     sizeof(struct sock_filter_int));
+		     sizeof(struct bpf_insn));
 
 	/* Conversion cannot happen on overlapping memory areas,
 	 * so we need to keep the user BPF around until the 2nd
@@ -945,7 +945,7 @@ static struct sk_filter *__sk_migrate_filter(struct sk_filter *fp,
 
 	fp->len = new_len;
 
-	/* 2nd pass: remap sock_filter insns into sock_filter_int insns. */
+	/* 2nd pass: remap sock_filter insns into bpf_insn insns. */
 	err = sk_convert_filter(old_prog, old_len, fp->insnsi, &new_len);
 	if (err)
 		/* 2nd sk_convert_filter() can fail only if it fails
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH RFC v2 net-next 04/16] net: filter: split filter.h and expose eBPF to user space
  2014-07-18  4:19 ` Alexei Starovoitov
                   ` (3 preceding siblings ...)
  (?)
@ 2014-07-18  4:19 ` Alexei Starovoitov
  -1 siblings, 0 replies; 62+ messages in thread
From: Alexei Starovoitov @ 2014-07-18  4:19 UTC (permalink / raw)
  To: David S. Miller
  Cc: Ingo Molnar, Linus Torvalds, Andy Lutomirski, Steven Rostedt,
	Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Jiri Olsa, Thomas Gleixner,
	H. Peter Anvin, Andrew Morton, Kees Cook, linux-api, netdev,
	linux-kernel

eBPF can be used from user space.

uapi/linux/bpf.h: eBPF instruction set definition

linux/filter.h: the rest

This patch only moves macro definitions, but practically it freezes existing
eBPF instruction set, though new instructions can still be added in the future.

These eBPF definitions cannot go into uapi/linux/filter.h, since the names
may conflict with existing applications.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
 include/linux/filter.h    |  294 +------------------------------------------
 include/uapi/linux/Kbuild |    1 +
 include/uapi/linux/bpf.h  |  303 +++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 305 insertions(+), 293 deletions(-)
 create mode 100644 include/uapi/linux/bpf.h

diff --git a/include/linux/filter.h b/include/linux/filter.h
index a3287d1c9a56..b43ad6a2b3cf 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -9,303 +9,11 @@
 #include <linux/skbuff.h>
 #include <linux/workqueue.h>
 #include <uapi/linux/filter.h>
-
-/* Internally used and optimized filter representation with extended
- * instruction set based on top of classic BPF.
- */
-
-/* instruction classes */
-#define BPF_ALU64	0x07	/* alu mode in double word width */
-
-/* ld/ldx fields */
-#define BPF_DW		0x18	/* double word */
-#define BPF_XADD	0xc0	/* exclusive add */
-
-/* alu/jmp fields */
-#define BPF_MOV		0xb0	/* mov reg to reg */
-#define BPF_ARSH	0xc0	/* sign extending arithmetic shift right */
-
-/* change endianness of a register */
-#define BPF_END		0xd0	/* flags for endianness conversion: */
-#define BPF_TO_LE	0x00	/* convert to little-endian */
-#define BPF_TO_BE	0x08	/* convert to big-endian */
-#define BPF_FROM_LE	BPF_TO_LE
-#define BPF_FROM_BE	BPF_TO_BE
-
-#define BPF_JNE		0x50	/* jump != */
-#define BPF_JSGT	0x60	/* SGT is signed '>', GT in x86 */
-#define BPF_JSGE	0x70	/* SGE is signed '>=', GE in x86 */
-#define BPF_CALL	0x80	/* function call */
-#define BPF_EXIT	0x90	/* function return */
-
-/* Register numbers */
-enum {
-	BPF_REG_0 = 0,
-	BPF_REG_1,
-	BPF_REG_2,
-	BPF_REG_3,
-	BPF_REG_4,
-	BPF_REG_5,
-	BPF_REG_6,
-	BPF_REG_7,
-	BPF_REG_8,
-	BPF_REG_9,
-	BPF_REG_10,
-	__MAX_BPF_REG,
-};
-
-/* BPF has 10 general purpose 64-bit registers and stack frame. */
-#define MAX_BPF_REG	__MAX_BPF_REG
-
-/* ArgX, context and stack frame pointer register positions. Note,
- * Arg1, Arg2, Arg3, etc are used as argument mappings of function
- * calls in BPF_CALL instruction.
- */
-#define BPF_REG_ARG1	BPF_REG_1
-#define BPF_REG_ARG2	BPF_REG_2
-#define BPF_REG_ARG3	BPF_REG_3
-#define BPF_REG_ARG4	BPF_REG_4
-#define BPF_REG_ARG5	BPF_REG_5
-#define BPF_REG_CTX	BPF_REG_6
-#define BPF_REG_FP	BPF_REG_10
-
-/* Additional register mappings for converted user programs. */
-#define BPF_REG_A	BPF_REG_0
-#define BPF_REG_X	BPF_REG_7
-#define BPF_REG_TMP	BPF_REG_8
-
-/* BPF program can access up to 512 bytes of stack space. */
-#define MAX_BPF_STACK	512
-
-/* Helper macros for filter block array initializers. */
-
-/* ALU ops on registers, bpf_add|sub|...: dst_reg += src_reg */
-
-#define BPF_ALU64_REG(OP, DST, SRC)				\
-	((struct bpf_insn) {				\
-		.code  = BPF_ALU64 | BPF_OP(OP) | BPF_X,	\
-		.dst_reg = DST,					\
-		.src_reg = SRC,					\
-		.off   = 0,					\
-		.imm   = 0 })
-
-#define BPF_ALU32_REG(OP, DST, SRC)				\
-	((struct bpf_insn) {				\
-		.code  = BPF_ALU | BPF_OP(OP) | BPF_X,		\
-		.dst_reg = DST,					\
-		.src_reg = SRC,					\
-		.off   = 0,					\
-		.imm   = 0 })
-
-/* ALU ops on immediates, bpf_add|sub|...: dst_reg += imm32 */
-
-#define BPF_ALU64_IMM(OP, DST, IMM)				\
-	((struct bpf_insn) {				\
-		.code  = BPF_ALU64 | BPF_OP(OP) | BPF_K,	\
-		.dst_reg = DST,					\
-		.src_reg = 0,					\
-		.off   = 0,					\
-		.imm   = IMM })
-
-#define BPF_ALU32_IMM(OP, DST, IMM)				\
-	((struct bpf_insn) {				\
-		.code  = BPF_ALU | BPF_OP(OP) | BPF_K,		\
-		.dst_reg = DST,					\
-		.src_reg = 0,					\
-		.off   = 0,					\
-		.imm   = IMM })
-
-/* Endianess conversion, cpu_to_{l,b}e(), {l,b}e_to_cpu() */
-
-#define BPF_ENDIAN(TYPE, DST, LEN)				\
-	((struct bpf_insn) {				\
-		.code  = BPF_ALU | BPF_END | BPF_SRC(TYPE),	\
-		.dst_reg = DST,					\
-		.src_reg = 0,					\
-		.off   = 0,					\
-		.imm   = LEN })
-
-/* Short form of mov, dst_reg = src_reg */
-
-#define BPF_MOV64_REG(DST, SRC)					\
-	((struct bpf_insn) {				\
-		.code  = BPF_ALU64 | BPF_MOV | BPF_X,		\
-		.dst_reg = DST,					\
-		.src_reg = SRC,					\
-		.off   = 0,					\
-		.imm   = 0 })
-
-#define BPF_MOV32_REG(DST, SRC)					\
-	((struct bpf_insn) {				\
-		.code  = BPF_ALU | BPF_MOV | BPF_X,		\
-		.dst_reg = DST,					\
-		.src_reg = SRC,					\
-		.off   = 0,					\
-		.imm   = 0 })
-
-/* Short form of mov, dst_reg = imm32 */
-
-#define BPF_MOV64_IMM(DST, IMM)					\
-	((struct bpf_insn) {				\
-		.code  = BPF_ALU64 | BPF_MOV | BPF_K,		\
-		.dst_reg = DST,					\
-		.src_reg = 0,					\
-		.off   = 0,					\
-		.imm   = IMM })
-
-#define BPF_MOV32_IMM(DST, IMM)					\
-	((struct bpf_insn) {				\
-		.code  = BPF_ALU | BPF_MOV | BPF_K,		\
-		.dst_reg = DST,					\
-		.src_reg = 0,					\
-		.off   = 0,					\
-		.imm   = IMM })
-
-/* Short form of mov based on type, BPF_X: dst_reg = src_reg, BPF_K: dst_reg = imm32 */
-
-#define BPF_MOV64_RAW(TYPE, DST, SRC, IMM)			\
-	((struct bpf_insn) {				\
-		.code  = BPF_ALU64 | BPF_MOV | BPF_SRC(TYPE),	\
-		.dst_reg = DST,					\
-		.src_reg = SRC,					\
-		.off   = 0,					\
-		.imm   = IMM })
-
-#define BPF_MOV32_RAW(TYPE, DST, SRC, IMM)			\
-	((struct bpf_insn) {				\
-		.code  = BPF_ALU | BPF_MOV | BPF_SRC(TYPE),	\
-		.dst_reg = DST,					\
-		.src_reg = SRC,					\
-		.off   = 0,					\
-		.imm   = IMM })
-
-/* Direct packet access, R0 = *(uint *) (skb->data + imm32) */
-
-#define BPF_LD_ABS(SIZE, IMM)					\
-	((struct bpf_insn) {				\
-		.code  = BPF_LD | BPF_SIZE(SIZE) | BPF_ABS,	\
-		.dst_reg = 0,					\
-		.src_reg = 0,					\
-		.off   = 0,					\
-		.imm   = IMM })
-
-/* Indirect packet access, R0 = *(uint *) (skb->data + src_reg + imm32) */
-
-#define BPF_LD_IND(SIZE, SRC, IMM)				\
-	((struct bpf_insn) {				\
-		.code  = BPF_LD | BPF_SIZE(SIZE) | BPF_IND,	\
-		.dst_reg = 0,					\
-		.src_reg = SRC,					\
-		.off   = 0,					\
-		.imm   = IMM })
-
-/* Memory load, dst_reg = *(uint *) (src_reg + off16) */
-
-#define BPF_LDX_MEM(SIZE, DST, SRC, OFF)			\
-	((struct bpf_insn) {				\
-		.code  = BPF_LDX | BPF_SIZE(SIZE) | BPF_MEM,	\
-		.dst_reg = DST,					\
-		.src_reg = SRC,					\
-		.off   = OFF,					\
-		.imm   = 0 })
-
-/* Memory store, *(uint *) (dst_reg + off16) = src_reg */
-
-#define BPF_STX_MEM(SIZE, DST, SRC, OFF)			\
-	((struct bpf_insn) {				\
-		.code  = BPF_STX | BPF_SIZE(SIZE) | BPF_MEM,	\
-		.dst_reg = DST,					\
-		.src_reg = SRC,					\
-		.off   = OFF,					\
-		.imm   = 0 })
-
-/* Memory store, *(uint *) (dst_reg + off16) = imm32 */
-
-#define BPF_ST_MEM(SIZE, DST, OFF, IMM)				\
-	((struct bpf_insn) {				\
-		.code  = BPF_ST | BPF_SIZE(SIZE) | BPF_MEM,	\
-		.dst_reg = DST,					\
-		.src_reg = 0,					\
-		.off   = OFF,					\
-		.imm   = IMM })
-
-/* Conditional jumps against registers, if (dst_reg 'op' src_reg) goto pc + off16 */
-
-#define BPF_JMP_REG(OP, DST, SRC, OFF)				\
-	((struct bpf_insn) {				\
-		.code  = BPF_JMP | BPF_OP(OP) | BPF_X,		\
-		.dst_reg = DST,					\
-		.src_reg = SRC,					\
-		.off   = OFF,					\
-		.imm   = 0 })
-
-/* Conditional jumps against immediates, if (dst_reg 'op' imm32) goto pc + off16 */
-
-#define BPF_JMP_IMM(OP, DST, IMM, OFF)				\
-	((struct bpf_insn) {				\
-		.code  = BPF_JMP | BPF_OP(OP) | BPF_K,		\
-		.dst_reg = DST,					\
-		.src_reg = 0,					\
-		.off   = OFF,					\
-		.imm   = IMM })
-
-/* Function call */
-
-#define BPF_EMIT_CALL(FUNC)					\
-	((struct bpf_insn) {				\
-		.code  = BPF_JMP | BPF_CALL,			\
-		.dst_reg = 0,					\
-		.src_reg = 0,					\
-		.off   = 0,					\
-		.imm   = ((FUNC) - __bpf_call_base) })
-
-/* Raw code statement block */
-
-#define BPF_RAW_INSN(CODE, DST, SRC, OFF, IMM)			\
-	((struct bpf_insn) {				\
-		.code  = CODE,					\
-		.dst_reg = DST,					\
-		.src_reg = SRC,					\
-		.off   = OFF,					\
-		.imm   = IMM })
-
-/* Program exit */
-
-#define BPF_EXIT_INSN()						\
-	((struct bpf_insn) {				\
-		.code  = BPF_JMP | BPF_EXIT,			\
-		.dst_reg = 0,					\
-		.src_reg = 0,					\
-		.off   = 0,					\
-		.imm   = 0 })
-
-#define bytes_to_bpf_size(bytes)				\
-({								\
-	int bpf_size = -EINVAL;					\
-								\
-	if (bytes == sizeof(u8))				\
-		bpf_size = BPF_B;				\
-	else if (bytes == sizeof(u16))				\
-		bpf_size = BPF_H;				\
-	else if (bytes == sizeof(u32))				\
-		bpf_size = BPF_W;				\
-	else if (bytes == sizeof(u64))				\
-		bpf_size = BPF_DW;				\
-								\
-	bpf_size;						\
-})
+#include <uapi/linux/bpf.h>
 
 /* Macro to invoke filter function. */
 #define SK_RUN_FILTER(filter, ctx)  (*filter->bpf_func)(ctx, filter->insnsi)
 
-struct bpf_insn {
-	__u8	code;		/* opcode */
-	__u8	dst_reg:4;	/* dest register */
-	__u8	src_reg:4;	/* source register */
-	__s16	off;		/* signed offset */
-	__s32	imm;		/* signed immediate constant */
-};
-
 #ifdef CONFIG_COMPAT
 /* A struct sock_filter is architecture independent. */
 struct compat_sock_fprog {
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index 24e9033f8b3f..fb3f7b675229 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -67,6 +67,7 @@ header-y += bfs_fs.h
 header-y += binfmts.h
 header-y += blkpg.h
 header-y += blktrace_api.h
+header-y += bpf.h
 header-y += bpqether.h
 header-y += bsg.h
 header-y += btrfs.h
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
new file mode 100644
index 000000000000..3ff5bf5045a7
--- /dev/null
+++ b/include/uapi/linux/bpf.h
@@ -0,0 +1,303 @@
+/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#ifndef _UAPI__LINUX_BPF_H__
+#define _UAPI__LINUX_BPF_H__
+
+#include <linux/types.h>
+
+/* Extended instruction set based on top of classic BPF */
+
+/* instruction classes */
+#define BPF_ALU64	0x07	/* alu mode in double word width */
+
+/* ld/ldx fields */
+#define BPF_DW		0x18	/* double word */
+#define BPF_XADD	0xc0	/* exclusive add */
+
+/* alu/jmp fields */
+#define BPF_MOV		0xb0	/* mov reg to reg */
+#define BPF_ARSH	0xc0	/* sign extending arithmetic shift right */
+
+/* change endianness of a register */
+#define BPF_END		0xd0	/* flags for endianness conversion: */
+#define BPF_TO_LE	0x00	/* convert to little-endian */
+#define BPF_TO_BE	0x08	/* convert to big-endian */
+#define BPF_FROM_LE	BPF_TO_LE
+#define BPF_FROM_BE	BPF_TO_BE
+
+#define BPF_JNE		0x50	/* jump != */
+#define BPF_JSGT	0x60	/* SGT is signed '>', GT in x86 */
+#define BPF_JSGE	0x70	/* SGE is signed '>=', GE in x86 */
+#define BPF_CALL	0x80	/* function call */
+#define BPF_EXIT	0x90	/* function return */
+
+/* Register numbers */
+enum {
+	BPF_REG_0 = 0,
+	BPF_REG_1,
+	BPF_REG_2,
+	BPF_REG_3,
+	BPF_REG_4,
+	BPF_REG_5,
+	BPF_REG_6,
+	BPF_REG_7,
+	BPF_REG_8,
+	BPF_REG_9,
+	BPF_REG_10,
+	__MAX_BPF_REG,
+};
+
+/* BPF has 10 general purpose 64-bit registers and stack frame. */
+#define MAX_BPF_REG	__MAX_BPF_REG
+
+/* ArgX, context and stack frame pointer register positions. Note,
+ * Arg1, Arg2, Arg3, etc are used as argument mappings of function
+ * calls in BPF_CALL instruction.
+ */
+#define BPF_REG_ARG1	BPF_REG_1
+#define BPF_REG_ARG2	BPF_REG_2
+#define BPF_REG_ARG3	BPF_REG_3
+#define BPF_REG_ARG4	BPF_REG_4
+#define BPF_REG_ARG5	BPF_REG_5
+#define BPF_REG_CTX	BPF_REG_6
+#define BPF_REG_FP	BPF_REG_10
+
+/* Additional register mappings for converted user programs. */
+#define BPF_REG_A	BPF_REG_0
+#define BPF_REG_X	BPF_REG_7
+#define BPF_REG_TMP	BPF_REG_8
+
+/* BPF program can access up to 512 bytes of stack space. */
+#define MAX_BPF_STACK	512
+
+/* Helper macros for filter block array initializers. */
+
+/* ALU ops on registers, bpf_add|sub|...: dst_reg += src_reg */
+
+#define BPF_ALU64_REG(OP, DST, SRC)				\
+	((struct bpf_insn) {				\
+		.code  = BPF_ALU64 | BPF_OP(OP) | BPF_X,	\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = 0,					\
+		.imm   = 0 })
+
+#define BPF_ALU32_REG(OP, DST, SRC)				\
+	((struct bpf_insn) {				\
+		.code  = BPF_ALU | BPF_OP(OP) | BPF_X,		\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = 0,					\
+		.imm   = 0 })
+
+/* ALU ops on immediates, bpf_add|sub|...: dst_reg += imm32 */
+
+#define BPF_ALU64_IMM(OP, DST, IMM)				\
+	((struct bpf_insn) {				\
+		.code  = BPF_ALU64 | BPF_OP(OP) | BPF_K,	\
+		.dst_reg = DST,					\
+		.src_reg = 0,					\
+		.off   = 0,					\
+		.imm   = IMM })
+
+#define BPF_ALU32_IMM(OP, DST, IMM)				\
+	((struct bpf_insn) {				\
+		.code  = BPF_ALU | BPF_OP(OP) | BPF_K,		\
+		.dst_reg = DST,					\
+		.src_reg = 0,					\
+		.off   = 0,					\
+		.imm   = IMM })
+
+/* Endianess conversion, cpu_to_{l,b}e(), {l,b}e_to_cpu() */
+
+#define BPF_ENDIAN(TYPE, DST, LEN)				\
+	((struct bpf_insn) {				\
+		.code  = BPF_ALU | BPF_END | BPF_SRC(TYPE),	\
+		.dst_reg = DST,					\
+		.src_reg = 0,					\
+		.off   = 0,					\
+		.imm   = LEN })
+
+/* Short form of mov, dst_reg = src_reg */
+
+#define BPF_MOV64_REG(DST, SRC)					\
+	((struct bpf_insn) {				\
+		.code  = BPF_ALU64 | BPF_MOV | BPF_X,		\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = 0,					\
+		.imm   = 0 })
+
+#define BPF_MOV32_REG(DST, SRC)					\
+	((struct bpf_insn) {				\
+		.code  = BPF_ALU | BPF_MOV | BPF_X,		\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = 0,					\
+		.imm   = 0 })
+
+/* Short form of mov, dst_reg = imm32 */
+
+#define BPF_MOV64_IMM(DST, IMM)					\
+	((struct bpf_insn) {				\
+		.code  = BPF_ALU64 | BPF_MOV | BPF_K,		\
+		.dst_reg = DST,					\
+		.src_reg = 0,					\
+		.off   = 0,					\
+		.imm   = IMM })
+
+#define BPF_MOV32_IMM(DST, IMM)					\
+	((struct bpf_insn) {				\
+		.code  = BPF_ALU | BPF_MOV | BPF_K,		\
+		.dst_reg = DST,					\
+		.src_reg = 0,					\
+		.off   = 0,					\
+		.imm   = IMM })
+
+/* Short form of mov based on type, BPF_X: dst_reg = src_reg, BPF_K: dst_reg = imm32 */
+
+#define BPF_MOV64_RAW(TYPE, DST, SRC, IMM)			\
+	((struct bpf_insn) {				\
+		.code  = BPF_ALU64 | BPF_MOV | BPF_SRC(TYPE),	\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = 0,					\
+		.imm   = IMM })
+
+#define BPF_MOV32_RAW(TYPE, DST, SRC, IMM)			\
+	((struct bpf_insn) {				\
+		.code  = BPF_ALU | BPF_MOV | BPF_SRC(TYPE),	\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = 0,					\
+		.imm   = IMM })
+
+/* Direct packet access, R0 = *(uint *) (skb->data + imm32) */
+
+#define BPF_LD_ABS(SIZE, IMM)					\
+	((struct bpf_insn) {				\
+		.code  = BPF_LD | BPF_SIZE(SIZE) | BPF_ABS,	\
+		.dst_reg = 0,					\
+		.src_reg = 0,					\
+		.off   = 0,					\
+		.imm   = IMM })
+
+/* Indirect packet access, R0 = *(uint *) (skb->data + src_reg + imm32) */
+
+#define BPF_LD_IND(SIZE, SRC, IMM)				\
+	((struct bpf_insn) {				\
+		.code  = BPF_LD | BPF_SIZE(SIZE) | BPF_IND,	\
+		.dst_reg = 0,					\
+		.src_reg = SRC,					\
+		.off   = 0,					\
+		.imm   = IMM })
+
+/* Memory load, dst_reg = *(uint *) (src_reg + off16) */
+
+#define BPF_LDX_MEM(SIZE, DST, SRC, OFF)			\
+	((struct bpf_insn) {				\
+		.code  = BPF_LDX | BPF_SIZE(SIZE) | BPF_MEM,	\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = OFF,					\
+		.imm   = 0 })
+
+/* Memory store, *(uint *) (dst_reg + off16) = src_reg */
+
+#define BPF_STX_MEM(SIZE, DST, SRC, OFF)			\
+	((struct bpf_insn) {				\
+		.code  = BPF_STX | BPF_SIZE(SIZE) | BPF_MEM,	\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = OFF,					\
+		.imm   = 0 })
+
+/* Memory store, *(uint *) (dst_reg + off16) = imm32 */
+
+#define BPF_ST_MEM(SIZE, DST, OFF, IMM)				\
+	((struct bpf_insn) {				\
+		.code  = BPF_ST | BPF_SIZE(SIZE) | BPF_MEM,	\
+		.dst_reg = DST,					\
+		.src_reg = 0,					\
+		.off   = OFF,					\
+		.imm   = IMM })
+
+/* Conditional jumps against registers, if (dst_reg 'op' src_reg) goto pc + off16 */
+
+#define BPF_JMP_REG(OP, DST, SRC, OFF)				\
+	((struct bpf_insn) {				\
+		.code  = BPF_JMP | BPF_OP(OP) | BPF_X,		\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = OFF,					\
+		.imm   = 0 })
+
+/* Conditional jumps against immediates, if (dst_reg 'op' imm32) goto pc + off16 */
+
+#define BPF_JMP_IMM(OP, DST, IMM, OFF)				\
+	((struct bpf_insn) {				\
+		.code  = BPF_JMP | BPF_OP(OP) | BPF_K,		\
+		.dst_reg = DST,					\
+		.src_reg = 0,					\
+		.off   = OFF,					\
+		.imm   = IMM })
+
+/* Function call */
+
+#define BPF_EMIT_CALL(FUNC)					\
+	((struct bpf_insn) {				\
+		.code  = BPF_JMP | BPF_CALL,			\
+		.dst_reg = 0,					\
+		.src_reg = 0,					\
+		.off   = 0,					\
+		.imm   = ((FUNC) - __bpf_call_base) })
+
+/* Raw code statement block */
+
+#define BPF_RAW_INSN(CODE, DST, SRC, OFF, IMM)			\
+	((struct bpf_insn) {				\
+		.code  = CODE,					\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = OFF,					\
+		.imm   = IMM })
+
+/* Program exit */
+
+#define BPF_EXIT_INSN()						\
+	((struct bpf_insn) {				\
+		.code  = BPF_JMP | BPF_EXIT,			\
+		.dst_reg = 0,					\
+		.src_reg = 0,					\
+		.off   = 0,					\
+		.imm   = 0 })
+
+#define bytes_to_bpf_size(bytes)				\
+({								\
+	int bpf_size = -EINVAL;					\
+								\
+	if (bytes == sizeof(u8))				\
+		bpf_size = BPF_B;				\
+	else if (bytes == sizeof(u16))				\
+		bpf_size = BPF_H;				\
+	else if (bytes == sizeof(u32))				\
+		bpf_size = BPF_W;				\
+	else if (bytes == sizeof(u64))				\
+		bpf_size = BPF_DW;				\
+								\
+	bpf_size;						\
+})
+
+struct bpf_insn {
+	__u8	code;		/* opcode */
+	__u8	dst_reg:4;	/* dest register */
+	__u8	src_reg:4;	/* source register */
+	__s16	off;		/* signed offset */
+	__s32	imm;		/* signed immediate constant */
+};
+
+#endif /* _UAPI__LINUX_BPF_H__ */
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH RFC v2 net-next 05/16] bpf: introduce syscall(BPF, ...) and BPF maps
  2014-07-18  4:19 ` Alexei Starovoitov
                   ` (4 preceding siblings ...)
  (?)
@ 2014-07-18  4:19 ` Alexei Starovoitov
  2014-07-23 18:02     ` Kees Cook
  -1 siblings, 1 reply; 62+ messages in thread
From: Alexei Starovoitov @ 2014-07-18  4:19 UTC (permalink / raw)
  To: David S. Miller
  Cc: Ingo Molnar, Linus Torvalds, Andy Lutomirski, Steven Rostedt,
	Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Jiri Olsa, Thomas Gleixner,
	H. Peter Anvin, Andrew Morton, Kees Cook, linux-api, netdev,
	linux-kernel

BPF syscall is a demux for different BPF releated commands.

'maps' is a generic storage of different types for sharing data between kernel
and userspace.

The maps can be created from user space via BPF syscall:
- create a map with given type and attributes
  fd = bpf_map_create(map_type, struct nlattr *attr, int len)
  returns fd or negative error

- close(fd) deletes the map

Next patch allows userspace programs to populate/read maps that eBPF programs
are concurrently updating.

maps can have different types: hash, bloom filter, radix-tree, etc.

The map is defined by:
  . type
  . max number of elements
  . key size in bytes
  . value size in bytes

Next patches allow eBPF programs to access maps via API:
  void * bpf_map_lookup_elem(u32 fd, void *key);
  int bpf_map_update_elem(u32 fd, void *key, void *value);
  int bpf_map_delete_elem(u32 fd, void *key);

This patch establishes core infrastructure for BPF maps.
Next patches implement lookup/update and hashtable type.
More map types can be added in the future.

syscall is using type-length-value style of passing arguments to be backwards
compatible with future extensions to map attributes. Different map types may
use different attributes as well.
The concept of type-lenght-value is borrowed from netlink, but netlink itself
is not applicable here, since BPF programs and maps can be used in NET-less
configurations.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
 Documentation/networking/filter.txt |   69 +++++++++++
 include/linux/bpf.h                 |   43 +++++++
 include/uapi/linux/bpf.h            |   24 ++++
 kernel/bpf/Makefile                 |    2 +-
 kernel/bpf/syscall.c                |  225 +++++++++++++++++++++++++++++++++++
 5 files changed, 362 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/bpf.h
 create mode 100644 kernel/bpf/syscall.c

diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt
index ee78eba78a9d..e14e486f69cd 100644
--- a/Documentation/networking/filter.txt
+++ b/Documentation/networking/filter.txt
@@ -995,6 +995,75 @@ BPF_XADD | BPF_DW | BPF_STX: lock xadd *(u64 *)(dst_reg + off16) += src_reg
 Where size is one of: BPF_B or BPF_H or BPF_W or BPF_DW. Note that 1 and
 2 byte atomic increments are not supported.
 
+eBPF maps
+---------
+'maps' is a generic storage of different types for sharing data between kernel
+and userspace.
+
+The maps are accessed from user space via BPF syscall, which has commands:
+- create a map with given id, type and attributes
+  map_id = bpf_map_create(int map_id, map_type, struct nlattr *attr, int len)
+  returns positive map id or negative error
+
+- delete map with given map id
+  err = bpf_map_delete(int map_id)
+  returns zero or negative error
+
+- lookup key in a given map referenced by map_id
+  err = bpf_map_lookup_elem(int map_id, void *key, void *value)
+  returns zero and stores found elem into value or negative error
+
+- create or update key/value pair in a given map
+  err = bpf_map_update_elem(int map_id, void *key, void *value)
+  returns zero or negative error
+
+- find and delete element by key in a given map
+  err = bpf_map_delete_elem(int map_id, void *key)
+
+userspace programs uses this API to create/populate/read maps that eBPF programs
+are concurrently updating.
+
+maps can have different types: hash, bloom filter, radix-tree, etc.
+
+The map is defined by:
+  . id
+  . type
+  . max number of elements
+  . key size in bytes
+  . value size in bytes
+
+The maps are accesible from eBPF program with API:
+  void * bpf_map_lookup_elem(u32 map_id, void *key);
+  int bpf_map_update_elem(u32 map_id, void *key, void *value);
+  int bpf_map_delete_elem(u32 map_id, void *key);
+
+If eBPF verifier is configured to recognize extra calls in the program
+bpf_map_lookup_elem() and bpf_map_update_elem() then access to maps looks like:
+  ...
+  ptr_to_value = map_lookup_elem(const_int_map_id, key)
+  access memory [ptr_to_value, ptr_to_value + value_size_in_bytes]
+  ...
+  prepare key2 and value2 on stack of key_size and value_size
+  err = map_update_elem(const_int_map_id2, key2, value2)
+  ...
+
+eBPF program cannot create or delete maps
+(such calls will be unknown to verifier)
+
+During program loading the refcnt of used maps is incremented, so they don't get
+deleted while program is running
+
+bpf_map_update_elem() can fail if maximum number of elements reached.
+if key2 already exists, bpf_map_update_elem() replaces it with value2 atomically
+
+bpf_map_lookup_elem() can return null or ptr_to_value
+ptr_to_value is read/write from the program point of view.
+
+The verifier will check that the program accesses map elements within specified
+size. It will not let programs pass junk values as 'key' and 'value' to
+bpf_map_*_elem() functions, so these functions (implemented in C inside kernel)
+can safely access the pointers in all cases.
+
 Testing
 -------
 
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
new file mode 100644
index 000000000000..57af236a0eb4
--- /dev/null
+++ b/include/linux/bpf.h
@@ -0,0 +1,43 @@
+/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#ifndef _LINUX_BPF_H
+#define _LINUX_BPF_H 1
+
+#include <uapi/linux/bpf.h>
+#include <linux/workqueue.h>
+
+struct bpf_map;
+struct nlattr;
+
+/* map is generic key/value storage optionally accesible by eBPF programs */
+struct bpf_map_ops {
+	/* funcs callable from userspace (via syscall) */
+	struct bpf_map *(*map_alloc)(struct nlattr *attrs[BPF_MAP_ATTR_MAX + 1]);
+	void (*map_free)(struct bpf_map *);
+};
+
+struct bpf_map {
+	atomic_t refcnt;
+	int map_id;
+	enum bpf_map_type map_type;
+	u32 key_size;
+	u32 value_size;
+	u32 max_entries;
+	struct bpf_map_ops *ops;
+	struct work_struct work;
+};
+
+struct bpf_map_type_list {
+	struct list_head list_node;
+	struct bpf_map_ops *ops;
+	enum bpf_map_type type;
+};
+
+void bpf_register_map_type(struct bpf_map_type_list *tl);
+struct bpf_map *bpf_map_get(u32 map_id);
+
+#endif /* _LINUX_BPF_H */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 3ff5bf5045a7..dcc7eb97a64a 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -300,4 +300,28 @@ struct bpf_insn {
 	__s32	imm;		/* signed immediate constant */
 };
 
+/* BPF syscall commands */
+enum bpf_cmd {
+	/* create a map with given type and attributes
+	 * fd = bpf_map_create(bpf_map_type, struct nlattr *attr, int len)
+	 * returns fd or negative error
+	 * map is deleted when fd is closed
+	 */
+	BPF_MAP_CREATE,
+};
+
+enum bpf_map_attributes {
+	BPF_MAP_UNSPEC,
+	BPF_MAP_KEY_SIZE,	/* size of key in bytes */
+	BPF_MAP_VALUE_SIZE,	/* size of value in bytes */
+	BPF_MAP_MAX_ENTRIES,	/* maximum number of entries in a map */
+	__BPF_MAP_ATTR_MAX,
+};
+#define BPF_MAP_ATTR_MAX (__BPF_MAP_ATTR_MAX - 1)
+#define BPF_MAP_MAX_ATTR_SIZE 65535
+
+enum bpf_map_type {
+	BPF_MAP_TYPE_UNSPEC,
+};
+
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 6a71145e2769..e9f7334ed07a 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -1 +1 @@
-obj-y := core.o
+obj-y := core.o syscall.o
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
new file mode 100644
index 000000000000..c4a330642653
--- /dev/null
+++ b/kernel/bpf/syscall.c
@@ -0,0 +1,225 @@
+/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#include <linux/bpf.h>
+#include <linux/syscalls.h>
+#include <net/netlink.h>
+#include <linux/anon_inodes.h>
+
+/* mutex to protect insertion/deletion of map_id in IDR */
+static DEFINE_MUTEX(bpf_map_lock);
+static DEFINE_IDR(bpf_map_id_idr);
+
+/* maximum number of outstanding maps */
+#define MAX_BPF_MAP_CNT 1024
+static u32 bpf_map_cnt;
+
+static LIST_HEAD(bpf_map_types);
+
+static struct bpf_map *find_and_alloc_map(enum bpf_map_type type,
+					  struct nlattr *tb[BPF_MAP_ATTR_MAX + 1])
+{
+	struct bpf_map_type_list *tl;
+	struct bpf_map *map;
+
+	list_for_each_entry(tl, &bpf_map_types, list_node) {
+		if (tl->type == type) {
+			map = tl->ops->map_alloc(tb);
+			if (IS_ERR(map))
+				return map;
+			map->ops = tl->ops;
+			map->map_type = type;
+			return map;
+		}
+	}
+	return ERR_PTR(-EINVAL);
+}
+
+/* boot time registration of different map implementations */
+void bpf_register_map_type(struct bpf_map_type_list *tl)
+{
+	list_add(&tl->list_node, &bpf_map_types);
+}
+
+/* called from workqueue */
+static void bpf_map_free_deferred(struct work_struct *work)
+{
+	struct bpf_map *map = container_of(work, struct bpf_map, work);
+
+	/* grab the mutex and free the map */
+	mutex_lock(&bpf_map_lock);
+
+	bpf_map_cnt--;
+	idr_remove(&bpf_map_id_idr, map->map_id);
+
+	mutex_unlock(&bpf_map_lock);
+
+	/* implementation dependent freeing */
+	map->ops->map_free(map);
+}
+
+/* decrement map refcnt and schedule it for freeing via workqueue
+ * (unrelying map implementation ops->map_free() might sleep)
+ */
+static void __bpf_map_put(struct bpf_map *map)
+{
+	if (atomic_dec_and_test(&map->refcnt)) {
+		INIT_WORK(&map->work, bpf_map_free_deferred);
+		schedule_work(&map->work);
+	}
+}
+
+/* find map by id and decrement its refcnt
+ *
+ * can be called without any locks held
+ *
+ * returns true if map was found
+ */
+static bool bpf_map_put(u32 map_id)
+{
+	struct bpf_map *map;
+
+	rcu_read_lock();
+	map = idr_find(&bpf_map_id_idr, map_id);
+
+	if (!map) {
+		rcu_read_unlock();
+		return false;
+	}
+
+	__bpf_map_put(map);
+	rcu_read_unlock();
+
+	return true;
+}
+
+/* called with bpf_map_lock held */
+struct bpf_map *bpf_map_get(u32 map_id)
+{
+	BUG_ON(!mutex_is_locked(&bpf_map_lock));
+
+	return idr_find(&bpf_map_id_idr, map_id);
+}
+
+static int bpf_map_release(struct inode *inode, struct file *filp)
+{
+	struct bpf_map *map = filp->private_data;
+
+	__bpf_map_put(map);
+	return 0;
+}
+
+static const struct file_operations bpf_map_fops = {
+        .release = bpf_map_release,
+};
+
+static const struct nla_policy map_policy[BPF_MAP_ATTR_MAX + 1] = {
+	[BPF_MAP_KEY_SIZE]    = { .type = NLA_U32 },
+	[BPF_MAP_VALUE_SIZE]  = { .type = NLA_U32 },
+	[BPF_MAP_MAX_ENTRIES] = { .type = NLA_U32 },
+};
+
+/* called via syscall */
+static int map_create(enum bpf_map_type type, struct nlattr __user *uattr, int len)
+{
+	struct nlattr *tb[BPF_MAP_ATTR_MAX + 1];
+	struct bpf_map *map;
+	struct nlattr *attr;
+	int err;
+
+	if (len <= 0 || len > BPF_MAP_MAX_ATTR_SIZE)
+		return -EINVAL;
+
+	attr = kmalloc(len, GFP_USER);
+	if (!attr)
+		return -ENOMEM;
+
+	/* copy map attributes from user space */
+	err = -EFAULT;
+	if (copy_from_user(attr, uattr, len) != 0)
+		goto free_attr;
+
+	/* perform basic validation */
+	err = nla_parse(tb, BPF_MAP_ATTR_MAX, attr, len, map_policy);
+	if (err < 0)
+		goto free_attr;
+
+	/* find map type and init map: hashtable vs rbtree vs bloom vs ... */
+	map = find_and_alloc_map(type, tb);
+	if (IS_ERR(map)) {
+		err = PTR_ERR(map);
+		goto free_attr;
+	}
+
+	atomic_set(&map->refcnt, 1);
+
+	mutex_lock(&bpf_map_lock);
+
+	if (bpf_map_cnt >= MAX_BPF_MAP_CNT) {
+		mutex_unlock(&bpf_map_lock);
+		err = -ENOSPC;
+		goto free_map;
+	}
+
+	/* allocate map id */
+	err = idr_alloc(&bpf_map_id_idr, map, 1 /* min map_id */, 0, GFP_USER);
+
+	if (err > 0)
+		bpf_map_cnt++;
+
+	map->map_id = err;
+
+	mutex_unlock(&bpf_map_lock);
+
+	if (err < 0)
+		/* failed to allocate map id */
+		goto free_map;
+
+	err = anon_inode_getfd("bpf-map", &bpf_map_fops, map, O_RDWR | O_CLOEXEC);
+
+	if (err < 0)
+		/* failed to allocate fd */
+		goto free_map_id;
+
+	/* user supplied array of map attributes is no longer needed */
+	kfree(attr);
+
+	return err;
+
+free_map_id:
+	/* grab the mutex and free the map */
+	mutex_lock(&bpf_map_lock);
+
+	bpf_map_cnt--;
+	idr_remove(&bpf_map_id_idr, map->map_id);
+
+	mutex_unlock(&bpf_map_lock);
+free_map:
+	map->ops->map_free(map);
+free_attr:
+	kfree(attr);
+	return err;
+}
+
+SYSCALL_DEFINE5(bpf, int, cmd, unsigned long, arg2, unsigned long, arg3,
+		unsigned long, arg4, unsigned long, arg5)
+{
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	switch (cmd) {
+	case BPF_MAP_CREATE:
+		return map_create((enum bpf_map_type) arg2,
+				  (struct nlattr __user *) arg3, (int) arg4);
+	default:
+		return -EINVAL;
+	}
+}
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH RFC v2 net-next 06/16] bpf: enable bpf syscall on x64
@ 2014-07-18  4:19   ` Alexei Starovoitov
  0 siblings, 0 replies; 62+ messages in thread
From: Alexei Starovoitov @ 2014-07-18  4:19 UTC (permalink / raw)
  To: David S. Miller
  Cc: Ingo Molnar, Linus Torvalds, Andy Lutomirski, Steven Rostedt,
	Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Jiri Olsa, Thomas Gleixner,
	H. Peter Anvin, Andrew Morton, Kees Cook, linux-api, netdev,
	linux-kernel

done as separate commit to ease conflict resolution

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
 arch/x86/syscalls/syscall_64.tbl  |    1 +
 include/linux/syscalls.h          |    2 ++
 include/uapi/asm-generic/unistd.h |    4 +++-
 kernel/sys_ni.c                   |    3 +++
 4 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index ec255a1646d2..edbb8460e1b5 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -323,6 +323,7 @@
 314	common	sched_setattr		sys_sched_setattr
 315	common	sched_getattr		sys_sched_getattr
 316	common	renameat2		sys_renameat2
+317	common	bpf			sys_bpf
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index b0881a0ed322..2b524aeba262 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -866,4 +866,6 @@ asmlinkage long sys_process_vm_writev(pid_t pid,
 asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
 			 unsigned long idx1, unsigned long idx2);
 asmlinkage long sys_finit_module(int fd, const char __user *uargs, int flags);
+asmlinkage long sys_bpf(int cmd, unsigned long arg2, unsigned long arg3,
+			unsigned long arg4, unsigned long arg5);
 #endif
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 333640608087..41e20f8fb87e 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -699,9 +699,11 @@ __SYSCALL(__NR_sched_setattr, sys_sched_setattr)
 __SYSCALL(__NR_sched_getattr, sys_sched_getattr)
 #define __NR_renameat2 276
 __SYSCALL(__NR_renameat2, sys_renameat2)
+#define __NR_bpf 277
+__SYSCALL(__NR_bpf, sys_bpf)
 
 #undef __NR_syscalls
-#define __NR_syscalls 277
+#define __NR_syscalls 278
 
 /*
  * All syscalls below here should go away really,
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 36441b51b5df..877c9aafbfb4 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -213,3 +213,6 @@ cond_syscall(compat_sys_open_by_handle_at);
 
 /* compare kernel pointers */
 cond_syscall(sys_kcmp);
+
+/* access BPF programs and maps */
+cond_syscall(sys_bpf);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH RFC v2 net-next 06/16] bpf: enable bpf syscall on x64
@ 2014-07-18  4:19   ` Alexei Starovoitov
  0 siblings, 0 replies; 62+ messages in thread
From: Alexei Starovoitov @ 2014-07-18  4:19 UTC (permalink / raw)
  To: David S. Miller
  Cc: Ingo Molnar, Linus Torvalds, Andy Lutomirski, Steven Rostedt,
	Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Jiri Olsa, Thomas Gleixner,
	H. Peter Anvin, Andrew Morton, Kees Cook,
	linux-api-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

done as separate commit to ease conflict resolution

Signed-off-by: Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
---
 arch/x86/syscalls/syscall_64.tbl  |    1 +
 include/linux/syscalls.h          |    2 ++
 include/uapi/asm-generic/unistd.h |    4 +++-
 kernel/sys_ni.c                   |    3 +++
 4 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index ec255a1646d2..edbb8460e1b5 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -323,6 +323,7 @@
 314	common	sched_setattr		sys_sched_setattr
 315	common	sched_getattr		sys_sched_getattr
 316	common	renameat2		sys_renameat2
+317	common	bpf			sys_bpf
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index b0881a0ed322..2b524aeba262 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -866,4 +866,6 @@ asmlinkage long sys_process_vm_writev(pid_t pid,
 asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
 			 unsigned long idx1, unsigned long idx2);
 asmlinkage long sys_finit_module(int fd, const char __user *uargs, int flags);
+asmlinkage long sys_bpf(int cmd, unsigned long arg2, unsigned long arg3,
+			unsigned long arg4, unsigned long arg5);
 #endif
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 333640608087..41e20f8fb87e 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -699,9 +699,11 @@ __SYSCALL(__NR_sched_setattr, sys_sched_setattr)
 __SYSCALL(__NR_sched_getattr, sys_sched_getattr)
 #define __NR_renameat2 276
 __SYSCALL(__NR_renameat2, sys_renameat2)
+#define __NR_bpf 277
+__SYSCALL(__NR_bpf, sys_bpf)
 
 #undef __NR_syscalls
-#define __NR_syscalls 277
+#define __NR_syscalls 278
 
 /*
  * All syscalls below here should go away really,
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 36441b51b5df..877c9aafbfb4 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -213,3 +213,6 @@ cond_syscall(compat_sys_open_by_handle_at);
 
 /* compare kernel pointers */
 cond_syscall(sys_kcmp);
+
+/* access BPF programs and maps */
+cond_syscall(sys_bpf);
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH RFC v2 net-next 07/16] bpf: add lookup/update/delete/iterate methods to BPF maps
  2014-07-18  4:19 ` Alexei Starovoitov
                   ` (6 preceding siblings ...)
  (?)
@ 2014-07-18  4:19 ` Alexei Starovoitov
  2014-07-23 18:25   ` Kees Cook
  -1 siblings, 1 reply; 62+ messages in thread
From: Alexei Starovoitov @ 2014-07-18  4:19 UTC (permalink / raw)
  To: David S. Miller
  Cc: Ingo Molnar, Linus Torvalds, Andy Lutomirski, Steven Rostedt,
	Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Jiri Olsa, Thomas Gleixner,
	H. Peter Anvin, Andrew Morton, Kees Cook, linux-api, netdev,
	linux-kernel

'maps' is a generic storage of different types for sharing data between kernel
and userspace.

The maps are accessed from user space via BPF syscall, which has commands:

- create a map with given type and attributes
  fd = bpf_map_create(map_type, struct nlattr *attr, int len)
  returns fd or negative error

- lookup key in a given map referenced by fd
  err = bpf_map_lookup_elem(int fd, void *key, void *value)
  returns zero and stores found elem into value or negative error

- create or update key/value pair in a given map
  err = bpf_map_update_elem(int fd, void *key, void *value)
  returns zero or negative error

- find and delete element by key in a given map
  err = bpf_map_delete_elem(int fd, void *key)

- iterate map elements (based on input key return next_key)
  err = bpf_map_get_next_key(int fd, void *key, void *next_key)

- close(fd) deletes the map

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
 include/linux/bpf.h      |    6 ++
 include/uapi/linux/bpf.h |   25 ++++++
 kernel/bpf/syscall.c     |  209 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 240 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 57af236a0eb4..91e2caf8edf9 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -18,6 +18,12 @@ struct bpf_map_ops {
 	/* funcs callable from userspace (via syscall) */
 	struct bpf_map *(*map_alloc)(struct nlattr *attrs[BPF_MAP_ATTR_MAX + 1]);
 	void (*map_free)(struct bpf_map *);
+	int (*map_get_next_key)(struct bpf_map *map, void *key, void *next_key);
+
+	/* funcs callable from userspace and from eBPF programs */
+	void *(*map_lookup_elem)(struct bpf_map *map, void *key);
+	int (*map_update_elem)(struct bpf_map *map, void *key, void *value);
+	int (*map_delete_elem)(struct bpf_map *map, void *key);
 };
 
 struct bpf_map {
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index dcc7eb97a64a..5e1bfbc9cdc7 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -308,6 +308,31 @@ enum bpf_cmd {
 	 * map is deleted when fd is closed
 	 */
 	BPF_MAP_CREATE,
+
+	/* lookup key in a given map referenced by map_id
+	 * err = bpf_map_lookup_elem(int map_id, void *key, void *value)
+	 * returns zero and stores found elem into value
+	 * or negative error
+	 */
+	BPF_MAP_LOOKUP_ELEM,
+
+	/* create or update key/value pair in a given map
+	 * err = bpf_map_update_elem(int map_id, void *key, void *value)
+	 * returns zero or negative error
+	 */
+	BPF_MAP_UPDATE_ELEM,
+
+	/* find and delete elem by key in a given map
+	 * err = bpf_map_delete_elem(int map_id, void *key)
+	 * returns zero or negative error
+	 */
+	BPF_MAP_DELETE_ELEM,
+
+	/* lookup key in a given map and return next key
+	 * err = bpf_map_get_elem(int map_id, void *key, void *next_key)
+	 * returns zero and stores next key or negative error
+	 */
+	BPF_MAP_GET_NEXT_KEY,
 };
 
 enum bpf_map_attributes {
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index c4a330642653..ca2be66845b3 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -13,6 +13,7 @@
 #include <linux/syscalls.h>
 #include <net/netlink.h>
 #include <linux/anon_inodes.h>
+#include <linux/file.h>
 
 /* mutex to protect insertion/deletion of map_id in IDR */
 static DEFINE_MUTEX(bpf_map_lock);
@@ -209,6 +210,202 @@ free_attr:
 	return err;
 }
 
+static int get_map_id(struct fd f)
+{
+	struct bpf_map *map;
+
+	if (!f.file)
+		return -EBADF;
+
+	if (f.file->f_op != &bpf_map_fops) {
+		fdput(f);
+		return -EINVAL;
+	}
+
+	map = f.file->private_data;
+
+	return map->map_id;
+}
+
+static int map_lookup_elem(int ufd, void __user *ukey, void __user *uvalue)
+{
+	struct fd f = fdget(ufd);
+	struct bpf_map *map;
+	void *key, *value;
+	int err;
+
+	err = get_map_id(f);
+	if (err < 0)
+		return err;
+
+	rcu_read_lock();
+	map = idr_find(&bpf_map_id_idr, err);
+	err = -EINVAL;
+	if (!map)
+		goto err_unlock;
+
+	err = -ENOMEM;
+	key = kmalloc(map->key_size, GFP_ATOMIC);
+	if (!key)
+		goto err_unlock;
+
+	err = -EFAULT;
+	if (copy_from_user(key, ukey, map->key_size) != 0)
+		goto free_key;
+
+	err = -ESRCH;
+	value = map->ops->map_lookup_elem(map, key);
+	if (!value)
+		goto free_key;
+
+	err = -EFAULT;
+	if (copy_to_user(uvalue, value, map->value_size) != 0)
+		goto free_key;
+
+	err = 0;
+
+free_key:
+	kfree(key);
+err_unlock:
+	rcu_read_unlock();
+	fdput(f);
+	return err;
+}
+
+static int map_update_elem(int ufd, void __user *ukey, void __user *uvalue)
+{
+	struct fd f = fdget(ufd);
+	struct bpf_map *map;
+	void *key, *value;
+	int err;
+
+	err = get_map_id(f);
+	if (err < 0)
+		return err;
+
+	rcu_read_lock();
+	map = idr_find(&bpf_map_id_idr, err);
+	err = -EINVAL;
+	if (!map)
+		goto err_unlock;
+
+	err = -ENOMEM;
+	key = kmalloc(map->key_size, GFP_ATOMIC);
+	if (!key)
+		goto err_unlock;
+
+	err = -EFAULT;
+	if (copy_from_user(key, ukey, map->key_size) != 0)
+		goto free_key;
+
+	err = -ENOMEM;
+	value = kmalloc(map->value_size, GFP_ATOMIC);
+	if (!value)
+		goto free_key;
+
+	err = -EFAULT;
+	if (copy_from_user(value, uvalue, map->value_size) != 0)
+		goto free_value;
+
+	err = map->ops->map_update_elem(map, key, value);
+
+free_value:
+	kfree(value);
+free_key:
+	kfree(key);
+err_unlock:
+	rcu_read_unlock();
+	fdput(f);
+	return err;
+}
+
+static int map_delete_elem(int ufd, void __user *ukey)
+{
+	struct fd f = fdget(ufd);
+	struct bpf_map *map;
+	void *key;
+	int err;
+
+	err = get_map_id(f);
+	if (err < 0)
+		return err;
+
+	rcu_read_lock();
+	map = idr_find(&bpf_map_id_idr, err);
+	err = -EINVAL;
+	if (!map)
+		goto err_unlock;
+
+	err = -ENOMEM;
+	key = kmalloc(map->key_size, GFP_ATOMIC);
+	if (!key)
+		goto err_unlock;
+
+	err = -EFAULT;
+	if (copy_from_user(key, ukey, map->key_size) != 0)
+		goto free_key;
+
+	err = map->ops->map_delete_elem(map, key);
+
+free_key:
+	kfree(key);
+err_unlock:
+	rcu_read_unlock();
+	fdput(f);
+	return err;
+}
+
+static int map_get_next_key(int ufd, void __user *ukey, void __user *unext_key)
+{
+	struct fd f = fdget(ufd);
+	struct bpf_map *map;
+	void *key, *next_key;
+	int err;
+
+	err = get_map_id(f);
+	if (err < 0)
+		return err;
+
+	rcu_read_lock();
+	map = idr_find(&bpf_map_id_idr, err);
+	err = -EINVAL;
+	if (!map)
+		goto err_unlock;
+
+	err = -ENOMEM;
+	key = kmalloc(map->key_size, GFP_ATOMIC);
+	if (!key)
+		goto err_unlock;
+
+	err = -EFAULT;
+	if (copy_from_user(key, ukey, map->key_size) != 0)
+		goto free_key;
+
+	err = -ENOMEM;
+	next_key = kmalloc(map->key_size, GFP_ATOMIC);
+	if (!next_key)
+		goto free_key;
+
+	err = map->ops->map_get_next_key(map, key, next_key);
+	if (err)
+		goto free_next_key;
+
+	err = -EFAULT;
+	if (copy_to_user(unext_key, next_key, map->key_size) != 0)
+		goto free_next_key;
+
+	err = 0;
+
+free_next_key:
+	kfree(next_key);
+free_key:
+	kfree(key);
+err_unlock:
+	rcu_read_unlock();
+	fdput(f);
+	return err;
+}
+
 SYSCALL_DEFINE5(bpf, int, cmd, unsigned long, arg2, unsigned long, arg3,
 		unsigned long, arg4, unsigned long, arg5)
 {
@@ -219,6 +416,18 @@ SYSCALL_DEFINE5(bpf, int, cmd, unsigned long, arg2, unsigned long, arg3,
 	case BPF_MAP_CREATE:
 		return map_create((enum bpf_map_type) arg2,
 				  (struct nlattr __user *) arg3, (int) arg4);
+	case BPF_MAP_LOOKUP_ELEM:
+		return map_lookup_elem((int) arg2, (void __user *) arg3,
+				       (void __user *) arg4);
+	case BPF_MAP_UPDATE_ELEM:
+		return map_update_elem((int) arg2, (void __user *) arg3,
+				       (void __user *) arg4);
+	case BPF_MAP_DELETE_ELEM:
+		return map_delete_elem((int) arg2, (void __user *) arg3);
+
+	case BPF_MAP_GET_NEXT_KEY:
+		return map_get_next_key((int) arg2, (void __user *) arg3,
+					(void __user *) arg4);
 	default:
 		return -EINVAL;
 	}
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH RFC v2 net-next 08/16] bpf: add hashtable type of BPF maps
  2014-07-18  4:19 ` Alexei Starovoitov
                   ` (7 preceding siblings ...)
  (?)
@ 2014-07-18  4:19 ` Alexei Starovoitov
  2014-07-23 18:36     ` Kees Cook
  -1 siblings, 1 reply; 62+ messages in thread
From: Alexei Starovoitov @ 2014-07-18  4:19 UTC (permalink / raw)
  To: David S. Miller
  Cc: Ingo Molnar, Linus Torvalds, Andy Lutomirski, Steven Rostedt,
	Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Jiri Olsa, Thomas Gleixner,
	H. Peter Anvin, Andrew Morton, Kees Cook, linux-api, netdev,
	linux-kernel

add new map type: BPF_MAP_TYPE_HASH
and its simple (not auto resizeable) hash table implementation

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
 include/uapi/linux/bpf.h |    1 +
 kernel/bpf/Makefile      |    2 +-
 kernel/bpf/hashtab.c     |  371 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 373 insertions(+), 1 deletion(-)
 create mode 100644 kernel/bpf/hashtab.c

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 5e1bfbc9cdc7..3ea11ba053a8 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -347,6 +347,7 @@ enum bpf_map_attributes {
 
 enum bpf_map_type {
 	BPF_MAP_TYPE_UNSPEC,
+	BPF_MAP_TYPE_HASH,
 };
 
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index e9f7334ed07a..558e12712ebc 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -1 +1 @@
-obj-y := core.o syscall.o
+obj-y := core.o syscall.o hashtab.o
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
new file mode 100644
index 000000000000..6e481cacbba3
--- /dev/null
+++ b/kernel/bpf/hashtab.c
@@ -0,0 +1,371 @@
+/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#include <linux/bpf.h>
+#include <net/netlink.h>
+#include <linux/jhash.h>
+
+struct bpf_htab {
+	struct bpf_map map;
+	struct hlist_head *buckets;
+	struct kmem_cache *elem_cache;
+	char *slab_name;
+	spinlock_t lock;
+	u32 count; /* number of elements in this hashtable */
+	u32 n_buckets; /* number of hash buckets */
+	u32 elem_size; /* size of each element in bytes */
+};
+
+/* each htab element is struct htab_elem + key + value */
+struct htab_elem {
+	struct hlist_node hash_node;
+	struct rcu_head rcu;
+	struct bpf_htab *htab;
+	u32 hash;
+	u32 pad;
+	char key[0];
+};
+
+#define HASH_MAX_BUCKETS 1024
+#define BPF_MAP_MAX_KEY_SIZE 256
+static struct bpf_map *htab_map_alloc(struct nlattr *attr[BPF_MAP_ATTR_MAX + 1])
+{
+	struct bpf_htab *htab;
+	int err, i;
+
+	htab = kmalloc(sizeof(*htab), GFP_USER);
+	if (!htab)
+		return ERR_PTR(-ENOMEM);
+
+	/* look for mandatory map attributes */
+	err = -EINVAL;
+	if (!attr[BPF_MAP_KEY_SIZE])
+		goto free_htab;
+	htab->map.key_size = nla_get_u32(attr[BPF_MAP_KEY_SIZE]);
+
+	if (!attr[BPF_MAP_VALUE_SIZE])
+		goto free_htab;
+	htab->map.value_size = nla_get_u32(attr[BPF_MAP_VALUE_SIZE]);
+
+	if (!attr[BPF_MAP_MAX_ENTRIES])
+		goto free_htab;
+	htab->map.max_entries = nla_get_u32(attr[BPF_MAP_MAX_ENTRIES]);
+
+	htab->n_buckets = (htab->map.max_entries <= HASH_MAX_BUCKETS) ?
+			  htab->map.max_entries : HASH_MAX_BUCKETS;
+
+	/* hash table size must be power of 2 */
+	if ((htab->n_buckets & (htab->n_buckets - 1)) != 0)
+		goto free_htab;
+
+	err = -E2BIG;
+	if (htab->map.key_size > BPF_MAP_MAX_KEY_SIZE)
+		goto free_htab;
+
+	err = -ENOMEM;
+	htab->buckets = kmalloc(htab->n_buckets * sizeof(struct hlist_head),
+				GFP_USER);
+
+	if (!htab->buckets)
+		goto free_htab;
+
+	for (i = 0; i < htab->n_buckets; i++)
+		INIT_HLIST_HEAD(&htab->buckets[i]);
+
+	spin_lock_init(&htab->lock);
+	htab->count = 0;
+
+	htab->elem_size = sizeof(struct htab_elem) +
+			  round_up(htab->map.key_size, 8) +
+			  htab->map.value_size;
+
+	htab->slab_name = kasprintf(GFP_USER, "bpf_htab_%p", htab);
+	if (!htab->slab_name)
+		goto free_buckets;
+
+	htab->elem_cache = kmem_cache_create(htab->slab_name,
+					     htab->elem_size, 0, 0, NULL);
+	if (!htab->elem_cache)
+		goto free_slab_name;
+
+	return &htab->map;
+
+free_slab_name:
+	kfree(htab->slab_name);
+free_buckets:
+	kfree(htab->buckets);
+free_htab:
+	kfree(htab);
+	return ERR_PTR(err);
+}
+
+static inline u32 htab_map_hash(const void *key, u32 key_len)
+{
+	return jhash(key, key_len, 0);
+}
+
+static inline struct hlist_head *select_bucket(struct bpf_htab *htab, u32 hash)
+{
+	return &htab->buckets[hash & (htab->n_buckets - 1)];
+}
+
+static struct htab_elem *lookup_elem_raw(struct hlist_head *head, u32 hash,
+					 void *key, u32 key_size)
+{
+	struct htab_elem *l;
+
+	hlist_for_each_entry_rcu(l, head, hash_node) {
+		if (l->hash == hash && !memcmp(&l->key, key, key_size))
+			return l;
+	}
+	return NULL;
+}
+
+/* Must be called with rcu_read_lock. */
+static void *htab_map_lookup_elem(struct bpf_map *map, void *key)
+{
+	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+	struct hlist_head *head;
+	struct htab_elem *l;
+	u32 hash, key_size;
+
+	WARN_ON_ONCE(!rcu_read_lock_held());
+
+	key_size = map->key_size;
+
+	hash = htab_map_hash(key, key_size);
+
+	head = select_bucket(htab, hash);
+
+	l = lookup_elem_raw(head, hash, key, key_size);
+
+	if (l)
+		return l->key + round_up(map->key_size, 8);
+	else
+		return NULL;
+}
+
+/* Must be called with rcu_read_lock. */
+static int htab_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
+{
+	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+	struct hlist_head *head;
+	struct htab_elem *l, *next_l;
+	u32 hash, key_size;
+	int i;
+
+	WARN_ON_ONCE(!rcu_read_lock_held());
+
+	key_size = map->key_size;
+
+	hash = htab_map_hash(key, key_size);
+
+	head = select_bucket(htab, hash);
+
+	/* lookup the key */
+	l = lookup_elem_raw(head, hash, key, key_size);
+
+	if (!l) {
+		i = 0;
+		goto find_first_elem;
+	}
+
+	/* key was found, get next key in the same bucket */
+	next_l = hlist_entry_safe(rcu_dereference_raw(hlist_next_rcu(&l->hash_node)),
+				  struct htab_elem, hash_node);
+
+	if (next_l) {
+		/* if next elem in this hash list is non-zero, just return it */
+		memcpy(next_key, next_l->key, key_size);
+		return 0;
+	} else {
+		/* no more elements in this hash list, go to the next bucket */
+		i = hash & (htab->n_buckets - 1);
+		i++;
+	}
+
+find_first_elem:
+	/* iterate over buckets */
+	for (; i < htab->n_buckets; i++) {
+		head = select_bucket(htab, i);
+
+		/* pick first element in the bucket */
+		next_l = hlist_entry_safe(rcu_dereference_raw(hlist_first_rcu(head)),
+					  struct htab_elem, hash_node);
+		if (next_l) {
+			/* if it's not empty, just return it */
+			memcpy(next_key, next_l->key, key_size);
+			return 0;
+		}
+	}
+
+	/* itereated over all buckets and all elements */
+	return -ENOENT;
+}
+
+static struct htab_elem *htab_alloc_elem(struct bpf_htab *htab)
+{
+	void *l;
+
+	l = kmem_cache_alloc(htab->elem_cache, GFP_ATOMIC);
+	if (!l)
+		return ERR_PTR(-ENOMEM);
+	return l;
+}
+
+static void free_htab_elem_rcu(struct rcu_head *rcu)
+{
+	struct htab_elem *l = container_of(rcu, struct htab_elem, rcu);
+
+	kmem_cache_free(l->htab->elem_cache, l);
+}
+
+static void release_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
+{
+	l->htab = htab;
+	call_rcu(&l->rcu, free_htab_elem_rcu);
+}
+
+/* Must be called with rcu_read_lock. */
+static int htab_map_update_elem(struct bpf_map *map, void *key, void *value)
+{
+	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+	struct htab_elem *l_new, *l_old;
+	struct hlist_head *head;
+	u32 key_size;
+
+	WARN_ON_ONCE(!rcu_read_lock_held());
+
+	l_new = htab_alloc_elem(htab);
+	if (IS_ERR(l_new))
+		return -ENOMEM;
+
+	key_size = map->key_size;
+
+	memcpy(l_new->key, key, key_size);
+	memcpy(l_new->key + round_up(key_size, 8), value, map->value_size);
+
+	l_new->hash = htab_map_hash(l_new->key, key_size);
+
+	head = select_bucket(htab, l_new->hash);
+
+	l_old = lookup_elem_raw(head, l_new->hash, key, key_size);
+
+	spin_lock_bh(&htab->lock);
+	if (!l_old && unlikely(htab->count >= map->max_entries)) {
+		/* if elem with this 'key' doesn't exist and we've reached
+		 * max_entries limit, fail insertion of new elem
+		 */
+		spin_unlock_bh(&htab->lock);
+		kmem_cache_free(htab->elem_cache, l_new);
+		return -EFBIG;
+	}
+
+	/* add new element to the head of the list, so that concurrent
+	 * search will find it before old elem
+	 */
+	hlist_add_head_rcu(&l_new->hash_node, head);
+	if (l_old) {
+		hlist_del_rcu(&l_old->hash_node);
+		release_htab_elem(htab, l_old);
+	} else {
+		htab->count++;
+	}
+	spin_unlock_bh(&htab->lock);
+
+	return 0;
+}
+
+/* Must be called with rcu_read_lock. */
+static int htab_map_delete_elem(struct bpf_map *map, void *key)
+{
+	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+	struct htab_elem *l;
+	struct hlist_head *head;
+	u32 hash, key_size;
+
+	WARN_ON_ONCE(!rcu_read_lock_held());
+
+	key_size = map->key_size;
+
+	hash = htab_map_hash(key, key_size);
+
+	head = select_bucket(htab, hash);
+
+	l = lookup_elem_raw(head, hash, key, key_size);
+
+	if (l) {
+		spin_lock_bh(&htab->lock);
+		hlist_del_rcu(&l->hash_node);
+		htab->count--;
+		release_htab_elem(htab, l);
+		spin_unlock_bh(&htab->lock);
+		return 0;
+	}
+	return -ESRCH;
+}
+
+static void delete_all_elements(struct bpf_htab *htab)
+{
+	int i;
+
+	for (i = 0; i < htab->n_buckets; i++) {
+		struct hlist_head *head = select_bucket(htab, i);
+		struct hlist_node *n;
+		struct htab_elem *l;
+
+		hlist_for_each_entry_safe(l, n, head, hash_node) {
+			hlist_del_rcu(&l->hash_node);
+			htab->count--;
+			kmem_cache_free(htab->elem_cache, l);
+		}
+	}
+}
+
+/* called when map->refcnt goes to zero */
+static void htab_map_free(struct bpf_map *map)
+{
+	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+
+	/* wait for all outstanding updates to complete */
+	synchronize_rcu();
+
+	/* kmem_cache_free all htab elements */
+	delete_all_elements(htab);
+
+	/* and destroy cache, which might sleep */
+	kmem_cache_destroy(htab->elem_cache);
+
+	kfree(htab->buckets);
+	kfree(htab->slab_name);
+	kfree(htab);
+}
+
+static struct bpf_map_ops htab_ops = {
+	.map_alloc = htab_map_alloc,
+	.map_free = htab_map_free,
+	.map_get_next_key = htab_map_get_next_key,
+	.map_lookup_elem = htab_map_lookup_elem,
+	.map_update_elem = htab_map_update_elem,
+	.map_delete_elem = htab_map_delete_elem,
+};
+
+static struct bpf_map_type_list tl = {
+	.ops = &htab_ops,
+	.type = BPF_MAP_TYPE_HASH,
+};
+
+static int __init register_htab_map(void)
+{
+	bpf_register_map_type(&tl);
+	return 0;
+}
+late_initcall(register_htab_map);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH RFC v2 net-next 09/16] bpf: expand BPF syscall with program load/unload
  2014-07-18  4:19 ` Alexei Starovoitov
                   ` (8 preceding siblings ...)
  (?)
@ 2014-07-18  4:19 ` Alexei Starovoitov
  2014-07-23 19:00     ` Kees Cook
  -1 siblings, 1 reply; 62+ messages in thread
From: Alexei Starovoitov @ 2014-07-18  4:19 UTC (permalink / raw)
  To: David S. Miller
  Cc: Ingo Molnar, Linus Torvalds, Andy Lutomirski, Steven Rostedt,
	Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Jiri Olsa, Thomas Gleixner,
	H. Peter Anvin, Andrew Morton, Kees Cook, linux-api, netdev,
	linux-kernel

eBPF programs are safe run-to-completion functions with load/unload
methods from userspace similar to kernel modules.

User space API:

- load eBPF program
  fd = bpf_prog_load(bpf_prog_type, struct nlattr *prog, int len)

  where 'prog' is a sequence of sections (TEXT, LICENSE, MAP_ASSOC)
  TEXT - array of eBPF instructions
  LICENSE - must be GPL compatible to call helper functions marked gpl_only
  MAP_FIXUP - array of {insn idx, map fd} used by kernel to adjust
  imm constants in 'mov' instructions used to access maps

- unload eBPF program
  close(fd)

User space example of syscall(__NR_bpf, BPF_PROG_LOAD, prog_type, ...)
follows in later patches

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
 include/linux/bpf.h      |   33 +++++
 include/linux/filter.h   |    9 +-
 include/uapi/linux/bpf.h |   29 +++++
 kernel/bpf/core.c        |    5 +-
 kernel/bpf/syscall.c     |  309 ++++++++++++++++++++++++++++++++++++++++++++++
 net/core/filter.c        |    9 +-
 6 files changed, 388 insertions(+), 6 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 91e2caf8edf9..4967619595cc 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -46,4 +46,37 @@ struct bpf_map_type_list {
 void bpf_register_map_type(struct bpf_map_type_list *tl);
 struct bpf_map *bpf_map_get(u32 map_id);
 
+/* eBPF function prototype used by verifier to allow BPF_CALLs from eBPF programs
+ * to in-kernel helper functions and for adjusting imm32 field in BPF_CALL
+ * instructions after verifying
+ */
+struct bpf_func_proto {
+	u64 (*func)(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
+	bool gpl_only;
+};
+
+struct bpf_verifier_ops {
+	/* return eBPF function prototype for verification */
+	const struct bpf_func_proto *(*get_func_proto)(enum bpf_func_id func_id);
+};
+
+struct bpf_prog_type_list {
+	struct list_head list_node;
+	struct bpf_verifier_ops *ops;
+	enum bpf_prog_type type;
+};
+
+void bpf_register_prog_type(struct bpf_prog_type_list *tl);
+
+struct bpf_prog_info {
+	bool is_gpl_compatible;
+	enum bpf_prog_type prog_type;
+	struct bpf_verifier_ops *ops;
+	u32 *used_maps;
+	u32 used_map_cnt;
+};
+
+void free_bpf_prog_info(struct bpf_prog_info *info);
+struct sk_filter *bpf_prog_get(u32 ufd);
+
 #endif /* _LINUX_BPF_H */
diff --git a/include/linux/filter.h b/include/linux/filter.h
index b43ad6a2b3cf..822b310e75e1 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -30,12 +30,17 @@ struct sock_fprog_kern {
 struct sk_buff;
 struct sock;
 struct seccomp_data;
+struct bpf_prog_info;
 
 struct sk_filter {
 	atomic_t		refcnt;
 	u32			jited:1,	/* Is our filter JIT'ed? */
-				len:31;		/* Number of filter blocks */
-	struct sock_fprog_kern	*orig_prog;	/* Original BPF program */
+				ebpf:1,		/* Is it eBPF program ? */
+				len:30;		/* Number of filter blocks */
+	union {
+		struct sock_fprog_kern	*orig_prog;	/* Original BPF program */
+		struct bpf_prog_info	*info;
+	};
 	struct rcu_head		rcu;
 	unsigned int		(*bpf_func)(const struct sk_buff *skb,
 					    const struct bpf_insn *filter);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 3ea11ba053a8..06ba71b49f64 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -333,6 +333,13 @@ enum bpf_cmd {
 	 * returns zero and stores next key or negative error
 	 */
 	BPF_MAP_GET_NEXT_KEY,
+
+	/* verify and load eBPF program
+	 * prog_id = bpf_prog_load(bpf_prog_type, struct nlattr *prog, int len)
+	 * prog is a sequence of sections
+	 * returns fd or negative error
+	 */
+	BPF_PROG_LOAD,
 };
 
 enum bpf_map_attributes {
@@ -350,4 +357,26 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_HASH,
 };
 
+enum bpf_prog_attributes {
+	BPF_PROG_UNSPEC,
+	BPF_PROG_TEXT,		/* array of eBPF instructions */
+	BPF_PROG_LICENSE,	/* license string */
+	BPF_PROG_MAP_FIXUP,	/* array of {insn idx, map fd} to fixup insns */
+	__BPF_PROG_ATTR_MAX,
+};
+#define BPF_PROG_ATTR_MAX (__BPF_PROG_ATTR_MAX - 1)
+#define BPF_PROG_MAX_ATTR_SIZE 65535
+
+enum bpf_prog_type {
+	BPF_PROG_TYPE_UNSPEC,
+};
+
+/* integer value in 'imm' field of BPF_CALL instruction selects which helper
+ * function eBPF program intends to call
+ */
+enum bpf_func_id {
+	BPF_FUNC_unspec,
+	__BPF_FUNC_MAX_ID,
+};
+
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 265a02cc822d..e65ecdc36358 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -23,6 +23,7 @@
 #include <linux/filter.h>
 #include <linux/skbuff.h>
 #include <asm/unaligned.h>
+#include <linux/bpf.h>
 
 /* Registers */
 #define BPF_R0	regs[BPF_REG_0]
@@ -528,9 +529,11 @@ void sk_filter_select_runtime(struct sk_filter *fp)
 }
 EXPORT_SYMBOL_GPL(sk_filter_select_runtime);
 
-/* free internal BPF program */
+/* free internal BPF program, called after RCU grace period */
 void sk_filter_free(struct sk_filter *fp)
 {
+	if (fp->ebpf)
+		free_bpf_prog_info(fp->info);
 	bpf_jit_free(fp);
 }
 EXPORT_SYMBOL_GPL(sk_filter_free);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index ca2be66845b3..9e45ca6b6937 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -14,6 +14,8 @@
 #include <net/netlink.h>
 #include <linux/anon_inodes.h>
 #include <linux/file.h>
+#include <linux/license.h>
+#include <linux/filter.h>
 
 /* mutex to protect insertion/deletion of map_id in IDR */
 static DEFINE_MUTEX(bpf_map_lock);
@@ -406,6 +408,310 @@ err_unlock:
 	return err;
 }
 
+static LIST_HEAD(bpf_prog_types);
+
+static int find_prog_type(enum bpf_prog_type type, struct sk_filter *prog)
+{
+	struct bpf_prog_type_list *tl;
+
+	list_for_each_entry(tl, &bpf_prog_types, list_node) {
+		if (tl->type == type) {
+			prog->info->ops = tl->ops;
+			prog->info->prog_type = type;
+			return 0;
+		}
+	}
+	return -EINVAL;
+}
+
+void bpf_register_prog_type(struct bpf_prog_type_list *tl)
+{
+	list_add(&tl->list_node, &bpf_prog_types);
+}
+
+/* fixup insn->imm field of bpf_call instructions:
+ * if (insn->imm == BPF_FUNC_map_lookup_elem)
+ *      insn->imm = bpf_map_lookup_elem - __bpf_call_base;
+ * else if (insn->imm == BPF_FUNC_map_update_elem)
+ *      insn->imm = bpf_map_update_elem - __bpf_call_base;
+ * else ...
+ *
+ * this function is called after eBPF program passed verification
+ */
+static void fixup_bpf_calls(struct sk_filter *prog)
+{
+	const struct bpf_func_proto *fn;
+	int i;
+
+	for (i = 0; i < prog->len; i++) {
+		struct bpf_insn *insn = &prog->insnsi[i];
+
+		if (insn->code == (BPF_JMP | BPF_CALL)) {
+			/* we reach here when program has bpf_call instructions
+			 * and it passed bpf_check(), means that
+			 * ops->get_func_proto must have been supplied, check it
+			 */
+			BUG_ON(!prog->info->ops->get_func_proto);
+
+			fn = prog->info->ops->get_func_proto(insn->imm);
+			/* all functions that have prototype and verifier allowed
+			 * programs to call them, must be real in-kernel functions
+			 */
+			BUG_ON(!fn->func);
+			insn->imm = fn->func - __bpf_call_base;
+		}
+	}
+}
+
+/* fixup instructions that are using map_ids:
+ *
+ * BPF_MOV64_IMM(BPF_REG_1, MAP_ID), // r1 = MAP_ID
+ * BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+ *
+ * in the 1st insn kernel replaces MAP_ID with global map_id,
+ * since programs are executing out of different contexts and must use
+ * globally visible ids to access maps
+ *
+ * map_fixup is an array of pairs {insn idx, map ufd}
+ *
+ * kernel resolves ufd -> global map_id and adjusts eBPF instructions
+ */
+static int fixup_bpf_map_id(struct sk_filter *prog, struct nlattr *map_fixup)
+{
+	struct {
+		u32 insn_idx;
+		u32 ufd;
+	} *fixup = nla_data(map_fixup);
+	int fixup_len = nla_len(map_fixup) / sizeof(*fixup);
+	struct bpf_insn *insn;
+	struct fd f;
+	u32 idx;
+	int i, map_id;
+
+	if (fixup_len <= 0)
+		return -EINVAL;
+
+	for (i = 0; i < fixup_len; i++) {
+		idx = fixup[i].insn_idx;
+		if (idx >= prog->len)
+			return -EINVAL;
+
+		insn = &prog->insnsi[idx];
+		if (insn->code != (BPF_ALU64 | BPF_MOV | BPF_K) &&
+		    insn->code != (BPF_ALU | BPF_MOV | BPF_K))
+			return -EINVAL;
+
+		f = fdget(fixup[i].ufd);
+
+		map_id = get_map_id(f);
+
+		if (map_id < 0)
+			return map_id;
+
+		insn->imm = map_id;
+		fdput(f);
+	}
+	return 0;
+}
+
+/* free eBPF program auxilary data, called after rcu grace period,
+ * so it's safe to drop refcnt on maps used by this program
+ *
+ * called from sk_filter_release()->sk_filter_release_rcu()->sk_filter_free()
+ */
+void free_bpf_prog_info(struct bpf_prog_info *info)
+{
+	bool found;
+	int i;
+
+	for (i = 0; i < info->used_map_cnt; i++) {
+		found = bpf_map_put(info->used_maps[i]);
+		/* all maps that this program was using should obviously still
+		 * be there
+		 */
+		BUG_ON(!found);
+	}
+	kfree(info);
+}
+
+static int bpf_prog_release(struct inode *inode, struct file *filp)
+{
+	struct sk_filter *prog = filp->private_data;
+
+	sk_unattached_filter_destroy(prog);
+	return 0;
+}
+
+static const struct file_operations bpf_prog_fops = {
+        .release = bpf_prog_release,
+};
+
+static const struct nla_policy prog_policy[BPF_PROG_ATTR_MAX + 1] = {
+	[BPF_PROG_TEXT]      = { .type = NLA_BINARY },
+	[BPF_PROG_LICENSE]   = { .type = NLA_NUL_STRING },
+	[BPF_PROG_MAP_FIXUP] = { .type = NLA_BINARY },
+};
+
+static int bpf_prog_load(enum bpf_prog_type type, struct nlattr __user *uattr,
+			 int len)
+{
+	struct nlattr *tb[BPF_PROG_ATTR_MAX + 1];
+	struct sk_filter *prog;
+	struct bpf_map *map;
+	struct nlattr *attr;
+	size_t insn_len;
+	int err, i;
+	bool is_gpl;
+
+	if (len <= 0 || len > BPF_PROG_MAX_ATTR_SIZE)
+		return -EINVAL;
+
+	attr = kmalloc(len, GFP_USER);
+	if (!attr)
+		return -ENOMEM;
+
+	/* copy eBPF program from user space */
+	err = -EFAULT;
+	if (copy_from_user(attr, uattr, len) != 0)
+		goto free_attr;
+
+	/* perform basic validation */
+	err = nla_parse(tb, BPF_PROG_ATTR_MAX, attr, len, prog_policy);
+	if (err < 0)
+		goto free_attr;
+
+	err = -EINVAL;
+	/* look for mandatory license string */
+	if (!tb[BPF_PROG_LICENSE])
+		goto free_attr;
+
+	/* eBPF programs must be GPL compatible to use GPL-ed functions */
+	is_gpl = license_is_gpl_compatible(nla_data(tb[BPF_PROG_LICENSE]));
+
+	/* look for mandatory array of eBPF instructions */
+	if (!tb[BPF_PROG_TEXT])
+		goto free_attr;
+
+	insn_len = nla_len(tb[BPF_PROG_TEXT]);
+	if (insn_len % sizeof(struct bpf_insn) != 0 || insn_len <= 0)
+		goto free_attr;
+
+	/* plain sk_filter allocation */
+	err = -ENOMEM;
+	prog = kmalloc(sk_filter_size(insn_len), GFP_USER);
+	if (!prog)
+		goto free_attr;
+
+	prog->len = insn_len / sizeof(struct bpf_insn);
+	memcpy(prog->insns, nla_data(tb[BPF_PROG_TEXT]), insn_len);
+	prog->orig_prog = NULL;
+	prog->jited = 0;
+	prog->ebpf = 0;
+	atomic_set(&prog->refcnt, 1);
+
+	if (tb[BPF_PROG_MAP_FIXUP]) {
+		/* if program is using maps, fixup map_ids */
+		err = fixup_bpf_map_id(prog, tb[BPF_PROG_MAP_FIXUP]);
+		if (err < 0)
+			goto free_prog;
+	}
+
+	/* allocate eBPF related auxilary data */
+	prog->info = kzalloc(sizeof(struct bpf_prog_info), GFP_USER);
+	if (!prog->info)
+		goto free_prog;
+	prog->ebpf = 1;
+	prog->info->is_gpl_compatible = is_gpl;
+
+	/* find program type: socket_filter vs tracing_filter */
+	err = find_prog_type(type, prog);
+	if (err < 0)
+		goto free_prog;
+
+	/* lock maps to prevent any changes to maps, since eBPF program may
+	 * use them. In such case bpf_check() will populate prog->used_maps
+	 */
+	mutex_lock(&bpf_map_lock);
+
+	/* run eBPF verifier */
+	/* err = bpf_check(prog); */
+
+	if (err == 0 && prog->info->used_maps) {
+		/* program passed verifier and it's using some maps,
+		 * hold them
+		 */
+		for (i = 0; i < prog->info->used_map_cnt; i++) {
+			map = bpf_map_get(prog->info->used_maps[i]);
+			BUG_ON(!map);
+			atomic_inc(&map->refcnt);
+		}
+	}
+	mutex_unlock(&bpf_map_lock);
+
+	if (err < 0)
+		goto free_prog;
+
+	/* fixup BPF_CALL->imm field */
+	fixup_bpf_calls(prog);
+
+	/* eBPF program is ready to be JITed */
+	sk_filter_select_runtime(prog);
+
+	err = anon_inode_getfd("bpf-prog", &bpf_prog_fops, prog, O_RDWR | O_CLOEXEC);
+
+	if (err < 0)
+		/* failed to allocate fd */
+		goto free_prog;
+
+	/* user supplied eBPF prog attributes are no longer needed */
+	kfree(attr);
+
+	return err;
+free_prog:
+	sk_filter_free(prog);
+free_attr:
+	kfree(attr);
+	return err;
+}
+
+static struct sk_filter *get_prog(struct fd f)
+{
+	struct sk_filter *prog;
+
+	if (!f.file)
+		return ERR_PTR(-EBADF);
+
+	if (f.file->f_op != &bpf_prog_fops) {
+		fdput(f);
+		return ERR_PTR(-EINVAL);
+	}
+
+	prog = f.file->private_data;
+
+	return prog;
+}
+
+/* called from sk_attach_filter_ebpf() or from tracing filter attach
+ * pairs with
+ * sk_detach_filter()->sk_filter_uncharge()->sk_filter_release()
+ * or with
+ * sk_unattached_filter_destroy()->sk_filter_release()
+ */
+struct sk_filter *bpf_prog_get(u32 ufd)
+{
+	struct fd f = fdget(ufd);
+	struct sk_filter *prog;
+
+	prog = get_prog(f);
+
+	if (IS_ERR(prog))
+		return prog;
+
+	atomic_inc(&prog->refcnt);
+	fdput(f);
+	return prog;
+}
+
 SYSCALL_DEFINE5(bpf, int, cmd, unsigned long, arg2, unsigned long, arg3,
 		unsigned long, arg4, unsigned long, arg5)
 {
@@ -428,6 +734,9 @@ SYSCALL_DEFINE5(bpf, int, cmd, unsigned long, arg2, unsigned long, arg3,
 	case BPF_MAP_GET_NEXT_KEY:
 		return map_get_next_key((int) arg2, (void __user *) arg3,
 					(void __user *) arg4);
+	case BPF_PROG_LOAD:
+		return bpf_prog_load((enum bpf_prog_type) arg2,
+				     (struct nlattr __user *) arg3, (int) arg4);
 	default:
 		return -EINVAL;
 	}
diff --git a/net/core/filter.c b/net/core/filter.c
index f3b2d5e9fe5f..255dba1bb678 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -835,7 +835,7 @@ static void sk_release_orig_filter(struct sk_filter *fp)
 {
 	struct sock_fprog_kern *fprog = fp->orig_prog;
 
-	if (fprog) {
+	if (!fp->ebpf && fprog) {
 		kfree(fprog->filter);
 		kfree(fprog);
 	}
@@ -867,14 +867,16 @@ static void sk_filter_release(struct sk_filter *fp)
 
 void sk_filter_uncharge(struct sock *sk, struct sk_filter *fp)
 {
-	atomic_sub(sk_filter_size(fp->len), &sk->sk_omem_alloc);
+	if (!fp->ebpf)
+		atomic_sub(sk_filter_size(fp->len), &sk->sk_omem_alloc);
 	sk_filter_release(fp);
 }
 
 void sk_filter_charge(struct sock *sk, struct sk_filter *fp)
 {
 	atomic_inc(&fp->refcnt);
-	atomic_add(sk_filter_size(fp->len), &sk->sk_omem_alloc);
+	if (!fp->ebpf)
+		atomic_add(sk_filter_size(fp->len), &sk->sk_omem_alloc);
 }
 
 static struct sk_filter *__sk_migrate_realloc(struct sk_filter *fp,
@@ -978,6 +980,7 @@ static struct sk_filter *__sk_prepare_filter(struct sk_filter *fp,
 
 	fp->bpf_func = NULL;
 	fp->jited = 0;
+	fp->ebpf = 0;
 
 	err = sk_chk_filter(fp->insns, fp->len);
 	if (err) {
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH RFC v2 net-next 10/16] bpf: add eBPF verifier
  2014-07-18  4:19 ` Alexei Starovoitov
                   ` (9 preceding siblings ...)
  (?)
@ 2014-07-18  4:20 ` Alexei Starovoitov
  2014-07-23 23:38     ` Kees Cook
  2014-07-24 18:25     ` Andy Lutomirski
  -1 siblings, 2 replies; 62+ messages in thread
From: Alexei Starovoitov @ 2014-07-18  4:20 UTC (permalink / raw)
  To: David S. Miller
  Cc: Ingo Molnar, Linus Torvalds, Andy Lutomirski, Steven Rostedt,
	Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Jiri Olsa, Thomas Gleixner,
	H. Peter Anvin, Andrew Morton, Kees Cook, linux-api, netdev,
	linux-kernel

Safety of eBPF programs is statically determined by the verifier, which detects:
- loops
- out of range jumps
- unreachable instructions
- invalid instructions
- uninitialized register access
- uninitialized stack access
- misaligned stack access
- out of range stack access
- invalid calling convention

It checks that
- R1-R5 registers statisfy function prototype
- program terminates
- BPF_LD_ABS|IND instructions are only used in socket filters

It is configured with:

- bool (*is_valid_access)(int off, int size, enum bpf_access_type type);
  that provides information to the verifer which fields of 'ctx'
  are accessible (remember 'ctx' is the first argument to eBPF program)

- const struct bpf_func_proto *(*get_func_proto)(enum bpf_func_id func_id);
  reports argument types of kernel helper functions that eBPF program
  may call, so that verifier can checks that R1-R5 types match prototype

More details in Documentation/networking/filter.txt

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
 Documentation/networking/filter.txt |  233 ++++++
 include/linux/bpf.h                 |   49 ++
 include/uapi/linux/bpf.h            |    1 +
 kernel/bpf/Makefile                 |    2 +-
 kernel/bpf/syscall.c                |    2 +-
 kernel/bpf/verifier.c               | 1520 +++++++++++++++++++++++++++++++++++
 6 files changed, 1805 insertions(+), 2 deletions(-)
 create mode 100644 kernel/bpf/verifier.c

diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt
index e14e486f69cd..778f763fce10 100644
--- a/Documentation/networking/filter.txt
+++ b/Documentation/networking/filter.txt
@@ -995,6 +995,108 @@ BPF_XADD | BPF_DW | BPF_STX: lock xadd *(u64 *)(dst_reg + off16) += src_reg
 Where size is one of: BPF_B or BPF_H or BPF_W or BPF_DW. Note that 1 and
 2 byte atomic increments are not supported.
 
+eBPF verifier
+-------------
+The safety of the eBPF program is determined in two steps.
+
+First step does DAG check to disallow loops and other CFG validation.
+In particular it will detect programs that have unreachable instructions.
+(though classic BPF checker allows them)
+
+Second step starts from the first insn and descends all possible paths.
+It simulates execution of every insn and observes the state change of
+registers and stack.
+
+At the start of the program the register R1 contains a pointer to context
+and has type PTR_TO_CTX.
+If verifier sees an insn that does R2=R1, then R2 has now type
+PTR_TO_CTX as well and can be used on the right hand side of expression.
+If R1=PTR_TO_CTX and insn is R2=R1+R1, then R2=INVALID_PTR,
+since addition of two valid pointers makes invalid pointer.
+
+If register was never written to, it's not readable:
+  bpf_mov R0 = R2
+  bpf_exit
+will be rejected, since R2 is unreadable at the start of the program.
+
+After kernel function call, R1-R5 are reset to unreadable and
+R0 has a return type of the function.
+
+Since R6-R9 are callee saved, their state is preserved across the call.
+  bpf_mov R6 = 1
+  bpf_call foo
+  bpf_mov R0 = R6
+  bpf_exit
+is a correct program. If there was R1 instead of R6, it would have
+been rejected.
+
+Classic BPF register X is mapped to eBPF register R7 inside sk_convert_filter(),
+so that its state is preserved across calls.
+
+load/store instructions are allowed only with registers of valid types, which
+are PTR_TO_CTX, PTR_TO_MAP, PTR_TO_STACK. They are bounds and alignment checked.
+For example:
+ bpf_mov R1 = 1
+ bpf_mov R2 = 2
+ bpf_xadd *(u32 *)(R1 + 3) += R2
+ bpf_exit
+will be rejected, since R1 doesn't have a valid pointer type at the time of
+execution of instruction bpf_xadd.
+
+At the start R1 contains pointer to ctx and R1 type is PTR_TO_CTX.
+ctx is generic. verifier is configured to known what context is for particular
+class of bpf programs. For example, context == skb (for socket filters) and
+ctx == seccomp_data for seccomp filters.
+A callback is used to customize verifier to restrict eBPF program access to only
+certain fields within ctx structure with specified size and alignment.
+
+For example, the following insn:
+  bpf_ld R0 = *(u32 *)(R6 + 8)
+intends to load a word from address R6 + 8 and store it into R0
+If R6=PTR_TO_CTX, via is_valid_access() callback the verifier will know
+that offset 8 of size 4 bytes can be accessed for reading, otherwise
+the verifier will reject the program.
+If R6=PTR_TO_STACK, then access should be aligned and be within
+stack bounds, which are [-MAX_BPF_STACK, 0). In this example offset is 8,
+so it will fail verification, since it's out of bounds.
+
+The verifier will allow eBPF program to read data from stack only after
+it wrote into it.
+Classic BPF verifier does similar check with M[0-15] memory slots.
+For example:
+  bpf_ld R0 = *(u32 *)(R10 - 4)
+  bpf_exit
+is invalid program.
+Though R10 is correct read-only register and has type PTR_TO_STACK
+and R10 - 4 is within stack bounds, there were no stores into that location.
+
+Pointer register spill/fill is tracked as well, since four (R6-R9)
+callee saved registers may not be enough for some programs.
+
+Allowed function calls are customized with bpf_verifier_ops->get_func_proto()
+For example, skb_get_nlattr() function has the following definition:
+  struct bpf_func_proto proto = {RET_INTEGER, PTR_TO_CTX};
+and eBPF verifier will check that this function is always called with first
+argument being 'ctx'. In other words R1 must have type PTR_TO_CTX
+at the time of bpf_call insn.
+After the call register R0 will be set to readable state, so that
+program can access it.
+
+Function calls is a main mechanism to extend functionality of eBPF programs.
+Socket filters may let programs to call one set of functions, whereas tracing
+filters may allow completely different set.
+
+If a function made accessible to eBPF program, it needs to be thought through
+from security point of view. The verifier will guarantee that the function is
+called with valid arguments.
+
+seccomp vs socket filters have different security restrictions for classic BPF.
+Seccomp solves this by two stage verifier: classic BPF verifier is followed
+by seccomp verifier. In case of eBPF one configurable verifier is shared for
+all use cases.
+
+See details of eBPF verifier in kernel/bpf/verifier.c
+
 eBPF maps
 ---------
 'maps' is a generic storage of different types for sharing data between kernel
@@ -1064,6 +1166,137 @@ size. It will not let programs pass junk values as 'key' and 'value' to
 bpf_map_*_elem() functions, so these functions (implemented in C inside kernel)
 can safely access the pointers in all cases.
 
+Understanding eBPF verifier messages
+------------------------------------
+
+The following are few examples of invalid eBPF programs and verifier error
+messages as seen in the log:
+
+Program with unreachable instructions:
+static struct bpf_insn prog[] = {
+  BPF_EXIT_INSN(),
+  BPF_EXIT_INSN(),
+};
+Error:
+  unreachable insn 1
+
+Program that reads uninitialized register:
+  BPF_ALU64_REG(BPF_MOV, BPF_REG_0, BPF_REG_2),
+  BPF_EXIT_INSN(),
+Error:
+  0: (bf) r0 = r2
+  R2 !read_ok
+
+Program that doesn't initialize R0 before exiting:
+  BPF_ALU64_REG(BPF_MOV, BPF_REG_2, BPF_REG_1),
+  BPF_EXIT_INSN(),
+Error:
+  0: (bf) r2 = r1
+  1: (95) exit
+  R0 !read_ok
+
+Program that accesses stack out of bounds:
+  BPF_ST_MEM(BPF_DW, BPF_REG_10, 8, 0),
+  BPF_EXIT_INSN(),
+Error:
+  0: (7a) *(u64 *)(r10 +8) = 0
+  invalid stack off=8 size=8
+
+Program that doesn't initialize stack before passing its address into function:
+  BPF_ALU64_REG(BPF_MOV, BPF_REG_2, BPF_REG_10),
+  BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+  BPF_ALU64_IMM(BPF_MOV, BPF_REG_1, 1),
+  BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+  BPF_EXIT_INSN(),
+Error:
+  0: (bf) r2 = r10
+  1: (07) r2 += -8
+  2: (b7) r1 = 1
+  3: (85) call 1
+  invalid indirect read from stack off -8+0 size 8
+
+Program that uses invalid map_id=2 while calling to map_lookup_elem() function:
+  BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+  BPF_ALU64_REG(BPF_MOV, BPF_REG_2, BPF_REG_10),
+  BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+  BPF_ALU64_IMM(BPF_MOV, BPF_REG_1, 2),
+  BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+  BPF_EXIT_INSN(),
+Error:
+  0: (7a) *(u64 *)(r10 -8) = 0
+  1: (bf) r2 = r10
+  2: (07) r2 += -8
+  3: (b7) r1 = 2
+  4: (85) call 1
+  invalid access to map_id=2
+
+Program that doesn't check return value of map_lookup_elem() before accessing
+map element:
+  BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+  BPF_ALU64_REG(BPF_MOV, BPF_REG_2, BPF_REG_10),
+  BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+  BPF_ALU64_IMM(BPF_MOV, BPF_REG_1, 1),
+  BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+  BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0),
+  BPF_EXIT_INSN(),
+Error:
+  0: (7a) *(u64 *)(r10 -8) = 0
+  1: (bf) r2 = r10
+  2: (07) r2 += -8
+  3: (b7) r1 = 1
+  4: (85) call 1
+  5: (7a) *(u64 *)(r0 +0) = 0
+  R0 invalid mem access 'map_value_or_null'
+
+Program that correctly checks map_lookup_elem() returned value for NULL, but
+accesses the memory with incorrect alignment:
+  BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+  BPF_ALU64_REG(BPF_MOV, BPF_REG_2, BPF_REG_10),
+  BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+  BPF_ALU64_IMM(BPF_MOV, BPF_REG_1, 1),
+  BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+  BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 1),
+  BPF_ST_MEM(BPF_DW, BPF_REG_0, 4, 0),
+  BPF_EXIT_INSN(),
+Error:
+  0: (7a) *(u64 *)(r10 -8) = 0
+  1: (bf) r2 = r10
+  2: (07) r2 += -8
+  3: (b7) r1 = 1
+  4: (85) call 1
+  5: (15) if r0 == 0x0 goto pc+1
+   R0=map_value1 R10=fp
+  6: (7a) *(u64 *)(r0 +4) = 0
+  misaligned access off 4 size 8
+
+Program that correctly checks map_lookup_elem() returned value for NULL and
+accesses memory with correct alignment in one side of 'if' branch, but fails
+to do so in the other side of 'if' branch:
+  BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+  BPF_ALU64_REG(BPF_MOV, BPF_REG_2, BPF_REG_10),
+  BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+  BPF_ALU64_IMM(BPF_MOV, BPF_REG_1, 1),
+  BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+  BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
+  BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0),
+  BPF_EXIT_INSN(),
+  BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 1),
+  BPF_EXIT_INSN(),
+Error:
+  0: (7a) *(u64 *)(r10 -8) = 0
+  1: (bf) r2 = r10
+  2: (07) r2 += -8
+  3: (b7) r1 = 1
+  4: (85) call 1
+  5: (15) if r0 == 0x0 goto pc+2
+   R0=map_value1 R10=fp
+  6: (7a) *(u64 *)(r0 +0) = 0
+  7: (95) exit
+
+  from 5 to 8: R0=imm0 R10=fp
+  8: (7a) *(u64 *)(r0 +0) = 1
+  R0 invalid mem access 'imm'
+
 Testing
 -------
 
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 4967619595cc..b5e90efddfcf 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -46,6 +46,31 @@ struct bpf_map_type_list {
 void bpf_register_map_type(struct bpf_map_type_list *tl);
 struct bpf_map *bpf_map_get(u32 map_id);
 
+/* function argument constraints */
+enum bpf_arg_type {
+	ARG_ANYTHING = 0,	/* any argument is ok */
+
+	/* the following constraints used to prototype
+	 * bpf_map_lookup/update/delete_elem() functions
+	 */
+	ARG_CONST_MAP_ID,	/* int const argument used as map_id */
+	ARG_PTR_TO_MAP_KEY,	/* pointer to stack used as map key */
+	ARG_PTR_TO_MAP_VALUE,	/* pointer to stack used as map value */
+
+	/* the following constraints used to prototype bpf_memcmp() and other
+	 * functions that access data on eBPF program stack
+	 */
+	ARG_PTR_TO_STACK,	/* any pointer to eBPF program stack */
+	ARG_CONST_STACK_SIZE,	/* number of bytes accessed from stack */
+};
+
+/* type of values returned from helper functions */
+enum bpf_return_type {
+	RET_INTEGER,		/* function returns integer */
+	RET_VOID,		/* function doesn't return anything */
+	RET_PTR_TO_MAP_OR_NULL,	/* function returns a pointer to map elem value or NULL */
+};
+
 /* eBPF function prototype used by verifier to allow BPF_CALLs from eBPF programs
  * to in-kernel helper functions and for adjusting imm32 field in BPF_CALL
  * instructions after verifying
@@ -53,11 +78,33 @@ struct bpf_map *bpf_map_get(u32 map_id);
 struct bpf_func_proto {
 	u64 (*func)(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
 	bool gpl_only;
+	enum bpf_return_type ret_type;
+	enum bpf_arg_type arg1_type;
+	enum bpf_arg_type arg2_type;
+	enum bpf_arg_type arg3_type;
+	enum bpf_arg_type arg4_type;
+	enum bpf_arg_type arg5_type;
+};
+
+/* bpf_context is intentionally undefined structure. Pointer to bpf_context is
+ * the first argument to eBPF programs.
+ * For socket filters: 'struct bpf_context *' == 'struct sk_buff *'
+ */
+struct bpf_context;
+
+enum bpf_access_type {
+	BPF_READ = 1,
+	BPF_WRITE = 2
 };
 
 struct bpf_verifier_ops {
 	/* return eBPF function prototype for verification */
 	const struct bpf_func_proto *(*get_func_proto)(enum bpf_func_id func_id);
+
+	/* return true if 'size' wide access at offset 'off' within bpf_context
+	 * with 'type' (read or write) is allowed
+	 */
+	bool (*is_valid_access)(int off, int size, enum bpf_access_type type);
 };
 
 struct bpf_prog_type_list {
@@ -78,5 +125,7 @@ struct bpf_prog_info {
 
 void free_bpf_prog_info(struct bpf_prog_info *info);
 struct sk_filter *bpf_prog_get(u32 ufd);
+/* verify correctness of eBPF program */
+int bpf_check(struct sk_filter *fp);
 
 #endif /* _LINUX_BPF_H */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 06ba71b49f64..3f288e1d08f1 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -369,6 +369,7 @@ enum bpf_prog_attributes {
 
 enum bpf_prog_type {
 	BPF_PROG_TYPE_UNSPEC,
+	BPF_PROG_TYPE_SOCKET_FILTER,
 };
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 558e12712ebc..95a9035e0f29 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -1 +1 @@
-obj-y := core.o syscall.o hashtab.o
+obj-y := core.o syscall.o hashtab.o verifier.o
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 9e45ca6b6937..9d441f17548e 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -634,7 +634,7 @@ static int bpf_prog_load(enum bpf_prog_type type, struct nlattr __user *uattr,
 	mutex_lock(&bpf_map_lock);
 
 	/* run eBPF verifier */
-	/* err = bpf_check(prog); */
+	err = bpf_check(prog);
 
 	if (err == 0 && prog->info->used_maps) {
 		/* program passed verifier and it's using some maps,
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
new file mode 100644
index 000000000000..0fce771632b4
--- /dev/null
+++ b/kernel/bpf/verifier.c
@@ -0,0 +1,1520 @@
+/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/slab.h>
+#include <linux/bpf.h>
+#include <linux/filter.h>
+#include <linux/capability.h>
+
+/* bpf_check() is a static code analyzer that walks eBPF program
+ * instruction by instruction and updates register/stack state.
+ * All paths of conditional branches are analyzed until 'bpf_exit' insn.
+ *
+ * At the first pass depth-first-search verifies that the BPF program is a DAG.
+ * It rejects the following programs:
+ * - larger than BPF_MAXINSNS insns
+ * - if loop is present (detected via back-edge)
+ * - unreachable insns exist (shouldn't be a forest. program = one function)
+ * - out of bounds or malformed jumps
+ * The second pass is all possible path descent from the 1st insn.
+ * Conditional branch target insns keep a link list of verifier states.
+ * If the state already visited, this path can be pruned.
+ * If it wasn't a DAG, such state prunning would be incorrect, since it would
+ * skip cycles. Since it's analyzing all pathes through the program,
+ * the length of the analysis is limited to 32k insn, which may be hit even
+ * if insn_cnt < 4K, but there are too many branches that change stack/regs.
+ * Number of 'branches to be analyzed' is limited to 1k
+ *
+ * On entry to each instruction, each register has a type, and the instruction
+ * changes the types of the registers depending on instruction semantics.
+ * If instruction is BPF_MOV64_REG(BPF_REG_1, BPF_REG_5), then type of R5 is
+ * copied to R1.
+ *
+ * All registers are 64-bit (even on 32-bit arch)
+ * R0 - return register
+ * R1-R5 argument passing registers
+ * R6-R9 callee saved registers
+ * R10 - frame pointer read-only
+ *
+ * At the start of BPF program the register R1 contains a pointer to bpf_context
+ * and has type PTR_TO_CTX.
+ *
+ * Most of the time the registers have UNKNOWN_VALUE type, which
+ * means the register has some value, but it's not a valid pointer.
+ * Verifier doesn't attemp to track all arithmetic operations on pointers.
+ * The only special case is the sequence:
+ *    BPF_MOV64_REG(BPF_REG_1, BPF_REG_10),
+ *    BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, -20),
+ * 1st insn copies R10 (which has FRAME_PTR) type into R1
+ * and 2nd arithmetic instruction is pattern matched to recognize
+ * that it wants to construct a pointer to some element within stack.
+ * So after 2nd insn, the register R1 has type PTR_TO_STACK
+ * (and -20 constant is saved for further stack bounds checking).
+ * Meaning that this reg is a pointer to stack plus known immediate constant.
+ *
+ * When program is doing load or store insns the type of base register can be:
+ * PTR_TO_MAP, PTR_TO_CTX, FRAME_PTR. These are three pointer types recognized
+ * by check_mem_access() function.
+ *
+ * PTR_TO_MAP means that this register is pointing to 'map element value'
+ * and the range of [ptr, ptr + map's value_size) is accessible.
+ *
+ * registers used to pass pointers to function calls are verified against
+ * function prototypes
+ *
+ * ARG_PTR_TO_MAP_KEY is a function argument constraint.
+ * It means that the register type passed to this function must be
+ * PTR_TO_STACK and it will be used inside the function as
+ * 'pointer to map element key'
+ *
+ * For example the argument constraints for bpf_map_lookup_elem():
+ *   .ret_type = RET_PTR_TO_MAP_OR_NULL,
+ *   .arg1_type = ARG_CONST_MAP_ID,
+ *   .arg2_type = ARG_PTR_TO_MAP_KEY,
+ *
+ * ret_type says that this function returns 'pointer to map elem value or null'
+ * 1st argument is a 'const immediate' value which must be one of valid map_ids.
+ * 2nd argument is a pointer to stack, which will be used inside the function as
+ * a pointer to map element key.
+ *
+ * On the kernel side the helper function looks like:
+ * u64 bpf_map_lookup_elem(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+ * {
+ *    struct bpf_map *map;
+ *    int map_id = r1;
+ *    void *key = (void *) (unsigned long) r2;
+ *    void *value;
+ *
+ *    here kernel can access 'key' pointer safely, knowing that
+ *    [key, key + map->key_size) bytes are valid and were initialized on
+ *    the stack of eBPF program.
+ * }
+ *
+ * Corresponding eBPF program looked like:
+ *    BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),  // after this insn R2 type is FRAME_PTR
+ *    BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), // after this insn R2 type is PTR_TO_STACK
+ *    BPF_MOV64_IMM(BPF_REG_1, MAP_ID),      // after this insn R1 type is CONST_ARG
+ *    BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+ * here verifier looks a prototype of map_lookup_elem and sees:
+ * .arg1_type == ARG_CONST_MAP_ID and R1->type == CONST_ARG, which is ok so far,
+ * then it goes and finds a map with map_id equal to R1->imm value.
+ * Now verifier knows that this map has key of key_size bytes
+ *
+ * Then .arg2_type == ARG_PTR_TO_MAP_KEY and R2->type == PTR_TO_STACK, ok so far,
+ * Now verifier checks that [R2, R2 + map's key_size) are within stack limits
+ * and were initialized prior to this call.
+ * If it's ok, then verifier allows this BPF_CALL insn and looks at
+ * .ret_type which is RET_PTR_TO_MAP_OR_NULL, so it sets
+ * R0->type = PTR_TO_MAP_OR_NULL which means bpf_map_lookup_elem() function
+ * returns ether pointer to map value or NULL.
+ *
+ * When type PTR_TO_MAP_OR_NULL passes through 'if (reg != 0) goto +off' insn,
+ * the register holding that pointer in the true branch changes state to
+ * PTR_TO_MAP and the same register changes state to CONST_IMM in the false
+ * branch. See check_cond_jmp_op().
+ *
+ * After the call R0 is set to return type of the function and registers R1-R5
+ * are set to NOT_INIT to indicate that they are no longer readable.
+ *
+ * load/store alignment is checked:
+ *    BPF_STX_MEM(BPF_DW, dest_reg, src_reg, 3)
+ * is rejected, because it's misaligned
+ *
+ * load/store to stack are bounds checked and register spill is tracked
+ *    BPF_STX_MEM(BPF_B, BPF_REG_10, src_reg, 0)
+ * is rejected, because it's out of bounds
+ *
+ * load/store to map are bounds checked:
+ *    BPF_STX_MEM(BPF_H, dest_reg, src_reg, 8)
+ * is ok, if dest_reg->type == PTR_TO_MAP and
+ * 8 + sizeof(u16) <= map_info->value_size
+ *
+ * load/store to bpf_context are checked against known fields
+ */
+
+#define _(OP) ({ int ret = OP; if (ret < 0) return ret; })
+
+/* types of values stored in eBPF registers */
+enum bpf_reg_type {
+	NOT_INIT = 0,		/* nothing was written into register */
+	UNKNOWN_VALUE,		/* reg doesn't contain a valid pointer */
+	PTR_TO_CTX,		/* reg points to bpf_context */
+	PTR_TO_MAP,		/* reg points to map element value */
+	PTR_TO_MAP_OR_NULL,	/* points to map element value or NULL */
+	FRAME_PTR,		/* reg == frame_pointer */
+	PTR_TO_STACK,		/* reg == frame_pointer + imm */
+	CONST_IMM,		/* constant integer value */
+};
+
+struct reg_state {
+	enum bpf_reg_type type;
+	int imm;
+};
+
+enum bpf_stack_slot_type {
+	STACK_INVALID,    /* nothing was stored in this stack slot */
+	STACK_SPILL,      /* 1st byte of register spilled into stack */
+	STACK_SPILL_PART, /* other 7 bytes of register spill */
+	STACK_MISC	  /* BPF program wrote some data into this slot */
+};
+
+struct bpf_stack_slot {
+	enum bpf_stack_slot_type stype;
+	enum bpf_reg_type type;
+	int imm;
+};
+
+/* state of the program:
+ * type of all registers and stack info
+ */
+struct verifier_state {
+	struct reg_state regs[MAX_BPF_REG];
+	struct bpf_stack_slot stack[MAX_BPF_STACK];
+};
+
+/* linked list of verifier states used to prune search */
+struct verifier_state_list {
+	struct verifier_state state;
+	struct verifier_state_list *next;
+};
+
+/* verifier_state + insn_idx are pushed to stack when branch is encountered */
+struct verifier_stack_elem {
+	/* verifer state is 'st'
+	 * before processing instruction 'insn_idx'
+	 * and after processing instruction 'prev_insn_idx'
+	 */
+	struct verifier_state st;
+	int insn_idx;
+	int prev_insn_idx;
+	struct verifier_stack_elem *next;
+};
+
+#define MAX_USED_MAPS 64 /* max number of maps accessed by one eBPF program */
+
+/* single container for all structs
+ * one verifier_env per bpf_check() call
+ */
+struct verifier_env {
+	struct sk_filter *prog;		/* eBPF program being verified */
+	struct verifier_stack_elem *head; /* stack of verifier states to be processed */
+	int stack_size;			/* number of states to be processed */
+	struct verifier_state cur_state; /* current verifier state */
+	struct verifier_state_list **branch_landing; /* search prunning optimization */
+	u32 used_maps[MAX_USED_MAPS];	/* array of map_id's used by eBPF program */
+	u32 used_map_cnt;		/* number of used maps */
+};
+
+/* verbose verifier prints what it's seeing
+ * bpf_check() is called under map lock, so no race to access this global var
+ */
+static bool verbose_on;
+
+/* when verifier rejects eBPF program, it does a second path with verbose on
+ * to dump the verification trace to the log, so the user can figure out what's
+ * wrong with the program
+ */
+static int verbose(const char *fmt, ...)
+{
+	va_list args;
+	int ret;
+
+	if (!verbose_on)
+		return 0;
+
+	va_start(args, fmt);
+	ret = vprintk(fmt, args);
+	va_end(args);
+	return ret;
+}
+
+/* string representation of 'enum bpf_reg_type' */
+static const char * const reg_type_str[] = {
+	[NOT_INIT] = "?",
+	[UNKNOWN_VALUE] = "inv",
+	[PTR_TO_CTX] = "ctx",
+	[PTR_TO_MAP] = "map_value",
+	[PTR_TO_MAP_OR_NULL] = "map_value_or_null",
+	[FRAME_PTR] = "fp",
+	[PTR_TO_STACK] = "fp",
+	[CONST_IMM] = "imm",
+};
+
+static void pr_cont_verifier_state(struct verifier_env *env)
+{
+	enum bpf_reg_type t;
+	int i;
+
+	for (i = 0; i < MAX_BPF_REG; i++) {
+		t = env->cur_state.regs[i].type;
+		if (t == NOT_INIT)
+			continue;
+		pr_cont(" R%d=%s", i, reg_type_str[t]);
+		if (t == CONST_IMM ||
+		    t == PTR_TO_STACK ||
+		    t == PTR_TO_MAP_OR_NULL ||
+		    t == PTR_TO_MAP)
+			pr_cont("%d", env->cur_state.regs[i].imm);
+	}
+	for (i = 0; i < MAX_BPF_STACK; i++) {
+		if (env->cur_state.stack[i].stype == STACK_SPILL)
+			pr_cont(" fp%d=%s", -MAX_BPF_STACK + i,
+				reg_type_str[env->cur_state.stack[i].type]);
+	}
+	pr_cont("\n");
+}
+
+static const char *const bpf_class_string[] = {
+	"ld", "ldx", "st", "stx", "alu", "jmp", "BUG", "alu64"
+};
+
+static const char *const bpf_alu_string[] = {
+	"+=", "-=", "*=", "/=", "|=", "&=", "<<=", ">>=", "neg",
+	"%=", "^=", "=", "s>>=", "endian", "BUG", "BUG"
+};
+
+static const char *const bpf_ldst_string[] = {
+	"u32", "u16", "u8", "u64"
+};
+
+static const char *const bpf_jmp_string[] = {
+	"jmp", "==", ">", ">=", "&", "!=", "s>", "s>=", "call", "exit"
+};
+
+static void pr_cont_bpf_insn(struct bpf_insn *insn)
+{
+	u8 class = BPF_CLASS(insn->code);
+
+	if (class == BPF_ALU || class == BPF_ALU64) {
+		if (BPF_SRC(insn->code) == BPF_X)
+			pr_cont("(%02x) %sr%d %s %sr%d\n",
+				insn->code, class == BPF_ALU ? "(u32) " : "",
+				insn->dst_reg,
+				bpf_alu_string[BPF_OP(insn->code) >> 4],
+				class == BPF_ALU ? "(u32) " : "",
+				insn->src_reg);
+		else
+			pr_cont("(%02x) %sr%d %s %s%d\n",
+				insn->code, class == BPF_ALU ? "(u32) " : "",
+				insn->dst_reg,
+				bpf_alu_string[BPF_OP(insn->code) >> 4],
+				class == BPF_ALU ? "(u32) " : "",
+				insn->imm);
+	} else if (class == BPF_STX) {
+		if (BPF_MODE(insn->code) == BPF_MEM)
+			pr_cont("(%02x) *(%s *)(r%d %+d) = r%d\n",
+				insn->code,
+				bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
+				insn->dst_reg,
+				insn->off, insn->src_reg);
+		else if (BPF_MODE(insn->code) == BPF_XADD)
+			pr_cont("(%02x) lock *(%s *)(r%d %+d) += r%d\n",
+				insn->code,
+				bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
+				insn->dst_reg, insn->off,
+				insn->src_reg);
+		else
+			pr_cont("BUG_%02x\n", insn->code);
+	} else if (class == BPF_ST) {
+		if (BPF_MODE(insn->code) != BPF_MEM) {
+			pr_cont("BUG_st_%02x\n", insn->code);
+			return;
+		}
+		pr_cont("(%02x) *(%s *)(r%d %+d) = %d\n",
+			insn->code,
+			bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
+			insn->dst_reg,
+			insn->off, insn->imm);
+	} else if (class == BPF_LDX) {
+		if (BPF_MODE(insn->code) != BPF_MEM) {
+			pr_cont("BUG_ldx_%02x\n", insn->code);
+			return;
+		}
+		pr_cont("(%02x) r%d = *(%s *)(r%d %+d)\n",
+			insn->code, insn->dst_reg,
+			bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
+			insn->src_reg, insn->off);
+	} else if (class == BPF_LD) {
+		if (BPF_MODE(insn->code) == BPF_ABS) {
+			pr_cont("(%02x) r0 = *(%s *)skb[%d]\n",
+				insn->code,
+				bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
+				insn->imm);
+		} else if (BPF_MODE(insn->code) == BPF_IND) {
+			pr_cont("(%02x) r0 = *(%s *)skb[r%d + %d]\n",
+				insn->code,
+				bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
+				insn->src_reg, insn->imm);
+		} else {
+			pr_cont("BUG_ld_%02x\n", insn->code);
+			return;
+		}
+	} else if (class == BPF_JMP) {
+		u8 opcode = BPF_OP(insn->code);
+
+		if (opcode == BPF_CALL) {
+			pr_cont("(%02x) call %d\n", insn->code, insn->imm);
+		} else if (insn->code == (BPF_JMP | BPF_JA)) {
+			pr_cont("(%02x) goto pc%+d\n",
+				insn->code, insn->off);
+		} else if (insn->code == (BPF_JMP | BPF_EXIT)) {
+			pr_cont("(%02x) exit\n", insn->code);
+		} else if (BPF_SRC(insn->code) == BPF_X) {
+			pr_cont("(%02x) if r%d %s r%d goto pc%+d\n",
+				insn->code, insn->dst_reg,
+				bpf_jmp_string[BPF_OP(insn->code) >> 4],
+				insn->src_reg, insn->off);
+		} else {
+			pr_cont("(%02x) if r%d %s 0x%x goto pc%+d\n",
+				insn->code, insn->dst_reg,
+				bpf_jmp_string[BPF_OP(insn->code) >> 4],
+				insn->imm, insn->off);
+		}
+	} else {
+		pr_cont("(%02x) %s\n", insn->code, bpf_class_string[class]);
+	}
+}
+
+static int pop_stack(struct verifier_env *env, int *prev_insn_idx)
+{
+	struct verifier_stack_elem *elem;
+	int insn_idx;
+
+	if (env->head == NULL)
+		return -1;
+
+	memcpy(&env->cur_state, &env->head->st, sizeof(env->cur_state));
+	insn_idx = env->head->insn_idx;
+	if (prev_insn_idx)
+		*prev_insn_idx = env->head->prev_insn_idx;
+	elem = env->head->next;
+	kfree(env->head);
+	env->head = elem;
+	env->stack_size--;
+	return insn_idx;
+}
+
+static struct verifier_state *push_stack(struct verifier_env *env, int insn_idx,
+					 int prev_insn_idx)
+{
+	struct verifier_stack_elem *elem;
+
+	elem = kmalloc(sizeof(struct verifier_stack_elem), GFP_KERNEL);
+	if (!elem)
+		goto err;
+
+	memcpy(&elem->st, &env->cur_state, sizeof(env->cur_state));
+	elem->insn_idx = insn_idx;
+	elem->prev_insn_idx = prev_insn_idx;
+	elem->next = env->head;
+	env->head = elem;
+	env->stack_size++;
+	if (env->stack_size > 1024) {
+		verbose("BPF program is too complex\n");
+		goto err;
+	}
+	return &elem->st;
+err:
+	/* pop all elements and return */
+	while (pop_stack(env, NULL) >= 0);
+	return NULL;
+}
+
+#define CALLER_SAVED_REGS 6
+static const int caller_saved[CALLER_SAVED_REGS] = {
+	BPF_REG_0, BPF_REG_1, BPF_REG_2, BPF_REG_3, BPF_REG_4, BPF_REG_5
+};
+
+static void init_reg_state(struct reg_state *regs)
+{
+	int i;
+
+	for (i = 0; i < MAX_BPF_REG; i++) {
+		regs[i].type = NOT_INIT;
+		regs[i].imm = 0;
+	}
+
+	/* frame pointer */
+	regs[BPF_REG_FP].type = FRAME_PTR;
+
+	/* 1st arg to a function */
+	regs[BPF_REG_1].type = PTR_TO_CTX;
+}
+
+static void mark_reg_unknown_value(struct reg_state *regs, int regno)
+{
+	regs[regno].type = UNKNOWN_VALUE;
+	regs[regno].imm = 0;
+}
+
+static int check_reg_arg(struct reg_state *regs, int regno, bool is_src)
+{
+	if (is_src) {
+		if (regs[regno].type == NOT_INIT) {
+			verbose("R%d !read_ok\n", regno);
+			return -EACCES;
+		}
+	} else {
+		if (regno == BPF_REG_FP)
+			/* frame pointer is read only */
+			return -EACCES;
+		mark_reg_unknown_value(regs, regno);
+	}
+	return 0;
+}
+
+static int bpf_size_to_bytes(int bpf_size)
+{
+	if (bpf_size == BPF_W)
+		return 4;
+	else if (bpf_size == BPF_H)
+		return 2;
+	else if (bpf_size == BPF_B)
+		return 1;
+	else if (bpf_size == BPF_DW)
+		return 8;
+	else
+		return -EACCES;
+}
+
+static int check_stack_write(struct verifier_state *state, int off, int size,
+			     int value_regno)
+{
+	struct bpf_stack_slot *slot;
+	int i;
+
+	if (value_regno >= 0 &&
+	    (state->regs[value_regno].type == PTR_TO_MAP ||
+	     state->regs[value_regno].type == PTR_TO_STACK ||
+	     state->regs[value_regno].type == PTR_TO_CTX)) {
+
+		/* register containing pointer is being spilled into stack */
+		if (size != 8) {
+			verbose("invalid size of register spill\n");
+			return -EACCES;
+		}
+
+		slot = &state->stack[MAX_BPF_STACK + off];
+		slot->stype = STACK_SPILL;
+		/* save register state */
+		slot->type = state->regs[value_regno].type;
+		slot->imm = state->regs[value_regno].imm;
+		for (i = 1; i < 8; i++) {
+			slot = &state->stack[MAX_BPF_STACK + off + i];
+			slot->stype = STACK_SPILL_PART;
+			slot->type = UNKNOWN_VALUE;
+			slot->imm = 0;
+		}
+	} else {
+
+		/* regular write of data into stack */
+		for (i = 0; i < size; i++) {
+			slot = &state->stack[MAX_BPF_STACK + off + i];
+			slot->stype = STACK_MISC;
+			slot->type = UNKNOWN_VALUE;
+			slot->imm = 0;
+		}
+	}
+	return 0;
+}
+
+static int check_stack_read(struct verifier_state *state, int off, int size,
+			    int value_regno)
+{
+	int i;
+	struct bpf_stack_slot *slot;
+
+	slot = &state->stack[MAX_BPF_STACK + off];
+
+	if (slot->stype == STACK_SPILL) {
+		if (size != 8) {
+			verbose("invalid size of register spill\n");
+			return -EACCES;
+		}
+		for (i = 1; i < 8; i++) {
+			if (state->stack[MAX_BPF_STACK + off + i].stype !=
+			    STACK_SPILL_PART) {
+				verbose("corrupted spill memory\n");
+				return -EACCES;
+			}
+		}
+
+		/* restore register state from stack */
+		state->regs[value_regno].type = slot->type;
+		state->regs[value_regno].imm = slot->imm;
+		return 0;
+	} else {
+		for (i = 0; i < size; i++) {
+			if (state->stack[MAX_BPF_STACK + off + i].stype !=
+			    STACK_MISC) {
+				verbose("invalid read from stack off %d+%d size %d\n",
+					off, i, size);
+				return -EACCES;
+			}
+		}
+		/* have read misc data from the stack */
+		mark_reg_unknown_value(state->regs, value_regno);
+		return 0;
+	}
+}
+
+static int remember_map_id(struct verifier_env *env, u32 map_id)
+{
+	int i;
+
+	/* check whether we recorded this map_id already */
+	for (i = 0; i < env->used_map_cnt; i++)
+		if (env->used_maps[i] == map_id)
+			return 0;
+
+	if (env->used_map_cnt >= MAX_USED_MAPS)
+		return -E2BIG;
+
+	/* remember this map_id */
+	env->used_maps[env->used_map_cnt++] = map_id;
+	return 0;
+}
+
+static int get_map_info(struct verifier_env *env, u32 map_id,
+			struct bpf_map **map)
+{
+	/* if BPF program contains bpf_map_lookup_elem(map_id, key)
+	 * the incorrect map_id will be caught here
+	 */
+	*map = bpf_map_get(map_id);
+	if (!*map) {
+		verbose("invalid access to map_id=%d\n", map_id);
+		return -EACCES;
+	}
+
+	_(remember_map_id(env, map_id));
+
+	return 0;
+}
+
+/* check read/write into map element returned by bpf_map_lookup_elem() */
+static int check_map_access(struct verifier_env *env, int regno, int off,
+			    int size)
+{
+	struct bpf_map *map;
+	int map_id = env->cur_state.regs[regno].imm;
+
+	_(get_map_info(env, map_id, &map));
+
+	if (off < 0 || off + size > map->value_size) {
+		verbose("invalid access to map_id=%d leaf_size=%d off=%d size=%d\n",
+			map_id, map->value_size, off, size);
+		return -EACCES;
+	}
+	return 0;
+}
+
+/* check access to 'struct bpf_context' fields */
+static int check_ctx_access(struct verifier_env *env, int off, int size,
+			    enum bpf_access_type t)
+{
+	if (env->prog->info->ops->is_valid_access &&
+	    env->prog->info->ops->is_valid_access(off, size, t))
+		return 0;
+
+	verbose("invalid bpf_context access off=%d size=%d\n", off, size);
+	return -EACCES;
+}
+
+static int check_mem_access(struct verifier_env *env, int regno, int off,
+			    int bpf_size, enum bpf_access_type t,
+			    int value_regno)
+{
+	struct verifier_state *state = &env->cur_state;
+	int size;
+
+	_(size = bpf_size_to_bytes(bpf_size));
+
+	if (off % size != 0) {
+		verbose("misaligned access off %d size %d\n", off, size);
+		return -EACCES;
+	}
+
+	if (state->regs[regno].type == PTR_TO_MAP) {
+		_(check_map_access(env, regno, off, size));
+		if (t == BPF_READ)
+			mark_reg_unknown_value(state->regs, value_regno);
+	} else if (state->regs[regno].type == PTR_TO_CTX) {
+		_(check_ctx_access(env, off, size, t));
+		if (t == BPF_READ)
+			mark_reg_unknown_value(state->regs, value_regno);
+	} else if (state->regs[regno].type == FRAME_PTR) {
+		if (off >= 0 || off < -MAX_BPF_STACK) {
+			verbose("invalid stack off=%d size=%d\n", off, size);
+			return -EACCES;
+		}
+		if (t == BPF_WRITE)
+			_(check_stack_write(state, off, size, value_regno));
+		else
+			_(check_stack_read(state, off, size, value_regno));
+	} else {
+		verbose("R%d invalid mem access '%s'\n",
+			regno, reg_type_str[state->regs[regno].type]);
+		return -EACCES;
+	}
+	return 0;
+}
+
+/* when register 'regno' is passed into function that will read 'access_size'
+ * bytes from that pointer, make sure that it's within stack boundary
+ * and all elements of stack are initialized
+ */
+static int check_stack_boundary(struct verifier_env *env,
+				int regno, int access_size)
+{
+	struct verifier_state *state = &env->cur_state;
+	struct reg_state *regs = state->regs;
+	int off, i;
+
+	if (regs[regno].type != PTR_TO_STACK)
+		return -EACCES;
+
+	off = regs[regno].imm;
+	if (off >= 0 || off < -MAX_BPF_STACK || off + access_size > 0 ||
+	    access_size <= 0) {
+		verbose("invalid stack type R%d off=%d access_size=%d\n",
+			regno, off, access_size);
+		return -EACCES;
+	}
+
+	for (i = 0; i < access_size; i++) {
+		if (state->stack[MAX_BPF_STACK + off + i].stype != STACK_MISC) {
+			verbose("invalid indirect read from stack off %d+%d size %d\n",
+				off, i, access_size);
+			return -EACCES;
+		}
+	}
+	return 0;
+}
+
+static int check_func_arg(struct verifier_env *env, int regno,
+			  enum bpf_arg_type arg_type, int *map_id,
+			  struct bpf_map **mapp)
+{
+	struct reg_state *reg = env->cur_state.regs + regno;
+	enum bpf_reg_type expected_type;
+
+	if (arg_type == ARG_ANYTHING)
+		return 0;
+
+	if (reg->type == NOT_INIT) {
+		verbose("R%d !read_ok\n", regno);
+		return -EACCES;
+	}
+
+	if (arg_type == ARG_PTR_TO_MAP_KEY || arg_type == ARG_PTR_TO_MAP_VALUE) {
+		expected_type = PTR_TO_STACK;
+	} else if (arg_type == ARG_CONST_MAP_ID || arg_type == ARG_CONST_STACK_SIZE) {
+		expected_type = CONST_IMM;
+	} else {
+		verbose("unsupported arg_type %d\n", arg_type);
+		return -EFAULT;
+	}
+
+	if (reg->type != expected_type) {
+		verbose("R%d type=%s expected=%s\n", regno,
+			reg_type_str[reg->type], reg_type_str[expected_type]);
+		return -EACCES;
+	}
+
+	if (arg_type == ARG_CONST_MAP_ID) {
+		/* bpf_map_xxx(map_id) call: check that map_id is valid */
+		*map_id = reg->imm;
+		_(get_map_info(env, reg->imm, mapp));
+	} else if (arg_type == ARG_PTR_TO_MAP_KEY) {
+		/*
+		 * bpf_map_xxx(..., map_id, ..., key) call:
+		 * check that [key, key + map->key_size) are within
+		 * stack limits and initialized
+		 */
+		if (!*mapp) {
+			/*
+			 * in function declaration map_id must come before
+			 * map_key or map_elem, so that it's verified
+			 * and known before we have to check map_key here
+			 */
+			verbose("invalid map_id to access map->key\n");
+			return -EACCES;
+		}
+		_(check_stack_boundary(env, regno, (*mapp)->key_size));
+	} else if (arg_type == ARG_PTR_TO_MAP_VALUE) {
+		/*
+		 * bpf_map_xxx(..., map_id, ..., value) call:
+		 * check [value, value + map->value_size) validity
+		 */
+		if (!*mapp) {
+			verbose("invalid map_id to access map->elem\n");
+			return -EACCES;
+		}
+		_(check_stack_boundary(env, regno, (*mapp)->value_size));
+	} else if (arg_type == ARG_CONST_STACK_SIZE) {
+		/*
+		 * bpf_xxx(..., buf, len) call will access 'len' bytes
+		 * from stack pointer 'buf'. Check it
+		 * note: regno == len, regno - 1 == buf
+		 */
+		_(check_stack_boundary(env, regno - 1, reg->imm));
+	}
+
+	return 0;
+}
+
+static int check_call(struct verifier_env *env, int func_id)
+{
+	struct verifier_state *state = &env->cur_state;
+	const struct bpf_func_proto *fn = NULL;
+	struct reg_state *regs = state->regs;
+	struct bpf_map *map = NULL;
+	struct reg_state *reg;
+	int map_id = -1;
+	int i;
+
+	/* find function prototype */
+	if (func_id <= 0 || func_id >= __BPF_FUNC_MAX_ID) {
+		verbose("invalid func %d\n", func_id);
+		return -EINVAL;
+	}
+
+	if (env->prog->info->ops->get_func_proto)
+		fn = env->prog->info->ops->get_func_proto(func_id);
+
+	if (!fn) {
+		verbose("unknown func %d\n", func_id);
+		return -EINVAL;
+	}
+
+	/* eBPF programs must be GPL compatible to use GPL-ed functions */
+	if (!env->prog->info->is_gpl_compatible && fn->gpl_only) {
+		verbose("cannot call GPL only function from proprietary program\n");
+		return -EINVAL;
+	}
+
+	/* check args */
+	_(check_func_arg(env, BPF_REG_1, fn->arg1_type, &map_id, &map));
+	_(check_func_arg(env, BPF_REG_2, fn->arg2_type, &map_id, &map));
+	_(check_func_arg(env, BPF_REG_3, fn->arg3_type, &map_id, &map));
+	_(check_func_arg(env, BPF_REG_4, fn->arg4_type, &map_id, &map));
+	_(check_func_arg(env, BPF_REG_5, fn->arg5_type, &map_id, &map));
+
+	/* reset caller saved regs */
+	for (i = 0; i < CALLER_SAVED_REGS; i++) {
+		reg = regs + caller_saved[i];
+		reg->type = NOT_INIT;
+		reg->imm = 0;
+	}
+
+	/* update return register */
+	if (fn->ret_type == RET_INTEGER) {
+		regs[BPF_REG_0].type = UNKNOWN_VALUE;
+	} else if (fn->ret_type == RET_VOID) {
+		regs[BPF_REG_0].type = NOT_INIT;
+	} else if (fn->ret_type == RET_PTR_TO_MAP_OR_NULL) {
+		regs[BPF_REG_0].type = PTR_TO_MAP_OR_NULL;
+		/*
+		 * remember map_id, so that check_map_access()
+		 * can check 'value_size' boundary of memory access
+		 * to map element returned from bpf_map_lookup_elem()
+		 */
+		regs[BPF_REG_0].imm = map_id;
+	} else {
+		verbose("unknown return type %d of func %d\n",
+			fn->ret_type, func_id);
+		return -EINVAL;
+	}
+	return 0;
+}
+
+/* check validity of 32-bit and 64-bit arithmetic operations */
+static int check_alu_op(struct reg_state *regs, struct bpf_insn *insn)
+{
+	u8 opcode = BPF_OP(insn->code);
+
+	if (opcode == BPF_END || opcode == BPF_NEG) {
+		if (BPF_SRC(insn->code) != BPF_X)
+			return -EINVAL;
+		/* check src operand */
+		_(check_reg_arg(regs, insn->dst_reg, 1));
+
+		/* check dest operand */
+		_(check_reg_arg(regs, insn->dst_reg, 0));
+
+	} else if (opcode == BPF_MOV) {
+
+		if (BPF_SRC(insn->code) == BPF_X)
+			/* check src operand */
+			_(check_reg_arg(regs, insn->src_reg, 1));
+
+		/* check dest operand */
+		_(check_reg_arg(regs, insn->dst_reg, 0));
+
+		if (BPF_SRC(insn->code) == BPF_X) {
+			if (BPF_CLASS(insn->code) == BPF_ALU64) {
+				/* case: R1 = R2
+				 * copy register state to dest reg
+				 */
+				regs[insn->dst_reg].type = regs[insn->src_reg].type;
+				regs[insn->dst_reg].imm = regs[insn->src_reg].imm;
+			} else {
+				regs[insn->dst_reg].type = UNKNOWN_VALUE;
+				regs[insn->dst_reg].imm = 0;
+			}
+		} else {
+			/* case: R = imm
+			 * remember the value we stored into this reg
+			 */
+			regs[insn->dst_reg].type = CONST_IMM;
+			regs[insn->dst_reg].imm = insn->imm;
+		}
+
+	} else {	/* all other ALU ops: and, sub, xor, add, ... */
+
+		int stack_relative = 0;
+
+		if (BPF_SRC(insn->code) == BPF_X)
+			/* check src1 operand */
+			_(check_reg_arg(regs, insn->src_reg, 1));
+
+		/* check src2 operand */
+		_(check_reg_arg(regs, insn->dst_reg, 1));
+
+		if ((opcode == BPF_MOD || opcode == BPF_DIV) &&
+		    BPF_SRC(insn->code) == BPF_K && insn->imm == 0) {
+			verbose("div by zero\n");
+			return -EINVAL;
+		}
+
+		if (opcode == BPF_ADD && BPF_CLASS(insn->code) == BPF_ALU64 &&
+		    regs[insn->dst_reg].type == FRAME_PTR &&
+		    BPF_SRC(insn->code) == BPF_K)
+			stack_relative = 1;
+
+		/* check dest operand */
+		_(check_reg_arg(regs, insn->dst_reg, 0));
+
+		if (stack_relative) {
+			regs[insn->dst_reg].type = PTR_TO_STACK;
+			regs[insn->dst_reg].imm = insn->imm;
+		}
+	}
+
+	return 0;
+}
+
+static int check_cond_jmp_op(struct verifier_env *env,
+			     struct bpf_insn *insn, int *insn_idx)
+{
+	struct reg_state *regs = env->cur_state.regs;
+	struct verifier_state *other_branch;
+	u8 opcode = BPF_OP(insn->code);
+
+	if (BPF_SRC(insn->code) == BPF_X)
+		/* check src1 operand */
+		_(check_reg_arg(regs, insn->src_reg, 1));
+
+	/* check src2 operand */
+	_(check_reg_arg(regs, insn->dst_reg, 1));
+
+	/* detect if R == 0 where R was initialized to zero earlier */
+	if (BPF_SRC(insn->code) == BPF_K &&
+	    (opcode == BPF_JEQ || opcode == BPF_JNE) &&
+	    regs[insn->dst_reg].type == CONST_IMM &&
+	    regs[insn->dst_reg].imm == insn->imm) {
+		if (opcode == BPF_JEQ) {
+			/* if (imm == imm) goto pc+off;
+			 * only follow the goto, ignore fall-through
+			 */
+			*insn_idx += insn->off;
+			return 0;
+		} else {
+			/* if (imm != imm) goto pc+off;
+			 * only follow fall-through branch, since
+			 * that's where the program will go
+			 */
+			return 0;
+		}
+	}
+
+	other_branch = push_stack(env, *insn_idx + insn->off + 1, *insn_idx);
+	if (!other_branch)
+		return -EFAULT;
+
+	/* detect if R == 0 where R is returned value from bpf_map_lookup_elem() */
+	if (BPF_SRC(insn->code) == BPF_K &&
+	    insn->imm == 0 && (opcode == BPF_JEQ ||
+			       opcode == BPF_JNE) &&
+	    regs[insn->dst_reg].type == PTR_TO_MAP_OR_NULL) {
+		if (opcode == BPF_JEQ) {
+			/* next fallthrough insn can access memory via
+			 * this register
+			 */
+			regs[insn->dst_reg].type = PTR_TO_MAP;
+			/* branch targer cannot access it, since reg == 0 */
+			other_branch->regs[insn->dst_reg].type = CONST_IMM;
+			other_branch->regs[insn->dst_reg].imm = 0;
+		} else {
+			other_branch->regs[insn->dst_reg].type = PTR_TO_MAP;
+			regs[insn->dst_reg].type = CONST_IMM;
+			regs[insn->dst_reg].imm = 0;
+		}
+	} else if (BPF_SRC(insn->code) == BPF_K &&
+		   (opcode == BPF_JEQ || opcode == BPF_JNE)) {
+
+		if (opcode == BPF_JEQ) {
+			/* detect if (R == imm) goto
+			 * and in the target state recognize that R = imm
+			 */
+			other_branch->regs[insn->dst_reg].type = CONST_IMM;
+			other_branch->regs[insn->dst_reg].imm = insn->imm;
+		} else {
+			/* detect if (R != imm) goto
+			 * and in the fall-through state recognize that R = imm
+			 */
+			regs[insn->dst_reg].type = CONST_IMM;
+			regs[insn->dst_reg].imm = insn->imm;
+		}
+	}
+	if (verbose_on)
+		pr_cont_verifier_state(env);
+	return 0;
+}
+
+/* verify safety of LD_ABS|LD_IND instructions:
+ * - they can only appear in the programs where ctx == skb
+ * - since they are wrappers of function calls, they scratch R1-R5 registers,
+ *   preserve R6-R9, and store return value into R0
+ *
+ * Implicit input:
+ *   ctx == skb == R6 == CTX
+ *
+ * Explicit input:
+ *   SRC == any register
+ *   IMM == 32-bit immediate
+ *
+ * Output:
+ *   R0 - 8/16/32-bit skb data converted to cpu endianness
+ */
+
+static int check_ld_abs(struct verifier_env *env, struct bpf_insn *insn)
+{
+	struct reg_state *regs = env->cur_state.regs;
+	u8 mode = BPF_MODE(insn->code);
+	struct reg_state *reg;
+	int i;
+
+	if (mode != BPF_ABS && mode != BPF_IND)
+		return -EINVAL;
+
+	if (env->prog->info->prog_type != BPF_PROG_TYPE_SOCKET_FILTER) {
+		verbose("BPF_LD_ABS|IND instructions are only allowed in socket filters\n");
+		return -EINVAL;
+	}
+
+	/* check whether implicit source operand (register R6) is readable */
+	_(check_reg_arg(regs, BPF_REG_6, 1));
+
+	if (regs[BPF_REG_6].type != PTR_TO_CTX) {
+		verbose("at the time of BPF_LD_ABS|IND R6 != pointer to skb\n");
+		return -EINVAL;
+	}
+
+	if (mode == BPF_IND)
+		/* check explicit source operand */
+		_(check_reg_arg(regs, insn->src_reg, 1));
+
+	/* reset caller saved regs to unreadable */
+	for (i = 0; i < CALLER_SAVED_REGS; i++) {
+		reg = regs + caller_saved[i];
+		reg->type = NOT_INIT;
+		reg->imm = 0;
+	}
+
+	/* mark destination R0 register as readable, since it contains
+	 * the value fetched from the packet
+	 */
+	regs[BPF_REG_0].type = UNKNOWN_VALUE;
+	return 0;
+}
+
+/* non-recursive DFS pseudo code
+ * 1  procedure DFS-iterative(G,v):
+ * 2      label v as discovered
+ * 3      let S be a stack
+ * 4      S.push(v)
+ * 5      while S is not empty
+ * 6            t <- S.pop()
+ * 7            if t is what we're looking for:
+ * 8                return t
+ * 9            for all edges e in G.adjacentEdges(t) do
+ * 10               if edge e is already labelled
+ * 11                   continue with the next edge
+ * 12               w <- G.adjacentVertex(t,e)
+ * 13               if vertex w is not discovered and not explored
+ * 14                   label e as tree-edge
+ * 15                   label w as discovered
+ * 16                   S.push(w)
+ * 17                   continue at 5
+ * 18               else if vertex w is discovered
+ * 19                   label e as back-edge
+ * 20               else
+ * 21                   // vertex w is explored
+ * 22                   label e as forward- or cross-edge
+ * 23           label t as explored
+ * 24           S.pop()
+ *
+ * convention:
+ * 1 - discovered
+ * 2 - discovered and 1st branch labelled
+ * 3 - discovered and 1st and 2nd branch labelled
+ * 4 - explored
+ */
+
+#define STATE_END ((struct verifier_state_list *)-1)
+
+#define PUSH_INT(I) \
+	do { \
+		if (cur_stack >= insn_cnt) { \
+			ret = -E2BIG; \
+			goto free_st; \
+		} \
+		stack[cur_stack++] = I; \
+	} while (0)
+
+#define PEEK_INT() \
+	({ \
+		int _ret; \
+		if (cur_stack == 0) \
+			_ret = -1; \
+		else \
+			_ret = stack[cur_stack - 1]; \
+		_ret; \
+	 })
+
+#define POP_INT() \
+	({ \
+		int _ret; \
+		if (cur_stack == 0) \
+			_ret = -1; \
+		else \
+			_ret = stack[--cur_stack]; \
+		_ret; \
+	 })
+
+#define PUSH_INSN(T, W, E) \
+	do { \
+		int w = W; \
+		if (E == 1 && st[T] >= 2) \
+			break; \
+		if (E == 2 && st[T] >= 3) \
+			break; \
+		if (w >= insn_cnt) { \
+			ret = -EACCES; \
+			goto free_st; \
+		} \
+		if (E == 2) \
+			/* mark branch target for state pruning */ \
+			env->branch_landing[w] = STATE_END; \
+		if (st[w] == 0) { \
+			/* tree-edge */ \
+			st[T] = 1 + E; \
+			st[w] = 1; /* discovered */ \
+			PUSH_INT(w); \
+			goto peak_stack; \
+		} else if (st[w] == 1 || st[w] == 2 || st[w] == 3) { \
+			verbose("back-edge from insn %d to %d\n", t, w); \
+			ret = -EINVAL; \
+			goto free_st; \
+		} else if (st[w] == 4) { \
+			/* forward- or cross-edge */ \
+			st[T] = 1 + E; \
+		} else { \
+			verbose("insn state internal bug\n"); \
+			ret = -EFAULT; \
+			goto free_st; \
+		} \
+	} while (0)
+
+/* non-recursive depth-first-search to detect loops in BPF program
+ * loop == back-edge in directed graph
+ */
+static int check_cfg(struct verifier_env *env)
+{
+	struct bpf_insn *insns = env->prog->insnsi;
+	int insn_cnt = env->prog->len;
+	int cur_stack = 0;
+	int *stack;
+	int ret = 0;
+	int *st;
+	int i, t;
+
+	if (insns[insn_cnt - 1].code != (BPF_JMP | BPF_EXIT)) {
+		verbose("last insn is not a 'ret'\n");
+		return -EINVAL;
+	}
+
+	st = kzalloc(sizeof(int) * insn_cnt, GFP_KERNEL);
+	if (!st)
+		return -ENOMEM;
+
+	stack = kzalloc(sizeof(int) * insn_cnt, GFP_KERNEL);
+	if (!stack) {
+		kfree(st);
+		return -ENOMEM;
+	}
+
+	st[0] = 1; /* mark 1st insn as discovered */
+	PUSH_INT(0);
+
+peak_stack:
+	while ((t = PEEK_INT()) != -1) {
+		if (insns[t].code == (BPF_JMP | BPF_EXIT))
+			goto mark_explored;
+
+		if (BPF_CLASS(insns[t].code) == BPF_JMP) {
+			u8 opcode = BPF_OP(insns[t].code);
+
+			if (opcode == BPF_CALL) {
+				PUSH_INSN(t, t + 1, 1);
+			} else if (opcode == BPF_JA) {
+				if (BPF_SRC(insns[t].code) != BPF_X) {
+					ret = -EINVAL;
+					goto free_st;
+				}
+				PUSH_INSN(t, t + insns[t].off + 1, 1);
+			} else {
+				PUSH_INSN(t, t + 1, 1);
+				PUSH_INSN(t, t + insns[t].off + 1, 2);
+			}
+			/* tell verifier to check for equivalent verifier states
+			 * after every call and jump
+			 */
+			env->branch_landing[t + 1] = STATE_END;
+		} else {
+			PUSH_INSN(t, t + 1, 1);
+		}
+
+mark_explored:
+		st[t] = 4; /* explored */
+		if (POP_INT() == -1) {
+			verbose("pop_int internal bug\n");
+			ret = -EFAULT;
+			goto free_st;
+		}
+	}
+
+
+	for (i = 0; i < insn_cnt; i++) {
+		if (st[i] != 4) {
+			verbose("unreachable insn %d\n", i);
+			ret = -EINVAL;
+			goto free_st;
+		}
+	}
+
+free_st:
+	kfree(st);
+	kfree(stack);
+	return ret;
+}
+
+/* compare two verifier states
+ *
+ * all states stored in state_list are known to be valid, since
+ * verifier reached 'bpf_exit' instruction through them
+ *
+ * this function is called when verifier exploring different branches of
+ * execution popped from the state stack. If it sees an old state that has
+ * more strict register state and more strict stack state then this execution
+ * branch doesn't need to be explored further, since verifier already
+ * concluded that more strict state leads to valid finish.
+ *
+ * Therefore two states are equivalent if register state is more conservative
+ * and explored stack state is more conservative than the current one.
+ * Example:
+ *       explored                   current
+ * (slot1=INV slot2=MISC) == (slot1=MISC slot2=MISC)
+ * (slot1=MISC slot2=MISC) != (slot1=INV slot2=MISC)
+ *
+ * In other words if current stack state (one being explored) has more
+ * valid slots than old one that already passed validation, it means
+ * the verifier can stop exploring and conclude that current state is valid too
+ *
+ * Similarly with registers. If explored state has register type as invalid
+ * whereas register type in current state is meaningful, it means that
+ * the current state will reach 'bpf_exit' instruction safely
+ */
+static bool states_equal(struct verifier_state *old, struct verifier_state *cur)
+{
+	int i;
+
+	for (i = 0; i < MAX_BPF_REG; i++) {
+		if (memcmp(&old->regs[i], &cur->regs[i],
+			   sizeof(old->regs[0])) != 0) {
+			if (old->regs[i].type == NOT_INIT ||
+			    old->regs[i].type == UNKNOWN_VALUE)
+				continue;
+			return false;
+		}
+	}
+
+	for (i = 0; i < MAX_BPF_STACK; i++) {
+		if (memcmp(&old->stack[i], &cur->stack[i],
+			   sizeof(old->stack[0])) != 0) {
+			if (old->stack[i].stype == STACK_INVALID)
+				continue;
+			return false;
+		}
+	}
+	return true;
+}
+
+static int is_state_visited(struct verifier_env *env, int insn_idx)
+{
+	struct verifier_state_list *new_sl;
+	struct verifier_state_list *sl;
+
+	sl = env->branch_landing[insn_idx];
+	if (!sl)
+		/* no branch jump to this insn, ignore it */
+		return 0;
+
+	while (sl != STATE_END) {
+		if (states_equal(&sl->state, &env->cur_state))
+			/* reached equivalent register/stack state,
+			 * prune the search
+			 */
+			return 1;
+		sl = sl->next;
+	}
+	new_sl = kmalloc(sizeof(struct verifier_state_list), GFP_KERNEL);
+
+	if (!new_sl)
+		/* ignore ENOMEM, it doesn't affect correctness */
+		return 0;
+
+	/* add new state to the head of linked list */
+	memcpy(&new_sl->state, &env->cur_state, sizeof(env->cur_state));
+	new_sl->next = env->branch_landing[insn_idx];
+	env->branch_landing[insn_idx] = new_sl;
+	return 0;
+}
+
+static int do_check(struct verifier_env *env)
+{
+	struct verifier_state *state = &env->cur_state;
+	struct bpf_insn *insns = env->prog->insnsi;
+	struct reg_state *regs = state->regs;
+	int insn_cnt = env->prog->len;
+	int insn_idx, prev_insn_idx = 0;
+	int insn_processed = 0;
+	bool do_print_state = false;
+
+	init_reg_state(regs);
+	insn_idx = 0;
+	for (;;) {
+		struct bpf_insn *insn;
+		u8 class;
+
+		if (insn_idx >= insn_cnt) {
+			verbose("invalid insn idx %d insn_cnt %d\n",
+				insn_idx, insn_cnt);
+			return -EFAULT;
+		}
+
+		insn = &insns[insn_idx];
+		class = BPF_CLASS(insn->code);
+
+		if (++insn_processed > 32768) {
+			verbose("BPF program is too large. Proccessed %d insn\n",
+				insn_processed);
+			return -E2BIG;
+		}
+
+		if (is_state_visited(env, insn_idx)) {
+			if (verbose_on) {
+				if (do_print_state)
+					pr_cont("\nfrom %d to %d: safe\n",
+						prev_insn_idx, insn_idx);
+				else
+					pr_cont("%d: safe\n", insn_idx);
+			}
+			goto process_bpf_exit;
+		}
+
+		if (verbose_on && do_print_state) {
+			pr_cont("\nfrom %d to %d:", prev_insn_idx, insn_idx);
+			pr_cont_verifier_state(env);
+			do_print_state = false;
+		}
+
+		if (verbose_on) {
+			pr_cont("%d: ", insn_idx);
+			pr_cont_bpf_insn(insn);
+		}
+
+		if (class == BPF_ALU || class == BPF_ALU64) {
+			_(check_alu_op(regs, insn));
+
+		} else if (class == BPF_LDX) {
+			if (BPF_MODE(insn->code) != BPF_MEM)
+				return -EINVAL;
+
+			/* check src operand */
+			_(check_reg_arg(regs, insn->src_reg, 1));
+
+			_(check_mem_access(env, insn->src_reg, insn->off,
+					   BPF_SIZE(insn->code), BPF_READ,
+					   insn->dst_reg));
+
+			/* dest reg state will be updated by mem_access */
+
+		} else if (class == BPF_STX) {
+			/* check src1 operand */
+			_(check_reg_arg(regs, insn->src_reg, 1));
+			/* check src2 operand */
+			_(check_reg_arg(regs, insn->dst_reg, 1));
+			_(check_mem_access(env, insn->dst_reg, insn->off,
+					   BPF_SIZE(insn->code), BPF_WRITE,
+					   insn->src_reg));
+
+		} else if (class == BPF_ST) {
+			if (BPF_MODE(insn->code) != BPF_MEM)
+				return -EINVAL;
+			/* check src operand */
+			_(check_reg_arg(regs, insn->dst_reg, 1));
+			_(check_mem_access(env, insn->dst_reg, insn->off,
+					   BPF_SIZE(insn->code), BPF_WRITE,
+					   -1));
+
+		} else if (class == BPF_JMP) {
+			u8 opcode = BPF_OP(insn->code);
+
+			if (opcode == BPF_CALL) {
+				_(check_call(env, insn->imm));
+			} else if (opcode == BPF_JA) {
+				if (BPF_SRC(insn->code) != BPF_X)
+					return -EINVAL;
+				insn_idx += insn->off + 1;
+				continue;
+			} else if (opcode == BPF_EXIT) {
+				/* eBPF calling convetion is such that R0 is used
+				 * to return the value from eBPF program.
+				 * Make sure that it's readable at this time
+				 * of bpf_exit, which means that program wrote
+				 * something into it earlier
+				 */
+				_(check_reg_arg(regs, BPF_REG_0, 1));
+process_bpf_exit:
+				insn_idx = pop_stack(env, &prev_insn_idx);
+				if (insn_idx < 0) {
+					break;
+				} else {
+					do_print_state = true;
+					continue;
+				}
+			} else {
+				_(check_cond_jmp_op(env, insn, &insn_idx));
+			}
+		} else if (class == BPF_LD) {
+			_(check_ld_abs(env, insn));
+		} else {
+			verbose("unknown insn class %d\n", class);
+			return -EINVAL;
+		}
+
+		insn_idx++;
+	}
+
+	return 0;
+}
+
+static void free_states(struct verifier_env *env, int insn_cnt)
+{
+	struct verifier_state_list *sl, *sln;
+	int i;
+
+	for (i = 0; i < insn_cnt; i++) {
+		sl = env->branch_landing[i];
+
+		if (sl)
+			while (sl != STATE_END) {
+				sln = sl->next;
+				kfree(sl);
+				sl = sln;
+			}
+	}
+
+	kfree(env->branch_landing);
+}
+
+int bpf_check(struct sk_filter *prog)
+{
+	struct verifier_env *env;
+	int ret;
+
+	if (prog->len <= 0 || prog->len > BPF_MAXINSNS)
+		return -E2BIG;
+
+	env = kzalloc(sizeof(struct verifier_env), GFP_KERNEL);
+	if (!env)
+		return -ENOMEM;
+
+	verbose_on = false;
+retry:
+	env->prog = prog;
+	env->branch_landing = kcalloc(prog->len,
+				      sizeof(struct verifier_state_list *),
+				      GFP_KERNEL);
+
+	if (!env->branch_landing) {
+		kfree(env);
+		return -ENOMEM;
+	}
+
+	ret = check_cfg(env);
+	if (ret < 0)
+		goto free_env;
+
+	ret = do_check(env);
+
+free_env:
+	while (pop_stack(env, NULL) >= 0);
+	free_states(env, prog->len);
+
+	if (ret < 0 && !verbose_on && capable(CAP_SYS_ADMIN)) {
+		/* verification failed, redo it with verbose on */
+		memset(env, 0, sizeof(struct verifier_env));
+		verbose_on = true;
+		goto retry;
+	}
+
+	if (ret == 0 && env->used_map_cnt) {
+		/* if program passed verifier, update used_maps in bpf_prog_info */
+		prog->info->used_maps = kmalloc_array(env->used_map_cnt,
+						      sizeof(u32), GFP_KERNEL);
+		if (!prog->info->used_maps) {
+			kfree(env);
+			return -ENOMEM;
+		}
+		memcpy(prog->info->used_maps, env->used_maps,
+		       sizeof(u32) * env->used_map_cnt);
+		prog->info->used_map_cnt = env->used_map_cnt;
+	}
+
+	kfree(env);
+	return ret;
+}
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH RFC v2 net-next 11/16] bpf: allow eBPF programs to use maps
  2014-07-18  4:19 ` Alexei Starovoitov
                   ` (10 preceding siblings ...)
  (?)
@ 2014-07-18  4:20 ` Alexei Starovoitov
  -1 siblings, 0 replies; 62+ messages in thread
From: Alexei Starovoitov @ 2014-07-18  4:20 UTC (permalink / raw)
  To: David S. Miller
  Cc: Ingo Molnar, Linus Torvalds, Andy Lutomirski, Steven Rostedt,
	Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Jiri Olsa, Thomas Gleixner,
	H. Peter Anvin, Andrew Morton, Kees Cook, linux-api, netdev,
	linux-kernel

expose bpf_map_lookup_elem(), bpf_map_update_elem(), bpf_map_delete_elem()
map accessors to eBPF programs

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
 include/linux/bpf.h      |    5 +++
 include/uapi/linux/bpf.h |    3 ++
 kernel/bpf/syscall.c     |   85 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 93 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index b5e90efddfcf..a7566afbe23b 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -128,4 +128,9 @@ struct sk_filter *bpf_prog_get(u32 ufd);
 /* verify correctness of eBPF program */
 int bpf_check(struct sk_filter *fp);
 
+/* in-kernel helper functions called from eBPF programs */
+u64 bpf_map_lookup_elem(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
+u64 bpf_map_update_elem(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
+u64 bpf_map_delete_elem(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
+
 #endif /* _LINUX_BPF_H */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 3f288e1d08f1..06e0f63055fb 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -377,6 +377,9 @@ enum bpf_prog_type {
  */
 enum bpf_func_id {
 	BPF_FUNC_unspec,
+	BPF_FUNC_map_lookup_elem, /* void *map_lookup_elem(map_id, void *key) */
+	BPF_FUNC_map_update_elem, /* int map_update_elem(map_id, void *key, void *value) */
+	BPF_FUNC_map_delete_elem, /* int map_delete_elem(map_id, void *key) */
 	__BPF_FUNC_MAX_ID,
 };
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 9d441f17548e..2d6e6a171594 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -741,3 +741,88 @@ SYSCALL_DEFINE5(bpf, int, cmd, unsigned long, arg2, unsigned long, arg3,
 		return -EINVAL;
 	}
 }
+
+/* called from eBPF program under rcu lock
+ *
+ * if kernel subsystem is allowing eBPF programs to call this function,
+ * inside its own verifier_ops->get_func_proto() callback it should return
+ * (struct bpf_func_proto) {
+ *    .ret_type = PTR_TO_MAP_CONDITIONAL,
+ *    .arg1_type = CONST_ARG_MAP_ID,
+ *    .arg2_type = PTR_TO_STACK_IMM_MAP_KEY,
+ * }
+ * so that eBPF verifier properly checks the arguments
+ */
+u64 bpf_map_lookup_elem(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+	struct bpf_map *map;
+	int map_id = r1;
+	void *key = (void *) (unsigned long) r2;
+	void *value;
+
+	WARN_ON_ONCE(!rcu_read_lock_held());
+
+	map = idr_find(&bpf_map_id_idr, map_id);
+	/* eBPF verifier guarantees that map_id is valid for the life of
+	 * the program
+	 */
+	BUG_ON(!map);
+
+	value = map->ops->map_lookup_elem(map, key);
+
+	return (unsigned long) value;
+}
+
+/* called from eBPF program under rcu lock
+ *
+ * if kernel subsystem is allowing eBPF programs to call this function,
+ * inside its own verifier_ops->get_func_proto() callback it should return
+ * (struct bpf_func_proto) {
+ *    .ret_type = RET_INTEGER,
+ *    .arg1_type = CONST_ARG_MAP_ID,
+ *    .arg2_type = PTR_TO_STACK_IMM_MAP_KEY,
+ *    .arg3_type = PTR_TO_STACK_IMM_MAP_VALUE,
+ * }
+ * so that eBPF verifier properly checks the arguments
+ */
+u64 bpf_map_update_elem(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+	struct bpf_map *map;
+	int map_id = r1;
+	void *key = (void *) (unsigned long) r2;
+	void *value = (void *) (unsigned long) r3;
+
+	WARN_ON_ONCE(!rcu_read_lock_held());
+
+	map = idr_find(&bpf_map_id_idr, map_id);
+	/* eBPF verifier guarantees that map_id is valid */
+	BUG_ON(!map);
+
+	return map->ops->map_update_elem(map, key, value);
+}
+
+/* called from eBPF program under rcu lock
+ *
+ * if kernel subsystem is allowing eBPF programs to call this function,
+ * inside its own verifier_ops->get_func_proto() callback it should return
+ * (struct bpf_func_proto) {
+ *    .ret_type = RET_INTEGER,
+ *    .arg1_type = CONST_ARG_MAP_ID,
+ *    .arg2_type = PTR_TO_STACK_IMM_MAP_KEY,
+ * }
+ * so that eBPF verifier properly checks the arguments
+ */
+u64 bpf_map_delete_elem(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+	struct bpf_map *map;
+	int map_id = r1;
+	void *key = (void *) (unsigned long) r2;
+
+	WARN_ON_ONCE(!rcu_read_lock_held());
+
+	map = idr_find(&bpf_map_id_idr, map_id);
+	/* eBPF verifier guarantees that map_id is valid */
+	BUG_ON(!map);
+
+	return map->ops->map_delete_elem(map, key);
+}
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH RFC v2 net-next 12/16] net: sock: allow eBPF programs to be attached to sockets
  2014-07-18  4:19 ` Alexei Starovoitov
                   ` (11 preceding siblings ...)
  (?)
@ 2014-07-18  4:20 ` Alexei Starovoitov
  -1 siblings, 0 replies; 62+ messages in thread
From: Alexei Starovoitov @ 2014-07-18  4:20 UTC (permalink / raw)
  To: David S. Miller
  Cc: Ingo Molnar, Linus Torvalds, Andy Lutomirski, Steven Rostedt,
	Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Jiri Olsa, Thomas Gleixner,
	H. Peter Anvin, Andrew Morton, Kees Cook, linux-api, netdev,
	linux-kernel

introduce new setsockopt() command:

int fd;
setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER_EBPF, &fd, sizeof(fd))

fd is associated with eBPF program priorly loaded via:

fd = syscall(__NR_bpf, BPF_PROG_LOAD, BPF_PROG_TYPE_SOCKET_FILTER,
             &prog, sizeof(prog));

setsockopt() calls bpf_prog_get() which increment refcnt of the program,
so it doesn't get unloaded while socket is using the program.

The same eBPF program can be attached to different sockets.

Program exit automatically closes socket which calls sk_filter_uncharge()
which decrements refcnt of eBPF program

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
 arch/alpha/include/uapi/asm/socket.h   |    2 +
 arch/avr32/include/uapi/asm/socket.h   |    2 +
 arch/cris/include/uapi/asm/socket.h    |    2 +
 arch/frv/include/uapi/asm/socket.h     |    2 +
 arch/ia64/include/uapi/asm/socket.h    |    2 +
 arch/m32r/include/uapi/asm/socket.h    |    2 +
 arch/mips/include/uapi/asm/socket.h    |    2 +
 arch/mn10300/include/uapi/asm/socket.h |    2 +
 arch/parisc/include/uapi/asm/socket.h  |    2 +
 arch/powerpc/include/uapi/asm/socket.h |    2 +
 arch/s390/include/uapi/asm/socket.h    |    2 +
 arch/sparc/include/uapi/asm/socket.h   |    2 +
 arch/xtensa/include/uapi/asm/socket.h  |    2 +
 include/linux/filter.h                 |    1 +
 include/uapi/asm-generic/socket.h      |    2 +
 net/core/filter.c                      |  112 ++++++++++++++++++++++++++++++++
 net/core/sock.c                        |   13 ++++
 17 files changed, 154 insertions(+)

diff --git a/arch/alpha/include/uapi/asm/socket.h b/arch/alpha/include/uapi/asm/socket.h
index 3de1394bcab8..8c83c376b5ba 100644
--- a/arch/alpha/include/uapi/asm/socket.h
+++ b/arch/alpha/include/uapi/asm/socket.h
@@ -87,4 +87,6 @@
 
 #define SO_BPF_EXTENSIONS	48
 
+#define SO_ATTACH_FILTER_EBPF	49
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/avr32/include/uapi/asm/socket.h b/arch/avr32/include/uapi/asm/socket.h
index 6e6cd159924b..498ef7220466 100644
--- a/arch/avr32/include/uapi/asm/socket.h
+++ b/arch/avr32/include/uapi/asm/socket.h
@@ -80,4 +80,6 @@
 
 #define SO_BPF_EXTENSIONS	48
 
+#define SO_ATTACH_FILTER_EBPF	49
+
 #endif /* _UAPI__ASM_AVR32_SOCKET_H */
diff --git a/arch/cris/include/uapi/asm/socket.h b/arch/cris/include/uapi/asm/socket.h
index ed94e5ed0a23..0d5120724780 100644
--- a/arch/cris/include/uapi/asm/socket.h
+++ b/arch/cris/include/uapi/asm/socket.h
@@ -82,6 +82,8 @@
 
 #define SO_BPF_EXTENSIONS	48
 
+#define SO_ATTACH_FILTER_EBPF	49
+
 #endif /* _ASM_SOCKET_H */
 
 
diff --git a/arch/frv/include/uapi/asm/socket.h b/arch/frv/include/uapi/asm/socket.h
index ca2c6e6f31c6..81fba267c285 100644
--- a/arch/frv/include/uapi/asm/socket.h
+++ b/arch/frv/include/uapi/asm/socket.h
@@ -80,5 +80,7 @@
 
 #define SO_BPF_EXTENSIONS	48
 
+#define SO_ATTACH_FILTER_EBPF	49
+
 #endif /* _ASM_SOCKET_H */
 
diff --git a/arch/ia64/include/uapi/asm/socket.h b/arch/ia64/include/uapi/asm/socket.h
index a1b49bac7951..9cbb2e82fa7c 100644
--- a/arch/ia64/include/uapi/asm/socket.h
+++ b/arch/ia64/include/uapi/asm/socket.h
@@ -89,4 +89,6 @@
 
 #define SO_BPF_EXTENSIONS	48
 
+#define SO_ATTACH_FILTER_EBPF	49
+
 #endif /* _ASM_IA64_SOCKET_H */
diff --git a/arch/m32r/include/uapi/asm/socket.h b/arch/m32r/include/uapi/asm/socket.h
index 6c9a24b3aefa..587ac2fb4106 100644
--- a/arch/m32r/include/uapi/asm/socket.h
+++ b/arch/m32r/include/uapi/asm/socket.h
@@ -80,4 +80,6 @@
 
 #define SO_BPF_EXTENSIONS	48
 
+#define SO_ATTACH_FILTER_EBPF	49
+
 #endif /* _ASM_M32R_SOCKET_H */
diff --git a/arch/mips/include/uapi/asm/socket.h b/arch/mips/include/uapi/asm/socket.h
index a14baa218c76..ab1aed2306db 100644
--- a/arch/mips/include/uapi/asm/socket.h
+++ b/arch/mips/include/uapi/asm/socket.h
@@ -98,4 +98,6 @@
 
 #define SO_BPF_EXTENSIONS	48
 
+#define SO_ATTACH_FILTER_EBPF	49
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/mn10300/include/uapi/asm/socket.h b/arch/mn10300/include/uapi/asm/socket.h
index 6aa3ce1854aa..1c4f916d0ef1 100644
--- a/arch/mn10300/include/uapi/asm/socket.h
+++ b/arch/mn10300/include/uapi/asm/socket.h
@@ -80,4 +80,6 @@
 
 #define SO_BPF_EXTENSIONS	48
 
+#define SO_ATTACH_FILTER_EBPF	49
+
 #endif /* _ASM_SOCKET_H */
diff --git a/arch/parisc/include/uapi/asm/socket.h b/arch/parisc/include/uapi/asm/socket.h
index fe35ceacf0e7..d189bb79ca07 100644
--- a/arch/parisc/include/uapi/asm/socket.h
+++ b/arch/parisc/include/uapi/asm/socket.h
@@ -79,4 +79,6 @@
 
 #define SO_BPF_EXTENSIONS	0x4029
 
+#define SO_ATTACH_FILTER_EBPF	0x402a
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/powerpc/include/uapi/asm/socket.h b/arch/powerpc/include/uapi/asm/socket.h
index a9c3e2e18c05..88488f24ae7f 100644
--- a/arch/powerpc/include/uapi/asm/socket.h
+++ b/arch/powerpc/include/uapi/asm/socket.h
@@ -87,4 +87,6 @@
 
 #define SO_BPF_EXTENSIONS	48
 
+#define SO_ATTACH_FILTER_EBPF	49
+
 #endif	/* _ASM_POWERPC_SOCKET_H */
diff --git a/arch/s390/include/uapi/asm/socket.h b/arch/s390/include/uapi/asm/socket.h
index e031332096d7..c5f26af90366 100644
--- a/arch/s390/include/uapi/asm/socket.h
+++ b/arch/s390/include/uapi/asm/socket.h
@@ -86,4 +86,6 @@
 
 #define SO_BPF_EXTENSIONS	48
 
+#define SO_ATTACH_FILTER_EBPF	49
+
 #endif /* _ASM_SOCKET_H */
diff --git a/arch/sparc/include/uapi/asm/socket.h b/arch/sparc/include/uapi/asm/socket.h
index 54d9608681b6..667ed3fa63f2 100644
--- a/arch/sparc/include/uapi/asm/socket.h
+++ b/arch/sparc/include/uapi/asm/socket.h
@@ -76,6 +76,8 @@
 
 #define SO_BPF_EXTENSIONS	0x0032
 
+#define SO_ATTACH_FILTER_EBPF	0x0033
+
 /* Security levels - as per NRL IPv6 - don't actually do anything */
 #define SO_SECURITY_AUTHENTICATION		0x5001
 #define SO_SECURITY_ENCRYPTION_TRANSPORT	0x5002
diff --git a/arch/xtensa/include/uapi/asm/socket.h b/arch/xtensa/include/uapi/asm/socket.h
index 39acec0cf0b1..24f3e4434979 100644
--- a/arch/xtensa/include/uapi/asm/socket.h
+++ b/arch/xtensa/include/uapi/asm/socket.h
@@ -91,4 +91,6 @@
 
 #define SO_BPF_EXTENSIONS	48
 
+#define SO_ATTACH_FILTER_EBPF	49
+
 #endif	/* _XTENSA_SOCKET_H */
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 822b310e75e1..5a310ed28fbb 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -73,6 +73,7 @@ int sk_unattached_filter_create(struct sk_filter **pfp,
 void sk_unattached_filter_destroy(struct sk_filter *fp);
 
 int sk_attach_filter(struct sock_fprog *fprog, struct sock *sk);
+int sk_attach_filter_ebpf(u32 ufd, struct sock *sk);
 int sk_detach_filter(struct sock *sk);
 
 int sk_chk_filter(const struct sock_filter *filter, unsigned int flen);
diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h
index ea0796bdcf88..f41844e9ac07 100644
--- a/include/uapi/asm-generic/socket.h
+++ b/include/uapi/asm-generic/socket.h
@@ -82,4 +82,6 @@
 
 #define SO_BPF_EXTENSIONS	48
 
+#define SO_ATTACH_FILTER_EBPF	49
+
 #endif /* __ASM_GENERIC_SOCKET_H */
diff --git a/net/core/filter.c b/net/core/filter.c
index 255dba1bb678..ea929fed67b4 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -44,6 +44,7 @@
 #include <linux/ratelimit.h>
 #include <linux/seccomp.h>
 #include <linux/if_vlan.h>
+#include <linux/bpf.h>
 
 /**
  *	sk_filter - run a packet through a socket filter
@@ -1117,6 +1118,117 @@ int sk_attach_filter(struct sock_fprog *fprog, struct sock *sk)
 }
 EXPORT_SYMBOL_GPL(sk_attach_filter);
 
+int sk_attach_filter_ebpf(u32 ufd, struct sock *sk)
+{
+	struct sk_filter *fp, *old_fp;
+
+	if (sock_flag(sk, SOCK_FILTER_LOCKED))
+		return -EPERM;
+
+	fp = bpf_prog_get(ufd);
+	if (!fp)
+		return -EINVAL;
+
+	if (fp->info->prog_type != BPF_PROG_TYPE_SOCKET_FILTER) {
+		/* valid fd, but invalid program type */
+		sk_filter_release(fp);
+		return -EINVAL;
+	}
+
+	old_fp = rcu_dereference_protected(sk->sk_filter,
+					   sock_owned_by_user(sk));
+	rcu_assign_pointer(sk->sk_filter, fp);
+
+	if (old_fp)
+		sk_filter_uncharge(sk, old_fp);
+
+	return 0;
+}
+
+static struct bpf_func_proto sock_filter_funcs[] = {
+	[BPF_FUNC_map_lookup_elem] = {
+		.func = bpf_map_lookup_elem,
+		.gpl_only = false,
+		.ret_type = RET_PTR_TO_MAP_OR_NULL,
+		.arg1_type = ARG_CONST_MAP_ID,
+		.arg2_type = ARG_PTR_TO_MAP_KEY,
+	},
+	[BPF_FUNC_map_update_elem] = {
+		.func = bpf_map_update_elem,
+		.gpl_only = false,
+		.ret_type = RET_INTEGER,
+		.arg1_type = ARG_CONST_MAP_ID,
+		.arg2_type = ARG_PTR_TO_MAP_KEY,
+		.arg3_type = ARG_PTR_TO_MAP_VALUE,
+	},
+	[BPF_FUNC_map_delete_elem] = {
+		.func = bpf_map_delete_elem,
+		.gpl_only = false,
+		.ret_type = RET_INTEGER,
+		.arg1_type = ARG_CONST_MAP_ID,
+		.arg2_type = ARG_PTR_TO_MAP_KEY,
+	},
+};
+
+/* allow socket filters to call
+ * bpf_map_lookup_elem(), bpf_map_update_elem(), bpf_map_delete_elem()
+ */
+static const struct bpf_func_proto *sock_filter_func_proto(enum bpf_func_id func_id)
+{
+	if (func_id < 0 || func_id >= ARRAY_SIZE(sock_filter_funcs))
+		return NULL;
+	return &sock_filter_funcs[func_id];
+}
+
+static const struct bpf_context_access {
+	int size;
+	enum bpf_access_type type;
+} sock_filter_ctx_access[] = {
+	[offsetof(struct sk_buff, mark)] = {
+		FIELD_SIZEOF(struct sk_buff, mark), BPF_READ
+	},
+	[offsetof(struct sk_buff, protocol)] = {
+		FIELD_SIZEOF(struct sk_buff, protocol), BPF_READ
+	},
+	[offsetof(struct sk_buff, queue_mapping)] = {
+		FIELD_SIZEOF(struct sk_buff, queue_mapping), BPF_READ
+	},
+};
+
+/* allow socket filters to access to 'mark', 'protocol' and 'queue_mapping'
+ * fields of 'struct sk_buff'
+ */
+static bool sock_filter_is_valid_access(int off, int size, enum bpf_access_type type)
+{
+	const struct bpf_context_access *access;
+
+	if (off < 0 || off >= ARRAY_SIZE(sock_filter_ctx_access))
+		return false;
+
+	access = &sock_filter_ctx_access[off];
+	if (access->size == size && (access->type & type))
+		return true;
+
+	return false;
+}
+
+static struct bpf_verifier_ops sock_filter_ops = {
+	.get_func_proto = sock_filter_func_proto,
+	.is_valid_access = sock_filter_is_valid_access,
+};
+
+static struct bpf_prog_type_list tl = {
+	.ops = &sock_filter_ops,
+	.type = BPF_PROG_TYPE_SOCKET_FILTER,
+};
+
+static int __init register_sock_filter_ops(void)
+{
+	bpf_register_prog_type(&tl);
+	return 0;
+}
+late_initcall(register_sock_filter_ops);
+
 int sk_detach_filter(struct sock *sk)
 {
 	int ret = -ENOENT;
diff --git a/net/core/sock.c b/net/core/sock.c
index 026e01f70274..005d5683ef5c 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -895,6 +895,19 @@ set_rcvbuf:
 		}
 		break;
 
+	case SO_ATTACH_FILTER_EBPF:
+		ret = -EINVAL;
+		if (optlen == sizeof(u32)) {
+			u32 ufd;
+
+			ret = -EFAULT;
+			if (copy_from_user(&ufd, optval, sizeof(ufd)))
+				break;
+
+			ret = sk_attach_filter_ebpf(ufd, sk);
+		}
+		break;
+
 	case SO_DETACH_FILTER:
 		ret = sk_detach_filter(sk);
 		break;
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH RFC v2 net-next 13/16] tracing: allow eBPF programs to be attached to events
  2014-07-18  4:19 ` Alexei Starovoitov
                   ` (12 preceding siblings ...)
  (?)
@ 2014-07-18  4:20 ` Alexei Starovoitov
  2014-07-23 23:46   ` Kees Cook
  -1 siblings, 1 reply; 62+ messages in thread
From: Alexei Starovoitov @ 2014-07-18  4:20 UTC (permalink / raw)
  To: David S. Miller
  Cc: Ingo Molnar, Linus Torvalds, Andy Lutomirski, Steven Rostedt,
	Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Jiri Olsa, Thomas Gleixner,
	H. Peter Anvin, Andrew Morton, Kees Cook, linux-api, netdev,
	linux-kernel

User interface:
fd = open("/sys/kernel/debug/tracing/__event__/filter")

write(fd, "bpf_123")

where 123 is process local FD associated with eBPF program previously loaded.
__event__ is static tracepoint event.
(kprobe events will be supported in the future patches)
Once program is successfully attached to tracepoint event, the tracepoint
will be auto-enabled

close(fd)
auto-disables tracepoint event and detaches eBPF program from it

eBPF programs can call in-kernel helper functions to:
- lookup/update/delete elements in maps
- memcmp
- trace_printk
- load_pointer
- dump_stack

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
 include/linux/ftrace_event.h       |    5 +
 include/trace/bpf_trace.h          |   29 +++++
 include/trace/ftrace.h             |   10 ++
 include/uapi/linux/bpf.h           |    5 +
 kernel/trace/Kconfig               |    1 +
 kernel/trace/Makefile              |    1 +
 kernel/trace/bpf_trace.c           |  212 ++++++++++++++++++++++++++++++++++++
 kernel/trace/trace.h               |    3 +
 kernel/trace/trace_events.c        |   36 +++++-
 kernel/trace/trace_events_filter.c |   72 +++++++++++-
 10 files changed, 372 insertions(+), 2 deletions(-)
 create mode 100644 include/trace/bpf_trace.h
 create mode 100644 kernel/trace/bpf_trace.c

diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index cff3106ffe2c..de313bd9a434 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -237,6 +237,7 @@ enum {
 	TRACE_EVENT_FL_WAS_ENABLED_BIT,
 	TRACE_EVENT_FL_USE_CALL_FILTER_BIT,
 	TRACE_EVENT_FL_TRACEPOINT_BIT,
+	TRACE_EVENT_FL_BPF_BIT,
 };
 
 /*
@@ -259,6 +260,7 @@ enum {
 	TRACE_EVENT_FL_WAS_ENABLED	= (1 << TRACE_EVENT_FL_WAS_ENABLED_BIT),
 	TRACE_EVENT_FL_USE_CALL_FILTER	= (1 << TRACE_EVENT_FL_USE_CALL_FILTER_BIT),
 	TRACE_EVENT_FL_TRACEPOINT	= (1 << TRACE_EVENT_FL_TRACEPOINT_BIT),
+	TRACE_EVENT_FL_BPF		= (1 << TRACE_EVENT_FL_BPF_BIT),
 };
 
 struct ftrace_event_call {
@@ -536,6 +538,9 @@ event_trigger_unlock_commit_regs(struct ftrace_event_file *file,
 		event_triggers_post_call(file, tt);
 }
 
+struct bpf_context;
+void trace_filter_call_bpf(struct event_filter *filter, struct bpf_context *ctx);
+
 enum {
 	FILTER_OTHER = 0,
 	FILTER_STATIC_STRING,
diff --git a/include/trace/bpf_trace.h b/include/trace/bpf_trace.h
new file mode 100644
index 000000000000..2122437f1317
--- /dev/null
+++ b/include/trace/bpf_trace.h
@@ -0,0 +1,29 @@
+/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#ifndef _LINUX_KERNEL_BPF_TRACE_H
+#define _LINUX_KERNEL_BPF_TRACE_H
+
+/* For tracing filters save first six arguments of tracepoint events.
+ * On 64-bit architectures argN fields will match one to one to arguments passed
+ * to tracepoint events.
+ * On 32-bit architectures u64 arguments to events will be seen into two
+ * consecutive argN, argN+1 fields. Pointers, u32, u16, u8, bool types will
+ * match one to one
+ */
+struct bpf_context {
+	unsigned long arg1;
+	unsigned long arg2;
+	unsigned long arg3;
+	unsigned long arg4;
+	unsigned long arg5;
+	unsigned long arg6;
+};
+
+/* call from ftrace_raw_event_*() to copy tracepoint arguments into ctx */
+void populate_bpf_context(struct bpf_context *ctx, ...);
+
+#endif /* _LINUX_KERNEL_BPF_TRACE_H */
diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
index 26b4f2e13275..ad4987ac68bb 100644
--- a/include/trace/ftrace.h
+++ b/include/trace/ftrace.h
@@ -17,6 +17,7 @@
  */
 
 #include <linux/ftrace_event.h>
+#include <trace/bpf_trace.h>
 
 /*
  * DECLARE_EVENT_CLASS can be used to add a generic function
@@ -634,6 +635,15 @@ ftrace_raw_event_##call(void *__data, proto)				\
 	if (ftrace_trigger_soft_disabled(ftrace_file))			\
 		return;							\
 									\
+	if (unlikely(ftrace_file->flags & FTRACE_EVENT_FL_FILTERED) &&	\
+	    unlikely(ftrace_file->event_call->flags & TRACE_EVENT_FL_BPF)) { \
+		struct bpf_context __ctx;				\
+									\
+		populate_bpf_context(&__ctx, args, 0, 0, 0, 0, 0);	\
+		trace_filter_call_bpf(ftrace_file->filter, &__ctx);	\
+		return;							\
+	}								\
+									\
 	__data_size = ftrace_get_offsets_##call(&__data_offsets, args); \
 									\
 	entry = ftrace_event_buffer_reserve(&fbuffer, ftrace_file,	\
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 06e0f63055fb..cedcf9a0db53 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -370,6 +370,7 @@ enum bpf_prog_attributes {
 enum bpf_prog_type {
 	BPF_PROG_TYPE_UNSPEC,
 	BPF_PROG_TYPE_SOCKET_FILTER,
+	BPF_PROG_TYPE_TRACING_FILTER,
 };
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
@@ -380,6 +381,10 @@ enum bpf_func_id {
 	BPF_FUNC_map_lookup_elem, /* void *map_lookup_elem(map_id, void *key) */
 	BPF_FUNC_map_update_elem, /* int map_update_elem(map_id, void *key, void *value) */
 	BPF_FUNC_map_delete_elem, /* int map_delete_elem(map_id, void *key) */
+	BPF_FUNC_load_pointer,    /* void *bpf_load_pointer(void *unsafe_ptr) */
+	BPF_FUNC_memcmp,          /* int bpf_memcmp(void *unsafe_ptr, void *safe_ptr, int size) */
+	BPF_FUNC_dump_stack,      /* void bpf_dump_stack(void) */
+	BPF_FUNC_printk,          /* int bpf_printk(const char *fmt, int fmt_size, ...) */
 	__BPF_FUNC_MAX_ID,
 };
 
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index d4409356f40d..e36d42876634 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -80,6 +80,7 @@ config FTRACE_NMI_ENTER
 
 config EVENT_TRACING
 	select CONTEXT_SWITCH_TRACER
+	depends on NET
 	bool
 
 config CONTEXT_SWITCH_TRACER
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 2611613f14f1..a0fcfd97101d 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -52,6 +52,7 @@ obj-$(CONFIG_EVENT_TRACING) += trace_event_perf.o
 endif
 obj-$(CONFIG_EVENT_TRACING) += trace_events_filter.o
 obj-$(CONFIG_EVENT_TRACING) += trace_events_trigger.o
+obj-$(CONFIG_EVENT_TRACING) += bpf_trace.o
 obj-$(CONFIG_KPROBE_EVENT) += trace_kprobe.o
 obj-$(CONFIG_TRACEPOINTS) += power-traces.o
 ifeq ($(CONFIG_PM_RUNTIME),y)
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
new file mode 100644
index 000000000000..7263491be792
--- /dev/null
+++ b/kernel/trace/bpf_trace.c
@@ -0,0 +1,212 @@
+/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/slab.h>
+#include <linux/bpf.h>
+#include <linux/filter.h>
+#include <linux/uaccess.h>
+#include <trace/bpf_trace.h>
+#include "trace.h"
+
+/* call from ftrace_raw_event_*() to copy tracepoint arguments into ctx */
+void populate_bpf_context(struct bpf_context *ctx, ...)
+{
+	va_list args;
+
+	va_start(args, ctx);
+
+	ctx->arg1 = va_arg(args, unsigned long);
+	ctx->arg2 = va_arg(args, unsigned long);
+	ctx->arg3 = va_arg(args, unsigned long);
+	ctx->arg4 = va_arg(args, unsigned long);
+	ctx->arg5 = va_arg(args, unsigned long);
+	ctx->arg6 = va_arg(args, unsigned long);
+
+	va_end(args);
+}
+EXPORT_SYMBOL_GPL(populate_bpf_context);
+
+/* called from eBPF program with rcu lock held */
+static u64 bpf_load_ptr(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+        void *unsafe_ptr = (void *) r1;
+	void *ptr = NULL;
+
+	probe_kernel_read(&ptr, unsafe_ptr, sizeof(void *));
+	return (u64) (unsigned long) ptr;
+}
+
+static u64 bpf_memcmp(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+        void *unsafe_ptr = (void *) r1;
+	void *safe_ptr = (void *) r2;
+	u32 size = (u32) r3;
+	char buf[64];
+	int err;
+
+	if (size < 64) {
+		err = probe_kernel_read(buf, unsafe_ptr, size);
+		if (err)
+			return err;
+		return memcmp(buf, safe_ptr, size);
+	}
+	return -1;
+}
+
+static u64 bpf_dump_stack(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+	trace_dump_stack(0);
+	return 0;
+}
+
+/* limited printk()
+ * only %d %u %x conversion specifiers allowed
+ */
+static u64 bpf_printk(u64 r1, u64 fmt_size, u64 r3, u64 r4, u64 r5)
+{
+	char *fmt = (char *) r1;
+	int fmt_cnt = 0;
+	int i;
+
+	/* bpf_check() guarantees that fmt points to bpf program stack and
+	 * fmt_size bytes of it were initialized by bpf program
+	 */
+	if (fmt[fmt_size - 1] != 0)
+		return -EINVAL;
+
+	/* check format string for allowed specifiers */
+	for (i = 0; i < fmt_size; i++)
+		if (fmt[i] == '%') {
+			if (i + 1 >= fmt_size)
+				return -EINVAL;
+			if (fmt[i + 1] != 'd' && fmt[i + 1] != 'u' &&
+			    fmt[i + 1] != 'x')
+				return -EINVAL;
+			fmt_cnt++;
+		}
+
+	if (fmt_cnt > 3)
+		return -EINVAL;
+
+	return __trace_printk((unsigned long) __builtin_return_address(3), fmt,
+			      (u32) r3, (u32) r4, (u32) r5);
+}
+
+static struct bpf_func_proto tracing_filter_funcs[] = {
+	[BPF_FUNC_load_pointer] = {
+		.func = bpf_load_ptr,
+		.gpl_only = true,
+		.ret_type = RET_INTEGER,
+	},
+	[BPF_FUNC_memcmp] = {
+		.func = bpf_memcmp,
+		.gpl_only = false,
+		.ret_type = RET_INTEGER,
+		.arg1_type = ARG_ANYTHING,
+		.arg2_type = ARG_PTR_TO_STACK,
+		.arg3_type = ARG_CONST_STACK_SIZE,
+	},
+	[BPF_FUNC_dump_stack] = {
+		.func = bpf_dump_stack,
+		.gpl_only = false,
+		.ret_type = RET_VOID,
+	},
+	[BPF_FUNC_printk] = {
+		.func = bpf_printk,
+		.gpl_only = true,
+		.ret_type = RET_INTEGER,
+		.arg1_type = ARG_PTR_TO_STACK,
+		.arg2_type = ARG_CONST_STACK_SIZE,
+	},
+	[BPF_FUNC_map_lookup_elem] = {
+		.func = bpf_map_lookup_elem,
+		.gpl_only = false,
+		.ret_type = RET_PTR_TO_MAP_OR_NULL,
+		.arg1_type = ARG_CONST_MAP_ID,
+		.arg2_type = ARG_PTR_TO_MAP_KEY,
+	},
+	[BPF_FUNC_map_update_elem] = {
+		.func = bpf_map_update_elem,
+		.gpl_only = false,
+		.ret_type = RET_INTEGER,
+		.arg1_type = ARG_CONST_MAP_ID,
+		.arg2_type = ARG_PTR_TO_MAP_KEY,
+		.arg3_type = ARG_PTR_TO_MAP_VALUE,
+	},
+	[BPF_FUNC_map_delete_elem] = {
+		.func = bpf_map_delete_elem,
+		.gpl_only = false,
+		.ret_type = RET_INTEGER,
+		.arg1_type = ARG_CONST_MAP_ID,
+		.arg2_type = ARG_PTR_TO_MAP_KEY,
+	},
+};
+
+static const struct bpf_func_proto *tracing_filter_func_proto(enum bpf_func_id func_id)
+{
+	if (func_id < 0 || func_id >= ARRAY_SIZE(tracing_filter_funcs))
+		return NULL;
+	return &tracing_filter_funcs[func_id];
+}
+
+static const struct bpf_context_access {
+	int size;
+	enum bpf_access_type type;
+} tracing_filter_ctx_access[] = {
+	[offsetof(struct bpf_context, arg1)] = {
+		FIELD_SIZEOF(struct bpf_context, arg1),
+		BPF_READ
+	},
+	[offsetof(struct bpf_context, arg2)] = {
+		FIELD_SIZEOF(struct bpf_context, arg2),
+		BPF_READ
+	},
+	[offsetof(struct bpf_context, arg3)] = {
+		FIELD_SIZEOF(struct bpf_context, arg3),
+		BPF_READ
+	},
+	[offsetof(struct bpf_context, arg4)] = {
+		FIELD_SIZEOF(struct bpf_context, arg4),
+		BPF_READ
+	},
+	[offsetof(struct bpf_context, arg5)] = {
+		FIELD_SIZEOF(struct bpf_context, arg5),
+		BPF_READ
+	},
+};
+
+static bool tracing_filter_is_valid_access(int off, int size, enum bpf_access_type type)
+{
+	const struct bpf_context_access *access;
+
+	if (off < 0 || off >= ARRAY_SIZE(tracing_filter_ctx_access))
+		return false;
+
+	access = &tracing_filter_ctx_access[off];
+	if (access->size == size && (access->type & type))
+		return true;
+
+	return false;
+}
+
+static struct bpf_verifier_ops tracing_filter_ops = {
+	.get_func_proto = tracing_filter_func_proto,
+	.is_valid_access = tracing_filter_is_valid_access,
+};
+
+static struct bpf_prog_type_list tl = {
+	.ops = &tracing_filter_ops,
+	.type = BPF_PROG_TYPE_TRACING_FILTER,
+};
+
+static int __init register_tracing_filter_ops(void)
+{
+	bpf_register_prog_type(&tl);
+	return 0;
+}
+late_initcall(register_tracing_filter_ops);
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 9258f5a815db..bb7c6a19ead5 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -984,12 +984,15 @@ struct ftrace_event_field {
 	int			is_signed;
 };
 
+struct sk_filter;
+
 struct event_filter {
 	int			n_preds;	/* Number assigned */
 	int			a_preds;	/* allocated */
 	struct filter_pred	*preds;
 	struct filter_pred	*root;
 	char			*filter_string;
+	struct sk_filter	*prog;
 };
 
 struct event_subsystem {
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index f99e0b3bca8c..de79c27a0a42 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -1048,6 +1048,26 @@ event_filter_read(struct file *filp, char __user *ubuf, size_t cnt,
 	return r;
 }
 
+static int event_filter_release(struct inode *inode, struct file *filp)
+{
+	struct ftrace_event_file *file;
+	char buf[2] = "0";
+
+	mutex_lock(&event_mutex);
+	file = event_file_data(filp);
+	if (file) {
+		if (file->event_call->flags & TRACE_EVENT_FL_BPF) {
+			/* auto-disable the filter */
+			ftrace_event_enable_disable(file, 0);
+
+			/* if BPF filter was used, clear it on fd close */
+			apply_event_filter(file, buf);
+		}
+	}
+	mutex_unlock(&event_mutex);
+	return 0;
+}
+
 static ssize_t
 event_filter_write(struct file *filp, const char __user *ubuf, size_t cnt,
 		   loff_t *ppos)
@@ -1071,10 +1091,23 @@ event_filter_write(struct file *filp, const char __user *ubuf, size_t cnt,
 
 	mutex_lock(&event_mutex);
 	file = event_file_data(filp);
-	if (file)
+	if (file) {
 		err = apply_event_filter(file, buf);
+		if (!err && file->event_call->flags & TRACE_EVENT_FL_BPF)
+			/* once filter is applied, auto-enable it */
+			ftrace_event_enable_disable(file, 1);
+	}
+
 	mutex_unlock(&event_mutex);
 
+	if (file && file->event_call->flags & TRACE_EVENT_FL_BPF) {
+		/*
+		 * allocate per-cpu printk buffers, since eBPF program
+		 * might be calling bpf_trace_printk
+		 */
+		trace_printk_init_buffers();
+	}
+
 	free_page((unsigned long) buf);
 	if (err < 0)
 		return err;
@@ -1325,6 +1358,7 @@ static const struct file_operations ftrace_event_filter_fops = {
 	.open = tracing_open_generic,
 	.read = event_filter_read,
 	.write = event_filter_write,
+	.release = event_filter_release,
 	.llseek = default_llseek,
 };
 
diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index 8a8631926a07..a27526fae0fe 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -23,6 +23,9 @@
 #include <linux/mutex.h>
 #include <linux/perf_event.h>
 #include <linux/slab.h>
+#include <linux/bpf.h>
+#include <trace/bpf_trace.h>
+#include <linux/filter.h>
 
 #include "trace.h"
 #include "trace_output.h"
@@ -535,6 +538,16 @@ static int filter_match_preds_cb(enum move_type move, struct filter_pred *pred,
 	return WALK_PRED_DEFAULT;
 }
 
+void trace_filter_call_bpf(struct event_filter *filter, struct bpf_context *ctx)
+{
+	BUG_ON(!filter || !filter->prog);
+
+	rcu_read_lock();
+	SK_RUN_FILTER(filter->prog, (void *) ctx);
+	rcu_read_unlock();
+}
+EXPORT_SYMBOL_GPL(trace_filter_call_bpf);
+
 /* return 1 if event matches, 0 otherwise (discard) */
 int filter_match_preds(struct event_filter *filter, void *rec)
 {
@@ -794,6 +807,8 @@ static void __free_filter(struct event_filter *filter)
 	if (!filter)
 		return;
 
+	if (filter->prog)
+		sk_unattached_filter_destroy(filter->prog);
 	__free_preds(filter);
 	kfree(filter->filter_string);
 	kfree(filter);
@@ -1898,6 +1913,48 @@ static int create_filter_start(char *filter_str, bool set_str,
 	return err;
 }
 
+static int create_filter_bpf(char *filter_str, struct event_filter **filterp)
+{
+	struct event_filter *filter;
+	struct sk_filter *prog;
+	long ufd;
+	int err = 0;
+
+	*filterp = NULL;
+
+	filter = __alloc_filter();
+	if (!filter)
+		return -ENOMEM;
+
+	err = replace_filter_string(filter, filter_str);
+	if (err)
+		goto free_filter;
+
+	err = kstrtol(filter_str + 4, 0, &ufd);
+	if (err)
+		goto free_filter;
+
+	err = -ESRCH;
+	prog = bpf_prog_get(ufd);
+	if (!prog)
+		goto free_filter;
+
+	filter->prog = prog;
+
+	err = -EINVAL;
+	if (prog->info->prog_type != BPF_PROG_TYPE_TRACING_FILTER)
+		/* prog_id is valid, but it's not a tracing filter program */
+		goto free_filter;
+
+	*filterp = filter;
+
+	return 0;
+
+free_filter:
+	__free_filter(filter);
+	return err;
+}
+
 static void create_filter_finish(struct filter_parse_state *ps)
 {
 	if (ps) {
@@ -2007,7 +2064,20 @@ int apply_event_filter(struct ftrace_event_file *file, char *filter_string)
 		return 0;
 	}
 
-	err = create_filter(call, filter_string, true, &filter);
+	/*
+	 * 'bpf_123' string is a request to attach eBPF program with id == 123
+	 * also accept 'bpf 123', 'bpf.123', 'bpf-123' variants
+	 */
+	if (memcmp(filter_string, "bpf", 3) == 0 && filter_string[3] != 0 &&
+	    filter_string[4] != 0) {
+		err = create_filter_bpf(filter_string, &filter);
+		if (!err)
+			call->flags |= TRACE_EVENT_FL_BPF;
+	} else {
+		err = create_filter(call, filter_string, true, &filter);
+		if (!err)
+			call->flags &= ~TRACE_EVENT_FL_BPF;
+	}
 
 	/*
 	 * Always swap the call filter with the new filter
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH RFC v2 net-next 14/16] samples: bpf: add mini eBPF library to manipulate maps and programs
  2014-07-18  4:19 ` Alexei Starovoitov
                   ` (13 preceding siblings ...)
  (?)
@ 2014-07-18  4:20 ` Alexei Starovoitov
  -1 siblings, 0 replies; 62+ messages in thread
From: Alexei Starovoitov @ 2014-07-18  4:20 UTC (permalink / raw)
  To: David S. Miller
  Cc: Ingo Molnar, Linus Torvalds, Andy Lutomirski, Steven Rostedt,
	Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Jiri Olsa, Thomas Gleixner,
	H. Peter Anvin, Andrew Morton, Kees Cook, linux-api, netdev,
	linux-kernel

the library includes a trivial set of BPF syscall wrappers:

int bpf_create_map(int key_size, int value_size, int max_entries);

int bpf_update_elem(int fd, void *key, void *value);

int bpf_lookup_elem(int fd, void *key, void *value);

int bpf_delete_elem(int fd, void *key);

int bpf_get_next_key(int fd, void *key, void *next_key);

int bpf_prog_load(enum bpf_prog_type prog_type,
		  const struct sock_filter_int *insns, int insn_len,
		  const char *license,
		  const struct bpf_map_fixup *fixups, int fixup_len);

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
 samples/bpf/libbpf.c |  109 ++++++++++++++++++++++++++++++++++++++++++++++++++
 samples/bpf/libbpf.h |   22 ++++++++++
 2 files changed, 131 insertions(+)
 create mode 100644 samples/bpf/libbpf.c
 create mode 100644 samples/bpf/libbpf.h

diff --git a/samples/bpf/libbpf.c b/samples/bpf/libbpf.c
new file mode 100644
index 000000000000..5460d1f3ad43
--- /dev/null
+++ b/samples/bpf/libbpf.c
@@ -0,0 +1,109 @@
+/* eBPF mini library */
+#include <stdlib.h>
+#include <linux/unistd.h>
+#include <unistd.h>
+#include <string.h>
+#include <linux/netlink.h>
+#include <linux/bpf.h>
+#include <errno.h>
+#include "libbpf.h"
+
+struct nlattr_u32 {
+	__u16 nla_len;
+	__u16 nla_type;
+	__u32 val;
+};
+
+int bpf_create_map(int key_size, int value_size, int max_entries)
+{
+	struct nlattr_u32 attr[] = {
+		{
+			.nla_len = sizeof(struct nlattr_u32),
+			.nla_type = BPF_MAP_KEY_SIZE,
+			.val = key_size,
+		},
+		{
+			.nla_len = sizeof(struct nlattr_u32),
+			.nla_type = BPF_MAP_VALUE_SIZE,
+			.val = value_size,
+		},
+		{
+			.nla_len = sizeof(struct nlattr_u32),
+			.nla_type = BPF_MAP_MAX_ENTRIES,
+			.val = max_entries,
+		},
+	};
+
+	return syscall(__NR_bpf, BPF_MAP_CREATE, BPF_MAP_TYPE_HASH, attr, sizeof(attr));
+}
+
+
+int bpf_update_elem(int fd, void *key, void *value)
+{
+	return syscall(__NR_bpf, BPF_MAP_UPDATE_ELEM, fd, key, value);
+}
+
+int bpf_lookup_elem(int fd, void *key, void *value)
+{
+	return syscall(__NR_bpf, BPF_MAP_LOOKUP_ELEM, fd, key, value);
+}
+
+int bpf_delete_elem(int fd, void *key)
+{
+	return syscall(__NR_bpf, BPF_MAP_DELETE_ELEM, fd, key);
+}
+
+int bpf_get_next_key(int fd, void *key, void *next_key)
+{
+	return syscall(__NR_bpf, BPF_MAP_GET_NEXT_KEY, fd, key, next_key);
+}
+
+#define ROUND_UP(x, n) (((x) + (n) - 1u) & ~((n) - 1u))
+
+int bpf_prog_load(enum bpf_prog_type prog_type,
+		  const struct bpf_insn *insns, int prog_len,
+		  const char *license,
+		  const struct bpf_map_fixup *fixups, int fixup_len)
+{
+	int nlattr_size, license_len, err;
+	void *nlattr, *ptr;
+
+	license_len = strlen(license) + 1;
+	nlattr_size = sizeof(struct nlattr) + prog_len + sizeof(struct nlattr) +
+		ROUND_UP(license_len, 4) + sizeof(struct nlattr) +
+		fixup_len;
+
+	ptr = nlattr = malloc(nlattr_size);
+
+	*(struct nlattr *) ptr = (struct nlattr) {
+		.nla_len = prog_len + sizeof(struct nlattr),
+		.nla_type = BPF_PROG_TEXT,
+	};
+	ptr += sizeof(struct nlattr);
+
+	memcpy(ptr, insns, prog_len);
+	ptr += prog_len;
+
+	*(struct nlattr *) ptr = (struct nlattr) {
+		.nla_len = ROUND_UP(license_len, 4) + sizeof(struct nlattr),
+		.nla_type = BPF_PROG_LICENSE,
+	};
+	ptr += sizeof(struct nlattr);
+
+	memcpy(ptr, license, license_len);
+	ptr += ROUND_UP(license_len, 4);
+
+	*(struct nlattr *) ptr = (struct nlattr) {
+		.nla_len = fixup_len + sizeof(struct nlattr),
+		.nla_type = BPF_PROG_MAP_FIXUP,
+	};
+	ptr += sizeof(struct nlattr);
+
+	memcpy(ptr, fixups, fixup_len);
+	ptr += fixup_len;
+
+	err = syscall(__NR_bpf, BPF_PROG_LOAD, prog_type, nlattr, nlattr_size);
+
+	free(nlattr);
+	return err;
+}
diff --git a/samples/bpf/libbpf.h b/samples/bpf/libbpf.h
new file mode 100644
index 000000000000..07f668e3dac0
--- /dev/null
+++ b/samples/bpf/libbpf.h
@@ -0,0 +1,22 @@
+/* eBPF mini library */
+#ifndef __LIBBPF_H
+#define __LIBBPF_H
+
+struct bpf_insn;
+
+int bpf_create_map(int key_size, int value_size, int max_entries);
+int bpf_update_elem(int fd, void *key, void *value);
+int bpf_lookup_elem(int fd, void *key, void *value);
+int bpf_delete_elem(int fd, void *key);
+int bpf_get_next_key(int fd, void *key, void *next_key);
+
+struct bpf_map_fixup {
+	int insn_idx;
+	int fd;
+};
+int bpf_prog_load(enum bpf_prog_type prog_type,
+		  const struct bpf_insn *insns, int insn_len,
+		  const char *license,
+		  const struct bpf_map_fixup *fixups, int fixup_len);
+
+#endif
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH RFC v2 net-next 15/16] samples: bpf: example of stateful socket filtering
  2014-07-18  4:19 ` Alexei Starovoitov
                   ` (14 preceding siblings ...)
  (?)
@ 2014-07-18  4:20 ` Alexei Starovoitov
  -1 siblings, 0 replies; 62+ messages in thread
From: Alexei Starovoitov @ 2014-07-18  4:20 UTC (permalink / raw)
  To: David S. Miller
  Cc: Ingo Molnar, Linus Torvalds, Andy Lutomirski, Steven Rostedt,
	Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Jiri Olsa, Thomas Gleixner,
	H. Peter Anvin, Andrew Morton, Kees Cook, linux-api, netdev,
	linux-kernel

this socket filter example does:

- creates a hashtable in kernel with key 4 bytes and value 8 bytes

- populates map[6] = 0; map[17] = 0;  // 6 - tcp_proto, 17 - udp_proto

- loads eBPF program:
  r0 = skb[14 + 9]; // load one byte of ip->proto
  *(u32*)(fp - 4) = r0;
  value = bpf_map_lookup_elem(map_id, fp - 4);
  if (value)
       (*(u64*)value) += 1;

- attaches this program to eth0 raw socket

- every second user space reads map[6] and map[17] to see how many
  TCP and UDP packets were seen on eth0

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
 samples/bpf/.gitignore     |    1 +
 samples/bpf/Makefile       |   13 ++++
 samples/bpf/sock_example.c |  161 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 175 insertions(+)
 create mode 100644 samples/bpf/.gitignore
 create mode 100644 samples/bpf/Makefile
 create mode 100644 samples/bpf/sock_example.c

diff --git a/samples/bpf/.gitignore b/samples/bpf/.gitignore
new file mode 100644
index 000000000000..5465c6e92a00
--- /dev/null
+++ b/samples/bpf/.gitignore
@@ -0,0 +1 @@
+sock_example
diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
new file mode 100644
index 000000000000..95c990151644
--- /dev/null
+++ b/samples/bpf/Makefile
@@ -0,0 +1,13 @@
+# kbuild trick to avoid linker error. Can be omitted if a module is built.
+obj- := dummy.o
+
+# List of programs to build
+hostprogs-y := sock_example
+
+sock_example-objs := sock_example.o libbpf.o
+
+# Tell kbuild to always build the programs
+always := $(hostprogs-y)
+
+HOSTCFLAGS_libbpf.o += -I$(objtree)/usr/include
+HOSTCFLAGS_sock_example.o += -I$(objtree)/usr/include
diff --git a/samples/bpf/sock_example.c b/samples/bpf/sock_example.c
new file mode 100644
index 000000000000..9b23b2b7e9d0
--- /dev/null
+++ b/samples/bpf/sock_example.c
@@ -0,0 +1,161 @@
+/* eBPF example program:
+ * - creates a hashtable in kernel with key 4 bytes and value 8 bytes
+ *
+ * - populates map[6] = 0; map[17] = 0;  // 6 - tcp_proto, 17 - udp_proto
+ *
+ * - loads eBPF program:
+ *   r0 = skb[14 + 9]; // load one byte of ip->proto
+ *   *(u32*)(fp - 4) = r0;
+ *   value = bpf_map_lookup_elem(map_id, fp - 4);
+ *   if (value)
+ *        (*(u64*)value) += 1;
+ *
+ * - attaches this program to eth0 raw socket
+ *
+ * - every second user space reads map[6] and map[17] to see how many
+ *   TCP and UDP packets were seen on eth0
+ */
+#include <stdio.h>
+#include <unistd.h>
+#include <asm-generic/socket.h>
+#include <linux/netlink.h>
+#include <net/ethernet.h>
+#include <net/if.h>
+#include <linux/sockios.h>
+#include <linux/if_packet.h>
+#include <linux/bpf.h>
+#include <errno.h>
+#include <sys/socket.h>
+#include <sys/ioctl.h>
+#include <linux/unistd.h>
+#include <string.h>
+#include <linux/filter.h>
+#include <stdlib.h>
+#include <arpa/inet.h>
+#include "libbpf.h"
+
+static int open_raw_sock(const char *name)
+{
+	struct sockaddr_ll sll;
+	struct packet_mreq mr;
+	struct ifreq ifr;
+	int sock;
+
+	sock = socket(PF_PACKET, SOCK_RAW | SOCK_NONBLOCK | SOCK_CLOEXEC, htons(ETH_P_ALL));
+	if (sock < 0) {
+		printf("cannot open socket!\n");
+		return -1;
+	}
+
+	memset(&ifr, 0, sizeof(ifr));
+	strncpy((char *)ifr.ifr_name, name, IFNAMSIZ);
+	if (ioctl(sock, SIOCGIFINDEX, &ifr) < 0) {
+		printf("ioctl: %s\n", strerror(errno));
+		close(sock);
+		return -1;
+	}
+
+	memset(&sll, 0, sizeof(sll));
+	sll.sll_family = AF_PACKET;
+	sll.sll_ifindex = ifr.ifr_ifindex;
+	sll.sll_protocol = htons(ETH_P_ALL);
+	if (bind(sock, (struct sockaddr *)&sll, sizeof(sll)) < 0) {
+		printf("bind: %s\n", strerror(errno));
+		close(sock);
+		return -1;
+	}
+
+	memset(&mr, 0, sizeof(mr));
+	mr.mr_ifindex = ifr.ifr_ifindex;
+	mr.mr_type = PACKET_MR_PROMISC;
+	if (setsockopt(sock, SOL_PACKET, PACKET_ADD_MEMBERSHIP, &mr, sizeof(mr)) < 0) {
+		printf("set_promisc: %s\n", strerror(errno));
+		close(sock);
+		return -1;
+	}
+	return sock;
+}
+
+static int test_sock(void)
+{
+	static struct bpf_insn prog[] = {
+		BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
+		BPF_LD_ABS(BPF_B, 14 + 9 /* R0 = ip->proto */),
+		BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4), /* *(u32 *)(fp - 4) = r0 */
+		BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+		BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), /* r2 = fp - 4 */
+		BPF_MOV64_IMM(BPF_REG_1, 0), /* r1 = MAP_ID */
+		BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+		BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
+		BPF_MOV64_IMM(BPF_REG_1, 1), /* r1 = 1 */
+		BPF_RAW_INSN(BPF_STX | BPF_XADD | BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0), /* xadd r0 += r1 */
+		BPF_MOV64_IMM(BPF_REG_0, 0), /* r0 = 0 */
+		BPF_EXIT_INSN(),
+	};
+	static struct bpf_map_fixup fixup[1];
+
+	int sock = -1, map_fd, prog_fd, i, key;
+	long long value = 0, tcp_cnt, udp_cnt;
+
+	map_fd = bpf_create_map(sizeof(key), sizeof(value), 2);
+	if (map_fd < 0) {
+		printf("failed to create map '%s'\n", strerror(errno));
+		/* must have been left from previous aborted run, delete it */
+		goto cleanup;
+	}
+
+	key = 6; /* tcp */
+	if (bpf_update_elem(map_fd, &key, &value) < 0) {
+		printf("update err key=%d\n", key);
+		goto cleanup;
+	}
+
+	key = 17; /* udp */
+	if (bpf_update_elem(map_fd, &key, &value) < 0) {
+		printf("update err key=%d\n", key);
+		goto cleanup;
+	}
+
+	fixup[0].insn_idx = 5;
+	fixup[0].fd = map_fd;
+
+	prog_fd = bpf_prog_load(BPF_PROG_TYPE_SOCKET_FILTER, prog, sizeof(prog),
+				"GPL", fixup, sizeof(fixup));
+	if (prog_fd < 0) {
+		printf("failed to load prog '%s'\n", strerror(errno));
+		goto cleanup;
+	}
+
+	sock = open_raw_sock("eth0");
+
+	if (setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER_EBPF, &prog_fd, sizeof(prog_fd)) < 0) {
+		printf("setsockopt %d\n", errno);
+		goto cleanup;
+	}
+
+	for (i = 0; i < 10; i++) {
+		key = 6;
+		if (bpf_lookup_elem(map_fd, &key, &tcp_cnt) < 0) {
+			printf("lookup err\n");
+			break;
+		}
+		key = 17;
+		if (bpf_lookup_elem(map_fd, &key, &udp_cnt) < 0) {
+			printf("lookup err\n");
+			break;
+		}
+		printf("TCP %lld UDP %lld packets\n", tcp_cnt, udp_cnt);
+		sleep(1);
+	}
+
+cleanup:
+	/* maps, programs, raw sockets will auto cleanup on process exit */
+
+	return 0;
+}
+
+int main(void)
+{
+	test_sock();
+	return 0;
+}
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH RFC v2 net-next 16/16] samples: bpf: example of tracing filters with eBPF
  2014-07-18  4:19 ` Alexei Starovoitov
                   ` (15 preceding siblings ...)
  (?)
@ 2014-07-18  4:20 ` Alexei Starovoitov
  -1 siblings, 0 replies; 62+ messages in thread
From: Alexei Starovoitov @ 2014-07-18  4:20 UTC (permalink / raw)
  To: David S. Miller
  Cc: Ingo Molnar, Linus Torvalds, Andy Lutomirski, Steven Rostedt,
	Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Jiri Olsa, Thomas Gleixner,
	H. Peter Anvin, Andrew Morton, Kees Cook, linux-api, netdev,
	linux-kernel

simple packet drop monitor:
- in-kernel eBPF program attaches to kfree_skb() event and records number
  of packet drops at given location
- userspace iterates over the map every second and prints stats

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
 samples/bpf/Makefile  |    4 +-
 samples/bpf/dropmon.c |  134 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 137 insertions(+), 1 deletion(-)
 create mode 100644 samples/bpf/dropmon.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 95c990151644..8e3dfa0c25e4 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -2,12 +2,14 @@
 obj- := dummy.o
 
 # List of programs to build
-hostprogs-y := sock_example
+hostprogs-y := sock_example dropmon
 
 sock_example-objs := sock_example.o libbpf.o
+dropmon-objs := dropmon.o libbpf.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
 
 HOSTCFLAGS_libbpf.o += -I$(objtree)/usr/include
 HOSTCFLAGS_sock_example.o += -I$(objtree)/usr/include
+HOSTCFLAGS_dropmon.o += -I$(objtree)/usr/include
diff --git a/samples/bpf/dropmon.c b/samples/bpf/dropmon.c
new file mode 100644
index 000000000000..d3d38832fc74
--- /dev/null
+++ b/samples/bpf/dropmon.c
@@ -0,0 +1,134 @@
+/* simple packet drop monitor:
+ * - in-kernel eBPF program attaches to kfree_skb() event and records number
+ *   of packet drops at given location
+ * - userspace iterates over the map every second and prints stats
+ */
+#include <stdio.h>
+#include <unistd.h>
+#include <asm-generic/socket.h>
+#include <linux/netlink.h>
+#include <net/ethernet.h>
+#include <net/if.h>
+#include <linux/sockios.h>
+#include <linux/if_packet.h>
+#include <linux/bpf.h>
+#include <errno.h>
+#include <sys/socket.h>
+#include <sys/ioctl.h>
+#include <linux/unistd.h>
+#include <string.h>
+#include <linux/filter.h>
+#include <stdlib.h>
+#include <arpa/inet.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <stdbool.h>
+#include "libbpf.h"
+
+#define TRACEPOINT "/sys/kernel/debug/tracing/events/skb/kfree_skb/"
+
+static int write_to_file(const char *file, const char *str, bool keep_open)
+{
+	int fd, err;
+
+	fd = open(file, O_WRONLY);
+	err = write(fd, str, strlen(str));
+	(void) err;
+
+	if (keep_open) {
+		return fd;
+	} else {
+		close(fd);
+		return -1;
+	}
+}
+
+static int dropmon(void)
+{
+	/* the following eBPF program is equivalent to C:
+	 * void filter(struct bpf_context *ctx)
+	 * {
+	 *   long loc = ctx->arg2;
+	 *   long init_val = 1;
+	 *   void *value;
+	 *
+	 *   value = bpf_map_lookup_elem(MAP_ID, &loc);
+	 *   if (value) {
+	 *      (*(long *) value) += 1;
+	 *   } else {
+	 *      bpf_map_update_elem(MAP_ID, &loc, &init_val);
+	 *   }
+	 * }
+	 */
+	static struct bpf_insn prog[] = {
+		BPF_LDX_MEM(BPF_DW, BPF_REG_2, BPF_REG_1, 8), /* r2 = *(u64 *)(r1 + 8) */
+		BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_2, -8), /* *(u64 *)(fp - 8) = r2 */
+		BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+		BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), /* r2 = fp - 8 */
+		BPF_MOV64_IMM(BPF_REG_1, 0), /* r1 = MAP_ID */
+		BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+		BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 3),
+		BPF_MOV64_IMM(BPF_REG_1, 1), /* r1 = 1 */
+		BPF_RAW_INSN(BPF_STX | BPF_XADD | BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0), /* xadd r0 += r1 */
+		BPF_EXIT_INSN(),
+		BPF_ST_MEM(BPF_DW, BPF_REG_10, -16, 1), /* *(u64 *)(fp - 16) = 1 */
+		BPF_MOV64_REG(BPF_REG_3, BPF_REG_10),
+		BPF_ALU64_IMM(BPF_ADD, BPF_REG_3, -16), /* r3 = fp - 16 */
+		BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+		BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), /* r2 = fp - 8 */
+		BPF_MOV64_IMM(BPF_REG_1, 0), /* r1 = MAP_ID */
+		BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_update_elem),
+		BPF_EXIT_INSN(),
+	};
+
+	static struct bpf_map_fixup fixup[2];
+	long long key, next_key, value = 0;
+	int prog_fd, map_fd, i;
+	char fmt[32];
+
+	map_fd = bpf_create_map(sizeof(key), sizeof(value), 1024);
+	if (map_fd < 0) {
+		printf("failed to create map '%s'\n", strerror(errno));
+		goto cleanup;
+	}
+
+	fixup[0].insn_idx = 4;
+	fixup[0].fd = map_fd;
+	fixup[1].insn_idx = 15;
+	fixup[1].fd = map_fd;
+
+	prog_fd = bpf_prog_load(BPF_PROG_TYPE_TRACING_FILTER, prog,
+				sizeof(prog), "GPL", fixup, sizeof(fixup));
+	if (prog_fd < 0) {
+		printf("failed to load prog '%s'\n", strerror(errno));
+		return -1;
+	}
+
+	sprintf(fmt, "bpf_%d", prog_fd);
+
+	write_to_file(TRACEPOINT "filter", fmt, true);
+
+	for (i = 0; i < 10; i++) {
+		key = 0;
+		while (bpf_get_next_key(map_fd, &key, &next_key) == 0) {
+			bpf_lookup_elem(map_fd, &next_key, &value);
+			printf("location 0x%llx count %lld\n", next_key, value);
+			key = next_key;
+		}
+		if (key)
+			printf("\n");
+		sleep(1);
+	}
+
+cleanup:
+	/* maps, programs, tracepoint filters will auto cleanup on process exit */
+
+	return 0;
+}
+
+int main(void)
+{
+	dropmon();
+	return 0;
+}
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH RFC v2 net-next 02/16] bpf: update MAINTAINERS entry
  2014-07-18  4:19   ` Alexei Starovoitov
  (?)
@ 2014-07-23 17:37   ` Kees Cook
  2014-07-23 17:48       ` Alexei Starovoitov
  -1 siblings, 1 reply; 62+ messages in thread
From: Kees Cook @ 2014-07-23 17:37 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Ingo Molnar, Linus Torvalds, Andy Lutomirski,
	Steven Rostedt, Daniel Borkmann, Chema Gonzalez, Eric Dumazet,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Jiri Olsa,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Linux API,
	Network Development, LKML

On Thu, Jul 17, 2014 at 9:19 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
> ---
>  MAINTAINERS |    7 +++++++
>  1 file changed, 7 insertions(+)
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index ae8cd00215b2..32e24ff46da3 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -1912,6 +1912,13 @@ S:       Supported
>  F:     drivers/net/bonding/
>  F:     include/uapi/linux/if_bonding.h
>
> +BPF (Safe dynamic programs and tools)

bikeshed: I feel like this shouldn't be an acronym. Maybe instead:

BERKELEY PACKET FILTER (BPF: Safe dynamic programs and tools)

-Kees

> +M:     Alexei Starovoitov <ast@kernel.org>
> +L:     netdev@vger.kernel.org
> +L:     linux-kernel@vger.kernel.org
> +S:     Supported
> +F:     kernel/bpf/
> +
>  BROADCOM B44 10/100 ETHERNET DRIVER
>  M:     Gary Zambrano <zambrano@broadcom.com>
>  L:     netdev@vger.kernel.org
> --
> 1.7.9.5
>



-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH RFC v2 net-next 02/16] bpf: update MAINTAINERS entry
@ 2014-07-23 17:48       ` Alexei Starovoitov
  0 siblings, 0 replies; 62+ messages in thread
From: Alexei Starovoitov @ 2014-07-23 17:48 UTC (permalink / raw)
  To: Kees Cook
  Cc: David S. Miller, Ingo Molnar, Linus Torvalds, Andy Lutomirski,
	Steven Rostedt, Daniel Borkmann, Chema Gonzalez, Eric Dumazet,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Jiri Olsa,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Linux API,
	Network Development, LKML

On Wed, Jul 23, 2014 at 10:37 AM, Kees Cook <keescook@chromium.org> wrote:
> On Thu, Jul 17, 2014 at 9:19 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
>> ---
>>  MAINTAINERS |    7 +++++++
>>  1 file changed, 7 insertions(+)
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index ae8cd00215b2..32e24ff46da3 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -1912,6 +1912,13 @@ S:       Supported
>>  F:     drivers/net/bonding/
>>  F:     include/uapi/linux/if_bonding.h
>>
>> +BPF (Safe dynamic programs and tools)
>
> bikeshed: I feel like this shouldn't be an acronym. Maybe instead:
>
> BERKELEY PACKET FILTER (BPF: Safe dynamic programs and tools)

pile on :)

I think eBPF is no longer acronym. 'e' stands for 'extended',
but BPF is no longer 'packet filter' only and definitely not 'berkeley'.
So I'd rather keep BPF as a magic abbreviation without spelling it out,
since full name is historic and no longer meaningful.
I've considered coming up with brand new abbreviation and full name
for this instruction set, but none looked good and all lose in comparison
to 'eBPF' name, which is concise and carries enough historical
references to explain the idea behind new ISA.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH RFC v2 net-next 02/16] bpf: update MAINTAINERS entry
@ 2014-07-23 17:48       ` Alexei Starovoitov
  0 siblings, 0 replies; 62+ messages in thread
From: Alexei Starovoitov @ 2014-07-23 17:48 UTC (permalink / raw)
  To: Kees Cook
  Cc: David S. Miller, Ingo Molnar, Linus Torvalds, Andy Lutomirski,
	Steven Rostedt, Daniel Borkmann, Chema Gonzalez, Eric Dumazet,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Jiri Olsa,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Linux API,
	Network Development, LKML

On Wed, Jul 23, 2014 at 10:37 AM, Kees Cook <keescook-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org> wrote:
> On Thu, Jul 17, 2014 at 9:19 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
>> Signed-off-by: Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
>> ---
>>  MAINTAINERS |    7 +++++++
>>  1 file changed, 7 insertions(+)
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index ae8cd00215b2..32e24ff46da3 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -1912,6 +1912,13 @@ S:       Supported
>>  F:     drivers/net/bonding/
>>  F:     include/uapi/linux/if_bonding.h
>>
>> +BPF (Safe dynamic programs and tools)
>
> bikeshed: I feel like this shouldn't be an acronym. Maybe instead:
>
> BERKELEY PACKET FILTER (BPF: Safe dynamic programs and tools)

pile on :)

I think eBPF is no longer acronym. 'e' stands for 'extended',
but BPF is no longer 'packet filter' only and definitely not 'berkeley'.
So I'd rather keep BPF as a magic abbreviation without spelling it out,
since full name is historic and no longer meaningful.
I've considered coming up with brand new abbreviation and full name
for this instruction set, but none looked good and all lose in comparison
to 'eBPF' name, which is concise and carries enough historical
references to explain the idea behind new ISA.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH RFC v2 net-next 05/16] bpf: introduce syscall(BPF, ...) and BPF maps
@ 2014-07-23 18:02     ` Kees Cook
  0 siblings, 0 replies; 62+ messages in thread
From: Kees Cook @ 2014-07-23 18:02 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Ingo Molnar, Linus Torvalds, Andy Lutomirski,
	Steven Rostedt, Daniel Borkmann, Chema Gonzalez, Eric Dumazet,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Jiri Olsa,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Linux API,
	Network Development, LKML

On Thu, Jul 17, 2014 at 9:19 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
> BPF syscall is a demux for different BPF releated commands.
>
> 'maps' is a generic storage of different types for sharing data between kernel
> and userspace.
>
> The maps can be created from user space via BPF syscall:
> - create a map with given type and attributes
>   fd = bpf_map_create(map_type, struct nlattr *attr, int len)
>   returns fd or negative error
>
> - close(fd) deletes the map
>
> Next patch allows userspace programs to populate/read maps that eBPF programs
> are concurrently updating.
>
> maps can have different types: hash, bloom filter, radix-tree, etc.
>
> The map is defined by:
>   . type
>   . max number of elements
>   . key size in bytes
>   . value size in bytes
>
> Next patches allow eBPF programs to access maps via API:
>   void * bpf_map_lookup_elem(u32 fd, void *key);
>   int bpf_map_update_elem(u32 fd, void *key, void *value);
>   int bpf_map_delete_elem(u32 fd, void *key);
>
> This patch establishes core infrastructure for BPF maps.
> Next patches implement lookup/update and hashtable type.
> More map types can be added in the future.
>
> syscall is using type-length-value style of passing arguments to be backwards
> compatible with future extensions to map attributes. Different map types may
> use different attributes as well.
> The concept of type-lenght-value is borrowed from netlink, but netlink itself
> is not applicable here, since BPF programs and maps can be used in NET-less
> configurations.
>
> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
> ---
>  Documentation/networking/filter.txt |   69 +++++++++++
>  include/linux/bpf.h                 |   43 +++++++
>  include/uapi/linux/bpf.h            |   24 ++++
>  kernel/bpf/Makefile                 |    2 +-
>  kernel/bpf/syscall.c                |  225 +++++++++++++++++++++++++++++++++++
>  5 files changed, 362 insertions(+), 1 deletion(-)
>  create mode 100644 include/linux/bpf.h
>  create mode 100644 kernel/bpf/syscall.c
>
> diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt
> index ee78eba78a9d..e14e486f69cd 100644
> --- a/Documentation/networking/filter.txt
> +++ b/Documentation/networking/filter.txt
> @@ -995,6 +995,75 @@ BPF_XADD | BPF_DW | BPF_STX: lock xadd *(u64 *)(dst_reg + off16) += src_reg
>  Where size is one of: BPF_B or BPF_H or BPF_W or BPF_DW. Note that 1 and
>  2 byte atomic increments are not supported.
>
> +eBPF maps
> +---------
> +'maps' is a generic storage of different types for sharing data between kernel
> +and userspace.
> +
> +The maps are accessed from user space via BPF syscall, which has commands:
> +- create a map with given id, type and attributes
> +  map_id = bpf_map_create(int map_id, map_type, struct nlattr *attr, int len)
> +  returns positive map id or negative error

Looks like these docs need updating for the fd-based approach instead
of the map_id approach?

> +
> +- delete map with given map id
> +  err = bpf_map_delete(int map_id)
> +  returns zero or negative error
> +
> +- lookup key in a given map referenced by map_id
> +  err = bpf_map_lookup_elem(int map_id, void *key, void *value)
> +  returns zero and stores found elem into value or negative error
> +
> +- create or update key/value pair in a given map
> +  err = bpf_map_update_elem(int map_id, void *key, void *value)
> +  returns zero or negative error
> +
> +- find and delete element by key in a given map
> +  err = bpf_map_delete_elem(int map_id, void *key)
> +
> +userspace programs uses this API to create/populate/read maps that eBPF programs
> +are concurrently updating.
> +
> +maps can have different types: hash, bloom filter, radix-tree, etc.
> +
> +The map is defined by:
> +  . id
> +  . type
> +  . max number of elements
> +  . key size in bytes
> +  . value size in bytes
> +
> +The maps are accesible from eBPF program with API:
> +  void * bpf_map_lookup_elem(u32 map_id, void *key);
> +  int bpf_map_update_elem(u32 map_id, void *key, void *value);
> +  int bpf_map_delete_elem(u32 map_id, void *key);
> +
> +If eBPF verifier is configured to recognize extra calls in the program
> +bpf_map_lookup_elem() and bpf_map_update_elem() then access to maps looks like:
> +  ...
> +  ptr_to_value = map_lookup_elem(const_int_map_id, key)
> +  access memory [ptr_to_value, ptr_to_value + value_size_in_bytes]
> +  ...
> +  prepare key2 and value2 on stack of key_size and value_size
> +  err = map_update_elem(const_int_map_id2, key2, value2)
> +  ...
> +
> +eBPF program cannot create or delete maps
> +(such calls will be unknown to verifier)
> +
> +During program loading the refcnt of used maps is incremented, so they don't get
> +deleted while program is running
> +
> +bpf_map_update_elem() can fail if maximum number of elements reached.
> +if key2 already exists, bpf_map_update_elem() replaces it with value2 atomically
> +
> +bpf_map_lookup_elem() can return null or ptr_to_value
> +ptr_to_value is read/write from the program point of view.
> +
> +The verifier will check that the program accesses map elements within specified
> +size. It will not let programs pass junk values as 'key' and 'value' to
> +bpf_map_*_elem() functions, so these functions (implemented in C inside kernel)
> +can safely access the pointers in all cases.
> +
>  Testing
>  -------
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> new file mode 100644
> index 000000000000..57af236a0eb4
> --- /dev/null
> +++ b/include/linux/bpf.h
> @@ -0,0 +1,43 @@
> +/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of version 2 of the GNU General Public
> + * License as published by the Free Software Foundation.
> + */
> +#ifndef _LINUX_BPF_H
> +#define _LINUX_BPF_H 1
> +
> +#include <uapi/linux/bpf.h>
> +#include <linux/workqueue.h>
> +
> +struct bpf_map;
> +struct nlattr;
> +
> +/* map is generic key/value storage optionally accesible by eBPF programs */
> +struct bpf_map_ops {
> +       /* funcs callable from userspace (via syscall) */
> +       struct bpf_map *(*map_alloc)(struct nlattr *attrs[BPF_MAP_ATTR_MAX + 1]);
> +       void (*map_free)(struct bpf_map *);
> +};
> +
> +struct bpf_map {
> +       atomic_t refcnt;
> +       int map_id;
> +       enum bpf_map_type map_type;
> +       u32 key_size;
> +       u32 value_size;
> +       u32 max_entries;
> +       struct bpf_map_ops *ops;
> +       struct work_struct work;
> +};
> +
> +struct bpf_map_type_list {
> +       struct list_head list_node;
> +       struct bpf_map_ops *ops;
> +       enum bpf_map_type type;
> +};
> +
> +void bpf_register_map_type(struct bpf_map_type_list *tl);
> +struct bpf_map *bpf_map_get(u32 map_id);
> +
> +#endif /* _LINUX_BPF_H */
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 3ff5bf5045a7..dcc7eb97a64a 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -300,4 +300,28 @@ struct bpf_insn {
>         __s32   imm;            /* signed immediate constant */
>  };
>
> +/* BPF syscall commands */
> +enum bpf_cmd {
> +       /* create a map with given type and attributes
> +        * fd = bpf_map_create(bpf_map_type, struct nlattr *attr, int len)
> +        * returns fd or negative error
> +        * map is deleted when fd is closed
> +        */
> +       BPF_MAP_CREATE,
> +};
> +
> +enum bpf_map_attributes {
> +       BPF_MAP_UNSPEC,
> +       BPF_MAP_KEY_SIZE,       /* size of key in bytes */
> +       BPF_MAP_VALUE_SIZE,     /* size of value in bytes */
> +       BPF_MAP_MAX_ENTRIES,    /* maximum number of entries in a map */
> +       __BPF_MAP_ATTR_MAX,
> +};
> +#define BPF_MAP_ATTR_MAX (__BPF_MAP_ATTR_MAX - 1)
> +#define BPF_MAP_MAX_ATTR_SIZE 65535
> +
> +enum bpf_map_type {
> +       BPF_MAP_TYPE_UNSPEC,
> +};
> +
>  #endif /* _UAPI__LINUX_BPF_H__ */
> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> index 6a71145e2769..e9f7334ed07a 100644
> --- a/kernel/bpf/Makefile
> +++ b/kernel/bpf/Makefile
> @@ -1 +1 @@
> -obj-y := core.o
> +obj-y := core.o syscall.o
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> new file mode 100644
> index 000000000000..c4a330642653
> --- /dev/null
> +++ b/kernel/bpf/syscall.c
> @@ -0,0 +1,225 @@
> +/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of version 2 of the GNU General Public
> + * License as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful, but
> + * WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + * General Public License for more details.
> + */
> +#include <linux/bpf.h>
> +#include <linux/syscalls.h>
> +#include <net/netlink.h>
> +#include <linux/anon_inodes.h>
> +
> +/* mutex to protect insertion/deletion of map_id in IDR */
> +static DEFINE_MUTEX(bpf_map_lock);
> +static DEFINE_IDR(bpf_map_id_idr);
> +
> +/* maximum number of outstanding maps */
> +#define MAX_BPF_MAP_CNT 1024
> +static u32 bpf_map_cnt;
> +
> +static LIST_HEAD(bpf_map_types);
> +
> +static struct bpf_map *find_and_alloc_map(enum bpf_map_type type,
> +                                         struct nlattr *tb[BPF_MAP_ATTR_MAX + 1])
> +{
> +       struct bpf_map_type_list *tl;
> +       struct bpf_map *map;
> +
> +       list_for_each_entry(tl, &bpf_map_types, list_node) {
> +               if (tl->type == type) {
> +                       map = tl->ops->map_alloc(tb);
> +                       if (IS_ERR(map))
> +                               return map;
> +                       map->ops = tl->ops;
> +                       map->map_type = type;
> +                       return map;
> +               }
> +       }
> +       return ERR_PTR(-EINVAL);
> +}
> +
> +/* boot time registration of different map implementations */
> +void bpf_register_map_type(struct bpf_map_type_list *tl)
> +{
> +       list_add(&tl->list_node, &bpf_map_types);
> +}
> +
> +/* called from workqueue */
> +static void bpf_map_free_deferred(struct work_struct *work)
> +{
> +       struct bpf_map *map = container_of(work, struct bpf_map, work);
> +
> +       /* grab the mutex and free the map */
> +       mutex_lock(&bpf_map_lock);
> +
> +       bpf_map_cnt--;
> +       idr_remove(&bpf_map_id_idr, map->map_id);
> +
> +       mutex_unlock(&bpf_map_lock);
> +
> +       /* implementation dependent freeing */
> +       map->ops->map_free(map);
> +}
> +
> +/* decrement map refcnt and schedule it for freeing via workqueue
> + * (unrelying map implementation ops->map_free() might sleep)
> + */
> +static void __bpf_map_put(struct bpf_map *map)
> +{
> +       if (atomic_dec_and_test(&map->refcnt)) {
> +               INIT_WORK(&map->work, bpf_map_free_deferred);
> +               schedule_work(&map->work);
> +       }
> +}
> +
> +/* find map by id and decrement its refcnt
> + *
> + * can be called without any locks held
> + *
> + * returns true if map was found
> + */
> +static bool bpf_map_put(u32 map_id)
> +{
> +       struct bpf_map *map;
> +
> +       rcu_read_lock();
> +       map = idr_find(&bpf_map_id_idr, map_id);
> +
> +       if (!map) {
> +               rcu_read_unlock();
> +               return false;
> +       }
> +
> +       __bpf_map_put(map);
> +       rcu_read_unlock();
> +
> +       return true;
> +}
> +
> +/* called with bpf_map_lock held */
> +struct bpf_map *bpf_map_get(u32 map_id)
> +{
> +       BUG_ON(!mutex_is_locked(&bpf_map_lock));
> +
> +       return idr_find(&bpf_map_id_idr, map_id);
> +}
> +
> +static int bpf_map_release(struct inode *inode, struct file *filp)
> +{
> +       struct bpf_map *map = filp->private_data;
> +
> +       __bpf_map_put(map);
> +       return 0;
> +}
> +
> +static const struct file_operations bpf_map_fops = {
> +        .release = bpf_map_release,
> +};
> +
> +static const struct nla_policy map_policy[BPF_MAP_ATTR_MAX + 1] = {
> +       [BPF_MAP_KEY_SIZE]    = { .type = NLA_U32 },
> +       [BPF_MAP_VALUE_SIZE]  = { .type = NLA_U32 },
> +       [BPF_MAP_MAX_ENTRIES] = { .type = NLA_U32 },
> +};
> +
> +/* called via syscall */
> +static int map_create(enum bpf_map_type type, struct nlattr __user *uattr, int len)
> +{
> +       struct nlattr *tb[BPF_MAP_ATTR_MAX + 1];
> +       struct bpf_map *map;
> +       struct nlattr *attr;
> +       int err;
> +
> +       if (len <= 0 || len > BPF_MAP_MAX_ATTR_SIZE)
> +               return -EINVAL;
> +
> +       attr = kmalloc(len, GFP_USER);
> +       if (!attr)
> +               return -ENOMEM;
> +
> +       /* copy map attributes from user space */
> +       err = -EFAULT;
> +       if (copy_from_user(attr, uattr, len) != 0)
> +               goto free_attr;
> +
> +       /* perform basic validation */
> +       err = nla_parse(tb, BPF_MAP_ATTR_MAX, attr, len, map_policy);
> +       if (err < 0)
> +               goto free_attr;
> +
> +       /* find map type and init map: hashtable vs rbtree vs bloom vs ... */
> +       map = find_and_alloc_map(type, tb);
> +       if (IS_ERR(map)) {
> +               err = PTR_ERR(map);
> +               goto free_attr;
> +       }
> +
> +       atomic_set(&map->refcnt, 1);
> +
> +       mutex_lock(&bpf_map_lock);
> +
> +       if (bpf_map_cnt >= MAX_BPF_MAP_CNT) {
> +               mutex_unlock(&bpf_map_lock);
> +               err = -ENOSPC;
> +               goto free_map;
> +       }
> +
> +       /* allocate map id */
> +       err = idr_alloc(&bpf_map_id_idr, map, 1 /* min map_id */, 0, GFP_USER);
> +
> +       if (err > 0)
> +               bpf_map_cnt++;
> +
> +       map->map_id = err;
> +
> +       mutex_unlock(&bpf_map_lock);
> +
> +       if (err < 0)
> +               /* failed to allocate map id */
> +               goto free_map;
> +
> +       err = anon_inode_getfd("bpf-map", &bpf_map_fops, map, O_RDWR | O_CLOEXEC);
> +
> +       if (err < 0)
> +               /* failed to allocate fd */
> +               goto free_map_id;
> +
> +       /* user supplied array of map attributes is no longer needed */
> +       kfree(attr);
> +
> +       return err;
> +
> +free_map_id:
> +       /* grab the mutex and free the map */
> +       mutex_lock(&bpf_map_lock);
> +
> +       bpf_map_cnt--;
> +       idr_remove(&bpf_map_id_idr, map->map_id);
> +
> +       mutex_unlock(&bpf_map_lock);
> +free_map:
> +       map->ops->map_free(map);
> +free_attr:
> +       kfree(attr);
> +       return err;
> +}
> +
> +SYSCALL_DEFINE5(bpf, int, cmd, unsigned long, arg2, unsigned long, arg3,
> +               unsigned long, arg4, unsigned long, arg5)
> +{
> +       if (!capable(CAP_SYS_ADMIN))
> +               return -EPERM;

It might be valuable to have a comment here describing why this is
currently limited to CAP_SYS_ADMIN.

> +
> +       switch (cmd) {
> +       case BPF_MAP_CREATE:
> +               return map_create((enum bpf_map_type) arg2,
> +                                 (struct nlattr __user *) arg3, (int) arg4);

I'd recommend requiring arg5 == 0 here, just for future flexibility.

-Kees

> +       default:
> +               return -EINVAL;
> +       }
> +}
> --
> 1.7.9.5
>



-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH RFC v2 net-next 05/16] bpf: introduce syscall(BPF, ...) and BPF maps
@ 2014-07-23 18:02     ` Kees Cook
  0 siblings, 0 replies; 62+ messages in thread
From: Kees Cook @ 2014-07-23 18:02 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Ingo Molnar, Linus Torvalds, Andy Lutomirski,
	Steven Rostedt, Daniel Borkmann, Chema Gonzalez, Eric Dumazet,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Jiri Olsa,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Linux API,
	Network Development, LKML

On Thu, Jul 17, 2014 at 9:19 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
> BPF syscall is a demux for different BPF releated commands.
>
> 'maps' is a generic storage of different types for sharing data between kernel
> and userspace.
>
> The maps can be created from user space via BPF syscall:
> - create a map with given type and attributes
>   fd = bpf_map_create(map_type, struct nlattr *attr, int len)
>   returns fd or negative error
>
> - close(fd) deletes the map
>
> Next patch allows userspace programs to populate/read maps that eBPF programs
> are concurrently updating.
>
> maps can have different types: hash, bloom filter, radix-tree, etc.
>
> The map is defined by:
>   . type
>   . max number of elements
>   . key size in bytes
>   . value size in bytes
>
> Next patches allow eBPF programs to access maps via API:
>   void * bpf_map_lookup_elem(u32 fd, void *key);
>   int bpf_map_update_elem(u32 fd, void *key, void *value);
>   int bpf_map_delete_elem(u32 fd, void *key);
>
> This patch establishes core infrastructure for BPF maps.
> Next patches implement lookup/update and hashtable type.
> More map types can be added in the future.
>
> syscall is using type-length-value style of passing arguments to be backwards
> compatible with future extensions to map attributes. Different map types may
> use different attributes as well.
> The concept of type-lenght-value is borrowed from netlink, but netlink itself
> is not applicable here, since BPF programs and maps can be used in NET-less
> configurations.
>
> Signed-off-by: Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
> ---
>  Documentation/networking/filter.txt |   69 +++++++++++
>  include/linux/bpf.h                 |   43 +++++++
>  include/uapi/linux/bpf.h            |   24 ++++
>  kernel/bpf/Makefile                 |    2 +-
>  kernel/bpf/syscall.c                |  225 +++++++++++++++++++++++++++++++++++
>  5 files changed, 362 insertions(+), 1 deletion(-)
>  create mode 100644 include/linux/bpf.h
>  create mode 100644 kernel/bpf/syscall.c
>
> diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt
> index ee78eba78a9d..e14e486f69cd 100644
> --- a/Documentation/networking/filter.txt
> +++ b/Documentation/networking/filter.txt
> @@ -995,6 +995,75 @@ BPF_XADD | BPF_DW | BPF_STX: lock xadd *(u64 *)(dst_reg + off16) += src_reg
>  Where size is one of: BPF_B or BPF_H or BPF_W or BPF_DW. Note that 1 and
>  2 byte atomic increments are not supported.
>
> +eBPF maps
> +---------
> +'maps' is a generic storage of different types for sharing data between kernel
> +and userspace.
> +
> +The maps are accessed from user space via BPF syscall, which has commands:
> +- create a map with given id, type and attributes
> +  map_id = bpf_map_create(int map_id, map_type, struct nlattr *attr, int len)
> +  returns positive map id or negative error

Looks like these docs need updating for the fd-based approach instead
of the map_id approach?

> +
> +- delete map with given map id
> +  err = bpf_map_delete(int map_id)
> +  returns zero or negative error
> +
> +- lookup key in a given map referenced by map_id
> +  err = bpf_map_lookup_elem(int map_id, void *key, void *value)
> +  returns zero and stores found elem into value or negative error
> +
> +- create or update key/value pair in a given map
> +  err = bpf_map_update_elem(int map_id, void *key, void *value)
> +  returns zero or negative error
> +
> +- find and delete element by key in a given map
> +  err = bpf_map_delete_elem(int map_id, void *key)
> +
> +userspace programs uses this API to create/populate/read maps that eBPF programs
> +are concurrently updating.
> +
> +maps can have different types: hash, bloom filter, radix-tree, etc.
> +
> +The map is defined by:
> +  . id
> +  . type
> +  . max number of elements
> +  . key size in bytes
> +  . value size in bytes
> +
> +The maps are accesible from eBPF program with API:
> +  void * bpf_map_lookup_elem(u32 map_id, void *key);
> +  int bpf_map_update_elem(u32 map_id, void *key, void *value);
> +  int bpf_map_delete_elem(u32 map_id, void *key);
> +
> +If eBPF verifier is configured to recognize extra calls in the program
> +bpf_map_lookup_elem() and bpf_map_update_elem() then access to maps looks like:
> +  ...
> +  ptr_to_value = map_lookup_elem(const_int_map_id, key)
> +  access memory [ptr_to_value, ptr_to_value + value_size_in_bytes]
> +  ...
> +  prepare key2 and value2 on stack of key_size and value_size
> +  err = map_update_elem(const_int_map_id2, key2, value2)
> +  ...
> +
> +eBPF program cannot create or delete maps
> +(such calls will be unknown to verifier)
> +
> +During program loading the refcnt of used maps is incremented, so they don't get
> +deleted while program is running
> +
> +bpf_map_update_elem() can fail if maximum number of elements reached.
> +if key2 already exists, bpf_map_update_elem() replaces it with value2 atomically
> +
> +bpf_map_lookup_elem() can return null or ptr_to_value
> +ptr_to_value is read/write from the program point of view.
> +
> +The verifier will check that the program accesses map elements within specified
> +size. It will not let programs pass junk values as 'key' and 'value' to
> +bpf_map_*_elem() functions, so these functions (implemented in C inside kernel)
> +can safely access the pointers in all cases.
> +
>  Testing
>  -------
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> new file mode 100644
> index 000000000000..57af236a0eb4
> --- /dev/null
> +++ b/include/linux/bpf.h
> @@ -0,0 +1,43 @@
> +/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of version 2 of the GNU General Public
> + * License as published by the Free Software Foundation.
> + */
> +#ifndef _LINUX_BPF_H
> +#define _LINUX_BPF_H 1
> +
> +#include <uapi/linux/bpf.h>
> +#include <linux/workqueue.h>
> +
> +struct bpf_map;
> +struct nlattr;
> +
> +/* map is generic key/value storage optionally accesible by eBPF programs */
> +struct bpf_map_ops {
> +       /* funcs callable from userspace (via syscall) */
> +       struct bpf_map *(*map_alloc)(struct nlattr *attrs[BPF_MAP_ATTR_MAX + 1]);
> +       void (*map_free)(struct bpf_map *);
> +};
> +
> +struct bpf_map {
> +       atomic_t refcnt;
> +       int map_id;
> +       enum bpf_map_type map_type;
> +       u32 key_size;
> +       u32 value_size;
> +       u32 max_entries;
> +       struct bpf_map_ops *ops;
> +       struct work_struct work;
> +};
> +
> +struct bpf_map_type_list {
> +       struct list_head list_node;
> +       struct bpf_map_ops *ops;
> +       enum bpf_map_type type;
> +};
> +
> +void bpf_register_map_type(struct bpf_map_type_list *tl);
> +struct bpf_map *bpf_map_get(u32 map_id);
> +
> +#endif /* _LINUX_BPF_H */
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 3ff5bf5045a7..dcc7eb97a64a 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -300,4 +300,28 @@ struct bpf_insn {
>         __s32   imm;            /* signed immediate constant */
>  };
>
> +/* BPF syscall commands */
> +enum bpf_cmd {
> +       /* create a map with given type and attributes
> +        * fd = bpf_map_create(bpf_map_type, struct nlattr *attr, int len)
> +        * returns fd or negative error
> +        * map is deleted when fd is closed
> +        */
> +       BPF_MAP_CREATE,
> +};
> +
> +enum bpf_map_attributes {
> +       BPF_MAP_UNSPEC,
> +       BPF_MAP_KEY_SIZE,       /* size of key in bytes */
> +       BPF_MAP_VALUE_SIZE,     /* size of value in bytes */
> +       BPF_MAP_MAX_ENTRIES,    /* maximum number of entries in a map */
> +       __BPF_MAP_ATTR_MAX,
> +};
> +#define BPF_MAP_ATTR_MAX (__BPF_MAP_ATTR_MAX - 1)
> +#define BPF_MAP_MAX_ATTR_SIZE 65535
> +
> +enum bpf_map_type {
> +       BPF_MAP_TYPE_UNSPEC,
> +};
> +
>  #endif /* _UAPI__LINUX_BPF_H__ */
> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> index 6a71145e2769..e9f7334ed07a 100644
> --- a/kernel/bpf/Makefile
> +++ b/kernel/bpf/Makefile
> @@ -1 +1 @@
> -obj-y := core.o
> +obj-y := core.o syscall.o
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> new file mode 100644
> index 000000000000..c4a330642653
> --- /dev/null
> +++ b/kernel/bpf/syscall.c
> @@ -0,0 +1,225 @@
> +/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of version 2 of the GNU General Public
> + * License as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful, but
> + * WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + * General Public License for more details.
> + */
> +#include <linux/bpf.h>
> +#include <linux/syscalls.h>
> +#include <net/netlink.h>
> +#include <linux/anon_inodes.h>
> +
> +/* mutex to protect insertion/deletion of map_id in IDR */
> +static DEFINE_MUTEX(bpf_map_lock);
> +static DEFINE_IDR(bpf_map_id_idr);
> +
> +/* maximum number of outstanding maps */
> +#define MAX_BPF_MAP_CNT 1024
> +static u32 bpf_map_cnt;
> +
> +static LIST_HEAD(bpf_map_types);
> +
> +static struct bpf_map *find_and_alloc_map(enum bpf_map_type type,
> +                                         struct nlattr *tb[BPF_MAP_ATTR_MAX + 1])
> +{
> +       struct bpf_map_type_list *tl;
> +       struct bpf_map *map;
> +
> +       list_for_each_entry(tl, &bpf_map_types, list_node) {
> +               if (tl->type == type) {
> +                       map = tl->ops->map_alloc(tb);
> +                       if (IS_ERR(map))
> +                               return map;
> +                       map->ops = tl->ops;
> +                       map->map_type = type;
> +                       return map;
> +               }
> +       }
> +       return ERR_PTR(-EINVAL);
> +}
> +
> +/* boot time registration of different map implementations */
> +void bpf_register_map_type(struct bpf_map_type_list *tl)
> +{
> +       list_add(&tl->list_node, &bpf_map_types);
> +}
> +
> +/* called from workqueue */
> +static void bpf_map_free_deferred(struct work_struct *work)
> +{
> +       struct bpf_map *map = container_of(work, struct bpf_map, work);
> +
> +       /* grab the mutex and free the map */
> +       mutex_lock(&bpf_map_lock);
> +
> +       bpf_map_cnt--;
> +       idr_remove(&bpf_map_id_idr, map->map_id);
> +
> +       mutex_unlock(&bpf_map_lock);
> +
> +       /* implementation dependent freeing */
> +       map->ops->map_free(map);
> +}
> +
> +/* decrement map refcnt and schedule it for freeing via workqueue
> + * (unrelying map implementation ops->map_free() might sleep)
> + */
> +static void __bpf_map_put(struct bpf_map *map)
> +{
> +       if (atomic_dec_and_test(&map->refcnt)) {
> +               INIT_WORK(&map->work, bpf_map_free_deferred);
> +               schedule_work(&map->work);
> +       }
> +}
> +
> +/* find map by id and decrement its refcnt
> + *
> + * can be called without any locks held
> + *
> + * returns true if map was found
> + */
> +static bool bpf_map_put(u32 map_id)
> +{
> +       struct bpf_map *map;
> +
> +       rcu_read_lock();
> +       map = idr_find(&bpf_map_id_idr, map_id);
> +
> +       if (!map) {
> +               rcu_read_unlock();
> +               return false;
> +       }
> +
> +       __bpf_map_put(map);
> +       rcu_read_unlock();
> +
> +       return true;
> +}
> +
> +/* called with bpf_map_lock held */
> +struct bpf_map *bpf_map_get(u32 map_id)
> +{
> +       BUG_ON(!mutex_is_locked(&bpf_map_lock));
> +
> +       return idr_find(&bpf_map_id_idr, map_id);
> +}
> +
> +static int bpf_map_release(struct inode *inode, struct file *filp)
> +{
> +       struct bpf_map *map = filp->private_data;
> +
> +       __bpf_map_put(map);
> +       return 0;
> +}
> +
> +static const struct file_operations bpf_map_fops = {
> +        .release = bpf_map_release,
> +};
> +
> +static const struct nla_policy map_policy[BPF_MAP_ATTR_MAX + 1] = {
> +       [BPF_MAP_KEY_SIZE]    = { .type = NLA_U32 },
> +       [BPF_MAP_VALUE_SIZE]  = { .type = NLA_U32 },
> +       [BPF_MAP_MAX_ENTRIES] = { .type = NLA_U32 },
> +};
> +
> +/* called via syscall */
> +static int map_create(enum bpf_map_type type, struct nlattr __user *uattr, int len)
> +{
> +       struct nlattr *tb[BPF_MAP_ATTR_MAX + 1];
> +       struct bpf_map *map;
> +       struct nlattr *attr;
> +       int err;
> +
> +       if (len <= 0 || len > BPF_MAP_MAX_ATTR_SIZE)
> +               return -EINVAL;
> +
> +       attr = kmalloc(len, GFP_USER);
> +       if (!attr)
> +               return -ENOMEM;
> +
> +       /* copy map attributes from user space */
> +       err = -EFAULT;
> +       if (copy_from_user(attr, uattr, len) != 0)
> +               goto free_attr;
> +
> +       /* perform basic validation */
> +       err = nla_parse(tb, BPF_MAP_ATTR_MAX, attr, len, map_policy);
> +       if (err < 0)
> +               goto free_attr;
> +
> +       /* find map type and init map: hashtable vs rbtree vs bloom vs ... */
> +       map = find_and_alloc_map(type, tb);
> +       if (IS_ERR(map)) {
> +               err = PTR_ERR(map);
> +               goto free_attr;
> +       }
> +
> +       atomic_set(&map->refcnt, 1);
> +
> +       mutex_lock(&bpf_map_lock);
> +
> +       if (bpf_map_cnt >= MAX_BPF_MAP_CNT) {
> +               mutex_unlock(&bpf_map_lock);
> +               err = -ENOSPC;
> +               goto free_map;
> +       }
> +
> +       /* allocate map id */
> +       err = idr_alloc(&bpf_map_id_idr, map, 1 /* min map_id */, 0, GFP_USER);
> +
> +       if (err > 0)
> +               bpf_map_cnt++;
> +
> +       map->map_id = err;
> +
> +       mutex_unlock(&bpf_map_lock);
> +
> +       if (err < 0)
> +               /* failed to allocate map id */
> +               goto free_map;
> +
> +       err = anon_inode_getfd("bpf-map", &bpf_map_fops, map, O_RDWR | O_CLOEXEC);
> +
> +       if (err < 0)
> +               /* failed to allocate fd */
> +               goto free_map_id;
> +
> +       /* user supplied array of map attributes is no longer needed */
> +       kfree(attr);
> +
> +       return err;
> +
> +free_map_id:
> +       /* grab the mutex and free the map */
> +       mutex_lock(&bpf_map_lock);
> +
> +       bpf_map_cnt--;
> +       idr_remove(&bpf_map_id_idr, map->map_id);
> +
> +       mutex_unlock(&bpf_map_lock);
> +free_map:
> +       map->ops->map_free(map);
> +free_attr:
> +       kfree(attr);
> +       return err;
> +}
> +
> +SYSCALL_DEFINE5(bpf, int, cmd, unsigned long, arg2, unsigned long, arg3,
> +               unsigned long, arg4, unsigned long, arg5)
> +{
> +       if (!capable(CAP_SYS_ADMIN))
> +               return -EPERM;

It might be valuable to have a comment here describing why this is
currently limited to CAP_SYS_ADMIN.

> +
> +       switch (cmd) {
> +       case BPF_MAP_CREATE:
> +               return map_create((enum bpf_map_type) arg2,
> +                                 (struct nlattr __user *) arg3, (int) arg4);

I'd recommend requiring arg5 == 0 here, just for future flexibility.

-Kees

> +       default:
> +               return -EINVAL;
> +       }
> +}
> --
> 1.7.9.5
>



-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH RFC v2 net-next 07/16] bpf: add lookup/update/delete/iterate methods to BPF maps
  2014-07-18  4:19 ` [PATCH RFC v2 net-next 07/16] bpf: add lookup/update/delete/iterate methods to BPF maps Alexei Starovoitov
@ 2014-07-23 18:25   ` Kees Cook
  2014-07-23 19:49     ` Alexei Starovoitov
  0 siblings, 1 reply; 62+ messages in thread
From: Kees Cook @ 2014-07-23 18:25 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Ingo Molnar, Linus Torvalds, Andy Lutomirski,
	Steven Rostedt, Daniel Borkmann, Chema Gonzalez, Eric Dumazet,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Jiri Olsa,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Linux API,
	Network Development, LKML

On Thu, Jul 17, 2014 at 9:19 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
> 'maps' is a generic storage of different types for sharing data between kernel
> and userspace.
>
> The maps are accessed from user space via BPF syscall, which has commands:
>
> - create a map with given type and attributes
>   fd = bpf_map_create(map_type, struct nlattr *attr, int len)
>   returns fd or negative error
>
> - lookup key in a given map referenced by fd
>   err = bpf_map_lookup_elem(int fd, void *key, void *value)
>   returns zero and stores found elem into value or negative error
>
> - create or update key/value pair in a given map
>   err = bpf_map_update_elem(int fd, void *key, void *value)
>   returns zero or negative error
>
> - find and delete element by key in a given map
>   err = bpf_map_delete_elem(int fd, void *key)
>
> - iterate map elements (based on input key return next_key)
>   err = bpf_map_get_next_key(int fd, void *key, void *next_key)
>
> - close(fd) deletes the map
>
> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
> ---
>  include/linux/bpf.h      |    6 ++
>  include/uapi/linux/bpf.h |   25 ++++++
>  kernel/bpf/syscall.c     |  209 ++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 240 insertions(+)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 57af236a0eb4..91e2caf8edf9 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -18,6 +18,12 @@ struct bpf_map_ops {
>         /* funcs callable from userspace (via syscall) */
>         struct bpf_map *(*map_alloc)(struct nlattr *attrs[BPF_MAP_ATTR_MAX + 1]);
>         void (*map_free)(struct bpf_map *);
> +       int (*map_get_next_key)(struct bpf_map *map, void *key, void *next_key);
> +
> +       /* funcs callable from userspace and from eBPF programs */
> +       void *(*map_lookup_elem)(struct bpf_map *map, void *key);
> +       int (*map_update_elem)(struct bpf_map *map, void *key, void *value);
> +       int (*map_delete_elem)(struct bpf_map *map, void *key);
>  };
>
>  struct bpf_map {
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index dcc7eb97a64a..5e1bfbc9cdc7 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -308,6 +308,31 @@ enum bpf_cmd {
>          * map is deleted when fd is closed
>          */
>         BPF_MAP_CREATE,
> +
> +       /* lookup key in a given map referenced by map_id
> +        * err = bpf_map_lookup_elem(int map_id, void *key, void *value)

This needs map_id documentation updates too?

> +        * returns zero and stores found elem into value
> +        * or negative error
> +        */
> +       BPF_MAP_LOOKUP_ELEM,
> +
> +       /* create or update key/value pair in a given map
> +        * err = bpf_map_update_elem(int map_id, void *key, void *value)
> +        * returns zero or negative error
> +        */
> +       BPF_MAP_UPDATE_ELEM,
> +
> +       /* find and delete elem by key in a given map
> +        * err = bpf_map_delete_elem(int map_id, void *key)
> +        * returns zero or negative error
> +        */
> +       BPF_MAP_DELETE_ELEM,
> +
> +       /* lookup key in a given map and return next key
> +        * err = bpf_map_get_elem(int map_id, void *key, void *next_key)
> +        * returns zero and stores next key or negative error
> +        */
> +       BPF_MAP_GET_NEXT_KEY,
>  };
>
>  enum bpf_map_attributes {
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index c4a330642653..ca2be66845b3 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -13,6 +13,7 @@
>  #include <linux/syscalls.h>
>  #include <net/netlink.h>
>  #include <linux/anon_inodes.h>
> +#include <linux/file.h>
>
>  /* mutex to protect insertion/deletion of map_id in IDR */
>  static DEFINE_MUTEX(bpf_map_lock);
> @@ -209,6 +210,202 @@ free_attr:
>         return err;
>  }
>
> +static int get_map_id(struct fd f)
> +{
> +       struct bpf_map *map;
> +
> +       if (!f.file)
> +               return -EBADF;
> +
> +       if (f.file->f_op != &bpf_map_fops) {
> +               fdput(f);

It feels weird to me to do the fdput inside this function. Instead,
should map_lookup_elem get a "err_put" label, instead?

> +               return -EINVAL;
> +       }
> +
> +       map = f.file->private_data;
> +
> +       return map->map_id;
> +}
> +
> +static int map_lookup_elem(int ufd, void __user *ukey, void __user *uvalue)
> +{
> +       struct fd f = fdget(ufd);
> +       struct bpf_map *map;
> +       void *key, *value;
> +       int err;
> +
> +       err = get_map_id(f);
> +       if (err < 0)
> +               return err;

for example:

    if (err < 0)
        goto fail_put;

> +
> +       rcu_read_lock();
> +       map = idr_find(&bpf_map_id_idr, err);
> +       err = -EINVAL;
> +       if (!map)
> +               goto err_unlock;
> +
> +       err = -ENOMEM;
> +       key = kmalloc(map->key_size, GFP_ATOMIC);
> +       if (!key)
> +               goto err_unlock;
> +
> +       err = -EFAULT;
> +       if (copy_from_user(key, ukey, map->key_size) != 0)
> +               goto free_key;
> +
> +       err = -ESRCH;
> +       value = map->ops->map_lookup_elem(map, key);
> +       if (!value)
> +               goto free_key;
> +
> +       err = -EFAULT;
> +       if (copy_to_user(uvalue, value, map->value_size) != 0)
> +               goto free_key;

I'm made uncomfortable with memory copying where explicit lengths from
userspace aren't being used. It does look like it would be redundant,
though. Are there other syscalls where the kernel may stomp on user
memory based on internal kernel sizes? I think this is fine as-is, but
it makes me want to think harder about it. :)

> +
> +       err = 0;
> +
> +free_key:
> +       kfree(key);
> +err_unlock:
> +       rcu_read_unlock();

fail_put:

> +       fdput(f);
> +       return err;
> +}
> +
> +static int map_update_elem(int ufd, void __user *ukey, void __user *uvalue)
> +{
> +       struct fd f = fdget(ufd);
> +       struct bpf_map *map;
> +       void *key, *value;
> +       int err;
> +
> +       err = get_map_id(f);
> +       if (err < 0)
> +               return err;

Same thing?

> +
> +       rcu_read_lock();
> +       map = idr_find(&bpf_map_id_idr, err);
> +       err = -EINVAL;
> +       if (!map)
> +               goto err_unlock;
> +
> +       err = -ENOMEM;
> +       key = kmalloc(map->key_size, GFP_ATOMIC);
> +       if (!key)
> +               goto err_unlock;
> +
> +       err = -EFAULT;
> +       if (copy_from_user(key, ukey, map->key_size) != 0)
> +               goto free_key;
> +
> +       err = -ENOMEM;
> +       value = kmalloc(map->value_size, GFP_ATOMIC);
> +       if (!value)
> +               goto free_key;
> +
> +       err = -EFAULT;
> +       if (copy_from_user(value, uvalue, map->value_size) != 0)
> +               goto free_value;
> +
> +       err = map->ops->map_update_elem(map, key, value);
> +
> +free_value:
> +       kfree(value);
> +free_key:
> +       kfree(key);
> +err_unlock:
> +       rcu_read_unlock();
> +       fdput(f);
> +       return err;
> +}
> +
> +static int map_delete_elem(int ufd, void __user *ukey)
> +{
> +       struct fd f = fdget(ufd);
> +       struct bpf_map *map;
> +       void *key;
> +       int err;
> +
> +       err = get_map_id(f);
> +       if (err < 0)
> +               return err;
> +
> +       rcu_read_lock();
> +       map = idr_find(&bpf_map_id_idr, err);
> +       err = -EINVAL;
> +       if (!map)
> +               goto err_unlock;
> +
> +       err = -ENOMEM;
> +       key = kmalloc(map->key_size, GFP_ATOMIC);
> +       if (!key)
> +               goto err_unlock;
> +
> +       err = -EFAULT;
> +       if (copy_from_user(key, ukey, map->key_size) != 0)
> +               goto free_key;
> +
> +       err = map->ops->map_delete_elem(map, key);
> +
> +free_key:
> +       kfree(key);
> +err_unlock:
> +       rcu_read_unlock();
> +       fdput(f);
> +       return err;
> +}
> +
> +static int map_get_next_key(int ufd, void __user *ukey, void __user *unext_key)
> +{
> +       struct fd f = fdget(ufd);
> +       struct bpf_map *map;
> +       void *key, *next_key;
> +       int err;
> +
> +       err = get_map_id(f);
> +       if (err < 0)
> +               return err;
> +
> +       rcu_read_lock();
> +       map = idr_find(&bpf_map_id_idr, err);
> +       err = -EINVAL;
> +       if (!map)
> +               goto err_unlock;
> +
> +       err = -ENOMEM;
> +       key = kmalloc(map->key_size, GFP_ATOMIC);
> +       if (!key)
> +               goto err_unlock;
> +
> +       err = -EFAULT;
> +       if (copy_from_user(key, ukey, map->key_size) != 0)
> +               goto free_key;
> +
> +       err = -ENOMEM;
> +       next_key = kmalloc(map->key_size, GFP_ATOMIC);

In the interests of defensiveness, I'd use kzalloc here.

> +       if (!next_key)
> +               goto free_key;
> +
> +       err = map->ops->map_get_next_key(map, key, next_key);
> +       if (err)
> +               goto free_next_key;
> +
> +       err = -EFAULT;
> +       if (copy_to_user(unext_key, next_key, map->key_size) != 0)
> +               goto free_next_key;
> +
> +       err = 0;
> +
> +free_next_key:
> +       kfree(next_key);
> +free_key:
> +       kfree(key);
> +err_unlock:
> +       rcu_read_unlock();
> +       fdput(f);
> +       return err;
> +}
> +
>  SYSCALL_DEFINE5(bpf, int, cmd, unsigned long, arg2, unsigned long, arg3,
>                 unsigned long, arg4, unsigned long, arg5)
>  {
> @@ -219,6 +416,18 @@ SYSCALL_DEFINE5(bpf, int, cmd, unsigned long, arg2, unsigned long, arg3,
>         case BPF_MAP_CREATE:
>                 return map_create((enum bpf_map_type) arg2,
>                                   (struct nlattr __user *) arg3, (int) arg4);
> +       case BPF_MAP_LOOKUP_ELEM:
> +               return map_lookup_elem((int) arg2, (void __user *) arg3,
> +                                      (void __user *) arg4);
> +       case BPF_MAP_UPDATE_ELEM:
> +               return map_update_elem((int) arg2, (void __user *) arg3,
> +                                      (void __user *) arg4);
> +       case BPF_MAP_DELETE_ELEM:
> +               return map_delete_elem((int) arg2, (void __user *) arg3);
> +
> +       case BPF_MAP_GET_NEXT_KEY:
> +               return map_get_next_key((int) arg2, (void __user *) arg3,
> +                                       (void __user *) arg4);

Same observation as the other syscall cmd: perhaps arg5 == 0 should be
checked? Also, since each of these functions looks up the fd and
builds the key, maybe those should be added to a common helper instead
of copy/pasting into each demuxed function?

-Kees

>         default:
>                 return -EINVAL;
>         }
> --
> 1.7.9.5
>



-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH RFC v2 net-next 08/16] bpf: add hashtable type of BPF maps
@ 2014-07-23 18:36     ` Kees Cook
  0 siblings, 0 replies; 62+ messages in thread
From: Kees Cook @ 2014-07-23 18:36 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Ingo Molnar, Linus Torvalds, Andy Lutomirski,
	Steven Rostedt, Daniel Borkmann, Chema Gonzalez, Eric Dumazet,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Jiri Olsa,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Linux API,
	Network Development, LKML

On Thu, Jul 17, 2014 at 9:19 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
> add new map type: BPF_MAP_TYPE_HASH
> and its simple (not auto resizeable) hash table implementation
>
> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
> ---
>  include/uapi/linux/bpf.h |    1 +
>  kernel/bpf/Makefile      |    2 +-
>  kernel/bpf/hashtab.c     |  371 ++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 373 insertions(+), 1 deletion(-)
>  create mode 100644 kernel/bpf/hashtab.c
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 5e1bfbc9cdc7..3ea11ba053a8 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -347,6 +347,7 @@ enum bpf_map_attributes {
>
>  enum bpf_map_type {
>         BPF_MAP_TYPE_UNSPEC,
> +       BPF_MAP_TYPE_HASH,
>  };
>
>  #endif /* _UAPI__LINUX_BPF_H__ */
> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> index e9f7334ed07a..558e12712ebc 100644
> --- a/kernel/bpf/Makefile
> +++ b/kernel/bpf/Makefile
> @@ -1 +1 @@
> -obj-y := core.o syscall.o
> +obj-y := core.o syscall.o hashtab.o
> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> new file mode 100644
> index 000000000000..6e481cacbba3
> --- /dev/null
> +++ b/kernel/bpf/hashtab.c
> @@ -0,0 +1,371 @@
> +/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of version 2 of the GNU General Public
> + * License as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful, but
> + * WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + * General Public License for more details.
> + */
> +#include <linux/bpf.h>
> +#include <net/netlink.h>
> +#include <linux/jhash.h>
> +
> +struct bpf_htab {
> +       struct bpf_map map;
> +       struct hlist_head *buckets;
> +       struct kmem_cache *elem_cache;
> +       char *slab_name;
> +       spinlock_t lock;
> +       u32 count; /* number of elements in this hashtable */
> +       u32 n_buckets; /* number of hash buckets */
> +       u32 elem_size; /* size of each element in bytes */
> +};
> +
> +/* each htab element is struct htab_elem + key + value */
> +struct htab_elem {
> +       struct hlist_node hash_node;
> +       struct rcu_head rcu;
> +       struct bpf_htab *htab;
> +       u32 hash;
> +       u32 pad;
> +       char key[0];
> +};
> +
> +#define HASH_MAX_BUCKETS 1024
> +#define BPF_MAP_MAX_KEY_SIZE 256
> +static struct bpf_map *htab_map_alloc(struct nlattr *attr[BPF_MAP_ATTR_MAX + 1])
> +{
> +       struct bpf_htab *htab;
> +       int err, i;
> +
> +       htab = kmalloc(sizeof(*htab), GFP_USER);

I'd prefer kzalloc here.

> +       if (!htab)
> +               return ERR_PTR(-ENOMEM);
> +
> +       /* look for mandatory map attributes */
> +       err = -EINVAL;
> +       if (!attr[BPF_MAP_KEY_SIZE])
> +               goto free_htab;
> +       htab->map.key_size = nla_get_u32(attr[BPF_MAP_KEY_SIZE]);
> +
> +       if (!attr[BPF_MAP_VALUE_SIZE])
> +               goto free_htab;
> +       htab->map.value_size = nla_get_u32(attr[BPF_MAP_VALUE_SIZE]);
> +
> +       if (!attr[BPF_MAP_MAX_ENTRIES])
> +               goto free_htab;
> +       htab->map.max_entries = nla_get_u32(attr[BPF_MAP_MAX_ENTRIES]);
> +
> +       htab->n_buckets = (htab->map.max_entries <= HASH_MAX_BUCKETS) ?
> +                         htab->map.max_entries : HASH_MAX_BUCKETS;
> +
> +       /* hash table size must be power of 2 */
> +       if ((htab->n_buckets & (htab->n_buckets - 1)) != 0)
> +               goto free_htab;
> +
> +       err = -E2BIG;
> +       if (htab->map.key_size > BPF_MAP_MAX_KEY_SIZE)
> +               goto free_htab;
> +
> +       err = -ENOMEM;
> +       htab->buckets = kmalloc(htab->n_buckets * sizeof(struct hlist_head),
> +                               GFP_USER);

I'd prefer kcalloc here, even though n_buckets can't currently trigger
an integer overflow.

> +
> +       if (!htab->buckets)
> +               goto free_htab;
> +
> +       for (i = 0; i < htab->n_buckets; i++)
> +               INIT_HLIST_HEAD(&htab->buckets[i]);
> +
> +       spin_lock_init(&htab->lock);
> +       htab->count = 0;
> +
> +       htab->elem_size = sizeof(struct htab_elem) +
> +                         round_up(htab->map.key_size, 8) +
> +                         htab->map.value_size;
> +
> +       htab->slab_name = kasprintf(GFP_USER, "bpf_htab_%p", htab);

This leaks a kernel heap memory pointer to userspace. If a unique name
needed, I think map_id should be used instead.

> +       if (!htab->slab_name)
> +               goto free_buckets;
> +
> +       htab->elem_cache = kmem_cache_create(htab->slab_name,
> +                                            htab->elem_size, 0, 0, NULL);
> +       if (!htab->elem_cache)
> +               goto free_slab_name;
> +
> +       return &htab->map;
> +
> +free_slab_name:
> +       kfree(htab->slab_name);
> +free_buckets:
> +       kfree(htab->buckets);
> +free_htab:
> +       kfree(htab);
> +       return ERR_PTR(err);
> +}
> +
> +static inline u32 htab_map_hash(const void *key, u32 key_len)
> +{
> +       return jhash(key, key_len, 0);
> +}
> +
> +static inline struct hlist_head *select_bucket(struct bpf_htab *htab, u32 hash)
> +{
> +       return &htab->buckets[hash & (htab->n_buckets - 1)];
> +}
> +
> +static struct htab_elem *lookup_elem_raw(struct hlist_head *head, u32 hash,
> +                                        void *key, u32 key_size)
> +{
> +       struct htab_elem *l;
> +
> +       hlist_for_each_entry_rcu(l, head, hash_node) {
> +               if (l->hash == hash && !memcmp(&l->key, key, key_size))
> +                       return l;
> +       }
> +       return NULL;
> +}
> +
> +/* Must be called with rcu_read_lock. */
> +static void *htab_map_lookup_elem(struct bpf_map *map, void *key)
> +{
> +       struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
> +       struct hlist_head *head;
> +       struct htab_elem *l;
> +       u32 hash, key_size;
> +
> +       WARN_ON_ONCE(!rcu_read_lock_held());
> +
> +       key_size = map->key_size;
> +
> +       hash = htab_map_hash(key, key_size);
> +
> +       head = select_bucket(htab, hash);
> +
> +       l = lookup_elem_raw(head, hash, key, key_size);
> +
> +       if (l)
> +               return l->key + round_up(map->key_size, 8);
> +       else
> +               return NULL;
> +}
> +
> +/* Must be called with rcu_read_lock. */
> +static int htab_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
> +{
> +       struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
> +       struct hlist_head *head;
> +       struct htab_elem *l, *next_l;
> +       u32 hash, key_size;
> +       int i;
> +
> +       WARN_ON_ONCE(!rcu_read_lock_held());
> +
> +       key_size = map->key_size;
> +
> +       hash = htab_map_hash(key, key_size);
> +
> +       head = select_bucket(htab, hash);
> +
> +       /* lookup the key */
> +       l = lookup_elem_raw(head, hash, key, key_size);
> +
> +       if (!l) {
> +               i = 0;
> +               goto find_first_elem;
> +       }
> +
> +       /* key was found, get next key in the same bucket */
> +       next_l = hlist_entry_safe(rcu_dereference_raw(hlist_next_rcu(&l->hash_node)),
> +                                 struct htab_elem, hash_node);
> +
> +       if (next_l) {
> +               /* if next elem in this hash list is non-zero, just return it */
> +               memcpy(next_key, next_l->key, key_size);
> +               return 0;
> +       } else {
> +               /* no more elements in this hash list, go to the next bucket */
> +               i = hash & (htab->n_buckets - 1);
> +               i++;
> +       }
> +
> +find_first_elem:
> +       /* iterate over buckets */
> +       for (; i < htab->n_buckets; i++) {
> +               head = select_bucket(htab, i);
> +
> +               /* pick first element in the bucket */
> +               next_l = hlist_entry_safe(rcu_dereference_raw(hlist_first_rcu(head)),
> +                                         struct htab_elem, hash_node);
> +               if (next_l) {
> +                       /* if it's not empty, just return it */
> +                       memcpy(next_key, next_l->key, key_size);
> +                       return 0;
> +               }
> +       }
> +
> +       /* itereated over all buckets and all elements */
> +       return -ENOENT;
> +}
> +
> +static struct htab_elem *htab_alloc_elem(struct bpf_htab *htab)
> +{
> +       void *l;
> +
> +       l = kmem_cache_alloc(htab->elem_cache, GFP_ATOMIC);
> +       if (!l)
> +               return ERR_PTR(-ENOMEM);
> +       return l;
> +}
> +
> +static void free_htab_elem_rcu(struct rcu_head *rcu)
> +{
> +       struct htab_elem *l = container_of(rcu, struct htab_elem, rcu);
> +
> +       kmem_cache_free(l->htab->elem_cache, l);
> +}
> +
> +static void release_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
> +{
> +       l->htab = htab;
> +       call_rcu(&l->rcu, free_htab_elem_rcu);
> +}
> +
> +/* Must be called with rcu_read_lock. */
> +static int htab_map_update_elem(struct bpf_map *map, void *key, void *value)
> +{
> +       struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
> +       struct htab_elem *l_new, *l_old;
> +       struct hlist_head *head;
> +       u32 key_size;
> +
> +       WARN_ON_ONCE(!rcu_read_lock_held());
> +
> +       l_new = htab_alloc_elem(htab);
> +       if (IS_ERR(l_new))
> +               return -ENOMEM;
> +
> +       key_size = map->key_size;
> +
> +       memcpy(l_new->key, key, key_size);
> +       memcpy(l_new->key + round_up(key_size, 8), value, map->value_size);
> +
> +       l_new->hash = htab_map_hash(l_new->key, key_size);
> +
> +       head = select_bucket(htab, l_new->hash);
> +
> +       l_old = lookup_elem_raw(head, l_new->hash, key, key_size);
> +
> +       spin_lock_bh(&htab->lock);
> +       if (!l_old && unlikely(htab->count >= map->max_entries)) {
> +               /* if elem with this 'key' doesn't exist and we've reached
> +                * max_entries limit, fail insertion of new elem
> +                */
> +               spin_unlock_bh(&htab->lock);
> +               kmem_cache_free(htab->elem_cache, l_new);
> +               return -EFBIG;
> +       }
> +
> +       /* add new element to the head of the list, so that concurrent
> +        * search will find it before old elem
> +        */
> +       hlist_add_head_rcu(&l_new->hash_node, head);
> +       if (l_old) {
> +               hlist_del_rcu(&l_old->hash_node);
> +               release_htab_elem(htab, l_old);
> +       } else {
> +               htab->count++;
> +       }
> +       spin_unlock_bh(&htab->lock);
> +
> +       return 0;
> +}
> +
> +/* Must be called with rcu_read_lock. */
> +static int htab_map_delete_elem(struct bpf_map *map, void *key)
> +{
> +       struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
> +       struct htab_elem *l;
> +       struct hlist_head *head;
> +       u32 hash, key_size;
> +
> +       WARN_ON_ONCE(!rcu_read_lock_held());
> +
> +       key_size = map->key_size;
> +
> +       hash = htab_map_hash(key, key_size);
> +
> +       head = select_bucket(htab, hash);
> +
> +       l = lookup_elem_raw(head, hash, key, key_size);
> +
> +       if (l) {
> +               spin_lock_bh(&htab->lock);
> +               hlist_del_rcu(&l->hash_node);
> +               htab->count--;
> +               release_htab_elem(htab, l);
> +               spin_unlock_bh(&htab->lock);
> +               return 0;
> +       }
> +       return -ESRCH;
> +}
> +
> +static void delete_all_elements(struct bpf_htab *htab)
> +{
> +       int i;
> +
> +       for (i = 0; i < htab->n_buckets; i++) {
> +               struct hlist_head *head = select_bucket(htab, i);
> +               struct hlist_node *n;
> +               struct htab_elem *l;
> +
> +               hlist_for_each_entry_safe(l, n, head, hash_node) {
> +                       hlist_del_rcu(&l->hash_node);
> +                       htab->count--;
> +                       kmem_cache_free(htab->elem_cache, l);
> +               }
> +       }
> +}
> +
> +/* called when map->refcnt goes to zero */
> +static void htab_map_free(struct bpf_map *map)
> +{
> +       struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
> +
> +       /* wait for all outstanding updates to complete */
> +       synchronize_rcu();
> +
> +       /* kmem_cache_free all htab elements */
> +       delete_all_elements(htab);
> +
> +       /* and destroy cache, which might sleep */
> +       kmem_cache_destroy(htab->elem_cache);
> +
> +       kfree(htab->buckets);
> +       kfree(htab->slab_name);
> +       kfree(htab);
> +}
> +
> +static struct bpf_map_ops htab_ops = {
> +       .map_alloc = htab_map_alloc,
> +       .map_free = htab_map_free,
> +       .map_get_next_key = htab_map_get_next_key,
> +       .map_lookup_elem = htab_map_lookup_elem,
> +       .map_update_elem = htab_map_update_elem,
> +       .map_delete_elem = htab_map_delete_elem,
> +};
> +
> +static struct bpf_map_type_list tl = {
> +       .ops = &htab_ops,
> +       .type = BPF_MAP_TYPE_HASH,
> +};
> +
> +static int __init register_htab_map(void)
> +{
> +       bpf_register_map_type(&tl);
> +       return 0;
> +}
> +late_initcall(register_htab_map);
> --
> 1.7.9.5
>

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH RFC v2 net-next 08/16] bpf: add hashtable type of BPF maps
@ 2014-07-23 18:36     ` Kees Cook
  0 siblings, 0 replies; 62+ messages in thread
From: Kees Cook @ 2014-07-23 18:36 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Ingo Molnar, Linus Torvalds, Andy Lutomirski,
	Steven Rostedt, Daniel Borkmann, Chema Gonzalez, Eric Dumazet,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Jiri Olsa,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Linux API,
	Network Development, LKML

On Thu, Jul 17, 2014 at 9:19 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
> add new map type: BPF_MAP_TYPE_HASH
> and its simple (not auto resizeable) hash table implementation
>
> Signed-off-by: Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
> ---
>  include/uapi/linux/bpf.h |    1 +
>  kernel/bpf/Makefile      |    2 +-
>  kernel/bpf/hashtab.c     |  371 ++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 373 insertions(+), 1 deletion(-)
>  create mode 100644 kernel/bpf/hashtab.c
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 5e1bfbc9cdc7..3ea11ba053a8 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -347,6 +347,7 @@ enum bpf_map_attributes {
>
>  enum bpf_map_type {
>         BPF_MAP_TYPE_UNSPEC,
> +       BPF_MAP_TYPE_HASH,
>  };
>
>  #endif /* _UAPI__LINUX_BPF_H__ */
> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> index e9f7334ed07a..558e12712ebc 100644
> --- a/kernel/bpf/Makefile
> +++ b/kernel/bpf/Makefile
> @@ -1 +1 @@
> -obj-y := core.o syscall.o
> +obj-y := core.o syscall.o hashtab.o
> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> new file mode 100644
> index 000000000000..6e481cacbba3
> --- /dev/null
> +++ b/kernel/bpf/hashtab.c
> @@ -0,0 +1,371 @@
> +/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of version 2 of the GNU General Public
> + * License as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful, but
> + * WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + * General Public License for more details.
> + */
> +#include <linux/bpf.h>
> +#include <net/netlink.h>
> +#include <linux/jhash.h>
> +
> +struct bpf_htab {
> +       struct bpf_map map;
> +       struct hlist_head *buckets;
> +       struct kmem_cache *elem_cache;
> +       char *slab_name;
> +       spinlock_t lock;
> +       u32 count; /* number of elements in this hashtable */
> +       u32 n_buckets; /* number of hash buckets */
> +       u32 elem_size; /* size of each element in bytes */
> +};
> +
> +/* each htab element is struct htab_elem + key + value */
> +struct htab_elem {
> +       struct hlist_node hash_node;
> +       struct rcu_head rcu;
> +       struct bpf_htab *htab;
> +       u32 hash;
> +       u32 pad;
> +       char key[0];
> +};
> +
> +#define HASH_MAX_BUCKETS 1024
> +#define BPF_MAP_MAX_KEY_SIZE 256
> +static struct bpf_map *htab_map_alloc(struct nlattr *attr[BPF_MAP_ATTR_MAX + 1])
> +{
> +       struct bpf_htab *htab;
> +       int err, i;
> +
> +       htab = kmalloc(sizeof(*htab), GFP_USER);

I'd prefer kzalloc here.

> +       if (!htab)
> +               return ERR_PTR(-ENOMEM);
> +
> +       /* look for mandatory map attributes */
> +       err = -EINVAL;
> +       if (!attr[BPF_MAP_KEY_SIZE])
> +               goto free_htab;
> +       htab->map.key_size = nla_get_u32(attr[BPF_MAP_KEY_SIZE]);
> +
> +       if (!attr[BPF_MAP_VALUE_SIZE])
> +               goto free_htab;
> +       htab->map.value_size = nla_get_u32(attr[BPF_MAP_VALUE_SIZE]);
> +
> +       if (!attr[BPF_MAP_MAX_ENTRIES])
> +               goto free_htab;
> +       htab->map.max_entries = nla_get_u32(attr[BPF_MAP_MAX_ENTRIES]);
> +
> +       htab->n_buckets = (htab->map.max_entries <= HASH_MAX_BUCKETS) ?
> +                         htab->map.max_entries : HASH_MAX_BUCKETS;
> +
> +       /* hash table size must be power of 2 */
> +       if ((htab->n_buckets & (htab->n_buckets - 1)) != 0)
> +               goto free_htab;
> +
> +       err = -E2BIG;
> +       if (htab->map.key_size > BPF_MAP_MAX_KEY_SIZE)
> +               goto free_htab;
> +
> +       err = -ENOMEM;
> +       htab->buckets = kmalloc(htab->n_buckets * sizeof(struct hlist_head),
> +                               GFP_USER);

I'd prefer kcalloc here, even though n_buckets can't currently trigger
an integer overflow.

> +
> +       if (!htab->buckets)
> +               goto free_htab;
> +
> +       for (i = 0; i < htab->n_buckets; i++)
> +               INIT_HLIST_HEAD(&htab->buckets[i]);
> +
> +       spin_lock_init(&htab->lock);
> +       htab->count = 0;
> +
> +       htab->elem_size = sizeof(struct htab_elem) +
> +                         round_up(htab->map.key_size, 8) +
> +                         htab->map.value_size;
> +
> +       htab->slab_name = kasprintf(GFP_USER, "bpf_htab_%p", htab);

This leaks a kernel heap memory pointer to userspace. If a unique name
needed, I think map_id should be used instead.

> +       if (!htab->slab_name)
> +               goto free_buckets;
> +
> +       htab->elem_cache = kmem_cache_create(htab->slab_name,
> +                                            htab->elem_size, 0, 0, NULL);
> +       if (!htab->elem_cache)
> +               goto free_slab_name;
> +
> +       return &htab->map;
> +
> +free_slab_name:
> +       kfree(htab->slab_name);
> +free_buckets:
> +       kfree(htab->buckets);
> +free_htab:
> +       kfree(htab);
> +       return ERR_PTR(err);
> +}
> +
> +static inline u32 htab_map_hash(const void *key, u32 key_len)
> +{
> +       return jhash(key, key_len, 0);
> +}
> +
> +static inline struct hlist_head *select_bucket(struct bpf_htab *htab, u32 hash)
> +{
> +       return &htab->buckets[hash & (htab->n_buckets - 1)];
> +}
> +
> +static struct htab_elem *lookup_elem_raw(struct hlist_head *head, u32 hash,
> +                                        void *key, u32 key_size)
> +{
> +       struct htab_elem *l;
> +
> +       hlist_for_each_entry_rcu(l, head, hash_node) {
> +               if (l->hash == hash && !memcmp(&l->key, key, key_size))
> +                       return l;
> +       }
> +       return NULL;
> +}
> +
> +/* Must be called with rcu_read_lock. */
> +static void *htab_map_lookup_elem(struct bpf_map *map, void *key)
> +{
> +       struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
> +       struct hlist_head *head;
> +       struct htab_elem *l;
> +       u32 hash, key_size;
> +
> +       WARN_ON_ONCE(!rcu_read_lock_held());
> +
> +       key_size = map->key_size;
> +
> +       hash = htab_map_hash(key, key_size);
> +
> +       head = select_bucket(htab, hash);
> +
> +       l = lookup_elem_raw(head, hash, key, key_size);
> +
> +       if (l)
> +               return l->key + round_up(map->key_size, 8);
> +       else
> +               return NULL;
> +}
> +
> +/* Must be called with rcu_read_lock. */
> +static int htab_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
> +{
> +       struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
> +       struct hlist_head *head;
> +       struct htab_elem *l, *next_l;
> +       u32 hash, key_size;
> +       int i;
> +
> +       WARN_ON_ONCE(!rcu_read_lock_held());
> +
> +       key_size = map->key_size;
> +
> +       hash = htab_map_hash(key, key_size);
> +
> +       head = select_bucket(htab, hash);
> +
> +       /* lookup the key */
> +       l = lookup_elem_raw(head, hash, key, key_size);
> +
> +       if (!l) {
> +               i = 0;
> +               goto find_first_elem;
> +       }
> +
> +       /* key was found, get next key in the same bucket */
> +       next_l = hlist_entry_safe(rcu_dereference_raw(hlist_next_rcu(&l->hash_node)),
> +                                 struct htab_elem, hash_node);
> +
> +       if (next_l) {
> +               /* if next elem in this hash list is non-zero, just return it */
> +               memcpy(next_key, next_l->key, key_size);
> +               return 0;
> +       } else {
> +               /* no more elements in this hash list, go to the next bucket */
> +               i = hash & (htab->n_buckets - 1);
> +               i++;
> +       }
> +
> +find_first_elem:
> +       /* iterate over buckets */
> +       for (; i < htab->n_buckets; i++) {
> +               head = select_bucket(htab, i);
> +
> +               /* pick first element in the bucket */
> +               next_l = hlist_entry_safe(rcu_dereference_raw(hlist_first_rcu(head)),
> +                                         struct htab_elem, hash_node);
> +               if (next_l) {
> +                       /* if it's not empty, just return it */
> +                       memcpy(next_key, next_l->key, key_size);
> +                       return 0;
> +               }
> +       }
> +
> +       /* itereated over all buckets and all elements */
> +       return -ENOENT;
> +}
> +
> +static struct htab_elem *htab_alloc_elem(struct bpf_htab *htab)
> +{
> +       void *l;
> +
> +       l = kmem_cache_alloc(htab->elem_cache, GFP_ATOMIC);
> +       if (!l)
> +               return ERR_PTR(-ENOMEM);
> +       return l;
> +}
> +
> +static void free_htab_elem_rcu(struct rcu_head *rcu)
> +{
> +       struct htab_elem *l = container_of(rcu, struct htab_elem, rcu);
> +
> +       kmem_cache_free(l->htab->elem_cache, l);
> +}
> +
> +static void release_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
> +{
> +       l->htab = htab;
> +       call_rcu(&l->rcu, free_htab_elem_rcu);
> +}
> +
> +/* Must be called with rcu_read_lock. */
> +static int htab_map_update_elem(struct bpf_map *map, void *key, void *value)
> +{
> +       struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
> +       struct htab_elem *l_new, *l_old;
> +       struct hlist_head *head;
> +       u32 key_size;
> +
> +       WARN_ON_ONCE(!rcu_read_lock_held());
> +
> +       l_new = htab_alloc_elem(htab);
> +       if (IS_ERR(l_new))
> +               return -ENOMEM;
> +
> +       key_size = map->key_size;
> +
> +       memcpy(l_new->key, key, key_size);
> +       memcpy(l_new->key + round_up(key_size, 8), value, map->value_size);
> +
> +       l_new->hash = htab_map_hash(l_new->key, key_size);
> +
> +       head = select_bucket(htab, l_new->hash);
> +
> +       l_old = lookup_elem_raw(head, l_new->hash, key, key_size);
> +
> +       spin_lock_bh(&htab->lock);
> +       if (!l_old && unlikely(htab->count >= map->max_entries)) {
> +               /* if elem with this 'key' doesn't exist and we've reached
> +                * max_entries limit, fail insertion of new elem
> +                */
> +               spin_unlock_bh(&htab->lock);
> +               kmem_cache_free(htab->elem_cache, l_new);
> +               return -EFBIG;
> +       }
> +
> +       /* add new element to the head of the list, so that concurrent
> +        * search will find it before old elem
> +        */
> +       hlist_add_head_rcu(&l_new->hash_node, head);
> +       if (l_old) {
> +               hlist_del_rcu(&l_old->hash_node);
> +               release_htab_elem(htab, l_old);
> +       } else {
> +               htab->count++;
> +       }
> +       spin_unlock_bh(&htab->lock);
> +
> +       return 0;
> +}
> +
> +/* Must be called with rcu_read_lock. */
> +static int htab_map_delete_elem(struct bpf_map *map, void *key)
> +{
> +       struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
> +       struct htab_elem *l;
> +       struct hlist_head *head;
> +       u32 hash, key_size;
> +
> +       WARN_ON_ONCE(!rcu_read_lock_held());
> +
> +       key_size = map->key_size;
> +
> +       hash = htab_map_hash(key, key_size);
> +
> +       head = select_bucket(htab, hash);
> +
> +       l = lookup_elem_raw(head, hash, key, key_size);
> +
> +       if (l) {
> +               spin_lock_bh(&htab->lock);
> +               hlist_del_rcu(&l->hash_node);
> +               htab->count--;
> +               release_htab_elem(htab, l);
> +               spin_unlock_bh(&htab->lock);
> +               return 0;
> +       }
> +       return -ESRCH;
> +}
> +
> +static void delete_all_elements(struct bpf_htab *htab)
> +{
> +       int i;
> +
> +       for (i = 0; i < htab->n_buckets; i++) {
> +               struct hlist_head *head = select_bucket(htab, i);
> +               struct hlist_node *n;
> +               struct htab_elem *l;
> +
> +               hlist_for_each_entry_safe(l, n, head, hash_node) {
> +                       hlist_del_rcu(&l->hash_node);
> +                       htab->count--;
> +                       kmem_cache_free(htab->elem_cache, l);
> +               }
> +       }
> +}
> +
> +/* called when map->refcnt goes to zero */
> +static void htab_map_free(struct bpf_map *map)
> +{
> +       struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
> +
> +       /* wait for all outstanding updates to complete */
> +       synchronize_rcu();
> +
> +       /* kmem_cache_free all htab elements */
> +       delete_all_elements(htab);
> +
> +       /* and destroy cache, which might sleep */
> +       kmem_cache_destroy(htab->elem_cache);
> +
> +       kfree(htab->buckets);
> +       kfree(htab->slab_name);
> +       kfree(htab);
> +}
> +
> +static struct bpf_map_ops htab_ops = {
> +       .map_alloc = htab_map_alloc,
> +       .map_free = htab_map_free,
> +       .map_get_next_key = htab_map_get_next_key,
> +       .map_lookup_elem = htab_map_lookup_elem,
> +       .map_update_elem = htab_map_update_elem,
> +       .map_delete_elem = htab_map_delete_elem,
> +};
> +
> +static struct bpf_map_type_list tl = {
> +       .ops = &htab_ops,
> +       .type = BPF_MAP_TYPE_HASH,
> +};
> +
> +static int __init register_htab_map(void)
> +{
> +       bpf_register_map_type(&tl);
> +       return 0;
> +}
> +late_initcall(register_htab_map);
> --
> 1.7.9.5
>

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH RFC v2 net-next 02/16] bpf: update MAINTAINERS entry
@ 2014-07-23 18:39         ` Kees Cook
  0 siblings, 0 replies; 62+ messages in thread
From: Kees Cook @ 2014-07-23 18:39 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Ingo Molnar, Linus Torvalds, Andy Lutomirski,
	Steven Rostedt, Daniel Borkmann, Chema Gonzalez, Eric Dumazet,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Jiri Olsa,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Linux API,
	Network Development, LKML

On Wed, Jul 23, 2014 at 10:48 AM, Alexei Starovoitov <ast@plumgrid.com> wrote:
> On Wed, Jul 23, 2014 at 10:37 AM, Kees Cook <keescook@chromium.org> wrote:
>> On Thu, Jul 17, 2014 at 9:19 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>>> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
>>> ---
>>>  MAINTAINERS |    7 +++++++
>>>  1 file changed, 7 insertions(+)
>>>
>>> diff --git a/MAINTAINERS b/MAINTAINERS
>>> index ae8cd00215b2..32e24ff46da3 100644
>>> --- a/MAINTAINERS
>>> +++ b/MAINTAINERS
>>> @@ -1912,6 +1912,13 @@ S:       Supported
>>>  F:     drivers/net/bonding/
>>>  F:     include/uapi/linux/if_bonding.h
>>>
>>> +BPF (Safe dynamic programs and tools)
>>
>> bikeshed: I feel like this shouldn't be an acronym. Maybe instead:
>>
>> BERKELEY PACKET FILTER (BPF: Safe dynamic programs and tools)
>
> pile on :)
>
> I think eBPF is no longer acronym. 'e' stands for 'extended',
> but BPF is no longer 'packet filter' only and definitely not 'berkeley'.
> So I'd rather keep BPF as a magic abbreviation without spelling it out,
> since full name is historic and no longer meaningful.
> I've considered coming up with brand new abbreviation and full name
> for this instruction set, but none looked good and all lose in comparison
> to 'eBPF' name, which is concise and carries enough historical
> references to explain the idea behind new ISA.

Yeah, that's a fair point. No sense in using "BEE PEE EFF" :)

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH RFC v2 net-next 02/16] bpf: update MAINTAINERS entry
@ 2014-07-23 18:39         ` Kees Cook
  0 siblings, 0 replies; 62+ messages in thread
From: Kees Cook @ 2014-07-23 18:39 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Ingo Molnar, Linus Torvalds, Andy Lutomirski,
	Steven Rostedt, Daniel Borkmann, Chema Gonzalez, Eric Dumazet,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Jiri Olsa,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Linux API,
	Network Development, LKML

On Wed, Jul 23, 2014 at 10:48 AM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
> On Wed, Jul 23, 2014 at 10:37 AM, Kees Cook <keescook-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org> wrote:
>> On Thu, Jul 17, 2014 at 9:19 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
>>> Signed-off-by: Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
>>> ---
>>>  MAINTAINERS |    7 +++++++
>>>  1 file changed, 7 insertions(+)
>>>
>>> diff --git a/MAINTAINERS b/MAINTAINERS
>>> index ae8cd00215b2..32e24ff46da3 100644
>>> --- a/MAINTAINERS
>>> +++ b/MAINTAINERS
>>> @@ -1912,6 +1912,13 @@ S:       Supported
>>>  F:     drivers/net/bonding/
>>>  F:     include/uapi/linux/if_bonding.h
>>>
>>> +BPF (Safe dynamic programs and tools)
>>
>> bikeshed: I feel like this shouldn't be an acronym. Maybe instead:
>>
>> BERKELEY PACKET FILTER (BPF: Safe dynamic programs and tools)
>
> pile on :)
>
> I think eBPF is no longer acronym. 'e' stands for 'extended',
> but BPF is no longer 'packet filter' only and definitely not 'berkeley'.
> So I'd rather keep BPF as a magic abbreviation without spelling it out,
> since full name is historic and no longer meaningful.
> I've considered coming up with brand new abbreviation and full name
> for this instruction set, but none looked good and all lose in comparison
> to 'eBPF' name, which is concise and carries enough historical
> references to explain the idea behind new ISA.

Yeah, that's a fair point. No sense in using "BEE PEE EFF" :)

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH RFC v2 net-next 09/16] bpf: expand BPF syscall with program load/unload
@ 2014-07-23 19:00     ` Kees Cook
  0 siblings, 0 replies; 62+ messages in thread
From: Kees Cook @ 2014-07-23 19:00 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Ingo Molnar, Linus Torvalds, Andy Lutomirski,
	Steven Rostedt, Daniel Borkmann, Chema Gonzalez, Eric Dumazet,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Jiri Olsa,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Linux API,
	Network Development, LKML

On Thu, Jul 17, 2014 at 9:19 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
> eBPF programs are safe run-to-completion functions with load/unload
> methods from userspace similar to kernel modules.
>
> User space API:
>
> - load eBPF program
>   fd = bpf_prog_load(bpf_prog_type, struct nlattr *prog, int len)
>
>   where 'prog' is a sequence of sections (TEXT, LICENSE, MAP_ASSOC)
>   TEXT - array of eBPF instructions
>   LICENSE - must be GPL compatible to call helper functions marked gpl_only
>   MAP_FIXUP - array of {insn idx, map fd} used by kernel to adjust
>   imm constants in 'mov' instructions used to access maps

Nit: naming mismatch between MAP_ASSOC vs MAP_FIXUP.

>
> - unload eBPF program
>   close(fd)
>
> User space example of syscall(__NR_bpf, BPF_PROG_LOAD, prog_type, ...)
> follows in later patches
>
> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
> ---
>  include/linux/bpf.h      |   33 +++++
>  include/linux/filter.h   |    9 +-
>  include/uapi/linux/bpf.h |   29 +++++
>  kernel/bpf/core.c        |    5 +-
>  kernel/bpf/syscall.c     |  309 ++++++++++++++++++++++++++++++++++++++++++++++
>  net/core/filter.c        |    9 +-
>  6 files changed, 388 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 91e2caf8edf9..4967619595cc 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -46,4 +46,37 @@ struct bpf_map_type_list {
>  void bpf_register_map_type(struct bpf_map_type_list *tl);
>  struct bpf_map *bpf_map_get(u32 map_id);
>
> +/* eBPF function prototype used by verifier to allow BPF_CALLs from eBPF programs
> + * to in-kernel helper functions and for adjusting imm32 field in BPF_CALL
> + * instructions after verifying
> + */
> +struct bpf_func_proto {
> +       u64 (*func)(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
> +       bool gpl_only;
> +};
> +
> +struct bpf_verifier_ops {
> +       /* return eBPF function prototype for verification */
> +       const struct bpf_func_proto *(*get_func_proto)(enum bpf_func_id func_id);
> +};
> +
> +struct bpf_prog_type_list {
> +       struct list_head list_node;
> +       struct bpf_verifier_ops *ops;
> +       enum bpf_prog_type type;
> +};
> +
> +void bpf_register_prog_type(struct bpf_prog_type_list *tl);
> +
> +struct bpf_prog_info {
> +       bool is_gpl_compatible;
> +       enum bpf_prog_type prog_type;
> +       struct bpf_verifier_ops *ops;
> +       u32 *used_maps;
> +       u32 used_map_cnt;
> +};
> +
> +void free_bpf_prog_info(struct bpf_prog_info *info);
> +struct sk_filter *bpf_prog_get(u32 ufd);
> +
>  #endif /* _LINUX_BPF_H */
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index b43ad6a2b3cf..822b310e75e1 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -30,12 +30,17 @@ struct sock_fprog_kern {
>  struct sk_buff;
>  struct sock;
>  struct seccomp_data;
> +struct bpf_prog_info;
>
>  struct sk_filter {
>         atomic_t                refcnt;
>         u32                     jited:1,        /* Is our filter JIT'ed? */
> -                               len:31;         /* Number of filter blocks */
> -       struct sock_fprog_kern  *orig_prog;     /* Original BPF program */
> +                               ebpf:1,         /* Is it eBPF program ? */
> +                               len:30;         /* Number of filter blocks */
> +       union {
> +               struct sock_fprog_kern  *orig_prog;     /* Original BPF program */
> +               struct bpf_prog_info    *info;
> +       };
>         struct rcu_head         rcu;
>         unsigned int            (*bpf_func)(const struct sk_buff *skb,
>                                             const struct bpf_insn *filter);
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 3ea11ba053a8..06ba71b49f64 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -333,6 +333,13 @@ enum bpf_cmd {
>          * returns zero and stores next key or negative error
>          */
>         BPF_MAP_GET_NEXT_KEY,
> +
> +       /* verify and load eBPF program
> +        * prog_id = bpf_prog_load(bpf_prog_type, struct nlattr *prog, int len)
> +        * prog is a sequence of sections
> +        * returns fd or negative error
> +        */
> +       BPF_PROG_LOAD,
>  };
>
>  enum bpf_map_attributes {
> @@ -350,4 +357,26 @@ enum bpf_map_type {
>         BPF_MAP_TYPE_HASH,
>  };
>
> +enum bpf_prog_attributes {
> +       BPF_PROG_UNSPEC,
> +       BPF_PROG_TEXT,          /* array of eBPF instructions */
> +       BPF_PROG_LICENSE,       /* license string */
> +       BPF_PROG_MAP_FIXUP,     /* array of {insn idx, map fd} to fixup insns */
> +       __BPF_PROG_ATTR_MAX,
> +};
> +#define BPF_PROG_ATTR_MAX (__BPF_PROG_ATTR_MAX - 1)
> +#define BPF_PROG_MAX_ATTR_SIZE 65535
> +
> +enum bpf_prog_type {
> +       BPF_PROG_TYPE_UNSPEC,
> +};
> +
> +/* integer value in 'imm' field of BPF_CALL instruction selects which helper
> + * function eBPF program intends to call
> + */
> +enum bpf_func_id {
> +       BPF_FUNC_unspec,
> +       __BPF_FUNC_MAX_ID,
> +};
> +
>  #endif /* _UAPI__LINUX_BPF_H__ */
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index 265a02cc822d..e65ecdc36358 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -23,6 +23,7 @@
>  #include <linux/filter.h>
>  #include <linux/skbuff.h>
>  #include <asm/unaligned.h>
> +#include <linux/bpf.h>
>
>  /* Registers */
>  #define BPF_R0 regs[BPF_REG_0]
> @@ -528,9 +529,11 @@ void sk_filter_select_runtime(struct sk_filter *fp)
>  }
>  EXPORT_SYMBOL_GPL(sk_filter_select_runtime);
>
> -/* free internal BPF program */
> +/* free internal BPF program, called after RCU grace period */
>  void sk_filter_free(struct sk_filter *fp)
>  {
> +       if (fp->ebpf)
> +               free_bpf_prog_info(fp->info);
>         bpf_jit_free(fp);
>  }
>  EXPORT_SYMBOL_GPL(sk_filter_free);
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index ca2be66845b3..9e45ca6b6937 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -14,6 +14,8 @@
>  #include <net/netlink.h>
>  #include <linux/anon_inodes.h>
>  #include <linux/file.h>
> +#include <linux/license.h>
> +#include <linux/filter.h>
>
>  /* mutex to protect insertion/deletion of map_id in IDR */
>  static DEFINE_MUTEX(bpf_map_lock);
> @@ -406,6 +408,310 @@ err_unlock:
>         return err;
>  }
>
> +static LIST_HEAD(bpf_prog_types);
> +
> +static int find_prog_type(enum bpf_prog_type type, struct sk_filter *prog)
> +{
> +       struct bpf_prog_type_list *tl;
> +
> +       list_for_each_entry(tl, &bpf_prog_types, list_node) {
> +               if (tl->type == type) {
> +                       prog->info->ops = tl->ops;
> +                       prog->info->prog_type = type;
> +                       return 0;
> +               }
> +       }
> +       return -EINVAL;
> +}
> +
> +void bpf_register_prog_type(struct bpf_prog_type_list *tl)
> +{
> +       list_add(&tl->list_node, &bpf_prog_types);
> +}
> +
> +/* fixup insn->imm field of bpf_call instructions:
> + * if (insn->imm == BPF_FUNC_map_lookup_elem)
> + *      insn->imm = bpf_map_lookup_elem - __bpf_call_base;
> + * else if (insn->imm == BPF_FUNC_map_update_elem)
> + *      insn->imm = bpf_map_update_elem - __bpf_call_base;
> + * else ...
> + *
> + * this function is called after eBPF program passed verification
> + */
> +static void fixup_bpf_calls(struct sk_filter *prog)
> +{
> +       const struct bpf_func_proto *fn;
> +       int i;
> +
> +       for (i = 0; i < prog->len; i++) {
> +               struct bpf_insn *insn = &prog->insnsi[i];
> +
> +               if (insn->code == (BPF_JMP | BPF_CALL)) {
> +                       /* we reach here when program has bpf_call instructions
> +                        * and it passed bpf_check(), means that
> +                        * ops->get_func_proto must have been supplied, check it
> +                        */
> +                       BUG_ON(!prog->info->ops->get_func_proto);
> +
> +                       fn = prog->info->ops->get_func_proto(insn->imm);
> +                       /* all functions that have prototype and verifier allowed
> +                        * programs to call them, must be real in-kernel functions
> +                        */
> +                       BUG_ON(!fn->func);
> +                       insn->imm = fn->func - __bpf_call_base;
> +               }
> +       }
> +}
> +
> +/* fixup instructions that are using map_ids:
> + *
> + * BPF_MOV64_IMM(BPF_REG_1, MAP_ID), // r1 = MAP_ID
> + * BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
> + *
> + * in the 1st insn kernel replaces MAP_ID with global map_id,
> + * since programs are executing out of different contexts and must use
> + * globally visible ids to access maps
> + *
> + * map_fixup is an array of pairs {insn idx, map ufd}
> + *
> + * kernel resolves ufd -> global map_id and adjusts eBPF instructions
> + */
> +static int fixup_bpf_map_id(struct sk_filter *prog, struct nlattr *map_fixup)
> +{
> +       struct {
> +               u32 insn_idx;
> +               u32 ufd;
> +       } *fixup = nla_data(map_fixup);
> +       int fixup_len = nla_len(map_fixup) / sizeof(*fixup);
> +       struct bpf_insn *insn;
> +       struct fd f;
> +       u32 idx;
> +       int i, map_id;
> +
> +       if (fixup_len <= 0)
> +               return -EINVAL;
> +
> +       for (i = 0; i < fixup_len; i++) {
> +               idx = fixup[i].insn_idx;
> +               if (idx >= prog->len)
> +                       return -EINVAL;
> +
> +               insn = &prog->insnsi[idx];
> +               if (insn->code != (BPF_ALU64 | BPF_MOV | BPF_K) &&
> +                   insn->code != (BPF_ALU | BPF_MOV | BPF_K))
> +                       return -EINVAL;
> +
> +               f = fdget(fixup[i].ufd);
> +
> +               map_id = get_map_id(f);
> +
> +               if (map_id < 0)
> +                       return map_id;
> +
> +               insn->imm = map_id;
> +               fdput(f);

It looks like there a potentially race risk of a map_id changing out
from under a running program? Between the call to fixup_bpf_map_id()
and the bpf_map_get() calls during bpf_prog_load() below...

> +       }
> +       return 0;
> +}
> +
> +/* free eBPF program auxilary data, called after rcu grace period,
> + * so it's safe to drop refcnt on maps used by this program
> + *
> + * called from sk_filter_release()->sk_filter_release_rcu()->sk_filter_free()
> + */
> +void free_bpf_prog_info(struct bpf_prog_info *info)
> +{
> +       bool found;
> +       int i;
> +
> +       for (i = 0; i < info->used_map_cnt; i++) {
> +               found = bpf_map_put(info->used_maps[i]);
> +               /* all maps that this program was using should obviously still
> +                * be there
> +                */
> +               BUG_ON(!found);
> +       }
> +       kfree(info);
> +}
> +
> +static int bpf_prog_release(struct inode *inode, struct file *filp)
> +{
> +       struct sk_filter *prog = filp->private_data;
> +
> +       sk_unattached_filter_destroy(prog);
> +       return 0;
> +}
> +
> +static const struct file_operations bpf_prog_fops = {
> +        .release = bpf_prog_release,
> +};
> +
> +static const struct nla_policy prog_policy[BPF_PROG_ATTR_MAX + 1] = {
> +       [BPF_PROG_TEXT]      = { .type = NLA_BINARY },
> +       [BPF_PROG_LICENSE]   = { .type = NLA_NUL_STRING },
> +       [BPF_PROG_MAP_FIXUP] = { .type = NLA_BINARY },
> +};
> +
> +static int bpf_prog_load(enum bpf_prog_type type, struct nlattr __user *uattr,
> +                        int len)
> +{
> +       struct nlattr *tb[BPF_PROG_ATTR_MAX + 1];
> +       struct sk_filter *prog;
> +       struct bpf_map *map;
> +       struct nlattr *attr;
> +       size_t insn_len;
> +       int err, i;
> +       bool is_gpl;
> +
> +       if (len <= 0 || len > BPF_PROG_MAX_ATTR_SIZE)
> +               return -EINVAL;
> +
> +       attr = kmalloc(len, GFP_USER);
> +       if (!attr)
> +               return -ENOMEM;
> +
> +       /* copy eBPF program from user space */
> +       err = -EFAULT;
> +       if (copy_from_user(attr, uattr, len) != 0)
> +               goto free_attr;
> +
> +       /* perform basic validation */
> +       err = nla_parse(tb, BPF_PROG_ATTR_MAX, attr, len, prog_policy);
> +       if (err < 0)
> +               goto free_attr;
> +
> +       err = -EINVAL;
> +       /* look for mandatory license string */
> +       if (!tb[BPF_PROG_LICENSE])
> +               goto free_attr;
> +
> +       /* eBPF programs must be GPL compatible to use GPL-ed functions */
> +       is_gpl = license_is_gpl_compatible(nla_data(tb[BPF_PROG_LICENSE]));
> +
> +       /* look for mandatory array of eBPF instructions */
> +       if (!tb[BPF_PROG_TEXT])
> +               goto free_attr;
> +
> +       insn_len = nla_len(tb[BPF_PROG_TEXT]);
> +       if (insn_len % sizeof(struct bpf_insn) != 0 || insn_len <= 0)
> +               goto free_attr;
> +
> +       /* plain sk_filter allocation */
> +       err = -ENOMEM;
> +       prog = kmalloc(sk_filter_size(insn_len), GFP_USER);
> +       if (!prog)
> +               goto free_attr;
> +
> +       prog->len = insn_len / sizeof(struct bpf_insn);
> +       memcpy(prog->insns, nla_data(tb[BPF_PROG_TEXT]), insn_len);
> +       prog->orig_prog = NULL;
> +       prog->jited = 0;
> +       prog->ebpf = 0;
> +       atomic_set(&prog->refcnt, 1);
> +
> +       if (tb[BPF_PROG_MAP_FIXUP]) {
> +               /* if program is using maps, fixup map_ids */
> +               err = fixup_bpf_map_id(prog, tb[BPF_PROG_MAP_FIXUP]);
> +               if (err < 0)
> +                       goto free_prog;
> +       }
> +
> +       /* allocate eBPF related auxilary data */
> +       prog->info = kzalloc(sizeof(struct bpf_prog_info), GFP_USER);
> +       if (!prog->info)
> +               goto free_prog;
> +       prog->ebpf = 1;
> +       prog->info->is_gpl_compatible = is_gpl;
> +
> +       /* find program type: socket_filter vs tracing_filter */
> +       err = find_prog_type(type, prog);
> +       if (err < 0)
> +               goto free_prog;
> +
> +       /* lock maps to prevent any changes to maps, since eBPF program may
> +        * use them. In such case bpf_check() will populate prog->used_maps
> +        */
> +       mutex_lock(&bpf_map_lock);
> +
> +       /* run eBPF verifier */
> +       /* err = bpf_check(prog); */
> +
> +       if (err == 0 && prog->info->used_maps) {
> +               /* program passed verifier and it's using some maps,
> +                * hold them
> +                */
> +               for (i = 0; i < prog->info->used_map_cnt; i++) {
> +                       map = bpf_map_get(prog->info->used_maps[i]);
> +                       BUG_ON(!map);
> +                       atomic_inc(&map->refcnt);
> +               }
> +       }
> +       mutex_unlock(&bpf_map_lock);

As mentioned above, I think fixup_bpf_map_id needs to be done under
the map_lock mutex, unless I'm misunderstanding something in the
object lifetime.

> +
> +       if (err < 0)
> +               goto free_prog;
> +
> +       /* fixup BPF_CALL->imm field */
> +       fixup_bpf_calls(prog);
> +
> +       /* eBPF program is ready to be JITed */
> +       sk_filter_select_runtime(prog);
> +
> +       err = anon_inode_getfd("bpf-prog", &bpf_prog_fops, prog, O_RDWR | O_CLOEXEC);
> +
> +       if (err < 0)
> +               /* failed to allocate fd */
> +               goto free_prog;
> +
> +       /* user supplied eBPF prog attributes are no longer needed */
> +       kfree(attr);
> +
> +       return err;
> +free_prog:
> +       sk_filter_free(prog);
> +free_attr:
> +       kfree(attr);
> +       return err;
> +}
> +
> +static struct sk_filter *get_prog(struct fd f)
> +{
> +       struct sk_filter *prog;
> +
> +       if (!f.file)
> +               return ERR_PTR(-EBADF);
> +
> +       if (f.file->f_op != &bpf_prog_fops) {
> +               fdput(f);
> +               return ERR_PTR(-EINVAL);
> +       }
> +
> +       prog = f.file->private_data;
> +
> +       return prog;
> +}
> +
> +/* called from sk_attach_filter_ebpf() or from tracing filter attach
> + * pairs with
> + * sk_detach_filter()->sk_filter_uncharge()->sk_filter_release()
> + * or with
> + * sk_unattached_filter_destroy()->sk_filter_release()
> + */
> +struct sk_filter *bpf_prog_get(u32 ufd)
> +{
> +       struct fd f = fdget(ufd);
> +       struct sk_filter *prog;
> +
> +       prog = get_prog(f);
> +
> +       if (IS_ERR(prog))
> +               return prog;
> +
> +       atomic_inc(&prog->refcnt);
> +       fdput(f);
> +       return prog;
> +}
> +
>  SYSCALL_DEFINE5(bpf, int, cmd, unsigned long, arg2, unsigned long, arg3,
>                 unsigned long, arg4, unsigned long, arg5)
>  {
> @@ -428,6 +734,9 @@ SYSCALL_DEFINE5(bpf, int, cmd, unsigned long, arg2, unsigned long, arg3,
>         case BPF_MAP_GET_NEXT_KEY:
>                 return map_get_next_key((int) arg2, (void __user *) arg3,
>                                         (void __user *) arg4);
> +       case BPF_PROG_LOAD:
> +               return bpf_prog_load((enum bpf_prog_type) arg2,
> +                                    (struct nlattr __user *) arg3, (int) arg4);
>         default:
>                 return -EINVAL;
>         }
> diff --git a/net/core/filter.c b/net/core/filter.c
> index f3b2d5e9fe5f..255dba1bb678 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -835,7 +835,7 @@ static void sk_release_orig_filter(struct sk_filter *fp)
>  {
>         struct sock_fprog_kern *fprog = fp->orig_prog;
>
> -       if (fprog) {
> +       if (!fp->ebpf && fprog) {
>                 kfree(fprog->filter);
>                 kfree(fprog);
>         }
> @@ -867,14 +867,16 @@ static void sk_filter_release(struct sk_filter *fp)
>
>  void sk_filter_uncharge(struct sock *sk, struct sk_filter *fp)
>  {
> -       atomic_sub(sk_filter_size(fp->len), &sk->sk_omem_alloc);
> +       if (!fp->ebpf)
> +               atomic_sub(sk_filter_size(fp->len), &sk->sk_omem_alloc);
>         sk_filter_release(fp);
>  }
>
>  void sk_filter_charge(struct sock *sk, struct sk_filter *fp)
>  {
>         atomic_inc(&fp->refcnt);
> -       atomic_add(sk_filter_size(fp->len), &sk->sk_omem_alloc);
> +       if (!fp->ebpf)
> +               atomic_add(sk_filter_size(fp->len), &sk->sk_omem_alloc);
>  }
>
>  static struct sk_filter *__sk_migrate_realloc(struct sk_filter *fp,
> @@ -978,6 +980,7 @@ static struct sk_filter *__sk_prepare_filter(struct sk_filter *fp,
>
>         fp->bpf_func = NULL;
>         fp->jited = 0;
> +       fp->ebpf = 0;
>
>         err = sk_chk_filter(fp->insns, fp->len);
>         if (err) {
> --
> 1.7.9.5
>

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH RFC v2 net-next 09/16] bpf: expand BPF syscall with program load/unload
@ 2014-07-23 19:00     ` Kees Cook
  0 siblings, 0 replies; 62+ messages in thread
From: Kees Cook @ 2014-07-23 19:00 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Ingo Molnar, Linus Torvalds, Andy Lutomirski,
	Steven Rostedt, Daniel Borkmann, Chema Gonzalez, Eric Dumazet,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Jiri Olsa,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Linux API,
	Network Development, LKML

On Thu, Jul 17, 2014 at 9:19 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
> eBPF programs are safe run-to-completion functions with load/unload
> methods from userspace similar to kernel modules.
>
> User space API:
>
> - load eBPF program
>   fd = bpf_prog_load(bpf_prog_type, struct nlattr *prog, int len)
>
>   where 'prog' is a sequence of sections (TEXT, LICENSE, MAP_ASSOC)
>   TEXT - array of eBPF instructions
>   LICENSE - must be GPL compatible to call helper functions marked gpl_only
>   MAP_FIXUP - array of {insn idx, map fd} used by kernel to adjust
>   imm constants in 'mov' instructions used to access maps

Nit: naming mismatch between MAP_ASSOC vs MAP_FIXUP.

>
> - unload eBPF program
>   close(fd)
>
> User space example of syscall(__NR_bpf, BPF_PROG_LOAD, prog_type, ...)
> follows in later patches
>
> Signed-off-by: Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
> ---
>  include/linux/bpf.h      |   33 +++++
>  include/linux/filter.h   |    9 +-
>  include/uapi/linux/bpf.h |   29 +++++
>  kernel/bpf/core.c        |    5 +-
>  kernel/bpf/syscall.c     |  309 ++++++++++++++++++++++++++++++++++++++++++++++
>  net/core/filter.c        |    9 +-
>  6 files changed, 388 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 91e2caf8edf9..4967619595cc 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -46,4 +46,37 @@ struct bpf_map_type_list {
>  void bpf_register_map_type(struct bpf_map_type_list *tl);
>  struct bpf_map *bpf_map_get(u32 map_id);
>
> +/* eBPF function prototype used by verifier to allow BPF_CALLs from eBPF programs
> + * to in-kernel helper functions and for adjusting imm32 field in BPF_CALL
> + * instructions after verifying
> + */
> +struct bpf_func_proto {
> +       u64 (*func)(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
> +       bool gpl_only;
> +};
> +
> +struct bpf_verifier_ops {
> +       /* return eBPF function prototype for verification */
> +       const struct bpf_func_proto *(*get_func_proto)(enum bpf_func_id func_id);
> +};
> +
> +struct bpf_prog_type_list {
> +       struct list_head list_node;
> +       struct bpf_verifier_ops *ops;
> +       enum bpf_prog_type type;
> +};
> +
> +void bpf_register_prog_type(struct bpf_prog_type_list *tl);
> +
> +struct bpf_prog_info {
> +       bool is_gpl_compatible;
> +       enum bpf_prog_type prog_type;
> +       struct bpf_verifier_ops *ops;
> +       u32 *used_maps;
> +       u32 used_map_cnt;
> +};
> +
> +void free_bpf_prog_info(struct bpf_prog_info *info);
> +struct sk_filter *bpf_prog_get(u32 ufd);
> +
>  #endif /* _LINUX_BPF_H */
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index b43ad6a2b3cf..822b310e75e1 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -30,12 +30,17 @@ struct sock_fprog_kern {
>  struct sk_buff;
>  struct sock;
>  struct seccomp_data;
> +struct bpf_prog_info;
>
>  struct sk_filter {
>         atomic_t                refcnt;
>         u32                     jited:1,        /* Is our filter JIT'ed? */
> -                               len:31;         /* Number of filter blocks */
> -       struct sock_fprog_kern  *orig_prog;     /* Original BPF program */
> +                               ebpf:1,         /* Is it eBPF program ? */
> +                               len:30;         /* Number of filter blocks */
> +       union {
> +               struct sock_fprog_kern  *orig_prog;     /* Original BPF program */
> +               struct bpf_prog_info    *info;
> +       };
>         struct rcu_head         rcu;
>         unsigned int            (*bpf_func)(const struct sk_buff *skb,
>                                             const struct bpf_insn *filter);
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 3ea11ba053a8..06ba71b49f64 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -333,6 +333,13 @@ enum bpf_cmd {
>          * returns zero and stores next key or negative error
>          */
>         BPF_MAP_GET_NEXT_KEY,
> +
> +       /* verify and load eBPF program
> +        * prog_id = bpf_prog_load(bpf_prog_type, struct nlattr *prog, int len)
> +        * prog is a sequence of sections
> +        * returns fd or negative error
> +        */
> +       BPF_PROG_LOAD,
>  };
>
>  enum bpf_map_attributes {
> @@ -350,4 +357,26 @@ enum bpf_map_type {
>         BPF_MAP_TYPE_HASH,
>  };
>
> +enum bpf_prog_attributes {
> +       BPF_PROG_UNSPEC,
> +       BPF_PROG_TEXT,          /* array of eBPF instructions */
> +       BPF_PROG_LICENSE,       /* license string */
> +       BPF_PROG_MAP_FIXUP,     /* array of {insn idx, map fd} to fixup insns */
> +       __BPF_PROG_ATTR_MAX,
> +};
> +#define BPF_PROG_ATTR_MAX (__BPF_PROG_ATTR_MAX - 1)
> +#define BPF_PROG_MAX_ATTR_SIZE 65535
> +
> +enum bpf_prog_type {
> +       BPF_PROG_TYPE_UNSPEC,
> +};
> +
> +/* integer value in 'imm' field of BPF_CALL instruction selects which helper
> + * function eBPF program intends to call
> + */
> +enum bpf_func_id {
> +       BPF_FUNC_unspec,
> +       __BPF_FUNC_MAX_ID,
> +};
> +
>  #endif /* _UAPI__LINUX_BPF_H__ */
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index 265a02cc822d..e65ecdc36358 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -23,6 +23,7 @@
>  #include <linux/filter.h>
>  #include <linux/skbuff.h>
>  #include <asm/unaligned.h>
> +#include <linux/bpf.h>
>
>  /* Registers */
>  #define BPF_R0 regs[BPF_REG_0]
> @@ -528,9 +529,11 @@ void sk_filter_select_runtime(struct sk_filter *fp)
>  }
>  EXPORT_SYMBOL_GPL(sk_filter_select_runtime);
>
> -/* free internal BPF program */
> +/* free internal BPF program, called after RCU grace period */
>  void sk_filter_free(struct sk_filter *fp)
>  {
> +       if (fp->ebpf)
> +               free_bpf_prog_info(fp->info);
>         bpf_jit_free(fp);
>  }
>  EXPORT_SYMBOL_GPL(sk_filter_free);
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index ca2be66845b3..9e45ca6b6937 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -14,6 +14,8 @@
>  #include <net/netlink.h>
>  #include <linux/anon_inodes.h>
>  #include <linux/file.h>
> +#include <linux/license.h>
> +#include <linux/filter.h>
>
>  /* mutex to protect insertion/deletion of map_id in IDR */
>  static DEFINE_MUTEX(bpf_map_lock);
> @@ -406,6 +408,310 @@ err_unlock:
>         return err;
>  }
>
> +static LIST_HEAD(bpf_prog_types);
> +
> +static int find_prog_type(enum bpf_prog_type type, struct sk_filter *prog)
> +{
> +       struct bpf_prog_type_list *tl;
> +
> +       list_for_each_entry(tl, &bpf_prog_types, list_node) {
> +               if (tl->type == type) {
> +                       prog->info->ops = tl->ops;
> +                       prog->info->prog_type = type;
> +                       return 0;
> +               }
> +       }
> +       return -EINVAL;
> +}
> +
> +void bpf_register_prog_type(struct bpf_prog_type_list *tl)
> +{
> +       list_add(&tl->list_node, &bpf_prog_types);
> +}
> +
> +/* fixup insn->imm field of bpf_call instructions:
> + * if (insn->imm == BPF_FUNC_map_lookup_elem)
> + *      insn->imm = bpf_map_lookup_elem - __bpf_call_base;
> + * else if (insn->imm == BPF_FUNC_map_update_elem)
> + *      insn->imm = bpf_map_update_elem - __bpf_call_base;
> + * else ...
> + *
> + * this function is called after eBPF program passed verification
> + */
> +static void fixup_bpf_calls(struct sk_filter *prog)
> +{
> +       const struct bpf_func_proto *fn;
> +       int i;
> +
> +       for (i = 0; i < prog->len; i++) {
> +               struct bpf_insn *insn = &prog->insnsi[i];
> +
> +               if (insn->code == (BPF_JMP | BPF_CALL)) {
> +                       /* we reach here when program has bpf_call instructions
> +                        * and it passed bpf_check(), means that
> +                        * ops->get_func_proto must have been supplied, check it
> +                        */
> +                       BUG_ON(!prog->info->ops->get_func_proto);
> +
> +                       fn = prog->info->ops->get_func_proto(insn->imm);
> +                       /* all functions that have prototype and verifier allowed
> +                        * programs to call them, must be real in-kernel functions
> +                        */
> +                       BUG_ON(!fn->func);
> +                       insn->imm = fn->func - __bpf_call_base;
> +               }
> +       }
> +}
> +
> +/* fixup instructions that are using map_ids:
> + *
> + * BPF_MOV64_IMM(BPF_REG_1, MAP_ID), // r1 = MAP_ID
> + * BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
> + *
> + * in the 1st insn kernel replaces MAP_ID with global map_id,
> + * since programs are executing out of different contexts and must use
> + * globally visible ids to access maps
> + *
> + * map_fixup is an array of pairs {insn idx, map ufd}
> + *
> + * kernel resolves ufd -> global map_id and adjusts eBPF instructions
> + */
> +static int fixup_bpf_map_id(struct sk_filter *prog, struct nlattr *map_fixup)
> +{
> +       struct {
> +               u32 insn_idx;
> +               u32 ufd;
> +       } *fixup = nla_data(map_fixup);
> +       int fixup_len = nla_len(map_fixup) / sizeof(*fixup);
> +       struct bpf_insn *insn;
> +       struct fd f;
> +       u32 idx;
> +       int i, map_id;
> +
> +       if (fixup_len <= 0)
> +               return -EINVAL;
> +
> +       for (i = 0; i < fixup_len; i++) {
> +               idx = fixup[i].insn_idx;
> +               if (idx >= prog->len)
> +                       return -EINVAL;
> +
> +               insn = &prog->insnsi[idx];
> +               if (insn->code != (BPF_ALU64 | BPF_MOV | BPF_K) &&
> +                   insn->code != (BPF_ALU | BPF_MOV | BPF_K))
> +                       return -EINVAL;
> +
> +               f = fdget(fixup[i].ufd);
> +
> +               map_id = get_map_id(f);
> +
> +               if (map_id < 0)
> +                       return map_id;
> +
> +               insn->imm = map_id;
> +               fdput(f);

It looks like there a potentially race risk of a map_id changing out
from under a running program? Between the call to fixup_bpf_map_id()
and the bpf_map_get() calls during bpf_prog_load() below...

> +       }
> +       return 0;
> +}
> +
> +/* free eBPF program auxilary data, called after rcu grace period,
> + * so it's safe to drop refcnt on maps used by this program
> + *
> + * called from sk_filter_release()->sk_filter_release_rcu()->sk_filter_free()
> + */
> +void free_bpf_prog_info(struct bpf_prog_info *info)
> +{
> +       bool found;
> +       int i;
> +
> +       for (i = 0; i < info->used_map_cnt; i++) {
> +               found = bpf_map_put(info->used_maps[i]);
> +               /* all maps that this program was using should obviously still
> +                * be there
> +                */
> +               BUG_ON(!found);
> +       }
> +       kfree(info);
> +}
> +
> +static int bpf_prog_release(struct inode *inode, struct file *filp)
> +{
> +       struct sk_filter *prog = filp->private_data;
> +
> +       sk_unattached_filter_destroy(prog);
> +       return 0;
> +}
> +
> +static const struct file_operations bpf_prog_fops = {
> +        .release = bpf_prog_release,
> +};
> +
> +static const struct nla_policy prog_policy[BPF_PROG_ATTR_MAX + 1] = {
> +       [BPF_PROG_TEXT]      = { .type = NLA_BINARY },
> +       [BPF_PROG_LICENSE]   = { .type = NLA_NUL_STRING },
> +       [BPF_PROG_MAP_FIXUP] = { .type = NLA_BINARY },
> +};
> +
> +static int bpf_prog_load(enum bpf_prog_type type, struct nlattr __user *uattr,
> +                        int len)
> +{
> +       struct nlattr *tb[BPF_PROG_ATTR_MAX + 1];
> +       struct sk_filter *prog;
> +       struct bpf_map *map;
> +       struct nlattr *attr;
> +       size_t insn_len;
> +       int err, i;
> +       bool is_gpl;
> +
> +       if (len <= 0 || len > BPF_PROG_MAX_ATTR_SIZE)
> +               return -EINVAL;
> +
> +       attr = kmalloc(len, GFP_USER);
> +       if (!attr)
> +               return -ENOMEM;
> +
> +       /* copy eBPF program from user space */
> +       err = -EFAULT;
> +       if (copy_from_user(attr, uattr, len) != 0)
> +               goto free_attr;
> +
> +       /* perform basic validation */
> +       err = nla_parse(tb, BPF_PROG_ATTR_MAX, attr, len, prog_policy);
> +       if (err < 0)
> +               goto free_attr;
> +
> +       err = -EINVAL;
> +       /* look for mandatory license string */
> +       if (!tb[BPF_PROG_LICENSE])
> +               goto free_attr;
> +
> +       /* eBPF programs must be GPL compatible to use GPL-ed functions */
> +       is_gpl = license_is_gpl_compatible(nla_data(tb[BPF_PROG_LICENSE]));
> +
> +       /* look for mandatory array of eBPF instructions */
> +       if (!tb[BPF_PROG_TEXT])
> +               goto free_attr;
> +
> +       insn_len = nla_len(tb[BPF_PROG_TEXT]);
> +       if (insn_len % sizeof(struct bpf_insn) != 0 || insn_len <= 0)
> +               goto free_attr;
> +
> +       /* plain sk_filter allocation */
> +       err = -ENOMEM;
> +       prog = kmalloc(sk_filter_size(insn_len), GFP_USER);
> +       if (!prog)
> +               goto free_attr;
> +
> +       prog->len = insn_len / sizeof(struct bpf_insn);
> +       memcpy(prog->insns, nla_data(tb[BPF_PROG_TEXT]), insn_len);
> +       prog->orig_prog = NULL;
> +       prog->jited = 0;
> +       prog->ebpf = 0;
> +       atomic_set(&prog->refcnt, 1);
> +
> +       if (tb[BPF_PROG_MAP_FIXUP]) {
> +               /* if program is using maps, fixup map_ids */
> +               err = fixup_bpf_map_id(prog, tb[BPF_PROG_MAP_FIXUP]);
> +               if (err < 0)
> +                       goto free_prog;
> +       }
> +
> +       /* allocate eBPF related auxilary data */
> +       prog->info = kzalloc(sizeof(struct bpf_prog_info), GFP_USER);
> +       if (!prog->info)
> +               goto free_prog;
> +       prog->ebpf = 1;
> +       prog->info->is_gpl_compatible = is_gpl;
> +
> +       /* find program type: socket_filter vs tracing_filter */
> +       err = find_prog_type(type, prog);
> +       if (err < 0)
> +               goto free_prog;
> +
> +       /* lock maps to prevent any changes to maps, since eBPF program may
> +        * use them. In such case bpf_check() will populate prog->used_maps
> +        */
> +       mutex_lock(&bpf_map_lock);
> +
> +       /* run eBPF verifier */
> +       /* err = bpf_check(prog); */
> +
> +       if (err == 0 && prog->info->used_maps) {
> +               /* program passed verifier and it's using some maps,
> +                * hold them
> +                */
> +               for (i = 0; i < prog->info->used_map_cnt; i++) {
> +                       map = bpf_map_get(prog->info->used_maps[i]);
> +                       BUG_ON(!map);
> +                       atomic_inc(&map->refcnt);
> +               }
> +       }
> +       mutex_unlock(&bpf_map_lock);

As mentioned above, I think fixup_bpf_map_id needs to be done under
the map_lock mutex, unless I'm misunderstanding something in the
object lifetime.

> +
> +       if (err < 0)
> +               goto free_prog;
> +
> +       /* fixup BPF_CALL->imm field */
> +       fixup_bpf_calls(prog);
> +
> +       /* eBPF program is ready to be JITed */
> +       sk_filter_select_runtime(prog);
> +
> +       err = anon_inode_getfd("bpf-prog", &bpf_prog_fops, prog, O_RDWR | O_CLOEXEC);
> +
> +       if (err < 0)
> +               /* failed to allocate fd */
> +               goto free_prog;
> +
> +       /* user supplied eBPF prog attributes are no longer needed */
> +       kfree(attr);
> +
> +       return err;
> +free_prog:
> +       sk_filter_free(prog);
> +free_attr:
> +       kfree(attr);
> +       return err;
> +}
> +
> +static struct sk_filter *get_prog(struct fd f)
> +{
> +       struct sk_filter *prog;
> +
> +       if (!f.file)
> +               return ERR_PTR(-EBADF);
> +
> +       if (f.file->f_op != &bpf_prog_fops) {
> +               fdput(f);
> +               return ERR_PTR(-EINVAL);
> +       }
> +
> +       prog = f.file->private_data;
> +
> +       return prog;
> +}
> +
> +/* called from sk_attach_filter_ebpf() or from tracing filter attach
> + * pairs with
> + * sk_detach_filter()->sk_filter_uncharge()->sk_filter_release()
> + * or with
> + * sk_unattached_filter_destroy()->sk_filter_release()
> + */
> +struct sk_filter *bpf_prog_get(u32 ufd)
> +{
> +       struct fd f = fdget(ufd);
> +       struct sk_filter *prog;
> +
> +       prog = get_prog(f);
> +
> +       if (IS_ERR(prog))
> +               return prog;
> +
> +       atomic_inc(&prog->refcnt);
> +       fdput(f);
> +       return prog;
> +}
> +
>  SYSCALL_DEFINE5(bpf, int, cmd, unsigned long, arg2, unsigned long, arg3,
>                 unsigned long, arg4, unsigned long, arg5)
>  {
> @@ -428,6 +734,9 @@ SYSCALL_DEFINE5(bpf, int, cmd, unsigned long, arg2, unsigned long, arg3,
>         case BPF_MAP_GET_NEXT_KEY:
>                 return map_get_next_key((int) arg2, (void __user *) arg3,
>                                         (void __user *) arg4);
> +       case BPF_PROG_LOAD:
> +               return bpf_prog_load((enum bpf_prog_type) arg2,
> +                                    (struct nlattr __user *) arg3, (int) arg4);
>         default:
>                 return -EINVAL;
>         }
> diff --git a/net/core/filter.c b/net/core/filter.c
> index f3b2d5e9fe5f..255dba1bb678 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -835,7 +835,7 @@ static void sk_release_orig_filter(struct sk_filter *fp)
>  {
>         struct sock_fprog_kern *fprog = fp->orig_prog;
>
> -       if (fprog) {
> +       if (!fp->ebpf && fprog) {
>                 kfree(fprog->filter);
>                 kfree(fprog);
>         }
> @@ -867,14 +867,16 @@ static void sk_filter_release(struct sk_filter *fp)
>
>  void sk_filter_uncharge(struct sock *sk, struct sk_filter *fp)
>  {
> -       atomic_sub(sk_filter_size(fp->len), &sk->sk_omem_alloc);
> +       if (!fp->ebpf)
> +               atomic_sub(sk_filter_size(fp->len), &sk->sk_omem_alloc);
>         sk_filter_release(fp);
>  }
>
>  void sk_filter_charge(struct sock *sk, struct sk_filter *fp)
>  {
>         atomic_inc(&fp->refcnt);
> -       atomic_add(sk_filter_size(fp->len), &sk->sk_omem_alloc);
> +       if (!fp->ebpf)
> +               atomic_add(sk_filter_size(fp->len), &sk->sk_omem_alloc);
>  }
>
>  static struct sk_filter *__sk_migrate_realloc(struct sk_filter *fp,
> @@ -978,6 +980,7 @@ static struct sk_filter *__sk_prepare_filter(struct sk_filter *fp,
>
>         fp->bpf_func = NULL;
>         fp->jited = 0;
> +       fp->ebpf = 0;
>
>         err = sk_chk_filter(fp->insns, fp->len);
>         if (err) {
> --
> 1.7.9.5
>

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH RFC v2 net-next 05/16] bpf: introduce syscall(BPF, ...) and BPF maps
  2014-07-23 18:02     ` Kees Cook
  (?)
@ 2014-07-23 19:30     ` Alexei Starovoitov
  -1 siblings, 0 replies; 62+ messages in thread
From: Alexei Starovoitov @ 2014-07-23 19:30 UTC (permalink / raw)
  To: Kees Cook
  Cc: David S. Miller, Ingo Molnar, Linus Torvalds, Andy Lutomirski,
	Steven Rostedt, Daniel Borkmann, Chema Gonzalez, Eric Dumazet,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Jiri Olsa,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Linux API,
	Network Development, LKML

On Wed, Jul 23, 2014 at 11:02 AM, Kees Cook <keescook@chromium.org> wrote:
>> --- a/Documentation/networking/filter.txt
>> +++ b/Documentation/networking/filter.txt
>>
>> +eBPF maps
>> +---------
>> +'maps' is a generic storage of different types for sharing data between kernel
>> +and userspace.
>> +
>> +The maps are accessed from user space via BPF syscall, which has commands:
>> +- create a map with given id, type and attributes
>> +  map_id = bpf_map_create(int map_id, map_type, struct nlattr *attr, int len)
>> +  returns positive map id or negative error
>
> Looks like these docs need updating for the fd-based approach instead
> of the map_id approach?

ohh, yes. updated it in srcs and in commit log, but forgot in docs.

>> +SYSCALL_DEFINE5(bpf, int, cmd, unsigned long, arg2, unsigned long, arg3,
>> +               unsigned long, arg4, unsigned long, arg5)
>> +{
>> +       if (!capable(CAP_SYS_ADMIN))
>> +               return -EPERM;
>
> It might be valuable to have a comment here describing why this is
> currently limited to CAP_SYS_ADMIN.

makes sense.
There are several reasons it should be limited to root initially:
- to phase changes in gradually
- verifier is not detecting pointer leaks yet
- full security audit wasn't performed
- tracing and network analytics are root only anyway
Currently eBPF is safe (non-crashing), since safety is relatively easy
to enforce by static analysis. For somebody with compiler background
it's natural to think about bounds, alignments, uninitialized access, etc.
So I'm confident that I didn't miss anything big in 'safety' aspect.
'Non-root security' is harder. I'll add pointer leak detection first
and will ask for more suggestions.

>> +       switch (cmd) {
>> +       case BPF_MAP_CREATE:
>> +               return map_create((enum bpf_map_type) arg2,
>> +                                 (struct nlattr __user *) arg3, (int) arg4);
>
> I'd recommend requiring arg5 == 0 here, just for future flexibility.

Though I expect all extensions to go through nlattr attributes,
it's indeed cleaner to enforce arg5==0 here and for all other cmds.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH RFC v2 net-next 07/16] bpf: add lookup/update/delete/iterate methods to BPF maps
  2014-07-23 18:25   ` Kees Cook
@ 2014-07-23 19:49     ` Alexei Starovoitov
  2014-07-23 20:25       ` Kees Cook
  0 siblings, 1 reply; 62+ messages in thread
From: Alexei Starovoitov @ 2014-07-23 19:49 UTC (permalink / raw)
  To: Kees Cook
  Cc: David S. Miller, Ingo Molnar, Linus Torvalds, Andy Lutomirski,
	Steven Rostedt, Daniel Borkmann, Chema Gonzalez, Eric Dumazet,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Jiri Olsa,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Linux API,
	Network Development, LKML

On Wed, Jul 23, 2014 at 11:25 AM, Kees Cook <keescook@chromium.org> wrote:
>> +
>> +       /* lookup key in a given map referenced by map_id
>> +        * err = bpf_map_lookup_elem(int map_id, void *key, void *value)
>
> This needs map_id documentation updates too?

yes. will grep for it just to make sure.

>> +static int get_map_id(struct fd f)
>> +{
>> +       struct bpf_map *map;
>> +
>> +       if (!f.file)
>> +               return -EBADF;
>> +
>> +       if (f.file->f_op != &bpf_map_fops) {
>> +               fdput(f);
>
> It feels weird to me to do the fdput inside this function. Instead,
> should map_lookup_elem get a "err_put" label, instead?

I don't think it will work, since I'm not sure that fd.flags will be zero
when fd.file == NULL. It looks so by analyzing return code path
in fs/file.c, but I wasn't sure that I followed all code paths,
so I just picked this style from fs/timerfd.c assuming it was
done this away on purpose and there can be the case where
fd.file == null and fd.flags !=0. In such case we cannot call fdput().

>> +       err = -EFAULT;
>> +       if (copy_to_user(uvalue, value, map->value_size) != 0)
>> +               goto free_key;
>
> I'm made uncomfortable with memory copying where explicit lengths from
> userspace aren't being used. It does look like it would be redundant,
> though. Are there other syscalls where the kernel may stomp on user
> memory based on internal kernel sizes? I think this is fine as-is, but
> it makes me want to think harder about it. :)

good question :)
key_size and value_size are passed initially from user space.
Kernel only verifies and allocates internal map elements with given
sizes. Then it copies the value back with the size it remembered.
If user space said at map creation time that value_size is 100,
it should be using it consistently in user space program.

>> +       err = -ENOMEM;
>> +       next_key = kmalloc(map->key_size, GFP_ATOMIC);
>
> In the interests of defensiveness, I'd use kzalloc here.

I think it would be an overkill. Map implementation must consume
all bytes of incoming 'key' and return exactly the same number
of bytes in 'next_key'. Otherwise the whole iteration over map
with 'get_next_key' won't work. So if map implementation is
broken, it will be seen right away. No security leak here :)

>> +       case BPF_MAP_GET_NEXT_KEY:
>> +               return map_get_next_key((int) arg2, (void __user *) arg3,
>> +                                       (void __user *) arg4);
>
> Same observation as the other syscall cmd: perhaps arg5 == 0 should be
> checked? Also, since each of these functions looks up the fd and

yes. will do.

> builds the key, maybe those should be added to a common helper instead
> of copy/pasting into each demuxed function?

well, get_map_id() is a common helper. I didn't move fdget() all
the way to switch statement, since it looks less readable.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH RFC v2 net-next 08/16] bpf: add hashtable type of BPF maps
@ 2014-07-23 19:57       ` Alexei Starovoitov
  0 siblings, 0 replies; 62+ messages in thread
From: Alexei Starovoitov @ 2014-07-23 19:57 UTC (permalink / raw)
  To: Kees Cook
  Cc: David S. Miller, Ingo Molnar, Linus Torvalds, Andy Lutomirski,
	Steven Rostedt, Daniel Borkmann, Chema Gonzalez, Eric Dumazet,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Jiri Olsa,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Linux API,
	Network Development, LKML

On Wed, Jul 23, 2014 at 11:36 AM, Kees Cook <keescook@chromium.org> wrote:
>> +static struct bpf_map *htab_map_alloc(struct nlattr *attr[BPF_MAP_ATTR_MAX + 1])
>> +{
>> +       struct bpf_htab *htab;
>> +       int err, i;
>> +
>> +       htab = kmalloc(sizeof(*htab), GFP_USER);
>
> I'd prefer kzalloc here.

in this case I agree. will change, since it's not in critical path and we
can waste few cycles zeroing memory.

>> +       err = -ENOMEM;
>> +       htab->buckets = kmalloc(htab->n_buckets * sizeof(struct hlist_head),
>> +                               GFP_USER);
>
> I'd prefer kcalloc here, even though n_buckets can't currently trigger
> an integer overflow.

hmm, I would argue that kmalloc_array is a preferred way, but kcalloc ?
Few lines below the whole array is inited with INIT_HLIST_HEAD...

>> +       for (i = 0; i < htab->n_buckets; i++)
>> +               INIT_HLIST_HEAD(&htab->buckets[i]);

>> +       htab->slab_name = kasprintf(GFP_USER, "bpf_htab_%p", htab);
>
> This leaks a kernel heap memory pointer to userspace. If a unique name
> needed, I think map_id should be used instead.

it leaks, how? slabinfo is only available to root.
The same code exists in conntrack:
net/netfilter/nf_conntrack_core.c:1767

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH RFC v2 net-next 08/16] bpf: add hashtable type of BPF maps
@ 2014-07-23 19:57       ` Alexei Starovoitov
  0 siblings, 0 replies; 62+ messages in thread
From: Alexei Starovoitov @ 2014-07-23 19:57 UTC (permalink / raw)
  To: Kees Cook
  Cc: David S. Miller, Ingo Molnar, Linus Torvalds, Andy Lutomirski,
	Steven Rostedt, Daniel Borkmann, Chema Gonzalez, Eric Dumazet,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Jiri Olsa,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Linux API,
	Network Development, LKML

On Wed, Jul 23, 2014 at 11:36 AM, Kees Cook <keescook-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org> wrote:
>> +static struct bpf_map *htab_map_alloc(struct nlattr *attr[BPF_MAP_ATTR_MAX + 1])
>> +{
>> +       struct bpf_htab *htab;
>> +       int err, i;
>> +
>> +       htab = kmalloc(sizeof(*htab), GFP_USER);
>
> I'd prefer kzalloc here.

in this case I agree. will change, since it's not in critical path and we
can waste few cycles zeroing memory.

>> +       err = -ENOMEM;
>> +       htab->buckets = kmalloc(htab->n_buckets * sizeof(struct hlist_head),
>> +                               GFP_USER);
>
> I'd prefer kcalloc here, even though n_buckets can't currently trigger
> an integer overflow.

hmm, I would argue that kmalloc_array is a preferred way, but kcalloc ?
Few lines below the whole array is inited with INIT_HLIST_HEAD...

>> +       for (i = 0; i < htab->n_buckets; i++)
>> +               INIT_HLIST_HEAD(&htab->buckets[i]);

>> +       htab->slab_name = kasprintf(GFP_USER, "bpf_htab_%p", htab);
>
> This leaks a kernel heap memory pointer to userspace. If a unique name
> needed, I think map_id should be used instead.

it leaks, how? slabinfo is only available to root.
The same code exists in conntrack:
net/netfilter/nf_conntrack_core.c:1767

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH RFC v2 net-next 09/16] bpf: expand BPF syscall with program load/unload
@ 2014-07-23 20:22       ` Alexei Starovoitov
  0 siblings, 0 replies; 62+ messages in thread
From: Alexei Starovoitov @ 2014-07-23 20:22 UTC (permalink / raw)
  To: Kees Cook
  Cc: David S. Miller, Ingo Molnar, Linus Torvalds, Andy Lutomirski,
	Steven Rostedt, Daniel Borkmann, Chema Gonzalez, Eric Dumazet,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Jiri Olsa,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Linux API,
	Network Development, LKML

On Wed, Jul 23, 2014 at 12:00 PM, Kees Cook <keescook@chromium.org> wrote:
> On Thu, Jul 17, 2014 at 9:19 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>> eBPF programs are safe run-to-completion functions with load/unload
>> methods from userspace similar to kernel modules.
>>
>> User space API:
>>
>> - load eBPF program
>>   fd = bpf_prog_load(bpf_prog_type, struct nlattr *prog, int len)
>>
>>   where 'prog' is a sequence of sections (TEXT, LICENSE, MAP_ASSOC)
>>   TEXT - array of eBPF instructions
>>   LICENSE - must be GPL compatible to call helper functions marked gpl_only
>>   MAP_FIXUP - array of {insn idx, map fd} used by kernel to adjust
>>   imm constants in 'mov' instructions used to access maps
>
> Nit: naming mismatch between MAP_ASSOC vs MAP_FIXUP.

ohh yes, I used map_assoc name initially and forgot to rename it
in commit log. will fix.

>> + */
>> +static int fixup_bpf_map_id(struct sk_filter *prog, struct nlattr *map_fixup)
>> +{
>> +       struct {
>> +               u32 insn_idx;
>> +               u32 ufd;
>> +       } *fixup = nla_data(map_fixup);
>> +       int fixup_len = nla_len(map_fixup) / sizeof(*fixup);
>> +       struct bpf_insn *insn;
>> +       struct fd f;
>> +       u32 idx;
>> +       int i, map_id;
>> +
>> +       if (fixup_len <= 0)
>> +               return -EINVAL;
>> +
>> +       for (i = 0; i < fixup_len; i++) {
>> +               idx = fixup[i].insn_idx;
>> +               if (idx >= prog->len)
>> +                       return -EINVAL;
>> +
>> +               insn = &prog->insnsi[idx];
>> +               if (insn->code != (BPF_ALU64 | BPF_MOV | BPF_K) &&
>> +                   insn->code != (BPF_ALU | BPF_MOV | BPF_K))
>> +                       return -EINVAL;
>> +
>> +               f = fdget(fixup[i].ufd);
>> +
>> +               map_id = get_map_id(f);
>> +
>> +               if (map_id < 0)
>> +                       return map_id;
>> +
>> +               insn->imm = map_id;
>> +               fdput(f);
>
> It looks like there a potentially race risk of a map_id changing out
> from under a running program? Between the call to fixup_bpf_map_id()
> and the bpf_map_get() calls during bpf_prog_load() below...

Excellent question!
If user space created a bunch of maps and has another thread
that closes fds (and may be creating new maps) while main thread is
doing syscall(prog_load,...) then map_ids stored inside instructions
can become stale by the time bpf_check() is called. In such case
bpf_check() will reject the program. Either it will find unknown map_id
or map will have invalid key/value ranges for the program to access.
So this is not an issue, but I agree the code is not obviously correct.
My bad to allow such subtle races.
I'll increase mutex_lock() range.

>> +       if (tb[BPF_PROG_MAP_FIXUP]) {
>> +               /* if program is using maps, fixup map_ids */
>> +               err = fixup_bpf_map_id(prog, tb[BPF_PROG_MAP_FIXUP]);
>> +               if (err < 0)
>> +                       goto free_prog;
>> +       }
>> +
>> +       /* allocate eBPF related auxilary data */
>> +       prog->info = kzalloc(sizeof(struct bpf_prog_info), GFP_USER);
>> +       if (!prog->info)
>> +               goto free_prog;
>> +       prog->ebpf = 1;
>> +       prog->info->is_gpl_compatible = is_gpl;
>> +
>> +       /* find program type: socket_filter vs tracing_filter */
>> +       err = find_prog_type(type, prog);
>> +       if (err < 0)
>> +               goto free_prog;
>> +
>> +       /* lock maps to prevent any changes to maps, since eBPF program may
>> +        * use them. In such case bpf_check() will populate prog->used_maps
>> +        */
>> +       mutex_lock(&bpf_map_lock);
>> +
>> +       /* run eBPF verifier */
>> +       /* err = bpf_check(prog); */
>> +
>> +       if (err == 0 && prog->info->used_maps) {
>> +               /* program passed verifier and it's using some maps,
>> +                * hold them
>> +                */
>> +               for (i = 0; i < prog->info->used_map_cnt; i++) {
>> +                       map = bpf_map_get(prog->info->used_maps[i]);
>> +                       BUG_ON(!map);
>> +                       atomic_inc(&map->refcnt);
>> +               }
>> +       }
>> +       mutex_unlock(&bpf_map_lock);
>
> As mentioned above, I think fixup_bpf_map_id needs to be done under
> the map_lock mutex, unless I'm misunderstanding something in the
> object lifetime.

yes. Excellent point. Will increase the range.

Thank you so much for the review!

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH RFC v2 net-next 09/16] bpf: expand BPF syscall with program load/unload
@ 2014-07-23 20:22       ` Alexei Starovoitov
  0 siblings, 0 replies; 62+ messages in thread
From: Alexei Starovoitov @ 2014-07-23 20:22 UTC (permalink / raw)
  To: Kees Cook
  Cc: David S. Miller, Ingo Molnar, Linus Torvalds, Andy Lutomirski,
	Steven Rostedt, Daniel Borkmann, Chema Gonzalez, Eric Dumazet,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Jiri Olsa,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Linux API,
	Network Development, LKML

On Wed, Jul 23, 2014 at 12:00 PM, Kees Cook <keescook-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org> wrote:
> On Thu, Jul 17, 2014 at 9:19 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
>> eBPF programs are safe run-to-completion functions with load/unload
>> methods from userspace similar to kernel modules.
>>
>> User space API:
>>
>> - load eBPF program
>>   fd = bpf_prog_load(bpf_prog_type, struct nlattr *prog, int len)
>>
>>   where 'prog' is a sequence of sections (TEXT, LICENSE, MAP_ASSOC)
>>   TEXT - array of eBPF instructions
>>   LICENSE - must be GPL compatible to call helper functions marked gpl_only
>>   MAP_FIXUP - array of {insn idx, map fd} used by kernel to adjust
>>   imm constants in 'mov' instructions used to access maps
>
> Nit: naming mismatch between MAP_ASSOC vs MAP_FIXUP.

ohh yes, I used map_assoc name initially and forgot to rename it
in commit log. will fix.

>> + */
>> +static int fixup_bpf_map_id(struct sk_filter *prog, struct nlattr *map_fixup)
>> +{
>> +       struct {
>> +               u32 insn_idx;
>> +               u32 ufd;
>> +       } *fixup = nla_data(map_fixup);
>> +       int fixup_len = nla_len(map_fixup) / sizeof(*fixup);
>> +       struct bpf_insn *insn;
>> +       struct fd f;
>> +       u32 idx;
>> +       int i, map_id;
>> +
>> +       if (fixup_len <= 0)
>> +               return -EINVAL;
>> +
>> +       for (i = 0; i < fixup_len; i++) {
>> +               idx = fixup[i].insn_idx;
>> +               if (idx >= prog->len)
>> +                       return -EINVAL;
>> +
>> +               insn = &prog->insnsi[idx];
>> +               if (insn->code != (BPF_ALU64 | BPF_MOV | BPF_K) &&
>> +                   insn->code != (BPF_ALU | BPF_MOV | BPF_K))
>> +                       return -EINVAL;
>> +
>> +               f = fdget(fixup[i].ufd);
>> +
>> +               map_id = get_map_id(f);
>> +
>> +               if (map_id < 0)
>> +                       return map_id;
>> +
>> +               insn->imm = map_id;
>> +               fdput(f);
>
> It looks like there a potentially race risk of a map_id changing out
> from under a running program? Between the call to fixup_bpf_map_id()
> and the bpf_map_get() calls during bpf_prog_load() below...

Excellent question!
If user space created a bunch of maps and has another thread
that closes fds (and may be creating new maps) while main thread is
doing syscall(prog_load,...) then map_ids stored inside instructions
can become stale by the time bpf_check() is called. In such case
bpf_check() will reject the program. Either it will find unknown map_id
or map will have invalid key/value ranges for the program to access.
So this is not an issue, but I agree the code is not obviously correct.
My bad to allow such subtle races.
I'll increase mutex_lock() range.

>> +       if (tb[BPF_PROG_MAP_FIXUP]) {
>> +               /* if program is using maps, fixup map_ids */
>> +               err = fixup_bpf_map_id(prog, tb[BPF_PROG_MAP_FIXUP]);
>> +               if (err < 0)
>> +                       goto free_prog;
>> +       }
>> +
>> +       /* allocate eBPF related auxilary data */
>> +       prog->info = kzalloc(sizeof(struct bpf_prog_info), GFP_USER);
>> +       if (!prog->info)
>> +               goto free_prog;
>> +       prog->ebpf = 1;
>> +       prog->info->is_gpl_compatible = is_gpl;
>> +
>> +       /* find program type: socket_filter vs tracing_filter */
>> +       err = find_prog_type(type, prog);
>> +       if (err < 0)
>> +               goto free_prog;
>> +
>> +       /* lock maps to prevent any changes to maps, since eBPF program may
>> +        * use them. In such case bpf_check() will populate prog->used_maps
>> +        */
>> +       mutex_lock(&bpf_map_lock);
>> +
>> +       /* run eBPF verifier */
>> +       /* err = bpf_check(prog); */
>> +
>> +       if (err == 0 && prog->info->used_maps) {
>> +               /* program passed verifier and it's using some maps,
>> +                * hold them
>> +                */
>> +               for (i = 0; i < prog->info->used_map_cnt; i++) {
>> +                       map = bpf_map_get(prog->info->used_maps[i]);
>> +                       BUG_ON(!map);
>> +                       atomic_inc(&map->refcnt);
>> +               }
>> +       }
>> +       mutex_unlock(&bpf_map_lock);
>
> As mentioned above, I think fixup_bpf_map_id needs to be done under
> the map_lock mutex, unless I'm misunderstanding something in the
> object lifetime.

yes. Excellent point. Will increase the range.

Thank you so much for the review!

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH RFC v2 net-next 07/16] bpf: add lookup/update/delete/iterate methods to BPF maps
  2014-07-23 19:49     ` Alexei Starovoitov
@ 2014-07-23 20:25       ` Kees Cook
  2014-07-23 21:22         ` Alexei Starovoitov
  0 siblings, 1 reply; 62+ messages in thread
From: Kees Cook @ 2014-07-23 20:25 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Ingo Molnar, Linus Torvalds, Andy Lutomirski,
	Steven Rostedt, Daniel Borkmann, Chema Gonzalez, Eric Dumazet,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Jiri Olsa,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Linux API,
	Network Development, LKML

On Wed, Jul 23, 2014 at 12:49 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
> On Wed, Jul 23, 2014 at 11:25 AM, Kees Cook <keescook@chromium.org> wrote:
>>> +
>>> +       /* lookup key in a given map referenced by map_id
>>> +        * err = bpf_map_lookup_elem(int map_id, void *key, void *value)
>>
>> This needs map_id documentation updates too?
>
> yes. will grep for it just to make sure.
>
>>> +static int get_map_id(struct fd f)
>>> +{
>>> +       struct bpf_map *map;
>>> +
>>> +       if (!f.file)
>>> +               return -EBADF;
>>> +
>>> +       if (f.file->f_op != &bpf_map_fops) {
>>> +               fdput(f);
>>
>> It feels weird to me to do the fdput inside this function. Instead,
>> should map_lookup_elem get a "err_put" label, instead?
>
> I don't think it will work, since I'm not sure that fd.flags will be zero
> when fd.file == NULL. It looks so by analyzing return code path
> in fs/file.c, but I wasn't sure that I followed all code paths,
> so I just picked this style from fs/timerfd.c assuming it was
> done this away on purpose and there can be the case where
> fd.file == null and fd.flags !=0. In such case we cannot call fdput().

Yeah, hm, looking around, this does seem to be the case. I guess the
thought is that when get_map_id fails, struct fd has been handled.
Maybe add a comment above that function as a reminder?

>>> +       err = -EFAULT;
>>> +       if (copy_to_user(uvalue, value, map->value_size) != 0)
>>> +               goto free_key;
>>
>> I'm made uncomfortable with memory copying where explicit lengths from
>> userspace aren't being used. It does look like it would be redundant,
>> though. Are there other syscalls where the kernel may stomp on user
>> memory based on internal kernel sizes? I think this is fine as-is, but
>> it makes me want to think harder about it. :)
>
> good question :)
> key_size and value_size are passed initially from user space.
> Kernel only verifies and allocates internal map elements with given
> sizes. Then it copies the value back with the size it remembered.
> If user space said at map creation time that value_size is 100,
> it should be using it consistently in user space program.

Yeah, I think this should be fine as-is.

>
>>> +       err = -ENOMEM;
>>> +       next_key = kmalloc(map->key_size, GFP_ATOMIC);
>>
>> In the interests of defensiveness, I'd use kzalloc here.
>
> I think it would be an overkill. Map implementation must consume
> all bytes of incoming 'key' and return exactly the same number
> of bytes in 'next_key'. Otherwise the whole iteration over map
> with 'get_next_key' won't work. So if map implementation is
> broken, it will be seen right away. No security leak here :)

Okay, fair enough. I had a few similar suggestions later. I kind of
wish there was a kcalloc that didn't zero memory to handle the case of
multiplied size input, but no need to spend the time clearing.

>
>>> +       case BPF_MAP_GET_NEXT_KEY:
>>> +               return map_get_next_key((int) arg2, (void __user *) arg3,
>>> +                                       (void __user *) arg4);
>>
>> Same observation as the other syscall cmd: perhaps arg5 == 0 should be
>> checked? Also, since each of these functions looks up the fd and
>
> yes. will do.
>
>> builds the key, maybe those should be added to a common helper instead
>> of copy/pasting into each demuxed function?
>
> well, get_map_id() is a common helper. I didn't move fdget() all
> the way to switch statement, since it looks less readable.

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH RFC v2 net-next 08/16] bpf: add hashtable type of BPF maps
@ 2014-07-23 20:33         ` Kees Cook
  0 siblings, 0 replies; 62+ messages in thread
From: Kees Cook @ 2014-07-23 20:33 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Ingo Molnar, Linus Torvalds, Andy Lutomirski,
	Steven Rostedt, Daniel Borkmann, Chema Gonzalez, Eric Dumazet,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Jiri Olsa,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Linux API,
	Network Development, LKML

On Wed, Jul 23, 2014 at 12:57 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
> On Wed, Jul 23, 2014 at 11:36 AM, Kees Cook <keescook@chromium.org> wrote:
>>> +static struct bpf_map *htab_map_alloc(struct nlattr *attr[BPF_MAP_ATTR_MAX + 1])
>>> +{
>>> +       struct bpf_htab *htab;
>>> +       int err, i;
>>> +
>>> +       htab = kmalloc(sizeof(*htab), GFP_USER);
>>
>> I'd prefer kzalloc here.
>
> in this case I agree. will change, since it's not in critical path and we
> can waste few cycles zeroing memory.
>
>>> +       err = -ENOMEM;
>>> +       htab->buckets = kmalloc(htab->n_buckets * sizeof(struct hlist_head),
>>> +                               GFP_USER);
>>
>> I'd prefer kcalloc here, even though n_buckets can't currently trigger
>> an integer overflow.
>
> hmm, I would argue that kmalloc_array is a preferred way, but kcalloc ?
> Few lines below the whole array is inited with INIT_HLIST_HEAD...

Ah! I didn't realize kmalloc_array existed! Perfect. Yes, that would
be great to use. The zeroing is not needed, due to the init below, as
you say.

>
>>> +       for (i = 0; i < htab->n_buckets; i++)
>>> +               INIT_HLIST_HEAD(&htab->buckets[i]);
>
>>> +       htab->slab_name = kasprintf(GFP_USER, "bpf_htab_%p", htab);
>>
>> This leaks a kernel heap memory pointer to userspace. If a unique name
>> needed, I think map_id should be used instead.
>
> it leaks, how? slabinfo is only available to root.
> The same code exists in conntrack:
> net/netfilter/nf_conntrack_core.c:1767

Right, in extreme cases, there are system configurations where leaking
addresses even to root can be considered a bug. There are a lot of
these situations in the kernel still, that's true. However, if we can
at all avoid it, I'd really like to avoid adding new ones. Nearly all
the cases of using a memory pointer is for uniqueness concerns, but I
think can already get that from the map_id.

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH RFC v2 net-next 08/16] bpf: add hashtable type of BPF maps
@ 2014-07-23 20:33         ` Kees Cook
  0 siblings, 0 replies; 62+ messages in thread
From: Kees Cook @ 2014-07-23 20:33 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Ingo Molnar, Linus Torvalds, Andy Lutomirski,
	Steven Rostedt, Daniel Borkmann, Chema Gonzalez, Eric Dumazet,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Jiri Olsa,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Linux API,
	Network Development, LKML

On Wed, Jul 23, 2014 at 12:57 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
> On Wed, Jul 23, 2014 at 11:36 AM, Kees Cook <keescook-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org> wrote:
>>> +static struct bpf_map *htab_map_alloc(struct nlattr *attr[BPF_MAP_ATTR_MAX + 1])
>>> +{
>>> +       struct bpf_htab *htab;
>>> +       int err, i;
>>> +
>>> +       htab = kmalloc(sizeof(*htab), GFP_USER);
>>
>> I'd prefer kzalloc here.
>
> in this case I agree. will change, since it's not in critical path and we
> can waste few cycles zeroing memory.
>
>>> +       err = -ENOMEM;
>>> +       htab->buckets = kmalloc(htab->n_buckets * sizeof(struct hlist_head),
>>> +                               GFP_USER);
>>
>> I'd prefer kcalloc here, even though n_buckets can't currently trigger
>> an integer overflow.
>
> hmm, I would argue that kmalloc_array is a preferred way, but kcalloc ?
> Few lines below the whole array is inited with INIT_HLIST_HEAD...

Ah! I didn't realize kmalloc_array existed! Perfect. Yes, that would
be great to use. The zeroing is not needed, due to the init below, as
you say.

>
>>> +       for (i = 0; i < htab->n_buckets; i++)
>>> +               INIT_HLIST_HEAD(&htab->buckets[i]);
>
>>> +       htab->slab_name = kasprintf(GFP_USER, "bpf_htab_%p", htab);
>>
>> This leaks a kernel heap memory pointer to userspace. If a unique name
>> needed, I think map_id should be used instead.
>
> it leaks, how? slabinfo is only available to root.
> The same code exists in conntrack:
> net/netfilter/nf_conntrack_core.c:1767

Right, in extreme cases, there are system configurations where leaking
addresses even to root can be considered a bug. There are a lot of
these situations in the kernel still, that's true. However, if we can
at all avoid it, I'd really like to avoid adding new ones. Nearly all
the cases of using a memory pointer is for uniqueness concerns, but I
think can already get that from the map_id.

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH RFC v2 net-next 07/16] bpf: add lookup/update/delete/iterate methods to BPF maps
  2014-07-23 20:25       ` Kees Cook
@ 2014-07-23 21:22         ` Alexei Starovoitov
  0 siblings, 0 replies; 62+ messages in thread
From: Alexei Starovoitov @ 2014-07-23 21:22 UTC (permalink / raw)
  To: Kees Cook
  Cc: David S. Miller, Ingo Molnar, Linus Torvalds, Andy Lutomirski,
	Steven Rostedt, Daniel Borkmann, Chema Gonzalez, Eric Dumazet,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Jiri Olsa,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Linux API,
	Network Development, LKML

On Wed, Jul 23, 2014 at 1:25 PM, Kees Cook <keescook@chromium.org> wrote:
> On Wed, Jul 23, 2014 at 12:49 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>> On Wed, Jul 23, 2014 at 11:25 AM, Kees Cook <keescook@chromium.org> wrote:
>>>> +
>>>> +       /* lookup key in a given map referenced by map_id
>>>> +        * err = bpf_map_lookup_elem(int map_id, void *key, void *value)
>>>
>>> This needs map_id documentation updates too?
>>
>> yes. will grep for it just to make sure.
>>
>>>> +static int get_map_id(struct fd f)
>>>> +{
>>>> +       struct bpf_map *map;
>>>> +
>>>> +       if (!f.file)
>>>> +               return -EBADF;
>>>> +
>>>> +       if (f.file->f_op != &bpf_map_fops) {
>>>> +               fdput(f);
>>>
>>> It feels weird to me to do the fdput inside this function. Instead,
>>> should map_lookup_elem get a "err_put" label, instead?
>>
>> I don't think it will work, since I'm not sure that fd.flags will be zero
>> when fd.file == NULL. It looks so by analyzing return code path
>> in fs/file.c, but I wasn't sure that I followed all code paths,
>> so I just picked this style from fs/timerfd.c assuming it was
>> done this away on purpose and there can be the case where
>> fd.file == null and fd.flags !=0. In such case we cannot call fdput().
>
> Yeah, hm, looking around, this does seem to be the case. I guess the
> thought is that when get_map_id fails, struct fd has been handled.

correct.

> Maybe add a comment above that function as a reminder?

yes. will do.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH RFC v2 net-next 08/16] bpf: add hashtable type of BPF maps
  2014-07-23 20:33         ` Kees Cook
  (?)
@ 2014-07-23 21:42         ` Alexei Starovoitov
  -1 siblings, 0 replies; 62+ messages in thread
From: Alexei Starovoitov @ 2014-07-23 21:42 UTC (permalink / raw)
  To: Kees Cook
  Cc: David S. Miller, Ingo Molnar, Linus Torvalds, Andy Lutomirski,
	Steven Rostedt, Daniel Borkmann, Chema Gonzalez, Eric Dumazet,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Jiri Olsa,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Linux API,
	Network Development, LKML

On Wed, Jul 23, 2014 at 1:33 PM, Kees Cook <keescook@chromium.org> wrote:
>>
>>>> +       htab->slab_name = kasprintf(GFP_USER, "bpf_htab_%p", htab);
>>>
>>> This leaks a kernel heap memory pointer to userspace. If a unique name
>>> needed, I think map_id should be used instead.
>>
>> it leaks, how? slabinfo is only available to root.
>> The same code exists in conntrack:
>> net/netfilter/nf_conntrack_core.c:1767
>
> Right, in extreme cases, there are system configurations where leaking
> addresses even to root can be considered a bug. There are a lot of
> these situations in the kernel still, that's true. However, if we can
> at all avoid it, I'd really like to avoid adding new ones. Nearly all
> the cases of using a memory pointer is for uniqueness concerns, but I
> think can already get that from the map_id.

ok. fair enough. I think slab name doesn't have to be unique anymore.
It's used to be a requirement in older kernels. If it is ok to reuse now,
I'll just use the same for all hash-type maps.
Advice from slab expert would be great...

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH RFC v2 net-next 10/16] bpf: add eBPF verifier
@ 2014-07-23 23:38     ` Kees Cook
  0 siblings, 0 replies; 62+ messages in thread
From: Kees Cook @ 2014-07-23 23:38 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Ingo Molnar, Linus Torvalds, Andy Lutomirski,
	Steven Rostedt, Daniel Borkmann, Chema Gonzalez, Eric Dumazet,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Jiri Olsa,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Linux API,
	Network Development, LKML

On Thu, Jul 17, 2014 at 9:20 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
> Safety of eBPF programs is statically determined by the verifier, which detects:
> - loops
> - out of range jumps
> - unreachable instructions
> - invalid instructions
> - uninitialized register access
> - uninitialized stack access
> - misaligned stack access
> - out of range stack access
> - invalid calling convention
>
> It checks that
> - R1-R5 registers statisfy function prototype
> - program terminates
> - BPF_LD_ABS|IND instructions are only used in socket filters
>
> It is configured with:
>
> - bool (*is_valid_access)(int off, int size, enum bpf_access_type type);
>   that provides information to the verifer which fields of 'ctx'
>   are accessible (remember 'ctx' is the first argument to eBPF program)
>
> - const struct bpf_func_proto *(*get_func_proto)(enum bpf_func_id func_id);
>   reports argument types of kernel helper functions that eBPF program
>   may call, so that verifier can checks that R1-R5 types match prototype
>
> More details in Documentation/networking/filter.txt
>
> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
> ---
>  Documentation/networking/filter.txt |  233 ++++++
>  include/linux/bpf.h                 |   49 ++
>  include/uapi/linux/bpf.h            |    1 +
>  kernel/bpf/Makefile                 |    2 +-
>  kernel/bpf/syscall.c                |    2 +-
>  kernel/bpf/verifier.c               | 1520 +++++++++++++++++++++++++++++++++++
>  6 files changed, 1805 insertions(+), 2 deletions(-)
>  create mode 100644 kernel/bpf/verifier.c
>
> diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt
> index e14e486f69cd..778f763fce10 100644
> --- a/Documentation/networking/filter.txt
> +++ b/Documentation/networking/filter.txt
> @@ -995,6 +995,108 @@ BPF_XADD | BPF_DW | BPF_STX: lock xadd *(u64 *)(dst_reg + off16) += src_reg
>  Where size is one of: BPF_B or BPF_H or BPF_W or BPF_DW. Note that 1 and
>  2 byte atomic increments are not supported.
>
> +eBPF verifier
> +-------------
> +The safety of the eBPF program is determined in two steps.
> +
> +First step does DAG check to disallow loops and other CFG validation.
> +In particular it will detect programs that have unreachable instructions.
> +(though classic BPF checker allows them)
> +
> +Second step starts from the first insn and descends all possible paths.
> +It simulates execution of every insn and observes the state change of
> +registers and stack.
> +
> +At the start of the program the register R1 contains a pointer to context
> +and has type PTR_TO_CTX.
> +If verifier sees an insn that does R2=R1, then R2 has now type
> +PTR_TO_CTX as well and can be used on the right hand side of expression.
> +If R1=PTR_TO_CTX and insn is R2=R1+R1, then R2=INVALID_PTR,
> +since addition of two valid pointers makes invalid pointer.
> +
> +If register was never written to, it's not readable:
> +  bpf_mov R0 = R2
> +  bpf_exit
> +will be rejected, since R2 is unreadable at the start of the program.
> +
> +After kernel function call, R1-R5 are reset to unreadable and
> +R0 has a return type of the function.
> +
> +Since R6-R9 are callee saved, their state is preserved across the call.
> +  bpf_mov R6 = 1
> +  bpf_call foo
> +  bpf_mov R0 = R6
> +  bpf_exit
> +is a correct program. If there was R1 instead of R6, it would have
> +been rejected.
> +
> +Classic BPF register X is mapped to eBPF register R7 inside sk_convert_filter(),
> +so that its state is preserved across calls.
> +
> +load/store instructions are allowed only with registers of valid types, which
> +are PTR_TO_CTX, PTR_TO_MAP, PTR_TO_STACK. They are bounds and alignment checked.
> +For example:
> + bpf_mov R1 = 1
> + bpf_mov R2 = 2
> + bpf_xadd *(u32 *)(R1 + 3) += R2
> + bpf_exit
> +will be rejected, since R1 doesn't have a valid pointer type at the time of
> +execution of instruction bpf_xadd.
> +
> +At the start R1 contains pointer to ctx and R1 type is PTR_TO_CTX.
> +ctx is generic. verifier is configured to known what context is for particular
> +class of bpf programs. For example, context == skb (for socket filters) and
> +ctx == seccomp_data for seccomp filters.
> +A callback is used to customize verifier to restrict eBPF program access to only
> +certain fields within ctx structure with specified size and alignment.
> +
> +For example, the following insn:
> +  bpf_ld R0 = *(u32 *)(R6 + 8)
> +intends to load a word from address R6 + 8 and store it into R0
> +If R6=PTR_TO_CTX, via is_valid_access() callback the verifier will know
> +that offset 8 of size 4 bytes can be accessed for reading, otherwise
> +the verifier will reject the program.
> +If R6=PTR_TO_STACK, then access should be aligned and be within
> +stack bounds, which are [-MAX_BPF_STACK, 0). In this example offset is 8,
> +so it will fail verification, since it's out of bounds.
> +
> +The verifier will allow eBPF program to read data from stack only after
> +it wrote into it.
> +Classic BPF verifier does similar check with M[0-15] memory slots.
> +For example:
> +  bpf_ld R0 = *(u32 *)(R10 - 4)
> +  bpf_exit
> +is invalid program.
> +Though R10 is correct read-only register and has type PTR_TO_STACK
> +and R10 - 4 is within stack bounds, there were no stores into that location.
> +
> +Pointer register spill/fill is tracked as well, since four (R6-R9)
> +callee saved registers may not be enough for some programs.
> +
> +Allowed function calls are customized with bpf_verifier_ops->get_func_proto()
> +For example, skb_get_nlattr() function has the following definition:
> +  struct bpf_func_proto proto = {RET_INTEGER, PTR_TO_CTX};
> +and eBPF verifier will check that this function is always called with first
> +argument being 'ctx'. In other words R1 must have type PTR_TO_CTX
> +at the time of bpf_call insn.
> +After the call register R0 will be set to readable state, so that
> +program can access it.
> +
> +Function calls is a main mechanism to extend functionality of eBPF programs.
> +Socket filters may let programs to call one set of functions, whereas tracing
> +filters may allow completely different set.
> +
> +If a function made accessible to eBPF program, it needs to be thought through
> +from security point of view. The verifier will guarantee that the function is
> +called with valid arguments.
> +
> +seccomp vs socket filters have different security restrictions for classic BPF.
> +Seccomp solves this by two stage verifier: classic BPF verifier is followed
> +by seccomp verifier. In case of eBPF one configurable verifier is shared for
> +all use cases.
> +
> +See details of eBPF verifier in kernel/bpf/verifier.c
> +
>  eBPF maps
>  ---------
>  'maps' is a generic storage of different types for sharing data between kernel
> @@ -1064,6 +1166,137 @@ size. It will not let programs pass junk values as 'key' and 'value' to
>  bpf_map_*_elem() functions, so these functions (implemented in C inside kernel)
>  can safely access the pointers in all cases.
>
> +Understanding eBPF verifier messages
> +------------------------------------
> +
> +The following are few examples of invalid eBPF programs and verifier error
> +messages as seen in the log:
> +
> +Program with unreachable instructions:
> +static struct bpf_insn prog[] = {
> +  BPF_EXIT_INSN(),
> +  BPF_EXIT_INSN(),
> +};
> +Error:
> +  unreachable insn 1
> +
> +Program that reads uninitialized register:
> +  BPF_ALU64_REG(BPF_MOV, BPF_REG_0, BPF_REG_2),
> +  BPF_EXIT_INSN(),
> +Error:
> +  0: (bf) r0 = r2
> +  R2 !read_ok
> +
> +Program that doesn't initialize R0 before exiting:
> +  BPF_ALU64_REG(BPF_MOV, BPF_REG_2, BPF_REG_1),
> +  BPF_EXIT_INSN(),
> +Error:
> +  0: (bf) r2 = r1
> +  1: (95) exit
> +  R0 !read_ok
> +
> +Program that accesses stack out of bounds:
> +  BPF_ST_MEM(BPF_DW, BPF_REG_10, 8, 0),
> +  BPF_EXIT_INSN(),
> +Error:
> +  0: (7a) *(u64 *)(r10 +8) = 0
> +  invalid stack off=8 size=8
> +
> +Program that doesn't initialize stack before passing its address into function:
> +  BPF_ALU64_REG(BPF_MOV, BPF_REG_2, BPF_REG_10),
> +  BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
> +  BPF_ALU64_IMM(BPF_MOV, BPF_REG_1, 1),
> +  BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
> +  BPF_EXIT_INSN(),
> +Error:
> +  0: (bf) r2 = r10
> +  1: (07) r2 += -8
> +  2: (b7) r1 = 1
> +  3: (85) call 1
> +  invalid indirect read from stack off -8+0 size 8
> +
> +Program that uses invalid map_id=2 while calling to map_lookup_elem() function:
> +  BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
> +  BPF_ALU64_REG(BPF_MOV, BPF_REG_2, BPF_REG_10),
> +  BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
> +  BPF_ALU64_IMM(BPF_MOV, BPF_REG_1, 2),
> +  BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
> +  BPF_EXIT_INSN(),
> +Error:
> +  0: (7a) *(u64 *)(r10 -8) = 0
> +  1: (bf) r2 = r10
> +  2: (07) r2 += -8
> +  3: (b7) r1 = 2
> +  4: (85) call 1
> +  invalid access to map_id=2
> +
> +Program that doesn't check return value of map_lookup_elem() before accessing
> +map element:
> +  BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
> +  BPF_ALU64_REG(BPF_MOV, BPF_REG_2, BPF_REG_10),
> +  BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
> +  BPF_ALU64_IMM(BPF_MOV, BPF_REG_1, 1),
> +  BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),

Is the expectation that these pointers are direct kernel function
addresses? It looks like they're indexes in the check_call routine
below. What specifically were the pointer leaks you'd mentioned?

> +  BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0),
> +  BPF_EXIT_INSN(),
> +Error:
> +  0: (7a) *(u64 *)(r10 -8) = 0
> +  1: (bf) r2 = r10
> +  2: (07) r2 += -8
> +  3: (b7) r1 = 1
> +  4: (85) call 1
> +  5: (7a) *(u64 *)(r0 +0) = 0
> +  R0 invalid mem access 'map_value_or_null'
> +
> +Program that correctly checks map_lookup_elem() returned value for NULL, but
> +accesses the memory with incorrect alignment:
> +  BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
> +  BPF_ALU64_REG(BPF_MOV, BPF_REG_2, BPF_REG_10),
> +  BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
> +  BPF_ALU64_IMM(BPF_MOV, BPF_REG_1, 1),
> +  BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
> +  BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 1),
> +  BPF_ST_MEM(BPF_DW, BPF_REG_0, 4, 0),
> +  BPF_EXIT_INSN(),
> +Error:
> +  0: (7a) *(u64 *)(r10 -8) = 0
> +  1: (bf) r2 = r10
> +  2: (07) r2 += -8
> +  3: (b7) r1 = 1
> +  4: (85) call 1
> +  5: (15) if r0 == 0x0 goto pc+1
> +   R0=map_value1 R10=fp
> +  6: (7a) *(u64 *)(r0 +4) = 0
> +  misaligned access off 4 size 8
> +
> +Program that correctly checks map_lookup_elem() returned value for NULL and
> +accesses memory with correct alignment in one side of 'if' branch, but fails
> +to do so in the other side of 'if' branch:
> +  BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
> +  BPF_ALU64_REG(BPF_MOV, BPF_REG_2, BPF_REG_10),
> +  BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
> +  BPF_ALU64_IMM(BPF_MOV, BPF_REG_1, 1),
> +  BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
> +  BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
> +  BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0),
> +  BPF_EXIT_INSN(),
> +  BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 1),
> +  BPF_EXIT_INSN(),
> +Error:
> +  0: (7a) *(u64 *)(r10 -8) = 0
> +  1: (bf) r2 = r10
> +  2: (07) r2 += -8
> +  3: (b7) r1 = 1
> +  4: (85) call 1
> +  5: (15) if r0 == 0x0 goto pc+2
> +   R0=map_value1 R10=fp
> +  6: (7a) *(u64 *)(r0 +0) = 0
> +  7: (95) exit
> +
> +  from 5 to 8: R0=imm0 R10=fp
> +  8: (7a) *(u64 *)(r0 +0) = 1
> +  R0 invalid mem access 'imm'
> +
>  Testing
>  -------
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 4967619595cc..b5e90efddfcf 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -46,6 +46,31 @@ struct bpf_map_type_list {
>  void bpf_register_map_type(struct bpf_map_type_list *tl);
>  struct bpf_map *bpf_map_get(u32 map_id);
>
> +/* function argument constraints */
> +enum bpf_arg_type {
> +       ARG_ANYTHING = 0,       /* any argument is ok */
> +
> +       /* the following constraints used to prototype
> +        * bpf_map_lookup/update/delete_elem() functions
> +        */
> +       ARG_CONST_MAP_ID,       /* int const argument used as map_id */
> +       ARG_PTR_TO_MAP_KEY,     /* pointer to stack used as map key */
> +       ARG_PTR_TO_MAP_VALUE,   /* pointer to stack used as map value */
> +
> +       /* the following constraints used to prototype bpf_memcmp() and other
> +        * functions that access data on eBPF program stack
> +        */
> +       ARG_PTR_TO_STACK,       /* any pointer to eBPF program stack */
> +       ARG_CONST_STACK_SIZE,   /* number of bytes accessed from stack */
> +};
> +
> +/* type of values returned from helper functions */
> +enum bpf_return_type {
> +       RET_INTEGER,            /* function returns integer */
> +       RET_VOID,               /* function doesn't return anything */
> +       RET_PTR_TO_MAP_OR_NULL, /* function returns a pointer to map elem value or NULL */
> +};
> +
>  /* eBPF function prototype used by verifier to allow BPF_CALLs from eBPF programs
>   * to in-kernel helper functions and for adjusting imm32 field in BPF_CALL
>   * instructions after verifying
> @@ -53,11 +78,33 @@ struct bpf_map *bpf_map_get(u32 map_id);
>  struct bpf_func_proto {
>         u64 (*func)(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
>         bool gpl_only;
> +       enum bpf_return_type ret_type;
> +       enum bpf_arg_type arg1_type;
> +       enum bpf_arg_type arg2_type;
> +       enum bpf_arg_type arg3_type;
> +       enum bpf_arg_type arg4_type;
> +       enum bpf_arg_type arg5_type;
> +};
> +
> +/* bpf_context is intentionally undefined structure. Pointer to bpf_context is
> + * the first argument to eBPF programs.
> + * For socket filters: 'struct bpf_context *' == 'struct sk_buff *'
> + */
> +struct bpf_context;
> +
> +enum bpf_access_type {
> +       BPF_READ = 1,
> +       BPF_WRITE = 2
>  };
>
>  struct bpf_verifier_ops {
>         /* return eBPF function prototype for verification */
>         const struct bpf_func_proto *(*get_func_proto)(enum bpf_func_id func_id);
> +
> +       /* return true if 'size' wide access at offset 'off' within bpf_context
> +        * with 'type' (read or write) is allowed
> +        */
> +       bool (*is_valid_access)(int off, int size, enum bpf_access_type type);
>  };
>
>  struct bpf_prog_type_list {
> @@ -78,5 +125,7 @@ struct bpf_prog_info {
>
>  void free_bpf_prog_info(struct bpf_prog_info *info);
>  struct sk_filter *bpf_prog_get(u32 ufd);
> +/* verify correctness of eBPF program */
> +int bpf_check(struct sk_filter *fp);
>
>  #endif /* _LINUX_BPF_H */
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 06ba71b49f64..3f288e1d08f1 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -369,6 +369,7 @@ enum bpf_prog_attributes {
>
>  enum bpf_prog_type {
>         BPF_PROG_TYPE_UNSPEC,
> +       BPF_PROG_TYPE_SOCKET_FILTER,
>  };
>
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> index 558e12712ebc..95a9035e0f29 100644
> --- a/kernel/bpf/Makefile
> +++ b/kernel/bpf/Makefile
> @@ -1 +1 @@
> -obj-y := core.o syscall.o hashtab.o
> +obj-y := core.o syscall.o hashtab.o verifier.o
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 9e45ca6b6937..9d441f17548e 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -634,7 +634,7 @@ static int bpf_prog_load(enum bpf_prog_type type, struct nlattr __user *uattr,
>         mutex_lock(&bpf_map_lock);
>
>         /* run eBPF verifier */
> -       /* err = bpf_check(prog); */
> +       err = bpf_check(prog);
>
>         if (err == 0 && prog->info->used_maps) {
>                 /* program passed verifier and it's using some maps,
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> new file mode 100644
> index 000000000000..0fce771632b4
> --- /dev/null
> +++ b/kernel/bpf/verifier.c
> @@ -0,0 +1,1520 @@
> +/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of version 2 of the GNU General Public
> + * License as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful, but
> + * WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + * General Public License for more details.
> + */
> +#include <linux/kernel.h>
> +#include <linux/types.h>
> +#include <linux/slab.h>
> +#include <linux/bpf.h>
> +#include <linux/filter.h>
> +#include <linux/capability.h>
> +
> +/* bpf_check() is a static code analyzer that walks eBPF program
> + * instruction by instruction and updates register/stack state.
> + * All paths of conditional branches are analyzed until 'bpf_exit' insn.
> + *
> + * At the first pass depth-first-search verifies that the BPF program is a DAG.
> + * It rejects the following programs:
> + * - larger than BPF_MAXINSNS insns
> + * - if loop is present (detected via back-edge)
> + * - unreachable insns exist (shouldn't be a forest. program = one function)
> + * - out of bounds or malformed jumps
> + * The second pass is all possible path descent from the 1st insn.
> + * Conditional branch target insns keep a link list of verifier states.
> + * If the state already visited, this path can be pruned.
> + * If it wasn't a DAG, such state prunning would be incorrect, since it would
> + * skip cycles. Since it's analyzing all pathes through the program,
> + * the length of the analysis is limited to 32k insn, which may be hit even
> + * if insn_cnt < 4K, but there are too many branches that change stack/regs.
> + * Number of 'branches to be analyzed' is limited to 1k
> + *
> + * On entry to each instruction, each register has a type, and the instruction
> + * changes the types of the registers depending on instruction semantics.
> + * If instruction is BPF_MOV64_REG(BPF_REG_1, BPF_REG_5), then type of R5 is
> + * copied to R1.
> + *
> + * All registers are 64-bit (even on 32-bit arch)
> + * R0 - return register
> + * R1-R5 argument passing registers
> + * R6-R9 callee saved registers
> + * R10 - frame pointer read-only
> + *
> + * At the start of BPF program the register R1 contains a pointer to bpf_context
> + * and has type PTR_TO_CTX.
> + *
> + * Most of the time the registers have UNKNOWN_VALUE type, which
> + * means the register has some value, but it's not a valid pointer.
> + * Verifier doesn't attemp to track all arithmetic operations on pointers.
> + * The only special case is the sequence:
> + *    BPF_MOV64_REG(BPF_REG_1, BPF_REG_10),
> + *    BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, -20),
> + * 1st insn copies R10 (which has FRAME_PTR) type into R1
> + * and 2nd arithmetic instruction is pattern matched to recognize
> + * that it wants to construct a pointer to some element within stack.
> + * So after 2nd insn, the register R1 has type PTR_TO_STACK
> + * (and -20 constant is saved for further stack bounds checking).
> + * Meaning that this reg is a pointer to stack plus known immediate constant.
> + *
> + * When program is doing load or store insns the type of base register can be:
> + * PTR_TO_MAP, PTR_TO_CTX, FRAME_PTR. These are three pointer types recognized
> + * by check_mem_access() function.
> + *
> + * PTR_TO_MAP means that this register is pointing to 'map element value'
> + * and the range of [ptr, ptr + map's value_size) is accessible.
> + *
> + * registers used to pass pointers to function calls are verified against
> + * function prototypes
> + *
> + * ARG_PTR_TO_MAP_KEY is a function argument constraint.
> + * It means that the register type passed to this function must be
> + * PTR_TO_STACK and it will be used inside the function as
> + * 'pointer to map element key'
> + *
> + * For example the argument constraints for bpf_map_lookup_elem():
> + *   .ret_type = RET_PTR_TO_MAP_OR_NULL,
> + *   .arg1_type = ARG_CONST_MAP_ID,
> + *   .arg2_type = ARG_PTR_TO_MAP_KEY,
> + *
> + * ret_type says that this function returns 'pointer to map elem value or null'
> + * 1st argument is a 'const immediate' value which must be one of valid map_ids.
> + * 2nd argument is a pointer to stack, which will be used inside the function as
> + * a pointer to map element key.
> + *
> + * On the kernel side the helper function looks like:
> + * u64 bpf_map_lookup_elem(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
> + * {
> + *    struct bpf_map *map;
> + *    int map_id = r1;
> + *    void *key = (void *) (unsigned long) r2;
> + *    void *value;
> + *
> + *    here kernel can access 'key' pointer safely, knowing that
> + *    [key, key + map->key_size) bytes are valid and were initialized on
> + *    the stack of eBPF program.
> + * }
> + *
> + * Corresponding eBPF program looked like:
> + *    BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),  // after this insn R2 type is FRAME_PTR
> + *    BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), // after this insn R2 type is PTR_TO_STACK
> + *    BPF_MOV64_IMM(BPF_REG_1, MAP_ID),      // after this insn R1 type is CONST_ARG
> + *    BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
> + * here verifier looks a prototype of map_lookup_elem and sees:
> + * .arg1_type == ARG_CONST_MAP_ID and R1->type == CONST_ARG, which is ok so far,
> + * then it goes and finds a map with map_id equal to R1->imm value.
> + * Now verifier knows that this map has key of key_size bytes
> + *
> + * Then .arg2_type == ARG_PTR_TO_MAP_KEY and R2->type == PTR_TO_STACK, ok so far,
> + * Now verifier checks that [R2, R2 + map's key_size) are within stack limits
> + * and were initialized prior to this call.
> + * If it's ok, then verifier allows this BPF_CALL insn and looks at
> + * .ret_type which is RET_PTR_TO_MAP_OR_NULL, so it sets
> + * R0->type = PTR_TO_MAP_OR_NULL which means bpf_map_lookup_elem() function
> + * returns ether pointer to map value or NULL.
> + *
> + * When type PTR_TO_MAP_OR_NULL passes through 'if (reg != 0) goto +off' insn,
> + * the register holding that pointer in the true branch changes state to
> + * PTR_TO_MAP and the same register changes state to CONST_IMM in the false
> + * branch. See check_cond_jmp_op().
> + *
> + * After the call R0 is set to return type of the function and registers R1-R5
> + * are set to NOT_INIT to indicate that they are no longer readable.
> + *
> + * load/store alignment is checked:
> + *    BPF_STX_MEM(BPF_DW, dest_reg, src_reg, 3)
> + * is rejected, because it's misaligned
> + *
> + * load/store to stack are bounds checked and register spill is tracked
> + *    BPF_STX_MEM(BPF_B, BPF_REG_10, src_reg, 0)
> + * is rejected, because it's out of bounds
> + *
> + * load/store to map are bounds checked:
> + *    BPF_STX_MEM(BPF_H, dest_reg, src_reg, 8)
> + * is ok, if dest_reg->type == PTR_TO_MAP and
> + * 8 + sizeof(u16) <= map_info->value_size
> + *
> + * load/store to bpf_context are checked against known fields
> + */
> +
> +#define _(OP) ({ int ret = OP; if (ret < 0) return ret; })

This seems overly terse. :) And the meaning tends to be overloaded
(this obviously isn't a translatable string, etc). Perhaps call it
"chk" or "ret_fail"? And I think OP in the body should have ()s around
it to avoid potential macro expansion silliness.

> +
> +/* types of values stored in eBPF registers */
> +enum bpf_reg_type {
> +       NOT_INIT = 0,           /* nothing was written into register */
> +       UNKNOWN_VALUE,          /* reg doesn't contain a valid pointer */
> +       PTR_TO_CTX,             /* reg points to bpf_context */
> +       PTR_TO_MAP,             /* reg points to map element value */
> +       PTR_TO_MAP_OR_NULL,     /* points to map element value or NULL */
> +       FRAME_PTR,              /* reg == frame_pointer */
> +       PTR_TO_STACK,           /* reg == frame_pointer + imm */
> +       CONST_IMM,              /* constant integer value */
> +};
> +
> +struct reg_state {
> +       enum bpf_reg_type type;
> +       int imm;
> +};
> +
> +enum bpf_stack_slot_type {
> +       STACK_INVALID,    /* nothing was stored in this stack slot */
> +       STACK_SPILL,      /* 1st byte of register spilled into stack */
> +       STACK_SPILL_PART, /* other 7 bytes of register spill */
> +       STACK_MISC        /* BPF program wrote some data into this slot */
> +};
> +
> +struct bpf_stack_slot {
> +       enum bpf_stack_slot_type stype;
> +       enum bpf_reg_type type;
> +       int imm;
> +};
> +
> +/* state of the program:
> + * type of all registers and stack info
> + */
> +struct verifier_state {
> +       struct reg_state regs[MAX_BPF_REG];
> +       struct bpf_stack_slot stack[MAX_BPF_STACK];
> +};
> +
> +/* linked list of verifier states used to prune search */
> +struct verifier_state_list {
> +       struct verifier_state state;
> +       struct verifier_state_list *next;
> +};
> +
> +/* verifier_state + insn_idx are pushed to stack when branch is encountered */
> +struct verifier_stack_elem {
> +       /* verifer state is 'st'
> +        * before processing instruction 'insn_idx'
> +        * and after processing instruction 'prev_insn_idx'
> +        */
> +       struct verifier_state st;
> +       int insn_idx;
> +       int prev_insn_idx;
> +       struct verifier_stack_elem *next;
> +};
> +
> +#define MAX_USED_MAPS 64 /* max number of maps accessed by one eBPF program */
> +
> +/* single container for all structs
> + * one verifier_env per bpf_check() call
> + */
> +struct verifier_env {
> +       struct sk_filter *prog;         /* eBPF program being verified */
> +       struct verifier_stack_elem *head; /* stack of verifier states to be processed */
> +       int stack_size;                 /* number of states to be processed */
> +       struct verifier_state cur_state; /* current verifier state */
> +       struct verifier_state_list **branch_landing; /* search prunning optimization */
> +       u32 used_maps[MAX_USED_MAPS];   /* array of map_id's used by eBPF program */
> +       u32 used_map_cnt;               /* number of used maps */
> +};
> +
> +/* verbose verifier prints what it's seeing
> + * bpf_check() is called under map lock, so no race to access this global var
> + */
> +static bool verbose_on;
> +
> +/* when verifier rejects eBPF program, it does a second path with verbose on
> + * to dump the verification trace to the log, so the user can figure out what's
> + * wrong with the program
> + */
> +static int verbose(const char *fmt, ...)
> +{
> +       va_list args;
> +       int ret;
> +
> +       if (!verbose_on)
> +               return 0;
> +
> +       va_start(args, fmt);
> +       ret = vprintk(fmt, args);
> +       va_end(args);
> +       return ret;
> +}
> +
> +/* string representation of 'enum bpf_reg_type' */
> +static const char * const reg_type_str[] = {
> +       [NOT_INIT] = "?",
> +       [UNKNOWN_VALUE] = "inv",
> +       [PTR_TO_CTX] = "ctx",
> +       [PTR_TO_MAP] = "map_value",
> +       [PTR_TO_MAP_OR_NULL] = "map_value_or_null",
> +       [FRAME_PTR] = "fp",
> +       [PTR_TO_STACK] = "fp",
> +       [CONST_IMM] = "imm",
> +};
> +
> +static void pr_cont_verifier_state(struct verifier_env *env)
> +{
> +       enum bpf_reg_type t;
> +       int i;
> +
> +       for (i = 0; i < MAX_BPF_REG; i++) {
> +               t = env->cur_state.regs[i].type;
> +               if (t == NOT_INIT)
> +                       continue;
> +               pr_cont(" R%d=%s", i, reg_type_str[t]);
> +               if (t == CONST_IMM ||
> +                   t == PTR_TO_STACK ||
> +                   t == PTR_TO_MAP_OR_NULL ||
> +                   t == PTR_TO_MAP)
> +                       pr_cont("%d", env->cur_state.regs[i].imm);
> +       }
> +       for (i = 0; i < MAX_BPF_STACK; i++) {
> +               if (env->cur_state.stack[i].stype == STACK_SPILL)
> +                       pr_cont(" fp%d=%s", -MAX_BPF_STACK + i,
> +                               reg_type_str[env->cur_state.stack[i].type]);
> +       }
> +       pr_cont("\n");
> +}
> +
> +static const char *const bpf_class_string[] = {
> +       "ld", "ldx", "st", "stx", "alu", "jmp", "BUG", "alu64"
> +};
> +
> +static const char *const bpf_alu_string[] = {
> +       "+=", "-=", "*=", "/=", "|=", "&=", "<<=", ">>=", "neg",
> +       "%=", "^=", "=", "s>>=", "endian", "BUG", "BUG"
> +};
> +
> +static const char *const bpf_ldst_string[] = {
> +       "u32", "u16", "u8", "u64"
> +};
> +
> +static const char *const bpf_jmp_string[] = {
> +       "jmp", "==", ">", ">=", "&", "!=", "s>", "s>=", "call", "exit"
> +};

It seems like these string arrays should have literal initializers
like reg_type_str does.

> +
> +static void pr_cont_bpf_insn(struct bpf_insn *insn)
> +{
> +       u8 class = BPF_CLASS(insn->code);
> +
> +       if (class == BPF_ALU || class == BPF_ALU64) {
> +               if (BPF_SRC(insn->code) == BPF_X)
> +                       pr_cont("(%02x) %sr%d %s %sr%d\n",
> +                               insn->code, class == BPF_ALU ? "(u32) " : "",
> +                               insn->dst_reg,
> +                               bpf_alu_string[BPF_OP(insn->code) >> 4],
> +                               class == BPF_ALU ? "(u32) " : "",
> +                               insn->src_reg);
> +               else
> +                       pr_cont("(%02x) %sr%d %s %s%d\n",
> +                               insn->code, class == BPF_ALU ? "(u32) " : "",
> +                               insn->dst_reg,
> +                               bpf_alu_string[BPF_OP(insn->code) >> 4],
> +                               class == BPF_ALU ? "(u32) " : "",
> +                               insn->imm);
> +       } else if (class == BPF_STX) {
> +               if (BPF_MODE(insn->code) == BPF_MEM)
> +                       pr_cont("(%02x) *(%s *)(r%d %+d) = r%d\n",
> +                               insn->code,
> +                               bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
> +                               insn->dst_reg,
> +                               insn->off, insn->src_reg);
> +               else if (BPF_MODE(insn->code) == BPF_XADD)
> +                       pr_cont("(%02x) lock *(%s *)(r%d %+d) += r%d\n",
> +                               insn->code,
> +                               bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
> +                               insn->dst_reg, insn->off,
> +                               insn->src_reg);
> +               else
> +                       pr_cont("BUG_%02x\n", insn->code);

As an optimization, would this be more readable by having BPF_SIZE >>
3 and BPF_OP >> 4 pre-loaded in some local variables?

> +       } else if (class == BPF_ST) {
> +               if (BPF_MODE(insn->code) != BPF_MEM) {
> +                       pr_cont("BUG_st_%02x\n", insn->code);
> +                       return;
> +               }
> +               pr_cont("(%02x) *(%s *)(r%d %+d) = %d\n",
> +                       insn->code,
> +                       bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
> +                       insn->dst_reg,
> +                       insn->off, insn->imm);
> +       } else if (class == BPF_LDX) {
> +               if (BPF_MODE(insn->code) != BPF_MEM) {
> +                       pr_cont("BUG_ldx_%02x\n", insn->code);
> +                       return;
> +               }
> +               pr_cont("(%02x) r%d = *(%s *)(r%d %+d)\n",
> +                       insn->code, insn->dst_reg,
> +                       bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
> +                       insn->src_reg, insn->off);
> +       } else if (class == BPF_LD) {
> +               if (BPF_MODE(insn->code) == BPF_ABS) {
> +                       pr_cont("(%02x) r0 = *(%s *)skb[%d]\n",
> +                               insn->code,
> +                               bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
> +                               insn->imm);
> +               } else if (BPF_MODE(insn->code) == BPF_IND) {
> +                       pr_cont("(%02x) r0 = *(%s *)skb[r%d + %d]\n",
> +                               insn->code,
> +                               bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
> +                               insn->src_reg, insn->imm);
> +               } else {
> +                       pr_cont("BUG_ld_%02x\n", insn->code);
> +                       return;
> +               }
> +       } else if (class == BPF_JMP) {
> +               u8 opcode = BPF_OP(insn->code);
> +
> +               if (opcode == BPF_CALL) {
> +                       pr_cont("(%02x) call %d\n", insn->code, insn->imm);
> +               } else if (insn->code == (BPF_JMP | BPF_JA)) {
> +                       pr_cont("(%02x) goto pc%+d\n",
> +                               insn->code, insn->off);
> +               } else if (insn->code == (BPF_JMP | BPF_EXIT)) {
> +                       pr_cont("(%02x) exit\n", insn->code);
> +               } else if (BPF_SRC(insn->code) == BPF_X) {
> +                       pr_cont("(%02x) if r%d %s r%d goto pc%+d\n",
> +                               insn->code, insn->dst_reg,
> +                               bpf_jmp_string[BPF_OP(insn->code) >> 4],
> +                               insn->src_reg, insn->off);
> +               } else {
> +                       pr_cont("(%02x) if r%d %s 0x%x goto pc%+d\n",
> +                               insn->code, insn->dst_reg,
> +                               bpf_jmp_string[BPF_OP(insn->code) >> 4],
> +                               insn->imm, insn->off);
> +               }
> +       } else {
> +               pr_cont("(%02x) %s\n", insn->code, bpf_class_string[class]);
> +       }
> +}
> +
> +static int pop_stack(struct verifier_env *env, int *prev_insn_idx)
> +{
> +       struct verifier_stack_elem *elem;
> +       int insn_idx;
> +
> +       if (env->head == NULL)
> +               return -1;
> +
> +       memcpy(&env->cur_state, &env->head->st, sizeof(env->cur_state));
> +       insn_idx = env->head->insn_idx;
> +       if (prev_insn_idx)
> +               *prev_insn_idx = env->head->prev_insn_idx;
> +       elem = env->head->next;
> +       kfree(env->head);
> +       env->head = elem;
> +       env->stack_size--;
> +       return insn_idx;
> +}
> +
> +static struct verifier_state *push_stack(struct verifier_env *env, int insn_idx,
> +                                        int prev_insn_idx)
> +{
> +       struct verifier_stack_elem *elem;
> +
> +       elem = kmalloc(sizeof(struct verifier_stack_elem), GFP_KERNEL);
> +       if (!elem)
> +               goto err;
> +
> +       memcpy(&elem->st, &env->cur_state, sizeof(env->cur_state));
> +       elem->insn_idx = insn_idx;
> +       elem->prev_insn_idx = prev_insn_idx;
> +       elem->next = env->head;
> +       env->head = elem;
> +       env->stack_size++;
> +       if (env->stack_size > 1024) {
> +               verbose("BPF program is too complex\n");
> +               goto err;
> +       }
> +       return &elem->st;
> +err:
> +       /* pop all elements and return */
> +       while (pop_stack(env, NULL) >= 0);
> +       return NULL;
> +}
> +
> +#define CALLER_SAVED_REGS 6
> +static const int caller_saved[CALLER_SAVED_REGS] = {
> +       BPF_REG_0, BPF_REG_1, BPF_REG_2, BPF_REG_3, BPF_REG_4, BPF_REG_5
> +};
> +
> +static void init_reg_state(struct reg_state *regs)
> +{
> +       int i;
> +
> +       for (i = 0; i < MAX_BPF_REG; i++) {
> +               regs[i].type = NOT_INIT;
> +               regs[i].imm = 0;
> +       }
> +
> +       /* frame pointer */
> +       regs[BPF_REG_FP].type = FRAME_PTR;
> +
> +       /* 1st arg to a function */
> +       regs[BPF_REG_1].type = PTR_TO_CTX;
> +}
> +
> +static void mark_reg_unknown_value(struct reg_state *regs, int regno)
> +{
> +       regs[regno].type = UNKNOWN_VALUE;
> +       regs[regno].imm = 0;
> +}
> +
> +static int check_reg_arg(struct reg_state *regs, int regno, bool is_src)
> +{

Since regno is always populated with dst_reg/src_reg (u8 :4 sized),
shouldn't this be u8 instead of int? (And in check_* below too?) More
importantly, regno needs bounds checking. MAX_BPF_REG is 10, but
dst_reg/src_reg could be up to 15, IIUC.

> +       if (is_src) {
> +               if (regs[regno].type == NOT_INIT) {
> +                       verbose("R%d !read_ok\n", regno);
> +                       return -EACCES;
> +               }
> +       } else {
> +               if (regno == BPF_REG_FP)
> +                       /* frame pointer is read only */

Why no verbose() call here?

> +                       return -EACCES;
> +               mark_reg_unknown_value(regs, regno);
> +       }
> +       return 0;
> +}
> +
> +static int bpf_size_to_bytes(int bpf_size)
> +{
> +       if (bpf_size == BPF_W)
> +               return 4;
> +       else if (bpf_size == BPF_H)
> +               return 2;
> +       else if (bpf_size == BPF_B)
> +               return 1;
> +       else if (bpf_size == BPF_DW)
> +               return 8;
> +       else
> +               return -EACCES;
> +}
> +
> +static int check_stack_write(struct verifier_state *state, int off, int size,
> +                            int value_regno)
> +{
> +       struct bpf_stack_slot *slot;
> +       int i;
> +
> +       if (value_regno >= 0 &&
> +           (state->regs[value_regno].type == PTR_TO_MAP ||
> +            state->regs[value_regno].type == PTR_TO_STACK ||
> +            state->regs[value_regno].type == PTR_TO_CTX)) {
> +
> +               /* register containing pointer is being spilled into stack */
> +               if (size != 8) {
> +                       verbose("invalid size of register spill\n");
> +                       return -EACCES;
> +               }
> +
> +               slot = &state->stack[MAX_BPF_STACK + off];
> +               slot->stype = STACK_SPILL;
> +               /* save register state */
> +               slot->type = state->regs[value_regno].type;
> +               slot->imm = state->regs[value_regno].imm;
> +               for (i = 1; i < 8; i++) {
> +                       slot = &state->stack[MAX_BPF_STACK + off + i];

off and size need bounds checking here and below.

> +                       slot->stype = STACK_SPILL_PART;
> +                       slot->type = UNKNOWN_VALUE;
> +                       slot->imm = 0;
> +               }
> +       } else {
> +
> +               /* regular write of data into stack */
> +               for (i = 0; i < size; i++) {
> +                       slot = &state->stack[MAX_BPF_STACK + off + i];
> +                       slot->stype = STACK_MISC;
> +                       slot->type = UNKNOWN_VALUE;
> +                       slot->imm = 0;
> +               }
> +       }
> +       return 0;
> +}
> +
> +static int check_stack_read(struct verifier_state *state, int off, int size,
> +                           int value_regno)
> +{
> +       int i;
> +       struct bpf_stack_slot *slot;
> +
> +       slot = &state->stack[MAX_BPF_STACK + off];
> +
> +       if (slot->stype == STACK_SPILL) {
> +               if (size != 8) {
> +                       verbose("invalid size of register spill\n");
> +                       return -EACCES;
> +               }
> +               for (i = 1; i < 8; i++) {
> +                       if (state->stack[MAX_BPF_STACK + off + i].stype !=
> +                           STACK_SPILL_PART) {
> +                               verbose("corrupted spill memory\n");
> +                               return -EACCES;
> +                       }
> +               }
> +
> +               /* restore register state from stack */
> +               state->regs[value_regno].type = slot->type;
> +               state->regs[value_regno].imm = slot->imm;
> +               return 0;
> +       } else {
> +               for (i = 0; i < size; i++) {
> +                       if (state->stack[MAX_BPF_STACK + off + i].stype !=
> +                           STACK_MISC) {
> +                               verbose("invalid read from stack off %d+%d size %d\n",
> +                                       off, i, size);
> +                               return -EACCES;
> +                       }
> +               }
> +               /* have read misc data from the stack */
> +               mark_reg_unknown_value(state->regs, value_regno);
> +               return 0;
> +       }
> +}
> +
> +static int remember_map_id(struct verifier_env *env, u32 map_id)
> +{
> +       int i;
> +
> +       /* check whether we recorded this map_id already */
> +       for (i = 0; i < env->used_map_cnt; i++)
> +               if (env->used_maps[i] == map_id)
> +                       return 0;
> +
> +       if (env->used_map_cnt >= MAX_USED_MAPS)
> +               return -E2BIG;
> +
> +       /* remember this map_id */
> +       env->used_maps[env->used_map_cnt++] = map_id;
> +       return 0;
> +}
> +
> +static int get_map_info(struct verifier_env *env, u32 map_id,
> +                       struct bpf_map **map)
> +{
> +       /* if BPF program contains bpf_map_lookup_elem(map_id, key)
> +        * the incorrect map_id will be caught here
> +        */
> +       *map = bpf_map_get(map_id);
> +       if (!*map) {
> +               verbose("invalid access to map_id=%d\n", map_id);
> +               return -EACCES;
> +       }
> +
> +       _(remember_map_id(env, map_id));
> +
> +       return 0;
> +}
> +
> +/* check read/write into map element returned by bpf_map_lookup_elem() */
> +static int check_map_access(struct verifier_env *env, int regno, int off,
> +                           int size)
> +{
> +       struct bpf_map *map;
> +       int map_id = env->cur_state.regs[regno].imm;
> +
> +       _(get_map_info(env, map_id, &map));
> +
> +       if (off < 0 || off + size > map->value_size) {

This could be tricked with a negative size, or a giant size, wrapping negative.

> +               verbose("invalid access to map_id=%d leaf_size=%d off=%d size=%d\n",
> +                       map_id, map->value_size, off, size);
> +               return -EACCES;
> +       }
> +       return 0;
> +}
> +
> +/* check access to 'struct bpf_context' fields */
> +static int check_ctx_access(struct verifier_env *env, int off, int size,
> +                           enum bpf_access_type t)
> +{
> +       if (env->prog->info->ops->is_valid_access &&
> +           env->prog->info->ops->is_valid_access(off, size, t))
> +               return 0;
> +
> +       verbose("invalid bpf_context access off=%d size=%d\n", off, size);
> +       return -EACCES;
> +}
> +
> +static int check_mem_access(struct verifier_env *env, int regno, int off,
> +                           int bpf_size, enum bpf_access_type t,
> +                           int value_regno)
> +{
> +       struct verifier_state *state = &env->cur_state;
> +       int size;
> +
> +       _(size = bpf_size_to_bytes(bpf_size));
> +
> +       if (off % size != 0) {
> +               verbose("misaligned access off %d size %d\n", off, size);
> +               return -EACCES;
> +       }

I think more off and size checking is needed here.

> +
> +       if (state->regs[regno].type == PTR_TO_MAP) {
> +               _(check_map_access(env, regno, off, size));
> +               if (t == BPF_READ)
> +                       mark_reg_unknown_value(state->regs, value_regno);
> +       } else if (state->regs[regno].type == PTR_TO_CTX) {
> +               _(check_ctx_access(env, off, size, t));
> +               if (t == BPF_READ)
> +                       mark_reg_unknown_value(state->regs, value_regno);
> +       } else if (state->regs[regno].type == FRAME_PTR) {
> +               if (off >= 0 || off < -MAX_BPF_STACK) {
> +                       verbose("invalid stack off=%d size=%d\n", off, size);
> +                       return -EACCES;
> +               }
> +               if (t == BPF_WRITE)
> +                       _(check_stack_write(state, off, size, value_regno));
> +               else
> +                       _(check_stack_read(state, off, size, value_regno));
> +       } else {
> +               verbose("R%d invalid mem access '%s'\n",
> +                       regno, reg_type_str[state->regs[regno].type]);
> +               return -EACCES;
> +       }
> +       return 0;
> +}
> +
> +/* when register 'regno' is passed into function that will read 'access_size'
> + * bytes from that pointer, make sure that it's within stack boundary
> + * and all elements of stack are initialized
> + */
> +static int check_stack_boundary(struct verifier_env *env,
> +                               int regno, int access_size)
> +{
> +       struct verifier_state *state = &env->cur_state;
> +       struct reg_state *regs = state->regs;
> +       int off, i;
> +

regno bounds checking needed.

> +       if (regs[regno].type != PTR_TO_STACK)
> +               return -EACCES;
> +
> +       off = regs[regno].imm;
> +       if (off >= 0 || off < -MAX_BPF_STACK || off + access_size > 0 ||
> +           access_size <= 0) {
> +               verbose("invalid stack type R%d off=%d access_size=%d\n",
> +                       regno, off, access_size);
> +               return -EACCES;
> +       }
> +
> +       for (i = 0; i < access_size; i++) {
> +               if (state->stack[MAX_BPF_STACK + off + i].stype != STACK_MISC) {
> +                       verbose("invalid indirect read from stack off %d+%d size %d\n",
> +                               off, i, access_size);
> +                       return -EACCES;
> +               }
> +       }
> +       return 0;
> +}
> +
> +static int check_func_arg(struct verifier_env *env, int regno,
> +                         enum bpf_arg_type arg_type, int *map_id,
> +                         struct bpf_map **mapp)
> +{
> +       struct reg_state *reg = env->cur_state.regs + regno;

I would use [] instead of + here. (and regno needs bounds checking)

> +       enum bpf_reg_type expected_type;
> +
> +       if (arg_type == ARG_ANYTHING)
> +               return 0;
> +
> +       if (reg->type == NOT_INIT) {
> +               verbose("R%d !read_ok\n", regno);
> +               return -EACCES;
> +       }
> +
> +       if (arg_type == ARG_PTR_TO_MAP_KEY || arg_type == ARG_PTR_TO_MAP_VALUE) {
> +               expected_type = PTR_TO_STACK;
> +       } else if (arg_type == ARG_CONST_MAP_ID || arg_type == ARG_CONST_STACK_SIZE) {
> +               expected_type = CONST_IMM;
> +       } else {
> +               verbose("unsupported arg_type %d\n", arg_type);
> +               return -EFAULT;
> +       }
> +
> +       if (reg->type != expected_type) {
> +               verbose("R%d type=%s expected=%s\n", regno,
> +                       reg_type_str[reg->type], reg_type_str[expected_type]);
> +               return -EACCES;
> +       }
> +
> +       if (arg_type == ARG_CONST_MAP_ID) {
> +               /* bpf_map_xxx(map_id) call: check that map_id is valid */
> +               *map_id = reg->imm;
> +               _(get_map_info(env, reg->imm, mapp));
> +       } else if (arg_type == ARG_PTR_TO_MAP_KEY) {
> +               /*
> +                * bpf_map_xxx(..., map_id, ..., key) call:
> +                * check that [key, key + map->key_size) are within
> +                * stack limits and initialized
> +                */
> +               if (!*mapp) {
> +                       /*
> +                        * in function declaration map_id must come before
> +                        * map_key or map_elem, so that it's verified
> +                        * and known before we have to check map_key here
> +                        */
> +                       verbose("invalid map_id to access map->key\n");
> +                       return -EACCES;
> +               }
> +               _(check_stack_boundary(env, regno, (*mapp)->key_size));
> +       } else if (arg_type == ARG_PTR_TO_MAP_VALUE) {
> +               /*
> +                * bpf_map_xxx(..., map_id, ..., value) call:
> +                * check [value, value + map->value_size) validity
> +                */
> +               if (!*mapp) {
> +                       verbose("invalid map_id to access map->elem\n");
> +                       return -EACCES;
> +               }
> +               _(check_stack_boundary(env, regno, (*mapp)->value_size));
> +       } else if (arg_type == ARG_CONST_STACK_SIZE) {
> +               /*
> +                * bpf_xxx(..., buf, len) call will access 'len' bytes
> +                * from stack pointer 'buf'. Check it
> +                * note: regno == len, regno - 1 == buf
> +                */
> +               _(check_stack_boundary(env, regno - 1, reg->imm));
> +       }
> +
> +       return 0;
> +}
> +
> +static int check_call(struct verifier_env *env, int func_id)
> +{
> +       struct verifier_state *state = &env->cur_state;
> +       const struct bpf_func_proto *fn = NULL;
> +       struct reg_state *regs = state->regs;
> +       struct bpf_map *map = NULL;
> +       struct reg_state *reg;
> +       int map_id = -1;
> +       int i;
> +
> +       /* find function prototype */
> +       if (func_id <= 0 || func_id >= __BPF_FUNC_MAX_ID) {
> +               verbose("invalid func %d\n", func_id);
> +               return -EINVAL;
> +       }
> +
> +       if (env->prog->info->ops->get_func_proto)
> +               fn = env->prog->info->ops->get_func_proto(func_id);
> +
> +       if (!fn) {
> +               verbose("unknown func %d\n", func_id);
> +               return -EINVAL;
> +       }
> +
> +       /* eBPF programs must be GPL compatible to use GPL-ed functions */
> +       if (!env->prog->info->is_gpl_compatible && fn->gpl_only) {
> +               verbose("cannot call GPL only function from proprietary program\n");
> +               return -EINVAL;
> +       }
> +
> +       /* check args */
> +       _(check_func_arg(env, BPF_REG_1, fn->arg1_type, &map_id, &map));
> +       _(check_func_arg(env, BPF_REG_2, fn->arg2_type, &map_id, &map));
> +       _(check_func_arg(env, BPF_REG_3, fn->arg3_type, &map_id, &map));
> +       _(check_func_arg(env, BPF_REG_4, fn->arg4_type, &map_id, &map));
> +       _(check_func_arg(env, BPF_REG_5, fn->arg5_type, &map_id, &map));
> +
> +       /* reset caller saved regs */
> +       for (i = 0; i < CALLER_SAVED_REGS; i++) {
> +               reg = regs + caller_saved[i];
> +               reg->type = NOT_INIT;
> +               reg->imm = 0;
> +       }
> +
> +       /* update return register */
> +       if (fn->ret_type == RET_INTEGER) {
> +               regs[BPF_REG_0].type = UNKNOWN_VALUE;
> +       } else if (fn->ret_type == RET_VOID) {
> +               regs[BPF_REG_0].type = NOT_INIT;
> +       } else if (fn->ret_type == RET_PTR_TO_MAP_OR_NULL) {
> +               regs[BPF_REG_0].type = PTR_TO_MAP_OR_NULL;
> +               /*
> +                * remember map_id, so that check_map_access()
> +                * can check 'value_size' boundary of memory access
> +                * to map element returned from bpf_map_lookup_elem()
> +                */
> +               regs[BPF_REG_0].imm = map_id;
> +       } else {
> +               verbose("unknown return type %d of func %d\n",
> +                       fn->ret_type, func_id);
> +               return -EINVAL;
> +       }
> +       return 0;
> +}
> +
> +/* check validity of 32-bit and 64-bit arithmetic operations */
> +static int check_alu_op(struct reg_state *regs, struct bpf_insn *insn)
> +{
> +       u8 opcode = BPF_OP(insn->code);
> +
> +       if (opcode == BPF_END || opcode == BPF_NEG) {
> +               if (BPF_SRC(insn->code) != BPF_X)
> +                       return -EINVAL;
> +               /* check src operand */
> +               _(check_reg_arg(regs, insn->dst_reg, 1));
> +
> +               /* check dest operand */
> +               _(check_reg_arg(regs, insn->dst_reg, 0));
> +
> +       } else if (opcode == BPF_MOV) {
> +
> +               if (BPF_SRC(insn->code) == BPF_X)
> +                       /* check src operand */
> +                       _(check_reg_arg(regs, insn->src_reg, 1));
> +
> +               /* check dest operand */
> +               _(check_reg_arg(regs, insn->dst_reg, 0));
> +
> +               if (BPF_SRC(insn->code) == BPF_X) {
> +                       if (BPF_CLASS(insn->code) == BPF_ALU64) {
> +                               /* case: R1 = R2
> +                                * copy register state to dest reg
> +                                */
> +                               regs[insn->dst_reg].type = regs[insn->src_reg].type;
> +                               regs[insn->dst_reg].imm = regs[insn->src_reg].imm;
> +                       } else {
> +                               regs[insn->dst_reg].type = UNKNOWN_VALUE;
> +                               regs[insn->dst_reg].imm = 0;
> +                       }
> +               } else {
> +                       /* case: R = imm
> +                        * remember the value we stored into this reg
> +                        */
> +                       regs[insn->dst_reg].type = CONST_IMM;
> +                       regs[insn->dst_reg].imm = insn->imm;
> +               }
> +
> +       } else {        /* all other ALU ops: and, sub, xor, add, ... */
> +
> +               int stack_relative = 0;
> +
> +               if (BPF_SRC(insn->code) == BPF_X)
> +                       /* check src1 operand */
> +                       _(check_reg_arg(regs, insn->src_reg, 1));
> +
> +               /* check src2 operand */
> +               _(check_reg_arg(regs, insn->dst_reg, 1));
> +
> +               if ((opcode == BPF_MOD || opcode == BPF_DIV) &&
> +                   BPF_SRC(insn->code) == BPF_K && insn->imm == 0) {
> +                       verbose("div by zero\n");
> +                       return -EINVAL;
> +               }
> +
> +               if (opcode == BPF_ADD && BPF_CLASS(insn->code) == BPF_ALU64 &&
> +                   regs[insn->dst_reg].type == FRAME_PTR &&
> +                   BPF_SRC(insn->code) == BPF_K)
> +                       stack_relative = 1;
> +
> +               /* check dest operand */
> +               _(check_reg_arg(regs, insn->dst_reg, 0));
> +
> +               if (stack_relative) {
> +                       regs[insn->dst_reg].type = PTR_TO_STACK;
> +                       regs[insn->dst_reg].imm = insn->imm;
> +               }
> +       }
> +
> +       return 0;
> +}
> +
> +static int check_cond_jmp_op(struct verifier_env *env,
> +                            struct bpf_insn *insn, int *insn_idx)
> +{
> +       struct reg_state *regs = env->cur_state.regs;
> +       struct verifier_state *other_branch;
> +       u8 opcode = BPF_OP(insn->code);
> +
> +       if (BPF_SRC(insn->code) == BPF_X)
> +               /* check src1 operand */
> +               _(check_reg_arg(regs, insn->src_reg, 1));
> +
> +       /* check src2 operand */
> +       _(check_reg_arg(regs, insn->dst_reg, 1));
> +
> +       /* detect if R == 0 where R was initialized to zero earlier */
> +       if (BPF_SRC(insn->code) == BPF_K &&
> +           (opcode == BPF_JEQ || opcode == BPF_JNE) &&
> +           regs[insn->dst_reg].type == CONST_IMM &&
> +           regs[insn->dst_reg].imm == insn->imm) {
> +               if (opcode == BPF_JEQ) {
> +                       /* if (imm == imm) goto pc+off;
> +                        * only follow the goto, ignore fall-through
> +                        */
> +                       *insn_idx += insn->off;
> +                       return 0;
> +               } else {
> +                       /* if (imm != imm) goto pc+off;
> +                        * only follow fall-through branch, since
> +                        * that's where the program will go
> +                        */
> +                       return 0;
> +               }
> +       }
> +
> +       other_branch = push_stack(env, *insn_idx + insn->off + 1, *insn_idx);
> +       if (!other_branch)
> +               return -EFAULT;
> +
> +       /* detect if R == 0 where R is returned value from bpf_map_lookup_elem() */
> +       if (BPF_SRC(insn->code) == BPF_K &&
> +           insn->imm == 0 && (opcode == BPF_JEQ ||
> +                              opcode == BPF_JNE) &&
> +           regs[insn->dst_reg].type == PTR_TO_MAP_OR_NULL) {
> +               if (opcode == BPF_JEQ) {
> +                       /* next fallthrough insn can access memory via
> +                        * this register
> +                        */
> +                       regs[insn->dst_reg].type = PTR_TO_MAP;
> +                       /* branch targer cannot access it, since reg == 0 */
> +                       other_branch->regs[insn->dst_reg].type = CONST_IMM;
> +                       other_branch->regs[insn->dst_reg].imm = 0;
> +               } else {
> +                       other_branch->regs[insn->dst_reg].type = PTR_TO_MAP;
> +                       regs[insn->dst_reg].type = CONST_IMM;
> +                       regs[insn->dst_reg].imm = 0;
> +               }
> +       } else if (BPF_SRC(insn->code) == BPF_K &&
> +                  (opcode == BPF_JEQ || opcode == BPF_JNE)) {
> +
> +               if (opcode == BPF_JEQ) {
> +                       /* detect if (R == imm) goto
> +                        * and in the target state recognize that R = imm
> +                        */
> +                       other_branch->regs[insn->dst_reg].type = CONST_IMM;
> +                       other_branch->regs[insn->dst_reg].imm = insn->imm;
> +               } else {
> +                       /* detect if (R != imm) goto
> +                        * and in the fall-through state recognize that R = imm
> +                        */
> +                       regs[insn->dst_reg].type = CONST_IMM;
> +                       regs[insn->dst_reg].imm = insn->imm;
> +               }
> +       }
> +       if (verbose_on)
> +               pr_cont_verifier_state(env);
> +       return 0;
> +}
> +
> +/* verify safety of LD_ABS|LD_IND instructions:
> + * - they can only appear in the programs where ctx == skb
> + * - since they are wrappers of function calls, they scratch R1-R5 registers,
> + *   preserve R6-R9, and store return value into R0
> + *
> + * Implicit input:
> + *   ctx == skb == R6 == CTX
> + *
> + * Explicit input:
> + *   SRC == any register
> + *   IMM == 32-bit immediate
> + *
> + * Output:
> + *   R0 - 8/16/32-bit skb data converted to cpu endianness
> + */
> +
> +static int check_ld_abs(struct verifier_env *env, struct bpf_insn *insn)
> +{
> +       struct reg_state *regs = env->cur_state.regs;
> +       u8 mode = BPF_MODE(insn->code);
> +       struct reg_state *reg;
> +       int i;
> +
> +       if (mode != BPF_ABS && mode != BPF_IND)
> +               return -EINVAL;
> +
> +       if (env->prog->info->prog_type != BPF_PROG_TYPE_SOCKET_FILTER) {
> +               verbose("BPF_LD_ABS|IND instructions are only allowed in socket filters\n");
> +               return -EINVAL;
> +       }
> +
> +       /* check whether implicit source operand (register R6) is readable */
> +       _(check_reg_arg(regs, BPF_REG_6, 1));
> +
> +       if (regs[BPF_REG_6].type != PTR_TO_CTX) {
> +               verbose("at the time of BPF_LD_ABS|IND R6 != pointer to skb\n");
> +               return -EINVAL;
> +       }
> +
> +       if (mode == BPF_IND)
> +               /* check explicit source operand */
> +               _(check_reg_arg(regs, insn->src_reg, 1));
> +
> +       /* reset caller saved regs to unreadable */
> +       for (i = 0; i < CALLER_SAVED_REGS; i++) {
> +               reg = regs + caller_saved[i];
> +               reg->type = NOT_INIT;
> +               reg->imm = 0;
> +       }
> +
> +       /* mark destination R0 register as readable, since it contains
> +        * the value fetched from the packet
> +        */
> +       regs[BPF_REG_0].type = UNKNOWN_VALUE;
> +       return 0;
> +}
> +
> +/* non-recursive DFS pseudo code
> + * 1  procedure DFS-iterative(G,v):
> + * 2      label v as discovered
> + * 3      let S be a stack
> + * 4      S.push(v)
> + * 5      while S is not empty
> + * 6            t <- S.pop()
> + * 7            if t is what we're looking for:
> + * 8                return t
> + * 9            for all edges e in G.adjacentEdges(t) do
> + * 10               if edge e is already labelled
> + * 11                   continue with the next edge
> + * 12               w <- G.adjacentVertex(t,e)
> + * 13               if vertex w is not discovered and not explored
> + * 14                   label e as tree-edge
> + * 15                   label w as discovered
> + * 16                   S.push(w)
> + * 17                   continue at 5
> + * 18               else if vertex w is discovered
> + * 19                   label e as back-edge
> + * 20               else
> + * 21                   // vertex w is explored
> + * 22                   label e as forward- or cross-edge
> + * 23           label t as explored
> + * 24           S.pop()
> + *
> + * convention:
> + * 1 - discovered
> + * 2 - discovered and 1st branch labelled
> + * 3 - discovered and 1st and 2nd branch labelled
> + * 4 - explored
> + */
> +
> +#define STATE_END ((struct verifier_state_list *)-1)
> +
> +#define PUSH_INT(I) \
> +       do { \
> +               if (cur_stack >= insn_cnt) { \
> +                       ret = -E2BIG; \
> +                       goto free_st; \
> +               } \
> +               stack[cur_stack++] = I; \
> +       } while (0)
> +
> +#define PEEK_INT() \
> +       ({ \
> +               int _ret; \
> +               if (cur_stack == 0) \
> +                       _ret = -1; \
> +               else \
> +                       _ret = stack[cur_stack - 1]; \
> +               _ret; \
> +        })
> +
> +#define POP_INT() \
> +       ({ \
> +               int _ret; \
> +               if (cur_stack == 0) \
> +                       _ret = -1; \
> +               else \
> +                       _ret = stack[--cur_stack]; \
> +               _ret; \
> +        })
> +
> +#define PUSH_INSN(T, W, E) \
> +       do { \
> +               int w = W; \
> +               if (E == 1 && st[T] >= 2) \
> +                       break; \
> +               if (E == 2 && st[T] >= 3) \
> +                       break; \
> +               if (w >= insn_cnt) { \
> +                       ret = -EACCES; \
> +                       goto free_st; \
> +               } \
> +               if (E == 2) \
> +                       /* mark branch target for state pruning */ \
> +                       env->branch_landing[w] = STATE_END; \
> +               if (st[w] == 0) { \
> +                       /* tree-edge */ \
> +                       st[T] = 1 + E; \
> +                       st[w] = 1; /* discovered */ \
> +                       PUSH_INT(w); \
> +                       goto peak_stack; \
> +               } else if (st[w] == 1 || st[w] == 2 || st[w] == 3) { \
> +                       verbose("back-edge from insn %d to %d\n", t, w); \
> +                       ret = -EINVAL; \
> +                       goto free_st; \
> +               } else if (st[w] == 4) { \
> +                       /* forward- or cross-edge */ \
> +                       st[T] = 1 + E; \
> +               } else { \
> +                       verbose("insn state internal bug\n"); \
> +                       ret = -EFAULT; \
> +                       goto free_st; \
> +               } \
> +       } while (0)
> +
> +/* non-recursive depth-first-search to detect loops in BPF program
> + * loop == back-edge in directed graph
> + */
> +static int check_cfg(struct verifier_env *env)
> +{
> +       struct bpf_insn *insns = env->prog->insnsi;
> +       int insn_cnt = env->prog->len;
> +       int cur_stack = 0;
> +       int *stack;
> +       int ret = 0;
> +       int *st;
> +       int i, t;
> +
> +       if (insns[insn_cnt - 1].code != (BPF_JMP | BPF_EXIT)) {
> +               verbose("last insn is not a 'ret'\n");
> +               return -EINVAL;
> +       }
> +
> +       st = kzalloc(sizeof(int) * insn_cnt, GFP_KERNEL);
> +       if (!st)
> +               return -ENOMEM;
> +
> +       stack = kzalloc(sizeof(int) * insn_cnt, GFP_KERNEL);
> +       if (!stack) {
> +               kfree(st);
> +               return -ENOMEM;
> +       }
> +
> +       st[0] = 1; /* mark 1st insn as discovered */
> +       PUSH_INT(0);
> +
> +peak_stack:
> +       while ((t = PEEK_INT()) != -1) {
> +               if (insns[t].code == (BPF_JMP | BPF_EXIT))
> +                       goto mark_explored;
> +
> +               if (BPF_CLASS(insns[t].code) == BPF_JMP) {
> +                       u8 opcode = BPF_OP(insns[t].code);
> +
> +                       if (opcode == BPF_CALL) {
> +                               PUSH_INSN(t, t + 1, 1);
> +                       } else if (opcode == BPF_JA) {
> +                               if (BPF_SRC(insns[t].code) != BPF_X) {
> +                                       ret = -EINVAL;
> +                                       goto free_st;
> +                               }
> +                               PUSH_INSN(t, t + insns[t].off + 1, 1);
> +                       } else {
> +                               PUSH_INSN(t, t + 1, 1);
> +                               PUSH_INSN(t, t + insns[t].off + 1, 2);
> +                       }
> +                       /* tell verifier to check for equivalent verifier states
> +                        * after every call and jump
> +                        */
> +                       env->branch_landing[t + 1] = STATE_END;
> +               } else {
> +                       PUSH_INSN(t, t + 1, 1);
> +               }
> +
> +mark_explored:
> +               st[t] = 4; /* explored */
> +               if (POP_INT() == -1) {
> +                       verbose("pop_int internal bug\n");
> +                       ret = -EFAULT;
> +                       goto free_st;
> +               }
> +       }
> +
> +
> +       for (i = 0; i < insn_cnt; i++) {
> +               if (st[i] != 4) {
> +                       verbose("unreachable insn %d\n", i);
> +                       ret = -EINVAL;
> +                       goto free_st;
> +               }
> +       }
> +
> +free_st:
> +       kfree(st);
> +       kfree(stack);
> +       return ret;
> +}
> +
> +/* compare two verifier states
> + *
> + * all states stored in state_list are known to be valid, since
> + * verifier reached 'bpf_exit' instruction through them
> + *
> + * this function is called when verifier exploring different branches of
> + * execution popped from the state stack. If it sees an old state that has
> + * more strict register state and more strict stack state then this execution
> + * branch doesn't need to be explored further, since verifier already
> + * concluded that more strict state leads to valid finish.
> + *
> + * Therefore two states are equivalent if register state is more conservative
> + * and explored stack state is more conservative than the current one.
> + * Example:
> + *       explored                   current
> + * (slot1=INV slot2=MISC) == (slot1=MISC slot2=MISC)
> + * (slot1=MISC slot2=MISC) != (slot1=INV slot2=MISC)
> + *
> + * In other words if current stack state (one being explored) has more
> + * valid slots than old one that already passed validation, it means
> + * the verifier can stop exploring and conclude that current state is valid too
> + *
> + * Similarly with registers. If explored state has register type as invalid
> + * whereas register type in current state is meaningful, it means that
> + * the current state will reach 'bpf_exit' instruction safely
> + */
> +static bool states_equal(struct verifier_state *old, struct verifier_state *cur)
> +{
> +       int i;
> +
> +       for (i = 0; i < MAX_BPF_REG; i++) {
> +               if (memcmp(&old->regs[i], &cur->regs[i],
> +                          sizeof(old->regs[0])) != 0) {
> +                       if (old->regs[i].type == NOT_INIT ||
> +                           old->regs[i].type == UNKNOWN_VALUE)
> +                               continue;
> +                       return false;
> +               }
> +       }
> +
> +       for (i = 0; i < MAX_BPF_STACK; i++) {
> +               if (memcmp(&old->stack[i], &cur->stack[i],
> +                          sizeof(old->stack[0])) != 0) {
> +                       if (old->stack[i].stype == STACK_INVALID)
> +                               continue;
> +                       return false;
> +               }
> +       }
> +       return true;
> +}
> +
> +static int is_state_visited(struct verifier_env *env, int insn_idx)
> +{
> +       struct verifier_state_list *new_sl;
> +       struct verifier_state_list *sl;
> +
> +       sl = env->branch_landing[insn_idx];
> +       if (!sl)
> +               /* no branch jump to this insn, ignore it */
> +               return 0;
> +
> +       while (sl != STATE_END) {
> +               if (states_equal(&sl->state, &env->cur_state))
> +                       /* reached equivalent register/stack state,
> +                        * prune the search
> +                        */
> +                       return 1;
> +               sl = sl->next;
> +       }
> +       new_sl = kmalloc(sizeof(struct verifier_state_list), GFP_KERNEL);
> +
> +       if (!new_sl)
> +               /* ignore ENOMEM, it doesn't affect correctness */
> +               return 0;
> +
> +       /* add new state to the head of linked list */
> +       memcpy(&new_sl->state, &env->cur_state, sizeof(env->cur_state));
> +       new_sl->next = env->branch_landing[insn_idx];
> +       env->branch_landing[insn_idx] = new_sl;
> +       return 0;
> +}
> +
> +static int do_check(struct verifier_env *env)
> +{
> +       struct verifier_state *state = &env->cur_state;
> +       struct bpf_insn *insns = env->prog->insnsi;
> +       struct reg_state *regs = state->regs;
> +       int insn_cnt = env->prog->len;
> +       int insn_idx, prev_insn_idx = 0;
> +       int insn_processed = 0;
> +       bool do_print_state = false;
> +
> +       init_reg_state(regs);
> +       insn_idx = 0;
> +       for (;;) {
> +               struct bpf_insn *insn;
> +               u8 class;
> +
> +               if (insn_idx >= insn_cnt) {
> +                       verbose("invalid insn idx %d insn_cnt %d\n",
> +                               insn_idx, insn_cnt);
> +                       return -EFAULT;
> +               }
> +
> +               insn = &insns[insn_idx];
> +               class = BPF_CLASS(insn->code);
> +
> +               if (++insn_processed > 32768) {
> +                       verbose("BPF program is too large. Proccessed %d insn\n",
> +                               insn_processed);
> +                       return -E2BIG;
> +               }
> +
> +               if (is_state_visited(env, insn_idx)) {
> +                       if (verbose_on) {
> +                               if (do_print_state)
> +                                       pr_cont("\nfrom %d to %d: safe\n",
> +                                               prev_insn_idx, insn_idx);
> +                               else
> +                                       pr_cont("%d: safe\n", insn_idx);
> +                       }
> +                       goto process_bpf_exit;
> +               }
> +
> +               if (verbose_on && do_print_state) {
> +                       pr_cont("\nfrom %d to %d:", prev_insn_idx, insn_idx);
> +                       pr_cont_verifier_state(env);
> +                       do_print_state = false;
> +               }
> +
> +               if (verbose_on) {
> +                       pr_cont("%d: ", insn_idx);
> +                       pr_cont_bpf_insn(insn);
> +               }
> +
> +               if (class == BPF_ALU || class == BPF_ALU64) {
> +                       _(check_alu_op(regs, insn));
> +
> +               } else if (class == BPF_LDX) {
> +                       if (BPF_MODE(insn->code) != BPF_MEM)
> +                               return -EINVAL;
> +
> +                       /* check src operand */
> +                       _(check_reg_arg(regs, insn->src_reg, 1));
> +
> +                       _(check_mem_access(env, insn->src_reg, insn->off,
> +                                          BPF_SIZE(insn->code), BPF_READ,
> +                                          insn->dst_reg));
> +
> +                       /* dest reg state will be updated by mem_access */
> +
> +               } else if (class == BPF_STX) {
> +                       /* check src1 operand */
> +                       _(check_reg_arg(regs, insn->src_reg, 1));
> +                       /* check src2 operand */
> +                       _(check_reg_arg(regs, insn->dst_reg, 1));
> +                       _(check_mem_access(env, insn->dst_reg, insn->off,
> +                                          BPF_SIZE(insn->code), BPF_WRITE,
> +                                          insn->src_reg));
> +
> +               } else if (class == BPF_ST) {
> +                       if (BPF_MODE(insn->code) != BPF_MEM)
> +                               return -EINVAL;
> +                       /* check src operand */
> +                       _(check_reg_arg(regs, insn->dst_reg, 1));
> +                       _(check_mem_access(env, insn->dst_reg, insn->off,
> +                                          BPF_SIZE(insn->code), BPF_WRITE,
> +                                          -1));
> +
> +               } else if (class == BPF_JMP) {
> +                       u8 opcode = BPF_OP(insn->code);
> +
> +                       if (opcode == BPF_CALL) {
> +                               _(check_call(env, insn->imm));
> +                       } else if (opcode == BPF_JA) {
> +                               if (BPF_SRC(insn->code) != BPF_X)
> +                                       return -EINVAL;
> +                               insn_idx += insn->off + 1;
> +                               continue;
> +                       } else if (opcode == BPF_EXIT) {
> +                               /* eBPF calling convetion is such that R0 is used
> +                                * to return the value from eBPF program.
> +                                * Make sure that it's readable at this time
> +                                * of bpf_exit, which means that program wrote
> +                                * something into it earlier
> +                                */
> +                               _(check_reg_arg(regs, BPF_REG_0, 1));
> +process_bpf_exit:
> +                               insn_idx = pop_stack(env, &prev_insn_idx);
> +                               if (insn_idx < 0) {
> +                                       break;
> +                               } else {
> +                                       do_print_state = true;
> +                                       continue;
> +                               }
> +                       } else {
> +                               _(check_cond_jmp_op(env, insn, &insn_idx));
> +                       }
> +               } else if (class == BPF_LD) {
> +                       _(check_ld_abs(env, insn));
> +               } else {
> +                       verbose("unknown insn class %d\n", class);
> +                       return -EINVAL;
> +               }
> +
> +               insn_idx++;
> +       }
> +
> +       return 0;
> +}
> +
> +static void free_states(struct verifier_env *env, int insn_cnt)
> +{
> +       struct verifier_state_list *sl, *sln;
> +       int i;
> +
> +       for (i = 0; i < insn_cnt; i++) {
> +               sl = env->branch_landing[i];
> +
> +               if (sl)
> +                       while (sl != STATE_END) {
> +                               sln = sl->next;
> +                               kfree(sl);
> +                               sl = sln;
> +                       }
> +       }
> +
> +       kfree(env->branch_landing);
> +}
> +
> +int bpf_check(struct sk_filter *prog)
> +{
> +       struct verifier_env *env;
> +       int ret;
> +
> +       if (prog->len <= 0 || prog->len > BPF_MAXINSNS)
> +               return -E2BIG;
> +
> +       env = kzalloc(sizeof(struct verifier_env), GFP_KERNEL);
> +       if (!env)
> +               return -ENOMEM;
> +
> +       verbose_on = false;
> +retry:
> +       env->prog = prog;
> +       env->branch_landing = kcalloc(prog->len,
> +                                     sizeof(struct verifier_state_list *),
> +                                     GFP_KERNEL);
> +
> +       if (!env->branch_landing) {
> +               kfree(env);
> +               return -ENOMEM;
> +       }
> +
> +       ret = check_cfg(env);
> +       if (ret < 0)
> +               goto free_env;
> +
> +       ret = do_check(env);
> +
> +free_env:
> +       while (pop_stack(env, NULL) >= 0);
> +       free_states(env, prog->len);
> +
> +       if (ret < 0 && !verbose_on && capable(CAP_SYS_ADMIN)) {
> +               /* verification failed, redo it with verbose on */
> +               memset(env, 0, sizeof(struct verifier_env));
> +               verbose_on = true;
> +               goto retry;
> +       }
> +
> +       if (ret == 0 && env->used_map_cnt) {
> +               /* if program passed verifier, update used_maps in bpf_prog_info */
> +               prog->info->used_maps = kmalloc_array(env->used_map_cnt,
> +                                                     sizeof(u32), GFP_KERNEL);
> +               if (!prog->info->used_maps) {
> +                       kfree(env);
> +                       return -ENOMEM;
> +               }
> +               memcpy(prog->info->used_maps, env->used_maps,
> +                      sizeof(u32) * env->used_map_cnt);
> +               prog->info->used_map_cnt = env->used_map_cnt;
> +       }
> +
> +       kfree(env);
> +       return ret;
> +}
> --
> 1.7.9.5
>

Unless I've overlooked something, I think this needs much stricter
evaluation of register numbers, offsets, and sizes.

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH RFC v2 net-next 10/16] bpf: add eBPF verifier
@ 2014-07-23 23:38     ` Kees Cook
  0 siblings, 0 replies; 62+ messages in thread
From: Kees Cook @ 2014-07-23 23:38 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Ingo Molnar, Linus Torvalds, Andy Lutomirski,
	Steven Rostedt, Daniel Borkmann, Chema Gonzalez, Eric Dumazet,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Jiri Olsa,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Linux API,
	Network Development, LKML

On Thu, Jul 17, 2014 at 9:20 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
> Safety of eBPF programs is statically determined by the verifier, which detects:
> - loops
> - out of range jumps
> - unreachable instructions
> - invalid instructions
> - uninitialized register access
> - uninitialized stack access
> - misaligned stack access
> - out of range stack access
> - invalid calling convention
>
> It checks that
> - R1-R5 registers statisfy function prototype
> - program terminates
> - BPF_LD_ABS|IND instructions are only used in socket filters
>
> It is configured with:
>
> - bool (*is_valid_access)(int off, int size, enum bpf_access_type type);
>   that provides information to the verifer which fields of 'ctx'
>   are accessible (remember 'ctx' is the first argument to eBPF program)
>
> - const struct bpf_func_proto *(*get_func_proto)(enum bpf_func_id func_id);
>   reports argument types of kernel helper functions that eBPF program
>   may call, so that verifier can checks that R1-R5 types match prototype
>
> More details in Documentation/networking/filter.txt
>
> Signed-off-by: Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
> ---
>  Documentation/networking/filter.txt |  233 ++++++
>  include/linux/bpf.h                 |   49 ++
>  include/uapi/linux/bpf.h            |    1 +
>  kernel/bpf/Makefile                 |    2 +-
>  kernel/bpf/syscall.c                |    2 +-
>  kernel/bpf/verifier.c               | 1520 +++++++++++++++++++++++++++++++++++
>  6 files changed, 1805 insertions(+), 2 deletions(-)
>  create mode 100644 kernel/bpf/verifier.c
>
> diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt
> index e14e486f69cd..778f763fce10 100644
> --- a/Documentation/networking/filter.txt
> +++ b/Documentation/networking/filter.txt
> @@ -995,6 +995,108 @@ BPF_XADD | BPF_DW | BPF_STX: lock xadd *(u64 *)(dst_reg + off16) += src_reg
>  Where size is one of: BPF_B or BPF_H or BPF_W or BPF_DW. Note that 1 and
>  2 byte atomic increments are not supported.
>
> +eBPF verifier
> +-------------
> +The safety of the eBPF program is determined in two steps.
> +
> +First step does DAG check to disallow loops and other CFG validation.
> +In particular it will detect programs that have unreachable instructions.
> +(though classic BPF checker allows them)
> +
> +Second step starts from the first insn and descends all possible paths.
> +It simulates execution of every insn and observes the state change of
> +registers and stack.
> +
> +At the start of the program the register R1 contains a pointer to context
> +and has type PTR_TO_CTX.
> +If verifier sees an insn that does R2=R1, then R2 has now type
> +PTR_TO_CTX as well and can be used on the right hand side of expression.
> +If R1=PTR_TO_CTX and insn is R2=R1+R1, then R2=INVALID_PTR,
> +since addition of two valid pointers makes invalid pointer.
> +
> +If register was never written to, it's not readable:
> +  bpf_mov R0 = R2
> +  bpf_exit
> +will be rejected, since R2 is unreadable at the start of the program.
> +
> +After kernel function call, R1-R5 are reset to unreadable and
> +R0 has a return type of the function.
> +
> +Since R6-R9 are callee saved, their state is preserved across the call.
> +  bpf_mov R6 = 1
> +  bpf_call foo
> +  bpf_mov R0 = R6
> +  bpf_exit
> +is a correct program. If there was R1 instead of R6, it would have
> +been rejected.
> +
> +Classic BPF register X is mapped to eBPF register R7 inside sk_convert_filter(),
> +so that its state is preserved across calls.
> +
> +load/store instructions are allowed only with registers of valid types, which
> +are PTR_TO_CTX, PTR_TO_MAP, PTR_TO_STACK. They are bounds and alignment checked.
> +For example:
> + bpf_mov R1 = 1
> + bpf_mov R2 = 2
> + bpf_xadd *(u32 *)(R1 + 3) += R2
> + bpf_exit
> +will be rejected, since R1 doesn't have a valid pointer type at the time of
> +execution of instruction bpf_xadd.
> +
> +At the start R1 contains pointer to ctx and R1 type is PTR_TO_CTX.
> +ctx is generic. verifier is configured to known what context is for particular
> +class of bpf programs. For example, context == skb (for socket filters) and
> +ctx == seccomp_data for seccomp filters.
> +A callback is used to customize verifier to restrict eBPF program access to only
> +certain fields within ctx structure with specified size and alignment.
> +
> +For example, the following insn:
> +  bpf_ld R0 = *(u32 *)(R6 + 8)
> +intends to load a word from address R6 + 8 and store it into R0
> +If R6=PTR_TO_CTX, via is_valid_access() callback the verifier will know
> +that offset 8 of size 4 bytes can be accessed for reading, otherwise
> +the verifier will reject the program.
> +If R6=PTR_TO_STACK, then access should be aligned and be within
> +stack bounds, which are [-MAX_BPF_STACK, 0). In this example offset is 8,
> +so it will fail verification, since it's out of bounds.
> +
> +The verifier will allow eBPF program to read data from stack only after
> +it wrote into it.
> +Classic BPF verifier does similar check with M[0-15] memory slots.
> +For example:
> +  bpf_ld R0 = *(u32 *)(R10 - 4)
> +  bpf_exit
> +is invalid program.
> +Though R10 is correct read-only register and has type PTR_TO_STACK
> +and R10 - 4 is within stack bounds, there were no stores into that location.
> +
> +Pointer register spill/fill is tracked as well, since four (R6-R9)
> +callee saved registers may not be enough for some programs.
> +
> +Allowed function calls are customized with bpf_verifier_ops->get_func_proto()
> +For example, skb_get_nlattr() function has the following definition:
> +  struct bpf_func_proto proto = {RET_INTEGER, PTR_TO_CTX};
> +and eBPF verifier will check that this function is always called with first
> +argument being 'ctx'. In other words R1 must have type PTR_TO_CTX
> +at the time of bpf_call insn.
> +After the call register R0 will be set to readable state, so that
> +program can access it.
> +
> +Function calls is a main mechanism to extend functionality of eBPF programs.
> +Socket filters may let programs to call one set of functions, whereas tracing
> +filters may allow completely different set.
> +
> +If a function made accessible to eBPF program, it needs to be thought through
> +from security point of view. The verifier will guarantee that the function is
> +called with valid arguments.
> +
> +seccomp vs socket filters have different security restrictions for classic BPF.
> +Seccomp solves this by two stage verifier: classic BPF verifier is followed
> +by seccomp verifier. In case of eBPF one configurable verifier is shared for
> +all use cases.
> +
> +See details of eBPF verifier in kernel/bpf/verifier.c
> +
>  eBPF maps
>  ---------
>  'maps' is a generic storage of different types for sharing data between kernel
> @@ -1064,6 +1166,137 @@ size. It will not let programs pass junk values as 'key' and 'value' to
>  bpf_map_*_elem() functions, so these functions (implemented in C inside kernel)
>  can safely access the pointers in all cases.
>
> +Understanding eBPF verifier messages
> +------------------------------------
> +
> +The following are few examples of invalid eBPF programs and verifier error
> +messages as seen in the log:
> +
> +Program with unreachable instructions:
> +static struct bpf_insn prog[] = {
> +  BPF_EXIT_INSN(),
> +  BPF_EXIT_INSN(),
> +};
> +Error:
> +  unreachable insn 1
> +
> +Program that reads uninitialized register:
> +  BPF_ALU64_REG(BPF_MOV, BPF_REG_0, BPF_REG_2),
> +  BPF_EXIT_INSN(),
> +Error:
> +  0: (bf) r0 = r2
> +  R2 !read_ok
> +
> +Program that doesn't initialize R0 before exiting:
> +  BPF_ALU64_REG(BPF_MOV, BPF_REG_2, BPF_REG_1),
> +  BPF_EXIT_INSN(),
> +Error:
> +  0: (bf) r2 = r1
> +  1: (95) exit
> +  R0 !read_ok
> +
> +Program that accesses stack out of bounds:
> +  BPF_ST_MEM(BPF_DW, BPF_REG_10, 8, 0),
> +  BPF_EXIT_INSN(),
> +Error:
> +  0: (7a) *(u64 *)(r10 +8) = 0
> +  invalid stack off=8 size=8
> +
> +Program that doesn't initialize stack before passing its address into function:
> +  BPF_ALU64_REG(BPF_MOV, BPF_REG_2, BPF_REG_10),
> +  BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
> +  BPF_ALU64_IMM(BPF_MOV, BPF_REG_1, 1),
> +  BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
> +  BPF_EXIT_INSN(),
> +Error:
> +  0: (bf) r2 = r10
> +  1: (07) r2 += -8
> +  2: (b7) r1 = 1
> +  3: (85) call 1
> +  invalid indirect read from stack off -8+0 size 8
> +
> +Program that uses invalid map_id=2 while calling to map_lookup_elem() function:
> +  BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
> +  BPF_ALU64_REG(BPF_MOV, BPF_REG_2, BPF_REG_10),
> +  BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
> +  BPF_ALU64_IMM(BPF_MOV, BPF_REG_1, 2),
> +  BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
> +  BPF_EXIT_INSN(),
> +Error:
> +  0: (7a) *(u64 *)(r10 -8) = 0
> +  1: (bf) r2 = r10
> +  2: (07) r2 += -8
> +  3: (b7) r1 = 2
> +  4: (85) call 1
> +  invalid access to map_id=2
> +
> +Program that doesn't check return value of map_lookup_elem() before accessing
> +map element:
> +  BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
> +  BPF_ALU64_REG(BPF_MOV, BPF_REG_2, BPF_REG_10),
> +  BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
> +  BPF_ALU64_IMM(BPF_MOV, BPF_REG_1, 1),
> +  BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),

Is the expectation that these pointers are direct kernel function
addresses? It looks like they're indexes in the check_call routine
below. What specifically were the pointer leaks you'd mentioned?

> +  BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0),
> +  BPF_EXIT_INSN(),
> +Error:
> +  0: (7a) *(u64 *)(r10 -8) = 0
> +  1: (bf) r2 = r10
> +  2: (07) r2 += -8
> +  3: (b7) r1 = 1
> +  4: (85) call 1
> +  5: (7a) *(u64 *)(r0 +0) = 0
> +  R0 invalid mem access 'map_value_or_null'
> +
> +Program that correctly checks map_lookup_elem() returned value for NULL, but
> +accesses the memory with incorrect alignment:
> +  BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
> +  BPF_ALU64_REG(BPF_MOV, BPF_REG_2, BPF_REG_10),
> +  BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
> +  BPF_ALU64_IMM(BPF_MOV, BPF_REG_1, 1),
> +  BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
> +  BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 1),
> +  BPF_ST_MEM(BPF_DW, BPF_REG_0, 4, 0),
> +  BPF_EXIT_INSN(),
> +Error:
> +  0: (7a) *(u64 *)(r10 -8) = 0
> +  1: (bf) r2 = r10
> +  2: (07) r2 += -8
> +  3: (b7) r1 = 1
> +  4: (85) call 1
> +  5: (15) if r0 == 0x0 goto pc+1
> +   R0=map_value1 R10=fp
> +  6: (7a) *(u64 *)(r0 +4) = 0
> +  misaligned access off 4 size 8
> +
> +Program that correctly checks map_lookup_elem() returned value for NULL and
> +accesses memory with correct alignment in one side of 'if' branch, but fails
> +to do so in the other side of 'if' branch:
> +  BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
> +  BPF_ALU64_REG(BPF_MOV, BPF_REG_2, BPF_REG_10),
> +  BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
> +  BPF_ALU64_IMM(BPF_MOV, BPF_REG_1, 1),
> +  BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
> +  BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
> +  BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0),
> +  BPF_EXIT_INSN(),
> +  BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 1),
> +  BPF_EXIT_INSN(),
> +Error:
> +  0: (7a) *(u64 *)(r10 -8) = 0
> +  1: (bf) r2 = r10
> +  2: (07) r2 += -8
> +  3: (b7) r1 = 1
> +  4: (85) call 1
> +  5: (15) if r0 == 0x0 goto pc+2
> +   R0=map_value1 R10=fp
> +  6: (7a) *(u64 *)(r0 +0) = 0
> +  7: (95) exit
> +
> +  from 5 to 8: R0=imm0 R10=fp
> +  8: (7a) *(u64 *)(r0 +0) = 1
> +  R0 invalid mem access 'imm'
> +
>  Testing
>  -------
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 4967619595cc..b5e90efddfcf 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -46,6 +46,31 @@ struct bpf_map_type_list {
>  void bpf_register_map_type(struct bpf_map_type_list *tl);
>  struct bpf_map *bpf_map_get(u32 map_id);
>
> +/* function argument constraints */
> +enum bpf_arg_type {
> +       ARG_ANYTHING = 0,       /* any argument is ok */
> +
> +       /* the following constraints used to prototype
> +        * bpf_map_lookup/update/delete_elem() functions
> +        */
> +       ARG_CONST_MAP_ID,       /* int const argument used as map_id */
> +       ARG_PTR_TO_MAP_KEY,     /* pointer to stack used as map key */
> +       ARG_PTR_TO_MAP_VALUE,   /* pointer to stack used as map value */
> +
> +       /* the following constraints used to prototype bpf_memcmp() and other
> +        * functions that access data on eBPF program stack
> +        */
> +       ARG_PTR_TO_STACK,       /* any pointer to eBPF program stack */
> +       ARG_CONST_STACK_SIZE,   /* number of bytes accessed from stack */
> +};
> +
> +/* type of values returned from helper functions */
> +enum bpf_return_type {
> +       RET_INTEGER,            /* function returns integer */
> +       RET_VOID,               /* function doesn't return anything */
> +       RET_PTR_TO_MAP_OR_NULL, /* function returns a pointer to map elem value or NULL */
> +};
> +
>  /* eBPF function prototype used by verifier to allow BPF_CALLs from eBPF programs
>   * to in-kernel helper functions and for adjusting imm32 field in BPF_CALL
>   * instructions after verifying
> @@ -53,11 +78,33 @@ struct bpf_map *bpf_map_get(u32 map_id);
>  struct bpf_func_proto {
>         u64 (*func)(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
>         bool gpl_only;
> +       enum bpf_return_type ret_type;
> +       enum bpf_arg_type arg1_type;
> +       enum bpf_arg_type arg2_type;
> +       enum bpf_arg_type arg3_type;
> +       enum bpf_arg_type arg4_type;
> +       enum bpf_arg_type arg5_type;
> +};
> +
> +/* bpf_context is intentionally undefined structure. Pointer to bpf_context is
> + * the first argument to eBPF programs.
> + * For socket filters: 'struct bpf_context *' == 'struct sk_buff *'
> + */
> +struct bpf_context;
> +
> +enum bpf_access_type {
> +       BPF_READ = 1,
> +       BPF_WRITE = 2
>  };
>
>  struct bpf_verifier_ops {
>         /* return eBPF function prototype for verification */
>         const struct bpf_func_proto *(*get_func_proto)(enum bpf_func_id func_id);
> +
> +       /* return true if 'size' wide access at offset 'off' within bpf_context
> +        * with 'type' (read or write) is allowed
> +        */
> +       bool (*is_valid_access)(int off, int size, enum bpf_access_type type);
>  };
>
>  struct bpf_prog_type_list {
> @@ -78,5 +125,7 @@ struct bpf_prog_info {
>
>  void free_bpf_prog_info(struct bpf_prog_info *info);
>  struct sk_filter *bpf_prog_get(u32 ufd);
> +/* verify correctness of eBPF program */
> +int bpf_check(struct sk_filter *fp);
>
>  #endif /* _LINUX_BPF_H */
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 06ba71b49f64..3f288e1d08f1 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -369,6 +369,7 @@ enum bpf_prog_attributes {
>
>  enum bpf_prog_type {
>         BPF_PROG_TYPE_UNSPEC,
> +       BPF_PROG_TYPE_SOCKET_FILTER,
>  };
>
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> index 558e12712ebc..95a9035e0f29 100644
> --- a/kernel/bpf/Makefile
> +++ b/kernel/bpf/Makefile
> @@ -1 +1 @@
> -obj-y := core.o syscall.o hashtab.o
> +obj-y := core.o syscall.o hashtab.o verifier.o
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 9e45ca6b6937..9d441f17548e 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -634,7 +634,7 @@ static int bpf_prog_load(enum bpf_prog_type type, struct nlattr __user *uattr,
>         mutex_lock(&bpf_map_lock);
>
>         /* run eBPF verifier */
> -       /* err = bpf_check(prog); */
> +       err = bpf_check(prog);
>
>         if (err == 0 && prog->info->used_maps) {
>                 /* program passed verifier and it's using some maps,
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> new file mode 100644
> index 000000000000..0fce771632b4
> --- /dev/null
> +++ b/kernel/bpf/verifier.c
> @@ -0,0 +1,1520 @@
> +/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of version 2 of the GNU General Public
> + * License as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful, but
> + * WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + * General Public License for more details.
> + */
> +#include <linux/kernel.h>
> +#include <linux/types.h>
> +#include <linux/slab.h>
> +#include <linux/bpf.h>
> +#include <linux/filter.h>
> +#include <linux/capability.h>
> +
> +/* bpf_check() is a static code analyzer that walks eBPF program
> + * instruction by instruction and updates register/stack state.
> + * All paths of conditional branches are analyzed until 'bpf_exit' insn.
> + *
> + * At the first pass depth-first-search verifies that the BPF program is a DAG.
> + * It rejects the following programs:
> + * - larger than BPF_MAXINSNS insns
> + * - if loop is present (detected via back-edge)
> + * - unreachable insns exist (shouldn't be a forest. program = one function)
> + * - out of bounds or malformed jumps
> + * The second pass is all possible path descent from the 1st insn.
> + * Conditional branch target insns keep a link list of verifier states.
> + * If the state already visited, this path can be pruned.
> + * If it wasn't a DAG, such state prunning would be incorrect, since it would
> + * skip cycles. Since it's analyzing all pathes through the program,
> + * the length of the analysis is limited to 32k insn, which may be hit even
> + * if insn_cnt < 4K, but there are too many branches that change stack/regs.
> + * Number of 'branches to be analyzed' is limited to 1k
> + *
> + * On entry to each instruction, each register has a type, and the instruction
> + * changes the types of the registers depending on instruction semantics.
> + * If instruction is BPF_MOV64_REG(BPF_REG_1, BPF_REG_5), then type of R5 is
> + * copied to R1.
> + *
> + * All registers are 64-bit (even on 32-bit arch)
> + * R0 - return register
> + * R1-R5 argument passing registers
> + * R6-R9 callee saved registers
> + * R10 - frame pointer read-only
> + *
> + * At the start of BPF program the register R1 contains a pointer to bpf_context
> + * and has type PTR_TO_CTX.
> + *
> + * Most of the time the registers have UNKNOWN_VALUE type, which
> + * means the register has some value, but it's not a valid pointer.
> + * Verifier doesn't attemp to track all arithmetic operations on pointers.
> + * The only special case is the sequence:
> + *    BPF_MOV64_REG(BPF_REG_1, BPF_REG_10),
> + *    BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, -20),
> + * 1st insn copies R10 (which has FRAME_PTR) type into R1
> + * and 2nd arithmetic instruction is pattern matched to recognize
> + * that it wants to construct a pointer to some element within stack.
> + * So after 2nd insn, the register R1 has type PTR_TO_STACK
> + * (and -20 constant is saved for further stack bounds checking).
> + * Meaning that this reg is a pointer to stack plus known immediate constant.
> + *
> + * When program is doing load or store insns the type of base register can be:
> + * PTR_TO_MAP, PTR_TO_CTX, FRAME_PTR. These are three pointer types recognized
> + * by check_mem_access() function.
> + *
> + * PTR_TO_MAP means that this register is pointing to 'map element value'
> + * and the range of [ptr, ptr + map's value_size) is accessible.
> + *
> + * registers used to pass pointers to function calls are verified against
> + * function prototypes
> + *
> + * ARG_PTR_TO_MAP_KEY is a function argument constraint.
> + * It means that the register type passed to this function must be
> + * PTR_TO_STACK and it will be used inside the function as
> + * 'pointer to map element key'
> + *
> + * For example the argument constraints for bpf_map_lookup_elem():
> + *   .ret_type = RET_PTR_TO_MAP_OR_NULL,
> + *   .arg1_type = ARG_CONST_MAP_ID,
> + *   .arg2_type = ARG_PTR_TO_MAP_KEY,
> + *
> + * ret_type says that this function returns 'pointer to map elem value or null'
> + * 1st argument is a 'const immediate' value which must be one of valid map_ids.
> + * 2nd argument is a pointer to stack, which will be used inside the function as
> + * a pointer to map element key.
> + *
> + * On the kernel side the helper function looks like:
> + * u64 bpf_map_lookup_elem(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
> + * {
> + *    struct bpf_map *map;
> + *    int map_id = r1;
> + *    void *key = (void *) (unsigned long) r2;
> + *    void *value;
> + *
> + *    here kernel can access 'key' pointer safely, knowing that
> + *    [key, key + map->key_size) bytes are valid and were initialized on
> + *    the stack of eBPF program.
> + * }
> + *
> + * Corresponding eBPF program looked like:
> + *    BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),  // after this insn R2 type is FRAME_PTR
> + *    BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), // after this insn R2 type is PTR_TO_STACK
> + *    BPF_MOV64_IMM(BPF_REG_1, MAP_ID),      // after this insn R1 type is CONST_ARG
> + *    BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
> + * here verifier looks a prototype of map_lookup_elem and sees:
> + * .arg1_type == ARG_CONST_MAP_ID and R1->type == CONST_ARG, which is ok so far,
> + * then it goes and finds a map with map_id equal to R1->imm value.
> + * Now verifier knows that this map has key of key_size bytes
> + *
> + * Then .arg2_type == ARG_PTR_TO_MAP_KEY and R2->type == PTR_TO_STACK, ok so far,
> + * Now verifier checks that [R2, R2 + map's key_size) are within stack limits
> + * and were initialized prior to this call.
> + * If it's ok, then verifier allows this BPF_CALL insn and looks at
> + * .ret_type which is RET_PTR_TO_MAP_OR_NULL, so it sets
> + * R0->type = PTR_TO_MAP_OR_NULL which means bpf_map_lookup_elem() function
> + * returns ether pointer to map value or NULL.
> + *
> + * When type PTR_TO_MAP_OR_NULL passes through 'if (reg != 0) goto +off' insn,
> + * the register holding that pointer in the true branch changes state to
> + * PTR_TO_MAP and the same register changes state to CONST_IMM in the false
> + * branch. See check_cond_jmp_op().
> + *
> + * After the call R0 is set to return type of the function and registers R1-R5
> + * are set to NOT_INIT to indicate that they are no longer readable.
> + *
> + * load/store alignment is checked:
> + *    BPF_STX_MEM(BPF_DW, dest_reg, src_reg, 3)
> + * is rejected, because it's misaligned
> + *
> + * load/store to stack are bounds checked and register spill is tracked
> + *    BPF_STX_MEM(BPF_B, BPF_REG_10, src_reg, 0)
> + * is rejected, because it's out of bounds
> + *
> + * load/store to map are bounds checked:
> + *    BPF_STX_MEM(BPF_H, dest_reg, src_reg, 8)
> + * is ok, if dest_reg->type == PTR_TO_MAP and
> + * 8 + sizeof(u16) <= map_info->value_size
> + *
> + * load/store to bpf_context are checked against known fields
> + */
> +
> +#define _(OP) ({ int ret = OP; if (ret < 0) return ret; })

This seems overly terse. :) And the meaning tends to be overloaded
(this obviously isn't a translatable string, etc). Perhaps call it
"chk" or "ret_fail"? And I think OP in the body should have ()s around
it to avoid potential macro expansion silliness.

> +
> +/* types of values stored in eBPF registers */
> +enum bpf_reg_type {
> +       NOT_INIT = 0,           /* nothing was written into register */
> +       UNKNOWN_VALUE,          /* reg doesn't contain a valid pointer */
> +       PTR_TO_CTX,             /* reg points to bpf_context */
> +       PTR_TO_MAP,             /* reg points to map element value */
> +       PTR_TO_MAP_OR_NULL,     /* points to map element value or NULL */
> +       FRAME_PTR,              /* reg == frame_pointer */
> +       PTR_TO_STACK,           /* reg == frame_pointer + imm */
> +       CONST_IMM,              /* constant integer value */
> +};
> +
> +struct reg_state {
> +       enum bpf_reg_type type;
> +       int imm;
> +};
> +
> +enum bpf_stack_slot_type {
> +       STACK_INVALID,    /* nothing was stored in this stack slot */
> +       STACK_SPILL,      /* 1st byte of register spilled into stack */
> +       STACK_SPILL_PART, /* other 7 bytes of register spill */
> +       STACK_MISC        /* BPF program wrote some data into this slot */
> +};
> +
> +struct bpf_stack_slot {
> +       enum bpf_stack_slot_type stype;
> +       enum bpf_reg_type type;
> +       int imm;
> +};
> +
> +/* state of the program:
> + * type of all registers and stack info
> + */
> +struct verifier_state {
> +       struct reg_state regs[MAX_BPF_REG];
> +       struct bpf_stack_slot stack[MAX_BPF_STACK];
> +};
> +
> +/* linked list of verifier states used to prune search */
> +struct verifier_state_list {
> +       struct verifier_state state;
> +       struct verifier_state_list *next;
> +};
> +
> +/* verifier_state + insn_idx are pushed to stack when branch is encountered */
> +struct verifier_stack_elem {
> +       /* verifer state is 'st'
> +        * before processing instruction 'insn_idx'
> +        * and after processing instruction 'prev_insn_idx'
> +        */
> +       struct verifier_state st;
> +       int insn_idx;
> +       int prev_insn_idx;
> +       struct verifier_stack_elem *next;
> +};
> +
> +#define MAX_USED_MAPS 64 /* max number of maps accessed by one eBPF program */
> +
> +/* single container for all structs
> + * one verifier_env per bpf_check() call
> + */
> +struct verifier_env {
> +       struct sk_filter *prog;         /* eBPF program being verified */
> +       struct verifier_stack_elem *head; /* stack of verifier states to be processed */
> +       int stack_size;                 /* number of states to be processed */
> +       struct verifier_state cur_state; /* current verifier state */
> +       struct verifier_state_list **branch_landing; /* search prunning optimization */
> +       u32 used_maps[MAX_USED_MAPS];   /* array of map_id's used by eBPF program */
> +       u32 used_map_cnt;               /* number of used maps */
> +};
> +
> +/* verbose verifier prints what it's seeing
> + * bpf_check() is called under map lock, so no race to access this global var
> + */
> +static bool verbose_on;
> +
> +/* when verifier rejects eBPF program, it does a second path with verbose on
> + * to dump the verification trace to the log, so the user can figure out what's
> + * wrong with the program
> + */
> +static int verbose(const char *fmt, ...)
> +{
> +       va_list args;
> +       int ret;
> +
> +       if (!verbose_on)
> +               return 0;
> +
> +       va_start(args, fmt);
> +       ret = vprintk(fmt, args);
> +       va_end(args);
> +       return ret;
> +}
> +
> +/* string representation of 'enum bpf_reg_type' */
> +static const char * const reg_type_str[] = {
> +       [NOT_INIT] = "?",
> +       [UNKNOWN_VALUE] = "inv",
> +       [PTR_TO_CTX] = "ctx",
> +       [PTR_TO_MAP] = "map_value",
> +       [PTR_TO_MAP_OR_NULL] = "map_value_or_null",
> +       [FRAME_PTR] = "fp",
> +       [PTR_TO_STACK] = "fp",
> +       [CONST_IMM] = "imm",
> +};
> +
> +static void pr_cont_verifier_state(struct verifier_env *env)
> +{
> +       enum bpf_reg_type t;
> +       int i;
> +
> +       for (i = 0; i < MAX_BPF_REG; i++) {
> +               t = env->cur_state.regs[i].type;
> +               if (t == NOT_INIT)
> +                       continue;
> +               pr_cont(" R%d=%s", i, reg_type_str[t]);
> +               if (t == CONST_IMM ||
> +                   t == PTR_TO_STACK ||
> +                   t == PTR_TO_MAP_OR_NULL ||
> +                   t == PTR_TO_MAP)
> +                       pr_cont("%d", env->cur_state.regs[i].imm);
> +       }
> +       for (i = 0; i < MAX_BPF_STACK; i++) {
> +               if (env->cur_state.stack[i].stype == STACK_SPILL)
> +                       pr_cont(" fp%d=%s", -MAX_BPF_STACK + i,
> +                               reg_type_str[env->cur_state.stack[i].type]);
> +       }
> +       pr_cont("\n");
> +}
> +
> +static const char *const bpf_class_string[] = {
> +       "ld", "ldx", "st", "stx", "alu", "jmp", "BUG", "alu64"
> +};
> +
> +static const char *const bpf_alu_string[] = {
> +       "+=", "-=", "*=", "/=", "|=", "&=", "<<=", ">>=", "neg",
> +       "%=", "^=", "=", "s>>=", "endian", "BUG", "BUG"
> +};
> +
> +static const char *const bpf_ldst_string[] = {
> +       "u32", "u16", "u8", "u64"
> +};
> +
> +static const char *const bpf_jmp_string[] = {
> +       "jmp", "==", ">", ">=", "&", "!=", "s>", "s>=", "call", "exit"
> +};

It seems like these string arrays should have literal initializers
like reg_type_str does.

> +
> +static void pr_cont_bpf_insn(struct bpf_insn *insn)
> +{
> +       u8 class = BPF_CLASS(insn->code);
> +
> +       if (class == BPF_ALU || class == BPF_ALU64) {
> +               if (BPF_SRC(insn->code) == BPF_X)
> +                       pr_cont("(%02x) %sr%d %s %sr%d\n",
> +                               insn->code, class == BPF_ALU ? "(u32) " : "",
> +                               insn->dst_reg,
> +                               bpf_alu_string[BPF_OP(insn->code) >> 4],
> +                               class == BPF_ALU ? "(u32) " : "",
> +                               insn->src_reg);
> +               else
> +                       pr_cont("(%02x) %sr%d %s %s%d\n",
> +                               insn->code, class == BPF_ALU ? "(u32) " : "",
> +                               insn->dst_reg,
> +                               bpf_alu_string[BPF_OP(insn->code) >> 4],
> +                               class == BPF_ALU ? "(u32) " : "",
> +                               insn->imm);
> +       } else if (class == BPF_STX) {
> +               if (BPF_MODE(insn->code) == BPF_MEM)
> +                       pr_cont("(%02x) *(%s *)(r%d %+d) = r%d\n",
> +                               insn->code,
> +                               bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
> +                               insn->dst_reg,
> +                               insn->off, insn->src_reg);
> +               else if (BPF_MODE(insn->code) == BPF_XADD)
> +                       pr_cont("(%02x) lock *(%s *)(r%d %+d) += r%d\n",
> +                               insn->code,
> +                               bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
> +                               insn->dst_reg, insn->off,
> +                               insn->src_reg);
> +               else
> +                       pr_cont("BUG_%02x\n", insn->code);

As an optimization, would this be more readable by having BPF_SIZE >>
3 and BPF_OP >> 4 pre-loaded in some local variables?

> +       } else if (class == BPF_ST) {
> +               if (BPF_MODE(insn->code) != BPF_MEM) {
> +                       pr_cont("BUG_st_%02x\n", insn->code);
> +                       return;
> +               }
> +               pr_cont("(%02x) *(%s *)(r%d %+d) = %d\n",
> +                       insn->code,
> +                       bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
> +                       insn->dst_reg,
> +                       insn->off, insn->imm);
> +       } else if (class == BPF_LDX) {
> +               if (BPF_MODE(insn->code) != BPF_MEM) {
> +                       pr_cont("BUG_ldx_%02x\n", insn->code);
> +                       return;
> +               }
> +               pr_cont("(%02x) r%d = *(%s *)(r%d %+d)\n",
> +                       insn->code, insn->dst_reg,
> +                       bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
> +                       insn->src_reg, insn->off);
> +       } else if (class == BPF_LD) {
> +               if (BPF_MODE(insn->code) == BPF_ABS) {
> +                       pr_cont("(%02x) r0 = *(%s *)skb[%d]\n",
> +                               insn->code,
> +                               bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
> +                               insn->imm);
> +               } else if (BPF_MODE(insn->code) == BPF_IND) {
> +                       pr_cont("(%02x) r0 = *(%s *)skb[r%d + %d]\n",
> +                               insn->code,
> +                               bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
> +                               insn->src_reg, insn->imm);
> +               } else {
> +                       pr_cont("BUG_ld_%02x\n", insn->code);
> +                       return;
> +               }
> +       } else if (class == BPF_JMP) {
> +               u8 opcode = BPF_OP(insn->code);
> +
> +               if (opcode == BPF_CALL) {
> +                       pr_cont("(%02x) call %d\n", insn->code, insn->imm);
> +               } else if (insn->code == (BPF_JMP | BPF_JA)) {
> +                       pr_cont("(%02x) goto pc%+d\n",
> +                               insn->code, insn->off);
> +               } else if (insn->code == (BPF_JMP | BPF_EXIT)) {
> +                       pr_cont("(%02x) exit\n", insn->code);
> +               } else if (BPF_SRC(insn->code) == BPF_X) {
> +                       pr_cont("(%02x) if r%d %s r%d goto pc%+d\n",
> +                               insn->code, insn->dst_reg,
> +                               bpf_jmp_string[BPF_OP(insn->code) >> 4],
> +                               insn->src_reg, insn->off);
> +               } else {
> +                       pr_cont("(%02x) if r%d %s 0x%x goto pc%+d\n",
> +                               insn->code, insn->dst_reg,
> +                               bpf_jmp_string[BPF_OP(insn->code) >> 4],
> +                               insn->imm, insn->off);
> +               }
> +       } else {
> +               pr_cont("(%02x) %s\n", insn->code, bpf_class_string[class]);
> +       }
> +}
> +
> +static int pop_stack(struct verifier_env *env, int *prev_insn_idx)
> +{
> +       struct verifier_stack_elem *elem;
> +       int insn_idx;
> +
> +       if (env->head == NULL)
> +               return -1;
> +
> +       memcpy(&env->cur_state, &env->head->st, sizeof(env->cur_state));
> +       insn_idx = env->head->insn_idx;
> +       if (prev_insn_idx)
> +               *prev_insn_idx = env->head->prev_insn_idx;
> +       elem = env->head->next;
> +       kfree(env->head);
> +       env->head = elem;
> +       env->stack_size--;
> +       return insn_idx;
> +}
> +
> +static struct verifier_state *push_stack(struct verifier_env *env, int insn_idx,
> +                                        int prev_insn_idx)
> +{
> +       struct verifier_stack_elem *elem;
> +
> +       elem = kmalloc(sizeof(struct verifier_stack_elem), GFP_KERNEL);
> +       if (!elem)
> +               goto err;
> +
> +       memcpy(&elem->st, &env->cur_state, sizeof(env->cur_state));
> +       elem->insn_idx = insn_idx;
> +       elem->prev_insn_idx = prev_insn_idx;
> +       elem->next = env->head;
> +       env->head = elem;
> +       env->stack_size++;
> +       if (env->stack_size > 1024) {
> +               verbose("BPF program is too complex\n");
> +               goto err;
> +       }
> +       return &elem->st;
> +err:
> +       /* pop all elements and return */
> +       while (pop_stack(env, NULL) >= 0);
> +       return NULL;
> +}
> +
> +#define CALLER_SAVED_REGS 6
> +static const int caller_saved[CALLER_SAVED_REGS] = {
> +       BPF_REG_0, BPF_REG_1, BPF_REG_2, BPF_REG_3, BPF_REG_4, BPF_REG_5
> +};
> +
> +static void init_reg_state(struct reg_state *regs)
> +{
> +       int i;
> +
> +       for (i = 0; i < MAX_BPF_REG; i++) {
> +               regs[i].type = NOT_INIT;
> +               regs[i].imm = 0;
> +       }
> +
> +       /* frame pointer */
> +       regs[BPF_REG_FP].type = FRAME_PTR;
> +
> +       /* 1st arg to a function */
> +       regs[BPF_REG_1].type = PTR_TO_CTX;
> +}
> +
> +static void mark_reg_unknown_value(struct reg_state *regs, int regno)
> +{
> +       regs[regno].type = UNKNOWN_VALUE;
> +       regs[regno].imm = 0;
> +}
> +
> +static int check_reg_arg(struct reg_state *regs, int regno, bool is_src)
> +{

Since regno is always populated with dst_reg/src_reg (u8 :4 sized),
shouldn't this be u8 instead of int? (And in check_* below too?) More
importantly, regno needs bounds checking. MAX_BPF_REG is 10, but
dst_reg/src_reg could be up to 15, IIUC.

> +       if (is_src) {
> +               if (regs[regno].type == NOT_INIT) {
> +                       verbose("R%d !read_ok\n", regno);
> +                       return -EACCES;
> +               }
> +       } else {
> +               if (regno == BPF_REG_FP)
> +                       /* frame pointer is read only */

Why no verbose() call here?

> +                       return -EACCES;
> +               mark_reg_unknown_value(regs, regno);
> +       }
> +       return 0;
> +}
> +
> +static int bpf_size_to_bytes(int bpf_size)
> +{
> +       if (bpf_size == BPF_W)
> +               return 4;
> +       else if (bpf_size == BPF_H)
> +               return 2;
> +       else if (bpf_size == BPF_B)
> +               return 1;
> +       else if (bpf_size == BPF_DW)
> +               return 8;
> +       else
> +               return -EACCES;
> +}
> +
> +static int check_stack_write(struct verifier_state *state, int off, int size,
> +                            int value_regno)
> +{
> +       struct bpf_stack_slot *slot;
> +       int i;
> +
> +       if (value_regno >= 0 &&
> +           (state->regs[value_regno].type == PTR_TO_MAP ||
> +            state->regs[value_regno].type == PTR_TO_STACK ||
> +            state->regs[value_regno].type == PTR_TO_CTX)) {
> +
> +               /* register containing pointer is being spilled into stack */
> +               if (size != 8) {
> +                       verbose("invalid size of register spill\n");
> +                       return -EACCES;
> +               }
> +
> +               slot = &state->stack[MAX_BPF_STACK + off];
> +               slot->stype = STACK_SPILL;
> +               /* save register state */
> +               slot->type = state->regs[value_regno].type;
> +               slot->imm = state->regs[value_regno].imm;
> +               for (i = 1; i < 8; i++) {
> +                       slot = &state->stack[MAX_BPF_STACK + off + i];

off and size need bounds checking here and below.

> +                       slot->stype = STACK_SPILL_PART;
> +                       slot->type = UNKNOWN_VALUE;
> +                       slot->imm = 0;
> +               }
> +       } else {
> +
> +               /* regular write of data into stack */
> +               for (i = 0; i < size; i++) {
> +                       slot = &state->stack[MAX_BPF_STACK + off + i];
> +                       slot->stype = STACK_MISC;
> +                       slot->type = UNKNOWN_VALUE;
> +                       slot->imm = 0;
> +               }
> +       }
> +       return 0;
> +}
> +
> +static int check_stack_read(struct verifier_state *state, int off, int size,
> +                           int value_regno)
> +{
> +       int i;
> +       struct bpf_stack_slot *slot;
> +
> +       slot = &state->stack[MAX_BPF_STACK + off];
> +
> +       if (slot->stype == STACK_SPILL) {
> +               if (size != 8) {
> +                       verbose("invalid size of register spill\n");
> +                       return -EACCES;
> +               }
> +               for (i = 1; i < 8; i++) {
> +                       if (state->stack[MAX_BPF_STACK + off + i].stype !=
> +                           STACK_SPILL_PART) {
> +                               verbose("corrupted spill memory\n");
> +                               return -EACCES;
> +                       }
> +               }
> +
> +               /* restore register state from stack */
> +               state->regs[value_regno].type = slot->type;
> +               state->regs[value_regno].imm = slot->imm;
> +               return 0;
> +       } else {
> +               for (i = 0; i < size; i++) {
> +                       if (state->stack[MAX_BPF_STACK + off + i].stype !=
> +                           STACK_MISC) {
> +                               verbose("invalid read from stack off %d+%d size %d\n",
> +                                       off, i, size);
> +                               return -EACCES;
> +                       }
> +               }
> +               /* have read misc data from the stack */
> +               mark_reg_unknown_value(state->regs, value_regno);
> +               return 0;
> +       }
> +}
> +
> +static int remember_map_id(struct verifier_env *env, u32 map_id)
> +{
> +       int i;
> +
> +       /* check whether we recorded this map_id already */
> +       for (i = 0; i < env->used_map_cnt; i++)
> +               if (env->used_maps[i] == map_id)
> +                       return 0;
> +
> +       if (env->used_map_cnt >= MAX_USED_MAPS)
> +               return -E2BIG;
> +
> +       /* remember this map_id */
> +       env->used_maps[env->used_map_cnt++] = map_id;
> +       return 0;
> +}
> +
> +static int get_map_info(struct verifier_env *env, u32 map_id,
> +                       struct bpf_map **map)
> +{
> +       /* if BPF program contains bpf_map_lookup_elem(map_id, key)
> +        * the incorrect map_id will be caught here
> +        */
> +       *map = bpf_map_get(map_id);
> +       if (!*map) {
> +               verbose("invalid access to map_id=%d\n", map_id);
> +               return -EACCES;
> +       }
> +
> +       _(remember_map_id(env, map_id));
> +
> +       return 0;
> +}
> +
> +/* check read/write into map element returned by bpf_map_lookup_elem() */
> +static int check_map_access(struct verifier_env *env, int regno, int off,
> +                           int size)
> +{
> +       struct bpf_map *map;
> +       int map_id = env->cur_state.regs[regno].imm;
> +
> +       _(get_map_info(env, map_id, &map));
> +
> +       if (off < 0 || off + size > map->value_size) {

This could be tricked with a negative size, or a giant size, wrapping negative.

> +               verbose("invalid access to map_id=%d leaf_size=%d off=%d size=%d\n",
> +                       map_id, map->value_size, off, size);
> +               return -EACCES;
> +       }
> +       return 0;
> +}
> +
> +/* check access to 'struct bpf_context' fields */
> +static int check_ctx_access(struct verifier_env *env, int off, int size,
> +                           enum bpf_access_type t)
> +{
> +       if (env->prog->info->ops->is_valid_access &&
> +           env->prog->info->ops->is_valid_access(off, size, t))
> +               return 0;
> +
> +       verbose("invalid bpf_context access off=%d size=%d\n", off, size);
> +       return -EACCES;
> +}
> +
> +static int check_mem_access(struct verifier_env *env, int regno, int off,
> +                           int bpf_size, enum bpf_access_type t,
> +                           int value_regno)
> +{
> +       struct verifier_state *state = &env->cur_state;
> +       int size;
> +
> +       _(size = bpf_size_to_bytes(bpf_size));
> +
> +       if (off % size != 0) {
> +               verbose("misaligned access off %d size %d\n", off, size);
> +               return -EACCES;
> +       }

I think more off and size checking is needed here.

> +
> +       if (state->regs[regno].type == PTR_TO_MAP) {
> +               _(check_map_access(env, regno, off, size));
> +               if (t == BPF_READ)
> +                       mark_reg_unknown_value(state->regs, value_regno);
> +       } else if (state->regs[regno].type == PTR_TO_CTX) {
> +               _(check_ctx_access(env, off, size, t));
> +               if (t == BPF_READ)
> +                       mark_reg_unknown_value(state->regs, value_regno);
> +       } else if (state->regs[regno].type == FRAME_PTR) {
> +               if (off >= 0 || off < -MAX_BPF_STACK) {
> +                       verbose("invalid stack off=%d size=%d\n", off, size);
> +                       return -EACCES;
> +               }
> +               if (t == BPF_WRITE)
> +                       _(check_stack_write(state, off, size, value_regno));
> +               else
> +                       _(check_stack_read(state, off, size, value_regno));
> +       } else {
> +               verbose("R%d invalid mem access '%s'\n",
> +                       regno, reg_type_str[state->regs[regno].type]);
> +               return -EACCES;
> +       }
> +       return 0;
> +}
> +
> +/* when register 'regno' is passed into function that will read 'access_size'
> + * bytes from that pointer, make sure that it's within stack boundary
> + * and all elements of stack are initialized
> + */
> +static int check_stack_boundary(struct verifier_env *env,
> +                               int regno, int access_size)
> +{
> +       struct verifier_state *state = &env->cur_state;
> +       struct reg_state *regs = state->regs;
> +       int off, i;
> +

regno bounds checking needed.

> +       if (regs[regno].type != PTR_TO_STACK)
> +               return -EACCES;
> +
> +       off = regs[regno].imm;
> +       if (off >= 0 || off < -MAX_BPF_STACK || off + access_size > 0 ||
> +           access_size <= 0) {
> +               verbose("invalid stack type R%d off=%d access_size=%d\n",
> +                       regno, off, access_size);
> +               return -EACCES;
> +       }
> +
> +       for (i = 0; i < access_size; i++) {
> +               if (state->stack[MAX_BPF_STACK + off + i].stype != STACK_MISC) {
> +                       verbose("invalid indirect read from stack off %d+%d size %d\n",
> +                               off, i, access_size);
> +                       return -EACCES;
> +               }
> +       }
> +       return 0;
> +}
> +
> +static int check_func_arg(struct verifier_env *env, int regno,
> +                         enum bpf_arg_type arg_type, int *map_id,
> +                         struct bpf_map **mapp)
> +{
> +       struct reg_state *reg = env->cur_state.regs + regno;

I would use [] instead of + here. (and regno needs bounds checking)

> +       enum bpf_reg_type expected_type;
> +
> +       if (arg_type == ARG_ANYTHING)
> +               return 0;
> +
> +       if (reg->type == NOT_INIT) {
> +               verbose("R%d !read_ok\n", regno);
> +               return -EACCES;
> +       }
> +
> +       if (arg_type == ARG_PTR_TO_MAP_KEY || arg_type == ARG_PTR_TO_MAP_VALUE) {
> +               expected_type = PTR_TO_STACK;
> +       } else if (arg_type == ARG_CONST_MAP_ID || arg_type == ARG_CONST_STACK_SIZE) {
> +               expected_type = CONST_IMM;
> +       } else {
> +               verbose("unsupported arg_type %d\n", arg_type);
> +               return -EFAULT;
> +       }
> +
> +       if (reg->type != expected_type) {
> +               verbose("R%d type=%s expected=%s\n", regno,
> +                       reg_type_str[reg->type], reg_type_str[expected_type]);
> +               return -EACCES;
> +       }
> +
> +       if (arg_type == ARG_CONST_MAP_ID) {
> +               /* bpf_map_xxx(map_id) call: check that map_id is valid */
> +               *map_id = reg->imm;
> +               _(get_map_info(env, reg->imm, mapp));
> +       } else if (arg_type == ARG_PTR_TO_MAP_KEY) {
> +               /*
> +                * bpf_map_xxx(..., map_id, ..., key) call:
> +                * check that [key, key + map->key_size) are within
> +                * stack limits and initialized
> +                */
> +               if (!*mapp) {
> +                       /*
> +                        * in function declaration map_id must come before
> +                        * map_key or map_elem, so that it's verified
> +                        * and known before we have to check map_key here
> +                        */
> +                       verbose("invalid map_id to access map->key\n");
> +                       return -EACCES;
> +               }
> +               _(check_stack_boundary(env, regno, (*mapp)->key_size));
> +       } else if (arg_type == ARG_PTR_TO_MAP_VALUE) {
> +               /*
> +                * bpf_map_xxx(..., map_id, ..., value) call:
> +                * check [value, value + map->value_size) validity
> +                */
> +               if (!*mapp) {
> +                       verbose("invalid map_id to access map->elem\n");
> +                       return -EACCES;
> +               }
> +               _(check_stack_boundary(env, regno, (*mapp)->value_size));
> +       } else if (arg_type == ARG_CONST_STACK_SIZE) {
> +               /*
> +                * bpf_xxx(..., buf, len) call will access 'len' bytes
> +                * from stack pointer 'buf'. Check it
> +                * note: regno == len, regno - 1 == buf
> +                */
> +               _(check_stack_boundary(env, regno - 1, reg->imm));
> +       }
> +
> +       return 0;
> +}
> +
> +static int check_call(struct verifier_env *env, int func_id)
> +{
> +       struct verifier_state *state = &env->cur_state;
> +       const struct bpf_func_proto *fn = NULL;
> +       struct reg_state *regs = state->regs;
> +       struct bpf_map *map = NULL;
> +       struct reg_state *reg;
> +       int map_id = -1;
> +       int i;
> +
> +       /* find function prototype */
> +       if (func_id <= 0 || func_id >= __BPF_FUNC_MAX_ID) {
> +               verbose("invalid func %d\n", func_id);
> +               return -EINVAL;
> +       }
> +
> +       if (env->prog->info->ops->get_func_proto)
> +               fn = env->prog->info->ops->get_func_proto(func_id);
> +
> +       if (!fn) {
> +               verbose("unknown func %d\n", func_id);
> +               return -EINVAL;
> +       }
> +
> +       /* eBPF programs must be GPL compatible to use GPL-ed functions */
> +       if (!env->prog->info->is_gpl_compatible && fn->gpl_only) {
> +               verbose("cannot call GPL only function from proprietary program\n");
> +               return -EINVAL;
> +       }
> +
> +       /* check args */
> +       _(check_func_arg(env, BPF_REG_1, fn->arg1_type, &map_id, &map));
> +       _(check_func_arg(env, BPF_REG_2, fn->arg2_type, &map_id, &map));
> +       _(check_func_arg(env, BPF_REG_3, fn->arg3_type, &map_id, &map));
> +       _(check_func_arg(env, BPF_REG_4, fn->arg4_type, &map_id, &map));
> +       _(check_func_arg(env, BPF_REG_5, fn->arg5_type, &map_id, &map));
> +
> +       /* reset caller saved regs */
> +       for (i = 0; i < CALLER_SAVED_REGS; i++) {
> +               reg = regs + caller_saved[i];
> +               reg->type = NOT_INIT;
> +               reg->imm = 0;
> +       }
> +
> +       /* update return register */
> +       if (fn->ret_type == RET_INTEGER) {
> +               regs[BPF_REG_0].type = UNKNOWN_VALUE;
> +       } else if (fn->ret_type == RET_VOID) {
> +               regs[BPF_REG_0].type = NOT_INIT;
> +       } else if (fn->ret_type == RET_PTR_TO_MAP_OR_NULL) {
> +               regs[BPF_REG_0].type = PTR_TO_MAP_OR_NULL;
> +               /*
> +                * remember map_id, so that check_map_access()
> +                * can check 'value_size' boundary of memory access
> +                * to map element returned from bpf_map_lookup_elem()
> +                */
> +               regs[BPF_REG_0].imm = map_id;
> +       } else {
> +               verbose("unknown return type %d of func %d\n",
> +                       fn->ret_type, func_id);
> +               return -EINVAL;
> +       }
> +       return 0;
> +}
> +
> +/* check validity of 32-bit and 64-bit arithmetic operations */
> +static int check_alu_op(struct reg_state *regs, struct bpf_insn *insn)
> +{
> +       u8 opcode = BPF_OP(insn->code);
> +
> +       if (opcode == BPF_END || opcode == BPF_NEG) {
> +               if (BPF_SRC(insn->code) != BPF_X)
> +                       return -EINVAL;
> +               /* check src operand */
> +               _(check_reg_arg(regs, insn->dst_reg, 1));
> +
> +               /* check dest operand */
> +               _(check_reg_arg(regs, insn->dst_reg, 0));
> +
> +       } else if (opcode == BPF_MOV) {
> +
> +               if (BPF_SRC(insn->code) == BPF_X)
> +                       /* check src operand */
> +                       _(check_reg_arg(regs, insn->src_reg, 1));
> +
> +               /* check dest operand */
> +               _(check_reg_arg(regs, insn->dst_reg, 0));
> +
> +               if (BPF_SRC(insn->code) == BPF_X) {
> +                       if (BPF_CLASS(insn->code) == BPF_ALU64) {
> +                               /* case: R1 = R2
> +                                * copy register state to dest reg
> +                                */
> +                               regs[insn->dst_reg].type = regs[insn->src_reg].type;
> +                               regs[insn->dst_reg].imm = regs[insn->src_reg].imm;
> +                       } else {
> +                               regs[insn->dst_reg].type = UNKNOWN_VALUE;
> +                               regs[insn->dst_reg].imm = 0;
> +                       }
> +               } else {
> +                       /* case: R = imm
> +                        * remember the value we stored into this reg
> +                        */
> +                       regs[insn->dst_reg].type = CONST_IMM;
> +                       regs[insn->dst_reg].imm = insn->imm;
> +               }
> +
> +       } else {        /* all other ALU ops: and, sub, xor, add, ... */
> +
> +               int stack_relative = 0;
> +
> +               if (BPF_SRC(insn->code) == BPF_X)
> +                       /* check src1 operand */
> +                       _(check_reg_arg(regs, insn->src_reg, 1));
> +
> +               /* check src2 operand */
> +               _(check_reg_arg(regs, insn->dst_reg, 1));
> +
> +               if ((opcode == BPF_MOD || opcode == BPF_DIV) &&
> +                   BPF_SRC(insn->code) == BPF_K && insn->imm == 0) {
> +                       verbose("div by zero\n");
> +                       return -EINVAL;
> +               }
> +
> +               if (opcode == BPF_ADD && BPF_CLASS(insn->code) == BPF_ALU64 &&
> +                   regs[insn->dst_reg].type == FRAME_PTR &&
> +                   BPF_SRC(insn->code) == BPF_K)
> +                       stack_relative = 1;
> +
> +               /* check dest operand */
> +               _(check_reg_arg(regs, insn->dst_reg, 0));
> +
> +               if (stack_relative) {
> +                       regs[insn->dst_reg].type = PTR_TO_STACK;
> +                       regs[insn->dst_reg].imm = insn->imm;
> +               }
> +       }
> +
> +       return 0;
> +}
> +
> +static int check_cond_jmp_op(struct verifier_env *env,
> +                            struct bpf_insn *insn, int *insn_idx)
> +{
> +       struct reg_state *regs = env->cur_state.regs;
> +       struct verifier_state *other_branch;
> +       u8 opcode = BPF_OP(insn->code);
> +
> +       if (BPF_SRC(insn->code) == BPF_X)
> +               /* check src1 operand */
> +               _(check_reg_arg(regs, insn->src_reg, 1));
> +
> +       /* check src2 operand */
> +       _(check_reg_arg(regs, insn->dst_reg, 1));
> +
> +       /* detect if R == 0 where R was initialized to zero earlier */
> +       if (BPF_SRC(insn->code) == BPF_K &&
> +           (opcode == BPF_JEQ || opcode == BPF_JNE) &&
> +           regs[insn->dst_reg].type == CONST_IMM &&
> +           regs[insn->dst_reg].imm == insn->imm) {
> +               if (opcode == BPF_JEQ) {
> +                       /* if (imm == imm) goto pc+off;
> +                        * only follow the goto, ignore fall-through
> +                        */
> +                       *insn_idx += insn->off;
> +                       return 0;
> +               } else {
> +                       /* if (imm != imm) goto pc+off;
> +                        * only follow fall-through branch, since
> +                        * that's where the program will go
> +                        */
> +                       return 0;
> +               }
> +       }
> +
> +       other_branch = push_stack(env, *insn_idx + insn->off + 1, *insn_idx);
> +       if (!other_branch)
> +               return -EFAULT;
> +
> +       /* detect if R == 0 where R is returned value from bpf_map_lookup_elem() */
> +       if (BPF_SRC(insn->code) == BPF_K &&
> +           insn->imm == 0 && (opcode == BPF_JEQ ||
> +                              opcode == BPF_JNE) &&
> +           regs[insn->dst_reg].type == PTR_TO_MAP_OR_NULL) {
> +               if (opcode == BPF_JEQ) {
> +                       /* next fallthrough insn can access memory via
> +                        * this register
> +                        */
> +                       regs[insn->dst_reg].type = PTR_TO_MAP;
> +                       /* branch targer cannot access it, since reg == 0 */
> +                       other_branch->regs[insn->dst_reg].type = CONST_IMM;
> +                       other_branch->regs[insn->dst_reg].imm = 0;
> +               } else {
> +                       other_branch->regs[insn->dst_reg].type = PTR_TO_MAP;
> +                       regs[insn->dst_reg].type = CONST_IMM;
> +                       regs[insn->dst_reg].imm = 0;
> +               }
> +       } else if (BPF_SRC(insn->code) == BPF_K &&
> +                  (opcode == BPF_JEQ || opcode == BPF_JNE)) {
> +
> +               if (opcode == BPF_JEQ) {
> +                       /* detect if (R == imm) goto
> +                        * and in the target state recognize that R = imm
> +                        */
> +                       other_branch->regs[insn->dst_reg].type = CONST_IMM;
> +                       other_branch->regs[insn->dst_reg].imm = insn->imm;
> +               } else {
> +                       /* detect if (R != imm) goto
> +                        * and in the fall-through state recognize that R = imm
> +                        */
> +                       regs[insn->dst_reg].type = CONST_IMM;
> +                       regs[insn->dst_reg].imm = insn->imm;
> +               }
> +       }
> +       if (verbose_on)
> +               pr_cont_verifier_state(env);
> +       return 0;
> +}
> +
> +/* verify safety of LD_ABS|LD_IND instructions:
> + * - they can only appear in the programs where ctx == skb
> + * - since they are wrappers of function calls, they scratch R1-R5 registers,
> + *   preserve R6-R9, and store return value into R0
> + *
> + * Implicit input:
> + *   ctx == skb == R6 == CTX
> + *
> + * Explicit input:
> + *   SRC == any register
> + *   IMM == 32-bit immediate
> + *
> + * Output:
> + *   R0 - 8/16/32-bit skb data converted to cpu endianness
> + */
> +
> +static int check_ld_abs(struct verifier_env *env, struct bpf_insn *insn)
> +{
> +       struct reg_state *regs = env->cur_state.regs;
> +       u8 mode = BPF_MODE(insn->code);
> +       struct reg_state *reg;
> +       int i;
> +
> +       if (mode != BPF_ABS && mode != BPF_IND)
> +               return -EINVAL;
> +
> +       if (env->prog->info->prog_type != BPF_PROG_TYPE_SOCKET_FILTER) {
> +               verbose("BPF_LD_ABS|IND instructions are only allowed in socket filters\n");
> +               return -EINVAL;
> +       }
> +
> +       /* check whether implicit source operand (register R6) is readable */
> +       _(check_reg_arg(regs, BPF_REG_6, 1));
> +
> +       if (regs[BPF_REG_6].type != PTR_TO_CTX) {
> +               verbose("at the time of BPF_LD_ABS|IND R6 != pointer to skb\n");
> +               return -EINVAL;
> +       }
> +
> +       if (mode == BPF_IND)
> +               /* check explicit source operand */
> +               _(check_reg_arg(regs, insn->src_reg, 1));
> +
> +       /* reset caller saved regs to unreadable */
> +       for (i = 0; i < CALLER_SAVED_REGS; i++) {
> +               reg = regs + caller_saved[i];
> +               reg->type = NOT_INIT;
> +               reg->imm = 0;
> +       }
> +
> +       /* mark destination R0 register as readable, since it contains
> +        * the value fetched from the packet
> +        */
> +       regs[BPF_REG_0].type = UNKNOWN_VALUE;
> +       return 0;
> +}
> +
> +/* non-recursive DFS pseudo code
> + * 1  procedure DFS-iterative(G,v):
> + * 2      label v as discovered
> + * 3      let S be a stack
> + * 4      S.push(v)
> + * 5      while S is not empty
> + * 6            t <- S.pop()
> + * 7            if t is what we're looking for:
> + * 8                return t
> + * 9            for all edges e in G.adjacentEdges(t) do
> + * 10               if edge e is already labelled
> + * 11                   continue with the next edge
> + * 12               w <- G.adjacentVertex(t,e)
> + * 13               if vertex w is not discovered and not explored
> + * 14                   label e as tree-edge
> + * 15                   label w as discovered
> + * 16                   S.push(w)
> + * 17                   continue at 5
> + * 18               else if vertex w is discovered
> + * 19                   label e as back-edge
> + * 20               else
> + * 21                   // vertex w is explored
> + * 22                   label e as forward- or cross-edge
> + * 23           label t as explored
> + * 24           S.pop()
> + *
> + * convention:
> + * 1 - discovered
> + * 2 - discovered and 1st branch labelled
> + * 3 - discovered and 1st and 2nd branch labelled
> + * 4 - explored
> + */
> +
> +#define STATE_END ((struct verifier_state_list *)-1)
> +
> +#define PUSH_INT(I) \
> +       do { \
> +               if (cur_stack >= insn_cnt) { \
> +                       ret = -E2BIG; \
> +                       goto free_st; \
> +               } \
> +               stack[cur_stack++] = I; \
> +       } while (0)
> +
> +#define PEEK_INT() \
> +       ({ \
> +               int _ret; \
> +               if (cur_stack == 0) \
> +                       _ret = -1; \
> +               else \
> +                       _ret = stack[cur_stack - 1]; \
> +               _ret; \
> +        })
> +
> +#define POP_INT() \
> +       ({ \
> +               int _ret; \
> +               if (cur_stack == 0) \
> +                       _ret = -1; \
> +               else \
> +                       _ret = stack[--cur_stack]; \
> +               _ret; \
> +        })
> +
> +#define PUSH_INSN(T, W, E) \
> +       do { \
> +               int w = W; \
> +               if (E == 1 && st[T] >= 2) \
> +                       break; \
> +               if (E == 2 && st[T] >= 3) \
> +                       break; \
> +               if (w >= insn_cnt) { \
> +                       ret = -EACCES; \
> +                       goto free_st; \
> +               } \
> +               if (E == 2) \
> +                       /* mark branch target for state pruning */ \
> +                       env->branch_landing[w] = STATE_END; \
> +               if (st[w] == 0) { \
> +                       /* tree-edge */ \
> +                       st[T] = 1 + E; \
> +                       st[w] = 1; /* discovered */ \
> +                       PUSH_INT(w); \
> +                       goto peak_stack; \
> +               } else if (st[w] == 1 || st[w] == 2 || st[w] == 3) { \
> +                       verbose("back-edge from insn %d to %d\n", t, w); \
> +                       ret = -EINVAL; \
> +                       goto free_st; \
> +               } else if (st[w] == 4) { \
> +                       /* forward- or cross-edge */ \
> +                       st[T] = 1 + E; \
> +               } else { \
> +                       verbose("insn state internal bug\n"); \
> +                       ret = -EFAULT; \
> +                       goto free_st; \
> +               } \
> +       } while (0)
> +
> +/* non-recursive depth-first-search to detect loops in BPF program
> + * loop == back-edge in directed graph
> + */
> +static int check_cfg(struct verifier_env *env)
> +{
> +       struct bpf_insn *insns = env->prog->insnsi;
> +       int insn_cnt = env->prog->len;
> +       int cur_stack = 0;
> +       int *stack;
> +       int ret = 0;
> +       int *st;
> +       int i, t;
> +
> +       if (insns[insn_cnt - 1].code != (BPF_JMP | BPF_EXIT)) {
> +               verbose("last insn is not a 'ret'\n");
> +               return -EINVAL;
> +       }
> +
> +       st = kzalloc(sizeof(int) * insn_cnt, GFP_KERNEL);
> +       if (!st)
> +               return -ENOMEM;
> +
> +       stack = kzalloc(sizeof(int) * insn_cnt, GFP_KERNEL);
> +       if (!stack) {
> +               kfree(st);
> +               return -ENOMEM;
> +       }
> +
> +       st[0] = 1; /* mark 1st insn as discovered */
> +       PUSH_INT(0);
> +
> +peak_stack:
> +       while ((t = PEEK_INT()) != -1) {
> +               if (insns[t].code == (BPF_JMP | BPF_EXIT))
> +                       goto mark_explored;
> +
> +               if (BPF_CLASS(insns[t].code) == BPF_JMP) {
> +                       u8 opcode = BPF_OP(insns[t].code);
> +
> +                       if (opcode == BPF_CALL) {
> +                               PUSH_INSN(t, t + 1, 1);
> +                       } else if (opcode == BPF_JA) {
> +                               if (BPF_SRC(insns[t].code) != BPF_X) {
> +                                       ret = -EINVAL;
> +                                       goto free_st;
> +                               }
> +                               PUSH_INSN(t, t + insns[t].off + 1, 1);
> +                       } else {
> +                               PUSH_INSN(t, t + 1, 1);
> +                               PUSH_INSN(t, t + insns[t].off + 1, 2);
> +                       }
> +                       /* tell verifier to check for equivalent verifier states
> +                        * after every call and jump
> +                        */
> +                       env->branch_landing[t + 1] = STATE_END;
> +               } else {
> +                       PUSH_INSN(t, t + 1, 1);
> +               }
> +
> +mark_explored:
> +               st[t] = 4; /* explored */
> +               if (POP_INT() == -1) {
> +                       verbose("pop_int internal bug\n");
> +                       ret = -EFAULT;
> +                       goto free_st;
> +               }
> +       }
> +
> +
> +       for (i = 0; i < insn_cnt; i++) {
> +               if (st[i] != 4) {
> +                       verbose("unreachable insn %d\n", i);
> +                       ret = -EINVAL;
> +                       goto free_st;
> +               }
> +       }
> +
> +free_st:
> +       kfree(st);
> +       kfree(stack);
> +       return ret;
> +}
> +
> +/* compare two verifier states
> + *
> + * all states stored in state_list are known to be valid, since
> + * verifier reached 'bpf_exit' instruction through them
> + *
> + * this function is called when verifier exploring different branches of
> + * execution popped from the state stack. If it sees an old state that has
> + * more strict register state and more strict stack state then this execution
> + * branch doesn't need to be explored further, since verifier already
> + * concluded that more strict state leads to valid finish.
> + *
> + * Therefore two states are equivalent if register state is more conservative
> + * and explored stack state is more conservative than the current one.
> + * Example:
> + *       explored                   current
> + * (slot1=INV slot2=MISC) == (slot1=MISC slot2=MISC)
> + * (slot1=MISC slot2=MISC) != (slot1=INV slot2=MISC)
> + *
> + * In other words if current stack state (one being explored) has more
> + * valid slots than old one that already passed validation, it means
> + * the verifier can stop exploring and conclude that current state is valid too
> + *
> + * Similarly with registers. If explored state has register type as invalid
> + * whereas register type in current state is meaningful, it means that
> + * the current state will reach 'bpf_exit' instruction safely
> + */
> +static bool states_equal(struct verifier_state *old, struct verifier_state *cur)
> +{
> +       int i;
> +
> +       for (i = 0; i < MAX_BPF_REG; i++) {
> +               if (memcmp(&old->regs[i], &cur->regs[i],
> +                          sizeof(old->regs[0])) != 0) {
> +                       if (old->regs[i].type == NOT_INIT ||
> +                           old->regs[i].type == UNKNOWN_VALUE)
> +                               continue;
> +                       return false;
> +               }
> +       }
> +
> +       for (i = 0; i < MAX_BPF_STACK; i++) {
> +               if (memcmp(&old->stack[i], &cur->stack[i],
> +                          sizeof(old->stack[0])) != 0) {
> +                       if (old->stack[i].stype == STACK_INVALID)
> +                               continue;
> +                       return false;
> +               }
> +       }
> +       return true;
> +}
> +
> +static int is_state_visited(struct verifier_env *env, int insn_idx)
> +{
> +       struct verifier_state_list *new_sl;
> +       struct verifier_state_list *sl;
> +
> +       sl = env->branch_landing[insn_idx];
> +       if (!sl)
> +               /* no branch jump to this insn, ignore it */
> +               return 0;
> +
> +       while (sl != STATE_END) {
> +               if (states_equal(&sl->state, &env->cur_state))
> +                       /* reached equivalent register/stack state,
> +                        * prune the search
> +                        */
> +                       return 1;
> +               sl = sl->next;
> +       }
> +       new_sl = kmalloc(sizeof(struct verifier_state_list), GFP_KERNEL);
> +
> +       if (!new_sl)
> +               /* ignore ENOMEM, it doesn't affect correctness */
> +               return 0;
> +
> +       /* add new state to the head of linked list */
> +       memcpy(&new_sl->state, &env->cur_state, sizeof(env->cur_state));
> +       new_sl->next = env->branch_landing[insn_idx];
> +       env->branch_landing[insn_idx] = new_sl;
> +       return 0;
> +}
> +
> +static int do_check(struct verifier_env *env)
> +{
> +       struct verifier_state *state = &env->cur_state;
> +       struct bpf_insn *insns = env->prog->insnsi;
> +       struct reg_state *regs = state->regs;
> +       int insn_cnt = env->prog->len;
> +       int insn_idx, prev_insn_idx = 0;
> +       int insn_processed = 0;
> +       bool do_print_state = false;
> +
> +       init_reg_state(regs);
> +       insn_idx = 0;
> +       for (;;) {
> +               struct bpf_insn *insn;
> +               u8 class;
> +
> +               if (insn_idx >= insn_cnt) {
> +                       verbose("invalid insn idx %d insn_cnt %d\n",
> +                               insn_idx, insn_cnt);
> +                       return -EFAULT;
> +               }
> +
> +               insn = &insns[insn_idx];
> +               class = BPF_CLASS(insn->code);
> +
> +               if (++insn_processed > 32768) {
> +                       verbose("BPF program is too large. Proccessed %d insn\n",
> +                               insn_processed);
> +                       return -E2BIG;
> +               }
> +
> +               if (is_state_visited(env, insn_idx)) {
> +                       if (verbose_on) {
> +                               if (do_print_state)
> +                                       pr_cont("\nfrom %d to %d: safe\n",
> +                                               prev_insn_idx, insn_idx);
> +                               else
> +                                       pr_cont("%d: safe\n", insn_idx);
> +                       }
> +                       goto process_bpf_exit;
> +               }
> +
> +               if (verbose_on && do_print_state) {
> +                       pr_cont("\nfrom %d to %d:", prev_insn_idx, insn_idx);
> +                       pr_cont_verifier_state(env);
> +                       do_print_state = false;
> +               }
> +
> +               if (verbose_on) {
> +                       pr_cont("%d: ", insn_idx);
> +                       pr_cont_bpf_insn(insn);
> +               }
> +
> +               if (class == BPF_ALU || class == BPF_ALU64) {
> +                       _(check_alu_op(regs, insn));
> +
> +               } else if (class == BPF_LDX) {
> +                       if (BPF_MODE(insn->code) != BPF_MEM)
> +                               return -EINVAL;
> +
> +                       /* check src operand */
> +                       _(check_reg_arg(regs, insn->src_reg, 1));
> +
> +                       _(check_mem_access(env, insn->src_reg, insn->off,
> +                                          BPF_SIZE(insn->code), BPF_READ,
> +                                          insn->dst_reg));
> +
> +                       /* dest reg state will be updated by mem_access */
> +
> +               } else if (class == BPF_STX) {
> +                       /* check src1 operand */
> +                       _(check_reg_arg(regs, insn->src_reg, 1));
> +                       /* check src2 operand */
> +                       _(check_reg_arg(regs, insn->dst_reg, 1));
> +                       _(check_mem_access(env, insn->dst_reg, insn->off,
> +                                          BPF_SIZE(insn->code), BPF_WRITE,
> +                                          insn->src_reg));
> +
> +               } else if (class == BPF_ST) {
> +                       if (BPF_MODE(insn->code) != BPF_MEM)
> +                               return -EINVAL;
> +                       /* check src operand */
> +                       _(check_reg_arg(regs, insn->dst_reg, 1));
> +                       _(check_mem_access(env, insn->dst_reg, insn->off,
> +                                          BPF_SIZE(insn->code), BPF_WRITE,
> +                                          -1));
> +
> +               } else if (class == BPF_JMP) {
> +                       u8 opcode = BPF_OP(insn->code);
> +
> +                       if (opcode == BPF_CALL) {
> +                               _(check_call(env, insn->imm));
> +                       } else if (opcode == BPF_JA) {
> +                               if (BPF_SRC(insn->code) != BPF_X)
> +                                       return -EINVAL;
> +                               insn_idx += insn->off + 1;
> +                               continue;
> +                       } else if (opcode == BPF_EXIT) {
> +                               /* eBPF calling convetion is such that R0 is used
> +                                * to return the value from eBPF program.
> +                                * Make sure that it's readable at this time
> +                                * of bpf_exit, which means that program wrote
> +                                * something into it earlier
> +                                */
> +                               _(check_reg_arg(regs, BPF_REG_0, 1));
> +process_bpf_exit:
> +                               insn_idx = pop_stack(env, &prev_insn_idx);
> +                               if (insn_idx < 0) {
> +                                       break;
> +                               } else {
> +                                       do_print_state = true;
> +                                       continue;
> +                               }
> +                       } else {
> +                               _(check_cond_jmp_op(env, insn, &insn_idx));
> +                       }
> +               } else if (class == BPF_LD) {
> +                       _(check_ld_abs(env, insn));
> +               } else {
> +                       verbose("unknown insn class %d\n", class);
> +                       return -EINVAL;
> +               }
> +
> +               insn_idx++;
> +       }
> +
> +       return 0;
> +}
> +
> +static void free_states(struct verifier_env *env, int insn_cnt)
> +{
> +       struct verifier_state_list *sl, *sln;
> +       int i;
> +
> +       for (i = 0; i < insn_cnt; i++) {
> +               sl = env->branch_landing[i];
> +
> +               if (sl)
> +                       while (sl != STATE_END) {
> +                               sln = sl->next;
> +                               kfree(sl);
> +                               sl = sln;
> +                       }
> +       }
> +
> +       kfree(env->branch_landing);
> +}
> +
> +int bpf_check(struct sk_filter *prog)
> +{
> +       struct verifier_env *env;
> +       int ret;
> +
> +       if (prog->len <= 0 || prog->len > BPF_MAXINSNS)
> +               return -E2BIG;
> +
> +       env = kzalloc(sizeof(struct verifier_env), GFP_KERNEL);
> +       if (!env)
> +               return -ENOMEM;
> +
> +       verbose_on = false;
> +retry:
> +       env->prog = prog;
> +       env->branch_landing = kcalloc(prog->len,
> +                                     sizeof(struct verifier_state_list *),
> +                                     GFP_KERNEL);
> +
> +       if (!env->branch_landing) {
> +               kfree(env);
> +               return -ENOMEM;
> +       }
> +
> +       ret = check_cfg(env);
> +       if (ret < 0)
> +               goto free_env;
> +
> +       ret = do_check(env);
> +
> +free_env:
> +       while (pop_stack(env, NULL) >= 0);
> +       free_states(env, prog->len);
> +
> +       if (ret < 0 && !verbose_on && capable(CAP_SYS_ADMIN)) {
> +               /* verification failed, redo it with verbose on */
> +               memset(env, 0, sizeof(struct verifier_env));
> +               verbose_on = true;
> +               goto retry;
> +       }
> +
> +       if (ret == 0 && env->used_map_cnt) {
> +               /* if program passed verifier, update used_maps in bpf_prog_info */
> +               prog->info->used_maps = kmalloc_array(env->used_map_cnt,
> +                                                     sizeof(u32), GFP_KERNEL);
> +               if (!prog->info->used_maps) {
> +                       kfree(env);
> +                       return -ENOMEM;
> +               }
> +               memcpy(prog->info->used_maps, env->used_maps,
> +                      sizeof(u32) * env->used_map_cnt);
> +               prog->info->used_map_cnt = env->used_map_cnt;
> +       }
> +
> +       kfree(env);
> +       return ret;
> +}
> --
> 1.7.9.5
>

Unless I've overlooked something, I think this needs much stricter
evaluation of register numbers, offsets, and sizes.

-Kees

-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH RFC v2 net-next 13/16] tracing: allow eBPF programs to be attached to events
  2014-07-18  4:20 ` [PATCH RFC v2 net-next 13/16] tracing: allow eBPF programs to be attached to events Alexei Starovoitov
@ 2014-07-23 23:46   ` Kees Cook
  2014-07-24  0:06       ` Alexei Starovoitov
  0 siblings, 1 reply; 62+ messages in thread
From: Kees Cook @ 2014-07-23 23:46 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Ingo Molnar, Linus Torvalds, Andy Lutomirski,
	Steven Rostedt, Daniel Borkmann, Chema Gonzalez, Eric Dumazet,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Jiri Olsa,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Linux API,
	Network Development, LKML

On Thu, Jul 17, 2014 at 9:20 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
> User interface:
> fd = open("/sys/kernel/debug/tracing/__event__/filter")
>
> write(fd, "bpf_123")
>
> where 123 is process local FD associated with eBPF program previously loaded.
> __event__ is static tracepoint event.
> (kprobe events will be supported in the future patches)
> Once program is successfully attached to tracepoint event, the tracepoint
> will be auto-enabled
>
> close(fd)
> auto-disables tracepoint event and detaches eBPF program from it
>
> eBPF programs can call in-kernel helper functions to:
> - lookup/update/delete elements in maps
> - memcmp
> - trace_printk
> - load_pointer
> - dump_stack

Ah, this must be the pointer leaking you mentioned. :)

>
> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
> ---
>  include/linux/ftrace_event.h       |    5 +
>  include/trace/bpf_trace.h          |   29 +++++
>  include/trace/ftrace.h             |   10 ++
>  include/uapi/linux/bpf.h           |    5 +
>  kernel/trace/Kconfig               |    1 +
>  kernel/trace/Makefile              |    1 +
>  kernel/trace/bpf_trace.c           |  212 ++++++++++++++++++++++++++++++++++++
>  kernel/trace/trace.h               |    3 +
>  kernel/trace/trace_events.c        |   36 +++++-
>  kernel/trace/trace_events_filter.c |   72 +++++++++++-
>  10 files changed, 372 insertions(+), 2 deletions(-)
>  create mode 100644 include/trace/bpf_trace.h
>  create mode 100644 kernel/trace/bpf_trace.c
>
> diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
> index cff3106ffe2c..de313bd9a434 100644
> --- a/include/linux/ftrace_event.h
> +++ b/include/linux/ftrace_event.h
> @@ -237,6 +237,7 @@ enum {
>         TRACE_EVENT_FL_WAS_ENABLED_BIT,
>         TRACE_EVENT_FL_USE_CALL_FILTER_BIT,
>         TRACE_EVENT_FL_TRACEPOINT_BIT,
> +       TRACE_EVENT_FL_BPF_BIT,
>  };
>
>  /*
> @@ -259,6 +260,7 @@ enum {
>         TRACE_EVENT_FL_WAS_ENABLED      = (1 << TRACE_EVENT_FL_WAS_ENABLED_BIT),
>         TRACE_EVENT_FL_USE_CALL_FILTER  = (1 << TRACE_EVENT_FL_USE_CALL_FILTER_BIT),
>         TRACE_EVENT_FL_TRACEPOINT       = (1 << TRACE_EVENT_FL_TRACEPOINT_BIT),
> +       TRACE_EVENT_FL_BPF              = (1 << TRACE_EVENT_FL_BPF_BIT),
>  };
>
>  struct ftrace_event_call {
> @@ -536,6 +538,9 @@ event_trigger_unlock_commit_regs(struct ftrace_event_file *file,
>                 event_triggers_post_call(file, tt);
>  }
>
> +struct bpf_context;
> +void trace_filter_call_bpf(struct event_filter *filter, struct bpf_context *ctx);
> +
>  enum {
>         FILTER_OTHER = 0,
>         FILTER_STATIC_STRING,
> diff --git a/include/trace/bpf_trace.h b/include/trace/bpf_trace.h
> new file mode 100644
> index 000000000000..2122437f1317
> --- /dev/null
> +++ b/include/trace/bpf_trace.h
> @@ -0,0 +1,29 @@
> +/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of version 2 of the GNU General Public
> + * License as published by the Free Software Foundation.
> + */
> +#ifndef _LINUX_KERNEL_BPF_TRACE_H
> +#define _LINUX_KERNEL_BPF_TRACE_H
> +
> +/* For tracing filters save first six arguments of tracepoint events.
> + * On 64-bit architectures argN fields will match one to one to arguments passed
> + * to tracepoint events.
> + * On 32-bit architectures u64 arguments to events will be seen into two
> + * consecutive argN, argN+1 fields. Pointers, u32, u16, u8, bool types will
> + * match one to one
> + */
> +struct bpf_context {
> +       unsigned long arg1;
> +       unsigned long arg2;
> +       unsigned long arg3;
> +       unsigned long arg4;
> +       unsigned long arg5;
> +       unsigned long arg6;
> +};
> +
> +/* call from ftrace_raw_event_*() to copy tracepoint arguments into ctx */
> +void populate_bpf_context(struct bpf_context *ctx, ...);
> +
> +#endif /* _LINUX_KERNEL_BPF_TRACE_H */
> diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
> index 26b4f2e13275..ad4987ac68bb 100644
> --- a/include/trace/ftrace.h
> +++ b/include/trace/ftrace.h
> @@ -17,6 +17,7 @@
>   */
>
>  #include <linux/ftrace_event.h>
> +#include <trace/bpf_trace.h>
>
>  /*
>   * DECLARE_EVENT_CLASS can be used to add a generic function
> @@ -634,6 +635,15 @@ ftrace_raw_event_##call(void *__data, proto)                               \
>         if (ftrace_trigger_soft_disabled(ftrace_file))                  \
>                 return;                                                 \
>                                                                         \
> +       if (unlikely(ftrace_file->flags & FTRACE_EVENT_FL_FILTERED) &&  \
> +           unlikely(ftrace_file->event_call->flags & TRACE_EVENT_FL_BPF)) { \
> +               struct bpf_context __ctx;                               \
> +                                                                       \
> +               populate_bpf_context(&__ctx, args, 0, 0, 0, 0, 0);      \
> +               trace_filter_call_bpf(ftrace_file->filter, &__ctx);     \
> +               return;                                                 \
> +       }                                                               \
> +                                                                       \
>         __data_size = ftrace_get_offsets_##call(&__data_offsets, args); \
>                                                                         \
>         entry = ftrace_event_buffer_reserve(&fbuffer, ftrace_file,      \
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 06e0f63055fb..cedcf9a0db53 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -370,6 +370,7 @@ enum bpf_prog_attributes {
>  enum bpf_prog_type {
>         BPF_PROG_TYPE_UNSPEC,
>         BPF_PROG_TYPE_SOCKET_FILTER,
> +       BPF_PROG_TYPE_TRACING_FILTER,
>  };
>
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> @@ -380,6 +381,10 @@ enum bpf_func_id {
>         BPF_FUNC_map_lookup_elem, /* void *map_lookup_elem(map_id, void *key) */
>         BPF_FUNC_map_update_elem, /* int map_update_elem(map_id, void *key, void *value) */
>         BPF_FUNC_map_delete_elem, /* int map_delete_elem(map_id, void *key) */
> +       BPF_FUNC_load_pointer,    /* void *bpf_load_pointer(void *unsafe_ptr) */
> +       BPF_FUNC_memcmp,          /* int bpf_memcmp(void *unsafe_ptr, void *safe_ptr, int size) */
> +       BPF_FUNC_dump_stack,      /* void bpf_dump_stack(void) */
> +       BPF_FUNC_printk,          /* int bpf_printk(const char *fmt, int fmt_size, ...) */
>         __BPF_FUNC_MAX_ID,
>  };
>
> diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> index d4409356f40d..e36d42876634 100644
> --- a/kernel/trace/Kconfig
> +++ b/kernel/trace/Kconfig
> @@ -80,6 +80,7 @@ config FTRACE_NMI_ENTER
>
>  config EVENT_TRACING
>         select CONTEXT_SWITCH_TRACER
> +       depends on NET
>         bool
>
>  config CONTEXT_SWITCH_TRACER
> diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
> index 2611613f14f1..a0fcfd97101d 100644
> --- a/kernel/trace/Makefile
> +++ b/kernel/trace/Makefile
> @@ -52,6 +52,7 @@ obj-$(CONFIG_EVENT_TRACING) += trace_event_perf.o
>  endif
>  obj-$(CONFIG_EVENT_TRACING) += trace_events_filter.o
>  obj-$(CONFIG_EVENT_TRACING) += trace_events_trigger.o
> +obj-$(CONFIG_EVENT_TRACING) += bpf_trace.o

Can the existing tracing mechanisms already expose kernel addresses? I
suspect "yes". So I guess existing limitations on tracing exposure
should already cover access control here? (I'm trying to figure out if
a separate CONFIG is needed -- I don't think so: nothing "new" is
exposed via eBPF, is that right?)

-Kees

>  obj-$(CONFIG_KPROBE_EVENT) += trace_kprobe.o
>  obj-$(CONFIG_TRACEPOINTS) += power-traces.o
>  ifeq ($(CONFIG_PM_RUNTIME),y)
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> new file mode 100644
> index 000000000000..7263491be792
> --- /dev/null
> +++ b/kernel/trace/bpf_trace.c
> @@ -0,0 +1,212 @@
> +/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of version 2 of the GNU General Public
> + * License as published by the Free Software Foundation.
> + */
> +#include <linux/kernel.h>
> +#include <linux/types.h>
> +#include <linux/slab.h>
> +#include <linux/bpf.h>
> +#include <linux/filter.h>
> +#include <linux/uaccess.h>
> +#include <trace/bpf_trace.h>
> +#include "trace.h"
> +
> +/* call from ftrace_raw_event_*() to copy tracepoint arguments into ctx */
> +void populate_bpf_context(struct bpf_context *ctx, ...)
> +{
> +       va_list args;
> +
> +       va_start(args, ctx);
> +
> +       ctx->arg1 = va_arg(args, unsigned long);
> +       ctx->arg2 = va_arg(args, unsigned long);
> +       ctx->arg3 = va_arg(args, unsigned long);
> +       ctx->arg4 = va_arg(args, unsigned long);
> +       ctx->arg5 = va_arg(args, unsigned long);
> +       ctx->arg6 = va_arg(args, unsigned long);
> +
> +       va_end(args);
> +}
> +EXPORT_SYMBOL_GPL(populate_bpf_context);
> +
> +/* called from eBPF program with rcu lock held */
> +static u64 bpf_load_ptr(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
> +{
> +        void *unsafe_ptr = (void *) r1;
> +       void *ptr = NULL;
> +
> +       probe_kernel_read(&ptr, unsafe_ptr, sizeof(void *));
> +       return (u64) (unsigned long) ptr;
> +}
> +
> +static u64 bpf_memcmp(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
> +{
> +        void *unsafe_ptr = (void *) r1;
> +       void *safe_ptr = (void *) r2;
> +       u32 size = (u32) r3;
> +       char buf[64];
> +       int err;
> +
> +       if (size < 64) {
> +               err = probe_kernel_read(buf, unsafe_ptr, size);
> +               if (err)
> +                       return err;
> +               return memcmp(buf, safe_ptr, size);
> +       }
> +       return -1;
> +}
> +
> +static u64 bpf_dump_stack(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
> +{
> +       trace_dump_stack(0);
> +       return 0;
> +}
> +
> +/* limited printk()
> + * only %d %u %x conversion specifiers allowed
> + */
> +static u64 bpf_printk(u64 r1, u64 fmt_size, u64 r3, u64 r4, u64 r5)
> +{
> +       char *fmt = (char *) r1;
> +       int fmt_cnt = 0;
> +       int i;
> +
> +       /* bpf_check() guarantees that fmt points to bpf program stack and
> +        * fmt_size bytes of it were initialized by bpf program
> +        */
> +       if (fmt[fmt_size - 1] != 0)
> +               return -EINVAL;
> +
> +       /* check format string for allowed specifiers */
> +       for (i = 0; i < fmt_size; i++)
> +               if (fmt[i] == '%') {
> +                       if (i + 1 >= fmt_size)
> +                               return -EINVAL;
> +                       if (fmt[i + 1] != 'd' && fmt[i + 1] != 'u' &&
> +                           fmt[i + 1] != 'x')
> +                               return -EINVAL;
> +                       fmt_cnt++;
> +               }
> +
> +       if (fmt_cnt > 3)
> +               return -EINVAL;
> +
> +       return __trace_printk((unsigned long) __builtin_return_address(3), fmt,
> +                             (u32) r3, (u32) r4, (u32) r5);
> +}
> +
> +static struct bpf_func_proto tracing_filter_funcs[] = {
> +       [BPF_FUNC_load_pointer] = {
> +               .func = bpf_load_ptr,
> +               .gpl_only = true,
> +               .ret_type = RET_INTEGER,
> +       },
> +       [BPF_FUNC_memcmp] = {
> +               .func = bpf_memcmp,
> +               .gpl_only = false,
> +               .ret_type = RET_INTEGER,
> +               .arg1_type = ARG_ANYTHING,
> +               .arg2_type = ARG_PTR_TO_STACK,
> +               .arg3_type = ARG_CONST_STACK_SIZE,
> +       },
> +       [BPF_FUNC_dump_stack] = {
> +               .func = bpf_dump_stack,
> +               .gpl_only = false,
> +               .ret_type = RET_VOID,
> +       },
> +       [BPF_FUNC_printk] = {
> +               .func = bpf_printk,
> +               .gpl_only = true,
> +               .ret_type = RET_INTEGER,
> +               .arg1_type = ARG_PTR_TO_STACK,
> +               .arg2_type = ARG_CONST_STACK_SIZE,
> +       },
> +       [BPF_FUNC_map_lookup_elem] = {
> +               .func = bpf_map_lookup_elem,
> +               .gpl_only = false,
> +               .ret_type = RET_PTR_TO_MAP_OR_NULL,
> +               .arg1_type = ARG_CONST_MAP_ID,
> +               .arg2_type = ARG_PTR_TO_MAP_KEY,
> +       },
> +       [BPF_FUNC_map_update_elem] = {
> +               .func = bpf_map_update_elem,
> +               .gpl_only = false,
> +               .ret_type = RET_INTEGER,
> +               .arg1_type = ARG_CONST_MAP_ID,
> +               .arg2_type = ARG_PTR_TO_MAP_KEY,
> +               .arg3_type = ARG_PTR_TO_MAP_VALUE,
> +       },
> +       [BPF_FUNC_map_delete_elem] = {
> +               .func = bpf_map_delete_elem,
> +               .gpl_only = false,
> +               .ret_type = RET_INTEGER,
> +               .arg1_type = ARG_CONST_MAP_ID,
> +               .arg2_type = ARG_PTR_TO_MAP_KEY,
> +       },
> +};
> +
> +static const struct bpf_func_proto *tracing_filter_func_proto(enum bpf_func_id func_id)
> +{
> +       if (func_id < 0 || func_id >= ARRAY_SIZE(tracing_filter_funcs))
> +               return NULL;
> +       return &tracing_filter_funcs[func_id];
> +}
> +
> +static const struct bpf_context_access {
> +       int size;
> +       enum bpf_access_type type;
> +} tracing_filter_ctx_access[] = {
> +       [offsetof(struct bpf_context, arg1)] = {
> +               FIELD_SIZEOF(struct bpf_context, arg1),
> +               BPF_READ
> +       },
> +       [offsetof(struct bpf_context, arg2)] = {
> +               FIELD_SIZEOF(struct bpf_context, arg2),
> +               BPF_READ
> +       },
> +       [offsetof(struct bpf_context, arg3)] = {
> +               FIELD_SIZEOF(struct bpf_context, arg3),
> +               BPF_READ
> +       },
> +       [offsetof(struct bpf_context, arg4)] = {
> +               FIELD_SIZEOF(struct bpf_context, arg4),
> +               BPF_READ
> +       },
> +       [offsetof(struct bpf_context, arg5)] = {
> +               FIELD_SIZEOF(struct bpf_context, arg5),
> +               BPF_READ
> +       },
> +};
> +
> +static bool tracing_filter_is_valid_access(int off, int size, enum bpf_access_type type)
> +{
> +       const struct bpf_context_access *access;
> +
> +       if (off < 0 || off >= ARRAY_SIZE(tracing_filter_ctx_access))
> +               return false;
> +
> +       access = &tracing_filter_ctx_access[off];
> +       if (access->size == size && (access->type & type))
> +               return true;
> +
> +       return false;
> +}
> +
> +static struct bpf_verifier_ops tracing_filter_ops = {
> +       .get_func_proto = tracing_filter_func_proto,
> +       .is_valid_access = tracing_filter_is_valid_access,
> +};
> +
> +static struct bpf_prog_type_list tl = {
> +       .ops = &tracing_filter_ops,
> +       .type = BPF_PROG_TYPE_TRACING_FILTER,
> +};
> +
> +static int __init register_tracing_filter_ops(void)
> +{
> +       bpf_register_prog_type(&tl);
> +       return 0;
> +}
> +late_initcall(register_tracing_filter_ops);
> diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
> index 9258f5a815db..bb7c6a19ead5 100644
> --- a/kernel/trace/trace.h
> +++ b/kernel/trace/trace.h
> @@ -984,12 +984,15 @@ struct ftrace_event_field {
>         int                     is_signed;
>  };
>
> +struct sk_filter;
> +
>  struct event_filter {
>         int                     n_preds;        /* Number assigned */
>         int                     a_preds;        /* allocated */
>         struct filter_pred      *preds;
>         struct filter_pred      *root;
>         char                    *filter_string;
> +       struct sk_filter        *prog;
>  };
>
>  struct event_subsystem {
> diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
> index f99e0b3bca8c..de79c27a0a42 100644
> --- a/kernel/trace/trace_events.c
> +++ b/kernel/trace/trace_events.c
> @@ -1048,6 +1048,26 @@ event_filter_read(struct file *filp, char __user *ubuf, size_t cnt,
>         return r;
>  }
>
> +static int event_filter_release(struct inode *inode, struct file *filp)
> +{
> +       struct ftrace_event_file *file;
> +       char buf[2] = "0";
> +
> +       mutex_lock(&event_mutex);
> +       file = event_file_data(filp);
> +       if (file) {
> +               if (file->event_call->flags & TRACE_EVENT_FL_BPF) {
> +                       /* auto-disable the filter */
> +                       ftrace_event_enable_disable(file, 0);
> +
> +                       /* if BPF filter was used, clear it on fd close */
> +                       apply_event_filter(file, buf);
> +               }
> +       }
> +       mutex_unlock(&event_mutex);
> +       return 0;
> +}
> +
>  static ssize_t
>  event_filter_write(struct file *filp, const char __user *ubuf, size_t cnt,
>                    loff_t *ppos)
> @@ -1071,10 +1091,23 @@ event_filter_write(struct file *filp, const char __user *ubuf, size_t cnt,
>
>         mutex_lock(&event_mutex);
>         file = event_file_data(filp);
> -       if (file)
> +       if (file) {
>                 err = apply_event_filter(file, buf);
> +               if (!err && file->event_call->flags & TRACE_EVENT_FL_BPF)
> +                       /* once filter is applied, auto-enable it */
> +                       ftrace_event_enable_disable(file, 1);
> +       }
> +
>         mutex_unlock(&event_mutex);
>
> +       if (file && file->event_call->flags & TRACE_EVENT_FL_BPF) {
> +               /*
> +                * allocate per-cpu printk buffers, since eBPF program
> +                * might be calling bpf_trace_printk
> +                */
> +               trace_printk_init_buffers();
> +       }
> +
>         free_page((unsigned long) buf);
>         if (err < 0)
>                 return err;
> @@ -1325,6 +1358,7 @@ static const struct file_operations ftrace_event_filter_fops = {
>         .open = tracing_open_generic,
>         .read = event_filter_read,
>         .write = event_filter_write,
> +       .release = event_filter_release,
>         .llseek = default_llseek,
>  };
>
> diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
> index 8a8631926a07..a27526fae0fe 100644
> --- a/kernel/trace/trace_events_filter.c
> +++ b/kernel/trace/trace_events_filter.c
> @@ -23,6 +23,9 @@
>  #include <linux/mutex.h>
>  #include <linux/perf_event.h>
>  #include <linux/slab.h>
> +#include <linux/bpf.h>
> +#include <trace/bpf_trace.h>
> +#include <linux/filter.h>
>
>  #include "trace.h"
>  #include "trace_output.h"
> @@ -535,6 +538,16 @@ static int filter_match_preds_cb(enum move_type move, struct filter_pred *pred,
>         return WALK_PRED_DEFAULT;
>  }
>
> +void trace_filter_call_bpf(struct event_filter *filter, struct bpf_context *ctx)
> +{
> +       BUG_ON(!filter || !filter->prog);
> +
> +       rcu_read_lock();
> +       SK_RUN_FILTER(filter->prog, (void *) ctx);
> +       rcu_read_unlock();
> +}
> +EXPORT_SYMBOL_GPL(trace_filter_call_bpf);
> +
>  /* return 1 if event matches, 0 otherwise (discard) */
>  int filter_match_preds(struct event_filter *filter, void *rec)
>  {
> @@ -794,6 +807,8 @@ static void __free_filter(struct event_filter *filter)
>         if (!filter)
>                 return;
>
> +       if (filter->prog)
> +               sk_unattached_filter_destroy(filter->prog);
>         __free_preds(filter);
>         kfree(filter->filter_string);
>         kfree(filter);
> @@ -1898,6 +1913,48 @@ static int create_filter_start(char *filter_str, bool set_str,
>         return err;
>  }
>
> +static int create_filter_bpf(char *filter_str, struct event_filter **filterp)
> +{
> +       struct event_filter *filter;
> +       struct sk_filter *prog;
> +       long ufd;
> +       int err = 0;
> +
> +       *filterp = NULL;
> +
> +       filter = __alloc_filter();
> +       if (!filter)
> +               return -ENOMEM;
> +
> +       err = replace_filter_string(filter, filter_str);
> +       if (err)
> +               goto free_filter;
> +
> +       err = kstrtol(filter_str + 4, 0, &ufd);
> +       if (err)
> +               goto free_filter;
> +
> +       err = -ESRCH;
> +       prog = bpf_prog_get(ufd);
> +       if (!prog)
> +               goto free_filter;
> +
> +       filter->prog = prog;
> +
> +       err = -EINVAL;
> +       if (prog->info->prog_type != BPF_PROG_TYPE_TRACING_FILTER)
> +               /* prog_id is valid, but it's not a tracing filter program */
> +               goto free_filter;
> +
> +       *filterp = filter;
> +
> +       return 0;
> +
> +free_filter:
> +       __free_filter(filter);
> +       return err;
> +}
> +
>  static void create_filter_finish(struct filter_parse_state *ps)
>  {
>         if (ps) {
> @@ -2007,7 +2064,20 @@ int apply_event_filter(struct ftrace_event_file *file, char *filter_string)
>                 return 0;
>         }
>
> -       err = create_filter(call, filter_string, true, &filter);
> +       /*
> +        * 'bpf_123' string is a request to attach eBPF program with id == 123
> +        * also accept 'bpf 123', 'bpf.123', 'bpf-123' variants
> +        */
> +       if (memcmp(filter_string, "bpf", 3) == 0 && filter_string[3] != 0 &&
> +           filter_string[4] != 0) {
> +               err = create_filter_bpf(filter_string, &filter);
> +               if (!err)
> +                       call->flags |= TRACE_EVENT_FL_BPF;
> +       } else {
> +               err = create_filter(call, filter_string, true, &filter);
> +               if (!err)
> +                       call->flags &= ~TRACE_EVENT_FL_BPF;
> +       }
>
>         /*
>          * Always swap the call filter with the new filter
> --
> 1.7.9.5
>



-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH RFC v2 net-next 13/16] tracing: allow eBPF programs to be attached to events
@ 2014-07-24  0:06       ` Alexei Starovoitov
  0 siblings, 0 replies; 62+ messages in thread
From: Alexei Starovoitov @ 2014-07-24  0:06 UTC (permalink / raw)
  To: Kees Cook
  Cc: David S. Miller, Ingo Molnar, Linus Torvalds, Andy Lutomirski,
	Steven Rostedt, Daniel Borkmann, Chema Gonzalez, Eric Dumazet,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Jiri Olsa,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Linux API,
	Network Development, LKML

On Wed, Jul 23, 2014 at 4:46 PM, Kees Cook <keescook@chromium.org> wrote:
>>
>> eBPF programs can call in-kernel helper functions to:
>> - lookup/update/delete elements in maps
>> - memcmp
>> - trace_printk
>> - load_pointer
>> - dump_stack
>
> Ah, this must be the pointer leaking you mentioned. :)
>
>
> Can the existing tracing mechanisms already expose kernel addresses? I
> suspect "yes". So I guess existing limitations on tracing exposure
> should already cover access control here? (I'm trying to figure out if
> a separate CONFIG is needed -- I don't think so: nothing "new" is
> exposed via eBPF, is that right?)

correct. through debugfs/tracing the whole kernel is already exposed.
Idea of eBPF for tracing is to give kernel developers and performance
engineers a tool to analyze what kernel is doing by writing programs
in C and attaching them to kprobe/tracepoint events, so it's definitely
for root only.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH RFC v2 net-next 13/16] tracing: allow eBPF programs to be attached to events
@ 2014-07-24  0:06       ` Alexei Starovoitov
  0 siblings, 0 replies; 62+ messages in thread
From: Alexei Starovoitov @ 2014-07-24  0:06 UTC (permalink / raw)
  To: Kees Cook
  Cc: David S. Miller, Ingo Molnar, Linus Torvalds, Andy Lutomirski,
	Steven Rostedt, Daniel Borkmann, Chema Gonzalez, Eric Dumazet,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Jiri Olsa,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Linux API,
	Network Development, LKML

On Wed, Jul 23, 2014 at 4:46 PM, Kees Cook <keescook-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org> wrote:
>>
>> eBPF programs can call in-kernel helper functions to:
>> - lookup/update/delete elements in maps
>> - memcmp
>> - trace_printk
>> - load_pointer
>> - dump_stack
>
> Ah, this must be the pointer leaking you mentioned. :)
>
>
> Can the existing tracing mechanisms already expose kernel addresses? I
> suspect "yes". So I guess existing limitations on tracing exposure
> should already cover access control here? (I'm trying to figure out if
> a separate CONFIG is needed -- I don't think so: nothing "new" is
> exposed via eBPF, is that right?)

correct. through debugfs/tracing the whole kernel is already exposed.
Idea of eBPF for tracing is to give kernel developers and performance
engineers a tool to analyze what kernel is doing by writing programs
in C and attaching them to kprobe/tracepoint events, so it's definitely
for root only.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH RFC v2 net-next 10/16] bpf: add eBPF verifier
  2014-07-23 23:38     ` Kees Cook
  (?)
@ 2014-07-24  0:48     ` Alexei Starovoitov
  -1 siblings, 0 replies; 62+ messages in thread
From: Alexei Starovoitov @ 2014-07-24  0:48 UTC (permalink / raw)
  To: Kees Cook
  Cc: David S. Miller, Ingo Molnar, Linus Torvalds, Andy Lutomirski,
	Steven Rostedt, Daniel Borkmann, Chema Gonzalez, Eric Dumazet,
	Peter Zijlstra, Arnaldo Carvalho de Melo, Jiri Olsa,
	Thomas Gleixner, H. Peter Anvin, Andrew Morton, Linux API,
	Network Development, LKML

On Wed, Jul 23, 2014 at 4:38 PM, Kees Cook <keescook@chromium.org> wrote:
>> +Program that doesn't check return value of map_lookup_elem() before accessing
>> +map element:
>> +  BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
>> +  BPF_ALU64_REG(BPF_MOV, BPF_REG_2, BPF_REG_10),
>> +  BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
>> +  BPF_ALU64_IMM(BPF_MOV, BPF_REG_1, 1),
>> +  BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
>
> Is the expectation that these pointers are direct kernel function
> addresses? It looks like they're indexes in the check_call routine
> below. What specifically were the pointer leaks you'd mentioned?

yes, the pointer returned from map_lookup_elem() is a direct pointer
to map element value. If program prints it, that obviously a leak.
Therefore I'm planning to add 'secure' mode to verifier where such
pointer leaks are detected and rejected. This mode will be on for
any non-root syscall.

>> +#define _(OP) ({ int ret = OP; if (ret < 0) return ret; })
>
> This seems overly terse. :) And the meaning tends to be overloaded
> (this obviously isn't a translatable string, etc). Perhaps call it
> "chk" or "ret_fail"? And I think OP in the body should have ()s around
> it to avoid potential macro expansion silliness.

Sure, I'll wrap OP in ().
you've missed the previous thread about my favorite _ macro:
http://www.spinics.net/lists/netdev/msg288070.html
I think I gave a ton of 'pro' arguments already.
Looks like I have to order a bunch of t-shirts with '#define _()' on
them and give it to everyone on the next conference :)

>> +static const char *const bpf_jmp_string[] = {
>> +       "jmp", "==", ">", ">=", "&", "!=", "s>", "s>=", "call", "exit"
>> +};
>
> It seems like these string arrays should have literal initializers
> like reg_type_str does.

yeah. good point. will do.

>> +static int check_reg_arg(struct reg_state *regs, int regno, bool is_src)
>> +{
>
> Since regno is always populated with dst_reg/src_reg (u8 :4 sized),
> shouldn't this be u8 instead of int? (And in check_* below too?) More

why? 'int' type is much friendlier to compiler. u8,u16 is a pain to deal with.
unsigned types in general are much harder for optimizer.

> importantly, regno needs bounds checking. MAX_BPF_REG is 10, but
> dst_reg/src_reg could be up to 15, IIUC.

grr. yes. somehow lost this check in this version. good catch.

>> +       } else {
>> +               if (regno == BPF_REG_FP)
>> +                       /* frame pointer is read only */
>
> Why no verbose() call here?

no good reason.will add.

>> +               slot = &state->stack[MAX_BPF_STACK + off];
>> +               slot->stype = STACK_SPILL;
>> +               /* save register state */
>> +               slot->type = state->regs[value_regno].type;
>> +               slot->imm = state->regs[value_regno].imm;
>> +               for (i = 1; i < 8; i++) {
>> +                       slot = &state->stack[MAX_BPF_STACK + off + i];
>
> off and size need bounds checking here and below.

off and size were checked in check_mem_access().
Here size is 1,2,4,8 and off is within [-MAX_BPF_STACK,0)
so no extra checks needed.

>> +/* check read/write into map element returned by bpf_map_lookup_elem() */
>> +static int check_map_access(struct verifier_env *env, int regno, int off,
>> +                           int size)
>> +{
>> +       struct bpf_map *map;
>> +       int map_id = env->cur_state.regs[regno].imm;
>> +
>> +       _(get_map_info(env, map_id, &map));
>> +
>> +       if (off < 0 || off + size > map->value_size) {
>
> This could be tricked with a negative size, or a giant size, wrapping negative.

nope. cannot. check_map_access() is called from check_mem_access()
where off and size were checked.

>> +static int check_mem_access(struct verifier_env *env, int regno, int off,
>> +                           int bpf_size, enum bpf_access_type t,
>> +                           int value_regno)
>> +{
>> +       struct verifier_state *state = &env->cur_state;
>> +       int size;
>> +
>> +       _(size = bpf_size_to_bytes(bpf_size));
>> +
>> +       if (off % size != 0) {
>> +               verbose("misaligned access off %d size %d\n", off, size);
>> +               return -EACCES;
>> +       }
>
> I think more off and size checking is needed here.

I don't see the problem. Here it's the main entry into other checks.
alignment check above is a common check for all memory accesses.
All other stricter checks are in check_map_access(), check_stack_*(),
check_ctx_access() that are called from this check_mem_access() func.
Why do you think more checking is needed?

>> +/* when register 'regno' is passed into function that will read 'access_size'
>> + * bytes from that pointer, make sure that it's within stack boundary
>> + * and all elements of stack are initialized
>> + */
>> +static int check_stack_boundary(struct verifier_env *env,
>> +                               int regno, int access_size)
>> +{
>> +       struct verifier_state *state = &env->cur_state;
>> +       struct reg_state *regs = state->regs;
>> +       int off, i;
>> +
>
> regno bounds checking needed.

nope. check_stack_boundary() is called from check_func_arg()
which is called only with constant regnos: 1,2,3,4,5 to check function
arguments.

> Unless I've overlooked something, I think this needs much stricter
> evaluation of register numbers, offsets, and sizes.

sorry to hear that first glance was disappointing :)
I hope my explanation made it more clear.
The only check that I forgot to carry over the last year is in
check_reg_arg(). Around november last year the verifier patches I keep
posting diverged a little bit from the one we keep running in production,
since eBPF got few instruction renamed, so I had to keep tracking the two.
Once this version gets upstreamed we can finally drop the internal one.
check_reg_arg() is indeed incorrect here. Will fix. That was a good catch.
Thank you for review!

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH RFC v2 net-next 10/16] bpf: add eBPF verifier
@ 2014-07-24 18:25     ` Andy Lutomirski
  0 siblings, 0 replies; 62+ messages in thread
From: Andy Lutomirski @ 2014-07-24 18:25 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Ingo Molnar, Linus Torvalds, Steven Rostedt,
	Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Jiri Olsa, Thomas Gleixner,
	H. Peter Anvin, Andrew Morton, Kees Cook, Linux API,
	Network Development, linux-kernel

On Thu, Jul 17, 2014 at 9:20 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
> Safety of eBPF programs is statically determined by the verifier, which detects:
> - loops
> - out of range jumps
> - unreachable instructions
> - invalid instructions
> - uninitialized register access
> - uninitialized stack access
> - misaligned stack access
> - out of range stack access
> - invalid calling convention

Is there something that documents exactly what conditions an eBPF
program must satisfy in order to be considered valid?

--Andy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH RFC v2 net-next 10/16] bpf: add eBPF verifier
@ 2014-07-24 18:25     ` Andy Lutomirski
  0 siblings, 0 replies; 62+ messages in thread
From: Andy Lutomirski @ 2014-07-24 18:25 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Ingo Molnar, Linus Torvalds, Steven Rostedt,
	Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Jiri Olsa, Thomas Gleixner,
	H. Peter Anvin, Andrew Morton, Kees Cook, Linux API,
	Network Development, linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Thu, Jul 17, 2014 at 9:20 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
> Safety of eBPF programs is statically determined by the verifier, which detects:
> - loops
> - out of range jumps
> - unreachable instructions
> - invalid instructions
> - uninitialized register access
> - uninitialized stack access
> - misaligned stack access
> - out of range stack access
> - invalid calling convention

Is there something that documents exactly what conditions an eBPF
program must satisfy in order to be considered valid?

--Andy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH RFC v2 net-next 10/16] bpf: add eBPF verifier
@ 2014-07-24 19:25       ` Alexei Starovoitov
  0 siblings, 0 replies; 62+ messages in thread
From: Alexei Starovoitov @ 2014-07-24 19:25 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: David S. Miller, Ingo Molnar, Linus Torvalds, Steven Rostedt,
	Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Jiri Olsa, Thomas Gleixner,
	H. Peter Anvin, Andrew Morton, Kees Cook, Linux API,
	Network Development, linux-kernel

On Thu, Jul 24, 2014 at 11:25 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Thu, Jul 17, 2014 at 9:20 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>> Safety of eBPF programs is statically determined by the verifier, which detects:
>> - loops
>> - out of range jumps
>> - unreachable instructions
>> - invalid instructions
>> - uninitialized register access
>> - uninitialized stack access
>> - misaligned stack access
>> - out of range stack access
>> - invalid calling convention
>
> Is there something that documents exactly what conditions an eBPF
> program must satisfy in order to be considered valid?

I did a writeup in the past on things that verifiers checks and gave it
to internal folks to review. Guys have said that now they understand very
well how it works, but in reality it didn't help at all to write valid programs.
What worked is 'verification trace' = the instruction by instruction dump
of verifier state while it's analyzing the program.
I gave few simple examples of it in
'Understanding eBPF verifier messages' section:
https://git.kernel.org/cgit/linux/kernel/git/ast/bpf.git/diff/Documentation/networking/filter.txt?id=b22459133b9f52d2176c8c0f8b5eb036478a40c9
Every example there is what "program must satisfy to be valid"...

Therefore I'm addressing two things:
1. how verifier works and what it checks for.
  that is described in 'eBPF verifier' section of the doc and
  in 200 lines of comments inside verifier.c
2. how to write valid programs
 that's more important one, since it's a key to happy users.
 'verification trace' is the first step. I'm planning to add debug info and
 user space tool that points out to line in C instead of assembler trace.
 In other words to bring errors to user as early as possible during
 compilation process.
 This is not a concern when programs are written in assembler,
 since the programs will be much shorter and thought through by
 the author. However I don't think there will be too many users
 willing to understand ebpf assembler.

I suspect you're more concerned about #1 at this point whereas
I'm concerned about #2.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH RFC v2 net-next 10/16] bpf: add eBPF verifier
@ 2014-07-24 19:25       ` Alexei Starovoitov
  0 siblings, 0 replies; 62+ messages in thread
From: Alexei Starovoitov @ 2014-07-24 19:25 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: David S. Miller, Ingo Molnar, Linus Torvalds, Steven Rostedt,
	Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Jiri Olsa, Thomas Gleixner,
	H. Peter Anvin, Andrew Morton, Kees Cook, Linux API,
	Network Development, linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Thu, Jul 24, 2014 at 11:25 AM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
> On Thu, Jul 17, 2014 at 9:20 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
>> Safety of eBPF programs is statically determined by the verifier, which detects:
>> - loops
>> - out of range jumps
>> - unreachable instructions
>> - invalid instructions
>> - uninitialized register access
>> - uninitialized stack access
>> - misaligned stack access
>> - out of range stack access
>> - invalid calling convention
>
> Is there something that documents exactly what conditions an eBPF
> program must satisfy in order to be considered valid?

I did a writeup in the past on things that verifiers checks and gave it
to internal folks to review. Guys have said that now they understand very
well how it works, but in reality it didn't help at all to write valid programs.
What worked is 'verification trace' = the instruction by instruction dump
of verifier state while it's analyzing the program.
I gave few simple examples of it in
'Understanding eBPF verifier messages' section:
https://git.kernel.org/cgit/linux/kernel/git/ast/bpf.git/diff/Documentation/networking/filter.txt?id=b22459133b9f52d2176c8c0f8b5eb036478a40c9
Every example there is what "program must satisfy to be valid"...

Therefore I'm addressing two things:
1. how verifier works and what it checks for.
  that is described in 'eBPF verifier' section of the doc and
  in 200 lines of comments inside verifier.c
2. how to write valid programs
 that's more important one, since it's a key to happy users.
 'verification trace' is the first step. I'm planning to add debug info and
 user space tool that points out to line in C instead of assembler trace.
 In other words to bring errors to user as early as possible during
 compilation process.
 This is not a concern when programs are written in assembler,
 since the programs will be much shorter and thought through by
 the author. However I don't think there will be too many users
 willing to understand ebpf assembler.

I suspect you're more concerned about #1 at this point whereas
I'm concerned about #2.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH RFC v2 net-next 10/16] bpf: add eBPF verifier
@ 2014-08-12 19:32         ` Andy Lutomirski
  0 siblings, 0 replies; 62+ messages in thread
From: Andy Lutomirski @ 2014-08-12 19:32 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Ingo Molnar, Linus Torvalds, Steven Rostedt,
	Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Jiri Olsa, Thomas Gleixner,
	H. Peter Anvin, Andrew Morton, Kees Cook, Linux API,
	Network Development, linux-kernel

On Thu, Jul 24, 2014 at 12:25 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
> On Thu, Jul 24, 2014 at 11:25 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Thu, Jul 17, 2014 at 9:20 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>>> Safety of eBPF programs is statically determined by the verifier, which detects:
>>> - loops
>>> - out of range jumps
>>> - unreachable instructions
>>> - invalid instructions
>>> - uninitialized register access
>>> - uninitialized stack access
>>> - misaligned stack access
>>> - out of range stack access
>>> - invalid calling convention
>>
>> Is there something that documents exactly what conditions an eBPF
>> program must satisfy in order to be considered valid?
>
> I did a writeup in the past on things that verifiers checks and gave it
> to internal folks to review. Guys have said that now they understand very
> well how it works, but in reality it didn't help at all to write valid programs.
> What worked is 'verification trace' = the instruction by instruction dump
> of verifier state while it's analyzing the program.
> I gave few simple examples of it in
> 'Understanding eBPF verifier messages' section:
> https://git.kernel.org/cgit/linux/kernel/git/ast/bpf.git/diff/Documentation/networking/filter.txt?id=b22459133b9f52d2176c8c0f8b5eb036478a40c9
> Every example there is what "program must satisfy to be valid"...
>
> Therefore I'm addressing two things:
> 1. how verifier works and what it checks for.
>   that is described in 'eBPF verifier' section of the doc and
>   in 200 lines of comments inside verifier.c

That doc is pretty good.  I'll try to read it carefully soon.  Sorry
for the huge delay here -- I've been on vacation.

--Andy

> 2. how to write valid programs
>  that's more important one, since it's a key to happy users.
>  'verification trace' is the first step. I'm planning to add debug info and
>  user space tool that points out to line in C instead of assembler trace.
>  In other words to bring errors to user as early as possible during
>  compilation process.
>  This is not a concern when programs are written in assembler,
>  since the programs will be much shorter and thought through by
>  the author. However I don't think there will be too many users
>  willing to understand ebpf assembler.
>
> I suspect you're more concerned about #1 at this point whereas
> I'm concerned about #2.



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH RFC v2 net-next 10/16] bpf: add eBPF verifier
@ 2014-08-12 19:32         ` Andy Lutomirski
  0 siblings, 0 replies; 62+ messages in thread
From: Andy Lutomirski @ 2014-08-12 19:32 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Ingo Molnar, Linus Torvalds, Steven Rostedt,
	Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Jiri Olsa, Thomas Gleixner,
	H. Peter Anvin, Andrew Morton, Kees Cook, Linux API,
	Network Development, linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Thu, Jul 24, 2014 at 12:25 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
> On Thu, Jul 24, 2014 at 11:25 AM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>> On Thu, Jul 17, 2014 at 9:20 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
>>> Safety of eBPF programs is statically determined by the verifier, which detects:
>>> - loops
>>> - out of range jumps
>>> - unreachable instructions
>>> - invalid instructions
>>> - uninitialized register access
>>> - uninitialized stack access
>>> - misaligned stack access
>>> - out of range stack access
>>> - invalid calling convention
>>
>> Is there something that documents exactly what conditions an eBPF
>> program must satisfy in order to be considered valid?
>
> I did a writeup in the past on things that verifiers checks and gave it
> to internal folks to review. Guys have said that now they understand very
> well how it works, but in reality it didn't help at all to write valid programs.
> What worked is 'verification trace' = the instruction by instruction dump
> of verifier state while it's analyzing the program.
> I gave few simple examples of it in
> 'Understanding eBPF verifier messages' section:
> https://git.kernel.org/cgit/linux/kernel/git/ast/bpf.git/diff/Documentation/networking/filter.txt?id=b22459133b9f52d2176c8c0f8b5eb036478a40c9
> Every example there is what "program must satisfy to be valid"...
>
> Therefore I'm addressing two things:
> 1. how verifier works and what it checks for.
>   that is described in 'eBPF verifier' section of the doc and
>   in 200 lines of comments inside verifier.c

That doc is pretty good.  I'll try to read it carefully soon.  Sorry
for the huge delay here -- I've been on vacation.

--Andy

> 2. how to write valid programs
>  that's more important one, since it's a key to happy users.
>  'verification trace' is the first step. I'm planning to add debug info and
>  user space tool that points out to line in C instead of assembler trace.
>  In other words to bring errors to user as early as possible during
>  compilation process.
>  This is not a concern when programs are written in assembler,
>  since the programs will be much shorter and thought through by
>  the author. However I don't think there will be too many users
>  willing to understand ebpf assembler.
>
> I suspect you're more concerned about #1 at this point whereas
> I'm concerned about #2.



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH RFC v2 net-next 10/16] bpf: add eBPF verifier
@ 2014-08-12 20:00           ` Alexei Starovoitov
  0 siblings, 0 replies; 62+ messages in thread
From: Alexei Starovoitov @ 2014-08-12 20:00 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: David S. Miller, Ingo Molnar, Linus Torvalds, Steven Rostedt,
	Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Jiri Olsa, Thomas Gleixner,
	H. Peter Anvin, Andrew Morton, Kees Cook, Linux API,
	Network Development, linux-kernel

On Tue, Aug 12, 2014 at 12:32 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Thu, Jul 24, 2014 at 12:25 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>> On Thu, Jul 24, 2014 at 11:25 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Thu, Jul 17, 2014 at 9:20 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>>>> Safety of eBPF programs is statically determined by the verifier, which detects:
>>>> - loops
>>>> - out of range jumps
>>>> - unreachable instructions
>>>> - invalid instructions
>>>> - uninitialized register access
>>>> - uninitialized stack access
>>>> - misaligned stack access
>>>> - out of range stack access
>>>> - invalid calling convention
>>>
>>> Is there something that documents exactly what conditions an eBPF
>>> program must satisfy in order to be considered valid?
>>
>> I did a writeup in the past on things that verifiers checks and gave it
>> to internal folks to review. Guys have said that now they understand very
>> well how it works, but in reality it didn't help at all to write valid programs.
>> What worked is 'verification trace' = the instruction by instruction dump
>> of verifier state while it's analyzing the program.
>> I gave few simple examples of it in
>> 'Understanding eBPF verifier messages' section:
>> https://git.kernel.org/cgit/linux/kernel/git/ast/bpf.git/diff/Documentation/networking/filter.txt?id=b22459133b9f52d2176c8c0f8b5eb036478a40c9
>> Every example there is what "program must satisfy to be valid"...
>>
>> Therefore I'm addressing two things:
>> 1. how verifier works and what it checks for.
>>   that is described in 'eBPF verifier' section of the doc and
>>   in 200 lines of comments inside verifier.c
>
> That doc is pretty good.  I'll try to read it carefully soon.  Sorry
> for the huge delay here -- I've been on vacation.

I've been sitting on v4 for few weeks, since it's a merge window.
So please hold on a careful review. I'll post v4 later today.
Mainly I've split the verifier into several patches to make it
easier to read.
Thanks!

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH RFC v2 net-next 10/16] bpf: add eBPF verifier
@ 2014-08-12 20:00           ` Alexei Starovoitov
  0 siblings, 0 replies; 62+ messages in thread
From: Alexei Starovoitov @ 2014-08-12 20:00 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: David S. Miller, Ingo Molnar, Linus Torvalds, Steven Rostedt,
	Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Jiri Olsa, Thomas Gleixner,
	H. Peter Anvin, Andrew Morton, Kees Cook, Linux API,
	Network Development, linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Tue, Aug 12, 2014 at 12:32 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
> On Thu, Jul 24, 2014 at 12:25 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
>> On Thu, Jul 24, 2014 at 11:25 AM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>> On Thu, Jul 17, 2014 at 9:20 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
>>>> Safety of eBPF programs is statically determined by the verifier, which detects:
>>>> - loops
>>>> - out of range jumps
>>>> - unreachable instructions
>>>> - invalid instructions
>>>> - uninitialized register access
>>>> - uninitialized stack access
>>>> - misaligned stack access
>>>> - out of range stack access
>>>> - invalid calling convention
>>>
>>> Is there something that documents exactly what conditions an eBPF
>>> program must satisfy in order to be considered valid?
>>
>> I did a writeup in the past on things that verifiers checks and gave it
>> to internal folks to review. Guys have said that now they understand very
>> well how it works, but in reality it didn't help at all to write valid programs.
>> What worked is 'verification trace' = the instruction by instruction dump
>> of verifier state while it's analyzing the program.
>> I gave few simple examples of it in
>> 'Understanding eBPF verifier messages' section:
>> https://git.kernel.org/cgit/linux/kernel/git/ast/bpf.git/diff/Documentation/networking/filter.txt?id=b22459133b9f52d2176c8c0f8b5eb036478a40c9
>> Every example there is what "program must satisfy to be valid"...
>>
>> Therefore I'm addressing two things:
>> 1. how verifier works and what it checks for.
>>   that is described in 'eBPF verifier' section of the doc and
>>   in 200 lines of comments inside verifier.c
>
> That doc is pretty good.  I'll try to read it carefully soon.  Sorry
> for the huge delay here -- I've been on vacation.

I've been sitting on v4 for few weeks, since it's a merge window.
So please hold on a careful review. I'll post v4 later today.
Mainly I've split the verifier into several patches to make it
easier to read.
Thanks!

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH RFC v2 net-next 10/16] bpf: add eBPF verifier
@ 2014-08-12 20:10             ` Andy Lutomirski
  0 siblings, 0 replies; 62+ messages in thread
From: Andy Lutomirski @ 2014-08-12 20:10 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Ingo Molnar, Linus Torvalds, Steven Rostedt,
	Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Jiri Olsa, Thomas Gleixner,
	H. Peter Anvin, Andrew Morton, Kees Cook, Linux API,
	Network Development, linux-kernel

On Tue, Aug 12, 2014 at 1:00 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
> On Tue, Aug 12, 2014 at 12:32 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Thu, Jul 24, 2014 at 12:25 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>>> On Thu, Jul 24, 2014 at 11:25 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>> On Thu, Jul 17, 2014 at 9:20 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>>>>> Safety of eBPF programs is statically determined by the verifier, which detects:
>>>>> - loops
>>>>> - out of range jumps
>>>>> - unreachable instructions
>>>>> - invalid instructions
>>>>> - uninitialized register access
>>>>> - uninitialized stack access
>>>>> - misaligned stack access
>>>>> - out of range stack access
>>>>> - invalid calling convention
>>>>
>>>> Is there something that documents exactly what conditions an eBPF
>>>> program must satisfy in order to be considered valid?
>>>
>>> I did a writeup in the past on things that verifiers checks and gave it
>>> to internal folks to review. Guys have said that now they understand very
>>> well how it works, but in reality it didn't help at all to write valid programs.
>>> What worked is 'verification trace' = the instruction by instruction dump
>>> of verifier state while it's analyzing the program.
>>> I gave few simple examples of it in
>>> 'Understanding eBPF verifier messages' section:
>>> https://git.kernel.org/cgit/linux/kernel/git/ast/bpf.git/diff/Documentation/networking/filter.txt?id=b22459133b9f52d2176c8c0f8b5eb036478a40c9
>>> Every example there is what "program must satisfy to be valid"...
>>>
>>> Therefore I'm addressing two things:
>>> 1. how verifier works and what it checks for.
>>>   that is described in 'eBPF verifier' section of the doc and
>>>   in 200 lines of comments inside verifier.c
>>
>> That doc is pretty good.  I'll try to read it carefully soon.  Sorry
>> for the huge delay here -- I've been on vacation.
>
> I've been sitting on v4 for few weeks, since it's a merge window.
> So please hold on a careful review. I'll post v4 later today.
> Mainly I've split the verifier into several patches to make it
> easier to read.
> Thanks!

Will you be at KS / LSS / LinuxCon?

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH RFC v2 net-next 10/16] bpf: add eBPF verifier
@ 2014-08-12 20:10             ` Andy Lutomirski
  0 siblings, 0 replies; 62+ messages in thread
From: Andy Lutomirski @ 2014-08-12 20:10 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Ingo Molnar, Linus Torvalds, Steven Rostedt,
	Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Jiri Olsa, Thomas Gleixner,
	H. Peter Anvin, Andrew Morton, Kees Cook, Linux API,
	Network Development, linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Tue, Aug 12, 2014 at 1:00 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
> On Tue, Aug 12, 2014 at 12:32 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>> On Thu, Jul 24, 2014 at 12:25 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
>>> On Thu, Jul 24, 2014 at 11:25 AM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>>> On Thu, Jul 17, 2014 at 9:20 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
>>>>> Safety of eBPF programs is statically determined by the verifier, which detects:
>>>>> - loops
>>>>> - out of range jumps
>>>>> - unreachable instructions
>>>>> - invalid instructions
>>>>> - uninitialized register access
>>>>> - uninitialized stack access
>>>>> - misaligned stack access
>>>>> - out of range stack access
>>>>> - invalid calling convention
>>>>
>>>> Is there something that documents exactly what conditions an eBPF
>>>> program must satisfy in order to be considered valid?
>>>
>>> I did a writeup in the past on things that verifiers checks and gave it
>>> to internal folks to review. Guys have said that now they understand very
>>> well how it works, but in reality it didn't help at all to write valid programs.
>>> What worked is 'verification trace' = the instruction by instruction dump
>>> of verifier state while it's analyzing the program.
>>> I gave few simple examples of it in
>>> 'Understanding eBPF verifier messages' section:
>>> https://git.kernel.org/cgit/linux/kernel/git/ast/bpf.git/diff/Documentation/networking/filter.txt?id=b22459133b9f52d2176c8c0f8b5eb036478a40c9
>>> Every example there is what "program must satisfy to be valid"...
>>>
>>> Therefore I'm addressing two things:
>>> 1. how verifier works and what it checks for.
>>>   that is described in 'eBPF verifier' section of the doc and
>>>   in 200 lines of comments inside verifier.c
>>
>> That doc is pretty good.  I'll try to read it carefully soon.  Sorry
>> for the huge delay here -- I've been on vacation.
>
> I've been sitting on v4 for few weeks, since it's a merge window.
> So please hold on a careful review. I'll post v4 later today.
> Mainly I've split the verifier into several patches to make it
> easier to read.
> Thanks!

Will you be at KS / LSS / LinuxCon?

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH RFC v2 net-next 10/16] bpf: add eBPF verifier
@ 2014-08-12 20:43               ` Alexei Starovoitov
  0 siblings, 0 replies; 62+ messages in thread
From: Alexei Starovoitov @ 2014-08-12 20:43 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: David S. Miller, Ingo Molnar, Linus Torvalds, Steven Rostedt,
	Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Jiri Olsa, Thomas Gleixner,
	H. Peter Anvin, Andrew Morton, Kees Cook, Linux API,
	Network Development, linux-kernel

On Tue, Aug 12, 2014 at 1:10 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Tue, Aug 12, 2014 at 1:00 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>> On Tue, Aug 12, 2014 at 12:32 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Thu, Jul 24, 2014 at 12:25 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>>>> On Thu, Jul 24, 2014 at 11:25 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>> On Thu, Jul 17, 2014 at 9:20 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>>>>>> Safety of eBPF programs is statically determined by the verifier, which detects:
>>>>>> - loops
>>>>>> - out of range jumps
>>>>>> - unreachable instructions
>>>>>> - invalid instructions
>>>>>> - uninitialized register access
>>>>>> - uninitialized stack access
>>>>>> - misaligned stack access
>>>>>> - out of range stack access
>>>>>> - invalid calling convention
>>>>>
>>>>> Is there something that documents exactly what conditions an eBPF
>>>>> program must satisfy in order to be considered valid?
>>>>
>>>> I did a writeup in the past on things that verifiers checks and gave it
>>>> to internal folks to review. Guys have said that now they understand very
>>>> well how it works, but in reality it didn't help at all to write valid programs.
>>>> What worked is 'verification trace' = the instruction by instruction dump
>>>> of verifier state while it's analyzing the program.
>>>> I gave few simple examples of it in
>>>> 'Understanding eBPF verifier messages' section:
>>>> https://git.kernel.org/cgit/linux/kernel/git/ast/bpf.git/diff/Documentation/networking/filter.txt?id=b22459133b9f52d2176c8c0f8b5eb036478a40c9
>>>> Every example there is what "program must satisfy to be valid"...
>>>>
>>>> Therefore I'm addressing two things:
>>>> 1. how verifier works and what it checks for.
>>>>   that is described in 'eBPF verifier' section of the doc and
>>>>   in 200 lines of comments inside verifier.c
>>>
>>> That doc is pretty good.  I'll try to read it carefully soon.  Sorry
>>> for the huge delay here -- I've been on vacation.
>>
>> I've been sitting on v4 for few weeks, since it's a merge window.
>> So please hold on a careful review. I'll post v4 later today.
>> Mainly I've split the verifier into several patches to make it
>> easier to read.
>> Thanks!
>
> Will you be at KS / LSS / LinuxCon?

I would love to, but I didn't get an invite for KS.
I'll be at plumbers in October.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH RFC v2 net-next 10/16] bpf: add eBPF verifier
@ 2014-08-12 20:43               ` Alexei Starovoitov
  0 siblings, 0 replies; 62+ messages in thread
From: Alexei Starovoitov @ 2014-08-12 20:43 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: David S. Miller, Ingo Molnar, Linus Torvalds, Steven Rostedt,
	Daniel Borkmann, Chema Gonzalez, Eric Dumazet, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Jiri Olsa, Thomas Gleixner,
	H. Peter Anvin, Andrew Morton, Kees Cook, Linux API,
	Network Development, linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Tue, Aug 12, 2014 at 1:10 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
> On Tue, Aug 12, 2014 at 1:00 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
>> On Tue, Aug 12, 2014 at 12:32 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>> On Thu, Jul 24, 2014 at 12:25 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
>>>> On Thu, Jul 24, 2014 at 11:25 AM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>>>> On Thu, Jul 17, 2014 at 9:20 PM, Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
>>>>>> Safety of eBPF programs is statically determined by the verifier, which detects:
>>>>>> - loops
>>>>>> - out of range jumps
>>>>>> - unreachable instructions
>>>>>> - invalid instructions
>>>>>> - uninitialized register access
>>>>>> - uninitialized stack access
>>>>>> - misaligned stack access
>>>>>> - out of range stack access
>>>>>> - invalid calling convention
>>>>>
>>>>> Is there something that documents exactly what conditions an eBPF
>>>>> program must satisfy in order to be considered valid?
>>>>
>>>> I did a writeup in the past on things that verifiers checks and gave it
>>>> to internal folks to review. Guys have said that now they understand very
>>>> well how it works, but in reality it didn't help at all to write valid programs.
>>>> What worked is 'verification trace' = the instruction by instruction dump
>>>> of verifier state while it's analyzing the program.
>>>> I gave few simple examples of it in
>>>> 'Understanding eBPF verifier messages' section:
>>>> https://git.kernel.org/cgit/linux/kernel/git/ast/bpf.git/diff/Documentation/networking/filter.txt?id=b22459133b9f52d2176c8c0f8b5eb036478a40c9
>>>> Every example there is what "program must satisfy to be valid"...
>>>>
>>>> Therefore I'm addressing two things:
>>>> 1. how verifier works and what it checks for.
>>>>   that is described in 'eBPF verifier' section of the doc and
>>>>   in 200 lines of comments inside verifier.c
>>>
>>> That doc is pretty good.  I'll try to read it carefully soon.  Sorry
>>> for the huge delay here -- I've been on vacation.
>>
>> I've been sitting on v4 for few weeks, since it's a merge window.
>> So please hold on a careful review. I'll post v4 later today.
>> Mainly I've split the verifier into several patches to make it
>> easier to read.
>> Thanks!
>
> Will you be at KS / LSS / LinuxCon?

I would love to, but I didn't get an invite for KS.
I'll be at plumbers in October.

^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2014-08-12 20:43 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-18  4:19 [PATCH RFC v2 net-next 00/16] BPF syscall, maps, verifier, samples Alexei Starovoitov
2014-07-18  4:19 ` Alexei Starovoitov
2014-07-18  4:19 ` [PATCH RFC v2 net-next 01/16] net: filter: split filter.c into two files Alexei Starovoitov
2014-07-18  4:19   ` Alexei Starovoitov
2014-07-18  4:19 ` [PATCH RFC v2 net-next 02/16] bpf: update MAINTAINERS entry Alexei Starovoitov
2014-07-18  4:19   ` Alexei Starovoitov
2014-07-23 17:37   ` Kees Cook
2014-07-23 17:48     ` Alexei Starovoitov
2014-07-23 17:48       ` Alexei Starovoitov
2014-07-23 18:39       ` Kees Cook
2014-07-23 18:39         ` Kees Cook
2014-07-18  4:19 ` [PATCH RFC v2 net-next 03/16] net: filter: rename struct sock_filter_int into bpf_insn Alexei Starovoitov
2014-07-18  4:19 ` [PATCH RFC v2 net-next 04/16] net: filter: split filter.h and expose eBPF to user space Alexei Starovoitov
2014-07-18  4:19 ` [PATCH RFC v2 net-next 05/16] bpf: introduce syscall(BPF, ...) and BPF maps Alexei Starovoitov
2014-07-23 18:02   ` Kees Cook
2014-07-23 18:02     ` Kees Cook
2014-07-23 19:30     ` Alexei Starovoitov
2014-07-18  4:19 ` [PATCH RFC v2 net-next 06/16] bpf: enable bpf syscall on x64 Alexei Starovoitov
2014-07-18  4:19   ` Alexei Starovoitov
2014-07-18  4:19 ` [PATCH RFC v2 net-next 07/16] bpf: add lookup/update/delete/iterate methods to BPF maps Alexei Starovoitov
2014-07-23 18:25   ` Kees Cook
2014-07-23 19:49     ` Alexei Starovoitov
2014-07-23 20:25       ` Kees Cook
2014-07-23 21:22         ` Alexei Starovoitov
2014-07-18  4:19 ` [PATCH RFC v2 net-next 08/16] bpf: add hashtable type of " Alexei Starovoitov
2014-07-23 18:36   ` Kees Cook
2014-07-23 18:36     ` Kees Cook
2014-07-23 19:57     ` Alexei Starovoitov
2014-07-23 19:57       ` Alexei Starovoitov
2014-07-23 20:33       ` Kees Cook
2014-07-23 20:33         ` Kees Cook
2014-07-23 21:42         ` Alexei Starovoitov
2014-07-18  4:19 ` [PATCH RFC v2 net-next 09/16] bpf: expand BPF syscall with program load/unload Alexei Starovoitov
2014-07-23 19:00   ` Kees Cook
2014-07-23 19:00     ` Kees Cook
2014-07-23 20:22     ` Alexei Starovoitov
2014-07-23 20:22       ` Alexei Starovoitov
2014-07-18  4:20 ` [PATCH RFC v2 net-next 10/16] bpf: add eBPF verifier Alexei Starovoitov
2014-07-23 23:38   ` Kees Cook
2014-07-23 23:38     ` Kees Cook
2014-07-24  0:48     ` Alexei Starovoitov
2014-07-24 18:25   ` Andy Lutomirski
2014-07-24 18:25     ` Andy Lutomirski
2014-07-24 19:25     ` Alexei Starovoitov
2014-07-24 19:25       ` Alexei Starovoitov
2014-08-12 19:32       ` Andy Lutomirski
2014-08-12 19:32         ` Andy Lutomirski
2014-08-12 20:00         ` Alexei Starovoitov
2014-08-12 20:00           ` Alexei Starovoitov
2014-08-12 20:10           ` Andy Lutomirski
2014-08-12 20:10             ` Andy Lutomirski
2014-08-12 20:43             ` Alexei Starovoitov
2014-08-12 20:43               ` Alexei Starovoitov
2014-07-18  4:20 ` [PATCH RFC v2 net-next 11/16] bpf: allow eBPF programs to use maps Alexei Starovoitov
2014-07-18  4:20 ` [PATCH RFC v2 net-next 12/16] net: sock: allow eBPF programs to be attached to sockets Alexei Starovoitov
2014-07-18  4:20 ` [PATCH RFC v2 net-next 13/16] tracing: allow eBPF programs to be attached to events Alexei Starovoitov
2014-07-23 23:46   ` Kees Cook
2014-07-24  0:06     ` Alexei Starovoitov
2014-07-24  0:06       ` Alexei Starovoitov
2014-07-18  4:20 ` [PATCH RFC v2 net-next 14/16] samples: bpf: add mini eBPF library to manipulate maps and programs Alexei Starovoitov
2014-07-18  4:20 ` [PATCH RFC v2 net-next 15/16] samples: bpf: example of stateful socket filtering Alexei Starovoitov
2014-07-18  4:20 ` [PATCH RFC v2 net-next 16/16] samples: bpf: example of tracing filters with eBPF Alexei Starovoitov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.