linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 net-next 0/1] bpf32->bpf64 mapper and bpf64 interpreter
@ 2014-02-27  2:38 Alexei Starovoitov
  2014-02-27  2:38 ` [PATCH v3 net-next 1/1] " Alexei Starovoitov
  0 siblings, 1 reply; 8+ messages in thread
From: Alexei Starovoitov @ 2014-02-27  2:38 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: David S. Miller, Ingo Molnar, Steven Rostedt, Peter Zijlstra,
	H. Peter Anvin, Thomas Gleixner, Masami Hiramatsu, Tom Zanussi,
	Jovi Zhangwei, Eric Dumazet, Linus Torvalds, Andrew Morton,
	Frederic Weisbecker, Arnaldo Carvalho de Melo, Pekka Enberg,
	Arjan van de Ven, Christoph Hellwig, linux-kernel, netdev

Hi All,

V1 patches:
http://thread.gmane.org/gmane.linux.kernel/1605783
V2 patches:
http://thread.gmane.org/gmane.linux.kernel/1642325

V3 summary:
- as suggested by Daniel added on the fly converter from
  old BPF (aka BPF32) into extended BPF (aka BPF64)
- as suggested by Peter Anvin added 32-bit subregisters
  they don't add much to interpreter speed, but simplify bpf32->bpf64 mapping
- added sysctl net.core.bpf64_enable flag
  if enabled, old BPF filters will be converted to BPF64
  and will be used by tcpdump/cls/xtables.
  safety of the filters is verified by old BPF sk_chk_filter()
  BPF64's bpf_check() is dropped from this patch to simplify review

Addition of 32-bit subregs require some work on BPF64 x86_64 JIT, so
it's not included in this patch set. LLVM BPF64 backend also needs to be
taught to take advantage of 32-bit subregs.

Initially BPF64 instruction set was designed for max performance after JIT,
Now it was tweaked for good interpreter speeds as well.
Eventually BPF64 can completely replace existing BPF on all architectures.

Two key reasons why BPF64 interpreter is noticeably faster
than existing BPF32 interpreter:

1.fall-through jumps
  In BPF32 jump instructions are forced to go either 'true' or 'false'
  branch which causes branch-miss penalty.
  BPF64 jump instructions have one branch and fall-through, which fit
  CPU branch predictor logic better.
  'perf stat' shows drastic difference for branch-misses.

2.jump-threaded implementation of interpreter vs switch statement
  Instead of single tablejump at the top of 'switch' statement, GCC will
  generate multiple tablejump instructions, which helps CPU branch predictor

Performance of two BPF filters generated by libpcap was measured
on x86_64, i386 and arm32.

fprog #1 is taken from Documentation/networking/filter.txt:
tcpdump -i eth0 port 22 -dd

fprog #2 is taken from 'man tcpdump':
tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) - 
   ((tcp[12]&0xf0)>>2)) != 0)' -dd

Other libpcap programs have similar performance differences.

Raw performance data from BPF micro-benchmark:
SK_RUN_FILTER on same SKB (cache-hit) or 10k SKBs (cache-miss)
time in nsec per call, smaller is better
--x86_64--
        fprog #1  fprog #1   fprog #2  fprog #2
        cache-hit cache-miss cache-hit cache-miss
BPF32      90        98       207       220
BPF64      28        85       60        108
BPF32_JIT  12        33       17         44
BPF64_JIT  TBD

--i386--
        fprog #1  fprog #1   fprog #2  fprog #2
        cache-hit cache-miss cache-hit cache-miss
BPF32     107        136      227       252
BPF64      40        119       69       172

--arm32--
        fprog #1  fprog #1   fprog #2  fprog #2
        cache-hit cache-miss cache-hit cache-miss
BPF32     202        300      475       540
BPF64     139        270      296       470
BPF32_JIT  26        182       37       202
BPF64_JIT TBD

on Intel cpus BPF64 interpreter is significantly faster than
old BPF interpreter. Existing BPF32_JIT is obviously even faster.
BPF64_JIT has similar performance.

Tested with Daniel's 'trinify BPF fuzzer'

TODO:
- bpf32->bpf64 converter doesn't recognize seccomp and negative
  offsets yet, fix that

- add 32-bit subregs to BPF64 x86_64 JIT and LLVM backend

- add bpf64 verifier, so that tcpdump/cls/xt and others can
  insert both bpf32 and bpf64 programs through the same interface

- add bpf tables, complete 'dropmonitor' and get back to
  systemtap-like probes with bpf64

Please review.
Thanks!

Alexei Starovoitov (1):
  bpf32->bpf64 mapper and bpf64 interpreter

 include/linux/filter.h      |    9 +-
 include/linux/netdevice.h   |    1 +
 include/uapi/linux/filter.h |   37 ++-
 net/core/Makefile           |    2 +-
 net/core/bpf_run.c          |  766 +++++++++++++++++++++++++++++++++++++++++++
 net/core/filter.c           |  114 ++++++-
 net/core/sysctl_net_core.c  |    7 +
 7 files changed, 913 insertions(+), 23 deletions(-)
 create mode 100644 net/core/bpf_run.c

-- 
1.7.9.5


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v3 net-next 1/1] bpf32->bpf64 mapper and bpf64 interpreter
  2014-02-27  2:38 [PATCH v3 net-next 0/1] bpf32->bpf64 mapper and bpf64 interpreter Alexei Starovoitov
@ 2014-02-27  2:38 ` Alexei Starovoitov
  2014-02-28 12:45   ` Daniel Borkmann
  0 siblings, 1 reply; 8+ messages in thread
From: Alexei Starovoitov @ 2014-02-27  2:38 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: David S. Miller, Ingo Molnar, Steven Rostedt, Peter Zijlstra,
	H. Peter Anvin, Thomas Gleixner, Masami Hiramatsu, Tom Zanussi,
	Jovi Zhangwei, Eric Dumazet, Linus Torvalds, Andrew Morton,
	Frederic Weisbecker, Arnaldo Carvalho de Melo, Pekka Enberg,
	Arjan van de Ven, Christoph Hellwig, linux-kernel, netdev

Extended BPF (or 64-bit BPF) is an instruction set to
create safe dynamically loadable filters that can call fixed set
of kernel functions and take generic bpf_context as an input.
BPF filter is a glue between kernel functions and bpf_context.
Different kernel subsystems can define their own set of available functions
and alter BPF machinery for specific use case.
BPF64 instruction set is designed for efficient mapping to native
instructions on 64-bit CPUs

Old BPF instructions are remapped on the fly to BPF64
when sysctl net.core.bpf64_enable=1

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
 include/linux/filter.h      |    9 +-
 include/linux/netdevice.h   |    1 +
 include/uapi/linux/filter.h |   37 ++-
 net/core/Makefile           |    2 +-
 net/core/bpf_run.c          |  766 +++++++++++++++++++++++++++++++++++++++++++
 net/core/filter.c           |  114 ++++++-
 net/core/sysctl_net_core.c  |    7 +
 7 files changed, 913 insertions(+), 23 deletions(-)
 create mode 100644 net/core/bpf_run.c

diff --git a/include/linux/filter.h b/include/linux/filter.h
index e568c8ef896b..bf3085258f4c 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -53,6 +53,13 @@ extern int sk_chk_filter(struct sock_filter *filter, unsigned int flen);
 extern int sk_get_filter(struct sock *sk, struct sock_filter __user *filter, unsigned len);
 extern void sk_decode_filter(struct sock_filter *filt, struct sock_filter *to);
 
+/* function remaps 'sock_filter' style insns to 'bpf_insn' style insns */
+int bpf_convert(struct sock_filter *fp, int len, struct bpf_insn *new_prog,
+		int *p_new_len);
+/* execute bpf64 program */
+u32 bpf_run(struct bpf_context *ctx, const struct bpf_insn *insn);
+
+#define SK_RUN_FILTER(FILTER, SKB) (*FILTER->bpf_func)(SKB, FILTER->insns)
 #ifdef CONFIG_BPF_JIT
 #include <stdarg.h>
 #include <linux/linkage.h>
@@ -70,7 +77,6 @@ static inline void bpf_jit_dump(unsigned int flen, unsigned int proglen,
 		print_hex_dump(KERN_ERR, "JIT code: ", DUMP_PREFIX_OFFSET,
 			       16, 1, image, proglen, false);
 }
-#define SK_RUN_FILTER(FILTER, SKB) (*FILTER->bpf_func)(SKB, FILTER->insns)
 #else
 #include <linux/slab.h>
 static inline void bpf_jit_compile(struct sk_filter *fp)
@@ -80,7 +86,6 @@ static inline void bpf_jit_free(struct sk_filter *fp)
 {
 	kfree(fp);
 }
-#define SK_RUN_FILTER(FILTER, SKB) sk_run_filter(SKB, FILTER->insns)
 #endif
 
 static inline int bpf_tell_extensions(void)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 5e84483c0650..7b1acefc244e 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2971,6 +2971,7 @@ extern int		netdev_max_backlog;
 extern int		netdev_tstamp_prequeue;
 extern int		weight_p;
 extern int		bpf_jit_enable;
+extern int		bpf64_enable;
 
 bool netdev_has_upper_dev(struct net_device *dev, struct net_device *upper_dev);
 struct net_device *netdev_all_upper_get_next_dev_rcu(struct net_device *dev,
diff --git a/include/uapi/linux/filter.h b/include/uapi/linux/filter.h
index 8eb9ccaa5b48..70ff29ee6825 100644
--- a/include/uapi/linux/filter.h
+++ b/include/uapi/linux/filter.h
@@ -1,3 +1,4 @@
+/* extended BPF is Copyright (c) 2011-2014, PLUMgrid, http://plumgrid.com */
 /*
  * Linux Socket Filter Data Structures
  */
@@ -19,7 +20,7 @@
  *	Try and keep these values and structures similar to BSD, especially
  *	the BPF code definitions which need to match so you can share filters
  */
- 
+
 struct sock_filter {	/* Filter block */
 	__u16	code;   /* Actual filter code */
 	__u8	jt;	/* Jump true */
@@ -45,12 +46,26 @@ struct sock_fprog {	/* Required for SO_ATTACH_FILTER. */
 #define         BPF_JMP         0x05
 #define         BPF_RET         0x06
 #define         BPF_MISC        0x07
+#define         BPF_ALU64       0x07
+
+struct bpf_insn {
+	__u8	code;    /* opcode */
+	__u8    a_reg:4; /* dest register*/
+	__u8    x_reg:4; /* source register */
+	__s16	off;     /* signed offset */
+	__s32	imm;     /* signed immediate constant */
+};
+
+/* pointer to bpf_context is the first and only argument to BPF program
+ * its definition is use-case specific */
+struct bpf_context;
 
 /* ld/ldx fields */
 #define BPF_SIZE(code)  ((code) & 0x18)
 #define         BPF_W           0x00
 #define         BPF_H           0x08
 #define         BPF_B           0x10
+#define         BPF_DW          0x18
 #define BPF_MODE(code)  ((code) & 0xe0)
 #define         BPF_IMM         0x00
 #define         BPF_ABS         0x20
@@ -58,6 +73,7 @@ struct sock_fprog {	/* Required for SO_ATTACH_FILTER. */
 #define         BPF_MEM         0x60
 #define         BPF_LEN         0x80
 #define         BPF_MSH         0xa0
+#define         BPF_XADD        0xc0 /* exclusive add */
 
 /* alu/jmp fields */
 #define BPF_OP(code)    ((code) & 0xf0)
@@ -68,16 +84,24 @@ struct sock_fprog {	/* Required for SO_ATTACH_FILTER. */
 #define         BPF_OR          0x40
 #define         BPF_AND         0x50
 #define         BPF_LSH         0x60
-#define         BPF_RSH         0x70
+#define         BPF_RSH         0x70 /* logical shift right */
 #define         BPF_NEG         0x80
 #define		BPF_MOD		0x90
 #define		BPF_XOR		0xa0
+#define		BPF_MOV		0xb0 /* mov reg to reg */
+#define		BPF_ARSH	0xc0 /* sign extending arithmetic shift right */
+#define		BPF_BSWAP32	0xd0 /* swap lower 4 bytes of 64-bit register */
+#define		BPF_BSWAP64	0xe0 /* swap all 8 bytes of 64-bit register */
 
 #define         BPF_JA          0x00
-#define         BPF_JEQ         0x10
-#define         BPF_JGT         0x20
-#define         BPF_JGE         0x30
-#define         BPF_JSET        0x40
+#define         BPF_JEQ         0x10 /* jump == */
+#define         BPF_JGT         0x20 /* GT is unsigned '>', JA in x86 */
+#define         BPF_JGE         0x30 /* GE is unsigned '>=', JAE in x86 */
+#define         BPF_JSET        0x40 /* if (A & X) */
+#define         BPF_JNE         0x50 /* jump != */
+#define         BPF_JSGT        0x60 /* SGT is signed '>', GT in x86 */
+#define         BPF_JSGE        0x70 /* SGE is signed '>=', GE in x86 */
+#define         BPF_CALL        0x80 /* function call */
 #define BPF_SRC(code)   ((code) & 0x08)
 #define         BPF_K           0x00
 #define         BPF_X           0x08
@@ -134,5 +158,4 @@ struct sock_fprog {	/* Required for SO_ATTACH_FILTER. */
 #define SKF_NET_OFF   (-0x100000)
 #define SKF_LL_OFF    (-0x200000)
 
-
 #endif /* _UAPI__LINUX_FILTER_H__ */
diff --git a/net/core/Makefile b/net/core/Makefile
index 9628c20acff6..e622b97f58dc 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -8,7 +8,7 @@ obj-y := sock.o request_sock.o skbuff.o iovec.o datagram.o stream.o scm.o \
 obj-$(CONFIG_SYSCTL) += sysctl_net_core.o
 
 obj-y		     += dev.o ethtool.o dev_addr_lists.o dst.o netevent.o \
-			neighbour.o rtnetlink.o utils.o link_watch.o filter.o \
+			neighbour.o rtnetlink.o utils.o link_watch.o filter.o bpf_run.o \
 			sock_diag.o dev_ioctl.o
 
 obj-$(CONFIG_XFRM) += flow.o
diff --git a/net/core/bpf_run.c b/net/core/bpf_run.c
new file mode 100644
index 000000000000..fa1862fcbc74
--- /dev/null
+++ b/net/core/bpf_run.c
@@ -0,0 +1,766 @@
+/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/filter.h>
+#include <linux/skbuff.h>
+#include <asm/unaligned.h>
+
+int bpf64_enable __read_mostly;
+
+void *bpf_internal_load_pointer_neg_helper(const struct sk_buff *skb, int k,
+					   unsigned int size);
+
+static inline void *load_pointer(const struct sk_buff *skb, int k,
+				 unsigned int size, void *buffer)
+{
+	if (k >= 0)
+		return skb_header_pointer(skb, k, size, buffer);
+	return bpf_internal_load_pointer_neg_helper(skb, k, size);
+}
+
+static const char *const bpf_class_string[] = {
+	"ld", "ldx", "st", "stx", "alu", "jmp", "ret", "misc"
+};
+
+static const char *const bpf_alu_string[] = {
+	"+=", "-=", "*=", "/=", "|=", "&=", "<<=", ">>=", "neg",
+	"%=", "^=", "=", "s>>=", "bswap32", "bswap64", "???"
+};
+
+static const char *const bpf_ldst_string[] = {
+	"u32", "u16", "u8", "u64"
+};
+
+static const char *const bpf_jmp_string[] = {
+	"jmp", "==", ">", ">=", "&", "!=", "s>", "s>=", "call"
+};
+
+static const char *reg_to_str(int regno, u64 *regs)
+{
+	static char reg_value[16][32];
+	if (!regs)
+		return "";
+	snprintf(reg_value[regno], sizeof(reg_value[regno]), "(0x%llx)",
+		 regs[regno]);
+	return reg_value[regno];
+}
+
+#define R(regno) reg_to_str(regno, regs)
+
+void pr_info_bpf_insn(const struct bpf_insn *insn, u64 *regs)
+{
+	u16 class = BPF_CLASS(insn->code);
+	if (class == BPF_ALU || class == BPF_ALU64) {
+		if (BPF_SRC(insn->code) == BPF_X)
+			pr_info("code_%02x %sr%d%s %s r%d%s\n",
+				insn->code, class == BPF_ALU ? "(u32)" : "",
+				insn->a_reg, R(insn->a_reg),
+				bpf_alu_string[BPF_OP(insn->code) >> 4],
+				insn->x_reg, R(insn->x_reg));
+		else
+			pr_info("code_%02x %sr%d%s %s %d\n",
+				insn->code, class == BPF_ALU ? "(u32)" : "",
+				insn->a_reg, R(insn->a_reg),
+				bpf_alu_string[BPF_OP(insn->code) >> 4],
+				insn->imm);
+	} else if (class == BPF_STX) {
+		if (BPF_MODE(insn->code) == BPF_MEM)
+			pr_info("code_%02x *(%s *)(r%d%s %+d) = r%d%s\n",
+				insn->code,
+				bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
+				insn->a_reg, R(insn->a_reg),
+				insn->off, insn->x_reg, R(insn->x_reg));
+		else if (BPF_MODE(insn->code) == BPF_XADD)
+			pr_info("code_%02x lock *(%s *)(r%d%s %+d) += r%d%s\n",
+				insn->code,
+				bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
+				insn->a_reg, R(insn->a_reg), insn->off,
+				insn->x_reg, R(insn->x_reg));
+		else
+			pr_info("BUG_%02x\n", insn->code);
+	} else if (class == BPF_ST) {
+		if (BPF_MODE(insn->code) != BPF_MEM) {
+			pr_info("BUG_st_%02x\n", insn->code);
+			return;
+		}
+		pr_info("code_%02x *(%s *)(r%d%s %+d) = %d\n",
+			insn->code,
+			bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
+			insn->a_reg, R(insn->a_reg),
+			insn->off, insn->imm);
+	} else if (class == BPF_LDX) {
+		if (BPF_MODE(insn->code) == BPF_MEM) {
+			pr_info("code_%02x r%d = *(%s *)(r%d%s %+d)\n",
+				insn->code, insn->a_reg,
+				bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
+				insn->x_reg, R(insn->x_reg), insn->off);
+		} else {
+			pr_info("BUG_ldx_%02x\n", insn->code);
+		}
+	} else if (class == BPF_LD) {
+		if (BPF_MODE(insn->code) == BPF_ABS) {
+			pr_info("code_%02x r%d = *(%s *)(skb %+d)\n",
+				insn->code, insn->a_reg,
+				bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
+				insn->imm);
+		} else if (BPF_MODE(insn->code) == BPF_IND) {
+			pr_info("code_%02x r%d = *(%s *)(skb + r%d%s %+d)\n",
+				insn->code, insn->a_reg,
+				bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
+				insn->x_reg, R(insn->x_reg), insn->imm);
+		} else {
+			pr_info("BUG_ld_%02x\n", insn->code);
+		}
+	} else if (class == BPF_JMP) {
+		u16 opcode = BPF_OP(insn->code);
+		if (opcode == BPF_CALL) {
+			pr_info("code_%02x call %d\n", insn->code, insn->imm);
+		} else if (insn->code == (BPF_JMP | BPF_JA)) {
+			pr_info("code_%02x goto pc%+d\n",
+				insn->code, insn->off);
+		} else if (BPF_SRC(insn->code) == BPF_X) {
+			pr_info("code_%02x if r%d%s %s r%d%s goto pc%+d\n",
+				insn->code, insn->a_reg, R(insn->a_reg),
+				bpf_jmp_string[BPF_OP(insn->code) >> 4],
+				insn->x_reg, R(insn->x_reg), insn->off);
+		} else {
+			pr_info("code_%02x if r%d%s %s 0x%x goto pc%+d\n",
+				insn->code, insn->a_reg, R(insn->a_reg),
+				bpf_jmp_string[BPF_OP(insn->code) >> 4],
+				insn->imm, insn->off);
+		}
+	} else {
+		pr_info("code_%02x %s\n", insn->code, bpf_class_string[class]);
+	}
+}
+EXPORT_SYMBOL(pr_info_bpf_insn);
+
+/* remap 'sock_filter' style BPF instruction set to 'bpf_insn' style (BPF64)
+ *
+ * first, call bpf_convert(old_prog, len, NULL, &new_len) to calculate new
+ * program length in one pass
+ *
+ * then new_prog = kmalloc(sizeof(struct bpf_insn) * new_len);
+ *
+ * and call it again: bpf_convert(old_prog, len, new_prog, &new_len);
+ * to remap in two passes: 1st pass finds new jump offsets, 2nd pass remaps
+ */
+int bpf_convert(struct sock_filter *old_prog, int len,
+		struct bpf_insn *new_prog, int *p_new_len)
+{
+	struct bpf_insn *new_insn;
+	struct sock_filter *fp;
+	int *addrs = NULL;
+	int new_len = 0;
+	int pass = 0;
+	int tgt, i;
+
+	if (len <= 0 || len >= BPF_MAXINSNS)
+		return -EINVAL;
+
+	if (new_prog) {
+		addrs = kzalloc(len * sizeof(*addrs), GFP_KERNEL);
+		if (!addrs)
+			return -ENOMEM;
+	}
+
+do_pass:
+	new_insn = new_prog;
+	fp = old_prog;
+	for (i = 0; i < len; fp++, i++) {
+		struct bpf_insn tmp_insns[3] = {};
+		struct bpf_insn *insn = tmp_insns;
+
+		if (addrs)
+			addrs[i] = new_insn - new_prog;
+
+		switch (fp->code) {
+		/* all arithmetic insns and skb loads map as-is */
+		case BPF_ALU | BPF_ADD | BPF_X:
+		case BPF_ALU | BPF_ADD | BPF_K:
+		case BPF_ALU | BPF_SUB | BPF_X:
+		case BPF_ALU | BPF_SUB | BPF_K:
+		case BPF_ALU | BPF_AND | BPF_X:
+		case BPF_ALU | BPF_AND | BPF_K:
+		case BPF_ALU | BPF_OR | BPF_X:
+		case BPF_ALU | BPF_OR | BPF_K:
+		case BPF_ALU | BPF_LSH | BPF_X:
+		case BPF_ALU | BPF_LSH | BPF_K:
+		case BPF_ALU | BPF_RSH | BPF_X:
+		case BPF_ALU | BPF_RSH | BPF_K:
+		case BPF_ALU | BPF_XOR | BPF_X:
+		case BPF_ALU | BPF_XOR | BPF_K:
+		case BPF_ALU | BPF_MUL | BPF_X:
+		case BPF_ALU | BPF_MUL | BPF_K:
+		case BPF_ALU | BPF_DIV | BPF_X:
+		case BPF_ALU | BPF_DIV | BPF_K:
+		case BPF_ALU | BPF_MOD | BPF_X:
+		case BPF_ALU | BPF_MOD | BPF_K:
+		case BPF_ALU | BPF_NEG:
+		case BPF_LD | BPF_ABS | BPF_W:
+		case BPF_LD | BPF_ABS | BPF_H:
+		case BPF_LD | BPF_ABS | BPF_B:
+		case BPF_LD | BPF_IND | BPF_W:
+		case BPF_LD | BPF_IND | BPF_H:
+		case BPF_LD | BPF_IND | BPF_B:
+			insn->code = fp->code;
+			insn->a_reg = 6;
+			insn->x_reg = 7;
+			insn->imm = fp->k;
+			break;
+
+		/* jump opcodes map as-is, but offsets need adjustment */
+		case BPF_JMP | BPF_JA:
+			tgt = i + fp->k + 1;
+			insn->code = fp->code;
+#define EMIT_JMP \
+	do { \
+		if (tgt >= len || tgt < 0) \
+			goto err; \
+		insn->off = addrs ? addrs[tgt] - addrs[i] - 1 : 0; \
+	} while (0)
+
+			EMIT_JMP;
+			break;
+
+		case BPF_JMP | BPF_JEQ | BPF_K:
+		case BPF_JMP | BPF_JEQ | BPF_X:
+		case BPF_JMP | BPF_JSET | BPF_K:
+		case BPF_JMP | BPF_JSET | BPF_X:
+		case BPF_JMP | BPF_JGT | BPF_K:
+		case BPF_JMP | BPF_JGT | BPF_X:
+		case BPF_JMP | BPF_JGE | BPF_K:
+		case BPF_JMP | BPF_JGE | BPF_X:
+			insn->a_reg = 6;
+			insn->x_reg = 7;
+			insn->imm = fp->k;
+			/* common case where 'jump_false' is next insn */
+			if (fp->jf == 0) {
+				insn->code = fp->code;
+				tgt = i + fp->jt + 1;
+				EMIT_JMP;
+				break;
+			}
+			/* convert JEQ into JNE when 'jump_true' is next insn */
+			if (fp->jt == 0 && BPF_OP(fp->code) == BPF_JEQ) {
+				insn->code = BPF_JMP | BPF_JNE |
+					BPF_SRC(fp->code);
+				tgt = i + fp->jf + 1;
+				EMIT_JMP;
+				break;
+			}
+			/* other jumps are mapped into two insns: Jxx and JA */
+			tgt = i + fp->jt + 1;
+			insn->code = fp->code;
+			EMIT_JMP;
+
+			insn++;
+			insn->code = BPF_JMP | BPF_JA;
+			tgt = i + fp->jf + 1;
+			EMIT_JMP;
+			/* adjust pc relative offset, since it's a 2nd insn */
+			insn->off--;
+			break;
+
+			/* ldxb 4*([14]&0xf) is remaped into 3 insns */
+		case BPF_LDX | BPF_MSH | BPF_B:
+			insn->code = BPF_LD | BPF_ABS | BPF_B;
+			insn->a_reg = 7;
+			insn->imm = fp->k;
+
+			insn++;
+			insn->code = BPF_ALU | BPF_AND | BPF_K;
+			insn->a_reg = 7;
+			insn->imm = 0xf;
+
+			insn++;
+			insn->code = BPF_ALU | BPF_LSH | BPF_K;
+			insn->a_reg = 7;
+			insn->imm = 2;
+			break;
+
+			/* RET_K, RET_A are remaped into 2 insns */
+		case BPF_RET | BPF_A:
+		case BPF_RET | BPF_K:
+			insn->code = BPF_ALU | BPF_MOV |
+				(BPF_SRC(fp->code) == BPF_K ? BPF_K : BPF_X);
+			insn->a_reg = 0;
+			insn->x_reg = 6;
+			insn->imm = fp->k;
+
+			insn++;
+			insn->code = BPF_RET | BPF_K;
+			break;
+
+			/* store to stack */
+		case BPF_ST:
+		case BPF_STX:
+			insn->code = BPF_STX | BPF_MEM | BPF_W;
+			insn->a_reg = 10;
+			insn->x_reg = fp->code == BPF_ST ? 6 : 7;
+			insn->off = -(BPF_MEMWORDS - fp->k) * 4;
+			break;
+
+			/* load from stack */
+		case BPF_LD | BPF_MEM:
+		case BPF_LDX | BPF_MEM:
+			insn->code = BPF_LDX | BPF_MEM | BPF_W;
+			insn->a_reg = BPF_CLASS(fp->code) == BPF_LD ? 6 : 7;
+			insn->x_reg = 10;
+			insn->off = -(BPF_MEMWORDS - fp->k) * 4;
+			break;
+
+			/* A = K or X = K */
+		case BPF_LD | BPF_IMM:
+		case BPF_LDX | BPF_IMM:
+			insn->code = BPF_ALU | BPF_MOV | BPF_K;
+			insn->a_reg = BPF_CLASS(fp->code) == BPF_LD ? 6 : 7;
+			insn->imm = fp->k;
+			break;
+
+			/* X = A */
+		case BPF_MISC | BPF_TAX:
+			insn->code = BPF_ALU64 | BPF_MOV | BPF_X;
+			insn->a_reg = 7;
+			insn->x_reg = 6;
+			break;
+
+			/* A = X */
+		case BPF_MISC | BPF_TXA:
+			insn->code = BPF_ALU64 | BPF_MOV | BPF_X;
+			insn->a_reg = 6;
+			insn->x_reg = 7;
+			break;
+
+			/* A = skb->len or X = skb->len */
+		case BPF_LD|BPF_W|BPF_LEN:
+		case BPF_LDX|BPF_W|BPF_LEN:
+			insn->code = BPF_LDX | BPF_MEM | BPF_W;
+			insn->a_reg = BPF_CLASS(fp->code) == BPF_LD ? 6 : 7;
+			insn->x_reg = 1;
+			insn->off = offsetof(struct sk_buff, len);
+			break;
+
+		default:
+			/* pr_err("unknown opcode %02x\n", fp->code); */
+			goto err;
+		}
+
+		insn++;
+		if (new_prog) {
+			memcpy(new_insn, tmp_insns,
+			       sizeof(*insn) * (insn - tmp_insns));
+		}
+		new_insn += insn - tmp_insns;
+	}
+
+	if (!new_prog) {
+		/* only calculating new length */
+		*p_new_len = new_insn - new_prog;
+		return 0;
+	}
+
+	pass++;
+	if (new_len != new_insn - new_prog) {
+		new_len = new_insn - new_prog;
+		if (pass > 2)
+			goto err;
+		goto do_pass;
+	}
+	kfree(addrs);
+	if (*p_new_len != new_len)
+		/* inconsistent new program length */
+		pr_err("bpf_convert() usage error\n");
+	return 0;
+err:
+	kfree(addrs);
+	return -EINVAL;
+}
+
+notrace u32 bpf_run(struct bpf_context *ctx, const struct bpf_insn *insn)
+{
+	u64 stack[64];
+	u64 regs[16];
+	void *ptr;
+	u64 tmp;
+	int off;
+
+#ifdef __x86_64
+#define LOAD_IMM /**/
+#define K insn->imm
+#else
+#define LOAD_IMM (K = insn->imm)
+	s32 K = insn->imm;
+#endif
+
+#ifdef DEBUG
+#define DEBUG_INSN pr_info_bpf_insn(insn, regs)
+#else
+#define DEBUG_INSN
+#endif
+
+#define A regs[insn->a_reg]
+#define X regs[insn->x_reg]
+
+#define DL(A, B, C) [A|B|C] = &&L_##A##B##C,
+#define L(A, B, C) L_##A##B##C:
+
+#define CONT ({insn++; LOAD_IMM; DEBUG_INSN; goto select_insn; })
+#define CONT_JMP ({insn++; LOAD_IMM; DEBUG_INSN; goto select_insn; })
+/* some compilers may need help:
+ * #define CONT_JMP ({insn++; LOAD_IMM; DEBUG_INSN; goto *jt[insn->code]; })
+ */
+
+	static const void *jt[256] = {
+		[0 ... 255] = &&L_default,
+		DL(BPF_ALU, BPF_ADD, BPF_X) DL(BPF_ALU, BPF_ADD, BPF_K)
+		DL(BPF_ALU, BPF_SUB, BPF_X) DL(BPF_ALU, BPF_SUB, BPF_K)
+		DL(BPF_ALU, BPF_AND, BPF_X) DL(BPF_ALU, BPF_AND, BPF_K)
+		DL(BPF_ALU, BPF_OR, BPF_X)  DL(BPF_ALU, BPF_OR, BPF_K)
+		DL(BPF_ALU, BPF_LSH, BPF_X) DL(BPF_ALU, BPF_LSH, BPF_K)
+		DL(BPF_ALU, BPF_RSH, BPF_X) DL(BPF_ALU, BPF_RSH, BPF_K)
+		DL(BPF_ALU, BPF_XOR, BPF_X) DL(BPF_ALU, BPF_XOR, BPF_K)
+		DL(BPF_ALU, BPF_MUL, BPF_X) DL(BPF_ALU, BPF_MUL, BPF_K)
+		DL(BPF_ALU, BPF_MOV, BPF_X) DL(BPF_ALU, BPF_MOV, BPF_K)
+		DL(BPF_ALU, BPF_DIV, BPF_X) DL(BPF_ALU, BPF_DIV, BPF_K)
+		DL(BPF_ALU, BPF_MOD, BPF_X) DL(BPF_ALU, BPF_MOD, BPF_K)
+		DL(BPF_ALU64, BPF_ADD, BPF_X) DL(BPF_ALU64, BPF_ADD, BPF_K)
+		DL(BPF_ALU64, BPF_SUB, BPF_X) DL(BPF_ALU64, BPF_SUB, BPF_K)
+		DL(BPF_ALU64, BPF_AND, BPF_X) DL(BPF_ALU64, BPF_AND, BPF_K)
+		DL(BPF_ALU64, BPF_OR, BPF_X)  DL(BPF_ALU64, BPF_OR, BPF_K)
+		DL(BPF_ALU64, BPF_LSH, BPF_X) DL(BPF_ALU64, BPF_LSH, BPF_K)
+		DL(BPF_ALU64, BPF_RSH, BPF_X) DL(BPF_ALU64, BPF_RSH, BPF_K)
+		DL(BPF_ALU64, BPF_XOR, BPF_X) DL(BPF_ALU64, BPF_XOR, BPF_K)
+		DL(BPF_ALU64, BPF_MUL, BPF_X) DL(BPF_ALU64, BPF_MUL, BPF_K)
+		DL(BPF_ALU64, BPF_MOV, BPF_X) DL(BPF_ALU64, BPF_MOV, BPF_K)
+		DL(BPF_ALU64, BPF_ARSH, BPF_X)DL(BPF_ALU64, BPF_ARSH, BPF_K)
+		DL(BPF_ALU64, BPF_DIV, BPF_X) DL(BPF_ALU64, BPF_DIV, BPF_K)
+		DL(BPF_ALU64, BPF_MOD, BPF_X) DL(BPF_ALU64, BPF_MOD, BPF_K)
+		DL(BPF_ALU64, BPF_BSWAP32, BPF_X)
+		DL(BPF_ALU64, BPF_BSWAP64, BPF_X)
+		DL(BPF_ALU, BPF_NEG, 0)
+		DL(BPF_JMP, BPF_CALL, 0)
+		DL(BPF_JMP, BPF_JA, 0)
+		DL(BPF_JMP, BPF_JEQ, BPF_X) DL(BPF_JMP, BPF_JEQ, BPF_K)
+		DL(BPF_JMP, BPF_JNE, BPF_X) DL(BPF_JMP, BPF_JNE, BPF_K)
+		DL(BPF_JMP, BPF_JGT, BPF_X) DL(BPF_JMP, BPF_JGT, BPF_K)
+		DL(BPF_JMP, BPF_JGE, BPF_X) DL(BPF_JMP, BPF_JGE, BPF_K)
+		DL(BPF_JMP, BPF_JSGT, BPF_X) DL(BPF_JMP, BPF_JSGT, BPF_K)
+		DL(BPF_JMP, BPF_JSGE, BPF_X) DL(BPF_JMP, BPF_JSGE, BPF_K)
+		DL(BPF_JMP, BPF_JSET, BPF_X) DL(BPF_JMP, BPF_JSET, BPF_K)
+		DL(BPF_STX, BPF_MEM, BPF_B) DL(BPF_STX, BPF_MEM, BPF_H)
+		DL(BPF_STX, BPF_MEM, BPF_W) DL(BPF_STX, BPF_MEM, BPF_DW)
+		DL(BPF_ST, BPF_MEM, BPF_B) DL(BPF_ST, BPF_MEM, BPF_H)
+		DL(BPF_ST, BPF_MEM, BPF_W) DL(BPF_ST, BPF_MEM, BPF_DW)
+		DL(BPF_LDX, BPF_MEM, BPF_B) DL(BPF_LDX, BPF_MEM, BPF_H)
+		DL(BPF_LDX, BPF_MEM, BPF_W) DL(BPF_LDX, BPF_MEM, BPF_DW)
+		DL(BPF_STX, BPF_XADD, BPF_W)
+#ifdef CONFIG_64BIT
+		DL(BPF_STX, BPF_XADD, BPF_DW)
+#endif
+		DL(BPF_LD, BPF_ABS, BPF_W) DL(BPF_LD, BPF_ABS, BPF_H)
+		DL(BPF_LD, BPF_ABS, BPF_B) DL(BPF_LD, BPF_IND, BPF_W)
+		DL(BPF_LD, BPF_IND, BPF_H) DL(BPF_LD, BPF_IND, BPF_B)
+		DL(BPF_RET, BPF_K, 0)
+	};
+
+	regs[10/* BPF R10 */] = (u64)(ulong)&stack[64];
+	regs[1/* BPF R1 */] = (u64)(ulong)ctx;
+
+	DEBUG_INSN;
+	/* execute 1st insn */
+select_insn:
+	goto *jt[insn->code];
+
+		/* ALU */
+#define ALU(OPCODE, OP) \
+	L_BPF_ALU64##OPCODE##BPF_X: \
+		A = A OP X; \
+		CONT; \
+	L_BPF_ALU##OPCODE##BPF_X: \
+		A = (u32)A OP (u32)X; \
+		CONT; \
+	L_BPF_ALU64##OPCODE##BPF_K: \
+		A = A OP K; \
+		CONT; \
+	L_BPF_ALU##OPCODE##BPF_K: \
+		A = (u32)A OP (u32)K; \
+		CONT;
+
+	ALU(BPF_ADD, +)
+	ALU(BPF_SUB, -)
+	ALU(BPF_AND, &)
+	ALU(BPF_OR, |)
+	ALU(BPF_LSH, <<)
+	ALU(BPF_RSH, >>)
+	ALU(BPF_XOR, ^)
+	ALU(BPF_MUL, *)
+
+	L(BPF_ALU, BPF_NEG, 0)
+		A = (u32)-A;
+		CONT;
+	L(BPF_ALU, BPF_MOV, BPF_X)
+		A = (u32)X;
+		CONT;
+	L(BPF_ALU, BPF_MOV, BPF_K)
+		A = (u32)K;
+		CONT;
+	L(BPF_ALU64, BPF_MOV, BPF_X)
+		A = X;
+		CONT;
+	L(BPF_ALU64, BPF_MOV, BPF_K)
+		A = K;
+		CONT;
+	L(BPF_ALU64, BPF_ARSH, BPF_X)
+		(*(s64 *) &A) >>= X;
+		CONT;
+	L(BPF_ALU64, BPF_ARSH, BPF_K)
+		(*(s64 *) &A) >>= K;
+		CONT;
+	L(BPF_ALU64, BPF_MOD, BPF_X)
+		tmp = A;
+		if (X)
+			A = do_div(tmp, X);
+		CONT;
+	L(BPF_ALU, BPF_MOD, BPF_X)
+		tmp = (u32)A;
+		if (X)
+			A = do_div(tmp, (u32)X);
+		CONT;
+	L(BPF_ALU64, BPF_MOD, BPF_K)
+		tmp = A;
+		if (K)
+			A = do_div(tmp, K);
+		CONT;
+	L(BPF_ALU, BPF_MOD, BPF_K)
+		tmp = (u32)A;
+		if (K)
+			A = do_div(tmp, (u32)K);
+		CONT;
+	L(BPF_ALU64, BPF_DIV, BPF_X)
+		if (X)
+			do_div(A, X);
+		CONT;
+	L(BPF_ALU, BPF_DIV, BPF_X)
+		tmp = (u32)A;
+		if (X)
+			do_div(tmp, (u32)X);
+		A = (u32)tmp;
+		CONT;
+	L(BPF_ALU64, BPF_DIV, BPF_K)
+		if (K)
+			do_div(A, K);
+		CONT;
+	L(BPF_ALU, BPF_DIV, BPF_K)
+		tmp = (u32)A;
+		if (K)
+			do_div(tmp, (u32)K);
+		A = (u32)tmp;
+		CONT;
+	L(BPF_ALU64, BPF_BSWAP32, BPF_X)
+		A = swab32(A);
+		CONT;
+	L(BPF_ALU64, BPF_BSWAP64, BPF_X)
+		A = swab64(A);
+		CONT;
+
+		/* CALL */
+	L(BPF_JMP, BPF_CALL, 0)
+		/* TODO execute_func(K, regs); */
+		CONT;
+
+		/* JMP */
+	L(BPF_JMP, BPF_JA, 0)
+		insn += insn->off;
+		CONT;
+	L(BPF_JMP, BPF_JEQ, BPF_X)
+		if (A == X) {
+			insn += insn->off;
+			CONT_JMP;
+		}
+		CONT;
+	L(BPF_JMP, BPF_JEQ, BPF_K)
+		if (A == K) {
+			insn += insn->off;
+			CONT_JMP;
+		}
+		CONT;
+	L(BPF_JMP, BPF_JNE, BPF_X)
+		if (A != X) {
+			insn += insn->off;
+			CONT_JMP;
+		}
+		CONT;
+	L(BPF_JMP, BPF_JNE, BPF_K)
+		if (A != K) {
+			insn += insn->off;
+			CONT_JMP;
+		}
+		CONT;
+	L(BPF_JMP, BPF_JGT, BPF_X)
+		if (A > X) {
+			insn += insn->off;
+			CONT_JMP;
+		}
+		CONT;
+	L(BPF_JMP, BPF_JGT, BPF_K)
+		if (A > K) {
+			insn += insn->off;
+			CONT_JMP;
+		}
+		CONT;
+	L(BPF_JMP, BPF_JGE, BPF_X)
+		if (A >= X) {
+			insn += insn->off;
+			CONT_JMP;
+		}
+		CONT;
+	L(BPF_JMP, BPF_JGE, BPF_K)
+		if (A >= K) {
+			insn += insn->off;
+			CONT_JMP;
+		}
+		CONT;
+	L(BPF_JMP, BPF_JSGT, BPF_X)
+		if (((s64)A) > ((s64)X)) {
+			insn += insn->off;
+			CONT_JMP;
+		}
+		CONT;
+	L(BPF_JMP, BPF_JSGT, BPF_K)
+		if (((s64)A) > ((s64)K)) {
+			insn += insn->off;
+			CONT_JMP;
+		}
+		CONT;
+	L(BPF_JMP, BPF_JSGE, BPF_X)
+		if (((s64)A) >= ((s64)X)) {
+			insn += insn->off;
+			CONT_JMP;
+		}
+		CONT;
+	L(BPF_JMP, BPF_JSGE, BPF_K)
+		if (((s64)A) >= ((s64)K)) {
+			insn += insn->off;
+			CONT_JMP;
+		}
+		CONT;
+	L(BPF_JMP, BPF_JSET, BPF_X)
+		if (A & X) {
+			insn += insn->off;
+			CONT_JMP;
+		}
+		CONT;
+	L(BPF_JMP, BPF_JSET, BPF_K)
+		if (A & (u32)K) {
+			insn += insn->off;
+			CONT_JMP;
+		}
+		CONT;
+
+		/* STX and ST and LDX*/
+#define LDST(SIZEOP, SIZE) \
+	L_BPF_STXBPF_MEM##SIZEOP: \
+		*(SIZE *)(ulong)(A + insn->off) = X; \
+		CONT; \
+	L_BPF_STBPF_MEM##SIZEOP: \
+		*(SIZE *)(ulong)(A + insn->off) = K; \
+		CONT; \
+	L_BPF_LDXBPF_MEM##SIZEOP: \
+		A = *(SIZE *)(ulong)(X + insn->off); \
+		CONT;
+
+		LDST(BPF_B, u8)
+		LDST(BPF_H, u16)
+		LDST(BPF_W, u32)
+		LDST(BPF_DW, u64)
+
+		/* STX XADD */
+	L(BPF_STX, BPF_XADD, BPF_W)
+		atomic_add((u32)X, (atomic_t *)(ulong)(A + insn->off));
+		CONT;
+#ifdef CONFIG_64BIT
+	L(BPF_STX, BPF_XADD, BPF_DW)
+		atomic64_add((u64)X, (atomic64_t *)(ulong)(A + insn->off));
+		CONT;
+#endif
+
+		/* LD from SKB + K */
+	L(BPF_LD, BPF_ABS, BPF_W)
+		off = K;
+load_word:
+		ptr = load_pointer((struct sk_buff *)ctx, off, 4, &tmp);
+		if (likely(ptr != NULL)) {
+			A = get_unaligned_be32(ptr);
+			CONT;
+		}
+		return 0;
+
+	L(BPF_LD, BPF_ABS, BPF_H)
+		off = K;
+load_half:
+		ptr = load_pointer((struct sk_buff *)ctx, off, 2, &tmp);
+		if (likely(ptr != NULL)) {
+			A = get_unaligned_be16(ptr);
+			CONT;
+		}
+		return 0;
+
+	L(BPF_LD, BPF_ABS, BPF_B)
+		off = K;
+load_byte:
+		ptr = load_pointer((struct sk_buff *)ctx, off, 1, &tmp);
+		if (likely(ptr != NULL)) {
+			A = *(u8 *)ptr;
+			CONT;
+		}
+		return 0;
+
+		/* LD from SKB + X + K */
+	L(BPF_LD, BPF_IND, BPF_W)
+		off = K + X;
+		goto load_word;
+
+	L(BPF_LD, BPF_IND, BPF_H)
+		off = K + X;
+		goto load_half;
+
+	L(BPF_LD, BPF_IND, BPF_B)
+		off = K + X;
+		goto load_byte;
+
+		/* RET */
+	L(BPF_RET, BPF_K, 0)
+		return regs[0/* R0 */];
+
+	L_default:
+		/* bpf_check() or bpf_convert() will guarantee that
+		 * we never reach here
+		 */
+		WARN_RATELIMIT(1, "unknown opcode %02x\n", insn->code);
+		return 0;
+#undef DL
+#undef L
+#undef CONT
+#undef A
+#undef X
+#undef K
+#undef LOAD_IMM
+#undef DEBUG_INSN
+#undef ALU
+#undef LDST
+}
+EXPORT_SYMBOL(bpf_run);
+
diff --git a/net/core/filter.c b/net/core/filter.c
index ad30d626a5bd..575bf5fd4335 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -637,6 +637,10 @@ void sk_filter_release_rcu(struct rcu_head *rcu)
 {
 	struct sk_filter *fp = container_of(rcu, struct sk_filter, rcu);
 
+	if ((void *)fp->bpf_func == (void *)bpf_run)
+		/* arch specific jit_free are expecting this value */
+		fp->bpf_func = sk_run_filter;
+
 	bpf_jit_free(fp);
 }
 EXPORT_SYMBOL(sk_filter_release_rcu);
@@ -655,6 +659,81 @@ static int __sk_prepare_filter(struct sk_filter *fp)
 	return 0;
 }
 
+static int bpf64_prepare(struct sk_filter **pfp, struct sock_fprog *fprog,
+			 struct sock *sk)
+{
+	unsigned int fsize = sizeof(struct sock_filter) * fprog->len;
+	struct sock_filter *old_prog;
+	unsigned int sk_fsize;
+	struct sk_filter *fp;
+	int new_len;
+	int err;
+
+	/* store old program into buffer, since chk_filter will remap opcodes */
+	old_prog = kmalloc(fsize, GFP_KERNEL);
+	if (!old_prog)
+		return -ENOMEM;
+
+	if (sk) {
+		if (copy_from_user(old_prog, fprog->filter, fsize)) {
+			err = -EFAULT;
+			goto free_prog;
+		}
+	} else {
+		memcpy(old_prog, fprog->filter, fsize);
+	}
+
+	/* calculate bpf64 program length */
+	err = bpf_convert(fprog->filter, fprog->len, NULL, &new_len);
+	if (err)
+		goto free_prog;
+
+	sk_fsize = sk_filter_size(new_len);
+	/* allocate sk_filter to store bpf64 program */
+	if (sk)
+		fp = sock_kmalloc(sk, sk_fsize, GFP_KERNEL);
+	else
+		fp = kmalloc(sk_fsize, GFP_KERNEL);
+	if (!fp) {
+		err = -ENOMEM;
+		goto free_prog;
+	}
+
+	/* remap old insns into bpf64 */
+	err = bpf_convert(old_prog, fprog->len,
+			  (struct bpf_insn *)fp->insns, &new_len);
+	if (err)
+		/* 2nd bpf_convert() can fail only if it fails
+		 * to allocate memory, remapping must succeed
+		 */
+		goto free_fp;
+
+	/* now chk_filter can overwrite old_prog while checking */
+	err = sk_chk_filter(old_prog, fprog->len);
+	if (err)
+		goto free_fp;
+
+	/* discard old prog */
+	kfree(old_prog);
+
+	atomic_set(&fp->refcnt, 1);
+	fp->len = new_len;
+
+	/* bpf64 insns must be executed by bpf_run */
+	fp->bpf_func = (typeof(fp->bpf_func))bpf_run;
+
+	*pfp = fp;
+	return 0;
+free_fp:
+	if (sk)
+		sock_kfree_s(sk, fp, sk_fsize);
+	else
+		kfree(fp);
+free_prog:
+	kfree(old_prog);
+	return err;
+}
+
 /**
  *	sk_unattached_filter_create - create an unattached filter
  *	@fprog: the filter program
@@ -676,6 +755,9 @@ int sk_unattached_filter_create(struct sk_filter **pfp,
 	if (fprog->filter == NULL)
 		return -EINVAL;
 
+	if (bpf64_enable)
+		return bpf64_prepare(pfp, fprog, NULL);
+
 	fp = kmalloc(sk_filter_size(fprog->len), GFP_KERNEL);
 	if (!fp)
 		return -ENOMEM;
@@ -726,21 +808,27 @@ int sk_attach_filter(struct sock_fprog *fprog, struct sock *sk)
 	if (fprog->filter == NULL)
 		return -EINVAL;
 
-	fp = sock_kmalloc(sk, sk_fsize, GFP_KERNEL);
-	if (!fp)
-		return -ENOMEM;
-	if (copy_from_user(fp->insns, fprog->filter, fsize)) {
-		sock_kfree_s(sk, fp, sk_fsize);
-		return -EFAULT;
-	}
+	if (bpf64_enable) {
+		err = bpf64_prepare(&fp, fprog, sk);
+		if (err)
+			return err;
+	} else {
+		fp = sock_kmalloc(sk, sk_fsize, GFP_KERNEL);
+		if (!fp)
+			return -ENOMEM;
+		if (copy_from_user(fp->insns, fprog->filter, fsize)) {
+			sock_kfree_s(sk, fp, sk_fsize);
+			return -EFAULT;
+		}
 
-	atomic_set(&fp->refcnt, 1);
-	fp->len = fprog->len;
+		atomic_set(&fp->refcnt, 1);
+		fp->len = fprog->len;
 
-	err = __sk_prepare_filter(fp);
-	if (err) {
-		sk_filter_uncharge(sk, fp);
-		return err;
+		err = __sk_prepare_filter(fp);
+		if (err) {
+			sk_filter_uncharge(sk, fp);
+			return err;
+		}
 	}
 
 	old_fp = rcu_dereference_protected(sk->sk_filter,
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index cf9cd13509a7..f03acc0e8950 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -273,6 +273,13 @@ static struct ctl_table net_core_table[] = {
 	},
 #endif
 	{
+		.procname	= "bpf64_enable",
+		.data		= &bpf64_enable,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec
+	},
+	{
 		.procname	= "netdev_tstamp_prequeue",
 		.data		= &netdev_tstamp_prequeue,
 		.maxlen		= sizeof(int),
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 net-next 1/1] bpf32->bpf64 mapper and bpf64 interpreter
  2014-02-27  2:38 ` [PATCH v3 net-next 1/1] " Alexei Starovoitov
@ 2014-02-28 12:45   ` Daniel Borkmann
  2014-02-28 20:53     ` Alexei Starovoitov
  0 siblings, 1 reply; 8+ messages in thread
From: Daniel Borkmann @ 2014-02-28 12:45 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Ingo Molnar, Steven Rostedt, Peter Zijlstra,
	H. Peter Anvin, Thomas Gleixner, Masami Hiramatsu, Tom Zanussi,
	Jovi Zhangwei, Eric Dumazet, Linus Torvalds, Andrew Morton,
	Frederic Weisbecker, Arnaldo Carvalho de Melo, Pekka Enberg,
	Arjan van de Ven, Christoph Hellwig, linux-kernel, netdev,
	Hagen Paul Pfeifer, Jesse Gross

Hi Alexei,

[also cc'ing Hagen and Jesse]

Just some minor comments below ... let me know what you think.

On 02/27/2014 03:38 AM, Alexei Starovoitov wrote:
> Extended BPF (or 64-bit BPF) is an instruction set to
> create safe dynamically loadable filters that can call fixed set
> of kernel functions and take generic bpf_context as an input.
> BPF filter is a glue between kernel functions and bpf_context.
> Different kernel subsystems can define their own set of available functions
> and alter BPF machinery for specific use case.
> BPF64 instruction set is designed for efficient mapping to native
> instructions on 64-bit CPUs
>
> Old BPF instructions are remapped on the fly to BPF64
> when sysctl net.core.bpf64_enable=1

Would be nice if the commit message could be extended with what you
have posted in your [PATCH v3 net-next 0/1] email (but without the
changelog, changelog should go after "---" line), so that also this
information will appear here and in the Git log later on. Also please
elaborate more on this commit message. I think, at least, as it's a
bigger change it also needs to include future planned usage scenarios
for user space and for inside the kernel e.g. for OVS and ftrace.

You could make this patch a 2 patch patch-series and put into patch
2/2 all documentation you had in your previous version of the set.
Would be nice to extend the file Documentation/networking/filter.txt
with a description of your extended BPF so that users can read about
them in the same file.

Did you also test that seccomp-BPF still works out?

> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
> ---
>   include/linux/filter.h      |    9 +-
>   include/linux/netdevice.h   |    1 +
>   include/uapi/linux/filter.h |   37 ++-
>   net/core/Makefile           |    2 +-
>   net/core/bpf_run.c          |  766 +++++++++++++++++++++++++++++++++++++++++++
>   net/core/filter.c           |  114 ++++++-

I would have liked to see the content from net/core/bpf_run.c to go
directly into net/core/filter.c, not as a separate file, if that's
possible.

>   net/core/sysctl_net_core.c  |    7 +
>   7 files changed, 913 insertions(+), 23 deletions(-)
>   create mode 100644 net/core/bpf_run.c
>
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index e568c8ef896b..bf3085258f4c 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -53,6 +53,13 @@ extern int sk_chk_filter(struct sock_filter *filter, unsigned int flen);
>   extern int sk_get_filter(struct sock *sk, struct sock_filter __user *filter, unsigned len);
>   extern void sk_decode_filter(struct sock_filter *filt, struct sock_filter *to);
>
> +/* function remaps 'sock_filter' style insns to 'bpf_insn' style insns */
> +int bpf_convert(struct sock_filter *fp, int len, struct bpf_insn *new_prog,
> +		int *p_new_len);
> +/* execute bpf64 program */
> +u32 bpf_run(struct bpf_context *ctx, const struct bpf_insn *insn);
> +
> +#define SK_RUN_FILTER(FILTER, SKB) (*FILTER->bpf_func)(SKB, FILTER->insns)
>   #ifdef CONFIG_BPF_JIT
>   #include <stdarg.h>
>   #include <linux/linkage.h>
> @@ -70,7 +77,6 @@ static inline void bpf_jit_dump(unsigned int flen, unsigned int proglen,
>   		print_hex_dump(KERN_ERR, "JIT code: ", DUMP_PREFIX_OFFSET,
>   			       16, 1, image, proglen, false);
>   }
> -#define SK_RUN_FILTER(FILTER, SKB) (*FILTER->bpf_func)(SKB, FILTER->insns)
>   #else
>   #include <linux/slab.h>
>   static inline void bpf_jit_compile(struct sk_filter *fp)
> @@ -80,7 +86,6 @@ static inline void bpf_jit_free(struct sk_filter *fp)
>   {
>   	kfree(fp);
>   }
> -#define SK_RUN_FILTER(FILTER, SKB) sk_run_filter(SKB, FILTER->insns)
>   #endif
>
>   static inline int bpf_tell_extensions(void)
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 5e84483c0650..7b1acefc244e 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -2971,6 +2971,7 @@ extern int		netdev_max_backlog;
>   extern int		netdev_tstamp_prequeue;
>   extern int		weight_p;
>   extern int		bpf_jit_enable;
> +extern int		bpf64_enable;

We should keep naming consistent (so either extended BPF or BPF64),
so maybe bpf_ext_enable ? I'd presume rather {bpf,sk_filter}*_ext
as in 'struct bpf_insn' the immediate value is 32 bit, so for 64 bit
comparisons, you'd still need to load to immediate values, right?

After all your changes, we will still have the bpf_jit_enable knob
in tact, right?

>   bool netdev_has_upper_dev(struct net_device *dev, struct net_device *upper_dev);
>   struct net_device *netdev_all_upper_get_next_dev_rcu(struct net_device *dev,
> diff --git a/include/uapi/linux/filter.h b/include/uapi/linux/filter.h
> index 8eb9ccaa5b48..70ff29ee6825 100644
> --- a/include/uapi/linux/filter.h
> +++ b/include/uapi/linux/filter.h
> @@ -1,3 +1,4 @@
> +/* extended BPF is Copyright (c) 2011-2014, PLUMgrid, http://plumgrid.com */
>   /*
>    * Linux Socket Filter Data Structures
>    */

You can merge both comments into one.

> @@ -19,7 +20,7 @@
>    *	Try and keep these values and structures similar to BSD, especially
>    *	the BPF code definitions which need to match so you can share filters
>    */
> -
> +
>   struct sock_filter {	/* Filter block */
>   	__u16	code;   /* Actual filter code */
>   	__u8	jt;	/* Jump true */
> @@ -45,12 +46,26 @@ struct sock_fprog {	/* Required for SO_ATTACH_FILTER. */
>   #define         BPF_JMP         0x05
>   #define         BPF_RET         0x06
>   #define         BPF_MISC        0x07
> +#define         BPF_ALU64       0x07
> +
> +struct bpf_insn {
> +	__u8	code;    /* opcode */
> +	__u8    a_reg:4; /* dest register*/
> +	__u8    x_reg:4; /* source register */
> +	__s16	off;     /* signed offset */
> +	__s32	imm;     /* signed immediate constant */
> +};

As we have struct sock_filter and it's immutable due to uapi, I
would have liked to see the new data structure with a consistent
naming scheme, e.g. struct sock_filter_ext {...} for extended
BPF.

> +/* pointer to bpf_context is the first and only argument to BPF program
> + * its definition is use-case specific */
> +struct bpf_context;
>
>   /* ld/ldx fields */
>   #define BPF_SIZE(code)  ((code) & 0x18)
>   #define         BPF_W           0x00
>   #define         BPF_H           0x08
>   #define         BPF_B           0x10
> +#define         BPF_DW          0x18
>   #define BPF_MODE(code)  ((code) & 0xe0)
>   #define         BPF_IMM         0x00
>   #define         BPF_ABS         0x20
> @@ -58,6 +73,7 @@ struct sock_fprog {	/* Required for SO_ATTACH_FILTER. */
>   #define         BPF_MEM         0x60
>   #define         BPF_LEN         0x80
>   #define         BPF_MSH         0xa0
> +#define         BPF_XADD        0xc0 /* exclusive add */
>
>   /* alu/jmp fields */
>   #define BPF_OP(code)    ((code) & 0xf0)
> @@ -68,16 +84,24 @@ struct sock_fprog {	/* Required for SO_ATTACH_FILTER. */
>   #define         BPF_OR          0x40
>   #define         BPF_AND         0x50
>   #define         BPF_LSH         0x60
> -#define         BPF_RSH         0x70
> +#define         BPF_RSH         0x70 /* logical shift right */
>   #define         BPF_NEG         0x80
>   #define		BPF_MOD		0x90
>   #define		BPF_XOR		0xa0
> +#define		BPF_MOV		0xb0 /* mov reg to reg */
> +#define		BPF_ARSH	0xc0 /* sign extending arithmetic shift right */
> +#define		BPF_BSWAP32	0xd0 /* swap lower 4 bytes of 64-bit register */
> +#define		BPF_BSWAP64	0xe0 /* swap all 8 bytes of 64-bit register */
>
>   #define         BPF_JA          0x00
> -#define         BPF_JEQ         0x10
> -#define         BPF_JGT         0x20
> -#define         BPF_JGE         0x30
> -#define         BPF_JSET        0x40
> +#define         BPF_JEQ         0x10 /* jump == */
> +#define         BPF_JGT         0x20 /* GT is unsigned '>', JA in x86 */
> +#define         BPF_JGE         0x30 /* GE is unsigned '>=', JAE in x86 */
> +#define         BPF_JSET        0x40 /* if (A & X) */
> +#define         BPF_JNE         0x50 /* jump != */
> +#define         BPF_JSGT        0x60 /* SGT is signed '>', GT in x86 */
> +#define         BPF_JSGE        0x70 /* SGE is signed '>=', GE in x86 */
> +#define         BPF_CALL        0x80 /* function call */
>   #define BPF_SRC(code)   ((code) & 0x08)
>   #define         BPF_K           0x00
>   #define         BPF_X           0x08
> @@ -134,5 +158,4 @@ struct sock_fprog {	/* Required for SO_ATTACH_FILTER. */
>   #define SKF_NET_OFF   (-0x100000)
>   #define SKF_LL_OFF    (-0x200000)
>
> -
>   #endif /* _UAPI__LINUX_FILTER_H__ */
> diff --git a/net/core/Makefile b/net/core/Makefile
> index 9628c20acff6..e622b97f58dc 100644
> --- a/net/core/Makefile
> +++ b/net/core/Makefile
> @@ -8,7 +8,7 @@ obj-y := sock.o request_sock.o skbuff.o iovec.o datagram.o stream.o scm.o \
>   obj-$(CONFIG_SYSCTL) += sysctl_net_core.o
>
>   obj-y		     += dev.o ethtool.o dev_addr_lists.o dst.o netevent.o \
> -			neighbour.o rtnetlink.o utils.o link_watch.o filter.o \
> +			neighbour.o rtnetlink.o utils.o link_watch.o filter.o bpf_run.o \
>   			sock_diag.o dev_ioctl.o
>
>   obj-$(CONFIG_XFRM) += flow.o
> diff --git a/net/core/bpf_run.c b/net/core/bpf_run.c
> new file mode 100644
> index 000000000000..fa1862fcbc74
> --- /dev/null
> +++ b/net/core/bpf_run.c
> @@ -0,0 +1,766 @@
> +/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of version 2 of the GNU General Public
> + * License as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful, but
> + * WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + * General Public License for more details.
> + */
> +#include <linux/kernel.h>
> +#include <linux/types.h>
> +#include <linux/slab.h>
> +#include <linux/uaccess.h>
> +#include <linux/filter.h>
> +#include <linux/skbuff.h>
> +#include <asm/unaligned.h>
> +
> +int bpf64_enable __read_mostly;
> +
> +void *bpf_internal_load_pointer_neg_helper(const struct sk_buff *skb, int k,
> +					   unsigned int size);
> +
> +static inline void *load_pointer(const struct sk_buff *skb, int k,
> +				 unsigned int size, void *buffer)
> +{
> +	if (k >= 0)
> +		return skb_header_pointer(skb, k, size, buffer);
> +	return bpf_internal_load_pointer_neg_helper(skb, k, size);
> +}
> +
> +static const char *const bpf_class_string[] = {
> +	"ld", "ldx", "st", "stx", "alu", "jmp", "ret", "misc"
> +};
> +
> +static const char *const bpf_alu_string[] = {
> +	"+=", "-=", "*=", "/=", "|=", "&=", "<<=", ">>=", "neg",
> +	"%=", "^=", "=", "s>>=", "bswap32", "bswap64", "???"
> +};
> +
> +static const char *const bpf_ldst_string[] = {
> +	"u32", "u16", "u8", "u64"
> +};
> +
> +static const char *const bpf_jmp_string[] = {
> +	"jmp", "==", ">", ">=", "&", "!=", "s>", "s>=", "call"
> +};
> +
> +static const char *reg_to_str(int regno, u64 *regs)
> +{
> +	static char reg_value[16][32];
> +	if (!regs)
> +		return "";
> +	snprintf(reg_value[regno], sizeof(reg_value[regno]), "(0x%llx)",
> +		 regs[regno]);
> +	return reg_value[regno];
> +}
> +
> +#define R(regno) reg_to_str(regno, regs)
> +
> +void pr_info_bpf_insn(const struct bpf_insn *insn, u64 *regs)
> +{
> +	u16 class = BPF_CLASS(insn->code);
> +	if (class == BPF_ALU || class == BPF_ALU64) {
> +		if (BPF_SRC(insn->code) == BPF_X)
> +			pr_info("code_%02x %sr%d%s %s r%d%s\n",
> +				insn->code, class == BPF_ALU ? "(u32)" : "",
> +				insn->a_reg, R(insn->a_reg),
> +				bpf_alu_string[BPF_OP(insn->code) >> 4],
> +				insn->x_reg, R(insn->x_reg));
> +		else
> +			pr_info("code_%02x %sr%d%s %s %d\n",
> +				insn->code, class == BPF_ALU ? "(u32)" : "",
> +				insn->a_reg, R(insn->a_reg),
> +				bpf_alu_string[BPF_OP(insn->code) >> 4],
> +				insn->imm);
> +	} else if (class == BPF_STX) {
> +		if (BPF_MODE(insn->code) == BPF_MEM)
> +			pr_info("code_%02x *(%s *)(r%d%s %+d) = r%d%s\n",
> +				insn->code,
> +				bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
> +				insn->a_reg, R(insn->a_reg),
> +				insn->off, insn->x_reg, R(insn->x_reg));
> +		else if (BPF_MODE(insn->code) == BPF_XADD)
> +			pr_info("code_%02x lock *(%s *)(r%d%s %+d) += r%d%s\n",
> +				insn->code,
> +				bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
> +				insn->a_reg, R(insn->a_reg), insn->off,
> +				insn->x_reg, R(insn->x_reg));
> +		else
> +			pr_info("BUG_%02x\n", insn->code);
> +	} else if (class == BPF_ST) {
> +		if (BPF_MODE(insn->code) != BPF_MEM) {
> +			pr_info("BUG_st_%02x\n", insn->code);
> +			return;
> +		}
> +		pr_info("code_%02x *(%s *)(r%d%s %+d) = %d\n",
> +			insn->code,
> +			bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
> +			insn->a_reg, R(insn->a_reg),
> +			insn->off, insn->imm);
> +	} else if (class == BPF_LDX) {
> +		if (BPF_MODE(insn->code) == BPF_MEM) {
> +			pr_info("code_%02x r%d = *(%s *)(r%d%s %+d)\n",
> +				insn->code, insn->a_reg,
> +				bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
> +				insn->x_reg, R(insn->x_reg), insn->off);
> +		} else {
> +			pr_info("BUG_ldx_%02x\n", insn->code);
> +		}
> +	} else if (class == BPF_LD) {
> +		if (BPF_MODE(insn->code) == BPF_ABS) {
> +			pr_info("code_%02x r%d = *(%s *)(skb %+d)\n",
> +				insn->code, insn->a_reg,
> +				bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
> +				insn->imm);
> +		} else if (BPF_MODE(insn->code) == BPF_IND) {
> +			pr_info("code_%02x r%d = *(%s *)(skb + r%d%s %+d)\n",
> +				insn->code, insn->a_reg,
> +				bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
> +				insn->x_reg, R(insn->x_reg), insn->imm);
> +		} else {
> +			pr_info("BUG_ld_%02x\n", insn->code);
> +		}
> +	} else if (class == BPF_JMP) {
> +		u16 opcode = BPF_OP(insn->code);
> +		if (opcode == BPF_CALL) {
> +			pr_info("code_%02x call %d\n", insn->code, insn->imm);
> +		} else if (insn->code == (BPF_JMP | BPF_JA)) {
> +			pr_info("code_%02x goto pc%+d\n",
> +				insn->code, insn->off);
> +		} else if (BPF_SRC(insn->code) == BPF_X) {
> +			pr_info("code_%02x if r%d%s %s r%d%s goto pc%+d\n",
> +				insn->code, insn->a_reg, R(insn->a_reg),
> +				bpf_jmp_string[BPF_OP(insn->code) >> 4],
> +				insn->x_reg, R(insn->x_reg), insn->off);
> +		} else {
> +			pr_info("code_%02x if r%d%s %s 0x%x goto pc%+d\n",
> +				insn->code, insn->a_reg, R(insn->a_reg),
> +				bpf_jmp_string[BPF_OP(insn->code) >> 4],
> +				insn->imm, insn->off);
> +		}
> +	} else {
> +		pr_info("code_%02x %s\n", insn->code, bpf_class_string[class]);
> +	}
> +}
> +EXPORT_SYMBOL(pr_info_bpf_insn);

Why would that need to be exported as a symbol?

I would actually like to avoid having this pr_info* inside the kernel.
Couldn't this be done e.g. through systemtap script that could e.g. be
placed under tools/net/ or inside the documentation file?

> +/* remap 'sock_filter' style BPF instruction set to 'bpf_insn' style (BPF64)
> + *
> + * first, call bpf_convert(old_prog, len, NULL, &new_len) to calculate new
> + * program length in one pass
> + *
> + * then new_prog = kmalloc(sizeof(struct bpf_insn) * new_len);
> + *
> + * and call it again: bpf_convert(old_prog, len, new_prog, &new_len);
> + * to remap in two passes: 1st pass finds new jump offsets, 2nd pass remaps
> + */

Would be nice to have the API comment it in kdoc format.

> +int bpf_convert(struct sock_filter *old_prog, int len,
> +		struct bpf_insn *new_prog, int *p_new_len)
> +{

If we place it into net/core/filter.c, it would be nice to keep naming
conventions consistent, e.g. sk_convert_filter() or so.

> +	struct bpf_insn *new_insn;
> +	struct sock_filter *fp;
> +	int *addrs = NULL;
> +	int new_len = 0;
> +	int pass = 0;
> +	int tgt, i;
> +
> +	if (len <= 0 || len >= BPF_MAXINSNS)
> +		return -EINVAL;
> +
> +	if (new_prog) {
> +		addrs = kzalloc(len * sizeof(*addrs), GFP_KERNEL);
> +		if (!addrs)
> +			return -ENOMEM;
> +	}
> +
> +do_pass:
> +	new_insn = new_prog;
> +	fp = old_prog;
> +	for (i = 0; i < len; fp++, i++) {
> +		struct bpf_insn tmp_insns[3] = {};
> +		struct bpf_insn *insn = tmp_insns;
> +
> +		if (addrs)
> +			addrs[i] = new_insn - new_prog;
> +
> +		switch (fp->code) {
> +		/* all arithmetic insns and skb loads map as-is */
> +		case BPF_ALU | BPF_ADD | BPF_X:
> +		case BPF_ALU | BPF_ADD | BPF_K:
> +		case BPF_ALU | BPF_SUB | BPF_X:
> +		case BPF_ALU | BPF_SUB | BPF_K:
> +		case BPF_ALU | BPF_AND | BPF_X:
> +		case BPF_ALU | BPF_AND | BPF_K:
> +		case BPF_ALU | BPF_OR | BPF_X:
> +		case BPF_ALU | BPF_OR | BPF_K:
> +		case BPF_ALU | BPF_LSH | BPF_X:
> +		case BPF_ALU | BPF_LSH | BPF_K:
> +		case BPF_ALU | BPF_RSH | BPF_X:
> +		case BPF_ALU | BPF_RSH | BPF_K:
> +		case BPF_ALU | BPF_XOR | BPF_X:
> +		case BPF_ALU | BPF_XOR | BPF_K:
> +		case BPF_ALU | BPF_MUL | BPF_X:
> +		case BPF_ALU | BPF_MUL | BPF_K:
> +		case BPF_ALU | BPF_DIV | BPF_X:
> +		case BPF_ALU | BPF_DIV | BPF_K:
> +		case BPF_ALU | BPF_MOD | BPF_X:
> +		case BPF_ALU | BPF_MOD | BPF_K:
> +		case BPF_ALU | BPF_NEG:
> +		case BPF_LD | BPF_ABS | BPF_W:
> +		case BPF_LD | BPF_ABS | BPF_H:
> +		case BPF_LD | BPF_ABS | BPF_B:
> +		case BPF_LD | BPF_IND | BPF_W:
> +		case BPF_LD | BPF_IND | BPF_H:
> +		case BPF_LD | BPF_IND | BPF_B:

I think here and elsewhere for similar constructions, we could use
BPF_S_* helpers that was introduced by Hagen in commit 01f2f3f6ef4d076
("net: optimize Berkeley Packet Filter (BPF) processing").

> +			insn->code = fp->code;
> +			insn->a_reg = 6;
> +			insn->x_reg = 7;
> +			insn->imm = fp->k;
> +			break;
> +
> +		/* jump opcodes map as-is, but offsets need adjustment */
> +		case BPF_JMP | BPF_JA:
> +			tgt = i + fp->k + 1;
> +			insn->code = fp->code;
> +#define EMIT_JMP \
> +	do { \
> +		if (tgt >= len || tgt < 0) \
> +			goto err; \
> +		insn->off = addrs ? addrs[tgt] - addrs[i] - 1 : 0; \
> +	} while (0)
> +
> +			EMIT_JMP;
> +			break;
> +
> +		case BPF_JMP | BPF_JEQ | BPF_K:
> +		case BPF_JMP | BPF_JEQ | BPF_X:
> +		case BPF_JMP | BPF_JSET | BPF_K:
> +		case BPF_JMP | BPF_JSET | BPF_X:
> +		case BPF_JMP | BPF_JGT | BPF_K:
> +		case BPF_JMP | BPF_JGT | BPF_X:
> +		case BPF_JMP | BPF_JGE | BPF_K:
> +		case BPF_JMP | BPF_JGE | BPF_X:
> +			insn->a_reg = 6;
> +			insn->x_reg = 7;
> +			insn->imm = fp->k;
> +			/* common case where 'jump_false' is next insn */
> +			if (fp->jf == 0) {
> +				insn->code = fp->code;
> +				tgt = i + fp->jt + 1;
> +				EMIT_JMP;
> +				break;
> +			}
> +			/* convert JEQ into JNE when 'jump_true' is next insn */
> +			if (fp->jt == 0 && BPF_OP(fp->code) == BPF_JEQ) {
> +				insn->code = BPF_JMP | BPF_JNE |
> +					BPF_SRC(fp->code);
> +				tgt = i + fp->jf + 1;
> +				EMIT_JMP;
> +				break;
> +			}
> +			/* other jumps are mapped into two insns: Jxx and JA */
> +			tgt = i + fp->jt + 1;
> +			insn->code = fp->code;
> +			EMIT_JMP;
> +
> +			insn++;
> +			insn->code = BPF_JMP | BPF_JA;
> +			tgt = i + fp->jf + 1;
> +			EMIT_JMP;
> +			/* adjust pc relative offset, since it's a 2nd insn */
> +			insn->off--;
> +			break;
> +
> +			/* ldxb 4*([14]&0xf) is remaped into 3 insns */
> +		case BPF_LDX | BPF_MSH | BPF_B:
> +			insn->code = BPF_LD | BPF_ABS | BPF_B;
> +			insn->a_reg = 7;
> +			insn->imm = fp->k;
> +
> +			insn++;
> +			insn->code = BPF_ALU | BPF_AND | BPF_K;
> +			insn->a_reg = 7;
> +			insn->imm = 0xf;
> +
> +			insn++;
> +			insn->code = BPF_ALU | BPF_LSH | BPF_K;
> +			insn->a_reg = 7;
> +			insn->imm = 2;
> +			break;
> +
> +			/* RET_K, RET_A are remaped into 2 insns */
> +		case BPF_RET | BPF_A:
> +		case BPF_RET | BPF_K:
> +			insn->code = BPF_ALU | BPF_MOV |
> +				(BPF_SRC(fp->code) == BPF_K ? BPF_K : BPF_X);
> +			insn->a_reg = 0;
> +			insn->x_reg = 6;
> +			insn->imm = fp->k;
> +
> +			insn++;
> +			insn->code = BPF_RET | BPF_K;
> +			break;
> +
> +			/* store to stack */
> +		case BPF_ST:
> +		case BPF_STX:
> +			insn->code = BPF_STX | BPF_MEM | BPF_W;
> +			insn->a_reg = 10;
> +			insn->x_reg = fp->code == BPF_ST ? 6 : 7;
> +			insn->off = -(BPF_MEMWORDS - fp->k) * 4;
> +			break;
> +
> +			/* load from stack */
> +		case BPF_LD | BPF_MEM:
> +		case BPF_LDX | BPF_MEM:
> +			insn->code = BPF_LDX | BPF_MEM | BPF_W;
> +			insn->a_reg = BPF_CLASS(fp->code) == BPF_LD ? 6 : 7;
> +			insn->x_reg = 10;
> +			insn->off = -(BPF_MEMWORDS - fp->k) * 4;
> +			break;
> +
> +			/* A = K or X = K */
> +		case BPF_LD | BPF_IMM:
> +		case BPF_LDX | BPF_IMM:
> +			insn->code = BPF_ALU | BPF_MOV | BPF_K;
> +			insn->a_reg = BPF_CLASS(fp->code) == BPF_LD ? 6 : 7;
> +			insn->imm = fp->k;
> +			break;
> +
> +			/* X = A */
> +		case BPF_MISC | BPF_TAX:
> +			insn->code = BPF_ALU64 | BPF_MOV | BPF_X;
> +			insn->a_reg = 7;
> +			insn->x_reg = 6;
> +			break;
> +
> +			/* A = X */
> +		case BPF_MISC | BPF_TXA:
> +			insn->code = BPF_ALU64 | BPF_MOV | BPF_X;
> +			insn->a_reg = 6;
> +			insn->x_reg = 7;
> +			break;
> +
> +			/* A = skb->len or X = skb->len */
> +		case BPF_LD|BPF_W|BPF_LEN:
> +		case BPF_LDX|BPF_W|BPF_LEN:
> +			insn->code = BPF_LDX | BPF_MEM | BPF_W;
> +			insn->a_reg = BPF_CLASS(fp->code) == BPF_LD ? 6 : 7;
> +			insn->x_reg = 1;
> +			insn->off = offsetof(struct sk_buff, len);
> +			break;
> +
> +		default:
> +			/* pr_err("unknown opcode %02x\n", fp->code); */
> +			goto err;
> +		}
> +
> +		insn++;
> +		if (new_prog) {
> +			memcpy(new_insn, tmp_insns,
> +			       sizeof(*insn) * (insn - tmp_insns));
> +		}
> +		new_insn += insn - tmp_insns;
> +	}
> +
> +	if (!new_prog) {
> +		/* only calculating new length */
> +		*p_new_len = new_insn - new_prog;
> +		return 0;
> +	}
> +
> +	pass++;
> +	if (new_len != new_insn - new_prog) {
> +		new_len = new_insn - new_prog;
> +		if (pass > 2)
> +			goto err;
> +		goto do_pass;
> +	}
> +	kfree(addrs);
> +	if (*p_new_len != new_len)
> +		/* inconsistent new program length */
> +		pr_err("bpf_convert() usage error\n");
> +	return 0;
> +err:
> +	kfree(addrs);
> +	return -EINVAL;
> +}
> +
> +notrace u32 bpf_run(struct bpf_context *ctx, const struct bpf_insn *insn)
> +{

Similarly, sk_run_filter_ext()?

> +	u64 stack[64];
> +	u64 regs[16];
> +	void *ptr;
> +	u64 tmp;
> +	int off;
> +
> +#ifdef __x86_64
> +#define LOAD_IMM /**/
> +#define K insn->imm
> +#else
> +#define LOAD_IMM (K = insn->imm)
> +	s32 K = insn->imm;
> +#endif
> +
> +#ifdef DEBUG
> +#define DEBUG_INSN pr_info_bpf_insn(insn, regs)
> +#else
> +#define DEBUG_INSN
> +#endif

This DEBUG hunk could then be removed when we use a stap script
instead, for example.

> +#define A regs[insn->a_reg]
> +#define X regs[insn->x_reg]
> +
> +#define DL(A, B, C) [A|B|C] = &&L_##A##B##C,
> +#define L(A, B, C) L_##A##B##C:

Could we get rid of these two defines? I know you're defining
and using that as labels, but it's not obvious at first what
'jt' does. Maybe 'jt_labels' ?

> +#define CONT ({insn++; LOAD_IMM; DEBUG_INSN; goto select_insn; })
> +#define CONT_JMP ({insn++; LOAD_IMM; DEBUG_INSN; goto select_insn; })

Not a big fan of macros, but here it seems fine though.

> +/* some compilers may need help:
> + * #define CONT_JMP ({insn++; LOAD_IMM; DEBUG_INSN; goto *jt[insn->code]; })
> + */
> +
> +	static const void *jt[256] = {
> +		[0 ... 255] = &&L_default,

Nit: please just one define per line below:

> +		DL(BPF_ALU, BPF_ADD, BPF_X) DL(BPF_ALU, BPF_ADD, BPF_K)
> +		DL(BPF_ALU, BPF_SUB, BPF_X) DL(BPF_ALU, BPF_SUB, BPF_K)
> +		DL(BPF_ALU, BPF_AND, BPF_X) DL(BPF_ALU, BPF_AND, BPF_K)
> +		DL(BPF_ALU, BPF_OR, BPF_X)  DL(BPF_ALU, BPF_OR, BPF_K)
> +		DL(BPF_ALU, BPF_LSH, BPF_X) DL(BPF_ALU, BPF_LSH, BPF_K)
> +		DL(BPF_ALU, BPF_RSH, BPF_X) DL(BPF_ALU, BPF_RSH, BPF_K)
> +		DL(BPF_ALU, BPF_XOR, BPF_X) DL(BPF_ALU, BPF_XOR, BPF_K)
> +		DL(BPF_ALU, BPF_MUL, BPF_X) DL(BPF_ALU, BPF_MUL, BPF_K)
> +		DL(BPF_ALU, BPF_MOV, BPF_X) DL(BPF_ALU, BPF_MOV, BPF_K)
> +		DL(BPF_ALU, BPF_DIV, BPF_X) DL(BPF_ALU, BPF_DIV, BPF_K)
> +		DL(BPF_ALU, BPF_MOD, BPF_X) DL(BPF_ALU, BPF_MOD, BPF_K)
> +		DL(BPF_ALU64, BPF_ADD, BPF_X) DL(BPF_ALU64, BPF_ADD, BPF_K)
> +		DL(BPF_ALU64, BPF_SUB, BPF_X) DL(BPF_ALU64, BPF_SUB, BPF_K)
> +		DL(BPF_ALU64, BPF_AND, BPF_X) DL(BPF_ALU64, BPF_AND, BPF_K)
> +		DL(BPF_ALU64, BPF_OR, BPF_X)  DL(BPF_ALU64, BPF_OR, BPF_K)
> +		DL(BPF_ALU64, BPF_LSH, BPF_X) DL(BPF_ALU64, BPF_LSH, BPF_K)
> +		DL(BPF_ALU64, BPF_RSH, BPF_X) DL(BPF_ALU64, BPF_RSH, BPF_K)
> +		DL(BPF_ALU64, BPF_XOR, BPF_X) DL(BPF_ALU64, BPF_XOR, BPF_K)
> +		DL(BPF_ALU64, BPF_MUL, BPF_X) DL(BPF_ALU64, BPF_MUL, BPF_K)
> +		DL(BPF_ALU64, BPF_MOV, BPF_X) DL(BPF_ALU64, BPF_MOV, BPF_K)
> +		DL(BPF_ALU64, BPF_ARSH, BPF_X)DL(BPF_ALU64, BPF_ARSH, BPF_K)
> +		DL(BPF_ALU64, BPF_DIV, BPF_X) DL(BPF_ALU64, BPF_DIV, BPF_K)
> +		DL(BPF_ALU64, BPF_MOD, BPF_X) DL(BPF_ALU64, BPF_MOD, BPF_K)
> +		DL(BPF_ALU64, BPF_BSWAP32, BPF_X)
> +		DL(BPF_ALU64, BPF_BSWAP64, BPF_X)
> +		DL(BPF_ALU, BPF_NEG, 0)
> +		DL(BPF_JMP, BPF_CALL, 0)
> +		DL(BPF_JMP, BPF_JA, 0)
> +		DL(BPF_JMP, BPF_JEQ, BPF_X) DL(BPF_JMP, BPF_JEQ, BPF_K)
> +		DL(BPF_JMP, BPF_JNE, BPF_X) DL(BPF_JMP, BPF_JNE, BPF_K)
> +		DL(BPF_JMP, BPF_JGT, BPF_X) DL(BPF_JMP, BPF_JGT, BPF_K)
> +		DL(BPF_JMP, BPF_JGE, BPF_X) DL(BPF_JMP, BPF_JGE, BPF_K)
> +		DL(BPF_JMP, BPF_JSGT, BPF_X) DL(BPF_JMP, BPF_JSGT, BPF_K)
> +		DL(BPF_JMP, BPF_JSGE, BPF_X) DL(BPF_JMP, BPF_JSGE, BPF_K)
> +		DL(BPF_JMP, BPF_JSET, BPF_X) DL(BPF_JMP, BPF_JSET, BPF_K)
> +		DL(BPF_STX, BPF_MEM, BPF_B) DL(BPF_STX, BPF_MEM, BPF_H)
> +		DL(BPF_STX, BPF_MEM, BPF_W) DL(BPF_STX, BPF_MEM, BPF_DW)
> +		DL(BPF_ST, BPF_MEM, BPF_B) DL(BPF_ST, BPF_MEM, BPF_H)
> +		DL(BPF_ST, BPF_MEM, BPF_W) DL(BPF_ST, BPF_MEM, BPF_DW)
> +		DL(BPF_LDX, BPF_MEM, BPF_B) DL(BPF_LDX, BPF_MEM, BPF_H)
> +		DL(BPF_LDX, BPF_MEM, BPF_W) DL(BPF_LDX, BPF_MEM, BPF_DW)
> +		DL(BPF_STX, BPF_XADD, BPF_W)
> +#ifdef CONFIG_64BIT
> +		DL(BPF_STX, BPF_XADD, BPF_DW)
> +#endif
> +		DL(BPF_LD, BPF_ABS, BPF_W) DL(BPF_LD, BPF_ABS, BPF_H)
> +		DL(BPF_LD, BPF_ABS, BPF_B) DL(BPF_LD, BPF_IND, BPF_W)
> +		DL(BPF_LD, BPF_IND, BPF_H) DL(BPF_LD, BPF_IND, BPF_B)
> +		DL(BPF_RET, BPF_K, 0)
> +	};
> +
> +	regs[10/* BPF R10 */] = (u64)(ulong)&stack[64];
> +	regs[1/* BPF R1 */] = (u64)(ulong)ctx;
> +
> +	DEBUG_INSN;
> +	/* execute 1st insn */
> +select_insn:
> +	goto *jt[insn->code];
> +
> +		/* ALU */
> +#define ALU(OPCODE, OP) \
> +	L_BPF_ALU64##OPCODE##BPF_X: \
> +		A = A OP X; \
> +		CONT; \
> +	L_BPF_ALU##OPCODE##BPF_X: \
> +		A = (u32)A OP (u32)X; \
> +		CONT; \
> +	L_BPF_ALU64##OPCODE##BPF_K: \
> +		A = A OP K; \
> +		CONT; \
> +	L_BPF_ALU##OPCODE##BPF_K: \
> +		A = (u32)A OP (u32)K; \
> +		CONT;
> +
> +	ALU(BPF_ADD, +)
> +	ALU(BPF_SUB, -)
> +	ALU(BPF_AND, &)
> +	ALU(BPF_OR, |)
> +	ALU(BPF_LSH, <<)
> +	ALU(BPF_RSH, >>)
> +	ALU(BPF_XOR, ^)
> +	ALU(BPF_MUL, *)
> +
> +	L(BPF_ALU, BPF_NEG, 0)
> +		A = (u32)-A;
> +		CONT;
> +	L(BPF_ALU, BPF_MOV, BPF_X)
> +		A = (u32)X;
> +		CONT;
> +	L(BPF_ALU, BPF_MOV, BPF_K)
> +		A = (u32)K;
> +		CONT;
> +	L(BPF_ALU64, BPF_MOV, BPF_X)
> +		A = X;
> +		CONT;
> +	L(BPF_ALU64, BPF_MOV, BPF_K)
> +		A = K;
> +		CONT;
> +	L(BPF_ALU64, BPF_ARSH, BPF_X)
> +		(*(s64 *) &A) >>= X;
> +		CONT;
> +	L(BPF_ALU64, BPF_ARSH, BPF_K)
> +		(*(s64 *) &A) >>= K;
> +		CONT;
> +	L(BPF_ALU64, BPF_MOD, BPF_X)
> +		tmp = A;
> +		if (X)
> +			A = do_div(tmp, X);
> +		CONT;
> +	L(BPF_ALU, BPF_MOD, BPF_X)
> +		tmp = (u32)A;
> +		if (X)
> +			A = do_div(tmp, (u32)X);
> +		CONT;
> +	L(BPF_ALU64, BPF_MOD, BPF_K)
> +		tmp = A;
> +		if (K)
> +			A = do_div(tmp, K);
> +		CONT;
> +	L(BPF_ALU, BPF_MOD, BPF_K)
> +		tmp = (u32)A;
> +		if (K)
> +			A = do_div(tmp, (u32)K);
> +		CONT;
> +	L(BPF_ALU64, BPF_DIV, BPF_X)
> +		if (X)
> +			do_div(A, X);
> +		CONT;
> +	L(BPF_ALU, BPF_DIV, BPF_X)
> +		tmp = (u32)A;
> +		if (X)
> +			do_div(tmp, (u32)X);
> +		A = (u32)tmp;
> +		CONT;
> +	L(BPF_ALU64, BPF_DIV, BPF_K)
> +		if (K)
> +			do_div(A, K);
> +		CONT;
> +	L(BPF_ALU, BPF_DIV, BPF_K)
> +		tmp = (u32)A;
> +		if (K)
> +			do_div(tmp, (u32)K);
> +		A = (u32)tmp;
> +		CONT;
> +	L(BPF_ALU64, BPF_BSWAP32, BPF_X)
> +		A = swab32(A);
> +		CONT;
> +	L(BPF_ALU64, BPF_BSWAP64, BPF_X)
> +		A = swab64(A);
> +		CONT;
> +
> +		/* CALL */
> +	L(BPF_JMP, BPF_CALL, 0)
> +		/* TODO execute_func(K, regs); */
> +		CONT;

Would the filter checker for now just return -EINVAL when this is used?

> +		/* JMP */
> +	L(BPF_JMP, BPF_JA, 0)
> +		insn += insn->off;
> +		CONT;
> +	L(BPF_JMP, BPF_JEQ, BPF_X)
> +		if (A == X) {
> +			insn += insn->off;
> +			CONT_JMP;
> +		}
> +		CONT;
> +	L(BPF_JMP, BPF_JEQ, BPF_K)
> +		if (A == K) {
> +			insn += insn->off;
> +			CONT_JMP;
> +		}
> +		CONT;
> +	L(BPF_JMP, BPF_JNE, BPF_X)
> +		if (A != X) {
> +			insn += insn->off;
> +			CONT_JMP;
> +		}
> +		CONT;
> +	L(BPF_JMP, BPF_JNE, BPF_K)
> +		if (A != K) {
> +			insn += insn->off;
> +			CONT_JMP;
> +		}
> +		CONT;
> +	L(BPF_JMP, BPF_JGT, BPF_X)
> +		if (A > X) {
> +			insn += insn->off;
> +			CONT_JMP;
> +		}
> +		CONT;
> +	L(BPF_JMP, BPF_JGT, BPF_K)
> +		if (A > K) {
> +			insn += insn->off;
> +			CONT_JMP;
> +		}
> +		CONT;
> +	L(BPF_JMP, BPF_JGE, BPF_X)
> +		if (A >= X) {
> +			insn += insn->off;
> +			CONT_JMP;
> +		}
> +		CONT;
> +	L(BPF_JMP, BPF_JGE, BPF_K)
> +		if (A >= K) {
> +			insn += insn->off;
> +			CONT_JMP;
> +		}
> +		CONT;
> +	L(BPF_JMP, BPF_JSGT, BPF_X)
> +		if (((s64)A) > ((s64)X)) {
> +			insn += insn->off;
> +			CONT_JMP;
> +		}
> +		CONT;
> +	L(BPF_JMP, BPF_JSGT, BPF_K)
> +		if (((s64)A) > ((s64)K)) {
> +			insn += insn->off;
> +			CONT_JMP;
> +		}
> +		CONT;
> +	L(BPF_JMP, BPF_JSGE, BPF_X)
> +		if (((s64)A) >= ((s64)X)) {
> +			insn += insn->off;
> +			CONT_JMP;
> +		}
> +		CONT;
> +	L(BPF_JMP, BPF_JSGE, BPF_K)
> +		if (((s64)A) >= ((s64)K)) {
> +			insn += insn->off;
> +			CONT_JMP;
> +		}
> +		CONT;
> +	L(BPF_JMP, BPF_JSET, BPF_X)
> +		if (A & X) {
> +			insn += insn->off;
> +			CONT_JMP;
> +		}
> +		CONT;
> +	L(BPF_JMP, BPF_JSET, BPF_K)
> +		if (A & (u32)K) {
> +			insn += insn->off;
> +			CONT_JMP;
> +		}
> +		CONT;

Okay, so right now we only support forward jumps, right? And these
are checked by the old checker code I'd assume?

> +		/* STX and ST and LDX*/
> +#define LDST(SIZEOP, SIZE) \
> +	L_BPF_STXBPF_MEM##SIZEOP: \
> +		*(SIZE *)(ulong)(A + insn->off) = X; \
> +		CONT; \
> +	L_BPF_STBPF_MEM##SIZEOP: \
> +		*(SIZE *)(ulong)(A + insn->off) = K; \
> +		CONT; \
> +	L_BPF_LDXBPF_MEM##SIZEOP: \
> +		A = *(SIZE *)(ulong)(X + insn->off); \
> +		CONT;
> +
> +		LDST(BPF_B, u8)
> +		LDST(BPF_H, u16)
> +		LDST(BPF_W, u32)
> +		LDST(BPF_DW, u64)
> +
> +		/* STX XADD */
> +	L(BPF_STX, BPF_XADD, BPF_W)
> +		atomic_add((u32)X, (atomic_t *)(ulong)(A + insn->off));
> +		CONT;
> +#ifdef CONFIG_64BIT
> +	L(BPF_STX, BPF_XADD, BPF_DW)
> +		atomic64_add((u64)X, (atomic64_t *)(ulong)(A + insn->off));
> +		CONT;
> +#endif
> +
> +		/* LD from SKB + K */

Nit: indent one tab too much (here and elsewhere)

> +	L(BPF_LD, BPF_ABS, BPF_W)
> +		off = K;
> +load_word:
> +		ptr = load_pointer((struct sk_buff *)ctx, off, 4, &tmp);
> +		if (likely(ptr != NULL)) {
> +			A = get_unaligned_be32(ptr);
> +			CONT;
> +		}
> +		return 0;
> +
> +	L(BPF_LD, BPF_ABS, BPF_H)
> +		off = K;
> +load_half:
> +		ptr = load_pointer((struct sk_buff *)ctx, off, 2, &tmp);
> +		if (likely(ptr != NULL)) {
> +			A = get_unaligned_be16(ptr);
> +			CONT;
> +		}
> +		return 0;
> +
> +	L(BPF_LD, BPF_ABS, BPF_B)
> +		off = K;
> +load_byte:
> +		ptr = load_pointer((struct sk_buff *)ctx, off, 1, &tmp);
> +		if (likely(ptr != NULL)) {
> +			A = *(u8 *)ptr;
> +			CONT;
> +		}
> +		return 0;
> +
> +		/* LD from SKB + X + K */

Nit: ditto

> +	L(BPF_LD, BPF_IND, BPF_W)
> +		off = K + X;
> +		goto load_word;
> +
> +	L(BPF_LD, BPF_IND, BPF_H)
> +		off = K + X;
> +		goto load_half;
> +
> +	L(BPF_LD, BPF_IND, BPF_B)
> +		off = K + X;
> +		goto load_byte;
> +
> +		/* RET */
> +	L(BPF_RET, BPF_K, 0)
> +		return regs[0/* R0 */];
> +
> +	L_default:
> +		/* bpf_check() or bpf_convert() will guarantee that
> +		 * we never reach here
> +		 */
> +		WARN_RATELIMIT(1, "unknown opcode %02x\n", insn->code);
> +		return 0;
> +#undef DL
> +#undef L
> +#undef CONT
> +#undef A
> +#undef X
> +#undef K
> +#undef LOAD_IMM
> +#undef DEBUG_INSN
> +#undef ALU
> +#undef LDST
> +}
> +EXPORT_SYMBOL(bpf_run);
> +
> diff --git a/net/core/filter.c b/net/core/filter.c
> index ad30d626a5bd..575bf5fd4335 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -637,6 +637,10 @@ void sk_filter_release_rcu(struct rcu_head *rcu)
>   {
>   	struct sk_filter *fp = container_of(rcu, struct sk_filter, rcu);
>
> +	if ((void *)fp->bpf_func == (void *)bpf_run)
> +		/* arch specific jit_free are expecting this value */
> +		fp->bpf_func = sk_run_filter;
> +
>   	bpf_jit_free(fp);
>   }
>   EXPORT_SYMBOL(sk_filter_release_rcu);
> @@ -655,6 +659,81 @@ static int __sk_prepare_filter(struct sk_filter *fp)
>   	return 0;
>   }
>
> +static int bpf64_prepare(struct sk_filter **pfp, struct sock_fprog *fprog,
> +			 struct sock *sk)
> +{

sk_prepare_filter_ext()?

> +	unsigned int fsize = sizeof(struct sock_filter) * fprog->len;
> +	struct sock_filter *old_prog;
> +	unsigned int sk_fsize;
> +	struct sk_filter *fp;
> +	int new_len;
> +	int err;
> +
> +	/* store old program into buffer, since chk_filter will remap opcodes */
> +	old_prog = kmalloc(fsize, GFP_KERNEL);
> +	if (!old_prog)
> +		return -ENOMEM;
> +
> +	if (sk) {
> +		if (copy_from_user(old_prog, fprog->filter, fsize)) {
> +			err = -EFAULT;
> +			goto free_prog;
> +		}
> +	} else {
> +		memcpy(old_prog, fprog->filter, fsize);
> +	}
> +
> +	/* calculate bpf64 program length */
> +	err = bpf_convert(fprog->filter, fprog->len, NULL, &new_len);
> +	if (err)
> +		goto free_prog;
> +
> +	sk_fsize = sk_filter_size(new_len);
> +	/* allocate sk_filter to store bpf64 program */
> +	if (sk)
> +		fp = sock_kmalloc(sk, sk_fsize, GFP_KERNEL);
> +	else
> +		fp = kmalloc(sk_fsize, GFP_KERNEL);
> +	if (!fp) {
> +		err = -ENOMEM;
> +		goto free_prog;
> +	}
> +
> +	/* remap old insns into bpf64 */
> +	err = bpf_convert(old_prog, fprog->len,
> +			  (struct bpf_insn *)fp->insns, &new_len);
> +	if (err)
> +		/* 2nd bpf_convert() can fail only if it fails
> +		 * to allocate memory, remapping must succeed
> +		 */
> +		goto free_fp;
> +
> +	/* now chk_filter can overwrite old_prog while checking */
> +	err = sk_chk_filter(old_prog, fprog->len);
> +	if (err)
> +		goto free_fp;
> +
> +	/* discard old prog */
> +	kfree(old_prog);
> +
> +	atomic_set(&fp->refcnt, 1);
> +	fp->len = new_len;
> +
> +	/* bpf64 insns must be executed by bpf_run */
> +	fp->bpf_func = (typeof(fp->bpf_func))bpf_run;
> +
> +	*pfp = fp;
> +	return 0;
> +free_fp:
> +	if (sk)
> +		sock_kfree_s(sk, fp, sk_fsize);
> +	else
> +		kfree(fp);
> +free_prog:
> +	kfree(old_prog);
> +	return err;
> +}
> +
>   /**
>    *	sk_unattached_filter_create - create an unattached filter
>    *	@fprog: the filter program
> @@ -676,6 +755,9 @@ int sk_unattached_filter_create(struct sk_filter **pfp,
>   	if (fprog->filter == NULL)
>   		return -EINVAL;
>
> +	if (bpf64_enable)
> +		return bpf64_prepare(pfp, fprog, NULL);
> +
>   	fp = kmalloc(sk_filter_size(fprog->len), GFP_KERNEL);
>   	if (!fp)
>   		return -ENOMEM;
> @@ -726,21 +808,27 @@ int sk_attach_filter(struct sock_fprog *fprog, struct sock *sk)
>   	if (fprog->filter == NULL)
>   		return -EINVAL;
>
> -	fp = sock_kmalloc(sk, sk_fsize, GFP_KERNEL);
> -	if (!fp)
> -		return -ENOMEM;
> -	if (copy_from_user(fp->insns, fprog->filter, fsize)) {
> -		sock_kfree_s(sk, fp, sk_fsize);
> -		return -EFAULT;
> -	}
> +	if (bpf64_enable) {
> +		err = bpf64_prepare(&fp, fprog, sk);
> +		if (err)
> +			return err;
> +	} else {
> +		fp = sock_kmalloc(sk, sk_fsize, GFP_KERNEL);
> +		if (!fp)
> +			return -ENOMEM;
> +		if (copy_from_user(fp->insns, fprog->filter, fsize)) {
> +			sock_kfree_s(sk, fp, sk_fsize);
> +			return -EFAULT;
> +		}
>
> -	atomic_set(&fp->refcnt, 1);
> -	fp->len = fprog->len;
> +		atomic_set(&fp->refcnt, 1);
> +		fp->len = fprog->len;
>
> -	err = __sk_prepare_filter(fp);
> -	if (err) {
> -		sk_filter_uncharge(sk, fp);
> -		return err;
> +		err = __sk_prepare_filter(fp);
> +		if (err) {
> +			sk_filter_uncharge(sk, fp);
> +			return err;
> +		}
>   	}
>
>   	old_fp = rcu_dereference_protected(sk->sk_filter,
> diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
> index cf9cd13509a7..f03acc0e8950 100644
> --- a/net/core/sysctl_net_core.c
> +++ b/net/core/sysctl_net_core.c
> @@ -273,6 +273,13 @@ static struct ctl_table net_core_table[] = {
>   	},
>   #endif
>   	{
> +		.procname	= "bpf64_enable",
> +		.data		= &bpf64_enable,
> +		.maxlen		= sizeof(int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec
> +	},
> +	{
>   		.procname	= "netdev_tstamp_prequeue",
>   		.data		= &netdev_tstamp_prequeue,
>   		.maxlen		= sizeof(int),
>

Hope some of the comments made sense. ;-)

Thanks,

Daniel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 net-next 1/1] bpf32->bpf64 mapper and bpf64 interpreter
  2014-02-28 12:45   ` Daniel Borkmann
@ 2014-02-28 20:53     ` Alexei Starovoitov
  2014-03-01  0:10       ` Alexei Starovoitov
  2014-03-01  0:30       ` Daniel Borkmann
  0 siblings, 2 replies; 8+ messages in thread
From: Alexei Starovoitov @ 2014-02-28 20:53 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: David S. Miller, Ingo Molnar, Steven Rostedt, Peter Zijlstra,
	H. Peter Anvin, Thomas Gleixner, Masami Hiramatsu, Tom Zanussi,
	Jovi Zhangwei, Eric Dumazet, Linus Torvalds, Andrew Morton,
	Frederic Weisbecker, Arnaldo Carvalho de Melo, Pekka Enberg,
	Arjan van de Ven, Christoph Hellwig, linux-kernel, netdev,
	Hagen Paul Pfeifer, Jesse Gross

On Fri, Feb 28, 2014 at 4:45 AM, Daniel Borkmann <dborkman@redhat.com> wrote:
> Hi Alexei,
>
> [also cc'ing Hagen and Jesse]
>
> Just some minor comments below ... let me know what you think.

Thank you for review! Comments below.

> On 02/27/2014 03:38 AM, Alexei Starovoitov wrote:
>>
>> Extended BPF (or 64-bit BPF) is an instruction set to
>> create safe dynamically loadable filters that can call fixed set
>> of kernel functions and take generic bpf_context as an input.
>> BPF filter is a glue between kernel functions and bpf_context.
>> Different kernel subsystems can define their own set of available
>> functions
>> and alter BPF machinery for specific use case.
>> BPF64 instruction set is designed for efficient mapping to native
>> instructions on 64-bit CPUs
>>
>> Old BPF instructions are remapped on the fly to BPF64
>> when sysctl net.core.bpf64_enable=1
>
>
> Would be nice if the commit message could be extended with what you
> have posted in your [PATCH v3 net-next 0/1] email (but without the
> changelog, changelog should go after "---" line), so that also this
> information will appear here and in the Git log later on. Also please
> elaborate more on this commit message. I think, at least, as it's a
> bigger change it also needs to include future planned usage scenarios
> for user space and for inside the kernel e.g. for OVS and ftrace.

Ok will do

> You could make this patch a 2 patch patch-series and put into patch
> 2/2 all documentation you had in your previous version of the set.
> Would be nice to extend the file Documentation/networking/filter.txt
> with a description of your extended BPF so that users can read about
> them in the same file.

Sure.

> Did you also test that seccomp-BPF still works out?

yes. Have a prototype, but it needs a bit more cleanup.

>
>> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
>> ---
>>   include/linux/filter.h      |    9 +-
>>   include/linux/netdevice.h   |    1 +
>>   include/uapi/linux/filter.h |   37 ++-
>>   net/core/Makefile           |    2 +-
>>   net/core/bpf_run.c          |  766
>> +++++++++++++++++++++++++++++++++++++++++++
>>   net/core/filter.c           |  114 ++++++-
>
>
> I would have liked to see the content from net/core/bpf_run.c to go
> directly into net/core/filter.c, not as a separate file, if that's
> possible.

sure. that's fine.

>
>>   net/core/sysctl_net_core.c  |    7 +
>>   7 files changed, 913 insertions(+), 23 deletions(-)
>>   create mode 100644 net/core/bpf_run.c
>>
>> diff --git a/include/linux/filter.h b/include/linux/filter.h
>> index e568c8ef896b..bf3085258f4c 100644
>> --- a/include/linux/filter.h
>> +++ b/include/linux/filter.h
>> @@ -53,6 +53,13 @@ extern int sk_chk_filter(struct sock_filter *filter,
>> unsigned int flen);
>>   extern int sk_get_filter(struct sock *sk, struct sock_filter __user
>> *filter, unsigned len);
>>   extern void sk_decode_filter(struct sock_filter *filt, struct
>> sock_filter *to);
>>
>> +/* function remaps 'sock_filter' style insns to 'bpf_insn' style insns */
>> +int bpf_convert(struct sock_filter *fp, int len, struct bpf_insn
>> *new_prog,
>> +               int *p_new_len);
>> +/* execute bpf64 program */
>> +u32 bpf_run(struct bpf_context *ctx, const struct bpf_insn *insn);
>> +
>> +#define SK_RUN_FILTER(FILTER, SKB) (*FILTER->bpf_func)(SKB,
>> FILTER->insns)
>>   #ifdef CONFIG_BPF_JIT
>>   #include <stdarg.h>
>>   #include <linux/linkage.h>
>> @@ -70,7 +77,6 @@ static inline void bpf_jit_dump(unsigned int flen,
>> unsigned int proglen,
>>                 print_hex_dump(KERN_ERR, "JIT code: ", DUMP_PREFIX_OFFSET,
>>                                16, 1, image, proglen, false);
>>   }
>> -#define SK_RUN_FILTER(FILTER, SKB) (*FILTER->bpf_func)(SKB,
>> FILTER->insns)
>>   #else
>>   #include <linux/slab.h>
>>   static inline void bpf_jit_compile(struct sk_filter *fp)
>> @@ -80,7 +86,6 @@ static inline void bpf_jit_free(struct sk_filter *fp)
>>   {
>>         kfree(fp);
>>   }
>> -#define SK_RUN_FILTER(FILTER, SKB) sk_run_filter(SKB, FILTER->insns)
>>   #endif
>>
>>   static inline int bpf_tell_extensions(void)
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index 5e84483c0650..7b1acefc244e 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -2971,6 +2971,7 @@ extern int                netdev_max_backlog;
>>   extern int            netdev_tstamp_prequeue;
>>   extern int            weight_p;
>>   extern int            bpf_jit_enable;
>> +extern int             bpf64_enable;
>
>
> We should keep naming consistent (so either extended BPF or BPF64),
> so maybe bpf_ext_enable ? I'd presume rather {bpf,sk_filter}*_ext

agree. we need consistent naming for both (old and new).
I'll try all-out rename of bpf_*() into sk_filter64_*() and sk_filter_ext_*()
to see which one looks better.
I'm kinda leaning to sk_filter64, since it's easier to quickly spot
the difference
and more descriptive.

> as in 'struct bpf_insn' the immediate value is 32 bit, so for 64 bit
> comparisons, you'd still need to load to immediate values, right?

there is no insn that use 64-bit immediate, since 64-bit immediates
are extremely rare. grep x86-64 asm code for movabsq will return very few.
llvm or gcc can easily construct any constant by combination of mov,
shifts and ors.
bpf64 comparisons are all 64-bit right now. So far I didn't see a need to do
32-bit comparison, since old bpf is all unsigned, mapping 32->64 of
Jxx is painless.
Notice I added signed comparison, since real life programs cannot do
without them.
I also kept the spirit of old bpf having > and >= only. (no < and <=)
that made llvm/gcc backends a bit harder to do, since no real cpu has
such limitations.
I'm still contemplating do add < and <= (both signed and unsigned), which is
interesting trade-off: number of instructions vs complexity of compiler

> After all your changes, we will still have the bpf_jit_enable knob
> in tact, right?

Yes.
In this diff the workflow is the following:

old filter comes through sk_attach_filter() or sk_unattached_filter_create()
if (bpf64) {
    convert to new
    sk_chk_filter() - check old bpf
    use new interpreter
} else {
    sk_chk_filter() - check old bpf
    if (bpf_jit_enable)
        use old jit
    else
        use old interpreter
}
soon I'll add bpf64 jit and will reuse the same bpf_jit_enable knob for it.
then add new/old inband demux into sk_attach_filter(),
so that workflow will become:

a filter comes through sk_attach_filter() or sk_unattached_filter_create()
if (new filter prog) {
    sk_chk_filter64() - check new bpf
    if (bpf_jit_enable)
        use new jit
    else
        use new interpreter
} else {
    if (bpf64_enable) {
       convert to new
       sk_chk_filter() - check old bpf
       if (bpf_jit_enable)
            use new jit
       else
            use new interpreter
    } else {
       sk_chk_filter()
       if (bpf_jit_enable)
           use old jit
       else
           use old interpreter
    }
}
eventually bpf64_enable can be made default,
the last 'else' can be retired and 'bpf64_enable' removed along with
old interpreter.

bpf_jit_enable knob will stay for foreseeable future.

>>   bool netdev_has_upper_dev(struct net_device *dev, struct net_device
>> *upper_dev);
>>   struct net_device *netdev_all_upper_get_next_dev_rcu(struct net_device
>> *dev,
>> diff --git a/include/uapi/linux/filter.h b/include/uapi/linux/filter.h
>> index 8eb9ccaa5b48..70ff29ee6825 100644
>> --- a/include/uapi/linux/filter.h
>> +++ b/include/uapi/linux/filter.h
>> @@ -1,3 +1,4 @@
>> +/* extended BPF is Copyright (c) 2011-2014, PLUMgrid, http://plumgrid.com
>> */
>>   /*
>>    * Linux Socket Filter Data Structures
>>    */
>
>
> You can merge both comments into one.

ok.

>
>> @@ -19,7 +20,7 @@
>>    *    Try and keep these values and structures similar to BSD,
>> especially
>>    *    the BPF code definitions which need to match so you can share
>> filters
>>    */
>> -
>> +
>>   struct sock_filter {  /* Filter block */
>>         __u16   code;   /* Actual filter code */
>>         __u8    jt;     /* Jump true */
>> @@ -45,12 +46,26 @@ struct sock_fprog { /* Required for SO_ATTACH_FILTER.
>> */
>>   #define         BPF_JMP         0x05
>>   #define         BPF_RET         0x06
>>   #define         BPF_MISC        0x07
>> +#define         BPF_ALU64       0x07
>> +
>> +struct bpf_insn {
>> +       __u8    code;    /* opcode */
>> +       __u8    a_reg:4; /* dest register*/
>> +       __u8    x_reg:4; /* source register */
>> +       __s16   off;     /* signed offset */
>> +       __s32   imm;     /* signed immediate constant */
>> +};
>
>
> As we have struct sock_filter and it's immutable due to uapi, I
> would have liked to see the new data structure with a consistent
> naming scheme, e.g. struct sock_filter_ext {...} for extended
> BPF.

ok. will rename bpf_insn to sock_filter64 and sock_filter_ext to see
which one looks better through out the code.

>> +/* pointer to bpf_context is the first and only argument to BPF program
>> + * its definition is use-case specific */
>> +struct bpf_context;
>>
>>   /* ld/ldx fields */
>>   #define BPF_SIZE(code)  ((code) & 0x18)
>>   #define         BPF_W           0x00
>>   #define         BPF_H           0x08
>>   #define         BPF_B           0x10
>> +#define         BPF_DW          0x18
>>   #define BPF_MODE(code)  ((code) & 0xe0)
>>   #define         BPF_IMM         0x00
>>   #define         BPF_ABS         0x20
>> @@ -58,6 +73,7 @@ struct sock_fprog {   /* Required for SO_ATTACH_FILTER.
>> */
>>   #define         BPF_MEM         0x60
>>   #define         BPF_LEN         0x80
>>   #define         BPF_MSH         0xa0
>> +#define         BPF_XADD        0xc0 /* exclusive add */
>>
>>   /* alu/jmp fields */
>>   #define BPF_OP(code)    ((code) & 0xf0)
>> @@ -68,16 +84,24 @@ struct sock_fprog { /* Required for SO_ATTACH_FILTER.
>> */
>>   #define         BPF_OR          0x40
>>   #define         BPF_AND         0x50
>>   #define         BPF_LSH         0x60
>> -#define         BPF_RSH         0x70
>> +#define         BPF_RSH         0x70 /* logical shift right */
>>   #define         BPF_NEG         0x80
>>   #define               BPF_MOD         0x90
>>   #define               BPF_XOR         0xa0
>> +#define                BPF_MOV         0xb0 /* mov reg to reg */
>> +#define                BPF_ARSH        0xc0 /* sign extending arithmetic
>> shift right */
>> +#define                BPF_BSWAP32     0xd0 /* swap lower 4 bytes of
>> 64-bit register */
>> +#define                BPF_BSWAP64     0xe0 /* swap all 8 bytes of 64-bit
>> register */
>>
>>   #define         BPF_JA          0x00
>> -#define         BPF_JEQ         0x10
>> -#define         BPF_JGT         0x20
>> -#define         BPF_JGE         0x30
>> -#define         BPF_JSET        0x40
>> +#define         BPF_JEQ         0x10 /* jump == */
>> +#define         BPF_JGT         0x20 /* GT is unsigned '>', JA in x86 */
>> +#define         BPF_JGE         0x30 /* GE is unsigned '>=', JAE in x86
>> */
>> +#define         BPF_JSET        0x40 /* if (A & X) */
>> +#define         BPF_JNE         0x50 /* jump != */
>> +#define         BPF_JSGT        0x60 /* SGT is signed '>', GT in x86 */
>> +#define         BPF_JSGE        0x70 /* SGE is signed '>=', GE in x86 */
>> +#define         BPF_CALL        0x80 /* function call */
>>   #define BPF_SRC(code)   ((code) & 0x08)
>>   #define         BPF_K           0x00
>>   #define         BPF_X           0x08
>> @@ -134,5 +158,4 @@ struct sock_fprog { /* Required for SO_ATTACH_FILTER.
>> */
>>   #define SKF_NET_OFF   (-0x100000)
>>   #define SKF_LL_OFF    (-0x200000)
>>
>> -
>>   #endif /* _UAPI__LINUX_FILTER_H__ */
>> diff --git a/net/core/Makefile b/net/core/Makefile
>> index 9628c20acff6..e622b97f58dc 100644
>> --- a/net/core/Makefile
>> +++ b/net/core/Makefile
>> @@ -8,7 +8,7 @@ obj-y := sock.o request_sock.o skbuff.o iovec.o datagram.o
>> stream.o scm.o \
>>   obj-$(CONFIG_SYSCTL) += sysctl_net_core.o
>>
>>   obj-y              += dev.o ethtool.o dev_addr_lists.o dst.o netevent.o
>> \
>> -                       neighbour.o rtnetlink.o utils.o link_watch.o
>> filter.o \
>> +                       neighbour.o rtnetlink.o utils.o link_watch.o
>> filter.o bpf_run.o \
>>                         sock_diag.o dev_ioctl.o
>>
>>   obj-$(CONFIG_XFRM) += flow.o
>> diff --git a/net/core/bpf_run.c b/net/core/bpf_run.c
>> new file mode 100644
>> index 000000000000..fa1862fcbc74
>> --- /dev/null
>> +++ b/net/core/bpf_run.c
>> @@ -0,0 +1,766 @@
>> +/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
>> + *
>> + * This program is free software; you can redistribute it and/or
>> + * modify it under the terms of version 2 of the GNU General Public
>> + * License as published by the Free Software Foundation.
>> + *
>> + * This program is distributed in the hope that it will be useful, but
>> + * WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
>> + * General Public License for more details.
>> + */
>> +#include <linux/kernel.h>
>> +#include <linux/types.h>
>> +#include <linux/slab.h>
>> +#include <linux/uaccess.h>
>> +#include <linux/filter.h>
>> +#include <linux/skbuff.h>
>> +#include <asm/unaligned.h>
>> +
>> +int bpf64_enable __read_mostly;
>> +
>> +void *bpf_internal_load_pointer_neg_helper(const struct sk_buff *skb, int
>> k,
>> +                                          unsigned int size);
>> +
>> +static inline void *load_pointer(const struct sk_buff *skb, int k,
>> +                                unsigned int size, void *buffer)
>> +{
>> +       if (k >= 0)
>> +               return skb_header_pointer(skb, k, size, buffer);
>> +       return bpf_internal_load_pointer_neg_helper(skb, k, size);
>> +}
>> +
>> +static const char *const bpf_class_string[] = {
>> +       "ld", "ldx", "st", "stx", "alu", "jmp", "ret", "misc"
>> +};
>> +
>> +static const char *const bpf_alu_string[] = {
>> +       "+=", "-=", "*=", "/=", "|=", "&=", "<<=", ">>=", "neg",
>> +       "%=", "^=", "=", "s>>=", "bswap32", "bswap64", "???"
>> +};
>> +
>> +static const char *const bpf_ldst_string[] = {
>> +       "u32", "u16", "u8", "u64"
>> +};
>> +
>> +static const char *const bpf_jmp_string[] = {
>> +       "jmp", "==", ">", ">=", "&", "!=", "s>", "s>=", "call"
>> +};
>> +
>> +static const char *reg_to_str(int regno, u64 *regs)
>> +{
>> +       static char reg_value[16][32];
>> +       if (!regs)
>> +               return "";
>> +       snprintf(reg_value[regno], sizeof(reg_value[regno]), "(0x%llx)",
>> +                regs[regno]);
>> +       return reg_value[regno];
>> +}
>> +
>> +#define R(regno) reg_to_str(regno, regs)
>> +
>> +void pr_info_bpf_insn(const struct bpf_insn *insn, u64 *regs)
>> +{
>> +       u16 class = BPF_CLASS(insn->code);
>> +       if (class == BPF_ALU || class == BPF_ALU64) {
>> +               if (BPF_SRC(insn->code) == BPF_X)
>> +                       pr_info("code_%02x %sr%d%s %s r%d%s\n",
>> +                               insn->code, class == BPF_ALU ? "(u32)" :
>> "",
>> +                               insn->a_reg, R(insn->a_reg),
>> +                               bpf_alu_string[BPF_OP(insn->code) >> 4],
>> +                               insn->x_reg, R(insn->x_reg));
>> +               else
>> +                       pr_info("code_%02x %sr%d%s %s %d\n",
>> +                               insn->code, class == BPF_ALU ? "(u32)" :
>> "",
>> +                               insn->a_reg, R(insn->a_reg),
>> +                               bpf_alu_string[BPF_OP(insn->code) >> 4],
>> +                               insn->imm);
>> +       } else if (class == BPF_STX) {
>> +               if (BPF_MODE(insn->code) == BPF_MEM)
>> +                       pr_info("code_%02x *(%s *)(r%d%s %+d) = r%d%s\n",
>> +                               insn->code,
>> +                               bpf_ldst_string[BPF_SIZE(insn->code) >>
>> 3],
>> +                               insn->a_reg, R(insn->a_reg),
>> +                               insn->off, insn->x_reg, R(insn->x_reg));
>> +               else if (BPF_MODE(insn->code) == BPF_XADD)
>> +                       pr_info("code_%02x lock *(%s *)(r%d%s %+d) +=
>> r%d%s\n",
>> +                               insn->code,
>> +                               bpf_ldst_string[BPF_SIZE(insn->code) >>
>> 3],
>> +                               insn->a_reg, R(insn->a_reg), insn->off,
>> +                               insn->x_reg, R(insn->x_reg));
>> +               else
>> +                       pr_info("BUG_%02x\n", insn->code);
>> +       } else if (class == BPF_ST) {
>> +               if (BPF_MODE(insn->code) != BPF_MEM) {
>> +                       pr_info("BUG_st_%02x\n", insn->code);
>> +                       return;
>> +               }
>> +               pr_info("code_%02x *(%s *)(r%d%s %+d) = %d\n",
>> +                       insn->code,
>> +                       bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
>> +                       insn->a_reg, R(insn->a_reg),
>> +                       insn->off, insn->imm);
>> +       } else if (class == BPF_LDX) {
>> +               if (BPF_MODE(insn->code) == BPF_MEM) {
>> +                       pr_info("code_%02x r%d = *(%s *)(r%d%s %+d)\n",
>> +                               insn->code, insn->a_reg,
>> +                               bpf_ldst_string[BPF_SIZE(insn->code) >>
>> 3],
>> +                               insn->x_reg, R(insn->x_reg), insn->off);
>> +               } else {
>> +                       pr_info("BUG_ldx_%02x\n", insn->code);
>> +               }
>> +       } else if (class == BPF_LD) {
>> +               if (BPF_MODE(insn->code) == BPF_ABS) {
>> +                       pr_info("code_%02x r%d = *(%s *)(skb %+d)\n",
>> +                               insn->code, insn->a_reg,
>> +                               bpf_ldst_string[BPF_SIZE(insn->code) >>
>> 3],
>> +                               insn->imm);
>> +               } else if (BPF_MODE(insn->code) == BPF_IND) {
>> +                       pr_info("code_%02x r%d = *(%s *)(skb + r%d%s
>> %+d)\n",
>> +                               insn->code, insn->a_reg,
>> +                               bpf_ldst_string[BPF_SIZE(insn->code) >>
>> 3],
>> +                               insn->x_reg, R(insn->x_reg), insn->imm);
>> +               } else {
>> +                       pr_info("BUG_ld_%02x\n", insn->code);
>> +               }
>> +       } else if (class == BPF_JMP) {
>> +               u16 opcode = BPF_OP(insn->code);
>> +               if (opcode == BPF_CALL) {
>> +                       pr_info("code_%02x call %d\n", insn->code,
>> insn->imm);
>> +               } else if (insn->code == (BPF_JMP | BPF_JA)) {
>> +                       pr_info("code_%02x goto pc%+d\n",
>> +                               insn->code, insn->off);
>> +               } else if (BPF_SRC(insn->code) == BPF_X) {
>> +                       pr_info("code_%02x if r%d%s %s r%d%s goto
>> pc%+d\n",
>> +                               insn->code, insn->a_reg, R(insn->a_reg),
>> +                               bpf_jmp_string[BPF_OP(insn->code) >> 4],
>> +                               insn->x_reg, R(insn->x_reg), insn->off);
>> +               } else {
>> +                       pr_info("code_%02x if r%d%s %s 0x%x goto pc%+d\n",
>> +                               insn->code, insn->a_reg, R(insn->a_reg),
>> +                               bpf_jmp_string[BPF_OP(insn->code) >> 4],
>> +                               insn->imm, insn->off);
>> +               }
>> +       } else {
>> +               pr_info("code_%02x %s\n", insn->code,
>> bpf_class_string[class]);
>> +       }
>> +}
>> +EXPORT_SYMBOL(pr_info_bpf_insn);
>
>
> Why would that need to be exported as a symbol?

the performance numbers I mentioned are from bpf_bench that is done
as kernel module, so I used this for debugging from it.
also to see what execution traces I get while running trinity bpf fuzzer :)

> I would actually like to avoid having this pr_info* inside the kernel.
> Couldn't this be done e.g. through systemtap script that could e.g. be
> placed under tools/net/ or inside the documentation file?

like the idea!
Will drop it from the diff and eventually will move it to tools/net.

>> +/* remap 'sock_filter' style BPF instruction set to 'bpf_insn' style
>> (BPF64)
>> + *
>> + * first, call bpf_convert(old_prog, len, NULL, &new_len) to calculate
>> new
>> + * program length in one pass
>> + *
>> + * then new_prog = kmalloc(sizeof(struct bpf_insn) * new_len);
>> + *
>> + * and call it again: bpf_convert(old_prog, len, new_prog, &new_len);
>> + * to remap in two passes: 1st pass finds new jump offsets, 2nd pass
>> remaps
>> + */
>
>
> Would be nice to have the API comment it in kdoc format.

good point. will do.

>> +int bpf_convert(struct sock_filter *old_prog, int len,
>> +               struct bpf_insn *new_prog, int *p_new_len)
>> +{
>
>
> If we place it into net/core/filter.c, it would be nice to keep naming
> conventions consistent, e.g. sk_convert_filter() or so.

ok.

>> +       struct bpf_insn *new_insn;
>> +       struct sock_filter *fp;
>> +       int *addrs = NULL;
>> +       int new_len = 0;
>> +       int pass = 0;
>> +       int tgt, i;
>> +
>> +       if (len <= 0 || len >= BPF_MAXINSNS)
>> +               return -EINVAL;
>> +
>> +       if (new_prog) {
>> +               addrs = kzalloc(len * sizeof(*addrs), GFP_KERNEL);
>> +               if (!addrs)
>> +                       return -ENOMEM;
>> +       }
>> +
>> +do_pass:
>> +       new_insn = new_prog;
>> +       fp = old_prog;
>> +       for (i = 0; i < len; fp++, i++) {
>> +               struct bpf_insn tmp_insns[3] = {};
>> +               struct bpf_insn *insn = tmp_insns;
>> +
>> +               if (addrs)
>> +                       addrs[i] = new_insn - new_prog;
>> +
>> +               switch (fp->code) {
>> +               /* all arithmetic insns and skb loads map as-is */
>> +               case BPF_ALU | BPF_ADD | BPF_X:
>> +               case BPF_ALU | BPF_ADD | BPF_K:
>> +               case BPF_ALU | BPF_SUB | BPF_X:
>> +               case BPF_ALU | BPF_SUB | BPF_K:
>> +               case BPF_ALU | BPF_AND | BPF_X:
>> +               case BPF_ALU | BPF_AND | BPF_K:
>> +               case BPF_ALU | BPF_OR | BPF_X:
>> +               case BPF_ALU | BPF_OR | BPF_K:
>> +               case BPF_ALU | BPF_LSH | BPF_X:
>> +               case BPF_ALU | BPF_LSH | BPF_K:
>> +               case BPF_ALU | BPF_RSH | BPF_X:
>> +               case BPF_ALU | BPF_RSH | BPF_K:
>> +               case BPF_ALU | BPF_XOR | BPF_X:
>> +               case BPF_ALU | BPF_XOR | BPF_K:
>> +               case BPF_ALU | BPF_MUL | BPF_X:
>> +               case BPF_ALU | BPF_MUL | BPF_K:
>> +               case BPF_ALU | BPF_DIV | BPF_X:
>> +               case BPF_ALU | BPF_DIV | BPF_K:
>> +               case BPF_ALU | BPF_MOD | BPF_X:
>> +               case BPF_ALU | BPF_MOD | BPF_K:
>> +               case BPF_ALU | BPF_NEG:
>> +               case BPF_LD | BPF_ABS | BPF_W:
>> +               case BPF_LD | BPF_ABS | BPF_H:
>> +               case BPF_LD | BPF_ABS | BPF_B:
>> +               case BPF_LD | BPF_IND | BPF_W:
>> +               case BPF_LD | BPF_IND | BPF_H:
>> +               case BPF_LD | BPF_IND | BPF_B:
>
>
> I think here and elsewhere for similar constructions, we could use
> BPF_S_* helpers that was introduced by Hagen in commit 01f2f3f6ef4d076
> ("net: optimize Berkeley Packet Filter (BPF) processing").

well, old bpf opcodes were just unlucky to be on the wrong side of
compiler heuristics at that time.
Instead of doing remapping while loading and opposite remapping in
sk_get_filter()
the problem could have been solved with few dummy 'case' statements.
That would have only added few lines of code instead of hundred lines of mapping
back and forth.
Also I suspect commit 01f2f3f6ef4d076 was valid only when whole kernel
is compiled
with -Os. With -O2 GCC should have done it right.
Heuristic is: number_of_case_stmts * ratio < range_of_case_values
and ratio is 3 for -Os and 10 for -O2.
old bpf has ~70 stmts and ~200 range.
See expand_switch_as_decision_tree_p() in gcc/stmt.c
Eventually when current interpreter is retired, I would like to remove
remapping as well.
Especially for sk_convert_filter() I would like to keep normal BPF_
values from uapi/filter.h,
since for majority of them I reuse as-is and remapping would have
added unnecessary code.

>> +                       insn->code = fp->code;
>> +                       insn->a_reg = 6;
>> +                       insn->x_reg = 7;
>> +                       insn->imm = fp->k;
>> +                       break;
>> +
>> +               /* jump opcodes map as-is, but offsets need adjustment */
>> +               case BPF_JMP | BPF_JA:
>> +                       tgt = i + fp->k + 1;
>> +                       insn->code = fp->code;
>> +#define EMIT_JMP \
>> +       do { \
>> +               if (tgt >= len || tgt < 0) \
>> +                       goto err; \
>> +               insn->off = addrs ? addrs[tgt] - addrs[i] - 1 : 0; \
>> +       } while (0)
>> +
>> +                       EMIT_JMP;
>> +                       break;
>> +
>> +               case BPF_JMP | BPF_JEQ | BPF_K:
>> +               case BPF_JMP | BPF_JEQ | BPF_X:
>> +               case BPF_JMP | BPF_JSET | BPF_K:
>> +               case BPF_JMP | BPF_JSET | BPF_X:
>> +               case BPF_JMP | BPF_JGT | BPF_K:
>> +               case BPF_JMP | BPF_JGT | BPF_X:
>> +               case BPF_JMP | BPF_JGE | BPF_K:
>> +               case BPF_JMP | BPF_JGE | BPF_X:
>> +                       insn->a_reg = 6;
>> +                       insn->x_reg = 7;
>> +                       insn->imm = fp->k;
>> +                       /* common case where 'jump_false' is next insn */
>> +                       if (fp->jf == 0) {
>> +                               insn->code = fp->code;
>> +                               tgt = i + fp->jt + 1;
>> +                               EMIT_JMP;
>> +                               break;
>> +                       }
>> +                       /* convert JEQ into JNE when 'jump_true' is next
>> insn */
>> +                       if (fp->jt == 0 && BPF_OP(fp->code) == BPF_JEQ) {
>> +                               insn->code = BPF_JMP | BPF_JNE |
>> +                                       BPF_SRC(fp->code);
>> +                               tgt = i + fp->jf + 1;
>> +                               EMIT_JMP;
>> +                               break;
>> +                       }
>> +                       /* other jumps are mapped into two insns: Jxx and
>> JA */
>> +                       tgt = i + fp->jt + 1;
>> +                       insn->code = fp->code;
>> +                       EMIT_JMP;
>> +
>> +                       insn++;
>> +                       insn->code = BPF_JMP | BPF_JA;
>> +                       tgt = i + fp->jf + 1;
>> +                       EMIT_JMP;
>> +                       /* adjust pc relative offset, since it's a 2nd
>> insn */
>> +                       insn->off--;
>> +                       break;
>> +
>> +                       /* ldxb 4*([14]&0xf) is remaped into 3 insns */
>> +               case BPF_LDX | BPF_MSH | BPF_B:
>> +                       insn->code = BPF_LD | BPF_ABS | BPF_B;
>> +                       insn->a_reg = 7;
>> +                       insn->imm = fp->k;
>> +
>> +                       insn++;
>> +                       insn->code = BPF_ALU | BPF_AND | BPF_K;
>> +                       insn->a_reg = 7;
>> +                       insn->imm = 0xf;
>> +
>> +                       insn++;
>> +                       insn->code = BPF_ALU | BPF_LSH | BPF_K;
>> +                       insn->a_reg = 7;
>> +                       insn->imm = 2;
>> +                       break;
>> +
>> +                       /* RET_K, RET_A are remaped into 2 insns */
>> +               case BPF_RET | BPF_A:
>> +               case BPF_RET | BPF_K:
>> +                       insn->code = BPF_ALU | BPF_MOV |
>> +                               (BPF_SRC(fp->code) == BPF_K ? BPF_K :
>> BPF_X);
>> +                       insn->a_reg = 0;
>> +                       insn->x_reg = 6;
>> +                       insn->imm = fp->k;
>> +
>> +                       insn++;
>> +                       insn->code = BPF_RET | BPF_K;
>> +                       break;
>> +
>> +                       /* store to stack */
>> +               case BPF_ST:
>> +               case BPF_STX:
>> +                       insn->code = BPF_STX | BPF_MEM | BPF_W;
>> +                       insn->a_reg = 10;
>> +                       insn->x_reg = fp->code == BPF_ST ? 6 : 7;
>> +                       insn->off = -(BPF_MEMWORDS - fp->k) * 4;
>> +                       break;
>> +
>> +                       /* load from stack */
>> +               case BPF_LD | BPF_MEM:
>> +               case BPF_LDX | BPF_MEM:
>> +                       insn->code = BPF_LDX | BPF_MEM | BPF_W;
>> +                       insn->a_reg = BPF_CLASS(fp->code) == BPF_LD ? 6 :
>> 7;
>> +                       insn->x_reg = 10;
>> +                       insn->off = -(BPF_MEMWORDS - fp->k) * 4;
>> +                       break;
>> +
>> +                       /* A = K or X = K */
>> +               case BPF_LD | BPF_IMM:
>> +               case BPF_LDX | BPF_IMM:
>> +                       insn->code = BPF_ALU | BPF_MOV | BPF_K;
>> +                       insn->a_reg = BPF_CLASS(fp->code) == BPF_LD ? 6 :
>> 7;
>> +                       insn->imm = fp->k;
>> +                       break;
>> +
>> +                       /* X = A */
>> +               case BPF_MISC | BPF_TAX:
>> +                       insn->code = BPF_ALU64 | BPF_MOV | BPF_X;
>> +                       insn->a_reg = 7;
>> +                       insn->x_reg = 6;
>> +                       break;
>> +
>> +                       /* A = X */
>> +               case BPF_MISC | BPF_TXA:
>> +                       insn->code = BPF_ALU64 | BPF_MOV | BPF_X;
>> +                       insn->a_reg = 6;
>> +                       insn->x_reg = 7;
>> +                       break;
>> +
>> +                       /* A = skb->len or X = skb->len */
>> +               case BPF_LD|BPF_W|BPF_LEN:
>> +               case BPF_LDX|BPF_W|BPF_LEN:
>> +                       insn->code = BPF_LDX | BPF_MEM | BPF_W;
>> +                       insn->a_reg = BPF_CLASS(fp->code) == BPF_LD ? 6 :
>> 7;
>> +                       insn->x_reg = 1;
>> +                       insn->off = offsetof(struct sk_buff, len);
>> +                       break;
>> +
>> +               default:
>> +                       /* pr_err("unknown opcode %02x\n", fp->code); */
>> +                       goto err;
>> +               }
>> +
>> +               insn++;
>> +               if (new_prog) {
>> +                       memcpy(new_insn, tmp_insns,
>> +                              sizeof(*insn) * (insn - tmp_insns));
>> +               }
>> +               new_insn += insn - tmp_insns;
>> +       }
>> +
>> +       if (!new_prog) {
>> +               /* only calculating new length */
>> +               *p_new_len = new_insn - new_prog;
>> +               return 0;
>> +       }
>> +
>> +       pass++;
>> +       if (new_len != new_insn - new_prog) {
>> +               new_len = new_insn - new_prog;
>> +               if (pass > 2)
>> +                       goto err;
>> +               goto do_pass;
>> +       }
>> +       kfree(addrs);
>> +       if (*p_new_len != new_len)
>> +               /* inconsistent new program length */
>> +               pr_err("bpf_convert() usage error\n");
>> +       return 0;
>> +err:
>> +       kfree(addrs);
>> +       return -EINVAL;
>> +}
>> +
>> +notrace u32 bpf_run(struct bpf_context *ctx, const struct bpf_insn *insn)
>> +{
>
>
> Similarly, sk_run_filter_ext()?

ok.

>> +       u64 stack[64];
>> +       u64 regs[16];
>> +       void *ptr;
>> +       u64 tmp;
>> +       int off;
>> +
>> +#ifdef __x86_64
>> +#define LOAD_IMM /**/
>> +#define K insn->imm
>> +#else
>> +#define LOAD_IMM (K = insn->imm)
>> +       s32 K = insn->imm;
>> +#endif
>> +
>> +#ifdef DEBUG
>> +#define DEBUG_INSN pr_info_bpf_insn(insn, regs)
>> +#else
>> +#define DEBUG_INSN
>> +#endif
>
>
> This DEBUG hunk could then be removed when we use a stap script
> instead, for example.

ok

>> +#define A regs[insn->a_reg]
>> +#define X regs[insn->x_reg]
>> +
>> +#define DL(A, B, C) [A|B|C] = &&L_##A##B##C,
>> +#define L(A, B, C) L_##A##B##C:
>
>
> Could we get rid of these two defines? I know you're defining

I think it's actually more readable this way and lesser chance of typos.
DL - define label
L - label
will try to remove 2nd marco to see how it looks...

> and using that as labels, but it's not obvious at first what
> 'jt' does. Maybe 'jt_labels' ?

ok will rename.

>> +#define CONT ({insn++; LOAD_IMM; DEBUG_INSN; goto select_insn; })
>> +#define CONT_JMP ({insn++; LOAD_IMM; DEBUG_INSN; goto select_insn; })
>
>
> Not a big fan of macros, but here it seems fine though.
>
>
>> +/* some compilers may need help:
>> + * #define CONT_JMP ({insn++; LOAD_IMM; DEBUG_INSN; goto *jt[insn->code];
>> })
>> + */
>> +
>> +       static const void *jt[256] = {
>> +               [0 ... 255] = &&L_default,
>
>
> Nit: please just one define per line below:

ok.

>> +               DL(BPF_ALU, BPF_ADD, BPF_X) DL(BPF_ALU, BPF_ADD, BPF_K)
>> +               DL(BPF_ALU, BPF_SUB, BPF_X) DL(BPF_ALU, BPF_SUB, BPF_K)
>> +               DL(BPF_ALU, BPF_AND, BPF_X) DL(BPF_ALU, BPF_AND, BPF_K)
>> +               DL(BPF_ALU, BPF_OR, BPF_X)  DL(BPF_ALU, BPF_OR, BPF_K)
>> +               DL(BPF_ALU, BPF_LSH, BPF_X) DL(BPF_ALU, BPF_LSH, BPF_K)
>> +               DL(BPF_ALU, BPF_RSH, BPF_X) DL(BPF_ALU, BPF_RSH, BPF_K)
>> +               DL(BPF_ALU, BPF_XOR, BPF_X) DL(BPF_ALU, BPF_XOR, BPF_K)
>> +               DL(BPF_ALU, BPF_MUL, BPF_X) DL(BPF_ALU, BPF_MUL, BPF_K)
>> +               DL(BPF_ALU, BPF_MOV, BPF_X) DL(BPF_ALU, BPF_MOV, BPF_K)
>> +               DL(BPF_ALU, BPF_DIV, BPF_X) DL(BPF_ALU, BPF_DIV, BPF_K)
>> +               DL(BPF_ALU, BPF_MOD, BPF_X) DL(BPF_ALU, BPF_MOD, BPF_K)
>> +               DL(BPF_ALU64, BPF_ADD, BPF_X) DL(BPF_ALU64, BPF_ADD,
>> BPF_K)
>> +               DL(BPF_ALU64, BPF_SUB, BPF_X) DL(BPF_ALU64, BPF_SUB,
>> BPF_K)
>> +               DL(BPF_ALU64, BPF_AND, BPF_X) DL(BPF_ALU64, BPF_AND,
>> BPF_K)
>> +               DL(BPF_ALU64, BPF_OR, BPF_X)  DL(BPF_ALU64, BPF_OR, BPF_K)
>> +               DL(BPF_ALU64, BPF_LSH, BPF_X) DL(BPF_ALU64, BPF_LSH,
>> BPF_K)
>> +               DL(BPF_ALU64, BPF_RSH, BPF_X) DL(BPF_ALU64, BPF_RSH,
>> BPF_K)
>> +               DL(BPF_ALU64, BPF_XOR, BPF_X) DL(BPF_ALU64, BPF_XOR,
>> BPF_K)
>> +               DL(BPF_ALU64, BPF_MUL, BPF_X) DL(BPF_ALU64, BPF_MUL,
>> BPF_K)
>> +               DL(BPF_ALU64, BPF_MOV, BPF_X) DL(BPF_ALU64, BPF_MOV,
>> BPF_K)
>> +               DL(BPF_ALU64, BPF_ARSH, BPF_X)DL(BPF_ALU64, BPF_ARSH,
>> BPF_K)
>> +               DL(BPF_ALU64, BPF_DIV, BPF_X) DL(BPF_ALU64, BPF_DIV,
>> BPF_K)
>> +               DL(BPF_ALU64, BPF_MOD, BPF_X) DL(BPF_ALU64, BPF_MOD,
>> BPF_K)
>> +               DL(BPF_ALU64, BPF_BSWAP32, BPF_X)
>> +               DL(BPF_ALU64, BPF_BSWAP64, BPF_X)
>> +               DL(BPF_ALU, BPF_NEG, 0)
>> +               DL(BPF_JMP, BPF_CALL, 0)
>> +               DL(BPF_JMP, BPF_JA, 0)
>> +               DL(BPF_JMP, BPF_JEQ, BPF_X) DL(BPF_JMP, BPF_JEQ, BPF_K)
>> +               DL(BPF_JMP, BPF_JNE, BPF_X) DL(BPF_JMP, BPF_JNE, BPF_K)
>> +               DL(BPF_JMP, BPF_JGT, BPF_X) DL(BPF_JMP, BPF_JGT, BPF_K)
>> +               DL(BPF_JMP, BPF_JGE, BPF_X) DL(BPF_JMP, BPF_JGE, BPF_K)
>> +               DL(BPF_JMP, BPF_JSGT, BPF_X) DL(BPF_JMP, BPF_JSGT, BPF_K)
>> +               DL(BPF_JMP, BPF_JSGE, BPF_X) DL(BPF_JMP, BPF_JSGE, BPF_K)
>> +               DL(BPF_JMP, BPF_JSET, BPF_X) DL(BPF_JMP, BPF_JSET, BPF_K)
>> +               DL(BPF_STX, BPF_MEM, BPF_B) DL(BPF_STX, BPF_MEM, BPF_H)
>> +               DL(BPF_STX, BPF_MEM, BPF_W) DL(BPF_STX, BPF_MEM, BPF_DW)
>> +               DL(BPF_ST, BPF_MEM, BPF_B) DL(BPF_ST, BPF_MEM, BPF_H)
>> +               DL(BPF_ST, BPF_MEM, BPF_W) DL(BPF_ST, BPF_MEM, BPF_DW)
>> +               DL(BPF_LDX, BPF_MEM, BPF_B) DL(BPF_LDX, BPF_MEM, BPF_H)
>> +               DL(BPF_LDX, BPF_MEM, BPF_W) DL(BPF_LDX, BPF_MEM, BPF_DW)
>> +               DL(BPF_STX, BPF_XADD, BPF_W)
>> +#ifdef CONFIG_64BIT
>> +               DL(BPF_STX, BPF_XADD, BPF_DW)
>> +#endif
>> +               DL(BPF_LD, BPF_ABS, BPF_W) DL(BPF_LD, BPF_ABS, BPF_H)
>> +               DL(BPF_LD, BPF_ABS, BPF_B) DL(BPF_LD, BPF_IND, BPF_W)
>> +               DL(BPF_LD, BPF_IND, BPF_H) DL(BPF_LD, BPF_IND, BPF_B)
>> +               DL(BPF_RET, BPF_K, 0)
>> +       };
>> +
>> +       regs[10/* BPF R10 */] = (u64)(ulong)&stack[64];
>> +       regs[1/* BPF R1 */] = (u64)(ulong)ctx;
>> +
>> +       DEBUG_INSN;
>> +       /* execute 1st insn */
>> +select_insn:
>> +       goto *jt[insn->code];
>> +
>> +               /* ALU */
>> +#define ALU(OPCODE, OP) \
>> +       L_BPF_ALU64##OPCODE##BPF_X: \
>> +               A = A OP X; \
>> +               CONT; \
>> +       L_BPF_ALU##OPCODE##BPF_X: \
>> +               A = (u32)A OP (u32)X; \
>> +               CONT; \
>> +       L_BPF_ALU64##OPCODE##BPF_K: \
>> +               A = A OP K; \
>> +               CONT; \
>> +       L_BPF_ALU##OPCODE##BPF_K: \
>> +               A = (u32)A OP (u32)K; \
>> +               CONT;
>> +
>> +       ALU(BPF_ADD, +)
>> +       ALU(BPF_SUB, -)
>> +       ALU(BPF_AND, &)
>> +       ALU(BPF_OR, |)
>> +       ALU(BPF_LSH, <<)
>> +       ALU(BPF_RSH, >>)
>> +       ALU(BPF_XOR, ^)
>> +       ALU(BPF_MUL, *)
>> +
>> +       L(BPF_ALU, BPF_NEG, 0)
>> +               A = (u32)-A;
>> +               CONT;
>> +       L(BPF_ALU, BPF_MOV, BPF_X)
>> +               A = (u32)X;
>> +               CONT;
>> +       L(BPF_ALU, BPF_MOV, BPF_K)
>> +               A = (u32)K;
>> +               CONT;
>> +       L(BPF_ALU64, BPF_MOV, BPF_X)
>> +               A = X;
>> +               CONT;
>> +       L(BPF_ALU64, BPF_MOV, BPF_K)
>> +               A = K;
>> +               CONT;
>> +       L(BPF_ALU64, BPF_ARSH, BPF_X)
>> +               (*(s64 *) &A) >>= X;
>> +               CONT;
>> +       L(BPF_ALU64, BPF_ARSH, BPF_K)
>> +               (*(s64 *) &A) >>= K;
>> +               CONT;
>> +       L(BPF_ALU64, BPF_MOD, BPF_X)
>> +               tmp = A;
>> +               if (X)
>> +                       A = do_div(tmp, X);
>> +               CONT;
>> +       L(BPF_ALU, BPF_MOD, BPF_X)
>> +               tmp = (u32)A;
>> +               if (X)
>> +                       A = do_div(tmp, (u32)X);
>> +               CONT;
>> +       L(BPF_ALU64, BPF_MOD, BPF_K)
>> +               tmp = A;
>> +               if (K)
>> +                       A = do_div(tmp, K);
>> +               CONT;
>> +       L(BPF_ALU, BPF_MOD, BPF_K)
>> +               tmp = (u32)A;
>> +               if (K)
>> +                       A = do_div(tmp, (u32)K);
>> +               CONT;
>> +       L(BPF_ALU64, BPF_DIV, BPF_X)
>> +               if (X)
>> +                       do_div(A, X);
>> +               CONT;
>> +       L(BPF_ALU, BPF_DIV, BPF_X)
>> +               tmp = (u32)A;
>> +               if (X)
>> +                       do_div(tmp, (u32)X);
>> +               A = (u32)tmp;
>> +               CONT;
>> +       L(BPF_ALU64, BPF_DIV, BPF_K)
>> +               if (K)
>> +                       do_div(A, K);
>> +               CONT;
>> +       L(BPF_ALU, BPF_DIV, BPF_K)
>> +               tmp = (u32)A;
>> +               if (K)
>> +                       do_div(tmp, (u32)K);
>> +               A = (u32)tmp;
>> +               CONT;
>> +       L(BPF_ALU64, BPF_BSWAP32, BPF_X)
>> +               A = swab32(A);
>> +               CONT;
>> +       L(BPF_ALU64, BPF_BSWAP64, BPF_X)
>> +               A = swab64(A);
>> +               CONT;
>> +
>> +               /* CALL */
>> +       L(BPF_JMP, BPF_CALL, 0)
>> +               /* TODO execute_func(K, regs); */
>> +               CONT;
>
>
> Would the filter checker for now just return -EINVAL when this is used?

sure.
I was planning to address that in a week or so, but ok.
Will remove 'call' insn in the first patch and will re-add in a clean
way in a next diff.

>> +               /* JMP */
>> +       L(BPF_JMP, BPF_JA, 0)
>> +               insn += insn->off;
>> +               CONT;
>> +       L(BPF_JMP, BPF_JEQ, BPF_X)
>> +               if (A == X) {
>> +                       insn += insn->off;
>> +                       CONT_JMP;
>> +               }
>> +               CONT;
>> +       L(BPF_JMP, BPF_JEQ, BPF_K)
>> +               if (A == K) {
>> +                       insn += insn->off;
>> +                       CONT_JMP;
>> +               }
>> +               CONT;
>> +       L(BPF_JMP, BPF_JNE, BPF_X)
>> +               if (A != X) {
>> +                       insn += insn->off;
>> +                       CONT_JMP;
>> +               }
>> +               CONT;
>> +       L(BPF_JMP, BPF_JNE, BPF_K)
>> +               if (A != K) {
>> +                       insn += insn->off;
>> +                       CONT_JMP;
>> +               }
>> +               CONT;
>> +       L(BPF_JMP, BPF_JGT, BPF_X)
>> +               if (A > X) {
>> +                       insn += insn->off;
>> +                       CONT_JMP;
>> +               }
>> +               CONT;
>> +       L(BPF_JMP, BPF_JGT, BPF_K)
>> +               if (A > K) {
>> +                       insn += insn->off;
>> +                       CONT_JMP;
>> +               }
>> +               CONT;
>> +       L(BPF_JMP, BPF_JGE, BPF_X)
>> +               if (A >= X) {
>> +                       insn += insn->off;
>> +                       CONT_JMP;
>> +               }
>> +               CONT;
>> +       L(BPF_JMP, BPF_JGE, BPF_K)
>> +               if (A >= K) {
>> +                       insn += insn->off;
>> +                       CONT_JMP;
>> +               }
>> +               CONT;
>> +       L(BPF_JMP, BPF_JSGT, BPF_X)
>> +               if (((s64)A) > ((s64)X)) {
>> +                       insn += insn->off;
>> +                       CONT_JMP;
>> +               }
>> +               CONT;
>> +       L(BPF_JMP, BPF_JSGT, BPF_K)
>> +               if (((s64)A) > ((s64)K)) {
>> +                       insn += insn->off;
>> +                       CONT_JMP;
>> +               }
>> +               CONT;
>> +       L(BPF_JMP, BPF_JSGE, BPF_X)
>> +               if (((s64)A) >= ((s64)X)) {
>> +                       insn += insn->off;
>> +                       CONT_JMP;
>> +               }
>> +               CONT;
>> +       L(BPF_JMP, BPF_JSGE, BPF_K)
>> +               if (((s64)A) >= ((s64)K)) {
>> +                       insn += insn->off;
>> +                       CONT_JMP;
>> +               }
>> +               CONT;
>> +       L(BPF_JMP, BPF_JSET, BPF_X)
>> +               if (A & X) {
>> +                       insn += insn->off;
>> +                       CONT_JMP;
>> +               }
>> +               CONT;
>> +       L(BPF_JMP, BPF_JSET, BPF_K)
>> +               if (A & (u32)K) {
>> +                       insn += insn->off;
>> +                       CONT_JMP;
>> +               }
>> +               CONT;
>
>
> Okay, so right now we only support forward jumps, right? And these
> are checked by the old checker code I'd assume?

yes. just like old interpreter new interpreter doesn't care. It just jumps.
and you're correct. old checker code will allow only forward jumps.
That restriction is preserved by sk_convert_filter() as well.

>> +               /* STX and ST and LDX*/
>> +#define LDST(SIZEOP, SIZE) \
>> +       L_BPF_STXBPF_MEM##SIZEOP: \
>> +               *(SIZE *)(ulong)(A + insn->off) = X; \
>> +               CONT; \
>> +       L_BPF_STBPF_MEM##SIZEOP: \
>> +               *(SIZE *)(ulong)(A + insn->off) = K; \
>> +               CONT; \
>> +       L_BPF_LDXBPF_MEM##SIZEOP: \
>> +               A = *(SIZE *)(ulong)(X + insn->off); \
>> +               CONT;
>> +
>> +               LDST(BPF_B, u8)
>> +               LDST(BPF_H, u16)
>> +               LDST(BPF_W, u32)
>> +               LDST(BPF_DW, u64)
>> +
>> +               /* STX XADD */
>> +       L(BPF_STX, BPF_XADD, BPF_W)
>> +               atomic_add((u32)X, (atomic_t *)(ulong)(A + insn->off));
>> +               CONT;
>> +#ifdef CONFIG_64BIT
>> +       L(BPF_STX, BPF_XADD, BPF_DW)
>> +               atomic64_add((u64)X, (atomic64_t *)(ulong)(A +
>> insn->off));
>> +               CONT;
>> +#endif
>> +
>> +               /* LD from SKB + K */
>
>
> Nit: indent one tab too much (here and elsewhere)

ahh. ok.

>> +       L(BPF_LD, BPF_ABS, BPF_W)
>> +               off = K;
>> +load_word:
>> +               ptr = load_pointer((struct sk_buff *)ctx, off, 4, &tmp);
>> +               if (likely(ptr != NULL)) {
>> +                       A = get_unaligned_be32(ptr);
>> +                       CONT;
>> +               }
>> +               return 0;
>> +
>> +       L(BPF_LD, BPF_ABS, BPF_H)
>> +               off = K;
>> +load_half:
>> +               ptr = load_pointer((struct sk_buff *)ctx, off, 2, &tmp);
>> +               if (likely(ptr != NULL)) {
>> +                       A = get_unaligned_be16(ptr);
>> +                       CONT;
>> +               }
>> +               return 0;
>> +
>> +       L(BPF_LD, BPF_ABS, BPF_B)
>> +               off = K;
>> +load_byte:
>> +               ptr = load_pointer((struct sk_buff *)ctx, off, 1, &tmp);
>> +               if (likely(ptr != NULL)) {
>> +                       A = *(u8 *)ptr;
>> +                       CONT;
>> +               }
>> +               return 0;
>> +
>> +               /* LD from SKB + X + K */
>
>
> Nit: ditto

ok

>> +       L(BPF_LD, BPF_IND, BPF_W)
>> +               off = K + X;
>> +               goto load_word;
>> +
>> +       L(BPF_LD, BPF_IND, BPF_H)
>> +               off = K + X;
>> +               goto load_half;
>> +
>> +       L(BPF_LD, BPF_IND, BPF_B)
>> +               off = K + X;
>> +               goto load_byte;
>> +
>> +               /* RET */
>> +       L(BPF_RET, BPF_K, 0)
>> +               return regs[0/* R0 */];
>> +
>> +       L_default:
>> +               /* bpf_check() or bpf_convert() will guarantee that
>> +                * we never reach here
>> +                */
>> +               WARN_RATELIMIT(1, "unknown opcode %02x\n", insn->code);
>> +               return 0;
>> +#undef DL
>> +#undef L
>> +#undef CONT
>> +#undef A
>> +#undef X
>> +#undef K
>> +#undef LOAD_IMM
>> +#undef DEBUG_INSN
>> +#undef ALU
>> +#undef LDST
>> +}
>> +EXPORT_SYMBOL(bpf_run);
>> +
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index ad30d626a5bd..575bf5fd4335 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -637,6 +637,10 @@ void sk_filter_release_rcu(struct rcu_head *rcu)
>>   {
>>         struct sk_filter *fp = container_of(rcu, struct sk_filter, rcu);
>>
>> +       if ((void *)fp->bpf_func == (void *)bpf_run)
>> +               /* arch specific jit_free are expecting this value */
>> +               fp->bpf_func = sk_run_filter;
>> +
>>         bpf_jit_free(fp);
>>   }
>>   EXPORT_SYMBOL(sk_filter_release_rcu);
>> @@ -655,6 +659,81 @@ static int __sk_prepare_filter(struct sk_filter *fp)
>>         return 0;
>>   }
>>
>> +static int bpf64_prepare(struct sk_filter **pfp, struct sock_fprog
>> *fprog,
>> +                        struct sock *sk)
>> +{
>
>
> sk_prepare_filter_ext()?

ok.

>> +       unsigned int fsize = sizeof(struct sock_filter) * fprog->len;
>> +       struct sock_filter *old_prog;
>> +       unsigned int sk_fsize;
>> +       struct sk_filter *fp;
>> +       int new_len;
>> +       int err;
>> +
>> +       /* store old program into buffer, since chk_filter will remap
>> opcodes */
>> +       old_prog = kmalloc(fsize, GFP_KERNEL);
>> +       if (!old_prog)
>> +               return -ENOMEM;
>> +
>> +       if (sk) {
>> +               if (copy_from_user(old_prog, fprog->filter, fsize)) {
>> +                       err = -EFAULT;
>> +                       goto free_prog;
>> +               }
>> +       } else {
>> +               memcpy(old_prog, fprog->filter, fsize);
>> +       }
>> +
>> +       /* calculate bpf64 program length */
>> +       err = bpf_convert(fprog->filter, fprog->len, NULL, &new_len);
>> +       if (err)
>> +               goto free_prog;
>> +
>> +       sk_fsize = sk_filter_size(new_len);
>> +       /* allocate sk_filter to store bpf64 program */
>> +       if (sk)
>> +               fp = sock_kmalloc(sk, sk_fsize, GFP_KERNEL);
>> +       else
>> +               fp = kmalloc(sk_fsize, GFP_KERNEL);
>> +       if (!fp) {
>> +               err = -ENOMEM;
>> +               goto free_prog;
>> +       }
>> +
>> +       /* remap old insns into bpf64 */
>> +       err = bpf_convert(old_prog, fprog->len,
>> +                         (struct bpf_insn *)fp->insns, &new_len);
>> +       if (err)
>> +               /* 2nd bpf_convert() can fail only if it fails
>> +                * to allocate memory, remapping must succeed
>> +                */
>> +               goto free_fp;
>> +
>> +       /* now chk_filter can overwrite old_prog while checking */
>> +       err = sk_chk_filter(old_prog, fprog->len);
>> +       if (err)
>> +               goto free_fp;
>> +
>> +       /* discard old prog */
>> +       kfree(old_prog);
>> +
>> +       atomic_set(&fp->refcnt, 1);
>> +       fp->len = new_len;
>> +
>> +       /* bpf64 insns must be executed by bpf_run */
>> +       fp->bpf_func = (typeof(fp->bpf_func))bpf_run;
>> +
>> +       *pfp = fp;
>> +       return 0;
>> +free_fp:
>> +       if (sk)
>> +               sock_kfree_s(sk, fp, sk_fsize);
>> +       else
>> +               kfree(fp);
>> +free_prog:
>> +       kfree(old_prog);
>> +       return err;
>> +}
>> +
>>   /**
>>    *    sk_unattached_filter_create - create an unattached filter
>>    *    @fprog: the filter program
>> @@ -676,6 +755,9 @@ int sk_unattached_filter_create(struct sk_filter
>> **pfp,
>>         if (fprog->filter == NULL)
>>                 return -EINVAL;
>>
>> +       if (bpf64_enable)
>> +               return bpf64_prepare(pfp, fprog, NULL);
>> +
>>         fp = kmalloc(sk_filter_size(fprog->len), GFP_KERNEL);
>>         if (!fp)
>>                 return -ENOMEM;
>> @@ -726,21 +808,27 @@ int sk_attach_filter(struct sock_fprog *fprog,
>> struct sock *sk)
>>         if (fprog->filter == NULL)
>>                 return -EINVAL;
>>
>> -       fp = sock_kmalloc(sk, sk_fsize, GFP_KERNEL);
>> -       if (!fp)
>> -               return -ENOMEM;
>> -       if (copy_from_user(fp->insns, fprog->filter, fsize)) {
>> -               sock_kfree_s(sk, fp, sk_fsize);
>> -               return -EFAULT;
>> -       }
>> +       if (bpf64_enable) {
>> +               err = bpf64_prepare(&fp, fprog, sk);
>> +               if (err)
>> +                       return err;
>> +       } else {
>> +               fp = sock_kmalloc(sk, sk_fsize, GFP_KERNEL);
>> +               if (!fp)
>> +                       return -ENOMEM;
>> +               if (copy_from_user(fp->insns, fprog->filter, fsize)) {
>> +                       sock_kfree_s(sk, fp, sk_fsize);
>> +                       return -EFAULT;
>> +               }
>>
>> -       atomic_set(&fp->refcnt, 1);
>> -       fp->len = fprog->len;
>> +               atomic_set(&fp->refcnt, 1);
>> +               fp->len = fprog->len;
>>
>> -       err = __sk_prepare_filter(fp);
>> -       if (err) {
>> -               sk_filter_uncharge(sk, fp);
>> -               return err;
>> +               err = __sk_prepare_filter(fp);
>> +               if (err) {
>> +                       sk_filter_uncharge(sk, fp);
>> +                       return err;
>> +               }
>>         }
>>
>>         old_fp = rcu_dereference_protected(sk->sk_filter,
>> diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
>> index cf9cd13509a7..f03acc0e8950 100644
>> --- a/net/core/sysctl_net_core.c
>> +++ b/net/core/sysctl_net_core.c
>> @@ -273,6 +273,13 @@ static struct ctl_table net_core_table[] = {
>>         },
>>   #endif
>>         {
>> +               .procname       = "bpf64_enable",
>> +               .data           = &bpf64_enable,
>> +               .maxlen         = sizeof(int),
>> +               .mode           = 0644,
>> +               .proc_handler   = proc_dointvec
>> +       },
>> +       {
>>                 .procname       = "netdev_tstamp_prequeue",
>>                 .data           = &netdev_tstamp_prequeue,
>>                 .maxlen         = sizeof(int),
>>
>
> Hope some of the comments made sense. ;-)

Yes. Indeed. Thanks a lot for the thorough review!
Will address things and resend V4.

Alexei

> Thanks,
>
> Daniel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 net-next 1/1] bpf32->bpf64 mapper and bpf64 interpreter
  2014-02-28 20:53     ` Alexei Starovoitov
@ 2014-03-01  0:10       ` Alexei Starovoitov
  2014-03-01  0:30       ` Daniel Borkmann
  1 sibling, 0 replies; 8+ messages in thread
From: Alexei Starovoitov @ 2014-03-01  0:10 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: David S. Miller, Ingo Molnar, Steven Rostedt, Peter Zijlstra,
	H. Peter Anvin, Thomas Gleixner, Masami Hiramatsu, Tom Zanussi,
	Jovi Zhangwei, Eric Dumazet, Linus Torvalds, Andrew Morton,
	Frederic Weisbecker, Arnaldo Carvalho de Melo, Pekka Enberg,
	Arjan van de Ven, Christoph Hellwig, linux-kernel, netdev,
	Hagen Paul Pfeifer, Jesse Gross

On Fri, Feb 28, 2014 at 12:53 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
> On Fri, Feb 28, 2014 at 4:45 AM, Daniel Borkmann <dborkman@redhat.com> wrote:
>> Hi Alexei,
>>
>> [also cc'ing Hagen and Jesse]
>>
>> Just some minor comments below ... let me know what you think.
>
> Thank you for review! Comments below.
>
>> On 02/27/2014 03:38 AM, Alexei Starovoitov wrote:
>>>
>>> Extended BPF (or 64-bit BPF) is an instruction set to
>>> create safe dynamically loadable filters that can call fixed set
>>> of kernel functions and take generic bpf_context as an input.
>>> BPF filter is a glue between kernel functions and bpf_context.
>>> Different kernel subsystems can define their own set of available
>>> functions
>>> and alter BPF machinery for specific use case.
>>> BPF64 instruction set is designed for efficient mapping to native
>>> instructions on 64-bit CPUs
>>>
>>> Old BPF instructions are remapped on the fly to BPF64
>>> when sysctl net.core.bpf64_enable=1
>>
>>
>> Would be nice if the commit message could be extended with what you
>> have posted in your [PATCH v3 net-next 0/1] email (but without the
>> changelog, changelog should go after "---" line), so that also this
>> information will appear here and in the Git log later on. Also please
>> elaborate more on this commit message. I think, at least, as it's a
>> bigger change it also needs to include future planned usage scenarios
>> for user space and for inside the kernel e.g. for OVS and ftrace.
>
> Ok will do
>
>> You could make this patch a 2 patch patch-series and put into patch
>> 2/2 all documentation you had in your previous version of the set.
>> Would be nice to extend the file Documentation/networking/filter.txt
>> with a description of your extended BPF so that users can read about
>> them in the same file.
>
> Sure.
>
>> Did you also test that seccomp-BPF still works out?
>
> yes. Have a prototype, but it needs a bit more cleanup.
>
>>
>>> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
>>> ---
>>>   include/linux/filter.h      |    9 +-
>>>   include/linux/netdevice.h   |    1 +
>>>   include/uapi/linux/filter.h |   37 ++-
>>>   net/core/Makefile           |    2 +-
>>>   net/core/bpf_run.c          |  766
>>> +++++++++++++++++++++++++++++++++++++++++++
>>>   net/core/filter.c           |  114 ++++++-
>>
>>
>> I would have liked to see the content from net/core/bpf_run.c to go
>> directly into net/core/filter.c, not as a separate file, if that's
>> possible.
>
> sure. that's fine.
>
>>
>>>   net/core/sysctl_net_core.c  |    7 +
>>>   7 files changed, 913 insertions(+), 23 deletions(-)
>>>   create mode 100644 net/core/bpf_run.c
>>>
>>> diff --git a/include/linux/filter.h b/include/linux/filter.h
>>> index e568c8ef896b..bf3085258f4c 100644
>>> --- a/include/linux/filter.h
>>> +++ b/include/linux/filter.h
>>> @@ -53,6 +53,13 @@ extern int sk_chk_filter(struct sock_filter *filter,
>>> unsigned int flen);
>>>   extern int sk_get_filter(struct sock *sk, struct sock_filter __user
>>> *filter, unsigned len);
>>>   extern void sk_decode_filter(struct sock_filter *filt, struct
>>> sock_filter *to);
>>>
>>> +/* function remaps 'sock_filter' style insns to 'bpf_insn' style insns */
>>> +int bpf_convert(struct sock_filter *fp, int len, struct bpf_insn
>>> *new_prog,
>>> +               int *p_new_len);
>>> +/* execute bpf64 program */
>>> +u32 bpf_run(struct bpf_context *ctx, const struct bpf_insn *insn);
>>> +
>>> +#define SK_RUN_FILTER(FILTER, SKB) (*FILTER->bpf_func)(SKB,
>>> FILTER->insns)
>>>   #ifdef CONFIG_BPF_JIT
>>>   #include <stdarg.h>
>>>   #include <linux/linkage.h>
>>> @@ -70,7 +77,6 @@ static inline void bpf_jit_dump(unsigned int flen,
>>> unsigned int proglen,
>>>                 print_hex_dump(KERN_ERR, "JIT code: ", DUMP_PREFIX_OFFSET,
>>>                                16, 1, image, proglen, false);
>>>   }
>>> -#define SK_RUN_FILTER(FILTER, SKB) (*FILTER->bpf_func)(SKB,
>>> FILTER->insns)
>>>   #else
>>>   #include <linux/slab.h>
>>>   static inline void bpf_jit_compile(struct sk_filter *fp)
>>> @@ -80,7 +86,6 @@ static inline void bpf_jit_free(struct sk_filter *fp)
>>>   {
>>>         kfree(fp);
>>>   }
>>> -#define SK_RUN_FILTER(FILTER, SKB) sk_run_filter(SKB, FILTER->insns)
>>>   #endif
>>>
>>>   static inline int bpf_tell_extensions(void)
>>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>>> index 5e84483c0650..7b1acefc244e 100644
>>> --- a/include/linux/netdevice.h
>>> +++ b/include/linux/netdevice.h
>>> @@ -2971,6 +2971,7 @@ extern int                netdev_max_backlog;
>>>   extern int            netdev_tstamp_prequeue;
>>>   extern int            weight_p;
>>>   extern int            bpf_jit_enable;
>>> +extern int             bpf64_enable;
>>
>>
>> We should keep naming consistent (so either extended BPF or BPF64),
>> so maybe bpf_ext_enable ? I'd presume rather {bpf,sk_filter}*_ext
>
> agree. we need consistent naming for both (old and new).
> I'll try all-out rename of bpf_*() into sk_filter64_*() and sk_filter_ext_*()
> to see which one looks better.
> I'm kinda leaning to sk_filter64, since it's easier to quickly spot
> the difference
> and more descriptive.

After going back and forth between sk_filter64 vs sk_filter_ext,
decided to follow your suggestion and stick with sk_filter_ext,
bpf_ext_enable, etc. since calling it bpf64 may raise unnecessary
concerns on 32-bit architectures... but performance numbers
show that extended bpf on 32-bit cpus is faster than old bpf,
so 'extended bpf' from now on.

Thanks

>> as in 'struct bpf_insn' the immediate value is 32 bit, so for 64 bit
>> comparisons, you'd still need to load to immediate values, right?
>
> there is no insn that use 64-bit immediate, since 64-bit immediates
> are extremely rare. grep x86-64 asm code for movabsq will return very few.
> llvm or gcc can easily construct any constant by combination of mov,
> shifts and ors.
> bpf64 comparisons are all 64-bit right now. So far I didn't see a need to do
> 32-bit comparison, since old bpf is all unsigned, mapping 32->64 of
> Jxx is painless.
> Notice I added signed comparison, since real life programs cannot do
> without them.
> I also kept the spirit of old bpf having > and >= only. (no < and <=)
> that made llvm/gcc backends a bit harder to do, since no real cpu has
> such limitations.
> I'm still contemplating do add < and <= (both signed and unsigned), which is
> interesting trade-off: number of instructions vs complexity of compiler
>
>> After all your changes, we will still have the bpf_jit_enable knob
>> in tact, right?
>
> Yes.
> In this diff the workflow is the following:
>
> old filter comes through sk_attach_filter() or sk_unattached_filter_create()
> if (bpf64) {
>     convert to new
>     sk_chk_filter() - check old bpf
>     use new interpreter
> } else {
>     sk_chk_filter() - check old bpf
>     if (bpf_jit_enable)
>         use old jit
>     else
>         use old interpreter
> }
> soon I'll add bpf64 jit and will reuse the same bpf_jit_enable knob for it.
> then add new/old inband demux into sk_attach_filter(),
> so that workflow will become:
>
> a filter comes through sk_attach_filter() or sk_unattached_filter_create()
> if (new filter prog) {
>     sk_chk_filter64() - check new bpf
>     if (bpf_jit_enable)
>         use new jit
>     else
>         use new interpreter
> } else {
>     if (bpf64_enable) {
>        convert to new
>        sk_chk_filter() - check old bpf
>        if (bpf_jit_enable)
>             use new jit
>        else
>             use new interpreter
>     } else {
>        sk_chk_filter()
>        if (bpf_jit_enable)
>            use old jit
>        else
>            use old interpreter
>     }
> }
> eventually bpf64_enable can be made default,
> the last 'else' can be retired and 'bpf64_enable' removed along with
> old interpreter.
>
> bpf_jit_enable knob will stay for foreseeable future.
>
>>>   bool netdev_has_upper_dev(struct net_device *dev, struct net_device
>>> *upper_dev);
>>>   struct net_device *netdev_all_upper_get_next_dev_rcu(struct net_device
>>> *dev,
>>> diff --git a/include/uapi/linux/filter.h b/include/uapi/linux/filter.h
>>> index 8eb9ccaa5b48..70ff29ee6825 100644
>>> --- a/include/uapi/linux/filter.h
>>> +++ b/include/uapi/linux/filter.h
>>> @@ -1,3 +1,4 @@
>>> +/* extended BPF is Copyright (c) 2011-2014, PLUMgrid, http://plumgrid.com
>>> */
>>>   /*
>>>    * Linux Socket Filter Data Structures
>>>    */
>>
>>
>> You can merge both comments into one.
>
> ok.
>
>>
>>> @@ -19,7 +20,7 @@
>>>    *    Try and keep these values and structures similar to BSD,
>>> especially
>>>    *    the BPF code definitions which need to match so you can share
>>> filters
>>>    */
>>> -
>>> +
>>>   struct sock_filter {  /* Filter block */
>>>         __u16   code;   /* Actual filter code */
>>>         __u8    jt;     /* Jump true */
>>> @@ -45,12 +46,26 @@ struct sock_fprog { /* Required for SO_ATTACH_FILTER.
>>> */
>>>   #define         BPF_JMP         0x05
>>>   #define         BPF_RET         0x06
>>>   #define         BPF_MISC        0x07
>>> +#define         BPF_ALU64       0x07
>>> +
>>> +struct bpf_insn {
>>> +       __u8    code;    /* opcode */
>>> +       __u8    a_reg:4; /* dest register*/
>>> +       __u8    x_reg:4; /* source register */
>>> +       __s16   off;     /* signed offset */
>>> +       __s32   imm;     /* signed immediate constant */
>>> +};
>>
>>
>> As we have struct sock_filter and it's immutable due to uapi, I
>> would have liked to see the new data structure with a consistent
>> naming scheme, e.g. struct sock_filter_ext {...} for extended
>> BPF.
>
> ok. will rename bpf_insn to sock_filter64 and sock_filter_ext to see
> which one looks better through out the code.
>
>>> +/* pointer to bpf_context is the first and only argument to BPF program
>>> + * its definition is use-case specific */
>>> +struct bpf_context;
>>>
>>>   /* ld/ldx fields */
>>>   #define BPF_SIZE(code)  ((code) & 0x18)
>>>   #define         BPF_W           0x00
>>>   #define         BPF_H           0x08
>>>   #define         BPF_B           0x10
>>> +#define         BPF_DW          0x18
>>>   #define BPF_MODE(code)  ((code) & 0xe0)
>>>   #define         BPF_IMM         0x00
>>>   #define         BPF_ABS         0x20
>>> @@ -58,6 +73,7 @@ struct sock_fprog {   /* Required for SO_ATTACH_FILTER.
>>> */
>>>   #define         BPF_MEM         0x60
>>>   #define         BPF_LEN         0x80
>>>   #define         BPF_MSH         0xa0
>>> +#define         BPF_XADD        0xc0 /* exclusive add */
>>>
>>>   /* alu/jmp fields */
>>>   #define BPF_OP(code)    ((code) & 0xf0)
>>> @@ -68,16 +84,24 @@ struct sock_fprog { /* Required for SO_ATTACH_FILTER.
>>> */
>>>   #define         BPF_OR          0x40
>>>   #define         BPF_AND         0x50
>>>   #define         BPF_LSH         0x60
>>> -#define         BPF_RSH         0x70
>>> +#define         BPF_RSH         0x70 /* logical shift right */
>>>   #define         BPF_NEG         0x80
>>>   #define               BPF_MOD         0x90
>>>   #define               BPF_XOR         0xa0
>>> +#define                BPF_MOV         0xb0 /* mov reg to reg */
>>> +#define                BPF_ARSH        0xc0 /* sign extending arithmetic
>>> shift right */
>>> +#define                BPF_BSWAP32     0xd0 /* swap lower 4 bytes of
>>> 64-bit register */
>>> +#define                BPF_BSWAP64     0xe0 /* swap all 8 bytes of 64-bit
>>> register */
>>>
>>>   #define         BPF_JA          0x00
>>> -#define         BPF_JEQ         0x10
>>> -#define         BPF_JGT         0x20
>>> -#define         BPF_JGE         0x30
>>> -#define         BPF_JSET        0x40
>>> +#define         BPF_JEQ         0x10 /* jump == */
>>> +#define         BPF_JGT         0x20 /* GT is unsigned '>', JA in x86 */
>>> +#define         BPF_JGE         0x30 /* GE is unsigned '>=', JAE in x86
>>> */
>>> +#define         BPF_JSET        0x40 /* if (A & X) */
>>> +#define         BPF_JNE         0x50 /* jump != */
>>> +#define         BPF_JSGT        0x60 /* SGT is signed '>', GT in x86 */
>>> +#define         BPF_JSGE        0x70 /* SGE is signed '>=', GE in x86 */
>>> +#define         BPF_CALL        0x80 /* function call */
>>>   #define BPF_SRC(code)   ((code) & 0x08)
>>>   #define         BPF_K           0x00
>>>   #define         BPF_X           0x08
>>> @@ -134,5 +158,4 @@ struct sock_fprog { /* Required for SO_ATTACH_FILTER.
>>> */
>>>   #define SKF_NET_OFF   (-0x100000)
>>>   #define SKF_LL_OFF    (-0x200000)
>>>
>>> -
>>>   #endif /* _UAPI__LINUX_FILTER_H__ */
>>> diff --git a/net/core/Makefile b/net/core/Makefile
>>> index 9628c20acff6..e622b97f58dc 100644
>>> --- a/net/core/Makefile
>>> +++ b/net/core/Makefile
>>> @@ -8,7 +8,7 @@ obj-y := sock.o request_sock.o skbuff.o iovec.o datagram.o
>>> stream.o scm.o \
>>>   obj-$(CONFIG_SYSCTL) += sysctl_net_core.o
>>>
>>>   obj-y              += dev.o ethtool.o dev_addr_lists.o dst.o netevent.o
>>> \
>>> -                       neighbour.o rtnetlink.o utils.o link_watch.o
>>> filter.o \
>>> +                       neighbour.o rtnetlink.o utils.o link_watch.o
>>> filter.o bpf_run.o \
>>>                         sock_diag.o dev_ioctl.o
>>>
>>>   obj-$(CONFIG_XFRM) += flow.o
>>> diff --git a/net/core/bpf_run.c b/net/core/bpf_run.c
>>> new file mode 100644
>>> index 000000000000..fa1862fcbc74
>>> --- /dev/null
>>> +++ b/net/core/bpf_run.c
>>> @@ -0,0 +1,766 @@
>>> +/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
>>> + *
>>> + * This program is free software; you can redistribute it and/or
>>> + * modify it under the terms of version 2 of the GNU General Public
>>> + * License as published by the Free Software Foundation.
>>> + *
>>> + * This program is distributed in the hope that it will be useful, but
>>> + * WITHOUT ANY WARRANTY; without even the implied warranty of
>>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
>>> + * General Public License for more details.
>>> + */
>>> +#include <linux/kernel.h>
>>> +#include <linux/types.h>
>>> +#include <linux/slab.h>
>>> +#include <linux/uaccess.h>
>>> +#include <linux/filter.h>
>>> +#include <linux/skbuff.h>
>>> +#include <asm/unaligned.h>
>>> +
>>> +int bpf64_enable __read_mostly;
>>> +
>>> +void *bpf_internal_load_pointer_neg_helper(const struct sk_buff *skb, int
>>> k,
>>> +                                          unsigned int size);
>>> +
>>> +static inline void *load_pointer(const struct sk_buff *skb, int k,
>>> +                                unsigned int size, void *buffer)
>>> +{
>>> +       if (k >= 0)
>>> +               return skb_header_pointer(skb, k, size, buffer);
>>> +       return bpf_internal_load_pointer_neg_helper(skb, k, size);
>>> +}
>>> +
>>> +static const char *const bpf_class_string[] = {
>>> +       "ld", "ldx", "st", "stx", "alu", "jmp", "ret", "misc"
>>> +};
>>> +
>>> +static const char *const bpf_alu_string[] = {
>>> +       "+=", "-=", "*=", "/=", "|=", "&=", "<<=", ">>=", "neg",
>>> +       "%=", "^=", "=", "s>>=", "bswap32", "bswap64", "???"
>>> +};
>>> +
>>> +static const char *const bpf_ldst_string[] = {
>>> +       "u32", "u16", "u8", "u64"
>>> +};
>>> +
>>> +static const char *const bpf_jmp_string[] = {
>>> +       "jmp", "==", ">", ">=", "&", "!=", "s>", "s>=", "call"
>>> +};
>>> +
>>> +static const char *reg_to_str(int regno, u64 *regs)
>>> +{
>>> +       static char reg_value[16][32];
>>> +       if (!regs)
>>> +               return "";
>>> +       snprintf(reg_value[regno], sizeof(reg_value[regno]), "(0x%llx)",
>>> +                regs[regno]);
>>> +       return reg_value[regno];
>>> +}
>>> +
>>> +#define R(regno) reg_to_str(regno, regs)
>>> +
>>> +void pr_info_bpf_insn(const struct bpf_insn *insn, u64 *regs)
>>> +{
>>> +       u16 class = BPF_CLASS(insn->code);
>>> +       if (class == BPF_ALU || class == BPF_ALU64) {
>>> +               if (BPF_SRC(insn->code) == BPF_X)
>>> +                       pr_info("code_%02x %sr%d%s %s r%d%s\n",
>>> +                               insn->code, class == BPF_ALU ? "(u32)" :
>>> "",
>>> +                               insn->a_reg, R(insn->a_reg),
>>> +                               bpf_alu_string[BPF_OP(insn->code) >> 4],
>>> +                               insn->x_reg, R(insn->x_reg));
>>> +               else
>>> +                       pr_info("code_%02x %sr%d%s %s %d\n",
>>> +                               insn->code, class == BPF_ALU ? "(u32)" :
>>> "",
>>> +                               insn->a_reg, R(insn->a_reg),
>>> +                               bpf_alu_string[BPF_OP(insn->code) >> 4],
>>> +                               insn->imm);
>>> +       } else if (class == BPF_STX) {
>>> +               if (BPF_MODE(insn->code) == BPF_MEM)
>>> +                       pr_info("code_%02x *(%s *)(r%d%s %+d) = r%d%s\n",
>>> +                               insn->code,
>>> +                               bpf_ldst_string[BPF_SIZE(insn->code) >>
>>> 3],
>>> +                               insn->a_reg, R(insn->a_reg),
>>> +                               insn->off, insn->x_reg, R(insn->x_reg));
>>> +               else if (BPF_MODE(insn->code) == BPF_XADD)
>>> +                       pr_info("code_%02x lock *(%s *)(r%d%s %+d) +=
>>> r%d%s\n",
>>> +                               insn->code,
>>> +                               bpf_ldst_string[BPF_SIZE(insn->code) >>
>>> 3],
>>> +                               insn->a_reg, R(insn->a_reg), insn->off,
>>> +                               insn->x_reg, R(insn->x_reg));
>>> +               else
>>> +                       pr_info("BUG_%02x\n", insn->code);
>>> +       } else if (class == BPF_ST) {
>>> +               if (BPF_MODE(insn->code) != BPF_MEM) {
>>> +                       pr_info("BUG_st_%02x\n", insn->code);
>>> +                       return;
>>> +               }
>>> +               pr_info("code_%02x *(%s *)(r%d%s %+d) = %d\n",
>>> +                       insn->code,
>>> +                       bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
>>> +                       insn->a_reg, R(insn->a_reg),
>>> +                       insn->off, insn->imm);
>>> +       } else if (class == BPF_LDX) {
>>> +               if (BPF_MODE(insn->code) == BPF_MEM) {
>>> +                       pr_info("code_%02x r%d = *(%s *)(r%d%s %+d)\n",
>>> +                               insn->code, insn->a_reg,
>>> +                               bpf_ldst_string[BPF_SIZE(insn->code) >>
>>> 3],
>>> +                               insn->x_reg, R(insn->x_reg), insn->off);
>>> +               } else {
>>> +                       pr_info("BUG_ldx_%02x\n", insn->code);
>>> +               }
>>> +       } else if (class == BPF_LD) {
>>> +               if (BPF_MODE(insn->code) == BPF_ABS) {
>>> +                       pr_info("code_%02x r%d = *(%s *)(skb %+d)\n",
>>> +                               insn->code, insn->a_reg,
>>> +                               bpf_ldst_string[BPF_SIZE(insn->code) >>
>>> 3],
>>> +                               insn->imm);
>>> +               } else if (BPF_MODE(insn->code) == BPF_IND) {
>>> +                       pr_info("code_%02x r%d = *(%s *)(skb + r%d%s
>>> %+d)\n",
>>> +                               insn->code, insn->a_reg,
>>> +                               bpf_ldst_string[BPF_SIZE(insn->code) >>
>>> 3],
>>> +                               insn->x_reg, R(insn->x_reg), insn->imm);
>>> +               } else {
>>> +                       pr_info("BUG_ld_%02x\n", insn->code);
>>> +               }
>>> +       } else if (class == BPF_JMP) {
>>> +               u16 opcode = BPF_OP(insn->code);
>>> +               if (opcode == BPF_CALL) {
>>> +                       pr_info("code_%02x call %d\n", insn->code,
>>> insn->imm);
>>> +               } else if (insn->code == (BPF_JMP | BPF_JA)) {
>>> +                       pr_info("code_%02x goto pc%+d\n",
>>> +                               insn->code, insn->off);
>>> +               } else if (BPF_SRC(insn->code) == BPF_X) {
>>> +                       pr_info("code_%02x if r%d%s %s r%d%s goto
>>> pc%+d\n",
>>> +                               insn->code, insn->a_reg, R(insn->a_reg),
>>> +                               bpf_jmp_string[BPF_OP(insn->code) >> 4],
>>> +                               insn->x_reg, R(insn->x_reg), insn->off);
>>> +               } else {
>>> +                       pr_info("code_%02x if r%d%s %s 0x%x goto pc%+d\n",
>>> +                               insn->code, insn->a_reg, R(insn->a_reg),
>>> +                               bpf_jmp_string[BPF_OP(insn->code) >> 4],
>>> +                               insn->imm, insn->off);
>>> +               }
>>> +       } else {
>>> +               pr_info("code_%02x %s\n", insn->code,
>>> bpf_class_string[class]);
>>> +       }
>>> +}
>>> +EXPORT_SYMBOL(pr_info_bpf_insn);
>>
>>
>> Why would that need to be exported as a symbol?
>
> the performance numbers I mentioned are from bpf_bench that is done
> as kernel module, so I used this for debugging from it.
> also to see what execution traces I get while running trinity bpf fuzzer :)
>
>> I would actually like to avoid having this pr_info* inside the kernel.
>> Couldn't this be done e.g. through systemtap script that could e.g. be
>> placed under tools/net/ or inside the documentation file?
>
> like the idea!
> Will drop it from the diff and eventually will move it to tools/net.
>
>>> +/* remap 'sock_filter' style BPF instruction set to 'bpf_insn' style
>>> (BPF64)
>>> + *
>>> + * first, call bpf_convert(old_prog, len, NULL, &new_len) to calculate
>>> new
>>> + * program length in one pass
>>> + *
>>> + * then new_prog = kmalloc(sizeof(struct bpf_insn) * new_len);
>>> + *
>>> + * and call it again: bpf_convert(old_prog, len, new_prog, &new_len);
>>> + * to remap in two passes: 1st pass finds new jump offsets, 2nd pass
>>> remaps
>>> + */
>>
>>
>> Would be nice to have the API comment it in kdoc format.
>
> good point. will do.
>
>>> +int bpf_convert(struct sock_filter *old_prog, int len,
>>> +               struct bpf_insn *new_prog, int *p_new_len)
>>> +{
>>
>>
>> If we place it into net/core/filter.c, it would be nice to keep naming
>> conventions consistent, e.g. sk_convert_filter() or so.
>
> ok.
>
>>> +       struct bpf_insn *new_insn;
>>> +       struct sock_filter *fp;
>>> +       int *addrs = NULL;
>>> +       int new_len = 0;
>>> +       int pass = 0;
>>> +       int tgt, i;
>>> +
>>> +       if (len <= 0 || len >= BPF_MAXINSNS)
>>> +               return -EINVAL;
>>> +
>>> +       if (new_prog) {
>>> +               addrs = kzalloc(len * sizeof(*addrs), GFP_KERNEL);
>>> +               if (!addrs)
>>> +                       return -ENOMEM;
>>> +       }
>>> +
>>> +do_pass:
>>> +       new_insn = new_prog;
>>> +       fp = old_prog;
>>> +       for (i = 0; i < len; fp++, i++) {
>>> +               struct bpf_insn tmp_insns[3] = {};
>>> +               struct bpf_insn *insn = tmp_insns;
>>> +
>>> +               if (addrs)
>>> +                       addrs[i] = new_insn - new_prog;
>>> +
>>> +               switch (fp->code) {
>>> +               /* all arithmetic insns and skb loads map as-is */
>>> +               case BPF_ALU | BPF_ADD | BPF_X:
>>> +               case BPF_ALU | BPF_ADD | BPF_K:
>>> +               case BPF_ALU | BPF_SUB | BPF_X:
>>> +               case BPF_ALU | BPF_SUB | BPF_K:
>>> +               case BPF_ALU | BPF_AND | BPF_X:
>>> +               case BPF_ALU | BPF_AND | BPF_K:
>>> +               case BPF_ALU | BPF_OR | BPF_X:
>>> +               case BPF_ALU | BPF_OR | BPF_K:
>>> +               case BPF_ALU | BPF_LSH | BPF_X:
>>> +               case BPF_ALU | BPF_LSH | BPF_K:
>>> +               case BPF_ALU | BPF_RSH | BPF_X:
>>> +               case BPF_ALU | BPF_RSH | BPF_K:
>>> +               case BPF_ALU | BPF_XOR | BPF_X:
>>> +               case BPF_ALU | BPF_XOR | BPF_K:
>>> +               case BPF_ALU | BPF_MUL | BPF_X:
>>> +               case BPF_ALU | BPF_MUL | BPF_K:
>>> +               case BPF_ALU | BPF_DIV | BPF_X:
>>> +               case BPF_ALU | BPF_DIV | BPF_K:
>>> +               case BPF_ALU | BPF_MOD | BPF_X:
>>> +               case BPF_ALU | BPF_MOD | BPF_K:
>>> +               case BPF_ALU | BPF_NEG:
>>> +               case BPF_LD | BPF_ABS | BPF_W:
>>> +               case BPF_LD | BPF_ABS | BPF_H:
>>> +               case BPF_LD | BPF_ABS | BPF_B:
>>> +               case BPF_LD | BPF_IND | BPF_W:
>>> +               case BPF_LD | BPF_IND | BPF_H:
>>> +               case BPF_LD | BPF_IND | BPF_B:
>>
>>
>> I think here and elsewhere for similar constructions, we could use
>> BPF_S_* helpers that was introduced by Hagen in commit 01f2f3f6ef4d076
>> ("net: optimize Berkeley Packet Filter (BPF) processing").
>
> well, old bpf opcodes were just unlucky to be on the wrong side of
> compiler heuristics at that time.
> Instead of doing remapping while loading and opposite remapping in
> sk_get_filter()
> the problem could have been solved with few dummy 'case' statements.
> That would have only added few lines of code instead of hundred lines of mapping
> back and forth.
> Also I suspect commit 01f2f3f6ef4d076 was valid only when whole kernel
> is compiled
> with -Os. With -O2 GCC should have done it right.
> Heuristic is: number_of_case_stmts * ratio < range_of_case_values
> and ratio is 3 for -Os and 10 for -O2.
> old bpf has ~70 stmts and ~200 range.
> See expand_switch_as_decision_tree_p() in gcc/stmt.c
> Eventually when current interpreter is retired, I would like to remove
> remapping as well.
> Especially for sk_convert_filter() I would like to keep normal BPF_
> values from uapi/filter.h,
> since for majority of them I reuse as-is and remapping would have
> added unnecessary code.
>
>>> +                       insn->code = fp->code;
>>> +                       insn->a_reg = 6;
>>> +                       insn->x_reg = 7;
>>> +                       insn->imm = fp->k;
>>> +                       break;
>>> +
>>> +               /* jump opcodes map as-is, but offsets need adjustment */
>>> +               case BPF_JMP | BPF_JA:
>>> +                       tgt = i + fp->k + 1;
>>> +                       insn->code = fp->code;
>>> +#define EMIT_JMP \
>>> +       do { \
>>> +               if (tgt >= len || tgt < 0) \
>>> +                       goto err; \
>>> +               insn->off = addrs ? addrs[tgt] - addrs[i] - 1 : 0; \
>>> +       } while (0)
>>> +
>>> +                       EMIT_JMP;
>>> +                       break;
>>> +
>>> +               case BPF_JMP | BPF_JEQ | BPF_K:
>>> +               case BPF_JMP | BPF_JEQ | BPF_X:
>>> +               case BPF_JMP | BPF_JSET | BPF_K:
>>> +               case BPF_JMP | BPF_JSET | BPF_X:
>>> +               case BPF_JMP | BPF_JGT | BPF_K:
>>> +               case BPF_JMP | BPF_JGT | BPF_X:
>>> +               case BPF_JMP | BPF_JGE | BPF_K:
>>> +               case BPF_JMP | BPF_JGE | BPF_X:
>>> +                       insn->a_reg = 6;
>>> +                       insn->x_reg = 7;
>>> +                       insn->imm = fp->k;
>>> +                       /* common case where 'jump_false' is next insn */
>>> +                       if (fp->jf == 0) {
>>> +                               insn->code = fp->code;
>>> +                               tgt = i + fp->jt + 1;
>>> +                               EMIT_JMP;
>>> +                               break;
>>> +                       }
>>> +                       /* convert JEQ into JNE when 'jump_true' is next
>>> insn */
>>> +                       if (fp->jt == 0 && BPF_OP(fp->code) == BPF_JEQ) {
>>> +                               insn->code = BPF_JMP | BPF_JNE |
>>> +                                       BPF_SRC(fp->code);
>>> +                               tgt = i + fp->jf + 1;
>>> +                               EMIT_JMP;
>>> +                               break;
>>> +                       }
>>> +                       /* other jumps are mapped into two insns: Jxx and
>>> JA */
>>> +                       tgt = i + fp->jt + 1;
>>> +                       insn->code = fp->code;
>>> +                       EMIT_JMP;
>>> +
>>> +                       insn++;
>>> +                       insn->code = BPF_JMP | BPF_JA;
>>> +                       tgt = i + fp->jf + 1;
>>> +                       EMIT_JMP;
>>> +                       /* adjust pc relative offset, since it's a 2nd
>>> insn */
>>> +                       insn->off--;
>>> +                       break;
>>> +
>>> +                       /* ldxb 4*([14]&0xf) is remaped into 3 insns */
>>> +               case BPF_LDX | BPF_MSH | BPF_B:
>>> +                       insn->code = BPF_LD | BPF_ABS | BPF_B;
>>> +                       insn->a_reg = 7;
>>> +                       insn->imm = fp->k;
>>> +
>>> +                       insn++;
>>> +                       insn->code = BPF_ALU | BPF_AND | BPF_K;
>>> +                       insn->a_reg = 7;
>>> +                       insn->imm = 0xf;
>>> +
>>> +                       insn++;
>>> +                       insn->code = BPF_ALU | BPF_LSH | BPF_K;
>>> +                       insn->a_reg = 7;
>>> +                       insn->imm = 2;
>>> +                       break;
>>> +
>>> +                       /* RET_K, RET_A are remaped into 2 insns */
>>> +               case BPF_RET | BPF_A:
>>> +               case BPF_RET | BPF_K:
>>> +                       insn->code = BPF_ALU | BPF_MOV |
>>> +                               (BPF_SRC(fp->code) == BPF_K ? BPF_K :
>>> BPF_X);
>>> +                       insn->a_reg = 0;
>>> +                       insn->x_reg = 6;
>>> +                       insn->imm = fp->k;
>>> +
>>> +                       insn++;
>>> +                       insn->code = BPF_RET | BPF_K;
>>> +                       break;
>>> +
>>> +                       /* store to stack */
>>> +               case BPF_ST:
>>> +               case BPF_STX:
>>> +                       insn->code = BPF_STX | BPF_MEM | BPF_W;
>>> +                       insn->a_reg = 10;
>>> +                       insn->x_reg = fp->code == BPF_ST ? 6 : 7;
>>> +                       insn->off = -(BPF_MEMWORDS - fp->k) * 4;
>>> +                       break;
>>> +
>>> +                       /* load from stack */
>>> +               case BPF_LD | BPF_MEM:
>>> +               case BPF_LDX | BPF_MEM:
>>> +                       insn->code = BPF_LDX | BPF_MEM | BPF_W;
>>> +                       insn->a_reg = BPF_CLASS(fp->code) == BPF_LD ? 6 :
>>> 7;
>>> +                       insn->x_reg = 10;
>>> +                       insn->off = -(BPF_MEMWORDS - fp->k) * 4;
>>> +                       break;
>>> +
>>> +                       /* A = K or X = K */
>>> +               case BPF_LD | BPF_IMM:
>>> +               case BPF_LDX | BPF_IMM:
>>> +                       insn->code = BPF_ALU | BPF_MOV | BPF_K;
>>> +                       insn->a_reg = BPF_CLASS(fp->code) == BPF_LD ? 6 :
>>> 7;
>>> +                       insn->imm = fp->k;
>>> +                       break;
>>> +
>>> +                       /* X = A */
>>> +               case BPF_MISC | BPF_TAX:
>>> +                       insn->code = BPF_ALU64 | BPF_MOV | BPF_X;
>>> +                       insn->a_reg = 7;
>>> +                       insn->x_reg = 6;
>>> +                       break;
>>> +
>>> +                       /* A = X */
>>> +               case BPF_MISC | BPF_TXA:
>>> +                       insn->code = BPF_ALU64 | BPF_MOV | BPF_X;
>>> +                       insn->a_reg = 6;
>>> +                       insn->x_reg = 7;
>>> +                       break;
>>> +
>>> +                       /* A = skb->len or X = skb->len */
>>> +               case BPF_LD|BPF_W|BPF_LEN:
>>> +               case BPF_LDX|BPF_W|BPF_LEN:
>>> +                       insn->code = BPF_LDX | BPF_MEM | BPF_W;
>>> +                       insn->a_reg = BPF_CLASS(fp->code) == BPF_LD ? 6 :
>>> 7;
>>> +                       insn->x_reg = 1;
>>> +                       insn->off = offsetof(struct sk_buff, len);
>>> +                       break;
>>> +
>>> +               default:
>>> +                       /* pr_err("unknown opcode %02x\n", fp->code); */
>>> +                       goto err;
>>> +               }
>>> +
>>> +               insn++;
>>> +               if (new_prog) {
>>> +                       memcpy(new_insn, tmp_insns,
>>> +                              sizeof(*insn) * (insn - tmp_insns));
>>> +               }
>>> +               new_insn += insn - tmp_insns;
>>> +       }
>>> +
>>> +       if (!new_prog) {
>>> +               /* only calculating new length */
>>> +               *p_new_len = new_insn - new_prog;
>>> +               return 0;
>>> +       }
>>> +
>>> +       pass++;
>>> +       if (new_len != new_insn - new_prog) {
>>> +               new_len = new_insn - new_prog;
>>> +               if (pass > 2)
>>> +                       goto err;
>>> +               goto do_pass;
>>> +       }
>>> +       kfree(addrs);
>>> +       if (*p_new_len != new_len)
>>> +               /* inconsistent new program length */
>>> +               pr_err("bpf_convert() usage error\n");
>>> +       return 0;
>>> +err:
>>> +       kfree(addrs);
>>> +       return -EINVAL;
>>> +}
>>> +
>>> +notrace u32 bpf_run(struct bpf_context *ctx, const struct bpf_insn *insn)
>>> +{
>>
>>
>> Similarly, sk_run_filter_ext()?
>
> ok.
>
>>> +       u64 stack[64];
>>> +       u64 regs[16];
>>> +       void *ptr;
>>> +       u64 tmp;
>>> +       int off;
>>> +
>>> +#ifdef __x86_64
>>> +#define LOAD_IMM /**/
>>> +#define K insn->imm
>>> +#else
>>> +#define LOAD_IMM (K = insn->imm)
>>> +       s32 K = insn->imm;
>>> +#endif
>>> +
>>> +#ifdef DEBUG
>>> +#define DEBUG_INSN pr_info_bpf_insn(insn, regs)
>>> +#else
>>> +#define DEBUG_INSN
>>> +#endif
>>
>>
>> This DEBUG hunk could then be removed when we use a stap script
>> instead, for example.
>
> ok
>
>>> +#define A regs[insn->a_reg]
>>> +#define X regs[insn->x_reg]
>>> +
>>> +#define DL(A, B, C) [A|B|C] = &&L_##A##B##C,
>>> +#define L(A, B, C) L_##A##B##C:
>>
>>
>> Could we get rid of these two defines? I know you're defining
>
> I think it's actually more readable this way and lesser chance of typos.
> DL - define label
> L - label
> will try to remove 2nd marco to see how it looks...
>
>> and using that as labels, but it's not obvious at first what
>> 'jt' does. Maybe 'jt_labels' ?
>
> ok will rename.
>
>>> +#define CONT ({insn++; LOAD_IMM; DEBUG_INSN; goto select_insn; })
>>> +#define CONT_JMP ({insn++; LOAD_IMM; DEBUG_INSN; goto select_insn; })
>>
>>
>> Not a big fan of macros, but here it seems fine though.
>>
>>
>>> +/* some compilers may need help:
>>> + * #define CONT_JMP ({insn++; LOAD_IMM; DEBUG_INSN; goto *jt[insn->code];
>>> })
>>> + */
>>> +
>>> +       static const void *jt[256] = {
>>> +               [0 ... 255] = &&L_default,
>>
>>
>> Nit: please just one define per line below:
>
> ok.
>
>>> +               DL(BPF_ALU, BPF_ADD, BPF_X) DL(BPF_ALU, BPF_ADD, BPF_K)
>>> +               DL(BPF_ALU, BPF_SUB, BPF_X) DL(BPF_ALU, BPF_SUB, BPF_K)
>>> +               DL(BPF_ALU, BPF_AND, BPF_X) DL(BPF_ALU, BPF_AND, BPF_K)
>>> +               DL(BPF_ALU, BPF_OR, BPF_X)  DL(BPF_ALU, BPF_OR, BPF_K)
>>> +               DL(BPF_ALU, BPF_LSH, BPF_X) DL(BPF_ALU, BPF_LSH, BPF_K)
>>> +               DL(BPF_ALU, BPF_RSH, BPF_X) DL(BPF_ALU, BPF_RSH, BPF_K)
>>> +               DL(BPF_ALU, BPF_XOR, BPF_X) DL(BPF_ALU, BPF_XOR, BPF_K)
>>> +               DL(BPF_ALU, BPF_MUL, BPF_X) DL(BPF_ALU, BPF_MUL, BPF_K)
>>> +               DL(BPF_ALU, BPF_MOV, BPF_X) DL(BPF_ALU, BPF_MOV, BPF_K)
>>> +               DL(BPF_ALU, BPF_DIV, BPF_X) DL(BPF_ALU, BPF_DIV, BPF_K)
>>> +               DL(BPF_ALU, BPF_MOD, BPF_X) DL(BPF_ALU, BPF_MOD, BPF_K)
>>> +               DL(BPF_ALU64, BPF_ADD, BPF_X) DL(BPF_ALU64, BPF_ADD,
>>> BPF_K)
>>> +               DL(BPF_ALU64, BPF_SUB, BPF_X) DL(BPF_ALU64, BPF_SUB,
>>> BPF_K)
>>> +               DL(BPF_ALU64, BPF_AND, BPF_X) DL(BPF_ALU64, BPF_AND,
>>> BPF_K)
>>> +               DL(BPF_ALU64, BPF_OR, BPF_X)  DL(BPF_ALU64, BPF_OR, BPF_K)
>>> +               DL(BPF_ALU64, BPF_LSH, BPF_X) DL(BPF_ALU64, BPF_LSH,
>>> BPF_K)
>>> +               DL(BPF_ALU64, BPF_RSH, BPF_X) DL(BPF_ALU64, BPF_RSH,
>>> BPF_K)
>>> +               DL(BPF_ALU64, BPF_XOR, BPF_X) DL(BPF_ALU64, BPF_XOR,
>>> BPF_K)
>>> +               DL(BPF_ALU64, BPF_MUL, BPF_X) DL(BPF_ALU64, BPF_MUL,
>>> BPF_K)
>>> +               DL(BPF_ALU64, BPF_MOV, BPF_X) DL(BPF_ALU64, BPF_MOV,
>>> BPF_K)
>>> +               DL(BPF_ALU64, BPF_ARSH, BPF_X)DL(BPF_ALU64, BPF_ARSH,
>>> BPF_K)
>>> +               DL(BPF_ALU64, BPF_DIV, BPF_X) DL(BPF_ALU64, BPF_DIV,
>>> BPF_K)
>>> +               DL(BPF_ALU64, BPF_MOD, BPF_X) DL(BPF_ALU64, BPF_MOD,
>>> BPF_K)
>>> +               DL(BPF_ALU64, BPF_BSWAP32, BPF_X)
>>> +               DL(BPF_ALU64, BPF_BSWAP64, BPF_X)
>>> +               DL(BPF_ALU, BPF_NEG, 0)
>>> +               DL(BPF_JMP, BPF_CALL, 0)
>>> +               DL(BPF_JMP, BPF_JA, 0)
>>> +               DL(BPF_JMP, BPF_JEQ, BPF_X) DL(BPF_JMP, BPF_JEQ, BPF_K)
>>> +               DL(BPF_JMP, BPF_JNE, BPF_X) DL(BPF_JMP, BPF_JNE, BPF_K)
>>> +               DL(BPF_JMP, BPF_JGT, BPF_X) DL(BPF_JMP, BPF_JGT, BPF_K)
>>> +               DL(BPF_JMP, BPF_JGE, BPF_X) DL(BPF_JMP, BPF_JGE, BPF_K)
>>> +               DL(BPF_JMP, BPF_JSGT, BPF_X) DL(BPF_JMP, BPF_JSGT, BPF_K)
>>> +               DL(BPF_JMP, BPF_JSGE, BPF_X) DL(BPF_JMP, BPF_JSGE, BPF_K)
>>> +               DL(BPF_JMP, BPF_JSET, BPF_X) DL(BPF_JMP, BPF_JSET, BPF_K)
>>> +               DL(BPF_STX, BPF_MEM, BPF_B) DL(BPF_STX, BPF_MEM, BPF_H)
>>> +               DL(BPF_STX, BPF_MEM, BPF_W) DL(BPF_STX, BPF_MEM, BPF_DW)
>>> +               DL(BPF_ST, BPF_MEM, BPF_B) DL(BPF_ST, BPF_MEM, BPF_H)
>>> +               DL(BPF_ST, BPF_MEM, BPF_W) DL(BPF_ST, BPF_MEM, BPF_DW)
>>> +               DL(BPF_LDX, BPF_MEM, BPF_B) DL(BPF_LDX, BPF_MEM, BPF_H)
>>> +               DL(BPF_LDX, BPF_MEM, BPF_W) DL(BPF_LDX, BPF_MEM, BPF_DW)
>>> +               DL(BPF_STX, BPF_XADD, BPF_W)
>>> +#ifdef CONFIG_64BIT
>>> +               DL(BPF_STX, BPF_XADD, BPF_DW)
>>> +#endif
>>> +               DL(BPF_LD, BPF_ABS, BPF_W) DL(BPF_LD, BPF_ABS, BPF_H)
>>> +               DL(BPF_LD, BPF_ABS, BPF_B) DL(BPF_LD, BPF_IND, BPF_W)
>>> +               DL(BPF_LD, BPF_IND, BPF_H) DL(BPF_LD, BPF_IND, BPF_B)
>>> +               DL(BPF_RET, BPF_K, 0)
>>> +       };
>>> +
>>> +       regs[10/* BPF R10 */] = (u64)(ulong)&stack[64];
>>> +       regs[1/* BPF R1 */] = (u64)(ulong)ctx;
>>> +
>>> +       DEBUG_INSN;
>>> +       /* execute 1st insn */
>>> +select_insn:
>>> +       goto *jt[insn->code];
>>> +
>>> +               /* ALU */
>>> +#define ALU(OPCODE, OP) \
>>> +       L_BPF_ALU64##OPCODE##BPF_X: \
>>> +               A = A OP X; \
>>> +               CONT; \
>>> +       L_BPF_ALU##OPCODE##BPF_X: \
>>> +               A = (u32)A OP (u32)X; \
>>> +               CONT; \
>>> +       L_BPF_ALU64##OPCODE##BPF_K: \
>>> +               A = A OP K; \
>>> +               CONT; \
>>> +       L_BPF_ALU##OPCODE##BPF_K: \
>>> +               A = (u32)A OP (u32)K; \
>>> +               CONT;
>>> +
>>> +       ALU(BPF_ADD, +)
>>> +       ALU(BPF_SUB, -)
>>> +       ALU(BPF_AND, &)
>>> +       ALU(BPF_OR, |)
>>> +       ALU(BPF_LSH, <<)
>>> +       ALU(BPF_RSH, >>)
>>> +       ALU(BPF_XOR, ^)
>>> +       ALU(BPF_MUL, *)
>>> +
>>> +       L(BPF_ALU, BPF_NEG, 0)
>>> +               A = (u32)-A;
>>> +               CONT;
>>> +       L(BPF_ALU, BPF_MOV, BPF_X)
>>> +               A = (u32)X;
>>> +               CONT;
>>> +       L(BPF_ALU, BPF_MOV, BPF_K)
>>> +               A = (u32)K;
>>> +               CONT;
>>> +       L(BPF_ALU64, BPF_MOV, BPF_X)
>>> +               A = X;
>>> +               CONT;
>>> +       L(BPF_ALU64, BPF_MOV, BPF_K)
>>> +               A = K;
>>> +               CONT;
>>> +       L(BPF_ALU64, BPF_ARSH, BPF_X)
>>> +               (*(s64 *) &A) >>= X;
>>> +               CONT;
>>> +       L(BPF_ALU64, BPF_ARSH, BPF_K)
>>> +               (*(s64 *) &A) >>= K;
>>> +               CONT;
>>> +       L(BPF_ALU64, BPF_MOD, BPF_X)
>>> +               tmp = A;
>>> +               if (X)
>>> +                       A = do_div(tmp, X);
>>> +               CONT;
>>> +       L(BPF_ALU, BPF_MOD, BPF_X)
>>> +               tmp = (u32)A;
>>> +               if (X)
>>> +                       A = do_div(tmp, (u32)X);
>>> +               CONT;
>>> +       L(BPF_ALU64, BPF_MOD, BPF_K)
>>> +               tmp = A;
>>> +               if (K)
>>> +                       A = do_div(tmp, K);
>>> +               CONT;
>>> +       L(BPF_ALU, BPF_MOD, BPF_K)
>>> +               tmp = (u32)A;
>>> +               if (K)
>>> +                       A = do_div(tmp, (u32)K);
>>> +               CONT;
>>> +       L(BPF_ALU64, BPF_DIV, BPF_X)
>>> +               if (X)
>>> +                       do_div(A, X);
>>> +               CONT;
>>> +       L(BPF_ALU, BPF_DIV, BPF_X)
>>> +               tmp = (u32)A;
>>> +               if (X)
>>> +                       do_div(tmp, (u32)X);
>>> +               A = (u32)tmp;
>>> +               CONT;
>>> +       L(BPF_ALU64, BPF_DIV, BPF_K)
>>> +               if (K)
>>> +                       do_div(A, K);
>>> +               CONT;
>>> +       L(BPF_ALU, BPF_DIV, BPF_K)
>>> +               tmp = (u32)A;
>>> +               if (K)
>>> +                       do_div(tmp, (u32)K);
>>> +               A = (u32)tmp;
>>> +               CONT;
>>> +       L(BPF_ALU64, BPF_BSWAP32, BPF_X)
>>> +               A = swab32(A);
>>> +               CONT;
>>> +       L(BPF_ALU64, BPF_BSWAP64, BPF_X)
>>> +               A = swab64(A);
>>> +               CONT;
>>> +
>>> +               /* CALL */
>>> +       L(BPF_JMP, BPF_CALL, 0)
>>> +               /* TODO execute_func(K, regs); */
>>> +               CONT;
>>
>>
>> Would the filter checker for now just return -EINVAL when this is used?
>
> sure.
> I was planning to address that in a week or so, but ok.
> Will remove 'call' insn in the first patch and will re-add in a clean
> way in a next diff.
>
>>> +               /* JMP */
>>> +       L(BPF_JMP, BPF_JA, 0)
>>> +               insn += insn->off;
>>> +               CONT;
>>> +       L(BPF_JMP, BPF_JEQ, BPF_X)
>>> +               if (A == X) {
>>> +                       insn += insn->off;
>>> +                       CONT_JMP;
>>> +               }
>>> +               CONT;
>>> +       L(BPF_JMP, BPF_JEQ, BPF_K)
>>> +               if (A == K) {
>>> +                       insn += insn->off;
>>> +                       CONT_JMP;
>>> +               }
>>> +               CONT;
>>> +       L(BPF_JMP, BPF_JNE, BPF_X)
>>> +               if (A != X) {
>>> +                       insn += insn->off;
>>> +                       CONT_JMP;
>>> +               }
>>> +               CONT;
>>> +       L(BPF_JMP, BPF_JNE, BPF_K)
>>> +               if (A != K) {
>>> +                       insn += insn->off;
>>> +                       CONT_JMP;
>>> +               }
>>> +               CONT;
>>> +       L(BPF_JMP, BPF_JGT, BPF_X)
>>> +               if (A > X) {
>>> +                       insn += insn->off;
>>> +                       CONT_JMP;
>>> +               }
>>> +               CONT;
>>> +       L(BPF_JMP, BPF_JGT, BPF_K)
>>> +               if (A > K) {
>>> +                       insn += insn->off;
>>> +                       CONT_JMP;
>>> +               }
>>> +               CONT;
>>> +       L(BPF_JMP, BPF_JGE, BPF_X)
>>> +               if (A >= X) {
>>> +                       insn += insn->off;
>>> +                       CONT_JMP;
>>> +               }
>>> +               CONT;
>>> +       L(BPF_JMP, BPF_JGE, BPF_K)
>>> +               if (A >= K) {
>>> +                       insn += insn->off;
>>> +                       CONT_JMP;
>>> +               }
>>> +               CONT;
>>> +       L(BPF_JMP, BPF_JSGT, BPF_X)
>>> +               if (((s64)A) > ((s64)X)) {
>>> +                       insn += insn->off;
>>> +                       CONT_JMP;
>>> +               }
>>> +               CONT;
>>> +       L(BPF_JMP, BPF_JSGT, BPF_K)
>>> +               if (((s64)A) > ((s64)K)) {
>>> +                       insn += insn->off;
>>> +                       CONT_JMP;
>>> +               }
>>> +               CONT;
>>> +       L(BPF_JMP, BPF_JSGE, BPF_X)
>>> +               if (((s64)A) >= ((s64)X)) {
>>> +                       insn += insn->off;
>>> +                       CONT_JMP;
>>> +               }
>>> +               CONT;
>>> +       L(BPF_JMP, BPF_JSGE, BPF_K)
>>> +               if (((s64)A) >= ((s64)K)) {
>>> +                       insn += insn->off;
>>> +                       CONT_JMP;
>>> +               }
>>> +               CONT;
>>> +       L(BPF_JMP, BPF_JSET, BPF_X)
>>> +               if (A & X) {
>>> +                       insn += insn->off;
>>> +                       CONT_JMP;
>>> +               }
>>> +               CONT;
>>> +       L(BPF_JMP, BPF_JSET, BPF_K)
>>> +               if (A & (u32)K) {
>>> +                       insn += insn->off;
>>> +                       CONT_JMP;
>>> +               }
>>> +               CONT;
>>
>>
>> Okay, so right now we only support forward jumps, right? And these
>> are checked by the old checker code I'd assume?
>
> yes. just like old interpreter new interpreter doesn't care. It just jumps.
> and you're correct. old checker code will allow only forward jumps.
> That restriction is preserved by sk_convert_filter() as well.
>
>>> +               /* STX and ST and LDX*/
>>> +#define LDST(SIZEOP, SIZE) \
>>> +       L_BPF_STXBPF_MEM##SIZEOP: \
>>> +               *(SIZE *)(ulong)(A + insn->off) = X; \
>>> +               CONT; \
>>> +       L_BPF_STBPF_MEM##SIZEOP: \
>>> +               *(SIZE *)(ulong)(A + insn->off) = K; \
>>> +               CONT; \
>>> +       L_BPF_LDXBPF_MEM##SIZEOP: \
>>> +               A = *(SIZE *)(ulong)(X + insn->off); \
>>> +               CONT;
>>> +
>>> +               LDST(BPF_B, u8)
>>> +               LDST(BPF_H, u16)
>>> +               LDST(BPF_W, u32)
>>> +               LDST(BPF_DW, u64)
>>> +
>>> +               /* STX XADD */
>>> +       L(BPF_STX, BPF_XADD, BPF_W)
>>> +               atomic_add((u32)X, (atomic_t *)(ulong)(A + insn->off));
>>> +               CONT;
>>> +#ifdef CONFIG_64BIT
>>> +       L(BPF_STX, BPF_XADD, BPF_DW)
>>> +               atomic64_add((u64)X, (atomic64_t *)(ulong)(A +
>>> insn->off));
>>> +               CONT;
>>> +#endif
>>> +
>>> +               /* LD from SKB + K */
>>
>>
>> Nit: indent one tab too much (here and elsewhere)
>
> ahh. ok.
>
>>> +       L(BPF_LD, BPF_ABS, BPF_W)
>>> +               off = K;
>>> +load_word:
>>> +               ptr = load_pointer((struct sk_buff *)ctx, off, 4, &tmp);
>>> +               if (likely(ptr != NULL)) {
>>> +                       A = get_unaligned_be32(ptr);
>>> +                       CONT;
>>> +               }
>>> +               return 0;
>>> +
>>> +       L(BPF_LD, BPF_ABS, BPF_H)
>>> +               off = K;
>>> +load_half:
>>> +               ptr = load_pointer((struct sk_buff *)ctx, off, 2, &tmp);
>>> +               if (likely(ptr != NULL)) {
>>> +                       A = get_unaligned_be16(ptr);
>>> +                       CONT;
>>> +               }
>>> +               return 0;
>>> +
>>> +       L(BPF_LD, BPF_ABS, BPF_B)
>>> +               off = K;
>>> +load_byte:
>>> +               ptr = load_pointer((struct sk_buff *)ctx, off, 1, &tmp);
>>> +               if (likely(ptr != NULL)) {
>>> +                       A = *(u8 *)ptr;
>>> +                       CONT;
>>> +               }
>>> +               return 0;
>>> +
>>> +               /* LD from SKB + X + K */
>>
>>
>> Nit: ditto
>
> ok
>
>>> +       L(BPF_LD, BPF_IND, BPF_W)
>>> +               off = K + X;
>>> +               goto load_word;
>>> +
>>> +       L(BPF_LD, BPF_IND, BPF_H)
>>> +               off = K + X;
>>> +               goto load_half;
>>> +
>>> +       L(BPF_LD, BPF_IND, BPF_B)
>>> +               off = K + X;
>>> +               goto load_byte;
>>> +
>>> +               /* RET */
>>> +       L(BPF_RET, BPF_K, 0)
>>> +               return regs[0/* R0 */];
>>> +
>>> +       L_default:
>>> +               /* bpf_check() or bpf_convert() will guarantee that
>>> +                * we never reach here
>>> +                */
>>> +               WARN_RATELIMIT(1, "unknown opcode %02x\n", insn->code);
>>> +               return 0;
>>> +#undef DL
>>> +#undef L
>>> +#undef CONT
>>> +#undef A
>>> +#undef X
>>> +#undef K
>>> +#undef LOAD_IMM
>>> +#undef DEBUG_INSN
>>> +#undef ALU
>>> +#undef LDST
>>> +}
>>> +EXPORT_SYMBOL(bpf_run);
>>> +
>>> diff --git a/net/core/filter.c b/net/core/filter.c
>>> index ad30d626a5bd..575bf5fd4335 100644
>>> --- a/net/core/filter.c
>>> +++ b/net/core/filter.c
>>> @@ -637,6 +637,10 @@ void sk_filter_release_rcu(struct rcu_head *rcu)
>>>   {
>>>         struct sk_filter *fp = container_of(rcu, struct sk_filter, rcu);
>>>
>>> +       if ((void *)fp->bpf_func == (void *)bpf_run)
>>> +               /* arch specific jit_free are expecting this value */
>>> +               fp->bpf_func = sk_run_filter;
>>> +
>>>         bpf_jit_free(fp);
>>>   }
>>>   EXPORT_SYMBOL(sk_filter_release_rcu);
>>> @@ -655,6 +659,81 @@ static int __sk_prepare_filter(struct sk_filter *fp)
>>>         return 0;
>>>   }
>>>
>>> +static int bpf64_prepare(struct sk_filter **pfp, struct sock_fprog
>>> *fprog,
>>> +                        struct sock *sk)
>>> +{
>>
>>
>> sk_prepare_filter_ext()?
>
> ok.
>
>>> +       unsigned int fsize = sizeof(struct sock_filter) * fprog->len;
>>> +       struct sock_filter *old_prog;
>>> +       unsigned int sk_fsize;
>>> +       struct sk_filter *fp;
>>> +       int new_len;
>>> +       int err;
>>> +
>>> +       /* store old program into buffer, since chk_filter will remap
>>> opcodes */
>>> +       old_prog = kmalloc(fsize, GFP_KERNEL);
>>> +       if (!old_prog)
>>> +               return -ENOMEM;
>>> +
>>> +       if (sk) {
>>> +               if (copy_from_user(old_prog, fprog->filter, fsize)) {
>>> +                       err = -EFAULT;
>>> +                       goto free_prog;
>>> +               }
>>> +       } else {
>>> +               memcpy(old_prog, fprog->filter, fsize);
>>> +       }
>>> +
>>> +       /* calculate bpf64 program length */
>>> +       err = bpf_convert(fprog->filter, fprog->len, NULL, &new_len);
>>> +       if (err)
>>> +               goto free_prog;
>>> +
>>> +       sk_fsize = sk_filter_size(new_len);
>>> +       /* allocate sk_filter to store bpf64 program */
>>> +       if (sk)
>>> +               fp = sock_kmalloc(sk, sk_fsize, GFP_KERNEL);
>>> +       else
>>> +               fp = kmalloc(sk_fsize, GFP_KERNEL);
>>> +       if (!fp) {
>>> +               err = -ENOMEM;
>>> +               goto free_prog;
>>> +       }
>>> +
>>> +       /* remap old insns into bpf64 */
>>> +       err = bpf_convert(old_prog, fprog->len,
>>> +                         (struct bpf_insn *)fp->insns, &new_len);
>>> +       if (err)
>>> +               /* 2nd bpf_convert() can fail only if it fails
>>> +                * to allocate memory, remapping must succeed
>>> +                */
>>> +               goto free_fp;
>>> +
>>> +       /* now chk_filter can overwrite old_prog while checking */
>>> +       err = sk_chk_filter(old_prog, fprog->len);
>>> +       if (err)
>>> +               goto free_fp;
>>> +
>>> +       /* discard old prog */
>>> +       kfree(old_prog);
>>> +
>>> +       atomic_set(&fp->refcnt, 1);
>>> +       fp->len = new_len;
>>> +
>>> +       /* bpf64 insns must be executed by bpf_run */
>>> +       fp->bpf_func = (typeof(fp->bpf_func))bpf_run;
>>> +
>>> +       *pfp = fp;
>>> +       return 0;
>>> +free_fp:
>>> +       if (sk)
>>> +               sock_kfree_s(sk, fp, sk_fsize);
>>> +       else
>>> +               kfree(fp);
>>> +free_prog:
>>> +       kfree(old_prog);
>>> +       return err;
>>> +}
>>> +
>>>   /**
>>>    *    sk_unattached_filter_create - create an unattached filter
>>>    *    @fprog: the filter program
>>> @@ -676,6 +755,9 @@ int sk_unattached_filter_create(struct sk_filter
>>> **pfp,
>>>         if (fprog->filter == NULL)
>>>                 return -EINVAL;
>>>
>>> +       if (bpf64_enable)
>>> +               return bpf64_prepare(pfp, fprog, NULL);
>>> +
>>>         fp = kmalloc(sk_filter_size(fprog->len), GFP_KERNEL);
>>>         if (!fp)
>>>                 return -ENOMEM;
>>> @@ -726,21 +808,27 @@ int sk_attach_filter(struct sock_fprog *fprog,
>>> struct sock *sk)
>>>         if (fprog->filter == NULL)
>>>                 return -EINVAL;
>>>
>>> -       fp = sock_kmalloc(sk, sk_fsize, GFP_KERNEL);
>>> -       if (!fp)
>>> -               return -ENOMEM;
>>> -       if (copy_from_user(fp->insns, fprog->filter, fsize)) {
>>> -               sock_kfree_s(sk, fp, sk_fsize);
>>> -               return -EFAULT;
>>> -       }
>>> +       if (bpf64_enable) {
>>> +               err = bpf64_prepare(&fp, fprog, sk);
>>> +               if (err)
>>> +                       return err;
>>> +       } else {
>>> +               fp = sock_kmalloc(sk, sk_fsize, GFP_KERNEL);
>>> +               if (!fp)
>>> +                       return -ENOMEM;
>>> +               if (copy_from_user(fp->insns, fprog->filter, fsize)) {
>>> +                       sock_kfree_s(sk, fp, sk_fsize);
>>> +                       return -EFAULT;
>>> +               }
>>>
>>> -       atomic_set(&fp->refcnt, 1);
>>> -       fp->len = fprog->len;
>>> +               atomic_set(&fp->refcnt, 1);
>>> +               fp->len = fprog->len;
>>>
>>> -       err = __sk_prepare_filter(fp);
>>> -       if (err) {
>>> -               sk_filter_uncharge(sk, fp);
>>> -               return err;
>>> +               err = __sk_prepare_filter(fp);
>>> +               if (err) {
>>> +                       sk_filter_uncharge(sk, fp);
>>> +                       return err;
>>> +               }
>>>         }
>>>
>>>         old_fp = rcu_dereference_protected(sk->sk_filter,
>>> diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
>>> index cf9cd13509a7..f03acc0e8950 100644
>>> --- a/net/core/sysctl_net_core.c
>>> +++ b/net/core/sysctl_net_core.c
>>> @@ -273,6 +273,13 @@ static struct ctl_table net_core_table[] = {
>>>         },
>>>   #endif
>>>         {
>>> +               .procname       = "bpf64_enable",
>>> +               .data           = &bpf64_enable,
>>> +               .maxlen         = sizeof(int),
>>> +               .mode           = 0644,
>>> +               .proc_handler   = proc_dointvec
>>> +       },
>>> +       {
>>>                 .procname       = "netdev_tstamp_prequeue",
>>>                 .data           = &netdev_tstamp_prequeue,
>>>                 .maxlen         = sizeof(int),
>>>
>>
>> Hope some of the comments made sense. ;-)
>
> Yes. Indeed. Thanks a lot for the thorough review!
> Will address things and resend V4.
>
> Alexei
>
>> Thanks,
>>
>> Daniel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 net-next 1/1] bpf32->bpf64 mapper and bpf64 interpreter
  2014-02-28 20:53     ` Alexei Starovoitov
  2014-03-01  0:10       ` Alexei Starovoitov
@ 2014-03-01  0:30       ` Daniel Borkmann
  2014-03-03 10:05         ` Hagen Paul Pfeifer
  1 sibling, 1 reply; 8+ messages in thread
From: Daniel Borkmann @ 2014-03-01  0:30 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Ingo Molnar, Steven Rostedt, Peter Zijlstra,
	H. Peter Anvin, Thomas Gleixner, Masami Hiramatsu, Tom Zanussi,
	Jovi Zhangwei, Eric Dumazet, Linus Torvalds, Andrew Morton,
	Frederic Weisbecker, Arnaldo Carvalho de Melo, Pekka Enberg,
	Arjan van de Ven, Christoph Hellwig, linux-kernel, netdev,
	Hagen Paul Pfeifer, Jesse Gross

On 02/28/2014 09:53 PM, Alexei Starovoitov wrote:
> On Fri, Feb 28, 2014 at 4:45 AM, Daniel Borkmann <dborkman@redhat.com> wrote:
...
>> Did you also test that seccomp-BPF still works out?
>
> yes. Have a prototype, but it needs a bit more cleanup.

Here's [1] actually some code snippet for user space for prctl(). The
libseccomp [2] actually wraps around that and makes usage easier.

   [1] http://outflux.net/teach-seccomp/
   [2] http://lwn.net/Articles/491308/

...
>> We should keep naming consistent (so either extended BPF or BPF64),
>> so maybe bpf_ext_enable ? I'd presume rather {bpf,sk_filter}*_ext
>
> agree. we need consistent naming for both (old and new).
> I'll try all-out rename of bpf_*() into sk_filter64_*() and sk_filter_ext_*()
> to see which one looks better.
> I'm kinda leaning to sk_filter64, since it's easier to quickly spot
> the difference
> and more descriptive.

Just saw your second email regarding sk_filter_ext() et al, yep, I agree.

>> as in 'struct bpf_insn' the immediate value is 32 bit, so for 64 bit
>> comparisons, you'd still need to load to immediate values, right?
>
> there is no insn that use 64-bit immediate, since 64-bit immediates
> are extremely rare. grep x86-64 asm code for movabsq will return very few.
> llvm or gcc can easily construct any constant by combination of mov,
> shifts and ors.
> bpf64 comparisons are all 64-bit right now. So far I didn't see a need to do
> 32-bit comparison, since old bpf is all unsigned, mapping 32->64 of
> Jxx is painless.

Hm, fair enough, I was just thinking for comparisons of IPv6 addresses
when we do socket filtering. On the other hand, old and new insns are
both 64 bit wide and can be used though the same api then.

> Notice I added signed comparison, since real life programs cannot do
> without them.
> I also kept the spirit of old bpf having > and >= only. (no < and <=)
> that made llvm/gcc backends a bit harder to do, since no real cpu has
> such limitations.

Hehe, I proposed them once, but for low level BPF it was just easier to
adjust jt/jf offsets differently to achieve the same.

> I'm still contemplating do add < and <= (both signed and unsigned), which is
> interesting trade-off: number of instructions vs complexity of compiler
>
>> After all your changes, we will still have the bpf_jit_enable knob
>> in tact, right?
>
> Yes.
> In this diff the workflow is the following:
>
> old filter comes through sk_attach_filter() or sk_unattached_filter_create()
> if (bpf64) {
>      convert to new
>      sk_chk_filter() - check old bpf
>      use new interpreter
> } else {
>      sk_chk_filter() - check old bpf
>      if (bpf_jit_enable)
>          use old jit
>      else
>          use old interpreter
> }
> soon I'll add bpf64 jit and will reuse the same bpf_jit_enable knob for it.
> then add new/old inband demux into sk_attach_filter(),
> so that workflow will become:
>
> a filter comes through sk_attach_filter() or sk_unattached_filter_create()
> if (new filter prog) {
>      sk_chk_filter64() - check new bpf
>      if (bpf_jit_enable)
>          use new jit
>      else
>          use new interpreter
> } else {
>      if (bpf64_enable) {
>         convert to new
>         sk_chk_filter() - check old bpf
>         if (bpf_jit_enable)
>              use new jit
>         else
>              use new interpreter
>      } else {
>         sk_chk_filter()
>         if (bpf_jit_enable)
>             use old jit
>         else
>             use old interpreter
>      }
> }
> eventually bpf64_enable can be made default,
> the last 'else' can be retired and 'bpf64_enable' removed along with
> old interpreter.
>
> bpf_jit_enable knob will stay for foreseeable future.

Okay, cool. I think it seems reasonable to keep this knob around anyway.
E.g. for seccomp some people might argue speed is important, maybe other
more security related distros might not want to rely on JIT and therefore
trade performance.

...
>> Why would that need to be exported as a symbol?
>
> the performance numbers I mentioned are from bpf_bench that is done
> as kernel module, so I used this for debugging from it.
> also to see what execution traces I get while running trinity bpf fuzzer :)
>
>> I would actually like to avoid having this pr_info* inside the kernel.
>> Couldn't this be done e.g. through systemtap script that could e.g. be
>> placed under tools/net/ or inside the documentation file?
>
> like the idea!
> Will drop it from the diff and eventually will move it to tools/net.

Sounds great!

Thanks,

Daniel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 net-next 1/1] bpf32->bpf64 mapper and bpf64 interpreter
  2014-03-01  0:30       ` Daniel Borkmann
@ 2014-03-03 10:05         ` Hagen Paul Pfeifer
  2014-03-03 23:04           ` Alexei Starovoitov
  0 siblings, 1 reply; 8+ messages in thread
From: Hagen Paul Pfeifer @ 2014-03-03 10:05 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Alexei Starovoitov, David S. Miller, Ingo Molnar, Steven Rostedt,
	Peter Zijlstra, H. Peter Anvin, Thomas Gleixner,
	Masami Hiramatsu, Tom Zanussi, Jovi Zhangwei, Eric Dumazet,
	Linus Torvalds, Andrew Morton, Frederic Weisbecker,
	Arnaldo Carvalho de Melo, Pekka Enberg, Arjan van de Ven,
	Christoph Hellwig, linux-kernel, netdev, Jesse Gross

* Daniel Borkmann | 2014-03-01 01:30:00 [+0100]:

>>>as in 'struct bpf_insn' the immediate value is 32 bit, so for 64 bit
>>>comparisons, you'd still need to load to immediate values, right?
>>
>>there is no insn that use 64-bit immediate, since 64-bit immediates
>>are extremely rare. grep x86-64 asm code for movabsq will return very few.
>>llvm or gcc can easily construct any constant by combination of mov,
>>shifts and ors.
>>bpf64 comparisons are all 64-bit right now. So far I didn't see a need to do
>>32-bit comparison, since old bpf is all unsigned, mapping 32->64 of
>>Jxx is painless.
>
>Hm, fair enough, I was just thinking for comparisons of IPv6 addresses
>when we do socket filtering. On the other hand, old and new insns are
>both 64 bit wide and can be used though the same api then.

What about the long term idea to support JITed nftables? A 128 bit immediate
is required - maybe the biggest requirement for nftable support.

Hagen

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 net-next 1/1] bpf32->bpf64 mapper and bpf64 interpreter
  2014-03-03 10:05         ` Hagen Paul Pfeifer
@ 2014-03-03 23:04           ` Alexei Starovoitov
  0 siblings, 0 replies; 8+ messages in thread
From: Alexei Starovoitov @ 2014-03-03 23:04 UTC (permalink / raw)
  To: Hagen Paul Pfeifer
  Cc: Daniel Borkmann, David S. Miller, Ingo Molnar, Steven Rostedt,
	Peter Zijlstra, H. Peter Anvin, Thomas Gleixner,
	Masami Hiramatsu, Tom Zanussi, Jovi Zhangwei, Eric Dumazet,
	Linus Torvalds, Andrew Morton, Frederic Weisbecker,
	Arnaldo Carvalho de Melo, Pekka Enberg, Arjan van de Ven,
	Christoph Hellwig, linux-kernel, netdev, Jesse Gross

On Mon, Mar 3, 2014 at 2:05 AM, Hagen Paul Pfeifer <hagen@jauu.net> wrote:
> * Daniel Borkmann | 2014-03-01 01:30:00 [+0100]:
>
>>>>as in 'struct bpf_insn' the immediate value is 32 bit, so for 64 bit
>>>>comparisons, you'd still need to load to immediate values, right?
>>>
>>>there is no insn that use 64-bit immediate, since 64-bit immediates
>>>are extremely rare. grep x86-64 asm code for movabsq will return very few.
>>>llvm or gcc can easily construct any constant by combination of mov,
>>>shifts and ors.
>>>bpf64 comparisons are all 64-bit right now. So far I didn't see a need to do
>>>32-bit comparison, since old bpf is all unsigned, mapping 32->64 of
>>>Jxx is painless.
>>
>>Hm, fair enough, I was just thinking for comparisons of IPv6 addresses
>>when we do socket filtering. On the other hand, old and new insns are
>>both 64 bit wide and can be used though the same api then.
>
> What about the long term idea to support JITed nftables? A 128 bit immediate
> is required - maybe the biggest requirement for nftable support.

I'm still planning to bring benefits of ebpf-JIT to nft.
There are different ways to approach it. I'm not ready to debate
details, since I
don't have a working code for nft+bpf yet and code speaks better than words.
But I'm confident that ebpf instruction set will not need 128-bit extensions.
If something unforeseen is needed, we can always add it.

Right now I'm testing ebpf+seccomp.
As a micro benchmark I took a test from libseccomp and added dummy
syscall loop. There is a nice speedup. will post a patch soon.

Thanks
Alexei

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2014-03-03 23:04 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-27  2:38 [PATCH v3 net-next 0/1] bpf32->bpf64 mapper and bpf64 interpreter Alexei Starovoitov
2014-02-27  2:38 ` [PATCH v3 net-next 1/1] " Alexei Starovoitov
2014-02-28 12:45   ` Daniel Borkmann
2014-02-28 20:53     ` Alexei Starovoitov
2014-03-01  0:10       ` Alexei Starovoitov
2014-03-01  0:30       ` Daniel Borkmann
2014-03-03 10:05         ` Hagen Paul Pfeifer
2014-03-03 23:04           ` Alexei Starovoitov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).