All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] arm: eBPF JIT compiler
@ 2017-05-25 23:13 ` Shubham Bansal
  0 siblings, 0 replies; 87+ messages in thread
From: Shubham Bansal @ 2017-05-25 23:13 UTC (permalink / raw)
  To: linux; +Cc: linux-arm-kernel, linux-kernel, andrew, keescook, Shubham Bansal

The JIT compiler emits ARM 32 bit instructions. Currently, It supports
eBPF only. Classic BPF is supported because of the conversion by BPF
core.

This patch is essentially changing the current implementation of JIT
compiler of Berkeley Packet Filter from classic to internal with almost
all instructions from eBPF ISA supported except the following
	BPF_ALU64 | BPF_DIV | BPF_K
	BPF_ALU64 | BPF_DIV | BPF_X
	BPF_ALU64 | BPF_MOD | BPF_K
	BPF_ALU64 | BPF_MOD | BPF_X
	BPF_STX | BPF_XADD | BPF_W
	BPF_STX | BPF_XADD | BPF_DW
	BPF_JMP | BPF_CALL

Implementation is using scratch space to emulate 64 bit eBPF ISA on 32 bit
ARM because of deficiency of general purpose registers on ARM. Currently,
only LITTLE ENDIAN machines are supported in this eBPF JIT Compiler.

Tested on ARMv7 with QEMU by me (Shubham Bansal).
Tested on ARMv5 by Andrew Lunn (andrew@lunn.ch).
Expected to work on ARMv6 as well, as its a part ARMv7 and part ARMv5.
Although, a proper testing is not done for ARMv6.

Both of these testing are done with and without CONFIG_FRAME_POINTER
separately for LITTLE ENDIAN machine.

For testing:

1. JIT is enabled with
	echo 1 > /proc/sys/net/core/bpf_jit_enable
2. Constant Blinding can be enabled along with JIT using
	echo 1 > /proc/sys/net/core/bpf_jit_enable
	echo 2 > /proc/sys/net/core/bpf_jit_harden

See Documentation/networking/filter.txt for more information.

Result : test_bpf: Summary: 314 PASSED, 0 FAILED, [278/306 JIT'ed]

Signed-off-by: Shubham Bansal <illusionist.neo@gmail.com>
---
 Documentation/networking/filter.txt |    4 +-
 arch/arm/Kconfig                    |    2 +-
 arch/arm/net/bpf_jit_32.c           | 2404 ++++++++++++++++++++++++-----------
 arch/arm/net/bpf_jit_32.h           |  108 +-
 4 files changed, 1713 insertions(+), 805 deletions(-)

diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt
index b69b205..01165ac 100644
--- a/Documentation/networking/filter.txt
+++ b/Documentation/networking/filter.txt
@@ -596,8 +596,8 @@ skb pointer). All constraints and restrictions from bpf_check_classic() apply
 before a conversion to the new layout is being done behind the scenes!
 
 Currently, the classic BPF format is being used for JITing on most 32-bit
-architectures, whereas x86-64, aarch64, s390x, powerpc64, sparc64 perform JIT
-compilation from eBPF instruction set.
+architectures, whereas x86-64, aarch64, arm, s390x, powerpc64, sparc64 perform
+JIT compilation from eBPF instruction set.
 
 Some core changes of the new internal format:
 
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 8a7ab5e..13ade46 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -47,7 +47,7 @@ config ARM
 	select HAVE_ARCH_SECCOMP_FILTER if (AEABI && !OABI_COMPAT)
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_ARM_SMCCC if CPU_V7
-	select HAVE_CBPF_JIT
+	select HAVE_EBPF_JIT
 	select HAVE_CC_STACKPROTECTOR
 	select HAVE_CONTEXT_TRACKING
 	select HAVE_C_RECORDMCOUNT
diff --git a/arch/arm/net/bpf_jit_32.c b/arch/arm/net/bpf_jit_32.c
index 93d0b6d..c7476e5 100644
--- a/arch/arm/net/bpf_jit_32.c
+++ b/arch/arm/net/bpf_jit_32.c
@@ -1,13 +1,15 @@
 /*
- * Just-In-Time compiler for BPF filters on 32bit ARM
+ * Just-In-Time compiler for eBPF filters on 32bit ARM
  *
  * Copyright (c) 2011 Mircea Gherzan <mgherzan@gmail.com>
+ * Copyright (c) 2017 Shubham Bansal <illusionist.neo@gmail.com>
  *
  * This program is free software; you can redistribute it and/or modify it
  * under the terms of the GNU General Public License as published by the
  * Free Software Foundation; version 2 of the License.
  */
 
+#include <linux/bpf.h>
 #include <linux/bitops.h>
 #include <linux/compiler.h>
 #include <linux/errno.h>
@@ -23,44 +25,91 @@
 
 #include "bpf_jit_32.h"
 
+int bpf_jit_enable __read_mostly;
+
+#define STACK_OFFSET(k)	(k)
+#define TMP_REG_1	(MAX_BPF_JIT_REG + 0)	/* TEMP Register 1 */
+#define TMP_REG_2	(MAX_BPF_JIT_REG + 1)	/* TEMP Register 2 */
+#define TCALL_CNT	(MAX_BPF_JIT_REG + 2)	/* Tail Call Count */
+
+/* Flags used for JIT optimization */
+#define SEEN_CALL	(1 << 0)
+
+#define FLAG_IMM_OVERFLOW	(1 << 0)
+
 /*
- * ABI:
+ * Map eBPF registers to ARM 32bit registers or stack scratch space.
+ *
+ * 1. First argument is passed using the arm 32bit registers and rest of the
+ * arguments are passed on stack scratch space.
+ * 2. First callee-saved aregument is mapped to arm 32 bit registers and rest
+ * arguments are mapped to scratch space on stack.
+ * 3. We need two 64 bit temp registers to do complex operations on eBPF
+ * registers.
+ *
+ * As the eBPF registers are all 64 bit registers and arm has only 32 bit
+ * registers, we have to map each eBPF registers with two arm 32 bit regs or
+ * scratch memory space and we have to build eBPF 64 bit register from those.
  *
- * r0	scratch register
- * r4	BPF register A
- * r5	BPF register X
- * r6	pointer to the skb
- * r7	skb->data
- * r8	skb_headlen(skb)
  */
+static const u8 bpf2a32[][2] = {
+	/* return value from in-kernel function, and exit value from eBPF */
+	[BPF_REG_0] = {ARM_R1, ARM_R0},
+	/* arguments from eBPF program to in-kernel function */
+	[BPF_REG_1] = {ARM_R3, ARM_R2},
+	/* Stored on stack scratch space */
+	[BPF_REG_2] = {STACK_OFFSET(0), STACK_OFFSET(4)},
+	[BPF_REG_3] = {STACK_OFFSET(8), STACK_OFFSET(12)},
+	[BPF_REG_4] = {STACK_OFFSET(16), STACK_OFFSET(20)},
+	[BPF_REG_5] = {STACK_OFFSET(24), STACK_OFFSET(28)},
+	/* callee saved registers that in-kernel function will preserve */
+	[BPF_REG_6] = {ARM_R5, ARM_R4},
+	/* Stored on stack scratch space */
+	[BPF_REG_7] = {STACK_OFFSET(32), STACK_OFFSET(36)},
+	[BPF_REG_8] = {STACK_OFFSET(40), STACK_OFFSET(44)},
+	[BPF_REG_9] = {STACK_OFFSET(48), STACK_OFFSET(52)},
+	/* Read only Frame Pointer to access Stack */
+	[BPF_REG_FP] = {STACK_OFFSET(56), STACK_OFFSET(60)},
+	/* Temporary Register for internal BPF JIT, can be used
+	 * for constant blindings and others.
+	 */
+	[TMP_REG_1] = {ARM_R7, ARM_R6},
+	[TMP_REG_2] = {ARM_R10, ARM_R8},
+	/* Tail call count. Stored on stack scratch space. */
+	[TCALL_CNT] = {STACK_OFFSET(64), STACK_OFFSET(68)},
+	/* temporary register for blinding constants.
+	 * Stored on stack scratch space.
+	 */
+	[BPF_REG_AX] = {STACK_OFFSET(72), STACK_OFFSET(76)},
+};
 
-#define r_scratch	ARM_R0
-/* r1-r3 are (also) used for the unaligned loads on the non-ARMv7 slowpath */
-#define r_off		ARM_R1
-#define r_A		ARM_R4
-#define r_X		ARM_R5
-#define r_skb		ARM_R6
-#define r_skb_data	ARM_R7
-#define r_skb_hl	ARM_R8
-
-#define SCRATCH_SP_OFFSET	0
-#define SCRATCH_OFF(k)		(SCRATCH_SP_OFFSET + 4 * (k))
-
-#define SEEN_MEM		((1 << BPF_MEMWORDS) - 1)
-#define SEEN_MEM_WORD(k)	(1 << (k))
-#define SEEN_X			(1 << BPF_MEMWORDS)
-#define SEEN_CALL		(1 << (BPF_MEMWORDS + 1))
-#define SEEN_SKB		(1 << (BPF_MEMWORDS + 2))
-#define SEEN_DATA		(1 << (BPF_MEMWORDS + 3))
+#define	dst_lo	dst[1]
+#define dst_hi	dst[0]
+#define src_lo	src[1]
+#define src_hi	src[0]
 
-#define FLAG_NEED_X_RESET	(1 << 0)
-#define FLAG_IMM_OVERFLOW	(1 << 1)
+/*
+ * JIT Context:
+ *
+ * prog			:	bpf_prog
+ * idx			:	index of current last JITed instruction.
+ * prologue_bytes	:	bytes used in prologue.
+ * epilogue_offset	:	offset of epilogue starting.
+ * seen			:	bit mask used for JIT optimization.
+ * offsets		:	array of eBPF instruction offsets in
+ *				JITed code.
+ * target		:	final JITed code.
+ * epilogue_bytes	:	no of bytes used in epilogue.
+ * imm_count		:	no of immediate counts used for global
+ *				variables.
+ * imms			:	array of global variable addresses.
+ */
 
 struct jit_ctx {
-	const struct bpf_prog *skf;
-	unsigned idx;
-	unsigned prologue_bytes;
-	int ret0_fp_idx;
+	const struct bpf_prog *prog;
+	unsigned int idx;
+	unsigned int prologue_bytes;
+	unsigned int epilogue_offset;
 	u32 seen;
 	u32 flags;
 	u32 *offsets;
@@ -72,68 +121,16 @@ struct jit_ctx {
 #endif
 };
 
-int bpf_jit_enable __read_mostly;
-
-static inline int call_neg_helper(struct sk_buff *skb, int offset, void *ret,
-		      unsigned int size)
-{
-	void *ptr = bpf_internal_load_pointer_neg_helper(skb, offset, size);
-
-	if (!ptr)
-		return -EFAULT;
-	memcpy(ret, ptr, size);
-	return 0;
-}
-
-static u64 jit_get_skb_b(struct sk_buff *skb, int offset)
-{
-	u8 ret;
-	int err;
-
-	if (offset < 0)
-		err = call_neg_helper(skb, offset, &ret, 1);
-	else
-		err = skb_copy_bits(skb, offset, &ret, 1);
-
-	return (u64)err << 32 | ret;
-}
-
-static u64 jit_get_skb_h(struct sk_buff *skb, int offset)
-{
-	u16 ret;
-	int err;
-
-	if (offset < 0)
-		err = call_neg_helper(skb, offset, &ret, 2);
-	else
-		err = skb_copy_bits(skb, offset, &ret, 2);
-
-	return (u64)err << 32 | ntohs(ret);
-}
-
-static u64 jit_get_skb_w(struct sk_buff *skb, int offset)
-{
-	u32 ret;
-	int err;
-
-	if (offset < 0)
-		err = call_neg_helper(skb, offset, &ret, 4);
-	else
-		err = skb_copy_bits(skb, offset, &ret, 4);
-
-	return (u64)err << 32 | ntohl(ret);
-}
-
 /*
  * Wrappers which handle both OABI and EABI and assures Thumb2 interworking
  * (where the assembly routines like __aeabi_uidiv could cause problems).
  */
-static u32 jit_udiv(u32 dividend, u32 divisor)
+static u32 jit_udiv32(u32 dividend, u32 divisor)
 {
 	return dividend / divisor;
 }
 
-static u32 jit_mod(u32 dividend, u32 divisor)
+static u32 jit_mod32(u32 dividend, u32 divisor)
 {
 	return dividend % divisor;
 }
@@ -157,36 +154,22 @@ static inline void emit(u32 inst, struct jit_ctx *ctx)
 	_emit(ARM_COND_AL, inst, ctx);
 }
 
-static u16 saved_regs(struct jit_ctx *ctx)
+/*
+ * Checks if immediate value can be converted to imm12(12 bits) value.
+ */
+static int16_t imm8m(u32 x)
 {
-	u16 ret = 0;
-
-	if ((ctx->skf->len > 1) ||
-	    (ctx->skf->insns[0].code == (BPF_RET | BPF_A)))
-		ret |= 1 << r_A;
-
-#ifdef CONFIG_FRAME_POINTER
-	ret |= (1 << ARM_FP) | (1 << ARM_IP) | (1 << ARM_LR) | (1 << ARM_PC);
-#else
-	if (ctx->seen & SEEN_CALL)
-		ret |= 1 << ARM_LR;
-#endif
-	if (ctx->seen & (SEEN_DATA | SEEN_SKB))
-		ret |= 1 << r_skb;
-	if (ctx->seen & SEEN_DATA)
-		ret |= (1 << r_skb_data) | (1 << r_skb_hl);
-	if (ctx->seen & SEEN_X)
-		ret |= 1 << r_X;
-
-	return ret;
-}
+	u32 rot;
 
-static inline int mem_words_used(struct jit_ctx *ctx)
-{
-	/* yes, we do waste some stack space IF there are "holes" in the set" */
-	return fls(ctx->seen & SEEN_MEM);
+	for (rot = 0; rot < 16; rot++)
+		if ((x & ~ror32(0xff, 2 * rot)) == 0)
+			return rol32(x, 2 * rot) | (rot << 8);
+	return -1;
 }
 
+/*
+ * Initializes the JIT space with undefined instructions.
+ */
 static void jit_fill_hole(void *area, unsigned int size)
 {
 	u32 *ptr;
@@ -195,88 +178,34 @@ static void jit_fill_hole(void *area, unsigned int size)
 		*ptr++ = __opcode_to_mem_arm(ARM_INST_UDF);
 }
 
-static void build_prologue(struct jit_ctx *ctx)
-{
-	u16 reg_set = saved_regs(ctx);
-	u16 off;
-
-#ifdef CONFIG_FRAME_POINTER
-	emit(ARM_MOV_R(ARM_IP, ARM_SP), ctx);
-	emit(ARM_PUSH(reg_set), ctx);
-	emit(ARM_SUB_I(ARM_FP, ARM_IP, 4), ctx);
-#else
-	if (reg_set)
-		emit(ARM_PUSH(reg_set), ctx);
-#endif
+/* Stack must be multiples of 16 Bytes */
+#define STACK_ALIGN(sz) (((sz) + 15) & ~15)
 
-	if (ctx->seen & (SEEN_DATA | SEEN_SKB))
-		emit(ARM_MOV_R(r_skb, ARM_R0), ctx);
-
-	if (ctx->seen & SEEN_DATA) {
-		off = offsetof(struct sk_buff, data);
-		emit(ARM_LDR_I(r_skb_data, r_skb, off), ctx);
-		/* headlen = len - data_len */
-		off = offsetof(struct sk_buff, len);
-		emit(ARM_LDR_I(r_skb_hl, r_skb, off), ctx);
-		off = offsetof(struct sk_buff, data_len);
-		emit(ARM_LDR_I(r_scratch, r_skb, off), ctx);
-		emit(ARM_SUB_R(r_skb_hl, r_skb_hl, r_scratch), ctx);
-	}
-
-	if (ctx->flags & FLAG_NEED_X_RESET)
-		emit(ARM_MOV_I(r_X, 0), ctx);
-
-	/* do not leak kernel data to userspace */
-	if (bpf_needs_clear_a(&ctx->skf->insns[0]))
-		emit(ARM_MOV_I(r_A, 0), ctx);
-
-	/* stack space for the BPF_MEM words */
-	if (ctx->seen & SEEN_MEM)
-		emit(ARM_SUB_I(ARM_SP, ARM_SP, mem_words_used(ctx) * 4), ctx);
-}
-
-static void build_epilogue(struct jit_ctx *ctx)
-{
-	u16 reg_set = saved_regs(ctx);
-
-	if (ctx->seen & SEEN_MEM)
-		emit(ARM_ADD_I(ARM_SP, ARM_SP, mem_words_used(ctx) * 4), ctx);
-
-	reg_set &= ~(1 << ARM_LR);
-
-#ifdef CONFIG_FRAME_POINTER
-	/* the first instruction of the prologue was: mov ip, sp */
-	reg_set &= ~(1 << ARM_IP);
-	reg_set |= (1 << ARM_SP);
-	emit(ARM_LDM(ARM_SP, reg_set), ctx);
-#else
-	if (reg_set) {
-		if (ctx->seen & SEEN_CALL)
-			reg_set |= 1 << ARM_PC;
-		emit(ARM_POP(reg_set), ctx);
-	}
+/* Stack space for BPF_REG_2, BPF_REG_3, BPF_REG_4,
+ * BPF_REG_5, BPF_REG_7, BPF_REG_8, BPF_REG_9,
+ * BPF_REG_FP and Tail call counts.
+ */
+#define SCRATCH_SIZE 80
 
-	if (!(ctx->seen & SEEN_CALL))
-		emit(ARM_BX(ARM_LR), ctx);
-#endif
-}
+/* total stack size used in JITed code */
+#define _STACK_SIZE \
+	(MAX_BPF_STACK + \
+	 + SCRATCH_SIZE + \
+	 + 4 /* extra for skb_copy_bits buffer */)
 
-static int16_t imm8m(u32 x)
-{
-	u32 rot;
+#define STACK_SIZE STACK_ALIGN(_STACK_SIZE)
 
-	for (rot = 0; rot < 16; rot++)
-		if ((x & ~ror32(0xff, 2 * rot)) == 0)
-			return rol32(x, 2 * rot) | (rot << 8);
+/* Get the offset of eBPF REGISTERs stored on scratch space. */
+#define STACK_VAR(off) (STACK_SIZE-off-4)
 
-	return -1;
-}
+/* Offset of skb_copy_bits buffer */
+#define SKB_BUFFER STACK_VAR(SCRATCH_SIZE)
 
 #if __LINUX_ARM_ARCH__ < 7
 
 static u16 imm_offset(u32 k, struct jit_ctx *ctx)
 {
-	unsigned i = 0, offset;
+	unsigned int i = 0, offset;
 	u16 imm;
 
 	/* on the "fake" run we just count them (duplicates included) */
@@ -295,7 +224,7 @@ static u16 imm_offset(u32 k, struct jit_ctx *ctx)
 		ctx->imms[i] = k;
 
 	/* constants go just after the epilogue */
-	offset =  ctx->offsets[ctx->skf->len];
+	offset =  ctx->offsets[ctx->prog->len - 1] * 4;
 	offset += ctx->prologue_bytes;
 	offset += ctx->epilogue_bytes;
 	offset += i * 4;
@@ -319,10 +248,22 @@ static u16 imm_offset(u32 k, struct jit_ctx *ctx)
 
 #endif /* __LINUX_ARM_ARCH__ */
 
+static inline int bpf2a32_offset(int bpf_to, int bpf_from,
+				 const struct jit_ctx *ctx) {
+	int to, from;
+
+	if (ctx->target == NULL)
+		return 0;
+	to = ctx->offsets[bpf_to];
+	from = ctx->offsets[bpf_from];
+
+	return to - from - 1;
+}
+
 /*
  * Move an immediate that's not an imm8m to a core register.
  */
-static inline void emit_mov_i_no8m(int rd, u32 val, struct jit_ctx *ctx)
+static inline void emit_mov_i_no8m(const u8 rd, u32 val, struct jit_ctx *ctx)
 {
 #if __LINUX_ARM_ARCH__ < 7
 	emit(ARM_LDR_I(rd, ARM_PC, imm_offset(val, ctx)), ctx);
@@ -333,7 +274,7 @@ static inline void emit_mov_i_no8m(int rd, u32 val, struct jit_ctx *ctx)
 #endif
 }
 
-static inline void emit_mov_i(int rd, u32 val, struct jit_ctx *ctx)
+static inline void emit_mov_i(const u8 rd, u32 val, struct jit_ctx *ctx)
 {
 	int imm12 = imm8m(val);
 
@@ -343,676 +284,1553 @@ static inline void emit_mov_i(int rd, u32 val, struct jit_ctx *ctx)
 		emit_mov_i_no8m(rd, val, ctx);
 }
 
-#if __LINUX_ARM_ARCH__ < 6
-
-static void emit_load_be32(u8 cond, u8 r_res, u8 r_addr, struct jit_ctx *ctx)
+static inline void emit_blx_r(u8 tgt_reg, struct jit_ctx *ctx)
 {
-	_emit(cond, ARM_LDRB_I(ARM_R3, r_addr, 1), ctx);
-	_emit(cond, ARM_LDRB_I(ARM_R1, r_addr, 0), ctx);
-	_emit(cond, ARM_LDRB_I(ARM_R2, r_addr, 3), ctx);
-	_emit(cond, ARM_LSL_I(ARM_R3, ARM_R3, 16), ctx);
-	_emit(cond, ARM_LDRB_I(ARM_R0, r_addr, 2), ctx);
-	_emit(cond, ARM_ORR_S(ARM_R3, ARM_R3, ARM_R1, SRTYPE_LSL, 24), ctx);
-	_emit(cond, ARM_ORR_R(ARM_R3, ARM_R3, ARM_R2), ctx);
-	_emit(cond, ARM_ORR_S(r_res, ARM_R3, ARM_R0, SRTYPE_LSL, 8), ctx);
+	ctx->seen |= SEEN_CALL;
+#if __LINUX_ARM_ARCH__ < 5
+	emit(ARM_MOV_R(ARM_LR, ARM_PC), ctx);
+
+	if (elf_hwcap & HWCAP_THUMB)
+		emit(ARM_BX(tgt_reg), ctx);
+	else
+		emit(ARM_MOV_R(ARM_PC, tgt_reg), ctx);
+#else
+	emit(ARM_BLX_R(tgt_reg), ctx);
+#endif
 }
 
-static void emit_load_be16(u8 cond, u8 r_res, u8 r_addr, struct jit_ctx *ctx)
+static inline int epilogue_offset(const struct jit_ctx *ctx)
 {
-	_emit(cond, ARM_LDRB_I(ARM_R1, r_addr, 0), ctx);
-	_emit(cond, ARM_LDRB_I(ARM_R2, r_addr, 1), ctx);
-	_emit(cond, ARM_ORR_S(r_res, ARM_R2, ARM_R1, SRTYPE_LSL, 8), ctx);
+	int to, from;
+	/* No need for 1st dummy run */
+	if (ctx->target == NULL)
+		return 0;
+	to = ctx->epilogue_offset;
+	from = ctx->idx;
+
+	return to - from - 2;
 }
 
-static inline void emit_swap16(u8 r_dst, u8 r_src, struct jit_ctx *ctx)
+static inline void emit_udivmod(u8 rd, u8 rm, u8 rn, struct jit_ctx *ctx, u8 op)
 {
-	/* r_dst = (r_src << 8) | (r_src >> 8) */
-	emit(ARM_LSL_I(ARM_R1, r_src, 8), ctx);
-	emit(ARM_ORR_S(r_dst, ARM_R1, r_src, SRTYPE_LSR, 8), ctx);
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	s32 jmp_offset;
+
+	/* checks if divisor is zero or not. If it is, then
+	 * exit directly.
+	 */
+	emit(ARM_CMP_I(rn, 0), ctx);
+	_emit(ARM_COND_EQ, ARM_MOV_I(ARM_R0, 0), ctx);
+	jmp_offset = epilogue_offset(ctx);
+	_emit(ARM_COND_EQ, ARM_B(jmp_offset), ctx);
+#if __LINUX_ARM_ARCH__ == 7
+	if (elf_hwcap & HWCAP_IDIVA) {
+		if (op == BPF_DIV)
+			emit(ARM_UDIV(rd, rm, rn), ctx);
+		else {
+			emit(ARM_UDIV(ARM_IP, rm, rn), ctx);
+			emit(ARM_MLS(rd, rn, ARM_IP, rm), ctx);
+		}
+		return;
+	}
+#endif
 
 	/*
-	 * we need to mask out the bits set in r_dst[23:16] due to
-	 * the first shift instruction.
-	 *
-	 * note that 0x8ff is the encoded immediate 0x00ff0000.
+	 * For BPF_ALU | BPF_DIV | BPF_K instructions
+	 * As ARM_R1 and ARM_R0 contains 1st argument of bpf
+	 * function, we need to save it on caller side to save
+	 * it from getting destroyed within callee.
+	 * After the return from the callee, we restore ARM_R0
+	 * ARM_R1.
 	 */
-	emit(ARM_BIC_I(r_dst, r_dst, 0x8ff), ctx);
-}
+	if (rn != ARM_R1) {
+		emit(ARM_MOV_R(tmp[0], ARM_R1), ctx);
+		emit(ARM_MOV_R(ARM_R1, rn), ctx);
+	}
+	if (rm != ARM_R0) {
+		emit(ARM_MOV_R(tmp[1], ARM_R0), ctx);
+		emit(ARM_MOV_R(ARM_R0, rm), ctx);
+	}
 
-#else  /* ARMv6+ */
+	/* Call appropriate function */
+	ctx->seen |= SEEN_CALL;
+	emit_mov_i(ARM_IP, op == BPF_DIV ?
+		   (u32)jit_udiv32 : (u32)jit_mod32, ctx);
+	emit_blx_r(ARM_IP, ctx);
 
-static void emit_load_be32(u8 cond, u8 r_res, u8 r_addr, struct jit_ctx *ctx)
-{
-	_emit(cond, ARM_LDR_I(r_res, r_addr, 0), ctx);
-#ifdef __LITTLE_ENDIAN
-	_emit(cond, ARM_REV(r_res, r_res), ctx);
-#endif
+	/* Save return value */
+	if (rd != ARM_R0)
+		emit(ARM_MOV_R(rd, ARM_R0), ctx);
+
+	/* Restore ARM_R0 and ARM_R1 */
+	if (rn != ARM_R1)
+		emit(ARM_MOV_R(ARM_R1, tmp[0]), ctx);
+	if (rm != ARM_R0)
+		emit(ARM_MOV_R(ARM_R0, tmp[1]), ctx);
 }
 
-static void emit_load_be16(u8 cond, u8 r_res, u8 r_addr, struct jit_ctx *ctx)
+/* Checks whether BPF register is on scratch stack space or not. */
+static inline bool is_on_stack(u8 bpf_reg)
 {
-	_emit(cond, ARM_LDRH_I(r_res, r_addr, 0), ctx);
-#ifdef __LITTLE_ENDIAN
-	_emit(cond, ARM_REV16(r_res, r_res), ctx);
-#endif
+	static u8 stack_regs[] = {BPF_REG_AX, BPF_REG_3, BPF_REG_4, BPF_REG_5,
+				BPF_REG_7, BPF_REG_8, BPF_REG_9, TCALL_CNT,
+				BPF_REG_2, BPF_REG_FP};
+	int i, reg_len = sizeof(stack_regs);
+
+	for (i = 0 ; i < reg_len ; i++) {
+		if (bpf_reg == stack_regs[i])
+			return true;
+	}
+	return false;
 }
 
-static inline void emit_swap16(u8 r_dst __maybe_unused,
-			       u8 r_src __maybe_unused,
-			       struct jit_ctx *ctx __maybe_unused)
+static inline void emit_a32_mov_i(const u8 dst, const u32 val,
+				  bool dstk, struct jit_ctx *ctx)
 {
-#ifdef __LITTLE_ENDIAN
-	emit(ARM_REV16(r_dst, r_src), ctx);
-#endif
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+
+	if (dstk) {
+		emit_mov_i(tmp[1], val, ctx);
+		emit(ARM_STR_I(tmp[1], ARM_SP, STACK_VAR(dst)), ctx);
+	} else {
+		emit_mov_i(dst, val, ctx);
+	}
 }
 
-#endif /* __LINUX_ARM_ARCH__ < 6 */
+/* Sign extended move */
+static inline void emit_a32_mov_i64(const bool is64, const u8 dst[],
+				  const u32 val, bool dstk,
+				  struct jit_ctx *ctx) {
+	u32 hi = 0;
 
+	if (is64 && (val & (1<<31)))
+		hi = (u32)~0;
+	emit_a32_mov_i(dst_lo, val, dstk, ctx);
+	emit_a32_mov_i(dst_hi, hi, dstk, ctx);
+}
 
-/* Compute the immediate value for a PC-relative branch. */
-static inline u32 b_imm(unsigned tgt, struct jit_ctx *ctx)
-{
-	u32 imm;
+static inline void emit_a32_add_r(const u8 dst, const u8 src,
+			      const bool is64, const bool hi,
+			      struct jit_ctx *ctx) {
+	/* 64 bit :
+	 *	adds dst_lo, dst_lo, src_lo
+	 *	adc dst_hi, dst_hi, src_hi
+	 * 32 bit :
+	 *	add dst_lo, dst_lo, src_lo
+	 */
+	if (!hi && is64)
+		emit(ARM_ADDS_R(dst, dst, src), ctx);
+	else if (hi && is64)
+		emit(ARM_ADC_R(dst, dst, src), ctx);
+	else
+		emit(ARM_ADD_R(dst, dst, src), ctx);
+}
 
-	if (ctx->target == NULL)
-		return 0;
-	/*
-	 * BPF allows only forward jumps and the offset of the target is
-	 * still the one computed during the first pass.
+static inline void emit_a32_sub_r(const u8 dst, const u8 src,
+				  const bool is64, const bool hi,
+				  struct jit_ctx *ctx) {
+	/* 64 bit :
+	 *	subs dst_lo, dst_lo, src_lo
+	 *	sbc dst_hi, dst_hi, src_hi
+	 * 32 bit :
+	 *	sub dst_lo, dst_lo, src_lo
 	 */
-	imm  = ctx->offsets[tgt] + ctx->prologue_bytes - (ctx->idx * 4 + 8);
+	if (!hi && is64)
+		emit(ARM_SUBS_R(dst, dst, src), ctx);
+	else if (hi && is64)
+		emit(ARM_SBC_R(dst, dst, src), ctx);
+	else
+		emit(ARM_SUB_R(dst, dst, src), ctx);
+}
 
-	return imm >> 2;
+static inline void emit_alu_r(const u8 dst, const u8 src, const bool is64,
+			      const bool hi, const u8 op, struct jit_ctx *ctx){
+	switch (BPF_OP(op)) {
+	/* dst = dst + src */
+	case BPF_ADD:
+		emit_a32_add_r(dst, src, is64, hi, ctx);
+		break;
+	/* dst = dst - src */
+	case BPF_SUB:
+		emit_a32_sub_r(dst, src, is64, hi, ctx);
+		break;
+	/* dst = dst | src */
+	case BPF_OR:
+		emit(ARM_ORR_R(dst, dst, src), ctx);
+		break;
+	/* dst = dst & src */
+	case BPF_AND:
+		emit(ARM_AND_R(dst, dst, src), ctx);
+		break;
+	/* dst = dst ^ src */
+	case BPF_XOR:
+		emit(ARM_EOR_R(dst, dst, src), ctx);
+		break;
+	/* dst = dst * src */
+	case BPF_MUL:
+		emit(ARM_MUL(dst, dst, src), ctx);
+		break;
+	/* dst = dst << src */
+	case BPF_LSH:
+		emit(ARM_LSL_R(dst, dst, src), ctx);
+		break;
+	/* dst = dst >> src */
+	case BPF_RSH:
+		emit(ARM_LSR_R(dst, dst, src), ctx);
+		break;
+	/* dst = dst >> src (signed)*/
+	case BPF_ARSH:
+		emit(ARM_MOV_SR(dst, dst, SRTYPE_ASR, src), ctx);
+		break;
+	}
 }
 
-#define OP_IMM3(op, r1, r2, imm_val, ctx)				\
-	do {								\
-		imm12 = imm8m(imm_val);					\
-		if (imm12 < 0) {					\
-			emit_mov_i_no8m(r_scratch, imm_val, ctx);	\
-			emit(op ## _R((r1), (r2), r_scratch), ctx);	\
-		} else {						\
-			emit(op ## _I((r1), (r2), imm12), ctx);		\
-		}							\
-	} while (0)
-
-static inline void emit_err_ret(u8 cond, struct jit_ctx *ctx)
-{
-	if (ctx->ret0_fp_idx >= 0) {
-		_emit(cond, ARM_B(b_imm(ctx->ret0_fp_idx, ctx)), ctx);
-		/* NOP to keep the size constant between passes */
-		emit(ARM_MOV_R(ARM_R0, ARM_R0), ctx);
+/* ALU operation (32 bit)
+ * dst = dst (op) src
+ */
+static inline void emit_a32_alu_r(const u8 dst, const u8 src,
+				  bool dstk, bool sstk,
+				  struct jit_ctx *ctx, const bool is64,
+				  const bool hi, const u8 op) {
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	u8 rn = sstk ? tmp[1] : src;
+
+	if (sstk)
+		emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src)), ctx);
+
+	/* ALU operation */
+	if (dstk) {
+		emit(ARM_LDR_I(tmp[0], ARM_SP, STACK_VAR(dst)), ctx);
+		emit_alu_r(tmp[0], rn, is64, hi, op, ctx);
+		emit(ARM_STR_I(tmp[0], ARM_SP, STACK_VAR(dst)), ctx);
 	} else {
-		_emit(cond, ARM_MOV_I(ARM_R0, 0), ctx);
-		_emit(cond, ARM_B(b_imm(ctx->skf->len, ctx)), ctx);
+		emit_alu_r(dst, rn, is64, hi, op, ctx);
 	}
 }
 
-static inline void emit_blx_r(u8 tgt_reg, struct jit_ctx *ctx)
-{
-#if __LINUX_ARM_ARCH__ < 5
-	emit(ARM_MOV_R(ARM_LR, ARM_PC), ctx);
+/* ALU operation (64 bit) */
+static inline void emit_a32_alu_r64(const bool is64, const u8 dst[],
+				  const u8 src[], bool dstk,
+				  bool sstk, struct jit_ctx *ctx,
+				  const u8 op) {
+	emit_a32_alu_r(dst_lo, src_lo, dstk, sstk, ctx, is64, false, op);
+	if (is64)
+		emit_a32_alu_r(dst_hi, src_hi, dstk, sstk, ctx, is64, true, op);
+	else
+		emit_a32_mov_i(dst_hi, 0, dstk, ctx);
+}
 
-	if (elf_hwcap & HWCAP_THUMB)
-		emit(ARM_BX(tgt_reg), ctx);
+/* dst = imm (4 bytes)*/
+static inline void emit_a32_mov_r(const u8 dst, const u8 src,
+				  bool dstk, bool sstk,
+				  struct jit_ctx *ctx) {
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	u8 rt = sstk ? tmp[0] : src;
+
+	if (sstk)
+		emit(ARM_LDR_I(tmp[0], ARM_SP, STACK_VAR(src)), ctx);
+	if (dstk)
+		emit(ARM_STR_I(rt, ARM_SP, STACK_VAR(dst)), ctx);
 	else
-		emit(ARM_MOV_R(ARM_PC, tgt_reg), ctx);
-#else
-	emit(ARM_BLX_R(tgt_reg), ctx);
-#endif
+		emit(ARM_MOV_R(dst, rt), ctx);
 }
 
-static inline void emit_udivmod(u8 rd, u8 rm, u8 rn, struct jit_ctx *ctx,
-				int bpf_op)
-{
-#if __LINUX_ARM_ARCH__ == 7
-	if (elf_hwcap & HWCAP_IDIVA) {
-		if (bpf_op == BPF_DIV)
-			emit(ARM_UDIV(rd, rm, rn), ctx);
-		else {
-			emit(ARM_UDIV(ARM_R3, rm, rn), ctx);
-			emit(ARM_MLS(rd, rn, ARM_R3, rm), ctx);
-		}
-		return;
+/* dst = src */
+static inline void emit_a32_mov_r64(const bool is64, const u8 dst[],
+				  const u8 src[], bool dstk,
+				  bool sstk, struct jit_ctx *ctx) {
+	emit_a32_mov_r(dst_lo, src_lo, dstk, sstk, ctx);
+	if (is64) {
+		/* complete 8 byte move */
+		emit_a32_mov_r(dst_hi, src_hi, dstk, sstk, ctx);
+	} else {
+		/* Zero out high 4 bytes */
+		emit_a32_mov_i(dst_hi, 0, dstk, ctx);
 	}
-#endif
+}
 
-	/*
-	 * For BPF_ALU | BPF_DIV | BPF_K instructions, rm is ARM_R4
-	 * (r_A) and rn is ARM_R0 (r_scratch) so load rn first into
-	 * ARM_R1 to avoid accidentally overwriting ARM_R0 with rm
-	 * before using it as a source for ARM_R1.
-	 *
-	 * For BPF_ALU | BPF_DIV | BPF_X rm is ARM_R4 (r_A) and rn is
-	 * ARM_R5 (r_X) so there is no particular register overlap
-	 * issues.
-	 */
-	if (rn != ARM_R1)
-		emit(ARM_MOV_R(ARM_R1, rn), ctx);
-	if (rm != ARM_R0)
-		emit(ARM_MOV_R(ARM_R0, rm), ctx);
+/* Shift operations */
+static inline void emit_a32_alu_i(const u8 dst, const u32 val, bool dstk,
+				struct jit_ctx *ctx, const u8 op) {
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	u8 rd = dstk ? tmp[0] : dst;
+
+	if (dstk)
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst)), ctx);
+
+	/* Do shift operation */
+	switch (op) {
+	case BPF_LSH:
+		emit(ARM_LSL_I(rd, rd, val), ctx);
+		break;
+	case BPF_RSH:
+		emit(ARM_LSR_I(rd, rd, val), ctx);
+		break;
+	case BPF_NEG:
+		emit(ARM_RSB_I(rd, rd, val), ctx);
+		break;
+	}
+
+	if (dstk)
+		emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst)), ctx);
+}
+
+/* dst = ~dst (64 bit) */
+static inline void emit_a32_neg64(const u8 dst[], bool dstk,
+				struct jit_ctx *ctx){
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	u8 rd = dstk ? tmp[1] : dst[1];
+	u8 rm = dstk ? tmp[0] : dst[0];
+
+	/* Setup Operand */
+	if (dstk) {
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+
+	/* Do Negate Operation */
+	emit(ARM_RSBS_I(rd, rd, 0), ctx);
+	emit(ARM_RSC_I(rm, rm, 0), ctx);
+
+	if (dstk) {
+		emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_STR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+}
 
+/* dst = dst << src */
+static inline void emit_a32_lsh_r64(const u8 dst[], const u8 src[], bool dstk,
+				    bool sstk, struct jit_ctx *ctx) {
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+
+	/* Setup Operands */
+	u8 rt = sstk ? tmp2[1] : src_lo;
+	u8 rd = dstk ? tmp[1] : dst_lo;
+	u8 rm = dstk ? tmp[0] : dst_hi;
+
+	if (sstk)
+		emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(src_lo)), ctx);
+	if (dstk) {
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+
+	/* Do LSH operation */
+	emit(ARM_SUB_I(ARM_IP, rt, 32), ctx);
+	emit(ARM_RSB_I(tmp2[0], rt, 32), ctx);
+	/* As we are using ARM_LR */
 	ctx->seen |= SEEN_CALL;
-	emit_mov_i(ARM_R3, bpf_op == BPF_DIV ? (u32)jit_udiv : (u32)jit_mod,
-		   ctx);
-	emit_blx_r(ARM_R3, ctx);
+	emit(ARM_MOV_SR(ARM_LR, rm, SRTYPE_ASL, rt), ctx);
+	emit(ARM_ORR_SR(ARM_LR, ARM_LR, rd, SRTYPE_ASL, ARM_IP), ctx);
+	emit(ARM_ORR_SR(ARM_IP, ARM_LR, rd, SRTYPE_LSR, tmp2[0]), ctx);
+	emit(ARM_MOV_SR(ARM_LR, rd, SRTYPE_ASL, rt), ctx);
+
+	if (dstk) {
+		emit(ARM_STR_I(ARM_LR, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_STR_I(ARM_IP, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	} else {
+		emit(ARM_MOV_R(rd, ARM_LR), ctx);
+		emit(ARM_MOV_R(rm, ARM_IP), ctx);
+	}
+}
 
-	if (rd != ARM_R0)
-		emit(ARM_MOV_R(rd, ARM_R0), ctx);
+/* dst = dst >> src (signed)*/
+static inline void emit_a32_arsh_r64(const u8 dst[], const u8 src[], bool dstk,
+				    bool sstk, struct jit_ctx *ctx) {
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+	/* Setup Operands */
+	u8 rt = sstk ? tmp2[1] : src_lo;
+	u8 rd = dstk ? tmp[1] : dst_lo;
+	u8 rm = dstk ? tmp[0] : dst_hi;
+
+	if (sstk)
+		emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(src_lo)), ctx);
+	if (dstk) {
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+
+	/* Do the ARSH operation */
+	emit(ARM_RSB_I(ARM_IP, rt, 32), ctx);
+	emit(ARM_SUBS_I(tmp2[0], rt, 32), ctx);
+	/* As we are using ARM_LR */
+	ctx->seen |= SEEN_CALL;
+	emit(ARM_MOV_SR(ARM_LR, rd, SRTYPE_LSR, rt), ctx);
+	emit(ARM_ORR_SR(ARM_LR, ARM_LR, rm, SRTYPE_ASL, ARM_IP), ctx);
+	_emit(ARM_COND_MI, ARM_B(0), ctx);
+	emit(ARM_ORR_SR(ARM_LR, ARM_LR, rm, SRTYPE_ASR, tmp2[0]), ctx);
+	emit(ARM_MOV_SR(ARM_IP, rm, SRTYPE_ASR, rt), ctx);
+	if (dstk) {
+		emit(ARM_STR_I(ARM_LR, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_STR_I(ARM_IP, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	} else {
+		emit(ARM_MOV_R(rd, ARM_LR), ctx);
+		emit(ARM_MOV_R(rm, ARM_IP), ctx);
+	}
 }
 
-static inline void update_on_xread(struct jit_ctx *ctx)
+/* dst = dst >> src */
+static inline void emit_a32_lsr_r64(const u8 dst[], const u8 src[], bool dstk,
+				     bool sstk, struct jit_ctx *ctx) {
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+	/* Setup Operands */
+	u8 rt = sstk ? tmp2[1] : src_lo;
+	u8 rd = dstk ? tmp[1] : dst_lo;
+	u8 rm = dstk ? tmp[0] : dst_hi;
+
+	if (sstk)
+		emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(src_lo)), ctx);
+	if (dstk) {
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+
+	/* Do LSH operation */
+	emit(ARM_RSB_I(ARM_IP, rt, 32), ctx);
+	emit(ARM_SUBS_I(tmp2[0], rt, 32), ctx);
+	/* As we are using ARM_LR */
+	ctx->seen |= SEEN_CALL;
+	emit(ARM_MOV_SR(ARM_LR, rd, SRTYPE_LSR, rt), ctx);
+	emit(ARM_ORR_SR(ARM_LR, ARM_LR, rm, SRTYPE_ASL, ARM_IP), ctx);
+	emit(ARM_ORR_SR(ARM_LR, ARM_LR, rm, SRTYPE_LSR, tmp2[0]), ctx);
+	emit(ARM_MOV_SR(ARM_IP, rm, SRTYPE_LSR, rt), ctx);
+	if (dstk) {
+		emit(ARM_STR_I(ARM_LR, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_STR_I(ARM_IP, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	} else {
+		emit(ARM_MOV_R(rd, ARM_LR), ctx);
+		emit(ARM_MOV_R(rm, ARM_IP), ctx);
+	}
+}
+
+/* dst = dst << val */
+static inline void emit_a32_lsh_i64(const u8 dst[], bool dstk,
+				     const u32 val, struct jit_ctx *ctx){
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+	/* Setup operands */
+	u8 rd = dstk ? tmp[1] : dst_lo;
+	u8 rm = dstk ? tmp[0] : dst_hi;
+
+	if (dstk) {
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+
+	/* Do LSH operation */
+	if (val < 32) {
+		emit(ARM_MOV_SI(tmp2[0], rm, SRTYPE_ASL, val), ctx);
+		emit(ARM_ORR_SI(rm, tmp2[0], rd, SRTYPE_LSR, 32 - val), ctx);
+		emit(ARM_MOV_SI(rd, rd, SRTYPE_ASL, val), ctx);
+	} else {
+		if (val == 32)
+			emit(ARM_MOV_R(rm, rd), ctx);
+		else
+			emit(ARM_MOV_SI(rm, rd, SRTYPE_ASL, val - 32), ctx);
+		emit(ARM_EOR_R(rd, rd, rd), ctx);
+	}
+
+	if (dstk) {
+		emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_STR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+}
+
+/* dst = dst >> val */
+static inline void emit_a32_lsr_i64(const u8 dst[], bool dstk,
+				    const u32 val, struct jit_ctx *ctx) {
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+	/* Setup operands */
+	u8 rd = dstk ? tmp[1] : dst_lo;
+	u8 rm = dstk ? tmp[0] : dst_hi;
+
+	if (dstk) {
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+
+	/* Do LSR operation */
+	if (val < 32) {
+		emit(ARM_MOV_SI(tmp2[1], rd, SRTYPE_LSR, val), ctx);
+		emit(ARM_ORR_SI(rd, tmp2[1], rm, SRTYPE_ASL, 32 - val), ctx);
+		emit(ARM_MOV_SI(rm, rm, SRTYPE_LSR, val), ctx);
+	} else if (val == 32) {
+		emit(ARM_MOV_R(rd, rm), ctx);
+		emit(ARM_MOV_I(rm, 0), ctx);
+	} else {
+		emit(ARM_MOV_SI(rd, rm, SRTYPE_LSR, val - 32), ctx);
+		emit(ARM_MOV_I(rm, 0), ctx);
+	}
+
+	if (dstk) {
+		emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_STR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+}
+
+/* dst = dst >> val (signed) */
+static inline void emit_a32_arsh_i64(const u8 dst[], bool dstk,
+				     const u32 val, struct jit_ctx *ctx){
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+	 /* Setup operands */
+	u8 rd = dstk ? tmp[1] : dst_lo;
+	u8 rm = dstk ? tmp[0] : dst_hi;
+
+	if (dstk) {
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+
+	/* Do ARSH operation */
+	if (val < 32) {
+		emit(ARM_MOV_SI(tmp2[1], rd, SRTYPE_LSR, val), ctx);
+		emit(ARM_ORR_SI(rd, tmp2[1], rm, SRTYPE_ASL, 32 - val), ctx);
+		emit(ARM_MOV_SI(rm, rm, SRTYPE_ASR, val), ctx);
+	} else if (val == 32) {
+		emit(ARM_MOV_R(rd, rm), ctx);
+		emit(ARM_MOV_SI(rm, rm, SRTYPE_ASR, 31), ctx);
+	} else {
+		emit(ARM_MOV_SI(rd, rm, SRTYPE_ASR, val - 32), ctx);
+		emit(ARM_MOV_SI(rm, rm, SRTYPE_ASR, 31), ctx);
+	}
+
+	if (dstk) {
+		emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_STR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+}
+
+static inline void emit_a32_mul_r64(const u8 dst[], const u8 src[], bool dstk,
+				    bool sstk, struct jit_ctx *ctx) {
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+	/* Setup operands for multiplication */
+	u8 rd = dstk ? tmp[1] : dst_lo;
+	u8 rm = dstk ? tmp[0] : dst_hi;
+	u8 rt = sstk ? tmp2[1] : src_lo;
+	u8 rn = sstk ? tmp2[0] : src_hi;
+
+	if (dstk) {
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+	if (sstk) {
+		emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(src_lo)), ctx);
+		emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src_hi)), ctx);
+	}
+
+	/* Do Multiplication */
+	emit(ARM_MUL(ARM_IP, rd, rn), ctx);
+	emit(ARM_MUL(ARM_LR, rm, rt), ctx);
+	/* As we are using ARM_LR */
+	ctx->seen |= SEEN_CALL;
+	emit(ARM_ADD_R(ARM_LR, ARM_IP, ARM_LR), ctx);
+
+	emit(ARM_UMULL(ARM_IP, rm, rd, rt), ctx);
+	emit(ARM_ADD_R(rm, ARM_LR, rm), ctx);
+	if (dstk) {
+		emit(ARM_STR_I(ARM_IP, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_STR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	} else {
+		emit(ARM_MOV_R(rd, ARM_IP), ctx);
+	}
+}
+
+/* *(size *)(dst + off) = src */
+static inline void emit_str_r(const u8 dst, const u8 src, bool dstk,
+			      const s32 off, struct jit_ctx *ctx, const u8 sz){
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	u8 rd = dstk ? tmp[1] : dst;
+
+	if (dstk)
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst)), ctx);
+	if (off) {
+		emit_a32_mov_i(tmp[0], off, false, ctx);
+		emit(ARM_ADD_R(tmp[0], rd, tmp[0]), ctx);
+		rd = tmp[0];
+	}
+	switch (sz) {
+	case BPF_W:
+		/* Store a Word */
+		emit(ARM_STR_I(src, rd, 0), ctx);
+		break;
+	case BPF_H:
+		/* Store a HalfWord */
+		emit(ARM_STRH_I(src, rd, 0), ctx);
+		break;
+	case BPF_B:
+		/* Store a Byte */
+		emit(ARM_STRB_I(src, rd, 0), ctx);
+		break;
+	}
+}
+
+/* dst = *(size*)(src + off) */
+static inline void emit_ldx_r(const u8 dst, const u8 src, bool dstk,
+			      const s32 off, struct jit_ctx *ctx, const u8 sz){
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	u8 rd = dstk ? tmp[1] : dst;
+	u8 rm = src;
+
+	if (off) {
+		emit_a32_mov_i(tmp[0], off, false, ctx);
+		emit(ARM_ADD_R(tmp[0], tmp[0], src), ctx);
+		rm = tmp[0];
+	}
+	switch (sz) {
+	case BPF_W:
+		/* Load a Word */
+		emit(ARM_LDR_I(rd, rm, 0), ctx);
+		break;
+	case BPF_H:
+		/* Load a HalfWord */
+		emit(ARM_LDRH_I(rd, rm, 0), ctx);
+		break;
+	case BPF_B:
+		/* Load a Byte */
+		emit(ARM_LDRB_I(rd, rm, 0), ctx);
+		break;
+	}
+	if (dstk)
+		emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst)), ctx);
+}
+
+/* Arithmatic Operation */
+static inline void emit_ar_r(const u8 rd, const u8 rt, const u8 rm,
+			     const u8 rn, struct jit_ctx *ctx, u8 op) {
+	switch (op) {
+	case BPF_JSET:
+		ctx->seen |= SEEN_CALL;
+		emit(ARM_AND_R(ARM_IP, rt, rn), ctx);
+		emit(ARM_AND_R(ARM_LR, rd, rm), ctx);
+		emit(ARM_ORRS_R(ARM_IP, ARM_LR, ARM_IP), ctx);
+		break;
+	case BPF_JEQ:
+	case BPF_JNE:
+	case BPF_JGT:
+	case BPF_JGE:
+		emit(ARM_CMP_R(rd, rm), ctx);
+		_emit(ARM_COND_EQ, ARM_CMP_R(rt, rn), ctx);
+		break;
+	case BPF_JSGT:
+		emit(ARM_CMP_R(rn, rt), ctx);
+		emit(ARM_SBCS_R(ARM_IP, rm, rd), ctx);
+		break;
+	case BPF_JSGE:
+		emit(ARM_CMP_R(rt, rn), ctx);
+		emit(ARM_SBCS_R(ARM_IP, rd, rm), ctx);
+		break;
+	}
+}
+
+static int out_offset = -1; /* initialized on the first pass of build_body() */
+static int emit_bpf_tail_call(struct jit_ctx *ctx)
+{
+
+	/* bpf_tail_call(void *prog_ctx, struct bpf_array *array, u64 index) */
+	const u8 *r2 = bpf2a32[BPF_REG_2];
+	const u8 *r3 = bpf2a32[BPF_REG_3];
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+	const u8 *tcc = bpf2a32[TCALL_CNT];
+	const int idx0 = ctx->idx;
+#define cur_offset (ctx->idx - idx0)
+#define jmp_offset (out_offset - (cur_offset))
+	u32 off, lo, hi;
+
+	/* if (index >= array->map.max_entries)
+	 *	goto out;
+	 */
+	off = offsetof(struct bpf_array, map.max_entries);
+	/* array->map.max_entries */
+	emit_a32_mov_i(tmp[1], off, false, ctx);
+	emit(ARM_LDR_I(tmp2[1], ARM_SP, STACK_VAR(r2[1])), ctx);
+	emit(ARM_LDR_R(tmp[1], tmp2[1], tmp[1]), ctx);
+	/* index (64 bit) */
+	emit(ARM_LDR_I(tmp2[1], ARM_SP, STACK_VAR(r3[1])), ctx);
+	/* index >= array->map.max_entries */
+	emit(ARM_CMP_R(tmp2[1], tmp[1]), ctx);
+	_emit(ARM_COND_CS, ARM_B(jmp_offset), ctx);
+
+	/* if (tail_call_cnt > MAX_TAIL_CALL_CNT)
+	 *	goto out;
+	 * tail_call_cnt++;
+	 */
+	lo = (u32)MAX_TAIL_CALL_CNT;
+	hi = (u32)((u64)MAX_TAIL_CALL_CNT >> 32);
+	emit(ARM_LDR_I(tmp[1], ARM_SP, STACK_VAR(tcc[1])), ctx);
+	emit(ARM_LDR_I(tmp[0], ARM_SP, STACK_VAR(tcc[0])), ctx);
+	emit(ARM_CMP_I(tmp[0], hi), ctx);
+	_emit(ARM_COND_EQ, ARM_CMP_I(tmp[1], lo), ctx);
+	_emit(ARM_COND_HI, ARM_B(jmp_offset), ctx);
+	emit(ARM_ADDS_I(tmp[1], tmp[1], 1), ctx);
+	emit(ARM_ADC_I(tmp[0], tmp[0], 0), ctx);
+	emit(ARM_STR_I(tmp[1], ARM_SP, STACK_VAR(tcc[1])), ctx);
+	emit(ARM_STR_I(tmp[0], ARM_SP, STACK_VAR(tcc[0])), ctx);
+
+	/* prog = array->ptrs[index]
+	 * if (prog == NULL)
+	 *	goto out;
+	 */
+	off = offsetof(struct bpf_array, ptrs);
+	emit_a32_mov_i(tmp[1], off, false, ctx);
+	emit(ARM_LDR_I(tmp2[1], ARM_SP, STACK_VAR(r2[1])), ctx);
+	emit(ARM_LDR_R(tmp[1], tmp2[1], tmp[1]), ctx);
+	emit(ARM_LDR_I(tmp2[1], ARM_SP, STACK_VAR(r3[1])), ctx);
+	emit(ARM_MOV_SI(tmp[0], tmp2[1], SRTYPE_ASL, 2), ctx);
+	emit(ARM_LDR_R(tmp[1], tmp[1], tmp[0]), ctx);
+	emit(ARM_CMP_I(tmp[1], 0), ctx);
+	_emit(ARM_COND_EQ, ARM_B(jmp_offset), ctx);
+
+	/* goto *(prog->bpf_func + prologue_size); */
+	off = offsetof(struct bpf_prog, bpf_func);
+	emit_a32_mov_i(tmp2[1], off, false, ctx);
+	emit(ARM_LDR_R(tmp[1], tmp[1], tmp2[1]), ctx);
+	emit(ARM_ADD_I(tmp[1], tmp[1], ctx->prologue_bytes), ctx);
+	emit(ARM_BX(tmp[1]), ctx);
+
+	/* out: */
+	if (out_offset == -1)
+		out_offset = cur_offset;
+	if (cur_offset != out_offset) {
+		pr_err_once("tail_call out_offset = %d, expected %d!\n",
+			    cur_offset, out_offset);
+		return -1;
+	}
+	return 0;
+#undef cur_offset
+#undef jmp_offset
+}
+
+/* 0xabcd => 0xcdab */
+static inline void emit_rev16(const u8 rd, const u8 rn, struct jit_ctx *ctx)
 {
-	if (!(ctx->seen & SEEN_X))
-		ctx->flags |= FLAG_NEED_X_RESET;
+#if __LINUX_ARM_ARCH__ < 6
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+
+	emit(ARM_AND_I(tmp2[1], rn, 0xff), ctx);
+	emit(ARM_MOV_SI(tmp2[0], rn, SRTYPE_LSR, 8), ctx);
+	emit(ARM_AND_I(tmp2[0], tmp2[0], 0xff), ctx);
+	emit(ARM_ORR_SI(rd, tmp2[0], tmp2[1], SRTYPE_LSL, 8), ctx);
+#else /* ARMv6+ */
+	emit(ARM_REV16(rd, rn), ctx);
+#endif
+}
 
-	ctx->seen |= SEEN_X;
+/* 0xabcdefgh => 0xghefcdab */
+static inline void emit_rev32(const u8 rd, const u8 rn, struct jit_ctx *ctx)
+{
+#if __LINUX_ARM_ARCH__ < 6
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+
+	emit(ARM_AND_I(tmp2[1], rn, 0xff), ctx);
+	emit(ARM_MOV_SI(tmp2[0], rn, SRTYPE_LSR, 24), ctx);
+	emit(ARM_ORR_SI(ARM_IP, tmp2[0], tmp2[1], SRTYPE_LSL, 24), ctx);
+
+	emit(ARM_MOV_SI(tmp2[1], rn, SRTYPE_LSR, 8), ctx);
+	emit(ARM_AND_I(tmp2[1], tmp2[1], 0xff), ctx);
+	emit(ARM_MOV_SI(tmp2[0], rn, SRTYPE_LSR, 16), ctx);
+	emit(ARM_AND_I(tmp2[0], tmp2[0], 0xff), ctx);
+	emit(ARM_MOV_SI(tmp2[0], tmp2[0], SRTYPE_LSL, 8), ctx);
+	emit(ARM_ORR_SI(tmp2[0], tmp2[0], tmp2[1], SRTYPE_LSL, 16), ctx);
+	emit(ARM_ORR_R(rd, ARM_IP, tmp2[0]), ctx);
+
+#else /* ARMv6+ */
+	emit(ARM_REV(rd, rn), ctx);
+#endif
 }
 
-static int build_body(struct jit_ctx *ctx)
+static void build_prologue(struct jit_ctx *ctx)
 {
-	void *load_func[] = {jit_get_skb_b, jit_get_skb_h, jit_get_skb_w};
-	const struct bpf_prog *prog = ctx->skf;
-	const struct sock_filter *inst;
-	unsigned i, load_order, off, condt;
-	int imm12;
-	u32 k;
+	const u8 r0 = bpf2a32[BPF_REG_0][1];
+	const u8 r2 = bpf2a32[BPF_REG_1][1];
+	const u8 r3 = bpf2a32[BPF_REG_1][0];
+	const u8 r4 = bpf2a32[BPF_REG_6][1];
+	const u8 r5 = bpf2a32[BPF_REG_6][0];
+	const u8 r6 = bpf2a32[TMP_REG_1][1];
+	const u8 r7 = bpf2a32[TMP_REG_1][0];
+	const u8 r8 = bpf2a32[TMP_REG_2][1];
+	const u8 r10 = bpf2a32[TMP_REG_2][0];
+	const u8 fplo = bpf2a32[BPF_REG_FP][1];
+	const u8 fphi = bpf2a32[BPF_REG_FP][0];
+	const u8 sp = ARM_SP;
+	const u8 *tcc = bpf2a32[TCALL_CNT];
+
+	u16 reg_set = 0;
 
-	for (i = 0; i < prog->len; i++) {
-		u16 code;
+	/*
+	 * eBPF prog stack layout
+	 *
+	 *                         high
+	 * original ARM_SP =>     +-----+ eBPF prologue
+	 *                        |FP/LR|
+	 * current ARM_FP =>      +-----+
+	 *                        | ... | callee saved registers
+	 * eBPF fp register =>    +-----+ <= (BPF_FP)
+	 *                        | ... | eBPF JIT scratch space
+	 *                        |     | eBPF prog stack
+	 *                        +-----+
+	 *			  |RSVD | JIT scratchpad
+	 * current A64_SP =>      +-----+ <= (BPF_FP - STACK_SIZE)
+	 *                        |     |
+	 *                        | ... | Function call stack
+	 *                        |     |
+	 *                        +-----+
+	 *                          low
+	 */
 
-		inst = &(prog->insns[i]);
-		/* K as an immediate value operand */
-		k = inst->k;
-		code = bpf_anc_helper(inst);
+	/* Save callee saved registers. */
+	reg_set |= (1<<r4) | (1<<r5) | (1<<r6) | (1<<r7) | (1<<r8) | (1<<r10);
+#ifdef CONFIG_FRAME_POINTER
+	reg_set |= (1<<ARM_FP) | (1<<ARM_IP) | (1<<ARM_LR) | (1<<ARM_PC);
+	emit(ARM_MOV_R(ARM_IP, sp), ctx);
+	emit(ARM_PUSH(reg_set), ctx);
+	emit(ARM_SUB_I(ARM_FP, ARM_IP, 4), ctx);
+#else
+	/* Check if call instruction exists in BPF body */
+	if (ctx->seen & SEEN_CALL)
+		reg_set |= (1<<ARM_LR);
+	emit(ARM_PUSH(reg_set), ctx);
+#endif
+	/* Save frame pointer for later */
+	emit(ARM_SUB_I(ARM_IP, sp, SCRATCH_SIZE), ctx);
+
+	/* Set up function call stack */
+	emit(ARM_SUB_I(ARM_SP, ARM_SP, imm8m(STACK_SIZE)), ctx);
+
+	/* Set up BPF prog stack base register */
+	emit_a32_mov_r(fplo, ARM_IP, true, false, ctx);
+	emit_a32_mov_i(fphi, 0, true, ctx);
+
+	/* mov r4, 0 */
+	emit(ARM_MOV_I(r4, 0), ctx);
+	/* MOV bpf_ctx pointer to BPF_R1 */
+	emit(ARM_MOV_R(r3, r4), ctx);
+	emit(ARM_MOV_R(r2, r0), ctx);
+	/* Initialize Tail Count */
+	emit(ARM_STR_I(r4, ARM_SP, STACK_VAR(tcc[0])), ctx);
+	emit(ARM_STR_I(r4, ARM_SP, STACK_VAR(tcc[1])), ctx);
+	/* end of prologue */
+}
 
-		/* compute offsets only in the fake pass */
-		if (ctx->target == NULL)
-			ctx->offsets[i] = ctx->idx * 4;
+static void build_epilogue(struct jit_ctx *ctx)
+{
+	const u8 r4 = bpf2a32[BPF_REG_6][1];
+	const u8 r5 = bpf2a32[BPF_REG_6][0];
+	const u8 r6 = bpf2a32[TMP_REG_1][1];
+	const u8 r7 = bpf2a32[TMP_REG_1][0];
+	const u8 r8 = bpf2a32[TMP_REG_2][1];
+	const u8 r10 = bpf2a32[TMP_REG_2][0];
+	u16 reg_set = 0;
+
+	/* unwind function call stack */
+	emit(ARM_ADD_I(ARM_SP, ARM_SP, imm8m(STACK_SIZE)), ctx);
+
+	/* restore callee saved registers. */
+	reg_set |= (1<<r4) | (1<<r5) | (1<<r6) | (1<<r7) | (1<<r8) | (1<<r10);
+#ifdef CONFIG_FRAME_POINTER
+	/* the first instruction of the prologue was: mov ip, sp */
+	reg_set |= (1<<ARM_FP) | (1<<ARM_SP) | (1<<ARM_PC);
+	emit(ARM_LDM(ARM_SP, reg_set), ctx);
+#else
+	if (ctx->seen & SEEN_CALL)
+		reg_set |= (1<<ARM_PC);
+	/* Restore callee saved registers. */
+	emit(ARM_POP(reg_set), ctx);
+	/* Return back to the callee function */
+	if (!(ctx->seen & SEEN_CALL))
+		emit(ARM_BX(ARM_LR), ctx);
+#endif
+}
 
-		switch (code) {
-		case BPF_LD | BPF_IMM:
-			emit_mov_i(r_A, k, ctx);
+/*
+ * Convert an eBPF instruction to native instruction, i.e
+ * JITs an eBPF instruction.
+ * Returns :
+ *	0  - Successfully JITed an 8-byte eBPF instruction
+ *	>0 - Successfully JITed a 16-byte eBPF instruction
+ *	<0 - Failed to JIT.
+ */
+static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx)
+{
+	const u8 code = insn->code;
+	const u8 *dst = bpf2a32[insn->dst_reg];
+	const u8 *src = bpf2a32[insn->src_reg];
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+	const s16 off = insn->off;
+	const s32 imm = insn->imm;
+	const int i = insn - ctx->prog->insnsi;
+	const bool is64 = BPF_CLASS(code) == BPF_ALU64;
+	const bool dstk = is_on_stack(insn->dst_reg);
+	const bool sstk = is_on_stack(insn->src_reg);
+	u8 rd, rt, rm, rn;
+	s32 jmp_offset;
+
+#define check_imm(bits, imm) do {				\
+	if ((((imm) > 0) && ((imm) >> (bits))) ||		\
+	    (((imm) < 0) && (~(imm) >> (bits)))) {		\
+		pr_info("[%2d] imm=%d(0x%x) out of range\n",	\
+			i, imm, imm);				\
+		return -EINVAL;					\
+	}							\
+} while (0)
+#define check_imm24(imm) check_imm(24, imm)
+
+	switch (code) {
+	/* ALU operations */
+
+	/* dst = src */
+	case BPF_ALU | BPF_MOV | BPF_K:
+	case BPF_ALU | BPF_MOV | BPF_X:
+	case BPF_ALU64 | BPF_MOV | BPF_K:
+	case BPF_ALU64 | BPF_MOV | BPF_X:
+		switch (BPF_SRC(code)) {
+		case BPF_X:
+			emit_a32_mov_r64(is64, dst, src, dstk, sstk, ctx);
 			break;
-		case BPF_LD | BPF_W | BPF_LEN:
-			ctx->seen |= SEEN_SKB;
-			BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, len) != 4);
-			emit(ARM_LDR_I(r_A, r_skb,
-				       offsetof(struct sk_buff, len)), ctx);
+		case BPF_K:
+			/* Sign-extend immediate value to destination reg */
+			emit_a32_mov_i64(is64, dst, imm, dstk, ctx);
 			break;
-		case BPF_LD | BPF_MEM:
-			/* A = scratch[k] */
-			ctx->seen |= SEEN_MEM_WORD(k);
-			emit(ARM_LDR_I(r_A, ARM_SP, SCRATCH_OFF(k)), ctx);
+		}
+		break;
+	/* dst = dst + src/imm */
+	/* dst = dst - src/imm */
+	/* dst = dst | src/imm */
+	/* dst = dst & src/imm */
+	/* dst = dst ^ src/imm */
+	/* dst = dst * src/imm */
+	/* dst = dst << src */
+	/* dst = dst >> src */
+	case BPF_ALU | BPF_ADD | BPF_K:
+	case BPF_ALU | BPF_ADD | BPF_X:
+	case BPF_ALU | BPF_SUB | BPF_K:
+	case BPF_ALU | BPF_SUB | BPF_X:
+	case BPF_ALU | BPF_OR | BPF_K:
+	case BPF_ALU | BPF_OR | BPF_X:
+	case BPF_ALU | BPF_AND | BPF_K:
+	case BPF_ALU | BPF_AND | BPF_X:
+	case BPF_ALU | BPF_XOR | BPF_K:
+	case BPF_ALU | BPF_XOR | BPF_X:
+	case BPF_ALU | BPF_MUL | BPF_K:
+	case BPF_ALU | BPF_MUL | BPF_X:
+	case BPF_ALU | BPF_LSH | BPF_X:
+	case BPF_ALU | BPF_RSH | BPF_X:
+	case BPF_ALU | BPF_ARSH | BPF_K:
+	case BPF_ALU | BPF_ARSH | BPF_X:
+	case BPF_ALU64 | BPF_ADD | BPF_K:
+	case BPF_ALU64 | BPF_ADD | BPF_X:
+	case BPF_ALU64 | BPF_SUB | BPF_K:
+	case BPF_ALU64 | BPF_SUB | BPF_X:
+	case BPF_ALU64 | BPF_OR | BPF_K:
+	case BPF_ALU64 | BPF_OR | BPF_X:
+	case BPF_ALU64 | BPF_AND | BPF_K:
+	case BPF_ALU64 | BPF_AND | BPF_X:
+	case BPF_ALU64 | BPF_XOR | BPF_K:
+	case BPF_ALU64 | BPF_XOR | BPF_X:
+		switch (BPF_SRC(code)) {
+		case BPF_X:
+			emit_a32_alu_r64(is64, dst, src, dstk, sstk,
+					 ctx, BPF_OP(code));
 			break;
-		case BPF_LD | BPF_W | BPF_ABS:
-			load_order = 2;
-			goto load;
-		case BPF_LD | BPF_H | BPF_ABS:
-			load_order = 1;
-			goto load;
-		case BPF_LD | BPF_B | BPF_ABS:
-			load_order = 0;
-load:
-			emit_mov_i(r_off, k, ctx);
-load_common:
-			ctx->seen |= SEEN_DATA | SEEN_CALL;
-
-			if (load_order > 0) {
-				emit(ARM_SUB_I(r_scratch, r_skb_hl,
-					       1 << load_order), ctx);
-				emit(ARM_CMP_R(r_scratch, r_off), ctx);
-				condt = ARM_COND_GE;
-			} else {
-				emit(ARM_CMP_R(r_skb_hl, r_off), ctx);
-				condt = ARM_COND_HI;
-			}
-
-			/*
-			 * test for negative offset, only if we are
-			 * currently scheduled to take the fast
-			 * path. this will update the flags so that
-			 * the slowpath instruction are ignored if the
-			 * offset is negative.
-			 *
-			 * for loard_order == 0 the HI condition will
-			 * make loads at offset 0 take the slow path too.
+		case BPF_K:
+			/* Move immediate value to the temporary register
+			 * and then do the ALU operation on the temporary
+			 * register as this will sign-extend the immediate
+			 * value into temporary reg and then it would be
+			 * safe to do the operation on it.
 			 */
-			_emit(condt, ARM_CMP_I(r_off, 0), ctx);
-
-			_emit(condt, ARM_ADD_R(r_scratch, r_off, r_skb_data),
-			      ctx);
-
-			if (load_order == 0)
-				_emit(condt, ARM_LDRB_I(r_A, r_scratch, 0),
-				      ctx);
-			else if (load_order == 1)
-				emit_load_be16(condt, r_A, r_scratch, ctx);
-			else if (load_order == 2)
-				emit_load_be32(condt, r_A, r_scratch, ctx);
-
-			_emit(condt, ARM_B(b_imm(i + 1, ctx)), ctx);
-
-			/* the slowpath */
-			emit_mov_i(ARM_R3, (u32)load_func[load_order], ctx);
-			emit(ARM_MOV_R(ARM_R0, r_skb), ctx);
-			/* the offset is already in R1 */
-			emit_blx_r(ARM_R3, ctx);
-			/* check the result of skb_copy_bits */
-			emit(ARM_CMP_I(ARM_R1, 0), ctx);
-			emit_err_ret(ARM_COND_NE, ctx);
-			emit(ARM_MOV_R(r_A, ARM_R0), ctx);
+			emit_a32_mov_i64(is64, tmp2, imm, false, ctx);
+			emit_a32_alu_r64(is64, dst, tmp2, dstk, false,
+					 ctx, BPF_OP(code));
 			break;
-		case BPF_LD | BPF_W | BPF_IND:
-			load_order = 2;
-			goto load_ind;
-		case BPF_LD | BPF_H | BPF_IND:
-			load_order = 1;
-			goto load_ind;
-		case BPF_LD | BPF_B | BPF_IND:
-			load_order = 0;
-load_ind:
-			update_on_xread(ctx);
-			OP_IMM3(ARM_ADD, r_off, r_X, k, ctx);
-			goto load_common;
-		case BPF_LDX | BPF_IMM:
-			ctx->seen |= SEEN_X;
-			emit_mov_i(r_X, k, ctx);
+		}
+		break;
+	/* dst = dst / src(imm) */
+	/* dst = dst % src(imm) */
+	case BPF_ALU | BPF_DIV | BPF_K:
+	case BPF_ALU | BPF_DIV | BPF_X:
+	case BPF_ALU | BPF_MOD | BPF_K:
+	case BPF_ALU | BPF_MOD | BPF_X:
+		rt = src_lo;
+		rd = dstk ? tmp2[1] : dst_lo;
+		if (dstk)
+			emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		switch (BPF_SRC(code)) {
+		case BPF_X:
+			rt = sstk ? tmp2[0] : rt;
+			if (sstk)
+				emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(src_lo)),
+				     ctx);
 			break;
-		case BPF_LDX | BPF_W | BPF_LEN:
-			ctx->seen |= SEEN_X | SEEN_SKB;
-			emit(ARM_LDR_I(r_X, r_skb,
-				       offsetof(struct sk_buff, len)), ctx);
+		case BPF_K:
+			rt = tmp2[0];
+			emit_a32_mov_i(rt, imm, false, ctx);
 			break;
-		case BPF_LDX | BPF_MEM:
-			ctx->seen |= SEEN_X | SEEN_MEM_WORD(k);
-			emit(ARM_LDR_I(r_X, ARM_SP, SCRATCH_OFF(k)), ctx);
+		}
+		emit_udivmod(rd, rd, rt, ctx, BPF_OP(code));
+		if (dstk)
+			emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit_a32_mov_i(dst_hi, 0, dstk, ctx);
+		break;
+	case BPF_ALU64 | BPF_DIV | BPF_K:
+	case BPF_ALU64 | BPF_DIV | BPF_X:
+	case BPF_ALU64 | BPF_MOD | BPF_K:
+	case BPF_ALU64 | BPF_MOD | BPF_X:
+		goto notyet;
+	/* dst = dst >> imm */
+	/* dst = dst << imm */
+	case BPF_ALU | BPF_RSH | BPF_K:
+	case BPF_ALU | BPF_LSH | BPF_K:
+		if (unlikely(imm > 31))
+			return -EINVAL;
+		if (imm)
+			emit_a32_alu_i(dst_lo, imm, dstk, ctx, BPF_OP(code));
+		emit_a32_mov_i(dst_hi, 0, dstk, ctx);
+		break;
+	/* dst = dst << imm */
+	case BPF_ALU64 | BPF_LSH | BPF_K:
+		if (unlikely(imm > 63))
+			return -EINVAL;
+		emit_a32_lsh_i64(dst, dstk, imm, ctx);
+		break;
+	/* dst = dst >> imm */
+	case BPF_ALU64 | BPF_RSH | BPF_K:
+		if (unlikely(imm > 63))
+			return -EINVAL;
+		emit_a32_lsr_i64(dst, dstk, imm, ctx);
+		break;
+	/* dst = dst << src */
+	case BPF_ALU64 | BPF_LSH | BPF_X:
+		emit_a32_lsh_r64(dst, src, dstk, sstk, ctx);
+		break;
+	/* dst = dst >> src */
+	case BPF_ALU64 | BPF_RSH | BPF_X:
+		emit_a32_lsr_r64(dst, src, dstk, sstk, ctx);
+		break;
+	/* dst = dst >> src (signed) */
+	case BPF_ALU64 | BPF_ARSH | BPF_X:
+		emit_a32_arsh_r64(dst, src, dstk, sstk, ctx);
+		break;
+	/* dst = dst >> imm (signed) */
+	case BPF_ALU64 | BPF_ARSH | BPF_K:
+		if (unlikely(imm > 63))
+			return -EINVAL;
+		emit_a32_arsh_i64(dst, dstk, imm, ctx);
+		break;
+	/* dst = ~dst */
+	case BPF_ALU | BPF_NEG:
+		emit_a32_alu_i(dst_lo, 0, dstk, ctx, BPF_OP(code));
+		emit_a32_mov_i(dst_hi, 0, dstk, ctx);
+		break;
+	/* dst = ~dst (64 bit) */
+	case BPF_ALU64 | BPF_NEG:
+		emit_a32_neg64(dst, dstk, ctx);
+		break;
+	/* dst = dst * src/imm */
+	case BPF_ALU64 | BPF_MUL | BPF_X:
+	case BPF_ALU64 | BPF_MUL | BPF_K:
+		switch (BPF_SRC(code)) {
+		case BPF_X:
+			emit_a32_mul_r64(dst, src, dstk, sstk, ctx);
 			break;
-		case BPF_LDX | BPF_B | BPF_MSH:
-			/* x = ((*(frame + k)) & 0xf) << 2; */
-			ctx->seen |= SEEN_X | SEEN_DATA | SEEN_CALL;
-			/* the interpreter should deal with the negative K */
-			if ((int)k < 0)
-				return -1;
-			/* offset in r1: we might have to take the slow path */
-			emit_mov_i(r_off, k, ctx);
-			emit(ARM_CMP_R(r_skb_hl, r_off), ctx);
-
-			/* load in r0: common with the slowpath */
-			_emit(ARM_COND_HI, ARM_LDRB_R(ARM_R0, r_skb_data,
-						      ARM_R1), ctx);
-			/*
-			 * emit_mov_i() might generate one or two instructions,
-			 * the same holds for emit_blx_r()
+		case BPF_K:
+			/* Move immediate value to the temporary register
+			 * and then do the multiplication on it as this
+			 * will sign-extend the immediate value into temp
+			 * reg then it would be safe to do the operation
+			 * on it.
 			 */
-			_emit(ARM_COND_HI, ARM_B(b_imm(i + 1, ctx) - 2), ctx);
-
-			emit(ARM_MOV_R(ARM_R0, r_skb), ctx);
-			/* r_off is r1 */
-			emit_mov_i(ARM_R3, (u32)jit_get_skb_b, ctx);
-			emit_blx_r(ARM_R3, ctx);
-			/* check the return value of skb_copy_bits */
-			emit(ARM_CMP_I(ARM_R1, 0), ctx);
-			emit_err_ret(ARM_COND_NE, ctx);
-
-			emit(ARM_AND_I(r_X, ARM_R0, 0x00f), ctx);
-			emit(ARM_LSL_I(r_X, r_X, 2), ctx);
-			break;
-		case BPF_ST:
-			ctx->seen |= SEEN_MEM_WORD(k);
-			emit(ARM_STR_I(r_A, ARM_SP, SCRATCH_OFF(k)), ctx);
-			break;
-		case BPF_STX:
-			update_on_xread(ctx);
-			ctx->seen |= SEEN_MEM_WORD(k);
-			emit(ARM_STR_I(r_X, ARM_SP, SCRATCH_OFF(k)), ctx);
-			break;
-		case BPF_ALU | BPF_ADD | BPF_K:
-			/* A += K */
-			OP_IMM3(ARM_ADD, r_A, r_A, k, ctx);
-			break;
-		case BPF_ALU | BPF_ADD | BPF_X:
-			update_on_xread(ctx);
-			emit(ARM_ADD_R(r_A, r_A, r_X), ctx);
-			break;
-		case BPF_ALU | BPF_SUB | BPF_K:
-			/* A -= K */
-			OP_IMM3(ARM_SUB, r_A, r_A, k, ctx);
-			break;
-		case BPF_ALU | BPF_SUB | BPF_X:
-			update_on_xread(ctx);
-			emit(ARM_SUB_R(r_A, r_A, r_X), ctx);
-			break;
-		case BPF_ALU | BPF_MUL | BPF_K:
-			/* A *= K */
-			emit_mov_i(r_scratch, k, ctx);
-			emit(ARM_MUL(r_A, r_A, r_scratch), ctx);
-			break;
-		case BPF_ALU | BPF_MUL | BPF_X:
-			update_on_xread(ctx);
-			emit(ARM_MUL(r_A, r_A, r_X), ctx);
-			break;
-		case BPF_ALU | BPF_DIV | BPF_K:
-			if (k == 1)
-				break;
-			emit_mov_i(r_scratch, k, ctx);
-			emit_udivmod(r_A, r_A, r_scratch, ctx, BPF_DIV);
-			break;
-		case BPF_ALU | BPF_DIV | BPF_X:
-			update_on_xread(ctx);
-			emit(ARM_CMP_I(r_X, 0), ctx);
-			emit_err_ret(ARM_COND_EQ, ctx);
-			emit_udivmod(r_A, r_A, r_X, ctx, BPF_DIV);
-			break;
-		case BPF_ALU | BPF_MOD | BPF_K:
-			if (k == 1) {
-				emit_mov_i(r_A, 0, ctx);
-				break;
-			}
-			emit_mov_i(r_scratch, k, ctx);
-			emit_udivmod(r_A, r_A, r_scratch, ctx, BPF_MOD);
+			emit_a32_mov_i64(is64, tmp2, imm, false, ctx);
+			emit_a32_mul_r64(dst, tmp2, dstk, false, ctx);
 			break;
-		case BPF_ALU | BPF_MOD | BPF_X:
-			update_on_xread(ctx);
-			emit(ARM_CMP_I(r_X, 0), ctx);
-			emit_err_ret(ARM_COND_EQ, ctx);
-			emit_udivmod(r_A, r_A, r_X, ctx, BPF_MOD);
-			break;
-		case BPF_ALU | BPF_OR | BPF_K:
-			/* A |= K */
-			OP_IMM3(ARM_ORR, r_A, r_A, k, ctx);
+		}
+		break;
+	/* dst = htole(dst) */
+	/* dst = htobe(dst) */
+	case BPF_ALU | BPF_END | BPF_FROM_LE:
+	case BPF_ALU | BPF_END | BPF_FROM_BE:
+		rd = dstk ? tmp[0] : dst_hi;
+		rt = dstk ? tmp[1] : dst_lo;
+		if (dstk) {
+			emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(dst_lo)), ctx);
+			emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_hi)), ctx);
+		}
+		if (BPF_SRC(code) == BPF_FROM_LE)
+			goto emit_bswap_uxt;
+		switch (imm) {
+		case 16:
+			emit_rev16(rt, rt, ctx);
+			goto emit_bswap_uxt;
+		case 32:
+			emit_rev32(rt, rt, ctx);
+			goto emit_bswap_uxt;
+		case 64:
+			/* Because of the usage of ARM_LR */
+			ctx->seen |= SEEN_CALL;
+			emit_rev32(ARM_LR, rt, ctx);
+			emit_rev32(rt, rd, ctx);
+			emit(ARM_MOV_R(rd, ARM_LR), ctx);
 			break;
-		case BPF_ALU | BPF_OR | BPF_X:
-			update_on_xread(ctx);
-			emit(ARM_ORR_R(r_A, r_A, r_X), ctx);
+		}
+		goto exit;
+emit_bswap_uxt:
+		switch (imm) {
+		case 16:
+			/* zero-extend 16 bits into 64 bits */
+#if __LINUX_ARM_ARCH__ < 6
+			emit_a32_mov_i(tmp2[1], 0xffff, false, ctx);
+			emit(ARM_AND_R(rt, rt, tmp2[1]), ctx);
+#else /* ARMv6+ */
+			emit(ARM_UXTH(rt, rt), ctx);
+#endif
+			emit(ARM_EOR_R(rd, rd, rd), ctx);
 			break;
-		case BPF_ALU | BPF_XOR | BPF_K:
-			/* A ^= K; */
-			OP_IMM3(ARM_EOR, r_A, r_A, k, ctx);
+		case 32:
+			/* zero-extend 32 bits into 64 bits */
+			emit(ARM_EOR_R(rd, rd, rd), ctx);
 			break;
-		case BPF_ANC | SKF_AD_ALU_XOR_X:
-		case BPF_ALU | BPF_XOR | BPF_X:
-			/* A ^= X */
-			update_on_xread(ctx);
-			emit(ARM_EOR_R(r_A, r_A, r_X), ctx);
+		case 64:
+			/* nop */
 			break;
-		case BPF_ALU | BPF_AND | BPF_K:
-			/* A &= K */
-			OP_IMM3(ARM_AND, r_A, r_A, k, ctx);
+		}
+exit:
+		if (dstk) {
+			emit(ARM_STR_I(rt, ARM_SP, STACK_VAR(dst_lo)), ctx);
+			emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_hi)), ctx);
+		}
+		break;
+	/* dst = imm64 */
+	case BPF_LD | BPF_IMM | BPF_DW:
+	{
+		const struct bpf_insn insn1 = insn[1];
+		u32 hi, lo = imm;
+
+		if (insn1.code != 0 || insn1.src_reg != 0 ||
+		    insn1.dst_reg != 0 || insn1.off != 0) {
+			/* Note: verifier in BPF core must catch invalid
+			 * instruction.
+			 */
+			pr_err_once("Invalid BPF_LD_IMM64 instruction\n");
+			return -EINVAL;
+		}
+		hi = insn1.imm;
+		emit_a32_mov_i(dst_lo, lo, dstk, ctx);
+		emit_a32_mov_i(dst_hi, hi, dstk, ctx);
+
+		return 1;
+	}
+	/* LDX: dst = *(size *)(src + off) */
+	case BPF_LDX | BPF_MEM | BPF_W:
+	case BPF_LDX | BPF_MEM | BPF_H:
+	case BPF_LDX | BPF_MEM | BPF_B:
+	case BPF_LDX | BPF_MEM | BPF_DW:
+		rn = sstk ? tmp2[1] : src_lo;
+		if (sstk)
+			emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
+		switch (BPF_SIZE(code)) {
+		case BPF_W:
+			/* Load a Word */
+		case BPF_H:
+			/* Load a Half-Word */
+		case BPF_B:
+			/* Load a Byte */
+			emit_ldx_r(dst_lo, rn, dstk, off, ctx, BPF_SIZE(code));
+			emit_a32_mov_i(dst_hi, 0, dstk, ctx);
 			break;
-		case BPF_ALU | BPF_AND | BPF_X:
-			update_on_xread(ctx);
-			emit(ARM_AND_R(r_A, r_A, r_X), ctx);
+		case BPF_DW:
+			/* Load a double word */
+			emit_ldx_r(dst_lo, rn, dstk, off, ctx, BPF_W);
+			emit_ldx_r(dst_hi, rn, dstk, off+4, ctx, BPF_W);
 			break;
-		case BPF_ALU | BPF_LSH | BPF_K:
-			if (unlikely(k > 31))
-				return -1;
-			emit(ARM_LSL_I(r_A, r_A, k), ctx);
+		}
+		break;
+	/* R0 = ntohx(*(size *)(((struct sk_buff *)R6)->data + imm)) */
+	case BPF_LD | BPF_ABS | BPF_W:
+	case BPF_LD | BPF_ABS | BPF_H:
+	case BPF_LD | BPF_ABS | BPF_B:
+	/* R0 = ntohx(*(size *)(((struct sk_buff *)R6)->data + src + imm)) */
+	case BPF_LD | BPF_IND | BPF_W:
+	case BPF_LD | BPF_IND | BPF_H:
+	case BPF_LD | BPF_IND | BPF_B:
+	{
+		const u8 r4 = bpf2a32[BPF_REG_6][1]; /* r4 = ptr to sk_buff */
+		const u8 r0 = bpf2a32[BPF_REG_0][1]; /*r0: struct sk_buff *skb*/
+						     /* rtn value */
+		const u8 r1 = bpf2a32[BPF_REG_0][0]; /* r1: int k */
+		const u8 r2 = bpf2a32[BPF_REG_1][1]; /* r2: unsigned int size */
+		const u8 r3 = bpf2a32[BPF_REG_1][0]; /* r3: void *buffer */
+		const u8 r6 = bpf2a32[TMP_REG_1][1]; /* r6: void *(*func)(..) */
+		int size;
+
+		/* Setting up first argument */
+		emit(ARM_MOV_R(r0, r4), ctx);
+
+		/* Setting up second argument */
+		emit_a32_mov_i(r1, imm, false, ctx);
+		if (BPF_MODE(code) == BPF_IND)
+			emit_a32_alu_r(r1, src_lo, false, sstk, ctx,
+				       false, false, BPF_ADD);
+
+		/* Setting up third argument */
+		switch (BPF_SIZE(code)) {
+		case BPF_W:
+			size = 4;
 			break;
-		case BPF_ALU | BPF_LSH | BPF_X:
-			update_on_xread(ctx);
-			emit(ARM_LSL_R(r_A, r_A, r_X), ctx);
+		case BPF_H:
+			size = 2;
 			break;
-		case BPF_ALU | BPF_RSH | BPF_K:
-			if (unlikely(k > 31))
-				return -1;
-			if (k)
-				emit(ARM_LSR_I(r_A, r_A, k), ctx);
+		case BPF_B:
+			size = 1;
 			break;
-		case BPF_ALU | BPF_RSH | BPF_X:
-			update_on_xread(ctx);
-			emit(ARM_LSR_R(r_A, r_A, r_X), ctx);
+		default:
+			return -EINVAL;
+		}
+		emit_a32_mov_i(r2, size, false, ctx);
+
+		/* Setting up fourth argument */
+		emit(ARM_ADD_I(r3, ARM_SP, imm8m(SKB_BUFFER)), ctx);
+
+		/* Setting up function pointer to call */
+		emit_a32_mov_i(r6, (unsigned int)bpf_load_pointer, false, ctx);
+		emit_blx_r(r6, ctx);
+
+		emit(ARM_EOR_R(r1, r1, r1), ctx);
+		/* Check if return address is NULL or not.
+		 * if NULL then jump to epilogue
+		 * else continue to load the value from retn address
+		 */
+		emit(ARM_CMP_I(r0, 0), ctx);
+		jmp_offset = epilogue_offset(ctx);
+		check_imm24(jmp_offset);
+		_emit(ARM_COND_EQ, ARM_B(jmp_offset), ctx);
+
+		/* Load value from the address */
+		switch (BPF_SIZE(code)) {
+		case BPF_W:
+			emit(ARM_LDR_I(r0, r0, 0), ctx);
+			emit_rev32(r0, r0, ctx);
 			break;
-		case BPF_ALU | BPF_NEG:
-			/* A = -A */
-			emit(ARM_RSB_I(r_A, r_A, 0), ctx);
+		case BPF_H:
+			emit(ARM_LDRH_I(r0, r0, 0), ctx);
+			emit_rev16(r0, r0, ctx);
 			break;
-		case BPF_JMP | BPF_JA:
-			/* pc += K */
-			emit(ARM_B(b_imm(i + k + 1, ctx)), ctx);
+		case BPF_B:
+			emit(ARM_LDRB_I(r0, r0, 0), ctx);
+			/* No need to reverse */
 			break;
-		case BPF_JMP | BPF_JEQ | BPF_K:
-			/* pc += (A == K) ? pc->jt : pc->jf */
-			condt  = ARM_COND_EQ;
-			goto cmp_imm;
-		case BPF_JMP | BPF_JGT | BPF_K:
-			/* pc += (A > K) ? pc->jt : pc->jf */
-			condt  = ARM_COND_HI;
-			goto cmp_imm;
-		case BPF_JMP | BPF_JGE | BPF_K:
-			/* pc += (A >= K) ? pc->jt : pc->jf */
-			condt  = ARM_COND_HS;
-cmp_imm:
-			imm12 = imm8m(k);
-			if (imm12 < 0) {
-				emit_mov_i_no8m(r_scratch, k, ctx);
-				emit(ARM_CMP_R(r_A, r_scratch), ctx);
-			} else {
-				emit(ARM_CMP_I(r_A, imm12), ctx);
-			}
-cond_jump:
-			if (inst->jt)
-				_emit(condt, ARM_B(b_imm(i + inst->jt + 1,
-						   ctx)), ctx);
-			if (inst->jf)
-				_emit(condt ^ 1, ARM_B(b_imm(i + inst->jf + 1,
-							     ctx)), ctx);
+		}
+		break;
+	}
+	/* ST: *(size *)(dst + off) = imm */
+	case BPF_ST | BPF_MEM | BPF_W:
+	case BPF_ST | BPF_MEM | BPF_H:
+	case BPF_ST | BPF_MEM | BPF_B:
+	case BPF_ST | BPF_MEM | BPF_DW:
+		switch (BPF_SIZE(code)) {
+		case BPF_DW:
+			/* Sign-extend immediate value into temp reg */
+			emit_a32_mov_i64(true, tmp2, imm, false, ctx);
+			emit_str_r(dst_lo, tmp2[1], dstk, off, ctx, BPF_W);
+			emit_str_r(dst_lo, tmp2[0], dstk, off+4, ctx, BPF_W);
 			break;
-		case BPF_JMP | BPF_JEQ | BPF_X:
-			/* pc += (A == X) ? pc->jt : pc->jf */
-			condt   = ARM_COND_EQ;
-			goto cmp_x;
-		case BPF_JMP | BPF_JGT | BPF_X:
-			/* pc += (A > X) ? pc->jt : pc->jf */
-			condt   = ARM_COND_HI;
-			goto cmp_x;
-		case BPF_JMP | BPF_JGE | BPF_X:
-			/* pc += (A >= X) ? pc->jt : pc->jf */
-			condt   = ARM_COND_CS;
-cmp_x:
-			update_on_xread(ctx);
-			emit(ARM_CMP_R(r_A, r_X), ctx);
-			goto cond_jump;
-		case BPF_JMP | BPF_JSET | BPF_K:
-			/* pc += (A & K) ? pc->jt : pc->jf */
-			condt  = ARM_COND_NE;
-			/* not set iff all zeroes iff Z==1 iff EQ */
-
-			imm12 = imm8m(k);
-			if (imm12 < 0) {
-				emit_mov_i_no8m(r_scratch, k, ctx);
-				emit(ARM_TST_R(r_A, r_scratch), ctx);
-			} else {
-				emit(ARM_TST_I(r_A, imm12), ctx);
-			}
-			goto cond_jump;
-		case BPF_JMP | BPF_JSET | BPF_X:
-			/* pc += (A & X) ? pc->jt : pc->jf */
-			update_on_xread(ctx);
-			condt  = ARM_COND_NE;
-			emit(ARM_TST_R(r_A, r_X), ctx);
-			goto cond_jump;
-		case BPF_RET | BPF_A:
-			emit(ARM_MOV_R(ARM_R0, r_A), ctx);
-			goto b_epilogue;
-		case BPF_RET | BPF_K:
-			if ((k == 0) && (ctx->ret0_fp_idx < 0))
-				ctx->ret0_fp_idx = i;
-			emit_mov_i(ARM_R0, k, ctx);
-b_epilogue:
-			if (i != ctx->skf->len - 1)
-				emit(ARM_B(b_imm(prog->len, ctx)), ctx);
+		case BPF_W:
+		case BPF_H:
+		case BPF_B:
+			emit_a32_mov_i(tmp2[1], imm, false, ctx);
+			emit_str_r(dst_lo, tmp2[1], dstk, off, ctx,
+				   BPF_SIZE(code));
 			break;
-		case BPF_MISC | BPF_TAX:
-			/* X = A */
-			ctx->seen |= SEEN_X;
-			emit(ARM_MOV_R(r_X, r_A), ctx);
+		}
+		break;
+	/* STX XADD: lock *(u32 *)(dst + off) += src */
+	case BPF_STX | BPF_XADD | BPF_W:
+	/* STX XADD: lock *(u64 *)(dst + off) += src */
+	case BPF_STX | BPF_XADD | BPF_DW:
+		goto notyet;
+	/* STX: *(size *)(dst + off) = src */
+	case BPF_STX | BPF_MEM | BPF_W:
+	case BPF_STX | BPF_MEM | BPF_H:
+	case BPF_STX | BPF_MEM | BPF_B:
+	case BPF_STX | BPF_MEM | BPF_DW:
+	{
+		u8 sz = BPF_SIZE(code);
+
+		rn = sstk ? tmp2[1] : src_lo;
+		rm = sstk ? tmp2[0] : src_hi;
+		if (!sstk)
+			goto do_store;
+		switch (BPF_SIZE(code)) {
+		case BPF_W:
+			emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
+			goto empty_hi;
+		case BPF_H:
+			emit(ARM_LDRH_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
+			goto empty_hi;
+		case BPF_B:
+			emit(ARM_LDRB_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
+			goto empty_hi;
+empty_hi:
+			emit(ARM_EOR_R(rm, rm, rm), ctx);
+		case BPF_DW:
+			emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
+			emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(src_hi)), ctx);
+			sz = BPF_W;
 			break;
-		case BPF_MISC | BPF_TXA:
-			/* A = X */
-			update_on_xread(ctx);
-			emit(ARM_MOV_R(r_A, r_X), ctx);
+		}
+
+do_store:
+		/* Clear higher word except for BPF_DW */
+		if (BPF_SIZE(code) != BPF_DW)
+			emit(ARM_EOR_R(rm, rm, rm), ctx);
+
+		/* Store the value */
+		emit_str_r(dst_lo, rn, dstk, off, ctx, sz);
+		emit_str_r(dst_lo, rm, dstk, off+4, ctx, BPF_W);
+		break;
+	}
+	/* PC += off if dst == src */
+	/* PC += off if dst > src */
+	/* PC += off if dst >= src */
+	/* PC += off if dst != src */
+	/* PC += off if dst > src (signed) */
+	/* PC += off if dst >= src (signed) */
+	/* PC += off if dst & src */
+	case BPF_JMP | BPF_JEQ | BPF_X:
+	case BPF_JMP | BPF_JGT | BPF_X:
+	case BPF_JMP | BPF_JGE | BPF_X:
+	case BPF_JMP | BPF_JNE | BPF_X:
+	case BPF_JMP | BPF_JSGT | BPF_X:
+	case BPF_JMP | BPF_JSGE | BPF_X:
+	case BPF_JMP | BPF_JSET | BPF_X:
+		/* Setup source registers */
+		rm = sstk ? tmp2[0] : src_hi;
+		rn = sstk ? tmp2[1] : src_lo;
+		if (sstk) {
+			emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
+			emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(src_hi)), ctx);
+		}
+		goto go_jmp;
+	/* PC += off if dst == imm */
+	/* PC += off if dst > imm */
+	/* PC += off if dst >= imm */
+	/* PC += off if dst != imm */
+	/* PC += off if dst > imm (signed) */
+	/* PC += off if dst >= imm (signed) */
+	/* PC += off if dst & imm */
+	case BPF_JMP | BPF_JEQ | BPF_K:
+	case BPF_JMP | BPF_JGT | BPF_K:
+	case BPF_JMP | BPF_JGE | BPF_K:
+	case BPF_JMP | BPF_JNE | BPF_K:
+	case BPF_JMP | BPF_JSGT | BPF_K:
+	case BPF_JMP | BPF_JSGE | BPF_K:
+	case BPF_JMP | BPF_JSET | BPF_K:
+		if (off == 0)
 			break;
-		case BPF_ANC | SKF_AD_PROTOCOL:
-			/* A = ntohs(skb->protocol) */
-			ctx->seen |= SEEN_SKB;
-			BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff,
-						  protocol) != 2);
-			off = offsetof(struct sk_buff, protocol);
-			emit(ARM_LDRH_I(r_scratch, r_skb, off), ctx);
-			emit_swap16(r_A, r_scratch, ctx);
+		rm = tmp2[0];
+		rn = tmp2[1];
+		/* Sign-extend immediate value */
+		emit_a32_mov_i64(true, tmp2, imm, false, ctx);
+go_jmp:
+		/* Setup destination register */
+		rd = dstk ? tmp[0] : dst_hi;
+		rt = dstk ? tmp[1] : dst_lo;
+		if (dstk) {
+			emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(dst_lo)), ctx);
+			emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_hi)), ctx);
+		}
+
+		/* Check for the condition */
+		emit_ar_r(rd, rt, rm, rn, ctx, BPF_OP(code));
+
+		/* Setup JUMP instruction */
+		jmp_offset = bpf2a32_offset(i+off, i, ctx);
+		switch (BPF_OP(code)) {
+		case BPF_JNE:
+		case BPF_JSET:
+			_emit(ARM_COND_NE, ARM_B(jmp_offset), ctx);
 			break;
-		case BPF_ANC | SKF_AD_CPU:
-			/* r_scratch = current_thread_info() */
-			OP_IMM3(ARM_BIC, r_scratch, ARM_SP, THREAD_SIZE - 1, ctx);
-			/* A = current_thread_info()->cpu */
-			BUILD_BUG_ON(FIELD_SIZEOF(struct thread_info, cpu) != 4);
-			off = offsetof(struct thread_info, cpu);
-			emit(ARM_LDR_I(r_A, r_scratch, off), ctx);
+		case BPF_JEQ:
+			_emit(ARM_COND_EQ, ARM_B(jmp_offset), ctx);
 			break;
-		case BPF_ANC | SKF_AD_IFINDEX:
-		case BPF_ANC | SKF_AD_HATYPE:
-			/* A = skb->dev->ifindex */
-			/* A = skb->dev->type */
-			ctx->seen |= SEEN_SKB;
-			off = offsetof(struct sk_buff, dev);
-			emit(ARM_LDR_I(r_scratch, r_skb, off), ctx);
-
-			emit(ARM_CMP_I(r_scratch, 0), ctx);
-			emit_err_ret(ARM_COND_EQ, ctx);
-
-			BUILD_BUG_ON(FIELD_SIZEOF(struct net_device,
-						  ifindex) != 4);
-			BUILD_BUG_ON(FIELD_SIZEOF(struct net_device,
-						  type) != 2);
-
-			if (code == (BPF_ANC | SKF_AD_IFINDEX)) {
-				off = offsetof(struct net_device, ifindex);
-				emit(ARM_LDR_I(r_A, r_scratch, off), ctx);
-			} else {
-				/*
-				 * offset of field "type" in "struct
-				 * net_device" is above what can be
-				 * used in the ldrh rd, [rn, #imm]
-				 * instruction, so load the offset in
-				 * a register and use ldrh rd, [rn, rm]
-				 */
-				off = offsetof(struct net_device, type);
-				emit_mov_i(ARM_R3, off, ctx);
-				emit(ARM_LDRH_R(r_A, r_scratch, ARM_R3), ctx);
-			}
+		case BPF_JGT:
+			_emit(ARM_COND_HI, ARM_B(jmp_offset), ctx);
 			break;
-		case BPF_ANC | SKF_AD_MARK:
-			ctx->seen |= SEEN_SKB;
-			BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, mark) != 4);
-			off = offsetof(struct sk_buff, mark);
-			emit(ARM_LDR_I(r_A, r_skb, off), ctx);
+		case BPF_JGE:
+			_emit(ARM_COND_CS, ARM_B(jmp_offset), ctx);
 			break;
-		case BPF_ANC | SKF_AD_RXHASH:
-			ctx->seen |= SEEN_SKB;
-			BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, hash) != 4);
-			off = offsetof(struct sk_buff, hash);
-			emit(ARM_LDR_I(r_A, r_skb, off), ctx);
+		case BPF_JSGT:
+			_emit(ARM_COND_LT, ARM_B(jmp_offset), ctx);
 			break;
-		case BPF_ANC | SKF_AD_VLAN_TAG:
-		case BPF_ANC | SKF_AD_VLAN_TAG_PRESENT:
-			ctx->seen |= SEEN_SKB;
-			BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, vlan_tci) != 2);
-			off = offsetof(struct sk_buff, vlan_tci);
-			emit(ARM_LDRH_I(r_A, r_skb, off), ctx);
-			if (code == (BPF_ANC | SKF_AD_VLAN_TAG))
-				OP_IMM3(ARM_AND, r_A, r_A, ~VLAN_TAG_PRESENT, ctx);
-			else {
-				OP_IMM3(ARM_LSR, r_A, r_A, 12, ctx);
-				OP_IMM3(ARM_AND, r_A, r_A, 0x1, ctx);
-			}
+		case BPF_JSGE:
+			_emit(ARM_COND_GE, ARM_B(jmp_offset), ctx);
 			break;
-		case BPF_ANC | SKF_AD_PKTTYPE:
-			ctx->seen |= SEEN_SKB;
-			BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff,
-						  __pkt_type_offset[0]) != 1);
-			off = PKT_TYPE_OFFSET();
-			emit(ARM_LDRB_I(r_A, r_skb, off), ctx);
-			emit(ARM_AND_I(r_A, r_A, PKT_TYPE_MAX), ctx);
-#ifdef __BIG_ENDIAN_BITFIELD
-			emit(ARM_LSR_I(r_A, r_A, 5), ctx);
-#endif
+		}
+		break;
+	/* JMP OFF */
+	case BPF_JMP | BPF_JA:
+	{
+		if (off == 0)
 			break;
-		case BPF_ANC | SKF_AD_QUEUE:
-			ctx->seen |= SEEN_SKB;
-			BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff,
-						  queue_mapping) != 2);
-			BUILD_BUG_ON(offsetof(struct sk_buff,
-					      queue_mapping) > 0xff);
-			off = offsetof(struct sk_buff, queue_mapping);
-			emit(ARM_LDRH_I(r_A, r_skb, off), ctx);
+		jmp_offset = bpf2a32_offset(i+off, i, ctx);
+		check_imm24(jmp_offset);
+		emit(ARM_B(jmp_offset), ctx);
+		break;
+	}
+	/* tail call */
+	case BPF_JMP | BPF_CALL | BPF_X:
+		if (emit_bpf_tail_call(ctx))
+			return -EFAULT;
+		break;
+	/* function call */
+	case BPF_JMP | BPF_CALL:
+		goto notyet;
+	/* function return */
+	case BPF_JMP | BPF_EXIT:
+		/* Optimization: when last instruction is EXIT
+		 * simply fallthrough to epilogue.
+		 */
+		if (i == ctx->prog->len - 1)
 			break;
-		case BPF_ANC | SKF_AD_PAY_OFFSET:
-			ctx->seen |= SEEN_SKB | SEEN_CALL;
+		jmp_offset = epilogue_offset(ctx);
+		check_imm24(jmp_offset);
+		emit(ARM_B(jmp_offset), ctx);
+		break;
+notyet:
+		pr_info_once("*** NOT YET: opcode %02x ***\n", code);
+		return -EFAULT;
+	default:
+		pr_err_once("unknown opcode %02x\n", code);
+		return -EINVAL;
+	}
 
-			emit(ARM_MOV_R(ARM_R0, r_skb), ctx);
-			emit_mov_i(ARM_R3, (unsigned int)skb_get_poff, ctx);
-			emit_blx_r(ARM_R3, ctx);
-			emit(ARM_MOV_R(r_A, ARM_R0), ctx);
-			break;
-		case BPF_LDX | BPF_W | BPF_ABS:
-			/*
-			 * load a 32bit word from struct seccomp_data.
-			 * seccomp_check_filter() will already have checked
-			 * that k is 32bit aligned and lies within the
-			 * struct seccomp_data.
-			 */
-			ctx->seen |= SEEN_SKB;
-			emit(ARM_LDR_I(r_A, r_skb, k), ctx);
-			break;
-		default:
-			return -1;
+	if (ctx->flags & FLAG_IMM_OVERFLOW)
+		/*
+		 * this instruction generated an overflow when
+		 * trying to access the literal pool, so
+		 * delegate this filter to the kernel interpreter.
+		 */
+		return -1;
+	return 0;
+}
+
+static int build_body(struct jit_ctx *ctx)
+{
+	const struct bpf_prog *prog = ctx->prog;
+	unsigned int i;
+
+	for (i = 0; i < prog->len; i++) {
+		const struct bpf_insn *insn = &(prog->insnsi[i]);
+		int ret;
+
+		ret = build_insn(insn, ctx);
+
+		/* It's used with loading the 64 bit immediate value. */
+		if (ret > 0) {
+			i++;
+			if (ctx->target == NULL)
+				ctx->offsets[i] = ctx->idx;
+			continue;
 		}
 
-		if (ctx->flags & FLAG_IMM_OVERFLOW)
-			/*
-			 * this instruction generated an overflow when
-			 * trying to access the literal pool, so
-			 * delegate this filter to the kernel interpreter.
-			 */
-			return -1;
+		if (ctx->target == NULL)
+			ctx->offsets[i] = ctx->idx;
+
+		/* If unsuccesfull, return with error code */
+		if (ret)
+			return ret;
 	}
+	return 0;
+}
 
-	/* compute offsets only during the first pass */
-	if (ctx->target == NULL)
-		ctx->offsets[i] = ctx->idx * 4;
+static int validate_code(struct jit_ctx *ctx)
+{
+	int i;
+
+	for (i = 0; i < ctx->idx; i++) {
+		u32 a32_insn = le32_to_cpu(ctx->target[i]);
+
+		if (a32_insn == ARM_INST_UDF)
+			return -1;
+	}
 
 	return 0;
 }
 
+void bpf_jit_compile(struct bpf_prog *prog)
+{
+	/* Nothing to do here. We support Internal BPF. */
+}
 
-void bpf_jit_compile(struct bpf_prog *fp)
+struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 {
+#ifdef __LITTLE_ENDIAN
+	struct bpf_prog *tmp, *orig_prog = prog;
 	struct bpf_binary_header *header;
+	bool tmp_blinded = false;
 	struct jit_ctx ctx;
-	unsigned tmp_idx;
-	unsigned alloc_size;
-	u8 *target_ptr;
+	unsigned int tmp_idx;
+	unsigned int image_size;
+	u8 *image_ptr;
 
+	/* If BPF JIT was not enabled then we must fall back to
+	 * the interpreter.
+	 */
 	if (!bpf_jit_enable)
-		return;
+		return orig_prog;
 
-	memset(&ctx, 0, sizeof(ctx));
-	ctx.skf		= fp;
-	ctx.ret0_fp_idx = -1;
+	/* If constant blinding was enabled and we failed during blinding
+	 * then we must fall back to the interpreter. Otherwise, we save
+	 * the new JITed code.
+	 */
+	tmp = bpf_jit_blind_constants(prog);
 
-	ctx.offsets = kzalloc(4 * (ctx.skf->len + 1), GFP_KERNEL);
-	if (ctx.offsets == NULL)
-		return;
+	if (IS_ERR(tmp))
+		return orig_prog;
+	if (tmp != prog) {
+		tmp_blinded = true;
+		prog = tmp;
+	}
+
+	memset(&ctx, 0, sizeof(ctx));
+	ctx.prog = prog;
 
-	/* fake pass to fill in the ctx->seen */
-	if (unlikely(build_body(&ctx)))
+	/* Not able to allocate memory for offsets[] , then
+	 * we must fall back to the interpreter
+	 */
+	ctx.offsets = kcalloc(prog->len, sizeof(int), GFP_KERNEL);
+	if (ctx.offsets == NULL) {
+		prog = orig_prog;
 		goto out;
+	}
+
+	/* 1) fake pass to find in the length of the JITed code,
+	 * to compute ctx->offsets and other context variables
+	 * needed to compute final JITed code.
+	 * Also, calculate random starting pointer/start of JITed code
+	 * which is prefixed by random number of fault instructions.
+	 *
+	 * If the first pass fails then there is no chance of it
+	 * being successful in the second pass, so just fall back
+	 * to the interpreter.
+	 */
+	if (build_body(&ctx)) {
+		prog = orig_prog;
+		goto out_off;
+	}
 
 	tmp_idx = ctx.idx;
 	build_prologue(&ctx);
 	ctx.prologue_bytes = (ctx.idx - tmp_idx) * 4;
 
+	ctx.epilogue_offset = ctx.idx;
+
 #if __LINUX_ARM_ARCH__ < 7
 	tmp_idx = ctx.idx;
 	build_epilogue(&ctx);
@@ -1020,64 +1838,96 @@ void bpf_jit_compile(struct bpf_prog *fp)
 
 	ctx.idx += ctx.imm_count;
 	if (ctx.imm_count) {
-		ctx.imms = kzalloc(4 * ctx.imm_count, GFP_KERNEL);
-		if (ctx.imms == NULL)
-			goto out;
+		ctx.imms = kcalloc(ctx.imm_count, sizeof(u32), GFP_KERNEL);
+		if (ctx.imms == NULL) {
+			prog = orig_prog;
+			goto out_off;
+		}
 	}
 #else
-	/* there's nothing after the epilogue on ARMv7 */
+	/* there's nothing about the epilogue on ARMv7 */
 	build_epilogue(&ctx);
 #endif
-	alloc_size = 4 * ctx.idx;
-	header = bpf_jit_binary_alloc(alloc_size, &target_ptr,
-				      4, jit_fill_hole);
-	if (header == NULL)
-		goto out;
+	/* Now we can get the actual image size of the JITed arm code.
+	 * Currently, we are not considering the THUMB-2 instructions
+	 * for jit, although it can decrease the size of the image.
+	 *
+	 * As each arm instruction is of length 32bit, we are translating
+	 * number of JITed intructions into the size required to store these
+	 * JITed code.
+	 */
+	image_size = sizeof(u32) * ctx.idx;
 
-	ctx.target = (u32 *) target_ptr;
+	/* Now we know the size of the structure to make */
+	header = bpf_jit_binary_alloc(image_size, &image_ptr,
+				      sizeof(u32), jit_fill_hole);
+	/* Not able to allocate memory for the structure then
+	 * we must fall back to the interpretation
+	 */
+	if (header == NULL) {
+		prog = orig_prog;
+		goto out_imms;
+	}
+
+	/* 2.) Actual pass to generate final JIT code */
+	ctx.target = (u32 *) image_ptr;
 	ctx.idx = 0;
 
 	build_prologue(&ctx);
+
+	/* If building the body of the JITed code fails somehow,
+	 * we fall back to the interpretation.
+	 */
 	if (build_body(&ctx) < 0) {
-#if __LINUX_ARM_ARCH__ < 7
-		if (ctx.imm_count)
-			kfree(ctx.imms);
-#endif
+		image_ptr = NULL;
 		bpf_jit_binary_free(header);
-		goto out;
+		prog = orig_prog;
+		goto out_imms;
 	}
 	build_epilogue(&ctx);
 
+	/* 3.) Extra pass to validate JITed Code */
+	if (validate_code(&ctx)) {
+		image_ptr = NULL;
+		bpf_jit_binary_free(header);
+		prog = orig_prog;
+		goto out_imms;
+	}
 	flush_icache_range((u32)header, (u32)(ctx.target + ctx.idx));
 
-#if __LINUX_ARM_ARCH__ < 7
-	if (ctx.imm_count)
-		kfree(ctx.imms);
-#endif
-
 	if (bpf_jit_enable > 1)
 		/* there are 2 passes here */
-		bpf_jit_dump(fp->len, alloc_size, 2, ctx.target);
+		bpf_jit_dump(prog->len, image_size, 2, ctx.target);
 
 	set_memory_ro((unsigned long)header, header->pages);
-	fp->bpf_func = (void *)ctx.target;
-	fp->jited = 1;
-out:
+	prog->bpf_func = (void *)ctx.target;
+	prog->jited = 1;
+out_imms:
+#if __LINUX_ARM_ARCH__ < 7
+	if (ctx.imm_count)
+		kfree(ctx.imms);
+#endif
+out_off:
 	kfree(ctx.offsets);
-	return;
+out:
+	if (tmp_blinded)
+		bpf_jit_prog_release_other(prog, prog == orig_prog ?
+					   tmp : orig_prog);
+#endif /* __LITTLE_ENDIAN */
+	return prog;
 }
 
-void bpf_jit_free(struct bpf_prog *fp)
+void bpf_jit_free(struct bpf_prog *prog)
 {
-	unsigned long addr = (unsigned long)fp->bpf_func & PAGE_MASK;
+	unsigned long addr = (unsigned long)prog->bpf_func & PAGE_MASK;
 	struct bpf_binary_header *header = (void *)addr;
 
-	if (!fp->jited)
+	if (!prog->jited)
 		goto free_filter;
 
 	set_memory_rw(addr, header->pages);
 	bpf_jit_binary_free(header);
 
 free_filter:
-	bpf_prog_unlock_free(fp);
+	bpf_prog_unlock_free(prog);
 }
diff --git a/arch/arm/net/bpf_jit_32.h b/arch/arm/net/bpf_jit_32.h
index c46fca2..d5cf5f6 100644
--- a/arch/arm/net/bpf_jit_32.h
+++ b/arch/arm/net/bpf_jit_32.h
@@ -11,6 +11,7 @@
 #ifndef PFILTER_OPCODES_ARM_H
 #define PFILTER_OPCODES_ARM_H
 
+/* ARM 32bit Registers */
 #define ARM_R0	0
 #define ARM_R1	1
 #define ARM_R2	2
@@ -22,38 +23,43 @@
 #define ARM_R8	8
 #define ARM_R9	9
 #define ARM_R10	10
-#define ARM_FP	11
-#define ARM_IP	12
-#define ARM_SP	13
-#define ARM_LR	14
-#define ARM_PC	15
-
-#define ARM_COND_EQ		0x0
-#define ARM_COND_NE		0x1
-#define ARM_COND_CS		0x2
+#define ARM_FP	11	/* Frame Pointer */
+#define ARM_IP	12	/* Intra-procedure scratch register */
+#define ARM_SP	13	/* Stack pointer: as load/store base reg */
+#define ARM_LR	14	/* Link Register */
+#define ARM_PC	15	/* Program counter */
+
+#define ARM_COND_EQ		0x0	/* == */
+#define ARM_COND_NE		0x1	/* != */
+#define ARM_COND_CS		0x2	/* unsigned >= */
 #define ARM_COND_HS		ARM_COND_CS
-#define ARM_COND_CC		0x3
+#define ARM_COND_CC		0x3	/* unsigned < */
 #define ARM_COND_LO		ARM_COND_CC
-#define ARM_COND_MI		0x4
-#define ARM_COND_PL		0x5
-#define ARM_COND_VS		0x6
-#define ARM_COND_VC		0x7
-#define ARM_COND_HI		0x8
-#define ARM_COND_LS		0x9
-#define ARM_COND_GE		0xa
-#define ARM_COND_LT		0xb
-#define ARM_COND_GT		0xc
-#define ARM_COND_LE		0xd
-#define ARM_COND_AL		0xe
+#define ARM_COND_MI		0x4	/* < 0 */
+#define ARM_COND_PL		0x5	/* >= 0 */
+#define ARM_COND_VS		0x6	/* Signed Overflow */
+#define ARM_COND_VC		0x7	/* No Signed Overflow */
+#define ARM_COND_HI		0x8	/* unsigned > */
+#define ARM_COND_LS		0x9	/* unsigned <= */
+#define ARM_COND_GE		0xa	/* Signed >= */
+#define ARM_COND_LT		0xb	/* Signed < */
+#define ARM_COND_GT		0xc	/* Signed > */
+#define ARM_COND_LE		0xd	/* Signed <= */
+#define ARM_COND_AL		0xe	/* None */
 
 /* register shift types */
 #define SRTYPE_LSL		0
 #define SRTYPE_LSR		1
 #define SRTYPE_ASR		2
 #define SRTYPE_ROR		3
+#define SRTYPE_ASL		(SRTYPE_LSL)
 
 #define ARM_INST_ADD_R		0x00800000
+#define ARM_INST_ADDS_R		0x00900000
+#define ARM_INST_ADC_R		0x00a00000
+#define ARM_INST_ADC_I		0x02a00000
 #define ARM_INST_ADD_I		0x02800000
+#define ARM_INST_ADDS_I		0x02900000
 
 #define ARM_INST_AND_R		0x00000000
 #define ARM_INST_AND_I		0x02000000
@@ -76,8 +82,10 @@
 #define ARM_INST_LDRH_I		0x01d000b0
 #define ARM_INST_LDRH_R		0x019000b0
 #define ARM_INST_LDR_I		0x05900000
+#define ARM_INST_LDR_R		0x07900000
 
 #define ARM_INST_LDM		0x08900000
+#define ARM_INST_LDM_IA		0x08b00000
 
 #define ARM_INST_LSL_I		0x01a00000
 #define ARM_INST_LSL_R		0x01a00010
@@ -86,6 +94,7 @@
 #define ARM_INST_LSR_R		0x01a00030
 
 #define ARM_INST_MOV_R		0x01a00000
+#define ARM_INST_MOVS_R		0x01b00000
 #define ARM_INST_MOV_I		0x03a00000
 #define ARM_INST_MOVW		0x03000000
 #define ARM_INST_MOVT		0x03400000
@@ -96,17 +105,28 @@
 #define ARM_INST_PUSH		0x092d0000
 
 #define ARM_INST_ORR_R		0x01800000
+#define ARM_INST_ORRS_R		0x01900000
 #define ARM_INST_ORR_I		0x03800000
 
 #define ARM_INST_REV		0x06bf0f30
 #define ARM_INST_REV16		0x06bf0fb0
 
 #define ARM_INST_RSB_I		0x02600000
+#define ARM_INST_RSBS_I		0x02700000
+#define ARM_INST_RSC_I		0x02e00000
 
 #define ARM_INST_SUB_R		0x00400000
+#define ARM_INST_SUBS_R		0x00500000
+#define ARM_INST_RSB_R		0x00600000
 #define ARM_INST_SUB_I		0x02400000
+#define ARM_INST_SUBS_I		0x02500000
+#define ARM_INST_SBC_I		0x02c00000
+#define ARM_INST_SBC_R		0x00c00000
+#define ARM_INST_SBCS_R		0x00d00000
 
 #define ARM_INST_STR_I		0x05800000
+#define ARM_INST_STRB_I		0x05c00000
+#define ARM_INST_STRH_I		0x01c000b0
 
 #define ARM_INST_TST_R		0x01100000
 #define ARM_INST_TST_I		0x03100000
@@ -117,6 +137,8 @@
 
 #define ARM_INST_MLS		0x00600090
 
+#define ARM_INST_UXTH		0x06ff0070
+
 /*
  * Use a suitable undefined instruction to use for ARM/Thumb2 faulting.
  * We need to be careful not to conflict with those used by other modules
@@ -135,9 +157,15 @@
 #define _AL3_R(op, rd, rn, rm)	((op ## _R) | (rd) << 12 | (rn) << 16 | (rm))
 /* immediate */
 #define _AL3_I(op, rd, rn, imm)	((op ## _I) | (rd) << 12 | (rn) << 16 | (imm))
+/* register with register-shift */
+#define _AL3_SR(inst)	(inst | (1 << 4))
 
 #define ARM_ADD_R(rd, rn, rm)	_AL3_R(ARM_INST_ADD, rd, rn, rm)
+#define ARM_ADDS_R(rd, rn, rm)	_AL3_R(ARM_INST_ADDS, rd, rn, rm)
 #define ARM_ADD_I(rd, rn, imm)	_AL3_I(ARM_INST_ADD, rd, rn, imm)
+#define ARM_ADDS_I(rd, rn, imm)	_AL3_I(ARM_INST_ADDS, rd, rn, imm)
+#define ARM_ADC_R(rd, rn, rm)	_AL3_R(ARM_INST_ADC, rd, rn, rm)
+#define ARM_ADC_I(rd, rn, imm)	_AL3_I(ARM_INST_ADC, rd, rn, imm)
 
 #define ARM_AND_R(rd, rn, rm)	_AL3_R(ARM_INST_AND, rd, rn, rm)
 #define ARM_AND_I(rd, rn, imm)	_AL3_I(ARM_INST_AND, rd, rn, imm)
@@ -156,7 +184,9 @@
 #define ARM_EOR_I(rd, rn, imm)	_AL3_I(ARM_INST_EOR, rd, rn, imm)
 
 #define ARM_LDR_I(rt, rn, off)	(ARM_INST_LDR_I | (rt) << 12 | (rn) << 16 \
-				 | (off))
+				 | ((off) & 0xfff))
+#define ARM_LDR_R(rt, rn, rm)	(ARM_INST_LDR_R | (rt) << 12 | (rn) << 16 \
+				 | (rm))
 #define ARM_LDRB_I(rt, rn, off)	(ARM_INST_LDRB_I | (rt) << 12 | (rn) << 16 \
 				 | (off))
 #define ARM_LDRB_R(rt, rn, rm)	(ARM_INST_LDRB_R | (rt) << 12 | (rn) << 16 \
@@ -167,15 +197,23 @@
 				 | (rm))
 
 #define ARM_LDM(rn, regs)	(ARM_INST_LDM | (rn) << 16 | (regs))
+#define ARM_LDM_IA(rn, regs)	(ARM_INST_LDM_IA | (rn) << 16 | (regs))
 
 #define ARM_LSL_R(rd, rn, rm)	(_AL3_R(ARM_INST_LSL, rd, 0, rn) | (rm) << 8)
 #define ARM_LSL_I(rd, rn, imm)	(_AL3_I(ARM_INST_LSL, rd, 0, rn) | (imm) << 7)
 
 #define ARM_LSR_R(rd, rn, rm)	(_AL3_R(ARM_INST_LSR, rd, 0, rn) | (rm) << 8)
 #define ARM_LSR_I(rd, rn, imm)	(_AL3_I(ARM_INST_LSR, rd, 0, rn) | (imm) << 7)
+#define ARM_ASR_R(rd, rn, rm)   (_AL3_R(ARM_INST_ASR, rd, 0, rn) | (rm) << 8)
+#define ARM_ASR_I(rd, rn, imm)  (_AL3_I(ARM_INST_ASR, rd, 0, rn) | (imm) << 7)
 
 #define ARM_MOV_R(rd, rm)	_AL3_R(ARM_INST_MOV, rd, 0, rm)
+#define ARM_MOVS_R(rd, rm)	_AL3_R(ARM_INST_MOVS, rd, 0, rm)
 #define ARM_MOV_I(rd, imm)	_AL3_I(ARM_INST_MOV, rd, 0, imm)
+#define ARM_MOV_SR(rd, rm, type, rs)	\
+	(_AL3_SR(ARM_MOV_R(rd, rm)) | (type) << 5 | (rs) << 8)
+#define ARM_MOV_SI(rd, rm, type, imm6)	\
+	(ARM_MOV_R(rd, rm) | (type) << 5 | (imm6) << 7)
 
 #define ARM_MOVW(rd, imm)	\
 	(ARM_INST_MOVW | ((imm) >> 12) << 16 | (rd) << 12 | ((imm) & 0x0fff))
@@ -190,19 +228,38 @@
 
 #define ARM_ORR_R(rd, rn, rm)	_AL3_R(ARM_INST_ORR, rd, rn, rm)
 #define ARM_ORR_I(rd, rn, imm)	_AL3_I(ARM_INST_ORR, rd, rn, imm)
-#define ARM_ORR_S(rd, rn, rm, type, rs)	\
-	(ARM_ORR_R(rd, rn, rm) | (type) << 5 | (rs) << 7)
+#define ARM_ORR_SR(rd, rn, rm, type, rs)	\
+	(_AL3_SR(ARM_ORR_R(rd, rn, rm)) | (type) << 5 | (rs) << 8)
+#define ARM_ORRS_R(rd, rn, rm)	_AL3_R(ARM_INST_ORRS, rd, rn, rm)
+#define ARM_ORRS_SR(rd, rn, rm, type, rs)	\
+	(_AL3_SR(ARM_ORRS_R(rd, rn, rm)) | (type) << 5 | (rs) << 8)
+#define ARM_ORR_SI(rd, rn, rm, type, imm6)	\
+	(ARM_ORR_R(rd, rn, rm) | (type) << 5 | (imm6) << 7)
+#define ARM_ORRS_SI(rd, rn, rm, type, imm6)	\
+	(ARM_ORRS_R(rd, rn, rm) | (type) << 5 | (imm6) << 7)
 
 #define ARM_REV(rd, rm)		(ARM_INST_REV | (rd) << 12 | (rm))
 #define ARM_REV16(rd, rm)	(ARM_INST_REV16 | (rd) << 12 | (rm))
 
 #define ARM_RSB_I(rd, rn, imm)	_AL3_I(ARM_INST_RSB, rd, rn, imm)
+#define ARM_RSBS_I(rd, rn, imm)	_AL3_I(ARM_INST_RSBS, rd, rn, imm)
+#define ARM_RSC_I(rd, rn, imm)	_AL3_I(ARM_INST_RSC, rd, rn, imm)
 
 #define ARM_SUB_R(rd, rn, rm)	_AL3_R(ARM_INST_SUB, rd, rn, rm)
+#define ARM_SUBS_R(rd, rn, rm)	_AL3_R(ARM_INST_SUBS, rd, rn, rm)
+#define ARM_RSB_R(rd, rn, rm)	_AL3_R(ARM_INST_RSB, rd, rn, rm)
+#define ARM_SBC_R(rd, rn, rm)	_AL3_R(ARM_INST_SBC, rd, rn, rm)
+#define ARM_SBCS_R(rd, rn, rm)	_AL3_R(ARM_INST_SBCS, rd, rn, rm)
 #define ARM_SUB_I(rd, rn, imm)	_AL3_I(ARM_INST_SUB, rd, rn, imm)
+#define ARM_SUBS_I(rd, rn, imm)	_AL3_I(ARM_INST_SUBS, rd, rn, imm)
+#define ARM_SBC_I(rd, rn, imm)	_AL3_I(ARM_INST_SBC, rd, rn, imm)
 
 #define ARM_STR_I(rt, rn, off)	(ARM_INST_STR_I | (rt) << 12 | (rn) << 16 \
-				 | (off))
+				 | ((off) & 0xfff))
+#define ARM_STRH_I(rt, rn, off)	(ARM_INST_STRH_I | (rt) << 12 | (rn) << 16 \
+				 | (((off) & 0xf0) << 4) | ((off) & 0xf))
+#define ARM_STRB_I(rt, rn, off)	(ARM_INST_STRB_I | (rt) << 12 | (rn) << 16 \
+				 | (((off) & 0xf0) << 4) | ((off) & 0xf))
 
 #define ARM_TST_R(rn, rm)	_AL3_R(ARM_INST_TST, 0, rn, rm)
 #define ARM_TST_I(rn, imm)	_AL3_I(ARM_INST_TST, 0, rn, imm)
@@ -214,5 +271,6 @@
 
 #define ARM_MLS(rd, rn, rm, ra)	(ARM_INST_MLS | (rd) << 16 | (rn) | (rm) << 8 \
 				 | (ra) << 12)
+#define ARM_UXTH(rd, rm)	(ARM_INST_UXTH | (rd) << 12 | (rm))
 
 #endif /* PFILTER_OPCODES_ARM_H */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v2] arm: eBPF JIT compiler
@ 2017-05-25 23:13 ` Shubham Bansal
  0 siblings, 0 replies; 87+ messages in thread
From: Shubham Bansal @ 2017-05-25 23:13 UTC (permalink / raw)
  To: linux-arm-kernel

The JIT compiler emits ARM 32 bit instructions. Currently, It supports
eBPF only. Classic BPF is supported because of the conversion by BPF
core.

This patch is essentially changing the current implementation of JIT
compiler of Berkeley Packet Filter from classic to internal with almost
all instructions from eBPF ISA supported except the following
	BPF_ALU64 | BPF_DIV | BPF_K
	BPF_ALU64 | BPF_DIV | BPF_X
	BPF_ALU64 | BPF_MOD | BPF_K
	BPF_ALU64 | BPF_MOD | BPF_X
	BPF_STX | BPF_XADD | BPF_W
	BPF_STX | BPF_XADD | BPF_DW
	BPF_JMP | BPF_CALL

Implementation is using scratch space to emulate 64 bit eBPF ISA on 32 bit
ARM because of deficiency of general purpose registers on ARM. Currently,
only LITTLE ENDIAN machines are supported in this eBPF JIT Compiler.

Tested on ARMv7 with QEMU by me (Shubham Bansal).
Tested on ARMv5 by Andrew Lunn (andrew at lunn.ch).
Expected to work on ARMv6 as well, as its a part ARMv7 and part ARMv5.
Although, a proper testing is not done for ARMv6.

Both of these testing are done with and without CONFIG_FRAME_POINTER
separately for LITTLE ENDIAN machine.

For testing:

1. JIT is enabled with
	echo 1 > /proc/sys/net/core/bpf_jit_enable
2. Constant Blinding can be enabled along with JIT using
	echo 1 > /proc/sys/net/core/bpf_jit_enable
	echo 2 > /proc/sys/net/core/bpf_jit_harden

See Documentation/networking/filter.txt for more information.

Result : test_bpf: Summary: 314 PASSED, 0 FAILED, [278/306 JIT'ed]

Signed-off-by: Shubham Bansal <illusionist.neo@gmail.com>
---
 Documentation/networking/filter.txt |    4 +-
 arch/arm/Kconfig                    |    2 +-
 arch/arm/net/bpf_jit_32.c           | 2404 ++++++++++++++++++++++++-----------
 arch/arm/net/bpf_jit_32.h           |  108 +-
 4 files changed, 1713 insertions(+), 805 deletions(-)

diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt
index b69b205..01165ac 100644
--- a/Documentation/networking/filter.txt
+++ b/Documentation/networking/filter.txt
@@ -596,8 +596,8 @@ skb pointer). All constraints and restrictions from bpf_check_classic() apply
 before a conversion to the new layout is being done behind the scenes!
 
 Currently, the classic BPF format is being used for JITing on most 32-bit
-architectures, whereas x86-64, aarch64, s390x, powerpc64, sparc64 perform JIT
-compilation from eBPF instruction set.
+architectures, whereas x86-64, aarch64, arm, s390x, powerpc64, sparc64 perform
+JIT compilation from eBPF instruction set.
 
 Some core changes of the new internal format:
 
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 8a7ab5e..13ade46 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -47,7 +47,7 @@ config ARM
 	select HAVE_ARCH_SECCOMP_FILTER if (AEABI && !OABI_COMPAT)
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_ARM_SMCCC if CPU_V7
-	select HAVE_CBPF_JIT
+	select HAVE_EBPF_JIT
 	select HAVE_CC_STACKPROTECTOR
 	select HAVE_CONTEXT_TRACKING
 	select HAVE_C_RECORDMCOUNT
diff --git a/arch/arm/net/bpf_jit_32.c b/arch/arm/net/bpf_jit_32.c
index 93d0b6d..c7476e5 100644
--- a/arch/arm/net/bpf_jit_32.c
+++ b/arch/arm/net/bpf_jit_32.c
@@ -1,13 +1,15 @@
 /*
- * Just-In-Time compiler for BPF filters on 32bit ARM
+ * Just-In-Time compiler for eBPF filters on 32bit ARM
  *
  * Copyright (c) 2011 Mircea Gherzan <mgherzan@gmail.com>
+ * Copyright (c) 2017 Shubham Bansal <illusionist.neo@gmail.com>
  *
  * This program is free software; you can redistribute it and/or modify it
  * under the terms of the GNU General Public License as published by the
  * Free Software Foundation; version 2 of the License.
  */
 
+#include <linux/bpf.h>
 #include <linux/bitops.h>
 #include <linux/compiler.h>
 #include <linux/errno.h>
@@ -23,44 +25,91 @@
 
 #include "bpf_jit_32.h"
 
+int bpf_jit_enable __read_mostly;
+
+#define STACK_OFFSET(k)	(k)
+#define TMP_REG_1	(MAX_BPF_JIT_REG + 0)	/* TEMP Register 1 */
+#define TMP_REG_2	(MAX_BPF_JIT_REG + 1)	/* TEMP Register 2 */
+#define TCALL_CNT	(MAX_BPF_JIT_REG + 2)	/* Tail Call Count */
+
+/* Flags used for JIT optimization */
+#define SEEN_CALL	(1 << 0)
+
+#define FLAG_IMM_OVERFLOW	(1 << 0)
+
 /*
- * ABI:
+ * Map eBPF registers to ARM 32bit registers or stack scratch space.
+ *
+ * 1. First argument is passed using the arm 32bit registers and rest of the
+ * arguments are passed on stack scratch space.
+ * 2. First callee-saved aregument is mapped to arm 32 bit registers and rest
+ * arguments are mapped to scratch space on stack.
+ * 3. We need two 64 bit temp registers to do complex operations on eBPF
+ * registers.
+ *
+ * As the eBPF registers are all 64 bit registers and arm has only 32 bit
+ * registers, we have to map each eBPF registers with two arm 32 bit regs or
+ * scratch memory space and we have to build eBPF 64 bit register from those.
  *
- * r0	scratch register
- * r4	BPF register A
- * r5	BPF register X
- * r6	pointer to the skb
- * r7	skb->data
- * r8	skb_headlen(skb)
  */
+static const u8 bpf2a32[][2] = {
+	/* return value from in-kernel function, and exit value from eBPF */
+	[BPF_REG_0] = {ARM_R1, ARM_R0},
+	/* arguments from eBPF program to in-kernel function */
+	[BPF_REG_1] = {ARM_R3, ARM_R2},
+	/* Stored on stack scratch space */
+	[BPF_REG_2] = {STACK_OFFSET(0), STACK_OFFSET(4)},
+	[BPF_REG_3] = {STACK_OFFSET(8), STACK_OFFSET(12)},
+	[BPF_REG_4] = {STACK_OFFSET(16), STACK_OFFSET(20)},
+	[BPF_REG_5] = {STACK_OFFSET(24), STACK_OFFSET(28)},
+	/* callee saved registers that in-kernel function will preserve */
+	[BPF_REG_6] = {ARM_R5, ARM_R4},
+	/* Stored on stack scratch space */
+	[BPF_REG_7] = {STACK_OFFSET(32), STACK_OFFSET(36)},
+	[BPF_REG_8] = {STACK_OFFSET(40), STACK_OFFSET(44)},
+	[BPF_REG_9] = {STACK_OFFSET(48), STACK_OFFSET(52)},
+	/* Read only Frame Pointer to access Stack */
+	[BPF_REG_FP] = {STACK_OFFSET(56), STACK_OFFSET(60)},
+	/* Temporary Register for internal BPF JIT, can be used
+	 * for constant blindings and others.
+	 */
+	[TMP_REG_1] = {ARM_R7, ARM_R6},
+	[TMP_REG_2] = {ARM_R10, ARM_R8},
+	/* Tail call count. Stored on stack scratch space. */
+	[TCALL_CNT] = {STACK_OFFSET(64), STACK_OFFSET(68)},
+	/* temporary register for blinding constants.
+	 * Stored on stack scratch space.
+	 */
+	[BPF_REG_AX] = {STACK_OFFSET(72), STACK_OFFSET(76)},
+};
 
-#define r_scratch	ARM_R0
-/* r1-r3 are (also) used for the unaligned loads on the non-ARMv7 slowpath */
-#define r_off		ARM_R1
-#define r_A		ARM_R4
-#define r_X		ARM_R5
-#define r_skb		ARM_R6
-#define r_skb_data	ARM_R7
-#define r_skb_hl	ARM_R8
-
-#define SCRATCH_SP_OFFSET	0
-#define SCRATCH_OFF(k)		(SCRATCH_SP_OFFSET + 4 * (k))
-
-#define SEEN_MEM		((1 << BPF_MEMWORDS) - 1)
-#define SEEN_MEM_WORD(k)	(1 << (k))
-#define SEEN_X			(1 << BPF_MEMWORDS)
-#define SEEN_CALL		(1 << (BPF_MEMWORDS + 1))
-#define SEEN_SKB		(1 << (BPF_MEMWORDS + 2))
-#define SEEN_DATA		(1 << (BPF_MEMWORDS + 3))
+#define	dst_lo	dst[1]
+#define dst_hi	dst[0]
+#define src_lo	src[1]
+#define src_hi	src[0]
 
-#define FLAG_NEED_X_RESET	(1 << 0)
-#define FLAG_IMM_OVERFLOW	(1 << 1)
+/*
+ * JIT Context:
+ *
+ * prog			:	bpf_prog
+ * idx			:	index of current last JITed instruction.
+ * prologue_bytes	:	bytes used in prologue.
+ * epilogue_offset	:	offset of epilogue starting.
+ * seen			:	bit mask used for JIT optimization.
+ * offsets		:	array of eBPF instruction offsets in
+ *				JITed code.
+ * target		:	final JITed code.
+ * epilogue_bytes	:	no of bytes used in epilogue.
+ * imm_count		:	no of immediate counts used for global
+ *				variables.
+ * imms			:	array of global variable addresses.
+ */
 
 struct jit_ctx {
-	const struct bpf_prog *skf;
-	unsigned idx;
-	unsigned prologue_bytes;
-	int ret0_fp_idx;
+	const struct bpf_prog *prog;
+	unsigned int idx;
+	unsigned int prologue_bytes;
+	unsigned int epilogue_offset;
 	u32 seen;
 	u32 flags;
 	u32 *offsets;
@@ -72,68 +121,16 @@ struct jit_ctx {
 #endif
 };
 
-int bpf_jit_enable __read_mostly;
-
-static inline int call_neg_helper(struct sk_buff *skb, int offset, void *ret,
-		      unsigned int size)
-{
-	void *ptr = bpf_internal_load_pointer_neg_helper(skb, offset, size);
-
-	if (!ptr)
-		return -EFAULT;
-	memcpy(ret, ptr, size);
-	return 0;
-}
-
-static u64 jit_get_skb_b(struct sk_buff *skb, int offset)
-{
-	u8 ret;
-	int err;
-
-	if (offset < 0)
-		err = call_neg_helper(skb, offset, &ret, 1);
-	else
-		err = skb_copy_bits(skb, offset, &ret, 1);
-
-	return (u64)err << 32 | ret;
-}
-
-static u64 jit_get_skb_h(struct sk_buff *skb, int offset)
-{
-	u16 ret;
-	int err;
-
-	if (offset < 0)
-		err = call_neg_helper(skb, offset, &ret, 2);
-	else
-		err = skb_copy_bits(skb, offset, &ret, 2);
-
-	return (u64)err << 32 | ntohs(ret);
-}
-
-static u64 jit_get_skb_w(struct sk_buff *skb, int offset)
-{
-	u32 ret;
-	int err;
-
-	if (offset < 0)
-		err = call_neg_helper(skb, offset, &ret, 4);
-	else
-		err = skb_copy_bits(skb, offset, &ret, 4);
-
-	return (u64)err << 32 | ntohl(ret);
-}
-
 /*
  * Wrappers which handle both OABI and EABI and assures Thumb2 interworking
  * (where the assembly routines like __aeabi_uidiv could cause problems).
  */
-static u32 jit_udiv(u32 dividend, u32 divisor)
+static u32 jit_udiv32(u32 dividend, u32 divisor)
 {
 	return dividend / divisor;
 }
 
-static u32 jit_mod(u32 dividend, u32 divisor)
+static u32 jit_mod32(u32 dividend, u32 divisor)
 {
 	return dividend % divisor;
 }
@@ -157,36 +154,22 @@ static inline void emit(u32 inst, struct jit_ctx *ctx)
 	_emit(ARM_COND_AL, inst, ctx);
 }
 
-static u16 saved_regs(struct jit_ctx *ctx)
+/*
+ * Checks if immediate value can be converted to imm12(12 bits) value.
+ */
+static int16_t imm8m(u32 x)
 {
-	u16 ret = 0;
-
-	if ((ctx->skf->len > 1) ||
-	    (ctx->skf->insns[0].code == (BPF_RET | BPF_A)))
-		ret |= 1 << r_A;
-
-#ifdef CONFIG_FRAME_POINTER
-	ret |= (1 << ARM_FP) | (1 << ARM_IP) | (1 << ARM_LR) | (1 << ARM_PC);
-#else
-	if (ctx->seen & SEEN_CALL)
-		ret |= 1 << ARM_LR;
-#endif
-	if (ctx->seen & (SEEN_DATA | SEEN_SKB))
-		ret |= 1 << r_skb;
-	if (ctx->seen & SEEN_DATA)
-		ret |= (1 << r_skb_data) | (1 << r_skb_hl);
-	if (ctx->seen & SEEN_X)
-		ret |= 1 << r_X;
-
-	return ret;
-}
+	u32 rot;
 
-static inline int mem_words_used(struct jit_ctx *ctx)
-{
-	/* yes, we do waste some stack space IF there are "holes" in the set" */
-	return fls(ctx->seen & SEEN_MEM);
+	for (rot = 0; rot < 16; rot++)
+		if ((x & ~ror32(0xff, 2 * rot)) == 0)
+			return rol32(x, 2 * rot) | (rot << 8);
+	return -1;
 }
 
+/*
+ * Initializes the JIT space with undefined instructions.
+ */
 static void jit_fill_hole(void *area, unsigned int size)
 {
 	u32 *ptr;
@@ -195,88 +178,34 @@ static void jit_fill_hole(void *area, unsigned int size)
 		*ptr++ = __opcode_to_mem_arm(ARM_INST_UDF);
 }
 
-static void build_prologue(struct jit_ctx *ctx)
-{
-	u16 reg_set = saved_regs(ctx);
-	u16 off;
-
-#ifdef CONFIG_FRAME_POINTER
-	emit(ARM_MOV_R(ARM_IP, ARM_SP), ctx);
-	emit(ARM_PUSH(reg_set), ctx);
-	emit(ARM_SUB_I(ARM_FP, ARM_IP, 4), ctx);
-#else
-	if (reg_set)
-		emit(ARM_PUSH(reg_set), ctx);
-#endif
+/* Stack must be multiples of 16 Bytes */
+#define STACK_ALIGN(sz) (((sz) + 15) & ~15)
 
-	if (ctx->seen & (SEEN_DATA | SEEN_SKB))
-		emit(ARM_MOV_R(r_skb, ARM_R0), ctx);
-
-	if (ctx->seen & SEEN_DATA) {
-		off = offsetof(struct sk_buff, data);
-		emit(ARM_LDR_I(r_skb_data, r_skb, off), ctx);
-		/* headlen = len - data_len */
-		off = offsetof(struct sk_buff, len);
-		emit(ARM_LDR_I(r_skb_hl, r_skb, off), ctx);
-		off = offsetof(struct sk_buff, data_len);
-		emit(ARM_LDR_I(r_scratch, r_skb, off), ctx);
-		emit(ARM_SUB_R(r_skb_hl, r_skb_hl, r_scratch), ctx);
-	}
-
-	if (ctx->flags & FLAG_NEED_X_RESET)
-		emit(ARM_MOV_I(r_X, 0), ctx);
-
-	/* do not leak kernel data to userspace */
-	if (bpf_needs_clear_a(&ctx->skf->insns[0]))
-		emit(ARM_MOV_I(r_A, 0), ctx);
-
-	/* stack space for the BPF_MEM words */
-	if (ctx->seen & SEEN_MEM)
-		emit(ARM_SUB_I(ARM_SP, ARM_SP, mem_words_used(ctx) * 4), ctx);
-}
-
-static void build_epilogue(struct jit_ctx *ctx)
-{
-	u16 reg_set = saved_regs(ctx);
-
-	if (ctx->seen & SEEN_MEM)
-		emit(ARM_ADD_I(ARM_SP, ARM_SP, mem_words_used(ctx) * 4), ctx);
-
-	reg_set &= ~(1 << ARM_LR);
-
-#ifdef CONFIG_FRAME_POINTER
-	/* the first instruction of the prologue was: mov ip, sp */
-	reg_set &= ~(1 << ARM_IP);
-	reg_set |= (1 << ARM_SP);
-	emit(ARM_LDM(ARM_SP, reg_set), ctx);
-#else
-	if (reg_set) {
-		if (ctx->seen & SEEN_CALL)
-			reg_set |= 1 << ARM_PC;
-		emit(ARM_POP(reg_set), ctx);
-	}
+/* Stack space for BPF_REG_2, BPF_REG_3, BPF_REG_4,
+ * BPF_REG_5, BPF_REG_7, BPF_REG_8, BPF_REG_9,
+ * BPF_REG_FP and Tail call counts.
+ */
+#define SCRATCH_SIZE 80
 
-	if (!(ctx->seen & SEEN_CALL))
-		emit(ARM_BX(ARM_LR), ctx);
-#endif
-}
+/* total stack size used in JITed code */
+#define _STACK_SIZE \
+	(MAX_BPF_STACK + \
+	 + SCRATCH_SIZE + \
+	 + 4 /* extra for skb_copy_bits buffer */)
 
-static int16_t imm8m(u32 x)
-{
-	u32 rot;
+#define STACK_SIZE STACK_ALIGN(_STACK_SIZE)
 
-	for (rot = 0; rot < 16; rot++)
-		if ((x & ~ror32(0xff, 2 * rot)) == 0)
-			return rol32(x, 2 * rot) | (rot << 8);
+/* Get the offset of eBPF REGISTERs stored on scratch space. */
+#define STACK_VAR(off) (STACK_SIZE-off-4)
 
-	return -1;
-}
+/* Offset of skb_copy_bits buffer */
+#define SKB_BUFFER STACK_VAR(SCRATCH_SIZE)
 
 #if __LINUX_ARM_ARCH__ < 7
 
 static u16 imm_offset(u32 k, struct jit_ctx *ctx)
 {
-	unsigned i = 0, offset;
+	unsigned int i = 0, offset;
 	u16 imm;
 
 	/* on the "fake" run we just count them (duplicates included) */
@@ -295,7 +224,7 @@ static u16 imm_offset(u32 k, struct jit_ctx *ctx)
 		ctx->imms[i] = k;
 
 	/* constants go just after the epilogue */
-	offset =  ctx->offsets[ctx->skf->len];
+	offset =  ctx->offsets[ctx->prog->len - 1] * 4;
 	offset += ctx->prologue_bytes;
 	offset += ctx->epilogue_bytes;
 	offset += i * 4;
@@ -319,10 +248,22 @@ static u16 imm_offset(u32 k, struct jit_ctx *ctx)
 
 #endif /* __LINUX_ARM_ARCH__ */
 
+static inline int bpf2a32_offset(int bpf_to, int bpf_from,
+				 const struct jit_ctx *ctx) {
+	int to, from;
+
+	if (ctx->target == NULL)
+		return 0;
+	to = ctx->offsets[bpf_to];
+	from = ctx->offsets[bpf_from];
+
+	return to - from - 1;
+}
+
 /*
  * Move an immediate that's not an imm8m to a core register.
  */
-static inline void emit_mov_i_no8m(int rd, u32 val, struct jit_ctx *ctx)
+static inline void emit_mov_i_no8m(const u8 rd, u32 val, struct jit_ctx *ctx)
 {
 #if __LINUX_ARM_ARCH__ < 7
 	emit(ARM_LDR_I(rd, ARM_PC, imm_offset(val, ctx)), ctx);
@@ -333,7 +274,7 @@ static inline void emit_mov_i_no8m(int rd, u32 val, struct jit_ctx *ctx)
 #endif
 }
 
-static inline void emit_mov_i(int rd, u32 val, struct jit_ctx *ctx)
+static inline void emit_mov_i(const u8 rd, u32 val, struct jit_ctx *ctx)
 {
 	int imm12 = imm8m(val);
 
@@ -343,676 +284,1553 @@ static inline void emit_mov_i(int rd, u32 val, struct jit_ctx *ctx)
 		emit_mov_i_no8m(rd, val, ctx);
 }
 
-#if __LINUX_ARM_ARCH__ < 6
-
-static void emit_load_be32(u8 cond, u8 r_res, u8 r_addr, struct jit_ctx *ctx)
+static inline void emit_blx_r(u8 tgt_reg, struct jit_ctx *ctx)
 {
-	_emit(cond, ARM_LDRB_I(ARM_R3, r_addr, 1), ctx);
-	_emit(cond, ARM_LDRB_I(ARM_R1, r_addr, 0), ctx);
-	_emit(cond, ARM_LDRB_I(ARM_R2, r_addr, 3), ctx);
-	_emit(cond, ARM_LSL_I(ARM_R3, ARM_R3, 16), ctx);
-	_emit(cond, ARM_LDRB_I(ARM_R0, r_addr, 2), ctx);
-	_emit(cond, ARM_ORR_S(ARM_R3, ARM_R3, ARM_R1, SRTYPE_LSL, 24), ctx);
-	_emit(cond, ARM_ORR_R(ARM_R3, ARM_R3, ARM_R2), ctx);
-	_emit(cond, ARM_ORR_S(r_res, ARM_R3, ARM_R0, SRTYPE_LSL, 8), ctx);
+	ctx->seen |= SEEN_CALL;
+#if __LINUX_ARM_ARCH__ < 5
+	emit(ARM_MOV_R(ARM_LR, ARM_PC), ctx);
+
+	if (elf_hwcap & HWCAP_THUMB)
+		emit(ARM_BX(tgt_reg), ctx);
+	else
+		emit(ARM_MOV_R(ARM_PC, tgt_reg), ctx);
+#else
+	emit(ARM_BLX_R(tgt_reg), ctx);
+#endif
 }
 
-static void emit_load_be16(u8 cond, u8 r_res, u8 r_addr, struct jit_ctx *ctx)
+static inline int epilogue_offset(const struct jit_ctx *ctx)
 {
-	_emit(cond, ARM_LDRB_I(ARM_R1, r_addr, 0), ctx);
-	_emit(cond, ARM_LDRB_I(ARM_R2, r_addr, 1), ctx);
-	_emit(cond, ARM_ORR_S(r_res, ARM_R2, ARM_R1, SRTYPE_LSL, 8), ctx);
+	int to, from;
+	/* No need for 1st dummy run */
+	if (ctx->target == NULL)
+		return 0;
+	to = ctx->epilogue_offset;
+	from = ctx->idx;
+
+	return to - from - 2;
 }
 
-static inline void emit_swap16(u8 r_dst, u8 r_src, struct jit_ctx *ctx)
+static inline void emit_udivmod(u8 rd, u8 rm, u8 rn, struct jit_ctx *ctx, u8 op)
 {
-	/* r_dst = (r_src << 8) | (r_src >> 8) */
-	emit(ARM_LSL_I(ARM_R1, r_src, 8), ctx);
-	emit(ARM_ORR_S(r_dst, ARM_R1, r_src, SRTYPE_LSR, 8), ctx);
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	s32 jmp_offset;
+
+	/* checks if divisor is zero or not. If it is, then
+	 * exit directly.
+	 */
+	emit(ARM_CMP_I(rn, 0), ctx);
+	_emit(ARM_COND_EQ, ARM_MOV_I(ARM_R0, 0), ctx);
+	jmp_offset = epilogue_offset(ctx);
+	_emit(ARM_COND_EQ, ARM_B(jmp_offset), ctx);
+#if __LINUX_ARM_ARCH__ == 7
+	if (elf_hwcap & HWCAP_IDIVA) {
+		if (op == BPF_DIV)
+			emit(ARM_UDIV(rd, rm, rn), ctx);
+		else {
+			emit(ARM_UDIV(ARM_IP, rm, rn), ctx);
+			emit(ARM_MLS(rd, rn, ARM_IP, rm), ctx);
+		}
+		return;
+	}
+#endif
 
 	/*
-	 * we need to mask out the bits set in r_dst[23:16] due to
-	 * the first shift instruction.
-	 *
-	 * note that 0x8ff is the encoded immediate 0x00ff0000.
+	 * For BPF_ALU | BPF_DIV | BPF_K instructions
+	 * As ARM_R1 and ARM_R0 contains 1st argument of bpf
+	 * function, we need to save it on caller side to save
+	 * it from getting destroyed within callee.
+	 * After the return from the callee, we restore ARM_R0
+	 * ARM_R1.
 	 */
-	emit(ARM_BIC_I(r_dst, r_dst, 0x8ff), ctx);
-}
+	if (rn != ARM_R1) {
+		emit(ARM_MOV_R(tmp[0], ARM_R1), ctx);
+		emit(ARM_MOV_R(ARM_R1, rn), ctx);
+	}
+	if (rm != ARM_R0) {
+		emit(ARM_MOV_R(tmp[1], ARM_R0), ctx);
+		emit(ARM_MOV_R(ARM_R0, rm), ctx);
+	}
 
-#else  /* ARMv6+ */
+	/* Call appropriate function */
+	ctx->seen |= SEEN_CALL;
+	emit_mov_i(ARM_IP, op == BPF_DIV ?
+		   (u32)jit_udiv32 : (u32)jit_mod32, ctx);
+	emit_blx_r(ARM_IP, ctx);
 
-static void emit_load_be32(u8 cond, u8 r_res, u8 r_addr, struct jit_ctx *ctx)
-{
-	_emit(cond, ARM_LDR_I(r_res, r_addr, 0), ctx);
-#ifdef __LITTLE_ENDIAN
-	_emit(cond, ARM_REV(r_res, r_res), ctx);
-#endif
+	/* Save return value */
+	if (rd != ARM_R0)
+		emit(ARM_MOV_R(rd, ARM_R0), ctx);
+
+	/* Restore ARM_R0 and ARM_R1 */
+	if (rn != ARM_R1)
+		emit(ARM_MOV_R(ARM_R1, tmp[0]), ctx);
+	if (rm != ARM_R0)
+		emit(ARM_MOV_R(ARM_R0, tmp[1]), ctx);
 }
 
-static void emit_load_be16(u8 cond, u8 r_res, u8 r_addr, struct jit_ctx *ctx)
+/* Checks whether BPF register is on scratch stack space or not. */
+static inline bool is_on_stack(u8 bpf_reg)
 {
-	_emit(cond, ARM_LDRH_I(r_res, r_addr, 0), ctx);
-#ifdef __LITTLE_ENDIAN
-	_emit(cond, ARM_REV16(r_res, r_res), ctx);
-#endif
+	static u8 stack_regs[] = {BPF_REG_AX, BPF_REG_3, BPF_REG_4, BPF_REG_5,
+				BPF_REG_7, BPF_REG_8, BPF_REG_9, TCALL_CNT,
+				BPF_REG_2, BPF_REG_FP};
+	int i, reg_len = sizeof(stack_regs);
+
+	for (i = 0 ; i < reg_len ; i++) {
+		if (bpf_reg == stack_regs[i])
+			return true;
+	}
+	return false;
 }
 
-static inline void emit_swap16(u8 r_dst __maybe_unused,
-			       u8 r_src __maybe_unused,
-			       struct jit_ctx *ctx __maybe_unused)
+static inline void emit_a32_mov_i(const u8 dst, const u32 val,
+				  bool dstk, struct jit_ctx *ctx)
 {
-#ifdef __LITTLE_ENDIAN
-	emit(ARM_REV16(r_dst, r_src), ctx);
-#endif
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+
+	if (dstk) {
+		emit_mov_i(tmp[1], val, ctx);
+		emit(ARM_STR_I(tmp[1], ARM_SP, STACK_VAR(dst)), ctx);
+	} else {
+		emit_mov_i(dst, val, ctx);
+	}
 }
 
-#endif /* __LINUX_ARM_ARCH__ < 6 */
+/* Sign extended move */
+static inline void emit_a32_mov_i64(const bool is64, const u8 dst[],
+				  const u32 val, bool dstk,
+				  struct jit_ctx *ctx) {
+	u32 hi = 0;
 
+	if (is64 && (val & (1<<31)))
+		hi = (u32)~0;
+	emit_a32_mov_i(dst_lo, val, dstk, ctx);
+	emit_a32_mov_i(dst_hi, hi, dstk, ctx);
+}
 
-/* Compute the immediate value for a PC-relative branch. */
-static inline u32 b_imm(unsigned tgt, struct jit_ctx *ctx)
-{
-	u32 imm;
+static inline void emit_a32_add_r(const u8 dst, const u8 src,
+			      const bool is64, const bool hi,
+			      struct jit_ctx *ctx) {
+	/* 64 bit :
+	 *	adds dst_lo, dst_lo, src_lo
+	 *	adc dst_hi, dst_hi, src_hi
+	 * 32 bit :
+	 *	add dst_lo, dst_lo, src_lo
+	 */
+	if (!hi && is64)
+		emit(ARM_ADDS_R(dst, dst, src), ctx);
+	else if (hi && is64)
+		emit(ARM_ADC_R(dst, dst, src), ctx);
+	else
+		emit(ARM_ADD_R(dst, dst, src), ctx);
+}
 
-	if (ctx->target == NULL)
-		return 0;
-	/*
-	 * BPF allows only forward jumps and the offset of the target is
-	 * still the one computed during the first pass.
+static inline void emit_a32_sub_r(const u8 dst, const u8 src,
+				  const bool is64, const bool hi,
+				  struct jit_ctx *ctx) {
+	/* 64 bit :
+	 *	subs dst_lo, dst_lo, src_lo
+	 *	sbc dst_hi, dst_hi, src_hi
+	 * 32 bit :
+	 *	sub dst_lo, dst_lo, src_lo
 	 */
-	imm  = ctx->offsets[tgt] + ctx->prologue_bytes - (ctx->idx * 4 + 8);
+	if (!hi && is64)
+		emit(ARM_SUBS_R(dst, dst, src), ctx);
+	else if (hi && is64)
+		emit(ARM_SBC_R(dst, dst, src), ctx);
+	else
+		emit(ARM_SUB_R(dst, dst, src), ctx);
+}
 
-	return imm >> 2;
+static inline void emit_alu_r(const u8 dst, const u8 src, const bool is64,
+			      const bool hi, const u8 op, struct jit_ctx *ctx){
+	switch (BPF_OP(op)) {
+	/* dst = dst + src */
+	case BPF_ADD:
+		emit_a32_add_r(dst, src, is64, hi, ctx);
+		break;
+	/* dst = dst - src */
+	case BPF_SUB:
+		emit_a32_sub_r(dst, src, is64, hi, ctx);
+		break;
+	/* dst = dst | src */
+	case BPF_OR:
+		emit(ARM_ORR_R(dst, dst, src), ctx);
+		break;
+	/* dst = dst & src */
+	case BPF_AND:
+		emit(ARM_AND_R(dst, dst, src), ctx);
+		break;
+	/* dst = dst ^ src */
+	case BPF_XOR:
+		emit(ARM_EOR_R(dst, dst, src), ctx);
+		break;
+	/* dst = dst * src */
+	case BPF_MUL:
+		emit(ARM_MUL(dst, dst, src), ctx);
+		break;
+	/* dst = dst << src */
+	case BPF_LSH:
+		emit(ARM_LSL_R(dst, dst, src), ctx);
+		break;
+	/* dst = dst >> src */
+	case BPF_RSH:
+		emit(ARM_LSR_R(dst, dst, src), ctx);
+		break;
+	/* dst = dst >> src (signed)*/
+	case BPF_ARSH:
+		emit(ARM_MOV_SR(dst, dst, SRTYPE_ASR, src), ctx);
+		break;
+	}
 }
 
-#define OP_IMM3(op, r1, r2, imm_val, ctx)				\
-	do {								\
-		imm12 = imm8m(imm_val);					\
-		if (imm12 < 0) {					\
-			emit_mov_i_no8m(r_scratch, imm_val, ctx);	\
-			emit(op ## _R((r1), (r2), r_scratch), ctx);	\
-		} else {						\
-			emit(op ## _I((r1), (r2), imm12), ctx);		\
-		}							\
-	} while (0)
-
-static inline void emit_err_ret(u8 cond, struct jit_ctx *ctx)
-{
-	if (ctx->ret0_fp_idx >= 0) {
-		_emit(cond, ARM_B(b_imm(ctx->ret0_fp_idx, ctx)), ctx);
-		/* NOP to keep the size constant between passes */
-		emit(ARM_MOV_R(ARM_R0, ARM_R0), ctx);
+/* ALU operation (32 bit)
+ * dst = dst (op) src
+ */
+static inline void emit_a32_alu_r(const u8 dst, const u8 src,
+				  bool dstk, bool sstk,
+				  struct jit_ctx *ctx, const bool is64,
+				  const bool hi, const u8 op) {
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	u8 rn = sstk ? tmp[1] : src;
+
+	if (sstk)
+		emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src)), ctx);
+
+	/* ALU operation */
+	if (dstk) {
+		emit(ARM_LDR_I(tmp[0], ARM_SP, STACK_VAR(dst)), ctx);
+		emit_alu_r(tmp[0], rn, is64, hi, op, ctx);
+		emit(ARM_STR_I(tmp[0], ARM_SP, STACK_VAR(dst)), ctx);
 	} else {
-		_emit(cond, ARM_MOV_I(ARM_R0, 0), ctx);
-		_emit(cond, ARM_B(b_imm(ctx->skf->len, ctx)), ctx);
+		emit_alu_r(dst, rn, is64, hi, op, ctx);
 	}
 }
 
-static inline void emit_blx_r(u8 tgt_reg, struct jit_ctx *ctx)
-{
-#if __LINUX_ARM_ARCH__ < 5
-	emit(ARM_MOV_R(ARM_LR, ARM_PC), ctx);
+/* ALU operation (64 bit) */
+static inline void emit_a32_alu_r64(const bool is64, const u8 dst[],
+				  const u8 src[], bool dstk,
+				  bool sstk, struct jit_ctx *ctx,
+				  const u8 op) {
+	emit_a32_alu_r(dst_lo, src_lo, dstk, sstk, ctx, is64, false, op);
+	if (is64)
+		emit_a32_alu_r(dst_hi, src_hi, dstk, sstk, ctx, is64, true, op);
+	else
+		emit_a32_mov_i(dst_hi, 0, dstk, ctx);
+}
 
-	if (elf_hwcap & HWCAP_THUMB)
-		emit(ARM_BX(tgt_reg), ctx);
+/* dst = imm (4 bytes)*/
+static inline void emit_a32_mov_r(const u8 dst, const u8 src,
+				  bool dstk, bool sstk,
+				  struct jit_ctx *ctx) {
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	u8 rt = sstk ? tmp[0] : src;
+
+	if (sstk)
+		emit(ARM_LDR_I(tmp[0], ARM_SP, STACK_VAR(src)), ctx);
+	if (dstk)
+		emit(ARM_STR_I(rt, ARM_SP, STACK_VAR(dst)), ctx);
 	else
-		emit(ARM_MOV_R(ARM_PC, tgt_reg), ctx);
-#else
-	emit(ARM_BLX_R(tgt_reg), ctx);
-#endif
+		emit(ARM_MOV_R(dst, rt), ctx);
 }
 
-static inline void emit_udivmod(u8 rd, u8 rm, u8 rn, struct jit_ctx *ctx,
-				int bpf_op)
-{
-#if __LINUX_ARM_ARCH__ == 7
-	if (elf_hwcap & HWCAP_IDIVA) {
-		if (bpf_op == BPF_DIV)
-			emit(ARM_UDIV(rd, rm, rn), ctx);
-		else {
-			emit(ARM_UDIV(ARM_R3, rm, rn), ctx);
-			emit(ARM_MLS(rd, rn, ARM_R3, rm), ctx);
-		}
-		return;
+/* dst = src */
+static inline void emit_a32_mov_r64(const bool is64, const u8 dst[],
+				  const u8 src[], bool dstk,
+				  bool sstk, struct jit_ctx *ctx) {
+	emit_a32_mov_r(dst_lo, src_lo, dstk, sstk, ctx);
+	if (is64) {
+		/* complete 8 byte move */
+		emit_a32_mov_r(dst_hi, src_hi, dstk, sstk, ctx);
+	} else {
+		/* Zero out high 4 bytes */
+		emit_a32_mov_i(dst_hi, 0, dstk, ctx);
 	}
-#endif
+}
 
-	/*
-	 * For BPF_ALU | BPF_DIV | BPF_K instructions, rm is ARM_R4
-	 * (r_A) and rn is ARM_R0 (r_scratch) so load rn first into
-	 * ARM_R1 to avoid accidentally overwriting ARM_R0 with rm
-	 * before using it as a source for ARM_R1.
-	 *
-	 * For BPF_ALU | BPF_DIV | BPF_X rm is ARM_R4 (r_A) and rn is
-	 * ARM_R5 (r_X) so there is no particular register overlap
-	 * issues.
-	 */
-	if (rn != ARM_R1)
-		emit(ARM_MOV_R(ARM_R1, rn), ctx);
-	if (rm != ARM_R0)
-		emit(ARM_MOV_R(ARM_R0, rm), ctx);
+/* Shift operations */
+static inline void emit_a32_alu_i(const u8 dst, const u32 val, bool dstk,
+				struct jit_ctx *ctx, const u8 op) {
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	u8 rd = dstk ? tmp[0] : dst;
+
+	if (dstk)
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst)), ctx);
+
+	/* Do shift operation */
+	switch (op) {
+	case BPF_LSH:
+		emit(ARM_LSL_I(rd, rd, val), ctx);
+		break;
+	case BPF_RSH:
+		emit(ARM_LSR_I(rd, rd, val), ctx);
+		break;
+	case BPF_NEG:
+		emit(ARM_RSB_I(rd, rd, val), ctx);
+		break;
+	}
+
+	if (dstk)
+		emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst)), ctx);
+}
+
+/* dst = ~dst (64 bit) */
+static inline void emit_a32_neg64(const u8 dst[], bool dstk,
+				struct jit_ctx *ctx){
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	u8 rd = dstk ? tmp[1] : dst[1];
+	u8 rm = dstk ? tmp[0] : dst[0];
+
+	/* Setup Operand */
+	if (dstk) {
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+
+	/* Do Negate Operation */
+	emit(ARM_RSBS_I(rd, rd, 0), ctx);
+	emit(ARM_RSC_I(rm, rm, 0), ctx);
+
+	if (dstk) {
+		emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_STR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+}
 
+/* dst = dst << src */
+static inline void emit_a32_lsh_r64(const u8 dst[], const u8 src[], bool dstk,
+				    bool sstk, struct jit_ctx *ctx) {
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+
+	/* Setup Operands */
+	u8 rt = sstk ? tmp2[1] : src_lo;
+	u8 rd = dstk ? tmp[1] : dst_lo;
+	u8 rm = dstk ? tmp[0] : dst_hi;
+
+	if (sstk)
+		emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(src_lo)), ctx);
+	if (dstk) {
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+
+	/* Do LSH operation */
+	emit(ARM_SUB_I(ARM_IP, rt, 32), ctx);
+	emit(ARM_RSB_I(tmp2[0], rt, 32), ctx);
+	/* As we are using ARM_LR */
 	ctx->seen |= SEEN_CALL;
-	emit_mov_i(ARM_R3, bpf_op == BPF_DIV ? (u32)jit_udiv : (u32)jit_mod,
-		   ctx);
-	emit_blx_r(ARM_R3, ctx);
+	emit(ARM_MOV_SR(ARM_LR, rm, SRTYPE_ASL, rt), ctx);
+	emit(ARM_ORR_SR(ARM_LR, ARM_LR, rd, SRTYPE_ASL, ARM_IP), ctx);
+	emit(ARM_ORR_SR(ARM_IP, ARM_LR, rd, SRTYPE_LSR, tmp2[0]), ctx);
+	emit(ARM_MOV_SR(ARM_LR, rd, SRTYPE_ASL, rt), ctx);
+
+	if (dstk) {
+		emit(ARM_STR_I(ARM_LR, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_STR_I(ARM_IP, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	} else {
+		emit(ARM_MOV_R(rd, ARM_LR), ctx);
+		emit(ARM_MOV_R(rm, ARM_IP), ctx);
+	}
+}
 
-	if (rd != ARM_R0)
-		emit(ARM_MOV_R(rd, ARM_R0), ctx);
+/* dst = dst >> src (signed)*/
+static inline void emit_a32_arsh_r64(const u8 dst[], const u8 src[], bool dstk,
+				    bool sstk, struct jit_ctx *ctx) {
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+	/* Setup Operands */
+	u8 rt = sstk ? tmp2[1] : src_lo;
+	u8 rd = dstk ? tmp[1] : dst_lo;
+	u8 rm = dstk ? tmp[0] : dst_hi;
+
+	if (sstk)
+		emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(src_lo)), ctx);
+	if (dstk) {
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+
+	/* Do the ARSH operation */
+	emit(ARM_RSB_I(ARM_IP, rt, 32), ctx);
+	emit(ARM_SUBS_I(tmp2[0], rt, 32), ctx);
+	/* As we are using ARM_LR */
+	ctx->seen |= SEEN_CALL;
+	emit(ARM_MOV_SR(ARM_LR, rd, SRTYPE_LSR, rt), ctx);
+	emit(ARM_ORR_SR(ARM_LR, ARM_LR, rm, SRTYPE_ASL, ARM_IP), ctx);
+	_emit(ARM_COND_MI, ARM_B(0), ctx);
+	emit(ARM_ORR_SR(ARM_LR, ARM_LR, rm, SRTYPE_ASR, tmp2[0]), ctx);
+	emit(ARM_MOV_SR(ARM_IP, rm, SRTYPE_ASR, rt), ctx);
+	if (dstk) {
+		emit(ARM_STR_I(ARM_LR, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_STR_I(ARM_IP, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	} else {
+		emit(ARM_MOV_R(rd, ARM_LR), ctx);
+		emit(ARM_MOV_R(rm, ARM_IP), ctx);
+	}
 }
 
-static inline void update_on_xread(struct jit_ctx *ctx)
+/* dst = dst >> src */
+static inline void emit_a32_lsr_r64(const u8 dst[], const u8 src[], bool dstk,
+				     bool sstk, struct jit_ctx *ctx) {
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+	/* Setup Operands */
+	u8 rt = sstk ? tmp2[1] : src_lo;
+	u8 rd = dstk ? tmp[1] : dst_lo;
+	u8 rm = dstk ? tmp[0] : dst_hi;
+
+	if (sstk)
+		emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(src_lo)), ctx);
+	if (dstk) {
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+
+	/* Do LSH operation */
+	emit(ARM_RSB_I(ARM_IP, rt, 32), ctx);
+	emit(ARM_SUBS_I(tmp2[0], rt, 32), ctx);
+	/* As we are using ARM_LR */
+	ctx->seen |= SEEN_CALL;
+	emit(ARM_MOV_SR(ARM_LR, rd, SRTYPE_LSR, rt), ctx);
+	emit(ARM_ORR_SR(ARM_LR, ARM_LR, rm, SRTYPE_ASL, ARM_IP), ctx);
+	emit(ARM_ORR_SR(ARM_LR, ARM_LR, rm, SRTYPE_LSR, tmp2[0]), ctx);
+	emit(ARM_MOV_SR(ARM_IP, rm, SRTYPE_LSR, rt), ctx);
+	if (dstk) {
+		emit(ARM_STR_I(ARM_LR, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_STR_I(ARM_IP, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	} else {
+		emit(ARM_MOV_R(rd, ARM_LR), ctx);
+		emit(ARM_MOV_R(rm, ARM_IP), ctx);
+	}
+}
+
+/* dst = dst << val */
+static inline void emit_a32_lsh_i64(const u8 dst[], bool dstk,
+				     const u32 val, struct jit_ctx *ctx){
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+	/* Setup operands */
+	u8 rd = dstk ? tmp[1] : dst_lo;
+	u8 rm = dstk ? tmp[0] : dst_hi;
+
+	if (dstk) {
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+
+	/* Do LSH operation */
+	if (val < 32) {
+		emit(ARM_MOV_SI(tmp2[0], rm, SRTYPE_ASL, val), ctx);
+		emit(ARM_ORR_SI(rm, tmp2[0], rd, SRTYPE_LSR, 32 - val), ctx);
+		emit(ARM_MOV_SI(rd, rd, SRTYPE_ASL, val), ctx);
+	} else {
+		if (val == 32)
+			emit(ARM_MOV_R(rm, rd), ctx);
+		else
+			emit(ARM_MOV_SI(rm, rd, SRTYPE_ASL, val - 32), ctx);
+		emit(ARM_EOR_R(rd, rd, rd), ctx);
+	}
+
+	if (dstk) {
+		emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_STR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+}
+
+/* dst = dst >> val */
+static inline void emit_a32_lsr_i64(const u8 dst[], bool dstk,
+				    const u32 val, struct jit_ctx *ctx) {
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+	/* Setup operands */
+	u8 rd = dstk ? tmp[1] : dst_lo;
+	u8 rm = dstk ? tmp[0] : dst_hi;
+
+	if (dstk) {
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+
+	/* Do LSR operation */
+	if (val < 32) {
+		emit(ARM_MOV_SI(tmp2[1], rd, SRTYPE_LSR, val), ctx);
+		emit(ARM_ORR_SI(rd, tmp2[1], rm, SRTYPE_ASL, 32 - val), ctx);
+		emit(ARM_MOV_SI(rm, rm, SRTYPE_LSR, val), ctx);
+	} else if (val == 32) {
+		emit(ARM_MOV_R(rd, rm), ctx);
+		emit(ARM_MOV_I(rm, 0), ctx);
+	} else {
+		emit(ARM_MOV_SI(rd, rm, SRTYPE_LSR, val - 32), ctx);
+		emit(ARM_MOV_I(rm, 0), ctx);
+	}
+
+	if (dstk) {
+		emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_STR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+}
+
+/* dst = dst >> val (signed) */
+static inline void emit_a32_arsh_i64(const u8 dst[], bool dstk,
+				     const u32 val, struct jit_ctx *ctx){
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+	 /* Setup operands */
+	u8 rd = dstk ? tmp[1] : dst_lo;
+	u8 rm = dstk ? tmp[0] : dst_hi;
+
+	if (dstk) {
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+
+	/* Do ARSH operation */
+	if (val < 32) {
+		emit(ARM_MOV_SI(tmp2[1], rd, SRTYPE_LSR, val), ctx);
+		emit(ARM_ORR_SI(rd, tmp2[1], rm, SRTYPE_ASL, 32 - val), ctx);
+		emit(ARM_MOV_SI(rm, rm, SRTYPE_ASR, val), ctx);
+	} else if (val == 32) {
+		emit(ARM_MOV_R(rd, rm), ctx);
+		emit(ARM_MOV_SI(rm, rm, SRTYPE_ASR, 31), ctx);
+	} else {
+		emit(ARM_MOV_SI(rd, rm, SRTYPE_ASR, val - 32), ctx);
+		emit(ARM_MOV_SI(rm, rm, SRTYPE_ASR, 31), ctx);
+	}
+
+	if (dstk) {
+		emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_STR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+}
+
+static inline void emit_a32_mul_r64(const u8 dst[], const u8 src[], bool dstk,
+				    bool sstk, struct jit_ctx *ctx) {
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+	/* Setup operands for multiplication */
+	u8 rd = dstk ? tmp[1] : dst_lo;
+	u8 rm = dstk ? tmp[0] : dst_hi;
+	u8 rt = sstk ? tmp2[1] : src_lo;
+	u8 rn = sstk ? tmp2[0] : src_hi;
+
+	if (dstk) {
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+	if (sstk) {
+		emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(src_lo)), ctx);
+		emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src_hi)), ctx);
+	}
+
+	/* Do Multiplication */
+	emit(ARM_MUL(ARM_IP, rd, rn), ctx);
+	emit(ARM_MUL(ARM_LR, rm, rt), ctx);
+	/* As we are using ARM_LR */
+	ctx->seen |= SEEN_CALL;
+	emit(ARM_ADD_R(ARM_LR, ARM_IP, ARM_LR), ctx);
+
+	emit(ARM_UMULL(ARM_IP, rm, rd, rt), ctx);
+	emit(ARM_ADD_R(rm, ARM_LR, rm), ctx);
+	if (dstk) {
+		emit(ARM_STR_I(ARM_IP, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_STR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	} else {
+		emit(ARM_MOV_R(rd, ARM_IP), ctx);
+	}
+}
+
+/* *(size *)(dst + off) = src */
+static inline void emit_str_r(const u8 dst, const u8 src, bool dstk,
+			      const s32 off, struct jit_ctx *ctx, const u8 sz){
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	u8 rd = dstk ? tmp[1] : dst;
+
+	if (dstk)
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst)), ctx);
+	if (off) {
+		emit_a32_mov_i(tmp[0], off, false, ctx);
+		emit(ARM_ADD_R(tmp[0], rd, tmp[0]), ctx);
+		rd = tmp[0];
+	}
+	switch (sz) {
+	case BPF_W:
+		/* Store a Word */
+		emit(ARM_STR_I(src, rd, 0), ctx);
+		break;
+	case BPF_H:
+		/* Store a HalfWord */
+		emit(ARM_STRH_I(src, rd, 0), ctx);
+		break;
+	case BPF_B:
+		/* Store a Byte */
+		emit(ARM_STRB_I(src, rd, 0), ctx);
+		break;
+	}
+}
+
+/* dst = *(size*)(src + off) */
+static inline void emit_ldx_r(const u8 dst, const u8 src, bool dstk,
+			      const s32 off, struct jit_ctx *ctx, const u8 sz){
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	u8 rd = dstk ? tmp[1] : dst;
+	u8 rm = src;
+
+	if (off) {
+		emit_a32_mov_i(tmp[0], off, false, ctx);
+		emit(ARM_ADD_R(tmp[0], tmp[0], src), ctx);
+		rm = tmp[0];
+	}
+	switch (sz) {
+	case BPF_W:
+		/* Load a Word */
+		emit(ARM_LDR_I(rd, rm, 0), ctx);
+		break;
+	case BPF_H:
+		/* Load a HalfWord */
+		emit(ARM_LDRH_I(rd, rm, 0), ctx);
+		break;
+	case BPF_B:
+		/* Load a Byte */
+		emit(ARM_LDRB_I(rd, rm, 0), ctx);
+		break;
+	}
+	if (dstk)
+		emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst)), ctx);
+}
+
+/* Arithmatic Operation */
+static inline void emit_ar_r(const u8 rd, const u8 rt, const u8 rm,
+			     const u8 rn, struct jit_ctx *ctx, u8 op) {
+	switch (op) {
+	case BPF_JSET:
+		ctx->seen |= SEEN_CALL;
+		emit(ARM_AND_R(ARM_IP, rt, rn), ctx);
+		emit(ARM_AND_R(ARM_LR, rd, rm), ctx);
+		emit(ARM_ORRS_R(ARM_IP, ARM_LR, ARM_IP), ctx);
+		break;
+	case BPF_JEQ:
+	case BPF_JNE:
+	case BPF_JGT:
+	case BPF_JGE:
+		emit(ARM_CMP_R(rd, rm), ctx);
+		_emit(ARM_COND_EQ, ARM_CMP_R(rt, rn), ctx);
+		break;
+	case BPF_JSGT:
+		emit(ARM_CMP_R(rn, rt), ctx);
+		emit(ARM_SBCS_R(ARM_IP, rm, rd), ctx);
+		break;
+	case BPF_JSGE:
+		emit(ARM_CMP_R(rt, rn), ctx);
+		emit(ARM_SBCS_R(ARM_IP, rd, rm), ctx);
+		break;
+	}
+}
+
+static int out_offset = -1; /* initialized on the first pass of build_body() */
+static int emit_bpf_tail_call(struct jit_ctx *ctx)
+{
+
+	/* bpf_tail_call(void *prog_ctx, struct bpf_array *array, u64 index) */
+	const u8 *r2 = bpf2a32[BPF_REG_2];
+	const u8 *r3 = bpf2a32[BPF_REG_3];
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+	const u8 *tcc = bpf2a32[TCALL_CNT];
+	const int idx0 = ctx->idx;
+#define cur_offset (ctx->idx - idx0)
+#define jmp_offset (out_offset - (cur_offset))
+	u32 off, lo, hi;
+
+	/* if (index >= array->map.max_entries)
+	 *	goto out;
+	 */
+	off = offsetof(struct bpf_array, map.max_entries);
+	/* array->map.max_entries */
+	emit_a32_mov_i(tmp[1], off, false, ctx);
+	emit(ARM_LDR_I(tmp2[1], ARM_SP, STACK_VAR(r2[1])), ctx);
+	emit(ARM_LDR_R(tmp[1], tmp2[1], tmp[1]), ctx);
+	/* index (64 bit) */
+	emit(ARM_LDR_I(tmp2[1], ARM_SP, STACK_VAR(r3[1])), ctx);
+	/* index >= array->map.max_entries */
+	emit(ARM_CMP_R(tmp2[1], tmp[1]), ctx);
+	_emit(ARM_COND_CS, ARM_B(jmp_offset), ctx);
+
+	/* if (tail_call_cnt > MAX_TAIL_CALL_CNT)
+	 *	goto out;
+	 * tail_call_cnt++;
+	 */
+	lo = (u32)MAX_TAIL_CALL_CNT;
+	hi = (u32)((u64)MAX_TAIL_CALL_CNT >> 32);
+	emit(ARM_LDR_I(tmp[1], ARM_SP, STACK_VAR(tcc[1])), ctx);
+	emit(ARM_LDR_I(tmp[0], ARM_SP, STACK_VAR(tcc[0])), ctx);
+	emit(ARM_CMP_I(tmp[0], hi), ctx);
+	_emit(ARM_COND_EQ, ARM_CMP_I(tmp[1], lo), ctx);
+	_emit(ARM_COND_HI, ARM_B(jmp_offset), ctx);
+	emit(ARM_ADDS_I(tmp[1], tmp[1], 1), ctx);
+	emit(ARM_ADC_I(tmp[0], tmp[0], 0), ctx);
+	emit(ARM_STR_I(tmp[1], ARM_SP, STACK_VAR(tcc[1])), ctx);
+	emit(ARM_STR_I(tmp[0], ARM_SP, STACK_VAR(tcc[0])), ctx);
+
+	/* prog = array->ptrs[index]
+	 * if (prog == NULL)
+	 *	goto out;
+	 */
+	off = offsetof(struct bpf_array, ptrs);
+	emit_a32_mov_i(tmp[1], off, false, ctx);
+	emit(ARM_LDR_I(tmp2[1], ARM_SP, STACK_VAR(r2[1])), ctx);
+	emit(ARM_LDR_R(tmp[1], tmp2[1], tmp[1]), ctx);
+	emit(ARM_LDR_I(tmp2[1], ARM_SP, STACK_VAR(r3[1])), ctx);
+	emit(ARM_MOV_SI(tmp[0], tmp2[1], SRTYPE_ASL, 2), ctx);
+	emit(ARM_LDR_R(tmp[1], tmp[1], tmp[0]), ctx);
+	emit(ARM_CMP_I(tmp[1], 0), ctx);
+	_emit(ARM_COND_EQ, ARM_B(jmp_offset), ctx);
+
+	/* goto *(prog->bpf_func + prologue_size); */
+	off = offsetof(struct bpf_prog, bpf_func);
+	emit_a32_mov_i(tmp2[1], off, false, ctx);
+	emit(ARM_LDR_R(tmp[1], tmp[1], tmp2[1]), ctx);
+	emit(ARM_ADD_I(tmp[1], tmp[1], ctx->prologue_bytes), ctx);
+	emit(ARM_BX(tmp[1]), ctx);
+
+	/* out: */
+	if (out_offset == -1)
+		out_offset = cur_offset;
+	if (cur_offset != out_offset) {
+		pr_err_once("tail_call out_offset = %d, expected %d!\n",
+			    cur_offset, out_offset);
+		return -1;
+	}
+	return 0;
+#undef cur_offset
+#undef jmp_offset
+}
+
+/* 0xabcd => 0xcdab */
+static inline void emit_rev16(const u8 rd, const u8 rn, struct jit_ctx *ctx)
 {
-	if (!(ctx->seen & SEEN_X))
-		ctx->flags |= FLAG_NEED_X_RESET;
+#if __LINUX_ARM_ARCH__ < 6
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+
+	emit(ARM_AND_I(tmp2[1], rn, 0xff), ctx);
+	emit(ARM_MOV_SI(tmp2[0], rn, SRTYPE_LSR, 8), ctx);
+	emit(ARM_AND_I(tmp2[0], tmp2[0], 0xff), ctx);
+	emit(ARM_ORR_SI(rd, tmp2[0], tmp2[1], SRTYPE_LSL, 8), ctx);
+#else /* ARMv6+ */
+	emit(ARM_REV16(rd, rn), ctx);
+#endif
+}
 
-	ctx->seen |= SEEN_X;
+/* 0xabcdefgh => 0xghefcdab */
+static inline void emit_rev32(const u8 rd, const u8 rn, struct jit_ctx *ctx)
+{
+#if __LINUX_ARM_ARCH__ < 6
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+
+	emit(ARM_AND_I(tmp2[1], rn, 0xff), ctx);
+	emit(ARM_MOV_SI(tmp2[0], rn, SRTYPE_LSR, 24), ctx);
+	emit(ARM_ORR_SI(ARM_IP, tmp2[0], tmp2[1], SRTYPE_LSL, 24), ctx);
+
+	emit(ARM_MOV_SI(tmp2[1], rn, SRTYPE_LSR, 8), ctx);
+	emit(ARM_AND_I(tmp2[1], tmp2[1], 0xff), ctx);
+	emit(ARM_MOV_SI(tmp2[0], rn, SRTYPE_LSR, 16), ctx);
+	emit(ARM_AND_I(tmp2[0], tmp2[0], 0xff), ctx);
+	emit(ARM_MOV_SI(tmp2[0], tmp2[0], SRTYPE_LSL, 8), ctx);
+	emit(ARM_ORR_SI(tmp2[0], tmp2[0], tmp2[1], SRTYPE_LSL, 16), ctx);
+	emit(ARM_ORR_R(rd, ARM_IP, tmp2[0]), ctx);
+
+#else /* ARMv6+ */
+	emit(ARM_REV(rd, rn), ctx);
+#endif
 }
 
-static int build_body(struct jit_ctx *ctx)
+static void build_prologue(struct jit_ctx *ctx)
 {
-	void *load_func[] = {jit_get_skb_b, jit_get_skb_h, jit_get_skb_w};
-	const struct bpf_prog *prog = ctx->skf;
-	const struct sock_filter *inst;
-	unsigned i, load_order, off, condt;
-	int imm12;
-	u32 k;
+	const u8 r0 = bpf2a32[BPF_REG_0][1];
+	const u8 r2 = bpf2a32[BPF_REG_1][1];
+	const u8 r3 = bpf2a32[BPF_REG_1][0];
+	const u8 r4 = bpf2a32[BPF_REG_6][1];
+	const u8 r5 = bpf2a32[BPF_REG_6][0];
+	const u8 r6 = bpf2a32[TMP_REG_1][1];
+	const u8 r7 = bpf2a32[TMP_REG_1][0];
+	const u8 r8 = bpf2a32[TMP_REG_2][1];
+	const u8 r10 = bpf2a32[TMP_REG_2][0];
+	const u8 fplo = bpf2a32[BPF_REG_FP][1];
+	const u8 fphi = bpf2a32[BPF_REG_FP][0];
+	const u8 sp = ARM_SP;
+	const u8 *tcc = bpf2a32[TCALL_CNT];
+
+	u16 reg_set = 0;
 
-	for (i = 0; i < prog->len; i++) {
-		u16 code;
+	/*
+	 * eBPF prog stack layout
+	 *
+	 *                         high
+	 * original ARM_SP =>     +-----+ eBPF prologue
+	 *                        |FP/LR|
+	 * current ARM_FP =>      +-----+
+	 *                        | ... | callee saved registers
+	 * eBPF fp register =>    +-----+ <= (BPF_FP)
+	 *                        | ... | eBPF JIT scratch space
+	 *                        |     | eBPF prog stack
+	 *                        +-----+
+	 *			  |RSVD | JIT scratchpad
+	 * current A64_SP =>      +-----+ <= (BPF_FP - STACK_SIZE)
+	 *                        |     |
+	 *                        | ... | Function call stack
+	 *                        |     |
+	 *                        +-----+
+	 *                          low
+	 */
 
-		inst = &(prog->insns[i]);
-		/* K as an immediate value operand */
-		k = inst->k;
-		code = bpf_anc_helper(inst);
+	/* Save callee saved registers. */
+	reg_set |= (1<<r4) | (1<<r5) | (1<<r6) | (1<<r7) | (1<<r8) | (1<<r10);
+#ifdef CONFIG_FRAME_POINTER
+	reg_set |= (1<<ARM_FP) | (1<<ARM_IP) | (1<<ARM_LR) | (1<<ARM_PC);
+	emit(ARM_MOV_R(ARM_IP, sp), ctx);
+	emit(ARM_PUSH(reg_set), ctx);
+	emit(ARM_SUB_I(ARM_FP, ARM_IP, 4), ctx);
+#else
+	/* Check if call instruction exists in BPF body */
+	if (ctx->seen & SEEN_CALL)
+		reg_set |= (1<<ARM_LR);
+	emit(ARM_PUSH(reg_set), ctx);
+#endif
+	/* Save frame pointer for later */
+	emit(ARM_SUB_I(ARM_IP, sp, SCRATCH_SIZE), ctx);
+
+	/* Set up function call stack */
+	emit(ARM_SUB_I(ARM_SP, ARM_SP, imm8m(STACK_SIZE)), ctx);
+
+	/* Set up BPF prog stack base register */
+	emit_a32_mov_r(fplo, ARM_IP, true, false, ctx);
+	emit_a32_mov_i(fphi, 0, true, ctx);
+
+	/* mov r4, 0 */
+	emit(ARM_MOV_I(r4, 0), ctx);
+	/* MOV bpf_ctx pointer to BPF_R1 */
+	emit(ARM_MOV_R(r3, r4), ctx);
+	emit(ARM_MOV_R(r2, r0), ctx);
+	/* Initialize Tail Count */
+	emit(ARM_STR_I(r4, ARM_SP, STACK_VAR(tcc[0])), ctx);
+	emit(ARM_STR_I(r4, ARM_SP, STACK_VAR(tcc[1])), ctx);
+	/* end of prologue */
+}
 
-		/* compute offsets only in the fake pass */
-		if (ctx->target == NULL)
-			ctx->offsets[i] = ctx->idx * 4;
+static void build_epilogue(struct jit_ctx *ctx)
+{
+	const u8 r4 = bpf2a32[BPF_REG_6][1];
+	const u8 r5 = bpf2a32[BPF_REG_6][0];
+	const u8 r6 = bpf2a32[TMP_REG_1][1];
+	const u8 r7 = bpf2a32[TMP_REG_1][0];
+	const u8 r8 = bpf2a32[TMP_REG_2][1];
+	const u8 r10 = bpf2a32[TMP_REG_2][0];
+	u16 reg_set = 0;
+
+	/* unwind function call stack */
+	emit(ARM_ADD_I(ARM_SP, ARM_SP, imm8m(STACK_SIZE)), ctx);
+
+	/* restore callee saved registers. */
+	reg_set |= (1<<r4) | (1<<r5) | (1<<r6) | (1<<r7) | (1<<r8) | (1<<r10);
+#ifdef CONFIG_FRAME_POINTER
+	/* the first instruction of the prologue was: mov ip, sp */
+	reg_set |= (1<<ARM_FP) | (1<<ARM_SP) | (1<<ARM_PC);
+	emit(ARM_LDM(ARM_SP, reg_set), ctx);
+#else
+	if (ctx->seen & SEEN_CALL)
+		reg_set |= (1<<ARM_PC);
+	/* Restore callee saved registers. */
+	emit(ARM_POP(reg_set), ctx);
+	/* Return back to the callee function */
+	if (!(ctx->seen & SEEN_CALL))
+		emit(ARM_BX(ARM_LR), ctx);
+#endif
+}
 
-		switch (code) {
-		case BPF_LD | BPF_IMM:
-			emit_mov_i(r_A, k, ctx);
+/*
+ * Convert an eBPF instruction to native instruction, i.e
+ * JITs an eBPF instruction.
+ * Returns :
+ *	0  - Successfully JITed an 8-byte eBPF instruction
+ *	>0 - Successfully JITed a 16-byte eBPF instruction
+ *	<0 - Failed to JIT.
+ */
+static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx)
+{
+	const u8 code = insn->code;
+	const u8 *dst = bpf2a32[insn->dst_reg];
+	const u8 *src = bpf2a32[insn->src_reg];
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+	const s16 off = insn->off;
+	const s32 imm = insn->imm;
+	const int i = insn - ctx->prog->insnsi;
+	const bool is64 = BPF_CLASS(code) == BPF_ALU64;
+	const bool dstk = is_on_stack(insn->dst_reg);
+	const bool sstk = is_on_stack(insn->src_reg);
+	u8 rd, rt, rm, rn;
+	s32 jmp_offset;
+
+#define check_imm(bits, imm) do {				\
+	if ((((imm) > 0) && ((imm) >> (bits))) ||		\
+	    (((imm) < 0) && (~(imm) >> (bits)))) {		\
+		pr_info("[%2d] imm=%d(0x%x) out of range\n",	\
+			i, imm, imm);				\
+		return -EINVAL;					\
+	}							\
+} while (0)
+#define check_imm24(imm) check_imm(24, imm)
+
+	switch (code) {
+	/* ALU operations */
+
+	/* dst = src */
+	case BPF_ALU | BPF_MOV | BPF_K:
+	case BPF_ALU | BPF_MOV | BPF_X:
+	case BPF_ALU64 | BPF_MOV | BPF_K:
+	case BPF_ALU64 | BPF_MOV | BPF_X:
+		switch (BPF_SRC(code)) {
+		case BPF_X:
+			emit_a32_mov_r64(is64, dst, src, dstk, sstk, ctx);
 			break;
-		case BPF_LD | BPF_W | BPF_LEN:
-			ctx->seen |= SEEN_SKB;
-			BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, len) != 4);
-			emit(ARM_LDR_I(r_A, r_skb,
-				       offsetof(struct sk_buff, len)), ctx);
+		case BPF_K:
+			/* Sign-extend immediate value to destination reg */
+			emit_a32_mov_i64(is64, dst, imm, dstk, ctx);
 			break;
-		case BPF_LD | BPF_MEM:
-			/* A = scratch[k] */
-			ctx->seen |= SEEN_MEM_WORD(k);
-			emit(ARM_LDR_I(r_A, ARM_SP, SCRATCH_OFF(k)), ctx);
+		}
+		break;
+	/* dst = dst + src/imm */
+	/* dst = dst - src/imm */
+	/* dst = dst | src/imm */
+	/* dst = dst & src/imm */
+	/* dst = dst ^ src/imm */
+	/* dst = dst * src/imm */
+	/* dst = dst << src */
+	/* dst = dst >> src */
+	case BPF_ALU | BPF_ADD | BPF_K:
+	case BPF_ALU | BPF_ADD | BPF_X:
+	case BPF_ALU | BPF_SUB | BPF_K:
+	case BPF_ALU | BPF_SUB | BPF_X:
+	case BPF_ALU | BPF_OR | BPF_K:
+	case BPF_ALU | BPF_OR | BPF_X:
+	case BPF_ALU | BPF_AND | BPF_K:
+	case BPF_ALU | BPF_AND | BPF_X:
+	case BPF_ALU | BPF_XOR | BPF_K:
+	case BPF_ALU | BPF_XOR | BPF_X:
+	case BPF_ALU | BPF_MUL | BPF_K:
+	case BPF_ALU | BPF_MUL | BPF_X:
+	case BPF_ALU | BPF_LSH | BPF_X:
+	case BPF_ALU | BPF_RSH | BPF_X:
+	case BPF_ALU | BPF_ARSH | BPF_K:
+	case BPF_ALU | BPF_ARSH | BPF_X:
+	case BPF_ALU64 | BPF_ADD | BPF_K:
+	case BPF_ALU64 | BPF_ADD | BPF_X:
+	case BPF_ALU64 | BPF_SUB | BPF_K:
+	case BPF_ALU64 | BPF_SUB | BPF_X:
+	case BPF_ALU64 | BPF_OR | BPF_K:
+	case BPF_ALU64 | BPF_OR | BPF_X:
+	case BPF_ALU64 | BPF_AND | BPF_K:
+	case BPF_ALU64 | BPF_AND | BPF_X:
+	case BPF_ALU64 | BPF_XOR | BPF_K:
+	case BPF_ALU64 | BPF_XOR | BPF_X:
+		switch (BPF_SRC(code)) {
+		case BPF_X:
+			emit_a32_alu_r64(is64, dst, src, dstk, sstk,
+					 ctx, BPF_OP(code));
 			break;
-		case BPF_LD | BPF_W | BPF_ABS:
-			load_order = 2;
-			goto load;
-		case BPF_LD | BPF_H | BPF_ABS:
-			load_order = 1;
-			goto load;
-		case BPF_LD | BPF_B | BPF_ABS:
-			load_order = 0;
-load:
-			emit_mov_i(r_off, k, ctx);
-load_common:
-			ctx->seen |= SEEN_DATA | SEEN_CALL;
-
-			if (load_order > 0) {
-				emit(ARM_SUB_I(r_scratch, r_skb_hl,
-					       1 << load_order), ctx);
-				emit(ARM_CMP_R(r_scratch, r_off), ctx);
-				condt = ARM_COND_GE;
-			} else {
-				emit(ARM_CMP_R(r_skb_hl, r_off), ctx);
-				condt = ARM_COND_HI;
-			}
-
-			/*
-			 * test for negative offset, only if we are
-			 * currently scheduled to take the fast
-			 * path. this will update the flags so that
-			 * the slowpath instruction are ignored if the
-			 * offset is negative.
-			 *
-			 * for loard_order == 0 the HI condition will
-			 * make loads@offset 0 take the slow path too.
+		case BPF_K:
+			/* Move immediate value to the temporary register
+			 * and then do the ALU operation on the temporary
+			 * register as this will sign-extend the immediate
+			 * value into temporary reg and then it would be
+			 * safe to do the operation on it.
 			 */
-			_emit(condt, ARM_CMP_I(r_off, 0), ctx);
-
-			_emit(condt, ARM_ADD_R(r_scratch, r_off, r_skb_data),
-			      ctx);
-
-			if (load_order == 0)
-				_emit(condt, ARM_LDRB_I(r_A, r_scratch, 0),
-				      ctx);
-			else if (load_order == 1)
-				emit_load_be16(condt, r_A, r_scratch, ctx);
-			else if (load_order == 2)
-				emit_load_be32(condt, r_A, r_scratch, ctx);
-
-			_emit(condt, ARM_B(b_imm(i + 1, ctx)), ctx);
-
-			/* the slowpath */
-			emit_mov_i(ARM_R3, (u32)load_func[load_order], ctx);
-			emit(ARM_MOV_R(ARM_R0, r_skb), ctx);
-			/* the offset is already in R1 */
-			emit_blx_r(ARM_R3, ctx);
-			/* check the result of skb_copy_bits */
-			emit(ARM_CMP_I(ARM_R1, 0), ctx);
-			emit_err_ret(ARM_COND_NE, ctx);
-			emit(ARM_MOV_R(r_A, ARM_R0), ctx);
+			emit_a32_mov_i64(is64, tmp2, imm, false, ctx);
+			emit_a32_alu_r64(is64, dst, tmp2, dstk, false,
+					 ctx, BPF_OP(code));
 			break;
-		case BPF_LD | BPF_W | BPF_IND:
-			load_order = 2;
-			goto load_ind;
-		case BPF_LD | BPF_H | BPF_IND:
-			load_order = 1;
-			goto load_ind;
-		case BPF_LD | BPF_B | BPF_IND:
-			load_order = 0;
-load_ind:
-			update_on_xread(ctx);
-			OP_IMM3(ARM_ADD, r_off, r_X, k, ctx);
-			goto load_common;
-		case BPF_LDX | BPF_IMM:
-			ctx->seen |= SEEN_X;
-			emit_mov_i(r_X, k, ctx);
+		}
+		break;
+	/* dst = dst / src(imm) */
+	/* dst = dst % src(imm) */
+	case BPF_ALU | BPF_DIV | BPF_K:
+	case BPF_ALU | BPF_DIV | BPF_X:
+	case BPF_ALU | BPF_MOD | BPF_K:
+	case BPF_ALU | BPF_MOD | BPF_X:
+		rt = src_lo;
+		rd = dstk ? tmp2[1] : dst_lo;
+		if (dstk)
+			emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		switch (BPF_SRC(code)) {
+		case BPF_X:
+			rt = sstk ? tmp2[0] : rt;
+			if (sstk)
+				emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(src_lo)),
+				     ctx);
 			break;
-		case BPF_LDX | BPF_W | BPF_LEN:
-			ctx->seen |= SEEN_X | SEEN_SKB;
-			emit(ARM_LDR_I(r_X, r_skb,
-				       offsetof(struct sk_buff, len)), ctx);
+		case BPF_K:
+			rt = tmp2[0];
+			emit_a32_mov_i(rt, imm, false, ctx);
 			break;
-		case BPF_LDX | BPF_MEM:
-			ctx->seen |= SEEN_X | SEEN_MEM_WORD(k);
-			emit(ARM_LDR_I(r_X, ARM_SP, SCRATCH_OFF(k)), ctx);
+		}
+		emit_udivmod(rd, rd, rt, ctx, BPF_OP(code));
+		if (dstk)
+			emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit_a32_mov_i(dst_hi, 0, dstk, ctx);
+		break;
+	case BPF_ALU64 | BPF_DIV | BPF_K:
+	case BPF_ALU64 | BPF_DIV | BPF_X:
+	case BPF_ALU64 | BPF_MOD | BPF_K:
+	case BPF_ALU64 | BPF_MOD | BPF_X:
+		goto notyet;
+	/* dst = dst >> imm */
+	/* dst = dst << imm */
+	case BPF_ALU | BPF_RSH | BPF_K:
+	case BPF_ALU | BPF_LSH | BPF_K:
+		if (unlikely(imm > 31))
+			return -EINVAL;
+		if (imm)
+			emit_a32_alu_i(dst_lo, imm, dstk, ctx, BPF_OP(code));
+		emit_a32_mov_i(dst_hi, 0, dstk, ctx);
+		break;
+	/* dst = dst << imm */
+	case BPF_ALU64 | BPF_LSH | BPF_K:
+		if (unlikely(imm > 63))
+			return -EINVAL;
+		emit_a32_lsh_i64(dst, dstk, imm, ctx);
+		break;
+	/* dst = dst >> imm */
+	case BPF_ALU64 | BPF_RSH | BPF_K:
+		if (unlikely(imm > 63))
+			return -EINVAL;
+		emit_a32_lsr_i64(dst, dstk, imm, ctx);
+		break;
+	/* dst = dst << src */
+	case BPF_ALU64 | BPF_LSH | BPF_X:
+		emit_a32_lsh_r64(dst, src, dstk, sstk, ctx);
+		break;
+	/* dst = dst >> src */
+	case BPF_ALU64 | BPF_RSH | BPF_X:
+		emit_a32_lsr_r64(dst, src, dstk, sstk, ctx);
+		break;
+	/* dst = dst >> src (signed) */
+	case BPF_ALU64 | BPF_ARSH | BPF_X:
+		emit_a32_arsh_r64(dst, src, dstk, sstk, ctx);
+		break;
+	/* dst = dst >> imm (signed) */
+	case BPF_ALU64 | BPF_ARSH | BPF_K:
+		if (unlikely(imm > 63))
+			return -EINVAL;
+		emit_a32_arsh_i64(dst, dstk, imm, ctx);
+		break;
+	/* dst = ~dst */
+	case BPF_ALU | BPF_NEG:
+		emit_a32_alu_i(dst_lo, 0, dstk, ctx, BPF_OP(code));
+		emit_a32_mov_i(dst_hi, 0, dstk, ctx);
+		break;
+	/* dst = ~dst (64 bit) */
+	case BPF_ALU64 | BPF_NEG:
+		emit_a32_neg64(dst, dstk, ctx);
+		break;
+	/* dst = dst * src/imm */
+	case BPF_ALU64 | BPF_MUL | BPF_X:
+	case BPF_ALU64 | BPF_MUL | BPF_K:
+		switch (BPF_SRC(code)) {
+		case BPF_X:
+			emit_a32_mul_r64(dst, src, dstk, sstk, ctx);
 			break;
-		case BPF_LDX | BPF_B | BPF_MSH:
-			/* x = ((*(frame + k)) & 0xf) << 2; */
-			ctx->seen |= SEEN_X | SEEN_DATA | SEEN_CALL;
-			/* the interpreter should deal with the negative K */
-			if ((int)k < 0)
-				return -1;
-			/* offset in r1: we might have to take the slow path */
-			emit_mov_i(r_off, k, ctx);
-			emit(ARM_CMP_R(r_skb_hl, r_off), ctx);
-
-			/* load in r0: common with the slowpath */
-			_emit(ARM_COND_HI, ARM_LDRB_R(ARM_R0, r_skb_data,
-						      ARM_R1), ctx);
-			/*
-			 * emit_mov_i() might generate one or two instructions,
-			 * the same holds for emit_blx_r()
+		case BPF_K:
+			/* Move immediate value to the temporary register
+			 * and then do the multiplication on it as this
+			 * will sign-extend the immediate value into temp
+			 * reg then it would be safe to do the operation
+			 * on it.
 			 */
-			_emit(ARM_COND_HI, ARM_B(b_imm(i + 1, ctx) - 2), ctx);
-
-			emit(ARM_MOV_R(ARM_R0, r_skb), ctx);
-			/* r_off is r1 */
-			emit_mov_i(ARM_R3, (u32)jit_get_skb_b, ctx);
-			emit_blx_r(ARM_R3, ctx);
-			/* check the return value of skb_copy_bits */
-			emit(ARM_CMP_I(ARM_R1, 0), ctx);
-			emit_err_ret(ARM_COND_NE, ctx);
-
-			emit(ARM_AND_I(r_X, ARM_R0, 0x00f), ctx);
-			emit(ARM_LSL_I(r_X, r_X, 2), ctx);
-			break;
-		case BPF_ST:
-			ctx->seen |= SEEN_MEM_WORD(k);
-			emit(ARM_STR_I(r_A, ARM_SP, SCRATCH_OFF(k)), ctx);
-			break;
-		case BPF_STX:
-			update_on_xread(ctx);
-			ctx->seen |= SEEN_MEM_WORD(k);
-			emit(ARM_STR_I(r_X, ARM_SP, SCRATCH_OFF(k)), ctx);
-			break;
-		case BPF_ALU | BPF_ADD | BPF_K:
-			/* A += K */
-			OP_IMM3(ARM_ADD, r_A, r_A, k, ctx);
-			break;
-		case BPF_ALU | BPF_ADD | BPF_X:
-			update_on_xread(ctx);
-			emit(ARM_ADD_R(r_A, r_A, r_X), ctx);
-			break;
-		case BPF_ALU | BPF_SUB | BPF_K:
-			/* A -= K */
-			OP_IMM3(ARM_SUB, r_A, r_A, k, ctx);
-			break;
-		case BPF_ALU | BPF_SUB | BPF_X:
-			update_on_xread(ctx);
-			emit(ARM_SUB_R(r_A, r_A, r_X), ctx);
-			break;
-		case BPF_ALU | BPF_MUL | BPF_K:
-			/* A *= K */
-			emit_mov_i(r_scratch, k, ctx);
-			emit(ARM_MUL(r_A, r_A, r_scratch), ctx);
-			break;
-		case BPF_ALU | BPF_MUL | BPF_X:
-			update_on_xread(ctx);
-			emit(ARM_MUL(r_A, r_A, r_X), ctx);
-			break;
-		case BPF_ALU | BPF_DIV | BPF_K:
-			if (k == 1)
-				break;
-			emit_mov_i(r_scratch, k, ctx);
-			emit_udivmod(r_A, r_A, r_scratch, ctx, BPF_DIV);
-			break;
-		case BPF_ALU | BPF_DIV | BPF_X:
-			update_on_xread(ctx);
-			emit(ARM_CMP_I(r_X, 0), ctx);
-			emit_err_ret(ARM_COND_EQ, ctx);
-			emit_udivmod(r_A, r_A, r_X, ctx, BPF_DIV);
-			break;
-		case BPF_ALU | BPF_MOD | BPF_K:
-			if (k == 1) {
-				emit_mov_i(r_A, 0, ctx);
-				break;
-			}
-			emit_mov_i(r_scratch, k, ctx);
-			emit_udivmod(r_A, r_A, r_scratch, ctx, BPF_MOD);
+			emit_a32_mov_i64(is64, tmp2, imm, false, ctx);
+			emit_a32_mul_r64(dst, tmp2, dstk, false, ctx);
 			break;
-		case BPF_ALU | BPF_MOD | BPF_X:
-			update_on_xread(ctx);
-			emit(ARM_CMP_I(r_X, 0), ctx);
-			emit_err_ret(ARM_COND_EQ, ctx);
-			emit_udivmod(r_A, r_A, r_X, ctx, BPF_MOD);
-			break;
-		case BPF_ALU | BPF_OR | BPF_K:
-			/* A |= K */
-			OP_IMM3(ARM_ORR, r_A, r_A, k, ctx);
+		}
+		break;
+	/* dst = htole(dst) */
+	/* dst = htobe(dst) */
+	case BPF_ALU | BPF_END | BPF_FROM_LE:
+	case BPF_ALU | BPF_END | BPF_FROM_BE:
+		rd = dstk ? tmp[0] : dst_hi;
+		rt = dstk ? tmp[1] : dst_lo;
+		if (dstk) {
+			emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(dst_lo)), ctx);
+			emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_hi)), ctx);
+		}
+		if (BPF_SRC(code) == BPF_FROM_LE)
+			goto emit_bswap_uxt;
+		switch (imm) {
+		case 16:
+			emit_rev16(rt, rt, ctx);
+			goto emit_bswap_uxt;
+		case 32:
+			emit_rev32(rt, rt, ctx);
+			goto emit_bswap_uxt;
+		case 64:
+			/* Because of the usage of ARM_LR */
+			ctx->seen |= SEEN_CALL;
+			emit_rev32(ARM_LR, rt, ctx);
+			emit_rev32(rt, rd, ctx);
+			emit(ARM_MOV_R(rd, ARM_LR), ctx);
 			break;
-		case BPF_ALU | BPF_OR | BPF_X:
-			update_on_xread(ctx);
-			emit(ARM_ORR_R(r_A, r_A, r_X), ctx);
+		}
+		goto exit;
+emit_bswap_uxt:
+		switch (imm) {
+		case 16:
+			/* zero-extend 16 bits into 64 bits */
+#if __LINUX_ARM_ARCH__ < 6
+			emit_a32_mov_i(tmp2[1], 0xffff, false, ctx);
+			emit(ARM_AND_R(rt, rt, tmp2[1]), ctx);
+#else /* ARMv6+ */
+			emit(ARM_UXTH(rt, rt), ctx);
+#endif
+			emit(ARM_EOR_R(rd, rd, rd), ctx);
 			break;
-		case BPF_ALU | BPF_XOR | BPF_K:
-			/* A ^= K; */
-			OP_IMM3(ARM_EOR, r_A, r_A, k, ctx);
+		case 32:
+			/* zero-extend 32 bits into 64 bits */
+			emit(ARM_EOR_R(rd, rd, rd), ctx);
 			break;
-		case BPF_ANC | SKF_AD_ALU_XOR_X:
-		case BPF_ALU | BPF_XOR | BPF_X:
-			/* A ^= X */
-			update_on_xread(ctx);
-			emit(ARM_EOR_R(r_A, r_A, r_X), ctx);
+		case 64:
+			/* nop */
 			break;
-		case BPF_ALU | BPF_AND | BPF_K:
-			/* A &= K */
-			OP_IMM3(ARM_AND, r_A, r_A, k, ctx);
+		}
+exit:
+		if (dstk) {
+			emit(ARM_STR_I(rt, ARM_SP, STACK_VAR(dst_lo)), ctx);
+			emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_hi)), ctx);
+		}
+		break;
+	/* dst = imm64 */
+	case BPF_LD | BPF_IMM | BPF_DW:
+	{
+		const struct bpf_insn insn1 = insn[1];
+		u32 hi, lo = imm;
+
+		if (insn1.code != 0 || insn1.src_reg != 0 ||
+		    insn1.dst_reg != 0 || insn1.off != 0) {
+			/* Note: verifier in BPF core must catch invalid
+			 * instruction.
+			 */
+			pr_err_once("Invalid BPF_LD_IMM64 instruction\n");
+			return -EINVAL;
+		}
+		hi = insn1.imm;
+		emit_a32_mov_i(dst_lo, lo, dstk, ctx);
+		emit_a32_mov_i(dst_hi, hi, dstk, ctx);
+
+		return 1;
+	}
+	/* LDX: dst = *(size *)(src + off) */
+	case BPF_LDX | BPF_MEM | BPF_W:
+	case BPF_LDX | BPF_MEM | BPF_H:
+	case BPF_LDX | BPF_MEM | BPF_B:
+	case BPF_LDX | BPF_MEM | BPF_DW:
+		rn = sstk ? tmp2[1] : src_lo;
+		if (sstk)
+			emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
+		switch (BPF_SIZE(code)) {
+		case BPF_W:
+			/* Load a Word */
+		case BPF_H:
+			/* Load a Half-Word */
+		case BPF_B:
+			/* Load a Byte */
+			emit_ldx_r(dst_lo, rn, dstk, off, ctx, BPF_SIZE(code));
+			emit_a32_mov_i(dst_hi, 0, dstk, ctx);
 			break;
-		case BPF_ALU | BPF_AND | BPF_X:
-			update_on_xread(ctx);
-			emit(ARM_AND_R(r_A, r_A, r_X), ctx);
+		case BPF_DW:
+			/* Load a double word */
+			emit_ldx_r(dst_lo, rn, dstk, off, ctx, BPF_W);
+			emit_ldx_r(dst_hi, rn, dstk, off+4, ctx, BPF_W);
 			break;
-		case BPF_ALU | BPF_LSH | BPF_K:
-			if (unlikely(k > 31))
-				return -1;
-			emit(ARM_LSL_I(r_A, r_A, k), ctx);
+		}
+		break;
+	/* R0 = ntohx(*(size *)(((struct sk_buff *)R6)->data + imm)) */
+	case BPF_LD | BPF_ABS | BPF_W:
+	case BPF_LD | BPF_ABS | BPF_H:
+	case BPF_LD | BPF_ABS | BPF_B:
+	/* R0 = ntohx(*(size *)(((struct sk_buff *)R6)->data + src + imm)) */
+	case BPF_LD | BPF_IND | BPF_W:
+	case BPF_LD | BPF_IND | BPF_H:
+	case BPF_LD | BPF_IND | BPF_B:
+	{
+		const u8 r4 = bpf2a32[BPF_REG_6][1]; /* r4 = ptr to sk_buff */
+		const u8 r0 = bpf2a32[BPF_REG_0][1]; /*r0: struct sk_buff *skb*/
+						     /* rtn value */
+		const u8 r1 = bpf2a32[BPF_REG_0][0]; /* r1: int k */
+		const u8 r2 = bpf2a32[BPF_REG_1][1]; /* r2: unsigned int size */
+		const u8 r3 = bpf2a32[BPF_REG_1][0]; /* r3: void *buffer */
+		const u8 r6 = bpf2a32[TMP_REG_1][1]; /* r6: void *(*func)(..) */
+		int size;
+
+		/* Setting up first argument */
+		emit(ARM_MOV_R(r0, r4), ctx);
+
+		/* Setting up second argument */
+		emit_a32_mov_i(r1, imm, false, ctx);
+		if (BPF_MODE(code) == BPF_IND)
+			emit_a32_alu_r(r1, src_lo, false, sstk, ctx,
+				       false, false, BPF_ADD);
+
+		/* Setting up third argument */
+		switch (BPF_SIZE(code)) {
+		case BPF_W:
+			size = 4;
 			break;
-		case BPF_ALU | BPF_LSH | BPF_X:
-			update_on_xread(ctx);
-			emit(ARM_LSL_R(r_A, r_A, r_X), ctx);
+		case BPF_H:
+			size = 2;
 			break;
-		case BPF_ALU | BPF_RSH | BPF_K:
-			if (unlikely(k > 31))
-				return -1;
-			if (k)
-				emit(ARM_LSR_I(r_A, r_A, k), ctx);
+		case BPF_B:
+			size = 1;
 			break;
-		case BPF_ALU | BPF_RSH | BPF_X:
-			update_on_xread(ctx);
-			emit(ARM_LSR_R(r_A, r_A, r_X), ctx);
+		default:
+			return -EINVAL;
+		}
+		emit_a32_mov_i(r2, size, false, ctx);
+
+		/* Setting up fourth argument */
+		emit(ARM_ADD_I(r3, ARM_SP, imm8m(SKB_BUFFER)), ctx);
+
+		/* Setting up function pointer to call */
+		emit_a32_mov_i(r6, (unsigned int)bpf_load_pointer, false, ctx);
+		emit_blx_r(r6, ctx);
+
+		emit(ARM_EOR_R(r1, r1, r1), ctx);
+		/* Check if return address is NULL or not.
+		 * if NULL then jump to epilogue
+		 * else continue to load the value from retn address
+		 */
+		emit(ARM_CMP_I(r0, 0), ctx);
+		jmp_offset = epilogue_offset(ctx);
+		check_imm24(jmp_offset);
+		_emit(ARM_COND_EQ, ARM_B(jmp_offset), ctx);
+
+		/* Load value from the address */
+		switch (BPF_SIZE(code)) {
+		case BPF_W:
+			emit(ARM_LDR_I(r0, r0, 0), ctx);
+			emit_rev32(r0, r0, ctx);
 			break;
-		case BPF_ALU | BPF_NEG:
-			/* A = -A */
-			emit(ARM_RSB_I(r_A, r_A, 0), ctx);
+		case BPF_H:
+			emit(ARM_LDRH_I(r0, r0, 0), ctx);
+			emit_rev16(r0, r0, ctx);
 			break;
-		case BPF_JMP | BPF_JA:
-			/* pc += K */
-			emit(ARM_B(b_imm(i + k + 1, ctx)), ctx);
+		case BPF_B:
+			emit(ARM_LDRB_I(r0, r0, 0), ctx);
+			/* No need to reverse */
 			break;
-		case BPF_JMP | BPF_JEQ | BPF_K:
-			/* pc += (A == K) ? pc->jt : pc->jf */
-			condt  = ARM_COND_EQ;
-			goto cmp_imm;
-		case BPF_JMP | BPF_JGT | BPF_K:
-			/* pc += (A > K) ? pc->jt : pc->jf */
-			condt  = ARM_COND_HI;
-			goto cmp_imm;
-		case BPF_JMP | BPF_JGE | BPF_K:
-			/* pc += (A >= K) ? pc->jt : pc->jf */
-			condt  = ARM_COND_HS;
-cmp_imm:
-			imm12 = imm8m(k);
-			if (imm12 < 0) {
-				emit_mov_i_no8m(r_scratch, k, ctx);
-				emit(ARM_CMP_R(r_A, r_scratch), ctx);
-			} else {
-				emit(ARM_CMP_I(r_A, imm12), ctx);
-			}
-cond_jump:
-			if (inst->jt)
-				_emit(condt, ARM_B(b_imm(i + inst->jt + 1,
-						   ctx)), ctx);
-			if (inst->jf)
-				_emit(condt ^ 1, ARM_B(b_imm(i + inst->jf + 1,
-							     ctx)), ctx);
+		}
+		break;
+	}
+	/* ST: *(size *)(dst + off) = imm */
+	case BPF_ST | BPF_MEM | BPF_W:
+	case BPF_ST | BPF_MEM | BPF_H:
+	case BPF_ST | BPF_MEM | BPF_B:
+	case BPF_ST | BPF_MEM | BPF_DW:
+		switch (BPF_SIZE(code)) {
+		case BPF_DW:
+			/* Sign-extend immediate value into temp reg */
+			emit_a32_mov_i64(true, tmp2, imm, false, ctx);
+			emit_str_r(dst_lo, tmp2[1], dstk, off, ctx, BPF_W);
+			emit_str_r(dst_lo, tmp2[0], dstk, off+4, ctx, BPF_W);
 			break;
-		case BPF_JMP | BPF_JEQ | BPF_X:
-			/* pc += (A == X) ? pc->jt : pc->jf */
-			condt   = ARM_COND_EQ;
-			goto cmp_x;
-		case BPF_JMP | BPF_JGT | BPF_X:
-			/* pc += (A > X) ? pc->jt : pc->jf */
-			condt   = ARM_COND_HI;
-			goto cmp_x;
-		case BPF_JMP | BPF_JGE | BPF_X:
-			/* pc += (A >= X) ? pc->jt : pc->jf */
-			condt   = ARM_COND_CS;
-cmp_x:
-			update_on_xread(ctx);
-			emit(ARM_CMP_R(r_A, r_X), ctx);
-			goto cond_jump;
-		case BPF_JMP | BPF_JSET | BPF_K:
-			/* pc += (A & K) ? pc->jt : pc->jf */
-			condt  = ARM_COND_NE;
-			/* not set iff all zeroes iff Z==1 iff EQ */
-
-			imm12 = imm8m(k);
-			if (imm12 < 0) {
-				emit_mov_i_no8m(r_scratch, k, ctx);
-				emit(ARM_TST_R(r_A, r_scratch), ctx);
-			} else {
-				emit(ARM_TST_I(r_A, imm12), ctx);
-			}
-			goto cond_jump;
-		case BPF_JMP | BPF_JSET | BPF_X:
-			/* pc += (A & X) ? pc->jt : pc->jf */
-			update_on_xread(ctx);
-			condt  = ARM_COND_NE;
-			emit(ARM_TST_R(r_A, r_X), ctx);
-			goto cond_jump;
-		case BPF_RET | BPF_A:
-			emit(ARM_MOV_R(ARM_R0, r_A), ctx);
-			goto b_epilogue;
-		case BPF_RET | BPF_K:
-			if ((k == 0) && (ctx->ret0_fp_idx < 0))
-				ctx->ret0_fp_idx = i;
-			emit_mov_i(ARM_R0, k, ctx);
-b_epilogue:
-			if (i != ctx->skf->len - 1)
-				emit(ARM_B(b_imm(prog->len, ctx)), ctx);
+		case BPF_W:
+		case BPF_H:
+		case BPF_B:
+			emit_a32_mov_i(tmp2[1], imm, false, ctx);
+			emit_str_r(dst_lo, tmp2[1], dstk, off, ctx,
+				   BPF_SIZE(code));
 			break;
-		case BPF_MISC | BPF_TAX:
-			/* X = A */
-			ctx->seen |= SEEN_X;
-			emit(ARM_MOV_R(r_X, r_A), ctx);
+		}
+		break;
+	/* STX XADD: lock *(u32 *)(dst + off) += src */
+	case BPF_STX | BPF_XADD | BPF_W:
+	/* STX XADD: lock *(u64 *)(dst + off) += src */
+	case BPF_STX | BPF_XADD | BPF_DW:
+		goto notyet;
+	/* STX: *(size *)(dst + off) = src */
+	case BPF_STX | BPF_MEM | BPF_W:
+	case BPF_STX | BPF_MEM | BPF_H:
+	case BPF_STX | BPF_MEM | BPF_B:
+	case BPF_STX | BPF_MEM | BPF_DW:
+	{
+		u8 sz = BPF_SIZE(code);
+
+		rn = sstk ? tmp2[1] : src_lo;
+		rm = sstk ? tmp2[0] : src_hi;
+		if (!sstk)
+			goto do_store;
+		switch (BPF_SIZE(code)) {
+		case BPF_W:
+			emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
+			goto empty_hi;
+		case BPF_H:
+			emit(ARM_LDRH_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
+			goto empty_hi;
+		case BPF_B:
+			emit(ARM_LDRB_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
+			goto empty_hi;
+empty_hi:
+			emit(ARM_EOR_R(rm, rm, rm), ctx);
+		case BPF_DW:
+			emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
+			emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(src_hi)), ctx);
+			sz = BPF_W;
 			break;
-		case BPF_MISC | BPF_TXA:
-			/* A = X */
-			update_on_xread(ctx);
-			emit(ARM_MOV_R(r_A, r_X), ctx);
+		}
+
+do_store:
+		/* Clear higher word except for BPF_DW */
+		if (BPF_SIZE(code) != BPF_DW)
+			emit(ARM_EOR_R(rm, rm, rm), ctx);
+
+		/* Store the value */
+		emit_str_r(dst_lo, rn, dstk, off, ctx, sz);
+		emit_str_r(dst_lo, rm, dstk, off+4, ctx, BPF_W);
+		break;
+	}
+	/* PC += off if dst == src */
+	/* PC += off if dst > src */
+	/* PC += off if dst >= src */
+	/* PC += off if dst != src */
+	/* PC += off if dst > src (signed) */
+	/* PC += off if dst >= src (signed) */
+	/* PC += off if dst & src */
+	case BPF_JMP | BPF_JEQ | BPF_X:
+	case BPF_JMP | BPF_JGT | BPF_X:
+	case BPF_JMP | BPF_JGE | BPF_X:
+	case BPF_JMP | BPF_JNE | BPF_X:
+	case BPF_JMP | BPF_JSGT | BPF_X:
+	case BPF_JMP | BPF_JSGE | BPF_X:
+	case BPF_JMP | BPF_JSET | BPF_X:
+		/* Setup source registers */
+		rm = sstk ? tmp2[0] : src_hi;
+		rn = sstk ? tmp2[1] : src_lo;
+		if (sstk) {
+			emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
+			emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(src_hi)), ctx);
+		}
+		goto go_jmp;
+	/* PC += off if dst == imm */
+	/* PC += off if dst > imm */
+	/* PC += off if dst >= imm */
+	/* PC += off if dst != imm */
+	/* PC += off if dst > imm (signed) */
+	/* PC += off if dst >= imm (signed) */
+	/* PC += off if dst & imm */
+	case BPF_JMP | BPF_JEQ | BPF_K:
+	case BPF_JMP | BPF_JGT | BPF_K:
+	case BPF_JMP | BPF_JGE | BPF_K:
+	case BPF_JMP | BPF_JNE | BPF_K:
+	case BPF_JMP | BPF_JSGT | BPF_K:
+	case BPF_JMP | BPF_JSGE | BPF_K:
+	case BPF_JMP | BPF_JSET | BPF_K:
+		if (off == 0)
 			break;
-		case BPF_ANC | SKF_AD_PROTOCOL:
-			/* A = ntohs(skb->protocol) */
-			ctx->seen |= SEEN_SKB;
-			BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff,
-						  protocol) != 2);
-			off = offsetof(struct sk_buff, protocol);
-			emit(ARM_LDRH_I(r_scratch, r_skb, off), ctx);
-			emit_swap16(r_A, r_scratch, ctx);
+		rm = tmp2[0];
+		rn = tmp2[1];
+		/* Sign-extend immediate value */
+		emit_a32_mov_i64(true, tmp2, imm, false, ctx);
+go_jmp:
+		/* Setup destination register */
+		rd = dstk ? tmp[0] : dst_hi;
+		rt = dstk ? tmp[1] : dst_lo;
+		if (dstk) {
+			emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(dst_lo)), ctx);
+			emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_hi)), ctx);
+		}
+
+		/* Check for the condition */
+		emit_ar_r(rd, rt, rm, rn, ctx, BPF_OP(code));
+
+		/* Setup JUMP instruction */
+		jmp_offset = bpf2a32_offset(i+off, i, ctx);
+		switch (BPF_OP(code)) {
+		case BPF_JNE:
+		case BPF_JSET:
+			_emit(ARM_COND_NE, ARM_B(jmp_offset), ctx);
 			break;
-		case BPF_ANC | SKF_AD_CPU:
-			/* r_scratch = current_thread_info() */
-			OP_IMM3(ARM_BIC, r_scratch, ARM_SP, THREAD_SIZE - 1, ctx);
-			/* A = current_thread_info()->cpu */
-			BUILD_BUG_ON(FIELD_SIZEOF(struct thread_info, cpu) != 4);
-			off = offsetof(struct thread_info, cpu);
-			emit(ARM_LDR_I(r_A, r_scratch, off), ctx);
+		case BPF_JEQ:
+			_emit(ARM_COND_EQ, ARM_B(jmp_offset), ctx);
 			break;
-		case BPF_ANC | SKF_AD_IFINDEX:
-		case BPF_ANC | SKF_AD_HATYPE:
-			/* A = skb->dev->ifindex */
-			/* A = skb->dev->type */
-			ctx->seen |= SEEN_SKB;
-			off = offsetof(struct sk_buff, dev);
-			emit(ARM_LDR_I(r_scratch, r_skb, off), ctx);
-
-			emit(ARM_CMP_I(r_scratch, 0), ctx);
-			emit_err_ret(ARM_COND_EQ, ctx);
-
-			BUILD_BUG_ON(FIELD_SIZEOF(struct net_device,
-						  ifindex) != 4);
-			BUILD_BUG_ON(FIELD_SIZEOF(struct net_device,
-						  type) != 2);
-
-			if (code == (BPF_ANC | SKF_AD_IFINDEX)) {
-				off = offsetof(struct net_device, ifindex);
-				emit(ARM_LDR_I(r_A, r_scratch, off), ctx);
-			} else {
-				/*
-				 * offset of field "type" in "struct
-				 * net_device" is above what can be
-				 * used in the ldrh rd, [rn, #imm]
-				 * instruction, so load the offset in
-				 * a register and use ldrh rd, [rn, rm]
-				 */
-				off = offsetof(struct net_device, type);
-				emit_mov_i(ARM_R3, off, ctx);
-				emit(ARM_LDRH_R(r_A, r_scratch, ARM_R3), ctx);
-			}
+		case BPF_JGT:
+			_emit(ARM_COND_HI, ARM_B(jmp_offset), ctx);
 			break;
-		case BPF_ANC | SKF_AD_MARK:
-			ctx->seen |= SEEN_SKB;
-			BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, mark) != 4);
-			off = offsetof(struct sk_buff, mark);
-			emit(ARM_LDR_I(r_A, r_skb, off), ctx);
+		case BPF_JGE:
+			_emit(ARM_COND_CS, ARM_B(jmp_offset), ctx);
 			break;
-		case BPF_ANC | SKF_AD_RXHASH:
-			ctx->seen |= SEEN_SKB;
-			BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, hash) != 4);
-			off = offsetof(struct sk_buff, hash);
-			emit(ARM_LDR_I(r_A, r_skb, off), ctx);
+		case BPF_JSGT:
+			_emit(ARM_COND_LT, ARM_B(jmp_offset), ctx);
 			break;
-		case BPF_ANC | SKF_AD_VLAN_TAG:
-		case BPF_ANC | SKF_AD_VLAN_TAG_PRESENT:
-			ctx->seen |= SEEN_SKB;
-			BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, vlan_tci) != 2);
-			off = offsetof(struct sk_buff, vlan_tci);
-			emit(ARM_LDRH_I(r_A, r_skb, off), ctx);
-			if (code == (BPF_ANC | SKF_AD_VLAN_TAG))
-				OP_IMM3(ARM_AND, r_A, r_A, ~VLAN_TAG_PRESENT, ctx);
-			else {
-				OP_IMM3(ARM_LSR, r_A, r_A, 12, ctx);
-				OP_IMM3(ARM_AND, r_A, r_A, 0x1, ctx);
-			}
+		case BPF_JSGE:
+			_emit(ARM_COND_GE, ARM_B(jmp_offset), ctx);
 			break;
-		case BPF_ANC | SKF_AD_PKTTYPE:
-			ctx->seen |= SEEN_SKB;
-			BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff,
-						  __pkt_type_offset[0]) != 1);
-			off = PKT_TYPE_OFFSET();
-			emit(ARM_LDRB_I(r_A, r_skb, off), ctx);
-			emit(ARM_AND_I(r_A, r_A, PKT_TYPE_MAX), ctx);
-#ifdef __BIG_ENDIAN_BITFIELD
-			emit(ARM_LSR_I(r_A, r_A, 5), ctx);
-#endif
+		}
+		break;
+	/* JMP OFF */
+	case BPF_JMP | BPF_JA:
+	{
+		if (off == 0)
 			break;
-		case BPF_ANC | SKF_AD_QUEUE:
-			ctx->seen |= SEEN_SKB;
-			BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff,
-						  queue_mapping) != 2);
-			BUILD_BUG_ON(offsetof(struct sk_buff,
-					      queue_mapping) > 0xff);
-			off = offsetof(struct sk_buff, queue_mapping);
-			emit(ARM_LDRH_I(r_A, r_skb, off), ctx);
+		jmp_offset = bpf2a32_offset(i+off, i, ctx);
+		check_imm24(jmp_offset);
+		emit(ARM_B(jmp_offset), ctx);
+		break;
+	}
+	/* tail call */
+	case BPF_JMP | BPF_CALL | BPF_X:
+		if (emit_bpf_tail_call(ctx))
+			return -EFAULT;
+		break;
+	/* function call */
+	case BPF_JMP | BPF_CALL:
+		goto notyet;
+	/* function return */
+	case BPF_JMP | BPF_EXIT:
+		/* Optimization: when last instruction is EXIT
+		 * simply fallthrough to epilogue.
+		 */
+		if (i == ctx->prog->len - 1)
 			break;
-		case BPF_ANC | SKF_AD_PAY_OFFSET:
-			ctx->seen |= SEEN_SKB | SEEN_CALL;
+		jmp_offset = epilogue_offset(ctx);
+		check_imm24(jmp_offset);
+		emit(ARM_B(jmp_offset), ctx);
+		break;
+notyet:
+		pr_info_once("*** NOT YET: opcode %02x ***\n", code);
+		return -EFAULT;
+	default:
+		pr_err_once("unknown opcode %02x\n", code);
+		return -EINVAL;
+	}
 
-			emit(ARM_MOV_R(ARM_R0, r_skb), ctx);
-			emit_mov_i(ARM_R3, (unsigned int)skb_get_poff, ctx);
-			emit_blx_r(ARM_R3, ctx);
-			emit(ARM_MOV_R(r_A, ARM_R0), ctx);
-			break;
-		case BPF_LDX | BPF_W | BPF_ABS:
-			/*
-			 * load a 32bit word from struct seccomp_data.
-			 * seccomp_check_filter() will already have checked
-			 * that k is 32bit aligned and lies within the
-			 * struct seccomp_data.
-			 */
-			ctx->seen |= SEEN_SKB;
-			emit(ARM_LDR_I(r_A, r_skb, k), ctx);
-			break;
-		default:
-			return -1;
+	if (ctx->flags & FLAG_IMM_OVERFLOW)
+		/*
+		 * this instruction generated an overflow when
+		 * trying to access the literal pool, so
+		 * delegate this filter to the kernel interpreter.
+		 */
+		return -1;
+	return 0;
+}
+
+static int build_body(struct jit_ctx *ctx)
+{
+	const struct bpf_prog *prog = ctx->prog;
+	unsigned int i;
+
+	for (i = 0; i < prog->len; i++) {
+		const struct bpf_insn *insn = &(prog->insnsi[i]);
+		int ret;
+
+		ret = build_insn(insn, ctx);
+
+		/* It's used with loading the 64 bit immediate value. */
+		if (ret > 0) {
+			i++;
+			if (ctx->target == NULL)
+				ctx->offsets[i] = ctx->idx;
+			continue;
 		}
 
-		if (ctx->flags & FLAG_IMM_OVERFLOW)
-			/*
-			 * this instruction generated an overflow when
-			 * trying to access the literal pool, so
-			 * delegate this filter to the kernel interpreter.
-			 */
-			return -1;
+		if (ctx->target == NULL)
+			ctx->offsets[i] = ctx->idx;
+
+		/* If unsuccesfull, return with error code */
+		if (ret)
+			return ret;
 	}
+	return 0;
+}
 
-	/* compute offsets only during the first pass */
-	if (ctx->target == NULL)
-		ctx->offsets[i] = ctx->idx * 4;
+static int validate_code(struct jit_ctx *ctx)
+{
+	int i;
+
+	for (i = 0; i < ctx->idx; i++) {
+		u32 a32_insn = le32_to_cpu(ctx->target[i]);
+
+		if (a32_insn == ARM_INST_UDF)
+			return -1;
+	}
 
 	return 0;
 }
 
+void bpf_jit_compile(struct bpf_prog *prog)
+{
+	/* Nothing to do here. We support Internal BPF. */
+}
 
-void bpf_jit_compile(struct bpf_prog *fp)
+struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 {
+#ifdef __LITTLE_ENDIAN
+	struct bpf_prog *tmp, *orig_prog = prog;
 	struct bpf_binary_header *header;
+	bool tmp_blinded = false;
 	struct jit_ctx ctx;
-	unsigned tmp_idx;
-	unsigned alloc_size;
-	u8 *target_ptr;
+	unsigned int tmp_idx;
+	unsigned int image_size;
+	u8 *image_ptr;
 
+	/* If BPF JIT was not enabled then we must fall back to
+	 * the interpreter.
+	 */
 	if (!bpf_jit_enable)
-		return;
+		return orig_prog;
 
-	memset(&ctx, 0, sizeof(ctx));
-	ctx.skf		= fp;
-	ctx.ret0_fp_idx = -1;
+	/* If constant blinding was enabled and we failed during blinding
+	 * then we must fall back to the interpreter. Otherwise, we save
+	 * the new JITed code.
+	 */
+	tmp = bpf_jit_blind_constants(prog);
 
-	ctx.offsets = kzalloc(4 * (ctx.skf->len + 1), GFP_KERNEL);
-	if (ctx.offsets == NULL)
-		return;
+	if (IS_ERR(tmp))
+		return orig_prog;
+	if (tmp != prog) {
+		tmp_blinded = true;
+		prog = tmp;
+	}
+
+	memset(&ctx, 0, sizeof(ctx));
+	ctx.prog = prog;
 
-	/* fake pass to fill in the ctx->seen */
-	if (unlikely(build_body(&ctx)))
+	/* Not able to allocate memory for offsets[] , then
+	 * we must fall back to the interpreter
+	 */
+	ctx.offsets = kcalloc(prog->len, sizeof(int), GFP_KERNEL);
+	if (ctx.offsets == NULL) {
+		prog = orig_prog;
 		goto out;
+	}
+
+	/* 1) fake pass to find in the length of the JITed code,
+	 * to compute ctx->offsets and other context variables
+	 * needed to compute final JITed code.
+	 * Also, calculate random starting pointer/start of JITed code
+	 * which is prefixed by random number of fault instructions.
+	 *
+	 * If the first pass fails then there is no chance of it
+	 * being successful in the second pass, so just fall back
+	 * to the interpreter.
+	 */
+	if (build_body(&ctx)) {
+		prog = orig_prog;
+		goto out_off;
+	}
 
 	tmp_idx = ctx.idx;
 	build_prologue(&ctx);
 	ctx.prologue_bytes = (ctx.idx - tmp_idx) * 4;
 
+	ctx.epilogue_offset = ctx.idx;
+
 #if __LINUX_ARM_ARCH__ < 7
 	tmp_idx = ctx.idx;
 	build_epilogue(&ctx);
@@ -1020,64 +1838,96 @@ void bpf_jit_compile(struct bpf_prog *fp)
 
 	ctx.idx += ctx.imm_count;
 	if (ctx.imm_count) {
-		ctx.imms = kzalloc(4 * ctx.imm_count, GFP_KERNEL);
-		if (ctx.imms == NULL)
-			goto out;
+		ctx.imms = kcalloc(ctx.imm_count, sizeof(u32), GFP_KERNEL);
+		if (ctx.imms == NULL) {
+			prog = orig_prog;
+			goto out_off;
+		}
 	}
 #else
-	/* there's nothing after the epilogue on ARMv7 */
+	/* there's nothing about the epilogue on ARMv7 */
 	build_epilogue(&ctx);
 #endif
-	alloc_size = 4 * ctx.idx;
-	header = bpf_jit_binary_alloc(alloc_size, &target_ptr,
-				      4, jit_fill_hole);
-	if (header == NULL)
-		goto out;
+	/* Now we can get the actual image size of the JITed arm code.
+	 * Currently, we are not considering the THUMB-2 instructions
+	 * for jit, although it can decrease the size of the image.
+	 *
+	 * As each arm instruction is of length 32bit, we are translating
+	 * number of JITed intructions into the size required to store these
+	 * JITed code.
+	 */
+	image_size = sizeof(u32) * ctx.idx;
 
-	ctx.target = (u32 *) target_ptr;
+	/* Now we know the size of the structure to make */
+	header = bpf_jit_binary_alloc(image_size, &image_ptr,
+				      sizeof(u32), jit_fill_hole);
+	/* Not able to allocate memory for the structure then
+	 * we must fall back to the interpretation
+	 */
+	if (header == NULL) {
+		prog = orig_prog;
+		goto out_imms;
+	}
+
+	/* 2.) Actual pass to generate final JIT code */
+	ctx.target = (u32 *) image_ptr;
 	ctx.idx = 0;
 
 	build_prologue(&ctx);
+
+	/* If building the body of the JITed code fails somehow,
+	 * we fall back to the interpretation.
+	 */
 	if (build_body(&ctx) < 0) {
-#if __LINUX_ARM_ARCH__ < 7
-		if (ctx.imm_count)
-			kfree(ctx.imms);
-#endif
+		image_ptr = NULL;
 		bpf_jit_binary_free(header);
-		goto out;
+		prog = orig_prog;
+		goto out_imms;
 	}
 	build_epilogue(&ctx);
 
+	/* 3.) Extra pass to validate JITed Code */
+	if (validate_code(&ctx)) {
+		image_ptr = NULL;
+		bpf_jit_binary_free(header);
+		prog = orig_prog;
+		goto out_imms;
+	}
 	flush_icache_range((u32)header, (u32)(ctx.target + ctx.idx));
 
-#if __LINUX_ARM_ARCH__ < 7
-	if (ctx.imm_count)
-		kfree(ctx.imms);
-#endif
-
 	if (bpf_jit_enable > 1)
 		/* there are 2 passes here */
-		bpf_jit_dump(fp->len, alloc_size, 2, ctx.target);
+		bpf_jit_dump(prog->len, image_size, 2, ctx.target);
 
 	set_memory_ro((unsigned long)header, header->pages);
-	fp->bpf_func = (void *)ctx.target;
-	fp->jited = 1;
-out:
+	prog->bpf_func = (void *)ctx.target;
+	prog->jited = 1;
+out_imms:
+#if __LINUX_ARM_ARCH__ < 7
+	if (ctx.imm_count)
+		kfree(ctx.imms);
+#endif
+out_off:
 	kfree(ctx.offsets);
-	return;
+out:
+	if (tmp_blinded)
+		bpf_jit_prog_release_other(prog, prog == orig_prog ?
+					   tmp : orig_prog);
+#endif /* __LITTLE_ENDIAN */
+	return prog;
 }
 
-void bpf_jit_free(struct bpf_prog *fp)
+void bpf_jit_free(struct bpf_prog *prog)
 {
-	unsigned long addr = (unsigned long)fp->bpf_func & PAGE_MASK;
+	unsigned long addr = (unsigned long)prog->bpf_func & PAGE_MASK;
 	struct bpf_binary_header *header = (void *)addr;
 
-	if (!fp->jited)
+	if (!prog->jited)
 		goto free_filter;
 
 	set_memory_rw(addr, header->pages);
 	bpf_jit_binary_free(header);
 
 free_filter:
-	bpf_prog_unlock_free(fp);
+	bpf_prog_unlock_free(prog);
 }
diff --git a/arch/arm/net/bpf_jit_32.h b/arch/arm/net/bpf_jit_32.h
index c46fca2..d5cf5f6 100644
--- a/arch/arm/net/bpf_jit_32.h
+++ b/arch/arm/net/bpf_jit_32.h
@@ -11,6 +11,7 @@
 #ifndef PFILTER_OPCODES_ARM_H
 #define PFILTER_OPCODES_ARM_H
 
+/* ARM 32bit Registers */
 #define ARM_R0	0
 #define ARM_R1	1
 #define ARM_R2	2
@@ -22,38 +23,43 @@
 #define ARM_R8	8
 #define ARM_R9	9
 #define ARM_R10	10
-#define ARM_FP	11
-#define ARM_IP	12
-#define ARM_SP	13
-#define ARM_LR	14
-#define ARM_PC	15
-
-#define ARM_COND_EQ		0x0
-#define ARM_COND_NE		0x1
-#define ARM_COND_CS		0x2
+#define ARM_FP	11	/* Frame Pointer */
+#define ARM_IP	12	/* Intra-procedure scratch register */
+#define ARM_SP	13	/* Stack pointer: as load/store base reg */
+#define ARM_LR	14	/* Link Register */
+#define ARM_PC	15	/* Program counter */
+
+#define ARM_COND_EQ		0x0	/* == */
+#define ARM_COND_NE		0x1	/* != */
+#define ARM_COND_CS		0x2	/* unsigned >= */
 #define ARM_COND_HS		ARM_COND_CS
-#define ARM_COND_CC		0x3
+#define ARM_COND_CC		0x3	/* unsigned < */
 #define ARM_COND_LO		ARM_COND_CC
-#define ARM_COND_MI		0x4
-#define ARM_COND_PL		0x5
-#define ARM_COND_VS		0x6
-#define ARM_COND_VC		0x7
-#define ARM_COND_HI		0x8
-#define ARM_COND_LS		0x9
-#define ARM_COND_GE		0xa
-#define ARM_COND_LT		0xb
-#define ARM_COND_GT		0xc
-#define ARM_COND_LE		0xd
-#define ARM_COND_AL		0xe
+#define ARM_COND_MI		0x4	/* < 0 */
+#define ARM_COND_PL		0x5	/* >= 0 */
+#define ARM_COND_VS		0x6	/* Signed Overflow */
+#define ARM_COND_VC		0x7	/* No Signed Overflow */
+#define ARM_COND_HI		0x8	/* unsigned > */
+#define ARM_COND_LS		0x9	/* unsigned <= */
+#define ARM_COND_GE		0xa	/* Signed >= */
+#define ARM_COND_LT		0xb	/* Signed < */
+#define ARM_COND_GT		0xc	/* Signed > */
+#define ARM_COND_LE		0xd	/* Signed <= */
+#define ARM_COND_AL		0xe	/* None */
 
 /* register shift types */
 #define SRTYPE_LSL		0
 #define SRTYPE_LSR		1
 #define SRTYPE_ASR		2
 #define SRTYPE_ROR		3
+#define SRTYPE_ASL		(SRTYPE_LSL)
 
 #define ARM_INST_ADD_R		0x00800000
+#define ARM_INST_ADDS_R		0x00900000
+#define ARM_INST_ADC_R		0x00a00000
+#define ARM_INST_ADC_I		0x02a00000
 #define ARM_INST_ADD_I		0x02800000
+#define ARM_INST_ADDS_I		0x02900000
 
 #define ARM_INST_AND_R		0x00000000
 #define ARM_INST_AND_I		0x02000000
@@ -76,8 +82,10 @@
 #define ARM_INST_LDRH_I		0x01d000b0
 #define ARM_INST_LDRH_R		0x019000b0
 #define ARM_INST_LDR_I		0x05900000
+#define ARM_INST_LDR_R		0x07900000
 
 #define ARM_INST_LDM		0x08900000
+#define ARM_INST_LDM_IA		0x08b00000
 
 #define ARM_INST_LSL_I		0x01a00000
 #define ARM_INST_LSL_R		0x01a00010
@@ -86,6 +94,7 @@
 #define ARM_INST_LSR_R		0x01a00030
 
 #define ARM_INST_MOV_R		0x01a00000
+#define ARM_INST_MOVS_R		0x01b00000
 #define ARM_INST_MOV_I		0x03a00000
 #define ARM_INST_MOVW		0x03000000
 #define ARM_INST_MOVT		0x03400000
@@ -96,17 +105,28 @@
 #define ARM_INST_PUSH		0x092d0000
 
 #define ARM_INST_ORR_R		0x01800000
+#define ARM_INST_ORRS_R		0x01900000
 #define ARM_INST_ORR_I		0x03800000
 
 #define ARM_INST_REV		0x06bf0f30
 #define ARM_INST_REV16		0x06bf0fb0
 
 #define ARM_INST_RSB_I		0x02600000
+#define ARM_INST_RSBS_I		0x02700000
+#define ARM_INST_RSC_I		0x02e00000
 
 #define ARM_INST_SUB_R		0x00400000
+#define ARM_INST_SUBS_R		0x00500000
+#define ARM_INST_RSB_R		0x00600000
 #define ARM_INST_SUB_I		0x02400000
+#define ARM_INST_SUBS_I		0x02500000
+#define ARM_INST_SBC_I		0x02c00000
+#define ARM_INST_SBC_R		0x00c00000
+#define ARM_INST_SBCS_R		0x00d00000
 
 #define ARM_INST_STR_I		0x05800000
+#define ARM_INST_STRB_I		0x05c00000
+#define ARM_INST_STRH_I		0x01c000b0
 
 #define ARM_INST_TST_R		0x01100000
 #define ARM_INST_TST_I		0x03100000
@@ -117,6 +137,8 @@
 
 #define ARM_INST_MLS		0x00600090
 
+#define ARM_INST_UXTH		0x06ff0070
+
 /*
  * Use a suitable undefined instruction to use for ARM/Thumb2 faulting.
  * We need to be careful not to conflict with those used by other modules
@@ -135,9 +157,15 @@
 #define _AL3_R(op, rd, rn, rm)	((op ## _R) | (rd) << 12 | (rn) << 16 | (rm))
 /* immediate */
 #define _AL3_I(op, rd, rn, imm)	((op ## _I) | (rd) << 12 | (rn) << 16 | (imm))
+/* register with register-shift */
+#define _AL3_SR(inst)	(inst | (1 << 4))
 
 #define ARM_ADD_R(rd, rn, rm)	_AL3_R(ARM_INST_ADD, rd, rn, rm)
+#define ARM_ADDS_R(rd, rn, rm)	_AL3_R(ARM_INST_ADDS, rd, rn, rm)
 #define ARM_ADD_I(rd, rn, imm)	_AL3_I(ARM_INST_ADD, rd, rn, imm)
+#define ARM_ADDS_I(rd, rn, imm)	_AL3_I(ARM_INST_ADDS, rd, rn, imm)
+#define ARM_ADC_R(rd, rn, rm)	_AL3_R(ARM_INST_ADC, rd, rn, rm)
+#define ARM_ADC_I(rd, rn, imm)	_AL3_I(ARM_INST_ADC, rd, rn, imm)
 
 #define ARM_AND_R(rd, rn, rm)	_AL3_R(ARM_INST_AND, rd, rn, rm)
 #define ARM_AND_I(rd, rn, imm)	_AL3_I(ARM_INST_AND, rd, rn, imm)
@@ -156,7 +184,9 @@
 #define ARM_EOR_I(rd, rn, imm)	_AL3_I(ARM_INST_EOR, rd, rn, imm)
 
 #define ARM_LDR_I(rt, rn, off)	(ARM_INST_LDR_I | (rt) << 12 | (rn) << 16 \
-				 | (off))
+				 | ((off) & 0xfff))
+#define ARM_LDR_R(rt, rn, rm)	(ARM_INST_LDR_R | (rt) << 12 | (rn) << 16 \
+				 | (rm))
 #define ARM_LDRB_I(rt, rn, off)	(ARM_INST_LDRB_I | (rt) << 12 | (rn) << 16 \
 				 | (off))
 #define ARM_LDRB_R(rt, rn, rm)	(ARM_INST_LDRB_R | (rt) << 12 | (rn) << 16 \
@@ -167,15 +197,23 @@
 				 | (rm))
 
 #define ARM_LDM(rn, regs)	(ARM_INST_LDM | (rn) << 16 | (regs))
+#define ARM_LDM_IA(rn, regs)	(ARM_INST_LDM_IA | (rn) << 16 | (regs))
 
 #define ARM_LSL_R(rd, rn, rm)	(_AL3_R(ARM_INST_LSL, rd, 0, rn) | (rm) << 8)
 #define ARM_LSL_I(rd, rn, imm)	(_AL3_I(ARM_INST_LSL, rd, 0, rn) | (imm) << 7)
 
 #define ARM_LSR_R(rd, rn, rm)	(_AL3_R(ARM_INST_LSR, rd, 0, rn) | (rm) << 8)
 #define ARM_LSR_I(rd, rn, imm)	(_AL3_I(ARM_INST_LSR, rd, 0, rn) | (imm) << 7)
+#define ARM_ASR_R(rd, rn, rm)   (_AL3_R(ARM_INST_ASR, rd, 0, rn) | (rm) << 8)
+#define ARM_ASR_I(rd, rn, imm)  (_AL3_I(ARM_INST_ASR, rd, 0, rn) | (imm) << 7)
 
 #define ARM_MOV_R(rd, rm)	_AL3_R(ARM_INST_MOV, rd, 0, rm)
+#define ARM_MOVS_R(rd, rm)	_AL3_R(ARM_INST_MOVS, rd, 0, rm)
 #define ARM_MOV_I(rd, imm)	_AL3_I(ARM_INST_MOV, rd, 0, imm)
+#define ARM_MOV_SR(rd, rm, type, rs)	\
+	(_AL3_SR(ARM_MOV_R(rd, rm)) | (type) << 5 | (rs) << 8)
+#define ARM_MOV_SI(rd, rm, type, imm6)	\
+	(ARM_MOV_R(rd, rm) | (type) << 5 | (imm6) << 7)
 
 #define ARM_MOVW(rd, imm)	\
 	(ARM_INST_MOVW | ((imm) >> 12) << 16 | (rd) << 12 | ((imm) & 0x0fff))
@@ -190,19 +228,38 @@
 
 #define ARM_ORR_R(rd, rn, rm)	_AL3_R(ARM_INST_ORR, rd, rn, rm)
 #define ARM_ORR_I(rd, rn, imm)	_AL3_I(ARM_INST_ORR, rd, rn, imm)
-#define ARM_ORR_S(rd, rn, rm, type, rs)	\
-	(ARM_ORR_R(rd, rn, rm) | (type) << 5 | (rs) << 7)
+#define ARM_ORR_SR(rd, rn, rm, type, rs)	\
+	(_AL3_SR(ARM_ORR_R(rd, rn, rm)) | (type) << 5 | (rs) << 8)
+#define ARM_ORRS_R(rd, rn, rm)	_AL3_R(ARM_INST_ORRS, rd, rn, rm)
+#define ARM_ORRS_SR(rd, rn, rm, type, rs)	\
+	(_AL3_SR(ARM_ORRS_R(rd, rn, rm)) | (type) << 5 | (rs) << 8)
+#define ARM_ORR_SI(rd, rn, rm, type, imm6)	\
+	(ARM_ORR_R(rd, rn, rm) | (type) << 5 | (imm6) << 7)
+#define ARM_ORRS_SI(rd, rn, rm, type, imm6)	\
+	(ARM_ORRS_R(rd, rn, rm) | (type) << 5 | (imm6) << 7)
 
 #define ARM_REV(rd, rm)		(ARM_INST_REV | (rd) << 12 | (rm))
 #define ARM_REV16(rd, rm)	(ARM_INST_REV16 | (rd) << 12 | (rm))
 
 #define ARM_RSB_I(rd, rn, imm)	_AL3_I(ARM_INST_RSB, rd, rn, imm)
+#define ARM_RSBS_I(rd, rn, imm)	_AL3_I(ARM_INST_RSBS, rd, rn, imm)
+#define ARM_RSC_I(rd, rn, imm)	_AL3_I(ARM_INST_RSC, rd, rn, imm)
 
 #define ARM_SUB_R(rd, rn, rm)	_AL3_R(ARM_INST_SUB, rd, rn, rm)
+#define ARM_SUBS_R(rd, rn, rm)	_AL3_R(ARM_INST_SUBS, rd, rn, rm)
+#define ARM_RSB_R(rd, rn, rm)	_AL3_R(ARM_INST_RSB, rd, rn, rm)
+#define ARM_SBC_R(rd, rn, rm)	_AL3_R(ARM_INST_SBC, rd, rn, rm)
+#define ARM_SBCS_R(rd, rn, rm)	_AL3_R(ARM_INST_SBCS, rd, rn, rm)
 #define ARM_SUB_I(rd, rn, imm)	_AL3_I(ARM_INST_SUB, rd, rn, imm)
+#define ARM_SUBS_I(rd, rn, imm)	_AL3_I(ARM_INST_SUBS, rd, rn, imm)
+#define ARM_SBC_I(rd, rn, imm)	_AL3_I(ARM_INST_SBC, rd, rn, imm)
 
 #define ARM_STR_I(rt, rn, off)	(ARM_INST_STR_I | (rt) << 12 | (rn) << 16 \
-				 | (off))
+				 | ((off) & 0xfff))
+#define ARM_STRH_I(rt, rn, off)	(ARM_INST_STRH_I | (rt) << 12 | (rn) << 16 \
+				 | (((off) & 0xf0) << 4) | ((off) & 0xf))
+#define ARM_STRB_I(rt, rn, off)	(ARM_INST_STRB_I | (rt) << 12 | (rn) << 16 \
+				 | (((off) & 0xf0) << 4) | ((off) & 0xf))
 
 #define ARM_TST_R(rn, rm)	_AL3_R(ARM_INST_TST, 0, rn, rm)
 #define ARM_TST_I(rn, imm)	_AL3_I(ARM_INST_TST, 0, rn, imm)
@@ -214,5 +271,6 @@
 
 #define ARM_MLS(rd, rn, rm, ra)	(ARM_INST_MLS | (rd) << 16 | (rn) | (rm) << 8 \
 				 | (ra) << 12)
+#define ARM_UXTH(rd, rm)	(ARM_INST_UXTH | (rd) << 12 | (rm))
 
 #endif /* PFILTER_OPCODES_ARM_H */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
  2017-05-25 23:13 ` Shubham Bansal
@ 2017-05-25 23:23   ` Andrew Lunn
  -1 siblings, 0 replies; 87+ messages in thread
From: Andrew Lunn @ 2017-05-25 23:23 UTC (permalink / raw)
  To: Shubham Bansal; +Cc: linux, linux-arm-kernel, linux-kernel, keescook

> Tested on ARMv7 with QEMU by me (Shubham Bansal).
> Tested on ARMv5 by Andrew Lunn (andrew@lunn.ch).
> Expected to work on ARMv6 as well, as its a part ARMv7 and part ARMv5.
> Although, a proper testing is not done for ARMv6.
> 
> Both of these testing are done with and without CONFIG_FRAME_POINTER
> separately for LITTLE ENDIAN machine.

Nope. I only tested it in the default configuration of mvebu_v5_defconfig.
Please change the 'Both' to 'Some'.

      Andrew

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v2] arm: eBPF JIT compiler
@ 2017-05-25 23:23   ` Andrew Lunn
  0 siblings, 0 replies; 87+ messages in thread
From: Andrew Lunn @ 2017-05-25 23:23 UTC (permalink / raw)
  To: linux-arm-kernel

> Tested on ARMv7 with QEMU by me (Shubham Bansal).
> Tested on ARMv5 by Andrew Lunn (andrew at lunn.ch).
> Expected to work on ARMv6 as well, as its a part ARMv7 and part ARMv5.
> Although, a proper testing is not done for ARMv6.
> 
> Both of these testing are done with and without CONFIG_FRAME_POINTER
> separately for LITTLE ENDIAN machine.

Nope. I only tested it in the default configuration of mvebu_v5_defconfig.
Please change the 'Both' to 'Some'.

      Andrew

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
  2017-05-25 23:23   ` Andrew Lunn
@ 2017-05-25 23:34     ` Shubham Bansal
  -1 siblings, 0 replies; 87+ messages in thread
From: Shubham Bansal @ 2017-05-25 23:34 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Russell King - ARM Linux, linux-arm-kernel, linux-kernel, Kees Cook

Hi Andrew,

Oh. I didn't knew. Can you test it now and confirm it? I think it will
work and wouldn't take much of the time.

Please.
Best,
Shubham Bansal


On Fri, May 26, 2017 at 4:53 AM, Andrew Lunn <andrew@lunn.ch> wrote:
>> Tested on ARMv7 with QEMU by me (Shubham Bansal).
>> Tested on ARMv5 by Andrew Lunn (andrew@lunn.ch).
>> Expected to work on ARMv6 as well, as its a part ARMv7 and part ARMv5.
>> Although, a proper testing is not done for ARMv6.
>>
>> Both of these testing are done with and without CONFIG_FRAME_POINTER
>> separately for LITTLE ENDIAN machine.
>
> Nope. I only tested it in the default configuration of mvebu_v5_defconfig.
> Please change the 'Both' to 'Some'.
>
>       Andrew

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v2] arm: eBPF JIT compiler
@ 2017-05-25 23:34     ` Shubham Bansal
  0 siblings, 0 replies; 87+ messages in thread
From: Shubham Bansal @ 2017-05-25 23:34 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Andrew,

Oh. I didn't knew. Can you test it now and confirm it? I think it will
work and wouldn't take much of the time.

Please.
Best,
Shubham Bansal


On Fri, May 26, 2017 at 4:53 AM, Andrew Lunn <andrew@lunn.ch> wrote:
>> Tested on ARMv7 with QEMU by me (Shubham Bansal).
>> Tested on ARMv5 by Andrew Lunn (andrew at lunn.ch).
>> Expected to work on ARMv6 as well, as its a part ARMv7 and part ARMv5.
>> Although, a proper testing is not done for ARMv6.
>>
>> Both of these testing are done with and without CONFIG_FRAME_POINTER
>> separately for LITTLE ENDIAN machine.
>
> Nope. I only tested it in the default configuration of mvebu_v5_defconfig.
> Please change the 'Both' to 'Some'.
>
>       Andrew

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
  2017-05-25 23:34     ` Shubham Bansal
@ 2017-05-25 23:36       ` Shubham Bansal
  -1 siblings, 0 replies; 87+ messages in thread
From: Shubham Bansal @ 2017-05-25 23:36 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Russell King - ARM Linux, linux-arm-kernel, linux-kernel, Kees Cook

Just to add. It a very very small part which deal with
CONFIG_FRAME_POINTER just one move instruction.
Best,
Shubham Bansal


On Fri, May 26, 2017 at 5:04 AM, Shubham Bansal
<illusionist.neo@gmail.com> wrote:
> Hi Andrew,
>
> Oh. I didn't knew. Can you test it now and confirm it? I think it will
> work and wouldn't take much of the time.
>
> Please.
> Best,
> Shubham Bansal
>
>
> On Fri, May 26, 2017 at 4:53 AM, Andrew Lunn <andrew@lunn.ch> wrote:
>>> Tested on ARMv7 with QEMU by me (Shubham Bansal).
>>> Tested on ARMv5 by Andrew Lunn (andrew@lunn.ch).
>>> Expected to work on ARMv6 as well, as its a part ARMv7 and part ARMv5.
>>> Although, a proper testing is not done for ARMv6.
>>>
>>> Both of these testing are done with and without CONFIG_FRAME_POINTER
>>> separately for LITTLE ENDIAN machine.
>>
>> Nope. I only tested it in the default configuration of mvebu_v5_defconfig.
>> Please change the 'Both' to 'Some'.
>>
>>       Andrew

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v2] arm: eBPF JIT compiler
@ 2017-05-25 23:36       ` Shubham Bansal
  0 siblings, 0 replies; 87+ messages in thread
From: Shubham Bansal @ 2017-05-25 23:36 UTC (permalink / raw)
  To: linux-arm-kernel

Just to add. It a very very small part which deal with
CONFIG_FRAME_POINTER just one move instruction.
Best,
Shubham Bansal


On Fri, May 26, 2017 at 5:04 AM, Shubham Bansal
<illusionist.neo@gmail.com> wrote:
> Hi Andrew,
>
> Oh. I didn't knew. Can you test it now and confirm it? I think it will
> work and wouldn't take much of the time.
>
> Please.
> Best,
> Shubham Bansal
>
>
> On Fri, May 26, 2017 at 4:53 AM, Andrew Lunn <andrew@lunn.ch> wrote:
>>> Tested on ARMv7 with QEMU by me (Shubham Bansal).
>>> Tested on ARMv5 by Andrew Lunn (andrew at lunn.ch).
>>> Expected to work on ARMv6 as well, as its a part ARMv7 and part ARMv5.
>>> Although, a proper testing is not done for ARMv6.
>>>
>>> Both of these testing are done with and without CONFIG_FRAME_POINTER
>>> separately for LITTLE ENDIAN machine.
>>
>> Nope. I only tested it in the default configuration of mvebu_v5_defconfig.
>> Please change the 'Both' to 'Some'.
>>
>>       Andrew

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
  2017-05-25 23:36       ` Shubham Bansal
@ 2017-05-26 16:57         ` Shubham Bansal
  -1 siblings, 0 replies; 87+ messages in thread
From: Shubham Bansal @ 2017-05-26 16:57 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Russell King - ARM Linux, linux-arm-kernel, linux-kernel, Kees Cook

Hi Andrew,

Did you get the time to test the code with CONFIG_FRAME_POINTER? It
would be great if you could check if it works on ARMv5 so that Russell
can look at the patch.

Thanks.
Best,
Shubham Bansal


On Fri, May 26, 2017 at 5:06 AM, Shubham Bansal
<illusionist.neo@gmail.com> wrote:
> Just to add. It a very very small part which deal with
> CONFIG_FRAME_POINTER just one move instruction.
> Best,
> Shubham Bansal
>
>
> On Fri, May 26, 2017 at 5:04 AM, Shubham Bansal
> <illusionist.neo@gmail.com> wrote:
>> Hi Andrew,
>>
>> Oh. I didn't knew. Can you test it now and confirm it? I think it will
>> work and wouldn't take much of the time.
>>
>> Please.
>> Best,
>> Shubham Bansal
>>
>>
>> On Fri, May 26, 2017 at 4:53 AM, Andrew Lunn <andrew@lunn.ch> wrote:
>>>> Tested on ARMv7 with QEMU by me (Shubham Bansal).
>>>> Tested on ARMv5 by Andrew Lunn (andrew@lunn.ch).
>>>> Expected to work on ARMv6 as well, as its a part ARMv7 and part ARMv5.
>>>> Although, a proper testing is not done for ARMv6.
>>>>
>>>> Both of these testing are done with and without CONFIG_FRAME_POINTER
>>>> separately for LITTLE ENDIAN machine.
>>>
>>> Nope. I only tested it in the default configuration of mvebu_v5_defconfig.
>>> Please change the 'Both' to 'Some'.
>>>
>>>       Andrew

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v2] arm: eBPF JIT compiler
@ 2017-05-26 16:57         ` Shubham Bansal
  0 siblings, 0 replies; 87+ messages in thread
From: Shubham Bansal @ 2017-05-26 16:57 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Andrew,

Did you get the time to test the code with CONFIG_FRAME_POINTER? It
would be great if you could check if it works on ARMv5 so that Russell
can look at the patch.

Thanks.
Best,
Shubham Bansal


On Fri, May 26, 2017 at 5:06 AM, Shubham Bansal
<illusionist.neo@gmail.com> wrote:
> Just to add. It a very very small part which deal with
> CONFIG_FRAME_POINTER just one move instruction.
> Best,
> Shubham Bansal
>
>
> On Fri, May 26, 2017 at 5:04 AM, Shubham Bansal
> <illusionist.neo@gmail.com> wrote:
>> Hi Andrew,
>>
>> Oh. I didn't knew. Can you test it now and confirm it? I think it will
>> work and wouldn't take much of the time.
>>
>> Please.
>> Best,
>> Shubham Bansal
>>
>>
>> On Fri, May 26, 2017 at 4:53 AM, Andrew Lunn <andrew@lunn.ch> wrote:
>>>> Tested on ARMv7 with QEMU by me (Shubham Bansal).
>>>> Tested on ARMv5 by Andrew Lunn (andrew at lunn.ch).
>>>> Expected to work on ARMv6 as well, as its a part ARMv7 and part ARMv5.
>>>> Although, a proper testing is not done for ARMv6.
>>>>
>>>> Both of these testing are done with and without CONFIG_FRAME_POINTER
>>>> separately for LITTLE ENDIAN machine.
>>>
>>> Nope. I only tested it in the default configuration of mvebu_v5_defconfig.
>>> Please change the 'Both' to 'Some'.
>>>
>>>       Andrew

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
  2017-05-26 16:57         ` Shubham Bansal
@ 2017-05-30 18:50           ` Shubham Bansal
  -1 siblings, 0 replies; 87+ messages in thread
From: Shubham Bansal @ 2017-05-30 18:50 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Russell King - ARM Linux, linux-arm-kernel, linux-kernel, Kees Cook

Hi Russell,

I tried everything I could to run a ARMv6 machine in last 4 days but I
am not able to. If nobody is willing to test the code no there
machine, may be we should not merge the code.
I have been more than happy to assist anyone who want to test it but
it looks like other people don't need it. I guess, it was a wasted
effort.

I don't have anything more to offer in regarding to this patch.
Thank you.

Best,
Shubham Bansal


On Fri, May 26, 2017 at 10:27 PM, Shubham Bansal
<illusionist.neo@gmail.com> wrote:
> Hi Andrew,
>
> Did you get the time to test the code with CONFIG_FRAME_POINTER? It
> would be great if you could check if it works on ARMv5 so that Russell
> can look at the patch.
>
> Thanks.
> Best,
> Shubham Bansal
>
>
> On Fri, May 26, 2017 at 5:06 AM, Shubham Bansal
> <illusionist.neo@gmail.com> wrote:
>> Just to add. It a very very small part which deal with
>> CONFIG_FRAME_POINTER just one move instruction.
>> Best,
>> Shubham Bansal
>>
>>
>> On Fri, May 26, 2017 at 5:04 AM, Shubham Bansal
>> <illusionist.neo@gmail.com> wrote:
>>> Hi Andrew,
>>>
>>> Oh. I didn't knew. Can you test it now and confirm it? I think it will
>>> work and wouldn't take much of the time.
>>>
>>> Please.
>>> Best,
>>> Shubham Bansal
>>>
>>>
>>> On Fri, May 26, 2017 at 4:53 AM, Andrew Lunn <andrew@lunn.ch> wrote:
>>>>> Tested on ARMv7 with QEMU by me (Shubham Bansal).
>>>>> Tested on ARMv5 by Andrew Lunn (andrew@lunn.ch).
>>>>> Expected to work on ARMv6 as well, as its a part ARMv7 and part ARMv5.
>>>>> Although, a proper testing is not done for ARMv6.
>>>>>
>>>>> Both of these testing are done with and without CONFIG_FRAME_POINTER
>>>>> separately for LITTLE ENDIAN machine.
>>>>
>>>> Nope. I only tested it in the default configuration of mvebu_v5_defconfig.
>>>> Please change the 'Both' to 'Some'.
>>>>
>>>>       Andrew

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v2] arm: eBPF JIT compiler
@ 2017-05-30 18:50           ` Shubham Bansal
  0 siblings, 0 replies; 87+ messages in thread
From: Shubham Bansal @ 2017-05-30 18:50 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Russell,

I tried everything I could to run a ARMv6 machine in last 4 days but I
am not able to. If nobody is willing to test the code no there
machine, may be we should not merge the code.
I have been more than happy to assist anyone who want to test it but
it looks like other people don't need it. I guess, it was a wasted
effort.

I don't have anything more to offer in regarding to this patch.
Thank you.

Best,
Shubham Bansal


On Fri, May 26, 2017 at 10:27 PM, Shubham Bansal
<illusionist.neo@gmail.com> wrote:
> Hi Andrew,
>
> Did you get the time to test the code with CONFIG_FRAME_POINTER? It
> would be great if you could check if it works on ARMv5 so that Russell
> can look at the patch.
>
> Thanks.
> Best,
> Shubham Bansal
>
>
> On Fri, May 26, 2017 at 5:06 AM, Shubham Bansal
> <illusionist.neo@gmail.com> wrote:
>> Just to add. It a very very small part which deal with
>> CONFIG_FRAME_POINTER just one move instruction.
>> Best,
>> Shubham Bansal
>>
>>
>> On Fri, May 26, 2017 at 5:04 AM, Shubham Bansal
>> <illusionist.neo@gmail.com> wrote:
>>> Hi Andrew,
>>>
>>> Oh. I didn't knew. Can you test it now and confirm it? I think it will
>>> work and wouldn't take much of the time.
>>>
>>> Please.
>>> Best,
>>> Shubham Bansal
>>>
>>>
>>> On Fri, May 26, 2017 at 4:53 AM, Andrew Lunn <andrew@lunn.ch> wrote:
>>>>> Tested on ARMv7 with QEMU by me (Shubham Bansal).
>>>>> Tested on ARMv5 by Andrew Lunn (andrew at lunn.ch).
>>>>> Expected to work on ARMv6 as well, as its a part ARMv7 and part ARMv5.
>>>>> Although, a proper testing is not done for ARMv6.
>>>>>
>>>>> Both of these testing are done with and without CONFIG_FRAME_POINTER
>>>>> separately for LITTLE ENDIAN machine.
>>>>
>>>> Nope. I only tested it in the default configuration of mvebu_v5_defconfig.
>>>> Please change the 'Both' to 'Some'.
>>>>
>>>>       Andrew

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
  2017-05-25 23:13 ` Shubham Bansal
@ 2017-05-30 19:11   ` Kees Cook
  -1 siblings, 0 replies; 87+ messages in thread
From: Kees Cook @ 2017-05-30 19:11 UTC (permalink / raw)
  To: Shubham Bansal, Russell King; +Cc: linux-arm-kernel, LKML, Andrew Lunn

On Thu, May 25, 2017 at 4:13 PM, Shubham Bansal
<illusionist.neo@gmail.com> wrote:
> The JIT compiler emits ARM 32 bit instructions. Currently, It supports
> eBPF only. Classic BPF is supported because of the conversion by BPF
> core.
>
> This patch is essentially changing the current implementation of JIT
> compiler of Berkeley Packet Filter from classic to internal with almost
> all instructions from eBPF ISA supported except the following
>         BPF_ALU64 | BPF_DIV | BPF_K
>         BPF_ALU64 | BPF_DIV | BPF_X
>         BPF_ALU64 | BPF_MOD | BPF_K
>         BPF_ALU64 | BPF_MOD | BPF_X
>         BPF_STX | BPF_XADD | BPF_W
>         BPF_STX | BPF_XADD | BPF_DW
>         BPF_JMP | BPF_CALL
>
> Implementation is using scratch space to emulate 64 bit eBPF ISA on 32 bit
> ARM because of deficiency of general purpose registers on ARM. Currently,
> only LITTLE ENDIAN machines are supported in this eBPF JIT Compiler.
>
> Tested on ARMv7 with QEMU by me (Shubham Bansal).
> Tested on ARMv5 by Andrew Lunn (andrew@lunn.ch).
> Expected to work on ARMv6 as well, as its a part ARMv7 and part ARMv5.
> Although, a proper testing is not done for ARMv6.
>
> Both of these testing are done with and without CONFIG_FRAME_POINTER
> separately for LITTLE ENDIAN machine.
>
> For testing:
>
> 1. JIT is enabled with
>         echo 1 > /proc/sys/net/core/bpf_jit_enable
> 2. Constant Blinding can be enabled along with JIT using
>         echo 1 > /proc/sys/net/core/bpf_jit_enable
>         echo 2 > /proc/sys/net/core/bpf_jit_harden
>
> See Documentation/networking/filter.txt for more information.
>
> Result : test_bpf: Summary: 314 PASSED, 0 FAILED, [278/306 JIT'ed]
>
> Signed-off-by: Shubham Bansal <illusionist.neo@gmail.com>

Thanks for this! Russell, should this patch go via the ARM patch tracker?

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v2] arm: eBPF JIT compiler
@ 2017-05-30 19:11   ` Kees Cook
  0 siblings, 0 replies; 87+ messages in thread
From: Kees Cook @ 2017-05-30 19:11 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, May 25, 2017 at 4:13 PM, Shubham Bansal
<illusionist.neo@gmail.com> wrote:
> The JIT compiler emits ARM 32 bit instructions. Currently, It supports
> eBPF only. Classic BPF is supported because of the conversion by BPF
> core.
>
> This patch is essentially changing the current implementation of JIT
> compiler of Berkeley Packet Filter from classic to internal with almost
> all instructions from eBPF ISA supported except the following
>         BPF_ALU64 | BPF_DIV | BPF_K
>         BPF_ALU64 | BPF_DIV | BPF_X
>         BPF_ALU64 | BPF_MOD | BPF_K
>         BPF_ALU64 | BPF_MOD | BPF_X
>         BPF_STX | BPF_XADD | BPF_W
>         BPF_STX | BPF_XADD | BPF_DW
>         BPF_JMP | BPF_CALL
>
> Implementation is using scratch space to emulate 64 bit eBPF ISA on 32 bit
> ARM because of deficiency of general purpose registers on ARM. Currently,
> only LITTLE ENDIAN machines are supported in this eBPF JIT Compiler.
>
> Tested on ARMv7 with QEMU by me (Shubham Bansal).
> Tested on ARMv5 by Andrew Lunn (andrew at lunn.ch).
> Expected to work on ARMv6 as well, as its a part ARMv7 and part ARMv5.
> Although, a proper testing is not done for ARMv6.
>
> Both of these testing are done with and without CONFIG_FRAME_POINTER
> separately for LITTLE ENDIAN machine.
>
> For testing:
>
> 1. JIT is enabled with
>         echo 1 > /proc/sys/net/core/bpf_jit_enable
> 2. Constant Blinding can be enabled along with JIT using
>         echo 1 > /proc/sys/net/core/bpf_jit_enable
>         echo 2 > /proc/sys/net/core/bpf_jit_harden
>
> See Documentation/networking/filter.txt for more information.
>
> Result : test_bpf: Summary: 314 PASSED, 0 FAILED, [278/306 JIT'ed]
>
> Signed-off-by: Shubham Bansal <illusionist.neo@gmail.com>

Thanks for this! Russell, should this patch go via the ARM patch tracker?

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
  2017-05-25 23:13 ` Shubham Bansal
@ 2017-05-30 19:19   ` Kees Cook
  -1 siblings, 0 replies; 87+ messages in thread
From: Kees Cook @ 2017-05-30 19:19 UTC (permalink / raw)
  To: Shubham Bansal, Network Development, Daniel Borkmann,
	David S. Miller, Alexei Starovoitov
  Cc: Russell King, linux-arm-kernel, LKML, Andrew Lunn

Forwarding this to net-dev and eBPF folks, who weren't on CC...

-Kees

On Thu, May 25, 2017 at 4:13 PM, Shubham Bansal
<illusionist.neo@gmail.com> wrote:
> The JIT compiler emits ARM 32 bit instructions. Currently, It supports
> eBPF only. Classic BPF is supported because of the conversion by BPF
> core.
>
> This patch is essentially changing the current implementation of JIT
> compiler of Berkeley Packet Filter from classic to internal with almost
> all instructions from eBPF ISA supported except the following
>         BPF_ALU64 | BPF_DIV | BPF_K
>         BPF_ALU64 | BPF_DIV | BPF_X
>         BPF_ALU64 | BPF_MOD | BPF_K
>         BPF_ALU64 | BPF_MOD | BPF_X
>         BPF_STX | BPF_XADD | BPF_W
>         BPF_STX | BPF_XADD | BPF_DW
>         BPF_JMP | BPF_CALL
>
> Implementation is using scratch space to emulate 64 bit eBPF ISA on 32 bit
> ARM because of deficiency of general purpose registers on ARM. Currently,
> only LITTLE ENDIAN machines are supported in this eBPF JIT Compiler.
>
> Tested on ARMv7 with QEMU by me (Shubham Bansal).
> Tested on ARMv5 by Andrew Lunn (andrew@lunn.ch).
> Expected to work on ARMv6 as well, as its a part ARMv7 and part ARMv5.
> Although, a proper testing is not done for ARMv6.
>
> Both of these testing are done with and without CONFIG_FRAME_POINTER
> separately for LITTLE ENDIAN machine.
>
> For testing:
>
> 1. JIT is enabled with
>         echo 1 > /proc/sys/net/core/bpf_jit_enable
> 2. Constant Blinding can be enabled along with JIT using
>         echo 1 > /proc/sys/net/core/bpf_jit_enable
>         echo 2 > /proc/sys/net/core/bpf_jit_harden
>
> See Documentation/networking/filter.txt for more information.
>
> Result : test_bpf: Summary: 314 PASSED, 0 FAILED, [278/306 JIT'ed]
>
> Signed-off-by: Shubham Bansal <illusionist.neo@gmail.com>
> ---
>  Documentation/networking/filter.txt |    4 +-
>  arch/arm/Kconfig                    |    2 +-
>  arch/arm/net/bpf_jit_32.c           | 2404 ++++++++++++++++++++++++-----------
>  arch/arm/net/bpf_jit_32.h           |  108 +-
>  4 files changed, 1713 insertions(+), 805 deletions(-)
>
> diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt
> index b69b205..01165ac 100644
> --- a/Documentation/networking/filter.txt
> +++ b/Documentation/networking/filter.txt
> @@ -596,8 +596,8 @@ skb pointer). All constraints and restrictions from bpf_check_classic() apply
>  before a conversion to the new layout is being done behind the scenes!
>
>  Currently, the classic BPF format is being used for JITing on most 32-bit
> -architectures, whereas x86-64, aarch64, s390x, powerpc64, sparc64 perform JIT
> -compilation from eBPF instruction set.
> +architectures, whereas x86-64, aarch64, arm, s390x, powerpc64, sparc64 perform
> +JIT compilation from eBPF instruction set.
>
>  Some core changes of the new internal format:
>
> diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
> index 8a7ab5e..13ade46 100644
> --- a/arch/arm/Kconfig
> +++ b/arch/arm/Kconfig
> @@ -47,7 +47,7 @@ config ARM
>         select HAVE_ARCH_SECCOMP_FILTER if (AEABI && !OABI_COMPAT)
>         select HAVE_ARCH_TRACEHOOK
>         select HAVE_ARM_SMCCC if CPU_V7
> -       select HAVE_CBPF_JIT
> +       select HAVE_EBPF_JIT
>         select HAVE_CC_STACKPROTECTOR
>         select HAVE_CONTEXT_TRACKING
>         select HAVE_C_RECORDMCOUNT
> diff --git a/arch/arm/net/bpf_jit_32.c b/arch/arm/net/bpf_jit_32.c
> index 93d0b6d..c7476e5 100644
> --- a/arch/arm/net/bpf_jit_32.c
> +++ b/arch/arm/net/bpf_jit_32.c
> @@ -1,13 +1,15 @@
>  /*
> - * Just-In-Time compiler for BPF filters on 32bit ARM
> + * Just-In-Time compiler for eBPF filters on 32bit ARM
>   *
>   * Copyright (c) 2011 Mircea Gherzan <mgherzan@gmail.com>
> + * Copyright (c) 2017 Shubham Bansal <illusionist.neo@gmail.com>
>   *
>   * This program is free software; you can redistribute it and/or modify it
>   * under the terms of the GNU General Public License as published by the
>   * Free Software Foundation; version 2 of the License.
>   */
>
> +#include <linux/bpf.h>
>  #include <linux/bitops.h>
>  #include <linux/compiler.h>
>  #include <linux/errno.h>
> @@ -23,44 +25,91 @@
>
>  #include "bpf_jit_32.h"
>
> +int bpf_jit_enable __read_mostly;
> +
> +#define STACK_OFFSET(k)        (k)
> +#define TMP_REG_1      (MAX_BPF_JIT_REG + 0)   /* TEMP Register 1 */
> +#define TMP_REG_2      (MAX_BPF_JIT_REG + 1)   /* TEMP Register 2 */
> +#define TCALL_CNT      (MAX_BPF_JIT_REG + 2)   /* Tail Call Count */
> +
> +/* Flags used for JIT optimization */
> +#define SEEN_CALL      (1 << 0)
> +
> +#define FLAG_IMM_OVERFLOW      (1 << 0)
> +
>  /*
> - * ABI:
> + * Map eBPF registers to ARM 32bit registers or stack scratch space.
> + *
> + * 1. First argument is passed using the arm 32bit registers and rest of the
> + * arguments are passed on stack scratch space.
> + * 2. First callee-saved aregument is mapped to arm 32 bit registers and rest
> + * arguments are mapped to scratch space on stack.
> + * 3. We need two 64 bit temp registers to do complex operations on eBPF
> + * registers.
> + *
> + * As the eBPF registers are all 64 bit registers and arm has only 32 bit
> + * registers, we have to map each eBPF registers with two arm 32 bit regs or
> + * scratch memory space and we have to build eBPF 64 bit register from those.
>   *
> - * r0  scratch register
> - * r4  BPF register A
> - * r5  BPF register X
> - * r6  pointer to the skb
> - * r7  skb->data
> - * r8  skb_headlen(skb)
>   */
> +static const u8 bpf2a32[][2] = {
> +       /* return value from in-kernel function, and exit value from eBPF */
> +       [BPF_REG_0] = {ARM_R1, ARM_R0},
> +       /* arguments from eBPF program to in-kernel function */
> +       [BPF_REG_1] = {ARM_R3, ARM_R2},
> +       /* Stored on stack scratch space */
> +       [BPF_REG_2] = {STACK_OFFSET(0), STACK_OFFSET(4)},
> +       [BPF_REG_3] = {STACK_OFFSET(8), STACK_OFFSET(12)},
> +       [BPF_REG_4] = {STACK_OFFSET(16), STACK_OFFSET(20)},
> +       [BPF_REG_5] = {STACK_OFFSET(24), STACK_OFFSET(28)},
> +       /* callee saved registers that in-kernel function will preserve */
> +       [BPF_REG_6] = {ARM_R5, ARM_R4},
> +       /* Stored on stack scratch space */
> +       [BPF_REG_7] = {STACK_OFFSET(32), STACK_OFFSET(36)},
> +       [BPF_REG_8] = {STACK_OFFSET(40), STACK_OFFSET(44)},
> +       [BPF_REG_9] = {STACK_OFFSET(48), STACK_OFFSET(52)},
> +       /* Read only Frame Pointer to access Stack */
> +       [BPF_REG_FP] = {STACK_OFFSET(56), STACK_OFFSET(60)},
> +       /* Temporary Register for internal BPF JIT, can be used
> +        * for constant blindings and others.
> +        */
> +       [TMP_REG_1] = {ARM_R7, ARM_R6},
> +       [TMP_REG_2] = {ARM_R10, ARM_R8},
> +       /* Tail call count. Stored on stack scratch space. */
> +       [TCALL_CNT] = {STACK_OFFSET(64), STACK_OFFSET(68)},
> +       /* temporary register for blinding constants.
> +        * Stored on stack scratch space.
> +        */
> +       [BPF_REG_AX] = {STACK_OFFSET(72), STACK_OFFSET(76)},
> +};
>
> -#define r_scratch      ARM_R0
> -/* r1-r3 are (also) used for the unaligned loads on the non-ARMv7 slowpath */
> -#define r_off          ARM_R1
> -#define r_A            ARM_R4
> -#define r_X            ARM_R5
> -#define r_skb          ARM_R6
> -#define r_skb_data     ARM_R7
> -#define r_skb_hl       ARM_R8
> -
> -#define SCRATCH_SP_OFFSET      0
> -#define SCRATCH_OFF(k)         (SCRATCH_SP_OFFSET + 4 * (k))
> -
> -#define SEEN_MEM               ((1 << BPF_MEMWORDS) - 1)
> -#define SEEN_MEM_WORD(k)       (1 << (k))
> -#define SEEN_X                 (1 << BPF_MEMWORDS)
> -#define SEEN_CALL              (1 << (BPF_MEMWORDS + 1))
> -#define SEEN_SKB               (1 << (BPF_MEMWORDS + 2))
> -#define SEEN_DATA              (1 << (BPF_MEMWORDS + 3))
> +#define        dst_lo  dst[1]
> +#define dst_hi dst[0]
> +#define src_lo src[1]
> +#define src_hi src[0]
>
> -#define FLAG_NEED_X_RESET      (1 << 0)
> -#define FLAG_IMM_OVERFLOW      (1 << 1)
> +/*
> + * JIT Context:
> + *
> + * prog                        :       bpf_prog
> + * idx                 :       index of current last JITed instruction.
> + * prologue_bytes      :       bytes used in prologue.
> + * epilogue_offset     :       offset of epilogue starting.
> + * seen                        :       bit mask used for JIT optimization.
> + * offsets             :       array of eBPF instruction offsets in
> + *                             JITed code.
> + * target              :       final JITed code.
> + * epilogue_bytes      :       no of bytes used in epilogue.
> + * imm_count           :       no of immediate counts used for global
> + *                             variables.
> + * imms                        :       array of global variable addresses.
> + */
>
>  struct jit_ctx {
> -       const struct bpf_prog *skf;
> -       unsigned idx;
> -       unsigned prologue_bytes;
> -       int ret0_fp_idx;
> +       const struct bpf_prog *prog;
> +       unsigned int idx;
> +       unsigned int prologue_bytes;
> +       unsigned int epilogue_offset;
>         u32 seen;
>         u32 flags;
>         u32 *offsets;
> @@ -72,68 +121,16 @@ struct jit_ctx {
>  #endif
>  };
>
> -int bpf_jit_enable __read_mostly;
> -
> -static inline int call_neg_helper(struct sk_buff *skb, int offset, void *ret,
> -                     unsigned int size)
> -{
> -       void *ptr = bpf_internal_load_pointer_neg_helper(skb, offset, size);
> -
> -       if (!ptr)
> -               return -EFAULT;
> -       memcpy(ret, ptr, size);
> -       return 0;
> -}
> -
> -static u64 jit_get_skb_b(struct sk_buff *skb, int offset)
> -{
> -       u8 ret;
> -       int err;
> -
> -       if (offset < 0)
> -               err = call_neg_helper(skb, offset, &ret, 1);
> -       else
> -               err = skb_copy_bits(skb, offset, &ret, 1);
> -
> -       return (u64)err << 32 | ret;
> -}
> -
> -static u64 jit_get_skb_h(struct sk_buff *skb, int offset)
> -{
> -       u16 ret;
> -       int err;
> -
> -       if (offset < 0)
> -               err = call_neg_helper(skb, offset, &ret, 2);
> -       else
> -               err = skb_copy_bits(skb, offset, &ret, 2);
> -
> -       return (u64)err << 32 | ntohs(ret);
> -}
> -
> -static u64 jit_get_skb_w(struct sk_buff *skb, int offset)
> -{
> -       u32 ret;
> -       int err;
> -
> -       if (offset < 0)
> -               err = call_neg_helper(skb, offset, &ret, 4);
> -       else
> -               err = skb_copy_bits(skb, offset, &ret, 4);
> -
> -       return (u64)err << 32 | ntohl(ret);
> -}
> -
>  /*
>   * Wrappers which handle both OABI and EABI and assures Thumb2 interworking
>   * (where the assembly routines like __aeabi_uidiv could cause problems).
>   */
> -static u32 jit_udiv(u32 dividend, u32 divisor)
> +static u32 jit_udiv32(u32 dividend, u32 divisor)
>  {
>         return dividend / divisor;
>  }
>
> -static u32 jit_mod(u32 dividend, u32 divisor)
> +static u32 jit_mod32(u32 dividend, u32 divisor)
>  {
>         return dividend % divisor;
>  }
> @@ -157,36 +154,22 @@ static inline void emit(u32 inst, struct jit_ctx *ctx)
>         _emit(ARM_COND_AL, inst, ctx);
>  }
>
> -static u16 saved_regs(struct jit_ctx *ctx)
> +/*
> + * Checks if immediate value can be converted to imm12(12 bits) value.
> + */
> +static int16_t imm8m(u32 x)
>  {
> -       u16 ret = 0;
> -
> -       if ((ctx->skf->len > 1) ||
> -           (ctx->skf->insns[0].code == (BPF_RET | BPF_A)))
> -               ret |= 1 << r_A;
> -
> -#ifdef CONFIG_FRAME_POINTER
> -       ret |= (1 << ARM_FP) | (1 << ARM_IP) | (1 << ARM_LR) | (1 << ARM_PC);
> -#else
> -       if (ctx->seen & SEEN_CALL)
> -               ret |= 1 << ARM_LR;
> -#endif
> -       if (ctx->seen & (SEEN_DATA | SEEN_SKB))
> -               ret |= 1 << r_skb;
> -       if (ctx->seen & SEEN_DATA)
> -               ret |= (1 << r_skb_data) | (1 << r_skb_hl);
> -       if (ctx->seen & SEEN_X)
> -               ret |= 1 << r_X;
> -
> -       return ret;
> -}
> +       u32 rot;
>
> -static inline int mem_words_used(struct jit_ctx *ctx)
> -{
> -       /* yes, we do waste some stack space IF there are "holes" in the set" */
> -       return fls(ctx->seen & SEEN_MEM);
> +       for (rot = 0; rot < 16; rot++)
> +               if ((x & ~ror32(0xff, 2 * rot)) == 0)
> +                       return rol32(x, 2 * rot) | (rot << 8);
> +       return -1;
>  }
>
> +/*
> + * Initializes the JIT space with undefined instructions.
> + */
>  static void jit_fill_hole(void *area, unsigned int size)
>  {
>         u32 *ptr;
> @@ -195,88 +178,34 @@ static void jit_fill_hole(void *area, unsigned int size)
>                 *ptr++ = __opcode_to_mem_arm(ARM_INST_UDF);
>  }
>
> -static void build_prologue(struct jit_ctx *ctx)
> -{
> -       u16 reg_set = saved_regs(ctx);
> -       u16 off;
> -
> -#ifdef CONFIG_FRAME_POINTER
> -       emit(ARM_MOV_R(ARM_IP, ARM_SP), ctx);
> -       emit(ARM_PUSH(reg_set), ctx);
> -       emit(ARM_SUB_I(ARM_FP, ARM_IP, 4), ctx);
> -#else
> -       if (reg_set)
> -               emit(ARM_PUSH(reg_set), ctx);
> -#endif
> +/* Stack must be multiples of 16 Bytes */
> +#define STACK_ALIGN(sz) (((sz) + 15) & ~15)
>
> -       if (ctx->seen & (SEEN_DATA | SEEN_SKB))
> -               emit(ARM_MOV_R(r_skb, ARM_R0), ctx);
> -
> -       if (ctx->seen & SEEN_DATA) {
> -               off = offsetof(struct sk_buff, data);
> -               emit(ARM_LDR_I(r_skb_data, r_skb, off), ctx);
> -               /* headlen = len - data_len */
> -               off = offsetof(struct sk_buff, len);
> -               emit(ARM_LDR_I(r_skb_hl, r_skb, off), ctx);
> -               off = offsetof(struct sk_buff, data_len);
> -               emit(ARM_LDR_I(r_scratch, r_skb, off), ctx);
> -               emit(ARM_SUB_R(r_skb_hl, r_skb_hl, r_scratch), ctx);
> -       }
> -
> -       if (ctx->flags & FLAG_NEED_X_RESET)
> -               emit(ARM_MOV_I(r_X, 0), ctx);
> -
> -       /* do not leak kernel data to userspace */
> -       if (bpf_needs_clear_a(&ctx->skf->insns[0]))
> -               emit(ARM_MOV_I(r_A, 0), ctx);
> -
> -       /* stack space for the BPF_MEM words */
> -       if (ctx->seen & SEEN_MEM)
> -               emit(ARM_SUB_I(ARM_SP, ARM_SP, mem_words_used(ctx) * 4), ctx);
> -}
> -
> -static void build_epilogue(struct jit_ctx *ctx)
> -{
> -       u16 reg_set = saved_regs(ctx);
> -
> -       if (ctx->seen & SEEN_MEM)
> -               emit(ARM_ADD_I(ARM_SP, ARM_SP, mem_words_used(ctx) * 4), ctx);
> -
> -       reg_set &= ~(1 << ARM_LR);
> -
> -#ifdef CONFIG_FRAME_POINTER
> -       /* the first instruction of the prologue was: mov ip, sp */
> -       reg_set &= ~(1 << ARM_IP);
> -       reg_set |= (1 << ARM_SP);
> -       emit(ARM_LDM(ARM_SP, reg_set), ctx);
> -#else
> -       if (reg_set) {
> -               if (ctx->seen & SEEN_CALL)
> -                       reg_set |= 1 << ARM_PC;
> -               emit(ARM_POP(reg_set), ctx);
> -       }
> +/* Stack space for BPF_REG_2, BPF_REG_3, BPF_REG_4,
> + * BPF_REG_5, BPF_REG_7, BPF_REG_8, BPF_REG_9,
> + * BPF_REG_FP and Tail call counts.
> + */
> +#define SCRATCH_SIZE 80
>
> -       if (!(ctx->seen & SEEN_CALL))
> -               emit(ARM_BX(ARM_LR), ctx);
> -#endif
> -}
> +/* total stack size used in JITed code */
> +#define _STACK_SIZE \
> +       (MAX_BPF_STACK + \
> +        + SCRATCH_SIZE + \
> +        + 4 /* extra for skb_copy_bits buffer */)
>
> -static int16_t imm8m(u32 x)
> -{
> -       u32 rot;
> +#define STACK_SIZE STACK_ALIGN(_STACK_SIZE)
>
> -       for (rot = 0; rot < 16; rot++)
> -               if ((x & ~ror32(0xff, 2 * rot)) == 0)
> -                       return rol32(x, 2 * rot) | (rot << 8);
> +/* Get the offset of eBPF REGISTERs stored on scratch space. */
> +#define STACK_VAR(off) (STACK_SIZE-off-4)
>
> -       return -1;
> -}
> +/* Offset of skb_copy_bits buffer */
> +#define SKB_BUFFER STACK_VAR(SCRATCH_SIZE)
>
>  #if __LINUX_ARM_ARCH__ < 7
>
>  static u16 imm_offset(u32 k, struct jit_ctx *ctx)
>  {
> -       unsigned i = 0, offset;
> +       unsigned int i = 0, offset;
>         u16 imm;
>
>         /* on the "fake" run we just count them (duplicates included) */
> @@ -295,7 +224,7 @@ static u16 imm_offset(u32 k, struct jit_ctx *ctx)
>                 ctx->imms[i] = k;
>
>         /* constants go just after the epilogue */
> -       offset =  ctx->offsets[ctx->skf->len];
> +       offset =  ctx->offsets[ctx->prog->len - 1] * 4;
>         offset += ctx->prologue_bytes;
>         offset += ctx->epilogue_bytes;
>         offset += i * 4;
> @@ -319,10 +248,22 @@ static u16 imm_offset(u32 k, struct jit_ctx *ctx)
>
>  #endif /* __LINUX_ARM_ARCH__ */
>
> +static inline int bpf2a32_offset(int bpf_to, int bpf_from,
> +                                const struct jit_ctx *ctx) {
> +       int to, from;
> +
> +       if (ctx->target == NULL)
> +               return 0;
> +       to = ctx->offsets[bpf_to];
> +       from = ctx->offsets[bpf_from];
> +
> +       return to - from - 1;
> +}
> +
>  /*
>   * Move an immediate that's not an imm8m to a core register.
>   */
> -static inline void emit_mov_i_no8m(int rd, u32 val, struct jit_ctx *ctx)
> +static inline void emit_mov_i_no8m(const u8 rd, u32 val, struct jit_ctx *ctx)
>  {
>  #if __LINUX_ARM_ARCH__ < 7
>         emit(ARM_LDR_I(rd, ARM_PC, imm_offset(val, ctx)), ctx);
> @@ -333,7 +274,7 @@ static inline void emit_mov_i_no8m(int rd, u32 val, struct jit_ctx *ctx)
>  #endif
>  }
>
> -static inline void emit_mov_i(int rd, u32 val, struct jit_ctx *ctx)
> +static inline void emit_mov_i(const u8 rd, u32 val, struct jit_ctx *ctx)
>  {
>         int imm12 = imm8m(val);
>
> @@ -343,676 +284,1553 @@ static inline void emit_mov_i(int rd, u32 val, struct jit_ctx *ctx)
>                 emit_mov_i_no8m(rd, val, ctx);
>  }
>
> -#if __LINUX_ARM_ARCH__ < 6
> -
> -static void emit_load_be32(u8 cond, u8 r_res, u8 r_addr, struct jit_ctx *ctx)
> +static inline void emit_blx_r(u8 tgt_reg, struct jit_ctx *ctx)
>  {
> -       _emit(cond, ARM_LDRB_I(ARM_R3, r_addr, 1), ctx);
> -       _emit(cond, ARM_LDRB_I(ARM_R1, r_addr, 0), ctx);
> -       _emit(cond, ARM_LDRB_I(ARM_R2, r_addr, 3), ctx);
> -       _emit(cond, ARM_LSL_I(ARM_R3, ARM_R3, 16), ctx);
> -       _emit(cond, ARM_LDRB_I(ARM_R0, r_addr, 2), ctx);
> -       _emit(cond, ARM_ORR_S(ARM_R3, ARM_R3, ARM_R1, SRTYPE_LSL, 24), ctx);
> -       _emit(cond, ARM_ORR_R(ARM_R3, ARM_R3, ARM_R2), ctx);
> -       _emit(cond, ARM_ORR_S(r_res, ARM_R3, ARM_R0, SRTYPE_LSL, 8), ctx);
> +       ctx->seen |= SEEN_CALL;
> +#if __LINUX_ARM_ARCH__ < 5
> +       emit(ARM_MOV_R(ARM_LR, ARM_PC), ctx);
> +
> +       if (elf_hwcap & HWCAP_THUMB)
> +               emit(ARM_BX(tgt_reg), ctx);
> +       else
> +               emit(ARM_MOV_R(ARM_PC, tgt_reg), ctx);
> +#else
> +       emit(ARM_BLX_R(tgt_reg), ctx);
> +#endif
>  }
>
> -static void emit_load_be16(u8 cond, u8 r_res, u8 r_addr, struct jit_ctx *ctx)
> +static inline int epilogue_offset(const struct jit_ctx *ctx)
>  {
> -       _emit(cond, ARM_LDRB_I(ARM_R1, r_addr, 0), ctx);
> -       _emit(cond, ARM_LDRB_I(ARM_R2, r_addr, 1), ctx);
> -       _emit(cond, ARM_ORR_S(r_res, ARM_R2, ARM_R1, SRTYPE_LSL, 8), ctx);
> +       int to, from;
> +       /* No need for 1st dummy run */
> +       if (ctx->target == NULL)
> +               return 0;
> +       to = ctx->epilogue_offset;
> +       from = ctx->idx;
> +
> +       return to - from - 2;
>  }
>
> -static inline void emit_swap16(u8 r_dst, u8 r_src, struct jit_ctx *ctx)
> +static inline void emit_udivmod(u8 rd, u8 rm, u8 rn, struct jit_ctx *ctx, u8 op)
>  {
> -       /* r_dst = (r_src << 8) | (r_src >> 8) */
> -       emit(ARM_LSL_I(ARM_R1, r_src, 8), ctx);
> -       emit(ARM_ORR_S(r_dst, ARM_R1, r_src, SRTYPE_LSR, 8), ctx);
> +       const u8 *tmp = bpf2a32[TMP_REG_1];
> +       s32 jmp_offset;
> +
> +       /* checks if divisor is zero or not. If it is, then
> +        * exit directly.
> +        */
> +       emit(ARM_CMP_I(rn, 0), ctx);
> +       _emit(ARM_COND_EQ, ARM_MOV_I(ARM_R0, 0), ctx);
> +       jmp_offset = epilogue_offset(ctx);
> +       _emit(ARM_COND_EQ, ARM_B(jmp_offset), ctx);
> +#if __LINUX_ARM_ARCH__ == 7
> +       if (elf_hwcap & HWCAP_IDIVA) {
> +               if (op == BPF_DIV)
> +                       emit(ARM_UDIV(rd, rm, rn), ctx);
> +               else {
> +                       emit(ARM_UDIV(ARM_IP, rm, rn), ctx);
> +                       emit(ARM_MLS(rd, rn, ARM_IP, rm), ctx);
> +               }
> +               return;
> +       }
> +#endif
>
>         /*
> -        * we need to mask out the bits set in r_dst[23:16] due to
> -        * the first shift instruction.
> -        *
> -        * note that 0x8ff is the encoded immediate 0x00ff0000.
> +        * For BPF_ALU | BPF_DIV | BPF_K instructions
> +        * As ARM_R1 and ARM_R0 contains 1st argument of bpf
> +        * function, we need to save it on caller side to save
> +        * it from getting destroyed within callee.
> +        * After the return from the callee, we restore ARM_R0
> +        * ARM_R1.
>          */
> -       emit(ARM_BIC_I(r_dst, r_dst, 0x8ff), ctx);
> -}
> +       if (rn != ARM_R1) {
> +               emit(ARM_MOV_R(tmp[0], ARM_R1), ctx);
> +               emit(ARM_MOV_R(ARM_R1, rn), ctx);
> +       }
> +       if (rm != ARM_R0) {
> +               emit(ARM_MOV_R(tmp[1], ARM_R0), ctx);
> +               emit(ARM_MOV_R(ARM_R0, rm), ctx);
> +       }
>
> -#else  /* ARMv6+ */
> +       /* Call appropriate function */
> +       ctx->seen |= SEEN_CALL;
> +       emit_mov_i(ARM_IP, op == BPF_DIV ?
> +                  (u32)jit_udiv32 : (u32)jit_mod32, ctx);
> +       emit_blx_r(ARM_IP, ctx);
>
> -static void emit_load_be32(u8 cond, u8 r_res, u8 r_addr, struct jit_ctx *ctx)
> -{
> -       _emit(cond, ARM_LDR_I(r_res, r_addr, 0), ctx);
> -#ifdef __LITTLE_ENDIAN
> -       _emit(cond, ARM_REV(r_res, r_res), ctx);
> -#endif
> +       /* Save return value */
> +       if (rd != ARM_R0)
> +               emit(ARM_MOV_R(rd, ARM_R0), ctx);
> +
> +       /* Restore ARM_R0 and ARM_R1 */
> +       if (rn != ARM_R1)
> +               emit(ARM_MOV_R(ARM_R1, tmp[0]), ctx);
> +       if (rm != ARM_R0)
> +               emit(ARM_MOV_R(ARM_R0, tmp[1]), ctx);
>  }
>
> -static void emit_load_be16(u8 cond, u8 r_res, u8 r_addr, struct jit_ctx *ctx)
> +/* Checks whether BPF register is on scratch stack space or not. */
> +static inline bool is_on_stack(u8 bpf_reg)
>  {
> -       _emit(cond, ARM_LDRH_I(r_res, r_addr, 0), ctx);
> -#ifdef __LITTLE_ENDIAN
> -       _emit(cond, ARM_REV16(r_res, r_res), ctx);
> -#endif
> +       static u8 stack_regs[] = {BPF_REG_AX, BPF_REG_3, BPF_REG_4, BPF_REG_5,
> +                               BPF_REG_7, BPF_REG_8, BPF_REG_9, TCALL_CNT,
> +                               BPF_REG_2, BPF_REG_FP};
> +       int i, reg_len = sizeof(stack_regs);
> +
> +       for (i = 0 ; i < reg_len ; i++) {
> +               if (bpf_reg == stack_regs[i])
> +                       return true;
> +       }
> +       return false;
>  }
>
> -static inline void emit_swap16(u8 r_dst __maybe_unused,
> -                              u8 r_src __maybe_unused,
> -                              struct jit_ctx *ctx __maybe_unused)
> +static inline void emit_a32_mov_i(const u8 dst, const u32 val,
> +                                 bool dstk, struct jit_ctx *ctx)
>  {
> -#ifdef __LITTLE_ENDIAN
> -       emit(ARM_REV16(r_dst, r_src), ctx);
> -#endif
> +       const u8 *tmp = bpf2a32[TMP_REG_1];
> +
> +       if (dstk) {
> +               emit_mov_i(tmp[1], val, ctx);
> +               emit(ARM_STR_I(tmp[1], ARM_SP, STACK_VAR(dst)), ctx);
> +       } else {
> +               emit_mov_i(dst, val, ctx);
> +       }
>  }
>
> -#endif /* __LINUX_ARM_ARCH__ < 6 */
> +/* Sign extended move */
> +static inline void emit_a32_mov_i64(const bool is64, const u8 dst[],
> +                                 const u32 val, bool dstk,
> +                                 struct jit_ctx *ctx) {
> +       u32 hi = 0;
>
> +       if (is64 && (val & (1<<31)))
> +               hi = (u32)~0;
> +       emit_a32_mov_i(dst_lo, val, dstk, ctx);
> +       emit_a32_mov_i(dst_hi, hi, dstk, ctx);
> +}
>
> -/* Compute the immediate value for a PC-relative branch. */
> -static inline u32 b_imm(unsigned tgt, struct jit_ctx *ctx)
> -{
> -       u32 imm;
> +static inline void emit_a32_add_r(const u8 dst, const u8 src,
> +                             const bool is64, const bool hi,
> +                             struct jit_ctx *ctx) {
> +       /* 64 bit :
> +        *      adds dst_lo, dst_lo, src_lo
> +        *      adc dst_hi, dst_hi, src_hi
> +        * 32 bit :
> +        *      add dst_lo, dst_lo, src_lo
> +        */
> +       if (!hi && is64)
> +               emit(ARM_ADDS_R(dst, dst, src), ctx);
> +       else if (hi && is64)
> +               emit(ARM_ADC_R(dst, dst, src), ctx);
> +       else
> +               emit(ARM_ADD_R(dst, dst, src), ctx);
> +}
>
> -       if (ctx->target == NULL)
> -               return 0;
> -       /*
> -        * BPF allows only forward jumps and the offset of the target is
> -        * still the one computed during the first pass.
> +static inline void emit_a32_sub_r(const u8 dst, const u8 src,
> +                                 const bool is64, const bool hi,
> +                                 struct jit_ctx *ctx) {
> +       /* 64 bit :
> +        *      subs dst_lo, dst_lo, src_lo
> +        *      sbc dst_hi, dst_hi, src_hi
> +        * 32 bit :
> +        *      sub dst_lo, dst_lo, src_lo
>          */
> -       imm  = ctx->offsets[tgt] + ctx->prologue_bytes - (ctx->idx * 4 + 8);
> +       if (!hi && is64)
> +               emit(ARM_SUBS_R(dst, dst, src), ctx);
> +       else if (hi && is64)
> +               emit(ARM_SBC_R(dst, dst, src), ctx);
> +       else
> +               emit(ARM_SUB_R(dst, dst, src), ctx);
> +}
>
> -       return imm >> 2;
> +static inline void emit_alu_r(const u8 dst, const u8 src, const bool is64,
> +                             const bool hi, const u8 op, struct jit_ctx *ctx){
> +       switch (BPF_OP(op)) {
> +       /* dst = dst + src */
> +       case BPF_ADD:
> +               emit_a32_add_r(dst, src, is64, hi, ctx);
> +               break;
> +       /* dst = dst - src */
> +       case BPF_SUB:
> +               emit_a32_sub_r(dst, src, is64, hi, ctx);
> +               break;
> +       /* dst = dst | src */
> +       case BPF_OR:
> +               emit(ARM_ORR_R(dst, dst, src), ctx);
> +               break;
> +       /* dst = dst & src */
> +       case BPF_AND:
> +               emit(ARM_AND_R(dst, dst, src), ctx);
> +               break;
> +       /* dst = dst ^ src */
> +       case BPF_XOR:
> +               emit(ARM_EOR_R(dst, dst, src), ctx);
> +               break;
> +       /* dst = dst * src */
> +       case BPF_MUL:
> +               emit(ARM_MUL(dst, dst, src), ctx);
> +               break;
> +       /* dst = dst << src */
> +       case BPF_LSH:
> +               emit(ARM_LSL_R(dst, dst, src), ctx);
> +               break;
> +       /* dst = dst >> src */
> +       case BPF_RSH:
> +               emit(ARM_LSR_R(dst, dst, src), ctx);
> +               break;
> +       /* dst = dst >> src (signed)*/
> +       case BPF_ARSH:
> +               emit(ARM_MOV_SR(dst, dst, SRTYPE_ASR, src), ctx);
> +               break;
> +       }
>  }
>
> -#define OP_IMM3(op, r1, r2, imm_val, ctx)                              \
> -       do {                                                            \
> -               imm12 = imm8m(imm_val);                                 \
> -               if (imm12 < 0) {                                        \
> -                       emit_mov_i_no8m(r_scratch, imm_val, ctx);       \
> -                       emit(op ## _R((r1), (r2), r_scratch), ctx);     \
> -               } else {                                                \
> -                       emit(op ## _I((r1), (r2), imm12), ctx);         \
> -               }                                                       \
> -       } while (0)
> -
> -static inline void emit_err_ret(u8 cond, struct jit_ctx *ctx)
> -{
> -       if (ctx->ret0_fp_idx >= 0) {
> -               _emit(cond, ARM_B(b_imm(ctx->ret0_fp_idx, ctx)), ctx);
> -               /* NOP to keep the size constant between passes */
> -               emit(ARM_MOV_R(ARM_R0, ARM_R0), ctx);
> +/* ALU operation (32 bit)
> + * dst = dst (op) src
> + */
> +static inline void emit_a32_alu_r(const u8 dst, const u8 src,
> +                                 bool dstk, bool sstk,
> +                                 struct jit_ctx *ctx, const bool is64,
> +                                 const bool hi, const u8 op) {
> +       const u8 *tmp = bpf2a32[TMP_REG_1];
> +       u8 rn = sstk ? tmp[1] : src;
> +
> +       if (sstk)
> +               emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src)), ctx);
> +
> +       /* ALU operation */
> +       if (dstk) {
> +               emit(ARM_LDR_I(tmp[0], ARM_SP, STACK_VAR(dst)), ctx);
> +               emit_alu_r(tmp[0], rn, is64, hi, op, ctx);
> +               emit(ARM_STR_I(tmp[0], ARM_SP, STACK_VAR(dst)), ctx);
>         } else {
> -               _emit(cond, ARM_MOV_I(ARM_R0, 0), ctx);
> -               _emit(cond, ARM_B(b_imm(ctx->skf->len, ctx)), ctx);
> +               emit_alu_r(dst, rn, is64, hi, op, ctx);
>         }
>  }
>
> -static inline void emit_blx_r(u8 tgt_reg, struct jit_ctx *ctx)
> -{
> -#if __LINUX_ARM_ARCH__ < 5
> -       emit(ARM_MOV_R(ARM_LR, ARM_PC), ctx);
> +/* ALU operation (64 bit) */
> +static inline void emit_a32_alu_r64(const bool is64, const u8 dst[],
> +                                 const u8 src[], bool dstk,
> +                                 bool sstk, struct jit_ctx *ctx,
> +                                 const u8 op) {
> +       emit_a32_alu_r(dst_lo, src_lo, dstk, sstk, ctx, is64, false, op);
> +       if (is64)
> +               emit_a32_alu_r(dst_hi, src_hi, dstk, sstk, ctx, is64, true, op);
> +       else
> +               emit_a32_mov_i(dst_hi, 0, dstk, ctx);
> +}
>
> -       if (elf_hwcap & HWCAP_THUMB)
> -               emit(ARM_BX(tgt_reg), ctx);
> +/* dst = imm (4 bytes)*/
> +static inline void emit_a32_mov_r(const u8 dst, const u8 src,
> +                                 bool dstk, bool sstk,
> +                                 struct jit_ctx *ctx) {
> +       const u8 *tmp = bpf2a32[TMP_REG_1];
> +       u8 rt = sstk ? tmp[0] : src;
> +
> +       if (sstk)
> +               emit(ARM_LDR_I(tmp[0], ARM_SP, STACK_VAR(src)), ctx);
> +       if (dstk)
> +               emit(ARM_STR_I(rt, ARM_SP, STACK_VAR(dst)), ctx);
>         else
> -               emit(ARM_MOV_R(ARM_PC, tgt_reg), ctx);
> -#else
> -       emit(ARM_BLX_R(tgt_reg), ctx);
> -#endif
> +               emit(ARM_MOV_R(dst, rt), ctx);
>  }
>
> -static inline void emit_udivmod(u8 rd, u8 rm, u8 rn, struct jit_ctx *ctx,
> -                               int bpf_op)
> -{
> -#if __LINUX_ARM_ARCH__ == 7
> -       if (elf_hwcap & HWCAP_IDIVA) {
> -               if (bpf_op == BPF_DIV)
> -                       emit(ARM_UDIV(rd, rm, rn), ctx);
> -               else {
> -                       emit(ARM_UDIV(ARM_R3, rm, rn), ctx);
> -                       emit(ARM_MLS(rd, rn, ARM_R3, rm), ctx);
> -               }
> -               return;
> +/* dst = src */
> +static inline void emit_a32_mov_r64(const bool is64, const u8 dst[],
> +                                 const u8 src[], bool dstk,
> +                                 bool sstk, struct jit_ctx *ctx) {
> +       emit_a32_mov_r(dst_lo, src_lo, dstk, sstk, ctx);
> +       if (is64) {
> +               /* complete 8 byte move */
> +               emit_a32_mov_r(dst_hi, src_hi, dstk, sstk, ctx);
> +       } else {
> +               /* Zero out high 4 bytes */
> +               emit_a32_mov_i(dst_hi, 0, dstk, ctx);
>         }
> -#endif
> +}
>
> -       /*
> -        * For BPF_ALU | BPF_DIV | BPF_K instructions, rm is ARM_R4
> -        * (r_A) and rn is ARM_R0 (r_scratch) so load rn first into
> -        * ARM_R1 to avoid accidentally overwriting ARM_R0 with rm
> -        * before using it as a source for ARM_R1.
> -        *
> -        * For BPF_ALU | BPF_DIV | BPF_X rm is ARM_R4 (r_A) and rn is
> -        * ARM_R5 (r_X) so there is no particular register overlap
> -        * issues.
> -        */
> -       if (rn != ARM_R1)
> -               emit(ARM_MOV_R(ARM_R1, rn), ctx);
> -       if (rm != ARM_R0)
> -               emit(ARM_MOV_R(ARM_R0, rm), ctx);
> +/* Shift operations */
> +static inline void emit_a32_alu_i(const u8 dst, const u32 val, bool dstk,
> +                               struct jit_ctx *ctx, const u8 op) {
> +       const u8 *tmp = bpf2a32[TMP_REG_1];
> +       u8 rd = dstk ? tmp[0] : dst;
> +
> +       if (dstk)
> +               emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst)), ctx);
> +
> +       /* Do shift operation */
> +       switch (op) {
> +       case BPF_LSH:
> +               emit(ARM_LSL_I(rd, rd, val), ctx);
> +               break;
> +       case BPF_RSH:
> +               emit(ARM_LSR_I(rd, rd, val), ctx);
> +               break;
> +       case BPF_NEG:
> +               emit(ARM_RSB_I(rd, rd, val), ctx);
> +               break;
> +       }
> +
> +       if (dstk)
> +               emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst)), ctx);
> +}
> +
> +/* dst = ~dst (64 bit) */
> +static inline void emit_a32_neg64(const u8 dst[], bool dstk,
> +                               struct jit_ctx *ctx){
> +       const u8 *tmp = bpf2a32[TMP_REG_1];
> +       u8 rd = dstk ? tmp[1] : dst[1];
> +       u8 rm = dstk ? tmp[0] : dst[0];
> +
> +       /* Setup Operand */
> +       if (dstk) {
> +               emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
> +               emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
> +       }
> +
> +       /* Do Negate Operation */
> +       emit(ARM_RSBS_I(rd, rd, 0), ctx);
> +       emit(ARM_RSC_I(rm, rm, 0), ctx);
> +
> +       if (dstk) {
> +               emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
> +               emit(ARM_STR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
> +       }
> +}
>
> +/* dst = dst << src */
> +static inline void emit_a32_lsh_r64(const u8 dst[], const u8 src[], bool dstk,
> +                                   bool sstk, struct jit_ctx *ctx) {
> +       const u8 *tmp = bpf2a32[TMP_REG_1];
> +       const u8 *tmp2 = bpf2a32[TMP_REG_2];
> +
> +       /* Setup Operands */
> +       u8 rt = sstk ? tmp2[1] : src_lo;
> +       u8 rd = dstk ? tmp[1] : dst_lo;
> +       u8 rm = dstk ? tmp[0] : dst_hi;
> +
> +       if (sstk)
> +               emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(src_lo)), ctx);
> +       if (dstk) {
> +               emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
> +               emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
> +       }
> +
> +       /* Do LSH operation */
> +       emit(ARM_SUB_I(ARM_IP, rt, 32), ctx);
> +       emit(ARM_RSB_I(tmp2[0], rt, 32), ctx);
> +       /* As we are using ARM_LR */
>         ctx->seen |= SEEN_CALL;
> -       emit_mov_i(ARM_R3, bpf_op == BPF_DIV ? (u32)jit_udiv : (u32)jit_mod,
> -                  ctx);
> -       emit_blx_r(ARM_R3, ctx);
> +       emit(ARM_MOV_SR(ARM_LR, rm, SRTYPE_ASL, rt), ctx);
> +       emit(ARM_ORR_SR(ARM_LR, ARM_LR, rd, SRTYPE_ASL, ARM_IP), ctx);
> +       emit(ARM_ORR_SR(ARM_IP, ARM_LR, rd, SRTYPE_LSR, tmp2[0]), ctx);
> +       emit(ARM_MOV_SR(ARM_LR, rd, SRTYPE_ASL, rt), ctx);
> +
> +       if (dstk) {
> +               emit(ARM_STR_I(ARM_LR, ARM_SP, STACK_VAR(dst_lo)), ctx);
> +               emit(ARM_STR_I(ARM_IP, ARM_SP, STACK_VAR(dst_hi)), ctx);
> +       } else {
> +               emit(ARM_MOV_R(rd, ARM_LR), ctx);
> +               emit(ARM_MOV_R(rm, ARM_IP), ctx);
> +       }
> +}
>
> -       if (rd != ARM_R0)
> -               emit(ARM_MOV_R(rd, ARM_R0), ctx);
> +/* dst = dst >> src (signed)*/
> +static inline void emit_a32_arsh_r64(const u8 dst[], const u8 src[], bool dstk,
> +                                   bool sstk, struct jit_ctx *ctx) {
> +       const u8 *tmp = bpf2a32[TMP_REG_1];
> +       const u8 *tmp2 = bpf2a32[TMP_REG_2];
> +       /* Setup Operands */
> +       u8 rt = sstk ? tmp2[1] : src_lo;
> +       u8 rd = dstk ? tmp[1] : dst_lo;
> +       u8 rm = dstk ? tmp[0] : dst_hi;
> +
> +       if (sstk)
> +               emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(src_lo)), ctx);
> +       if (dstk) {
> +               emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
> +               emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
> +       }
> +
> +       /* Do the ARSH operation */
> +       emit(ARM_RSB_I(ARM_IP, rt, 32), ctx);
> +       emit(ARM_SUBS_I(tmp2[0], rt, 32), ctx);
> +       /* As we are using ARM_LR */
> +       ctx->seen |= SEEN_CALL;
> +       emit(ARM_MOV_SR(ARM_LR, rd, SRTYPE_LSR, rt), ctx);
> +       emit(ARM_ORR_SR(ARM_LR, ARM_LR, rm, SRTYPE_ASL, ARM_IP), ctx);
> +       _emit(ARM_COND_MI, ARM_B(0), ctx);
> +       emit(ARM_ORR_SR(ARM_LR, ARM_LR, rm, SRTYPE_ASR, tmp2[0]), ctx);
> +       emit(ARM_MOV_SR(ARM_IP, rm, SRTYPE_ASR, rt), ctx);
> +       if (dstk) {
> +               emit(ARM_STR_I(ARM_LR, ARM_SP, STACK_VAR(dst_lo)), ctx);
> +               emit(ARM_STR_I(ARM_IP, ARM_SP, STACK_VAR(dst_hi)), ctx);
> +       } else {
> +               emit(ARM_MOV_R(rd, ARM_LR), ctx);
> +               emit(ARM_MOV_R(rm, ARM_IP), ctx);
> +       }
>  }
>
> -static inline void update_on_xread(struct jit_ctx *ctx)
> +/* dst = dst >> src */
> +static inline void emit_a32_lsr_r64(const u8 dst[], const u8 src[], bool dstk,
> +                                    bool sstk, struct jit_ctx *ctx) {
> +       const u8 *tmp = bpf2a32[TMP_REG_1];
> +       const u8 *tmp2 = bpf2a32[TMP_REG_2];
> +       /* Setup Operands */
> +       u8 rt = sstk ? tmp2[1] : src_lo;
> +       u8 rd = dstk ? tmp[1] : dst_lo;
> +       u8 rm = dstk ? tmp[0] : dst_hi;
> +
> +       if (sstk)
> +               emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(src_lo)), ctx);
> +       if (dstk) {
> +               emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
> +               emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
> +       }
> +
> +       /* Do LSH operation */
> +       emit(ARM_RSB_I(ARM_IP, rt, 32), ctx);
> +       emit(ARM_SUBS_I(tmp2[0], rt, 32), ctx);
> +       /* As we are using ARM_LR */
> +       ctx->seen |= SEEN_CALL;
> +       emit(ARM_MOV_SR(ARM_LR, rd, SRTYPE_LSR, rt), ctx);
> +       emit(ARM_ORR_SR(ARM_LR, ARM_LR, rm, SRTYPE_ASL, ARM_IP), ctx);
> +       emit(ARM_ORR_SR(ARM_LR, ARM_LR, rm, SRTYPE_LSR, tmp2[0]), ctx);
> +       emit(ARM_MOV_SR(ARM_IP, rm, SRTYPE_LSR, rt), ctx);
> +       if (dstk) {
> +               emit(ARM_STR_I(ARM_LR, ARM_SP, STACK_VAR(dst_lo)), ctx);
> +               emit(ARM_STR_I(ARM_IP, ARM_SP, STACK_VAR(dst_hi)), ctx);
> +       } else {
> +               emit(ARM_MOV_R(rd, ARM_LR), ctx);
> +               emit(ARM_MOV_R(rm, ARM_IP), ctx);
> +       }
> +}
> +
> +/* dst = dst << val */
> +static inline void emit_a32_lsh_i64(const u8 dst[], bool dstk,
> +                                    const u32 val, struct jit_ctx *ctx){
> +       const u8 *tmp = bpf2a32[TMP_REG_1];
> +       const u8 *tmp2 = bpf2a32[TMP_REG_2];
> +       /* Setup operands */
> +       u8 rd = dstk ? tmp[1] : dst_lo;
> +       u8 rm = dstk ? tmp[0] : dst_hi;
> +
> +       if (dstk) {
> +               emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
> +               emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
> +       }
> +
> +       /* Do LSH operation */
> +       if (val < 32) {
> +               emit(ARM_MOV_SI(tmp2[0], rm, SRTYPE_ASL, val), ctx);
> +               emit(ARM_ORR_SI(rm, tmp2[0], rd, SRTYPE_LSR, 32 - val), ctx);
> +               emit(ARM_MOV_SI(rd, rd, SRTYPE_ASL, val), ctx);
> +       } else {
> +               if (val == 32)
> +                       emit(ARM_MOV_R(rm, rd), ctx);
> +               else
> +                       emit(ARM_MOV_SI(rm, rd, SRTYPE_ASL, val - 32), ctx);
> +               emit(ARM_EOR_R(rd, rd, rd), ctx);
> +       }
> +
> +       if (dstk) {
> +               emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
> +               emit(ARM_STR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
> +       }
> +}
> +
> +/* dst = dst >> val */
> +static inline void emit_a32_lsr_i64(const u8 dst[], bool dstk,
> +                                   const u32 val, struct jit_ctx *ctx) {
> +       const u8 *tmp = bpf2a32[TMP_REG_1];
> +       const u8 *tmp2 = bpf2a32[TMP_REG_2];
> +       /* Setup operands */
> +       u8 rd = dstk ? tmp[1] : dst_lo;
> +       u8 rm = dstk ? tmp[0] : dst_hi;
> +
> +       if (dstk) {
> +               emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
> +               emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
> +       }
> +
> +       /* Do LSR operation */
> +       if (val < 32) {
> +               emit(ARM_MOV_SI(tmp2[1], rd, SRTYPE_LSR, val), ctx);
> +               emit(ARM_ORR_SI(rd, tmp2[1], rm, SRTYPE_ASL, 32 - val), ctx);
> +               emit(ARM_MOV_SI(rm, rm, SRTYPE_LSR, val), ctx);
> +       } else if (val == 32) {
> +               emit(ARM_MOV_R(rd, rm), ctx);
> +               emit(ARM_MOV_I(rm, 0), ctx);
> +       } else {
> +               emit(ARM_MOV_SI(rd, rm, SRTYPE_LSR, val - 32), ctx);
> +               emit(ARM_MOV_I(rm, 0), ctx);
> +       }
> +
> +       if (dstk) {
> +               emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
> +               emit(ARM_STR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
> +       }
> +}
> +
> +/* dst = dst >> val (signed) */
> +static inline void emit_a32_arsh_i64(const u8 dst[], bool dstk,
> +                                    const u32 val, struct jit_ctx *ctx){
> +       const u8 *tmp = bpf2a32[TMP_REG_1];
> +       const u8 *tmp2 = bpf2a32[TMP_REG_2];
> +        /* Setup operands */
> +       u8 rd = dstk ? tmp[1] : dst_lo;
> +       u8 rm = dstk ? tmp[0] : dst_hi;
> +
> +       if (dstk) {
> +               emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
> +               emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
> +       }
> +
> +       /* Do ARSH operation */
> +       if (val < 32) {
> +               emit(ARM_MOV_SI(tmp2[1], rd, SRTYPE_LSR, val), ctx);
> +               emit(ARM_ORR_SI(rd, tmp2[1], rm, SRTYPE_ASL, 32 - val), ctx);
> +               emit(ARM_MOV_SI(rm, rm, SRTYPE_ASR, val), ctx);
> +       } else if (val == 32) {
> +               emit(ARM_MOV_R(rd, rm), ctx);
> +               emit(ARM_MOV_SI(rm, rm, SRTYPE_ASR, 31), ctx);
> +       } else {
> +               emit(ARM_MOV_SI(rd, rm, SRTYPE_ASR, val - 32), ctx);
> +               emit(ARM_MOV_SI(rm, rm, SRTYPE_ASR, 31), ctx);
> +       }
> +
> +       if (dstk) {
> +               emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
> +               emit(ARM_STR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
> +       }
> +}
> +
> +static inline void emit_a32_mul_r64(const u8 dst[], const u8 src[], bool dstk,
> +                                   bool sstk, struct jit_ctx *ctx) {
> +       const u8 *tmp = bpf2a32[TMP_REG_1];
> +       const u8 *tmp2 = bpf2a32[TMP_REG_2];
> +       /* Setup operands for multiplication */
> +       u8 rd = dstk ? tmp[1] : dst_lo;
> +       u8 rm = dstk ? tmp[0] : dst_hi;
> +       u8 rt = sstk ? tmp2[1] : src_lo;
> +       u8 rn = sstk ? tmp2[0] : src_hi;
> +
> +       if (dstk) {
> +               emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
> +               emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
> +       }
> +       if (sstk) {
> +               emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(src_lo)), ctx);
> +               emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src_hi)), ctx);
> +       }
> +
> +       /* Do Multiplication */
> +       emit(ARM_MUL(ARM_IP, rd, rn), ctx);
> +       emit(ARM_MUL(ARM_LR, rm, rt), ctx);
> +       /* As we are using ARM_LR */
> +       ctx->seen |= SEEN_CALL;
> +       emit(ARM_ADD_R(ARM_LR, ARM_IP, ARM_LR), ctx);
> +
> +       emit(ARM_UMULL(ARM_IP, rm, rd, rt), ctx);
> +       emit(ARM_ADD_R(rm, ARM_LR, rm), ctx);
> +       if (dstk) {
> +               emit(ARM_STR_I(ARM_IP, ARM_SP, STACK_VAR(dst_lo)), ctx);
> +               emit(ARM_STR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
> +       } else {
> +               emit(ARM_MOV_R(rd, ARM_IP), ctx);
> +       }
> +}
> +
> +/* *(size *)(dst + off) = src */
> +static inline void emit_str_r(const u8 dst, const u8 src, bool dstk,
> +                             const s32 off, struct jit_ctx *ctx, const u8 sz){
> +       const u8 *tmp = bpf2a32[TMP_REG_1];
> +       u8 rd = dstk ? tmp[1] : dst;
> +
> +       if (dstk)
> +               emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst)), ctx);
> +       if (off) {
> +               emit_a32_mov_i(tmp[0], off, false, ctx);
> +               emit(ARM_ADD_R(tmp[0], rd, tmp[0]), ctx);
> +               rd = tmp[0];
> +       }
> +       switch (sz) {
> +       case BPF_W:
> +               /* Store a Word */
> +               emit(ARM_STR_I(src, rd, 0), ctx);
> +               break;
> +       case BPF_H:
> +               /* Store a HalfWord */
> +               emit(ARM_STRH_I(src, rd, 0), ctx);
> +               break;
> +       case BPF_B:
> +               /* Store a Byte */
> +               emit(ARM_STRB_I(src, rd, 0), ctx);
> +               break;
> +       }
> +}
> +
> +/* dst = *(size*)(src + off) */
> +static inline void emit_ldx_r(const u8 dst, const u8 src, bool dstk,
> +                             const s32 off, struct jit_ctx *ctx, const u8 sz){
> +       const u8 *tmp = bpf2a32[TMP_REG_1];
> +       u8 rd = dstk ? tmp[1] : dst;
> +       u8 rm = src;
> +
> +       if (off) {
> +               emit_a32_mov_i(tmp[0], off, false, ctx);
> +               emit(ARM_ADD_R(tmp[0], tmp[0], src), ctx);
> +               rm = tmp[0];
> +       }
> +       switch (sz) {
> +       case BPF_W:
> +               /* Load a Word */
> +               emit(ARM_LDR_I(rd, rm, 0), ctx);
> +               break;
> +       case BPF_H:
> +               /* Load a HalfWord */
> +               emit(ARM_LDRH_I(rd, rm, 0), ctx);
> +               break;
> +       case BPF_B:
> +               /* Load a Byte */
> +               emit(ARM_LDRB_I(rd, rm, 0), ctx);
> +               break;
> +       }
> +       if (dstk)
> +               emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst)), ctx);
> +}
> +
> +/* Arithmatic Operation */
> +static inline void emit_ar_r(const u8 rd, const u8 rt, const u8 rm,
> +                            const u8 rn, struct jit_ctx *ctx, u8 op) {
> +       switch (op) {
> +       case BPF_JSET:
> +               ctx->seen |= SEEN_CALL;
> +               emit(ARM_AND_R(ARM_IP, rt, rn), ctx);
> +               emit(ARM_AND_R(ARM_LR, rd, rm), ctx);
> +               emit(ARM_ORRS_R(ARM_IP, ARM_LR, ARM_IP), ctx);
> +               break;
> +       case BPF_JEQ:
> +       case BPF_JNE:
> +       case BPF_JGT:
> +       case BPF_JGE:
> +               emit(ARM_CMP_R(rd, rm), ctx);
> +               _emit(ARM_COND_EQ, ARM_CMP_R(rt, rn), ctx);
> +               break;
> +       case BPF_JSGT:
> +               emit(ARM_CMP_R(rn, rt), ctx);
> +               emit(ARM_SBCS_R(ARM_IP, rm, rd), ctx);
> +               break;
> +       case BPF_JSGE:
> +               emit(ARM_CMP_R(rt, rn), ctx);
> +               emit(ARM_SBCS_R(ARM_IP, rd, rm), ctx);
> +               break;
> +       }
> +}
> +
> +static int out_offset = -1; /* initialized on the first pass of build_body() */
> +static int emit_bpf_tail_call(struct jit_ctx *ctx)
> +{
> +
> +       /* bpf_tail_call(void *prog_ctx, struct bpf_array *array, u64 index) */
> +       const u8 *r2 = bpf2a32[BPF_REG_2];
> +       const u8 *r3 = bpf2a32[BPF_REG_3];
> +       const u8 *tmp = bpf2a32[TMP_REG_1];
> +       const u8 *tmp2 = bpf2a32[TMP_REG_2];
> +       const u8 *tcc = bpf2a32[TCALL_CNT];
> +       const int idx0 = ctx->idx;
> +#define cur_offset (ctx->idx - idx0)
> +#define jmp_offset (out_offset - (cur_offset))
> +       u32 off, lo, hi;
> +
> +       /* if (index >= array->map.max_entries)
> +        *      goto out;
> +        */
> +       off = offsetof(struct bpf_array, map.max_entries);
> +       /* array->map.max_entries */
> +       emit_a32_mov_i(tmp[1], off, false, ctx);
> +       emit(ARM_LDR_I(tmp2[1], ARM_SP, STACK_VAR(r2[1])), ctx);
> +       emit(ARM_LDR_R(tmp[1], tmp2[1], tmp[1]), ctx);
> +       /* index (64 bit) */
> +       emit(ARM_LDR_I(tmp2[1], ARM_SP, STACK_VAR(r3[1])), ctx);
> +       /* index >= array->map.max_entries */
> +       emit(ARM_CMP_R(tmp2[1], tmp[1]), ctx);
> +       _emit(ARM_COND_CS, ARM_B(jmp_offset), ctx);
> +
> +       /* if (tail_call_cnt > MAX_TAIL_CALL_CNT)
> +        *      goto out;
> +        * tail_call_cnt++;
> +        */
> +       lo = (u32)MAX_TAIL_CALL_CNT;
> +       hi = (u32)((u64)MAX_TAIL_CALL_CNT >> 32);
> +       emit(ARM_LDR_I(tmp[1], ARM_SP, STACK_VAR(tcc[1])), ctx);
> +       emit(ARM_LDR_I(tmp[0], ARM_SP, STACK_VAR(tcc[0])), ctx);
> +       emit(ARM_CMP_I(tmp[0], hi), ctx);
> +       _emit(ARM_COND_EQ, ARM_CMP_I(tmp[1], lo), ctx);
> +       _emit(ARM_COND_HI, ARM_B(jmp_offset), ctx);
> +       emit(ARM_ADDS_I(tmp[1], tmp[1], 1), ctx);
> +       emit(ARM_ADC_I(tmp[0], tmp[0], 0), ctx);
> +       emit(ARM_STR_I(tmp[1], ARM_SP, STACK_VAR(tcc[1])), ctx);
> +       emit(ARM_STR_I(tmp[0], ARM_SP, STACK_VAR(tcc[0])), ctx);
> +
> +       /* prog = array->ptrs[index]
> +        * if (prog == NULL)
> +        *      goto out;
> +        */
> +       off = offsetof(struct bpf_array, ptrs);
> +       emit_a32_mov_i(tmp[1], off, false, ctx);
> +       emit(ARM_LDR_I(tmp2[1], ARM_SP, STACK_VAR(r2[1])), ctx);
> +       emit(ARM_LDR_R(tmp[1], tmp2[1], tmp[1]), ctx);
> +       emit(ARM_LDR_I(tmp2[1], ARM_SP, STACK_VAR(r3[1])), ctx);
> +       emit(ARM_MOV_SI(tmp[0], tmp2[1], SRTYPE_ASL, 2), ctx);
> +       emit(ARM_LDR_R(tmp[1], tmp[1], tmp[0]), ctx);
> +       emit(ARM_CMP_I(tmp[1], 0), ctx);
> +       _emit(ARM_COND_EQ, ARM_B(jmp_offset), ctx);
> +
> +       /* goto *(prog->bpf_func + prologue_size); */
> +       off = offsetof(struct bpf_prog, bpf_func);
> +       emit_a32_mov_i(tmp2[1], off, false, ctx);
> +       emit(ARM_LDR_R(tmp[1], tmp[1], tmp2[1]), ctx);
> +       emit(ARM_ADD_I(tmp[1], tmp[1], ctx->prologue_bytes), ctx);
> +       emit(ARM_BX(tmp[1]), ctx);
> +
> +       /* out: */
> +       if (out_offset == -1)
> +               out_offset = cur_offset;
> +       if (cur_offset != out_offset) {
> +               pr_err_once("tail_call out_offset = %d, expected %d!\n",
> +                           cur_offset, out_offset);
> +               return -1;
> +       }
> +       return 0;
> +#undef cur_offset
> +#undef jmp_offset
> +}
> +
> +/* 0xabcd => 0xcdab */
> +static inline void emit_rev16(const u8 rd, const u8 rn, struct jit_ctx *ctx)
>  {
> -       if (!(ctx->seen & SEEN_X))
> -               ctx->flags |= FLAG_NEED_X_RESET;
> +#if __LINUX_ARM_ARCH__ < 6
> +       const u8 *tmp2 = bpf2a32[TMP_REG_2];
> +
> +       emit(ARM_AND_I(tmp2[1], rn, 0xff), ctx);
> +       emit(ARM_MOV_SI(tmp2[0], rn, SRTYPE_LSR, 8), ctx);
> +       emit(ARM_AND_I(tmp2[0], tmp2[0], 0xff), ctx);
> +       emit(ARM_ORR_SI(rd, tmp2[0], tmp2[1], SRTYPE_LSL, 8), ctx);
> +#else /* ARMv6+ */
> +       emit(ARM_REV16(rd, rn), ctx);
> +#endif
> +}
>
> -       ctx->seen |= SEEN_X;
> +/* 0xabcdefgh => 0xghefcdab */
> +static inline void emit_rev32(const u8 rd, const u8 rn, struct jit_ctx *ctx)
> +{
> +#if __LINUX_ARM_ARCH__ < 6
> +       const u8 *tmp2 = bpf2a32[TMP_REG_2];
> +
> +       emit(ARM_AND_I(tmp2[1], rn, 0xff), ctx);
> +       emit(ARM_MOV_SI(tmp2[0], rn, SRTYPE_LSR, 24), ctx);
> +       emit(ARM_ORR_SI(ARM_IP, tmp2[0], tmp2[1], SRTYPE_LSL, 24), ctx);
> +
> +       emit(ARM_MOV_SI(tmp2[1], rn, SRTYPE_LSR, 8), ctx);
> +       emit(ARM_AND_I(tmp2[1], tmp2[1], 0xff), ctx);
> +       emit(ARM_MOV_SI(tmp2[0], rn, SRTYPE_LSR, 16), ctx);
> +       emit(ARM_AND_I(tmp2[0], tmp2[0], 0xff), ctx);
> +       emit(ARM_MOV_SI(tmp2[0], tmp2[0], SRTYPE_LSL, 8), ctx);
> +       emit(ARM_ORR_SI(tmp2[0], tmp2[0], tmp2[1], SRTYPE_LSL, 16), ctx);
> +       emit(ARM_ORR_R(rd, ARM_IP, tmp2[0]), ctx);
> +
> +#else /* ARMv6+ */
> +       emit(ARM_REV(rd, rn), ctx);
> +#endif
>  }
>
> -static int build_body(struct jit_ctx *ctx)
> +static void build_prologue(struct jit_ctx *ctx)
>  {
> -       void *load_func[] = {jit_get_skb_b, jit_get_skb_h, jit_get_skb_w};
> -       const struct bpf_prog *prog = ctx->skf;
> -       const struct sock_filter *inst;
> -       unsigned i, load_order, off, condt;
> -       int imm12;
> -       u32 k;
> +       const u8 r0 = bpf2a32[BPF_REG_0][1];
> +       const u8 r2 = bpf2a32[BPF_REG_1][1];
> +       const u8 r3 = bpf2a32[BPF_REG_1][0];
> +       const u8 r4 = bpf2a32[BPF_REG_6][1];
> +       const u8 r5 = bpf2a32[BPF_REG_6][0];
> +       const u8 r6 = bpf2a32[TMP_REG_1][1];
> +       const u8 r7 = bpf2a32[TMP_REG_1][0];
> +       const u8 r8 = bpf2a32[TMP_REG_2][1];
> +       const u8 r10 = bpf2a32[TMP_REG_2][0];
> +       const u8 fplo = bpf2a32[BPF_REG_FP][1];
> +       const u8 fphi = bpf2a32[BPF_REG_FP][0];
> +       const u8 sp = ARM_SP;
> +       const u8 *tcc = bpf2a32[TCALL_CNT];
> +
> +       u16 reg_set = 0;
>
> -       for (i = 0; i < prog->len; i++) {
> -               u16 code;
> +       /*
> +        * eBPF prog stack layout
> +        *
> +        *                         high
> +        * original ARM_SP =>     +-----+ eBPF prologue
> +        *                        |FP/LR|
> +        * current ARM_FP =>      +-----+
> +        *                        | ... | callee saved registers
> +        * eBPF fp register =>    +-----+ <= (BPF_FP)
> +        *                        | ... | eBPF JIT scratch space
> +        *                        |     | eBPF prog stack
> +        *                        +-----+
> +        *                        |RSVD | JIT scratchpad
> +        * current A64_SP =>      +-----+ <= (BPF_FP - STACK_SIZE)
> +        *                        |     |
> +        *                        | ... | Function call stack
> +        *                        |     |
> +        *                        +-----+
> +        *                          low
> +        */
>
> -               inst = &(prog->insns[i]);
> -               /* K as an immediate value operand */
> -               k = inst->k;
> -               code = bpf_anc_helper(inst);
> +       /* Save callee saved registers. */
> +       reg_set |= (1<<r4) | (1<<r5) | (1<<r6) | (1<<r7) | (1<<r8) | (1<<r10);
> +#ifdef CONFIG_FRAME_POINTER
> +       reg_set |= (1<<ARM_FP) | (1<<ARM_IP) | (1<<ARM_LR) | (1<<ARM_PC);
> +       emit(ARM_MOV_R(ARM_IP, sp), ctx);
> +       emit(ARM_PUSH(reg_set), ctx);
> +       emit(ARM_SUB_I(ARM_FP, ARM_IP, 4), ctx);
> +#else
> +       /* Check if call instruction exists in BPF body */
> +       if (ctx->seen & SEEN_CALL)
> +               reg_set |= (1<<ARM_LR);
> +       emit(ARM_PUSH(reg_set), ctx);
> +#endif
> +       /* Save frame pointer for later */
> +       emit(ARM_SUB_I(ARM_IP, sp, SCRATCH_SIZE), ctx);
> +
> +       /* Set up function call stack */
> +       emit(ARM_SUB_I(ARM_SP, ARM_SP, imm8m(STACK_SIZE)), ctx);
> +
> +       /* Set up BPF prog stack base register */
> +       emit_a32_mov_r(fplo, ARM_IP, true, false, ctx);
> +       emit_a32_mov_i(fphi, 0, true, ctx);
> +
> +       /* mov r4, 0 */
> +       emit(ARM_MOV_I(r4, 0), ctx);
> +       /* MOV bpf_ctx pointer to BPF_R1 */
> +       emit(ARM_MOV_R(r3, r4), ctx);
> +       emit(ARM_MOV_R(r2, r0), ctx);
> +       /* Initialize Tail Count */
> +       emit(ARM_STR_I(r4, ARM_SP, STACK_VAR(tcc[0])), ctx);
> +       emit(ARM_STR_I(r4, ARM_SP, STACK_VAR(tcc[1])), ctx);
> +       /* end of prologue */
> +}
>
> -               /* compute offsets only in the fake pass */
> -               if (ctx->target == NULL)
> -                       ctx->offsets[i] = ctx->idx * 4;
> +static void build_epilogue(struct jit_ctx *ctx)
> +{
> +       const u8 r4 = bpf2a32[BPF_REG_6][1];
> +       const u8 r5 = bpf2a32[BPF_REG_6][0];
> +       const u8 r6 = bpf2a32[TMP_REG_1][1];
> +       const u8 r7 = bpf2a32[TMP_REG_1][0];
> +       const u8 r8 = bpf2a32[TMP_REG_2][1];
> +       const u8 r10 = bpf2a32[TMP_REG_2][0];
> +       u16 reg_set = 0;
> +
> +       /* unwind function call stack */
> +       emit(ARM_ADD_I(ARM_SP, ARM_SP, imm8m(STACK_SIZE)), ctx);
> +
> +       /* restore callee saved registers. */
> +       reg_set |= (1<<r4) | (1<<r5) | (1<<r6) | (1<<r7) | (1<<r8) | (1<<r10);
> +#ifdef CONFIG_FRAME_POINTER
> +       /* the first instruction of the prologue was: mov ip, sp */
> +       reg_set |= (1<<ARM_FP) | (1<<ARM_SP) | (1<<ARM_PC);
> +       emit(ARM_LDM(ARM_SP, reg_set), ctx);
> +#else
> +       if (ctx->seen & SEEN_CALL)
> +               reg_set |= (1<<ARM_PC);
> +       /* Restore callee saved registers. */
> +       emit(ARM_POP(reg_set), ctx);
> +       /* Return back to the callee function */
> +       if (!(ctx->seen & SEEN_CALL))
> +               emit(ARM_BX(ARM_LR), ctx);
> +#endif
> +}
>
> -               switch (code) {
> -               case BPF_LD | BPF_IMM:
> -                       emit_mov_i(r_A, k, ctx);
> +/*
> + * Convert an eBPF instruction to native instruction, i.e
> + * JITs an eBPF instruction.
> + * Returns :
> + *     0  - Successfully JITed an 8-byte eBPF instruction
> + *     >0 - Successfully JITed a 16-byte eBPF instruction
> + *     <0 - Failed to JIT.
> + */
> +static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx)
> +{
> +       const u8 code = insn->code;
> +       const u8 *dst = bpf2a32[insn->dst_reg];
> +       const u8 *src = bpf2a32[insn->src_reg];
> +       const u8 *tmp = bpf2a32[TMP_REG_1];
> +       const u8 *tmp2 = bpf2a32[TMP_REG_2];
> +       const s16 off = insn->off;
> +       const s32 imm = insn->imm;
> +       const int i = insn - ctx->prog->insnsi;
> +       const bool is64 = BPF_CLASS(code) == BPF_ALU64;
> +       const bool dstk = is_on_stack(insn->dst_reg);
> +       const bool sstk = is_on_stack(insn->src_reg);
> +       u8 rd, rt, rm, rn;
> +       s32 jmp_offset;
> +
> +#define check_imm(bits, imm) do {                              \
> +       if ((((imm) > 0) && ((imm) >> (bits))) ||               \
> +           (((imm) < 0) && (~(imm) >> (bits)))) {              \
> +               pr_info("[%2d] imm=%d(0x%x) out of range\n",    \
> +                       i, imm, imm);                           \
> +               return -EINVAL;                                 \
> +       }                                                       \
> +} while (0)
> +#define check_imm24(imm) check_imm(24, imm)
> +
> +       switch (code) {
> +       /* ALU operations */
> +
> +       /* dst = src */
> +       case BPF_ALU | BPF_MOV | BPF_K:
> +       case BPF_ALU | BPF_MOV | BPF_X:
> +       case BPF_ALU64 | BPF_MOV | BPF_K:
> +       case BPF_ALU64 | BPF_MOV | BPF_X:
> +               switch (BPF_SRC(code)) {
> +               case BPF_X:
> +                       emit_a32_mov_r64(is64, dst, src, dstk, sstk, ctx);
>                         break;
> -               case BPF_LD | BPF_W | BPF_LEN:
> -                       ctx->seen |= SEEN_SKB;
> -                       BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, len) != 4);
> -                       emit(ARM_LDR_I(r_A, r_skb,
> -                                      offsetof(struct sk_buff, len)), ctx);
> +               case BPF_K:
> +                       /* Sign-extend immediate value to destination reg */
> +                       emit_a32_mov_i64(is64, dst, imm, dstk, ctx);
>                         break;
> -               case BPF_LD | BPF_MEM:
> -                       /* A = scratch[k] */
> -                       ctx->seen |= SEEN_MEM_WORD(k);
> -                       emit(ARM_LDR_I(r_A, ARM_SP, SCRATCH_OFF(k)), ctx);
> +               }
> +               break;
> +       /* dst = dst + src/imm */
> +       /* dst = dst - src/imm */
> +       /* dst = dst | src/imm */
> +       /* dst = dst & src/imm */
> +       /* dst = dst ^ src/imm */
> +       /* dst = dst * src/imm */
> +       /* dst = dst << src */
> +       /* dst = dst >> src */
> +       case BPF_ALU | BPF_ADD | BPF_K:
> +       case BPF_ALU | BPF_ADD | BPF_X:
> +       case BPF_ALU | BPF_SUB | BPF_K:
> +       case BPF_ALU | BPF_SUB | BPF_X:
> +       case BPF_ALU | BPF_OR | BPF_K:
> +       case BPF_ALU | BPF_OR | BPF_X:
> +       case BPF_ALU | BPF_AND | BPF_K:
> +       case BPF_ALU | BPF_AND | BPF_X:
> +       case BPF_ALU | BPF_XOR | BPF_K:
> +       case BPF_ALU | BPF_XOR | BPF_X:
> +       case BPF_ALU | BPF_MUL | BPF_K:
> +       case BPF_ALU | BPF_MUL | BPF_X:
> +       case BPF_ALU | BPF_LSH | BPF_X:
> +       case BPF_ALU | BPF_RSH | BPF_X:
> +       case BPF_ALU | BPF_ARSH | BPF_K:
> +       case BPF_ALU | BPF_ARSH | BPF_X:
> +       case BPF_ALU64 | BPF_ADD | BPF_K:
> +       case BPF_ALU64 | BPF_ADD | BPF_X:
> +       case BPF_ALU64 | BPF_SUB | BPF_K:
> +       case BPF_ALU64 | BPF_SUB | BPF_X:
> +       case BPF_ALU64 | BPF_OR | BPF_K:
> +       case BPF_ALU64 | BPF_OR | BPF_X:
> +       case BPF_ALU64 | BPF_AND | BPF_K:
> +       case BPF_ALU64 | BPF_AND | BPF_X:
> +       case BPF_ALU64 | BPF_XOR | BPF_K:
> +       case BPF_ALU64 | BPF_XOR | BPF_X:
> +               switch (BPF_SRC(code)) {
> +               case BPF_X:
> +                       emit_a32_alu_r64(is64, dst, src, dstk, sstk,
> +                                        ctx, BPF_OP(code));
>                         break;
> -               case BPF_LD | BPF_W | BPF_ABS:
> -                       load_order = 2;
> -                       goto load;
> -               case BPF_LD | BPF_H | BPF_ABS:
> -                       load_order = 1;
> -                       goto load;
> -               case BPF_LD | BPF_B | BPF_ABS:
> -                       load_order = 0;
> -load:
> -                       emit_mov_i(r_off, k, ctx);
> -load_common:
> -                       ctx->seen |= SEEN_DATA | SEEN_CALL;
> -
> -                       if (load_order > 0) {
> -                               emit(ARM_SUB_I(r_scratch, r_skb_hl,
> -                                              1 << load_order), ctx);
> -                               emit(ARM_CMP_R(r_scratch, r_off), ctx);
> -                               condt = ARM_COND_GE;
> -                       } else {
> -                               emit(ARM_CMP_R(r_skb_hl, r_off), ctx);
> -                               condt = ARM_COND_HI;
> -                       }
> -
> -                       /*
> -                        * test for negative offset, only if we are
> -                        * currently scheduled to take the fast
> -                        * path. this will update the flags so that
> -                        * the slowpath instruction are ignored if the
> -                        * offset is negative.
> -                        *
> -                        * for loard_order == 0 the HI condition will
> -                        * make loads at offset 0 take the slow path too.
> +               case BPF_K:
> +                       /* Move immediate value to the temporary register
> +                        * and then do the ALU operation on the temporary
> +                        * register as this will sign-extend the immediate
> +                        * value into temporary reg and then it would be
> +                        * safe to do the operation on it.
>                          */
> -                       _emit(condt, ARM_CMP_I(r_off, 0), ctx);
> -
> -                       _emit(condt, ARM_ADD_R(r_scratch, r_off, r_skb_data),
> -                             ctx);
> -
> -                       if (load_order == 0)
> -                               _emit(condt, ARM_LDRB_I(r_A, r_scratch, 0),
> -                                     ctx);
> -                       else if (load_order == 1)
> -                               emit_load_be16(condt, r_A, r_scratch, ctx);
> -                       else if (load_order == 2)
> -                               emit_load_be32(condt, r_A, r_scratch, ctx);
> -
> -                       _emit(condt, ARM_B(b_imm(i + 1, ctx)), ctx);
> -
> -                       /* the slowpath */
> -                       emit_mov_i(ARM_R3, (u32)load_func[load_order], ctx);
> -                       emit(ARM_MOV_R(ARM_R0, r_skb), ctx);
> -                       /* the offset is already in R1 */
> -                       emit_blx_r(ARM_R3, ctx);
> -                       /* check the result of skb_copy_bits */
> -                       emit(ARM_CMP_I(ARM_R1, 0), ctx);
> -                       emit_err_ret(ARM_COND_NE, ctx);
> -                       emit(ARM_MOV_R(r_A, ARM_R0), ctx);
> +                       emit_a32_mov_i64(is64, tmp2, imm, false, ctx);
> +                       emit_a32_alu_r64(is64, dst, tmp2, dstk, false,
> +                                        ctx, BPF_OP(code));
>                         break;
> -               case BPF_LD | BPF_W | BPF_IND:
> -                       load_order = 2;
> -                       goto load_ind;
> -               case BPF_LD | BPF_H | BPF_IND:
> -                       load_order = 1;
> -                       goto load_ind;
> -               case BPF_LD | BPF_B | BPF_IND:
> -                       load_order = 0;
> -load_ind:
> -                       update_on_xread(ctx);
> -                       OP_IMM3(ARM_ADD, r_off, r_X, k, ctx);
> -                       goto load_common;
> -               case BPF_LDX | BPF_IMM:
> -                       ctx->seen |= SEEN_X;
> -                       emit_mov_i(r_X, k, ctx);
> +               }
> +               break;
> +       /* dst = dst / src(imm) */
> +       /* dst = dst % src(imm) */
> +       case BPF_ALU | BPF_DIV | BPF_K:
> +       case BPF_ALU | BPF_DIV | BPF_X:
> +       case BPF_ALU | BPF_MOD | BPF_K:
> +       case BPF_ALU | BPF_MOD | BPF_X:
> +               rt = src_lo;
> +               rd = dstk ? tmp2[1] : dst_lo;
> +               if (dstk)
> +                       emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
> +               switch (BPF_SRC(code)) {
> +               case BPF_X:
> +                       rt = sstk ? tmp2[0] : rt;
> +                       if (sstk)
> +                               emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(src_lo)),
> +                                    ctx);
>                         break;
> -               case BPF_LDX | BPF_W | BPF_LEN:
> -                       ctx->seen |= SEEN_X | SEEN_SKB;
> -                       emit(ARM_LDR_I(r_X, r_skb,
> -                                      offsetof(struct sk_buff, len)), ctx);
> +               case BPF_K:
> +                       rt = tmp2[0];
> +                       emit_a32_mov_i(rt, imm, false, ctx);
>                         break;
> -               case BPF_LDX | BPF_MEM:
> -                       ctx->seen |= SEEN_X | SEEN_MEM_WORD(k);
> -                       emit(ARM_LDR_I(r_X, ARM_SP, SCRATCH_OFF(k)), ctx);
> +               }
> +               emit_udivmod(rd, rd, rt, ctx, BPF_OP(code));
> +               if (dstk)
> +                       emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
> +               emit_a32_mov_i(dst_hi, 0, dstk, ctx);
> +               break;
> +       case BPF_ALU64 | BPF_DIV | BPF_K:
> +       case BPF_ALU64 | BPF_DIV | BPF_X:
> +       case BPF_ALU64 | BPF_MOD | BPF_K:
> +       case BPF_ALU64 | BPF_MOD | BPF_X:
> +               goto notyet;
> +       /* dst = dst >> imm */
> +       /* dst = dst << imm */
> +       case BPF_ALU | BPF_RSH | BPF_K:
> +       case BPF_ALU | BPF_LSH | BPF_K:
> +               if (unlikely(imm > 31))
> +                       return -EINVAL;
> +               if (imm)
> +                       emit_a32_alu_i(dst_lo, imm, dstk, ctx, BPF_OP(code));
> +               emit_a32_mov_i(dst_hi, 0, dstk, ctx);
> +               break;
> +       /* dst = dst << imm */
> +       case BPF_ALU64 | BPF_LSH | BPF_K:
> +               if (unlikely(imm > 63))
> +                       return -EINVAL;
> +               emit_a32_lsh_i64(dst, dstk, imm, ctx);
> +               break;
> +       /* dst = dst >> imm */
> +       case BPF_ALU64 | BPF_RSH | BPF_K:
> +               if (unlikely(imm > 63))
> +                       return -EINVAL;
> +               emit_a32_lsr_i64(dst, dstk, imm, ctx);
> +               break;
> +       /* dst = dst << src */
> +       case BPF_ALU64 | BPF_LSH | BPF_X:
> +               emit_a32_lsh_r64(dst, src, dstk, sstk, ctx);
> +               break;
> +       /* dst = dst >> src */
> +       case BPF_ALU64 | BPF_RSH | BPF_X:
> +               emit_a32_lsr_r64(dst, src, dstk, sstk, ctx);
> +               break;
> +       /* dst = dst >> src (signed) */
> +       case BPF_ALU64 | BPF_ARSH | BPF_X:
> +               emit_a32_arsh_r64(dst, src, dstk, sstk, ctx);
> +               break;
> +       /* dst = dst >> imm (signed) */
> +       case BPF_ALU64 | BPF_ARSH | BPF_K:
> +               if (unlikely(imm > 63))
> +                       return -EINVAL;
> +               emit_a32_arsh_i64(dst, dstk, imm, ctx);
> +               break;
> +       /* dst = ~dst */
> +       case BPF_ALU | BPF_NEG:
> +               emit_a32_alu_i(dst_lo, 0, dstk, ctx, BPF_OP(code));
> +               emit_a32_mov_i(dst_hi, 0, dstk, ctx);
> +               break;
> +       /* dst = ~dst (64 bit) */
> +       case BPF_ALU64 | BPF_NEG:
> +               emit_a32_neg64(dst, dstk, ctx);
> +               break;
> +       /* dst = dst * src/imm */
> +       case BPF_ALU64 | BPF_MUL | BPF_X:
> +       case BPF_ALU64 | BPF_MUL | BPF_K:
> +               switch (BPF_SRC(code)) {
> +               case BPF_X:
> +                       emit_a32_mul_r64(dst, src, dstk, sstk, ctx);
>                         break;
> -               case BPF_LDX | BPF_B | BPF_MSH:
> -                       /* x = ((*(frame + k)) & 0xf) << 2; */
> -                       ctx->seen |= SEEN_X | SEEN_DATA | SEEN_CALL;
> -                       /* the interpreter should deal with the negative K */
> -                       if ((int)k < 0)
> -                               return -1;
> -                       /* offset in r1: we might have to take the slow path */
> -                       emit_mov_i(r_off, k, ctx);
> -                       emit(ARM_CMP_R(r_skb_hl, r_off), ctx);
> -
> -                       /* load in r0: common with the slowpath */
> -                       _emit(ARM_COND_HI, ARM_LDRB_R(ARM_R0, r_skb_data,
> -                                                     ARM_R1), ctx);
> -                       /*
> -                        * emit_mov_i() might generate one or two instructions,
> -                        * the same holds for emit_blx_r()
> +               case BPF_K:
> +                       /* Move immediate value to the temporary register
> +                        * and then do the multiplication on it as this
> +                        * will sign-extend the immediate value into temp
> +                        * reg then it would be safe to do the operation
> +                        * on it.
>                          */
> -                       _emit(ARM_COND_HI, ARM_B(b_imm(i + 1, ctx) - 2), ctx);
> -
> -                       emit(ARM_MOV_R(ARM_R0, r_skb), ctx);
> -                       /* r_off is r1 */
> -                       emit_mov_i(ARM_R3, (u32)jit_get_skb_b, ctx);
> -                       emit_blx_r(ARM_R3, ctx);
> -                       /* check the return value of skb_copy_bits */
> -                       emit(ARM_CMP_I(ARM_R1, 0), ctx);
> -                       emit_err_ret(ARM_COND_NE, ctx);
> -
> -                       emit(ARM_AND_I(r_X, ARM_R0, 0x00f), ctx);
> -                       emit(ARM_LSL_I(r_X, r_X, 2), ctx);
> -                       break;
> -               case BPF_ST:
> -                       ctx->seen |= SEEN_MEM_WORD(k);
> -                       emit(ARM_STR_I(r_A, ARM_SP, SCRATCH_OFF(k)), ctx);
> -                       break;
> -               case BPF_STX:
> -                       update_on_xread(ctx);
> -                       ctx->seen |= SEEN_MEM_WORD(k);
> -                       emit(ARM_STR_I(r_X, ARM_SP, SCRATCH_OFF(k)), ctx);
> -                       break;
> -               case BPF_ALU | BPF_ADD | BPF_K:
> -                       /* A += K */
> -                       OP_IMM3(ARM_ADD, r_A, r_A, k, ctx);
> -                       break;
> -               case BPF_ALU | BPF_ADD | BPF_X:
> -                       update_on_xread(ctx);
> -                       emit(ARM_ADD_R(r_A, r_A, r_X), ctx);
> -                       break;
> -               case BPF_ALU | BPF_SUB | BPF_K:
> -                       /* A -= K */
> -                       OP_IMM3(ARM_SUB, r_A, r_A, k, ctx);
> -                       break;
> -               case BPF_ALU | BPF_SUB | BPF_X:
> -                       update_on_xread(ctx);
> -                       emit(ARM_SUB_R(r_A, r_A, r_X), ctx);
> -                       break;
> -               case BPF_ALU | BPF_MUL | BPF_K:
> -                       /* A *= K */
> -                       emit_mov_i(r_scratch, k, ctx);
> -                       emit(ARM_MUL(r_A, r_A, r_scratch), ctx);
> -                       break;
> -               case BPF_ALU | BPF_MUL | BPF_X:
> -                       update_on_xread(ctx);
> -                       emit(ARM_MUL(r_A, r_A, r_X), ctx);
> -                       break;
> -               case BPF_ALU | BPF_DIV | BPF_K:
> -                       if (k == 1)
> -                               break;
> -                       emit_mov_i(r_scratch, k, ctx);
> -                       emit_udivmod(r_A, r_A, r_scratch, ctx, BPF_DIV);
> -                       break;
> -               case BPF_ALU | BPF_DIV | BPF_X:
> -                       update_on_xread(ctx);
> -                       emit(ARM_CMP_I(r_X, 0), ctx);
> -                       emit_err_ret(ARM_COND_EQ, ctx);
> -                       emit_udivmod(r_A, r_A, r_X, ctx, BPF_DIV);
> -                       break;
> -               case BPF_ALU | BPF_MOD | BPF_K:
> -                       if (k == 1) {
> -                               emit_mov_i(r_A, 0, ctx);
> -                               break;
> -                       }
> -                       emit_mov_i(r_scratch, k, ctx);
> -                       emit_udivmod(r_A, r_A, r_scratch, ctx, BPF_MOD);
> +                       emit_a32_mov_i64(is64, tmp2, imm, false, ctx);
> +                       emit_a32_mul_r64(dst, tmp2, dstk, false, ctx);
>                         break;
> -               case BPF_ALU | BPF_MOD | BPF_X:
> -                       update_on_xread(ctx);
> -                       emit(ARM_CMP_I(r_X, 0), ctx);
> -                       emit_err_ret(ARM_COND_EQ, ctx);
> -                       emit_udivmod(r_A, r_A, r_X, ctx, BPF_MOD);
> -                       break;
> -               case BPF_ALU | BPF_OR | BPF_K:
> -                       /* A |= K */
> -                       OP_IMM3(ARM_ORR, r_A, r_A, k, ctx);
> +               }
> +               break;
> +       /* dst = htole(dst) */
> +       /* dst = htobe(dst) */
> +       case BPF_ALU | BPF_END | BPF_FROM_LE:
> +       case BPF_ALU | BPF_END | BPF_FROM_BE:
> +               rd = dstk ? tmp[0] : dst_hi;
> +               rt = dstk ? tmp[1] : dst_lo;
> +               if (dstk) {
> +                       emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(dst_lo)), ctx);
> +                       emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_hi)), ctx);
> +               }
> +               if (BPF_SRC(code) == BPF_FROM_LE)
> +                       goto emit_bswap_uxt;
> +               switch (imm) {
> +               case 16:
> +                       emit_rev16(rt, rt, ctx);
> +                       goto emit_bswap_uxt;
> +               case 32:
> +                       emit_rev32(rt, rt, ctx);
> +                       goto emit_bswap_uxt;
> +               case 64:
> +                       /* Because of the usage of ARM_LR */
> +                       ctx->seen |= SEEN_CALL;
> +                       emit_rev32(ARM_LR, rt, ctx);
> +                       emit_rev32(rt, rd, ctx);
> +                       emit(ARM_MOV_R(rd, ARM_LR), ctx);
>                         break;
> -               case BPF_ALU | BPF_OR | BPF_X:
> -                       update_on_xread(ctx);
> -                       emit(ARM_ORR_R(r_A, r_A, r_X), ctx);
> +               }
> +               goto exit;
> +emit_bswap_uxt:
> +               switch (imm) {
> +               case 16:
> +                       /* zero-extend 16 bits into 64 bits */
> +#if __LINUX_ARM_ARCH__ < 6
> +                       emit_a32_mov_i(tmp2[1], 0xffff, false, ctx);
> +                       emit(ARM_AND_R(rt, rt, tmp2[1]), ctx);
> +#else /* ARMv6+ */
> +                       emit(ARM_UXTH(rt, rt), ctx);
> +#endif
> +                       emit(ARM_EOR_R(rd, rd, rd), ctx);
>                         break;
> -               case BPF_ALU | BPF_XOR | BPF_K:
> -                       /* A ^= K; */
> -                       OP_IMM3(ARM_EOR, r_A, r_A, k, ctx);
> +               case 32:
> +                       /* zero-extend 32 bits into 64 bits */
> +                       emit(ARM_EOR_R(rd, rd, rd), ctx);
>                         break;
> -               case BPF_ANC | SKF_AD_ALU_XOR_X:
> -               case BPF_ALU | BPF_XOR | BPF_X:
> -                       /* A ^= X */
> -                       update_on_xread(ctx);
> -                       emit(ARM_EOR_R(r_A, r_A, r_X), ctx);
> +               case 64:
> +                       /* nop */
>                         break;
> -               case BPF_ALU | BPF_AND | BPF_K:
> -                       /* A &= K */
> -                       OP_IMM3(ARM_AND, r_A, r_A, k, ctx);
> +               }
> +exit:
> +               if (dstk) {
> +                       emit(ARM_STR_I(rt, ARM_SP, STACK_VAR(dst_lo)), ctx);
> +                       emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_hi)), ctx);
> +               }
> +               break;
> +       /* dst = imm64 */
> +       case BPF_LD | BPF_IMM | BPF_DW:
> +       {
> +               const struct bpf_insn insn1 = insn[1];
> +               u32 hi, lo = imm;
> +
> +               if (insn1.code != 0 || insn1.src_reg != 0 ||
> +                   insn1.dst_reg != 0 || insn1.off != 0) {
> +                       /* Note: verifier in BPF core must catch invalid
> +                        * instruction.
> +                        */
> +                       pr_err_once("Invalid BPF_LD_IMM64 instruction\n");
> +                       return -EINVAL;
> +               }
> +               hi = insn1.imm;
> +               emit_a32_mov_i(dst_lo, lo, dstk, ctx);
> +               emit_a32_mov_i(dst_hi, hi, dstk, ctx);
> +
> +               return 1;
> +       }
> +       /* LDX: dst = *(size *)(src + off) */
> +       case BPF_LDX | BPF_MEM | BPF_W:
> +       case BPF_LDX | BPF_MEM | BPF_H:
> +       case BPF_LDX | BPF_MEM | BPF_B:
> +       case BPF_LDX | BPF_MEM | BPF_DW:
> +               rn = sstk ? tmp2[1] : src_lo;
> +               if (sstk)
> +                       emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
> +               switch (BPF_SIZE(code)) {
> +               case BPF_W:
> +                       /* Load a Word */
> +               case BPF_H:
> +                       /* Load a Half-Word */
> +               case BPF_B:
> +                       /* Load a Byte */
> +                       emit_ldx_r(dst_lo, rn, dstk, off, ctx, BPF_SIZE(code));
> +                       emit_a32_mov_i(dst_hi, 0, dstk, ctx);
>                         break;
> -               case BPF_ALU | BPF_AND | BPF_X:
> -                       update_on_xread(ctx);
> -                       emit(ARM_AND_R(r_A, r_A, r_X), ctx);
> +               case BPF_DW:
> +                       /* Load a double word */
> +                       emit_ldx_r(dst_lo, rn, dstk, off, ctx, BPF_W);
> +                       emit_ldx_r(dst_hi, rn, dstk, off+4, ctx, BPF_W);
>                         break;
> -               case BPF_ALU | BPF_LSH | BPF_K:
> -                       if (unlikely(k > 31))
> -                               return -1;
> -                       emit(ARM_LSL_I(r_A, r_A, k), ctx);
> +               }
> +               break;
> +       /* R0 = ntohx(*(size *)(((struct sk_buff *)R6)->data + imm)) */
> +       case BPF_LD | BPF_ABS | BPF_W:
> +       case BPF_LD | BPF_ABS | BPF_H:
> +       case BPF_LD | BPF_ABS | BPF_B:
> +       /* R0 = ntohx(*(size *)(((struct sk_buff *)R6)->data + src + imm)) */
> +       case BPF_LD | BPF_IND | BPF_W:
> +       case BPF_LD | BPF_IND | BPF_H:
> +       case BPF_LD | BPF_IND | BPF_B:
> +       {
> +               const u8 r4 = bpf2a32[BPF_REG_6][1]; /* r4 = ptr to sk_buff */
> +               const u8 r0 = bpf2a32[BPF_REG_0][1]; /*r0: struct sk_buff *skb*/
> +                                                    /* rtn value */
> +               const u8 r1 = bpf2a32[BPF_REG_0][0]; /* r1: int k */
> +               const u8 r2 = bpf2a32[BPF_REG_1][1]; /* r2: unsigned int size */
> +               const u8 r3 = bpf2a32[BPF_REG_1][0]; /* r3: void *buffer */
> +               const u8 r6 = bpf2a32[TMP_REG_1][1]; /* r6: void *(*func)(..) */
> +               int size;
> +
> +               /* Setting up first argument */
> +               emit(ARM_MOV_R(r0, r4), ctx);
> +
> +               /* Setting up second argument */
> +               emit_a32_mov_i(r1, imm, false, ctx);
> +               if (BPF_MODE(code) == BPF_IND)
> +                       emit_a32_alu_r(r1, src_lo, false, sstk, ctx,
> +                                      false, false, BPF_ADD);
> +
> +               /* Setting up third argument */
> +               switch (BPF_SIZE(code)) {
> +               case BPF_W:
> +                       size = 4;
>                         break;
> -               case BPF_ALU | BPF_LSH | BPF_X:
> -                       update_on_xread(ctx);
> -                       emit(ARM_LSL_R(r_A, r_A, r_X), ctx);
> +               case BPF_H:
> +                       size = 2;
>                         break;
> -               case BPF_ALU | BPF_RSH | BPF_K:
> -                       if (unlikely(k > 31))
> -                               return -1;
> -                       if (k)
> -                               emit(ARM_LSR_I(r_A, r_A, k), ctx);
> +               case BPF_B:
> +                       size = 1;
>                         break;
> -               case BPF_ALU | BPF_RSH | BPF_X:
> -                       update_on_xread(ctx);
> -                       emit(ARM_LSR_R(r_A, r_A, r_X), ctx);
> +               default:
> +                       return -EINVAL;
> +               }
> +               emit_a32_mov_i(r2, size, false, ctx);
> +
> +               /* Setting up fourth argument */
> +               emit(ARM_ADD_I(r3, ARM_SP, imm8m(SKB_BUFFER)), ctx);
> +
> +               /* Setting up function pointer to call */
> +               emit_a32_mov_i(r6, (unsigned int)bpf_load_pointer, false, ctx);
> +               emit_blx_r(r6, ctx);
> +
> +               emit(ARM_EOR_R(r1, r1, r1), ctx);
> +               /* Check if return address is NULL or not.
> +                * if NULL then jump to epilogue
> +                * else continue to load the value from retn address
> +                */
> +               emit(ARM_CMP_I(r0, 0), ctx);
> +               jmp_offset = epilogue_offset(ctx);
> +               check_imm24(jmp_offset);
> +               _emit(ARM_COND_EQ, ARM_B(jmp_offset), ctx);
> +
> +               /* Load value from the address */
> +               switch (BPF_SIZE(code)) {
> +               case BPF_W:
> +                       emit(ARM_LDR_I(r0, r0, 0), ctx);
> +                       emit_rev32(r0, r0, ctx);
>                         break;
> -               case BPF_ALU | BPF_NEG:
> -                       /* A = -A */
> -                       emit(ARM_RSB_I(r_A, r_A, 0), ctx);
> +               case BPF_H:
> +                       emit(ARM_LDRH_I(r0, r0, 0), ctx);
> +                       emit_rev16(r0, r0, ctx);
>                         break;
> -               case BPF_JMP | BPF_JA:
> -                       /* pc += K */
> -                       emit(ARM_B(b_imm(i + k + 1, ctx)), ctx);
> +               case BPF_B:
> +                       emit(ARM_LDRB_I(r0, r0, 0), ctx);
> +                       /* No need to reverse */
>                         break;
> -               case BPF_JMP | BPF_JEQ | BPF_K:
> -                       /* pc += (A == K) ? pc->jt : pc->jf */
> -                       condt  = ARM_COND_EQ;
> -                       goto cmp_imm;
> -               case BPF_JMP | BPF_JGT | BPF_K:
> -                       /* pc += (A > K) ? pc->jt : pc->jf */
> -                       condt  = ARM_COND_HI;
> -                       goto cmp_imm;
> -               case BPF_JMP | BPF_JGE | BPF_K:
> -                       /* pc += (A >= K) ? pc->jt : pc->jf */
> -                       condt  = ARM_COND_HS;
> -cmp_imm:
> -                       imm12 = imm8m(k);
> -                       if (imm12 < 0) {
> -                               emit_mov_i_no8m(r_scratch, k, ctx);
> -                               emit(ARM_CMP_R(r_A, r_scratch), ctx);
> -                       } else {
> -                               emit(ARM_CMP_I(r_A, imm12), ctx);
> -                       }
> -cond_jump:
> -                       if (inst->jt)
> -                               _emit(condt, ARM_B(b_imm(i + inst->jt + 1,
> -                                                  ctx)), ctx);
> -                       if (inst->jf)
> -                               _emit(condt ^ 1, ARM_B(b_imm(i + inst->jf + 1,
> -                                                            ctx)), ctx);
> +               }
> +               break;
> +       }
> +       /* ST: *(size *)(dst + off) = imm */
> +       case BPF_ST | BPF_MEM | BPF_W:
> +       case BPF_ST | BPF_MEM | BPF_H:
> +       case BPF_ST | BPF_MEM | BPF_B:
> +       case BPF_ST | BPF_MEM | BPF_DW:
> +               switch (BPF_SIZE(code)) {
> +               case BPF_DW:
> +                       /* Sign-extend immediate value into temp reg */
> +                       emit_a32_mov_i64(true, tmp2, imm, false, ctx);
> +                       emit_str_r(dst_lo, tmp2[1], dstk, off, ctx, BPF_W);
> +                       emit_str_r(dst_lo, tmp2[0], dstk, off+4, ctx, BPF_W);
>                         break;
> -               case BPF_JMP | BPF_JEQ | BPF_X:
> -                       /* pc += (A == X) ? pc->jt : pc->jf */
> -                       condt   = ARM_COND_EQ;
> -                       goto cmp_x;
> -               case BPF_JMP | BPF_JGT | BPF_X:
> -                       /* pc += (A > X) ? pc->jt : pc->jf */
> -                       condt   = ARM_COND_HI;
> -                       goto cmp_x;
> -               case BPF_JMP | BPF_JGE | BPF_X:
> -                       /* pc += (A >= X) ? pc->jt : pc->jf */
> -                       condt   = ARM_COND_CS;
> -cmp_x:
> -                       update_on_xread(ctx);
> -                       emit(ARM_CMP_R(r_A, r_X), ctx);
> -                       goto cond_jump;
> -               case BPF_JMP | BPF_JSET | BPF_K:
> -                       /* pc += (A & K) ? pc->jt : pc->jf */
> -                       condt  = ARM_COND_NE;
> -                       /* not set iff all zeroes iff Z==1 iff EQ */
> -
> -                       imm12 = imm8m(k);
> -                       if (imm12 < 0) {
> -                               emit_mov_i_no8m(r_scratch, k, ctx);
> -                               emit(ARM_TST_R(r_A, r_scratch), ctx);
> -                       } else {
> -                               emit(ARM_TST_I(r_A, imm12), ctx);
> -                       }
> -                       goto cond_jump;
> -               case BPF_JMP | BPF_JSET | BPF_X:
> -                       /* pc += (A & X) ? pc->jt : pc->jf */
> -                       update_on_xread(ctx);
> -                       condt  = ARM_COND_NE;
> -                       emit(ARM_TST_R(r_A, r_X), ctx);
> -                       goto cond_jump;
> -               case BPF_RET | BPF_A:
> -                       emit(ARM_MOV_R(ARM_R0, r_A), ctx);
> -                       goto b_epilogue;
> -               case BPF_RET | BPF_K:
> -                       if ((k == 0) && (ctx->ret0_fp_idx < 0))
> -                               ctx->ret0_fp_idx = i;
> -                       emit_mov_i(ARM_R0, k, ctx);
> -b_epilogue:
> -                       if (i != ctx->skf->len - 1)
> -                               emit(ARM_B(b_imm(prog->len, ctx)), ctx);
> +               case BPF_W:
> +               case BPF_H:
> +               case BPF_B:
> +                       emit_a32_mov_i(tmp2[1], imm, false, ctx);
> +                       emit_str_r(dst_lo, tmp2[1], dstk, off, ctx,
> +                                  BPF_SIZE(code));
>                         break;
> -               case BPF_MISC | BPF_TAX:
> -                       /* X = A */
> -                       ctx->seen |= SEEN_X;
> -                       emit(ARM_MOV_R(r_X, r_A), ctx);
> +               }
> +               break;
> +       /* STX XADD: lock *(u32 *)(dst + off) += src */
> +       case BPF_STX | BPF_XADD | BPF_W:
> +       /* STX XADD: lock *(u64 *)(dst + off) += src */
> +       case BPF_STX | BPF_XADD | BPF_DW:
> +               goto notyet;
> +       /* STX: *(size *)(dst + off) = src */
> +       case BPF_STX | BPF_MEM | BPF_W:
> +       case BPF_STX | BPF_MEM | BPF_H:
> +       case BPF_STX | BPF_MEM | BPF_B:
> +       case BPF_STX | BPF_MEM | BPF_DW:
> +       {
> +               u8 sz = BPF_SIZE(code);
> +
> +               rn = sstk ? tmp2[1] : src_lo;
> +               rm = sstk ? tmp2[0] : src_hi;
> +               if (!sstk)
> +                       goto do_store;
> +               switch (BPF_SIZE(code)) {
> +               case BPF_W:
> +                       emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
> +                       goto empty_hi;
> +               case BPF_H:
> +                       emit(ARM_LDRH_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
> +                       goto empty_hi;
> +               case BPF_B:
> +                       emit(ARM_LDRB_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
> +                       goto empty_hi;
> +empty_hi:
> +                       emit(ARM_EOR_R(rm, rm, rm), ctx);
> +               case BPF_DW:
> +                       emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
> +                       emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(src_hi)), ctx);
> +                       sz = BPF_W;
>                         break;
> -               case BPF_MISC | BPF_TXA:
> -                       /* A = X */
> -                       update_on_xread(ctx);
> -                       emit(ARM_MOV_R(r_A, r_X), ctx);
> +               }
> +
> +do_store:
> +               /* Clear higher word except for BPF_DW */
> +               if (BPF_SIZE(code) != BPF_DW)
> +                       emit(ARM_EOR_R(rm, rm, rm), ctx);
> +
> +               /* Store the value */
> +               emit_str_r(dst_lo, rn, dstk, off, ctx, sz);
> +               emit_str_r(dst_lo, rm, dstk, off+4, ctx, BPF_W);
> +               break;
> +       }
> +       /* PC += off if dst == src */
> +       /* PC += off if dst > src */
> +       /* PC += off if dst >= src */
> +       /* PC += off if dst != src */
> +       /* PC += off if dst > src (signed) */
> +       /* PC += off if dst >= src (signed) */
> +       /* PC += off if dst & src */
> +       case BPF_JMP | BPF_JEQ | BPF_X:
> +       case BPF_JMP | BPF_JGT | BPF_X:
> +       case BPF_JMP | BPF_JGE | BPF_X:
> +       case BPF_JMP | BPF_JNE | BPF_X:
> +       case BPF_JMP | BPF_JSGT | BPF_X:
> +       case BPF_JMP | BPF_JSGE | BPF_X:
> +       case BPF_JMP | BPF_JSET | BPF_X:
> +               /* Setup source registers */
> +               rm = sstk ? tmp2[0] : src_hi;
> +               rn = sstk ? tmp2[1] : src_lo;
> +               if (sstk) {
> +                       emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
> +                       emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(src_hi)), ctx);
> +               }
> +               goto go_jmp;
> +       /* PC += off if dst == imm */
> +       /* PC += off if dst > imm */
> +       /* PC += off if dst >= imm */
> +       /* PC += off if dst != imm */
> +       /* PC += off if dst > imm (signed) */
> +       /* PC += off if dst >= imm (signed) */
> +       /* PC += off if dst & imm */
> +       case BPF_JMP | BPF_JEQ | BPF_K:
> +       case BPF_JMP | BPF_JGT | BPF_K:
> +       case BPF_JMP | BPF_JGE | BPF_K:
> +       case BPF_JMP | BPF_JNE | BPF_K:
> +       case BPF_JMP | BPF_JSGT | BPF_K:
> +       case BPF_JMP | BPF_JSGE | BPF_K:
> +       case BPF_JMP | BPF_JSET | BPF_K:
> +               if (off == 0)
>                         break;
> -               case BPF_ANC | SKF_AD_PROTOCOL:
> -                       /* A = ntohs(skb->protocol) */
> -                       ctx->seen |= SEEN_SKB;
> -                       BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff,
> -                                                 protocol) != 2);
> -                       off = offsetof(struct sk_buff, protocol);
> -                       emit(ARM_LDRH_I(r_scratch, r_skb, off), ctx);
> -                       emit_swap16(r_A, r_scratch, ctx);
> +               rm = tmp2[0];
> +               rn = tmp2[1];
> +               /* Sign-extend immediate value */
> +               emit_a32_mov_i64(true, tmp2, imm, false, ctx);
> +go_jmp:
> +               /* Setup destination register */
> +               rd = dstk ? tmp[0] : dst_hi;
> +               rt = dstk ? tmp[1] : dst_lo;
> +               if (dstk) {
> +                       emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(dst_lo)), ctx);
> +                       emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_hi)), ctx);
> +               }
> +
> +               /* Check for the condition */
> +               emit_ar_r(rd, rt, rm, rn, ctx, BPF_OP(code));
> +
> +               /* Setup JUMP instruction */
> +               jmp_offset = bpf2a32_offset(i+off, i, ctx);
> +               switch (BPF_OP(code)) {
> +               case BPF_JNE:
> +               case BPF_JSET:
> +                       _emit(ARM_COND_NE, ARM_B(jmp_offset), ctx);
>                         break;
> -               case BPF_ANC | SKF_AD_CPU:
> -                       /* r_scratch = current_thread_info() */
> -                       OP_IMM3(ARM_BIC, r_scratch, ARM_SP, THREAD_SIZE - 1, ctx);
> -                       /* A = current_thread_info()->cpu */
> -                       BUILD_BUG_ON(FIELD_SIZEOF(struct thread_info, cpu) != 4);
> -                       off = offsetof(struct thread_info, cpu);
> -                       emit(ARM_LDR_I(r_A, r_scratch, off), ctx);
> +               case BPF_JEQ:
> +                       _emit(ARM_COND_EQ, ARM_B(jmp_offset), ctx);
>                         break;
> -               case BPF_ANC | SKF_AD_IFINDEX:
> -               case BPF_ANC | SKF_AD_HATYPE:
> -                       /* A = skb->dev->ifindex */
> -                       /* A = skb->dev->type */
> -                       ctx->seen |= SEEN_SKB;
> -                       off = offsetof(struct sk_buff, dev);
> -                       emit(ARM_LDR_I(r_scratch, r_skb, off), ctx);
> -
> -                       emit(ARM_CMP_I(r_scratch, 0), ctx);
> -                       emit_err_ret(ARM_COND_EQ, ctx);
> -
> -                       BUILD_BUG_ON(FIELD_SIZEOF(struct net_device,
> -                                                 ifindex) != 4);
> -                       BUILD_BUG_ON(FIELD_SIZEOF(struct net_device,
> -                                                 type) != 2);
> -
> -                       if (code == (BPF_ANC | SKF_AD_IFINDEX)) {
> -                               off = offsetof(struct net_device, ifindex);
> -                               emit(ARM_LDR_I(r_A, r_scratch, off), ctx);
> -                       } else {
> -                               /*
> -                                * offset of field "type" in "struct
> -                                * net_device" is above what can be
> -                                * used in the ldrh rd, [rn, #imm]
> -                                * instruction, so load the offset in
> -                                * a register and use ldrh rd, [rn, rm]
> -                                */
> -                               off = offsetof(struct net_device, type);
> -                               emit_mov_i(ARM_R3, off, ctx);
> -                               emit(ARM_LDRH_R(r_A, r_scratch, ARM_R3), ctx);
> -                       }
> +               case BPF_JGT:
> +                       _emit(ARM_COND_HI, ARM_B(jmp_offset), ctx);
>                         break;
> -               case BPF_ANC | SKF_AD_MARK:
> -                       ctx->seen |= SEEN_SKB;
> -                       BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, mark) != 4);
> -                       off = offsetof(struct sk_buff, mark);
> -                       emit(ARM_LDR_I(r_A, r_skb, off), ctx);
> +               case BPF_JGE:
> +                       _emit(ARM_COND_CS, ARM_B(jmp_offset), ctx);
>                         break;
> -               case BPF_ANC | SKF_AD_RXHASH:
> -                       ctx->seen |= SEEN_SKB;
> -                       BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, hash) != 4);
> -                       off = offsetof(struct sk_buff, hash);
> -                       emit(ARM_LDR_I(r_A, r_skb, off), ctx);
> +               case BPF_JSGT:
> +                       _emit(ARM_COND_LT, ARM_B(jmp_offset), ctx);
>                         break;
> -               case BPF_ANC | SKF_AD_VLAN_TAG:
> -               case BPF_ANC | SKF_AD_VLAN_TAG_PRESENT:
> -                       ctx->seen |= SEEN_SKB;
> -                       BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, vlan_tci) != 2);
> -                       off = offsetof(struct sk_buff, vlan_tci);
> -                       emit(ARM_LDRH_I(r_A, r_skb, off), ctx);
> -                       if (code == (BPF_ANC | SKF_AD_VLAN_TAG))
> -                               OP_IMM3(ARM_AND, r_A, r_A, ~VLAN_TAG_PRESENT, ctx);
> -                       else {
> -                               OP_IMM3(ARM_LSR, r_A, r_A, 12, ctx);
> -                               OP_IMM3(ARM_AND, r_A, r_A, 0x1, ctx);
> -                       }
> +               case BPF_JSGE:
> +                       _emit(ARM_COND_GE, ARM_B(jmp_offset), ctx);
>                         break;
> -               case BPF_ANC | SKF_AD_PKTTYPE:
> -                       ctx->seen |= SEEN_SKB;
> -                       BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff,
> -                                                 __pkt_type_offset[0]) != 1);
> -                       off = PKT_TYPE_OFFSET();
> -                       emit(ARM_LDRB_I(r_A, r_skb, off), ctx);
> -                       emit(ARM_AND_I(r_A, r_A, PKT_TYPE_MAX), ctx);
> -#ifdef __BIG_ENDIAN_BITFIELD
> -                       emit(ARM_LSR_I(r_A, r_A, 5), ctx);
> -#endif
> +               }
> +               break;
> +       /* JMP OFF */
> +       case BPF_JMP | BPF_JA:
> +       {
> +               if (off == 0)
>                         break;
> -               case BPF_ANC | SKF_AD_QUEUE:
> -                       ctx->seen |= SEEN_SKB;
> -                       BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff,
> -                                                 queue_mapping) != 2);
> -                       BUILD_BUG_ON(offsetof(struct sk_buff,
> -                                             queue_mapping) > 0xff);
> -                       off = offsetof(struct sk_buff, queue_mapping);
> -                       emit(ARM_LDRH_I(r_A, r_skb, off), ctx);
> +               jmp_offset = bpf2a32_offset(i+off, i, ctx);
> +               check_imm24(jmp_offset);
> +               emit(ARM_B(jmp_offset), ctx);
> +               break;
> +       }
> +       /* tail call */
> +       case BPF_JMP | BPF_CALL | BPF_X:
> +               if (emit_bpf_tail_call(ctx))
> +                       return -EFAULT;
> +               break;
> +       /* function call */
> +       case BPF_JMP | BPF_CALL:
> +               goto notyet;
> +       /* function return */
> +       case BPF_JMP | BPF_EXIT:
> +               /* Optimization: when last instruction is EXIT
> +                * simply fallthrough to epilogue.
> +                */
> +               if (i == ctx->prog->len - 1)
>                         break;
> -               case BPF_ANC | SKF_AD_PAY_OFFSET:
> -                       ctx->seen |= SEEN_SKB | SEEN_CALL;
> +               jmp_offset = epilogue_offset(ctx);
> +               check_imm24(jmp_offset);
> +               emit(ARM_B(jmp_offset), ctx);
> +               break;
> +notyet:
> +               pr_info_once("*** NOT YET: opcode %02x ***\n", code);
> +               return -EFAULT;
> +       default:
> +               pr_err_once("unknown opcode %02x\n", code);
> +               return -EINVAL;
> +       }
>
> -                       emit(ARM_MOV_R(ARM_R0, r_skb), ctx);
> -                       emit_mov_i(ARM_R3, (unsigned int)skb_get_poff, ctx);
> -                       emit_blx_r(ARM_R3, ctx);
> -                       emit(ARM_MOV_R(r_A, ARM_R0), ctx);
> -                       break;
> -               case BPF_LDX | BPF_W | BPF_ABS:
> -                       /*
> -                        * load a 32bit word from struct seccomp_data.
> -                        * seccomp_check_filter() will already have checked
> -                        * that k is 32bit aligned and lies within the
> -                        * struct seccomp_data.
> -                        */
> -                       ctx->seen |= SEEN_SKB;
> -                       emit(ARM_LDR_I(r_A, r_skb, k), ctx);
> -                       break;
> -               default:
> -                       return -1;
> +       if (ctx->flags & FLAG_IMM_OVERFLOW)
> +               /*
> +                * this instruction generated an overflow when
> +                * trying to access the literal pool, so
> +                * delegate this filter to the kernel interpreter.
> +                */
> +               return -1;
> +       return 0;
> +}
> +
> +static int build_body(struct jit_ctx *ctx)
> +{
> +       const struct bpf_prog *prog = ctx->prog;
> +       unsigned int i;
> +
> +       for (i = 0; i < prog->len; i++) {
> +               const struct bpf_insn *insn = &(prog->insnsi[i]);
> +               int ret;
> +
> +               ret = build_insn(insn, ctx);
> +
> +               /* It's used with loading the 64 bit immediate value. */
> +               if (ret > 0) {
> +                       i++;
> +                       if (ctx->target == NULL)
> +                               ctx->offsets[i] = ctx->idx;
> +                       continue;
>                 }
>
> -               if (ctx->flags & FLAG_IMM_OVERFLOW)
> -                       /*
> -                        * this instruction generated an overflow when
> -                        * trying to access the literal pool, so
> -                        * delegate this filter to the kernel interpreter.
> -                        */
> -                       return -1;
> +               if (ctx->target == NULL)
> +                       ctx->offsets[i] = ctx->idx;
> +
> +               /* If unsuccesfull, return with error code */
> +               if (ret)
> +                       return ret;
>         }
> +       return 0;
> +}
>
> -       /* compute offsets only during the first pass */
> -       if (ctx->target == NULL)
> -               ctx->offsets[i] = ctx->idx * 4;
> +static int validate_code(struct jit_ctx *ctx)
> +{
> +       int i;
> +
> +       for (i = 0; i < ctx->idx; i++) {
> +               u32 a32_insn = le32_to_cpu(ctx->target[i]);
> +
> +               if (a32_insn == ARM_INST_UDF)
> +                       return -1;
> +       }
>
>         return 0;
>  }
>
> +void bpf_jit_compile(struct bpf_prog *prog)
> +{
> +       /* Nothing to do here. We support Internal BPF. */
> +}
>
> -void bpf_jit_compile(struct bpf_prog *fp)
> +struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
>  {
> +#ifdef __LITTLE_ENDIAN
> +       struct bpf_prog *tmp, *orig_prog = prog;
>         struct bpf_binary_header *header;
> +       bool tmp_blinded = false;
>         struct jit_ctx ctx;
> -       unsigned tmp_idx;
> -       unsigned alloc_size;
> -       u8 *target_ptr;
> +       unsigned int tmp_idx;
> +       unsigned int image_size;
> +       u8 *image_ptr;
>
> +       /* If BPF JIT was not enabled then we must fall back to
> +        * the interpreter.
> +        */
>         if (!bpf_jit_enable)
> -               return;
> +               return orig_prog;
>
> -       memset(&ctx, 0, sizeof(ctx));
> -       ctx.skf         = fp;
> -       ctx.ret0_fp_idx = -1;
> +       /* If constant blinding was enabled and we failed during blinding
> +        * then we must fall back to the interpreter. Otherwise, we save
> +        * the new JITed code.
> +        */
> +       tmp = bpf_jit_blind_constants(prog);
>
> -       ctx.offsets = kzalloc(4 * (ctx.skf->len + 1), GFP_KERNEL);
> -       if (ctx.offsets == NULL)
> -               return;
> +       if (IS_ERR(tmp))
> +               return orig_prog;
> +       if (tmp != prog) {
> +               tmp_blinded = true;
> +               prog = tmp;
> +       }
> +
> +       memset(&ctx, 0, sizeof(ctx));
> +       ctx.prog = prog;
>
> -       /* fake pass to fill in the ctx->seen */
> -       if (unlikely(build_body(&ctx)))
> +       /* Not able to allocate memory for offsets[] , then
> +        * we must fall back to the interpreter
> +        */
> +       ctx.offsets = kcalloc(prog->len, sizeof(int), GFP_KERNEL);
> +       if (ctx.offsets == NULL) {
> +               prog = orig_prog;
>                 goto out;
> +       }
> +
> +       /* 1) fake pass to find in the length of the JITed code,
> +        * to compute ctx->offsets and other context variables
> +        * needed to compute final JITed code.
> +        * Also, calculate random starting pointer/start of JITed code
> +        * which is prefixed by random number of fault instructions.
> +        *
> +        * If the first pass fails then there is no chance of it
> +        * being successful in the second pass, so just fall back
> +        * to the interpreter.
> +        */
> +       if (build_body(&ctx)) {
> +               prog = orig_prog;
> +               goto out_off;
> +       }
>
>         tmp_idx = ctx.idx;
>         build_prologue(&ctx);
>         ctx.prologue_bytes = (ctx.idx - tmp_idx) * 4;
>
> +       ctx.epilogue_offset = ctx.idx;
> +
>  #if __LINUX_ARM_ARCH__ < 7
>         tmp_idx = ctx.idx;
>         build_epilogue(&ctx);
> @@ -1020,64 +1838,96 @@ void bpf_jit_compile(struct bpf_prog *fp)
>
>         ctx.idx += ctx.imm_count;
>         if (ctx.imm_count) {
> -               ctx.imms = kzalloc(4 * ctx.imm_count, GFP_KERNEL);
> -               if (ctx.imms == NULL)
> -                       goto out;
> +               ctx.imms = kcalloc(ctx.imm_count, sizeof(u32), GFP_KERNEL);
> +               if (ctx.imms == NULL) {
> +                       prog = orig_prog;
> +                       goto out_off;
> +               }
>         }
>  #else
> -       /* there's nothing after the epilogue on ARMv7 */
> +       /* there's nothing about the epilogue on ARMv7 */
>         build_epilogue(&ctx);
>  #endif
> -       alloc_size = 4 * ctx.idx;
> -       header = bpf_jit_binary_alloc(alloc_size, &target_ptr,
> -                                     4, jit_fill_hole);
> -       if (header == NULL)
> -               goto out;
> +       /* Now we can get the actual image size of the JITed arm code.
> +        * Currently, we are not considering the THUMB-2 instructions
> +        * for jit, although it can decrease the size of the image.
> +        *
> +        * As each arm instruction is of length 32bit, we are translating
> +        * number of JITed intructions into the size required to store these
> +        * JITed code.
> +        */
> +       image_size = sizeof(u32) * ctx.idx;
>
> -       ctx.target = (u32 *) target_ptr;
> +       /* Now we know the size of the structure to make */
> +       header = bpf_jit_binary_alloc(image_size, &image_ptr,
> +                                     sizeof(u32), jit_fill_hole);
> +       /* Not able to allocate memory for the structure then
> +        * we must fall back to the interpretation
> +        */
> +       if (header == NULL) {
> +               prog = orig_prog;
> +               goto out_imms;
> +       }
> +
> +       /* 2.) Actual pass to generate final JIT code */
> +       ctx.target = (u32 *) image_ptr;
>         ctx.idx = 0;
>
>         build_prologue(&ctx);
> +
> +       /* If building the body of the JITed code fails somehow,
> +        * we fall back to the interpretation.
> +        */
>         if (build_body(&ctx) < 0) {
> -#if __LINUX_ARM_ARCH__ < 7
> -               if (ctx.imm_count)
> -                       kfree(ctx.imms);
> -#endif
> +               image_ptr = NULL;
>                 bpf_jit_binary_free(header);
> -               goto out;
> +               prog = orig_prog;
> +               goto out_imms;
>         }
>         build_epilogue(&ctx);
>
> +       /* 3.) Extra pass to validate JITed Code */
> +       if (validate_code(&ctx)) {
> +               image_ptr = NULL;
> +               bpf_jit_binary_free(header);
> +               prog = orig_prog;
> +               goto out_imms;
> +       }
>         flush_icache_range((u32)header, (u32)(ctx.target + ctx.idx));
>
> -#if __LINUX_ARM_ARCH__ < 7
> -       if (ctx.imm_count)
> -               kfree(ctx.imms);
> -#endif
> -
>         if (bpf_jit_enable > 1)
>                 /* there are 2 passes here */
> -               bpf_jit_dump(fp->len, alloc_size, 2, ctx.target);
> +               bpf_jit_dump(prog->len, image_size, 2, ctx.target);
>
>         set_memory_ro((unsigned long)header, header->pages);
> -       fp->bpf_func = (void *)ctx.target;
> -       fp->jited = 1;
> -out:
> +       prog->bpf_func = (void *)ctx.target;
> +       prog->jited = 1;
> +out_imms:
> +#if __LINUX_ARM_ARCH__ < 7
> +       if (ctx.imm_count)
> +               kfree(ctx.imms);
> +#endif
> +out_off:
>         kfree(ctx.offsets);
> -       return;
> +out:
> +       if (tmp_blinded)
> +               bpf_jit_prog_release_other(prog, prog == orig_prog ?
> +                                          tmp : orig_prog);
> +#endif /* __LITTLE_ENDIAN */
> +       return prog;
>  }
>
> -void bpf_jit_free(struct bpf_prog *fp)
> +void bpf_jit_free(struct bpf_prog *prog)
>  {
> -       unsigned long addr = (unsigned long)fp->bpf_func & PAGE_MASK;
> +       unsigned long addr = (unsigned long)prog->bpf_func & PAGE_MASK;
>         struct bpf_binary_header *header = (void *)addr;
>
> -       if (!fp->jited)
> +       if (!prog->jited)
>                 goto free_filter;
>
>         set_memory_rw(addr, header->pages);
>         bpf_jit_binary_free(header);
>
>  free_filter:
> -       bpf_prog_unlock_free(fp);
> +       bpf_prog_unlock_free(prog);
>  }
> diff --git a/arch/arm/net/bpf_jit_32.h b/arch/arm/net/bpf_jit_32.h
> index c46fca2..d5cf5f6 100644
> --- a/arch/arm/net/bpf_jit_32.h
> +++ b/arch/arm/net/bpf_jit_32.h
> @@ -11,6 +11,7 @@
>  #ifndef PFILTER_OPCODES_ARM_H
>  #define PFILTER_OPCODES_ARM_H
>
> +/* ARM 32bit Registers */
>  #define ARM_R0 0
>  #define ARM_R1 1
>  #define ARM_R2 2
> @@ -22,38 +23,43 @@
>  #define ARM_R8 8
>  #define ARM_R9 9
>  #define ARM_R10        10
> -#define ARM_FP 11
> -#define ARM_IP 12
> -#define ARM_SP 13
> -#define ARM_LR 14
> -#define ARM_PC 15
> -
> -#define ARM_COND_EQ            0x0
> -#define ARM_COND_NE            0x1
> -#define ARM_COND_CS            0x2
> +#define ARM_FP 11      /* Frame Pointer */
> +#define ARM_IP 12      /* Intra-procedure scratch register */
> +#define ARM_SP 13      /* Stack pointer: as load/store base reg */
> +#define ARM_LR 14      /* Link Register */
> +#define ARM_PC 15      /* Program counter */
> +
> +#define ARM_COND_EQ            0x0     /* == */
> +#define ARM_COND_NE            0x1     /* != */
> +#define ARM_COND_CS            0x2     /* unsigned >= */
>  #define ARM_COND_HS            ARM_COND_CS
> -#define ARM_COND_CC            0x3
> +#define ARM_COND_CC            0x3     /* unsigned < */
>  #define ARM_COND_LO            ARM_COND_CC
> -#define ARM_COND_MI            0x4
> -#define ARM_COND_PL            0x5
> -#define ARM_COND_VS            0x6
> -#define ARM_COND_VC            0x7
> -#define ARM_COND_HI            0x8
> -#define ARM_COND_LS            0x9
> -#define ARM_COND_GE            0xa
> -#define ARM_COND_LT            0xb
> -#define ARM_COND_GT            0xc
> -#define ARM_COND_LE            0xd
> -#define ARM_COND_AL            0xe
> +#define ARM_COND_MI            0x4     /* < 0 */
> +#define ARM_COND_PL            0x5     /* >= 0 */
> +#define ARM_COND_VS            0x6     /* Signed Overflow */
> +#define ARM_COND_VC            0x7     /* No Signed Overflow */
> +#define ARM_COND_HI            0x8     /* unsigned > */
> +#define ARM_COND_LS            0x9     /* unsigned <= */
> +#define ARM_COND_GE            0xa     /* Signed >= */
> +#define ARM_COND_LT            0xb     /* Signed < */
> +#define ARM_COND_GT            0xc     /* Signed > */
> +#define ARM_COND_LE            0xd     /* Signed <= */
> +#define ARM_COND_AL            0xe     /* None */
>
>  /* register shift types */
>  #define SRTYPE_LSL             0
>  #define SRTYPE_LSR             1
>  #define SRTYPE_ASR             2
>  #define SRTYPE_ROR             3
> +#define SRTYPE_ASL             (SRTYPE_LSL)
>
>  #define ARM_INST_ADD_R         0x00800000
> +#define ARM_INST_ADDS_R                0x00900000
> +#define ARM_INST_ADC_R         0x00a00000
> +#define ARM_INST_ADC_I         0x02a00000
>  #define ARM_INST_ADD_I         0x02800000
> +#define ARM_INST_ADDS_I                0x02900000
>
>  #define ARM_INST_AND_R         0x00000000
>  #define ARM_INST_AND_I         0x02000000
> @@ -76,8 +82,10 @@
>  #define ARM_INST_LDRH_I                0x01d000b0
>  #define ARM_INST_LDRH_R                0x019000b0
>  #define ARM_INST_LDR_I         0x05900000
> +#define ARM_INST_LDR_R         0x07900000
>
>  #define ARM_INST_LDM           0x08900000
> +#define ARM_INST_LDM_IA                0x08b00000
>
>  #define ARM_INST_LSL_I         0x01a00000
>  #define ARM_INST_LSL_R         0x01a00010
> @@ -86,6 +94,7 @@
>  #define ARM_INST_LSR_R         0x01a00030
>
>  #define ARM_INST_MOV_R         0x01a00000
> +#define ARM_INST_MOVS_R                0x01b00000
>  #define ARM_INST_MOV_I         0x03a00000
>  #define ARM_INST_MOVW          0x03000000
>  #define ARM_INST_MOVT          0x03400000
> @@ -96,17 +105,28 @@
>  #define ARM_INST_PUSH          0x092d0000
>
>  #define ARM_INST_ORR_R         0x01800000
> +#define ARM_INST_ORRS_R                0x01900000
>  #define ARM_INST_ORR_I         0x03800000
>
>  #define ARM_INST_REV           0x06bf0f30
>  #define ARM_INST_REV16         0x06bf0fb0
>
>  #define ARM_INST_RSB_I         0x02600000
> +#define ARM_INST_RSBS_I                0x02700000
> +#define ARM_INST_RSC_I         0x02e00000
>
>  #define ARM_INST_SUB_R         0x00400000
> +#define ARM_INST_SUBS_R                0x00500000
> +#define ARM_INST_RSB_R         0x00600000
>  #define ARM_INST_SUB_I         0x02400000
> +#define ARM_INST_SUBS_I                0x02500000
> +#define ARM_INST_SBC_I         0x02c00000
> +#define ARM_INST_SBC_R         0x00c00000
> +#define ARM_INST_SBCS_R                0x00d00000
>
>  #define ARM_INST_STR_I         0x05800000
> +#define ARM_INST_STRB_I                0x05c00000
> +#define ARM_INST_STRH_I                0x01c000b0
>
>  #define ARM_INST_TST_R         0x01100000
>  #define ARM_INST_TST_I         0x03100000
> @@ -117,6 +137,8 @@
>
>  #define ARM_INST_MLS           0x00600090
>
> +#define ARM_INST_UXTH          0x06ff0070
> +
>  /*
>   * Use a suitable undefined instruction to use for ARM/Thumb2 faulting.
>   * We need to be careful not to conflict with those used by other modules
> @@ -135,9 +157,15 @@
>  #define _AL3_R(op, rd, rn, rm) ((op ## _R) | (rd) << 12 | (rn) << 16 | (rm))
>  /* immediate */
>  #define _AL3_I(op, rd, rn, imm)        ((op ## _I) | (rd) << 12 | (rn) << 16 | (imm))
> +/* register with register-shift */
> +#define _AL3_SR(inst)  (inst | (1 << 4))
>
>  #define ARM_ADD_R(rd, rn, rm)  _AL3_R(ARM_INST_ADD, rd, rn, rm)
> +#define ARM_ADDS_R(rd, rn, rm) _AL3_R(ARM_INST_ADDS, rd, rn, rm)
>  #define ARM_ADD_I(rd, rn, imm) _AL3_I(ARM_INST_ADD, rd, rn, imm)
> +#define ARM_ADDS_I(rd, rn, imm)        _AL3_I(ARM_INST_ADDS, rd, rn, imm)
> +#define ARM_ADC_R(rd, rn, rm)  _AL3_R(ARM_INST_ADC, rd, rn, rm)
> +#define ARM_ADC_I(rd, rn, imm) _AL3_I(ARM_INST_ADC, rd, rn, imm)
>
>  #define ARM_AND_R(rd, rn, rm)  _AL3_R(ARM_INST_AND, rd, rn, rm)
>  #define ARM_AND_I(rd, rn, imm) _AL3_I(ARM_INST_AND, rd, rn, imm)
> @@ -156,7 +184,9 @@
>  #define ARM_EOR_I(rd, rn, imm) _AL3_I(ARM_INST_EOR, rd, rn, imm)
>
>  #define ARM_LDR_I(rt, rn, off) (ARM_INST_LDR_I | (rt) << 12 | (rn) << 16 \
> -                                | (off))
> +                                | ((off) & 0xfff))
> +#define ARM_LDR_R(rt, rn, rm)  (ARM_INST_LDR_R | (rt) << 12 | (rn) << 16 \
> +                                | (rm))
>  #define ARM_LDRB_I(rt, rn, off)        (ARM_INST_LDRB_I | (rt) << 12 | (rn) << 16 \
>                                  | (off))
>  #define ARM_LDRB_R(rt, rn, rm) (ARM_INST_LDRB_R | (rt) << 12 | (rn) << 16 \
> @@ -167,15 +197,23 @@
>                                  | (rm))
>
>  #define ARM_LDM(rn, regs)      (ARM_INST_LDM | (rn) << 16 | (regs))
> +#define ARM_LDM_IA(rn, regs)   (ARM_INST_LDM_IA | (rn) << 16 | (regs))
>
>  #define ARM_LSL_R(rd, rn, rm)  (_AL3_R(ARM_INST_LSL, rd, 0, rn) | (rm) << 8)
>  #define ARM_LSL_I(rd, rn, imm) (_AL3_I(ARM_INST_LSL, rd, 0, rn) | (imm) << 7)
>
>  #define ARM_LSR_R(rd, rn, rm)  (_AL3_R(ARM_INST_LSR, rd, 0, rn) | (rm) << 8)
>  #define ARM_LSR_I(rd, rn, imm) (_AL3_I(ARM_INST_LSR, rd, 0, rn) | (imm) << 7)
> +#define ARM_ASR_R(rd, rn, rm)   (_AL3_R(ARM_INST_ASR, rd, 0, rn) | (rm) << 8)
> +#define ARM_ASR_I(rd, rn, imm)  (_AL3_I(ARM_INST_ASR, rd, 0, rn) | (imm) << 7)
>
>  #define ARM_MOV_R(rd, rm)      _AL3_R(ARM_INST_MOV, rd, 0, rm)
> +#define ARM_MOVS_R(rd, rm)     _AL3_R(ARM_INST_MOVS, rd, 0, rm)
>  #define ARM_MOV_I(rd, imm)     _AL3_I(ARM_INST_MOV, rd, 0, imm)
> +#define ARM_MOV_SR(rd, rm, type, rs)   \
> +       (_AL3_SR(ARM_MOV_R(rd, rm)) | (type) << 5 | (rs) << 8)
> +#define ARM_MOV_SI(rd, rm, type, imm6) \
> +       (ARM_MOV_R(rd, rm) | (type) << 5 | (imm6) << 7)
>
>  #define ARM_MOVW(rd, imm)      \
>         (ARM_INST_MOVW | ((imm) >> 12) << 16 | (rd) << 12 | ((imm) & 0x0fff))
> @@ -190,19 +228,38 @@
>
>  #define ARM_ORR_R(rd, rn, rm)  _AL3_R(ARM_INST_ORR, rd, rn, rm)
>  #define ARM_ORR_I(rd, rn, imm) _AL3_I(ARM_INST_ORR, rd, rn, imm)
> -#define ARM_ORR_S(rd, rn, rm, type, rs)        \
> -       (ARM_ORR_R(rd, rn, rm) | (type) << 5 | (rs) << 7)
> +#define ARM_ORR_SR(rd, rn, rm, type, rs)       \
> +       (_AL3_SR(ARM_ORR_R(rd, rn, rm)) | (type) << 5 | (rs) << 8)
> +#define ARM_ORRS_R(rd, rn, rm) _AL3_R(ARM_INST_ORRS, rd, rn, rm)
> +#define ARM_ORRS_SR(rd, rn, rm, type, rs)      \
> +       (_AL3_SR(ARM_ORRS_R(rd, rn, rm)) | (type) << 5 | (rs) << 8)
> +#define ARM_ORR_SI(rd, rn, rm, type, imm6)     \
> +       (ARM_ORR_R(rd, rn, rm) | (type) << 5 | (imm6) << 7)
> +#define ARM_ORRS_SI(rd, rn, rm, type, imm6)    \
> +       (ARM_ORRS_R(rd, rn, rm) | (type) << 5 | (imm6) << 7)
>
>  #define ARM_REV(rd, rm)                (ARM_INST_REV | (rd) << 12 | (rm))
>  #define ARM_REV16(rd, rm)      (ARM_INST_REV16 | (rd) << 12 | (rm))
>
>  #define ARM_RSB_I(rd, rn, imm) _AL3_I(ARM_INST_RSB, rd, rn, imm)
> +#define ARM_RSBS_I(rd, rn, imm)        _AL3_I(ARM_INST_RSBS, rd, rn, imm)
> +#define ARM_RSC_I(rd, rn, imm) _AL3_I(ARM_INST_RSC, rd, rn, imm)
>
>  #define ARM_SUB_R(rd, rn, rm)  _AL3_R(ARM_INST_SUB, rd, rn, rm)
> +#define ARM_SUBS_R(rd, rn, rm) _AL3_R(ARM_INST_SUBS, rd, rn, rm)
> +#define ARM_RSB_R(rd, rn, rm)  _AL3_R(ARM_INST_RSB, rd, rn, rm)
> +#define ARM_SBC_R(rd, rn, rm)  _AL3_R(ARM_INST_SBC, rd, rn, rm)
> +#define ARM_SBCS_R(rd, rn, rm) _AL3_R(ARM_INST_SBCS, rd, rn, rm)
>  #define ARM_SUB_I(rd, rn, imm) _AL3_I(ARM_INST_SUB, rd, rn, imm)
> +#define ARM_SUBS_I(rd, rn, imm)        _AL3_I(ARM_INST_SUBS, rd, rn, imm)
> +#define ARM_SBC_I(rd, rn, imm) _AL3_I(ARM_INST_SBC, rd, rn, imm)
>
>  #define ARM_STR_I(rt, rn, off) (ARM_INST_STR_I | (rt) << 12 | (rn) << 16 \
> -                                | (off))
> +                                | ((off) & 0xfff))
> +#define ARM_STRH_I(rt, rn, off)        (ARM_INST_STRH_I | (rt) << 12 | (rn) << 16 \
> +                                | (((off) & 0xf0) << 4) | ((off) & 0xf))
> +#define ARM_STRB_I(rt, rn, off)        (ARM_INST_STRB_I | (rt) << 12 | (rn) << 16 \
> +                                | (((off) & 0xf0) << 4) | ((off) & 0xf))
>
>  #define ARM_TST_R(rn, rm)      _AL3_R(ARM_INST_TST, 0, rn, rm)
>  #define ARM_TST_I(rn, imm)     _AL3_I(ARM_INST_TST, 0, rn, imm)
> @@ -214,5 +271,6 @@
>
>  #define ARM_MLS(rd, rn, rm, ra)        (ARM_INST_MLS | (rd) << 16 | (rn) | (rm) << 8 \
>                                  | (ra) << 12)
> +#define ARM_UXTH(rd, rm)       (ARM_INST_UXTH | (rd) << 12 | (rm))
>
>  #endif /* PFILTER_OPCODES_ARM_H */
> --
> 2.7.4
>



-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
@ 2017-05-30 19:19   ` Kees Cook
  0 siblings, 0 replies; 87+ messages in thread
From: Kees Cook @ 2017-05-30 19:19 UTC (permalink / raw)
  To: Shubham Bansal, Network Development, Daniel Borkmann,
	David S. Miller, Alexei Starovoitov
  Cc: Russell King, linux-arm-kernel, LKML, Andrew Lunn

Forwarding this to net-dev and eBPF folks, who weren't on CC...

-Kees

On Thu, May 25, 2017 at 4:13 PM, Shubham Bansal
<illusionist.neo@gmail.com> wrote:
> The JIT compiler emits ARM 32 bit instructions. Currently, It supports
> eBPF only. Classic BPF is supported because of the conversion by BPF
> core.
>
> This patch is essentially changing the current implementation of JIT
> compiler of Berkeley Packet Filter from classic to internal with almost
> all instructions from eBPF ISA supported except the following
>         BPF_ALU64 | BPF_DIV | BPF_K
>         BPF_ALU64 | BPF_DIV | BPF_X
>         BPF_ALU64 | BPF_MOD | BPF_K
>         BPF_ALU64 | BPF_MOD | BPF_X
>         BPF_STX | BPF_XADD | BPF_W
>         BPF_STX | BPF_XADD | BPF_DW
>         BPF_JMP | BPF_CALL
>
> Implementation is using scratch space to emulate 64 bit eBPF ISA on 32 bit
> ARM because of deficiency of general purpose registers on ARM. Currently,
> only LITTLE ENDIAN machines are supported in this eBPF JIT Compiler.
>
> Tested on ARMv7 with QEMU by me (Shubham Bansal).
> Tested on ARMv5 by Andrew Lunn (andrew@lunn.ch).
> Expected to work on ARMv6 as well, as its a part ARMv7 and part ARMv5.
> Although, a proper testing is not done for ARMv6.
>
> Both of these testing are done with and without CONFIG_FRAME_POINTER
> separately for LITTLE ENDIAN machine.
>
> For testing:
>
> 1. JIT is enabled with
>         echo 1 > /proc/sys/net/core/bpf_jit_enable
> 2. Constant Blinding can be enabled along with JIT using
>         echo 1 > /proc/sys/net/core/bpf_jit_enable
>         echo 2 > /proc/sys/net/core/bpf_jit_harden
>
> See Documentation/networking/filter.txt for more information.
>
> Result : test_bpf: Summary: 314 PASSED, 0 FAILED, [278/306 JIT'ed]
>
> Signed-off-by: Shubham Bansal <illusionist.neo@gmail.com>
> ---
>  Documentation/networking/filter.txt |    4 +-
>  arch/arm/Kconfig                    |    2 +-
>  arch/arm/net/bpf_jit_32.c           | 2404 ++++++++++++++++++++++++-----------
>  arch/arm/net/bpf_jit_32.h           |  108 +-
>  4 files changed, 1713 insertions(+), 805 deletions(-)
>
> diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt
> index b69b205..01165ac 100644
> --- a/Documentation/networking/filter.txt
> +++ b/Documentation/networking/filter.txt
> @@ -596,8 +596,8 @@ skb pointer). All constraints and restrictions from bpf_check_classic() apply
>  before a conversion to the new layout is being done behind the scenes!
>
>  Currently, the classic BPF format is being used for JITing on most 32-bit
> -architectures, whereas x86-64, aarch64, s390x, powerpc64, sparc64 perform JIT
> -compilation from eBPF instruction set.
> +architectures, whereas x86-64, aarch64, arm, s390x, powerpc64, sparc64 perform
> +JIT compilation from eBPF instruction set.
>
>  Some core changes of the new internal format:
>
> diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
> index 8a7ab5e..13ade46 100644
> --- a/arch/arm/Kconfig
> +++ b/arch/arm/Kconfig
> @@ -47,7 +47,7 @@ config ARM
>         select HAVE_ARCH_SECCOMP_FILTER if (AEABI && !OABI_COMPAT)
>         select HAVE_ARCH_TRACEHOOK
>         select HAVE_ARM_SMCCC if CPU_V7
> -       select HAVE_CBPF_JIT
> +       select HAVE_EBPF_JIT
>         select HAVE_CC_STACKPROTECTOR
>         select HAVE_CONTEXT_TRACKING
>         select HAVE_C_RECORDMCOUNT
> diff --git a/arch/arm/net/bpf_jit_32.c b/arch/arm/net/bpf_jit_32.c
> index 93d0b6d..c7476e5 100644
> --- a/arch/arm/net/bpf_jit_32.c
> +++ b/arch/arm/net/bpf_jit_32.c
> @@ -1,13 +1,15 @@
>  /*
> - * Just-In-Time compiler for BPF filters on 32bit ARM
> + * Just-In-Time compiler for eBPF filters on 32bit ARM
>   *
>   * Copyright (c) 2011 Mircea Gherzan <mgherzan@gmail.com>
> + * Copyright (c) 2017 Shubham Bansal <illusionist.neo@gmail.com>
>   *
>   * This program is free software; you can redistribute it and/or modify it
>   * under the terms of the GNU General Public License as published by the
>   * Free Software Foundation; version 2 of the License.
>   */
>
> +#include <linux/bpf.h>
>  #include <linux/bitops.h>
>  #include <linux/compiler.h>
>  #include <linux/errno.h>
> @@ -23,44 +25,91 @@
>
>  #include "bpf_jit_32.h"
>
> +int bpf_jit_enable __read_mostly;
> +
> +#define STACK_OFFSET(k)        (k)
> +#define TMP_REG_1      (MAX_BPF_JIT_REG + 0)   /* TEMP Register 1 */
> +#define TMP_REG_2      (MAX_BPF_JIT_REG + 1)   /* TEMP Register 2 */
> +#define TCALL_CNT      (MAX_BPF_JIT_REG + 2)   /* Tail Call Count */
> +
> +/* Flags used for JIT optimization */
> +#define SEEN_CALL      (1 << 0)
> +
> +#define FLAG_IMM_OVERFLOW      (1 << 0)
> +
>  /*
> - * ABI:
> + * Map eBPF registers to ARM 32bit registers or stack scratch space.
> + *
> + * 1. First argument is passed using the arm 32bit registers and rest of the
> + * arguments are passed on stack scratch space.
> + * 2. First callee-saved aregument is mapped to arm 32 bit registers and rest
> + * arguments are mapped to scratch space on stack.
> + * 3. We need two 64 bit temp registers to do complex operations on eBPF
> + * registers.
> + *
> + * As the eBPF registers are all 64 bit registers and arm has only 32 bit
> + * registers, we have to map each eBPF registers with two arm 32 bit regs or
> + * scratch memory space and we have to build eBPF 64 bit register from those.
>   *
> - * r0  scratch register
> - * r4  BPF register A
> - * r5  BPF register X
> - * r6  pointer to the skb
> - * r7  skb->data
> - * r8  skb_headlen(skb)
>   */
> +static const u8 bpf2a32[][2] = {
> +       /* return value from in-kernel function, and exit value from eBPF */
> +       [BPF_REG_0] = {ARM_R1, ARM_R0},
> +       /* arguments from eBPF program to in-kernel function */
> +       [BPF_REG_1] = {ARM_R3, ARM_R2},
> +       /* Stored on stack scratch space */
> +       [BPF_REG_2] = {STACK_OFFSET(0), STACK_OFFSET(4)},
> +       [BPF_REG_3] = {STACK_OFFSET(8), STACK_OFFSET(12)},
> +       [BPF_REG_4] = {STACK_OFFSET(16), STACK_OFFSET(20)},
> +       [BPF_REG_5] = {STACK_OFFSET(24), STACK_OFFSET(28)},
> +       /* callee saved registers that in-kernel function will preserve */
> +       [BPF_REG_6] = {ARM_R5, ARM_R4},
> +       /* Stored on stack scratch space */
> +       [BPF_REG_7] = {STACK_OFFSET(32), STACK_OFFSET(36)},
> +       [BPF_REG_8] = {STACK_OFFSET(40), STACK_OFFSET(44)},
> +       [BPF_REG_9] = {STACK_OFFSET(48), STACK_OFFSET(52)},
> +       /* Read only Frame Pointer to access Stack */
> +       [BPF_REG_FP] = {STACK_OFFSET(56), STACK_OFFSET(60)},
> +       /* Temporary Register for internal BPF JIT, can be used
> +        * for constant blindings and others.
> +        */
> +       [TMP_REG_1] = {ARM_R7, ARM_R6},
> +       [TMP_REG_2] = {ARM_R10, ARM_R8},
> +       /* Tail call count. Stored on stack scratch space. */
> +       [TCALL_CNT] = {STACK_OFFSET(64), STACK_OFFSET(68)},
> +       /* temporary register for blinding constants.
> +        * Stored on stack scratch space.
> +        */
> +       [BPF_REG_AX] = {STACK_OFFSET(72), STACK_OFFSET(76)},
> +};
>
> -#define r_scratch      ARM_R0
> -/* r1-r3 are (also) used for the unaligned loads on the non-ARMv7 slowpath */
> -#define r_off          ARM_R1
> -#define r_A            ARM_R4
> -#define r_X            ARM_R5
> -#define r_skb          ARM_R6
> -#define r_skb_data     ARM_R7
> -#define r_skb_hl       ARM_R8
> -
> -#define SCRATCH_SP_OFFSET      0
> -#define SCRATCH_OFF(k)         (SCRATCH_SP_OFFSET + 4 * (k))
> -
> -#define SEEN_MEM               ((1 << BPF_MEMWORDS) - 1)
> -#define SEEN_MEM_WORD(k)       (1 << (k))
> -#define SEEN_X                 (1 << BPF_MEMWORDS)
> -#define SEEN_CALL              (1 << (BPF_MEMWORDS + 1))
> -#define SEEN_SKB               (1 << (BPF_MEMWORDS + 2))
> -#define SEEN_DATA              (1 << (BPF_MEMWORDS + 3))
> +#define        dst_lo  dst[1]
> +#define dst_hi dst[0]
> +#define src_lo src[1]
> +#define src_hi src[0]
>
> -#define FLAG_NEED_X_RESET      (1 << 0)
> -#define FLAG_IMM_OVERFLOW      (1 << 1)
> +/*
> + * JIT Context:
> + *
> + * prog                        :       bpf_prog
> + * idx                 :       index of current last JITed instruction.
> + * prologue_bytes      :       bytes used in prologue.
> + * epilogue_offset     :       offset of epilogue starting.
> + * seen                        :       bit mask used for JIT optimization.
> + * offsets             :       array of eBPF instruction offsets in
> + *                             JITed code.
> + * target              :       final JITed code.
> + * epilogue_bytes      :       no of bytes used in epilogue.
> + * imm_count           :       no of immediate counts used for global
> + *                             variables.
> + * imms                        :       array of global variable addresses.
> + */
>
>  struct jit_ctx {
> -       const struct bpf_prog *skf;
> -       unsigned idx;
> -       unsigned prologue_bytes;
> -       int ret0_fp_idx;
> +       const struct bpf_prog *prog;
> +       unsigned int idx;
> +       unsigned int prologue_bytes;
> +       unsigned int epilogue_offset;
>         u32 seen;
>         u32 flags;
>         u32 *offsets;
> @@ -72,68 +121,16 @@ struct jit_ctx {
>  #endif
>  };
>
> -int bpf_jit_enable __read_mostly;
> -
> -static inline int call_neg_helper(struct sk_buff *skb, int offset, void *ret,
> -                     unsigned int size)
> -{
> -       void *ptr = bpf_internal_load_pointer_neg_helper(skb, offset, size);
> -
> -       if (!ptr)
> -               return -EFAULT;
> -       memcpy(ret, ptr, size);
> -       return 0;
> -}
> -
> -static u64 jit_get_skb_b(struct sk_buff *skb, int offset)
> -{
> -       u8 ret;
> -       int err;
> -
> -       if (offset < 0)
> -               err = call_neg_helper(skb, offset, &ret, 1);
> -       else
> -               err = skb_copy_bits(skb, offset, &ret, 1);
> -
> -       return (u64)err << 32 | ret;
> -}
> -
> -static u64 jit_get_skb_h(struct sk_buff *skb, int offset)
> -{
> -       u16 ret;
> -       int err;
> -
> -       if (offset < 0)
> -               err = call_neg_helper(skb, offset, &ret, 2);
> -       else
> -               err = skb_copy_bits(skb, offset, &ret, 2);
> -
> -       return (u64)err << 32 | ntohs(ret);
> -}
> -
> -static u64 jit_get_skb_w(struct sk_buff *skb, int offset)
> -{
> -       u32 ret;
> -       int err;
> -
> -       if (offset < 0)
> -               err = call_neg_helper(skb, offset, &ret, 4);
> -       else
> -               err = skb_copy_bits(skb, offset, &ret, 4);
> -
> -       return (u64)err << 32 | ntohl(ret);
> -}
> -
>  /*
>   * Wrappers which handle both OABI and EABI and assures Thumb2 interworking
>   * (where the assembly routines like __aeabi_uidiv could cause problems).
>   */
> -static u32 jit_udiv(u32 dividend, u32 divisor)
> +static u32 jit_udiv32(u32 dividend, u32 divisor)
>  {
>         return dividend / divisor;
>  }
>
> -static u32 jit_mod(u32 dividend, u32 divisor)
> +static u32 jit_mod32(u32 dividend, u32 divisor)
>  {
>         return dividend % divisor;
>  }
> @@ -157,36 +154,22 @@ static inline void emit(u32 inst, struct jit_ctx *ctx)
>         _emit(ARM_COND_AL, inst, ctx);
>  }
>
> -static u16 saved_regs(struct jit_ctx *ctx)
> +/*
> + * Checks if immediate value can be converted to imm12(12 bits) value.
> + */
> +static int16_t imm8m(u32 x)
>  {
> -       u16 ret = 0;
> -
> -       if ((ctx->skf->len > 1) ||
> -           (ctx->skf->insns[0].code == (BPF_RET | BPF_A)))
> -               ret |= 1 << r_A;
> -
> -#ifdef CONFIG_FRAME_POINTER
> -       ret |= (1 << ARM_FP) | (1 << ARM_IP) | (1 << ARM_LR) | (1 << ARM_PC);
> -#else
> -       if (ctx->seen & SEEN_CALL)
> -               ret |= 1 << ARM_LR;
> -#endif
> -       if (ctx->seen & (SEEN_DATA | SEEN_SKB))
> -               ret |= 1 << r_skb;
> -       if (ctx->seen & SEEN_DATA)
> -               ret |= (1 << r_skb_data) | (1 << r_skb_hl);
> -       if (ctx->seen & SEEN_X)
> -               ret |= 1 << r_X;
> -
> -       return ret;
> -}
> +       u32 rot;
>
> -static inline int mem_words_used(struct jit_ctx *ctx)
> -{
> -       /* yes, we do waste some stack space IF there are "holes" in the set" */
> -       return fls(ctx->seen & SEEN_MEM);
> +       for (rot = 0; rot < 16; rot++)
> +               if ((x & ~ror32(0xff, 2 * rot)) == 0)
> +                       return rol32(x, 2 * rot) | (rot << 8);
> +       return -1;
>  }
>
> +/*
> + * Initializes the JIT space with undefined instructions.
> + */
>  static void jit_fill_hole(void *area, unsigned int size)
>  {
>         u32 *ptr;
> @@ -195,88 +178,34 @@ static void jit_fill_hole(void *area, unsigned int size)
>                 *ptr++ = __opcode_to_mem_arm(ARM_INST_UDF);
>  }
>
> -static void build_prologue(struct jit_ctx *ctx)
> -{
> -       u16 reg_set = saved_regs(ctx);
> -       u16 off;
> -
> -#ifdef CONFIG_FRAME_POINTER
> -       emit(ARM_MOV_R(ARM_IP, ARM_SP), ctx);
> -       emit(ARM_PUSH(reg_set), ctx);
> -       emit(ARM_SUB_I(ARM_FP, ARM_IP, 4), ctx);
> -#else
> -       if (reg_set)
> -               emit(ARM_PUSH(reg_set), ctx);
> -#endif
> +/* Stack must be multiples of 16 Bytes */
> +#define STACK_ALIGN(sz) (((sz) + 15) & ~15)
>
> -       if (ctx->seen & (SEEN_DATA | SEEN_SKB))
> -               emit(ARM_MOV_R(r_skb, ARM_R0), ctx);
> -
> -       if (ctx->seen & SEEN_DATA) {
> -               off = offsetof(struct sk_buff, data);
> -               emit(ARM_LDR_I(r_skb_data, r_skb, off), ctx);
> -               /* headlen = len - data_len */
> -               off = offsetof(struct sk_buff, len);
> -               emit(ARM_LDR_I(r_skb_hl, r_skb, off), ctx);
> -               off = offsetof(struct sk_buff, data_len);
> -               emit(ARM_LDR_I(r_scratch, r_skb, off), ctx);
> -               emit(ARM_SUB_R(r_skb_hl, r_skb_hl, r_scratch), ctx);
> -       }
> -
> -       if (ctx->flags & FLAG_NEED_X_RESET)
> -               emit(ARM_MOV_I(r_X, 0), ctx);
> -
> -       /* do not leak kernel data to userspace */
> -       if (bpf_needs_clear_a(&ctx->skf->insns[0]))
> -               emit(ARM_MOV_I(r_A, 0), ctx);
> -
> -       /* stack space for the BPF_MEM words */
> -       if (ctx->seen & SEEN_MEM)
> -               emit(ARM_SUB_I(ARM_SP, ARM_SP, mem_words_used(ctx) * 4), ctx);
> -}
> -
> -static void build_epilogue(struct jit_ctx *ctx)
> -{
> -       u16 reg_set = saved_regs(ctx);
> -
> -       if (ctx->seen & SEEN_MEM)
> -               emit(ARM_ADD_I(ARM_SP, ARM_SP, mem_words_used(ctx) * 4), ctx);
> -
> -       reg_set &= ~(1 << ARM_LR);
> -
> -#ifdef CONFIG_FRAME_POINTER
> -       /* the first instruction of the prologue was: mov ip, sp */
> -       reg_set &= ~(1 << ARM_IP);
> -       reg_set |= (1 << ARM_SP);
> -       emit(ARM_LDM(ARM_SP, reg_set), ctx);
> -#else
> -       if (reg_set) {
> -               if (ctx->seen & SEEN_CALL)
> -                       reg_set |= 1 << ARM_PC;
> -               emit(ARM_POP(reg_set), ctx);
> -       }
> +/* Stack space for BPF_REG_2, BPF_REG_3, BPF_REG_4,
> + * BPF_REG_5, BPF_REG_7, BPF_REG_8, BPF_REG_9,
> + * BPF_REG_FP and Tail call counts.
> + */
> +#define SCRATCH_SIZE 80
>
> -       if (!(ctx->seen & SEEN_CALL))
> -               emit(ARM_BX(ARM_LR), ctx);
> -#endif
> -}
> +/* total stack size used in JITed code */
> +#define _STACK_SIZE \
> +       (MAX_BPF_STACK + \
> +        + SCRATCH_SIZE + \
> +        + 4 /* extra for skb_copy_bits buffer */)
>
> -static int16_t imm8m(u32 x)
> -{
> -       u32 rot;
> +#define STACK_SIZE STACK_ALIGN(_STACK_SIZE)
>
> -       for (rot = 0; rot < 16; rot++)
> -               if ((x & ~ror32(0xff, 2 * rot)) == 0)
> -                       return rol32(x, 2 * rot) | (rot << 8);
> +/* Get the offset of eBPF REGISTERs stored on scratch space. */
> +#define STACK_VAR(off) (STACK_SIZE-off-4)
>
> -       return -1;
> -}
> +/* Offset of skb_copy_bits buffer */
> +#define SKB_BUFFER STACK_VAR(SCRATCH_SIZE)
>
>  #if __LINUX_ARM_ARCH__ < 7
>
>  static u16 imm_offset(u32 k, struct jit_ctx *ctx)
>  {
> -       unsigned i = 0, offset;
> +       unsigned int i = 0, offset;
>         u16 imm;
>
>         /* on the "fake" run we just count them (duplicates included) */
> @@ -295,7 +224,7 @@ static u16 imm_offset(u32 k, struct jit_ctx *ctx)
>                 ctx->imms[i] = k;
>
>         /* constants go just after the epilogue */
> -       offset =  ctx->offsets[ctx->skf->len];
> +       offset =  ctx->offsets[ctx->prog->len - 1] * 4;
>         offset += ctx->prologue_bytes;
>         offset += ctx->epilogue_bytes;
>         offset += i * 4;
> @@ -319,10 +248,22 @@ static u16 imm_offset(u32 k, struct jit_ctx *ctx)
>
>  #endif /* __LINUX_ARM_ARCH__ */
>
> +static inline int bpf2a32_offset(int bpf_to, int bpf_from,
> +                                const struct jit_ctx *ctx) {
> +       int to, from;
> +
> +       if (ctx->target == NULL)
> +               return 0;
> +       to = ctx->offsets[bpf_to];
> +       from = ctx->offsets[bpf_from];
> +
> +       return to - from - 1;
> +}
> +
>  /*
>   * Move an immediate that's not an imm8m to a core register.
>   */
> -static inline void emit_mov_i_no8m(int rd, u32 val, struct jit_ctx *ctx)
> +static inline void emit_mov_i_no8m(const u8 rd, u32 val, struct jit_ctx *ctx)
>  {
>  #if __LINUX_ARM_ARCH__ < 7
>         emit(ARM_LDR_I(rd, ARM_PC, imm_offset(val, ctx)), ctx);
> @@ -333,7 +274,7 @@ static inline void emit_mov_i_no8m(int rd, u32 val, struct jit_ctx *ctx)
>  #endif
>  }
>
> -static inline void emit_mov_i(int rd, u32 val, struct jit_ctx *ctx)
> +static inline void emit_mov_i(const u8 rd, u32 val, struct jit_ctx *ctx)
>  {
>         int imm12 = imm8m(val);
>
> @@ -343,676 +284,1553 @@ static inline void emit_mov_i(int rd, u32 val, struct jit_ctx *ctx)
>                 emit_mov_i_no8m(rd, val, ctx);
>  }
>
> -#if __LINUX_ARM_ARCH__ < 6
> -
> -static void emit_load_be32(u8 cond, u8 r_res, u8 r_addr, struct jit_ctx *ctx)
> +static inline void emit_blx_r(u8 tgt_reg, struct jit_ctx *ctx)
>  {
> -       _emit(cond, ARM_LDRB_I(ARM_R3, r_addr, 1), ctx);
> -       _emit(cond, ARM_LDRB_I(ARM_R1, r_addr, 0), ctx);
> -       _emit(cond, ARM_LDRB_I(ARM_R2, r_addr, 3), ctx);
> -       _emit(cond, ARM_LSL_I(ARM_R3, ARM_R3, 16), ctx);
> -       _emit(cond, ARM_LDRB_I(ARM_R0, r_addr, 2), ctx);
> -       _emit(cond, ARM_ORR_S(ARM_R3, ARM_R3, ARM_R1, SRTYPE_LSL, 24), ctx);
> -       _emit(cond, ARM_ORR_R(ARM_R3, ARM_R3, ARM_R2), ctx);
> -       _emit(cond, ARM_ORR_S(r_res, ARM_R3, ARM_R0, SRTYPE_LSL, 8), ctx);
> +       ctx->seen |= SEEN_CALL;
> +#if __LINUX_ARM_ARCH__ < 5
> +       emit(ARM_MOV_R(ARM_LR, ARM_PC), ctx);
> +
> +       if (elf_hwcap & HWCAP_THUMB)
> +               emit(ARM_BX(tgt_reg), ctx);
> +       else
> +               emit(ARM_MOV_R(ARM_PC, tgt_reg), ctx);
> +#else
> +       emit(ARM_BLX_R(tgt_reg), ctx);
> +#endif
>  }
>
> -static void emit_load_be16(u8 cond, u8 r_res, u8 r_addr, struct jit_ctx *ctx)
> +static inline int epilogue_offset(const struct jit_ctx *ctx)
>  {
> -       _emit(cond, ARM_LDRB_I(ARM_R1, r_addr, 0), ctx);
> -       _emit(cond, ARM_LDRB_I(ARM_R2, r_addr, 1), ctx);
> -       _emit(cond, ARM_ORR_S(r_res, ARM_R2, ARM_R1, SRTYPE_LSL, 8), ctx);
> +       int to, from;
> +       /* No need for 1st dummy run */
> +       if (ctx->target == NULL)
> +               return 0;
> +       to = ctx->epilogue_offset;
> +       from = ctx->idx;
> +
> +       return to - from - 2;
>  }
>
> -static inline void emit_swap16(u8 r_dst, u8 r_src, struct jit_ctx *ctx)
> +static inline void emit_udivmod(u8 rd, u8 rm, u8 rn, struct jit_ctx *ctx, u8 op)
>  {
> -       /* r_dst = (r_src << 8) | (r_src >> 8) */
> -       emit(ARM_LSL_I(ARM_R1, r_src, 8), ctx);
> -       emit(ARM_ORR_S(r_dst, ARM_R1, r_src, SRTYPE_LSR, 8), ctx);
> +       const u8 *tmp = bpf2a32[TMP_REG_1];
> +       s32 jmp_offset;
> +
> +       /* checks if divisor is zero or not. If it is, then
> +        * exit directly.
> +        */
> +       emit(ARM_CMP_I(rn, 0), ctx);
> +       _emit(ARM_COND_EQ, ARM_MOV_I(ARM_R0, 0), ctx);
> +       jmp_offset = epilogue_offset(ctx);
> +       _emit(ARM_COND_EQ, ARM_B(jmp_offset), ctx);
> +#if __LINUX_ARM_ARCH__ == 7
> +       if (elf_hwcap & HWCAP_IDIVA) {
> +               if (op == BPF_DIV)
> +                       emit(ARM_UDIV(rd, rm, rn), ctx);
> +               else {
> +                       emit(ARM_UDIV(ARM_IP, rm, rn), ctx);
> +                       emit(ARM_MLS(rd, rn, ARM_IP, rm), ctx);
> +               }
> +               return;
> +       }
> +#endif
>
>         /*
> -        * we need to mask out the bits set in r_dst[23:16] due to
> -        * the first shift instruction.
> -        *
> -        * note that 0x8ff is the encoded immediate 0x00ff0000.
> +        * For BPF_ALU | BPF_DIV | BPF_K instructions
> +        * As ARM_R1 and ARM_R0 contains 1st argument of bpf
> +        * function, we need to save it on caller side to save
> +        * it from getting destroyed within callee.
> +        * After the return from the callee, we restore ARM_R0
> +        * ARM_R1.
>          */
> -       emit(ARM_BIC_I(r_dst, r_dst, 0x8ff), ctx);
> -}
> +       if (rn != ARM_R1) {
> +               emit(ARM_MOV_R(tmp[0], ARM_R1), ctx);
> +               emit(ARM_MOV_R(ARM_R1, rn), ctx);
> +       }
> +       if (rm != ARM_R0) {
> +               emit(ARM_MOV_R(tmp[1], ARM_R0), ctx);
> +               emit(ARM_MOV_R(ARM_R0, rm), ctx);
> +       }
>
> -#else  /* ARMv6+ */
> +       /* Call appropriate function */
> +       ctx->seen |= SEEN_CALL;
> +       emit_mov_i(ARM_IP, op == BPF_DIV ?
> +                  (u32)jit_udiv32 : (u32)jit_mod32, ctx);
> +       emit_blx_r(ARM_IP, ctx);
>
> -static void emit_load_be32(u8 cond, u8 r_res, u8 r_addr, struct jit_ctx *ctx)
> -{
> -       _emit(cond, ARM_LDR_I(r_res, r_addr, 0), ctx);
> -#ifdef __LITTLE_ENDIAN
> -       _emit(cond, ARM_REV(r_res, r_res), ctx);
> -#endif
> +       /* Save return value */
> +       if (rd != ARM_R0)
> +               emit(ARM_MOV_R(rd, ARM_R0), ctx);
> +
> +       /* Restore ARM_R0 and ARM_R1 */
> +       if (rn != ARM_R1)
> +               emit(ARM_MOV_R(ARM_R1, tmp[0]), ctx);
> +       if (rm != ARM_R0)
> +               emit(ARM_MOV_R(ARM_R0, tmp[1]), ctx);
>  }
>
> -static void emit_load_be16(u8 cond, u8 r_res, u8 r_addr, struct jit_ctx *ctx)
> +/* Checks whether BPF register is on scratch stack space or not. */
> +static inline bool is_on_stack(u8 bpf_reg)
>  {
> -       _emit(cond, ARM_LDRH_I(r_res, r_addr, 0), ctx);
> -#ifdef __LITTLE_ENDIAN
> -       _emit(cond, ARM_REV16(r_res, r_res), ctx);
> -#endif
> +       static u8 stack_regs[] = {BPF_REG_AX, BPF_REG_3, BPF_REG_4, BPF_REG_5,
> +                               BPF_REG_7, BPF_REG_8, BPF_REG_9, TCALL_CNT,
> +                               BPF_REG_2, BPF_REG_FP};
> +       int i, reg_len = sizeof(stack_regs);
> +
> +       for (i = 0 ; i < reg_len ; i++) {
> +               if (bpf_reg == stack_regs[i])
> +                       return true;
> +       }
> +       return false;
>  }
>
> -static inline void emit_swap16(u8 r_dst __maybe_unused,
> -                              u8 r_src __maybe_unused,
> -                              struct jit_ctx *ctx __maybe_unused)
> +static inline void emit_a32_mov_i(const u8 dst, const u32 val,
> +                                 bool dstk, struct jit_ctx *ctx)
>  {
> -#ifdef __LITTLE_ENDIAN
> -       emit(ARM_REV16(r_dst, r_src), ctx);
> -#endif
> +       const u8 *tmp = bpf2a32[TMP_REG_1];
> +
> +       if (dstk) {
> +               emit_mov_i(tmp[1], val, ctx);
> +               emit(ARM_STR_I(tmp[1], ARM_SP, STACK_VAR(dst)), ctx);
> +       } else {
> +               emit_mov_i(dst, val, ctx);
> +       }
>  }
>
> -#endif /* __LINUX_ARM_ARCH__ < 6 */
> +/* Sign extended move */
> +static inline void emit_a32_mov_i64(const bool is64, const u8 dst[],
> +                                 const u32 val, bool dstk,
> +                                 struct jit_ctx *ctx) {
> +       u32 hi = 0;
>
> +       if (is64 && (val & (1<<31)))
> +               hi = (u32)~0;
> +       emit_a32_mov_i(dst_lo, val, dstk, ctx);
> +       emit_a32_mov_i(dst_hi, hi, dstk, ctx);
> +}
>
> -/* Compute the immediate value for a PC-relative branch. */
> -static inline u32 b_imm(unsigned tgt, struct jit_ctx *ctx)
> -{
> -       u32 imm;
> +static inline void emit_a32_add_r(const u8 dst, const u8 src,
> +                             const bool is64, const bool hi,
> +                             struct jit_ctx *ctx) {
> +       /* 64 bit :
> +        *      adds dst_lo, dst_lo, src_lo
> +        *      adc dst_hi, dst_hi, src_hi
> +        * 32 bit :
> +        *      add dst_lo, dst_lo, src_lo
> +        */
> +       if (!hi && is64)
> +               emit(ARM_ADDS_R(dst, dst, src), ctx);
> +       else if (hi && is64)
> +               emit(ARM_ADC_R(dst, dst, src), ctx);
> +       else
> +               emit(ARM_ADD_R(dst, dst, src), ctx);
> +}
>
> -       if (ctx->target == NULL)
> -               return 0;
> -       /*
> -        * BPF allows only forward jumps and the offset of the target is
> -        * still the one computed during the first pass.
> +static inline void emit_a32_sub_r(const u8 dst, const u8 src,
> +                                 const bool is64, const bool hi,
> +                                 struct jit_ctx *ctx) {
> +       /* 64 bit :
> +        *      subs dst_lo, dst_lo, src_lo
> +        *      sbc dst_hi, dst_hi, src_hi
> +        * 32 bit :
> +        *      sub dst_lo, dst_lo, src_lo
>          */
> -       imm  = ctx->offsets[tgt] + ctx->prologue_bytes - (ctx->idx * 4 + 8);
> +       if (!hi && is64)
> +               emit(ARM_SUBS_R(dst, dst, src), ctx);
> +       else if (hi && is64)
> +               emit(ARM_SBC_R(dst, dst, src), ctx);
> +       else
> +               emit(ARM_SUB_R(dst, dst, src), ctx);
> +}
>
> -       return imm >> 2;
> +static inline void emit_alu_r(const u8 dst, const u8 src, const bool is64,
> +                             const bool hi, const u8 op, struct jit_ctx *ctx){
> +       switch (BPF_OP(op)) {
> +       /* dst = dst + src */
> +       case BPF_ADD:
> +               emit_a32_add_r(dst, src, is64, hi, ctx);
> +               break;
> +       /* dst = dst - src */
> +       case BPF_SUB:
> +               emit_a32_sub_r(dst, src, is64, hi, ctx);
> +               break;
> +       /* dst = dst | src */
> +       case BPF_OR:
> +               emit(ARM_ORR_R(dst, dst, src), ctx);
> +               break;
> +       /* dst = dst & src */
> +       case BPF_AND:
> +               emit(ARM_AND_R(dst, dst, src), ctx);
> +               break;
> +       /* dst = dst ^ src */
> +       case BPF_XOR:
> +               emit(ARM_EOR_R(dst, dst, src), ctx);
> +               break;
> +       /* dst = dst * src */
> +       case BPF_MUL:
> +               emit(ARM_MUL(dst, dst, src), ctx);
> +               break;
> +       /* dst = dst << src */
> +       case BPF_LSH:
> +               emit(ARM_LSL_R(dst, dst, src), ctx);
> +               break;
> +       /* dst = dst >> src */
> +       case BPF_RSH:
> +               emit(ARM_LSR_R(dst, dst, src), ctx);
> +               break;
> +       /* dst = dst >> src (signed)*/
> +       case BPF_ARSH:
> +               emit(ARM_MOV_SR(dst, dst, SRTYPE_ASR, src), ctx);
> +               break;
> +       }
>  }
>
> -#define OP_IMM3(op, r1, r2, imm_val, ctx)                              \
> -       do {                                                            \
> -               imm12 = imm8m(imm_val);                                 \
> -               if (imm12 < 0) {                                        \
> -                       emit_mov_i_no8m(r_scratch, imm_val, ctx);       \
> -                       emit(op ## _R((r1), (r2), r_scratch), ctx);     \
> -               } else {                                                \
> -                       emit(op ## _I((r1), (r2), imm12), ctx);         \
> -               }                                                       \
> -       } while (0)
> -
> -static inline void emit_err_ret(u8 cond, struct jit_ctx *ctx)
> -{
> -       if (ctx->ret0_fp_idx >= 0) {
> -               _emit(cond, ARM_B(b_imm(ctx->ret0_fp_idx, ctx)), ctx);
> -               /* NOP to keep the size constant between passes */
> -               emit(ARM_MOV_R(ARM_R0, ARM_R0), ctx);
> +/* ALU operation (32 bit)
> + * dst = dst (op) src
> + */
> +static inline void emit_a32_alu_r(const u8 dst, const u8 src,
> +                                 bool dstk, bool sstk,
> +                                 struct jit_ctx *ctx, const bool is64,
> +                                 const bool hi, const u8 op) {
> +       const u8 *tmp = bpf2a32[TMP_REG_1];
> +       u8 rn = sstk ? tmp[1] : src;
> +
> +       if (sstk)
> +               emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src)), ctx);
> +
> +       /* ALU operation */
> +       if (dstk) {
> +               emit(ARM_LDR_I(tmp[0], ARM_SP, STACK_VAR(dst)), ctx);
> +               emit_alu_r(tmp[0], rn, is64, hi, op, ctx);
> +               emit(ARM_STR_I(tmp[0], ARM_SP, STACK_VAR(dst)), ctx);
>         } else {
> -               _emit(cond, ARM_MOV_I(ARM_R0, 0), ctx);
> -               _emit(cond, ARM_B(b_imm(ctx->skf->len, ctx)), ctx);
> +               emit_alu_r(dst, rn, is64, hi, op, ctx);
>         }
>  }
>
> -static inline void emit_blx_r(u8 tgt_reg, struct jit_ctx *ctx)
> -{
> -#if __LINUX_ARM_ARCH__ < 5
> -       emit(ARM_MOV_R(ARM_LR, ARM_PC), ctx);
> +/* ALU operation (64 bit) */
> +static inline void emit_a32_alu_r64(const bool is64, const u8 dst[],
> +                                 const u8 src[], bool dstk,
> +                                 bool sstk, struct jit_ctx *ctx,
> +                                 const u8 op) {
> +       emit_a32_alu_r(dst_lo, src_lo, dstk, sstk, ctx, is64, false, op);
> +       if (is64)
> +               emit_a32_alu_r(dst_hi, src_hi, dstk, sstk, ctx, is64, true, op);
> +       else
> +               emit_a32_mov_i(dst_hi, 0, dstk, ctx);
> +}
>
> -       if (elf_hwcap & HWCAP_THUMB)
> -               emit(ARM_BX(tgt_reg), ctx);
> +/* dst = imm (4 bytes)*/
> +static inline void emit_a32_mov_r(const u8 dst, const u8 src,
> +                                 bool dstk, bool sstk,
> +                                 struct jit_ctx *ctx) {
> +       const u8 *tmp = bpf2a32[TMP_REG_1];
> +       u8 rt = sstk ? tmp[0] : src;
> +
> +       if (sstk)
> +               emit(ARM_LDR_I(tmp[0], ARM_SP, STACK_VAR(src)), ctx);
> +       if (dstk)
> +               emit(ARM_STR_I(rt, ARM_SP, STACK_VAR(dst)), ctx);
>         else
> -               emit(ARM_MOV_R(ARM_PC, tgt_reg), ctx);
> -#else
> -       emit(ARM_BLX_R(tgt_reg), ctx);
> -#endif
> +               emit(ARM_MOV_R(dst, rt), ctx);
>  }
>
> -static inline void emit_udivmod(u8 rd, u8 rm, u8 rn, struct jit_ctx *ctx,
> -                               int bpf_op)
> -{
> -#if __LINUX_ARM_ARCH__ == 7
> -       if (elf_hwcap & HWCAP_IDIVA) {
> -               if (bpf_op == BPF_DIV)
> -                       emit(ARM_UDIV(rd, rm, rn), ctx);
> -               else {
> -                       emit(ARM_UDIV(ARM_R3, rm, rn), ctx);
> -                       emit(ARM_MLS(rd, rn, ARM_R3, rm), ctx);
> -               }
> -               return;
> +/* dst = src */
> +static inline void emit_a32_mov_r64(const bool is64, const u8 dst[],
> +                                 const u8 src[], bool dstk,
> +                                 bool sstk, struct jit_ctx *ctx) {
> +       emit_a32_mov_r(dst_lo, src_lo, dstk, sstk, ctx);
> +       if (is64) {
> +               /* complete 8 byte move */
> +               emit_a32_mov_r(dst_hi, src_hi, dstk, sstk, ctx);
> +       } else {
> +               /* Zero out high 4 bytes */
> +               emit_a32_mov_i(dst_hi, 0, dstk, ctx);
>         }
> -#endif
> +}
>
> -       /*
> -        * For BPF_ALU | BPF_DIV | BPF_K instructions, rm is ARM_R4
> -        * (r_A) and rn is ARM_R0 (r_scratch) so load rn first into
> -        * ARM_R1 to avoid accidentally overwriting ARM_R0 with rm
> -        * before using it as a source for ARM_R1.
> -        *
> -        * For BPF_ALU | BPF_DIV | BPF_X rm is ARM_R4 (r_A) and rn is
> -        * ARM_R5 (r_X) so there is no particular register overlap
> -        * issues.
> -        */
> -       if (rn != ARM_R1)
> -               emit(ARM_MOV_R(ARM_R1, rn), ctx);
> -       if (rm != ARM_R0)
> -               emit(ARM_MOV_R(ARM_R0, rm), ctx);
> +/* Shift operations */
> +static inline void emit_a32_alu_i(const u8 dst, const u32 val, bool dstk,
> +                               struct jit_ctx *ctx, const u8 op) {
> +       const u8 *tmp = bpf2a32[TMP_REG_1];
> +       u8 rd = dstk ? tmp[0] : dst;
> +
> +       if (dstk)
> +               emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst)), ctx);
> +
> +       /* Do shift operation */
> +       switch (op) {
> +       case BPF_LSH:
> +               emit(ARM_LSL_I(rd, rd, val), ctx);
> +               break;
> +       case BPF_RSH:
> +               emit(ARM_LSR_I(rd, rd, val), ctx);
> +               break;
> +       case BPF_NEG:
> +               emit(ARM_RSB_I(rd, rd, val), ctx);
> +               break;
> +       }
> +
> +       if (dstk)
> +               emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst)), ctx);
> +}
> +
> +/* dst = ~dst (64 bit) */
> +static inline void emit_a32_neg64(const u8 dst[], bool dstk,
> +                               struct jit_ctx *ctx){
> +       const u8 *tmp = bpf2a32[TMP_REG_1];
> +       u8 rd = dstk ? tmp[1] : dst[1];
> +       u8 rm = dstk ? tmp[0] : dst[0];
> +
> +       /* Setup Operand */
> +       if (dstk) {
> +               emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
> +               emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
> +       }
> +
> +       /* Do Negate Operation */
> +       emit(ARM_RSBS_I(rd, rd, 0), ctx);
> +       emit(ARM_RSC_I(rm, rm, 0), ctx);
> +
> +       if (dstk) {
> +               emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
> +               emit(ARM_STR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
> +       }
> +}
>
> +/* dst = dst << src */
> +static inline void emit_a32_lsh_r64(const u8 dst[], const u8 src[], bool dstk,
> +                                   bool sstk, struct jit_ctx *ctx) {
> +       const u8 *tmp = bpf2a32[TMP_REG_1];
> +       const u8 *tmp2 = bpf2a32[TMP_REG_2];
> +
> +       /* Setup Operands */
> +       u8 rt = sstk ? tmp2[1] : src_lo;
> +       u8 rd = dstk ? tmp[1] : dst_lo;
> +       u8 rm = dstk ? tmp[0] : dst_hi;
> +
> +       if (sstk)
> +               emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(src_lo)), ctx);
> +       if (dstk) {
> +               emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
> +               emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
> +       }
> +
> +       /* Do LSH operation */
> +       emit(ARM_SUB_I(ARM_IP, rt, 32), ctx);
> +       emit(ARM_RSB_I(tmp2[0], rt, 32), ctx);
> +       /* As we are using ARM_LR */
>         ctx->seen |= SEEN_CALL;
> -       emit_mov_i(ARM_R3, bpf_op == BPF_DIV ? (u32)jit_udiv : (u32)jit_mod,
> -                  ctx);
> -       emit_blx_r(ARM_R3, ctx);
> +       emit(ARM_MOV_SR(ARM_LR, rm, SRTYPE_ASL, rt), ctx);
> +       emit(ARM_ORR_SR(ARM_LR, ARM_LR, rd, SRTYPE_ASL, ARM_IP), ctx);
> +       emit(ARM_ORR_SR(ARM_IP, ARM_LR, rd, SRTYPE_LSR, tmp2[0]), ctx);
> +       emit(ARM_MOV_SR(ARM_LR, rd, SRTYPE_ASL, rt), ctx);
> +
> +       if (dstk) {
> +               emit(ARM_STR_I(ARM_LR, ARM_SP, STACK_VAR(dst_lo)), ctx);
> +               emit(ARM_STR_I(ARM_IP, ARM_SP, STACK_VAR(dst_hi)), ctx);
> +       } else {
> +               emit(ARM_MOV_R(rd, ARM_LR), ctx);
> +               emit(ARM_MOV_R(rm, ARM_IP), ctx);
> +       }
> +}
>
> -       if (rd != ARM_R0)
> -               emit(ARM_MOV_R(rd, ARM_R0), ctx);
> +/* dst = dst >> src (signed)*/
> +static inline void emit_a32_arsh_r64(const u8 dst[], const u8 src[], bool dstk,
> +                                   bool sstk, struct jit_ctx *ctx) {
> +       const u8 *tmp = bpf2a32[TMP_REG_1];
> +       const u8 *tmp2 = bpf2a32[TMP_REG_2];
> +       /* Setup Operands */
> +       u8 rt = sstk ? tmp2[1] : src_lo;
> +       u8 rd = dstk ? tmp[1] : dst_lo;
> +       u8 rm = dstk ? tmp[0] : dst_hi;
> +
> +       if (sstk)
> +               emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(src_lo)), ctx);
> +       if (dstk) {
> +               emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
> +               emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
> +       }
> +
> +       /* Do the ARSH operation */
> +       emit(ARM_RSB_I(ARM_IP, rt, 32), ctx);
> +       emit(ARM_SUBS_I(tmp2[0], rt, 32), ctx);
> +       /* As we are using ARM_LR */
> +       ctx->seen |= SEEN_CALL;
> +       emit(ARM_MOV_SR(ARM_LR, rd, SRTYPE_LSR, rt), ctx);
> +       emit(ARM_ORR_SR(ARM_LR, ARM_LR, rm, SRTYPE_ASL, ARM_IP), ctx);
> +       _emit(ARM_COND_MI, ARM_B(0), ctx);
> +       emit(ARM_ORR_SR(ARM_LR, ARM_LR, rm, SRTYPE_ASR, tmp2[0]), ctx);
> +       emit(ARM_MOV_SR(ARM_IP, rm, SRTYPE_ASR, rt), ctx);
> +       if (dstk) {
> +               emit(ARM_STR_I(ARM_LR, ARM_SP, STACK_VAR(dst_lo)), ctx);
> +               emit(ARM_STR_I(ARM_IP, ARM_SP, STACK_VAR(dst_hi)), ctx);
> +       } else {
> +               emit(ARM_MOV_R(rd, ARM_LR), ctx);
> +               emit(ARM_MOV_R(rm, ARM_IP), ctx);
> +       }
>  }
>
> -static inline void update_on_xread(struct jit_ctx *ctx)
> +/* dst = dst >> src */
> +static inline void emit_a32_lsr_r64(const u8 dst[], const u8 src[], bool dstk,
> +                                    bool sstk, struct jit_ctx *ctx) {
> +       const u8 *tmp = bpf2a32[TMP_REG_1];
> +       const u8 *tmp2 = bpf2a32[TMP_REG_2];
> +       /* Setup Operands */
> +       u8 rt = sstk ? tmp2[1] : src_lo;
> +       u8 rd = dstk ? tmp[1] : dst_lo;
> +       u8 rm = dstk ? tmp[0] : dst_hi;
> +
> +       if (sstk)
> +               emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(src_lo)), ctx);
> +       if (dstk) {
> +               emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
> +               emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
> +       }
> +
> +       /* Do LSH operation */
> +       emit(ARM_RSB_I(ARM_IP, rt, 32), ctx);
> +       emit(ARM_SUBS_I(tmp2[0], rt, 32), ctx);
> +       /* As we are using ARM_LR */
> +       ctx->seen |= SEEN_CALL;
> +       emit(ARM_MOV_SR(ARM_LR, rd, SRTYPE_LSR, rt), ctx);
> +       emit(ARM_ORR_SR(ARM_LR, ARM_LR, rm, SRTYPE_ASL, ARM_IP), ctx);
> +       emit(ARM_ORR_SR(ARM_LR, ARM_LR, rm, SRTYPE_LSR, tmp2[0]), ctx);
> +       emit(ARM_MOV_SR(ARM_IP, rm, SRTYPE_LSR, rt), ctx);
> +       if (dstk) {
> +               emit(ARM_STR_I(ARM_LR, ARM_SP, STACK_VAR(dst_lo)), ctx);
> +               emit(ARM_STR_I(ARM_IP, ARM_SP, STACK_VAR(dst_hi)), ctx);
> +       } else {
> +               emit(ARM_MOV_R(rd, ARM_LR), ctx);
> +               emit(ARM_MOV_R(rm, ARM_IP), ctx);
> +       }
> +}
> +
> +/* dst = dst << val */
> +static inline void emit_a32_lsh_i64(const u8 dst[], bool dstk,
> +                                    const u32 val, struct jit_ctx *ctx){
> +       const u8 *tmp = bpf2a32[TMP_REG_1];
> +       const u8 *tmp2 = bpf2a32[TMP_REG_2];
> +       /* Setup operands */
> +       u8 rd = dstk ? tmp[1] : dst_lo;
> +       u8 rm = dstk ? tmp[0] : dst_hi;
> +
> +       if (dstk) {
> +               emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
> +               emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
> +       }
> +
> +       /* Do LSH operation */
> +       if (val < 32) {
> +               emit(ARM_MOV_SI(tmp2[0], rm, SRTYPE_ASL, val), ctx);
> +               emit(ARM_ORR_SI(rm, tmp2[0], rd, SRTYPE_LSR, 32 - val), ctx);
> +               emit(ARM_MOV_SI(rd, rd, SRTYPE_ASL, val), ctx);
> +       } else {
> +               if (val == 32)
> +                       emit(ARM_MOV_R(rm, rd), ctx);
> +               else
> +                       emit(ARM_MOV_SI(rm, rd, SRTYPE_ASL, val - 32), ctx);
> +               emit(ARM_EOR_R(rd, rd, rd), ctx);
> +       }
> +
> +       if (dstk) {
> +               emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
> +               emit(ARM_STR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
> +       }
> +}
> +
> +/* dst = dst >> val */
> +static inline void emit_a32_lsr_i64(const u8 dst[], bool dstk,
> +                                   const u32 val, struct jit_ctx *ctx) {
> +       const u8 *tmp = bpf2a32[TMP_REG_1];
> +       const u8 *tmp2 = bpf2a32[TMP_REG_2];
> +       /* Setup operands */
> +       u8 rd = dstk ? tmp[1] : dst_lo;
> +       u8 rm = dstk ? tmp[0] : dst_hi;
> +
> +       if (dstk) {
> +               emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
> +               emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
> +       }
> +
> +       /* Do LSR operation */
> +       if (val < 32) {
> +               emit(ARM_MOV_SI(tmp2[1], rd, SRTYPE_LSR, val), ctx);
> +               emit(ARM_ORR_SI(rd, tmp2[1], rm, SRTYPE_ASL, 32 - val), ctx);
> +               emit(ARM_MOV_SI(rm, rm, SRTYPE_LSR, val), ctx);
> +       } else if (val == 32) {
> +               emit(ARM_MOV_R(rd, rm), ctx);
> +               emit(ARM_MOV_I(rm, 0), ctx);
> +       } else {
> +               emit(ARM_MOV_SI(rd, rm, SRTYPE_LSR, val - 32), ctx);
> +               emit(ARM_MOV_I(rm, 0), ctx);
> +       }
> +
> +       if (dstk) {
> +               emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
> +               emit(ARM_STR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
> +       }
> +}
> +
> +/* dst = dst >> val (signed) */
> +static inline void emit_a32_arsh_i64(const u8 dst[], bool dstk,
> +                                    const u32 val, struct jit_ctx *ctx){
> +       const u8 *tmp = bpf2a32[TMP_REG_1];
> +       const u8 *tmp2 = bpf2a32[TMP_REG_2];
> +        /* Setup operands */
> +       u8 rd = dstk ? tmp[1] : dst_lo;
> +       u8 rm = dstk ? tmp[0] : dst_hi;
> +
> +       if (dstk) {
> +               emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
> +               emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
> +       }
> +
> +       /* Do ARSH operation */
> +       if (val < 32) {
> +               emit(ARM_MOV_SI(tmp2[1], rd, SRTYPE_LSR, val), ctx);
> +               emit(ARM_ORR_SI(rd, tmp2[1], rm, SRTYPE_ASL, 32 - val), ctx);
> +               emit(ARM_MOV_SI(rm, rm, SRTYPE_ASR, val), ctx);
> +       } else if (val == 32) {
> +               emit(ARM_MOV_R(rd, rm), ctx);
> +               emit(ARM_MOV_SI(rm, rm, SRTYPE_ASR, 31), ctx);
> +       } else {
> +               emit(ARM_MOV_SI(rd, rm, SRTYPE_ASR, val - 32), ctx);
> +               emit(ARM_MOV_SI(rm, rm, SRTYPE_ASR, 31), ctx);
> +       }
> +
> +       if (dstk) {
> +               emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
> +               emit(ARM_STR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
> +       }
> +}
> +
> +static inline void emit_a32_mul_r64(const u8 dst[], const u8 src[], bool dstk,
> +                                   bool sstk, struct jit_ctx *ctx) {
> +       const u8 *tmp = bpf2a32[TMP_REG_1];
> +       const u8 *tmp2 = bpf2a32[TMP_REG_2];
> +       /* Setup operands for multiplication */
> +       u8 rd = dstk ? tmp[1] : dst_lo;
> +       u8 rm = dstk ? tmp[0] : dst_hi;
> +       u8 rt = sstk ? tmp2[1] : src_lo;
> +       u8 rn = sstk ? tmp2[0] : src_hi;
> +
> +       if (dstk) {
> +               emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
> +               emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
> +       }
> +       if (sstk) {
> +               emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(src_lo)), ctx);
> +               emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src_hi)), ctx);
> +       }
> +
> +       /* Do Multiplication */
> +       emit(ARM_MUL(ARM_IP, rd, rn), ctx);
> +       emit(ARM_MUL(ARM_LR, rm, rt), ctx);
> +       /* As we are using ARM_LR */
> +       ctx->seen |= SEEN_CALL;
> +       emit(ARM_ADD_R(ARM_LR, ARM_IP, ARM_LR), ctx);
> +
> +       emit(ARM_UMULL(ARM_IP, rm, rd, rt), ctx);
> +       emit(ARM_ADD_R(rm, ARM_LR, rm), ctx);
> +       if (dstk) {
> +               emit(ARM_STR_I(ARM_IP, ARM_SP, STACK_VAR(dst_lo)), ctx);
> +               emit(ARM_STR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
> +       } else {
> +               emit(ARM_MOV_R(rd, ARM_IP), ctx);
> +       }
> +}
> +
> +/* *(size *)(dst + off) = src */
> +static inline void emit_str_r(const u8 dst, const u8 src, bool dstk,
> +                             const s32 off, struct jit_ctx *ctx, const u8 sz){
> +       const u8 *tmp = bpf2a32[TMP_REG_1];
> +       u8 rd = dstk ? tmp[1] : dst;
> +
> +       if (dstk)
> +               emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst)), ctx);
> +       if (off) {
> +               emit_a32_mov_i(tmp[0], off, false, ctx);
> +               emit(ARM_ADD_R(tmp[0], rd, tmp[0]), ctx);
> +               rd = tmp[0];
> +       }
> +       switch (sz) {
> +       case BPF_W:
> +               /* Store a Word */
> +               emit(ARM_STR_I(src, rd, 0), ctx);
> +               break;
> +       case BPF_H:
> +               /* Store a HalfWord */
> +               emit(ARM_STRH_I(src, rd, 0), ctx);
> +               break;
> +       case BPF_B:
> +               /* Store a Byte */
> +               emit(ARM_STRB_I(src, rd, 0), ctx);
> +               break;
> +       }
> +}
> +
> +/* dst = *(size*)(src + off) */
> +static inline void emit_ldx_r(const u8 dst, const u8 src, bool dstk,
> +                             const s32 off, struct jit_ctx *ctx, const u8 sz){
> +       const u8 *tmp = bpf2a32[TMP_REG_1];
> +       u8 rd = dstk ? tmp[1] : dst;
> +       u8 rm = src;
> +
> +       if (off) {
> +               emit_a32_mov_i(tmp[0], off, false, ctx);
> +               emit(ARM_ADD_R(tmp[0], tmp[0], src), ctx);
> +               rm = tmp[0];
> +       }
> +       switch (sz) {
> +       case BPF_W:
> +               /* Load a Word */
> +               emit(ARM_LDR_I(rd, rm, 0), ctx);
> +               break;
> +       case BPF_H:
> +               /* Load a HalfWord */
> +               emit(ARM_LDRH_I(rd, rm, 0), ctx);
> +               break;
> +       case BPF_B:
> +               /* Load a Byte */
> +               emit(ARM_LDRB_I(rd, rm, 0), ctx);
> +               break;
> +       }
> +       if (dstk)
> +               emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst)), ctx);
> +}
> +
> +/* Arithmatic Operation */
> +static inline void emit_ar_r(const u8 rd, const u8 rt, const u8 rm,
> +                            const u8 rn, struct jit_ctx *ctx, u8 op) {
> +       switch (op) {
> +       case BPF_JSET:
> +               ctx->seen |= SEEN_CALL;
> +               emit(ARM_AND_R(ARM_IP, rt, rn), ctx);
> +               emit(ARM_AND_R(ARM_LR, rd, rm), ctx);
> +               emit(ARM_ORRS_R(ARM_IP, ARM_LR, ARM_IP), ctx);
> +               break;
> +       case BPF_JEQ:
> +       case BPF_JNE:
> +       case BPF_JGT:
> +       case BPF_JGE:
> +               emit(ARM_CMP_R(rd, rm), ctx);
> +               _emit(ARM_COND_EQ, ARM_CMP_R(rt, rn), ctx);
> +               break;
> +       case BPF_JSGT:
> +               emit(ARM_CMP_R(rn, rt), ctx);
> +               emit(ARM_SBCS_R(ARM_IP, rm, rd), ctx);
> +               break;
> +       case BPF_JSGE:
> +               emit(ARM_CMP_R(rt, rn), ctx);
> +               emit(ARM_SBCS_R(ARM_IP, rd, rm), ctx);
> +               break;
> +       }
> +}
> +
> +static int out_offset = -1; /* initialized on the first pass of build_body() */
> +static int emit_bpf_tail_call(struct jit_ctx *ctx)
> +{
> +
> +       /* bpf_tail_call(void *prog_ctx, struct bpf_array *array, u64 index) */
> +       const u8 *r2 = bpf2a32[BPF_REG_2];
> +       const u8 *r3 = bpf2a32[BPF_REG_3];
> +       const u8 *tmp = bpf2a32[TMP_REG_1];
> +       const u8 *tmp2 = bpf2a32[TMP_REG_2];
> +       const u8 *tcc = bpf2a32[TCALL_CNT];
> +       const int idx0 = ctx->idx;
> +#define cur_offset (ctx->idx - idx0)
> +#define jmp_offset (out_offset - (cur_offset))
> +       u32 off, lo, hi;
> +
> +       /* if (index >= array->map.max_entries)
> +        *      goto out;
> +        */
> +       off = offsetof(struct bpf_array, map.max_entries);
> +       /* array->map.max_entries */
> +       emit_a32_mov_i(tmp[1], off, false, ctx);
> +       emit(ARM_LDR_I(tmp2[1], ARM_SP, STACK_VAR(r2[1])), ctx);
> +       emit(ARM_LDR_R(tmp[1], tmp2[1], tmp[1]), ctx);
> +       /* index (64 bit) */
> +       emit(ARM_LDR_I(tmp2[1], ARM_SP, STACK_VAR(r3[1])), ctx);
> +       /* index >= array->map.max_entries */
> +       emit(ARM_CMP_R(tmp2[1], tmp[1]), ctx);
> +       _emit(ARM_COND_CS, ARM_B(jmp_offset), ctx);
> +
> +       /* if (tail_call_cnt > MAX_TAIL_CALL_CNT)
> +        *      goto out;
> +        * tail_call_cnt++;
> +        */
> +       lo = (u32)MAX_TAIL_CALL_CNT;
> +       hi = (u32)((u64)MAX_TAIL_CALL_CNT >> 32);
> +       emit(ARM_LDR_I(tmp[1], ARM_SP, STACK_VAR(tcc[1])), ctx);
> +       emit(ARM_LDR_I(tmp[0], ARM_SP, STACK_VAR(tcc[0])), ctx);
> +       emit(ARM_CMP_I(tmp[0], hi), ctx);
> +       _emit(ARM_COND_EQ, ARM_CMP_I(tmp[1], lo), ctx);
> +       _emit(ARM_COND_HI, ARM_B(jmp_offset), ctx);
> +       emit(ARM_ADDS_I(tmp[1], tmp[1], 1), ctx);
> +       emit(ARM_ADC_I(tmp[0], tmp[0], 0), ctx);
> +       emit(ARM_STR_I(tmp[1], ARM_SP, STACK_VAR(tcc[1])), ctx);
> +       emit(ARM_STR_I(tmp[0], ARM_SP, STACK_VAR(tcc[0])), ctx);
> +
> +       /* prog = array->ptrs[index]
> +        * if (prog == NULL)
> +        *      goto out;
> +        */
> +       off = offsetof(struct bpf_array, ptrs);
> +       emit_a32_mov_i(tmp[1], off, false, ctx);
> +       emit(ARM_LDR_I(tmp2[1], ARM_SP, STACK_VAR(r2[1])), ctx);
> +       emit(ARM_LDR_R(tmp[1], tmp2[1], tmp[1]), ctx);
> +       emit(ARM_LDR_I(tmp2[1], ARM_SP, STACK_VAR(r3[1])), ctx);
> +       emit(ARM_MOV_SI(tmp[0], tmp2[1], SRTYPE_ASL, 2), ctx);
> +       emit(ARM_LDR_R(tmp[1], tmp[1], tmp[0]), ctx);
> +       emit(ARM_CMP_I(tmp[1], 0), ctx);
> +       _emit(ARM_COND_EQ, ARM_B(jmp_offset), ctx);
> +
> +       /* goto *(prog->bpf_func + prologue_size); */
> +       off = offsetof(struct bpf_prog, bpf_func);
> +       emit_a32_mov_i(tmp2[1], off, false, ctx);
> +       emit(ARM_LDR_R(tmp[1], tmp[1], tmp2[1]), ctx);
> +       emit(ARM_ADD_I(tmp[1], tmp[1], ctx->prologue_bytes), ctx);
> +       emit(ARM_BX(tmp[1]), ctx);
> +
> +       /* out: */
> +       if (out_offset == -1)
> +               out_offset = cur_offset;
> +       if (cur_offset != out_offset) {
> +               pr_err_once("tail_call out_offset = %d, expected %d!\n",
> +                           cur_offset, out_offset);
> +               return -1;
> +       }
> +       return 0;
> +#undef cur_offset
> +#undef jmp_offset
> +}
> +
> +/* 0xabcd => 0xcdab */
> +static inline void emit_rev16(const u8 rd, const u8 rn, struct jit_ctx *ctx)
>  {
> -       if (!(ctx->seen & SEEN_X))
> -               ctx->flags |= FLAG_NEED_X_RESET;
> +#if __LINUX_ARM_ARCH__ < 6
> +       const u8 *tmp2 = bpf2a32[TMP_REG_2];
> +
> +       emit(ARM_AND_I(tmp2[1], rn, 0xff), ctx);
> +       emit(ARM_MOV_SI(tmp2[0], rn, SRTYPE_LSR, 8), ctx);
> +       emit(ARM_AND_I(tmp2[0], tmp2[0], 0xff), ctx);
> +       emit(ARM_ORR_SI(rd, tmp2[0], tmp2[1], SRTYPE_LSL, 8), ctx);
> +#else /* ARMv6+ */
> +       emit(ARM_REV16(rd, rn), ctx);
> +#endif
> +}
>
> -       ctx->seen |= SEEN_X;
> +/* 0xabcdefgh => 0xghefcdab */
> +static inline void emit_rev32(const u8 rd, const u8 rn, struct jit_ctx *ctx)
> +{
> +#if __LINUX_ARM_ARCH__ < 6
> +       const u8 *tmp2 = bpf2a32[TMP_REG_2];
> +
> +       emit(ARM_AND_I(tmp2[1], rn, 0xff), ctx);
> +       emit(ARM_MOV_SI(tmp2[0], rn, SRTYPE_LSR, 24), ctx);
> +       emit(ARM_ORR_SI(ARM_IP, tmp2[0], tmp2[1], SRTYPE_LSL, 24), ctx);
> +
> +       emit(ARM_MOV_SI(tmp2[1], rn, SRTYPE_LSR, 8), ctx);
> +       emit(ARM_AND_I(tmp2[1], tmp2[1], 0xff), ctx);
> +       emit(ARM_MOV_SI(tmp2[0], rn, SRTYPE_LSR, 16), ctx);
> +       emit(ARM_AND_I(tmp2[0], tmp2[0], 0xff), ctx);
> +       emit(ARM_MOV_SI(tmp2[0], tmp2[0], SRTYPE_LSL, 8), ctx);
> +       emit(ARM_ORR_SI(tmp2[0], tmp2[0], tmp2[1], SRTYPE_LSL, 16), ctx);
> +       emit(ARM_ORR_R(rd, ARM_IP, tmp2[0]), ctx);
> +
> +#else /* ARMv6+ */
> +       emit(ARM_REV(rd, rn), ctx);
> +#endif
>  }
>
> -static int build_body(struct jit_ctx *ctx)
> +static void build_prologue(struct jit_ctx *ctx)
>  {
> -       void *load_func[] = {jit_get_skb_b, jit_get_skb_h, jit_get_skb_w};
> -       const struct bpf_prog *prog = ctx->skf;
> -       const struct sock_filter *inst;
> -       unsigned i, load_order, off, condt;
> -       int imm12;
> -       u32 k;
> +       const u8 r0 = bpf2a32[BPF_REG_0][1];
> +       const u8 r2 = bpf2a32[BPF_REG_1][1];
> +       const u8 r3 = bpf2a32[BPF_REG_1][0];
> +       const u8 r4 = bpf2a32[BPF_REG_6][1];
> +       const u8 r5 = bpf2a32[BPF_REG_6][0];
> +       const u8 r6 = bpf2a32[TMP_REG_1][1];
> +       const u8 r7 = bpf2a32[TMP_REG_1][0];
> +       const u8 r8 = bpf2a32[TMP_REG_2][1];
> +       const u8 r10 = bpf2a32[TMP_REG_2][0];
> +       const u8 fplo = bpf2a32[BPF_REG_FP][1];
> +       const u8 fphi = bpf2a32[BPF_REG_FP][0];
> +       const u8 sp = ARM_SP;
> +       const u8 *tcc = bpf2a32[TCALL_CNT];
> +
> +       u16 reg_set = 0;
>
> -       for (i = 0; i < prog->len; i++) {
> -               u16 code;
> +       /*
> +        * eBPF prog stack layout
> +        *
> +        *                         high
> +        * original ARM_SP =>     +-----+ eBPF prologue
> +        *                        |FP/LR|
> +        * current ARM_FP =>      +-----+
> +        *                        | ... | callee saved registers
> +        * eBPF fp register =>    +-----+ <= (BPF_FP)
> +        *                        | ... | eBPF JIT scratch space
> +        *                        |     | eBPF prog stack
> +        *                        +-----+
> +        *                        |RSVD | JIT scratchpad
> +        * current A64_SP =>      +-----+ <= (BPF_FP - STACK_SIZE)
> +        *                        |     |
> +        *                        | ... | Function call stack
> +        *                        |     |
> +        *                        +-----+
> +        *                          low
> +        */
>
> -               inst = &(prog->insns[i]);
> -               /* K as an immediate value operand */
> -               k = inst->k;
> -               code = bpf_anc_helper(inst);
> +       /* Save callee saved registers. */
> +       reg_set |= (1<<r4) | (1<<r5) | (1<<r6) | (1<<r7) | (1<<r8) | (1<<r10);
> +#ifdef CONFIG_FRAME_POINTER
> +       reg_set |= (1<<ARM_FP) | (1<<ARM_IP) | (1<<ARM_LR) | (1<<ARM_PC);
> +       emit(ARM_MOV_R(ARM_IP, sp), ctx);
> +       emit(ARM_PUSH(reg_set), ctx);
> +       emit(ARM_SUB_I(ARM_FP, ARM_IP, 4), ctx);
> +#else
> +       /* Check if call instruction exists in BPF body */
> +       if (ctx->seen & SEEN_CALL)
> +               reg_set |= (1<<ARM_LR);
> +       emit(ARM_PUSH(reg_set), ctx);
> +#endif
> +       /* Save frame pointer for later */
> +       emit(ARM_SUB_I(ARM_IP, sp, SCRATCH_SIZE), ctx);
> +
> +       /* Set up function call stack */
> +       emit(ARM_SUB_I(ARM_SP, ARM_SP, imm8m(STACK_SIZE)), ctx);
> +
> +       /* Set up BPF prog stack base register */
> +       emit_a32_mov_r(fplo, ARM_IP, true, false, ctx);
> +       emit_a32_mov_i(fphi, 0, true, ctx);
> +
> +       /* mov r4, 0 */
> +       emit(ARM_MOV_I(r4, 0), ctx);
> +       /* MOV bpf_ctx pointer to BPF_R1 */
> +       emit(ARM_MOV_R(r3, r4), ctx);
> +       emit(ARM_MOV_R(r2, r0), ctx);
> +       /* Initialize Tail Count */
> +       emit(ARM_STR_I(r4, ARM_SP, STACK_VAR(tcc[0])), ctx);
> +       emit(ARM_STR_I(r4, ARM_SP, STACK_VAR(tcc[1])), ctx);
> +       /* end of prologue */
> +}
>
> -               /* compute offsets only in the fake pass */
> -               if (ctx->target == NULL)
> -                       ctx->offsets[i] = ctx->idx * 4;
> +static void build_epilogue(struct jit_ctx *ctx)
> +{
> +       const u8 r4 = bpf2a32[BPF_REG_6][1];
> +       const u8 r5 = bpf2a32[BPF_REG_6][0];
> +       const u8 r6 = bpf2a32[TMP_REG_1][1];
> +       const u8 r7 = bpf2a32[TMP_REG_1][0];
> +       const u8 r8 = bpf2a32[TMP_REG_2][1];
> +       const u8 r10 = bpf2a32[TMP_REG_2][0];
> +       u16 reg_set = 0;
> +
> +       /* unwind function call stack */
> +       emit(ARM_ADD_I(ARM_SP, ARM_SP, imm8m(STACK_SIZE)), ctx);
> +
> +       /* restore callee saved registers. */
> +       reg_set |= (1<<r4) | (1<<r5) | (1<<r6) | (1<<r7) | (1<<r8) | (1<<r10);
> +#ifdef CONFIG_FRAME_POINTER
> +       /* the first instruction of the prologue was: mov ip, sp */
> +       reg_set |= (1<<ARM_FP) | (1<<ARM_SP) | (1<<ARM_PC);
> +       emit(ARM_LDM(ARM_SP, reg_set), ctx);
> +#else
> +       if (ctx->seen & SEEN_CALL)
> +               reg_set |= (1<<ARM_PC);
> +       /* Restore callee saved registers. */
> +       emit(ARM_POP(reg_set), ctx);
> +       /* Return back to the callee function */
> +       if (!(ctx->seen & SEEN_CALL))
> +               emit(ARM_BX(ARM_LR), ctx);
> +#endif
> +}
>
> -               switch (code) {
> -               case BPF_LD | BPF_IMM:
> -                       emit_mov_i(r_A, k, ctx);
> +/*
> + * Convert an eBPF instruction to native instruction, i.e
> + * JITs an eBPF instruction.
> + * Returns :
> + *     0  - Successfully JITed an 8-byte eBPF instruction
> + *     >0 - Successfully JITed a 16-byte eBPF instruction
> + *     <0 - Failed to JIT.
> + */
> +static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx)
> +{
> +       const u8 code = insn->code;
> +       const u8 *dst = bpf2a32[insn->dst_reg];
> +       const u8 *src = bpf2a32[insn->src_reg];
> +       const u8 *tmp = bpf2a32[TMP_REG_1];
> +       const u8 *tmp2 = bpf2a32[TMP_REG_2];
> +       const s16 off = insn->off;
> +       const s32 imm = insn->imm;
> +       const int i = insn - ctx->prog->insnsi;
> +       const bool is64 = BPF_CLASS(code) == BPF_ALU64;
> +       const bool dstk = is_on_stack(insn->dst_reg);
> +       const bool sstk = is_on_stack(insn->src_reg);
> +       u8 rd, rt, rm, rn;
> +       s32 jmp_offset;
> +
> +#define check_imm(bits, imm) do {                              \
> +       if ((((imm) > 0) && ((imm) >> (bits))) ||               \
> +           (((imm) < 0) && (~(imm) >> (bits)))) {              \
> +               pr_info("[%2d] imm=%d(0x%x) out of range\n",    \
> +                       i, imm, imm);                           \
> +               return -EINVAL;                                 \
> +       }                                                       \
> +} while (0)
> +#define check_imm24(imm) check_imm(24, imm)
> +
> +       switch (code) {
> +       /* ALU operations */
> +
> +       /* dst = src */
> +       case BPF_ALU | BPF_MOV | BPF_K:
> +       case BPF_ALU | BPF_MOV | BPF_X:
> +       case BPF_ALU64 | BPF_MOV | BPF_K:
> +       case BPF_ALU64 | BPF_MOV | BPF_X:
> +               switch (BPF_SRC(code)) {
> +               case BPF_X:
> +                       emit_a32_mov_r64(is64, dst, src, dstk, sstk, ctx);
>                         break;
> -               case BPF_LD | BPF_W | BPF_LEN:
> -                       ctx->seen |= SEEN_SKB;
> -                       BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, len) != 4);
> -                       emit(ARM_LDR_I(r_A, r_skb,
> -                                      offsetof(struct sk_buff, len)), ctx);
> +               case BPF_K:
> +                       /* Sign-extend immediate value to destination reg */
> +                       emit_a32_mov_i64(is64, dst, imm, dstk, ctx);
>                         break;
> -               case BPF_LD | BPF_MEM:
> -                       /* A = scratch[k] */
> -                       ctx->seen |= SEEN_MEM_WORD(k);
> -                       emit(ARM_LDR_I(r_A, ARM_SP, SCRATCH_OFF(k)), ctx);
> +               }
> +               break;
> +       /* dst = dst + src/imm */
> +       /* dst = dst - src/imm */
> +       /* dst = dst | src/imm */
> +       /* dst = dst & src/imm */
> +       /* dst = dst ^ src/imm */
> +       /* dst = dst * src/imm */
> +       /* dst = dst << src */
> +       /* dst = dst >> src */
> +       case BPF_ALU | BPF_ADD | BPF_K:
> +       case BPF_ALU | BPF_ADD | BPF_X:
> +       case BPF_ALU | BPF_SUB | BPF_K:
> +       case BPF_ALU | BPF_SUB | BPF_X:
> +       case BPF_ALU | BPF_OR | BPF_K:
> +       case BPF_ALU | BPF_OR | BPF_X:
> +       case BPF_ALU | BPF_AND | BPF_K:
> +       case BPF_ALU | BPF_AND | BPF_X:
> +       case BPF_ALU | BPF_XOR | BPF_K:
> +       case BPF_ALU | BPF_XOR | BPF_X:
> +       case BPF_ALU | BPF_MUL | BPF_K:
> +       case BPF_ALU | BPF_MUL | BPF_X:
> +       case BPF_ALU | BPF_LSH | BPF_X:
> +       case BPF_ALU | BPF_RSH | BPF_X:
> +       case BPF_ALU | BPF_ARSH | BPF_K:
> +       case BPF_ALU | BPF_ARSH | BPF_X:
> +       case BPF_ALU64 | BPF_ADD | BPF_K:
> +       case BPF_ALU64 | BPF_ADD | BPF_X:
> +       case BPF_ALU64 | BPF_SUB | BPF_K:
> +       case BPF_ALU64 | BPF_SUB | BPF_X:
> +       case BPF_ALU64 | BPF_OR | BPF_K:
> +       case BPF_ALU64 | BPF_OR | BPF_X:
> +       case BPF_ALU64 | BPF_AND | BPF_K:
> +       case BPF_ALU64 | BPF_AND | BPF_X:
> +       case BPF_ALU64 | BPF_XOR | BPF_K:
> +       case BPF_ALU64 | BPF_XOR | BPF_X:
> +               switch (BPF_SRC(code)) {
> +               case BPF_X:
> +                       emit_a32_alu_r64(is64, dst, src, dstk, sstk,
> +                                        ctx, BPF_OP(code));
>                         break;
> -               case BPF_LD | BPF_W | BPF_ABS:
> -                       load_order = 2;
> -                       goto load;
> -               case BPF_LD | BPF_H | BPF_ABS:
> -                       load_order = 1;
> -                       goto load;
> -               case BPF_LD | BPF_B | BPF_ABS:
> -                       load_order = 0;
> -load:
> -                       emit_mov_i(r_off, k, ctx);
> -load_common:
> -                       ctx->seen |= SEEN_DATA | SEEN_CALL;
> -
> -                       if (load_order > 0) {
> -                               emit(ARM_SUB_I(r_scratch, r_skb_hl,
> -                                              1 << load_order), ctx);
> -                               emit(ARM_CMP_R(r_scratch, r_off), ctx);
> -                               condt = ARM_COND_GE;
> -                       } else {
> -                               emit(ARM_CMP_R(r_skb_hl, r_off), ctx);
> -                               condt = ARM_COND_HI;
> -                       }
> -
> -                       /*
> -                        * test for negative offset, only if we are
> -                        * currently scheduled to take the fast
> -                        * path. this will update the flags so that
> -                        * the slowpath instruction are ignored if the
> -                        * offset is negative.
> -                        *
> -                        * for loard_order == 0 the HI condition will
> -                        * make loads at offset 0 take the slow path too.
> +               case BPF_K:
> +                       /* Move immediate value to the temporary register
> +                        * and then do the ALU operation on the temporary
> +                        * register as this will sign-extend the immediate
> +                        * value into temporary reg and then it would be
> +                        * safe to do the operation on it.
>                          */
> -                       _emit(condt, ARM_CMP_I(r_off, 0), ctx);
> -
> -                       _emit(condt, ARM_ADD_R(r_scratch, r_off, r_skb_data),
> -                             ctx);
> -
> -                       if (load_order == 0)
> -                               _emit(condt, ARM_LDRB_I(r_A, r_scratch, 0),
> -                                     ctx);
> -                       else if (load_order == 1)
> -                               emit_load_be16(condt, r_A, r_scratch, ctx);
> -                       else if (load_order == 2)
> -                               emit_load_be32(condt, r_A, r_scratch, ctx);
> -
> -                       _emit(condt, ARM_B(b_imm(i + 1, ctx)), ctx);
> -
> -                       /* the slowpath */
> -                       emit_mov_i(ARM_R3, (u32)load_func[load_order], ctx);
> -                       emit(ARM_MOV_R(ARM_R0, r_skb), ctx);
> -                       /* the offset is already in R1 */
> -                       emit_blx_r(ARM_R3, ctx);
> -                       /* check the result of skb_copy_bits */
> -                       emit(ARM_CMP_I(ARM_R1, 0), ctx);
> -                       emit_err_ret(ARM_COND_NE, ctx);
> -                       emit(ARM_MOV_R(r_A, ARM_R0), ctx);
> +                       emit_a32_mov_i64(is64, tmp2, imm, false, ctx);
> +                       emit_a32_alu_r64(is64, dst, tmp2, dstk, false,
> +                                        ctx, BPF_OP(code));
>                         break;
> -               case BPF_LD | BPF_W | BPF_IND:
> -                       load_order = 2;
> -                       goto load_ind;
> -               case BPF_LD | BPF_H | BPF_IND:
> -                       load_order = 1;
> -                       goto load_ind;
> -               case BPF_LD | BPF_B | BPF_IND:
> -                       load_order = 0;
> -load_ind:
> -                       update_on_xread(ctx);
> -                       OP_IMM3(ARM_ADD, r_off, r_X, k, ctx);
> -                       goto load_common;
> -               case BPF_LDX | BPF_IMM:
> -                       ctx->seen |= SEEN_X;
> -                       emit_mov_i(r_X, k, ctx);
> +               }
> +               break;
> +       /* dst = dst / src(imm) */
> +       /* dst = dst % src(imm) */
> +       case BPF_ALU | BPF_DIV | BPF_K:
> +       case BPF_ALU | BPF_DIV | BPF_X:
> +       case BPF_ALU | BPF_MOD | BPF_K:
> +       case BPF_ALU | BPF_MOD | BPF_X:
> +               rt = src_lo;
> +               rd = dstk ? tmp2[1] : dst_lo;
> +               if (dstk)
> +                       emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
> +               switch (BPF_SRC(code)) {
> +               case BPF_X:
> +                       rt = sstk ? tmp2[0] : rt;
> +                       if (sstk)
> +                               emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(src_lo)),
> +                                    ctx);
>                         break;
> -               case BPF_LDX | BPF_W | BPF_LEN:
> -                       ctx->seen |= SEEN_X | SEEN_SKB;
> -                       emit(ARM_LDR_I(r_X, r_skb,
> -                                      offsetof(struct sk_buff, len)), ctx);
> +               case BPF_K:
> +                       rt = tmp2[0];
> +                       emit_a32_mov_i(rt, imm, false, ctx);
>                         break;
> -               case BPF_LDX | BPF_MEM:
> -                       ctx->seen |= SEEN_X | SEEN_MEM_WORD(k);
> -                       emit(ARM_LDR_I(r_X, ARM_SP, SCRATCH_OFF(k)), ctx);
> +               }
> +               emit_udivmod(rd, rd, rt, ctx, BPF_OP(code));
> +               if (dstk)
> +                       emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
> +               emit_a32_mov_i(dst_hi, 0, dstk, ctx);
> +               break;
> +       case BPF_ALU64 | BPF_DIV | BPF_K:
> +       case BPF_ALU64 | BPF_DIV | BPF_X:
> +       case BPF_ALU64 | BPF_MOD | BPF_K:
> +       case BPF_ALU64 | BPF_MOD | BPF_X:
> +               goto notyet;
> +       /* dst = dst >> imm */
> +       /* dst = dst << imm */
> +       case BPF_ALU | BPF_RSH | BPF_K:
> +       case BPF_ALU | BPF_LSH | BPF_K:
> +               if (unlikely(imm > 31))
> +                       return -EINVAL;
> +               if (imm)
> +                       emit_a32_alu_i(dst_lo, imm, dstk, ctx, BPF_OP(code));
> +               emit_a32_mov_i(dst_hi, 0, dstk, ctx);
> +               break;
> +       /* dst = dst << imm */
> +       case BPF_ALU64 | BPF_LSH | BPF_K:
> +               if (unlikely(imm > 63))
> +                       return -EINVAL;
> +               emit_a32_lsh_i64(dst, dstk, imm, ctx);
> +               break;
> +       /* dst = dst >> imm */
> +       case BPF_ALU64 | BPF_RSH | BPF_K:
> +               if (unlikely(imm > 63))
> +                       return -EINVAL;
> +               emit_a32_lsr_i64(dst, dstk, imm, ctx);
> +               break;
> +       /* dst = dst << src */
> +       case BPF_ALU64 | BPF_LSH | BPF_X:
> +               emit_a32_lsh_r64(dst, src, dstk, sstk, ctx);
> +               break;
> +       /* dst = dst >> src */
> +       case BPF_ALU64 | BPF_RSH | BPF_X:
> +               emit_a32_lsr_r64(dst, src, dstk, sstk, ctx);
> +               break;
> +       /* dst = dst >> src (signed) */
> +       case BPF_ALU64 | BPF_ARSH | BPF_X:
> +               emit_a32_arsh_r64(dst, src, dstk, sstk, ctx);
> +               break;
> +       /* dst = dst >> imm (signed) */
> +       case BPF_ALU64 | BPF_ARSH | BPF_K:
> +               if (unlikely(imm > 63))
> +                       return -EINVAL;
> +               emit_a32_arsh_i64(dst, dstk, imm, ctx);
> +               break;
> +       /* dst = ~dst */
> +       case BPF_ALU | BPF_NEG:
> +               emit_a32_alu_i(dst_lo, 0, dstk, ctx, BPF_OP(code));
> +               emit_a32_mov_i(dst_hi, 0, dstk, ctx);
> +               break;
> +       /* dst = ~dst (64 bit) */
> +       case BPF_ALU64 | BPF_NEG:
> +               emit_a32_neg64(dst, dstk, ctx);
> +               break;
> +       /* dst = dst * src/imm */
> +       case BPF_ALU64 | BPF_MUL | BPF_X:
> +       case BPF_ALU64 | BPF_MUL | BPF_K:
> +               switch (BPF_SRC(code)) {
> +               case BPF_X:
> +                       emit_a32_mul_r64(dst, src, dstk, sstk, ctx);
>                         break;
> -               case BPF_LDX | BPF_B | BPF_MSH:
> -                       /* x = ((*(frame + k)) & 0xf) << 2; */
> -                       ctx->seen |= SEEN_X | SEEN_DATA | SEEN_CALL;
> -                       /* the interpreter should deal with the negative K */
> -                       if ((int)k < 0)
> -                               return -1;
> -                       /* offset in r1: we might have to take the slow path */
> -                       emit_mov_i(r_off, k, ctx);
> -                       emit(ARM_CMP_R(r_skb_hl, r_off), ctx);
> -
> -                       /* load in r0: common with the slowpath */
> -                       _emit(ARM_COND_HI, ARM_LDRB_R(ARM_R0, r_skb_data,
> -                                                     ARM_R1), ctx);
> -                       /*
> -                        * emit_mov_i() might generate one or two instructions,
> -                        * the same holds for emit_blx_r()
> +               case BPF_K:
> +                       /* Move immediate value to the temporary register
> +                        * and then do the multiplication on it as this
> +                        * will sign-extend the immediate value into temp
> +                        * reg then it would be safe to do the operation
> +                        * on it.
>                          */
> -                       _emit(ARM_COND_HI, ARM_B(b_imm(i + 1, ctx) - 2), ctx);
> -
> -                       emit(ARM_MOV_R(ARM_R0, r_skb), ctx);
> -                       /* r_off is r1 */
> -                       emit_mov_i(ARM_R3, (u32)jit_get_skb_b, ctx);
> -                       emit_blx_r(ARM_R3, ctx);
> -                       /* check the return value of skb_copy_bits */
> -                       emit(ARM_CMP_I(ARM_R1, 0), ctx);
> -                       emit_err_ret(ARM_COND_NE, ctx);
> -
> -                       emit(ARM_AND_I(r_X, ARM_R0, 0x00f), ctx);
> -                       emit(ARM_LSL_I(r_X, r_X, 2), ctx);
> -                       break;
> -               case BPF_ST:
> -                       ctx->seen |= SEEN_MEM_WORD(k);
> -                       emit(ARM_STR_I(r_A, ARM_SP, SCRATCH_OFF(k)), ctx);
> -                       break;
> -               case BPF_STX:
> -                       update_on_xread(ctx);
> -                       ctx->seen |= SEEN_MEM_WORD(k);
> -                       emit(ARM_STR_I(r_X, ARM_SP, SCRATCH_OFF(k)), ctx);
> -                       break;
> -               case BPF_ALU | BPF_ADD | BPF_K:
> -                       /* A += K */
> -                       OP_IMM3(ARM_ADD, r_A, r_A, k, ctx);
> -                       break;
> -               case BPF_ALU | BPF_ADD | BPF_X:
> -                       update_on_xread(ctx);
> -                       emit(ARM_ADD_R(r_A, r_A, r_X), ctx);
> -                       break;
> -               case BPF_ALU | BPF_SUB | BPF_K:
> -                       /* A -= K */
> -                       OP_IMM3(ARM_SUB, r_A, r_A, k, ctx);
> -                       break;
> -               case BPF_ALU | BPF_SUB | BPF_X:
> -                       update_on_xread(ctx);
> -                       emit(ARM_SUB_R(r_A, r_A, r_X), ctx);
> -                       break;
> -               case BPF_ALU | BPF_MUL | BPF_K:
> -                       /* A *= K */
> -                       emit_mov_i(r_scratch, k, ctx);
> -                       emit(ARM_MUL(r_A, r_A, r_scratch), ctx);
> -                       break;
> -               case BPF_ALU | BPF_MUL | BPF_X:
> -                       update_on_xread(ctx);
> -                       emit(ARM_MUL(r_A, r_A, r_X), ctx);
> -                       break;
> -               case BPF_ALU | BPF_DIV | BPF_K:
> -                       if (k == 1)
> -                               break;
> -                       emit_mov_i(r_scratch, k, ctx);
> -                       emit_udivmod(r_A, r_A, r_scratch, ctx, BPF_DIV);
> -                       break;
> -               case BPF_ALU | BPF_DIV | BPF_X:
> -                       update_on_xread(ctx);
> -                       emit(ARM_CMP_I(r_X, 0), ctx);
> -                       emit_err_ret(ARM_COND_EQ, ctx);
> -                       emit_udivmod(r_A, r_A, r_X, ctx, BPF_DIV);
> -                       break;
> -               case BPF_ALU | BPF_MOD | BPF_K:
> -                       if (k == 1) {
> -                               emit_mov_i(r_A, 0, ctx);
> -                               break;
> -                       }
> -                       emit_mov_i(r_scratch, k, ctx);
> -                       emit_udivmod(r_A, r_A, r_scratch, ctx, BPF_MOD);
> +                       emit_a32_mov_i64(is64, tmp2, imm, false, ctx);
> +                       emit_a32_mul_r64(dst, tmp2, dstk, false, ctx);
>                         break;
> -               case BPF_ALU | BPF_MOD | BPF_X:
> -                       update_on_xread(ctx);
> -                       emit(ARM_CMP_I(r_X, 0), ctx);
> -                       emit_err_ret(ARM_COND_EQ, ctx);
> -                       emit_udivmod(r_A, r_A, r_X, ctx, BPF_MOD);
> -                       break;
> -               case BPF_ALU | BPF_OR | BPF_K:
> -                       /* A |= K */
> -                       OP_IMM3(ARM_ORR, r_A, r_A, k, ctx);
> +               }
> +               break;
> +       /* dst = htole(dst) */
> +       /* dst = htobe(dst) */
> +       case BPF_ALU | BPF_END | BPF_FROM_LE:
> +       case BPF_ALU | BPF_END | BPF_FROM_BE:
> +               rd = dstk ? tmp[0] : dst_hi;
> +               rt = dstk ? tmp[1] : dst_lo;
> +               if (dstk) {
> +                       emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(dst_lo)), ctx);
> +                       emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_hi)), ctx);
> +               }
> +               if (BPF_SRC(code) == BPF_FROM_LE)
> +                       goto emit_bswap_uxt;
> +               switch (imm) {
> +               case 16:
> +                       emit_rev16(rt, rt, ctx);
> +                       goto emit_bswap_uxt;
> +               case 32:
> +                       emit_rev32(rt, rt, ctx);
> +                       goto emit_bswap_uxt;
> +               case 64:
> +                       /* Because of the usage of ARM_LR */
> +                       ctx->seen |= SEEN_CALL;
> +                       emit_rev32(ARM_LR, rt, ctx);
> +                       emit_rev32(rt, rd, ctx);
> +                       emit(ARM_MOV_R(rd, ARM_LR), ctx);
>                         break;
> -               case BPF_ALU | BPF_OR | BPF_X:
> -                       update_on_xread(ctx);
> -                       emit(ARM_ORR_R(r_A, r_A, r_X), ctx);
> +               }
> +               goto exit;
> +emit_bswap_uxt:
> +               switch (imm) {
> +               case 16:
> +                       /* zero-extend 16 bits into 64 bits */
> +#if __LINUX_ARM_ARCH__ < 6
> +                       emit_a32_mov_i(tmp2[1], 0xffff, false, ctx);
> +                       emit(ARM_AND_R(rt, rt, tmp2[1]), ctx);
> +#else /* ARMv6+ */
> +                       emit(ARM_UXTH(rt, rt), ctx);
> +#endif
> +                       emit(ARM_EOR_R(rd, rd, rd), ctx);
>                         break;
> -               case BPF_ALU | BPF_XOR | BPF_K:
> -                       /* A ^= K; */
> -                       OP_IMM3(ARM_EOR, r_A, r_A, k, ctx);
> +               case 32:
> +                       /* zero-extend 32 bits into 64 bits */
> +                       emit(ARM_EOR_R(rd, rd, rd), ctx);
>                         break;
> -               case BPF_ANC | SKF_AD_ALU_XOR_X:
> -               case BPF_ALU | BPF_XOR | BPF_X:
> -                       /* A ^= X */
> -                       update_on_xread(ctx);
> -                       emit(ARM_EOR_R(r_A, r_A, r_X), ctx);
> +               case 64:
> +                       /* nop */
>                         break;
> -               case BPF_ALU | BPF_AND | BPF_K:
> -                       /* A &= K */
> -                       OP_IMM3(ARM_AND, r_A, r_A, k, ctx);
> +               }
> +exit:
> +               if (dstk) {
> +                       emit(ARM_STR_I(rt, ARM_SP, STACK_VAR(dst_lo)), ctx);
> +                       emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_hi)), ctx);
> +               }
> +               break;
> +       /* dst = imm64 */
> +       case BPF_LD | BPF_IMM | BPF_DW:
> +       {
> +               const struct bpf_insn insn1 = insn[1];
> +               u32 hi, lo = imm;
> +
> +               if (insn1.code != 0 || insn1.src_reg != 0 ||
> +                   insn1.dst_reg != 0 || insn1.off != 0) {
> +                       /* Note: verifier in BPF core must catch invalid
> +                        * instruction.
> +                        */
> +                       pr_err_once("Invalid BPF_LD_IMM64 instruction\n");
> +                       return -EINVAL;
> +               }
> +               hi = insn1.imm;
> +               emit_a32_mov_i(dst_lo, lo, dstk, ctx);
> +               emit_a32_mov_i(dst_hi, hi, dstk, ctx);
> +
> +               return 1;
> +       }
> +       /* LDX: dst = *(size *)(src + off) */
> +       case BPF_LDX | BPF_MEM | BPF_W:
> +       case BPF_LDX | BPF_MEM | BPF_H:
> +       case BPF_LDX | BPF_MEM | BPF_B:
> +       case BPF_LDX | BPF_MEM | BPF_DW:
> +               rn = sstk ? tmp2[1] : src_lo;
> +               if (sstk)
> +                       emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
> +               switch (BPF_SIZE(code)) {
> +               case BPF_W:
> +                       /* Load a Word */
> +               case BPF_H:
> +                       /* Load a Half-Word */
> +               case BPF_B:
> +                       /* Load a Byte */
> +                       emit_ldx_r(dst_lo, rn, dstk, off, ctx, BPF_SIZE(code));
> +                       emit_a32_mov_i(dst_hi, 0, dstk, ctx);
>                         break;
> -               case BPF_ALU | BPF_AND | BPF_X:
> -                       update_on_xread(ctx);
> -                       emit(ARM_AND_R(r_A, r_A, r_X), ctx);
> +               case BPF_DW:
> +                       /* Load a double word */
> +                       emit_ldx_r(dst_lo, rn, dstk, off, ctx, BPF_W);
> +                       emit_ldx_r(dst_hi, rn, dstk, off+4, ctx, BPF_W);
>                         break;
> -               case BPF_ALU | BPF_LSH | BPF_K:
> -                       if (unlikely(k > 31))
> -                               return -1;
> -                       emit(ARM_LSL_I(r_A, r_A, k), ctx);
> +               }
> +               break;
> +       /* R0 = ntohx(*(size *)(((struct sk_buff *)R6)->data + imm)) */
> +       case BPF_LD | BPF_ABS | BPF_W:
> +       case BPF_LD | BPF_ABS | BPF_H:
> +       case BPF_LD | BPF_ABS | BPF_B:
> +       /* R0 = ntohx(*(size *)(((struct sk_buff *)R6)->data + src + imm)) */
> +       case BPF_LD | BPF_IND | BPF_W:
> +       case BPF_LD | BPF_IND | BPF_H:
> +       case BPF_LD | BPF_IND | BPF_B:
> +       {
> +               const u8 r4 = bpf2a32[BPF_REG_6][1]; /* r4 = ptr to sk_buff */
> +               const u8 r0 = bpf2a32[BPF_REG_0][1]; /*r0: struct sk_buff *skb*/
> +                                                    /* rtn value */
> +               const u8 r1 = bpf2a32[BPF_REG_0][0]; /* r1: int k */
> +               const u8 r2 = bpf2a32[BPF_REG_1][1]; /* r2: unsigned int size */
> +               const u8 r3 = bpf2a32[BPF_REG_1][0]; /* r3: void *buffer */
> +               const u8 r6 = bpf2a32[TMP_REG_1][1]; /* r6: void *(*func)(..) */
> +               int size;
> +
> +               /* Setting up first argument */
> +               emit(ARM_MOV_R(r0, r4), ctx);
> +
> +               /* Setting up second argument */
> +               emit_a32_mov_i(r1, imm, false, ctx);
> +               if (BPF_MODE(code) == BPF_IND)
> +                       emit_a32_alu_r(r1, src_lo, false, sstk, ctx,
> +                                      false, false, BPF_ADD);
> +
> +               /* Setting up third argument */
> +               switch (BPF_SIZE(code)) {
> +               case BPF_W:
> +                       size = 4;
>                         break;
> -               case BPF_ALU | BPF_LSH | BPF_X:
> -                       update_on_xread(ctx);
> -                       emit(ARM_LSL_R(r_A, r_A, r_X), ctx);
> +               case BPF_H:
> +                       size = 2;
>                         break;
> -               case BPF_ALU | BPF_RSH | BPF_K:
> -                       if (unlikely(k > 31))
> -                               return -1;
> -                       if (k)
> -                               emit(ARM_LSR_I(r_A, r_A, k), ctx);
> +               case BPF_B:
> +                       size = 1;
>                         break;
> -               case BPF_ALU | BPF_RSH | BPF_X:
> -                       update_on_xread(ctx);
> -                       emit(ARM_LSR_R(r_A, r_A, r_X), ctx);
> +               default:
> +                       return -EINVAL;
> +               }
> +               emit_a32_mov_i(r2, size, false, ctx);
> +
> +               /* Setting up fourth argument */
> +               emit(ARM_ADD_I(r3, ARM_SP, imm8m(SKB_BUFFER)), ctx);
> +
> +               /* Setting up function pointer to call */
> +               emit_a32_mov_i(r6, (unsigned int)bpf_load_pointer, false, ctx);
> +               emit_blx_r(r6, ctx);
> +
> +               emit(ARM_EOR_R(r1, r1, r1), ctx);
> +               /* Check if return address is NULL or not.
> +                * if NULL then jump to epilogue
> +                * else continue to load the value from retn address
> +                */
> +               emit(ARM_CMP_I(r0, 0), ctx);
> +               jmp_offset = epilogue_offset(ctx);
> +               check_imm24(jmp_offset);
> +               _emit(ARM_COND_EQ, ARM_B(jmp_offset), ctx);
> +
> +               /* Load value from the address */
> +               switch (BPF_SIZE(code)) {
> +               case BPF_W:
> +                       emit(ARM_LDR_I(r0, r0, 0), ctx);
> +                       emit_rev32(r0, r0, ctx);
>                         break;
> -               case BPF_ALU | BPF_NEG:
> -                       /* A = -A */
> -                       emit(ARM_RSB_I(r_A, r_A, 0), ctx);
> +               case BPF_H:
> +                       emit(ARM_LDRH_I(r0, r0, 0), ctx);
> +                       emit_rev16(r0, r0, ctx);
>                         break;
> -               case BPF_JMP | BPF_JA:
> -                       /* pc += K */
> -                       emit(ARM_B(b_imm(i + k + 1, ctx)), ctx);
> +               case BPF_B:
> +                       emit(ARM_LDRB_I(r0, r0, 0), ctx);
> +                       /* No need to reverse */
>                         break;
> -               case BPF_JMP | BPF_JEQ | BPF_K:
> -                       /* pc += (A == K) ? pc->jt : pc->jf */
> -                       condt  = ARM_COND_EQ;
> -                       goto cmp_imm;
> -               case BPF_JMP | BPF_JGT | BPF_K:
> -                       /* pc += (A > K) ? pc->jt : pc->jf */
> -                       condt  = ARM_COND_HI;
> -                       goto cmp_imm;
> -               case BPF_JMP | BPF_JGE | BPF_K:
> -                       /* pc += (A >= K) ? pc->jt : pc->jf */
> -                       condt  = ARM_COND_HS;
> -cmp_imm:
> -                       imm12 = imm8m(k);
> -                       if (imm12 < 0) {
> -                               emit_mov_i_no8m(r_scratch, k, ctx);
> -                               emit(ARM_CMP_R(r_A, r_scratch), ctx);
> -                       } else {
> -                               emit(ARM_CMP_I(r_A, imm12), ctx);
> -                       }
> -cond_jump:
> -                       if (inst->jt)
> -                               _emit(condt, ARM_B(b_imm(i + inst->jt + 1,
> -                                                  ctx)), ctx);
> -                       if (inst->jf)
> -                               _emit(condt ^ 1, ARM_B(b_imm(i + inst->jf + 1,
> -                                                            ctx)), ctx);
> +               }
> +               break;
> +       }
> +       /* ST: *(size *)(dst + off) = imm */
> +       case BPF_ST | BPF_MEM | BPF_W:
> +       case BPF_ST | BPF_MEM | BPF_H:
> +       case BPF_ST | BPF_MEM | BPF_B:
> +       case BPF_ST | BPF_MEM | BPF_DW:
> +               switch (BPF_SIZE(code)) {
> +               case BPF_DW:
> +                       /* Sign-extend immediate value into temp reg */
> +                       emit_a32_mov_i64(true, tmp2, imm, false, ctx);
> +                       emit_str_r(dst_lo, tmp2[1], dstk, off, ctx, BPF_W);
> +                       emit_str_r(dst_lo, tmp2[0], dstk, off+4, ctx, BPF_W);
>                         break;
> -               case BPF_JMP | BPF_JEQ | BPF_X:
> -                       /* pc += (A == X) ? pc->jt : pc->jf */
> -                       condt   = ARM_COND_EQ;
> -                       goto cmp_x;
> -               case BPF_JMP | BPF_JGT | BPF_X:
> -                       /* pc += (A > X) ? pc->jt : pc->jf */
> -                       condt   = ARM_COND_HI;
> -                       goto cmp_x;
> -               case BPF_JMP | BPF_JGE | BPF_X:
> -                       /* pc += (A >= X) ? pc->jt : pc->jf */
> -                       condt   = ARM_COND_CS;
> -cmp_x:
> -                       update_on_xread(ctx);
> -                       emit(ARM_CMP_R(r_A, r_X), ctx);
> -                       goto cond_jump;
> -               case BPF_JMP | BPF_JSET | BPF_K:
> -                       /* pc += (A & K) ? pc->jt : pc->jf */
> -                       condt  = ARM_COND_NE;
> -                       /* not set iff all zeroes iff Z==1 iff EQ */
> -
> -                       imm12 = imm8m(k);
> -                       if (imm12 < 0) {
> -                               emit_mov_i_no8m(r_scratch, k, ctx);
> -                               emit(ARM_TST_R(r_A, r_scratch), ctx);
> -                       } else {
> -                               emit(ARM_TST_I(r_A, imm12), ctx);
> -                       }
> -                       goto cond_jump;
> -               case BPF_JMP | BPF_JSET | BPF_X:
> -                       /* pc += (A & X) ? pc->jt : pc->jf */
> -                       update_on_xread(ctx);
> -                       condt  = ARM_COND_NE;
> -                       emit(ARM_TST_R(r_A, r_X), ctx);
> -                       goto cond_jump;
> -               case BPF_RET | BPF_A:
> -                       emit(ARM_MOV_R(ARM_R0, r_A), ctx);
> -                       goto b_epilogue;
> -               case BPF_RET | BPF_K:
> -                       if ((k == 0) && (ctx->ret0_fp_idx < 0))
> -                               ctx->ret0_fp_idx = i;
> -                       emit_mov_i(ARM_R0, k, ctx);
> -b_epilogue:
> -                       if (i != ctx->skf->len - 1)
> -                               emit(ARM_B(b_imm(prog->len, ctx)), ctx);
> +               case BPF_W:
> +               case BPF_H:
> +               case BPF_B:
> +                       emit_a32_mov_i(tmp2[1], imm, false, ctx);
> +                       emit_str_r(dst_lo, tmp2[1], dstk, off, ctx,
> +                                  BPF_SIZE(code));
>                         break;
> -               case BPF_MISC | BPF_TAX:
> -                       /* X = A */
> -                       ctx->seen |= SEEN_X;
> -                       emit(ARM_MOV_R(r_X, r_A), ctx);
> +               }
> +               break;
> +       /* STX XADD: lock *(u32 *)(dst + off) += src */
> +       case BPF_STX | BPF_XADD | BPF_W:
> +       /* STX XADD: lock *(u64 *)(dst + off) += src */
> +       case BPF_STX | BPF_XADD | BPF_DW:
> +               goto notyet;
> +       /* STX: *(size *)(dst + off) = src */
> +       case BPF_STX | BPF_MEM | BPF_W:
> +       case BPF_STX | BPF_MEM | BPF_H:
> +       case BPF_STX | BPF_MEM | BPF_B:
> +       case BPF_STX | BPF_MEM | BPF_DW:
> +       {
> +               u8 sz = BPF_SIZE(code);
> +
> +               rn = sstk ? tmp2[1] : src_lo;
> +               rm = sstk ? tmp2[0] : src_hi;
> +               if (!sstk)
> +                       goto do_store;
> +               switch (BPF_SIZE(code)) {
> +               case BPF_W:
> +                       emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
> +                       goto empty_hi;
> +               case BPF_H:
> +                       emit(ARM_LDRH_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
> +                       goto empty_hi;
> +               case BPF_B:
> +                       emit(ARM_LDRB_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
> +                       goto empty_hi;
> +empty_hi:
> +                       emit(ARM_EOR_R(rm, rm, rm), ctx);
> +               case BPF_DW:
> +                       emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
> +                       emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(src_hi)), ctx);
> +                       sz = BPF_W;
>                         break;
> -               case BPF_MISC | BPF_TXA:
> -                       /* A = X */
> -                       update_on_xread(ctx);
> -                       emit(ARM_MOV_R(r_A, r_X), ctx);
> +               }
> +
> +do_store:
> +               /* Clear higher word except for BPF_DW */
> +               if (BPF_SIZE(code) != BPF_DW)
> +                       emit(ARM_EOR_R(rm, rm, rm), ctx);
> +
> +               /* Store the value */
> +               emit_str_r(dst_lo, rn, dstk, off, ctx, sz);
> +               emit_str_r(dst_lo, rm, dstk, off+4, ctx, BPF_W);
> +               break;
> +       }
> +       /* PC += off if dst == src */
> +       /* PC += off if dst > src */
> +       /* PC += off if dst >= src */
> +       /* PC += off if dst != src */
> +       /* PC += off if dst > src (signed) */
> +       /* PC += off if dst >= src (signed) */
> +       /* PC += off if dst & src */
> +       case BPF_JMP | BPF_JEQ | BPF_X:
> +       case BPF_JMP | BPF_JGT | BPF_X:
> +       case BPF_JMP | BPF_JGE | BPF_X:
> +       case BPF_JMP | BPF_JNE | BPF_X:
> +       case BPF_JMP | BPF_JSGT | BPF_X:
> +       case BPF_JMP | BPF_JSGE | BPF_X:
> +       case BPF_JMP | BPF_JSET | BPF_X:
> +               /* Setup source registers */
> +               rm = sstk ? tmp2[0] : src_hi;
> +               rn = sstk ? tmp2[1] : src_lo;
> +               if (sstk) {
> +                       emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
> +                       emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(src_hi)), ctx);
> +               }
> +               goto go_jmp;
> +       /* PC += off if dst == imm */
> +       /* PC += off if dst > imm */
> +       /* PC += off if dst >= imm */
> +       /* PC += off if dst != imm */
> +       /* PC += off if dst > imm (signed) */
> +       /* PC += off if dst >= imm (signed) */
> +       /* PC += off if dst & imm */
> +       case BPF_JMP | BPF_JEQ | BPF_K:
> +       case BPF_JMP | BPF_JGT | BPF_K:
> +       case BPF_JMP | BPF_JGE | BPF_K:
> +       case BPF_JMP | BPF_JNE | BPF_K:
> +       case BPF_JMP | BPF_JSGT | BPF_K:
> +       case BPF_JMP | BPF_JSGE | BPF_K:
> +       case BPF_JMP | BPF_JSET | BPF_K:
> +               if (off == 0)
>                         break;
> -               case BPF_ANC | SKF_AD_PROTOCOL:
> -                       /* A = ntohs(skb->protocol) */
> -                       ctx->seen |= SEEN_SKB;
> -                       BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff,
> -                                                 protocol) != 2);
> -                       off = offsetof(struct sk_buff, protocol);
> -                       emit(ARM_LDRH_I(r_scratch, r_skb, off), ctx);
> -                       emit_swap16(r_A, r_scratch, ctx);
> +               rm = tmp2[0];
> +               rn = tmp2[1];
> +               /* Sign-extend immediate value */
> +               emit_a32_mov_i64(true, tmp2, imm, false, ctx);
> +go_jmp:
> +               /* Setup destination register */
> +               rd = dstk ? tmp[0] : dst_hi;
> +               rt = dstk ? tmp[1] : dst_lo;
> +               if (dstk) {
> +                       emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(dst_lo)), ctx);
> +                       emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_hi)), ctx);
> +               }
> +
> +               /* Check for the condition */
> +               emit_ar_r(rd, rt, rm, rn, ctx, BPF_OP(code));
> +
> +               /* Setup JUMP instruction */
> +               jmp_offset = bpf2a32_offset(i+off, i, ctx);
> +               switch (BPF_OP(code)) {
> +               case BPF_JNE:
> +               case BPF_JSET:
> +                       _emit(ARM_COND_NE, ARM_B(jmp_offset), ctx);
>                         break;
> -               case BPF_ANC | SKF_AD_CPU:
> -                       /* r_scratch = current_thread_info() */
> -                       OP_IMM3(ARM_BIC, r_scratch, ARM_SP, THREAD_SIZE - 1, ctx);
> -                       /* A = current_thread_info()->cpu */
> -                       BUILD_BUG_ON(FIELD_SIZEOF(struct thread_info, cpu) != 4);
> -                       off = offsetof(struct thread_info, cpu);
> -                       emit(ARM_LDR_I(r_A, r_scratch, off), ctx);
> +               case BPF_JEQ:
> +                       _emit(ARM_COND_EQ, ARM_B(jmp_offset), ctx);
>                         break;
> -               case BPF_ANC | SKF_AD_IFINDEX:
> -               case BPF_ANC | SKF_AD_HATYPE:
> -                       /* A = skb->dev->ifindex */
> -                       /* A = skb->dev->type */
> -                       ctx->seen |= SEEN_SKB;
> -                       off = offsetof(struct sk_buff, dev);
> -                       emit(ARM_LDR_I(r_scratch, r_skb, off), ctx);
> -
> -                       emit(ARM_CMP_I(r_scratch, 0), ctx);
> -                       emit_err_ret(ARM_COND_EQ, ctx);
> -
> -                       BUILD_BUG_ON(FIELD_SIZEOF(struct net_device,
> -                                                 ifindex) != 4);
> -                       BUILD_BUG_ON(FIELD_SIZEOF(struct net_device,
> -                                                 type) != 2);
> -
> -                       if (code == (BPF_ANC | SKF_AD_IFINDEX)) {
> -                               off = offsetof(struct net_device, ifindex);
> -                               emit(ARM_LDR_I(r_A, r_scratch, off), ctx);
> -                       } else {
> -                               /*
> -                                * offset of field "type" in "struct
> -                                * net_device" is above what can be
> -                                * used in the ldrh rd, [rn, #imm]
> -                                * instruction, so load the offset in
> -                                * a register and use ldrh rd, [rn, rm]
> -                                */
> -                               off = offsetof(struct net_device, type);
> -                               emit_mov_i(ARM_R3, off, ctx);
> -                               emit(ARM_LDRH_R(r_A, r_scratch, ARM_R3), ctx);
> -                       }
> +               case BPF_JGT:
> +                       _emit(ARM_COND_HI, ARM_B(jmp_offset), ctx);
>                         break;
> -               case BPF_ANC | SKF_AD_MARK:
> -                       ctx->seen |= SEEN_SKB;
> -                       BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, mark) != 4);
> -                       off = offsetof(struct sk_buff, mark);
> -                       emit(ARM_LDR_I(r_A, r_skb, off), ctx);
> +               case BPF_JGE:
> +                       _emit(ARM_COND_CS, ARM_B(jmp_offset), ctx);
>                         break;
> -               case BPF_ANC | SKF_AD_RXHASH:
> -                       ctx->seen |= SEEN_SKB;
> -                       BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, hash) != 4);
> -                       off = offsetof(struct sk_buff, hash);
> -                       emit(ARM_LDR_I(r_A, r_skb, off), ctx);
> +               case BPF_JSGT:
> +                       _emit(ARM_COND_LT, ARM_B(jmp_offset), ctx);
>                         break;
> -               case BPF_ANC | SKF_AD_VLAN_TAG:
> -               case BPF_ANC | SKF_AD_VLAN_TAG_PRESENT:
> -                       ctx->seen |= SEEN_SKB;
> -                       BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, vlan_tci) != 2);
> -                       off = offsetof(struct sk_buff, vlan_tci);
> -                       emit(ARM_LDRH_I(r_A, r_skb, off), ctx);
> -                       if (code == (BPF_ANC | SKF_AD_VLAN_TAG))
> -                               OP_IMM3(ARM_AND, r_A, r_A, ~VLAN_TAG_PRESENT, ctx);
> -                       else {
> -                               OP_IMM3(ARM_LSR, r_A, r_A, 12, ctx);
> -                               OP_IMM3(ARM_AND, r_A, r_A, 0x1, ctx);
> -                       }
> +               case BPF_JSGE:
> +                       _emit(ARM_COND_GE, ARM_B(jmp_offset), ctx);
>                         break;
> -               case BPF_ANC | SKF_AD_PKTTYPE:
> -                       ctx->seen |= SEEN_SKB;
> -                       BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff,
> -                                                 __pkt_type_offset[0]) != 1);
> -                       off = PKT_TYPE_OFFSET();
> -                       emit(ARM_LDRB_I(r_A, r_skb, off), ctx);
> -                       emit(ARM_AND_I(r_A, r_A, PKT_TYPE_MAX), ctx);
> -#ifdef __BIG_ENDIAN_BITFIELD
> -                       emit(ARM_LSR_I(r_A, r_A, 5), ctx);
> -#endif
> +               }
> +               break;
> +       /* JMP OFF */
> +       case BPF_JMP | BPF_JA:
> +       {
> +               if (off == 0)
>                         break;
> -               case BPF_ANC | SKF_AD_QUEUE:
> -                       ctx->seen |= SEEN_SKB;
> -                       BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff,
> -                                                 queue_mapping) != 2);
> -                       BUILD_BUG_ON(offsetof(struct sk_buff,
> -                                             queue_mapping) > 0xff);
> -                       off = offsetof(struct sk_buff, queue_mapping);
> -                       emit(ARM_LDRH_I(r_A, r_skb, off), ctx);
> +               jmp_offset = bpf2a32_offset(i+off, i, ctx);
> +               check_imm24(jmp_offset);
> +               emit(ARM_B(jmp_offset), ctx);
> +               break;
> +       }
> +       /* tail call */
> +       case BPF_JMP | BPF_CALL | BPF_X:
> +               if (emit_bpf_tail_call(ctx))
> +                       return -EFAULT;
> +               break;
> +       /* function call */
> +       case BPF_JMP | BPF_CALL:
> +               goto notyet;
> +       /* function return */
> +       case BPF_JMP | BPF_EXIT:
> +               /* Optimization: when last instruction is EXIT
> +                * simply fallthrough to epilogue.
> +                */
> +               if (i == ctx->prog->len - 1)
>                         break;
> -               case BPF_ANC | SKF_AD_PAY_OFFSET:
> -                       ctx->seen |= SEEN_SKB | SEEN_CALL;
> +               jmp_offset = epilogue_offset(ctx);
> +               check_imm24(jmp_offset);
> +               emit(ARM_B(jmp_offset), ctx);
> +               break;
> +notyet:
> +               pr_info_once("*** NOT YET: opcode %02x ***\n", code);
> +               return -EFAULT;
> +       default:
> +               pr_err_once("unknown opcode %02x\n", code);
> +               return -EINVAL;
> +       }
>
> -                       emit(ARM_MOV_R(ARM_R0, r_skb), ctx);
> -                       emit_mov_i(ARM_R3, (unsigned int)skb_get_poff, ctx);
> -                       emit_blx_r(ARM_R3, ctx);
> -                       emit(ARM_MOV_R(r_A, ARM_R0), ctx);
> -                       break;
> -               case BPF_LDX | BPF_W | BPF_ABS:
> -                       /*
> -                        * load a 32bit word from struct seccomp_data.
> -                        * seccomp_check_filter() will already have checked
> -                        * that k is 32bit aligned and lies within the
> -                        * struct seccomp_data.
> -                        */
> -                       ctx->seen |= SEEN_SKB;
> -                       emit(ARM_LDR_I(r_A, r_skb, k), ctx);
> -                       break;
> -               default:
> -                       return -1;
> +       if (ctx->flags & FLAG_IMM_OVERFLOW)
> +               /*
> +                * this instruction generated an overflow when
> +                * trying to access the literal pool, so
> +                * delegate this filter to the kernel interpreter.
> +                */
> +               return -1;
> +       return 0;
> +}
> +
> +static int build_body(struct jit_ctx *ctx)
> +{
> +       const struct bpf_prog *prog = ctx->prog;
> +       unsigned int i;
> +
> +       for (i = 0; i < prog->len; i++) {
> +               const struct bpf_insn *insn = &(prog->insnsi[i]);
> +               int ret;
> +
> +               ret = build_insn(insn, ctx);
> +
> +               /* It's used with loading the 64 bit immediate value. */
> +               if (ret > 0) {
> +                       i++;
> +                       if (ctx->target == NULL)
> +                               ctx->offsets[i] = ctx->idx;
> +                       continue;
>                 }
>
> -               if (ctx->flags & FLAG_IMM_OVERFLOW)
> -                       /*
> -                        * this instruction generated an overflow when
> -                        * trying to access the literal pool, so
> -                        * delegate this filter to the kernel interpreter.
> -                        */
> -                       return -1;
> +               if (ctx->target == NULL)
> +                       ctx->offsets[i] = ctx->idx;
> +
> +               /* If unsuccesfull, return with error code */
> +               if (ret)
> +                       return ret;
>         }
> +       return 0;
> +}
>
> -       /* compute offsets only during the first pass */
> -       if (ctx->target == NULL)
> -               ctx->offsets[i] = ctx->idx * 4;
> +static int validate_code(struct jit_ctx *ctx)
> +{
> +       int i;
> +
> +       for (i = 0; i < ctx->idx; i++) {
> +               u32 a32_insn = le32_to_cpu(ctx->target[i]);
> +
> +               if (a32_insn == ARM_INST_UDF)
> +                       return -1;
> +       }
>
>         return 0;
>  }
>
> +void bpf_jit_compile(struct bpf_prog *prog)
> +{
> +       /* Nothing to do here. We support Internal BPF. */
> +}
>
> -void bpf_jit_compile(struct bpf_prog *fp)
> +struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
>  {
> +#ifdef __LITTLE_ENDIAN
> +       struct bpf_prog *tmp, *orig_prog = prog;
>         struct bpf_binary_header *header;
> +       bool tmp_blinded = false;
>         struct jit_ctx ctx;
> -       unsigned tmp_idx;
> -       unsigned alloc_size;
> -       u8 *target_ptr;
> +       unsigned int tmp_idx;
> +       unsigned int image_size;
> +       u8 *image_ptr;
>
> +       /* If BPF JIT was not enabled then we must fall back to
> +        * the interpreter.
> +        */
>         if (!bpf_jit_enable)
> -               return;
> +               return orig_prog;
>
> -       memset(&ctx, 0, sizeof(ctx));
> -       ctx.skf         = fp;
> -       ctx.ret0_fp_idx = -1;
> +       /* If constant blinding was enabled and we failed during blinding
> +        * then we must fall back to the interpreter. Otherwise, we save
> +        * the new JITed code.
> +        */
> +       tmp = bpf_jit_blind_constants(prog);
>
> -       ctx.offsets = kzalloc(4 * (ctx.skf->len + 1), GFP_KERNEL);
> -       if (ctx.offsets == NULL)
> -               return;
> +       if (IS_ERR(tmp))
> +               return orig_prog;
> +       if (tmp != prog) {
> +               tmp_blinded = true;
> +               prog = tmp;
> +       }
> +
> +       memset(&ctx, 0, sizeof(ctx));
> +       ctx.prog = prog;
>
> -       /* fake pass to fill in the ctx->seen */
> -       if (unlikely(build_body(&ctx)))
> +       /* Not able to allocate memory for offsets[] , then
> +        * we must fall back to the interpreter
> +        */
> +       ctx.offsets = kcalloc(prog->len, sizeof(int), GFP_KERNEL);
> +       if (ctx.offsets == NULL) {
> +               prog = orig_prog;
>                 goto out;
> +       }
> +
> +       /* 1) fake pass to find in the length of the JITed code,
> +        * to compute ctx->offsets and other context variables
> +        * needed to compute final JITed code.
> +        * Also, calculate random starting pointer/start of JITed code
> +        * which is prefixed by random number of fault instructions.
> +        *
> +        * If the first pass fails then there is no chance of it
> +        * being successful in the second pass, so just fall back
> +        * to the interpreter.
> +        */
> +       if (build_body(&ctx)) {
> +               prog = orig_prog;
> +               goto out_off;
> +       }
>
>         tmp_idx = ctx.idx;
>         build_prologue(&ctx);
>         ctx.prologue_bytes = (ctx.idx - tmp_idx) * 4;
>
> +       ctx.epilogue_offset = ctx.idx;
> +
>  #if __LINUX_ARM_ARCH__ < 7
>         tmp_idx = ctx.idx;
>         build_epilogue(&ctx);
> @@ -1020,64 +1838,96 @@ void bpf_jit_compile(struct bpf_prog *fp)
>
>         ctx.idx += ctx.imm_count;
>         if (ctx.imm_count) {
> -               ctx.imms = kzalloc(4 * ctx.imm_count, GFP_KERNEL);
> -               if (ctx.imms == NULL)
> -                       goto out;
> +               ctx.imms = kcalloc(ctx.imm_count, sizeof(u32), GFP_KERNEL);
> +               if (ctx.imms == NULL) {
> +                       prog = orig_prog;
> +                       goto out_off;
> +               }
>         }
>  #else
> -       /* there's nothing after the epilogue on ARMv7 */
> +       /* there's nothing about the epilogue on ARMv7 */
>         build_epilogue(&ctx);
>  #endif
> -       alloc_size = 4 * ctx.idx;
> -       header = bpf_jit_binary_alloc(alloc_size, &target_ptr,
> -                                     4, jit_fill_hole);
> -       if (header == NULL)
> -               goto out;
> +       /* Now we can get the actual image size of the JITed arm code.
> +        * Currently, we are not considering the THUMB-2 instructions
> +        * for jit, although it can decrease the size of the image.
> +        *
> +        * As each arm instruction is of length 32bit, we are translating
> +        * number of JITed intructions into the size required to store these
> +        * JITed code.
> +        */
> +       image_size = sizeof(u32) * ctx.idx;
>
> -       ctx.target = (u32 *) target_ptr;
> +       /* Now we know the size of the structure to make */
> +       header = bpf_jit_binary_alloc(image_size, &image_ptr,
> +                                     sizeof(u32), jit_fill_hole);
> +       /* Not able to allocate memory for the structure then
> +        * we must fall back to the interpretation
> +        */
> +       if (header == NULL) {
> +               prog = orig_prog;
> +               goto out_imms;
> +       }
> +
> +       /* 2.) Actual pass to generate final JIT code */
> +       ctx.target = (u32 *) image_ptr;
>         ctx.idx = 0;
>
>         build_prologue(&ctx);
> +
> +       /* If building the body of the JITed code fails somehow,
> +        * we fall back to the interpretation.
> +        */
>         if (build_body(&ctx) < 0) {
> -#if __LINUX_ARM_ARCH__ < 7
> -               if (ctx.imm_count)
> -                       kfree(ctx.imms);
> -#endif
> +               image_ptr = NULL;
>                 bpf_jit_binary_free(header);
> -               goto out;
> +               prog = orig_prog;
> +               goto out_imms;
>         }
>         build_epilogue(&ctx);
>
> +       /* 3.) Extra pass to validate JITed Code */
> +       if (validate_code(&ctx)) {
> +               image_ptr = NULL;
> +               bpf_jit_binary_free(header);
> +               prog = orig_prog;
> +               goto out_imms;
> +       }
>         flush_icache_range((u32)header, (u32)(ctx.target + ctx.idx));
>
> -#if __LINUX_ARM_ARCH__ < 7
> -       if (ctx.imm_count)
> -               kfree(ctx.imms);
> -#endif
> -
>         if (bpf_jit_enable > 1)
>                 /* there are 2 passes here */
> -               bpf_jit_dump(fp->len, alloc_size, 2, ctx.target);
> +               bpf_jit_dump(prog->len, image_size, 2, ctx.target);
>
>         set_memory_ro((unsigned long)header, header->pages);
> -       fp->bpf_func = (void *)ctx.target;
> -       fp->jited = 1;
> -out:
> +       prog->bpf_func = (void *)ctx.target;
> +       prog->jited = 1;
> +out_imms:
> +#if __LINUX_ARM_ARCH__ < 7
> +       if (ctx.imm_count)
> +               kfree(ctx.imms);
> +#endif
> +out_off:
>         kfree(ctx.offsets);
> -       return;
> +out:
> +       if (tmp_blinded)
> +               bpf_jit_prog_release_other(prog, prog == orig_prog ?
> +                                          tmp : orig_prog);
> +#endif /* __LITTLE_ENDIAN */
> +       return prog;
>  }
>
> -void bpf_jit_free(struct bpf_prog *fp)
> +void bpf_jit_free(struct bpf_prog *prog)
>  {
> -       unsigned long addr = (unsigned long)fp->bpf_func & PAGE_MASK;
> +       unsigned long addr = (unsigned long)prog->bpf_func & PAGE_MASK;
>         struct bpf_binary_header *header = (void *)addr;
>
> -       if (!fp->jited)
> +       if (!prog->jited)
>                 goto free_filter;
>
>         set_memory_rw(addr, header->pages);
>         bpf_jit_binary_free(header);
>
>  free_filter:
> -       bpf_prog_unlock_free(fp);
> +       bpf_prog_unlock_free(prog);
>  }
> diff --git a/arch/arm/net/bpf_jit_32.h b/arch/arm/net/bpf_jit_32.h
> index c46fca2..d5cf5f6 100644
> --- a/arch/arm/net/bpf_jit_32.h
> +++ b/arch/arm/net/bpf_jit_32.h
> @@ -11,6 +11,7 @@
>  #ifndef PFILTER_OPCODES_ARM_H
>  #define PFILTER_OPCODES_ARM_H
>
> +/* ARM 32bit Registers */
>  #define ARM_R0 0
>  #define ARM_R1 1
>  #define ARM_R2 2
> @@ -22,38 +23,43 @@
>  #define ARM_R8 8
>  #define ARM_R9 9
>  #define ARM_R10        10
> -#define ARM_FP 11
> -#define ARM_IP 12
> -#define ARM_SP 13
> -#define ARM_LR 14
> -#define ARM_PC 15
> -
> -#define ARM_COND_EQ            0x0
> -#define ARM_COND_NE            0x1
> -#define ARM_COND_CS            0x2
> +#define ARM_FP 11      /* Frame Pointer */
> +#define ARM_IP 12      /* Intra-procedure scratch register */
> +#define ARM_SP 13      /* Stack pointer: as load/store base reg */
> +#define ARM_LR 14      /* Link Register */
> +#define ARM_PC 15      /* Program counter */
> +
> +#define ARM_COND_EQ            0x0     /* == */
> +#define ARM_COND_NE            0x1     /* != */
> +#define ARM_COND_CS            0x2     /* unsigned >= */
>  #define ARM_COND_HS            ARM_COND_CS
> -#define ARM_COND_CC            0x3
> +#define ARM_COND_CC            0x3     /* unsigned < */
>  #define ARM_COND_LO            ARM_COND_CC
> -#define ARM_COND_MI            0x4
> -#define ARM_COND_PL            0x5
> -#define ARM_COND_VS            0x6
> -#define ARM_COND_VC            0x7
> -#define ARM_COND_HI            0x8
> -#define ARM_COND_LS            0x9
> -#define ARM_COND_GE            0xa
> -#define ARM_COND_LT            0xb
> -#define ARM_COND_GT            0xc
> -#define ARM_COND_LE            0xd
> -#define ARM_COND_AL            0xe
> +#define ARM_COND_MI            0x4     /* < 0 */
> +#define ARM_COND_PL            0x5     /* >= 0 */
> +#define ARM_COND_VS            0x6     /* Signed Overflow */
> +#define ARM_COND_VC            0x7     /* No Signed Overflow */
> +#define ARM_COND_HI            0x8     /* unsigned > */
> +#define ARM_COND_LS            0x9     /* unsigned <= */
> +#define ARM_COND_GE            0xa     /* Signed >= */
> +#define ARM_COND_LT            0xb     /* Signed < */
> +#define ARM_COND_GT            0xc     /* Signed > */
> +#define ARM_COND_LE            0xd     /* Signed <= */
> +#define ARM_COND_AL            0xe     /* None */
>
>  /* register shift types */
>  #define SRTYPE_LSL             0
>  #define SRTYPE_LSR             1
>  #define SRTYPE_ASR             2
>  #define SRTYPE_ROR             3
> +#define SRTYPE_ASL             (SRTYPE_LSL)
>
>  #define ARM_INST_ADD_R         0x00800000
> +#define ARM_INST_ADDS_R                0x00900000
> +#define ARM_INST_ADC_R         0x00a00000
> +#define ARM_INST_ADC_I         0x02a00000
>  #define ARM_INST_ADD_I         0x02800000
> +#define ARM_INST_ADDS_I                0x02900000
>
>  #define ARM_INST_AND_R         0x00000000
>  #define ARM_INST_AND_I         0x02000000
> @@ -76,8 +82,10 @@
>  #define ARM_INST_LDRH_I                0x01d000b0
>  #define ARM_INST_LDRH_R                0x019000b0
>  #define ARM_INST_LDR_I         0x05900000
> +#define ARM_INST_LDR_R         0x07900000
>
>  #define ARM_INST_LDM           0x08900000
> +#define ARM_INST_LDM_IA                0x08b00000
>
>  #define ARM_INST_LSL_I         0x01a00000
>  #define ARM_INST_LSL_R         0x01a00010
> @@ -86,6 +94,7 @@
>  #define ARM_INST_LSR_R         0x01a00030
>
>  #define ARM_INST_MOV_R         0x01a00000
> +#define ARM_INST_MOVS_R                0x01b00000
>  #define ARM_INST_MOV_I         0x03a00000
>  #define ARM_INST_MOVW          0x03000000
>  #define ARM_INST_MOVT          0x03400000
> @@ -96,17 +105,28 @@
>  #define ARM_INST_PUSH          0x092d0000
>
>  #define ARM_INST_ORR_R         0x01800000
> +#define ARM_INST_ORRS_R                0x01900000
>  #define ARM_INST_ORR_I         0x03800000
>
>  #define ARM_INST_REV           0x06bf0f30
>  #define ARM_INST_REV16         0x06bf0fb0
>
>  #define ARM_INST_RSB_I         0x02600000
> +#define ARM_INST_RSBS_I                0x02700000
> +#define ARM_INST_RSC_I         0x02e00000
>
>  #define ARM_INST_SUB_R         0x00400000
> +#define ARM_INST_SUBS_R                0x00500000
> +#define ARM_INST_RSB_R         0x00600000
>  #define ARM_INST_SUB_I         0x02400000
> +#define ARM_INST_SUBS_I                0x02500000
> +#define ARM_INST_SBC_I         0x02c00000
> +#define ARM_INST_SBC_R         0x00c00000
> +#define ARM_INST_SBCS_R                0x00d00000
>
>  #define ARM_INST_STR_I         0x05800000
> +#define ARM_INST_STRB_I                0x05c00000
> +#define ARM_INST_STRH_I                0x01c000b0
>
>  #define ARM_INST_TST_R         0x01100000
>  #define ARM_INST_TST_I         0x03100000
> @@ -117,6 +137,8 @@
>
>  #define ARM_INST_MLS           0x00600090
>
> +#define ARM_INST_UXTH          0x06ff0070
> +
>  /*
>   * Use a suitable undefined instruction to use for ARM/Thumb2 faulting.
>   * We need to be careful not to conflict with those used by other modules
> @@ -135,9 +157,15 @@
>  #define _AL3_R(op, rd, rn, rm) ((op ## _R) | (rd) << 12 | (rn) << 16 | (rm))
>  /* immediate */
>  #define _AL3_I(op, rd, rn, imm)        ((op ## _I) | (rd) << 12 | (rn) << 16 | (imm))
> +/* register with register-shift */
> +#define _AL3_SR(inst)  (inst | (1 << 4))
>
>  #define ARM_ADD_R(rd, rn, rm)  _AL3_R(ARM_INST_ADD, rd, rn, rm)
> +#define ARM_ADDS_R(rd, rn, rm) _AL3_R(ARM_INST_ADDS, rd, rn, rm)
>  #define ARM_ADD_I(rd, rn, imm) _AL3_I(ARM_INST_ADD, rd, rn, imm)
> +#define ARM_ADDS_I(rd, rn, imm)        _AL3_I(ARM_INST_ADDS, rd, rn, imm)
> +#define ARM_ADC_R(rd, rn, rm)  _AL3_R(ARM_INST_ADC, rd, rn, rm)
> +#define ARM_ADC_I(rd, rn, imm) _AL3_I(ARM_INST_ADC, rd, rn, imm)
>
>  #define ARM_AND_R(rd, rn, rm)  _AL3_R(ARM_INST_AND, rd, rn, rm)
>  #define ARM_AND_I(rd, rn, imm) _AL3_I(ARM_INST_AND, rd, rn, imm)
> @@ -156,7 +184,9 @@
>  #define ARM_EOR_I(rd, rn, imm) _AL3_I(ARM_INST_EOR, rd, rn, imm)
>
>  #define ARM_LDR_I(rt, rn, off) (ARM_INST_LDR_I | (rt) << 12 | (rn) << 16 \
> -                                | (off))
> +                                | ((off) & 0xfff))
> +#define ARM_LDR_R(rt, rn, rm)  (ARM_INST_LDR_R | (rt) << 12 | (rn) << 16 \
> +                                | (rm))
>  #define ARM_LDRB_I(rt, rn, off)        (ARM_INST_LDRB_I | (rt) << 12 | (rn) << 16 \
>                                  | (off))
>  #define ARM_LDRB_R(rt, rn, rm) (ARM_INST_LDRB_R | (rt) << 12 | (rn) << 16 \
> @@ -167,15 +197,23 @@
>                                  | (rm))
>
>  #define ARM_LDM(rn, regs)      (ARM_INST_LDM | (rn) << 16 | (regs))
> +#define ARM_LDM_IA(rn, regs)   (ARM_INST_LDM_IA | (rn) << 16 | (regs))
>
>  #define ARM_LSL_R(rd, rn, rm)  (_AL3_R(ARM_INST_LSL, rd, 0, rn) | (rm) << 8)
>  #define ARM_LSL_I(rd, rn, imm) (_AL3_I(ARM_INST_LSL, rd, 0, rn) | (imm) << 7)
>
>  #define ARM_LSR_R(rd, rn, rm)  (_AL3_R(ARM_INST_LSR, rd, 0, rn) | (rm) << 8)
>  #define ARM_LSR_I(rd, rn, imm) (_AL3_I(ARM_INST_LSR, rd, 0, rn) | (imm) << 7)
> +#define ARM_ASR_R(rd, rn, rm)   (_AL3_R(ARM_INST_ASR, rd, 0, rn) | (rm) << 8)
> +#define ARM_ASR_I(rd, rn, imm)  (_AL3_I(ARM_INST_ASR, rd, 0, rn) | (imm) << 7)
>
>  #define ARM_MOV_R(rd, rm)      _AL3_R(ARM_INST_MOV, rd, 0, rm)
> +#define ARM_MOVS_R(rd, rm)     _AL3_R(ARM_INST_MOVS, rd, 0, rm)
>  #define ARM_MOV_I(rd, imm)     _AL3_I(ARM_INST_MOV, rd, 0, imm)
> +#define ARM_MOV_SR(rd, rm, type, rs)   \
> +       (_AL3_SR(ARM_MOV_R(rd, rm)) | (type) << 5 | (rs) << 8)
> +#define ARM_MOV_SI(rd, rm, type, imm6) \
> +       (ARM_MOV_R(rd, rm) | (type) << 5 | (imm6) << 7)
>
>  #define ARM_MOVW(rd, imm)      \
>         (ARM_INST_MOVW | ((imm) >> 12) << 16 | (rd) << 12 | ((imm) & 0x0fff))
> @@ -190,19 +228,38 @@
>
>  #define ARM_ORR_R(rd, rn, rm)  _AL3_R(ARM_INST_ORR, rd, rn, rm)
>  #define ARM_ORR_I(rd, rn, imm) _AL3_I(ARM_INST_ORR, rd, rn, imm)
> -#define ARM_ORR_S(rd, rn, rm, type, rs)        \
> -       (ARM_ORR_R(rd, rn, rm) | (type) << 5 | (rs) << 7)
> +#define ARM_ORR_SR(rd, rn, rm, type, rs)       \
> +       (_AL3_SR(ARM_ORR_R(rd, rn, rm)) | (type) << 5 | (rs) << 8)
> +#define ARM_ORRS_R(rd, rn, rm) _AL3_R(ARM_INST_ORRS, rd, rn, rm)
> +#define ARM_ORRS_SR(rd, rn, rm, type, rs)      \
> +       (_AL3_SR(ARM_ORRS_R(rd, rn, rm)) | (type) << 5 | (rs) << 8)
> +#define ARM_ORR_SI(rd, rn, rm, type, imm6)     \
> +       (ARM_ORR_R(rd, rn, rm) | (type) << 5 | (imm6) << 7)
> +#define ARM_ORRS_SI(rd, rn, rm, type, imm6)    \
> +       (ARM_ORRS_R(rd, rn, rm) | (type) << 5 | (imm6) << 7)
>
>  #define ARM_REV(rd, rm)                (ARM_INST_REV | (rd) << 12 | (rm))
>  #define ARM_REV16(rd, rm)      (ARM_INST_REV16 | (rd) << 12 | (rm))
>
>  #define ARM_RSB_I(rd, rn, imm) _AL3_I(ARM_INST_RSB, rd, rn, imm)
> +#define ARM_RSBS_I(rd, rn, imm)        _AL3_I(ARM_INST_RSBS, rd, rn, imm)
> +#define ARM_RSC_I(rd, rn, imm) _AL3_I(ARM_INST_RSC, rd, rn, imm)
>
>  #define ARM_SUB_R(rd, rn, rm)  _AL3_R(ARM_INST_SUB, rd, rn, rm)
> +#define ARM_SUBS_R(rd, rn, rm) _AL3_R(ARM_INST_SUBS, rd, rn, rm)
> +#define ARM_RSB_R(rd, rn, rm)  _AL3_R(ARM_INST_RSB, rd, rn, rm)
> +#define ARM_SBC_R(rd, rn, rm)  _AL3_R(ARM_INST_SBC, rd, rn, rm)
> +#define ARM_SBCS_R(rd, rn, rm) _AL3_R(ARM_INST_SBCS, rd, rn, rm)
>  #define ARM_SUB_I(rd, rn, imm) _AL3_I(ARM_INST_SUB, rd, rn, imm)
> +#define ARM_SUBS_I(rd, rn, imm)        _AL3_I(ARM_INST_SUBS, rd, rn, imm)
> +#define ARM_SBC_I(rd, rn, imm) _AL3_I(ARM_INST_SBC, rd, rn, imm)
>
>  #define ARM_STR_I(rt, rn, off) (ARM_INST_STR_I | (rt) << 12 | (rn) << 16 \
> -                                | (off))
> +                                | ((off) & 0xfff))
> +#define ARM_STRH_I(rt, rn, off)        (ARM_INST_STRH_I | (rt) << 12 | (rn) << 16 \
> +                                | (((off) & 0xf0) << 4) | ((off) & 0xf))
> +#define ARM_STRB_I(rt, rn, off)        (ARM_INST_STRB_I | (rt) << 12 | (rn) << 16 \
> +                                | (((off) & 0xf0) << 4) | ((off) & 0xf))
>
>  #define ARM_TST_R(rn, rm)      _AL3_R(ARM_INST_TST, 0, rn, rm)
>  #define ARM_TST_I(rn, imm)     _AL3_I(ARM_INST_TST, 0, rn, imm)
> @@ -214,5 +271,6 @@
>
>  #define ARM_MLS(rd, rn, rm, ra)        (ARM_INST_MLS | (rd) << 16 | (rn) | (rm) << 8 \
>                                  | (ra) << 12)
> +#define ARM_UXTH(rd, rm)       (ARM_INST_UXTH | (rd) << 12 | (rm))
>
>  #endif /* PFILTER_OPCODES_ARM_H */
> --
> 2.7.4
>



-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
  2017-05-30 19:19   ` Kees Cook
@ 2017-06-06 19:47     ` Shubham Bansal
  -1 siblings, 0 replies; 87+ messages in thread
From: Shubham Bansal @ 2017-06-06 19:47 UTC (permalink / raw)
  To: Kees Cook
  Cc: Network Development, Daniel Borkmann, David S. Miller,
	Alexei Starovoitov, Russell King, linux-arm-kernel, LKML,
	Andrew Lunn

Hi Russell, Alexei, David, Daniel, kees,

Any update on this patch moving forward?
Best,
Shubham Bansal


On Wed, May 31, 2017 at 12:49 AM, Kees Cook <keescook@chromium.org> wrote:
> Forwarding this to net-dev and eBPF folks, who weren't on CC...
>
> -Kees
>
> On Thu, May 25, 2017 at 4:13 PM, Shubham Bansal
> <illusionist.neo@gmail.com> wrote:
>> The JIT compiler emits ARM 32 bit instructions. Currently, It supports
>> eBPF only. Classic BPF is supported because of the conversion by BPF
>> core.
>>
>> This patch is essentially changing the current implementation of JIT
>> compiler of Berkeley Packet Filter from classic to internal with almost
>> all instructions from eBPF ISA supported except the following
>>         BPF_ALU64 | BPF_DIV | BPF_K
>>         BPF_ALU64 | BPF_DIV | BPF_X
>>         BPF_ALU64 | BPF_MOD | BPF_K
>>         BPF_ALU64 | BPF_MOD | BPF_X
>>         BPF_STX | BPF_XADD | BPF_W
>>         BPF_STX | BPF_XADD | BPF_DW
>>         BPF_JMP | BPF_CALL
>>
>> Implementation is using scratch space to emulate 64 bit eBPF ISA on 32 bit
>> ARM because of deficiency of general purpose registers on ARM. Currently,
>> only LITTLE ENDIAN machines are supported in this eBPF JIT Compiler.
>>
>> Tested on ARMv7 with QEMU by me (Shubham Bansal).
>> Tested on ARMv5 by Andrew Lunn (andrew@lunn.ch).
>> Expected to work on ARMv6 as well, as its a part ARMv7 and part ARMv5.
>> Although, a proper testing is not done for ARMv6.
>>
>> Both of these testing are done with and without CONFIG_FRAME_POINTER
>> separately for LITTLE ENDIAN machine.
>>
>> For testing:
>>
>> 1. JIT is enabled with
>>         echo 1 > /proc/sys/net/core/bpf_jit_enable
>> 2. Constant Blinding can be enabled along with JIT using
>>         echo 1 > /proc/sys/net/core/bpf_jit_enable
>>         echo 2 > /proc/sys/net/core/bpf_jit_harden
>>
>> See Documentation/networking/filter.txt for more information.
>>
>> Result : test_bpf: Summary: 314 PASSED, 0 FAILED, [278/306 JIT'ed]
>>
>> Signed-off-by: Shubham Bansal <illusionist.neo@gmail.com>
>> ---
>>  Documentation/networking/filter.txt |    4 +-
>>  arch/arm/Kconfig                    |    2 +-
>>  arch/arm/net/bpf_jit_32.c           | 2404 ++++++++++++++++++++++++-----------
>>  arch/arm/net/bpf_jit_32.h           |  108 +-
>>  4 files changed, 1713 insertions(+), 805 deletions(-)
>>
>> diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt
>> index b69b205..01165ac 100644
>> --- a/Documentation/networking/filter.txt
>> +++ b/Documentation/networking/filter.txt
>> @@ -596,8 +596,8 @@ skb pointer). All constraints and restrictions from bpf_check_classic() apply
>>  before a conversion to the new layout is being done behind the scenes!
>>
>>  Currently, the classic BPF format is being used for JITing on most 32-bit
>> -architectures, whereas x86-64, aarch64, s390x, powerpc64, sparc64 perform JIT
>> -compilation from eBPF instruction set.
>> +architectures, whereas x86-64, aarch64, arm, s390x, powerpc64, sparc64 perform
>> +JIT compilation from eBPF instruction set.
>>
>>  Some core changes of the new internal format:
>>
>> diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
>> index 8a7ab5e..13ade46 100644
>> --- a/arch/arm/Kconfig
>> +++ b/arch/arm/Kconfig
>> @@ -47,7 +47,7 @@ config ARM
>>         select HAVE_ARCH_SECCOMP_FILTER if (AEABI && !OABI_COMPAT)
>>         select HAVE_ARCH_TRACEHOOK
>>         select HAVE_ARM_SMCCC if CPU_V7
>> -       select HAVE_CBPF_JIT
>> +       select HAVE_EBPF_JIT
>>         select HAVE_CC_STACKPROTECTOR
>>         select HAVE_CONTEXT_TRACKING
>>         select HAVE_C_RECORDMCOUNT
>> diff --git a/arch/arm/net/bpf_jit_32.c b/arch/arm/net/bpf_jit_32.c
>> index 93d0b6d..c7476e5 100644
>> --- a/arch/arm/net/bpf_jit_32.c
>> +++ b/arch/arm/net/bpf_jit_32.c
>> @@ -1,13 +1,15 @@
>>  /*
>> - * Just-In-Time compiler for BPF filters on 32bit ARM
>> + * Just-In-Time compiler for eBPF filters on 32bit ARM
>>   *
>>   * Copyright (c) 2011 Mircea Gherzan <mgherzan@gmail.com>
>> + * Copyright (c) 2017 Shubham Bansal <illusionist.neo@gmail.com>
>>   *
>>   * This program is free software; you can redistribute it and/or modify it
>>   * under the terms of the GNU General Public License as published by the
>>   * Free Software Foundation; version 2 of the License.
>>   */
>>
>> +#include <linux/bpf.h>
>>  #include <linux/bitops.h>
>>  #include <linux/compiler.h>
>>  #include <linux/errno.h>
>> @@ -23,44 +25,91 @@
>>
>>  #include "bpf_jit_32.h"
>>
>> +int bpf_jit_enable __read_mostly;
>> +
>> +#define STACK_OFFSET(k)        (k)
>> +#define TMP_REG_1      (MAX_BPF_JIT_REG + 0)   /* TEMP Register 1 */
>> +#define TMP_REG_2      (MAX_BPF_JIT_REG + 1)   /* TEMP Register 2 */
>> +#define TCALL_CNT      (MAX_BPF_JIT_REG + 2)   /* Tail Call Count */
>> +
>> +/* Flags used for JIT optimization */
>> +#define SEEN_CALL      (1 << 0)
>> +
>> +#define FLAG_IMM_OVERFLOW      (1 << 0)
>> +
>>  /*
>> - * ABI:
>> + * Map eBPF registers to ARM 32bit registers or stack scratch space.
>> + *
>> + * 1. First argument is passed using the arm 32bit registers and rest of the
>> + * arguments are passed on stack scratch space.
>> + * 2. First callee-saved aregument is mapped to arm 32 bit registers and rest
>> + * arguments are mapped to scratch space on stack.
>> + * 3. We need two 64 bit temp registers to do complex operations on eBPF
>> + * registers.
>> + *
>> + * As the eBPF registers are all 64 bit registers and arm has only 32 bit
>> + * registers, we have to map each eBPF registers with two arm 32 bit regs or
>> + * scratch memory space and we have to build eBPF 64 bit register from those.
>>   *
>> - * r0  scratch register
>> - * r4  BPF register A
>> - * r5  BPF register X
>> - * r6  pointer to the skb
>> - * r7  skb->data
>> - * r8  skb_headlen(skb)
>>   */
>> +static const u8 bpf2a32[][2] = {
>> +       /* return value from in-kernel function, and exit value from eBPF */
>> +       [BPF_REG_0] = {ARM_R1, ARM_R0},
>> +       /* arguments from eBPF program to in-kernel function */
>> +       [BPF_REG_1] = {ARM_R3, ARM_R2},
>> +       /* Stored on stack scratch space */
>> +       [BPF_REG_2] = {STACK_OFFSET(0), STACK_OFFSET(4)},
>> +       [BPF_REG_3] = {STACK_OFFSET(8), STACK_OFFSET(12)},
>> +       [BPF_REG_4] = {STACK_OFFSET(16), STACK_OFFSET(20)},
>> +       [BPF_REG_5] = {STACK_OFFSET(24), STACK_OFFSET(28)},
>> +       /* callee saved registers that in-kernel function will preserve */
>> +       [BPF_REG_6] = {ARM_R5, ARM_R4},
>> +       /* Stored on stack scratch space */
>> +       [BPF_REG_7] = {STACK_OFFSET(32), STACK_OFFSET(36)},
>> +       [BPF_REG_8] = {STACK_OFFSET(40), STACK_OFFSET(44)},
>> +       [BPF_REG_9] = {STACK_OFFSET(48), STACK_OFFSET(52)},
>> +       /* Read only Frame Pointer to access Stack */
>> +       [BPF_REG_FP] = {STACK_OFFSET(56), STACK_OFFSET(60)},
>> +       /* Temporary Register for internal BPF JIT, can be used
>> +        * for constant blindings and others.
>> +        */
>> +       [TMP_REG_1] = {ARM_R7, ARM_R6},
>> +       [TMP_REG_2] = {ARM_R10, ARM_R8},
>> +       /* Tail call count. Stored on stack scratch space. */
>> +       [TCALL_CNT] = {STACK_OFFSET(64), STACK_OFFSET(68)},
>> +       /* temporary register for blinding constants.
>> +        * Stored on stack scratch space.
>> +        */
>> +       [BPF_REG_AX] = {STACK_OFFSET(72), STACK_OFFSET(76)},
>> +};
>>
>> -#define r_scratch      ARM_R0
>> -/* r1-r3 are (also) used for the unaligned loads on the non-ARMv7 slowpath */
>> -#define r_off          ARM_R1
>> -#define r_A            ARM_R4
>> -#define r_X            ARM_R5
>> -#define r_skb          ARM_R6
>> -#define r_skb_data     ARM_R7
>> -#define r_skb_hl       ARM_R8
>> -
>> -#define SCRATCH_SP_OFFSET      0
>> -#define SCRATCH_OFF(k)         (SCRATCH_SP_OFFSET + 4 * (k))
>> -
>> -#define SEEN_MEM               ((1 << BPF_MEMWORDS) - 1)
>> -#define SEEN_MEM_WORD(k)       (1 << (k))
>> -#define SEEN_X                 (1 << BPF_MEMWORDS)
>> -#define SEEN_CALL              (1 << (BPF_MEMWORDS + 1))
>> -#define SEEN_SKB               (1 << (BPF_MEMWORDS + 2))
>> -#define SEEN_DATA              (1 << (BPF_MEMWORDS + 3))
>> +#define        dst_lo  dst[1]
>> +#define dst_hi dst[0]
>> +#define src_lo src[1]
>> +#define src_hi src[0]
>>
>> -#define FLAG_NEED_X_RESET      (1 << 0)
>> -#define FLAG_IMM_OVERFLOW      (1 << 1)
>> +/*
>> + * JIT Context:
>> + *
>> + * prog                        :       bpf_prog
>> + * idx                 :       index of current last JITed instruction.
>> + * prologue_bytes      :       bytes used in prologue.
>> + * epilogue_offset     :       offset of epilogue starting.
>> + * seen                        :       bit mask used for JIT optimization.
>> + * offsets             :       array of eBPF instruction offsets in
>> + *                             JITed code.
>> + * target              :       final JITed code.
>> + * epilogue_bytes      :       no of bytes used in epilogue.
>> + * imm_count           :       no of immediate counts used for global
>> + *                             variables.
>> + * imms                        :       array of global variable addresses.
>> + */
>>
>>  struct jit_ctx {
>> -       const struct bpf_prog *skf;
>> -       unsigned idx;
>> -       unsigned prologue_bytes;
>> -       int ret0_fp_idx;
>> +       const struct bpf_prog *prog;
>> +       unsigned int idx;
>> +       unsigned int prologue_bytes;
>> +       unsigned int epilogue_offset;
>>         u32 seen;
>>         u32 flags;
>>         u32 *offsets;
>> @@ -72,68 +121,16 @@ struct jit_ctx {
>>  #endif
>>  };
>>
>> -int bpf_jit_enable __read_mostly;
>> -
>> -static inline int call_neg_helper(struct sk_buff *skb, int offset, void *ret,
>> -                     unsigned int size)
>> -{
>> -       void *ptr = bpf_internal_load_pointer_neg_helper(skb, offset, size);
>> -
>> -       if (!ptr)
>> -               return -EFAULT;
>> -       memcpy(ret, ptr, size);
>> -       return 0;
>> -}
>> -
>> -static u64 jit_get_skb_b(struct sk_buff *skb, int offset)
>> -{
>> -       u8 ret;
>> -       int err;
>> -
>> -       if (offset < 0)
>> -               err = call_neg_helper(skb, offset, &ret, 1);
>> -       else
>> -               err = skb_copy_bits(skb, offset, &ret, 1);
>> -
>> -       return (u64)err << 32 | ret;
>> -}
>> -
>> -static u64 jit_get_skb_h(struct sk_buff *skb, int offset)
>> -{
>> -       u16 ret;
>> -       int err;
>> -
>> -       if (offset < 0)
>> -               err = call_neg_helper(skb, offset, &ret, 2);
>> -       else
>> -               err = skb_copy_bits(skb, offset, &ret, 2);
>> -
>> -       return (u64)err << 32 | ntohs(ret);
>> -}
>> -
>> -static u64 jit_get_skb_w(struct sk_buff *skb, int offset)
>> -{
>> -       u32 ret;
>> -       int err;
>> -
>> -       if (offset < 0)
>> -               err = call_neg_helper(skb, offset, &ret, 4);
>> -       else
>> -               err = skb_copy_bits(skb, offset, &ret, 4);
>> -
>> -       return (u64)err << 32 | ntohl(ret);
>> -}
>> -
>>  /*
>>   * Wrappers which handle both OABI and EABI and assures Thumb2 interworking
>>   * (where the assembly routines like __aeabi_uidiv could cause problems).
>>   */
>> -static u32 jit_udiv(u32 dividend, u32 divisor)
>> +static u32 jit_udiv32(u32 dividend, u32 divisor)
>>  {
>>         return dividend / divisor;
>>  }
>>
>> -static u32 jit_mod(u32 dividend, u32 divisor)
>> +static u32 jit_mod32(u32 dividend, u32 divisor)
>>  {
>>         return dividend % divisor;
>>  }
>> @@ -157,36 +154,22 @@ static inline void emit(u32 inst, struct jit_ctx *ctx)
>>         _emit(ARM_COND_AL, inst, ctx);
>>  }
>>
>> -static u16 saved_regs(struct jit_ctx *ctx)
>> +/*
>> + * Checks if immediate value can be converted to imm12(12 bits) value.
>> + */
>> +static int16_t imm8m(u32 x)
>>  {
>> -       u16 ret = 0;
>> -
>> -       if ((ctx->skf->len > 1) ||
>> -           (ctx->skf->insns[0].code == (BPF_RET | BPF_A)))
>> -               ret |= 1 << r_A;
>> -
>> -#ifdef CONFIG_FRAME_POINTER
>> -       ret |= (1 << ARM_FP) | (1 << ARM_IP) | (1 << ARM_LR) | (1 << ARM_PC);
>> -#else
>> -       if (ctx->seen & SEEN_CALL)
>> -               ret |= 1 << ARM_LR;
>> -#endif
>> -       if (ctx->seen & (SEEN_DATA | SEEN_SKB))
>> -               ret |= 1 << r_skb;
>> -       if (ctx->seen & SEEN_DATA)
>> -               ret |= (1 << r_skb_data) | (1 << r_skb_hl);
>> -       if (ctx->seen & SEEN_X)
>> -               ret |= 1 << r_X;
>> -
>> -       return ret;
>> -}
>> +       u32 rot;
>>
>> -static inline int mem_words_used(struct jit_ctx *ctx)
>> -{
>> -       /* yes, we do waste some stack space IF there are "holes" in the set" */
>> -       return fls(ctx->seen & SEEN_MEM);
>> +       for (rot = 0; rot < 16; rot++)
>> +               if ((x & ~ror32(0xff, 2 * rot)) == 0)
>> +                       return rol32(x, 2 * rot) | (rot << 8);
>> +       return -1;
>>  }
>>
>> +/*
>> + * Initializes the JIT space with undefined instructions.
>> + */
>>  static void jit_fill_hole(void *area, unsigned int size)
>>  {
>>         u32 *ptr;
>> @@ -195,88 +178,34 @@ static void jit_fill_hole(void *area, unsigned int size)
>>                 *ptr++ = __opcode_to_mem_arm(ARM_INST_UDF);
>>  }
>>
>> -static void build_prologue(struct jit_ctx *ctx)
>> -{
>> -       u16 reg_set = saved_regs(ctx);
>> -       u16 off;
>> -
>> -#ifdef CONFIG_FRAME_POINTER
>> -       emit(ARM_MOV_R(ARM_IP, ARM_SP), ctx);
>> -       emit(ARM_PUSH(reg_set), ctx);
>> -       emit(ARM_SUB_I(ARM_FP, ARM_IP, 4), ctx);
>> -#else
>> -       if (reg_set)
>> -               emit(ARM_PUSH(reg_set), ctx);
>> -#endif
>> +/* Stack must be multiples of 16 Bytes */
>> +#define STACK_ALIGN(sz) (((sz) + 15) & ~15)
>>
>> -       if (ctx->seen & (SEEN_DATA | SEEN_SKB))
>> -               emit(ARM_MOV_R(r_skb, ARM_R0), ctx);
>> -
>> -       if (ctx->seen & SEEN_DATA) {
>> -               off = offsetof(struct sk_buff, data);
>> -               emit(ARM_LDR_I(r_skb_data, r_skb, off), ctx);
>> -               /* headlen = len - data_len */
>> -               off = offsetof(struct sk_buff, len);
>> -               emit(ARM_LDR_I(r_skb_hl, r_skb, off), ctx);
>> -               off = offsetof(struct sk_buff, data_len);
>> -               emit(ARM_LDR_I(r_scratch, r_skb, off), ctx);
>> -               emit(ARM_SUB_R(r_skb_hl, r_skb_hl, r_scratch), ctx);
>> -       }
>> -
>> -       if (ctx->flags & FLAG_NEED_X_RESET)
>> -               emit(ARM_MOV_I(r_X, 0), ctx);
>> -
>> -       /* do not leak kernel data to userspace */
>> -       if (bpf_needs_clear_a(&ctx->skf->insns[0]))
>> -               emit(ARM_MOV_I(r_A, 0), ctx);
>> -
>> -       /* stack space for the BPF_MEM words */
>> -       if (ctx->seen & SEEN_MEM)
>> -               emit(ARM_SUB_I(ARM_SP, ARM_SP, mem_words_used(ctx) * 4), ctx);
>> -}
>> -
>> -static void build_epilogue(struct jit_ctx *ctx)
>> -{
>> -       u16 reg_set = saved_regs(ctx);
>> -
>> -       if (ctx->seen & SEEN_MEM)
>> -               emit(ARM_ADD_I(ARM_SP, ARM_SP, mem_words_used(ctx) * 4), ctx);
>> -
>> -       reg_set &= ~(1 << ARM_LR);
>> -
>> -#ifdef CONFIG_FRAME_POINTER
>> -       /* the first instruction of the prologue was: mov ip, sp */
>> -       reg_set &= ~(1 << ARM_IP);
>> -       reg_set |= (1 << ARM_SP);
>> -       emit(ARM_LDM(ARM_SP, reg_set), ctx);
>> -#else
>> -       if (reg_set) {
>> -               if (ctx->seen & SEEN_CALL)
>> -                       reg_set |= 1 << ARM_PC;
>> -               emit(ARM_POP(reg_set), ctx);
>> -       }
>> +/* Stack space for BPF_REG_2, BPF_REG_3, BPF_REG_4,
>> + * BPF_REG_5, BPF_REG_7, BPF_REG_8, BPF_REG_9,
>> + * BPF_REG_FP and Tail call counts.
>> + */
>> +#define SCRATCH_SIZE 80
>>
>> -       if (!(ctx->seen & SEEN_CALL))
>> -               emit(ARM_BX(ARM_LR), ctx);
>> -#endif
>> -}
>> +/* total stack size used in JITed code */
>> +#define _STACK_SIZE \
>> +       (MAX_BPF_STACK + \
>> +        + SCRATCH_SIZE + \
>> +        + 4 /* extra for skb_copy_bits buffer */)
>>
>> -static int16_t imm8m(u32 x)
>> -{
>> -       u32 rot;
>> +#define STACK_SIZE STACK_ALIGN(_STACK_SIZE)
>>
>> -       for (rot = 0; rot < 16; rot++)
>> -               if ((x & ~ror32(0xff, 2 * rot)) == 0)
>> -                       return rol32(x, 2 * rot) | (rot << 8);
>> +/* Get the offset of eBPF REGISTERs stored on scratch space. */
>> +#define STACK_VAR(off) (STACK_SIZE-off-4)
>>
>> -       return -1;
>> -}
>> +/* Offset of skb_copy_bits buffer */
>> +#define SKB_BUFFER STACK_VAR(SCRATCH_SIZE)
>>
>>  #if __LINUX_ARM_ARCH__ < 7
>>
>>  static u16 imm_offset(u32 k, struct jit_ctx *ctx)
>>  {
>> -       unsigned i = 0, offset;
>> +       unsigned int i = 0, offset;
>>         u16 imm;
>>
>>         /* on the "fake" run we just count them (duplicates included) */
>> @@ -295,7 +224,7 @@ static u16 imm_offset(u32 k, struct jit_ctx *ctx)
>>                 ctx->imms[i] = k;
>>
>>         /* constants go just after the epilogue */
>> -       offset =  ctx->offsets[ctx->skf->len];
>> +       offset =  ctx->offsets[ctx->prog->len - 1] * 4;
>>         offset += ctx->prologue_bytes;
>>         offset += ctx->epilogue_bytes;
>>         offset += i * 4;
>> @@ -319,10 +248,22 @@ static u16 imm_offset(u32 k, struct jit_ctx *ctx)
>>
>>  #endif /* __LINUX_ARM_ARCH__ */
>>
>> +static inline int bpf2a32_offset(int bpf_to, int bpf_from,
>> +                                const struct jit_ctx *ctx) {
>> +       int to, from;
>> +
>> +       if (ctx->target == NULL)
>> +               return 0;
>> +       to = ctx->offsets[bpf_to];
>> +       from = ctx->offsets[bpf_from];
>> +
>> +       return to - from - 1;
>> +}
>> +
>>  /*
>>   * Move an immediate that's not an imm8m to a core register.
>>   */
>> -static inline void emit_mov_i_no8m(int rd, u32 val, struct jit_ctx *ctx)
>> +static inline void emit_mov_i_no8m(const u8 rd, u32 val, struct jit_ctx *ctx)
>>  {
>>  #if __LINUX_ARM_ARCH__ < 7
>>         emit(ARM_LDR_I(rd, ARM_PC, imm_offset(val, ctx)), ctx);
>> @@ -333,7 +274,7 @@ static inline void emit_mov_i_no8m(int rd, u32 val, struct jit_ctx *ctx)
>>  #endif
>>  }
>>
>> -static inline void emit_mov_i(int rd, u32 val, struct jit_ctx *ctx)
>> +static inline void emit_mov_i(const u8 rd, u32 val, struct jit_ctx *ctx)
>>  {
>>         int imm12 = imm8m(val);
>>
>> @@ -343,676 +284,1553 @@ static inline void emit_mov_i(int rd, u32 val, struct jit_ctx *ctx)
>>                 emit_mov_i_no8m(rd, val, ctx);
>>  }
>>
>> -#if __LINUX_ARM_ARCH__ < 6
>> -
>> -static void emit_load_be32(u8 cond, u8 r_res, u8 r_addr, struct jit_ctx *ctx)
>> +static inline void emit_blx_r(u8 tgt_reg, struct jit_ctx *ctx)
>>  {
>> -       _emit(cond, ARM_LDRB_I(ARM_R3, r_addr, 1), ctx);
>> -       _emit(cond, ARM_LDRB_I(ARM_R1, r_addr, 0), ctx);
>> -       _emit(cond, ARM_LDRB_I(ARM_R2, r_addr, 3), ctx);
>> -       _emit(cond, ARM_LSL_I(ARM_R3, ARM_R3, 16), ctx);
>> -       _emit(cond, ARM_LDRB_I(ARM_R0, r_addr, 2), ctx);
>> -       _emit(cond, ARM_ORR_S(ARM_R3, ARM_R3, ARM_R1, SRTYPE_LSL, 24), ctx);
>> -       _emit(cond, ARM_ORR_R(ARM_R3, ARM_R3, ARM_R2), ctx);
>> -       _emit(cond, ARM_ORR_S(r_res, ARM_R3, ARM_R0, SRTYPE_LSL, 8), ctx);
>> +       ctx->seen |= SEEN_CALL;
>> +#if __LINUX_ARM_ARCH__ < 5
>> +       emit(ARM_MOV_R(ARM_LR, ARM_PC), ctx);
>> +
>> +       if (elf_hwcap & HWCAP_THUMB)
>> +               emit(ARM_BX(tgt_reg), ctx);
>> +       else
>> +               emit(ARM_MOV_R(ARM_PC, tgt_reg), ctx);
>> +#else
>> +       emit(ARM_BLX_R(tgt_reg), ctx);
>> +#endif
>>  }
>>
>> -static void emit_load_be16(u8 cond, u8 r_res, u8 r_addr, struct jit_ctx *ctx)
>> +static inline int epilogue_offset(const struct jit_ctx *ctx)
>>  {
>> -       _emit(cond, ARM_LDRB_I(ARM_R1, r_addr, 0), ctx);
>> -       _emit(cond, ARM_LDRB_I(ARM_R2, r_addr, 1), ctx);
>> -       _emit(cond, ARM_ORR_S(r_res, ARM_R2, ARM_R1, SRTYPE_LSL, 8), ctx);
>> +       int to, from;
>> +       /* No need for 1st dummy run */
>> +       if (ctx->target == NULL)
>> +               return 0;
>> +       to = ctx->epilogue_offset;
>> +       from = ctx->idx;
>> +
>> +       return to - from - 2;
>>  }
>>
>> -static inline void emit_swap16(u8 r_dst, u8 r_src, struct jit_ctx *ctx)
>> +static inline void emit_udivmod(u8 rd, u8 rm, u8 rn, struct jit_ctx *ctx, u8 op)
>>  {
>> -       /* r_dst = (r_src << 8) | (r_src >> 8) */
>> -       emit(ARM_LSL_I(ARM_R1, r_src, 8), ctx);
>> -       emit(ARM_ORR_S(r_dst, ARM_R1, r_src, SRTYPE_LSR, 8), ctx);
>> +       const u8 *tmp = bpf2a32[TMP_REG_1];
>> +       s32 jmp_offset;
>> +
>> +       /* checks if divisor is zero or not. If it is, then
>> +        * exit directly.
>> +        */
>> +       emit(ARM_CMP_I(rn, 0), ctx);
>> +       _emit(ARM_COND_EQ, ARM_MOV_I(ARM_R0, 0), ctx);
>> +       jmp_offset = epilogue_offset(ctx);
>> +       _emit(ARM_COND_EQ, ARM_B(jmp_offset), ctx);
>> +#if __LINUX_ARM_ARCH__ == 7
>> +       if (elf_hwcap & HWCAP_IDIVA) {
>> +               if (op == BPF_DIV)
>> +                       emit(ARM_UDIV(rd, rm, rn), ctx);
>> +               else {
>> +                       emit(ARM_UDIV(ARM_IP, rm, rn), ctx);
>> +                       emit(ARM_MLS(rd, rn, ARM_IP, rm), ctx);
>> +               }
>> +               return;
>> +       }
>> +#endif
>>
>>         /*
>> -        * we need to mask out the bits set in r_dst[23:16] due to
>> -        * the first shift instruction.
>> -        *
>> -        * note that 0x8ff is the encoded immediate 0x00ff0000.
>> +        * For BPF_ALU | BPF_DIV | BPF_K instructions
>> +        * As ARM_R1 and ARM_R0 contains 1st argument of bpf
>> +        * function, we need to save it on caller side to save
>> +        * it from getting destroyed within callee.
>> +        * After the return from the callee, we restore ARM_R0
>> +        * ARM_R1.
>>          */
>> -       emit(ARM_BIC_I(r_dst, r_dst, 0x8ff), ctx);
>> -}
>> +       if (rn != ARM_R1) {
>> +               emit(ARM_MOV_R(tmp[0], ARM_R1), ctx);
>> +               emit(ARM_MOV_R(ARM_R1, rn), ctx);
>> +       }
>> +       if (rm != ARM_R0) {
>> +               emit(ARM_MOV_R(tmp[1], ARM_R0), ctx);
>> +               emit(ARM_MOV_R(ARM_R0, rm), ctx);
>> +       }
>>
>> -#else  /* ARMv6+ */
>> +       /* Call appropriate function */
>> +       ctx->seen |= SEEN_CALL;
>> +       emit_mov_i(ARM_IP, op == BPF_DIV ?
>> +                  (u32)jit_udiv32 : (u32)jit_mod32, ctx);
>> +       emit_blx_r(ARM_IP, ctx);
>>
>> -static void emit_load_be32(u8 cond, u8 r_res, u8 r_addr, struct jit_ctx *ctx)
>> -{
>> -       _emit(cond, ARM_LDR_I(r_res, r_addr, 0), ctx);
>> -#ifdef __LITTLE_ENDIAN
>> -       _emit(cond, ARM_REV(r_res, r_res), ctx);
>> -#endif
>> +       /* Save return value */
>> +       if (rd != ARM_R0)
>> +               emit(ARM_MOV_R(rd, ARM_R0), ctx);
>> +
>> +       /* Restore ARM_R0 and ARM_R1 */
>> +       if (rn != ARM_R1)
>> +               emit(ARM_MOV_R(ARM_R1, tmp[0]), ctx);
>> +       if (rm != ARM_R0)
>> +               emit(ARM_MOV_R(ARM_R0, tmp[1]), ctx);
>>  }
>>
>> -static void emit_load_be16(u8 cond, u8 r_res, u8 r_addr, struct jit_ctx *ctx)
>> +/* Checks whether BPF register is on scratch stack space or not. */
>> +static inline bool is_on_stack(u8 bpf_reg)
>>  {
>> -       _emit(cond, ARM_LDRH_I(r_res, r_addr, 0), ctx);
>> -#ifdef __LITTLE_ENDIAN
>> -       _emit(cond, ARM_REV16(r_res, r_res), ctx);
>> -#endif
>> +       static u8 stack_regs[] = {BPF_REG_AX, BPF_REG_3, BPF_REG_4, BPF_REG_5,
>> +                               BPF_REG_7, BPF_REG_8, BPF_REG_9, TCALL_CNT,
>> +                               BPF_REG_2, BPF_REG_FP};
>> +       int i, reg_len = sizeof(stack_regs);
>> +
>> +       for (i = 0 ; i < reg_len ; i++) {
>> +               if (bpf_reg == stack_regs[i])
>> +                       return true;
>> +       }
>> +       return false;
>>  }
>>
>> -static inline void emit_swap16(u8 r_dst __maybe_unused,
>> -                              u8 r_src __maybe_unused,
>> -                              struct jit_ctx *ctx __maybe_unused)
>> +static inline void emit_a32_mov_i(const u8 dst, const u32 val,
>> +                                 bool dstk, struct jit_ctx *ctx)
>>  {
>> -#ifdef __LITTLE_ENDIAN
>> -       emit(ARM_REV16(r_dst, r_src), ctx);
>> -#endif
>> +       const u8 *tmp = bpf2a32[TMP_REG_1];
>> +
>> +       if (dstk) {
>> +               emit_mov_i(tmp[1], val, ctx);
>> +               emit(ARM_STR_I(tmp[1], ARM_SP, STACK_VAR(dst)), ctx);
>> +       } else {
>> +               emit_mov_i(dst, val, ctx);
>> +       }
>>  }
>>
>> -#endif /* __LINUX_ARM_ARCH__ < 6 */
>> +/* Sign extended move */
>> +static inline void emit_a32_mov_i64(const bool is64, const u8 dst[],
>> +                                 const u32 val, bool dstk,
>> +                                 struct jit_ctx *ctx) {
>> +       u32 hi = 0;
>>
>> +       if (is64 && (val & (1<<31)))
>> +               hi = (u32)~0;
>> +       emit_a32_mov_i(dst_lo, val, dstk, ctx);
>> +       emit_a32_mov_i(dst_hi, hi, dstk, ctx);
>> +}
>>
>> -/* Compute the immediate value for a PC-relative branch. */
>> -static inline u32 b_imm(unsigned tgt, struct jit_ctx *ctx)
>> -{
>> -       u32 imm;
>> +static inline void emit_a32_add_r(const u8 dst, const u8 src,
>> +                             const bool is64, const bool hi,
>> +                             struct jit_ctx *ctx) {
>> +       /* 64 bit :
>> +        *      adds dst_lo, dst_lo, src_lo
>> +        *      adc dst_hi, dst_hi, src_hi
>> +        * 32 bit :
>> +        *      add dst_lo, dst_lo, src_lo
>> +        */
>> +       if (!hi && is64)
>> +               emit(ARM_ADDS_R(dst, dst, src), ctx);
>> +       else if (hi && is64)
>> +               emit(ARM_ADC_R(dst, dst, src), ctx);
>> +       else
>> +               emit(ARM_ADD_R(dst, dst, src), ctx);
>> +}
>>
>> -       if (ctx->target == NULL)
>> -               return 0;
>> -       /*
>> -        * BPF allows only forward jumps and the offset of the target is
>> -        * still the one computed during the first pass.
>> +static inline void emit_a32_sub_r(const u8 dst, const u8 src,
>> +                                 const bool is64, const bool hi,
>> +                                 struct jit_ctx *ctx) {
>> +       /* 64 bit :
>> +        *      subs dst_lo, dst_lo, src_lo
>> +        *      sbc dst_hi, dst_hi, src_hi
>> +        * 32 bit :
>> +        *      sub dst_lo, dst_lo, src_lo
>>          */
>> -       imm  = ctx->offsets[tgt] + ctx->prologue_bytes - (ctx->idx * 4 + 8);
>> +       if (!hi && is64)
>> +               emit(ARM_SUBS_R(dst, dst, src), ctx);
>> +       else if (hi && is64)
>> +               emit(ARM_SBC_R(dst, dst, src), ctx);
>> +       else
>> +               emit(ARM_SUB_R(dst, dst, src), ctx);
>> +}
>>
>> -       return imm >> 2;
>> +static inline void emit_alu_r(const u8 dst, const u8 src, const bool is64,
>> +                             const bool hi, const u8 op, struct jit_ctx *ctx){
>> +       switch (BPF_OP(op)) {
>> +       /* dst = dst + src */
>> +       case BPF_ADD:
>> +               emit_a32_add_r(dst, src, is64, hi, ctx);
>> +               break;
>> +       /* dst = dst - src */
>> +       case BPF_SUB:
>> +               emit_a32_sub_r(dst, src, is64, hi, ctx);
>> +               break;
>> +       /* dst = dst | src */
>> +       case BPF_OR:
>> +               emit(ARM_ORR_R(dst, dst, src), ctx);
>> +               break;
>> +       /* dst = dst & src */
>> +       case BPF_AND:
>> +               emit(ARM_AND_R(dst, dst, src), ctx);
>> +               break;
>> +       /* dst = dst ^ src */
>> +       case BPF_XOR:
>> +               emit(ARM_EOR_R(dst, dst, src), ctx);
>> +               break;
>> +       /* dst = dst * src */
>> +       case BPF_MUL:
>> +               emit(ARM_MUL(dst, dst, src), ctx);
>> +               break;
>> +       /* dst = dst << src */
>> +       case BPF_LSH:
>> +               emit(ARM_LSL_R(dst, dst, src), ctx);
>> +               break;
>> +       /* dst = dst >> src */
>> +       case BPF_RSH:
>> +               emit(ARM_LSR_R(dst, dst, src), ctx);
>> +               break;
>> +       /* dst = dst >> src (signed)*/
>> +       case BPF_ARSH:
>> +               emit(ARM_MOV_SR(dst, dst, SRTYPE_ASR, src), ctx);
>> +               break;
>> +       }
>>  }
>>
>> -#define OP_IMM3(op, r1, r2, imm_val, ctx)                              \
>> -       do {                                                            \
>> -               imm12 = imm8m(imm_val);                                 \
>> -               if (imm12 < 0) {                                        \
>> -                       emit_mov_i_no8m(r_scratch, imm_val, ctx);       \
>> -                       emit(op ## _R((r1), (r2), r_scratch), ctx);     \
>> -               } else {                                                \
>> -                       emit(op ## _I((r1), (r2), imm12), ctx);         \
>> -               }                                                       \
>> -       } while (0)
>> -
>> -static inline void emit_err_ret(u8 cond, struct jit_ctx *ctx)
>> -{
>> -       if (ctx->ret0_fp_idx >= 0) {
>> -               _emit(cond, ARM_B(b_imm(ctx->ret0_fp_idx, ctx)), ctx);
>> -               /* NOP to keep the size constant between passes */
>> -               emit(ARM_MOV_R(ARM_R0, ARM_R0), ctx);
>> +/* ALU operation (32 bit)
>> + * dst = dst (op) src
>> + */
>> +static inline void emit_a32_alu_r(const u8 dst, const u8 src,
>> +                                 bool dstk, bool sstk,
>> +                                 struct jit_ctx *ctx, const bool is64,
>> +                                 const bool hi, const u8 op) {
>> +       const u8 *tmp = bpf2a32[TMP_REG_1];
>> +       u8 rn = sstk ? tmp[1] : src;
>> +
>> +       if (sstk)
>> +               emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src)), ctx);
>> +
>> +       /* ALU operation */
>> +       if (dstk) {
>> +               emit(ARM_LDR_I(tmp[0], ARM_SP, STACK_VAR(dst)), ctx);
>> +               emit_alu_r(tmp[0], rn, is64, hi, op, ctx);
>> +               emit(ARM_STR_I(tmp[0], ARM_SP, STACK_VAR(dst)), ctx);
>>         } else {
>> -               _emit(cond, ARM_MOV_I(ARM_R0, 0), ctx);
>> -               _emit(cond, ARM_B(b_imm(ctx->skf->len, ctx)), ctx);
>> +               emit_alu_r(dst, rn, is64, hi, op, ctx);
>>         }
>>  }
>>
>> -static inline void emit_blx_r(u8 tgt_reg, struct jit_ctx *ctx)
>> -{
>> -#if __LINUX_ARM_ARCH__ < 5
>> -       emit(ARM_MOV_R(ARM_LR, ARM_PC), ctx);
>> +/* ALU operation (64 bit) */
>> +static inline void emit_a32_alu_r64(const bool is64, const u8 dst[],
>> +                                 const u8 src[], bool dstk,
>> +                                 bool sstk, struct jit_ctx *ctx,
>> +                                 const u8 op) {
>> +       emit_a32_alu_r(dst_lo, src_lo, dstk, sstk, ctx, is64, false, op);
>> +       if (is64)
>> +               emit_a32_alu_r(dst_hi, src_hi, dstk, sstk, ctx, is64, true, op);
>> +       else
>> +               emit_a32_mov_i(dst_hi, 0, dstk, ctx);
>> +}
>>
>> -       if (elf_hwcap & HWCAP_THUMB)
>> -               emit(ARM_BX(tgt_reg), ctx);
>> +/* dst = imm (4 bytes)*/
>> +static inline void emit_a32_mov_r(const u8 dst, const u8 src,
>> +                                 bool dstk, bool sstk,
>> +                                 struct jit_ctx *ctx) {
>> +       const u8 *tmp = bpf2a32[TMP_REG_1];
>> +       u8 rt = sstk ? tmp[0] : src;
>> +
>> +       if (sstk)
>> +               emit(ARM_LDR_I(tmp[0], ARM_SP, STACK_VAR(src)), ctx);
>> +       if (dstk)
>> +               emit(ARM_STR_I(rt, ARM_SP, STACK_VAR(dst)), ctx);
>>         else
>> -               emit(ARM_MOV_R(ARM_PC, tgt_reg), ctx);
>> -#else
>> -       emit(ARM_BLX_R(tgt_reg), ctx);
>> -#endif
>> +               emit(ARM_MOV_R(dst, rt), ctx);
>>  }
>>
>> -static inline void emit_udivmod(u8 rd, u8 rm, u8 rn, struct jit_ctx *ctx,
>> -                               int bpf_op)
>> -{
>> -#if __LINUX_ARM_ARCH__ == 7
>> -       if (elf_hwcap & HWCAP_IDIVA) {
>> -               if (bpf_op == BPF_DIV)
>> -                       emit(ARM_UDIV(rd, rm, rn), ctx);
>> -               else {
>> -                       emit(ARM_UDIV(ARM_R3, rm, rn), ctx);
>> -                       emit(ARM_MLS(rd, rn, ARM_R3, rm), ctx);
>> -               }
>> -               return;
>> +/* dst = src */
>> +static inline void emit_a32_mov_r64(const bool is64, const u8 dst[],
>> +                                 const u8 src[], bool dstk,
>> +                                 bool sstk, struct jit_ctx *ctx) {
>> +       emit_a32_mov_r(dst_lo, src_lo, dstk, sstk, ctx);
>> +       if (is64) {
>> +               /* complete 8 byte move */
>> +               emit_a32_mov_r(dst_hi, src_hi, dstk, sstk, ctx);
>> +       } else {
>> +               /* Zero out high 4 bytes */
>> +               emit_a32_mov_i(dst_hi, 0, dstk, ctx);
>>         }
>> -#endif
>> +}
>>
>> -       /*
>> -        * For BPF_ALU | BPF_DIV | BPF_K instructions, rm is ARM_R4
>> -        * (r_A) and rn is ARM_R0 (r_scratch) so load rn first into
>> -        * ARM_R1 to avoid accidentally overwriting ARM_R0 with rm
>> -        * before using it as a source for ARM_R1.
>> -        *
>> -        * For BPF_ALU | BPF_DIV | BPF_X rm is ARM_R4 (r_A) and rn is
>> -        * ARM_R5 (r_X) so there is no particular register overlap
>> -        * issues.
>> -        */
>> -       if (rn != ARM_R1)
>> -               emit(ARM_MOV_R(ARM_R1, rn), ctx);
>> -       if (rm != ARM_R0)
>> -               emit(ARM_MOV_R(ARM_R0, rm), ctx);
>> +/* Shift operations */
>> +static inline void emit_a32_alu_i(const u8 dst, const u32 val, bool dstk,
>> +                               struct jit_ctx *ctx, const u8 op) {
>> +       const u8 *tmp = bpf2a32[TMP_REG_1];
>> +       u8 rd = dstk ? tmp[0] : dst;
>> +
>> +       if (dstk)
>> +               emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst)), ctx);
>> +
>> +       /* Do shift operation */
>> +       switch (op) {
>> +       case BPF_LSH:
>> +               emit(ARM_LSL_I(rd, rd, val), ctx);
>> +               break;
>> +       case BPF_RSH:
>> +               emit(ARM_LSR_I(rd, rd, val), ctx);
>> +               break;
>> +       case BPF_NEG:
>> +               emit(ARM_RSB_I(rd, rd, val), ctx);
>> +               break;
>> +       }
>> +
>> +       if (dstk)
>> +               emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst)), ctx);
>> +}
>> +
>> +/* dst = ~dst (64 bit) */
>> +static inline void emit_a32_neg64(const u8 dst[], bool dstk,
>> +                               struct jit_ctx *ctx){
>> +       const u8 *tmp = bpf2a32[TMP_REG_1];
>> +       u8 rd = dstk ? tmp[1] : dst[1];
>> +       u8 rm = dstk ? tmp[0] : dst[0];
>> +
>> +       /* Setup Operand */
>> +       if (dstk) {
>> +               emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
>> +               emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
>> +       }
>> +
>> +       /* Do Negate Operation */
>> +       emit(ARM_RSBS_I(rd, rd, 0), ctx);
>> +       emit(ARM_RSC_I(rm, rm, 0), ctx);
>> +
>> +       if (dstk) {
>> +               emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
>> +               emit(ARM_STR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
>> +       }
>> +}
>>
>> +/* dst = dst << src */
>> +static inline void emit_a32_lsh_r64(const u8 dst[], const u8 src[], bool dstk,
>> +                                   bool sstk, struct jit_ctx *ctx) {
>> +       const u8 *tmp = bpf2a32[TMP_REG_1];
>> +       const u8 *tmp2 = bpf2a32[TMP_REG_2];
>> +
>> +       /* Setup Operands */
>> +       u8 rt = sstk ? tmp2[1] : src_lo;
>> +       u8 rd = dstk ? tmp[1] : dst_lo;
>> +       u8 rm = dstk ? tmp[0] : dst_hi;
>> +
>> +       if (sstk)
>> +               emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(src_lo)), ctx);
>> +       if (dstk) {
>> +               emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
>> +               emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
>> +       }
>> +
>> +       /* Do LSH operation */
>> +       emit(ARM_SUB_I(ARM_IP, rt, 32), ctx);
>> +       emit(ARM_RSB_I(tmp2[0], rt, 32), ctx);
>> +       /* As we are using ARM_LR */
>>         ctx->seen |= SEEN_CALL;
>> -       emit_mov_i(ARM_R3, bpf_op == BPF_DIV ? (u32)jit_udiv : (u32)jit_mod,
>> -                  ctx);
>> -       emit_blx_r(ARM_R3, ctx);
>> +       emit(ARM_MOV_SR(ARM_LR, rm, SRTYPE_ASL, rt), ctx);
>> +       emit(ARM_ORR_SR(ARM_LR, ARM_LR, rd, SRTYPE_ASL, ARM_IP), ctx);
>> +       emit(ARM_ORR_SR(ARM_IP, ARM_LR, rd, SRTYPE_LSR, tmp2[0]), ctx);
>> +       emit(ARM_MOV_SR(ARM_LR, rd, SRTYPE_ASL, rt), ctx);
>> +
>> +       if (dstk) {
>> +               emit(ARM_STR_I(ARM_LR, ARM_SP, STACK_VAR(dst_lo)), ctx);
>> +               emit(ARM_STR_I(ARM_IP, ARM_SP, STACK_VAR(dst_hi)), ctx);
>> +       } else {
>> +               emit(ARM_MOV_R(rd, ARM_LR), ctx);
>> +               emit(ARM_MOV_R(rm, ARM_IP), ctx);
>> +       }
>> +}
>>
>> -       if (rd != ARM_R0)
>> -               emit(ARM_MOV_R(rd, ARM_R0), ctx);
>> +/* dst = dst >> src (signed)*/
>> +static inline void emit_a32_arsh_r64(const u8 dst[], const u8 src[], bool dstk,
>> +                                   bool sstk, struct jit_ctx *ctx) {
>> +       const u8 *tmp = bpf2a32[TMP_REG_1];
>> +       const u8 *tmp2 = bpf2a32[TMP_REG_2];
>> +       /* Setup Operands */
>> +       u8 rt = sstk ? tmp2[1] : src_lo;
>> +       u8 rd = dstk ? tmp[1] : dst_lo;
>> +       u8 rm = dstk ? tmp[0] : dst_hi;
>> +
>> +       if (sstk)
>> +               emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(src_lo)), ctx);
>> +       if (dstk) {
>> +               emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
>> +               emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
>> +       }
>> +
>> +       /* Do the ARSH operation */
>> +       emit(ARM_RSB_I(ARM_IP, rt, 32), ctx);
>> +       emit(ARM_SUBS_I(tmp2[0], rt, 32), ctx);
>> +       /* As we are using ARM_LR */
>> +       ctx->seen |= SEEN_CALL;
>> +       emit(ARM_MOV_SR(ARM_LR, rd, SRTYPE_LSR, rt), ctx);
>> +       emit(ARM_ORR_SR(ARM_LR, ARM_LR, rm, SRTYPE_ASL, ARM_IP), ctx);
>> +       _emit(ARM_COND_MI, ARM_B(0), ctx);
>> +       emit(ARM_ORR_SR(ARM_LR, ARM_LR, rm, SRTYPE_ASR, tmp2[0]), ctx);
>> +       emit(ARM_MOV_SR(ARM_IP, rm, SRTYPE_ASR, rt), ctx);
>> +       if (dstk) {
>> +               emit(ARM_STR_I(ARM_LR, ARM_SP, STACK_VAR(dst_lo)), ctx);
>> +               emit(ARM_STR_I(ARM_IP, ARM_SP, STACK_VAR(dst_hi)), ctx);
>> +       } else {
>> +               emit(ARM_MOV_R(rd, ARM_LR), ctx);
>> +               emit(ARM_MOV_R(rm, ARM_IP), ctx);
>> +       }
>>  }
>>
>> -static inline void update_on_xread(struct jit_ctx *ctx)
>> +/* dst = dst >> src */
>> +static inline void emit_a32_lsr_r64(const u8 dst[], const u8 src[], bool dstk,
>> +                                    bool sstk, struct jit_ctx *ctx) {
>> +       const u8 *tmp = bpf2a32[TMP_REG_1];
>> +       const u8 *tmp2 = bpf2a32[TMP_REG_2];
>> +       /* Setup Operands */
>> +       u8 rt = sstk ? tmp2[1] : src_lo;
>> +       u8 rd = dstk ? tmp[1] : dst_lo;
>> +       u8 rm = dstk ? tmp[0] : dst_hi;
>> +
>> +       if (sstk)
>> +               emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(src_lo)), ctx);
>> +       if (dstk) {
>> +               emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
>> +               emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
>> +       }
>> +
>> +       /* Do LSH operation */
>> +       emit(ARM_RSB_I(ARM_IP, rt, 32), ctx);
>> +       emit(ARM_SUBS_I(tmp2[0], rt, 32), ctx);
>> +       /* As we are using ARM_LR */
>> +       ctx->seen |= SEEN_CALL;
>> +       emit(ARM_MOV_SR(ARM_LR, rd, SRTYPE_LSR, rt), ctx);
>> +       emit(ARM_ORR_SR(ARM_LR, ARM_LR, rm, SRTYPE_ASL, ARM_IP), ctx);
>> +       emit(ARM_ORR_SR(ARM_LR, ARM_LR, rm, SRTYPE_LSR, tmp2[0]), ctx);
>> +       emit(ARM_MOV_SR(ARM_IP, rm, SRTYPE_LSR, rt), ctx);
>> +       if (dstk) {
>> +               emit(ARM_STR_I(ARM_LR, ARM_SP, STACK_VAR(dst_lo)), ctx);
>> +               emit(ARM_STR_I(ARM_IP, ARM_SP, STACK_VAR(dst_hi)), ctx);
>> +       } else {
>> +               emit(ARM_MOV_R(rd, ARM_LR), ctx);
>> +               emit(ARM_MOV_R(rm, ARM_IP), ctx);
>> +       }
>> +}
>> +
>> +/* dst = dst << val */
>> +static inline void emit_a32_lsh_i64(const u8 dst[], bool dstk,
>> +                                    const u32 val, struct jit_ctx *ctx){
>> +       const u8 *tmp = bpf2a32[TMP_REG_1];
>> +       const u8 *tmp2 = bpf2a32[TMP_REG_2];
>> +       /* Setup operands */
>> +       u8 rd = dstk ? tmp[1] : dst_lo;
>> +       u8 rm = dstk ? tmp[0] : dst_hi;
>> +
>> +       if (dstk) {
>> +               emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
>> +               emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
>> +       }
>> +
>> +       /* Do LSH operation */
>> +       if (val < 32) {
>> +               emit(ARM_MOV_SI(tmp2[0], rm, SRTYPE_ASL, val), ctx);
>> +               emit(ARM_ORR_SI(rm, tmp2[0], rd, SRTYPE_LSR, 32 - val), ctx);
>> +               emit(ARM_MOV_SI(rd, rd, SRTYPE_ASL, val), ctx);
>> +       } else {
>> +               if (val == 32)
>> +                       emit(ARM_MOV_R(rm, rd), ctx);
>> +               else
>> +                       emit(ARM_MOV_SI(rm, rd, SRTYPE_ASL, val - 32), ctx);
>> +               emit(ARM_EOR_R(rd, rd, rd), ctx);
>> +       }
>> +
>> +       if (dstk) {
>> +               emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
>> +               emit(ARM_STR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
>> +       }
>> +}
>> +
>> +/* dst = dst >> val */
>> +static inline void emit_a32_lsr_i64(const u8 dst[], bool dstk,
>> +                                   const u32 val, struct jit_ctx *ctx) {
>> +       const u8 *tmp = bpf2a32[TMP_REG_1];
>> +       const u8 *tmp2 = bpf2a32[TMP_REG_2];
>> +       /* Setup operands */
>> +       u8 rd = dstk ? tmp[1] : dst_lo;
>> +       u8 rm = dstk ? tmp[0] : dst_hi;
>> +
>> +       if (dstk) {
>> +               emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
>> +               emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
>> +       }
>> +
>> +       /* Do LSR operation */
>> +       if (val < 32) {
>> +               emit(ARM_MOV_SI(tmp2[1], rd, SRTYPE_LSR, val), ctx);
>> +               emit(ARM_ORR_SI(rd, tmp2[1], rm, SRTYPE_ASL, 32 - val), ctx);
>> +               emit(ARM_MOV_SI(rm, rm, SRTYPE_LSR, val), ctx);
>> +       } else if (val == 32) {
>> +               emit(ARM_MOV_R(rd, rm), ctx);
>> +               emit(ARM_MOV_I(rm, 0), ctx);
>> +       } else {
>> +               emit(ARM_MOV_SI(rd, rm, SRTYPE_LSR, val - 32), ctx);
>> +               emit(ARM_MOV_I(rm, 0), ctx);
>> +       }
>> +
>> +       if (dstk) {
>> +               emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
>> +               emit(ARM_STR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
>> +       }
>> +}
>> +
>> +/* dst = dst >> val (signed) */
>> +static inline void emit_a32_arsh_i64(const u8 dst[], bool dstk,
>> +                                    const u32 val, struct jit_ctx *ctx){
>> +       const u8 *tmp = bpf2a32[TMP_REG_1];
>> +       const u8 *tmp2 = bpf2a32[TMP_REG_2];
>> +        /* Setup operands */
>> +       u8 rd = dstk ? tmp[1] : dst_lo;
>> +       u8 rm = dstk ? tmp[0] : dst_hi;
>> +
>> +       if (dstk) {
>> +               emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
>> +               emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
>> +       }
>> +
>> +       /* Do ARSH operation */
>> +       if (val < 32) {
>> +               emit(ARM_MOV_SI(tmp2[1], rd, SRTYPE_LSR, val), ctx);
>> +               emit(ARM_ORR_SI(rd, tmp2[1], rm, SRTYPE_ASL, 32 - val), ctx);
>> +               emit(ARM_MOV_SI(rm, rm, SRTYPE_ASR, val), ctx);
>> +       } else if (val == 32) {
>> +               emit(ARM_MOV_R(rd, rm), ctx);
>> +               emit(ARM_MOV_SI(rm, rm, SRTYPE_ASR, 31), ctx);
>> +       } else {
>> +               emit(ARM_MOV_SI(rd, rm, SRTYPE_ASR, val - 32), ctx);
>> +               emit(ARM_MOV_SI(rm, rm, SRTYPE_ASR, 31), ctx);
>> +       }
>> +
>> +       if (dstk) {
>> +               emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
>> +               emit(ARM_STR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
>> +       }
>> +}
>> +
>> +static inline void emit_a32_mul_r64(const u8 dst[], const u8 src[], bool dstk,
>> +                                   bool sstk, struct jit_ctx *ctx) {
>> +       const u8 *tmp = bpf2a32[TMP_REG_1];
>> +       const u8 *tmp2 = bpf2a32[TMP_REG_2];
>> +       /* Setup operands for multiplication */
>> +       u8 rd = dstk ? tmp[1] : dst_lo;
>> +       u8 rm = dstk ? tmp[0] : dst_hi;
>> +       u8 rt = sstk ? tmp2[1] : src_lo;
>> +       u8 rn = sstk ? tmp2[0] : src_hi;
>> +
>> +       if (dstk) {
>> +               emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
>> +               emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
>> +       }
>> +       if (sstk) {
>> +               emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(src_lo)), ctx);
>> +               emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src_hi)), ctx);
>> +       }
>> +
>> +       /* Do Multiplication */
>> +       emit(ARM_MUL(ARM_IP, rd, rn), ctx);
>> +       emit(ARM_MUL(ARM_LR, rm, rt), ctx);
>> +       /* As we are using ARM_LR */
>> +       ctx->seen |= SEEN_CALL;
>> +       emit(ARM_ADD_R(ARM_LR, ARM_IP, ARM_LR), ctx);
>> +
>> +       emit(ARM_UMULL(ARM_IP, rm, rd, rt), ctx);
>> +       emit(ARM_ADD_R(rm, ARM_LR, rm), ctx);
>> +       if (dstk) {
>> +               emit(ARM_STR_I(ARM_IP, ARM_SP, STACK_VAR(dst_lo)), ctx);
>> +               emit(ARM_STR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
>> +       } else {
>> +               emit(ARM_MOV_R(rd, ARM_IP), ctx);
>> +       }
>> +}
>> +
>> +/* *(size *)(dst + off) = src */
>> +static inline void emit_str_r(const u8 dst, const u8 src, bool dstk,
>> +                             const s32 off, struct jit_ctx *ctx, const u8 sz){
>> +       const u8 *tmp = bpf2a32[TMP_REG_1];
>> +       u8 rd = dstk ? tmp[1] : dst;
>> +
>> +       if (dstk)
>> +               emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst)), ctx);
>> +       if (off) {
>> +               emit_a32_mov_i(tmp[0], off, false, ctx);
>> +               emit(ARM_ADD_R(tmp[0], rd, tmp[0]), ctx);
>> +               rd = tmp[0];
>> +       }
>> +       switch (sz) {
>> +       case BPF_W:
>> +               /* Store a Word */
>> +               emit(ARM_STR_I(src, rd, 0), ctx);
>> +               break;
>> +       case BPF_H:
>> +               /* Store a HalfWord */
>> +               emit(ARM_STRH_I(src, rd, 0), ctx);
>> +               break;
>> +       case BPF_B:
>> +               /* Store a Byte */
>> +               emit(ARM_STRB_I(src, rd, 0), ctx);
>> +               break;
>> +       }
>> +}
>> +
>> +/* dst = *(size*)(src + off) */
>> +static inline void emit_ldx_r(const u8 dst, const u8 src, bool dstk,
>> +                             const s32 off, struct jit_ctx *ctx, const u8 sz){
>> +       const u8 *tmp = bpf2a32[TMP_REG_1];
>> +       u8 rd = dstk ? tmp[1] : dst;
>> +       u8 rm = src;
>> +
>> +       if (off) {
>> +               emit_a32_mov_i(tmp[0], off, false, ctx);
>> +               emit(ARM_ADD_R(tmp[0], tmp[0], src), ctx);
>> +               rm = tmp[0];
>> +       }
>> +       switch (sz) {
>> +       case BPF_W:
>> +               /* Load a Word */
>> +               emit(ARM_LDR_I(rd, rm, 0), ctx);
>> +               break;
>> +       case BPF_H:
>> +               /* Load a HalfWord */
>> +               emit(ARM_LDRH_I(rd, rm, 0), ctx);
>> +               break;
>> +       case BPF_B:
>> +               /* Load a Byte */
>> +               emit(ARM_LDRB_I(rd, rm, 0), ctx);
>> +               break;
>> +       }
>> +       if (dstk)
>> +               emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst)), ctx);
>> +}
>> +
>> +/* Arithmatic Operation */
>> +static inline void emit_ar_r(const u8 rd, const u8 rt, const u8 rm,
>> +                            const u8 rn, struct jit_ctx *ctx, u8 op) {
>> +       switch (op) {
>> +       case BPF_JSET:
>> +               ctx->seen |= SEEN_CALL;
>> +               emit(ARM_AND_R(ARM_IP, rt, rn), ctx);
>> +               emit(ARM_AND_R(ARM_LR, rd, rm), ctx);
>> +               emit(ARM_ORRS_R(ARM_IP, ARM_LR, ARM_IP), ctx);
>> +               break;
>> +       case BPF_JEQ:
>> +       case BPF_JNE:
>> +       case BPF_JGT:
>> +       case BPF_JGE:
>> +               emit(ARM_CMP_R(rd, rm), ctx);
>> +               _emit(ARM_COND_EQ, ARM_CMP_R(rt, rn), ctx);
>> +               break;
>> +       case BPF_JSGT:
>> +               emit(ARM_CMP_R(rn, rt), ctx);
>> +               emit(ARM_SBCS_R(ARM_IP, rm, rd), ctx);
>> +               break;
>> +       case BPF_JSGE:
>> +               emit(ARM_CMP_R(rt, rn), ctx);
>> +               emit(ARM_SBCS_R(ARM_IP, rd, rm), ctx);
>> +               break;
>> +       }
>> +}
>> +
>> +static int out_offset = -1; /* initialized on the first pass of build_body() */
>> +static int emit_bpf_tail_call(struct jit_ctx *ctx)
>> +{
>> +
>> +       /* bpf_tail_call(void *prog_ctx, struct bpf_array *array, u64 index) */
>> +       const u8 *r2 = bpf2a32[BPF_REG_2];
>> +       const u8 *r3 = bpf2a32[BPF_REG_3];
>> +       const u8 *tmp = bpf2a32[TMP_REG_1];
>> +       const u8 *tmp2 = bpf2a32[TMP_REG_2];
>> +       const u8 *tcc = bpf2a32[TCALL_CNT];
>> +       const int idx0 = ctx->idx;
>> +#define cur_offset (ctx->idx - idx0)
>> +#define jmp_offset (out_offset - (cur_offset))
>> +       u32 off, lo, hi;
>> +
>> +       /* if (index >= array->map.max_entries)
>> +        *      goto out;
>> +        */
>> +       off = offsetof(struct bpf_array, map.max_entries);
>> +       /* array->map.max_entries */
>> +       emit_a32_mov_i(tmp[1], off, false, ctx);
>> +       emit(ARM_LDR_I(tmp2[1], ARM_SP, STACK_VAR(r2[1])), ctx);
>> +       emit(ARM_LDR_R(tmp[1], tmp2[1], tmp[1]), ctx);
>> +       /* index (64 bit) */
>> +       emit(ARM_LDR_I(tmp2[1], ARM_SP, STACK_VAR(r3[1])), ctx);
>> +       /* index >= array->map.max_entries */
>> +       emit(ARM_CMP_R(tmp2[1], tmp[1]), ctx);
>> +       _emit(ARM_COND_CS, ARM_B(jmp_offset), ctx);
>> +
>> +       /* if (tail_call_cnt > MAX_TAIL_CALL_CNT)
>> +        *      goto out;
>> +        * tail_call_cnt++;
>> +        */
>> +       lo = (u32)MAX_TAIL_CALL_CNT;
>> +       hi = (u32)((u64)MAX_TAIL_CALL_CNT >> 32);
>> +       emit(ARM_LDR_I(tmp[1], ARM_SP, STACK_VAR(tcc[1])), ctx);
>> +       emit(ARM_LDR_I(tmp[0], ARM_SP, STACK_VAR(tcc[0])), ctx);
>> +       emit(ARM_CMP_I(tmp[0], hi), ctx);
>> +       _emit(ARM_COND_EQ, ARM_CMP_I(tmp[1], lo), ctx);
>> +       _emit(ARM_COND_HI, ARM_B(jmp_offset), ctx);
>> +       emit(ARM_ADDS_I(tmp[1], tmp[1], 1), ctx);
>> +       emit(ARM_ADC_I(tmp[0], tmp[0], 0), ctx);
>> +       emit(ARM_STR_I(tmp[1], ARM_SP, STACK_VAR(tcc[1])), ctx);
>> +       emit(ARM_STR_I(tmp[0], ARM_SP, STACK_VAR(tcc[0])), ctx);
>> +
>> +       /* prog = array->ptrs[index]
>> +        * if (prog == NULL)
>> +        *      goto out;
>> +        */
>> +       off = offsetof(struct bpf_array, ptrs);
>> +       emit_a32_mov_i(tmp[1], off, false, ctx);
>> +       emit(ARM_LDR_I(tmp2[1], ARM_SP, STACK_VAR(r2[1])), ctx);
>> +       emit(ARM_LDR_R(tmp[1], tmp2[1], tmp[1]), ctx);
>> +       emit(ARM_LDR_I(tmp2[1], ARM_SP, STACK_VAR(r3[1])), ctx);
>> +       emit(ARM_MOV_SI(tmp[0], tmp2[1], SRTYPE_ASL, 2), ctx);
>> +       emit(ARM_LDR_R(tmp[1], tmp[1], tmp[0]), ctx);
>> +       emit(ARM_CMP_I(tmp[1], 0), ctx);
>> +       _emit(ARM_COND_EQ, ARM_B(jmp_offset), ctx);
>> +
>> +       /* goto *(prog->bpf_func + prologue_size); */
>> +       off = offsetof(struct bpf_prog, bpf_func);
>> +       emit_a32_mov_i(tmp2[1], off, false, ctx);
>> +       emit(ARM_LDR_R(tmp[1], tmp[1], tmp2[1]), ctx);
>> +       emit(ARM_ADD_I(tmp[1], tmp[1], ctx->prologue_bytes), ctx);
>> +       emit(ARM_BX(tmp[1]), ctx);
>> +
>> +       /* out: */
>> +       if (out_offset == -1)
>> +               out_offset = cur_offset;
>> +       if (cur_offset != out_offset) {
>> +               pr_err_once("tail_call out_offset = %d, expected %d!\n",
>> +                           cur_offset, out_offset);
>> +               return -1;
>> +       }
>> +       return 0;
>> +#undef cur_offset
>> +#undef jmp_offset
>> +}
>> +
>> +/* 0xabcd => 0xcdab */
>> +static inline void emit_rev16(const u8 rd, const u8 rn, struct jit_ctx *ctx)
>>  {
>> -       if (!(ctx->seen & SEEN_X))
>> -               ctx->flags |= FLAG_NEED_X_RESET;
>> +#if __LINUX_ARM_ARCH__ < 6
>> +       const u8 *tmp2 = bpf2a32[TMP_REG_2];
>> +
>> +       emit(ARM_AND_I(tmp2[1], rn, 0xff), ctx);
>> +       emit(ARM_MOV_SI(tmp2[0], rn, SRTYPE_LSR, 8), ctx);
>> +       emit(ARM_AND_I(tmp2[0], tmp2[0], 0xff), ctx);
>> +       emit(ARM_ORR_SI(rd, tmp2[0], tmp2[1], SRTYPE_LSL, 8), ctx);
>> +#else /* ARMv6+ */
>> +       emit(ARM_REV16(rd, rn), ctx);
>> +#endif
>> +}
>>
>> -       ctx->seen |= SEEN_X;
>> +/* 0xabcdefgh => 0xghefcdab */
>> +static inline void emit_rev32(const u8 rd, const u8 rn, struct jit_ctx *ctx)
>> +{
>> +#if __LINUX_ARM_ARCH__ < 6
>> +       const u8 *tmp2 = bpf2a32[TMP_REG_2];
>> +
>> +       emit(ARM_AND_I(tmp2[1], rn, 0xff), ctx);
>> +       emit(ARM_MOV_SI(tmp2[0], rn, SRTYPE_LSR, 24), ctx);
>> +       emit(ARM_ORR_SI(ARM_IP, tmp2[0], tmp2[1], SRTYPE_LSL, 24), ctx);
>> +
>> +       emit(ARM_MOV_SI(tmp2[1], rn, SRTYPE_LSR, 8), ctx);
>> +       emit(ARM_AND_I(tmp2[1], tmp2[1], 0xff), ctx);
>> +       emit(ARM_MOV_SI(tmp2[0], rn, SRTYPE_LSR, 16), ctx);
>> +       emit(ARM_AND_I(tmp2[0], tmp2[0], 0xff), ctx);
>> +       emit(ARM_MOV_SI(tmp2[0], tmp2[0], SRTYPE_LSL, 8), ctx);
>> +       emit(ARM_ORR_SI(tmp2[0], tmp2[0], tmp2[1], SRTYPE_LSL, 16), ctx);
>> +       emit(ARM_ORR_R(rd, ARM_IP, tmp2[0]), ctx);
>> +
>> +#else /* ARMv6+ */
>> +       emit(ARM_REV(rd, rn), ctx);
>> +#endif
>>  }
>>
>> -static int build_body(struct jit_ctx *ctx)
>> +static void build_prologue(struct jit_ctx *ctx)
>>  {
>> -       void *load_func[] = {jit_get_skb_b, jit_get_skb_h, jit_get_skb_w};
>> -       const struct bpf_prog *prog = ctx->skf;
>> -       const struct sock_filter *inst;
>> -       unsigned i, load_order, off, condt;
>> -       int imm12;
>> -       u32 k;
>> +       const u8 r0 = bpf2a32[BPF_REG_0][1];
>> +       const u8 r2 = bpf2a32[BPF_REG_1][1];
>> +       const u8 r3 = bpf2a32[BPF_REG_1][0];
>> +       const u8 r4 = bpf2a32[BPF_REG_6][1];
>> +       const u8 r5 = bpf2a32[BPF_REG_6][0];
>> +       const u8 r6 = bpf2a32[TMP_REG_1][1];
>> +       const u8 r7 = bpf2a32[TMP_REG_1][0];
>> +       const u8 r8 = bpf2a32[TMP_REG_2][1];
>> +       const u8 r10 = bpf2a32[TMP_REG_2][0];
>> +       const u8 fplo = bpf2a32[BPF_REG_FP][1];
>> +       const u8 fphi = bpf2a32[BPF_REG_FP][0];
>> +       const u8 sp = ARM_SP;
>> +       const u8 *tcc = bpf2a32[TCALL_CNT];
>> +
>> +       u16 reg_set = 0;
>>
>> -       for (i = 0; i < prog->len; i++) {
>> -               u16 code;
>> +       /*
>> +        * eBPF prog stack layout
>> +        *
>> +        *                         high
>> +        * original ARM_SP =>     +-----+ eBPF prologue
>> +        *                        |FP/LR|
>> +        * current ARM_FP =>      +-----+
>> +        *                        | ... | callee saved registers
>> +        * eBPF fp register =>    +-----+ <= (BPF_FP)
>> +        *                        | ... | eBPF JIT scratch space
>> +        *                        |     | eBPF prog stack
>> +        *                        +-----+
>> +        *                        |RSVD | JIT scratchpad
>> +        * current A64_SP =>      +-----+ <= (BPF_FP - STACK_SIZE)
>> +        *                        |     |
>> +        *                        | ... | Function call stack
>> +        *                        |     |
>> +        *                        +-----+
>> +        *                          low
>> +        */
>>
>> -               inst = &(prog->insns[i]);
>> -               /* K as an immediate value operand */
>> -               k = inst->k;
>> -               code = bpf_anc_helper(inst);
>> +       /* Save callee saved registers. */
>> +       reg_set |= (1<<r4) | (1<<r5) | (1<<r6) | (1<<r7) | (1<<r8) | (1<<r10);
>> +#ifdef CONFIG_FRAME_POINTER
>> +       reg_set |= (1<<ARM_FP) | (1<<ARM_IP) | (1<<ARM_LR) | (1<<ARM_PC);
>> +       emit(ARM_MOV_R(ARM_IP, sp), ctx);
>> +       emit(ARM_PUSH(reg_set), ctx);
>> +       emit(ARM_SUB_I(ARM_FP, ARM_IP, 4), ctx);
>> +#else
>> +       /* Check if call instruction exists in BPF body */
>> +       if (ctx->seen & SEEN_CALL)
>> +               reg_set |= (1<<ARM_LR);
>> +       emit(ARM_PUSH(reg_set), ctx);
>> +#endif
>> +       /* Save frame pointer for later */
>> +       emit(ARM_SUB_I(ARM_IP, sp, SCRATCH_SIZE), ctx);
>> +
>> +       /* Set up function call stack */
>> +       emit(ARM_SUB_I(ARM_SP, ARM_SP, imm8m(STACK_SIZE)), ctx);
>> +
>> +       /* Set up BPF prog stack base register */
>> +       emit_a32_mov_r(fplo, ARM_IP, true, false, ctx);
>> +       emit_a32_mov_i(fphi, 0, true, ctx);
>> +
>> +       /* mov r4, 0 */
>> +       emit(ARM_MOV_I(r4, 0), ctx);
>> +       /* MOV bpf_ctx pointer to BPF_R1 */
>> +       emit(ARM_MOV_R(r3, r4), ctx);
>> +       emit(ARM_MOV_R(r2, r0), ctx);
>> +       /* Initialize Tail Count */
>> +       emit(ARM_STR_I(r4, ARM_SP, STACK_VAR(tcc[0])), ctx);
>> +       emit(ARM_STR_I(r4, ARM_SP, STACK_VAR(tcc[1])), ctx);
>> +       /* end of prologue */
>> +}
>>
>> -               /* compute offsets only in the fake pass */
>> -               if (ctx->target == NULL)
>> -                       ctx->offsets[i] = ctx->idx * 4;
>> +static void build_epilogue(struct jit_ctx *ctx)
>> +{
>> +       const u8 r4 = bpf2a32[BPF_REG_6][1];
>> +       const u8 r5 = bpf2a32[BPF_REG_6][0];
>> +       const u8 r6 = bpf2a32[TMP_REG_1][1];
>> +       const u8 r7 = bpf2a32[TMP_REG_1][0];
>> +       const u8 r8 = bpf2a32[TMP_REG_2][1];
>> +       const u8 r10 = bpf2a32[TMP_REG_2][0];
>> +       u16 reg_set = 0;
>> +
>> +       /* unwind function call stack */
>> +       emit(ARM_ADD_I(ARM_SP, ARM_SP, imm8m(STACK_SIZE)), ctx);
>> +
>> +       /* restore callee saved registers. */
>> +       reg_set |= (1<<r4) | (1<<r5) | (1<<r6) | (1<<r7) | (1<<r8) | (1<<r10);
>> +#ifdef CONFIG_FRAME_POINTER
>> +       /* the first instruction of the prologue was: mov ip, sp */
>> +       reg_set |= (1<<ARM_FP) | (1<<ARM_SP) | (1<<ARM_PC);
>> +       emit(ARM_LDM(ARM_SP, reg_set), ctx);
>> +#else
>> +       if (ctx->seen & SEEN_CALL)
>> +               reg_set |= (1<<ARM_PC);
>> +       /* Restore callee saved registers. */
>> +       emit(ARM_POP(reg_set), ctx);
>> +       /* Return back to the callee function */
>> +       if (!(ctx->seen & SEEN_CALL))
>> +               emit(ARM_BX(ARM_LR), ctx);
>> +#endif
>> +}
>>
>> -               switch (code) {
>> -               case BPF_LD | BPF_IMM:
>> -                       emit_mov_i(r_A, k, ctx);
>> +/*
>> + * Convert an eBPF instruction to native instruction, i.e
>> + * JITs an eBPF instruction.
>> + * Returns :
>> + *     0  - Successfully JITed an 8-byte eBPF instruction
>> + *     >0 - Successfully JITed a 16-byte eBPF instruction
>> + *     <0 - Failed to JIT.
>> + */
>> +static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx)
>> +{
>> +       const u8 code = insn->code;
>> +       const u8 *dst = bpf2a32[insn->dst_reg];
>> +       const u8 *src = bpf2a32[insn->src_reg];
>> +       const u8 *tmp = bpf2a32[TMP_REG_1];
>> +       const u8 *tmp2 = bpf2a32[TMP_REG_2];
>> +       const s16 off = insn->off;
>> +       const s32 imm = insn->imm;
>> +       const int i = insn - ctx->prog->insnsi;
>> +       const bool is64 = BPF_CLASS(code) == BPF_ALU64;
>> +       const bool dstk = is_on_stack(insn->dst_reg);
>> +       const bool sstk = is_on_stack(insn->src_reg);
>> +       u8 rd, rt, rm, rn;
>> +       s32 jmp_offset;
>> +
>> +#define check_imm(bits, imm) do {                              \
>> +       if ((((imm) > 0) && ((imm) >> (bits))) ||               \
>> +           (((imm) < 0) && (~(imm) >> (bits)))) {              \
>> +               pr_info("[%2d] imm=%d(0x%x) out of range\n",    \
>> +                       i, imm, imm);                           \
>> +               return -EINVAL;                                 \
>> +       }                                                       \
>> +} while (0)
>> +#define check_imm24(imm) check_imm(24, imm)
>> +
>> +       switch (code) {
>> +       /* ALU operations */
>> +
>> +       /* dst = src */
>> +       case BPF_ALU | BPF_MOV | BPF_K:
>> +       case BPF_ALU | BPF_MOV | BPF_X:
>> +       case BPF_ALU64 | BPF_MOV | BPF_K:
>> +       case BPF_ALU64 | BPF_MOV | BPF_X:
>> +               switch (BPF_SRC(code)) {
>> +               case BPF_X:
>> +                       emit_a32_mov_r64(is64, dst, src, dstk, sstk, ctx);
>>                         break;
>> -               case BPF_LD | BPF_W | BPF_LEN:
>> -                       ctx->seen |= SEEN_SKB;
>> -                       BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, len) != 4);
>> -                       emit(ARM_LDR_I(r_A, r_skb,
>> -                                      offsetof(struct sk_buff, len)), ctx);
>> +               case BPF_K:
>> +                       /* Sign-extend immediate value to destination reg */
>> +                       emit_a32_mov_i64(is64, dst, imm, dstk, ctx);
>>                         break;
>> -               case BPF_LD | BPF_MEM:
>> -                       /* A = scratch[k] */
>> -                       ctx->seen |= SEEN_MEM_WORD(k);
>> -                       emit(ARM_LDR_I(r_A, ARM_SP, SCRATCH_OFF(k)), ctx);
>> +               }
>> +               break;
>> +       /* dst = dst + src/imm */
>> +       /* dst = dst - src/imm */
>> +       /* dst = dst | src/imm */
>> +       /* dst = dst & src/imm */
>> +       /* dst = dst ^ src/imm */
>> +       /* dst = dst * src/imm */
>> +       /* dst = dst << src */
>> +       /* dst = dst >> src */
>> +       case BPF_ALU | BPF_ADD | BPF_K:
>> +       case BPF_ALU | BPF_ADD | BPF_X:
>> +       case BPF_ALU | BPF_SUB | BPF_K:
>> +       case BPF_ALU | BPF_SUB | BPF_X:
>> +       case BPF_ALU | BPF_OR | BPF_K:
>> +       case BPF_ALU | BPF_OR | BPF_X:
>> +       case BPF_ALU | BPF_AND | BPF_K:
>> +       case BPF_ALU | BPF_AND | BPF_X:
>> +       case BPF_ALU | BPF_XOR | BPF_K:
>> +       case BPF_ALU | BPF_XOR | BPF_X:
>> +       case BPF_ALU | BPF_MUL | BPF_K:
>> +       case BPF_ALU | BPF_MUL | BPF_X:
>> +       case BPF_ALU | BPF_LSH | BPF_X:
>> +       case BPF_ALU | BPF_RSH | BPF_X:
>> +       case BPF_ALU | BPF_ARSH | BPF_K:
>> +       case BPF_ALU | BPF_ARSH | BPF_X:
>> +       case BPF_ALU64 | BPF_ADD | BPF_K:
>> +       case BPF_ALU64 | BPF_ADD | BPF_X:
>> +       case BPF_ALU64 | BPF_SUB | BPF_K:
>> +       case BPF_ALU64 | BPF_SUB | BPF_X:
>> +       case BPF_ALU64 | BPF_OR | BPF_K:
>> +       case BPF_ALU64 | BPF_OR | BPF_X:
>> +       case BPF_ALU64 | BPF_AND | BPF_K:
>> +       case BPF_ALU64 | BPF_AND | BPF_X:
>> +       case BPF_ALU64 | BPF_XOR | BPF_K:
>> +       case BPF_ALU64 | BPF_XOR | BPF_X:
>> +               switch (BPF_SRC(code)) {
>> +               case BPF_X:
>> +                       emit_a32_alu_r64(is64, dst, src, dstk, sstk,
>> +                                        ctx, BPF_OP(code));
>>                         break;
>> -               case BPF_LD | BPF_W | BPF_ABS:
>> -                       load_order = 2;
>> -                       goto load;
>> -               case BPF_LD | BPF_H | BPF_ABS:
>> -                       load_order = 1;
>> -                       goto load;
>> -               case BPF_LD | BPF_B | BPF_ABS:
>> -                       load_order = 0;
>> -load:
>> -                       emit_mov_i(r_off, k, ctx);
>> -load_common:
>> -                       ctx->seen |= SEEN_DATA | SEEN_CALL;
>> -
>> -                       if (load_order > 0) {
>> -                               emit(ARM_SUB_I(r_scratch, r_skb_hl,
>> -                                              1 << load_order), ctx);
>> -                               emit(ARM_CMP_R(r_scratch, r_off), ctx);
>> -                               condt = ARM_COND_GE;
>> -                       } else {
>> -                               emit(ARM_CMP_R(r_skb_hl, r_off), ctx);
>> -                               condt = ARM_COND_HI;
>> -                       }
>> -
>> -                       /*
>> -                        * test for negative offset, only if we are
>> -                        * currently scheduled to take the fast
>> -                        * path. this will update the flags so that
>> -                        * the slowpath instruction are ignored if the
>> -                        * offset is negative.
>> -                        *
>> -                        * for loard_order == 0 the HI condition will
>> -                        * make loads at offset 0 take the slow path too.
>> +               case BPF_K:
>> +                       /* Move immediate value to the temporary register
>> +                        * and then do the ALU operation on the temporary
>> +                        * register as this will sign-extend the immediate
>> +                        * value into temporary reg and then it would be
>> +                        * safe to do the operation on it.
>>                          */
>> -                       _emit(condt, ARM_CMP_I(r_off, 0), ctx);
>> -
>> -                       _emit(condt, ARM_ADD_R(r_scratch, r_off, r_skb_data),
>> -                             ctx);
>> -
>> -                       if (load_order == 0)
>> -                               _emit(condt, ARM_LDRB_I(r_A, r_scratch, 0),
>> -                                     ctx);
>> -                       else if (load_order == 1)
>> -                               emit_load_be16(condt, r_A, r_scratch, ctx);
>> -                       else if (load_order == 2)
>> -                               emit_load_be32(condt, r_A, r_scratch, ctx);
>> -
>> -                       _emit(condt, ARM_B(b_imm(i + 1, ctx)), ctx);
>> -
>> -                       /* the slowpath */
>> -                       emit_mov_i(ARM_R3, (u32)load_func[load_order], ctx);
>> -                       emit(ARM_MOV_R(ARM_R0, r_skb), ctx);
>> -                       /* the offset is already in R1 */
>> -                       emit_blx_r(ARM_R3, ctx);
>> -                       /* check the result of skb_copy_bits */
>> -                       emit(ARM_CMP_I(ARM_R1, 0), ctx);
>> -                       emit_err_ret(ARM_COND_NE, ctx);
>> -                       emit(ARM_MOV_R(r_A, ARM_R0), ctx);
>> +                       emit_a32_mov_i64(is64, tmp2, imm, false, ctx);
>> +                       emit_a32_alu_r64(is64, dst, tmp2, dstk, false,
>> +                                        ctx, BPF_OP(code));
>>                         break;
>> -               case BPF_LD | BPF_W | BPF_IND:
>> -                       load_order = 2;
>> -                       goto load_ind;
>> -               case BPF_LD | BPF_H | BPF_IND:
>> -                       load_order = 1;
>> -                       goto load_ind;
>> -               case BPF_LD | BPF_B | BPF_IND:
>> -                       load_order = 0;
>> -load_ind:
>> -                       update_on_xread(ctx);
>> -                       OP_IMM3(ARM_ADD, r_off, r_X, k, ctx);
>> -                       goto load_common;
>> -               case BPF_LDX | BPF_IMM:
>> -                       ctx->seen |= SEEN_X;
>> -                       emit_mov_i(r_X, k, ctx);
>> +               }
>> +               break;
>> +       /* dst = dst / src(imm) */
>> +       /* dst = dst % src(imm) */
>> +       case BPF_ALU | BPF_DIV | BPF_K:
>> +       case BPF_ALU | BPF_DIV | BPF_X:
>> +       case BPF_ALU | BPF_MOD | BPF_K:
>> +       case BPF_ALU | BPF_MOD | BPF_X:
>> +               rt = src_lo;
>> +               rd = dstk ? tmp2[1] : dst_lo;
>> +               if (dstk)
>> +                       emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
>> +               switch (BPF_SRC(code)) {
>> +               case BPF_X:
>> +                       rt = sstk ? tmp2[0] : rt;
>> +                       if (sstk)
>> +                               emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(src_lo)),
>> +                                    ctx);
>>                         break;
>> -               case BPF_LDX | BPF_W | BPF_LEN:
>> -                       ctx->seen |= SEEN_X | SEEN_SKB;
>> -                       emit(ARM_LDR_I(r_X, r_skb,
>> -                                      offsetof(struct sk_buff, len)), ctx);
>> +               case BPF_K:
>> +                       rt = tmp2[0];
>> +                       emit_a32_mov_i(rt, imm, false, ctx);
>>                         break;
>> -               case BPF_LDX | BPF_MEM:
>> -                       ctx->seen |= SEEN_X | SEEN_MEM_WORD(k);
>> -                       emit(ARM_LDR_I(r_X, ARM_SP, SCRATCH_OFF(k)), ctx);
>> +               }
>> +               emit_udivmod(rd, rd, rt, ctx, BPF_OP(code));
>> +               if (dstk)
>> +                       emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
>> +               emit_a32_mov_i(dst_hi, 0, dstk, ctx);
>> +               break;
>> +       case BPF_ALU64 | BPF_DIV | BPF_K:
>> +       case BPF_ALU64 | BPF_DIV | BPF_X:
>> +       case BPF_ALU64 | BPF_MOD | BPF_K:
>> +       case BPF_ALU64 | BPF_MOD | BPF_X:
>> +               goto notyet;
>> +       /* dst = dst >> imm */
>> +       /* dst = dst << imm */
>> +       case BPF_ALU | BPF_RSH | BPF_K:
>> +       case BPF_ALU | BPF_LSH | BPF_K:
>> +               if (unlikely(imm > 31))
>> +                       return -EINVAL;
>> +               if (imm)
>> +                       emit_a32_alu_i(dst_lo, imm, dstk, ctx, BPF_OP(code));
>> +               emit_a32_mov_i(dst_hi, 0, dstk, ctx);
>> +               break;
>> +       /* dst = dst << imm */
>> +       case BPF_ALU64 | BPF_LSH | BPF_K:
>> +               if (unlikely(imm > 63))
>> +                       return -EINVAL;
>> +               emit_a32_lsh_i64(dst, dstk, imm, ctx);
>> +               break;
>> +       /* dst = dst >> imm */
>> +       case BPF_ALU64 | BPF_RSH | BPF_K:
>> +               if (unlikely(imm > 63))
>> +                       return -EINVAL;
>> +               emit_a32_lsr_i64(dst, dstk, imm, ctx);
>> +               break;
>> +       /* dst = dst << src */
>> +       case BPF_ALU64 | BPF_LSH | BPF_X:
>> +               emit_a32_lsh_r64(dst, src, dstk, sstk, ctx);
>> +               break;
>> +       /* dst = dst >> src */
>> +       case BPF_ALU64 | BPF_RSH | BPF_X:
>> +               emit_a32_lsr_r64(dst, src, dstk, sstk, ctx);
>> +               break;
>> +       /* dst = dst >> src (signed) */
>> +       case BPF_ALU64 | BPF_ARSH | BPF_X:
>> +               emit_a32_arsh_r64(dst, src, dstk, sstk, ctx);
>> +               break;
>> +       /* dst = dst >> imm (signed) */
>> +       case BPF_ALU64 | BPF_ARSH | BPF_K:
>> +               if (unlikely(imm > 63))
>> +                       return -EINVAL;
>> +               emit_a32_arsh_i64(dst, dstk, imm, ctx);
>> +               break;
>> +       /* dst = ~dst */
>> +       case BPF_ALU | BPF_NEG:
>> +               emit_a32_alu_i(dst_lo, 0, dstk, ctx, BPF_OP(code));
>> +               emit_a32_mov_i(dst_hi, 0, dstk, ctx);
>> +               break;
>> +       /* dst = ~dst (64 bit) */
>> +       case BPF_ALU64 | BPF_NEG:
>> +               emit_a32_neg64(dst, dstk, ctx);
>> +               break;
>> +       /* dst = dst * src/imm */
>> +       case BPF_ALU64 | BPF_MUL | BPF_X:
>> +       case BPF_ALU64 | BPF_MUL | BPF_K:
>> +               switch (BPF_SRC(code)) {
>> +               case BPF_X:
>> +                       emit_a32_mul_r64(dst, src, dstk, sstk, ctx);
>>                         break;
>> -               case BPF_LDX | BPF_B | BPF_MSH:
>> -                       /* x = ((*(frame + k)) & 0xf) << 2; */
>> -                       ctx->seen |= SEEN_X | SEEN_DATA | SEEN_CALL;
>> -                       /* the interpreter should deal with the negative K */
>> -                       if ((int)k < 0)
>> -                               return -1;
>> -                       /* offset in r1: we might have to take the slow path */
>> -                       emit_mov_i(r_off, k, ctx);
>> -                       emit(ARM_CMP_R(r_skb_hl, r_off), ctx);
>> -
>> -                       /* load in r0: common with the slowpath */
>> -                       _emit(ARM_COND_HI, ARM_LDRB_R(ARM_R0, r_skb_data,
>> -                                                     ARM_R1), ctx);
>> -                       /*
>> -                        * emit_mov_i() might generate one or two instructions,
>> -                        * the same holds for emit_blx_r()
>> +               case BPF_K:
>> +                       /* Move immediate value to the temporary register
>> +                        * and then do the multiplication on it as this
>> +                        * will sign-extend the immediate value into temp
>> +                        * reg then it would be safe to do the operation
>> +                        * on it.
>>                          */
>> -                       _emit(ARM_COND_HI, ARM_B(b_imm(i + 1, ctx) - 2), ctx);
>> -
>> -                       emit(ARM_MOV_R(ARM_R0, r_skb), ctx);
>> -                       /* r_off is r1 */
>> -                       emit_mov_i(ARM_R3, (u32)jit_get_skb_b, ctx);
>> -                       emit_blx_r(ARM_R3, ctx);
>> -                       /* check the return value of skb_copy_bits */
>> -                       emit(ARM_CMP_I(ARM_R1, 0), ctx);
>> -                       emit_err_ret(ARM_COND_NE, ctx);
>> -
>> -                       emit(ARM_AND_I(r_X, ARM_R0, 0x00f), ctx);
>> -                       emit(ARM_LSL_I(r_X, r_X, 2), ctx);
>> -                       break;
>> -               case BPF_ST:
>> -                       ctx->seen |= SEEN_MEM_WORD(k);
>> -                       emit(ARM_STR_I(r_A, ARM_SP, SCRATCH_OFF(k)), ctx);
>> -                       break;
>> -               case BPF_STX:
>> -                       update_on_xread(ctx);
>> -                       ctx->seen |= SEEN_MEM_WORD(k);
>> -                       emit(ARM_STR_I(r_X, ARM_SP, SCRATCH_OFF(k)), ctx);
>> -                       break;
>> -               case BPF_ALU | BPF_ADD | BPF_K:
>> -                       /* A += K */
>> -                       OP_IMM3(ARM_ADD, r_A, r_A, k, ctx);
>> -                       break;
>> -               case BPF_ALU | BPF_ADD | BPF_X:
>> -                       update_on_xread(ctx);
>> -                       emit(ARM_ADD_R(r_A, r_A, r_X), ctx);
>> -                       break;
>> -               case BPF_ALU | BPF_SUB | BPF_K:
>> -                       /* A -= K */
>> -                       OP_IMM3(ARM_SUB, r_A, r_A, k, ctx);
>> -                       break;
>> -               case BPF_ALU | BPF_SUB | BPF_X:
>> -                       update_on_xread(ctx);
>> -                       emit(ARM_SUB_R(r_A, r_A, r_X), ctx);
>> -                       break;
>> -               case BPF_ALU | BPF_MUL | BPF_K:
>> -                       /* A *= K */
>> -                       emit_mov_i(r_scratch, k, ctx);
>> -                       emit(ARM_MUL(r_A, r_A, r_scratch), ctx);
>> -                       break;
>> -               case BPF_ALU | BPF_MUL | BPF_X:
>> -                       update_on_xread(ctx);
>> -                       emit(ARM_MUL(r_A, r_A, r_X), ctx);
>> -                       break;
>> -               case BPF_ALU | BPF_DIV | BPF_K:
>> -                       if (k == 1)
>> -                               break;
>> -                       emit_mov_i(r_scratch, k, ctx);
>> -                       emit_udivmod(r_A, r_A, r_scratch, ctx, BPF_DIV);
>> -                       break;
>> -               case BPF_ALU | BPF_DIV | BPF_X:
>> -                       update_on_xread(ctx);
>> -                       emit(ARM_CMP_I(r_X, 0), ctx);
>> -                       emit_err_ret(ARM_COND_EQ, ctx);
>> -                       emit_udivmod(r_A, r_A, r_X, ctx, BPF_DIV);
>> -                       break;
>> -               case BPF_ALU | BPF_MOD | BPF_K:
>> -                       if (k == 1) {
>> -                               emit_mov_i(r_A, 0, ctx);
>> -                               break;
>> -                       }
>> -                       emit_mov_i(r_scratch, k, ctx);
>> -                       emit_udivmod(r_A, r_A, r_scratch, ctx, BPF_MOD);
>> +                       emit_a32_mov_i64(is64, tmp2, imm, false, ctx);
>> +                       emit_a32_mul_r64(dst, tmp2, dstk, false, ctx);
>>                         break;
>> -               case BPF_ALU | BPF_MOD | BPF_X:
>> -                       update_on_xread(ctx);
>> -                       emit(ARM_CMP_I(r_X, 0), ctx);
>> -                       emit_err_ret(ARM_COND_EQ, ctx);
>> -                       emit_udivmod(r_A, r_A, r_X, ctx, BPF_MOD);
>> -                       break;
>> -               case BPF_ALU | BPF_OR | BPF_K:
>> -                       /* A |= K */
>> -                       OP_IMM3(ARM_ORR, r_A, r_A, k, ctx);
>> +               }
>> +               break;
>> +       /* dst = htole(dst) */
>> +       /* dst = htobe(dst) */
>> +       case BPF_ALU | BPF_END | BPF_FROM_LE:
>> +       case BPF_ALU | BPF_END | BPF_FROM_BE:
>> +               rd = dstk ? tmp[0] : dst_hi;
>> +               rt = dstk ? tmp[1] : dst_lo;
>> +               if (dstk) {
>> +                       emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(dst_lo)), ctx);
>> +                       emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_hi)), ctx);
>> +               }
>> +               if (BPF_SRC(code) == BPF_FROM_LE)
>> +                       goto emit_bswap_uxt;
>> +               switch (imm) {
>> +               case 16:
>> +                       emit_rev16(rt, rt, ctx);
>> +                       goto emit_bswap_uxt;
>> +               case 32:
>> +                       emit_rev32(rt, rt, ctx);
>> +                       goto emit_bswap_uxt;
>> +               case 64:
>> +                       /* Because of the usage of ARM_LR */
>> +                       ctx->seen |= SEEN_CALL;
>> +                       emit_rev32(ARM_LR, rt, ctx);
>> +                       emit_rev32(rt, rd, ctx);
>> +                       emit(ARM_MOV_R(rd, ARM_LR), ctx);
>>                         break;
>> -               case BPF_ALU | BPF_OR | BPF_X:
>> -                       update_on_xread(ctx);
>> -                       emit(ARM_ORR_R(r_A, r_A, r_X), ctx);
>> +               }
>> +               goto exit;
>> +emit_bswap_uxt:
>> +               switch (imm) {
>> +               case 16:
>> +                       /* zero-extend 16 bits into 64 bits */
>> +#if __LINUX_ARM_ARCH__ < 6
>> +                       emit_a32_mov_i(tmp2[1], 0xffff, false, ctx);
>> +                       emit(ARM_AND_R(rt, rt, tmp2[1]), ctx);
>> +#else /* ARMv6+ */
>> +                       emit(ARM_UXTH(rt, rt), ctx);
>> +#endif
>> +                       emit(ARM_EOR_R(rd, rd, rd), ctx);
>>                         break;
>> -               case BPF_ALU | BPF_XOR | BPF_K:
>> -                       /* A ^= K; */
>> -                       OP_IMM3(ARM_EOR, r_A, r_A, k, ctx);
>> +               case 32:
>> +                       /* zero-extend 32 bits into 64 bits */
>> +                       emit(ARM_EOR_R(rd, rd, rd), ctx);
>>                         break;
>> -               case BPF_ANC | SKF_AD_ALU_XOR_X:
>> -               case BPF_ALU | BPF_XOR | BPF_X:
>> -                       /* A ^= X */
>> -                       update_on_xread(ctx);
>> -                       emit(ARM_EOR_R(r_A, r_A, r_X), ctx);
>> +               case 64:
>> +                       /* nop */
>>                         break;
>> -               case BPF_ALU | BPF_AND | BPF_K:
>> -                       /* A &= K */
>> -                       OP_IMM3(ARM_AND, r_A, r_A, k, ctx);
>> +               }
>> +exit:
>> +               if (dstk) {
>> +                       emit(ARM_STR_I(rt, ARM_SP, STACK_VAR(dst_lo)), ctx);
>> +                       emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_hi)), ctx);
>> +               }
>> +               break;
>> +       /* dst = imm64 */
>> +       case BPF_LD | BPF_IMM | BPF_DW:
>> +       {
>> +               const struct bpf_insn insn1 = insn[1];
>> +               u32 hi, lo = imm;
>> +
>> +               if (insn1.code != 0 || insn1.src_reg != 0 ||
>> +                   insn1.dst_reg != 0 || insn1.off != 0) {
>> +                       /* Note: verifier in BPF core must catch invalid
>> +                        * instruction.
>> +                        */
>> +                       pr_err_once("Invalid BPF_LD_IMM64 instruction\n");
>> +                       return -EINVAL;
>> +               }
>> +               hi = insn1.imm;
>> +               emit_a32_mov_i(dst_lo, lo, dstk, ctx);
>> +               emit_a32_mov_i(dst_hi, hi, dstk, ctx);
>> +
>> +               return 1;
>> +       }
>> +       /* LDX: dst = *(size *)(src + off) */
>> +       case BPF_LDX | BPF_MEM | BPF_W:
>> +       case BPF_LDX | BPF_MEM | BPF_H:
>> +       case BPF_LDX | BPF_MEM | BPF_B:
>> +       case BPF_LDX | BPF_MEM | BPF_DW:
>> +               rn = sstk ? tmp2[1] : src_lo;
>> +               if (sstk)
>> +                       emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
>> +               switch (BPF_SIZE(code)) {
>> +               case BPF_W:
>> +                       /* Load a Word */
>> +               case BPF_H:
>> +                       /* Load a Half-Word */
>> +               case BPF_B:
>> +                       /* Load a Byte */
>> +                       emit_ldx_r(dst_lo, rn, dstk, off, ctx, BPF_SIZE(code));
>> +                       emit_a32_mov_i(dst_hi, 0, dstk, ctx);
>>                         break;
>> -               case BPF_ALU | BPF_AND | BPF_X:
>> -                       update_on_xread(ctx);
>> -                       emit(ARM_AND_R(r_A, r_A, r_X), ctx);
>> +               case BPF_DW:
>> +                       /* Load a double word */
>> +                       emit_ldx_r(dst_lo, rn, dstk, off, ctx, BPF_W);
>> +                       emit_ldx_r(dst_hi, rn, dstk, off+4, ctx, BPF_W);
>>                         break;
>> -               case BPF_ALU | BPF_LSH | BPF_K:
>> -                       if (unlikely(k > 31))
>> -                               return -1;
>> -                       emit(ARM_LSL_I(r_A, r_A, k), ctx);
>> +               }
>> +               break;
>> +       /* R0 = ntohx(*(size *)(((struct sk_buff *)R6)->data + imm)) */
>> +       case BPF_LD | BPF_ABS | BPF_W:
>> +       case BPF_LD | BPF_ABS | BPF_H:
>> +       case BPF_LD | BPF_ABS | BPF_B:
>> +       /* R0 = ntohx(*(size *)(((struct sk_buff *)R6)->data + src + imm)) */
>> +       case BPF_LD | BPF_IND | BPF_W:
>> +       case BPF_LD | BPF_IND | BPF_H:
>> +       case BPF_LD | BPF_IND | BPF_B:
>> +       {
>> +               const u8 r4 = bpf2a32[BPF_REG_6][1]; /* r4 = ptr to sk_buff */
>> +               const u8 r0 = bpf2a32[BPF_REG_0][1]; /*r0: struct sk_buff *skb*/
>> +                                                    /* rtn value */
>> +               const u8 r1 = bpf2a32[BPF_REG_0][0]; /* r1: int k */
>> +               const u8 r2 = bpf2a32[BPF_REG_1][1]; /* r2: unsigned int size */
>> +               const u8 r3 = bpf2a32[BPF_REG_1][0]; /* r3: void *buffer */
>> +               const u8 r6 = bpf2a32[TMP_REG_1][1]; /* r6: void *(*func)(..) */
>> +               int size;
>> +
>> +               /* Setting up first argument */
>> +               emit(ARM_MOV_R(r0, r4), ctx);
>> +
>> +               /* Setting up second argument */
>> +               emit_a32_mov_i(r1, imm, false, ctx);
>> +               if (BPF_MODE(code) == BPF_IND)
>> +                       emit_a32_alu_r(r1, src_lo, false, sstk, ctx,
>> +                                      false, false, BPF_ADD);
>> +
>> +               /* Setting up third argument */
>> +               switch (BPF_SIZE(code)) {
>> +               case BPF_W:
>> +                       size = 4;
>>                         break;
>> -               case BPF_ALU | BPF_LSH | BPF_X:
>> -                       update_on_xread(ctx);
>> -                       emit(ARM_LSL_R(r_A, r_A, r_X), ctx);
>> +               case BPF_H:
>> +                       size = 2;
>>                         break;
>> -               case BPF_ALU | BPF_RSH | BPF_K:
>> -                       if (unlikely(k > 31))
>> -                               return -1;
>> -                       if (k)
>> -                               emit(ARM_LSR_I(r_A, r_A, k), ctx);
>> +               case BPF_B:
>> +                       size = 1;
>>                         break;
>> -               case BPF_ALU | BPF_RSH | BPF_X:
>> -                       update_on_xread(ctx);
>> -                       emit(ARM_LSR_R(r_A, r_A, r_X), ctx);
>> +               default:
>> +                       return -EINVAL;
>> +               }
>> +               emit_a32_mov_i(r2, size, false, ctx);
>> +
>> +               /* Setting up fourth argument */
>> +               emit(ARM_ADD_I(r3, ARM_SP, imm8m(SKB_BUFFER)), ctx);
>> +
>> +               /* Setting up function pointer to call */
>> +               emit_a32_mov_i(r6, (unsigned int)bpf_load_pointer, false, ctx);
>> +               emit_blx_r(r6, ctx);
>> +
>> +               emit(ARM_EOR_R(r1, r1, r1), ctx);
>> +               /* Check if return address is NULL or not.
>> +                * if NULL then jump to epilogue
>> +                * else continue to load the value from retn address
>> +                */
>> +               emit(ARM_CMP_I(r0, 0), ctx);
>> +               jmp_offset = epilogue_offset(ctx);
>> +               check_imm24(jmp_offset);
>> +               _emit(ARM_COND_EQ, ARM_B(jmp_offset), ctx);
>> +
>> +               /* Load value from the address */
>> +               switch (BPF_SIZE(code)) {
>> +               case BPF_W:
>> +                       emit(ARM_LDR_I(r0, r0, 0), ctx);
>> +                       emit_rev32(r0, r0, ctx);
>>                         break;
>> -               case BPF_ALU | BPF_NEG:
>> -                       /* A = -A */
>> -                       emit(ARM_RSB_I(r_A, r_A, 0), ctx);
>> +               case BPF_H:
>> +                       emit(ARM_LDRH_I(r0, r0, 0), ctx);
>> +                       emit_rev16(r0, r0, ctx);
>>                         break;
>> -               case BPF_JMP | BPF_JA:
>> -                       /* pc += K */
>> -                       emit(ARM_B(b_imm(i + k + 1, ctx)), ctx);
>> +               case BPF_B:
>> +                       emit(ARM_LDRB_I(r0, r0, 0), ctx);
>> +                       /* No need to reverse */
>>                         break;
>> -               case BPF_JMP | BPF_JEQ | BPF_K:
>> -                       /* pc += (A == K) ? pc->jt : pc->jf */
>> -                       condt  = ARM_COND_EQ;
>> -                       goto cmp_imm;
>> -               case BPF_JMP | BPF_JGT | BPF_K:
>> -                       /* pc += (A > K) ? pc->jt : pc->jf */
>> -                       condt  = ARM_COND_HI;
>> -                       goto cmp_imm;
>> -               case BPF_JMP | BPF_JGE | BPF_K:
>> -                       /* pc += (A >= K) ? pc->jt : pc->jf */
>> -                       condt  = ARM_COND_HS;
>> -cmp_imm:
>> -                       imm12 = imm8m(k);
>> -                       if (imm12 < 0) {
>> -                               emit_mov_i_no8m(r_scratch, k, ctx);
>> -                               emit(ARM_CMP_R(r_A, r_scratch), ctx);
>> -                       } else {
>> -                               emit(ARM_CMP_I(r_A, imm12), ctx);
>> -                       }
>> -cond_jump:
>> -                       if (inst->jt)
>> -                               _emit(condt, ARM_B(b_imm(i + inst->jt + 1,
>> -                                                  ctx)), ctx);
>> -                       if (inst->jf)
>> -                               _emit(condt ^ 1, ARM_B(b_imm(i + inst->jf + 1,
>> -                                                            ctx)), ctx);
>> +               }
>> +               break;
>> +       }
>> +       /* ST: *(size *)(dst + off) = imm */
>> +       case BPF_ST | BPF_MEM | BPF_W:
>> +       case BPF_ST | BPF_MEM | BPF_H:
>> +       case BPF_ST | BPF_MEM | BPF_B:
>> +       case BPF_ST | BPF_MEM | BPF_DW:
>> +               switch (BPF_SIZE(code)) {
>> +               case BPF_DW:
>> +                       /* Sign-extend immediate value into temp reg */
>> +                       emit_a32_mov_i64(true, tmp2, imm, false, ctx);
>> +                       emit_str_r(dst_lo, tmp2[1], dstk, off, ctx, BPF_W);
>> +                       emit_str_r(dst_lo, tmp2[0], dstk, off+4, ctx, BPF_W);
>>                         break;
>> -               case BPF_JMP | BPF_JEQ | BPF_X:
>> -                       /* pc += (A == X) ? pc->jt : pc->jf */
>> -                       condt   = ARM_COND_EQ;
>> -                       goto cmp_x;
>> -               case BPF_JMP | BPF_JGT | BPF_X:
>> -                       /* pc += (A > X) ? pc->jt : pc->jf */
>> -                       condt   = ARM_COND_HI;
>> -                       goto cmp_x;
>> -               case BPF_JMP | BPF_JGE | BPF_X:
>> -                       /* pc += (A >= X) ? pc->jt : pc->jf */
>> -                       condt   = ARM_COND_CS;
>> -cmp_x:
>> -                       update_on_xread(ctx);
>> -                       emit(ARM_CMP_R(r_A, r_X), ctx);
>> -                       goto cond_jump;
>> -               case BPF_JMP | BPF_JSET | BPF_K:
>> -                       /* pc += (A & K) ? pc->jt : pc->jf */
>> -                       condt  = ARM_COND_NE;
>> -                       /* not set iff all zeroes iff Z==1 iff EQ */
>> -
>> -                       imm12 = imm8m(k);
>> -                       if (imm12 < 0) {
>> -                               emit_mov_i_no8m(r_scratch, k, ctx);
>> -                               emit(ARM_TST_R(r_A, r_scratch), ctx);
>> -                       } else {
>> -                               emit(ARM_TST_I(r_A, imm12), ctx);
>> -                       }
>> -                       goto cond_jump;
>> -               case BPF_JMP | BPF_JSET | BPF_X:
>> -                       /* pc += (A & X) ? pc->jt : pc->jf */
>> -                       update_on_xread(ctx);
>> -                       condt  = ARM_COND_NE;
>> -                       emit(ARM_TST_R(r_A, r_X), ctx);
>> -                       goto cond_jump;
>> -               case BPF_RET | BPF_A:
>> -                       emit(ARM_MOV_R(ARM_R0, r_A), ctx);
>> -                       goto b_epilogue;
>> -               case BPF_RET | BPF_K:
>> -                       if ((k == 0) && (ctx->ret0_fp_idx < 0))
>> -                               ctx->ret0_fp_idx = i;
>> -                       emit_mov_i(ARM_R0, k, ctx);
>> -b_epilogue:
>> -                       if (i != ctx->skf->len - 1)
>> -                               emit(ARM_B(b_imm(prog->len, ctx)), ctx);
>> +               case BPF_W:
>> +               case BPF_H:
>> +               case BPF_B:
>> +                       emit_a32_mov_i(tmp2[1], imm, false, ctx);
>> +                       emit_str_r(dst_lo, tmp2[1], dstk, off, ctx,
>> +                                  BPF_SIZE(code));
>>                         break;
>> -               case BPF_MISC | BPF_TAX:
>> -                       /* X = A */
>> -                       ctx->seen |= SEEN_X;
>> -                       emit(ARM_MOV_R(r_X, r_A), ctx);
>> +               }
>> +               break;
>> +       /* STX XADD: lock *(u32 *)(dst + off) += src */
>> +       case BPF_STX | BPF_XADD | BPF_W:
>> +       /* STX XADD: lock *(u64 *)(dst + off) += src */
>> +       case BPF_STX | BPF_XADD | BPF_DW:
>> +               goto notyet;
>> +       /* STX: *(size *)(dst + off) = src */
>> +       case BPF_STX | BPF_MEM | BPF_W:
>> +       case BPF_STX | BPF_MEM | BPF_H:
>> +       case BPF_STX | BPF_MEM | BPF_B:
>> +       case BPF_STX | BPF_MEM | BPF_DW:
>> +       {
>> +               u8 sz = BPF_SIZE(code);
>> +
>> +               rn = sstk ? tmp2[1] : src_lo;
>> +               rm = sstk ? tmp2[0] : src_hi;
>> +               if (!sstk)
>> +                       goto do_store;
>> +               switch (BPF_SIZE(code)) {
>> +               case BPF_W:
>> +                       emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
>> +                       goto empty_hi;
>> +               case BPF_H:
>> +                       emit(ARM_LDRH_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
>> +                       goto empty_hi;
>> +               case BPF_B:
>> +                       emit(ARM_LDRB_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
>> +                       goto empty_hi;
>> +empty_hi:
>> +                       emit(ARM_EOR_R(rm, rm, rm), ctx);
>> +               case BPF_DW:
>> +                       emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
>> +                       emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(src_hi)), ctx);
>> +                       sz = BPF_W;
>>                         break;
>> -               case BPF_MISC | BPF_TXA:
>> -                       /* A = X */
>> -                       update_on_xread(ctx);
>> -                       emit(ARM_MOV_R(r_A, r_X), ctx);
>> +               }
>> +
>> +do_store:
>> +               /* Clear higher word except for BPF_DW */
>> +               if (BPF_SIZE(code) != BPF_DW)
>> +                       emit(ARM_EOR_R(rm, rm, rm), ctx);
>> +
>> +               /* Store the value */
>> +               emit_str_r(dst_lo, rn, dstk, off, ctx, sz);
>> +               emit_str_r(dst_lo, rm, dstk, off+4, ctx, BPF_W);
>> +               break;
>> +       }
>> +       /* PC += off if dst == src */
>> +       /* PC += off if dst > src */
>> +       /* PC += off if dst >= src */
>> +       /* PC += off if dst != src */
>> +       /* PC += off if dst > src (signed) */
>> +       /* PC += off if dst >= src (signed) */
>> +       /* PC += off if dst & src */
>> +       case BPF_JMP | BPF_JEQ | BPF_X:
>> +       case BPF_JMP | BPF_JGT | BPF_X:
>> +       case BPF_JMP | BPF_JGE | BPF_X:
>> +       case BPF_JMP | BPF_JNE | BPF_X:
>> +       case BPF_JMP | BPF_JSGT | BPF_X:
>> +       case BPF_JMP | BPF_JSGE | BPF_X:
>> +       case BPF_JMP | BPF_JSET | BPF_X:
>> +               /* Setup source registers */
>> +               rm = sstk ? tmp2[0] : src_hi;
>> +               rn = sstk ? tmp2[1] : src_lo;
>> +               if (sstk) {
>> +                       emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
>> +                       emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(src_hi)), ctx);
>> +               }
>> +               goto go_jmp;
>> +       /* PC += off if dst == imm */
>> +       /* PC += off if dst > imm */
>> +       /* PC += off if dst >= imm */
>> +       /* PC += off if dst != imm */
>> +       /* PC += off if dst > imm (signed) */
>> +       /* PC += off if dst >= imm (signed) */
>> +       /* PC += off if dst & imm */
>> +       case BPF_JMP | BPF_JEQ | BPF_K:
>> +       case BPF_JMP | BPF_JGT | BPF_K:
>> +       case BPF_JMP | BPF_JGE | BPF_K:
>> +       case BPF_JMP | BPF_JNE | BPF_K:
>> +       case BPF_JMP | BPF_JSGT | BPF_K:
>> +       case BPF_JMP | BPF_JSGE | BPF_K:
>> +       case BPF_JMP | BPF_JSET | BPF_K:
>> +               if (off == 0)
>>                         break;
>> -               case BPF_ANC | SKF_AD_PROTOCOL:
>> -                       /* A = ntohs(skb->protocol) */
>> -                       ctx->seen |= SEEN_SKB;
>> -                       BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff,
>> -                                                 protocol) != 2);
>> -                       off = offsetof(struct sk_buff, protocol);
>> -                       emit(ARM_LDRH_I(r_scratch, r_skb, off), ctx);
>> -                       emit_swap16(r_A, r_scratch, ctx);
>> +               rm = tmp2[0];
>> +               rn = tmp2[1];
>> +               /* Sign-extend immediate value */
>> +               emit_a32_mov_i64(true, tmp2, imm, false, ctx);
>> +go_jmp:
>> +               /* Setup destination register */
>> +               rd = dstk ? tmp[0] : dst_hi;
>> +               rt = dstk ? tmp[1] : dst_lo;
>> +               if (dstk) {
>> +                       emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(dst_lo)), ctx);
>> +                       emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_hi)), ctx);
>> +               }
>> +
>> +               /* Check for the condition */
>> +               emit_ar_r(rd, rt, rm, rn, ctx, BPF_OP(code));
>> +
>> +               /* Setup JUMP instruction */
>> +               jmp_offset = bpf2a32_offset(i+off, i, ctx);
>> +               switch (BPF_OP(code)) {
>> +               case BPF_JNE:
>> +               case BPF_JSET:
>> +                       _emit(ARM_COND_NE, ARM_B(jmp_offset), ctx);
>>                         break;
>> -               case BPF_ANC | SKF_AD_CPU:
>> -                       /* r_scratch = current_thread_info() */
>> -                       OP_IMM3(ARM_BIC, r_scratch, ARM_SP, THREAD_SIZE - 1, ctx);
>> -                       /* A = current_thread_info()->cpu */
>> -                       BUILD_BUG_ON(FIELD_SIZEOF(struct thread_info, cpu) != 4);
>> -                       off = offsetof(struct thread_info, cpu);
>> -                       emit(ARM_LDR_I(r_A, r_scratch, off), ctx);
>> +               case BPF_JEQ:
>> +                       _emit(ARM_COND_EQ, ARM_B(jmp_offset), ctx);
>>                         break;
>> -               case BPF_ANC | SKF_AD_IFINDEX:
>> -               case BPF_ANC | SKF_AD_HATYPE:
>> -                       /* A = skb->dev->ifindex */
>> -                       /* A = skb->dev->type */
>> -                       ctx->seen |= SEEN_SKB;
>> -                       off = offsetof(struct sk_buff, dev);
>> -                       emit(ARM_LDR_I(r_scratch, r_skb, off), ctx);
>> -
>> -                       emit(ARM_CMP_I(r_scratch, 0), ctx);
>> -                       emit_err_ret(ARM_COND_EQ, ctx);
>> -
>> -                       BUILD_BUG_ON(FIELD_SIZEOF(struct net_device,
>> -                                                 ifindex) != 4);
>> -                       BUILD_BUG_ON(FIELD_SIZEOF(struct net_device,
>> -                                                 type) != 2);
>> -
>> -                       if (code == (BPF_ANC | SKF_AD_IFINDEX)) {
>> -                               off = offsetof(struct net_device, ifindex);
>> -                               emit(ARM_LDR_I(r_A, r_scratch, off), ctx);
>> -                       } else {
>> -                               /*
>> -                                * offset of field "type" in "struct
>> -                                * net_device" is above what can be
>> -                                * used in the ldrh rd, [rn, #imm]
>> -                                * instruction, so load the offset in
>> -                                * a register and use ldrh rd, [rn, rm]
>> -                                */
>> -                               off = offsetof(struct net_device, type);
>> -                               emit_mov_i(ARM_R3, off, ctx);
>> -                               emit(ARM_LDRH_R(r_A, r_scratch, ARM_R3), ctx);
>> -                       }
>> +               case BPF_JGT:
>> +                       _emit(ARM_COND_HI, ARM_B(jmp_offset), ctx);
>>                         break;
>> -               case BPF_ANC | SKF_AD_MARK:
>> -                       ctx->seen |= SEEN_SKB;
>> -                       BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, mark) != 4);
>> -                       off = offsetof(struct sk_buff, mark);
>> -                       emit(ARM_LDR_I(r_A, r_skb, off), ctx);
>> +               case BPF_JGE:
>> +                       _emit(ARM_COND_CS, ARM_B(jmp_offset), ctx);
>>                         break;
>> -               case BPF_ANC | SKF_AD_RXHASH:
>> -                       ctx->seen |= SEEN_SKB;
>> -                       BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, hash) != 4);
>> -                       off = offsetof(struct sk_buff, hash);
>> -                       emit(ARM_LDR_I(r_A, r_skb, off), ctx);
>> +               case BPF_JSGT:
>> +                       _emit(ARM_COND_LT, ARM_B(jmp_offset), ctx);
>>                         break;
>> -               case BPF_ANC | SKF_AD_VLAN_TAG:
>> -               case BPF_ANC | SKF_AD_VLAN_TAG_PRESENT:
>> -                       ctx->seen |= SEEN_SKB;
>> -                       BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, vlan_tci) != 2);
>> -                       off = offsetof(struct sk_buff, vlan_tci);
>> -                       emit(ARM_LDRH_I(r_A, r_skb, off), ctx);
>> -                       if (code == (BPF_ANC | SKF_AD_VLAN_TAG))
>> -                               OP_IMM3(ARM_AND, r_A, r_A, ~VLAN_TAG_PRESENT, ctx);
>> -                       else {
>> -                               OP_IMM3(ARM_LSR, r_A, r_A, 12, ctx);
>> -                               OP_IMM3(ARM_AND, r_A, r_A, 0x1, ctx);
>> -                       }
>> +               case BPF_JSGE:
>> +                       _emit(ARM_COND_GE, ARM_B(jmp_offset), ctx);
>>                         break;
>> -               case BPF_ANC | SKF_AD_PKTTYPE:
>> -                       ctx->seen |= SEEN_SKB;
>> -                       BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff,
>> -                                                 __pkt_type_offset[0]) != 1);
>> -                       off = PKT_TYPE_OFFSET();
>> -                       emit(ARM_LDRB_I(r_A, r_skb, off), ctx);
>> -                       emit(ARM_AND_I(r_A, r_A, PKT_TYPE_MAX), ctx);
>> -#ifdef __BIG_ENDIAN_BITFIELD
>> -                       emit(ARM_LSR_I(r_A, r_A, 5), ctx);
>> -#endif
>> +               }
>> +               break;
>> +       /* JMP OFF */
>> +       case BPF_JMP | BPF_JA:
>> +       {
>> +               if (off == 0)
>>                         break;
>> -               case BPF_ANC | SKF_AD_QUEUE:
>> -                       ctx->seen |= SEEN_SKB;
>> -                       BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff,
>> -                                                 queue_mapping) != 2);
>> -                       BUILD_BUG_ON(offsetof(struct sk_buff,
>> -                                             queue_mapping) > 0xff);
>> -                       off = offsetof(struct sk_buff, queue_mapping);
>> -                       emit(ARM_LDRH_I(r_A, r_skb, off), ctx);
>> +               jmp_offset = bpf2a32_offset(i+off, i, ctx);
>> +               check_imm24(jmp_offset);
>> +               emit(ARM_B(jmp_offset), ctx);
>> +               break;
>> +       }
>> +       /* tail call */
>> +       case BPF_JMP | BPF_CALL | BPF_X:
>> +               if (emit_bpf_tail_call(ctx))
>> +                       return -EFAULT;
>> +               break;
>> +       /* function call */
>> +       case BPF_JMP | BPF_CALL:
>> +               goto notyet;
>> +       /* function return */
>> +       case BPF_JMP | BPF_EXIT:
>> +               /* Optimization: when last instruction is EXIT
>> +                * simply fallthrough to epilogue.
>> +                */
>> +               if (i == ctx->prog->len - 1)
>>                         break;
>> -               case BPF_ANC | SKF_AD_PAY_OFFSET:
>> -                       ctx->seen |= SEEN_SKB | SEEN_CALL;
>> +               jmp_offset = epilogue_offset(ctx);
>> +               check_imm24(jmp_offset);
>> +               emit(ARM_B(jmp_offset), ctx);
>> +               break;
>> +notyet:
>> +               pr_info_once("*** NOT YET: opcode %02x ***\n", code);
>> +               return -EFAULT;
>> +       default:
>> +               pr_err_once("unknown opcode %02x\n", code);
>> +               return -EINVAL;
>> +       }
>>
>> -                       emit(ARM_MOV_R(ARM_R0, r_skb), ctx);
>> -                       emit_mov_i(ARM_R3, (unsigned int)skb_get_poff, ctx);
>> -                       emit_blx_r(ARM_R3, ctx);
>> -                       emit(ARM_MOV_R(r_A, ARM_R0), ctx);
>> -                       break;
>> -               case BPF_LDX | BPF_W | BPF_ABS:
>> -                       /*
>> -                        * load a 32bit word from struct seccomp_data.
>> -                        * seccomp_check_filter() will already have checked
>> -                        * that k is 32bit aligned and lies within the
>> -                        * struct seccomp_data.
>> -                        */
>> -                       ctx->seen |= SEEN_SKB;
>> -                       emit(ARM_LDR_I(r_A, r_skb, k), ctx);
>> -                       break;
>> -               default:
>> -                       return -1;
>> +       if (ctx->flags & FLAG_IMM_OVERFLOW)
>> +               /*
>> +                * this instruction generated an overflow when
>> +                * trying to access the literal pool, so
>> +                * delegate this filter to the kernel interpreter.
>> +                */
>> +               return -1;
>> +       return 0;
>> +}
>> +
>> +static int build_body(struct jit_ctx *ctx)
>> +{
>> +       const struct bpf_prog *prog = ctx->prog;
>> +       unsigned int i;
>> +
>> +       for (i = 0; i < prog->len; i++) {
>> +               const struct bpf_insn *insn = &(prog->insnsi[i]);
>> +               int ret;
>> +
>> +               ret = build_insn(insn, ctx);
>> +
>> +               /* It's used with loading the 64 bit immediate value. */
>> +               if (ret > 0) {
>> +                       i++;
>> +                       if (ctx->target == NULL)
>> +                               ctx->offsets[i] = ctx->idx;
>> +                       continue;
>>                 }
>>
>> -               if (ctx->flags & FLAG_IMM_OVERFLOW)
>> -                       /*
>> -                        * this instruction generated an overflow when
>> -                        * trying to access the literal pool, so
>> -                        * delegate this filter to the kernel interpreter.
>> -                        */
>> -                       return -1;
>> +               if (ctx->target == NULL)
>> +                       ctx->offsets[i] = ctx->idx;
>> +
>> +               /* If unsuccesfull, return with error code */
>> +               if (ret)
>> +                       return ret;
>>         }
>> +       return 0;
>> +}
>>
>> -       /* compute offsets only during the first pass */
>> -       if (ctx->target == NULL)
>> -               ctx->offsets[i] = ctx->idx * 4;
>> +static int validate_code(struct jit_ctx *ctx)
>> +{
>> +       int i;
>> +
>> +       for (i = 0; i < ctx->idx; i++) {
>> +               u32 a32_insn = le32_to_cpu(ctx->target[i]);
>> +
>> +               if (a32_insn == ARM_INST_UDF)
>> +                       return -1;
>> +       }
>>
>>         return 0;
>>  }
>>
>> +void bpf_jit_compile(struct bpf_prog *prog)
>> +{
>> +       /* Nothing to do here. We support Internal BPF. */
>> +}
>>
>> -void bpf_jit_compile(struct bpf_prog *fp)
>> +struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
>>  {
>> +#ifdef __LITTLE_ENDIAN
>> +       struct bpf_prog *tmp, *orig_prog = prog;
>>         struct bpf_binary_header *header;
>> +       bool tmp_blinded = false;
>>         struct jit_ctx ctx;
>> -       unsigned tmp_idx;
>> -       unsigned alloc_size;
>> -       u8 *target_ptr;
>> +       unsigned int tmp_idx;
>> +       unsigned int image_size;
>> +       u8 *image_ptr;
>>
>> +       /* If BPF JIT was not enabled then we must fall back to
>> +        * the interpreter.
>> +        */
>>         if (!bpf_jit_enable)
>> -               return;
>> +               return orig_prog;
>>
>> -       memset(&ctx, 0, sizeof(ctx));
>> -       ctx.skf         = fp;
>> -       ctx.ret0_fp_idx = -1;
>> +       /* If constant blinding was enabled and we failed during blinding
>> +        * then we must fall back to the interpreter. Otherwise, we save
>> +        * the new JITed code.
>> +        */
>> +       tmp = bpf_jit_blind_constants(prog);
>>
>> -       ctx.offsets = kzalloc(4 * (ctx.skf->len + 1), GFP_KERNEL);
>> -       if (ctx.offsets == NULL)
>> -               return;
>> +       if (IS_ERR(tmp))
>> +               return orig_prog;
>> +       if (tmp != prog) {
>> +               tmp_blinded = true;
>> +               prog = tmp;
>> +       }
>> +
>> +       memset(&ctx, 0, sizeof(ctx));
>> +       ctx.prog = prog;
>>
>> -       /* fake pass to fill in the ctx->seen */
>> -       if (unlikely(build_body(&ctx)))
>> +       /* Not able to allocate memory for offsets[] , then
>> +        * we must fall back to the interpreter
>> +        */
>> +       ctx.offsets = kcalloc(prog->len, sizeof(int), GFP_KERNEL);
>> +       if (ctx.offsets == NULL) {
>> +               prog = orig_prog;
>>                 goto out;
>> +       }
>> +
>> +       /* 1) fake pass to find in the length of the JITed code,
>> +        * to compute ctx->offsets and other context variables
>> +        * needed to compute final JITed code.
>> +        * Also, calculate random starting pointer/start of JITed code
>> +        * which is prefixed by random number of fault instructions.
>> +        *
>> +        * If the first pass fails then there is no chance of it
>> +        * being successful in the second pass, so just fall back
>> +        * to the interpreter.
>> +        */
>> +       if (build_body(&ctx)) {
>> +               prog = orig_prog;
>> +               goto out_off;
>> +       }
>>
>>         tmp_idx = ctx.idx;
>>         build_prologue(&ctx);
>>         ctx.prologue_bytes = (ctx.idx - tmp_idx) * 4;
>>
>> +       ctx.epilogue_offset = ctx.idx;
>> +
>>  #if __LINUX_ARM_ARCH__ < 7
>>         tmp_idx = ctx.idx;
>>         build_epilogue(&ctx);
>> @@ -1020,64 +1838,96 @@ void bpf_jit_compile(struct bpf_prog *fp)
>>
>>         ctx.idx += ctx.imm_count;
>>         if (ctx.imm_count) {
>> -               ctx.imms = kzalloc(4 * ctx.imm_count, GFP_KERNEL);
>> -               if (ctx.imms == NULL)
>> -                       goto out;
>> +               ctx.imms = kcalloc(ctx.imm_count, sizeof(u32), GFP_KERNEL);
>> +               if (ctx.imms == NULL) {
>> +                       prog = orig_prog;
>> +                       goto out_off;
>> +               }
>>         }
>>  #else
>> -       /* there's nothing after the epilogue on ARMv7 */
>> +       /* there's nothing about the epilogue on ARMv7 */
>>         build_epilogue(&ctx);
>>  #endif
>> -       alloc_size = 4 * ctx.idx;
>> -       header = bpf_jit_binary_alloc(alloc_size, &target_ptr,
>> -                                     4, jit_fill_hole);
>> -       if (header == NULL)
>> -               goto out;
>> +       /* Now we can get the actual image size of the JITed arm code.
>> +        * Currently, we are not considering the THUMB-2 instructions
>> +        * for jit, although it can decrease the size of the image.
>> +        *
>> +        * As each arm instruction is of length 32bit, we are translating
>> +        * number of JITed intructions into the size required to store these
>> +        * JITed code.
>> +        */
>> +       image_size = sizeof(u32) * ctx.idx;
>>
>> -       ctx.target = (u32 *) target_ptr;
>> +       /* Now we know the size of the structure to make */
>> +       header = bpf_jit_binary_alloc(image_size, &image_ptr,
>> +                                     sizeof(u32), jit_fill_hole);
>> +       /* Not able to allocate memory for the structure then
>> +        * we must fall back to the interpretation
>> +        */
>> +       if (header == NULL) {
>> +               prog = orig_prog;
>> +               goto out_imms;
>> +       }
>> +
>> +       /* 2.) Actual pass to generate final JIT code */
>> +       ctx.target = (u32 *) image_ptr;
>>         ctx.idx = 0;
>>
>>         build_prologue(&ctx);
>> +
>> +       /* If building the body of the JITed code fails somehow,
>> +        * we fall back to the interpretation.
>> +        */
>>         if (build_body(&ctx) < 0) {
>> -#if __LINUX_ARM_ARCH__ < 7
>> -               if (ctx.imm_count)
>> -                       kfree(ctx.imms);
>> -#endif
>> +               image_ptr = NULL;
>>                 bpf_jit_binary_free(header);
>> -               goto out;
>> +               prog = orig_prog;
>> +               goto out_imms;
>>         }
>>         build_epilogue(&ctx);
>>
>> +       /* 3.) Extra pass to validate JITed Code */
>> +       if (validate_code(&ctx)) {
>> +               image_ptr = NULL;
>> +               bpf_jit_binary_free(header);
>> +               prog = orig_prog;
>> +               goto out_imms;
>> +       }
>>         flush_icache_range((u32)header, (u32)(ctx.target + ctx.idx));
>>
>> -#if __LINUX_ARM_ARCH__ < 7
>> -       if (ctx.imm_count)
>> -               kfree(ctx.imms);
>> -#endif
>> -
>>         if (bpf_jit_enable > 1)
>>                 /* there are 2 passes here */
>> -               bpf_jit_dump(fp->len, alloc_size, 2, ctx.target);
>> +               bpf_jit_dump(prog->len, image_size, 2, ctx.target);
>>
>>         set_memory_ro((unsigned long)header, header->pages);
>> -       fp->bpf_func = (void *)ctx.target;
>> -       fp->jited = 1;
>> -out:
>> +       prog->bpf_func = (void *)ctx.target;
>> +       prog->jited = 1;
>> +out_imms:
>> +#if __LINUX_ARM_ARCH__ < 7
>> +       if (ctx.imm_count)
>> +               kfree(ctx.imms);
>> +#endif
>> +out_off:
>>         kfree(ctx.offsets);
>> -       return;
>> +out:
>> +       if (tmp_blinded)
>> +               bpf_jit_prog_release_other(prog, prog == orig_prog ?
>> +                                          tmp : orig_prog);
>> +#endif /* __LITTLE_ENDIAN */
>> +       return prog;
>>  }
>>
>> -void bpf_jit_free(struct bpf_prog *fp)
>> +void bpf_jit_free(struct bpf_prog *prog)
>>  {
>> -       unsigned long addr = (unsigned long)fp->bpf_func & PAGE_MASK;
>> +       unsigned long addr = (unsigned long)prog->bpf_func & PAGE_MASK;
>>         struct bpf_binary_header *header = (void *)addr;
>>
>> -       if (!fp->jited)
>> +       if (!prog->jited)
>>                 goto free_filter;
>>
>>         set_memory_rw(addr, header->pages);
>>         bpf_jit_binary_free(header);
>>
>>  free_filter:
>> -       bpf_prog_unlock_free(fp);
>> +       bpf_prog_unlock_free(prog);
>>  }
>> diff --git a/arch/arm/net/bpf_jit_32.h b/arch/arm/net/bpf_jit_32.h
>> index c46fca2..d5cf5f6 100644
>> --- a/arch/arm/net/bpf_jit_32.h
>> +++ b/arch/arm/net/bpf_jit_32.h
>> @@ -11,6 +11,7 @@
>>  #ifndef PFILTER_OPCODES_ARM_H
>>  #define PFILTER_OPCODES_ARM_H
>>
>> +/* ARM 32bit Registers */
>>  #define ARM_R0 0
>>  #define ARM_R1 1
>>  #define ARM_R2 2
>> @@ -22,38 +23,43 @@
>>  #define ARM_R8 8
>>  #define ARM_R9 9
>>  #define ARM_R10        10
>> -#define ARM_FP 11
>> -#define ARM_IP 12
>> -#define ARM_SP 13
>> -#define ARM_LR 14
>> -#define ARM_PC 15
>> -
>> -#define ARM_COND_EQ            0x0
>> -#define ARM_COND_NE            0x1
>> -#define ARM_COND_CS            0x2
>> +#define ARM_FP 11      /* Frame Pointer */
>> +#define ARM_IP 12      /* Intra-procedure scratch register */
>> +#define ARM_SP 13      /* Stack pointer: as load/store base reg */
>> +#define ARM_LR 14      /* Link Register */
>> +#define ARM_PC 15      /* Program counter */
>> +
>> +#define ARM_COND_EQ            0x0     /* == */
>> +#define ARM_COND_NE            0x1     /* != */
>> +#define ARM_COND_CS            0x2     /* unsigned >= */
>>  #define ARM_COND_HS            ARM_COND_CS
>> -#define ARM_COND_CC            0x3
>> +#define ARM_COND_CC            0x3     /* unsigned < */
>>  #define ARM_COND_LO            ARM_COND_CC
>> -#define ARM_COND_MI            0x4
>> -#define ARM_COND_PL            0x5
>> -#define ARM_COND_VS            0x6
>> -#define ARM_COND_VC            0x7
>> -#define ARM_COND_HI            0x8
>> -#define ARM_COND_LS            0x9
>> -#define ARM_COND_GE            0xa
>> -#define ARM_COND_LT            0xb
>> -#define ARM_COND_GT            0xc
>> -#define ARM_COND_LE            0xd
>> -#define ARM_COND_AL            0xe
>> +#define ARM_COND_MI            0x4     /* < 0 */
>> +#define ARM_COND_PL            0x5     /* >= 0 */
>> +#define ARM_COND_VS            0x6     /* Signed Overflow */
>> +#define ARM_COND_VC            0x7     /* No Signed Overflow */
>> +#define ARM_COND_HI            0x8     /* unsigned > */
>> +#define ARM_COND_LS            0x9     /* unsigned <= */
>> +#define ARM_COND_GE            0xa     /* Signed >= */
>> +#define ARM_COND_LT            0xb     /* Signed < */
>> +#define ARM_COND_GT            0xc     /* Signed > */
>> +#define ARM_COND_LE            0xd     /* Signed <= */
>> +#define ARM_COND_AL            0xe     /* None */
>>
>>  /* register shift types */
>>  #define SRTYPE_LSL             0
>>  #define SRTYPE_LSR             1
>>  #define SRTYPE_ASR             2
>>  #define SRTYPE_ROR             3
>> +#define SRTYPE_ASL             (SRTYPE_LSL)
>>
>>  #define ARM_INST_ADD_R         0x00800000
>> +#define ARM_INST_ADDS_R                0x00900000
>> +#define ARM_INST_ADC_R         0x00a00000
>> +#define ARM_INST_ADC_I         0x02a00000
>>  #define ARM_INST_ADD_I         0x02800000
>> +#define ARM_INST_ADDS_I                0x02900000
>>
>>  #define ARM_INST_AND_R         0x00000000
>>  #define ARM_INST_AND_I         0x02000000
>> @@ -76,8 +82,10 @@
>>  #define ARM_INST_LDRH_I                0x01d000b0
>>  #define ARM_INST_LDRH_R                0x019000b0
>>  #define ARM_INST_LDR_I         0x05900000
>> +#define ARM_INST_LDR_R         0x07900000
>>
>>  #define ARM_INST_LDM           0x08900000
>> +#define ARM_INST_LDM_IA                0x08b00000
>>
>>  #define ARM_INST_LSL_I         0x01a00000
>>  #define ARM_INST_LSL_R         0x01a00010
>> @@ -86,6 +94,7 @@
>>  #define ARM_INST_LSR_R         0x01a00030
>>
>>  #define ARM_INST_MOV_R         0x01a00000
>> +#define ARM_INST_MOVS_R                0x01b00000
>>  #define ARM_INST_MOV_I         0x03a00000
>>  #define ARM_INST_MOVW          0x03000000
>>  #define ARM_INST_MOVT          0x03400000
>> @@ -96,17 +105,28 @@
>>  #define ARM_INST_PUSH          0x092d0000
>>
>>  #define ARM_INST_ORR_R         0x01800000
>> +#define ARM_INST_ORRS_R                0x01900000
>>  #define ARM_INST_ORR_I         0x03800000
>>
>>  #define ARM_INST_REV           0x06bf0f30
>>  #define ARM_INST_REV16         0x06bf0fb0
>>
>>  #define ARM_INST_RSB_I         0x02600000
>> +#define ARM_INST_RSBS_I                0x02700000
>> +#define ARM_INST_RSC_I         0x02e00000
>>
>>  #define ARM_INST_SUB_R         0x00400000
>> +#define ARM_INST_SUBS_R                0x00500000
>> +#define ARM_INST_RSB_R         0x00600000
>>  #define ARM_INST_SUB_I         0x02400000
>> +#define ARM_INST_SUBS_I                0x02500000
>> +#define ARM_INST_SBC_I         0x02c00000
>> +#define ARM_INST_SBC_R         0x00c00000
>> +#define ARM_INST_SBCS_R                0x00d00000
>>
>>  #define ARM_INST_STR_I         0x05800000
>> +#define ARM_INST_STRB_I                0x05c00000
>> +#define ARM_INST_STRH_I                0x01c000b0
>>
>>  #define ARM_INST_TST_R         0x01100000
>>  #define ARM_INST_TST_I         0x03100000
>> @@ -117,6 +137,8 @@
>>
>>  #define ARM_INST_MLS           0x00600090
>>
>> +#define ARM_INST_UXTH          0x06ff0070
>> +
>>  /*
>>   * Use a suitable undefined instruction to use for ARM/Thumb2 faulting.
>>   * We need to be careful not to conflict with those used by other modules
>> @@ -135,9 +157,15 @@
>>  #define _AL3_R(op, rd, rn, rm) ((op ## _R) | (rd) << 12 | (rn) << 16 | (rm))
>>  /* immediate */
>>  #define _AL3_I(op, rd, rn, imm)        ((op ## _I) | (rd) << 12 | (rn) << 16 | (imm))
>> +/* register with register-shift */
>> +#define _AL3_SR(inst)  (inst | (1 << 4))
>>
>>  #define ARM_ADD_R(rd, rn, rm)  _AL3_R(ARM_INST_ADD, rd, rn, rm)
>> +#define ARM_ADDS_R(rd, rn, rm) _AL3_R(ARM_INST_ADDS, rd, rn, rm)
>>  #define ARM_ADD_I(rd, rn, imm) _AL3_I(ARM_INST_ADD, rd, rn, imm)
>> +#define ARM_ADDS_I(rd, rn, imm)        _AL3_I(ARM_INST_ADDS, rd, rn, imm)
>> +#define ARM_ADC_R(rd, rn, rm)  _AL3_R(ARM_INST_ADC, rd, rn, rm)
>> +#define ARM_ADC_I(rd, rn, imm) _AL3_I(ARM_INST_ADC, rd, rn, imm)
>>
>>  #define ARM_AND_R(rd, rn, rm)  _AL3_R(ARM_INST_AND, rd, rn, rm)
>>  #define ARM_AND_I(rd, rn, imm) _AL3_I(ARM_INST_AND, rd, rn, imm)
>> @@ -156,7 +184,9 @@
>>  #define ARM_EOR_I(rd, rn, imm) _AL3_I(ARM_INST_EOR, rd, rn, imm)
>>
>>  #define ARM_LDR_I(rt, rn, off) (ARM_INST_LDR_I | (rt) << 12 | (rn) << 16 \
>> -                                | (off))
>> +                                | ((off) & 0xfff))
>> +#define ARM_LDR_R(rt, rn, rm)  (ARM_INST_LDR_R | (rt) << 12 | (rn) << 16 \
>> +                                | (rm))
>>  #define ARM_LDRB_I(rt, rn, off)        (ARM_INST_LDRB_I | (rt) << 12 | (rn) << 16 \
>>                                  | (off))
>>  #define ARM_LDRB_R(rt, rn, rm) (ARM_INST_LDRB_R | (rt) << 12 | (rn) << 16 \
>> @@ -167,15 +197,23 @@
>>                                  | (rm))
>>
>>  #define ARM_LDM(rn, regs)      (ARM_INST_LDM | (rn) << 16 | (regs))
>> +#define ARM_LDM_IA(rn, regs)   (ARM_INST_LDM_IA | (rn) << 16 | (regs))
>>
>>  #define ARM_LSL_R(rd, rn, rm)  (_AL3_R(ARM_INST_LSL, rd, 0, rn) | (rm) << 8)
>>  #define ARM_LSL_I(rd, rn, imm) (_AL3_I(ARM_INST_LSL, rd, 0, rn) | (imm) << 7)
>>
>>  #define ARM_LSR_R(rd, rn, rm)  (_AL3_R(ARM_INST_LSR, rd, 0, rn) | (rm) << 8)
>>  #define ARM_LSR_I(rd, rn, imm) (_AL3_I(ARM_INST_LSR, rd, 0, rn) | (imm) << 7)
>> +#define ARM_ASR_R(rd, rn, rm)   (_AL3_R(ARM_INST_ASR, rd, 0, rn) | (rm) << 8)
>> +#define ARM_ASR_I(rd, rn, imm)  (_AL3_I(ARM_INST_ASR, rd, 0, rn) | (imm) << 7)
>>
>>  #define ARM_MOV_R(rd, rm)      _AL3_R(ARM_INST_MOV, rd, 0, rm)
>> +#define ARM_MOVS_R(rd, rm)     _AL3_R(ARM_INST_MOVS, rd, 0, rm)
>>  #define ARM_MOV_I(rd, imm)     _AL3_I(ARM_INST_MOV, rd, 0, imm)
>> +#define ARM_MOV_SR(rd, rm, type, rs)   \
>> +       (_AL3_SR(ARM_MOV_R(rd, rm)) | (type) << 5 | (rs) << 8)
>> +#define ARM_MOV_SI(rd, rm, type, imm6) \
>> +       (ARM_MOV_R(rd, rm) | (type) << 5 | (imm6) << 7)
>>
>>  #define ARM_MOVW(rd, imm)      \
>>         (ARM_INST_MOVW | ((imm) >> 12) << 16 | (rd) << 12 | ((imm) & 0x0fff))
>> @@ -190,19 +228,38 @@
>>
>>  #define ARM_ORR_R(rd, rn, rm)  _AL3_R(ARM_INST_ORR, rd, rn, rm)
>>  #define ARM_ORR_I(rd, rn, imm) _AL3_I(ARM_INST_ORR, rd, rn, imm)
>> -#define ARM_ORR_S(rd, rn, rm, type, rs)        \
>> -       (ARM_ORR_R(rd, rn, rm) | (type) << 5 | (rs) << 7)
>> +#define ARM_ORR_SR(rd, rn, rm, type, rs)       \
>> +       (_AL3_SR(ARM_ORR_R(rd, rn, rm)) | (type) << 5 | (rs) << 8)
>> +#define ARM_ORRS_R(rd, rn, rm) _AL3_R(ARM_INST_ORRS, rd, rn, rm)
>> +#define ARM_ORRS_SR(rd, rn, rm, type, rs)      \
>> +       (_AL3_SR(ARM_ORRS_R(rd, rn, rm)) | (type) << 5 | (rs) << 8)
>> +#define ARM_ORR_SI(rd, rn, rm, type, imm6)     \
>> +       (ARM_ORR_R(rd, rn, rm) | (type) << 5 | (imm6) << 7)
>> +#define ARM_ORRS_SI(rd, rn, rm, type, imm6)    \
>> +       (ARM_ORRS_R(rd, rn, rm) | (type) << 5 | (imm6) << 7)
>>
>>  #define ARM_REV(rd, rm)                (ARM_INST_REV | (rd) << 12 | (rm))
>>  #define ARM_REV16(rd, rm)      (ARM_INST_REV16 | (rd) << 12 | (rm))
>>
>>  #define ARM_RSB_I(rd, rn, imm) _AL3_I(ARM_INST_RSB, rd, rn, imm)
>> +#define ARM_RSBS_I(rd, rn, imm)        _AL3_I(ARM_INST_RSBS, rd, rn, imm)
>> +#define ARM_RSC_I(rd, rn, imm) _AL3_I(ARM_INST_RSC, rd, rn, imm)
>>
>>  #define ARM_SUB_R(rd, rn, rm)  _AL3_R(ARM_INST_SUB, rd, rn, rm)
>> +#define ARM_SUBS_R(rd, rn, rm) _AL3_R(ARM_INST_SUBS, rd, rn, rm)
>> +#define ARM_RSB_R(rd, rn, rm)  _AL3_R(ARM_INST_RSB, rd, rn, rm)
>> +#define ARM_SBC_R(rd, rn, rm)  _AL3_R(ARM_INST_SBC, rd, rn, rm)
>> +#define ARM_SBCS_R(rd, rn, rm) _AL3_R(ARM_INST_SBCS, rd, rn, rm)
>>  #define ARM_SUB_I(rd, rn, imm) _AL3_I(ARM_INST_SUB, rd, rn, imm)
>> +#define ARM_SUBS_I(rd, rn, imm)        _AL3_I(ARM_INST_SUBS, rd, rn, imm)
>> +#define ARM_SBC_I(rd, rn, imm) _AL3_I(ARM_INST_SBC, rd, rn, imm)
>>
>>  #define ARM_STR_I(rt, rn, off) (ARM_INST_STR_I | (rt) << 12 | (rn) << 16 \
>> -                                | (off))
>> +                                | ((off) & 0xfff))
>> +#define ARM_STRH_I(rt, rn, off)        (ARM_INST_STRH_I | (rt) << 12 | (rn) << 16 \
>> +                                | (((off) & 0xf0) << 4) | ((off) & 0xf))
>> +#define ARM_STRB_I(rt, rn, off)        (ARM_INST_STRB_I | (rt) << 12 | (rn) << 16 \
>> +                                | (((off) & 0xf0) << 4) | ((off) & 0xf))
>>
>>  #define ARM_TST_R(rn, rm)      _AL3_R(ARM_INST_TST, 0, rn, rm)
>>  #define ARM_TST_I(rn, imm)     _AL3_I(ARM_INST_TST, 0, rn, imm)
>> @@ -214,5 +271,6 @@
>>
>>  #define ARM_MLS(rd, rn, rm, ra)        (ARM_INST_MLS | (rd) << 16 | (rn) | (rm) << 8 \
>>                                  | (ra) << 12)
>> +#define ARM_UXTH(rd, rm)       (ARM_INST_UXTH | (rd) << 12 | (rm))
>>
>>  #endif /* PFILTER_OPCODES_ARM_H */
>> --
>> 2.7.4
>>
>
>
>
> --
> Kees Cook
> Pixel Security

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
@ 2017-06-06 19:47     ` Shubham Bansal
  0 siblings, 0 replies; 87+ messages in thread
From: Shubham Bansal @ 2017-06-06 19:47 UTC (permalink / raw)
  To: Kees Cook
  Cc: Network Development, Daniel Borkmann, David S. Miller,
	Alexei Starovoitov, Russell King, linux-arm-kernel, LKML,
	Andrew Lunn

Hi Russell, Alexei, David, Daniel, kees,

Any update on this patch moving forward?
Best,
Shubham Bansal


On Wed, May 31, 2017 at 12:49 AM, Kees Cook <keescook@chromium.org> wrote:
> Forwarding this to net-dev and eBPF folks, who weren't on CC...
>
> -Kees
>
> On Thu, May 25, 2017 at 4:13 PM, Shubham Bansal
> <illusionist.neo@gmail.com> wrote:
>> The JIT compiler emits ARM 32 bit instructions. Currently, It supports
>> eBPF only. Classic BPF is supported because of the conversion by BPF
>> core.
>>
>> This patch is essentially changing the current implementation of JIT
>> compiler of Berkeley Packet Filter from classic to internal with almost
>> all instructions from eBPF ISA supported except the following
>>         BPF_ALU64 | BPF_DIV | BPF_K
>>         BPF_ALU64 | BPF_DIV | BPF_X
>>         BPF_ALU64 | BPF_MOD | BPF_K
>>         BPF_ALU64 | BPF_MOD | BPF_X
>>         BPF_STX | BPF_XADD | BPF_W
>>         BPF_STX | BPF_XADD | BPF_DW
>>         BPF_JMP | BPF_CALL
>>
>> Implementation is using scratch space to emulate 64 bit eBPF ISA on 32 bit
>> ARM because of deficiency of general purpose registers on ARM. Currently,
>> only LITTLE ENDIAN machines are supported in this eBPF JIT Compiler.
>>
>> Tested on ARMv7 with QEMU by me (Shubham Bansal).
>> Tested on ARMv5 by Andrew Lunn (andrew@lunn.ch).
>> Expected to work on ARMv6 as well, as its a part ARMv7 and part ARMv5.
>> Although, a proper testing is not done for ARMv6.
>>
>> Both of these testing are done with and without CONFIG_FRAME_POINTER
>> separately for LITTLE ENDIAN machine.
>>
>> For testing:
>>
>> 1. JIT is enabled with
>>         echo 1 > /proc/sys/net/core/bpf_jit_enable
>> 2. Constant Blinding can be enabled along with JIT using
>>         echo 1 > /proc/sys/net/core/bpf_jit_enable
>>         echo 2 > /proc/sys/net/core/bpf_jit_harden
>>
>> See Documentation/networking/filter.txt for more information.
>>
>> Result : test_bpf: Summary: 314 PASSED, 0 FAILED, [278/306 JIT'ed]
>>
>> Signed-off-by: Shubham Bansal <illusionist.neo@gmail.com>
>> ---
>>  Documentation/networking/filter.txt |    4 +-
>>  arch/arm/Kconfig                    |    2 +-
>>  arch/arm/net/bpf_jit_32.c           | 2404 ++++++++++++++++++++++++-----------
>>  arch/arm/net/bpf_jit_32.h           |  108 +-
>>  4 files changed, 1713 insertions(+), 805 deletions(-)
>>
>> diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt
>> index b69b205..01165ac 100644
>> --- a/Documentation/networking/filter.txt
>> +++ b/Documentation/networking/filter.txt
>> @@ -596,8 +596,8 @@ skb pointer). All constraints and restrictions from bpf_check_classic() apply
>>  before a conversion to the new layout is being done behind the scenes!
>>
>>  Currently, the classic BPF format is being used for JITing on most 32-bit
>> -architectures, whereas x86-64, aarch64, s390x, powerpc64, sparc64 perform JIT
>> -compilation from eBPF instruction set.
>> +architectures, whereas x86-64, aarch64, arm, s390x, powerpc64, sparc64 perform
>> +JIT compilation from eBPF instruction set.
>>
>>  Some core changes of the new internal format:
>>
>> diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
>> index 8a7ab5e..13ade46 100644
>> --- a/arch/arm/Kconfig
>> +++ b/arch/arm/Kconfig
>> @@ -47,7 +47,7 @@ config ARM
>>         select HAVE_ARCH_SECCOMP_FILTER if (AEABI && !OABI_COMPAT)
>>         select HAVE_ARCH_TRACEHOOK
>>         select HAVE_ARM_SMCCC if CPU_V7
>> -       select HAVE_CBPF_JIT
>> +       select HAVE_EBPF_JIT
>>         select HAVE_CC_STACKPROTECTOR
>>         select HAVE_CONTEXT_TRACKING
>>         select HAVE_C_RECORDMCOUNT
>> diff --git a/arch/arm/net/bpf_jit_32.c b/arch/arm/net/bpf_jit_32.c
>> index 93d0b6d..c7476e5 100644
>> --- a/arch/arm/net/bpf_jit_32.c
>> +++ b/arch/arm/net/bpf_jit_32.c
>> @@ -1,13 +1,15 @@
>>  /*
>> - * Just-In-Time compiler for BPF filters on 32bit ARM
>> + * Just-In-Time compiler for eBPF filters on 32bit ARM
>>   *
>>   * Copyright (c) 2011 Mircea Gherzan <mgherzan@gmail.com>
>> + * Copyright (c) 2017 Shubham Bansal <illusionist.neo@gmail.com>
>>   *
>>   * This program is free software; you can redistribute it and/or modify it
>>   * under the terms of the GNU General Public License as published by the
>>   * Free Software Foundation; version 2 of the License.
>>   */
>>
>> +#include <linux/bpf.h>
>>  #include <linux/bitops.h>
>>  #include <linux/compiler.h>
>>  #include <linux/errno.h>
>> @@ -23,44 +25,91 @@
>>
>>  #include "bpf_jit_32.h"
>>
>> +int bpf_jit_enable __read_mostly;
>> +
>> +#define STACK_OFFSET(k)        (k)
>> +#define TMP_REG_1      (MAX_BPF_JIT_REG + 0)   /* TEMP Register 1 */
>> +#define TMP_REG_2      (MAX_BPF_JIT_REG + 1)   /* TEMP Register 2 */
>> +#define TCALL_CNT      (MAX_BPF_JIT_REG + 2)   /* Tail Call Count */
>> +
>> +/* Flags used for JIT optimization */
>> +#define SEEN_CALL      (1 << 0)
>> +
>> +#define FLAG_IMM_OVERFLOW      (1 << 0)
>> +
>>  /*
>> - * ABI:
>> + * Map eBPF registers to ARM 32bit registers or stack scratch space.
>> + *
>> + * 1. First argument is passed using the arm 32bit registers and rest of the
>> + * arguments are passed on stack scratch space.
>> + * 2. First callee-saved aregument is mapped to arm 32 bit registers and rest
>> + * arguments are mapped to scratch space on stack.
>> + * 3. We need two 64 bit temp registers to do complex operations on eBPF
>> + * registers.
>> + *
>> + * As the eBPF registers are all 64 bit registers and arm has only 32 bit
>> + * registers, we have to map each eBPF registers with two arm 32 bit regs or
>> + * scratch memory space and we have to build eBPF 64 bit register from those.
>>   *
>> - * r0  scratch register
>> - * r4  BPF register A
>> - * r5  BPF register X
>> - * r6  pointer to the skb
>> - * r7  skb->data
>> - * r8  skb_headlen(skb)
>>   */
>> +static const u8 bpf2a32[][2] = {
>> +       /* return value from in-kernel function, and exit value from eBPF */
>> +       [BPF_REG_0] = {ARM_R1, ARM_R0},
>> +       /* arguments from eBPF program to in-kernel function */
>> +       [BPF_REG_1] = {ARM_R3, ARM_R2},
>> +       /* Stored on stack scratch space */
>> +       [BPF_REG_2] = {STACK_OFFSET(0), STACK_OFFSET(4)},
>> +       [BPF_REG_3] = {STACK_OFFSET(8), STACK_OFFSET(12)},
>> +       [BPF_REG_4] = {STACK_OFFSET(16), STACK_OFFSET(20)},
>> +       [BPF_REG_5] = {STACK_OFFSET(24), STACK_OFFSET(28)},
>> +       /* callee saved registers that in-kernel function will preserve */
>> +       [BPF_REG_6] = {ARM_R5, ARM_R4},
>> +       /* Stored on stack scratch space */
>> +       [BPF_REG_7] = {STACK_OFFSET(32), STACK_OFFSET(36)},
>> +       [BPF_REG_8] = {STACK_OFFSET(40), STACK_OFFSET(44)},
>> +       [BPF_REG_9] = {STACK_OFFSET(48), STACK_OFFSET(52)},
>> +       /* Read only Frame Pointer to access Stack */
>> +       [BPF_REG_FP] = {STACK_OFFSET(56), STACK_OFFSET(60)},
>> +       /* Temporary Register for internal BPF JIT, can be used
>> +        * for constant blindings and others.
>> +        */
>> +       [TMP_REG_1] = {ARM_R7, ARM_R6},
>> +       [TMP_REG_2] = {ARM_R10, ARM_R8},
>> +       /* Tail call count. Stored on stack scratch space. */
>> +       [TCALL_CNT] = {STACK_OFFSET(64), STACK_OFFSET(68)},
>> +       /* temporary register for blinding constants.
>> +        * Stored on stack scratch space.
>> +        */
>> +       [BPF_REG_AX] = {STACK_OFFSET(72), STACK_OFFSET(76)},
>> +};
>>
>> -#define r_scratch      ARM_R0
>> -/* r1-r3 are (also) used for the unaligned loads on the non-ARMv7 slowpath */
>> -#define r_off          ARM_R1
>> -#define r_A            ARM_R4
>> -#define r_X            ARM_R5
>> -#define r_skb          ARM_R6
>> -#define r_skb_data     ARM_R7
>> -#define r_skb_hl       ARM_R8
>> -
>> -#define SCRATCH_SP_OFFSET      0
>> -#define SCRATCH_OFF(k)         (SCRATCH_SP_OFFSET + 4 * (k))
>> -
>> -#define SEEN_MEM               ((1 << BPF_MEMWORDS) - 1)
>> -#define SEEN_MEM_WORD(k)       (1 << (k))
>> -#define SEEN_X                 (1 << BPF_MEMWORDS)
>> -#define SEEN_CALL              (1 << (BPF_MEMWORDS + 1))
>> -#define SEEN_SKB               (1 << (BPF_MEMWORDS + 2))
>> -#define SEEN_DATA              (1 << (BPF_MEMWORDS + 3))
>> +#define        dst_lo  dst[1]
>> +#define dst_hi dst[0]
>> +#define src_lo src[1]
>> +#define src_hi src[0]
>>
>> -#define FLAG_NEED_X_RESET      (1 << 0)
>> -#define FLAG_IMM_OVERFLOW      (1 << 1)
>> +/*
>> + * JIT Context:
>> + *
>> + * prog                        :       bpf_prog
>> + * idx                 :       index of current last JITed instruction.
>> + * prologue_bytes      :       bytes used in prologue.
>> + * epilogue_offset     :       offset of epilogue starting.
>> + * seen                        :       bit mask used for JIT optimization.
>> + * offsets             :       array of eBPF instruction offsets in
>> + *                             JITed code.
>> + * target              :       final JITed code.
>> + * epilogue_bytes      :       no of bytes used in epilogue.
>> + * imm_count           :       no of immediate counts used for global
>> + *                             variables.
>> + * imms                        :       array of global variable addresses.
>> + */
>>
>>  struct jit_ctx {
>> -       const struct bpf_prog *skf;
>> -       unsigned idx;
>> -       unsigned prologue_bytes;
>> -       int ret0_fp_idx;
>> +       const struct bpf_prog *prog;
>> +       unsigned int idx;
>> +       unsigned int prologue_bytes;
>> +       unsigned int epilogue_offset;
>>         u32 seen;
>>         u32 flags;
>>         u32 *offsets;
>> @@ -72,68 +121,16 @@ struct jit_ctx {
>>  #endif
>>  };
>>
>> -int bpf_jit_enable __read_mostly;
>> -
>> -static inline int call_neg_helper(struct sk_buff *skb, int offset, void *ret,
>> -                     unsigned int size)
>> -{
>> -       void *ptr = bpf_internal_load_pointer_neg_helper(skb, offset, size);
>> -
>> -       if (!ptr)
>> -               return -EFAULT;
>> -       memcpy(ret, ptr, size);
>> -       return 0;
>> -}
>> -
>> -static u64 jit_get_skb_b(struct sk_buff *skb, int offset)
>> -{
>> -       u8 ret;
>> -       int err;
>> -
>> -       if (offset < 0)
>> -               err = call_neg_helper(skb, offset, &ret, 1);
>> -       else
>> -               err = skb_copy_bits(skb, offset, &ret, 1);
>> -
>> -       return (u64)err << 32 | ret;
>> -}
>> -
>> -static u64 jit_get_skb_h(struct sk_buff *skb, int offset)
>> -{
>> -       u16 ret;
>> -       int err;
>> -
>> -       if (offset < 0)
>> -               err = call_neg_helper(skb, offset, &ret, 2);
>> -       else
>> -               err = skb_copy_bits(skb, offset, &ret, 2);
>> -
>> -       return (u64)err << 32 | ntohs(ret);
>> -}
>> -
>> -static u64 jit_get_skb_w(struct sk_buff *skb, int offset)
>> -{
>> -       u32 ret;
>> -       int err;
>> -
>> -       if (offset < 0)
>> -               err = call_neg_helper(skb, offset, &ret, 4);
>> -       else
>> -               err = skb_copy_bits(skb, offset, &ret, 4);
>> -
>> -       return (u64)err << 32 | ntohl(ret);
>> -}
>> -
>>  /*
>>   * Wrappers which handle both OABI and EABI and assures Thumb2 interworking
>>   * (where the assembly routines like __aeabi_uidiv could cause problems).
>>   */
>> -static u32 jit_udiv(u32 dividend, u32 divisor)
>> +static u32 jit_udiv32(u32 dividend, u32 divisor)
>>  {
>>         return dividend / divisor;
>>  }
>>
>> -static u32 jit_mod(u32 dividend, u32 divisor)
>> +static u32 jit_mod32(u32 dividend, u32 divisor)
>>  {
>>         return dividend % divisor;
>>  }
>> @@ -157,36 +154,22 @@ static inline void emit(u32 inst, struct jit_ctx *ctx)
>>         _emit(ARM_COND_AL, inst, ctx);
>>  }
>>
>> -static u16 saved_regs(struct jit_ctx *ctx)
>> +/*
>> + * Checks if immediate value can be converted to imm12(12 bits) value.
>> + */
>> +static int16_t imm8m(u32 x)
>>  {
>> -       u16 ret = 0;
>> -
>> -       if ((ctx->skf->len > 1) ||
>> -           (ctx->skf->insns[0].code == (BPF_RET | BPF_A)))
>> -               ret |= 1 << r_A;
>> -
>> -#ifdef CONFIG_FRAME_POINTER
>> -       ret |= (1 << ARM_FP) | (1 << ARM_IP) | (1 << ARM_LR) | (1 << ARM_PC);
>> -#else
>> -       if (ctx->seen & SEEN_CALL)
>> -               ret |= 1 << ARM_LR;
>> -#endif
>> -       if (ctx->seen & (SEEN_DATA | SEEN_SKB))
>> -               ret |= 1 << r_skb;
>> -       if (ctx->seen & SEEN_DATA)
>> -               ret |= (1 << r_skb_data) | (1 << r_skb_hl);
>> -       if (ctx->seen & SEEN_X)
>> -               ret |= 1 << r_X;
>> -
>> -       return ret;
>> -}
>> +       u32 rot;
>>
>> -static inline int mem_words_used(struct jit_ctx *ctx)
>> -{
>> -       /* yes, we do waste some stack space IF there are "holes" in the set" */
>> -       return fls(ctx->seen & SEEN_MEM);
>> +       for (rot = 0; rot < 16; rot++)
>> +               if ((x & ~ror32(0xff, 2 * rot)) == 0)
>> +                       return rol32(x, 2 * rot) | (rot << 8);
>> +       return -1;
>>  }
>>
>> +/*
>> + * Initializes the JIT space with undefined instructions.
>> + */
>>  static void jit_fill_hole(void *area, unsigned int size)
>>  {
>>         u32 *ptr;
>> @@ -195,88 +178,34 @@ static void jit_fill_hole(void *area, unsigned int size)
>>                 *ptr++ = __opcode_to_mem_arm(ARM_INST_UDF);
>>  }
>>
>> -static void build_prologue(struct jit_ctx *ctx)
>> -{
>> -       u16 reg_set = saved_regs(ctx);
>> -       u16 off;
>> -
>> -#ifdef CONFIG_FRAME_POINTER
>> -       emit(ARM_MOV_R(ARM_IP, ARM_SP), ctx);
>> -       emit(ARM_PUSH(reg_set), ctx);
>> -       emit(ARM_SUB_I(ARM_FP, ARM_IP, 4), ctx);
>> -#else
>> -       if (reg_set)
>> -               emit(ARM_PUSH(reg_set), ctx);
>> -#endif
>> +/* Stack must be multiples of 16 Bytes */
>> +#define STACK_ALIGN(sz) (((sz) + 15) & ~15)
>>
>> -       if (ctx->seen & (SEEN_DATA | SEEN_SKB))
>> -               emit(ARM_MOV_R(r_skb, ARM_R0), ctx);
>> -
>> -       if (ctx->seen & SEEN_DATA) {
>> -               off = offsetof(struct sk_buff, data);
>> -               emit(ARM_LDR_I(r_skb_data, r_skb, off), ctx);
>> -               /* headlen = len - data_len */
>> -               off = offsetof(struct sk_buff, len);
>> -               emit(ARM_LDR_I(r_skb_hl, r_skb, off), ctx);
>> -               off = offsetof(struct sk_buff, data_len);
>> -               emit(ARM_LDR_I(r_scratch, r_skb, off), ctx);
>> -               emit(ARM_SUB_R(r_skb_hl, r_skb_hl, r_scratch), ctx);
>> -       }
>> -
>> -       if (ctx->flags & FLAG_NEED_X_RESET)
>> -               emit(ARM_MOV_I(r_X, 0), ctx);
>> -
>> -       /* do not leak kernel data to userspace */
>> -       if (bpf_needs_clear_a(&ctx->skf->insns[0]))
>> -               emit(ARM_MOV_I(r_A, 0), ctx);
>> -
>> -       /* stack space for the BPF_MEM words */
>> -       if (ctx->seen & SEEN_MEM)
>> -               emit(ARM_SUB_I(ARM_SP, ARM_SP, mem_words_used(ctx) * 4), ctx);
>> -}
>> -
>> -static void build_epilogue(struct jit_ctx *ctx)
>> -{
>> -       u16 reg_set = saved_regs(ctx);
>> -
>> -       if (ctx->seen & SEEN_MEM)
>> -               emit(ARM_ADD_I(ARM_SP, ARM_SP, mem_words_used(ctx) * 4), ctx);
>> -
>> -       reg_set &= ~(1 << ARM_LR);
>> -
>> -#ifdef CONFIG_FRAME_POINTER
>> -       /* the first instruction of the prologue was: mov ip, sp */
>> -       reg_set &= ~(1 << ARM_IP);
>> -       reg_set |= (1 << ARM_SP);
>> -       emit(ARM_LDM(ARM_SP, reg_set), ctx);
>> -#else
>> -       if (reg_set) {
>> -               if (ctx->seen & SEEN_CALL)
>> -                       reg_set |= 1 << ARM_PC;
>> -               emit(ARM_POP(reg_set), ctx);
>> -       }
>> +/* Stack space for BPF_REG_2, BPF_REG_3, BPF_REG_4,
>> + * BPF_REG_5, BPF_REG_7, BPF_REG_8, BPF_REG_9,
>> + * BPF_REG_FP and Tail call counts.
>> + */
>> +#define SCRATCH_SIZE 80
>>
>> -       if (!(ctx->seen & SEEN_CALL))
>> -               emit(ARM_BX(ARM_LR), ctx);
>> -#endif
>> -}
>> +/* total stack size used in JITed code */
>> +#define _STACK_SIZE \
>> +       (MAX_BPF_STACK + \
>> +        + SCRATCH_SIZE + \
>> +        + 4 /* extra for skb_copy_bits buffer */)
>>
>> -static int16_t imm8m(u32 x)
>> -{
>> -       u32 rot;
>> +#define STACK_SIZE STACK_ALIGN(_STACK_SIZE)
>>
>> -       for (rot = 0; rot < 16; rot++)
>> -               if ((x & ~ror32(0xff, 2 * rot)) == 0)
>> -                       return rol32(x, 2 * rot) | (rot << 8);
>> +/* Get the offset of eBPF REGISTERs stored on scratch space. */
>> +#define STACK_VAR(off) (STACK_SIZE-off-4)
>>
>> -       return -1;
>> -}
>> +/* Offset of skb_copy_bits buffer */
>> +#define SKB_BUFFER STACK_VAR(SCRATCH_SIZE)
>>
>>  #if __LINUX_ARM_ARCH__ < 7
>>
>>  static u16 imm_offset(u32 k, struct jit_ctx *ctx)
>>  {
>> -       unsigned i = 0, offset;
>> +       unsigned int i = 0, offset;
>>         u16 imm;
>>
>>         /* on the "fake" run we just count them (duplicates included) */
>> @@ -295,7 +224,7 @@ static u16 imm_offset(u32 k, struct jit_ctx *ctx)
>>                 ctx->imms[i] = k;
>>
>>         /* constants go just after the epilogue */
>> -       offset =  ctx->offsets[ctx->skf->len];
>> +       offset =  ctx->offsets[ctx->prog->len - 1] * 4;
>>         offset += ctx->prologue_bytes;
>>         offset += ctx->epilogue_bytes;
>>         offset += i * 4;
>> @@ -319,10 +248,22 @@ static u16 imm_offset(u32 k, struct jit_ctx *ctx)
>>
>>  #endif /* __LINUX_ARM_ARCH__ */
>>
>> +static inline int bpf2a32_offset(int bpf_to, int bpf_from,
>> +                                const struct jit_ctx *ctx) {
>> +       int to, from;
>> +
>> +       if (ctx->target == NULL)
>> +               return 0;
>> +       to = ctx->offsets[bpf_to];
>> +       from = ctx->offsets[bpf_from];
>> +
>> +       return to - from - 1;
>> +}
>> +
>>  /*
>>   * Move an immediate that's not an imm8m to a core register.
>>   */
>> -static inline void emit_mov_i_no8m(int rd, u32 val, struct jit_ctx *ctx)
>> +static inline void emit_mov_i_no8m(const u8 rd, u32 val, struct jit_ctx *ctx)
>>  {
>>  #if __LINUX_ARM_ARCH__ < 7
>>         emit(ARM_LDR_I(rd, ARM_PC, imm_offset(val, ctx)), ctx);
>> @@ -333,7 +274,7 @@ static inline void emit_mov_i_no8m(int rd, u32 val, struct jit_ctx *ctx)
>>  #endif
>>  }
>>
>> -static inline void emit_mov_i(int rd, u32 val, struct jit_ctx *ctx)
>> +static inline void emit_mov_i(const u8 rd, u32 val, struct jit_ctx *ctx)
>>  {
>>         int imm12 = imm8m(val);
>>
>> @@ -343,676 +284,1553 @@ static inline void emit_mov_i(int rd, u32 val, struct jit_ctx *ctx)
>>                 emit_mov_i_no8m(rd, val, ctx);
>>  }
>>
>> -#if __LINUX_ARM_ARCH__ < 6
>> -
>> -static void emit_load_be32(u8 cond, u8 r_res, u8 r_addr, struct jit_ctx *ctx)
>> +static inline void emit_blx_r(u8 tgt_reg, struct jit_ctx *ctx)
>>  {
>> -       _emit(cond, ARM_LDRB_I(ARM_R3, r_addr, 1), ctx);
>> -       _emit(cond, ARM_LDRB_I(ARM_R1, r_addr, 0), ctx);
>> -       _emit(cond, ARM_LDRB_I(ARM_R2, r_addr, 3), ctx);
>> -       _emit(cond, ARM_LSL_I(ARM_R3, ARM_R3, 16), ctx);
>> -       _emit(cond, ARM_LDRB_I(ARM_R0, r_addr, 2), ctx);
>> -       _emit(cond, ARM_ORR_S(ARM_R3, ARM_R3, ARM_R1, SRTYPE_LSL, 24), ctx);
>> -       _emit(cond, ARM_ORR_R(ARM_R3, ARM_R3, ARM_R2), ctx);
>> -       _emit(cond, ARM_ORR_S(r_res, ARM_R3, ARM_R0, SRTYPE_LSL, 8), ctx);
>> +       ctx->seen |= SEEN_CALL;
>> +#if __LINUX_ARM_ARCH__ < 5
>> +       emit(ARM_MOV_R(ARM_LR, ARM_PC), ctx);
>> +
>> +       if (elf_hwcap & HWCAP_THUMB)
>> +               emit(ARM_BX(tgt_reg), ctx);
>> +       else
>> +               emit(ARM_MOV_R(ARM_PC, tgt_reg), ctx);
>> +#else
>> +       emit(ARM_BLX_R(tgt_reg), ctx);
>> +#endif
>>  }
>>
>> -static void emit_load_be16(u8 cond, u8 r_res, u8 r_addr, struct jit_ctx *ctx)
>> +static inline int epilogue_offset(const struct jit_ctx *ctx)
>>  {
>> -       _emit(cond, ARM_LDRB_I(ARM_R1, r_addr, 0), ctx);
>> -       _emit(cond, ARM_LDRB_I(ARM_R2, r_addr, 1), ctx);
>> -       _emit(cond, ARM_ORR_S(r_res, ARM_R2, ARM_R1, SRTYPE_LSL, 8), ctx);
>> +       int to, from;
>> +       /* No need for 1st dummy run */
>> +       if (ctx->target == NULL)
>> +               return 0;
>> +       to = ctx->epilogue_offset;
>> +       from = ctx->idx;
>> +
>> +       return to - from - 2;
>>  }
>>
>> -static inline void emit_swap16(u8 r_dst, u8 r_src, struct jit_ctx *ctx)
>> +static inline void emit_udivmod(u8 rd, u8 rm, u8 rn, struct jit_ctx *ctx, u8 op)
>>  {
>> -       /* r_dst = (r_src << 8) | (r_src >> 8) */
>> -       emit(ARM_LSL_I(ARM_R1, r_src, 8), ctx);
>> -       emit(ARM_ORR_S(r_dst, ARM_R1, r_src, SRTYPE_LSR, 8), ctx);
>> +       const u8 *tmp = bpf2a32[TMP_REG_1];
>> +       s32 jmp_offset;
>> +
>> +       /* checks if divisor is zero or not. If it is, then
>> +        * exit directly.
>> +        */
>> +       emit(ARM_CMP_I(rn, 0), ctx);
>> +       _emit(ARM_COND_EQ, ARM_MOV_I(ARM_R0, 0), ctx);
>> +       jmp_offset = epilogue_offset(ctx);
>> +       _emit(ARM_COND_EQ, ARM_B(jmp_offset), ctx);
>> +#if __LINUX_ARM_ARCH__ == 7
>> +       if (elf_hwcap & HWCAP_IDIVA) {
>> +               if (op == BPF_DIV)
>> +                       emit(ARM_UDIV(rd, rm, rn), ctx);
>> +               else {
>> +                       emit(ARM_UDIV(ARM_IP, rm, rn), ctx);
>> +                       emit(ARM_MLS(rd, rn, ARM_IP, rm), ctx);
>> +               }
>> +               return;
>> +       }
>> +#endif
>>
>>         /*
>> -        * we need to mask out the bits set in r_dst[23:16] due to
>> -        * the first shift instruction.
>> -        *
>> -        * note that 0x8ff is the encoded immediate 0x00ff0000.
>> +        * For BPF_ALU | BPF_DIV | BPF_K instructions
>> +        * As ARM_R1 and ARM_R0 contains 1st argument of bpf
>> +        * function, we need to save it on caller side to save
>> +        * it from getting destroyed within callee.
>> +        * After the return from the callee, we restore ARM_R0
>> +        * ARM_R1.
>>          */
>> -       emit(ARM_BIC_I(r_dst, r_dst, 0x8ff), ctx);
>> -}
>> +       if (rn != ARM_R1) {
>> +               emit(ARM_MOV_R(tmp[0], ARM_R1), ctx);
>> +               emit(ARM_MOV_R(ARM_R1, rn), ctx);
>> +       }
>> +       if (rm != ARM_R0) {
>> +               emit(ARM_MOV_R(tmp[1], ARM_R0), ctx);
>> +               emit(ARM_MOV_R(ARM_R0, rm), ctx);
>> +       }
>>
>> -#else  /* ARMv6+ */
>> +       /* Call appropriate function */
>> +       ctx->seen |= SEEN_CALL;
>> +       emit_mov_i(ARM_IP, op == BPF_DIV ?
>> +                  (u32)jit_udiv32 : (u32)jit_mod32, ctx);
>> +       emit_blx_r(ARM_IP, ctx);
>>
>> -static void emit_load_be32(u8 cond, u8 r_res, u8 r_addr, struct jit_ctx *ctx)
>> -{
>> -       _emit(cond, ARM_LDR_I(r_res, r_addr, 0), ctx);
>> -#ifdef __LITTLE_ENDIAN
>> -       _emit(cond, ARM_REV(r_res, r_res), ctx);
>> -#endif
>> +       /* Save return value */
>> +       if (rd != ARM_R0)
>> +               emit(ARM_MOV_R(rd, ARM_R0), ctx);
>> +
>> +       /* Restore ARM_R0 and ARM_R1 */
>> +       if (rn != ARM_R1)
>> +               emit(ARM_MOV_R(ARM_R1, tmp[0]), ctx);
>> +       if (rm != ARM_R0)
>> +               emit(ARM_MOV_R(ARM_R0, tmp[1]), ctx);
>>  }
>>
>> -static void emit_load_be16(u8 cond, u8 r_res, u8 r_addr, struct jit_ctx *ctx)
>> +/* Checks whether BPF register is on scratch stack space or not. */
>> +static inline bool is_on_stack(u8 bpf_reg)
>>  {
>> -       _emit(cond, ARM_LDRH_I(r_res, r_addr, 0), ctx);
>> -#ifdef __LITTLE_ENDIAN
>> -       _emit(cond, ARM_REV16(r_res, r_res), ctx);
>> -#endif
>> +       static u8 stack_regs[] = {BPF_REG_AX, BPF_REG_3, BPF_REG_4, BPF_REG_5,
>> +                               BPF_REG_7, BPF_REG_8, BPF_REG_9, TCALL_CNT,
>> +                               BPF_REG_2, BPF_REG_FP};
>> +       int i, reg_len = sizeof(stack_regs);
>> +
>> +       for (i = 0 ; i < reg_len ; i++) {
>> +               if (bpf_reg == stack_regs[i])
>> +                       return true;
>> +       }
>> +       return false;
>>  }
>>
>> -static inline void emit_swap16(u8 r_dst __maybe_unused,
>> -                              u8 r_src __maybe_unused,
>> -                              struct jit_ctx *ctx __maybe_unused)
>> +static inline void emit_a32_mov_i(const u8 dst, const u32 val,
>> +                                 bool dstk, struct jit_ctx *ctx)
>>  {
>> -#ifdef __LITTLE_ENDIAN
>> -       emit(ARM_REV16(r_dst, r_src), ctx);
>> -#endif
>> +       const u8 *tmp = bpf2a32[TMP_REG_1];
>> +
>> +       if (dstk) {
>> +               emit_mov_i(tmp[1], val, ctx);
>> +               emit(ARM_STR_I(tmp[1], ARM_SP, STACK_VAR(dst)), ctx);
>> +       } else {
>> +               emit_mov_i(dst, val, ctx);
>> +       }
>>  }
>>
>> -#endif /* __LINUX_ARM_ARCH__ < 6 */
>> +/* Sign extended move */
>> +static inline void emit_a32_mov_i64(const bool is64, const u8 dst[],
>> +                                 const u32 val, bool dstk,
>> +                                 struct jit_ctx *ctx) {
>> +       u32 hi = 0;
>>
>> +       if (is64 && (val & (1<<31)))
>> +               hi = (u32)~0;
>> +       emit_a32_mov_i(dst_lo, val, dstk, ctx);
>> +       emit_a32_mov_i(dst_hi, hi, dstk, ctx);
>> +}
>>
>> -/* Compute the immediate value for a PC-relative branch. */
>> -static inline u32 b_imm(unsigned tgt, struct jit_ctx *ctx)
>> -{
>> -       u32 imm;
>> +static inline void emit_a32_add_r(const u8 dst, const u8 src,
>> +                             const bool is64, const bool hi,
>> +                             struct jit_ctx *ctx) {
>> +       /* 64 bit :
>> +        *      adds dst_lo, dst_lo, src_lo
>> +        *      adc dst_hi, dst_hi, src_hi
>> +        * 32 bit :
>> +        *      add dst_lo, dst_lo, src_lo
>> +        */
>> +       if (!hi && is64)
>> +               emit(ARM_ADDS_R(dst, dst, src), ctx);
>> +       else if (hi && is64)
>> +               emit(ARM_ADC_R(dst, dst, src), ctx);
>> +       else
>> +               emit(ARM_ADD_R(dst, dst, src), ctx);
>> +}
>>
>> -       if (ctx->target == NULL)
>> -               return 0;
>> -       /*
>> -        * BPF allows only forward jumps and the offset of the target is
>> -        * still the one computed during the first pass.
>> +static inline void emit_a32_sub_r(const u8 dst, const u8 src,
>> +                                 const bool is64, const bool hi,
>> +                                 struct jit_ctx *ctx) {
>> +       /* 64 bit :
>> +        *      subs dst_lo, dst_lo, src_lo
>> +        *      sbc dst_hi, dst_hi, src_hi
>> +        * 32 bit :
>> +        *      sub dst_lo, dst_lo, src_lo
>>          */
>> -       imm  = ctx->offsets[tgt] + ctx->prologue_bytes - (ctx->idx * 4 + 8);
>> +       if (!hi && is64)
>> +               emit(ARM_SUBS_R(dst, dst, src), ctx);
>> +       else if (hi && is64)
>> +               emit(ARM_SBC_R(dst, dst, src), ctx);
>> +       else
>> +               emit(ARM_SUB_R(dst, dst, src), ctx);
>> +}
>>
>> -       return imm >> 2;
>> +static inline void emit_alu_r(const u8 dst, const u8 src, const bool is64,
>> +                             const bool hi, const u8 op, struct jit_ctx *ctx){
>> +       switch (BPF_OP(op)) {
>> +       /* dst = dst + src */
>> +       case BPF_ADD:
>> +               emit_a32_add_r(dst, src, is64, hi, ctx);
>> +               break;
>> +       /* dst = dst - src */
>> +       case BPF_SUB:
>> +               emit_a32_sub_r(dst, src, is64, hi, ctx);
>> +               break;
>> +       /* dst = dst | src */
>> +       case BPF_OR:
>> +               emit(ARM_ORR_R(dst, dst, src), ctx);
>> +               break;
>> +       /* dst = dst & src */
>> +       case BPF_AND:
>> +               emit(ARM_AND_R(dst, dst, src), ctx);
>> +               break;
>> +       /* dst = dst ^ src */
>> +       case BPF_XOR:
>> +               emit(ARM_EOR_R(dst, dst, src), ctx);
>> +               break;
>> +       /* dst = dst * src */
>> +       case BPF_MUL:
>> +               emit(ARM_MUL(dst, dst, src), ctx);
>> +               break;
>> +       /* dst = dst << src */
>> +       case BPF_LSH:
>> +               emit(ARM_LSL_R(dst, dst, src), ctx);
>> +               break;
>> +       /* dst = dst >> src */
>> +       case BPF_RSH:
>> +               emit(ARM_LSR_R(dst, dst, src), ctx);
>> +               break;
>> +       /* dst = dst >> src (signed)*/
>> +       case BPF_ARSH:
>> +               emit(ARM_MOV_SR(dst, dst, SRTYPE_ASR, src), ctx);
>> +               break;
>> +       }
>>  }
>>
>> -#define OP_IMM3(op, r1, r2, imm_val, ctx)                              \
>> -       do {                                                            \
>> -               imm12 = imm8m(imm_val);                                 \
>> -               if (imm12 < 0) {                                        \
>> -                       emit_mov_i_no8m(r_scratch, imm_val, ctx);       \
>> -                       emit(op ## _R((r1), (r2), r_scratch), ctx);     \
>> -               } else {                                                \
>> -                       emit(op ## _I((r1), (r2), imm12), ctx);         \
>> -               }                                                       \
>> -       } while (0)
>> -
>> -static inline void emit_err_ret(u8 cond, struct jit_ctx *ctx)
>> -{
>> -       if (ctx->ret0_fp_idx >= 0) {
>> -               _emit(cond, ARM_B(b_imm(ctx->ret0_fp_idx, ctx)), ctx);
>> -               /* NOP to keep the size constant between passes */
>> -               emit(ARM_MOV_R(ARM_R0, ARM_R0), ctx);
>> +/* ALU operation (32 bit)
>> + * dst = dst (op) src
>> + */
>> +static inline void emit_a32_alu_r(const u8 dst, const u8 src,
>> +                                 bool dstk, bool sstk,
>> +                                 struct jit_ctx *ctx, const bool is64,
>> +                                 const bool hi, const u8 op) {
>> +       const u8 *tmp = bpf2a32[TMP_REG_1];
>> +       u8 rn = sstk ? tmp[1] : src;
>> +
>> +       if (sstk)
>> +               emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src)), ctx);
>> +
>> +       /* ALU operation */
>> +       if (dstk) {
>> +               emit(ARM_LDR_I(tmp[0], ARM_SP, STACK_VAR(dst)), ctx);
>> +               emit_alu_r(tmp[0], rn, is64, hi, op, ctx);
>> +               emit(ARM_STR_I(tmp[0], ARM_SP, STACK_VAR(dst)), ctx);
>>         } else {
>> -               _emit(cond, ARM_MOV_I(ARM_R0, 0), ctx);
>> -               _emit(cond, ARM_B(b_imm(ctx->skf->len, ctx)), ctx);
>> +               emit_alu_r(dst, rn, is64, hi, op, ctx);
>>         }
>>  }
>>
>> -static inline void emit_blx_r(u8 tgt_reg, struct jit_ctx *ctx)
>> -{
>> -#if __LINUX_ARM_ARCH__ < 5
>> -       emit(ARM_MOV_R(ARM_LR, ARM_PC), ctx);
>> +/* ALU operation (64 bit) */
>> +static inline void emit_a32_alu_r64(const bool is64, const u8 dst[],
>> +                                 const u8 src[], bool dstk,
>> +                                 bool sstk, struct jit_ctx *ctx,
>> +                                 const u8 op) {
>> +       emit_a32_alu_r(dst_lo, src_lo, dstk, sstk, ctx, is64, false, op);
>> +       if (is64)
>> +               emit_a32_alu_r(dst_hi, src_hi, dstk, sstk, ctx, is64, true, op);
>> +       else
>> +               emit_a32_mov_i(dst_hi, 0, dstk, ctx);
>> +}
>>
>> -       if (elf_hwcap & HWCAP_THUMB)
>> -               emit(ARM_BX(tgt_reg), ctx);
>> +/* dst = imm (4 bytes)*/
>> +static inline void emit_a32_mov_r(const u8 dst, const u8 src,
>> +                                 bool dstk, bool sstk,
>> +                                 struct jit_ctx *ctx) {
>> +       const u8 *tmp = bpf2a32[TMP_REG_1];
>> +       u8 rt = sstk ? tmp[0] : src;
>> +
>> +       if (sstk)
>> +               emit(ARM_LDR_I(tmp[0], ARM_SP, STACK_VAR(src)), ctx);
>> +       if (dstk)
>> +               emit(ARM_STR_I(rt, ARM_SP, STACK_VAR(dst)), ctx);
>>         else
>> -               emit(ARM_MOV_R(ARM_PC, tgt_reg), ctx);
>> -#else
>> -       emit(ARM_BLX_R(tgt_reg), ctx);
>> -#endif
>> +               emit(ARM_MOV_R(dst, rt), ctx);
>>  }
>>
>> -static inline void emit_udivmod(u8 rd, u8 rm, u8 rn, struct jit_ctx *ctx,
>> -                               int bpf_op)
>> -{
>> -#if __LINUX_ARM_ARCH__ == 7
>> -       if (elf_hwcap & HWCAP_IDIVA) {
>> -               if (bpf_op == BPF_DIV)
>> -                       emit(ARM_UDIV(rd, rm, rn), ctx);
>> -               else {
>> -                       emit(ARM_UDIV(ARM_R3, rm, rn), ctx);
>> -                       emit(ARM_MLS(rd, rn, ARM_R3, rm), ctx);
>> -               }
>> -               return;
>> +/* dst = src */
>> +static inline void emit_a32_mov_r64(const bool is64, const u8 dst[],
>> +                                 const u8 src[], bool dstk,
>> +                                 bool sstk, struct jit_ctx *ctx) {
>> +       emit_a32_mov_r(dst_lo, src_lo, dstk, sstk, ctx);
>> +       if (is64) {
>> +               /* complete 8 byte move */
>> +               emit_a32_mov_r(dst_hi, src_hi, dstk, sstk, ctx);
>> +       } else {
>> +               /* Zero out high 4 bytes */
>> +               emit_a32_mov_i(dst_hi, 0, dstk, ctx);
>>         }
>> -#endif
>> +}
>>
>> -       /*
>> -        * For BPF_ALU | BPF_DIV | BPF_K instructions, rm is ARM_R4
>> -        * (r_A) and rn is ARM_R0 (r_scratch) so load rn first into
>> -        * ARM_R1 to avoid accidentally overwriting ARM_R0 with rm
>> -        * before using it as a source for ARM_R1.
>> -        *
>> -        * For BPF_ALU | BPF_DIV | BPF_X rm is ARM_R4 (r_A) and rn is
>> -        * ARM_R5 (r_X) so there is no particular register overlap
>> -        * issues.
>> -        */
>> -       if (rn != ARM_R1)
>> -               emit(ARM_MOV_R(ARM_R1, rn), ctx);
>> -       if (rm != ARM_R0)
>> -               emit(ARM_MOV_R(ARM_R0, rm), ctx);
>> +/* Shift operations */
>> +static inline void emit_a32_alu_i(const u8 dst, const u32 val, bool dstk,
>> +                               struct jit_ctx *ctx, const u8 op) {
>> +       const u8 *tmp = bpf2a32[TMP_REG_1];
>> +       u8 rd = dstk ? tmp[0] : dst;
>> +
>> +       if (dstk)
>> +               emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst)), ctx);
>> +
>> +       /* Do shift operation */
>> +       switch (op) {
>> +       case BPF_LSH:
>> +               emit(ARM_LSL_I(rd, rd, val), ctx);
>> +               break;
>> +       case BPF_RSH:
>> +               emit(ARM_LSR_I(rd, rd, val), ctx);
>> +               break;
>> +       case BPF_NEG:
>> +               emit(ARM_RSB_I(rd, rd, val), ctx);
>> +               break;
>> +       }
>> +
>> +       if (dstk)
>> +               emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst)), ctx);
>> +}
>> +
>> +/* dst = ~dst (64 bit) */
>> +static inline void emit_a32_neg64(const u8 dst[], bool dstk,
>> +                               struct jit_ctx *ctx){
>> +       const u8 *tmp = bpf2a32[TMP_REG_1];
>> +       u8 rd = dstk ? tmp[1] : dst[1];
>> +       u8 rm = dstk ? tmp[0] : dst[0];
>> +
>> +       /* Setup Operand */
>> +       if (dstk) {
>> +               emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
>> +               emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
>> +       }
>> +
>> +       /* Do Negate Operation */
>> +       emit(ARM_RSBS_I(rd, rd, 0), ctx);
>> +       emit(ARM_RSC_I(rm, rm, 0), ctx);
>> +
>> +       if (dstk) {
>> +               emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
>> +               emit(ARM_STR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
>> +       }
>> +}
>>
>> +/* dst = dst << src */
>> +static inline void emit_a32_lsh_r64(const u8 dst[], const u8 src[], bool dstk,
>> +                                   bool sstk, struct jit_ctx *ctx) {
>> +       const u8 *tmp = bpf2a32[TMP_REG_1];
>> +       const u8 *tmp2 = bpf2a32[TMP_REG_2];
>> +
>> +       /* Setup Operands */
>> +       u8 rt = sstk ? tmp2[1] : src_lo;
>> +       u8 rd = dstk ? tmp[1] : dst_lo;
>> +       u8 rm = dstk ? tmp[0] : dst_hi;
>> +
>> +       if (sstk)
>> +               emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(src_lo)), ctx);
>> +       if (dstk) {
>> +               emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
>> +               emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
>> +       }
>> +
>> +       /* Do LSH operation */
>> +       emit(ARM_SUB_I(ARM_IP, rt, 32), ctx);
>> +       emit(ARM_RSB_I(tmp2[0], rt, 32), ctx);
>> +       /* As we are using ARM_LR */
>>         ctx->seen |= SEEN_CALL;
>> -       emit_mov_i(ARM_R3, bpf_op == BPF_DIV ? (u32)jit_udiv : (u32)jit_mod,
>> -                  ctx);
>> -       emit_blx_r(ARM_R3, ctx);
>> +       emit(ARM_MOV_SR(ARM_LR, rm, SRTYPE_ASL, rt), ctx);
>> +       emit(ARM_ORR_SR(ARM_LR, ARM_LR, rd, SRTYPE_ASL, ARM_IP), ctx);
>> +       emit(ARM_ORR_SR(ARM_IP, ARM_LR, rd, SRTYPE_LSR, tmp2[0]), ctx);
>> +       emit(ARM_MOV_SR(ARM_LR, rd, SRTYPE_ASL, rt), ctx);
>> +
>> +       if (dstk) {
>> +               emit(ARM_STR_I(ARM_LR, ARM_SP, STACK_VAR(dst_lo)), ctx);
>> +               emit(ARM_STR_I(ARM_IP, ARM_SP, STACK_VAR(dst_hi)), ctx);
>> +       } else {
>> +               emit(ARM_MOV_R(rd, ARM_LR), ctx);
>> +               emit(ARM_MOV_R(rm, ARM_IP), ctx);
>> +       }
>> +}
>>
>> -       if (rd != ARM_R0)
>> -               emit(ARM_MOV_R(rd, ARM_R0), ctx);
>> +/* dst = dst >> src (signed)*/
>> +static inline void emit_a32_arsh_r64(const u8 dst[], const u8 src[], bool dstk,
>> +                                   bool sstk, struct jit_ctx *ctx) {
>> +       const u8 *tmp = bpf2a32[TMP_REG_1];
>> +       const u8 *tmp2 = bpf2a32[TMP_REG_2];
>> +       /* Setup Operands */
>> +       u8 rt = sstk ? tmp2[1] : src_lo;
>> +       u8 rd = dstk ? tmp[1] : dst_lo;
>> +       u8 rm = dstk ? tmp[0] : dst_hi;
>> +
>> +       if (sstk)
>> +               emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(src_lo)), ctx);
>> +       if (dstk) {
>> +               emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
>> +               emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
>> +       }
>> +
>> +       /* Do the ARSH operation */
>> +       emit(ARM_RSB_I(ARM_IP, rt, 32), ctx);
>> +       emit(ARM_SUBS_I(tmp2[0], rt, 32), ctx);
>> +       /* As we are using ARM_LR */
>> +       ctx->seen |= SEEN_CALL;
>> +       emit(ARM_MOV_SR(ARM_LR, rd, SRTYPE_LSR, rt), ctx);
>> +       emit(ARM_ORR_SR(ARM_LR, ARM_LR, rm, SRTYPE_ASL, ARM_IP), ctx);
>> +       _emit(ARM_COND_MI, ARM_B(0), ctx);
>> +       emit(ARM_ORR_SR(ARM_LR, ARM_LR, rm, SRTYPE_ASR, tmp2[0]), ctx);
>> +       emit(ARM_MOV_SR(ARM_IP, rm, SRTYPE_ASR, rt), ctx);
>> +       if (dstk) {
>> +               emit(ARM_STR_I(ARM_LR, ARM_SP, STACK_VAR(dst_lo)), ctx);
>> +               emit(ARM_STR_I(ARM_IP, ARM_SP, STACK_VAR(dst_hi)), ctx);
>> +       } else {
>> +               emit(ARM_MOV_R(rd, ARM_LR), ctx);
>> +               emit(ARM_MOV_R(rm, ARM_IP), ctx);
>> +       }
>>  }
>>
>> -static inline void update_on_xread(struct jit_ctx *ctx)
>> +/* dst = dst >> src */
>> +static inline void emit_a32_lsr_r64(const u8 dst[], const u8 src[], bool dstk,
>> +                                    bool sstk, struct jit_ctx *ctx) {
>> +       const u8 *tmp = bpf2a32[TMP_REG_1];
>> +       const u8 *tmp2 = bpf2a32[TMP_REG_2];
>> +       /* Setup Operands */
>> +       u8 rt = sstk ? tmp2[1] : src_lo;
>> +       u8 rd = dstk ? tmp[1] : dst_lo;
>> +       u8 rm = dstk ? tmp[0] : dst_hi;
>> +
>> +       if (sstk)
>> +               emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(src_lo)), ctx);
>> +       if (dstk) {
>> +               emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
>> +               emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
>> +       }
>> +
>> +       /* Do LSH operation */
>> +       emit(ARM_RSB_I(ARM_IP, rt, 32), ctx);
>> +       emit(ARM_SUBS_I(tmp2[0], rt, 32), ctx);
>> +       /* As we are using ARM_LR */
>> +       ctx->seen |= SEEN_CALL;
>> +       emit(ARM_MOV_SR(ARM_LR, rd, SRTYPE_LSR, rt), ctx);
>> +       emit(ARM_ORR_SR(ARM_LR, ARM_LR, rm, SRTYPE_ASL, ARM_IP), ctx);
>> +       emit(ARM_ORR_SR(ARM_LR, ARM_LR, rm, SRTYPE_LSR, tmp2[0]), ctx);
>> +       emit(ARM_MOV_SR(ARM_IP, rm, SRTYPE_LSR, rt), ctx);
>> +       if (dstk) {
>> +               emit(ARM_STR_I(ARM_LR, ARM_SP, STACK_VAR(dst_lo)), ctx);
>> +               emit(ARM_STR_I(ARM_IP, ARM_SP, STACK_VAR(dst_hi)), ctx);
>> +       } else {
>> +               emit(ARM_MOV_R(rd, ARM_LR), ctx);
>> +               emit(ARM_MOV_R(rm, ARM_IP), ctx);
>> +       }
>> +}
>> +
>> +/* dst = dst << val */
>> +static inline void emit_a32_lsh_i64(const u8 dst[], bool dstk,
>> +                                    const u32 val, struct jit_ctx *ctx){
>> +       const u8 *tmp = bpf2a32[TMP_REG_1];
>> +       const u8 *tmp2 = bpf2a32[TMP_REG_2];
>> +       /* Setup operands */
>> +       u8 rd = dstk ? tmp[1] : dst_lo;
>> +       u8 rm = dstk ? tmp[0] : dst_hi;
>> +
>> +       if (dstk) {
>> +               emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
>> +               emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
>> +       }
>> +
>> +       /* Do LSH operation */
>> +       if (val < 32) {
>> +               emit(ARM_MOV_SI(tmp2[0], rm, SRTYPE_ASL, val), ctx);
>> +               emit(ARM_ORR_SI(rm, tmp2[0], rd, SRTYPE_LSR, 32 - val), ctx);
>> +               emit(ARM_MOV_SI(rd, rd, SRTYPE_ASL, val), ctx);
>> +       } else {
>> +               if (val == 32)
>> +                       emit(ARM_MOV_R(rm, rd), ctx);
>> +               else
>> +                       emit(ARM_MOV_SI(rm, rd, SRTYPE_ASL, val - 32), ctx);
>> +               emit(ARM_EOR_R(rd, rd, rd), ctx);
>> +       }
>> +
>> +       if (dstk) {
>> +               emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
>> +               emit(ARM_STR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
>> +       }
>> +}
>> +
>> +/* dst = dst >> val */
>> +static inline void emit_a32_lsr_i64(const u8 dst[], bool dstk,
>> +                                   const u32 val, struct jit_ctx *ctx) {
>> +       const u8 *tmp = bpf2a32[TMP_REG_1];
>> +       const u8 *tmp2 = bpf2a32[TMP_REG_2];
>> +       /* Setup operands */
>> +       u8 rd = dstk ? tmp[1] : dst_lo;
>> +       u8 rm = dstk ? tmp[0] : dst_hi;
>> +
>> +       if (dstk) {
>> +               emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
>> +               emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
>> +       }
>> +
>> +       /* Do LSR operation */
>> +       if (val < 32) {
>> +               emit(ARM_MOV_SI(tmp2[1], rd, SRTYPE_LSR, val), ctx);
>> +               emit(ARM_ORR_SI(rd, tmp2[1], rm, SRTYPE_ASL, 32 - val), ctx);
>> +               emit(ARM_MOV_SI(rm, rm, SRTYPE_LSR, val), ctx);
>> +       } else if (val == 32) {
>> +               emit(ARM_MOV_R(rd, rm), ctx);
>> +               emit(ARM_MOV_I(rm, 0), ctx);
>> +       } else {
>> +               emit(ARM_MOV_SI(rd, rm, SRTYPE_LSR, val - 32), ctx);
>> +               emit(ARM_MOV_I(rm, 0), ctx);
>> +       }
>> +
>> +       if (dstk) {
>> +               emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
>> +               emit(ARM_STR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
>> +       }
>> +}
>> +
>> +/* dst = dst >> val (signed) */
>> +static inline void emit_a32_arsh_i64(const u8 dst[], bool dstk,
>> +                                    const u32 val, struct jit_ctx *ctx){
>> +       const u8 *tmp = bpf2a32[TMP_REG_1];
>> +       const u8 *tmp2 = bpf2a32[TMP_REG_2];
>> +        /* Setup operands */
>> +       u8 rd = dstk ? tmp[1] : dst_lo;
>> +       u8 rm = dstk ? tmp[0] : dst_hi;
>> +
>> +       if (dstk) {
>> +               emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
>> +               emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
>> +       }
>> +
>> +       /* Do ARSH operation */
>> +       if (val < 32) {
>> +               emit(ARM_MOV_SI(tmp2[1], rd, SRTYPE_LSR, val), ctx);
>> +               emit(ARM_ORR_SI(rd, tmp2[1], rm, SRTYPE_ASL, 32 - val), ctx);
>> +               emit(ARM_MOV_SI(rm, rm, SRTYPE_ASR, val), ctx);
>> +       } else if (val == 32) {
>> +               emit(ARM_MOV_R(rd, rm), ctx);
>> +               emit(ARM_MOV_SI(rm, rm, SRTYPE_ASR, 31), ctx);
>> +       } else {
>> +               emit(ARM_MOV_SI(rd, rm, SRTYPE_ASR, val - 32), ctx);
>> +               emit(ARM_MOV_SI(rm, rm, SRTYPE_ASR, 31), ctx);
>> +       }
>> +
>> +       if (dstk) {
>> +               emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
>> +               emit(ARM_STR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
>> +       }
>> +}
>> +
>> +static inline void emit_a32_mul_r64(const u8 dst[], const u8 src[], bool dstk,
>> +                                   bool sstk, struct jit_ctx *ctx) {
>> +       const u8 *tmp = bpf2a32[TMP_REG_1];
>> +       const u8 *tmp2 = bpf2a32[TMP_REG_2];
>> +       /* Setup operands for multiplication */
>> +       u8 rd = dstk ? tmp[1] : dst_lo;
>> +       u8 rm = dstk ? tmp[0] : dst_hi;
>> +       u8 rt = sstk ? tmp2[1] : src_lo;
>> +       u8 rn = sstk ? tmp2[0] : src_hi;
>> +
>> +       if (dstk) {
>> +               emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
>> +               emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
>> +       }
>> +       if (sstk) {
>> +               emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(src_lo)), ctx);
>> +               emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src_hi)), ctx);
>> +       }
>> +
>> +       /* Do Multiplication */
>> +       emit(ARM_MUL(ARM_IP, rd, rn), ctx);
>> +       emit(ARM_MUL(ARM_LR, rm, rt), ctx);
>> +       /* As we are using ARM_LR */
>> +       ctx->seen |= SEEN_CALL;
>> +       emit(ARM_ADD_R(ARM_LR, ARM_IP, ARM_LR), ctx);
>> +
>> +       emit(ARM_UMULL(ARM_IP, rm, rd, rt), ctx);
>> +       emit(ARM_ADD_R(rm, ARM_LR, rm), ctx);
>> +       if (dstk) {
>> +               emit(ARM_STR_I(ARM_IP, ARM_SP, STACK_VAR(dst_lo)), ctx);
>> +               emit(ARM_STR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
>> +       } else {
>> +               emit(ARM_MOV_R(rd, ARM_IP), ctx);
>> +       }
>> +}
>> +
>> +/* *(size *)(dst + off) = src */
>> +static inline void emit_str_r(const u8 dst, const u8 src, bool dstk,
>> +                             const s32 off, struct jit_ctx *ctx, const u8 sz){
>> +       const u8 *tmp = bpf2a32[TMP_REG_1];
>> +       u8 rd = dstk ? tmp[1] : dst;
>> +
>> +       if (dstk)
>> +               emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst)), ctx);
>> +       if (off) {
>> +               emit_a32_mov_i(tmp[0], off, false, ctx);
>> +               emit(ARM_ADD_R(tmp[0], rd, tmp[0]), ctx);
>> +               rd = tmp[0];
>> +       }
>> +       switch (sz) {
>> +       case BPF_W:
>> +               /* Store a Word */
>> +               emit(ARM_STR_I(src, rd, 0), ctx);
>> +               break;
>> +       case BPF_H:
>> +               /* Store a HalfWord */
>> +               emit(ARM_STRH_I(src, rd, 0), ctx);
>> +               break;
>> +       case BPF_B:
>> +               /* Store a Byte */
>> +               emit(ARM_STRB_I(src, rd, 0), ctx);
>> +               break;
>> +       }
>> +}
>> +
>> +/* dst = *(size*)(src + off) */
>> +static inline void emit_ldx_r(const u8 dst, const u8 src, bool dstk,
>> +                             const s32 off, struct jit_ctx *ctx, const u8 sz){
>> +       const u8 *tmp = bpf2a32[TMP_REG_1];
>> +       u8 rd = dstk ? tmp[1] : dst;
>> +       u8 rm = src;
>> +
>> +       if (off) {
>> +               emit_a32_mov_i(tmp[0], off, false, ctx);
>> +               emit(ARM_ADD_R(tmp[0], tmp[0], src), ctx);
>> +               rm = tmp[0];
>> +       }
>> +       switch (sz) {
>> +       case BPF_W:
>> +               /* Load a Word */
>> +               emit(ARM_LDR_I(rd, rm, 0), ctx);
>> +               break;
>> +       case BPF_H:
>> +               /* Load a HalfWord */
>> +               emit(ARM_LDRH_I(rd, rm, 0), ctx);
>> +               break;
>> +       case BPF_B:
>> +               /* Load a Byte */
>> +               emit(ARM_LDRB_I(rd, rm, 0), ctx);
>> +               break;
>> +       }
>> +       if (dstk)
>> +               emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst)), ctx);
>> +}
>> +
>> +/* Arithmatic Operation */
>> +static inline void emit_ar_r(const u8 rd, const u8 rt, const u8 rm,
>> +                            const u8 rn, struct jit_ctx *ctx, u8 op) {
>> +       switch (op) {
>> +       case BPF_JSET:
>> +               ctx->seen |= SEEN_CALL;
>> +               emit(ARM_AND_R(ARM_IP, rt, rn), ctx);
>> +               emit(ARM_AND_R(ARM_LR, rd, rm), ctx);
>> +               emit(ARM_ORRS_R(ARM_IP, ARM_LR, ARM_IP), ctx);
>> +               break;
>> +       case BPF_JEQ:
>> +       case BPF_JNE:
>> +       case BPF_JGT:
>> +       case BPF_JGE:
>> +               emit(ARM_CMP_R(rd, rm), ctx);
>> +               _emit(ARM_COND_EQ, ARM_CMP_R(rt, rn), ctx);
>> +               break;
>> +       case BPF_JSGT:
>> +               emit(ARM_CMP_R(rn, rt), ctx);
>> +               emit(ARM_SBCS_R(ARM_IP, rm, rd), ctx);
>> +               break;
>> +       case BPF_JSGE:
>> +               emit(ARM_CMP_R(rt, rn), ctx);
>> +               emit(ARM_SBCS_R(ARM_IP, rd, rm), ctx);
>> +               break;
>> +       }
>> +}
>> +
>> +static int out_offset = -1; /* initialized on the first pass of build_body() */
>> +static int emit_bpf_tail_call(struct jit_ctx *ctx)
>> +{
>> +
>> +       /* bpf_tail_call(void *prog_ctx, struct bpf_array *array, u64 index) */
>> +       const u8 *r2 = bpf2a32[BPF_REG_2];
>> +       const u8 *r3 = bpf2a32[BPF_REG_3];
>> +       const u8 *tmp = bpf2a32[TMP_REG_1];
>> +       const u8 *tmp2 = bpf2a32[TMP_REG_2];
>> +       const u8 *tcc = bpf2a32[TCALL_CNT];
>> +       const int idx0 = ctx->idx;
>> +#define cur_offset (ctx->idx - idx0)
>> +#define jmp_offset (out_offset - (cur_offset))
>> +       u32 off, lo, hi;
>> +
>> +       /* if (index >= array->map.max_entries)
>> +        *      goto out;
>> +        */
>> +       off = offsetof(struct bpf_array, map.max_entries);
>> +       /* array->map.max_entries */
>> +       emit_a32_mov_i(tmp[1], off, false, ctx);
>> +       emit(ARM_LDR_I(tmp2[1], ARM_SP, STACK_VAR(r2[1])), ctx);
>> +       emit(ARM_LDR_R(tmp[1], tmp2[1], tmp[1]), ctx);
>> +       /* index (64 bit) */
>> +       emit(ARM_LDR_I(tmp2[1], ARM_SP, STACK_VAR(r3[1])), ctx);
>> +       /* index >= array->map.max_entries */
>> +       emit(ARM_CMP_R(tmp2[1], tmp[1]), ctx);
>> +       _emit(ARM_COND_CS, ARM_B(jmp_offset), ctx);
>> +
>> +       /* if (tail_call_cnt > MAX_TAIL_CALL_CNT)
>> +        *      goto out;
>> +        * tail_call_cnt++;
>> +        */
>> +       lo = (u32)MAX_TAIL_CALL_CNT;
>> +       hi = (u32)((u64)MAX_TAIL_CALL_CNT >> 32);
>> +       emit(ARM_LDR_I(tmp[1], ARM_SP, STACK_VAR(tcc[1])), ctx);
>> +       emit(ARM_LDR_I(tmp[0], ARM_SP, STACK_VAR(tcc[0])), ctx);
>> +       emit(ARM_CMP_I(tmp[0], hi), ctx);
>> +       _emit(ARM_COND_EQ, ARM_CMP_I(tmp[1], lo), ctx);
>> +       _emit(ARM_COND_HI, ARM_B(jmp_offset), ctx);
>> +       emit(ARM_ADDS_I(tmp[1], tmp[1], 1), ctx);
>> +       emit(ARM_ADC_I(tmp[0], tmp[0], 0), ctx);
>> +       emit(ARM_STR_I(tmp[1], ARM_SP, STACK_VAR(tcc[1])), ctx);
>> +       emit(ARM_STR_I(tmp[0], ARM_SP, STACK_VAR(tcc[0])), ctx);
>> +
>> +       /* prog = array->ptrs[index]
>> +        * if (prog == NULL)
>> +        *      goto out;
>> +        */
>> +       off = offsetof(struct bpf_array, ptrs);
>> +       emit_a32_mov_i(tmp[1], off, false, ctx);
>> +       emit(ARM_LDR_I(tmp2[1], ARM_SP, STACK_VAR(r2[1])), ctx);
>> +       emit(ARM_LDR_R(tmp[1], tmp2[1], tmp[1]), ctx);
>> +       emit(ARM_LDR_I(tmp2[1], ARM_SP, STACK_VAR(r3[1])), ctx);
>> +       emit(ARM_MOV_SI(tmp[0], tmp2[1], SRTYPE_ASL, 2), ctx);
>> +       emit(ARM_LDR_R(tmp[1], tmp[1], tmp[0]), ctx);
>> +       emit(ARM_CMP_I(tmp[1], 0), ctx);
>> +       _emit(ARM_COND_EQ, ARM_B(jmp_offset), ctx);
>> +
>> +       /* goto *(prog->bpf_func + prologue_size); */
>> +       off = offsetof(struct bpf_prog, bpf_func);
>> +       emit_a32_mov_i(tmp2[1], off, false, ctx);
>> +       emit(ARM_LDR_R(tmp[1], tmp[1], tmp2[1]), ctx);
>> +       emit(ARM_ADD_I(tmp[1], tmp[1], ctx->prologue_bytes), ctx);
>> +       emit(ARM_BX(tmp[1]), ctx);
>> +
>> +       /* out: */
>> +       if (out_offset == -1)
>> +               out_offset = cur_offset;
>> +       if (cur_offset != out_offset) {
>> +               pr_err_once("tail_call out_offset = %d, expected %d!\n",
>> +                           cur_offset, out_offset);
>> +               return -1;
>> +       }
>> +       return 0;
>> +#undef cur_offset
>> +#undef jmp_offset
>> +}
>> +
>> +/* 0xabcd => 0xcdab */
>> +static inline void emit_rev16(const u8 rd, const u8 rn, struct jit_ctx *ctx)
>>  {
>> -       if (!(ctx->seen & SEEN_X))
>> -               ctx->flags |= FLAG_NEED_X_RESET;
>> +#if __LINUX_ARM_ARCH__ < 6
>> +       const u8 *tmp2 = bpf2a32[TMP_REG_2];
>> +
>> +       emit(ARM_AND_I(tmp2[1], rn, 0xff), ctx);
>> +       emit(ARM_MOV_SI(tmp2[0], rn, SRTYPE_LSR, 8), ctx);
>> +       emit(ARM_AND_I(tmp2[0], tmp2[0], 0xff), ctx);
>> +       emit(ARM_ORR_SI(rd, tmp2[0], tmp2[1], SRTYPE_LSL, 8), ctx);
>> +#else /* ARMv6+ */
>> +       emit(ARM_REV16(rd, rn), ctx);
>> +#endif
>> +}
>>
>> -       ctx->seen |= SEEN_X;
>> +/* 0xabcdefgh => 0xghefcdab */
>> +static inline void emit_rev32(const u8 rd, const u8 rn, struct jit_ctx *ctx)
>> +{
>> +#if __LINUX_ARM_ARCH__ < 6
>> +       const u8 *tmp2 = bpf2a32[TMP_REG_2];
>> +
>> +       emit(ARM_AND_I(tmp2[1], rn, 0xff), ctx);
>> +       emit(ARM_MOV_SI(tmp2[0], rn, SRTYPE_LSR, 24), ctx);
>> +       emit(ARM_ORR_SI(ARM_IP, tmp2[0], tmp2[1], SRTYPE_LSL, 24), ctx);
>> +
>> +       emit(ARM_MOV_SI(tmp2[1], rn, SRTYPE_LSR, 8), ctx);
>> +       emit(ARM_AND_I(tmp2[1], tmp2[1], 0xff), ctx);
>> +       emit(ARM_MOV_SI(tmp2[0], rn, SRTYPE_LSR, 16), ctx);
>> +       emit(ARM_AND_I(tmp2[0], tmp2[0], 0xff), ctx);
>> +       emit(ARM_MOV_SI(tmp2[0], tmp2[0], SRTYPE_LSL, 8), ctx);
>> +       emit(ARM_ORR_SI(tmp2[0], tmp2[0], tmp2[1], SRTYPE_LSL, 16), ctx);
>> +       emit(ARM_ORR_R(rd, ARM_IP, tmp2[0]), ctx);
>> +
>> +#else /* ARMv6+ */
>> +       emit(ARM_REV(rd, rn), ctx);
>> +#endif
>>  }
>>
>> -static int build_body(struct jit_ctx *ctx)
>> +static void build_prologue(struct jit_ctx *ctx)
>>  {
>> -       void *load_func[] = {jit_get_skb_b, jit_get_skb_h, jit_get_skb_w};
>> -       const struct bpf_prog *prog = ctx->skf;
>> -       const struct sock_filter *inst;
>> -       unsigned i, load_order, off, condt;
>> -       int imm12;
>> -       u32 k;
>> +       const u8 r0 = bpf2a32[BPF_REG_0][1];
>> +       const u8 r2 = bpf2a32[BPF_REG_1][1];
>> +       const u8 r3 = bpf2a32[BPF_REG_1][0];
>> +       const u8 r4 = bpf2a32[BPF_REG_6][1];
>> +       const u8 r5 = bpf2a32[BPF_REG_6][0];
>> +       const u8 r6 = bpf2a32[TMP_REG_1][1];
>> +       const u8 r7 = bpf2a32[TMP_REG_1][0];
>> +       const u8 r8 = bpf2a32[TMP_REG_2][1];
>> +       const u8 r10 = bpf2a32[TMP_REG_2][0];
>> +       const u8 fplo = bpf2a32[BPF_REG_FP][1];
>> +       const u8 fphi = bpf2a32[BPF_REG_FP][0];
>> +       const u8 sp = ARM_SP;
>> +       const u8 *tcc = bpf2a32[TCALL_CNT];
>> +
>> +       u16 reg_set = 0;
>>
>> -       for (i = 0; i < prog->len; i++) {
>> -               u16 code;
>> +       /*
>> +        * eBPF prog stack layout
>> +        *
>> +        *                         high
>> +        * original ARM_SP =>     +-----+ eBPF prologue
>> +        *                        |FP/LR|
>> +        * current ARM_FP =>      +-----+
>> +        *                        | ... | callee saved registers
>> +        * eBPF fp register =>    +-----+ <= (BPF_FP)
>> +        *                        | ... | eBPF JIT scratch space
>> +        *                        |     | eBPF prog stack
>> +        *                        +-----+
>> +        *                        |RSVD | JIT scratchpad
>> +        * current A64_SP =>      +-----+ <= (BPF_FP - STACK_SIZE)
>> +        *                        |     |
>> +        *                        | ... | Function call stack
>> +        *                        |     |
>> +        *                        +-----+
>> +        *                          low
>> +        */
>>
>> -               inst = &(prog->insns[i]);
>> -               /* K as an immediate value operand */
>> -               k = inst->k;
>> -               code = bpf_anc_helper(inst);
>> +       /* Save callee saved registers. */
>> +       reg_set |= (1<<r4) | (1<<r5) | (1<<r6) | (1<<r7) | (1<<r8) | (1<<r10);
>> +#ifdef CONFIG_FRAME_POINTER
>> +       reg_set |= (1<<ARM_FP) | (1<<ARM_IP) | (1<<ARM_LR) | (1<<ARM_PC);
>> +       emit(ARM_MOV_R(ARM_IP, sp), ctx);
>> +       emit(ARM_PUSH(reg_set), ctx);
>> +       emit(ARM_SUB_I(ARM_FP, ARM_IP, 4), ctx);
>> +#else
>> +       /* Check if call instruction exists in BPF body */
>> +       if (ctx->seen & SEEN_CALL)
>> +               reg_set |= (1<<ARM_LR);
>> +       emit(ARM_PUSH(reg_set), ctx);
>> +#endif
>> +       /* Save frame pointer for later */
>> +       emit(ARM_SUB_I(ARM_IP, sp, SCRATCH_SIZE), ctx);
>> +
>> +       /* Set up function call stack */
>> +       emit(ARM_SUB_I(ARM_SP, ARM_SP, imm8m(STACK_SIZE)), ctx);
>> +
>> +       /* Set up BPF prog stack base register */
>> +       emit_a32_mov_r(fplo, ARM_IP, true, false, ctx);
>> +       emit_a32_mov_i(fphi, 0, true, ctx);
>> +
>> +       /* mov r4, 0 */
>> +       emit(ARM_MOV_I(r4, 0), ctx);
>> +       /* MOV bpf_ctx pointer to BPF_R1 */
>> +       emit(ARM_MOV_R(r3, r4), ctx);
>> +       emit(ARM_MOV_R(r2, r0), ctx);
>> +       /* Initialize Tail Count */
>> +       emit(ARM_STR_I(r4, ARM_SP, STACK_VAR(tcc[0])), ctx);
>> +       emit(ARM_STR_I(r4, ARM_SP, STACK_VAR(tcc[1])), ctx);
>> +       /* end of prologue */
>> +}
>>
>> -               /* compute offsets only in the fake pass */
>> -               if (ctx->target == NULL)
>> -                       ctx->offsets[i] = ctx->idx * 4;
>> +static void build_epilogue(struct jit_ctx *ctx)
>> +{
>> +       const u8 r4 = bpf2a32[BPF_REG_6][1];
>> +       const u8 r5 = bpf2a32[BPF_REG_6][0];
>> +       const u8 r6 = bpf2a32[TMP_REG_1][1];
>> +       const u8 r7 = bpf2a32[TMP_REG_1][0];
>> +       const u8 r8 = bpf2a32[TMP_REG_2][1];
>> +       const u8 r10 = bpf2a32[TMP_REG_2][0];
>> +       u16 reg_set = 0;
>> +
>> +       /* unwind function call stack */
>> +       emit(ARM_ADD_I(ARM_SP, ARM_SP, imm8m(STACK_SIZE)), ctx);
>> +
>> +       /* restore callee saved registers. */
>> +       reg_set |= (1<<r4) | (1<<r5) | (1<<r6) | (1<<r7) | (1<<r8) | (1<<r10);
>> +#ifdef CONFIG_FRAME_POINTER
>> +       /* the first instruction of the prologue was: mov ip, sp */
>> +       reg_set |= (1<<ARM_FP) | (1<<ARM_SP) | (1<<ARM_PC);
>> +       emit(ARM_LDM(ARM_SP, reg_set), ctx);
>> +#else
>> +       if (ctx->seen & SEEN_CALL)
>> +               reg_set |= (1<<ARM_PC);
>> +       /* Restore callee saved registers. */
>> +       emit(ARM_POP(reg_set), ctx);
>> +       /* Return back to the callee function */
>> +       if (!(ctx->seen & SEEN_CALL))
>> +               emit(ARM_BX(ARM_LR), ctx);
>> +#endif
>> +}
>>
>> -               switch (code) {
>> -               case BPF_LD | BPF_IMM:
>> -                       emit_mov_i(r_A, k, ctx);
>> +/*
>> + * Convert an eBPF instruction to native instruction, i.e
>> + * JITs an eBPF instruction.
>> + * Returns :
>> + *     0  - Successfully JITed an 8-byte eBPF instruction
>> + *     >0 - Successfully JITed a 16-byte eBPF instruction
>> + *     <0 - Failed to JIT.
>> + */
>> +static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx)
>> +{
>> +       const u8 code = insn->code;
>> +       const u8 *dst = bpf2a32[insn->dst_reg];
>> +       const u8 *src = bpf2a32[insn->src_reg];
>> +       const u8 *tmp = bpf2a32[TMP_REG_1];
>> +       const u8 *tmp2 = bpf2a32[TMP_REG_2];
>> +       const s16 off = insn->off;
>> +       const s32 imm = insn->imm;
>> +       const int i = insn - ctx->prog->insnsi;
>> +       const bool is64 = BPF_CLASS(code) == BPF_ALU64;
>> +       const bool dstk = is_on_stack(insn->dst_reg);
>> +       const bool sstk = is_on_stack(insn->src_reg);
>> +       u8 rd, rt, rm, rn;
>> +       s32 jmp_offset;
>> +
>> +#define check_imm(bits, imm) do {                              \
>> +       if ((((imm) > 0) && ((imm) >> (bits))) ||               \
>> +           (((imm) < 0) && (~(imm) >> (bits)))) {              \
>> +               pr_info("[%2d] imm=%d(0x%x) out of range\n",    \
>> +                       i, imm, imm);                           \
>> +               return -EINVAL;                                 \
>> +       }                                                       \
>> +} while (0)
>> +#define check_imm24(imm) check_imm(24, imm)
>> +
>> +       switch (code) {
>> +       /* ALU operations */
>> +
>> +       /* dst = src */
>> +       case BPF_ALU | BPF_MOV | BPF_K:
>> +       case BPF_ALU | BPF_MOV | BPF_X:
>> +       case BPF_ALU64 | BPF_MOV | BPF_K:
>> +       case BPF_ALU64 | BPF_MOV | BPF_X:
>> +               switch (BPF_SRC(code)) {
>> +               case BPF_X:
>> +                       emit_a32_mov_r64(is64, dst, src, dstk, sstk, ctx);
>>                         break;
>> -               case BPF_LD | BPF_W | BPF_LEN:
>> -                       ctx->seen |= SEEN_SKB;
>> -                       BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, len) != 4);
>> -                       emit(ARM_LDR_I(r_A, r_skb,
>> -                                      offsetof(struct sk_buff, len)), ctx);
>> +               case BPF_K:
>> +                       /* Sign-extend immediate value to destination reg */
>> +                       emit_a32_mov_i64(is64, dst, imm, dstk, ctx);
>>                         break;
>> -               case BPF_LD | BPF_MEM:
>> -                       /* A = scratch[k] */
>> -                       ctx->seen |= SEEN_MEM_WORD(k);
>> -                       emit(ARM_LDR_I(r_A, ARM_SP, SCRATCH_OFF(k)), ctx);
>> +               }
>> +               break;
>> +       /* dst = dst + src/imm */
>> +       /* dst = dst - src/imm */
>> +       /* dst = dst | src/imm */
>> +       /* dst = dst & src/imm */
>> +       /* dst = dst ^ src/imm */
>> +       /* dst = dst * src/imm */
>> +       /* dst = dst << src */
>> +       /* dst = dst >> src */
>> +       case BPF_ALU | BPF_ADD | BPF_K:
>> +       case BPF_ALU | BPF_ADD | BPF_X:
>> +       case BPF_ALU | BPF_SUB | BPF_K:
>> +       case BPF_ALU | BPF_SUB | BPF_X:
>> +       case BPF_ALU | BPF_OR | BPF_K:
>> +       case BPF_ALU | BPF_OR | BPF_X:
>> +       case BPF_ALU | BPF_AND | BPF_K:
>> +       case BPF_ALU | BPF_AND | BPF_X:
>> +       case BPF_ALU | BPF_XOR | BPF_K:
>> +       case BPF_ALU | BPF_XOR | BPF_X:
>> +       case BPF_ALU | BPF_MUL | BPF_K:
>> +       case BPF_ALU | BPF_MUL | BPF_X:
>> +       case BPF_ALU | BPF_LSH | BPF_X:
>> +       case BPF_ALU | BPF_RSH | BPF_X:
>> +       case BPF_ALU | BPF_ARSH | BPF_K:
>> +       case BPF_ALU | BPF_ARSH | BPF_X:
>> +       case BPF_ALU64 | BPF_ADD | BPF_K:
>> +       case BPF_ALU64 | BPF_ADD | BPF_X:
>> +       case BPF_ALU64 | BPF_SUB | BPF_K:
>> +       case BPF_ALU64 | BPF_SUB | BPF_X:
>> +       case BPF_ALU64 | BPF_OR | BPF_K:
>> +       case BPF_ALU64 | BPF_OR | BPF_X:
>> +       case BPF_ALU64 | BPF_AND | BPF_K:
>> +       case BPF_ALU64 | BPF_AND | BPF_X:
>> +       case BPF_ALU64 | BPF_XOR | BPF_K:
>> +       case BPF_ALU64 | BPF_XOR | BPF_X:
>> +               switch (BPF_SRC(code)) {
>> +               case BPF_X:
>> +                       emit_a32_alu_r64(is64, dst, src, dstk, sstk,
>> +                                        ctx, BPF_OP(code));
>>                         break;
>> -               case BPF_LD | BPF_W | BPF_ABS:
>> -                       load_order = 2;
>> -                       goto load;
>> -               case BPF_LD | BPF_H | BPF_ABS:
>> -                       load_order = 1;
>> -                       goto load;
>> -               case BPF_LD | BPF_B | BPF_ABS:
>> -                       load_order = 0;
>> -load:
>> -                       emit_mov_i(r_off, k, ctx);
>> -load_common:
>> -                       ctx->seen |= SEEN_DATA | SEEN_CALL;
>> -
>> -                       if (load_order > 0) {
>> -                               emit(ARM_SUB_I(r_scratch, r_skb_hl,
>> -                                              1 << load_order), ctx);
>> -                               emit(ARM_CMP_R(r_scratch, r_off), ctx);
>> -                               condt = ARM_COND_GE;
>> -                       } else {
>> -                               emit(ARM_CMP_R(r_skb_hl, r_off), ctx);
>> -                               condt = ARM_COND_HI;
>> -                       }
>> -
>> -                       /*
>> -                        * test for negative offset, only if we are
>> -                        * currently scheduled to take the fast
>> -                        * path. this will update the flags so that
>> -                        * the slowpath instruction are ignored if the
>> -                        * offset is negative.
>> -                        *
>> -                        * for loard_order == 0 the HI condition will
>> -                        * make loads at offset 0 take the slow path too.
>> +               case BPF_K:
>> +                       /* Move immediate value to the temporary register
>> +                        * and then do the ALU operation on the temporary
>> +                        * register as this will sign-extend the immediate
>> +                        * value into temporary reg and then it would be
>> +                        * safe to do the operation on it.
>>                          */
>> -                       _emit(condt, ARM_CMP_I(r_off, 0), ctx);
>> -
>> -                       _emit(condt, ARM_ADD_R(r_scratch, r_off, r_skb_data),
>> -                             ctx);
>> -
>> -                       if (load_order == 0)
>> -                               _emit(condt, ARM_LDRB_I(r_A, r_scratch, 0),
>> -                                     ctx);
>> -                       else if (load_order == 1)
>> -                               emit_load_be16(condt, r_A, r_scratch, ctx);
>> -                       else if (load_order == 2)
>> -                               emit_load_be32(condt, r_A, r_scratch, ctx);
>> -
>> -                       _emit(condt, ARM_B(b_imm(i + 1, ctx)), ctx);
>> -
>> -                       /* the slowpath */
>> -                       emit_mov_i(ARM_R3, (u32)load_func[load_order], ctx);
>> -                       emit(ARM_MOV_R(ARM_R0, r_skb), ctx);
>> -                       /* the offset is already in R1 */
>> -                       emit_blx_r(ARM_R3, ctx);
>> -                       /* check the result of skb_copy_bits */
>> -                       emit(ARM_CMP_I(ARM_R1, 0), ctx);
>> -                       emit_err_ret(ARM_COND_NE, ctx);
>> -                       emit(ARM_MOV_R(r_A, ARM_R0), ctx);
>> +                       emit_a32_mov_i64(is64, tmp2, imm, false, ctx);
>> +                       emit_a32_alu_r64(is64, dst, tmp2, dstk, false,
>> +                                        ctx, BPF_OP(code));
>>                         break;
>> -               case BPF_LD | BPF_W | BPF_IND:
>> -                       load_order = 2;
>> -                       goto load_ind;
>> -               case BPF_LD | BPF_H | BPF_IND:
>> -                       load_order = 1;
>> -                       goto load_ind;
>> -               case BPF_LD | BPF_B | BPF_IND:
>> -                       load_order = 0;
>> -load_ind:
>> -                       update_on_xread(ctx);
>> -                       OP_IMM3(ARM_ADD, r_off, r_X, k, ctx);
>> -                       goto load_common;
>> -               case BPF_LDX | BPF_IMM:
>> -                       ctx->seen |= SEEN_X;
>> -                       emit_mov_i(r_X, k, ctx);
>> +               }
>> +               break;
>> +       /* dst = dst / src(imm) */
>> +       /* dst = dst % src(imm) */
>> +       case BPF_ALU | BPF_DIV | BPF_K:
>> +       case BPF_ALU | BPF_DIV | BPF_X:
>> +       case BPF_ALU | BPF_MOD | BPF_K:
>> +       case BPF_ALU | BPF_MOD | BPF_X:
>> +               rt = src_lo;
>> +               rd = dstk ? tmp2[1] : dst_lo;
>> +               if (dstk)
>> +                       emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
>> +               switch (BPF_SRC(code)) {
>> +               case BPF_X:
>> +                       rt = sstk ? tmp2[0] : rt;
>> +                       if (sstk)
>> +                               emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(src_lo)),
>> +                                    ctx);
>>                         break;
>> -               case BPF_LDX | BPF_W | BPF_LEN:
>> -                       ctx->seen |= SEEN_X | SEEN_SKB;
>> -                       emit(ARM_LDR_I(r_X, r_skb,
>> -                                      offsetof(struct sk_buff, len)), ctx);
>> +               case BPF_K:
>> +                       rt = tmp2[0];
>> +                       emit_a32_mov_i(rt, imm, false, ctx);
>>                         break;
>> -               case BPF_LDX | BPF_MEM:
>> -                       ctx->seen |= SEEN_X | SEEN_MEM_WORD(k);
>> -                       emit(ARM_LDR_I(r_X, ARM_SP, SCRATCH_OFF(k)), ctx);
>> +               }
>> +               emit_udivmod(rd, rd, rt, ctx, BPF_OP(code));
>> +               if (dstk)
>> +                       emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
>> +               emit_a32_mov_i(dst_hi, 0, dstk, ctx);
>> +               break;
>> +       case BPF_ALU64 | BPF_DIV | BPF_K:
>> +       case BPF_ALU64 | BPF_DIV | BPF_X:
>> +       case BPF_ALU64 | BPF_MOD | BPF_K:
>> +       case BPF_ALU64 | BPF_MOD | BPF_X:
>> +               goto notyet;
>> +       /* dst = dst >> imm */
>> +       /* dst = dst << imm */
>> +       case BPF_ALU | BPF_RSH | BPF_K:
>> +       case BPF_ALU | BPF_LSH | BPF_K:
>> +               if (unlikely(imm > 31))
>> +                       return -EINVAL;
>> +               if (imm)
>> +                       emit_a32_alu_i(dst_lo, imm, dstk, ctx, BPF_OP(code));
>> +               emit_a32_mov_i(dst_hi, 0, dstk, ctx);
>> +               break;
>> +       /* dst = dst << imm */
>> +       case BPF_ALU64 | BPF_LSH | BPF_K:
>> +               if (unlikely(imm > 63))
>> +                       return -EINVAL;
>> +               emit_a32_lsh_i64(dst, dstk, imm, ctx);
>> +               break;
>> +       /* dst = dst >> imm */
>> +       case BPF_ALU64 | BPF_RSH | BPF_K:
>> +               if (unlikely(imm > 63))
>> +                       return -EINVAL;
>> +               emit_a32_lsr_i64(dst, dstk, imm, ctx);
>> +               break;
>> +       /* dst = dst << src */
>> +       case BPF_ALU64 | BPF_LSH | BPF_X:
>> +               emit_a32_lsh_r64(dst, src, dstk, sstk, ctx);
>> +               break;
>> +       /* dst = dst >> src */
>> +       case BPF_ALU64 | BPF_RSH | BPF_X:
>> +               emit_a32_lsr_r64(dst, src, dstk, sstk, ctx);
>> +               break;
>> +       /* dst = dst >> src (signed) */
>> +       case BPF_ALU64 | BPF_ARSH | BPF_X:
>> +               emit_a32_arsh_r64(dst, src, dstk, sstk, ctx);
>> +               break;
>> +       /* dst = dst >> imm (signed) */
>> +       case BPF_ALU64 | BPF_ARSH | BPF_K:
>> +               if (unlikely(imm > 63))
>> +                       return -EINVAL;
>> +               emit_a32_arsh_i64(dst, dstk, imm, ctx);
>> +               break;
>> +       /* dst = ~dst */
>> +       case BPF_ALU | BPF_NEG:
>> +               emit_a32_alu_i(dst_lo, 0, dstk, ctx, BPF_OP(code));
>> +               emit_a32_mov_i(dst_hi, 0, dstk, ctx);
>> +               break;
>> +       /* dst = ~dst (64 bit) */
>> +       case BPF_ALU64 | BPF_NEG:
>> +               emit_a32_neg64(dst, dstk, ctx);
>> +               break;
>> +       /* dst = dst * src/imm */
>> +       case BPF_ALU64 | BPF_MUL | BPF_X:
>> +       case BPF_ALU64 | BPF_MUL | BPF_K:
>> +               switch (BPF_SRC(code)) {
>> +               case BPF_X:
>> +                       emit_a32_mul_r64(dst, src, dstk, sstk, ctx);
>>                         break;
>> -               case BPF_LDX | BPF_B | BPF_MSH:
>> -                       /* x = ((*(frame + k)) & 0xf) << 2; */
>> -                       ctx->seen |= SEEN_X | SEEN_DATA | SEEN_CALL;
>> -                       /* the interpreter should deal with the negative K */
>> -                       if ((int)k < 0)
>> -                               return -1;
>> -                       /* offset in r1: we might have to take the slow path */
>> -                       emit_mov_i(r_off, k, ctx);
>> -                       emit(ARM_CMP_R(r_skb_hl, r_off), ctx);
>> -
>> -                       /* load in r0: common with the slowpath */
>> -                       _emit(ARM_COND_HI, ARM_LDRB_R(ARM_R0, r_skb_data,
>> -                                                     ARM_R1), ctx);
>> -                       /*
>> -                        * emit_mov_i() might generate one or two instructions,
>> -                        * the same holds for emit_blx_r()
>> +               case BPF_K:
>> +                       /* Move immediate value to the temporary register
>> +                        * and then do the multiplication on it as this
>> +                        * will sign-extend the immediate value into temp
>> +                        * reg then it would be safe to do the operation
>> +                        * on it.
>>                          */
>> -                       _emit(ARM_COND_HI, ARM_B(b_imm(i + 1, ctx) - 2), ctx);
>> -
>> -                       emit(ARM_MOV_R(ARM_R0, r_skb), ctx);
>> -                       /* r_off is r1 */
>> -                       emit_mov_i(ARM_R3, (u32)jit_get_skb_b, ctx);
>> -                       emit_blx_r(ARM_R3, ctx);
>> -                       /* check the return value of skb_copy_bits */
>> -                       emit(ARM_CMP_I(ARM_R1, 0), ctx);
>> -                       emit_err_ret(ARM_COND_NE, ctx);
>> -
>> -                       emit(ARM_AND_I(r_X, ARM_R0, 0x00f), ctx);
>> -                       emit(ARM_LSL_I(r_X, r_X, 2), ctx);
>> -                       break;
>> -               case BPF_ST:
>> -                       ctx->seen |= SEEN_MEM_WORD(k);
>> -                       emit(ARM_STR_I(r_A, ARM_SP, SCRATCH_OFF(k)), ctx);
>> -                       break;
>> -               case BPF_STX:
>> -                       update_on_xread(ctx);
>> -                       ctx->seen |= SEEN_MEM_WORD(k);
>> -                       emit(ARM_STR_I(r_X, ARM_SP, SCRATCH_OFF(k)), ctx);
>> -                       break;
>> -               case BPF_ALU | BPF_ADD | BPF_K:
>> -                       /* A += K */
>> -                       OP_IMM3(ARM_ADD, r_A, r_A, k, ctx);
>> -                       break;
>> -               case BPF_ALU | BPF_ADD | BPF_X:
>> -                       update_on_xread(ctx);
>> -                       emit(ARM_ADD_R(r_A, r_A, r_X), ctx);
>> -                       break;
>> -               case BPF_ALU | BPF_SUB | BPF_K:
>> -                       /* A -= K */
>> -                       OP_IMM3(ARM_SUB, r_A, r_A, k, ctx);
>> -                       break;
>> -               case BPF_ALU | BPF_SUB | BPF_X:
>> -                       update_on_xread(ctx);
>> -                       emit(ARM_SUB_R(r_A, r_A, r_X), ctx);
>> -                       break;
>> -               case BPF_ALU | BPF_MUL | BPF_K:
>> -                       /* A *= K */
>> -                       emit_mov_i(r_scratch, k, ctx);
>> -                       emit(ARM_MUL(r_A, r_A, r_scratch), ctx);
>> -                       break;
>> -               case BPF_ALU | BPF_MUL | BPF_X:
>> -                       update_on_xread(ctx);
>> -                       emit(ARM_MUL(r_A, r_A, r_X), ctx);
>> -                       break;
>> -               case BPF_ALU | BPF_DIV | BPF_K:
>> -                       if (k == 1)
>> -                               break;
>> -                       emit_mov_i(r_scratch, k, ctx);
>> -                       emit_udivmod(r_A, r_A, r_scratch, ctx, BPF_DIV);
>> -                       break;
>> -               case BPF_ALU | BPF_DIV | BPF_X:
>> -                       update_on_xread(ctx);
>> -                       emit(ARM_CMP_I(r_X, 0), ctx);
>> -                       emit_err_ret(ARM_COND_EQ, ctx);
>> -                       emit_udivmod(r_A, r_A, r_X, ctx, BPF_DIV);
>> -                       break;
>> -               case BPF_ALU | BPF_MOD | BPF_K:
>> -                       if (k == 1) {
>> -                               emit_mov_i(r_A, 0, ctx);
>> -                               break;
>> -                       }
>> -                       emit_mov_i(r_scratch, k, ctx);
>> -                       emit_udivmod(r_A, r_A, r_scratch, ctx, BPF_MOD);
>> +                       emit_a32_mov_i64(is64, tmp2, imm, false, ctx);
>> +                       emit_a32_mul_r64(dst, tmp2, dstk, false, ctx);
>>                         break;
>> -               case BPF_ALU | BPF_MOD | BPF_X:
>> -                       update_on_xread(ctx);
>> -                       emit(ARM_CMP_I(r_X, 0), ctx);
>> -                       emit_err_ret(ARM_COND_EQ, ctx);
>> -                       emit_udivmod(r_A, r_A, r_X, ctx, BPF_MOD);
>> -                       break;
>> -               case BPF_ALU | BPF_OR | BPF_K:
>> -                       /* A |= K */
>> -                       OP_IMM3(ARM_ORR, r_A, r_A, k, ctx);
>> +               }
>> +               break;
>> +       /* dst = htole(dst) */
>> +       /* dst = htobe(dst) */
>> +       case BPF_ALU | BPF_END | BPF_FROM_LE:
>> +       case BPF_ALU | BPF_END | BPF_FROM_BE:
>> +               rd = dstk ? tmp[0] : dst_hi;
>> +               rt = dstk ? tmp[1] : dst_lo;
>> +               if (dstk) {
>> +                       emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(dst_lo)), ctx);
>> +                       emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_hi)), ctx);
>> +               }
>> +               if (BPF_SRC(code) == BPF_FROM_LE)
>> +                       goto emit_bswap_uxt;
>> +               switch (imm) {
>> +               case 16:
>> +                       emit_rev16(rt, rt, ctx);
>> +                       goto emit_bswap_uxt;
>> +               case 32:
>> +                       emit_rev32(rt, rt, ctx);
>> +                       goto emit_bswap_uxt;
>> +               case 64:
>> +                       /* Because of the usage of ARM_LR */
>> +                       ctx->seen |= SEEN_CALL;
>> +                       emit_rev32(ARM_LR, rt, ctx);
>> +                       emit_rev32(rt, rd, ctx);
>> +                       emit(ARM_MOV_R(rd, ARM_LR), ctx);
>>                         break;
>> -               case BPF_ALU | BPF_OR | BPF_X:
>> -                       update_on_xread(ctx);
>> -                       emit(ARM_ORR_R(r_A, r_A, r_X), ctx);
>> +               }
>> +               goto exit;
>> +emit_bswap_uxt:
>> +               switch (imm) {
>> +               case 16:
>> +                       /* zero-extend 16 bits into 64 bits */
>> +#if __LINUX_ARM_ARCH__ < 6
>> +                       emit_a32_mov_i(tmp2[1], 0xffff, false, ctx);
>> +                       emit(ARM_AND_R(rt, rt, tmp2[1]), ctx);
>> +#else /* ARMv6+ */
>> +                       emit(ARM_UXTH(rt, rt), ctx);
>> +#endif
>> +                       emit(ARM_EOR_R(rd, rd, rd), ctx);
>>                         break;
>> -               case BPF_ALU | BPF_XOR | BPF_K:
>> -                       /* A ^= K; */
>> -                       OP_IMM3(ARM_EOR, r_A, r_A, k, ctx);
>> +               case 32:
>> +                       /* zero-extend 32 bits into 64 bits */
>> +                       emit(ARM_EOR_R(rd, rd, rd), ctx);
>>                         break;
>> -               case BPF_ANC | SKF_AD_ALU_XOR_X:
>> -               case BPF_ALU | BPF_XOR | BPF_X:
>> -                       /* A ^= X */
>> -                       update_on_xread(ctx);
>> -                       emit(ARM_EOR_R(r_A, r_A, r_X), ctx);
>> +               case 64:
>> +                       /* nop */
>>                         break;
>> -               case BPF_ALU | BPF_AND | BPF_K:
>> -                       /* A &= K */
>> -                       OP_IMM3(ARM_AND, r_A, r_A, k, ctx);
>> +               }
>> +exit:
>> +               if (dstk) {
>> +                       emit(ARM_STR_I(rt, ARM_SP, STACK_VAR(dst_lo)), ctx);
>> +                       emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_hi)), ctx);
>> +               }
>> +               break;
>> +       /* dst = imm64 */
>> +       case BPF_LD | BPF_IMM | BPF_DW:
>> +       {
>> +               const struct bpf_insn insn1 = insn[1];
>> +               u32 hi, lo = imm;
>> +
>> +               if (insn1.code != 0 || insn1.src_reg != 0 ||
>> +                   insn1.dst_reg != 0 || insn1.off != 0) {
>> +                       /* Note: verifier in BPF core must catch invalid
>> +                        * instruction.
>> +                        */
>> +                       pr_err_once("Invalid BPF_LD_IMM64 instruction\n");
>> +                       return -EINVAL;
>> +               }
>> +               hi = insn1.imm;
>> +               emit_a32_mov_i(dst_lo, lo, dstk, ctx);
>> +               emit_a32_mov_i(dst_hi, hi, dstk, ctx);
>> +
>> +               return 1;
>> +       }
>> +       /* LDX: dst = *(size *)(src + off) */
>> +       case BPF_LDX | BPF_MEM | BPF_W:
>> +       case BPF_LDX | BPF_MEM | BPF_H:
>> +       case BPF_LDX | BPF_MEM | BPF_B:
>> +       case BPF_LDX | BPF_MEM | BPF_DW:
>> +               rn = sstk ? tmp2[1] : src_lo;
>> +               if (sstk)
>> +                       emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
>> +               switch (BPF_SIZE(code)) {
>> +               case BPF_W:
>> +                       /* Load a Word */
>> +               case BPF_H:
>> +                       /* Load a Half-Word */
>> +               case BPF_B:
>> +                       /* Load a Byte */
>> +                       emit_ldx_r(dst_lo, rn, dstk, off, ctx, BPF_SIZE(code));
>> +                       emit_a32_mov_i(dst_hi, 0, dstk, ctx);
>>                         break;
>> -               case BPF_ALU | BPF_AND | BPF_X:
>> -                       update_on_xread(ctx);
>> -                       emit(ARM_AND_R(r_A, r_A, r_X), ctx);
>> +               case BPF_DW:
>> +                       /* Load a double word */
>> +                       emit_ldx_r(dst_lo, rn, dstk, off, ctx, BPF_W);
>> +                       emit_ldx_r(dst_hi, rn, dstk, off+4, ctx, BPF_W);
>>                         break;
>> -               case BPF_ALU | BPF_LSH | BPF_K:
>> -                       if (unlikely(k > 31))
>> -                               return -1;
>> -                       emit(ARM_LSL_I(r_A, r_A, k), ctx);
>> +               }
>> +               break;
>> +       /* R0 = ntohx(*(size *)(((struct sk_buff *)R6)->data + imm)) */
>> +       case BPF_LD | BPF_ABS | BPF_W:
>> +       case BPF_LD | BPF_ABS | BPF_H:
>> +       case BPF_LD | BPF_ABS | BPF_B:
>> +       /* R0 = ntohx(*(size *)(((struct sk_buff *)R6)->data + src + imm)) */
>> +       case BPF_LD | BPF_IND | BPF_W:
>> +       case BPF_LD | BPF_IND | BPF_H:
>> +       case BPF_LD | BPF_IND | BPF_B:
>> +       {
>> +               const u8 r4 = bpf2a32[BPF_REG_6][1]; /* r4 = ptr to sk_buff */
>> +               const u8 r0 = bpf2a32[BPF_REG_0][1]; /*r0: struct sk_buff *skb*/
>> +                                                    /* rtn value */
>> +               const u8 r1 = bpf2a32[BPF_REG_0][0]; /* r1: int k */
>> +               const u8 r2 = bpf2a32[BPF_REG_1][1]; /* r2: unsigned int size */
>> +               const u8 r3 = bpf2a32[BPF_REG_1][0]; /* r3: void *buffer */
>> +               const u8 r6 = bpf2a32[TMP_REG_1][1]; /* r6: void *(*func)(..) */
>> +               int size;
>> +
>> +               /* Setting up first argument */
>> +               emit(ARM_MOV_R(r0, r4), ctx);
>> +
>> +               /* Setting up second argument */
>> +               emit_a32_mov_i(r1, imm, false, ctx);
>> +               if (BPF_MODE(code) == BPF_IND)
>> +                       emit_a32_alu_r(r1, src_lo, false, sstk, ctx,
>> +                                      false, false, BPF_ADD);
>> +
>> +               /* Setting up third argument */
>> +               switch (BPF_SIZE(code)) {
>> +               case BPF_W:
>> +                       size = 4;
>>                         break;
>> -               case BPF_ALU | BPF_LSH | BPF_X:
>> -                       update_on_xread(ctx);
>> -                       emit(ARM_LSL_R(r_A, r_A, r_X), ctx);
>> +               case BPF_H:
>> +                       size = 2;
>>                         break;
>> -               case BPF_ALU | BPF_RSH | BPF_K:
>> -                       if (unlikely(k > 31))
>> -                               return -1;
>> -                       if (k)
>> -                               emit(ARM_LSR_I(r_A, r_A, k), ctx);
>> +               case BPF_B:
>> +                       size = 1;
>>                         break;
>> -               case BPF_ALU | BPF_RSH | BPF_X:
>> -                       update_on_xread(ctx);
>> -                       emit(ARM_LSR_R(r_A, r_A, r_X), ctx);
>> +               default:
>> +                       return -EINVAL;
>> +               }
>> +               emit_a32_mov_i(r2, size, false, ctx);
>> +
>> +               /* Setting up fourth argument */
>> +               emit(ARM_ADD_I(r3, ARM_SP, imm8m(SKB_BUFFER)), ctx);
>> +
>> +               /* Setting up function pointer to call */
>> +               emit_a32_mov_i(r6, (unsigned int)bpf_load_pointer, false, ctx);
>> +               emit_blx_r(r6, ctx);
>> +
>> +               emit(ARM_EOR_R(r1, r1, r1), ctx);
>> +               /* Check if return address is NULL or not.
>> +                * if NULL then jump to epilogue
>> +                * else continue to load the value from retn address
>> +                */
>> +               emit(ARM_CMP_I(r0, 0), ctx);
>> +               jmp_offset = epilogue_offset(ctx);
>> +               check_imm24(jmp_offset);
>> +               _emit(ARM_COND_EQ, ARM_B(jmp_offset), ctx);
>> +
>> +               /* Load value from the address */
>> +               switch (BPF_SIZE(code)) {
>> +               case BPF_W:
>> +                       emit(ARM_LDR_I(r0, r0, 0), ctx);
>> +                       emit_rev32(r0, r0, ctx);
>>                         break;
>> -               case BPF_ALU | BPF_NEG:
>> -                       /* A = -A */
>> -                       emit(ARM_RSB_I(r_A, r_A, 0), ctx);
>> +               case BPF_H:
>> +                       emit(ARM_LDRH_I(r0, r0, 0), ctx);
>> +                       emit_rev16(r0, r0, ctx);
>>                         break;
>> -               case BPF_JMP | BPF_JA:
>> -                       /* pc += K */
>> -                       emit(ARM_B(b_imm(i + k + 1, ctx)), ctx);
>> +               case BPF_B:
>> +                       emit(ARM_LDRB_I(r0, r0, 0), ctx);
>> +                       /* No need to reverse */
>>                         break;
>> -               case BPF_JMP | BPF_JEQ | BPF_K:
>> -                       /* pc += (A == K) ? pc->jt : pc->jf */
>> -                       condt  = ARM_COND_EQ;
>> -                       goto cmp_imm;
>> -               case BPF_JMP | BPF_JGT | BPF_K:
>> -                       /* pc += (A > K) ? pc->jt : pc->jf */
>> -                       condt  = ARM_COND_HI;
>> -                       goto cmp_imm;
>> -               case BPF_JMP | BPF_JGE | BPF_K:
>> -                       /* pc += (A >= K) ? pc->jt : pc->jf */
>> -                       condt  = ARM_COND_HS;
>> -cmp_imm:
>> -                       imm12 = imm8m(k);
>> -                       if (imm12 < 0) {
>> -                               emit_mov_i_no8m(r_scratch, k, ctx);
>> -                               emit(ARM_CMP_R(r_A, r_scratch), ctx);
>> -                       } else {
>> -                               emit(ARM_CMP_I(r_A, imm12), ctx);
>> -                       }
>> -cond_jump:
>> -                       if (inst->jt)
>> -                               _emit(condt, ARM_B(b_imm(i + inst->jt + 1,
>> -                                                  ctx)), ctx);
>> -                       if (inst->jf)
>> -                               _emit(condt ^ 1, ARM_B(b_imm(i + inst->jf + 1,
>> -                                                            ctx)), ctx);
>> +               }
>> +               break;
>> +       }
>> +       /* ST: *(size *)(dst + off) = imm */
>> +       case BPF_ST | BPF_MEM | BPF_W:
>> +       case BPF_ST | BPF_MEM | BPF_H:
>> +       case BPF_ST | BPF_MEM | BPF_B:
>> +       case BPF_ST | BPF_MEM | BPF_DW:
>> +               switch (BPF_SIZE(code)) {
>> +               case BPF_DW:
>> +                       /* Sign-extend immediate value into temp reg */
>> +                       emit_a32_mov_i64(true, tmp2, imm, false, ctx);
>> +                       emit_str_r(dst_lo, tmp2[1], dstk, off, ctx, BPF_W);
>> +                       emit_str_r(dst_lo, tmp2[0], dstk, off+4, ctx, BPF_W);
>>                         break;
>> -               case BPF_JMP | BPF_JEQ | BPF_X:
>> -                       /* pc += (A == X) ? pc->jt : pc->jf */
>> -                       condt   = ARM_COND_EQ;
>> -                       goto cmp_x;
>> -               case BPF_JMP | BPF_JGT | BPF_X:
>> -                       /* pc += (A > X) ? pc->jt : pc->jf */
>> -                       condt   = ARM_COND_HI;
>> -                       goto cmp_x;
>> -               case BPF_JMP | BPF_JGE | BPF_X:
>> -                       /* pc += (A >= X) ? pc->jt : pc->jf */
>> -                       condt   = ARM_COND_CS;
>> -cmp_x:
>> -                       update_on_xread(ctx);
>> -                       emit(ARM_CMP_R(r_A, r_X), ctx);
>> -                       goto cond_jump;
>> -               case BPF_JMP | BPF_JSET | BPF_K:
>> -                       /* pc += (A & K) ? pc->jt : pc->jf */
>> -                       condt  = ARM_COND_NE;
>> -                       /* not set iff all zeroes iff Z==1 iff EQ */
>> -
>> -                       imm12 = imm8m(k);
>> -                       if (imm12 < 0) {
>> -                               emit_mov_i_no8m(r_scratch, k, ctx);
>> -                               emit(ARM_TST_R(r_A, r_scratch), ctx);
>> -                       } else {
>> -                               emit(ARM_TST_I(r_A, imm12), ctx);
>> -                       }
>> -                       goto cond_jump;
>> -               case BPF_JMP | BPF_JSET | BPF_X:
>> -                       /* pc += (A & X) ? pc->jt : pc->jf */
>> -                       update_on_xread(ctx);
>> -                       condt  = ARM_COND_NE;
>> -                       emit(ARM_TST_R(r_A, r_X), ctx);
>> -                       goto cond_jump;
>> -               case BPF_RET | BPF_A:
>> -                       emit(ARM_MOV_R(ARM_R0, r_A), ctx);
>> -                       goto b_epilogue;
>> -               case BPF_RET | BPF_K:
>> -                       if ((k == 0) && (ctx->ret0_fp_idx < 0))
>> -                               ctx->ret0_fp_idx = i;
>> -                       emit_mov_i(ARM_R0, k, ctx);
>> -b_epilogue:
>> -                       if (i != ctx->skf->len - 1)
>> -                               emit(ARM_B(b_imm(prog->len, ctx)), ctx);
>> +               case BPF_W:
>> +               case BPF_H:
>> +               case BPF_B:
>> +                       emit_a32_mov_i(tmp2[1], imm, false, ctx);
>> +                       emit_str_r(dst_lo, tmp2[1], dstk, off, ctx,
>> +                                  BPF_SIZE(code));
>>                         break;
>> -               case BPF_MISC | BPF_TAX:
>> -                       /* X = A */
>> -                       ctx->seen |= SEEN_X;
>> -                       emit(ARM_MOV_R(r_X, r_A), ctx);
>> +               }
>> +               break;
>> +       /* STX XADD: lock *(u32 *)(dst + off) += src */
>> +       case BPF_STX | BPF_XADD | BPF_W:
>> +       /* STX XADD: lock *(u64 *)(dst + off) += src */
>> +       case BPF_STX | BPF_XADD | BPF_DW:
>> +               goto notyet;
>> +       /* STX: *(size *)(dst + off) = src */
>> +       case BPF_STX | BPF_MEM | BPF_W:
>> +       case BPF_STX | BPF_MEM | BPF_H:
>> +       case BPF_STX | BPF_MEM | BPF_B:
>> +       case BPF_STX | BPF_MEM | BPF_DW:
>> +       {
>> +               u8 sz = BPF_SIZE(code);
>> +
>> +               rn = sstk ? tmp2[1] : src_lo;
>> +               rm = sstk ? tmp2[0] : src_hi;
>> +               if (!sstk)
>> +                       goto do_store;
>> +               switch (BPF_SIZE(code)) {
>> +               case BPF_W:
>> +                       emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
>> +                       goto empty_hi;
>> +               case BPF_H:
>> +                       emit(ARM_LDRH_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
>> +                       goto empty_hi;
>> +               case BPF_B:
>> +                       emit(ARM_LDRB_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
>> +                       goto empty_hi;
>> +empty_hi:
>> +                       emit(ARM_EOR_R(rm, rm, rm), ctx);
>> +               case BPF_DW:
>> +                       emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
>> +                       emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(src_hi)), ctx);
>> +                       sz = BPF_W;
>>                         break;
>> -               case BPF_MISC | BPF_TXA:
>> -                       /* A = X */
>> -                       update_on_xread(ctx);
>> -                       emit(ARM_MOV_R(r_A, r_X), ctx);
>> +               }
>> +
>> +do_store:
>> +               /* Clear higher word except for BPF_DW */
>> +               if (BPF_SIZE(code) != BPF_DW)
>> +                       emit(ARM_EOR_R(rm, rm, rm), ctx);
>> +
>> +               /* Store the value */
>> +               emit_str_r(dst_lo, rn, dstk, off, ctx, sz);
>> +               emit_str_r(dst_lo, rm, dstk, off+4, ctx, BPF_W);
>> +               break;
>> +       }
>> +       /* PC += off if dst == src */
>> +       /* PC += off if dst > src */
>> +       /* PC += off if dst >= src */
>> +       /* PC += off if dst != src */
>> +       /* PC += off if dst > src (signed) */
>> +       /* PC += off if dst >= src (signed) */
>> +       /* PC += off if dst & src */
>> +       case BPF_JMP | BPF_JEQ | BPF_X:
>> +       case BPF_JMP | BPF_JGT | BPF_X:
>> +       case BPF_JMP | BPF_JGE | BPF_X:
>> +       case BPF_JMP | BPF_JNE | BPF_X:
>> +       case BPF_JMP | BPF_JSGT | BPF_X:
>> +       case BPF_JMP | BPF_JSGE | BPF_X:
>> +       case BPF_JMP | BPF_JSET | BPF_X:
>> +               /* Setup source registers */
>> +               rm = sstk ? tmp2[0] : src_hi;
>> +               rn = sstk ? tmp2[1] : src_lo;
>> +               if (sstk) {
>> +                       emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
>> +                       emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(src_hi)), ctx);
>> +               }
>> +               goto go_jmp;
>> +       /* PC += off if dst == imm */
>> +       /* PC += off if dst > imm */
>> +       /* PC += off if dst >= imm */
>> +       /* PC += off if dst != imm */
>> +       /* PC += off if dst > imm (signed) */
>> +       /* PC += off if dst >= imm (signed) */
>> +       /* PC += off if dst & imm */
>> +       case BPF_JMP | BPF_JEQ | BPF_K:
>> +       case BPF_JMP | BPF_JGT | BPF_K:
>> +       case BPF_JMP | BPF_JGE | BPF_K:
>> +       case BPF_JMP | BPF_JNE | BPF_K:
>> +       case BPF_JMP | BPF_JSGT | BPF_K:
>> +       case BPF_JMP | BPF_JSGE | BPF_K:
>> +       case BPF_JMP | BPF_JSET | BPF_K:
>> +               if (off == 0)
>>                         break;
>> -               case BPF_ANC | SKF_AD_PROTOCOL:
>> -                       /* A = ntohs(skb->protocol) */
>> -                       ctx->seen |= SEEN_SKB;
>> -                       BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff,
>> -                                                 protocol) != 2);
>> -                       off = offsetof(struct sk_buff, protocol);
>> -                       emit(ARM_LDRH_I(r_scratch, r_skb, off), ctx);
>> -                       emit_swap16(r_A, r_scratch, ctx);
>> +               rm = tmp2[0];
>> +               rn = tmp2[1];
>> +               /* Sign-extend immediate value */
>> +               emit_a32_mov_i64(true, tmp2, imm, false, ctx);
>> +go_jmp:
>> +               /* Setup destination register */
>> +               rd = dstk ? tmp[0] : dst_hi;
>> +               rt = dstk ? tmp[1] : dst_lo;
>> +               if (dstk) {
>> +                       emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(dst_lo)), ctx);
>> +                       emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_hi)), ctx);
>> +               }
>> +
>> +               /* Check for the condition */
>> +               emit_ar_r(rd, rt, rm, rn, ctx, BPF_OP(code));
>> +
>> +               /* Setup JUMP instruction */
>> +               jmp_offset = bpf2a32_offset(i+off, i, ctx);
>> +               switch (BPF_OP(code)) {
>> +               case BPF_JNE:
>> +               case BPF_JSET:
>> +                       _emit(ARM_COND_NE, ARM_B(jmp_offset), ctx);
>>                         break;
>> -               case BPF_ANC | SKF_AD_CPU:
>> -                       /* r_scratch = current_thread_info() */
>> -                       OP_IMM3(ARM_BIC, r_scratch, ARM_SP, THREAD_SIZE - 1, ctx);
>> -                       /* A = current_thread_info()->cpu */
>> -                       BUILD_BUG_ON(FIELD_SIZEOF(struct thread_info, cpu) != 4);
>> -                       off = offsetof(struct thread_info, cpu);
>> -                       emit(ARM_LDR_I(r_A, r_scratch, off), ctx);
>> +               case BPF_JEQ:
>> +                       _emit(ARM_COND_EQ, ARM_B(jmp_offset), ctx);
>>                         break;
>> -               case BPF_ANC | SKF_AD_IFINDEX:
>> -               case BPF_ANC | SKF_AD_HATYPE:
>> -                       /* A = skb->dev->ifindex */
>> -                       /* A = skb->dev->type */
>> -                       ctx->seen |= SEEN_SKB;
>> -                       off = offsetof(struct sk_buff, dev);
>> -                       emit(ARM_LDR_I(r_scratch, r_skb, off), ctx);
>> -
>> -                       emit(ARM_CMP_I(r_scratch, 0), ctx);
>> -                       emit_err_ret(ARM_COND_EQ, ctx);
>> -
>> -                       BUILD_BUG_ON(FIELD_SIZEOF(struct net_device,
>> -                                                 ifindex) != 4);
>> -                       BUILD_BUG_ON(FIELD_SIZEOF(struct net_device,
>> -                                                 type) != 2);
>> -
>> -                       if (code == (BPF_ANC | SKF_AD_IFINDEX)) {
>> -                               off = offsetof(struct net_device, ifindex);
>> -                               emit(ARM_LDR_I(r_A, r_scratch, off), ctx);
>> -                       } else {
>> -                               /*
>> -                                * offset of field "type" in "struct
>> -                                * net_device" is above what can be
>> -                                * used in the ldrh rd, [rn, #imm]
>> -                                * instruction, so load the offset in
>> -                                * a register and use ldrh rd, [rn, rm]
>> -                                */
>> -                               off = offsetof(struct net_device, type);
>> -                               emit_mov_i(ARM_R3, off, ctx);
>> -                               emit(ARM_LDRH_R(r_A, r_scratch, ARM_R3), ctx);
>> -                       }
>> +               case BPF_JGT:
>> +                       _emit(ARM_COND_HI, ARM_B(jmp_offset), ctx);
>>                         break;
>> -               case BPF_ANC | SKF_AD_MARK:
>> -                       ctx->seen |= SEEN_SKB;
>> -                       BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, mark) != 4);
>> -                       off = offsetof(struct sk_buff, mark);
>> -                       emit(ARM_LDR_I(r_A, r_skb, off), ctx);
>> +               case BPF_JGE:
>> +                       _emit(ARM_COND_CS, ARM_B(jmp_offset), ctx);
>>                         break;
>> -               case BPF_ANC | SKF_AD_RXHASH:
>> -                       ctx->seen |= SEEN_SKB;
>> -                       BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, hash) != 4);
>> -                       off = offsetof(struct sk_buff, hash);
>> -                       emit(ARM_LDR_I(r_A, r_skb, off), ctx);
>> +               case BPF_JSGT:
>> +                       _emit(ARM_COND_LT, ARM_B(jmp_offset), ctx);
>>                         break;
>> -               case BPF_ANC | SKF_AD_VLAN_TAG:
>> -               case BPF_ANC | SKF_AD_VLAN_TAG_PRESENT:
>> -                       ctx->seen |= SEEN_SKB;
>> -                       BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, vlan_tci) != 2);
>> -                       off = offsetof(struct sk_buff, vlan_tci);
>> -                       emit(ARM_LDRH_I(r_A, r_skb, off), ctx);
>> -                       if (code == (BPF_ANC | SKF_AD_VLAN_TAG))
>> -                               OP_IMM3(ARM_AND, r_A, r_A, ~VLAN_TAG_PRESENT, ctx);
>> -                       else {
>> -                               OP_IMM3(ARM_LSR, r_A, r_A, 12, ctx);
>> -                               OP_IMM3(ARM_AND, r_A, r_A, 0x1, ctx);
>> -                       }
>> +               case BPF_JSGE:
>> +                       _emit(ARM_COND_GE, ARM_B(jmp_offset), ctx);
>>                         break;
>> -               case BPF_ANC | SKF_AD_PKTTYPE:
>> -                       ctx->seen |= SEEN_SKB;
>> -                       BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff,
>> -                                                 __pkt_type_offset[0]) != 1);
>> -                       off = PKT_TYPE_OFFSET();
>> -                       emit(ARM_LDRB_I(r_A, r_skb, off), ctx);
>> -                       emit(ARM_AND_I(r_A, r_A, PKT_TYPE_MAX), ctx);
>> -#ifdef __BIG_ENDIAN_BITFIELD
>> -                       emit(ARM_LSR_I(r_A, r_A, 5), ctx);
>> -#endif
>> +               }
>> +               break;
>> +       /* JMP OFF */
>> +       case BPF_JMP | BPF_JA:
>> +       {
>> +               if (off == 0)
>>                         break;
>> -               case BPF_ANC | SKF_AD_QUEUE:
>> -                       ctx->seen |= SEEN_SKB;
>> -                       BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff,
>> -                                                 queue_mapping) != 2);
>> -                       BUILD_BUG_ON(offsetof(struct sk_buff,
>> -                                             queue_mapping) > 0xff);
>> -                       off = offsetof(struct sk_buff, queue_mapping);
>> -                       emit(ARM_LDRH_I(r_A, r_skb, off), ctx);
>> +               jmp_offset = bpf2a32_offset(i+off, i, ctx);
>> +               check_imm24(jmp_offset);
>> +               emit(ARM_B(jmp_offset), ctx);
>> +               break;
>> +       }
>> +       /* tail call */
>> +       case BPF_JMP | BPF_CALL | BPF_X:
>> +               if (emit_bpf_tail_call(ctx))
>> +                       return -EFAULT;
>> +               break;
>> +       /* function call */
>> +       case BPF_JMP | BPF_CALL:
>> +               goto notyet;
>> +       /* function return */
>> +       case BPF_JMP | BPF_EXIT:
>> +               /* Optimization: when last instruction is EXIT
>> +                * simply fallthrough to epilogue.
>> +                */
>> +               if (i == ctx->prog->len - 1)
>>                         break;
>> -               case BPF_ANC | SKF_AD_PAY_OFFSET:
>> -                       ctx->seen |= SEEN_SKB | SEEN_CALL;
>> +               jmp_offset = epilogue_offset(ctx);
>> +               check_imm24(jmp_offset);
>> +               emit(ARM_B(jmp_offset), ctx);
>> +               break;
>> +notyet:
>> +               pr_info_once("*** NOT YET: opcode %02x ***\n", code);
>> +               return -EFAULT;
>> +       default:
>> +               pr_err_once("unknown opcode %02x\n", code);
>> +               return -EINVAL;
>> +       }
>>
>> -                       emit(ARM_MOV_R(ARM_R0, r_skb), ctx);
>> -                       emit_mov_i(ARM_R3, (unsigned int)skb_get_poff, ctx);
>> -                       emit_blx_r(ARM_R3, ctx);
>> -                       emit(ARM_MOV_R(r_A, ARM_R0), ctx);
>> -                       break;
>> -               case BPF_LDX | BPF_W | BPF_ABS:
>> -                       /*
>> -                        * load a 32bit word from struct seccomp_data.
>> -                        * seccomp_check_filter() will already have checked
>> -                        * that k is 32bit aligned and lies within the
>> -                        * struct seccomp_data.
>> -                        */
>> -                       ctx->seen |= SEEN_SKB;
>> -                       emit(ARM_LDR_I(r_A, r_skb, k), ctx);
>> -                       break;
>> -               default:
>> -                       return -1;
>> +       if (ctx->flags & FLAG_IMM_OVERFLOW)
>> +               /*
>> +                * this instruction generated an overflow when
>> +                * trying to access the literal pool, so
>> +                * delegate this filter to the kernel interpreter.
>> +                */
>> +               return -1;
>> +       return 0;
>> +}
>> +
>> +static int build_body(struct jit_ctx *ctx)
>> +{
>> +       const struct bpf_prog *prog = ctx->prog;
>> +       unsigned int i;
>> +
>> +       for (i = 0; i < prog->len; i++) {
>> +               const struct bpf_insn *insn = &(prog->insnsi[i]);
>> +               int ret;
>> +
>> +               ret = build_insn(insn, ctx);
>> +
>> +               /* It's used with loading the 64 bit immediate value. */
>> +               if (ret > 0) {
>> +                       i++;
>> +                       if (ctx->target == NULL)
>> +                               ctx->offsets[i] = ctx->idx;
>> +                       continue;
>>                 }
>>
>> -               if (ctx->flags & FLAG_IMM_OVERFLOW)
>> -                       /*
>> -                        * this instruction generated an overflow when
>> -                        * trying to access the literal pool, so
>> -                        * delegate this filter to the kernel interpreter.
>> -                        */
>> -                       return -1;
>> +               if (ctx->target == NULL)
>> +                       ctx->offsets[i] = ctx->idx;
>> +
>> +               /* If unsuccesfull, return with error code */
>> +               if (ret)
>> +                       return ret;
>>         }
>> +       return 0;
>> +}
>>
>> -       /* compute offsets only during the first pass */
>> -       if (ctx->target == NULL)
>> -               ctx->offsets[i] = ctx->idx * 4;
>> +static int validate_code(struct jit_ctx *ctx)
>> +{
>> +       int i;
>> +
>> +       for (i = 0; i < ctx->idx; i++) {
>> +               u32 a32_insn = le32_to_cpu(ctx->target[i]);
>> +
>> +               if (a32_insn == ARM_INST_UDF)
>> +                       return -1;
>> +       }
>>
>>         return 0;
>>  }
>>
>> +void bpf_jit_compile(struct bpf_prog *prog)
>> +{
>> +       /* Nothing to do here. We support Internal BPF. */
>> +}
>>
>> -void bpf_jit_compile(struct bpf_prog *fp)
>> +struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
>>  {
>> +#ifdef __LITTLE_ENDIAN
>> +       struct bpf_prog *tmp, *orig_prog = prog;
>>         struct bpf_binary_header *header;
>> +       bool tmp_blinded = false;
>>         struct jit_ctx ctx;
>> -       unsigned tmp_idx;
>> -       unsigned alloc_size;
>> -       u8 *target_ptr;
>> +       unsigned int tmp_idx;
>> +       unsigned int image_size;
>> +       u8 *image_ptr;
>>
>> +       /* If BPF JIT was not enabled then we must fall back to
>> +        * the interpreter.
>> +        */
>>         if (!bpf_jit_enable)
>> -               return;
>> +               return orig_prog;
>>
>> -       memset(&ctx, 0, sizeof(ctx));
>> -       ctx.skf         = fp;
>> -       ctx.ret0_fp_idx = -1;
>> +       /* If constant blinding was enabled and we failed during blinding
>> +        * then we must fall back to the interpreter. Otherwise, we save
>> +        * the new JITed code.
>> +        */
>> +       tmp = bpf_jit_blind_constants(prog);
>>
>> -       ctx.offsets = kzalloc(4 * (ctx.skf->len + 1), GFP_KERNEL);
>> -       if (ctx.offsets == NULL)
>> -               return;
>> +       if (IS_ERR(tmp))
>> +               return orig_prog;
>> +       if (tmp != prog) {
>> +               tmp_blinded = true;
>> +               prog = tmp;
>> +       }
>> +
>> +       memset(&ctx, 0, sizeof(ctx));
>> +       ctx.prog = prog;
>>
>> -       /* fake pass to fill in the ctx->seen */
>> -       if (unlikely(build_body(&ctx)))
>> +       /* Not able to allocate memory for offsets[] , then
>> +        * we must fall back to the interpreter
>> +        */
>> +       ctx.offsets = kcalloc(prog->len, sizeof(int), GFP_KERNEL);
>> +       if (ctx.offsets == NULL) {
>> +               prog = orig_prog;
>>                 goto out;
>> +       }
>> +
>> +       /* 1) fake pass to find in the length of the JITed code,
>> +        * to compute ctx->offsets and other context variables
>> +        * needed to compute final JITed code.
>> +        * Also, calculate random starting pointer/start of JITed code
>> +        * which is prefixed by random number of fault instructions.
>> +        *
>> +        * If the first pass fails then there is no chance of it
>> +        * being successful in the second pass, so just fall back
>> +        * to the interpreter.
>> +        */
>> +       if (build_body(&ctx)) {
>> +               prog = orig_prog;
>> +               goto out_off;
>> +       }
>>
>>         tmp_idx = ctx.idx;
>>         build_prologue(&ctx);
>>         ctx.prologue_bytes = (ctx.idx - tmp_idx) * 4;
>>
>> +       ctx.epilogue_offset = ctx.idx;
>> +
>>  #if __LINUX_ARM_ARCH__ < 7
>>         tmp_idx = ctx.idx;
>>         build_epilogue(&ctx);
>> @@ -1020,64 +1838,96 @@ void bpf_jit_compile(struct bpf_prog *fp)
>>
>>         ctx.idx += ctx.imm_count;
>>         if (ctx.imm_count) {
>> -               ctx.imms = kzalloc(4 * ctx.imm_count, GFP_KERNEL);
>> -               if (ctx.imms == NULL)
>> -                       goto out;
>> +               ctx.imms = kcalloc(ctx.imm_count, sizeof(u32), GFP_KERNEL);
>> +               if (ctx.imms == NULL) {
>> +                       prog = orig_prog;
>> +                       goto out_off;
>> +               }
>>         }
>>  #else
>> -       /* there's nothing after the epilogue on ARMv7 */
>> +       /* there's nothing about the epilogue on ARMv7 */
>>         build_epilogue(&ctx);
>>  #endif
>> -       alloc_size = 4 * ctx.idx;
>> -       header = bpf_jit_binary_alloc(alloc_size, &target_ptr,
>> -                                     4, jit_fill_hole);
>> -       if (header == NULL)
>> -               goto out;
>> +       /* Now we can get the actual image size of the JITed arm code.
>> +        * Currently, we are not considering the THUMB-2 instructions
>> +        * for jit, although it can decrease the size of the image.
>> +        *
>> +        * As each arm instruction is of length 32bit, we are translating
>> +        * number of JITed intructions into the size required to store these
>> +        * JITed code.
>> +        */
>> +       image_size = sizeof(u32) * ctx.idx;
>>
>> -       ctx.target = (u32 *) target_ptr;
>> +       /* Now we know the size of the structure to make */
>> +       header = bpf_jit_binary_alloc(image_size, &image_ptr,
>> +                                     sizeof(u32), jit_fill_hole);
>> +       /* Not able to allocate memory for the structure then
>> +        * we must fall back to the interpretation
>> +        */
>> +       if (header == NULL) {
>> +               prog = orig_prog;
>> +               goto out_imms;
>> +       }
>> +
>> +       /* 2.) Actual pass to generate final JIT code */
>> +       ctx.target = (u32 *) image_ptr;
>>         ctx.idx = 0;
>>
>>         build_prologue(&ctx);
>> +
>> +       /* If building the body of the JITed code fails somehow,
>> +        * we fall back to the interpretation.
>> +        */
>>         if (build_body(&ctx) < 0) {
>> -#if __LINUX_ARM_ARCH__ < 7
>> -               if (ctx.imm_count)
>> -                       kfree(ctx.imms);
>> -#endif
>> +               image_ptr = NULL;
>>                 bpf_jit_binary_free(header);
>> -               goto out;
>> +               prog = orig_prog;
>> +               goto out_imms;
>>         }
>>         build_epilogue(&ctx);
>>
>> +       /* 3.) Extra pass to validate JITed Code */
>> +       if (validate_code(&ctx)) {
>> +               image_ptr = NULL;
>> +               bpf_jit_binary_free(header);
>> +               prog = orig_prog;
>> +               goto out_imms;
>> +       }
>>         flush_icache_range((u32)header, (u32)(ctx.target + ctx.idx));
>>
>> -#if __LINUX_ARM_ARCH__ < 7
>> -       if (ctx.imm_count)
>> -               kfree(ctx.imms);
>> -#endif
>> -
>>         if (bpf_jit_enable > 1)
>>                 /* there are 2 passes here */
>> -               bpf_jit_dump(fp->len, alloc_size, 2, ctx.target);
>> +               bpf_jit_dump(prog->len, image_size, 2, ctx.target);
>>
>>         set_memory_ro((unsigned long)header, header->pages);
>> -       fp->bpf_func = (void *)ctx.target;
>> -       fp->jited = 1;
>> -out:
>> +       prog->bpf_func = (void *)ctx.target;
>> +       prog->jited = 1;
>> +out_imms:
>> +#if __LINUX_ARM_ARCH__ < 7
>> +       if (ctx.imm_count)
>> +               kfree(ctx.imms);
>> +#endif
>> +out_off:
>>         kfree(ctx.offsets);
>> -       return;
>> +out:
>> +       if (tmp_blinded)
>> +               bpf_jit_prog_release_other(prog, prog == orig_prog ?
>> +                                          tmp : orig_prog);
>> +#endif /* __LITTLE_ENDIAN */
>> +       return prog;
>>  }
>>
>> -void bpf_jit_free(struct bpf_prog *fp)
>> +void bpf_jit_free(struct bpf_prog *prog)
>>  {
>> -       unsigned long addr = (unsigned long)fp->bpf_func & PAGE_MASK;
>> +       unsigned long addr = (unsigned long)prog->bpf_func & PAGE_MASK;
>>         struct bpf_binary_header *header = (void *)addr;
>>
>> -       if (!fp->jited)
>> +       if (!prog->jited)
>>                 goto free_filter;
>>
>>         set_memory_rw(addr, header->pages);
>>         bpf_jit_binary_free(header);
>>
>>  free_filter:
>> -       bpf_prog_unlock_free(fp);
>> +       bpf_prog_unlock_free(prog);
>>  }
>> diff --git a/arch/arm/net/bpf_jit_32.h b/arch/arm/net/bpf_jit_32.h
>> index c46fca2..d5cf5f6 100644
>> --- a/arch/arm/net/bpf_jit_32.h
>> +++ b/arch/arm/net/bpf_jit_32.h
>> @@ -11,6 +11,7 @@
>>  #ifndef PFILTER_OPCODES_ARM_H
>>  #define PFILTER_OPCODES_ARM_H
>>
>> +/* ARM 32bit Registers */
>>  #define ARM_R0 0
>>  #define ARM_R1 1
>>  #define ARM_R2 2
>> @@ -22,38 +23,43 @@
>>  #define ARM_R8 8
>>  #define ARM_R9 9
>>  #define ARM_R10        10
>> -#define ARM_FP 11
>> -#define ARM_IP 12
>> -#define ARM_SP 13
>> -#define ARM_LR 14
>> -#define ARM_PC 15
>> -
>> -#define ARM_COND_EQ            0x0
>> -#define ARM_COND_NE            0x1
>> -#define ARM_COND_CS            0x2
>> +#define ARM_FP 11      /* Frame Pointer */
>> +#define ARM_IP 12      /* Intra-procedure scratch register */
>> +#define ARM_SP 13      /* Stack pointer: as load/store base reg */
>> +#define ARM_LR 14      /* Link Register */
>> +#define ARM_PC 15      /* Program counter */
>> +
>> +#define ARM_COND_EQ            0x0     /* == */
>> +#define ARM_COND_NE            0x1     /* != */
>> +#define ARM_COND_CS            0x2     /* unsigned >= */
>>  #define ARM_COND_HS            ARM_COND_CS
>> -#define ARM_COND_CC            0x3
>> +#define ARM_COND_CC            0x3     /* unsigned < */
>>  #define ARM_COND_LO            ARM_COND_CC
>> -#define ARM_COND_MI            0x4
>> -#define ARM_COND_PL            0x5
>> -#define ARM_COND_VS            0x6
>> -#define ARM_COND_VC            0x7
>> -#define ARM_COND_HI            0x8
>> -#define ARM_COND_LS            0x9
>> -#define ARM_COND_GE            0xa
>> -#define ARM_COND_LT            0xb
>> -#define ARM_COND_GT            0xc
>> -#define ARM_COND_LE            0xd
>> -#define ARM_COND_AL            0xe
>> +#define ARM_COND_MI            0x4     /* < 0 */
>> +#define ARM_COND_PL            0x5     /* >= 0 */
>> +#define ARM_COND_VS            0x6     /* Signed Overflow */
>> +#define ARM_COND_VC            0x7     /* No Signed Overflow */
>> +#define ARM_COND_HI            0x8     /* unsigned > */
>> +#define ARM_COND_LS            0x9     /* unsigned <= */
>> +#define ARM_COND_GE            0xa     /* Signed >= */
>> +#define ARM_COND_LT            0xb     /* Signed < */
>> +#define ARM_COND_GT            0xc     /* Signed > */
>> +#define ARM_COND_LE            0xd     /* Signed <= */
>> +#define ARM_COND_AL            0xe     /* None */
>>
>>  /* register shift types */
>>  #define SRTYPE_LSL             0
>>  #define SRTYPE_LSR             1
>>  #define SRTYPE_ASR             2
>>  #define SRTYPE_ROR             3
>> +#define SRTYPE_ASL             (SRTYPE_LSL)
>>
>>  #define ARM_INST_ADD_R         0x00800000
>> +#define ARM_INST_ADDS_R                0x00900000
>> +#define ARM_INST_ADC_R         0x00a00000
>> +#define ARM_INST_ADC_I         0x02a00000
>>  #define ARM_INST_ADD_I         0x02800000
>> +#define ARM_INST_ADDS_I                0x02900000
>>
>>  #define ARM_INST_AND_R         0x00000000
>>  #define ARM_INST_AND_I         0x02000000
>> @@ -76,8 +82,10 @@
>>  #define ARM_INST_LDRH_I                0x01d000b0
>>  #define ARM_INST_LDRH_R                0x019000b0
>>  #define ARM_INST_LDR_I         0x05900000
>> +#define ARM_INST_LDR_R         0x07900000
>>
>>  #define ARM_INST_LDM           0x08900000
>> +#define ARM_INST_LDM_IA                0x08b00000
>>
>>  #define ARM_INST_LSL_I         0x01a00000
>>  #define ARM_INST_LSL_R         0x01a00010
>> @@ -86,6 +94,7 @@
>>  #define ARM_INST_LSR_R         0x01a00030
>>
>>  #define ARM_INST_MOV_R         0x01a00000
>> +#define ARM_INST_MOVS_R                0x01b00000
>>  #define ARM_INST_MOV_I         0x03a00000
>>  #define ARM_INST_MOVW          0x03000000
>>  #define ARM_INST_MOVT          0x03400000
>> @@ -96,17 +105,28 @@
>>  #define ARM_INST_PUSH          0x092d0000
>>
>>  #define ARM_INST_ORR_R         0x01800000
>> +#define ARM_INST_ORRS_R                0x01900000
>>  #define ARM_INST_ORR_I         0x03800000
>>
>>  #define ARM_INST_REV           0x06bf0f30
>>  #define ARM_INST_REV16         0x06bf0fb0
>>
>>  #define ARM_INST_RSB_I         0x02600000
>> +#define ARM_INST_RSBS_I                0x02700000
>> +#define ARM_INST_RSC_I         0x02e00000
>>
>>  #define ARM_INST_SUB_R         0x00400000
>> +#define ARM_INST_SUBS_R                0x00500000
>> +#define ARM_INST_RSB_R         0x00600000
>>  #define ARM_INST_SUB_I         0x02400000
>> +#define ARM_INST_SUBS_I                0x02500000
>> +#define ARM_INST_SBC_I         0x02c00000
>> +#define ARM_INST_SBC_R         0x00c00000
>> +#define ARM_INST_SBCS_R                0x00d00000
>>
>>  #define ARM_INST_STR_I         0x05800000
>> +#define ARM_INST_STRB_I                0x05c00000
>> +#define ARM_INST_STRH_I                0x01c000b0
>>
>>  #define ARM_INST_TST_R         0x01100000
>>  #define ARM_INST_TST_I         0x03100000
>> @@ -117,6 +137,8 @@
>>
>>  #define ARM_INST_MLS           0x00600090
>>
>> +#define ARM_INST_UXTH          0x06ff0070
>> +
>>  /*
>>   * Use a suitable undefined instruction to use for ARM/Thumb2 faulting.
>>   * We need to be careful not to conflict with those used by other modules
>> @@ -135,9 +157,15 @@
>>  #define _AL3_R(op, rd, rn, rm) ((op ## _R) | (rd) << 12 | (rn) << 16 | (rm))
>>  /* immediate */
>>  #define _AL3_I(op, rd, rn, imm)        ((op ## _I) | (rd) << 12 | (rn) << 16 | (imm))
>> +/* register with register-shift */
>> +#define _AL3_SR(inst)  (inst | (1 << 4))
>>
>>  #define ARM_ADD_R(rd, rn, rm)  _AL3_R(ARM_INST_ADD, rd, rn, rm)
>> +#define ARM_ADDS_R(rd, rn, rm) _AL3_R(ARM_INST_ADDS, rd, rn, rm)
>>  #define ARM_ADD_I(rd, rn, imm) _AL3_I(ARM_INST_ADD, rd, rn, imm)
>> +#define ARM_ADDS_I(rd, rn, imm)        _AL3_I(ARM_INST_ADDS, rd, rn, imm)
>> +#define ARM_ADC_R(rd, rn, rm)  _AL3_R(ARM_INST_ADC, rd, rn, rm)
>> +#define ARM_ADC_I(rd, rn, imm) _AL3_I(ARM_INST_ADC, rd, rn, imm)
>>
>>  #define ARM_AND_R(rd, rn, rm)  _AL3_R(ARM_INST_AND, rd, rn, rm)
>>  #define ARM_AND_I(rd, rn, imm) _AL3_I(ARM_INST_AND, rd, rn, imm)
>> @@ -156,7 +184,9 @@
>>  #define ARM_EOR_I(rd, rn, imm) _AL3_I(ARM_INST_EOR, rd, rn, imm)
>>
>>  #define ARM_LDR_I(rt, rn, off) (ARM_INST_LDR_I | (rt) << 12 | (rn) << 16 \
>> -                                | (off))
>> +                                | ((off) & 0xfff))
>> +#define ARM_LDR_R(rt, rn, rm)  (ARM_INST_LDR_R | (rt) << 12 | (rn) << 16 \
>> +                                | (rm))
>>  #define ARM_LDRB_I(rt, rn, off)        (ARM_INST_LDRB_I | (rt) << 12 | (rn) << 16 \
>>                                  | (off))
>>  #define ARM_LDRB_R(rt, rn, rm) (ARM_INST_LDRB_R | (rt) << 12 | (rn) << 16 \
>> @@ -167,15 +197,23 @@
>>                                  | (rm))
>>
>>  #define ARM_LDM(rn, regs)      (ARM_INST_LDM | (rn) << 16 | (regs))
>> +#define ARM_LDM_IA(rn, regs)   (ARM_INST_LDM_IA | (rn) << 16 | (regs))
>>
>>  #define ARM_LSL_R(rd, rn, rm)  (_AL3_R(ARM_INST_LSL, rd, 0, rn) | (rm) << 8)
>>  #define ARM_LSL_I(rd, rn, imm) (_AL3_I(ARM_INST_LSL, rd, 0, rn) | (imm) << 7)
>>
>>  #define ARM_LSR_R(rd, rn, rm)  (_AL3_R(ARM_INST_LSR, rd, 0, rn) | (rm) << 8)
>>  #define ARM_LSR_I(rd, rn, imm) (_AL3_I(ARM_INST_LSR, rd, 0, rn) | (imm) << 7)
>> +#define ARM_ASR_R(rd, rn, rm)   (_AL3_R(ARM_INST_ASR, rd, 0, rn) | (rm) << 8)
>> +#define ARM_ASR_I(rd, rn, imm)  (_AL3_I(ARM_INST_ASR, rd, 0, rn) | (imm) << 7)
>>
>>  #define ARM_MOV_R(rd, rm)      _AL3_R(ARM_INST_MOV, rd, 0, rm)
>> +#define ARM_MOVS_R(rd, rm)     _AL3_R(ARM_INST_MOVS, rd, 0, rm)
>>  #define ARM_MOV_I(rd, imm)     _AL3_I(ARM_INST_MOV, rd, 0, imm)
>> +#define ARM_MOV_SR(rd, rm, type, rs)   \
>> +       (_AL3_SR(ARM_MOV_R(rd, rm)) | (type) << 5 | (rs) << 8)
>> +#define ARM_MOV_SI(rd, rm, type, imm6) \
>> +       (ARM_MOV_R(rd, rm) | (type) << 5 | (imm6) << 7)
>>
>>  #define ARM_MOVW(rd, imm)      \
>>         (ARM_INST_MOVW | ((imm) >> 12) << 16 | (rd) << 12 | ((imm) & 0x0fff))
>> @@ -190,19 +228,38 @@
>>
>>  #define ARM_ORR_R(rd, rn, rm)  _AL3_R(ARM_INST_ORR, rd, rn, rm)
>>  #define ARM_ORR_I(rd, rn, imm) _AL3_I(ARM_INST_ORR, rd, rn, imm)
>> -#define ARM_ORR_S(rd, rn, rm, type, rs)        \
>> -       (ARM_ORR_R(rd, rn, rm) | (type) << 5 | (rs) << 7)
>> +#define ARM_ORR_SR(rd, rn, rm, type, rs)       \
>> +       (_AL3_SR(ARM_ORR_R(rd, rn, rm)) | (type) << 5 | (rs) << 8)
>> +#define ARM_ORRS_R(rd, rn, rm) _AL3_R(ARM_INST_ORRS, rd, rn, rm)
>> +#define ARM_ORRS_SR(rd, rn, rm, type, rs)      \
>> +       (_AL3_SR(ARM_ORRS_R(rd, rn, rm)) | (type) << 5 | (rs) << 8)
>> +#define ARM_ORR_SI(rd, rn, rm, type, imm6)     \
>> +       (ARM_ORR_R(rd, rn, rm) | (type) << 5 | (imm6) << 7)
>> +#define ARM_ORRS_SI(rd, rn, rm, type, imm6)    \
>> +       (ARM_ORRS_R(rd, rn, rm) | (type) << 5 | (imm6) << 7)
>>
>>  #define ARM_REV(rd, rm)                (ARM_INST_REV | (rd) << 12 | (rm))
>>  #define ARM_REV16(rd, rm)      (ARM_INST_REV16 | (rd) << 12 | (rm))
>>
>>  #define ARM_RSB_I(rd, rn, imm) _AL3_I(ARM_INST_RSB, rd, rn, imm)
>> +#define ARM_RSBS_I(rd, rn, imm)        _AL3_I(ARM_INST_RSBS, rd, rn, imm)
>> +#define ARM_RSC_I(rd, rn, imm) _AL3_I(ARM_INST_RSC, rd, rn, imm)
>>
>>  #define ARM_SUB_R(rd, rn, rm)  _AL3_R(ARM_INST_SUB, rd, rn, rm)
>> +#define ARM_SUBS_R(rd, rn, rm) _AL3_R(ARM_INST_SUBS, rd, rn, rm)
>> +#define ARM_RSB_R(rd, rn, rm)  _AL3_R(ARM_INST_RSB, rd, rn, rm)
>> +#define ARM_SBC_R(rd, rn, rm)  _AL3_R(ARM_INST_SBC, rd, rn, rm)
>> +#define ARM_SBCS_R(rd, rn, rm) _AL3_R(ARM_INST_SBCS, rd, rn, rm)
>>  #define ARM_SUB_I(rd, rn, imm) _AL3_I(ARM_INST_SUB, rd, rn, imm)
>> +#define ARM_SUBS_I(rd, rn, imm)        _AL3_I(ARM_INST_SUBS, rd, rn, imm)
>> +#define ARM_SBC_I(rd, rn, imm) _AL3_I(ARM_INST_SBC, rd, rn, imm)
>>
>>  #define ARM_STR_I(rt, rn, off) (ARM_INST_STR_I | (rt) << 12 | (rn) << 16 \
>> -                                | (off))
>> +                                | ((off) & 0xfff))
>> +#define ARM_STRH_I(rt, rn, off)        (ARM_INST_STRH_I | (rt) << 12 | (rn) << 16 \
>> +                                | (((off) & 0xf0) << 4) | ((off) & 0xf))
>> +#define ARM_STRB_I(rt, rn, off)        (ARM_INST_STRB_I | (rt) << 12 | (rn) << 16 \
>> +                                | (((off) & 0xf0) << 4) | ((off) & 0xf))
>>
>>  #define ARM_TST_R(rn, rm)      _AL3_R(ARM_INST_TST, 0, rn, rm)
>>  #define ARM_TST_I(rn, imm)     _AL3_I(ARM_INST_TST, 0, rn, imm)
>> @@ -214,5 +271,6 @@
>>
>>  #define ARM_MLS(rd, rn, rm, ra)        (ARM_INST_MLS | (rd) << 16 | (rn) | (rm) << 8 \
>>                                  | (ra) << 12)
>> +#define ARM_UXTH(rd, rm)       (ARM_INST_UXTH | (rd) << 12 | (rm))
>>
>>  #endif /* PFILTER_OPCODES_ARM_H */
>> --
>> 2.7.4
>>
>
>
>
> --
> Kees Cook
> Pixel Security

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
  2017-06-06 19:47     ` Shubham Bansal
  (?)
@ 2017-06-12  2:00       ` Kees Cook
  -1 siblings, 0 replies; 87+ messages in thread
From: Kees Cook @ 2017-06-12  2:00 UTC (permalink / raw)
  To: Shubham Bansal
  Cc: Network Development, Daniel Borkmann, David S. Miller,
	Alexei Starovoitov, Russell King, linux-arm-kernel, LKML,
	Andrew Lunn

On Tue, Jun 6, 2017 at 12:47 PM, Shubham Bansal
<illusionist.neo@gmail.com> wrote:
> Hi Russell, Alexei, David, Daniel, kees,
>
> Any update on this patch moving forward?

Since this has gotten testing by various people and passes the
existing self-tests, I think this can probably go in via the ARM patch
tracker? Russell does that sound okay to you?

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
@ 2017-06-12  2:00       ` Kees Cook
  0 siblings, 0 replies; 87+ messages in thread
From: Kees Cook @ 2017-06-12  2:00 UTC (permalink / raw)
  To: Shubham Bansal
  Cc: Network Development, Daniel Borkmann, David S. Miller,
	Alexei Starovoitov, Russell King, linux-arm-kernel, LKML,
	Andrew Lunn

On Tue, Jun 6, 2017 at 12:47 PM, Shubham Bansal
<illusionist.neo@gmail.com> wrote:
> Hi Russell, Alexei, David, Daniel, kees,
>
> Any update on this patch moving forward?

Since this has gotten testing by various people and passes the
existing self-tests, I think this can probably go in via the ARM patch
tracker? Russell does that sound okay to you?

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v2] arm: eBPF JIT compiler
@ 2017-06-12  2:00       ` Kees Cook
  0 siblings, 0 replies; 87+ messages in thread
From: Kees Cook @ 2017-06-12  2:00 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Jun 6, 2017 at 12:47 PM, Shubham Bansal
<illusionist.neo@gmail.com> wrote:
> Hi Russell, Alexei, David, Daniel, kees,
>
> Any update on this patch moving forward?

Since this has gotten testing by various people and passes the
existing self-tests, I think this can probably go in via the ARM patch
tracker? Russell does that sound okay to you?

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
  2017-05-30 19:19   ` Kees Cook
  (?)
@ 2017-06-12 10:21     ` Daniel Borkmann
  -1 siblings, 0 replies; 87+ messages in thread
From: Daniel Borkmann @ 2017-06-12 10:21 UTC (permalink / raw)
  To: Kees Cook, Shubham Bansal, Network Development, David S. Miller,
	Alexei Starovoitov
  Cc: Russell King, linux-arm-kernel, LKML, Andrew Lunn

On 05/30/2017 09:19 PM, Kees Cook wrote:
> Forwarding this to net-dev and eBPF folks, who weren't on CC...

Sorry for being late. Some comments below from a cursory look ...

> -Kees
>
> On Thu, May 25, 2017 at 4:13 PM, Shubham Bansal
> <illusionist.neo@gmail.com> wrote:
>> The JIT compiler emits ARM 32 bit instructions. Currently, It supports
>> eBPF only. Classic BPF is supported because of the conversion by BPF
>> core.
>>
>> This patch is essentially changing the current implementation of JIT
>> compiler of Berkeley Packet Filter from classic to internal with almost
>> all instructions from eBPF ISA supported except the following
>>          BPF_ALU64 | BPF_DIV | BPF_K
>>          BPF_ALU64 | BPF_DIV | BPF_X
>>          BPF_ALU64 | BPF_MOD | BPF_K
>>          BPF_ALU64 | BPF_MOD | BPF_X
>>          BPF_STX | BPF_XADD | BPF_W
>>          BPF_STX | BPF_XADD | BPF_DW
>>          BPF_JMP | BPF_CALL

Any plans to implement above especially BPF_JMP | BPF_CALL in near future?
Reason why I'm asking is that i) currently the arm32 cBPF JIT implements
all of the cBPF extensions (except SKF_AD_RANDOM and SKF_AD_VLAN_TPID).
Some of the programs that were JITed before e.g. using SKF_AD_CPU would now
fall back to the eBPF interpreter due to lack of translation in JIT, but
also ii) that probably most (if not all) of eBPF programs use BPF helper
calls heavily, which will still redirect them to the interpreter right now
due to lack of BPF_JMP | BPF_CALL support, so it's really quite essential
to have it implemented.

>> Implementation is using scratch space to emulate 64 bit eBPF ISA on 32 bit
>> ARM because of deficiency of general purpose registers on ARM. Currently,
>> only LITTLE ENDIAN machines are supported in this eBPF JIT Compiler.
>>
>> Tested on ARMv7 with QEMU by me (Shubham Bansal).
>> Tested on ARMv5 by Andrew Lunn (andrew@lunn.ch).
>> Expected to work on ARMv6 as well, as its a part ARMv7 and part ARMv5.
>> Although, a proper testing is not done for ARMv6.
>>
>> Both of these testing are done with and without CONFIG_FRAME_POINTER
>> separately for LITTLE ENDIAN machine.
>>
>> For testing:
>>
>> 1. JIT is enabled with
>>          echo 1 > /proc/sys/net/core/bpf_jit_enable
>> 2. Constant Blinding can be enabled along with JIT using
>>          echo 1 > /proc/sys/net/core/bpf_jit_enable
>>          echo 2 > /proc/sys/net/core/bpf_jit_harden
>>
>> See Documentation/networking/filter.txt for more information.
>>
>> Result : test_bpf: Summary: 314 PASSED, 0 FAILED, [278/306 JIT'ed]

Did you also manage to get the BPF selftest suite running in the meantime
(tools/testing/selftests/bpf/)? There are a couple of programs that clang
will compile (test_pkt_access.o, test_xdp.o, test_l4lb.o, test_tcp_estats.o)
and then test run.

Did you manage to get tail calls tested as well (I assume so since you
implemented emit_bpf_tail_call() in the patch but just out of curiosity)?

>> Signed-off-by: Shubham Bansal <illusionist.neo@gmail.com>
>> ---
>>   Documentation/networking/filter.txt |    4 +-
>>   arch/arm/Kconfig                    |    2 +-
>>   arch/arm/net/bpf_jit_32.c           | 2404 ++++++++++++++++++++++++-----------
>>   arch/arm/net/bpf_jit_32.h           |  108 +-
>>   4 files changed, 1713 insertions(+), 805 deletions(-)
>>
[...]

If arm folks take the patch, there will be two minor (silent) merge
conflicts with net-next:

1) In bpf_int_jit_compile(), below the jited = 1 assignment, there
    needs to come a prog->jited_len = image_size.
2) The internal tail call opcode changed from BPF_JMP | BPF_CALL | BPF_X
    into BPF_JMP | BPF_TAIL_CALL.

Two minor things below, could probably also be as follow-up.

[...]
>> +       /* dst = imm64 */
>> +       case BPF_LD | BPF_IMM | BPF_DW:
>> +       {
>> +               const struct bpf_insn insn1 = insn[1];
>> +               u32 hi, lo = imm;
>> +
>> +               if (insn1.code != 0 || insn1.src_reg != 0 ||
>> +                   insn1.dst_reg != 0 || insn1.off != 0) {
>> +                       /* Note: verifier in BPF core must catch invalid
>> +                        * instruction.
>> +                        */
>> +                       pr_err_once("Invalid BPF_LD_IMM64 instruction\n");
>> +                       return -EINVAL;
>> +               }

Nit: this check can be removed as verifier already takes care
of it. (No JIT checks for this anymore.)

>> +               hi = insn1.imm;
>> +               emit_a32_mov_i(dst_lo, lo, dstk, ctx);
>> +               emit_a32_mov_i(dst_hi, hi, dstk, ctx);
>> +
>> +               return 1;
>> +       }
[...]
>> -       /* compute offsets only during the first pass */
>> -       if (ctx->target == NULL)
>> -               ctx->offsets[i] = ctx->idx * 4;
>> +static int validate_code(struct jit_ctx *ctx)
>> +{
>> +       int i;
>> +
>> +       for (i = 0; i < ctx->idx; i++) {
>> +               u32 a32_insn = le32_to_cpu(ctx->target[i]);

Given __opcode_to_mem_arm(ARM_INST_UDF) is used to fill the image,
perhaps use the __mem_to_opcode_arm() helper for the check?

>> +               if (a32_insn == ARM_INST_UDF)
>> +                       return -1;
>> +       }
>>
>>          return 0;
>>   }
>>
>> +void bpf_jit_compile(struct bpf_prog *prog)
>> +{
>> +       /* Nothing to do here. We support Internal BPF. */
>> +}

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
@ 2017-06-12 10:21     ` Daniel Borkmann
  0 siblings, 0 replies; 87+ messages in thread
From: Daniel Borkmann @ 2017-06-12 10:21 UTC (permalink / raw)
  To: Kees Cook, Shubham Bansal, Network Development, David S. Miller,
	Alexei Starovoitov
  Cc: Russell King, linux-arm-kernel, LKML, Andrew Lunn

On 05/30/2017 09:19 PM, Kees Cook wrote:
> Forwarding this to net-dev and eBPF folks, who weren't on CC...

Sorry for being late. Some comments below from a cursory look ...

> -Kees
>
> On Thu, May 25, 2017 at 4:13 PM, Shubham Bansal
> <illusionist.neo@gmail.com> wrote:
>> The JIT compiler emits ARM 32 bit instructions. Currently, It supports
>> eBPF only. Classic BPF is supported because of the conversion by BPF
>> core.
>>
>> This patch is essentially changing the current implementation of JIT
>> compiler of Berkeley Packet Filter from classic to internal with almost
>> all instructions from eBPF ISA supported except the following
>>          BPF_ALU64 | BPF_DIV | BPF_K
>>          BPF_ALU64 | BPF_DIV | BPF_X
>>          BPF_ALU64 | BPF_MOD | BPF_K
>>          BPF_ALU64 | BPF_MOD | BPF_X
>>          BPF_STX | BPF_XADD | BPF_W
>>          BPF_STX | BPF_XADD | BPF_DW
>>          BPF_JMP | BPF_CALL

Any plans to implement above especially BPF_JMP | BPF_CALL in near future?
Reason why I'm asking is that i) currently the arm32 cBPF JIT implements
all of the cBPF extensions (except SKF_AD_RANDOM and SKF_AD_VLAN_TPID).
Some of the programs that were JITed before e.g. using SKF_AD_CPU would now
fall back to the eBPF interpreter due to lack of translation in JIT, but
also ii) that probably most (if not all) of eBPF programs use BPF helper
calls heavily, which will still redirect them to the interpreter right now
due to lack of BPF_JMP | BPF_CALL support, so it's really quite essential
to have it implemented.

>> Implementation is using scratch space to emulate 64 bit eBPF ISA on 32 bit
>> ARM because of deficiency of general purpose registers on ARM. Currently,
>> only LITTLE ENDIAN machines are supported in this eBPF JIT Compiler.
>>
>> Tested on ARMv7 with QEMU by me (Shubham Bansal).
>> Tested on ARMv5 by Andrew Lunn (andrew@lunn.ch).
>> Expected to work on ARMv6 as well, as its a part ARMv7 and part ARMv5.
>> Although, a proper testing is not done for ARMv6.
>>
>> Both of these testing are done with and without CONFIG_FRAME_POINTER
>> separately for LITTLE ENDIAN machine.
>>
>> For testing:
>>
>> 1. JIT is enabled with
>>          echo 1 > /proc/sys/net/core/bpf_jit_enable
>> 2. Constant Blinding can be enabled along with JIT using
>>          echo 1 > /proc/sys/net/core/bpf_jit_enable
>>          echo 2 > /proc/sys/net/core/bpf_jit_harden
>>
>> See Documentation/networking/filter.txt for more information.
>>
>> Result : test_bpf: Summary: 314 PASSED, 0 FAILED, [278/306 JIT'ed]

Did you also manage to get the BPF selftest suite running in the meantime
(tools/testing/selftests/bpf/)? There are a couple of programs that clang
will compile (test_pkt_access.o, test_xdp.o, test_l4lb.o, test_tcp_estats.o)
and then test run.

Did you manage to get tail calls tested as well (I assume so since you
implemented emit_bpf_tail_call() in the patch but just out of curiosity)?

>> Signed-off-by: Shubham Bansal <illusionist.neo@gmail.com>
>> ---
>>   Documentation/networking/filter.txt |    4 +-
>>   arch/arm/Kconfig                    |    2 +-
>>   arch/arm/net/bpf_jit_32.c           | 2404 ++++++++++++++++++++++++-----------
>>   arch/arm/net/bpf_jit_32.h           |  108 +-
>>   4 files changed, 1713 insertions(+), 805 deletions(-)
>>
[...]

If arm folks take the patch, there will be two minor (silent) merge
conflicts with net-next:

1) In bpf_int_jit_compile(), below the jited = 1 assignment, there
    needs to come a prog->jited_len = image_size.
2) The internal tail call opcode changed from BPF_JMP | BPF_CALL | BPF_X
    into BPF_JMP | BPF_TAIL_CALL.

Two minor things below, could probably also be as follow-up.

[...]
>> +       /* dst = imm64 */
>> +       case BPF_LD | BPF_IMM | BPF_DW:
>> +       {
>> +               const struct bpf_insn insn1 = insn[1];
>> +               u32 hi, lo = imm;
>> +
>> +               if (insn1.code != 0 || insn1.src_reg != 0 ||
>> +                   insn1.dst_reg != 0 || insn1.off != 0) {
>> +                       /* Note: verifier in BPF core must catch invalid
>> +                        * instruction.
>> +                        */
>> +                       pr_err_once("Invalid BPF_LD_IMM64 instruction\n");
>> +                       return -EINVAL;
>> +               }

Nit: this check can be removed as verifier already takes care
of it. (No JIT checks for this anymore.)

>> +               hi = insn1.imm;
>> +               emit_a32_mov_i(dst_lo, lo, dstk, ctx);
>> +               emit_a32_mov_i(dst_hi, hi, dstk, ctx);
>> +
>> +               return 1;
>> +       }
[...]
>> -       /* compute offsets only during the first pass */
>> -       if (ctx->target == NULL)
>> -               ctx->offsets[i] = ctx->idx * 4;
>> +static int validate_code(struct jit_ctx *ctx)
>> +{
>> +       int i;
>> +
>> +       for (i = 0; i < ctx->idx; i++) {
>> +               u32 a32_insn = le32_to_cpu(ctx->target[i]);

Given __opcode_to_mem_arm(ARM_INST_UDF) is used to fill the image,
perhaps use the __mem_to_opcode_arm() helper for the check?

>> +               if (a32_insn == ARM_INST_UDF)
>> +                       return -1;
>> +       }
>>
>>          return 0;
>>   }
>>
>> +void bpf_jit_compile(struct bpf_prog *prog)
>> +{
>> +       /* Nothing to do here. We support Internal BPF. */
>> +}

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v2] arm: eBPF JIT compiler
@ 2017-06-12 10:21     ` Daniel Borkmann
  0 siblings, 0 replies; 87+ messages in thread
From: Daniel Borkmann @ 2017-06-12 10:21 UTC (permalink / raw)
  To: linux-arm-kernel

On 05/30/2017 09:19 PM, Kees Cook wrote:
> Forwarding this to net-dev and eBPF folks, who weren't on CC...

Sorry for being late. Some comments below from a cursory look ...

> -Kees
>
> On Thu, May 25, 2017 at 4:13 PM, Shubham Bansal
> <illusionist.neo@gmail.com> wrote:
>> The JIT compiler emits ARM 32 bit instructions. Currently, It supports
>> eBPF only. Classic BPF is supported because of the conversion by BPF
>> core.
>>
>> This patch is essentially changing the current implementation of JIT
>> compiler of Berkeley Packet Filter from classic to internal with almost
>> all instructions from eBPF ISA supported except the following
>>          BPF_ALU64 | BPF_DIV | BPF_K
>>          BPF_ALU64 | BPF_DIV | BPF_X
>>          BPF_ALU64 | BPF_MOD | BPF_K
>>          BPF_ALU64 | BPF_MOD | BPF_X
>>          BPF_STX | BPF_XADD | BPF_W
>>          BPF_STX | BPF_XADD | BPF_DW
>>          BPF_JMP | BPF_CALL

Any plans to implement above especially BPF_JMP | BPF_CALL in near future?
Reason why I'm asking is that i) currently the arm32 cBPF JIT implements
all of the cBPF extensions (except SKF_AD_RANDOM and SKF_AD_VLAN_TPID).
Some of the programs that were JITed before e.g. using SKF_AD_CPU would now
fall back to the eBPF interpreter due to lack of translation in JIT, but
also ii) that probably most (if not all) of eBPF programs use BPF helper
calls heavily, which will still redirect them to the interpreter right now
due to lack of BPF_JMP | BPF_CALL support, so it's really quite essential
to have it implemented.

>> Implementation is using scratch space to emulate 64 bit eBPF ISA on 32 bit
>> ARM because of deficiency of general purpose registers on ARM. Currently,
>> only LITTLE ENDIAN machines are supported in this eBPF JIT Compiler.
>>
>> Tested on ARMv7 with QEMU by me (Shubham Bansal).
>> Tested on ARMv5 by Andrew Lunn (andrew at lunn.ch).
>> Expected to work on ARMv6 as well, as its a part ARMv7 and part ARMv5.
>> Although, a proper testing is not done for ARMv6.
>>
>> Both of these testing are done with and without CONFIG_FRAME_POINTER
>> separately for LITTLE ENDIAN machine.
>>
>> For testing:
>>
>> 1. JIT is enabled with
>>          echo 1 > /proc/sys/net/core/bpf_jit_enable
>> 2. Constant Blinding can be enabled along with JIT using
>>          echo 1 > /proc/sys/net/core/bpf_jit_enable
>>          echo 2 > /proc/sys/net/core/bpf_jit_harden
>>
>> See Documentation/networking/filter.txt for more information.
>>
>> Result : test_bpf: Summary: 314 PASSED, 0 FAILED, [278/306 JIT'ed]

Did you also manage to get the BPF selftest suite running in the meantime
(tools/testing/selftests/bpf/)? There are a couple of programs that clang
will compile (test_pkt_access.o, test_xdp.o, test_l4lb.o, test_tcp_estats.o)
and then test run.

Did you manage to get tail calls tested as well (I assume so since you
implemented emit_bpf_tail_call() in the patch but just out of curiosity)?

>> Signed-off-by: Shubham Bansal <illusionist.neo@gmail.com>
>> ---
>>   Documentation/networking/filter.txt |    4 +-
>>   arch/arm/Kconfig                    |    2 +-
>>   arch/arm/net/bpf_jit_32.c           | 2404 ++++++++++++++++++++++++-----------
>>   arch/arm/net/bpf_jit_32.h           |  108 +-
>>   4 files changed, 1713 insertions(+), 805 deletions(-)
>>
[...]

If arm folks take the patch, there will be two minor (silent) merge
conflicts with net-next:

1) In bpf_int_jit_compile(), below the jited = 1 assignment, there
    needs to come a prog->jited_len = image_size.
2) The internal tail call opcode changed from BPF_JMP | BPF_CALL | BPF_X
    into BPF_JMP | BPF_TAIL_CALL.

Two minor things below, could probably also be as follow-up.

[...]
>> +       /* dst = imm64 */
>> +       case BPF_LD | BPF_IMM | BPF_DW:
>> +       {
>> +               const struct bpf_insn insn1 = insn[1];
>> +               u32 hi, lo = imm;
>> +
>> +               if (insn1.code != 0 || insn1.src_reg != 0 ||
>> +                   insn1.dst_reg != 0 || insn1.off != 0) {
>> +                       /* Note: verifier in BPF core must catch invalid
>> +                        * instruction.
>> +                        */
>> +                       pr_err_once("Invalid BPF_LD_IMM64 instruction\n");
>> +                       return -EINVAL;
>> +               }

Nit: this check can be removed as verifier already takes care
of it. (No JIT checks for this anymore.)

>> +               hi = insn1.imm;
>> +               emit_a32_mov_i(dst_lo, lo, dstk, ctx);
>> +               emit_a32_mov_i(dst_hi, hi, dstk, ctx);
>> +
>> +               return 1;
>> +       }
[...]
>> -       /* compute offsets only during the first pass */
>> -       if (ctx->target == NULL)
>> -               ctx->offsets[i] = ctx->idx * 4;
>> +static int validate_code(struct jit_ctx *ctx)
>> +{
>> +       int i;
>> +
>> +       for (i = 0; i < ctx->idx; i++) {
>> +               u32 a32_insn = le32_to_cpu(ctx->target[i]);

Given __opcode_to_mem_arm(ARM_INST_UDF) is used to fill the image,
perhaps use the __mem_to_opcode_arm() helper for the check?

>> +               if (a32_insn == ARM_INST_UDF)
>> +                       return -1;
>> +       }
>>
>>          return 0;
>>   }
>>
>> +void bpf_jit_compile(struct bpf_prog *prog)
>> +{
>> +       /* Nothing to do here. We support Internal BPF. */
>> +}

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
  2017-06-12 10:21     ` Daniel Borkmann
  (?)
@ 2017-06-12 11:06       ` Russell King - ARM Linux
  -1 siblings, 0 replies; 87+ messages in thread
From: Russell King - ARM Linux @ 2017-06-12 11:06 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Kees Cook, Shubham Bansal, Network Development, David S. Miller,
	Alexei Starovoitov, linux-arm-kernel, LKML, Andrew Lunn

On Mon, Jun 12, 2017 at 12:21:03PM +0200, Daniel Borkmann wrote:
> On 05/30/2017 09:19 PM, Kees Cook wrote:
> >On Thu, May 25, 2017 at 4:13 PM, Shubham Bansal
> ><illusionist.neo@gmail.com> wrote:
> >>+static int validate_code(struct jit_ctx *ctx)
> >>+{
> >>+       int i;
> >>+
> >>+       for (i = 0; i < ctx->idx; i++) {
> >>+               u32 a32_insn = le32_to_cpu(ctx->target[i]);
> 
> Given __opcode_to_mem_arm(ARM_INST_UDF) is used to fill the image,
> perhaps use the __mem_to_opcode_arm() helper for the check?
> 
> >>+               if (a32_insn == ARM_INST_UDF)

The following is probably better:

		if (ctx->target[i] == __opcode_to_mem_arm(ARM_INST_UDF))

since then you can take advantage of the compiler optimising the
constant rather than having to do a byte swap on an unknown 32-bit
value.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
@ 2017-06-12 11:06       ` Russell King - ARM Linux
  0 siblings, 0 replies; 87+ messages in thread
From: Russell King - ARM Linux @ 2017-06-12 11:06 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Kees Cook, Shubham Bansal, Network Development, David S. Miller,
	Alexei Starovoitov, linux-arm-kernel, LKML, Andrew Lunn

On Mon, Jun 12, 2017 at 12:21:03PM +0200, Daniel Borkmann wrote:
> On 05/30/2017 09:19 PM, Kees Cook wrote:
> >On Thu, May 25, 2017 at 4:13 PM, Shubham Bansal
> ><illusionist.neo@gmail.com> wrote:
> >>+static int validate_code(struct jit_ctx *ctx)
> >>+{
> >>+       int i;
> >>+
> >>+       for (i = 0; i < ctx->idx; i++) {
> >>+               u32 a32_insn = le32_to_cpu(ctx->target[i]);
> 
> Given __opcode_to_mem_arm(ARM_INST_UDF) is used to fill the image,
> perhaps use the __mem_to_opcode_arm() helper for the check?
> 
> >>+               if (a32_insn == ARM_INST_UDF)

The following is probably better:

		if (ctx->target[i] == __opcode_to_mem_arm(ARM_INST_UDF))

since then you can take advantage of the compiler optimising the
constant rather than having to do a byte swap on an unknown 32-bit
value.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v2] arm: eBPF JIT compiler
@ 2017-06-12 11:06       ` Russell King - ARM Linux
  0 siblings, 0 replies; 87+ messages in thread
From: Russell King - ARM Linux @ 2017-06-12 11:06 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jun 12, 2017 at 12:21:03PM +0200, Daniel Borkmann wrote:
> On 05/30/2017 09:19 PM, Kees Cook wrote:
> >On Thu, May 25, 2017 at 4:13 PM, Shubham Bansal
> ><illusionist.neo@gmail.com> wrote:
> >>+static int validate_code(struct jit_ctx *ctx)
> >>+{
> >>+       int i;
> >>+
> >>+       for (i = 0; i < ctx->idx; i++) {
> >>+               u32 a32_insn = le32_to_cpu(ctx->target[i]);
> 
> Given __opcode_to_mem_arm(ARM_INST_UDF) is used to fill the image,
> perhaps use the __mem_to_opcode_arm() helper for the check?
> 
> >>+               if (a32_insn == ARM_INST_UDF)

The following is probably better:

		if (ctx->target[i] == __opcode_to_mem_arm(ARM_INST_UDF))

since then you can take advantage of the compiler optimising the
constant rather than having to do a byte swap on an unknown 32-bit
value.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
  2017-06-12 10:21     ` Daniel Borkmann
  (?)
@ 2017-06-12 15:40       ` Shubham Bansal
  -1 siblings, 0 replies; 87+ messages in thread
From: Shubham Bansal @ 2017-06-12 15:40 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Kees Cook, Network Development, David S. Miller,
	Alexei Starovoitov, Russell King, linux-arm-kernel, LKML,
	Andrew Lunn

On Mon, Jun 12, 2017 at 3:51 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> On 05/30/2017 09:19 PM, Kees Cook wrote:

>>> This patch is essentially changing the current implementation of JIT
>>> compiler of Berkeley Packet Filter from classic to internal with almost
>>> all instructions from eBPF ISA supported except the following
>>>          BPF_ALU64 | BPF_DIV | BPF_K
>>>          BPF_ALU64 | BPF_DIV | BPF_X
>>>          BPF_ALU64 | BPF_MOD | BPF_K
>>>          BPF_ALU64 | BPF_MOD | BPF_X
>>>          BPF_STX | BPF_XADD | BPF_W
>>>          BPF_STX | BPF_XADD | BPF_DW
>>>          BPF_JMP | BPF_CALL
>
>
> Any plans to implement above especially BPF_JMP | BPF_CALL in near future?
> Reason why I'm asking is that i) currently the arm32 cBPF JIT implements
> all of the cBPF extensions (except SKF_AD_RANDOM and SKF_AD_VLAN_TPID).
> Some of the programs that were JITed before e.g. using SKF_AD_CPU would now
> fall back to the eBPF interpreter due to lack of translation in JIT, but
> also ii) that probably most (if not all) of eBPF programs use BPF helper
> calls heavily, which will still redirect them to the interpreter right now
> due to lack of BPF_JMP | BPF_CALL support, so it's really quite essential
> to have it implemented.

I can try for BPF_JMP | BPF_CALL. I didn't do it last time because I
thought, it would make the code look messy and become pain to get it
through the review.
For this, I have to map eBPF arguments with arm ABI arguments and move
ebpf arguments to corresponding arm ABI arguments, as eBPF arguments
doesn't match with arm ABI arguments.
Let me try that if its possible.

As far as following 4 are concerned :

>>>          BPF_ALU64 | BPF_DIV | BPF_K
>>>          BPF_ALU64 | BPF_DIV | BPF_X
>>>          BPF_ALU64 | BPF_MOD | BPF_K
>>>          BPF_ALU64 | BPF_MOD | BPF_X

I don't think it possible with current constraints over registers. I
already tried this.

>
>>> Implementation is using scratch space to emulate 64 bit eBPF ISA on 32
>>> bit
>>> ARM because of deficiency of general purpose registers on ARM. Currently,
>>> only LITTLE ENDIAN machines are supported in this eBPF JIT Compiler.
>>>
>>> Tested on ARMv7 with QEMU by me (Shubham Bansal).
>>> Tested on ARMv5 by Andrew Lunn (andrew@lunn.ch).
>>> Expected to work on ARMv6 as well, as its a part ARMv7 and part ARMv5.
>>> Although, a proper testing is not done for ARMv6.
>>>
>>> Both of these testing are done with and without CONFIG_FRAME_POINTER
>>> separately for LITTLE ENDIAN machine.
>>>
>>> For testing:
>>>
>>> 1. JIT is enabled with
>>>          echo 1 > /proc/sys/net/core/bpf_jit_enable
>>> 2. Constant Blinding can be enabled along with JIT using
>>>          echo 1 > /proc/sys/net/core/bpf_jit_enable
>>>          echo 2 > /proc/sys/net/core/bpf_jit_harden
>>>
>>> See Documentation/networking/filter.txt for more information.
>>>
>>> Result : test_bpf: Summary: 314 PASSED, 0 FAILED, [278/306 JIT'ed]
>
>
> Did you also manage to get the BPF selftest suite running in the meantime
> (tools/testing/selftests/bpf/)? There are a couple of programs that clang
> will compile (test_pkt_access.o, test_xdp.o, test_l4lb.o, test_tcp_estats.o)
> and then test run.

Nope. It looks like a latest addition to testing. Can you please tell
me how to test with it?

>
> Did you manage to get tail calls tested as well (I assume so since you
> implemented emit_bpf_tail_call() in the patch but just out of curiosity)?

I didn't try it exclusively, I thought test_bpf must have tested it. Doesn't it?

>
>>> Signed-off-by: Shubham Bansal <illusionist.neo@gmail.com>
>>> ---
>>>   Documentation/networking/filter.txt |    4 +-
>>>   arch/arm/Kconfig                    |    2 +-
>>>   arch/arm/net/bpf_jit_32.c           | 2404
>>> ++++++++++++++++++++++++-----------
>>>   arch/arm/net/bpf_jit_32.h           |  108 +-
>>>   4 files changed, 1713 insertions(+), 805 deletions(-)
>>>
> [...]
>
> If arm folks take the patch, there will be two minor (silent) merge
> conflicts with net-next:
>
> 1) In bpf_int_jit_compile(), below the jited = 1 assignment, there
>    needs to come a prog->jited_len = image_size.

Done.

> 2) The internal tail call opcode changed from BPF_JMP | BPF_CALL | BPF_X
>    into BPF_JMP | BPF_TAIL_CALL.

Done.

>
> Two minor things below, could probably also be as follow-up.
>
> [...]
>>>
>>> +       /* dst = imm64 */
>>> +       case BPF_LD | BPF_IMM | BPF_DW:
>>> +       {
>>> +               const struct bpf_insn insn1 = insn[1];
>>> +               u32 hi, lo = imm;
>>> +
>>> +               if (insn1.code != 0 || insn1.src_reg != 0 ||
>>> +                   insn1.dst_reg != 0 || insn1.off != 0) {
>>> +                       /* Note: verifier in BPF core must catch invalid
>>> +                        * instruction.
>>> +                        */
>>> +                       pr_err_once("Invalid BPF_LD_IMM64
>>> instruction\n");
>>> +                       return -EINVAL;
>>> +               }
>
>
> Nit: this check can be removed as verifier already takes care
> of it. (No JIT checks for this anymore.)
>
>>> +               hi = insn1.imm;
>>> +               emit_a32_mov_i(dst_lo, lo, dstk, ctx);
>>> +               emit_a32_mov_i(dst_hi, hi, dstk, ctx);
>>> +
>>> +               return 1;
>>> +       }
>
> [...]
>>>
>>> -       /* compute offsets only during the first pass */
>>> -       if (ctx->target == NULL)
>>> -               ctx->offsets[i] = ctx->idx * 4;
>>> +static int validate_code(struct jit_ctx *ctx)
>>> +{
>>> +       int i;
>>> +
>>> +       for (i = 0; i < ctx->idx; i++) {
>>> +               u32 a32_insn = le32_to_cpu(ctx->target[i]);
>
>
> Given __opcode_to_mem_arm(ARM_INST_UDF) is used to fill the image,
> perhaps use the __mem_to_opcode_arm() helper for the check?

Done.


I will send the patch again with these fixes. I really appreciate if
you could find more issues with the code, so that I can add it to the
next fix.

Thanks.
Shubham

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
@ 2017-06-12 15:40       ` Shubham Bansal
  0 siblings, 0 replies; 87+ messages in thread
From: Shubham Bansal @ 2017-06-12 15:40 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Kees Cook, Network Development, David S. Miller,
	Alexei Starovoitov, Russell King, linux-arm-kernel, LKML,
	Andrew Lunn

On Mon, Jun 12, 2017 at 3:51 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> On 05/30/2017 09:19 PM, Kees Cook wrote:

>>> This patch is essentially changing the current implementation of JIT
>>> compiler of Berkeley Packet Filter from classic to internal with almost
>>> all instructions from eBPF ISA supported except the following
>>>          BPF_ALU64 | BPF_DIV | BPF_K
>>>          BPF_ALU64 | BPF_DIV | BPF_X
>>>          BPF_ALU64 | BPF_MOD | BPF_K
>>>          BPF_ALU64 | BPF_MOD | BPF_X
>>>          BPF_STX | BPF_XADD | BPF_W
>>>          BPF_STX | BPF_XADD | BPF_DW
>>>          BPF_JMP | BPF_CALL
>
>
> Any plans to implement above especially BPF_JMP | BPF_CALL in near future?
> Reason why I'm asking is that i) currently the arm32 cBPF JIT implements
> all of the cBPF extensions (except SKF_AD_RANDOM and SKF_AD_VLAN_TPID).
> Some of the programs that were JITed before e.g. using SKF_AD_CPU would now
> fall back to the eBPF interpreter due to lack of translation in JIT, but
> also ii) that probably most (if not all) of eBPF programs use BPF helper
> calls heavily, which will still redirect them to the interpreter right now
> due to lack of BPF_JMP | BPF_CALL support, so it's really quite essential
> to have it implemented.

I can try for BPF_JMP | BPF_CALL. I didn't do it last time because I
thought, it would make the code look messy and become pain to get it
through the review.
For this, I have to map eBPF arguments with arm ABI arguments and move
ebpf arguments to corresponding arm ABI arguments, as eBPF arguments
doesn't match with arm ABI arguments.
Let me try that if its possible.

As far as following 4 are concerned :

>>>          BPF_ALU64 | BPF_DIV | BPF_K
>>>          BPF_ALU64 | BPF_DIV | BPF_X
>>>          BPF_ALU64 | BPF_MOD | BPF_K
>>>          BPF_ALU64 | BPF_MOD | BPF_X

I don't think it possible with current constraints over registers. I
already tried this.

>
>>> Implementation is using scratch space to emulate 64 bit eBPF ISA on 32
>>> bit
>>> ARM because of deficiency of general purpose registers on ARM. Currently,
>>> only LITTLE ENDIAN machines are supported in this eBPF JIT Compiler.
>>>
>>> Tested on ARMv7 with QEMU by me (Shubham Bansal).
>>> Tested on ARMv5 by Andrew Lunn (andrew@lunn.ch).
>>> Expected to work on ARMv6 as well, as its a part ARMv7 and part ARMv5.
>>> Although, a proper testing is not done for ARMv6.
>>>
>>> Both of these testing are done with and without CONFIG_FRAME_POINTER
>>> separately for LITTLE ENDIAN machine.
>>>
>>> For testing:
>>>
>>> 1. JIT is enabled with
>>>          echo 1 > /proc/sys/net/core/bpf_jit_enable
>>> 2. Constant Blinding can be enabled along with JIT using
>>>          echo 1 > /proc/sys/net/core/bpf_jit_enable
>>>          echo 2 > /proc/sys/net/core/bpf_jit_harden
>>>
>>> See Documentation/networking/filter.txt for more information.
>>>
>>> Result : test_bpf: Summary: 314 PASSED, 0 FAILED, [278/306 JIT'ed]
>
>
> Did you also manage to get the BPF selftest suite running in the meantime
> (tools/testing/selftests/bpf/)? There are a couple of programs that clang
> will compile (test_pkt_access.o, test_xdp.o, test_l4lb.o, test_tcp_estats.o)
> and then test run.

Nope. It looks like a latest addition to testing. Can you please tell
me how to test with it?

>
> Did you manage to get tail calls tested as well (I assume so since you
> implemented emit_bpf_tail_call() in the patch but just out of curiosity)?

I didn't try it exclusively, I thought test_bpf must have tested it. Doesn't it?

>
>>> Signed-off-by: Shubham Bansal <illusionist.neo@gmail.com>
>>> ---
>>>   Documentation/networking/filter.txt |    4 +-
>>>   arch/arm/Kconfig                    |    2 +-
>>>   arch/arm/net/bpf_jit_32.c           | 2404
>>> ++++++++++++++++++++++++-----------
>>>   arch/arm/net/bpf_jit_32.h           |  108 +-
>>>   4 files changed, 1713 insertions(+), 805 deletions(-)
>>>
> [...]
>
> If arm folks take the patch, there will be two minor (silent) merge
> conflicts with net-next:
>
> 1) In bpf_int_jit_compile(), below the jited = 1 assignment, there
>    needs to come a prog->jited_len = image_size.

Done.

> 2) The internal tail call opcode changed from BPF_JMP | BPF_CALL | BPF_X
>    into BPF_JMP | BPF_TAIL_CALL.

Done.

>
> Two minor things below, could probably also be as follow-up.
>
> [...]
>>>
>>> +       /* dst = imm64 */
>>> +       case BPF_LD | BPF_IMM | BPF_DW:
>>> +       {
>>> +               const struct bpf_insn insn1 = insn[1];
>>> +               u32 hi, lo = imm;
>>> +
>>> +               if (insn1.code != 0 || insn1.src_reg != 0 ||
>>> +                   insn1.dst_reg != 0 || insn1.off != 0) {
>>> +                       /* Note: verifier in BPF core must catch invalid
>>> +                        * instruction.
>>> +                        */
>>> +                       pr_err_once("Invalid BPF_LD_IMM64
>>> instruction\n");
>>> +                       return -EINVAL;
>>> +               }
>
>
> Nit: this check can be removed as verifier already takes care
> of it. (No JIT checks for this anymore.)
>
>>> +               hi = insn1.imm;
>>> +               emit_a32_mov_i(dst_lo, lo, dstk, ctx);
>>> +               emit_a32_mov_i(dst_hi, hi, dstk, ctx);
>>> +
>>> +               return 1;
>>> +       }
>
> [...]
>>>
>>> -       /* compute offsets only during the first pass */
>>> -       if (ctx->target == NULL)
>>> -               ctx->offsets[i] = ctx->idx * 4;
>>> +static int validate_code(struct jit_ctx *ctx)
>>> +{
>>> +       int i;
>>> +
>>> +       for (i = 0; i < ctx->idx; i++) {
>>> +               u32 a32_insn = le32_to_cpu(ctx->target[i]);
>
>
> Given __opcode_to_mem_arm(ARM_INST_UDF) is used to fill the image,
> perhaps use the __mem_to_opcode_arm() helper for the check?

Done.


I will send the patch again with these fixes. I really appreciate if
you could find more issues with the code, so that I can add it to the
next fix.

Thanks.
Shubham

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v2] arm: eBPF JIT compiler
@ 2017-06-12 15:40       ` Shubham Bansal
  0 siblings, 0 replies; 87+ messages in thread
From: Shubham Bansal @ 2017-06-12 15:40 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jun 12, 2017 at 3:51 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> On 05/30/2017 09:19 PM, Kees Cook wrote:

>>> This patch is essentially changing the current implementation of JIT
>>> compiler of Berkeley Packet Filter from classic to internal with almost
>>> all instructions from eBPF ISA supported except the following
>>>          BPF_ALU64 | BPF_DIV | BPF_K
>>>          BPF_ALU64 | BPF_DIV | BPF_X
>>>          BPF_ALU64 | BPF_MOD | BPF_K
>>>          BPF_ALU64 | BPF_MOD | BPF_X
>>>          BPF_STX | BPF_XADD | BPF_W
>>>          BPF_STX | BPF_XADD | BPF_DW
>>>          BPF_JMP | BPF_CALL
>
>
> Any plans to implement above especially BPF_JMP | BPF_CALL in near future?
> Reason why I'm asking is that i) currently the arm32 cBPF JIT implements
> all of the cBPF extensions (except SKF_AD_RANDOM and SKF_AD_VLAN_TPID).
> Some of the programs that were JITed before e.g. using SKF_AD_CPU would now
> fall back to the eBPF interpreter due to lack of translation in JIT, but
> also ii) that probably most (if not all) of eBPF programs use BPF helper
> calls heavily, which will still redirect them to the interpreter right now
> due to lack of BPF_JMP | BPF_CALL support, so it's really quite essential
> to have it implemented.

I can try for BPF_JMP | BPF_CALL. I didn't do it last time because I
thought, it would make the code look messy and become pain to get it
through the review.
For this, I have to map eBPF arguments with arm ABI arguments and move
ebpf arguments to corresponding arm ABI arguments, as eBPF arguments
doesn't match with arm ABI arguments.
Let me try that if its possible.

As far as following 4 are concerned :

>>>          BPF_ALU64 | BPF_DIV | BPF_K
>>>          BPF_ALU64 | BPF_DIV | BPF_X
>>>          BPF_ALU64 | BPF_MOD | BPF_K
>>>          BPF_ALU64 | BPF_MOD | BPF_X

I don't think it possible with current constraints over registers. I
already tried this.

>
>>> Implementation is using scratch space to emulate 64 bit eBPF ISA on 32
>>> bit
>>> ARM because of deficiency of general purpose registers on ARM. Currently,
>>> only LITTLE ENDIAN machines are supported in this eBPF JIT Compiler.
>>>
>>> Tested on ARMv7 with QEMU by me (Shubham Bansal).
>>> Tested on ARMv5 by Andrew Lunn (andrew at lunn.ch).
>>> Expected to work on ARMv6 as well, as its a part ARMv7 and part ARMv5.
>>> Although, a proper testing is not done for ARMv6.
>>>
>>> Both of these testing are done with and without CONFIG_FRAME_POINTER
>>> separately for LITTLE ENDIAN machine.
>>>
>>> For testing:
>>>
>>> 1. JIT is enabled with
>>>          echo 1 > /proc/sys/net/core/bpf_jit_enable
>>> 2. Constant Blinding can be enabled along with JIT using
>>>          echo 1 > /proc/sys/net/core/bpf_jit_enable
>>>          echo 2 > /proc/sys/net/core/bpf_jit_harden
>>>
>>> See Documentation/networking/filter.txt for more information.
>>>
>>> Result : test_bpf: Summary: 314 PASSED, 0 FAILED, [278/306 JIT'ed]
>
>
> Did you also manage to get the BPF selftest suite running in the meantime
> (tools/testing/selftests/bpf/)? There are a couple of programs that clang
> will compile (test_pkt_access.o, test_xdp.o, test_l4lb.o, test_tcp_estats.o)
> and then test run.

Nope. It looks like a latest addition to testing. Can you please tell
me how to test with it?

>
> Did you manage to get tail calls tested as well (I assume so since you
> implemented emit_bpf_tail_call() in the patch but just out of curiosity)?

I didn't try it exclusively, I thought test_bpf must have tested it. Doesn't it?

>
>>> Signed-off-by: Shubham Bansal <illusionist.neo@gmail.com>
>>> ---
>>>   Documentation/networking/filter.txt |    4 +-
>>>   arch/arm/Kconfig                    |    2 +-
>>>   arch/arm/net/bpf_jit_32.c           | 2404
>>> ++++++++++++++++++++++++-----------
>>>   arch/arm/net/bpf_jit_32.h           |  108 +-
>>>   4 files changed, 1713 insertions(+), 805 deletions(-)
>>>
> [...]
>
> If arm folks take the patch, there will be two minor (silent) merge
> conflicts with net-next:
>
> 1) In bpf_int_jit_compile(), below the jited = 1 assignment, there
>    needs to come a prog->jited_len = image_size.

Done.

> 2) The internal tail call opcode changed from BPF_JMP | BPF_CALL | BPF_X
>    into BPF_JMP | BPF_TAIL_CALL.

Done.

>
> Two minor things below, could probably also be as follow-up.
>
> [...]
>>>
>>> +       /* dst = imm64 */
>>> +       case BPF_LD | BPF_IMM | BPF_DW:
>>> +       {
>>> +               const struct bpf_insn insn1 = insn[1];
>>> +               u32 hi, lo = imm;
>>> +
>>> +               if (insn1.code != 0 || insn1.src_reg != 0 ||
>>> +                   insn1.dst_reg != 0 || insn1.off != 0) {
>>> +                       /* Note: verifier in BPF core must catch invalid
>>> +                        * instruction.
>>> +                        */
>>> +                       pr_err_once("Invalid BPF_LD_IMM64
>>> instruction\n");
>>> +                       return -EINVAL;
>>> +               }
>
>
> Nit: this check can be removed as verifier already takes care
> of it. (No JIT checks for this anymore.)
>
>>> +               hi = insn1.imm;
>>> +               emit_a32_mov_i(dst_lo, lo, dstk, ctx);
>>> +               emit_a32_mov_i(dst_hi, hi, dstk, ctx);
>>> +
>>> +               return 1;
>>> +       }
>
> [...]
>>>
>>> -       /* compute offsets only during the first pass */
>>> -       if (ctx->target == NULL)
>>> -               ctx->offsets[i] = ctx->idx * 4;
>>> +static int validate_code(struct jit_ctx *ctx)
>>> +{
>>> +       int i;
>>> +
>>> +       for (i = 0; i < ctx->idx; i++) {
>>> +               u32 a32_insn = le32_to_cpu(ctx->target[i]);
>
>
> Given __opcode_to_mem_arm(ARM_INST_UDF) is used to fill the image,
> perhaps use the __mem_to_opcode_arm() helper for the check?

Done.


I will send the patch again with these fixes. I really appreciate if
you could find more issues with the code, so that I can add it to the
next fix.

Thanks.
Shubham

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
  2017-06-12 11:06       ` Russell King - ARM Linux
  (?)
@ 2017-06-12 15:41         ` Shubham Bansal
  -1 siblings, 0 replies; 87+ messages in thread
From: Shubham Bansal @ 2017-06-12 15:41 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Daniel Borkmann, Kees Cook, Network Development, David S. Miller,
	Alexei Starovoitov, linux-arm-kernel, LKML, Andrew Lunn

Hi Russel,

On Mon, Jun 12, 2017 at 4:36 PM, Russell King - ARM Linux
<linux@armlinux.org.uk> wrote:
> On Mon, Jun 12, 2017 at 12:21:03PM +0200, Daniel Borkmann wrote:
>> On 05/30/2017 09:19 PM, Kees Cook wrote:
>> >On Thu, May 25, 2017 at 4:13 PM, Shubham Bansal
>> ><illusionist.neo@gmail.com> wrote:
>> >>+static int validate_code(struct jit_ctx *ctx)
>> >>+{
>> >>+       int i;
>> >>+
>> >>+       for (i = 0; i < ctx->idx; i++) {
>> >>+               u32 a32_insn = le32_to_cpu(ctx->target[i]);
>>
>> Given __opcode_to_mem_arm(ARM_INST_UDF) is used to fill the image,
>> perhaps use the __mem_to_opcode_arm() helper for the check?
>>
>> >>+               if (a32_insn == ARM_INST_UDF)
>
> The following is probably better:
>
>                 if (ctx->target[i] == __opcode_to_mem_arm(ARM_INST_UDF))
>
> since then you can take advantage of the compiler optimising the
> constant rather than having to do a byte swap on an unknown 32-bit
> value.

Done. Thanks :)
Please check if you can find anymore issues with the code. I really
appreciate it.

-Shubham

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
@ 2017-06-12 15:41         ` Shubham Bansal
  0 siblings, 0 replies; 87+ messages in thread
From: Shubham Bansal @ 2017-06-12 15:41 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Daniel Borkmann, Kees Cook, Network Development, David S. Miller,
	Alexei Starovoitov, linux-arm-kernel, LKML, Andrew Lunn

Hi Russel,

On Mon, Jun 12, 2017 at 4:36 PM, Russell King - ARM Linux
<linux@armlinux.org.uk> wrote:
> On Mon, Jun 12, 2017 at 12:21:03PM +0200, Daniel Borkmann wrote:
>> On 05/30/2017 09:19 PM, Kees Cook wrote:
>> >On Thu, May 25, 2017 at 4:13 PM, Shubham Bansal
>> ><illusionist.neo@gmail.com> wrote:
>> >>+static int validate_code(struct jit_ctx *ctx)
>> >>+{
>> >>+       int i;
>> >>+
>> >>+       for (i = 0; i < ctx->idx; i++) {
>> >>+               u32 a32_insn = le32_to_cpu(ctx->target[i]);
>>
>> Given __opcode_to_mem_arm(ARM_INST_UDF) is used to fill the image,
>> perhaps use the __mem_to_opcode_arm() helper for the check?
>>
>> >>+               if (a32_insn == ARM_INST_UDF)
>
> The following is probably better:
>
>                 if (ctx->target[i] == __opcode_to_mem_arm(ARM_INST_UDF))
>
> since then you can take advantage of the compiler optimising the
> constant rather than having to do a byte swap on an unknown 32-bit
> value.

Done. Thanks :)
Please check if you can find anymore issues with the code. I really
appreciate it.

-Shubham

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v2] arm: eBPF JIT compiler
@ 2017-06-12 15:41         ` Shubham Bansal
  0 siblings, 0 replies; 87+ messages in thread
From: Shubham Bansal @ 2017-06-12 15:41 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Russel,

On Mon, Jun 12, 2017 at 4:36 PM, Russell King - ARM Linux
<linux@armlinux.org.uk> wrote:
> On Mon, Jun 12, 2017 at 12:21:03PM +0200, Daniel Borkmann wrote:
>> On 05/30/2017 09:19 PM, Kees Cook wrote:
>> >On Thu, May 25, 2017 at 4:13 PM, Shubham Bansal
>> ><illusionist.neo@gmail.com> wrote:
>> >>+static int validate_code(struct jit_ctx *ctx)
>> >>+{
>> >>+       int i;
>> >>+
>> >>+       for (i = 0; i < ctx->idx; i++) {
>> >>+               u32 a32_insn = le32_to_cpu(ctx->target[i]);
>>
>> Given __opcode_to_mem_arm(ARM_INST_UDF) is used to fill the image,
>> perhaps use the __mem_to_opcode_arm() helper for the check?
>>
>> >>+               if (a32_insn == ARM_INST_UDF)
>
> The following is probably better:
>
>                 if (ctx->target[i] == __opcode_to_mem_arm(ARM_INST_UDF))
>
> since then you can take advantage of the compiler optimising the
> constant rather than having to do a byte swap on an unknown 32-bit
> value.

Done. Thanks :)
Please check if you can find anymore issues with the code. I really
appreciate it.

-Shubham

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
  2017-06-12 15:40       ` Shubham Bansal
  (?)
@ 2017-06-12 22:45         ` Alexander Alemayhu
  -1 siblings, 0 replies; 87+ messages in thread
From: Alexander Alemayhu @ 2017-06-12 22:45 UTC (permalink / raw)
  To: Shubham Bansal
  Cc: Daniel Borkmann, Kees Cook, Network Development, David S. Miller,
	Alexei Starovoitov, Russell King, linux-arm-kernel, LKML,
	Andrew Lunn

On Mon, Jun 12, 2017 at 09:10:07PM +0530, Shubham Bansal wrote:
> 
> Nope. It looks like a latest addition to testing. Can you please tell
> me how to test with it?
>
cd tools/testing/selftests/bpf/
make
sudo ./test_progs

-- 
Mit freundlichen Grüßen

Alexander Alemayhu

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
@ 2017-06-12 22:45         ` Alexander Alemayhu
  0 siblings, 0 replies; 87+ messages in thread
From: Alexander Alemayhu @ 2017-06-12 22:45 UTC (permalink / raw)
  To: Shubham Bansal
  Cc: Daniel Borkmann, Kees Cook, Network Development, David S. Miller,
	Alexei Starovoitov, Russell King, linux-arm-kernel, LKML,
	Andrew Lunn

On Mon, Jun 12, 2017 at 09:10:07PM +0530, Shubham Bansal wrote:
> 
> Nope. It looks like a latest addition to testing. Can you please tell
> me how to test with it?
>
cd tools/testing/selftests/bpf/
make
sudo ./test_progs

-- 
Mit freundlichen Grüßen

Alexander Alemayhu

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v2] arm: eBPF JIT compiler
@ 2017-06-12 22:45         ` Alexander Alemayhu
  0 siblings, 0 replies; 87+ messages in thread
From: Alexander Alemayhu @ 2017-06-12 22:45 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jun 12, 2017 at 09:10:07PM +0530, Shubham Bansal wrote:
> 
> Nope. It looks like a latest addition to testing. Can you please tell
> me how to test with it?
>
cd tools/testing/selftests/bpf/
make
sudo ./test_progs

-- 
Mit freundlichen Gr??en

Alexander Alemayhu

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
  2017-06-12 22:45         ` Alexander Alemayhu
@ 2017-06-12 22:47           ` David Miller
  -1 siblings, 0 replies; 87+ messages in thread
From: David Miller @ 2017-06-12 22:47 UTC (permalink / raw)
  To: alexander
  Cc: illusionist.neo, daniel, keescook, netdev, ast, linux,
	linux-arm-kernel, linux-kernel, andrew

From: Alexander Alemayhu <alexander@alemayhu.com>
Date: Tue, 13 Jun 2017 00:45:45 +0200

> On Mon, Jun 12, 2017 at 09:10:07PM +0530, Shubham Bansal wrote:
>> 
>> Nope. It looks like a latest addition to testing. Can you please tell
>> me how to test with it?
>>
> cd tools/testing/selftests/bpf/
> make
> sudo ./test_progs

Also, you might need to do a "make headers_install" at the top level
before doing this.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v2] arm: eBPF JIT compiler
@ 2017-06-12 22:47           ` David Miller
  0 siblings, 0 replies; 87+ messages in thread
From: David Miller @ 2017-06-12 22:47 UTC (permalink / raw)
  To: linux-arm-kernel

From: Alexander Alemayhu <alexander@alemayhu.com>
Date: Tue, 13 Jun 2017 00:45:45 +0200

> On Mon, Jun 12, 2017 at 09:10:07PM +0530, Shubham Bansal wrote:
>> 
>> Nope. It looks like a latest addition to testing. Can you please tell
>> me how to test with it?
>>
> cd tools/testing/selftests/bpf/
> make
> sudo ./test_progs

Also, you might need to do a "make headers_install" at the top level
before doing this.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
  2017-06-12 15:40       ` Shubham Bansal
  (?)
@ 2017-06-12 23:17         ` Daniel Borkmann
  -1 siblings, 0 replies; 87+ messages in thread
From: Daniel Borkmann @ 2017-06-12 23:17 UTC (permalink / raw)
  To: Shubham Bansal
  Cc: Kees Cook, Network Development, David S. Miller,
	Alexei Starovoitov, Russell King, linux-arm-kernel, LKML,
	Andrew Lunn

On 06/12/2017 05:40 PM, Shubham Bansal wrote:
[...]
>> Did you manage to get tail calls tested as well (I assume so since you
>> implemented emit_bpf_tail_call() in the patch but just out of curiosity)?
>
> I didn't try it exclusively, I thought test_bpf must have tested it. Doesn't it?

In samples/bpf/ there's sockex3* that would exercise it, or
alternatively in iproute2 repo under examples/bpf/ there's
bpf_cyclic.c and bpf_tailcall.c as a prog.

Hm, generally, we should really add a test case also to BPF
selftest suite to facilitate that. I'll likely do that for
the next batch of BPF patches.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
@ 2017-06-12 23:17         ` Daniel Borkmann
  0 siblings, 0 replies; 87+ messages in thread
From: Daniel Borkmann @ 2017-06-12 23:17 UTC (permalink / raw)
  To: Shubham Bansal
  Cc: Kees Cook, Network Development, David S. Miller,
	Alexei Starovoitov, Russell King, linux-arm-kernel, LKML,
	Andrew Lunn

On 06/12/2017 05:40 PM, Shubham Bansal wrote:
[...]
>> Did you manage to get tail calls tested as well (I assume so since you
>> implemented emit_bpf_tail_call() in the patch but just out of curiosity)?
>
> I didn't try it exclusively, I thought test_bpf must have tested it. Doesn't it?

In samples/bpf/ there's sockex3* that would exercise it, or
alternatively in iproute2 repo under examples/bpf/ there's
bpf_cyclic.c and bpf_tailcall.c as a prog.

Hm, generally, we should really add a test case also to BPF
selftest suite to facilitate that. I'll likely do that for
the next batch of BPF patches.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v2] arm: eBPF JIT compiler
@ 2017-06-12 23:17         ` Daniel Borkmann
  0 siblings, 0 replies; 87+ messages in thread
From: Daniel Borkmann @ 2017-06-12 23:17 UTC (permalink / raw)
  To: linux-arm-kernel

On 06/12/2017 05:40 PM, Shubham Bansal wrote:
[...]
>> Did you manage to get tail calls tested as well (I assume so since you
>> implemented emit_bpf_tail_call() in the patch but just out of curiosity)?
>
> I didn't try it exclusively, I thought test_bpf must have tested it. Doesn't it?

In samples/bpf/ there's sockex3* that would exercise it, or
alternatively in iproute2 repo under examples/bpf/ there's
bpf_cyclic.c and bpf_tailcall.c as a prog.

Hm, generally, we should really add a test case also to BPF
selftest suite to facilitate that. I'll likely do that for
the next batch of BPF patches.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
  2017-06-12 15:40       ` Shubham Bansal
  (?)
@ 2017-06-13  6:56         ` Shubham Bansal
  -1 siblings, 0 replies; 87+ messages in thread
From: Shubham Bansal @ 2017-06-13  6:56 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Kees Cook, Network Development, David S. Miller,
	Alexei Starovoitov, Russell King, linux-arm-kernel, LKML,
	Andrew Lunn

Hi Daniel, Kees, David, Russel,

>> Any plans to implement above especially BPF_JMP | BPF_CALL in near future?
>> Reason why I'm asking is that i) currently the arm32 cBPF JIT implements
>> all of the cBPF extensions (except SKF_AD_RANDOM and SKF_AD_VLAN_TPID).
>> Some of the programs that were JITed before e.g. using SKF_AD_CPU would now
>> fall back to the eBPF interpreter due to lack of translation in JIT, but
>> also ii) that probably most (if not all) of eBPF programs use BPF helper
>> calls heavily, which will still redirect them to the interpreter right now
>> due to lack of BPF_JMP | BPF_CALL support, so it's really quite essential
>> to have it implemented.
>
> I can try for BPF_JMP | BPF_CALL. I didn't do it last time because I
> thought, it would make the code look messy and become pain to get it
> through the review.
> For this, I have to map eBPF arguments with arm ABI arguments and move
> ebpf arguments to corresponding arm ABI arguments, as eBPF arguments
> doesn't match with arm ABI arguments.
> Let me try that if its possible.

Okay. I looked at it, tried few different solutions also. There is a
problem with implementing BPF_JMP | BPF_CALL.
Problem is transition between 4 byte and 8 byte arguments. Lets take a
look a the following example to get a more clear look at the problem.

Lets consider this function :
CASE 1:                            foo(int a, int b, long long c, int d)
For calling this function in arm 32 arch, I have to pass the arguments
as following:
                                         a -> r0
                                         b -> r1
                                         c -> r2, r3
                                         d -> stack_top

Now consider an another example function :
CASE 2:                           bar(int a, int b, int c, int d)
For calling this function in arm32 arch, I have to pass the arguments
as following:
                                       a -> r0
                                       b -> r1
                                       c -> r2
                                       d -> r3

So, you can clearly see the problem with it. There is no way of
knowing which of the above way to pass the arguments. There are
solutions possible:

1. One thing I can do is look at the address of the function to call
and pass the argument accordingly but thats not really a robust
solution as we have to change the arm32 JIT each time we add any new
BPF helper function.

2. Another solution is, if any of you guys can assure/confirm me that
there will be only 4 byte argument passed to BPF helper functions in
arm32 as of now and in future including the pointer as well, then I
can just assume that each argument is passed as 4 byte value and my
trimming the 8byte arguments to 4 bytes arguments wouldn't be a
problem. In that case, arguments for CASE 1 and CASE 2 will be passed
in the same way, i.e.
                                       a -> r0
                                       b -> r1
                                       c -> r2
                                       d -> r3

Let me know what you think. I don't think I can find the solution to
this problem other than those mentioned above. Would love to here any
ideas from you guys.

>> Did you also manage to get the BPF selftest suite running in the meantime
>> (tools/testing/selftests/bpf/)? There are a couple of programs that clang
>> will compile (test_pkt_access.o, test_xdp.o, test_l4lb.o, test_tcp_estats.o)
>> and then test run.
I will run these tests tonight. Hopefully I will be able to run them.

Any comments are welcome. Would love to here what you think about my
solutions above.

Thanks.
Shubham

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
@ 2017-06-13  6:56         ` Shubham Bansal
  0 siblings, 0 replies; 87+ messages in thread
From: Shubham Bansal @ 2017-06-13  6:56 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Kees Cook, Network Development, David S. Miller,
	Alexei Starovoitov, Russell King, linux-arm-kernel, LKML,
	Andrew Lunn

Hi Daniel, Kees, David, Russel,

>> Any plans to implement above especially BPF_JMP | BPF_CALL in near future?
>> Reason why I'm asking is that i) currently the arm32 cBPF JIT implements
>> all of the cBPF extensions (except SKF_AD_RANDOM and SKF_AD_VLAN_TPID).
>> Some of the programs that were JITed before e.g. using SKF_AD_CPU would now
>> fall back to the eBPF interpreter due to lack of translation in JIT, but
>> also ii) that probably most (if not all) of eBPF programs use BPF helper
>> calls heavily, which will still redirect them to the interpreter right now
>> due to lack of BPF_JMP | BPF_CALL support, so it's really quite essential
>> to have it implemented.
>
> I can try for BPF_JMP | BPF_CALL. I didn't do it last time because I
> thought, it would make the code look messy and become pain to get it
> through the review.
> For this, I have to map eBPF arguments with arm ABI arguments and move
> ebpf arguments to corresponding arm ABI arguments, as eBPF arguments
> doesn't match with arm ABI arguments.
> Let me try that if its possible.

Okay. I looked at it, tried few different solutions also. There is a
problem with implementing BPF_JMP | BPF_CALL.
Problem is transition between 4 byte and 8 byte arguments. Lets take a
look a the following example to get a more clear look at the problem.

Lets consider this function :
CASE 1:                            foo(int a, int b, long long c, int d)
For calling this function in arm 32 arch, I have to pass the arguments
as following:
                                         a -> r0
                                         b -> r1
                                         c -> r2, r3
                                         d -> stack_top

Now consider an another example function :
CASE 2:                           bar(int a, int b, int c, int d)
For calling this function in arm32 arch, I have to pass the arguments
as following:
                                       a -> r0
                                       b -> r1
                                       c -> r2
                                       d -> r3

So, you can clearly see the problem with it. There is no way of
knowing which of the above way to pass the arguments. There are
solutions possible:

1. One thing I can do is look at the address of the function to call
and pass the argument accordingly but thats not really a robust
solution as we have to change the arm32 JIT each time we add any new
BPF helper function.

2. Another solution is, if any of you guys can assure/confirm me that
there will be only 4 byte argument passed to BPF helper functions in
arm32 as of now and in future including the pointer as well, then I
can just assume that each argument is passed as 4 byte value and my
trimming the 8byte arguments to 4 bytes arguments wouldn't be a
problem. In that case, arguments for CASE 1 and CASE 2 will be passed
in the same way, i.e.
                                       a -> r0
                                       b -> r1
                                       c -> r2
                                       d -> r3

Let me know what you think. I don't think I can find the solution to
this problem other than those mentioned above. Would love to here any
ideas from you guys.

>> Did you also manage to get the BPF selftest suite running in the meantime
>> (tools/testing/selftests/bpf/)? There are a couple of programs that clang
>> will compile (test_pkt_access.o, test_xdp.o, test_l4lb.o, test_tcp_estats.o)
>> and then test run.
I will run these tests tonight. Hopefully I will be able to run them.

Any comments are welcome. Would love to here what you think about my
solutions above.

Thanks.
Shubham

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v2] arm: eBPF JIT compiler
@ 2017-06-13  6:56         ` Shubham Bansal
  0 siblings, 0 replies; 87+ messages in thread
From: Shubham Bansal @ 2017-06-13  6:56 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Daniel, Kees, David, Russel,

>> Any plans to implement above especially BPF_JMP | BPF_CALL in near future?
>> Reason why I'm asking is that i) currently the arm32 cBPF JIT implements
>> all of the cBPF extensions (except SKF_AD_RANDOM and SKF_AD_VLAN_TPID).
>> Some of the programs that were JITed before e.g. using SKF_AD_CPU would now
>> fall back to the eBPF interpreter due to lack of translation in JIT, but
>> also ii) that probably most (if not all) of eBPF programs use BPF helper
>> calls heavily, which will still redirect them to the interpreter right now
>> due to lack of BPF_JMP | BPF_CALL support, so it's really quite essential
>> to have it implemented.
>
> I can try for BPF_JMP | BPF_CALL. I didn't do it last time because I
> thought, it would make the code look messy and become pain to get it
> through the review.
> For this, I have to map eBPF arguments with arm ABI arguments and move
> ebpf arguments to corresponding arm ABI arguments, as eBPF arguments
> doesn't match with arm ABI arguments.
> Let me try that if its possible.

Okay. I looked at it, tried few different solutions also. There is a
problem with implementing BPF_JMP | BPF_CALL.
Problem is transition between 4 byte and 8 byte arguments. Lets take a
look a the following example to get a more clear look at the problem.

Lets consider this function :
CASE 1:                            foo(int a, int b, long long c, int d)
For calling this function in arm 32 arch, I have to pass the arguments
as following:
                                         a -> r0
                                         b -> r1
                                         c -> r2, r3
                                         d -> stack_top

Now consider an another example function :
CASE 2:                           bar(int a, int b, int c, int d)
For calling this function in arm32 arch, I have to pass the arguments
as following:
                                       a -> r0
                                       b -> r1
                                       c -> r2
                                       d -> r3

So, you can clearly see the problem with it. There is no way of
knowing which of the above way to pass the arguments. There are
solutions possible:

1. One thing I can do is look at the address of the function to call
and pass the argument accordingly but thats not really a robust
solution as we have to change the arm32 JIT each time we add any new
BPF helper function.

2. Another solution is, if any of you guys can assure/confirm me that
there will be only 4 byte argument passed to BPF helper functions in
arm32 as of now and in future including the pointer as well, then I
can just assume that each argument is passed as 4 byte value and my
trimming the 8byte arguments to 4 bytes arguments wouldn't be a
problem. In that case, arguments for CASE 1 and CASE 2 will be passed
in the same way, i.e.
                                       a -> r0
                                       b -> r1
                                       c -> r2
                                       d -> r3

Let me know what you think. I don't think I can find the solution to
this problem other than those mentioned above. Would love to here any
ideas from you guys.

>> Did you also manage to get the BPF selftest suite running in the meantime
>> (tools/testing/selftests/bpf/)? There are a couple of programs that clang
>> will compile (test_pkt_access.o, test_xdp.o, test_l4lb.o, test_tcp_estats.o)
>> and then test run.
I will run these tests tonight. Hopefully I will be able to run them.

Any comments are welcome. Would love to here what you think about my
solutions above.

Thanks.
Shubham

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
  2017-06-13  6:56         ` Shubham Bansal
  (?)
@ 2017-06-14 20:31           ` Daniel Borkmann
  -1 siblings, 0 replies; 87+ messages in thread
From: Daniel Borkmann @ 2017-06-14 20:31 UTC (permalink / raw)
  To: Shubham Bansal
  Cc: Kees Cook, Network Development, David S. Miller,
	Alexei Starovoitov, Russell King, linux-arm-kernel, LKML,
	Andrew Lunn

On 06/13/2017 08:56 AM, Shubham Bansal wrote:
> Hi Daniel, Kees, David, Russel,
>
>>> Any plans to implement above especially BPF_JMP | BPF_CALL in near future?
>>> Reason why I'm asking is that i) currently the arm32 cBPF JIT implements
>>> all of the cBPF extensions (except SKF_AD_RANDOM and SKF_AD_VLAN_TPID).
>>> Some of the programs that were JITed before e.g. using SKF_AD_CPU would now
>>> fall back to the eBPF interpreter due to lack of translation in JIT, but
>>> also ii) that probably most (if not all) of eBPF programs use BPF helper
>>> calls heavily, which will still redirect them to the interpreter right now
>>> due to lack of BPF_JMP | BPF_CALL support, so it's really quite essential
>>> to have it implemented.
>>
>> I can try for BPF_JMP | BPF_CALL. I didn't do it last time because I
>> thought, it would make the code look messy and become pain to get it
>> through the review.
>> For this, I have to map eBPF arguments with arm ABI arguments and move
>> ebpf arguments to corresponding arm ABI arguments, as eBPF arguments
>> doesn't match with arm ABI arguments.
>> Let me try that if its possible.
>
> Okay. I looked at it, tried few different solutions also. There is a
> problem with implementing BPF_JMP | BPF_CALL.
> Problem is transition between 4 byte and 8 byte arguments. Lets take a
> look a the following example to get a more clear look at the problem.
>
> Lets consider this function :
> CASE 1:                            foo(int a, int b, long long c, int d)
> For calling this function in arm 32 arch, I have to pass the arguments
> as following:
>                                           a -> r0
>                                           b -> r1
>                                           c -> r2, r3
>                                           d -> stack_top
>
> Now consider an another example function :
> CASE 2:                           bar(int a, int b, int c, int d)
> For calling this function in arm32 arch, I have to pass the arguments
> as following:
>                                         a -> r0
>                                         b -> r1
>                                         c -> r2
>                                         d -> r3
>
> So, you can clearly see the problem with it. There is no way of
> knowing which of the above way to pass the arguments. There are
> solutions possible:

Right.

> 1. One thing I can do is look at the address of the function to call
> and pass the argument accordingly but thats not really a robust
> solution as we have to change the arm32 JIT each time we add any new
> BPF helper function.

Yeah, that would be rather ugly.

> 2. Another solution is, if any of you guys can assure/confirm me that
> there will be only 4 byte argument passed to BPF helper functions in
> arm32 as of now and in future including the pointer as well, then I
> can just assume that each argument is passed as 4 byte value and my
> trimming the 8byte arguments to 4 bytes arguments wouldn't be a
> problem. In that case, arguments for CASE 1 and CASE 2 will be passed
> in the same way, i.e.
>                                         a -> r0
>                                         b -> r1
>                                         c -> r2
>                                         d -> r3
>
> Let me know what you think. I don't think I can find the solution to
> this problem other than those mentioned above. Would love to here any
> ideas from you guys.

Not all of the helpers have 4 or less byte arguments only, there are a
few with 8 byte arguments, so making that general assumption wouldn't
work. I guess what could be done is that helpers have a flag in struct
bpf_func_proto which indicates for JITs that all args are 4 byte on 32bit
so you could probably use convention similar to case2 for them. Presumably
for that information to process, the JIT might need to be reworked to
extract that via bpf_analyzer() that does a verifier run to re-analyze
the program like in nfp JIT case.

The other option could perhaps be to check the interpreter disasm of
___bpf_prog_run() with regards to how it handles BPF_JMP | BPF_CALL
helper call and do something similarly generic in the JIT as well.

>>> Did you also manage to get the BPF selftest suite running in the meantime
>>> (tools/testing/selftests/bpf/)? There are a couple of programs that clang
>>> will compile (test_pkt_access.o, test_xdp.o, test_l4lb.o, test_tcp_estats.o)
>>> and then test run.
> I will run these tests tonight. Hopefully I will be able to run them.

Ok.

> Any comments are welcome. Would love to here what you think about my
> solutions above.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
@ 2017-06-14 20:31           ` Daniel Borkmann
  0 siblings, 0 replies; 87+ messages in thread
From: Daniel Borkmann @ 2017-06-14 20:31 UTC (permalink / raw)
  To: Shubham Bansal
  Cc: Kees Cook, Network Development, David S. Miller,
	Alexei Starovoitov, Russell King, linux-arm-kernel, LKML,
	Andrew Lunn

On 06/13/2017 08:56 AM, Shubham Bansal wrote:
> Hi Daniel, Kees, David, Russel,
>
>>> Any plans to implement above especially BPF_JMP | BPF_CALL in near future?
>>> Reason why I'm asking is that i) currently the arm32 cBPF JIT implements
>>> all of the cBPF extensions (except SKF_AD_RANDOM and SKF_AD_VLAN_TPID).
>>> Some of the programs that were JITed before e.g. using SKF_AD_CPU would now
>>> fall back to the eBPF interpreter due to lack of translation in JIT, but
>>> also ii) that probably most (if not all) of eBPF programs use BPF helper
>>> calls heavily, which will still redirect them to the interpreter right now
>>> due to lack of BPF_JMP | BPF_CALL support, so it's really quite essential
>>> to have it implemented.
>>
>> I can try for BPF_JMP | BPF_CALL. I didn't do it last time because I
>> thought, it would make the code look messy and become pain to get it
>> through the review.
>> For this, I have to map eBPF arguments with arm ABI arguments and move
>> ebpf arguments to corresponding arm ABI arguments, as eBPF arguments
>> doesn't match with arm ABI arguments.
>> Let me try that if its possible.
>
> Okay. I looked at it, tried few different solutions also. There is a
> problem with implementing BPF_JMP | BPF_CALL.
> Problem is transition between 4 byte and 8 byte arguments. Lets take a
> look a the following example to get a more clear look at the problem.
>
> Lets consider this function :
> CASE 1:                            foo(int a, int b, long long c, int d)
> For calling this function in arm 32 arch, I have to pass the arguments
> as following:
>                                           a -> r0
>                                           b -> r1
>                                           c -> r2, r3
>                                           d -> stack_top
>
> Now consider an another example function :
> CASE 2:                           bar(int a, int b, int c, int d)
> For calling this function in arm32 arch, I have to pass the arguments
> as following:
>                                         a -> r0
>                                         b -> r1
>                                         c -> r2
>                                         d -> r3
>
> So, you can clearly see the problem with it. There is no way of
> knowing which of the above way to pass the arguments. There are
> solutions possible:

Right.

> 1. One thing I can do is look at the address of the function to call
> and pass the argument accordingly but thats not really a robust
> solution as we have to change the arm32 JIT each time we add any new
> BPF helper function.

Yeah, that would be rather ugly.

> 2. Another solution is, if any of you guys can assure/confirm me that
> there will be only 4 byte argument passed to BPF helper functions in
> arm32 as of now and in future including the pointer as well, then I
> can just assume that each argument is passed as 4 byte value and my
> trimming the 8byte arguments to 4 bytes arguments wouldn't be a
> problem. In that case, arguments for CASE 1 and CASE 2 will be passed
> in the same way, i.e.
>                                         a -> r0
>                                         b -> r1
>                                         c -> r2
>                                         d -> r3
>
> Let me know what you think. I don't think I can find the solution to
> this problem other than those mentioned above. Would love to here any
> ideas from you guys.

Not all of the helpers have 4 or less byte arguments only, there are a
few with 8 byte arguments, so making that general assumption wouldn't
work. I guess what could be done is that helpers have a flag in struct
bpf_func_proto which indicates for JITs that all args are 4 byte on 32bit
so you could probably use convention similar to case2 for them. Presumably
for that information to process, the JIT might need to be reworked to
extract that via bpf_analyzer() that does a verifier run to re-analyze
the program like in nfp JIT case.

The other option could perhaps be to check the interpreter disasm of
___bpf_prog_run() with regards to how it handles BPF_JMP | BPF_CALL
helper call and do something similarly generic in the JIT as well.

>>> Did you also manage to get the BPF selftest suite running in the meantime
>>> (tools/testing/selftests/bpf/)? There are a couple of programs that clang
>>> will compile (test_pkt_access.o, test_xdp.o, test_l4lb.o, test_tcp_estats.o)
>>> and then test run.
> I will run these tests tonight. Hopefully I will be able to run them.

Ok.

> Any comments are welcome. Would love to here what you think about my
> solutions above.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v2] arm: eBPF JIT compiler
@ 2017-06-14 20:31           ` Daniel Borkmann
  0 siblings, 0 replies; 87+ messages in thread
From: Daniel Borkmann @ 2017-06-14 20:31 UTC (permalink / raw)
  To: linux-arm-kernel

On 06/13/2017 08:56 AM, Shubham Bansal wrote:
> Hi Daniel, Kees, David, Russel,
>
>>> Any plans to implement above especially BPF_JMP | BPF_CALL in near future?
>>> Reason why I'm asking is that i) currently the arm32 cBPF JIT implements
>>> all of the cBPF extensions (except SKF_AD_RANDOM and SKF_AD_VLAN_TPID).
>>> Some of the programs that were JITed before e.g. using SKF_AD_CPU would now
>>> fall back to the eBPF interpreter due to lack of translation in JIT, but
>>> also ii) that probably most (if not all) of eBPF programs use BPF helper
>>> calls heavily, which will still redirect them to the interpreter right now
>>> due to lack of BPF_JMP | BPF_CALL support, so it's really quite essential
>>> to have it implemented.
>>
>> I can try for BPF_JMP | BPF_CALL. I didn't do it last time because I
>> thought, it would make the code look messy and become pain to get it
>> through the review.
>> For this, I have to map eBPF arguments with arm ABI arguments and move
>> ebpf arguments to corresponding arm ABI arguments, as eBPF arguments
>> doesn't match with arm ABI arguments.
>> Let me try that if its possible.
>
> Okay. I looked at it, tried few different solutions also. There is a
> problem with implementing BPF_JMP | BPF_CALL.
> Problem is transition between 4 byte and 8 byte arguments. Lets take a
> look a the following example to get a more clear look at the problem.
>
> Lets consider this function :
> CASE 1:                            foo(int a, int b, long long c, int d)
> For calling this function in arm 32 arch, I have to pass the arguments
> as following:
>                                           a -> r0
>                                           b -> r1
>                                           c -> r2, r3
>                                           d -> stack_top
>
> Now consider an another example function :
> CASE 2:                           bar(int a, int b, int c, int d)
> For calling this function in arm32 arch, I have to pass the arguments
> as following:
>                                         a -> r0
>                                         b -> r1
>                                         c -> r2
>                                         d -> r3
>
> So, you can clearly see the problem with it. There is no way of
> knowing which of the above way to pass the arguments. There are
> solutions possible:

Right.

> 1. One thing I can do is look at the address of the function to call
> and pass the argument accordingly but thats not really a robust
> solution as we have to change the arm32 JIT each time we add any new
> BPF helper function.

Yeah, that would be rather ugly.

> 2. Another solution is, if any of you guys can assure/confirm me that
> there will be only 4 byte argument passed to BPF helper functions in
> arm32 as of now and in future including the pointer as well, then I
> can just assume that each argument is passed as 4 byte value and my
> trimming the 8byte arguments to 4 bytes arguments wouldn't be a
> problem. In that case, arguments for CASE 1 and CASE 2 will be passed
> in the same way, i.e.
>                                         a -> r0
>                                         b -> r1
>                                         c -> r2
>                                         d -> r3
>
> Let me know what you think. I don't think I can find the solution to
> this problem other than those mentioned above. Would love to here any
> ideas from you guys.

Not all of the helpers have 4 or less byte arguments only, there are a
few with 8 byte arguments, so making that general assumption wouldn't
work. I guess what could be done is that helpers have a flag in struct
bpf_func_proto which indicates for JITs that all args are 4 byte on 32bit
so you could probably use convention similar to case2 for them. Presumably
for that information to process, the JIT might need to be reworked to
extract that via bpf_analyzer() that does a verifier run to re-analyze
the program like in nfp JIT case.

The other option could perhaps be to check the interpreter disasm of
___bpf_prog_run() with regards to how it handles BPF_JMP | BPF_CALL
helper call and do something similarly generic in the JIT as well.

>>> Did you also manage to get the BPF selftest suite running in the meantime
>>> (tools/testing/selftests/bpf/)? There are a couple of programs that clang
>>> will compile (test_pkt_access.o, test_xdp.o, test_l4lb.o, test_tcp_estats.o)
>>> and then test run.
> I will run these tests tonight. Hopefully I will be able to run them.

Ok.

> Any comments are welcome. Would love to here what you think about my
> solutions above.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
  2017-06-14 20:31           ` Daniel Borkmann
  (?)
@ 2017-06-17 12:23             ` Shubham Bansal
  -1 siblings, 0 replies; 87+ messages in thread
From: Shubham Bansal @ 2017-06-17 12:23 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Kees Cook, Network Development, David S. Miller,
	Alexei Starovoitov, Russell King, linux-arm-kernel, LKML,
	Andrew Lunn

Hi Daniel,

>
> Not all of the helpers have 4 or less byte arguments only, there are a
> few with 8 byte arguments, so making that general assumption wouldn't
> work. I guess what could be done is that helpers have a flag in struct
> bpf_func_proto which indicates for JITs that all args are 4 byte on 32bit
> so you could probably use convention similar to case2 for them. Presumably
> for that information to process, the JIT might need to be reworked to
> extract that via bpf_analyzer() that does a verifier run to re-analyze
> the program like in nfp JIT case.

Let me try a better solution which can be used to support both 4 byte
and 8 byte arguments. I hope it would work out. Are you sure this
patch can pass if it only supports 4 byte arguments though?
Let me list out what I have to do, so that you can tell me if I am
thinking in a wrong way :-

* I will add a bit flag in bpf_func_proto to represent whether
different arguments in a function call are 4 bytes or 8 bytes. If lsb
of bit flag is set then first argument is 8 byte, otherwise its not. I
think I can handle this flag properly in build_insn() in my code. Does
this sound okay?

I don't understand second part of your solution, i.e.

> Presumably
> for that information to process, the JIT might need to be reworked to
> extract that via bpf_analyzer() that does a verifier run to re-analyze
> the program like in nfp JIT case.

Please explain what are you suggesting and how can I extract bit flag
from bpf_func_proto().

Please reply asap, as I would like to finish it over the weekend. Please.

-Shubham

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
@ 2017-06-17 12:23             ` Shubham Bansal
  0 siblings, 0 replies; 87+ messages in thread
From: Shubham Bansal @ 2017-06-17 12:23 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Kees Cook, Network Development, David S. Miller,
	Alexei Starovoitov, Russell King, linux-arm-kernel, LKML,
	Andrew Lunn

Hi Daniel,

>
> Not all of the helpers have 4 or less byte arguments only, there are a
> few with 8 byte arguments, so making that general assumption wouldn't
> work. I guess what could be done is that helpers have a flag in struct
> bpf_func_proto which indicates for JITs that all args are 4 byte on 32bit
> so you could probably use convention similar to case2 for them. Presumably
> for that information to process, the JIT might need to be reworked to
> extract that via bpf_analyzer() that does a verifier run to re-analyze
> the program like in nfp JIT case.

Let me try a better solution which can be used to support both 4 byte
and 8 byte arguments. I hope it would work out. Are you sure this
patch can pass if it only supports 4 byte arguments though?
Let me list out what I have to do, so that you can tell me if I am
thinking in a wrong way :-

* I will add a bit flag in bpf_func_proto to represent whether
different arguments in a function call are 4 bytes or 8 bytes. If lsb
of bit flag is set then first argument is 8 byte, otherwise its not. I
think I can handle this flag properly in build_insn() in my code. Does
this sound okay?

I don't understand second part of your solution, i.e.

> Presumably
> for that information to process, the JIT might need to be reworked to
> extract that via bpf_analyzer() that does a verifier run to re-analyze
> the program like in nfp JIT case.

Please explain what are you suggesting and how can I extract bit flag
from bpf_func_proto().

Please reply asap, as I would like to finish it over the weekend. Please.

-Shubham

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v2] arm: eBPF JIT compiler
@ 2017-06-17 12:23             ` Shubham Bansal
  0 siblings, 0 replies; 87+ messages in thread
From: Shubham Bansal @ 2017-06-17 12:23 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Daniel,

>
> Not all of the helpers have 4 or less byte arguments only, there are a
> few with 8 byte arguments, so making that general assumption wouldn't
> work. I guess what could be done is that helpers have a flag in struct
> bpf_func_proto which indicates for JITs that all args are 4 byte on 32bit
> so you could probably use convention similar to case2 for them. Presumably
> for that information to process, the JIT might need to be reworked to
> extract that via bpf_analyzer() that does a verifier run to re-analyze
> the program like in nfp JIT case.

Let me try a better solution which can be used to support both 4 byte
and 8 byte arguments. I hope it would work out. Are you sure this
patch can pass if it only supports 4 byte arguments though?
Let me list out what I have to do, so that you can tell me if I am
thinking in a wrong way :-

* I will add a bit flag in bpf_func_proto to represent whether
different arguments in a function call are 4 bytes or 8 bytes. If lsb
of bit flag is set then first argument is 8 byte, otherwise its not. I
think I can handle this flag properly in build_insn() in my code. Does
this sound okay?

I don't understand second part of your solution, i.e.

> Presumably
> for that information to process, the JIT might need to be reworked to
> extract that via bpf_analyzer() that does a verifier run to re-analyze
> the program like in nfp JIT case.

Please explain what are you suggesting and how can I extract bit flag
from bpf_func_proto().

Please reply asap, as I would like to finish it over the weekend. Please.

-Shubham

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
  2017-06-17 12:23             ` Shubham Bansal
  (?)
@ 2017-06-19 18:10               ` Daniel Borkmann
  -1 siblings, 0 replies; 87+ messages in thread
From: Daniel Borkmann @ 2017-06-19 18:10 UTC (permalink / raw)
  To: Shubham Bansal
  Cc: Kees Cook, Network Development, David S. Miller,
	Alexei Starovoitov, Russell King, linux-arm-kernel, LKML,
	Andrew Lunn

On 06/17/2017 02:23 PM, Shubham Bansal wrote:
> Hi Daniel,
>
>> Not all of the helpers have 4 or less byte arguments only, there are a
>> few with 8 byte arguments, so making that general assumption wouldn't
>> work. I guess what could be done is that helpers have a flag in struct
>> bpf_func_proto which indicates for JITs that all args are 4 byte on 32bit
>> so you could probably use convention similar to case2 for them. Presumably
>> for that information to process, the JIT might need to be reworked to
>> extract that via bpf_analyzer() that does a verifier run to re-analyze
>> the program like in nfp JIT case.
>
> Let me try a better solution which can be used to support both 4 byte
> and 8 byte arguments. I hope it would work out. Are you sure this
> patch can pass if it only supports 4 byte arguments though?
> Let me list out what I have to do, so that you can tell me if I am
> thinking in a wrong way :-
>
> * I will add a bit flag in bpf_func_proto to represent whether
> different arguments in a function call are 4 bytes or 8 bytes. If lsb
> of bit flag is set then first argument is 8 byte, otherwise its not. I
> think I can handle this flag properly in build_insn() in my code. Does
> this sound okay?
>
> I don't understand second part of your solution, i.e.
>
>> Presumably
>> for that information to process, the JIT might need to be reworked to
>> extract that via bpf_analyzer() that does a verifier run to re-analyze
>> the program like in nfp JIT case.
>
> Please explain what are you suggesting and how can I extract bit flag
> from bpf_func_proto().
>
> Please reply asap, as I would like to finish it over the weekend. Please.

Sorry, had a travel over the weekend, so didn't read it in time.

What is the issue with imitating in JIT what the interpreter is
doing as a starting point? That should be generic enough to handle
any case.

Otherwise you'd need some sort of reverse mapping since verifier
already converted BPF_CALL insns into relative helper addresses
in imm part.

> -Shubham
>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
@ 2017-06-19 18:10               ` Daniel Borkmann
  0 siblings, 0 replies; 87+ messages in thread
From: Daniel Borkmann @ 2017-06-19 18:10 UTC (permalink / raw)
  To: Shubham Bansal
  Cc: Kees Cook, Network Development, David S. Miller,
	Alexei Starovoitov, Russell King, linux-arm-kernel, LKML,
	Andrew Lunn

On 06/17/2017 02:23 PM, Shubham Bansal wrote:
> Hi Daniel,
>
>> Not all of the helpers have 4 or less byte arguments only, there are a
>> few with 8 byte arguments, so making that general assumption wouldn't
>> work. I guess what could be done is that helpers have a flag in struct
>> bpf_func_proto which indicates for JITs that all args are 4 byte on 32bit
>> so you could probably use convention similar to case2 for them. Presumably
>> for that information to process, the JIT might need to be reworked to
>> extract that via bpf_analyzer() that does a verifier run to re-analyze
>> the program like in nfp JIT case.
>
> Let me try a better solution which can be used to support both 4 byte
> and 8 byte arguments. I hope it would work out. Are you sure this
> patch can pass if it only supports 4 byte arguments though?
> Let me list out what I have to do, so that you can tell me if I am
> thinking in a wrong way :-
>
> * I will add a bit flag in bpf_func_proto to represent whether
> different arguments in a function call are 4 bytes or 8 bytes. If lsb
> of bit flag is set then first argument is 8 byte, otherwise its not. I
> think I can handle this flag properly in build_insn() in my code. Does
> this sound okay?
>
> I don't understand second part of your solution, i.e.
>
>> Presumably
>> for that information to process, the JIT might need to be reworked to
>> extract that via bpf_analyzer() that does a verifier run to re-analyze
>> the program like in nfp JIT case.
>
> Please explain what are you suggesting and how can I extract bit flag
> from bpf_func_proto().
>
> Please reply asap, as I would like to finish it over the weekend. Please.

Sorry, had a travel over the weekend, so didn't read it in time.

What is the issue with imitating in JIT what the interpreter is
doing as a starting point? That should be generic enough to handle
any case.

Otherwise you'd need some sort of reverse mapping since verifier
already converted BPF_CALL insns into relative helper addresses
in imm part.

> -Shubham
>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v2] arm: eBPF JIT compiler
@ 2017-06-19 18:10               ` Daniel Borkmann
  0 siblings, 0 replies; 87+ messages in thread
From: Daniel Borkmann @ 2017-06-19 18:10 UTC (permalink / raw)
  To: linux-arm-kernel

On 06/17/2017 02:23 PM, Shubham Bansal wrote:
> Hi Daniel,
>
>> Not all of the helpers have 4 or less byte arguments only, there are a
>> few with 8 byte arguments, so making that general assumption wouldn't
>> work. I guess what could be done is that helpers have a flag in struct
>> bpf_func_proto which indicates for JITs that all args are 4 byte on 32bit
>> so you could probably use convention similar to case2 for them. Presumably
>> for that information to process, the JIT might need to be reworked to
>> extract that via bpf_analyzer() that does a verifier run to re-analyze
>> the program like in nfp JIT case.
>
> Let me try a better solution which can be used to support both 4 byte
> and 8 byte arguments. I hope it would work out. Are you sure this
> patch can pass if it only supports 4 byte arguments though?
> Let me list out what I have to do, so that you can tell me if I am
> thinking in a wrong way :-
>
> * I will add a bit flag in bpf_func_proto to represent whether
> different arguments in a function call are 4 bytes or 8 bytes. If lsb
> of bit flag is set then first argument is 8 byte, otherwise its not. I
> think I can handle this flag properly in build_insn() in my code. Does
> this sound okay?
>
> I don't understand second part of your solution, i.e.
>
>> Presumably
>> for that information to process, the JIT might need to be reworked to
>> extract that via bpf_analyzer() that does a verifier run to re-analyze
>> the program like in nfp JIT case.
>
> Please explain what are you suggesting and how can I extract bit flag
> from bpf_func_proto().
>
> Please reply asap, as I would like to finish it over the weekend. Please.

Sorry, had a travel over the weekend, so didn't read it in time.

What is the issue with imitating in JIT what the interpreter is
doing as a starting point? That should be generic enough to handle
any case.

Otherwise you'd need some sort of reverse mapping since verifier
already converted BPF_CALL insns into relative helper addresses
in imm part.

> -Shubham
>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
  2017-06-19 18:10               ` Daniel Borkmann
  (?)
@ 2017-06-20  1:34                 ` Shubham Bansal
  -1 siblings, 0 replies; 87+ messages in thread
From: Shubham Bansal @ 2017-06-20  1:34 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Kees Cook, Network Development, David S. Miller,
	Alexei Starovoitov, Russell King, linux-arm-kernel, LKML,
	Andrew Lunn

Hi Daniel,

>
> Sorry, had a travel over the weekend, so didn't read it in time.
>
> What is the issue with imitating in JIT what the interpreter is
> doing as a starting point? That should be generic enough to handle
> any case.
>
> Otherwise you'd need some sort of reverse mapping since verifier
> already converted BPF_CALL insns into relative helper addresses
> in imm part.
>
Sorry but I don't get what you are trying to say. Can you explain it
with an example?

-Shubham

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
@ 2017-06-20  1:34                 ` Shubham Bansal
  0 siblings, 0 replies; 87+ messages in thread
From: Shubham Bansal @ 2017-06-20  1:34 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Kees Cook, Network Development, David S. Miller,
	Alexei Starovoitov, Russell King, linux-arm-kernel, LKML,
	Andrew Lunn

Hi Daniel,

>
> Sorry, had a travel over the weekend, so didn't read it in time.
>
> What is the issue with imitating in JIT what the interpreter is
> doing as a starting point? That should be generic enough to handle
> any case.
>
> Otherwise you'd need some sort of reverse mapping since verifier
> already converted BPF_CALL insns into relative helper addresses
> in imm part.
>
Sorry but I don't get what you are trying to say. Can you explain it
with an example?

-Shubham

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v2] arm: eBPF JIT compiler
@ 2017-06-20  1:34                 ` Shubham Bansal
  0 siblings, 0 replies; 87+ messages in thread
From: Shubham Bansal @ 2017-06-20  1:34 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Daniel,

>
> Sorry, had a travel over the weekend, so didn't read it in time.
>
> What is the issue with imitating in JIT what the interpreter is
> doing as a starting point? That should be generic enough to handle
> any case.
>
> Otherwise you'd need some sort of reverse mapping since verifier
> already converted BPF_CALL insns into relative helper addresses
> in imm part.
>
Sorry but I don't get what you are trying to say. Can you explain it
with an example?

-Shubham

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
  2017-06-20  1:34                 ` Shubham Bansal
  (?)
@ 2017-06-20 16:55                   ` Daniel Borkmann
  -1 siblings, 0 replies; 87+ messages in thread
From: Daniel Borkmann @ 2017-06-20 16:55 UTC (permalink / raw)
  To: Shubham Bansal
  Cc: Kees Cook, Network Development, David S. Miller,
	Alexei Starovoitov, Russell King, linux-arm-kernel, LKML,
	Andrew Lunn

On 06/20/2017 03:34 AM, Shubham Bansal wrote:
> Hi Daniel,
>
>> Sorry, had a travel over the weekend, so didn't read it in time.
>>
>> What is the issue with imitating in JIT what the interpreter is
>> doing as a starting point? That should be generic enough to handle
>> any case.

Why not proceeding this way first?

>> Otherwise you'd need some sort of reverse mapping since verifier
>> already converted BPF_CALL insns into relative helper addresses
>> in imm part.
>>
> Sorry but I don't get what you are trying to say. Can you explain it
> with an example?

Ok, probably the best is to check fixup_bpf_calls() in the verifier,
see the fn = prog->aux->ops->get_func_proto(insn->imm). It fetches the
helper function specification based on the BPF_FUNC_* enum and converts
the imm field into a relative address for the function such that if
you look at ___bpf_prog_run(), JMP_CALL label, the call address can
be reconstructed again. So you'd need some reverse mapping to get back
to the struct bpf_func_proto, so you can check argX_type that needs to
be extended with whether its JITable on 32bit or not.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
@ 2017-06-20 16:55                   ` Daniel Borkmann
  0 siblings, 0 replies; 87+ messages in thread
From: Daniel Borkmann @ 2017-06-20 16:55 UTC (permalink / raw)
  To: Shubham Bansal
  Cc: Kees Cook, Network Development, David S. Miller,
	Alexei Starovoitov, Russell King, linux-arm-kernel, LKML,
	Andrew Lunn

On 06/20/2017 03:34 AM, Shubham Bansal wrote:
> Hi Daniel,
>
>> Sorry, had a travel over the weekend, so didn't read it in time.
>>
>> What is the issue with imitating in JIT what the interpreter is
>> doing as a starting point? That should be generic enough to handle
>> any case.

Why not proceeding this way first?

>> Otherwise you'd need some sort of reverse mapping since verifier
>> already converted BPF_CALL insns into relative helper addresses
>> in imm part.
>>
> Sorry but I don't get what you are trying to say. Can you explain it
> with an example?

Ok, probably the best is to check fixup_bpf_calls() in the verifier,
see the fn = prog->aux->ops->get_func_proto(insn->imm). It fetches the
helper function specification based on the BPF_FUNC_* enum and converts
the imm field into a relative address for the function such that if
you look at ___bpf_prog_run(), JMP_CALL label, the call address can
be reconstructed again. So you'd need some reverse mapping to get back
to the struct bpf_func_proto, so you can check argX_type that needs to
be extended with whether its JITable on 32bit or not.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v2] arm: eBPF JIT compiler
@ 2017-06-20 16:55                   ` Daniel Borkmann
  0 siblings, 0 replies; 87+ messages in thread
From: Daniel Borkmann @ 2017-06-20 16:55 UTC (permalink / raw)
  To: linux-arm-kernel

On 06/20/2017 03:34 AM, Shubham Bansal wrote:
> Hi Daniel,
>
>> Sorry, had a travel over the weekend, so didn't read it in time.
>>
>> What is the issue with imitating in JIT what the interpreter is
>> doing as a starting point? That should be generic enough to handle
>> any case.

Why not proceeding this way first?

>> Otherwise you'd need some sort of reverse mapping since verifier
>> already converted BPF_CALL insns into relative helper addresses
>> in imm part.
>>
> Sorry but I don't get what you are trying to say. Can you explain it
> with an example?

Ok, probably the best is to check fixup_bpf_calls() in the verifier,
see the fn = prog->aux->ops->get_func_proto(insn->imm). It fetches the
helper function specification based on the BPF_FUNC_* enum and converts
the imm field into a relative address for the function such that if
you look at ___bpf_prog_run(), JMP_CALL label, the call address can
be reconstructed again. So you'd need some reverse mapping to get back
to the struct bpf_func_proto, so you can check argX_type that needs to
be extended with whether its JITable on 32bit or not.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
  2017-06-20 16:55                   ` Daniel Borkmann
  (?)
@ 2017-06-21 14:26                     ` Shubham Bansal
  -1 siblings, 0 replies; 87+ messages in thread
From: Shubham Bansal @ 2017-06-21 14:26 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Kees Cook, Network Development, David S. Miller,
	Alexei Starovoitov, Russell King, linux-arm-kernel, LKML,
	Andrew Lunn

Hi Daniel,

>
> So my question would be, why can't the JIT imitate something
> similar to what we do in the interpreter as well? So looking
> at the disasm of what gcc compiles for the interpreter when it's
> doing the above call could help as well in going forward. Not
> sure if that answers your question, but perhaps not sure if I
> understand your question yet?

I just looked at the code again and I think I completely misunderstood
the logic of  BPF_JMP | BPF_CALL.
I think each helper function is working like this.

____helper_function(u32 a1, u32 a2){
}

helper_function(u64 a1, u64 a2){
     ____helper_function((u32 *)a1, (u32 *)a2);
}

So ultimately, we call helper_function which takes u64 as arguments
only. I know its asking a lot, but can you please confirm this asap? I
would like to start implementing it.

>
> Cheers,
> Daniel

-Shubham

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
@ 2017-06-21 14:26                     ` Shubham Bansal
  0 siblings, 0 replies; 87+ messages in thread
From: Shubham Bansal @ 2017-06-21 14:26 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Kees Cook, Network Development, David S. Miller,
	Alexei Starovoitov, Russell King, linux-arm-kernel, LKML,
	Andrew Lunn

Hi Daniel,

>
> So my question would be, why can't the JIT imitate something
> similar to what we do in the interpreter as well? So looking
> at the disasm of what gcc compiles for the interpreter when it's
> doing the above call could help as well in going forward. Not
> sure if that answers your question, but perhaps not sure if I
> understand your question yet?

I just looked at the code again and I think I completely misunderstood
the logic of  BPF_JMP | BPF_CALL.
I think each helper function is working like this.

____helper_function(u32 a1, u32 a2){
}

helper_function(u64 a1, u64 a2){
     ____helper_function((u32 *)a1, (u32 *)a2);
}

So ultimately, we call helper_function which takes u64 as arguments
only. I know its asking a lot, but can you please confirm this asap? I
would like to start implementing it.

>
> Cheers,
> Daniel

-Shubham

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v2] arm: eBPF JIT compiler
@ 2017-06-21 14:26                     ` Shubham Bansal
  0 siblings, 0 replies; 87+ messages in thread
From: Shubham Bansal @ 2017-06-21 14:26 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Daniel,

>
> So my question would be, why can't the JIT imitate something
> similar to what we do in the interpreter as well? So looking
> at the disasm of what gcc compiles for the interpreter when it's
> doing the above call could help as well in going forward. Not
> sure if that answers your question, but perhaps not sure if I
> understand your question yet?

I just looked at the code again and I think I completely misunderstood
the logic of  BPF_JMP | BPF_CALL.
I think each helper function is working like this.

____helper_function(u32 a1, u32 a2){
}

helper_function(u64 a1, u64 a2){
     ____helper_function((u32 *)a1, (u32 *)a2);
}

So ultimately, we call helper_function which takes u64 as arguments
only. I know its asking a lot, but can you please confirm this asap? I
would like to start implementing it.

>
> Cheers,
> Daniel

-Shubham

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
  2017-06-21 14:26                     ` Shubham Bansal
  (?)
@ 2017-06-21 16:32                       ` Daniel Borkmann
  -1 siblings, 0 replies; 87+ messages in thread
From: Daniel Borkmann @ 2017-06-21 16:32 UTC (permalink / raw)
  To: Shubham Bansal
  Cc: Kees Cook, Network Development, David S. Miller,
	Alexei Starovoitov, Russell King, linux-arm-kernel, LKML,
	Andrew Lunn

On 06/21/2017 04:26 PM, Shubham Bansal wrote:
[...]
> So ultimately, we call helper_function which takes u64 as arguments
> only. I know its asking a lot, but can you please confirm this asap? I
> would like to start implementing it.

Yes, that is correct. I think it would be the better, more generic
approach going forward to always assume that.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
@ 2017-06-21 16:32                       ` Daniel Borkmann
  0 siblings, 0 replies; 87+ messages in thread
From: Daniel Borkmann @ 2017-06-21 16:32 UTC (permalink / raw)
  To: Shubham Bansal
  Cc: Kees Cook, Network Development, David S. Miller,
	Alexei Starovoitov, Russell King, linux-arm-kernel, LKML,
	Andrew Lunn

On 06/21/2017 04:26 PM, Shubham Bansal wrote:
[...]
> So ultimately, we call helper_function which takes u64 as arguments
> only. I know its asking a lot, but can you please confirm this asap? I
> would like to start implementing it.

Yes, that is correct. I think it would be the better, more generic
approach going forward to always assume that.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v2] arm: eBPF JIT compiler
@ 2017-06-21 16:32                       ` Daniel Borkmann
  0 siblings, 0 replies; 87+ messages in thread
From: Daniel Borkmann @ 2017-06-21 16:32 UTC (permalink / raw)
  To: linux-arm-kernel

On 06/21/2017 04:26 PM, Shubham Bansal wrote:
[...]
> So ultimately, we call helper_function which takes u64 as arguments
> only. I know its asking a lot, but can you please confirm this asap? I
> would like to start implementing it.

Yes, that is correct. I think it would be the better, more generic
approach going forward to always assume that.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
  2017-06-21 16:32                       ` Daniel Borkmann
  (?)
@ 2017-06-21 19:37                         ` Shubham Bansal
  -1 siblings, 0 replies; 87+ messages in thread
From: Shubham Bansal @ 2017-06-21 19:37 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Kees Cook, Network Development, David S. Miller,
	Alexei Starovoitov, Russell King, linux-arm-kernel, LKML,
	Andrew Lunn

Hi Daniel,

Good news. Got the CALL to work.

[  145.670882] test_bpf: Summary: 316 PASSED, 0 FAILED, [287/308 JIT'ed]

Awesome. Do you think with this implementation, the patch could get
accepted? If you think so, then I will send the patch in couple of
days after some refactoring, if not, then do let me know what more is
required?

Best,
Shubham Bansal


On Wed, Jun 21, 2017 at 10:02 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> On 06/21/2017 04:26 PM, Shubham Bansal wrote:
> [...]
>>
>> So ultimately, we call helper_function which takes u64 as arguments
>> only. I know its asking a lot, but can you please confirm this asap? I
>> would like to start implementing it.
>
>
> Yes, that is correct. I think it would be the better, more generic
> approach going forward to always assume that.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
@ 2017-06-21 19:37                         ` Shubham Bansal
  0 siblings, 0 replies; 87+ messages in thread
From: Shubham Bansal @ 2017-06-21 19:37 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Kees Cook, Network Development, David S. Miller,
	Alexei Starovoitov, Russell King, linux-arm-kernel, LKML,
	Andrew Lunn

Hi Daniel,

Good news. Got the CALL to work.

[  145.670882] test_bpf: Summary: 316 PASSED, 0 FAILED, [287/308 JIT'ed]

Awesome. Do you think with this implementation, the patch could get
accepted? If you think so, then I will send the patch in couple of
days after some refactoring, if not, then do let me know what more is
required?

Best,
Shubham Bansal


On Wed, Jun 21, 2017 at 10:02 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> On 06/21/2017 04:26 PM, Shubham Bansal wrote:
> [...]
>>
>> So ultimately, we call helper_function which takes u64 as arguments
>> only. I know its asking a lot, but can you please confirm this asap? I
>> would like to start implementing it.
>
>
> Yes, that is correct. I think it would be the better, more generic
> approach going forward to always assume that.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v2] arm: eBPF JIT compiler
@ 2017-06-21 19:37                         ` Shubham Bansal
  0 siblings, 0 replies; 87+ messages in thread
From: Shubham Bansal @ 2017-06-21 19:37 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Daniel,

Good news. Got the CALL to work.

[  145.670882] test_bpf: Summary: 316 PASSED, 0 FAILED, [287/308 JIT'ed]

Awesome. Do you think with this implementation, the patch could get
accepted? If you think so, then I will send the patch in couple of
days after some refactoring, if not, then do let me know what more is
required?

Best,
Shubham Bansal


On Wed, Jun 21, 2017 at 10:02 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> On 06/21/2017 04:26 PM, Shubham Bansal wrote:
> [...]
>>
>> So ultimately, we call helper_function which takes u64 as arguments
>> only. I know its asking a lot, but can you please confirm this asap? I
>> would like to start implementing it.
>
>
> Yes, that is correct. I think it would be the better, more generic
> approach going forward to always assume that.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
  2017-06-21 19:37                         ` Shubham Bansal
  (?)
@ 2017-06-21 19:53                           ` Daniel Borkmann
  -1 siblings, 0 replies; 87+ messages in thread
From: Daniel Borkmann @ 2017-06-21 19:53 UTC (permalink / raw)
  To: Shubham Bansal
  Cc: Kees Cook, Network Development, David S. Miller,
	Alexei Starovoitov, Russell King, linux-arm-kernel, LKML,
	Andrew Lunn

On 06/21/2017 09:37 PM, Shubham Bansal wrote:
> Hi Daniel,
>
> Good news. Got the CALL to work.
>
> [  145.670882] test_bpf: Summary: 316 PASSED, 0 FAILED, [287/308 JIT'ed]
>
> Awesome. Do you think with this implementation, the patch could get
> accepted? If you think so, then I will send the patch in couple of
> days after some refactoring, if not, then do let me know what more is
> required?

Nice, it's ultimately up to the arm folks to review the set in-depth,
but feel free to send out the patch once you're done refactoring. With
BPF_CALL support that looks quite good from pov of supported insns.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
@ 2017-06-21 19:53                           ` Daniel Borkmann
  0 siblings, 0 replies; 87+ messages in thread
From: Daniel Borkmann @ 2017-06-21 19:53 UTC (permalink / raw)
  To: Shubham Bansal
  Cc: Kees Cook, Network Development, David S. Miller,
	Alexei Starovoitov, Russell King, linux-arm-kernel, LKML,
	Andrew Lunn

On 06/21/2017 09:37 PM, Shubham Bansal wrote:
> Hi Daniel,
>
> Good news. Got the CALL to work.
>
> [  145.670882] test_bpf: Summary: 316 PASSED, 0 FAILED, [287/308 JIT'ed]
>
> Awesome. Do you think with this implementation, the patch could get
> accepted? If you think so, then I will send the patch in couple of
> days after some refactoring, if not, then do let me know what more is
> required?

Nice, it's ultimately up to the arm folks to review the set in-depth,
but feel free to send out the patch once you're done refactoring. With
BPF_CALL support that looks quite good from pov of supported insns.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v2] arm: eBPF JIT compiler
@ 2017-06-21 19:53                           ` Daniel Borkmann
  0 siblings, 0 replies; 87+ messages in thread
From: Daniel Borkmann @ 2017-06-21 19:53 UTC (permalink / raw)
  To: linux-arm-kernel

On 06/21/2017 09:37 PM, Shubham Bansal wrote:
> Hi Daniel,
>
> Good news. Got the CALL to work.
>
> [  145.670882] test_bpf: Summary: 316 PASSED, 0 FAILED, [287/308 JIT'ed]
>
> Awesome. Do you think with this implementation, the patch could get
> accepted? If you think so, then I will send the patch in couple of
> days after some refactoring, if not, then do let me know what more is
> required?

Nice, it's ultimately up to the arm folks to review the set in-depth,
but feel free to send out the patch once you're done refactoring. With
BPF_CALL support that looks quite good from pov of supported insns.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
  2017-06-21 16:32                       ` Daniel Borkmann
@ 2017-06-23 22:39                         ` Shubham Bansal
  -1 siblings, 0 replies; 87+ messages in thread
From: Shubham Bansal @ 2017-06-23 22:39 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Kees Cook, Network Development, David S. Miller,
	Alexei Starovoitov, Russell King, linux-arm-kernel, LKML,
	Andrew Lunn

[-- Attachment #1: Type: text/plain, Size: 452 bytes --]

Hi Russell,Daniel and Kees,

I am attaching the latest patch with this mail. It included support
for BPF_CALL | BPF_JMP tested with and without constant blinding on
ARMv7 machine.
Due to the limitation on my machine I can't test the tail call. It
would be a great help if any of you could help me with this.

Its been a long time since this patch is in works, Russell, can you
please help with sending this patch to ARM patch tracker?

Thanks.
Shubham

[-- Attachment #2: 0001-Added-Support-for-BPF_CALL-BPF_JMP.patch --]
[-- Type: application/octet-stream, Size: 87846 bytes --]

From 502dd777765a982ce1b479ee01911fa6fe023a76 Mon Sep 17 00:00:00 2001
From: Shubham Bansal <illusionist.neo@gmail.com>
Date: Sat, 24 Jun 2017 04:03:37 +0530
Subject: [PATCH] Added Support for BPF_CALL | BPF_JMP.

---
 arch/arm/Kconfig          |    2 +-
 arch/arm/net/bpf_jit_32.c | 2430 ++++++++++++++++++++++++++++++---------------
 arch/arm/net/bpf_jit_32.h |  108 +-
 3 files changed, 1736 insertions(+), 804 deletions(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 4c1a35f..53bf116 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -48,7 +48,7 @@ config ARM
 	select HAVE_ARCH_SECCOMP_FILTER if (AEABI && !OABI_COMPAT)
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_ARM_SMCCC if CPU_V7
-	select HAVE_CBPF_JIT
+	select HAVE_EBPF_JIT
 	select HAVE_CC_STACKPROTECTOR
 	select HAVE_CONTEXT_TRACKING
 	select HAVE_C_RECORDMCOUNT
diff --git a/arch/arm/net/bpf_jit_32.c b/arch/arm/net/bpf_jit_32.c
index d5b9fa1..8b8ddc4 100644
--- a/arch/arm/net/bpf_jit_32.c
+++ b/arch/arm/net/bpf_jit_32.c
@@ -1,13 +1,15 @@
 /*
- * Just-In-Time compiler for BPF filters on 32bit ARM
+ * Just-In-Time compiler for eBPF filters on 32bit ARM
  *
  * Copyright (c) 2011 Mircea Gherzan <mgherzan@gmail.com>
+ * Copyright (c) 2017 Shubham Bansal <illusionist.neo@gmail.com>
  *
  * This program is free software; you can redistribute it and/or modify it
  * under the terms of the GNU General Public License as published by the
  * Free Software Foundation; version 2 of the License.
  */
 
+#include <linux/bpf.h>
 #include <linux/bitops.h>
 #include <linux/compiler.h>
 #include <linux/errno.h>
@@ -18,50 +20,96 @@
 #include <linux/if_vlan.h>
 
 #include <asm/cacheflush.h>
-#include <asm/set_memory.h>
 #include <asm/hwcap.h>
 #include <asm/opcodes.h>
 
 #include "bpf_jit_32.h"
 
+int bpf_jit_enable __read_mostly;
+
+#define STACK_OFFSET(k)	(k)
+#define TMP_REG_1	(MAX_BPF_JIT_REG + 0)	/* TEMP Register 1 */
+#define TMP_REG_2	(MAX_BPF_JIT_REG + 1)	/* TEMP Register 2 */
+#define TCALL_CNT	(MAX_BPF_JIT_REG + 2)	/* Tail Call Count */
+
+/* Flags used for JIT optimization */
+#define SEEN_CALL	(1 << 0)
+
+#define FLAG_IMM_OVERFLOW	(1 << 0)
+
 /*
- * ABI:
+ * Map eBPF registers to ARM 32bit registers or stack scratch space.
+ *
+ * 1. First argument is passed using the arm 32bit registers and rest of the
+ * arguments are passed on stack scratch space.
+ * 2. First callee-saved aregument is mapped to arm 32 bit registers and rest
+ * arguments are mapped to scratch space on stack.
+ * 3. We need two 64 bit temp registers to do complex operations on eBPF
+ * registers.
+ *
+ * As the eBPF registers are all 64 bit registers and arm has only 32 bit
+ * registers, we have to map each eBPF registers with two arm 32 bit regs or
+ * scratch memory space and we have to build eBPF 64 bit register from those.
  *
- * r0	scratch register
- * r4	BPF register A
- * r5	BPF register X
- * r6	pointer to the skb
- * r7	skb->data
- * r8	skb_headlen(skb)
  */
+static const u8 bpf2a32[][2] = {
+	/* return value from in-kernel function, and exit value from eBPF */
+	[BPF_REG_0] = {ARM_R1, ARM_R0},
+	/* arguments from eBPF program to in-kernel function */
+	[BPF_REG_1] = {ARM_R3, ARM_R2},
+	/* Stored on stack scratch space */
+	[BPF_REG_2] = {STACK_OFFSET(0), STACK_OFFSET(4)},
+	[BPF_REG_3] = {STACK_OFFSET(8), STACK_OFFSET(12)},
+	[BPF_REG_4] = {STACK_OFFSET(16), STACK_OFFSET(20)},
+	[BPF_REG_5] = {STACK_OFFSET(24), STACK_OFFSET(28)},
+	/* callee saved registers that in-kernel function will preserve */
+	[BPF_REG_6] = {ARM_R5, ARM_R4},
+	/* Stored on stack scratch space */
+	[BPF_REG_7] = {STACK_OFFSET(32), STACK_OFFSET(36)},
+	[BPF_REG_8] = {STACK_OFFSET(40), STACK_OFFSET(44)},
+	[BPF_REG_9] = {STACK_OFFSET(48), STACK_OFFSET(52)},
+	/* Read only Frame Pointer to access Stack */
+	[BPF_REG_FP] = {STACK_OFFSET(56), STACK_OFFSET(60)},
+	/* Temporary Register for internal BPF JIT, can be used
+	 * for constant blindings and others.
+	 */
+	[TMP_REG_1] = {ARM_R7, ARM_R6},
+	[TMP_REG_2] = {ARM_R10, ARM_R8},
+	/* Tail call count. Stored on stack scratch space. */
+	[TCALL_CNT] = {STACK_OFFSET(64), STACK_OFFSET(68)},
+	/* temporary register for blinding constants.
+	 * Stored on stack scratch space.
+	 */
+	[BPF_REG_AX] = {STACK_OFFSET(72), STACK_OFFSET(76)},
+};
 
-#define r_scratch	ARM_R0
-/* r1-r3 are (also) used for the unaligned loads on the non-ARMv7 slowpath */
-#define r_off		ARM_R1
-#define r_A		ARM_R4
-#define r_X		ARM_R5
-#define r_skb		ARM_R6
-#define r_skb_data	ARM_R7
-#define r_skb_hl	ARM_R8
-
-#define SCRATCH_SP_OFFSET	0
-#define SCRATCH_OFF(k)		(SCRATCH_SP_OFFSET + 4 * (k))
-
-#define SEEN_MEM		((1 << BPF_MEMWORDS) - 1)
-#define SEEN_MEM_WORD(k)	(1 << (k))
-#define SEEN_X			(1 << BPF_MEMWORDS)
-#define SEEN_CALL		(1 << (BPF_MEMWORDS + 1))
-#define SEEN_SKB		(1 << (BPF_MEMWORDS + 2))
-#define SEEN_DATA		(1 << (BPF_MEMWORDS + 3))
+#define	dst_lo	dst[1]
+#define dst_hi	dst[0]
+#define src_lo	src[1]
+#define src_hi	src[0]
 
-#define FLAG_NEED_X_RESET	(1 << 0)
-#define FLAG_IMM_OVERFLOW	(1 << 1)
+/*
+ * JIT Context:
+ *
+ * prog			:	bpf_prog
+ * idx			:	index of current last JITed instruction.
+ * prologue_bytes	:	bytes used in prologue.
+ * epilogue_offset	:	offset of epilogue starting.
+ * seen			:	bit mask used for JIT optimization.
+ * offsets		:	array of eBPF instruction offsets in
+ *				JITed code.
+ * target		:	final JITed code.
+ * epilogue_bytes	:	no of bytes used in epilogue.
+ * imm_count		:	no of immediate counts used for global
+ *				variables.
+ * imms			:	array of global variable addresses.
+ */
 
 struct jit_ctx {
-	const struct bpf_prog *skf;
-	unsigned idx;
-	unsigned prologue_bytes;
-	int ret0_fp_idx;
+	const struct bpf_prog *prog;
+	unsigned int idx;
+	unsigned int prologue_bytes;
+	unsigned int epilogue_offset;
 	u32 seen;
 	u32 flags;
 	u32 *offsets;
@@ -73,68 +121,16 @@ struct jit_ctx {
 #endif
 };
 
-int bpf_jit_enable __read_mostly;
-
-static inline int call_neg_helper(struct sk_buff *skb, int offset, void *ret,
-		      unsigned int size)
-{
-	void *ptr = bpf_internal_load_pointer_neg_helper(skb, offset, size);
-
-	if (!ptr)
-		return -EFAULT;
-	memcpy(ret, ptr, size);
-	return 0;
-}
-
-static u64 jit_get_skb_b(struct sk_buff *skb, int offset)
-{
-	u8 ret;
-	int err;
-
-	if (offset < 0)
-		err = call_neg_helper(skb, offset, &ret, 1);
-	else
-		err = skb_copy_bits(skb, offset, &ret, 1);
-
-	return (u64)err << 32 | ret;
-}
-
-static u64 jit_get_skb_h(struct sk_buff *skb, int offset)
-{
-	u16 ret;
-	int err;
-
-	if (offset < 0)
-		err = call_neg_helper(skb, offset, &ret, 2);
-	else
-		err = skb_copy_bits(skb, offset, &ret, 2);
-
-	return (u64)err << 32 | ntohs(ret);
-}
-
-static u64 jit_get_skb_w(struct sk_buff *skb, int offset)
-{
-	u32 ret;
-	int err;
-
-	if (offset < 0)
-		err = call_neg_helper(skb, offset, &ret, 4);
-	else
-		err = skb_copy_bits(skb, offset, &ret, 4);
-
-	return (u64)err << 32 | ntohl(ret);
-}
-
 /*
  * Wrappers which handle both OABI and EABI and assures Thumb2 interworking
  * (where the assembly routines like __aeabi_uidiv could cause problems).
  */
-static u32 jit_udiv(u32 dividend, u32 divisor)
+static u32 jit_udiv32(u32 dividend, u32 divisor)
 {
 	return dividend / divisor;
 }
 
-static u32 jit_mod(u32 dividend, u32 divisor)
+static u32 jit_mod32(u32 dividend, u32 divisor)
 {
 	return dividend % divisor;
 }
@@ -158,36 +154,22 @@ static inline void emit(u32 inst, struct jit_ctx *ctx)
 	_emit(ARM_COND_AL, inst, ctx);
 }
 
-static u16 saved_regs(struct jit_ctx *ctx)
+/*
+ * Checks if immediate value can be converted to imm12(12 bits) value.
+ */
+static int16_t imm8m(u32 x)
 {
-	u16 ret = 0;
-
-	if ((ctx->skf->len > 1) ||
-	    (ctx->skf->insns[0].code == (BPF_RET | BPF_A)))
-		ret |= 1 << r_A;
-
-#ifdef CONFIG_FRAME_POINTER
-	ret |= (1 << ARM_FP) | (1 << ARM_IP) | (1 << ARM_LR) | (1 << ARM_PC);
-#else
-	if (ctx->seen & SEEN_CALL)
-		ret |= 1 << ARM_LR;
-#endif
-	if (ctx->seen & (SEEN_DATA | SEEN_SKB))
-		ret |= 1 << r_skb;
-	if (ctx->seen & SEEN_DATA)
-		ret |= (1 << r_skb_data) | (1 << r_skb_hl);
-	if (ctx->seen & SEEN_X)
-		ret |= 1 << r_X;
-
-	return ret;
-}
+	u32 rot;
 
-static inline int mem_words_used(struct jit_ctx *ctx)
-{
-	/* yes, we do waste some stack space IF there are "holes" in the set" */
-	return fls(ctx->seen & SEEN_MEM);
+	for (rot = 0; rot < 16; rot++)
+		if ((x & ~ror32(0xff, 2 * rot)) == 0)
+			return rol32(x, 2 * rot) | (rot << 8);
+	return -1;
 }
 
+/*
+ * Initializes the JIT space with undefined instructions.
+ */
 static void jit_fill_hole(void *area, unsigned int size)
 {
 	u32 *ptr;
@@ -196,88 +178,34 @@ static void jit_fill_hole(void *area, unsigned int size)
 		*ptr++ = __opcode_to_mem_arm(ARM_INST_UDF);
 }
 
-static void build_prologue(struct jit_ctx *ctx)
-{
-	u16 reg_set = saved_regs(ctx);
-	u16 off;
-
-#ifdef CONFIG_FRAME_POINTER
-	emit(ARM_MOV_R(ARM_IP, ARM_SP), ctx);
-	emit(ARM_PUSH(reg_set), ctx);
-	emit(ARM_SUB_I(ARM_FP, ARM_IP, 4), ctx);
-#else
-	if (reg_set)
-		emit(ARM_PUSH(reg_set), ctx);
-#endif
-
-	if (ctx->seen & (SEEN_DATA | SEEN_SKB))
-		emit(ARM_MOV_R(r_skb, ARM_R0), ctx);
-
-	if (ctx->seen & SEEN_DATA) {
-		off = offsetof(struct sk_buff, data);
-		emit(ARM_LDR_I(r_skb_data, r_skb, off), ctx);
-		/* headlen = len - data_len */
-		off = offsetof(struct sk_buff, len);
-		emit(ARM_LDR_I(r_skb_hl, r_skb, off), ctx);
-		off = offsetof(struct sk_buff, data_len);
-		emit(ARM_LDR_I(r_scratch, r_skb, off), ctx);
-		emit(ARM_SUB_R(r_skb_hl, r_skb_hl, r_scratch), ctx);
-	}
+/* Stack must be multiples of 16 Bytes */
+#define STACK_ALIGN(sz) (((sz) + 15) & ~15)
 
-	if (ctx->flags & FLAG_NEED_X_RESET)
-		emit(ARM_MOV_I(r_X, 0), ctx);
-
-	/* do not leak kernel data to userspace */
-	if (bpf_needs_clear_a(&ctx->skf->insns[0]))
-		emit(ARM_MOV_I(r_A, 0), ctx);
-
-	/* stack space for the BPF_MEM words */
-	if (ctx->seen & SEEN_MEM)
-		emit(ARM_SUB_I(ARM_SP, ARM_SP, mem_words_used(ctx) * 4), ctx);
-}
-
-static void build_epilogue(struct jit_ctx *ctx)
-{
-	u16 reg_set = saved_regs(ctx);
-
-	if (ctx->seen & SEEN_MEM)
-		emit(ARM_ADD_I(ARM_SP, ARM_SP, mem_words_used(ctx) * 4), ctx);
-
-	reg_set &= ~(1 << ARM_LR);
-
-#ifdef CONFIG_FRAME_POINTER
-	/* the first instruction of the prologue was: mov ip, sp */
-	reg_set &= ~(1 << ARM_IP);
-	reg_set |= (1 << ARM_SP);
-	emit(ARM_LDM(ARM_SP, reg_set), ctx);
-#else
-	if (reg_set) {
-		if (ctx->seen & SEEN_CALL)
-			reg_set |= 1 << ARM_PC;
-		emit(ARM_POP(reg_set), ctx);
-	}
+/* Stack space for BPF_REG_2, BPF_REG_3, BPF_REG_4,
+ * BPF_REG_5, BPF_REG_7, BPF_REG_8, BPF_REG_9,
+ * BPF_REG_FP and Tail call counts.
+ */
+#define SCRATCH_SIZE 80
 
-	if (!(ctx->seen & SEEN_CALL))
-		emit(ARM_BX(ARM_LR), ctx);
-#endif
-}
+/* total stack size used in JITed code */
+#define _STACK_SIZE \
+	(MAX_BPF_STACK + \
+	 + SCRATCH_SIZE + \
+	 + 4 /* extra for skb_copy_bits buffer */)
 
-static int16_t imm8m(u32 x)
-{
-	u32 rot;
+#define STACK_SIZE STACK_ALIGN(_STACK_SIZE)
 
-	for (rot = 0; rot < 16; rot++)
-		if ((x & ~ror32(0xff, 2 * rot)) == 0)
-			return rol32(x, 2 * rot) | (rot << 8);
+/* Get the offset of eBPF REGISTERs stored on scratch space. */
+#define STACK_VAR(off) (STACK_SIZE-off-4)
 
-	return -1;
-}
+/* Offset of skb_copy_bits buffer */
+#define SKB_BUFFER STACK_VAR(SCRATCH_SIZE)
 
 #if __LINUX_ARM_ARCH__ < 7
 
 static u16 imm_offset(u32 k, struct jit_ctx *ctx)
 {
-	unsigned i = 0, offset;
+	unsigned int i = 0, offset;
 	u16 imm;
 
 	/* on the "fake" run we just count them (duplicates included) */
@@ -296,7 +224,7 @@ static u16 imm_offset(u32 k, struct jit_ctx *ctx)
 		ctx->imms[i] = k;
 
 	/* constants go just after the epilogue */
-	offset =  ctx->offsets[ctx->skf->len];
+	offset =  ctx->offsets[ctx->prog->len - 1] * 4;
 	offset += ctx->prologue_bytes;
 	offset += ctx->epilogue_bytes;
 	offset += i * 4;
@@ -320,10 +248,22 @@ static u16 imm_offset(u32 k, struct jit_ctx *ctx)
 
 #endif /* __LINUX_ARM_ARCH__ */
 
+static inline int bpf2a32_offset(int bpf_to, int bpf_from,
+				 const struct jit_ctx *ctx) {
+	int to, from;
+
+	if (ctx->target == NULL)
+		return 0;
+	to = ctx->offsets[bpf_to];
+	from = ctx->offsets[bpf_from];
+
+	return to - from - 1;
+}
+
 /*
  * Move an immediate that's not an imm8m to a core register.
  */
-static inline void emit_mov_i_no8m(int rd, u32 val, struct jit_ctx *ctx)
+static inline void emit_mov_i_no8m(const u8 rd, u32 val, struct jit_ctx *ctx)
 {
 #if __LINUX_ARM_ARCH__ < 7
 	emit(ARM_LDR_I(rd, ARM_PC, imm_offset(val, ctx)), ctx);
@@ -334,7 +274,7 @@ static inline void emit_mov_i_no8m(int rd, u32 val, struct jit_ctx *ctx)
 #endif
 }
 
-static inline void emit_mov_i(int rd, u32 val, struct jit_ctx *ctx)
+static inline void emit_mov_i(const u8 rd, u32 val, struct jit_ctx *ctx)
 {
 	int imm12 = imm8m(val);
 
@@ -344,676 +284,1578 @@ static inline void emit_mov_i(int rd, u32 val, struct jit_ctx *ctx)
 		emit_mov_i_no8m(rd, val, ctx);
 }
 
-#if __LINUX_ARM_ARCH__ < 6
-
-static void emit_load_be32(u8 cond, u8 r_res, u8 r_addr, struct jit_ctx *ctx)
+static inline void emit_blx_r(u8 tgt_reg, struct jit_ctx *ctx)
 {
-	_emit(cond, ARM_LDRB_I(ARM_R3, r_addr, 1), ctx);
-	_emit(cond, ARM_LDRB_I(ARM_R1, r_addr, 0), ctx);
-	_emit(cond, ARM_LDRB_I(ARM_R2, r_addr, 3), ctx);
-	_emit(cond, ARM_LSL_I(ARM_R3, ARM_R3, 16), ctx);
-	_emit(cond, ARM_LDRB_I(ARM_R0, r_addr, 2), ctx);
-	_emit(cond, ARM_ORR_S(ARM_R3, ARM_R3, ARM_R1, SRTYPE_LSL, 24), ctx);
-	_emit(cond, ARM_ORR_R(ARM_R3, ARM_R3, ARM_R2), ctx);
-	_emit(cond, ARM_ORR_S(r_res, ARM_R3, ARM_R0, SRTYPE_LSL, 8), ctx);
+	ctx->seen |= SEEN_CALL;
+#if __LINUX_ARM_ARCH__ < 5
+	emit(ARM_MOV_R(ARM_LR, ARM_PC), ctx);
+
+	if (elf_hwcap & HWCAP_THUMB)
+		emit(ARM_BX(tgt_reg), ctx);
+	else
+		emit(ARM_MOV_R(ARM_PC, tgt_reg), ctx);
+#else
+	emit(ARM_BLX_R(tgt_reg), ctx);
+#endif
 }
 
-static void emit_load_be16(u8 cond, u8 r_res, u8 r_addr, struct jit_ctx *ctx)
+static inline int epilogue_offset(const struct jit_ctx *ctx)
 {
-	_emit(cond, ARM_LDRB_I(ARM_R1, r_addr, 0), ctx);
-	_emit(cond, ARM_LDRB_I(ARM_R2, r_addr, 1), ctx);
-	_emit(cond, ARM_ORR_S(r_res, ARM_R2, ARM_R1, SRTYPE_LSL, 8), ctx);
+	int to, from;
+	/* No need for 1st dummy run */
+	if (ctx->target == NULL)
+		return 0;
+	to = ctx->epilogue_offset;
+	from = ctx->idx;
+
+	return to - from - 2;
 }
 
-static inline void emit_swap16(u8 r_dst, u8 r_src, struct jit_ctx *ctx)
+static inline void emit_udivmod(u8 rd, u8 rm, u8 rn, struct jit_ctx *ctx, u8 op)
 {
-	/* r_dst = (r_src << 8) | (r_src >> 8) */
-	emit(ARM_LSL_I(ARM_R1, r_src, 8), ctx);
-	emit(ARM_ORR_S(r_dst, ARM_R1, r_src, SRTYPE_LSR, 8), ctx);
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	s32 jmp_offset;
+
+	/* checks if divisor is zero or not. If it is, then
+	 * exit directly.
+	 */
+	emit(ARM_CMP_I(rn, 0), ctx);
+	_emit(ARM_COND_EQ, ARM_MOV_I(ARM_R0, 0), ctx);
+	jmp_offset = epilogue_offset(ctx);
+	_emit(ARM_COND_EQ, ARM_B(jmp_offset), ctx);
+#if __LINUX_ARM_ARCH__ == 7
+	if (elf_hwcap & HWCAP_IDIVA) {
+		if (op == BPF_DIV)
+			emit(ARM_UDIV(rd, rm, rn), ctx);
+		else {
+			emit(ARM_UDIV(ARM_IP, rm, rn), ctx);
+			emit(ARM_MLS(rd, rn, ARM_IP, rm), ctx);
+		}
+		return;
+	}
+#endif
 
 	/*
-	 * we need to mask out the bits set in r_dst[23:16] due to
-	 * the first shift instruction.
-	 *
-	 * note that 0x8ff is the encoded immediate 0x00ff0000.
+	 * For BPF_ALU | BPF_DIV | BPF_K instructions
+	 * As ARM_R1 and ARM_R0 contains 1st argument of bpf
+	 * function, we need to save it on caller side to save
+	 * it from getting destroyed within callee.
+	 * After the return from the callee, we restore ARM_R0
+	 * ARM_R1.
 	 */
-	emit(ARM_BIC_I(r_dst, r_dst, 0x8ff), ctx);
-}
+	if (rn != ARM_R1) {
+		emit(ARM_MOV_R(tmp[0], ARM_R1), ctx);
+		emit(ARM_MOV_R(ARM_R1, rn), ctx);
+	}
+	if (rm != ARM_R0) {
+		emit(ARM_MOV_R(tmp[1], ARM_R0), ctx);
+		emit(ARM_MOV_R(ARM_R0, rm), ctx);
+	}
+
+	/* Call appropriate function */
+	ctx->seen |= SEEN_CALL;
+	emit_mov_i(ARM_IP, op == BPF_DIV ?
+		   (u32)jit_udiv32 : (u32)jit_mod32, ctx);
+	emit_blx_r(ARM_IP, ctx);
 
-#else  /* ARMv6+ */
+	/* Save return value */
+	if (rd != ARM_R0)
+		emit(ARM_MOV_R(rd, ARM_R0), ctx);
 
-static void emit_load_be32(u8 cond, u8 r_res, u8 r_addr, struct jit_ctx *ctx)
-{
-	_emit(cond, ARM_LDR_I(r_res, r_addr, 0), ctx);
-#ifdef __LITTLE_ENDIAN
-	_emit(cond, ARM_REV(r_res, r_res), ctx);
-#endif
+	/* Restore ARM_R0 and ARM_R1 */
+	if (rn != ARM_R1)
+		emit(ARM_MOV_R(ARM_R1, tmp[0]), ctx);
+	if (rm != ARM_R0)
+		emit(ARM_MOV_R(ARM_R0, tmp[1]), ctx);
 }
 
-static void emit_load_be16(u8 cond, u8 r_res, u8 r_addr, struct jit_ctx *ctx)
+/* Checks whether BPF register is on scratch stack space or not. */
+static inline bool is_on_stack(u8 bpf_reg)
 {
-	_emit(cond, ARM_LDRH_I(r_res, r_addr, 0), ctx);
-#ifdef __LITTLE_ENDIAN
-	_emit(cond, ARM_REV16(r_res, r_res), ctx);
-#endif
+	static u8 stack_regs[] = {BPF_REG_AX, BPF_REG_3, BPF_REG_4, BPF_REG_5,
+				BPF_REG_7, BPF_REG_8, BPF_REG_9, TCALL_CNT,
+				BPF_REG_2, BPF_REG_FP};
+	int i, reg_len = sizeof(stack_regs);
+
+	for (i = 0 ; i < reg_len ; i++) {
+		if (bpf_reg == stack_regs[i])
+			return true;
+	}
+	return false;
 }
 
-static inline void emit_swap16(u8 r_dst __maybe_unused,
-			       u8 r_src __maybe_unused,
-			       struct jit_ctx *ctx __maybe_unused)
+static inline void emit_a32_mov_i(const u8 dst, const u32 val,
+				  bool dstk, struct jit_ctx *ctx)
 {
-#ifdef __LITTLE_ENDIAN
-	emit(ARM_REV16(r_dst, r_src), ctx);
-#endif
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+
+	if (dstk) {
+		emit_mov_i(tmp[1], val, ctx);
+		emit(ARM_STR_I(tmp[1], ARM_SP, STACK_VAR(dst)), ctx);
+	} else {
+		emit_mov_i(dst, val, ctx);
+	}
 }
 
-#endif /* __LINUX_ARM_ARCH__ < 6 */
+/* Sign extended move */
+static inline void emit_a32_mov_i64(const bool is64, const u8 dst[],
+				  const u32 val, bool dstk,
+				  struct jit_ctx *ctx) {
+	u32 hi = 0;
 
+	if (is64 && (val & (1<<31)))
+		hi = (u32)~0;
+	emit_a32_mov_i(dst_lo, val, dstk, ctx);
+	emit_a32_mov_i(dst_hi, hi, dstk, ctx);
+}
 
-/* Compute the immediate value for a PC-relative branch. */
-static inline u32 b_imm(unsigned tgt, struct jit_ctx *ctx)
-{
-	u32 imm;
+static inline void emit_a32_add_r(const u8 dst, const u8 src,
+			      const bool is64, const bool hi,
+			      struct jit_ctx *ctx) {
+	/* 64 bit :
+	 *	adds dst_lo, dst_lo, src_lo
+	 *	adc dst_hi, dst_hi, src_hi
+	 * 32 bit :
+	 *	add dst_lo, dst_lo, src_lo
+	 */
+	if (!hi && is64)
+		emit(ARM_ADDS_R(dst, dst, src), ctx);
+	else if (hi && is64)
+		emit(ARM_ADC_R(dst, dst, src), ctx);
+	else
+		emit(ARM_ADD_R(dst, dst, src), ctx);
+}
 
-	if (ctx->target == NULL)
-		return 0;
-	/*
-	 * BPF allows only forward jumps and the offset of the target is
-	 * still the one computed during the first pass.
+static inline void emit_a32_sub_r(const u8 dst, const u8 src,
+				  const bool is64, const bool hi,
+				  struct jit_ctx *ctx) {
+	/* 64 bit :
+	 *	subs dst_lo, dst_lo, src_lo
+	 *	sbc dst_hi, dst_hi, src_hi
+	 * 32 bit :
+	 *	sub dst_lo, dst_lo, src_lo
 	 */
-	imm  = ctx->offsets[tgt] + ctx->prologue_bytes - (ctx->idx * 4 + 8);
+	if (!hi && is64)
+		emit(ARM_SUBS_R(dst, dst, src), ctx);
+	else if (hi && is64)
+		emit(ARM_SBC_R(dst, dst, src), ctx);
+	else
+		emit(ARM_SUB_R(dst, dst, src), ctx);
+}
 
-	return imm >> 2;
+static inline void emit_alu_r(const u8 dst, const u8 src, const bool is64,
+			      const bool hi, const u8 op, struct jit_ctx *ctx){
+	switch (BPF_OP(op)) {
+	/* dst = dst + src */
+	case BPF_ADD:
+		emit_a32_add_r(dst, src, is64, hi, ctx);
+		break;
+	/* dst = dst - src */
+	case BPF_SUB:
+		emit_a32_sub_r(dst, src, is64, hi, ctx);
+		break;
+	/* dst = dst | src */
+	case BPF_OR:
+		emit(ARM_ORR_R(dst, dst, src), ctx);
+		break;
+	/* dst = dst & src */
+	case BPF_AND:
+		emit(ARM_AND_R(dst, dst, src), ctx);
+		break;
+	/* dst = dst ^ src */
+	case BPF_XOR:
+		emit(ARM_EOR_R(dst, dst, src), ctx);
+		break;
+	/* dst = dst * src */
+	case BPF_MUL:
+		emit(ARM_MUL(dst, dst, src), ctx);
+		break;
+	/* dst = dst << src */
+	case BPF_LSH:
+		emit(ARM_LSL_R(dst, dst, src), ctx);
+		break;
+	/* dst = dst >> src */
+	case BPF_RSH:
+		emit(ARM_LSR_R(dst, dst, src), ctx);
+		break;
+	/* dst = dst >> src (signed)*/
+	case BPF_ARSH:
+		emit(ARM_MOV_SR(dst, dst, SRTYPE_ASR, src), ctx);
+		break;
+	}
 }
 
-#define OP_IMM3(op, r1, r2, imm_val, ctx)				\
-	do {								\
-		imm12 = imm8m(imm_val);					\
-		if (imm12 < 0) {					\
-			emit_mov_i_no8m(r_scratch, imm_val, ctx);	\
-			emit(op ## _R((r1), (r2), r_scratch), ctx);	\
-		} else {						\
-			emit(op ## _I((r1), (r2), imm12), ctx);		\
-		}							\
-	} while (0)
-
-static inline void emit_err_ret(u8 cond, struct jit_ctx *ctx)
-{
-	if (ctx->ret0_fp_idx >= 0) {
-		_emit(cond, ARM_B(b_imm(ctx->ret0_fp_idx, ctx)), ctx);
-		/* NOP to keep the size constant between passes */
-		emit(ARM_MOV_R(ARM_R0, ARM_R0), ctx);
+/* ALU operation (32 bit)
+ * dst = dst (op) src
+ */
+static inline void emit_a32_alu_r(const u8 dst, const u8 src,
+				  bool dstk, bool sstk,
+				  struct jit_ctx *ctx, const bool is64,
+				  const bool hi, const u8 op) {
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	u8 rn = sstk ? tmp[1] : src;
+
+	if (sstk)
+		emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src)), ctx);
+
+	/* ALU operation */
+	if (dstk) {
+		emit(ARM_LDR_I(tmp[0], ARM_SP, STACK_VAR(dst)), ctx);
+		emit_alu_r(tmp[0], rn, is64, hi, op, ctx);
+		emit(ARM_STR_I(tmp[0], ARM_SP, STACK_VAR(dst)), ctx);
 	} else {
-		_emit(cond, ARM_MOV_I(ARM_R0, 0), ctx);
-		_emit(cond, ARM_B(b_imm(ctx->skf->len, ctx)), ctx);
+		emit_alu_r(dst, rn, is64, hi, op, ctx);
 	}
 }
 
-static inline void emit_blx_r(u8 tgt_reg, struct jit_ctx *ctx)
-{
-#if __LINUX_ARM_ARCH__ < 5
-	emit(ARM_MOV_R(ARM_LR, ARM_PC), ctx);
+/* ALU operation (64 bit) */
+static inline void emit_a32_alu_r64(const bool is64, const u8 dst[],
+				  const u8 src[], bool dstk,
+				  bool sstk, struct jit_ctx *ctx,
+				  const u8 op) {
+	emit_a32_alu_r(dst_lo, src_lo, dstk, sstk, ctx, is64, false, op);
+	if (is64)
+		emit_a32_alu_r(dst_hi, src_hi, dstk, sstk, ctx, is64, true, op);
+	else
+		emit_a32_mov_i(dst_hi, 0, dstk, ctx);
+}
 
-	if (elf_hwcap & HWCAP_THUMB)
-		emit(ARM_BX(tgt_reg), ctx);
+/* dst = imm (4 bytes)*/
+static inline void emit_a32_mov_r(const u8 dst, const u8 src,
+				  bool dstk, bool sstk,
+				  struct jit_ctx *ctx) {
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	u8 rt = sstk ? tmp[0] : src;
+
+	if (sstk)
+		emit(ARM_LDR_I(tmp[0], ARM_SP, STACK_VAR(src)), ctx);
+	if (dstk)
+		emit(ARM_STR_I(rt, ARM_SP, STACK_VAR(dst)), ctx);
 	else
-		emit(ARM_MOV_R(ARM_PC, tgt_reg), ctx);
-#else
-	emit(ARM_BLX_R(tgt_reg), ctx);
-#endif
+		emit(ARM_MOV_R(dst, rt), ctx);
 }
 
-static inline void emit_udivmod(u8 rd, u8 rm, u8 rn, struct jit_ctx *ctx,
-				int bpf_op)
-{
-#if __LINUX_ARM_ARCH__ == 7
-	if (elf_hwcap & HWCAP_IDIVA) {
-		if (bpf_op == BPF_DIV)
-			emit(ARM_UDIV(rd, rm, rn), ctx);
-		else {
-			emit(ARM_UDIV(ARM_R3, rm, rn), ctx);
-			emit(ARM_MLS(rd, rn, ARM_R3, rm), ctx);
-		}
-		return;
+/* dst = src */
+static inline void emit_a32_mov_r64(const bool is64, const u8 dst[],
+				  const u8 src[], bool dstk,
+				  bool sstk, struct jit_ctx *ctx) {
+	emit_a32_mov_r(dst_lo, src_lo, dstk, sstk, ctx);
+	if (is64) {
+		/* complete 8 byte move */
+		emit_a32_mov_r(dst_hi, src_hi, dstk, sstk, ctx);
+	} else {
+		/* Zero out high 4 bytes */
+		emit_a32_mov_i(dst_hi, 0, dstk, ctx);
 	}
-#endif
+}
 
-	/*
-	 * For BPF_ALU | BPF_DIV | BPF_K instructions, rm is ARM_R4
-	 * (r_A) and rn is ARM_R0 (r_scratch) so load rn first into
-	 * ARM_R1 to avoid accidentally overwriting ARM_R0 with rm
-	 * before using it as a source for ARM_R1.
-	 *
-	 * For BPF_ALU | BPF_DIV | BPF_X rm is ARM_R4 (r_A) and rn is
-	 * ARM_R5 (r_X) so there is no particular register overlap
-	 * issues.
-	 */
-	if (rn != ARM_R1)
-		emit(ARM_MOV_R(ARM_R1, rn), ctx);
-	if (rm != ARM_R0)
-		emit(ARM_MOV_R(ARM_R0, rm), ctx);
+/* Shift operations */
+static inline void emit_a32_alu_i(const u8 dst, const u32 val, bool dstk,
+				struct jit_ctx *ctx, const u8 op) {
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	u8 rd = dstk ? tmp[0] : dst;
+
+	if (dstk)
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst)), ctx);
+
+	/* Do shift operation */
+	switch (op) {
+	case BPF_LSH:
+		emit(ARM_LSL_I(rd, rd, val), ctx);
+		break;
+	case BPF_RSH:
+		emit(ARM_LSR_I(rd, rd, val), ctx);
+		break;
+	case BPF_NEG:
+		emit(ARM_RSB_I(rd, rd, val), ctx);
+		break;
+	}
+
+	if (dstk)
+		emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst)), ctx);
+}
 
+/* dst = ~dst (64 bit) */
+static inline void emit_a32_neg64(const u8 dst[], bool dstk,
+				struct jit_ctx *ctx){
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	u8 rd = dstk ? tmp[1] : dst[1];
+	u8 rm = dstk ? tmp[0] : dst[0];
+
+	/* Setup Operand */
+	if (dstk) {
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+
+	/* Do Negate Operation */
+	emit(ARM_RSBS_I(rd, rd, 0), ctx);
+	emit(ARM_RSC_I(rm, rm, 0), ctx);
+
+	if (dstk) {
+		emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_STR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+}
+
+/* dst = dst << src */
+static inline void emit_a32_lsh_r64(const u8 dst[], const u8 src[], bool dstk,
+				    bool sstk, struct jit_ctx *ctx) {
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+
+	/* Setup Operands */
+	u8 rt = sstk ? tmp2[1] : src_lo;
+	u8 rd = dstk ? tmp[1] : dst_lo;
+	u8 rm = dstk ? tmp[0] : dst_hi;
+
+	if (sstk)
+		emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(src_lo)), ctx);
+	if (dstk) {
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+
+	/* Do LSH operation */
+	emit(ARM_SUB_I(ARM_IP, rt, 32), ctx);
+	emit(ARM_RSB_I(tmp2[0], rt, 32), ctx);
+	/* As we are using ARM_LR */
 	ctx->seen |= SEEN_CALL;
-	emit_mov_i(ARM_R3, bpf_op == BPF_DIV ? (u32)jit_udiv : (u32)jit_mod,
-		   ctx);
-	emit_blx_r(ARM_R3, ctx);
+	emit(ARM_MOV_SR(ARM_LR, rm, SRTYPE_ASL, rt), ctx);
+	emit(ARM_ORR_SR(ARM_LR, ARM_LR, rd, SRTYPE_ASL, ARM_IP), ctx);
+	emit(ARM_ORR_SR(ARM_IP, ARM_LR, rd, SRTYPE_LSR, tmp2[0]), ctx);
+	emit(ARM_MOV_SR(ARM_LR, rd, SRTYPE_ASL, rt), ctx);
+
+	if (dstk) {
+		emit(ARM_STR_I(ARM_LR, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_STR_I(ARM_IP, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	} else {
+		emit(ARM_MOV_R(rd, ARM_LR), ctx);
+		emit(ARM_MOV_R(rm, ARM_IP), ctx);
+	}
+}
 
-	if (rd != ARM_R0)
-		emit(ARM_MOV_R(rd, ARM_R0), ctx);
+/* dst = dst >> src (signed)*/
+static inline void emit_a32_arsh_r64(const u8 dst[], const u8 src[], bool dstk,
+				    bool sstk, struct jit_ctx *ctx) {
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+	/* Setup Operands */
+	u8 rt = sstk ? tmp2[1] : src_lo;
+	u8 rd = dstk ? tmp[1] : dst_lo;
+	u8 rm = dstk ? tmp[0] : dst_hi;
+
+	if (sstk)
+		emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(src_lo)), ctx);
+	if (dstk) {
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+
+	/* Do the ARSH operation */
+	emit(ARM_RSB_I(ARM_IP, rt, 32), ctx);
+	emit(ARM_SUBS_I(tmp2[0], rt, 32), ctx);
+	/* As we are using ARM_LR */
+	ctx->seen |= SEEN_CALL;
+	emit(ARM_MOV_SR(ARM_LR, rd, SRTYPE_LSR, rt), ctx);
+	emit(ARM_ORR_SR(ARM_LR, ARM_LR, rm, SRTYPE_ASL, ARM_IP), ctx);
+	_emit(ARM_COND_MI, ARM_B(0), ctx);
+	emit(ARM_ORR_SR(ARM_LR, ARM_LR, rm, SRTYPE_ASR, tmp2[0]), ctx);
+	emit(ARM_MOV_SR(ARM_IP, rm, SRTYPE_ASR, rt), ctx);
+	if (dstk) {
+		emit(ARM_STR_I(ARM_LR, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_STR_I(ARM_IP, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	} else {
+		emit(ARM_MOV_R(rd, ARM_LR), ctx);
+		emit(ARM_MOV_R(rm, ARM_IP), ctx);
+	}
+}
+
+/* dst = dst >> src */
+static inline void emit_a32_lsr_r64(const u8 dst[], const u8 src[], bool dstk,
+				     bool sstk, struct jit_ctx *ctx) {
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+	/* Setup Operands */
+	u8 rt = sstk ? tmp2[1] : src_lo;
+	u8 rd = dstk ? tmp[1] : dst_lo;
+	u8 rm = dstk ? tmp[0] : dst_hi;
+
+	if (sstk)
+		emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(src_lo)), ctx);
+	if (dstk) {
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+
+	/* Do LSH operation */
+	emit(ARM_RSB_I(ARM_IP, rt, 32), ctx);
+	emit(ARM_SUBS_I(tmp2[0], rt, 32), ctx);
+	/* As we are using ARM_LR */
+	ctx->seen |= SEEN_CALL;
+	emit(ARM_MOV_SR(ARM_LR, rd, SRTYPE_LSR, rt), ctx);
+	emit(ARM_ORR_SR(ARM_LR, ARM_LR, rm, SRTYPE_ASL, ARM_IP), ctx);
+	emit(ARM_ORR_SR(ARM_LR, ARM_LR, rm, SRTYPE_LSR, tmp2[0]), ctx);
+	emit(ARM_MOV_SR(ARM_IP, rm, SRTYPE_LSR, rt), ctx);
+	if (dstk) {
+		emit(ARM_STR_I(ARM_LR, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_STR_I(ARM_IP, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	} else {
+		emit(ARM_MOV_R(rd, ARM_LR), ctx);
+		emit(ARM_MOV_R(rm, ARM_IP), ctx);
+	}
+}
+
+/* dst = dst << val */
+static inline void emit_a32_lsh_i64(const u8 dst[], bool dstk,
+				     const u32 val, struct jit_ctx *ctx){
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+	/* Setup operands */
+	u8 rd = dstk ? tmp[1] : dst_lo;
+	u8 rm = dstk ? tmp[0] : dst_hi;
+
+	if (dstk) {
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+
+	/* Do LSH operation */
+	if (val < 32) {
+		emit(ARM_MOV_SI(tmp2[0], rm, SRTYPE_ASL, val), ctx);
+		emit(ARM_ORR_SI(rm, tmp2[0], rd, SRTYPE_LSR, 32 - val), ctx);
+		emit(ARM_MOV_SI(rd, rd, SRTYPE_ASL, val), ctx);
+	} else {
+		if (val == 32)
+			emit(ARM_MOV_R(rm, rd), ctx);
+		else
+			emit(ARM_MOV_SI(rm, rd, SRTYPE_ASL, val - 32), ctx);
+		emit(ARM_EOR_R(rd, rd, rd), ctx);
+	}
+
+	if (dstk) {
+		emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_STR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+}
+
+/* dst = dst >> val */
+static inline void emit_a32_lsr_i64(const u8 dst[], bool dstk,
+				    const u32 val, struct jit_ctx *ctx) {
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+	/* Setup operands */
+	u8 rd = dstk ? tmp[1] : dst_lo;
+	u8 rm = dstk ? tmp[0] : dst_hi;
+
+	if (dstk) {
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+
+	/* Do LSR operation */
+	if (val < 32) {
+		emit(ARM_MOV_SI(tmp2[1], rd, SRTYPE_LSR, val), ctx);
+		emit(ARM_ORR_SI(rd, tmp2[1], rm, SRTYPE_ASL, 32 - val), ctx);
+		emit(ARM_MOV_SI(rm, rm, SRTYPE_LSR, val), ctx);
+	} else if (val == 32) {
+		emit(ARM_MOV_R(rd, rm), ctx);
+		emit(ARM_MOV_I(rm, 0), ctx);
+	} else {
+		emit(ARM_MOV_SI(rd, rm, SRTYPE_LSR, val - 32), ctx);
+		emit(ARM_MOV_I(rm, 0), ctx);
+	}
+
+	if (dstk) {
+		emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_STR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
 }
 
-static inline void update_on_xread(struct jit_ctx *ctx)
+/* dst = dst >> val (signed) */
+static inline void emit_a32_arsh_i64(const u8 dst[], bool dstk,
+				     const u32 val, struct jit_ctx *ctx){
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+	 /* Setup operands */
+	u8 rd = dstk ? tmp[1] : dst_lo;
+	u8 rm = dstk ? tmp[0] : dst_hi;
+
+	if (dstk) {
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+
+	/* Do ARSH operation */
+	if (val < 32) {
+		emit(ARM_MOV_SI(tmp2[1], rd, SRTYPE_LSR, val), ctx);
+		emit(ARM_ORR_SI(rd, tmp2[1], rm, SRTYPE_ASL, 32 - val), ctx);
+		emit(ARM_MOV_SI(rm, rm, SRTYPE_ASR, val), ctx);
+	} else if (val == 32) {
+		emit(ARM_MOV_R(rd, rm), ctx);
+		emit(ARM_MOV_SI(rm, rm, SRTYPE_ASR, 31), ctx);
+	} else {
+		emit(ARM_MOV_SI(rd, rm, SRTYPE_ASR, val - 32), ctx);
+		emit(ARM_MOV_SI(rm, rm, SRTYPE_ASR, 31), ctx);
+	}
+
+	if (dstk) {
+		emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_STR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+}
+
+static inline void emit_a32_mul_r64(const u8 dst[], const u8 src[], bool dstk,
+				    bool sstk, struct jit_ctx *ctx) {
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+	/* Setup operands for multiplication */
+	u8 rd = dstk ? tmp[1] : dst_lo;
+	u8 rm = dstk ? tmp[0] : dst_hi;
+	u8 rt = sstk ? tmp2[1] : src_lo;
+	u8 rn = sstk ? tmp2[0] : src_hi;
+
+	if (dstk) {
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+	if (sstk) {
+		emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(src_lo)), ctx);
+		emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src_hi)), ctx);
+	}
+
+	/* Do Multiplication */
+	emit(ARM_MUL(ARM_IP, rd, rn), ctx);
+	emit(ARM_MUL(ARM_LR, rm, rt), ctx);
+	/* As we are using ARM_LR */
+	ctx->seen |= SEEN_CALL;
+	emit(ARM_ADD_R(ARM_LR, ARM_IP, ARM_LR), ctx);
+
+	emit(ARM_UMULL(ARM_IP, rm, rd, rt), ctx);
+	emit(ARM_ADD_R(rm, ARM_LR, rm), ctx);
+	if (dstk) {
+		emit(ARM_STR_I(ARM_IP, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_STR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	} else {
+		emit(ARM_MOV_R(rd, ARM_IP), ctx);
+	}
+}
+
+/* *(size *)(dst + off) = src */
+static inline void emit_str_r(const u8 dst, const u8 src, bool dstk,
+			      const s32 off, struct jit_ctx *ctx, const u8 sz){
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	u8 rd = dstk ? tmp[1] : dst;
+
+	if (dstk)
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst)), ctx);
+	if (off) {
+		emit_a32_mov_i(tmp[0], off, false, ctx);
+		emit(ARM_ADD_R(tmp[0], rd, tmp[0]), ctx);
+		rd = tmp[0];
+	}
+	switch (sz) {
+	case BPF_W:
+		/* Store a Word */
+		emit(ARM_STR_I(src, rd, 0), ctx);
+		break;
+	case BPF_H:
+		/* Store a HalfWord */
+		emit(ARM_STRH_I(src, rd, 0), ctx);
+		break;
+	case BPF_B:
+		/* Store a Byte */
+		emit(ARM_STRB_I(src, rd, 0), ctx);
+		break;
+	}
+}
+
+/* dst = *(size*)(src + off) */
+static inline void emit_ldx_r(const u8 dst, const u8 src, bool dstk,
+			      const s32 off, struct jit_ctx *ctx, const u8 sz){
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	u8 rd = dstk ? tmp[1] : dst;
+	u8 rm = src;
+
+	if (off) {
+		emit_a32_mov_i(tmp[0], off, false, ctx);
+		emit(ARM_ADD_R(tmp[0], tmp[0], src), ctx);
+		rm = tmp[0];
+	}
+	switch (sz) {
+	case BPF_W:
+		/* Load a Word */
+		emit(ARM_LDR_I(rd, rm, 0), ctx);
+		break;
+	case BPF_H:
+		/* Load a HalfWord */
+		emit(ARM_LDRH_I(rd, rm, 0), ctx);
+		break;
+	case BPF_B:
+		/* Load a Byte */
+		emit(ARM_LDRB_I(rd, rm, 0), ctx);
+		break;
+	}
+	if (dstk)
+		emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst)), ctx);
+}
+
+/* Arithmatic Operation */
+static inline void emit_ar_r(const u8 rd, const u8 rt, const u8 rm,
+			     const u8 rn, struct jit_ctx *ctx, u8 op) {
+	switch (op) {
+	case BPF_JSET:
+		ctx->seen |= SEEN_CALL;
+		emit(ARM_AND_R(ARM_IP, rt, rn), ctx);
+		emit(ARM_AND_R(ARM_LR, rd, rm), ctx);
+		emit(ARM_ORRS_R(ARM_IP, ARM_LR, ARM_IP), ctx);
+		break;
+	case BPF_JEQ:
+	case BPF_JNE:
+	case BPF_JGT:
+	case BPF_JGE:
+		emit(ARM_CMP_R(rd, rm), ctx);
+		_emit(ARM_COND_EQ, ARM_CMP_R(rt, rn), ctx);
+		break;
+	case BPF_JSGT:
+		emit(ARM_CMP_R(rn, rt), ctx);
+		emit(ARM_SBCS_R(ARM_IP, rm, rd), ctx);
+		break;
+	case BPF_JSGE:
+		emit(ARM_CMP_R(rt, rn), ctx);
+		emit(ARM_SBCS_R(ARM_IP, rd, rm), ctx);
+		break;
+	}
+}
+
+static int out_offset = -1; /* initialized on the first pass of build_body() */
+static int emit_bpf_tail_call(struct jit_ctx *ctx)
 {
-	if (!(ctx->seen & SEEN_X))
-		ctx->flags |= FLAG_NEED_X_RESET;
 
-	ctx->seen |= SEEN_X;
+	/* bpf_tail_call(void *prog_ctx, struct bpf_array *array, u64 index) */
+	const u8 *r2 = bpf2a32[BPF_REG_2];
+	const u8 *r3 = bpf2a32[BPF_REG_3];
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+	const u8 *tcc = bpf2a32[TCALL_CNT];
+	const int idx0 = ctx->idx;
+#define cur_offset (ctx->idx - idx0)
+#define jmp_offset (out_offset - (cur_offset))
+	u32 off, lo, hi;
+
+	/* if (index >= array->map.max_entries)
+	 *	goto out;
+	 */
+	off = offsetof(struct bpf_array, map.max_entries);
+	/* array->map.max_entries */
+	emit_a32_mov_i(tmp[1], off, false, ctx);
+	emit(ARM_LDR_I(tmp2[1], ARM_SP, STACK_VAR(r2[1])), ctx);
+	emit(ARM_LDR_R(tmp[1], tmp2[1], tmp[1]), ctx);
+	/* index (64 bit) */
+	emit(ARM_LDR_I(tmp2[1], ARM_SP, STACK_VAR(r3[1])), ctx);
+	/* index >= array->map.max_entries */
+	emit(ARM_CMP_R(tmp2[1], tmp[1]), ctx);
+	_emit(ARM_COND_CS, ARM_B(jmp_offset), ctx);
+
+	/* if (tail_call_cnt > MAX_TAIL_CALL_CNT)
+	 *	goto out;
+	 * tail_call_cnt++;
+	 */
+	lo = (u32)MAX_TAIL_CALL_CNT;
+	hi = (u32)((u64)MAX_TAIL_CALL_CNT >> 32);
+	emit(ARM_LDR_I(tmp[1], ARM_SP, STACK_VAR(tcc[1])), ctx);
+	emit(ARM_LDR_I(tmp[0], ARM_SP, STACK_VAR(tcc[0])), ctx);
+	emit(ARM_CMP_I(tmp[0], hi), ctx);
+	_emit(ARM_COND_EQ, ARM_CMP_I(tmp[1], lo), ctx);
+	_emit(ARM_COND_HI, ARM_B(jmp_offset), ctx);
+	emit(ARM_ADDS_I(tmp[1], tmp[1], 1), ctx);
+	emit(ARM_ADC_I(tmp[0], tmp[0], 0), ctx);
+	emit(ARM_STR_I(tmp[1], ARM_SP, STACK_VAR(tcc[1])), ctx);
+	emit(ARM_STR_I(tmp[0], ARM_SP, STACK_VAR(tcc[0])), ctx);
+
+	/* prog = array->ptrs[index]
+	 * if (prog == NULL)
+	 *	goto out;
+	 */
+	off = offsetof(struct bpf_array, ptrs);
+	emit_a32_mov_i(tmp[1], off, false, ctx);
+	emit(ARM_LDR_I(tmp2[1], ARM_SP, STACK_VAR(r2[1])), ctx);
+	emit(ARM_ADD_R(tmp[1], tmp2[1], tmp[1]), ctx);
+	emit(ARM_LDR_I(tmp2[1], ARM_SP, STACK_VAR(r3[1])), ctx);
+	emit(ARM_MOV_SI(tmp[0], tmp2[1], SRTYPE_ASL, 2), ctx);
+	emit(ARM_LDR_R(tmp[1], tmp[1], tmp[0]), ctx);
+	emit(ARM_CMP_I(tmp[1], 0), ctx);
+	_emit(ARM_COND_EQ, ARM_B(jmp_offset), ctx);
+
+	/* goto *(prog->bpf_func + prologue_size); */
+	off = offsetof(struct bpf_prog, bpf_func);
+	emit_a32_mov_i(tmp2[1], off, false, ctx);
+	emit(ARM_LDR_R(tmp[1], tmp[1], tmp2[1]), ctx);
+	emit(ARM_ADD_I(tmp[1], tmp[1], ctx->prologue_bytes), ctx);
+	emit(ARM_BX(tmp[1]), ctx);
+
+	/* out: */
+	if (out_offset == -1)
+		out_offset = cur_offset;
+	if (cur_offset != out_offset) {
+		pr_err_once("tail_call out_offset = %d, expected %d!\n",
+			    cur_offset, out_offset);
+		return -1;
+	}
+	return 0;
+#undef cur_offset
+#undef jmp_offset
 }
 
-static int build_body(struct jit_ctx *ctx)
+/* 0xabcd => 0xcdab */
+static inline void emit_rev16(const u8 rd, const u8 rn, struct jit_ctx *ctx)
 {
-	void *load_func[] = {jit_get_skb_b, jit_get_skb_h, jit_get_skb_w};
-	const struct bpf_prog *prog = ctx->skf;
-	const struct sock_filter *inst;
-	unsigned i, load_order, off, condt;
-	int imm12;
-	u32 k;
+#if __LINUX_ARM_ARCH__ < 6
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+
+	emit(ARM_AND_I(tmp2[1], rn, 0xff), ctx);
+	emit(ARM_MOV_SI(tmp2[0], rn, SRTYPE_LSR, 8), ctx);
+	emit(ARM_AND_I(tmp2[0], tmp2[0], 0xff), ctx);
+	emit(ARM_ORR_SI(rd, tmp2[0], tmp2[1], SRTYPE_LSL, 8), ctx);
+#else /* ARMv6+ */
+	emit(ARM_REV16(rd, rn), ctx);
+#endif
+}
 
-	for (i = 0; i < prog->len; i++) {
-		u16 code;
+/* 0xabcdefgh => 0xghefcdab */
+static inline void emit_rev32(const u8 rd, const u8 rn, struct jit_ctx *ctx)
+{
+#if __LINUX_ARM_ARCH__ < 6
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+
+	emit(ARM_AND_I(tmp2[1], rn, 0xff), ctx);
+	emit(ARM_MOV_SI(tmp2[0], rn, SRTYPE_LSR, 24), ctx);
+	emit(ARM_ORR_SI(ARM_IP, tmp2[0], tmp2[1], SRTYPE_LSL, 24), ctx);
+
+	emit(ARM_MOV_SI(tmp2[1], rn, SRTYPE_LSR, 8), ctx);
+	emit(ARM_AND_I(tmp2[1], tmp2[1], 0xff), ctx);
+	emit(ARM_MOV_SI(tmp2[0], rn, SRTYPE_LSR, 16), ctx);
+	emit(ARM_AND_I(tmp2[0], tmp2[0], 0xff), ctx);
+	emit(ARM_MOV_SI(tmp2[0], tmp2[0], SRTYPE_LSL, 8), ctx);
+	emit(ARM_ORR_SI(tmp2[0], tmp2[0], tmp2[1], SRTYPE_LSL, 16), ctx);
+	emit(ARM_ORR_R(rd, ARM_IP, tmp2[0]), ctx);
+
+#else /* ARMv6+ */
+	emit(ARM_REV(rd, rn), ctx);
+#endif
+}
+
+// push the scratch stack register on top of the stack
+static inline void emit_push_r64(const u8 src[], const u8 shift,
+		struct jit_ctx *ctx)
+{
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+	u16 reg_set = 0;
 
-		inst = &(prog->insns[i]);
-		/* K as an immediate value operand */
-		k = inst->k;
-		code = bpf_anc_helper(inst);
+	emit(ARM_LDR_I(tmp2[1], ARM_SP, STACK_VAR(src[1]+shift)), ctx);
+	emit(ARM_LDR_I(tmp2[0], ARM_SP, STACK_VAR(src[0]+shift)), ctx);
 
-		/* compute offsets only in the fake pass */
-		if (ctx->target == NULL)
-			ctx->offsets[i] = ctx->idx * 4;
+	reg_set = (1 << tmp2[1]) | (1 << tmp2[0]);
+	emit(ARM_PUSH(reg_set), ctx);
+}
+
+static void build_prologue(struct jit_ctx *ctx)
+{
+	const u8 r0 = bpf2a32[BPF_REG_0][1];
+	const u8 r2 = bpf2a32[BPF_REG_1][1];
+	const u8 r3 = bpf2a32[BPF_REG_1][0];
+	const u8 r4 = bpf2a32[BPF_REG_6][1];
+	const u8 r5 = bpf2a32[BPF_REG_6][0];
+	const u8 r6 = bpf2a32[TMP_REG_1][1];
+	const u8 r7 = bpf2a32[TMP_REG_1][0];
+	const u8 r8 = bpf2a32[TMP_REG_2][1];
+	const u8 r10 = bpf2a32[TMP_REG_2][0];
+	const u8 fplo = bpf2a32[BPF_REG_FP][1];
+	const u8 fphi = bpf2a32[BPF_REG_FP][0];
+	const u8 sp = ARM_SP;
+	const u8 *tcc = bpf2a32[TCALL_CNT];
+
+	u16 reg_set = 0;
+
+	/*
+	 * eBPF prog stack layout
+	 *
+	 *                         high
+	 * original ARM_SP =>     +-----+ eBPF prologue
+	 *                        |FP/LR|
+	 * current ARM_FP =>      +-----+
+	 *                        | ... | callee saved registers
+	 * eBPF fp register =>    +-----+ <= (BPF_FP)
+	 *                        | ... | eBPF JIT scratch space
+	 *                        |     | eBPF prog stack
+	 *                        +-----+
+	 *			  |RSVD | JIT scratchpad
+	 * current A64_SP =>      +-----+ <= (BPF_FP - STACK_SIZE)
+	 *                        |     |
+	 *                        | ... | Function call stack
+	 *                        |     |
+	 *                        +-----+
+	 *                          low
+	 */
+
+	/* Save callee saved registers. */
+	reg_set |= (1<<r4) | (1<<r5) | (1<<r6) | (1<<r7) | (1<<r8) | (1<<r10);
+#ifdef CONFIG_FRAME_POINTER
+	reg_set |= (1<<ARM_FP) | (1<<ARM_IP) | (1<<ARM_LR) | (1<<ARM_PC);
+	emit(ARM_MOV_R(ARM_IP, sp), ctx);
+	emit(ARM_PUSH(reg_set), ctx);
+	emit(ARM_SUB_I(ARM_FP, ARM_IP, 4), ctx);
+#else
+	/* Check if call instruction exists in BPF body */
+	if (ctx->seen & SEEN_CALL)
+		reg_set |= (1<<ARM_LR);
+	emit(ARM_PUSH(reg_set), ctx);
+#endif
+	/* Save frame pointer for later */
+	emit(ARM_SUB_I(ARM_IP, sp, SCRATCH_SIZE), ctx);
+
+	/* Set up function call stack */
+	emit(ARM_SUB_I(ARM_SP, ARM_SP, imm8m(STACK_SIZE)), ctx);
+
+	/* Set up BPF prog stack base register */
+	emit_a32_mov_r(fplo, ARM_IP, true, false, ctx);
+	emit_a32_mov_i(fphi, 0, true, ctx);
+
+	/* mov r4, 0 */
+	emit(ARM_MOV_I(r4, 0), ctx);
+
+	/* Move BPF_CTX to BPF_R1 */
+	emit(ARM_MOV_R(r3, r4), ctx);
+	emit(ARM_MOV_R(r2, r0), ctx);
+	/* Initialize Tail Count */
+	emit(ARM_STR_I(r4, ARM_SP, STACK_VAR(tcc[0])), ctx);
+	emit(ARM_STR_I(r4, ARM_SP, STACK_VAR(tcc[1])), ctx);
+	/* end of prologue */
+}
+
+static void build_epilogue(struct jit_ctx *ctx)
+{
+	const u8 r4 = bpf2a32[BPF_REG_6][1];
+	const u8 r5 = bpf2a32[BPF_REG_6][0];
+	const u8 r6 = bpf2a32[TMP_REG_1][1];
+	const u8 r7 = bpf2a32[TMP_REG_1][0];
+	const u8 r8 = bpf2a32[TMP_REG_2][1];
+	const u8 r10 = bpf2a32[TMP_REG_2][0];
+	u16 reg_set = 0;
+
+	/* unwind function call stack */
+	emit(ARM_ADD_I(ARM_SP, ARM_SP, imm8m(STACK_SIZE)), ctx);
+
+	/* restore callee saved registers. */
+	reg_set |= (1<<r4) | (1<<r5) | (1<<r6) | (1<<r7) | (1<<r8) | (1<<r10);
+#ifdef CONFIG_FRAME_POINTER
+	/* the first instruction of the prologue was: mov ip, sp */
+	reg_set |= (1<<ARM_FP) | (1<<ARM_SP) | (1<<ARM_PC);
+	emit(ARM_LDM(ARM_SP, reg_set), ctx);
+#else
+	if (ctx->seen & SEEN_CALL)
+		reg_set |= (1<<ARM_PC);
+	/* Restore callee saved registers. */
+	emit(ARM_POP(reg_set), ctx);
+	/* Return back to the callee function */
+	if (!(ctx->seen & SEEN_CALL))
+		emit(ARM_BX(ARM_LR), ctx);
+#endif
+}
 
-		switch (code) {
-		case BPF_LD | BPF_IMM:
-			emit_mov_i(r_A, k, ctx);
+/*
+ * Convert an eBPF instruction to native instruction, i.e
+ * JITs an eBPF instruction.
+ * Returns :
+ *	0  - Successfully JITed an 8-byte eBPF instruction
+ *	>0 - Successfully JITed a 16-byte eBPF instruction
+ *	<0 - Failed to JIT.
+ */
+static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx)
+{
+	const u8 code = insn->code;
+	const u8 *dst = bpf2a32[insn->dst_reg];
+	const u8 *src = bpf2a32[insn->src_reg];
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+	const s16 off = insn->off;
+	const s32 imm = insn->imm;
+	const int i = insn - ctx->prog->insnsi;
+	const bool is64 = BPF_CLASS(code) == BPF_ALU64;
+	const bool dstk = is_on_stack(insn->dst_reg);
+	const bool sstk = is_on_stack(insn->src_reg);
+	u8 rd, rt, rm, rn;
+	s32 jmp_offset;
+
+#define check_imm(bits, imm) do {				\
+	if ((((imm) > 0) && ((imm) >> (bits))) ||		\
+	    (((imm) < 0) && (~(imm) >> (bits)))) {		\
+		pr_info("[%2d] imm=%d(0x%x) out of range\n",	\
+			i, imm, imm);				\
+		return -EINVAL;					\
+	}							\
+} while (0)
+#define check_imm24(imm) check_imm(24, imm)
+
+	switch (code) {
+	/* ALU operations */
+
+	/* dst = src */
+	case BPF_ALU | BPF_MOV | BPF_K:
+	case BPF_ALU | BPF_MOV | BPF_X:
+	case BPF_ALU64 | BPF_MOV | BPF_K:
+	case BPF_ALU64 | BPF_MOV | BPF_X:
+		switch (BPF_SRC(code)) {
+		case BPF_X:
+			emit_a32_mov_r64(is64, dst, src, dstk, sstk, ctx);
 			break;
-		case BPF_LD | BPF_W | BPF_LEN:
-			ctx->seen |= SEEN_SKB;
-			BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, len) != 4);
-			emit(ARM_LDR_I(r_A, r_skb,
-				       offsetof(struct sk_buff, len)), ctx);
+		case BPF_K:
+			/* Sign-extend immediate value to destination reg */
+			emit_a32_mov_i64(is64, dst, imm, dstk, ctx);
 			break;
-		case BPF_LD | BPF_MEM:
-			/* A = scratch[k] */
-			ctx->seen |= SEEN_MEM_WORD(k);
-			emit(ARM_LDR_I(r_A, ARM_SP, SCRATCH_OFF(k)), ctx);
+		}
+		break;
+	/* dst = dst + src/imm */
+	/* dst = dst - src/imm */
+	/* dst = dst | src/imm */
+	/* dst = dst & src/imm */
+	/* dst = dst ^ src/imm */
+	/* dst = dst * src/imm */
+	/* dst = dst << src */
+	/* dst = dst >> src */
+	case BPF_ALU | BPF_ADD | BPF_K:
+	case BPF_ALU | BPF_ADD | BPF_X:
+	case BPF_ALU | BPF_SUB | BPF_K:
+	case BPF_ALU | BPF_SUB | BPF_X:
+	case BPF_ALU | BPF_OR | BPF_K:
+	case BPF_ALU | BPF_OR | BPF_X:
+	case BPF_ALU | BPF_AND | BPF_K:
+	case BPF_ALU | BPF_AND | BPF_X:
+	case BPF_ALU | BPF_XOR | BPF_K:
+	case BPF_ALU | BPF_XOR | BPF_X:
+	case BPF_ALU | BPF_MUL | BPF_K:
+	case BPF_ALU | BPF_MUL | BPF_X:
+	case BPF_ALU | BPF_LSH | BPF_X:
+	case BPF_ALU | BPF_RSH | BPF_X:
+	case BPF_ALU | BPF_ARSH | BPF_K:
+	case BPF_ALU | BPF_ARSH | BPF_X:
+	case BPF_ALU64 | BPF_ADD | BPF_K:
+	case BPF_ALU64 | BPF_ADD | BPF_X:
+	case BPF_ALU64 | BPF_SUB | BPF_K:
+	case BPF_ALU64 | BPF_SUB | BPF_X:
+	case BPF_ALU64 | BPF_OR | BPF_K:
+	case BPF_ALU64 | BPF_OR | BPF_X:
+	case BPF_ALU64 | BPF_AND | BPF_K:
+	case BPF_ALU64 | BPF_AND | BPF_X:
+	case BPF_ALU64 | BPF_XOR | BPF_K:
+	case BPF_ALU64 | BPF_XOR | BPF_X:
+		switch (BPF_SRC(code)) {
+		case BPF_X:
+			emit_a32_alu_r64(is64, dst, src, dstk, sstk,
+					 ctx, BPF_OP(code));
 			break;
-		case BPF_LD | BPF_W | BPF_ABS:
-			load_order = 2;
-			goto load;
-		case BPF_LD | BPF_H | BPF_ABS:
-			load_order = 1;
-			goto load;
-		case BPF_LD | BPF_B | BPF_ABS:
-			load_order = 0;
-load:
-			emit_mov_i(r_off, k, ctx);
-load_common:
-			ctx->seen |= SEEN_DATA | SEEN_CALL;
-
-			if (load_order > 0) {
-				emit(ARM_SUB_I(r_scratch, r_skb_hl,
-					       1 << load_order), ctx);
-				emit(ARM_CMP_R(r_scratch, r_off), ctx);
-				condt = ARM_COND_GE;
-			} else {
-				emit(ARM_CMP_R(r_skb_hl, r_off), ctx);
-				condt = ARM_COND_HI;
-			}
-
-			/*
-			 * test for negative offset, only if we are
-			 * currently scheduled to take the fast
-			 * path. this will update the flags so that
-			 * the slowpath instruction are ignored if the
-			 * offset is negative.
-			 *
-			 * for loard_order == 0 the HI condition will
-			 * make loads at offset 0 take the slow path too.
+		case BPF_K:
+			/* Move immediate value to the temporary register
+			 * and then do the ALU operation on the temporary
+			 * register as this will sign-extend the immediate
+			 * value into temporary reg and then it would be
+			 * safe to do the operation on it.
 			 */
-			_emit(condt, ARM_CMP_I(r_off, 0), ctx);
-
-			_emit(condt, ARM_ADD_R(r_scratch, r_off, r_skb_data),
-			      ctx);
-
-			if (load_order == 0)
-				_emit(condt, ARM_LDRB_I(r_A, r_scratch, 0),
-				      ctx);
-			else if (load_order == 1)
-				emit_load_be16(condt, r_A, r_scratch, ctx);
-			else if (load_order == 2)
-				emit_load_be32(condt, r_A, r_scratch, ctx);
-
-			_emit(condt, ARM_B(b_imm(i + 1, ctx)), ctx);
-
-			/* the slowpath */
-			emit_mov_i(ARM_R3, (u32)load_func[load_order], ctx);
-			emit(ARM_MOV_R(ARM_R0, r_skb), ctx);
-			/* the offset is already in R1 */
-			emit_blx_r(ARM_R3, ctx);
-			/* check the result of skb_copy_bits */
-			emit(ARM_CMP_I(ARM_R1, 0), ctx);
-			emit_err_ret(ARM_COND_NE, ctx);
-			emit(ARM_MOV_R(r_A, ARM_R0), ctx);
+			emit_a32_mov_i64(is64, tmp2, imm, false, ctx);
+			emit_a32_alu_r64(is64, dst, tmp2, dstk, false,
+					 ctx, BPF_OP(code));
 			break;
-		case BPF_LD | BPF_W | BPF_IND:
-			load_order = 2;
-			goto load_ind;
-		case BPF_LD | BPF_H | BPF_IND:
-			load_order = 1;
-			goto load_ind;
-		case BPF_LD | BPF_B | BPF_IND:
-			load_order = 0;
-load_ind:
-			update_on_xread(ctx);
-			OP_IMM3(ARM_ADD, r_off, r_X, k, ctx);
-			goto load_common;
-		case BPF_LDX | BPF_IMM:
-			ctx->seen |= SEEN_X;
-			emit_mov_i(r_X, k, ctx);
+		}
+		break;
+	/* dst = dst / src(imm) */
+	/* dst = dst % src(imm) */
+	case BPF_ALU | BPF_DIV | BPF_K:
+	case BPF_ALU | BPF_DIV | BPF_X:
+	case BPF_ALU | BPF_MOD | BPF_K:
+	case BPF_ALU | BPF_MOD | BPF_X:
+		rt = src_lo;
+		rd = dstk ? tmp2[1] : dst_lo;
+		if (dstk)
+			emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		switch (BPF_SRC(code)) {
+		case BPF_X:
+			rt = sstk ? tmp2[0] : rt;
+			if (sstk)
+				emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(src_lo)),
+				     ctx);
 			break;
-		case BPF_LDX | BPF_W | BPF_LEN:
-			ctx->seen |= SEEN_X | SEEN_SKB;
-			emit(ARM_LDR_I(r_X, r_skb,
-				       offsetof(struct sk_buff, len)), ctx);
+		case BPF_K:
+			rt = tmp2[0];
+			emit_a32_mov_i(rt, imm, false, ctx);
 			break;
-		case BPF_LDX | BPF_MEM:
-			ctx->seen |= SEEN_X | SEEN_MEM_WORD(k);
-			emit(ARM_LDR_I(r_X, ARM_SP, SCRATCH_OFF(k)), ctx);
+		}
+		emit_udivmod(rd, rd, rt, ctx, BPF_OP(code));
+		if (dstk)
+			emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit_a32_mov_i(dst_hi, 0, dstk, ctx);
+		break;
+	case BPF_ALU64 | BPF_DIV | BPF_K:
+	case BPF_ALU64 | BPF_DIV | BPF_X:
+	case BPF_ALU64 | BPF_MOD | BPF_K:
+	case BPF_ALU64 | BPF_MOD | BPF_X:
+		goto notyet;
+	/* dst = dst >> imm */
+	/* dst = dst << imm */
+	case BPF_ALU | BPF_RSH | BPF_K:
+	case BPF_ALU | BPF_LSH | BPF_K:
+		if (unlikely(imm > 31))
+			return -EINVAL;
+		if (imm)
+			emit_a32_alu_i(dst_lo, imm, dstk, ctx, BPF_OP(code));
+		emit_a32_mov_i(dst_hi, 0, dstk, ctx);
+		break;
+	/* dst = dst << imm */
+	case BPF_ALU64 | BPF_LSH | BPF_K:
+		if (unlikely(imm > 63))
+			return -EINVAL;
+		emit_a32_lsh_i64(dst, dstk, imm, ctx);
+		break;
+	/* dst = dst >> imm */
+	case BPF_ALU64 | BPF_RSH | BPF_K:
+		if (unlikely(imm > 63))
+			return -EINVAL;
+		emit_a32_lsr_i64(dst, dstk, imm, ctx);
+		break;
+	/* dst = dst << src */
+	case BPF_ALU64 | BPF_LSH | BPF_X:
+		emit_a32_lsh_r64(dst, src, dstk, sstk, ctx);
+		break;
+	/* dst = dst >> src */
+	case BPF_ALU64 | BPF_RSH | BPF_X:
+		emit_a32_lsr_r64(dst, src, dstk, sstk, ctx);
+		break;
+	/* dst = dst >> src (signed) */
+	case BPF_ALU64 | BPF_ARSH | BPF_X:
+		emit_a32_arsh_r64(dst, src, dstk, sstk, ctx);
+		break;
+	/* dst = dst >> imm (signed) */
+	case BPF_ALU64 | BPF_ARSH | BPF_K:
+		if (unlikely(imm > 63))
+			return -EINVAL;
+		emit_a32_arsh_i64(dst, dstk, imm, ctx);
+		break;
+	/* dst = ~dst */
+	case BPF_ALU | BPF_NEG:
+		emit_a32_alu_i(dst_lo, 0, dstk, ctx, BPF_OP(code));
+		emit_a32_mov_i(dst_hi, 0, dstk, ctx);
+		break;
+	/* dst = ~dst (64 bit) */
+	case BPF_ALU64 | BPF_NEG:
+		emit_a32_neg64(dst, dstk, ctx);
+		break;
+	/* dst = dst * src/imm */
+	case BPF_ALU64 | BPF_MUL | BPF_X:
+	case BPF_ALU64 | BPF_MUL | BPF_K:
+		switch (BPF_SRC(code)) {
+		case BPF_X:
+			emit_a32_mul_r64(dst, src, dstk, sstk, ctx);
 			break;
-		case BPF_LDX | BPF_B | BPF_MSH:
-			/* x = ((*(frame + k)) & 0xf) << 2; */
-			ctx->seen |= SEEN_X | SEEN_DATA | SEEN_CALL;
-			/* the interpreter should deal with the negative K */
-			if ((int)k < 0)
-				return -1;
-			/* offset in r1: we might have to take the slow path */
-			emit_mov_i(r_off, k, ctx);
-			emit(ARM_CMP_R(r_skb_hl, r_off), ctx);
-
-			/* load in r0: common with the slowpath */
-			_emit(ARM_COND_HI, ARM_LDRB_R(ARM_R0, r_skb_data,
-						      ARM_R1), ctx);
-			/*
-			 * emit_mov_i() might generate one or two instructions,
-			 * the same holds for emit_blx_r()
+		case BPF_K:
+			/* Move immediate value to the temporary register
+			 * and then do the multiplication on it as this
+			 * will sign-extend the immediate value into temp
+			 * reg then it would be safe to do the operation
+			 * on it.
 			 */
-			_emit(ARM_COND_HI, ARM_B(b_imm(i + 1, ctx) - 2), ctx);
-
-			emit(ARM_MOV_R(ARM_R0, r_skb), ctx);
-			/* r_off is r1 */
-			emit_mov_i(ARM_R3, (u32)jit_get_skb_b, ctx);
-			emit_blx_r(ARM_R3, ctx);
-			/* check the return value of skb_copy_bits */
-			emit(ARM_CMP_I(ARM_R1, 0), ctx);
-			emit_err_ret(ARM_COND_NE, ctx);
-
-			emit(ARM_AND_I(r_X, ARM_R0, 0x00f), ctx);
-			emit(ARM_LSL_I(r_X, r_X, 2), ctx);
-			break;
-		case BPF_ST:
-			ctx->seen |= SEEN_MEM_WORD(k);
-			emit(ARM_STR_I(r_A, ARM_SP, SCRATCH_OFF(k)), ctx);
-			break;
-		case BPF_STX:
-			update_on_xread(ctx);
-			ctx->seen |= SEEN_MEM_WORD(k);
-			emit(ARM_STR_I(r_X, ARM_SP, SCRATCH_OFF(k)), ctx);
-			break;
-		case BPF_ALU | BPF_ADD | BPF_K:
-			/* A += K */
-			OP_IMM3(ARM_ADD, r_A, r_A, k, ctx);
-			break;
-		case BPF_ALU | BPF_ADD | BPF_X:
-			update_on_xread(ctx);
-			emit(ARM_ADD_R(r_A, r_A, r_X), ctx);
-			break;
-		case BPF_ALU | BPF_SUB | BPF_K:
-			/* A -= K */
-			OP_IMM3(ARM_SUB, r_A, r_A, k, ctx);
-			break;
-		case BPF_ALU | BPF_SUB | BPF_X:
-			update_on_xread(ctx);
-			emit(ARM_SUB_R(r_A, r_A, r_X), ctx);
-			break;
-		case BPF_ALU | BPF_MUL | BPF_K:
-			/* A *= K */
-			emit_mov_i(r_scratch, k, ctx);
-			emit(ARM_MUL(r_A, r_A, r_scratch), ctx);
+			emit_a32_mov_i64(is64, tmp2, imm, false, ctx);
+			emit_a32_mul_r64(dst, tmp2, dstk, false, ctx);
 			break;
-		case BPF_ALU | BPF_MUL | BPF_X:
-			update_on_xread(ctx);
-			emit(ARM_MUL(r_A, r_A, r_X), ctx);
-			break;
-		case BPF_ALU | BPF_DIV | BPF_K:
-			if (k == 1)
-				break;
-			emit_mov_i(r_scratch, k, ctx);
-			emit_udivmod(r_A, r_A, r_scratch, ctx, BPF_DIV);
-			break;
-		case BPF_ALU | BPF_DIV | BPF_X:
-			update_on_xread(ctx);
-			emit(ARM_CMP_I(r_X, 0), ctx);
-			emit_err_ret(ARM_COND_EQ, ctx);
-			emit_udivmod(r_A, r_A, r_X, ctx, BPF_DIV);
-			break;
-		case BPF_ALU | BPF_MOD | BPF_K:
-			if (k == 1) {
-				emit_mov_i(r_A, 0, ctx);
-				break;
-			}
-			emit_mov_i(r_scratch, k, ctx);
-			emit_udivmod(r_A, r_A, r_scratch, ctx, BPF_MOD);
-			break;
-		case BPF_ALU | BPF_MOD | BPF_X:
-			update_on_xread(ctx);
-			emit(ARM_CMP_I(r_X, 0), ctx);
-			emit_err_ret(ARM_COND_EQ, ctx);
-			emit_udivmod(r_A, r_A, r_X, ctx, BPF_MOD);
-			break;
-		case BPF_ALU | BPF_OR | BPF_K:
-			/* A |= K */
-			OP_IMM3(ARM_ORR, r_A, r_A, k, ctx);
+		}
+		break;
+	/* dst = htole(dst) */
+	/* dst = htobe(dst) */
+	case BPF_ALU | BPF_END | BPF_FROM_LE:
+	case BPF_ALU | BPF_END | BPF_FROM_BE:
+		rd = dstk ? tmp[0] : dst_hi;
+		rt = dstk ? tmp[1] : dst_lo;
+		if (dstk) {
+			emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(dst_lo)), ctx);
+			emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_hi)), ctx);
+		}
+		if (BPF_SRC(code) == BPF_FROM_LE)
+			goto emit_bswap_uxt;
+		switch (imm) {
+		case 16:
+			emit_rev16(rt, rt, ctx);
+			goto emit_bswap_uxt;
+		case 32:
+			emit_rev32(rt, rt, ctx);
+			goto emit_bswap_uxt;
+		case 64:
+			/* Because of the usage of ARM_LR */
+			ctx->seen |= SEEN_CALL;
+			emit_rev32(ARM_LR, rt, ctx);
+			emit_rev32(rt, rd, ctx);
+			emit(ARM_MOV_R(rd, ARM_LR), ctx);
 			break;
-		case BPF_ALU | BPF_OR | BPF_X:
-			update_on_xread(ctx);
-			emit(ARM_ORR_R(r_A, r_A, r_X), ctx);
+		}
+		goto exit;
+emit_bswap_uxt:
+		switch (imm) {
+		case 16:
+			/* zero-extend 16 bits into 64 bits */
+#if __LINUX_ARM_ARCH__ < 6
+			emit_a32_mov_i(tmp2[1], 0xffff, false, ctx);
+			emit(ARM_AND_R(rt, rt, tmp2[1]), ctx);
+#else /* ARMv6+ */
+			emit(ARM_UXTH(rt, rt), ctx);
+#endif
+			emit(ARM_EOR_R(rd, rd, rd), ctx);
 			break;
-		case BPF_ALU | BPF_XOR | BPF_K:
-			/* A ^= K; */
-			OP_IMM3(ARM_EOR, r_A, r_A, k, ctx);
+		case 32:
+			/* zero-extend 32 bits into 64 bits */
+			emit(ARM_EOR_R(rd, rd, rd), ctx);
 			break;
-		case BPF_ANC | SKF_AD_ALU_XOR_X:
-		case BPF_ALU | BPF_XOR | BPF_X:
-			/* A ^= X */
-			update_on_xread(ctx);
-			emit(ARM_EOR_R(r_A, r_A, r_X), ctx);
+		case 64:
+			/* nop */
 			break;
-		case BPF_ALU | BPF_AND | BPF_K:
-			/* A &= K */
-			OP_IMM3(ARM_AND, r_A, r_A, k, ctx);
+		}
+exit:
+		if (dstk) {
+			emit(ARM_STR_I(rt, ARM_SP, STACK_VAR(dst_lo)), ctx);
+			emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_hi)), ctx);
+		}
+		break;
+	/* dst = imm64 */
+	case BPF_LD | BPF_IMM | BPF_DW:
+	{
+		const struct bpf_insn insn1 = insn[1];
+		u32 hi, lo = imm;
+
+		hi = insn1.imm;
+		emit_a32_mov_i(dst_lo, lo, dstk, ctx);
+		emit_a32_mov_i(dst_hi, hi, dstk, ctx);
+
+		return 1;
+	}
+	/* LDX: dst = *(size *)(src + off) */
+	case BPF_LDX | BPF_MEM | BPF_W:
+	case BPF_LDX | BPF_MEM | BPF_H:
+	case BPF_LDX | BPF_MEM | BPF_B:
+	case BPF_LDX | BPF_MEM | BPF_DW:
+		rn = sstk ? tmp2[1] : src_lo;
+		if (sstk)
+			emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
+		switch (BPF_SIZE(code)) {
+		case BPF_W:
+			/* Load a Word */
+		case BPF_H:
+			/* Load a Half-Word */
+		case BPF_B:
+			/* Load a Byte */
+			emit_ldx_r(dst_lo, rn, dstk, off, ctx, BPF_SIZE(code));
+			emit_a32_mov_i(dst_hi, 0, dstk, ctx);
 			break;
-		case BPF_ALU | BPF_AND | BPF_X:
-			update_on_xread(ctx);
-			emit(ARM_AND_R(r_A, r_A, r_X), ctx);
+		case BPF_DW:
+			/* Load a double word */
+			emit_ldx_r(dst_lo, rn, dstk, off, ctx, BPF_W);
+			emit_ldx_r(dst_hi, rn, dstk, off+4, ctx, BPF_W);
 			break;
-		case BPF_ALU | BPF_LSH | BPF_K:
-			if (unlikely(k > 31))
-				return -1;
-			emit(ARM_LSL_I(r_A, r_A, k), ctx);
+		}
+		break;
+	/* R0 = ntohx(*(size *)(((struct sk_buff *)R6)->data + imm)) */
+	case BPF_LD | BPF_ABS | BPF_W:
+	case BPF_LD | BPF_ABS | BPF_H:
+	case BPF_LD | BPF_ABS | BPF_B:
+	/* R0 = ntohx(*(size *)(((struct sk_buff *)R6)->data + src + imm)) */
+	case BPF_LD | BPF_IND | BPF_W:
+	case BPF_LD | BPF_IND | BPF_H:
+	case BPF_LD | BPF_IND | BPF_B:
+	{
+		const u8 r4 = bpf2a32[BPF_REG_6][1]; /* r4 = ptr to sk_buff */
+		const u8 r0 = bpf2a32[BPF_REG_0][1]; /*r0: struct sk_buff *skb*/
+						     /* rtn value */
+		const u8 r1 = bpf2a32[BPF_REG_0][0]; /* r1: int k */
+		const u8 r2 = bpf2a32[BPF_REG_1][1]; /* r2: unsigned int size */
+		const u8 r3 = bpf2a32[BPF_REG_1][0]; /* r3: void *buffer */
+		const u8 r6 = bpf2a32[TMP_REG_1][1]; /* r6: void *(*func)(..) */
+		int size;
+
+		/* Setting up first argument */
+		emit(ARM_MOV_R(r0, r4), ctx);
+
+		/* Setting up second argument */
+		emit_a32_mov_i(r1, imm, false, ctx);
+		if (BPF_MODE(code) == BPF_IND)
+			emit_a32_alu_r(r1, src_lo, false, sstk, ctx,
+				       false, false, BPF_ADD);
+
+		/* Setting up third argument */
+		switch (BPF_SIZE(code)) {
+		case BPF_W:
+			size = 4;
 			break;
-		case BPF_ALU | BPF_LSH | BPF_X:
-			update_on_xread(ctx);
-			emit(ARM_LSL_R(r_A, r_A, r_X), ctx);
+		case BPF_H:
+			size = 2;
 			break;
-		case BPF_ALU | BPF_RSH | BPF_K:
-			if (unlikely(k > 31))
-				return -1;
-			if (k)
-				emit(ARM_LSR_I(r_A, r_A, k), ctx);
+		case BPF_B:
+			size = 1;
 			break;
-		case BPF_ALU | BPF_RSH | BPF_X:
-			update_on_xread(ctx);
-			emit(ARM_LSR_R(r_A, r_A, r_X), ctx);
+		default:
+			return -EINVAL;
+		}
+		emit_a32_mov_i(r2, size, false, ctx);
+
+		/* Setting up fourth argument */
+		emit(ARM_ADD_I(r3, ARM_SP, imm8m(SKB_BUFFER)), ctx);
+
+		/* Setting up function pointer to call */
+		emit_a32_mov_i(r6, (unsigned int)bpf_load_pointer, false, ctx);
+		emit_blx_r(r6, ctx);
+
+		emit(ARM_EOR_R(r1, r1, r1), ctx);
+		/* Check if return address is NULL or not.
+		 * if NULL then jump to epilogue
+		 * else continue to load the value from retn address
+		 */
+		emit(ARM_CMP_I(r0, 0), ctx);
+		jmp_offset = epilogue_offset(ctx);
+		check_imm24(jmp_offset);
+		_emit(ARM_COND_EQ, ARM_B(jmp_offset), ctx);
+
+		/* Load value from the address */
+		switch (BPF_SIZE(code)) {
+		case BPF_W:
+			emit(ARM_LDR_I(r0, r0, 0), ctx);
+			emit_rev32(r0, r0, ctx);
 			break;
-		case BPF_ALU | BPF_NEG:
-			/* A = -A */
-			emit(ARM_RSB_I(r_A, r_A, 0), ctx);
+		case BPF_H:
+			emit(ARM_LDRH_I(r0, r0, 0), ctx);
+			emit_rev16(r0, r0, ctx);
 			break;
-		case BPF_JMP | BPF_JA:
-			/* pc += K */
-			emit(ARM_B(b_imm(i + k + 1, ctx)), ctx);
+		case BPF_B:
+			emit(ARM_LDRB_I(r0, r0, 0), ctx);
+			/* No need to reverse */
 			break;
-		case BPF_JMP | BPF_JEQ | BPF_K:
-			/* pc += (A == K) ? pc->jt : pc->jf */
-			condt  = ARM_COND_EQ;
-			goto cmp_imm;
-		case BPF_JMP | BPF_JGT | BPF_K:
-			/* pc += (A > K) ? pc->jt : pc->jf */
-			condt  = ARM_COND_HI;
-			goto cmp_imm;
-		case BPF_JMP | BPF_JGE | BPF_K:
-			/* pc += (A >= K) ? pc->jt : pc->jf */
-			condt  = ARM_COND_HS;
-cmp_imm:
-			imm12 = imm8m(k);
-			if (imm12 < 0) {
-				emit_mov_i_no8m(r_scratch, k, ctx);
-				emit(ARM_CMP_R(r_A, r_scratch), ctx);
-			} else {
-				emit(ARM_CMP_I(r_A, imm12), ctx);
-			}
-cond_jump:
-			if (inst->jt)
-				_emit(condt, ARM_B(b_imm(i + inst->jt + 1,
-						   ctx)), ctx);
-			if (inst->jf)
-				_emit(condt ^ 1, ARM_B(b_imm(i + inst->jf + 1,
-							     ctx)), ctx);
+		}
+		break;
+	}
+	/* ST: *(size *)(dst + off) = imm */
+	case BPF_ST | BPF_MEM | BPF_W:
+	case BPF_ST | BPF_MEM | BPF_H:
+	case BPF_ST | BPF_MEM | BPF_B:
+	case BPF_ST | BPF_MEM | BPF_DW:
+		switch (BPF_SIZE(code)) {
+		case BPF_DW:
+			/* Sign-extend immediate value into temp reg */
+			emit_a32_mov_i64(true, tmp2, imm, false, ctx);
+			emit_str_r(dst_lo, tmp2[1], dstk, off, ctx, BPF_W);
+			emit_str_r(dst_lo, tmp2[0], dstk, off+4, ctx, BPF_W);
 			break;
-		case BPF_JMP | BPF_JEQ | BPF_X:
-			/* pc += (A == X) ? pc->jt : pc->jf */
-			condt   = ARM_COND_EQ;
-			goto cmp_x;
-		case BPF_JMP | BPF_JGT | BPF_X:
-			/* pc += (A > X) ? pc->jt : pc->jf */
-			condt   = ARM_COND_HI;
-			goto cmp_x;
-		case BPF_JMP | BPF_JGE | BPF_X:
-			/* pc += (A >= X) ? pc->jt : pc->jf */
-			condt   = ARM_COND_CS;
-cmp_x:
-			update_on_xread(ctx);
-			emit(ARM_CMP_R(r_A, r_X), ctx);
-			goto cond_jump;
-		case BPF_JMP | BPF_JSET | BPF_K:
-			/* pc += (A & K) ? pc->jt : pc->jf */
-			condt  = ARM_COND_NE;
-			/* not set iff all zeroes iff Z==1 iff EQ */
-
-			imm12 = imm8m(k);
-			if (imm12 < 0) {
-				emit_mov_i_no8m(r_scratch, k, ctx);
-				emit(ARM_TST_R(r_A, r_scratch), ctx);
-			} else {
-				emit(ARM_TST_I(r_A, imm12), ctx);
-			}
-			goto cond_jump;
-		case BPF_JMP | BPF_JSET | BPF_X:
-			/* pc += (A & X) ? pc->jt : pc->jf */
-			update_on_xread(ctx);
-			condt  = ARM_COND_NE;
-			emit(ARM_TST_R(r_A, r_X), ctx);
-			goto cond_jump;
-		case BPF_RET | BPF_A:
-			emit(ARM_MOV_R(ARM_R0, r_A), ctx);
-			goto b_epilogue;
-		case BPF_RET | BPF_K:
-			if ((k == 0) && (ctx->ret0_fp_idx < 0))
-				ctx->ret0_fp_idx = i;
-			emit_mov_i(ARM_R0, k, ctx);
-b_epilogue:
-			if (i != ctx->skf->len - 1)
-				emit(ARM_B(b_imm(prog->len, ctx)), ctx);
+		case BPF_W:
+		case BPF_H:
+		case BPF_B:
+			emit_a32_mov_i(tmp2[1], imm, false, ctx);
+			emit_str_r(dst_lo, tmp2[1], dstk, off, ctx,
+				   BPF_SIZE(code));
 			break;
-		case BPF_MISC | BPF_TAX:
-			/* X = A */
-			ctx->seen |= SEEN_X;
-			emit(ARM_MOV_R(r_X, r_A), ctx);
+		}
+		break;
+	/* STX XADD: lock *(u32 *)(dst + off) += src */
+	case BPF_STX | BPF_XADD | BPF_W:
+	/* STX XADD: lock *(u64 *)(dst + off) += src */
+	case BPF_STX | BPF_XADD | BPF_DW:
+		goto notyet;
+	/* STX: *(size *)(dst + off) = src */
+	case BPF_STX | BPF_MEM | BPF_W:
+	case BPF_STX | BPF_MEM | BPF_H:
+	case BPF_STX | BPF_MEM | BPF_B:
+	case BPF_STX | BPF_MEM | BPF_DW:
+	{
+		u8 sz = BPF_SIZE(code);
+
+		rn = sstk ? tmp2[1] : src_lo;
+		rm = sstk ? tmp2[0] : src_hi;
+		if (!sstk)
+			goto do_store;
+		switch (BPF_SIZE(code)) {
+		case BPF_W:
+			emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
+			goto empty_hi;
+		case BPF_H:
+			emit(ARM_LDRH_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
+			goto empty_hi;
+		case BPF_B:
+			emit(ARM_LDRB_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
+			goto empty_hi;
+empty_hi:
+			emit(ARM_EOR_R(rm, rm, rm), ctx);
+		case BPF_DW:
+			emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
+			emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(src_hi)), ctx);
+			sz = BPF_W;
 			break;
-		case BPF_MISC | BPF_TXA:
-			/* A = X */
-			update_on_xread(ctx);
-			emit(ARM_MOV_R(r_A, r_X), ctx);
+		}
+
+do_store:
+		/* Clear higher word except for BPF_DW */
+		if (BPF_SIZE(code) != BPF_DW)
+			emit(ARM_EOR_R(rm, rm, rm), ctx);
+
+		/* Store the value */
+		emit_str_r(dst_lo, rn, dstk, off, ctx, sz);
+		emit_str_r(dst_lo, rm, dstk, off+4, ctx, BPF_W);
+		break;
+	}
+	/* PC += off if dst == src */
+	/* PC += off if dst > src */
+	/* PC += off if dst >= src */
+	/* PC += off if dst != src */
+	/* PC += off if dst > src (signed) */
+	/* PC += off if dst >= src (signed) */
+	/* PC += off if dst & src */
+	case BPF_JMP | BPF_JEQ | BPF_X:
+	case BPF_JMP | BPF_JGT | BPF_X:
+	case BPF_JMP | BPF_JGE | BPF_X:
+	case BPF_JMP | BPF_JNE | BPF_X:
+	case BPF_JMP | BPF_JSGT | BPF_X:
+	case BPF_JMP | BPF_JSGE | BPF_X:
+	case BPF_JMP | BPF_JSET | BPF_X:
+		/* Setup source registers */
+		rm = sstk ? tmp2[0] : src_hi;
+		rn = sstk ? tmp2[1] : src_lo;
+		if (sstk) {
+			emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
+			emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(src_hi)), ctx);
+		}
+		goto go_jmp;
+	/* PC += off if dst == imm */
+	/* PC += off if dst > imm */
+	/* PC += off if dst >= imm */
+	/* PC += off if dst != imm */
+	/* PC += off if dst > imm (signed) */
+	/* PC += off if dst >= imm (signed) */
+	/* PC += off if dst & imm */
+	case BPF_JMP | BPF_JEQ | BPF_K:
+	case BPF_JMP | BPF_JGT | BPF_K:
+	case BPF_JMP | BPF_JGE | BPF_K:
+	case BPF_JMP | BPF_JNE | BPF_K:
+	case BPF_JMP | BPF_JSGT | BPF_K:
+	case BPF_JMP | BPF_JSGE | BPF_K:
+	case BPF_JMP | BPF_JSET | BPF_K:
+		if (off == 0)
 			break;
-		case BPF_ANC | SKF_AD_PROTOCOL:
-			/* A = ntohs(skb->protocol) */
-			ctx->seen |= SEEN_SKB;
-			BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff,
-						  protocol) != 2);
-			off = offsetof(struct sk_buff, protocol);
-			emit(ARM_LDRH_I(r_scratch, r_skb, off), ctx);
-			emit_swap16(r_A, r_scratch, ctx);
+		rm = tmp2[0];
+		rn = tmp2[1];
+		/* Sign-extend immediate value */
+		emit_a32_mov_i64(true, tmp2, imm, false, ctx);
+go_jmp:
+		/* Setup destination register */
+		rd = dstk ? tmp[0] : dst_hi;
+		rt = dstk ? tmp[1] : dst_lo;
+		if (dstk) {
+			emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(dst_lo)), ctx);
+			emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_hi)), ctx);
+		}
+
+		/* Check for the condition */
+		emit_ar_r(rd, rt, rm, rn, ctx, BPF_OP(code));
+
+		/* Setup JUMP instruction */
+		jmp_offset = bpf2a32_offset(i+off, i, ctx);
+		switch (BPF_OP(code)) {
+		case BPF_JNE:
+		case BPF_JSET:
+			_emit(ARM_COND_NE, ARM_B(jmp_offset), ctx);
 			break;
-		case BPF_ANC | SKF_AD_CPU:
-			/* r_scratch = current_thread_info() */
-			OP_IMM3(ARM_BIC, r_scratch, ARM_SP, THREAD_SIZE - 1, ctx);
-			/* A = current_thread_info()->cpu */
-			BUILD_BUG_ON(FIELD_SIZEOF(struct thread_info, cpu) != 4);
-			off = offsetof(struct thread_info, cpu);
-			emit(ARM_LDR_I(r_A, r_scratch, off), ctx);
+		case BPF_JEQ:
+			_emit(ARM_COND_EQ, ARM_B(jmp_offset), ctx);
 			break;
-		case BPF_ANC | SKF_AD_IFINDEX:
-		case BPF_ANC | SKF_AD_HATYPE:
-			/* A = skb->dev->ifindex */
-			/* A = skb->dev->type */
-			ctx->seen |= SEEN_SKB;
-			off = offsetof(struct sk_buff, dev);
-			emit(ARM_LDR_I(r_scratch, r_skb, off), ctx);
-
-			emit(ARM_CMP_I(r_scratch, 0), ctx);
-			emit_err_ret(ARM_COND_EQ, ctx);
-
-			BUILD_BUG_ON(FIELD_SIZEOF(struct net_device,
-						  ifindex) != 4);
-			BUILD_BUG_ON(FIELD_SIZEOF(struct net_device,
-						  type) != 2);
-
-			if (code == (BPF_ANC | SKF_AD_IFINDEX)) {
-				off = offsetof(struct net_device, ifindex);
-				emit(ARM_LDR_I(r_A, r_scratch, off), ctx);
-			} else {
-				/*
-				 * offset of field "type" in "struct
-				 * net_device" is above what can be
-				 * used in the ldrh rd, [rn, #imm]
-				 * instruction, so load the offset in
-				 * a register and use ldrh rd, [rn, rm]
-				 */
-				off = offsetof(struct net_device, type);
-				emit_mov_i(ARM_R3, off, ctx);
-				emit(ARM_LDRH_R(r_A, r_scratch, ARM_R3), ctx);
-			}
+		case BPF_JGT:
+			_emit(ARM_COND_HI, ARM_B(jmp_offset), ctx);
 			break;
-		case BPF_ANC | SKF_AD_MARK:
-			ctx->seen |= SEEN_SKB;
-			BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, mark) != 4);
-			off = offsetof(struct sk_buff, mark);
-			emit(ARM_LDR_I(r_A, r_skb, off), ctx);
+		case BPF_JGE:
+			_emit(ARM_COND_CS, ARM_B(jmp_offset), ctx);
 			break;
-		case BPF_ANC | SKF_AD_RXHASH:
-			ctx->seen |= SEEN_SKB;
-			BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, hash) != 4);
-			off = offsetof(struct sk_buff, hash);
-			emit(ARM_LDR_I(r_A, r_skb, off), ctx);
+		case BPF_JSGT:
+			_emit(ARM_COND_LT, ARM_B(jmp_offset), ctx);
 			break;
-		case BPF_ANC | SKF_AD_VLAN_TAG:
-		case BPF_ANC | SKF_AD_VLAN_TAG_PRESENT:
-			ctx->seen |= SEEN_SKB;
-			BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, vlan_tci) != 2);
-			off = offsetof(struct sk_buff, vlan_tci);
-			emit(ARM_LDRH_I(r_A, r_skb, off), ctx);
-			if (code == (BPF_ANC | SKF_AD_VLAN_TAG))
-				OP_IMM3(ARM_AND, r_A, r_A, ~VLAN_TAG_PRESENT, ctx);
-			else {
-				OP_IMM3(ARM_LSR, r_A, r_A, 12, ctx);
-				OP_IMM3(ARM_AND, r_A, r_A, 0x1, ctx);
-			}
+		case BPF_JSGE:
+			_emit(ARM_COND_GE, ARM_B(jmp_offset), ctx);
 			break;
-		case BPF_ANC | SKF_AD_PKTTYPE:
-			ctx->seen |= SEEN_SKB;
-			BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff,
-						  __pkt_type_offset[0]) != 1);
-			off = PKT_TYPE_OFFSET();
-			emit(ARM_LDRB_I(r_A, r_skb, off), ctx);
-			emit(ARM_AND_I(r_A, r_A, PKT_TYPE_MAX), ctx);
-#ifdef __BIG_ENDIAN_BITFIELD
-			emit(ARM_LSR_I(r_A, r_A, 5), ctx);
-#endif
+		}
+		break;
+	/* JMP OFF */
+	case BPF_JMP | BPF_JA:
+	{
+		if (off == 0)
 			break;
-		case BPF_ANC | SKF_AD_QUEUE:
-			ctx->seen |= SEEN_SKB;
-			BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff,
-						  queue_mapping) != 2);
-			BUILD_BUG_ON(offsetof(struct sk_buff,
-					      queue_mapping) > 0xff);
-			off = offsetof(struct sk_buff, queue_mapping);
-			emit(ARM_LDRH_I(r_A, r_skb, off), ctx);
+		jmp_offset = bpf2a32_offset(i+off, i, ctx);
+		check_imm24(jmp_offset);
+		emit(ARM_B(jmp_offset), ctx);
+		break;
+	}
+	/* tail call */
+	case BPF_JMP | BPF_CALL | BPF_X:
+		if (emit_bpf_tail_call(ctx))
+			return -EFAULT;
+		break;
+	/* function call */
+	case BPF_JMP | BPF_CALL:
+	{
+		const u8 *r0 = bpf2a32[BPF_REG_0];
+		const u8 *r1 = bpf2a32[BPF_REG_1];
+		const u8 *r2 = bpf2a32[BPF_REG_2];
+		const u8 *r3 = bpf2a32[BPF_REG_3];
+		const u8 *r4 = bpf2a32[BPF_REG_4];
+		const u8 *r5 = bpf2a32[BPF_REG_5];
+		const u32 func = (u32)__bpf_call_base + imm;
+
+		emit_a32_mov_r64(true, r0, r1, false, false, ctx);
+		emit_a32_mov_r64(true, r1, r2, false, true, ctx);
+		emit_push_r64(r5, 0, ctx);
+		emit_push_r64(r4, 8, ctx);
+		emit_push_r64(r3, 16, ctx);
+
+		emit_a32_mov_i(tmp[1], func, false, ctx);
+		emit_blx_r(tmp[1], ctx);
+
+		emit(ARM_ADD_I(ARM_SP, ARM_SP, imm8m(24)), ctx); // callee clean
+		break;
+	}
+	/* function return */
+	case BPF_JMP | BPF_EXIT:
+		/* Optimization: when last instruction is EXIT
+		 * simply fallthrough to epilogue.
+		 */
+		if (i == ctx->prog->len - 1)
 			break;
-		case BPF_ANC | SKF_AD_PAY_OFFSET:
-			ctx->seen |= SEEN_SKB | SEEN_CALL;
+		jmp_offset = epilogue_offset(ctx);
+		check_imm24(jmp_offset);
+		emit(ARM_B(jmp_offset), ctx);
+		break;
+notyet:
+		pr_info_once("*** NOT YET: opcode %02x ***\n", code);
+		return -EFAULT;
+	default:
+		pr_err_once("unknown opcode %02x\n", code);
+		return -EINVAL;
+	}
 
-			emit(ARM_MOV_R(ARM_R0, r_skb), ctx);
-			emit_mov_i(ARM_R3, (unsigned int)skb_get_poff, ctx);
-			emit_blx_r(ARM_R3, ctx);
-			emit(ARM_MOV_R(r_A, ARM_R0), ctx);
-			break;
-		case BPF_LDX | BPF_W | BPF_ABS:
-			/*
-			 * load a 32bit word from struct seccomp_data.
-			 * seccomp_check_filter() will already have checked
-			 * that k is 32bit aligned and lies within the
-			 * struct seccomp_data.
-			 */
-			ctx->seen |= SEEN_SKB;
-			emit(ARM_LDR_I(r_A, r_skb, k), ctx);
-			break;
-		default:
-			return -1;
+	if (ctx->flags & FLAG_IMM_OVERFLOW)
+		/*
+		 * this instruction generated an overflow when
+		 * trying to access the literal pool, so
+		 * delegate this filter to the kernel interpreter.
+		 */
+		return -1;
+	return 0;
+}
+
+static int build_body(struct jit_ctx *ctx)
+{
+	const struct bpf_prog *prog = ctx->prog;
+	unsigned int i;
+
+	for (i = 0; i < prog->len; i++) {
+		const struct bpf_insn *insn = &(prog->insnsi[i]);
+		int ret;
+
+		ret = build_insn(insn, ctx);
+
+		/* It's used with loading the 64 bit immediate value. */
+		if (ret > 0) {
+			i++;
+			if (ctx->target == NULL)
+				ctx->offsets[i] = ctx->idx;
+			continue;
 		}
 
-		if (ctx->flags & FLAG_IMM_OVERFLOW)
-			/*
-			 * this instruction generated an overflow when
-			 * trying to access the literal pool, so
-			 * delegate this filter to the kernel interpreter.
-			 */
-			return -1;
+		if (ctx->target == NULL)
+			ctx->offsets[i] = ctx->idx;
+
+		/* If unsuccesfull, return with error code */
+		if (ret)
+			return ret;
 	}
+	return 0;
+}
 
-	/* compute offsets only during the first pass */
-	if (ctx->target == NULL)
-		ctx->offsets[i] = ctx->idx * 4;
+static int validate_code(struct jit_ctx *ctx)
+{
+	int i;
+
+	for (i = 0; i < ctx->idx; i++) {
+		if (ctx->target[i] == __opcode_to_mem_arm(ARM_INST_UDF))
+			return -1;
+	}
 
 	return 0;
 }
 
+void bpf_jit_compile(struct bpf_prog *prog)
+{
+	/* Nothing to do here. We support Internal BPF. */
+}
 
-void bpf_jit_compile(struct bpf_prog *fp)
+struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 {
+#ifdef __LITTLE_ENDIAN
+	struct bpf_prog *tmp, *orig_prog = prog;
 	struct bpf_binary_header *header;
+	bool tmp_blinded = false;
 	struct jit_ctx ctx;
-	unsigned tmp_idx;
-	unsigned alloc_size;
-	u8 *target_ptr;
+	unsigned int tmp_idx;
+	unsigned int image_size;
+	u8 *image_ptr;
 
+	/* If BPF JIT was not enabled then we must fall back to
+	 * the interpreter.
+	 */
 	if (!bpf_jit_enable)
-		return;
+		return orig_prog;
 
-	memset(&ctx, 0, sizeof(ctx));
-	ctx.skf		= fp;
-	ctx.ret0_fp_idx = -1;
+	/* If constant blinding was enabled and we failed during blinding
+	 * then we must fall back to the interpreter. Otherwise, we save
+	 * the new JITed code.
+	 */
+	tmp = bpf_jit_blind_constants(prog);
 
-	ctx.offsets = kzalloc(4 * (ctx.skf->len + 1), GFP_KERNEL);
-	if (ctx.offsets == NULL)
-		return;
+	if (IS_ERR(tmp))
+		return orig_prog;
+	if (tmp != prog) {
+		tmp_blinded = true;
+		prog = tmp;
+	}
 
-	/* fake pass to fill in the ctx->seen */
-	if (unlikely(build_body(&ctx)))
+	memset(&ctx, 0, sizeof(ctx));
+	ctx.prog = prog;
+
+	/* Not able to allocate memory for offsets[] , then
+	 * we must fall back to the interpreter
+	 */
+	ctx.offsets = kcalloc(prog->len, sizeof(int), GFP_KERNEL);
+	if (ctx.offsets == NULL) {
+		prog = orig_prog;
 		goto out;
+	}
+
+	/* 1) fake pass to find in the length of the JITed code,
+	 * to compute ctx->offsets and other context variables
+	 * needed to compute final JITed code.
+	 * Also, calculate random starting pointer/start of JITed code
+	 * which is prefixed by random number of fault instructions.
+	 *
+	 * If the first pass fails then there is no chance of it
+	 * being successful in the second pass, so just fall back
+	 * to the interpreter.
+	 */
+	if (build_body(&ctx)) {
+		prog = orig_prog;
+		goto out_off;
+	}
 
 	tmp_idx = ctx.idx;
 	build_prologue(&ctx);
 	ctx.prologue_bytes = (ctx.idx - tmp_idx) * 4;
 
+	ctx.epilogue_offset = ctx.idx;
+
 #if __LINUX_ARM_ARCH__ < 7
 	tmp_idx = ctx.idx;
 	build_epilogue(&ctx);
@@ -1021,64 +1863,96 @@ void bpf_jit_compile(struct bpf_prog *fp)
 
 	ctx.idx += ctx.imm_count;
 	if (ctx.imm_count) {
-		ctx.imms = kzalloc(4 * ctx.imm_count, GFP_KERNEL);
-		if (ctx.imms == NULL)
-			goto out;
+		ctx.imms = kcalloc(ctx.imm_count, sizeof(u32), GFP_KERNEL);
+		if (ctx.imms == NULL) {
+			prog = orig_prog;
+			goto out_off;
+		}
 	}
 #else
-	/* there's nothing after the epilogue on ARMv7 */
+	/* there's nothing about the epilogue on ARMv7 */
 	build_epilogue(&ctx);
 #endif
-	alloc_size = 4 * ctx.idx;
-	header = bpf_jit_binary_alloc(alloc_size, &target_ptr,
-				      4, jit_fill_hole);
-	if (header == NULL)
-		goto out;
+	/* Now we can get the actual image size of the JITed arm code.
+	 * Currently, we are not considering the THUMB-2 instructions
+	 * for jit, although it can decrease the size of the image.
+	 *
+	 * As each arm instruction is of length 32bit, we are translating
+	 * number of JITed intructions into the size required to store these
+	 * JITed code.
+	 */
+	image_size = sizeof(u32) * ctx.idx;
+
+	/* Now we know the size of the structure to make */
+	header = bpf_jit_binary_alloc(image_size, &image_ptr,
+				      sizeof(u32), jit_fill_hole);
+	/* Not able to allocate memory for the structure then
+	 * we must fall back to the interpretation
+	 */
+	if (header == NULL) {
+		prog = orig_prog;
+		goto out_imms;
+	}
 
-	ctx.target = (u32 *) target_ptr;
+	/* 2.) Actual pass to generate final JIT code */
+	ctx.target = (u32 *) image_ptr;
 	ctx.idx = 0;
 
 	build_prologue(&ctx);
+
+	/* If building the body of the JITed code fails somehow,
+	 * we fall back to the interpretation.
+	 */
 	if (build_body(&ctx) < 0) {
-#if __LINUX_ARM_ARCH__ < 7
-		if (ctx.imm_count)
-			kfree(ctx.imms);
-#endif
+		image_ptr = NULL;
 		bpf_jit_binary_free(header);
-		goto out;
+		prog = orig_prog;
+		goto out_imms;
 	}
 	build_epilogue(&ctx);
 
+	/* 3.) Extra pass to validate JITed Code */
+	if (validate_code(&ctx)) {
+		image_ptr = NULL;
+		bpf_jit_binary_free(header);
+		prog = orig_prog;
+		goto out_imms;
+	}
 	flush_icache_range((u32)header, (u32)(ctx.target + ctx.idx));
 
-#if __LINUX_ARM_ARCH__ < 7
-	if (ctx.imm_count)
-		kfree(ctx.imms);
-#endif
-
 	if (bpf_jit_enable > 1)
 		/* there are 2 passes here */
-		bpf_jit_dump(fp->len, alloc_size, 2, ctx.target);
+		bpf_jit_dump(prog->len, image_size, 2, ctx.target);
 
 	set_memory_ro((unsigned long)header, header->pages);
-	fp->bpf_func = (void *)ctx.target;
-	fp->jited = 1;
-out:
+	prog->bpf_func = (void *)ctx.target;
+	prog->jited = 1;
+out_imms:
+#if __LINUX_ARM_ARCH__ < 7
+	if (ctx.imm_count)
+		kfree(ctx.imms);
+#endif
+out_off:
 	kfree(ctx.offsets);
-	return;
+out:
+	if (tmp_blinded)
+		bpf_jit_prog_release_other(prog, prog == orig_prog ?
+					   tmp : orig_prog);
+#endif /* __LITTLE_ENDIAN */
+	return prog;
 }
 
-void bpf_jit_free(struct bpf_prog *fp)
+void bpf_jit_free(struct bpf_prog *prog)
 {
-	unsigned long addr = (unsigned long)fp->bpf_func & PAGE_MASK;
+	unsigned long addr = (unsigned long)prog->bpf_func & PAGE_MASK;
 	struct bpf_binary_header *header = (void *)addr;
 
-	if (!fp->jited)
+	if (!prog->jited)
 		goto free_filter;
 
 	set_memory_rw(addr, header->pages);
 	bpf_jit_binary_free(header);
 
 free_filter:
-	bpf_prog_unlock_free(fp);
+	bpf_prog_unlock_free(prog);
 }
diff --git a/arch/arm/net/bpf_jit_32.h b/arch/arm/net/bpf_jit_32.h
index c46fca2..d5cf5f6 100644
--- a/arch/arm/net/bpf_jit_32.h
+++ b/arch/arm/net/bpf_jit_32.h
@@ -11,6 +11,7 @@
 #ifndef PFILTER_OPCODES_ARM_H
 #define PFILTER_OPCODES_ARM_H
 
+/* ARM 32bit Registers */
 #define ARM_R0	0
 #define ARM_R1	1
 #define ARM_R2	2
@@ -22,38 +23,43 @@
 #define ARM_R8	8
 #define ARM_R9	9
 #define ARM_R10	10
-#define ARM_FP	11
-#define ARM_IP	12
-#define ARM_SP	13
-#define ARM_LR	14
-#define ARM_PC	15
-
-#define ARM_COND_EQ		0x0
-#define ARM_COND_NE		0x1
-#define ARM_COND_CS		0x2
+#define ARM_FP	11	/* Frame Pointer */
+#define ARM_IP	12	/* Intra-procedure scratch register */
+#define ARM_SP	13	/* Stack pointer: as load/store base reg */
+#define ARM_LR	14	/* Link Register */
+#define ARM_PC	15	/* Program counter */
+
+#define ARM_COND_EQ		0x0	/* == */
+#define ARM_COND_NE		0x1	/* != */
+#define ARM_COND_CS		0x2	/* unsigned >= */
 #define ARM_COND_HS		ARM_COND_CS
-#define ARM_COND_CC		0x3
+#define ARM_COND_CC		0x3	/* unsigned < */
 #define ARM_COND_LO		ARM_COND_CC
-#define ARM_COND_MI		0x4
-#define ARM_COND_PL		0x5
-#define ARM_COND_VS		0x6
-#define ARM_COND_VC		0x7
-#define ARM_COND_HI		0x8
-#define ARM_COND_LS		0x9
-#define ARM_COND_GE		0xa
-#define ARM_COND_LT		0xb
-#define ARM_COND_GT		0xc
-#define ARM_COND_LE		0xd
-#define ARM_COND_AL		0xe
+#define ARM_COND_MI		0x4	/* < 0 */
+#define ARM_COND_PL		0x5	/* >= 0 */
+#define ARM_COND_VS		0x6	/* Signed Overflow */
+#define ARM_COND_VC		0x7	/* No Signed Overflow */
+#define ARM_COND_HI		0x8	/* unsigned > */
+#define ARM_COND_LS		0x9	/* unsigned <= */
+#define ARM_COND_GE		0xa	/* Signed >= */
+#define ARM_COND_LT		0xb	/* Signed < */
+#define ARM_COND_GT		0xc	/* Signed > */
+#define ARM_COND_LE		0xd	/* Signed <= */
+#define ARM_COND_AL		0xe	/* None */
 
 /* register shift types */
 #define SRTYPE_LSL		0
 #define SRTYPE_LSR		1
 #define SRTYPE_ASR		2
 #define SRTYPE_ROR		3
+#define SRTYPE_ASL		(SRTYPE_LSL)
 
 #define ARM_INST_ADD_R		0x00800000
+#define ARM_INST_ADDS_R		0x00900000
+#define ARM_INST_ADC_R		0x00a00000
+#define ARM_INST_ADC_I		0x02a00000
 #define ARM_INST_ADD_I		0x02800000
+#define ARM_INST_ADDS_I		0x02900000
 
 #define ARM_INST_AND_R		0x00000000
 #define ARM_INST_AND_I		0x02000000
@@ -76,8 +82,10 @@
 #define ARM_INST_LDRH_I		0x01d000b0
 #define ARM_INST_LDRH_R		0x019000b0
 #define ARM_INST_LDR_I		0x05900000
+#define ARM_INST_LDR_R		0x07900000
 
 #define ARM_INST_LDM		0x08900000
+#define ARM_INST_LDM_IA		0x08b00000
 
 #define ARM_INST_LSL_I		0x01a00000
 #define ARM_INST_LSL_R		0x01a00010
@@ -86,6 +94,7 @@
 #define ARM_INST_LSR_R		0x01a00030
 
 #define ARM_INST_MOV_R		0x01a00000
+#define ARM_INST_MOVS_R		0x01b00000
 #define ARM_INST_MOV_I		0x03a00000
 #define ARM_INST_MOVW		0x03000000
 #define ARM_INST_MOVT		0x03400000
@@ -96,17 +105,28 @@
 #define ARM_INST_PUSH		0x092d0000
 
 #define ARM_INST_ORR_R		0x01800000
+#define ARM_INST_ORRS_R		0x01900000
 #define ARM_INST_ORR_I		0x03800000
 
 #define ARM_INST_REV		0x06bf0f30
 #define ARM_INST_REV16		0x06bf0fb0
 
 #define ARM_INST_RSB_I		0x02600000
+#define ARM_INST_RSBS_I		0x02700000
+#define ARM_INST_RSC_I		0x02e00000
 
 #define ARM_INST_SUB_R		0x00400000
+#define ARM_INST_SUBS_R		0x00500000
+#define ARM_INST_RSB_R		0x00600000
 #define ARM_INST_SUB_I		0x02400000
+#define ARM_INST_SUBS_I		0x02500000
+#define ARM_INST_SBC_I		0x02c00000
+#define ARM_INST_SBC_R		0x00c00000
+#define ARM_INST_SBCS_R		0x00d00000
 
 #define ARM_INST_STR_I		0x05800000
+#define ARM_INST_STRB_I		0x05c00000
+#define ARM_INST_STRH_I		0x01c000b0
 
 #define ARM_INST_TST_R		0x01100000
 #define ARM_INST_TST_I		0x03100000
@@ -117,6 +137,8 @@
 
 #define ARM_INST_MLS		0x00600090
 
+#define ARM_INST_UXTH		0x06ff0070
+
 /*
  * Use a suitable undefined instruction to use for ARM/Thumb2 faulting.
  * We need to be careful not to conflict with those used by other modules
@@ -135,9 +157,15 @@
 #define _AL3_R(op, rd, rn, rm)	((op ## _R) | (rd) << 12 | (rn) << 16 | (rm))
 /* immediate */
 #define _AL3_I(op, rd, rn, imm)	((op ## _I) | (rd) << 12 | (rn) << 16 | (imm))
+/* register with register-shift */
+#define _AL3_SR(inst)	(inst | (1 << 4))
 
 #define ARM_ADD_R(rd, rn, rm)	_AL3_R(ARM_INST_ADD, rd, rn, rm)
+#define ARM_ADDS_R(rd, rn, rm)	_AL3_R(ARM_INST_ADDS, rd, rn, rm)
 #define ARM_ADD_I(rd, rn, imm)	_AL3_I(ARM_INST_ADD, rd, rn, imm)
+#define ARM_ADDS_I(rd, rn, imm)	_AL3_I(ARM_INST_ADDS, rd, rn, imm)
+#define ARM_ADC_R(rd, rn, rm)	_AL3_R(ARM_INST_ADC, rd, rn, rm)
+#define ARM_ADC_I(rd, rn, imm)	_AL3_I(ARM_INST_ADC, rd, rn, imm)
 
 #define ARM_AND_R(rd, rn, rm)	_AL3_R(ARM_INST_AND, rd, rn, rm)
 #define ARM_AND_I(rd, rn, imm)	_AL3_I(ARM_INST_AND, rd, rn, imm)
@@ -156,7 +184,9 @@
 #define ARM_EOR_I(rd, rn, imm)	_AL3_I(ARM_INST_EOR, rd, rn, imm)
 
 #define ARM_LDR_I(rt, rn, off)	(ARM_INST_LDR_I | (rt) << 12 | (rn) << 16 \
-				 | (off))
+				 | ((off) & 0xfff))
+#define ARM_LDR_R(rt, rn, rm)	(ARM_INST_LDR_R | (rt) << 12 | (rn) << 16 \
+				 | (rm))
 #define ARM_LDRB_I(rt, rn, off)	(ARM_INST_LDRB_I | (rt) << 12 | (rn) << 16 \
 				 | (off))
 #define ARM_LDRB_R(rt, rn, rm)	(ARM_INST_LDRB_R | (rt) << 12 | (rn) << 16 \
@@ -167,15 +197,23 @@
 				 | (rm))
 
 #define ARM_LDM(rn, regs)	(ARM_INST_LDM | (rn) << 16 | (regs))
+#define ARM_LDM_IA(rn, regs)	(ARM_INST_LDM_IA | (rn) << 16 | (regs))
 
 #define ARM_LSL_R(rd, rn, rm)	(_AL3_R(ARM_INST_LSL, rd, 0, rn) | (rm) << 8)
 #define ARM_LSL_I(rd, rn, imm)	(_AL3_I(ARM_INST_LSL, rd, 0, rn) | (imm) << 7)
 
 #define ARM_LSR_R(rd, rn, rm)	(_AL3_R(ARM_INST_LSR, rd, 0, rn) | (rm) << 8)
 #define ARM_LSR_I(rd, rn, imm)	(_AL3_I(ARM_INST_LSR, rd, 0, rn) | (imm) << 7)
+#define ARM_ASR_R(rd, rn, rm)   (_AL3_R(ARM_INST_ASR, rd, 0, rn) | (rm) << 8)
+#define ARM_ASR_I(rd, rn, imm)  (_AL3_I(ARM_INST_ASR, rd, 0, rn) | (imm) << 7)
 
 #define ARM_MOV_R(rd, rm)	_AL3_R(ARM_INST_MOV, rd, 0, rm)
+#define ARM_MOVS_R(rd, rm)	_AL3_R(ARM_INST_MOVS, rd, 0, rm)
 #define ARM_MOV_I(rd, imm)	_AL3_I(ARM_INST_MOV, rd, 0, imm)
+#define ARM_MOV_SR(rd, rm, type, rs)	\
+	(_AL3_SR(ARM_MOV_R(rd, rm)) | (type) << 5 | (rs) << 8)
+#define ARM_MOV_SI(rd, rm, type, imm6)	\
+	(ARM_MOV_R(rd, rm) | (type) << 5 | (imm6) << 7)
 
 #define ARM_MOVW(rd, imm)	\
 	(ARM_INST_MOVW | ((imm) >> 12) << 16 | (rd) << 12 | ((imm) & 0x0fff))
@@ -190,19 +228,38 @@
 
 #define ARM_ORR_R(rd, rn, rm)	_AL3_R(ARM_INST_ORR, rd, rn, rm)
 #define ARM_ORR_I(rd, rn, imm)	_AL3_I(ARM_INST_ORR, rd, rn, imm)
-#define ARM_ORR_S(rd, rn, rm, type, rs)	\
-	(ARM_ORR_R(rd, rn, rm) | (type) << 5 | (rs) << 7)
+#define ARM_ORR_SR(rd, rn, rm, type, rs)	\
+	(_AL3_SR(ARM_ORR_R(rd, rn, rm)) | (type) << 5 | (rs) << 8)
+#define ARM_ORRS_R(rd, rn, rm)	_AL3_R(ARM_INST_ORRS, rd, rn, rm)
+#define ARM_ORRS_SR(rd, rn, rm, type, rs)	\
+	(_AL3_SR(ARM_ORRS_R(rd, rn, rm)) | (type) << 5 | (rs) << 8)
+#define ARM_ORR_SI(rd, rn, rm, type, imm6)	\
+	(ARM_ORR_R(rd, rn, rm) | (type) << 5 | (imm6) << 7)
+#define ARM_ORRS_SI(rd, rn, rm, type, imm6)	\
+	(ARM_ORRS_R(rd, rn, rm) | (type) << 5 | (imm6) << 7)
 
 #define ARM_REV(rd, rm)		(ARM_INST_REV | (rd) << 12 | (rm))
 #define ARM_REV16(rd, rm)	(ARM_INST_REV16 | (rd) << 12 | (rm))
 
 #define ARM_RSB_I(rd, rn, imm)	_AL3_I(ARM_INST_RSB, rd, rn, imm)
+#define ARM_RSBS_I(rd, rn, imm)	_AL3_I(ARM_INST_RSBS, rd, rn, imm)
+#define ARM_RSC_I(rd, rn, imm)	_AL3_I(ARM_INST_RSC, rd, rn, imm)
 
 #define ARM_SUB_R(rd, rn, rm)	_AL3_R(ARM_INST_SUB, rd, rn, rm)
+#define ARM_SUBS_R(rd, rn, rm)	_AL3_R(ARM_INST_SUBS, rd, rn, rm)
+#define ARM_RSB_R(rd, rn, rm)	_AL3_R(ARM_INST_RSB, rd, rn, rm)
+#define ARM_SBC_R(rd, rn, rm)	_AL3_R(ARM_INST_SBC, rd, rn, rm)
+#define ARM_SBCS_R(rd, rn, rm)	_AL3_R(ARM_INST_SBCS, rd, rn, rm)
 #define ARM_SUB_I(rd, rn, imm)	_AL3_I(ARM_INST_SUB, rd, rn, imm)
+#define ARM_SUBS_I(rd, rn, imm)	_AL3_I(ARM_INST_SUBS, rd, rn, imm)
+#define ARM_SBC_I(rd, rn, imm)	_AL3_I(ARM_INST_SBC, rd, rn, imm)
 
 #define ARM_STR_I(rt, rn, off)	(ARM_INST_STR_I | (rt) << 12 | (rn) << 16 \
-				 | (off))
+				 | ((off) & 0xfff))
+#define ARM_STRH_I(rt, rn, off)	(ARM_INST_STRH_I | (rt) << 12 | (rn) << 16 \
+				 | (((off) & 0xf0) << 4) | ((off) & 0xf))
+#define ARM_STRB_I(rt, rn, off)	(ARM_INST_STRB_I | (rt) << 12 | (rn) << 16 \
+				 | (((off) & 0xf0) << 4) | ((off) & 0xf))
 
 #define ARM_TST_R(rn, rm)	_AL3_R(ARM_INST_TST, 0, rn, rm)
 #define ARM_TST_I(rn, imm)	_AL3_I(ARM_INST_TST, 0, rn, imm)
@@ -214,5 +271,6 @@
 
 #define ARM_MLS(rd, rn, rm, ra)	(ARM_INST_MLS | (rd) << 16 | (rn) | (rm) << 8 \
 				 | (ra) << 12)
+#define ARM_UXTH(rd, rm)	(ARM_INST_UXTH | (rd) << 12 | (rm))
 
 #endif /* PFILTER_OPCODES_ARM_H */
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
@ 2017-06-23 22:39                         ` Shubham Bansal
  0 siblings, 0 replies; 87+ messages in thread
From: Shubham Bansal @ 2017-06-23 22:39 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Kees Cook, Network Development, David S. Miller,
	Alexei Starovoitov, Russell King, linux-arm-kernel, LKML,
	Andrew Lunn

[-- Attachment #1: Type: text/plain, Size: 452 bytes --]

Hi Russell,Daniel and Kees,

I am attaching the latest patch with this mail. It included support
for BPF_CALL | BPF_JMP tested with and without constant blinding on
ARMv7 machine.
Due to the limitation on my machine I can't test the tail call. It
would be a great help if any of you could help me with this.

Its been a long time since this patch is in works, Russell, can you
please help with sending this patch to ARM patch tracker?

Thanks.
Shubham

[-- Attachment #2: 0001-Added-Support-for-BPF_CALL-BPF_JMP.patch --]
[-- Type: application/octet-stream, Size: 87846 bytes --]

From 502dd777765a982ce1b479ee01911fa6fe023a76 Mon Sep 17 00:00:00 2001
From: Shubham Bansal <illusionist.neo@gmail.com>
Date: Sat, 24 Jun 2017 04:03:37 +0530
Subject: [PATCH] Added Support for BPF_CALL | BPF_JMP.

---
 arch/arm/Kconfig          |    2 +-
 arch/arm/net/bpf_jit_32.c | 2430 ++++++++++++++++++++++++++++++---------------
 arch/arm/net/bpf_jit_32.h |  108 +-
 3 files changed, 1736 insertions(+), 804 deletions(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 4c1a35f..53bf116 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -48,7 +48,7 @@ config ARM
 	select HAVE_ARCH_SECCOMP_FILTER if (AEABI && !OABI_COMPAT)
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_ARM_SMCCC if CPU_V7
-	select HAVE_CBPF_JIT
+	select HAVE_EBPF_JIT
 	select HAVE_CC_STACKPROTECTOR
 	select HAVE_CONTEXT_TRACKING
 	select HAVE_C_RECORDMCOUNT
diff --git a/arch/arm/net/bpf_jit_32.c b/arch/arm/net/bpf_jit_32.c
index d5b9fa1..8b8ddc4 100644
--- a/arch/arm/net/bpf_jit_32.c
+++ b/arch/arm/net/bpf_jit_32.c
@@ -1,13 +1,15 @@
 /*
- * Just-In-Time compiler for BPF filters on 32bit ARM
+ * Just-In-Time compiler for eBPF filters on 32bit ARM
  *
  * Copyright (c) 2011 Mircea Gherzan <mgherzan@gmail.com>
+ * Copyright (c) 2017 Shubham Bansal <illusionist.neo@gmail.com>
  *
  * This program is free software; you can redistribute it and/or modify it
  * under the terms of the GNU General Public License as published by the
  * Free Software Foundation; version 2 of the License.
  */
 
+#include <linux/bpf.h>
 #include <linux/bitops.h>
 #include <linux/compiler.h>
 #include <linux/errno.h>
@@ -18,50 +20,96 @@
 #include <linux/if_vlan.h>
 
 #include <asm/cacheflush.h>
-#include <asm/set_memory.h>
 #include <asm/hwcap.h>
 #include <asm/opcodes.h>
 
 #include "bpf_jit_32.h"
 
+int bpf_jit_enable __read_mostly;
+
+#define STACK_OFFSET(k)	(k)
+#define TMP_REG_1	(MAX_BPF_JIT_REG + 0)	/* TEMP Register 1 */
+#define TMP_REG_2	(MAX_BPF_JIT_REG + 1)	/* TEMP Register 2 */
+#define TCALL_CNT	(MAX_BPF_JIT_REG + 2)	/* Tail Call Count */
+
+/* Flags used for JIT optimization */
+#define SEEN_CALL	(1 << 0)
+
+#define FLAG_IMM_OVERFLOW	(1 << 0)
+
 /*
- * ABI:
+ * Map eBPF registers to ARM 32bit registers or stack scratch space.
+ *
+ * 1. First argument is passed using the arm 32bit registers and rest of the
+ * arguments are passed on stack scratch space.
+ * 2. First callee-saved aregument is mapped to arm 32 bit registers and rest
+ * arguments are mapped to scratch space on stack.
+ * 3. We need two 64 bit temp registers to do complex operations on eBPF
+ * registers.
+ *
+ * As the eBPF registers are all 64 bit registers and arm has only 32 bit
+ * registers, we have to map each eBPF registers with two arm 32 bit regs or
+ * scratch memory space and we have to build eBPF 64 bit register from those.
  *
- * r0	scratch register
- * r4	BPF register A
- * r5	BPF register X
- * r6	pointer to the skb
- * r7	skb->data
- * r8	skb_headlen(skb)
  */
+static const u8 bpf2a32[][2] = {
+	/* return value from in-kernel function, and exit value from eBPF */
+	[BPF_REG_0] = {ARM_R1, ARM_R0},
+	/* arguments from eBPF program to in-kernel function */
+	[BPF_REG_1] = {ARM_R3, ARM_R2},
+	/* Stored on stack scratch space */
+	[BPF_REG_2] = {STACK_OFFSET(0), STACK_OFFSET(4)},
+	[BPF_REG_3] = {STACK_OFFSET(8), STACK_OFFSET(12)},
+	[BPF_REG_4] = {STACK_OFFSET(16), STACK_OFFSET(20)},
+	[BPF_REG_5] = {STACK_OFFSET(24), STACK_OFFSET(28)},
+	/* callee saved registers that in-kernel function will preserve */
+	[BPF_REG_6] = {ARM_R5, ARM_R4},
+	/* Stored on stack scratch space */
+	[BPF_REG_7] = {STACK_OFFSET(32), STACK_OFFSET(36)},
+	[BPF_REG_8] = {STACK_OFFSET(40), STACK_OFFSET(44)},
+	[BPF_REG_9] = {STACK_OFFSET(48), STACK_OFFSET(52)},
+	/* Read only Frame Pointer to access Stack */
+	[BPF_REG_FP] = {STACK_OFFSET(56), STACK_OFFSET(60)},
+	/* Temporary Register for internal BPF JIT, can be used
+	 * for constant blindings and others.
+	 */
+	[TMP_REG_1] = {ARM_R7, ARM_R6},
+	[TMP_REG_2] = {ARM_R10, ARM_R8},
+	/* Tail call count. Stored on stack scratch space. */
+	[TCALL_CNT] = {STACK_OFFSET(64), STACK_OFFSET(68)},
+	/* temporary register for blinding constants.
+	 * Stored on stack scratch space.
+	 */
+	[BPF_REG_AX] = {STACK_OFFSET(72), STACK_OFFSET(76)},
+};
 
-#define r_scratch	ARM_R0
-/* r1-r3 are (also) used for the unaligned loads on the non-ARMv7 slowpath */
-#define r_off		ARM_R1
-#define r_A		ARM_R4
-#define r_X		ARM_R5
-#define r_skb		ARM_R6
-#define r_skb_data	ARM_R7
-#define r_skb_hl	ARM_R8
-
-#define SCRATCH_SP_OFFSET	0
-#define SCRATCH_OFF(k)		(SCRATCH_SP_OFFSET + 4 * (k))
-
-#define SEEN_MEM		((1 << BPF_MEMWORDS) - 1)
-#define SEEN_MEM_WORD(k)	(1 << (k))
-#define SEEN_X			(1 << BPF_MEMWORDS)
-#define SEEN_CALL		(1 << (BPF_MEMWORDS + 1))
-#define SEEN_SKB		(1 << (BPF_MEMWORDS + 2))
-#define SEEN_DATA		(1 << (BPF_MEMWORDS + 3))
+#define	dst_lo	dst[1]
+#define dst_hi	dst[0]
+#define src_lo	src[1]
+#define src_hi	src[0]
 
-#define FLAG_NEED_X_RESET	(1 << 0)
-#define FLAG_IMM_OVERFLOW	(1 << 1)
+/*
+ * JIT Context:
+ *
+ * prog			:	bpf_prog
+ * idx			:	index of current last JITed instruction.
+ * prologue_bytes	:	bytes used in prologue.
+ * epilogue_offset	:	offset of epilogue starting.
+ * seen			:	bit mask used for JIT optimization.
+ * offsets		:	array of eBPF instruction offsets in
+ *				JITed code.
+ * target		:	final JITed code.
+ * epilogue_bytes	:	no of bytes used in epilogue.
+ * imm_count		:	no of immediate counts used for global
+ *				variables.
+ * imms			:	array of global variable addresses.
+ */
 
 struct jit_ctx {
-	const struct bpf_prog *skf;
-	unsigned idx;
-	unsigned prologue_bytes;
-	int ret0_fp_idx;
+	const struct bpf_prog *prog;
+	unsigned int idx;
+	unsigned int prologue_bytes;
+	unsigned int epilogue_offset;
 	u32 seen;
 	u32 flags;
 	u32 *offsets;
@@ -73,68 +121,16 @@ struct jit_ctx {
 #endif
 };
 
-int bpf_jit_enable __read_mostly;
-
-static inline int call_neg_helper(struct sk_buff *skb, int offset, void *ret,
-		      unsigned int size)
-{
-	void *ptr = bpf_internal_load_pointer_neg_helper(skb, offset, size);
-
-	if (!ptr)
-		return -EFAULT;
-	memcpy(ret, ptr, size);
-	return 0;
-}
-
-static u64 jit_get_skb_b(struct sk_buff *skb, int offset)
-{
-	u8 ret;
-	int err;
-
-	if (offset < 0)
-		err = call_neg_helper(skb, offset, &ret, 1);
-	else
-		err = skb_copy_bits(skb, offset, &ret, 1);
-
-	return (u64)err << 32 | ret;
-}
-
-static u64 jit_get_skb_h(struct sk_buff *skb, int offset)
-{
-	u16 ret;
-	int err;
-
-	if (offset < 0)
-		err = call_neg_helper(skb, offset, &ret, 2);
-	else
-		err = skb_copy_bits(skb, offset, &ret, 2);
-
-	return (u64)err << 32 | ntohs(ret);
-}
-
-static u64 jit_get_skb_w(struct sk_buff *skb, int offset)
-{
-	u32 ret;
-	int err;
-
-	if (offset < 0)
-		err = call_neg_helper(skb, offset, &ret, 4);
-	else
-		err = skb_copy_bits(skb, offset, &ret, 4);
-
-	return (u64)err << 32 | ntohl(ret);
-}
-
 /*
  * Wrappers which handle both OABI and EABI and assures Thumb2 interworking
  * (where the assembly routines like __aeabi_uidiv could cause problems).
  */
-static u32 jit_udiv(u32 dividend, u32 divisor)
+static u32 jit_udiv32(u32 dividend, u32 divisor)
 {
 	return dividend / divisor;
 }
 
-static u32 jit_mod(u32 dividend, u32 divisor)
+static u32 jit_mod32(u32 dividend, u32 divisor)
 {
 	return dividend % divisor;
 }
@@ -158,36 +154,22 @@ static inline void emit(u32 inst, struct jit_ctx *ctx)
 	_emit(ARM_COND_AL, inst, ctx);
 }
 
-static u16 saved_regs(struct jit_ctx *ctx)
+/*
+ * Checks if immediate value can be converted to imm12(12 bits) value.
+ */
+static int16_t imm8m(u32 x)
 {
-	u16 ret = 0;
-
-	if ((ctx->skf->len > 1) ||
-	    (ctx->skf->insns[0].code == (BPF_RET | BPF_A)))
-		ret |= 1 << r_A;
-
-#ifdef CONFIG_FRAME_POINTER
-	ret |= (1 << ARM_FP) | (1 << ARM_IP) | (1 << ARM_LR) | (1 << ARM_PC);
-#else
-	if (ctx->seen & SEEN_CALL)
-		ret |= 1 << ARM_LR;
-#endif
-	if (ctx->seen & (SEEN_DATA | SEEN_SKB))
-		ret |= 1 << r_skb;
-	if (ctx->seen & SEEN_DATA)
-		ret |= (1 << r_skb_data) | (1 << r_skb_hl);
-	if (ctx->seen & SEEN_X)
-		ret |= 1 << r_X;
-
-	return ret;
-}
+	u32 rot;
 
-static inline int mem_words_used(struct jit_ctx *ctx)
-{
-	/* yes, we do waste some stack space IF there are "holes" in the set" */
-	return fls(ctx->seen & SEEN_MEM);
+	for (rot = 0; rot < 16; rot++)
+		if ((x & ~ror32(0xff, 2 * rot)) == 0)
+			return rol32(x, 2 * rot) | (rot << 8);
+	return -1;
 }
 
+/*
+ * Initializes the JIT space with undefined instructions.
+ */
 static void jit_fill_hole(void *area, unsigned int size)
 {
 	u32 *ptr;
@@ -196,88 +178,34 @@ static void jit_fill_hole(void *area, unsigned int size)
 		*ptr++ = __opcode_to_mem_arm(ARM_INST_UDF);
 }
 
-static void build_prologue(struct jit_ctx *ctx)
-{
-	u16 reg_set = saved_regs(ctx);
-	u16 off;
-
-#ifdef CONFIG_FRAME_POINTER
-	emit(ARM_MOV_R(ARM_IP, ARM_SP), ctx);
-	emit(ARM_PUSH(reg_set), ctx);
-	emit(ARM_SUB_I(ARM_FP, ARM_IP, 4), ctx);
-#else
-	if (reg_set)
-		emit(ARM_PUSH(reg_set), ctx);
-#endif
-
-	if (ctx->seen & (SEEN_DATA | SEEN_SKB))
-		emit(ARM_MOV_R(r_skb, ARM_R0), ctx);
-
-	if (ctx->seen & SEEN_DATA) {
-		off = offsetof(struct sk_buff, data);
-		emit(ARM_LDR_I(r_skb_data, r_skb, off), ctx);
-		/* headlen = len - data_len */
-		off = offsetof(struct sk_buff, len);
-		emit(ARM_LDR_I(r_skb_hl, r_skb, off), ctx);
-		off = offsetof(struct sk_buff, data_len);
-		emit(ARM_LDR_I(r_scratch, r_skb, off), ctx);
-		emit(ARM_SUB_R(r_skb_hl, r_skb_hl, r_scratch), ctx);
-	}
+/* Stack must be multiples of 16 Bytes */
+#define STACK_ALIGN(sz) (((sz) + 15) & ~15)
 
-	if (ctx->flags & FLAG_NEED_X_RESET)
-		emit(ARM_MOV_I(r_X, 0), ctx);
-
-	/* do not leak kernel data to userspace */
-	if (bpf_needs_clear_a(&ctx->skf->insns[0]))
-		emit(ARM_MOV_I(r_A, 0), ctx);
-
-	/* stack space for the BPF_MEM words */
-	if (ctx->seen & SEEN_MEM)
-		emit(ARM_SUB_I(ARM_SP, ARM_SP, mem_words_used(ctx) * 4), ctx);
-}
-
-static void build_epilogue(struct jit_ctx *ctx)
-{
-	u16 reg_set = saved_regs(ctx);
-
-	if (ctx->seen & SEEN_MEM)
-		emit(ARM_ADD_I(ARM_SP, ARM_SP, mem_words_used(ctx) * 4), ctx);
-
-	reg_set &= ~(1 << ARM_LR);
-
-#ifdef CONFIG_FRAME_POINTER
-	/* the first instruction of the prologue was: mov ip, sp */
-	reg_set &= ~(1 << ARM_IP);
-	reg_set |= (1 << ARM_SP);
-	emit(ARM_LDM(ARM_SP, reg_set), ctx);
-#else
-	if (reg_set) {
-		if (ctx->seen & SEEN_CALL)
-			reg_set |= 1 << ARM_PC;
-		emit(ARM_POP(reg_set), ctx);
-	}
+/* Stack space for BPF_REG_2, BPF_REG_3, BPF_REG_4,
+ * BPF_REG_5, BPF_REG_7, BPF_REG_8, BPF_REG_9,
+ * BPF_REG_FP and Tail call counts.
+ */
+#define SCRATCH_SIZE 80
 
-	if (!(ctx->seen & SEEN_CALL))
-		emit(ARM_BX(ARM_LR), ctx);
-#endif
-}
+/* total stack size used in JITed code */
+#define _STACK_SIZE \
+	(MAX_BPF_STACK + \
+	 + SCRATCH_SIZE + \
+	 + 4 /* extra for skb_copy_bits buffer */)
 
-static int16_t imm8m(u32 x)
-{
-	u32 rot;
+#define STACK_SIZE STACK_ALIGN(_STACK_SIZE)
 
-	for (rot = 0; rot < 16; rot++)
-		if ((x & ~ror32(0xff, 2 * rot)) == 0)
-			return rol32(x, 2 * rot) | (rot << 8);
+/* Get the offset of eBPF REGISTERs stored on scratch space. */
+#define STACK_VAR(off) (STACK_SIZE-off-4)
 
-	return -1;
-}
+/* Offset of skb_copy_bits buffer */
+#define SKB_BUFFER STACK_VAR(SCRATCH_SIZE)
 
 #if __LINUX_ARM_ARCH__ < 7
 
 static u16 imm_offset(u32 k, struct jit_ctx *ctx)
 {
-	unsigned i = 0, offset;
+	unsigned int i = 0, offset;
 	u16 imm;
 
 	/* on the "fake" run we just count them (duplicates included) */
@@ -296,7 +224,7 @@ static u16 imm_offset(u32 k, struct jit_ctx *ctx)
 		ctx->imms[i] = k;
 
 	/* constants go just after the epilogue */
-	offset =  ctx->offsets[ctx->skf->len];
+	offset =  ctx->offsets[ctx->prog->len - 1] * 4;
 	offset += ctx->prologue_bytes;
 	offset += ctx->epilogue_bytes;
 	offset += i * 4;
@@ -320,10 +248,22 @@ static u16 imm_offset(u32 k, struct jit_ctx *ctx)
 
 #endif /* __LINUX_ARM_ARCH__ */
 
+static inline int bpf2a32_offset(int bpf_to, int bpf_from,
+				 const struct jit_ctx *ctx) {
+	int to, from;
+
+	if (ctx->target == NULL)
+		return 0;
+	to = ctx->offsets[bpf_to];
+	from = ctx->offsets[bpf_from];
+
+	return to - from - 1;
+}
+
 /*
  * Move an immediate that's not an imm8m to a core register.
  */
-static inline void emit_mov_i_no8m(int rd, u32 val, struct jit_ctx *ctx)
+static inline void emit_mov_i_no8m(const u8 rd, u32 val, struct jit_ctx *ctx)
 {
 #if __LINUX_ARM_ARCH__ < 7
 	emit(ARM_LDR_I(rd, ARM_PC, imm_offset(val, ctx)), ctx);
@@ -334,7 +274,7 @@ static inline void emit_mov_i_no8m(int rd, u32 val, struct jit_ctx *ctx)
 #endif
 }
 
-static inline void emit_mov_i(int rd, u32 val, struct jit_ctx *ctx)
+static inline void emit_mov_i(const u8 rd, u32 val, struct jit_ctx *ctx)
 {
 	int imm12 = imm8m(val);
 
@@ -344,676 +284,1578 @@ static inline void emit_mov_i(int rd, u32 val, struct jit_ctx *ctx)
 		emit_mov_i_no8m(rd, val, ctx);
 }
 
-#if __LINUX_ARM_ARCH__ < 6
-
-static void emit_load_be32(u8 cond, u8 r_res, u8 r_addr, struct jit_ctx *ctx)
+static inline void emit_blx_r(u8 tgt_reg, struct jit_ctx *ctx)
 {
-	_emit(cond, ARM_LDRB_I(ARM_R3, r_addr, 1), ctx);
-	_emit(cond, ARM_LDRB_I(ARM_R1, r_addr, 0), ctx);
-	_emit(cond, ARM_LDRB_I(ARM_R2, r_addr, 3), ctx);
-	_emit(cond, ARM_LSL_I(ARM_R3, ARM_R3, 16), ctx);
-	_emit(cond, ARM_LDRB_I(ARM_R0, r_addr, 2), ctx);
-	_emit(cond, ARM_ORR_S(ARM_R3, ARM_R3, ARM_R1, SRTYPE_LSL, 24), ctx);
-	_emit(cond, ARM_ORR_R(ARM_R3, ARM_R3, ARM_R2), ctx);
-	_emit(cond, ARM_ORR_S(r_res, ARM_R3, ARM_R0, SRTYPE_LSL, 8), ctx);
+	ctx->seen |= SEEN_CALL;
+#if __LINUX_ARM_ARCH__ < 5
+	emit(ARM_MOV_R(ARM_LR, ARM_PC), ctx);
+
+	if (elf_hwcap & HWCAP_THUMB)
+		emit(ARM_BX(tgt_reg), ctx);
+	else
+		emit(ARM_MOV_R(ARM_PC, tgt_reg), ctx);
+#else
+	emit(ARM_BLX_R(tgt_reg), ctx);
+#endif
 }
 
-static void emit_load_be16(u8 cond, u8 r_res, u8 r_addr, struct jit_ctx *ctx)
+static inline int epilogue_offset(const struct jit_ctx *ctx)
 {
-	_emit(cond, ARM_LDRB_I(ARM_R1, r_addr, 0), ctx);
-	_emit(cond, ARM_LDRB_I(ARM_R2, r_addr, 1), ctx);
-	_emit(cond, ARM_ORR_S(r_res, ARM_R2, ARM_R1, SRTYPE_LSL, 8), ctx);
+	int to, from;
+	/* No need for 1st dummy run */
+	if (ctx->target == NULL)
+		return 0;
+	to = ctx->epilogue_offset;
+	from = ctx->idx;
+
+	return to - from - 2;
 }
 
-static inline void emit_swap16(u8 r_dst, u8 r_src, struct jit_ctx *ctx)
+static inline void emit_udivmod(u8 rd, u8 rm, u8 rn, struct jit_ctx *ctx, u8 op)
 {
-	/* r_dst = (r_src << 8) | (r_src >> 8) */
-	emit(ARM_LSL_I(ARM_R1, r_src, 8), ctx);
-	emit(ARM_ORR_S(r_dst, ARM_R1, r_src, SRTYPE_LSR, 8), ctx);
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	s32 jmp_offset;
+
+	/* checks if divisor is zero or not. If it is, then
+	 * exit directly.
+	 */
+	emit(ARM_CMP_I(rn, 0), ctx);
+	_emit(ARM_COND_EQ, ARM_MOV_I(ARM_R0, 0), ctx);
+	jmp_offset = epilogue_offset(ctx);
+	_emit(ARM_COND_EQ, ARM_B(jmp_offset), ctx);
+#if __LINUX_ARM_ARCH__ == 7
+	if (elf_hwcap & HWCAP_IDIVA) {
+		if (op == BPF_DIV)
+			emit(ARM_UDIV(rd, rm, rn), ctx);
+		else {
+			emit(ARM_UDIV(ARM_IP, rm, rn), ctx);
+			emit(ARM_MLS(rd, rn, ARM_IP, rm), ctx);
+		}
+		return;
+	}
+#endif
 
 	/*
-	 * we need to mask out the bits set in r_dst[23:16] due to
-	 * the first shift instruction.
-	 *
-	 * note that 0x8ff is the encoded immediate 0x00ff0000.
+	 * For BPF_ALU | BPF_DIV | BPF_K instructions
+	 * As ARM_R1 and ARM_R0 contains 1st argument of bpf
+	 * function, we need to save it on caller side to save
+	 * it from getting destroyed within callee.
+	 * After the return from the callee, we restore ARM_R0
+	 * ARM_R1.
 	 */
-	emit(ARM_BIC_I(r_dst, r_dst, 0x8ff), ctx);
-}
+	if (rn != ARM_R1) {
+		emit(ARM_MOV_R(tmp[0], ARM_R1), ctx);
+		emit(ARM_MOV_R(ARM_R1, rn), ctx);
+	}
+	if (rm != ARM_R0) {
+		emit(ARM_MOV_R(tmp[1], ARM_R0), ctx);
+		emit(ARM_MOV_R(ARM_R0, rm), ctx);
+	}
+
+	/* Call appropriate function */
+	ctx->seen |= SEEN_CALL;
+	emit_mov_i(ARM_IP, op == BPF_DIV ?
+		   (u32)jit_udiv32 : (u32)jit_mod32, ctx);
+	emit_blx_r(ARM_IP, ctx);
 
-#else  /* ARMv6+ */
+	/* Save return value */
+	if (rd != ARM_R0)
+		emit(ARM_MOV_R(rd, ARM_R0), ctx);
 
-static void emit_load_be32(u8 cond, u8 r_res, u8 r_addr, struct jit_ctx *ctx)
-{
-	_emit(cond, ARM_LDR_I(r_res, r_addr, 0), ctx);
-#ifdef __LITTLE_ENDIAN
-	_emit(cond, ARM_REV(r_res, r_res), ctx);
-#endif
+	/* Restore ARM_R0 and ARM_R1 */
+	if (rn != ARM_R1)
+		emit(ARM_MOV_R(ARM_R1, tmp[0]), ctx);
+	if (rm != ARM_R0)
+		emit(ARM_MOV_R(ARM_R0, tmp[1]), ctx);
 }
 
-static void emit_load_be16(u8 cond, u8 r_res, u8 r_addr, struct jit_ctx *ctx)
+/* Checks whether BPF register is on scratch stack space or not. */
+static inline bool is_on_stack(u8 bpf_reg)
 {
-	_emit(cond, ARM_LDRH_I(r_res, r_addr, 0), ctx);
-#ifdef __LITTLE_ENDIAN
-	_emit(cond, ARM_REV16(r_res, r_res), ctx);
-#endif
+	static u8 stack_regs[] = {BPF_REG_AX, BPF_REG_3, BPF_REG_4, BPF_REG_5,
+				BPF_REG_7, BPF_REG_8, BPF_REG_9, TCALL_CNT,
+				BPF_REG_2, BPF_REG_FP};
+	int i, reg_len = sizeof(stack_regs);
+
+	for (i = 0 ; i < reg_len ; i++) {
+		if (bpf_reg == stack_regs[i])
+			return true;
+	}
+	return false;
 }
 
-static inline void emit_swap16(u8 r_dst __maybe_unused,
-			       u8 r_src __maybe_unused,
-			       struct jit_ctx *ctx __maybe_unused)
+static inline void emit_a32_mov_i(const u8 dst, const u32 val,
+				  bool dstk, struct jit_ctx *ctx)
 {
-#ifdef __LITTLE_ENDIAN
-	emit(ARM_REV16(r_dst, r_src), ctx);
-#endif
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+
+	if (dstk) {
+		emit_mov_i(tmp[1], val, ctx);
+		emit(ARM_STR_I(tmp[1], ARM_SP, STACK_VAR(dst)), ctx);
+	} else {
+		emit_mov_i(dst, val, ctx);
+	}
 }
 
-#endif /* __LINUX_ARM_ARCH__ < 6 */
+/* Sign extended move */
+static inline void emit_a32_mov_i64(const bool is64, const u8 dst[],
+				  const u32 val, bool dstk,
+				  struct jit_ctx *ctx) {
+	u32 hi = 0;
 
+	if (is64 && (val & (1<<31)))
+		hi = (u32)~0;
+	emit_a32_mov_i(dst_lo, val, dstk, ctx);
+	emit_a32_mov_i(dst_hi, hi, dstk, ctx);
+}
 
-/* Compute the immediate value for a PC-relative branch. */
-static inline u32 b_imm(unsigned tgt, struct jit_ctx *ctx)
-{
-	u32 imm;
+static inline void emit_a32_add_r(const u8 dst, const u8 src,
+			      const bool is64, const bool hi,
+			      struct jit_ctx *ctx) {
+	/* 64 bit :
+	 *	adds dst_lo, dst_lo, src_lo
+	 *	adc dst_hi, dst_hi, src_hi
+	 * 32 bit :
+	 *	add dst_lo, dst_lo, src_lo
+	 */
+	if (!hi && is64)
+		emit(ARM_ADDS_R(dst, dst, src), ctx);
+	else if (hi && is64)
+		emit(ARM_ADC_R(dst, dst, src), ctx);
+	else
+		emit(ARM_ADD_R(dst, dst, src), ctx);
+}
 
-	if (ctx->target == NULL)
-		return 0;
-	/*
-	 * BPF allows only forward jumps and the offset of the target is
-	 * still the one computed during the first pass.
+static inline void emit_a32_sub_r(const u8 dst, const u8 src,
+				  const bool is64, const bool hi,
+				  struct jit_ctx *ctx) {
+	/* 64 bit :
+	 *	subs dst_lo, dst_lo, src_lo
+	 *	sbc dst_hi, dst_hi, src_hi
+	 * 32 bit :
+	 *	sub dst_lo, dst_lo, src_lo
 	 */
-	imm  = ctx->offsets[tgt] + ctx->prologue_bytes - (ctx->idx * 4 + 8);
+	if (!hi && is64)
+		emit(ARM_SUBS_R(dst, dst, src), ctx);
+	else if (hi && is64)
+		emit(ARM_SBC_R(dst, dst, src), ctx);
+	else
+		emit(ARM_SUB_R(dst, dst, src), ctx);
+}
 
-	return imm >> 2;
+static inline void emit_alu_r(const u8 dst, const u8 src, const bool is64,
+			      const bool hi, const u8 op, struct jit_ctx *ctx){
+	switch (BPF_OP(op)) {
+	/* dst = dst + src */
+	case BPF_ADD:
+		emit_a32_add_r(dst, src, is64, hi, ctx);
+		break;
+	/* dst = dst - src */
+	case BPF_SUB:
+		emit_a32_sub_r(dst, src, is64, hi, ctx);
+		break;
+	/* dst = dst | src */
+	case BPF_OR:
+		emit(ARM_ORR_R(dst, dst, src), ctx);
+		break;
+	/* dst = dst & src */
+	case BPF_AND:
+		emit(ARM_AND_R(dst, dst, src), ctx);
+		break;
+	/* dst = dst ^ src */
+	case BPF_XOR:
+		emit(ARM_EOR_R(dst, dst, src), ctx);
+		break;
+	/* dst = dst * src */
+	case BPF_MUL:
+		emit(ARM_MUL(dst, dst, src), ctx);
+		break;
+	/* dst = dst << src */
+	case BPF_LSH:
+		emit(ARM_LSL_R(dst, dst, src), ctx);
+		break;
+	/* dst = dst >> src */
+	case BPF_RSH:
+		emit(ARM_LSR_R(dst, dst, src), ctx);
+		break;
+	/* dst = dst >> src (signed)*/
+	case BPF_ARSH:
+		emit(ARM_MOV_SR(dst, dst, SRTYPE_ASR, src), ctx);
+		break;
+	}
 }
 
-#define OP_IMM3(op, r1, r2, imm_val, ctx)				\
-	do {								\
-		imm12 = imm8m(imm_val);					\
-		if (imm12 < 0) {					\
-			emit_mov_i_no8m(r_scratch, imm_val, ctx);	\
-			emit(op ## _R((r1), (r2), r_scratch), ctx);	\
-		} else {						\
-			emit(op ## _I((r1), (r2), imm12), ctx);		\
-		}							\
-	} while (0)
-
-static inline void emit_err_ret(u8 cond, struct jit_ctx *ctx)
-{
-	if (ctx->ret0_fp_idx >= 0) {
-		_emit(cond, ARM_B(b_imm(ctx->ret0_fp_idx, ctx)), ctx);
-		/* NOP to keep the size constant between passes */
-		emit(ARM_MOV_R(ARM_R0, ARM_R0), ctx);
+/* ALU operation (32 bit)
+ * dst = dst (op) src
+ */
+static inline void emit_a32_alu_r(const u8 dst, const u8 src,
+				  bool dstk, bool sstk,
+				  struct jit_ctx *ctx, const bool is64,
+				  const bool hi, const u8 op) {
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	u8 rn = sstk ? tmp[1] : src;
+
+	if (sstk)
+		emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src)), ctx);
+
+	/* ALU operation */
+	if (dstk) {
+		emit(ARM_LDR_I(tmp[0], ARM_SP, STACK_VAR(dst)), ctx);
+		emit_alu_r(tmp[0], rn, is64, hi, op, ctx);
+		emit(ARM_STR_I(tmp[0], ARM_SP, STACK_VAR(dst)), ctx);
 	} else {
-		_emit(cond, ARM_MOV_I(ARM_R0, 0), ctx);
-		_emit(cond, ARM_B(b_imm(ctx->skf->len, ctx)), ctx);
+		emit_alu_r(dst, rn, is64, hi, op, ctx);
 	}
 }
 
-static inline void emit_blx_r(u8 tgt_reg, struct jit_ctx *ctx)
-{
-#if __LINUX_ARM_ARCH__ < 5
-	emit(ARM_MOV_R(ARM_LR, ARM_PC), ctx);
+/* ALU operation (64 bit) */
+static inline void emit_a32_alu_r64(const bool is64, const u8 dst[],
+				  const u8 src[], bool dstk,
+				  bool sstk, struct jit_ctx *ctx,
+				  const u8 op) {
+	emit_a32_alu_r(dst_lo, src_lo, dstk, sstk, ctx, is64, false, op);
+	if (is64)
+		emit_a32_alu_r(dst_hi, src_hi, dstk, sstk, ctx, is64, true, op);
+	else
+		emit_a32_mov_i(dst_hi, 0, dstk, ctx);
+}
 
-	if (elf_hwcap & HWCAP_THUMB)
-		emit(ARM_BX(tgt_reg), ctx);
+/* dst = imm (4 bytes)*/
+static inline void emit_a32_mov_r(const u8 dst, const u8 src,
+				  bool dstk, bool sstk,
+				  struct jit_ctx *ctx) {
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	u8 rt = sstk ? tmp[0] : src;
+
+	if (sstk)
+		emit(ARM_LDR_I(tmp[0], ARM_SP, STACK_VAR(src)), ctx);
+	if (dstk)
+		emit(ARM_STR_I(rt, ARM_SP, STACK_VAR(dst)), ctx);
 	else
-		emit(ARM_MOV_R(ARM_PC, tgt_reg), ctx);
-#else
-	emit(ARM_BLX_R(tgt_reg), ctx);
-#endif
+		emit(ARM_MOV_R(dst, rt), ctx);
 }
 
-static inline void emit_udivmod(u8 rd, u8 rm, u8 rn, struct jit_ctx *ctx,
-				int bpf_op)
-{
-#if __LINUX_ARM_ARCH__ == 7
-	if (elf_hwcap & HWCAP_IDIVA) {
-		if (bpf_op == BPF_DIV)
-			emit(ARM_UDIV(rd, rm, rn), ctx);
-		else {
-			emit(ARM_UDIV(ARM_R3, rm, rn), ctx);
-			emit(ARM_MLS(rd, rn, ARM_R3, rm), ctx);
-		}
-		return;
+/* dst = src */
+static inline void emit_a32_mov_r64(const bool is64, const u8 dst[],
+				  const u8 src[], bool dstk,
+				  bool sstk, struct jit_ctx *ctx) {
+	emit_a32_mov_r(dst_lo, src_lo, dstk, sstk, ctx);
+	if (is64) {
+		/* complete 8 byte move */
+		emit_a32_mov_r(dst_hi, src_hi, dstk, sstk, ctx);
+	} else {
+		/* Zero out high 4 bytes */
+		emit_a32_mov_i(dst_hi, 0, dstk, ctx);
 	}
-#endif
+}
 
-	/*
-	 * For BPF_ALU | BPF_DIV | BPF_K instructions, rm is ARM_R4
-	 * (r_A) and rn is ARM_R0 (r_scratch) so load rn first into
-	 * ARM_R1 to avoid accidentally overwriting ARM_R0 with rm
-	 * before using it as a source for ARM_R1.
-	 *
-	 * For BPF_ALU | BPF_DIV | BPF_X rm is ARM_R4 (r_A) and rn is
-	 * ARM_R5 (r_X) so there is no particular register overlap
-	 * issues.
-	 */
-	if (rn != ARM_R1)
-		emit(ARM_MOV_R(ARM_R1, rn), ctx);
-	if (rm != ARM_R0)
-		emit(ARM_MOV_R(ARM_R0, rm), ctx);
+/* Shift operations */
+static inline void emit_a32_alu_i(const u8 dst, const u32 val, bool dstk,
+				struct jit_ctx *ctx, const u8 op) {
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	u8 rd = dstk ? tmp[0] : dst;
+
+	if (dstk)
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst)), ctx);
+
+	/* Do shift operation */
+	switch (op) {
+	case BPF_LSH:
+		emit(ARM_LSL_I(rd, rd, val), ctx);
+		break;
+	case BPF_RSH:
+		emit(ARM_LSR_I(rd, rd, val), ctx);
+		break;
+	case BPF_NEG:
+		emit(ARM_RSB_I(rd, rd, val), ctx);
+		break;
+	}
+
+	if (dstk)
+		emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst)), ctx);
+}
 
+/* dst = ~dst (64 bit) */
+static inline void emit_a32_neg64(const u8 dst[], bool dstk,
+				struct jit_ctx *ctx){
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	u8 rd = dstk ? tmp[1] : dst[1];
+	u8 rm = dstk ? tmp[0] : dst[0];
+
+	/* Setup Operand */
+	if (dstk) {
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+
+	/* Do Negate Operation */
+	emit(ARM_RSBS_I(rd, rd, 0), ctx);
+	emit(ARM_RSC_I(rm, rm, 0), ctx);
+
+	if (dstk) {
+		emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_STR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+}
+
+/* dst = dst << src */
+static inline void emit_a32_lsh_r64(const u8 dst[], const u8 src[], bool dstk,
+				    bool sstk, struct jit_ctx *ctx) {
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+
+	/* Setup Operands */
+	u8 rt = sstk ? tmp2[1] : src_lo;
+	u8 rd = dstk ? tmp[1] : dst_lo;
+	u8 rm = dstk ? tmp[0] : dst_hi;
+
+	if (sstk)
+		emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(src_lo)), ctx);
+	if (dstk) {
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+
+	/* Do LSH operation */
+	emit(ARM_SUB_I(ARM_IP, rt, 32), ctx);
+	emit(ARM_RSB_I(tmp2[0], rt, 32), ctx);
+	/* As we are using ARM_LR */
 	ctx->seen |= SEEN_CALL;
-	emit_mov_i(ARM_R3, bpf_op == BPF_DIV ? (u32)jit_udiv : (u32)jit_mod,
-		   ctx);
-	emit_blx_r(ARM_R3, ctx);
+	emit(ARM_MOV_SR(ARM_LR, rm, SRTYPE_ASL, rt), ctx);
+	emit(ARM_ORR_SR(ARM_LR, ARM_LR, rd, SRTYPE_ASL, ARM_IP), ctx);
+	emit(ARM_ORR_SR(ARM_IP, ARM_LR, rd, SRTYPE_LSR, tmp2[0]), ctx);
+	emit(ARM_MOV_SR(ARM_LR, rd, SRTYPE_ASL, rt), ctx);
+
+	if (dstk) {
+		emit(ARM_STR_I(ARM_LR, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_STR_I(ARM_IP, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	} else {
+		emit(ARM_MOV_R(rd, ARM_LR), ctx);
+		emit(ARM_MOV_R(rm, ARM_IP), ctx);
+	}
+}
 
-	if (rd != ARM_R0)
-		emit(ARM_MOV_R(rd, ARM_R0), ctx);
+/* dst = dst >> src (signed)*/
+static inline void emit_a32_arsh_r64(const u8 dst[], const u8 src[], bool dstk,
+				    bool sstk, struct jit_ctx *ctx) {
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+	/* Setup Operands */
+	u8 rt = sstk ? tmp2[1] : src_lo;
+	u8 rd = dstk ? tmp[1] : dst_lo;
+	u8 rm = dstk ? tmp[0] : dst_hi;
+
+	if (sstk)
+		emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(src_lo)), ctx);
+	if (dstk) {
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+
+	/* Do the ARSH operation */
+	emit(ARM_RSB_I(ARM_IP, rt, 32), ctx);
+	emit(ARM_SUBS_I(tmp2[0], rt, 32), ctx);
+	/* As we are using ARM_LR */
+	ctx->seen |= SEEN_CALL;
+	emit(ARM_MOV_SR(ARM_LR, rd, SRTYPE_LSR, rt), ctx);
+	emit(ARM_ORR_SR(ARM_LR, ARM_LR, rm, SRTYPE_ASL, ARM_IP), ctx);
+	_emit(ARM_COND_MI, ARM_B(0), ctx);
+	emit(ARM_ORR_SR(ARM_LR, ARM_LR, rm, SRTYPE_ASR, tmp2[0]), ctx);
+	emit(ARM_MOV_SR(ARM_IP, rm, SRTYPE_ASR, rt), ctx);
+	if (dstk) {
+		emit(ARM_STR_I(ARM_LR, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_STR_I(ARM_IP, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	} else {
+		emit(ARM_MOV_R(rd, ARM_LR), ctx);
+		emit(ARM_MOV_R(rm, ARM_IP), ctx);
+	}
+}
+
+/* dst = dst >> src */
+static inline void emit_a32_lsr_r64(const u8 dst[], const u8 src[], bool dstk,
+				     bool sstk, struct jit_ctx *ctx) {
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+	/* Setup Operands */
+	u8 rt = sstk ? tmp2[1] : src_lo;
+	u8 rd = dstk ? tmp[1] : dst_lo;
+	u8 rm = dstk ? tmp[0] : dst_hi;
+
+	if (sstk)
+		emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(src_lo)), ctx);
+	if (dstk) {
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+
+	/* Do LSH operation */
+	emit(ARM_RSB_I(ARM_IP, rt, 32), ctx);
+	emit(ARM_SUBS_I(tmp2[0], rt, 32), ctx);
+	/* As we are using ARM_LR */
+	ctx->seen |= SEEN_CALL;
+	emit(ARM_MOV_SR(ARM_LR, rd, SRTYPE_LSR, rt), ctx);
+	emit(ARM_ORR_SR(ARM_LR, ARM_LR, rm, SRTYPE_ASL, ARM_IP), ctx);
+	emit(ARM_ORR_SR(ARM_LR, ARM_LR, rm, SRTYPE_LSR, tmp2[0]), ctx);
+	emit(ARM_MOV_SR(ARM_IP, rm, SRTYPE_LSR, rt), ctx);
+	if (dstk) {
+		emit(ARM_STR_I(ARM_LR, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_STR_I(ARM_IP, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	} else {
+		emit(ARM_MOV_R(rd, ARM_LR), ctx);
+		emit(ARM_MOV_R(rm, ARM_IP), ctx);
+	}
+}
+
+/* dst = dst << val */
+static inline void emit_a32_lsh_i64(const u8 dst[], bool dstk,
+				     const u32 val, struct jit_ctx *ctx){
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+	/* Setup operands */
+	u8 rd = dstk ? tmp[1] : dst_lo;
+	u8 rm = dstk ? tmp[0] : dst_hi;
+
+	if (dstk) {
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+
+	/* Do LSH operation */
+	if (val < 32) {
+		emit(ARM_MOV_SI(tmp2[0], rm, SRTYPE_ASL, val), ctx);
+		emit(ARM_ORR_SI(rm, tmp2[0], rd, SRTYPE_LSR, 32 - val), ctx);
+		emit(ARM_MOV_SI(rd, rd, SRTYPE_ASL, val), ctx);
+	} else {
+		if (val == 32)
+			emit(ARM_MOV_R(rm, rd), ctx);
+		else
+			emit(ARM_MOV_SI(rm, rd, SRTYPE_ASL, val - 32), ctx);
+		emit(ARM_EOR_R(rd, rd, rd), ctx);
+	}
+
+	if (dstk) {
+		emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_STR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+}
+
+/* dst = dst >> val */
+static inline void emit_a32_lsr_i64(const u8 dst[], bool dstk,
+				    const u32 val, struct jit_ctx *ctx) {
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+	/* Setup operands */
+	u8 rd = dstk ? tmp[1] : dst_lo;
+	u8 rm = dstk ? tmp[0] : dst_hi;
+
+	if (dstk) {
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+
+	/* Do LSR operation */
+	if (val < 32) {
+		emit(ARM_MOV_SI(tmp2[1], rd, SRTYPE_LSR, val), ctx);
+		emit(ARM_ORR_SI(rd, tmp2[1], rm, SRTYPE_ASL, 32 - val), ctx);
+		emit(ARM_MOV_SI(rm, rm, SRTYPE_LSR, val), ctx);
+	} else if (val == 32) {
+		emit(ARM_MOV_R(rd, rm), ctx);
+		emit(ARM_MOV_I(rm, 0), ctx);
+	} else {
+		emit(ARM_MOV_SI(rd, rm, SRTYPE_LSR, val - 32), ctx);
+		emit(ARM_MOV_I(rm, 0), ctx);
+	}
+
+	if (dstk) {
+		emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_STR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
 }
 
-static inline void update_on_xread(struct jit_ctx *ctx)
+/* dst = dst >> val (signed) */
+static inline void emit_a32_arsh_i64(const u8 dst[], bool dstk,
+				     const u32 val, struct jit_ctx *ctx){
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+	 /* Setup operands */
+	u8 rd = dstk ? tmp[1] : dst_lo;
+	u8 rm = dstk ? tmp[0] : dst_hi;
+
+	if (dstk) {
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+
+	/* Do ARSH operation */
+	if (val < 32) {
+		emit(ARM_MOV_SI(tmp2[1], rd, SRTYPE_LSR, val), ctx);
+		emit(ARM_ORR_SI(rd, tmp2[1], rm, SRTYPE_ASL, 32 - val), ctx);
+		emit(ARM_MOV_SI(rm, rm, SRTYPE_ASR, val), ctx);
+	} else if (val == 32) {
+		emit(ARM_MOV_R(rd, rm), ctx);
+		emit(ARM_MOV_SI(rm, rm, SRTYPE_ASR, 31), ctx);
+	} else {
+		emit(ARM_MOV_SI(rd, rm, SRTYPE_ASR, val - 32), ctx);
+		emit(ARM_MOV_SI(rm, rm, SRTYPE_ASR, 31), ctx);
+	}
+
+	if (dstk) {
+		emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_STR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+}
+
+static inline void emit_a32_mul_r64(const u8 dst[], const u8 src[], bool dstk,
+				    bool sstk, struct jit_ctx *ctx) {
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+	/* Setup operands for multiplication */
+	u8 rd = dstk ? tmp[1] : dst_lo;
+	u8 rm = dstk ? tmp[0] : dst_hi;
+	u8 rt = sstk ? tmp2[1] : src_lo;
+	u8 rn = sstk ? tmp2[0] : src_hi;
+
+	if (dstk) {
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+	if (sstk) {
+		emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(src_lo)), ctx);
+		emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src_hi)), ctx);
+	}
+
+	/* Do Multiplication */
+	emit(ARM_MUL(ARM_IP, rd, rn), ctx);
+	emit(ARM_MUL(ARM_LR, rm, rt), ctx);
+	/* As we are using ARM_LR */
+	ctx->seen |= SEEN_CALL;
+	emit(ARM_ADD_R(ARM_LR, ARM_IP, ARM_LR), ctx);
+
+	emit(ARM_UMULL(ARM_IP, rm, rd, rt), ctx);
+	emit(ARM_ADD_R(rm, ARM_LR, rm), ctx);
+	if (dstk) {
+		emit(ARM_STR_I(ARM_IP, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_STR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	} else {
+		emit(ARM_MOV_R(rd, ARM_IP), ctx);
+	}
+}
+
+/* *(size *)(dst + off) = src */
+static inline void emit_str_r(const u8 dst, const u8 src, bool dstk,
+			      const s32 off, struct jit_ctx *ctx, const u8 sz){
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	u8 rd = dstk ? tmp[1] : dst;
+
+	if (dstk)
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst)), ctx);
+	if (off) {
+		emit_a32_mov_i(tmp[0], off, false, ctx);
+		emit(ARM_ADD_R(tmp[0], rd, tmp[0]), ctx);
+		rd = tmp[0];
+	}
+	switch (sz) {
+	case BPF_W:
+		/* Store a Word */
+		emit(ARM_STR_I(src, rd, 0), ctx);
+		break;
+	case BPF_H:
+		/* Store a HalfWord */
+		emit(ARM_STRH_I(src, rd, 0), ctx);
+		break;
+	case BPF_B:
+		/* Store a Byte */
+		emit(ARM_STRB_I(src, rd, 0), ctx);
+		break;
+	}
+}
+
+/* dst = *(size*)(src + off) */
+static inline void emit_ldx_r(const u8 dst, const u8 src, bool dstk,
+			      const s32 off, struct jit_ctx *ctx, const u8 sz){
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	u8 rd = dstk ? tmp[1] : dst;
+	u8 rm = src;
+
+	if (off) {
+		emit_a32_mov_i(tmp[0], off, false, ctx);
+		emit(ARM_ADD_R(tmp[0], tmp[0], src), ctx);
+		rm = tmp[0];
+	}
+	switch (sz) {
+	case BPF_W:
+		/* Load a Word */
+		emit(ARM_LDR_I(rd, rm, 0), ctx);
+		break;
+	case BPF_H:
+		/* Load a HalfWord */
+		emit(ARM_LDRH_I(rd, rm, 0), ctx);
+		break;
+	case BPF_B:
+		/* Load a Byte */
+		emit(ARM_LDRB_I(rd, rm, 0), ctx);
+		break;
+	}
+	if (dstk)
+		emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst)), ctx);
+}
+
+/* Arithmatic Operation */
+static inline void emit_ar_r(const u8 rd, const u8 rt, const u8 rm,
+			     const u8 rn, struct jit_ctx *ctx, u8 op) {
+	switch (op) {
+	case BPF_JSET:
+		ctx->seen |= SEEN_CALL;
+		emit(ARM_AND_R(ARM_IP, rt, rn), ctx);
+		emit(ARM_AND_R(ARM_LR, rd, rm), ctx);
+		emit(ARM_ORRS_R(ARM_IP, ARM_LR, ARM_IP), ctx);
+		break;
+	case BPF_JEQ:
+	case BPF_JNE:
+	case BPF_JGT:
+	case BPF_JGE:
+		emit(ARM_CMP_R(rd, rm), ctx);
+		_emit(ARM_COND_EQ, ARM_CMP_R(rt, rn), ctx);
+		break;
+	case BPF_JSGT:
+		emit(ARM_CMP_R(rn, rt), ctx);
+		emit(ARM_SBCS_R(ARM_IP, rm, rd), ctx);
+		break;
+	case BPF_JSGE:
+		emit(ARM_CMP_R(rt, rn), ctx);
+		emit(ARM_SBCS_R(ARM_IP, rd, rm), ctx);
+		break;
+	}
+}
+
+static int out_offset = -1; /* initialized on the first pass of build_body() */
+static int emit_bpf_tail_call(struct jit_ctx *ctx)
 {
-	if (!(ctx->seen & SEEN_X))
-		ctx->flags |= FLAG_NEED_X_RESET;
 
-	ctx->seen |= SEEN_X;
+	/* bpf_tail_call(void *prog_ctx, struct bpf_array *array, u64 index) */
+	const u8 *r2 = bpf2a32[BPF_REG_2];
+	const u8 *r3 = bpf2a32[BPF_REG_3];
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+	const u8 *tcc = bpf2a32[TCALL_CNT];
+	const int idx0 = ctx->idx;
+#define cur_offset (ctx->idx - idx0)
+#define jmp_offset (out_offset - (cur_offset))
+	u32 off, lo, hi;
+
+	/* if (index >= array->map.max_entries)
+	 *	goto out;
+	 */
+	off = offsetof(struct bpf_array, map.max_entries);
+	/* array->map.max_entries */
+	emit_a32_mov_i(tmp[1], off, false, ctx);
+	emit(ARM_LDR_I(tmp2[1], ARM_SP, STACK_VAR(r2[1])), ctx);
+	emit(ARM_LDR_R(tmp[1], tmp2[1], tmp[1]), ctx);
+	/* index (64 bit) */
+	emit(ARM_LDR_I(tmp2[1], ARM_SP, STACK_VAR(r3[1])), ctx);
+	/* index >= array->map.max_entries */
+	emit(ARM_CMP_R(tmp2[1], tmp[1]), ctx);
+	_emit(ARM_COND_CS, ARM_B(jmp_offset), ctx);
+
+	/* if (tail_call_cnt > MAX_TAIL_CALL_CNT)
+	 *	goto out;
+	 * tail_call_cnt++;
+	 */
+	lo = (u32)MAX_TAIL_CALL_CNT;
+	hi = (u32)((u64)MAX_TAIL_CALL_CNT >> 32);
+	emit(ARM_LDR_I(tmp[1], ARM_SP, STACK_VAR(tcc[1])), ctx);
+	emit(ARM_LDR_I(tmp[0], ARM_SP, STACK_VAR(tcc[0])), ctx);
+	emit(ARM_CMP_I(tmp[0], hi), ctx);
+	_emit(ARM_COND_EQ, ARM_CMP_I(tmp[1], lo), ctx);
+	_emit(ARM_COND_HI, ARM_B(jmp_offset), ctx);
+	emit(ARM_ADDS_I(tmp[1], tmp[1], 1), ctx);
+	emit(ARM_ADC_I(tmp[0], tmp[0], 0), ctx);
+	emit(ARM_STR_I(tmp[1], ARM_SP, STACK_VAR(tcc[1])), ctx);
+	emit(ARM_STR_I(tmp[0], ARM_SP, STACK_VAR(tcc[0])), ctx);
+
+	/* prog = array->ptrs[index]
+	 * if (prog == NULL)
+	 *	goto out;
+	 */
+	off = offsetof(struct bpf_array, ptrs);
+	emit_a32_mov_i(tmp[1], off, false, ctx);
+	emit(ARM_LDR_I(tmp2[1], ARM_SP, STACK_VAR(r2[1])), ctx);
+	emit(ARM_ADD_R(tmp[1], tmp2[1], tmp[1]), ctx);
+	emit(ARM_LDR_I(tmp2[1], ARM_SP, STACK_VAR(r3[1])), ctx);
+	emit(ARM_MOV_SI(tmp[0], tmp2[1], SRTYPE_ASL, 2), ctx);
+	emit(ARM_LDR_R(tmp[1], tmp[1], tmp[0]), ctx);
+	emit(ARM_CMP_I(tmp[1], 0), ctx);
+	_emit(ARM_COND_EQ, ARM_B(jmp_offset), ctx);
+
+	/* goto *(prog->bpf_func + prologue_size); */
+	off = offsetof(struct bpf_prog, bpf_func);
+	emit_a32_mov_i(tmp2[1], off, false, ctx);
+	emit(ARM_LDR_R(tmp[1], tmp[1], tmp2[1]), ctx);
+	emit(ARM_ADD_I(tmp[1], tmp[1], ctx->prologue_bytes), ctx);
+	emit(ARM_BX(tmp[1]), ctx);
+
+	/* out: */
+	if (out_offset == -1)
+		out_offset = cur_offset;
+	if (cur_offset != out_offset) {
+		pr_err_once("tail_call out_offset = %d, expected %d!\n",
+			    cur_offset, out_offset);
+		return -1;
+	}
+	return 0;
+#undef cur_offset
+#undef jmp_offset
 }
 
-static int build_body(struct jit_ctx *ctx)
+/* 0xabcd => 0xcdab */
+static inline void emit_rev16(const u8 rd, const u8 rn, struct jit_ctx *ctx)
 {
-	void *load_func[] = {jit_get_skb_b, jit_get_skb_h, jit_get_skb_w};
-	const struct bpf_prog *prog = ctx->skf;
-	const struct sock_filter *inst;
-	unsigned i, load_order, off, condt;
-	int imm12;
-	u32 k;
+#if __LINUX_ARM_ARCH__ < 6
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+
+	emit(ARM_AND_I(tmp2[1], rn, 0xff), ctx);
+	emit(ARM_MOV_SI(tmp2[0], rn, SRTYPE_LSR, 8), ctx);
+	emit(ARM_AND_I(tmp2[0], tmp2[0], 0xff), ctx);
+	emit(ARM_ORR_SI(rd, tmp2[0], tmp2[1], SRTYPE_LSL, 8), ctx);
+#else /* ARMv6+ */
+	emit(ARM_REV16(rd, rn), ctx);
+#endif
+}
 
-	for (i = 0; i < prog->len; i++) {
-		u16 code;
+/* 0xabcdefgh => 0xghefcdab */
+static inline void emit_rev32(const u8 rd, const u8 rn, struct jit_ctx *ctx)
+{
+#if __LINUX_ARM_ARCH__ < 6
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+
+	emit(ARM_AND_I(tmp2[1], rn, 0xff), ctx);
+	emit(ARM_MOV_SI(tmp2[0], rn, SRTYPE_LSR, 24), ctx);
+	emit(ARM_ORR_SI(ARM_IP, tmp2[0], tmp2[1], SRTYPE_LSL, 24), ctx);
+
+	emit(ARM_MOV_SI(tmp2[1], rn, SRTYPE_LSR, 8), ctx);
+	emit(ARM_AND_I(tmp2[1], tmp2[1], 0xff), ctx);
+	emit(ARM_MOV_SI(tmp2[0], rn, SRTYPE_LSR, 16), ctx);
+	emit(ARM_AND_I(tmp2[0], tmp2[0], 0xff), ctx);
+	emit(ARM_MOV_SI(tmp2[0], tmp2[0], SRTYPE_LSL, 8), ctx);
+	emit(ARM_ORR_SI(tmp2[0], tmp2[0], tmp2[1], SRTYPE_LSL, 16), ctx);
+	emit(ARM_ORR_R(rd, ARM_IP, tmp2[0]), ctx);
+
+#else /* ARMv6+ */
+	emit(ARM_REV(rd, rn), ctx);
+#endif
+}
+
+// push the scratch stack register on top of the stack
+static inline void emit_push_r64(const u8 src[], const u8 shift,
+		struct jit_ctx *ctx)
+{
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+	u16 reg_set = 0;
 
-		inst = &(prog->insns[i]);
-		/* K as an immediate value operand */
-		k = inst->k;
-		code = bpf_anc_helper(inst);
+	emit(ARM_LDR_I(tmp2[1], ARM_SP, STACK_VAR(src[1]+shift)), ctx);
+	emit(ARM_LDR_I(tmp2[0], ARM_SP, STACK_VAR(src[0]+shift)), ctx);
 
-		/* compute offsets only in the fake pass */
-		if (ctx->target == NULL)
-			ctx->offsets[i] = ctx->idx * 4;
+	reg_set = (1 << tmp2[1]) | (1 << tmp2[0]);
+	emit(ARM_PUSH(reg_set), ctx);
+}
+
+static void build_prologue(struct jit_ctx *ctx)
+{
+	const u8 r0 = bpf2a32[BPF_REG_0][1];
+	const u8 r2 = bpf2a32[BPF_REG_1][1];
+	const u8 r3 = bpf2a32[BPF_REG_1][0];
+	const u8 r4 = bpf2a32[BPF_REG_6][1];
+	const u8 r5 = bpf2a32[BPF_REG_6][0];
+	const u8 r6 = bpf2a32[TMP_REG_1][1];
+	const u8 r7 = bpf2a32[TMP_REG_1][0];
+	const u8 r8 = bpf2a32[TMP_REG_2][1];
+	const u8 r10 = bpf2a32[TMP_REG_2][0];
+	const u8 fplo = bpf2a32[BPF_REG_FP][1];
+	const u8 fphi = bpf2a32[BPF_REG_FP][0];
+	const u8 sp = ARM_SP;
+	const u8 *tcc = bpf2a32[TCALL_CNT];
+
+	u16 reg_set = 0;
+
+	/*
+	 * eBPF prog stack layout
+	 *
+	 *                         high
+	 * original ARM_SP =>     +-----+ eBPF prologue
+	 *                        |FP/LR|
+	 * current ARM_FP =>      +-----+
+	 *                        | ... | callee saved registers
+	 * eBPF fp register =>    +-----+ <= (BPF_FP)
+	 *                        | ... | eBPF JIT scratch space
+	 *                        |     | eBPF prog stack
+	 *                        +-----+
+	 *			  |RSVD | JIT scratchpad
+	 * current A64_SP =>      +-----+ <= (BPF_FP - STACK_SIZE)
+	 *                        |     |
+	 *                        | ... | Function call stack
+	 *                        |     |
+	 *                        +-----+
+	 *                          low
+	 */
+
+	/* Save callee saved registers. */
+	reg_set |= (1<<r4) | (1<<r5) | (1<<r6) | (1<<r7) | (1<<r8) | (1<<r10);
+#ifdef CONFIG_FRAME_POINTER
+	reg_set |= (1<<ARM_FP) | (1<<ARM_IP) | (1<<ARM_LR) | (1<<ARM_PC);
+	emit(ARM_MOV_R(ARM_IP, sp), ctx);
+	emit(ARM_PUSH(reg_set), ctx);
+	emit(ARM_SUB_I(ARM_FP, ARM_IP, 4), ctx);
+#else
+	/* Check if call instruction exists in BPF body */
+	if (ctx->seen & SEEN_CALL)
+		reg_set |= (1<<ARM_LR);
+	emit(ARM_PUSH(reg_set), ctx);
+#endif
+	/* Save frame pointer for later */
+	emit(ARM_SUB_I(ARM_IP, sp, SCRATCH_SIZE), ctx);
+
+	/* Set up function call stack */
+	emit(ARM_SUB_I(ARM_SP, ARM_SP, imm8m(STACK_SIZE)), ctx);
+
+	/* Set up BPF prog stack base register */
+	emit_a32_mov_r(fplo, ARM_IP, true, false, ctx);
+	emit_a32_mov_i(fphi, 0, true, ctx);
+
+	/* mov r4, 0 */
+	emit(ARM_MOV_I(r4, 0), ctx);
+
+	/* Move BPF_CTX to BPF_R1 */
+	emit(ARM_MOV_R(r3, r4), ctx);
+	emit(ARM_MOV_R(r2, r0), ctx);
+	/* Initialize Tail Count */
+	emit(ARM_STR_I(r4, ARM_SP, STACK_VAR(tcc[0])), ctx);
+	emit(ARM_STR_I(r4, ARM_SP, STACK_VAR(tcc[1])), ctx);
+	/* end of prologue */
+}
+
+static void build_epilogue(struct jit_ctx *ctx)
+{
+	const u8 r4 = bpf2a32[BPF_REG_6][1];
+	const u8 r5 = bpf2a32[BPF_REG_6][0];
+	const u8 r6 = bpf2a32[TMP_REG_1][1];
+	const u8 r7 = bpf2a32[TMP_REG_1][0];
+	const u8 r8 = bpf2a32[TMP_REG_2][1];
+	const u8 r10 = bpf2a32[TMP_REG_2][0];
+	u16 reg_set = 0;
+
+	/* unwind function call stack */
+	emit(ARM_ADD_I(ARM_SP, ARM_SP, imm8m(STACK_SIZE)), ctx);
+
+	/* restore callee saved registers. */
+	reg_set |= (1<<r4) | (1<<r5) | (1<<r6) | (1<<r7) | (1<<r8) | (1<<r10);
+#ifdef CONFIG_FRAME_POINTER
+	/* the first instruction of the prologue was: mov ip, sp */
+	reg_set |= (1<<ARM_FP) | (1<<ARM_SP) | (1<<ARM_PC);
+	emit(ARM_LDM(ARM_SP, reg_set), ctx);
+#else
+	if (ctx->seen & SEEN_CALL)
+		reg_set |= (1<<ARM_PC);
+	/* Restore callee saved registers. */
+	emit(ARM_POP(reg_set), ctx);
+	/* Return back to the callee function */
+	if (!(ctx->seen & SEEN_CALL))
+		emit(ARM_BX(ARM_LR), ctx);
+#endif
+}
 
-		switch (code) {
-		case BPF_LD | BPF_IMM:
-			emit_mov_i(r_A, k, ctx);
+/*
+ * Convert an eBPF instruction to native instruction, i.e
+ * JITs an eBPF instruction.
+ * Returns :
+ *	0  - Successfully JITed an 8-byte eBPF instruction
+ *	>0 - Successfully JITed a 16-byte eBPF instruction
+ *	<0 - Failed to JIT.
+ */
+static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx)
+{
+	const u8 code = insn->code;
+	const u8 *dst = bpf2a32[insn->dst_reg];
+	const u8 *src = bpf2a32[insn->src_reg];
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+	const s16 off = insn->off;
+	const s32 imm = insn->imm;
+	const int i = insn - ctx->prog->insnsi;
+	const bool is64 = BPF_CLASS(code) == BPF_ALU64;
+	const bool dstk = is_on_stack(insn->dst_reg);
+	const bool sstk = is_on_stack(insn->src_reg);
+	u8 rd, rt, rm, rn;
+	s32 jmp_offset;
+
+#define check_imm(bits, imm) do {				\
+	if ((((imm) > 0) && ((imm) >> (bits))) ||		\
+	    (((imm) < 0) && (~(imm) >> (bits)))) {		\
+		pr_info("[%2d] imm=%d(0x%x) out of range\n",	\
+			i, imm, imm);				\
+		return -EINVAL;					\
+	}							\
+} while (0)
+#define check_imm24(imm) check_imm(24, imm)
+
+	switch (code) {
+	/* ALU operations */
+
+	/* dst = src */
+	case BPF_ALU | BPF_MOV | BPF_K:
+	case BPF_ALU | BPF_MOV | BPF_X:
+	case BPF_ALU64 | BPF_MOV | BPF_K:
+	case BPF_ALU64 | BPF_MOV | BPF_X:
+		switch (BPF_SRC(code)) {
+		case BPF_X:
+			emit_a32_mov_r64(is64, dst, src, dstk, sstk, ctx);
 			break;
-		case BPF_LD | BPF_W | BPF_LEN:
-			ctx->seen |= SEEN_SKB;
-			BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, len) != 4);
-			emit(ARM_LDR_I(r_A, r_skb,
-				       offsetof(struct sk_buff, len)), ctx);
+		case BPF_K:
+			/* Sign-extend immediate value to destination reg */
+			emit_a32_mov_i64(is64, dst, imm, dstk, ctx);
 			break;
-		case BPF_LD | BPF_MEM:
-			/* A = scratch[k] */
-			ctx->seen |= SEEN_MEM_WORD(k);
-			emit(ARM_LDR_I(r_A, ARM_SP, SCRATCH_OFF(k)), ctx);
+		}
+		break;
+	/* dst = dst + src/imm */
+	/* dst = dst - src/imm */
+	/* dst = dst | src/imm */
+	/* dst = dst & src/imm */
+	/* dst = dst ^ src/imm */
+	/* dst = dst * src/imm */
+	/* dst = dst << src */
+	/* dst = dst >> src */
+	case BPF_ALU | BPF_ADD | BPF_K:
+	case BPF_ALU | BPF_ADD | BPF_X:
+	case BPF_ALU | BPF_SUB | BPF_K:
+	case BPF_ALU | BPF_SUB | BPF_X:
+	case BPF_ALU | BPF_OR | BPF_K:
+	case BPF_ALU | BPF_OR | BPF_X:
+	case BPF_ALU | BPF_AND | BPF_K:
+	case BPF_ALU | BPF_AND | BPF_X:
+	case BPF_ALU | BPF_XOR | BPF_K:
+	case BPF_ALU | BPF_XOR | BPF_X:
+	case BPF_ALU | BPF_MUL | BPF_K:
+	case BPF_ALU | BPF_MUL | BPF_X:
+	case BPF_ALU | BPF_LSH | BPF_X:
+	case BPF_ALU | BPF_RSH | BPF_X:
+	case BPF_ALU | BPF_ARSH | BPF_K:
+	case BPF_ALU | BPF_ARSH | BPF_X:
+	case BPF_ALU64 | BPF_ADD | BPF_K:
+	case BPF_ALU64 | BPF_ADD | BPF_X:
+	case BPF_ALU64 | BPF_SUB | BPF_K:
+	case BPF_ALU64 | BPF_SUB | BPF_X:
+	case BPF_ALU64 | BPF_OR | BPF_K:
+	case BPF_ALU64 | BPF_OR | BPF_X:
+	case BPF_ALU64 | BPF_AND | BPF_K:
+	case BPF_ALU64 | BPF_AND | BPF_X:
+	case BPF_ALU64 | BPF_XOR | BPF_K:
+	case BPF_ALU64 | BPF_XOR | BPF_X:
+		switch (BPF_SRC(code)) {
+		case BPF_X:
+			emit_a32_alu_r64(is64, dst, src, dstk, sstk,
+					 ctx, BPF_OP(code));
 			break;
-		case BPF_LD | BPF_W | BPF_ABS:
-			load_order = 2;
-			goto load;
-		case BPF_LD | BPF_H | BPF_ABS:
-			load_order = 1;
-			goto load;
-		case BPF_LD | BPF_B | BPF_ABS:
-			load_order = 0;
-load:
-			emit_mov_i(r_off, k, ctx);
-load_common:
-			ctx->seen |= SEEN_DATA | SEEN_CALL;
-
-			if (load_order > 0) {
-				emit(ARM_SUB_I(r_scratch, r_skb_hl,
-					       1 << load_order), ctx);
-				emit(ARM_CMP_R(r_scratch, r_off), ctx);
-				condt = ARM_COND_GE;
-			} else {
-				emit(ARM_CMP_R(r_skb_hl, r_off), ctx);
-				condt = ARM_COND_HI;
-			}
-
-			/*
-			 * test for negative offset, only if we are
-			 * currently scheduled to take the fast
-			 * path. this will update the flags so that
-			 * the slowpath instruction are ignored if the
-			 * offset is negative.
-			 *
-			 * for loard_order == 0 the HI condition will
-			 * make loads at offset 0 take the slow path too.
+		case BPF_K:
+			/* Move immediate value to the temporary register
+			 * and then do the ALU operation on the temporary
+			 * register as this will sign-extend the immediate
+			 * value into temporary reg and then it would be
+			 * safe to do the operation on it.
 			 */
-			_emit(condt, ARM_CMP_I(r_off, 0), ctx);
-
-			_emit(condt, ARM_ADD_R(r_scratch, r_off, r_skb_data),
-			      ctx);
-
-			if (load_order == 0)
-				_emit(condt, ARM_LDRB_I(r_A, r_scratch, 0),
-				      ctx);
-			else if (load_order == 1)
-				emit_load_be16(condt, r_A, r_scratch, ctx);
-			else if (load_order == 2)
-				emit_load_be32(condt, r_A, r_scratch, ctx);
-
-			_emit(condt, ARM_B(b_imm(i + 1, ctx)), ctx);
-
-			/* the slowpath */
-			emit_mov_i(ARM_R3, (u32)load_func[load_order], ctx);
-			emit(ARM_MOV_R(ARM_R0, r_skb), ctx);
-			/* the offset is already in R1 */
-			emit_blx_r(ARM_R3, ctx);
-			/* check the result of skb_copy_bits */
-			emit(ARM_CMP_I(ARM_R1, 0), ctx);
-			emit_err_ret(ARM_COND_NE, ctx);
-			emit(ARM_MOV_R(r_A, ARM_R0), ctx);
+			emit_a32_mov_i64(is64, tmp2, imm, false, ctx);
+			emit_a32_alu_r64(is64, dst, tmp2, dstk, false,
+					 ctx, BPF_OP(code));
 			break;
-		case BPF_LD | BPF_W | BPF_IND:
-			load_order = 2;
-			goto load_ind;
-		case BPF_LD | BPF_H | BPF_IND:
-			load_order = 1;
-			goto load_ind;
-		case BPF_LD | BPF_B | BPF_IND:
-			load_order = 0;
-load_ind:
-			update_on_xread(ctx);
-			OP_IMM3(ARM_ADD, r_off, r_X, k, ctx);
-			goto load_common;
-		case BPF_LDX | BPF_IMM:
-			ctx->seen |= SEEN_X;
-			emit_mov_i(r_X, k, ctx);
+		}
+		break;
+	/* dst = dst / src(imm) */
+	/* dst = dst % src(imm) */
+	case BPF_ALU | BPF_DIV | BPF_K:
+	case BPF_ALU | BPF_DIV | BPF_X:
+	case BPF_ALU | BPF_MOD | BPF_K:
+	case BPF_ALU | BPF_MOD | BPF_X:
+		rt = src_lo;
+		rd = dstk ? tmp2[1] : dst_lo;
+		if (dstk)
+			emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		switch (BPF_SRC(code)) {
+		case BPF_X:
+			rt = sstk ? tmp2[0] : rt;
+			if (sstk)
+				emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(src_lo)),
+				     ctx);
 			break;
-		case BPF_LDX | BPF_W | BPF_LEN:
-			ctx->seen |= SEEN_X | SEEN_SKB;
-			emit(ARM_LDR_I(r_X, r_skb,
-				       offsetof(struct sk_buff, len)), ctx);
+		case BPF_K:
+			rt = tmp2[0];
+			emit_a32_mov_i(rt, imm, false, ctx);
 			break;
-		case BPF_LDX | BPF_MEM:
-			ctx->seen |= SEEN_X | SEEN_MEM_WORD(k);
-			emit(ARM_LDR_I(r_X, ARM_SP, SCRATCH_OFF(k)), ctx);
+		}
+		emit_udivmod(rd, rd, rt, ctx, BPF_OP(code));
+		if (dstk)
+			emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit_a32_mov_i(dst_hi, 0, dstk, ctx);
+		break;
+	case BPF_ALU64 | BPF_DIV | BPF_K:
+	case BPF_ALU64 | BPF_DIV | BPF_X:
+	case BPF_ALU64 | BPF_MOD | BPF_K:
+	case BPF_ALU64 | BPF_MOD | BPF_X:
+		goto notyet;
+	/* dst = dst >> imm */
+	/* dst = dst << imm */
+	case BPF_ALU | BPF_RSH | BPF_K:
+	case BPF_ALU | BPF_LSH | BPF_K:
+		if (unlikely(imm > 31))
+			return -EINVAL;
+		if (imm)
+			emit_a32_alu_i(dst_lo, imm, dstk, ctx, BPF_OP(code));
+		emit_a32_mov_i(dst_hi, 0, dstk, ctx);
+		break;
+	/* dst = dst << imm */
+	case BPF_ALU64 | BPF_LSH | BPF_K:
+		if (unlikely(imm > 63))
+			return -EINVAL;
+		emit_a32_lsh_i64(dst, dstk, imm, ctx);
+		break;
+	/* dst = dst >> imm */
+	case BPF_ALU64 | BPF_RSH | BPF_K:
+		if (unlikely(imm > 63))
+			return -EINVAL;
+		emit_a32_lsr_i64(dst, dstk, imm, ctx);
+		break;
+	/* dst = dst << src */
+	case BPF_ALU64 | BPF_LSH | BPF_X:
+		emit_a32_lsh_r64(dst, src, dstk, sstk, ctx);
+		break;
+	/* dst = dst >> src */
+	case BPF_ALU64 | BPF_RSH | BPF_X:
+		emit_a32_lsr_r64(dst, src, dstk, sstk, ctx);
+		break;
+	/* dst = dst >> src (signed) */
+	case BPF_ALU64 | BPF_ARSH | BPF_X:
+		emit_a32_arsh_r64(dst, src, dstk, sstk, ctx);
+		break;
+	/* dst = dst >> imm (signed) */
+	case BPF_ALU64 | BPF_ARSH | BPF_K:
+		if (unlikely(imm > 63))
+			return -EINVAL;
+		emit_a32_arsh_i64(dst, dstk, imm, ctx);
+		break;
+	/* dst = ~dst */
+	case BPF_ALU | BPF_NEG:
+		emit_a32_alu_i(dst_lo, 0, dstk, ctx, BPF_OP(code));
+		emit_a32_mov_i(dst_hi, 0, dstk, ctx);
+		break;
+	/* dst = ~dst (64 bit) */
+	case BPF_ALU64 | BPF_NEG:
+		emit_a32_neg64(dst, dstk, ctx);
+		break;
+	/* dst = dst * src/imm */
+	case BPF_ALU64 | BPF_MUL | BPF_X:
+	case BPF_ALU64 | BPF_MUL | BPF_K:
+		switch (BPF_SRC(code)) {
+		case BPF_X:
+			emit_a32_mul_r64(dst, src, dstk, sstk, ctx);
 			break;
-		case BPF_LDX | BPF_B | BPF_MSH:
-			/* x = ((*(frame + k)) & 0xf) << 2; */
-			ctx->seen |= SEEN_X | SEEN_DATA | SEEN_CALL;
-			/* the interpreter should deal with the negative K */
-			if ((int)k < 0)
-				return -1;
-			/* offset in r1: we might have to take the slow path */
-			emit_mov_i(r_off, k, ctx);
-			emit(ARM_CMP_R(r_skb_hl, r_off), ctx);
-
-			/* load in r0: common with the slowpath */
-			_emit(ARM_COND_HI, ARM_LDRB_R(ARM_R0, r_skb_data,
-						      ARM_R1), ctx);
-			/*
-			 * emit_mov_i() might generate one or two instructions,
-			 * the same holds for emit_blx_r()
+		case BPF_K:
+			/* Move immediate value to the temporary register
+			 * and then do the multiplication on it as this
+			 * will sign-extend the immediate value into temp
+			 * reg then it would be safe to do the operation
+			 * on it.
 			 */
-			_emit(ARM_COND_HI, ARM_B(b_imm(i + 1, ctx) - 2), ctx);
-
-			emit(ARM_MOV_R(ARM_R0, r_skb), ctx);
-			/* r_off is r1 */
-			emit_mov_i(ARM_R3, (u32)jit_get_skb_b, ctx);
-			emit_blx_r(ARM_R3, ctx);
-			/* check the return value of skb_copy_bits */
-			emit(ARM_CMP_I(ARM_R1, 0), ctx);
-			emit_err_ret(ARM_COND_NE, ctx);
-
-			emit(ARM_AND_I(r_X, ARM_R0, 0x00f), ctx);
-			emit(ARM_LSL_I(r_X, r_X, 2), ctx);
-			break;
-		case BPF_ST:
-			ctx->seen |= SEEN_MEM_WORD(k);
-			emit(ARM_STR_I(r_A, ARM_SP, SCRATCH_OFF(k)), ctx);
-			break;
-		case BPF_STX:
-			update_on_xread(ctx);
-			ctx->seen |= SEEN_MEM_WORD(k);
-			emit(ARM_STR_I(r_X, ARM_SP, SCRATCH_OFF(k)), ctx);
-			break;
-		case BPF_ALU | BPF_ADD | BPF_K:
-			/* A += K */
-			OP_IMM3(ARM_ADD, r_A, r_A, k, ctx);
-			break;
-		case BPF_ALU | BPF_ADD | BPF_X:
-			update_on_xread(ctx);
-			emit(ARM_ADD_R(r_A, r_A, r_X), ctx);
-			break;
-		case BPF_ALU | BPF_SUB | BPF_K:
-			/* A -= K */
-			OP_IMM3(ARM_SUB, r_A, r_A, k, ctx);
-			break;
-		case BPF_ALU | BPF_SUB | BPF_X:
-			update_on_xread(ctx);
-			emit(ARM_SUB_R(r_A, r_A, r_X), ctx);
-			break;
-		case BPF_ALU | BPF_MUL | BPF_K:
-			/* A *= K */
-			emit_mov_i(r_scratch, k, ctx);
-			emit(ARM_MUL(r_A, r_A, r_scratch), ctx);
+			emit_a32_mov_i64(is64, tmp2, imm, false, ctx);
+			emit_a32_mul_r64(dst, tmp2, dstk, false, ctx);
 			break;
-		case BPF_ALU | BPF_MUL | BPF_X:
-			update_on_xread(ctx);
-			emit(ARM_MUL(r_A, r_A, r_X), ctx);
-			break;
-		case BPF_ALU | BPF_DIV | BPF_K:
-			if (k == 1)
-				break;
-			emit_mov_i(r_scratch, k, ctx);
-			emit_udivmod(r_A, r_A, r_scratch, ctx, BPF_DIV);
-			break;
-		case BPF_ALU | BPF_DIV | BPF_X:
-			update_on_xread(ctx);
-			emit(ARM_CMP_I(r_X, 0), ctx);
-			emit_err_ret(ARM_COND_EQ, ctx);
-			emit_udivmod(r_A, r_A, r_X, ctx, BPF_DIV);
-			break;
-		case BPF_ALU | BPF_MOD | BPF_K:
-			if (k == 1) {
-				emit_mov_i(r_A, 0, ctx);
-				break;
-			}
-			emit_mov_i(r_scratch, k, ctx);
-			emit_udivmod(r_A, r_A, r_scratch, ctx, BPF_MOD);
-			break;
-		case BPF_ALU | BPF_MOD | BPF_X:
-			update_on_xread(ctx);
-			emit(ARM_CMP_I(r_X, 0), ctx);
-			emit_err_ret(ARM_COND_EQ, ctx);
-			emit_udivmod(r_A, r_A, r_X, ctx, BPF_MOD);
-			break;
-		case BPF_ALU | BPF_OR | BPF_K:
-			/* A |= K */
-			OP_IMM3(ARM_ORR, r_A, r_A, k, ctx);
+		}
+		break;
+	/* dst = htole(dst) */
+	/* dst = htobe(dst) */
+	case BPF_ALU | BPF_END | BPF_FROM_LE:
+	case BPF_ALU | BPF_END | BPF_FROM_BE:
+		rd = dstk ? tmp[0] : dst_hi;
+		rt = dstk ? tmp[1] : dst_lo;
+		if (dstk) {
+			emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(dst_lo)), ctx);
+			emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_hi)), ctx);
+		}
+		if (BPF_SRC(code) == BPF_FROM_LE)
+			goto emit_bswap_uxt;
+		switch (imm) {
+		case 16:
+			emit_rev16(rt, rt, ctx);
+			goto emit_bswap_uxt;
+		case 32:
+			emit_rev32(rt, rt, ctx);
+			goto emit_bswap_uxt;
+		case 64:
+			/* Because of the usage of ARM_LR */
+			ctx->seen |= SEEN_CALL;
+			emit_rev32(ARM_LR, rt, ctx);
+			emit_rev32(rt, rd, ctx);
+			emit(ARM_MOV_R(rd, ARM_LR), ctx);
 			break;
-		case BPF_ALU | BPF_OR | BPF_X:
-			update_on_xread(ctx);
-			emit(ARM_ORR_R(r_A, r_A, r_X), ctx);
+		}
+		goto exit;
+emit_bswap_uxt:
+		switch (imm) {
+		case 16:
+			/* zero-extend 16 bits into 64 bits */
+#if __LINUX_ARM_ARCH__ < 6
+			emit_a32_mov_i(tmp2[1], 0xffff, false, ctx);
+			emit(ARM_AND_R(rt, rt, tmp2[1]), ctx);
+#else /* ARMv6+ */
+			emit(ARM_UXTH(rt, rt), ctx);
+#endif
+			emit(ARM_EOR_R(rd, rd, rd), ctx);
 			break;
-		case BPF_ALU | BPF_XOR | BPF_K:
-			/* A ^= K; */
-			OP_IMM3(ARM_EOR, r_A, r_A, k, ctx);
+		case 32:
+			/* zero-extend 32 bits into 64 bits */
+			emit(ARM_EOR_R(rd, rd, rd), ctx);
 			break;
-		case BPF_ANC | SKF_AD_ALU_XOR_X:
-		case BPF_ALU | BPF_XOR | BPF_X:
-			/* A ^= X */
-			update_on_xread(ctx);
-			emit(ARM_EOR_R(r_A, r_A, r_X), ctx);
+		case 64:
+			/* nop */
 			break;
-		case BPF_ALU | BPF_AND | BPF_K:
-			/* A &= K */
-			OP_IMM3(ARM_AND, r_A, r_A, k, ctx);
+		}
+exit:
+		if (dstk) {
+			emit(ARM_STR_I(rt, ARM_SP, STACK_VAR(dst_lo)), ctx);
+			emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_hi)), ctx);
+		}
+		break;
+	/* dst = imm64 */
+	case BPF_LD | BPF_IMM | BPF_DW:
+	{
+		const struct bpf_insn insn1 = insn[1];
+		u32 hi, lo = imm;
+
+		hi = insn1.imm;
+		emit_a32_mov_i(dst_lo, lo, dstk, ctx);
+		emit_a32_mov_i(dst_hi, hi, dstk, ctx);
+
+		return 1;
+	}
+	/* LDX: dst = *(size *)(src + off) */
+	case BPF_LDX | BPF_MEM | BPF_W:
+	case BPF_LDX | BPF_MEM | BPF_H:
+	case BPF_LDX | BPF_MEM | BPF_B:
+	case BPF_LDX | BPF_MEM | BPF_DW:
+		rn = sstk ? tmp2[1] : src_lo;
+		if (sstk)
+			emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
+		switch (BPF_SIZE(code)) {
+		case BPF_W:
+			/* Load a Word */
+		case BPF_H:
+			/* Load a Half-Word */
+		case BPF_B:
+			/* Load a Byte */
+			emit_ldx_r(dst_lo, rn, dstk, off, ctx, BPF_SIZE(code));
+			emit_a32_mov_i(dst_hi, 0, dstk, ctx);
 			break;
-		case BPF_ALU | BPF_AND | BPF_X:
-			update_on_xread(ctx);
-			emit(ARM_AND_R(r_A, r_A, r_X), ctx);
+		case BPF_DW:
+			/* Load a double word */
+			emit_ldx_r(dst_lo, rn, dstk, off, ctx, BPF_W);
+			emit_ldx_r(dst_hi, rn, dstk, off+4, ctx, BPF_W);
 			break;
-		case BPF_ALU | BPF_LSH | BPF_K:
-			if (unlikely(k > 31))
-				return -1;
-			emit(ARM_LSL_I(r_A, r_A, k), ctx);
+		}
+		break;
+	/* R0 = ntohx(*(size *)(((struct sk_buff *)R6)->data + imm)) */
+	case BPF_LD | BPF_ABS | BPF_W:
+	case BPF_LD | BPF_ABS | BPF_H:
+	case BPF_LD | BPF_ABS | BPF_B:
+	/* R0 = ntohx(*(size *)(((struct sk_buff *)R6)->data + src + imm)) */
+	case BPF_LD | BPF_IND | BPF_W:
+	case BPF_LD | BPF_IND | BPF_H:
+	case BPF_LD | BPF_IND | BPF_B:
+	{
+		const u8 r4 = bpf2a32[BPF_REG_6][1]; /* r4 = ptr to sk_buff */
+		const u8 r0 = bpf2a32[BPF_REG_0][1]; /*r0: struct sk_buff *skb*/
+						     /* rtn value */
+		const u8 r1 = bpf2a32[BPF_REG_0][0]; /* r1: int k */
+		const u8 r2 = bpf2a32[BPF_REG_1][1]; /* r2: unsigned int size */
+		const u8 r3 = bpf2a32[BPF_REG_1][0]; /* r3: void *buffer */
+		const u8 r6 = bpf2a32[TMP_REG_1][1]; /* r6: void *(*func)(..) */
+		int size;
+
+		/* Setting up first argument */
+		emit(ARM_MOV_R(r0, r4), ctx);
+
+		/* Setting up second argument */
+		emit_a32_mov_i(r1, imm, false, ctx);
+		if (BPF_MODE(code) == BPF_IND)
+			emit_a32_alu_r(r1, src_lo, false, sstk, ctx,
+				       false, false, BPF_ADD);
+
+		/* Setting up third argument */
+		switch (BPF_SIZE(code)) {
+		case BPF_W:
+			size = 4;
 			break;
-		case BPF_ALU | BPF_LSH | BPF_X:
-			update_on_xread(ctx);
-			emit(ARM_LSL_R(r_A, r_A, r_X), ctx);
+		case BPF_H:
+			size = 2;
 			break;
-		case BPF_ALU | BPF_RSH | BPF_K:
-			if (unlikely(k > 31))
-				return -1;
-			if (k)
-				emit(ARM_LSR_I(r_A, r_A, k), ctx);
+		case BPF_B:
+			size = 1;
 			break;
-		case BPF_ALU | BPF_RSH | BPF_X:
-			update_on_xread(ctx);
-			emit(ARM_LSR_R(r_A, r_A, r_X), ctx);
+		default:
+			return -EINVAL;
+		}
+		emit_a32_mov_i(r2, size, false, ctx);
+
+		/* Setting up fourth argument */
+		emit(ARM_ADD_I(r3, ARM_SP, imm8m(SKB_BUFFER)), ctx);
+
+		/* Setting up function pointer to call */
+		emit_a32_mov_i(r6, (unsigned int)bpf_load_pointer, false, ctx);
+		emit_blx_r(r6, ctx);
+
+		emit(ARM_EOR_R(r1, r1, r1), ctx);
+		/* Check if return address is NULL or not.
+		 * if NULL then jump to epilogue
+		 * else continue to load the value from retn address
+		 */
+		emit(ARM_CMP_I(r0, 0), ctx);
+		jmp_offset = epilogue_offset(ctx);
+		check_imm24(jmp_offset);
+		_emit(ARM_COND_EQ, ARM_B(jmp_offset), ctx);
+
+		/* Load value from the address */
+		switch (BPF_SIZE(code)) {
+		case BPF_W:
+			emit(ARM_LDR_I(r0, r0, 0), ctx);
+			emit_rev32(r0, r0, ctx);
 			break;
-		case BPF_ALU | BPF_NEG:
-			/* A = -A */
-			emit(ARM_RSB_I(r_A, r_A, 0), ctx);
+		case BPF_H:
+			emit(ARM_LDRH_I(r0, r0, 0), ctx);
+			emit_rev16(r0, r0, ctx);
 			break;
-		case BPF_JMP | BPF_JA:
-			/* pc += K */
-			emit(ARM_B(b_imm(i + k + 1, ctx)), ctx);
+		case BPF_B:
+			emit(ARM_LDRB_I(r0, r0, 0), ctx);
+			/* No need to reverse */
 			break;
-		case BPF_JMP | BPF_JEQ | BPF_K:
-			/* pc += (A == K) ? pc->jt : pc->jf */
-			condt  = ARM_COND_EQ;
-			goto cmp_imm;
-		case BPF_JMP | BPF_JGT | BPF_K:
-			/* pc += (A > K) ? pc->jt : pc->jf */
-			condt  = ARM_COND_HI;
-			goto cmp_imm;
-		case BPF_JMP | BPF_JGE | BPF_K:
-			/* pc += (A >= K) ? pc->jt : pc->jf */
-			condt  = ARM_COND_HS;
-cmp_imm:
-			imm12 = imm8m(k);
-			if (imm12 < 0) {
-				emit_mov_i_no8m(r_scratch, k, ctx);
-				emit(ARM_CMP_R(r_A, r_scratch), ctx);
-			} else {
-				emit(ARM_CMP_I(r_A, imm12), ctx);
-			}
-cond_jump:
-			if (inst->jt)
-				_emit(condt, ARM_B(b_imm(i + inst->jt + 1,
-						   ctx)), ctx);
-			if (inst->jf)
-				_emit(condt ^ 1, ARM_B(b_imm(i + inst->jf + 1,
-							     ctx)), ctx);
+		}
+		break;
+	}
+	/* ST: *(size *)(dst + off) = imm */
+	case BPF_ST | BPF_MEM | BPF_W:
+	case BPF_ST | BPF_MEM | BPF_H:
+	case BPF_ST | BPF_MEM | BPF_B:
+	case BPF_ST | BPF_MEM | BPF_DW:
+		switch (BPF_SIZE(code)) {
+		case BPF_DW:
+			/* Sign-extend immediate value into temp reg */
+			emit_a32_mov_i64(true, tmp2, imm, false, ctx);
+			emit_str_r(dst_lo, tmp2[1], dstk, off, ctx, BPF_W);
+			emit_str_r(dst_lo, tmp2[0], dstk, off+4, ctx, BPF_W);
 			break;
-		case BPF_JMP | BPF_JEQ | BPF_X:
-			/* pc += (A == X) ? pc->jt : pc->jf */
-			condt   = ARM_COND_EQ;
-			goto cmp_x;
-		case BPF_JMP | BPF_JGT | BPF_X:
-			/* pc += (A > X) ? pc->jt : pc->jf */
-			condt   = ARM_COND_HI;
-			goto cmp_x;
-		case BPF_JMP | BPF_JGE | BPF_X:
-			/* pc += (A >= X) ? pc->jt : pc->jf */
-			condt   = ARM_COND_CS;
-cmp_x:
-			update_on_xread(ctx);
-			emit(ARM_CMP_R(r_A, r_X), ctx);
-			goto cond_jump;
-		case BPF_JMP | BPF_JSET | BPF_K:
-			/* pc += (A & K) ? pc->jt : pc->jf */
-			condt  = ARM_COND_NE;
-			/* not set iff all zeroes iff Z==1 iff EQ */
-
-			imm12 = imm8m(k);
-			if (imm12 < 0) {
-				emit_mov_i_no8m(r_scratch, k, ctx);
-				emit(ARM_TST_R(r_A, r_scratch), ctx);
-			} else {
-				emit(ARM_TST_I(r_A, imm12), ctx);
-			}
-			goto cond_jump;
-		case BPF_JMP | BPF_JSET | BPF_X:
-			/* pc += (A & X) ? pc->jt : pc->jf */
-			update_on_xread(ctx);
-			condt  = ARM_COND_NE;
-			emit(ARM_TST_R(r_A, r_X), ctx);
-			goto cond_jump;
-		case BPF_RET | BPF_A:
-			emit(ARM_MOV_R(ARM_R0, r_A), ctx);
-			goto b_epilogue;
-		case BPF_RET | BPF_K:
-			if ((k == 0) && (ctx->ret0_fp_idx < 0))
-				ctx->ret0_fp_idx = i;
-			emit_mov_i(ARM_R0, k, ctx);
-b_epilogue:
-			if (i != ctx->skf->len - 1)
-				emit(ARM_B(b_imm(prog->len, ctx)), ctx);
+		case BPF_W:
+		case BPF_H:
+		case BPF_B:
+			emit_a32_mov_i(tmp2[1], imm, false, ctx);
+			emit_str_r(dst_lo, tmp2[1], dstk, off, ctx,
+				   BPF_SIZE(code));
 			break;
-		case BPF_MISC | BPF_TAX:
-			/* X = A */
-			ctx->seen |= SEEN_X;
-			emit(ARM_MOV_R(r_X, r_A), ctx);
+		}
+		break;
+	/* STX XADD: lock *(u32 *)(dst + off) += src */
+	case BPF_STX | BPF_XADD | BPF_W:
+	/* STX XADD: lock *(u64 *)(dst + off) += src */
+	case BPF_STX | BPF_XADD | BPF_DW:
+		goto notyet;
+	/* STX: *(size *)(dst + off) = src */
+	case BPF_STX | BPF_MEM | BPF_W:
+	case BPF_STX | BPF_MEM | BPF_H:
+	case BPF_STX | BPF_MEM | BPF_B:
+	case BPF_STX | BPF_MEM | BPF_DW:
+	{
+		u8 sz = BPF_SIZE(code);
+
+		rn = sstk ? tmp2[1] : src_lo;
+		rm = sstk ? tmp2[0] : src_hi;
+		if (!sstk)
+			goto do_store;
+		switch (BPF_SIZE(code)) {
+		case BPF_W:
+			emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
+			goto empty_hi;
+		case BPF_H:
+			emit(ARM_LDRH_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
+			goto empty_hi;
+		case BPF_B:
+			emit(ARM_LDRB_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
+			goto empty_hi;
+empty_hi:
+			emit(ARM_EOR_R(rm, rm, rm), ctx);
+		case BPF_DW:
+			emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
+			emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(src_hi)), ctx);
+			sz = BPF_W;
 			break;
-		case BPF_MISC | BPF_TXA:
-			/* A = X */
-			update_on_xread(ctx);
-			emit(ARM_MOV_R(r_A, r_X), ctx);
+		}
+
+do_store:
+		/* Clear higher word except for BPF_DW */
+		if (BPF_SIZE(code) != BPF_DW)
+			emit(ARM_EOR_R(rm, rm, rm), ctx);
+
+		/* Store the value */
+		emit_str_r(dst_lo, rn, dstk, off, ctx, sz);
+		emit_str_r(dst_lo, rm, dstk, off+4, ctx, BPF_W);
+		break;
+	}
+	/* PC += off if dst == src */
+	/* PC += off if dst > src */
+	/* PC += off if dst >= src */
+	/* PC += off if dst != src */
+	/* PC += off if dst > src (signed) */
+	/* PC += off if dst >= src (signed) */
+	/* PC += off if dst & src */
+	case BPF_JMP | BPF_JEQ | BPF_X:
+	case BPF_JMP | BPF_JGT | BPF_X:
+	case BPF_JMP | BPF_JGE | BPF_X:
+	case BPF_JMP | BPF_JNE | BPF_X:
+	case BPF_JMP | BPF_JSGT | BPF_X:
+	case BPF_JMP | BPF_JSGE | BPF_X:
+	case BPF_JMP | BPF_JSET | BPF_X:
+		/* Setup source registers */
+		rm = sstk ? tmp2[0] : src_hi;
+		rn = sstk ? tmp2[1] : src_lo;
+		if (sstk) {
+			emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
+			emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(src_hi)), ctx);
+		}
+		goto go_jmp;
+	/* PC += off if dst == imm */
+	/* PC += off if dst > imm */
+	/* PC += off if dst >= imm */
+	/* PC += off if dst != imm */
+	/* PC += off if dst > imm (signed) */
+	/* PC += off if dst >= imm (signed) */
+	/* PC += off if dst & imm */
+	case BPF_JMP | BPF_JEQ | BPF_K:
+	case BPF_JMP | BPF_JGT | BPF_K:
+	case BPF_JMP | BPF_JGE | BPF_K:
+	case BPF_JMP | BPF_JNE | BPF_K:
+	case BPF_JMP | BPF_JSGT | BPF_K:
+	case BPF_JMP | BPF_JSGE | BPF_K:
+	case BPF_JMP | BPF_JSET | BPF_K:
+		if (off == 0)
 			break;
-		case BPF_ANC | SKF_AD_PROTOCOL:
-			/* A = ntohs(skb->protocol) */
-			ctx->seen |= SEEN_SKB;
-			BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff,
-						  protocol) != 2);
-			off = offsetof(struct sk_buff, protocol);
-			emit(ARM_LDRH_I(r_scratch, r_skb, off), ctx);
-			emit_swap16(r_A, r_scratch, ctx);
+		rm = tmp2[0];
+		rn = tmp2[1];
+		/* Sign-extend immediate value */
+		emit_a32_mov_i64(true, tmp2, imm, false, ctx);
+go_jmp:
+		/* Setup destination register */
+		rd = dstk ? tmp[0] : dst_hi;
+		rt = dstk ? tmp[1] : dst_lo;
+		if (dstk) {
+			emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(dst_lo)), ctx);
+			emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_hi)), ctx);
+		}
+
+		/* Check for the condition */
+		emit_ar_r(rd, rt, rm, rn, ctx, BPF_OP(code));
+
+		/* Setup JUMP instruction */
+		jmp_offset = bpf2a32_offset(i+off, i, ctx);
+		switch (BPF_OP(code)) {
+		case BPF_JNE:
+		case BPF_JSET:
+			_emit(ARM_COND_NE, ARM_B(jmp_offset), ctx);
 			break;
-		case BPF_ANC | SKF_AD_CPU:
-			/* r_scratch = current_thread_info() */
-			OP_IMM3(ARM_BIC, r_scratch, ARM_SP, THREAD_SIZE - 1, ctx);
-			/* A = current_thread_info()->cpu */
-			BUILD_BUG_ON(FIELD_SIZEOF(struct thread_info, cpu) != 4);
-			off = offsetof(struct thread_info, cpu);
-			emit(ARM_LDR_I(r_A, r_scratch, off), ctx);
+		case BPF_JEQ:
+			_emit(ARM_COND_EQ, ARM_B(jmp_offset), ctx);
 			break;
-		case BPF_ANC | SKF_AD_IFINDEX:
-		case BPF_ANC | SKF_AD_HATYPE:
-			/* A = skb->dev->ifindex */
-			/* A = skb->dev->type */
-			ctx->seen |= SEEN_SKB;
-			off = offsetof(struct sk_buff, dev);
-			emit(ARM_LDR_I(r_scratch, r_skb, off), ctx);
-
-			emit(ARM_CMP_I(r_scratch, 0), ctx);
-			emit_err_ret(ARM_COND_EQ, ctx);
-
-			BUILD_BUG_ON(FIELD_SIZEOF(struct net_device,
-						  ifindex) != 4);
-			BUILD_BUG_ON(FIELD_SIZEOF(struct net_device,
-						  type) != 2);
-
-			if (code == (BPF_ANC | SKF_AD_IFINDEX)) {
-				off = offsetof(struct net_device, ifindex);
-				emit(ARM_LDR_I(r_A, r_scratch, off), ctx);
-			} else {
-				/*
-				 * offset of field "type" in "struct
-				 * net_device" is above what can be
-				 * used in the ldrh rd, [rn, #imm]
-				 * instruction, so load the offset in
-				 * a register and use ldrh rd, [rn, rm]
-				 */
-				off = offsetof(struct net_device, type);
-				emit_mov_i(ARM_R3, off, ctx);
-				emit(ARM_LDRH_R(r_A, r_scratch, ARM_R3), ctx);
-			}
+		case BPF_JGT:
+			_emit(ARM_COND_HI, ARM_B(jmp_offset), ctx);
 			break;
-		case BPF_ANC | SKF_AD_MARK:
-			ctx->seen |= SEEN_SKB;
-			BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, mark) != 4);
-			off = offsetof(struct sk_buff, mark);
-			emit(ARM_LDR_I(r_A, r_skb, off), ctx);
+		case BPF_JGE:
+			_emit(ARM_COND_CS, ARM_B(jmp_offset), ctx);
 			break;
-		case BPF_ANC | SKF_AD_RXHASH:
-			ctx->seen |= SEEN_SKB;
-			BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, hash) != 4);
-			off = offsetof(struct sk_buff, hash);
-			emit(ARM_LDR_I(r_A, r_skb, off), ctx);
+		case BPF_JSGT:
+			_emit(ARM_COND_LT, ARM_B(jmp_offset), ctx);
 			break;
-		case BPF_ANC | SKF_AD_VLAN_TAG:
-		case BPF_ANC | SKF_AD_VLAN_TAG_PRESENT:
-			ctx->seen |= SEEN_SKB;
-			BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, vlan_tci) != 2);
-			off = offsetof(struct sk_buff, vlan_tci);
-			emit(ARM_LDRH_I(r_A, r_skb, off), ctx);
-			if (code == (BPF_ANC | SKF_AD_VLAN_TAG))
-				OP_IMM3(ARM_AND, r_A, r_A, ~VLAN_TAG_PRESENT, ctx);
-			else {
-				OP_IMM3(ARM_LSR, r_A, r_A, 12, ctx);
-				OP_IMM3(ARM_AND, r_A, r_A, 0x1, ctx);
-			}
+		case BPF_JSGE:
+			_emit(ARM_COND_GE, ARM_B(jmp_offset), ctx);
 			break;
-		case BPF_ANC | SKF_AD_PKTTYPE:
-			ctx->seen |= SEEN_SKB;
-			BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff,
-						  __pkt_type_offset[0]) != 1);
-			off = PKT_TYPE_OFFSET();
-			emit(ARM_LDRB_I(r_A, r_skb, off), ctx);
-			emit(ARM_AND_I(r_A, r_A, PKT_TYPE_MAX), ctx);
-#ifdef __BIG_ENDIAN_BITFIELD
-			emit(ARM_LSR_I(r_A, r_A, 5), ctx);
-#endif
+		}
+		break;
+	/* JMP OFF */
+	case BPF_JMP | BPF_JA:
+	{
+		if (off == 0)
 			break;
-		case BPF_ANC | SKF_AD_QUEUE:
-			ctx->seen |= SEEN_SKB;
-			BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff,
-						  queue_mapping) != 2);
-			BUILD_BUG_ON(offsetof(struct sk_buff,
-					      queue_mapping) > 0xff);
-			off = offsetof(struct sk_buff, queue_mapping);
-			emit(ARM_LDRH_I(r_A, r_skb, off), ctx);
+		jmp_offset = bpf2a32_offset(i+off, i, ctx);
+		check_imm24(jmp_offset);
+		emit(ARM_B(jmp_offset), ctx);
+		break;
+	}
+	/* tail call */
+	case BPF_JMP | BPF_CALL | BPF_X:
+		if (emit_bpf_tail_call(ctx))
+			return -EFAULT;
+		break;
+	/* function call */
+	case BPF_JMP | BPF_CALL:
+	{
+		const u8 *r0 = bpf2a32[BPF_REG_0];
+		const u8 *r1 = bpf2a32[BPF_REG_1];
+		const u8 *r2 = bpf2a32[BPF_REG_2];
+		const u8 *r3 = bpf2a32[BPF_REG_3];
+		const u8 *r4 = bpf2a32[BPF_REG_4];
+		const u8 *r5 = bpf2a32[BPF_REG_5];
+		const u32 func = (u32)__bpf_call_base + imm;
+
+		emit_a32_mov_r64(true, r0, r1, false, false, ctx);
+		emit_a32_mov_r64(true, r1, r2, false, true, ctx);
+		emit_push_r64(r5, 0, ctx);
+		emit_push_r64(r4, 8, ctx);
+		emit_push_r64(r3, 16, ctx);
+
+		emit_a32_mov_i(tmp[1], func, false, ctx);
+		emit_blx_r(tmp[1], ctx);
+
+		emit(ARM_ADD_I(ARM_SP, ARM_SP, imm8m(24)), ctx); // callee clean
+		break;
+	}
+	/* function return */
+	case BPF_JMP | BPF_EXIT:
+		/* Optimization: when last instruction is EXIT
+		 * simply fallthrough to epilogue.
+		 */
+		if (i == ctx->prog->len - 1)
 			break;
-		case BPF_ANC | SKF_AD_PAY_OFFSET:
-			ctx->seen |= SEEN_SKB | SEEN_CALL;
+		jmp_offset = epilogue_offset(ctx);
+		check_imm24(jmp_offset);
+		emit(ARM_B(jmp_offset), ctx);
+		break;
+notyet:
+		pr_info_once("*** NOT YET: opcode %02x ***\n", code);
+		return -EFAULT;
+	default:
+		pr_err_once("unknown opcode %02x\n", code);
+		return -EINVAL;
+	}
 
-			emit(ARM_MOV_R(ARM_R0, r_skb), ctx);
-			emit_mov_i(ARM_R3, (unsigned int)skb_get_poff, ctx);
-			emit_blx_r(ARM_R3, ctx);
-			emit(ARM_MOV_R(r_A, ARM_R0), ctx);
-			break;
-		case BPF_LDX | BPF_W | BPF_ABS:
-			/*
-			 * load a 32bit word from struct seccomp_data.
-			 * seccomp_check_filter() will already have checked
-			 * that k is 32bit aligned and lies within the
-			 * struct seccomp_data.
-			 */
-			ctx->seen |= SEEN_SKB;
-			emit(ARM_LDR_I(r_A, r_skb, k), ctx);
-			break;
-		default:
-			return -1;
+	if (ctx->flags & FLAG_IMM_OVERFLOW)
+		/*
+		 * this instruction generated an overflow when
+		 * trying to access the literal pool, so
+		 * delegate this filter to the kernel interpreter.
+		 */
+		return -1;
+	return 0;
+}
+
+static int build_body(struct jit_ctx *ctx)
+{
+	const struct bpf_prog *prog = ctx->prog;
+	unsigned int i;
+
+	for (i = 0; i < prog->len; i++) {
+		const struct bpf_insn *insn = &(prog->insnsi[i]);
+		int ret;
+
+		ret = build_insn(insn, ctx);
+
+		/* It's used with loading the 64 bit immediate value. */
+		if (ret > 0) {
+			i++;
+			if (ctx->target == NULL)
+				ctx->offsets[i] = ctx->idx;
+			continue;
 		}
 
-		if (ctx->flags & FLAG_IMM_OVERFLOW)
-			/*
-			 * this instruction generated an overflow when
-			 * trying to access the literal pool, so
-			 * delegate this filter to the kernel interpreter.
-			 */
-			return -1;
+		if (ctx->target == NULL)
+			ctx->offsets[i] = ctx->idx;
+
+		/* If unsuccesfull, return with error code */
+		if (ret)
+			return ret;
 	}
+	return 0;
+}
 
-	/* compute offsets only during the first pass */
-	if (ctx->target == NULL)
-		ctx->offsets[i] = ctx->idx * 4;
+static int validate_code(struct jit_ctx *ctx)
+{
+	int i;
+
+	for (i = 0; i < ctx->idx; i++) {
+		if (ctx->target[i] == __opcode_to_mem_arm(ARM_INST_UDF))
+			return -1;
+	}
 
 	return 0;
 }
 
+void bpf_jit_compile(struct bpf_prog *prog)
+{
+	/* Nothing to do here. We support Internal BPF. */
+}
 
-void bpf_jit_compile(struct bpf_prog *fp)
+struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 {
+#ifdef __LITTLE_ENDIAN
+	struct bpf_prog *tmp, *orig_prog = prog;
 	struct bpf_binary_header *header;
+	bool tmp_blinded = false;
 	struct jit_ctx ctx;
-	unsigned tmp_idx;
-	unsigned alloc_size;
-	u8 *target_ptr;
+	unsigned int tmp_idx;
+	unsigned int image_size;
+	u8 *image_ptr;
 
+	/* If BPF JIT was not enabled then we must fall back to
+	 * the interpreter.
+	 */
 	if (!bpf_jit_enable)
-		return;
+		return orig_prog;
 
-	memset(&ctx, 0, sizeof(ctx));
-	ctx.skf		= fp;
-	ctx.ret0_fp_idx = -1;
+	/* If constant blinding was enabled and we failed during blinding
+	 * then we must fall back to the interpreter. Otherwise, we save
+	 * the new JITed code.
+	 */
+	tmp = bpf_jit_blind_constants(prog);
 
-	ctx.offsets = kzalloc(4 * (ctx.skf->len + 1), GFP_KERNEL);
-	if (ctx.offsets == NULL)
-		return;
+	if (IS_ERR(tmp))
+		return orig_prog;
+	if (tmp != prog) {
+		tmp_blinded = true;
+		prog = tmp;
+	}
 
-	/* fake pass to fill in the ctx->seen */
-	if (unlikely(build_body(&ctx)))
+	memset(&ctx, 0, sizeof(ctx));
+	ctx.prog = prog;
+
+	/* Not able to allocate memory for offsets[] , then
+	 * we must fall back to the interpreter
+	 */
+	ctx.offsets = kcalloc(prog->len, sizeof(int), GFP_KERNEL);
+	if (ctx.offsets == NULL) {
+		prog = orig_prog;
 		goto out;
+	}
+
+	/* 1) fake pass to find in the length of the JITed code,
+	 * to compute ctx->offsets and other context variables
+	 * needed to compute final JITed code.
+	 * Also, calculate random starting pointer/start of JITed code
+	 * which is prefixed by random number of fault instructions.
+	 *
+	 * If the first pass fails then there is no chance of it
+	 * being successful in the second pass, so just fall back
+	 * to the interpreter.
+	 */
+	if (build_body(&ctx)) {
+		prog = orig_prog;
+		goto out_off;
+	}
 
 	tmp_idx = ctx.idx;
 	build_prologue(&ctx);
 	ctx.prologue_bytes = (ctx.idx - tmp_idx) * 4;
 
+	ctx.epilogue_offset = ctx.idx;
+
 #if __LINUX_ARM_ARCH__ < 7
 	tmp_idx = ctx.idx;
 	build_epilogue(&ctx);
@@ -1021,64 +1863,96 @@ void bpf_jit_compile(struct bpf_prog *fp)
 
 	ctx.idx += ctx.imm_count;
 	if (ctx.imm_count) {
-		ctx.imms = kzalloc(4 * ctx.imm_count, GFP_KERNEL);
-		if (ctx.imms == NULL)
-			goto out;
+		ctx.imms = kcalloc(ctx.imm_count, sizeof(u32), GFP_KERNEL);
+		if (ctx.imms == NULL) {
+			prog = orig_prog;
+			goto out_off;
+		}
 	}
 #else
-	/* there's nothing after the epilogue on ARMv7 */
+	/* there's nothing about the epilogue on ARMv7 */
 	build_epilogue(&ctx);
 #endif
-	alloc_size = 4 * ctx.idx;
-	header = bpf_jit_binary_alloc(alloc_size, &target_ptr,
-				      4, jit_fill_hole);
-	if (header == NULL)
-		goto out;
+	/* Now we can get the actual image size of the JITed arm code.
+	 * Currently, we are not considering the THUMB-2 instructions
+	 * for jit, although it can decrease the size of the image.
+	 *
+	 * As each arm instruction is of length 32bit, we are translating
+	 * number of JITed intructions into the size required to store these
+	 * JITed code.
+	 */
+	image_size = sizeof(u32) * ctx.idx;
+
+	/* Now we know the size of the structure to make */
+	header = bpf_jit_binary_alloc(image_size, &image_ptr,
+				      sizeof(u32), jit_fill_hole);
+	/* Not able to allocate memory for the structure then
+	 * we must fall back to the interpretation
+	 */
+	if (header == NULL) {
+		prog = orig_prog;
+		goto out_imms;
+	}
 
-	ctx.target = (u32 *) target_ptr;
+	/* 2.) Actual pass to generate final JIT code */
+	ctx.target = (u32 *) image_ptr;
 	ctx.idx = 0;
 
 	build_prologue(&ctx);
+
+	/* If building the body of the JITed code fails somehow,
+	 * we fall back to the interpretation.
+	 */
 	if (build_body(&ctx) < 0) {
-#if __LINUX_ARM_ARCH__ < 7
-		if (ctx.imm_count)
-			kfree(ctx.imms);
-#endif
+		image_ptr = NULL;
 		bpf_jit_binary_free(header);
-		goto out;
+		prog = orig_prog;
+		goto out_imms;
 	}
 	build_epilogue(&ctx);
 
+	/* 3.) Extra pass to validate JITed Code */
+	if (validate_code(&ctx)) {
+		image_ptr = NULL;
+		bpf_jit_binary_free(header);
+		prog = orig_prog;
+		goto out_imms;
+	}
 	flush_icache_range((u32)header, (u32)(ctx.target + ctx.idx));
 
-#if __LINUX_ARM_ARCH__ < 7
-	if (ctx.imm_count)
-		kfree(ctx.imms);
-#endif
-
 	if (bpf_jit_enable > 1)
 		/* there are 2 passes here */
-		bpf_jit_dump(fp->len, alloc_size, 2, ctx.target);
+		bpf_jit_dump(prog->len, image_size, 2, ctx.target);
 
 	set_memory_ro((unsigned long)header, header->pages);
-	fp->bpf_func = (void *)ctx.target;
-	fp->jited = 1;
-out:
+	prog->bpf_func = (void *)ctx.target;
+	prog->jited = 1;
+out_imms:
+#if __LINUX_ARM_ARCH__ < 7
+	if (ctx.imm_count)
+		kfree(ctx.imms);
+#endif
+out_off:
 	kfree(ctx.offsets);
-	return;
+out:
+	if (tmp_blinded)
+		bpf_jit_prog_release_other(prog, prog == orig_prog ?
+					   tmp : orig_prog);
+#endif /* __LITTLE_ENDIAN */
+	return prog;
 }
 
-void bpf_jit_free(struct bpf_prog *fp)
+void bpf_jit_free(struct bpf_prog *prog)
 {
-	unsigned long addr = (unsigned long)fp->bpf_func & PAGE_MASK;
+	unsigned long addr = (unsigned long)prog->bpf_func & PAGE_MASK;
 	struct bpf_binary_header *header = (void *)addr;
 
-	if (!fp->jited)
+	if (!prog->jited)
 		goto free_filter;
 
 	set_memory_rw(addr, header->pages);
 	bpf_jit_binary_free(header);
 
 free_filter:
-	bpf_prog_unlock_free(fp);
+	bpf_prog_unlock_free(prog);
 }
diff --git a/arch/arm/net/bpf_jit_32.h b/arch/arm/net/bpf_jit_32.h
index c46fca2..d5cf5f6 100644
--- a/arch/arm/net/bpf_jit_32.h
+++ b/arch/arm/net/bpf_jit_32.h
@@ -11,6 +11,7 @@
 #ifndef PFILTER_OPCODES_ARM_H
 #define PFILTER_OPCODES_ARM_H
 
+/* ARM 32bit Registers */
 #define ARM_R0	0
 #define ARM_R1	1
 #define ARM_R2	2
@@ -22,38 +23,43 @@
 #define ARM_R8	8
 #define ARM_R9	9
 #define ARM_R10	10
-#define ARM_FP	11
-#define ARM_IP	12
-#define ARM_SP	13
-#define ARM_LR	14
-#define ARM_PC	15
-
-#define ARM_COND_EQ		0x0
-#define ARM_COND_NE		0x1
-#define ARM_COND_CS		0x2
+#define ARM_FP	11	/* Frame Pointer */
+#define ARM_IP	12	/* Intra-procedure scratch register */
+#define ARM_SP	13	/* Stack pointer: as load/store base reg */
+#define ARM_LR	14	/* Link Register */
+#define ARM_PC	15	/* Program counter */
+
+#define ARM_COND_EQ		0x0	/* == */
+#define ARM_COND_NE		0x1	/* != */
+#define ARM_COND_CS		0x2	/* unsigned >= */
 #define ARM_COND_HS		ARM_COND_CS
-#define ARM_COND_CC		0x3
+#define ARM_COND_CC		0x3	/* unsigned < */
 #define ARM_COND_LO		ARM_COND_CC
-#define ARM_COND_MI		0x4
-#define ARM_COND_PL		0x5
-#define ARM_COND_VS		0x6
-#define ARM_COND_VC		0x7
-#define ARM_COND_HI		0x8
-#define ARM_COND_LS		0x9
-#define ARM_COND_GE		0xa
-#define ARM_COND_LT		0xb
-#define ARM_COND_GT		0xc
-#define ARM_COND_LE		0xd
-#define ARM_COND_AL		0xe
+#define ARM_COND_MI		0x4	/* < 0 */
+#define ARM_COND_PL		0x5	/* >= 0 */
+#define ARM_COND_VS		0x6	/* Signed Overflow */
+#define ARM_COND_VC		0x7	/* No Signed Overflow */
+#define ARM_COND_HI		0x8	/* unsigned > */
+#define ARM_COND_LS		0x9	/* unsigned <= */
+#define ARM_COND_GE		0xa	/* Signed >= */
+#define ARM_COND_LT		0xb	/* Signed < */
+#define ARM_COND_GT		0xc	/* Signed > */
+#define ARM_COND_LE		0xd	/* Signed <= */
+#define ARM_COND_AL		0xe	/* None */
 
 /* register shift types */
 #define SRTYPE_LSL		0
 #define SRTYPE_LSR		1
 #define SRTYPE_ASR		2
 #define SRTYPE_ROR		3
+#define SRTYPE_ASL		(SRTYPE_LSL)
 
 #define ARM_INST_ADD_R		0x00800000
+#define ARM_INST_ADDS_R		0x00900000
+#define ARM_INST_ADC_R		0x00a00000
+#define ARM_INST_ADC_I		0x02a00000
 #define ARM_INST_ADD_I		0x02800000
+#define ARM_INST_ADDS_I		0x02900000
 
 #define ARM_INST_AND_R		0x00000000
 #define ARM_INST_AND_I		0x02000000
@@ -76,8 +82,10 @@
 #define ARM_INST_LDRH_I		0x01d000b0
 #define ARM_INST_LDRH_R		0x019000b0
 #define ARM_INST_LDR_I		0x05900000
+#define ARM_INST_LDR_R		0x07900000
 
 #define ARM_INST_LDM		0x08900000
+#define ARM_INST_LDM_IA		0x08b00000
 
 #define ARM_INST_LSL_I		0x01a00000
 #define ARM_INST_LSL_R		0x01a00010
@@ -86,6 +94,7 @@
 #define ARM_INST_LSR_R		0x01a00030
 
 #define ARM_INST_MOV_R		0x01a00000
+#define ARM_INST_MOVS_R		0x01b00000
 #define ARM_INST_MOV_I		0x03a00000
 #define ARM_INST_MOVW		0x03000000
 #define ARM_INST_MOVT		0x03400000
@@ -96,17 +105,28 @@
 #define ARM_INST_PUSH		0x092d0000
 
 #define ARM_INST_ORR_R		0x01800000
+#define ARM_INST_ORRS_R		0x01900000
 #define ARM_INST_ORR_I		0x03800000
 
 #define ARM_INST_REV		0x06bf0f30
 #define ARM_INST_REV16		0x06bf0fb0
 
 #define ARM_INST_RSB_I		0x02600000
+#define ARM_INST_RSBS_I		0x02700000
+#define ARM_INST_RSC_I		0x02e00000
 
 #define ARM_INST_SUB_R		0x00400000
+#define ARM_INST_SUBS_R		0x00500000
+#define ARM_INST_RSB_R		0x00600000
 #define ARM_INST_SUB_I		0x02400000
+#define ARM_INST_SUBS_I		0x02500000
+#define ARM_INST_SBC_I		0x02c00000
+#define ARM_INST_SBC_R		0x00c00000
+#define ARM_INST_SBCS_R		0x00d00000
 
 #define ARM_INST_STR_I		0x05800000
+#define ARM_INST_STRB_I		0x05c00000
+#define ARM_INST_STRH_I		0x01c000b0
 
 #define ARM_INST_TST_R		0x01100000
 #define ARM_INST_TST_I		0x03100000
@@ -117,6 +137,8 @@
 
 #define ARM_INST_MLS		0x00600090
 
+#define ARM_INST_UXTH		0x06ff0070
+
 /*
  * Use a suitable undefined instruction to use for ARM/Thumb2 faulting.
  * We need to be careful not to conflict with those used by other modules
@@ -135,9 +157,15 @@
 #define _AL3_R(op, rd, rn, rm)	((op ## _R) | (rd) << 12 | (rn) << 16 | (rm))
 /* immediate */
 #define _AL3_I(op, rd, rn, imm)	((op ## _I) | (rd) << 12 | (rn) << 16 | (imm))
+/* register with register-shift */
+#define _AL3_SR(inst)	(inst | (1 << 4))
 
 #define ARM_ADD_R(rd, rn, rm)	_AL3_R(ARM_INST_ADD, rd, rn, rm)
+#define ARM_ADDS_R(rd, rn, rm)	_AL3_R(ARM_INST_ADDS, rd, rn, rm)
 #define ARM_ADD_I(rd, rn, imm)	_AL3_I(ARM_INST_ADD, rd, rn, imm)
+#define ARM_ADDS_I(rd, rn, imm)	_AL3_I(ARM_INST_ADDS, rd, rn, imm)
+#define ARM_ADC_R(rd, rn, rm)	_AL3_R(ARM_INST_ADC, rd, rn, rm)
+#define ARM_ADC_I(rd, rn, imm)	_AL3_I(ARM_INST_ADC, rd, rn, imm)
 
 #define ARM_AND_R(rd, rn, rm)	_AL3_R(ARM_INST_AND, rd, rn, rm)
 #define ARM_AND_I(rd, rn, imm)	_AL3_I(ARM_INST_AND, rd, rn, imm)
@@ -156,7 +184,9 @@
 #define ARM_EOR_I(rd, rn, imm)	_AL3_I(ARM_INST_EOR, rd, rn, imm)
 
 #define ARM_LDR_I(rt, rn, off)	(ARM_INST_LDR_I | (rt) << 12 | (rn) << 16 \
-				 | (off))
+				 | ((off) & 0xfff))
+#define ARM_LDR_R(rt, rn, rm)	(ARM_INST_LDR_R | (rt) << 12 | (rn) << 16 \
+				 | (rm))
 #define ARM_LDRB_I(rt, rn, off)	(ARM_INST_LDRB_I | (rt) << 12 | (rn) << 16 \
 				 | (off))
 #define ARM_LDRB_R(rt, rn, rm)	(ARM_INST_LDRB_R | (rt) << 12 | (rn) << 16 \
@@ -167,15 +197,23 @@
 				 | (rm))
 
 #define ARM_LDM(rn, regs)	(ARM_INST_LDM | (rn) << 16 | (regs))
+#define ARM_LDM_IA(rn, regs)	(ARM_INST_LDM_IA | (rn) << 16 | (regs))
 
 #define ARM_LSL_R(rd, rn, rm)	(_AL3_R(ARM_INST_LSL, rd, 0, rn) | (rm) << 8)
 #define ARM_LSL_I(rd, rn, imm)	(_AL3_I(ARM_INST_LSL, rd, 0, rn) | (imm) << 7)
 
 #define ARM_LSR_R(rd, rn, rm)	(_AL3_R(ARM_INST_LSR, rd, 0, rn) | (rm) << 8)
 #define ARM_LSR_I(rd, rn, imm)	(_AL3_I(ARM_INST_LSR, rd, 0, rn) | (imm) << 7)
+#define ARM_ASR_R(rd, rn, rm)   (_AL3_R(ARM_INST_ASR, rd, 0, rn) | (rm) << 8)
+#define ARM_ASR_I(rd, rn, imm)  (_AL3_I(ARM_INST_ASR, rd, 0, rn) | (imm) << 7)
 
 #define ARM_MOV_R(rd, rm)	_AL3_R(ARM_INST_MOV, rd, 0, rm)
+#define ARM_MOVS_R(rd, rm)	_AL3_R(ARM_INST_MOVS, rd, 0, rm)
 #define ARM_MOV_I(rd, imm)	_AL3_I(ARM_INST_MOV, rd, 0, imm)
+#define ARM_MOV_SR(rd, rm, type, rs)	\
+	(_AL3_SR(ARM_MOV_R(rd, rm)) | (type) << 5 | (rs) << 8)
+#define ARM_MOV_SI(rd, rm, type, imm6)	\
+	(ARM_MOV_R(rd, rm) | (type) << 5 | (imm6) << 7)
 
 #define ARM_MOVW(rd, imm)	\
 	(ARM_INST_MOVW | ((imm) >> 12) << 16 | (rd) << 12 | ((imm) & 0x0fff))
@@ -190,19 +228,38 @@
 
 #define ARM_ORR_R(rd, rn, rm)	_AL3_R(ARM_INST_ORR, rd, rn, rm)
 #define ARM_ORR_I(rd, rn, imm)	_AL3_I(ARM_INST_ORR, rd, rn, imm)
-#define ARM_ORR_S(rd, rn, rm, type, rs)	\
-	(ARM_ORR_R(rd, rn, rm) | (type) << 5 | (rs) << 7)
+#define ARM_ORR_SR(rd, rn, rm, type, rs)	\
+	(_AL3_SR(ARM_ORR_R(rd, rn, rm)) | (type) << 5 | (rs) << 8)
+#define ARM_ORRS_R(rd, rn, rm)	_AL3_R(ARM_INST_ORRS, rd, rn, rm)
+#define ARM_ORRS_SR(rd, rn, rm, type, rs)	\
+	(_AL3_SR(ARM_ORRS_R(rd, rn, rm)) | (type) << 5 | (rs) << 8)
+#define ARM_ORR_SI(rd, rn, rm, type, imm6)	\
+	(ARM_ORR_R(rd, rn, rm) | (type) << 5 | (imm6) << 7)
+#define ARM_ORRS_SI(rd, rn, rm, type, imm6)	\
+	(ARM_ORRS_R(rd, rn, rm) | (type) << 5 | (imm6) << 7)
 
 #define ARM_REV(rd, rm)		(ARM_INST_REV | (rd) << 12 | (rm))
 #define ARM_REV16(rd, rm)	(ARM_INST_REV16 | (rd) << 12 | (rm))
 
 #define ARM_RSB_I(rd, rn, imm)	_AL3_I(ARM_INST_RSB, rd, rn, imm)
+#define ARM_RSBS_I(rd, rn, imm)	_AL3_I(ARM_INST_RSBS, rd, rn, imm)
+#define ARM_RSC_I(rd, rn, imm)	_AL3_I(ARM_INST_RSC, rd, rn, imm)
 
 #define ARM_SUB_R(rd, rn, rm)	_AL3_R(ARM_INST_SUB, rd, rn, rm)
+#define ARM_SUBS_R(rd, rn, rm)	_AL3_R(ARM_INST_SUBS, rd, rn, rm)
+#define ARM_RSB_R(rd, rn, rm)	_AL3_R(ARM_INST_RSB, rd, rn, rm)
+#define ARM_SBC_R(rd, rn, rm)	_AL3_R(ARM_INST_SBC, rd, rn, rm)
+#define ARM_SBCS_R(rd, rn, rm)	_AL3_R(ARM_INST_SBCS, rd, rn, rm)
 #define ARM_SUB_I(rd, rn, imm)	_AL3_I(ARM_INST_SUB, rd, rn, imm)
+#define ARM_SUBS_I(rd, rn, imm)	_AL3_I(ARM_INST_SUBS, rd, rn, imm)
+#define ARM_SBC_I(rd, rn, imm)	_AL3_I(ARM_INST_SBC, rd, rn, imm)
 
 #define ARM_STR_I(rt, rn, off)	(ARM_INST_STR_I | (rt) << 12 | (rn) << 16 \
-				 | (off))
+				 | ((off) & 0xfff))
+#define ARM_STRH_I(rt, rn, off)	(ARM_INST_STRH_I | (rt) << 12 | (rn) << 16 \
+				 | (((off) & 0xf0) << 4) | ((off) & 0xf))
+#define ARM_STRB_I(rt, rn, off)	(ARM_INST_STRB_I | (rt) << 12 | (rn) << 16 \
+				 | (((off) & 0xf0) << 4) | ((off) & 0xf))
 
 #define ARM_TST_R(rn, rm)	_AL3_R(ARM_INST_TST, 0, rn, rm)
 #define ARM_TST_I(rn, imm)	_AL3_I(ARM_INST_TST, 0, rn, imm)
@@ -214,5 +271,6 @@
 
 #define ARM_MLS(rd, rn, rm, ra)	(ARM_INST_MLS | (rd) << 16 | (rn) | (rm) << 8 \
 				 | (ra) << 12)
+#define ARM_UXTH(rd, rm)	(ARM_INST_UXTH | (rd) << 12 | (rm))
 
 #endif /* PFILTER_OPCODES_ARM_H */
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
  2017-06-23 22:39                         ` Shubham Bansal
  (?)
@ 2017-07-05 22:11                           ` Kees Cook
  -1 siblings, 0 replies; 87+ messages in thread
From: Kees Cook @ 2017-07-05 22:11 UTC (permalink / raw)
  To: Shubham Bansal
  Cc: Daniel Borkmann, Network Development, David S. Miller,
	Alexei Starovoitov, Russell King, linux-arm-kernel, LKML,
	Andrew Lunn

On Fri, Jun 23, 2017 at 3:39 PM, Shubham Bansal
<illusionist.neo@gmail.com> wrote:
> Hi Russell,Daniel and Kees,
>
> I am attaching the latest patch with this mail. It included support
> for BPF_CALL | BPF_JMP tested with and without constant blinding on
> ARMv7 machine.
> Due to the limitation on my machine I can't test the tail call. It
> would be a great help if any of you could help me with this.

Is this just a matter of running test_bpf?

Have you been able to debootstrap a debian chroot for ARMv7?

> Its been a long time since this patch is in works, Russell, can you
> please help with sending this patch to ARM patch tracker?

If some other folks can Ack this, I can throw it at the patch tracker
for you. I'll report back on my findings.

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
@ 2017-07-05 22:11                           ` Kees Cook
  0 siblings, 0 replies; 87+ messages in thread
From: Kees Cook @ 2017-07-05 22:11 UTC (permalink / raw)
  To: Shubham Bansal
  Cc: Daniel Borkmann, Network Development, David S. Miller,
	Alexei Starovoitov, Russell King, linux-arm-kernel, LKML,
	Andrew Lunn

On Fri, Jun 23, 2017 at 3:39 PM, Shubham Bansal
<illusionist.neo@gmail.com> wrote:
> Hi Russell,Daniel and Kees,
>
> I am attaching the latest patch with this mail. It included support
> for BPF_CALL | BPF_JMP tested with and without constant blinding on
> ARMv7 machine.
> Due to the limitation on my machine I can't test the tail call. It
> would be a great help if any of you could help me with this.

Is this just a matter of running test_bpf?

Have you been able to debootstrap a debian chroot for ARMv7?

> Its been a long time since this patch is in works, Russell, can you
> please help with sending this patch to ARM patch tracker?

If some other folks can Ack this, I can throw it at the patch tracker
for you. I'll report back on my findings.

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v2] arm: eBPF JIT compiler
@ 2017-07-05 22:11                           ` Kees Cook
  0 siblings, 0 replies; 87+ messages in thread
From: Kees Cook @ 2017-07-05 22:11 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Jun 23, 2017 at 3:39 PM, Shubham Bansal
<illusionist.neo@gmail.com> wrote:
> Hi Russell,Daniel and Kees,
>
> I am attaching the latest patch with this mail. It included support
> for BPF_CALL | BPF_JMP tested with and without constant blinding on
> ARMv7 machine.
> Due to the limitation on my machine I can't test the tail call. It
> would be a great help if any of you could help me with this.

Is this just a matter of running test_bpf?

Have you been able to debootstrap a debian chroot for ARMv7?

> Its been a long time since this patch is in works, Russell, can you
> please help with sending this patch to ARM patch tracker?

If some other folks can Ack this, I can throw it at the patch tracker
for you. I'll report back on my findings.

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
  2017-07-05 22:11                           ` Kees Cook
  (?)
@ 2017-07-05 22:38                             ` Kees Cook
  -1 siblings, 0 replies; 87+ messages in thread
From: Kees Cook @ 2017-07-05 22:38 UTC (permalink / raw)
  To: Shubham Bansal
  Cc: Daniel Borkmann, Network Development, David S. Miller,
	Alexei Starovoitov, Russell King, linux-arm-kernel, LKML,
	Andrew Lunn

On Wed, Jul 5, 2017 at 3:11 PM, Kees Cook <keescook@chromium.org> wrote:
> On Fri, Jun 23, 2017 at 3:39 PM, Shubham Bansal
> <illusionist.neo@gmail.com> wrote:
>> Hi Russell,Daniel and Kees,
>>
>> I am attaching the latest patch with this mail. It included support
>> for BPF_CALL | BPF_JMP tested with and without constant blinding on
>> ARMv7 machine.
>> Due to the limitation on my machine I can't test the tail call. It
>> would be a great help if any of you could help me with this.
>
> Is this just a matter of running test_bpf?

If so:

Tested-by: Kees Cook <keescook@chromium.org>

test_bpf: Summary: 316 PASSED, 0 FAILED, [287/308 JIT'ed]

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
@ 2017-07-05 22:38                             ` Kees Cook
  0 siblings, 0 replies; 87+ messages in thread
From: Kees Cook @ 2017-07-05 22:38 UTC (permalink / raw)
  To: Shubham Bansal
  Cc: Daniel Borkmann, Network Development, David S. Miller,
	Alexei Starovoitov, Russell King, linux-arm-kernel, LKML,
	Andrew Lunn

On Wed, Jul 5, 2017 at 3:11 PM, Kees Cook <keescook@chromium.org> wrote:
> On Fri, Jun 23, 2017 at 3:39 PM, Shubham Bansal
> <illusionist.neo@gmail.com> wrote:
>> Hi Russell,Daniel and Kees,
>>
>> I am attaching the latest patch with this mail. It included support
>> for BPF_CALL | BPF_JMP tested with and without constant blinding on
>> ARMv7 machine.
>> Due to the limitation on my machine I can't test the tail call. It
>> would be a great help if any of you could help me with this.
>
> Is this just a matter of running test_bpf?

If so:

Tested-by: Kees Cook <keescook@chromium.org>

test_bpf: Summary: 316 PASSED, 0 FAILED, [287/308 JIT'ed]

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v2] arm: eBPF JIT compiler
@ 2017-07-05 22:38                             ` Kees Cook
  0 siblings, 0 replies; 87+ messages in thread
From: Kees Cook @ 2017-07-05 22:38 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jul 5, 2017 at 3:11 PM, Kees Cook <keescook@chromium.org> wrote:
> On Fri, Jun 23, 2017 at 3:39 PM, Shubham Bansal
> <illusionist.neo@gmail.com> wrote:
>> Hi Russell,Daniel and Kees,
>>
>> I am attaching the latest patch with this mail. It included support
>> for BPF_CALL | BPF_JMP tested with and without constant blinding on
>> ARMv7 machine.
>> Due to the limitation on my machine I can't test the tail call. It
>> would be a great help if any of you could help me with this.
>
> Is this just a matter of running test_bpf?

If so:

Tested-by: Kees Cook <keescook@chromium.org>

test_bpf: Summary: 316 PASSED, 0 FAILED, [287/308 JIT'ed]

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
  2017-07-05 22:38                             ` Kees Cook
@ 2017-07-06  3:49                               ` Shubham Bansal
  -1 siblings, 0 replies; 87+ messages in thread
From: Shubham Bansal @ 2017-07-06  3:49 UTC (permalink / raw)
  To: Kees Cook
  Cc: Daniel Borkmann, Network Development, David S. Miller,
	Alexei Starovoitov, Russell King, linux-arm-kernel, LKML,
	Andrew Lunn

[-- Attachment #1: Type: text/plain, Size: 460 bytes --]

Hi Kees,

Problem is my ARM machine don't have clang and iproute2 which is
keeping me from testing the bpf tail calls.

You should do the following to test it,.

1. tools/testing/selftests/bpf/
2. make
3. sudo ./test_progs

And, before testing, you have to do "make headers_install".
These tests are for tail calls with the attached patch. If its too
much work, Can you please upload your arm image so that I can test it?
I just need a good machine.

-Shubham

[-- Attachment #2: 0001-Added-Support-for-BPF_CALL-BPF_JMP.patch --]
[-- Type: application/octet-stream, Size: 87846 bytes --]

From 502dd777765a982ce1b479ee01911fa6fe023a76 Mon Sep 17 00:00:00 2001
From: Shubham Bansal <illusionist.neo@gmail.com>
Date: Sat, 24 Jun 2017 04:03:37 +0530
Subject: [PATCH] Added Support for BPF_CALL | BPF_JMP.

---
 arch/arm/Kconfig          |    2 +-
 arch/arm/net/bpf_jit_32.c | 2430 ++++++++++++++++++++++++++++++---------------
 arch/arm/net/bpf_jit_32.h |  108 +-
 3 files changed, 1736 insertions(+), 804 deletions(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 4c1a35f..53bf116 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -48,7 +48,7 @@ config ARM
 	select HAVE_ARCH_SECCOMP_FILTER if (AEABI && !OABI_COMPAT)
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_ARM_SMCCC if CPU_V7
-	select HAVE_CBPF_JIT
+	select HAVE_EBPF_JIT
 	select HAVE_CC_STACKPROTECTOR
 	select HAVE_CONTEXT_TRACKING
 	select HAVE_C_RECORDMCOUNT
diff --git a/arch/arm/net/bpf_jit_32.c b/arch/arm/net/bpf_jit_32.c
index d5b9fa1..8b8ddc4 100644
--- a/arch/arm/net/bpf_jit_32.c
+++ b/arch/arm/net/bpf_jit_32.c
@@ -1,13 +1,15 @@
 /*
- * Just-In-Time compiler for BPF filters on 32bit ARM
+ * Just-In-Time compiler for eBPF filters on 32bit ARM
  *
  * Copyright (c) 2011 Mircea Gherzan <mgherzan@gmail.com>
+ * Copyright (c) 2017 Shubham Bansal <illusionist.neo@gmail.com>
  *
  * This program is free software; you can redistribute it and/or modify it
  * under the terms of the GNU General Public License as published by the
  * Free Software Foundation; version 2 of the License.
  */
 
+#include <linux/bpf.h>
 #include <linux/bitops.h>
 #include <linux/compiler.h>
 #include <linux/errno.h>
@@ -18,50 +20,96 @@
 #include <linux/if_vlan.h>
 
 #include <asm/cacheflush.h>
-#include <asm/set_memory.h>
 #include <asm/hwcap.h>
 #include <asm/opcodes.h>
 
 #include "bpf_jit_32.h"
 
+int bpf_jit_enable __read_mostly;
+
+#define STACK_OFFSET(k)	(k)
+#define TMP_REG_1	(MAX_BPF_JIT_REG + 0)	/* TEMP Register 1 */
+#define TMP_REG_2	(MAX_BPF_JIT_REG + 1)	/* TEMP Register 2 */
+#define TCALL_CNT	(MAX_BPF_JIT_REG + 2)	/* Tail Call Count */
+
+/* Flags used for JIT optimization */
+#define SEEN_CALL	(1 << 0)
+
+#define FLAG_IMM_OVERFLOW	(1 << 0)
+
 /*
- * ABI:
+ * Map eBPF registers to ARM 32bit registers or stack scratch space.
+ *
+ * 1. First argument is passed using the arm 32bit registers and rest of the
+ * arguments are passed on stack scratch space.
+ * 2. First callee-saved aregument is mapped to arm 32 bit registers and rest
+ * arguments are mapped to scratch space on stack.
+ * 3. We need two 64 bit temp registers to do complex operations on eBPF
+ * registers.
+ *
+ * As the eBPF registers are all 64 bit registers and arm has only 32 bit
+ * registers, we have to map each eBPF registers with two arm 32 bit regs or
+ * scratch memory space and we have to build eBPF 64 bit register from those.
  *
- * r0	scratch register
- * r4	BPF register A
- * r5	BPF register X
- * r6	pointer to the skb
- * r7	skb->data
- * r8	skb_headlen(skb)
  */
+static const u8 bpf2a32[][2] = {
+	/* return value from in-kernel function, and exit value from eBPF */
+	[BPF_REG_0] = {ARM_R1, ARM_R0},
+	/* arguments from eBPF program to in-kernel function */
+	[BPF_REG_1] = {ARM_R3, ARM_R2},
+	/* Stored on stack scratch space */
+	[BPF_REG_2] = {STACK_OFFSET(0), STACK_OFFSET(4)},
+	[BPF_REG_3] = {STACK_OFFSET(8), STACK_OFFSET(12)},
+	[BPF_REG_4] = {STACK_OFFSET(16), STACK_OFFSET(20)},
+	[BPF_REG_5] = {STACK_OFFSET(24), STACK_OFFSET(28)},
+	/* callee saved registers that in-kernel function will preserve */
+	[BPF_REG_6] = {ARM_R5, ARM_R4},
+	/* Stored on stack scratch space */
+	[BPF_REG_7] = {STACK_OFFSET(32), STACK_OFFSET(36)},
+	[BPF_REG_8] = {STACK_OFFSET(40), STACK_OFFSET(44)},
+	[BPF_REG_9] = {STACK_OFFSET(48), STACK_OFFSET(52)},
+	/* Read only Frame Pointer to access Stack */
+	[BPF_REG_FP] = {STACK_OFFSET(56), STACK_OFFSET(60)},
+	/* Temporary Register for internal BPF JIT, can be used
+	 * for constant blindings and others.
+	 */
+	[TMP_REG_1] = {ARM_R7, ARM_R6},
+	[TMP_REG_2] = {ARM_R10, ARM_R8},
+	/* Tail call count. Stored on stack scratch space. */
+	[TCALL_CNT] = {STACK_OFFSET(64), STACK_OFFSET(68)},
+	/* temporary register for blinding constants.
+	 * Stored on stack scratch space.
+	 */
+	[BPF_REG_AX] = {STACK_OFFSET(72), STACK_OFFSET(76)},
+};
 
-#define r_scratch	ARM_R0
-/* r1-r3 are (also) used for the unaligned loads on the non-ARMv7 slowpath */
-#define r_off		ARM_R1
-#define r_A		ARM_R4
-#define r_X		ARM_R5
-#define r_skb		ARM_R6
-#define r_skb_data	ARM_R7
-#define r_skb_hl	ARM_R8
-
-#define SCRATCH_SP_OFFSET	0
-#define SCRATCH_OFF(k)		(SCRATCH_SP_OFFSET + 4 * (k))
-
-#define SEEN_MEM		((1 << BPF_MEMWORDS) - 1)
-#define SEEN_MEM_WORD(k)	(1 << (k))
-#define SEEN_X			(1 << BPF_MEMWORDS)
-#define SEEN_CALL		(1 << (BPF_MEMWORDS + 1))
-#define SEEN_SKB		(1 << (BPF_MEMWORDS + 2))
-#define SEEN_DATA		(1 << (BPF_MEMWORDS + 3))
+#define	dst_lo	dst[1]
+#define dst_hi	dst[0]
+#define src_lo	src[1]
+#define src_hi	src[0]
 
-#define FLAG_NEED_X_RESET	(1 << 0)
-#define FLAG_IMM_OVERFLOW	(1 << 1)
+/*
+ * JIT Context:
+ *
+ * prog			:	bpf_prog
+ * idx			:	index of current last JITed instruction.
+ * prologue_bytes	:	bytes used in prologue.
+ * epilogue_offset	:	offset of epilogue starting.
+ * seen			:	bit mask used for JIT optimization.
+ * offsets		:	array of eBPF instruction offsets in
+ *				JITed code.
+ * target		:	final JITed code.
+ * epilogue_bytes	:	no of bytes used in epilogue.
+ * imm_count		:	no of immediate counts used for global
+ *				variables.
+ * imms			:	array of global variable addresses.
+ */
 
 struct jit_ctx {
-	const struct bpf_prog *skf;
-	unsigned idx;
-	unsigned prologue_bytes;
-	int ret0_fp_idx;
+	const struct bpf_prog *prog;
+	unsigned int idx;
+	unsigned int prologue_bytes;
+	unsigned int epilogue_offset;
 	u32 seen;
 	u32 flags;
 	u32 *offsets;
@@ -73,68 +121,16 @@ struct jit_ctx {
 #endif
 };
 
-int bpf_jit_enable __read_mostly;
-
-static inline int call_neg_helper(struct sk_buff *skb, int offset, void *ret,
-		      unsigned int size)
-{
-	void *ptr = bpf_internal_load_pointer_neg_helper(skb, offset, size);
-
-	if (!ptr)
-		return -EFAULT;
-	memcpy(ret, ptr, size);
-	return 0;
-}
-
-static u64 jit_get_skb_b(struct sk_buff *skb, int offset)
-{
-	u8 ret;
-	int err;
-
-	if (offset < 0)
-		err = call_neg_helper(skb, offset, &ret, 1);
-	else
-		err = skb_copy_bits(skb, offset, &ret, 1);
-
-	return (u64)err << 32 | ret;
-}
-
-static u64 jit_get_skb_h(struct sk_buff *skb, int offset)
-{
-	u16 ret;
-	int err;
-
-	if (offset < 0)
-		err = call_neg_helper(skb, offset, &ret, 2);
-	else
-		err = skb_copy_bits(skb, offset, &ret, 2);
-
-	return (u64)err << 32 | ntohs(ret);
-}
-
-static u64 jit_get_skb_w(struct sk_buff *skb, int offset)
-{
-	u32 ret;
-	int err;
-
-	if (offset < 0)
-		err = call_neg_helper(skb, offset, &ret, 4);
-	else
-		err = skb_copy_bits(skb, offset, &ret, 4);
-
-	return (u64)err << 32 | ntohl(ret);
-}
-
 /*
  * Wrappers which handle both OABI and EABI and assures Thumb2 interworking
  * (where the assembly routines like __aeabi_uidiv could cause problems).
  */
-static u32 jit_udiv(u32 dividend, u32 divisor)
+static u32 jit_udiv32(u32 dividend, u32 divisor)
 {
 	return dividend / divisor;
 }
 
-static u32 jit_mod(u32 dividend, u32 divisor)
+static u32 jit_mod32(u32 dividend, u32 divisor)
 {
 	return dividend % divisor;
 }
@@ -158,36 +154,22 @@ static inline void emit(u32 inst, struct jit_ctx *ctx)
 	_emit(ARM_COND_AL, inst, ctx);
 }
 
-static u16 saved_regs(struct jit_ctx *ctx)
+/*
+ * Checks if immediate value can be converted to imm12(12 bits) value.
+ */
+static int16_t imm8m(u32 x)
 {
-	u16 ret = 0;
-
-	if ((ctx->skf->len > 1) ||
-	    (ctx->skf->insns[0].code == (BPF_RET | BPF_A)))
-		ret |= 1 << r_A;
-
-#ifdef CONFIG_FRAME_POINTER
-	ret |= (1 << ARM_FP) | (1 << ARM_IP) | (1 << ARM_LR) | (1 << ARM_PC);
-#else
-	if (ctx->seen & SEEN_CALL)
-		ret |= 1 << ARM_LR;
-#endif
-	if (ctx->seen & (SEEN_DATA | SEEN_SKB))
-		ret |= 1 << r_skb;
-	if (ctx->seen & SEEN_DATA)
-		ret |= (1 << r_skb_data) | (1 << r_skb_hl);
-	if (ctx->seen & SEEN_X)
-		ret |= 1 << r_X;
-
-	return ret;
-}
+	u32 rot;
 
-static inline int mem_words_used(struct jit_ctx *ctx)
-{
-	/* yes, we do waste some stack space IF there are "holes" in the set" */
-	return fls(ctx->seen & SEEN_MEM);
+	for (rot = 0; rot < 16; rot++)
+		if ((x & ~ror32(0xff, 2 * rot)) == 0)
+			return rol32(x, 2 * rot) | (rot << 8);
+	return -1;
 }
 
+/*
+ * Initializes the JIT space with undefined instructions.
+ */
 static void jit_fill_hole(void *area, unsigned int size)
 {
 	u32 *ptr;
@@ -196,88 +178,34 @@ static void jit_fill_hole(void *area, unsigned int size)
 		*ptr++ = __opcode_to_mem_arm(ARM_INST_UDF);
 }
 
-static void build_prologue(struct jit_ctx *ctx)
-{
-	u16 reg_set = saved_regs(ctx);
-	u16 off;
-
-#ifdef CONFIG_FRAME_POINTER
-	emit(ARM_MOV_R(ARM_IP, ARM_SP), ctx);
-	emit(ARM_PUSH(reg_set), ctx);
-	emit(ARM_SUB_I(ARM_FP, ARM_IP, 4), ctx);
-#else
-	if (reg_set)
-		emit(ARM_PUSH(reg_set), ctx);
-#endif
-
-	if (ctx->seen & (SEEN_DATA | SEEN_SKB))
-		emit(ARM_MOV_R(r_skb, ARM_R0), ctx);
-
-	if (ctx->seen & SEEN_DATA) {
-		off = offsetof(struct sk_buff, data);
-		emit(ARM_LDR_I(r_skb_data, r_skb, off), ctx);
-		/* headlen = len - data_len */
-		off = offsetof(struct sk_buff, len);
-		emit(ARM_LDR_I(r_skb_hl, r_skb, off), ctx);
-		off = offsetof(struct sk_buff, data_len);
-		emit(ARM_LDR_I(r_scratch, r_skb, off), ctx);
-		emit(ARM_SUB_R(r_skb_hl, r_skb_hl, r_scratch), ctx);
-	}
+/* Stack must be multiples of 16 Bytes */
+#define STACK_ALIGN(sz) (((sz) + 15) & ~15)
 
-	if (ctx->flags & FLAG_NEED_X_RESET)
-		emit(ARM_MOV_I(r_X, 0), ctx);
-
-	/* do not leak kernel data to userspace */
-	if (bpf_needs_clear_a(&ctx->skf->insns[0]))
-		emit(ARM_MOV_I(r_A, 0), ctx);
-
-	/* stack space for the BPF_MEM words */
-	if (ctx->seen & SEEN_MEM)
-		emit(ARM_SUB_I(ARM_SP, ARM_SP, mem_words_used(ctx) * 4), ctx);
-}
-
-static void build_epilogue(struct jit_ctx *ctx)
-{
-	u16 reg_set = saved_regs(ctx);
-
-	if (ctx->seen & SEEN_MEM)
-		emit(ARM_ADD_I(ARM_SP, ARM_SP, mem_words_used(ctx) * 4), ctx);
-
-	reg_set &= ~(1 << ARM_LR);
-
-#ifdef CONFIG_FRAME_POINTER
-	/* the first instruction of the prologue was: mov ip, sp */
-	reg_set &= ~(1 << ARM_IP);
-	reg_set |= (1 << ARM_SP);
-	emit(ARM_LDM(ARM_SP, reg_set), ctx);
-#else
-	if (reg_set) {
-		if (ctx->seen & SEEN_CALL)
-			reg_set |= 1 << ARM_PC;
-		emit(ARM_POP(reg_set), ctx);
-	}
+/* Stack space for BPF_REG_2, BPF_REG_3, BPF_REG_4,
+ * BPF_REG_5, BPF_REG_7, BPF_REG_8, BPF_REG_9,
+ * BPF_REG_FP and Tail call counts.
+ */
+#define SCRATCH_SIZE 80
 
-	if (!(ctx->seen & SEEN_CALL))
-		emit(ARM_BX(ARM_LR), ctx);
-#endif
-}
+/* total stack size used in JITed code */
+#define _STACK_SIZE \
+	(MAX_BPF_STACK + \
+	 + SCRATCH_SIZE + \
+	 + 4 /* extra for skb_copy_bits buffer */)
 
-static int16_t imm8m(u32 x)
-{
-	u32 rot;
+#define STACK_SIZE STACK_ALIGN(_STACK_SIZE)
 
-	for (rot = 0; rot < 16; rot++)
-		if ((x & ~ror32(0xff, 2 * rot)) == 0)
-			return rol32(x, 2 * rot) | (rot << 8);
+/* Get the offset of eBPF REGISTERs stored on scratch space. */
+#define STACK_VAR(off) (STACK_SIZE-off-4)
 
-	return -1;
-}
+/* Offset of skb_copy_bits buffer */
+#define SKB_BUFFER STACK_VAR(SCRATCH_SIZE)
 
 #if __LINUX_ARM_ARCH__ < 7
 
 static u16 imm_offset(u32 k, struct jit_ctx *ctx)
 {
-	unsigned i = 0, offset;
+	unsigned int i = 0, offset;
 	u16 imm;
 
 	/* on the "fake" run we just count them (duplicates included) */
@@ -296,7 +224,7 @@ static u16 imm_offset(u32 k, struct jit_ctx *ctx)
 		ctx->imms[i] = k;
 
 	/* constants go just after the epilogue */
-	offset =  ctx->offsets[ctx->skf->len];
+	offset =  ctx->offsets[ctx->prog->len - 1] * 4;
 	offset += ctx->prologue_bytes;
 	offset += ctx->epilogue_bytes;
 	offset += i * 4;
@@ -320,10 +248,22 @@ static u16 imm_offset(u32 k, struct jit_ctx *ctx)
 
 #endif /* __LINUX_ARM_ARCH__ */
 
+static inline int bpf2a32_offset(int bpf_to, int bpf_from,
+				 const struct jit_ctx *ctx) {
+	int to, from;
+
+	if (ctx->target == NULL)
+		return 0;
+	to = ctx->offsets[bpf_to];
+	from = ctx->offsets[bpf_from];
+
+	return to - from - 1;
+}
+
 /*
  * Move an immediate that's not an imm8m to a core register.
  */
-static inline void emit_mov_i_no8m(int rd, u32 val, struct jit_ctx *ctx)
+static inline void emit_mov_i_no8m(const u8 rd, u32 val, struct jit_ctx *ctx)
 {
 #if __LINUX_ARM_ARCH__ < 7
 	emit(ARM_LDR_I(rd, ARM_PC, imm_offset(val, ctx)), ctx);
@@ -334,7 +274,7 @@ static inline void emit_mov_i_no8m(int rd, u32 val, struct jit_ctx *ctx)
 #endif
 }
 
-static inline void emit_mov_i(int rd, u32 val, struct jit_ctx *ctx)
+static inline void emit_mov_i(const u8 rd, u32 val, struct jit_ctx *ctx)
 {
 	int imm12 = imm8m(val);
 
@@ -344,676 +284,1578 @@ static inline void emit_mov_i(int rd, u32 val, struct jit_ctx *ctx)
 		emit_mov_i_no8m(rd, val, ctx);
 }
 
-#if __LINUX_ARM_ARCH__ < 6
-
-static void emit_load_be32(u8 cond, u8 r_res, u8 r_addr, struct jit_ctx *ctx)
+static inline void emit_blx_r(u8 tgt_reg, struct jit_ctx *ctx)
 {
-	_emit(cond, ARM_LDRB_I(ARM_R3, r_addr, 1), ctx);
-	_emit(cond, ARM_LDRB_I(ARM_R1, r_addr, 0), ctx);
-	_emit(cond, ARM_LDRB_I(ARM_R2, r_addr, 3), ctx);
-	_emit(cond, ARM_LSL_I(ARM_R3, ARM_R3, 16), ctx);
-	_emit(cond, ARM_LDRB_I(ARM_R0, r_addr, 2), ctx);
-	_emit(cond, ARM_ORR_S(ARM_R3, ARM_R3, ARM_R1, SRTYPE_LSL, 24), ctx);
-	_emit(cond, ARM_ORR_R(ARM_R3, ARM_R3, ARM_R2), ctx);
-	_emit(cond, ARM_ORR_S(r_res, ARM_R3, ARM_R0, SRTYPE_LSL, 8), ctx);
+	ctx->seen |= SEEN_CALL;
+#if __LINUX_ARM_ARCH__ < 5
+	emit(ARM_MOV_R(ARM_LR, ARM_PC), ctx);
+
+	if (elf_hwcap & HWCAP_THUMB)
+		emit(ARM_BX(tgt_reg), ctx);
+	else
+		emit(ARM_MOV_R(ARM_PC, tgt_reg), ctx);
+#else
+	emit(ARM_BLX_R(tgt_reg), ctx);
+#endif
 }
 
-static void emit_load_be16(u8 cond, u8 r_res, u8 r_addr, struct jit_ctx *ctx)
+static inline int epilogue_offset(const struct jit_ctx *ctx)
 {
-	_emit(cond, ARM_LDRB_I(ARM_R1, r_addr, 0), ctx);
-	_emit(cond, ARM_LDRB_I(ARM_R2, r_addr, 1), ctx);
-	_emit(cond, ARM_ORR_S(r_res, ARM_R2, ARM_R1, SRTYPE_LSL, 8), ctx);
+	int to, from;
+	/* No need for 1st dummy run */
+	if (ctx->target == NULL)
+		return 0;
+	to = ctx->epilogue_offset;
+	from = ctx->idx;
+
+	return to - from - 2;
 }
 
-static inline void emit_swap16(u8 r_dst, u8 r_src, struct jit_ctx *ctx)
+static inline void emit_udivmod(u8 rd, u8 rm, u8 rn, struct jit_ctx *ctx, u8 op)
 {
-	/* r_dst = (r_src << 8) | (r_src >> 8) */
-	emit(ARM_LSL_I(ARM_R1, r_src, 8), ctx);
-	emit(ARM_ORR_S(r_dst, ARM_R1, r_src, SRTYPE_LSR, 8), ctx);
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	s32 jmp_offset;
+
+	/* checks if divisor is zero or not. If it is, then
+	 * exit directly.
+	 */
+	emit(ARM_CMP_I(rn, 0), ctx);
+	_emit(ARM_COND_EQ, ARM_MOV_I(ARM_R0, 0), ctx);
+	jmp_offset = epilogue_offset(ctx);
+	_emit(ARM_COND_EQ, ARM_B(jmp_offset), ctx);
+#if __LINUX_ARM_ARCH__ == 7
+	if (elf_hwcap & HWCAP_IDIVA) {
+		if (op == BPF_DIV)
+			emit(ARM_UDIV(rd, rm, rn), ctx);
+		else {
+			emit(ARM_UDIV(ARM_IP, rm, rn), ctx);
+			emit(ARM_MLS(rd, rn, ARM_IP, rm), ctx);
+		}
+		return;
+	}
+#endif
 
 	/*
-	 * we need to mask out the bits set in r_dst[23:16] due to
-	 * the first shift instruction.
-	 *
-	 * note that 0x8ff is the encoded immediate 0x00ff0000.
+	 * For BPF_ALU | BPF_DIV | BPF_K instructions
+	 * As ARM_R1 and ARM_R0 contains 1st argument of bpf
+	 * function, we need to save it on caller side to save
+	 * it from getting destroyed within callee.
+	 * After the return from the callee, we restore ARM_R0
+	 * ARM_R1.
 	 */
-	emit(ARM_BIC_I(r_dst, r_dst, 0x8ff), ctx);
-}
+	if (rn != ARM_R1) {
+		emit(ARM_MOV_R(tmp[0], ARM_R1), ctx);
+		emit(ARM_MOV_R(ARM_R1, rn), ctx);
+	}
+	if (rm != ARM_R0) {
+		emit(ARM_MOV_R(tmp[1], ARM_R0), ctx);
+		emit(ARM_MOV_R(ARM_R0, rm), ctx);
+	}
+
+	/* Call appropriate function */
+	ctx->seen |= SEEN_CALL;
+	emit_mov_i(ARM_IP, op == BPF_DIV ?
+		   (u32)jit_udiv32 : (u32)jit_mod32, ctx);
+	emit_blx_r(ARM_IP, ctx);
 
-#else  /* ARMv6+ */
+	/* Save return value */
+	if (rd != ARM_R0)
+		emit(ARM_MOV_R(rd, ARM_R0), ctx);
 
-static void emit_load_be32(u8 cond, u8 r_res, u8 r_addr, struct jit_ctx *ctx)
-{
-	_emit(cond, ARM_LDR_I(r_res, r_addr, 0), ctx);
-#ifdef __LITTLE_ENDIAN
-	_emit(cond, ARM_REV(r_res, r_res), ctx);
-#endif
+	/* Restore ARM_R0 and ARM_R1 */
+	if (rn != ARM_R1)
+		emit(ARM_MOV_R(ARM_R1, tmp[0]), ctx);
+	if (rm != ARM_R0)
+		emit(ARM_MOV_R(ARM_R0, tmp[1]), ctx);
 }
 
-static void emit_load_be16(u8 cond, u8 r_res, u8 r_addr, struct jit_ctx *ctx)
+/* Checks whether BPF register is on scratch stack space or not. */
+static inline bool is_on_stack(u8 bpf_reg)
 {
-	_emit(cond, ARM_LDRH_I(r_res, r_addr, 0), ctx);
-#ifdef __LITTLE_ENDIAN
-	_emit(cond, ARM_REV16(r_res, r_res), ctx);
-#endif
+	static u8 stack_regs[] = {BPF_REG_AX, BPF_REG_3, BPF_REG_4, BPF_REG_5,
+				BPF_REG_7, BPF_REG_8, BPF_REG_9, TCALL_CNT,
+				BPF_REG_2, BPF_REG_FP};
+	int i, reg_len = sizeof(stack_regs);
+
+	for (i = 0 ; i < reg_len ; i++) {
+		if (bpf_reg == stack_regs[i])
+			return true;
+	}
+	return false;
 }
 
-static inline void emit_swap16(u8 r_dst __maybe_unused,
-			       u8 r_src __maybe_unused,
-			       struct jit_ctx *ctx __maybe_unused)
+static inline void emit_a32_mov_i(const u8 dst, const u32 val,
+				  bool dstk, struct jit_ctx *ctx)
 {
-#ifdef __LITTLE_ENDIAN
-	emit(ARM_REV16(r_dst, r_src), ctx);
-#endif
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+
+	if (dstk) {
+		emit_mov_i(tmp[1], val, ctx);
+		emit(ARM_STR_I(tmp[1], ARM_SP, STACK_VAR(dst)), ctx);
+	} else {
+		emit_mov_i(dst, val, ctx);
+	}
 }
 
-#endif /* __LINUX_ARM_ARCH__ < 6 */
+/* Sign extended move */
+static inline void emit_a32_mov_i64(const bool is64, const u8 dst[],
+				  const u32 val, bool dstk,
+				  struct jit_ctx *ctx) {
+	u32 hi = 0;
 
+	if (is64 && (val & (1<<31)))
+		hi = (u32)~0;
+	emit_a32_mov_i(dst_lo, val, dstk, ctx);
+	emit_a32_mov_i(dst_hi, hi, dstk, ctx);
+}
 
-/* Compute the immediate value for a PC-relative branch. */
-static inline u32 b_imm(unsigned tgt, struct jit_ctx *ctx)
-{
-	u32 imm;
+static inline void emit_a32_add_r(const u8 dst, const u8 src,
+			      const bool is64, const bool hi,
+			      struct jit_ctx *ctx) {
+	/* 64 bit :
+	 *	adds dst_lo, dst_lo, src_lo
+	 *	adc dst_hi, dst_hi, src_hi
+	 * 32 bit :
+	 *	add dst_lo, dst_lo, src_lo
+	 */
+	if (!hi && is64)
+		emit(ARM_ADDS_R(dst, dst, src), ctx);
+	else if (hi && is64)
+		emit(ARM_ADC_R(dst, dst, src), ctx);
+	else
+		emit(ARM_ADD_R(dst, dst, src), ctx);
+}
 
-	if (ctx->target == NULL)
-		return 0;
-	/*
-	 * BPF allows only forward jumps and the offset of the target is
-	 * still the one computed during the first pass.
+static inline void emit_a32_sub_r(const u8 dst, const u8 src,
+				  const bool is64, const bool hi,
+				  struct jit_ctx *ctx) {
+	/* 64 bit :
+	 *	subs dst_lo, dst_lo, src_lo
+	 *	sbc dst_hi, dst_hi, src_hi
+	 * 32 bit :
+	 *	sub dst_lo, dst_lo, src_lo
 	 */
-	imm  = ctx->offsets[tgt] + ctx->prologue_bytes - (ctx->idx * 4 + 8);
+	if (!hi && is64)
+		emit(ARM_SUBS_R(dst, dst, src), ctx);
+	else if (hi && is64)
+		emit(ARM_SBC_R(dst, dst, src), ctx);
+	else
+		emit(ARM_SUB_R(dst, dst, src), ctx);
+}
 
-	return imm >> 2;
+static inline void emit_alu_r(const u8 dst, const u8 src, const bool is64,
+			      const bool hi, const u8 op, struct jit_ctx *ctx){
+	switch (BPF_OP(op)) {
+	/* dst = dst + src */
+	case BPF_ADD:
+		emit_a32_add_r(dst, src, is64, hi, ctx);
+		break;
+	/* dst = dst - src */
+	case BPF_SUB:
+		emit_a32_sub_r(dst, src, is64, hi, ctx);
+		break;
+	/* dst = dst | src */
+	case BPF_OR:
+		emit(ARM_ORR_R(dst, dst, src), ctx);
+		break;
+	/* dst = dst & src */
+	case BPF_AND:
+		emit(ARM_AND_R(dst, dst, src), ctx);
+		break;
+	/* dst = dst ^ src */
+	case BPF_XOR:
+		emit(ARM_EOR_R(dst, dst, src), ctx);
+		break;
+	/* dst = dst * src */
+	case BPF_MUL:
+		emit(ARM_MUL(dst, dst, src), ctx);
+		break;
+	/* dst = dst << src */
+	case BPF_LSH:
+		emit(ARM_LSL_R(dst, dst, src), ctx);
+		break;
+	/* dst = dst >> src */
+	case BPF_RSH:
+		emit(ARM_LSR_R(dst, dst, src), ctx);
+		break;
+	/* dst = dst >> src (signed)*/
+	case BPF_ARSH:
+		emit(ARM_MOV_SR(dst, dst, SRTYPE_ASR, src), ctx);
+		break;
+	}
 }
 
-#define OP_IMM3(op, r1, r2, imm_val, ctx)				\
-	do {								\
-		imm12 = imm8m(imm_val);					\
-		if (imm12 < 0) {					\
-			emit_mov_i_no8m(r_scratch, imm_val, ctx);	\
-			emit(op ## _R((r1), (r2), r_scratch), ctx);	\
-		} else {						\
-			emit(op ## _I((r1), (r2), imm12), ctx);		\
-		}							\
-	} while (0)
-
-static inline void emit_err_ret(u8 cond, struct jit_ctx *ctx)
-{
-	if (ctx->ret0_fp_idx >= 0) {
-		_emit(cond, ARM_B(b_imm(ctx->ret0_fp_idx, ctx)), ctx);
-		/* NOP to keep the size constant between passes */
-		emit(ARM_MOV_R(ARM_R0, ARM_R0), ctx);
+/* ALU operation (32 bit)
+ * dst = dst (op) src
+ */
+static inline void emit_a32_alu_r(const u8 dst, const u8 src,
+				  bool dstk, bool sstk,
+				  struct jit_ctx *ctx, const bool is64,
+				  const bool hi, const u8 op) {
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	u8 rn = sstk ? tmp[1] : src;
+
+	if (sstk)
+		emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src)), ctx);
+
+	/* ALU operation */
+	if (dstk) {
+		emit(ARM_LDR_I(tmp[0], ARM_SP, STACK_VAR(dst)), ctx);
+		emit_alu_r(tmp[0], rn, is64, hi, op, ctx);
+		emit(ARM_STR_I(tmp[0], ARM_SP, STACK_VAR(dst)), ctx);
 	} else {
-		_emit(cond, ARM_MOV_I(ARM_R0, 0), ctx);
-		_emit(cond, ARM_B(b_imm(ctx->skf->len, ctx)), ctx);
+		emit_alu_r(dst, rn, is64, hi, op, ctx);
 	}
 }
 
-static inline void emit_blx_r(u8 tgt_reg, struct jit_ctx *ctx)
-{
-#if __LINUX_ARM_ARCH__ < 5
-	emit(ARM_MOV_R(ARM_LR, ARM_PC), ctx);
+/* ALU operation (64 bit) */
+static inline void emit_a32_alu_r64(const bool is64, const u8 dst[],
+				  const u8 src[], bool dstk,
+				  bool sstk, struct jit_ctx *ctx,
+				  const u8 op) {
+	emit_a32_alu_r(dst_lo, src_lo, dstk, sstk, ctx, is64, false, op);
+	if (is64)
+		emit_a32_alu_r(dst_hi, src_hi, dstk, sstk, ctx, is64, true, op);
+	else
+		emit_a32_mov_i(dst_hi, 0, dstk, ctx);
+}
 
-	if (elf_hwcap & HWCAP_THUMB)
-		emit(ARM_BX(tgt_reg), ctx);
+/* dst = imm (4 bytes)*/
+static inline void emit_a32_mov_r(const u8 dst, const u8 src,
+				  bool dstk, bool sstk,
+				  struct jit_ctx *ctx) {
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	u8 rt = sstk ? tmp[0] : src;
+
+	if (sstk)
+		emit(ARM_LDR_I(tmp[0], ARM_SP, STACK_VAR(src)), ctx);
+	if (dstk)
+		emit(ARM_STR_I(rt, ARM_SP, STACK_VAR(dst)), ctx);
 	else
-		emit(ARM_MOV_R(ARM_PC, tgt_reg), ctx);
-#else
-	emit(ARM_BLX_R(tgt_reg), ctx);
-#endif
+		emit(ARM_MOV_R(dst, rt), ctx);
 }
 
-static inline void emit_udivmod(u8 rd, u8 rm, u8 rn, struct jit_ctx *ctx,
-				int bpf_op)
-{
-#if __LINUX_ARM_ARCH__ == 7
-	if (elf_hwcap & HWCAP_IDIVA) {
-		if (bpf_op == BPF_DIV)
-			emit(ARM_UDIV(rd, rm, rn), ctx);
-		else {
-			emit(ARM_UDIV(ARM_R3, rm, rn), ctx);
-			emit(ARM_MLS(rd, rn, ARM_R3, rm), ctx);
-		}
-		return;
+/* dst = src */
+static inline void emit_a32_mov_r64(const bool is64, const u8 dst[],
+				  const u8 src[], bool dstk,
+				  bool sstk, struct jit_ctx *ctx) {
+	emit_a32_mov_r(dst_lo, src_lo, dstk, sstk, ctx);
+	if (is64) {
+		/* complete 8 byte move */
+		emit_a32_mov_r(dst_hi, src_hi, dstk, sstk, ctx);
+	} else {
+		/* Zero out high 4 bytes */
+		emit_a32_mov_i(dst_hi, 0, dstk, ctx);
 	}
-#endif
+}
 
-	/*
-	 * For BPF_ALU | BPF_DIV | BPF_K instructions, rm is ARM_R4
-	 * (r_A) and rn is ARM_R0 (r_scratch) so load rn first into
-	 * ARM_R1 to avoid accidentally overwriting ARM_R0 with rm
-	 * before using it as a source for ARM_R1.
-	 *
-	 * For BPF_ALU | BPF_DIV | BPF_X rm is ARM_R4 (r_A) and rn is
-	 * ARM_R5 (r_X) so there is no particular register overlap
-	 * issues.
-	 */
-	if (rn != ARM_R1)
-		emit(ARM_MOV_R(ARM_R1, rn), ctx);
-	if (rm != ARM_R0)
-		emit(ARM_MOV_R(ARM_R0, rm), ctx);
+/* Shift operations */
+static inline void emit_a32_alu_i(const u8 dst, const u32 val, bool dstk,
+				struct jit_ctx *ctx, const u8 op) {
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	u8 rd = dstk ? tmp[0] : dst;
+
+	if (dstk)
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst)), ctx);
+
+	/* Do shift operation */
+	switch (op) {
+	case BPF_LSH:
+		emit(ARM_LSL_I(rd, rd, val), ctx);
+		break;
+	case BPF_RSH:
+		emit(ARM_LSR_I(rd, rd, val), ctx);
+		break;
+	case BPF_NEG:
+		emit(ARM_RSB_I(rd, rd, val), ctx);
+		break;
+	}
+
+	if (dstk)
+		emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst)), ctx);
+}
 
+/* dst = ~dst (64 bit) */
+static inline void emit_a32_neg64(const u8 dst[], bool dstk,
+				struct jit_ctx *ctx){
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	u8 rd = dstk ? tmp[1] : dst[1];
+	u8 rm = dstk ? tmp[0] : dst[0];
+
+	/* Setup Operand */
+	if (dstk) {
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+
+	/* Do Negate Operation */
+	emit(ARM_RSBS_I(rd, rd, 0), ctx);
+	emit(ARM_RSC_I(rm, rm, 0), ctx);
+
+	if (dstk) {
+		emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_STR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+}
+
+/* dst = dst << src */
+static inline void emit_a32_lsh_r64(const u8 dst[], const u8 src[], bool dstk,
+				    bool sstk, struct jit_ctx *ctx) {
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+
+	/* Setup Operands */
+	u8 rt = sstk ? tmp2[1] : src_lo;
+	u8 rd = dstk ? tmp[1] : dst_lo;
+	u8 rm = dstk ? tmp[0] : dst_hi;
+
+	if (sstk)
+		emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(src_lo)), ctx);
+	if (dstk) {
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+
+	/* Do LSH operation */
+	emit(ARM_SUB_I(ARM_IP, rt, 32), ctx);
+	emit(ARM_RSB_I(tmp2[0], rt, 32), ctx);
+	/* As we are using ARM_LR */
 	ctx->seen |= SEEN_CALL;
-	emit_mov_i(ARM_R3, bpf_op == BPF_DIV ? (u32)jit_udiv : (u32)jit_mod,
-		   ctx);
-	emit_blx_r(ARM_R3, ctx);
+	emit(ARM_MOV_SR(ARM_LR, rm, SRTYPE_ASL, rt), ctx);
+	emit(ARM_ORR_SR(ARM_LR, ARM_LR, rd, SRTYPE_ASL, ARM_IP), ctx);
+	emit(ARM_ORR_SR(ARM_IP, ARM_LR, rd, SRTYPE_LSR, tmp2[0]), ctx);
+	emit(ARM_MOV_SR(ARM_LR, rd, SRTYPE_ASL, rt), ctx);
+
+	if (dstk) {
+		emit(ARM_STR_I(ARM_LR, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_STR_I(ARM_IP, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	} else {
+		emit(ARM_MOV_R(rd, ARM_LR), ctx);
+		emit(ARM_MOV_R(rm, ARM_IP), ctx);
+	}
+}
 
-	if (rd != ARM_R0)
-		emit(ARM_MOV_R(rd, ARM_R0), ctx);
+/* dst = dst >> src (signed)*/
+static inline void emit_a32_arsh_r64(const u8 dst[], const u8 src[], bool dstk,
+				    bool sstk, struct jit_ctx *ctx) {
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+	/* Setup Operands */
+	u8 rt = sstk ? tmp2[1] : src_lo;
+	u8 rd = dstk ? tmp[1] : dst_lo;
+	u8 rm = dstk ? tmp[0] : dst_hi;
+
+	if (sstk)
+		emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(src_lo)), ctx);
+	if (dstk) {
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+
+	/* Do the ARSH operation */
+	emit(ARM_RSB_I(ARM_IP, rt, 32), ctx);
+	emit(ARM_SUBS_I(tmp2[0], rt, 32), ctx);
+	/* As we are using ARM_LR */
+	ctx->seen |= SEEN_CALL;
+	emit(ARM_MOV_SR(ARM_LR, rd, SRTYPE_LSR, rt), ctx);
+	emit(ARM_ORR_SR(ARM_LR, ARM_LR, rm, SRTYPE_ASL, ARM_IP), ctx);
+	_emit(ARM_COND_MI, ARM_B(0), ctx);
+	emit(ARM_ORR_SR(ARM_LR, ARM_LR, rm, SRTYPE_ASR, tmp2[0]), ctx);
+	emit(ARM_MOV_SR(ARM_IP, rm, SRTYPE_ASR, rt), ctx);
+	if (dstk) {
+		emit(ARM_STR_I(ARM_LR, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_STR_I(ARM_IP, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	} else {
+		emit(ARM_MOV_R(rd, ARM_LR), ctx);
+		emit(ARM_MOV_R(rm, ARM_IP), ctx);
+	}
+}
+
+/* dst = dst >> src */
+static inline void emit_a32_lsr_r64(const u8 dst[], const u8 src[], bool dstk,
+				     bool sstk, struct jit_ctx *ctx) {
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+	/* Setup Operands */
+	u8 rt = sstk ? tmp2[1] : src_lo;
+	u8 rd = dstk ? tmp[1] : dst_lo;
+	u8 rm = dstk ? tmp[0] : dst_hi;
+
+	if (sstk)
+		emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(src_lo)), ctx);
+	if (dstk) {
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+
+	/* Do LSH operation */
+	emit(ARM_RSB_I(ARM_IP, rt, 32), ctx);
+	emit(ARM_SUBS_I(tmp2[0], rt, 32), ctx);
+	/* As we are using ARM_LR */
+	ctx->seen |= SEEN_CALL;
+	emit(ARM_MOV_SR(ARM_LR, rd, SRTYPE_LSR, rt), ctx);
+	emit(ARM_ORR_SR(ARM_LR, ARM_LR, rm, SRTYPE_ASL, ARM_IP), ctx);
+	emit(ARM_ORR_SR(ARM_LR, ARM_LR, rm, SRTYPE_LSR, tmp2[0]), ctx);
+	emit(ARM_MOV_SR(ARM_IP, rm, SRTYPE_LSR, rt), ctx);
+	if (dstk) {
+		emit(ARM_STR_I(ARM_LR, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_STR_I(ARM_IP, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	} else {
+		emit(ARM_MOV_R(rd, ARM_LR), ctx);
+		emit(ARM_MOV_R(rm, ARM_IP), ctx);
+	}
+}
+
+/* dst = dst << val */
+static inline void emit_a32_lsh_i64(const u8 dst[], bool dstk,
+				     const u32 val, struct jit_ctx *ctx){
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+	/* Setup operands */
+	u8 rd = dstk ? tmp[1] : dst_lo;
+	u8 rm = dstk ? tmp[0] : dst_hi;
+
+	if (dstk) {
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+
+	/* Do LSH operation */
+	if (val < 32) {
+		emit(ARM_MOV_SI(tmp2[0], rm, SRTYPE_ASL, val), ctx);
+		emit(ARM_ORR_SI(rm, tmp2[0], rd, SRTYPE_LSR, 32 - val), ctx);
+		emit(ARM_MOV_SI(rd, rd, SRTYPE_ASL, val), ctx);
+	} else {
+		if (val == 32)
+			emit(ARM_MOV_R(rm, rd), ctx);
+		else
+			emit(ARM_MOV_SI(rm, rd, SRTYPE_ASL, val - 32), ctx);
+		emit(ARM_EOR_R(rd, rd, rd), ctx);
+	}
+
+	if (dstk) {
+		emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_STR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+}
+
+/* dst = dst >> val */
+static inline void emit_a32_lsr_i64(const u8 dst[], bool dstk,
+				    const u32 val, struct jit_ctx *ctx) {
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+	/* Setup operands */
+	u8 rd = dstk ? tmp[1] : dst_lo;
+	u8 rm = dstk ? tmp[0] : dst_hi;
+
+	if (dstk) {
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+
+	/* Do LSR operation */
+	if (val < 32) {
+		emit(ARM_MOV_SI(tmp2[1], rd, SRTYPE_LSR, val), ctx);
+		emit(ARM_ORR_SI(rd, tmp2[1], rm, SRTYPE_ASL, 32 - val), ctx);
+		emit(ARM_MOV_SI(rm, rm, SRTYPE_LSR, val), ctx);
+	} else if (val == 32) {
+		emit(ARM_MOV_R(rd, rm), ctx);
+		emit(ARM_MOV_I(rm, 0), ctx);
+	} else {
+		emit(ARM_MOV_SI(rd, rm, SRTYPE_LSR, val - 32), ctx);
+		emit(ARM_MOV_I(rm, 0), ctx);
+	}
+
+	if (dstk) {
+		emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_STR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
 }
 
-static inline void update_on_xread(struct jit_ctx *ctx)
+/* dst = dst >> val (signed) */
+static inline void emit_a32_arsh_i64(const u8 dst[], bool dstk,
+				     const u32 val, struct jit_ctx *ctx){
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+	 /* Setup operands */
+	u8 rd = dstk ? tmp[1] : dst_lo;
+	u8 rm = dstk ? tmp[0] : dst_hi;
+
+	if (dstk) {
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+
+	/* Do ARSH operation */
+	if (val < 32) {
+		emit(ARM_MOV_SI(tmp2[1], rd, SRTYPE_LSR, val), ctx);
+		emit(ARM_ORR_SI(rd, tmp2[1], rm, SRTYPE_ASL, 32 - val), ctx);
+		emit(ARM_MOV_SI(rm, rm, SRTYPE_ASR, val), ctx);
+	} else if (val == 32) {
+		emit(ARM_MOV_R(rd, rm), ctx);
+		emit(ARM_MOV_SI(rm, rm, SRTYPE_ASR, 31), ctx);
+	} else {
+		emit(ARM_MOV_SI(rd, rm, SRTYPE_ASR, val - 32), ctx);
+		emit(ARM_MOV_SI(rm, rm, SRTYPE_ASR, 31), ctx);
+	}
+
+	if (dstk) {
+		emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_STR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+}
+
+static inline void emit_a32_mul_r64(const u8 dst[], const u8 src[], bool dstk,
+				    bool sstk, struct jit_ctx *ctx) {
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+	/* Setup operands for multiplication */
+	u8 rd = dstk ? tmp[1] : dst_lo;
+	u8 rm = dstk ? tmp[0] : dst_hi;
+	u8 rt = sstk ? tmp2[1] : src_lo;
+	u8 rn = sstk ? tmp2[0] : src_hi;
+
+	if (dstk) {
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+	if (sstk) {
+		emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(src_lo)), ctx);
+		emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src_hi)), ctx);
+	}
+
+	/* Do Multiplication */
+	emit(ARM_MUL(ARM_IP, rd, rn), ctx);
+	emit(ARM_MUL(ARM_LR, rm, rt), ctx);
+	/* As we are using ARM_LR */
+	ctx->seen |= SEEN_CALL;
+	emit(ARM_ADD_R(ARM_LR, ARM_IP, ARM_LR), ctx);
+
+	emit(ARM_UMULL(ARM_IP, rm, rd, rt), ctx);
+	emit(ARM_ADD_R(rm, ARM_LR, rm), ctx);
+	if (dstk) {
+		emit(ARM_STR_I(ARM_IP, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_STR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	} else {
+		emit(ARM_MOV_R(rd, ARM_IP), ctx);
+	}
+}
+
+/* *(size *)(dst + off) = src */
+static inline void emit_str_r(const u8 dst, const u8 src, bool dstk,
+			      const s32 off, struct jit_ctx *ctx, const u8 sz){
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	u8 rd = dstk ? tmp[1] : dst;
+
+	if (dstk)
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst)), ctx);
+	if (off) {
+		emit_a32_mov_i(tmp[0], off, false, ctx);
+		emit(ARM_ADD_R(tmp[0], rd, tmp[0]), ctx);
+		rd = tmp[0];
+	}
+	switch (sz) {
+	case BPF_W:
+		/* Store a Word */
+		emit(ARM_STR_I(src, rd, 0), ctx);
+		break;
+	case BPF_H:
+		/* Store a HalfWord */
+		emit(ARM_STRH_I(src, rd, 0), ctx);
+		break;
+	case BPF_B:
+		/* Store a Byte */
+		emit(ARM_STRB_I(src, rd, 0), ctx);
+		break;
+	}
+}
+
+/* dst = *(size*)(src + off) */
+static inline void emit_ldx_r(const u8 dst, const u8 src, bool dstk,
+			      const s32 off, struct jit_ctx *ctx, const u8 sz){
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	u8 rd = dstk ? tmp[1] : dst;
+	u8 rm = src;
+
+	if (off) {
+		emit_a32_mov_i(tmp[0], off, false, ctx);
+		emit(ARM_ADD_R(tmp[0], tmp[0], src), ctx);
+		rm = tmp[0];
+	}
+	switch (sz) {
+	case BPF_W:
+		/* Load a Word */
+		emit(ARM_LDR_I(rd, rm, 0), ctx);
+		break;
+	case BPF_H:
+		/* Load a HalfWord */
+		emit(ARM_LDRH_I(rd, rm, 0), ctx);
+		break;
+	case BPF_B:
+		/* Load a Byte */
+		emit(ARM_LDRB_I(rd, rm, 0), ctx);
+		break;
+	}
+	if (dstk)
+		emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst)), ctx);
+}
+
+/* Arithmatic Operation */
+static inline void emit_ar_r(const u8 rd, const u8 rt, const u8 rm,
+			     const u8 rn, struct jit_ctx *ctx, u8 op) {
+	switch (op) {
+	case BPF_JSET:
+		ctx->seen |= SEEN_CALL;
+		emit(ARM_AND_R(ARM_IP, rt, rn), ctx);
+		emit(ARM_AND_R(ARM_LR, rd, rm), ctx);
+		emit(ARM_ORRS_R(ARM_IP, ARM_LR, ARM_IP), ctx);
+		break;
+	case BPF_JEQ:
+	case BPF_JNE:
+	case BPF_JGT:
+	case BPF_JGE:
+		emit(ARM_CMP_R(rd, rm), ctx);
+		_emit(ARM_COND_EQ, ARM_CMP_R(rt, rn), ctx);
+		break;
+	case BPF_JSGT:
+		emit(ARM_CMP_R(rn, rt), ctx);
+		emit(ARM_SBCS_R(ARM_IP, rm, rd), ctx);
+		break;
+	case BPF_JSGE:
+		emit(ARM_CMP_R(rt, rn), ctx);
+		emit(ARM_SBCS_R(ARM_IP, rd, rm), ctx);
+		break;
+	}
+}
+
+static int out_offset = -1; /* initialized on the first pass of build_body() */
+static int emit_bpf_tail_call(struct jit_ctx *ctx)
 {
-	if (!(ctx->seen & SEEN_X))
-		ctx->flags |= FLAG_NEED_X_RESET;
 
-	ctx->seen |= SEEN_X;
+	/* bpf_tail_call(void *prog_ctx, struct bpf_array *array, u64 index) */
+	const u8 *r2 = bpf2a32[BPF_REG_2];
+	const u8 *r3 = bpf2a32[BPF_REG_3];
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+	const u8 *tcc = bpf2a32[TCALL_CNT];
+	const int idx0 = ctx->idx;
+#define cur_offset (ctx->idx - idx0)
+#define jmp_offset (out_offset - (cur_offset))
+	u32 off, lo, hi;
+
+	/* if (index >= array->map.max_entries)
+	 *	goto out;
+	 */
+	off = offsetof(struct bpf_array, map.max_entries);
+	/* array->map.max_entries */
+	emit_a32_mov_i(tmp[1], off, false, ctx);
+	emit(ARM_LDR_I(tmp2[1], ARM_SP, STACK_VAR(r2[1])), ctx);
+	emit(ARM_LDR_R(tmp[1], tmp2[1], tmp[1]), ctx);
+	/* index (64 bit) */
+	emit(ARM_LDR_I(tmp2[1], ARM_SP, STACK_VAR(r3[1])), ctx);
+	/* index >= array->map.max_entries */
+	emit(ARM_CMP_R(tmp2[1], tmp[1]), ctx);
+	_emit(ARM_COND_CS, ARM_B(jmp_offset), ctx);
+
+	/* if (tail_call_cnt > MAX_TAIL_CALL_CNT)
+	 *	goto out;
+	 * tail_call_cnt++;
+	 */
+	lo = (u32)MAX_TAIL_CALL_CNT;
+	hi = (u32)((u64)MAX_TAIL_CALL_CNT >> 32);
+	emit(ARM_LDR_I(tmp[1], ARM_SP, STACK_VAR(tcc[1])), ctx);
+	emit(ARM_LDR_I(tmp[0], ARM_SP, STACK_VAR(tcc[0])), ctx);
+	emit(ARM_CMP_I(tmp[0], hi), ctx);
+	_emit(ARM_COND_EQ, ARM_CMP_I(tmp[1], lo), ctx);
+	_emit(ARM_COND_HI, ARM_B(jmp_offset), ctx);
+	emit(ARM_ADDS_I(tmp[1], tmp[1], 1), ctx);
+	emit(ARM_ADC_I(tmp[0], tmp[0], 0), ctx);
+	emit(ARM_STR_I(tmp[1], ARM_SP, STACK_VAR(tcc[1])), ctx);
+	emit(ARM_STR_I(tmp[0], ARM_SP, STACK_VAR(tcc[0])), ctx);
+
+	/* prog = array->ptrs[index]
+	 * if (prog == NULL)
+	 *	goto out;
+	 */
+	off = offsetof(struct bpf_array, ptrs);
+	emit_a32_mov_i(tmp[1], off, false, ctx);
+	emit(ARM_LDR_I(tmp2[1], ARM_SP, STACK_VAR(r2[1])), ctx);
+	emit(ARM_ADD_R(tmp[1], tmp2[1], tmp[1]), ctx);
+	emit(ARM_LDR_I(tmp2[1], ARM_SP, STACK_VAR(r3[1])), ctx);
+	emit(ARM_MOV_SI(tmp[0], tmp2[1], SRTYPE_ASL, 2), ctx);
+	emit(ARM_LDR_R(tmp[1], tmp[1], tmp[0]), ctx);
+	emit(ARM_CMP_I(tmp[1], 0), ctx);
+	_emit(ARM_COND_EQ, ARM_B(jmp_offset), ctx);
+
+	/* goto *(prog->bpf_func + prologue_size); */
+	off = offsetof(struct bpf_prog, bpf_func);
+	emit_a32_mov_i(tmp2[1], off, false, ctx);
+	emit(ARM_LDR_R(tmp[1], tmp[1], tmp2[1]), ctx);
+	emit(ARM_ADD_I(tmp[1], tmp[1], ctx->prologue_bytes), ctx);
+	emit(ARM_BX(tmp[1]), ctx);
+
+	/* out: */
+	if (out_offset == -1)
+		out_offset = cur_offset;
+	if (cur_offset != out_offset) {
+		pr_err_once("tail_call out_offset = %d, expected %d!\n",
+			    cur_offset, out_offset);
+		return -1;
+	}
+	return 0;
+#undef cur_offset
+#undef jmp_offset
 }
 
-static int build_body(struct jit_ctx *ctx)
+/* 0xabcd => 0xcdab */
+static inline void emit_rev16(const u8 rd, const u8 rn, struct jit_ctx *ctx)
 {
-	void *load_func[] = {jit_get_skb_b, jit_get_skb_h, jit_get_skb_w};
-	const struct bpf_prog *prog = ctx->skf;
-	const struct sock_filter *inst;
-	unsigned i, load_order, off, condt;
-	int imm12;
-	u32 k;
+#if __LINUX_ARM_ARCH__ < 6
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+
+	emit(ARM_AND_I(tmp2[1], rn, 0xff), ctx);
+	emit(ARM_MOV_SI(tmp2[0], rn, SRTYPE_LSR, 8), ctx);
+	emit(ARM_AND_I(tmp2[0], tmp2[0], 0xff), ctx);
+	emit(ARM_ORR_SI(rd, tmp2[0], tmp2[1], SRTYPE_LSL, 8), ctx);
+#else /* ARMv6+ */
+	emit(ARM_REV16(rd, rn), ctx);
+#endif
+}
 
-	for (i = 0; i < prog->len; i++) {
-		u16 code;
+/* 0xabcdefgh => 0xghefcdab */
+static inline void emit_rev32(const u8 rd, const u8 rn, struct jit_ctx *ctx)
+{
+#if __LINUX_ARM_ARCH__ < 6
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+
+	emit(ARM_AND_I(tmp2[1], rn, 0xff), ctx);
+	emit(ARM_MOV_SI(tmp2[0], rn, SRTYPE_LSR, 24), ctx);
+	emit(ARM_ORR_SI(ARM_IP, tmp2[0], tmp2[1], SRTYPE_LSL, 24), ctx);
+
+	emit(ARM_MOV_SI(tmp2[1], rn, SRTYPE_LSR, 8), ctx);
+	emit(ARM_AND_I(tmp2[1], tmp2[1], 0xff), ctx);
+	emit(ARM_MOV_SI(tmp2[0], rn, SRTYPE_LSR, 16), ctx);
+	emit(ARM_AND_I(tmp2[0], tmp2[0], 0xff), ctx);
+	emit(ARM_MOV_SI(tmp2[0], tmp2[0], SRTYPE_LSL, 8), ctx);
+	emit(ARM_ORR_SI(tmp2[0], tmp2[0], tmp2[1], SRTYPE_LSL, 16), ctx);
+	emit(ARM_ORR_R(rd, ARM_IP, tmp2[0]), ctx);
+
+#else /* ARMv6+ */
+	emit(ARM_REV(rd, rn), ctx);
+#endif
+}
+
+// push the scratch stack register on top of the stack
+static inline void emit_push_r64(const u8 src[], const u8 shift,
+		struct jit_ctx *ctx)
+{
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+	u16 reg_set = 0;
 
-		inst = &(prog->insns[i]);
-		/* K as an immediate value operand */
-		k = inst->k;
-		code = bpf_anc_helper(inst);
+	emit(ARM_LDR_I(tmp2[1], ARM_SP, STACK_VAR(src[1]+shift)), ctx);
+	emit(ARM_LDR_I(tmp2[0], ARM_SP, STACK_VAR(src[0]+shift)), ctx);
 
-		/* compute offsets only in the fake pass */
-		if (ctx->target == NULL)
-			ctx->offsets[i] = ctx->idx * 4;
+	reg_set = (1 << tmp2[1]) | (1 << tmp2[0]);
+	emit(ARM_PUSH(reg_set), ctx);
+}
+
+static void build_prologue(struct jit_ctx *ctx)
+{
+	const u8 r0 = bpf2a32[BPF_REG_0][1];
+	const u8 r2 = bpf2a32[BPF_REG_1][1];
+	const u8 r3 = bpf2a32[BPF_REG_1][0];
+	const u8 r4 = bpf2a32[BPF_REG_6][1];
+	const u8 r5 = bpf2a32[BPF_REG_6][0];
+	const u8 r6 = bpf2a32[TMP_REG_1][1];
+	const u8 r7 = bpf2a32[TMP_REG_1][0];
+	const u8 r8 = bpf2a32[TMP_REG_2][1];
+	const u8 r10 = bpf2a32[TMP_REG_2][0];
+	const u8 fplo = bpf2a32[BPF_REG_FP][1];
+	const u8 fphi = bpf2a32[BPF_REG_FP][0];
+	const u8 sp = ARM_SP;
+	const u8 *tcc = bpf2a32[TCALL_CNT];
+
+	u16 reg_set = 0;
+
+	/*
+	 * eBPF prog stack layout
+	 *
+	 *                         high
+	 * original ARM_SP =>     +-----+ eBPF prologue
+	 *                        |FP/LR|
+	 * current ARM_FP =>      +-----+
+	 *                        | ... | callee saved registers
+	 * eBPF fp register =>    +-----+ <= (BPF_FP)
+	 *                        | ... | eBPF JIT scratch space
+	 *                        |     | eBPF prog stack
+	 *                        +-----+
+	 *			  |RSVD | JIT scratchpad
+	 * current A64_SP =>      +-----+ <= (BPF_FP - STACK_SIZE)
+	 *                        |     |
+	 *                        | ... | Function call stack
+	 *                        |     |
+	 *                        +-----+
+	 *                          low
+	 */
+
+	/* Save callee saved registers. */
+	reg_set |= (1<<r4) | (1<<r5) | (1<<r6) | (1<<r7) | (1<<r8) | (1<<r10);
+#ifdef CONFIG_FRAME_POINTER
+	reg_set |= (1<<ARM_FP) | (1<<ARM_IP) | (1<<ARM_LR) | (1<<ARM_PC);
+	emit(ARM_MOV_R(ARM_IP, sp), ctx);
+	emit(ARM_PUSH(reg_set), ctx);
+	emit(ARM_SUB_I(ARM_FP, ARM_IP, 4), ctx);
+#else
+	/* Check if call instruction exists in BPF body */
+	if (ctx->seen & SEEN_CALL)
+		reg_set |= (1<<ARM_LR);
+	emit(ARM_PUSH(reg_set), ctx);
+#endif
+	/* Save frame pointer for later */
+	emit(ARM_SUB_I(ARM_IP, sp, SCRATCH_SIZE), ctx);
+
+	/* Set up function call stack */
+	emit(ARM_SUB_I(ARM_SP, ARM_SP, imm8m(STACK_SIZE)), ctx);
+
+	/* Set up BPF prog stack base register */
+	emit_a32_mov_r(fplo, ARM_IP, true, false, ctx);
+	emit_a32_mov_i(fphi, 0, true, ctx);
+
+	/* mov r4, 0 */
+	emit(ARM_MOV_I(r4, 0), ctx);
+
+	/* Move BPF_CTX to BPF_R1 */
+	emit(ARM_MOV_R(r3, r4), ctx);
+	emit(ARM_MOV_R(r2, r0), ctx);
+	/* Initialize Tail Count */
+	emit(ARM_STR_I(r4, ARM_SP, STACK_VAR(tcc[0])), ctx);
+	emit(ARM_STR_I(r4, ARM_SP, STACK_VAR(tcc[1])), ctx);
+	/* end of prologue */
+}
+
+static void build_epilogue(struct jit_ctx *ctx)
+{
+	const u8 r4 = bpf2a32[BPF_REG_6][1];
+	const u8 r5 = bpf2a32[BPF_REG_6][0];
+	const u8 r6 = bpf2a32[TMP_REG_1][1];
+	const u8 r7 = bpf2a32[TMP_REG_1][0];
+	const u8 r8 = bpf2a32[TMP_REG_2][1];
+	const u8 r10 = bpf2a32[TMP_REG_2][0];
+	u16 reg_set = 0;
+
+	/* unwind function call stack */
+	emit(ARM_ADD_I(ARM_SP, ARM_SP, imm8m(STACK_SIZE)), ctx);
+
+	/* restore callee saved registers. */
+	reg_set |= (1<<r4) | (1<<r5) | (1<<r6) | (1<<r7) | (1<<r8) | (1<<r10);
+#ifdef CONFIG_FRAME_POINTER
+	/* the first instruction of the prologue was: mov ip, sp */
+	reg_set |= (1<<ARM_FP) | (1<<ARM_SP) | (1<<ARM_PC);
+	emit(ARM_LDM(ARM_SP, reg_set), ctx);
+#else
+	if (ctx->seen & SEEN_CALL)
+		reg_set |= (1<<ARM_PC);
+	/* Restore callee saved registers. */
+	emit(ARM_POP(reg_set), ctx);
+	/* Return back to the callee function */
+	if (!(ctx->seen & SEEN_CALL))
+		emit(ARM_BX(ARM_LR), ctx);
+#endif
+}
 
-		switch (code) {
-		case BPF_LD | BPF_IMM:
-			emit_mov_i(r_A, k, ctx);
+/*
+ * Convert an eBPF instruction to native instruction, i.e
+ * JITs an eBPF instruction.
+ * Returns :
+ *	0  - Successfully JITed an 8-byte eBPF instruction
+ *	>0 - Successfully JITed a 16-byte eBPF instruction
+ *	<0 - Failed to JIT.
+ */
+static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx)
+{
+	const u8 code = insn->code;
+	const u8 *dst = bpf2a32[insn->dst_reg];
+	const u8 *src = bpf2a32[insn->src_reg];
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+	const s16 off = insn->off;
+	const s32 imm = insn->imm;
+	const int i = insn - ctx->prog->insnsi;
+	const bool is64 = BPF_CLASS(code) == BPF_ALU64;
+	const bool dstk = is_on_stack(insn->dst_reg);
+	const bool sstk = is_on_stack(insn->src_reg);
+	u8 rd, rt, rm, rn;
+	s32 jmp_offset;
+
+#define check_imm(bits, imm) do {				\
+	if ((((imm) > 0) && ((imm) >> (bits))) ||		\
+	    (((imm) < 0) && (~(imm) >> (bits)))) {		\
+		pr_info("[%2d] imm=%d(0x%x) out of range\n",	\
+			i, imm, imm);				\
+		return -EINVAL;					\
+	}							\
+} while (0)
+#define check_imm24(imm) check_imm(24, imm)
+
+	switch (code) {
+	/* ALU operations */
+
+	/* dst = src */
+	case BPF_ALU | BPF_MOV | BPF_K:
+	case BPF_ALU | BPF_MOV | BPF_X:
+	case BPF_ALU64 | BPF_MOV | BPF_K:
+	case BPF_ALU64 | BPF_MOV | BPF_X:
+		switch (BPF_SRC(code)) {
+		case BPF_X:
+			emit_a32_mov_r64(is64, dst, src, dstk, sstk, ctx);
 			break;
-		case BPF_LD | BPF_W | BPF_LEN:
-			ctx->seen |= SEEN_SKB;
-			BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, len) != 4);
-			emit(ARM_LDR_I(r_A, r_skb,
-				       offsetof(struct sk_buff, len)), ctx);
+		case BPF_K:
+			/* Sign-extend immediate value to destination reg */
+			emit_a32_mov_i64(is64, dst, imm, dstk, ctx);
 			break;
-		case BPF_LD | BPF_MEM:
-			/* A = scratch[k] */
-			ctx->seen |= SEEN_MEM_WORD(k);
-			emit(ARM_LDR_I(r_A, ARM_SP, SCRATCH_OFF(k)), ctx);
+		}
+		break;
+	/* dst = dst + src/imm */
+	/* dst = dst - src/imm */
+	/* dst = dst | src/imm */
+	/* dst = dst & src/imm */
+	/* dst = dst ^ src/imm */
+	/* dst = dst * src/imm */
+	/* dst = dst << src */
+	/* dst = dst >> src */
+	case BPF_ALU | BPF_ADD | BPF_K:
+	case BPF_ALU | BPF_ADD | BPF_X:
+	case BPF_ALU | BPF_SUB | BPF_K:
+	case BPF_ALU | BPF_SUB | BPF_X:
+	case BPF_ALU | BPF_OR | BPF_K:
+	case BPF_ALU | BPF_OR | BPF_X:
+	case BPF_ALU | BPF_AND | BPF_K:
+	case BPF_ALU | BPF_AND | BPF_X:
+	case BPF_ALU | BPF_XOR | BPF_K:
+	case BPF_ALU | BPF_XOR | BPF_X:
+	case BPF_ALU | BPF_MUL | BPF_K:
+	case BPF_ALU | BPF_MUL | BPF_X:
+	case BPF_ALU | BPF_LSH | BPF_X:
+	case BPF_ALU | BPF_RSH | BPF_X:
+	case BPF_ALU | BPF_ARSH | BPF_K:
+	case BPF_ALU | BPF_ARSH | BPF_X:
+	case BPF_ALU64 | BPF_ADD | BPF_K:
+	case BPF_ALU64 | BPF_ADD | BPF_X:
+	case BPF_ALU64 | BPF_SUB | BPF_K:
+	case BPF_ALU64 | BPF_SUB | BPF_X:
+	case BPF_ALU64 | BPF_OR | BPF_K:
+	case BPF_ALU64 | BPF_OR | BPF_X:
+	case BPF_ALU64 | BPF_AND | BPF_K:
+	case BPF_ALU64 | BPF_AND | BPF_X:
+	case BPF_ALU64 | BPF_XOR | BPF_K:
+	case BPF_ALU64 | BPF_XOR | BPF_X:
+		switch (BPF_SRC(code)) {
+		case BPF_X:
+			emit_a32_alu_r64(is64, dst, src, dstk, sstk,
+					 ctx, BPF_OP(code));
 			break;
-		case BPF_LD | BPF_W | BPF_ABS:
-			load_order = 2;
-			goto load;
-		case BPF_LD | BPF_H | BPF_ABS:
-			load_order = 1;
-			goto load;
-		case BPF_LD | BPF_B | BPF_ABS:
-			load_order = 0;
-load:
-			emit_mov_i(r_off, k, ctx);
-load_common:
-			ctx->seen |= SEEN_DATA | SEEN_CALL;
-
-			if (load_order > 0) {
-				emit(ARM_SUB_I(r_scratch, r_skb_hl,
-					       1 << load_order), ctx);
-				emit(ARM_CMP_R(r_scratch, r_off), ctx);
-				condt = ARM_COND_GE;
-			} else {
-				emit(ARM_CMP_R(r_skb_hl, r_off), ctx);
-				condt = ARM_COND_HI;
-			}
-
-			/*
-			 * test for negative offset, only if we are
-			 * currently scheduled to take the fast
-			 * path. this will update the flags so that
-			 * the slowpath instruction are ignored if the
-			 * offset is negative.
-			 *
-			 * for loard_order == 0 the HI condition will
-			 * make loads at offset 0 take the slow path too.
+		case BPF_K:
+			/* Move immediate value to the temporary register
+			 * and then do the ALU operation on the temporary
+			 * register as this will sign-extend the immediate
+			 * value into temporary reg and then it would be
+			 * safe to do the operation on it.
 			 */
-			_emit(condt, ARM_CMP_I(r_off, 0), ctx);
-
-			_emit(condt, ARM_ADD_R(r_scratch, r_off, r_skb_data),
-			      ctx);
-
-			if (load_order == 0)
-				_emit(condt, ARM_LDRB_I(r_A, r_scratch, 0),
-				      ctx);
-			else if (load_order == 1)
-				emit_load_be16(condt, r_A, r_scratch, ctx);
-			else if (load_order == 2)
-				emit_load_be32(condt, r_A, r_scratch, ctx);
-
-			_emit(condt, ARM_B(b_imm(i + 1, ctx)), ctx);
-
-			/* the slowpath */
-			emit_mov_i(ARM_R3, (u32)load_func[load_order], ctx);
-			emit(ARM_MOV_R(ARM_R0, r_skb), ctx);
-			/* the offset is already in R1 */
-			emit_blx_r(ARM_R3, ctx);
-			/* check the result of skb_copy_bits */
-			emit(ARM_CMP_I(ARM_R1, 0), ctx);
-			emit_err_ret(ARM_COND_NE, ctx);
-			emit(ARM_MOV_R(r_A, ARM_R0), ctx);
+			emit_a32_mov_i64(is64, tmp2, imm, false, ctx);
+			emit_a32_alu_r64(is64, dst, tmp2, dstk, false,
+					 ctx, BPF_OP(code));
 			break;
-		case BPF_LD | BPF_W | BPF_IND:
-			load_order = 2;
-			goto load_ind;
-		case BPF_LD | BPF_H | BPF_IND:
-			load_order = 1;
-			goto load_ind;
-		case BPF_LD | BPF_B | BPF_IND:
-			load_order = 0;
-load_ind:
-			update_on_xread(ctx);
-			OP_IMM3(ARM_ADD, r_off, r_X, k, ctx);
-			goto load_common;
-		case BPF_LDX | BPF_IMM:
-			ctx->seen |= SEEN_X;
-			emit_mov_i(r_X, k, ctx);
+		}
+		break;
+	/* dst = dst / src(imm) */
+	/* dst = dst % src(imm) */
+	case BPF_ALU | BPF_DIV | BPF_K:
+	case BPF_ALU | BPF_DIV | BPF_X:
+	case BPF_ALU | BPF_MOD | BPF_K:
+	case BPF_ALU | BPF_MOD | BPF_X:
+		rt = src_lo;
+		rd = dstk ? tmp2[1] : dst_lo;
+		if (dstk)
+			emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		switch (BPF_SRC(code)) {
+		case BPF_X:
+			rt = sstk ? tmp2[0] : rt;
+			if (sstk)
+				emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(src_lo)),
+				     ctx);
 			break;
-		case BPF_LDX | BPF_W | BPF_LEN:
-			ctx->seen |= SEEN_X | SEEN_SKB;
-			emit(ARM_LDR_I(r_X, r_skb,
-				       offsetof(struct sk_buff, len)), ctx);
+		case BPF_K:
+			rt = tmp2[0];
+			emit_a32_mov_i(rt, imm, false, ctx);
 			break;
-		case BPF_LDX | BPF_MEM:
-			ctx->seen |= SEEN_X | SEEN_MEM_WORD(k);
-			emit(ARM_LDR_I(r_X, ARM_SP, SCRATCH_OFF(k)), ctx);
+		}
+		emit_udivmod(rd, rd, rt, ctx, BPF_OP(code));
+		if (dstk)
+			emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit_a32_mov_i(dst_hi, 0, dstk, ctx);
+		break;
+	case BPF_ALU64 | BPF_DIV | BPF_K:
+	case BPF_ALU64 | BPF_DIV | BPF_X:
+	case BPF_ALU64 | BPF_MOD | BPF_K:
+	case BPF_ALU64 | BPF_MOD | BPF_X:
+		goto notyet;
+	/* dst = dst >> imm */
+	/* dst = dst << imm */
+	case BPF_ALU | BPF_RSH | BPF_K:
+	case BPF_ALU | BPF_LSH | BPF_K:
+		if (unlikely(imm > 31))
+			return -EINVAL;
+		if (imm)
+			emit_a32_alu_i(dst_lo, imm, dstk, ctx, BPF_OP(code));
+		emit_a32_mov_i(dst_hi, 0, dstk, ctx);
+		break;
+	/* dst = dst << imm */
+	case BPF_ALU64 | BPF_LSH | BPF_K:
+		if (unlikely(imm > 63))
+			return -EINVAL;
+		emit_a32_lsh_i64(dst, dstk, imm, ctx);
+		break;
+	/* dst = dst >> imm */
+	case BPF_ALU64 | BPF_RSH | BPF_K:
+		if (unlikely(imm > 63))
+			return -EINVAL;
+		emit_a32_lsr_i64(dst, dstk, imm, ctx);
+		break;
+	/* dst = dst << src */
+	case BPF_ALU64 | BPF_LSH | BPF_X:
+		emit_a32_lsh_r64(dst, src, dstk, sstk, ctx);
+		break;
+	/* dst = dst >> src */
+	case BPF_ALU64 | BPF_RSH | BPF_X:
+		emit_a32_lsr_r64(dst, src, dstk, sstk, ctx);
+		break;
+	/* dst = dst >> src (signed) */
+	case BPF_ALU64 | BPF_ARSH | BPF_X:
+		emit_a32_arsh_r64(dst, src, dstk, sstk, ctx);
+		break;
+	/* dst = dst >> imm (signed) */
+	case BPF_ALU64 | BPF_ARSH | BPF_K:
+		if (unlikely(imm > 63))
+			return -EINVAL;
+		emit_a32_arsh_i64(dst, dstk, imm, ctx);
+		break;
+	/* dst = ~dst */
+	case BPF_ALU | BPF_NEG:
+		emit_a32_alu_i(dst_lo, 0, dstk, ctx, BPF_OP(code));
+		emit_a32_mov_i(dst_hi, 0, dstk, ctx);
+		break;
+	/* dst = ~dst (64 bit) */
+	case BPF_ALU64 | BPF_NEG:
+		emit_a32_neg64(dst, dstk, ctx);
+		break;
+	/* dst = dst * src/imm */
+	case BPF_ALU64 | BPF_MUL | BPF_X:
+	case BPF_ALU64 | BPF_MUL | BPF_K:
+		switch (BPF_SRC(code)) {
+		case BPF_X:
+			emit_a32_mul_r64(dst, src, dstk, sstk, ctx);
 			break;
-		case BPF_LDX | BPF_B | BPF_MSH:
-			/* x = ((*(frame + k)) & 0xf) << 2; */
-			ctx->seen |= SEEN_X | SEEN_DATA | SEEN_CALL;
-			/* the interpreter should deal with the negative K */
-			if ((int)k < 0)
-				return -1;
-			/* offset in r1: we might have to take the slow path */
-			emit_mov_i(r_off, k, ctx);
-			emit(ARM_CMP_R(r_skb_hl, r_off), ctx);
-
-			/* load in r0: common with the slowpath */
-			_emit(ARM_COND_HI, ARM_LDRB_R(ARM_R0, r_skb_data,
-						      ARM_R1), ctx);
-			/*
-			 * emit_mov_i() might generate one or two instructions,
-			 * the same holds for emit_blx_r()
+		case BPF_K:
+			/* Move immediate value to the temporary register
+			 * and then do the multiplication on it as this
+			 * will sign-extend the immediate value into temp
+			 * reg then it would be safe to do the operation
+			 * on it.
 			 */
-			_emit(ARM_COND_HI, ARM_B(b_imm(i + 1, ctx) - 2), ctx);
-
-			emit(ARM_MOV_R(ARM_R0, r_skb), ctx);
-			/* r_off is r1 */
-			emit_mov_i(ARM_R3, (u32)jit_get_skb_b, ctx);
-			emit_blx_r(ARM_R3, ctx);
-			/* check the return value of skb_copy_bits */
-			emit(ARM_CMP_I(ARM_R1, 0), ctx);
-			emit_err_ret(ARM_COND_NE, ctx);
-
-			emit(ARM_AND_I(r_X, ARM_R0, 0x00f), ctx);
-			emit(ARM_LSL_I(r_X, r_X, 2), ctx);
-			break;
-		case BPF_ST:
-			ctx->seen |= SEEN_MEM_WORD(k);
-			emit(ARM_STR_I(r_A, ARM_SP, SCRATCH_OFF(k)), ctx);
-			break;
-		case BPF_STX:
-			update_on_xread(ctx);
-			ctx->seen |= SEEN_MEM_WORD(k);
-			emit(ARM_STR_I(r_X, ARM_SP, SCRATCH_OFF(k)), ctx);
-			break;
-		case BPF_ALU | BPF_ADD | BPF_K:
-			/* A += K */
-			OP_IMM3(ARM_ADD, r_A, r_A, k, ctx);
-			break;
-		case BPF_ALU | BPF_ADD | BPF_X:
-			update_on_xread(ctx);
-			emit(ARM_ADD_R(r_A, r_A, r_X), ctx);
-			break;
-		case BPF_ALU | BPF_SUB | BPF_K:
-			/* A -= K */
-			OP_IMM3(ARM_SUB, r_A, r_A, k, ctx);
-			break;
-		case BPF_ALU | BPF_SUB | BPF_X:
-			update_on_xread(ctx);
-			emit(ARM_SUB_R(r_A, r_A, r_X), ctx);
-			break;
-		case BPF_ALU | BPF_MUL | BPF_K:
-			/* A *= K */
-			emit_mov_i(r_scratch, k, ctx);
-			emit(ARM_MUL(r_A, r_A, r_scratch), ctx);
+			emit_a32_mov_i64(is64, tmp2, imm, false, ctx);
+			emit_a32_mul_r64(dst, tmp2, dstk, false, ctx);
 			break;
-		case BPF_ALU | BPF_MUL | BPF_X:
-			update_on_xread(ctx);
-			emit(ARM_MUL(r_A, r_A, r_X), ctx);
-			break;
-		case BPF_ALU | BPF_DIV | BPF_K:
-			if (k == 1)
-				break;
-			emit_mov_i(r_scratch, k, ctx);
-			emit_udivmod(r_A, r_A, r_scratch, ctx, BPF_DIV);
-			break;
-		case BPF_ALU | BPF_DIV | BPF_X:
-			update_on_xread(ctx);
-			emit(ARM_CMP_I(r_X, 0), ctx);
-			emit_err_ret(ARM_COND_EQ, ctx);
-			emit_udivmod(r_A, r_A, r_X, ctx, BPF_DIV);
-			break;
-		case BPF_ALU | BPF_MOD | BPF_K:
-			if (k == 1) {
-				emit_mov_i(r_A, 0, ctx);
-				break;
-			}
-			emit_mov_i(r_scratch, k, ctx);
-			emit_udivmod(r_A, r_A, r_scratch, ctx, BPF_MOD);
-			break;
-		case BPF_ALU | BPF_MOD | BPF_X:
-			update_on_xread(ctx);
-			emit(ARM_CMP_I(r_X, 0), ctx);
-			emit_err_ret(ARM_COND_EQ, ctx);
-			emit_udivmod(r_A, r_A, r_X, ctx, BPF_MOD);
-			break;
-		case BPF_ALU | BPF_OR | BPF_K:
-			/* A |= K */
-			OP_IMM3(ARM_ORR, r_A, r_A, k, ctx);
+		}
+		break;
+	/* dst = htole(dst) */
+	/* dst = htobe(dst) */
+	case BPF_ALU | BPF_END | BPF_FROM_LE:
+	case BPF_ALU | BPF_END | BPF_FROM_BE:
+		rd = dstk ? tmp[0] : dst_hi;
+		rt = dstk ? tmp[1] : dst_lo;
+		if (dstk) {
+			emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(dst_lo)), ctx);
+			emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_hi)), ctx);
+		}
+		if (BPF_SRC(code) == BPF_FROM_LE)
+			goto emit_bswap_uxt;
+		switch (imm) {
+		case 16:
+			emit_rev16(rt, rt, ctx);
+			goto emit_bswap_uxt;
+		case 32:
+			emit_rev32(rt, rt, ctx);
+			goto emit_bswap_uxt;
+		case 64:
+			/* Because of the usage of ARM_LR */
+			ctx->seen |= SEEN_CALL;
+			emit_rev32(ARM_LR, rt, ctx);
+			emit_rev32(rt, rd, ctx);
+			emit(ARM_MOV_R(rd, ARM_LR), ctx);
 			break;
-		case BPF_ALU | BPF_OR | BPF_X:
-			update_on_xread(ctx);
-			emit(ARM_ORR_R(r_A, r_A, r_X), ctx);
+		}
+		goto exit;
+emit_bswap_uxt:
+		switch (imm) {
+		case 16:
+			/* zero-extend 16 bits into 64 bits */
+#if __LINUX_ARM_ARCH__ < 6
+			emit_a32_mov_i(tmp2[1], 0xffff, false, ctx);
+			emit(ARM_AND_R(rt, rt, tmp2[1]), ctx);
+#else /* ARMv6+ */
+			emit(ARM_UXTH(rt, rt), ctx);
+#endif
+			emit(ARM_EOR_R(rd, rd, rd), ctx);
 			break;
-		case BPF_ALU | BPF_XOR | BPF_K:
-			/* A ^= K; */
-			OP_IMM3(ARM_EOR, r_A, r_A, k, ctx);
+		case 32:
+			/* zero-extend 32 bits into 64 bits */
+			emit(ARM_EOR_R(rd, rd, rd), ctx);
 			break;
-		case BPF_ANC | SKF_AD_ALU_XOR_X:
-		case BPF_ALU | BPF_XOR | BPF_X:
-			/* A ^= X */
-			update_on_xread(ctx);
-			emit(ARM_EOR_R(r_A, r_A, r_X), ctx);
+		case 64:
+			/* nop */
 			break;
-		case BPF_ALU | BPF_AND | BPF_K:
-			/* A &= K */
-			OP_IMM3(ARM_AND, r_A, r_A, k, ctx);
+		}
+exit:
+		if (dstk) {
+			emit(ARM_STR_I(rt, ARM_SP, STACK_VAR(dst_lo)), ctx);
+			emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_hi)), ctx);
+		}
+		break;
+	/* dst = imm64 */
+	case BPF_LD | BPF_IMM | BPF_DW:
+	{
+		const struct bpf_insn insn1 = insn[1];
+		u32 hi, lo = imm;
+
+		hi = insn1.imm;
+		emit_a32_mov_i(dst_lo, lo, dstk, ctx);
+		emit_a32_mov_i(dst_hi, hi, dstk, ctx);
+
+		return 1;
+	}
+	/* LDX: dst = *(size *)(src + off) */
+	case BPF_LDX | BPF_MEM | BPF_W:
+	case BPF_LDX | BPF_MEM | BPF_H:
+	case BPF_LDX | BPF_MEM | BPF_B:
+	case BPF_LDX | BPF_MEM | BPF_DW:
+		rn = sstk ? tmp2[1] : src_lo;
+		if (sstk)
+			emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
+		switch (BPF_SIZE(code)) {
+		case BPF_W:
+			/* Load a Word */
+		case BPF_H:
+			/* Load a Half-Word */
+		case BPF_B:
+			/* Load a Byte */
+			emit_ldx_r(dst_lo, rn, dstk, off, ctx, BPF_SIZE(code));
+			emit_a32_mov_i(dst_hi, 0, dstk, ctx);
 			break;
-		case BPF_ALU | BPF_AND | BPF_X:
-			update_on_xread(ctx);
-			emit(ARM_AND_R(r_A, r_A, r_X), ctx);
+		case BPF_DW:
+			/* Load a double word */
+			emit_ldx_r(dst_lo, rn, dstk, off, ctx, BPF_W);
+			emit_ldx_r(dst_hi, rn, dstk, off+4, ctx, BPF_W);
 			break;
-		case BPF_ALU | BPF_LSH | BPF_K:
-			if (unlikely(k > 31))
-				return -1;
-			emit(ARM_LSL_I(r_A, r_A, k), ctx);
+		}
+		break;
+	/* R0 = ntohx(*(size *)(((struct sk_buff *)R6)->data + imm)) */
+	case BPF_LD | BPF_ABS | BPF_W:
+	case BPF_LD | BPF_ABS | BPF_H:
+	case BPF_LD | BPF_ABS | BPF_B:
+	/* R0 = ntohx(*(size *)(((struct sk_buff *)R6)->data + src + imm)) */
+	case BPF_LD | BPF_IND | BPF_W:
+	case BPF_LD | BPF_IND | BPF_H:
+	case BPF_LD | BPF_IND | BPF_B:
+	{
+		const u8 r4 = bpf2a32[BPF_REG_6][1]; /* r4 = ptr to sk_buff */
+		const u8 r0 = bpf2a32[BPF_REG_0][1]; /*r0: struct sk_buff *skb*/
+						     /* rtn value */
+		const u8 r1 = bpf2a32[BPF_REG_0][0]; /* r1: int k */
+		const u8 r2 = bpf2a32[BPF_REG_1][1]; /* r2: unsigned int size */
+		const u8 r3 = bpf2a32[BPF_REG_1][0]; /* r3: void *buffer */
+		const u8 r6 = bpf2a32[TMP_REG_1][1]; /* r6: void *(*func)(..) */
+		int size;
+
+		/* Setting up first argument */
+		emit(ARM_MOV_R(r0, r4), ctx);
+
+		/* Setting up second argument */
+		emit_a32_mov_i(r1, imm, false, ctx);
+		if (BPF_MODE(code) == BPF_IND)
+			emit_a32_alu_r(r1, src_lo, false, sstk, ctx,
+				       false, false, BPF_ADD);
+
+		/* Setting up third argument */
+		switch (BPF_SIZE(code)) {
+		case BPF_W:
+			size = 4;
 			break;
-		case BPF_ALU | BPF_LSH | BPF_X:
-			update_on_xread(ctx);
-			emit(ARM_LSL_R(r_A, r_A, r_X), ctx);
+		case BPF_H:
+			size = 2;
 			break;
-		case BPF_ALU | BPF_RSH | BPF_K:
-			if (unlikely(k > 31))
-				return -1;
-			if (k)
-				emit(ARM_LSR_I(r_A, r_A, k), ctx);
+		case BPF_B:
+			size = 1;
 			break;
-		case BPF_ALU | BPF_RSH | BPF_X:
-			update_on_xread(ctx);
-			emit(ARM_LSR_R(r_A, r_A, r_X), ctx);
+		default:
+			return -EINVAL;
+		}
+		emit_a32_mov_i(r2, size, false, ctx);
+
+		/* Setting up fourth argument */
+		emit(ARM_ADD_I(r3, ARM_SP, imm8m(SKB_BUFFER)), ctx);
+
+		/* Setting up function pointer to call */
+		emit_a32_mov_i(r6, (unsigned int)bpf_load_pointer, false, ctx);
+		emit_blx_r(r6, ctx);
+
+		emit(ARM_EOR_R(r1, r1, r1), ctx);
+		/* Check if return address is NULL or not.
+		 * if NULL then jump to epilogue
+		 * else continue to load the value from retn address
+		 */
+		emit(ARM_CMP_I(r0, 0), ctx);
+		jmp_offset = epilogue_offset(ctx);
+		check_imm24(jmp_offset);
+		_emit(ARM_COND_EQ, ARM_B(jmp_offset), ctx);
+
+		/* Load value from the address */
+		switch (BPF_SIZE(code)) {
+		case BPF_W:
+			emit(ARM_LDR_I(r0, r0, 0), ctx);
+			emit_rev32(r0, r0, ctx);
 			break;
-		case BPF_ALU | BPF_NEG:
-			/* A = -A */
-			emit(ARM_RSB_I(r_A, r_A, 0), ctx);
+		case BPF_H:
+			emit(ARM_LDRH_I(r0, r0, 0), ctx);
+			emit_rev16(r0, r0, ctx);
 			break;
-		case BPF_JMP | BPF_JA:
-			/* pc += K */
-			emit(ARM_B(b_imm(i + k + 1, ctx)), ctx);
+		case BPF_B:
+			emit(ARM_LDRB_I(r0, r0, 0), ctx);
+			/* No need to reverse */
 			break;
-		case BPF_JMP | BPF_JEQ | BPF_K:
-			/* pc += (A == K) ? pc->jt : pc->jf */
-			condt  = ARM_COND_EQ;
-			goto cmp_imm;
-		case BPF_JMP | BPF_JGT | BPF_K:
-			/* pc += (A > K) ? pc->jt : pc->jf */
-			condt  = ARM_COND_HI;
-			goto cmp_imm;
-		case BPF_JMP | BPF_JGE | BPF_K:
-			/* pc += (A >= K) ? pc->jt : pc->jf */
-			condt  = ARM_COND_HS;
-cmp_imm:
-			imm12 = imm8m(k);
-			if (imm12 < 0) {
-				emit_mov_i_no8m(r_scratch, k, ctx);
-				emit(ARM_CMP_R(r_A, r_scratch), ctx);
-			} else {
-				emit(ARM_CMP_I(r_A, imm12), ctx);
-			}
-cond_jump:
-			if (inst->jt)
-				_emit(condt, ARM_B(b_imm(i + inst->jt + 1,
-						   ctx)), ctx);
-			if (inst->jf)
-				_emit(condt ^ 1, ARM_B(b_imm(i + inst->jf + 1,
-							     ctx)), ctx);
+		}
+		break;
+	}
+	/* ST: *(size *)(dst + off) = imm */
+	case BPF_ST | BPF_MEM | BPF_W:
+	case BPF_ST | BPF_MEM | BPF_H:
+	case BPF_ST | BPF_MEM | BPF_B:
+	case BPF_ST | BPF_MEM | BPF_DW:
+		switch (BPF_SIZE(code)) {
+		case BPF_DW:
+			/* Sign-extend immediate value into temp reg */
+			emit_a32_mov_i64(true, tmp2, imm, false, ctx);
+			emit_str_r(dst_lo, tmp2[1], dstk, off, ctx, BPF_W);
+			emit_str_r(dst_lo, tmp2[0], dstk, off+4, ctx, BPF_W);
 			break;
-		case BPF_JMP | BPF_JEQ | BPF_X:
-			/* pc += (A == X) ? pc->jt : pc->jf */
-			condt   = ARM_COND_EQ;
-			goto cmp_x;
-		case BPF_JMP | BPF_JGT | BPF_X:
-			/* pc += (A > X) ? pc->jt : pc->jf */
-			condt   = ARM_COND_HI;
-			goto cmp_x;
-		case BPF_JMP | BPF_JGE | BPF_X:
-			/* pc += (A >= X) ? pc->jt : pc->jf */
-			condt   = ARM_COND_CS;
-cmp_x:
-			update_on_xread(ctx);
-			emit(ARM_CMP_R(r_A, r_X), ctx);
-			goto cond_jump;
-		case BPF_JMP | BPF_JSET | BPF_K:
-			/* pc += (A & K) ? pc->jt : pc->jf */
-			condt  = ARM_COND_NE;
-			/* not set iff all zeroes iff Z==1 iff EQ */
-
-			imm12 = imm8m(k);
-			if (imm12 < 0) {
-				emit_mov_i_no8m(r_scratch, k, ctx);
-				emit(ARM_TST_R(r_A, r_scratch), ctx);
-			} else {
-				emit(ARM_TST_I(r_A, imm12), ctx);
-			}
-			goto cond_jump;
-		case BPF_JMP | BPF_JSET | BPF_X:
-			/* pc += (A & X) ? pc->jt : pc->jf */
-			update_on_xread(ctx);
-			condt  = ARM_COND_NE;
-			emit(ARM_TST_R(r_A, r_X), ctx);
-			goto cond_jump;
-		case BPF_RET | BPF_A:
-			emit(ARM_MOV_R(ARM_R0, r_A), ctx);
-			goto b_epilogue;
-		case BPF_RET | BPF_K:
-			if ((k == 0) && (ctx->ret0_fp_idx < 0))
-				ctx->ret0_fp_idx = i;
-			emit_mov_i(ARM_R0, k, ctx);
-b_epilogue:
-			if (i != ctx->skf->len - 1)
-				emit(ARM_B(b_imm(prog->len, ctx)), ctx);
+		case BPF_W:
+		case BPF_H:
+		case BPF_B:
+			emit_a32_mov_i(tmp2[1], imm, false, ctx);
+			emit_str_r(dst_lo, tmp2[1], dstk, off, ctx,
+				   BPF_SIZE(code));
 			break;
-		case BPF_MISC | BPF_TAX:
-			/* X = A */
-			ctx->seen |= SEEN_X;
-			emit(ARM_MOV_R(r_X, r_A), ctx);
+		}
+		break;
+	/* STX XADD: lock *(u32 *)(dst + off) += src */
+	case BPF_STX | BPF_XADD | BPF_W:
+	/* STX XADD: lock *(u64 *)(dst + off) += src */
+	case BPF_STX | BPF_XADD | BPF_DW:
+		goto notyet;
+	/* STX: *(size *)(dst + off) = src */
+	case BPF_STX | BPF_MEM | BPF_W:
+	case BPF_STX | BPF_MEM | BPF_H:
+	case BPF_STX | BPF_MEM | BPF_B:
+	case BPF_STX | BPF_MEM | BPF_DW:
+	{
+		u8 sz = BPF_SIZE(code);
+
+		rn = sstk ? tmp2[1] : src_lo;
+		rm = sstk ? tmp2[0] : src_hi;
+		if (!sstk)
+			goto do_store;
+		switch (BPF_SIZE(code)) {
+		case BPF_W:
+			emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
+			goto empty_hi;
+		case BPF_H:
+			emit(ARM_LDRH_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
+			goto empty_hi;
+		case BPF_B:
+			emit(ARM_LDRB_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
+			goto empty_hi;
+empty_hi:
+			emit(ARM_EOR_R(rm, rm, rm), ctx);
+		case BPF_DW:
+			emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
+			emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(src_hi)), ctx);
+			sz = BPF_W;
 			break;
-		case BPF_MISC | BPF_TXA:
-			/* A = X */
-			update_on_xread(ctx);
-			emit(ARM_MOV_R(r_A, r_X), ctx);
+		}
+
+do_store:
+		/* Clear higher word except for BPF_DW */
+		if (BPF_SIZE(code) != BPF_DW)
+			emit(ARM_EOR_R(rm, rm, rm), ctx);
+
+		/* Store the value */
+		emit_str_r(dst_lo, rn, dstk, off, ctx, sz);
+		emit_str_r(dst_lo, rm, dstk, off+4, ctx, BPF_W);
+		break;
+	}
+	/* PC += off if dst == src */
+	/* PC += off if dst > src */
+	/* PC += off if dst >= src */
+	/* PC += off if dst != src */
+	/* PC += off if dst > src (signed) */
+	/* PC += off if dst >= src (signed) */
+	/* PC += off if dst & src */
+	case BPF_JMP | BPF_JEQ | BPF_X:
+	case BPF_JMP | BPF_JGT | BPF_X:
+	case BPF_JMP | BPF_JGE | BPF_X:
+	case BPF_JMP | BPF_JNE | BPF_X:
+	case BPF_JMP | BPF_JSGT | BPF_X:
+	case BPF_JMP | BPF_JSGE | BPF_X:
+	case BPF_JMP | BPF_JSET | BPF_X:
+		/* Setup source registers */
+		rm = sstk ? tmp2[0] : src_hi;
+		rn = sstk ? tmp2[1] : src_lo;
+		if (sstk) {
+			emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
+			emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(src_hi)), ctx);
+		}
+		goto go_jmp;
+	/* PC += off if dst == imm */
+	/* PC += off if dst > imm */
+	/* PC += off if dst >= imm */
+	/* PC += off if dst != imm */
+	/* PC += off if dst > imm (signed) */
+	/* PC += off if dst >= imm (signed) */
+	/* PC += off if dst & imm */
+	case BPF_JMP | BPF_JEQ | BPF_K:
+	case BPF_JMP | BPF_JGT | BPF_K:
+	case BPF_JMP | BPF_JGE | BPF_K:
+	case BPF_JMP | BPF_JNE | BPF_K:
+	case BPF_JMP | BPF_JSGT | BPF_K:
+	case BPF_JMP | BPF_JSGE | BPF_K:
+	case BPF_JMP | BPF_JSET | BPF_K:
+		if (off == 0)
 			break;
-		case BPF_ANC | SKF_AD_PROTOCOL:
-			/* A = ntohs(skb->protocol) */
-			ctx->seen |= SEEN_SKB;
-			BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff,
-						  protocol) != 2);
-			off = offsetof(struct sk_buff, protocol);
-			emit(ARM_LDRH_I(r_scratch, r_skb, off), ctx);
-			emit_swap16(r_A, r_scratch, ctx);
+		rm = tmp2[0];
+		rn = tmp2[1];
+		/* Sign-extend immediate value */
+		emit_a32_mov_i64(true, tmp2, imm, false, ctx);
+go_jmp:
+		/* Setup destination register */
+		rd = dstk ? tmp[0] : dst_hi;
+		rt = dstk ? tmp[1] : dst_lo;
+		if (dstk) {
+			emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(dst_lo)), ctx);
+			emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_hi)), ctx);
+		}
+
+		/* Check for the condition */
+		emit_ar_r(rd, rt, rm, rn, ctx, BPF_OP(code));
+
+		/* Setup JUMP instruction */
+		jmp_offset = bpf2a32_offset(i+off, i, ctx);
+		switch (BPF_OP(code)) {
+		case BPF_JNE:
+		case BPF_JSET:
+			_emit(ARM_COND_NE, ARM_B(jmp_offset), ctx);
 			break;
-		case BPF_ANC | SKF_AD_CPU:
-			/* r_scratch = current_thread_info() */
-			OP_IMM3(ARM_BIC, r_scratch, ARM_SP, THREAD_SIZE - 1, ctx);
-			/* A = current_thread_info()->cpu */
-			BUILD_BUG_ON(FIELD_SIZEOF(struct thread_info, cpu) != 4);
-			off = offsetof(struct thread_info, cpu);
-			emit(ARM_LDR_I(r_A, r_scratch, off), ctx);
+		case BPF_JEQ:
+			_emit(ARM_COND_EQ, ARM_B(jmp_offset), ctx);
 			break;
-		case BPF_ANC | SKF_AD_IFINDEX:
-		case BPF_ANC | SKF_AD_HATYPE:
-			/* A = skb->dev->ifindex */
-			/* A = skb->dev->type */
-			ctx->seen |= SEEN_SKB;
-			off = offsetof(struct sk_buff, dev);
-			emit(ARM_LDR_I(r_scratch, r_skb, off), ctx);
-
-			emit(ARM_CMP_I(r_scratch, 0), ctx);
-			emit_err_ret(ARM_COND_EQ, ctx);
-
-			BUILD_BUG_ON(FIELD_SIZEOF(struct net_device,
-						  ifindex) != 4);
-			BUILD_BUG_ON(FIELD_SIZEOF(struct net_device,
-						  type) != 2);
-
-			if (code == (BPF_ANC | SKF_AD_IFINDEX)) {
-				off = offsetof(struct net_device, ifindex);
-				emit(ARM_LDR_I(r_A, r_scratch, off), ctx);
-			} else {
-				/*
-				 * offset of field "type" in "struct
-				 * net_device" is above what can be
-				 * used in the ldrh rd, [rn, #imm]
-				 * instruction, so load the offset in
-				 * a register and use ldrh rd, [rn, rm]
-				 */
-				off = offsetof(struct net_device, type);
-				emit_mov_i(ARM_R3, off, ctx);
-				emit(ARM_LDRH_R(r_A, r_scratch, ARM_R3), ctx);
-			}
+		case BPF_JGT:
+			_emit(ARM_COND_HI, ARM_B(jmp_offset), ctx);
 			break;
-		case BPF_ANC | SKF_AD_MARK:
-			ctx->seen |= SEEN_SKB;
-			BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, mark) != 4);
-			off = offsetof(struct sk_buff, mark);
-			emit(ARM_LDR_I(r_A, r_skb, off), ctx);
+		case BPF_JGE:
+			_emit(ARM_COND_CS, ARM_B(jmp_offset), ctx);
 			break;
-		case BPF_ANC | SKF_AD_RXHASH:
-			ctx->seen |= SEEN_SKB;
-			BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, hash) != 4);
-			off = offsetof(struct sk_buff, hash);
-			emit(ARM_LDR_I(r_A, r_skb, off), ctx);
+		case BPF_JSGT:
+			_emit(ARM_COND_LT, ARM_B(jmp_offset), ctx);
 			break;
-		case BPF_ANC | SKF_AD_VLAN_TAG:
-		case BPF_ANC | SKF_AD_VLAN_TAG_PRESENT:
-			ctx->seen |= SEEN_SKB;
-			BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, vlan_tci) != 2);
-			off = offsetof(struct sk_buff, vlan_tci);
-			emit(ARM_LDRH_I(r_A, r_skb, off), ctx);
-			if (code == (BPF_ANC | SKF_AD_VLAN_TAG))
-				OP_IMM3(ARM_AND, r_A, r_A, ~VLAN_TAG_PRESENT, ctx);
-			else {
-				OP_IMM3(ARM_LSR, r_A, r_A, 12, ctx);
-				OP_IMM3(ARM_AND, r_A, r_A, 0x1, ctx);
-			}
+		case BPF_JSGE:
+			_emit(ARM_COND_GE, ARM_B(jmp_offset), ctx);
 			break;
-		case BPF_ANC | SKF_AD_PKTTYPE:
-			ctx->seen |= SEEN_SKB;
-			BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff,
-						  __pkt_type_offset[0]) != 1);
-			off = PKT_TYPE_OFFSET();
-			emit(ARM_LDRB_I(r_A, r_skb, off), ctx);
-			emit(ARM_AND_I(r_A, r_A, PKT_TYPE_MAX), ctx);
-#ifdef __BIG_ENDIAN_BITFIELD
-			emit(ARM_LSR_I(r_A, r_A, 5), ctx);
-#endif
+		}
+		break;
+	/* JMP OFF */
+	case BPF_JMP | BPF_JA:
+	{
+		if (off == 0)
 			break;
-		case BPF_ANC | SKF_AD_QUEUE:
-			ctx->seen |= SEEN_SKB;
-			BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff,
-						  queue_mapping) != 2);
-			BUILD_BUG_ON(offsetof(struct sk_buff,
-					      queue_mapping) > 0xff);
-			off = offsetof(struct sk_buff, queue_mapping);
-			emit(ARM_LDRH_I(r_A, r_skb, off), ctx);
+		jmp_offset = bpf2a32_offset(i+off, i, ctx);
+		check_imm24(jmp_offset);
+		emit(ARM_B(jmp_offset), ctx);
+		break;
+	}
+	/* tail call */
+	case BPF_JMP | BPF_CALL | BPF_X:
+		if (emit_bpf_tail_call(ctx))
+			return -EFAULT;
+		break;
+	/* function call */
+	case BPF_JMP | BPF_CALL:
+	{
+		const u8 *r0 = bpf2a32[BPF_REG_0];
+		const u8 *r1 = bpf2a32[BPF_REG_1];
+		const u8 *r2 = bpf2a32[BPF_REG_2];
+		const u8 *r3 = bpf2a32[BPF_REG_3];
+		const u8 *r4 = bpf2a32[BPF_REG_4];
+		const u8 *r5 = bpf2a32[BPF_REG_5];
+		const u32 func = (u32)__bpf_call_base + imm;
+
+		emit_a32_mov_r64(true, r0, r1, false, false, ctx);
+		emit_a32_mov_r64(true, r1, r2, false, true, ctx);
+		emit_push_r64(r5, 0, ctx);
+		emit_push_r64(r4, 8, ctx);
+		emit_push_r64(r3, 16, ctx);
+
+		emit_a32_mov_i(tmp[1], func, false, ctx);
+		emit_blx_r(tmp[1], ctx);
+
+		emit(ARM_ADD_I(ARM_SP, ARM_SP, imm8m(24)), ctx); // callee clean
+		break;
+	}
+	/* function return */
+	case BPF_JMP | BPF_EXIT:
+		/* Optimization: when last instruction is EXIT
+		 * simply fallthrough to epilogue.
+		 */
+		if (i == ctx->prog->len - 1)
 			break;
-		case BPF_ANC | SKF_AD_PAY_OFFSET:
-			ctx->seen |= SEEN_SKB | SEEN_CALL;
+		jmp_offset = epilogue_offset(ctx);
+		check_imm24(jmp_offset);
+		emit(ARM_B(jmp_offset), ctx);
+		break;
+notyet:
+		pr_info_once("*** NOT YET: opcode %02x ***\n", code);
+		return -EFAULT;
+	default:
+		pr_err_once("unknown opcode %02x\n", code);
+		return -EINVAL;
+	}
 
-			emit(ARM_MOV_R(ARM_R0, r_skb), ctx);
-			emit_mov_i(ARM_R3, (unsigned int)skb_get_poff, ctx);
-			emit_blx_r(ARM_R3, ctx);
-			emit(ARM_MOV_R(r_A, ARM_R0), ctx);
-			break;
-		case BPF_LDX | BPF_W | BPF_ABS:
-			/*
-			 * load a 32bit word from struct seccomp_data.
-			 * seccomp_check_filter() will already have checked
-			 * that k is 32bit aligned and lies within the
-			 * struct seccomp_data.
-			 */
-			ctx->seen |= SEEN_SKB;
-			emit(ARM_LDR_I(r_A, r_skb, k), ctx);
-			break;
-		default:
-			return -1;
+	if (ctx->flags & FLAG_IMM_OVERFLOW)
+		/*
+		 * this instruction generated an overflow when
+		 * trying to access the literal pool, so
+		 * delegate this filter to the kernel interpreter.
+		 */
+		return -1;
+	return 0;
+}
+
+static int build_body(struct jit_ctx *ctx)
+{
+	const struct bpf_prog *prog = ctx->prog;
+	unsigned int i;
+
+	for (i = 0; i < prog->len; i++) {
+		const struct bpf_insn *insn = &(prog->insnsi[i]);
+		int ret;
+
+		ret = build_insn(insn, ctx);
+
+		/* It's used with loading the 64 bit immediate value. */
+		if (ret > 0) {
+			i++;
+			if (ctx->target == NULL)
+				ctx->offsets[i] = ctx->idx;
+			continue;
 		}
 
-		if (ctx->flags & FLAG_IMM_OVERFLOW)
-			/*
-			 * this instruction generated an overflow when
-			 * trying to access the literal pool, so
-			 * delegate this filter to the kernel interpreter.
-			 */
-			return -1;
+		if (ctx->target == NULL)
+			ctx->offsets[i] = ctx->idx;
+
+		/* If unsuccesfull, return with error code */
+		if (ret)
+			return ret;
 	}
+	return 0;
+}
 
-	/* compute offsets only during the first pass */
-	if (ctx->target == NULL)
-		ctx->offsets[i] = ctx->idx * 4;
+static int validate_code(struct jit_ctx *ctx)
+{
+	int i;
+
+	for (i = 0; i < ctx->idx; i++) {
+		if (ctx->target[i] == __opcode_to_mem_arm(ARM_INST_UDF))
+			return -1;
+	}
 
 	return 0;
 }
 
+void bpf_jit_compile(struct bpf_prog *prog)
+{
+	/* Nothing to do here. We support Internal BPF. */
+}
 
-void bpf_jit_compile(struct bpf_prog *fp)
+struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 {
+#ifdef __LITTLE_ENDIAN
+	struct bpf_prog *tmp, *orig_prog = prog;
 	struct bpf_binary_header *header;
+	bool tmp_blinded = false;
 	struct jit_ctx ctx;
-	unsigned tmp_idx;
-	unsigned alloc_size;
-	u8 *target_ptr;
+	unsigned int tmp_idx;
+	unsigned int image_size;
+	u8 *image_ptr;
 
+	/* If BPF JIT was not enabled then we must fall back to
+	 * the interpreter.
+	 */
 	if (!bpf_jit_enable)
-		return;
+		return orig_prog;
 
-	memset(&ctx, 0, sizeof(ctx));
-	ctx.skf		= fp;
-	ctx.ret0_fp_idx = -1;
+	/* If constant blinding was enabled and we failed during blinding
+	 * then we must fall back to the interpreter. Otherwise, we save
+	 * the new JITed code.
+	 */
+	tmp = bpf_jit_blind_constants(prog);
 
-	ctx.offsets = kzalloc(4 * (ctx.skf->len + 1), GFP_KERNEL);
-	if (ctx.offsets == NULL)
-		return;
+	if (IS_ERR(tmp))
+		return orig_prog;
+	if (tmp != prog) {
+		tmp_blinded = true;
+		prog = tmp;
+	}
 
-	/* fake pass to fill in the ctx->seen */
-	if (unlikely(build_body(&ctx)))
+	memset(&ctx, 0, sizeof(ctx));
+	ctx.prog = prog;
+
+	/* Not able to allocate memory for offsets[] , then
+	 * we must fall back to the interpreter
+	 */
+	ctx.offsets = kcalloc(prog->len, sizeof(int), GFP_KERNEL);
+	if (ctx.offsets == NULL) {
+		prog = orig_prog;
 		goto out;
+	}
+
+	/* 1) fake pass to find in the length of the JITed code,
+	 * to compute ctx->offsets and other context variables
+	 * needed to compute final JITed code.
+	 * Also, calculate random starting pointer/start of JITed code
+	 * which is prefixed by random number of fault instructions.
+	 *
+	 * If the first pass fails then there is no chance of it
+	 * being successful in the second pass, so just fall back
+	 * to the interpreter.
+	 */
+	if (build_body(&ctx)) {
+		prog = orig_prog;
+		goto out_off;
+	}
 
 	tmp_idx = ctx.idx;
 	build_prologue(&ctx);
 	ctx.prologue_bytes = (ctx.idx - tmp_idx) * 4;
 
+	ctx.epilogue_offset = ctx.idx;
+
 #if __LINUX_ARM_ARCH__ < 7
 	tmp_idx = ctx.idx;
 	build_epilogue(&ctx);
@@ -1021,64 +1863,96 @@ void bpf_jit_compile(struct bpf_prog *fp)
 
 	ctx.idx += ctx.imm_count;
 	if (ctx.imm_count) {
-		ctx.imms = kzalloc(4 * ctx.imm_count, GFP_KERNEL);
-		if (ctx.imms == NULL)
-			goto out;
+		ctx.imms = kcalloc(ctx.imm_count, sizeof(u32), GFP_KERNEL);
+		if (ctx.imms == NULL) {
+			prog = orig_prog;
+			goto out_off;
+		}
 	}
 #else
-	/* there's nothing after the epilogue on ARMv7 */
+	/* there's nothing about the epilogue on ARMv7 */
 	build_epilogue(&ctx);
 #endif
-	alloc_size = 4 * ctx.idx;
-	header = bpf_jit_binary_alloc(alloc_size, &target_ptr,
-				      4, jit_fill_hole);
-	if (header == NULL)
-		goto out;
+	/* Now we can get the actual image size of the JITed arm code.
+	 * Currently, we are not considering the THUMB-2 instructions
+	 * for jit, although it can decrease the size of the image.
+	 *
+	 * As each arm instruction is of length 32bit, we are translating
+	 * number of JITed intructions into the size required to store these
+	 * JITed code.
+	 */
+	image_size = sizeof(u32) * ctx.idx;
+
+	/* Now we know the size of the structure to make */
+	header = bpf_jit_binary_alloc(image_size, &image_ptr,
+				      sizeof(u32), jit_fill_hole);
+	/* Not able to allocate memory for the structure then
+	 * we must fall back to the interpretation
+	 */
+	if (header == NULL) {
+		prog = orig_prog;
+		goto out_imms;
+	}
 
-	ctx.target = (u32 *) target_ptr;
+	/* 2.) Actual pass to generate final JIT code */
+	ctx.target = (u32 *) image_ptr;
 	ctx.idx = 0;
 
 	build_prologue(&ctx);
+
+	/* If building the body of the JITed code fails somehow,
+	 * we fall back to the interpretation.
+	 */
 	if (build_body(&ctx) < 0) {
-#if __LINUX_ARM_ARCH__ < 7
-		if (ctx.imm_count)
-			kfree(ctx.imms);
-#endif
+		image_ptr = NULL;
 		bpf_jit_binary_free(header);
-		goto out;
+		prog = orig_prog;
+		goto out_imms;
 	}
 	build_epilogue(&ctx);
 
+	/* 3.) Extra pass to validate JITed Code */
+	if (validate_code(&ctx)) {
+		image_ptr = NULL;
+		bpf_jit_binary_free(header);
+		prog = orig_prog;
+		goto out_imms;
+	}
 	flush_icache_range((u32)header, (u32)(ctx.target + ctx.idx));
 
-#if __LINUX_ARM_ARCH__ < 7
-	if (ctx.imm_count)
-		kfree(ctx.imms);
-#endif
-
 	if (bpf_jit_enable > 1)
 		/* there are 2 passes here */
-		bpf_jit_dump(fp->len, alloc_size, 2, ctx.target);
+		bpf_jit_dump(prog->len, image_size, 2, ctx.target);
 
 	set_memory_ro((unsigned long)header, header->pages);
-	fp->bpf_func = (void *)ctx.target;
-	fp->jited = 1;
-out:
+	prog->bpf_func = (void *)ctx.target;
+	prog->jited = 1;
+out_imms:
+#if __LINUX_ARM_ARCH__ < 7
+	if (ctx.imm_count)
+		kfree(ctx.imms);
+#endif
+out_off:
 	kfree(ctx.offsets);
-	return;
+out:
+	if (tmp_blinded)
+		bpf_jit_prog_release_other(prog, prog == orig_prog ?
+					   tmp : orig_prog);
+#endif /* __LITTLE_ENDIAN */
+	return prog;
 }
 
-void bpf_jit_free(struct bpf_prog *fp)
+void bpf_jit_free(struct bpf_prog *prog)
 {
-	unsigned long addr = (unsigned long)fp->bpf_func & PAGE_MASK;
+	unsigned long addr = (unsigned long)prog->bpf_func & PAGE_MASK;
 	struct bpf_binary_header *header = (void *)addr;
 
-	if (!fp->jited)
+	if (!prog->jited)
 		goto free_filter;
 
 	set_memory_rw(addr, header->pages);
 	bpf_jit_binary_free(header);
 
 free_filter:
-	bpf_prog_unlock_free(fp);
+	bpf_prog_unlock_free(prog);
 }
diff --git a/arch/arm/net/bpf_jit_32.h b/arch/arm/net/bpf_jit_32.h
index c46fca2..d5cf5f6 100644
--- a/arch/arm/net/bpf_jit_32.h
+++ b/arch/arm/net/bpf_jit_32.h
@@ -11,6 +11,7 @@
 #ifndef PFILTER_OPCODES_ARM_H
 #define PFILTER_OPCODES_ARM_H
 
+/* ARM 32bit Registers */
 #define ARM_R0	0
 #define ARM_R1	1
 #define ARM_R2	2
@@ -22,38 +23,43 @@
 #define ARM_R8	8
 #define ARM_R9	9
 #define ARM_R10	10
-#define ARM_FP	11
-#define ARM_IP	12
-#define ARM_SP	13
-#define ARM_LR	14
-#define ARM_PC	15
-
-#define ARM_COND_EQ		0x0
-#define ARM_COND_NE		0x1
-#define ARM_COND_CS		0x2
+#define ARM_FP	11	/* Frame Pointer */
+#define ARM_IP	12	/* Intra-procedure scratch register */
+#define ARM_SP	13	/* Stack pointer: as load/store base reg */
+#define ARM_LR	14	/* Link Register */
+#define ARM_PC	15	/* Program counter */
+
+#define ARM_COND_EQ		0x0	/* == */
+#define ARM_COND_NE		0x1	/* != */
+#define ARM_COND_CS		0x2	/* unsigned >= */
 #define ARM_COND_HS		ARM_COND_CS
-#define ARM_COND_CC		0x3
+#define ARM_COND_CC		0x3	/* unsigned < */
 #define ARM_COND_LO		ARM_COND_CC
-#define ARM_COND_MI		0x4
-#define ARM_COND_PL		0x5
-#define ARM_COND_VS		0x6
-#define ARM_COND_VC		0x7
-#define ARM_COND_HI		0x8
-#define ARM_COND_LS		0x9
-#define ARM_COND_GE		0xa
-#define ARM_COND_LT		0xb
-#define ARM_COND_GT		0xc
-#define ARM_COND_LE		0xd
-#define ARM_COND_AL		0xe
+#define ARM_COND_MI		0x4	/* < 0 */
+#define ARM_COND_PL		0x5	/* >= 0 */
+#define ARM_COND_VS		0x6	/* Signed Overflow */
+#define ARM_COND_VC		0x7	/* No Signed Overflow */
+#define ARM_COND_HI		0x8	/* unsigned > */
+#define ARM_COND_LS		0x9	/* unsigned <= */
+#define ARM_COND_GE		0xa	/* Signed >= */
+#define ARM_COND_LT		0xb	/* Signed < */
+#define ARM_COND_GT		0xc	/* Signed > */
+#define ARM_COND_LE		0xd	/* Signed <= */
+#define ARM_COND_AL		0xe	/* None */
 
 /* register shift types */
 #define SRTYPE_LSL		0
 #define SRTYPE_LSR		1
 #define SRTYPE_ASR		2
 #define SRTYPE_ROR		3
+#define SRTYPE_ASL		(SRTYPE_LSL)
 
 #define ARM_INST_ADD_R		0x00800000
+#define ARM_INST_ADDS_R		0x00900000
+#define ARM_INST_ADC_R		0x00a00000
+#define ARM_INST_ADC_I		0x02a00000
 #define ARM_INST_ADD_I		0x02800000
+#define ARM_INST_ADDS_I		0x02900000
 
 #define ARM_INST_AND_R		0x00000000
 #define ARM_INST_AND_I		0x02000000
@@ -76,8 +82,10 @@
 #define ARM_INST_LDRH_I		0x01d000b0
 #define ARM_INST_LDRH_R		0x019000b0
 #define ARM_INST_LDR_I		0x05900000
+#define ARM_INST_LDR_R		0x07900000
 
 #define ARM_INST_LDM		0x08900000
+#define ARM_INST_LDM_IA		0x08b00000
 
 #define ARM_INST_LSL_I		0x01a00000
 #define ARM_INST_LSL_R		0x01a00010
@@ -86,6 +94,7 @@
 #define ARM_INST_LSR_R		0x01a00030
 
 #define ARM_INST_MOV_R		0x01a00000
+#define ARM_INST_MOVS_R		0x01b00000
 #define ARM_INST_MOV_I		0x03a00000
 #define ARM_INST_MOVW		0x03000000
 #define ARM_INST_MOVT		0x03400000
@@ -96,17 +105,28 @@
 #define ARM_INST_PUSH		0x092d0000
 
 #define ARM_INST_ORR_R		0x01800000
+#define ARM_INST_ORRS_R		0x01900000
 #define ARM_INST_ORR_I		0x03800000
 
 #define ARM_INST_REV		0x06bf0f30
 #define ARM_INST_REV16		0x06bf0fb0
 
 #define ARM_INST_RSB_I		0x02600000
+#define ARM_INST_RSBS_I		0x02700000
+#define ARM_INST_RSC_I		0x02e00000
 
 #define ARM_INST_SUB_R		0x00400000
+#define ARM_INST_SUBS_R		0x00500000
+#define ARM_INST_RSB_R		0x00600000
 #define ARM_INST_SUB_I		0x02400000
+#define ARM_INST_SUBS_I		0x02500000
+#define ARM_INST_SBC_I		0x02c00000
+#define ARM_INST_SBC_R		0x00c00000
+#define ARM_INST_SBCS_R		0x00d00000
 
 #define ARM_INST_STR_I		0x05800000
+#define ARM_INST_STRB_I		0x05c00000
+#define ARM_INST_STRH_I		0x01c000b0
 
 #define ARM_INST_TST_R		0x01100000
 #define ARM_INST_TST_I		0x03100000
@@ -117,6 +137,8 @@
 
 #define ARM_INST_MLS		0x00600090
 
+#define ARM_INST_UXTH		0x06ff0070
+
 /*
  * Use a suitable undefined instruction to use for ARM/Thumb2 faulting.
  * We need to be careful not to conflict with those used by other modules
@@ -135,9 +157,15 @@
 #define _AL3_R(op, rd, rn, rm)	((op ## _R) | (rd) << 12 | (rn) << 16 | (rm))
 /* immediate */
 #define _AL3_I(op, rd, rn, imm)	((op ## _I) | (rd) << 12 | (rn) << 16 | (imm))
+/* register with register-shift */
+#define _AL3_SR(inst)	(inst | (1 << 4))
 
 #define ARM_ADD_R(rd, rn, rm)	_AL3_R(ARM_INST_ADD, rd, rn, rm)
+#define ARM_ADDS_R(rd, rn, rm)	_AL3_R(ARM_INST_ADDS, rd, rn, rm)
 #define ARM_ADD_I(rd, rn, imm)	_AL3_I(ARM_INST_ADD, rd, rn, imm)
+#define ARM_ADDS_I(rd, rn, imm)	_AL3_I(ARM_INST_ADDS, rd, rn, imm)
+#define ARM_ADC_R(rd, rn, rm)	_AL3_R(ARM_INST_ADC, rd, rn, rm)
+#define ARM_ADC_I(rd, rn, imm)	_AL3_I(ARM_INST_ADC, rd, rn, imm)
 
 #define ARM_AND_R(rd, rn, rm)	_AL3_R(ARM_INST_AND, rd, rn, rm)
 #define ARM_AND_I(rd, rn, imm)	_AL3_I(ARM_INST_AND, rd, rn, imm)
@@ -156,7 +184,9 @@
 #define ARM_EOR_I(rd, rn, imm)	_AL3_I(ARM_INST_EOR, rd, rn, imm)
 
 #define ARM_LDR_I(rt, rn, off)	(ARM_INST_LDR_I | (rt) << 12 | (rn) << 16 \
-				 | (off))
+				 | ((off) & 0xfff))
+#define ARM_LDR_R(rt, rn, rm)	(ARM_INST_LDR_R | (rt) << 12 | (rn) << 16 \
+				 | (rm))
 #define ARM_LDRB_I(rt, rn, off)	(ARM_INST_LDRB_I | (rt) << 12 | (rn) << 16 \
 				 | (off))
 #define ARM_LDRB_R(rt, rn, rm)	(ARM_INST_LDRB_R | (rt) << 12 | (rn) << 16 \
@@ -167,15 +197,23 @@
 				 | (rm))
 
 #define ARM_LDM(rn, regs)	(ARM_INST_LDM | (rn) << 16 | (regs))
+#define ARM_LDM_IA(rn, regs)	(ARM_INST_LDM_IA | (rn) << 16 | (regs))
 
 #define ARM_LSL_R(rd, rn, rm)	(_AL3_R(ARM_INST_LSL, rd, 0, rn) | (rm) << 8)
 #define ARM_LSL_I(rd, rn, imm)	(_AL3_I(ARM_INST_LSL, rd, 0, rn) | (imm) << 7)
 
 #define ARM_LSR_R(rd, rn, rm)	(_AL3_R(ARM_INST_LSR, rd, 0, rn) | (rm) << 8)
 #define ARM_LSR_I(rd, rn, imm)	(_AL3_I(ARM_INST_LSR, rd, 0, rn) | (imm) << 7)
+#define ARM_ASR_R(rd, rn, rm)   (_AL3_R(ARM_INST_ASR, rd, 0, rn) | (rm) << 8)
+#define ARM_ASR_I(rd, rn, imm)  (_AL3_I(ARM_INST_ASR, rd, 0, rn) | (imm) << 7)
 
 #define ARM_MOV_R(rd, rm)	_AL3_R(ARM_INST_MOV, rd, 0, rm)
+#define ARM_MOVS_R(rd, rm)	_AL3_R(ARM_INST_MOVS, rd, 0, rm)
 #define ARM_MOV_I(rd, imm)	_AL3_I(ARM_INST_MOV, rd, 0, imm)
+#define ARM_MOV_SR(rd, rm, type, rs)	\
+	(_AL3_SR(ARM_MOV_R(rd, rm)) | (type) << 5 | (rs) << 8)
+#define ARM_MOV_SI(rd, rm, type, imm6)	\
+	(ARM_MOV_R(rd, rm) | (type) << 5 | (imm6) << 7)
 
 #define ARM_MOVW(rd, imm)	\
 	(ARM_INST_MOVW | ((imm) >> 12) << 16 | (rd) << 12 | ((imm) & 0x0fff))
@@ -190,19 +228,38 @@
 
 #define ARM_ORR_R(rd, rn, rm)	_AL3_R(ARM_INST_ORR, rd, rn, rm)
 #define ARM_ORR_I(rd, rn, imm)	_AL3_I(ARM_INST_ORR, rd, rn, imm)
-#define ARM_ORR_S(rd, rn, rm, type, rs)	\
-	(ARM_ORR_R(rd, rn, rm) | (type) << 5 | (rs) << 7)
+#define ARM_ORR_SR(rd, rn, rm, type, rs)	\
+	(_AL3_SR(ARM_ORR_R(rd, rn, rm)) | (type) << 5 | (rs) << 8)
+#define ARM_ORRS_R(rd, rn, rm)	_AL3_R(ARM_INST_ORRS, rd, rn, rm)
+#define ARM_ORRS_SR(rd, rn, rm, type, rs)	\
+	(_AL3_SR(ARM_ORRS_R(rd, rn, rm)) | (type) << 5 | (rs) << 8)
+#define ARM_ORR_SI(rd, rn, rm, type, imm6)	\
+	(ARM_ORR_R(rd, rn, rm) | (type) << 5 | (imm6) << 7)
+#define ARM_ORRS_SI(rd, rn, rm, type, imm6)	\
+	(ARM_ORRS_R(rd, rn, rm) | (type) << 5 | (imm6) << 7)
 
 #define ARM_REV(rd, rm)		(ARM_INST_REV | (rd) << 12 | (rm))
 #define ARM_REV16(rd, rm)	(ARM_INST_REV16 | (rd) << 12 | (rm))
 
 #define ARM_RSB_I(rd, rn, imm)	_AL3_I(ARM_INST_RSB, rd, rn, imm)
+#define ARM_RSBS_I(rd, rn, imm)	_AL3_I(ARM_INST_RSBS, rd, rn, imm)
+#define ARM_RSC_I(rd, rn, imm)	_AL3_I(ARM_INST_RSC, rd, rn, imm)
 
 #define ARM_SUB_R(rd, rn, rm)	_AL3_R(ARM_INST_SUB, rd, rn, rm)
+#define ARM_SUBS_R(rd, rn, rm)	_AL3_R(ARM_INST_SUBS, rd, rn, rm)
+#define ARM_RSB_R(rd, rn, rm)	_AL3_R(ARM_INST_RSB, rd, rn, rm)
+#define ARM_SBC_R(rd, rn, rm)	_AL3_R(ARM_INST_SBC, rd, rn, rm)
+#define ARM_SBCS_R(rd, rn, rm)	_AL3_R(ARM_INST_SBCS, rd, rn, rm)
 #define ARM_SUB_I(rd, rn, imm)	_AL3_I(ARM_INST_SUB, rd, rn, imm)
+#define ARM_SUBS_I(rd, rn, imm)	_AL3_I(ARM_INST_SUBS, rd, rn, imm)
+#define ARM_SBC_I(rd, rn, imm)	_AL3_I(ARM_INST_SBC, rd, rn, imm)
 
 #define ARM_STR_I(rt, rn, off)	(ARM_INST_STR_I | (rt) << 12 | (rn) << 16 \
-				 | (off))
+				 | ((off) & 0xfff))
+#define ARM_STRH_I(rt, rn, off)	(ARM_INST_STRH_I | (rt) << 12 | (rn) << 16 \
+				 | (((off) & 0xf0) << 4) | ((off) & 0xf))
+#define ARM_STRB_I(rt, rn, off)	(ARM_INST_STRB_I | (rt) << 12 | (rn) << 16 \
+				 | (((off) & 0xf0) << 4) | ((off) & 0xf))
 
 #define ARM_TST_R(rn, rm)	_AL3_R(ARM_INST_TST, 0, rn, rm)
 #define ARM_TST_I(rn, imm)	_AL3_I(ARM_INST_TST, 0, rn, imm)
@@ -214,5 +271,6 @@
 
 #define ARM_MLS(rd, rn, rm, ra)	(ARM_INST_MLS | (rd) << 16 | (rn) | (rm) << 8 \
 				 | (ra) << 12)
+#define ARM_UXTH(rd, rm)	(ARM_INST_UXTH | (rd) << 12 | (rm))
 
 #endif /* PFILTER_OPCODES_ARM_H */
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
@ 2017-07-06  3:49                               ` Shubham Bansal
  0 siblings, 0 replies; 87+ messages in thread
From: Shubham Bansal @ 2017-07-06  3:49 UTC (permalink / raw)
  To: Kees Cook
  Cc: Daniel Borkmann, Network Development, David S. Miller,
	Alexei Starovoitov, Russell King, linux-arm-kernel, LKML,
	Andrew Lunn

[-- Attachment #1: Type: text/plain, Size: 460 bytes --]

Hi Kees,

Problem is my ARM machine don't have clang and iproute2 which is
keeping me from testing the bpf tail calls.

You should do the following to test it,.

1. tools/testing/selftests/bpf/
2. make
3. sudo ./test_progs

And, before testing, you have to do "make headers_install".
These tests are for tail calls with the attached patch. If its too
much work, Can you please upload your arm image so that I can test it?
I just need a good machine.

-Shubham

[-- Attachment #2: 0001-Added-Support-for-BPF_CALL-BPF_JMP.patch --]
[-- Type: application/octet-stream, Size: 87846 bytes --]

From 502dd777765a982ce1b479ee01911fa6fe023a76 Mon Sep 17 00:00:00 2001
From: Shubham Bansal <illusionist.neo@gmail.com>
Date: Sat, 24 Jun 2017 04:03:37 +0530
Subject: [PATCH] Added Support for BPF_CALL | BPF_JMP.

---
 arch/arm/Kconfig          |    2 +-
 arch/arm/net/bpf_jit_32.c | 2430 ++++++++++++++++++++++++++++++---------------
 arch/arm/net/bpf_jit_32.h |  108 +-
 3 files changed, 1736 insertions(+), 804 deletions(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 4c1a35f..53bf116 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -48,7 +48,7 @@ config ARM
 	select HAVE_ARCH_SECCOMP_FILTER if (AEABI && !OABI_COMPAT)
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_ARM_SMCCC if CPU_V7
-	select HAVE_CBPF_JIT
+	select HAVE_EBPF_JIT
 	select HAVE_CC_STACKPROTECTOR
 	select HAVE_CONTEXT_TRACKING
 	select HAVE_C_RECORDMCOUNT
diff --git a/arch/arm/net/bpf_jit_32.c b/arch/arm/net/bpf_jit_32.c
index d5b9fa1..8b8ddc4 100644
--- a/arch/arm/net/bpf_jit_32.c
+++ b/arch/arm/net/bpf_jit_32.c
@@ -1,13 +1,15 @@
 /*
- * Just-In-Time compiler for BPF filters on 32bit ARM
+ * Just-In-Time compiler for eBPF filters on 32bit ARM
  *
  * Copyright (c) 2011 Mircea Gherzan <mgherzan@gmail.com>
+ * Copyright (c) 2017 Shubham Bansal <illusionist.neo@gmail.com>
  *
  * This program is free software; you can redistribute it and/or modify it
  * under the terms of the GNU General Public License as published by the
  * Free Software Foundation; version 2 of the License.
  */
 
+#include <linux/bpf.h>
 #include <linux/bitops.h>
 #include <linux/compiler.h>
 #include <linux/errno.h>
@@ -18,50 +20,96 @@
 #include <linux/if_vlan.h>
 
 #include <asm/cacheflush.h>
-#include <asm/set_memory.h>
 #include <asm/hwcap.h>
 #include <asm/opcodes.h>
 
 #include "bpf_jit_32.h"
 
+int bpf_jit_enable __read_mostly;
+
+#define STACK_OFFSET(k)	(k)
+#define TMP_REG_1	(MAX_BPF_JIT_REG + 0)	/* TEMP Register 1 */
+#define TMP_REG_2	(MAX_BPF_JIT_REG + 1)	/* TEMP Register 2 */
+#define TCALL_CNT	(MAX_BPF_JIT_REG + 2)	/* Tail Call Count */
+
+/* Flags used for JIT optimization */
+#define SEEN_CALL	(1 << 0)
+
+#define FLAG_IMM_OVERFLOW	(1 << 0)
+
 /*
- * ABI:
+ * Map eBPF registers to ARM 32bit registers or stack scratch space.
+ *
+ * 1. First argument is passed using the arm 32bit registers and rest of the
+ * arguments are passed on stack scratch space.
+ * 2. First callee-saved aregument is mapped to arm 32 bit registers and rest
+ * arguments are mapped to scratch space on stack.
+ * 3. We need two 64 bit temp registers to do complex operations on eBPF
+ * registers.
+ *
+ * As the eBPF registers are all 64 bit registers and arm has only 32 bit
+ * registers, we have to map each eBPF registers with two arm 32 bit regs or
+ * scratch memory space and we have to build eBPF 64 bit register from those.
  *
- * r0	scratch register
- * r4	BPF register A
- * r5	BPF register X
- * r6	pointer to the skb
- * r7	skb->data
- * r8	skb_headlen(skb)
  */
+static const u8 bpf2a32[][2] = {
+	/* return value from in-kernel function, and exit value from eBPF */
+	[BPF_REG_0] = {ARM_R1, ARM_R0},
+	/* arguments from eBPF program to in-kernel function */
+	[BPF_REG_1] = {ARM_R3, ARM_R2},
+	/* Stored on stack scratch space */
+	[BPF_REG_2] = {STACK_OFFSET(0), STACK_OFFSET(4)},
+	[BPF_REG_3] = {STACK_OFFSET(8), STACK_OFFSET(12)},
+	[BPF_REG_4] = {STACK_OFFSET(16), STACK_OFFSET(20)},
+	[BPF_REG_5] = {STACK_OFFSET(24), STACK_OFFSET(28)},
+	/* callee saved registers that in-kernel function will preserve */
+	[BPF_REG_6] = {ARM_R5, ARM_R4},
+	/* Stored on stack scratch space */
+	[BPF_REG_7] = {STACK_OFFSET(32), STACK_OFFSET(36)},
+	[BPF_REG_8] = {STACK_OFFSET(40), STACK_OFFSET(44)},
+	[BPF_REG_9] = {STACK_OFFSET(48), STACK_OFFSET(52)},
+	/* Read only Frame Pointer to access Stack */
+	[BPF_REG_FP] = {STACK_OFFSET(56), STACK_OFFSET(60)},
+	/* Temporary Register for internal BPF JIT, can be used
+	 * for constant blindings and others.
+	 */
+	[TMP_REG_1] = {ARM_R7, ARM_R6},
+	[TMP_REG_2] = {ARM_R10, ARM_R8},
+	/* Tail call count. Stored on stack scratch space. */
+	[TCALL_CNT] = {STACK_OFFSET(64), STACK_OFFSET(68)},
+	/* temporary register for blinding constants.
+	 * Stored on stack scratch space.
+	 */
+	[BPF_REG_AX] = {STACK_OFFSET(72), STACK_OFFSET(76)},
+};
 
-#define r_scratch	ARM_R0
-/* r1-r3 are (also) used for the unaligned loads on the non-ARMv7 slowpath */
-#define r_off		ARM_R1
-#define r_A		ARM_R4
-#define r_X		ARM_R5
-#define r_skb		ARM_R6
-#define r_skb_data	ARM_R7
-#define r_skb_hl	ARM_R8
-
-#define SCRATCH_SP_OFFSET	0
-#define SCRATCH_OFF(k)		(SCRATCH_SP_OFFSET + 4 * (k))
-
-#define SEEN_MEM		((1 << BPF_MEMWORDS) - 1)
-#define SEEN_MEM_WORD(k)	(1 << (k))
-#define SEEN_X			(1 << BPF_MEMWORDS)
-#define SEEN_CALL		(1 << (BPF_MEMWORDS + 1))
-#define SEEN_SKB		(1 << (BPF_MEMWORDS + 2))
-#define SEEN_DATA		(1 << (BPF_MEMWORDS + 3))
+#define	dst_lo	dst[1]
+#define dst_hi	dst[0]
+#define src_lo	src[1]
+#define src_hi	src[0]
 
-#define FLAG_NEED_X_RESET	(1 << 0)
-#define FLAG_IMM_OVERFLOW	(1 << 1)
+/*
+ * JIT Context:
+ *
+ * prog			:	bpf_prog
+ * idx			:	index of current last JITed instruction.
+ * prologue_bytes	:	bytes used in prologue.
+ * epilogue_offset	:	offset of epilogue starting.
+ * seen			:	bit mask used for JIT optimization.
+ * offsets		:	array of eBPF instruction offsets in
+ *				JITed code.
+ * target		:	final JITed code.
+ * epilogue_bytes	:	no of bytes used in epilogue.
+ * imm_count		:	no of immediate counts used for global
+ *				variables.
+ * imms			:	array of global variable addresses.
+ */
 
 struct jit_ctx {
-	const struct bpf_prog *skf;
-	unsigned idx;
-	unsigned prologue_bytes;
-	int ret0_fp_idx;
+	const struct bpf_prog *prog;
+	unsigned int idx;
+	unsigned int prologue_bytes;
+	unsigned int epilogue_offset;
 	u32 seen;
 	u32 flags;
 	u32 *offsets;
@@ -73,68 +121,16 @@ struct jit_ctx {
 #endif
 };
 
-int bpf_jit_enable __read_mostly;
-
-static inline int call_neg_helper(struct sk_buff *skb, int offset, void *ret,
-		      unsigned int size)
-{
-	void *ptr = bpf_internal_load_pointer_neg_helper(skb, offset, size);
-
-	if (!ptr)
-		return -EFAULT;
-	memcpy(ret, ptr, size);
-	return 0;
-}
-
-static u64 jit_get_skb_b(struct sk_buff *skb, int offset)
-{
-	u8 ret;
-	int err;
-
-	if (offset < 0)
-		err = call_neg_helper(skb, offset, &ret, 1);
-	else
-		err = skb_copy_bits(skb, offset, &ret, 1);
-
-	return (u64)err << 32 | ret;
-}
-
-static u64 jit_get_skb_h(struct sk_buff *skb, int offset)
-{
-	u16 ret;
-	int err;
-
-	if (offset < 0)
-		err = call_neg_helper(skb, offset, &ret, 2);
-	else
-		err = skb_copy_bits(skb, offset, &ret, 2);
-
-	return (u64)err << 32 | ntohs(ret);
-}
-
-static u64 jit_get_skb_w(struct sk_buff *skb, int offset)
-{
-	u32 ret;
-	int err;
-
-	if (offset < 0)
-		err = call_neg_helper(skb, offset, &ret, 4);
-	else
-		err = skb_copy_bits(skb, offset, &ret, 4);
-
-	return (u64)err << 32 | ntohl(ret);
-}
-
 /*
  * Wrappers which handle both OABI and EABI and assures Thumb2 interworking
  * (where the assembly routines like __aeabi_uidiv could cause problems).
  */
-static u32 jit_udiv(u32 dividend, u32 divisor)
+static u32 jit_udiv32(u32 dividend, u32 divisor)
 {
 	return dividend / divisor;
 }
 
-static u32 jit_mod(u32 dividend, u32 divisor)
+static u32 jit_mod32(u32 dividend, u32 divisor)
 {
 	return dividend % divisor;
 }
@@ -158,36 +154,22 @@ static inline void emit(u32 inst, struct jit_ctx *ctx)
 	_emit(ARM_COND_AL, inst, ctx);
 }
 
-static u16 saved_regs(struct jit_ctx *ctx)
+/*
+ * Checks if immediate value can be converted to imm12(12 bits) value.
+ */
+static int16_t imm8m(u32 x)
 {
-	u16 ret = 0;
-
-	if ((ctx->skf->len > 1) ||
-	    (ctx->skf->insns[0].code == (BPF_RET | BPF_A)))
-		ret |= 1 << r_A;
-
-#ifdef CONFIG_FRAME_POINTER
-	ret |= (1 << ARM_FP) | (1 << ARM_IP) | (1 << ARM_LR) | (1 << ARM_PC);
-#else
-	if (ctx->seen & SEEN_CALL)
-		ret |= 1 << ARM_LR;
-#endif
-	if (ctx->seen & (SEEN_DATA | SEEN_SKB))
-		ret |= 1 << r_skb;
-	if (ctx->seen & SEEN_DATA)
-		ret |= (1 << r_skb_data) | (1 << r_skb_hl);
-	if (ctx->seen & SEEN_X)
-		ret |= 1 << r_X;
-
-	return ret;
-}
+	u32 rot;
 
-static inline int mem_words_used(struct jit_ctx *ctx)
-{
-	/* yes, we do waste some stack space IF there are "holes" in the set" */
-	return fls(ctx->seen & SEEN_MEM);
+	for (rot = 0; rot < 16; rot++)
+		if ((x & ~ror32(0xff, 2 * rot)) == 0)
+			return rol32(x, 2 * rot) | (rot << 8);
+	return -1;
 }
 
+/*
+ * Initializes the JIT space with undefined instructions.
+ */
 static void jit_fill_hole(void *area, unsigned int size)
 {
 	u32 *ptr;
@@ -196,88 +178,34 @@ static void jit_fill_hole(void *area, unsigned int size)
 		*ptr++ = __opcode_to_mem_arm(ARM_INST_UDF);
 }
 
-static void build_prologue(struct jit_ctx *ctx)
-{
-	u16 reg_set = saved_regs(ctx);
-	u16 off;
-
-#ifdef CONFIG_FRAME_POINTER
-	emit(ARM_MOV_R(ARM_IP, ARM_SP), ctx);
-	emit(ARM_PUSH(reg_set), ctx);
-	emit(ARM_SUB_I(ARM_FP, ARM_IP, 4), ctx);
-#else
-	if (reg_set)
-		emit(ARM_PUSH(reg_set), ctx);
-#endif
-
-	if (ctx->seen & (SEEN_DATA | SEEN_SKB))
-		emit(ARM_MOV_R(r_skb, ARM_R0), ctx);
-
-	if (ctx->seen & SEEN_DATA) {
-		off = offsetof(struct sk_buff, data);
-		emit(ARM_LDR_I(r_skb_data, r_skb, off), ctx);
-		/* headlen = len - data_len */
-		off = offsetof(struct sk_buff, len);
-		emit(ARM_LDR_I(r_skb_hl, r_skb, off), ctx);
-		off = offsetof(struct sk_buff, data_len);
-		emit(ARM_LDR_I(r_scratch, r_skb, off), ctx);
-		emit(ARM_SUB_R(r_skb_hl, r_skb_hl, r_scratch), ctx);
-	}
+/* Stack must be multiples of 16 Bytes */
+#define STACK_ALIGN(sz) (((sz) + 15) & ~15)
 
-	if (ctx->flags & FLAG_NEED_X_RESET)
-		emit(ARM_MOV_I(r_X, 0), ctx);
-
-	/* do not leak kernel data to userspace */
-	if (bpf_needs_clear_a(&ctx->skf->insns[0]))
-		emit(ARM_MOV_I(r_A, 0), ctx);
-
-	/* stack space for the BPF_MEM words */
-	if (ctx->seen & SEEN_MEM)
-		emit(ARM_SUB_I(ARM_SP, ARM_SP, mem_words_used(ctx) * 4), ctx);
-}
-
-static void build_epilogue(struct jit_ctx *ctx)
-{
-	u16 reg_set = saved_regs(ctx);
-
-	if (ctx->seen & SEEN_MEM)
-		emit(ARM_ADD_I(ARM_SP, ARM_SP, mem_words_used(ctx) * 4), ctx);
-
-	reg_set &= ~(1 << ARM_LR);
-
-#ifdef CONFIG_FRAME_POINTER
-	/* the first instruction of the prologue was: mov ip, sp */
-	reg_set &= ~(1 << ARM_IP);
-	reg_set |= (1 << ARM_SP);
-	emit(ARM_LDM(ARM_SP, reg_set), ctx);
-#else
-	if (reg_set) {
-		if (ctx->seen & SEEN_CALL)
-			reg_set |= 1 << ARM_PC;
-		emit(ARM_POP(reg_set), ctx);
-	}
+/* Stack space for BPF_REG_2, BPF_REG_3, BPF_REG_4,
+ * BPF_REG_5, BPF_REG_7, BPF_REG_8, BPF_REG_9,
+ * BPF_REG_FP and Tail call counts.
+ */
+#define SCRATCH_SIZE 80
 
-	if (!(ctx->seen & SEEN_CALL))
-		emit(ARM_BX(ARM_LR), ctx);
-#endif
-}
+/* total stack size used in JITed code */
+#define _STACK_SIZE \
+	(MAX_BPF_STACK + \
+	 + SCRATCH_SIZE + \
+	 + 4 /* extra for skb_copy_bits buffer */)
 
-static int16_t imm8m(u32 x)
-{
-	u32 rot;
+#define STACK_SIZE STACK_ALIGN(_STACK_SIZE)
 
-	for (rot = 0; rot < 16; rot++)
-		if ((x & ~ror32(0xff, 2 * rot)) == 0)
-			return rol32(x, 2 * rot) | (rot << 8);
+/* Get the offset of eBPF REGISTERs stored on scratch space. */
+#define STACK_VAR(off) (STACK_SIZE-off-4)
 
-	return -1;
-}
+/* Offset of skb_copy_bits buffer */
+#define SKB_BUFFER STACK_VAR(SCRATCH_SIZE)
 
 #if __LINUX_ARM_ARCH__ < 7
 
 static u16 imm_offset(u32 k, struct jit_ctx *ctx)
 {
-	unsigned i = 0, offset;
+	unsigned int i = 0, offset;
 	u16 imm;
 
 	/* on the "fake" run we just count them (duplicates included) */
@@ -296,7 +224,7 @@ static u16 imm_offset(u32 k, struct jit_ctx *ctx)
 		ctx->imms[i] = k;
 
 	/* constants go just after the epilogue */
-	offset =  ctx->offsets[ctx->skf->len];
+	offset =  ctx->offsets[ctx->prog->len - 1] * 4;
 	offset += ctx->prologue_bytes;
 	offset += ctx->epilogue_bytes;
 	offset += i * 4;
@@ -320,10 +248,22 @@ static u16 imm_offset(u32 k, struct jit_ctx *ctx)
 
 #endif /* __LINUX_ARM_ARCH__ */
 
+static inline int bpf2a32_offset(int bpf_to, int bpf_from,
+				 const struct jit_ctx *ctx) {
+	int to, from;
+
+	if (ctx->target == NULL)
+		return 0;
+	to = ctx->offsets[bpf_to];
+	from = ctx->offsets[bpf_from];
+
+	return to - from - 1;
+}
+
 /*
  * Move an immediate that's not an imm8m to a core register.
  */
-static inline void emit_mov_i_no8m(int rd, u32 val, struct jit_ctx *ctx)
+static inline void emit_mov_i_no8m(const u8 rd, u32 val, struct jit_ctx *ctx)
 {
 #if __LINUX_ARM_ARCH__ < 7
 	emit(ARM_LDR_I(rd, ARM_PC, imm_offset(val, ctx)), ctx);
@@ -334,7 +274,7 @@ static inline void emit_mov_i_no8m(int rd, u32 val, struct jit_ctx *ctx)
 #endif
 }
 
-static inline void emit_mov_i(int rd, u32 val, struct jit_ctx *ctx)
+static inline void emit_mov_i(const u8 rd, u32 val, struct jit_ctx *ctx)
 {
 	int imm12 = imm8m(val);
 
@@ -344,676 +284,1578 @@ static inline void emit_mov_i(int rd, u32 val, struct jit_ctx *ctx)
 		emit_mov_i_no8m(rd, val, ctx);
 }
 
-#if __LINUX_ARM_ARCH__ < 6
-
-static void emit_load_be32(u8 cond, u8 r_res, u8 r_addr, struct jit_ctx *ctx)
+static inline void emit_blx_r(u8 tgt_reg, struct jit_ctx *ctx)
 {
-	_emit(cond, ARM_LDRB_I(ARM_R3, r_addr, 1), ctx);
-	_emit(cond, ARM_LDRB_I(ARM_R1, r_addr, 0), ctx);
-	_emit(cond, ARM_LDRB_I(ARM_R2, r_addr, 3), ctx);
-	_emit(cond, ARM_LSL_I(ARM_R3, ARM_R3, 16), ctx);
-	_emit(cond, ARM_LDRB_I(ARM_R0, r_addr, 2), ctx);
-	_emit(cond, ARM_ORR_S(ARM_R3, ARM_R3, ARM_R1, SRTYPE_LSL, 24), ctx);
-	_emit(cond, ARM_ORR_R(ARM_R3, ARM_R3, ARM_R2), ctx);
-	_emit(cond, ARM_ORR_S(r_res, ARM_R3, ARM_R0, SRTYPE_LSL, 8), ctx);
+	ctx->seen |= SEEN_CALL;
+#if __LINUX_ARM_ARCH__ < 5
+	emit(ARM_MOV_R(ARM_LR, ARM_PC), ctx);
+
+	if (elf_hwcap & HWCAP_THUMB)
+		emit(ARM_BX(tgt_reg), ctx);
+	else
+		emit(ARM_MOV_R(ARM_PC, tgt_reg), ctx);
+#else
+	emit(ARM_BLX_R(tgt_reg), ctx);
+#endif
 }
 
-static void emit_load_be16(u8 cond, u8 r_res, u8 r_addr, struct jit_ctx *ctx)
+static inline int epilogue_offset(const struct jit_ctx *ctx)
 {
-	_emit(cond, ARM_LDRB_I(ARM_R1, r_addr, 0), ctx);
-	_emit(cond, ARM_LDRB_I(ARM_R2, r_addr, 1), ctx);
-	_emit(cond, ARM_ORR_S(r_res, ARM_R2, ARM_R1, SRTYPE_LSL, 8), ctx);
+	int to, from;
+	/* No need for 1st dummy run */
+	if (ctx->target == NULL)
+		return 0;
+	to = ctx->epilogue_offset;
+	from = ctx->idx;
+
+	return to - from - 2;
 }
 
-static inline void emit_swap16(u8 r_dst, u8 r_src, struct jit_ctx *ctx)
+static inline void emit_udivmod(u8 rd, u8 rm, u8 rn, struct jit_ctx *ctx, u8 op)
 {
-	/* r_dst = (r_src << 8) | (r_src >> 8) */
-	emit(ARM_LSL_I(ARM_R1, r_src, 8), ctx);
-	emit(ARM_ORR_S(r_dst, ARM_R1, r_src, SRTYPE_LSR, 8), ctx);
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	s32 jmp_offset;
+
+	/* checks if divisor is zero or not. If it is, then
+	 * exit directly.
+	 */
+	emit(ARM_CMP_I(rn, 0), ctx);
+	_emit(ARM_COND_EQ, ARM_MOV_I(ARM_R0, 0), ctx);
+	jmp_offset = epilogue_offset(ctx);
+	_emit(ARM_COND_EQ, ARM_B(jmp_offset), ctx);
+#if __LINUX_ARM_ARCH__ == 7
+	if (elf_hwcap & HWCAP_IDIVA) {
+		if (op == BPF_DIV)
+			emit(ARM_UDIV(rd, rm, rn), ctx);
+		else {
+			emit(ARM_UDIV(ARM_IP, rm, rn), ctx);
+			emit(ARM_MLS(rd, rn, ARM_IP, rm), ctx);
+		}
+		return;
+	}
+#endif
 
 	/*
-	 * we need to mask out the bits set in r_dst[23:16] due to
-	 * the first shift instruction.
-	 *
-	 * note that 0x8ff is the encoded immediate 0x00ff0000.
+	 * For BPF_ALU | BPF_DIV | BPF_K instructions
+	 * As ARM_R1 and ARM_R0 contains 1st argument of bpf
+	 * function, we need to save it on caller side to save
+	 * it from getting destroyed within callee.
+	 * After the return from the callee, we restore ARM_R0
+	 * ARM_R1.
 	 */
-	emit(ARM_BIC_I(r_dst, r_dst, 0x8ff), ctx);
-}
+	if (rn != ARM_R1) {
+		emit(ARM_MOV_R(tmp[0], ARM_R1), ctx);
+		emit(ARM_MOV_R(ARM_R1, rn), ctx);
+	}
+	if (rm != ARM_R0) {
+		emit(ARM_MOV_R(tmp[1], ARM_R0), ctx);
+		emit(ARM_MOV_R(ARM_R0, rm), ctx);
+	}
+
+	/* Call appropriate function */
+	ctx->seen |= SEEN_CALL;
+	emit_mov_i(ARM_IP, op == BPF_DIV ?
+		   (u32)jit_udiv32 : (u32)jit_mod32, ctx);
+	emit_blx_r(ARM_IP, ctx);
 
-#else  /* ARMv6+ */
+	/* Save return value */
+	if (rd != ARM_R0)
+		emit(ARM_MOV_R(rd, ARM_R0), ctx);
 
-static void emit_load_be32(u8 cond, u8 r_res, u8 r_addr, struct jit_ctx *ctx)
-{
-	_emit(cond, ARM_LDR_I(r_res, r_addr, 0), ctx);
-#ifdef __LITTLE_ENDIAN
-	_emit(cond, ARM_REV(r_res, r_res), ctx);
-#endif
+	/* Restore ARM_R0 and ARM_R1 */
+	if (rn != ARM_R1)
+		emit(ARM_MOV_R(ARM_R1, tmp[0]), ctx);
+	if (rm != ARM_R0)
+		emit(ARM_MOV_R(ARM_R0, tmp[1]), ctx);
 }
 
-static void emit_load_be16(u8 cond, u8 r_res, u8 r_addr, struct jit_ctx *ctx)
+/* Checks whether BPF register is on scratch stack space or not. */
+static inline bool is_on_stack(u8 bpf_reg)
 {
-	_emit(cond, ARM_LDRH_I(r_res, r_addr, 0), ctx);
-#ifdef __LITTLE_ENDIAN
-	_emit(cond, ARM_REV16(r_res, r_res), ctx);
-#endif
+	static u8 stack_regs[] = {BPF_REG_AX, BPF_REG_3, BPF_REG_4, BPF_REG_5,
+				BPF_REG_7, BPF_REG_8, BPF_REG_9, TCALL_CNT,
+				BPF_REG_2, BPF_REG_FP};
+	int i, reg_len = sizeof(stack_regs);
+
+	for (i = 0 ; i < reg_len ; i++) {
+		if (bpf_reg == stack_regs[i])
+			return true;
+	}
+	return false;
 }
 
-static inline void emit_swap16(u8 r_dst __maybe_unused,
-			       u8 r_src __maybe_unused,
-			       struct jit_ctx *ctx __maybe_unused)
+static inline void emit_a32_mov_i(const u8 dst, const u32 val,
+				  bool dstk, struct jit_ctx *ctx)
 {
-#ifdef __LITTLE_ENDIAN
-	emit(ARM_REV16(r_dst, r_src), ctx);
-#endif
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+
+	if (dstk) {
+		emit_mov_i(tmp[1], val, ctx);
+		emit(ARM_STR_I(tmp[1], ARM_SP, STACK_VAR(dst)), ctx);
+	} else {
+		emit_mov_i(dst, val, ctx);
+	}
 }
 
-#endif /* __LINUX_ARM_ARCH__ < 6 */
+/* Sign extended move */
+static inline void emit_a32_mov_i64(const bool is64, const u8 dst[],
+				  const u32 val, bool dstk,
+				  struct jit_ctx *ctx) {
+	u32 hi = 0;
 
+	if (is64 && (val & (1<<31)))
+		hi = (u32)~0;
+	emit_a32_mov_i(dst_lo, val, dstk, ctx);
+	emit_a32_mov_i(dst_hi, hi, dstk, ctx);
+}
 
-/* Compute the immediate value for a PC-relative branch. */
-static inline u32 b_imm(unsigned tgt, struct jit_ctx *ctx)
-{
-	u32 imm;
+static inline void emit_a32_add_r(const u8 dst, const u8 src,
+			      const bool is64, const bool hi,
+			      struct jit_ctx *ctx) {
+	/* 64 bit :
+	 *	adds dst_lo, dst_lo, src_lo
+	 *	adc dst_hi, dst_hi, src_hi
+	 * 32 bit :
+	 *	add dst_lo, dst_lo, src_lo
+	 */
+	if (!hi && is64)
+		emit(ARM_ADDS_R(dst, dst, src), ctx);
+	else if (hi && is64)
+		emit(ARM_ADC_R(dst, dst, src), ctx);
+	else
+		emit(ARM_ADD_R(dst, dst, src), ctx);
+}
 
-	if (ctx->target == NULL)
-		return 0;
-	/*
-	 * BPF allows only forward jumps and the offset of the target is
-	 * still the one computed during the first pass.
+static inline void emit_a32_sub_r(const u8 dst, const u8 src,
+				  const bool is64, const bool hi,
+				  struct jit_ctx *ctx) {
+	/* 64 bit :
+	 *	subs dst_lo, dst_lo, src_lo
+	 *	sbc dst_hi, dst_hi, src_hi
+	 * 32 bit :
+	 *	sub dst_lo, dst_lo, src_lo
 	 */
-	imm  = ctx->offsets[tgt] + ctx->prologue_bytes - (ctx->idx * 4 + 8);
+	if (!hi && is64)
+		emit(ARM_SUBS_R(dst, dst, src), ctx);
+	else if (hi && is64)
+		emit(ARM_SBC_R(dst, dst, src), ctx);
+	else
+		emit(ARM_SUB_R(dst, dst, src), ctx);
+}
 
-	return imm >> 2;
+static inline void emit_alu_r(const u8 dst, const u8 src, const bool is64,
+			      const bool hi, const u8 op, struct jit_ctx *ctx){
+	switch (BPF_OP(op)) {
+	/* dst = dst + src */
+	case BPF_ADD:
+		emit_a32_add_r(dst, src, is64, hi, ctx);
+		break;
+	/* dst = dst - src */
+	case BPF_SUB:
+		emit_a32_sub_r(dst, src, is64, hi, ctx);
+		break;
+	/* dst = dst | src */
+	case BPF_OR:
+		emit(ARM_ORR_R(dst, dst, src), ctx);
+		break;
+	/* dst = dst & src */
+	case BPF_AND:
+		emit(ARM_AND_R(dst, dst, src), ctx);
+		break;
+	/* dst = dst ^ src */
+	case BPF_XOR:
+		emit(ARM_EOR_R(dst, dst, src), ctx);
+		break;
+	/* dst = dst * src */
+	case BPF_MUL:
+		emit(ARM_MUL(dst, dst, src), ctx);
+		break;
+	/* dst = dst << src */
+	case BPF_LSH:
+		emit(ARM_LSL_R(dst, dst, src), ctx);
+		break;
+	/* dst = dst >> src */
+	case BPF_RSH:
+		emit(ARM_LSR_R(dst, dst, src), ctx);
+		break;
+	/* dst = dst >> src (signed)*/
+	case BPF_ARSH:
+		emit(ARM_MOV_SR(dst, dst, SRTYPE_ASR, src), ctx);
+		break;
+	}
 }
 
-#define OP_IMM3(op, r1, r2, imm_val, ctx)				\
-	do {								\
-		imm12 = imm8m(imm_val);					\
-		if (imm12 < 0) {					\
-			emit_mov_i_no8m(r_scratch, imm_val, ctx);	\
-			emit(op ## _R((r1), (r2), r_scratch), ctx);	\
-		} else {						\
-			emit(op ## _I((r1), (r2), imm12), ctx);		\
-		}							\
-	} while (0)
-
-static inline void emit_err_ret(u8 cond, struct jit_ctx *ctx)
-{
-	if (ctx->ret0_fp_idx >= 0) {
-		_emit(cond, ARM_B(b_imm(ctx->ret0_fp_idx, ctx)), ctx);
-		/* NOP to keep the size constant between passes */
-		emit(ARM_MOV_R(ARM_R0, ARM_R0), ctx);
+/* ALU operation (32 bit)
+ * dst = dst (op) src
+ */
+static inline void emit_a32_alu_r(const u8 dst, const u8 src,
+				  bool dstk, bool sstk,
+				  struct jit_ctx *ctx, const bool is64,
+				  const bool hi, const u8 op) {
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	u8 rn = sstk ? tmp[1] : src;
+
+	if (sstk)
+		emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src)), ctx);
+
+	/* ALU operation */
+	if (dstk) {
+		emit(ARM_LDR_I(tmp[0], ARM_SP, STACK_VAR(dst)), ctx);
+		emit_alu_r(tmp[0], rn, is64, hi, op, ctx);
+		emit(ARM_STR_I(tmp[0], ARM_SP, STACK_VAR(dst)), ctx);
 	} else {
-		_emit(cond, ARM_MOV_I(ARM_R0, 0), ctx);
-		_emit(cond, ARM_B(b_imm(ctx->skf->len, ctx)), ctx);
+		emit_alu_r(dst, rn, is64, hi, op, ctx);
 	}
 }
 
-static inline void emit_blx_r(u8 tgt_reg, struct jit_ctx *ctx)
-{
-#if __LINUX_ARM_ARCH__ < 5
-	emit(ARM_MOV_R(ARM_LR, ARM_PC), ctx);
+/* ALU operation (64 bit) */
+static inline void emit_a32_alu_r64(const bool is64, const u8 dst[],
+				  const u8 src[], bool dstk,
+				  bool sstk, struct jit_ctx *ctx,
+				  const u8 op) {
+	emit_a32_alu_r(dst_lo, src_lo, dstk, sstk, ctx, is64, false, op);
+	if (is64)
+		emit_a32_alu_r(dst_hi, src_hi, dstk, sstk, ctx, is64, true, op);
+	else
+		emit_a32_mov_i(dst_hi, 0, dstk, ctx);
+}
 
-	if (elf_hwcap & HWCAP_THUMB)
-		emit(ARM_BX(tgt_reg), ctx);
+/* dst = imm (4 bytes)*/
+static inline void emit_a32_mov_r(const u8 dst, const u8 src,
+				  bool dstk, bool sstk,
+				  struct jit_ctx *ctx) {
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	u8 rt = sstk ? tmp[0] : src;
+
+	if (sstk)
+		emit(ARM_LDR_I(tmp[0], ARM_SP, STACK_VAR(src)), ctx);
+	if (dstk)
+		emit(ARM_STR_I(rt, ARM_SP, STACK_VAR(dst)), ctx);
 	else
-		emit(ARM_MOV_R(ARM_PC, tgt_reg), ctx);
-#else
-	emit(ARM_BLX_R(tgt_reg), ctx);
-#endif
+		emit(ARM_MOV_R(dst, rt), ctx);
 }
 
-static inline void emit_udivmod(u8 rd, u8 rm, u8 rn, struct jit_ctx *ctx,
-				int bpf_op)
-{
-#if __LINUX_ARM_ARCH__ == 7
-	if (elf_hwcap & HWCAP_IDIVA) {
-		if (bpf_op == BPF_DIV)
-			emit(ARM_UDIV(rd, rm, rn), ctx);
-		else {
-			emit(ARM_UDIV(ARM_R3, rm, rn), ctx);
-			emit(ARM_MLS(rd, rn, ARM_R3, rm), ctx);
-		}
-		return;
+/* dst = src */
+static inline void emit_a32_mov_r64(const bool is64, const u8 dst[],
+				  const u8 src[], bool dstk,
+				  bool sstk, struct jit_ctx *ctx) {
+	emit_a32_mov_r(dst_lo, src_lo, dstk, sstk, ctx);
+	if (is64) {
+		/* complete 8 byte move */
+		emit_a32_mov_r(dst_hi, src_hi, dstk, sstk, ctx);
+	} else {
+		/* Zero out high 4 bytes */
+		emit_a32_mov_i(dst_hi, 0, dstk, ctx);
 	}
-#endif
+}
 
-	/*
-	 * For BPF_ALU | BPF_DIV | BPF_K instructions, rm is ARM_R4
-	 * (r_A) and rn is ARM_R0 (r_scratch) so load rn first into
-	 * ARM_R1 to avoid accidentally overwriting ARM_R0 with rm
-	 * before using it as a source for ARM_R1.
-	 *
-	 * For BPF_ALU | BPF_DIV | BPF_X rm is ARM_R4 (r_A) and rn is
-	 * ARM_R5 (r_X) so there is no particular register overlap
-	 * issues.
-	 */
-	if (rn != ARM_R1)
-		emit(ARM_MOV_R(ARM_R1, rn), ctx);
-	if (rm != ARM_R0)
-		emit(ARM_MOV_R(ARM_R0, rm), ctx);
+/* Shift operations */
+static inline void emit_a32_alu_i(const u8 dst, const u32 val, bool dstk,
+				struct jit_ctx *ctx, const u8 op) {
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	u8 rd = dstk ? tmp[0] : dst;
+
+	if (dstk)
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst)), ctx);
+
+	/* Do shift operation */
+	switch (op) {
+	case BPF_LSH:
+		emit(ARM_LSL_I(rd, rd, val), ctx);
+		break;
+	case BPF_RSH:
+		emit(ARM_LSR_I(rd, rd, val), ctx);
+		break;
+	case BPF_NEG:
+		emit(ARM_RSB_I(rd, rd, val), ctx);
+		break;
+	}
+
+	if (dstk)
+		emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst)), ctx);
+}
 
+/* dst = ~dst (64 bit) */
+static inline void emit_a32_neg64(const u8 dst[], bool dstk,
+				struct jit_ctx *ctx){
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	u8 rd = dstk ? tmp[1] : dst[1];
+	u8 rm = dstk ? tmp[0] : dst[0];
+
+	/* Setup Operand */
+	if (dstk) {
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+
+	/* Do Negate Operation */
+	emit(ARM_RSBS_I(rd, rd, 0), ctx);
+	emit(ARM_RSC_I(rm, rm, 0), ctx);
+
+	if (dstk) {
+		emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_STR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+}
+
+/* dst = dst << src */
+static inline void emit_a32_lsh_r64(const u8 dst[], const u8 src[], bool dstk,
+				    bool sstk, struct jit_ctx *ctx) {
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+
+	/* Setup Operands */
+	u8 rt = sstk ? tmp2[1] : src_lo;
+	u8 rd = dstk ? tmp[1] : dst_lo;
+	u8 rm = dstk ? tmp[0] : dst_hi;
+
+	if (sstk)
+		emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(src_lo)), ctx);
+	if (dstk) {
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+
+	/* Do LSH operation */
+	emit(ARM_SUB_I(ARM_IP, rt, 32), ctx);
+	emit(ARM_RSB_I(tmp2[0], rt, 32), ctx);
+	/* As we are using ARM_LR */
 	ctx->seen |= SEEN_CALL;
-	emit_mov_i(ARM_R3, bpf_op == BPF_DIV ? (u32)jit_udiv : (u32)jit_mod,
-		   ctx);
-	emit_blx_r(ARM_R3, ctx);
+	emit(ARM_MOV_SR(ARM_LR, rm, SRTYPE_ASL, rt), ctx);
+	emit(ARM_ORR_SR(ARM_LR, ARM_LR, rd, SRTYPE_ASL, ARM_IP), ctx);
+	emit(ARM_ORR_SR(ARM_IP, ARM_LR, rd, SRTYPE_LSR, tmp2[0]), ctx);
+	emit(ARM_MOV_SR(ARM_LR, rd, SRTYPE_ASL, rt), ctx);
+
+	if (dstk) {
+		emit(ARM_STR_I(ARM_LR, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_STR_I(ARM_IP, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	} else {
+		emit(ARM_MOV_R(rd, ARM_LR), ctx);
+		emit(ARM_MOV_R(rm, ARM_IP), ctx);
+	}
+}
 
-	if (rd != ARM_R0)
-		emit(ARM_MOV_R(rd, ARM_R0), ctx);
+/* dst = dst >> src (signed)*/
+static inline void emit_a32_arsh_r64(const u8 dst[], const u8 src[], bool dstk,
+				    bool sstk, struct jit_ctx *ctx) {
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+	/* Setup Operands */
+	u8 rt = sstk ? tmp2[1] : src_lo;
+	u8 rd = dstk ? tmp[1] : dst_lo;
+	u8 rm = dstk ? tmp[0] : dst_hi;
+
+	if (sstk)
+		emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(src_lo)), ctx);
+	if (dstk) {
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+
+	/* Do the ARSH operation */
+	emit(ARM_RSB_I(ARM_IP, rt, 32), ctx);
+	emit(ARM_SUBS_I(tmp2[0], rt, 32), ctx);
+	/* As we are using ARM_LR */
+	ctx->seen |= SEEN_CALL;
+	emit(ARM_MOV_SR(ARM_LR, rd, SRTYPE_LSR, rt), ctx);
+	emit(ARM_ORR_SR(ARM_LR, ARM_LR, rm, SRTYPE_ASL, ARM_IP), ctx);
+	_emit(ARM_COND_MI, ARM_B(0), ctx);
+	emit(ARM_ORR_SR(ARM_LR, ARM_LR, rm, SRTYPE_ASR, tmp2[0]), ctx);
+	emit(ARM_MOV_SR(ARM_IP, rm, SRTYPE_ASR, rt), ctx);
+	if (dstk) {
+		emit(ARM_STR_I(ARM_LR, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_STR_I(ARM_IP, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	} else {
+		emit(ARM_MOV_R(rd, ARM_LR), ctx);
+		emit(ARM_MOV_R(rm, ARM_IP), ctx);
+	}
+}
+
+/* dst = dst >> src */
+static inline void emit_a32_lsr_r64(const u8 dst[], const u8 src[], bool dstk,
+				     bool sstk, struct jit_ctx *ctx) {
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+	/* Setup Operands */
+	u8 rt = sstk ? tmp2[1] : src_lo;
+	u8 rd = dstk ? tmp[1] : dst_lo;
+	u8 rm = dstk ? tmp[0] : dst_hi;
+
+	if (sstk)
+		emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(src_lo)), ctx);
+	if (dstk) {
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+
+	/* Do LSH operation */
+	emit(ARM_RSB_I(ARM_IP, rt, 32), ctx);
+	emit(ARM_SUBS_I(tmp2[0], rt, 32), ctx);
+	/* As we are using ARM_LR */
+	ctx->seen |= SEEN_CALL;
+	emit(ARM_MOV_SR(ARM_LR, rd, SRTYPE_LSR, rt), ctx);
+	emit(ARM_ORR_SR(ARM_LR, ARM_LR, rm, SRTYPE_ASL, ARM_IP), ctx);
+	emit(ARM_ORR_SR(ARM_LR, ARM_LR, rm, SRTYPE_LSR, tmp2[0]), ctx);
+	emit(ARM_MOV_SR(ARM_IP, rm, SRTYPE_LSR, rt), ctx);
+	if (dstk) {
+		emit(ARM_STR_I(ARM_LR, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_STR_I(ARM_IP, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	} else {
+		emit(ARM_MOV_R(rd, ARM_LR), ctx);
+		emit(ARM_MOV_R(rm, ARM_IP), ctx);
+	}
+}
+
+/* dst = dst << val */
+static inline void emit_a32_lsh_i64(const u8 dst[], bool dstk,
+				     const u32 val, struct jit_ctx *ctx){
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+	/* Setup operands */
+	u8 rd = dstk ? tmp[1] : dst_lo;
+	u8 rm = dstk ? tmp[0] : dst_hi;
+
+	if (dstk) {
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+
+	/* Do LSH operation */
+	if (val < 32) {
+		emit(ARM_MOV_SI(tmp2[0], rm, SRTYPE_ASL, val), ctx);
+		emit(ARM_ORR_SI(rm, tmp2[0], rd, SRTYPE_LSR, 32 - val), ctx);
+		emit(ARM_MOV_SI(rd, rd, SRTYPE_ASL, val), ctx);
+	} else {
+		if (val == 32)
+			emit(ARM_MOV_R(rm, rd), ctx);
+		else
+			emit(ARM_MOV_SI(rm, rd, SRTYPE_ASL, val - 32), ctx);
+		emit(ARM_EOR_R(rd, rd, rd), ctx);
+	}
+
+	if (dstk) {
+		emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_STR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+}
+
+/* dst = dst >> val */
+static inline void emit_a32_lsr_i64(const u8 dst[], bool dstk,
+				    const u32 val, struct jit_ctx *ctx) {
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+	/* Setup operands */
+	u8 rd = dstk ? tmp[1] : dst_lo;
+	u8 rm = dstk ? tmp[0] : dst_hi;
+
+	if (dstk) {
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+
+	/* Do LSR operation */
+	if (val < 32) {
+		emit(ARM_MOV_SI(tmp2[1], rd, SRTYPE_LSR, val), ctx);
+		emit(ARM_ORR_SI(rd, tmp2[1], rm, SRTYPE_ASL, 32 - val), ctx);
+		emit(ARM_MOV_SI(rm, rm, SRTYPE_LSR, val), ctx);
+	} else if (val == 32) {
+		emit(ARM_MOV_R(rd, rm), ctx);
+		emit(ARM_MOV_I(rm, 0), ctx);
+	} else {
+		emit(ARM_MOV_SI(rd, rm, SRTYPE_LSR, val - 32), ctx);
+		emit(ARM_MOV_I(rm, 0), ctx);
+	}
+
+	if (dstk) {
+		emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_STR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
 }
 
-static inline void update_on_xread(struct jit_ctx *ctx)
+/* dst = dst >> val (signed) */
+static inline void emit_a32_arsh_i64(const u8 dst[], bool dstk,
+				     const u32 val, struct jit_ctx *ctx){
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+	 /* Setup operands */
+	u8 rd = dstk ? tmp[1] : dst_lo;
+	u8 rm = dstk ? tmp[0] : dst_hi;
+
+	if (dstk) {
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+
+	/* Do ARSH operation */
+	if (val < 32) {
+		emit(ARM_MOV_SI(tmp2[1], rd, SRTYPE_LSR, val), ctx);
+		emit(ARM_ORR_SI(rd, tmp2[1], rm, SRTYPE_ASL, 32 - val), ctx);
+		emit(ARM_MOV_SI(rm, rm, SRTYPE_ASR, val), ctx);
+	} else if (val == 32) {
+		emit(ARM_MOV_R(rd, rm), ctx);
+		emit(ARM_MOV_SI(rm, rm, SRTYPE_ASR, 31), ctx);
+	} else {
+		emit(ARM_MOV_SI(rd, rm, SRTYPE_ASR, val - 32), ctx);
+		emit(ARM_MOV_SI(rm, rm, SRTYPE_ASR, 31), ctx);
+	}
+
+	if (dstk) {
+		emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_STR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+}
+
+static inline void emit_a32_mul_r64(const u8 dst[], const u8 src[], bool dstk,
+				    bool sstk, struct jit_ctx *ctx) {
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+	/* Setup operands for multiplication */
+	u8 rd = dstk ? tmp[1] : dst_lo;
+	u8 rm = dstk ? tmp[0] : dst_hi;
+	u8 rt = sstk ? tmp2[1] : src_lo;
+	u8 rn = sstk ? tmp2[0] : src_hi;
+
+	if (dstk) {
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	}
+	if (sstk) {
+		emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(src_lo)), ctx);
+		emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src_hi)), ctx);
+	}
+
+	/* Do Multiplication */
+	emit(ARM_MUL(ARM_IP, rd, rn), ctx);
+	emit(ARM_MUL(ARM_LR, rm, rt), ctx);
+	/* As we are using ARM_LR */
+	ctx->seen |= SEEN_CALL;
+	emit(ARM_ADD_R(ARM_LR, ARM_IP, ARM_LR), ctx);
+
+	emit(ARM_UMULL(ARM_IP, rm, rd, rt), ctx);
+	emit(ARM_ADD_R(rm, ARM_LR, rm), ctx);
+	if (dstk) {
+		emit(ARM_STR_I(ARM_IP, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit(ARM_STR_I(rm, ARM_SP, STACK_VAR(dst_hi)), ctx);
+	} else {
+		emit(ARM_MOV_R(rd, ARM_IP), ctx);
+	}
+}
+
+/* *(size *)(dst + off) = src */
+static inline void emit_str_r(const u8 dst, const u8 src, bool dstk,
+			      const s32 off, struct jit_ctx *ctx, const u8 sz){
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	u8 rd = dstk ? tmp[1] : dst;
+
+	if (dstk)
+		emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst)), ctx);
+	if (off) {
+		emit_a32_mov_i(tmp[0], off, false, ctx);
+		emit(ARM_ADD_R(tmp[0], rd, tmp[0]), ctx);
+		rd = tmp[0];
+	}
+	switch (sz) {
+	case BPF_W:
+		/* Store a Word */
+		emit(ARM_STR_I(src, rd, 0), ctx);
+		break;
+	case BPF_H:
+		/* Store a HalfWord */
+		emit(ARM_STRH_I(src, rd, 0), ctx);
+		break;
+	case BPF_B:
+		/* Store a Byte */
+		emit(ARM_STRB_I(src, rd, 0), ctx);
+		break;
+	}
+}
+
+/* dst = *(size*)(src + off) */
+static inline void emit_ldx_r(const u8 dst, const u8 src, bool dstk,
+			      const s32 off, struct jit_ctx *ctx, const u8 sz){
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	u8 rd = dstk ? tmp[1] : dst;
+	u8 rm = src;
+
+	if (off) {
+		emit_a32_mov_i(tmp[0], off, false, ctx);
+		emit(ARM_ADD_R(tmp[0], tmp[0], src), ctx);
+		rm = tmp[0];
+	}
+	switch (sz) {
+	case BPF_W:
+		/* Load a Word */
+		emit(ARM_LDR_I(rd, rm, 0), ctx);
+		break;
+	case BPF_H:
+		/* Load a HalfWord */
+		emit(ARM_LDRH_I(rd, rm, 0), ctx);
+		break;
+	case BPF_B:
+		/* Load a Byte */
+		emit(ARM_LDRB_I(rd, rm, 0), ctx);
+		break;
+	}
+	if (dstk)
+		emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst)), ctx);
+}
+
+/* Arithmatic Operation */
+static inline void emit_ar_r(const u8 rd, const u8 rt, const u8 rm,
+			     const u8 rn, struct jit_ctx *ctx, u8 op) {
+	switch (op) {
+	case BPF_JSET:
+		ctx->seen |= SEEN_CALL;
+		emit(ARM_AND_R(ARM_IP, rt, rn), ctx);
+		emit(ARM_AND_R(ARM_LR, rd, rm), ctx);
+		emit(ARM_ORRS_R(ARM_IP, ARM_LR, ARM_IP), ctx);
+		break;
+	case BPF_JEQ:
+	case BPF_JNE:
+	case BPF_JGT:
+	case BPF_JGE:
+		emit(ARM_CMP_R(rd, rm), ctx);
+		_emit(ARM_COND_EQ, ARM_CMP_R(rt, rn), ctx);
+		break;
+	case BPF_JSGT:
+		emit(ARM_CMP_R(rn, rt), ctx);
+		emit(ARM_SBCS_R(ARM_IP, rm, rd), ctx);
+		break;
+	case BPF_JSGE:
+		emit(ARM_CMP_R(rt, rn), ctx);
+		emit(ARM_SBCS_R(ARM_IP, rd, rm), ctx);
+		break;
+	}
+}
+
+static int out_offset = -1; /* initialized on the first pass of build_body() */
+static int emit_bpf_tail_call(struct jit_ctx *ctx)
 {
-	if (!(ctx->seen & SEEN_X))
-		ctx->flags |= FLAG_NEED_X_RESET;
 
-	ctx->seen |= SEEN_X;
+	/* bpf_tail_call(void *prog_ctx, struct bpf_array *array, u64 index) */
+	const u8 *r2 = bpf2a32[BPF_REG_2];
+	const u8 *r3 = bpf2a32[BPF_REG_3];
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+	const u8 *tcc = bpf2a32[TCALL_CNT];
+	const int idx0 = ctx->idx;
+#define cur_offset (ctx->idx - idx0)
+#define jmp_offset (out_offset - (cur_offset))
+	u32 off, lo, hi;
+
+	/* if (index >= array->map.max_entries)
+	 *	goto out;
+	 */
+	off = offsetof(struct bpf_array, map.max_entries);
+	/* array->map.max_entries */
+	emit_a32_mov_i(tmp[1], off, false, ctx);
+	emit(ARM_LDR_I(tmp2[1], ARM_SP, STACK_VAR(r2[1])), ctx);
+	emit(ARM_LDR_R(tmp[1], tmp2[1], tmp[1]), ctx);
+	/* index (64 bit) */
+	emit(ARM_LDR_I(tmp2[1], ARM_SP, STACK_VAR(r3[1])), ctx);
+	/* index >= array->map.max_entries */
+	emit(ARM_CMP_R(tmp2[1], tmp[1]), ctx);
+	_emit(ARM_COND_CS, ARM_B(jmp_offset), ctx);
+
+	/* if (tail_call_cnt > MAX_TAIL_CALL_CNT)
+	 *	goto out;
+	 * tail_call_cnt++;
+	 */
+	lo = (u32)MAX_TAIL_CALL_CNT;
+	hi = (u32)((u64)MAX_TAIL_CALL_CNT >> 32);
+	emit(ARM_LDR_I(tmp[1], ARM_SP, STACK_VAR(tcc[1])), ctx);
+	emit(ARM_LDR_I(tmp[0], ARM_SP, STACK_VAR(tcc[0])), ctx);
+	emit(ARM_CMP_I(tmp[0], hi), ctx);
+	_emit(ARM_COND_EQ, ARM_CMP_I(tmp[1], lo), ctx);
+	_emit(ARM_COND_HI, ARM_B(jmp_offset), ctx);
+	emit(ARM_ADDS_I(tmp[1], tmp[1], 1), ctx);
+	emit(ARM_ADC_I(tmp[0], tmp[0], 0), ctx);
+	emit(ARM_STR_I(tmp[1], ARM_SP, STACK_VAR(tcc[1])), ctx);
+	emit(ARM_STR_I(tmp[0], ARM_SP, STACK_VAR(tcc[0])), ctx);
+
+	/* prog = array->ptrs[index]
+	 * if (prog == NULL)
+	 *	goto out;
+	 */
+	off = offsetof(struct bpf_array, ptrs);
+	emit_a32_mov_i(tmp[1], off, false, ctx);
+	emit(ARM_LDR_I(tmp2[1], ARM_SP, STACK_VAR(r2[1])), ctx);
+	emit(ARM_ADD_R(tmp[1], tmp2[1], tmp[1]), ctx);
+	emit(ARM_LDR_I(tmp2[1], ARM_SP, STACK_VAR(r3[1])), ctx);
+	emit(ARM_MOV_SI(tmp[0], tmp2[1], SRTYPE_ASL, 2), ctx);
+	emit(ARM_LDR_R(tmp[1], tmp[1], tmp[0]), ctx);
+	emit(ARM_CMP_I(tmp[1], 0), ctx);
+	_emit(ARM_COND_EQ, ARM_B(jmp_offset), ctx);
+
+	/* goto *(prog->bpf_func + prologue_size); */
+	off = offsetof(struct bpf_prog, bpf_func);
+	emit_a32_mov_i(tmp2[1], off, false, ctx);
+	emit(ARM_LDR_R(tmp[1], tmp[1], tmp2[1]), ctx);
+	emit(ARM_ADD_I(tmp[1], tmp[1], ctx->prologue_bytes), ctx);
+	emit(ARM_BX(tmp[1]), ctx);
+
+	/* out: */
+	if (out_offset == -1)
+		out_offset = cur_offset;
+	if (cur_offset != out_offset) {
+		pr_err_once("tail_call out_offset = %d, expected %d!\n",
+			    cur_offset, out_offset);
+		return -1;
+	}
+	return 0;
+#undef cur_offset
+#undef jmp_offset
 }
 
-static int build_body(struct jit_ctx *ctx)
+/* 0xabcd => 0xcdab */
+static inline void emit_rev16(const u8 rd, const u8 rn, struct jit_ctx *ctx)
 {
-	void *load_func[] = {jit_get_skb_b, jit_get_skb_h, jit_get_skb_w};
-	const struct bpf_prog *prog = ctx->skf;
-	const struct sock_filter *inst;
-	unsigned i, load_order, off, condt;
-	int imm12;
-	u32 k;
+#if __LINUX_ARM_ARCH__ < 6
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+
+	emit(ARM_AND_I(tmp2[1], rn, 0xff), ctx);
+	emit(ARM_MOV_SI(tmp2[0], rn, SRTYPE_LSR, 8), ctx);
+	emit(ARM_AND_I(tmp2[0], tmp2[0], 0xff), ctx);
+	emit(ARM_ORR_SI(rd, tmp2[0], tmp2[1], SRTYPE_LSL, 8), ctx);
+#else /* ARMv6+ */
+	emit(ARM_REV16(rd, rn), ctx);
+#endif
+}
 
-	for (i = 0; i < prog->len; i++) {
-		u16 code;
+/* 0xabcdefgh => 0xghefcdab */
+static inline void emit_rev32(const u8 rd, const u8 rn, struct jit_ctx *ctx)
+{
+#if __LINUX_ARM_ARCH__ < 6
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+
+	emit(ARM_AND_I(tmp2[1], rn, 0xff), ctx);
+	emit(ARM_MOV_SI(tmp2[0], rn, SRTYPE_LSR, 24), ctx);
+	emit(ARM_ORR_SI(ARM_IP, tmp2[0], tmp2[1], SRTYPE_LSL, 24), ctx);
+
+	emit(ARM_MOV_SI(tmp2[1], rn, SRTYPE_LSR, 8), ctx);
+	emit(ARM_AND_I(tmp2[1], tmp2[1], 0xff), ctx);
+	emit(ARM_MOV_SI(tmp2[0], rn, SRTYPE_LSR, 16), ctx);
+	emit(ARM_AND_I(tmp2[0], tmp2[0], 0xff), ctx);
+	emit(ARM_MOV_SI(tmp2[0], tmp2[0], SRTYPE_LSL, 8), ctx);
+	emit(ARM_ORR_SI(tmp2[0], tmp2[0], tmp2[1], SRTYPE_LSL, 16), ctx);
+	emit(ARM_ORR_R(rd, ARM_IP, tmp2[0]), ctx);
+
+#else /* ARMv6+ */
+	emit(ARM_REV(rd, rn), ctx);
+#endif
+}
+
+// push the scratch stack register on top of the stack
+static inline void emit_push_r64(const u8 src[], const u8 shift,
+		struct jit_ctx *ctx)
+{
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+	u16 reg_set = 0;
 
-		inst = &(prog->insns[i]);
-		/* K as an immediate value operand */
-		k = inst->k;
-		code = bpf_anc_helper(inst);
+	emit(ARM_LDR_I(tmp2[1], ARM_SP, STACK_VAR(src[1]+shift)), ctx);
+	emit(ARM_LDR_I(tmp2[0], ARM_SP, STACK_VAR(src[0]+shift)), ctx);
 
-		/* compute offsets only in the fake pass */
-		if (ctx->target == NULL)
-			ctx->offsets[i] = ctx->idx * 4;
+	reg_set = (1 << tmp2[1]) | (1 << tmp2[0]);
+	emit(ARM_PUSH(reg_set), ctx);
+}
+
+static void build_prologue(struct jit_ctx *ctx)
+{
+	const u8 r0 = bpf2a32[BPF_REG_0][1];
+	const u8 r2 = bpf2a32[BPF_REG_1][1];
+	const u8 r3 = bpf2a32[BPF_REG_1][0];
+	const u8 r4 = bpf2a32[BPF_REG_6][1];
+	const u8 r5 = bpf2a32[BPF_REG_6][0];
+	const u8 r6 = bpf2a32[TMP_REG_1][1];
+	const u8 r7 = bpf2a32[TMP_REG_1][0];
+	const u8 r8 = bpf2a32[TMP_REG_2][1];
+	const u8 r10 = bpf2a32[TMP_REG_2][0];
+	const u8 fplo = bpf2a32[BPF_REG_FP][1];
+	const u8 fphi = bpf2a32[BPF_REG_FP][0];
+	const u8 sp = ARM_SP;
+	const u8 *tcc = bpf2a32[TCALL_CNT];
+
+	u16 reg_set = 0;
+
+	/*
+	 * eBPF prog stack layout
+	 *
+	 *                         high
+	 * original ARM_SP =>     +-----+ eBPF prologue
+	 *                        |FP/LR|
+	 * current ARM_FP =>      +-----+
+	 *                        | ... | callee saved registers
+	 * eBPF fp register =>    +-----+ <= (BPF_FP)
+	 *                        | ... | eBPF JIT scratch space
+	 *                        |     | eBPF prog stack
+	 *                        +-----+
+	 *			  |RSVD | JIT scratchpad
+	 * current A64_SP =>      +-----+ <= (BPF_FP - STACK_SIZE)
+	 *                        |     |
+	 *                        | ... | Function call stack
+	 *                        |     |
+	 *                        +-----+
+	 *                          low
+	 */
+
+	/* Save callee saved registers. */
+	reg_set |= (1<<r4) | (1<<r5) | (1<<r6) | (1<<r7) | (1<<r8) | (1<<r10);
+#ifdef CONFIG_FRAME_POINTER
+	reg_set |= (1<<ARM_FP) | (1<<ARM_IP) | (1<<ARM_LR) | (1<<ARM_PC);
+	emit(ARM_MOV_R(ARM_IP, sp), ctx);
+	emit(ARM_PUSH(reg_set), ctx);
+	emit(ARM_SUB_I(ARM_FP, ARM_IP, 4), ctx);
+#else
+	/* Check if call instruction exists in BPF body */
+	if (ctx->seen & SEEN_CALL)
+		reg_set |= (1<<ARM_LR);
+	emit(ARM_PUSH(reg_set), ctx);
+#endif
+	/* Save frame pointer for later */
+	emit(ARM_SUB_I(ARM_IP, sp, SCRATCH_SIZE), ctx);
+
+	/* Set up function call stack */
+	emit(ARM_SUB_I(ARM_SP, ARM_SP, imm8m(STACK_SIZE)), ctx);
+
+	/* Set up BPF prog stack base register */
+	emit_a32_mov_r(fplo, ARM_IP, true, false, ctx);
+	emit_a32_mov_i(fphi, 0, true, ctx);
+
+	/* mov r4, 0 */
+	emit(ARM_MOV_I(r4, 0), ctx);
+
+	/* Move BPF_CTX to BPF_R1 */
+	emit(ARM_MOV_R(r3, r4), ctx);
+	emit(ARM_MOV_R(r2, r0), ctx);
+	/* Initialize Tail Count */
+	emit(ARM_STR_I(r4, ARM_SP, STACK_VAR(tcc[0])), ctx);
+	emit(ARM_STR_I(r4, ARM_SP, STACK_VAR(tcc[1])), ctx);
+	/* end of prologue */
+}
+
+static void build_epilogue(struct jit_ctx *ctx)
+{
+	const u8 r4 = bpf2a32[BPF_REG_6][1];
+	const u8 r5 = bpf2a32[BPF_REG_6][0];
+	const u8 r6 = bpf2a32[TMP_REG_1][1];
+	const u8 r7 = bpf2a32[TMP_REG_1][0];
+	const u8 r8 = bpf2a32[TMP_REG_2][1];
+	const u8 r10 = bpf2a32[TMP_REG_2][0];
+	u16 reg_set = 0;
+
+	/* unwind function call stack */
+	emit(ARM_ADD_I(ARM_SP, ARM_SP, imm8m(STACK_SIZE)), ctx);
+
+	/* restore callee saved registers. */
+	reg_set |= (1<<r4) | (1<<r5) | (1<<r6) | (1<<r7) | (1<<r8) | (1<<r10);
+#ifdef CONFIG_FRAME_POINTER
+	/* the first instruction of the prologue was: mov ip, sp */
+	reg_set |= (1<<ARM_FP) | (1<<ARM_SP) | (1<<ARM_PC);
+	emit(ARM_LDM(ARM_SP, reg_set), ctx);
+#else
+	if (ctx->seen & SEEN_CALL)
+		reg_set |= (1<<ARM_PC);
+	/* Restore callee saved registers. */
+	emit(ARM_POP(reg_set), ctx);
+	/* Return back to the callee function */
+	if (!(ctx->seen & SEEN_CALL))
+		emit(ARM_BX(ARM_LR), ctx);
+#endif
+}
 
-		switch (code) {
-		case BPF_LD | BPF_IMM:
-			emit_mov_i(r_A, k, ctx);
+/*
+ * Convert an eBPF instruction to native instruction, i.e
+ * JITs an eBPF instruction.
+ * Returns :
+ *	0  - Successfully JITed an 8-byte eBPF instruction
+ *	>0 - Successfully JITed a 16-byte eBPF instruction
+ *	<0 - Failed to JIT.
+ */
+static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx)
+{
+	const u8 code = insn->code;
+	const u8 *dst = bpf2a32[insn->dst_reg];
+	const u8 *src = bpf2a32[insn->src_reg];
+	const u8 *tmp = bpf2a32[TMP_REG_1];
+	const u8 *tmp2 = bpf2a32[TMP_REG_2];
+	const s16 off = insn->off;
+	const s32 imm = insn->imm;
+	const int i = insn - ctx->prog->insnsi;
+	const bool is64 = BPF_CLASS(code) == BPF_ALU64;
+	const bool dstk = is_on_stack(insn->dst_reg);
+	const bool sstk = is_on_stack(insn->src_reg);
+	u8 rd, rt, rm, rn;
+	s32 jmp_offset;
+
+#define check_imm(bits, imm) do {				\
+	if ((((imm) > 0) && ((imm) >> (bits))) ||		\
+	    (((imm) < 0) && (~(imm) >> (bits)))) {		\
+		pr_info("[%2d] imm=%d(0x%x) out of range\n",	\
+			i, imm, imm);				\
+		return -EINVAL;					\
+	}							\
+} while (0)
+#define check_imm24(imm) check_imm(24, imm)
+
+	switch (code) {
+	/* ALU operations */
+
+	/* dst = src */
+	case BPF_ALU | BPF_MOV | BPF_K:
+	case BPF_ALU | BPF_MOV | BPF_X:
+	case BPF_ALU64 | BPF_MOV | BPF_K:
+	case BPF_ALU64 | BPF_MOV | BPF_X:
+		switch (BPF_SRC(code)) {
+		case BPF_X:
+			emit_a32_mov_r64(is64, dst, src, dstk, sstk, ctx);
 			break;
-		case BPF_LD | BPF_W | BPF_LEN:
-			ctx->seen |= SEEN_SKB;
-			BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, len) != 4);
-			emit(ARM_LDR_I(r_A, r_skb,
-				       offsetof(struct sk_buff, len)), ctx);
+		case BPF_K:
+			/* Sign-extend immediate value to destination reg */
+			emit_a32_mov_i64(is64, dst, imm, dstk, ctx);
 			break;
-		case BPF_LD | BPF_MEM:
-			/* A = scratch[k] */
-			ctx->seen |= SEEN_MEM_WORD(k);
-			emit(ARM_LDR_I(r_A, ARM_SP, SCRATCH_OFF(k)), ctx);
+		}
+		break;
+	/* dst = dst + src/imm */
+	/* dst = dst - src/imm */
+	/* dst = dst | src/imm */
+	/* dst = dst & src/imm */
+	/* dst = dst ^ src/imm */
+	/* dst = dst * src/imm */
+	/* dst = dst << src */
+	/* dst = dst >> src */
+	case BPF_ALU | BPF_ADD | BPF_K:
+	case BPF_ALU | BPF_ADD | BPF_X:
+	case BPF_ALU | BPF_SUB | BPF_K:
+	case BPF_ALU | BPF_SUB | BPF_X:
+	case BPF_ALU | BPF_OR | BPF_K:
+	case BPF_ALU | BPF_OR | BPF_X:
+	case BPF_ALU | BPF_AND | BPF_K:
+	case BPF_ALU | BPF_AND | BPF_X:
+	case BPF_ALU | BPF_XOR | BPF_K:
+	case BPF_ALU | BPF_XOR | BPF_X:
+	case BPF_ALU | BPF_MUL | BPF_K:
+	case BPF_ALU | BPF_MUL | BPF_X:
+	case BPF_ALU | BPF_LSH | BPF_X:
+	case BPF_ALU | BPF_RSH | BPF_X:
+	case BPF_ALU | BPF_ARSH | BPF_K:
+	case BPF_ALU | BPF_ARSH | BPF_X:
+	case BPF_ALU64 | BPF_ADD | BPF_K:
+	case BPF_ALU64 | BPF_ADD | BPF_X:
+	case BPF_ALU64 | BPF_SUB | BPF_K:
+	case BPF_ALU64 | BPF_SUB | BPF_X:
+	case BPF_ALU64 | BPF_OR | BPF_K:
+	case BPF_ALU64 | BPF_OR | BPF_X:
+	case BPF_ALU64 | BPF_AND | BPF_K:
+	case BPF_ALU64 | BPF_AND | BPF_X:
+	case BPF_ALU64 | BPF_XOR | BPF_K:
+	case BPF_ALU64 | BPF_XOR | BPF_X:
+		switch (BPF_SRC(code)) {
+		case BPF_X:
+			emit_a32_alu_r64(is64, dst, src, dstk, sstk,
+					 ctx, BPF_OP(code));
 			break;
-		case BPF_LD | BPF_W | BPF_ABS:
-			load_order = 2;
-			goto load;
-		case BPF_LD | BPF_H | BPF_ABS:
-			load_order = 1;
-			goto load;
-		case BPF_LD | BPF_B | BPF_ABS:
-			load_order = 0;
-load:
-			emit_mov_i(r_off, k, ctx);
-load_common:
-			ctx->seen |= SEEN_DATA | SEEN_CALL;
-
-			if (load_order > 0) {
-				emit(ARM_SUB_I(r_scratch, r_skb_hl,
-					       1 << load_order), ctx);
-				emit(ARM_CMP_R(r_scratch, r_off), ctx);
-				condt = ARM_COND_GE;
-			} else {
-				emit(ARM_CMP_R(r_skb_hl, r_off), ctx);
-				condt = ARM_COND_HI;
-			}
-
-			/*
-			 * test for negative offset, only if we are
-			 * currently scheduled to take the fast
-			 * path. this will update the flags so that
-			 * the slowpath instruction are ignored if the
-			 * offset is negative.
-			 *
-			 * for loard_order == 0 the HI condition will
-			 * make loads at offset 0 take the slow path too.
+		case BPF_K:
+			/* Move immediate value to the temporary register
+			 * and then do the ALU operation on the temporary
+			 * register as this will sign-extend the immediate
+			 * value into temporary reg and then it would be
+			 * safe to do the operation on it.
 			 */
-			_emit(condt, ARM_CMP_I(r_off, 0), ctx);
-
-			_emit(condt, ARM_ADD_R(r_scratch, r_off, r_skb_data),
-			      ctx);
-
-			if (load_order == 0)
-				_emit(condt, ARM_LDRB_I(r_A, r_scratch, 0),
-				      ctx);
-			else if (load_order == 1)
-				emit_load_be16(condt, r_A, r_scratch, ctx);
-			else if (load_order == 2)
-				emit_load_be32(condt, r_A, r_scratch, ctx);
-
-			_emit(condt, ARM_B(b_imm(i + 1, ctx)), ctx);
-
-			/* the slowpath */
-			emit_mov_i(ARM_R3, (u32)load_func[load_order], ctx);
-			emit(ARM_MOV_R(ARM_R0, r_skb), ctx);
-			/* the offset is already in R1 */
-			emit_blx_r(ARM_R3, ctx);
-			/* check the result of skb_copy_bits */
-			emit(ARM_CMP_I(ARM_R1, 0), ctx);
-			emit_err_ret(ARM_COND_NE, ctx);
-			emit(ARM_MOV_R(r_A, ARM_R0), ctx);
+			emit_a32_mov_i64(is64, tmp2, imm, false, ctx);
+			emit_a32_alu_r64(is64, dst, tmp2, dstk, false,
+					 ctx, BPF_OP(code));
 			break;
-		case BPF_LD | BPF_W | BPF_IND:
-			load_order = 2;
-			goto load_ind;
-		case BPF_LD | BPF_H | BPF_IND:
-			load_order = 1;
-			goto load_ind;
-		case BPF_LD | BPF_B | BPF_IND:
-			load_order = 0;
-load_ind:
-			update_on_xread(ctx);
-			OP_IMM3(ARM_ADD, r_off, r_X, k, ctx);
-			goto load_common;
-		case BPF_LDX | BPF_IMM:
-			ctx->seen |= SEEN_X;
-			emit_mov_i(r_X, k, ctx);
+		}
+		break;
+	/* dst = dst / src(imm) */
+	/* dst = dst % src(imm) */
+	case BPF_ALU | BPF_DIV | BPF_K:
+	case BPF_ALU | BPF_DIV | BPF_X:
+	case BPF_ALU | BPF_MOD | BPF_K:
+	case BPF_ALU | BPF_MOD | BPF_X:
+		rt = src_lo;
+		rd = dstk ? tmp2[1] : dst_lo;
+		if (dstk)
+			emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		switch (BPF_SRC(code)) {
+		case BPF_X:
+			rt = sstk ? tmp2[0] : rt;
+			if (sstk)
+				emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(src_lo)),
+				     ctx);
 			break;
-		case BPF_LDX | BPF_W | BPF_LEN:
-			ctx->seen |= SEEN_X | SEEN_SKB;
-			emit(ARM_LDR_I(r_X, r_skb,
-				       offsetof(struct sk_buff, len)), ctx);
+		case BPF_K:
+			rt = tmp2[0];
+			emit_a32_mov_i(rt, imm, false, ctx);
 			break;
-		case BPF_LDX | BPF_MEM:
-			ctx->seen |= SEEN_X | SEEN_MEM_WORD(k);
-			emit(ARM_LDR_I(r_X, ARM_SP, SCRATCH_OFF(k)), ctx);
+		}
+		emit_udivmod(rd, rd, rt, ctx, BPF_OP(code));
+		if (dstk)
+			emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_lo)), ctx);
+		emit_a32_mov_i(dst_hi, 0, dstk, ctx);
+		break;
+	case BPF_ALU64 | BPF_DIV | BPF_K:
+	case BPF_ALU64 | BPF_DIV | BPF_X:
+	case BPF_ALU64 | BPF_MOD | BPF_K:
+	case BPF_ALU64 | BPF_MOD | BPF_X:
+		goto notyet;
+	/* dst = dst >> imm */
+	/* dst = dst << imm */
+	case BPF_ALU | BPF_RSH | BPF_K:
+	case BPF_ALU | BPF_LSH | BPF_K:
+		if (unlikely(imm > 31))
+			return -EINVAL;
+		if (imm)
+			emit_a32_alu_i(dst_lo, imm, dstk, ctx, BPF_OP(code));
+		emit_a32_mov_i(dst_hi, 0, dstk, ctx);
+		break;
+	/* dst = dst << imm */
+	case BPF_ALU64 | BPF_LSH | BPF_K:
+		if (unlikely(imm > 63))
+			return -EINVAL;
+		emit_a32_lsh_i64(dst, dstk, imm, ctx);
+		break;
+	/* dst = dst >> imm */
+	case BPF_ALU64 | BPF_RSH | BPF_K:
+		if (unlikely(imm > 63))
+			return -EINVAL;
+		emit_a32_lsr_i64(dst, dstk, imm, ctx);
+		break;
+	/* dst = dst << src */
+	case BPF_ALU64 | BPF_LSH | BPF_X:
+		emit_a32_lsh_r64(dst, src, dstk, sstk, ctx);
+		break;
+	/* dst = dst >> src */
+	case BPF_ALU64 | BPF_RSH | BPF_X:
+		emit_a32_lsr_r64(dst, src, dstk, sstk, ctx);
+		break;
+	/* dst = dst >> src (signed) */
+	case BPF_ALU64 | BPF_ARSH | BPF_X:
+		emit_a32_arsh_r64(dst, src, dstk, sstk, ctx);
+		break;
+	/* dst = dst >> imm (signed) */
+	case BPF_ALU64 | BPF_ARSH | BPF_K:
+		if (unlikely(imm > 63))
+			return -EINVAL;
+		emit_a32_arsh_i64(dst, dstk, imm, ctx);
+		break;
+	/* dst = ~dst */
+	case BPF_ALU | BPF_NEG:
+		emit_a32_alu_i(dst_lo, 0, dstk, ctx, BPF_OP(code));
+		emit_a32_mov_i(dst_hi, 0, dstk, ctx);
+		break;
+	/* dst = ~dst (64 bit) */
+	case BPF_ALU64 | BPF_NEG:
+		emit_a32_neg64(dst, dstk, ctx);
+		break;
+	/* dst = dst * src/imm */
+	case BPF_ALU64 | BPF_MUL | BPF_X:
+	case BPF_ALU64 | BPF_MUL | BPF_K:
+		switch (BPF_SRC(code)) {
+		case BPF_X:
+			emit_a32_mul_r64(dst, src, dstk, sstk, ctx);
 			break;
-		case BPF_LDX | BPF_B | BPF_MSH:
-			/* x = ((*(frame + k)) & 0xf) << 2; */
-			ctx->seen |= SEEN_X | SEEN_DATA | SEEN_CALL;
-			/* the interpreter should deal with the negative K */
-			if ((int)k < 0)
-				return -1;
-			/* offset in r1: we might have to take the slow path */
-			emit_mov_i(r_off, k, ctx);
-			emit(ARM_CMP_R(r_skb_hl, r_off), ctx);
-
-			/* load in r0: common with the slowpath */
-			_emit(ARM_COND_HI, ARM_LDRB_R(ARM_R0, r_skb_data,
-						      ARM_R1), ctx);
-			/*
-			 * emit_mov_i() might generate one or two instructions,
-			 * the same holds for emit_blx_r()
+		case BPF_K:
+			/* Move immediate value to the temporary register
+			 * and then do the multiplication on it as this
+			 * will sign-extend the immediate value into temp
+			 * reg then it would be safe to do the operation
+			 * on it.
 			 */
-			_emit(ARM_COND_HI, ARM_B(b_imm(i + 1, ctx) - 2), ctx);
-
-			emit(ARM_MOV_R(ARM_R0, r_skb), ctx);
-			/* r_off is r1 */
-			emit_mov_i(ARM_R3, (u32)jit_get_skb_b, ctx);
-			emit_blx_r(ARM_R3, ctx);
-			/* check the return value of skb_copy_bits */
-			emit(ARM_CMP_I(ARM_R1, 0), ctx);
-			emit_err_ret(ARM_COND_NE, ctx);
-
-			emit(ARM_AND_I(r_X, ARM_R0, 0x00f), ctx);
-			emit(ARM_LSL_I(r_X, r_X, 2), ctx);
-			break;
-		case BPF_ST:
-			ctx->seen |= SEEN_MEM_WORD(k);
-			emit(ARM_STR_I(r_A, ARM_SP, SCRATCH_OFF(k)), ctx);
-			break;
-		case BPF_STX:
-			update_on_xread(ctx);
-			ctx->seen |= SEEN_MEM_WORD(k);
-			emit(ARM_STR_I(r_X, ARM_SP, SCRATCH_OFF(k)), ctx);
-			break;
-		case BPF_ALU | BPF_ADD | BPF_K:
-			/* A += K */
-			OP_IMM3(ARM_ADD, r_A, r_A, k, ctx);
-			break;
-		case BPF_ALU | BPF_ADD | BPF_X:
-			update_on_xread(ctx);
-			emit(ARM_ADD_R(r_A, r_A, r_X), ctx);
-			break;
-		case BPF_ALU | BPF_SUB | BPF_K:
-			/* A -= K */
-			OP_IMM3(ARM_SUB, r_A, r_A, k, ctx);
-			break;
-		case BPF_ALU | BPF_SUB | BPF_X:
-			update_on_xread(ctx);
-			emit(ARM_SUB_R(r_A, r_A, r_X), ctx);
-			break;
-		case BPF_ALU | BPF_MUL | BPF_K:
-			/* A *= K */
-			emit_mov_i(r_scratch, k, ctx);
-			emit(ARM_MUL(r_A, r_A, r_scratch), ctx);
+			emit_a32_mov_i64(is64, tmp2, imm, false, ctx);
+			emit_a32_mul_r64(dst, tmp2, dstk, false, ctx);
 			break;
-		case BPF_ALU | BPF_MUL | BPF_X:
-			update_on_xread(ctx);
-			emit(ARM_MUL(r_A, r_A, r_X), ctx);
-			break;
-		case BPF_ALU | BPF_DIV | BPF_K:
-			if (k == 1)
-				break;
-			emit_mov_i(r_scratch, k, ctx);
-			emit_udivmod(r_A, r_A, r_scratch, ctx, BPF_DIV);
-			break;
-		case BPF_ALU | BPF_DIV | BPF_X:
-			update_on_xread(ctx);
-			emit(ARM_CMP_I(r_X, 0), ctx);
-			emit_err_ret(ARM_COND_EQ, ctx);
-			emit_udivmod(r_A, r_A, r_X, ctx, BPF_DIV);
-			break;
-		case BPF_ALU | BPF_MOD | BPF_K:
-			if (k == 1) {
-				emit_mov_i(r_A, 0, ctx);
-				break;
-			}
-			emit_mov_i(r_scratch, k, ctx);
-			emit_udivmod(r_A, r_A, r_scratch, ctx, BPF_MOD);
-			break;
-		case BPF_ALU | BPF_MOD | BPF_X:
-			update_on_xread(ctx);
-			emit(ARM_CMP_I(r_X, 0), ctx);
-			emit_err_ret(ARM_COND_EQ, ctx);
-			emit_udivmod(r_A, r_A, r_X, ctx, BPF_MOD);
-			break;
-		case BPF_ALU | BPF_OR | BPF_K:
-			/* A |= K */
-			OP_IMM3(ARM_ORR, r_A, r_A, k, ctx);
+		}
+		break;
+	/* dst = htole(dst) */
+	/* dst = htobe(dst) */
+	case BPF_ALU | BPF_END | BPF_FROM_LE:
+	case BPF_ALU | BPF_END | BPF_FROM_BE:
+		rd = dstk ? tmp[0] : dst_hi;
+		rt = dstk ? tmp[1] : dst_lo;
+		if (dstk) {
+			emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(dst_lo)), ctx);
+			emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_hi)), ctx);
+		}
+		if (BPF_SRC(code) == BPF_FROM_LE)
+			goto emit_bswap_uxt;
+		switch (imm) {
+		case 16:
+			emit_rev16(rt, rt, ctx);
+			goto emit_bswap_uxt;
+		case 32:
+			emit_rev32(rt, rt, ctx);
+			goto emit_bswap_uxt;
+		case 64:
+			/* Because of the usage of ARM_LR */
+			ctx->seen |= SEEN_CALL;
+			emit_rev32(ARM_LR, rt, ctx);
+			emit_rev32(rt, rd, ctx);
+			emit(ARM_MOV_R(rd, ARM_LR), ctx);
 			break;
-		case BPF_ALU | BPF_OR | BPF_X:
-			update_on_xread(ctx);
-			emit(ARM_ORR_R(r_A, r_A, r_X), ctx);
+		}
+		goto exit;
+emit_bswap_uxt:
+		switch (imm) {
+		case 16:
+			/* zero-extend 16 bits into 64 bits */
+#if __LINUX_ARM_ARCH__ < 6
+			emit_a32_mov_i(tmp2[1], 0xffff, false, ctx);
+			emit(ARM_AND_R(rt, rt, tmp2[1]), ctx);
+#else /* ARMv6+ */
+			emit(ARM_UXTH(rt, rt), ctx);
+#endif
+			emit(ARM_EOR_R(rd, rd, rd), ctx);
 			break;
-		case BPF_ALU | BPF_XOR | BPF_K:
-			/* A ^= K; */
-			OP_IMM3(ARM_EOR, r_A, r_A, k, ctx);
+		case 32:
+			/* zero-extend 32 bits into 64 bits */
+			emit(ARM_EOR_R(rd, rd, rd), ctx);
 			break;
-		case BPF_ANC | SKF_AD_ALU_XOR_X:
-		case BPF_ALU | BPF_XOR | BPF_X:
-			/* A ^= X */
-			update_on_xread(ctx);
-			emit(ARM_EOR_R(r_A, r_A, r_X), ctx);
+		case 64:
+			/* nop */
 			break;
-		case BPF_ALU | BPF_AND | BPF_K:
-			/* A &= K */
-			OP_IMM3(ARM_AND, r_A, r_A, k, ctx);
+		}
+exit:
+		if (dstk) {
+			emit(ARM_STR_I(rt, ARM_SP, STACK_VAR(dst_lo)), ctx);
+			emit(ARM_STR_I(rd, ARM_SP, STACK_VAR(dst_hi)), ctx);
+		}
+		break;
+	/* dst = imm64 */
+	case BPF_LD | BPF_IMM | BPF_DW:
+	{
+		const struct bpf_insn insn1 = insn[1];
+		u32 hi, lo = imm;
+
+		hi = insn1.imm;
+		emit_a32_mov_i(dst_lo, lo, dstk, ctx);
+		emit_a32_mov_i(dst_hi, hi, dstk, ctx);
+
+		return 1;
+	}
+	/* LDX: dst = *(size *)(src + off) */
+	case BPF_LDX | BPF_MEM | BPF_W:
+	case BPF_LDX | BPF_MEM | BPF_H:
+	case BPF_LDX | BPF_MEM | BPF_B:
+	case BPF_LDX | BPF_MEM | BPF_DW:
+		rn = sstk ? tmp2[1] : src_lo;
+		if (sstk)
+			emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
+		switch (BPF_SIZE(code)) {
+		case BPF_W:
+			/* Load a Word */
+		case BPF_H:
+			/* Load a Half-Word */
+		case BPF_B:
+			/* Load a Byte */
+			emit_ldx_r(dst_lo, rn, dstk, off, ctx, BPF_SIZE(code));
+			emit_a32_mov_i(dst_hi, 0, dstk, ctx);
 			break;
-		case BPF_ALU | BPF_AND | BPF_X:
-			update_on_xread(ctx);
-			emit(ARM_AND_R(r_A, r_A, r_X), ctx);
+		case BPF_DW:
+			/* Load a double word */
+			emit_ldx_r(dst_lo, rn, dstk, off, ctx, BPF_W);
+			emit_ldx_r(dst_hi, rn, dstk, off+4, ctx, BPF_W);
 			break;
-		case BPF_ALU | BPF_LSH | BPF_K:
-			if (unlikely(k > 31))
-				return -1;
-			emit(ARM_LSL_I(r_A, r_A, k), ctx);
+		}
+		break;
+	/* R0 = ntohx(*(size *)(((struct sk_buff *)R6)->data + imm)) */
+	case BPF_LD | BPF_ABS | BPF_W:
+	case BPF_LD | BPF_ABS | BPF_H:
+	case BPF_LD | BPF_ABS | BPF_B:
+	/* R0 = ntohx(*(size *)(((struct sk_buff *)R6)->data + src + imm)) */
+	case BPF_LD | BPF_IND | BPF_W:
+	case BPF_LD | BPF_IND | BPF_H:
+	case BPF_LD | BPF_IND | BPF_B:
+	{
+		const u8 r4 = bpf2a32[BPF_REG_6][1]; /* r4 = ptr to sk_buff */
+		const u8 r0 = bpf2a32[BPF_REG_0][1]; /*r0: struct sk_buff *skb*/
+						     /* rtn value */
+		const u8 r1 = bpf2a32[BPF_REG_0][0]; /* r1: int k */
+		const u8 r2 = bpf2a32[BPF_REG_1][1]; /* r2: unsigned int size */
+		const u8 r3 = bpf2a32[BPF_REG_1][0]; /* r3: void *buffer */
+		const u8 r6 = bpf2a32[TMP_REG_1][1]; /* r6: void *(*func)(..) */
+		int size;
+
+		/* Setting up first argument */
+		emit(ARM_MOV_R(r0, r4), ctx);
+
+		/* Setting up second argument */
+		emit_a32_mov_i(r1, imm, false, ctx);
+		if (BPF_MODE(code) == BPF_IND)
+			emit_a32_alu_r(r1, src_lo, false, sstk, ctx,
+				       false, false, BPF_ADD);
+
+		/* Setting up third argument */
+		switch (BPF_SIZE(code)) {
+		case BPF_W:
+			size = 4;
 			break;
-		case BPF_ALU | BPF_LSH | BPF_X:
-			update_on_xread(ctx);
-			emit(ARM_LSL_R(r_A, r_A, r_X), ctx);
+		case BPF_H:
+			size = 2;
 			break;
-		case BPF_ALU | BPF_RSH | BPF_K:
-			if (unlikely(k > 31))
-				return -1;
-			if (k)
-				emit(ARM_LSR_I(r_A, r_A, k), ctx);
+		case BPF_B:
+			size = 1;
 			break;
-		case BPF_ALU | BPF_RSH | BPF_X:
-			update_on_xread(ctx);
-			emit(ARM_LSR_R(r_A, r_A, r_X), ctx);
+		default:
+			return -EINVAL;
+		}
+		emit_a32_mov_i(r2, size, false, ctx);
+
+		/* Setting up fourth argument */
+		emit(ARM_ADD_I(r3, ARM_SP, imm8m(SKB_BUFFER)), ctx);
+
+		/* Setting up function pointer to call */
+		emit_a32_mov_i(r6, (unsigned int)bpf_load_pointer, false, ctx);
+		emit_blx_r(r6, ctx);
+
+		emit(ARM_EOR_R(r1, r1, r1), ctx);
+		/* Check if return address is NULL or not.
+		 * if NULL then jump to epilogue
+		 * else continue to load the value from retn address
+		 */
+		emit(ARM_CMP_I(r0, 0), ctx);
+		jmp_offset = epilogue_offset(ctx);
+		check_imm24(jmp_offset);
+		_emit(ARM_COND_EQ, ARM_B(jmp_offset), ctx);
+
+		/* Load value from the address */
+		switch (BPF_SIZE(code)) {
+		case BPF_W:
+			emit(ARM_LDR_I(r0, r0, 0), ctx);
+			emit_rev32(r0, r0, ctx);
 			break;
-		case BPF_ALU | BPF_NEG:
-			/* A = -A */
-			emit(ARM_RSB_I(r_A, r_A, 0), ctx);
+		case BPF_H:
+			emit(ARM_LDRH_I(r0, r0, 0), ctx);
+			emit_rev16(r0, r0, ctx);
 			break;
-		case BPF_JMP | BPF_JA:
-			/* pc += K */
-			emit(ARM_B(b_imm(i + k + 1, ctx)), ctx);
+		case BPF_B:
+			emit(ARM_LDRB_I(r0, r0, 0), ctx);
+			/* No need to reverse */
 			break;
-		case BPF_JMP | BPF_JEQ | BPF_K:
-			/* pc += (A == K) ? pc->jt : pc->jf */
-			condt  = ARM_COND_EQ;
-			goto cmp_imm;
-		case BPF_JMP | BPF_JGT | BPF_K:
-			/* pc += (A > K) ? pc->jt : pc->jf */
-			condt  = ARM_COND_HI;
-			goto cmp_imm;
-		case BPF_JMP | BPF_JGE | BPF_K:
-			/* pc += (A >= K) ? pc->jt : pc->jf */
-			condt  = ARM_COND_HS;
-cmp_imm:
-			imm12 = imm8m(k);
-			if (imm12 < 0) {
-				emit_mov_i_no8m(r_scratch, k, ctx);
-				emit(ARM_CMP_R(r_A, r_scratch), ctx);
-			} else {
-				emit(ARM_CMP_I(r_A, imm12), ctx);
-			}
-cond_jump:
-			if (inst->jt)
-				_emit(condt, ARM_B(b_imm(i + inst->jt + 1,
-						   ctx)), ctx);
-			if (inst->jf)
-				_emit(condt ^ 1, ARM_B(b_imm(i + inst->jf + 1,
-							     ctx)), ctx);
+		}
+		break;
+	}
+	/* ST: *(size *)(dst + off) = imm */
+	case BPF_ST | BPF_MEM | BPF_W:
+	case BPF_ST | BPF_MEM | BPF_H:
+	case BPF_ST | BPF_MEM | BPF_B:
+	case BPF_ST | BPF_MEM | BPF_DW:
+		switch (BPF_SIZE(code)) {
+		case BPF_DW:
+			/* Sign-extend immediate value into temp reg */
+			emit_a32_mov_i64(true, tmp2, imm, false, ctx);
+			emit_str_r(dst_lo, tmp2[1], dstk, off, ctx, BPF_W);
+			emit_str_r(dst_lo, tmp2[0], dstk, off+4, ctx, BPF_W);
 			break;
-		case BPF_JMP | BPF_JEQ | BPF_X:
-			/* pc += (A == X) ? pc->jt : pc->jf */
-			condt   = ARM_COND_EQ;
-			goto cmp_x;
-		case BPF_JMP | BPF_JGT | BPF_X:
-			/* pc += (A > X) ? pc->jt : pc->jf */
-			condt   = ARM_COND_HI;
-			goto cmp_x;
-		case BPF_JMP | BPF_JGE | BPF_X:
-			/* pc += (A >= X) ? pc->jt : pc->jf */
-			condt   = ARM_COND_CS;
-cmp_x:
-			update_on_xread(ctx);
-			emit(ARM_CMP_R(r_A, r_X), ctx);
-			goto cond_jump;
-		case BPF_JMP | BPF_JSET | BPF_K:
-			/* pc += (A & K) ? pc->jt : pc->jf */
-			condt  = ARM_COND_NE;
-			/* not set iff all zeroes iff Z==1 iff EQ */
-
-			imm12 = imm8m(k);
-			if (imm12 < 0) {
-				emit_mov_i_no8m(r_scratch, k, ctx);
-				emit(ARM_TST_R(r_A, r_scratch), ctx);
-			} else {
-				emit(ARM_TST_I(r_A, imm12), ctx);
-			}
-			goto cond_jump;
-		case BPF_JMP | BPF_JSET | BPF_X:
-			/* pc += (A & X) ? pc->jt : pc->jf */
-			update_on_xread(ctx);
-			condt  = ARM_COND_NE;
-			emit(ARM_TST_R(r_A, r_X), ctx);
-			goto cond_jump;
-		case BPF_RET | BPF_A:
-			emit(ARM_MOV_R(ARM_R0, r_A), ctx);
-			goto b_epilogue;
-		case BPF_RET | BPF_K:
-			if ((k == 0) && (ctx->ret0_fp_idx < 0))
-				ctx->ret0_fp_idx = i;
-			emit_mov_i(ARM_R0, k, ctx);
-b_epilogue:
-			if (i != ctx->skf->len - 1)
-				emit(ARM_B(b_imm(prog->len, ctx)), ctx);
+		case BPF_W:
+		case BPF_H:
+		case BPF_B:
+			emit_a32_mov_i(tmp2[1], imm, false, ctx);
+			emit_str_r(dst_lo, tmp2[1], dstk, off, ctx,
+				   BPF_SIZE(code));
 			break;
-		case BPF_MISC | BPF_TAX:
-			/* X = A */
-			ctx->seen |= SEEN_X;
-			emit(ARM_MOV_R(r_X, r_A), ctx);
+		}
+		break;
+	/* STX XADD: lock *(u32 *)(dst + off) += src */
+	case BPF_STX | BPF_XADD | BPF_W:
+	/* STX XADD: lock *(u64 *)(dst + off) += src */
+	case BPF_STX | BPF_XADD | BPF_DW:
+		goto notyet;
+	/* STX: *(size *)(dst + off) = src */
+	case BPF_STX | BPF_MEM | BPF_W:
+	case BPF_STX | BPF_MEM | BPF_H:
+	case BPF_STX | BPF_MEM | BPF_B:
+	case BPF_STX | BPF_MEM | BPF_DW:
+	{
+		u8 sz = BPF_SIZE(code);
+
+		rn = sstk ? tmp2[1] : src_lo;
+		rm = sstk ? tmp2[0] : src_hi;
+		if (!sstk)
+			goto do_store;
+		switch (BPF_SIZE(code)) {
+		case BPF_W:
+			emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
+			goto empty_hi;
+		case BPF_H:
+			emit(ARM_LDRH_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
+			goto empty_hi;
+		case BPF_B:
+			emit(ARM_LDRB_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
+			goto empty_hi;
+empty_hi:
+			emit(ARM_EOR_R(rm, rm, rm), ctx);
+		case BPF_DW:
+			emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
+			emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(src_hi)), ctx);
+			sz = BPF_W;
 			break;
-		case BPF_MISC | BPF_TXA:
-			/* A = X */
-			update_on_xread(ctx);
-			emit(ARM_MOV_R(r_A, r_X), ctx);
+		}
+
+do_store:
+		/* Clear higher word except for BPF_DW */
+		if (BPF_SIZE(code) != BPF_DW)
+			emit(ARM_EOR_R(rm, rm, rm), ctx);
+
+		/* Store the value */
+		emit_str_r(dst_lo, rn, dstk, off, ctx, sz);
+		emit_str_r(dst_lo, rm, dstk, off+4, ctx, BPF_W);
+		break;
+	}
+	/* PC += off if dst == src */
+	/* PC += off if dst > src */
+	/* PC += off if dst >= src */
+	/* PC += off if dst != src */
+	/* PC += off if dst > src (signed) */
+	/* PC += off if dst >= src (signed) */
+	/* PC += off if dst & src */
+	case BPF_JMP | BPF_JEQ | BPF_X:
+	case BPF_JMP | BPF_JGT | BPF_X:
+	case BPF_JMP | BPF_JGE | BPF_X:
+	case BPF_JMP | BPF_JNE | BPF_X:
+	case BPF_JMP | BPF_JSGT | BPF_X:
+	case BPF_JMP | BPF_JSGE | BPF_X:
+	case BPF_JMP | BPF_JSET | BPF_X:
+		/* Setup source registers */
+		rm = sstk ? tmp2[0] : src_hi;
+		rn = sstk ? tmp2[1] : src_lo;
+		if (sstk) {
+			emit(ARM_LDR_I(rn, ARM_SP, STACK_VAR(src_lo)), ctx);
+			emit(ARM_LDR_I(rm, ARM_SP, STACK_VAR(src_hi)), ctx);
+		}
+		goto go_jmp;
+	/* PC += off if dst == imm */
+	/* PC += off if dst > imm */
+	/* PC += off if dst >= imm */
+	/* PC += off if dst != imm */
+	/* PC += off if dst > imm (signed) */
+	/* PC += off if dst >= imm (signed) */
+	/* PC += off if dst & imm */
+	case BPF_JMP | BPF_JEQ | BPF_K:
+	case BPF_JMP | BPF_JGT | BPF_K:
+	case BPF_JMP | BPF_JGE | BPF_K:
+	case BPF_JMP | BPF_JNE | BPF_K:
+	case BPF_JMP | BPF_JSGT | BPF_K:
+	case BPF_JMP | BPF_JSGE | BPF_K:
+	case BPF_JMP | BPF_JSET | BPF_K:
+		if (off == 0)
 			break;
-		case BPF_ANC | SKF_AD_PROTOCOL:
-			/* A = ntohs(skb->protocol) */
-			ctx->seen |= SEEN_SKB;
-			BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff,
-						  protocol) != 2);
-			off = offsetof(struct sk_buff, protocol);
-			emit(ARM_LDRH_I(r_scratch, r_skb, off), ctx);
-			emit_swap16(r_A, r_scratch, ctx);
+		rm = tmp2[0];
+		rn = tmp2[1];
+		/* Sign-extend immediate value */
+		emit_a32_mov_i64(true, tmp2, imm, false, ctx);
+go_jmp:
+		/* Setup destination register */
+		rd = dstk ? tmp[0] : dst_hi;
+		rt = dstk ? tmp[1] : dst_lo;
+		if (dstk) {
+			emit(ARM_LDR_I(rt, ARM_SP, STACK_VAR(dst_lo)), ctx);
+			emit(ARM_LDR_I(rd, ARM_SP, STACK_VAR(dst_hi)), ctx);
+		}
+
+		/* Check for the condition */
+		emit_ar_r(rd, rt, rm, rn, ctx, BPF_OP(code));
+
+		/* Setup JUMP instruction */
+		jmp_offset = bpf2a32_offset(i+off, i, ctx);
+		switch (BPF_OP(code)) {
+		case BPF_JNE:
+		case BPF_JSET:
+			_emit(ARM_COND_NE, ARM_B(jmp_offset), ctx);
 			break;
-		case BPF_ANC | SKF_AD_CPU:
-			/* r_scratch = current_thread_info() */
-			OP_IMM3(ARM_BIC, r_scratch, ARM_SP, THREAD_SIZE - 1, ctx);
-			/* A = current_thread_info()->cpu */
-			BUILD_BUG_ON(FIELD_SIZEOF(struct thread_info, cpu) != 4);
-			off = offsetof(struct thread_info, cpu);
-			emit(ARM_LDR_I(r_A, r_scratch, off), ctx);
+		case BPF_JEQ:
+			_emit(ARM_COND_EQ, ARM_B(jmp_offset), ctx);
 			break;
-		case BPF_ANC | SKF_AD_IFINDEX:
-		case BPF_ANC | SKF_AD_HATYPE:
-			/* A = skb->dev->ifindex */
-			/* A = skb->dev->type */
-			ctx->seen |= SEEN_SKB;
-			off = offsetof(struct sk_buff, dev);
-			emit(ARM_LDR_I(r_scratch, r_skb, off), ctx);
-
-			emit(ARM_CMP_I(r_scratch, 0), ctx);
-			emit_err_ret(ARM_COND_EQ, ctx);
-
-			BUILD_BUG_ON(FIELD_SIZEOF(struct net_device,
-						  ifindex) != 4);
-			BUILD_BUG_ON(FIELD_SIZEOF(struct net_device,
-						  type) != 2);
-
-			if (code == (BPF_ANC | SKF_AD_IFINDEX)) {
-				off = offsetof(struct net_device, ifindex);
-				emit(ARM_LDR_I(r_A, r_scratch, off), ctx);
-			} else {
-				/*
-				 * offset of field "type" in "struct
-				 * net_device" is above what can be
-				 * used in the ldrh rd, [rn, #imm]
-				 * instruction, so load the offset in
-				 * a register and use ldrh rd, [rn, rm]
-				 */
-				off = offsetof(struct net_device, type);
-				emit_mov_i(ARM_R3, off, ctx);
-				emit(ARM_LDRH_R(r_A, r_scratch, ARM_R3), ctx);
-			}
+		case BPF_JGT:
+			_emit(ARM_COND_HI, ARM_B(jmp_offset), ctx);
 			break;
-		case BPF_ANC | SKF_AD_MARK:
-			ctx->seen |= SEEN_SKB;
-			BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, mark) != 4);
-			off = offsetof(struct sk_buff, mark);
-			emit(ARM_LDR_I(r_A, r_skb, off), ctx);
+		case BPF_JGE:
+			_emit(ARM_COND_CS, ARM_B(jmp_offset), ctx);
 			break;
-		case BPF_ANC | SKF_AD_RXHASH:
-			ctx->seen |= SEEN_SKB;
-			BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, hash) != 4);
-			off = offsetof(struct sk_buff, hash);
-			emit(ARM_LDR_I(r_A, r_skb, off), ctx);
+		case BPF_JSGT:
+			_emit(ARM_COND_LT, ARM_B(jmp_offset), ctx);
 			break;
-		case BPF_ANC | SKF_AD_VLAN_TAG:
-		case BPF_ANC | SKF_AD_VLAN_TAG_PRESENT:
-			ctx->seen |= SEEN_SKB;
-			BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, vlan_tci) != 2);
-			off = offsetof(struct sk_buff, vlan_tci);
-			emit(ARM_LDRH_I(r_A, r_skb, off), ctx);
-			if (code == (BPF_ANC | SKF_AD_VLAN_TAG))
-				OP_IMM3(ARM_AND, r_A, r_A, ~VLAN_TAG_PRESENT, ctx);
-			else {
-				OP_IMM3(ARM_LSR, r_A, r_A, 12, ctx);
-				OP_IMM3(ARM_AND, r_A, r_A, 0x1, ctx);
-			}
+		case BPF_JSGE:
+			_emit(ARM_COND_GE, ARM_B(jmp_offset), ctx);
 			break;
-		case BPF_ANC | SKF_AD_PKTTYPE:
-			ctx->seen |= SEEN_SKB;
-			BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff,
-						  __pkt_type_offset[0]) != 1);
-			off = PKT_TYPE_OFFSET();
-			emit(ARM_LDRB_I(r_A, r_skb, off), ctx);
-			emit(ARM_AND_I(r_A, r_A, PKT_TYPE_MAX), ctx);
-#ifdef __BIG_ENDIAN_BITFIELD
-			emit(ARM_LSR_I(r_A, r_A, 5), ctx);
-#endif
+		}
+		break;
+	/* JMP OFF */
+	case BPF_JMP | BPF_JA:
+	{
+		if (off == 0)
 			break;
-		case BPF_ANC | SKF_AD_QUEUE:
-			ctx->seen |= SEEN_SKB;
-			BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff,
-						  queue_mapping) != 2);
-			BUILD_BUG_ON(offsetof(struct sk_buff,
-					      queue_mapping) > 0xff);
-			off = offsetof(struct sk_buff, queue_mapping);
-			emit(ARM_LDRH_I(r_A, r_skb, off), ctx);
+		jmp_offset = bpf2a32_offset(i+off, i, ctx);
+		check_imm24(jmp_offset);
+		emit(ARM_B(jmp_offset), ctx);
+		break;
+	}
+	/* tail call */
+	case BPF_JMP | BPF_CALL | BPF_X:
+		if (emit_bpf_tail_call(ctx))
+			return -EFAULT;
+		break;
+	/* function call */
+	case BPF_JMP | BPF_CALL:
+	{
+		const u8 *r0 = bpf2a32[BPF_REG_0];
+		const u8 *r1 = bpf2a32[BPF_REG_1];
+		const u8 *r2 = bpf2a32[BPF_REG_2];
+		const u8 *r3 = bpf2a32[BPF_REG_3];
+		const u8 *r4 = bpf2a32[BPF_REG_4];
+		const u8 *r5 = bpf2a32[BPF_REG_5];
+		const u32 func = (u32)__bpf_call_base + imm;
+
+		emit_a32_mov_r64(true, r0, r1, false, false, ctx);
+		emit_a32_mov_r64(true, r1, r2, false, true, ctx);
+		emit_push_r64(r5, 0, ctx);
+		emit_push_r64(r4, 8, ctx);
+		emit_push_r64(r3, 16, ctx);
+
+		emit_a32_mov_i(tmp[1], func, false, ctx);
+		emit_blx_r(tmp[1], ctx);
+
+		emit(ARM_ADD_I(ARM_SP, ARM_SP, imm8m(24)), ctx); // callee clean
+		break;
+	}
+	/* function return */
+	case BPF_JMP | BPF_EXIT:
+		/* Optimization: when last instruction is EXIT
+		 * simply fallthrough to epilogue.
+		 */
+		if (i == ctx->prog->len - 1)
 			break;
-		case BPF_ANC | SKF_AD_PAY_OFFSET:
-			ctx->seen |= SEEN_SKB | SEEN_CALL;
+		jmp_offset = epilogue_offset(ctx);
+		check_imm24(jmp_offset);
+		emit(ARM_B(jmp_offset), ctx);
+		break;
+notyet:
+		pr_info_once("*** NOT YET: opcode %02x ***\n", code);
+		return -EFAULT;
+	default:
+		pr_err_once("unknown opcode %02x\n", code);
+		return -EINVAL;
+	}
 
-			emit(ARM_MOV_R(ARM_R0, r_skb), ctx);
-			emit_mov_i(ARM_R3, (unsigned int)skb_get_poff, ctx);
-			emit_blx_r(ARM_R3, ctx);
-			emit(ARM_MOV_R(r_A, ARM_R0), ctx);
-			break;
-		case BPF_LDX | BPF_W | BPF_ABS:
-			/*
-			 * load a 32bit word from struct seccomp_data.
-			 * seccomp_check_filter() will already have checked
-			 * that k is 32bit aligned and lies within the
-			 * struct seccomp_data.
-			 */
-			ctx->seen |= SEEN_SKB;
-			emit(ARM_LDR_I(r_A, r_skb, k), ctx);
-			break;
-		default:
-			return -1;
+	if (ctx->flags & FLAG_IMM_OVERFLOW)
+		/*
+		 * this instruction generated an overflow when
+		 * trying to access the literal pool, so
+		 * delegate this filter to the kernel interpreter.
+		 */
+		return -1;
+	return 0;
+}
+
+static int build_body(struct jit_ctx *ctx)
+{
+	const struct bpf_prog *prog = ctx->prog;
+	unsigned int i;
+
+	for (i = 0; i < prog->len; i++) {
+		const struct bpf_insn *insn = &(prog->insnsi[i]);
+		int ret;
+
+		ret = build_insn(insn, ctx);
+
+		/* It's used with loading the 64 bit immediate value. */
+		if (ret > 0) {
+			i++;
+			if (ctx->target == NULL)
+				ctx->offsets[i] = ctx->idx;
+			continue;
 		}
 
-		if (ctx->flags & FLAG_IMM_OVERFLOW)
-			/*
-			 * this instruction generated an overflow when
-			 * trying to access the literal pool, so
-			 * delegate this filter to the kernel interpreter.
-			 */
-			return -1;
+		if (ctx->target == NULL)
+			ctx->offsets[i] = ctx->idx;
+
+		/* If unsuccesfull, return with error code */
+		if (ret)
+			return ret;
 	}
+	return 0;
+}
 
-	/* compute offsets only during the first pass */
-	if (ctx->target == NULL)
-		ctx->offsets[i] = ctx->idx * 4;
+static int validate_code(struct jit_ctx *ctx)
+{
+	int i;
+
+	for (i = 0; i < ctx->idx; i++) {
+		if (ctx->target[i] == __opcode_to_mem_arm(ARM_INST_UDF))
+			return -1;
+	}
 
 	return 0;
 }
 
+void bpf_jit_compile(struct bpf_prog *prog)
+{
+	/* Nothing to do here. We support Internal BPF. */
+}
 
-void bpf_jit_compile(struct bpf_prog *fp)
+struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 {
+#ifdef __LITTLE_ENDIAN
+	struct bpf_prog *tmp, *orig_prog = prog;
 	struct bpf_binary_header *header;
+	bool tmp_blinded = false;
 	struct jit_ctx ctx;
-	unsigned tmp_idx;
-	unsigned alloc_size;
-	u8 *target_ptr;
+	unsigned int tmp_idx;
+	unsigned int image_size;
+	u8 *image_ptr;
 
+	/* If BPF JIT was not enabled then we must fall back to
+	 * the interpreter.
+	 */
 	if (!bpf_jit_enable)
-		return;
+		return orig_prog;
 
-	memset(&ctx, 0, sizeof(ctx));
-	ctx.skf		= fp;
-	ctx.ret0_fp_idx = -1;
+	/* If constant blinding was enabled and we failed during blinding
+	 * then we must fall back to the interpreter. Otherwise, we save
+	 * the new JITed code.
+	 */
+	tmp = bpf_jit_blind_constants(prog);
 
-	ctx.offsets = kzalloc(4 * (ctx.skf->len + 1), GFP_KERNEL);
-	if (ctx.offsets == NULL)
-		return;
+	if (IS_ERR(tmp))
+		return orig_prog;
+	if (tmp != prog) {
+		tmp_blinded = true;
+		prog = tmp;
+	}
 
-	/* fake pass to fill in the ctx->seen */
-	if (unlikely(build_body(&ctx)))
+	memset(&ctx, 0, sizeof(ctx));
+	ctx.prog = prog;
+
+	/* Not able to allocate memory for offsets[] , then
+	 * we must fall back to the interpreter
+	 */
+	ctx.offsets = kcalloc(prog->len, sizeof(int), GFP_KERNEL);
+	if (ctx.offsets == NULL) {
+		prog = orig_prog;
 		goto out;
+	}
+
+	/* 1) fake pass to find in the length of the JITed code,
+	 * to compute ctx->offsets and other context variables
+	 * needed to compute final JITed code.
+	 * Also, calculate random starting pointer/start of JITed code
+	 * which is prefixed by random number of fault instructions.
+	 *
+	 * If the first pass fails then there is no chance of it
+	 * being successful in the second pass, so just fall back
+	 * to the interpreter.
+	 */
+	if (build_body(&ctx)) {
+		prog = orig_prog;
+		goto out_off;
+	}
 
 	tmp_idx = ctx.idx;
 	build_prologue(&ctx);
 	ctx.prologue_bytes = (ctx.idx - tmp_idx) * 4;
 
+	ctx.epilogue_offset = ctx.idx;
+
 #if __LINUX_ARM_ARCH__ < 7
 	tmp_idx = ctx.idx;
 	build_epilogue(&ctx);
@@ -1021,64 +1863,96 @@ void bpf_jit_compile(struct bpf_prog *fp)
 
 	ctx.idx += ctx.imm_count;
 	if (ctx.imm_count) {
-		ctx.imms = kzalloc(4 * ctx.imm_count, GFP_KERNEL);
-		if (ctx.imms == NULL)
-			goto out;
+		ctx.imms = kcalloc(ctx.imm_count, sizeof(u32), GFP_KERNEL);
+		if (ctx.imms == NULL) {
+			prog = orig_prog;
+			goto out_off;
+		}
 	}
 #else
-	/* there's nothing after the epilogue on ARMv7 */
+	/* there's nothing about the epilogue on ARMv7 */
 	build_epilogue(&ctx);
 #endif
-	alloc_size = 4 * ctx.idx;
-	header = bpf_jit_binary_alloc(alloc_size, &target_ptr,
-				      4, jit_fill_hole);
-	if (header == NULL)
-		goto out;
+	/* Now we can get the actual image size of the JITed arm code.
+	 * Currently, we are not considering the THUMB-2 instructions
+	 * for jit, although it can decrease the size of the image.
+	 *
+	 * As each arm instruction is of length 32bit, we are translating
+	 * number of JITed intructions into the size required to store these
+	 * JITed code.
+	 */
+	image_size = sizeof(u32) * ctx.idx;
+
+	/* Now we know the size of the structure to make */
+	header = bpf_jit_binary_alloc(image_size, &image_ptr,
+				      sizeof(u32), jit_fill_hole);
+	/* Not able to allocate memory for the structure then
+	 * we must fall back to the interpretation
+	 */
+	if (header == NULL) {
+		prog = orig_prog;
+		goto out_imms;
+	}
 
-	ctx.target = (u32 *) target_ptr;
+	/* 2.) Actual pass to generate final JIT code */
+	ctx.target = (u32 *) image_ptr;
 	ctx.idx = 0;
 
 	build_prologue(&ctx);
+
+	/* If building the body of the JITed code fails somehow,
+	 * we fall back to the interpretation.
+	 */
 	if (build_body(&ctx) < 0) {
-#if __LINUX_ARM_ARCH__ < 7
-		if (ctx.imm_count)
-			kfree(ctx.imms);
-#endif
+		image_ptr = NULL;
 		bpf_jit_binary_free(header);
-		goto out;
+		prog = orig_prog;
+		goto out_imms;
 	}
 	build_epilogue(&ctx);
 
+	/* 3.) Extra pass to validate JITed Code */
+	if (validate_code(&ctx)) {
+		image_ptr = NULL;
+		bpf_jit_binary_free(header);
+		prog = orig_prog;
+		goto out_imms;
+	}
 	flush_icache_range((u32)header, (u32)(ctx.target + ctx.idx));
 
-#if __LINUX_ARM_ARCH__ < 7
-	if (ctx.imm_count)
-		kfree(ctx.imms);
-#endif
-
 	if (bpf_jit_enable > 1)
 		/* there are 2 passes here */
-		bpf_jit_dump(fp->len, alloc_size, 2, ctx.target);
+		bpf_jit_dump(prog->len, image_size, 2, ctx.target);
 
 	set_memory_ro((unsigned long)header, header->pages);
-	fp->bpf_func = (void *)ctx.target;
-	fp->jited = 1;
-out:
+	prog->bpf_func = (void *)ctx.target;
+	prog->jited = 1;
+out_imms:
+#if __LINUX_ARM_ARCH__ < 7
+	if (ctx.imm_count)
+		kfree(ctx.imms);
+#endif
+out_off:
 	kfree(ctx.offsets);
-	return;
+out:
+	if (tmp_blinded)
+		bpf_jit_prog_release_other(prog, prog == orig_prog ?
+					   tmp : orig_prog);
+#endif /* __LITTLE_ENDIAN */
+	return prog;
 }
 
-void bpf_jit_free(struct bpf_prog *fp)
+void bpf_jit_free(struct bpf_prog *prog)
 {
-	unsigned long addr = (unsigned long)fp->bpf_func & PAGE_MASK;
+	unsigned long addr = (unsigned long)prog->bpf_func & PAGE_MASK;
 	struct bpf_binary_header *header = (void *)addr;
 
-	if (!fp->jited)
+	if (!prog->jited)
 		goto free_filter;
 
 	set_memory_rw(addr, header->pages);
 	bpf_jit_binary_free(header);
 
 free_filter:
-	bpf_prog_unlock_free(fp);
+	bpf_prog_unlock_free(prog);
 }
diff --git a/arch/arm/net/bpf_jit_32.h b/arch/arm/net/bpf_jit_32.h
index c46fca2..d5cf5f6 100644
--- a/arch/arm/net/bpf_jit_32.h
+++ b/arch/arm/net/bpf_jit_32.h
@@ -11,6 +11,7 @@
 #ifndef PFILTER_OPCODES_ARM_H
 #define PFILTER_OPCODES_ARM_H
 
+/* ARM 32bit Registers */
 #define ARM_R0	0
 #define ARM_R1	1
 #define ARM_R2	2
@@ -22,38 +23,43 @@
 #define ARM_R8	8
 #define ARM_R9	9
 #define ARM_R10	10
-#define ARM_FP	11
-#define ARM_IP	12
-#define ARM_SP	13
-#define ARM_LR	14
-#define ARM_PC	15
-
-#define ARM_COND_EQ		0x0
-#define ARM_COND_NE		0x1
-#define ARM_COND_CS		0x2
+#define ARM_FP	11	/* Frame Pointer */
+#define ARM_IP	12	/* Intra-procedure scratch register */
+#define ARM_SP	13	/* Stack pointer: as load/store base reg */
+#define ARM_LR	14	/* Link Register */
+#define ARM_PC	15	/* Program counter */
+
+#define ARM_COND_EQ		0x0	/* == */
+#define ARM_COND_NE		0x1	/* != */
+#define ARM_COND_CS		0x2	/* unsigned >= */
 #define ARM_COND_HS		ARM_COND_CS
-#define ARM_COND_CC		0x3
+#define ARM_COND_CC		0x3	/* unsigned < */
 #define ARM_COND_LO		ARM_COND_CC
-#define ARM_COND_MI		0x4
-#define ARM_COND_PL		0x5
-#define ARM_COND_VS		0x6
-#define ARM_COND_VC		0x7
-#define ARM_COND_HI		0x8
-#define ARM_COND_LS		0x9
-#define ARM_COND_GE		0xa
-#define ARM_COND_LT		0xb
-#define ARM_COND_GT		0xc
-#define ARM_COND_LE		0xd
-#define ARM_COND_AL		0xe
+#define ARM_COND_MI		0x4	/* < 0 */
+#define ARM_COND_PL		0x5	/* >= 0 */
+#define ARM_COND_VS		0x6	/* Signed Overflow */
+#define ARM_COND_VC		0x7	/* No Signed Overflow */
+#define ARM_COND_HI		0x8	/* unsigned > */
+#define ARM_COND_LS		0x9	/* unsigned <= */
+#define ARM_COND_GE		0xa	/* Signed >= */
+#define ARM_COND_LT		0xb	/* Signed < */
+#define ARM_COND_GT		0xc	/* Signed > */
+#define ARM_COND_LE		0xd	/* Signed <= */
+#define ARM_COND_AL		0xe	/* None */
 
 /* register shift types */
 #define SRTYPE_LSL		0
 #define SRTYPE_LSR		1
 #define SRTYPE_ASR		2
 #define SRTYPE_ROR		3
+#define SRTYPE_ASL		(SRTYPE_LSL)
 
 #define ARM_INST_ADD_R		0x00800000
+#define ARM_INST_ADDS_R		0x00900000
+#define ARM_INST_ADC_R		0x00a00000
+#define ARM_INST_ADC_I		0x02a00000
 #define ARM_INST_ADD_I		0x02800000
+#define ARM_INST_ADDS_I		0x02900000
 
 #define ARM_INST_AND_R		0x00000000
 #define ARM_INST_AND_I		0x02000000
@@ -76,8 +82,10 @@
 #define ARM_INST_LDRH_I		0x01d000b0
 #define ARM_INST_LDRH_R		0x019000b0
 #define ARM_INST_LDR_I		0x05900000
+#define ARM_INST_LDR_R		0x07900000
 
 #define ARM_INST_LDM		0x08900000
+#define ARM_INST_LDM_IA		0x08b00000
 
 #define ARM_INST_LSL_I		0x01a00000
 #define ARM_INST_LSL_R		0x01a00010
@@ -86,6 +94,7 @@
 #define ARM_INST_LSR_R		0x01a00030
 
 #define ARM_INST_MOV_R		0x01a00000
+#define ARM_INST_MOVS_R		0x01b00000
 #define ARM_INST_MOV_I		0x03a00000
 #define ARM_INST_MOVW		0x03000000
 #define ARM_INST_MOVT		0x03400000
@@ -96,17 +105,28 @@
 #define ARM_INST_PUSH		0x092d0000
 
 #define ARM_INST_ORR_R		0x01800000
+#define ARM_INST_ORRS_R		0x01900000
 #define ARM_INST_ORR_I		0x03800000
 
 #define ARM_INST_REV		0x06bf0f30
 #define ARM_INST_REV16		0x06bf0fb0
 
 #define ARM_INST_RSB_I		0x02600000
+#define ARM_INST_RSBS_I		0x02700000
+#define ARM_INST_RSC_I		0x02e00000
 
 #define ARM_INST_SUB_R		0x00400000
+#define ARM_INST_SUBS_R		0x00500000
+#define ARM_INST_RSB_R		0x00600000
 #define ARM_INST_SUB_I		0x02400000
+#define ARM_INST_SUBS_I		0x02500000
+#define ARM_INST_SBC_I		0x02c00000
+#define ARM_INST_SBC_R		0x00c00000
+#define ARM_INST_SBCS_R		0x00d00000
 
 #define ARM_INST_STR_I		0x05800000
+#define ARM_INST_STRB_I		0x05c00000
+#define ARM_INST_STRH_I		0x01c000b0
 
 #define ARM_INST_TST_R		0x01100000
 #define ARM_INST_TST_I		0x03100000
@@ -117,6 +137,8 @@
 
 #define ARM_INST_MLS		0x00600090
 
+#define ARM_INST_UXTH		0x06ff0070
+
 /*
  * Use a suitable undefined instruction to use for ARM/Thumb2 faulting.
  * We need to be careful not to conflict with those used by other modules
@@ -135,9 +157,15 @@
 #define _AL3_R(op, rd, rn, rm)	((op ## _R) | (rd) << 12 | (rn) << 16 | (rm))
 /* immediate */
 #define _AL3_I(op, rd, rn, imm)	((op ## _I) | (rd) << 12 | (rn) << 16 | (imm))
+/* register with register-shift */
+#define _AL3_SR(inst)	(inst | (1 << 4))
 
 #define ARM_ADD_R(rd, rn, rm)	_AL3_R(ARM_INST_ADD, rd, rn, rm)
+#define ARM_ADDS_R(rd, rn, rm)	_AL3_R(ARM_INST_ADDS, rd, rn, rm)
 #define ARM_ADD_I(rd, rn, imm)	_AL3_I(ARM_INST_ADD, rd, rn, imm)
+#define ARM_ADDS_I(rd, rn, imm)	_AL3_I(ARM_INST_ADDS, rd, rn, imm)
+#define ARM_ADC_R(rd, rn, rm)	_AL3_R(ARM_INST_ADC, rd, rn, rm)
+#define ARM_ADC_I(rd, rn, imm)	_AL3_I(ARM_INST_ADC, rd, rn, imm)
 
 #define ARM_AND_R(rd, rn, rm)	_AL3_R(ARM_INST_AND, rd, rn, rm)
 #define ARM_AND_I(rd, rn, imm)	_AL3_I(ARM_INST_AND, rd, rn, imm)
@@ -156,7 +184,9 @@
 #define ARM_EOR_I(rd, rn, imm)	_AL3_I(ARM_INST_EOR, rd, rn, imm)
 
 #define ARM_LDR_I(rt, rn, off)	(ARM_INST_LDR_I | (rt) << 12 | (rn) << 16 \
-				 | (off))
+				 | ((off) & 0xfff))
+#define ARM_LDR_R(rt, rn, rm)	(ARM_INST_LDR_R | (rt) << 12 | (rn) << 16 \
+				 | (rm))
 #define ARM_LDRB_I(rt, rn, off)	(ARM_INST_LDRB_I | (rt) << 12 | (rn) << 16 \
 				 | (off))
 #define ARM_LDRB_R(rt, rn, rm)	(ARM_INST_LDRB_R | (rt) << 12 | (rn) << 16 \
@@ -167,15 +197,23 @@
 				 | (rm))
 
 #define ARM_LDM(rn, regs)	(ARM_INST_LDM | (rn) << 16 | (regs))
+#define ARM_LDM_IA(rn, regs)	(ARM_INST_LDM_IA | (rn) << 16 | (regs))
 
 #define ARM_LSL_R(rd, rn, rm)	(_AL3_R(ARM_INST_LSL, rd, 0, rn) | (rm) << 8)
 #define ARM_LSL_I(rd, rn, imm)	(_AL3_I(ARM_INST_LSL, rd, 0, rn) | (imm) << 7)
 
 #define ARM_LSR_R(rd, rn, rm)	(_AL3_R(ARM_INST_LSR, rd, 0, rn) | (rm) << 8)
 #define ARM_LSR_I(rd, rn, imm)	(_AL3_I(ARM_INST_LSR, rd, 0, rn) | (imm) << 7)
+#define ARM_ASR_R(rd, rn, rm)   (_AL3_R(ARM_INST_ASR, rd, 0, rn) | (rm) << 8)
+#define ARM_ASR_I(rd, rn, imm)  (_AL3_I(ARM_INST_ASR, rd, 0, rn) | (imm) << 7)
 
 #define ARM_MOV_R(rd, rm)	_AL3_R(ARM_INST_MOV, rd, 0, rm)
+#define ARM_MOVS_R(rd, rm)	_AL3_R(ARM_INST_MOVS, rd, 0, rm)
 #define ARM_MOV_I(rd, imm)	_AL3_I(ARM_INST_MOV, rd, 0, imm)
+#define ARM_MOV_SR(rd, rm, type, rs)	\
+	(_AL3_SR(ARM_MOV_R(rd, rm)) | (type) << 5 | (rs) << 8)
+#define ARM_MOV_SI(rd, rm, type, imm6)	\
+	(ARM_MOV_R(rd, rm) | (type) << 5 | (imm6) << 7)
 
 #define ARM_MOVW(rd, imm)	\
 	(ARM_INST_MOVW | ((imm) >> 12) << 16 | (rd) << 12 | ((imm) & 0x0fff))
@@ -190,19 +228,38 @@
 
 #define ARM_ORR_R(rd, rn, rm)	_AL3_R(ARM_INST_ORR, rd, rn, rm)
 #define ARM_ORR_I(rd, rn, imm)	_AL3_I(ARM_INST_ORR, rd, rn, imm)
-#define ARM_ORR_S(rd, rn, rm, type, rs)	\
-	(ARM_ORR_R(rd, rn, rm) | (type) << 5 | (rs) << 7)
+#define ARM_ORR_SR(rd, rn, rm, type, rs)	\
+	(_AL3_SR(ARM_ORR_R(rd, rn, rm)) | (type) << 5 | (rs) << 8)
+#define ARM_ORRS_R(rd, rn, rm)	_AL3_R(ARM_INST_ORRS, rd, rn, rm)
+#define ARM_ORRS_SR(rd, rn, rm, type, rs)	\
+	(_AL3_SR(ARM_ORRS_R(rd, rn, rm)) | (type) << 5 | (rs) << 8)
+#define ARM_ORR_SI(rd, rn, rm, type, imm6)	\
+	(ARM_ORR_R(rd, rn, rm) | (type) << 5 | (imm6) << 7)
+#define ARM_ORRS_SI(rd, rn, rm, type, imm6)	\
+	(ARM_ORRS_R(rd, rn, rm) | (type) << 5 | (imm6) << 7)
 
 #define ARM_REV(rd, rm)		(ARM_INST_REV | (rd) << 12 | (rm))
 #define ARM_REV16(rd, rm)	(ARM_INST_REV16 | (rd) << 12 | (rm))
 
 #define ARM_RSB_I(rd, rn, imm)	_AL3_I(ARM_INST_RSB, rd, rn, imm)
+#define ARM_RSBS_I(rd, rn, imm)	_AL3_I(ARM_INST_RSBS, rd, rn, imm)
+#define ARM_RSC_I(rd, rn, imm)	_AL3_I(ARM_INST_RSC, rd, rn, imm)
 
 #define ARM_SUB_R(rd, rn, rm)	_AL3_R(ARM_INST_SUB, rd, rn, rm)
+#define ARM_SUBS_R(rd, rn, rm)	_AL3_R(ARM_INST_SUBS, rd, rn, rm)
+#define ARM_RSB_R(rd, rn, rm)	_AL3_R(ARM_INST_RSB, rd, rn, rm)
+#define ARM_SBC_R(rd, rn, rm)	_AL3_R(ARM_INST_SBC, rd, rn, rm)
+#define ARM_SBCS_R(rd, rn, rm)	_AL3_R(ARM_INST_SBCS, rd, rn, rm)
 #define ARM_SUB_I(rd, rn, imm)	_AL3_I(ARM_INST_SUB, rd, rn, imm)
+#define ARM_SUBS_I(rd, rn, imm)	_AL3_I(ARM_INST_SUBS, rd, rn, imm)
+#define ARM_SBC_I(rd, rn, imm)	_AL3_I(ARM_INST_SBC, rd, rn, imm)
 
 #define ARM_STR_I(rt, rn, off)	(ARM_INST_STR_I | (rt) << 12 | (rn) << 16 \
-				 | (off))
+				 | ((off) & 0xfff))
+#define ARM_STRH_I(rt, rn, off)	(ARM_INST_STRH_I | (rt) << 12 | (rn) << 16 \
+				 | (((off) & 0xf0) << 4) | ((off) & 0xf))
+#define ARM_STRB_I(rt, rn, off)	(ARM_INST_STRB_I | (rt) << 12 | (rn) << 16 \
+				 | (((off) & 0xf0) << 4) | ((off) & 0xf))
 
 #define ARM_TST_R(rn, rm)	_AL3_R(ARM_INST_TST, 0, rn, rm)
 #define ARM_TST_I(rn, imm)	_AL3_I(ARM_INST_TST, 0, rn, imm)
@@ -214,5 +271,6 @@
 
 #define ARM_MLS(rd, rn, rm, ra)	(ARM_INST_MLS | (rd) << 16 | (rn) | (rm) << 8 \
 				 | (ra) << 12)
+#define ARM_UXTH(rd, rm)	(ARM_INST_UXTH | (rd) << 12 | (rm))
 
 #endif /* PFILTER_OPCODES_ARM_H */
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
  2017-07-06  3:49                               ` Shubham Bansal
  (?)
@ 2017-07-07  4:42                                 ` Kees Cook
  -1 siblings, 0 replies; 87+ messages in thread
From: Kees Cook @ 2017-07-07  4:42 UTC (permalink / raw)
  To: Shubham Bansal
  Cc: Daniel Borkmann, Network Development, David S. Miller,
	Alexei Starovoitov, Russell King, linux-arm-kernel, LKML,
	Andrew Lunn

On Wed, Jul 5, 2017 at 8:49 PM, Shubham Bansal
<illusionist.neo@gmail.com> wrote:
> Hi Kees,
>
> Problem is my ARM machine don't have clang and iproute2 which is
> keeping me from testing the bpf tail calls.
>
> You should do the following to test it,.
>
> 1. tools/testing/selftests/bpf/
> 2. make
> 3. sudo ./test_progs
>
> And, before testing, you have to do "make headers_install".
> These tests are for tail calls with the attached patch. If its too
> much work, Can you please upload your arm image so that I can test it?
> I just need a good machine.

I've got all this set up now, and it faults during the test:

Unable to handle kernel NULL pointer dereference at virtual address 00000008
...
CPU: 0 PID: 1922 Comm: test_progs Not tainted 4.12.0+ #60
...
PC is at __htab_map_lookup_elem+0x54/0x1f4

I'll see if I can send you this disk image...

-Kees


-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
@ 2017-07-07  4:42                                 ` Kees Cook
  0 siblings, 0 replies; 87+ messages in thread
From: Kees Cook @ 2017-07-07  4:42 UTC (permalink / raw)
  To: Shubham Bansal
  Cc: Daniel Borkmann, Network Development, David S. Miller,
	Alexei Starovoitov, Russell King, linux-arm-kernel, LKML,
	Andrew Lunn

On Wed, Jul 5, 2017 at 8:49 PM, Shubham Bansal
<illusionist.neo@gmail.com> wrote:
> Hi Kees,
>
> Problem is my ARM machine don't have clang and iproute2 which is
> keeping me from testing the bpf tail calls.
>
> You should do the following to test it,.
>
> 1. tools/testing/selftests/bpf/
> 2. make
> 3. sudo ./test_progs
>
> And, before testing, you have to do "make headers_install".
> These tests are for tail calls with the attached patch. If its too
> much work, Can you please upload your arm image so that I can test it?
> I just need a good machine.

I've got all this set up now, and it faults during the test:

Unable to handle kernel NULL pointer dereference at virtual address 00000008
...
CPU: 0 PID: 1922 Comm: test_progs Not tainted 4.12.0+ #60
...
PC is at __htab_map_lookup_elem+0x54/0x1f4

I'll see if I can send you this disk image...

-Kees


-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v2] arm: eBPF JIT compiler
@ 2017-07-07  4:42                                 ` Kees Cook
  0 siblings, 0 replies; 87+ messages in thread
From: Kees Cook @ 2017-07-07  4:42 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jul 5, 2017 at 8:49 PM, Shubham Bansal
<illusionist.neo@gmail.com> wrote:
> Hi Kees,
>
> Problem is my ARM machine don't have clang and iproute2 which is
> keeping me from testing the bpf tail calls.
>
> You should do the following to test it,.
>
> 1. tools/testing/selftests/bpf/
> 2. make
> 3. sudo ./test_progs
>
> And, before testing, you have to do "make headers_install".
> These tests are for tail calls with the attached patch. If its too
> much work, Can you please upload your arm image so that I can test it?
> I just need a good machine.

I've got all this set up now, and it faults during the test:

Unable to handle kernel NULL pointer dereference at virtual address 00000008
...
CPU: 0 PID: 1922 Comm: test_progs Not tainted 4.12.0+ #60
...
PC is at __htab_map_lookup_elem+0x54/0x1f4

I'll see if I can send you this disk image...

-Kees


-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
  2017-07-07  4:42                                 ` Kees Cook
  (?)
@ 2017-07-07  4:49                                   ` Shubham Bansal
  -1 siblings, 0 replies; 87+ messages in thread
From: Shubham Bansal @ 2017-07-07  4:49 UTC (permalink / raw)
  To: Kees Cook
  Cc: Daniel Borkmann, Network Development, David S. Miller,
	Alexei Starovoitov, Russell King, linux-arm-kernel, LKML,
	Andrew Lunn

Okay Kees. I will take a look at it.
Best,
Shubham Bansal


On Fri, Jul 7, 2017 at 10:12 AM, Kees Cook <keescook@chromium.org> wrote:
> On Wed, Jul 5, 2017 at 8:49 PM, Shubham Bansal
> <illusionist.neo@gmail.com> wrote:
>> Hi Kees,
>>
>> Problem is my ARM machine don't have clang and iproute2 which is
>> keeping me from testing the bpf tail calls.
>>
>> You should do the following to test it,.
>>
>> 1. tools/testing/selftests/bpf/
>> 2. make
>> 3. sudo ./test_progs
>>
>> And, before testing, you have to do "make headers_install".
>> These tests are for tail calls with the attached patch. If its too
>> much work, Can you please upload your arm image so that I can test it?
>> I just need a good machine.
>
> I've got all this set up now, and it faults during the test:
>
> Unable to handle kernel NULL pointer dereference at virtual address 00000008
> ...
> CPU: 0 PID: 1922 Comm: test_progs Not tainted 4.12.0+ #60
> ...
> PC is at __htab_map_lookup_elem+0x54/0x1f4
>
> I'll see if I can send you this disk image...
>
> -Kees
>
>
> --
> Kees Cook
> Pixel Security

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2] arm: eBPF JIT compiler
@ 2017-07-07  4:49                                   ` Shubham Bansal
  0 siblings, 0 replies; 87+ messages in thread
From: Shubham Bansal @ 2017-07-07  4:49 UTC (permalink / raw)
  To: Kees Cook
  Cc: Daniel Borkmann, Network Development, David S. Miller,
	Alexei Starovoitov, Russell King, linux-arm-kernel, LKML,
	Andrew Lunn

Okay Kees. I will take a look at it.
Best,
Shubham Bansal


On Fri, Jul 7, 2017 at 10:12 AM, Kees Cook <keescook@chromium.org> wrote:
> On Wed, Jul 5, 2017 at 8:49 PM, Shubham Bansal
> <illusionist.neo@gmail.com> wrote:
>> Hi Kees,
>>
>> Problem is my ARM machine don't have clang and iproute2 which is
>> keeping me from testing the bpf tail calls.
>>
>> You should do the following to test it,.
>>
>> 1. tools/testing/selftests/bpf/
>> 2. make
>> 3. sudo ./test_progs
>>
>> And, before testing, you have to do "make headers_install".
>> These tests are for tail calls with the attached patch. If its too
>> much work, Can you please upload your arm image so that I can test it?
>> I just need a good machine.
>
> I've got all this set up now, and it faults during the test:
>
> Unable to handle kernel NULL pointer dereference at virtual address 00000008
> ...
> CPU: 0 PID: 1922 Comm: test_progs Not tainted 4.12.0+ #60
> ...
> PC is at __htab_map_lookup_elem+0x54/0x1f4
>
> I'll see if I can send you this disk image...
>
> -Kees
>
>
> --
> Kees Cook
> Pixel Security

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v2] arm: eBPF JIT compiler
@ 2017-07-07  4:49                                   ` Shubham Bansal
  0 siblings, 0 replies; 87+ messages in thread
From: Shubham Bansal @ 2017-07-07  4:49 UTC (permalink / raw)
  To: linux-arm-kernel

Okay Kees. I will take a look at it.
Best,
Shubham Bansal


On Fri, Jul 7, 2017 at 10:12 AM, Kees Cook <keescook@chromium.org> wrote:
> On Wed, Jul 5, 2017 at 8:49 PM, Shubham Bansal
> <illusionist.neo@gmail.com> wrote:
>> Hi Kees,
>>
>> Problem is my ARM machine don't have clang and iproute2 which is
>> keeping me from testing the bpf tail calls.
>>
>> You should do the following to test it,.
>>
>> 1. tools/testing/selftests/bpf/
>> 2. make
>> 3. sudo ./test_progs
>>
>> And, before testing, you have to do "make headers_install".
>> These tests are for tail calls with the attached patch. If its too
>> much work, Can you please upload your arm image so that I can test it?
>> I just need a good machine.
>
> I've got all this set up now, and it faults during the test:
>
> Unable to handle kernel NULL pointer dereference at virtual address 00000008
> ...
> CPU: 0 PID: 1922 Comm: test_progs Not tainted 4.12.0+ #60
> ...
> PC is at __htab_map_lookup_elem+0x54/0x1f4
>
> I'll see if I can send you this disk image...
>
> -Kees
>
>
> --
> Kees Cook
> Pixel Security

^ permalink raw reply	[flat|nested] 87+ messages in thread

end of thread, other threads:[~2017-07-07  4:49 UTC | newest]

Thread overview: 87+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-25 23:13 [PATCH v2] arm: eBPF JIT compiler Shubham Bansal
2017-05-25 23:13 ` Shubham Bansal
2017-05-25 23:23 ` Andrew Lunn
2017-05-25 23:23   ` Andrew Lunn
2017-05-25 23:34   ` Shubham Bansal
2017-05-25 23:34     ` Shubham Bansal
2017-05-25 23:36     ` Shubham Bansal
2017-05-25 23:36       ` Shubham Bansal
2017-05-26 16:57       ` Shubham Bansal
2017-05-26 16:57         ` Shubham Bansal
2017-05-30 18:50         ` Shubham Bansal
2017-05-30 18:50           ` Shubham Bansal
2017-05-30 19:11 ` Kees Cook
2017-05-30 19:11   ` Kees Cook
2017-05-30 19:19 ` Kees Cook
2017-05-30 19:19   ` Kees Cook
2017-06-06 19:47   ` Shubham Bansal
2017-06-06 19:47     ` Shubham Bansal
2017-06-12  2:00     ` Kees Cook
2017-06-12  2:00       ` Kees Cook
2017-06-12  2:00       ` Kees Cook
2017-06-12 10:21   ` Daniel Borkmann
2017-06-12 10:21     ` Daniel Borkmann
2017-06-12 10:21     ` Daniel Borkmann
2017-06-12 11:06     ` Russell King - ARM Linux
2017-06-12 11:06       ` Russell King - ARM Linux
2017-06-12 11:06       ` Russell King - ARM Linux
2017-06-12 15:41       ` Shubham Bansal
2017-06-12 15:41         ` Shubham Bansal
2017-06-12 15:41         ` Shubham Bansal
2017-06-12 15:40     ` Shubham Bansal
2017-06-12 15:40       ` Shubham Bansal
2017-06-12 15:40       ` Shubham Bansal
2017-06-12 22:45       ` Alexander Alemayhu
2017-06-12 22:45         ` Alexander Alemayhu
2017-06-12 22:45         ` Alexander Alemayhu
2017-06-12 22:47         ` David Miller
2017-06-12 22:47           ` David Miller
2017-06-12 23:17       ` Daniel Borkmann
2017-06-12 23:17         ` Daniel Borkmann
2017-06-12 23:17         ` Daniel Borkmann
2017-06-13  6:56       ` Shubham Bansal
2017-06-13  6:56         ` Shubham Bansal
2017-06-13  6:56         ` Shubham Bansal
2017-06-14 20:31         ` Daniel Borkmann
2017-06-14 20:31           ` Daniel Borkmann
2017-06-14 20:31           ` Daniel Borkmann
2017-06-17 12:23           ` Shubham Bansal
2017-06-17 12:23             ` Shubham Bansal
2017-06-17 12:23             ` Shubham Bansal
2017-06-19 18:10             ` Daniel Borkmann
2017-06-19 18:10               ` Daniel Borkmann
2017-06-19 18:10               ` Daniel Borkmann
2017-06-20  1:34               ` Shubham Bansal
2017-06-20  1:34                 ` Shubham Bansal
2017-06-20  1:34                 ` Shubham Bansal
2017-06-20 16:55                 ` Daniel Borkmann
2017-06-20 16:55                   ` Daniel Borkmann
2017-06-20 16:55                   ` Daniel Borkmann
2017-06-21 14:26                   ` Shubham Bansal
2017-06-21 14:26                     ` Shubham Bansal
2017-06-21 14:26                     ` Shubham Bansal
2017-06-21 16:32                     ` Daniel Borkmann
2017-06-21 16:32                       ` Daniel Borkmann
2017-06-21 16:32                       ` Daniel Borkmann
2017-06-21 19:37                       ` Shubham Bansal
2017-06-21 19:37                         ` Shubham Bansal
2017-06-21 19:37                         ` Shubham Bansal
2017-06-21 19:53                         ` Daniel Borkmann
2017-06-21 19:53                           ` Daniel Borkmann
2017-06-21 19:53                           ` Daniel Borkmann
2017-06-23 22:39                       ` Shubham Bansal
2017-06-23 22:39                         ` Shubham Bansal
2017-07-05 22:11                         ` Kees Cook
2017-07-05 22:11                           ` Kees Cook
2017-07-05 22:11                           ` Kees Cook
2017-07-05 22:38                           ` Kees Cook
2017-07-05 22:38                             ` Kees Cook
2017-07-05 22:38                             ` Kees Cook
2017-07-06  3:49                             ` Shubham Bansal
2017-07-06  3:49                               ` Shubham Bansal
2017-07-07  4:42                               ` Kees Cook
2017-07-07  4:42                                 ` Kees Cook
2017-07-07  4:42                                 ` Kees Cook
2017-07-07  4:49                                 ` Shubham Bansal
2017-07-07  4:49                                   ` Shubham Bansal
2017-07-07  4:49                                   ` Shubham Bansal

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.