All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH tip 0/5] tracing filters with BPF
@ 2013-12-03  4:28 Alexei Starovoitov
  2013-12-03  4:28 ` [RFC PATCH tip 1/5] Extended BPF core framework Alexei Starovoitov
                   ` (7 more replies)
  0 siblings, 8 replies; 65+ messages in thread
From: Alexei Starovoitov @ 2013-12-03  4:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steven Rostedt, Peter Zijlstra, H. Peter Anvin, Thomas Gleixner,
	Masami Hiramatsu, Tom Zanussi, Jovi Zhangwei, Eric Dumazet,
	linux-kernel

Hi All,

the following set of patches adds BPF support to trace filters.

Trace filters can be written in C and allow safe read-only access to any
kernel data structure. Like systemtap but with safety guaranteed by kernel.

The user can do:
cat bpf_program > /sys/kernel/debug/tracing/.../filter
if tracing event is either static or dynamic via kprobe_events.

The filter program may look like:
void filter(struct bpf_context *ctx)
{
        char devname[4] = "eth5";
        struct net_device *dev;
        struct sk_buff *skb = 0;

        dev = (struct net_device *)ctx->regs.si;
        if (bpf_memcmp(dev->name, devname, 4) == 0) {
                char fmt[] = "skb %p dev %p eth5\n";
                bpf_trace_printk(fmt, skb, dev, 0, 0);
        }
}

The kernel will do static analysis of bpf program to make sure that it cannot
crash the kernel (doesn't have loops, valid memory/register accesses, etc).
Then kernel will map bpf instructions to x86 instructions and let it
run in the place of trace filter.

To demonstrate performance I did a synthetic test:
        dev = init_net.loopback_dev;
        do_gettimeofday(&start_tv);
        for (i = 0; i < 1000000; i++) {
                struct sk_buff *skb;
                skb = netdev_alloc_skb(dev, 128);
                kfree_skb(skb);
        }
        do_gettimeofday(&end_tv);
        time = end_tv.tv_sec - start_tv.tv_sec;
        time *= USEC_PER_SEC;
        time += (long long)((long)end_tv.tv_usec - (long)start_tv.tv_usec);

        printk("1M skb alloc/free %lld (usecs)\n", time);

no tracing
[   33.450966] 1M skb alloc/free 145179 (usecs)

echo 1 > enable
[   97.186379] 1M skb alloc/free 240419 (usecs)
(tracing slows down kfree_skb() due to event_buffer_lock/buffer_unlock_commit)

echo 'name==eth5' > filter
[  139.644161] 1M skb alloc/free 302552 (usecs)
(running filter_match_preds() for every skb and discarding
event_buffer is even slower)

cat bpf_prog > filter
[  171.150566] 1M skb alloc/free 199463 (usecs)
(JITed bpf program is safely checking dev->name == eth5 and discarding)

echo 0 > enable
[  258.073593] 1M skb alloc/free 144919 (usecs)
(tracing is disabled, performance is back to original)

The C program compiled into BPF and then JITed into x86 is faster than
filter_match_preds() approach (199-145 msec vs 302-145 msec)

tracing+bpf is a tool for safe read-only access to variables without recompiling
the kernel and without affecting running programs.

BPF filters can be written manually (see tools/bpf/trace/filter_ex1.c)
or better compiled from restricted C via GCC or LLVM

Q: What is the difference between existing BPF and extended BPF?
A:
Existing BPF insn from uapi/linux/filter.h
struct sock_filter {
        __u16   code;   /* Actual filter code */
        __u8    jt;     /* Jump true */
        __u8    jf;     /* Jump false */
        __u32   k;      /* Generic multiuse field */
};

Extended BPF insn from linux/bpf.h
struct bpf_insn {
        __u8    code;    /* opcode */
        __u8    a_reg:4; /* dest register*/
        __u8    x_reg:4; /* source register */
        __s16   off;     /* signed offset */
        __s32   imm;     /* signed immediate constant */
};

opcode encoding is the same between old BPF and extended BPF.
Original BPF has two 32-bit registers.
Extended BPF has ten 64-bit registers.
That is the main difference.

Old BPF was using jt/jf fields for jump-insn only.
New BPF combines them into generic 'off' field for jump and non-jump insns.
k==imm field has the same meaning.

Thanks

Alexei Starovoitov (5):
  Extended BPF core framework
  Extended BPF JIT for x86-64
  Extended BPF (64-bit BPF) design document
  use BPF in tracing filters
  tracing filter examples in BPF

 Documentation/bpf_jit.txt            |  204 +++++++
 arch/x86/Kconfig                     |    1 +
 arch/x86/net/Makefile                |    1 +
 arch/x86/net/bpf64_jit_comp.c        |  625 ++++++++++++++++++++
 arch/x86/net/bpf_jit_comp.c          |   23 +-
 arch/x86/net/bpf_jit_comp.h          |   35 ++
 include/linux/bpf.h                  |  149 +++++
 include/linux/bpf_jit.h              |  129 +++++
 include/linux/ftrace_event.h         |    3 +
 include/trace/bpf_trace.h            |   27 +
 include/trace/ftrace.h               |   14 +
 kernel/Makefile                      |    1 +
 kernel/bpf_jit/Makefile              |    3 +
 kernel/bpf_jit/bpf_check.c           | 1054 ++++++++++++++++++++++++++++++++++
 kernel/bpf_jit/bpf_run.c             |  452 +++++++++++++++
 kernel/trace/Kconfig                 |    1 +
 kernel/trace/Makefile                |    1 +
 kernel/trace/bpf_trace_callbacks.c   |  191 ++++++
 kernel/trace/trace.c                 |    7 +
 kernel/trace/trace.h                 |   11 +-
 kernel/trace/trace_events.c          |    9 +-
 kernel/trace/trace_events_filter.c   |   61 +-
 kernel/trace/trace_kprobe.c          |    6 +
 lib/Kconfig.debug                    |   15 +
 tools/bpf/llvm/README.txt            |    6 +
 tools/bpf/trace/Makefile             |   34 ++
 tools/bpf/trace/README.txt           |   15 +
 tools/bpf/trace/filter_ex1.c         |   52 ++
 tools/bpf/trace/filter_ex1_orig.c    |   23 +
 tools/bpf/trace/filter_ex2.c         |   74 +++
 tools/bpf/trace/filter_ex2_orig.c    |   47 ++
 tools/bpf/trace/trace_filter_check.c |   82 +++
 32 files changed, 3332 insertions(+), 24 deletions(-)
 create mode 100644 Documentation/bpf_jit.txt
 create mode 100644 arch/x86/net/bpf64_jit_comp.c
 create mode 100644 arch/x86/net/bpf_jit_comp.h
 create mode 100644 include/linux/bpf.h
 create mode 100644 include/linux/bpf_jit.h
 create mode 100644 include/trace/bpf_trace.h
 create mode 100644 kernel/bpf_jit/Makefile
 create mode 100644 kernel/bpf_jit/bpf_check.c
 create mode 100644 kernel/bpf_jit/bpf_run.c
 create mode 100644 kernel/trace/bpf_trace_callbacks.c
 create mode 100644 tools/bpf/llvm/README.txt
 create mode 100644 tools/bpf/trace/Makefile
 create mode 100644 tools/bpf/trace/README.txt
 create mode 100644 tools/bpf/trace/filter_ex1.c
 create mode 100644 tools/bpf/trace/filter_ex1_orig.c
 create mode 100644 tools/bpf/trace/filter_ex2.c
 create mode 100644 tools/bpf/trace/filter_ex2_orig.c
 create mode 100644 tools/bpf/trace/trace_filter_check.c

-- 
1.7.9.5


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC PATCH tip 1/5] Extended BPF core framework
  2013-12-03  4:28 [RFC PATCH tip 0/5] tracing filters with BPF Alexei Starovoitov
@ 2013-12-03  4:28 ` Alexei Starovoitov
  2013-12-03  4:28 ` [RFC PATCH tip 2/5] Extended BPF JIT for x86-64 Alexei Starovoitov
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 65+ messages in thread
From: Alexei Starovoitov @ 2013-12-03  4:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steven Rostedt, Peter Zijlstra, H. Peter Anvin, Thomas Gleixner,
	Masami Hiramatsu, Tom Zanussi, Jovi Zhangwei, Eric Dumazet,
	linux-kernel

Extended BPF (or 64-bit BPF) is an instruction set to
create safe dynamically loadable filters that can call fixed set
of kernel functions and take generic bpf_context as an input.
BPF filter is a glue between kernel functions and bpf_context.
Different kernel subsystems can define their own set of available functions
and alter BPF machinery for specific use case.

include/linux/bpf.h - instruction set definition
kernel/bpf_jit/bpf_check.c - code safety checker/static analyzer
kernel/bpf_jit/bpf_run.c - emulator for archs without BPF64_JIT

Extended BPF instruction set is designed for efficient mapping to native
instructions on 64-bit CPUs

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
 include/linux/bpf.h        |  149 +++++++
 include/linux/bpf_jit.h    |  129 ++++++
 kernel/Makefile            |    1 +
 kernel/bpf_jit/Makefile    |    3 +
 kernel/bpf_jit/bpf_check.c | 1054 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/bpf_jit/bpf_run.c   |  452 +++++++++++++++++++
 lib/Kconfig.debug          |   15 +
 7 files changed, 1803 insertions(+)
 create mode 100644 include/linux/bpf.h
 create mode 100644 include/linux/bpf_jit.h
 create mode 100644 kernel/bpf_jit/Makefile
 create mode 100644 kernel/bpf_jit/bpf_check.c
 create mode 100644 kernel/bpf_jit/bpf_run.c

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
new file mode 100644
index 0000000..a7d3fb7
--- /dev/null
+++ b/include/linux/bpf.h
@@ -0,0 +1,149 @@
+/* 64-bit BPF is Copyright (c) 2011-2013, PLUMgrid, http://plumgrid.com */
+
+#ifndef __LINUX_BPF_H__
+#define __LINUX_BPF_H__
+
+#include <linux/types.h>
+
+struct bpf_insn {
+	__u8	code;    /* opcode */
+	__u8    a_reg:4; /* dest register*/
+	__u8    x_reg:4; /* source register */
+	__s16	off;     /* signed offset */
+	__s32	imm;     /* signed immediate constant */
+};
+
+struct bpf_table {
+	__u32   type;
+	__u32   key_size;
+	__u32   elem_size;
+	__u32   max_entries;
+	__u32   param1;         /* meaning is table-dependent */
+};
+
+enum bpf_table_type {
+	BPF_TABLE_HASH = 1,
+	BPF_TABLE_LPM
+};
+
+/* maximum number of insns and tables in a BPF program */
+#define MAX_BPF_INSNS 4096
+#define MAX_BPF_TABLES 64
+#define MAX_BPF_STRTAB_SIZE 1024
+
+/* pointer to bpf_context is the first and only argument to BPF program
+ * its definition is use-case specific */
+struct bpf_context;
+
+/* bpf_add|sub|...: a += x
+ *         bpf_mov: a = x
+ *       bpf_bswap: bswap a */
+#define BPF_INSN_ALU(op, a, x) \
+	(struct bpf_insn){BPF_ALU|BPF_OP(op)|BPF_X, a, x, 0, 0}
+
+/* bpf_add|sub|...: a += imm
+ *         bpf_mov: a = imm */
+#define BPF_INSN_ALU_IMM(op, a, imm) \
+	(struct bpf_insn){BPF_ALU|BPF_OP(op)|BPF_K, a, 0, 0, imm}
+
+/* a = *(uint *) (x + off) */
+#define BPF_INSN_LD(size, a, x, off) \
+	(struct bpf_insn){BPF_LDX|BPF_SIZE(size)|BPF_REL, a, x, off, 0}
+
+/* *(uint *) (a + off) = x */
+#define BPF_INSN_ST(size, a, off, x) \
+	(struct bpf_insn){BPF_STX|BPF_SIZE(size)|BPF_REL, a, x, off, 0}
+
+/* *(uint *) (a + off) = imm */
+#define BPF_INSN_ST_IMM(size, a, off, imm) \
+	(struct bpf_insn){BPF_ST|BPF_SIZE(size)|BPF_REL, a, 0, off, imm}
+
+/* lock *(uint *) (a + off) += x */
+#define BPF_INSN_XADD(size, a, off, x) \
+	(struct bpf_insn){BPF_STX|BPF_SIZE(size)|BPF_XADD, a, x, off, 0}
+
+/* if (a 'op' x) pc += off else fall through */
+#define BPF_INSN_JUMP(op, a, x, off) \
+	(struct bpf_insn){BPF_JMP|BPF_OP(op)|BPF_X, a, x, off, 0}
+
+/* if (a 'op' imm) pc += off else fall through */
+#define BPF_INSN_JUMP_IMM(op, a, imm, off) \
+	(struct bpf_insn){BPF_JMP|BPF_OP(op)|BPF_K, a, 0, off, imm}
+
+#define BPF_INSN_RET() \
+	(struct bpf_insn){BPF_RET|BPF_K, 0, 0, 0, 0}
+
+#define BPF_INSN_CALL(fn_code) \
+	(struct bpf_insn){BPF_JMP|BPF_CALL, 0, 0, 0, fn_code}
+
+/* Instruction classes */
+#define BPF_CLASS(code) ((code) & 0x07)
+#define         BPF_LD          0x00
+#define         BPF_LDX         0x01
+#define         BPF_ST          0x02
+#define         BPF_STX         0x03
+#define         BPF_ALU         0x04
+#define         BPF_JMP         0x05
+#define         BPF_RET         0x06
+
+/* ld/ldx fields */
+#define BPF_SIZE(code)  ((code) & 0x18)
+#define         BPF_W           0x00
+#define         BPF_H           0x08
+#define         BPF_B           0x10
+#define         BPF_DW          0x18
+#define BPF_MODE(code)  ((code) & 0xe0)
+#define         BPF_IMM         0x00
+#define         BPF_ABS         0x20
+#define         BPF_IND         0x40
+#define         BPF_MEM         0x60
+#define         BPF_LEN         0x80
+#define         BPF_MSH         0xa0
+#define         BPF_REL         0xc0
+#define         BPF_XADD        0xe0 /* exclusive add */
+
+/* alu/jmp fields */
+#define BPF_OP(code)    ((code) & 0xf0)
+#define         BPF_ADD         0x00
+#define         BPF_SUB         0x10
+#define         BPF_MUL         0x20
+#define         BPF_DIV         0x30
+#define         BPF_OR          0x40
+#define         BPF_AND         0x50
+#define         BPF_LSH         0x60
+#define         BPF_RSH         0x70 /* logical shift right */
+#define         BPF_NEG         0x80
+#define         BPF_MOD         0x90
+#define         BPF_XOR         0xa0
+#define         BPF_MOV         0xb0 /* mov reg to reg */
+#define         BPF_ARSH        0xc0 /* sign extending arithmetic shift right */
+#define         BPF_BSWAP32     0xd0 /* swap lower 4 bytes of 64-bit register */
+#define         BPF_BSWAP64     0xe0 /* swap all 8 bytes of 64-bit register */
+
+#define         BPF_JA          0x00
+#define         BPF_JEQ         0x10 /* jump == */
+#define         BPF_JGT         0x20 /* GT is unsigned '>', JA in x86 */
+#define         BPF_JGE         0x30 /* GE is unsigned '>=', JAE in x86 */
+#define         BPF_JSET        0x40
+#define         BPF_JNE         0x50 /* jump != */
+#define         BPF_JSGT        0x60 /* SGT is signed '>', GT in x86 */
+#define         BPF_JSGE        0x70 /* SGE is signed '>=', GE in x86 */
+#define         BPF_CALL        0x80 /* function call */
+#define BPF_SRC(code)   ((code) & 0x08)
+#define         BPF_K           0x00
+#define         BPF_X           0x08
+
+/* 64-bit registers */
+#define         R0              0
+#define         R1              1
+#define         R2              2
+#define         R3              3
+#define         R4              4
+#define         R5              5
+#define         R6              6
+#define         R7              7
+#define         R8              8
+#define         R9              9
+#define         __fp__          10
+
+#endif /* __LINUX_BPF_H__ */
diff --git a/include/linux/bpf_jit.h b/include/linux/bpf_jit.h
new file mode 100644
index 0000000..84b85d7
--- /dev/null
+++ b/include/linux/bpf_jit.h
@@ -0,0 +1,129 @@
+/* 64-bit BPF is Copyright (c) 2011-2013, PLUMgrid, http://plumgrid.com */
+
+#ifndef __LINUX_BPF_JIT_H__
+#define __LINUX_BPF_JIT_H__
+
+#include <linux/slab.h>
+#include <linux/workqueue.h>
+#include <linux/bpf.h>
+
+/*
+ * type of value stored in a BPF register or
+ * passed into function as an argument or
+ * returned from the function
+ */
+enum bpf_reg_type {
+	INVALID_PTR,  /* reg doesn't contain a valid pointer */
+	PTR_TO_CTX,   /* reg points to bpf_context */
+	PTR_TO_TABLE, /* reg points to table element */
+	PTR_TO_TABLE_CONDITIONAL, /* points to table element or NULL */
+	PTR_TO_STACK,     /* reg == frame_pointer */
+	PTR_TO_STACK_IMM, /* reg == frame_pointer + imm */
+	PTR_TO_STACK_IMM_TABLE_KEY, /* pointer to stack used as table key */
+	PTR_TO_STACK_IMM_TABLE_ELEM, /* pointer to stack used as table elem */
+	RET_INTEGER, /* function returns integer */
+	RET_VOID,    /* function returns void */
+	CONST_ARG,    /* function expects integer constant argument */
+	CONST_ARG_TABLE_ID, /* int const argument that is used as table_id */
+	/*
+	 * int const argument indicating number of bytes accessed from stack
+	 * previous function argument must be ptr_to_stack_imm
+	 */
+	CONST_ARG_STACK_IMM_SIZE,
+};
+
+/* BPF function prototype */
+struct bpf_func_proto {
+	enum bpf_reg_type ret_type;
+	enum bpf_reg_type arg1_type;
+	enum bpf_reg_type arg2_type;
+	enum bpf_reg_type arg3_type;
+	enum bpf_reg_type arg4_type;
+};
+
+/* struct bpf_context access type */
+enum bpf_access_type {
+	BPF_READ = 1,
+	BPF_WRITE = 2
+};
+
+struct bpf_context_access {
+	int size;
+	enum bpf_access_type type;
+};
+
+struct bpf_callbacks {
+	/* execute BPF func_id with given registers */
+	void (*execute_func)(char *strtab, int id, u64 *regs);
+
+	/* return address of func_id suitable to be called from JITed program */
+	void *(*jit_select_func)(char *strtab, int id);
+
+	/* return BPF function prototype for verification */
+	const struct bpf_func_proto* (*get_func_proto)(char *strtab, int id);
+
+	/* return expected bpf_context access size and permissions
+	 * for given byte offset within bpf_context */
+	const struct bpf_context_access *(*get_context_access)(int off);
+};
+
+struct bpf_program {
+	int   insn_cnt;
+	int   table_cnt;
+	int   strtab_size;
+	struct bpf_insn *insns;
+	struct bpf_table *tables;
+	char *strtab;
+	struct bpf_callbacks *cb;
+	void (*jit_image)(struct bpf_context *ctx);
+	struct work_struct work;
+};
+/*
+ * BPF image format:
+ * 4 bytes "bpf\0"
+ * 4 bytes - size of insn section in bytes
+ * 4 bytes - size of table definition section in bytes
+ * 4 bytes - size of strtab section in bytes
+ * bpf insns: one or more of 'struct bpf_insn'
+ * hash table definitions: zero or more of 'struct bpf_table'
+ * string table: zero separated ascii strings
+ *
+ * bpf_load_image() - load BPF image, setup callback extensions
+ * and run through verifier
+ */
+int bpf_load_image(const char *image, int image_len, struct bpf_callbacks *cb,
+		   struct bpf_program **prog);
+
+/* free BPF program */
+void bpf_free(struct bpf_program *prog);
+
+/* execture BPF program */
+void bpf_run(struct bpf_program *prog, struct bpf_context *ctx);
+
+/* verify correctness of BPF program */
+int bpf_check(struct bpf_program *prog);
+
+/* pr_info one BPF instructions and registers */
+void pr_info_bpf_insn(struct bpf_insn *insn, u64 *regs);
+
+static inline void free_bpf_program(struct bpf_program *prog)
+{
+	kfree(prog->strtab);
+	kfree(prog->tables);
+	kfree(prog->insns);
+	kfree(prog);
+}
+#if defined(CONFIG_BPF64_JIT)
+void bpf_compile(struct bpf_program *prog);
+void __bpf_free(struct bpf_program *prog);
+#else
+static inline void bpf_compile(struct bpf_program *prog)
+{
+}
+static inline void __bpf_free(struct bpf_program *prog)
+{
+	free_bpf_program(prog);
+}
+#endif
+
+#endif
diff --git a/kernel/Makefile b/kernel/Makefile
index bbaf7d5..68a974e 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -83,6 +83,7 @@ obj-$(CONFIG_TRACING) += trace/
 obj-$(CONFIG_TRACE_CLOCK) += trace/
 obj-$(CONFIG_RING_BUFFER) += trace/
 obj-$(CONFIG_TRACEPOINTS) += trace/
+obj-$(CONFIG_BPF64) += bpf_jit/
 obj-$(CONFIG_IRQ_WORK) += irq_work.o
 obj-$(CONFIG_CPU_PM) += cpu_pm.o
 
diff --git a/kernel/bpf_jit/Makefile b/kernel/bpf_jit/Makefile
new file mode 100644
index 0000000..2e576f9
--- /dev/null
+++ b/kernel/bpf_jit/Makefile
@@ -0,0 +1,3 @@
+obj-$(CONFIG_BPF64) += bpf_check.o
+obj-$(CONFIG_BPF64) += bpf_run.o
+
diff --git a/kernel/bpf_jit/bpf_check.c b/kernel/bpf_jit/bpf_check.c
new file mode 100644
index 0000000..b6f5f4af
--- /dev/null
+++ b/kernel/bpf_jit/bpf_check.c
@@ -0,0 +1,1054 @@
+/* Copyright (c) 2011-2013 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/slab.h>
+#include <linux/bpf_jit.h>
+
+/*
+ * bpf_check() is a static code analyzer that walks the BPF program
+ * instruction by instruction and updates register/stack state.
+ * All paths of conditional branches are analyzed until 'ret' insn.
+ *
+ * At the first pass depth-first-search verifies that the BPF program is a DAG.
+ * It rejects the following programs:
+ * - larger than 4K insns or 64 tables
+ * - if loop is present (detected via back-edge)
+ * - unreachable insns exist (shouldn't be a forest. program = one function)
+ * - more than one ret insn
+ * - ret insn is not a last insn
+ * - out of bounds or malformed jumps
+ * The second pass is all possible path descent from the 1st insn.
+ * Conditional branch target insns keep a link list of verifier states.
+ * If the state already visited, this path can be pruned.
+ * If it wasn't a DAG, such state prunning would be incorrect, since it would
+ * skip cycles. Since it's analyzing all pathes through the program,
+ * the length of the analysis is limited to 32k insn, which may be hit even
+ * if insn_cnt < 4K, but there are too many branches that change stack/regs.
+ * Number of 'branches to be analyzed' is limited to 1k
+ *
+ * All registers are 64-bit (even on 32-bit arch)
+ * R0 - return register
+ * R1-R5 argument passing registers
+ * R6-R9 callee saved registers
+ * R10 - frame pointer read-only
+ *
+ * At the start of BPF program the register R1 contains a pointer to bpf_context
+ * and has type PTR_TO_CTX.
+ *
+ * R10 has type PTR_TO_STACK. The sequence 'mov Rx, R10; add Rx, imm' changes
+ * Rx state to PTR_TO_STACK_IMM and immediate constant is saved for further
+ * stack bounds checking
+ *
+ * registers used to pass pointers to function calls are verified against
+ * function prototypes
+ *
+ * Example: before the call to bpf_table_lookup(), R1 must have type PTR_TO_CTX
+ * R2 must contain integer constant and R3 PTR_TO_STACK_IMM_TABLE_KEY
+ * Integer constant in R2 is a table_id. It's checked that 0 <= R2 < table_cnt
+ * and corresponding table_info->key_size fetched to check that
+ * [R3, R3 + table_info->key_size) are within stack limits and all that stack
+ * memory was initiliazed earlier by BPF program.
+ * After bpf_table_lookup() call insn, R0 is set to PTR_TO_TABLE_CONDITIONAL
+ * R1-R5 are cleared and no longer readable (but still writeable).
+ *
+ * bpf_table_lookup() function returns ether pointer to table value or NULL
+ * which is type PTR_TO_TABLE_CONDITIONAL. Once it passes through !=0 insn
+ * the register holding that pointer in the true branch changes state to
+ * PTR_TO_TABLE and the same register changes state to INVALID_PTR in the false
+ * branch. See check_cond_jmp_op()
+ *
+ * load/store alignment is checked
+ * Ex: stx [Rx + 3], (u32)Ry is rejected
+ *
+ * load/store to stack bounds checked and register spill is tracked
+ * Ex: stx [R10 + 0], (u8)Rx is rejected
+ *
+ * load/store to table bounds checked and table_id provides table size
+ * Ex: stx [Rx + 8], (u16)Ry is ok, if Rx is PTR_TO_TABLE and
+ * 8 + sizeof(u16) <= table_info->elem_size
+ *
+ * load/store to bpf_context checked against known fields
+ *
+ * Future improvements:
+ * stack size is hardcoded to 512 bytes maximum per program, relax it
+ */
+#define _(OP) ({ int ret = OP; if (ret < 0) return ret; })
+
+/* JITed code allocates 512 bytes and used bottom 4 slots
+ * to save R6-R9
+ */
+#define MAX_BPF_STACK (512 - 4 * 8)
+
+struct reg_state {
+	enum bpf_reg_type ptr;
+	bool read_ok;
+	int imm;
+};
+
+#define MAX_REG 11
+
+enum bpf_stack_slot_type {
+	STACK_INVALID,    /* nothing was stored in this stack slot */
+	STACK_SPILL,      /* 1st byte of register spilled into stack */
+	STACK_SPILL_PART, /* other 7 bytes of register spill */
+	STACK_MISC	  /* BPF program wrote some data into this slot */
+};
+
+struct bpf_stack_slot {
+	enum bpf_stack_slot_type type;
+	enum bpf_reg_type ptr;
+	int imm;
+};
+
+/* state of the program:
+ * type of all registers and stack info
+ */
+struct verifier_state {
+	struct reg_state regs[MAX_REG];
+	struct bpf_stack_slot stack[MAX_BPF_STACK];
+};
+
+/* linked list of verifier states
+ * used to prune search
+ */
+struct verifier_state_list {
+	struct verifier_state state;
+	struct verifier_state_list *next;
+};
+
+/* verifier_state + insn_idx are pushed to stack
+ * when branch is encountered
+ */
+struct verifier_stack_elem {
+	struct verifier_state st;
+	int insn_idx; /* at insn 'insn_idx' the program state is 'st' */
+	struct verifier_stack_elem *next;
+};
+
+/* single container for all structs
+ * one verifier_env per bpf_check() call
+ */
+struct verifier_env {
+	struct bpf_program *prog;
+	struct verifier_stack_elem *head;
+	int stack_size;
+	struct verifier_state cur_state;
+	struct verifier_state_list **branch_landing;
+};
+
+static int pop_stack(struct verifier_env *env)
+{
+	int insn_idx;
+	struct verifier_stack_elem *elem;
+	if (env->head == NULL)
+		return -1;
+	memcpy(&env->cur_state, &env->head->st, sizeof(env->cur_state));
+	insn_idx = env->head->insn_idx;
+	elem = env->head->next;
+	kfree(env->head);
+	env->head = elem;
+	env->stack_size--;
+	return insn_idx;
+}
+
+static struct verifier_state *push_stack(struct verifier_env *env, int insn_idx)
+{
+	struct verifier_stack_elem *elem;
+	elem = kmalloc(sizeof(struct verifier_stack_elem), GFP_KERNEL);
+	if (!elem)
+		goto err;
+	memcpy(&elem->st, &env->cur_state, sizeof(env->cur_state));
+	elem->insn_idx = insn_idx;
+	elem->next = env->head;
+	env->head = elem;
+	env->stack_size++;
+	if (env->stack_size > 1024) {
+		pr_err("BPF program is too complex\n");
+		goto err;
+	}
+	return &elem->st;
+err:
+	/* pop all elements and return */
+	while (pop_stack(env) >= 0);
+	return NULL;
+}
+
+#define CALLER_SAVED_REGS 6
+static const int caller_saved[CALLER_SAVED_REGS] = { R0, R1, R2, R3, R4, R5 };
+
+static void init_reg_state(struct reg_state *regs)
+{
+	struct reg_state *reg;
+	int i;
+	for (i = 0; i < MAX_REG; i++) {
+		regs[i].ptr = INVALID_PTR;
+		regs[i].read_ok = false;
+		regs[i].imm = 0xbadbad;
+	}
+	reg = regs + __fp__;
+	reg->ptr = PTR_TO_STACK;
+	reg->read_ok = true;
+
+	reg = regs + R1;	/* 1st arg to a function */
+	reg->ptr = PTR_TO_CTX;
+	reg->read_ok = true;
+}
+
+static void mark_reg_no_ptr(struct reg_state *regs, int regno)
+{
+	regs[regno].ptr = INVALID_PTR;
+	regs[regno].imm = 0xbadbad;
+	regs[regno].read_ok = true;
+}
+
+static int check_reg_arg(struct reg_state *regs, int regno, bool is_src)
+{
+	if (is_src) {
+		if (!regs[regno].read_ok) {
+			pr_err("R%d !read_ok\n", regno);
+			return -EACCES;
+		}
+	} else {
+		if (regno == __fp__)
+			/* frame pointer is read only */
+			return -EACCES;
+		mark_reg_no_ptr(regs, regno);
+	}
+	return 0;
+}
+
+static int bpf_size_to_bytes(int bpf_size)
+{
+	if (bpf_size == BPF_W)
+		return 4;
+	else if (bpf_size == BPF_H)
+		return 2;
+	else if (bpf_size == BPF_B)
+		return 1;
+	else if (bpf_size == BPF_DW)
+		return 8;
+	else
+		return -EACCES;
+}
+
+static int check_stack_write(struct verifier_state *state, int off, int size,
+			     int value_regno)
+{
+	int i;
+	struct bpf_stack_slot *slot;
+	if (value_regno >= 0 &&
+	    (state->regs[value_regno].ptr == PTR_TO_TABLE ||
+	     state->regs[value_regno].ptr == PTR_TO_CTX)) {
+
+		/* register containing pointer is being spilled into stack */
+		if (size != 8) {
+			pr_err("invalid size of register spill\n");
+			return -EACCES;
+		}
+
+		slot = &state->stack[MAX_BPF_STACK + off];
+		slot->type = STACK_SPILL;
+		/* save register state */
+		slot->ptr = state->regs[value_regno].ptr;
+		slot->imm = state->regs[value_regno].imm;
+		for (i = 1; i < 8; i++) {
+			slot = &state->stack[MAX_BPF_STACK + off + i];
+			slot->type = STACK_SPILL_PART;
+		}
+	} else {
+
+		/* regular write of data into stack */
+		for (i = 0; i < size; i++) {
+			slot = &state->stack[MAX_BPF_STACK + off + i];
+			slot->type = STACK_MISC;
+		}
+	}
+	return 0;
+}
+
+static int check_stack_read(struct verifier_state *state, int off, int size,
+			    int value_regno)
+{
+	int i;
+	struct bpf_stack_slot *slot;
+
+	slot = &state->stack[MAX_BPF_STACK + off];
+
+	if (slot->type == STACK_SPILL) {
+		if (size != 8) {
+			pr_err("invalid size of register spill\n");
+			return -EACCES;
+		}
+		for (i = 1; i < 8; i++) {
+			if (state->stack[MAX_BPF_STACK + off + i].type !=
+			    STACK_SPILL_PART) {
+				pr_err("corrupted spill memory\n");
+				return -EACCES;
+			}
+		}
+
+		/* restore register state from stack */
+		state->regs[value_regno].ptr = slot->ptr;
+		state->regs[value_regno].imm = slot->imm;
+		state->regs[value_regno].read_ok = true;
+		return 0;
+	} else {
+		for (i = 0; i < size; i++) {
+			if (state->stack[MAX_BPF_STACK + off + i].type !=
+			    STACK_MISC) {
+				pr_err("invalid read from stack off %d+%d size %d\n",
+				       off, i, size);
+				return -EACCES;
+			}
+		}
+		/* have read misc data from the stack */
+		mark_reg_no_ptr(state->regs, value_regno);
+		return 0;
+	}
+}
+
+static int get_table_info(struct verifier_env *env, int table_id,
+			  struct bpf_table **tablep)
+{
+	/* if BPF program contains bpf_table_lookup(ctx, 1024, key)
+	 * the incorrect table_id will be caught here
+	 */
+	if (table_id < 0 || table_id >= env->prog->table_cnt) {
+		pr_err("invalid access to table_id=%d max_tables=%d\n",
+		       table_id, env->prog->table_cnt);
+		return -EACCES;
+	}
+	*tablep = &env->prog->tables[table_id];
+	return 0;
+}
+
+/* check read/write into table element returned by bpf_table_lookup() */
+static int check_table_access(struct verifier_env *env, int regno, int off,
+			      int size)
+{
+	struct bpf_table *table;
+	int table_id = env->cur_state.regs[regno].imm;
+
+	_(get_table_info(env, table_id, &table));
+
+	if (off < 0 || off + size > table->elem_size) {
+		pr_err("invalid access to table_id=%d leaf_size=%d off=%d size=%d\n",
+		       table_id, table->elem_size, off, size);
+		return -EACCES;
+	}
+	return 0;
+}
+
+/* check access to 'struct bpf_context' fields */
+static int check_ctx_access(struct verifier_env *env, int off, int size,
+			    enum bpf_access_type t)
+{
+	const struct bpf_context_access *access;
+
+	if (off < 0 || off >= 32768/* struct bpf_context shouldn't be huge */)
+		goto error;
+
+	access = env->prog->cb->get_context_access(off);
+	if (!access)
+		goto error;
+
+	if (access->size == size && (access->type & t))
+		return 0;
+error:
+	pr_err("invalid bpf_context access off=%d size=%d\n", off, size);
+	return -EACCES;
+}
+
+static int check_mem_access(struct verifier_env *env, int regno, int off,
+			    int bpf_size, enum bpf_access_type t,
+			    int value_regno)
+{
+	struct verifier_state *state = &env->cur_state;
+	int size;
+	_(size = bpf_size_to_bytes(bpf_size));
+
+	if (off % size != 0) {
+		pr_err("misaligned access off %d size %d\n", off, size);
+		return -EACCES;
+	}
+
+	if (state->regs[regno].ptr == PTR_TO_TABLE) {
+		_(check_table_access(env, regno, off, size));
+		if (t == BPF_READ)
+			mark_reg_no_ptr(state->regs, value_regno);
+	} else if (state->regs[regno].ptr == PTR_TO_CTX) {
+		_(check_ctx_access(env, off, size, t));
+		if (t == BPF_READ)
+			mark_reg_no_ptr(state->regs, value_regno);
+	} else if (state->regs[regno].ptr == PTR_TO_STACK) {
+		if (off >= 0 || off < -MAX_BPF_STACK) {
+			pr_err("invalid stack off=%d size=%d\n", off, size);
+			return -EACCES;
+		}
+		if (t == BPF_WRITE)
+			_(check_stack_write(state, off, size, value_regno));
+		else
+			_(check_stack_read(state, off, size, value_regno));
+	} else {
+		pr_err("invalid mem access %d\n", state->regs[regno].ptr);
+		return -EACCES;
+	}
+	return 0;
+}
+
+/*
+ * when register 'regno' is passed into function that will read 'access_size'
+ * bytes from that pointer, make sure that it's within stack boundary
+ * and all elements of stack are initialized
+ */
+static int check_stack_boundary(struct verifier_env *env,
+				int regno, int access_size)
+{
+	struct verifier_state *state = &env->cur_state;
+	struct reg_state *regs = state->regs;
+	int off, i;
+
+	if (regs[regno].ptr != PTR_TO_STACK_IMM)
+		return -EACCES;
+
+	off = regs[regno].imm;
+	if (off >= 0 || off < -MAX_BPF_STACK || off + access_size > 0 ||
+	    access_size <= 0) {
+		pr_err("invalid stack ptr R%d off=%d access_size=%d\n",
+		       regno, off, access_size);
+		return -EACCES;
+	}
+
+	for (i = 0; i < access_size; i++) {
+		if (state->stack[MAX_BPF_STACK + off + i].type != STACK_MISC) {
+			pr_err("invalid indirect read from stack off %d+%d size %d\n",
+			       off, i, access_size);
+			return -EACCES;
+		}
+	}
+	return 0;
+}
+
+static int check_func_arg(struct verifier_env *env, int regno,
+			  enum bpf_reg_type arg_type, int *table_id,
+			  struct bpf_table **tablep)
+{
+	struct reg_state *reg = env->cur_state.regs + regno;
+	enum bpf_reg_type expected_type;
+
+	if (arg_type == INVALID_PTR)
+		return 0;
+
+	if (!reg->read_ok) {
+		pr_err("R%d !read_ok\n", regno);
+		return -EACCES;
+	}
+
+	if (arg_type == PTR_TO_STACK_IMM_TABLE_KEY ||
+	    arg_type == PTR_TO_STACK_IMM_TABLE_ELEM)
+		expected_type = PTR_TO_STACK_IMM;
+	else if (arg_type == CONST_ARG_TABLE_ID ||
+		 arg_type == CONST_ARG_STACK_IMM_SIZE)
+		expected_type = CONST_ARG;
+	else
+		expected_type = arg_type;
+
+	if (reg->ptr != expected_type) {
+		pr_err("R%d ptr=%d expected=%d\n", regno, reg->ptr,
+		       expected_type);
+		return -EACCES;
+	}
+
+	if (arg_type == CONST_ARG_TABLE_ID) {
+		/* bpf_table_xxx(table_id) call: check that table_id is valid */
+		*table_id = reg->imm;
+		_(get_table_info(env, reg->imm, tablep));
+	} else if (arg_type == PTR_TO_STACK_IMM_TABLE_KEY) {
+		/*
+		 * bpf_table_xxx(..., table_id, ..., key) call:
+		 * check that [key, key + table_info->key_size) are within
+		 * stack limits and initialized
+		 */
+		if (!*tablep) {
+			/*
+			 * in function declaration table_id must come before
+			 * table_key or table_elem, so that it's verified
+			 * and known before we have to check table_key here
+			 */
+			pr_err("invalid table_id to access table->key\n");
+			return -EACCES;
+		}
+		_(check_stack_boundary(env, regno, (*tablep)->key_size));
+	} else if (arg_type == PTR_TO_STACK_IMM_TABLE_ELEM) {
+		/*
+		 * bpf_table_xxx(..., table_id, ..., elem) call:
+		 * check [elem, elem + table_info->elem_size) validity
+		 */
+		if (!*tablep) {
+			pr_err("invalid table_id to access table->elem\n");
+			return -EACCES;
+		}
+		_(check_stack_boundary(env, regno, (*tablep)->elem_size));
+	} else if (arg_type == CONST_ARG_STACK_IMM_SIZE) {
+		/*
+		 * bpf_xxx(..., buf, len) call will access 'len' bytes
+		 * from stack pointer 'buf'. Check it
+		 * note: regno == len, regno - 1 == buf
+		 */
+		_(check_stack_boundary(env, regno - 1, reg->imm));
+	}
+
+	return 0;
+}
+
+static int check_call(struct verifier_env *env, int func_id)
+{
+	struct verifier_state *state = &env->cur_state;
+	const struct bpf_func_proto *fn = NULL;
+	struct reg_state *regs = state->regs;
+	struct bpf_table *table = NULL;
+	int table_id = -1;
+	struct reg_state *reg;
+	int i;
+
+	/* find function prototype */
+	if (func_id <= 0 || func_id >= env->prog->strtab_size) {
+		pr_err("invalid func %d\n", func_id);
+		return -EINVAL;
+	}
+
+	if (env->prog->cb->get_func_proto)
+		fn = env->prog->cb->get_func_proto(env->prog->strtab, func_id);
+
+	if (!fn || (fn->ret_type != RET_INTEGER &&
+		    fn->ret_type != PTR_TO_TABLE_CONDITIONAL &&
+		    fn->ret_type != RET_VOID)) {
+		pr_err("unknown func %d\n", func_id);
+		return -EINVAL;
+	}
+
+	/* check args */
+	_(check_func_arg(env, R1, fn->arg1_type, &table_id, &table));
+	_(check_func_arg(env, R2, fn->arg2_type, &table_id, &table));
+	_(check_func_arg(env, R3, fn->arg3_type, &table_id, &table));
+	_(check_func_arg(env, R4, fn->arg4_type, &table_id, &table));
+
+	/* reset caller saved regs */
+	for (i = 0; i < CALLER_SAVED_REGS; i++) {
+		reg = regs + caller_saved[i];
+		reg->read_ok = false;
+		reg->ptr = INVALID_PTR;
+		reg->imm = 0xbadbad;
+	}
+
+	/* update return register */
+	reg = regs + R0;
+	if (fn->ret_type == RET_INTEGER) {
+		reg->read_ok = true;
+		reg->ptr = INVALID_PTR;
+	} else if (fn->ret_type != RET_VOID) {
+		reg->read_ok = true;
+		reg->ptr = fn->ret_type;
+		if (fn->ret_type == PTR_TO_TABLE_CONDITIONAL)
+			/*
+			 * remember table_id, so that check_table_access()
+			 * can check 'elem_size' boundary of memory access
+			 * to table element returned from bpf_table_lookup()
+			 */
+			reg->imm = table_id;
+	}
+	return 0;
+}
+
+static int check_alu_op(struct reg_state *regs, struct bpf_insn *insn)
+{
+	u16 opcode = BPF_OP(insn->code);
+
+	if (opcode == BPF_BSWAP32 || opcode == BPF_BSWAP64 ||
+	    opcode == BPF_NEG) {
+		if (BPF_SRC(insn->code) != BPF_X)
+			return -EINVAL;
+		/* check src operand */
+		_(check_reg_arg(regs, insn->a_reg, 1));
+
+		/* check dest operand */
+		_(check_reg_arg(regs, insn->a_reg, 0));
+
+	} else if (opcode == BPF_MOV) {
+
+		if (BPF_SRC(insn->code) == BPF_X)
+			/* check src operand */
+			_(check_reg_arg(regs, insn->x_reg, 1));
+
+		/* check dest operand */
+		_(check_reg_arg(regs, insn->a_reg, 0));
+
+		if (BPF_SRC(insn->code) == BPF_X) {
+			/* case: R1 = R2
+			 * copy register state to dest reg
+			 */
+			regs[insn->a_reg].ptr = regs[insn->x_reg].ptr;
+			regs[insn->a_reg].imm = regs[insn->x_reg].imm;
+		} else {
+			/* case: R = imm
+			 * remember the value we stored into this reg
+			 */
+			regs[insn->a_reg].ptr = CONST_ARG;
+			regs[insn->a_reg].imm = insn->imm;
+		}
+
+	} else {	/* all other ALU ops: and, sub, xor, add, ... */
+
+		int stack_relative = 0;
+
+		if (BPF_SRC(insn->code) == BPF_X)
+			/* check src1 operand */
+			_(check_reg_arg(regs, insn->x_reg, 1));
+
+		/* check src2 operand */
+		_(check_reg_arg(regs, insn->a_reg, 1));
+
+		if (opcode == BPF_ADD &&
+		    regs[insn->a_reg].ptr == PTR_TO_STACK &&
+		    BPF_SRC(insn->code) == BPF_K)
+			stack_relative = 1;
+
+		/* check dest operand */
+		_(check_reg_arg(regs, insn->a_reg, 0));
+
+		if (stack_relative) {
+			regs[insn->a_reg].ptr = PTR_TO_STACK_IMM;
+			regs[insn->a_reg].imm = insn->imm;
+		}
+	}
+
+	return 0;
+}
+
+static int check_cond_jmp_op(struct verifier_env *env, struct bpf_insn *insn,
+			     int insn_idx)
+{
+	struct reg_state *regs = env->cur_state.regs;
+	struct verifier_state *other_branch;
+	u16 opcode = BPF_OP(insn->code);
+
+	if (BPF_SRC(insn->code) == BPF_X)
+		/* check src1 operand */
+		_(check_reg_arg(regs, insn->x_reg, 1));
+
+	/* check src2 operand */
+	_(check_reg_arg(regs, insn->a_reg, 1));
+
+	other_branch = push_stack(env, insn_idx + insn->off + 1);
+	if (!other_branch)
+		return -EFAULT;
+
+	/* detect if R == 0 where R is returned value from table_lookup() */
+	if (BPF_SRC(insn->code) == BPF_K &&
+	    insn->imm == 0 && (opcode == BPF_JEQ ||
+			       opcode == BPF_JNE) &&
+	    regs[insn->a_reg].ptr == PTR_TO_TABLE_CONDITIONAL) {
+		if (opcode == BPF_JEQ) {
+			/*
+			 * next fallthrough insn can access memory via
+			 * this register
+			 */
+			regs[insn->a_reg].ptr = PTR_TO_TABLE;
+			/* branch targer cannot access it, since reg == 0 */
+			other_branch->regs[insn->a_reg].ptr = INVALID_PTR;
+		} else {
+			other_branch->regs[insn->a_reg].ptr = PTR_TO_TABLE;
+			regs[insn->a_reg].ptr = INVALID_PTR;
+		}
+	}
+	return 0;
+}
+
+
+/*
+ * non-recursive DFS pseudo code
+ * 1  procedure DFS-iterative(G,v):
+ * 2      label v as discovered
+ * 3      let S be a stack
+ * 4      S.push(v)
+ * 5      while S is not empty
+ * 6            t <- S.pop()
+ * 7            if t is what we're looking for:
+ * 8                return t
+ * 9            for all edges e in G.adjacentEdges(t) do
+ * 10               if edge e is already labelled
+ * 11                   continue with the next edge
+ * 12               w <- G.adjacentVertex(t,e)
+ * 13               if vertex w is not discovered and not explored
+ * 14                   label e as tree-edge
+ * 15                   label w as discovered
+ * 16                   S.push(w)
+ * 17                   continue at 5
+ * 18               else if vertex w is discovered
+ * 19                   label e as back-edge
+ * 20               else
+ * 21                   // vertex w is explored
+ * 22                   label e as forward- or cross-edge
+ * 23           label t as explored
+ * 24           S.pop()
+ *
+ * convention:
+ * 1 - discovered
+ * 2 - discovered and 1st branch labelled
+ * 3 - discovered and 1st and 2nd branch labelled
+ * 4 - explored
+ */
+
+#define STATE_END ((struct verifier_state_list *)-1)
+
+#define PUSH_INT(I) \
+	do { \
+		if (cur_stack >= insn_cnt) { \
+			ret = -E2BIG; \
+			goto free_st; \
+		} \
+		stack[cur_stack++] = I; \
+	} while (0)
+
+#define PEAK_INT() \
+	({ \
+		int _ret; \
+		if (cur_stack == 0) \
+			_ret = -1; \
+		else \
+			_ret = stack[cur_stack - 1]; \
+		_ret; \
+	 })
+
+#define POP_INT() \
+	({ \
+		int _ret; \
+		if (cur_stack == 0) \
+			_ret = -1; \
+		else \
+			_ret = stack[--cur_stack]; \
+		_ret; \
+	 })
+
+#define PUSH_INSN(T, W, E) \
+	do { \
+		int w = W; \
+		if (E == 1 && st[T] >= 2) \
+			break; \
+		if (E == 2 && st[T] >= 3) \
+			break; \
+		if (w >= insn_cnt) { \
+			ret = -EACCES; \
+			goto free_st; \
+		} \
+		if (E == 2) \
+			/* mark branch target for state pruning */ \
+			env->branch_landing[w] = STATE_END; \
+		if (st[w] == 0) { \
+			/* tree-edge */ \
+			st[T] = 1 + E; \
+			st[w] = 1; /* discovered */ \
+			PUSH_INT(w); \
+			goto peak_stack; \
+		} else if (st[w] == 1 || st[w] == 2 || st[w] == 3) { \
+			pr_err("back-edge from insn %d to %d\n", t, w); \
+			ret = -EINVAL; \
+			goto free_st; \
+		} else if (st[w] == 4) { \
+			/* forward- or cross-edge */ \
+			st[T] = 1 + E; \
+		} else { \
+			pr_err("insn state internal bug\n"); \
+			ret = -EFAULT; \
+			goto free_st; \
+		} \
+	} while (0)
+
+/* non-recursive depth-first-search to detect loops in BPF program
+ * loop == back-edge in directed graph
+ */
+static int check_cfg(struct verifier_env *env)
+{
+	struct bpf_insn *insns = env->prog->insns;
+	int insn_cnt = env->prog->insn_cnt;
+	int cur_stack = 0;
+	int *stack;
+	int ret = 0;
+	int *st;
+	int i, t;
+
+	if (insns[insn_cnt - 1].code != (BPF_RET | BPF_K)) {
+		pr_err("last insn is not a 'ret'\n");
+		return -EINVAL;
+	}
+
+	st = kzalloc(sizeof(int) * insn_cnt, GFP_KERNEL);
+	if (!st)
+		return -ENOMEM;
+
+	stack = kzalloc(sizeof(int) * insn_cnt, GFP_KERNEL);
+	if (!stack) {
+		kfree(st);
+		return -ENOMEM;
+	}
+
+	st[0] = 1; /* mark 1st insn as discovered */
+	PUSH_INT(0);
+
+peak_stack:
+	while ((t = PEAK_INT()) != -1) {
+		if (t == insn_cnt - 1)
+			goto mark_explored;
+
+		if (BPF_CLASS(insns[t].code) == BPF_RET) {
+			pr_err("extraneous 'ret'\n");
+			ret = -EINVAL;
+			goto free_st;
+		}
+
+		if (BPF_CLASS(insns[t].code) == BPF_JMP) {
+			u16 opcode = BPF_OP(insns[t].code);
+			if (opcode == BPF_CALL) {
+				PUSH_INSN(t, t + 1, 1);
+			} else if (opcode == BPF_JA) {
+				if (BPF_SRC(insns[t].code) != BPF_X) {
+					ret = -EINVAL;
+					goto free_st;
+				}
+				PUSH_INSN(t, t + insns[t].off + 1, 1);
+			} else {
+				PUSH_INSN(t, t + 1, 1);
+				PUSH_INSN(t, t + insns[t].off + 1, 2);
+			}
+		} else {
+			PUSH_INSN(t, t + 1, 1);
+		}
+
+mark_explored:
+		st[t] = 4; /* explored */
+		if (POP_INT() == -1) {
+			pr_err("pop_int internal bug\n");
+			ret = -EFAULT;
+			goto free_st;
+		}
+	}
+
+
+	for (i = 0; i < insn_cnt; i++) {
+		if (st[i] != 4) {
+			pr_err("unreachable insn %d\n", i);
+			ret = -EINVAL;
+			goto free_st;
+		}
+	}
+
+free_st:
+	kfree(st);
+	kfree(stack);
+	return ret;
+}
+
+static int is_state_visited(struct verifier_env *env, int insn_idx)
+{
+	struct verifier_state_list *new_sl;
+	struct verifier_state_list *sl;
+
+	sl = env->branch_landing[insn_idx];
+	if (!sl)
+		/* no branch jump to this insn, ignore it */
+		return 0;
+
+	while (sl != STATE_END) {
+		if (memcmp(&sl->state, &env->cur_state,
+			   sizeof(env->cur_state)) == 0)
+			/* reached the same register/stack state,
+			 * prune the search
+			 */
+			return 1;
+		sl = sl->next;
+	}
+	new_sl = kmalloc(sizeof(struct verifier_state_list), GFP_KERNEL);
+
+	if (!new_sl)
+		/* ignore kmalloc error, since it's rare and doesn't affect
+		 * correctness of algorithm
+		 */
+		return 0;
+	/* add new state to the head of linked list */
+	memcpy(&new_sl->state, &env->cur_state, sizeof(env->cur_state));
+	new_sl->next = env->branch_landing[insn_idx];
+	env->branch_landing[insn_idx] = new_sl;
+	return 0;
+}
+
+#undef _
+#define _(OP) ({ err = OP; if (err < 0) goto err_print_insn; })
+
+static int __bpf_check(struct verifier_env *env)
+{
+	struct verifier_state *state = &env->cur_state;
+	struct bpf_insn *insns = env->prog->insns;
+	struct reg_state *regs = state->regs;
+	int insn_cnt = env->prog->insn_cnt;
+	int insn_processed = 0;
+	int insn_idx;
+	int err;
+
+	init_reg_state(regs);
+	insn_idx = 0;
+	for (;;) {
+		struct bpf_insn *insn;
+		u16 class;
+
+		if (insn_idx >= insn_cnt) {
+			pr_err("invalid insn idx %d insn_cnt %d\n",
+			       insn_idx, insn_cnt);
+			return -EFAULT;
+		}
+
+		insn = &insns[insn_idx];
+		class = BPF_CLASS(insn->code);
+
+		if (++insn_processed > 32768) {
+			pr_err("BPF program is too large. Proccessed %d insn\n",
+			       insn_processed);
+			return -E2BIG;
+		}
+
+		if (is_state_visited(env, insn_idx))
+			goto process_ret;
+
+		if (class == BPF_ALU) {
+			_(check_alu_op(regs, insn));
+
+		} else if (class == BPF_LDX) {
+			if (BPF_MODE(insn->code) != BPF_REL)
+				return -EINVAL;
+
+			/* check src operand */
+			_(check_reg_arg(regs, insn->x_reg, 1));
+
+			_(check_mem_access(env, insn->x_reg, insn->off,
+					   BPF_SIZE(insn->code), BPF_READ,
+					   insn->a_reg));
+
+			/* dest reg state will be updated by mem_access */
+
+		} else if (class == BPF_STX) {
+			/* check src1 operand */
+			_(check_reg_arg(regs, insn->x_reg, 1));
+			/* check src2 operand */
+			_(check_reg_arg(regs, insn->a_reg, 1));
+			_(check_mem_access(env, insn->a_reg, insn->off,
+					   BPF_SIZE(insn->code), BPF_WRITE,
+					   insn->x_reg));
+
+		} else if (class == BPF_ST) {
+			if (BPF_MODE(insn->code) != BPF_REL)
+				return -EINVAL;
+			/* check src operand */
+			_(check_reg_arg(regs, insn->a_reg, 1));
+			_(check_mem_access(env, insn->a_reg, insn->off,
+					   BPF_SIZE(insn->code), BPF_WRITE,
+					   -1));
+
+		} else if (class == BPF_JMP) {
+			u16 opcode = BPF_OP(insn->code);
+			if (opcode == BPF_CALL) {
+				_(check_call(env, insn->imm));
+			} else if (opcode == BPF_JA) {
+				if (BPF_SRC(insn->code) != BPF_X)
+					return -EINVAL;
+				insn_idx += insn->off + 1;
+				continue;
+			} else {
+				_(check_cond_jmp_op(env, insn, insn_idx));
+			}
+
+		} else if (class == BPF_RET) {
+process_ret:
+			insn_idx = pop_stack(env);
+			if (insn_idx < 0)
+				break;
+			else
+				continue;
+		}
+
+		insn_idx++;
+	}
+
+	pr_debug("insn_processed %d\n", insn_processed);
+	return 0;
+
+err_print_insn:
+	pr_info("insn #%d\n", insn_idx);
+	pr_info_bpf_insn(&insns[insn_idx], NULL);
+	return err;
+}
+
+static void free_states(struct verifier_env *env, int insn_cnt)
+{
+	int i;
+
+	for (i = 0; i < insn_cnt; i++) {
+		struct verifier_state_list *sl = env->branch_landing[i];
+		if (sl)
+			while (sl != STATE_END) {
+				struct verifier_state_list *sln = sl->next;
+				kfree(sl);
+				sl = sln;
+			}
+	}
+
+	kfree(env->branch_landing);
+}
+
+int bpf_check(struct bpf_program *prog)
+{
+	int ret;
+	struct verifier_env *env;
+
+	if (prog->insn_cnt <= 0 || prog->insn_cnt > MAX_BPF_INSNS ||
+	    prog->table_cnt < 0 || prog->table_cnt > MAX_BPF_TABLES ||
+	    prog->strtab_size < 0 || prog->strtab_size > MAX_BPF_STRTAB_SIZE ||
+	    prog->strtab[prog->strtab_size - 1] != 0) {
+		pr_err("BPF program has %d insn and %d tables. Max is %d/%d\n",
+		       prog->insn_cnt, prog->table_cnt,
+		       MAX_BPF_INSNS, MAX_BPF_TABLES);
+		return -E2BIG;
+	}
+
+	env = kzalloc(sizeof(struct verifier_env), GFP_KERNEL);
+	if (!env)
+		return -ENOMEM;
+
+	env->prog = prog;
+	env->branch_landing = kzalloc(sizeof(struct verifier_state_list *) *
+				      prog->insn_cnt, GFP_KERNEL);
+
+	if (!env->branch_landing) {
+		kfree(env);
+		return -ENOMEM;
+	}
+
+	ret = check_cfg(env);
+	if (ret)
+		goto free_env;
+	ret = __bpf_check(env);
+free_env:
+	while (pop_stack(env) >= 0);
+	free_states(env, prog->insn_cnt);
+	kfree(env);
+	return ret;
+}
+EXPORT_SYMBOL(bpf_check);
diff --git a/kernel/bpf_jit/bpf_run.c b/kernel/bpf_jit/bpf_run.c
new file mode 100644
index 0000000..380b618
--- /dev/null
+++ b/kernel/bpf_jit/bpf_run.c
@@ -0,0 +1,452 @@
+/* Copyright (c) 2011-2013 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/bpf_jit.h>
+
+static const char *const bpf_class_string[] = {
+	"ld", "ldx", "st", "stx", "alu", "jmp", "ret", "misc"
+};
+
+static const char *const bpf_alu_string[] = {
+	"+=", "-=", "*=", "/=", "|=", "&=", "<<=", ">>=", "neg",
+	"%=", "^=", "=", "s>>=", "bswap32", "bswap64", "BUG"
+};
+
+static const char *const bpf_ldst_string[] = {
+	"u32", "u16", "u8", "u64"
+};
+
+static const char *const bpf_jmp_string[] = {
+	"jmp", "==", ">", ">=", "&", "!=", "s>", "s>=", "call"
+};
+
+static const char *reg_to_str(int regno, u64 *regs)
+{
+	static char reg_value[16][32];
+	if (!regs)
+		return "";
+	snprintf(reg_value[regno], sizeof(reg_value[regno]), "(0x%llx)",
+		 regs[regno]);
+	return reg_value[regno];
+}
+
+#define R(regno) reg_to_str(regno, regs)
+
+void pr_info_bpf_insn(struct bpf_insn *insn, u64 *regs)
+{
+	u16 class = BPF_CLASS(insn->code);
+	if (class == BPF_ALU) {
+		if (BPF_SRC(insn->code) == BPF_X)
+			pr_info("code_%02x r%d%s %s r%d%s\n",
+				insn->code, insn->a_reg, R(insn->a_reg),
+				bpf_alu_string[BPF_OP(insn->code) >> 4],
+				insn->x_reg, R(insn->x_reg));
+		else
+			pr_info("code_%02x r%d%s %s %d\n",
+				insn->code, insn->a_reg, R(insn->a_reg),
+				bpf_alu_string[BPF_OP(insn->code) >> 4],
+				insn->imm);
+	} else if (class == BPF_STX) {
+		if (BPF_MODE(insn->code) == BPF_REL)
+			pr_info("code_%02x *(%s *)(r%d%s %+d) = r%d%s\n",
+				insn->code,
+				bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
+				insn->a_reg, R(insn->a_reg),
+				insn->off, insn->x_reg, R(insn->x_reg));
+		else if (BPF_MODE(insn->code) == BPF_XADD)
+			pr_info("code_%02x lock *(%s *)(r%d%s %+d) += r%d%s\n",
+				insn->code,
+				bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
+				insn->a_reg, R(insn->a_reg), insn->off,
+				insn->x_reg, R(insn->x_reg));
+		else
+			pr_info("BUG_%02x\n", insn->code);
+	} else if (class == BPF_ST) {
+		if (BPF_MODE(insn->code) != BPF_REL) {
+			pr_info("BUG_st_%02x\n", insn->code);
+			return;
+		}
+		pr_info("code_%02x *(%s *)(r%d%s %+d) = %d\n",
+			insn->code,
+			bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
+			insn->a_reg, R(insn->a_reg),
+			insn->off, insn->imm);
+	} else if (class == BPF_LDX) {
+		if (BPF_MODE(insn->code) != BPF_REL) {
+			pr_info("BUG_ldx_%02x\n", insn->code);
+			return;
+		}
+		pr_info("code_%02x r%d = *(%s *)(r%d%s %+d)\n",
+			insn->code, insn->a_reg,
+			bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
+			insn->x_reg, R(insn->x_reg), insn->off);
+	} else if (class == BPF_JMP) {
+		u16 opcode = BPF_OP(insn->code);
+		if (opcode == BPF_CALL) {
+			pr_info("code_%02x call %d\n", insn->code, insn->imm);
+		} else if (insn->code == (BPF_JMP | BPF_JA | BPF_X)) {
+			pr_info("code_%02x goto pc%+d\n",
+				insn->code, insn->off);
+		} else if (BPF_SRC(insn->code) == BPF_X) {
+			pr_info("code_%02x if r%d%s %s r%d%s goto pc%+d\n",
+				insn->code, insn->a_reg, R(insn->a_reg),
+				bpf_jmp_string[BPF_OP(insn->code) >> 4],
+				insn->x_reg, R(insn->x_reg), insn->off);
+		} else {
+			pr_info("code_%02x if r%d%s %s 0x%x goto pc%+d\n",
+				insn->code, insn->a_reg, R(insn->a_reg),
+				bpf_jmp_string[BPF_OP(insn->code) >> 4],
+				insn->imm, insn->off);
+		}
+	} else {
+		pr_info("code_%02x %s\n", insn->code, bpf_class_string[class]);
+	}
+}
+
+void bpf_run(struct bpf_program *prog, struct bpf_context *ctx)
+{
+	struct bpf_insn *insn = prog->insns;
+	u64 stack[64];
+	u64 regs[16] = { };
+	regs[__fp__] = (u64)(ulong)&stack[64];
+	regs[R1] = (u64)(ulong)ctx;
+
+	for (;; insn++) {
+		const s32 K = insn->imm;
+		u64 tmp;
+		u64 *a_reg = &regs[insn->a_reg];
+		u64 *x_reg = &regs[insn->x_reg];
+#define A (*a_reg)
+#define X (*x_reg)
+		/*pr_info_bpf_insn(insn, regs);*/
+		switch (insn->code) {
+			/* ALU */
+		case BPF_ALU | BPF_ADD | BPF_X:
+			A += X;
+			continue;
+		case BPF_ALU | BPF_ADD | BPF_K:
+			A += K;
+			continue;
+		case BPF_ALU | BPF_SUB | BPF_X:
+			A -= X;
+			continue;
+		case BPF_ALU | BPF_SUB | BPF_K:
+			A -= K;
+			continue;
+		case BPF_ALU | BPF_AND | BPF_X:
+			A &= X;
+			continue;
+		case BPF_ALU | BPF_AND | BPF_K:
+			A &= K;
+			continue;
+		case BPF_ALU | BPF_OR | BPF_X:
+			A |= X;
+			continue;
+		case BPF_ALU | BPF_OR | BPF_K:
+			A |= K;
+			continue;
+		case BPF_ALU | BPF_LSH | BPF_X:
+			A <<= X;
+			continue;
+		case BPF_ALU | BPF_LSH | BPF_K:
+			A <<= K;
+			continue;
+		case BPF_ALU | BPF_RSH | BPF_X:
+			A >>= X;
+			continue;
+		case BPF_ALU | BPF_RSH | BPF_K:
+			A >>= K;
+			continue;
+		case BPF_ALU | BPF_MOV | BPF_X:
+			A = X;
+			continue;
+		case BPF_ALU | BPF_MOV | BPF_K:
+			A = K;
+			continue;
+		case BPF_ALU | BPF_ARSH | BPF_X:
+			(*(s64 *) &A) >>= X;
+			continue;
+		case BPF_ALU | BPF_ARSH | BPF_K:
+			(*(s64 *) &A) >>= K;
+			continue;
+		case BPF_ALU | BPF_BSWAP32 | BPF_X:
+			A = __builtin_bswap32(A);
+			continue;
+		case BPF_ALU | BPF_BSWAP64 | BPF_X:
+			A = __builtin_bswap64(A);
+			continue;
+		case BPF_ALU | BPF_MOD | BPF_X:
+			tmp = A;
+			if (X)
+				A = do_div(tmp, X);
+			continue;
+		case BPF_ALU | BPF_MOD | BPF_K:
+			tmp = A;
+			if (K)
+				A = do_div(tmp, K);
+			continue;
+
+			/* CALL */
+		case BPF_JMP | BPF_CALL:
+			prog->cb->execute_func(prog->strtab, K, regs);
+			continue;
+
+			/* JMP */
+		case BPF_JMP | BPF_JA | BPF_X:
+			insn += insn->off;
+			continue;
+		case BPF_JMP | BPF_JEQ | BPF_X:
+			if (A == X)
+				insn += insn->off;
+			continue;
+		case BPF_JMP | BPF_JEQ | BPF_K:
+			if (A == K)
+				insn += insn->off;
+			continue;
+		case BPF_JMP | BPF_JNE | BPF_X:
+			if (A != X)
+				insn += insn->off;
+			continue;
+		case BPF_JMP | BPF_JNE | BPF_K:
+			if (A != K)
+				insn += insn->off;
+			continue;
+		case BPF_JMP | BPF_JGT | BPF_X:
+			if (A > X)
+				insn += insn->off;
+			continue;
+		case BPF_JMP | BPF_JGT | BPF_K:
+			if (A > K)
+				insn += insn->off;
+			continue;
+		case BPF_JMP | BPF_JGE | BPF_X:
+			if (A >= X)
+				insn += insn->off;
+			continue;
+		case BPF_JMP | BPF_JGE | BPF_K:
+			if (A >= K)
+				insn += insn->off;
+			continue;
+		case BPF_JMP | BPF_JSGT | BPF_X:
+			if (((s64)A) > ((s64)X))
+				insn += insn->off;
+			continue;
+		case BPF_JMP | BPF_JSGT | BPF_K:
+			if (((s64)A) > ((s64)K))
+				insn += insn->off;
+			continue;
+		case BPF_JMP | BPF_JSGE | BPF_X:
+			if (((s64)A) >= ((s64)X))
+				insn += insn->off;
+			continue;
+		case BPF_JMP | BPF_JSGE | BPF_K:
+			if (((s64)A) >= ((s64)K))
+				insn += insn->off;
+			continue;
+
+			/* STX */
+		case BPF_STX | BPF_REL | BPF_B:
+			*(u8 *)(ulong)(A + insn->off) = X;
+			continue;
+		case BPF_STX | BPF_REL | BPF_H:
+			*(u16 *)(ulong)(A + insn->off) = X;
+			continue;
+		case BPF_STX | BPF_REL | BPF_W:
+			*(u32 *)(ulong)(A + insn->off) = X;
+			continue;
+		case BPF_STX | BPF_REL | BPF_DW:
+			*(u64 *)(ulong)(A + insn->off) = X;
+			continue;
+
+			/* ST */
+		case BPF_ST | BPF_REL | BPF_B:
+			*(u8 *)(ulong)(A + insn->off) = K;
+			continue;
+		case BPF_ST | BPF_REL | BPF_H:
+			*(u16 *)(ulong)(A + insn->off) = K;
+			continue;
+		case BPF_ST | BPF_REL | BPF_W:
+			*(u32 *)(ulong)(A + insn->off) = K;
+			continue;
+		case BPF_ST | BPF_REL | BPF_DW:
+			*(u64 *)(ulong)(A + insn->off) = K;
+			continue;
+
+			/* LDX */
+		case BPF_LDX | BPF_REL | BPF_B:
+			A = *(u8 *)(ulong)(X + insn->off);
+			continue;
+		case BPF_LDX | BPF_REL | BPF_H:
+			A = *(u16 *)(ulong)(X + insn->off);
+			continue;
+		case BPF_LDX | BPF_REL | BPF_W:
+			A = *(u32 *)(ulong)(X + insn->off);
+			continue;
+		case BPF_LDX | BPF_REL | BPF_DW:
+			A = *(u64 *)(ulong)(X + insn->off);
+			continue;
+
+			/* STX XADD */
+		case BPF_STX | BPF_XADD | BPF_B:
+			__sync_fetch_and_add((u8 *)(ulong)(A + insn->off),
+					     (u8)X);
+			continue;
+		case BPF_STX | BPF_XADD | BPF_H:
+			__sync_fetch_and_add((u16 *)(ulong)(A + insn->off),
+					     (u16)X);
+			continue;
+		case BPF_STX | BPF_XADD | BPF_W:
+			__sync_fetch_and_add((u32 *)(ulong)(A + insn->off),
+					     (u32)X);
+			continue;
+		case BPF_STX | BPF_XADD | BPF_DW:
+			__sync_fetch_and_add((u64 *)(ulong)(A + insn->off),
+					     (u64)X);
+			continue;
+
+			/* RET */
+		case BPF_RET | BPF_K:
+			return;
+		default:
+			/*
+			 * bpf_check() will guarantee that
+			 * we never reach here
+			 */
+			pr_err("unknown opcode %02x\n", insn->code);
+			return;
+		}
+	}
+}
+EXPORT_SYMBOL(bpf_run);
+
+/*
+ * BPF image format:
+ * 4 bytes "bpf\0"
+ * 4 bytes - size of insn section in bytes
+ * 4 bytes - size of table definition section in bytes
+ * 4 bytes - size of strtab section in bytes
+ * bpf insns: one or more of 'struct bpf_insn'
+ * hash table definitions: zero or more of 'struct bpf_table'
+ * string table: zero separated ascii strings
+ */
+#define BPF_HEADER_SIZE 16
+int bpf_load_image(const char *image, int image_len, struct bpf_callbacks *cb,
+		   struct bpf_program **p_prog)
+{
+	struct bpf_program *prog;
+	int insn_size, htab_size, strtab_size;
+	int ret;
+
+	BUILD_BUG_ON(sizeof(struct bpf_insn) != 8);
+
+	if (!image || !cb || !cb->execute_func || !cb->get_func_proto ||
+	    !cb->get_context_access)
+		return -EINVAL;
+
+	if (image_len < BPF_HEADER_SIZE + sizeof(struct bpf_insn) ||
+	    memcmp(image, "bpf", 4) != 0) {
+		pr_err("invalid bpf image, size=%d\n", image_len);
+		return -EINVAL;
+	}
+
+	memcpy(&insn_size, image + 4, 4);
+	memcpy(&htab_size, image + 8, 4);
+	memcpy(&strtab_size, image + 12, 4);
+
+	if (insn_size % sizeof(struct bpf_insn) ||
+	    htab_size % sizeof(struct bpf_table) ||
+	    insn_size <= 0 ||
+	    insn_size / sizeof(struct bpf_insn) > MAX_BPF_INSNS ||
+	    htab_size < 0 ||
+	    htab_size / sizeof(struct bpf_table) > MAX_BPF_TABLES ||
+	    strtab_size < 0 ||
+	    strtab_size > MAX_BPF_STRTAB_SIZE ||
+	    insn_size + htab_size + strtab_size + BPF_HEADER_SIZE != image_len) {
+		pr_err("BPF program insn_size %d htab_size %d strtab_size %d\n",
+		       insn_size, htab_size, strtab_size);
+		return -E2BIG;
+	}
+
+	prog = kzalloc(sizeof(struct bpf_program), GFP_KERNEL);
+	if (!prog)
+		return -ENOMEM;
+
+	prog->insn_cnt = insn_size / sizeof(struct bpf_insn);
+	prog->cb = cb;
+
+	prog->insns = kmalloc(insn_size, GFP_KERNEL);
+	if (!prog->insns) {
+		ret = -ENOMEM;
+		goto free_prog;
+	}
+
+	memcpy(prog->insns, image + BPF_HEADER_SIZE, insn_size);
+
+	if (htab_size) {
+		prog->table_cnt = htab_size / sizeof(struct bpf_table);
+		prog->tables = kmalloc(htab_size, GFP_KERNEL);
+		if (!prog->tables) {
+			ret = -ENOMEM;
+			goto free_insns;
+		}
+		memcpy(prog->tables,
+		       image + BPF_HEADER_SIZE + insn_size,
+		       htab_size);
+	}
+
+	if (strtab_size) {
+		prog->strtab_size = strtab_size;
+		prog->strtab = kmalloc(strtab_size, GFP_KERNEL);
+		if (!prog->strtab) {
+			ret = -ENOMEM;
+			goto free_tables;
+		}
+		memcpy(prog->strtab,
+		       image + BPF_HEADER_SIZE + insn_size + htab_size,
+		       strtab_size);
+	}
+
+	/* verify BPF program */
+	ret = bpf_check(prog);
+	if (ret)
+		goto free_strtab;
+
+	/* compile it (map BPF insns to native hw insns) */
+	bpf_compile(prog);
+
+	*p_prog = prog;
+
+	return 0;
+
+free_strtab:
+	kfree(prog->strtab);
+free_tables:
+	kfree(prog->tables);
+free_insns:
+	kfree(prog->insns);
+free_prog:
+	kfree(prog);
+	return ret;
+}
+EXPORT_SYMBOL(bpf_load_image);
+
+void bpf_free(struct bpf_program *prog)
+{
+	if (!prog)
+		return;
+	__bpf_free(prog);
+}
+EXPORT_SYMBOL(bpf_free);
+
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 6982094..7b50774 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1591,3 +1591,18 @@ source "samples/Kconfig"
 
 source "lib/Kconfig.kgdb"
 
+# Used by archs to tell that they support 64-bit BPF JIT
+config HAVE_BPF64_JIT
+	bool
+
+config BPF64
+	bool "Enable 64-bit BPF instruction set support"
+	help
+	  Enable this option to support 64-bit BPF programs
+
+config BPF64_JIT
+	bool "Enable 64-bit BPF JIT compiler"
+	depends on BPF64 && HAVE_BPF64_JIT
+	help
+	  Enable Just-In-Time compiler for 64-bit BPF programs
+
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC PATCH tip 2/5] Extended BPF JIT for x86-64
  2013-12-03  4:28 [RFC PATCH tip 0/5] tracing filters with BPF Alexei Starovoitov
  2013-12-03  4:28 ` [RFC PATCH tip 1/5] Extended BPF core framework Alexei Starovoitov
@ 2013-12-03  4:28 ` Alexei Starovoitov
  2013-12-03  4:28 ` [RFC PATCH tip 3/5] Extended BPF (64-bit BPF) design document Alexei Starovoitov
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 65+ messages in thread
From: Alexei Starovoitov @ 2013-12-03  4:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steven Rostedt, Peter Zijlstra, H. Peter Anvin, Thomas Gleixner,
	Masami Hiramatsu, Tom Zanussi, Jovi Zhangwei, Eric Dumazet,
	linux-kernel

Just-In-Time compiler that maps 64-bit BPF instructions to x86-64 instructions.

Most BPF instructions have one to one mapping.

Every BPF register maps to one x86-64 register:
R0 -> rax
R1 -> rdi
R2 -> rsi
R3 -> rdx
R4 -> rcx
R5 -> r8
R6 -> rbx
R7 -> r13
R8 -> r14
R9 -> r15
FP -> rbp

BPF calling convention is defined as:
R0 - return value from in-kernel function
R1-R5 - arguments from BPF program to in-kernel function
R6-R9 - callee saved registers that in-kernel function will preserve
R10 - read-only frame pointer to access stack
so BPF calling convention maps directly to x86-64 calling convention.

Allowing zero-overhead calls between BPF filter and safe kernel functions

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
 arch/x86/Kconfig              |    1 +
 arch/x86/net/Makefile         |    1 +
 arch/x86/net/bpf64_jit_comp.c |  625 +++++++++++++++++++++++++++++++++++++++++
 arch/x86/net/bpf_jit_comp.c   |   23 +-
 arch/x86/net/bpf_jit_comp.h   |   35 +++
 5 files changed, 665 insertions(+), 20 deletions(-)
 create mode 100644 arch/x86/net/bpf64_jit_comp.c
 create mode 100644 arch/x86/net/bpf_jit_comp.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index c84cf90..44b0b11 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -92,6 +92,7 @@ config X86
 	select GENERIC_CLOCKEVENTS_MIN_ADJUST
 	select IRQ_FORCED_THREADING
 	select HAVE_BPF_JIT if X86_64
+	select HAVE_BPF64_JIT if X86_64
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
 	select CLKEVT_I8253
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG
diff --git a/arch/x86/net/Makefile b/arch/x86/net/Makefile
index 90568c3..c3bb7d5 100644
--- a/arch/x86/net/Makefile
+++ b/arch/x86/net/Makefile
@@ -2,3 +2,4 @@
 # Arch-specific network modules
 #
 obj-$(CONFIG_BPF_JIT) += bpf_jit.o bpf_jit_comp.o
+obj-$(CONFIG_BPF64_JIT) += bpf64_jit_comp.o
diff --git a/arch/x86/net/bpf64_jit_comp.c b/arch/x86/net/bpf64_jit_comp.c
new file mode 100644
index 0000000..5f7c331
--- /dev/null
+++ b/arch/x86/net/bpf64_jit_comp.c
@@ -0,0 +1,625 @@
+/*
+ * Copyright (c) 2011-2013 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/slab.h>
+#include <linux/bpf_jit.h>
+#include <linux/moduleloader.h>
+#include "bpf_jit_comp.h"
+
+static inline u8 *emit_code(u8 *ptr, u32 bytes, unsigned int len)
+{
+	if (len == 1)
+		*ptr = bytes;
+	else if (len == 2)
+		*(u16 *)ptr = bytes;
+	else
+		*(u32 *)ptr = bytes;
+	return ptr + len;
+}
+
+#define EMIT(bytes, len) (prog = emit_code(prog, (bytes), (len)))
+
+#define EMIT1(b1)		EMIT(b1, 1)
+#define EMIT2(b1, b2)		EMIT((b1) + ((b2) << 8), 2)
+#define EMIT3(b1, b2, b3)	EMIT((b1) + ((b2) << 8) + ((b3) << 16), 3)
+#define EMIT4(b1, b2, b3, b4)	EMIT((b1) + ((b2) << 8) + ((b3) << 16) + \
+				     ((b4) << 24), 4)
+/* imm32 is sign extended by cpu */
+#define EMIT1_off32(b1, off) \
+	do {EMIT1(b1); EMIT(off, 4); } while (0)
+#define EMIT2_off32(b1, b2, off) \
+	do {EMIT2(b1, b2); EMIT(off, 4); } while (0)
+#define EMIT3_off32(b1, b2, b3, off) \
+	do {EMIT3(b1, b2, b3); EMIT(off, 4); } while (0)
+#define EMIT4_off32(b1, b2, b3, b4, off) \
+	do {EMIT4(b1, b2, b3, b4); EMIT(off, 4); } while (0)
+
+/* mov A, X */
+#define EMIT_mov(A, X) \
+	EMIT3(add_2mod(0x48, A, X), 0x89, add_2reg(0xC0, A, X))
+
+#define X86_JAE 0x73
+#define X86_JE  0x74
+#define X86_JNE 0x75
+#define X86_JA  0x77
+#define X86_JGE 0x7D
+#define X86_JG  0x7F
+
+static inline bool is_imm8(__s32 value)
+{
+	return value <= 127 && value >= -128;
+}
+
+static inline bool is_simm32(__s64 value)
+{
+	return value == (__s64)(__s32)value;
+}
+
+static int bpf_size_to_x86_bytes(int bpf_size)
+{
+	if (bpf_size == BPF_W)
+		return 4;
+	else if (bpf_size == BPF_H)
+		return 2;
+	else if (bpf_size == BPF_B)
+		return 1;
+	else if (bpf_size == BPF_DW)
+		return 4; /* imm32 */
+	else
+		return 0;
+}
+
+#define AUX_REG 32
+
+/* avoid x86-64 R12 which if used as base address in memory access
+ * always needs an extra byte for index */
+static const int reg2hex[] = {
+	[R0] = 0, /* rax */
+	[R1] = 7, /* rdi */
+	[R2] = 6, /* rsi */
+	[R3] = 2, /* rdx */
+	[R4] = 1, /* rcx */
+	[R5] = 0, /* r8 */
+	[R6] = 3, /* rbx callee saved */
+	[R7] = 5, /* r13 callee saved */
+	[R8] = 6, /* r14 callee saved */
+	[R9] = 7, /* r15 callee saved */
+	[__fp__] = 5, /* rbp readonly */
+	[AUX_REG] = 1, /* r9 temp register */
+};
+
+/* is_ereg() == true if r8 <= reg <= r15,
+ * rax,rcx,...,rbp don't need extra byte of encoding */
+static inline bool is_ereg(u32 reg)
+{
+	if (reg == R5 || (reg >= R7 && reg <= R9) || reg == AUX_REG)
+		return true;
+	else
+		return false;
+}
+
+static inline u8 add_1mod(u8 byte, u32 reg)
+{
+	if (is_ereg(reg))
+		byte |= 1;
+	return byte;
+}
+static inline u8 add_2mod(u8 byte, u32 r1, u32 r2)
+{
+	if (is_ereg(r1))
+		byte |= 1;
+	if (is_ereg(r2))
+		byte |= 4;
+	return byte;
+}
+
+static inline u8 add_1reg(u8 byte, u32 a_reg)
+{
+	return byte + reg2hex[a_reg];
+}
+static inline u8 add_2reg(u8 byte, u32 a_reg, u32 x_reg)
+{
+	return byte + reg2hex[a_reg] + (reg2hex[x_reg] << 3);
+}
+
+static u8 *select_bpf_func(struct bpf_program *prog, int id)
+{
+	if (id <= 0 || id >= prog->strtab_size)
+		return NULL;
+	return prog->cb->jit_select_func(prog->strtab, id);
+}
+
+static int do_jit(struct bpf_program *bpf_prog, int *addrs, u8 *image,
+		  int oldproglen)
+{
+	struct bpf_insn *insn = bpf_prog->insns;
+	int insn_cnt = bpf_prog->insn_cnt;
+	u8 temp[64];
+	int i;
+	int proglen = 0;
+	u8 *prog = temp;
+	int stacksize = 512;
+
+	EMIT1(0x55); /* push rbp */
+	EMIT3(0x48, 0x89, 0xE5); /* mov rbp,rsp */
+
+	/* sub rsp, stacksize */
+	EMIT3_off32(0x48, 0x81, 0xEC, stacksize);
+	/* mov qword ptr [rbp-X],rbx */
+	EMIT3_off32(0x48, 0x89, 0x9D, -stacksize);
+	/* mov qword ptr [rbp-X],r13 */
+	EMIT3_off32(0x4C, 0x89, 0xAD, -stacksize + 8);
+	/* mov qword ptr [rbp-X],r14 */
+	EMIT3_off32(0x4C, 0x89, 0xB5, -stacksize + 16);
+	/* mov qword ptr [rbp-X],r15 */
+	EMIT3_off32(0x4C, 0x89, 0xBD, -stacksize + 24);
+
+	for (i = 0; i < insn_cnt; i++, insn++) {
+		const __s32 K = insn->imm;
+		__u32 a_reg = insn->a_reg;
+		__u32 x_reg = insn->x_reg;
+		u8 b1 = 0, b2 = 0, b3 = 0;
+		u8 jmp_cond;
+		__s64 jmp_offset;
+		int ilen;
+		u8 *func;
+
+		switch (insn->code) {
+			/* ALU */
+		case BPF_ALU | BPF_ADD | BPF_X:
+		case BPF_ALU | BPF_SUB | BPF_X:
+		case BPF_ALU | BPF_AND | BPF_X:
+		case BPF_ALU | BPF_OR | BPF_X:
+		case BPF_ALU | BPF_XOR | BPF_X:
+			b1 = 0x48;
+			b3 = 0xC0;
+			switch (BPF_OP(insn->code)) {
+			case BPF_ADD: b2 = 0x01; break;
+			case BPF_SUB: b2 = 0x29; break;
+			case BPF_AND: b2 = 0x21; break;
+			case BPF_OR: b2 = 0x09; break;
+			case BPF_XOR: b2 = 0x31; break;
+			}
+			EMIT3(add_2mod(b1, a_reg, x_reg), b2,
+			      add_2reg(b3, a_reg, x_reg));
+			break;
+
+			/* mov A, X */
+		case BPF_ALU | BPF_MOV | BPF_X:
+			EMIT_mov(a_reg, x_reg);
+			break;
+
+			/* neg A */
+		case BPF_ALU | BPF_NEG | BPF_X:
+			EMIT3(add_1mod(0x48, a_reg), 0xF7,
+			      add_1reg(0xD8, a_reg));
+			break;
+
+		case BPF_ALU | BPF_ADD | BPF_K:
+		case BPF_ALU | BPF_SUB | BPF_K:
+		case BPF_ALU | BPF_AND | BPF_K:
+		case BPF_ALU | BPF_OR | BPF_K:
+			b1 = add_1mod(0x48, a_reg);
+
+			switch (BPF_OP(insn->code)) {
+			case BPF_ADD: b3 = 0xC0; break;
+			case BPF_SUB: b3 = 0xE8; break;
+			case BPF_AND: b3 = 0xE0; break;
+			case BPF_OR: b3 = 0xC8; break;
+			}
+
+			if (is_imm8(K))
+				EMIT4(b1, 0x83, add_1reg(b3, a_reg), K);
+			else
+				EMIT3_off32(b1, 0x81, add_1reg(b3, a_reg), K);
+			break;
+
+		case BPF_ALU | BPF_MOV | BPF_K:
+			/* 'mov rax, imm32' sign extends imm32.
+			 * possible optimization: if imm32 is positive,
+			 * use 'mov eax, imm32' (which zero-extends imm32)
+			 * to save 2 bytes */
+			b1 = add_1mod(0x48, a_reg);
+			b2 = 0xC7;
+			b3 = 0xC0;
+			EMIT3_off32(b1, b2, add_1reg(b3, a_reg), K);
+			break;
+
+			/* A %= X
+			 * A /= X */
+		case BPF_ALU | BPF_MOD | BPF_X:
+		case BPF_ALU | BPF_DIV | BPF_X:
+			EMIT1(0x50); /* push rax */
+			EMIT1(0x52); /* push rdx */
+
+			/* mov r9, X */
+			EMIT_mov(AUX_REG, x_reg);
+
+			/* mov rax, A */
+			EMIT_mov(R0, a_reg);
+
+			/* xor rdx, rdx */
+			EMIT3(0x48, 0x31, 0xd2);
+
+			/* if X==0, skip divide, make A=0 */
+
+			/* cmp r9, 0 */
+			EMIT4(0x49, 0x83, 0xF9, 0x00);
+
+			/* je .+3 */
+			EMIT2(X86_JE, 3);
+
+			/* div r9 */
+			EMIT3(0x49, 0xF7, 0xF1);
+
+			if (BPF_OP(insn->code) == BPF_MOD) {
+				/* mov r9, rdx */
+				EMIT3(0x49, 0x89, 0xD1);
+			} else {
+				/* mov r9, rax */
+				EMIT3(0x49, 0x89, 0xC1);
+			}
+
+			EMIT1(0x5A); /* pop rdx */
+			EMIT1(0x58); /* pop rax */
+
+			/* mov A, r9 */
+			EMIT_mov(a_reg, AUX_REG);
+			break;
+
+			/* shifts */
+		case BPF_ALU | BPF_LSH | BPF_K:
+		case BPF_ALU | BPF_RSH | BPF_K:
+		case BPF_ALU | BPF_ARSH | BPF_K:
+			b1 = add_1mod(0x48, a_reg);
+			switch (BPF_OP(insn->code)) {
+			case BPF_LSH: b3 = 0xE0; break;
+			case BPF_RSH: b3 = 0xE8; break;
+			case BPF_ARSH: b3 = 0xF8; break;
+			}
+			EMIT4(b1, 0xC1, add_1reg(b3, a_reg), K);
+			break;
+
+		case BPF_ALU | BPF_BSWAP32 | BPF_X:
+			/* emit 'bswap eax' to swap lower 4-bytes */
+			if (is_ereg(a_reg))
+				EMIT2(0x41, 0x0F);
+			else
+				EMIT1(0x0F);
+			EMIT1(add_1reg(0xC8, a_reg));
+			break;
+
+		case BPF_ALU | BPF_BSWAP64 | BPF_X:
+			/* emit 'bswap rax' to swap 8-bytes */
+			EMIT3(add_1mod(0x48, a_reg), 0x0F,
+			      add_1reg(0xC8, a_reg));
+			break;
+
+			/* ST: *(u8*)(a_reg + off) = imm */
+		case BPF_ST | BPF_REL | BPF_B:
+			if (is_ereg(a_reg))
+				EMIT2(0x41, 0xC6);
+			else
+				EMIT1(0xC6);
+			goto st;
+		case BPF_ST | BPF_REL | BPF_H:
+			if (is_ereg(a_reg))
+				EMIT3(0x66, 0x41, 0xC7);
+			else
+				EMIT2(0x66, 0xC7);
+			goto st;
+		case BPF_ST | BPF_REL | BPF_W:
+			if (is_ereg(a_reg))
+				EMIT2(0x41, 0xC7);
+			else
+				EMIT1(0xC7);
+			goto st;
+		case BPF_ST | BPF_REL | BPF_DW:
+			EMIT2(add_1mod(0x48, a_reg), 0xC7);
+
+st:			if (is_imm8(insn->off))
+				EMIT2(add_1reg(0x40, a_reg), insn->off);
+			else
+				EMIT1_off32(add_1reg(0x80, a_reg), insn->off);
+
+			EMIT(K, bpf_size_to_x86_bytes(BPF_SIZE(insn->code)));
+			break;
+
+			/* STX: *(u8*)(a_reg + off) = x_reg */
+		case BPF_STX | BPF_REL | BPF_B:
+			/* emit 'mov byte ptr [rax + off], al' */
+			if (is_ereg(a_reg) || is_ereg(x_reg) ||
+			    /* have to add extra byte for x86 SIL, DIL regs */
+			    x_reg == R1 || x_reg == R2)
+				EMIT2(add_2mod(0x40, a_reg, x_reg), 0x88);
+			else
+				EMIT1(0x88);
+			goto stx;
+		case BPF_STX | BPF_REL | BPF_H:
+			if (is_ereg(a_reg) || is_ereg(x_reg))
+				EMIT3(0x66, add_2mod(0x40, a_reg, x_reg), 0x89);
+			else
+				EMIT2(0x66, 0x89);
+			goto stx;
+		case BPF_STX | BPF_REL | BPF_W:
+			if (is_ereg(a_reg) || is_ereg(x_reg))
+				EMIT2(add_2mod(0x40, a_reg, x_reg), 0x89);
+			else
+				EMIT1(0x89);
+			goto stx;
+		case BPF_STX | BPF_REL | BPF_DW:
+			EMIT2(add_2mod(0x48, a_reg, x_reg), 0x89);
+stx:			if (is_imm8(insn->off))
+				EMIT2(add_2reg(0x40, a_reg, x_reg), insn->off);
+			else
+				EMIT1_off32(add_2reg(0x80, a_reg, x_reg),
+					    insn->off);
+			break;
+
+			/* LDX: a_reg = *(u8*)(x_reg + off) */
+		case BPF_LDX | BPF_REL | BPF_B:
+			/* emit 'movzx rax, byte ptr [rax + off]' */
+			EMIT3(add_2mod(0x48, x_reg, a_reg), 0x0F, 0xB6);
+			goto ldx;
+		case BPF_LDX | BPF_REL | BPF_H:
+			/* emit 'movzx rax, word ptr [rax + off]' */
+			EMIT3(add_2mod(0x48, x_reg, a_reg), 0x0F, 0xB7);
+			goto ldx;
+		case BPF_LDX | BPF_REL | BPF_W:
+			/* emit 'mov eax, dword ptr [rax+0x14]' */
+			if (is_ereg(a_reg) || is_ereg(x_reg))
+				EMIT2(add_2mod(0x40, x_reg, a_reg), 0x8B);
+			else
+				EMIT1(0x8B);
+			goto ldx;
+		case BPF_LDX | BPF_REL | BPF_DW:
+			/* emit 'mov rax, qword ptr [rax+0x14]' */
+			EMIT2(add_2mod(0x48, x_reg, a_reg), 0x8B);
+ldx:			/* if insn->off == 0 we can save one extra byte, but
+			 * special case of x86 R13 which always needs an offset
+			 * is not worth the pain */
+			if (is_imm8(insn->off))
+				EMIT2(add_2reg(0x40, x_reg, a_reg), insn->off);
+			else
+				EMIT1_off32(add_2reg(0x80, x_reg, a_reg),
+					    insn->off);
+			break;
+
+			/* STX XADD: lock *(u8*)(a_reg + off) += x_reg */
+		case BPF_STX | BPF_XADD | BPF_B:
+			/* emit 'lock add byte ptr [rax + off], al' */
+			if (is_ereg(a_reg) || is_ereg(x_reg) ||
+			    /* have to add extra byte for x86 SIL, DIL regs */
+			    x_reg == R1 || x_reg == R2)
+				EMIT3(0xF0, add_2mod(0x40, a_reg, x_reg), 0x00);
+			else
+				EMIT2(0xF0, 0x00);
+			goto xadd;
+		case BPF_STX | BPF_XADD | BPF_H:
+			if (is_ereg(a_reg) || is_ereg(x_reg))
+				EMIT4(0x66, 0xF0, add_2mod(0x40, a_reg, x_reg),
+				      0x01);
+			else
+				EMIT3(0x66, 0xF0, 0x01);
+			goto xadd;
+		case BPF_STX | BPF_XADD | BPF_W:
+			if (is_ereg(a_reg) || is_ereg(x_reg))
+				EMIT3(0xF0, add_2mod(0x40, a_reg, x_reg), 0x01);
+			else
+				EMIT2(0xF0, 0x01);
+			goto xadd;
+		case BPF_STX | BPF_XADD | BPF_DW:
+			EMIT3(0xF0, add_2mod(0x48, a_reg, x_reg), 0x01);
+xadd:			if (is_imm8(insn->off))
+				EMIT2(add_2reg(0x40, a_reg, x_reg), insn->off);
+			else
+				EMIT1_off32(add_2reg(0x80, a_reg, x_reg),
+					    insn->off);
+			break;
+
+			/* call */
+		case BPF_JMP | BPF_CALL:
+			func = select_bpf_func(bpf_prog, K);
+			jmp_offset = func - (image + addrs[i]);
+			if (!func || !is_simm32(jmp_offset)) {
+				pr_err("unsupported bpf func %d addr %p image %p\n",
+				       K, func, image);
+				return -EINVAL;
+			}
+			EMIT1_off32(0xE8, jmp_offset);
+			break;
+
+			/* cond jump */
+		case BPF_JMP | BPF_JEQ | BPF_X:
+		case BPF_JMP | BPF_JNE | BPF_X:
+		case BPF_JMP | BPF_JGT | BPF_X:
+		case BPF_JMP | BPF_JGE | BPF_X:
+		case BPF_JMP | BPF_JSGT | BPF_X:
+		case BPF_JMP | BPF_JSGE | BPF_X:
+			/* emit 'cmp a_reg, x_reg' insn */
+			b1 = 0x48;
+			b2 = 0x39;
+			b3 = 0xC0;
+			EMIT3(add_2mod(b1, a_reg, x_reg), b2,
+			      add_2reg(b3, a_reg, x_reg));
+			goto emit_jump;
+		case BPF_JMP | BPF_JEQ | BPF_K:
+		case BPF_JMP | BPF_JNE | BPF_K:
+		case BPF_JMP | BPF_JGT | BPF_K:
+		case BPF_JMP | BPF_JGE | BPF_K:
+		case BPF_JMP | BPF_JSGT | BPF_K:
+		case BPF_JMP | BPF_JSGE | BPF_K:
+			/* emit 'cmp a_reg, imm8/32' */
+			EMIT1(add_1mod(0x48, a_reg));
+
+			if (is_imm8(K))
+				EMIT3(0x83, add_1reg(0xF8, a_reg), K);
+			else
+				EMIT2_off32(0x81, add_1reg(0xF8, a_reg), K);
+
+emit_jump:		/* convert BPF opcode to x86 */
+			switch (BPF_OP(insn->code)) {
+			case BPF_JEQ:
+				jmp_cond = X86_JE;
+				break;
+			case BPF_JNE:
+				jmp_cond = X86_JNE;
+				break;
+			case BPF_JGT:
+				/* GT is unsigned '>', JA in x86 */
+				jmp_cond = X86_JA;
+				break;
+			case BPF_JGE:
+				/* GE is unsigned '>=', JAE in x86 */
+				jmp_cond = X86_JAE;
+				break;
+			case BPF_JSGT:
+				/* signed '>', GT in x86 */
+				jmp_cond = X86_JG;
+				break;
+			case BPF_JSGE:
+				/* signed '>=', GE in x86 */
+				jmp_cond = X86_JGE;
+				break;
+			default: /* to silence gcc warning */
+				return -EFAULT;
+			}
+			jmp_offset = addrs[i + insn->off] - addrs[i];
+			if (is_imm8(jmp_offset)) {
+				EMIT2(jmp_cond, jmp_offset);
+			} else if (is_simm32(jmp_offset)) {
+				EMIT2_off32(0x0F, jmp_cond + 0x10, jmp_offset);
+			} else {
+				pr_err("cond_jmp gen bug %llx\n", jmp_offset);
+				return -EFAULT;
+			}
+
+			break;
+
+		case BPF_JMP | BPF_JA | BPF_X:
+			jmp_offset = addrs[i + insn->off] - addrs[i];
+			if (is_imm8(jmp_offset)) {
+				EMIT2(0xEB, jmp_offset);
+			} else if (is_simm32(jmp_offset)) {
+				EMIT1_off32(0xE9, jmp_offset);
+			} else {
+				pr_err("jmp gen bug %llx\n", jmp_offset);
+				return -EFAULT;
+			}
+
+			break;
+
+		case BPF_RET | BPF_K:
+			/* mov rbx, qword ptr [rbp-X] */
+			EMIT3_off32(0x48, 0x8B, 0x9D, -stacksize);
+			/* mov r13, qword ptr [rbp-X] */
+			EMIT3_off32(0x4C, 0x8B, 0xAD, -stacksize + 8);
+			/* mov r14, qword ptr [rbp-X] */
+			EMIT3_off32(0x4C, 0x8B, 0xB5, -stacksize + 16);
+			/* mov r15, qword ptr [rbp-X] */
+			EMIT3_off32(0x4C, 0x8B, 0xBD, -stacksize + 24);
+
+			EMIT1(0xC9); /* leave */
+			EMIT1(0xC3); /* ret */
+			break;
+
+		default:
+			/*pr_debug_bpf_insn(insn, NULL);*/
+			pr_err("bpf_jit: unknown opcode %02x\n", insn->code);
+			return -EINVAL;
+		}
+
+		ilen = prog - temp;
+		if (image) {
+			if (proglen + ilen > oldproglen)
+				return -2;
+			memcpy(image + proglen, temp, ilen);
+		}
+		proglen += ilen;
+		addrs[i] = proglen;
+		prog = temp;
+	}
+	return proglen;
+}
+
+void bpf_compile(struct bpf_program *prog)
+{
+	struct bpf_binary_header *header = NULL;
+	int proglen, oldproglen = 0;
+	int *addrs;
+	u8 *image = NULL;
+	int pass;
+	int i;
+
+	if (!prog || !prog->cb || !prog->cb->jit_select_func)
+		return;
+
+	addrs = kmalloc(prog->insn_cnt * sizeof(*addrs), GFP_KERNEL);
+	if (!addrs)
+		return;
+
+	for (proglen = 0, i = 0; i < prog->insn_cnt; i++) {
+		proglen += 64;
+		addrs[i] = proglen;
+	}
+	for (pass = 0; pass < 10; pass++) {
+		proglen = do_jit(prog, addrs, image, oldproglen);
+		if (proglen <= 0) {
+			image = NULL;
+			goto out;
+		}
+		if (image) {
+			if (proglen != oldproglen)
+				pr_err("bpf_jit: proglen=%d != oldproglen=%d\n",
+				       proglen, oldproglen);
+			break;
+		}
+		if (proglen == oldproglen) {
+			header = bpf_alloc_binary(proglen, &image);
+			if (!header)
+				goto out;
+		}
+		oldproglen = proglen;
+	}
+
+	if (image) {
+		bpf_flush_icache(header, image + proglen);
+		set_memory_ro((unsigned long)header, header->pages);
+	}
+out:
+	kfree(addrs);
+	prog->jit_image = (void (*)(struct bpf_context *ctx))image;
+	return;
+}
+
+static void bpf_jit_free_deferred(struct work_struct *work)
+{
+	struct bpf_program *prog = container_of(work, struct bpf_program, work);
+	unsigned long addr = (unsigned long)prog->jit_image & PAGE_MASK;
+	struct bpf_binary_header *header = (void *)addr;
+
+	set_memory_rw(addr, header->pages);
+	module_free(NULL, header);
+	free_bpf_program(prog);
+}
+
+void __bpf_free(struct bpf_program *prog)
+{
+	if (prog->jit_image) {
+		INIT_WORK(&prog->work, bpf_jit_free_deferred);
+		schedule_work(&prog->work);
+	} else {
+		free_bpf_program(prog);
+	}
+}
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 26328e8..3c35f8d 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -13,6 +13,7 @@
 #include <linux/filter.h>
 #include <linux/if_vlan.h>
 #include <linux/random.h>
+#include "bpf_jit_comp.h"
 
 /*
  * Conventions :
@@ -112,16 +113,6 @@ do {								\
 #define SEEN_XREG    2 /* ebx is used */
 #define SEEN_MEM     4 /* use mem[] for temporary storage */
 
-static inline void bpf_flush_icache(void *start, void *end)
-{
-	mm_segment_t old_fs = get_fs();
-
-	set_fs(KERNEL_DS);
-	smp_wmb();
-	flush_icache_range((unsigned long)start, (unsigned long)end);
-	set_fs(old_fs);
-}
-
 #define CHOOSE_LOAD_FUNC(K, func) \
 	((int)K < 0 ? ((int)K >= SKF_LL_OFF ? func##_negative_offset : func) : func##_positive_offset)
 
@@ -145,16 +136,8 @@ static int pkt_type_offset(void)
 	return -1;
 }
 
-struct bpf_binary_header {
-	unsigned int	pages;
-	/* Note : for security reasons, bpf code will follow a randomly
-	 * sized amount of int3 instructions
-	 */
-	u8		image[];
-};
-
-static struct bpf_binary_header *bpf_alloc_binary(unsigned int proglen,
-						  u8 **image_ptr)
+struct bpf_binary_header *bpf_alloc_binary(unsigned int proglen,
+					   u8 **image_ptr)
 {
 	unsigned int sz, hole;
 	struct bpf_binary_header *header;
diff --git a/arch/x86/net/bpf_jit_comp.h b/arch/x86/net/bpf_jit_comp.h
new file mode 100644
index 0000000..74ff45d
--- /dev/null
+++ b/arch/x86/net/bpf_jit_comp.h
@@ -0,0 +1,35 @@
+/* bpf_jit_comp.h : BPF filter alloc/free routines
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+#ifndef __BPF_JIT_COMP_H
+#define __BPF_JIT_COMP_H
+
+#include <linux/uaccess.h>
+#include <asm/cacheflush.h>
+
+struct bpf_binary_header {
+	unsigned int	pages;
+	/* Note : for security reasons, bpf code will follow a randomly
+	 * sized amount of int3 instructions
+	 */
+	u8		image[];
+};
+
+static inline void bpf_flush_icache(void *start, void *end)
+{
+	mm_segment_t old_fs = get_fs();
+
+	set_fs(KERNEL_DS);
+	smp_wmb();
+	flush_icache_range((unsigned long)start, (unsigned long)end);
+	set_fs(old_fs);
+}
+
+struct bpf_binary_header *bpf_alloc_binary(unsigned int proglen,
+					   u8 **image_ptr);
+
+#endif
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC PATCH tip 3/5] Extended BPF (64-bit BPF) design document
  2013-12-03  4:28 [RFC PATCH tip 0/5] tracing filters with BPF Alexei Starovoitov
  2013-12-03  4:28 ` [RFC PATCH tip 1/5] Extended BPF core framework Alexei Starovoitov
  2013-12-03  4:28 ` [RFC PATCH tip 2/5] Extended BPF JIT for x86-64 Alexei Starovoitov
@ 2013-12-03  4:28 ` Alexei Starovoitov
  2013-12-03 17:01   ` H. Peter Anvin
  2013-12-03  4:28 ` [RFC PATCH tip 4/5] use BPF in tracing filters Alexei Starovoitov
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 65+ messages in thread
From: Alexei Starovoitov @ 2013-12-03  4:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steven Rostedt, Peter Zijlstra, H. Peter Anvin, Thomas Gleixner,
	Masami Hiramatsu, Tom Zanussi, Jovi Zhangwei, Eric Dumazet,
	linux-kernel

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
 Documentation/bpf_jit.txt |  204 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 204 insertions(+)
 create mode 100644 Documentation/bpf_jit.txt

diff --git a/Documentation/bpf_jit.txt b/Documentation/bpf_jit.txt
new file mode 100644
index 0000000..9c70f42
--- /dev/null
+++ b/Documentation/bpf_jit.txt
@@ -0,0 +1,204 @@
+Subject: extended BPF or 64-bit BPF
+
+Q: What is BPF?
+A: Safe dynamically loadable 32-bit program that can access skb->data via
+sk_load_byte/half/word calls or seccomp_data. Can be attached to sockets,
+to netfilter xtables, seccomp. In case of sockets/xtables input is skb.
+In case of seccomp input is struct seccomp_data.
+
+Q: What is extended BPF?
+A: Safe dynamically loadable 64-bit program that can call fixed set
+of kernel functions and takes generic bpf_context as an input.
+BPF program is a glue between kernel functions and bpf_context.
+Different kernel subsystems can define their own set of available functions
+and alter BPF machinery for specific use case.
+
+Example 1:
+when function set is {bpf_load_byte/half/word} and bpf_context=skb
+the extended BPF is equivalent to original BPF (w/o negative offset extensions),
+since any such extended BPF program will only be able to load data from skb
+and interpret it.
+
+Example 2:
+when function set is {empty} and bpf_context=seccomp_data,
+the extended BPF is equivalent to original seccomp BPF with simpler programs
+and can immediately take advantage of extended BPF-JIT.
+(original BPF-JIT doesn't work for seccomp)
+
+Example 3:
+when function set is {bpf_load_xxx + bpf_table_lookup} and bpf_context=skb
+the extended BPF can be used to implement network analytics in tcpdump.
+Like counting all tcp flows through the dev or filtering for specific
+set of IP addresses.
+
+Example 4:
+when function set is {load_xxx + table_lookup + trace_printk} and
+bpf_context=pt_regs, the extended BPF is used to implement systemtap-like
+tracing filters
+
+Extended Instruction Set was designed with these goals:
+- write programs in restricted C and compile into BPF with GCC/LLVM
+- just-in-time map to modern 64-bit CPU with minimal performance overhead
+  over two steps: C -> BPF -> native code
+- guarantee termination and safety of BPF program in kernel
+  with simple algorithm
+
+Writing filters in tcpdump syntax or in systemtap language is difficult.
+Same filter done in C is easier to understand.
+GCC/LLVM-bpf backend is optional.
+Extended BPF can be coded with macroses from bpf.h just like original BPF.
+
+Minimal performance overhead is achieved by having one to one mapping
+between BPF insns and native insns, and one to one mapping between BPF
+registers and native registers on 64-bit CPUs
+
+Extended BPF allows jump forward and backward for two reasons:
+to reduce branch mispredict penalty compiler moves cold basic blocks out of
+fall-through path and to reduce code duplication that would be unavoidable
+if only jump forward was available.
+To guarantee termination simple non-recursive depth-first-search verifies
+that there are no back-edges (no loops in the program), program is a DAG
+with root at the first insn, all branches end at the last RET insn and
+all instructions are reachable.
+(Original BPF actually allows unreachable insns, but that's a bug)
+
+Original BPF has two registers (A and X) and hidden frame pointer.
+Extended BPF has ten registers and read-only frame pointer.
+Since 64-bit CPUs are passing arguments to the functions via registers
+the number of args from BPF program to in-kernel function is restricted to 5
+and one register is used to accept return value from in-kernel function.
+x86_64 passes first 6 arguments in registers.
+aarch64/sparcv9/mips64 have 7-8 registers for arguments.
+x86_64 has 6 callee saved registers.
+aarch64/sparcv9/mips64 have 11 or more callee saved registers.
+
+Therefore extended BPF calling convention is defined as:
+R0 - return value from in-kernel function
+R1-R5 - arguments from BPF program to in-kernel function
+R6-R9 - callee saved registers that in-kernel function will preserve
+R10 - read-only frame pointer to access stack
+
+so that all BPF registers map one to one to HW registers on x86_64,aarch64,etc
+and BPF calling convention maps directly to ABIs used by kernel on 64-bit
+architectures.
+
+R0-R5 are scratch registers and BPF program needs spill/fill them if necessary
+across calls.
+Note that there is only one BPF program == one BPF function and it cannot call
+other BPF functions. It can only call predefined in-kernel functions.
+
+All BPF registers are 64-bit without subregs, which makes JITed x86 code
+less optimal, but matches sparc/mips architectures.
+Adding 32-bit subregs was considered, since JIT can map them to x86 and aarch64
+nicely, but read-modify-write overhead for sparc/mips is not worth the gains.
+
+Original BPF and extended BPF are two operand instructions, which helps
+to do one-to-one mapping between BPF insn and x86 insn during JIT.
+
+Extended BPF doesn't have pre-defined endianness not to favor one
+architecture vs another. Therefore bswap insn was introduced.
+Original BPF doesn't have such insn and does bswap as part of sk_load_word call
+which is often unnecessary if we want to compare the value with the constant.
+Restricted C code might be written differently depending on endianness
+and GCC/LLVM-bpf will take an endianness flag.
+
+32-bit architectures run 64-bit extended BPF programs via interpreter
+
+Q: Why extended BPF is 64-bit? Cannot we live with 32-bit?
+A: On 64-bit architectures, pointers are 64-bit and we want to pass 64-bit
+values in/out kernel functions, so 32-bit BPF registers would require to define
+register-pair ABI, there won't be a direct BPF register to HW register
+mapping and JIT would need to do combine/split/move operations for every
+register in and out of the function, which is complex, bug prone and slow.
+Another reason is counters. To use 64-bit counter BPF program would need to do
+a complex math. Again bug prone and not atomic.
+
+Q: Original BPF is safe, deterministic and kernel can easily prove that.
+   Does extended BPF keep these properties?
+A: Yes. The safety of the program is determined in two steps.
+First step does depth-first-search to disallow loops and other CFG validation.
+Second step starts from the first insn and descends all possible paths.
+It simulates execution of every insn and observes the state change of
+registers and stack.
+At the start of the program the register R1 contains a pointer to bpf_context
+and has type PTR_TO_CTX. If checker sees an insn that does R2=R1, then R2 has
+now type PTR_TO_CTX as well and can be used on right hand side of expression.
+If R1=PTR_TO_CTX and insn is R2=R1+1, then R2=INVALID_PTR and it is readable.
+If register was never written to, it's not readable.
+After kernel function call, R1-R5 are reset to unreadable and R0 has a return
+type of the function. Since R6-R9 are callee saved, their state is preserved
+across the call.
+load/store instructions are allowed only with registers of valid types, which
+are PTR_TO_CTX, PTR_TO_TABLE, PTR_TO_STACK. They are bounds and alginment
+checked.
+
+bpf_context structure is generic. Its contents are defined by specific use case.
+For seccomp it can be seccomp_data and through get_context_access callback
+BPF checker is customized, so that BPF program can only access certain fields
+of bpf_context with specified size and alignment.
+For example, the following insn:
+  BPF_INSN_LD(BPF_W, R0, R6, 8)
+intends to load word from address R6 + 8 and store it into R0
+If R6=PTR_TO_CTX, then get_context_access callback should let the checker know
+that offset 8 of size 4 bytes can be accessed for reading, otherwise the checker
+will reject the program.
+If R6=PTR_TO_STACK, then access should be aligned and be within stack bounds,
+which are hard coded to [-480, 0]. In this example offset is 8, so it will fail
+verification.
+The checker will allow BPF program to read data from stack only after it wrote
+into it.
+Pointer register spill/fill is tracked as well, since four (R6-R9) callee saved
+registers may not be enough for some programs.
+
+Allowed function calls are customized via get_func_proto callback.
+For example:
+  u64 bpf_load_byte(struct bpf_context *ctx, u32 offset);
+function will have the following definition:
+  struct bpf_func_proto proto = {RET_INTEGER, PTR_TO_CTX};
+and BPF checker will verify that bpf_load_byte is always called with first
+argument being a valid pointer to bpf_context. After the call BPF register R0
+will be set to readable state, so that BPF program can access it.
+
+One of the useful functions that can be made available to BPF program
+are bpf_table_lookup/bpf_table_update.
+Using them a tracing filter can collect any type of statistics.
+
+Therefore extended BPF program consists of instructions and tables.
+From BPF program the table is identified by constant table_id
+and access to a table in C looks like:
+elem = bpf_table_lookup(ctx, table_id, key);
+
+BPF checker matches 'table_id' against known tables, verifies that 'key' points
+to stack and table->key_size bytes are initialized.
+From there on bpf_table_lookup() is a normal kernel function. It needs to do
+a lookup by whatever means and return either valid pointer to the element
+or NULL. BPF checker will verify that the program accesses the pointer only
+after comparing it to NULL. That's the meaning of PTR_TO_TABLE_CONDITIONAL and
+PTR_TO_TABLE register types in bpf_check.c
+
+If a kernel subsystem wants to use this BPF framework and decides to implement
+bpf_table_lookup, the checker will guarantee that argument 'ctx' is a valid
+pointer to bpf_context, 'table_id' is valid table_id and table->key_size bytes
+can be read from the pointer 'key'. It's up to implementation to decide how it
+wants to do the lookup and what is the key.
+
+Going back to the example BPF insn:
+  BPF_INSN_LD(BPF_W, R0, R6, 8)
+if R6=PTR_TO_TABLE, then offset and size of access must be within
+[0, table->elem_size] which is determined by constant table_id that was passed
+into bpf_table_lookup call prior to this insn.
+
+Just like original, extended BPF is limited to 4096 insns, which means that any
+program will terminate quickly and will call fixed number of kernel functions.
+Earlier implementation of the checker had a precise calculation of worst case
+number of insns, but it was removed to simplify the code, since the worst number
+is always less then number of insns in a program anyway (because it's a DAG).
+
+Since register/stack state tracking simulates execution of all insns in all
+possible branches, it will explode if not bounded. There are two bounds.
+verifier_state stack is limited to 1k, therefore BPF program cannot have
+more than 1k jump insns.
+Total number of insns to be analyzed is limited to 32k, which means that
+checker will either prove correctness or reject the program in few
+milliseconds on average x86 cpu. Valid programs take microseconds to verify.
+
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC PATCH tip 4/5] use BPF in tracing filters
  2013-12-03  4:28 [RFC PATCH tip 0/5] tracing filters with BPF Alexei Starovoitov
                   ` (2 preceding siblings ...)
  2013-12-03  4:28 ` [RFC PATCH tip 3/5] Extended BPF (64-bit BPF) design document Alexei Starovoitov
@ 2013-12-03  4:28 ` Alexei Starovoitov
  2013-12-04  0:48   ` Masami Hiramatsu
  2013-12-03  4:28 ` [RFC PATCH tip 5/5] tracing filter examples in BPF Alexei Starovoitov
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 65+ messages in thread
From: Alexei Starovoitov @ 2013-12-03  4:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steven Rostedt, Peter Zijlstra, H. Peter Anvin, Thomas Gleixner,
	Masami Hiramatsu, Tom Zanussi, Jovi Zhangwei, Eric Dumazet,
	linux-kernel

Such filters can be written in C and allow safe read-only access to
any kernel data structure.
Like systemtap but with safety guaranteed by kernel.

The user can do:
cat bpf_program > /sys/kernel/debug/tracing/.../filter
if tracing event is either static or dynamic via kprobe_events.

The program can be anything as long as bpf_check() can verify its safety.
For example, the user can create kprobe_event on dst_discard()
and use logically following code inside BPF filter:
      skb = (struct sk_buff *)ctx->regs.di;
      dev = bpf_load_pointer(&skb->dev);
to access 'struct net_device'
Since its prototype is 'int dst_discard(struct sk_buff *skb);'
'skb' pointer is in 'rdi' register on x86_64
bpf_load_pointer() will try to fetch 'dev' field of 'sk_buff'
structure and will suppress page-fault if pointer is incorrect.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
 include/linux/ftrace_event.h       |    3 +
 include/trace/bpf_trace.h          |   27 +++++
 include/trace/ftrace.h             |   14 +++
 kernel/trace/Kconfig               |    1 +
 kernel/trace/Makefile              |    1 +
 kernel/trace/bpf_trace_callbacks.c |  191 ++++++++++++++++++++++++++++++++++++
 kernel/trace/trace.c               |    7 ++
 kernel/trace/trace.h               |   11 ++-
 kernel/trace/trace_events.c        |    9 +-
 kernel/trace/trace_events_filter.c |   61 +++++++++++-
 kernel/trace/trace_kprobe.c        |    6 ++
 11 files changed, 327 insertions(+), 4 deletions(-)
 create mode 100644 include/trace/bpf_trace.h
 create mode 100644 kernel/trace/bpf_trace_callbacks.c

diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index 8c9b7a1..8d4a7a3 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -203,6 +203,7 @@ enum {
 	TRACE_EVENT_FL_IGNORE_ENABLE_BIT,
 	TRACE_EVENT_FL_WAS_ENABLED_BIT,
 	TRACE_EVENT_FL_USE_CALL_FILTER_BIT,
+	TRACE_EVENT_FL_BPF_BIT,
 };
 
 /*
@@ -223,6 +224,7 @@ enum {
 	TRACE_EVENT_FL_IGNORE_ENABLE	= (1 << TRACE_EVENT_FL_IGNORE_ENABLE_BIT),
 	TRACE_EVENT_FL_WAS_ENABLED	= (1 << TRACE_EVENT_FL_WAS_ENABLED_BIT),
 	TRACE_EVENT_FL_USE_CALL_FILTER	= (1 << TRACE_EVENT_FL_USE_CALL_FILTER_BIT),
+	TRACE_EVENT_FL_BPF		= (1 << TRACE_EVENT_FL_BPF_BIT),
 };
 
 struct ftrace_event_call {
@@ -347,6 +349,7 @@ extern int filter_check_discard(struct ftrace_event_file *file, void *rec,
 extern int call_filter_check_discard(struct ftrace_event_call *call, void *rec,
 				     struct ring_buffer *buffer,
 				     struct ring_buffer_event *event);
+extern void filter_call_bpf(struct event_filter *filter, struct pt_regs *regs);
 
 enum {
 	FILTER_OTHER = 0,
diff --git a/include/trace/bpf_trace.h b/include/trace/bpf_trace.h
new file mode 100644
index 0000000..99d1e4b
--- /dev/null
+++ b/include/trace/bpf_trace.h
@@ -0,0 +1,27 @@
+/* Copyright (c) 2011-2013 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#ifndef _LINUX_KERNEL_BPF_TRACE_H
+#define _LINUX_KERNEL_BPF_TRACE_H
+
+#include <linux/ptrace.h>
+
+struct bpf_context {
+	struct pt_regs regs;
+};
+
+void *bpf_load_pointer(void *unsafe_ptr);
+long bpf_memcmp(void *unsafe_ptr, void *safe_ptr, long size);
+void bpf_dump_stack(struct bpf_context *ctx);
+void bpf_trace_printk(char *fmt, long fmt_size,
+		      long arg1, long arg2, long arg3);
+void *bpf_table_lookup(struct bpf_context *ctx, long table_id, const void *key);
+long bpf_table_update(struct bpf_context *ctx, long table_id, const void *key,
+		      const void *leaf);
+
+extern struct bpf_callbacks bpf_trace_cb;
+
+#endif /* _LINUX_KERNEL_BPF_TRACE_H */
diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
index 5c38606..4054393 100644
--- a/include/trace/ftrace.h
+++ b/include/trace/ftrace.h
@@ -17,6 +17,7 @@
  */
 
 #include <linux/ftrace_event.h>
+#include <linux/kexec.h>
 
 /*
  * DECLARE_EVENT_CLASS can be used to add a generic function
@@ -526,6 +527,11 @@ static inline notrace int ftrace_get_offsets_##call(			\
 #undef DECLARE_EVENT_CLASS
 #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
 									\
+static noinline __noclone notrace void					\
+ftrace_raw_event_save_regs_##call(struct pt_regs *__regs, proto)	\
+{									\
+	crash_setup_regs(__regs, NULL);					\
+}									\
 static notrace void							\
 ftrace_raw_event_##call(void *__data, proto)				\
 {									\
@@ -543,6 +549,14 @@ ftrace_raw_event_##call(void *__data, proto)				\
 		     &ftrace_file->flags))				\
 		return;							\
 									\
+	if (unlikely(ftrace_file->flags & FTRACE_EVENT_FL_FILTERED) &&	\
+	    unlikely(ftrace_file->event_call->flags & TRACE_EVENT_FL_BPF)) { \
+		struct pt_regs __regs;					\
+		ftrace_raw_event_save_regs_##call(&__regs, args);	\
+		filter_call_bpf(ftrace_file->filter, &__regs);		\
+		return;							\
+	}								\
+									\
 	local_save_flags(irq_flags);					\
 	pc = preempt_count();						\
 									\
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 015f85a..2809cd1 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -80,6 +80,7 @@ config FTRACE_NMI_ENTER
 
 config EVENT_TRACING
 	select CONTEXT_SWITCH_TRACER
+	select BPF64
 	bool
 
 config CONTEXT_SWITCH_TRACER
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index d7e2068..fe90d85 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -50,6 +50,7 @@ ifeq ($(CONFIG_PERF_EVENTS),y)
 obj-$(CONFIG_EVENT_TRACING) += trace_event_perf.o
 endif
 obj-$(CONFIG_EVENT_TRACING) += trace_events_filter.o
+obj-$(CONFIG_EVENT_TRACING) += bpf_trace_callbacks.o
 obj-$(CONFIG_KPROBE_EVENT) += trace_kprobe.o
 obj-$(CONFIG_TRACEPOINTS) += power-traces.o
 ifeq ($(CONFIG_PM_RUNTIME),y)
diff --git a/kernel/trace/bpf_trace_callbacks.c b/kernel/trace/bpf_trace_callbacks.c
new file mode 100644
index 0000000..c2afd43
--- /dev/null
+++ b/kernel/trace/bpf_trace_callbacks.c
@@ -0,0 +1,191 @@
+/* Copyright (c) 2011-2013 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/slab.h>
+#include <linux/bpf_jit.h>
+#include <linux/uaccess.h>
+#include <trace/bpf_trace.h>
+#include "trace.h"
+
+#define MAX_CTX_OFF sizeof(struct bpf_context)
+
+static const struct bpf_context_access ctx_access[MAX_CTX_OFF] = {
+#ifdef CONFIG_X86_64
+	[offsetof(struct bpf_context, regs.di)] = {
+		FIELD_SIZEOF(struct bpf_context, regs.di),
+		BPF_READ
+	},
+	[offsetof(struct bpf_context, regs.si)] = {
+		FIELD_SIZEOF(struct bpf_context, regs.si),
+		BPF_READ
+	},
+	[offsetof(struct bpf_context, regs.dx)] = {
+		FIELD_SIZEOF(struct bpf_context, regs.dx),
+		BPF_READ
+	},
+	[offsetof(struct bpf_context, regs.cx)] = {
+		FIELD_SIZEOF(struct bpf_context, regs.cx),
+		BPF_READ
+	},
+#endif
+};
+
+static const struct bpf_context_access *get_context_access(int off)
+{
+	if (off >= MAX_CTX_OFF)
+		return NULL;
+	return &ctx_access[off];
+}
+
+void *bpf_load_pointer(void *unsafe_ptr)
+{
+	void *ptr = NULL;
+
+	probe_kernel_read(&ptr, unsafe_ptr, sizeof(void *));
+	return ptr;
+}
+
+long bpf_memcmp(void *unsafe_ptr, void *safe_ptr, long size)
+{
+	char buf[64];
+	int err;
+
+	if (size < 64) {
+		err = probe_kernel_read(buf, unsafe_ptr, size);
+		if (err)
+			return err;
+		return memcmp(buf, safe_ptr, size);
+	}
+	return -1;
+}
+
+void bpf_dump_stack(struct bpf_context *ctx)
+{
+	unsigned long flags;
+
+	local_save_flags(flags);
+
+	__trace_stack_regs(flags, 0, preempt_count(), (struct pt_regs *)ctx);
+}
+
+/*
+ * limited trace_printk()
+ * only %d %u %p %x conversion specifiers allowed
+ */
+void bpf_trace_printk(char *fmt, long fmt_size, long arg1, long arg2, long arg3)
+{
+	int fmt_cnt = 0;
+	int i;
+
+	/*
+	 * bpf_check() guarantees that fmt points to bpf program stack and
+	 * fmt_size bytes of it were initialized by bpf program
+	 */
+	if (fmt[fmt_size - 1] != 0)
+		return;
+
+	for (i = 0; i < fmt_size; i++)
+		if (fmt[i] == '%') {
+			if (i + 1 >= fmt_size)
+				return;
+			if (fmt[i + 1] != 'p' && fmt[i + 1] != 'd' &&
+			    fmt[i + 1] != 'u' && fmt[i + 1] != 'x')
+				return;
+			fmt_cnt++;
+		}
+	if (fmt_cnt > 3)
+		return;
+	__trace_printk((unsigned long)__builtin_return_address(3), fmt,
+		       arg1, arg2, arg3);
+}
+
+
+static const struct bpf_func_proto *get_func_proto(char *strtab, int id)
+{
+	if (!strcmp(strtab + id, "bpf_load_pointer")) {
+		static const struct bpf_func_proto proto = {RET_INTEGER};
+		return &proto;
+	}
+	if (!strcmp(strtab + id, "bpf_memcmp")) {
+		static const struct bpf_func_proto proto = {RET_INTEGER,
+			INVALID_PTR, PTR_TO_STACK_IMM,
+			CONST_ARG_STACK_IMM_SIZE};
+		return &proto;
+	}
+	if (!strcmp(strtab + id, "bpf_dump_stack")) {
+		static const struct bpf_func_proto proto = {RET_VOID,
+			PTR_TO_CTX};
+		return &proto;
+	}
+	if (!strcmp(strtab + id, "bpf_trace_printk")) {
+		static const struct bpf_func_proto proto = {RET_VOID,
+			PTR_TO_STACK_IMM, CONST_ARG_STACK_IMM_SIZE};
+		return &proto;
+	}
+	if (!strcmp(strtab + id, "bpf_table_lookup")) {
+		static const struct bpf_func_proto proto = {
+			PTR_TO_TABLE_CONDITIONAL, PTR_TO_CTX,
+			CONST_ARG_TABLE_ID, PTR_TO_STACK_IMM_TABLE_KEY};
+		return &proto;
+	}
+	if (!strcmp(strtab + id, "bpf_table_update")) {
+		static const struct bpf_func_proto proto = {RET_INTEGER,
+			PTR_TO_CTX, CONST_ARG_TABLE_ID,
+			PTR_TO_STACK_IMM_TABLE_KEY,
+			PTR_TO_STACK_IMM_TABLE_ELEM};
+		return &proto;
+	}
+	return NULL;
+}
+
+static void execute_func(char *strtab, int id, u64 *regs)
+{
+	regs[R0] = 0;
+
+	/*
+	 * strcmp-approach is not efficient.
+	 * TODO: optimize it for poor archs that don't have JIT yet
+	 */
+	if (!strcmp(strtab + id, "bpf_load_pointer")) {
+		regs[R0] = (u64)bpf_load_pointer((void *)regs[R1]);
+	} else if (!strcmp(strtab + id, "bpf_memcmp")) {
+		regs[R0] = (u64)bpf_memcmp((void *)regs[R1], (void *)regs[R2],
+					   (long)regs[R3]);
+	} else if (!strcmp(strtab + id, "bpf_dump_stack")) {
+		bpf_dump_stack((struct bpf_context *)regs[R1]);
+	} else if (!strcmp(strtab + id, "bpf_trace_printk")) {
+		bpf_trace_printk((char *)regs[R1], (long)regs[R2],
+				 (long)regs[R3], (long)regs[R4],
+				 (long)regs[R5]);
+	} else {
+		pr_err_once("trace cannot execute unknown bpf function %d '%s'\n",
+			    id, strtab + id);
+	}
+}
+
+static void *jit_select_func(char *strtab, int id)
+{
+	if (!strcmp(strtab + id, "bpf_load_pointer"))
+		return bpf_load_pointer;
+
+	if (!strcmp(strtab + id, "bpf_memcmp"))
+		return bpf_memcmp;
+
+	if (!strcmp(strtab + id, "bpf_dump_stack"))
+		return bpf_dump_stack;
+
+	if (!strcmp(strtab + id, "bpf_trace_printk"))
+		return bpf_trace_printk;
+
+	return NULL;
+}
+
+struct bpf_callbacks bpf_trace_cb = {
+	execute_func, jit_select_func, get_func_proto, get_context_access
+};
+
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 9d20cd9..c052936 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1758,6 +1758,13 @@ void __trace_stack(struct trace_array *tr, unsigned long flags, int skip,
 	__ftrace_trace_stack(tr->trace_buffer.buffer, flags, skip, pc, NULL);
 }
 
+void __trace_stack_regs(unsigned long flags, int skip, int pc,
+			struct pt_regs *regs)
+{
+	__ftrace_trace_stack(global_trace.trace_buffer.buffer, flags, skip,
+			     pc, regs);
+}
+
 /**
  * trace_dump_stack - record a stack back trace in the trace buffer
  * @skip: Number of functions to skip (helper handlers)
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index ea189e0..33d26aff 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -616,6 +616,8 @@ void ftrace_trace_userstack(struct ring_buffer *buffer, unsigned long flags,
 
 void __trace_stack(struct trace_array *tr, unsigned long flags, int skip,
 		   int pc);
+void __trace_stack_regs(unsigned long flags, int skip, int pc,
+			struct pt_regs *regs);
 #else
 static inline void ftrace_trace_stack(struct ring_buffer *buffer,
 				      unsigned long flags, int skip, int pc)
@@ -637,6 +639,10 @@ static inline void __trace_stack(struct trace_array *tr, unsigned long flags,
 				 int skip, int pc)
 {
 }
+static inline void __trace_stack_regs(unsigned long flags, int skip, int pc,
+				      struct pt_regs *regs)
+{
+}
 #endif /* CONFIG_STACKTRACE */
 
 extern cycle_t ftrace_now(int cpu);
@@ -936,12 +942,15 @@ struct ftrace_event_field {
 	int			is_signed;
 };
 
+struct bpf_program;
+
 struct event_filter {
 	int			n_preds;	/* Number assigned */
 	int			a_preds;	/* allocated */
 	struct filter_pred	*preds;
 	struct filter_pred	*root;
 	char			*filter_string;
+	struct bpf_program	*prog;
 };
 
 struct event_subsystem {
@@ -1014,7 +1023,7 @@ filter_parse_regex(char *buff, int len, char **search, int *not);
 extern void print_event_filter(struct ftrace_event_file *file,
 			       struct trace_seq *s);
 extern int apply_event_filter(struct ftrace_event_file *file,
-			      char *filter_string);
+			      char *filter_string, int filter_len);
 extern int apply_subsystem_event_filter(struct ftrace_subsystem_dir *dir,
 					char *filter_string);
 extern void print_subsystem_event_filter(struct event_subsystem *system,
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index f919a2e..deed25f 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -1041,9 +1041,16 @@ event_filter_write(struct file *filp, const char __user *ubuf, size_t cnt,
 	mutex_lock(&event_mutex);
 	file = event_file_data(filp);
 	if (file)
-		err = apply_event_filter(file, buf);
+		err = apply_event_filter(file, buf, cnt);
 	mutex_unlock(&event_mutex);
 
+	if (file->event_call->flags & TRACE_EVENT_FL_BPF)
+		/*
+		 * allocate per-cpu printk buffers, since BPF program
+		 * might be calling bpf_trace_printk
+		 */
+		trace_printk_init_buffers();
+
 	free_page((unsigned long) buf);
 	if (err < 0)
 		return err;
diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index 2468f56..36c7bd6 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -23,6 +23,8 @@
 #include <linux/mutex.h>
 #include <linux/perf_event.h>
 #include <linux/slab.h>
+#include <linux/bpf_jit.h>
+#include <trace/bpf_trace.h>
 
 #include "trace.h"
 #include "trace_output.h"
@@ -535,6 +537,20 @@ static int filter_match_preds_cb(enum move_type move, struct filter_pred *pred,
 	return WALK_PRED_DEFAULT;
 }
 
+void filter_call_bpf(struct event_filter *filter, struct pt_regs *regs)
+{
+	BUG_ON(!filter || !filter->prog);
+
+	if (!filter->prog->jit_image) {
+		pr_warn_once("BPF jit image is not available. Fallback to emulation\n");
+		bpf_run(filter->prog, (struct bpf_context *)regs);
+		return;
+	}
+
+	filter->prog->jit_image((struct bpf_context *)regs);
+}
+EXPORT_SYMBOL_GPL(filter_call_bpf);
+
 /* return 1 if event matches, 0 otherwise (discard) */
 int filter_match_preds(struct event_filter *filter, void *rec)
 {
@@ -794,6 +810,7 @@ static void __free_filter(struct event_filter *filter)
 	if (!filter)
 		return;
 
+	bpf_free(filter->prog);
 	__free_preds(filter);
 	kfree(filter->filter_string);
 	kfree(filter);
@@ -1893,6 +1910,37 @@ static int create_filter_start(char *filter_str, bool set_str,
 	return err;
 }
 
+static int create_filter_bpf(char *filter_str, int filter_len,
+			     struct event_filter **filterp)
+{
+	struct event_filter *filter;
+	int err = 0;
+
+	*filterp = NULL;
+
+	filter = __alloc_filter();
+	if (filter)
+		err = replace_filter_string(filter, "bpf");
+
+	if (!filter || err) {
+		__free_filter(filter);
+		return -ENOMEM;
+	}
+
+	err = bpf_load_image(filter_str, filter_len, &bpf_trace_cb,
+			     &filter->prog);
+
+	if (err) {
+		pr_err("failed to load bpf %d\n", err);
+		__free_filter(filter);
+		return -EACCES;
+	}
+
+	*filterp = filter;
+
+	return err;
+}
+
 static void create_filter_finish(struct filter_parse_state *ps)
 {
 	if (ps) {
@@ -1973,7 +2021,8 @@ static int create_system_filter(struct event_subsystem *system,
 }
 
 /* caller must hold event_mutex */
-int apply_event_filter(struct ftrace_event_file *file, char *filter_string)
+int apply_event_filter(struct ftrace_event_file *file, char *filter_string,
+		       int filter_len)
 {
 	struct ftrace_event_call *call = file->event_call;
 	struct event_filter *filter;
@@ -1995,7 +2044,15 @@ int apply_event_filter(struct ftrace_event_file *file, char *filter_string)
 		return 0;
 	}
 
-	err = create_filter(call, filter_string, true, &filter);
+	if (!strcmp(filter_string, "bpf")) {
+		err = create_filter_bpf(filter_string, filter_len, &filter);
+		if (!err)
+			call->flags |= TRACE_EVENT_FL_BPF;
+	} else {
+		err = create_filter(call, filter_string, true, &filter);
+		if (!err)
+			call->flags &= ~TRACE_EVENT_FL_BPF;
+	}
 
 	/*
 	 * Always swap the call filter with the new filter
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index dae9541..e1a2187 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -819,6 +819,12 @@ __kprobe_trace_func(struct trace_probe *tp, struct pt_regs *regs,
 	if (test_bit(FTRACE_EVENT_FL_SOFT_DISABLED_BIT, &ftrace_file->flags))
 		return;
 
+	if (unlikely(ftrace_file->flags & FTRACE_EVENT_FL_FILTERED) &&
+	    unlikely(ftrace_file->event_call->flags & TRACE_EVENT_FL_BPF)) {
+		filter_call_bpf(ftrace_file->filter, regs);
+		return;
+	}
+
 	local_save_flags(irq_flags);
 	pc = preempt_count();
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC PATCH tip 5/5] tracing filter examples in BPF
  2013-12-03  4:28 [RFC PATCH tip 0/5] tracing filters with BPF Alexei Starovoitov
                   ` (3 preceding siblings ...)
  2013-12-03  4:28 ` [RFC PATCH tip 4/5] use BPF in tracing filters Alexei Starovoitov
@ 2013-12-03  4:28 ` Alexei Starovoitov
  2013-12-04  0:35   ` Jonathan Corbet
  2013-12-03  9:16 ` [RFC PATCH tip 0/5] tracing filters with BPF Ingo Molnar
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 65+ messages in thread
From: Alexei Starovoitov @ 2013-12-03  4:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steven Rostedt, Peter Zijlstra, H. Peter Anvin, Thomas Gleixner,
	Masami Hiramatsu, Tom Zanussi, Jovi Zhangwei, Eric Dumazet,
	linux-kernel

filter_ex1: filter that prints events for loobpack device only

$ cat filter_ex1.bpf > /sys/kernel/debug/tracing/events/net/netif_receive_skb/filter
$ echo 1 > /sys/kernel/debug/tracing/events/net/netif_receive_skb/enable
$ ping -c1 localhost
$ cat /sys/kernel/debug/tracing/trace_pip
            ping-5913  [003] ..s2  3779.285726: __netif_receive_skb_core: skb ffff880808e3a300 dev ffff88080bbf8000
            ping-5913  [003] ..s2  3779.285744: __netif_receive_skb_core: skb ffff880808e3a900 dev ffff88080bbf8000

To pre-check correctness of the filter do:
$ trace_filter_check filter_ex1.bpf
(final filter check always happens in kernel)

bpf/llvm - placeholder for LLVM-BPF backend

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
 GCC-BPF backend is available on github
 (since gcc plugin infrastructure doesn't allow for out-of-tree backends)

 LLVM plugin infra is very flexible.
 LLVM-BPF backend is reusing 'LLVM target independent code generator'
 and currently it's work in progress. It can be built out of LLVM tree.
 The user would need to 'apt-get install llvm-3.x-dev'
 which will bring llvm headers and static libraries
 and then compile BPF backend only.

 Both compilers can compile C into BPF instruction set.

 tools/bpf/llvm/README.txt            |    6 +++
 tools/bpf/trace/Makefile             |   34 ++++++++++++++
 tools/bpf/trace/README.txt           |   15 +++++++
 tools/bpf/trace/filter_ex1.c         |   52 +++++++++++++++++++++
 tools/bpf/trace/filter_ex1_orig.c    |   23 ++++++++++
 tools/bpf/trace/filter_ex2.c         |   74 ++++++++++++++++++++++++++++++
 tools/bpf/trace/filter_ex2_orig.c    |   47 +++++++++++++++++++
 tools/bpf/trace/trace_filter_check.c |   82 ++++++++++++++++++++++++++++++++++
 8 files changed, 333 insertions(+)
 create mode 100644 tools/bpf/llvm/README.txt
 create mode 100644 tools/bpf/trace/Makefile
 create mode 100644 tools/bpf/trace/README.txt
 create mode 100644 tools/bpf/trace/filter_ex1.c
 create mode 100644 tools/bpf/trace/filter_ex1_orig.c
 create mode 100644 tools/bpf/trace/filter_ex2.c
 create mode 100644 tools/bpf/trace/filter_ex2_orig.c
 create mode 100644 tools/bpf/trace/trace_filter_check.c

diff --git a/tools/bpf/llvm/README.txt b/tools/bpf/llvm/README.txt
new file mode 100644
index 0000000..3ca3ece
--- /dev/null
+++ b/tools/bpf/llvm/README.txt
@@ -0,0 +1,6 @@
+placeholder for LLVM BPF backend:
+lib/Target/BPF/*.cpp
+
+prerequisites:
+apt-get install llvm-3.[23]-dev
+
diff --git a/tools/bpf/trace/Makefile b/tools/bpf/trace/Makefile
new file mode 100644
index 0000000..b63f974
--- /dev/null
+++ b/tools/bpf/trace/Makefile
@@ -0,0 +1,34 @@
+CC = gcc
+
+all: trace_filter_check filter_ex1.bpf filter_ex2.bpf
+
+srctree=../../..
+src-perf=../../perf
+
+CFLAGS += -I$(src-perf)/util/include
+CFLAGS += -I$(src-perf)/arch/$(ARCH)/include
+CFLAGS += -I$(srctree)/arch/$(ARCH)/include/uapi
+CFLAGS += -I$(srctree)/arch/$(ARCH)/include
+CFLAGS += -I$(srctree)/include/uapi
+CFLAGS += -I$(srctree)/include
+CFLAGS += -Wall -O2
+
+trace_filter_check: LDLIBS = -Wl,--unresolved-symbols=ignore-all
+trace_filter_check: trace_filter_check.o \
+	$(srctree)/kernel/bpf_jit/bpf_check.o \
+	$(srctree)/kernel/bpf_jit/bpf_run.o \
+	$(srctree)/kernel/trace/bpf_trace_callbacks.o
+
+filter_ex1: filter_ex1.o
+filter_ex1.bpf: filter_ex1
+	./filter_ex1 > filter_ex1.bpf
+	rm filter_ex1
+
+filter_ex2: filter_ex2.o
+filter_ex2.bpf: filter_ex2
+	./filter_ex2 > filter_ex2.bpf
+	rm filter_ex2
+
+clean:
+	rm -rf *.o *.bpf trace_filter_check filter_ex1 filter_ex2
+
diff --git a/tools/bpf/trace/README.txt b/tools/bpf/trace/README.txt
new file mode 100644
index 0000000..7c1fcb9
--- /dev/null
+++ b/tools/bpf/trace/README.txt
@@ -0,0 +1,15 @@
+Tracing filter examples
+
+filter_ex1: tracing filter example that prints events for loobpack device only
+
+$ cat filter_ex1.bpf > /sys/kernel/debug/tracing/events/net/netif_receive_skb/filter
+$ echo 1 > /sys/kernel/debug/tracing/events/net/netif_receive_skb/enable
+$ ping -c1 localhost
+$ cat /sys/kernel/debug/tracing/trace_pip
+            ping-5913  [003] ..s2  3779.285726: __netif_receive_skb_core: skb ffff880808e3a300 dev ffff88080bbf8000
+            ping-5913  [003] ..s2  3779.285744: __netif_receive_skb_core: skb ffff880808e3a900 dev ffff88080bbf8000
+
+To pre-check correctness of the filter do:
+$ trace_filter_check filter_ex1.bpf
+(final filter check always happens in kernel)
+
diff --git a/tools/bpf/trace/filter_ex1.c b/tools/bpf/trace/filter_ex1.c
new file mode 100644
index 0000000..74696ba
--- /dev/null
+++ b/tools/bpf/trace/filter_ex1.c
@@ -0,0 +1,52 @@
+#include <linux/bpf.h>
+
+struct bpf_insn bpf_insns_filter[] = {
+// registers to save R6 R7
+// allocate 24 bytes stack
+	BPF_INSN_ST_IMM(BPF_W, __fp__, -20, 28524), // *(uint32*)(__fp__, -20)=28524
+	BPF_INSN_LD(BPF_DW, R6, R1, 104), // R6=*(uint64*)(R1, 104)
+	BPF_INSN_ALU(BPF_MOV, R1, R6), // R1 = R6
+	BPF_INSN_ALU_IMM(BPF_ADD, R1, 32), // R1 += 32
+	BPF_INSN_CALL(1), // R0=bpf_load_pointer();
+	BPF_INSN_ALU(BPF_MOV, R7, R0), // R7 = R0
+	BPF_INSN_ALU_IMM(BPF_MOV, R3, 2), // R3 = 2
+	BPF_INSN_ALU(BPF_MOV, R2, __fp__), // R2 = __fp__
+	BPF_INSN_ALU_IMM(BPF_ADD, R2, -20), // R2 += -20
+	BPF_INSN_ALU(BPF_MOV, R1, R7), // R1 = R7
+	BPF_INSN_CALL(18), // R0=bpf_memcmp();
+	BPF_INSN_JUMP_IMM(BPF_JNE, R0, 0, 11), // if (R0 != 0) goto LabelL5
+	BPF_INSN_ST_IMM(BPF_W, __fp__, -16, 543320947), // *(uint32*)(__fp__, -16)=543320947
+	BPF_INSN_ST_IMM(BPF_W, __fp__, -12, 1679847461), // *(uint32*)(__fp__, -12)=1679847461
+	BPF_INSN_ST_IMM(BPF_W, __fp__, -8, 622884453), // *(uint32*)(__fp__, -8)=622884453
+	BPF_INSN_ST_IMM(BPF_W, __fp__, -4, 663664), // *(uint32*)(__fp__, -4)=663664
+	BPF_INSN_ALU_IMM(BPF_MOV, R5, 0), // R5 = 0
+	BPF_INSN_ALU(BPF_MOV, R4, R7), // R4 = R7
+	BPF_INSN_ALU(BPF_MOV, R3, R6), // R3 = R6
+	BPF_INSN_ALU_IMM(BPF_MOV, R2, 16), // R2 = 16
+	BPF_INSN_ALU(BPF_MOV, R1, __fp__), // R1 = __fp__
+	BPF_INSN_ALU_IMM(BPF_ADD, R1, -16), // R1 += -16
+	BPF_INSN_CALL(29), // (void)bpf_trace_printk();
+//LabelL5:
+	BPF_INSN_RET(), // return void
+};
+
+const char func_strtab[46] = "\0bpf_load_pointer\0bpf_memcmp\0bpf_trace_printk";
+
+int main()
+{
+	char header[4] = "bpf";
+
+	int insn_size = sizeof(bpf_insns_filter);
+	int htab_size = 0;
+	int strtab_size = sizeof(func_strtab);
+
+	write(1, header, 4);
+	write(1, &insn_size, 4);
+	write(1, &htab_size, 4);
+	write(1, &strtab_size, 4);
+
+	write(1, bpf_insns_filter, insn_size);
+	write(1, func_strtab, strtab_size);
+	return 0;
+}
+
diff --git a/tools/bpf/trace/filter_ex1_orig.c b/tools/bpf/trace/filter_ex1_orig.c
new file mode 100644
index 0000000..e670a82
--- /dev/null
+++ b/tools/bpf/trace/filter_ex1_orig.c
@@ -0,0 +1,23 @@
+/*
+ * tracing filter example
+ * if attached to /sys/kernel/debug/tracing/events/net/netif_receive_skb
+ * it will print events for loobpack device only
+ */
+#include <linux/skbuff.h>
+#include <linux/netdevice.h>
+#include <linux/bpf.h>
+#include <trace/bpf_trace.h>
+
+void filter(struct bpf_context *ctx)
+{
+	char devname[4] = "lo";
+	struct net_device *dev;
+	struct sk_buff *skb = 0;
+
+	skb = (struct sk_buff *)ctx->regs.si;
+	dev = bpf_load_pointer(&skb->dev);
+	if (bpf_memcmp(dev->name, devname, 2) == 0) {
+		char fmt[] = "skb %p dev %p \n";
+		bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)dev, 0);
+	}
+}
diff --git a/tools/bpf/trace/filter_ex2.c b/tools/bpf/trace/filter_ex2.c
new file mode 100644
index 0000000..cf5b7ce
--- /dev/null
+++ b/tools/bpf/trace/filter_ex2.c
@@ -0,0 +1,74 @@
+#include <linux/bpf.h>
+
+struct bpf_insn bpf_insns_filter[] = {
+// registers to save R6 R7
+// allocate 32 bytes stack
+	BPF_INSN_ALU(BPF_MOV, R6, R1), // R6 = R1
+	BPF_INSN_ST_IMM(BPF_DW, __fp__, -32, 0), // *(uint64*)(__fp__, -32)=0
+	BPF_INSN_LD(BPF_DW, R1, R6, 104), // R1=*(uint64*)(R6, 104)
+	BPF_INSN_ALU_IMM(BPF_ADD, R1, 32), // R1 += 32
+	BPF_INSN_CALL(1), // R0=bpf_load_pointer();
+	BPF_INSN_ALU(BPF_MOV, R7, R0), // R7 = R0
+	BPF_INSN_ST(BPF_DW, __fp__, -32, R7), // *(uint64*)(__fp__, -32)=R7
+	BPF_INSN_ALU(BPF_MOV, R3, __fp__), // R3 = __fp__
+	BPF_INSN_ALU_IMM(BPF_ADD, R3, -32), // R3 += -32
+	BPF_INSN_ALU_IMM(BPF_MOV, R2, 0), // R2 = 0
+	BPF_INSN_ALU(BPF_MOV, R1, R6), // R1 = R6
+	BPF_INSN_CALL(18), // R0=bpf_table_lookup();
+	BPF_INSN_JUMP_IMM(BPF_JEQ, R0, 0, 18), // if (R0 == 0) goto LabelL2
+	BPF_INSN_ALU_IMM(BPF_MOV, R1, 1), // R1 = 1
+	BPF_INSN_XADD(BPF_DW, R0, 0, R1), // atomic (*(uint64*)R0, 0) += R1
+	BPF_INSN_LD(BPF_DW, R1, R0, 0), // R1=*(uint64*)(R0, 0)
+	BPF_INSN_ALU_IMM(BPF_MOD, R1, 10000), // R1=((uint64)R1)%((uint64)10000)
+	BPF_INSN_JUMP_IMM(BPF_JNE, R1, 0, 21), // if (R1 != 0) goto LabelL6
+	BPF_INSN_ST_IMM(BPF_W, __fp__, -24, 544630116), // *(uint32*)(__fp__, -24)=544630116
+	BPF_INSN_ST_IMM(BPF_W, __fp__, -20, 538996773), // *(uint32*)(__fp__, -20)=538996773
+	BPF_INSN_ST_IMM(BPF_W, __fp__, -16, 1601465200), // *(uint32*)(__fp__, -16)=1601465200
+	BPF_INSN_ST_IMM(BPF_W, __fp__, -12, 544501347), // *(uint32*)(__fp__, -12)=544501347
+	BPF_INSN_ST_IMM(BPF_W, __fp__, -8, 680997), // *(uint32*)(__fp__, -8)=680997
+	BPF_INSN_ALU_IMM(BPF_MOV, R5, 0), // R5 = 0
+	BPF_INSN_LD(BPF_DW, R4, R0, 0), // R4=*(uint64*)(R0, 0)
+	BPF_INSN_ALU(BPF_MOV, R3, R7), // R3 = R7
+	BPF_INSN_ALU_IMM(BPF_MOV, R2, 20), // R2 = 20
+	BPF_INSN_ALU(BPF_MOV, R1, __fp__), // R1 = __fp__
+	BPF_INSN_ALU_IMM(BPF_ADD, R1, -24), // R1 += -24
+	BPF_INSN_CALL(35), // (void)bpf_trace_printk();
+	BPF_INSN_JUMP(BPF_JA, 0, 0, 8), // goto LabelL6
+//LabelL2:
+	BPF_INSN_ST_IMM(BPF_DW, __fp__, -24, 0), // *(uint64*)(__fp__, -24)=0
+	BPF_INSN_ALU(BPF_MOV, R4, __fp__), // R4 = __fp__
+	BPF_INSN_ALU_IMM(BPF_ADD, R4, -24), // R4 += -24
+	BPF_INSN_ALU(BPF_MOV, R3, __fp__), // R3 = __fp__
+	BPF_INSN_ALU_IMM(BPF_ADD, R3, -32), // R3 += -32
+	BPF_INSN_ALU_IMM(BPF_MOV, R2, 0), // R2 = 0
+	BPF_INSN_ALU(BPF_MOV, R1, R6), // R1 = R6
+	BPF_INSN_CALL(52), // R0=bpf_table_update();
+//LabelL6:
+	BPF_INSN_RET(), // return void
+};
+
+struct bpf_table bpf_filter_tables[] = {
+	{BPF_TABLE_HASH, 8, 8, 4096, 0}
+};
+
+const char func_strtab[69] = "\0bpf_load_pointer\0bpf_table_lookup\0bpf_trace_printk\0bpf_table_update";
+
+int main()
+{
+	char header[4] = "bpf";
+
+	int insn_size = sizeof(bpf_insns_filter);
+	int htab_size = sizeof(bpf_filter_tables);
+	int strtab_size = sizeof(func_strtab);
+
+	write(1, header, 4);
+	write(1, &insn_size, 4);
+	write(1, &htab_size, 4);
+	write(1, &strtab_size, 4);
+
+	write(1, bpf_insns_filter, insn_size);
+	write(1, bpf_filter_tables, htab_size);
+	write(1, func_strtab, strtab_size);
+	return 0;
+}
+
diff --git a/tools/bpf/trace/filter_ex2_orig.c b/tools/bpf/trace/filter_ex2_orig.c
new file mode 100644
index 0000000..a716490
--- /dev/null
+++ b/tools/bpf/trace/filter_ex2_orig.c
@@ -0,0 +1,47 @@
+/*
+ * tracing filter that counts number of events per device
+ * if attached to /sys/kernel/debug/tracing/events/net/netif_receive_skb
+ * it will count number of received packets for different devices
+ */
+#include <linux/skbuff.h>
+#include <linux/netdevice.h>
+#include <linux/bpf.h>
+#include <trace/bpf_trace.h>
+
+struct dev_key {
+	void *dev;
+};
+
+struct dev_leaf {
+	uint64_t packet_cnt;
+};
+
+void filter(struct bpf_context *ctx)
+{
+	struct net_device *dev;
+	struct sk_buff *skb = 0;
+	struct dev_leaf *leaf;
+	struct dev_key key = {};
+
+	skb = (struct sk_buff *)ctx->regs.si;
+	dev = bpf_load_pointer(&skb->dev);
+
+	key.dev = dev;
+	leaf = bpf_table_lookup(ctx, 0, &key);
+	if (leaf) {
+		__sync_fetch_and_add(&leaf->packet_cnt, 1);
+		if (leaf->packet_cnt % 10000 == 0) {
+			char fmt[] = "dev %p  pkt_cnt %d\n";
+			bpf_trace_printk(fmt, sizeof(fmt), (long)dev,
+					 leaf->packet_cnt, 0);
+		}
+	} else {
+		struct dev_leaf new_leaf = {};
+		bpf_table_update(ctx, 0, &key, &new_leaf);
+	}
+}
+
+struct bpf_table filter_tables[] = {
+	{BPF_TABLE_HASH, sizeof(struct dev_key), sizeof(struct dev_leaf), 4096, 0}
+};
+
diff --git a/tools/bpf/trace/trace_filter_check.c b/tools/bpf/trace/trace_filter_check.c
new file mode 100644
index 0000000..4d408f5
--- /dev/null
+++ b/tools/bpf/trace/trace_filter_check.c
@@ -0,0 +1,82 @@
+/* Copyright (c) 2011-2013 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include <linux/bpf.h>
+#include <trace/bpf_trace.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdarg.h>
+#include <errno.h>
+#include <string.h>
+
+void *__kmalloc(size_t size, int flags)
+{
+	return calloc(size, 1);
+}
+
+void kfree(void *objp)
+{
+	free(objp);
+}
+
+int kmalloc_caches[128];
+void *kmem_cache_alloc_trace(void *caches, int flags, size_t size)
+{
+	return calloc(size, 1);
+}
+
+void bpf_compile(void *prog)
+{
+}
+
+void __bpf_free(void *prog)
+{
+}
+
+int printk(const char *fmt, ...)
+{
+	int ret;
+	va_list ap;
+
+	va_start(ap, fmt);
+	ret = vprintf(fmt, ap);
+	va_end(ap);
+	return ret;
+}
+
+char buf[16000];
+int bpf_load_image(const char *image, int image_len, struct bpf_callbacks *cb,
+		   void **p_prog);
+
+int main(int ac, char **av)
+{
+	FILE *f;
+	int size, err;
+	void *prog;
+
+	if (ac < 2) {
+		printf("Usage: %s bpf_binary_image\n", av[0]);
+		return 1;
+	}
+
+	f = fopen(av[1], "r");
+	if (!f) {
+		printf("fopen %s\n", strerror(errno));
+		return 2;
+	}
+	size = fread(buf, 1, sizeof(buf), f);
+	if (size <= 0) {
+		printf("fread %s\n", strerror(errno));
+		return 3;
+	}
+	err = bpf_load_image(buf, size, &bpf_trace_cb, &prog);
+	if (!err)
+		printf("OK\n");
+	else
+		printf("err %s\n", strerror(-err));
+	fclose(f);
+	return 0;
+}
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH tip 0/5] tracing filters with BPF
  2013-12-03  4:28 [RFC PATCH tip 0/5] tracing filters with BPF Alexei Starovoitov
                   ` (4 preceding siblings ...)
  2013-12-03  4:28 ` [RFC PATCH tip 5/5] tracing filter examples in BPF Alexei Starovoitov
@ 2013-12-03  9:16 ` Ingo Molnar
  2013-12-03 15:33   ` Steven Rostedt
  2013-12-03 18:06   ` Alexei Starovoitov
  2013-12-03 10:34 ` Masami Hiramatsu
  2013-12-04  0:01 ` Andi Kleen
  7 siblings, 2 replies; 65+ messages in thread
From: Ingo Molnar @ 2013-12-03  9:16 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Steven Rostedt, Peter Zijlstra, H. Peter Anvin, Thomas Gleixner,
	Masami Hiramatsu, Tom Zanussi, Jovi Zhangwei, Eric Dumazet,
	linux-kernel, Linus Torvalds, Andrew Morton,
	Frédéric Weisbecker, Arnaldo Carvalho de Melo,
	Tom Zanussi, Pekka Enberg, David S. Miller, Arjan van de Ven,
	Christoph Hellwig


* Alexei Starovoitov <ast@plumgrid.com> wrote:

> Hi All,
> 
> the following set of patches adds BPF support to trace filters.
> 
> Trace filters can be written in C and allow safe read-only access to 
> any kernel data structure. Like systemtap but with safety guaranteed 
> by kernel.

Very cool! (Added various other folks who might be interested in this 
to the Cc: list.)

I have one generic concern:

It would be important to make it easy to extract loaded BPF code from 
the kernel in source code equivalent form, which compiles to the same 
BPF code.

I.e. I think it would be fundamentally important to make sure that 
this is all within the kernel's license domain, to make it very clear 
there can be no 'binary only' BPF scripts.

By up-loading BPF into a kernel the person loading it agrees to make 
that code available to all users of that system who can access it, 
under the same license as the kernel's code (or under a more 
permissive license).

The last thing we want is people getting funny ideas and writing 
drivers in BPF and hiding the code or making license claims over it 
...

I.e. we want to allow flexible plugins technologically, but make sure 
people who run into such a plugin can modify and improve it under the 
same license as they can modify and improve the kernel itself!

[ People can still 'hide' their sekrit plugins if they want to, by not
  distributing them to anyone who'd redistribute it widely. ]

> The user can do:
> cat bpf_program > /sys/kernel/debug/tracing/.../filter
> if tracing event is either static or dynamic via kprobe_events.
> 
> The filter program may look like:
> void filter(struct bpf_context *ctx)
> {
>         char devname[4] = "eth5";
>         struct net_device *dev;
>         struct sk_buff *skb = 0;
> 
>         dev = (struct net_device *)ctx->regs.si;
>         if (bpf_memcmp(dev->name, devname, 4) == 0) {
>                 char fmt[] = "skb %p dev %p eth5\n";
>                 bpf_trace_printk(fmt, skb, dev, 0, 0);
>         }
> }
> 
> The kernel will do static analysis of bpf program to make sure that 
> it cannot crash the kernel (doesn't have loops, valid 
> memory/register accesses, etc). Then kernel will map bpf 
> instructions to x86 instructions and let it run in the place of 
> trace filter.
> 
> To demonstrate performance I did a synthetic test:
>         dev = init_net.loopback_dev;
>         do_gettimeofday(&start_tv);
>         for (i = 0; i < 1000000; i++) {
>                 struct sk_buff *skb;
>                 skb = netdev_alloc_skb(dev, 128);
>                 kfree_skb(skb);
>         }
>         do_gettimeofday(&end_tv);
>         time = end_tv.tv_sec - start_tv.tv_sec;
>         time *= USEC_PER_SEC;
>         time += (long long)((long)end_tv.tv_usec - (long)start_tv.tv_usec);
> 
>         printk("1M skb alloc/free %lld (usecs)\n", time);
> 
> no tracing
> [   33.450966] 1M skb alloc/free 145179 (usecs)
> 
> echo 1 > enable
> [   97.186379] 1M skb alloc/free 240419 (usecs)
> (tracing slows down kfree_skb() due to event_buffer_lock/buffer_unlock_commit)
> 
> echo 'name==eth5' > filter
> [  139.644161] 1M skb alloc/free 302552 (usecs)
> (running filter_match_preds() for every skb and discarding
> event_buffer is even slower)
> 
> cat bpf_prog > filter
> [  171.150566] 1M skb alloc/free 199463 (usecs)
> (JITed bpf program is safely checking dev->name == eth5 and discarding)

So, to do the math:

   tracing               'all' overhead:   95 nsecs per event
   tracing 'eth5 + old filter' overhead:  157 nsecs per event
   tracing 'eth5 + BPF filter' overhead:   54 nsecs per event

So via BPF and a fairly trivial filter, we are able to reduce tracing 
overhead for real - while old-style filters.

In addition to that we now also have arbitrary BPF scripts, full C 
programs (or written in any other language from which BPF bytecode can 
be generated) enabled.

Seems like a massive win-win scenario to me ;-)

> echo 0 > enable
> [  258.073593] 1M skb alloc/free 144919 (usecs)
> (tracing is disabled, performance is back to original)
> 
> The C program compiled into BPF and then JITed into x86 is faster 
> than filter_match_preds() approach (199-145 msec vs 302-145 msec)
> 
> tracing+bpf is a tool for safe read-only access to variables without 
> recompiling the kernel and without affecting running programs.
> 
> BPF filters can be written manually (see 
> tools/bpf/trace/filter_ex1.c) or better compiled from restricted C 
> via GCC or LLVM

> Q: What is the difference between existing BPF and extended BPF?
> A:
> Existing BPF insn from uapi/linux/filter.h
> struct sock_filter {
>         __u16   code;   /* Actual filter code */
>         __u8    jt;     /* Jump true */
>         __u8    jf;     /* Jump false */
>         __u32   k;      /* Generic multiuse field */
> };
> 
> Extended BPF insn from linux/bpf.h
> struct bpf_insn {
>         __u8    code;    /* opcode */
>         __u8    a_reg:4; /* dest register*/
>         __u8    x_reg:4; /* source register */
>         __s16   off;     /* signed offset */
>         __s32   imm;     /* signed immediate constant */
> };
> 
> opcode encoding is the same between old BPF and extended BPF.
> Original BPF has two 32-bit registers.
> Extended BPF has ten 64-bit registers.
> That is the main difference.
> 
> Old BPF was using jt/jf fields for jump-insn only.
> New BPF combines them into generic 'off' field for jump and non-jump insns.
> k==imm field has the same meaning.

This only affects the internal JIT representation, not the BPF byte 
code, right?

>  32 files changed, 3332 insertions(+), 24 deletions(-)

Impressive!

I'm wondering, will the new nftable code in works make use of the BPF 
JIT as well, or is that a separate implementation?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH tip 0/5] tracing filters with BPF
  2013-12-03  4:28 [RFC PATCH tip 0/5] tracing filters with BPF Alexei Starovoitov
                   ` (5 preceding siblings ...)
  2013-12-03  9:16 ` [RFC PATCH tip 0/5] tracing filters with BPF Ingo Molnar
@ 2013-12-03 10:34 ` Masami Hiramatsu
  2013-12-04  0:01 ` Andi Kleen
  7 siblings, 0 replies; 65+ messages in thread
From: Masami Hiramatsu @ 2013-12-03 10:34 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Ingo Molnar, Steven Rostedt, Peter Zijlstra, H. Peter Anvin,
	Thomas Gleixner, Tom Zanussi, Jovi Zhangwei, Eric Dumazet,
	linux-kernel

(2013/12/03 13:28), Alexei Starovoitov wrote:
> Hi All,
> 
> the following set of patches adds BPF support to trace filters.
> 
> Trace filters can be written in C and allow safe read-only access to any
> kernel data structure. Like systemtap but with safety guaranteed by kernel.
> 
> The user can do:
> cat bpf_program > /sys/kernel/debug/tracing/.../filter
> if tracing event is either static or dynamic via kprobe_events.

Oh, thank you for this great work! :D

> 
> The filter program may look like:
> void filter(struct bpf_context *ctx)
> {
>         char devname[4] = "eth5";
>         struct net_device *dev;
>         struct sk_buff *skb = 0;
> 
>         dev = (struct net_device *)ctx->regs.si;
>         if (bpf_memcmp(dev->name, devname, 4) == 0) {
>                 char fmt[] = "skb %p dev %p eth5\n";
>                 bpf_trace_printk(fmt, skb, dev, 0, 0);
>         }
> }
> 
> The kernel will do static analysis of bpf program to make sure that it cannot
> crash the kernel (doesn't have loops, valid memory/register accesses, etc).
> Then kernel will map bpf instructions to x86 instructions and let it
> run in the place of trace filter.
> 
> To demonstrate performance I did a synthetic test:
>         dev = init_net.loopback_dev;
>         do_gettimeofday(&start_tv);
>         for (i = 0; i < 1000000; i++) {
>                 struct sk_buff *skb;
>                 skb = netdev_alloc_skb(dev, 128);
>                 kfree_skb(skb);
>         }
>         do_gettimeofday(&end_tv);
>         time = end_tv.tv_sec - start_tv.tv_sec;
>         time *= USEC_PER_SEC;
>         time += (long long)((long)end_tv.tv_usec - (long)start_tv.tv_usec);
> 
>         printk("1M skb alloc/free %lld (usecs)\n", time);
> 
> no tracing
> [   33.450966] 1M skb alloc/free 145179 (usecs)
> 
> echo 1 > enable
> [   97.186379] 1M skb alloc/free 240419 (usecs)
> (tracing slows down kfree_skb() due to event_buffer_lock/buffer_unlock_commit)
> 
> echo 'name==eth5' > filter
> [  139.644161] 1M skb alloc/free 302552 (usecs)
> (running filter_match_preds() for every skb and discarding
> event_buffer is even slower)
> 
> cat bpf_prog > filter
> [  171.150566] 1M skb alloc/free 199463 (usecs)
> (JITed bpf program is safely checking dev->name == eth5 and discarding)
> 
> echo 0 > enable
> [  258.073593] 1M skb alloc/free 144919 (usecs)
> (tracing is disabled, performance is back to original)
> 
> The C program compiled into BPF and then JITed into x86 is faster than
> filter_match_preds() approach (199-145 msec vs 302-145 msec)

Great! :)

> tracing+bpf is a tool for safe read-only access to variables without recompiling
> the kernel and without affecting running programs.

Hmm, this feature and trace-event trigger actions can give us
powerful on-the-fly scripting functionality...

> BPF filters can be written manually (see tools/bpf/trace/filter_ex1.c)
> or better compiled from restricted C via GCC or LLVM
> 
> Q: What is the difference between existing BPF and extended BPF?
> A:
> Existing BPF insn from uapi/linux/filter.h
> struct sock_filter {
>         __u16   code;   /* Actual filter code */
>         __u8    jt;     /* Jump true */
>         __u8    jf;     /* Jump false */
>         __u32   k;      /* Generic multiuse field */
> };
> 
> Extended BPF insn from linux/bpf.h
> struct bpf_insn {
>         __u8    code;    /* opcode */
>         __u8    a_reg:4; /* dest register*/
>         __u8    x_reg:4; /* source register */
>         __s16   off;     /* signed offset */
>         __s32   imm;     /* signed immediate constant */
> };
> 
> opcode encoding is the same between old BPF and extended BPF.
> Original BPF has two 32-bit registers.
> Extended BPF has ten 64-bit registers.
> That is the main difference.
> 
> Old BPF was using jt/jf fields for jump-insn only.
> New BPF combines them into generic 'off' field for jump and non-jump insns.
> k==imm field has the same meaning.

Looks very interesting. :)

Thank you!

-- 
Masami HIRAMATSU
IT Management Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH tip 0/5] tracing filters with BPF
  2013-12-03  9:16 ` [RFC PATCH tip 0/5] tracing filters with BPF Ingo Molnar
@ 2013-12-03 15:33   ` Steven Rostedt
  2013-12-03 18:26     ` Alexei Starovoitov
  2013-12-03 18:06   ` Alexei Starovoitov
  1 sibling, 1 reply; 65+ messages in thread
From: Steven Rostedt @ 2013-12-03 15:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Alexei Starovoitov, Peter Zijlstra, H. Peter Anvin,
	Thomas Gleixner, Masami Hiramatsu, Tom Zanussi, Jovi Zhangwei,
	Eric Dumazet, linux-kernel, Linus Torvalds, Andrew Morton,
	Frédéric Weisbecker, Arnaldo Carvalho de Melo,
	Tom Zanussi, Pekka Enberg, David S. Miller, Arjan van de Ven,
	Christoph Hellwig

On Tue, 3 Dec 2013 10:16:55 +0100
Ingo Molnar <mingo@kernel.org> wrote:

 
> So, to do the math:
> 
>    tracing               'all' overhead:   95 nsecs per event
>    tracing 'eth5 + old filter' overhead:  157 nsecs per event
>    tracing 'eth5 + BPF filter' overhead:   54 nsecs per event
> 
> So via BPF and a fairly trivial filter, we are able to reduce tracing 
> overhead for real - while old-style filters.

Yep, seems that BPF can do what I wasn't able to do with the normal
filters. Although, I haven't looked at the code yet, I'm assuming that
the BPF works on the parameters passed into the trace event. The normal
filters can only process the results of the trace (what's being
recorded) not the parameters of the trace event itself. To get what's
recorded, we need to write to the buffer first, and then we decided if
we want to keep the event or not and discard the event from the buffer
if we do not.

That method does not reduce overhead at all, and only adds to it, as
Alexei's tests have shown. The purpose of the filter was not to reduce
overhead, but to reduce filling the buffer with needless data.

It looks as if the BPF filter works on the parameters of the trace
event and not what is written to the buffers (as they can be
different). I've been looking for a way to do just that, and if this
does accomplish it, I'll be very happy :-)

-- Steve

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH tip 3/5] Extended BPF (64-bit BPF) design document
  2013-12-03  4:28 ` [RFC PATCH tip 3/5] Extended BPF (64-bit BPF) design document Alexei Starovoitov
@ 2013-12-03 17:01   ` H. Peter Anvin
  2013-12-03 19:59     ` Alexei Starovoitov
  0 siblings, 1 reply; 65+ messages in thread
From: H. Peter Anvin @ 2013-12-03 17:01 UTC (permalink / raw)
  To: Alexei Starovoitov, Ingo Molnar
  Cc: Steven Rostedt, Peter Zijlstra, Thomas Gleixner,
	Masami Hiramatsu, Tom Zanussi, Jovi Zhangwei, Eric Dumazet,
	linux-kernel

On 12/02/2013 08:28 PM, Alexei Starovoitov wrote:
> +
> +All BPF registers are 64-bit without subregs, which makes JITed x86 code
> +less optimal, but matches sparc/mips architectures.
> +Adding 32-bit subregs was considered, since JIT can map them to x86 and aarch64
> +nicely, but read-modify-write overhead for sparc/mips is not worth the gains.
> +

I find this tradeoff to be more than somewhat puzzling, given that x86
and ARM are by far the dominant tradeoffs, and it would make
implementation on 32-bit CPUs cheaper if a lot of the operations are 32 bit.

Instead it seems like the niche architectures (which, realistically,
SPARC and MIPS have become) ought to take the performance hit.

Perhaps you are simply misunderstanding the notion of subregisters.
Neither x86 nor ARM64 leave the top 32 bits intact, so I don't see why
SPARC/MIPS would do RMW either.

> +Q: Why extended BPF is 64-bit? Cannot we live with 32-bit?
> +A: On 64-bit architectures, pointers are 64-bit and we want to pass 64-bit
> +values in/out kernel functions, so 32-bit BPF registers would require to define
> +register-pair ABI, there won't be a direct BPF register to HW register
> +mapping and JIT would need to do combine/split/move operations for every
> +register in and out of the function, which is complex, bug prone and slow.
> +Another reason is counters. To use 64-bit counter BPF program would need to do
> +a complex math. Again bug prone and not atomic.

Having EBPF code manipulating pointers - or kernel memory - directly
seems like a nonstarter.  However, per your subsequent paragraph it
sounds like pointers are a special type at which point it shouldn't
matter at the EBPF level how many bytes it takes to represent it?

	-hpa


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH tip 0/5] tracing filters with BPF
  2013-12-03  9:16 ` [RFC PATCH tip 0/5] tracing filters with BPF Ingo Molnar
  2013-12-03 15:33   ` Steven Rostedt
@ 2013-12-03 18:06   ` Alexei Starovoitov
  2013-12-04  9:34     ` Ingo Molnar
  1 sibling, 1 reply; 65+ messages in thread
From: Alexei Starovoitov @ 2013-12-03 18:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steven Rostedt, Peter Zijlstra, H. Peter Anvin, Thomas Gleixner,
	Masami Hiramatsu, Tom Zanussi, Jovi Zhangwei, Eric Dumazet,
	linux-kernel, Linus Torvalds, Andrew Morton,
	Frédéric Weisbecker, Arnaldo Carvalho de Melo,
	Tom Zanussi, Pekka Enberg, David S. Miller, Arjan van de Ven,
	Christoph Hellwig

On Tue, Dec 3, 2013 at 1:16 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> Very cool! (Added various other folks who might be interested in this
> to the Cc: list.)
>
> I have one generic concern:
>
> It would be important to make it easy to extract loaded BPF code from
> the kernel in source code equivalent form, which compiles to the same
> BPF code.
>
> I.e. I think it would be fundamentally important to make sure that
> this is all within the kernel's license domain, to make it very clear
> there can be no 'binary only' BPF scripts.
>
> By up-loading BPF into a kernel the person loading it agrees to make
> that code available to all users of that system who can access it,
> under the same license as the kernel's code (or under a more
> permissive license).
>
> The last thing we want is people getting funny ideas and writing
> drivers in BPF and hiding the code or making license claims over it

all makes sense.
In case of kernel modules all export_symbols are accessible and module
has to have kernel compatible license. Same licensing terms apply to
anything else that interacts with kernel functions.
In case of BPF the list of accessible functions is tiny, so it's much
easier to enforce specific limited use case.
For tracing filters it's just bpf_load_xx/trace_printk/dump_stack.
Even if someone has funny ideas they cannot be brought to life, since
drivers need a lot more than this set of functions and BPF checker
will reject any attempts to call something outside of this tiny list.
imo the same applies to existing BPF as well. Meaning that tcpdump
filter string and seccomp filters, if distributed, has to have their
source code available.

> I.e. we want to allow flexible plugins technologically, but make sure
> people who run into such a plugin can modify and improve it under the
> same license as they can modify and improve the kernel itself!

wow. I guess if the whole thing takes off, we would need an in-kernel
directory to store upstreamed bpf filters as well :)

>> opcode encoding is the same between old BPF and extended BPF.
>> Original BPF has two 32-bit registers.
>> Extended BPF has ten 64-bit registers.
>> That is the main difference.
>>
>> Old BPF was using jt/jf fields for jump-insn only.
>> New BPF combines them into generic 'off' field for jump and non-jump insns.
>> k==imm field has the same meaning.
>
> This only affects the internal JIT representation, not the BPF byte
> code, right?

that is the ebpf vs bpf code difference. JIT doesn't keep another
representation.
Just converts it to x86

>>  32 files changed, 3332 insertions(+), 24 deletions(-)
>
> Impressive!
>
> I'm wondering, will the new nftable code in works make use of the BPF
> JIT as well, or is that a separate implementation?

nft is much higher level state machine customized for specific nftable use case.
imo iptables/nftable rules can be compiled into extended bpf.
One needs to define bpf_context and set of functions to do packet
lookup via bpf_callbacks...
but let's do it one step at a a time.

Thanks
Alexei

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH tip 0/5] tracing filters with BPF
  2013-12-03 15:33   ` Steven Rostedt
@ 2013-12-03 18:26     ` Alexei Starovoitov
  2013-12-04  1:13       ` Masami Hiramatsu
  0 siblings, 1 reply; 65+ messages in thread
From: Alexei Starovoitov @ 2013-12-03 18:26 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Peter Zijlstra, H. Peter Anvin, Thomas Gleixner,
	Masami Hiramatsu, Tom Zanussi, Jovi Zhangwei, Eric Dumazet,
	linux-kernel, Linus Torvalds, Andrew Morton,
	Frédéric Weisbecker, Arnaldo Carvalho de Melo,
	Tom Zanussi, Pekka Enberg, David S. Miller, Arjan van de Ven,
	Christoph Hellwig

On Tue, Dec 3, 2013 at 7:33 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Tue, 3 Dec 2013 10:16:55 +0100
> Ingo Molnar <mingo@kernel.org> wrote:
>
>
>> So, to do the math:
>>
>>    tracing               'all' overhead:   95 nsecs per event
>>    tracing 'eth5 + old filter' overhead:  157 nsecs per event
>>    tracing 'eth5 + BPF filter' overhead:   54 nsecs per event
>>
>> So via BPF and a fairly trivial filter, we are able to reduce tracing
>> overhead for real - while old-style filters.
>
> Yep, seems that BPF can do what I wasn't able to do with the normal
> filters. Although, I haven't looked at the code yet, I'm assuming that
> the BPF works on the parameters passed into the trace event. The normal
> filters can only process the results of the trace (what's being
> recorded) not the parameters of the trace event itself. To get what's
> recorded, we need to write to the buffer first, and then we decided if
> we want to keep the event or not and discard the event from the buffer
> if we do not.
>
> That method does not reduce overhead at all, and only adds to it, as
> Alexei's tests have shown. The purpose of the filter was not to reduce
> overhead, but to reduce filling the buffer with needless data.

Precisely.
Assumption is that filters will filter out majority of the events.
So filter takes pt_regs as input, has to interpret them and call
bpf_trace_printk
if it really wants to store something for the human to see.
We can extend bpf trace filters to return true/false to indicate
whether TP_printk-format
specified as part of the event should be printed as well, but imo
that's unnecessary.
When I was using bpf filters to debug networking bits I didn't need
that printk format of the event. I only used event as an entry point,
filtering out things and printing different fields vs initial event.
More like what developers do when they sprinkle
trace_printk/dump_stack through the code while debugging.

the only inconvenience so far is to know how parameters are getting
into registers.
on x86-64, arg1 is in rdi, arg2 is in rsi,... I want to improve that
after first step is done.
In the proposed patches bpf_context == pt_regs at the event entry point.
Would be cleaner to have struct {arg1,arg2,…} as bpf_context instead.
But that needed more code and I wanted to keep the first patch to the
minimum.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH tip 3/5] Extended BPF (64-bit BPF) design document
  2013-12-03 17:01   ` H. Peter Anvin
@ 2013-12-03 19:59     ` Alexei Starovoitov
  2013-12-03 20:41       ` Frank Ch. Eigler
  0 siblings, 1 reply; 65+ messages in thread
From: Alexei Starovoitov @ 2013-12-03 19:59 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ingo Molnar, Steven Rostedt, Peter Zijlstra, Thomas Gleixner,
	Masami Hiramatsu, Tom Zanussi, Jovi Zhangwei, Eric Dumazet,
	linux-kernel

On Tue, Dec 3, 2013 at 9:01 AM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 12/02/2013 08:28 PM, Alexei Starovoitov wrote:
>> +
>> +All BPF registers are 64-bit without subregs, which makes JITed x86 code
>> +less optimal, but matches sparc/mips architectures.
>> +Adding 32-bit subregs was considered, since JIT can map them to x86 and aarch64
>> +nicely, but read-modify-write overhead for sparc/mips is not worth the gains.
>> +
>
> I find this tradeoff to be more than somewhat puzzling, given that x86
> and ARM are by far the dominant tradeoffs, and it would make
> implementation on 32-bit CPUs cheaper if a lot of the operations are 32 bit.
>
> Instead it seems like the niche architectures (which, realistically,
> SPARC and MIPS have become) ought to take the performance hit.
>
> Perhaps you are simply misunderstanding the notion of subregisters.
> Neither x86 nor ARM64 leave the top 32 bits intact, so I don't see why
> SPARC/MIPS would do RMW either.

32-bit reg write access on x86-64 is of course zero-extended. 16 and 8
bit are not.
If BPF isa to allow 32-bit subregs, it would need to allow int args to
be passed in them
as well. I'm not sure yet that arm64 calling convention will match
x86-64 in that sense.
>From compiler point of view if arg has int32 type and lives in 32-bit
subreg on 64-bit cpu,
the compiler has to access it with 32-bit subreg ops only. It cannot
assume that it was zero extended by caller. So all ops on sparc/mips
would have to use extra registers and do 64->32 masks.
Also 32-bit subreg doesn't help to look at int variables. In both
cases it will be 32-bit load
then with subregs cmp eax and without it cmp rax after jit.
To increment atomic 32-bit counter in memory, they don't help either.
They just don't give enough performance boost to justify complexity in encoding,
analyzing and JITing. So I don't see a viable benefit of 32-bit subregs yet.

The above arguments apply to 64-bit CPUs with 64-bit registers with or
without 32-bit subregs.

If you're talking about 32-bit CPUs it's completely different matter.
If we want to support JIT on them we need 32-bit bpf isa
(proposed BPF isa is 64-bit with no effort to make it JITable on 32-bit cpus)
Which can reuse all of the same encoding and make all registers 32-bit.
Compiler will produce different bpf code though.
It would know that it cannot load 64-bit value in one insn and will
use register pairs and so on.
Like -m32 / -m64 switch for bpf backend.
Letting compiler generate 64-bit BPF isa and then try to JIT it to
32-bit cpu is feasible, but very painful. Such JIT will be too large
to include in kernel.
Proposed JIT is short and simple because it maps all registers and
instructions one to one.

So the big question, do we really care about lack of bpf jit on 32-bit cpus?
Considering that ebpf still works on them, but via interpreter (see
bpf_run.c)...
imo that is the same situation as we have today with old bpf.

>> +Q: Why extended BPF is 64-bit? Cannot we live with 32-bit?
>> +A: On 64-bit architectures, pointers are 64-bit and we want to pass 64-bit
>> +values in/out kernel functions, so 32-bit BPF registers would require to define
>> +register-pair ABI, there won't be a direct BPF register to HW register
>> +mapping and JIT would need to do combine/split/move operations for every
>> +register in and out of the function, which is complex, bug prone and slow.
>> +Another reason is counters. To use 64-bit counter BPF program would need to do
>> +a complex math. Again bug prone and not atomic.
>
> Having EBPF code manipulating pointers - or kernel memory - directly
> seems like a nonstarter.  However, per your subsequent paragraph it
> sounds like pointers are a special type at which point it shouldn't
> matter at the EBPF level how many bytes it takes to represent it?

bpf_check() will track every register through every insn.
If pointer is stored in the register, it will know what type
of pointer it is and will allow '*reg' operation only if pointer is valid.
For example, upon entry into bpf program, register R1 will have type ptr_to_ctx.
After JITing it means that 'rdi' has a valid pointer and it points to
'struct bpf_context'.
If bpf code has R1 = R1 + 1 insn, the checker will assign invalid_ptr type to R1
after this insn and memory access via R1 will be rejected by checker.

BPF program actually can manipulate kernel memory directly
when checker guarantees that it is safe to do so :)

For example in tracing filters bpf_context access is restricted to:
static const struct bpf_context_access ctx_access[MAX_CTX_OFF] = {
        [offsetof(struct bpf_context, regs.di)] = {
                FIELD_SIZEOF(struct bpf_context, regs.di),
                BPF_READ
        },

meaning that bpf program can only do 8-byte load from 'rdi + 112'
when rdi still has type ptr_to_ctx. (112 is offset of 'di' field
within bpf_context)

Direct access making it so efficient and fast. After JITing bpf
program is pretty to close to natively compiled code. C->bpf->x86 is
quite close to C->x86. (talking about x86_64 of course)

Over course of development bpf_check() found several compiler bugs.
I also tried all of sorts of ways to break bpf jail from inside of a
bpf program, but so far checker catches everything I was able to throw
at it.

btw, tools/bpf/trace/trace_filter_check.c is a user space program that
links kernel/bpf_jit/bpf_check.o to make it easier to debug/understand
how bpf_check() is working.
It's there to do the same check as kernel will do while loading, but
doing it in userspace.
So it's faster to get an answer whether bpf filter is safe or not.
Examples are in the same tools/bpf/trace/ dir.

Thank you so much for review! Really appreciate the feedback.

Regards,
Alexei

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH tip 3/5] Extended BPF (64-bit BPF) design document
  2013-12-03 19:59     ` Alexei Starovoitov
@ 2013-12-03 20:41       ` Frank Ch. Eigler
  2013-12-03 21:31         ` Alexei Starovoitov
  0 siblings, 1 reply; 65+ messages in thread
From: Frank Ch. Eigler @ 2013-12-03 20:41 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: H. Peter Anvin, Ingo Molnar, Steven Rostedt, Peter Zijlstra,
	Thomas Gleixner, Masami Hiramatsu, Tom Zanussi, Jovi Zhangwei,
	Eric Dumazet, linux-kernel


Alexei Starovoitov <ast@plumgrid.com> writes:

> [...]
>> Having EBPF code manipulating pointers - or kernel memory - directly
>> seems like a nonstarter.  However, per your subsequent paragraph it
>> sounds like pointers are a special type at which point it shouldn't
>> matter at the EBPF level how many bytes it takes to represent it?
>
> bpf_check() will track every register through every insn.
> If pointer is stored in the register, it will know what type
> of pointer it is and will allow '*reg' operation only if pointer is valid.
> [...]
> BPF program actually can manipulate kernel memory directly
> when checker guarantees that it is safe to do so :)

It sounds like this sort of static analysis would have difficulty with
situations such as:

- multiple levels of indirection

- conditionals (where it can't trace a unique data/type flow for all pointers)

- aliasing (same reason)

- the possibility of bad (or userspace?) pointers arriving as
  parameters from the underlying trace events
  

> For example in tracing filters bpf_context access is restricted to:
> static const struct bpf_context_access ctx_access[MAX_CTX_OFF] = {
>         [offsetof(struct bpf_context, regs.di)] = {
>                 FIELD_SIZEOF(struct bpf_context, regs.di),
>                 BPF_READ
>         },

Are such constraints to be hard-coded in the kernel?


> Over course of development bpf_check() found several compiler bugs.
> I also tried all of sorts of ways to break bpf jail from inside of a
> bpf program, but so far checker catches everything I was able to throw
> at it.

(One can be sure that attackers will chew hard on this interface,
should it become reasonably accessible to userspace, so good job
starting to check carefully!)


- FChE

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH tip 3/5] Extended BPF (64-bit BPF) design document
  2013-12-03 20:41       ` Frank Ch. Eigler
@ 2013-12-03 21:31         ` Alexei Starovoitov
  2013-12-04  9:24           ` Ingo Molnar
  0 siblings, 1 reply; 65+ messages in thread
From: Alexei Starovoitov @ 2013-12-03 21:31 UTC (permalink / raw)
  To: Frank Ch. Eigler
  Cc: H. Peter Anvin, Ingo Molnar, Steven Rostedt, Peter Zijlstra,
	Thomas Gleixner, Masami Hiramatsu, Tom Zanussi, Jovi Zhangwei,
	Eric Dumazet, linux-kernel

On Tue, Dec 3, 2013 at 12:41 PM, Frank Ch. Eigler <fche@redhat.com> wrote:
>
> Alexei Starovoitov <ast@plumgrid.com> writes:
>
>> [...]
>>> Having EBPF code manipulating pointers - or kernel memory - directly
>>> seems like a nonstarter.  However, per your subsequent paragraph it
>>> sounds like pointers are a special type at which point it shouldn't
>>> matter at the EBPF level how many bytes it takes to represent it?
>>
>> bpf_check() will track every register through every insn.
>> If pointer is stored in the register, it will know what type
>> of pointer it is and will allow '*reg' operation only if pointer is valid.
>> [...]
>> BPF program actually can manipulate kernel memory directly
>> when checker guarantees that it is safe to do so :)
>
> It sounds like this sort of static analysis would have difficulty with
> situations such as:

yes. good points.
it's a simple analysis to have minimal # of lines in the kernel
with the goal to cover the cases where performance matters.
but the cases you mentioned are supported by bpf_load_pointer() in
bpf_trace_callbacks.c

> - multiple levels of indirection

in tools/bpf/trace/filter_ex1_orig.c
when filter reads first argument of event it assumes it got 'skb' pointer.
If it just type casts it and does 'skb->dev', this memory load will be
rejected by bpf_check().
Instead it does bpf_load_pointer() which will bring the value if the
pointer is not faulting.
It may be reading junk and subsequent bpf_memcmp(dev->name, "name")
may be reading
something even more bogus.
But it won't crash the kernel and if filter applied to the correct
event, the 'skb' pointer will be indeed 'struct sk_buff' and 'dev'
will indeed be pointing to 'struct net_device' and 'dev->name' will be
actual name.
It's possible to make first level of indirection faster by making
static analyzer more complex,
but imo it's not worth it for one level.
It's possible to teach it for multi-level, but then analyzer will
become too large and won't be suitable for kernel.

> - conditionals (where it can't trace a unique data/type flow for all pointers)

Some conditionals where performance matter are supported.
Like bpf_table_lookup() returns either valid pointer to table element or null.
and bpf_check assigns pointer type as 'ptr_to_table_conditional' until
bpf program does 'reg != null' and in the true branch the register
changes type to 'ptr_to_table', so that bpf program can access table
element for both read and write. See filter_ex2_orig.c
There 'leaf->packet_cnt' is a direct memory load from pointer that
points to table element.

> - aliasing (same reason)

not sure what you mean here.

> - the possibility of bad (or userspace?) pointers arriving as
>   parameters from the underlying trace events

solved same way as case #1

>> For example in tracing filters bpf_context access is restricted to:
>> static const struct bpf_context_access ctx_access[MAX_CTX_OFF] = {
>>         [offsetof(struct bpf_context, regs.di)] = {
>>                 FIELD_SIZEOF(struct bpf_context, regs.di),
>>                 BPF_READ
>>         },
>
> Are such constraints to be hard-coded in the kernel?

I'm hoping that BPF framework will be used by different kernel subsystems.
In case of trace filters 'bpf_context' will be defined once and
hard-coded within kernel.
So all tracing filters will have the same input interface.
For other subsystems bpf_context may mean something different with its
own access rights.
So bpf filters that load fine as tracing filters will be rejected when
loaded into other bpf subsystem.
Theoretically subsystem may define bpf_context dynamically. Like its
own bpf_context for every trace event, but that's an overkill.

>> Over course of development bpf_check() found several compiler bugs.
>> I also tried all of sorts of ways to break bpf jail from inside of a
>> bpf program, but so far checker catches everything I was able to throw
>> at it.
>
> (One can be sure that attackers will chew hard on this interface,
> should it become reasonably accessible to userspace, so good job
> starting to check carefully!)

I think for quite some time root privileges would be needed to load
these filters,
but eventually would be useful to let non-root users to insert them as well.

Thank you for feedback!

Regards,
Alexei

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH tip 0/5] tracing filters with BPF
  2013-12-03  4:28 [RFC PATCH tip 0/5] tracing filters with BPF Alexei Starovoitov
                   ` (6 preceding siblings ...)
  2013-12-03 10:34 ` Masami Hiramatsu
@ 2013-12-04  0:01 ` Andi Kleen
  2013-12-04  3:09   ` Alexei Starovoitov
  2013-12-05 16:31   ` Frank Ch. Eigler
  7 siblings, 2 replies; 65+ messages in thread
From: Andi Kleen @ 2013-12-04  0:01 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Ingo Molnar, Steven Rostedt, Peter Zijlstra, H. Peter Anvin,
	Thomas Gleixner, Masami Hiramatsu, Tom Zanussi, Jovi Zhangwei,
	Eric Dumazet, linux-kernel

Alexei Starovoitov <ast@plumgrid.com> writes:

Can you do some performance comparison compared to e.g. ktap?
How much faster is it?

While it sounds interesting, I would strongly advise to make this
capability only available to root. Traditionally lots of complex byte
code languages which were designed to be "safe" and verifiable weren't
really. e.g. i managed to crash things with "safe" systemtap multiple
times. And we all know what happened to Java.

So the likelyhood of this having some hole somewhere (either in
the byte code or in some library function) is high.
 
-Andi 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH tip 5/5] tracing filter examples in BPF
  2013-12-03  4:28 ` [RFC PATCH tip 5/5] tracing filter examples in BPF Alexei Starovoitov
@ 2013-12-04  0:35   ` Jonathan Corbet
  2013-12-04  1:21     ` Alexei Starovoitov
  0 siblings, 1 reply; 65+ messages in thread
From: Jonathan Corbet @ 2013-12-04  0:35 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Ingo Molnar, Steven Rostedt, Peter Zijlstra, H. Peter Anvin,
	Thomas Gleixner, Masami Hiramatsu, Tom Zanussi, Jovi Zhangwei,
	Eric Dumazet, linux-kernel

On Mon,  2 Dec 2013 20:28:50 -0800
Alexei Starovoitov <ast@plumgrid.com> wrote:

>  GCC-BPF backend is available on github
>  (since gcc plugin infrastructure doesn't allow for out-of-tree backends)

Do you have a pointer to where this backend can be found?  I've
done a bit of digging around but seem to be unable to find it...

Thanks,

jon

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH tip 4/5] use BPF in tracing filters
  2013-12-03  4:28 ` [RFC PATCH tip 4/5] use BPF in tracing filters Alexei Starovoitov
@ 2013-12-04  0:48   ` Masami Hiramatsu
  2013-12-04  1:11     ` Steven Rostedt
  0 siblings, 1 reply; 65+ messages in thread
From: Masami Hiramatsu @ 2013-12-04  0:48 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Ingo Molnar, Steven Rostedt, Peter Zijlstra, H. Peter Anvin,
	Thomas Gleixner, Tom Zanussi, Jovi Zhangwei, Eric Dumazet,
	linux-kernel, yrl.pp-manager.tt

(2013/12/03 13:28), Alexei Starovoitov wrote:
> Such filters can be written in C and allow safe read-only access to
> any kernel data structure.
> Like systemtap but with safety guaranteed by kernel.
> 
> The user can do:
> cat bpf_program > /sys/kernel/debug/tracing/.../filter
> if tracing event is either static or dynamic via kprobe_events.
> 
> The program can be anything as long as bpf_check() can verify its safety.
> For example, the user can create kprobe_event on dst_discard()
> and use logically following code inside BPF filter:
>       skb = (struct sk_buff *)ctx->regs.di;
>       dev = bpf_load_pointer(&skb->dev);
> to access 'struct net_device'
> Since its prototype is 'int dst_discard(struct sk_buff *skb);'
> 'skb' pointer is in 'rdi' register on x86_64
> bpf_load_pointer() will try to fetch 'dev' field of 'sk_buff'
> structure and will suppress page-fault if pointer is incorrect.

Hmm, I doubt it is a good way to integrate with ftrace.
I prefer to use this for replacing current ftrace filter,
fetch functions and actions. In that case, we can continue
to use current interface but much faster to trace.
Also, we can see what filter/arguments/actions are set
on each event.

Thank you,


-- 
Masami HIRAMATSU
IT Management Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH tip 4/5] use BPF in tracing filters
  2013-12-04  0:48   ` Masami Hiramatsu
@ 2013-12-04  1:11     ` Steven Rostedt
  2013-12-05  0:05       ` Masami Hiramatsu
  0 siblings, 1 reply; 65+ messages in thread
From: Steven Rostedt @ 2013-12-04  1:11 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Alexei Starovoitov, Ingo Molnar, Peter Zijlstra, H. Peter Anvin,
	Thomas Gleixner, Tom Zanussi, Jovi Zhangwei, Eric Dumazet,
	linux-kernel, yrl.pp-manager.tt

On Wed, 04 Dec 2013 09:48:44 +0900
Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> wrote:

> (2013/12/03 13:28), Alexei Starovoitov wrote:
> > Such filters can be written in C and allow safe read-only access to
> > any kernel data structure.
> > Like systemtap but with safety guaranteed by kernel.
> > 
> > The user can do:
> > cat bpf_program > /sys/kernel/debug/tracing/.../filter
> > if tracing event is either static or dynamic via kprobe_events.
> > 
> > The program can be anything as long as bpf_check() can verify its safety.
> > For example, the user can create kprobe_event on dst_discard()
> > and use logically following code inside BPF filter:
> >       skb = (struct sk_buff *)ctx->regs.di;
> >       dev = bpf_load_pointer(&skb->dev);
> > to access 'struct net_device'
> > Since its prototype is 'int dst_discard(struct sk_buff *skb);'
> > 'skb' pointer is in 'rdi' register on x86_64
> > bpf_load_pointer() will try to fetch 'dev' field of 'sk_buff'
> > structure and will suppress page-fault if pointer is incorrect.
> 
> Hmm, I doubt it is a good way to integrate with ftrace.
> I prefer to use this for replacing current ftrace filter,

I'm not sure how we can do that. Especially since the bpf is very arch
specific, and the current filters work for all archs.

> fetch functions and actions. In that case, we can continue
> to use current interface but much faster to trace.
> Also, we can see what filter/arguments/actions are set
> on each event.

There's also the problem that the current filters work with the results
of what is written to the buffer, not what is passed in by the trace
point, as that isn't even displayed to the user.

For example, sched_switch gets passed struct task_struct *prev, and
*next, from that we save prev_comm, prev_pid, prev_prio, prev_state,
next_comm, next_prio and next_state. These are expressed to the user
by the format file of the event:

	field:char prev_comm[32];	offset:16;
	size:16;	signed:1; field:pid_t prev_pid;
	offset:32;	size:4;	signed:1; field:int
	prev_prio;	offset:36;	size:4;	signed:1;
	field:long prev_state;	offset:40;	size:8;
	signed:1; field:char next_comm[32];	offset:48;
	size:16;	signed:1; field:pid_t next_pid;
	offset:64;	size:4;	signed:1; field:int
	next_prio;	offset:68;	size:4; signed:1;

And the filters can check "next_prio > 10" and what not. The bpf
program needs to access next->prio. There's nothing that shows the user
what is passed to the tracepoint, and from that, what structure member
to use from there. The user would be required to look at the source
code of the given kernel. A requirement not needed by the current
implementation.

Also, there's results that can not be trivially converted. Taking a
quick look at some TRACE_EVENT() structures, I found bcache_bio that
has this:

        TP_fast_assign(
                __entry->dev            = bio->bi_bdev->bd_dev;
                __entry->sector         = bio->bi_sector;
                __entry->nr_sector      = bio->bi_size >> 9;
                blk_fill_rwbs(__entry->rwbs, bio->bi_rw, bio->bi_size);
        ),

Where the blk_fill_rwbs() updates the status of the entry->rwbs based
on the bi_rw field. A filter must remain backward compatible to
something like:

	rwbs == "w"  or rwbs =~ '*w*'


Now maybe we can make the filter code use some of the bpf if possible,
but to get the result, it still needs to write to the ring buffer, and
discard it if it is incorrect. Which will not make it any faster than
the original trace, but perhaps faster than the trace + current filter.

The speed up that was shown was because we were processing the
parameters of the trace point and not the result. That currently
requires the user to have full access to the source of the kernel they
are tracing.

-- Steve

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH tip 0/5] tracing filters with BPF
  2013-12-03 18:26     ` Alexei Starovoitov
@ 2013-12-04  1:13       ` Masami Hiramatsu
  2013-12-09  7:29         ` Namhyung Kim
  0 siblings, 1 reply; 65+ messages in thread
From: Masami Hiramatsu @ 2013-12-04  1:13 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, H. Peter Anvin,
	Thomas Gleixner, Tom Zanussi, Jovi Zhangwei, Eric Dumazet,
	linux-kernel, Linus Torvalds, Andrew Morton,
	Frédéric Weisbecker, Arnaldo Carvalho de Melo,
	Tom Zanussi, Pekka Enberg, David S. Miller, Arjan van de Ven,
	Christoph Hellwig, Oleg Nesterov, namhyung

(2013/12/04 3:26), Alexei Starovoitov wrote:
> On Tue, Dec 3, 2013 at 7:33 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
>> On Tue, 3 Dec 2013 10:16:55 +0100
>> Ingo Molnar <mingo@kernel.org> wrote:
>>
>>
>>> So, to do the math:
>>>
>>>    tracing               'all' overhead:   95 nsecs per event
>>>    tracing 'eth5 + old filter' overhead:  157 nsecs per event
>>>    tracing 'eth5 + BPF filter' overhead:   54 nsecs per event
>>>
>>> So via BPF and a fairly trivial filter, we are able to reduce tracing
>>> overhead for real - while old-style filters.
>>
>> Yep, seems that BPF can do what I wasn't able to do with the normal
>> filters. Although, I haven't looked at the code yet, I'm assuming that
>> the BPF works on the parameters passed into the trace event. The normal
>> filters can only process the results of the trace (what's being
>> recorded) not the parameters of the trace event itself. To get what's
>> recorded, we need to write to the buffer first, and then we decided if
>> we want to keep the event or not and discard the event from the buffer
>> if we do not.
>>
>> That method does not reduce overhead at all, and only adds to it, as
>> Alexei's tests have shown. The purpose of the filter was not to reduce
>> overhead, but to reduce filling the buffer with needless data.
> 
> Precisely.
> Assumption is that filters will filter out majority of the events.
> So filter takes pt_regs as input, has to interpret them and call
> bpf_trace_printk
> if it really wants to store something for the human to see.
> We can extend bpf trace filters to return true/false to indicate
> whether TP_printk-format
> specified as part of the event should be printed as well, but imo
> that's unnecessary.
> When I was using bpf filters to debug networking bits I didn't need
> that printk format of the event. I only used event as an entry point,
> filtering out things and printing different fields vs initial event.
> More like what developers do when they sprinkle
> trace_printk/dump_stack through the code while debugging.
> 
> the only inconvenience so far is to know how parameters are getting
> into registers.
> on x86-64, arg1 is in rdi, arg2 is in rsi,... I want to improve that
> after first step is done.

Actually, that part is done by the perf-probe and ftrace dynamic events
(kernel/trace/trace_probe.c). I think this generic BPF is good for
re-implementing fetch methods. :)

Thank you,

-- 
Masami HIRAMATSU
IT Management Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH tip 5/5] tracing filter examples in BPF
  2013-12-04  0:35   ` Jonathan Corbet
@ 2013-12-04  1:21     ` Alexei Starovoitov
  0 siblings, 0 replies; 65+ messages in thread
From: Alexei Starovoitov @ 2013-12-04  1:21 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Ingo Molnar, Steven Rostedt, Peter Zijlstra, H. Peter Anvin,
	Thomas Gleixner, Masami Hiramatsu, Tom Zanussi, Jovi Zhangwei,
	Eric Dumazet, linux-kernel

On Tue, Dec 3, 2013 at 4:35 PM, Jonathan Corbet <corbet@lwn.net> wrote:
> On Mon,  2 Dec 2013 20:28:50 -0800
> Alexei Starovoitov <ast@plumgrid.com> wrote:
>
>>  GCC-BPF backend is available on github
>>  (since gcc plugin infrastructure doesn't allow for out-of-tree backends)
>
> Do you have a pointer to where this backend can be found?  I've
> done a bit of digging around but seem to be unable to find it…

here it is:
https://github.com/iovisor/bpf_gcc/commit/9e7223f8f09c822ecc6e18309e89a574a23dbf63

It's quite small comparing to normal backend:
 config.me                          |    3 +
 config.sub                         |   12 +-
 gcc/common/config/bpf/bpf-common.c |   39 ++
 gcc/config.gcc                     |   18 +
 gcc/config/bpf/bpf-modes.def       |   22 +
 gcc/config/bpf/bpf-protos.h        |   60 ++
 gcc/config/bpf/bpf.c               | 1063 ++++++++++++++++++++++++++++++++++++
 gcc/config/bpf/bpf.h               |  746 +++++++++++++++++++++++++
 gcc/config/bpf/bpf.md              |  895 ++++++++++++++++++++++++++++++
 gcc/config/bpf/linux.h             |   67 +++
 10 files changed, 2922 insertions(+), 3 deletions(-)

Instruction emission is in bpf.md in functions like:
(define_insn "adddi3"
  [(set (match_operand:DI 0 "int_reg_operand" "=r, r")
  (plus:DI (match_operand:DI 1 "int_reg_operand" "0, 0")
    (match_operand:DI 2 "int_reg_or_const_operand" "r, K")))]
  ""
  "@
   BPF_INSN_ALU(BPF_ADD, %0, %2), // %0 += %2
   BPF_INSN_ALU_IMM(BPF_ADD, %0, %2), // %0 += %2"
[(set_attr "type" "arith,arith") ])

So it takes C and emits BPF_* macros from include/linux/bpf.h
tools/bpf/trace/ examples were generated by it.
Eventually it will be changed to emit binary hex directly, but macros
are much easier to understand for now.

Thanks
Alexei

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH tip 0/5] tracing filters with BPF
  2013-12-04  0:01 ` Andi Kleen
@ 2013-12-04  3:09   ` Alexei Starovoitov
  2013-12-05  4:40     ` Alexei Starovoitov
  2013-12-05 16:31   ` Frank Ch. Eigler
  1 sibling, 1 reply; 65+ messages in thread
From: Alexei Starovoitov @ 2013-12-04  3:09 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, Steven Rostedt, Peter Zijlstra, H. Peter Anvin,
	Thomas Gleixner, Masami Hiramatsu, Tom Zanussi, Jovi Zhangwei,
	Eric Dumazet, linux-kernel

On Tue, Dec 3, 2013 at 4:01 PM, Andi Kleen <andi@firstfloor.org> wrote:
> Alexei Starovoitov <ast@plumgrid.com> writes:
>
> Can you do some performance comparison compared to e.g. ktap?
> How much faster is it?

imo the most interesting ktap scripts (like kmalloc-top.kp) need
tables and timers.
tables are almost ready for prime time, but timers I prefer to keep
out of kernel.
I would like bpf filter to fill tables with interesting data in kernel
up to predefined limit
and periodically read and clear the tables from userspace.
This way I will be able to do nettop.stp, iotop.stp like programs.
So I'm still thinking what should be clean kernel/user interface for
bpf-defined tables.
Format of keys and elements of the table is defined within bpf program.
During load of bpf program, the tables are allocated and bpf program
can now lookup/update into them. At the same time corresponding
userspace program can read tables of this particular bpf program over
netlink.
Creating its own debugfs files for every filter feels too slow and
feature limited, since files are all or nothing interface. Netlink
access to bpf tables feels cleaner. Userspace will use libmnl to
access them. Other ideas?

In the mean time I'll do some simple
  trace probe:xx { print }
performance test…

> While it sounds interesting, I would strongly advise to make this
> capability only available to root. Traditionally lots of complex byte
> code languages which were designed to be "safe" and verifiable weren't
> really. e.g. i managed to crash things with "safe" systemtap multiple
> times. And we all know what happened to Java.
>
> So the likelyhood of this having some hole somewhere (either in
> the byte code or in some library function) is high.

Tracing filters are for root only today and should stay this way.
As far as safety of bpf… hard to argue systemtap point ;)
Though existing bpf is generally accepted to be safe.
extended bpf needs time to prove itself.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH tip 3/5] Extended BPF (64-bit BPF) design document
  2013-12-03 21:31         ` Alexei Starovoitov
@ 2013-12-04  9:24           ` Ingo Molnar
  0 siblings, 0 replies; 65+ messages in thread
From: Ingo Molnar @ 2013-12-04  9:24 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Frank Ch. Eigler, H. Peter Anvin, Steven Rostedt, Peter Zijlstra,
	Thomas Gleixner, Masami Hiramatsu, Tom Zanussi, Jovi Zhangwei,
	Eric Dumazet, linux-kernel


* Alexei Starovoitov <ast@plumgrid.com> wrote:

> It's possible to teach it for multi-level, but then analyzer will 
> become too large and won't be suitable for kernel.

Btw., even if we want to start simple with most things, the above 
statement is not actually true in the broad sense: the constraint for 
the kernel is utility, not complexity.

We have various kinds of highly complex code in the kernel and can 
deal with it just fine. Throwing away useful ideas just because they 
seem too complex at first sight is almost always wrong.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH tip 0/5] tracing filters with BPF
  2013-12-03 18:06   ` Alexei Starovoitov
@ 2013-12-04  9:34     ` Ingo Molnar
  2013-12-04 17:36       ` Alexei Starovoitov
  0 siblings, 1 reply; 65+ messages in thread
From: Ingo Molnar @ 2013-12-04  9:34 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Steven Rostedt, Peter Zijlstra, H. Peter Anvin, Thomas Gleixner,
	Masami Hiramatsu, Tom Zanussi, Jovi Zhangwei, Eric Dumazet,
	linux-kernel, Linus Torvalds, Andrew Morton,
	Frédéric Weisbecker, Arnaldo Carvalho de Melo,
	Tom Zanussi, Pekka Enberg, David S. Miller, Arjan van de Ven,
	Christoph Hellwig


* Alexei Starovoitov <ast@plumgrid.com> wrote:

> On Tue, Dec 3, 2013 at 1:16 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >
> > Very cool! (Added various other folks who might be interested in 
> > this to the Cc: list.)
> >
> > I have one generic concern:
> >
> > It would be important to make it easy to extract loaded BPF code 
> > from the kernel in source code equivalent form, which compiles to 
> > the same BPF code.
> >
> > I.e. I think it would be fundamentally important to make sure that 
> > this is all within the kernel's license domain, to make it very 
> > clear there can be no 'binary only' BPF scripts.
> >
> > By up-loading BPF into a kernel the person loading it agrees to 
> > make that code available to all users of that system who can 
> > access it, under the same license as the kernel's code (or under a 
> > more permissive license).
> >
> > The last thing we want is people getting funny ideas and writing 
> > drivers in BPF and hiding the code or making license claims over 
> > it
> 
> all makes sense. In case of kernel modules all export_symbols are 
> accessible and module has to have kernel compatible license. Same 
> licensing terms apply to anything else that interacts with kernel 
> functions. In case of BPF the list of accessible functions is tiny, 
> so it's much easier to enforce specific limited use case. For 
> tracing filters it's just bpf_load_xx/trace_printk/dump_stack. Even 
> if someone has funny ideas they cannot be brought to life, since 
> drivers need a lot more than this set of functions and BPF checker 
> will reject any attempts to call something outside of this tiny 
> list. imo the same applies to existing BPF as well. Meaning that 
> tcpdump filter string and seccomp filters, if distributed, has to 
> have their source code available.

I mean more than that, I mean the licensing of BFP filters a user can 
find on his own system's kernel should be very clear: by the act of 
loading a BFP script into the kernel the user doing the 'upload' gives 
permission for it to be redistributed on kernel-compatible license 
terms.

The easiest way to achieve that is to make sure that all loaded BFP 
scripts are 'registered' and are dumpable, viewable and reusable. 
That's good for debugging and it's good for transparency.

This means a minimal BFP decoder will have to be in the kernel as 
well, but that's OK, we actually have several x86 instruction decoder 
in the kernel already, so there's no complexity threshold.

> > I.e. we want to allow flexible plugins technologically, but make 
> > sure people who run into such a plugin can modify and improve it 
> > under the same license as they can modify and improve the kernel 
> > itself!
> 
> wow. I guess if the whole thing takes off, we would need an 
> in-kernel directory to store upstreamed bpf filters as well :)

I see no reason why not, but more importantly all currently loaded BFP 
scripts should be dumpable, displayable and reusable in a kernel 
license compatible fashion.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH tip 0/5] tracing filters with BPF
  2013-12-04  9:34     ` Ingo Molnar
@ 2013-12-04 17:36       ` Alexei Starovoitov
  2013-12-05 10:38         ` Ingo Molnar
  0 siblings, 1 reply; 65+ messages in thread
From: Alexei Starovoitov @ 2013-12-04 17:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steven Rostedt, Peter Zijlstra, H. Peter Anvin, Thomas Gleixner,
	Masami Hiramatsu, Tom Zanussi, Jovi Zhangwei, Eric Dumazet,
	linux-kernel, Linus Torvalds, Andrew Morton,
	Frédéric Weisbecker, Arnaldo Carvalho de Melo,
	Tom Zanussi, Pekka Enberg, David S. Miller, Arjan van de Ven,
	Christoph Hellwig

On Wed, Dec 4, 2013 at 1:34 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Alexei Starovoitov <ast@plumgrid.com> wrote:
>
>> On Tue, Dec 3, 2013 at 1:16 AM, Ingo Molnar <mingo@kernel.org> wrote:
>> >
>> > Very cool! (Added various other folks who might be interested in
>> > this to the Cc: list.)
>> >
>> > I have one generic concern:
>> >
>> > It would be important to make it easy to extract loaded BPF code
>> > from the kernel in source code equivalent form, which compiles to
>> > the same BPF code.
>> >
>> > I.e. I think it would be fundamentally important to make sure that
>> > this is all within the kernel's license domain, to make it very
>> > clear there can be no 'binary only' BPF scripts.
>> >
>> > By up-loading BPF into a kernel the person loading it agrees to
>> > make that code available to all users of that system who can
>> > access it, under the same license as the kernel's code (or under a
>> > more permissive license).
>> >
>> > The last thing we want is people getting funny ideas and writing
>> > drivers in BPF and hiding the code or making license claims over
>> > it
>>
>> all makes sense. In case of kernel modules all export_symbols are
>> accessible and module has to have kernel compatible license. Same
>> licensing terms apply to anything else that interacts with kernel
>> functions. In case of BPF the list of accessible functions is tiny,
>> so it's much easier to enforce specific limited use case. For
>> tracing filters it's just bpf_load_xx/trace_printk/dump_stack. Even
>> if someone has funny ideas they cannot be brought to life, since
>> drivers need a lot more than this set of functions and BPF checker
>> will reject any attempts to call something outside of this tiny
>> list. imo the same applies to existing BPF as well. Meaning that
>> tcpdump filter string and seccomp filters, if distributed, has to
>> have their source code available.
>
> I mean more than that, I mean the licensing of BFP filters a user can
> find on his own system's kernel should be very clear: by the act of
> loading a BFP script into the kernel the user doing the 'upload' gives
> permission for it to be redistributed on kernel-compatible license
> terms.
>
> The easiest way to achieve that is to make sure that all loaded BFP
> scripts are 'registered' and are dumpable, viewable and reusable.
> That's good for debugging and it's good for transparency.
>
> This means a minimal BFP decoder will have to be in the kernel as
> well, but that's OK, we actually have several x86 instruction decoder
> in the kernel already, so there's no complexity threshold.

sure. there is pr_info_bpf_insn() in bpf_run.c that dumps bpf insn in
human readable format.
I'll hook it up to trace_seq, so that "cat
/sys/kernel/debug/.../filter" will dump it.

Also I'm thinking to add 'license_string' section to bpf binary format
and call license_is_gpl_compatible() on it during load.
If false, then just reject it…. not even messing with taint flags...
That would be way stronger indication of bpf licensing terms than what
we have for .ko

>> wow. I guess if the whole thing takes off, we would need an
>> in-kernel directory to store upstreamed bpf filters as well :)
>
> I see no reason why not, but more importantly all currently loaded BFP
> scripts should be dumpable, displayable and reusable in a kernel
> license compatible fashion.

ok. will add global bpf list as well (was hesitating to do something
like this because of central lock)
and something in debugfs that dumps bodies of all currently loaded filters.

Will that solve the concern?

Thanks
Alexei

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH tip 4/5] use BPF in tracing filters
  2013-12-04  1:11     ` Steven Rostedt
@ 2013-12-05  0:05       ` Masami Hiramatsu
  2013-12-05  5:11         ` Alexei Starovoitov
  0 siblings, 1 reply; 65+ messages in thread
From: Masami Hiramatsu @ 2013-12-05  0:05 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Alexei Starovoitov, Ingo Molnar, Peter Zijlstra, H. Peter Anvin,
	Thomas Gleixner, Tom Zanussi, Jovi Zhangwei, Eric Dumazet,
	linux-kernel, yrl.pp-manager.tt

(2013/12/04 10:11), Steven Rostedt wrote:
> On Wed, 04 Dec 2013 09:48:44 +0900
> Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> wrote:
> 
>> (2013/12/03 13:28), Alexei Starovoitov wrote:
>>> Such filters can be written in C and allow safe read-only access to
>>> any kernel data structure.
>>> Like systemtap but with safety guaranteed by kernel.
>>>
>>> The user can do:
>>> cat bpf_program > /sys/kernel/debug/tracing/.../filter
>>> if tracing event is either static or dynamic via kprobe_events.
>>>
>>> The program can be anything as long as bpf_check() can verify its safety.
>>> For example, the user can create kprobe_event on dst_discard()
>>> and use logically following code inside BPF filter:
>>>       skb = (struct sk_buff *)ctx->regs.di;
>>>       dev = bpf_load_pointer(&skb->dev);
>>> to access 'struct net_device'
>>> Since its prototype is 'int dst_discard(struct sk_buff *skb);'
>>> 'skb' pointer is in 'rdi' register on x86_64
>>> bpf_load_pointer() will try to fetch 'dev' field of 'sk_buff'
>>> structure and will suppress page-fault if pointer is incorrect.
>>
>> Hmm, I doubt it is a good way to integrate with ftrace.
>> I prefer to use this for replacing current ftrace filter,
> 
> I'm not sure how we can do that. Especially since the bpf is very arch
> specific, and the current filters work for all archs.

My idea is to use BPF for the arch specific optimization for
ftrace filter. For the other arch, filter works with current
code. So the ftrace holds filter_preds and compile it in
BPF bytecode if possible.
And this backend optimization also can be done for fetch methods.

>> fetch functions and actions. In that case, we can continue
>> to use current interface but much faster to trace.
>> Also, we can see what filter/arguments/actions are set
>> on each event.
> 
> There's also the problem that the current filters work with the results
> of what is written to the buffer, not what is passed in by the trace
> point, as that isn't even displayed to the user.

Agreed, so I've said I doubt this implementation is a good
shape to integrate. Ktap style is better, since it just gets
parameters from perf buffer entry (using event format).

Thank you,

-- 
Masami HIRAMATSU
IT Management Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH tip 0/5] tracing filters with BPF
  2013-12-04  3:09   ` Alexei Starovoitov
@ 2013-12-05  4:40     ` Alexei Starovoitov
  2013-12-05 10:41       ` Ingo Molnar
                         ` (4 more replies)
  0 siblings, 5 replies; 65+ messages in thread
From: Alexei Starovoitov @ 2013-12-05  4:40 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, Steven Rostedt, Peter Zijlstra, H. Peter Anvin,
	Thomas Gleixner, Masami Hiramatsu, Tom Zanussi, Jovi Zhangwei,
	Eric Dumazet, linux-kernel

> On Tue, Dec 3, 2013 at 4:01 PM, Andi Kleen <andi@firstfloor.org> wrote:
>>
>> Can you do some performance comparison compared to e.g. ktap?
>> How much faster is it?

Did simple ktap test with 1M alloc_skb/kfree_skb toy test from earlier email:
trace skb:kfree_skb {
        if (arg2 == 0x100) {
                printf("%x %x\n", arg1, arg2)
        }
}
1M skb alloc/free 350315 (usecs)

baseline without any tracing:
1M skb alloc/free 145400 (usecs)

then equivalent bpf test:
void filter(struct bpf_context *ctx)
{
        void *loc = (void *)ctx->regs.dx;
        if (loc == 0x100) {
                struct sk_buff *skb = (struct sk_buff *)ctx->regs.si;
                char fmt[] = "skb %p loc %p\n";
                bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 0);
        }
}
1M skb alloc/free 183214 (usecs)

so with one 'if' condition the difference ktap vs bpf is 350-145 vs 183-145

obviously ktap is an interpreter, so it's not really fair.

To make it really unfair I did:
trace skb:kfree_skb {
        if (arg2 == 0x100 || arg2 == 0x200 || arg2 == 0x300 || arg2 == 0x400 ||
            arg2 == 0x500 || arg2 == 0x600 || arg2 == 0x700 || arg2 == 0x800 ||
            arg2 == 0x900 || arg2 == 0x1000) {
                printf("%x %x\n", arg1, arg2)
        }
}
1M skb alloc/free 484280 (usecs)

and corresponding bpf:
void filter(struct bpf_context *ctx)
{
        void *loc = (void *)ctx->regs.dx;
        if (loc == 0x100 || loc == 0x200 || loc == 0x300 || loc == 0x400 ||
            loc == 0x500 || loc == 0x600 || loc == 0x700 || loc == 0x800 ||
            loc == 0x900 || loc == 0x1000) {
                struct sk_buff *skb = (struct sk_buff *)ctx->regs.si;
                char fmt[] = "skb %p loc %p\n";
                bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 0);
        }
}
1M skb alloc/free 185660 (usecs)

the difference is bigger now: 484-145 vs 185-145

9 extra 'if' conditions for bpf is almost nothing, since they
translate into 18 new x86 instructions after JITing, but for
interpreter it's obviously costly.

Why 0x100 instead of 0x1? To make sure that compiler doesn't optimize
them into < >
Otherwise it's really really unfair.

ktap is a nice tool. Great job Jovi!
I noticed that it doesn't always clear created kprobes after run and I
see a bunch of .../tracing/events/ktap_kprobes_xxx, but that's a minor
thing.

Thanks
Alexei

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH tip 4/5] use BPF in tracing filters
  2013-12-05  0:05       ` Masami Hiramatsu
@ 2013-12-05  5:11         ` Alexei Starovoitov
  2013-12-06  8:43           ` Masami Hiramatsu
  0 siblings, 1 reply; 65+ messages in thread
From: Alexei Starovoitov @ 2013-12-05  5:11 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, H. Peter Anvin,
	Thomas Gleixner, Tom Zanussi, Jovi Zhangwei, Eric Dumazet,
	linux-kernel, yrl.pp-manager.tt

On Wed, Dec 4, 2013 at 4:05 PM, Masami Hiramatsu
<masami.hiramatsu.pt@hitachi.com> wrote:
> (2013/12/04 10:11), Steven Rostedt wrote:
>> On Wed, 04 Dec 2013 09:48:44 +0900
>> Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> wrote:
>>
>>> fetch functions and actions. In that case, we can continue
>>> to use current interface but much faster to trace.
>>> Also, we can see what filter/arguments/actions are set
>>> on each event.
>>
>> There's also the problem that the current filters work with the results
>> of what is written to the buffer, not what is passed in by the trace
>> point, as that isn't even displayed to the user.
>
> Agreed, so I've said I doubt this implementation is a good
> shape to integrate. Ktap style is better, since it just gets
> parameters from perf buffer entry (using event format).

Are you saying always store all arguments into ring buffer and let
filter run on it?
It's slower, but it's cleaner, because of human readable? since ktap
arg1 matches first
argument of tracepoint is better than doing ctx->regs.di ? Sure.
si->arg1 is easy to fix.
With si->arg1 tweak the bpf will become architecture independent. It
will run through JIT on x86 and through interpreter everywhere else.
but for kprobes user have to specify 'var=cpu_register' during probe
creation… how is it better than doing the same in filter?
I'm open to suggestions on how to improve the usability.

Thanks
Alexei

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH tip 0/5] tracing filters with BPF
  2013-12-04 17:36       ` Alexei Starovoitov
@ 2013-12-05 10:38         ` Ingo Molnar
  2013-12-06  5:43           ` Alexei Starovoitov
  0 siblings, 1 reply; 65+ messages in thread
From: Ingo Molnar @ 2013-12-05 10:38 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Steven Rostedt, Peter Zijlstra, H. Peter Anvin, Thomas Gleixner,
	Masami Hiramatsu, Tom Zanussi, Jovi Zhangwei, Eric Dumazet,
	linux-kernel, Linus Torvalds, Andrew Morton,
	Frédéric Weisbecker, Arnaldo Carvalho de Melo,
	Tom Zanussi, Pekka Enberg, David S. Miller, Arjan van de Ven,
	Christoph Hellwig


* Alexei Starovoitov <ast@plumgrid.com> wrote:

> > I mean more than that, I mean the licensing of BFP filters a user 
> > can find on his own system's kernel should be very clear: by the 
> > act of loading a BFP script into the kernel the user doing the 
> > 'upload' gives permission for it to be redistributed on 
> > kernel-compatible license terms.
> >
> > The easiest way to achieve that is to make sure that all loaded 
> > BFP scripts are 'registered' and are dumpable, viewable and 
> > reusable. That's good for debugging and it's good for 
> > transparency.
> >
> > This means a minimal BFP decoder will have to be in the kernel as 
> > well, but that's OK, we actually have several x86 instruction 
> > decoder in the kernel already, so there's no complexity threshold.
> 
> sure. there is pr_info_bpf_insn() in bpf_run.c that dumps bpf insn in
> human readable format.
> I'll hook it up to trace_seq, so that "cat
> /sys/kernel/debug/.../filter" will dump it.
> 
> Also I'm thinking to add 'license_string' section to bpf binary format
> and call license_is_gpl_compatible() on it during load.
> If false, then just reject it…. not even messing with taint flags...
> That would be way stronger indication of bpf licensing terms than what
> we have for .ko

But will BFP tools generate such gpl-compatible license tags by 
default? If yes then this might work, combined with the facility 
below. If not then it's just a nuisance to users.

Also, 'tainting' is a non-issue here, as we don't want the kernel to 
load license-incompatible scripts at all. This should be made clear in 
the design of the facility and the tooling itself.

> >> wow. I guess if the whole thing takes off, we would need an 
> >> in-kernel directory to store upstreamed bpf filters as well :)
> >
> > I see no reason why not, but more importantly all currently loaded 
> > BFP scripts should be dumpable, displayable and reusable in a 
> > kernel license compatible fashion.
> 
> ok. will add global bpf list as well (was hesitating to do something 
> like this because of central lock)

A lock + list is no big issue here I think, we do such central lookup 
locks all the time. If it ever becomes measurable it can be made 
scalable via numerous techniques.

> and something in debugfs that dumps bodies of all currently loaded 
> filters.
> 
> Will that solve the concern?

My concern would be solved by adding a facility to always be able to 
dump source code as well, i.e. trivially transform it to C or so, so 
that people can review it - or just edit it on the fly, recompile and 
reinsert? Most BFP scripts ought to be pretty simple.

(For example the most common way to load OpenGL shaders is to load the 
GLSL source code and that source code can be queried after insertion 
as well, so this is not an unusual model for small plugin-alike 
scriptlets.)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH tip 0/5] tracing filters with BPF
  2013-12-05  4:40     ` Alexei Starovoitov
@ 2013-12-05 10:41       ` Ingo Molnar
  2013-12-05 13:46         ` Steven Rostedt
  2013-12-05 16:11       ` Frank Ch. Eigler
                         ` (3 subsequent siblings)
  4 siblings, 1 reply; 65+ messages in thread
From: Ingo Molnar @ 2013-12-05 10:41 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andi Kleen, Steven Rostedt, Peter Zijlstra, H. Peter Anvin,
	Thomas Gleixner, Masami Hiramatsu, Tom Zanussi, Jovi Zhangwei,
	Eric Dumazet, linux-kernel


* Alexei Starovoitov <ast@plumgrid.com> wrote:

> > On Tue, Dec 3, 2013 at 4:01 PM, Andi Kleen <andi@firstfloor.org> wrote:
> >>
> >> Can you do some performance comparison compared to e.g. ktap?
> >> How much faster is it?
> 
> Did simple ktap test with 1M alloc_skb/kfree_skb toy test from earlier email:
> trace skb:kfree_skb {
>         if (arg2 == 0x100) {
>                 printf("%x %x\n", arg1, arg2)
>         }
> }
> 1M skb alloc/free 350315 (usecs)
> 
> baseline without any tracing:
> 1M skb alloc/free 145400 (usecs)
> 
> then equivalent bpf test:
> void filter(struct bpf_context *ctx)
> {
>         void *loc = (void *)ctx->regs.dx;
>         if (loc == 0x100) {
>                 struct sk_buff *skb = (struct sk_buff *)ctx->regs.si;
>                 char fmt[] = "skb %p loc %p\n";
>                 bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 0);
>         }
> }
> 1M skb alloc/free 183214 (usecs)
> 
> so with one 'if' condition the difference ktap vs bpf is 350-145 vs 183-145
> 
> obviously ktap is an interpreter, so it's not really fair.
> 
> To make it really unfair I did:
> trace skb:kfree_skb {
>         if (arg2 == 0x100 || arg2 == 0x200 || arg2 == 0x300 || arg2 == 0x400 ||
>             arg2 == 0x500 || arg2 == 0x600 || arg2 == 0x700 || arg2 == 0x800 ||
>             arg2 == 0x900 || arg2 == 0x1000) {
>                 printf("%x %x\n", arg1, arg2)
>         }
> }
> 1M skb alloc/free 484280 (usecs)

Real life scripts, for examples the ones related to network protocol 
analysis will often have such patterns in them, so I don't think this 
measurement is particularly unfair.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH tip 0/5] tracing filters with BPF
  2013-12-05 10:41       ` Ingo Molnar
@ 2013-12-05 13:46         ` Steven Rostedt
  2013-12-05 22:36           ` Alexei Starovoitov
  0 siblings, 1 reply; 65+ messages in thread
From: Steven Rostedt @ 2013-12-05 13:46 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Alexei Starovoitov, Andi Kleen, Peter Zijlstra, H. Peter Anvin,
	Thomas Gleixner, Masami Hiramatsu, Tom Zanussi, Jovi Zhangwei,
	Eric Dumazet, linux-kernel

On Thu, 5 Dec 2013 11:41:13 +0100
Ingo Molnar <mingo@kernel.org> wrote:

 
> > so with one 'if' condition the difference ktap vs bpf is 350-145 vs 183-145
> > 
> > obviously ktap is an interpreter, so it's not really fair.
> > 
> > To make it really unfair I did:
> > trace skb:kfree_skb {
> >         if (arg2 == 0x100 || arg2 == 0x200 || arg2 == 0x300 || arg2 == 0x400 ||
> >             arg2 == 0x500 || arg2 == 0x600 || arg2 == 0x700 || arg2 == 0x800 ||
> >             arg2 == 0x900 || arg2 == 0x1000) {
> >                 printf("%x %x\n", arg1, arg2)
> >         }
> > }
> > 1M skb alloc/free 484280 (usecs)
> 
> Real life scripts, for examples the ones related to network protocol 
> analysis will often have such patterns in them, so I don't think this 
> measurement is particularly unfair.

I agree. As the size of the if statement grows, the filter logic gets
lineally expensive, but the bpf filter does not.

I know that it would be great to have the bpf filter run before
recording of the tracepoint, but as that becomes quite awkward for a
user interface, because it requires intimate knowledge of the kernel
source, this speed up on the filter itself may be worth while to have
it happen after the recording of the buffer. When it happens after the
record, then the bpf has direct access to the event entry and its
fields as described by the trace event format files.

-- Steve

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH tip 0/5] tracing filters with BPF
  2013-12-05  4:40     ` Alexei Starovoitov
  2013-12-05 10:41       ` Ingo Molnar
@ 2013-12-05 16:11       ` Frank Ch. Eigler
  2013-12-05 19:43         ` Alexei Starovoitov
  2013-12-06  0:14       ` Andi Kleen
                         ` (2 subsequent siblings)
  4 siblings, 1 reply; 65+ messages in thread
From: Frank Ch. Eigler @ 2013-12-05 16:11 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andi Kleen, Ingo Molnar, Steven Rostedt, Peter Zijlstra,
	H. Peter Anvin, Thomas Gleixner, Masami Hiramatsu, Tom Zanussi,
	Jovi Zhangwei, Eric Dumazet, linux-kernel


ast wrote:

>>[...]
> Did simple ktap test with 1M alloc_skb/kfree_skb toy test from earlier email:
> trace skb:kfree_skb {
>         if (arg2 == 0x100) {
>                 printf("%x %x\n", arg1, arg2)
>         }
> }
> [...]

For reference, you might try putting systemtap into the performance
comparison matrix too:

# stap -e 'probe kernel.trace("kfree_skb") { 
              if ($location == 0x100 /* || $location == 0x200 etc. */ ) {
                 printf("%x %x\n", $skb, $location)
              }
           }'


- FChE

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH tip 0/5] tracing filters with BPF
  2013-12-04  0:01 ` Andi Kleen
  2013-12-04  3:09   ` Alexei Starovoitov
@ 2013-12-05 16:31   ` Frank Ch. Eigler
  1 sibling, 0 replies; 65+ messages in thread
From: Frank Ch. Eigler @ 2013-12-05 16:31 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Alexei Starovoitov, Ingo Molnar, Steven Rostedt, Peter Zijlstra,
	H. Peter Anvin, Thomas Gleixner, Masami Hiramatsu, Tom Zanussi,
	Jovi Zhangwei, Eric Dumazet, linux-kernel

Andi Kleen <andi@firstfloor.org> writes:

> [...]  While it sounds interesting, I would strongly advise to make
> this capability only available to root. Traditionally lots of
> complex byte code languages which were designed to be "safe" and
> verifiable weren't really. e.g. i managed to crash things with
> "safe" systemtap multiple times. [...]

Note that systemtap has never been a byte code language, that avenue
being considered lkml-futile at the time, but instead pure C.  Its
safety comes from a mix of compiled-in checks (which you can inspect
via "stap -p3") and script-to-C translation checks (which are
self-explanatory).  Its risks come from bugs in the checks (quite
rare), problems in the runtime library (rare), and problems in
underlying kernel facilities (rare or frequent - consider kprobes).


> So the likelyhood of this having some hole somewhere (either in
> the byte code or in some library function) is high.

Very true!


- FChE

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH tip 0/5] tracing filters with BPF
  2013-12-05 16:11       ` Frank Ch. Eigler
@ 2013-12-05 19:43         ` Alexei Starovoitov
  0 siblings, 0 replies; 65+ messages in thread
From: Alexei Starovoitov @ 2013-12-05 19:43 UTC (permalink / raw)
  To: Frank Ch. Eigler
  Cc: Andi Kleen, Ingo Molnar, Steven Rostedt, Peter Zijlstra,
	H. Peter Anvin, Thomas Gleixner, Masami Hiramatsu, Tom Zanussi,
	Jovi Zhangwei, Eric Dumazet, linux-kernel

On Thu, Dec 5, 2013 at 8:11 AM, Frank Ch. Eigler <fche@redhat.com> wrote:
>
> ast wrote:
>
>>>[...]
>> Did simple ktap test with 1M alloc_skb/kfree_skb toy test from earlier email:
>> trace skb:kfree_skb {
>>         if (arg2 == 0x100) {
>>                 printf("%x %x\n", arg1, arg2)
>>         }
>> }
>> [...]
>
> For reference, you might try putting systemtap into the performance
> comparison matrix too:
>
> # stap -e 'probe kernel.trace("kfree_skb") {
>               if ($location == 0x100 /* || $location == 0x200 etc. */ ) {
>                  printf("%x %x\n", $skb, $location)
>               }
>            }'

stap with one 'if': 1M skb alloc/free 200696 (usecs)
stap with 10 'if': 1M skb alloc/free 202135 (usecs)

so systemtap entry overhead is a bit higher than bpf and extra if-s
show the same progression as expected.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH tip 0/5] tracing filters with BPF
  2013-12-05 13:46         ` Steven Rostedt
@ 2013-12-05 22:36           ` Alexei Starovoitov
  2013-12-05 23:37             ` Steven Rostedt
  0 siblings, 1 reply; 65+ messages in thread
From: Alexei Starovoitov @ 2013-12-05 22:36 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Andi Kleen, Peter Zijlstra, H. Peter Anvin,
	Thomas Gleixner, Masami Hiramatsu, Tom Zanussi, Jovi Zhangwei,
	Eric Dumazet, linux-kernel

On Thu, Dec 5, 2013 at 5:46 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> I know that it would be great to have the bpf filter run before
> recording of the tracepoint, but as that becomes quite awkward for a
> user interface, because it requires intimate knowledge of the kernel
> source, this speed up on the filter itself may be worth while to have
> it happen after the recording of the buffer. When it happens after the
> record, then the bpf has direct access to the event entry and its
> fields as described by the trace event format files.

I don't understand that 'awkward' part yet. What do you mean by 'knowledge of
the kernel'? By accessing pt_regs structure? Something else ?
Can we try fixing the interface first before compromising on performance?

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH tip 0/5] tracing filters with BPF
  2013-12-05 22:36           ` Alexei Starovoitov
@ 2013-12-05 23:37             ` Steven Rostedt
  2013-12-06  4:49               ` Alexei Starovoitov
  0 siblings, 1 reply; 65+ messages in thread
From: Steven Rostedt @ 2013-12-05 23:37 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Ingo Molnar, Andi Kleen, Peter Zijlstra, H. Peter Anvin,
	Thomas Gleixner, Masami Hiramatsu, Tom Zanussi, Jovi Zhangwei,
	Eric Dumazet, linux-kernel

On Thu, 5 Dec 2013 14:36:58 -0800
Alexei Starovoitov <ast@plumgrid.com> wrote:

> On Thu, Dec 5, 2013 at 5:46 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > I know that it would be great to have the bpf filter run before
> > recording of the tracepoint, but as that becomes quite awkward for a
> > user interface, because it requires intimate knowledge of the kernel
> > source, this speed up on the filter itself may be worth while to have
> > it happen after the recording of the buffer. When it happens after the
> > record, then the bpf has direct access to the event entry and its
> > fields as described by the trace event format files.
> 
> I don't understand that 'awkward' part yet. What do you mean by 'knowledge of
> the kernel'? By accessing pt_regs structure? Something else ?
> Can we try fixing the interface first before compromising on performance?

Let me ask you this. If you do not have the source of the kernel on
hand, can you use BPF to filter the sched_switch tracepoint on prev pid?

The current filter interface allows you to filter with just what the
running kernel provides. No need for debug info from the vmlinux or
anything else.

pt_regs is not that useful without having something to translate what
that means.

I'm fine if it becomes a requirement to have a vmlinux built with
DEBUG_INFO to use BPF and have a tool like perf to translate the
filters. But it that must not replace what the current filters do now.
That is, it can be an add on, but not a replacement.

 -- Steve

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH tip 0/5] tracing filters with BPF
  2013-12-05  4:40     ` Alexei Starovoitov
  2013-12-05 10:41       ` Ingo Molnar
  2013-12-05 16:11       ` Frank Ch. Eigler
@ 2013-12-06  0:14       ` Andi Kleen
  2013-12-06  1:10         ` H. Peter Anvin
  2013-12-06  5:19       ` Jovi Zhangwei
  2013-12-06  6:17       ` Jovi Zhangwei
  4 siblings, 1 reply; 65+ messages in thread
From: Andi Kleen @ 2013-12-06  0:14 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andi Kleen, Ingo Molnar, Steven Rostedt, Peter Zijlstra,
	H. Peter Anvin, Thomas Gleixner, Masami Hiramatsu, Tom Zanussi,
	Jovi Zhangwei, Eric Dumazet, linux-kernel

> 1M skb alloc/free 185660 (usecs)
> 
> the difference is bigger now: 484-145 vs 185-145

Thanks for the data. 

This is a obvious improvement, but imho not big enough to be extremely
compelling (< cost 1-2 cache misses, no orders of magnitude improvements
that would justify a lot of code)

One larger problem I have with your patchkit is where exactly it fits
with the user base.

In my experience there are roughly two groups of trace users:
kernel hackers and users. The kernel hackers want something
convenient and fast, but for anything complicated or performance
critical they can always hack the kernel to include custom 
instrumentation.

Your code requires a compiler, so from my perspective it 
wouldn't be a lot easier or faster to use than just changing 
the code directly and recompile.

The users want something simple too that shields them from
having to learn all the internals. They don't want to recompile.
As far as I can tell your code is a bit too low level for that,
and the requirement for the compiler may also scare them.

Where exactly does it fit?

-Andi


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH tip 0/5] tracing filters with BPF
  2013-12-06  0:14       ` Andi Kleen
@ 2013-12-06  1:10         ` H. Peter Anvin
  2013-12-06  1:20           ` Andi Kleen
  0 siblings, 1 reply; 65+ messages in thread
From: H. Peter Anvin @ 2013-12-06  1:10 UTC (permalink / raw)
  To: Andi Kleen, Alexei Starovoitov
  Cc: Ingo Molnar, Steven Rostedt, Peter Zijlstra, Thomas Gleixner,
	Masami Hiramatsu, Tom Zanussi, Jovi Zhangwei, Eric Dumazet,
	linux-kernel

On 12/05/2013 04:14 PM, Andi Kleen wrote:
> 
> In my experience there are roughly two groups of trace users:
> kernel hackers and users. The kernel hackers want something
> convenient and fast, but for anything complicated or performance
> critical they can always hack the kernel to include custom 
> instrumentation.
> 

Not to mention that in that case we might as well -- since we need a
compiler anyway -- generate the machine code in user space; the JIT
solution really only is useful if it can provide something that we can't
do otherwise, e.g. enable it in secure boot environments.

	-hpa


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH tip 0/5] tracing filters with BPF
  2013-12-06  1:10         ` H. Peter Anvin
@ 2013-12-06  1:20           ` Andi Kleen
  2013-12-06  1:28             ` H. Peter Anvin
                               ` (3 more replies)
  0 siblings, 4 replies; 65+ messages in thread
From: Andi Kleen @ 2013-12-06  1:20 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Alexei Starovoitov, Ingo Molnar, Steven Rostedt, Peter Zijlstra,
	Thomas Gleixner, Masami Hiramatsu, Tom Zanussi, Jovi Zhangwei,
	Eric Dumazet, linux-kernel

"H. Peter Anvin" <hpa@zytor.com> writes:
>
> Not to mention that in that case we might as well -- since we need a
> compiler anyway -- generate the machine code in user space; the JIT
> solution really only is useful if it can provide something that we can't
> do otherwise, e.g. enable it in secure boot environments.

I can see there may be some setups which don't have a compiler
(e.g. I know some people don't use systemtap because of that)
But this needs a custom gcc install too as far as I understand.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH tip 0/5] tracing filters with BPF
  2013-12-06  1:20           ` Andi Kleen
@ 2013-12-06  1:28             ` H. Peter Anvin
  2013-12-06 21:43               ` Frank Ch. Eigler
  2013-12-06  5:16             ` Alexei Starovoitov
                               ` (2 subsequent siblings)
  3 siblings, 1 reply; 65+ messages in thread
From: H. Peter Anvin @ 2013-12-06  1:28 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Alexei Starovoitov, Ingo Molnar, Steven Rostedt, Peter Zijlstra,
	Thomas Gleixner, Masami Hiramatsu, Tom Zanussi, Jovi Zhangwei,
	Eric Dumazet, linux-kernel

On 12/05/2013 05:20 PM, Andi Kleen wrote:
> "H. Peter Anvin" <hpa@zytor.com> writes:
>>
>> Not to mention that in that case we might as well -- since we need a
>> compiler anyway -- generate the machine code in user space; the JIT
>> solution really only is useful if it can provide something that we can't
>> do otherwise, e.g. enable it in secure boot environments.
> 
> I can see there may be some setups which don't have a compiler
> (e.g. I know some people don't use systemtap because of that)
> But this needs a custom gcc install too as far as I understand.
> 

Yes... but no compiler and secure boot tend to go together, or at least
will in the future.

	-hpa


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH tip 0/5] tracing filters with BPF
  2013-12-05 23:37             ` Steven Rostedt
@ 2013-12-06  4:49               ` Alexei Starovoitov
  2013-12-10 15:47                 ` Ingo Molnar
  0 siblings, 1 reply; 65+ messages in thread
From: Alexei Starovoitov @ 2013-12-06  4:49 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Andi Kleen, Peter Zijlstra, H. Peter Anvin,
	Thomas Gleixner, Masami Hiramatsu, Tom Zanussi, Jovi Zhangwei,
	Eric Dumazet, linux-kernel

On Thu, Dec 5, 2013 at 3:37 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Thu, 5 Dec 2013 14:36:58 -0800
> Alexei Starovoitov <ast@plumgrid.com> wrote:
>
>> On Thu, Dec 5, 2013 at 5:46 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
>> >
>> > I know that it would be great to have the bpf filter run before
>> > recording of the tracepoint, but as that becomes quite awkward for a
>> > user interface, because it requires intimate knowledge of the kernel
>> > source, this speed up on the filter itself may be worth while to have
>> > it happen after the recording of the buffer. When it happens after the
>> > record, then the bpf has direct access to the event entry and its
>> > fields as described by the trace event format files.
>>
>> I don't understand that 'awkward' part yet. What do you mean by 'knowledge of
>> the kernel'? By accessing pt_regs structure? Something else ?
>> Can we try fixing the interface first before compromising on performance?
>
> Let me ask you this. If you do not have the source of the kernel on
> hand, can you use BPF to filter the sched_switch tracepoint on prev pid?
>
> The current filter interface allows you to filter with just what the
> running kernel provides. No need for debug info from the vmlinux or
> anything else.

Understood and agreed. For the users that are satisfied with amount of info
that single trace_event provides (like sched_switch) there is probably
little reason to do complex filtering. Either they're fine with all
the events or will
just filter based on pid only.

> I'm fine if it becomes a requirement to have a vmlinux built with
> DEBUG_INFO to use BPF and have a tool like perf to translate the
> filters. But it that must not replace what the current filters do now.
> That is, it can be an add on, but not a replacement.

Of course. tracing filters via bpf is an additional tool for kernel debugging.
bpf by itself has use cases beyond tracing.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH tip 0/5] tracing filters with BPF
  2013-12-06  1:20           ` Andi Kleen
  2013-12-06  1:28             ` H. Peter Anvin
@ 2013-12-06  5:16             ` Alexei Starovoitov
  2013-12-06 23:54               ` Masami Hiramatsu
  2013-12-06  5:46             ` Jovi Zhangwei
  2013-12-07  1:12             ` Alexei Starovoitov
  3 siblings, 1 reply; 65+ messages in thread
From: Alexei Starovoitov @ 2013-12-06  5:16 UTC (permalink / raw)
  To: Andi Kleen
  Cc: H. Peter Anvin, Ingo Molnar, Steven Rostedt, Peter Zijlstra,
	Thomas Gleixner, Masami Hiramatsu, Tom Zanussi, Jovi Zhangwei,
	Eric Dumazet, linux-kernel

On Thu, Dec 5, 2013 at 5:20 PM, Andi Kleen <andi@firstfloor.org> wrote:
>> the difference is bigger now: 484-145 vs 185-145
>
> This is a obvious improvement, but imho not big enough to be extremely
> compelling (< cost 1-2 cache misses, no orders of magnitude improvements
> that would justify a lot of code)

hmm. we're comparing against ktap here…
which has 5x more kernel code and 8x slower in this test...

> Your code requires a compiler, so from my perspective it
> wouldn't be a lot easier or faster to use than just changing
> the code directly and recompile.
>
> The users want something simple too that shields them from
> having to learn all the internals. They don't want to recompile.
> As far as I can tell your code is a bit too low level for that,
> and the requirement for the compiler may also scare them.
>
> Where exactly does it fit?

the goal is to have llvm compiler next to perf, wrapped in a user friendly way.

compiling small filter vs recompiling full kernel…
inserting into live kernel vs rebooting …
not sure how you're saying it's equivalent.

In my kernel debugging experience current tools (tracing, systemtap)
were rarely enough.
I always had to add my own printks through the code, recompile and reboot.
Often just to see that it's not the place where I want to print things
or it's too verbose.
Then I would adjust printks, recompile and reboot again.
That was slow and tedious, since I would be crashing things from time to time
just because skb doesn't always have a valid dev or I made a typo.
For debugging I do really need something quick and dirty that lets me
add my own printk
of whatever structs I want anywhere in the kernel without crashing it.
That's exactly what bpf tracing filters do.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH tip 0/5] tracing filters with BPF
  2013-12-05  4:40     ` Alexei Starovoitov
                         ` (2 preceding siblings ...)
  2013-12-06  0:14       ` Andi Kleen
@ 2013-12-06  5:19       ` Jovi Zhangwei
  2013-12-06 23:58         ` Masami Hiramatsu
  2013-12-06  6:17       ` Jovi Zhangwei
  4 siblings, 1 reply; 65+ messages in thread
From: Jovi Zhangwei @ 2013-12-06  5:19 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andi Kleen, Ingo Molnar, Steven Rostedt, Peter Zijlstra,
	H. Peter Anvin, Thomas Gleixner, Masami Hiramatsu, Tom Zanussi,
	Eric Dumazet, LKML

Hi Alexei,

On Thu, Dec 5, 2013 at 12:40 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>> On Tue, Dec 3, 2013 at 4:01 PM, Andi Kleen <andi@firstfloor.org> wrote:
>>>
>>> Can you do some performance comparison compared to e.g. ktap?
>>> How much faster is it?
>
> Did simple ktap test with 1M alloc_skb/kfree_skb toy test from earlier email:
> trace skb:kfree_skb {
>         if (arg2 == 0x100) {
>                 printf("%x %x\n", arg1, arg2)
>         }
> }
> 1M skb alloc/free 350315 (usecs)
>
> baseline without any tracing:
> 1M skb alloc/free 145400 (usecs)
>
> then equivalent bpf test:
> void filter(struct bpf_context *ctx)
> {
>         void *loc = (void *)ctx->regs.dx;
>         if (loc == 0x100) {
>                 struct sk_buff *skb = (struct sk_buff *)ctx->regs.si;
>                 char fmt[] = "skb %p loc %p\n";
>                 bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 0);
>         }
> }
> 1M skb alloc/free 183214 (usecs)
>
> so with one 'if' condition the difference ktap vs bpf is 350-145 vs 183-145
>
> obviously ktap is an interpreter, so it's not really fair.
>
> To make it really unfair I did:
> trace skb:kfree_skb {
>         if (arg2 == 0x100 || arg2 == 0x200 || arg2 == 0x300 || arg2 == 0x400 ||
>             arg2 == 0x500 || arg2 == 0x600 || arg2 == 0x700 || arg2 == 0x800 ||
>             arg2 == 0x900 || arg2 == 0x1000) {
>                 printf("%x %x\n", arg1, arg2)
>         }
> }
> 1M skb alloc/free 484280 (usecs)
>
> and corresponding bpf:
> void filter(struct bpf_context *ctx)
> {
>         void *loc = (void *)ctx->regs.dx;
>         if (loc == 0x100 || loc == 0x200 || loc == 0x300 || loc == 0x400 ||
>             loc == 0x500 || loc == 0x600 || loc == 0x700 || loc == 0x800 ||
>             loc == 0x900 || loc == 0x1000) {
>                 struct sk_buff *skb = (struct sk_buff *)ctx->regs.si;
>                 char fmt[] = "skb %p loc %p\n";
>                 bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 0);
>         }
> }
> 1M skb alloc/free 185660 (usecs)
>
> the difference is bigger now: 484-145 vs 185-145
>
There have big differences for compare arg2(in ktap) with direct register
access(ctx->regs.dx).

The current argument fetching(arg2 in above testcase) implementation in ktap
is very inefficient, see ktap/interpreter/lib_kdebug.c:kp_event_getarg.
The only way to speedup is kernel tracing code change, let external tracing
module access event field not through list lookup. This work is not
started yet. :)

Of course, I'm not saying this argument fetching issue is the performance
root cause compared with bpf and Systemtap, the bytecode executing speed
wouldn't compare with raw machine code.
(There have a plan to use JIT in ktap core, like luajit project, but
it need some
time to work on)

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH tip 0/5] tracing filters with BPF
  2013-12-05 10:38         ` Ingo Molnar
@ 2013-12-06  5:43           ` Alexei Starovoitov
  0 siblings, 0 replies; 65+ messages in thread
From: Alexei Starovoitov @ 2013-12-06  5:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steven Rostedt, Peter Zijlstra, H. Peter Anvin, Thomas Gleixner,
	Masami Hiramatsu, Tom Zanussi, Jovi Zhangwei, Eric Dumazet,
	linux-kernel, Linus Torvalds, Andrew Morton,
	Frédéric Weisbecker, Arnaldo Carvalho de Melo,
	Tom Zanussi, Pekka Enberg, David S. Miller, Arjan van de Ven,
	Christoph Hellwig

On Thu, Dec 5, 2013 at 2:38 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
>> Also I'm thinking to add 'license_string' section to bpf binary format
>> and call license_is_gpl_compatible() on it during load.
>> If false, then just reject it…. not even messing with taint flags...
>> That would be way stronger indication of bpf licensing terms than what
>> we have for .ko
>
> But will BFP tools generate such gpl-compatible license tags by
> default? If yes then this might work, combined with the facility
> below. If not then it's just a nuisance to users.

yes. similar to existing .ko module_license() tag. see below.

> My concern would be solved by adding a facility to always be able to
> dump source code as well, i.e. trivially transform it to C or so, so
> that people can review it - or just edit it on the fly, recompile and
> reinsert? Most BFP scripts ought to be pretty simple.

C code has '#include' in them, so without storing fully preprocessed code
it will not be equivalent. but then true source will be gigantic.
Can be zipped, but that sounds like an overkill.
Also we might want other languages with their own dependent includes.
Sure, we can have a section in bpf binary that has the source, but it's not
enforceable. Kernel cannot know that it's an actual source.
gcc/llvm will produce different bpf code out of the same source.
the source is in C or in language X, etc.
Doesn't seem that including some form of source will help
with enforcing the license.

imo requiring module_license("gpl"); line in C code and equivalent
string in all other languages that want to translate to bpf would be
stronger indication of licensing terms.
then compiler would have to include that string into 'license_string'
section and kernel can actually enforce it.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH tip 0/5] tracing filters with BPF
  2013-12-06  1:20           ` Andi Kleen
  2013-12-06  1:28             ` H. Peter Anvin
  2013-12-06  5:16             ` Alexei Starovoitov
@ 2013-12-06  5:46             ` Jovi Zhangwei
  2013-12-07  1:12             ` Alexei Starovoitov
  3 siblings, 0 replies; 65+ messages in thread
From: Jovi Zhangwei @ 2013-12-06  5:46 UTC (permalink / raw)
  To: Andi Kleen
  Cc: H. Peter Anvin, Alexei Starovoitov, Ingo Molnar, Steven Rostedt,
	Peter Zijlstra, Thomas Gleixner, Masami Hiramatsu, Tom Zanussi,
	Eric Dumazet, LKML

On Fri, Dec 6, 2013 at 9:20 AM, Andi Kleen <andi@firstfloor.org> wrote:
> "H. Peter Anvin" <hpa@zytor.com> writes:
>>
>> Not to mention that in that case we might as well -- since we need a
>> compiler anyway -- generate the machine code in user space; the JIT
>> solution really only is useful if it can provide something that we can't
>> do otherwise, e.g. enable it in secure boot environments.
>
> I can see there may be some setups which don't have a compiler
> (e.g. I know some people don't use systemtap because of that)
> But this needs a custom gcc install too as far as I understand.
>
If it's depend on gcc, then it's look like Systemtap. There have big
inconvenient for embedded environment and many production system
to install gcc.
(not sure if it need kernel compilation environment as well)

It seems the event filter is binding to specific event, it's not possible
to trace many events in a cooperation style, look Systemtap and ktap
samples, many event handler need to cooperate, the simplest
example is record syscall execution time(duration of exit - entry).

If this design is intentional, then I would think it's target for speed up
current kernel tracing filter.(but need extra usespace filter compiler)

And I guess bpf filter still need to take mind on usespace tracing :),
if it want to be a complete and integrated tracing solution.
(use a separated userspace compiler or translator to resolve symbol)

Thanks

Jovi

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH tip 0/5] tracing filters with BPF
  2013-12-05  4:40     ` Alexei Starovoitov
                         ` (3 preceding siblings ...)
  2013-12-06  5:19       ` Jovi Zhangwei
@ 2013-12-06  6:17       ` Jovi Zhangwei
  4 siblings, 0 replies; 65+ messages in thread
From: Jovi Zhangwei @ 2013-12-06  6:17 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andi Kleen, Ingo Molnar, Steven Rostedt, Peter Zijlstra,
	H. Peter Anvin, Thomas Gleixner, Masami Hiramatsu, Tom Zanussi,
	Eric Dumazet, LKML

On Thu, Dec 5, 2013 at 12:40 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>> On Tue, Dec 3, 2013 at 4:01 PM, Andi Kleen <andi@firstfloor.org> wrote:
>>>
>>> Can you do some performance comparison compared to e.g. ktap?
>>> How much faster is it?
>
> Did simple ktap test with 1M alloc_skb/kfree_skb toy test from earlier email:
> trace skb:kfree_skb {
>         if (arg2 == 0x100) {
>                 printf("%x %x\n", arg1, arg2)
>         }
> }
> 1M skb alloc/free 350315 (usecs)
>
> baseline without any tracing:
> 1M skb alloc/free 145400 (usecs)
>
> then equivalent bpf test:
> void filter(struct bpf_context *ctx)
> {
>         void *loc = (void *)ctx->regs.dx;
>         if (loc == 0x100) {
>                 struct sk_buff *skb = (struct sk_buff *)ctx->regs.si;
>                 char fmt[] = "skb %p loc %p\n";
>                 bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 0);
>         }
> }
> 1M skb alloc/free 183214 (usecs)
>
> so with one 'if' condition the difference ktap vs bpf is 350-145 vs 183-145
>
> obviously ktap is an interpreter, so it's not really fair.
>
> To make it really unfair I did:
> trace skb:kfree_skb {
>         if (arg2 == 0x100 || arg2 == 0x200 || arg2 == 0x300 || arg2 == 0x400 ||
>             arg2 == 0x500 || arg2 == 0x600 || arg2 == 0x700 || arg2 == 0x800 ||
>             arg2 == 0x900 || arg2 == 0x1000) {
>                 printf("%x %x\n", arg1, arg2)
>         }
> }
> 1M skb alloc/free 484280 (usecs)
>
I've lost my mind for a while. :)

If bpf only focus on filter, then it's not good to compare with ktap
like that, since
ktap can easily make use on current kernel filter, you should use below script:

trace skb:kfree_skb /location == 0x100 || location == 0x200 || .../ {
    printf("%x %x\n", arg1, arg2)
}

As ktap is a user of current simple kernel tracing filter, I fully
agree with Steven,
    "it can be an add on, but not a replacement."


Thanks,

Jovi

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH tip 4/5] use BPF in tracing filters
  2013-12-05  5:11         ` Alexei Starovoitov
@ 2013-12-06  8:43           ` Masami Hiramatsu
  2013-12-06 10:05             ` Jovi Zhangwei
  0 siblings, 1 reply; 65+ messages in thread
From: Masami Hiramatsu @ 2013-12-06  8:43 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Steven Rostedt, Ingo Molnar, Peter Zijlstra, H. Peter Anvin,
	Thomas Gleixner, Tom Zanussi, Jovi Zhangwei, Eric Dumazet,
	linux-kernel, yrl.pp-manager.tt

(2013/12/05 14:11), Alexei Starovoitov wrote:
> On Wed, Dec 4, 2013 at 4:05 PM, Masami Hiramatsu
> <masami.hiramatsu.pt@hitachi.com> wrote:
>> (2013/12/04 10:11), Steven Rostedt wrote:
>>> On Wed, 04 Dec 2013 09:48:44 +0900
>>> Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> wrote:
>>>
>>>> fetch functions and actions. In that case, we can continue
>>>> to use current interface but much faster to trace.
>>>> Also, we can see what filter/arguments/actions are set
>>>> on each event.
>>>
>>> There's also the problem that the current filters work with the results
>>> of what is written to the buffer, not what is passed in by the trace
>>> point, as that isn't even displayed to the user.
>>
>> Agreed, so I've said I doubt this implementation is a good
>> shape to integrate. Ktap style is better, since it just gets
>> parameters from perf buffer entry (using event format).
> 
> Are you saying always store all arguments into ring buffer and let
> filter run on it?

Yes, it is what ftrace does. I doubt your way fits all of the existing
trace-event macros. However, I think just for dynamic events, you can
integrating the argument fetching and filtering.

> It's slower, but it's cleaner, because of human readable? since ktap
> arg1 matches first
> argument of tracepoint is better than doing ctx->regs.di ? Sure.
> si->arg1 is easy to fix.
> With si->arg1 tweak the bpf will become architecture independent. It
> will run through JIT on x86 and through interpreter everywhere else.
> but for kprobes user have to specify 'var=cpu_register' during probe
> creation… how is it better than doing the same in filter?

Haven't you used perf-probe yet? It already supports such kind of
translation from kernel local variable name to registers, offsets,
and dereference. :) And kprobe-events can parse such arguments into
method chain. See Documentation/trace/kprobetrace.txt and
tools/perf/Documentation/perf-probe.txt for more detail.
Anyway, I'd like to use the bpf for re-implementing fetch method. :)

Thank you,

-- 
Masami HIRAMATSU
IT Management Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH tip 4/5] use BPF in tracing filters
  2013-12-06  8:43           ` Masami Hiramatsu
@ 2013-12-06 10:05             ` Jovi Zhangwei
  2013-12-06 23:48               ` Masami Hiramatsu
  0 siblings, 1 reply; 65+ messages in thread
From: Jovi Zhangwei @ 2013-12-06 10:05 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Alexei Starovoitov, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	H. Peter Anvin, Thomas Gleixner, Tom Zanussi, Eric Dumazet, LKML,
	yrl.pp-manager.tt

On Fri, Dec 6, 2013 at 4:43 PM, Masami Hiramatsu
<masami.hiramatsu.pt@hitachi.com> wrote:
> (2013/12/05 14:11), Alexei Starovoitov wrote:
>> On Wed, Dec 4, 2013 at 4:05 PM, Masami Hiramatsu
>> <masami.hiramatsu.pt@hitachi.com> wrote:
>>> (2013/12/04 10:11), Steven Rostedt wrote:
>>>> On Wed, 04 Dec 2013 09:48:44 +0900
>>>> Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> wrote:
>>>>
>>>>> fetch functions and actions. In that case, we can continue
>>>>> to use current interface but much faster to trace.
>>>>> Also, we can see what filter/arguments/actions are set
>>>>> on each event.
>>>>
>>>> There's also the problem that the current filters work with the results
>>>> of what is written to the buffer, not what is passed in by the trace
>>>> point, as that isn't even displayed to the user.
>>>
>>> Agreed, so I've said I doubt this implementation is a good
>>> shape to integrate. Ktap style is better, since it just gets
>>> parameters from perf buffer entry (using event format).
>>
>> Are you saying always store all arguments into ring buffer and let
>> filter run on it?
>
> Yes, it is what ftrace does. I doubt your way fits all of the existing
> trace-event macros. However, I think just for dynamic events, you can
> integrating the argument fetching and filtering.
>
Does this will affect the user interface of perf-probe argument fetching?

I mean if use bpf backend, do we must need gcc to compile bpf source
for perf-probe argument fetching? as we known, current argument
fetching is go through kprobe_events/uprobe_events debugfs file, and
ktap is based on this behavior.

Thanks.

Jovi.

>> It's slower, but it's cleaner, because of human readable? since ktap
>> arg1 matches first
>> argument of tracepoint is better than doing ctx->regs.di ? Sure.
>> si->arg1 is easy to fix.
>> With si->arg1 tweak the bpf will become architecture independent. It
>> will run through JIT on x86 and through interpreter everywhere else.
>> but for kprobes user have to specify 'var=cpu_register' during probe
>> creation… how is it better than doing the same in filter?
>
> Haven't you used perf-probe yet? It already supports such kind of
> translation from kernel local variable name to registers, offsets,
> and dereference. :) And kprobe-events can parse such arguments into
> method chain. See Documentation/trace/kprobetrace.txt and
> tools/perf/Documentation/perf-probe.txt for more detail.
> Anyway, I'd like to use the bpf for re-implementing fetch method. :)
>
> Thank you,
>
> --
> Masami HIRAMATSU
> IT Management Research Dept. Linux Technology Center
> Hitachi, Ltd., Yokohama Research Laboratory
> E-mail: masami.hiramatsu.pt@hitachi.com
>
>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH tip 0/5] tracing filters with BPF
  2013-12-06  1:28             ` H. Peter Anvin
@ 2013-12-06 21:43               ` Frank Ch. Eigler
  0 siblings, 0 replies; 65+ messages in thread
From: Frank Ch. Eigler @ 2013-12-06 21:43 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andi Kleen, Alexei Starovoitov, Ingo Molnar, Steven Rostedt,
	Peter Zijlstra, Thomas Gleixner, Masami Hiramatsu, Tom Zanussi,
	Jovi Zhangwei, Eric Dumazet, linux-kernel


hpa wrote:

>> I can see there may be some setups which don't have a compiler
>> (e.g. I know some people don't use systemtap because of that)
>> But this needs a custom gcc install too as far as I understand.
>
> Yes... but no compiler and secure boot tend to go together, or at
> least will in the future.

(Maybe not: we're already experimenting with support for secureboot in
systemtap.)

- FChE

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH tip 4/5] use BPF in tracing filters
  2013-12-06 10:05             ` Jovi Zhangwei
@ 2013-12-06 23:48               ` Masami Hiramatsu
  2013-12-08 18:22                 ` Frank Ch. Eigler
  0 siblings, 1 reply; 65+ messages in thread
From: Masami Hiramatsu @ 2013-12-06 23:48 UTC (permalink / raw)
  To: Jovi Zhangwei
  Cc: Alexei Starovoitov, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	H. Peter Anvin, Thomas Gleixner, Tom Zanussi, Eric Dumazet, LKML,
	yrl.pp-manager.tt

(2013/12/06 19:05), Jovi Zhangwei wrote:
> On Fri, Dec 6, 2013 at 4:43 PM, Masami Hiramatsu
> <masami.hiramatsu.pt@hitachi.com> wrote:
>> (2013/12/05 14:11), Alexei Starovoitov wrote:
>>> On Wed, Dec 4, 2013 at 4:05 PM, Masami Hiramatsu
>>> <masami.hiramatsu.pt@hitachi.com> wrote:
>>>> (2013/12/04 10:11), Steven Rostedt wrote:
>>>>> On Wed, 04 Dec 2013 09:48:44 +0900
>>>>> Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> wrote:
>>>>>
>>>>>> fetch functions and actions. In that case, we can continue
>>>>>> to use current interface but much faster to trace.
>>>>>> Also, we can see what filter/arguments/actions are set
>>>>>> on each event.
>>>>>
>>>>> There's also the problem that the current filters work with the results
>>>>> of what is written to the buffer, not what is passed in by the trace
>>>>> point, as that isn't even displayed to the user.
>>>>
>>>> Agreed, so I've said I doubt this implementation is a good
>>>> shape to integrate. Ktap style is better, since it just gets
>>>> parameters from perf buffer entry (using event format).
>>>
>>> Are you saying always store all arguments into ring buffer and let
>>> filter run on it?
>>
>> Yes, it is what ftrace does. I doubt your way fits all of the existing
>> trace-event macros. However, I think just for dynamic events, you can
>> integrating the argument fetching and filtering.
>>
> Does this will affect the user interface of perf-probe argument fetching?
> 
> I mean if use bpf backend, do we must need gcc to compile bpf source
> for perf-probe argument fetching? as we known, current argument
> fetching is go through kprobe_events/uprobe_events debugfs file, and
> ktap is based on this behavior.

No, I don't want to do that. Feeding binary code into the kernel is
not trusted nor controllable. I'd just like to see the code which
optimizing current fetching/filtering methods, and that is possible.

Anyway, as far as I can see, there looks be two different models of
tracing in our mind.

A) Fixed event based tracing: In this model, there are several fixed
"events" which well defined with fixed arguments. tracer handles these
events and only use limited arguments. It's like a packet stream
processing. ftrace, perf etc. are used this model.

B) Flexible event-point tracing: In this model, each tracer(or even
trace user) can freely define their own event, there will be some fixed
tracing points defined, but arguments are defined by users. It's like a
debugger's breakpoint debugging. systemtap, ktap etc. are used this model.

Of course, both have pros/cons, and can share some fundamental features.
e.g. B model has a good flexibility and A model is easy to use for beginners.

I think we'd better not integrate these two, but find the better way
to share each functionality.

Thank you,

-- 
Masami HIRAMATSU
IT Management Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Re: [RFC PATCH tip 0/5] tracing filters with BPF
  2013-12-06  5:16             ` Alexei Starovoitov
@ 2013-12-06 23:54               ` Masami Hiramatsu
  2013-12-07  1:01                 ` Alexei Starovoitov
  0 siblings, 1 reply; 65+ messages in thread
From: Masami Hiramatsu @ 2013-12-06 23:54 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andi Kleen, H. Peter Anvin, Ingo Molnar, Steven Rostedt,
	Peter Zijlstra, Thomas Gleixner, Tom Zanussi, Jovi Zhangwei,
	Eric Dumazet, linux-kernel

(2013/12/06 14:16), Alexei Starovoitov wrote:
> On Thu, Dec 5, 2013 at 5:20 PM, Andi Kleen <andi@firstfloor.org> wrote:
>>> the difference is bigger now: 484-145 vs 185-145
>>
>> This is a obvious improvement, but imho not big enough to be extremely
>> compelling (< cost 1-2 cache misses, no orders of magnitude improvements
>> that would justify a lot of code)
> 
> hmm. we're comparing against ktap here…
> which has 5x more kernel code and 8x slower in this test...
> 
>> Your code requires a compiler, so from my perspective it
>> wouldn't be a lot easier or faster to use than just changing
>> the code directly and recompile.
>>
>> The users want something simple too that shields them from
>> having to learn all the internals. They don't want to recompile.
>> As far as I can tell your code is a bit too low level for that,
>> and the requirement for the compiler may also scare them.
>>
>> Where exactly does it fit?
> 
> the goal is to have llvm compiler next to perf, wrapped in a user friendly way.
> 
> compiling small filter vs recompiling full kernel…
> inserting into live kernel vs rebooting …
> not sure how you're saying it's equivalent.
> 
> In my kernel debugging experience current tools (tracing, systemtap)
> were rarely enough.
> I always had to add my own printks through the code, recompile and reboot.
> Often just to see that it's not the place where I want to print things
> or it's too verbose.
> Then I would adjust printks, recompile and reboot again.
> That was slow and tedious, since I would be crashing things from time to time
> just because skb doesn't always have a valid dev or I made a typo.
> For debugging I do really need something quick and dirty that lets me
> add my own printk
> of whatever structs I want anywhere in the kernel without crashing it.
> That's exactly what bpf tracing filters do.

I recommend you to use perf-probe. That will give you an easy solution. :)


Thank you,

-- 
Masami HIRAMATSU
IT Management Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Re: [RFC PATCH tip 0/5] tracing filters with BPF
  2013-12-06  5:19       ` Jovi Zhangwei
@ 2013-12-06 23:58         ` Masami Hiramatsu
  2013-12-07 16:21           ` Jovi Zhangwei
  0 siblings, 1 reply; 65+ messages in thread
From: Masami Hiramatsu @ 2013-12-06 23:58 UTC (permalink / raw)
  To: Jovi Zhangwei
  Cc: Alexei Starovoitov, Andi Kleen, Ingo Molnar, Steven Rostedt,
	Peter Zijlstra, H. Peter Anvin, Thomas Gleixner, Tom Zanussi,
	Eric Dumazet, LKML

(2013/12/06 14:19), Jovi Zhangwei wrote:
> Hi Alexei,
> 
> On Thu, Dec 5, 2013 at 12:40 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>>> On Tue, Dec 3, 2013 at 4:01 PM, Andi Kleen <andi@firstfloor.org> wrote:
>>>>
>>>> Can you do some performance comparison compared to e.g. ktap?
>>>> How much faster is it?
>>
>> Did simple ktap test with 1M alloc_skb/kfree_skb toy test from earlier email:
>> trace skb:kfree_skb {
>>         if (arg2 == 0x100) {
>>                 printf("%x %x\n", arg1, arg2)
>>         }
>> }
>> 1M skb alloc/free 350315 (usecs)
>>
>> baseline without any tracing:
>> 1M skb alloc/free 145400 (usecs)
>>
>> then equivalent bpf test:
>> void filter(struct bpf_context *ctx)
>> {
>>         void *loc = (void *)ctx->regs.dx;
>>         if (loc == 0x100) {
>>                 struct sk_buff *skb = (struct sk_buff *)ctx->regs.si;
>>                 char fmt[] = "skb %p loc %p\n";
>>                 bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 0);
>>         }
>> }
>> 1M skb alloc/free 183214 (usecs)
>>
>> so with one 'if' condition the difference ktap vs bpf is 350-145 vs 183-145
>>
>> obviously ktap is an interpreter, so it's not really fair.
>>
>> To make it really unfair I did:
>> trace skb:kfree_skb {
>>         if (arg2 == 0x100 || arg2 == 0x200 || arg2 == 0x300 || arg2 == 0x400 ||
>>             arg2 == 0x500 || arg2 == 0x600 || arg2 == 0x700 || arg2 == 0x800 ||
>>             arg2 == 0x900 || arg2 == 0x1000) {
>>                 printf("%x %x\n", arg1, arg2)
>>         }
>> }
>> 1M skb alloc/free 484280 (usecs)
>>
>> and corresponding bpf:
>> void filter(struct bpf_context *ctx)
>> {
>>         void *loc = (void *)ctx->regs.dx;
>>         if (loc == 0x100 || loc == 0x200 || loc == 0x300 || loc == 0x400 ||
>>             loc == 0x500 || loc == 0x600 || loc == 0x700 || loc == 0x800 ||
>>             loc == 0x900 || loc == 0x1000) {
>>                 struct sk_buff *skb = (struct sk_buff *)ctx->regs.si;
>>                 char fmt[] = "skb %p loc %p\n";
>>                 bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 0);
>>         }
>> }
>> 1M skb alloc/free 185660 (usecs)
>>
>> the difference is bigger now: 484-145 vs 185-145
>>
> There have big differences for compare arg2(in ktap) with direct register
> access(ctx->regs.dx).
> 
> The current argument fetching(arg2 in above testcase) implementation in ktap
> is very inefficient, see ktap/interpreter/lib_kdebug.c:kp_event_getarg.
> The only way to speedup is kernel tracing code change, let external tracing
> module access event field not through list lookup. This work is not
> started yet. :)

I'm not sure why you can't access it directly from ftrace-event buffer.
There is just a packed data structure and it is exposed via debugfs.
You can decode it and can get an offset/size by using libtraceevent.

Thank you,

-- 
Masami HIRAMATSU
IT Management Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Re: [RFC PATCH tip 0/5] tracing filters with BPF
  2013-12-06 23:54               ` Masami Hiramatsu
@ 2013-12-07  1:01                 ` Alexei Starovoitov
  0 siblings, 0 replies; 65+ messages in thread
From: Alexei Starovoitov @ 2013-12-07  1:01 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Andi Kleen, H. Peter Anvin, Ingo Molnar, Steven Rostedt,
	Peter Zijlstra, Thomas Gleixner, Tom Zanussi, Jovi Zhangwei,
	Eric Dumazet, linux-kernel

On Fri, Dec 6, 2013 at 3:54 PM, Masami Hiramatsu
<masami.hiramatsu.pt@hitachi.com> wrote:
> (2013/12/06 14:16), Alexei Starovoitov wrote:
>> On Thu, Dec 5, 2013 at 5:20 PM, Andi Kleen <andi@firstfloor.org> wrote:
>>>> the difference is bigger now: 484-145 vs 185-145
>>>
>>> This is a obvious improvement, but imho not big enough to be extremely
>>> compelling (< cost 1-2 cache misses, no orders of magnitude improvements
>>> that would justify a lot of code)
>>
>> hmm. we're comparing against ktap here…
>> which has 5x more kernel code and 8x slower in this test...
>>
>>> Your code requires a compiler, so from my perspective it
>>> wouldn't be a lot easier or faster to use than just changing
>>> the code directly and recompile.
>>>
>>> The users want something simple too that shields them from
>>> having to learn all the internals. They don't want to recompile.
>>> As far as I can tell your code is a bit too low level for that,
>>> and the requirement for the compiler may also scare them.
>>>
>>> Where exactly does it fit?
>>
>> the goal is to have llvm compiler next to perf, wrapped in a user friendly way.
>>
>> compiling small filter vs recompiling full kernel…
>> inserting into live kernel vs rebooting …
>> not sure how you're saying it's equivalent.
>>
>> In my kernel debugging experience current tools (tracing, systemtap)
>> were rarely enough.
>> I always had to add my own printks through the code, recompile and reboot.
>> Often just to see that it's not the place where I want to print things
>> or it's too verbose.
>> Then I would adjust printks, recompile and reboot again.
>> That was slow and tedious, since I would be crashing things from time to time
>> just because skb doesn't always have a valid dev or I made a typo.
>> For debugging I do really need something quick and dirty that lets me
>> add my own printk
>> of whatever structs I want anywhere in the kernel without crashing it.
>> That's exactly what bpf tracing filters do.
>
> I recommend you to use perf-probe. That will give you an easy solution. :)

it is indeed very cool.
Thanks!

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH tip 0/5] tracing filters with BPF
  2013-12-06  1:20           ` Andi Kleen
                               ` (2 preceding siblings ...)
  2013-12-06  5:46             ` Jovi Zhangwei
@ 2013-12-07  1:12             ` Alexei Starovoitov
  2013-12-07 16:53               ` Jovi Zhangwei
  3 siblings, 1 reply; 65+ messages in thread
From: Alexei Starovoitov @ 2013-12-07  1:12 UTC (permalink / raw)
  To: Andi Kleen
  Cc: H. Peter Anvin, Ingo Molnar, Steven Rostedt, Peter Zijlstra,
	Thomas Gleixner, Masami Hiramatsu, Tom Zanussi, Jovi Zhangwei,
	Eric Dumazet, linux-kernel

On Thu, Dec 5, 2013 at 5:20 PM, Andi Kleen <andi@firstfloor.org> wrote:
> "H. Peter Anvin" <hpa@zytor.com> writes:
>>
>> Not to mention that in that case we might as well -- since we need a
>> compiler anyway -- generate the machine code in user space; the JIT
>> solution really only is useful if it can provide something that we can't
>> do otherwise, e.g. enable it in secure boot environments.
>
> I can see there may be some setups which don't have a compiler
> (e.g. I know some people don't use systemtap because of that)
> But this needs a custom gcc install too as far as I understand.

fyi custom gcc is a single 13M binary. It doesn't depend on any
include files or any libraries.
and can be easily packaged together with perf... even for embedded environment.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Re: [RFC PATCH tip 0/5] tracing filters with BPF
  2013-12-06 23:58         ` Masami Hiramatsu
@ 2013-12-07 16:21           ` Jovi Zhangwei
  2013-12-09  4:59             ` Masami Hiramatsu
  0 siblings, 1 reply; 65+ messages in thread
From: Jovi Zhangwei @ 2013-12-07 16:21 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Alexei Starovoitov, Andi Kleen, Ingo Molnar, Steven Rostedt,
	Peter Zijlstra, H. Peter Anvin, Thomas Gleixner, Tom Zanussi,
	Eric Dumazet, LKML

On Sat, Dec 7, 2013 at 7:58 AM, Masami Hiramatsu
<masami.hiramatsu.pt@hitachi.com> wrote:
> (2013/12/06 14:19), Jovi Zhangwei wrote:
>> Hi Alexei,
>>
>> On Thu, Dec 5, 2013 at 12:40 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>>>> On Tue, Dec 3, 2013 at 4:01 PM, Andi Kleen <andi@firstfloor.org> wrote:
>>>>>
>>>>> Can you do some performance comparison compared to e.g. ktap?
>>>>> How much faster is it?
>>>
>>> Did simple ktap test with 1M alloc_skb/kfree_skb toy test from earlier email:
>>> trace skb:kfree_skb {
>>>         if (arg2 == 0x100) {
>>>                 printf("%x %x\n", arg1, arg2)
>>>         }
>>> }
>>> 1M skb alloc/free 350315 (usecs)
>>>
>>> baseline without any tracing:
>>> 1M skb alloc/free 145400 (usecs)
>>>
>>> then equivalent bpf test:
>>> void filter(struct bpf_context *ctx)
>>> {
>>>         void *loc = (void *)ctx->regs.dx;
>>>         if (loc == 0x100) {
>>>                 struct sk_buff *skb = (struct sk_buff *)ctx->regs.si;
>>>                 char fmt[] = "skb %p loc %p\n";
>>>                 bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 0);
>>>         }
>>> }
>>> 1M skb alloc/free 183214 (usecs)
>>>
>>> so with one 'if' condition the difference ktap vs bpf is 350-145 vs 183-145
>>>
>>> obviously ktap is an interpreter, so it's not really fair.
>>>
>>> To make it really unfair I did:
>>> trace skb:kfree_skb {
>>>         if (arg2 == 0x100 || arg2 == 0x200 || arg2 == 0x300 || arg2 == 0x400 ||
>>>             arg2 == 0x500 || arg2 == 0x600 || arg2 == 0x700 || arg2 == 0x800 ||
>>>             arg2 == 0x900 || arg2 == 0x1000) {
>>>                 printf("%x %x\n", arg1, arg2)
>>>         }
>>> }
>>> 1M skb alloc/free 484280 (usecs)
>>>
>>> and corresponding bpf:
>>> void filter(struct bpf_context *ctx)
>>> {
>>>         void *loc = (void *)ctx->regs.dx;
>>>         if (loc == 0x100 || loc == 0x200 || loc == 0x300 || loc == 0x400 ||
>>>             loc == 0x500 || loc == 0x600 || loc == 0x700 || loc == 0x800 ||
>>>             loc == 0x900 || loc == 0x1000) {
>>>                 struct sk_buff *skb = (struct sk_buff *)ctx->regs.si;
>>>                 char fmt[] = "skb %p loc %p\n";
>>>                 bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 0);
>>>         }
>>> }
>>> 1M skb alloc/free 185660 (usecs)
>>>
>>> the difference is bigger now: 484-145 vs 185-145
>>>
>> There have big differences for compare arg2(in ktap) with direct register
>> access(ctx->regs.dx).
>>
>> The current argument fetching(arg2 in above testcase) implementation in ktap
>> is very inefficient, see ktap/interpreter/lib_kdebug.c:kp_event_getarg.
>> The only way to speedup is kernel tracing code change, let external tracing
>> module access event field not through list lookup. This work is not
>> started yet. :)
>
> I'm not sure why you can't access it directly from ftrace-event buffer.
> There is just a packed data structure and it is exposed via debugfs.
> You can decode it and can get an offset/size by using libtraceevent.
>
Then it means there need pass the event field info into kernel through trunk,
it looks strange because the kernel structure is the source of event field info,
it's like loop-back, and need to engage with libtraceevent in userspace.
(the side effect is it will make compilation slow, and consume more memory,
sometimes it will process 20K events in one script, like 'trace
probe:big_dso:*')

So "the only way" which I said is wrong, your approach indeed is another way.
I just think maybe use array instead of list for event fields would be more
efficient if list is not must needed. we can check it more in future.

Thanks.

Jovi.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH tip 0/5] tracing filters with BPF
  2013-12-07  1:12             ` Alexei Starovoitov
@ 2013-12-07 16:53               ` Jovi Zhangwei
  0 siblings, 0 replies; 65+ messages in thread
From: Jovi Zhangwei @ 2013-12-07 16:53 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andi Kleen, H. Peter Anvin, Ingo Molnar, Steven Rostedt,
	Peter Zijlstra, Thomas Gleixner, Masami Hiramatsu, Tom Zanussi,
	Eric Dumazet, LKML

On Sat, Dec 7, 2013 at 9:12 AM, Alexei Starovoitov <ast@plumgrid.com> wrote:
> On Thu, Dec 5, 2013 at 5:20 PM, Andi Kleen <andi@firstfloor.org> wrote:
>> "H. Peter Anvin" <hpa@zytor.com> writes:
>>>
>>> Not to mention that in that case we might as well -- since we need a
>>> compiler anyway -- generate the machine code in user space; the JIT
>>> solution really only is useful if it can provide something that we can't
>>> do otherwise, e.g. enable it in secure boot environments.
>>
>> I can see there may be some setups which don't have a compiler
>> (e.g. I know some people don't use systemtap because of that)
>> But this needs a custom gcc install too as far as I understand.
>
> fyi custom gcc is a single 13M binary. It doesn't depend on any
> include files or any libraries.
> and can be easily packaged together with perf... even for embedded environment.

Hmm, 13M binary is big IMO, perf is just 5M after compiled in my system,
I'm not sure embed a custom gcc into perf is a good idea. (and need to
compile that custom gcc every time when build perf ?)

IMO gcc size is not all/main reason of why embedded system didn't
install it, I saw many many production embedded system, no one
install gcc, also gdb, etc. I would never expect Android will install
gcc in some day, I also will really surprise if telcom-vender deliver
Linux board with gcc installed to customers.

Another question is: does the custom gcc of bpf-filter need kernel
header file for compilation? if it need, then this issue is more bigger
than gcc size for embedded system.(same problem like Systemtap)

Thanks,

Jovi.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH tip 4/5] use BPF in tracing filters
  2013-12-06 23:48               ` Masami Hiramatsu
@ 2013-12-08 18:22                 ` Frank Ch. Eigler
  2013-12-09 10:12                   ` Masami Hiramatsu
  0 siblings, 1 reply; 65+ messages in thread
From: Frank Ch. Eigler @ 2013-12-08 18:22 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Jovi Zhangwei, Alexei Starovoitov, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, H. Peter Anvin, Thomas Gleixner, Tom Zanussi,
	Eric Dumazet, LKML, yrl.pp-manager.tt


masami.hiramatsu.pt wrote:

> [...]
> Anyway, as far as I can see, there looks be two different models of
> tracing in our mind.
>
> A) Fixed event based tracing: In this model, there are several fixed
> "events" which well defined with fixed arguments. tracer handles these
> events and only use limited arguments. It's like a packet stream
> processing. ftrace, perf etc. are used this model.
>
> B) Flexible event-point tracing: In this model, each tracer(or even
> trace user) can freely define their own event, there will be some fixed
> tracing points defined, but arguments are defined by users. It's like a
> debugger's breakpoint debugging. systemtap, ktap etc. are used this model.

It may be more useful to think of it as a contrast along the
hard-coded versus programmable axis.  (perf, systemtap, and ktap can
each reach to some extent across your "fixed" vs "flexible" line.
Each has some dynamic and some static-tracepoint capability.)


> e.g. B model has a good flexibility and A model is easy to use for
> beginners.

I don't think it's the model that dictates ease-of-use, but the
quality of implementation, logistics, documentation, and examples.


- FChE

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Re: Re: [RFC PATCH tip 0/5] tracing filters with BPF
  2013-12-07 16:21           ` Jovi Zhangwei
@ 2013-12-09  4:59             ` Masami Hiramatsu
  0 siblings, 0 replies; 65+ messages in thread
From: Masami Hiramatsu @ 2013-12-09  4:59 UTC (permalink / raw)
  To: Jovi Zhangwei
  Cc: Alexei Starovoitov, Andi Kleen, Ingo Molnar, Steven Rostedt,
	Peter Zijlstra, H. Peter Anvin, Thomas Gleixner, Tom Zanussi,
	Eric Dumazet, LKML

(2013/12/08 1:21), Jovi Zhangwei wrote:
> On Sat, Dec 7, 2013 at 7:58 AM, Masami Hiramatsu
> <masami.hiramatsu.pt@hitachi.com> wrote:
>> (2013/12/06 14:19), Jovi Zhangwei wrote:
>>> Hi Alexei,
>>>
>>> On Thu, Dec 5, 2013 at 12:40 PM, Alexei Starovoitov <ast@plumgrid.com> wrote:
>>>>> On Tue, Dec 3, 2013 at 4:01 PM, Andi Kleen <andi@firstfloor.org> wrote:
>>>>>>
>>>>>> Can you do some performance comparison compared to e.g. ktap?
>>>>>> How much faster is it?
>>>>
>>>> Did simple ktap test with 1M alloc_skb/kfree_skb toy test from earlier email:
>>>> trace skb:kfree_skb {
>>>>         if (arg2 == 0x100) {
>>>>                 printf("%x %x\n", arg1, arg2)
>>>>         }
>>>> }
>>>> 1M skb alloc/free 350315 (usecs)
>>>>
>>>> baseline without any tracing:
>>>> 1M skb alloc/free 145400 (usecs)
>>>>
>>>> then equivalent bpf test:
>>>> void filter(struct bpf_context *ctx)
>>>> {
>>>>         void *loc = (void *)ctx->regs.dx;
>>>>         if (loc == 0x100) {
>>>>                 struct sk_buff *skb = (struct sk_buff *)ctx->regs.si;
>>>>                 char fmt[] = "skb %p loc %p\n";
>>>>                 bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 0);
>>>>         }
>>>> }
>>>> 1M skb alloc/free 183214 (usecs)
>>>>
>>>> so with one 'if' condition the difference ktap vs bpf is 350-145 vs 183-145
>>>>
>>>> obviously ktap is an interpreter, so it's not really fair.
>>>>
>>>> To make it really unfair I did:
>>>> trace skb:kfree_skb {
>>>>         if (arg2 == 0x100 || arg2 == 0x200 || arg2 == 0x300 || arg2 == 0x400 ||
>>>>             arg2 == 0x500 || arg2 == 0x600 || arg2 == 0x700 || arg2 == 0x800 ||
>>>>             arg2 == 0x900 || arg2 == 0x1000) {
>>>>                 printf("%x %x\n", arg1, arg2)
>>>>         }
>>>> }
>>>> 1M skb alloc/free 484280 (usecs)
>>>>
>>>> and corresponding bpf:
>>>> void filter(struct bpf_context *ctx)
>>>> {
>>>>         void *loc = (void *)ctx->regs.dx;
>>>>         if (loc == 0x100 || loc == 0x200 || loc == 0x300 || loc == 0x400 ||
>>>>             loc == 0x500 || loc == 0x600 || loc == 0x700 || loc == 0x800 ||
>>>>             loc == 0x900 || loc == 0x1000) {
>>>>                 struct sk_buff *skb = (struct sk_buff *)ctx->regs.si;
>>>>                 char fmt[] = "skb %p loc %p\n";
>>>>                 bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 0);
>>>>         }
>>>> }
>>>> 1M skb alloc/free 185660 (usecs)
>>>>
>>>> the difference is bigger now: 484-145 vs 185-145
>>>>
>>> There have big differences for compare arg2(in ktap) with direct register
>>> access(ctx->regs.dx).
>>>
>>> The current argument fetching(arg2 in above testcase) implementation in ktap
>>> is very inefficient, see ktap/interpreter/lib_kdebug.c:kp_event_getarg.
>>> The only way to speedup is kernel tracing code change, let external tracing
>>> module access event field not through list lookup. This work is not
>>> started yet. :)
>>
>> I'm not sure why you can't access it directly from ftrace-event buffer.
>> There is just a packed data structure and it is exposed via debugfs.
>> You can decode it and can get an offset/size by using libtraceevent.
>>
> Then it means there need pass the event field info into kernel through trunk,
> it looks strange because the kernel structure is the source of event field info,
> it's like loop-back, and need to engage with libtraceevent in userspace.

No, the static traceevents have its own kernel data structure, but
the dynamic events don't. They expose the data format (offset/type)
via debugfs, but do not define new data structure.
So, I meant the script is enough to take an offset and a method casting
to corresponding size.

> (the side effect is it will make compilation slow, and consume more memory,
> sometimes it will process 20K events in one script, like 'trace
> probe:big_dso:*')

I doubt it, since you just need to get formats only for the events what
the script using.

> So "the only way" which I said is wrong, your approach indeed is another way.
> I just think maybe use array instead of list for event fields would be more
> efficient if list is not must needed. we can check it more in future.

Ah, perhaps, I misunderstood ktap implementation. Does it define dynamic
events right before loading a bytecode? In that case, I recommend you to
change a loader to adjust the bytecode after defining event to tune the
offset information, which fits to the target event format.

e.g.
 1) compile a bytecode with dummy offsets
 2) define new additional dynamic events
 3) get the field offset information from the events
 4) modify the bytecode to replace offsets with correct one on memory
 5) load the bytecode

Thank you,

-- 
Masami HIRAMATSU
IT Management Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH tip 0/5] tracing filters with BPF
  2013-12-04  1:13       ` Masami Hiramatsu
@ 2013-12-09  7:29         ` Namhyung Kim
  2013-12-09  9:51           ` Masami Hiramatsu
  0 siblings, 1 reply; 65+ messages in thread
From: Namhyung Kim @ 2013-12-09  7:29 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Alexei Starovoitov, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	H. Peter Anvin, Thomas Gleixner, Tom Zanussi, Jovi Zhangwei,
	Eric Dumazet, linux-kernel, Linus Torvalds, Andrew Morton,
	Frédéric Weisbecker, Arnaldo Carvalho de Melo,
	Tom Zanussi, Pekka Enberg, David S. Miller, Arjan van de Ven,
	Christoph Hellwig, Oleg Nesterov

Hi Masami,

On Wed, 04 Dec 2013 10:13:37 +0900, Masami Hiramatsu wrote:
> (2013/12/04 3:26), Alexei Starovoitov wrote:
>> the only inconvenience so far is to know how parameters are getting
>> into registers.
>> on x86-64, arg1 is in rdi, arg2 is in rsi,... I want to improve that
>> after first step is done.
>
> Actually, that part is done by the perf-probe and ftrace dynamic events
> (kernel/trace/trace_probe.c). I think this generic BPF is good for
> re-implementing fetch methods. :)

For implementing patch method, it seems that it needs to access to user
memory, stack and/or current (task_struct - for utask or vma later) from
the BPF VM as well.  Isn't it OK from the security perspective?

Anyway, I'll take a look at it later if I have time, but I want to get
the existing/pending implementation merged first. :)

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH tip 0/5] tracing filters with BPF
  2013-12-09  7:29         ` Namhyung Kim
@ 2013-12-09  9:51           ` Masami Hiramatsu
  0 siblings, 0 replies; 65+ messages in thread
From: Masami Hiramatsu @ 2013-12-09  9:51 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Alexei Starovoitov, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	H. Peter Anvin, Thomas Gleixner, Tom Zanussi, Jovi Zhangwei,
	Eric Dumazet, linux-kernel, Linus Torvalds, Andrew Morton,
	Frédéric Weisbecker, Arnaldo Carvalho de Melo,
	Tom Zanussi, Pekka Enberg, David S. Miller, Arjan van de Ven,
	Christoph Hellwig, Oleg Nesterov, yrl.pp-manager.tt

(2013/12/09 16:29), Namhyung Kim wrote:
> Hi Masami,
> 
> On Wed, 04 Dec 2013 10:13:37 +0900, Masami Hiramatsu wrote:
>> (2013/12/04 3:26), Alexei Starovoitov wrote:
>>> the only inconvenience so far is to know how parameters are getting
>>> into registers.
>>> on x86-64, arg1 is in rdi, arg2 is in rsi,... I want to improve that
>>> after first step is done.
>>
>> Actually, that part is done by the perf-probe and ftrace dynamic events
>> (kernel/trace/trace_probe.c). I think this generic BPF is good for
>> re-implementing fetch methods. :)
> 
> For implementing patch method, it seems that it needs to access to user
> memory, stack and/or current (task_struct - for utask or vma later) from
> the BPF VM as well.  Isn't it OK from the security perspective?

Would you mean security or safety?  :)
For safety, I think we can check the BPF binary doesn't break anything.
Anyway, for fetch method, I think we have to make a generic syntax tree
for the archs which don't support BPF, and BPF bytecode will be generated
by the syntax tree. IOW, I'd like to use BPF just for optimizing
memory address calculation.
For security, it is hard to check what is the sensitive information
in the kernel, I think it should be restricted to root user a while.

> Anyway, I'll take a look at it later if I have time, but I want to get
> the existing/pending implementation merged first. :)

Yes, of course ! :)

Thank you,
-- 
Masami HIRAMATSU
IT Management Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH tip 4/5] use BPF in tracing filters
  2013-12-08 18:22                 ` Frank Ch. Eigler
@ 2013-12-09 10:12                   ` Masami Hiramatsu
  0 siblings, 0 replies; 65+ messages in thread
From: Masami Hiramatsu @ 2013-12-09 10:12 UTC (permalink / raw)
  To: Frank Ch. Eigler
  Cc: Jovi Zhangwei, Alexei Starovoitov, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, H. Peter Anvin, Thomas Gleixner, Tom Zanussi,
	Eric Dumazet, LKML, yrl.pp-manager.tt

(2013/12/09 3:22), Frank Ch. Eigler wrote:
> 
> masami.hiramatsu.pt wrote:
> 
>> [...]
>> Anyway, as far as I can see, there looks be two different models of
>> tracing in our mind.
>>
>> A) Fixed event based tracing: In this model, there are several fixed
>> "events" which well defined with fixed arguments. tracer handles these
>> events and only use limited arguments. It's like a packet stream
>> processing. ftrace, perf etc. are used this model.
>>
>> B) Flexible event-point tracing: In this model, each tracer(or even
>> trace user) can freely define their own event, there will be some fixed
>> tracing points defined, but arguments are defined by users. It's like a
>> debugger's breakpoint debugging. systemtap, ktap etc. are used this model.
> 
> It may be more useful to think of it as a contrast along the
> hard-coded versus programmable axis.  (perf, systemtap, and ktap can
> each reach to some extent across your "fixed" vs "flexible" line.
> Each has some dynamic and some static-tracepoint capability.)

Oh, I meant that B is not tend to share the defined event among
different tracing instances. Each instances defines new different
dynamic events and gets memories and registers freely.
OTOH, the Ftrace and LTT models are based on the fixed, shared
and well defined events. Even if a new dynamic event is defined,
it will be shared by every instances.

> 
>> e.g. B model has a good flexibility and A model is easy to use for
>> beginners.
> 
> I don't think it's the model that dictates ease-of-use, but the
> quality of implementation, logistics, documentation, and examples.

Of course, but it requires learning the new programming way. And
also, we need to know about the target source code for setting up
new events. I know that the systemtap provides many pre-defined
probepoints. so, the systemtap may already have solved this kind of
issue. ;)

Thank you,

-- 
Masami HIRAMATSU
IT Management Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH tip 0/5] tracing filters with BPF
  2013-12-06  4:49               ` Alexei Starovoitov
@ 2013-12-10 15:47                 ` Ingo Molnar
  2013-12-11  2:32                   ` Alexei Starovoitov
  0 siblings, 1 reply; 65+ messages in thread
From: Ingo Molnar @ 2013-12-10 15:47 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Steven Rostedt, Andi Kleen, Peter Zijlstra, H. Peter Anvin,
	Thomas Gleixner, Masami Hiramatsu, Tom Zanussi, Jovi Zhangwei,
	Eric Dumazet, linux-kernel


* Alexei Starovoitov <ast@plumgrid.com> wrote:

> > I'm fine if it becomes a requirement to have a vmlinux built with 
> > DEBUG_INFO to use BPF and have a tool like perf to translate the 
> > filters. But it that must not replace what the current filters do 
> > now. That is, it can be an add on, but not a replacement.
> 
> Of course. tracing filters via bpf is an additional tool for kernel 
> debugging. bpf by itself has use cases beyond tracing.

Well, Steve has a point: forcing DEBUG_INFO is a big showstopper for 
most people.

Would it be possible to make BFP filters recognize exposed details 
like the current filters do, without depending on the vmlinux?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH tip 0/5] tracing filters with BPF
  2013-12-10 15:47                 ` Ingo Molnar
@ 2013-12-11  2:32                   ` Alexei Starovoitov
  2013-12-11  3:35                     ` Masami Hiramatsu
  0 siblings, 1 reply; 65+ messages in thread
From: Alexei Starovoitov @ 2013-12-11  2:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steven Rostedt, Andi Kleen, Peter Zijlstra, H. Peter Anvin,
	Thomas Gleixner, Masami Hiramatsu, Tom Zanussi, Jovi Zhangwei,
	Eric Dumazet, linux-kernel

On Tue, Dec 10, 2013 at 7:47 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Alexei Starovoitov <ast@plumgrid.com> wrote:
>
>> > I'm fine if it becomes a requirement to have a vmlinux built with
>> > DEBUG_INFO to use BPF and have a tool like perf to translate the
>> > filters. But it that must not replace what the current filters do
>> > now. That is, it can be an add on, but not a replacement.
>>
>> Of course. tracing filters via bpf is an additional tool for kernel
>> debugging. bpf by itself has use cases beyond tracing.
>
> Well, Steve has a point: forcing DEBUG_INFO is a big showstopper for
> most people.

there is a misunderstanding here.
I was saying 'of course' to 'not replace current filter infra'.

bpf does not depend on debug info.
That's the key difference between 'perf probe' approach and bpf filters.

Masami is right that what I was trying to achieve with bpf filters
is similar to 'perf probe': insert a dynamic probe anywhere
in the kernel, walk pointers, data structures, print interesting stuff.

'perf probe' does it via scanning vmlinux with debug info.
bpf filters don't need it.
tools/bpf/trace/*_orig.c examples only depend on linux headers
in /lib/modules/../build/include/
Today bpf compiler struct layout is the same as x86_64.

Tomorrow bpf compiler will have flags to adjust endianness, pointer size, etc
of the front-end. Similar to -m32/-m64 and -m*-endian flags.
Neat part is that I don't need to do any work, just enable it properly in
the bpf backend. From gcc/llvm point of view, bpf is yet another 'hw'
architecture that compiler is emitting code for.
So when C code of filter_ex1_orig.c does 'skb->dev', compiler determines
field offset by looking at /lib/modules/.../include/skbuff.h
whereas for 'perf probe' 'skb->dev' means walk debug info.

Something like: cc1 -mlayout_x86_64 filter.c will produce bpf code that
walks all data structures in the same way x86_64 does it.
Even if the user makes a mistake and uses -mlayout_aarch64, it won't crash.
Note that all -m* flags will be in one compiler. It won't grow any bigger
because of that. All of it already supported by C front-ends.
It may sound complex, but really very little code for the bpf backend.

I didn't look inside systemtap/ktap enough to say how much they're
relying on presence of debug info to make a comparison.

I see two main use cases for bpf tracing filters: debugging live kernel
and collecting stats. Same tricks that [sk]tap do with their maps.
Or may be some of the stats that 'perf record' collects in userspace
can be collected by bpf filter in kernel and stored into generic bpf table?

> Would it be possible to make BFP filters recognize exposed details
> like the current filters do, without depending on the vmlinux?

Well, if you say that presence of linux headers is also too much to ask,
I can hook bpf after probes stored all the args.

This way current simple filter syntax can move to userspace.
'arg1==x || arg2!=y' can be parsed by userspace, bpf code
generated and fed into kernel. It will be faster than walk_pred_tree(),
but if we cannot remove 2k lines from trace_events_filter.c
because of backward compatibility, extra performance becomes
the only reason to have two different implementations.

Another use case is to optimize fetch sequences of dynamic probes
as Masami suggested, but backward compatibility requirement
would preserve to ways of doing it as well.

imo the current hook of bpf into tracing is more compelling, but let me
think more about reusing data stored in the ring buffer.

Thanks
Alexei

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH tip 0/5] tracing filters with BPF
  2013-12-11  2:32                   ` Alexei Starovoitov
@ 2013-12-11  3:35                     ` Masami Hiramatsu
  2013-12-12  2:48                       ` Alexei Starovoitov
  0 siblings, 1 reply; 65+ messages in thread
From: Masami Hiramatsu @ 2013-12-11  3:35 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Ingo Molnar, Steven Rostedt, Andi Kleen, Peter Zijlstra,
	H. Peter Anvin, Thomas Gleixner, Tom Zanussi, Jovi Zhangwei,
	Eric Dumazet, linux-kernel

(2013/12/11 11:32), Alexei Starovoitov wrote:
> On Tue, Dec 10, 2013 at 7:47 AM, Ingo Molnar <mingo@kernel.org> wrote:
>>
>> * Alexei Starovoitov <ast@plumgrid.com> wrote:
>>
>>>> I'm fine if it becomes a requirement to have a vmlinux built with
>>>> DEBUG_INFO to use BPF and have a tool like perf to translate the
>>>> filters. But it that must not replace what the current filters do
>>>> now. That is, it can be an add on, but not a replacement.
>>>
>>> Of course. tracing filters via bpf is an additional tool for kernel
>>> debugging. bpf by itself has use cases beyond tracing.
>>
>> Well, Steve has a point: forcing DEBUG_INFO is a big showstopper for
>> most people.
> 
> there is a misunderstanding here.
> I was saying 'of course' to 'not replace current filter infra'.
> 
> bpf does not depend on debug info.
> That's the key difference between 'perf probe' approach and bpf filters.
> 
> Masami is right that what I was trying to achieve with bpf filters
> is similar to 'perf probe': insert a dynamic probe anywhere
> in the kernel, walk pointers, data structures, print interesting stuff.
> 
> 'perf probe' does it via scanning vmlinux with debug info.
> bpf filters don't need it.
> tools/bpf/trace/*_orig.c examples only depend on linux headers
> in /lib/modules/../build/include/
> Today bpf compiler struct layout is the same as x86_64.
> 
> Tomorrow bpf compiler will have flags to adjust endianness, pointer size, etc
> of the front-end. Similar to -m32/-m64 and -m*-endian flags.
> Neat part is that I don't need to do any work, just enable it properly in
> the bpf backend. From gcc/llvm point of view, bpf is yet another 'hw'
> architecture that compiler is emitting code for.
> So when C code of filter_ex1_orig.c does 'skb->dev', compiler determines
> field offset by looking at /lib/modules/.../include/skbuff.h
> whereas for 'perf probe' 'skb->dev' means walk debug info.

Right, the offset of the data structure can get from the header etc.

However, how would the bpf get the register or stack assignment of
skb itself? In the tracepoint macro, it will be able to get it from
function parameters (it needs a trick, like jprobe does).
I doubt you can do that on kprobes/uprobes without any debuginfo
support. :(

And is it possible to trace a field in a data structure which is
defined locally in somewhere.c ? :) (maybe it's just a corner case)

> Something like: cc1 -mlayout_x86_64 filter.c will produce bpf code that
> walks all data structures in the same way x86_64 does it.
> Even if the user makes a mistake and uses -mlayout_aarch64, it won't crash.
> Note that all -m* flags will be in one compiler. It won't grow any bigger
> because of that. All of it already supported by C front-ends.
> It may sound complex, but really very little code for the bpf backend.
> 
> I didn't look inside systemtap/ktap enough to say how much they're
> relying on presence of debug info to make a comparison.
> 
> I see two main use cases for bpf tracing filters: debugging live kernel
> and collecting stats. Same tricks that [sk]tap do with their maps.
> Or may be some of the stats that 'perf record' collects in userspace
> can be collected by bpf filter in kernel and stored into generic bpf table?
> 
>> Would it be possible to make BFP filters recognize exposed details
>> like the current filters do, without depending on the vmlinux?
> 
> Well, if you say that presence of linux headers is also too much to ask,
> I can hook bpf after probes stored all the args.
> 
> This way current simple filter syntax can move to userspace.
> 'arg1==x || arg2!=y' can be parsed by userspace, bpf code
> generated and fed into kernel. It will be faster than walk_pred_tree(),
> but if we cannot remove 2k lines from trace_events_filter.c
> because of backward compatibility, extra performance becomes
> the only reason to have two different implementations.
> 
> Another use case is to optimize fetch sequences of dynamic probes
> as Masami suggested, but backward compatibility requirement
> would preserve to ways of doing it as well.

The backward compatibility issue is only for the interface, but not
for the implementation, I think. :) The fetch method and filter
pred do already parse the argument into a syntax tree. IMHO, bpf
can optimize that tree to just a simple opcode stream.

Thank you,

-- 
Masami HIRAMATSU
IT Management Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH tip 0/5] tracing filters with BPF
  2013-12-11  3:35                     ` Masami Hiramatsu
@ 2013-12-12  2:48                       ` Alexei Starovoitov
  0 siblings, 0 replies; 65+ messages in thread
From: Alexei Starovoitov @ 2013-12-12  2:48 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Ingo Molnar, Steven Rostedt, Andi Kleen, Peter Zijlstra,
	H. Peter Anvin, Thomas Gleixner, Tom Zanussi, Jovi Zhangwei,
	Eric Dumazet, linux-kernel

On Tue, Dec 10, 2013 at 7:35 PM, Masami Hiramatsu
<masami.hiramatsu.pt@hitachi.com> wrote:
> (2013/12/11 11:32), Alexei Starovoitov wrote:
>> On Tue, Dec 10, 2013 at 7:47 AM, Ingo Molnar <mingo@kernel.org> wrote:
>>>
>>> * Alexei Starovoitov <ast@plumgrid.com> wrote:
>>>
>>>>> I'm fine if it becomes a requirement to have a vmlinux built with
>>>>> DEBUG_INFO to use BPF and have a tool like perf to translate the
>>>>> filters. But it that must not replace what the current filters do
>>>>> now. That is, it can be an add on, but not a replacement.
>>>>
>>>> Of course. tracing filters via bpf is an additional tool for kernel
>>>> debugging. bpf by itself has use cases beyond tracing.
>>>
>>> Well, Steve has a point: forcing DEBUG_INFO is a big showstopper for
>>> most people.
>>
>> there is a misunderstanding here.
>> I was saying 'of course' to 'not replace current filter infra'.
>>
>> bpf does not depend on debug info.
>> That's the key difference between 'perf probe' approach and bpf filters.
>>
>> Masami is right that what I was trying to achieve with bpf filters
>> is similar to 'perf probe': insert a dynamic probe anywhere
>> in the kernel, walk pointers, data structures, print interesting stuff.
>>
>> 'perf probe' does it via scanning vmlinux with debug info.
>> bpf filters don't need it.
>> tools/bpf/trace/*_orig.c examples only depend on linux headers
>> in /lib/modules/../build/include/
>> Today bpf compiler struct layout is the same as x86_64.
>>
>> Tomorrow bpf compiler will have flags to adjust endianness, pointer size, etc
>> of the front-end. Similar to -m32/-m64 and -m*-endian flags.
>> Neat part is that I don't need to do any work, just enable it properly in
>> the bpf backend. From gcc/llvm point of view, bpf is yet another 'hw'
>> architecture that compiler is emitting code for.
>> So when C code of filter_ex1_orig.c does 'skb->dev', compiler determines
>> field offset by looking at /lib/modules/.../include/skbuff.h
>> whereas for 'perf probe' 'skb->dev' means walk debug info.
>
> Right, the offset of the data structure can get from the header etc.
>
> However, how would the bpf get the register or stack assignment of
> skb itself? In the tracepoint macro, it will be able to get it from
> function parameters (it needs a trick, like jprobe does).
> I doubt you can do that on kprobes/uprobes without any debuginfo
> support. :(

the 4/5 diff actually shows how it's working ;)
for kprobes it works at the function entry, since arguments are still
in the registers
and walks the pointers further down.
It cannot do func+line_number as perf-probe does, of course.
for tracepoints it's the same trick: call no-inline func with traceprobe args
and call inlined crash_setup_regs() that stores the regs.

Of course, there are limitations. Like 7th func argument goes into
stack and requires
more work to get out. If struct is not defined in .h, it would need to
be redefined in filter.c
Corner cases as you said.
Today user of bpf filter needs to know that arg1 goes into %rdi and so on.
that is easy to cleanup.

>> Another use case is to optimize fetch sequences of dynamic probes
>> as Masami suggested, but backward compatibility requirement
>> would preserve to ways of doing it as well.
>
> The backward compatibility issue is only for the interface, but not
> for the implementation, I think. :) The fetch method and filter
> pred do already parse the argument into a syntax tree. IMHO, bpf
> can optimize that tree to just a simple opcode stream.

ahh. yes. that's doable.

Thanks
Alexei

^ permalink raw reply	[flat|nested] 65+ messages in thread

end of thread, other threads:[~2013-12-12  2:48 UTC | newest]

Thread overview: 65+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-12-03  4:28 [RFC PATCH tip 0/5] tracing filters with BPF Alexei Starovoitov
2013-12-03  4:28 ` [RFC PATCH tip 1/5] Extended BPF core framework Alexei Starovoitov
2013-12-03  4:28 ` [RFC PATCH tip 2/5] Extended BPF JIT for x86-64 Alexei Starovoitov
2013-12-03  4:28 ` [RFC PATCH tip 3/5] Extended BPF (64-bit BPF) design document Alexei Starovoitov
2013-12-03 17:01   ` H. Peter Anvin
2013-12-03 19:59     ` Alexei Starovoitov
2013-12-03 20:41       ` Frank Ch. Eigler
2013-12-03 21:31         ` Alexei Starovoitov
2013-12-04  9:24           ` Ingo Molnar
2013-12-03  4:28 ` [RFC PATCH tip 4/5] use BPF in tracing filters Alexei Starovoitov
2013-12-04  0:48   ` Masami Hiramatsu
2013-12-04  1:11     ` Steven Rostedt
2013-12-05  0:05       ` Masami Hiramatsu
2013-12-05  5:11         ` Alexei Starovoitov
2013-12-06  8:43           ` Masami Hiramatsu
2013-12-06 10:05             ` Jovi Zhangwei
2013-12-06 23:48               ` Masami Hiramatsu
2013-12-08 18:22                 ` Frank Ch. Eigler
2013-12-09 10:12                   ` Masami Hiramatsu
2013-12-03  4:28 ` [RFC PATCH tip 5/5] tracing filter examples in BPF Alexei Starovoitov
2013-12-04  0:35   ` Jonathan Corbet
2013-12-04  1:21     ` Alexei Starovoitov
2013-12-03  9:16 ` [RFC PATCH tip 0/5] tracing filters with BPF Ingo Molnar
2013-12-03 15:33   ` Steven Rostedt
2013-12-03 18:26     ` Alexei Starovoitov
2013-12-04  1:13       ` Masami Hiramatsu
2013-12-09  7:29         ` Namhyung Kim
2013-12-09  9:51           ` Masami Hiramatsu
2013-12-03 18:06   ` Alexei Starovoitov
2013-12-04  9:34     ` Ingo Molnar
2013-12-04 17:36       ` Alexei Starovoitov
2013-12-05 10:38         ` Ingo Molnar
2013-12-06  5:43           ` Alexei Starovoitov
2013-12-03 10:34 ` Masami Hiramatsu
2013-12-04  0:01 ` Andi Kleen
2013-12-04  3:09   ` Alexei Starovoitov
2013-12-05  4:40     ` Alexei Starovoitov
2013-12-05 10:41       ` Ingo Molnar
2013-12-05 13:46         ` Steven Rostedt
2013-12-05 22:36           ` Alexei Starovoitov
2013-12-05 23:37             ` Steven Rostedt
2013-12-06  4:49               ` Alexei Starovoitov
2013-12-10 15:47                 ` Ingo Molnar
2013-12-11  2:32                   ` Alexei Starovoitov
2013-12-11  3:35                     ` Masami Hiramatsu
2013-12-12  2:48                       ` Alexei Starovoitov
2013-12-05 16:11       ` Frank Ch. Eigler
2013-12-05 19:43         ` Alexei Starovoitov
2013-12-06  0:14       ` Andi Kleen
2013-12-06  1:10         ` H. Peter Anvin
2013-12-06  1:20           ` Andi Kleen
2013-12-06  1:28             ` H. Peter Anvin
2013-12-06 21:43               ` Frank Ch. Eigler
2013-12-06  5:16             ` Alexei Starovoitov
2013-12-06 23:54               ` Masami Hiramatsu
2013-12-07  1:01                 ` Alexei Starovoitov
2013-12-06  5:46             ` Jovi Zhangwei
2013-12-07  1:12             ` Alexei Starovoitov
2013-12-07 16:53               ` Jovi Zhangwei
2013-12-06  5:19       ` Jovi Zhangwei
2013-12-06 23:58         ` Masami Hiramatsu
2013-12-07 16:21           ` Jovi Zhangwei
2013-12-09  4:59             ` Masami Hiramatsu
2013-12-06  6:17       ` Jovi Zhangwei
2013-12-05 16:31   ` Frank Ch. Eigler

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.