[RFC PATCH bpf-next 0/4] Introduce xdp_call.h and the BPF dispatcher

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH bpf-next 0/4] Introduce xdp_call.h and the BPF dispatcher
@ 2019-11-13 20:47 Björn Töpel
  2019-11-13 20:47 ` [RFC PATCH bpf-next 1/4] bpf: teach bpf_arch_text_poke() jumps Björn Töpel
                   ` (3 more replies)
  0 siblings, 4 replies; 23+ messages in thread
From: Björn Töpel @ 2019-11-13 20:47 UTC (permalink / raw)
  To: netdev, ast, daniel
  Cc: Björn Töpel, bpf, magnus.karlsson, magnus.karlsson,
	jonathan.lemon

This RFC(!) introduces the BPF dispatcher and xdp_call.h, and it's a
mechanism to avoid the retpoline overhead by text-poking/rewriting
indirect calls to direct calls.

The ideas build on Alexei's V3 of the BPF trampoline work, namely:
  * Use the existing BPF JIT infrastructure generate code
  * Use bpf_arch_text_poke() to modify the kernel text  

To try the series out, you'll need V3 of the BPF trampoline work [1].

The main idea; Each XDP call-site calls the jited dispatch table,
instead of an indirect call. The dispatch table calls the XDP programs
directly. In pseudo code this be something similar to:

unsigned int do_call(struct bpf_prog *prog, struct xdp_buff *xdp)
{
	if (&prog == PROG1)
		return call_direct_PROG1(xdp);
	if (&prog == PROG2)
		return call_direct_PROG2(xdp);
	return indirect_call(prog, xdp);
}

The current dispatcher supports four entries. It could support more,
but I don't know if it's really practical (...and I was lazy -- more
than 4 entries meant moving to >1B Jcc. :-P). The dispatcher is
re-generated for each new XDP program/entry. The upper limit of four
in this series means that if six i40e netdevs have an XDP program
running, the fifth and sixth will be using an indirect call.

Now to the performance numbers. I ran this on my 3 GHz Skylake, 64B
UDP packets are sent to the i40e at ~40 Mpps.

Benchmark:
  # ./xdp_rxq_info --dev enp134s0f0 --action XDP_DROP

  1. Baseline:            26.0 Mpps
  2. Dispatcher 1 entry:  35,5 Mpps (+36.5%)
  3. Dispatcher 4 enties: 32.9 Mpps (+26.5%)
  4. Dispatcher 5 enties: 24.2 Mpps (-6.9%)

Scenario 4 is that the benchmark uses the dispatcher, but the table is
full. This means that the caller pays for the dispatching *and* the
retpoline.

Is this a good idea? The performance is nice! Can it be done in a
better way? Useful for other BPF programs? I would love your input!

Thanks!
Björn

[1] https://patchwork.ozlabs.org/cover/1191672/

Björn Töpel (4):
  bpf: teach bpf_arch_text_poke() jumps
  bpf: introduce BPF dispatcher
  xdp: introduce xdp_call
  i40e: start using xdp_call.h

 arch/x86/net/bpf_jit_comp.c                 | 130 ++++++++++++-
 drivers/net/ethernet/intel/i40e/i40e_main.c |   5 +
 drivers/net/ethernet/intel/i40e/i40e_txrx.c |   5 +-
 drivers/net/ethernet/intel/i40e/i40e_xsk.c  |   5 +-
 include/linux/bpf.h                         |   3 +
 include/linux/xdp_call.h                    |  49 +++++
 kernel/bpf/Makefile                         |   1 +
 kernel/bpf/dispatcher.c                     | 197 ++++++++++++++++++++
 8 files changed, 388 insertions(+), 7 deletions(-)
 create mode 100644 include/linux/xdp_call.h
 create mode 100644 kernel/bpf/dispatcher.c

-- 
2.20.1

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH bpf-next 1/4] bpf: teach bpf_arch_text_poke() jumps
  2019-11-13 20:47 [RFC PATCH bpf-next 0/4] Introduce xdp_call.h and the BPF dispatcher Björn Töpel
@ 2019-11-13 20:47 ` Björn Töpel
  2019-11-13 20:47 ` [RFC PATCH bpf-next 2/4] bpf: introduce BPF dispatcher Björn Töpel
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 23+ messages in thread
From: Björn Töpel @ 2019-11-13 20:47 UTC (permalink / raw)
  To: netdev, ast, daniel
  Cc: Björn Töpel, bpf, magnus.karlsson, magnus.karlsson,
	jonathan.lemon

From: Björn Töpel <bjorn.topel@intel.com>

The BPF dispatcher, introduced in future commits, hijacks a trampoline
function. This commit teaches the text poker to emit jmpq in addtion
to callq.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 arch/x86/net/bpf_jit_comp.c | 34 +++++++++++++++++++++++++++++-----
 include/linux/bpf.h         |  3 +++
 2 files changed, 32 insertions(+), 5 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 79157d886a3e..28782a1c386e 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -481,7 +481,7 @@ static void emit_stx(u8 **pprog, u32 size, u32 dst_reg, u32 src_reg, int off)
 	*pprog = prog;
 }
 
-static int emit_call(u8 **pprog, void *func, void *ip)
+static int emit_call_jmp(u8 **pprog, void *func, void *ip, u8 insn)
 {
 	u8 *prog = *pprog;
 	int cnt = 0;
@@ -492,17 +492,28 @@ static int emit_call(u8 **pprog, void *func, void *ip)
 		pr_err("Target call %p is out of range\n", func);
 		return -EINVAL;
 	}
-	EMIT1_off32(0xE8, offset);
+	EMIT1_off32(insn, offset);
 	*pprog = prog;
 	return 0;
 }
 
+static int emit_call(u8 **pprog, void *func, void *ip)
+{
+	return emit_call_jmp(pprog, func, ip, 0xE8);
+}
+
+/* Emits tail-call */
+static int emit_jmp(u8 **pprog, void *func, void *ip)
+{
+	return emit_call_jmp(pprog, func, ip, 0xE9);
+}
+
 int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type t,
 		       void *old_addr, void *new_addr)
 {
 	u8 old_insn[X86_CALL_SIZE] = {};
 	u8 new_insn[X86_CALL_SIZE] = {};
-	u8 *prog;
+	u8 *prog, insn;
 	int ret;
 
 	if (!is_kernel_text((long)ip) &&
@@ -510,31 +521,44 @@ int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type t,
 		/* BPF trampoline in modules is not supported */
 		return -EINVAL;
 
+	switch (t) {
+	case BPF_MOD_NOP_TO_CALL:
+	case BPF_MOD_CALL_TO_CALL:
+	case BPF_MOD_CALL_TO_NOP:
+		insn = 0xE8;
+		break;
+	default:
+		insn = 0xE9;
+	}
+
 	if (old_addr) {
 		prog = old_insn;
-		ret = emit_call(&prog, old_addr, (void *)ip);
+		ret = emit_call_jmp(&prog, old_addr, (void *)ip, insn);
 		if (ret)
 			return ret;
 	}
 	if (new_addr) {
 		prog = new_insn;
-		ret = emit_call(&prog, new_addr, (void *)ip);
+		ret = emit_call_jmp(&prog, new_addr, (void *)ip, insn);
 		if (ret)
 			return ret;
 	}
 	ret = -EBUSY;
 	mutex_lock(&text_mutex);
 	switch (t) {
+	case BPF_MOD_NOP_TO_JMP:
 	case BPF_MOD_NOP_TO_CALL:
 		if (memcmp(ip, ideal_nops[NOP_ATOMIC5], X86_CALL_SIZE))
 			goto out;
 		text_poke(ip, new_insn, X86_CALL_SIZE);
 		break;
+	case BPF_MOD_JMP_TO_JMP:
 	case BPF_MOD_CALL_TO_CALL:
 		if (memcmp(ip, old_insn, X86_CALL_SIZE))
 			goto out;
 		text_poke(ip, new_insn, X86_CALL_SIZE);
 		break;
+	case BPF_MOD_JMP_TO_NOP:
 	case BPF_MOD_CALL_TO_NOP:
 		if (memcmp(ip, old_insn, X86_CALL_SIZE))
 			goto out;
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 6a80af092048..38b0715050a9 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1270,6 +1270,9 @@ enum bpf_text_poke_type {
 	BPF_MOD_NOP_TO_CALL,
 	BPF_MOD_CALL_TO_CALL,
 	BPF_MOD_CALL_TO_NOP,
+	BPF_MOD_NOP_TO_JMP,
+	BPF_MOD_JMP_TO_JMP,
+	BPF_MOD_JMP_TO_NOP,
 };
 int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type t,
 		       void *addr1, void *addr2);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH bpf-next 2/4] bpf: introduce BPF dispatcher
  2019-11-13 20:47 [RFC PATCH bpf-next 0/4] Introduce xdp_call.h and the BPF dispatcher Björn Töpel
  2019-11-13 20:47 ` [RFC PATCH bpf-next 1/4] bpf: teach bpf_arch_text_poke() jumps Björn Töpel
@ 2019-11-13 20:47 ` Björn Töpel
  2019-11-13 21:40   ` Edward Cree
                     ` (3 more replies)
  2019-11-13 20:47 ` [RFC PATCH bpf-next 3/4] xdp: introduce xdp_call Björn Töpel
  2019-11-13 20:47 ` [RFC PATCH bpf-next 4/4] i40e: start using xdp_call.h Björn Töpel
  3 siblings, 4 replies; 23+ messages in thread
From: Björn Töpel @ 2019-11-13 20:47 UTC (permalink / raw)
  To: netdev, ast, daniel
  Cc: Björn Töpel, bpf, magnus.karlsson, magnus.karlsson,
	jonathan.lemon

From: Björn Töpel <bjorn.topel@intel.com>

The BPF dispatcher builds on top of the BPF trampoline ideas;
Introduce bpf_arch_text_poke() and (re-)use the BPF JIT generate
code. The dispatcher builds a dispatch table for XDP programs, for
retpoline avoidance. The table is a simple binary search model, so
lookup is O(log n). Here, the dispatch table is limited to four
entries (for laziness reason -- only 1B relative jumps :-P). If the
dispatch table is full, it will fallback to the retpoline path.

An example: A module/driver allocates a dispatcher. The dispatcher is
shared for all netdevs. Each netdev allocate a slot in the dispatcher
and a BPF program. The netdev then uses the dispatcher to call the
correct program with a direct call (actually a tail-call).

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 arch/x86/net/bpf_jit_comp.c |  96 ++++++++++++++++++
 kernel/bpf/Makefile         |   1 +
 kernel/bpf/dispatcher.c     | 197 ++++++++++++++++++++++++++++++++++++
 3 files changed, 294 insertions(+)
 create mode 100644 kernel/bpf/dispatcher.c

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 28782a1c386e..d75aebf508b8 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -10,10 +10,12 @@
 #include <linux/if_vlan.h>
 #include <linux/bpf.h>
 #include <linux/memory.h>
+#include <linux/sort.h>
 #include <asm/extable.h>
 #include <asm/set_memory.h>
 #include <asm/nospec-branch.h>
 #include <asm/text-patching.h>
+#include <asm/asm-prototypes.h>
 
 static u8 *emit_code(u8 *ptr, u32 bytes, unsigned int len)
 {
@@ -1471,6 +1473,100 @@ int arch_prepare_bpf_trampoline(void *image, struct btf_func_model *m, u32 flags
 	return 0;
 }
 
+#if defined(CONFIG_BPF_JIT) && defined(CONFIG_RETPOLINE)
+
+/* Emits the dispatcher. Id lookup is limited to BPF_DISPATCHER_MAX,
+ * so it'll fit into PAGE_SIZE/2. The lookup is binary search: O(log
+ * n).
+ */
+static int emit_bpf_dispatcher(u8 **pprog, int a, int b, u64 *progs,
+			       u8 *fb)
+{
+	u8 *prog = *pprog, *jg_reloc;
+	int pivot, err, cnt = 0;
+	s64 jmp_offset;
+
+	if (a == b) {
+		emit_mov_imm64(&prog, BPF_REG_0,	/* movabs func,%rax */
+			       progs[a] >> 32,
+			       (progs[a] << 32) >> 32);
+		EMIT3(0x48, 0x39, 0xC2);		/* cmp rdx, rax */
+		jmp_offset = fb - (prog + 2);
+		if (!is_imm8(jmp_offset))
+			return -1;
+		EMIT2(X86_JNE, jmp_offset);		/* jne retpoline */
+		err = emit_jmp(&prog, (void *)progs[a], /* jmp bpf_prog */
+			       prog);
+		if (err)
+			return err;
+		goto out;
+	}
+
+	pivot = (b - a) / 2;
+	emit_mov_imm64(&prog, BPF_REG_0, progs[a + pivot] >> 32,
+		       (progs[a + pivot] << 32) >> 32);
+	EMIT3(0x48, 0x39, 0xC2);			/* cmp rdx, rax */
+
+	jg_reloc = prog;
+	EMIT2(X86_JG, 0);				/* jg pivot + 1-part */
+
+	err = emit_bpf_dispatcher(&prog, a, a + pivot, progs, fb);
+	if (err)
+		return err;
+
+	jmp_offset = prog - (jg_reloc + 2);
+	if (!is_imm8(jmp_offset))
+		return -1;
+	emit_code(jg_reloc, X86_JG + (jmp_offset << 8), 2);
+
+	err = emit_bpf_dispatcher(&prog, a + pivot + 1, b, progs, fb);
+	if (err)
+		return err;
+out:
+	*pprog = prog;
+	return 0;
+}
+
+#endif
+
+static int cmp_ips(const void *a, const void *b)
+{
+	const u64 *ipa = a;
+	const u64 *ipb = b;
+
+	if (*ipa > *ipb)
+		return 1;
+	if (*ipa < *ipb)
+		return -1;
+	return 0;
+}
+
+#define BPF_DISPATCHER_MAX 4
+
+int arch_prepare_bpf_dispatcher(void *image, struct bpf_prog **progs,
+				int num_progs)
+{
+	u64 ips[BPF_DISPATCHER_MAX] = {};
+	u8 *fallback, *prog = image;
+	int i, err, cnt = 0;
+
+	if (!num_progs || num_progs > BPF_DISPATCHER_MAX)
+		return -EINVAL;
+
+	for (i = 0; i < num_progs; i++)
+		ips[i] = (u64)progs[i]->bpf_func;
+
+	EMIT2(0xEB, 5);	/* jmp rip+5 (skip retpoline) */
+	fallback = prog;
+	err = emit_jmp(&prog,	/* jmp retpoline */
+		       __x86_indirect_thunk_rdx, prog);
+	if (err)
+		return err;
+
+	sort(&ips[0], num_progs, sizeof(ips[i]), cmp_ips, NULL);
+	return emit_bpf_dispatcher(&prog, 0, num_progs - 1, &ips[0], fallback);
+}
+
 struct x64_jit_data {
 	struct bpf_binary_header *header;
 	int *addrs;
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 3f671bf617e8..d4f330351f87 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -8,6 +8,7 @@ obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o
 obj-$(CONFIG_BPF_SYSCALL) += disasm.o
 obj-$(CONFIG_BPF_JIT) += trampoline.o
 obj-$(CONFIG_BPF_SYSCALL) += btf.o
+obj-$(CONFIG_BPF_JIT) += dispatcher.o
 ifeq ($(CONFIG_NET),y)
 obj-$(CONFIG_BPF_SYSCALL) += devmap.o
 obj-$(CONFIG_BPF_SYSCALL) += cpumap.o
diff --git a/kernel/bpf/dispatcher.c b/kernel/bpf/dispatcher.c
new file mode 100644
index 000000000000..691898640720
--- /dev/null
+++ b/kernel/bpf/dispatcher.c
@@ -0,0 +1,197 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright(c) 2019 Intel Corporation. */
+
+#ifdef CONFIG_RETPOLINE
+
+#include <linux/hash.h>
+#include <linux/bpf.h>
+#include <linux/filter.h>
+
+/* The BPF dispatcher is a multiway branch code generator. A user
+ * registers a slot (id) and can then update the BPF program for that
+ * slot. The dispatcher is jited, and will be rejited every time a
+ * slot is allocated/deallocated for performance reasons. An example:
+ * A module provides code for multiple netdevs. Each netdev can have
+ * one XDP program. The module code will allocate a dispatcher, and
+ * when the netdev enables XDP it allocates a new slot.
+ *
+ * Nothing like STATIC_CALL_INLINE is supported yet, so an explicit
+ * trampoline is needed:
+ *
+ *   unsigned int dispatcher_trampoline(void *ctx, void *insn, int id)
+ */
+
+#define DISPATCHER_HASH_BITS 10
+#define DISPATCHER_TABLE_SIZE (1 << DISPATCHER_HASH_BITS)
+
+static struct hlist_head dispatcher_table[DISPATCHER_TABLE_SIZE];
+
+#define BPF_DISPATCHER_MAX 4
+
+struct bpf_dispatcher {
+	struct hlist_node hlist;
+	void *func;
+	struct bpf_prog *progs[BPF_DISPATCHER_MAX];
+	int num_progs;
+	void *image;
+	u64 selector;
+};
+
+static DEFINE_MUTEX(dispatcher_mutex);
+
+struct bpf_dispatcher *bpf_dispatcher_lookup(void *func)
+{
+	struct bpf_dispatcher *d;
+	struct hlist_head *head;
+	void *image;
+
+	head = &dispatcher_table[hash_ptr(func, DISPATCHER_HASH_BITS)];
+	hlist_for_each_entry(d, head, hlist) {
+		if (d->func == func)
+			return d;
+	}
+	d = kzalloc(sizeof(*d), GFP_KERNEL);
+	if (!d)
+		return NULL;
+
+	image = bpf_jit_alloc_exec(PAGE_SIZE);
+	if (!image) {
+		kfree(d);
+		return NULL;
+	}
+
+	d->func = func;
+	INIT_HLIST_NODE(&d->hlist);
+	hlist_add_head(&d->hlist, head);
+
+	set_vm_flush_reset_perms(image);
+	set_memory_x((long)image, 1);
+	d->image = image;
+	return d;
+}
+
+static void bpf_dispatcher_free(struct bpf_dispatcher *d)
+{
+	bpf_jit_free_exec(d->image);
+	hlist_del(&d->hlist);
+	kfree(d);
+}
+
+static int bpf_dispatcher_add_prog(struct bpf_dispatcher *d,
+				   struct bpf_prog *prog)
+{
+	struct bpf_prog **entry = NULL;
+	int i, err = 0;
+
+	if (d->num_progs == BPF_DISPATCHER_MAX)
+		return err;
+
+	for (i = 0; i < BPF_DISPATCHER_MAX; i++) {
+		if (!entry && !d->progs[i])
+			entry = &d->progs[i];
+		if (d->progs[i] == prog)
+			return err;
+	}
+
+	prog = bpf_prog_inc(prog);
+	if (IS_ERR(prog))
+		return err;
+
+	*entry = prog;
+	d->num_progs++;
+	return err;
+}
+
+static void bpf_dispatcher_remove_prog(struct bpf_dispatcher *d,
+				       struct bpf_prog *prog)
+{
+	int i;
+
+	for (i = 0; i < BPF_DISPATCHER_MAX; i++) {
+		if (d->progs[i] == prog) {
+			bpf_prog_put(prog);
+			d->progs[i] = NULL;
+			d->num_progs--;
+			break;
+		}
+	}
+}
+
+int __weak arch_prepare_bpf_dispatcher(void *image, struct bpf_prog **progs,
+				       int num_ids)
+{
+	return -ENOTSUPP;
+}
+
+/* NB! bpf_dispatcher_update() might free the dispatcher! */
+static int bpf_dispatcher_update(struct bpf_dispatcher *d)
+{
+	void *old_image = d->image + ((d->selector + 1) & 1) * PAGE_SIZE / 2;
+	void *new_image = d->image + (d->selector & 1) * PAGE_SIZE / 2;
+	int err;
+
+	if (d->num_progs == 0) {
+		err = bpf_arch_text_poke(d->func, BPF_MOD_JMP_TO_NOP,
+					 old_image, NULL);
+		bpf_dispatcher_free(d);
+		goto out;
+	}
+
+	err = arch_prepare_bpf_dispatcher(new_image, &d->progs[0],
+					  d->num_progs);
+	if (err)
+		goto out;
+
+	if (d->selector)
+		/* progs already running at this address */
+		err = bpf_arch_text_poke(d->func, BPF_MOD_JMP_TO_JMP,
+					 old_image, new_image);
+	else
+		/* first time registering */
+		err = bpf_arch_text_poke(d->func, BPF_MOD_NOP_TO_JMP,
+					 NULL, new_image);
+
+	if (err)
+		goto out;
+	d->selector++;
+
+out:
+	return err;
+}
+
+void bpf_dispatcher_change_prog(void *func, struct bpf_prog *from,
+				struct bpf_prog *to)
+{
+	struct bpf_dispatcher *d;
+
+	if (!from && !to)
+		return;
+
+	mutex_lock(&dispatcher_mutex);
+	d = bpf_dispatcher_lookup(func);
+	if (!d)
+		goto out;
+
+	if (from)
+		bpf_dispatcher_remove_prog(d, from);
+
+	if (to)
+		bpf_dispatcher_add_prog(d, to);
+
+	WARN_ON(bpf_dispatcher_update(d));
+
+out:
+	mutex_unlock(&dispatcher_mutex);
+}
+
+static int __init init_dispatchers(void)
+{
+	int i;
+
+	for (i = 0; i < DISPATCHER_TABLE_SIZE; i++)
+		INIT_HLIST_HEAD(&dispatcher_table[i]);
+	return 0;
+}
+late_initcall(init_dispatchers);
+
+#endif
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH bpf-next 3/4] xdp: introduce xdp_call
  2019-11-13 20:47 [RFC PATCH bpf-next 0/4] Introduce xdp_call.h and the BPF dispatcher Björn Töpel
  2019-11-13 20:47 ` [RFC PATCH bpf-next 1/4] bpf: teach bpf_arch_text_poke() jumps Björn Töpel
  2019-11-13 20:47 ` [RFC PATCH bpf-next 2/4] bpf: introduce BPF dispatcher Björn Töpel
@ 2019-11-13 20:47 ` Björn Töpel
  2019-11-13 20:47 ` [RFC PATCH bpf-next 4/4] i40e: start using xdp_call.h Björn Töpel
  3 siblings, 0 replies; 23+ messages in thread
From: Björn Töpel @ 2019-11-13 20:47 UTC (permalink / raw)
  To: netdev, ast, daniel
  Cc: Björn Töpel, bpf, magnus.karlsson, magnus.karlsson,
	jonathan.lemon

From: Björn Töpel <bjorn.topel@intel.com>

The xdp_call.h header wraps a more user-friendly API around the BPF
dispatcher. A user adds a trampoline/XDP caller using the
DEFINE_XDP_CALL macro, and updates the BPF dispatcher via
xdp_call_update(). The actual dispatch is done via xdp_call().

The next commit will show-case how the i40e uses xdp_call.

When static_call, especially the inlined version, is upstreamed it
will be used by xdp_call.h.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/linux/xdp_call.h | 49 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 49 insertions(+)
 create mode 100644 include/linux/xdp_call.h

diff --git a/include/linux/xdp_call.h b/include/linux/xdp_call.h
new file mode 100644
index 000000000000..e736a4d3c961
--- /dev/null
+++ b/include/linux/xdp_call.h
@@ -0,0 +1,49 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/* Copyright(c) 2019 Intel Corporation. */
+#ifndef _LINUX_XDP_CALL_H
+#define _LINUX_XDP_CALL_H
+
+#include <linux/filter.h>
+
+#if defined(CONFIG_BPF_JIT) && defined(CONFIG_RETPOLINE)
+
+void bpf_dispatcher_change_prog(void *func, struct bpf_prog *from,
+				struct bpf_prog *to);
+
+#define XDP_CALL_TRAMP(name)	____xdp_call_##name##_tramp
+
+#define DEFINE_XDP_CALL(name)						\
+	unsigned int XDP_CALL_TRAMP(name)(				\
+		const void *xdp_ctx,					\
+		const struct bpf_insn *insnsi,				\
+		unsigned int (*bpf_func)(const void *,			\
+					 const struct bpf_insn *))	\
+	{								\
+		return bpf_func(xdp_ctx, insnsi);			\
+	}
+
+#define DECLARE_XDP_CALL(name)						\
+	unsigned int XDP_CALL_TRAMP(name)(				\
+		const void *xdp_ctx,					\
+		const struct bpf_insn *insnsi,				\
+		unsigned int (*bpf_func)(const void *,			\
+					 const struct bpf_insn *))
+
+#define xdp_call_run(name, xdp_prog, xdp)				\
+	XDP_CALL_TRAMP(name)(xdp, xdp_prog->insnsi,			\
+			     xdp_prog->bpf_func)
+
+#define xdp_call_update(name, from_xdp_prog, to_xdp_prog)	\
+	bpf_dispatcher_change_prog(&XDP_CALL_TRAMP(name),	\
+				   from_xdp_prog,		\
+				   to_xdp_prog)
+
+#else /* !defined(CONFIG_BPF_JIT) || !defined(CONFIG_RETPOLINE) */
+
+#define DEFINE_XDP_CALL(name)
+#define DECLARE_XDP_CALL(name)
+#define xdp_call_run(name, xdp_prog, xdp) bpf_prog_run_xdp(xdp_prog, xdp)
+#define xdp_call_update(name, from_xdp_prog, to_xdp_prog)
+
+#endif /* defined(CONFIG_BPF_JIT) && defined(CONFIG_RETPOLINE) */
+#endif /* _LINUX_XDP_CALL_H */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH bpf-next 4/4] i40e: start using xdp_call.h
  2019-11-13 20:47 [RFC PATCH bpf-next 0/4] Introduce xdp_call.h and the BPF dispatcher Björn Töpel
                   ` (2 preceding siblings ...)
  2019-11-13 20:47 ` [RFC PATCH bpf-next 3/4] xdp: introduce xdp_call Björn Töpel
@ 2019-11-13 20:47 ` Björn Töpel
  3 siblings, 0 replies; 23+ messages in thread
From: Björn Töpel @ 2019-11-13 20:47 UTC (permalink / raw)
  To: netdev, ast, daniel
  Cc: Björn Töpel, bpf, magnus.karlsson, magnus.karlsson,
	jonathan.lemon

From: Björn Töpel <bjorn.topel@intel.com>

This commit starts using xdp_call.h and the BPF dispatcher to avoid
the retpoline overhead.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 5 +++++
 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 5 ++++-
 drivers/net/ethernet/intel/i40e/i40e_xsk.c  | 5 ++++-
 3 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index b3d7edbb1389..59b530e4198f 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -5,6 +5,7 @@
 #include <linux/of_net.h>
 #include <linux/pci.h>
 #include <linux/bpf.h>
+#include <linux/xdp_call.h>
 
 /* Local includes */
 #include "i40e.h"
@@ -12517,6 +12518,8 @@ static netdev_features_t i40e_features_check(struct sk_buff *skb,
 	return features & ~(NETIF_F_CSUM_MASK | NETIF_F_GSO_MASK);
 }
 
+DEFINE_XDP_CALL(i40e_xdp_call);
+
 /**
  * i40e_xdp_setup - add/remove an XDP program
  * @vsi: VSI to changed
@@ -12552,6 +12555,8 @@ static int i40e_xdp_setup(struct i40e_vsi *vsi,
 	for (i = 0; i < vsi->num_queue_pairs; i++)
 		WRITE_ONCE(vsi->rx_rings[i]->xdp_prog, vsi->xdp_prog);
 
+	xdp_call_update(i40e_xdp_call, old_prog, prog);
+
 	if (old_prog)
 		bpf_prog_put(old_prog);
 
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index b8496037ef7f..34d7b15897a1 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -3,6 +3,7 @@
 
 #include <linux/prefetch.h>
 #include <linux/bpf_trace.h>
+#include <linux/xdp_call.h>
 #include <net/xdp.h>
 #include "i40e.h"
 #include "i40e_trace.h"
@@ -2188,6 +2189,8 @@ int i40e_xmit_xdp_tx_ring(struct xdp_buff *xdp, struct i40e_ring *xdp_ring)
 	return i40e_xmit_xdp_ring(xdpf, xdp_ring);
 }
 
+DECLARE_XDP_CALL(i40e_xdp_call);
+
 /**
  * i40e_run_xdp - run an XDP program
  * @rx_ring: Rx ring being processed
@@ -2209,7 +2212,7 @@ static struct sk_buff *i40e_run_xdp(struct i40e_ring *rx_ring,
 
 	prefetchw(xdp->data_hard_start); /* xdp_frame write */
 
-	act = bpf_prog_run_xdp(xdp_prog, xdp);
+	act = xdp_call_run(i40e_xdp_call, xdp_prog, xdp);
 	switch (act) {
 	case XDP_PASS:
 		break;
diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
index a05dfecdd9b4..c623eeaeb625 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -2,6 +2,7 @@
 /* Copyright(c) 2018 Intel Corporation. */
 
 #include <linux/bpf_trace.h>
+#include <linux/xdp_call.h>
 #include <net/xdp_sock.h>
 #include <net/xdp.h>
 
@@ -179,6 +180,8 @@ int i40e_xsk_umem_setup(struct i40e_vsi *vsi, struct xdp_umem *umem,
 		i40e_xsk_umem_disable(vsi, qid);
 }
 
+DECLARE_XDP_CALL(i40e_xdp_call);
+
 /**
  * i40e_run_xdp_zc - Executes an XDP program on an xdp_buff
  * @rx_ring: Rx ring
@@ -202,7 +205,7 @@ static int i40e_run_xdp_zc(struct i40e_ring *rx_ring, struct xdp_buff *xdp)
 	 * this path is enabled by setting an XDP program.
 	 */
 	xdp_prog = READ_ONCE(rx_ring->xdp_prog);
-	act = bpf_prog_run_xdp(xdp_prog, xdp);
+	act = xdp_call_run(i40e_xdp_call, xdp_prog, xdp);
 	offset = xdp->data - xdp->data_hard_start;
 
 	xdp->handle = xsk_umem_adjust_offset(umem, xdp->handle, offset);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH bpf-next 2/4] bpf: introduce BPF dispatcher
  2019-11-13 20:47 ` [RFC PATCH bpf-next 2/4] bpf: introduce BPF dispatcher Björn Töpel
@ 2019-11-13 21:40   ` Edward Cree
  2019-11-14  6:29     ` Björn Töpel
  2019-11-14 12:31   ` Toke Høiland-Jørgensen
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 23+ messages in thread
From: Edward Cree @ 2019-11-13 21:40 UTC (permalink / raw)
  To: Björn Töpel, netdev, ast, daniel
  Cc: Björn Töpel, bpf, magnus.karlsson, magnus.karlsson,
	jonathan.lemon

On 13/11/2019 20:47, Björn Töpel wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
> 
> The BPF dispatcher builds on top of the BPF trampoline ideas;
> Introduce bpf_arch_text_poke() and (re-)use the BPF JIT generate
> code. The dispatcher builds a dispatch table for XDP programs, for
> retpoline avoidance. The table is a simple binary search model, so
> lookup is O(log n). Here, the dispatch table is limited to four
> entries (for laziness reason -- only 1B relative jumps :-P). If the
> dispatch table is full, it will fallback to the retpoline path.
> 
> An example: A module/driver allocates a dispatcher. The dispatcher is
> shared for all netdevs. Each netdev allocate a slot in the dispatcher
> and a BPF program. The netdev then uses the dispatcher to call the
> correct program with a direct call (actually a tail-call).
> 
> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
The first-come-first-served model for dispatcher slots might mean that
 a low-traffic user ends up getting priority while a higher-traffic
 user is stuck with the retpoline fallback.  Have you considered using
 a learning mechanism, like in my dynamic call RFC [1] earlier this
 year?  (Though I'm sure a better learning mechanism than the one I
 used there could be devised.)

> +static int bpf_dispatcher_add_prog(struct bpf_dispatcher *d,
> +				   struct bpf_prog *prog)
> +{
> +	struct bpf_prog **entry = NULL;
> +	int i, err = 0;
> +
> +	if (d->num_progs == BPF_DISPATCHER_MAX)
> +		return err;
> +
> +	for (i = 0; i < BPF_DISPATCHER_MAX; i++) {
> +		if (!entry && !d->progs[i])
> +			entry = &d->progs[i];
> +		if (d->progs[i] == prog)
> +			return err;
> +	}
> +
> +	prog = bpf_prog_inc(prog);
> +	if (IS_ERR(prog))
> +		return err;
> +
> +	*entry = prog;
> +	d->num_progs++;
> +	return err;
> +}
If I'm reading this function right, it always returns zero; was that
 the intention, and if so why isn't it void?

-Ed

[1] https://lkml.org/lkml/2019/2/1/948

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH bpf-next 2/4] bpf: introduce BPF dispatcher
  2019-11-13 21:40   ` Edward Cree
@ 2019-11-14  6:29     ` Björn Töpel
  2019-11-14 10:18       ` Edward Cree
  0 siblings, 1 reply; 23+ messages in thread
From: Björn Töpel @ 2019-11-14  6:29 UTC (permalink / raw)
  To: Edward Cree
  Cc: Netdev, Alexei Starovoitov, Daniel Borkmann,
	Björn Töpel, bpf, Magnus Karlsson, Karlsson, Magnus,
	Jonathan Lemon

On Wed, 13 Nov 2019 at 22:41, Edward Cree <ecree@solarflare.com> wrote:
>
> On 13/11/2019 20:47, Björn Töpel wrote:
> > From: Björn Töpel <bjorn.topel@intel.com>
> >
> > The BPF dispatcher builds on top of the BPF trampoline ideas;
> > Introduce bpf_arch_text_poke() and (re-)use the BPF JIT generate
> > code. The dispatcher builds a dispatch table for XDP programs, for
> > retpoline avoidance. The table is a simple binary search model, so
> > lookup is O(log n). Here, the dispatch table is limited to four
> > entries (for laziness reason -- only 1B relative jumps :-P). If the
> > dispatch table is full, it will fallback to the retpoline path.
> >
> > An example: A module/driver allocates a dispatcher. The dispatcher is
> > shared for all netdevs. Each netdev allocate a slot in the dispatcher
> > and a BPF program. The netdev then uses the dispatcher to call the
> > correct program with a direct call (actually a tail-call).
> >
> > Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
> The first-come-first-served model for dispatcher slots might mean that
>  a low-traffic user ends up getting priority while a higher-traffic
>  user is stuck with the retpoline fallback.  Have you considered using
>  a learning mechanism, like in my dynamic call RFC [1] earlier this
>  year?  (Though I'm sure a better learning mechanism than the one I
>  used there could be devised.)
>

My rationale was that this mechanism would almost exclusively be used
by physical HW NICs using XDP. My hunch was that the number of netdevs
would be ~4, and typically less using XDP, so a more sophisticated
mechanism didn't really make sense IMO. However, your approach is more
generic and doesn't require any arch specific work. What was the push
back for your work? I'll read up on the thread. I'm intrigued to take
your series for a spin!

> > +static int bpf_dispatcher_add_prog(struct bpf_dispatcher *d,
> > +                                struct bpf_prog *prog)
> > +{
> > +     struct bpf_prog **entry = NULL;
> > +     int i, err = 0;
> > +
> > +     if (d->num_progs == BPF_DISPATCHER_MAX)
> > +             return err;
> > +
> > +     for (i = 0; i < BPF_DISPATCHER_MAX; i++) {
> > +             if (!entry && !d->progs[i])
> > +                     entry = &d->progs[i];
> > +             if (d->progs[i] == prog)
> > +                     return err;
> > +     }
> > +
> > +     prog = bpf_prog_inc(prog);
> > +     if (IS_ERR(prog))
> > +             return err;
> > +
> > +     *entry = prog;
> > +     d->num_progs++;
> > +     return err;
> > +}
> If I'm reading this function right, it always returns zero; was that
>  the intention, and if so why isn't it void?
>

Ugh, yeah. In general the error handling should be improved in this
RFC. If it makes sense to move forward with this series, I'll make
sure to address that. Thanks for taking a look, and for the pointers
to your work!


Cheers,
Björn

> -Ed
>
> [1] https://lkml.org/lkml/2019/2/1/948

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH bpf-next 2/4] bpf: introduce BPF dispatcher
  2019-11-14  6:29     ` Björn Töpel
@ 2019-11-14 10:18       ` Edward Cree
  2019-11-14 11:21         ` Björn Töpel
  0 siblings, 1 reply; 23+ messages in thread
From: Edward Cree @ 2019-11-14 10:18 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Netdev, Alexei Starovoitov, Daniel Borkmann,
	Björn Töpel, bpf, Magnus Karlsson, Karlsson, Magnus,
	Jonathan Lemon

On 14/11/2019 06:29, Björn Töpel wrote:
> On Wed, 13 Nov 2019 at 22:41, Edward Cree <ecree@solarflare.com> wrote:
>> On 13/11/2019 20:47, Björn Töpel wrote:
>> The first-come-first-served model for dispatcher slots might mean that
>>  a low-traffic user ends up getting priority while a higher-traffic
>>  user is stuck with the retpoline fallback.  Have you considered using
>>  a learning mechanism, like in my dynamic call RFC [1] earlier this
>>  year?  (Though I'm sure a better learning mechanism than the one I
>>  used there could be devised.)
> My rationale was that this mechanism would almost exclusively be used
> by physical HW NICs using XDP. My hunch was that the number of netdevs
> would be ~4, and typically less using XDP, so a more sophisticated
> mechanism didn't really make sense IMO.
That seems reasonable in most cases, although I can imagine systems with
 a couple of four-port boards being a thing.  I suppose the netdevs are
 likely to all have the same XDP prog, though, and if I'm reading your
 code right it seems they'd share a slot in that case.

> However, your approach is more
> generic and doesn't require any arch specific work. What was the push
> back for your work?
Mainly that I couldn't demonstrate a performance benefit from the few
 call sites I annotated, and others working in the area felt that
 manual annotation wouldn't scale — Nadav Amit had a different approach
 [2] that used a GCC plugin to apply a dispatcher on an opt-out basis
 to all the indirect calls in the kernel; the discussion on that got
 bogged down in interactions between text patching and perf tracing
 which all went *waaaay* over my head.  AFAICT the static_call series I
 was depending on never got merged, and I'm not sure if anyone's still
 working on it.

-Ed

[2] https://lkml.org/lkml/2018/12/31/19

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH bpf-next 2/4] bpf: introduce BPF dispatcher
  2019-11-14 10:18       ` Edward Cree
@ 2019-11-14 11:21         ` Björn Töpel
  2019-11-14 13:58           ` Peter Zijlstra
  0 siblings, 1 reply; 23+ messages in thread
From: Björn Töpel @ 2019-11-14 11:21 UTC (permalink / raw)
  To: Edward Cree
  Cc: Netdev, Alexei Starovoitov, Daniel Borkmann,
	Björn Töpel, bpf, Magnus Karlsson, Karlsson, Magnus,
	Jonathan Lemon, Peter Zijlstra

On Thu, 14 Nov 2019 at 11:19, Edward Cree <ecree@solarflare.com> wrote:
>
> On 14/11/2019 06:29, Björn Töpel wrote:
[...]
> > My rationale was that this mechanism would almost exclusively be used
> > by physical HW NICs using XDP. My hunch was that the number of netdevs
> > would be ~4, and typically less using XDP, so a more sophisticated
> > mechanism didn't really make sense IMO.
> That seems reasonable in most cases, although I can imagine systems with
>  a couple of four-port boards being a thing.  I suppose the netdevs are
>  likely to all have the same XDP prog, though, and if I'm reading your
>  code right it seems they'd share a slot in that case.
>

Yup, correct!

> > However, your approach is more
> > generic and doesn't require any arch specific work. What was the push
> > back for your work?
> Mainly that I couldn't demonstrate a performance benefit from the few
>  call sites I annotated, and others working in the area felt that
>  manual annotation wouldn't scale — Nadav Amit had a different approach
>  [2] that used a GCC plugin to apply a dispatcher on an opt-out basis
>  to all the indirect calls in the kernel; the discussion on that got
>  bogged down in interactions between text patching and perf tracing
>  which all went *waaaay* over my head.  AFAICT the static_call series I
>  was depending on never got merged, and I'm not sure if anyone's still
>  working on it.
>

Again, thanks for the pointers. PeterZ is (hopefully) still working on
the static_call stuff [3]. The static_call_inline would be a good fit
here, and maybe even using static_call as a patchpad/dispatcher like
you did is a better route. I will checkout Nadav's work!


Björn

[3] https://lore.kernel.org/lkml/20191007090225.44108711.6@infradead.org/#r

> -Ed
>
> [2] https://lkml.org/lkml/2018/12/31/19

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH bpf-next 2/4] bpf: introduce BPF dispatcher
  2019-11-13 20:47 ` [RFC PATCH bpf-next 2/4] bpf: introduce BPF dispatcher Björn Töpel
  2019-11-13 21:40   ` Edward Cree
@ 2019-11-14 12:31   ` Toke Høiland-Jørgensen
  2019-11-14 13:03     ` Daniel Borkmann
  2019-11-15  0:30   ` Alexei Starovoitov
  2019-11-18 19:36   ` Andrii Nakryiko
  3 siblings, 1 reply; 23+ messages in thread
From: Toke Høiland-Jørgensen @ 2019-11-14 12:31 UTC (permalink / raw)
  To: Björn Töpel, netdev, ast, daniel
  Cc: Björn Töpel, bpf, magnus.karlsson, magnus.karlsson,
	jonathan.lemon

Björn Töpel <bjorn.topel@gmail.com> writes:

> From: Björn Töpel <bjorn.topel@intel.com>
>
> The BPF dispatcher builds on top of the BPF trampoline ideas;
> Introduce bpf_arch_text_poke() and (re-)use the BPF JIT generate
> code. The dispatcher builds a dispatch table for XDP programs, for
> retpoline avoidance. The table is a simple binary search model, so
> lookup is O(log n). Here, the dispatch table is limited to four
> entries (for laziness reason -- only 1B relative jumps :-P). If the
> dispatch table is full, it will fallback to the retpoline path.

So it's O(log n) with n == 4? Have you compared the performance of just
doing four linear compare-and-jumps? Seems to me it may not be that big
of a difference for such a small N?

> An example: A module/driver allocates a dispatcher. The dispatcher is
> shared for all netdevs. Each netdev allocate a slot in the dispatcher
> and a BPF program. The netdev then uses the dispatcher to call the
> correct program with a direct call (actually a tail-call).

Is it really accurate to call it a tail call? To me, that would imply
that it increments the tail call limit counter and all that? Isn't this
just a direct jump using the trampoline stuff?

-Toke


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH bpf-next 2/4] bpf: introduce BPF dispatcher
  2019-11-14 12:31   ` Toke Høiland-Jørgensen
@ 2019-11-14 13:03     ` Daniel Borkmann
  2019-11-14 13:09       ` Toke Høiland-Jørgensen
  2019-11-14 13:56       ` Björn Töpel
  0 siblings, 2 replies; 23+ messages in thread
From: Daniel Borkmann @ 2019-11-14 13:03 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Björn Töpel, netdev, ast
  Cc: Björn Töpel, bpf, magnus.karlsson, magnus.karlsson,
	jonathan.lemon

On 11/14/19 1:31 PM, Toke Høiland-Jørgensen wrote:
> Björn Töpel <bjorn.topel@gmail.com> writes:
>> From: Björn Töpel <bjorn.topel@intel.com>
>>
>> The BPF dispatcher builds on top of the BPF trampoline ideas;
>> Introduce bpf_arch_text_poke() and (re-)use the BPF JIT generate
>> code. The dispatcher builds a dispatch table for XDP programs, for
>> retpoline avoidance. The table is a simple binary search model, so
>> lookup is O(log n). Here, the dispatch table is limited to four
>> entries (for laziness reason -- only 1B relative jumps :-P). If the
>> dispatch table is full, it will fallback to the retpoline path.
> 
> So it's O(log n) with n == 4? Have you compared the performance of just
> doing four linear compare-and-jumps? Seems to me it may not be that big
> of a difference for such a small N?

Did you perform some microbenchmarks wrt search tree? Mainly wondering
since for code emission for switch/case statements, clang/gcc turns off
indirect calls entirely under retpoline, see [0] from back then.

>> An example: A module/driver allocates a dispatcher. The dispatcher is
>> shared for all netdevs. Each netdev allocate a slot in the dispatcher
>> and a BPF program. The netdev then uses the dispatcher to call the
>> correct program with a direct call (actually a tail-call).
> 
> Is it really accurate to call it a tail call? To me, that would imply
> that it increments the tail call limit counter and all that? Isn't this
> just a direct jump using the trampoline stuff?

Not meant in BPF context here, but more general [1].

(For actual BPF tail calls I have a series close to ready for getting
rid of most indirect calls which I'll post later today.)

Best,
Daniel

   [0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a9d57ef15cbe327fe54416dd194ee0ea66ae53a4
   [1] https://en.wikipedia.org/wiki/Tail_call

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH bpf-next 2/4] bpf: introduce BPF dispatcher
  2019-11-14 13:03     ` Daniel Borkmann
@ 2019-11-14 13:09       ` Toke Høiland-Jørgensen
  2019-11-14 13:56       ` Björn Töpel
  1 sibling, 0 replies; 23+ messages in thread
From: Toke Høiland-Jørgensen @ 2019-11-14 13:09 UTC (permalink / raw)
  To: Daniel Borkmann, Björn Töpel, netdev, ast
  Cc: Björn Töpel, bpf, magnus.karlsson, magnus.karlsson,
	jonathan.lemon

Daniel Borkmann <daniel@iogearbox.net> writes:

> On 11/14/19 1:31 PM, Toke Høiland-Jørgensen wrote:
>> Björn Töpel <bjorn.topel@gmail.com> writes:
>>> From: Björn Töpel <bjorn.topel@intel.com>
>>>
>>> The BPF dispatcher builds on top of the BPF trampoline ideas;
>>> Introduce bpf_arch_text_poke() and (re-)use the BPF JIT generate
>>> code. The dispatcher builds a dispatch table for XDP programs, for
>>> retpoline avoidance. The table is a simple binary search model, so
>>> lookup is O(log n). Here, the dispatch table is limited to four
>>> entries (for laziness reason -- only 1B relative jumps :-P). If the
>>> dispatch table is full, it will fallback to the retpoline path.
>> 
>> So it's O(log n) with n == 4? Have you compared the performance of just
>> doing four linear compare-and-jumps? Seems to me it may not be that big
>> of a difference for such a small N?
>
> Did you perform some microbenchmarks wrt search tree? Mainly wondering
> since for code emission for switch/case statements, clang/gcc turns off
> indirect calls entirely under retpoline, see [0] from back then.

Yes, this was exactly the example I had in mind :)

>>> An example: A module/driver allocates a dispatcher. The dispatcher is
>>> shared for all netdevs. Each netdev allocate a slot in the dispatcher
>>> and a BPF program. The netdev then uses the dispatcher to call the
>>> correct program with a direct call (actually a tail-call).
>> 
>> Is it really accurate to call it a tail call? To me, that would imply
>> that it increments the tail call limit counter and all that? Isn't this
>> just a direct jump using the trampoline stuff?
>
> Not meant in BPF context here, but more general [1].

Ah, right, that makes more sense.

> (For actual BPF tail calls I have a series close to ready for getting
> rid of most indirect calls which I'll post later today.)

Cool!

-Toke


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH bpf-next 2/4] bpf: introduce BPF dispatcher
  2019-11-14 13:03     ` Daniel Borkmann
  2019-11-14 13:09       ` Toke Høiland-Jørgensen
@ 2019-11-14 13:56       ` Björn Töpel
  2019-11-14 14:55         ` Toke Høiland-Jørgensen
  1 sibling, 1 reply; 23+ messages in thread
From: Björn Töpel @ 2019-11-14 13:56 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Toke Høiland-Jørgensen, Netdev, Alexei Starovoitov,
	Björn Töpel, bpf, Magnus Karlsson, Karlsson, Magnus,
	Jonathan Lemon

On Thu, 14 Nov 2019 at 14:03, Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 11/14/19 1:31 PM, Toke Høiland-Jørgensen wrote:
> > Björn Töpel <bjorn.topel@gmail.com> writes:
> >> From: Björn Töpel <bjorn.topel@intel.com>
> >>
> >> The BPF dispatcher builds on top of the BPF trampoline ideas;
> >> Introduce bpf_arch_text_poke() and (re-)use the BPF JIT generate
> >> code. The dispatcher builds a dispatch table for XDP programs, for
> >> retpoline avoidance. The table is a simple binary search model, so
> >> lookup is O(log n). Here, the dispatch table is limited to four
> >> entries (for laziness reason -- only 1B relative jumps :-P). If the
> >> dispatch table is full, it will fallback to the retpoline path.
> >
> > So it's O(log n) with n == 4? Have you compared the performance of just
> > doing four linear compare-and-jumps? Seems to me it may not be that big
> > of a difference for such a small N?
>
> Did you perform some microbenchmarks wrt search tree? Mainly wondering
> since for code emission for switch/case statements, clang/gcc turns off
> indirect calls entirely under retpoline, see [0] from back then.
>

As Toke stated, binsearch is not needed for 4 entries. I started out
with 16 (and explicit ids instead of pointers), and there it made more
sense. If folks think it's a good idea to move forward -- and with 4
entries, it makes sense to make the code generator easier, or maybe
based on static_calls like Ed did.

As for ubenchmarks I only compared with 1 cmp, vs 3 vs 4 + retpoline
stated in the cover. For a proper patch I can do more in-depth
analysis. Or was it anything particular you were looking for?

For switch/case code generation there's a great paper on that here [3]
from the 2008 GCC dev summit ("A Superoptimizer Analysis of Multiway
Branch Code Generation")

[3] http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=968AE756567863243AC7B1728915861A?doi=10.1.1.602.1875&rep=rep1&type=pdf


> >> An example: A module/driver allocates a dispatcher. The dispatcher is
> >> shared for all netdevs. Each netdev allocate a slot in the dispatcher
> >> and a BPF program. The netdev then uses the dispatcher to call the
> >> correct program with a direct call (actually a tail-call).
> >
> > Is it really accurate to call it a tail call? To me, that would imply
> > that it increments the tail call limit counter and all that? Isn't this
> > just a direct jump using the trampoline stuff?
>
> Not meant in BPF context here, but more general [1].
>
> (For actual BPF tail calls I have a series close to ready for getting
> rid of most indirect calls which I'll post later today.)
>

Thanks for the clarification, Daniel! (call vs jmp)

> Best,
> Daniel
>
>    [0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a9d57ef15cbe327fe54416dd194ee0ea66ae53a4
>    [1] https://en.wikipedia.org/wiki/Tail_call

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH bpf-next 2/4] bpf: introduce BPF dispatcher
  2019-11-14 11:21         ` Björn Töpel
@ 2019-11-14 13:58           ` Peter Zijlstra
  0 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2019-11-14 13:58 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Edward Cree, Netdev, Alexei Starovoitov, Daniel Borkmann,
	Björn Töpel, bpf, Magnus Karlsson, Karlsson, Magnus,
	Jonathan Lemon

On Thu, Nov 14, 2019 at 12:21:27PM +0100, Björn Töpel wrote:
> Again, thanks for the pointers. PeterZ is (hopefully) still working on
> the static_call stuff [3]. The static_call_inline would be a good fit
> here, and maybe even using static_call as a patchpad/dispatcher like
> you did is a better route. I will checkout Nadav's work!

Yes, I'll repost that once the current pile of text_poke and ftrace
changes lands.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH bpf-next 2/4] bpf: introduce BPF dispatcher
  2019-11-14 13:56       ` Björn Töpel
@ 2019-11-14 14:55         ` Toke Høiland-Jørgensen
  2019-11-14 15:03           ` Björn Töpel
  0 siblings, 1 reply; 23+ messages in thread
From: Toke Høiland-Jørgensen @ 2019-11-14 14:55 UTC (permalink / raw)
  To: Björn Töpel, Daniel Borkmann
  Cc: Netdev, Alexei Starovoitov, Björn Töpel, bpf,
	Magnus Karlsson, Karlsson, Magnus, Jonathan Lemon

Björn Töpel <bjorn.topel@gmail.com> writes:

> On Thu, 14 Nov 2019 at 14:03, Daniel Borkmann <daniel@iogearbox.net> wrote:
>>
>> On 11/14/19 1:31 PM, Toke Høiland-Jørgensen wrote:
>> > Björn Töpel <bjorn.topel@gmail.com> writes:
>> >> From: Björn Töpel <bjorn.topel@intel.com>
>> >>
>> >> The BPF dispatcher builds on top of the BPF trampoline ideas;
>> >> Introduce bpf_arch_text_poke() and (re-)use the BPF JIT generate
>> >> code. The dispatcher builds a dispatch table for XDP programs, for
>> >> retpoline avoidance. The table is a simple binary search model, so
>> >> lookup is O(log n). Here, the dispatch table is limited to four
>> >> entries (for laziness reason -- only 1B relative jumps :-P). If the
>> >> dispatch table is full, it will fallback to the retpoline path.
>> >
>> > So it's O(log n) with n == 4? Have you compared the performance of just
>> > doing four linear compare-and-jumps? Seems to me it may not be that big
>> > of a difference for such a small N?
>>
>> Did you perform some microbenchmarks wrt search tree? Mainly wondering
>> since for code emission for switch/case statements, clang/gcc turns off
>> indirect calls entirely under retpoline, see [0] from back then.
>>
>
> As Toke stated, binsearch is not needed for 4 entries. I started out
> with 16 (and explicit ids instead of pointers), and there it made more
> sense. If folks think it's a good idea to move forward -- and with 4
> entries, it makes sense to make the code generator easier, or maybe
> based on static_calls like Ed did.

I don't really have anything to back it up, but my hunch is that only 4
entries will end up being a limit that people are going to end up
hitting. And since the performance falls off quite the cliff after
hitting that limit, I do fear that this is something we will hear about
quite emphatically :)

-Toke


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH bpf-next 2/4] bpf: introduce BPF dispatcher
  2019-11-14 14:55         ` Toke Høiland-Jørgensen
@ 2019-11-14 15:03           ` Björn Töpel
  2019-11-14 15:12             ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 23+ messages in thread
From: Björn Töpel @ 2019-11-14 15:03 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Daniel Borkmann, Netdev, Alexei Starovoitov,
	Björn Töpel, bpf, Magnus Karlsson, Karlsson, Magnus,
	Jonathan Lemon

On Thu, 14 Nov 2019 at 15:55, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
[...]
> I don't really have anything to back it up, but my hunch is that only 4
> entries will end up being a limit that people are going to end up
> hitting. And since the performance falls off quite the cliff after
> hitting that limit, I do fear that this is something we will hear about
> quite emphatically :)
>

Hmm, maybe. I have 8 i40e netdevs on my test machine, all running XDP,
but I don't think that's normal for deployments. ;-)

Björn

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH bpf-next 2/4] bpf: introduce BPF dispatcher
  2019-11-14 15:03           ` Björn Töpel
@ 2019-11-14 15:12             ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 23+ messages in thread
From: Toke Høiland-Jørgensen @ 2019-11-14 15:12 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Daniel Borkmann, Netdev, Alexei Starovoitov,
	Björn Töpel, bpf, Magnus Karlsson, Karlsson, Magnus,
	Jonathan Lemon

Björn Töpel <bjorn.topel@gmail.com> writes:

> On Thu, 14 Nov 2019 at 15:55, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
> [...]
>> I don't really have anything to back it up, but my hunch is that only 4
>> entries will end up being a limit that people are going to end up
>> hitting. And since the performance falls off quite the cliff after
>> hitting that limit, I do fear that this is something we will hear about
>> quite emphatically :)
>>
>
> Hmm, maybe. I have 8 i40e netdevs on my test machine, all running XDP,
> but I don't think that's normal for deployments. ;-)

Well, the fact that products like this exist kinda indicates that this
is something people are doing, no?

https://small-tree.net/products/10gbe-switches/p3e10g-6-sr/

-Toke


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH bpf-next 2/4] bpf: introduce BPF dispatcher
  2019-11-13 20:47 ` [RFC PATCH bpf-next 2/4] bpf: introduce BPF dispatcher Björn Töpel
  2019-11-13 21:40   ` Edward Cree
  2019-11-14 12:31   ` Toke Høiland-Jørgensen
@ 2019-11-15  0:30   ` Alexei Starovoitov
  2019-11-15  7:56     ` Björn Töpel
  2019-11-18 19:36   ` Andrii Nakryiko
  3 siblings, 1 reply; 23+ messages in thread
From: Alexei Starovoitov @ 2019-11-15  0:30 UTC (permalink / raw)
  To: Björn Töpel
  Cc: netdev, ast, daniel, Björn Töpel, bpf, magnus.karlsson,
	magnus.karlsson, jonathan.lemon

On Wed, Nov 13, 2019 at 09:47:35PM +0100, Björn Töpel wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
> 
> The BPF dispatcher builds on top of the BPF trampoline ideas;
> Introduce bpf_arch_text_poke() and (re-)use the BPF JIT generate
> code. The dispatcher builds a dispatch table for XDP programs, for
> retpoline avoidance. The table is a simple binary search model, so
> lookup is O(log n). Here, the dispatch table is limited to four
> entries (for laziness reason -- only 1B relative jumps :-P). If the
> dispatch table is full, it will fallback to the retpoline path.
> 
> An example: A module/driver allocates a dispatcher. The dispatcher is
> shared for all netdevs. Each netdev allocate a slot in the dispatcher
> and a BPF program. The netdev then uses the dispatcher to call the
> correct program with a direct call (actually a tail-call).
> 
> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
> ---
>  arch/x86/net/bpf_jit_comp.c |  96 ++++++++++++++++++
>  kernel/bpf/Makefile         |   1 +
>  kernel/bpf/dispatcher.c     | 197 ++++++++++++++++++++++++++++++++++++
>  3 files changed, 294 insertions(+)
>  create mode 100644 kernel/bpf/dispatcher.c
> 
> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> index 28782a1c386e..d75aebf508b8 100644
> --- a/arch/x86/net/bpf_jit_comp.c
> +++ b/arch/x86/net/bpf_jit_comp.c
> @@ -10,10 +10,12 @@
>  #include <linux/if_vlan.h>
>  #include <linux/bpf.h>
>  #include <linux/memory.h>
> +#include <linux/sort.h>
>  #include <asm/extable.h>
>  #include <asm/set_memory.h>
>  #include <asm/nospec-branch.h>
>  #include <asm/text-patching.h>
> +#include <asm/asm-prototypes.h>
>  
>  static u8 *emit_code(u8 *ptr, u32 bytes, unsigned int len)
>  {
> @@ -1471,6 +1473,100 @@ int arch_prepare_bpf_trampoline(void *image, struct btf_func_model *m, u32 flags
>  	return 0;
>  }
>  
> +#if defined(CONFIG_BPF_JIT) && defined(CONFIG_RETPOLINE)
> +
> +/* Emits the dispatcher. Id lookup is limited to BPF_DISPATCHER_MAX,
> + * so it'll fit into PAGE_SIZE/2. The lookup is binary search: O(log
> + * n).
> + */
> +static int emit_bpf_dispatcher(u8 **pprog, int a, int b, u64 *progs,
> +			       u8 *fb)
> +{
> +	u8 *prog = *pprog, *jg_reloc;
> +	int pivot, err, cnt = 0;
> +	s64 jmp_offset;
> +
> +	if (a == b) {
> +		emit_mov_imm64(&prog, BPF_REG_0,	/* movabs func,%rax */
> +			       progs[a] >> 32,
> +			       (progs[a] << 32) >> 32);

Could you try optimizing emit_mov_imm64() to recognize s32 ?
iirc there was a single x86 insns that could move and sign extend.
That should cut down on bytecode size and probably make things a bit faster?
Another alternative is compare lower 32-bit only, since on x86-64 upper 32
should be ~0 anyway for bpf prog pointers.
Looking at bookkeeping code, I think I should be able to generalize bpf
trampoline a bit and share the code for bpf dispatch.
Could you also try aligning jmp target a bit by inserting nops?
Some x86 cpus are sensitive to jmp target alignment. Even without considering
JCC bug it could be helpful. Especially since we're talking about XDP/AF_XDP
here that will be pushing millions of calls through bpf dispatch.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH bpf-next 2/4] bpf: introduce BPF dispatcher
  2019-11-15  0:30   ` Alexei Starovoitov
@ 2019-11-15  7:56     ` Björn Töpel
  2019-11-15 21:58       ` Alexei Starovoitov
  0 siblings, 1 reply; 23+ messages in thread
From: Björn Töpel @ 2019-11-15  7:56 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Netdev, Alexei Starovoitov, Daniel Borkmann,
	Björn Töpel, bpf, Magnus Karlsson, Karlsson, Magnus,
	Jonathan Lemon, Toke Høiland-Jørgensen, Edward Cree

On Fri, 15 Nov 2019 at 01:30, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
[...]
>
> Could you try optimizing emit_mov_imm64() to recognize s32 ?
> iirc there was a single x86 insns that could move and sign extend.
> That should cut down on bytecode size and probably make things a bit faster?
> Another alternative is compare lower 32-bit only, since on x86-64 upper 32
> should be ~0 anyway for bpf prog pointers.

Good ideas, thanks! I'll do the optimization, extend it to >4 entries
(as Toke suggested), and do a non-RFC respin.

> Looking at bookkeeping code, I think I should be able to generalize bpf
> trampoline a bit and share the code for bpf dispatch.

Ok, good!

> Could you also try aligning jmp target a bit by inserting nops?
> Some x86 cpus are sensitive to jmp target alignment. Even without considering
> JCC bug it could be helpful. Especially since we're talking about XDP/AF_XDP
> here that will be pushing millions of calls through bpf dispatch.
>

Yeah, I need to address the Jcc bug anyway, so that makes sense.

Another thought; I'm using the fentry nop as patch point, so it wont
play nice with other users of fentry atm -- but the plan is to move to
Steve's *_ftrace_direct work at some point, correct?


Björn

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH bpf-next 2/4] bpf: introduce BPF dispatcher
  2019-11-15  7:56     ` Björn Töpel
@ 2019-11-15 21:58       ` Alexei Starovoitov
  2019-11-18 10:03         ` Björn Töpel
  0 siblings, 1 reply; 23+ messages in thread
From: Alexei Starovoitov @ 2019-11-15 21:58 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Netdev, Alexei Starovoitov, Daniel Borkmann,
	Björn Töpel, bpf, Magnus Karlsson, Karlsson, Magnus,
	Jonathan Lemon, Toke Høiland-Jørgensen, Edward Cree

On Fri, Nov 15, 2019 at 08:56:59AM +0100, Björn Töpel wrote:
> On Fri, 15 Nov 2019 at 01:30, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> [...]
> >
> > Could you try optimizing emit_mov_imm64() to recognize s32 ?
> > iirc there was a single x86 insns that could move and sign extend.
> > That should cut down on bytecode size and probably make things a bit faster?
> > Another alternative is compare lower 32-bit only, since on x86-64 upper 32
> > should be ~0 anyway for bpf prog pointers.
> 
> Good ideas, thanks! I'll do the optimization, extend it to >4 entries
> (as Toke suggested), and do a non-RFC respin.
> 
> > Looking at bookkeeping code, I think I should be able to generalize bpf
> > trampoline a bit and share the code for bpf dispatch.
> 
> Ok, good!
> 
> > Could you also try aligning jmp target a bit by inserting nops?
> > Some x86 cpus are sensitive to jmp target alignment. Even without considering
> > JCC bug it could be helpful. Especially since we're talking about XDP/AF_XDP
> > here that will be pushing millions of calls through bpf dispatch.
> >
> 
> Yeah, I need to address the Jcc bug anyway, so that makes sense.
> 
> Another thought; I'm using the fentry nop as patch point, so it wont
> play nice with other users of fentry atm -- but the plan is to move to
> Steve's *_ftrace_direct work at some point, correct?

Yes. I'll start playing with reg/mod/unreg_ftrace_direct on Monday.
Steven has a bunch more in his tree for merging, so I cannot just pull
all of ftrace api features into bpf-next. So "be nice to other fentry users"
would have to be done during merge window or shortly after in bpf-next tree
after window closes. I think it's fine. In bpf dispatch case it's really
one dummy function we're talking about. If it was marked 'notrace'
from get go no one would blink. It's a dummy function not interesting
for ftrac-ing and not interesting from live patching pov.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH bpf-next 2/4] bpf: introduce BPF dispatcher
  2019-11-15 21:58       ` Alexei Starovoitov
@ 2019-11-18 10:03         ` Björn Töpel
  0 siblings, 0 replies; 23+ messages in thread
From: Björn Töpel @ 2019-11-18 10:03 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Netdev, Alexei Starovoitov, Daniel Borkmann,
	Björn Töpel, bpf, Magnus Karlsson, Karlsson, Magnus,
	Jonathan Lemon, Toke Høiland-Jørgensen, Edward Cree

On Fri, 15 Nov 2019 at 22:58, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
[...]
> > Another thought; I'm using the fentry nop as patch point, so it wont
> > play nice with other users of fentry atm -- but the plan is to move to
> > Steve's *_ftrace_direct work at some point, correct?
>
> Yes. I'll start playing with reg/mod/unreg_ftrace_direct on Monday.
> Steven has a bunch more in his tree for merging, so I cannot just pull
> all of ftrace api features into bpf-next. So "be nice to other fentry users"
> would have to be done during merge window or shortly after in bpf-next tree
> after window closes. I think it's fine.

Yup, I agree.

> In bpf dispatch case it's really
> one dummy function we're talking about. If it was marked 'notrace'
> from get go no one would blink. It's a dummy function not interesting
> for ftrac-ing and not interesting from live patching pov.
>

...but marking it with 'notrace' would remove the __fentry__ nop.
Anyways, the "be nice" approach is OK.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH bpf-next 2/4] bpf: introduce BPF dispatcher
  2019-11-13 20:47 ` [RFC PATCH bpf-next 2/4] bpf: introduce BPF dispatcher Björn Töpel
                     ` (2 preceding siblings ...)
  2019-11-15  0:30   ` Alexei Starovoitov
@ 2019-11-18 19:36   ` Andrii Nakryiko
  2019-11-18 20:11     ` Björn Töpel
  3 siblings, 1 reply; 23+ messages in thread
From: Andrii Nakryiko @ 2019-11-18 19:36 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Networking, Alexei Starovoitov, Daniel Borkmann,
	Björn Töpel, bpf, Magnus Karlsson, Magnus Karlsson,
	Jonathan Lemon

On Wed, Nov 13, 2019 at 12:48 PM Björn Töpel <bjorn.topel@gmail.com> wrote:
>
> From: Björn Töpel <bjorn.topel@intel.com>
>
> The BPF dispatcher builds on top of the BPF trampoline ideas;
> Introduce bpf_arch_text_poke() and (re-)use the BPF JIT generate
> code. The dispatcher builds a dispatch table for XDP programs, for
> retpoline avoidance. The table is a simple binary search model, so
> lookup is O(log n). Here, the dispatch table is limited to four
> entries (for laziness reason -- only 1B relative jumps :-P). If the
> dispatch table is full, it will fallback to the retpoline path.
>
> An example: A module/driver allocates a dispatcher. The dispatcher is
> shared for all netdevs. Each netdev allocate a slot in the dispatcher
> and a BPF program. The netdev then uses the dispatcher to call the
> correct program with a direct call (actually a tail-call).
>
> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
> ---
>  arch/x86/net/bpf_jit_comp.c |  96 ++++++++++++++++++
>  kernel/bpf/Makefile         |   1 +
>  kernel/bpf/dispatcher.c     | 197 ++++++++++++++++++++++++++++++++++++
>  3 files changed, 294 insertions(+)
>  create mode 100644 kernel/bpf/dispatcher.c
>
> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> index 28782a1c386e..d75aebf508b8 100644
> --- a/arch/x86/net/bpf_jit_comp.c
> +++ b/arch/x86/net/bpf_jit_comp.c
> @@ -10,10 +10,12 @@
>  #include <linux/if_vlan.h>
>  #include <linux/bpf.h>
>  #include <linux/memory.h>
> +#include <linux/sort.h>
>  #include <asm/extable.h>
>  #include <asm/set_memory.h>
>  #include <asm/nospec-branch.h>
>  #include <asm/text-patching.h>
> +#include <asm/asm-prototypes.h>
>

[...]

> +
> +int arch_prepare_bpf_dispatcher(void *image, struct bpf_prog **progs,
> +                               int num_progs)
> +{
> +       u64 ips[BPF_DISPATCHER_MAX] = {};
> +       u8 *fallback, *prog = image;
> +       int i, err, cnt = 0;
> +
> +       if (!num_progs || num_progs > BPF_DISPATCHER_MAX)
> +               return -EINVAL;
> +
> +       for (i = 0; i < num_progs; i++)
> +               ips[i] = (u64)progs[i]->bpf_func;
> +
> +       EMIT2(0xEB, 5); /* jmp rip+5 (skip retpoline) */
> +       fallback = prog;
> +       err = emit_jmp(&prog,   /* jmp retpoline */
> +                      __x86_indirect_thunk_rdx, prog);
> +       if (err)
> +               return err;
> +
> +       sort(&ips[0], num_progs, sizeof(ips[i]), cmp_ips, NULL);

nit: sizeof(ips[i]) looks weird...

> +       return emit_bpf_dispatcher(&prog, 0, num_progs - 1, &ips[0], fallback);
> +}
> +
>  struct x64_jit_data {
>         struct bpf_binary_header *header;
>         int *addrs;

[...]

> +
> +static int bpf_dispatcher_add_prog(struct bpf_dispatcher *d,
> +                                  struct bpf_prog *prog)
> +{
> +       struct bpf_prog **entry = NULL;
> +       int i, err = 0;
> +
> +       if (d->num_progs == BPF_DISPATCHER_MAX)
> +               return err;

err == 0, not what you want, probably

> +
> +       for (i = 0; i < BPF_DISPATCHER_MAX; i++) {
> +               if (!entry && !d->progs[i])
> +                       entry = &d->progs[i];
> +               if (d->progs[i] == prog)
> +                       return err;
> +       }
> +
> +       prog = bpf_prog_inc(prog);
> +       if (IS_ERR(prog))
> +               return err;
> +
> +       *entry = prog;
> +       d->num_progs++;
> +       return err;
> +}
> +
> +static void bpf_dispatcher_remove_prog(struct bpf_dispatcher *d,
> +                                      struct bpf_prog *prog)
> +{
> +       int i;
> +
> +       for (i = 0; i < BPF_DISPATCHER_MAX; i++) {
> +               if (d->progs[i] == prog) {
> +                       bpf_prog_put(prog);
> +                       d->progs[i] = NULL;
> +                       d->num_progs--;

instead of allowing holes, why not swap removed prog with the last on
in d->progs?

> +                       break;
> +               }
> +       }
> +}
> +
> +int __weak arch_prepare_bpf_dispatcher(void *image, struct bpf_prog **progs,
> +                                      int num_ids)
> +{
> +       return -ENOTSUPP;
> +}
> +
> +/* NB! bpf_dispatcher_update() might free the dispatcher! */
> +static int bpf_dispatcher_update(struct bpf_dispatcher *d)
> +{
> +       void *old_image = d->image + ((d->selector + 1) & 1) * PAGE_SIZE / 2;
> +       void *new_image = d->image + (d->selector & 1) * PAGE_SIZE / 2;
> +       int err;
> +
> +       if (d->num_progs == 0) {
> +               err = bpf_arch_text_poke(d->func, BPF_MOD_JMP_TO_NOP,
> +                                        old_image, NULL);
> +               bpf_dispatcher_free(d);
> +               goto out;
> +       }
> +
> +       err = arch_prepare_bpf_dispatcher(new_image, &d->progs[0],
> +                                         d->num_progs);
> +       if (err)
> +               goto out;
> +
> +       if (d->selector)
> +               /* progs already running at this address */
> +               err = bpf_arch_text_poke(d->func, BPF_MOD_JMP_TO_JMP,
> +                                        old_image, new_image);
> +       else
> +               /* first time registering */
> +               err = bpf_arch_text_poke(d->func, BPF_MOD_NOP_TO_JMP,
> +                                        NULL, new_image);
> +
> +       if (err)
> +               goto out;
> +       d->selector++;
> +
> +out:
> +       return err;
> +}
> +
> +void bpf_dispatcher_change_prog(void *func, struct bpf_prog *from,
> +                               struct bpf_prog *to)
> +{
> +       struct bpf_dispatcher *d;
> +
> +       if (!from && !to)
> +               return;
> +
> +       mutex_lock(&dispatcher_mutex);
> +       d = bpf_dispatcher_lookup(func);
> +       if (!d)
> +               goto out;
> +
> +       if (from)
> +               bpf_dispatcher_remove_prog(d, from);
> +
> +       if (to)
> +               bpf_dispatcher_add_prog(d, to);

this can fail

> +
> +       WARN_ON(bpf_dispatcher_update(d));

shouldn't dispatcher be removed from the list before freed? It seems
like handling dispatches freeing is better done here explicitly (and
you won't need to leave a NB remark)

> +
> +out:
> +       mutex_unlock(&dispatcher_mutex);
> +}
> +
> +static int __init init_dispatchers(void)
> +{
> +       int i;
> +
> +       for (i = 0; i < DISPATCHER_TABLE_SIZE; i++)
> +               INIT_HLIST_HEAD(&dispatcher_table[i]);
> +       return 0;
> +}
> +late_initcall(init_dispatchers);
> +
> +#endif
> --
> 2.20.1
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH bpf-next 2/4] bpf: introduce BPF dispatcher
  2019-11-18 19:36   ` Andrii Nakryiko
@ 2019-11-18 20:11     ` Björn Töpel
  0 siblings, 0 replies; 23+ messages in thread
From: Björn Töpel @ 2019-11-18 20:11 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Networking, Alexei Starovoitov, Daniel Borkmann,
	Björn Töpel, bpf, Magnus Karlsson, Magnus Karlsson,
	Jonathan Lemon

On Mon, 18 Nov 2019 at 20:36, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>
[...]
> > +       sort(&ips[0], num_progs, sizeof(ips[i]), cmp_ips, NULL);
>
> nit: sizeof(ips[i]) looks weird...
>

Ick! Thanks for spotting.

> > +       return emit_bpf_dispatcher(&prog, 0, num_progs - 1, &ips[0], fallback);
> > +}
> > +
> >  struct x64_jit_data {
> >         struct bpf_binary_header *header;
> >         int *addrs;
>
> [...]
>
> > +
> > +static int bpf_dispatcher_add_prog(struct bpf_dispatcher *d,
> > +                                  struct bpf_prog *prog)
> > +{
> > +       struct bpf_prog **entry = NULL;
> > +       int i, err = 0;
> > +
> > +       if (d->num_progs == BPF_DISPATCHER_MAX)
> > +               return err;
>
> err == 0, not what you want, probably
>

No, the error handling in this RFC is bad. I'll fix that in the patch proper!

[...]
> > +static void bpf_dispatcher_remove_prog(struct bpf_dispatcher *d,
> > +                                      struct bpf_prog *prog)
> > +{
> > +       int i;
> > +
> > +       for (i = 0; i < BPF_DISPATCHER_MAX; i++) {
> > +               if (d->progs[i] == prog) {
> > +                       bpf_prog_put(prog);
> > +                       d->progs[i] = NULL;
> > +                       d->num_progs--;
>
> instead of allowing holes, why not swap removed prog with the last on
> in d->progs?
>

Yeah, holes is a no go. I'll redo that.

[...]

> > +
> > +       WARN_ON(bpf_dispatcher_update(d));
>
> shouldn't dispatcher be removed from the list before freed? It seems
> like handling dispatches freeing is better done here explicitly (and
> you won't need to leave a NB remark)
>

I agree. Let's make that explicit. I'll send out a patch proper in a day or two.

Thanks for looking at the code, Andrii!


Björn

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2019-11-18 20:11 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-13 20:47 [RFC PATCH bpf-next 0/4] Introduce xdp_call.h and the BPF dispatcher Björn Töpel
2019-11-13 20:47 ` [RFC PATCH bpf-next 1/4] bpf: teach bpf_arch_text_poke() jumps Björn Töpel
2019-11-13 20:47 ` [RFC PATCH bpf-next 2/4] bpf: introduce BPF dispatcher Björn Töpel
2019-11-13 21:40   ` Edward Cree
2019-11-14  6:29     ` Björn Töpel
2019-11-14 10:18       ` Edward Cree
2019-11-14 11:21         ` Björn Töpel
2019-11-14 13:58           ` Peter Zijlstra
2019-11-14 12:31   ` Toke Høiland-Jørgensen
2019-11-14 13:03     ` Daniel Borkmann
2019-11-14 13:09       ` Toke Høiland-Jørgensen
2019-11-14 13:56       ` Björn Töpel
2019-11-14 14:55         ` Toke Høiland-Jørgensen
2019-11-14 15:03           ` Björn Töpel
2019-11-14 15:12             ` Toke Høiland-Jørgensen
2019-11-15  0:30   ` Alexei Starovoitov
2019-11-15  7:56     ` Björn Töpel
2019-11-15 21:58       ` Alexei Starovoitov
2019-11-18 10:03         ` Björn Töpel
2019-11-18 19:36   ` Andrii Nakryiko
2019-11-18 20:11     ` Björn Töpel
2019-11-13 20:47 ` [RFC PATCH bpf-next 3/4] xdp: introduce xdp_call Björn Töpel
2019-11-13 20:47 ` [RFC PATCH bpf-next 4/4] i40e: start using xdp_call.h Björn Töpel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.