BPF Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH bpf-next v2 0/6] Introduce the BPF dispatcher and xdp_call.h
@ 2019-11-23  7:12 Björn Töpel
  2019-11-23  7:12 ` [PATCH bpf-next v2 1/6] bpf: introduce BPF dispatcher Björn Töpel
                   ` (5 more replies)
  0 siblings, 6 replies; 21+ messages in thread
From: Björn Töpel @ 2019-11-23  7:12 UTC (permalink / raw)
  To: netdev, ast, daniel
  Cc: Björn Töpel, bpf, magnus.karlsson, magnus.karlsson,
	jonathan.lemon, ecree, thoiland, andrii.nakryiko, tariqt, saeedm,
	maximmi

Overview
========

A bit late in the merge-window, but here goes... This v2 series
introduces the BPF dispatcher and a wrapper, xdp_call.h, which are a
mechanism to avoid indirect calls when retpolines are enabled.

The BPF dispatcher is a multi-way branch code generator, mainly
targeted for XDP programs. When an XDP program is executed via the
bpf_prog_run_xdp(), it is invoked via an indirect call. With
retpolines enabled, the indirect call has a substantial performance
impact. The dispatcher is a mechanism that transform multiple indirect
calls to direct calls, and therefore avoids the retpoline. The
dispatcher is generated using the BPF JIT, and relies on text poking
provided by bpf_arch_text_poke().

Note, that bpf_arch_text_poke() only support text poking for
non-module code. That means that the a driver needs to be builtin in
order to benefit from the dispatcher. This might change in the future,
e.g. with the introduction of static_call().

The dispatcher hijacks a trampoline function it via the __fentry__ nop
of the trampoline. One dispatcher instance currently supports up to 16
dispatch points. This can be extended in the future.

An example: A module/driver allocates a dispatcher. The dispatcher is
shared for all netdevs. Each unique XDP program has a slot in the
dispatcher, registered by a netdev. The netdev then uses the
dispatcher to call the correct program with a direct call.

The xdp_call.h header wraps a more user-friendly API around the BPF
dispatcher. A user adds a trampoline/XDP caller using the
DEFINE_XDP_CALL macro, and updates the BPF dispatcher via
xdp_call_update(). The actual dispatch is done via xdp_call_run().

The following drivers has been modified to use xdp_call.h: ixgbe,
i40e, mlx4, and mlx4. 

*NOTE* that mlx4 and mlx5 are only "compile tested". I don't have any
Mellanox hardware, so if anyone could do tests with those drivers,
that would very helpful. If not, I suggest dropping the last two
patches from the series, and add them later if useful.

Generated code, x86-64
======================

The dispatcher currently has a maximum of 16 entries, where one entry
is a unique BPF program. Multiple users of a dispatcher instance using
the same BPF program will share that entry.

The program/slot lookup is performed by a binary search, O(log
n). Let's have a look at the generated code.

The trampoline function has the following signature:

  unsigned int tramp(const void *xdp_ctx,
                     const struct bpf_insn *insnsi,
                     unsigned int (*bpf_func)(const void *,
                                              const struct bpf_insn *))

On Intel x86-64 this means that rdx will contain the bpf_func. To,
make it easier to read, I've let the BPF programs have the following
range: 0xffffffffffffffff (-1) to 0xfffffffffffffff0
(-16). 0xffffffff81c00f10 is the retpoline thunk, in this case
__x86_indirect_thunk_rdx.

The minimal dispatcher will then look like this:

ffffffffc0002000: cmp    rdx,0xffffffffffffffff
ffffffffc0002007: je     0xffffffffffffffff ; -1
ffffffffc000200d: jmp    0xffffffff81c00f10

The largest dispatcher looks like this:

ffffffffc0020000: cmp    rdx,0xfffffffffffffff7 ; -9
ffffffffc0020007: jg     0xffffffffc0020130
ffffffffc002000d: cmp    rdx,0xfffffffffffffff3 ; -13
ffffffffc0020014: jg     0xffffffffc00200a0
ffffffffc002001a: cmp    rdx,0xfffffffffffffff1 ; -15
ffffffffc0020021: jg     0xffffffffc0020060
ffffffffc0020023: cmp    rdx,0xfffffffffffffff0 ; -16
ffffffffc002002a: jg     0xffffffffc0020040
ffffffffc002002c: cmp    rdx,0xfffffffffffffff0 ; -16
ffffffffc0020033: je     0xfffffffffffffff0 ; -16
ffffffffc0020039: jmp    0xffffffff81c00f10
ffffffffc002003e: xchg   ax,ax
ffffffffc0020040: cmp    rdx,0xfffffffffffffff1 ; -15
ffffffffc0020047: je     0xfffffffffffffff1 ; -15
ffffffffc002004d: jmp    0xffffffff81c00f10
ffffffffc0020052: nop    DWORD PTR [rax+rax*1+0x0]
ffffffffc002005a: nop    WORD PTR [rax+rax*1+0x0]
ffffffffc0020060: cmp    rdx,0xfffffffffffffff2 ; -14
ffffffffc0020067: jg     0xffffffffc0020080
ffffffffc0020069: cmp    rdx,0xfffffffffffffff2 ; -14
ffffffffc0020070: je     0xfffffffffffffff2 ; -14
ffffffffc0020076: jmp    0xffffffff81c00f10
ffffffffc002007b: nop    DWORD PTR [rax+rax*1+0x0]
ffffffffc0020080: cmp    rdx,0xfffffffffffffff3 ; -13
ffffffffc0020087: je     0xfffffffffffffff3 ; -13
ffffffffc002008d: jmp    0xffffffff81c00f10
ffffffffc0020092: nop    DWORD PTR [rax+rax*1+0x0]
ffffffffc002009a: nop    WORD PTR [rax+rax*1+0x0]
ffffffffc00200a0: cmp    rdx,0xfffffffffffffff5 ; -11
ffffffffc00200a7: jg     0xffffffffc00200f0
ffffffffc00200a9: cmp    rdx,0xfffffffffffffff4 ; -12
ffffffffc00200b0: jg     0xffffffffc00200d0
ffffffffc00200b2: cmp    rdx,0xfffffffffffffff4 ; -12
ffffffffc00200b9: je     0xfffffffffffffff4 ; -12
ffffffffc00200bf: jmp    0xffffffff81c00f10
ffffffffc00200c4: nop    DWORD PTR [rax+rax*1+0x0]
ffffffffc00200cc: nop    DWORD PTR [rax+0x0]
ffffffffc00200d0: cmp    rdx,0xfffffffffffffff5 ; -11
ffffffffc00200d7: je     0xfffffffffffffff5 ; -11
ffffffffc00200dd: jmp    0xffffffff81c00f10
ffffffffc00200e2: nop    DWORD PTR [rax+rax*1+0x0]
ffffffffc00200ea: nop    WORD PTR [rax+rax*1+0x0]
ffffffffc00200f0: cmp    rdx,0xfffffffffffffff6 ; -10
ffffffffc00200f7: jg     0xffffffffc0020110
ffffffffc00200f9: cmp    rdx,0xfffffffffffffff6 ; -10
ffffffffc0020100: je     0xfffffffffffffff6 ; -10
ffffffffc0020106: jmp    0xffffffff81c00f10
ffffffffc002010b: nop    DWORD PTR [rax+rax*1+0x0]
ffffffffc0020110: cmp    rdx,0xfffffffffffffff7 ; -9
ffffffffc0020117: je     0xfffffffffffffff7 ; -9
ffffffffc002011d: jmp    0xffffffff81c00f10
ffffffffc0020122: nop    DWORD PTR [rax+rax*1+0x0]
ffffffffc002012a: nop    WORD PTR [rax+rax*1+0x0]
ffffffffc0020130: cmp    rdx,0xfffffffffffffffb ; -5
ffffffffc0020137: jg     0xffffffffc00201d0
ffffffffc002013d: cmp    rdx,0xfffffffffffffff9 ; -7
ffffffffc0020144: jg     0xffffffffc0020190
ffffffffc0020146: cmp    rdx,0xfffffffffffffff8 ; -8
ffffffffc002014d: jg     0xffffffffc0020170
ffffffffc002014f: cmp    rdx,0xfffffffffffffff8 ; -8
ffffffffc0020156: je     0xfffffffffffffff8 ; -8
ffffffffc002015c: jmp    0xffffffff81c00f10
ffffffffc0020161: nop    DWORD PTR [rax+rax*1+0x0]
ffffffffc0020169: nop    DWORD PTR [rax+0x0]
ffffffffc0020170: cmp    rdx,0xfffffffffffffff9 ; -7
ffffffffc0020177: je     0xfffffffffffffff9 ; -7
ffffffffc002017d: jmp    0xffffffff81c00f10
ffffffffc0020182: nop    DWORD PTR [rax+rax*1+0x0]
ffffffffc002018a: nop    WORD PTR [rax+rax*1+0x0]
ffffffffc0020190: cmp    rdx,0xfffffffffffffffa ; -6
ffffffffc0020197: jg     0xffffffffc00201b0
ffffffffc0020199: cmp    rdx,0xfffffffffffffffa ; -6
ffffffffc00201a0: je     0xfffffffffffffffa ; -6
ffffffffc00201a6: jmp    0xffffffff81c00f10
ffffffffc00201ab: nop    DWORD PTR [rax+rax*1+0x0]
ffffffffc00201b0: cmp    rdx,0xfffffffffffffffb ; -5
ffffffffc00201b7: je     0xfffffffffffffffb ; -5
ffffffffc00201bd: jmp    0xffffffff81c00f10
ffffffffc00201c2: nop    DWORD PTR [rax+rax*1+0x0]
ffffffffc00201ca: nop    WORD PTR [rax+rax*1+0x0]
ffffffffc00201d0: cmp    rdx,0xfffffffffffffffd ; -3
ffffffffc00201d7: jg     0xffffffffc0020220
ffffffffc00201d9: cmp    rdx,0xfffffffffffffffc ; -4
ffffffffc00201e0: jg     0xffffffffc0020200
ffffffffc00201e2: cmp    rdx,0xfffffffffffffffc ; -4
ffffffffc00201e9: je     0xfffffffffffffffc ; -4
ffffffffc00201ef: jmp    0xffffffff81c00f10
ffffffffc00201f4: nop    DWORD PTR [rax+rax*1+0x0]
ffffffffc00201fc: nop    DWORD PTR [rax+0x0]
ffffffffc0020200: cmp    rdx,0xfffffffffffffffd ; -3
ffffffffc0020207: je     0xfffffffffffffffd ; -3
ffffffffc002020d: jmp    0xffffffff81c00f10
ffffffffc0020212: nop    DWORD PTR [rax+rax*1+0x0]
ffffffffc002021a: nop    WORD PTR [rax+rax*1+0x0]
ffffffffc0020220: cmp    rdx,0xfffffffffffffffe ; -2
ffffffffc0020227: jg     0xffffffffc0020240
ffffffffc0020229: cmp    rdx,0xfffffffffffffffe ; -2
ffffffffc0020230: je     0xfffffffffffffffe ; -2
ffffffffc0020236: jmp    0xffffffff81c00f10
ffffffffc002023b: nop    DWORD PTR [rax+rax*1+0x0]
ffffffffc0020240: cmp    rdx,0xffffffffffffffff ; -1
ffffffffc0020247: je     0xffffffffffffffff ; -1
ffffffffc002024d: jmp    0xffffffff81c00f10

The nops are there to align jump targets to 16B.

Performance
===========

The tests were performed using the xdp_rxq_info sample program with
the following command-line:

  # xdp_rxq_info --dev eth0 --action XDP_DROP

64B UDP packets at linerate (~59 Mpps) from a packet generator to a
40GbE i40e NIC attached to a 3GHz Intel Skylake machine.

1.  Baseline w/o dispatcher: 22.7 Mpps
2.  Dispatcher,  1 entry:    31.7 Mpps (+40%)
3.  Dispatcher,  2 entries:  32.2 Mpps (+42%)
4.  Dispatcher,  3 entries:  31.3 Mpps (+38%)
5.  Dispatcher,  4 entries:  32.0 Mpps (+41%)
6.  Dispatcher,  5 entries:  31.2 Mpps (+37%)
7.  Dispatcher,  6 entries:  31.2 Mpps (+37%)
8.  Dispatcher,  7 entries:  30.2 Mpps (+33%)
9.  Dispatcher,  8 entries:  31.3 Mpps (+39%)
10. Dispatcher,  9 entries:  30.1 Mpps (+32%)
11. Dispatcher, 10 entries:  31.6 Mpps (+39%)
12. Dispatcher, 11 entries:  31.1 Mpps (+37%)
13. Dispatcher, 12 entries:  30.9 Mpps (+36%)
14. Dispatcher, 13 entries:  30.4 Mpps (+34%)
15. Dispatcher, 14 entries:  31.2 Mpps (+37%)
16. Dispatcher, 15 entries:  30.9 Mpps (+36%)
17. Dispatcher, 16 entries:  32.1 Mpps (+41%)
18. Dispatcher, full:        22.4 Mpps (- 1%)

Test 18 is to show-case the cost of walking the a full dispatcher, and
then fallback to an indirect call.

As the results show, it is hard to see any difference between 1 to 16
entries, other than small variations between runs.

Revisions
=========

v1->v2: [1]
  * Fixed i386 build warning (kbuild robot)
  * Made bpf_dispatcher_lookup() static (kbuild robot)
  * Make sure xdp_call.h is only enabled for builtins
  * Add xdp_call() to ixgbe, mlx4, and mlx5

RFC->v1: [2]
  * Improved error handling (Edward and Andrii)
  * Explicit cleanup (Andrii)
  * Use 32B with sext cmp (Alexei)
  * Align jump targets to 16B (Alexei)
  * 4 to 16 entries (Toke)
  * Added stats to xdp_call_run()

[1] https://lore.kernel.org/bpf/20191119160757.27714-1-bjorn.topel@gmail.com/
[2] https://lore.kernel.org/bpf/20191113204737.31623-1-bjorn.topel@gmail.com/


Thanks!
Björn

Björn Töpel (6):
  bpf: introduce BPF dispatcher
  xdp: introduce xdp_call
  i40e: start using xdp_call.h
  ixgbe: start using xdp_call.h
  net/mlx4_en: start using xdp_call.h
  net/mlx5e: Start using xdp_call.h

 arch/x86/net/bpf_jit_comp.c                   | 135 ++++++++++++
 drivers/net/ethernet/intel/i40e/i40e_main.c   |   5 +
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   |   5 +-
 drivers/net/ethernet/intel/i40e/i40e_xsk.c    |   5 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   7 +-
 .../net/ethernet/mellanox/mlx4/en_netdev.c    |   7 +-
 drivers/net/ethernet/mellanox/mlx4/en_rx.c    |   5 +-
 .../net/ethernet/mellanox/mlx5/core/en/xdp.c  |   5 +-
 .../net/ethernet/mellanox/mlx5/core/en_main.c |   5 +
 include/linux/xdp_call.h                      |  66 ++++++
 kernel/bpf/Makefile                           |   1 +
 kernel/bpf/dispatcher.c                       | 208 ++++++++++++++++++
 12 files changed, 448 insertions(+), 6 deletions(-)
 create mode 100644 include/linux/xdp_call.h
 create mode 100644 kernel/bpf/dispatcher.c

-- 
2.20.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH bpf-next v2 1/6] bpf: introduce BPF dispatcher
  2019-11-23  7:12 [PATCH bpf-next v2 0/6] Introduce the BPF dispatcher and xdp_call.h Björn Töpel
@ 2019-11-23  7:12 ` Björn Töpel
  2019-11-24  1:55   ` Alexei Starovoitov
                     ` (2 more replies)
  2019-11-23  7:12 ` [PATCH bpf-next v2 2/6] xdp: introduce xdp_call Björn Töpel
                   ` (4 subsequent siblings)
  5 siblings, 3 replies; 21+ messages in thread
From: Björn Töpel @ 2019-11-23  7:12 UTC (permalink / raw)
  To: netdev, ast, daniel
  Cc: Björn Töpel, bpf, magnus.karlsson, magnus.karlsson,
	jonathan.lemon, ecree, thoiland, andrii.nakryiko, tariqt, saeedm,
	maximmi

From: Björn Töpel <bjorn.topel@intel.com>

The BPF dispatcher is a multiway branch code generator, mainly
targeted for XDP programs. When an XDP program is executed via the
bpf_prog_run_xdp(), it is invoked via an indirect call. With
retpolines enabled, the indirect call has a substantial performance
impact. The dispatcher is a mechanism that transform multiple indirect
calls to direct calls, and therefore avoids the retpoline. The
dispatcher is generated using the BPF JIT, and relies on text poking
provided by bpf_arch_text_poke().

The dispatcher hijacks a trampoline function it via the __fentry__ nop
of the trampoline. One dispatcher instance currently supports up to 16
dispatch points. This can be extended in the future.

An example: A module/driver allocates a dispatcher. The dispatcher is
shared for all netdevs. Each unique XDP program has a slot in the
dispatcher, registered by a netdev. The netdev then uses the
dispatcher to call the correct program with a direct call.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 arch/x86/net/bpf_jit_comp.c | 135 +++++++++++++++++++++++
 kernel/bpf/Makefile         |   1 +
 kernel/bpf/dispatcher.c     | 208 ++++++++++++++++++++++++++++++++++++
 3 files changed, 344 insertions(+)
 create mode 100644 kernel/bpf/dispatcher.c

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 15615c94804f..9ca81bc9e7f3 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -10,10 +10,12 @@
 #include <linux/if_vlan.h>
 #include <linux/bpf.h>
 #include <linux/memory.h>
+#include <linux/sort.h>
 #include <asm/extable.h>
 #include <asm/set_memory.h>
 #include <asm/nospec-branch.h>
 #include <asm/text-patching.h>
+#include <asm/asm-prototypes.h>
 
 static u8 *emit_code(u8 *ptr, u32 bytes, unsigned int len)
 {
@@ -1565,6 +1567,139 @@ int arch_prepare_bpf_trampoline(void *image, struct btf_func_model *m, u32 flags
 	return 0;
 }
 
+#if defined(CONFIG_BPF_JIT) && defined(CONFIG_RETPOLINE)
+
+static int emit_cond_near_jump(u8 **pprog, void *func, void *ip, u8 jmp_cond)
+{
+	u8 *prog = *pprog;
+	int cnt = 0;
+	s64 offset;
+
+	offset = func - (ip + 2 + 4);
+	if (!is_simm32(offset)) {
+		pr_err("Target %p is out of range\n", func);
+		return -EINVAL;
+	}
+	EMIT2_off32(0x0F, jmp_cond + 0x10, offset);
+	*pprog = prog;
+	return 0;
+}
+
+static void emit_nops(u8 **pprog, unsigned int len)
+{
+	unsigned int i, noplen;
+	u8 *prog = *pprog;
+	int cnt = 0;
+
+	while (len > 0) {
+		noplen = len;
+
+		if (noplen > ASM_NOP_MAX)
+			noplen = ASM_NOP_MAX;
+
+		for (i = 0; i < noplen; i++)
+			EMIT1(ideal_nops[noplen][i]);
+		len -= noplen;
+	}
+
+	*pprog = prog;
+}
+
+static int emit_bpf_dispatcher(u8 **pprog, int a, int b, s64 *progs)
+{
+	u8 *jg_reloc, *jg_target, *prog = *pprog;
+	int pivot, err, jg_bytes = 1, cnt = 0;
+	s64 jg_offset;
+
+	if (a == b) {
+		/* Leaf node of recursion, i.e. not a range of indices
+		 * anymore.
+		 */
+		EMIT1(add_1mod(0x48, BPF_REG_3));	/* cmp rdx,func */
+		if (!is_simm32(progs[a]))
+			return -1;
+		EMIT2_off32(0x81, add_1reg(0xF8, BPF_REG_3),
+			    progs[a]);
+		err = emit_cond_near_jump(&prog,	/* je func */
+					  (void *)progs[a], prog,
+					  X86_JE);
+		if (err)
+			return err;
+
+		err = emit_jump(&prog,			/* jmp thunk */
+				__x86_indirect_thunk_rdx, prog);
+		if (err)
+			return err;
+
+		*pprog = prog;
+		return 0;
+	}
+
+	/* Not a leaf node, so we pivot, and recursively descend into
+	 * the lower and upper ranges.
+	 */
+	pivot = (b - a) / 2;
+	EMIT1(add_1mod(0x48, BPF_REG_3));		/* cmp rdx,func */
+	if (!is_simm32(progs[a + pivot]))
+		return -1;
+	EMIT2_off32(0x81, add_1reg(0xF8, BPF_REG_3), progs[a + pivot]);
+
+	if (pivot > 2) {				/* jg upper_part */
+		/* Require near jump. */
+		jg_bytes = 4;
+		EMIT2_off32(0x0F, X86_JG + 0x10, 0);
+	} else {
+		EMIT2(X86_JG, 0);
+	}
+	jg_reloc = prog;
+
+	err = emit_bpf_dispatcher(&prog, a, a + pivot,	/* emit lower_part */
+				  progs);
+	if (err)
+		return err;
+
+	/* Intel 64 and IA-32 ArchitecturesOptimization Reference
+	 * Manual, 3.4.1.5 Code Alignment Assembly/Compiler Coding
+	 * Rule 12. (M impact, H generality) All branch targets should
+	 * be 16-byte aligned.
+	 */
+	jg_target = PTR_ALIGN(prog, 16);
+	if (jg_target != prog)
+		emit_nops(&prog, jg_target - prog);
+	jg_offset = prog - jg_reloc;
+	emit_code(jg_reloc - jg_bytes, jg_offset, jg_bytes);
+
+	err = emit_bpf_dispatcher(&prog, a + pivot + 1,	/* emit upper_part */
+				  b, progs);
+	if (err)
+		return err;
+
+	*pprog = prog;
+	return 0;
+}
+
+static int cmp_ips(const void *a, const void *b)
+{
+	const s64 *ipa = a;
+	const s64 *ipb = b;
+
+	if (*ipa > *ipb)
+		return 1;
+	if (*ipa < *ipb)
+		return -1;
+	return 0;
+}
+
+int arch_prepare_bpf_dispatcher(void *image, s64 *funcs, int num_funcs)
+{
+	u8 *prog = image;
+
+	sort(funcs, num_funcs, sizeof(funcs[0]), cmp_ips, NULL);
+	return emit_bpf_dispatcher(&prog, 0, num_funcs - 1, funcs);
+}
+
+#endif
+
 struct x64_jit_data {
 	struct bpf_binary_header *header;
 	int *addrs;
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 3f671bf617e8..d4f330351f87 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -8,6 +8,7 @@ obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o
 obj-$(CONFIG_BPF_SYSCALL) += disasm.o
 obj-$(CONFIG_BPF_JIT) += trampoline.o
 obj-$(CONFIG_BPF_SYSCALL) += btf.o
+obj-$(CONFIG_BPF_JIT) += dispatcher.o
 ifeq ($(CONFIG_NET),y)
 obj-$(CONFIG_BPF_SYSCALL) += devmap.o
 obj-$(CONFIG_BPF_SYSCALL) += cpumap.o
diff --git a/kernel/bpf/dispatcher.c b/kernel/bpf/dispatcher.c
new file mode 100644
index 000000000000..385dd76ab6d2
--- /dev/null
+++ b/kernel/bpf/dispatcher.c
@@ -0,0 +1,208 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright(c) 2019 Intel Corporation. */
+
+#ifdef CONFIG_RETPOLINE
+
+#include <linux/hash.h>
+#include <linux/bpf.h>
+#include <linux/filter.h>
+
+/* The BPF dispatcher is a multiway branch code generator. The
+ * dispatcher is a mechanism to avoid the performance penalty of an
+ * indirect call when retpolines are enabled. A dispatch client tries
+ * to register a BPF program into the dispatcher, and if there is
+ * available room in the dispatcher a direct call to the BPF program
+ * will be generated. All calls to the BPF programs called via the
+ * dispatcher will then be a direct call, instead of an indirect. The
+ * dispatcher hijacks a trampoline function it via the __fentry__ of
+ * the trampoline. The trampoline function has the following
+ * signature:
+ *	unsigned int trampoline(
+ *		const void *xdp_ctx,
+ *		const struct bpf_insn *insnsi,
+ *		unsigned int (*bpf_func)(const void *,
+ *					 const struct bpf_insn *));
+ *
+ * Direct use of the dispatcher is discouraged, and instead a wrapper
+ * such as xdp_call.h should be used.
+ */
+
+#define DISPATCHER_HASH_BITS 10
+#define DISPATCHER_TABLE_SIZE (1 << DISPATCHER_HASH_BITS)
+
+static struct hlist_head dispatcher_table[DISPATCHER_TABLE_SIZE];
+
+#define BPF_DISPATCHER_MAX 16
+
+struct bpf_dispatcher {
+	struct hlist_node hlist;
+	void *func;
+	struct bpf_prog *progs[BPF_DISPATCHER_MAX];
+	int num_progs;
+	void *image;
+	u64 selector;
+};
+
+static DEFINE_MUTEX(dispatcher_mutex);
+
+static struct bpf_dispatcher *bpf_dispatcher_lookup(void *func)
+{
+	struct bpf_dispatcher *d;
+	struct hlist_head *head;
+	void *image;
+
+	head = &dispatcher_table[hash_ptr(func, DISPATCHER_HASH_BITS)];
+	hlist_for_each_entry(d, head, hlist) {
+		if (d->func == func)
+			return d;
+	}
+	d = kzalloc(sizeof(*d), GFP_KERNEL);
+	if (!d)
+		return NULL;
+
+	image = bpf_jit_alloc_exec(PAGE_SIZE);
+	if (!image) {
+		kfree(d);
+		return NULL;
+	}
+
+	d->func = func;
+	INIT_HLIST_NODE(&d->hlist);
+	hlist_add_head(&d->hlist, head);
+
+	set_vm_flush_reset_perms(image);
+	set_memory_x((long)image, 1);
+	d->image = image;
+	return d;
+}
+
+static void bpf_dispatcher_free(struct bpf_dispatcher *d)
+{
+	bpf_jit_free_exec(d->image);
+	hlist_del(&d->hlist);
+	kfree(d);
+}
+
+static bool bpf_dispatcher_add_prog(struct bpf_dispatcher *d,
+				    struct bpf_prog *prog)
+{
+	struct bpf_prog **entry = NULL;
+	int i;
+
+	if (!prog)
+		return false;
+
+	if (d->num_progs == BPF_DISPATCHER_MAX)
+		return false;
+
+	for (i = 0; i < BPF_DISPATCHER_MAX; i++) {
+		if (!entry && !d->progs[i])
+			entry = &d->progs[i];
+		if (d->progs[i] == prog)
+			return false;
+	}
+
+	bpf_prog_inc(prog);
+	*entry = prog;
+	d->num_progs++;
+	return true;
+}
+
+static bool bpf_dispatcher_remove_prog(struct bpf_dispatcher *d,
+				       struct bpf_prog *prog)
+{
+	int i;
+
+	if (!prog)
+		return false;
+
+	for (i = 0; i < BPF_DISPATCHER_MAX; i++) {
+		if (d->progs[i] == prog) {
+			bpf_prog_put(prog);
+			d->progs[i] = NULL;
+			d->num_progs--;
+			return true;
+		}
+	}
+	return false;
+}
+
+int __weak arch_prepare_bpf_dispatcher(void *image, s64 *funcs,
+				       int num_funcs)
+{
+	return -ENOTSUPP;
+}
+
+static void bpf_dispatcher_update(struct bpf_dispatcher *d)
+{
+	void *old_image = d->image + ((d->selector + 1) & 1) * PAGE_SIZE / 2;
+	void *new_image = d->image + (d->selector & 1) * PAGE_SIZE / 2;
+	s64 ips[BPF_DISPATCHER_MAX] = {}, *ipsp = &ips[0];
+	int i, err;
+
+	if (!d->num_progs) {
+		bpf_arch_text_poke(d->func, BPF_MOD_JUMP_TO_NOP,
+				   old_image, NULL);
+		return;
+	}
+
+	for (i = 0; i < BPF_DISPATCHER_MAX; i++) {
+		if (d->progs[i])
+			*ipsp++ = (s64)(uintptr_t)d->progs[i]->bpf_func;
+	}
+	err = arch_prepare_bpf_dispatcher(new_image, &ips[0], d->num_progs);
+	if (err)
+		return;
+
+	if (d->selector) {
+		/* progs already running at this address */
+		err = bpf_arch_text_poke(d->func, BPF_MOD_JUMP_TO_JUMP,
+					 old_image, new_image);
+	} else {
+		/* first time registering */
+		err = bpf_arch_text_poke(d->func, BPF_MOD_NOP_TO_JUMP,
+					 NULL, new_image);
+	}
+	if (err)
+		return;
+	d->selector++;
+}
+
+void bpf_dispatcher_change_prog(void *func, struct bpf_prog *from,
+				struct bpf_prog *to)
+{
+	struct bpf_dispatcher *d;
+	bool changed = false;
+
+	if (from == to)
+		return;
+
+	mutex_lock(&dispatcher_mutex);
+	d = bpf_dispatcher_lookup(func);
+	if (!d)
+		goto out;
+
+	changed |= bpf_dispatcher_remove_prog(d, from);
+	changed |= bpf_dispatcher_add_prog(d, to);
+
+	if (!changed)
+		goto out;
+
+	bpf_dispatcher_update(d);
+	if (!d->num_progs)
+		bpf_dispatcher_free(d);
+out:
+	mutex_unlock(&dispatcher_mutex);
+}
+
+static int __init init_dispatchers(void)
+{
+	int i;
+
+	for (i = 0; i < DISPATCHER_TABLE_SIZE; i++)
+		INIT_HLIST_HEAD(&dispatcher_table[i]);
+	return 0;
+}
+late_initcall(init_dispatchers);
+
+#endif
-- 
2.20.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH bpf-next v2 2/6] xdp: introduce xdp_call
  2019-11-23  7:12 [PATCH bpf-next v2 0/6] Introduce the BPF dispatcher and xdp_call.h Björn Töpel
  2019-11-23  7:12 ` [PATCH bpf-next v2 1/6] bpf: introduce BPF dispatcher Björn Töpel
@ 2019-11-23  7:12 ` Björn Töpel
  2019-11-24  1:59   ` Alexei Starovoitov
  2019-11-25 11:18   ` Toke Høiland-Jørgensen
  2019-11-23  7:12 ` [PATCH bpf-next v2 3/6] i40e: start using xdp_call.h Björn Töpel
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 21+ messages in thread
From: Björn Töpel @ 2019-11-23  7:12 UTC (permalink / raw)
  To: netdev, ast, daniel
  Cc: Björn Töpel, bpf, magnus.karlsson, magnus.karlsson,
	jonathan.lemon, ecree, thoiland, andrii.nakryiko, tariqt, saeedm,
	maximmi

From: Björn Töpel <bjorn.topel@intel.com>

The xdp_call.h header wraps a more user-friendly API around the BPF
dispatcher. A user adds a trampoline/XDP caller using the
DEFINE_XDP_CALL macro, and updates the BPF dispatcher via
xdp_call_update(). The actual dispatch is done via xdp_call().

Note that xdp_call() is only supported for builtin drivers. Module
builds will fallback to bpf_prog_run_xdp().

The next patch will show-case how the i40e driver uses xdp_call.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/linux/xdp_call.h | 66 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 66 insertions(+)
 create mode 100644 include/linux/xdp_call.h

diff --git a/include/linux/xdp_call.h b/include/linux/xdp_call.h
new file mode 100644
index 000000000000..69b2d325a787
--- /dev/null
+++ b/include/linux/xdp_call.h
@@ -0,0 +1,66 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/* Copyright(c) 2019 Intel Corporation. */
+#ifndef _LINUX_XDP_CALL_H
+#define _LINUX_XDP_CALL_H
+
+#include <linux/filter.h>
+
+#if defined(CONFIG_BPF_JIT) && defined(CONFIG_RETPOLINE) && !defined(MODULE)
+
+void bpf_dispatcher_change_prog(void *func, struct bpf_prog *from,
+				struct bpf_prog *to);
+
+#define XDP_CALL_TRAMP(name)	____xdp_call_##name##_tramp
+
+#define DEFINE_XDP_CALL(name)						\
+	unsigned int XDP_CALL_TRAMP(name)(				\
+		const void *xdp_ctx,					\
+		const struct bpf_insn *insnsi,				\
+		unsigned int (*bpf_func)(const void *,			\
+					 const struct bpf_insn *))	\
+	{								\
+		return bpf_func(xdp_ctx, insnsi);			\
+	}
+
+#define DECLARE_XDP_CALL(name)						\
+	unsigned int XDP_CALL_TRAMP(name)(				\
+		const void *xdp_ctx,					\
+		const struct bpf_insn *insnsi,				\
+		unsigned int (*bpf_func)(const void *,			\
+					 const struct bpf_insn *))
+
+#define xdp_call_run(name, prog, ctx) ({				\
+	u32 ret;							\
+	cant_sleep();							\
+	if (static_branch_unlikely(&bpf_stats_enabled_key)) {		\
+		struct bpf_prog_stats *stats;				\
+		u64 start = sched_clock();				\
+		ret = XDP_CALL_TRAMP(name)(ctx,				\
+					   (prog)->insnsi,		\
+					   (prog)->bpf_func);		\
+		stats = this_cpu_ptr((prog)->aux->stats);		\
+		u64_stats_update_begin(&stats->syncp);			\
+		stats->cnt++;						\
+		stats->nsecs += sched_clock() - start;			\
+		u64_stats_update_end(&stats->syncp);			\
+	} else {							\
+		ret = XDP_CALL_TRAMP(name)(ctx,				\
+					   (prog)->insnsi,		\
+					   (prog)->bpf_func);		\
+	}								\
+	ret; })
+
+#define xdp_call_update(name, from_xdp_prog, to_xdp_prog)	\
+	bpf_dispatcher_change_prog(&XDP_CALL_TRAMP(name),	\
+				   from_xdp_prog,		\
+				   to_xdp_prog)
+
+#else
+
+#define DEFINE_XDP_CALL(name)
+#define DECLARE_XDP_CALL(name)
+#define xdp_call_run(name, xdp_prog, xdp) bpf_prog_run_xdp(xdp_prog, xdp)
+#define xdp_call_update(name, from_xdp_prog, to_xdp_prog)
+
+#endif
+#endif /* _LINUX_XDP_CALL_H */
-- 
2.20.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH bpf-next v2 3/6] i40e: start using xdp_call.h
  2019-11-23  7:12 [PATCH bpf-next v2 0/6] Introduce the BPF dispatcher and xdp_call.h Björn Töpel
  2019-11-23  7:12 ` [PATCH bpf-next v2 1/6] bpf: introduce BPF dispatcher Björn Töpel
  2019-11-23  7:12 ` [PATCH bpf-next v2 2/6] xdp: introduce xdp_call Björn Töpel
@ 2019-11-23  7:12 ` Björn Töpel
  2019-11-23  7:12 ` [PATCH bpf-next v2 4/6] ixgbe: " Björn Töpel
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 21+ messages in thread
From: Björn Töpel @ 2019-11-23  7:12 UTC (permalink / raw)
  To: netdev, ast, daniel
  Cc: Björn Töpel, bpf, magnus.karlsson, magnus.karlsson,
	jonathan.lemon, ecree, thoiland, andrii.nakryiko, tariqt, saeedm,
	maximmi

From: Björn Töpel <bjorn.topel@intel.com>

This commit starts using xdp_call.h and the BPF dispatcher to avoid
the retpoline overhead.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 5 +++++
 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 5 ++++-
 drivers/net/ethernet/intel/i40e/i40e_xsk.c  | 5 ++++-
 3 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 1ccabeafa44c..9fb2ea43d481 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -5,6 +5,7 @@
 #include <linux/of_net.h>
 #include <linux/pci.h>
 #include <linux/bpf.h>
+#include <linux/xdp_call.h>
 
 /* Local includes */
 #include "i40e.h"
@@ -12517,6 +12518,8 @@ static netdev_features_t i40e_features_check(struct sk_buff *skb,
 	return features & ~(NETIF_F_CSUM_MASK | NETIF_F_GSO_MASK);
 }
 
+DEFINE_XDP_CALL(i40e_xdp_call);
+
 /**
  * i40e_xdp_setup - add/remove an XDP program
  * @vsi: VSI to changed
@@ -12552,6 +12555,8 @@ static int i40e_xdp_setup(struct i40e_vsi *vsi,
 	for (i = 0; i < vsi->num_queue_pairs; i++)
 		WRITE_ONCE(vsi->rx_rings[i]->xdp_prog, vsi->xdp_prog);
 
+	xdp_call_update(i40e_xdp_call, old_prog, prog);
+
 	if (old_prog)
 		bpf_prog_put(old_prog);
 
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index b8496037ef7f..34d7b15897a1 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -3,6 +3,7 @@
 
 #include <linux/prefetch.h>
 #include <linux/bpf_trace.h>
+#include <linux/xdp_call.h>
 #include <net/xdp.h>
 #include "i40e.h"
 #include "i40e_trace.h"
@@ -2188,6 +2189,8 @@ int i40e_xmit_xdp_tx_ring(struct xdp_buff *xdp, struct i40e_ring *xdp_ring)
 	return i40e_xmit_xdp_ring(xdpf, xdp_ring);
 }
 
+DECLARE_XDP_CALL(i40e_xdp_call);
+
 /**
  * i40e_run_xdp - run an XDP program
  * @rx_ring: Rx ring being processed
@@ -2209,7 +2212,7 @@ static struct sk_buff *i40e_run_xdp(struct i40e_ring *rx_ring,
 
 	prefetchw(xdp->data_hard_start); /* xdp_frame write */
 
-	act = bpf_prog_run_xdp(xdp_prog, xdp);
+	act = xdp_call_run(i40e_xdp_call, xdp_prog, xdp);
 	switch (act) {
 	case XDP_PASS:
 		break;
diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
index d07e1a890428..5c3421f9cc69 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -2,6 +2,7 @@
 /* Copyright(c) 2018 Intel Corporation. */
 
 #include <linux/bpf_trace.h>
+#include <linux/xdp_call.h>
 #include <net/xdp_sock.h>
 #include <net/xdp.h>
 
@@ -179,6 +180,8 @@ int i40e_xsk_umem_setup(struct i40e_vsi *vsi, struct xdp_umem *umem,
 		i40e_xsk_umem_disable(vsi, qid);
 }
 
+DECLARE_XDP_CALL(i40e_xdp_call);
+
 /**
  * i40e_run_xdp_zc - Executes an XDP program on an xdp_buff
  * @rx_ring: Rx ring
@@ -202,7 +205,7 @@ static int i40e_run_xdp_zc(struct i40e_ring *rx_ring, struct xdp_buff *xdp)
 	 * this path is enabled by setting an XDP program.
 	 */
 	xdp_prog = READ_ONCE(rx_ring->xdp_prog);
-	act = bpf_prog_run_xdp(xdp_prog, xdp);
+	act = xdp_call_run(i40e_xdp_call, xdp_prog, xdp);
 	offset = xdp->data - xdp->data_hard_start;
 
 	xdp->handle = xsk_umem_adjust_offset(umem, xdp->handle, offset);
-- 
2.20.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH bpf-next v2 4/6] ixgbe: start using xdp_call.h
  2019-11-23  7:12 [PATCH bpf-next v2 0/6] Introduce the BPF dispatcher and xdp_call.h Björn Töpel
                   ` (2 preceding siblings ...)
  2019-11-23  7:12 ` [PATCH bpf-next v2 3/6] i40e: start using xdp_call.h Björn Töpel
@ 2019-11-23  7:12 ` " Björn Töpel
  2019-11-23  7:12 ` [PATCH bpf-next v2 5/6] net/mlx4_en: " Björn Töpel
  2019-11-23  7:12 ` [PATCH bpf-next v2 6/6] net/mlx5e: Start " Björn Töpel
  5 siblings, 0 replies; 21+ messages in thread
From: Björn Töpel @ 2019-11-23  7:12 UTC (permalink / raw)
  To: netdev, ast, daniel
  Cc: Björn Töpel, bpf, magnus.karlsson, magnus.karlsson,
	jonathan.lemon, ecree, thoiland, andrii.nakryiko, tariqt, saeedm,
	maximmi

From: Björn Töpel <bjorn.topel@intel.com>

This commit starts using xdp_call.h and the BPF dispatcher to avoid
the retpoline overhead.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 25c097cd8100..9c5cea239258 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -37,6 +37,7 @@
 #include <net/mpls.h>
 #include <net/xdp_sock.h>
 #include <net/xfrm.h>
+#include <linux/xdp_call.h>
 
 #include "ixgbe.h"
 #include "ixgbe_common.h"
@@ -2193,6 +2194,8 @@ static struct sk_buff *ixgbe_build_skb(struct ixgbe_ring *rx_ring,
 	return skb;
 }
 
+DEFINE_XDP_CALL(ixgbe_xdp_call);
+
 static struct sk_buff *ixgbe_run_xdp(struct ixgbe_adapter *adapter,
 				     struct ixgbe_ring *rx_ring,
 				     struct xdp_buff *xdp)
@@ -2210,7 +2213,7 @@ static struct sk_buff *ixgbe_run_xdp(struct ixgbe_adapter *adapter,
 
 	prefetchw(xdp->data_hard_start); /* xdp_frame write */
 
-	act = bpf_prog_run_xdp(xdp_prog, xdp);
+	act = xdp_call_run(ixgbe_xdp_call, xdp_prog, xdp);
 	switch (act) {
 	case XDP_PASS:
 		break;
@@ -10273,6 +10276,8 @@ static int ixgbe_xdp_setup(struct net_device *dev, struct bpf_prog *prog)
 			    adapter->xdp_prog);
 	}
 
+	xdp_call_update(ixgbe_xdp_call, old_prog, prog);
+
 	if (old_prog)
 		bpf_prog_put(old_prog);
 
-- 
2.20.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH bpf-next v2 5/6] net/mlx4_en: start using xdp_call.h
  2019-11-23  7:12 [PATCH bpf-next v2 0/6] Introduce the BPF dispatcher and xdp_call.h Björn Töpel
                   ` (3 preceding siblings ...)
  2019-11-23  7:12 ` [PATCH bpf-next v2 4/6] ixgbe: " Björn Töpel
@ 2019-11-23  7:12 ` " Björn Töpel
  2019-11-23  7:12 ` [PATCH bpf-next v2 6/6] net/mlx5e: Start " Björn Töpel
  5 siblings, 0 replies; 21+ messages in thread
From: Björn Töpel @ 2019-11-23  7:12 UTC (permalink / raw)
  To: netdev, ast, daniel
  Cc: Björn Töpel, bpf, magnus.karlsson, magnus.karlsson,
	jonathan.lemon, ecree, thoiland, andrii.nakryiko, tariqt, saeedm,
	maximmi

From: Björn Töpel <bjorn.topel@intel.com>

This commit starts using xdp_call.h and the BPF dispatcher to avoid
the retpoline overhead.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 7 ++++++-
 drivers/net/ethernet/mellanox/mlx4/en_rx.c     | 5 ++++-
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index d4697beeacc2..f2dea32e5599 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -41,6 +41,7 @@
 #include <net/ip.h>
 #include <net/vxlan.h>
 #include <net/devlink.h>
+#include <linux/xdp_call.h>
 
 #include <linux/mlx4/driver.h>
 #include <linux/mlx4/device.h>
@@ -2759,12 +2760,14 @@ static int mlx4_en_set_tx_maxrate(struct net_device *dev, int queue_index, u32 m
 	return err;
 }
 
+DEFINE_XDP_CALL(mlx4_xdp_call);
+
 static int mlx4_xdp_set(struct net_device *dev, struct bpf_prog *prog)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
 	struct mlx4_en_dev *mdev = priv->mdev;
 	struct mlx4_en_port_profile new_prof;
-	struct bpf_prog *old_prog;
+	struct bpf_prog *old_prog = NULL;
 	struct mlx4_en_priv *tmp;
 	int tx_changed = 0;
 	int xdp_ring_num;
@@ -2790,6 +2793,7 @@ static int mlx4_xdp_set(struct net_device *dev, struct bpf_prog *prog)
 			if (old_prog)
 				bpf_prog_put(old_prog);
 		}
+		xdp_call_update(mlx4_xdp_call, old_prog, prog);
 		mutex_unlock(&mdev->state_lock);
 		return 0;
 	}
@@ -2839,6 +2843,7 @@ static int mlx4_xdp_set(struct net_device *dev, struct bpf_prog *prog)
 		if (old_prog)
 			bpf_prog_put(old_prog);
 	}
+	xdp_call_update(mlx4_xdp_call, old_prog, prog);
 
 	if (port_up) {
 		err = mlx4_en_start_port(dev);
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index db3552f2d087..f2400ea36a11 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -42,6 +42,7 @@
 #include <linux/if_vlan.h>
 #include <linux/vmalloc.h>
 #include <linux/irq.h>
+#include <linux/xdp_call.h>
 
 #include <net/ip.h>
 #if IS_ENABLED(CONFIG_IPV6)
@@ -661,6 +662,8 @@ static int check_csum(struct mlx4_cqe *cqe, struct sk_buff *skb, void *va,
 #define MLX4_CQE_STATUS_IP_ANY (MLX4_CQE_STATUS_IPV4)
 #endif
 
+DECLARE_XDP_CALL(mlx4_xdp_call);
+
 int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int budget)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
@@ -782,7 +785,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 			xdp.data_end = xdp.data + length;
 			orig_data = xdp.data;
 
-			act = bpf_prog_run_xdp(xdp_prog, &xdp);
+			act = xdp_call_run(mlx4_xdp_call, xdp_prog, &xdp);
 
 			length = xdp.data_end - xdp.data;
 			if (xdp.data != orig_data) {
-- 
2.20.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH bpf-next v2 6/6] net/mlx5e: Start using xdp_call.h
  2019-11-23  7:12 [PATCH bpf-next v2 0/6] Introduce the BPF dispatcher and xdp_call.h Björn Töpel
                   ` (4 preceding siblings ...)
  2019-11-23  7:12 ` [PATCH bpf-next v2 5/6] net/mlx4_en: " Björn Töpel
@ 2019-11-23  7:12 ` " Björn Töpel
  5 siblings, 0 replies; 21+ messages in thread
From: Björn Töpel @ 2019-11-23  7:12 UTC (permalink / raw)
  To: netdev, ast, daniel
  Cc: Björn Töpel, bpf, magnus.karlsson, magnus.karlsson,
	jonathan.lemon, ecree, thoiland, andrii.nakryiko, tariqt, saeedm,
	maximmi

From: Björn Töpel <bjorn.topel@intel.com>

This commit starts using xdp_call.h and the BPF dispatcher to avoid
the retpoline overhead.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c  | 5 ++++-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 5 +++++
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
index f049e0ac308a..cc11b0db950e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
@@ -32,6 +32,7 @@
 
 #include <linux/bpf_trace.h>
 #include <net/xdp_sock.h>
+#include <linux/xdp_call.h>
 #include "en/xdp.h"
 #include "en/params.h"
 
@@ -117,6 +118,8 @@ mlx5e_xmit_xdp_buff(struct mlx5e_xdpsq *sq, struct mlx5e_rq *rq,
 	return sq->xmit_xdp_frame(sq, &xdptxd, &xdpi, 0);
 }
 
+DECLARE_XDP_CALL(mlx5e_xdp_call);
+
 /* returns true if packet was consumed by xdp */
 bool mlx5e_xdp_handle(struct mlx5e_rq *rq, struct mlx5e_dma_info *di,
 		      void *va, u16 *rx_headroom, u32 *len, bool xsk)
@@ -138,7 +141,7 @@ bool mlx5e_xdp_handle(struct mlx5e_rq *rq, struct mlx5e_dma_info *di,
 		xdp.handle = di->xsk.handle;
 	xdp.rxq = &rq->xdp_rxq;
 
-	act = bpf_prog_run_xdp(prog, &xdp);
+	act = xdp_call_run(mlx5e_xdp_call, prog, &xdp);
 	if (xsk) {
 		u64 off = xdp.data - xdp.data_hard_start;
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index e8d799c0dfda..0b26f9d7a968 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -39,6 +39,7 @@
 #include <linux/if_bridge.h>
 #include <net/page_pool.h>
 #include <net/xdp_sock.h>
+#include <linux/xdp_call.h>
 #include "eswitch.h"
 #include "en.h"
 #include "en/txrx.h"
@@ -4384,6 +4385,8 @@ static int mlx5e_xdp_update_state(struct mlx5e_priv *priv)
 	return 0;
 }
 
+DEFINE_XDP_CALL(mlx5e_xdp_call);
+
 static int mlx5e_xdp_set(struct net_device *netdev, struct bpf_prog *prog)
 {
 	struct mlx5e_priv *priv = netdev_priv(netdev);
@@ -4428,6 +4431,8 @@ static int mlx5e_xdp_set(struct net_device *netdev, struct bpf_prog *prog)
 		old_prog = xchg(&priv->channels.params.xdp_prog, prog);
 	}
 
+	xdp_call_update(mlx5e_xdp_call, old_prog, prog);
+
 	if (old_prog)
 		bpf_prog_put(old_prog);
 
-- 
2.20.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH bpf-next v2 1/6] bpf: introduce BPF dispatcher
  2019-11-23  7:12 ` [PATCH bpf-next v2 1/6] bpf: introduce BPF dispatcher Björn Töpel
@ 2019-11-24  1:55   ` Alexei Starovoitov
  2019-11-24  6:55     ` Björn Töpel
  2019-11-25  0:08   ` kbuild test robot
  2019-11-25 10:53   ` Daniel Borkmann
  2 siblings, 1 reply; 21+ messages in thread
From: Alexei Starovoitov @ 2019-11-24  1:55 UTC (permalink / raw)
  To: Björn Töpel
  Cc: netdev, ast, daniel, Björn Töpel, bpf, magnus.karlsson,
	magnus.karlsson, jonathan.lemon, ecree, thoiland,
	andrii.nakryiko, tariqt, saeedm, maximmi

On Sat, Nov 23, 2019 at 08:12:20AM +0100, Björn Töpel wrote:
> +
> +		err = emit_jump(&prog,			/* jmp thunk */
> +				__x86_indirect_thunk_rdx, prog);

could you please add a comment that this is gcc specific and gate it
by build_bug_on ?
I think even if compiler stays the change of flags:
RETPOLINE_CFLAGS_GCC := -mindirect-branch=thunk-extern -mindirect-branch-register
may change the name of this helper?
I wonder whether it's possible to make it compiler independent.

> diff --git a/kernel/bpf/dispatcher.c b/kernel/bpf/dispatcher.c
> new file mode 100644
> index 000000000000..385dd76ab6d2
> --- /dev/null
> +++ b/kernel/bpf/dispatcher.c
> @@ -0,0 +1,208 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright(c) 2019 Intel Corporation. */
> +
> +#ifdef CONFIG_RETPOLINE

I'm worried that such strong gating will make the code rot. Especially it's not
covered by selftests.
Could you please add xdp_call_run() to generic xdp and add a selftest ?
Also could you please benchmark it without retpoline?
iirc direct call is often faster than indirect, so I suspect this optimization
may benefit non-mitigated kernels.

> +#define DISPATCHER_HASH_BITS 10
> +#define DISPATCHER_TABLE_SIZE (1 << DISPATCHER_HASH_BITS)
> +
> +static struct hlist_head dispatcher_table[DISPATCHER_TABLE_SIZE];

there is one DEFINE_XDP_CALL per driver, so total number of such
dispatch routines is pretty small. 1<<10 hash table is overkill.
The hash table itself is overkill :)

How about adding below:

> +#define BPF_DISPATCHER_MAX 16
> +
> +struct bpf_dispatcher {
> +	struct hlist_node hlist;
> +	void *func;
> +	struct bpf_prog *progs[BPF_DISPATCHER_MAX];
> +	int num_progs;
> +	void *image;
> +	u64 selector;
> +};

without hlist and without func to DEFINE_XDP_CALL() macro?
Then bpf_dispatcher_lookup() will become bpf_dispatcher_init()
and the rest will become a bit simpler?

> +
> +	set_vm_flush_reset_perms(image);
> +	set_memory_x((long)image, 1);
> +	d->image = image;

Can you add a common helper for this bit to share between
bpf dispatch and bpf trampoline?

> +static void bpf_dispatcher_update(struct bpf_dispatcher *d)
> +{
> +	void *old_image = d->image + ((d->selector + 1) & 1) * PAGE_SIZE / 2;
> +	void *new_image = d->image + (d->selector & 1) * PAGE_SIZE / 2;
> +	s64 ips[BPF_DISPATCHER_MAX] = {}, *ipsp = &ips[0];
> +	int i, err;
> +
> +	if (!d->num_progs) {
> +		bpf_arch_text_poke(d->func, BPF_MOD_JUMP_TO_NOP,
> +				   old_image, NULL);
> +		return;

how does it work? Without doing d->selector = 0; the next addition
will try to do JUMP_TO_JUMP and will fail...

> +	}
> +
> +	for (i = 0; i < BPF_DISPATCHER_MAX; i++) {
> +		if (d->progs[i])
> +			*ipsp++ = (s64)(uintptr_t)d->progs[i]->bpf_func;
> +	}
> +	err = arch_prepare_bpf_dispatcher(new_image, &ips[0], d->num_progs);
> +	if (err)
> +		return;
> +
> +	if (d->selector) {
> +		/* progs already running at this address */
> +		err = bpf_arch_text_poke(d->func, BPF_MOD_JUMP_TO_JUMP,
> +					 old_image, new_image);
> +	} else {
> +		/* first time registering */
> +		err = bpf_arch_text_poke(d->func, BPF_MOD_NOP_TO_JUMP,
> +					 NULL, new_image);
> +	}
> +	if (err)
> +		return;
> +	d->selector++;
> +}

Not sure how to share selector logic between dispatch and trampoline.
But above selector=0; weirdness is a sign that sharing is probably necessary?

> +
> +void bpf_dispatcher_change_prog(void *func, struct bpf_prog *from,
> +				struct bpf_prog *to)
> +{
> +	struct bpf_dispatcher *d;
> +	bool changed = false;
> +
> +	if (from == to)
> +		return;
> +
> +	mutex_lock(&dispatcher_mutex);
> +	d = bpf_dispatcher_lookup(func);
> +	if (!d)
> +		goto out;
> +
> +	changed |= bpf_dispatcher_remove_prog(d, from);
> +	changed |= bpf_dispatcher_add_prog(d, to);
> +
> +	if (!changed)
> +		goto out;
> +
> +	bpf_dispatcher_update(d);
> +	if (!d->num_progs)
> +		bpf_dispatcher_free(d);

I think I got it why it works.
Every time the prog cnt goes to zero you free the trampoline right away
and next time it will be allocated again and kzalloc() will zero selector.
That's hard to spot.
Also if user space does for(;;) attach/detach;
it will keep stressing bpf_jit_alloc_exec.
In case of bpf trampoline attach/detach won't be stressing it.
Only load/unload which are much slower due to verification.
I guess such difference is ok.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH bpf-next v2 2/6] xdp: introduce xdp_call
  2019-11-23  7:12 ` [PATCH bpf-next v2 2/6] xdp: introduce xdp_call Björn Töpel
@ 2019-11-24  1:59   ` Alexei Starovoitov
  2019-11-24  6:56     ` Björn Töpel
  2019-11-25 11:18   ` Toke Høiland-Jørgensen
  1 sibling, 1 reply; 21+ messages in thread
From: Alexei Starovoitov @ 2019-11-24  1:59 UTC (permalink / raw)
  To: Björn Töpel
  Cc: netdev, ast, daniel, Björn Töpel, bpf, magnus.karlsson,
	magnus.karlsson, jonathan.lemon, ecree, thoiland,
	andrii.nakryiko, tariqt, saeedm, maximmi

On Sat, Nov 23, 2019 at 08:12:21AM +0100, Björn Töpel wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
> 
> The xdp_call.h header wraps a more user-friendly API around the BPF
> dispatcher. A user adds a trampoline/XDP caller using the
> DEFINE_XDP_CALL macro, and updates the BPF dispatcher via
> xdp_call_update(). The actual dispatch is done via xdp_call().
> 
> Note that xdp_call() is only supported for builtin drivers. Module
> builds will fallback to bpf_prog_run_xdp().
> 
> The next patch will show-case how the i40e driver uses xdp_call.
> 
> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
> ---
>  include/linux/xdp_call.h | 66 ++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 66 insertions(+)
>  create mode 100644 include/linux/xdp_call.h
> 
> diff --git a/include/linux/xdp_call.h b/include/linux/xdp_call.h
> new file mode 100644
> index 000000000000..69b2d325a787
> --- /dev/null
> +++ b/include/linux/xdp_call.h
> @@ -0,0 +1,66 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/* Copyright(c) 2019 Intel Corporation. */
> +#ifndef _LINUX_XDP_CALL_H
> +#define _LINUX_XDP_CALL_H
> +
> +#include <linux/filter.h>
> +
> +#if defined(CONFIG_BPF_JIT) && defined(CONFIG_RETPOLINE) && !defined(MODULE)
> +
> +void bpf_dispatcher_change_prog(void *func, struct bpf_prog *from,
> +				struct bpf_prog *to);
> +
> +#define XDP_CALL_TRAMP(name)	____xdp_call_##name##_tramp
> +
> +#define DEFINE_XDP_CALL(name)						\
> +	unsigned int XDP_CALL_TRAMP(name)(				\
> +		const void *xdp_ctx,					\
> +		const struct bpf_insn *insnsi,				\
> +		unsigned int (*bpf_func)(const void *,			\
> +					 const struct bpf_insn *))	\
> +	{								\
> +		return bpf_func(xdp_ctx, insnsi);			\
> +	}
> +
> +#define DECLARE_XDP_CALL(name)						\
> +	unsigned int XDP_CALL_TRAMP(name)(				\
> +		const void *xdp_ctx,					\
> +		const struct bpf_insn *insnsi,				\
> +		unsigned int (*bpf_func)(const void *,			\
> +					 const struct bpf_insn *))
> +
> +#define xdp_call_run(name, prog, ctx) ({				\
> +	u32 ret;							\
> +	cant_sleep();							\
> +	if (static_branch_unlikely(&bpf_stats_enabled_key)) {		\
> +		struct bpf_prog_stats *stats;				\
> +		u64 start = sched_clock();				\
> +		ret = XDP_CALL_TRAMP(name)(ctx,				\
> +					   (prog)->insnsi,		\
> +					   (prog)->bpf_func);		\
> +		stats = this_cpu_ptr((prog)->aux->stats);		\
> +		u64_stats_update_begin(&stats->syncp);			\
> +		stats->cnt++;						\
> +		stats->nsecs += sched_clock() - start;			\
> +		u64_stats_update_end(&stats->syncp);			\
> +	} else {							\
> +		ret = XDP_CALL_TRAMP(name)(ctx,				\
> +					   (prog)->insnsi,		\
> +					   (prog)->bpf_func);		\
> +	}								\
> +	ret; })

I cannot help but wonder whether it's possible to avoid copy-paste from
BPF_PROG_RUN().
At least could you place this new macro right next to BPF_PROG_RUN?
If it's in a different file eventually they may diverge.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH bpf-next v2 1/6] bpf: introduce BPF dispatcher
  2019-11-24  1:55   ` Alexei Starovoitov
@ 2019-11-24  6:55     ` Björn Töpel
  2019-11-24 17:08       ` Alexei Starovoitov
  0 siblings, 1 reply; 21+ messages in thread
From: Björn Töpel @ 2019-11-24  6:55 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Netdev, Alexei Starovoitov, Daniel Borkmann,
	Björn Töpel, bpf, Magnus Karlsson, Karlsson, Magnus,
	Jonathan Lemon, Edward Cree, Toke Høiland-Jørgensen,
	Andrii Nakryiko, Tariq Toukan, Saeed Mahameed,
	Maxim Mikityanskiy

On Sun, 24 Nov 2019 at 02:55, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Sat, Nov 23, 2019 at 08:12:20AM +0100, Björn Töpel wrote:
> > +
> > +             err = emit_jump(&prog,                  /* jmp thunk */
> > +                             __x86_indirect_thunk_rdx, prog);
>
> could you please add a comment that this is gcc specific and gate it
> by build_bug_on ?
> I think even if compiler stays the change of flags:
> RETPOLINE_CFLAGS_GCC := -mindirect-branch=thunk-extern -mindirect-branch-register
> may change the name of this helper?
> I wonder whether it's possible to make it compiler independent.
>
> > diff --git a/kernel/bpf/dispatcher.c b/kernel/bpf/dispatcher.c
> > new file mode 100644
> > index 000000000000..385dd76ab6d2
> > --- /dev/null
> > +++ b/kernel/bpf/dispatcher.c
> > @@ -0,0 +1,208 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/* Copyright(c) 2019 Intel Corporation. */
> > +
> > +#ifdef CONFIG_RETPOLINE
>
> I'm worried that such strong gating will make the code rot. Especially it's not
> covered by selftests.
> Could you please add xdp_call_run() to generic xdp and add a selftest ?
> Also could you please benchmark it without retpoline?
> iirc direct call is often faster than indirect, so I suspect this optimization
> may benefit non-mitigated kernels.
>
> > +#define DISPATCHER_HASH_BITS 10
> > +#define DISPATCHER_TABLE_SIZE (1 << DISPATCHER_HASH_BITS)
> > +
> > +static struct hlist_head dispatcher_table[DISPATCHER_TABLE_SIZE];
>
> there is one DEFINE_XDP_CALL per driver, so total number of such
> dispatch routines is pretty small. 1<<10 hash table is overkill.
> The hash table itself is overkill :)
>
> How about adding below:
>
> > +#define BPF_DISPATCHER_MAX 16
> > +
> > +struct bpf_dispatcher {
> > +     struct hlist_node hlist;
> > +     void *func;
> > +     struct bpf_prog *progs[BPF_DISPATCHER_MAX];
> > +     int num_progs;
> > +     void *image;
> > +     u64 selector;
> > +};
>
> without hlist and without func to DEFINE_XDP_CALL() macro?
> Then bpf_dispatcher_lookup() will become bpf_dispatcher_init()
> and the rest will become a bit simpler?
>
> > +
> > +     set_vm_flush_reset_perms(image);
> > +     set_memory_x((long)image, 1);
> > +     d->image = image;
>
> Can you add a common helper for this bit to share between
> bpf dispatch and bpf trampoline?
>
> > +static void bpf_dispatcher_update(struct bpf_dispatcher *d)
> > +{
> > +     void *old_image = d->image + ((d->selector + 1) & 1) * PAGE_SIZE / 2;
> > +     void *new_image = d->image + (d->selector & 1) * PAGE_SIZE / 2;
> > +     s64 ips[BPF_DISPATCHER_MAX] = {}, *ipsp = &ips[0];
> > +     int i, err;
> > +
> > +     if (!d->num_progs) {
> > +             bpf_arch_text_poke(d->func, BPF_MOD_JUMP_TO_NOP,
> > +                                old_image, NULL);
> > +             return;
>
> how does it work? Without doing d->selector = 0; the next addition
> will try to do JUMP_TO_JUMP and will fail...
>
> > +     }
> > +
> > +     for (i = 0; i < BPF_DISPATCHER_MAX; i++) {
> > +             if (d->progs[i])
> > +                     *ipsp++ = (s64)(uintptr_t)d->progs[i]->bpf_func;
> > +     }
> > +     err = arch_prepare_bpf_dispatcher(new_image, &ips[0], d->num_progs);
> > +     if (err)
> > +             return;
> > +
> > +     if (d->selector) {
> > +             /* progs already running at this address */
> > +             err = bpf_arch_text_poke(d->func, BPF_MOD_JUMP_TO_JUMP,
> > +                                      old_image, new_image);
> > +     } else {
> > +             /* first time registering */
> > +             err = bpf_arch_text_poke(d->func, BPF_MOD_NOP_TO_JUMP,
> > +                                      NULL, new_image);
> > +     }
> > +     if (err)
> > +             return;
> > +     d->selector++;
> > +}
>
> Not sure how to share selector logic between dispatch and trampoline.
> But above selector=0; weirdness is a sign that sharing is probably necessary?
>
> > +
> > +void bpf_dispatcher_change_prog(void *func, struct bpf_prog *from,
> > +                             struct bpf_prog *to)
> > +{
> > +     struct bpf_dispatcher *d;
> > +     bool changed = false;
> > +
> > +     if (from == to)
> > +             return;
> > +
> > +     mutex_lock(&dispatcher_mutex);
> > +     d = bpf_dispatcher_lookup(func);
> > +     if (!d)
> > +             goto out;
> > +
> > +     changed |= bpf_dispatcher_remove_prog(d, from);
> > +     changed |= bpf_dispatcher_add_prog(d, to);
> > +
> > +     if (!changed)
> > +             goto out;
> > +
> > +     bpf_dispatcher_update(d);
> > +     if (!d->num_progs)
> > +             bpf_dispatcher_free(d);
>
> I think I got it why it works.
> Every time the prog cnt goes to zero you free the trampoline right away
> and next time it will be allocated again and kzalloc() will zero selector.
> That's hard to spot.
> Also if user space does for(;;) attach/detach;
> it will keep stressing bpf_jit_alloc_exec.
> In case of bpf trampoline attach/detach won't be stressing it.
> Only load/unload which are much slower due to verification.
> I guess such difference is ok.
>

Alexei, thanks for all feedback (on the weekend)! I agree with all of
above, and especially missing selftests and too much code duplication.

I'll do a respin, but that'll be in the next window, given that Linus
will (probably) tag the release today.


Björn

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH bpf-next v2 2/6] xdp: introduce xdp_call
  2019-11-24  1:59   ` Alexei Starovoitov
@ 2019-11-24  6:56     ` Björn Töpel
  0 siblings, 0 replies; 21+ messages in thread
From: Björn Töpel @ 2019-11-24  6:56 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Netdev, Alexei Starovoitov, Daniel Borkmann,
	Björn Töpel, bpf, Magnus Karlsson, Karlsson, Magnus,
	Jonathan Lemon, Edward Cree, Toke Høiland-Jørgensen,
	Andrii Nakryiko, Tariq Toukan, Saeed Mahameed,
	Maxim Mikityanskiy

On Sun, 24 Nov 2019 at 02:59, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Sat, Nov 23, 2019 at 08:12:21AM +0100, Björn Töpel wrote:
> > From: Björn Töpel <bjorn.topel@intel.com>
> >
> > The xdp_call.h header wraps a more user-friendly API around the BPF
> > dispatcher. A user adds a trampoline/XDP caller using the
> > DEFINE_XDP_CALL macro, and updates the BPF dispatcher via
> > xdp_call_update(). The actual dispatch is done via xdp_call().
> >
> > Note that xdp_call() is only supported for builtin drivers. Module
> > builds will fallback to bpf_prog_run_xdp().
> >
> > The next patch will show-case how the i40e driver uses xdp_call.
> >
> > Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
> > ---
> >  include/linux/xdp_call.h | 66 ++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 66 insertions(+)
> >  create mode 100644 include/linux/xdp_call.h
> >
> > diff --git a/include/linux/xdp_call.h b/include/linux/xdp_call.h
> > new file mode 100644
> > index 000000000000..69b2d325a787
> > --- /dev/null
> > +++ b/include/linux/xdp_call.h
> > @@ -0,0 +1,66 @@
> > +/* SPDX-License-Identifier: GPL-2.0-only */
> > +/* Copyright(c) 2019 Intel Corporation. */
> > +#ifndef _LINUX_XDP_CALL_H
> > +#define _LINUX_XDP_CALL_H
> > +
> > +#include <linux/filter.h>
> > +
> > +#if defined(CONFIG_BPF_JIT) && defined(CONFIG_RETPOLINE) && !defined(MODULE)
> > +
> > +void bpf_dispatcher_change_prog(void *func, struct bpf_prog *from,
> > +                             struct bpf_prog *to);
> > +
> > +#define XDP_CALL_TRAMP(name) ____xdp_call_##name##_tramp
> > +
> > +#define DEFINE_XDP_CALL(name)                                                \
> > +     unsigned int XDP_CALL_TRAMP(name)(                              \
> > +             const void *xdp_ctx,                                    \
> > +             const struct bpf_insn *insnsi,                          \
> > +             unsigned int (*bpf_func)(const void *,                  \
> > +                                      const struct bpf_insn *))      \
> > +     {                                                               \
> > +             return bpf_func(xdp_ctx, insnsi);                       \
> > +     }
> > +
> > +#define DECLARE_XDP_CALL(name)                                               \
> > +     unsigned int XDP_CALL_TRAMP(name)(                              \
> > +             const void *xdp_ctx,                                    \
> > +             const struct bpf_insn *insnsi,                          \
> > +             unsigned int (*bpf_func)(const void *,                  \
> > +                                      const struct bpf_insn *))
> > +
> > +#define xdp_call_run(name, prog, ctx) ({                             \
> > +     u32 ret;                                                        \
> > +     cant_sleep();                                                   \
> > +     if (static_branch_unlikely(&bpf_stats_enabled_key)) {           \
> > +             struct bpf_prog_stats *stats;                           \
> > +             u64 start = sched_clock();                              \
> > +             ret = XDP_CALL_TRAMP(name)(ctx,                         \
> > +                                        (prog)->insnsi,              \
> > +                                        (prog)->bpf_func);           \
> > +             stats = this_cpu_ptr((prog)->aux->stats);               \
> > +             u64_stats_update_begin(&stats->syncp);                  \
> > +             stats->cnt++;                                           \
> > +             stats->nsecs += sched_clock() - start;                  \
> > +             u64_stats_update_end(&stats->syncp);                    \
> > +     } else {                                                        \
> > +             ret = XDP_CALL_TRAMP(name)(ctx,                         \
> > +                                        (prog)->insnsi,              \
> > +                                        (prog)->bpf_func);           \
> > +     }                                                               \
> > +     ret; })
>
> I cannot help but wonder whether it's possible to avoid copy-paste from
> BPF_PROG_RUN().
> At least could you place this new macro right next to BPF_PROG_RUN?
> If it's in a different file eventually they may diverge.
>

Yeah, I'll take a stab at that!


Thanks,
Björn

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH bpf-next v2 1/6] bpf: introduce BPF dispatcher
  2019-11-24  6:55     ` Björn Töpel
@ 2019-11-24 17:08       ` Alexei Starovoitov
  2019-11-24 17:16         ` Björn Töpel
  0 siblings, 1 reply; 21+ messages in thread
From: Alexei Starovoitov @ 2019-11-24 17:08 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Netdev, Alexei Starovoitov, Daniel Borkmann,
	Björn Töpel, bpf, Magnus Karlsson, Karlsson, Magnus,
	Jonathan Lemon, Edward Cree, Toke Høiland-Jørgensen,
	Andrii Nakryiko, Tariq Toukan, Saeed Mahameed,
	Maxim Mikityanskiy

On Sun, Nov 24, 2019 at 07:55:07AM +0100, Björn Töpel wrote:
> >
> > I think I got it why it works.
> > Every time the prog cnt goes to zero you free the trampoline right away
> > and next time it will be allocated again and kzalloc() will zero selector.
> > That's hard to spot.
> > Also if user space does for(;;) attach/detach;
> > it will keep stressing bpf_jit_alloc_exec.
> > In case of bpf trampoline attach/detach won't be stressing it.
> > Only load/unload which are much slower due to verification.
> > I guess such difference is ok.
> >
> 
> Alexei, thanks for all feedback (on the weekend)! I agree with all of
> above, and especially missing selftests and too much code duplication.
> 
> I'll do a respin, but that'll be in the next window, given that Linus
> will (probably) tag the release today.

I want it to land just as much as you do :) Two weeks is not a big deal. We
backport all of bpf and xdp as soon as it lands in bpf-next/net-next. We don't
wait for patches to reach Linus's tree. So this dispatch logic will be running
on our servers way sooner than you'd expect. I guess that explains my obsession
with quality. Same goes for libbpf.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH bpf-next v2 1/6] bpf: introduce BPF dispatcher
  2019-11-24 17:08       ` Alexei Starovoitov
@ 2019-11-24 17:16         ` Björn Töpel
  0 siblings, 0 replies; 21+ messages in thread
From: Björn Töpel @ 2019-11-24 17:16 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Netdev, Alexei Starovoitov, Daniel Borkmann,
	Björn Töpel, bpf, Magnus Karlsson, Karlsson, Magnus,
	Jonathan Lemon, Edward Cree, Toke Høiland-Jørgensen,
	Andrii Nakryiko, Tariq Toukan, Saeed Mahameed,
	Maxim Mikityanskiy

On Sun, 24 Nov 2019 at 18:08, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Sun, Nov 24, 2019 at 07:55:07AM +0100, Björn Töpel wrote:
> > >
> > > I think I got it why it works.
> > > Every time the prog cnt goes to zero you free the trampoline right away
> > > and next time it will be allocated again and kzalloc() will zero selector.
> > > That's hard to spot.
> > > Also if user space does for(;;) attach/detach;
> > > it will keep stressing bpf_jit_alloc_exec.
> > > In case of bpf trampoline attach/detach won't be stressing it.
> > > Only load/unload which are much slower due to verification.
> > > I guess such difference is ok.
> > >
> >
> > Alexei, thanks for all feedback (on the weekend)! I agree with all of
> > above, and especially missing selftests and too much code duplication.
> >
> > I'll do a respin, but that'll be in the next window, given that Linus
> > will (probably) tag the release today.
>
> I want it to land just as much as you do :) Two weeks is not a big deal. We
> backport all of bpf and xdp as soon as it lands in bpf-next/net-next. We don't
> wait for patches to reach Linus's tree. So this dispatch logic will be running
> on our servers way sooner than you'd expect. I guess that explains my obsession
> with quality. Same goes for libbpf.
>

No reason to rush it in! It's just a week back and forth, and your
comments were spot on.


Cheers,
Björn


>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH bpf-next v2 1/6] bpf: introduce BPF dispatcher
  2019-11-23  7:12 ` [PATCH bpf-next v2 1/6] bpf: introduce BPF dispatcher Björn Töpel
  2019-11-24  1:55   ` Alexei Starovoitov
@ 2019-11-25  0:08   ` kbuild test robot
  2019-11-25 10:53   ` Daniel Borkmann
  2 siblings, 0 replies; 21+ messages in thread
From: kbuild test robot @ 2019-11-25  0:08 UTC (permalink / raw)
  To: Björn Töpel
  Cc: kbuild-all, netdev, ast, daniel, Björn Töpel, bpf,
	magnus.karlsson, magnus.karlsson, jonathan.lemon, ecree,
	thoiland, andrii.nakryiko, tariqt, saeedm, maximmi

[-- Attachment #1: Type: text/plain, Size: 3121 bytes --]

Hi "Björn,

I love your patch! Yet something to improve:

[auto build test ERROR on bpf-next/master]
[also build test ERROR on next-20191122]
[cannot apply to v5.4-rc8]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Bj-rn-T-pel/Introduce-the-BPF-dispatcher-and-xdp_call-h/20191125-033931
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: x86_64-randconfig-a001-20191124 (attached as .config)
compiler: gcc-4.9 (Debian 4.9.2-10+deb8u1) 4.9.2
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   kernel/bpf/dispatcher.c: In function 'bpf_dispatcher_update':
>> kernel/bpf/dispatcher.c:144:31: error: 'BPF_MOD_JUMP_TO_NOP' undeclared (first use in this function)
      bpf_arch_text_poke(d->func, BPF_MOD_JUMP_TO_NOP,
                                  ^
   kernel/bpf/dispatcher.c:144:31: note: each undeclared identifier is reported only once for each function it appears in
>> kernel/bpf/dispatcher.c:159:37: error: 'BPF_MOD_JUMP_TO_JUMP' undeclared (first use in this function)
      err = bpf_arch_text_poke(d->func, BPF_MOD_JUMP_TO_JUMP,
                                        ^
>> kernel/bpf/dispatcher.c:163:37: error: 'BPF_MOD_NOP_TO_JUMP' undeclared (first use in this function)
      err = bpf_arch_text_poke(d->func, BPF_MOD_NOP_TO_JUMP,
                                        ^

vim +/BPF_MOD_JUMP_TO_NOP +144 kernel/bpf/dispatcher.c

   135	
   136	static void bpf_dispatcher_update(struct bpf_dispatcher *d)
   137	{
   138		void *old_image = d->image + ((d->selector + 1) & 1) * PAGE_SIZE / 2;
   139		void *new_image = d->image + (d->selector & 1) * PAGE_SIZE / 2;
   140		s64 ips[BPF_DISPATCHER_MAX] = {}, *ipsp = &ips[0];
   141		int i, err;
   142	
   143		if (!d->num_progs) {
 > 144			bpf_arch_text_poke(d->func, BPF_MOD_JUMP_TO_NOP,
   145					   old_image, NULL);
   146			return;
   147		}
   148	
   149		for (i = 0; i < BPF_DISPATCHER_MAX; i++) {
   150			if (d->progs[i])
   151				*ipsp++ = (s64)(uintptr_t)d->progs[i]->bpf_func;
   152		}
   153		err = arch_prepare_bpf_dispatcher(new_image, &ips[0], d->num_progs);
   154		if (err)
   155			return;
   156	
   157		if (d->selector) {
   158			/* progs already running at this address */
 > 159			err = bpf_arch_text_poke(d->func, BPF_MOD_JUMP_TO_JUMP,
   160						 old_image, new_image);
   161		} else {
   162			/* first time registering */
 > 163			err = bpf_arch_text_poke(d->func, BPF_MOD_NOP_TO_JUMP,
   164						 NULL, new_image);
   165		}
   166		if (err)
   167			return;
   168		d->selector++;
   169	}
   170	

---
0-DAY kernel test infrastructure                 Open Source Technology Center
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 27618 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH bpf-next v2 1/6] bpf: introduce BPF dispatcher
  2019-11-23  7:12 ` [PATCH bpf-next v2 1/6] bpf: introduce BPF dispatcher Björn Töpel
  2019-11-24  1:55   ` Alexei Starovoitov
  2019-11-25  0:08   ` kbuild test robot
@ 2019-11-25 10:53   ` Daniel Borkmann
  2019-11-25 15:20     ` Björn Töpel
  2 siblings, 1 reply; 21+ messages in thread
From: Daniel Borkmann @ 2019-11-25 10:53 UTC (permalink / raw)
  To: Björn Töpel
  Cc: netdev, ast, Björn Töpel, bpf, magnus.karlsson,
	magnus.karlsson, jonathan.lemon, ecree, thoiland,
	andrii.nakryiko, tariqt, saeedm, maximmi

On Sat, Nov 23, 2019 at 08:12:20AM +0100, Björn Töpel wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
> 
> The BPF dispatcher is a multiway branch code generator, mainly
> targeted for XDP programs. When an XDP program is executed via the
> bpf_prog_run_xdp(), it is invoked via an indirect call. With
> retpolines enabled, the indirect call has a substantial performance
> impact. The dispatcher is a mechanism that transform multiple indirect
> calls to direct calls, and therefore avoids the retpoline. The
> dispatcher is generated using the BPF JIT, and relies on text poking
> provided by bpf_arch_text_poke().
> 
> The dispatcher hijacks a trampoline function it via the __fentry__ nop
> of the trampoline. One dispatcher instance currently supports up to 16
> dispatch points. This can be extended in the future.
> 
> An example: A module/driver allocates a dispatcher. The dispatcher is
> shared for all netdevs. Each unique XDP program has a slot in the
> dispatcher, registered by a netdev. The netdev then uses the
> dispatcher to call the correct program with a direct call.
> 
> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
[...]
> +static int emit_bpf_dispatcher(u8 **pprog, int a, int b, s64 *progs)
> +{
> +	u8 *jg_reloc, *jg_target, *prog = *pprog;
> +	int pivot, err, jg_bytes = 1, cnt = 0;
> +	s64 jg_offset;
> +
> +	if (a == b) {
> +		/* Leaf node of recursion, i.e. not a range of indices
> +		 * anymore.
> +		 */
> +		EMIT1(add_1mod(0x48, BPF_REG_3));	/* cmp rdx,func */
> +		if (!is_simm32(progs[a]))
> +			return -1;
> +		EMIT2_off32(0x81, add_1reg(0xF8, BPF_REG_3),
> +			    progs[a]);
> +		err = emit_cond_near_jump(&prog,	/* je func */
> +					  (void *)progs[a], prog,
> +					  X86_JE);
> +		if (err)
> +			return err;
> +
> +		err = emit_jump(&prog,			/* jmp thunk */
> +				__x86_indirect_thunk_rdx, prog);
> +		if (err)
> +			return err;
> +
> +		*pprog = prog;
> +		return 0;
> +	}
> +
> +	/* Not a leaf node, so we pivot, and recursively descend into
> +	 * the lower and upper ranges.
> +	 */
> +	pivot = (b - a) / 2;
> +	EMIT1(add_1mod(0x48, BPF_REG_3));		/* cmp rdx,func */
> +	if (!is_simm32(progs[a + pivot]))
> +		return -1;
> +	EMIT2_off32(0x81, add_1reg(0xF8, BPF_REG_3), progs[a + pivot]);
> +
> +	if (pivot > 2) {				/* jg upper_part */
> +		/* Require near jump. */
> +		jg_bytes = 4;
> +		EMIT2_off32(0x0F, X86_JG + 0x10, 0);
> +	} else {
> +		EMIT2(X86_JG, 0);
> +	}
> +	jg_reloc = prog;
> +
> +	err = emit_bpf_dispatcher(&prog, a, a + pivot,	/* emit lower_part */
> +				  progs);
> +	if (err)
> +		return err;
> +
> +	/* Intel 64 and IA-32 ArchitecturesOptimization Reference
> +	 * Manual, 3.4.1.5 Code Alignment Assembly/Compiler Coding
> +	 * Rule 12. (M impact, H generality) All branch targets should
> +	 * be 16-byte aligned.

Isn't this section 3.4.1.4, rule 11 or are you reading a newer manual
than on the website [0]? :) Just wondering, in your IXIA tests, did you
see any noticeable slowdowns if you don't do the 16-byte alignments as
in the rest of the kernel [1,2]?

  [0] https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf
  [1] be6cb02779ca ("x86: Align jump targets to 1-byte boundaries")
  [2] https://lore.kernel.org/patchwork/patch/560050/

> +	 */
> +	jg_target = PTR_ALIGN(prog, 16);
> +	if (jg_target != prog)
> +		emit_nops(&prog, jg_target - prog);
> +	jg_offset = prog - jg_reloc;
> +	emit_code(jg_reloc - jg_bytes, jg_offset, jg_bytes);
> +
> +	err = emit_bpf_dispatcher(&prog, a + pivot + 1,	/* emit upper_part */
> +				  b, progs);
> +	if (err)
> +		return err;
> +
> +	*pprog = prog;
> +	return 0;
> +}

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH bpf-next v2 2/6] xdp: introduce xdp_call
  2019-11-23  7:12 ` [PATCH bpf-next v2 2/6] xdp: introduce xdp_call Björn Töpel
  2019-11-24  1:59   ` Alexei Starovoitov
@ 2019-11-25 11:18   ` Toke Høiland-Jørgensen
  2019-11-25 15:21     ` Björn Töpel
  1 sibling, 1 reply; 21+ messages in thread
From: Toke Høiland-Jørgensen @ 2019-11-25 11:18 UTC (permalink / raw)
  To: Björn Töpel, netdev, ast, daniel
  Cc: Björn Töpel, bpf, magnus.karlsson, magnus.karlsson,
	jonathan.lemon, ecree, andrii.nakryiko, tariqt, saeedm, maximmi

Björn Töpel <bjorn.topel@gmail.com> writes:

> From: Björn Töpel <bjorn.topel@intel.com>
>
> The xdp_call.h header wraps a more user-friendly API around the BPF
> dispatcher. A user adds a trampoline/XDP caller using the
> DEFINE_XDP_CALL macro, and updates the BPF dispatcher via
> xdp_call_update(). The actual dispatch is done via xdp_call().
>
> Note that xdp_call() is only supported for builtin drivers. Module
> builds will fallback to bpf_prog_run_xdp().

I don't like this restriction. Distro kernels are not likely to start
shipping all the network drivers builtin, so they won't benefit from the
performance benefits from this dispatcher.

What is the reason these dispatcher blocks have to reside in the driver?
Couldn't we just allocate one system-wide, and then simply change
bpf_prog_run_xdp() to make use of it transparently (from the driver
PoV)? That would also remove the need to modify every driver...

-Toke


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH bpf-next v2 1/6] bpf: introduce BPF dispatcher
  2019-11-25 10:53   ` Daniel Borkmann
@ 2019-11-25 15:20     ` Björn Töpel
  0 siblings, 0 replies; 21+ messages in thread
From: Björn Töpel @ 2019-11-25 15:20 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Netdev, Alexei Starovoitov, Björn Töpel, bpf,
	Magnus Karlsson, Karlsson, Magnus, Jonathan Lemon, Edward Cree,
	Toke Høiland-Jørgensen, Andrii Nakryiko, Tariq Toukan,
	Saeed Mahameed, Maxim Mikityanskiy

On Mon, 25 Nov 2019 at 11:54, Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On Sat, Nov 23, 2019 at 08:12:20AM +0100, Björn Töpel wrote:
> > From: Björn Töpel <bjorn.topel@intel.com>
> >
> > The BPF dispatcher is a multiway branch code generator, mainly
> > targeted for XDP programs. When an XDP program is executed via the
> > bpf_prog_run_xdp(), it is invoked via an indirect call. With
> > retpolines enabled, the indirect call has a substantial performance
> > impact. The dispatcher is a mechanism that transform multiple indirect
> > calls to direct calls, and therefore avoids the retpoline. The
> > dispatcher is generated using the BPF JIT, and relies on text poking
> > provided by bpf_arch_text_poke().
> >
> > The dispatcher hijacks a trampoline function it via the __fentry__ nop
> > of the trampoline. One dispatcher instance currently supports up to 16
> > dispatch points. This can be extended in the future.
> >
> > An example: A module/driver allocates a dispatcher. The dispatcher is
> > shared for all netdevs. Each unique XDP program has a slot in the
> > dispatcher, registered by a netdev. The netdev then uses the
> > dispatcher to call the correct program with a direct call.
> >
> > Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
> [...]
> > +static int emit_bpf_dispatcher(u8 **pprog, int a, int b, s64 *progs)
> > +{
> > +     u8 *jg_reloc, *jg_target, *prog = *pprog;
> > +     int pivot, err, jg_bytes = 1, cnt = 0;
> > +     s64 jg_offset;
> > +
> > +     if (a == b) {
> > +             /* Leaf node of recursion, i.e. not a range of indices
> > +              * anymore.
> > +              */
> > +             EMIT1(add_1mod(0x48, BPF_REG_3));       /* cmp rdx,func */
> > +             if (!is_simm32(progs[a]))
> > +                     return -1;
> > +             EMIT2_off32(0x81, add_1reg(0xF8, BPF_REG_3),
> > +                         progs[a]);
> > +             err = emit_cond_near_jump(&prog,        /* je func */
> > +                                       (void *)progs[a], prog,
> > +                                       X86_JE);
> > +             if (err)
> > +                     return err;
> > +
> > +             err = emit_jump(&prog,                  /* jmp thunk */
> > +                             __x86_indirect_thunk_rdx, prog);
> > +             if (err)
> > +                     return err;
> > +
> > +             *pprog = prog;
> > +             return 0;
> > +     }
> > +
> > +     /* Not a leaf node, so we pivot, and recursively descend into
> > +      * the lower and upper ranges.
> > +      */
> > +     pivot = (b - a) / 2;
> > +     EMIT1(add_1mod(0x48, BPF_REG_3));               /* cmp rdx,func */
> > +     if (!is_simm32(progs[a + pivot]))
> > +             return -1;
> > +     EMIT2_off32(0x81, add_1reg(0xF8, BPF_REG_3), progs[a + pivot]);
> > +
> > +     if (pivot > 2) {                                /* jg upper_part */
> > +             /* Require near jump. */
> > +             jg_bytes = 4;
> > +             EMIT2_off32(0x0F, X86_JG + 0x10, 0);
> > +     } else {
> > +             EMIT2(X86_JG, 0);
> > +     }
> > +     jg_reloc = prog;
> > +
> > +     err = emit_bpf_dispatcher(&prog, a, a + pivot,  /* emit lower_part */
> > +                               progs);
> > +     if (err)
> > +             return err;
> > +
> > +     /* Intel 64 and IA-32 ArchitecturesOptimization Reference
> > +      * Manual, 3.4.1.5 Code Alignment Assembly/Compiler Coding
> > +      * Rule 12. (M impact, H generality) All branch targets should
> > +      * be 16-byte aligned.
>
> Isn't this section 3.4.1.4, rule 11 or are you reading a newer manual
> than on the website [0]? :)

Argh, no, you're right. Typo. Thanks!

> Just wondering, in your IXIA tests, did you
> see any noticeable slowdowns if you don't do the 16-byte alignments as
> in the rest of the kernel [1,2]?
>
>   [0] https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf
>   [1] be6cb02779ca ("x86: Align jump targets to 1-byte boundaries")
>   [2] https://lore.kernel.org/patchwork/patch/560050/
>

Interesting! Thanks for the pointers. I'll do more benchmarking for
the next rev, and hopefully we can leave the nops out. Let's see.


Björn

> > +      */
> > +     jg_target = PTR_ALIGN(prog, 16);
> > +     if (jg_target != prog)
> > +             emit_nops(&prog, jg_target - prog);
> > +     jg_offset = prog - jg_reloc;
> > +     emit_code(jg_reloc - jg_bytes, jg_offset, jg_bytes);
> > +
> > +     err = emit_bpf_dispatcher(&prog, a + pivot + 1, /* emit upper_part */
> > +                               b, progs);
> > +     if (err)
> > +             return err;
> > +
> > +     *pprog = prog;
> > +     return 0;
> > +}

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH bpf-next v2 2/6] xdp: introduce xdp_call
  2019-11-25 11:18   ` Toke Høiland-Jørgensen
@ 2019-11-25 15:21     ` Björn Töpel
  2019-11-25 15:56       ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 21+ messages in thread
From: Björn Töpel @ 2019-11-25 15:21 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Netdev, Alexei Starovoitov, Daniel Borkmann,
	Björn Töpel, bpf, Magnus Karlsson, Karlsson, Magnus,
	Jonathan Lemon, Edward Cree, Andrii Nakryiko, Tariq Toukan,
	Saeed Mahameed, Maxim Mikityanskiy

On Mon, 25 Nov 2019 at 12:18, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Björn Töpel <bjorn.topel@gmail.com> writes:
>
> > From: Björn Töpel <bjorn.topel@intel.com>
> >
> > The xdp_call.h header wraps a more user-friendly API around the BPF
> > dispatcher. A user adds a trampoline/XDP caller using the
> > DEFINE_XDP_CALL macro, and updates the BPF dispatcher via
> > xdp_call_update(). The actual dispatch is done via xdp_call().
> >
> > Note that xdp_call() is only supported for builtin drivers. Module
> > builds will fallback to bpf_prog_run_xdp().
>
> I don't like this restriction. Distro kernels are not likely to start
> shipping all the network drivers builtin, so they won't benefit from the
> performance benefits from this dispatcher.
>
> What is the reason these dispatcher blocks have to reside in the driver?
> Couldn't we just allocate one system-wide, and then simply change
> bpf_prog_run_xdp() to make use of it transparently (from the driver
> PoV)? That would also remove the need to modify every driver...
>

Good idea! I'll try that out. Thanks for the suggestion!

Björn


> -Toke
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH bpf-next v2 2/6] xdp: introduce xdp_call
  2019-11-25 15:21     ` Björn Töpel
@ 2019-11-25 15:56       ` Toke Høiland-Jørgensen
  2019-11-26  7:43         ` Björn Töpel
  0 siblings, 1 reply; 21+ messages in thread
From: Toke Høiland-Jørgensen @ 2019-11-25 15:56 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Netdev, Alexei Starovoitov, Daniel Borkmann,
	Björn Töpel, bpf, Magnus Karlsson, Karlsson, Magnus,
	Jonathan Lemon, Edward Cree, Andrii Nakryiko, Tariq Toukan,
	Saeed Mahameed, Maxim Mikityanskiy

Björn Töpel <bjorn.topel@gmail.com> writes:

> On Mon, 25 Nov 2019 at 12:18, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Björn Töpel <bjorn.topel@gmail.com> writes:
>>
>> > From: Björn Töpel <bjorn.topel@intel.com>
>> >
>> > The xdp_call.h header wraps a more user-friendly API around the BPF
>> > dispatcher. A user adds a trampoline/XDP caller using the
>> > DEFINE_XDP_CALL macro, and updates the BPF dispatcher via
>> > xdp_call_update(). The actual dispatch is done via xdp_call().
>> >
>> > Note that xdp_call() is only supported for builtin drivers. Module
>> > builds will fallback to bpf_prog_run_xdp().
>>
>> I don't like this restriction. Distro kernels are not likely to start
>> shipping all the network drivers builtin, so they won't benefit from the
>> performance benefits from this dispatcher.
>>
>> What is the reason these dispatcher blocks have to reside in the driver?
>> Couldn't we just allocate one system-wide, and then simply change
>> bpf_prog_run_xdp() to make use of it transparently (from the driver
>> PoV)? That would also remove the need to modify every driver...
>>
>
> Good idea! I'll try that out. Thanks for the suggestion!

Awesome! I guess the table may need to be a bit bigger if it's
system-wide? But since you've already gone to all that trouble with the
binary search, I guess that shouldn't have too much of a performance
impact? Maybe the size could even be a config option so users/distros
can make their own size tradeoff?

-Toke


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH bpf-next v2 2/6] xdp: introduce xdp_call
  2019-11-25 15:56       ` Toke Høiland-Jørgensen
@ 2019-11-26  7:43         ` Björn Töpel
  2019-11-26  8:37           ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 21+ messages in thread
From: Björn Töpel @ 2019-11-26  7:43 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Netdev, Alexei Starovoitov, Daniel Borkmann,
	Björn Töpel, bpf, Magnus Karlsson, Karlsson, Magnus,
	Jonathan Lemon, Edward Cree, Andrii Nakryiko, Tariq Toukan,
	Saeed Mahameed, Maxim Mikityanskiy

On Mon, 25 Nov 2019 at 16:56, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Björn Töpel <bjorn.topel@gmail.com> writes:
>
> > On Mon, 25 Nov 2019 at 12:18, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >>
> >> Björn Töpel <bjorn.topel@gmail.com> writes:
> >>
> >> > From: Björn Töpel <bjorn.topel@intel.com>
> >> >
> >> > The xdp_call.h header wraps a more user-friendly API around the BPF
> >> > dispatcher. A user adds a trampoline/XDP caller using the
> >> > DEFINE_XDP_CALL macro, and updates the BPF dispatcher via
> >> > xdp_call_update(). The actual dispatch is done via xdp_call().
> >> >
> >> > Note that xdp_call() is only supported for builtin drivers. Module
> >> > builds will fallback to bpf_prog_run_xdp().
> >>
> >> I don't like this restriction. Distro kernels are not likely to start
> >> shipping all the network drivers builtin, so they won't benefit from the
> >> performance benefits from this dispatcher.
> >>
> >> What is the reason these dispatcher blocks have to reside in the driver?
> >> Couldn't we just allocate one system-wide, and then simply change
> >> bpf_prog_run_xdp() to make use of it transparently (from the driver
> >> PoV)? That would also remove the need to modify every driver...
> >>
> >
> > Good idea! I'll try that out. Thanks for the suggestion!
>
> Awesome! I guess the table may need to be a bit bigger if it's
> system-wide? But since you've already gone to all that trouble with the
> binary search, I guess that shouldn't have too much of a performance
> impact? Maybe the size could even be a config option so users/distros
> can make their own size tradeoff?
>

My bigger concern is not the dispatcher size, but that any XDP update
will be a system wide text-poke. OTOH, this is still the case even if
there are multiple dispatchers. No more "quickly swap XDP program in
one packet latency".

Still, definitely worth a try. And configurable dispatcher size might
be a good idea as well! Thanks!


Cheers,
Björn

> -Toke
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH bpf-next v2 2/6] xdp: introduce xdp_call
  2019-11-26  7:43         ` Björn Töpel
@ 2019-11-26  8:37           ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 21+ messages in thread
From: Toke Høiland-Jørgensen @ 2019-11-26  8:37 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Netdev, Alexei Starovoitov, Daniel Borkmann,
	Björn Töpel, bpf, Magnus Karlsson, Karlsson, Magnus,
	Jonathan Lemon, Edward Cree, Andrii Nakryiko, Tariq Toukan,
	Saeed Mahameed, Maxim Mikityanskiy

Björn Töpel <bjorn.topel@gmail.com> writes:

> On Mon, 25 Nov 2019 at 16:56, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Björn Töpel <bjorn.topel@gmail.com> writes:
>>
>> > On Mon, 25 Nov 2019 at 12:18, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> >>
>> >> Björn Töpel <bjorn.topel@gmail.com> writes:
>> >>
>> >> > From: Björn Töpel <bjorn.topel@intel.com>
>> >> >
>> >> > The xdp_call.h header wraps a more user-friendly API around the BPF
>> >> > dispatcher. A user adds a trampoline/XDP caller using the
>> >> > DEFINE_XDP_CALL macro, and updates the BPF dispatcher via
>> >> > xdp_call_update(). The actual dispatch is done via xdp_call().
>> >> >
>> >> > Note that xdp_call() is only supported for builtin drivers. Module
>> >> > builds will fallback to bpf_prog_run_xdp().
>> >>
>> >> I don't like this restriction. Distro kernels are not likely to start
>> >> shipping all the network drivers builtin, so they won't benefit from the
>> >> performance benefits from this dispatcher.
>> >>
>> >> What is the reason these dispatcher blocks have to reside in the driver?
>> >> Couldn't we just allocate one system-wide, and then simply change
>> >> bpf_prog_run_xdp() to make use of it transparently (from the driver
>> >> PoV)? That would also remove the need to modify every driver...
>> >>
>> >
>> > Good idea! I'll try that out. Thanks for the suggestion!
>>
>> Awesome! I guess the table may need to be a bit bigger if it's
>> system-wide? But since you've already gone to all that trouble with the
>> binary search, I guess that shouldn't have too much of a performance
>> impact? Maybe the size could even be a config option so users/distros
>> can make their own size tradeoff?
>>
>
> My bigger concern is not the dispatcher size, but that any XDP update
> will be a system wide text-poke. OTOH, this is still the case even if
> there are multiple dispatchers. No more "quickly swap XDP program in
> one packet latency".

Ah, right. I don't actually know the details of how all this kernel text
rewriting happens. I just assumed it was magic faerie dust that just
made everything faster; but now you're telling me there are tradeoffs?! ;)

When you say "no more quickly swap XDP programs" you mean that the
attach operation itself will take longer, right? I.e., it's not that it
will disrupt packet flow to the old program while it's happening? Also,
how much longer?

-Toke


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, back to index

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-23  7:12 [PATCH bpf-next v2 0/6] Introduce the BPF dispatcher and xdp_call.h Björn Töpel
2019-11-23  7:12 ` [PATCH bpf-next v2 1/6] bpf: introduce BPF dispatcher Björn Töpel
2019-11-24  1:55   ` Alexei Starovoitov
2019-11-24  6:55     ` Björn Töpel
2019-11-24 17:08       ` Alexei Starovoitov
2019-11-24 17:16         ` Björn Töpel
2019-11-25  0:08   ` kbuild test robot
2019-11-25 10:53   ` Daniel Borkmann
2019-11-25 15:20     ` Björn Töpel
2019-11-23  7:12 ` [PATCH bpf-next v2 2/6] xdp: introduce xdp_call Björn Töpel
2019-11-24  1:59   ` Alexei Starovoitov
2019-11-24  6:56     ` Björn Töpel
2019-11-25 11:18   ` Toke Høiland-Jørgensen
2019-11-25 15:21     ` Björn Töpel
2019-11-25 15:56       ` Toke Høiland-Jørgensen
2019-11-26  7:43         ` Björn Töpel
2019-11-26  8:37           ` Toke Høiland-Jørgensen
2019-11-23  7:12 ` [PATCH bpf-next v2 3/6] i40e: start using xdp_call.h Björn Töpel
2019-11-23  7:12 ` [PATCH bpf-next v2 4/6] ixgbe: " Björn Töpel
2019-11-23  7:12 ` [PATCH bpf-next v2 5/6] net/mlx4_en: " Björn Töpel
2019-11-23  7:12 ` [PATCH bpf-next v2 6/6] net/mlx5e: Start " Björn Töpel

BPF Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/bpf/0 bpf/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 bpf bpf/ https://lore.kernel.org/bpf \
		bpf@vger.kernel.org
	public-inbox-index bpf

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.bpf


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git