Re: [PATCH bpf-next v2 1/6] bpf: introduce BPF dispatcher

From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
To: "Björn Töpel" <bjorn.topel@gmail.com>
Cc: netdev@vger.kernel.org, ast@kernel.org, daniel@iogearbox.net,
	"Björn Töpel" <bjorn.topel@intel.com>,
	bpf@vger.kernel.org, magnus.karlsson@gmail.com,
	magnus.karlsson@intel.com, jonathan.lemon@gmail.com,
	ecree@solarflare.com, thoiland@redhat.com,
	andrii.nakryiko@gmail.com, tariqt@mellanox.com,
	saeedm@mellanox.com, maximmi@mellanox.com
Subject: Re: [PATCH bpf-next v2 1/6] bpf: introduce BPF dispatcher
Date: Sat, 23 Nov 2019 17:55:06 -0800	[thread overview]
Message-ID: <20191124015504.yypqw4gx52e5e6og@ast-mbp.dhcp.thefacebook.com> (raw)
In-Reply-To: <20191123071226.6501-2-bjorn.topel@gmail.com>

On Sat, Nov 23, 2019 at 08:12:20AM +0100, Björn Töpel wrote:
> +
> +		err = emit_jump(&prog,			/* jmp thunk */
> +				__x86_indirect_thunk_rdx, prog);

could you please add a comment that this is gcc specific and gate it
by build_bug_on ?
I think even if compiler stays the change of flags:
RETPOLINE_CFLAGS_GCC := -mindirect-branch=thunk-extern -mindirect-branch-register
may change the name of this helper?
I wonder whether it's possible to make it compiler independent.

> diff --git a/kernel/bpf/dispatcher.c b/kernel/bpf/dispatcher.c
> new file mode 100644
> index 000000000000..385dd76ab6d2
> --- /dev/null
> +++ b/kernel/bpf/dispatcher.c
> @@ -0,0 +1,208 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright(c) 2019 Intel Corporation. */
> +
> +#ifdef CONFIG_RETPOLINE

I'm worried that such strong gating will make the code rot. Especially it's not
covered by selftests.
Could you please add xdp_call_run() to generic xdp and add a selftest ?
Also could you please benchmark it without retpoline?
iirc direct call is often faster than indirect, so I suspect this optimization
may benefit non-mitigated kernels.

> +#define DISPATCHER_HASH_BITS 10
> +#define DISPATCHER_TABLE_SIZE (1 << DISPATCHER_HASH_BITS)
> +
> +static struct hlist_head dispatcher_table[DISPATCHER_TABLE_SIZE];

there is one DEFINE_XDP_CALL per driver, so total number of such
dispatch routines is pretty small. 1<<10 hash table is overkill.
The hash table itself is overkill :)

How about adding below:

> +#define BPF_DISPATCHER_MAX 16
> +
> +struct bpf_dispatcher {
> +	struct hlist_node hlist;
> +	void *func;
> +	struct bpf_prog *progs[BPF_DISPATCHER_MAX];
> +	int num_progs;
> +	void *image;
> +	u64 selector;
> +};

without hlist and without func to DEFINE_XDP_CALL() macro?
Then bpf_dispatcher_lookup() will become bpf_dispatcher_init()
and the rest will become a bit simpler?

> +
> +	set_vm_flush_reset_perms(image);
> +	set_memory_x((long)image, 1);
> +	d->image = image;

Can you add a common helper for this bit to share between
bpf dispatch and bpf trampoline?

> +static void bpf_dispatcher_update(struct bpf_dispatcher *d)
> +{
> +	void *old_image = d->image + ((d->selector + 1) & 1) * PAGE_SIZE / 2;
> +	void *new_image = d->image + (d->selector & 1) * PAGE_SIZE / 2;
> +	s64 ips[BPF_DISPATCHER_MAX] = {}, *ipsp = &ips[0];
> +	int i, err;
> +
> +	if (!d->num_progs) {
> +		bpf_arch_text_poke(d->func, BPF_MOD_JUMP_TO_NOP,
> +				   old_image, NULL);
> +		return;

how does it work? Without doing d->selector = 0; the next addition
will try to do JUMP_TO_JUMP and will fail...

> +	}
> +
> +	for (i = 0; i < BPF_DISPATCHER_MAX; i++) {
> +		if (d->progs[i])
> +			*ipsp++ = (s64)(uintptr_t)d->progs[i]->bpf_func;
> +	}
> +	err = arch_prepare_bpf_dispatcher(new_image, &ips[0], d->num_progs);
> +	if (err)
> +		return;
> +
> +	if (d->selector) {
> +		/* progs already running at this address */
> +		err = bpf_arch_text_poke(d->func, BPF_MOD_JUMP_TO_JUMP,
> +					 old_image, new_image);
> +	} else {
> +		/* first time registering */
> +		err = bpf_arch_text_poke(d->func, BPF_MOD_NOP_TO_JUMP,
> +					 NULL, new_image);
> +	}
> +	if (err)
> +		return;
> +	d->selector++;
> +}

Not sure how to share selector logic between dispatch and trampoline.
But above selector=0; weirdness is a sign that sharing is probably necessary?

> +
> +void bpf_dispatcher_change_prog(void *func, struct bpf_prog *from,
> +				struct bpf_prog *to)
> +{
> +	struct bpf_dispatcher *d;
> +	bool changed = false;
> +
> +	if (from == to)
> +		return;
> +
> +	mutex_lock(&dispatcher_mutex);
> +	d = bpf_dispatcher_lookup(func);
> +	if (!d)
> +		goto out;
> +
> +	changed |= bpf_dispatcher_remove_prog(d, from);
> +	changed |= bpf_dispatcher_add_prog(d, to);
> +
> +	if (!changed)
> +		goto out;
> +
> +	bpf_dispatcher_update(d);
> +	if (!d->num_progs)
> +		bpf_dispatcher_free(d);

I think I got it why it works.
Every time the prog cnt goes to zero you free the trampoline right away
and next time it will be allocated again and kzalloc() will zero selector.
That's hard to spot.
Also if user space does for(;;) attach/detach;
it will keep stressing bpf_jit_alloc_exec.
In case of bpf trampoline attach/detach won't be stressing it.
Only load/unload which are much slower due to verification.
I guess such difference is ok.