bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC] Userspace library for handling multiple XDP programs on an interface
@ 2020-02-28 14:22 Toke Høiland-Jørgensen
  2020-02-28 14:22 ` [PATCH RFC] libxdp: Add libxdp (FOR COMMENT ONLY) Toke Høiland-Jørgensen
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-02-28 14:22 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	Andrii Nakryiko, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Lorenz Bauer,
	Andrey Ignatov, netdev, bpf

Hi everyone

As most of you are no doubt aware, we've had various discussions on how
to handle multiple XDP programs on a single interface. With the freplace
functionality, the kernel infrastructure is now there to handle this
(almost, see "missing pieces" below).

While the freplace mechanism offers userspace a lot of flexibility in
how to handle dispatching of multiple XDP programs, userspace also has
to do quite a bit of work to implement this (compared to just issuing
load+attach). The goal of this email is to get some feedback on a
library to implement this, in the hope that we can converge on something
that will be widely applicable, ensuring that both (a) everyone doesn't
have to reinvent the wheel, and (b) we don't end up with a proliferation
of subtly incompatible dispatcher models that makes it hard or
impossible to mix and match XDP programs from multiple sources.

My proposal for the beginnings of such a library is in the xdp-tools repository
on Github, in the 'xdp-multi-prog' branch.

To clone and compile simply do this:

$ git clone --recurse-submodules -b xdp-multi-prog https://github.com/xdp-project/xdp-tools
$ cd xdp-tools && ./configure && make

See lib/libxdp/libxdp.c for the library implementation, and xdp-loader/ for a
command-line loader that supports loading multiple programs in one go using the
dispatch (just supply it multiple filenames on the command line). There are
still some missing bits, marked with FIXME comments in the code, and discussed
below.

I'm also including libxdp as a patch in the next email, but only to facilitate
easy commenting on the code; use the version of Github if you actually want to
compile and play with it.

The overall goal of the library is to *make the simple case easy* but retain
enough flexibility for custom applications to specify their own load order etc
where needed. The "simple case" here being to just load one or more XDP programs
onto an interface while retaining any programs that may already be loaded there.

**** Program metadata

To do this, I propose two pieces of metadata that an XDP program can specify for
itself, which will serve as defaults to guide the loading:

- Its *run priority*: This is simply an integer priority number that will be
  used to sort programs when building the dispatcher. The inspiration is
  old-style rc init scripts, where daemons are started in numerical order on
  boot (e.g., /etc/rc.d/S50sshd). My hope here is that we can establish a
  convention around ranges of priorities that make sense for different types of
  programs; e.g., packet filters would use low priorities, and programs that
  want to monitor the traffic on the host will use high priorities, etc.

- Its *chain call actions*: These are the return codes for which the next
  program should be called. The idea here is that a program can indicate which
  actions it makes sense to continue operating on; the default is just XDP_PASS,
  and I expect this would be the most common case.

The metadata is specified using BTF, using a syntax similar to BTF-defined maps,
i.e.:

struct {
	__uint(priority, 10);
	__uint(XDP_PASS, 1); // chain call on XDP_PASS...
	__uint(XDP_ROP, 1);  // ...and on XDP_DROP
} XDP_RUN_CONFIG(FUNCNAME);

(where FUNCNAME is the function name of the XDP program this config refers to).

Using BTF for this ensures that the metadata stays with the program in the
object file. And because this becomes part of the object BTF, it will be loaded
into the kernel and is thus also retrievable for loaded programs.

The libxdp loaded will use the run priority to sort XDP programs before loading,
and it will use the chain call actions to configure the dispatcher program. Note
that the values defined metadata only serve as a default, though; the user
should be able to override the values on load to sort programs in an arbitrary
order.

**** The dispatcher program
The dispatcher program is a simple XDP program that is generated from a template
to just implement a series of dispatch statements like these:

        if (num_progs_enabled < 1)
                goto out;
        ret = prog0(ctx);
        if (!((1 << ret) & conf.chain_call_actions[0]))
                return ret;

        if (num_progs_enabled < 2)
                goto out;
        ret = prog1(ctx);
        if (!((1 << ret) & conf.chain_call_actions[1]))
                return ret;

        [...]

The num_progs_enabled and conf.chain_call_actions variables are static const
global variables, which means that the compiler will put them into the .rodata
section, allowing the kernel to perform dead code elimination if the
num_progs_enabled check fails. libxdp will set the values based on the program
metadata before loading the dispatcher, the use freplace to put the actual
component programs into the placeholders specified by prog0, prog1, etc.

The dispatcher program makes liberal use of variables marked as 'volatile' to
prevent the compiler from optimising out the checks and calls to the dummy
functions.

**** Missing pieces
While the libxdp code can assemble a basic dispatcher and load it into the
kernel, there are a couple of missing pieces on the kernel side; I will propose
patches to fix these, but figured there was no reason to hold back posting of
the library for comments because of this. These missing pieces are:

- There is currently no way to persist the freplace after the program exits; the
  file descriptor returned by bpf_raw_tracepoint_open() will release the program
  when it is closed, and it cannot be pinned.

- There is no way to re-attach an already loaded program to another function;
  this is needed for updating the call sequence: When a new program is loaded,
  libxdp should get the existing list of component programs on the interface and
  insert the new one into the chain in the appropriate place. To do this it
  needs to build a new dispatcher and reattach all the old programs to it.
  Ideally, this should be doable without detaching them from the old dispatcher;
  that way, we can build the new dispatcher completely, and atomically replace
  it on the interface by the usual XDP attach mechanism.

---

Toke Høiland-Jørgensen (1):
      libxdp: Add libxdp (FOR COMMENT ONLY)


 tools/lib/xdp/libxdp.c          |  856 +++++++++++++++++++++++++++++++++++++++
 tools/lib/xdp/libxdp.h          |   38 ++
 tools/lib/xdp/prog_dispatcher.h |   17 +
 tools/lib/xdp/xdp-dispatcher.c  |  178 ++++++++
 tools/lib/xdp/xdp_helpers.h     |   12 +
 5 files changed, 1101 insertions(+)
 create mode 100644 tools/lib/xdp/libxdp.c
 create mode 100644 tools/lib/xdp/libxdp.h
 create mode 100644 tools/lib/xdp/prog_dispatcher.h
 create mode 100644 tools/lib/xdp/xdp-dispatcher.c
 create mode 100644 tools/lib/xdp/xdp_helpers.h


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH RFC] libxdp: Add libxdp (FOR COMMENT ONLY)
  2020-02-28 14:22 [PATCH RFC] Userspace library for handling multiple XDP programs on an interface Toke Høiland-Jørgensen
@ 2020-02-28 14:22 ` Toke Høiland-Jørgensen
  2020-02-28 22:15 ` [PATCH RFC] Userspace library for handling multiple XDP programs on an interface Andrey Ignatov
  2020-02-28 23:21 ` Andrii Nakryiko
  2 siblings, 0 replies; 11+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-02-28 14:22 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	Andrii Nakryiko, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Lorenz Bauer,
	Andrey Ignatov, netdev, bpf

From: Toke Høiland-Jørgensen <toke@redhat.com>

**THIS IS A DUMMY PATCH TO EASE INLINE REVIEW ONLY; SEE COVER LETTER**

This adds tools/lib/xdp/ with the contents of the libxdp code. This is only
included to ease inline commenting on the code; it does not actually
compile in this form, and I am not (yet) proposing it should live in the
kernel tree.

There is a working version on Github; to clone and compile simply do this:

$ git clone --recurse-submodules -b xdp-multi-prog https://github.com/xdp-project/xdp-tools
$ cd xdp-tools && ./configure && make

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 tools/lib/xdp/libxdp.c          |  856 +++++++++++++++++++++++++++++++++++++++
 tools/lib/xdp/libxdp.h          |   38 ++
 tools/lib/xdp/prog_dispatcher.h |   17 +
 tools/lib/xdp/xdp-dispatcher.c  |  178 ++++++++
 tools/lib/xdp/xdp_helpers.h     |   12 +
 5 files changed, 1101 insertions(+)
 create mode 100644 tools/lib/xdp/libxdp.c
 create mode 100644 tools/lib/xdp/libxdp.h
 create mode 100644 tools/lib/xdp/prog_dispatcher.h
 create mode 100644 tools/lib/xdp/xdp-dispatcher.c
 create mode 100644 tools/lib/xdp/xdp_helpers.h

diff --git a/tools/lib/xdp/libxdp.c b/tools/lib/xdp/libxdp.c
new file mode 100644
index 000000000000..46f2a73157a0
--- /dev/null
+++ b/tools/lib/xdp/libxdp.c
@@ -0,0 +1,856 @@
+// SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
+
+/*
+ * XDP management utility functions
+ *
+ * Copyright (C) 2020 Toke Høiland-Jørgensen <toke@redhat.com>
+ */
+
+#include <string.h>
+#include <errno.h>
+#include <unistd.h>
+#include <sys/stat.h>
+#include <sys/mman.h>
+#include <fcntl.h>
+
+#include <linux/err.h> /* ERR_PTR */
+#include <linux/if_link.h>
+
+#include <bpf/libbpf.h>
+#include <bpf/btf.h>
+#include <xdp/libxdp.h>
+#include <xdp/prog_dispatcher.h>
+#include "logging.h"
+#include "util.h"
+
+#define XDP_RUN_CONFIG_SEC ".xdp_run_config"
+
+struct xdp_program {
+	/* one of prog or prog_fd should be set */
+	struct bpf_program *bpf_prog;
+	struct bpf_object *bpf_obj;
+	struct btf *btf;
+	int prog_fd;
+	int link_fd;
+	const char *prog_name;
+	__u8 prog_tag[BPF_TAG_SIZE];
+	__u64 load_time;
+	bool from_external_obj;
+	unsigned int run_prio;
+	unsigned int chain_call_actions; // bitmap
+};
+
+
+static const char *xdp_action_names[] = {
+	[XDP_ABORTED] = "XDP_ABORTED",
+	[XDP_DROP] = "XDP_DROP",
+	[XDP_PASS] = "XDP_PASS",
+	[XDP_TX] = "XDP_TX",
+	[XDP_REDIRECT] = "XDP_REDIRECT",
+};
+
+static struct btf *xdp_program__btf(struct xdp_program *xdp_prog)
+{
+	return xdp_prog->btf;
+}
+
+void xdp_program__set_chain_call_enabled(struct xdp_program *prog, unsigned int action,
+					 bool enabled)
+{
+	/* FIXME: Should this also update the BTF info? */
+	if (enabled)
+		prog->chain_call_actions |= (1<<action);
+	else
+		prog->chain_call_actions &= ~(1<<action);
+}
+
+bool xdp_program__chain_call_enabled(struct xdp_program *prog,
+				     enum xdp_action action)
+{
+	return !!(prog->chain_call_actions & (1<<action));
+}
+
+unsigned int xdp_program__run_prio(struct xdp_program *prog)
+{
+	return prog->run_prio;
+}
+
+void xdp_program__set_run_prio(struct xdp_program *prog, unsigned int run_prio)
+{
+	/* FIXME: Should this also update the BTF info? */
+	prog->run_prio = run_prio;
+}
+
+const char *xdp_program__name(struct xdp_program *prog)
+{
+	return prog->prog_name;
+}
+
+int xdp_program__print_chain_call_actions(struct xdp_program *prog,
+					  char *buf,
+					  size_t buf_len)
+{
+	bool first = true;
+	char *pos = buf;
+	size_t len = 0;
+	int i;
+
+	for (i = 0; i <= XDP_REDIRECT; i++) {
+		if (xdp_program__chain_call_enabled(prog, i)) {
+			if (!first) {
+				*pos++ = ',';
+				buf_len--;
+			} else {
+				first = false;
+			}
+			len = snprintf(pos, buf_len-len, "%s",
+				       xdp_action_names[i]);
+			pos += len;
+			buf_len -= len;
+		}
+	}
+	return 0;
+}
+
+static const struct btf_type *
+skip_mods_and_typedefs(const struct btf *btf, __u32 id, __u32 *res_id)
+{
+	const struct btf_type *t = btf__type_by_id(btf, id);
+
+	if (res_id)
+		*res_id = id;
+
+	while (btf_is_mod(t) || btf_is_typedef(t)) {
+		if (res_id)
+			*res_id = t->type;
+		t = btf__type_by_id(btf, t->type);
+	}
+
+	return t;
+}
+
+static bool get_field_int(const char *prog_name, const struct btf *btf,
+			  const struct btf_type *def,
+			  const struct btf_member *m, __u32 *res)
+{
+	const struct btf_type *t = skip_mods_and_typedefs(btf, m->type, NULL);
+	const char *name = btf__name_by_offset(btf, m->name_off);
+	const struct btf_array *arr_info;
+	const struct btf_type *arr_t;
+
+	if (!btf_is_ptr(t)) {
+		pr_warn("prog '%s': attr '%s': expected PTR, got %u.\n",
+			prog_name, name, btf_kind(t));
+		return false;
+	}
+
+	arr_t = btf__type_by_id(btf, t->type);
+	if (!arr_t) {
+		pr_warn("prog '%s': attr '%s': type [%u] not found.\n",
+			prog_name, name, t->type);
+		return false;
+	}
+	if (!btf_is_array(arr_t)) {
+		pr_warn("prog '%s': attr '%s': expected ARRAY, got %u.\n",
+			prog_name, name, btf_kind(arr_t));
+		return false;
+	}
+	arr_info = btf_array(arr_t);
+	*res = arr_info->nelems;
+	return true;
+}
+
+static bool get_xdp_action(const char *act_name, unsigned int *act)
+{
+	const char **name = xdp_action_names;
+	unsigned int i;
+
+	for (i = 0; i < ARRAY_SIZE(xdp_action_names); i++, name++) {
+		if (!strcmp(act_name, *name)) {
+			*act = i;
+			return true;
+		}
+	}
+	return false;
+}
+
+/**
+ * This function parses the run config information attached to an XDP program.
+ *
+ * This information is specified using BTF, in a format similar to how
+ * BTF-defined maps are done. The definition looks like this:
+ *
+ * struct {
+ *	__uint(priority, 10);
+ *	__uint(XDP_PASS, 1);
+ * } XDP_RUN_CONFIG(FUNCNAME);
+ *
+ * The priority is simply an integer that will be used to sort programs as they
+ * are attached on the interface (see cmp_xdp_programs() for full sort order).
+ * In addition to the priority, the run config can define an integer value for
+ * each XDP action. A non-zero value means that execution will continue to the
+ * next loaded program if the current program returns that action. I.e., in the
+ * above example, any return value other than XDP_PASS will cause the dispatcher
+ * to exit with that return code, whereas XDP_PASS means execution will
+ * continue.
+ *
+ * Since this information becomes part of the object file BTF info, it will
+ * survive loading into the kernel, and so it can be retrieved for
+ * already-loaded programs as well.
+ */
+static int xdp_parse_run_config(struct xdp_program *xdp_prog)
+{
+	const struct btf_type *t, *var, *def, *sec = NULL;
+	struct btf *btf = xdp_program__btf(xdp_prog);
+	int err, nr_types, i, j, vlen, mlen;
+	const struct btf_var_secinfo *vi;
+	const struct btf_var *var_extra;
+	const struct btf_member *m;
+	char struct_name[100];
+	const char *name;
+
+	if (!btf) {
+		pr_debug("No BTF found for program\n");
+		return -ENOENT;
+	}
+
+	err = check_snprintf(struct_name, sizeof(struct_name), "_%s",
+			     xdp_program__name(xdp_prog));
+	if (err)
+		return err;
+
+	nr_types = btf__get_nr_types(btf);
+	for (i = 1; i <= nr_types; i++) {
+		t = btf__type_by_id(btf, i);
+		if (!btf_is_datasec(t))
+			continue;
+		name = btf__name_by_offset(btf, t->name_off);
+		if (strcmp(name, XDP_RUN_CONFIG_SEC) == 0) {
+			sec = t;
+			break;
+		}
+	}
+
+	if (!sec) {
+		pr_debug("DATASEC '%s' not found.\n", XDP_RUN_CONFIG_SEC);
+		return -ENOENT;
+	}
+
+	vlen = btf_vlen(sec);
+	vi = btf_var_secinfos(sec);
+	for (i = 0; i < vlen; i++, vi++) {
+		var = btf__type_by_id(btf, vi->type);
+		var_extra = btf_var(var);
+		name = btf__name_by_offset(btf, var->name_off);
+
+		if (strcmp(name, struct_name))
+			continue;
+
+		if (!btf_is_var(var)) {
+			pr_warn("struct '%s': unexpected var kind %u.\n",
+				name, btf_kind(var));
+			return -EINVAL;
+		}
+		if (var_extra->linkage != BTF_VAR_GLOBAL_ALLOCATED &&
+		    var_extra->linkage != BTF_VAR_STATIC) {
+			pr_warn("struct '%s': unsupported var linkage %u.\n",
+				name, var_extra->linkage);
+			return -EOPNOTSUPP;
+		}
+
+		def = skip_mods_and_typedefs(btf, var->type, NULL);
+		if (!btf_is_struct(def)) {
+			pr_warn("struct '%s': unexpected def kind %u.\n",
+				name, btf_kind(var));
+			return -EINVAL;
+		}
+		if (def->size > vi->size) {
+			pr_warn("struct '%s': invalid def size.\n", name);
+			return -EINVAL;
+		}
+
+		mlen = btf_vlen(def);
+		m = btf_members(def);
+		for (j = 0; j < mlen; j++, m++) {
+			const char *mname = btf__name_by_offset(btf, m->name_off);
+			unsigned int val, act;
+
+			if (!mname) {
+				pr_warn("struct '%s': invalid field #%d.\n", name, i);
+				return -EINVAL;
+			}
+			if (!strcmp(mname, "priority")) {
+				if (!get_field_int(struct_name, btf, def, m,
+						   &xdp_prog->run_prio))
+					return -EINVAL;
+				continue;
+			} else if(get_xdp_action(mname, &act)) {
+				if (!get_field_int(struct_name, btf, def, m,
+						   &val))
+					return -EINVAL;
+				xdp_program__set_chain_call_enabled(xdp_prog, act, val);
+			} else {
+				pr_warn("Invalid mname: %s\n", mname);
+				return -ENOTSUP;
+			}
+		}
+		return 0;
+	}
+
+	pr_debug("Couldn't find run order struct %s\n", struct_name);
+	return -ENOENT;
+}
+
+static struct xdp_program *xdp_program__new()
+{
+	struct xdp_program *xdp_prog;
+
+	xdp_prog = malloc(sizeof(*xdp_prog));
+	if (!xdp_prog)
+		return ERR_PTR(-ENOMEM);
+
+	memset(xdp_prog, 0, sizeof(*xdp_prog));
+
+	xdp_prog->prog_fd = -1;
+	xdp_prog->link_fd = -1;
+	xdp_prog->run_prio = XDP_DEFAULT_RUN_PRIO;
+	xdp_prog->chain_call_actions = XDP_DEFAULT_CHAIN_CALL_ACTIONS;
+
+	return xdp_prog;
+}
+
+void xdp_program__free(struct xdp_program *xdp_prog)
+{
+	if (xdp_prog->link_fd >= 0)
+		close(xdp_prog->link_fd);
+	if (xdp_prog->prog_fd >= 0)
+		close(xdp_prog->prog_fd);
+
+	if (!xdp_prog->from_external_obj) {
+		if (xdp_prog->bpf_obj)
+			bpf_object__close(xdp_prog->bpf_obj);
+		else if (xdp_prog->btf)
+			btf__free(xdp_prog->btf);
+	}
+}
+
+static int xdp_program__fill_from_obj(struct xdp_program *xdp_prog,
+				      struct bpf_object *obj,
+				      const char *prog_name,
+				      bool external)
+{
+	struct bpf_program *bpf_prog;
+	int err;
+
+	if (prog_name)
+		bpf_prog = bpf_object__find_program_by_title(obj, prog_name);
+	else
+		bpf_prog = bpf_program__next(NULL, obj);
+
+	if(!bpf_prog)
+		return -ENOENT;
+
+	xdp_prog->bpf_prog = bpf_prog;
+	xdp_prog->bpf_obj = obj;
+	xdp_prog->btf = bpf_object__btf(obj);
+	xdp_prog->from_external_obj = external;
+	xdp_prog->prog_name = bpf_program__name(bpf_prog);
+
+	err = xdp_parse_run_config(xdp_prog);
+	if (err && err != -ENOENT)
+		return err;
+
+	return 0;
+}
+
+struct xdp_program *xdp_program__from_bpf_obj(struct bpf_object *obj,
+					      const char *prog_name)
+{
+	struct xdp_program *xdp_prog;
+	int err;
+
+	xdp_prog = xdp_program__new();
+	if (IS_ERR(xdp_prog))
+		return xdp_prog;
+
+	err = xdp_program__fill_from_obj(xdp_prog, obj, prog_name, true);
+	if (err)
+		goto err;
+
+	return xdp_prog;
+err:
+	xdp_program__free(xdp_prog);
+	return ERR_PTR(err);
+}
+
+struct xdp_program *xdp_program__open_file(const char *filename,
+					   const char *prog_name,
+					   struct bpf_object_open_opts *opts)
+{
+	struct xdp_program *xdp_prog;
+	struct bpf_object *obj;
+	int err;
+
+	obj = bpf_object__open_file(filename, opts);
+	err = libbpf_get_error(obj);
+	if (err)
+		return ERR_PTR(err);
+
+	xdp_prog = xdp_program__new();
+	if (IS_ERR(xdp_prog)) {
+		bpf_object__close(obj);
+		return xdp_prog;
+	}
+
+	err = xdp_program__fill_from_obj(xdp_prog, obj, prog_name, false);
+	if (err)
+		goto err;
+
+	return xdp_prog;
+err:
+	xdp_program__free(xdp_prog);
+	bpf_object__close(obj);
+	return ERR_PTR(err);
+}
+
+static int xdp_program__fill_from_fd(struct xdp_program *xdp_prog, int fd)
+{
+	struct bpf_prog_info info = {};
+	__u32 len = sizeof(info);
+	struct btf *btf = NULL;
+	int err = 0;
+
+	err = bpf_obj_get_info_by_fd(fd, &info, &len);
+	if (err) {
+		pr_warn("couldn't get program info: %s", strerror(errno));
+		err = -errno;
+		goto err;
+	}
+
+	if (!xdp_prog->prog_name) {
+		xdp_prog->prog_name = strdup(info.name);
+		if (!xdp_prog->prog_name) {
+			err = -ENOMEM;
+			pr_warn("failed to strdup program title");
+			goto err;
+		}
+	}
+
+	if (info.btf_id && !xdp_prog->btf) {
+		err = btf__get_from_id(info.btf_id, &btf);
+		if (err) {
+			pr_warn("Couldn't get BTF for ID %ul\n", info.btf_id);
+			goto err;
+		}
+		xdp_prog->btf = btf;
+	}
+
+	memcpy(xdp_prog->prog_tag, info.tag, BPF_TAG_SIZE);
+	xdp_prog->load_time = info.load_time;
+	xdp_prog->prog_fd = fd;
+
+	return 0;
+err:
+	btf__free(btf);
+	return err;
+}
+
+struct xdp_program *xdp_program__from_id(__u32 id)
+{
+	struct xdp_program *xdp_prog = NULL;
+	int fd, err;
+
+	fd = bpf_prog_get_fd_by_id(id);
+	if (fd < 0) {
+		pr_warn("couldn't get program fd: %s", strerror(errno));
+		err = -errno;
+		goto err;
+	}
+
+	xdp_prog = xdp_program__new();
+	if (IS_ERR(xdp_prog))
+		return xdp_prog;
+
+	err = xdp_program__fill_from_fd(xdp_prog, fd);
+	if (err)
+		goto err;
+
+	err = xdp_parse_run_config(xdp_prog);
+	if (err && err != -ENOENT)
+		goto err;
+
+	return xdp_prog;
+err:
+	free(xdp_prog);
+	return ERR_PTR(err);
+}
+
+static int cmp_xdp_programs(const void *_a, const void *_b)
+{
+	const struct xdp_program *a = *(struct xdp_program * const *)_a;
+	const struct xdp_program *b = *(struct xdp_program * const *)_b;
+	int cmp;
+
+	if (a->run_prio != b->run_prio)
+		return a->run_prio < b->run_prio ? -1 : 1;
+
+	cmp = strcmp(a->prog_name, b->prog_name);
+	if (cmp)
+		return cmp;
+
+	/* Hopefully the two checks above will resolve most comparisons; in
+	 * cases where they don't, hopefully the checks below will keep the
+	 * order stable.
+	 */
+
+	/* loaded before non-loaded */
+	if (a->prog_fd >= 0 && b->prog_fd < 0)
+		return -1;
+	else if (a->prog_fd < 0 && b->prog_fd >= 0)
+		return 1;
+
+	/* two unloaded programs - compare by size */
+	if (a->bpf_prog && b->bpf_prog) {
+		size_t size_a, size_b;
+
+		size_a = bpf_program__size(a->bpf_prog);
+		size_b = bpf_program__size(b->bpf_prog);
+		if (size_a != size_b)
+			return size_a < size_b ? -1 : 1;
+	}
+
+	cmp = memcmp(a->prog_tag, b->prog_tag, BPF_TAG_SIZE);
+	if (cmp)
+		return cmp;
+
+	/* at this point we are really grasping for straws */
+	if (a->load_time != b->load_time)
+		return a->load_time < b->load_time ? -1 : 1;
+
+	return 0;
+}
+
+int xdp_program__get_from_ifindex(int ifindex, struct xdp_program **progs,
+				  size_t *num_progs)
+{
+	struct xdp_link_info xinfo = {};
+	struct xdp_program *p;
+	__u32 prog_id;
+	int err;
+
+	err = bpf_get_link_xdp_info(ifindex, &xinfo, sizeof(xinfo), 0);
+	if (err)
+		return err;
+
+	if (xinfo.attach_mode == XDP_ATTACHED_SKB)
+		prog_id = xinfo.skb_prog_id;
+	else
+		prog_id = xinfo.drv_prog_id;
+
+	if (!prog_id)
+		return -ENOENT;
+
+	p = xdp_program__from_id(prog_id);
+	if (IS_ERR(p))
+		return PTR_ERR(p);
+
+	/* FIXME: This should figure out whether the loaded program is a
+	 * dispatcher, and if it is, go find the component programs and return
+	 * those instead.
+	 */
+	progs[0] = p;
+	*num_progs = 1;
+
+	return 0;
+}
+
+int xdp_program__load(struct xdp_program *prog)
+{
+	int err;
+
+	if (prog->prog_fd >= 0)
+		return -EEXIST;
+
+	if (!prog->bpf_obj)
+		return -EINVAL;
+
+	err = bpf_object__load(prog->bpf_obj);
+	if (err)
+		return err;
+
+	pr_debug("Loaded XDP program %s, got fd %d\n",
+		 xdp_program__name(prog), bpf_program__fd(prog->bpf_prog));
+
+	return xdp_program__fill_from_fd(prog, bpf_program__fd(prog->bpf_prog));
+}
+
+int gen_xdp_multiprog(struct xdp_program **progs, size_t num_progs)
+{
+	struct bpf_program *dispatcher_prog;
+	struct bpf_map *prog_config_map;
+	int fd, prog_fd, err, i, lfd;
+	struct xdp_dispatcher_config *rodata;
+	struct xdp_program *prog;
+	struct bpf_object *obj;
+	struct stat sb = {};
+	char buf[PATH_MAX];
+	void *ptr = NULL;
+	size_t sz = 0;
+
+	/* The only way libbpf exposes access to the rodata section is through
+	 * the skeleton API. We need to modify it before loading a program,
+	 * hence all this boilerplate code, until we can fix libbpf to just
+	 * expose map->mmaped directly.
+	 */
+	struct bpf_prog_skeleton prog_skel = {
+		.name = "xdp_main",
+		.prog = &dispatcher_prog
+	};
+	struct bpf_map_skeleton map_skel = {
+		.name = "xdp_disp.rodata",
+		.map = &prog_config_map,
+		.mmaped = (void **)&rodata
+	};
+
+	struct bpf_object_skeleton s = {
+		.sz = sizeof(s),
+		.name = "xdp_dispatcher",
+		.obj = &obj,
+		.map_cnt = 1,
+		.map_skel_sz = sizeof(map_skel),
+		.maps = &map_skel,
+		.prog_cnt = 1,
+		.prog_skel_sz = sizeof(prog_skel),
+		.progs = &prog_skel
+	};
+
+	pr_debug("Generating multi-prog dispatcher for %zu programs\n", num_progs);
+
+	err = find_bpf_file(buf, sizeof(buf), "xdp-dispatcher.o");
+	if (err)
+		return err;
+
+	fd = open(buf, O_RDONLY);
+	if (fd < 0) {
+		err = -errno;
+		pr_warn("Couldn't open BPF file %s\n", buf);
+		return err;
+	}
+
+	err = fstat(fd, &sb);
+	if (err)
+		goto err;
+	sz = sb.st_size;
+
+	ptr = mmap(NULL, sb.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
+	if (ptr == MAP_FAILED) {
+		err = -errno;
+		ptr = NULL;
+		pr_warn("Couldn't mmap BPF file\n");
+		goto err;
+	}
+	s.data = ptr;
+	s.data_sz = sz;
+
+	err = bpf_object__open_skeleton(&s, NULL);
+	if (err)
+		goto err;
+
+	munmap(ptr, sz);
+	ptr = NULL;
+	close(fd);
+	fd = -1;
+
+	if (!rodata) {
+		pr_warn("No mmaped rodat map area\n");
+		err = -EINVAL;
+		goto err;
+	}
+
+	/* set dispatcher parameters before loading */
+	rodata->num_progs_enabled = num_progs;
+	for (i = 0; i < num_progs; i++)
+		rodata->chain_call_actions[i] = progs[i]->chain_call_actions;
+
+	err = bpf_object__load(obj);
+	if (err)
+		goto err;
+
+	prog_fd = bpf_program__fd(dispatcher_prog);
+	for (i = 0; i < num_progs; i++) {
+		prog = progs[i];
+		err = check_snprintf(buf, sizeof(buf), "prog%d", i);
+		if (err)
+			goto err_obj;
+
+		/* FIXME: The following assumes the component XDP programs are
+		 * not already loaded. We do want to be able to re-attach
+		 * already-loaded programs into a new dispatcher here; but the
+		 * kernel doesn't currently allow this. So for now, just assume
+		 * programs are not already loaded and load them as TYPE_EXT
+		 * programs.
+		 */
+
+		err = bpf_program__set_attach_target(prog->bpf_prog, prog_fd,
+						     buf);
+		if (err) {
+			pr_debug("Failed to set attach target: %s\n", strerror(-err));
+			goto err_obj;
+		}
+
+		bpf_program__set_type(prog->bpf_prog, BPF_PROG_TYPE_EXT);
+		err = xdp_program__load(prog);
+		if (err) {
+			pr_warn("Failed to load program %d ('%s'): %s\n",
+				i, xdp_program__name(prog), strerror(-err));
+			goto err_obj;
+		}
+
+		/* The attach will disappear once this fd is closed */
+		lfd = bpf_raw_tracepoint_open(NULL, prog->prog_fd);
+		if (lfd < 0) {
+			err = lfd;
+			pr_warn("Failed to attach program %d ('%s') to dispatcher: %s\n",
+				i, xdp_program__name(prog), strerror(-err));
+			goto err_obj;
+		}
+
+		pr_debug("Attached prog '%s' with priority %d in dispatcher entry '%s' with fd %d\n",
+			 xdp_program__name(prog), xdp_program__run_prio(prog),
+			 buf, lfd);
+		prog->link_fd = lfd;
+	}
+
+	return prog_fd;
+
+err_obj:
+	bpf_object__close(obj);
+err:
+	if (ptr)
+		munmap(ptr, sz);
+	if (fd > -1)
+		close(fd);
+	return err;
+}
+
+int xdp_attach_programs(struct xdp_program **progs, size_t num_progs,
+			int ifindex, bool force, enum xdp_attach_mode mode)
+{
+	int err = 0, xdp_flags = 0, prog_fd;
+
+	if (!num_progs)
+		return -EINVAL;
+
+	if (num_progs > 1) {
+		qsort(progs, num_progs, sizeof(*progs), cmp_xdp_programs);
+		prog_fd = gen_xdp_multiprog(progs, num_progs);
+	} else if (progs[0]->prog_fd >= 0) {
+		prog_fd = progs[0]->prog_fd;
+	} else {
+		prog_fd = xdp_program__load(progs[0]);
+	}
+
+	if (prog_fd < 0) {
+		err = prog_fd;
+		goto out;
+	}
+
+	pr_debug("Loading XDP fd %d onto ifindex %d\n", prog_fd, ifindex);
+
+	switch (mode) {
+	case XDP_MODE_SKB:
+		xdp_flags |= XDP_FLAGS_SKB_MODE;
+		break;
+	case XDP_MODE_NATIVE:
+		xdp_flags |= XDP_FLAGS_DRV_MODE;
+		break;
+	case XDP_MODE_HW:
+		xdp_flags |= XDP_FLAGS_HW_MODE;
+		break;
+	case XDP_MODE_UNSPEC:
+		break;
+	}
+
+	if (!force)
+		xdp_flags |= XDP_FLAGS_UPDATE_IF_NOEXIST;
+
+	err = bpf_set_link_xdp_fd(ifindex, prog_fd, xdp_flags);
+	if (err == -EEXIST && !(xdp_flags & XDP_FLAGS_UPDATE_IF_NOEXIST)) {
+		/* Program replace didn't work, probably because a program of
+		 * the opposite type is loaded. Let's unload that and try
+		 * loading again.
+		 */
+
+		__u32 old_flags = xdp_flags;
+
+		xdp_flags &= ~XDP_FLAGS_MODES;
+		xdp_flags |= (mode == XDP_MODE_SKB) ? XDP_FLAGS_DRV_MODE :
+			XDP_FLAGS_SKB_MODE;
+		err = bpf_set_link_xdp_fd(ifindex, -1, xdp_flags);
+		if (!err)
+			err = bpf_set_link_xdp_fd(ifindex, prog_fd, old_flags);
+	}
+	if (err < 0) {
+		pr_warn("Error attaching XDP program to ifindex %d: %s\n",
+			ifindex, strerror(-err));
+
+		switch (-err) {
+		case EBUSY:
+		case EEXIST:
+			pr_warn("XDP already loaded on device;"
+				" use --force to replace\n");
+			break;
+		case EOPNOTSUPP:
+			pr_warn("Native XDP not supported;"
+				" try using --skb-mode\n");
+			break;
+		default:
+			break;
+		}
+		goto out;
+	}
+
+	pr_debug("Loaded %zu programs on ifindex '%d'%s\n",
+		 num_progs, ifindex,
+		 mode == XDP_MODE_SKB ? " in skb mode" : "");
+
+	return prog_fd;
+out:
+	return err;
+}
+
+
+int xdp_program__attach(const struct xdp_program *prog,
+			int ifindex, bool replace, enum xdp_attach_mode mode)
+{
+	struct xdp_program *old_progs[10], *all_progs[10];
+	size_t num_old_progs = 10, num_progs;
+	int err, i;
+
+	/* FIXME: The idea here is that the API should allow the caller to just
+	 * attach a program; and the library will take care of finding the
+	 * already-attached programs, inserting the new one into the sequence
+	 * based on its priority, build a new dispatcher, and atomically replace
+	 * the old one. This needs a kernel API to allow re-attaching already
+	 * loaded freplace programs, as well as the ability to attach each
+	 * program to multiple places. So for now, this function doesn't really
+	 * work.
+	 */
+	err = xdp_program__get_from_ifindex(ifindex, old_progs, &num_old_progs);
+	if (err && err != -ENOENT)
+		return err;
+
+	if (replace) {
+		num_progs = 1;
+		all_progs[0] = (struct xdp_program *)prog;
+	} else {
+		for (i = 0; i < num_old_progs; i++)
+			all_progs[i] = old_progs[i];
+		num_progs = num_old_progs +1;
+	}
+
+	err = xdp_attach_programs(all_progs, num_progs, ifindex, true, mode);
+	if (err)
+		return err;
+	return 0;
+}
diff --git a/tools/lib/xdp/libxdp.h b/tools/lib/xdp/libxdp.h
new file mode 100644
index 000000000000..2ee960ec68b9
--- /dev/null
+++ b/tools/lib/xdp/libxdp.h
@@ -0,0 +1,38 @@
+// SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
+
+/*
+ * XDP management utility functions
+ *
+ * Copyright (C) 2020 Toke Høiland-Jørgensen <toke@redhat.com>
+ */
+
+#include <linux/bpf.h>
+#include <bpf/bpf.h>
+#include "xdp_helpers.h"
+#include "util.h"
+
+struct xdp_program;
+
+struct xdp_program *xdp_program__from_bpf_obj(struct bpf_object *obj,
+					      const char *prog_name);
+struct xdp_program *xdp_program__open_file(const char *filename,
+					   const char *prog_name,
+					   struct bpf_object_open_opts *opts);
+struct xdp_program *xdp_program__from_id(__u32 prog_id);
+
+void xdp_program__free(struct xdp_program *xdp_prog);
+
+const char *xdp_program__name(struct xdp_program *xdp_prog);
+unsigned int xdp_program__run_prio(struct xdp_program *xdp_prog);
+void xdp_program__set_run_prio(struct xdp_program *xdp_prog, unsigned int run_prio);
+bool xdp_program__chain_call_enabled(struct xdp_program *xdp_prog,
+				     enum xdp_action action);
+void xdp_program__set_chain_call_enabled(struct xdp_program *prog, unsigned int action,
+                                         bool enabled);
+
+int xdp_program__print_chain_call_actions(struct xdp_program *prog,
+					  char *buf,
+					  size_t buf_len);
+
+int xdp_attach_programs(struct xdp_program **progs, size_t num_progs,
+                        int ifindex, bool force, enum xdp_attach_mode mode);
diff --git a/tools/lib/xdp/prog_dispatcher.h b/tools/lib/xdp/prog_dispatcher.h
new file mode 100644
index 000000000000..25311dc137ff
--- /dev/null
+++ b/tools/lib/xdp/prog_dispatcher.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: (GPL-2.0-or-later OR BSD-2-clause) */
+
+#ifndef __PROG_DISPATCHER_H
+#define __PROG_DISPATCHER_H
+
+#include <linux/types.h>
+
+#ifndef MAX_DISPATCHER_ACTIONS
+#define MAX_DISPATCHER_ACTIONS 10
+#endif
+
+struct xdp_dispatcher_config {
+	__u8 num_progs_enabled;
+	__u32 chain_call_actions[MAX_DISPATCHER_ACTIONS];
+};
+
+#endif
diff --git a/tools/lib/xdp/xdp-dispatcher.c b/tools/lib/xdp/xdp-dispatcher.c
new file mode 100644
index 000000000000..0d965c48bdb8
--- /dev/null
+++ b/tools/lib/xdp/xdp-dispatcher.c
@@ -0,0 +1,178 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#include <linux/bpf.h>
+#include <linux/in.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+
+#include <xdp/prog_dispatcher.h>
+
+/* While 'const volatile' sounds a little like an oxymoron, there's reason
+ * behind the madness:
+ *
+ * - const places the data in rodata, where libbpf will mark it as read-only and
+ *   frozen on program load, letting the kernel do dead code elimination based
+ *   on the values.
+ *
+ * - volatile prevents the compiler from optimising away the checks based on the
+ *   compile-time value of the variables, which is important since we will be
+ *   changing the values before loading the program into the kernel.
+ */
+static volatile const struct xdp_dispatcher_config conf = {};
+
+/* The volatile return value prevents the compiler from assuming it knows the
+ * return value and optimising based on that.
+ */
+__attribute__ ((noinline))
+int prog0(struct xdp_md *ctx) {
+        volatile int ret = XDP_PASS;
+
+        if (!ctx)
+          return XDP_ABORTED;
+        return ret;
+}
+__attribute__ ((noinline))
+int prog1(struct xdp_md *ctx) {
+        volatile int ret = XDP_PASS;
+
+        if (!ctx)
+          return XDP_ABORTED;
+        return ret;
+}
+__attribute__ ((noinline))
+int prog2(struct xdp_md *ctx) {
+        volatile int ret = XDP_PASS;
+
+        if (!ctx)
+          return XDP_ABORTED;
+        return ret;
+}
+__attribute__ ((noinline))
+int prog3(struct xdp_md *ctx) {
+        volatile int ret = XDP_PASS;
+
+        if (!ctx)
+          return XDP_ABORTED;
+        return ret;
+}
+__attribute__ ((noinline))
+int prog4(struct xdp_md *ctx) {
+        volatile int ret = XDP_PASS;
+
+        if (!ctx)
+          return XDP_ABORTED;
+        return ret;
+}
+__attribute__ ((noinline))
+int prog5(struct xdp_md *ctx) {
+        volatile int ret = XDP_PASS;
+
+        if (!ctx)
+          return XDP_ABORTED;
+        return ret;
+}
+__attribute__ ((noinline))
+int prog6(struct xdp_md *ctx) {
+        volatile int ret = XDP_PASS;
+
+        if (!ctx)
+          return XDP_ABORTED;
+        return ret;
+}
+__attribute__ ((noinline))
+int prog7(struct xdp_md *ctx) {
+        volatile int ret = XDP_PASS;
+
+        if (!ctx)
+          return XDP_ABORTED;
+        return ret;
+}
+__attribute__ ((noinline))
+int prog8(struct xdp_md *ctx) {
+        volatile int ret = XDP_PASS;
+
+        if (!ctx)
+          return XDP_ABORTED;
+        return ret;
+}
+__attribute__ ((noinline))
+int prog9(struct xdp_md *ctx) {
+        volatile int ret = XDP_PASS;
+
+        if (!ctx)
+          return XDP_ABORTED;
+        return ret;
+}
+
+
+SEC("xdp_dispatcher")
+int xdp_main(struct xdp_md *ctx)
+{
+        __u8 num_progs_enabled = conf.num_progs_enabled;
+        int ret;
+
+        if (num_progs_enabled < 1)
+                goto out;
+        ret = prog0(ctx);
+        if (!((1 << ret) & conf.chain_call_actions[0]))
+                return ret;
+
+        if (num_progs_enabled < 2)
+                goto out;
+        ret = prog1(ctx);
+        if (!((1 << ret) & conf.chain_call_actions[1]))
+                return ret;
+
+        if (num_progs_enabled < 3)
+                goto out;
+        ret = prog2(ctx);
+        if (!((1 << ret) & conf.chain_call_actions[2]))
+                return ret;
+
+        if (num_progs_enabled < 4)
+                goto out;
+        ret = prog3(ctx);
+        if (!((1 << ret) & conf.chain_call_actions[3]))
+                return ret;
+
+        if (num_progs_enabled < 5)
+                goto out;
+        ret = prog4(ctx);
+        if (!((1 << ret) & conf.chain_call_actions[4]))
+                return ret;
+
+        if (num_progs_enabled < 6)
+                goto out;
+        ret = prog5(ctx);
+        if (!((1 << ret) & conf.chain_call_actions[5]))
+                return ret;
+
+        if (num_progs_enabled < 7)
+                goto out;
+        ret = prog6(ctx);
+        if (!((1 << ret) & conf.chain_call_actions[6]))
+                return ret;
+
+        if (num_progs_enabled < 8)
+                goto out;
+        ret = prog7(ctx);
+        if (!((1 << ret) & conf.chain_call_actions[7]))
+                return ret;
+
+        if (num_progs_enabled < 9)
+                goto out;
+        ret = prog8(ctx);
+        if (!((1 << ret) & conf.chain_call_actions[8]))
+                return ret;
+
+        if (num_progs_enabled < 10)
+                goto out;
+        ret = prog9(ctx);
+        if (!((1 << ret) & conf.chain_call_actions[9]))
+                return ret;
+
+out:
+        return XDP_PASS;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/lib/xdp/xdp_helpers.h b/tools/lib/xdp/xdp_helpers.h
new file mode 100644
index 000000000000..ec295367a8a0
--- /dev/null
+++ b/tools/lib/xdp/xdp_helpers.h
@@ -0,0 +1,12 @@
+/* SPDX-License-Identifier: (GPL-2.0-or-later OR BSD-2-clause) */
+
+#ifndef __XDP_HELPERS_H
+#define __XDP_HELPERS_H
+
+#define _CONCAT(x,y) x ## y
+#define XDP_RUN_CONFIG(f) _CONCAT(_,f) SEC(".xdp_run_config")
+
+#define XDP_DEFAULT_RUN_PRIO 50
+#define XDP_DEFAULT_CHAIN_CALL_ACTIONS (1<<XDP_PASS)
+
+#endif


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC] Userspace library for handling multiple XDP programs on an interface
  2020-02-28 14:22 [PATCH RFC] Userspace library for handling multiple XDP programs on an interface Toke Høiland-Jørgensen
  2020-02-28 14:22 ` [PATCH RFC] libxdp: Add libxdp (FOR COMMENT ONLY) Toke Høiland-Jørgensen
@ 2020-02-28 22:15 ` Andrey Ignatov
  2020-02-29 10:36   ` Toke Høiland-Jørgensen
  2020-02-28 23:21 ` Andrii Nakryiko
  2 siblings, 1 reply; 11+ messages in thread
From: Andrey Ignatov @ 2020-02-28 22:15 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, Andrii Nakryiko, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Lorenz Bauer, netdev,
	bpf, ctakshak

Toke Høiland-Jørgensen <toke@redhat.com> [Fri, 2020-02-28 06:22 -0800]:
> Hi everyone
> 
> As most of you are no doubt aware, we've had various discussions on how
> to handle multiple XDP programs on a single interface. With the freplace
> functionality, the kernel infrastructure is now there to handle this
> (almost, see "missing pieces" below).
> 
> While the freplace mechanism offers userspace a lot of flexibility in
> how to handle dispatching of multiple XDP programs, userspace also has
> to do quite a bit of work to implement this (compared to just issuing
> load+attach). The goal of this email is to get some feedback on a
> library to implement this, in the hope that we can converge on something
> that will be widely applicable, ensuring that both (a) everyone doesn't
> have to reinvent the wheel, and (b) we don't end up with a proliferation
> of subtly incompatible dispatcher models that makes it hard or
> impossible to mix and match XDP programs from multiple sources.
> 
> My proposal for the beginnings of such a library is in the xdp-tools repository
> on Github, in the 'xdp-multi-prog' branch.
> 
> To clone and compile simply do this:
> 
> $ git clone --recurse-submodules -b xdp-multi-prog https://github.com/xdp-project/xdp-tools
> $ cd xdp-tools && ./configure && make
> 
> See lib/libxdp/libxdp.c for the library implementation, and xdp-loader/ for a
> command-line loader that supports loading multiple programs in one go using the
> dispatch (just supply it multiple filenames on the command line). There are
> still some missing bits, marked with FIXME comments in the code, and discussed
> below.
> 
> I'm also including libxdp as a patch in the next email, but only to facilitate
> easy commenting on the code; use the version of Github if you actually want to
> compile and play with it.
> 
> The overall goal of the library is to *make the simple case easy* but retain
> enough flexibility for custom applications to specify their own load order etc
> where needed. The "simple case" here being to just load one or more XDP programs
> onto an interface while retaining any programs that may already be loaded there.
> 
> **** Program metadata
> 
> To do this, I propose two pieces of metadata that an XDP program can specify for
> itself, which will serve as defaults to guide the loading:
> 
> - Its *run priority*: This is simply an integer priority number that will be
>   used to sort programs when building the dispatcher. The inspiration is
>   old-style rc init scripts, where daemons are started in numerical order on
>   boot (e.g., /etc/rc.d/S50sshd). My hope here is that we can establish a
>   convention around ranges of priorities that make sense for different types of
>   programs; e.g., packet filters would use low priorities, and programs that
>   want to monitor the traffic on the host will use high priorities, etc.
> 
> - Its *chain call actions*: These are the return codes for which the next
>   program should be called. The idea here is that a program can indicate which
>   actions it makes sense to continue operating on; the default is just XDP_PASS,
>   and I expect this would be the most common case.
> 
> The metadata is specified using BTF, using a syntax similar to BTF-defined maps,
> i.e.:
> 
> struct {
> 	__uint(priority, 10);
> 	__uint(XDP_PASS, 1); // chain call on XDP_PASS...
> 	__uint(XDP_ROP, 1);  // ...and on XDP_DROP
> } XDP_RUN_CONFIG(FUNCNAME);
> 
> (where FUNCNAME is the function name of the XDP program this config refers to).
> 
> Using BTF for this ensures that the metadata stays with the program in the
> object file. And because this becomes part of the object BTF, it will be loaded
> into the kernel and is thus also retrievable for loaded programs.
> 
> The libxdp loaded will use the run priority to sort XDP programs before loading,
> and it will use the chain call actions to configure the dispatcher program. Note
> that the values defined metadata only serve as a default, though; the user
> should be able to override the values on load to sort programs in an arbitrary
> order.
> 
> **** The dispatcher program
> The dispatcher program is a simple XDP program that is generated from a template
> to just implement a series of dispatch statements like these:
> 
>         if (num_progs_enabled < 1)
>                 goto out;
>         ret = prog0(ctx);
>         if (!((1 << ret) & conf.chain_call_actions[0]))
>                 return ret;
> 
>         if (num_progs_enabled < 2)
>                 goto out;
>         ret = prog1(ctx);
>         if (!((1 << ret) & conf.chain_call_actions[1]))
>                 return ret;
> 
>         [...]
> 
> The num_progs_enabled and conf.chain_call_actions variables are static const
> global variables, which means that the compiler will put them into the .rodata
> section, allowing the kernel to perform dead code elimination if the
> num_progs_enabled check fails. libxdp will set the values based on the program
> metadata before loading the dispatcher, the use freplace to put the actual
> component programs into the placeholders specified by prog0, prog1, etc.
> 
> The dispatcher program makes liberal use of variables marked as 'volatile' to
> prevent the compiler from optimising out the checks and calls to the dummy
> functions.
> 
> **** Missing pieces
> While the libxdp code can assemble a basic dispatcher and load it into the
> kernel, there are a couple of missing pieces on the kernel side; I will propose
> patches to fix these, but figured there was no reason to hold back posting of
> the library for comments because of this. These missing pieces are:
> 
> - There is currently no way to persist the freplace after the program exits; the
>   file descriptor returned by bpf_raw_tracepoint_open() will release the program
>   when it is closed, and it cannot be pinned.
> 
> - There is no way to re-attach an already loaded program to another function;
>   this is needed for updating the call sequence: When a new program is loaded,
>   libxdp should get the existing list of component programs on the interface and
>   insert the new one into the chain in the appropriate place. To do this it
>   needs to build a new dispatcher and reattach all the old programs to it.
>   Ideally, this should be doable without detaching them from the old dispatcher;
>   that way, we can build the new dispatcher completely, and atomically replace
>   it on the interface by the usual XDP attach mechanism.

Hey Toke,

Thanks for posting it. I'm still reading the library on github and may
have more thoughts later, but wanted to provide early feedback on one
high level thing.

The main challenges I see for applying this approach in fb case is the
need to recreate the dispatcher every time a new program has to be
added.

Imagine there there are a few containers and every container wants to
run an application that attaches XDP program to the "dispatcher" via
freplace. Every application may have a "priority" reserved for it, but
recreating the dispatcher may have race condition, for example:

- dispatcher exists in some stable stage with programs p1, p2 attached
  to it;

- app A starts and gets the set of progs currently attached to
  dispatcher: p1, p2 (yeah, this kernel API doesn't exist, as you
  mentioned but even if it's added ..);

- app A adds another program pA to the list and generates dispatcher
  with the new set: p1, p2, pA;

- app B starts, gets same list from dispatcher: p1, p2 and adds its own
  pB program and generates new dispatcher.

- now the end result depends on the order in which app A and app B
  attach corresponding new dispatcher to the interface, but in any case
  dispatcher won't have either pA or pB.

Yes, some central coordination mechanism may be added to "lock" the lib
while recreating the dispatcher, but it may not be container-environment
friendly.

Also I see at least one other way to do it w/o regenerating dispatcher
every time:

It can be created and attached once with "big enough" number of slots,
for example with 100 and programs may use use their corresponding slot
to freplace w/o regenerating the dispatcher. Having those big number of
no-op slots should not be a big deal from what I understand and kernel
can optimize it.

Sure, chaining decisions may be different but it may be abstracted to a
separate freplace-able step that may get the action (XDP_PASS,
XDP_DROP, etc) from the XDP program, pass it to the next,
decision-making function that by default proceeds only if previous
program returned XDP_PASS, but user can freplace it with whatever logic
is needed.

And yeah, regenerating the dispatcher would need a new kernel API to be
added to learn what ext-programs are currently attached to the
dispatcher (that I don't know how big the deal is).

This is the main thing so far, I'll likely provide more feedback when
have some more time to read the code ..

> ---
> 
> Toke Høiland-Jørgensen (1):
>       libxdp: Add libxdp (FOR COMMENT ONLY)
> 
> 
>  tools/lib/xdp/libxdp.c          |  856 +++++++++++++++++++++++++++++++++++++++
>  tools/lib/xdp/libxdp.h          |   38 ++
>  tools/lib/xdp/prog_dispatcher.h |   17 +
>  tools/lib/xdp/xdp-dispatcher.c  |  178 ++++++++
>  tools/lib/xdp/xdp_helpers.h     |   12 +
>  5 files changed, 1101 insertions(+)
>  create mode 100644 tools/lib/xdp/libxdp.c
>  create mode 100644 tools/lib/xdp/libxdp.h
>  create mode 100644 tools/lib/xdp/prog_dispatcher.h
>  create mode 100644 tools/lib/xdp/xdp-dispatcher.c
>  create mode 100644 tools/lib/xdp/xdp_helpers.h
> 

-- 
Andrey Ignatov

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC] Userspace library for handling multiple XDP programs on an interface
  2020-02-28 14:22 [PATCH RFC] Userspace library for handling multiple XDP programs on an interface Toke Høiland-Jørgensen
  2020-02-28 14:22 ` [PATCH RFC] libxdp: Add libxdp (FOR COMMENT ONLY) Toke Høiland-Jørgensen
  2020-02-28 22:15 ` [PATCH RFC] Userspace library for handling multiple XDP programs on an interface Andrey Ignatov
@ 2020-02-28 23:21 ` Andrii Nakryiko
  2020-02-29 10:37   ` Toke Høiland-Jørgensen
  2 siblings, 1 reply; 11+ messages in thread
From: Andrii Nakryiko @ 2020-02-28 23:21 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, Andrii Nakryiko, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Lorenz Bauer,
	Andrey Ignatov, Networking, bpf

On Fri, Feb 28, 2020 at 6:22 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Hi everyone
>
> As most of you are no doubt aware, we've had various discussions on how
> to handle multiple XDP programs on a single interface. With the freplace
> functionality, the kernel infrastructure is now there to handle this
> (almost, see "missing pieces" below).
>
> While the freplace mechanism offers userspace a lot of flexibility in
> how to handle dispatching of multiple XDP programs, userspace also has
> to do quite a bit of work to implement this (compared to just issuing
> load+attach). The goal of this email is to get some feedback on a
> library to implement this, in the hope that we can converge on something
> that will be widely applicable, ensuring that both (a) everyone doesn't
> have to reinvent the wheel, and (b) we don't end up with a proliferation
> of subtly incompatible dispatcher models that makes it hard or
> impossible to mix and match XDP programs from multiple sources.
>

[...]

>
> **** Missing pieces
> While the libxdp code can assemble a basic dispatcher and load it into the
> kernel, there are a couple of missing pieces on the kernel side; I will propose
> patches to fix these, but figured there was no reason to hold back posting of
> the library for comments because of this. These missing pieces are:
>
> - There is currently no way to persist the freplace after the program exits; the
>   file descriptor returned by bpf_raw_tracepoint_open() will release the program
>   when it is closed, and it cannot be pinned.

This is completely addressed by patch set [0] I just posted. It will
allow you to pin freplace BPF link in BPF FS. Feel free to review and
comment there, if anything is missing.

  [0] https://patchwork.ozlabs.org/project/netdev/list/?series=161582&state=*

>
> - There is no way to re-attach an already loaded program to another function;
>   this is needed for updating the call sequence: When a new program is loaded,
>   libxdp should get the existing list of component programs on the interface and
>   insert the new one into the chain in the appropriate place. To do this it
>   needs to build a new dispatcher and reattach all the old programs to it.
>   Ideally, this should be doable without detaching them from the old dispatcher;
>   that way, we can build the new dispatcher completely, and atomically replace
>   it on the interface by the usual XDP attach mechanism.
>
> ---
>
> Toke Høiland-Jørgensen (1):
>       libxdp: Add libxdp (FOR COMMENT ONLY)
>
>
>  tools/lib/xdp/libxdp.c          |  856 +++++++++++++++++++++++++++++++++++++++
>  tools/lib/xdp/libxdp.h          |   38 ++
>  tools/lib/xdp/prog_dispatcher.h |   17 +
>  tools/lib/xdp/xdp-dispatcher.c  |  178 ++++++++
>  tools/lib/xdp/xdp_helpers.h     |   12 +
>  5 files changed, 1101 insertions(+)
>  create mode 100644 tools/lib/xdp/libxdp.c
>  create mode 100644 tools/lib/xdp/libxdp.h
>  create mode 100644 tools/lib/xdp/prog_dispatcher.h
>  create mode 100644 tools/lib/xdp/xdp-dispatcher.c
>  create mode 100644 tools/lib/xdp/xdp_helpers.h
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC] Userspace library for handling multiple XDP programs on an interface
  2020-02-28 22:15 ` [PATCH RFC] Userspace library for handling multiple XDP programs on an interface Andrey Ignatov
@ 2020-02-29 10:36   ` Toke Høiland-Jørgensen
  2020-03-03  1:03     ` Andrey Ignatov
  0 siblings, 1 reply; 11+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-02-29 10:36 UTC (permalink / raw)
  To: Andrey Ignatov
  Cc: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, Andrii Nakryiko, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Lorenz Bauer, netdev,
	bpf, ctakshak

Andrey Ignatov <rdna@fb.com> writes:

> The main challenges I see for applying this approach in fb case is the
> need to recreate the dispatcher every time a new program has to be
> added.
>
> Imagine there there are a few containers and every container wants to
> run an application that attaches XDP program to the "dispatcher" via
> freplace. Every application may have a "priority" reserved for it, but
> recreating the dispatcher may have race condition, for example:

Yeah, I did realise this is potentially racy, but so is any loading of
XDP programs right now (i.e., two applications can both try loading a
single XDP program at the same time, and end up stomping on each others'
feet). So we'll need to solve that in any case. I've managed to come up
with two possible ways to solve this:

1. Locking: Make it possible for a process to temporarily lock the
XDP program loaded onto an interface so no other program can modify it
until the lock is released.

2. A cmpxchg operation: Add a new field to the XDP load netlink message
containing an fd of the old program that the load call is expecting to
replace. I.e., instead of attach(ifindex, prog_fd, flags), you have
attach(ifindex, prog_fd, old_fd, flags). The kernel can then check that
the old_fd matches the program currently loaded before replacing
anything, and reject the operation otherwise.

With either of these mechanisms it should be possible for userspace to
do the right thing if the kernel state changes underneath it. I'm
leaning towards (2) because I think it is simpler to implement and
doesn't require any new state be kept in the kernel. The drawback is
that it may lead to a lot of retries if many processes are trying to
load their programs at the same time. Some data would be good here: How
often do you expect programs to be loaded/unloaded in your use case?

As for your other suggestion:

> Also I see at least one other way to do it w/o regenerating dispatcher
> every time:
>
> It can be created and attached once with "big enough" number of slots,
> for example with 100 and programs may use use their corresponding slot
> to freplace w/o regenerating the dispatcher. Having those big number of
> no-op slots should not be a big deal from what I understand and kernel
> can optimize it.

I thought about having the dispatcher stay around for longer, and just
replacing more function slots as new programs are added/removed. The
reason I didn't go with this is the following: Modifying the dispatcher
while it is loaded means that the modifications will apply to traffic on
the interface immediately. This is fine for simple add/remove of a
single program, but it limits which operations you can do atomically.
E.g., you can't switch the order of two programs, or add or remove more
than one, in a way that is atomic from the PoV of the traffic on the
interface.

Since I expect that we will need to support atomic operations even for
these more complex cases, that means we'll need to support rebuilding
the dispatcher anyway, and solving the race condition problem for that.
And once we've done that, the simple add/remove in the existing
dispatcher becomes just an additional code path that we'll need to
maintain, so why bother? :)

I am also not sure it's as simple as you say for the kernel to optimise
a more complex dispatcher: The current dead code elimination relies on
map data being frozen at verification time, so it's not applicable to
optimising a dispatcher as it is being changed later. Now, this could
probably be fixed and/or we could try doing clever tricks with the flow
control in the dispatcher program itself. But again, why bother if we
have to support the dispatcher rebuild mode of operation anyway?

I may have missed something, of course, so feel free to point out if you
see anything wrong with my reasoning above!

> This is the main thing so far, I'll likely provide more feedback when
> have some more time to read the code ..

Sounds good! You're already being very helpful, so thank you! :)

-Toke


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC] Userspace library for handling multiple XDP programs on an interface
  2020-02-28 23:21 ` Andrii Nakryiko
@ 2020-02-29 10:37   ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 11+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-02-29 10:37 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, Andrii Nakryiko, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Lorenz Bauer,
	Andrey Ignatov, Networking, bpf

Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:

> On Fri, Feb 28, 2020 at 6:22 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Hi everyone
>>
>> As most of you are no doubt aware, we've had various discussions on how
>> to handle multiple XDP programs on a single interface. With the freplace
>> functionality, the kernel infrastructure is now there to handle this
>> (almost, see "missing pieces" below).
>>
>> While the freplace mechanism offers userspace a lot of flexibility in
>> how to handle dispatching of multiple XDP programs, userspace also has
>> to do quite a bit of work to implement this (compared to just issuing
>> load+attach). The goal of this email is to get some feedback on a
>> library to implement this, in the hope that we can converge on something
>> that will be widely applicable, ensuring that both (a) everyone doesn't
>> have to reinvent the wheel, and (b) we don't end up with a proliferation
>> of subtly incompatible dispatcher models that makes it hard or
>> impossible to mix and match XDP programs from multiple sources.
>>
>
> [...]
>
>>
>> **** Missing pieces
>> While the libxdp code can assemble a basic dispatcher and load it into the
>> kernel, there are a couple of missing pieces on the kernel side; I will propose
>> patches to fix these, but figured there was no reason to hold back posting of
>> the library for comments because of this. These missing pieces are:
>>
>> - There is currently no way to persist the freplace after the program exits; the
>>   file descriptor returned by bpf_raw_tracepoint_open() will release the program
>>   when it is closed, and it cannot be pinned.
>
> This is completely addressed by patch set [0] I just posted. It will
> allow you to pin freplace BPF link in BPF FS. Feel free to review and
> comment there, if anything is missing.
>
>   [0]
>   https://patchwork.ozlabs.org/project/netdev/list/?series=161582&state=*

Ah, excellent! I'll take a look, thanks for the pointer :)

-Toke


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC] Userspace library for handling multiple XDP programs on an interface
  2020-02-29 10:36   ` Toke Høiland-Jørgensen
@ 2020-03-03  1:03     ` Andrey Ignatov
  2020-03-03  9:50       ` Toke Høiland-Jørgensen
  2020-03-03 19:53       ` Andrii Nakryiko
  0 siblings, 2 replies; 11+ messages in thread
From: Andrey Ignatov @ 2020-03-03  1:03 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, Andrii Nakryiko, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Lorenz Bauer, netdev,
	bpf, ctakshak

Toke Høiland-Jørgensen <toke@redhat.com> [Sat, 2020-02-29 02:36 -0800]:
> Andrey Ignatov <rdna@fb.com> writes:
> 
> > The main challenges I see for applying this approach in fb case is the
> > need to recreate the dispatcher every time a new program has to be
> > added.
> >
> > Imagine there there are a few containers and every container wants to
> > run an application that attaches XDP program to the "dispatcher" via
> > freplace. Every application may have a "priority" reserved for it, but
> > recreating the dispatcher may have race condition, for example:
> 
> Yeah, I did realise this is potentially racy, but so is any loading of
> XDP programs right now (i.e., two applications can both try loading a
> single XDP program at the same time, and end up stomping on each others'
> feet). So we'll need to solve that in any case. I've managed to come up
> with two possible ways to solve this:
> 
> 1. Locking: Make it possible for a process to temporarily lock the
> XDP program loaded onto an interface so no other program can modify it
> until the lock is released.
> 
> 2. A cmpxchg operation: Add a new field to the XDP load netlink message
> containing an fd of the old program that the load call is expecting to
> replace. I.e., instead of attach(ifindex, prog_fd, flags), you have
> attach(ifindex, prog_fd, old_fd, flags). The kernel can then check that
> the old_fd matches the program currently loaded before replacing
> anything, and reject the operation otherwise.
> 
> With either of these mechanisms it should be possible for userspace to
> do the right thing if the kernel state changes underneath it. I'm
> leaning towards (2) because I think it is simpler to implement and
> doesn't require any new state be kept in the kernel.

Yep, that will solve the race.

(2) sounds good to me, in fact I did similar thing for cgroup-bpf in:

7dd68b3279f1 ("bpf: Support replacing cgroup-bpf program in MULTI mode")

where user can pass replace_bpf_fd and BPF_F_REPLACE flag and it
guarantees that the program, users wants, will be replaced, not a new
program that was attached by somebody else just a moment ago.


> The drawback is
> that it may lead to a lot of retries if many processes are trying to
> load their programs at the same time. Some data would be good here: How
> often do you expect programs to be loaded/unloaded in your use case?


In the case I mentioned it's more about having multiple applications
that may start/restart at the same time, not about frequency. It'll be a
few (one digit number) apps, what means having a few retries should be
fine if "the old program doesn't exist" can be detected easily (e.g.
ENOENT should work) not to do retry for errors that are obviously
unrelated to the race condition.


> As for your other suggestion:
> 
> > Also I see at least one other way to do it w/o regenerating dispatcher
> > every time:
> >
> > It can be created and attached once with "big enough" number of slots,
> > for example with 100 and programs may use use their corresponding slot
> > to freplace w/o regenerating the dispatcher. Having those big number of
> > no-op slots should not be a big deal from what I understand and kernel
> > can optimize it.
> 
> I thought about having the dispatcher stay around for longer, and just
> replacing more function slots as new programs are added/removed. The
> reason I didn't go with this is the following: Modifying the dispatcher
> while it is loaded means that the modifications will apply to traffic on
> the interface immediately. This is fine for simple add/remove of a
> single program, but it limits which operations you can do atomically.
> E.g., you can't switch the order of two programs, or add or remove more
> than one, in a way that is atomic from the PoV of the traffic on the
> interface.

Right, simple add/remove cases is the only ones I've seen so far since
programs are usually more or less independent and they just should be
chained properly w/o anything like "order of programs should be changed
atomically" or "two programs must be enabled atomically".

But okay, I can imagine that this may happen in the wild. In this case
yes, full regeneration of the dispatcher looks like the option ..


> Since I expect that we will need to support atomic operations even for
> these more complex cases, that means we'll need to support rebuilding
> the dispatcher anyway, and solving the race condition problem for that.
> And once we've done that, the simple add/remove in the existing
> dispatcher becomes just an additional code path that we'll need to
> maintain, so why bother? :)
> 
> I am also not sure it's as simple as you say for the kernel to optimise
> a more complex dispatcher: The current dead code elimination relies on
> map data being frozen at verification time, so it's not applicable to
> optimising a dispatcher as it is being changed later. Now, this could
> probably be fixed and/or we could try doing clever tricks with the flow
> control in the dispatcher program itself. But again, why bother if we
> have to support the dispatcher rebuild mode of operation anyway?

Yeah, having the ability to regenerate the full dispatcher helps to
avoid dealing with those no-ops programs.

This kinda solves another problem of allocating positions in the list of
noop_fun1, noop_func2, ..., noop_funcN, since the N is limited and
keeping "enough space" between existing programs to be able to attach
something else between them in the future can be challenging in general
case.

> I may have missed something, of course, so feel free to point out if you
> see anything wrong with my reasoning above!
> 
> > This is the main thing so far, I'll likely provide more feedback when
> > have some more time to read the code ..
> 
> Sounds good! You're already being very helpful, so thank you! :)

I've spent more time reading the library and like the static global data
idea that allows to "regenerate" dispatcher w/o actually recompiling it
so that it can still be compiled once and distributed to all relevant
hosts.  It simplifies a bunch of things discussed above.

But this part in the "missing pieces":

> > - There is no way to re-attach an already loaded program to another function;
> >   this is needed for updating the call sequence: When a new program is loaded,
> >   libxdp should get the existing list of component programs on the interface and
> >   insert the new one into the chain in the appropriate place. To do this it
> >   needs to build a new dispatcher and reattach all the old programs to it.
> >   Ideally, this should be doable without detaching them from the old dispatcher;
> >   that way, we can build the new dispatcher completely, and atomically replace
> >   it on the interface by the usual XDP attach mechanism.

seems to be "must-have", including the "Ideally" section, since IMO
simply adding a new program should not interrupt what previously
attached programs are doing.

If there is a container A that attached progA to dispatcher some time
ago, and then container B is regenerating dispatcher to add progB, that
shouldn't stop progA from being executed even for short period of time
since for some programs it's just no-go (e.g. if progA is a firewall and
disabling it would mean allowing traffic that otherwise is denied).

I'm not the one who can answer the question how hard would it be to
support this kind of "re-attaching" on kernel side and curios myself. I
do see though that current implementation of ext programs has a single
(prog->aux->linked_prog, prog->aux->attach_btf_id) pair.

Also it's not clear what to do with fd returned by
bpf_tracing_prog_attach (whether it can be pined or not), e.g. if
container A generated dispatcher with ext progA attached to it and got
this "link" fd, but then dispatcher was regenerated and the progA was
reattached to the new dispatcher, how to make sure that the "link" fd is
still the right one and cleanup will happen when a process in container
A closes the fd it has (or unpins corresponding file in bpf fs).


-- 
Andrey Ignatov

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC] Userspace library for handling multiple XDP programs on an interface
  2020-03-03  1:03     ` Andrey Ignatov
@ 2020-03-03  9:50       ` Toke Høiland-Jørgensen
  2020-03-03 16:24         ` Alexei Starovoitov
  2020-03-03 19:53       ` Andrii Nakryiko
  1 sibling, 1 reply; 11+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-03-03  9:50 UTC (permalink / raw)
  To: Andrey Ignatov
  Cc: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, Andrii Nakryiko, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Lorenz Bauer, netdev,
	bpf, ctakshak

Andrey Ignatov <rdna@fb.com> writes:

> Toke Høiland-Jørgensen <toke@redhat.com> [Sat, 2020-02-29 02:36 -0800]:
>> Andrey Ignatov <rdna@fb.com> writes:
>> 
>> > The main challenges I see for applying this approach in fb case is the
>> > need to recreate the dispatcher every time a new program has to be
>> > added.
>> >
>> > Imagine there there are a few containers and every container wants to
>> > run an application that attaches XDP program to the "dispatcher" via
>> > freplace. Every application may have a "priority" reserved for it, but
>> > recreating the dispatcher may have race condition, for example:
>> 
>> Yeah, I did realise this is potentially racy, but so is any loading of
>> XDP programs right now (i.e., two applications can both try loading a
>> single XDP program at the same time, and end up stomping on each others'
>> feet). So we'll need to solve that in any case. I've managed to come up
>> with two possible ways to solve this:
>> 
>> 1. Locking: Make it possible for a process to temporarily lock the
>> XDP program loaded onto an interface so no other program can modify it
>> until the lock is released.
>> 
>> 2. A cmpxchg operation: Add a new field to the XDP load netlink message
>> containing an fd of the old program that the load call is expecting to
>> replace. I.e., instead of attach(ifindex, prog_fd, flags), you have
>> attach(ifindex, prog_fd, old_fd, flags). The kernel can then check that
>> the old_fd matches the program currently loaded before replacing
>> anything, and reject the operation otherwise.
>> 
>> With either of these mechanisms it should be possible for userspace to
>> do the right thing if the kernel state changes underneath it. I'm
>> leaning towards (2) because I think it is simpler to implement and
>> doesn't require any new state be kept in the kernel.
>
> Yep, that will solve the race.
>
> (2) sounds good to me, in fact I did similar thing for cgroup-bpf in:
>
> 7dd68b3279f1 ("bpf: Support replacing cgroup-bpf program in MULTI mode")
>
> where user can pass replace_bpf_fd and BPF_F_REPLACE flag and it
> guarantees that the program, users wants, will be replaced, not a new
> program that was attached by somebody else just a moment ago.

Yes, that was exactly the API I had in mind for XDP as well. Awesome!

>> The drawback is
>> that it may lead to a lot of retries if many processes are trying to
>> load their programs at the same time. Some data would be good here: How
>> often do you expect programs to be loaded/unloaded in your use case?
>
>
> In the case I mentioned it's more about having multiple applications
> that may start/restart at the same time, not about frequency. It'll be a
> few (one digit number) apps, what means having a few retries should be
> fine if "the old program doesn't exist" can be detected easily (e.g.
> ENOENT should work) not to do retry for errors that are obviously
> unrelated to the race condition.

OK, great, let's go with that, then.

>> As for your other suggestion:
>> 
>> > Also I see at least one other way to do it w/o regenerating dispatcher
>> > every time:
>> >
>> > It can be created and attached once with "big enough" number of slots,
>> > for example with 100 and programs may use use their corresponding slot
>> > to freplace w/o regenerating the dispatcher. Having those big number of
>> > no-op slots should not be a big deal from what I understand and kernel
>> > can optimize it.
>> 
>> I thought about having the dispatcher stay around for longer, and just
>> replacing more function slots as new programs are added/removed. The
>> reason I didn't go with this is the following: Modifying the dispatcher
>> while it is loaded means that the modifications will apply to traffic on
>> the interface immediately. This is fine for simple add/remove of a
>> single program, but it limits which operations you can do atomically.
>> E.g., you can't switch the order of two programs, or add or remove more
>> than one, in a way that is atomic from the PoV of the traffic on the
>> interface.
>
> Right, simple add/remove cases is the only ones I've seen so far since
> programs are usually more or less independent and they just should be
> chained properly w/o anything like "order of programs should be changed
> atomically" or "two programs must be enabled atomically".
>
> But okay, I can imagine that this may happen in the wild. In this case
> yes, full regeneration of the dispatcher looks like the option ..

Yes, my thought exactly :)

>> Since I expect that we will need to support atomic operations even for
>> these more complex cases, that means we'll need to support rebuilding
>> the dispatcher anyway, and solving the race condition problem for that.
>> And once we've done that, the simple add/remove in the existing
>> dispatcher becomes just an additional code path that we'll need to
>> maintain, so why bother? :)
>> 
>> I am also not sure it's as simple as you say for the kernel to optimise
>> a more complex dispatcher: The current dead code elimination relies on
>> map data being frozen at verification time, so it's not applicable to
>> optimising a dispatcher as it is being changed later. Now, this could
>> probably be fixed and/or we could try doing clever tricks with the flow
>> control in the dispatcher program itself. But again, why bother if we
>> have to support the dispatcher rebuild mode of operation anyway?
>
> Yeah, having the ability to regenerate the full dispatcher helps to
> avoid dealing with those no-ops programs.
>
> This kinda solves another problem of allocating positions in the list of
> noop_fun1, noop_func2, ..., noop_funcN, since the N is limited and
> keeping "enough space" between existing programs to be able to attach
> something else between them in the future can be challenging in general
> case.

Yeah, it strikes me as one of those things that works fine for simple
cases and then breaks down badly once you hit the limit (i.e, run out of
free slots).

>> I may have missed something, of course, so feel free to point out if you
>> see anything wrong with my reasoning above!
>> 
>> > This is the main thing so far, I'll likely provide more feedback when
>> > have some more time to read the code ..
>> 
>> Sounds good! You're already being very helpful, so thank you! :)
>
> I've spent more time reading the library and like the static global data
> idea that allows to "regenerate" dispatcher w/o actually recompiling it
> so that it can still be compiled once and distributed to all relevant
> hosts.  It simplifies a bunch of things discussed above.

Yup, I was quite happy when I discovered this could be done with a
pre-compiled program + dead code elimination; makes things so much
easier! So glad you agree, and thank you for taking a look!

> But this part in the "missing pieces":
>
>> > - There is no way to re-attach an already loaded program to another function;
>> >   this is needed for updating the call sequence: When a new program is loaded,
>> >   libxdp should get the existing list of component programs on the interface and
>> >   insert the new one into the chain in the appropriate place. To do this it
>> >   needs to build a new dispatcher and reattach all the old programs to it.
>> >   Ideally, this should be doable without detaching them from the old dispatcher;
>> >   that way, we can build the new dispatcher completely, and atomically replace
>> >   it on the interface by the usual XDP attach mechanism.
>
> seems to be "must-have", including the "Ideally" section, since IMO
> simply adding a new program should not interrupt what previously
> attached programs are doing.
>
> If there is a container A that attached progA to dispatcher some time
> ago, and then container B is regenerating dispatcher to add progB, that
> shouldn't stop progA from being executed even for short period of time
> since for some programs it's just no-go (e.g. if progA is a firewall and
> disabling it would mean allowing traffic that otherwise is denied).

Yeah, you're right, I was probably being a bit too timid when I said
"ideally"; this is really something we should support.

> I'm not the one who can answer the question how hard would it be to
> support this kind of "re-attaching" on kernel side and curios myself. I
> do see though that current implementation of ext programs has a single
> (prog->aux->linked_prog, prog->aux->attach_btf_id) pair.

No, it doesn't work with the way things are set up currently; but with a
bit of refactoring I believe it can be made to work. I'll give it a
shot, and we'll see if I'm right...

> Also it's not clear what to do with fd returned by
> bpf_tracing_prog_attach (whether it can be pined or not), e.g. if
> container A generated dispatcher with ext progA attached to it and got
> this "link" fd, but then dispatcher was regenerated and the progA was
> reattached to the new dispatcher, how to make sure that the "link" fd is
> still the right one and cleanup will happen when a process in container
> A closes the fd it has (or unpins corresponding file in bpf fs).

This is the reason why I think the 'link' between the main program and
the replacement program is in the "wrong direction". Instead I want to
introduce a new attachment API that can be used instead of
bpf_raw_tracepoint_open() - something like:

prog_fd = sys_bpf(BPF_PROG_LOAD, ...); // dispatcher
func_fd = sys_bpf(BPF_PROG_LOAD, ...); // replacement func
err = sys_bpf(BPF_PROG_REPLACE_FUNC, prog_fd, btf_id, func_fd); // does *not* return an fd

When using this, the kernel will flip the direction of the reference
between BPF programs, so it goes main_prog -> replacement_prog. And
instead of getting an fd back, this will make the replacement prog share
its lifecycle with the main program, so that when the main program is
released, so is the replacement (absent other references, of course).
There could be an explicit 'release' command as well, of course, and a
way to list all replacements on a program.

I think that making replacement progs 'fate-share' with the main prog
this way will make for a more natural API. In fact, I don't think it's
possible to atomically keep things consistent if we have to rely on
pinning; see [0] for my brain dump on this over on Andrii's bpf_link
patch series thread.

-Toke

[0] https://lore.kernel.org/bpf/87imjms8cm.fsf@toke.dk/


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC] Userspace library for handling multiple XDP programs on an interface
  2020-03-03  9:50       ` Toke Høiland-Jørgensen
@ 2020-03-03 16:24         ` Alexei Starovoitov
  2020-03-03 16:27           ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 11+ messages in thread
From: Alexei Starovoitov @ 2020-03-03 16:24 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Andrey Ignatov, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, Andrii Nakryiko,
	David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer,
	John Fastabend, Lorenz Bauer, Network Development, bpf,
	Takshak Chahande

On Tue, Mar 3, 2020 at 1:50 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> This is the reason why I think the 'link' between the main program and
> the replacement program is in the "wrong direction". Instead I want to
> introduce a new attachment API that can be used instead of
> bpf_raw_tracepoint_open() - something like:
>
> prog_fd = sys_bpf(BPF_PROG_LOAD, ...); // dispatcher
> func_fd = sys_bpf(BPF_PROG_LOAD, ...); // replacement func
> err = sys_bpf(BPF_PROG_REPLACE_FUNC, prog_fd, btf_id, func_fd); // does *not* return an fd
>
> When using this, the kernel will flip the direction of the reference
> between BPF programs, so it goes main_prog -> replacement_prog. And
> instead of getting an fd back, this will make the replacement prog share
> its lifecycle with the main program, so that when the main program is
> released, so is the replacement (absent other references, of course).
> There could be an explicit 'release' command as well, of course, and a
> way to list all replacements on a program.

Nack to such api.
We hit this opposite direction issue with xdp and tc in the past.
Not going to repeat the same mistake again.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC] Userspace library for handling multiple XDP programs on an interface
  2020-03-03 16:24         ` Alexei Starovoitov
@ 2020-03-03 16:27           ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 11+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-03-03 16:27 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrey Ignatov, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, Andrii Nakryiko,
	David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer,
	John Fastabend, Lorenz Bauer, Network Development, bpf,
	Takshak Chahande

Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Tue, Mar 3, 2020 at 1:50 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> This is the reason why I think the 'link' between the main program and
>> the replacement program is in the "wrong direction". Instead I want to
>> introduce a new attachment API that can be used instead of
>> bpf_raw_tracepoint_open() - something like:
>>
>> prog_fd = sys_bpf(BPF_PROG_LOAD, ...); // dispatcher
>> func_fd = sys_bpf(BPF_PROG_LOAD, ...); // replacement func
>> err = sys_bpf(BPF_PROG_REPLACE_FUNC, prog_fd, btf_id, func_fd); // does *not* return an fd
>>
>> When using this, the kernel will flip the direction of the reference
>> between BPF programs, so it goes main_prog -> replacement_prog. And
>> instead of getting an fd back, this will make the replacement prog share
>> its lifecycle with the main program, so that when the main program is
>> released, so is the replacement (absent other references, of course).
>> There could be an explicit 'release' command as well, of course, and a
>> way to list all replacements on a program.
>
> Nack to such api.
> We hit this opposite direction issue with xdp and tc in the past.
> Not going to repeat the same mistake again.

Care to elaborate? What mistake, and what was the issue?

-Toke


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC] Userspace library for handling multiple XDP programs on an interface
  2020-03-03  1:03     ` Andrey Ignatov
  2020-03-03  9:50       ` Toke Høiland-Jørgensen
@ 2020-03-03 19:53       ` Andrii Nakryiko
  1 sibling, 0 replies; 11+ messages in thread
From: Andrii Nakryiko @ 2020-03-03 19:53 UTC (permalink / raw)
  To: Andrey Ignatov
  Cc: Toke Høiland-Jørgensen, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	Andrii Nakryiko, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Lorenz Bauer, Networking,
	bpf, ctakshak

On Mon, Mar 2, 2020 at 5:03 PM Andrey Ignatov <rdna@fb.com> wrote:
>
> Toke Høiland-Jørgensen <toke@redhat.com> [Sat, 2020-02-29 02:36 -0800]:
> > Andrey Ignatov <rdna@fb.com> writes:
> >
> > > The main challenges I see for applying this approach in fb case is the
> > > need to recreate the dispatcher every time a new program has to be
> > > added.
> > >
> > > Imagine there there are a few containers and every container wants to
> > > run an application that attaches XDP program to the "dispatcher" via
> > > freplace. Every application may have a "priority" reserved for it, but
> > > recreating the dispatcher may have race condition, for example:
> >
> > Yeah, I did realise this is potentially racy, but so is any loading of
> > XDP programs right now (i.e., two applications can both try loading a
> > single XDP program at the same time, and end up stomping on each others'
> > feet). So we'll need to solve that in any case. I've managed to come up
> > with two possible ways to solve this:
> >
> > 1. Locking: Make it possible for a process to temporarily lock the
> > XDP program loaded onto an interface so no other program can modify it
> > until the lock is released.
> >
> > 2. A cmpxchg operation: Add a new field to the XDP load netlink message
> > containing an fd of the old program that the load call is expecting to
> > replace. I.e., instead of attach(ifindex, prog_fd, flags), you have
> > attach(ifindex, prog_fd, old_fd, flags). The kernel can then check that
> > the old_fd matches the program currently loaded before replacing
> > anything, and reject the operation otherwise.
> >
> > With either of these mechanisms it should be possible for userspace to
> > do the right thing if the kernel state changes underneath it. I'm
> > leaning towards (2) because I think it is simpler to implement and
> > doesn't require any new state be kept in the kernel.
>
> Yep, that will solve the race.
>
> (2) sounds good to me, in fact I did similar thing for cgroup-bpf in:
>
> 7dd68b3279f1 ("bpf: Support replacing cgroup-bpf program in MULTI mode")
>
> where user can pass replace_bpf_fd and BPF_F_REPLACE flag and it
> guarantees that the program, users wants, will be replaced, not a new
> program that was attached by somebody else just a moment ago.
>
>
> > The drawback is
> > that it may lead to a lot of retries if many processes are trying to
> > load their programs at the same time. Some data would be good here: How
> > often do you expect programs to be loaded/unloaded in your use case?
>
>
> In the case I mentioned it's more about having multiple applications
> that may start/restart at the same time, not about frequency. It'll be a
> few (one digit number) apps, what means having a few retries should be
> fine if "the old program doesn't exist" can be detected easily (e.g.
> ENOENT should work) not to do retry for errors that are obviously
> unrelated to the race condition.
>
>
> > As for your other suggestion:
> >
> > > Also I see at least one other way to do it w/o regenerating dispatcher
> > > every time:
> > >
> > > It can be created and attached once with "big enough" number of slots,
> > > for example with 100 and programs may use use their corresponding slot
> > > to freplace w/o regenerating the dispatcher. Having those big number of
> > > no-op slots should not be a big deal from what I understand and kernel
> > > can optimize it.
> >
> > I thought about having the dispatcher stay around for longer, and just
> > replacing more function slots as new programs are added/removed. The
> > reason I didn't go with this is the following: Modifying the dispatcher
> > while it is loaded means that the modifications will apply to traffic on
> > the interface immediately. This is fine for simple add/remove of a
> > single program, but it limits which operations you can do atomically.
> > E.g., you can't switch the order of two programs, or add or remove more
> > than one, in a way that is atomic from the PoV of the traffic on the
> > interface.
>
> Right, simple add/remove cases is the only ones I've seen so far since
> programs are usually more or less independent and they just should be
> chained properly w/o anything like "order of programs should be changed
> atomically" or "two programs must be enabled atomically".
>
> But okay, I can imagine that this may happen in the wild. In this case
> yes, full regeneration of the dispatcher looks like the option ..
>
>
> > Since I expect that we will need to support atomic operations even for
> > these more complex cases, that means we'll need to support rebuilding
> > the dispatcher anyway, and solving the race condition problem for that.
> > And once we've done that, the simple add/remove in the existing
> > dispatcher becomes just an additional code path that we'll need to
> > maintain, so why bother? :)
> >
> > I am also not sure it's as simple as you say for the kernel to optimise
> > a more complex dispatcher: The current dead code elimination relies on
> > map data being frozen at verification time, so it's not applicable to
> > optimising a dispatcher as it is being changed later. Now, this could
> > probably be fixed and/or we could try doing clever tricks with the flow
> > control in the dispatcher program itself. But again, why bother if we
> > have to support the dispatcher rebuild mode of operation anyway?
>
> Yeah, having the ability to regenerate the full dispatcher helps to
> avoid dealing with those no-ops programs.
>
> This kinda solves another problem of allocating positions in the list of
> noop_fun1, noop_func2, ..., noop_funcN, since the N is limited and
> keeping "enough space" between existing programs to be able to attach
> something else between them in the future can be challenging in general
> case.
>
> > I may have missed something, of course, so feel free to point out if you
> > see anything wrong with my reasoning above!
> >
> > > This is the main thing so far, I'll likely provide more feedback when
> > > have some more time to read the code ..
> >
> > Sounds good! You're already being very helpful, so thank you! :)
>
> I've spent more time reading the library and like the static global data
> idea that allows to "regenerate" dispatcher w/o actually recompiling it
> so that it can still be compiled once and distributed to all relevant
> hosts.  It simplifies a bunch of things discussed above.
>
> But this part in the "missing pieces":
>
> > > - There is no way to re-attach an already loaded program to another function;
> > >   this is needed for updating the call sequence: When a new program is loaded,
> > >   libxdp should get the existing list of component programs on the interface and
> > >   insert the new one into the chain in the appropriate place. To do this it
> > >   needs to build a new dispatcher and reattach all the old programs to it.
> > >   Ideally, this should be doable without detaching them from the old dispatcher;
> > >   that way, we can build the new dispatcher completely, and atomically replace
> > >   it on the interface by the usual XDP attach mechanism.
>
> seems to be "must-have", including the "Ideally" section, since IMO
> simply adding a new program should not interrupt what previously
> attached programs are doing.
>
> If there is a container A that attached progA to dispatcher some time
> ago, and then container B is regenerating dispatcher to add progB, that
> shouldn't stop progA from being executed even for short period of time
> since for some programs it's just no-go (e.g. if progA is a firewall and
> disabling it would mean allowing traffic that otherwise is denied).
>
> I'm not the one who can answer the question how hard would it be to
> support this kind of "re-attaching" on kernel side and curios myself. I
> do see though that current implementation of ext programs has a single
> (prog->aux->linked_prog, prog->aux->attach_btf_id) pair.
>
> Also it's not clear what to do with fd returned by
> bpf_tracing_prog_attach (whether it can be pined or not), e.g. if
> container A generated dispatcher with ext progA attached to it and got
> this "link" fd, but then dispatcher was regenerated and the progA was
> reattached to the new dispatcher, how to make sure that the "link" fd is
> still the right one and cleanup will happen when a process in container
> A closes the fd it has (or unpins corresponding file in bpf fs).

I just replied to Daniel on another thread (it's weird how we have
inter-related discussions on separate thread, but whatever..).
Basically, once you establish XDP bpf_link for XDP dispatcher, you
can't attach another bpf_link anymore. It will keep failing until
bpf_link goes away (close FD, unpin, etc). But, once you have
bpf_link, you should be able to replace underlying BPF program, as
long as you still own that link). This way it should be possible to
re-generate XDP dispatcher and safely switch it.

>
>
> --
> Andrey Ignatov

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2020-03-03 19:54 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-28 14:22 [PATCH RFC] Userspace library for handling multiple XDP programs on an interface Toke Høiland-Jørgensen
2020-02-28 14:22 ` [PATCH RFC] libxdp: Add libxdp (FOR COMMENT ONLY) Toke Høiland-Jørgensen
2020-02-28 22:15 ` [PATCH RFC] Userspace library for handling multiple XDP programs on an interface Andrey Ignatov
2020-02-29 10:36   ` Toke Høiland-Jørgensen
2020-03-03  1:03     ` Andrey Ignatov
2020-03-03  9:50       ` Toke Høiland-Jørgensen
2020-03-03 16:24         ` Alexei Starovoitov
2020-03-03 16:27           ` Toke Høiland-Jørgensen
2020-03-03 19:53       ` Andrii Nakryiko
2020-02-28 23:21 ` Andrii Nakryiko
2020-02-29 10:37   ` Toke Høiland-Jørgensen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).