bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next 0/3] Introduce pinnable bpf_link kernel abstraction
@ 2020-02-28 22:39 Andrii Nakryiko
  2020-02-28 22:39 ` [PATCH bpf-next 1/3] bpf: introduce pinnable bpf_link abstraction Andrii Nakryiko
                   ` (3 more replies)
  0 siblings, 4 replies; 50+ messages in thread
From: Andrii Nakryiko @ 2020-02-28 22:39 UTC (permalink / raw)
  To: bpf, netdev, ast, daniel; +Cc: andrii.nakryiko, kernel-team, Andrii Nakryiko

This patch series adds bpf_link abstraction, analogous to libbpf's already
existing bpf_link abstraction. This formalizes and makes more uniform existing
bpf_link-like BPF program link (attachment) types (raw tracepoint and tracing
links), which are FD-based objects that are automatically detached when last
file reference is closed. These types of BPF program links are switched to
using bpf_link framework.

FD-based bpf_link approach provides great safety guarantees, by ensuring there
is not going to be an abandoned BPF program attached, if user process suddenly
exits or forgets to clean up after itself. This is especially important in
production environment and is what all the recent new BPF link types followed.

One of the previously existing  inconveniences of FD-based approach, though,
was the scenario in which user process wants to install BPF link and exit, but
let attached BPF program run. Now, with bpf_link abstraction in place, it's
easy to support pinning links in BPF FS, which is done as part of the same
patch #1. This allows FD-based BPF program links to survive exit of a user
process and original file descriptor being closed, by creating an file entry
in BPF FS. This provides great safety by default, with simple way to opt out
for cases where it's needed.

Corresponding libbpf APIs are added in the same patch set, as well as
selftests for this functionality.

Other types of BPF program attachments (XDP, cgroup, perf_event, etc) are
going to be converted in subsequent patches to follow similar approach.

Andrii Nakryiko (3):
  bpf: introduce pinnable bpf_link abstraction
  libbpf: add bpf_link pinning/unpinning
  selftests/bpf: add link pinning selftests

 include/linux/bpf.h                           |  13 ++
 kernel/bpf/inode.c                            |  42 +++-
 kernel/bpf/syscall.c                          | 209 +++++++++++++++---
 tools/lib/bpf/libbpf.c                        | 131 ++++++++---
 tools/lib/bpf/libbpf.h                        |   5 +
 tools/lib/bpf/libbpf.map                      |   5 +
 .../selftests/bpf/prog_tests/link_pinning.c   | 105 +++++++++
 .../selftests/bpf/progs/test_link_pinning.c   |  25 +++
 8 files changed, 470 insertions(+), 65 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/link_pinning.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_link_pinning.c

-- 
2.17.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH bpf-next 1/3] bpf: introduce pinnable bpf_link abstraction
  2020-02-28 22:39 [PATCH bpf-next 0/3] Introduce pinnable bpf_link kernel abstraction Andrii Nakryiko
@ 2020-02-28 22:39 ` Andrii Nakryiko
  2020-03-02 10:13   ` Toke Høiland-Jørgensen
  2020-03-03  2:50   ` Alexei Starovoitov
  2020-02-28 22:39 ` [PATCH bpf-next 2/3] libbpf: add bpf_link pinning/unpinning Andrii Nakryiko
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 50+ messages in thread
From: Andrii Nakryiko @ 2020-02-28 22:39 UTC (permalink / raw)
  To: bpf, netdev, ast, daniel; +Cc: andrii.nakryiko, kernel-team, Andrii Nakryiko

Introduce bpf_link abstraction, representing an attachment of BPF program to
a BPF hook point (e.g., tracepoint, perf event, etc). bpf_link encapsulates
ownership of attached BPF program, reference counting of a link itself, when
reference from multiple anonymous inodes, as well as ensures that release
callback will be called from a process context, so that users can safely take
mutex locks and sleep.

Additionally, with a new abstraction it's now possible to generalize pinning
of a link object in BPF FS, allowing to explicitly prevent BPF program
detachment on process exit by pinning it in a BPF FS and let it open from
independent other process to keep working with it.

Convert two existing bpf_link-like objects (raw tracepoint and tracing BPF
program attachments) into utilizing bpf_link framework, making them pinnable
in BPF FS. More FD-based bpf_links will be added in follow up patches.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
---
 include/linux/bpf.h  |  13 +++
 kernel/bpf/inode.c   |  42 ++++++++-
 kernel/bpf/syscall.c | 209 ++++++++++++++++++++++++++++++++++++-------
 3 files changed, 226 insertions(+), 38 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 6015a4daf118..f13c78c6f29d 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1056,6 +1056,19 @@ extern int sysctl_unprivileged_bpf_disabled;
 int bpf_map_new_fd(struct bpf_map *map, int flags);
 int bpf_prog_new_fd(struct bpf_prog *prog);
 
+struct bpf_link;
+
+struct bpf_link_ops {
+	void (*release)(struct bpf_link *link);
+};
+
+void bpf_link_init(struct bpf_link *link, const struct bpf_link_ops *ops,
+		   struct bpf_prog *prog);
+void bpf_link_inc(struct bpf_link *link);
+void bpf_link_put(struct bpf_link *link);
+int bpf_link_new_fd(struct bpf_link *link);
+struct bpf_link *bpf_link_get_from_fd(u32 ufd);
+
 int bpf_obj_pin_user(u32 ufd, const char __user *pathname);
 int bpf_obj_get_user(const char __user *pathname, int flags);
 
diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
index 5e40e7fccc21..95087d9f4ed3 100644
--- a/kernel/bpf/inode.c
+++ b/kernel/bpf/inode.c
@@ -25,6 +25,7 @@ enum bpf_type {
 	BPF_TYPE_UNSPEC	= 0,
 	BPF_TYPE_PROG,
 	BPF_TYPE_MAP,
+	BPF_TYPE_LINK,
 };
 
 static void *bpf_any_get(void *raw, enum bpf_type type)
@@ -36,6 +37,9 @@ static void *bpf_any_get(void *raw, enum bpf_type type)
 	case BPF_TYPE_MAP:
 		bpf_map_inc_with_uref(raw);
 		break;
+	case BPF_TYPE_LINK:
+		bpf_link_inc(raw);
+		break;
 	default:
 		WARN_ON_ONCE(1);
 		break;
@@ -53,6 +57,9 @@ static void bpf_any_put(void *raw, enum bpf_type type)
 	case BPF_TYPE_MAP:
 		bpf_map_put_with_uref(raw);
 		break;
+	case BPF_TYPE_LINK:
+		bpf_link_put(raw);
+		break;
 	default:
 		WARN_ON_ONCE(1);
 		break;
@@ -63,20 +70,32 @@ static void *bpf_fd_probe_obj(u32 ufd, enum bpf_type *type)
 {
 	void *raw;
 
-	*type = BPF_TYPE_MAP;
 	raw = bpf_map_get_with_uref(ufd);
-	if (IS_ERR(raw)) {
+	if (!IS_ERR(raw)) {
+		*type = BPF_TYPE_MAP;
+		return raw;
+	}
+
+	raw = bpf_prog_get(ufd);
+	if (!IS_ERR(raw)) {
 		*type = BPF_TYPE_PROG;
-		raw = bpf_prog_get(ufd);
+		return raw;
 	}
 
-	return raw;
+	raw = bpf_link_get_from_fd(ufd);
+	if (!IS_ERR(raw)) {
+		*type = BPF_TYPE_LINK;
+		return raw;
+	}
+
+	return ERR_PTR(-EINVAL);
 }
 
 static const struct inode_operations bpf_dir_iops;
 
 static const struct inode_operations bpf_prog_iops = { };
 static const struct inode_operations bpf_map_iops  = { };
+static const struct inode_operations bpf_link_iops  = { };
 
 static struct inode *bpf_get_inode(struct super_block *sb,
 				   const struct inode *dir,
@@ -114,6 +133,8 @@ static int bpf_inode_type(const struct inode *inode, enum bpf_type *type)
 		*type = BPF_TYPE_PROG;
 	else if (inode->i_op == &bpf_map_iops)
 		*type = BPF_TYPE_MAP;
+	else if (inode->i_op == &bpf_link_iops)
+		*type = BPF_TYPE_LINK;
 	else
 		return -EACCES;
 
@@ -335,6 +356,12 @@ static int bpf_mkmap(struct dentry *dentry, umode_t mode, void *arg)
 			     &bpffs_map_fops : &bpffs_obj_fops);
 }
 
+static int bpf_mklink(struct dentry *dentry, umode_t mode, void *arg)
+{
+	return bpf_mkobj_ops(dentry, mode, arg, &bpf_link_iops,
+			     &bpffs_obj_fops);
+}
+
 static struct dentry *
 bpf_lookup(struct inode *dir, struct dentry *dentry, unsigned flags)
 {
@@ -411,6 +438,9 @@ static int bpf_obj_do_pin(const char __user *pathname, void *raw,
 	case BPF_TYPE_MAP:
 		ret = vfs_mkobj(dentry, mode, bpf_mkmap, raw);
 		break;
+	case BPF_TYPE_LINK:
+		ret = vfs_mkobj(dentry, mode, bpf_mklink, raw);
+		break;
 	default:
 		ret = -EPERM;
 	}
@@ -487,6 +517,8 @@ int bpf_obj_get_user(const char __user *pathname, int flags)
 		ret = bpf_prog_new_fd(raw);
 	else if (type == BPF_TYPE_MAP)
 		ret = bpf_map_new_fd(raw, f_flags);
+	else if (type == BPF_TYPE_LINK)
+		ret = bpf_link_new_fd(raw);
 	else
 		return -ENOENT;
 
@@ -504,6 +536,8 @@ static struct bpf_prog *__get_prog_inode(struct inode *inode, enum bpf_prog_type
 
 	if (inode->i_op == &bpf_map_iops)
 		return ERR_PTR(-EINVAL);
+	if (inode->i_op == &bpf_link_iops)
+		return ERR_PTR(-EINVAL);
 	if (inode->i_op != &bpf_prog_iops)
 		return ERR_PTR(-EACCES);
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index c536c65256ad..fca8de7e7872 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2173,23 +2173,153 @@ static int bpf_obj_get(const union bpf_attr *attr)
 				attr->file_flags);
 }
 
-static int bpf_tracing_prog_release(struct inode *inode, struct file *filp)
+struct bpf_link {
+	atomic64_t refcnt;
+	const struct bpf_link_ops *ops;
+	struct bpf_prog *prog;
+	struct work_struct work;
+};
+
+void bpf_link_init(struct bpf_link *link, const struct bpf_link_ops *ops,
+		   struct bpf_prog *prog)
 {
-	struct bpf_prog *prog = filp->private_data;
+	atomic64_set(&link->refcnt, 1);
+	link->ops = ops;
+	link->prog = prog;
+}
+
+void bpf_link_inc(struct bpf_link *link)
+{
+	atomic64_inc(&link->refcnt);
+}
+
+/* bpf_link_free is guaranteed to be called from process context */
+static void bpf_link_free(struct bpf_link *link)
+{
+	struct bpf_prog *prog;
 
-	WARN_ON_ONCE(bpf_trampoline_unlink_prog(prog));
+	/* remember prog locally, because release below will free link memory */
+	prog = link->prog;
+	/* extra clean up and kfree of container link struct */
+	link->ops->release(link);
+	/* no more accesing of link members after this point */
 	bpf_prog_put(prog);
+}
+
+static void bpf_link_put_deferred(struct work_struct *work)
+{
+	struct bpf_link *link = container_of(work, struct bpf_link, work);
+
+	bpf_link_free(link);
+}
+
+/* bpf_link_put can be called from atomic context, but ensures that resources
+ * are freed from process context
+ */
+void bpf_link_put(struct bpf_link *link)
+{
+	if (!atomic64_dec_and_test(&link->refcnt))
+		return;
+
+	if (in_atomic()) {
+		INIT_WORK(&link->work, bpf_link_put_deferred);
+		schedule_work(&link->work);
+	} else {
+		bpf_link_free(link);
+	}
+}
+
+static int bpf_link_release(struct inode *inode, struct file *filp)
+{
+	struct bpf_link *link = filp->private_data;
+
+	bpf_link_put(link);
 	return 0;
 }
 
-static const struct file_operations bpf_tracing_prog_fops = {
-	.release	= bpf_tracing_prog_release,
+#ifdef CONFIG_PROC_FS
+static const struct bpf_link_ops bpf_raw_tp_lops;
+static const struct bpf_link_ops bpf_tracing_link_lops;
+static const struct bpf_link_ops bpf_xdp_link_lops;
+
+static void bpf_link_show_fdinfo(struct seq_file *m, struct file *filp)
+{
+	const struct bpf_link *link = filp->private_data;
+	const struct bpf_prog *prog = link->prog;
+	char prog_tag[sizeof(prog->tag) * 2 + 1] = { };
+	const char *link_type;
+
+	if (link->ops == &bpf_raw_tp_lops)
+		link_type = "raw_tracepoint";
+	else if (link->ops == &bpf_tracing_link_lops)
+		link_type = "tracing";
+	else
+		link_type = "unknown";
+
+	bin2hex(prog_tag, prog->tag, sizeof(prog->tag));
+	seq_printf(m,
+		   "link_type:\t%s\n"
+		   "prog_tag:\t%s\n"
+		   "prog_id:\t%u\n",
+		   link_type,
+		   prog_tag,
+		   prog->aux->id);
+}
+#endif
+
+const struct file_operations bpf_link_fops = {
+#ifdef CONFIG_PROC_FS
+	.show_fdinfo	= bpf_link_show_fdinfo,
+#endif
+	.release	= bpf_link_release,
 	.read		= bpf_dummy_read,
 	.write		= bpf_dummy_write,
 };
 
+int bpf_link_new_fd(struct bpf_link *link)
+{
+	return anon_inode_getfd("bpf-link", &bpf_link_fops, link, O_CLOEXEC);
+}
+
+struct bpf_link *bpf_link_get_from_fd(u32 ufd)
+{
+	struct fd f = fdget(ufd);
+	struct bpf_link *link;
+
+	if (!f.file)
+		return ERR_PTR(-EBADF);
+	if (f.file->f_op != &bpf_link_fops) {
+		fdput(f);
+		return ERR_PTR(-EINVAL);
+	}
+
+	link = f.file->private_data;
+	bpf_link_inc(link);
+	fdput(f);
+
+	return link;
+}
+
+struct bpf_tracing_link {
+	struct bpf_link link;
+};
+
+static void bpf_tracing_link_release(struct bpf_link *link)
+{
+	struct bpf_tracing_link *tr_link =
+		container_of(link, struct bpf_tracing_link, link);
+
+	WARN_ON_ONCE(bpf_trampoline_unlink_prog(link->prog));
+	kfree(tr_link);
+}
+
+static const struct bpf_link_ops bpf_tracing_link_lops = {
+	.release = bpf_tracing_link_release,
+};
+
 static int bpf_tracing_prog_attach(struct bpf_prog *prog)
 {
+	struct bpf_tracing_link *link;
 	int tr_fd, err;
 
 	if (prog->expected_attach_type != BPF_TRACE_FENTRY &&
@@ -2199,53 +2329,57 @@ static int bpf_tracing_prog_attach(struct bpf_prog *prog)
 		goto out_put_prog;
 	}
 
+	link = kzalloc(sizeof(*link), GFP_USER);
+	if (!link) {
+		err = -ENOMEM;
+		goto out_put_prog;
+	}
+	bpf_link_init(&link->link, &bpf_tracing_link_lops, prog);
+
 	err = bpf_trampoline_link_prog(prog);
 	if (err)
-		goto out_put_prog;
+		goto out_free_link;
 
-	tr_fd = anon_inode_getfd("bpf-tracing-prog", &bpf_tracing_prog_fops,
-				 prog, O_CLOEXEC);
+	tr_fd = anon_inode_getfd("bpf-tracing-link", &bpf_link_fops,
+				 &link->link, O_CLOEXEC);
 	if (tr_fd < 0) {
 		WARN_ON_ONCE(bpf_trampoline_unlink_prog(prog));
 		err = tr_fd;
-		goto out_put_prog;
+		goto out_free_link;
 	}
 	return tr_fd;
 
+out_free_link:
+	kfree(link);
 out_put_prog:
 	bpf_prog_put(prog);
 	return err;
 }
 
-struct bpf_raw_tracepoint {
+struct bpf_raw_tp_link {
+	struct bpf_link link;
 	struct bpf_raw_event_map *btp;
-	struct bpf_prog *prog;
 };
 
-static int bpf_raw_tracepoint_release(struct inode *inode, struct file *filp)
+static void bpf_raw_tp_link_release(struct bpf_link *link)
 {
-	struct bpf_raw_tracepoint *raw_tp = filp->private_data;
+	struct bpf_raw_tp_link *raw_tp =
+		container_of(link, struct bpf_raw_tp_link, link);
 
-	if (raw_tp->prog) {
-		bpf_probe_unregister(raw_tp->btp, raw_tp->prog);
-		bpf_prog_put(raw_tp->prog);
-	}
+	bpf_probe_unregister(raw_tp->btp, raw_tp->link.prog);
 	bpf_put_raw_tracepoint(raw_tp->btp);
 	kfree(raw_tp);
-	return 0;
 }
 
-static const struct file_operations bpf_raw_tp_fops = {
-	.release	= bpf_raw_tracepoint_release,
-	.read		= bpf_dummy_read,
-	.write		= bpf_dummy_write,
+static const struct bpf_link_ops bpf_raw_tp_lops = {
+	.release = bpf_raw_tp_link_release,
 };
 
 #define BPF_RAW_TRACEPOINT_OPEN_LAST_FIELD raw_tracepoint.prog_fd
 
 static int bpf_raw_tracepoint_open(const union bpf_attr *attr)
 {
-	struct bpf_raw_tracepoint *raw_tp;
+	struct bpf_raw_tp_link *raw_tp;
 	struct bpf_raw_event_map *btp;
 	struct bpf_prog *prog;
 	const char *tp_name;
@@ -2302,15 +2436,15 @@ static int bpf_raw_tracepoint_open(const union bpf_attr *attr)
 		err = -ENOMEM;
 		goto out_put_btp;
 	}
+	bpf_link_init(&raw_tp->link, &bpf_raw_tp_lops, prog);
 	raw_tp->btp = btp;
-	raw_tp->prog = prog;
 
 	err = bpf_probe_register(raw_tp->btp, prog);
 	if (err)
 		goto out_free_tp;
 
-	tp_fd = anon_inode_getfd("bpf-raw-tracepoint", &bpf_raw_tp_fops, raw_tp,
-				 O_CLOEXEC);
+	tp_fd = anon_inode_getfd("bpf-raw-tp-link", &bpf_link_fops,
+				 &raw_tp->link, O_CLOEXEC);
 	if (tp_fd < 0) {
 		bpf_probe_unregister(raw_tp->btp, prog);
 		err = tp_fd;
@@ -3266,15 +3400,21 @@ static int bpf_task_fd_query(const union bpf_attr *attr,
 	if (err)
 		goto out;
 
-	if (file->f_op == &bpf_raw_tp_fops) {
-		struct bpf_raw_tracepoint *raw_tp = file->private_data;
-		struct bpf_raw_event_map *btp = raw_tp->btp;
+	if (file->f_op == &bpf_link_fops) {
+		struct bpf_link *link = file->private_data;
 
-		err = bpf_task_fd_query_copy(attr, uattr,
-					     raw_tp->prog->aux->id,
-					     BPF_FD_TYPE_RAW_TRACEPOINT,
-					     btp->tp->name, 0, 0);
-		goto put_file;
+		if (link->ops == &bpf_raw_tp_lops) {
+			struct bpf_raw_tp_link *raw_tp =
+				container_of(link, struct bpf_raw_tp_link, link);
+			struct bpf_raw_event_map *btp = raw_tp->btp;
+
+			err = bpf_task_fd_query_copy(attr, uattr,
+						     raw_tp->link.prog->aux->id,
+						     BPF_FD_TYPE_RAW_TRACEPOINT,
+						     btp->tp->name, 0, 0);
+			goto put_file;
+		}
+		goto out_not_supp;
 	}
 
 	event = perf_get_event(file);
@@ -3294,6 +3434,7 @@ static int bpf_task_fd_query(const union bpf_attr *attr,
 		goto put_file;
 	}
 
+out_not_supp:
 	err = -ENOTSUPP;
 put_file:
 	fput(file);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH bpf-next 2/3] libbpf: add bpf_link pinning/unpinning
  2020-02-28 22:39 [PATCH bpf-next 0/3] Introduce pinnable bpf_link kernel abstraction Andrii Nakryiko
  2020-02-28 22:39 ` [PATCH bpf-next 1/3] bpf: introduce pinnable bpf_link abstraction Andrii Nakryiko
@ 2020-02-28 22:39 ` Andrii Nakryiko
  2020-03-02 10:16   ` Toke Høiland-Jørgensen
  2020-02-28 22:39 ` [PATCH bpf-next 3/3] selftests/bpf: add link pinning selftests Andrii Nakryiko
  2020-03-02 10:11 ` [PATCH bpf-next 0/3] Introduce pinnable bpf_link kernel abstraction Toke Høiland-Jørgensen
  3 siblings, 1 reply; 50+ messages in thread
From: Andrii Nakryiko @ 2020-02-28 22:39 UTC (permalink / raw)
  To: bpf, netdev, ast, daniel; +Cc: andrii.nakryiko, kernel-team, Andrii Nakryiko

With bpf_link abstraction supported by kernel explicitly, add
pinning/unpinning API for links. Also allow to create (open) bpf_link from BPF
FS file.

This API allows to have an "ephemeral" FD-based BPF links (like raw tracepoint
or fexit/freplace attachments) surviving user process exit, by pinning them in
a BPF FS, which is an important use case for long-running BPF programs.

As part of this, expose underlying FD for bpf_link. While legacy bpf_link's
might not have a FD associated with them (which will be expressed as
a bpf_link with fd=-1), kernel's abstraction is based around FD-based usage,
so match it closely. This, subsequently, allows to have a generic
pinning/unpinning API for generalized bpf_link. For some types of bpf_links
kernel might not support pinning, in which case bpf_link__pin() will return
error.

With FD being part of generic bpf_link, also get rid of bpf_link_fd in favor
of using vanialla bpf_link.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
---
 tools/lib/bpf/libbpf.c   | 131 +++++++++++++++++++++++++++++++--------
 tools/lib/bpf/libbpf.h   |   5 ++
 tools/lib/bpf/libbpf.map |   5 ++
 3 files changed, 114 insertions(+), 27 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 996162801f7a..f8c4042e5855 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -6931,6 +6931,8 @@ int bpf_prog_load_xattr(const struct bpf_prog_load_attr *attr,
 struct bpf_link {
 	int (*detach)(struct bpf_link *link);
 	int (*destroy)(struct bpf_link *link);
+	char *pin_path;		/* NULL, if not pinned */
+	int fd;			/* hook FD, -1 if not applicable */
 	bool disconnected;
 };
 
@@ -6960,26 +6962,109 @@ int bpf_link__destroy(struct bpf_link *link)
 		err = link->detach(link);
 	if (link->destroy)
 		link->destroy(link);
+	if (link->pin_path)
+		free(link->pin_path);
 	free(link);
 
 	return err;
 }
 
-struct bpf_link_fd {
-	struct bpf_link link; /* has to be at the top of struct */
-	int fd; /* hook FD */
-};
+int bpf_link__fd(const struct bpf_link *link)
+{
+	return link->fd;
+}
+
+const char *bpf_link__pin_path(const struct bpf_link *link)
+{
+	return link->pin_path;
+}
+
+static int bpf_link__detach_fd(struct bpf_link *link)
+{
+	return close(link->fd);
+}
+
+struct bpf_link *bpf_link__open(const char *path)
+{
+	struct bpf_link *link;
+	int fd;
+
+	fd = bpf_obj_get(path);
+	if (fd < 0) {
+		fd = -errno;
+		pr_warn("failed to open link at %s: %d\n", path, fd);
+		return ERR_PTR(fd);
+	}
+
+	link = calloc(1, sizeof(*link));
+	if (!link) {
+		close(fd);
+		return ERR_PTR(-ENOMEM);
+	}
+	link->detach = &bpf_link__detach_fd;
+	link->fd = fd;
+
+	link->pin_path = strdup(path);
+	if (!link->pin_path) {
+		bpf_link__destroy(link);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	return link;
+}
+
+int bpf_link__pin(struct bpf_link *link, const char *path)
+{
+	int err;
+
+	if (link->pin_path)
+		return -EBUSY;
+	err = make_parent_dir(path);
+	if (err)
+		return err;
+	err = check_path(path);
+	if (err)
+		return err;
+
+	link->pin_path = strdup(path);
+	if (!link->pin_path)
+		return -ENOMEM;
+
+	if (bpf_obj_pin(link->fd, link->pin_path)) {
+		err = -errno;
+		zfree(&link->pin_path);
+		return err;
+	}
+
+	pr_debug("link fd=%d: pinned at %s\n", link->fd, link->pin_path);
+	return 0;
+}
+
+int bpf_link__unpin(struct bpf_link *link)
+{
+	int err;
+
+	if (!link->pin_path)
+		return -EINVAL;
+
+	err = unlink(link->pin_path);
+	if (err != 0)
+		return -errno;
+
+	pr_debug("link fd=%d: unpinned from %s\n", link->fd, link->pin_path);
+	zfree(&link->pin_path);
+	return 0;
+}
 
 static int bpf_link__detach_perf_event(struct bpf_link *link)
 {
-	struct bpf_link_fd *l = (void *)link;
 	int err;
 
-	err = ioctl(l->fd, PERF_EVENT_IOC_DISABLE, 0);
+	err = ioctl(link->fd, PERF_EVENT_IOC_DISABLE, 0);
 	if (err)
 		err = -errno;
 
-	close(l->fd);
+	close(link->fd);
 	return err;
 }
 
@@ -6987,7 +7072,7 @@ struct bpf_link *bpf_program__attach_perf_event(struct bpf_program *prog,
 						int pfd)
 {
 	char errmsg[STRERR_BUFSIZE];
-	struct bpf_link_fd *link;
+	struct bpf_link *link;
 	int prog_fd, err;
 
 	if (pfd < 0) {
@@ -7005,7 +7090,7 @@ struct bpf_link *bpf_program__attach_perf_event(struct bpf_program *prog,
 	link = calloc(1, sizeof(*link));
 	if (!link)
 		return ERR_PTR(-ENOMEM);
-	link->link.detach = &bpf_link__detach_perf_event;
+	link->detach = &bpf_link__detach_perf_event;
 	link->fd = pfd;
 
 	if (ioctl(pfd, PERF_EVENT_IOC_SET_BPF, prog_fd) < 0) {
@@ -7024,7 +7109,7 @@ struct bpf_link *bpf_program__attach_perf_event(struct bpf_program *prog,
 			   libbpf_strerror_r(err, errmsg, sizeof(errmsg)));
 		return ERR_PTR(err);
 	}
-	return (struct bpf_link *)link;
+	return link;
 }
 
 /*
@@ -7312,18 +7397,11 @@ static struct bpf_link *attach_tp(const struct bpf_sec_def *sec,
 	return link;
 }
 
-static int bpf_link__detach_fd(struct bpf_link *link)
-{
-	struct bpf_link_fd *l = (void *)link;
-
-	return close(l->fd);
-}
-
 struct bpf_link *bpf_program__attach_raw_tracepoint(struct bpf_program *prog,
 						    const char *tp_name)
 {
 	char errmsg[STRERR_BUFSIZE];
-	struct bpf_link_fd *link;
+	struct bpf_link *link;
 	int prog_fd, pfd;
 
 	prog_fd = bpf_program__fd(prog);
@@ -7336,7 +7414,7 @@ struct bpf_link *bpf_program__attach_raw_tracepoint(struct bpf_program *prog,
 	link = calloc(1, sizeof(*link));
 	if (!link)
 		return ERR_PTR(-ENOMEM);
-	link->link.detach = &bpf_link__detach_fd;
+	link->detach = &bpf_link__detach_fd;
 
 	pfd = bpf_raw_tracepoint_open(tp_name, prog_fd);
 	if (pfd < 0) {
@@ -7348,7 +7426,7 @@ struct bpf_link *bpf_program__attach_raw_tracepoint(struct bpf_program *prog,
 		return ERR_PTR(pfd);
 	}
 	link->fd = pfd;
-	return (struct bpf_link *)link;
+	return link;
 }
 
 static struct bpf_link *attach_raw_tp(const struct bpf_sec_def *sec,
@@ -7362,7 +7440,7 @@ static struct bpf_link *attach_raw_tp(const struct bpf_sec_def *sec,
 struct bpf_link *bpf_program__attach_trace(struct bpf_program *prog)
 {
 	char errmsg[STRERR_BUFSIZE];
-	struct bpf_link_fd *link;
+	struct bpf_link *link;
 	int prog_fd, pfd;
 
 	prog_fd = bpf_program__fd(prog);
@@ -7375,7 +7453,7 @@ struct bpf_link *bpf_program__attach_trace(struct bpf_program *prog)
 	link = calloc(1, sizeof(*link));
 	if (!link)
 		return ERR_PTR(-ENOMEM);
-	link->link.detach = &bpf_link__detach_fd;
+	link->detach = &bpf_link__detach_fd;
 
 	pfd = bpf_raw_tracepoint_open(NULL, prog_fd);
 	if (pfd < 0) {
@@ -7409,10 +7487,9 @@ struct bpf_link *bpf_program__attach(struct bpf_program *prog)
 
 static int bpf_link__detach_struct_ops(struct bpf_link *link)
 {
-	struct bpf_link_fd *l = (void *)link;
 	__u32 zero = 0;
 
-	if (bpf_map_delete_elem(l->fd, &zero))
+	if (bpf_map_delete_elem(link->fd, &zero))
 		return -errno;
 
 	return 0;
@@ -7421,7 +7498,7 @@ static int bpf_link__detach_struct_ops(struct bpf_link *link)
 struct bpf_link *bpf_map__attach_struct_ops(struct bpf_map *map)
 {
 	struct bpf_struct_ops *st_ops;
-	struct bpf_link_fd *link;
+	struct bpf_link *link;
 	__u32 i, zero = 0;
 	int err;
 
@@ -7453,10 +7530,10 @@ struct bpf_link *bpf_map__attach_struct_ops(struct bpf_map *map)
 		return ERR_PTR(err);
 	}
 
-	link->link.detach = bpf_link__detach_struct_ops;
+	link->detach = bpf_link__detach_struct_ops;
 	link->fd = map->fd;
 
-	return (struct bpf_link *)link;
+	return link;
 }
 
 enum bpf_perf_event_ret
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 02fc58a21a7f..d38d7a629417 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -219,6 +219,11 @@ LIBBPF_API void bpf_program__unload(struct bpf_program *prog);
 
 struct bpf_link;
 
+LIBBPF_API struct bpf_link *bpf_link__open(const char *path);
+LIBBPF_API int bpf_link__fd(const struct bpf_link *link);
+LIBBPF_API const char *bpf_link__pin_path(const struct bpf_link *link);
+LIBBPF_API int bpf_link__pin(struct bpf_link *link, const char *path);
+LIBBPF_API int bpf_link__unpin(struct bpf_link *link);
 LIBBPF_API void bpf_link__disconnect(struct bpf_link *link);
 LIBBPF_API int bpf_link__destroy(struct bpf_link *link);
 
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index 7b014c8cdece..5129283c0284 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -238,5 +238,10 @@ LIBBPF_0.0.7 {
 
 LIBBPF_0.0.8 {
 	global:
+		bpf_link__fd;
+		bpf_link__open;
+		bpf_link__pin;
+		bpf_link__pin_path;
+		bpf_link__unpin;
 		bpf_program__set_attach_target;
 } LIBBPF_0.0.7;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH bpf-next 3/3] selftests/bpf: add link pinning selftests
  2020-02-28 22:39 [PATCH bpf-next 0/3] Introduce pinnable bpf_link kernel abstraction Andrii Nakryiko
  2020-02-28 22:39 ` [PATCH bpf-next 1/3] bpf: introduce pinnable bpf_link abstraction Andrii Nakryiko
  2020-02-28 22:39 ` [PATCH bpf-next 2/3] libbpf: add bpf_link pinning/unpinning Andrii Nakryiko
@ 2020-02-28 22:39 ` Andrii Nakryiko
  2020-03-02 10:11 ` [PATCH bpf-next 0/3] Introduce pinnable bpf_link kernel abstraction Toke Høiland-Jørgensen
  3 siblings, 0 replies; 50+ messages in thread
From: Andrii Nakryiko @ 2020-02-28 22:39 UTC (permalink / raw)
  To: bpf, netdev, ast, daniel; +Cc: andrii.nakryiko, kernel-team, Andrii Nakryiko

Add selftests validating link pinning/unpinning and associated BPF link
(attachment) lifetime.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
---
 .../selftests/bpf/prog_tests/link_pinning.c   | 105 ++++++++++++++++++
 .../selftests/bpf/progs/test_link_pinning.c   |  25 +++++
 2 files changed, 130 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/link_pinning.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_link_pinning.c

diff --git a/tools/testing/selftests/bpf/prog_tests/link_pinning.c b/tools/testing/selftests/bpf/prog_tests/link_pinning.c
new file mode 100644
index 000000000000..a743288cf384
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/link_pinning.c
@@ -0,0 +1,105 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2020 Facebook */
+
+#include <test_progs.h>
+#include <sys/stat.h>
+
+#include "test_link_pinning.skel.h"
+
+static int duration = 0;
+
+void test_link_pinning_subtest(struct bpf_program *prog,
+			       struct test_link_pinning__bss *bss)
+{
+	const char *link_pin_path = "/sys/fs/bpf/pinned_link_test";
+	struct stat statbuf = {};
+	struct bpf_link *link;
+	int err, i;
+
+	link = bpf_program__attach(prog);
+	if (CHECK(IS_ERR(link), "link_attach", "err: %ld\n", PTR_ERR(link)))
+		goto cleanup;
+
+	bss->in = 1;
+	usleep(1);
+	CHECK(bss->out != 1, "res_check1", "exp %d, got %d\n", 1, bss->out);
+
+	/* pin link */
+	err = bpf_link__pin(link, link_pin_path);
+	if (CHECK(err, "link_pin", "err: %d\n", err))
+		goto cleanup;
+
+	CHECK(strcmp(link_pin_path, bpf_link__pin_path(link)), "pin_path1",
+	      "exp %s, got %s\n", link_pin_path, bpf_link__pin_path(link));
+
+	/* check that link was pinned */
+	err = stat(link_pin_path, &statbuf);
+	if (CHECK(err, "stat_link", "err %d errno %d\n", err, errno))
+		goto cleanup;
+
+	bss->in = 2;
+	usleep(1);
+	CHECK(bss->out != 2, "res_check2", "exp %d, got %d\n", 2, bss->out);
+
+	/* destroy link, pinned link should keep program attached */
+	bpf_link__destroy(link);
+	link = NULL;
+
+	bss->in = 3;
+	usleep(1);
+	CHECK(bss->out != 3, "res_check3", "exp %d, got %d\n", 3, bss->out);
+
+	/* re-open link from BPFFS */
+	link = bpf_link__open(link_pin_path);
+	if (CHECK(IS_ERR(link), "link_open", "err: %ld\n", PTR_ERR(link)))
+		goto cleanup;
+
+	CHECK(strcmp(link_pin_path, bpf_link__pin_path(link)), "pin_path2",
+	      "exp %s, got %s\n", link_pin_path, bpf_link__pin_path(link));
+
+	/* unpin link from BPFFS, program still attached */
+	err = bpf_link__unpin(link);
+	if (CHECK(err, "link_unpin", "err: %d\n", err))
+		goto cleanup;
+
+	/* still active, as we have FD open now */
+	bss->in = 4;
+	usleep(1);
+	CHECK(bss->out != 4, "res_check4", "exp %d, got %d\n", 4, bss->out);
+
+	bpf_link__destroy(link);
+	link = NULL;
+
+	/* Validate it's finally detached.
+	 * Actual detachment might get delayed a bit, so there is no reliable
+	 * way to validate it immediately here, let's count up for long enough
+	 * and see if eventually output stops being updated
+	 */
+	for (i = 5; i < 10000; i++) {
+		bss->in = i;
+		usleep(1);
+		if (bss->out == i - 1)
+			break;
+	}
+	CHECK(i == 10000, "link_attached", "got to iteration #%d\n", i);
+
+cleanup:
+	if (!IS_ERR(link))
+		bpf_link__destroy(link);
+}
+
+void test_link_pinning(void)
+{
+	struct test_link_pinning* skel;
+
+	skel = test_link_pinning__open_and_load();
+	if (CHECK(!skel, "skel_open", "failed to open skeleton\n"))
+		return;
+
+	if (test__start_subtest("pin_raw_tp"))
+		test_link_pinning_subtest(skel->progs.raw_tp_prog, skel->bss);
+	if (test__start_subtest("pin_tp_btf"))
+		test_link_pinning_subtest(skel->progs.tp_btf_prog, skel->bss);
+
+	test_link_pinning__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_link_pinning.c b/tools/testing/selftests/bpf/progs/test_link_pinning.c
new file mode 100644
index 000000000000..bbf2a5264dc0
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_link_pinning.c
@@ -0,0 +1,25 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2020 Facebook */
+
+#include <stdbool.h>
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+
+int in = 0;
+int out = 0;
+
+SEC("raw_tp/sys_enter")
+int raw_tp_prog(const void *ctx)
+{
+	out = in;
+	return 0;
+}
+
+SEC("tp_btf/sys_enter")
+int tp_btf_prog(const void *ctx)
+{
+	out = in;
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 0/3] Introduce pinnable bpf_link kernel abstraction
  2020-02-28 22:39 [PATCH bpf-next 0/3] Introduce pinnable bpf_link kernel abstraction Andrii Nakryiko
                   ` (2 preceding siblings ...)
  2020-02-28 22:39 ` [PATCH bpf-next 3/3] selftests/bpf: add link pinning selftests Andrii Nakryiko
@ 2020-03-02 10:11 ` Toke Høiland-Jørgensen
  2020-03-02 18:05   ` Andrii Nakryiko
  3 siblings, 1 reply; 50+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-03-02 10:11 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, netdev, ast, daniel
  Cc: andrii.nakryiko, kernel-team, Andrii Nakryiko

Andrii Nakryiko <andriin@fb.com> writes:

> This patch series adds bpf_link abstraction, analogous to libbpf's already
> existing bpf_link abstraction. This formalizes and makes more uniform existing
> bpf_link-like BPF program link (attachment) types (raw tracepoint and tracing
> links), which are FD-based objects that are automatically detached when last
> file reference is closed. These types of BPF program links are switched to
> using bpf_link framework.
>
> FD-based bpf_link approach provides great safety guarantees, by ensuring there
> is not going to be an abandoned BPF program attached, if user process suddenly
> exits or forgets to clean up after itself. This is especially important in
> production environment and is what all the recent new BPF link types followed.
>
> One of the previously existing  inconveniences of FD-based approach, though,
> was the scenario in which user process wants to install BPF link and exit, but
> let attached BPF program run. Now, with bpf_link abstraction in place, it's
> easy to support pinning links in BPF FS, which is done as part of the same
> patch #1. This allows FD-based BPF program links to survive exit of a user
> process and original file descriptor being closed, by creating an file entry
> in BPF FS. This provides great safety by default, with simple way to opt out
> for cases where it's needed.

While being able to pin the fds returned by bpf_raw_tracepoint_open()
certainly helps, I still feel like this is the wrong abstraction for
freplace(): When I'm building a program using freplace to put in new
functions (say, an XDP multi-prog dispatcher :)), I really want the
'new' functions (i.e., the freplace'd bpf_progs) to share their lifetime
with the calling BPF program. I.e., I want to be able to do something
like:

prog_fd = sys_bpf(BPF_PROG_LOAD, ...); // dispatcher
func_fd = sys_bpf(BPF_PROG_LOAD, ...); // replacement func
err = sys_bpf(BPF_PROG_REPLACE_FUNC, prog_fd, btf_id, func_fd); // does *not* return an fd

That last call should make the ref-counting be in the prog_fd -> func_fd
direction, so that when prog_fd is released, it will do
bpf_prog_put(func_fd). There could be an additional call like
sys_bpf(BPF_PROG_REPLACE_FUNC_DETACH, prog_fd, btf_id) for explicit
detach as well, of course.

With such an API, lifecycle management for an XDP program keeps being
obvious: There's an fd for the root program attached to the interface,
and that's it. When that is released the whole thing disappears. Whereas
with the bpf_raw_tracepoint_open() API, the userspace program suddenly
has to make sure all the component function FDs are pinned, which seems
cumbersome and error-prone...

I'll try to propose patches for what this could look like; I think it
could co-exist with this bpf_link abstraction, though, so no need to
hold up this series...

-Toke


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 1/3] bpf: introduce pinnable bpf_link abstraction
  2020-02-28 22:39 ` [PATCH bpf-next 1/3] bpf: introduce pinnable bpf_link abstraction Andrii Nakryiko
@ 2020-03-02 10:13   ` Toke Høiland-Jørgensen
  2020-03-02 18:06     ` Andrii Nakryiko
  2020-03-03  2:50   ` Alexei Starovoitov
  1 sibling, 1 reply; 50+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-03-02 10:13 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, netdev, ast, daniel
  Cc: andrii.nakryiko, kernel-team, Andrii Nakryiko

Andrii Nakryiko <andriin@fb.com> writes:

> Introduce bpf_link abstraction, representing an attachment of BPF program to
> a BPF hook point (e.g., tracepoint, perf event, etc). bpf_link encapsulates
> ownership of attached BPF program, reference counting of a link itself, when
> reference from multiple anonymous inodes, as well as ensures that release
> callback will be called from a process context, so that users can safely take
> mutex locks and sleep.
>
> Additionally, with a new abstraction it's now possible to generalize pinning
> of a link object in BPF FS, allowing to explicitly prevent BPF program
> detachment on process exit by pinning it in a BPF FS and let it open from
> independent other process to keep working with it.
>
> Convert two existing bpf_link-like objects (raw tracepoint and tracing BPF
> program attachments) into utilizing bpf_link framework, making them pinnable
> in BPF FS. More FD-based bpf_links will be added in follow up patches.
>
> Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> ---
>  include/linux/bpf.h  |  13 +++
>  kernel/bpf/inode.c   |  42 ++++++++-
>  kernel/bpf/syscall.c | 209 ++++++++++++++++++++++++++++++++++++-------
>  3 files changed, 226 insertions(+), 38 deletions(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 6015a4daf118..f13c78c6f29d 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1056,6 +1056,19 @@ extern int sysctl_unprivileged_bpf_disabled;
>  int bpf_map_new_fd(struct bpf_map *map, int flags);
>  int bpf_prog_new_fd(struct bpf_prog *prog);
>  
> +struct bpf_link;
> +
> +struct bpf_link_ops {
> +	void (*release)(struct bpf_link *link);
> +};
> +
> +void bpf_link_init(struct bpf_link *link, const struct bpf_link_ops *ops,
> +		   struct bpf_prog *prog);
> +void bpf_link_inc(struct bpf_link *link);
> +void bpf_link_put(struct bpf_link *link);
> +int bpf_link_new_fd(struct bpf_link *link);
> +struct bpf_link *bpf_link_get_from_fd(u32 ufd);
> +
>  int bpf_obj_pin_user(u32 ufd, const char __user *pathname);
>  int bpf_obj_get_user(const char __user *pathname, int flags);
>  
> diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
> index 5e40e7fccc21..95087d9f4ed3 100644
> --- a/kernel/bpf/inode.c
> +++ b/kernel/bpf/inode.c
> @@ -25,6 +25,7 @@ enum bpf_type {
>  	BPF_TYPE_UNSPEC	= 0,
>  	BPF_TYPE_PROG,
>  	BPF_TYPE_MAP,
> +	BPF_TYPE_LINK,
>  };
>  
>  static void *bpf_any_get(void *raw, enum bpf_type type)
> @@ -36,6 +37,9 @@ static void *bpf_any_get(void *raw, enum bpf_type type)
>  	case BPF_TYPE_MAP:
>  		bpf_map_inc_with_uref(raw);
>  		break;
> +	case BPF_TYPE_LINK:
> +		bpf_link_inc(raw);
> +		break;
>  	default:
>  		WARN_ON_ONCE(1);
>  		break;
> @@ -53,6 +57,9 @@ static void bpf_any_put(void *raw, enum bpf_type type)
>  	case BPF_TYPE_MAP:
>  		bpf_map_put_with_uref(raw);
>  		break;
> +	case BPF_TYPE_LINK:
> +		bpf_link_put(raw);
> +		break;
>  	default:
>  		WARN_ON_ONCE(1);
>  		break;
> @@ -63,20 +70,32 @@ static void *bpf_fd_probe_obj(u32 ufd, enum bpf_type *type)
>  {
>  	void *raw;
>  
> -	*type = BPF_TYPE_MAP;
>  	raw = bpf_map_get_with_uref(ufd);
> -	if (IS_ERR(raw)) {
> +	if (!IS_ERR(raw)) {
> +		*type = BPF_TYPE_MAP;
> +		return raw;
> +	}
> +
> +	raw = bpf_prog_get(ufd);
> +	if (!IS_ERR(raw)) {
>  		*type = BPF_TYPE_PROG;
> -		raw = bpf_prog_get(ufd);
> +		return raw;
>  	}
>  
> -	return raw;
> +	raw = bpf_link_get_from_fd(ufd);
> +	if (!IS_ERR(raw)) {
> +		*type = BPF_TYPE_LINK;
> +		return raw;
> +	}
> +
> +	return ERR_PTR(-EINVAL);
>  }
>  
>  static const struct inode_operations bpf_dir_iops;
>  
>  static const struct inode_operations bpf_prog_iops = { };
>  static const struct inode_operations bpf_map_iops  = { };
> +static const struct inode_operations bpf_link_iops  = { };
>  
>  static struct inode *bpf_get_inode(struct super_block *sb,
>  				   const struct inode *dir,
> @@ -114,6 +133,8 @@ static int bpf_inode_type(const struct inode *inode, enum bpf_type *type)
>  		*type = BPF_TYPE_PROG;
>  	else if (inode->i_op == &bpf_map_iops)
>  		*type = BPF_TYPE_MAP;
> +	else if (inode->i_op == &bpf_link_iops)
> +		*type = BPF_TYPE_LINK;
>  	else
>  		return -EACCES;
>  
> @@ -335,6 +356,12 @@ static int bpf_mkmap(struct dentry *dentry, umode_t mode, void *arg)
>  			     &bpffs_map_fops : &bpffs_obj_fops);
>  }
>  
> +static int bpf_mklink(struct dentry *dentry, umode_t mode, void *arg)
> +{
> +	return bpf_mkobj_ops(dentry, mode, arg, &bpf_link_iops,
> +			     &bpffs_obj_fops);
> +}
> +
>  static struct dentry *
>  bpf_lookup(struct inode *dir, struct dentry *dentry, unsigned flags)
>  {
> @@ -411,6 +438,9 @@ static int bpf_obj_do_pin(const char __user *pathname, void *raw,
>  	case BPF_TYPE_MAP:
>  		ret = vfs_mkobj(dentry, mode, bpf_mkmap, raw);
>  		break;
> +	case BPF_TYPE_LINK:
> +		ret = vfs_mkobj(dentry, mode, bpf_mklink, raw);
> +		break;
>  	default:
>  		ret = -EPERM;
>  	}
> @@ -487,6 +517,8 @@ int bpf_obj_get_user(const char __user *pathname, int flags)
>  		ret = bpf_prog_new_fd(raw);
>  	else if (type == BPF_TYPE_MAP)
>  		ret = bpf_map_new_fd(raw, f_flags);
> +	else if (type == BPF_TYPE_LINK)
> +		ret = bpf_link_new_fd(raw);
>  	else
>  		return -ENOENT;
>  
> @@ -504,6 +536,8 @@ static struct bpf_prog *__get_prog_inode(struct inode *inode, enum bpf_prog_type
>  
>  	if (inode->i_op == &bpf_map_iops)
>  		return ERR_PTR(-EINVAL);
> +	if (inode->i_op == &bpf_link_iops)
> +		return ERR_PTR(-EINVAL);
>  	if (inode->i_op != &bpf_prog_iops)
>  		return ERR_PTR(-EACCES);
>  
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index c536c65256ad..fca8de7e7872 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -2173,23 +2173,153 @@ static int bpf_obj_get(const union bpf_attr *attr)
>  				attr->file_flags);
>  }
>  
> -static int bpf_tracing_prog_release(struct inode *inode, struct file *filp)
> +struct bpf_link {
> +	atomic64_t refcnt;

refcount_t ?

-Toke


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 2/3] libbpf: add bpf_link pinning/unpinning
  2020-02-28 22:39 ` [PATCH bpf-next 2/3] libbpf: add bpf_link pinning/unpinning Andrii Nakryiko
@ 2020-03-02 10:16   ` Toke Høiland-Jørgensen
  2020-03-02 18:09     ` Andrii Nakryiko
  0 siblings, 1 reply; 50+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-03-02 10:16 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, netdev, ast, daniel
  Cc: andrii.nakryiko, kernel-team, Andrii Nakryiko

Andrii Nakryiko <andriin@fb.com> writes:

> With bpf_link abstraction supported by kernel explicitly, add
> pinning/unpinning API for links. Also allow to create (open) bpf_link from BPF
> FS file.
>
> This API allows to have an "ephemeral" FD-based BPF links (like raw tracepoint
> or fexit/freplace attachments) surviving user process exit, by pinning them in
> a BPF FS, which is an important use case for long-running BPF programs.
>
> As part of this, expose underlying FD for bpf_link. While legacy bpf_link's
> might not have a FD associated with them (which will be expressed as
> a bpf_link with fd=-1), kernel's abstraction is based around FD-based usage,
> so match it closely. This, subsequently, allows to have a generic
> pinning/unpinning API for generalized bpf_link. For some types of bpf_links
> kernel might not support pinning, in which case bpf_link__pin() will return
> error.
>
> With FD being part of generic bpf_link, also get rid of bpf_link_fd in favor
> of using vanialla bpf_link.
>
> Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> ---
>  tools/lib/bpf/libbpf.c   | 131 +++++++++++++++++++++++++++++++--------
>  tools/lib/bpf/libbpf.h   |   5 ++
>  tools/lib/bpf/libbpf.map |   5 ++
>  3 files changed, 114 insertions(+), 27 deletions(-)
>
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 996162801f7a..f8c4042e5855 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -6931,6 +6931,8 @@ int bpf_prog_load_xattr(const struct bpf_prog_load_attr *attr,
>  struct bpf_link {
>  	int (*detach)(struct bpf_link *link);
>  	int (*destroy)(struct bpf_link *link);
> +	char *pin_path;		/* NULL, if not pinned */
> +	int fd;			/* hook FD, -1 if not applicable */
>  	bool disconnected;
>  };
>  
> @@ -6960,26 +6962,109 @@ int bpf_link__destroy(struct bpf_link *link)
>  		err = link->detach(link);
>  	if (link->destroy)
>  		link->destroy(link);
> +	if (link->pin_path)
> +		free(link->pin_path);

This will still detach the link even if it's pinned, won't it? What's
the expectation, that the calling application just won't call
bpf_link__destroy() if it pins the link? But then it will leak memory?
Or is it just that __destroy() will close the fd, but if it's pinned the
kernel won't actually detach anything? In that case, it seems like the
function name becomes somewhat misleading?

-Toke


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 0/3] Introduce pinnable bpf_link kernel abstraction
  2020-03-02 10:11 ` [PATCH bpf-next 0/3] Introduce pinnable bpf_link kernel abstraction Toke Høiland-Jørgensen
@ 2020-03-02 18:05   ` Andrii Nakryiko
  2020-03-02 22:24     ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 50+ messages in thread
From: Andrii Nakryiko @ 2020-03-02 18:05 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team

On Mon, Mar 2, 2020 at 2:12 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Andrii Nakryiko <andriin@fb.com> writes:
>
> > This patch series adds bpf_link abstraction, analogous to libbpf's already
> > existing bpf_link abstraction. This formalizes and makes more uniform existing
> > bpf_link-like BPF program link (attachment) types (raw tracepoint and tracing
> > links), which are FD-based objects that are automatically detached when last
> > file reference is closed. These types of BPF program links are switched to
> > using bpf_link framework.
> >
> > FD-based bpf_link approach provides great safety guarantees, by ensuring there
> > is not going to be an abandoned BPF program attached, if user process suddenly
> > exits or forgets to clean up after itself. This is especially important in
> > production environment and is what all the recent new BPF link types followed.
> >
> > One of the previously existing  inconveniences of FD-based approach, though,
> > was the scenario in which user process wants to install BPF link and exit, but
> > let attached BPF program run. Now, with bpf_link abstraction in place, it's
> > easy to support pinning links in BPF FS, which is done as part of the same
> > patch #1. This allows FD-based BPF program links to survive exit of a user
> > process and original file descriptor being closed, by creating an file entry
> > in BPF FS. This provides great safety by default, with simple way to opt out
> > for cases where it's needed.
>
> While being able to pin the fds returned by bpf_raw_tracepoint_open()
> certainly helps, I still feel like this is the wrong abstraction for
> freplace(): When I'm building a program using freplace to put in new
> functions (say, an XDP multi-prog dispatcher :)), I really want the
> 'new' functions (i.e., the freplace'd bpf_progs) to share their lifetime
> with the calling BPF program. I.e., I want to be able to do something
> like:

freplace programs will take refcount on a BPF program they are
replacing, so in that sense they do share lifetime, except dependency
is opposite to what you describe: rootlet/dispatcher program can't go
away as long it has at least one freplace program attached. It
(dispatcher) might get detached, though, but freplace, technically,
will still be attached to now-detached dispatcher (so won't be
invoked, yet still attached). I hope that makes sense :)

>
> prog_fd = sys_bpf(BPF_PROG_LOAD, ...); // dispatcher
> func_fd = sys_bpf(BPF_PROG_LOAD, ...); // replacement func
> err = sys_bpf(BPF_PROG_REPLACE_FUNC, prog_fd, btf_id, func_fd); // does *not* return an fd
>
> That last call should make the ref-counting be in the prog_fd -> func_fd
> direction, so that when prog_fd is released, it will do
> bpf_prog_put(func_fd). There could be an additional call like
> sys_bpf(BPF_PROG_REPLACE_FUNC_DETACH, prog_fd, btf_id) for explicit
> detach as well, of course.

Taking this additional refcount will create a dependency loop (see
above), so that's why it wasn't done, I think.

With FD-based bpf_link, though, you'll be able to "transfer ownership"
from application that installed freplace program in the first place,
to the program that eventually will unload/replace dispatcher BPF
program. You do that by pinning freplace program in BPFFS location,
that's known to this libxdp library, and when you need to detach and
unload XDP dispatcher and overriden XDP programs, the "admin process"
which manages XDP dispatcher, will be able to just go and unpin and
detach everything, if necessary.

>
> With such an API, lifecycle management for an XDP program keeps being
> obvious: There's an fd for the root program attached to the interface,
> and that's it. When that is released the whole thing disappears. Whereas
> with the bpf_raw_tracepoint_open() API, the userspace program suddenly
> has to make sure all the component function FDs are pinned, which seems
> cumbersome and error-prone...

I thought that's what libxdp is supposed to do (among other things).
So for user applications it will be all hidden inside the library API,
no?

>
> I'll try to propose patches for what this could look like; I think it
> could co-exist with this bpf_link abstraction, though, so no need to
> hold up this series...

Yeah, either way, this is important and is desired behavior not just
for freplace cases.

>
> -Toke
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 1/3] bpf: introduce pinnable bpf_link abstraction
  2020-03-02 10:13   ` Toke Høiland-Jørgensen
@ 2020-03-02 18:06     ` Andrii Nakryiko
  2020-03-02 21:40       ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 50+ messages in thread
From: Andrii Nakryiko @ 2020-03-02 18:06 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team

On Mon, Mar 2, 2020 at 2:13 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Andrii Nakryiko <andriin@fb.com> writes:
>
> > Introduce bpf_link abstraction, representing an attachment of BPF program to
> > a BPF hook point (e.g., tracepoint, perf event, etc). bpf_link encapsulates
> > ownership of attached BPF program, reference counting of a link itself, when
> > reference from multiple anonymous inodes, as well as ensures that release
> > callback will be called from a process context, so that users can safely take
> > mutex locks and sleep.
> >
> > Additionally, with a new abstraction it's now possible to generalize pinning
> > of a link object in BPF FS, allowing to explicitly prevent BPF program
> > detachment on process exit by pinning it in a BPF FS and let it open from
> > independent other process to keep working with it.
> >
> > Convert two existing bpf_link-like objects (raw tracepoint and tracing BPF
> > program attachments) into utilizing bpf_link framework, making them pinnable
> > in BPF FS. More FD-based bpf_links will be added in follow up patches.
> >
> > Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> > ---
> >  include/linux/bpf.h  |  13 +++
> >  kernel/bpf/inode.c   |  42 ++++++++-
> >  kernel/bpf/syscall.c | 209 ++++++++++++++++++++++++++++++++++++-------
> >  3 files changed, 226 insertions(+), 38 deletions(-)
> >
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index 6015a4daf118..f13c78c6f29d 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -1056,6 +1056,19 @@ extern int sysctl_unprivileged_bpf_disabled;
> >  int bpf_map_new_fd(struct bpf_map *map, int flags);
> >  int bpf_prog_new_fd(struct bpf_prog *prog);
> >
> > +struct bpf_link;
> > +
> > +struct bpf_link_ops {
> > +     void (*release)(struct bpf_link *link);
> > +};
> > +
> > +void bpf_link_init(struct bpf_link *link, const struct bpf_link_ops *ops,
> > +                struct bpf_prog *prog);
> > +void bpf_link_inc(struct bpf_link *link);
> > +void bpf_link_put(struct bpf_link *link);
> > +int bpf_link_new_fd(struct bpf_link *link);
> > +struct bpf_link *bpf_link_get_from_fd(u32 ufd);
> > +
> >  int bpf_obj_pin_user(u32 ufd, const char __user *pathname);
> >  int bpf_obj_get_user(const char __user *pathname, int flags);
> >
> > diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
> > index 5e40e7fccc21..95087d9f4ed3 100644
> > --- a/kernel/bpf/inode.c
> > +++ b/kernel/bpf/inode.c
> > @@ -25,6 +25,7 @@ enum bpf_type {
> >       BPF_TYPE_UNSPEC = 0,
> >       BPF_TYPE_PROG,
> >       BPF_TYPE_MAP,
> > +     BPF_TYPE_LINK,
> >  };
> >
> >  static void *bpf_any_get(void *raw, enum bpf_type type)
> > @@ -36,6 +37,9 @@ static void *bpf_any_get(void *raw, enum bpf_type type)
> >       case BPF_TYPE_MAP:
> >               bpf_map_inc_with_uref(raw);
> >               break;
> > +     case BPF_TYPE_LINK:
> > +             bpf_link_inc(raw);
> > +             break;
> >       default:
> >               WARN_ON_ONCE(1);
> >               break;
> > @@ -53,6 +57,9 @@ static void bpf_any_put(void *raw, enum bpf_type type)
> >       case BPF_TYPE_MAP:
> >               bpf_map_put_with_uref(raw);
> >               break;
> > +     case BPF_TYPE_LINK:
> > +             bpf_link_put(raw);
> > +             break;
> >       default:
> >               WARN_ON_ONCE(1);
> >               break;
> > @@ -63,20 +70,32 @@ static void *bpf_fd_probe_obj(u32 ufd, enum bpf_type *type)
> >  {
> >       void *raw;
> >
> > -     *type = BPF_TYPE_MAP;
> >       raw = bpf_map_get_with_uref(ufd);
> > -     if (IS_ERR(raw)) {
> > +     if (!IS_ERR(raw)) {
> > +             *type = BPF_TYPE_MAP;
> > +             return raw;
> > +     }
> > +
> > +     raw = bpf_prog_get(ufd);
> > +     if (!IS_ERR(raw)) {
> >               *type = BPF_TYPE_PROG;
> > -             raw = bpf_prog_get(ufd);
> > +             return raw;
> >       }
> >
> > -     return raw;
> > +     raw = bpf_link_get_from_fd(ufd);
> > +     if (!IS_ERR(raw)) {
> > +             *type = BPF_TYPE_LINK;
> > +             return raw;
> > +     }
> > +
> > +     return ERR_PTR(-EINVAL);
> >  }
> >
> >  static const struct inode_operations bpf_dir_iops;
> >
> >  static const struct inode_operations bpf_prog_iops = { };
> >  static const struct inode_operations bpf_map_iops  = { };
> > +static const struct inode_operations bpf_link_iops  = { };
> >
> >  static struct inode *bpf_get_inode(struct super_block *sb,
> >                                  const struct inode *dir,
> > @@ -114,6 +133,8 @@ static int bpf_inode_type(const struct inode *inode, enum bpf_type *type)
> >               *type = BPF_TYPE_PROG;
> >       else if (inode->i_op == &bpf_map_iops)
> >               *type = BPF_TYPE_MAP;
> > +     else if (inode->i_op == &bpf_link_iops)
> > +             *type = BPF_TYPE_LINK;
> >       else
> >               return -EACCES;
> >
> > @@ -335,6 +356,12 @@ static int bpf_mkmap(struct dentry *dentry, umode_t mode, void *arg)
> >                            &bpffs_map_fops : &bpffs_obj_fops);
> >  }
> >
> > +static int bpf_mklink(struct dentry *dentry, umode_t mode, void *arg)
> > +{
> > +     return bpf_mkobj_ops(dentry, mode, arg, &bpf_link_iops,
> > +                          &bpffs_obj_fops);
> > +}
> > +
> >  static struct dentry *
> >  bpf_lookup(struct inode *dir, struct dentry *dentry, unsigned flags)
> >  {
> > @@ -411,6 +438,9 @@ static int bpf_obj_do_pin(const char __user *pathname, void *raw,
> >       case BPF_TYPE_MAP:
> >               ret = vfs_mkobj(dentry, mode, bpf_mkmap, raw);
> >               break;
> > +     case BPF_TYPE_LINK:
> > +             ret = vfs_mkobj(dentry, mode, bpf_mklink, raw);
> > +             break;
> >       default:
> >               ret = -EPERM;
> >       }
> > @@ -487,6 +517,8 @@ int bpf_obj_get_user(const char __user *pathname, int flags)
> >               ret = bpf_prog_new_fd(raw);
> >       else if (type == BPF_TYPE_MAP)
> >               ret = bpf_map_new_fd(raw, f_flags);
> > +     else if (type == BPF_TYPE_LINK)
> > +             ret = bpf_link_new_fd(raw);
> >       else
> >               return -ENOENT;
> >
> > @@ -504,6 +536,8 @@ static struct bpf_prog *__get_prog_inode(struct inode *inode, enum bpf_prog_type
> >
> >       if (inode->i_op == &bpf_map_iops)
> >               return ERR_PTR(-EINVAL);
> > +     if (inode->i_op == &bpf_link_iops)
> > +             return ERR_PTR(-EINVAL);
> >       if (inode->i_op != &bpf_prog_iops)
> >               return ERR_PTR(-EACCES);
> >
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index c536c65256ad..fca8de7e7872 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> > @@ -2173,23 +2173,153 @@ static int bpf_obj_get(const union bpf_attr *attr)
> >                               attr->file_flags);
> >  }
> >
> > -static int bpf_tracing_prog_release(struct inode *inode, struct file *filp)
> > +struct bpf_link {
> > +     atomic64_t refcnt;
>
> refcount_t ?

Both bpf_map and bpf_prog stick to atomic64 for their refcounting, so
I'd like to stay consistent and use refcount that can't possible leak
resources (which refcount_t can, if it's overflown).

>
> -Toke
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 2/3] libbpf: add bpf_link pinning/unpinning
  2020-03-02 10:16   ` Toke Høiland-Jørgensen
@ 2020-03-02 18:09     ` Andrii Nakryiko
  2020-03-02 21:45       ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 50+ messages in thread
From: Andrii Nakryiko @ 2020-03-02 18:09 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team

On Mon, Mar 2, 2020 at 2:17 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Andrii Nakryiko <andriin@fb.com> writes:
>
> > With bpf_link abstraction supported by kernel explicitly, add
> > pinning/unpinning API for links. Also allow to create (open) bpf_link from BPF
> > FS file.
> >
> > This API allows to have an "ephemeral" FD-based BPF links (like raw tracepoint
> > or fexit/freplace attachments) surviving user process exit, by pinning them in
> > a BPF FS, which is an important use case for long-running BPF programs.
> >
> > As part of this, expose underlying FD for bpf_link. While legacy bpf_link's
> > might not have a FD associated with them (which will be expressed as
> > a bpf_link with fd=-1), kernel's abstraction is based around FD-based usage,
> > so match it closely. This, subsequently, allows to have a generic
> > pinning/unpinning API for generalized bpf_link. For some types of bpf_links
> > kernel might not support pinning, in which case bpf_link__pin() will return
> > error.
> >
> > With FD being part of generic bpf_link, also get rid of bpf_link_fd in favor
> > of using vanialla bpf_link.
> >
> > Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> > ---
> >  tools/lib/bpf/libbpf.c   | 131 +++++++++++++++++++++++++++++++--------
> >  tools/lib/bpf/libbpf.h   |   5 ++
> >  tools/lib/bpf/libbpf.map |   5 ++
> >  3 files changed, 114 insertions(+), 27 deletions(-)
> >
> > diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> > index 996162801f7a..f8c4042e5855 100644
> > --- a/tools/lib/bpf/libbpf.c
> > +++ b/tools/lib/bpf/libbpf.c
> > @@ -6931,6 +6931,8 @@ int bpf_prog_load_xattr(const struct bpf_prog_load_attr *attr,
> >  struct bpf_link {
> >       int (*detach)(struct bpf_link *link);
> >       int (*destroy)(struct bpf_link *link);
> > +     char *pin_path;         /* NULL, if not pinned */
> > +     int fd;                 /* hook FD, -1 if not applicable */
> >       bool disconnected;
> >  };
> >
> > @@ -6960,26 +6962,109 @@ int bpf_link__destroy(struct bpf_link *link)
> >               err = link->detach(link);
> >       if (link->destroy)
> >               link->destroy(link);
> > +     if (link->pin_path)
> > +             free(link->pin_path);
>
> This will still detach the link even if it's pinned, won't it? What's

No, this will just free pin_path string memory.

> the expectation, that the calling application just won't call
> bpf_link__destroy() if it pins the link? But then it will leak memory?
> Or is it just that __destroy() will close the fd, but if it's pinned the
> kernel won't actually detach anything? In that case, it seems like the
> function name becomes somewhat misleading?

Yes, the latter, it will close its own FD, but if someone else has
open other FD against the same bpf_link (due to pinning or if you
shared FD with child process, etc), then kernel will keep it.
bpf_link__destroy() is more of a "sever the link my process has" or
"destroy my local link". Maybe not ideal name, but should be close
enough, I think.

>
> -Toke
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 1/3] bpf: introduce pinnable bpf_link abstraction
  2020-03-02 18:06     ` Andrii Nakryiko
@ 2020-03-02 21:40       ` Toke Høiland-Jørgensen
  2020-03-02 23:37         ` Andrii Nakryiko
  0 siblings, 1 reply; 50+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-03-02 21:40 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team

Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:

> On Mon, Mar 2, 2020 at 2:13 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Andrii Nakryiko <andriin@fb.com> writes:
>>
>> > Introduce bpf_link abstraction, representing an attachment of BPF program to
>> > a BPF hook point (e.g., tracepoint, perf event, etc). bpf_link encapsulates
>> > ownership of attached BPF program, reference counting of a link itself, when
>> > reference from multiple anonymous inodes, as well as ensures that release
>> > callback will be called from a process context, so that users can safely take
>> > mutex locks and sleep.
>> >
>> > Additionally, with a new abstraction it's now possible to generalize pinning
>> > of a link object in BPF FS, allowing to explicitly prevent BPF program
>> > detachment on process exit by pinning it in a BPF FS and let it open from
>> > independent other process to keep working with it.
>> >
>> > Convert two existing bpf_link-like objects (raw tracepoint and tracing BPF
>> > program attachments) into utilizing bpf_link framework, making them pinnable
>> > in BPF FS. More FD-based bpf_links will be added in follow up patches.
>> >
>> > Signed-off-by: Andrii Nakryiko <andriin@fb.com>
>> > ---
>> >  include/linux/bpf.h  |  13 +++
>> >  kernel/bpf/inode.c   |  42 ++++++++-
>> >  kernel/bpf/syscall.c | 209 ++++++++++++++++++++++++++++++++++++-------
>> >  3 files changed, 226 insertions(+), 38 deletions(-)
>> >
>> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> > index 6015a4daf118..f13c78c6f29d 100644
>> > --- a/include/linux/bpf.h
>> > +++ b/include/linux/bpf.h
>> > @@ -1056,6 +1056,19 @@ extern int sysctl_unprivileged_bpf_disabled;
>> >  int bpf_map_new_fd(struct bpf_map *map, int flags);
>> >  int bpf_prog_new_fd(struct bpf_prog *prog);
>> >
>> > +struct bpf_link;
>> > +
>> > +struct bpf_link_ops {
>> > +     void (*release)(struct bpf_link *link);
>> > +};
>> > +
>> > +void bpf_link_init(struct bpf_link *link, const struct bpf_link_ops *ops,
>> > +                struct bpf_prog *prog);
>> > +void bpf_link_inc(struct bpf_link *link);
>> > +void bpf_link_put(struct bpf_link *link);
>> > +int bpf_link_new_fd(struct bpf_link *link);
>> > +struct bpf_link *bpf_link_get_from_fd(u32 ufd);
>> > +
>> >  int bpf_obj_pin_user(u32 ufd, const char __user *pathname);
>> >  int bpf_obj_get_user(const char __user *pathname, int flags);
>> >
>> > diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
>> > index 5e40e7fccc21..95087d9f4ed3 100644
>> > --- a/kernel/bpf/inode.c
>> > +++ b/kernel/bpf/inode.c
>> > @@ -25,6 +25,7 @@ enum bpf_type {
>> >       BPF_TYPE_UNSPEC = 0,
>> >       BPF_TYPE_PROG,
>> >       BPF_TYPE_MAP,
>> > +     BPF_TYPE_LINK,
>> >  };
>> >
>> >  static void *bpf_any_get(void *raw, enum bpf_type type)
>> > @@ -36,6 +37,9 @@ static void *bpf_any_get(void *raw, enum bpf_type type)
>> >       case BPF_TYPE_MAP:
>> >               bpf_map_inc_with_uref(raw);
>> >               break;
>> > +     case BPF_TYPE_LINK:
>> > +             bpf_link_inc(raw);
>> > +             break;
>> >       default:
>> >               WARN_ON_ONCE(1);
>> >               break;
>> > @@ -53,6 +57,9 @@ static void bpf_any_put(void *raw, enum bpf_type type)
>> >       case BPF_TYPE_MAP:
>> >               bpf_map_put_with_uref(raw);
>> >               break;
>> > +     case BPF_TYPE_LINK:
>> > +             bpf_link_put(raw);
>> > +             break;
>> >       default:
>> >               WARN_ON_ONCE(1);
>> >               break;
>> > @@ -63,20 +70,32 @@ static void *bpf_fd_probe_obj(u32 ufd, enum bpf_type *type)
>> >  {
>> >       void *raw;
>> >
>> > -     *type = BPF_TYPE_MAP;
>> >       raw = bpf_map_get_with_uref(ufd);
>> > -     if (IS_ERR(raw)) {
>> > +     if (!IS_ERR(raw)) {
>> > +             *type = BPF_TYPE_MAP;
>> > +             return raw;
>> > +     }
>> > +
>> > +     raw = bpf_prog_get(ufd);
>> > +     if (!IS_ERR(raw)) {
>> >               *type = BPF_TYPE_PROG;
>> > -             raw = bpf_prog_get(ufd);
>> > +             return raw;
>> >       }
>> >
>> > -     return raw;
>> > +     raw = bpf_link_get_from_fd(ufd);
>> > +     if (!IS_ERR(raw)) {
>> > +             *type = BPF_TYPE_LINK;
>> > +             return raw;
>> > +     }
>> > +
>> > +     return ERR_PTR(-EINVAL);
>> >  }
>> >
>> >  static const struct inode_operations bpf_dir_iops;
>> >
>> >  static const struct inode_operations bpf_prog_iops = { };
>> >  static const struct inode_operations bpf_map_iops  = { };
>> > +static const struct inode_operations bpf_link_iops  = { };
>> >
>> >  static struct inode *bpf_get_inode(struct super_block *sb,
>> >                                  const struct inode *dir,
>> > @@ -114,6 +133,8 @@ static int bpf_inode_type(const struct inode *inode, enum bpf_type *type)
>> >               *type = BPF_TYPE_PROG;
>> >       else if (inode->i_op == &bpf_map_iops)
>> >               *type = BPF_TYPE_MAP;
>> > +     else if (inode->i_op == &bpf_link_iops)
>> > +             *type = BPF_TYPE_LINK;
>> >       else
>> >               return -EACCES;
>> >
>> > @@ -335,6 +356,12 @@ static int bpf_mkmap(struct dentry *dentry, umode_t mode, void *arg)
>> >                            &bpffs_map_fops : &bpffs_obj_fops);
>> >  }
>> >
>> > +static int bpf_mklink(struct dentry *dentry, umode_t mode, void *arg)
>> > +{
>> > +     return bpf_mkobj_ops(dentry, mode, arg, &bpf_link_iops,
>> > +                          &bpffs_obj_fops);
>> > +}
>> > +
>> >  static struct dentry *
>> >  bpf_lookup(struct inode *dir, struct dentry *dentry, unsigned flags)
>> >  {
>> > @@ -411,6 +438,9 @@ static int bpf_obj_do_pin(const char __user *pathname, void *raw,
>> >       case BPF_TYPE_MAP:
>> >               ret = vfs_mkobj(dentry, mode, bpf_mkmap, raw);
>> >               break;
>> > +     case BPF_TYPE_LINK:
>> > +             ret = vfs_mkobj(dentry, mode, bpf_mklink, raw);
>> > +             break;
>> >       default:
>> >               ret = -EPERM;
>> >       }
>> > @@ -487,6 +517,8 @@ int bpf_obj_get_user(const char __user *pathname, int flags)
>> >               ret = bpf_prog_new_fd(raw);
>> >       else if (type == BPF_TYPE_MAP)
>> >               ret = bpf_map_new_fd(raw, f_flags);
>> > +     else if (type == BPF_TYPE_LINK)
>> > +             ret = bpf_link_new_fd(raw);
>> >       else
>> >               return -ENOENT;
>> >
>> > @@ -504,6 +536,8 @@ static struct bpf_prog *__get_prog_inode(struct inode *inode, enum bpf_prog_type
>> >
>> >       if (inode->i_op == &bpf_map_iops)
>> >               return ERR_PTR(-EINVAL);
>> > +     if (inode->i_op == &bpf_link_iops)
>> > +             return ERR_PTR(-EINVAL);
>> >       if (inode->i_op != &bpf_prog_iops)
>> >               return ERR_PTR(-EACCES);
>> >
>> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> > index c536c65256ad..fca8de7e7872 100644
>> > --- a/kernel/bpf/syscall.c
>> > +++ b/kernel/bpf/syscall.c
>> > @@ -2173,23 +2173,153 @@ static int bpf_obj_get(const union bpf_attr *attr)
>> >                               attr->file_flags);
>> >  }
>> >
>> > -static int bpf_tracing_prog_release(struct inode *inode, struct file *filp)
>> > +struct bpf_link {
>> > +     atomic64_t refcnt;
>>
>> refcount_t ?
>
> Both bpf_map and bpf_prog stick to atomic64 for their refcounting, so
> I'd like to stay consistent and use refcount that can't possible leak
> resources (which refcount_t can, if it's overflown).

refcount_t is specifically supposed to turn a possible use-after-free on
under/overflow into a warning, isn't it? Not going to insist or anything
here, just found it odd that you'd prefer the other...

-Toke


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 2/3] libbpf: add bpf_link pinning/unpinning
  2020-03-02 18:09     ` Andrii Nakryiko
@ 2020-03-02 21:45       ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 50+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-03-02 21:45 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team

Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:

> On Mon, Mar 2, 2020 at 2:17 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Andrii Nakryiko <andriin@fb.com> writes:
>>
>> > With bpf_link abstraction supported by kernel explicitly, add
>> > pinning/unpinning API for links. Also allow to create (open) bpf_link from BPF
>> > FS file.
>> >
>> > This API allows to have an "ephemeral" FD-based BPF links (like raw tracepoint
>> > or fexit/freplace attachments) surviving user process exit, by pinning them in
>> > a BPF FS, which is an important use case for long-running BPF programs.
>> >
>> > As part of this, expose underlying FD for bpf_link. While legacy bpf_link's
>> > might not have a FD associated with them (which will be expressed as
>> > a bpf_link with fd=-1), kernel's abstraction is based around FD-based usage,
>> > so match it closely. This, subsequently, allows to have a generic
>> > pinning/unpinning API for generalized bpf_link. For some types of bpf_links
>> > kernel might not support pinning, in which case bpf_link__pin() will return
>> > error.
>> >
>> > With FD being part of generic bpf_link, also get rid of bpf_link_fd in favor
>> > of using vanialla bpf_link.
>> >
>> > Signed-off-by: Andrii Nakryiko <andriin@fb.com>
>> > ---
>> >  tools/lib/bpf/libbpf.c   | 131 +++++++++++++++++++++++++++++++--------
>> >  tools/lib/bpf/libbpf.h   |   5 ++
>> >  tools/lib/bpf/libbpf.map |   5 ++
>> >  3 files changed, 114 insertions(+), 27 deletions(-)
>> >
>> > diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
>> > index 996162801f7a..f8c4042e5855 100644
>> > --- a/tools/lib/bpf/libbpf.c
>> > +++ b/tools/lib/bpf/libbpf.c
>> > @@ -6931,6 +6931,8 @@ int bpf_prog_load_xattr(const struct bpf_prog_load_attr *attr,
>> >  struct bpf_link {
>> >       int (*detach)(struct bpf_link *link);
>> >       int (*destroy)(struct bpf_link *link);
>> > +     char *pin_path;         /* NULL, if not pinned */
>> > +     int fd;                 /* hook FD, -1 if not applicable */
>> >       bool disconnected;
>> >  };
>> >
>> > @@ -6960,26 +6962,109 @@ int bpf_link__destroy(struct bpf_link *link)
>> >               err = link->detach(link);
>> >       if (link->destroy)
>> >               link->destroy(link);
>> > +     if (link->pin_path)
>> > +             free(link->pin_path);
>>
>> This will still detach the link even if it's pinned, won't it? What's
>
> No, this will just free pin_path string memory.

I meant the containing function; i.e., link->detach() call above will
close the fd.

>> the expectation, that the calling application just won't call
>> bpf_link__destroy() if it pins the link? But then it will leak memory?
>> Or is it just that __destroy() will close the fd, but if it's pinned the
>> kernel won't actually detach anything? In that case, it seems like the
>> function name becomes somewhat misleading?
>
> Yes, the latter, it will close its own FD, but if someone else has
> open other FD against the same bpf_link (due to pinning or if you
> shared FD with child process, etc), then kernel will keep it.
> bpf_link__destroy() is more of a "sever the link my process has" or
> "destroy my local link". Maybe not ideal name, but should be close
> enough, I think.

Hmm, yeah, OK, I guess I can live with it ;)

-Toke


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 0/3] Introduce pinnable bpf_link kernel abstraction
  2020-03-02 18:05   ` Andrii Nakryiko
@ 2020-03-02 22:24     ` Toke Høiland-Jørgensen
  2020-03-02 23:35       ` Andrii Nakryiko
  2020-03-03  8:12       ` Daniel Borkmann
  0 siblings, 2 replies; 50+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-03-02 22:24 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team

Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:

> On Mon, Mar 2, 2020 at 2:12 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Andrii Nakryiko <andriin@fb.com> writes:
>>
>> > This patch series adds bpf_link abstraction, analogous to libbpf's already
>> > existing bpf_link abstraction. This formalizes and makes more uniform existing
>> > bpf_link-like BPF program link (attachment) types (raw tracepoint and tracing
>> > links), which are FD-based objects that are automatically detached when last
>> > file reference is closed. These types of BPF program links are switched to
>> > using bpf_link framework.
>> >
>> > FD-based bpf_link approach provides great safety guarantees, by ensuring there
>> > is not going to be an abandoned BPF program attached, if user process suddenly
>> > exits or forgets to clean up after itself. This is especially important in
>> > production environment and is what all the recent new BPF link types followed.
>> >
>> > One of the previously existing  inconveniences of FD-based approach, though,
>> > was the scenario in which user process wants to install BPF link and exit, but
>> > let attached BPF program run. Now, with bpf_link abstraction in place, it's
>> > easy to support pinning links in BPF FS, which is done as part of the same
>> > patch #1. This allows FD-based BPF program links to survive exit of a user
>> > process and original file descriptor being closed, by creating an file entry
>> > in BPF FS. This provides great safety by default, with simple way to opt out
>> > for cases where it's needed.
>>
>> While being able to pin the fds returned by bpf_raw_tracepoint_open()
>> certainly helps, I still feel like this is the wrong abstraction for
>> freplace(): When I'm building a program using freplace to put in new
>> functions (say, an XDP multi-prog dispatcher :)), I really want the
>> 'new' functions (i.e., the freplace'd bpf_progs) to share their lifetime
>> with the calling BPF program. I.e., I want to be able to do something
>> like:
>
> freplace programs will take refcount on a BPF program they are
> replacing, so in that sense they do share lifetime, except dependency
> is opposite to what you describe: rootlet/dispatcher program can't go
> away as long it has at least one freplace program attached. It
> (dispatcher) might get detached, though, but freplace, technically,
> will still be attached to now-detached dispatcher (so won't be
> invoked, yet still attached). I hope that makes sense :)

Yes, I realise that; I just think it's the wrong way 'round :)

>> prog_fd = sys_bpf(BPF_PROG_LOAD, ...); // dispatcher
>> func_fd = sys_bpf(BPF_PROG_LOAD, ...); // replacement func
>> err = sys_bpf(BPF_PROG_REPLACE_FUNC, prog_fd, btf_id, func_fd); // does *not* return an fd
>>
>> That last call should make the ref-counting be in the prog_fd -> func_fd
>> direction, so that when prog_fd is released, it will do
>> bpf_prog_put(func_fd). There could be an additional call like
>> sys_bpf(BPF_PROG_REPLACE_FUNC_DETACH, prog_fd, btf_id) for explicit
>> detach as well, of course.
>
> Taking this additional refcount will create a dependency loop (see
> above), so that's why it wasn't done, I think.

Right, that's why I want the new operation to 'flip' the direction of
the refcnt inside the kernel, so there would no longer be a reference
from the replacement to the dispatcher, only the other way (so no
loops).

> With FD-based bpf_link, though, you'll be able to "transfer ownership"
> from application that installed freplace program in the first place,
> to the program that eventually will unload/replace dispatcher BPF
> program. You do that by pinning freplace program in BPFFS location,
> that's known to this libxdp library, and when you need to detach and
> unload XDP dispatcher and overriden XDP programs, the "admin process"
> which manages XDP dispatcher, will be able to just go and unpin and
> detach everything, if necessary.

Yes, so let's assume we want to do it this way (with replacement
programs pinned in a known location in bpffs). First off, that will
introduce a hard dependency on bpffs for XDP; so now your networking
program also needs access to mounting filesystems, or you need to rely
on the system (including your container runtime) to set this up right
for you.

Assuming that has been solved, the steps to replace an existing set of
programs with a new one would be something like:

0. We assume that we already have /sys/fs/bpf/xdp/eth0/mainprog,
   /sys/fs/bpf/xdp/eth0/prog1, /sys/fs/bpf/xdp/eth0/prog2, and that
   mainprog is attached to eth0
   
1. Create new dispatcher, attach programs, pin replacement progs
   somewhere (/sys/fs/bpf/xdp/eth0.new/{mainprog,prog1,prog2,prog3}?)

2. Attach new mainprog to eth0, replacing the old one

3. `mv /sys/fs/bpf/xdp/eth0 /sys/fs/bpf/xdp/eth0.old`

4. `mv /sys/fs/bpf/xdp/eth0.new /sys/fs/bpf/xdp/eth0`

5. `rm -r /sys/fs/bpf/xdp/eth0.old`


Or you could switch steps 2 and 3. Either way, there is no way to do
steps 2..4 as one atomic operation; so you can end up in an inconsistent
state.

Oh, and what happens if the netdevice moves namespaces while it has an
XDP program attached? Then you can end up with the bpffs where you
pinned the programs originally not being available at all anymore.

Contrast this to the case where the replacement programs are just
referenced from the main prog: none of the above will be necessary, and
replacement will just be one atomic operation on the interface.

>> With such an API, lifecycle management for an XDP program keeps being
>> obvious: There's an fd for the root program attached to the interface,
>> and that's it. When that is released the whole thing disappears. Whereas
>> with the bpf_raw_tracepoint_open() API, the userspace program suddenly
>> has to make sure all the component function FDs are pinned, which seems
>> cumbersome and error-prone...
>
> I thought that's what libxdp is supposed to do (among other things).
> So for user applications it will be all hidden inside the library API,
> no?

Sure, but that also means that even if we somehow manage to write a
library without any bugs, it will still be called from arbitrary
applications that may crash between two operations and leave things in
an inconsistent state.

So I guess what I'm saying is that while it most likely is *possible* to
build the multi-prog facility using bpf_raw_tracepoint_open() + pinning,
I believe that the result will be brittle, and that we would be better
served with a different kernel primitive.

Do you see what I mean? Or am I missing something obvious here?

-Toke


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 0/3] Introduce pinnable bpf_link kernel abstraction
  2020-03-02 22:24     ` Toke Høiland-Jørgensen
@ 2020-03-02 23:35       ` Andrii Nakryiko
  2020-03-03  8:12         ` Toke Høiland-Jørgensen
  2020-03-03  8:12       ` Daniel Borkmann
  1 sibling, 1 reply; 50+ messages in thread
From: Andrii Nakryiko @ 2020-03-02 23:35 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team

On Mon, Mar 2, 2020 at 2:25 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>
> > On Mon, Mar 2, 2020 at 2:12 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >>
> >> Andrii Nakryiko <andriin@fb.com> writes:
> >>
> >> > This patch series adds bpf_link abstraction, analogous to libbpf's already
> >> > existing bpf_link abstraction. This formalizes and makes more uniform existing
> >> > bpf_link-like BPF program link (attachment) types (raw tracepoint and tracing
> >> > links), which are FD-based objects that are automatically detached when last
> >> > file reference is closed. These types of BPF program links are switched to
> >> > using bpf_link framework.
> >> >
> >> > FD-based bpf_link approach provides great safety guarantees, by ensuring there
> >> > is not going to be an abandoned BPF program attached, if user process suddenly
> >> > exits or forgets to clean up after itself. This is especially important in
> >> > production environment and is what all the recent new BPF link types followed.
> >> >
> >> > One of the previously existing  inconveniences of FD-based approach, though,
> >> > was the scenario in which user process wants to install BPF link and exit, but
> >> > let attached BPF program run. Now, with bpf_link abstraction in place, it's
> >> > easy to support pinning links in BPF FS, which is done as part of the same
> >> > patch #1. This allows FD-based BPF program links to survive exit of a user
> >> > process and original file descriptor being closed, by creating an file entry
> >> > in BPF FS. This provides great safety by default, with simple way to opt out
> >> > for cases where it's needed.
> >>
> >> While being able to pin the fds returned by bpf_raw_tracepoint_open()
> >> certainly helps, I still feel like this is the wrong abstraction for
> >> freplace(): When I'm building a program using freplace to put in new
> >> functions (say, an XDP multi-prog dispatcher :)), I really want the
> >> 'new' functions (i.e., the freplace'd bpf_progs) to share their lifetime
> >> with the calling BPF program. I.e., I want to be able to do something
> >> like:
> >
> > freplace programs will take refcount on a BPF program they are
> > replacing, so in that sense they do share lifetime, except dependency
> > is opposite to what you describe: rootlet/dispatcher program can't go
> > away as long it has at least one freplace program attached. It
> > (dispatcher) might get detached, though, but freplace, technically,
> > will still be attached to now-detached dispatcher (so won't be
> > invoked, yet still attached). I hope that makes sense :)
>
> Yes, I realise that; I just think it's the wrong way 'round :)
>
> >> prog_fd = sys_bpf(BPF_PROG_LOAD, ...); // dispatcher
> >> func_fd = sys_bpf(BPF_PROG_LOAD, ...); // replacement func
> >> err = sys_bpf(BPF_PROG_REPLACE_FUNC, prog_fd, btf_id, func_fd); // does *not* return an fd
> >>
> >> That last call should make the ref-counting be in the prog_fd -> func_fd
> >> direction, so that when prog_fd is released, it will do
> >> bpf_prog_put(func_fd). There could be an additional call like
> >> sys_bpf(BPF_PROG_REPLACE_FUNC_DETACH, prog_fd, btf_id) for explicit
> >> detach as well, of course.
> >
> > Taking this additional refcount will create a dependency loop (see
> > above), so that's why it wasn't done, I think.
>
> Right, that's why I want the new operation to 'flip' the direction of
> the refcnt inside the kernel, so there would no longer be a reference
> from the replacement to the dispatcher, only the other way (so no
> loops).
>
> > With FD-based bpf_link, though, you'll be able to "transfer ownership"
> > from application that installed freplace program in the first place,
> > to the program that eventually will unload/replace dispatcher BPF
> > program. You do that by pinning freplace program in BPFFS location,
> > that's known to this libxdp library, and when you need to detach and
> > unload XDP dispatcher and overriden XDP programs, the "admin process"
> > which manages XDP dispatcher, will be able to just go and unpin and
> > detach everything, if necessary.
>
> Yes, so let's assume we want to do it this way (with replacement
> programs pinned in a known location in bpffs). First off, that will
> introduce a hard dependency on bpffs for XDP; so now your networking
> program also needs access to mounting filesystems, or you need to rely
> on the system (including your container runtime) to set this up right
> for you.
>
> Assuming that has been solved, the steps to replace an existing set of
> programs with a new one would be something like:
>
> 0. We assume that we already have /sys/fs/bpf/xdp/eth0/mainprog,
>    /sys/fs/bpf/xdp/eth0/prog1, /sys/fs/bpf/xdp/eth0/prog2, and that
>    mainprog is attached to eth0
>
> 1. Create new dispatcher, attach programs, pin replacement progs
>    somewhere (/sys/fs/bpf/xdp/eth0.new/{mainprog,prog1,prog2,prog3}?)
>
> 2. Attach new mainprog to eth0, replacing the old one
>
> 3. `mv /sys/fs/bpf/xdp/eth0 /sys/fs/bpf/xdp/eth0.old`
>
> 4. `mv /sys/fs/bpf/xdp/eth0.new /sys/fs/bpf/xdp/eth0`
>
> 5. `rm -r /sys/fs/bpf/xdp/eth0.old`
>
>
> Or you could switch steps 2 and 3. Either way, there is no way to do
> steps 2..4 as one atomic operation; so you can end up in an inconsistent
> state.
>

I understand that as well :) But bpf_link is not trying to solve this
multi-XDP program use cases. What it does solve is making it possible
to have "ephemeral" FD-based attachment points (links), which by
default will be auto-detached if process exists or crashes (so they
are safe for production use and don't leave unaccounted BPF programs
running). But they allow to turn that "ephemeral" link into
"permanent" one by pinning it in BPF FS. That's it, I wasn't trying to
solve your specific use case. It does help with preserving XDP
dispatcher indefinitely, though. So that original process that created
XDP dispatcher doesn't have to use existing "permanent" XDP program
attachment and doesn't have to stay alive for too long.

All the XDP chaining specific issues should probably discussed on your
original thread that includes Andrey as well.

> Oh, and what happens if the netdevice moves namespaces while it has an
> XDP program attached? Then you can end up with the bpffs where you
> pinned the programs originally not being available at all anymore.
>
> Contrast this to the case where the replacement programs are just
> referenced from the main prog: none of the above will be necessary, and
> replacement will just be one atomic operation on the interface.

I don't think it's that simple. To do this atomic switch, you'll need
to be able to attach same XDP program to two different XDP
dispatchers, which isn't possible today. But again, I don't think this
thread is the right place to discuss those issues.

>
> >> With such an API, lifecycle management for an XDP program keeps being
> >> obvious: There's an fd for the root program attached to the interface,
> >> and that's it. When that is released the whole thing disappears. Whereas
> >> with the bpf_raw_tracepoint_open() API, the userspace program suddenly
> >> has to make sure all the component function FDs are pinned, which seems
> >> cumbersome and error-prone...
> >
> > I thought that's what libxdp is supposed to do (among other things).
> > So for user applications it will be all hidden inside the library API,
> > no?
>
> Sure, but that also means that even if we somehow manage to write a
> library without any bugs, it will still be called from arbitrary
> applications that may crash between two operations and leave things in
> an inconsistent state.

Whatever that library does, it shouldn't leave the host system in a
bad state despite any possible crashes, that should be designed in,
regardless of specific mechanism used. Which is also a reason for
having FD-based ephemeral links, because they will get cleaned up
automatically on crash, unless library reached stable point where
attached program can be permanently pinned. But again, that's general
problem, not just with XDP programs, but with cgroup programs and
others, which is what I try to address with bpf_link. Let's continue
discussing XDP dispatcher stuff on respective mail thread.

>
> So I guess what I'm saying is that while it most likely is *possible* to
> build the multi-prog facility using bpf_raw_tracepoint_open() + pinning,
> I believe that the result will be brittle, and that we would be better
> served with a different kernel primitive.
>
> Do you see what I mean? Or am I missing something obvious here?
>
> -Toke
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 1/3] bpf: introduce pinnable bpf_link abstraction
  2020-03-02 21:40       ` Toke Høiland-Jørgensen
@ 2020-03-02 23:37         ` Andrii Nakryiko
  0 siblings, 0 replies; 50+ messages in thread
From: Andrii Nakryiko @ 2020-03-02 23:37 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team

On Mon, Mar 2, 2020 at 1:40 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>
> > On Mon, Mar 2, 2020 at 2:13 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >>
> >> Andrii Nakryiko <andriin@fb.com> writes:
> >>
> >> > Introduce bpf_link abstraction, representing an attachment of BPF program to
> >> > a BPF hook point (e.g., tracepoint, perf event, etc). bpf_link encapsulates
> >> > ownership of attached BPF program, reference counting of a link itself, when
> >> > reference from multiple anonymous inodes, as well as ensures that release
> >> > callback will be called from a process context, so that users can safely take
> >> > mutex locks and sleep.
> >> >
> >> > Additionally, with a new abstraction it's now possible to generalize pinning
> >> > of a link object in BPF FS, allowing to explicitly prevent BPF program
> >> > detachment on process exit by pinning it in a BPF FS and let it open from
> >> > independent other process to keep working with it.
> >> >
> >> > Convert two existing bpf_link-like objects (raw tracepoint and tracing BPF
> >> > program attachments) into utilizing bpf_link framework, making them pinnable
> >> > in BPF FS. More FD-based bpf_links will be added in follow up patches.
> >> >
> >> > Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> >> > ---
> >> >  include/linux/bpf.h  |  13 +++
> >> >  kernel/bpf/inode.c   |  42 ++++++++-
> >> >  kernel/bpf/syscall.c | 209 ++++++++++++++++++++++++++++++++++++-------
> >> >  3 files changed, 226 insertions(+), 38 deletions(-)
> >> >

[...]

> >> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> >> > index c536c65256ad..fca8de7e7872 100644
> >> > --- a/kernel/bpf/syscall.c
> >> > +++ b/kernel/bpf/syscall.c
> >> > @@ -2173,23 +2173,153 @@ static int bpf_obj_get(const union bpf_attr *attr)
> >> >                               attr->file_flags);
> >> >  }
> >> >
> >> > -static int bpf_tracing_prog_release(struct inode *inode, struct file *filp)
> >> > +struct bpf_link {
> >> > +     atomic64_t refcnt;
> >>
> >> refcount_t ?
> >
> > Both bpf_map and bpf_prog stick to atomic64 for their refcounting, so
> > I'd like to stay consistent and use refcount that can't possible leak
> > resources (which refcount_t can, if it's overflown).
>
> refcount_t is specifically supposed to turn a possible use-after-free on
> under/overflow into a warning, isn't it? Not going to insist or anything
> here, just found it odd that you'd prefer the other...

Well, underflow is a huge bug that should never happen in well-tested
code (at least that's assumption for bpf_map and bpf_prog), and we are
generally very careful about that. Overflow can happen only because
refcount_t is using 32-bit integer, which atomic64_t side-steps
completely by going to 64-bit integer. So yeah, I'd rather stick to
the same stuff that's used for bpf_map and bpf_prog.

>
> -Toke
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 1/3] bpf: introduce pinnable bpf_link abstraction
  2020-02-28 22:39 ` [PATCH bpf-next 1/3] bpf: introduce pinnable bpf_link abstraction Andrii Nakryiko
  2020-03-02 10:13   ` Toke Høiland-Jørgensen
@ 2020-03-03  2:50   ` Alexei Starovoitov
  2020-03-03  4:18     ` Andrii Nakryiko
  1 sibling, 1 reply; 50+ messages in thread
From: Alexei Starovoitov @ 2020-03-03  2:50 UTC (permalink / raw)
  To: Andrii Nakryiko; +Cc: bpf, netdev, ast, daniel, andrii.nakryiko, kernel-team

On Fri, Feb 28, 2020 at 02:39:46PM -0800, Andrii Nakryiko wrote:
>  
> +int bpf_link_new_fd(struct bpf_link *link)
> +{
> +	return anon_inode_getfd("bpf-link", &bpf_link_fops, link, O_CLOEXEC);
> +}
...
> -	tr_fd = anon_inode_getfd("bpf-tracing-prog", &bpf_tracing_prog_fops,
> -				 prog, O_CLOEXEC);
> +	tr_fd = anon_inode_getfd("bpf-tracing-link", &bpf_link_fops,
> +				 &link->link, O_CLOEXEC);
...
> -	tp_fd = anon_inode_getfd("bpf-raw-tracepoint", &bpf_raw_tp_fops, raw_tp,
> -				 O_CLOEXEC);
> +	tp_fd = anon_inode_getfd("bpf-raw-tp-link", &bpf_link_fops,
> +				 &raw_tp->link, O_CLOEXEC);

I don't think different names are strong enough reason to open code it.
I think bpf_link_new_fd() should be used in all cases.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 1/3] bpf: introduce pinnable bpf_link abstraction
  2020-03-03  2:50   ` Alexei Starovoitov
@ 2020-03-03  4:18     ` Andrii Nakryiko
  0 siblings, 0 replies; 50+ messages in thread
From: Andrii Nakryiko @ 2020-03-03  4:18 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team

On Mon, Mar 2, 2020 at 6:50 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Feb 28, 2020 at 02:39:46PM -0800, Andrii Nakryiko wrote:
> >
> > +int bpf_link_new_fd(struct bpf_link *link)
> > +{
> > +     return anon_inode_getfd("bpf-link", &bpf_link_fops, link, O_CLOEXEC);
> > +}
> ...
> > -     tr_fd = anon_inode_getfd("bpf-tracing-prog", &bpf_tracing_prog_fops,
> > -                              prog, O_CLOEXEC);
> > +     tr_fd = anon_inode_getfd("bpf-tracing-link", &bpf_link_fops,
> > +                              &link->link, O_CLOEXEC);
> ...
> > -     tp_fd = anon_inode_getfd("bpf-raw-tracepoint", &bpf_raw_tp_fops, raw_tp,
> > -                              O_CLOEXEC);
> > +     tp_fd = anon_inode_getfd("bpf-raw-tp-link", &bpf_link_fops,
> > +                              &raw_tp->link, O_CLOEXEC);
>
> I don't think different names are strong enough reason to open code it.
> I think bpf_link_new_fd() should be used in all cases.

Oh, this got simplified from initial implementation after few rounds
of refactorings and I didn't notice that now I can just use
bpf_link_new_fd() here. Will update.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 0/3] Introduce pinnable bpf_link kernel abstraction
  2020-03-02 23:35       ` Andrii Nakryiko
@ 2020-03-03  8:12         ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 50+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-03-03  8:12 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team


> All the XDP chaining specific issues should probably discussed on your
> original thread that includes Andrey as well.

Yeah, sorry for hijacking your thread with my brain dump; I'll move it
over to the other one :)

-Toke


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 0/3] Introduce pinnable bpf_link kernel abstraction
  2020-03-02 22:24     ` Toke Høiland-Jørgensen
  2020-03-02 23:35       ` Andrii Nakryiko
@ 2020-03-03  8:12       ` Daniel Borkmann
  2020-03-03 15:46         ` Alexei Starovoitov
  1 sibling, 1 reply; 50+ messages in thread
From: Daniel Borkmann @ 2020-03-03  8:12 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov, Kernel Team

On 3/2/20 11:24 PM, Toke Høiland-Jørgensen wrote:
> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>> On Mon, Mar 2, 2020 at 2:12 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>> Andrii Nakryiko <andriin@fb.com> writes:
>>>
>>>> This patch series adds bpf_link abstraction, analogous to libbpf's already
>>>> existing bpf_link abstraction. This formalizes and makes more uniform existing
>>>> bpf_link-like BPF program link (attachment) types (raw tracepoint and tracing
>>>> links), which are FD-based objects that are automatically detached when last
>>>> file reference is closed. These types of BPF program links are switched to
>>>> using bpf_link framework.
>>>>
>>>> FD-based bpf_link approach provides great safety guarantees, by ensuring there
>>>> is not going to be an abandoned BPF program attached, if user process suddenly
>>>> exits or forgets to clean up after itself. This is especially important in
>>>> production environment and is what all the recent new BPF link types followed.
>>>>
>>>> One of the previously existing  inconveniences of FD-based approach, though,
>>>> was the scenario in which user process wants to install BPF link and exit, but
>>>> let attached BPF program run. Now, with bpf_link abstraction in place, it's
>>>> easy to support pinning links in BPF FS, which is done as part of the same
>>>> patch #1. This allows FD-based BPF program links to survive exit of a user
>>>> process and original file descriptor being closed, by creating an file entry
>>>> in BPF FS. This provides great safety by default, with simple way to opt out
>>>> for cases where it's needed.

I can see the motivation for this abstraction in particular for tracing, but given
the goal of bpf_link is to formalize and make the various program attachment types
more uniform, how is this going to solve e.g. the tc/BPF case? There is no guarantee
that while you create a link with the prog attached to cls_bpf that someone else is
going to replace that qdisc underneath you, and hence you end up with the same case
as if you would have only pinned the program itself (and not a link). So bpf_link
then gives a wrong impression that something is still attached and active while it
is not. What is the plan for these types?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 0/3] Introduce pinnable bpf_link kernel abstraction
  2020-03-03  8:12       ` Daniel Borkmann
@ 2020-03-03 15:46         ` Alexei Starovoitov
  2020-03-03 19:23           ` Daniel Borkmann
  0 siblings, 1 reply; 50+ messages in thread
From: Alexei Starovoitov @ 2020-03-03 15:46 UTC (permalink / raw)
  To: Daniel Borkmann, Toke Høiland-Jørgensen, Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, Networking, Kernel Team

On 3/3/20 12:12 AM, Daniel Borkmann wrote:
> 
> I can see the motivation for this abstraction in particular for tracing, 
> but given
> the goal of bpf_link is to formalize and make the various program 
> attachment types
> more uniform, how is this going to solve e.g. the tc/BPF case? There is 
> no guarantee
> that while you create a link with the prog attached to cls_bpf that 
> someone else is
> going to replace that qdisc underneath you, and hence you end up with 
> the same case
> as if you would have only pinned the program itself (and not a link). So 
> bpf_link
> then gives a wrong impression that something is still attached and 
> active while it
> is not. What is the plan for these types?


TC is not easy to handle, right, but I don't see a 'wrong impression' 
part. The link will keep the program attached to qdisc. The admin
may try to remove qdisc for netdev, but that's a separate issue.
Same thing with xdp. The link will keep xdp program attached,
but admin may do ifconfig down and no packets will be flowing.
Similar with cgroups. The link will keep prog attached to a cgroup,
but admin can still do rmdir and cgroup will be in 'dying' state.
In case of tracing there is no intermediate entity between programs
and the kernel. In case of networking there are layers.
Netdevs, qdiscs, etc. May be dev_hold is a way to go.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 0/3] Introduce pinnable bpf_link kernel abstraction
  2020-03-03 15:46         ` Alexei Starovoitov
@ 2020-03-03 19:23           ` Daniel Borkmann
  2020-03-03 19:46             ` Andrii Nakryiko
  0 siblings, 1 reply; 50+ messages in thread
From: Daniel Borkmann @ 2020-03-03 19:23 UTC (permalink / raw)
  To: Alexei Starovoitov, Toke Høiland-Jørgensen, Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, Networking, Kernel Team

On 3/3/20 4:46 PM, Alexei Starovoitov wrote:
> On 3/3/20 12:12 AM, Daniel Borkmann wrote:
>>
>> I can see the motivation for this abstraction in particular for tracing, but given
>> the goal of bpf_link is to formalize and make the various program attachment types
>> more uniform, how is this going to solve e.g. the tc/BPF case? There is no guarantee
>> that while you create a link with the prog attached to cls_bpf that someone else is
>> going to replace that qdisc underneath you, and hence you end up with the same case
>> as if you would have only pinned the program itself (and not a link). So bpf_link
>> then gives a wrong impression that something is still attached and active while it
>> is not. What is the plan for these types?
> 
> TC is not easy to handle, right, but I don't see a 'wrong impression' part. The link will keep the program attached to qdisc. The admin
> may try to remove qdisc for netdev, but that's a separate issue.
> Same thing with xdp. The link will keep xdp program attached,
> but admin may do ifconfig down and no packets will be flowing.
> Similar with cgroups. The link will keep prog attached to a cgroup,
> but admin can still do rmdir and cgroup will be in 'dying' state.
> In case of tracing there is no intermediate entity between programs
> and the kernel. In case of networking there are layers.
> Netdevs, qdiscs, etc. May be dev_hold is a way to go.

Yep, right. I mean taking tracing use-case aside, in Cilium we attach to XDP, tc,
cgroups BPF and whatnot, and we can tear down the Cilium user space agent just
fine while packets keep flowing through the BPF progs, and a later restart will
just reattach them atomically, e.g. Cilium version upgrades are usually done this
way.

This decoupling works since the attach point is already holding the reference on
the program, and if needed user space can always retrieve what has been attached
there. So the surrounding object acts like the "bpf_link" already. I think we need
to figure out what semantics an actual bpf_link should have there. Given an admin
can change qdisc/netdev/etc underneath us, and hence cause implicit detachment, I
don't know whether it would make much sense to keep surrounding objects like filter,
qdisc or even netdev alive to work around it since there's a whole dependency chain,
like in case of filter instance, it would be kept alive, but surrounding qdisc may
be dropped.

Question is, if there are no good semantics and benefits over what can be done
today with existing infra (abstracted from user space via libbpf) for the remaining
program types, perhaps it makes sense to have the pinning tracing specific only
instead of generic abstraction which only ever works for a limited number?

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 0/3] Introduce pinnable bpf_link kernel abstraction
  2020-03-03 19:23           ` Daniel Borkmann
@ 2020-03-03 19:46             ` Andrii Nakryiko
  2020-03-03 20:24               ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 50+ messages in thread
From: Andrii Nakryiko @ 2020-03-03 19:46 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Alexei Starovoitov, Toke Høiland-Jørgensen,
	Andrii Nakryiko, bpf, Networking, Kernel Team

On Tue, Mar 3, 2020 at 11:23 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 3/3/20 4:46 PM, Alexei Starovoitov wrote:
> > On 3/3/20 12:12 AM, Daniel Borkmann wrote:
> >>
> >> I can see the motivation for this abstraction in particular for tracing, but given
> >> the goal of bpf_link is to formalize and make the various program attachment types
> >> more uniform, how is this going to solve e.g. the tc/BPF case? There is no guarantee
> >> that while you create a link with the prog attached to cls_bpf that someone else is
> >> going to replace that qdisc underneath you, and hence you end up with the same case
> >> as if you would have only pinned the program itself (and not a link). So bpf_link
> >> then gives a wrong impression that something is still attached and active while it
> >> is not. What is the plan for these types?
> >
> > TC is not easy to handle, right, but I don't see a 'wrong impression' part. The link will keep the program attached to qdisc. The admin
> > may try to remove qdisc for netdev, but that's a separate issue.
> > Same thing with xdp. The link will keep xdp program attached,
> > but admin may do ifconfig down and no packets will be flowing.
> > Similar with cgroups. The link will keep prog attached to a cgroup,
> > but admin can still do rmdir and cgroup will be in 'dying' state.
> > In case of tracing there is no intermediate entity between programs
> > and the kernel. In case of networking there are layers.
> > Netdevs, qdiscs, etc. May be dev_hold is a way to go.
>
> Yep, right. I mean taking tracing use-case aside, in Cilium we attach to XDP, tc,
> cgroups BPF and whatnot, and we can tear down the Cilium user space agent just
> fine while packets keep flowing through the BPF progs, and a later restart will
> just reattach them atomically, e.g. Cilium version upgrades are usually done this
> way.

Right. This is the case where you want attached BPF program to survive
control application process exiting. Which is not a safe default,
though, because it might lead to BPF program running without anyone
knowing, leading to really bad consequences. It's especially important
for applications that are deployed fleet-wide and that don't "control"
hosts they are deployed to. If such application crashes and no one
notices and does anything about that, BPF program will keep running
draining resources or even just, say, dropping packets. We at FB had
outages due to such permanent BPF attachment semantics. With FD-based
bpf_link we are getting a framework, which allows safe,
auto-detachable behavior by default, unless application explicitly
opts in w/ bpf_link__pin().

>
> This decoupling works since the attach point is already holding the reference on
> the program, and if needed user space can always retrieve what has been attached
> there. So the surrounding object acts like the "bpf_link" already. I think we need
> to figure out what semantics an actual bpf_link should have there. Given an admin
> can change qdisc/netdev/etc underneath us, and hence cause implicit detachment, I
> don't know whether it would make much sense to keep surrounding objects like filter,
> qdisc or even netdev alive to work around it since there's a whole dependency chain,
> like in case of filter instance, it would be kept alive, but surrounding qdisc may
> be dropped.

I don't have specific enough knowledge right now to answer tc/BPF
question, but it seems like attached BPF program should hold a
reference to whatever it's attached to (net_device or whatnot) and not
let it just disappear? E.g., for cgroups, cgroup will go into dying
state, but it still will be there as long as there are remaining BPF
programs attached, sockets open, etc. I think it should be a general
approach, but again, I don't know specifics of each "attach point".

>
> Question is, if there are no good semantics and benefits over what can be done
> today with existing infra (abstracted from user space via libbpf) for the remaining
> program types, perhaps it makes sense to have the pinning tracing specific only
> instead of generic abstraction which only ever works for a limited number?

See above, I think bpf_link is what allows to have both
auto-detachment by default, as well as allow long-lived BPF
attachments (with explicit opt int).

As for what bpf_link can provide on top of existing stuff. One thing
that becomes more apparent with recent XDP discussions and what was
solved in cgroup-specific way for cgroup BPFs, is that there is a need
to swap BPF programs without interruption (BPF_F_REPLACE behavior for
cgroup BPF). Similar semantics is desirable for XDP, it seems. That's
where bpf_link is useful. Once bpf_link is attached (for specificity,
let's say XDP program to some ifindex), it cannot be replaced with
other bpf_link. Attached bpf_link will need to be detached first (by
means of closing all open FDs) to it. This ensures no-one can
accidentally replace XDP dispatcher program.

Now, once you have bpf_link attached, there will be bpf_link operation
(e.g., BPF_LINK_SWAP or something like that), where underlying BPF
program, associated with bpf_link, will get replaced with a new BPF
program without an interruption. Optionally, we can provide
expected_bpf_program_fd to make sure we are replacing the right
program (for cases where could be few bpf_link owners trying to modify
bpf_link, like in libxdp case). So in that sense bpf_link is a
coordination point, which mediates access to BPF hook (resource).

Thoughts?


>
> Thanks,
> Daniel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 0/3] Introduce pinnable bpf_link kernel abstraction
  2020-03-03 19:46             ` Andrii Nakryiko
@ 2020-03-03 20:24               ` Toke Høiland-Jørgensen
  2020-03-03 20:53                 ` Daniel Borkmann
  2020-03-03 22:40                 ` Jakub Kicinski
  0 siblings, 2 replies; 50+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-03-03 20:24 UTC (permalink / raw)
  To: Andrii Nakryiko, Daniel Borkmann
  Cc: Alexei Starovoitov, Andrii Nakryiko, bpf, Networking, Kernel Team

Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:

> On Tue, Mar 3, 2020 at 11:23 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>>
>> On 3/3/20 4:46 PM, Alexei Starovoitov wrote:
>> > On 3/3/20 12:12 AM, Daniel Borkmann wrote:
>> >>
>> >> I can see the motivation for this abstraction in particular for tracing, but given
>> >> the goal of bpf_link is to formalize and make the various program attachment types
>> >> more uniform, how is this going to solve e.g. the tc/BPF case? There is no guarantee
>> >> that while you create a link with the prog attached to cls_bpf that someone else is
>> >> going to replace that qdisc underneath you, and hence you end up with the same case
>> >> as if you would have only pinned the program itself (and not a link). So bpf_link
>> >> then gives a wrong impression that something is still attached and active while it
>> >> is not. What is the plan for these types?
>> >
>> > TC is not easy to handle, right, but I don't see a 'wrong impression' part. The link will keep the program attached to qdisc. The admin
>> > may try to remove qdisc for netdev, but that's a separate issue.
>> > Same thing with xdp. The link will keep xdp program attached,
>> > but admin may do ifconfig down and no packets will be flowing.
>> > Similar with cgroups. The link will keep prog attached to a cgroup,
>> > but admin can still do rmdir and cgroup will be in 'dying' state.
>> > In case of tracing there is no intermediate entity between programs
>> > and the kernel. In case of networking there are layers.
>> > Netdevs, qdiscs, etc. May be dev_hold is a way to go.
>>
>> Yep, right. I mean taking tracing use-case aside, in Cilium we attach to XDP, tc,
>> cgroups BPF and whatnot, and we can tear down the Cilium user space agent just
>> fine while packets keep flowing through the BPF progs, and a later restart will
>> just reattach them atomically, e.g. Cilium version upgrades are usually done this
>> way.
>
> Right. This is the case where you want attached BPF program to survive
> control application process exiting. Which is not a safe default,
> though, because it might lead to BPF program running without anyone
> knowing, leading to really bad consequences. It's especially important
> for applications that are deployed fleet-wide and that don't "control"
> hosts they are deployed to. If such application crashes and no one
> notices and does anything about that, BPF program will keep running
> draining resources or even just, say, dropping packets. We at FB had
> outages due to such permanent BPF attachment semantics. With FD-based
> bpf_link we are getting a framework, which allows safe,
> auto-detachable behavior by default, unless application explicitly
> opts in w/ bpf_link__pin().
>
>>
>> This decoupling works since the attach point is already holding the reference on
>> the program, and if needed user space can always retrieve what has been attached
>> there. So the surrounding object acts like the "bpf_link" already. I think we need
>> to figure out what semantics an actual bpf_link should have there. Given an admin
>> can change qdisc/netdev/etc underneath us, and hence cause implicit detachment, I
>> don't know whether it would make much sense to keep surrounding objects like filter,
>> qdisc or even netdev alive to work around it since there's a whole dependency chain,
>> like in case of filter instance, it would be kept alive, but surrounding qdisc may
>> be dropped.
>
> I don't have specific enough knowledge right now to answer tc/BPF
> question, but it seems like attached BPF program should hold a
> reference to whatever it's attached to (net_device or whatnot) and not
> let it just disappear? E.g., for cgroups, cgroup will go into dying
> state, but it still will be there as long as there are remaining BPF
> programs attached, sockets open, etc. I think it should be a general
> approach, but again, I don't know specifics of each "attach point".
>
>>
>> Question is, if there are no good semantics and benefits over what can be done
>> today with existing infra (abstracted from user space via libbpf) for the remaining
>> program types, perhaps it makes sense to have the pinning tracing specific only
>> instead of generic abstraction which only ever works for a limited number?
>
> See above, I think bpf_link is what allows to have both
> auto-detachment by default, as well as allow long-lived BPF
> attachments (with explicit opt int).
>
> As for what bpf_link can provide on top of existing stuff. One thing
> that becomes more apparent with recent XDP discussions and what was
> solved in cgroup-specific way for cgroup BPFs, is that there is a need
> to swap BPF programs without interruption (BPF_F_REPLACE behavior for
> cgroup BPF). Similar semantics is desirable for XDP, it seems. That's
> where bpf_link is useful. Once bpf_link is attached (for specificity,
> let's say XDP program to some ifindex), it cannot be replaced with
> other bpf_link. Attached bpf_link will need to be detached first (by
> means of closing all open FDs) to it. This ensures no-one can
> accidentally replace XDP dispatcher program.
>
> Now, once you have bpf_link attached, there will be bpf_link operation
> (e.g., BPF_LINK_SWAP or something like that), where underlying BPF
> program, associated with bpf_link, will get replaced with a new BPF
> program without an interruption. Optionally, we can provide
> expected_bpf_program_fd to make sure we are replacing the right
> program (for cases where could be few bpf_link owners trying to modify
> bpf_link, like in libxdp case). So in that sense bpf_link is a
> coordination point, which mediates access to BPF hook (resource).
>
> Thoughts?

I can see how the bpf_link abstraction helps by providing a single
abstraction for all the tracing-type attachments that are fd-based
anyway; but I think I agree with Daniel that maybe it makes more sense
to keep it to those? I.e., I'm not sure what bpf_link adds to XDP
program attachment? The expected_prev_fd field to replace a program
could just as well be provided by extending the existing netlink API?

-Toke


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 0/3] Introduce pinnable bpf_link kernel abstraction
  2020-03-03 20:24               ` Toke Høiland-Jørgensen
@ 2020-03-03 20:53                 ` Daniel Borkmann
  2020-03-03 22:01                   ` Alexei Starovoitov
  2020-03-03 22:40                 ` Jakub Kicinski
  1 sibling, 1 reply; 50+ messages in thread
From: Daniel Borkmann @ 2020-03-03 20:53 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Andrii Nakryiko
  Cc: Alexei Starovoitov, Andrii Nakryiko, bpf, Networking, Kernel Team

On 3/3/20 9:24 PM, Toke Høiland-Jørgensen wrote:
> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>> On Tue, Mar 3, 2020 at 11:23 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>>> On 3/3/20 4:46 PM, Alexei Starovoitov wrote:
>>>> On 3/3/20 12:12 AM, Daniel Borkmann wrote:
>>>>>
>>>>> I can see the motivation for this abstraction in particular for tracing, but given
>>>>> the goal of bpf_link is to formalize and make the various program attachment types
>>>>> more uniform, how is this going to solve e.g. the tc/BPF case? There is no guarantee
>>>>> that while you create a link with the prog attached to cls_bpf that someone else is
>>>>> going to replace that qdisc underneath you, and hence you end up with the same case
>>>>> as if you would have only pinned the program itself (and not a link). So bpf_link
>>>>> then gives a wrong impression that something is still attached and active while it
>>>>> is not. What is the plan for these types?
>>>>
>>>> TC is not easy to handle, right, but I don't see a 'wrong impression' part. The link will keep the program attached to qdisc. The admin
>>>> may try to remove qdisc for netdev, but that's a separate issue.
>>>> Same thing with xdp. The link will keep xdp program attached,
>>>> but admin may do ifconfig down and no packets will be flowing.
>>>> Similar with cgroups. The link will keep prog attached to a cgroup,
>>>> but admin can still do rmdir and cgroup will be in 'dying' state.
>>>> In case of tracing there is no intermediate entity between programs
>>>> and the kernel. In case of networking there are layers.
>>>> Netdevs, qdiscs, etc. May be dev_hold is a way to go.
>>>
>>> Yep, right. I mean taking tracing use-case aside, in Cilium we attach to XDP, tc,
>>> cgroups BPF and whatnot, and we can tear down the Cilium user space agent just
>>> fine while packets keep flowing through the BPF progs, and a later restart will
>>> just reattach them atomically, e.g. Cilium version upgrades are usually done this
>>> way.
>>
>> Right. This is the case where you want attached BPF program to survive
>> control application process exiting. Which is not a safe default,
>> though, because it might lead to BPF program running without anyone
>> knowing, leading to really bad consequences. It's especially important
>> for applications that are deployed fleet-wide and that don't "control"
>> hosts they are deployed to. If such application crashes and no one
>> notices and does anything about that, BPF program will keep running
>> draining resources or even just, say, dropping packets. We at FB had
>> outages due to such permanent BPF attachment semantics. With FD-based

I think it depends on the environment, and yes, whether the orchestrator
of those progs controls the host [networking] as in case of Cilium. We
actually had cases where a large user in prod was accidentally removing
the Cilium k8s daemon set (and hence the user space agent as well) and only
noticed 1hrs later since everything just kept running in the data path as
expected w/o causing them an outage. So I think both attachment semantics
have pros and cons. ;)

>> bpf_link we are getting a framework, which allows safe,
>> auto-detachable behavior by default, unless application explicitly
>> opts in w/ bpf_link__pin().
>>
>>> This decoupling works since the attach point is already holding the reference on
>>> the program, and if needed user space can always retrieve what has been attached
>>> there. So the surrounding object acts like the "bpf_link" already. I think we need
>>> to figure out what semantics an actual bpf_link should have there. Given an admin
>>> can change qdisc/netdev/etc underneath us, and hence cause implicit detachment, I
>>> don't know whether it would make much sense to keep surrounding objects like filter,
>>> qdisc or even netdev alive to work around it since there's a whole dependency chain,
>>> like in case of filter instance, it would be kept alive, but surrounding qdisc may
>>> be dropped.
>>
>> I don't have specific enough knowledge right now to answer tc/BPF
>> question, but it seems like attached BPF program should hold a
>> reference to whatever it's attached to (net_device or whatnot) and not
>> let it just disappear? E.g., for cgroups, cgroup will go into dying

But then are you also expecting that netlink requests which drop that tc
filter that holds this BPF prog would get rejected given it has a bpf_link,
is active & pinned and traffic goes through? If not the case, then what
would be the point? If it is the case, then this seems rather complex to
realize for rather little gain given there are two uapi interfaces (bpf,
tc/netlink) which then mess around with the same underlying object in
different ways.

>> state, but it still will be there as long as there are remaining BPF
>> programs attached, sockets open, etc. I think it should be a general
>> approach, but again, I don't know specifics of each "attach point".
>>
>>> Question is, if there are no good semantics and benefits over what can be done
>>> today with existing infra (abstracted from user space via libbpf) for the remaining
>>> program types, perhaps it makes sense to have the pinning tracing specific only
>>> instead of generic abstraction which only ever works for a limited number?
>>
>> See above, I think bpf_link is what allows to have both
>> auto-detachment by default, as well as allow long-lived BPF
>> attachments (with explicit opt int).
>>
>> As for what bpf_link can provide on top of existing stuff. One thing
>> that becomes more apparent with recent XDP discussions and what was
>> solved in cgroup-specific way for cgroup BPFs, is that there is a need
>> to swap BPF programs without interruption (BPF_F_REPLACE behavior for
>> cgroup BPF). Similar semantics is desirable for XDP, it seems. That's
>> where bpf_link is useful. Once bpf_link is attached (for specificity,
>> let's say XDP program to some ifindex), it cannot be replaced with
>> other bpf_link. Attached bpf_link will need to be detached first (by
>> means of closing all open FDs) to it. This ensures no-one can
>> accidentally replace XDP dispatcher program.
>>
>> Now, once you have bpf_link attached, there will be bpf_link operation
>> (e.g., BPF_LINK_SWAP or something like that), where underlying BPF
>> program, associated with bpf_link, will get replaced with a new BPF
>> program without an interruption. Optionally, we can provide
>> expected_bpf_program_fd to make sure we are replacing the right
>> program (for cases where could be few bpf_link owners trying to modify
>> bpf_link, like in libxdp case). So in that sense bpf_link is a
>> coordination point, which mediates access to BPF hook (resource).
>>
>> Thoughts?
> 
> I can see how the bpf_link abstraction helps by providing a single
> abstraction for all the tracing-type attachments that are fd-based
> anyway; but I think I agree with Daniel that maybe it makes more sense
> to keep it to those? I.e., I'm not sure what bpf_link adds to XDP
> program attachment? The expected_prev_fd field to replace a program
> could just as well be provided by extending the existing netlink API?
> 
> -Toke
> 


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 0/3] Introduce pinnable bpf_link kernel abstraction
  2020-03-03 20:53                 ` Daniel Borkmann
@ 2020-03-03 22:01                   ` Alexei Starovoitov
  2020-03-03 22:27                     ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 50+ messages in thread
From: Alexei Starovoitov @ 2020-03-03 22:01 UTC (permalink / raw)
  To: Daniel Borkmann, Toke Høiland-Jørgensen, Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, Networking, Kernel Team

On 3/3/20 12:53 PM, Daniel Borkmann wrote:
> 
> I think it depends on the environment, and yes, whether the orchestrator
> of those progs controls the host [networking] as in case of Cilium. We
> actually had cases where a large user in prod was accidentally removing
> the Cilium k8s daemon set (and hence the user space agent as well) and only
> noticed 1hrs later since everything just kept running in the data path as
> expected w/o causing them an outage. So I think both attachment semantics
> have pros and cons. ;)

of course. that's why there is pinning of FD-based links.
There are cases where pinning is useful and there are cases where
pinning will cause outages.
During app restart temporary pinning might be useful too.

> But then are you also expecting that netlink requests which drop that tc
> filter that holds this BPF prog would get rejected given it has a bpf_link,
> is active & pinned and traffic goes through? If not the case, then what
> would be the point? If it is the case, then this seems rather complex to
> realize for rather little gain given there are two uapi interfaces (bpf,
> tc/netlink) which then mess around with the same underlying object in
> different ways.

Legacy api for tc, xdp, cgroup will not be able to override FD-based
link. For TC it's easy. cls-bpf allows multi-prog, so netlink
adding/removing progs will not be able to touch progs that are
attached via FD-based link.
Same thing for cgroups. FD-based link will be similar to 'multi' mode.
The owner of the link has a guarantee that their program will
stay attached to cgroup.
XDP is also easy. Since it has only one prog. Attaching FD-based link
will prevent netlink from overriding it.
This way the rootlet prog installed by libxdp (let's find a better name
for it) will stay attached. libxdp can choose to pin it in some libxdp
specific location, so other libxdp-enabled applications can find it
in the same location, detach, replace, modify, but random app that
wants to hack an xdp prog won't be able to mess with it.
We didn't come up with these design choices overnight. It came from
hard lessons learned while deploying xdp, tc and cgroup in production.
Legacy apis will not be deprecated, of course.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 0/3] Introduce pinnable bpf_link kernel abstraction
  2020-03-03 22:01                   ` Alexei Starovoitov
@ 2020-03-03 22:27                     ` Toke Høiland-Jørgensen
  2020-03-04  4:36                       ` Alexei Starovoitov
  0 siblings, 1 reply; 50+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-03-03 22:27 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, Networking, Kernel Team

Alexei Starovoitov <ast@fb.com> writes:

> On 3/3/20 12:53 PM, Daniel Borkmann wrote:
>> 
>> I think it depends on the environment, and yes, whether the orchestrator
>> of those progs controls the host [networking] as in case of Cilium. We
>> actually had cases where a large user in prod was accidentally removing
>> the Cilium k8s daemon set (and hence the user space agent as well) and only
>> noticed 1hrs later since everything just kept running in the data path as
>> expected w/o causing them an outage. So I think both attachment semantics
>> have pros and cons. ;)
>
> of course. that's why there is pinning of FD-based links.
> There are cases where pinning is useful and there are cases where
> pinning will cause outages.
> During app restart temporary pinning might be useful too.
>
>> But then are you also expecting that netlink requests which drop that tc
>> filter that holds this BPF prog would get rejected given it has a bpf_link,
>> is active & pinned and traffic goes through? If not the case, then what
>> would be the point? If it is the case, then this seems rather complex to
>> realize for rather little gain given there are two uapi interfaces (bpf,
>> tc/netlink) which then mess around with the same underlying object in
>> different ways.
>
> Legacy api for tc, xdp, cgroup will not be able to override FD-based
> link. For TC it's easy. cls-bpf allows multi-prog, so netlink
> adding/removing progs will not be able to touch progs that are
> attached via FD-based link.
> Same thing for cgroups. FD-based link will be similar to 'multi' mode.
> The owner of the link has a guarantee that their program will
> stay attached to cgroup.
> XDP is also easy. Since it has only one prog. Attaching FD-based link
> will prevent netlink from overriding it.

So what happens if the device goes away?

> This way the rootlet prog installed by libxdp (let's find a better name
> for it) will stay attached.

Dispatcher prog?

> libxdp can choose to pin it in some libxdp specific location, so other
> libxdp-enabled applications can find it in the same location, detach,
> replace, modify, but random app that wants to hack an xdp prog won't
> be able to mess with it.

What if that "random app" comes first, and keeps holding on to the link
fd? Then the admin essentially has to start killing processes until they
find the one that has the device locked, no?

And what about the case where the link fd is pinned on a bpffs that is
no longer available? I.e., if a netdevice with an XDP program moves
namespaces and no longer has access to the original bpffs, that XDP
program would essentially become immutable?

> We didn't come up with these design choices overnight. It came from
> hard lessons learned while deploying xdp, tc and cgroup in production.
> Legacy apis will not be deprecated, of course.

Not deprecated, just less privileged?

-Toke


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 0/3] Introduce pinnable bpf_link kernel abstraction
  2020-03-03 20:24               ` Toke Høiland-Jørgensen
  2020-03-03 20:53                 ` Daniel Borkmann
@ 2020-03-03 22:40                 ` Jakub Kicinski
  1 sibling, 0 replies; 50+ messages in thread
From: Jakub Kicinski @ 2020-03-03 22:40 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Andrii Nakryiko, Daniel Borkmann, Alexei Starovoitov,
	Andrii Nakryiko, bpf, Networking, Kernel Team

On Tue, 03 Mar 2020 21:24:31 +0100 Toke Høiland-Jørgensen wrote:
> I can see how the bpf_link abstraction helps by providing a single
> abstraction for all the tracing-type attachments that are fd-based
> anyway; but I think I agree with Daniel that maybe it makes more sense
> to keep it to those? I.e., I'm not sure what bpf_link adds to XDP
> program attachment? The expected_prev_fd field to replace a program
> could just as well be provided by extending the existing netlink API?

+1

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 0/3] Introduce pinnable bpf_link kernel abstraction
  2020-03-03 22:27                     ` Toke Høiland-Jørgensen
@ 2020-03-04  4:36                       ` Alexei Starovoitov
  2020-03-04  7:47                         ` Toke Høiland-Jørgensen
  2020-03-04 19:41                         ` Jakub Kicinski
  0 siblings, 2 replies; 50+ messages in thread
From: Alexei Starovoitov @ 2020-03-04  4:36 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Andrii Nakryiko, bpf, Networking, Kernel Team

On Tue, Mar 03, 2020 at 11:27:13PM +0100, Toke Høiland-Jørgensen wrote:
> Alexei Starovoitov <ast@fb.com> writes:
> >
> > Legacy api for tc, xdp, cgroup will not be able to override FD-based
> > link. For TC it's easy. cls-bpf allows multi-prog, so netlink
> > adding/removing progs will not be able to touch progs that are
> > attached via FD-based link.
> > Same thing for cgroups. FD-based link will be similar to 'multi' mode.
> > The owner of the link has a guarantee that their program will
> > stay attached to cgroup.
> > XDP is also easy. Since it has only one prog. Attaching FD-based link
> > will prevent netlink from overriding it.
> 
> So what happens if the device goes away?

I'm not sure yet whether it's cleaner to make netdev, qdisc, cgroup to be held
by the link or use notifier approach. There are pros and cons to both.

> > This way the rootlet prog installed by libxdp (let's find a better name
> > for it) will stay attached.
> 
> Dispatcher prog?

would be great, but 'bpf_dispatcher' name is already used in the kernel.
I guess we can still call the library libdispatcher and dispatcher prog?
Alternatives:
libchainer and chainer prog
libaggregator and aggregator prog?
libpolicer kinda fits too, but could be misleading.
libxdp is very confusing. It's not xdp specific.

> > libxdp can choose to pin it in some libxdp specific location, so other
> > libxdp-enabled applications can find it in the same location, detach,
> > replace, modify, but random app that wants to hack an xdp prog won't
> > be able to mess with it.
> 
> What if that "random app" comes first, and keeps holding on to the link
> fd? Then the admin essentially has to start killing processes until they
> find the one that has the device locked, no?

Of course not. We have to provide an api to make it easy to discover
what process holds that link and where it's pinned.
But if we go with notifier approach none of it is an issue.
Whether target obj is held or notifier is used everything I said before still
stands. "random app" that uses netlink after libdispatcher got its link FD will
not be able to mess with carefully orchestrated setup done by libdispatcher.

Also either approach will guarantee that infamous message:
"unregister_netdevice: waiting for %s to become free. Usage count"
users will never see.

> And what about the case where the link fd is pinned on a bpffs that is
> no longer available? I.e., if a netdevice with an XDP program moves
> namespaces and no longer has access to the original bpffs, that XDP
> program would essentially become immutable?

'immutable' will not be possible.
I'm not clear to me how bpffs is going to disappear. What do you mean
exactly?

> > We didn't come up with these design choices overnight. It came from
> > hard lessons learned while deploying xdp, tc and cgroup in production.
> > Legacy apis will not be deprecated, of course.
> 
> Not deprecated, just less privileged?

No idea what you're referring to.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 0/3] Introduce pinnable bpf_link kernel abstraction
  2020-03-04  4:36                       ` Alexei Starovoitov
@ 2020-03-04  7:47                         ` Toke Høiland-Jørgensen
  2020-03-04 15:47                           ` Alexei Starovoitov
  2020-03-04 19:41                         ` Jakub Kicinski
  1 sibling, 1 reply; 50+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-03-04  7:47 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Andrii Nakryiko, bpf, Networking, Kernel Team

Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Tue, Mar 03, 2020 at 11:27:13PM +0100, Toke Høiland-Jørgensen wrote:
>> Alexei Starovoitov <ast@fb.com> writes:
>> >
>> > Legacy api for tc, xdp, cgroup will not be able to override FD-based
>> > link. For TC it's easy. cls-bpf allows multi-prog, so netlink
>> > adding/removing progs will not be able to touch progs that are
>> > attached via FD-based link.
>> > Same thing for cgroups. FD-based link will be similar to 'multi' mode.
>> > The owner of the link has a guarantee that their program will
>> > stay attached to cgroup.
>> > XDP is also easy. Since it has only one prog. Attaching FD-based link
>> > will prevent netlink from overriding it.
>> 
>> So what happens if the device goes away?
>
> I'm not sure yet whether it's cleaner to make netdev, qdisc, cgroup to be held
> by the link or use notifier approach. There are pros and cons to both.
>
>> > This way the rootlet prog installed by libxdp (let's find a better name
>> > for it) will stay attached.
>> 
>> Dispatcher prog?
>
> would be great, but 'bpf_dispatcher' name is already used in the kernel.
> I guess we can still call the library libdispatcher and dispatcher prog?
> Alternatives:
> libchainer and chainer prog
> libaggregator and aggregator prog?
> libpolicer kinda fits too, but could be misleading.

Of those, I like 'dispatcher' best.

> libxdp is very confusing. It's not xdp specific.

Presumably the parts that are generally useful will just end up in
libbpf (eventually)?

>> > libxdp can choose to pin it in some libxdp specific location, so other
>> > libxdp-enabled applications can find it in the same location, detach,
>> > replace, modify, but random app that wants to hack an xdp prog won't
>> > be able to mess with it.
>> 
>> What if that "random app" comes first, and keeps holding on to the link
>> fd? Then the admin essentially has to start killing processes until they
>> find the one that has the device locked, no?
>
> Of course not. We have to provide an api to make it easy to discover
> what process holds that link and where it's pinned.
> But if we go with notifier approach none of it is an issue.
> Whether target obj is held or notifier is used everything I said before still
> stands. "random app" that uses netlink after libdispatcher got its link FD will
> not be able to mess with carefully orchestrated setup done by
> libdispatcher.

Protecting things against random modification is fine. What I want to
avoid is XDP/tc programs locking the device so an admin needs to perform
extra steps if it is in use when (e.g.) shutting down a device. XDP
should be something any application can use as acceleration, and if it
becomes known as "that annoying thing that locks my netdev", then that
is not going to happen.

> Also either approach will guarantee that infamous message:
> "unregister_netdevice: waiting for %s to become free. Usage count"
> users will never see.
>
>> And what about the case where the link fd is pinned on a bpffs that is
>> no longer available? I.e., if a netdevice with an XDP program moves
>> namespaces and no longer has access to the original bpffs, that XDP
>> program would essentially become immutable?
>
> 'immutable' will not be possible.
> I'm not clear to me how bpffs is going to disappear. What do you mean
> exactly?

# stat /sys/fs/bpf | grep Device
Device: 1fh/31d	Inode: 1013963     Links: 2
# mkdir /sys/fs/bpf/test; ls /sys/fs/bpf
test
# ip netns add test
# ip netns exec test stat /sys/fs/bpf/test
stat: cannot stat '/sys/fs/bpf/test': No such file or directory
# ip netns exec test stat /sys/fs/bpf | grep Device
Device: 3fh/63d	Inode: 12242       Links: 2

It's a different bpffs instance inside the netns, so it won't have
access to anything pinned in the outer one...

-Toke


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 0/3] Introduce pinnable bpf_link kernel abstraction
  2020-03-04  7:47                         ` Toke Høiland-Jørgensen
@ 2020-03-04 15:47                           ` Alexei Starovoitov
  2020-03-05 10:37                             ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 50+ messages in thread
From: Alexei Starovoitov @ 2020-03-04 15:47 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Andrii Nakryiko, bpf, Networking, Kernel Team

On Wed, Mar 04, 2020 at 08:47:44AM +0100, Toke Høiland-Jørgensen wrote:
> >
> >> And what about the case where the link fd is pinned on a bpffs that is
> >> no longer available? I.e., if a netdevice with an XDP program moves
> >> namespaces and no longer has access to the original bpffs, that XDP
> >> program would essentially become immutable?
> >
> > 'immutable' will not be possible.
> > I'm not clear to me how bpffs is going to disappear. What do you mean
> > exactly?
> 
> # stat /sys/fs/bpf | grep Device
> Device: 1fh/31d	Inode: 1013963     Links: 2
> # mkdir /sys/fs/bpf/test; ls /sys/fs/bpf
> test
> # ip netns add test
> # ip netns exec test stat /sys/fs/bpf/test
> stat: cannot stat '/sys/fs/bpf/test': No such file or directory
> # ip netns exec test stat /sys/fs/bpf | grep Device
> Device: 3fh/63d	Inode: 12242       Links: 2
> 
> It's a different bpffs instance inside the netns, so it won't have
> access to anything pinned in the outer one...

Toke, please get your facts straight.

> # stat /sys/fs/bpf | grep Device
> Device: 1fh/31d	Inode: 1013963     Links: 2

Inode != 1 means that this is not bpffs.
I guess this is still sysfs.

> # mkdir /sys/fs/bpf/test; ls /sys/fs/bpf
> test
> # ip netns add test
> # ip netns exec test stat /sys/fs/bpf/test
> stat: cannot stat '/sys/fs/bpf/test': No such file or directory
> # ip netns exec test stat /sys/fs/bpf | grep Device
> Device: 3fh/63d	Inode: 12242       Links: 2

This is your new sysfs after ip netns exec.

netns has nothing do with bpffs despite your claims.

Try this instead:
# mkdir /tmp/bpf
# mount -t bpf bpf /tmp/bpf
# stat /tmp/bpf|grep Device
Device: 1eh/30d	Inode: 1           Links: 2
# stat -f /tmp/bpf|grep Type
    ID: 0        Namelen: 255     Type: bpf_fs
# mkdir /tmp/bpf/test
# ip netns add my
# ip netns exec my stat /tmp/bpf|grep Device
Device: 1eh/30d	Inode: 1           Links: 3
# ip netns exec my stat -f /tmp/bpf|grep Type
    ID: 0        Namelen: 255     Type: bpf_fs
# ip netns exec my ls /tmp/bpf/
test

Having said that we do allow remounting bpffs on top of existing one:
# mount -t bpf bpf /var/aa
# mkdir /var/aa/bb
# stat -f /var/aa/bb|grep Type
    ID: 0        Namelen: 255     Type: bpf_fs
# mount -t bpf bpf /var/aa
# stat -f /var/aa/bb|grep Type
stat: cannot read file system information for '/var/aa/bb': No such file or directory
# umount /var/aa
# stat -f /var/aa/bb|grep Type
    ID: 0        Namelen: 255     Type: bpf_fs

Still that doesn't mean that pinned link is 'immutable'.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 0/3] Introduce pinnable bpf_link kernel abstraction
  2020-03-04  4:36                       ` Alexei Starovoitov
  2020-03-04  7:47                         ` Toke Høiland-Jørgensen
@ 2020-03-04 19:41                         ` Jakub Kicinski
  2020-03-04 20:45                           ` Alexei Starovoitov
  1 sibling, 1 reply; 50+ messages in thread
From: Jakub Kicinski @ 2020-03-04 19:41 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Toke Høiland-Jørgensen, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Andrii Nakryiko, bpf,
	Networking, Kernel Team

On Tue, 3 Mar 2020 20:36:45 -0800 Alexei Starovoitov wrote:
> > > libxdp can choose to pin it in some libxdp specific location, so other
> > > libxdp-enabled applications can find it in the same location, detach,
> > > replace, modify, but random app that wants to hack an xdp prog won't
> > > be able to mess with it.  
> > 
> > What if that "random app" comes first, and keeps holding on to the link
> > fd? Then the admin essentially has to start killing processes until they
> > find the one that has the device locked, no?  
> 
> Of course not. We have to provide an api to make it easy to discover
> what process holds that link and where it's pinned.

That API to discover ownership would be useful but it's on the BPF side.

We have netlink notifications in networking world. The application
which doesn't want its program replaced should simply listen to the
netlink notifications and act if something goes wrong. And we already
have XDP_FLAGS_UPDATE_IF_NOEXIST.

> But if we go with notifier approach none of it is an issue.

Sorry, what's the notifier approach? You mean netdev notifier chain 
or something new?

> Whether target obj is held or notifier is used everything I said before still
> stands. "random app" that uses netlink after libdispatcher got its link FD will
> not be able to mess with carefully orchestrated setup done by libdispatcher.
> 
> Also either approach will guarantee that infamous message:
> "unregister_netdevice: waiting for %s to become free. Usage count"
> users will never see.
>
> > And what about the case where the link fd is pinned on a bpffs that is
> > no longer available? I.e., if a netdevice with an XDP program moves
> > namespaces and no longer has access to the original bpffs, that XDP
> > program would essentially become immutable?  
> 
> 'immutable' will not be possible.
> I'm not clear to me how bpffs is going to disappear. What do you mean
> exactly?
> 
> > > We didn't come up with these design choices overnight. It came from
> > > hard lessons learned while deploying xdp, tc and cgroup in production.
> > > Legacy apis will not be deprecated, of course.  

This sounds like a version of devm_* helpers for configuration.
Why are current user space APIs insufficient? Surely all of this can 
be done from user space. And we will need a centralized daemon for XDP
dispatch, so why is it not a part of a daemon?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 0/3] Introduce pinnable bpf_link kernel abstraction
  2020-03-04 19:41                         ` Jakub Kicinski
@ 2020-03-04 20:45                           ` Alexei Starovoitov
  2020-03-04 21:24                             ` Jakub Kicinski
  0 siblings, 1 reply; 50+ messages in thread
From: Alexei Starovoitov @ 2020-03-04 20:45 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Toke Høiland-Jørgensen, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Andrii Nakryiko, bpf,
	Networking, Kernel Team

On Wed, Mar 04, 2020 at 11:41:58AM -0800, Jakub Kicinski wrote:
> On Tue, 3 Mar 2020 20:36:45 -0800 Alexei Starovoitov wrote:
> > > > libxdp can choose to pin it in some libxdp specific location, so other
> > > > libxdp-enabled applications can find it in the same location, detach,
> > > > replace, modify, but random app that wants to hack an xdp prog won't
> > > > be able to mess with it.  
> > > 
> > > What if that "random app" comes first, and keeps holding on to the link
> > > fd? Then the admin essentially has to start killing processes until they
> > > find the one that has the device locked, no?  
> > 
> > Of course not. We have to provide an api to make it easy to discover
> > what process holds that link and where it's pinned.
> 
> That API to discover ownership would be useful but it's on the BPF side.

it's on bpf side because it's bpf specific.

> We have netlink notifications in networking world. The application
> which doesn't want its program replaced should simply listen to the
> netlink notifications and act if something goes wrong.

instead of locking the bike let's setup a camera and monitor the bike
when somebody steals it.
and then what? chase the thief and bring the bike back?

> > But if we go with notifier approach none of it is an issue.
> 
> Sorry, what's the notifier approach? You mean netdev notifier chain 
> or something new?

that's tbd.

> > Whether target obj is held or notifier is used everything I said before still
> > stands. "random app" that uses netlink after libdispatcher got its link FD will
> > not be able to mess with carefully orchestrated setup done by libdispatcher.
> > 
> > Also either approach will guarantee that infamous message:
> > "unregister_netdevice: waiting for %s to become free. Usage count"
> > users will never see.
> >
> > > And what about the case where the link fd is pinned on a bpffs that is
> > > no longer available? I.e., if a netdevice with an XDP program moves
> > > namespaces and no longer has access to the original bpffs, that XDP
> > > program would essentially become immutable?  
> > 
> > 'immutable' will not be possible.
> > I'm not clear to me how bpffs is going to disappear. What do you mean
> > exactly?
> > 
> > > > We didn't come up with these design choices overnight. It came from
> > > > hard lessons learned while deploying xdp, tc and cgroup in production.
> > > > Legacy apis will not be deprecated, of course.  
> 
> This sounds like a version of devm_* helpers for configuration.
> Why are current user space APIs insufficient? 

current xdp, tc, cgroup apis don't have the concept of the link
and owner of that link.

> Surely all of this can 
> be done from user space.

with a camera for theft monitoring. that will work well.

> And we will need a centralized daemon for XDP
> dispatch, so why is it not a part of a daemon?

current design of libdispatcher doesn't need the deamon.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 0/3] Introduce pinnable bpf_link kernel abstraction
  2020-03-04 20:45                           ` Alexei Starovoitov
@ 2020-03-04 21:24                             ` Jakub Kicinski
  2020-03-05  1:07                               ` Alexei Starovoitov
  0 siblings, 1 reply; 50+ messages in thread
From: Jakub Kicinski @ 2020-03-04 21:24 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Toke Høiland-Jørgensen, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Andrii Nakryiko, bpf,
	Networking, Kernel Team

On Wed, 4 Mar 2020 12:45:07 -0800 Alexei Starovoitov wrote:
> On Wed, Mar 04, 2020 at 11:41:58AM -0800, Jakub Kicinski wrote:
> > On Tue, 3 Mar 2020 20:36:45 -0800 Alexei Starovoitov wrote:  
> > > > > libxdp can choose to pin it in some libxdp specific location, so other
> > > > > libxdp-enabled applications can find it in the same location, detach,
> > > > > replace, modify, but random app that wants to hack an xdp prog won't
> > > > > be able to mess with it.    
> > > > 
> > > > What if that "random app" comes first, and keeps holding on to the link
> > > > fd? Then the admin essentially has to start killing processes until they
> > > > find the one that has the device locked, no?    
> > > 
> > > Of course not. We have to provide an api to make it easy to discover
> > > what process holds that link and where it's pinned.  
> > 
> > That API to discover ownership would be useful but it's on the BPF side.  
> 
> it's on bpf side because it's bpf specific.
> 
> > We have netlink notifications in networking world. The application
> > which doesn't want its program replaced should simply listen to the
> > netlink notifications and act if something goes wrong.  
> 
> instead of locking the bike let's setup a camera and monitor the bike
> when somebody steals it.
> and then what? chase the thief and bring the bike back?

:) Is the bike the BPF program? It's more like thief is stealing our
parking spot, we still have the program :)

Maybe also the thief should not have CAP_ADMIN in the first place?
And ask a daemon to perform its actions..

> > > But if we go with notifier approach none of it is an issue.  
> > 
> > Sorry, what's the notifier approach? You mean netdev notifier chain 
> > or something new?  
> 
> that's tbd.
> 
> > > Whether target obj is held or notifier is used everything I said before still
> > > stands. "random app" that uses netlink after libdispatcher got its link FD will
> > > not be able to mess with carefully orchestrated setup done by libdispatcher.
> > > 
> > > Also either approach will guarantee that infamous message:
> > > "unregister_netdevice: waiting for %s to become free. Usage count"
> > > users will never see.
> > >  
> > > > And what about the case where the link fd is pinned on a bpffs that is
> > > > no longer available? I.e., if a netdevice with an XDP program moves
> > > > namespaces and no longer has access to the original bpffs, that XDP
> > > > program would essentially become immutable?    
> > > 
> > > 'immutable' will not be possible.
> > > I'm not clear to me how bpffs is going to disappear. What do you mean
> > > exactly?
> > >   
> > > > > We didn't come up with these design choices overnight. It came from
> > > > > hard lessons learned while deploying xdp, tc and cgroup in production.
> > > > > Legacy apis will not be deprecated, of course.    
> > 
> > This sounds like a version of devm_* helpers for configuration.
> > Why are current user space APIs insufficient?   
> 
> current xdp, tc, cgroup apis don't have the concept of the link
> and owner of that link.

Why do the attachment points have to have a concept of an owner and 
not the program itself?

Link is a very overloaded term, I may not comprehend very well that 
it models because of that.

> > Surely all of this can 
> > be done from user space.  
> 
> with a camera for theft monitoring. that will work well.
> 
> > And we will need a centralized daemon for XDP
> > dispatch, so why is it not a part of a daemon?  
> 
> current design of libdispatcher doesn't need the deamon.

Which is flawed. Why do we want to solve a distributed problem 
of multiple applications with potentially a different version 
of a library cooperating. When we can make it a daemon.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 0/3] Introduce pinnable bpf_link kernel abstraction
  2020-03-04 21:24                             ` Jakub Kicinski
@ 2020-03-05  1:07                               ` Alexei Starovoitov
  2020-03-05  8:16                                 ` Jakub Kicinski
  0 siblings, 1 reply; 50+ messages in thread
From: Alexei Starovoitov @ 2020-03-05  1:07 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Toke Høiland-Jørgensen, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Andrii Nakryiko, bpf,
	Networking, Kernel Team

On Wed, Mar 04, 2020 at 01:24:39PM -0800, Jakub Kicinski wrote:
> On Wed, 4 Mar 2020 12:45:07 -0800 Alexei Starovoitov wrote:
> > On Wed, Mar 04, 2020 at 11:41:58AM -0800, Jakub Kicinski wrote:
> > > On Tue, 3 Mar 2020 20:36:45 -0800 Alexei Starovoitov wrote:  
> > > > > > libxdp can choose to pin it in some libxdp specific location, so other
> > > > > > libxdp-enabled applications can find it in the same location, detach,
> > > > > > replace, modify, but random app that wants to hack an xdp prog won't
> > > > > > be able to mess with it.    
> > > > > 
> > > > > What if that "random app" comes first, and keeps holding on to the link
> > > > > fd? Then the admin essentially has to start killing processes until they
> > > > > find the one that has the device locked, no?    
> > > > 
> > > > Of course not. We have to provide an api to make it easy to discover
> > > > what process holds that link and where it's pinned.  
> > > 
> > > That API to discover ownership would be useful but it's on the BPF side.  
> > 
> > it's on bpf side because it's bpf specific.
> > 
> > > We have netlink notifications in networking world. The application
> > > which doesn't want its program replaced should simply listen to the
> > > netlink notifications and act if something goes wrong.  
> > 
> > instead of locking the bike let's setup a camera and monitor the bike
> > when somebody steals it.
> > and then what? chase the thief and bring the bike back?
> 
> :) Is the bike the BPF program? It's more like thief is stealing our
> parking spot, we still have the program :)

yeah. parking spot is a better analogy.

> Maybe also the thief should not have CAP_ADMIN in the first place?
> And ask a daemon to perform its actions..

a daemon idea keeps coming back in circles.
With FD-based kprobe/uprobe/tracepoint/fexit/fentry that problem is gone,
but xdp, tc, cgroup still don't have the owner concept.
Some people argued that these three need three separate daemons.
Especially since cgroups are mainly managed by systemd plus container
manager it's quite different from networking (xdp, tc) where something
like 'networkd' might makes sense.
But if you take this line of thought all the ways systemd should be that
single daemon to coordinate attaching to xdp, tc, cgroup because
in many cases cgroup and tc progs have to coordinate the work.
At that's where it's getting gloomy... unless the kernel can provide
a facility so central daemon is not necessary.

> > current xdp, tc, cgroup apis don't have the concept of the link
> > and owner of that link.
> 
> Why do the attachment points have to have a concept of an owner and 
> not the program itself?

bpf program is an object. That object has an owner or multiple owners.
A user process that holds a pointer to that object is a shared owner.
FD is such pointer. FD == std::shared_ptr<bpf_prog>.
Holding that pointer guarantees that <bpf_prog> will not disappear,
but it says nothing that the program will keep running.
For [ku]probe,tp,fentry,fexit there was always <bpf_link> in the kernel.
It wasn't that formal in the past until most recent Andrii's patches,
but the concept existed for long time. FD == std::shared_ptr<bpf_link>
connects a kernel object with <bpf_prog>. When that kernel objects emits
an event the <bpf_link> guarantees that <bpf_prog> will be executed.

For cgroups we don't have such concept. We thought that three attach modes we
introduced (default, allow-override, allow-multi) will cover all use cases. But
in practice turned out that it only works when there is a central daemon for
_all_ cgroup-bpf progs in the system otherwise different processes step on each
other. More so there has to be a central diff-review human authority otherwise
teams step on each other. That's sort-of works within one org, but doesn't
scale.

To avoid making systemd a central place to coordinate attaching xdp, tc, cgroup
progs the kernel has to provide a mechanism for an application to connect a
kernel object with a prog and hold the ownership of that link so that no other
process in the system can break that connection. That kernel object is cgroup,
qdisc, netdev. Interesting question comes when that object disappears. What to
do with the link? Two ways to solve it:
1. make link hold the object, so it cannot be removed.
2. destroy the link when object goes away.
Both have pros and cons as I mentioned earlier.
And that's what's to be decided.
I think the truth is somewhat in the middle. The link has to hold the object,
so it doesn't disappear from under it, but get notified on deletion, so the
link can be self destroyed. From the user point of view the execution guarantee
is still preserved. The kernel object was removed and the link has one dangling
side. Note this behavior is vastly different from existing xdp, tc, cgroup
behavior where both object and bpf prog can be alive, but connection is gone
and execution guarantee is broken.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 0/3] Introduce pinnable bpf_link kernel abstraction
  2020-03-05  1:07                               ` Alexei Starovoitov
@ 2020-03-05  8:16                                 ` Jakub Kicinski
  2020-03-05 11:05                                   ` Toke Høiland-Jørgensen
  2020-03-05 16:39                                   ` Alexei Starovoitov
  0 siblings, 2 replies; 50+ messages in thread
From: Jakub Kicinski @ 2020-03-05  8:16 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Toke Høiland-Jørgensen, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Andrii Nakryiko, bpf,
	Networking, Kernel Team

On Wed, 4 Mar 2020 17:07:08 -0800, Alexei Starovoitov wrote:
> > Maybe also the thief should not have CAP_ADMIN in the first place?
> > And ask a daemon to perform its actions..  
> 
> a daemon idea keeps coming back in circles.
> With FD-based kprobe/uprobe/tracepoint/fexit/fentry that problem is gone,
> but xdp, tc, cgroup still don't have the owner concept.
> Some people argued that these three need three separate daemons.
> Especially since cgroups are mainly managed by systemd plus container
> manager it's quite different from networking (xdp, tc) where something
> like 'networkd' might makes sense.
> But if you take this line of thought all the ways systemd should be that
> single daemon to coordinate attaching to xdp, tc, cgroup because
> in many cases cgroup and tc progs have to coordinate the work.

The feature creep could happen, but Toke's proposal has a fairly simple
feature set, which should be easy to cover by a stand alone daemon.

Toke, I saw that in the library discussion there was no mention of 
a daemon, what makes a daemon solution unsuitable?

> At that's where it's getting gloomy... unless the kernel can provide
> a facility so central daemon is not necessary.
> 
> > > current xdp, tc, cgroup apis don't have the concept of the link
> > > and owner of that link.  
> > 
> > Why do the attachment points have to have a concept of an owner and 
> > not the program itself?  
> 
> bpf program is an object. That object has an owner or multiple owners.
> A user process that holds a pointer to that object is a shared owner.
> FD is such pointer. FD == std::shared_ptr<bpf_prog>.
> Holding that pointer guarantees that <bpf_prog> will not disappear,
> but it says nothing that the program will keep running.
> For [ku]probe,tp,fentry,fexit there was always <bpf_link> in the kernel.
> It wasn't that formal in the past until most recent Andrii's patches,
> but the concept existed for long time. FD == std::shared_ptr<bpf_link>
> connects a kernel object with <bpf_prog>. When that kernel objects emits
> an event the <bpf_link> guarantees that <bpf_prog> will be executed.

I see so the link is sort of [owner -> prog -> target].

> For cgroups we don't have such concept. We thought that three attach modes we
> introduced (default, allow-override, allow-multi) will cover all use cases. But
> in practice turned out that it only works when there is a central daemon for
> _all_ cgroup-bpf progs in the system otherwise different processes step on each
> other. More so there has to be a central diff-review human authority otherwise
> teams step on each other. That's sort-of works within one org, but doesn't
> scale.
> 
> To avoid making systemd a central place to coordinate attaching xdp, tc, cgroup
> progs the kernel has to provide a mechanism for an application to connect a
> kernel object with a prog and hold the ownership of that link so that no other
> process in the system can break that connection. 

To me for XDP the promise that nothing breaks the connection cannot be
made without a daemon, because without the daemon the link has to be
available somewhere/pinned to make changes to, and therefore is no
longer safe. (Lock but with a key right next to it, in the previous
analogies.)

And daemon IMHO can just monitor the changes. No different how we would
monitor for applications fiddling with any other networking state,
addresses, routes, device config, you name it. XDP changes already fire
link change notification, that's there probably from day one.

> That kernel object is cgroup,
> qdisc, netdev. Interesting question comes when that object disappears. What to
> do with the link? Two ways to solve it:
> 1. make link hold the object, so it cannot be removed.
> 2. destroy the link when object goes away.
> Both have pros and cons as I mentioned earlier. And that's what's to be decided.
> I think the truth is somewhat in the middle. The link has to hold the object,
> so it doesn't disappear from under it, but get notified on deletion, so the
> link can be self destroyed. From the user point of view the execution guarantee
> is still preserved. The kernel object was removed and the link has one dangling
> side. Note this behavior is vastly different from existing xdp, tc, cgroup
> behavior where both object and bpf prog can be alive, but connection is gone
> and execution guarantee is broken.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 0/3] Introduce pinnable bpf_link kernel abstraction
  2020-03-04 15:47                           ` Alexei Starovoitov
@ 2020-03-05 10:37                             ` Toke Høiland-Jørgensen
  2020-03-05 16:34                               ` Alexei Starovoitov
  0 siblings, 1 reply; 50+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-03-05 10:37 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Andrii Nakryiko, bpf, Networking, Kernel Team

Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Wed, Mar 04, 2020 at 08:47:44AM +0100, Toke Høiland-Jørgensen wrote:
>> >
>> >> And what about the case where the link fd is pinned on a bpffs that is
>> >> no longer available? I.e., if a netdevice with an XDP program moves
>> >> namespaces and no longer has access to the original bpffs, that XDP
>> >> program would essentially become immutable?
>> >
>> > 'immutable' will not be possible.
>> > I'm not clear to me how bpffs is going to disappear. What do you mean
>> > exactly?
>> 
>> # stat /sys/fs/bpf | grep Device
>> Device: 1fh/31d	Inode: 1013963     Links: 2
>> # mkdir /sys/fs/bpf/test; ls /sys/fs/bpf
>> test
>> # ip netns add test
>> # ip netns exec test stat /sys/fs/bpf/test
>> stat: cannot stat '/sys/fs/bpf/test': No such file or directory
>> # ip netns exec test stat /sys/fs/bpf | grep Device
>> Device: 3fh/63d	Inode: 12242       Links: 2
>> 
>> It's a different bpffs instance inside the netns, so it won't have
>> access to anything pinned in the outer one...
>
> Toke, please get your facts straight.
>
>> # stat /sys/fs/bpf | grep Device
>> Device: 1fh/31d	Inode: 1013963     Links: 2
>
> Inode != 1 means that this is not bpffs.
> I guess this is still sysfs.

Yes, my bad; I was confused because I was misremembering when 'ip'
mounts a new bpffs: I thought it was on every ns change, but it's only
when loading a BPF program, and I was in a hurry so I didn't check
properly; sorry about that.

Anyway, what I was trying to express:

> Still that doesn't mean that pinned link is 'immutable'.

I don't mean 'immutable' in the sense that it cannot be removed ever.
Just that we may end up in a situation where an application can see a
netdev with an XDP program attached, has the right privileges to modify
it, but can't because it can't find the pinned bpf_link. Right? Or am I
misunderstanding your proposal?

Amending my example from before, this could happen by:

1. Someone attaches a program to eth0, and pins the bpf_link to
   /sys/fs/bpf/myprog

2. eth0 is moved to a different namespace which mounts a new sysfs at
   /sys

3. Inside that namespace, /sys/fs/bpf/myprog is no longer accessible, so
   xdp-loader can't get access to the original bpf_link; but the XDP
   program is still attached to eth0.

If this happens, and the bpf_link locks the program in place, doesn't
that mean that anything inside executed inside that namespace will be
unable to remove the XDP program? Thus making it 'immutable' from their
PoV.

-Toke


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 0/3] Introduce pinnable bpf_link kernel abstraction
  2020-03-05  8:16                                 ` Jakub Kicinski
@ 2020-03-05 11:05                                   ` Toke Høiland-Jørgensen
  2020-03-05 18:13                                     ` Jakub Kicinski
  2020-03-05 16:39                                   ` Alexei Starovoitov
  1 sibling, 1 reply; 50+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-03-05 11:05 UTC (permalink / raw)
  To: Jakub Kicinski, Alexei Starovoitov
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Andrii Nakryiko, bpf, Networking, Kernel Team

Jakub Kicinski <kuba@kernel.org> writes:

> On Wed, 4 Mar 2020 17:07:08 -0800, Alexei Starovoitov wrote:
>> > Maybe also the thief should not have CAP_ADMIN in the first place?
>> > And ask a daemon to perform its actions..  
>> 
>> a daemon idea keeps coming back in circles.
>> With FD-based kprobe/uprobe/tracepoint/fexit/fentry that problem is gone,
>> but xdp, tc, cgroup still don't have the owner concept.
>> Some people argued that these three need three separate daemons.
>> Especially since cgroups are mainly managed by systemd plus container
>> manager it's quite different from networking (xdp, tc) where something
>> like 'networkd' might makes sense.
>> But if you take this line of thought all the ways systemd should be that
>> single daemon to coordinate attaching to xdp, tc, cgroup because
>> in many cases cgroup and tc progs have to coordinate the work.
>
> The feature creep could happen, but Toke's proposal has a fairly simple
> feature set, which should be easy to cover by a stand alone daemon.
>
> Toke, I saw that in the library discussion there was no mention of 
> a daemon, what makes a daemon solution unsuitable?

Quoting from the last discussion[0]:

> - Introducing a new, separate code base that we'll have to write, support
>   and manage updates to.
> 
> - Add a new dependency to using XDP (now you not only need the kernel
>   and libraries, you'll also need the daemon).
> 
> - Have to duplicate or wrap functionality currently found in the kernel;
>   at least:
>   
>     - Keeping track of which XDP programs are loaded and attached to
>       each interface (as well as the "new state" of their attachment
>       order).
> 
>     - Some kind of interface with the verifier; if an app does
>       xdpd_rpc_load(prog), how is the verifier result going to get back
>       to the caller?
> 
> - Have to deal with state synchronisation issues (how does xdpd handle
>   kernel state changing from underneath it?).
> 
> While these are issues that are (probably) all solvable, I think the
> cost of solving them is far higher than putting the support into the
> kernel. Which is why I think kernel support is the best solution :)

The context was slightly different, since this was before we had
freplace support in the kernel. But apart from the point about the
verifier, I think the arguments still stand. In fact, now that we have
that, we don't even need userspace linking, so basically a daemon's only
task would be to arbitrate access to the XDP hook? In my book,
arbitrating access to resources is what the kernel is all about...

-Toke

[0] https://lore.kernel.org/bpf/m/874l07fu61.fsf@toke.dk


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 0/3] Introduce pinnable bpf_link kernel abstraction
  2020-03-05 10:37                             ` Toke Høiland-Jørgensen
@ 2020-03-05 16:34                               ` Alexei Starovoitov
  2020-03-05 22:34                                 ` Daniel Borkmann
  0 siblings, 1 reply; 50+ messages in thread
From: Alexei Starovoitov @ 2020-03-05 16:34 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Andrii Nakryiko, bpf, Networking, Kernel Team

On Thu, Mar 05, 2020 at 11:37:11AM +0100, Toke Høiland-Jørgensen wrote:
> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
> 
> > On Wed, Mar 04, 2020 at 08:47:44AM +0100, Toke Høiland-Jørgensen wrote:
> >> >
> >> >> And what about the case where the link fd is pinned on a bpffs that is
> >> >> no longer available? I.e., if a netdevice with an XDP program moves
> >> >> namespaces and no longer has access to the original bpffs, that XDP
> >> >> program would essentially become immutable?
> >> >
> >> > 'immutable' will not be possible.
> >> > I'm not clear to me how bpffs is going to disappear. What do you mean
> >> > exactly?
> >> 
> >> # stat /sys/fs/bpf | grep Device
> >> Device: 1fh/31d	Inode: 1013963     Links: 2
> >> # mkdir /sys/fs/bpf/test; ls /sys/fs/bpf
> >> test
> >> # ip netns add test
> >> # ip netns exec test stat /sys/fs/bpf/test
> >> stat: cannot stat '/sys/fs/bpf/test': No such file or directory
> >> # ip netns exec test stat /sys/fs/bpf | grep Device
> >> Device: 3fh/63d	Inode: 12242       Links: 2
> >> 
> >> It's a different bpffs instance inside the netns, so it won't have
> >> access to anything pinned in the outer one...
> >
> > Toke, please get your facts straight.
> >
> >> # stat /sys/fs/bpf | grep Device
> >> Device: 1fh/31d	Inode: 1013963     Links: 2
> >
> > Inode != 1 means that this is not bpffs.
> > I guess this is still sysfs.
> 
> Yes, my bad; I was confused because I was misremembering when 'ip'
> mounts a new bpffs: I thought it was on every ns change, but it's only
> when loading a BPF program, and I was in a hurry so I didn't check
> properly; sorry about that.
> 
> Anyway, what I was trying to express:
> 
> > Still that doesn't mean that pinned link is 'immutable'.
> 
> I don't mean 'immutable' in the sense that it cannot be removed ever.
> Just that we may end up in a situation where an application can see a
> netdev with an XDP program attached, has the right privileges to modify
> it, but can't because it can't find the pinned bpf_link. Right? Or am I
> misunderstanding your proposal?
> 
> Amending my example from before, this could happen by:
> 
> 1. Someone attaches a program to eth0, and pins the bpf_link to
>    /sys/fs/bpf/myprog
> 
> 2. eth0 is moved to a different namespace which mounts a new sysfs at
>    /sys
> 
> 3. Inside that namespace, /sys/fs/bpf/myprog is no longer accessible, so
>    xdp-loader can't get access to the original bpf_link; but the XDP
>    program is still attached to eth0.

The key to decide is whether moving netdev across netns should be allowed
when xdp attached. I think it should be denied. Even when legacy xdp
program is attached, since it will confuse user space managing part.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 0/3] Introduce pinnable bpf_link kernel abstraction
  2020-03-05  8:16                                 ` Jakub Kicinski
  2020-03-05 11:05                                   ` Toke Høiland-Jørgensen
@ 2020-03-05 16:39                                   ` Alexei Starovoitov
  1 sibling, 0 replies; 50+ messages in thread
From: Alexei Starovoitov @ 2020-03-05 16:39 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Toke Høiland-Jørgensen, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Andrii Nakryiko, bpf,
	Networking, Kernel Team

On Thu, Mar 05, 2020 at 12:16:20AM -0800, Jakub Kicinski wrote:
> > 
> > bpf program is an object. That object has an owner or multiple owners.
> > A user process that holds a pointer to that object is a shared owner.
> > FD is such pointer. FD == std::shared_ptr<bpf_prog>.
> > Holding that pointer guarantees that <bpf_prog> will not disappear,
> > but it says nothing that the program will keep running.
> > For [ku]probe,tp,fentry,fexit there was always <bpf_link> in the kernel.
> > It wasn't that formal in the past until most recent Andrii's patches,
> > but the concept existed for long time. FD == std::shared_ptr<bpf_link>
> > connects a kernel object with <bpf_prog>. When that kernel objects emits
> > an event the <bpf_link> guarantees that <bpf_prog> will be executed.
> 
> I see so the link is sort of [owner -> prog -> target].

No. Link has two pieces: kernel object and program. It connects the two.

> > For cgroups we don't have such concept. We thought that three attach modes we
> > introduced (default, allow-override, allow-multi) will cover all use cases. But
> > in practice turned out that it only works when there is a central daemon for
> > _all_ cgroup-bpf progs in the system otherwise different processes step on each
> > other. More so there has to be a central diff-review human authority otherwise
> > teams step on each other. That's sort-of works within one org, but doesn't
> > scale.
> > 
> > To avoid making systemd a central place to coordinate attaching xdp, tc, cgroup
> > progs the kernel has to provide a mechanism for an application to connect a
> > kernel object with a prog and hold the ownership of that link so that no other
> > process in the system can break that connection. 
> 
> To me for XDP the promise that nothing breaks the connection cannot be
> made without a daemon, because without the daemon the link has to be
> available somewhere/pinned to make changes to, and therefore is no
> longer safe.

This is not true. Nothing requires pinning.
Again. FD == shared_ptr. Two applications can share that link and
coordinate the changes to dispatcher prog without a daemon and without pinning.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 0/3] Introduce pinnable bpf_link kernel abstraction
  2020-03-05 11:05                                   ` Toke Høiland-Jørgensen
@ 2020-03-05 18:13                                     ` Jakub Kicinski
  2020-03-09 11:41                                       ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 50+ messages in thread
From: Jakub Kicinski @ 2020-03-05 18:13 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Alexei Starovoitov, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Andrii Nakryiko, bpf, Networking, Kernel Team

On Thu, 05 Mar 2020 12:05:23 +0100 Toke Høiland-Jørgensen wrote:
> Jakub Kicinski <kuba@kernel.org> writes:
> > On Wed, 4 Mar 2020 17:07:08 -0800, Alexei Starovoitov wrote:  
> >> > Maybe also the thief should not have CAP_ADMIN in the first place?
> >> > And ask a daemon to perform its actions..    
> >> 
> >> a daemon idea keeps coming back in circles.
> >> With FD-based kprobe/uprobe/tracepoint/fexit/fentry that problem is gone,
> >> but xdp, tc, cgroup still don't have the owner concept.
> >> Some people argued that these three need three separate daemons.
> >> Especially since cgroups are mainly managed by systemd plus container
> >> manager it's quite different from networking (xdp, tc) where something
> >> like 'networkd' might makes sense.
> >> But if you take this line of thought all the ways systemd should be that
> >> single daemon to coordinate attaching to xdp, tc, cgroup because
> >> in many cases cgroup and tc progs have to coordinate the work.  
> >
> > The feature creep could happen, but Toke's proposal has a fairly simple
> > feature set, which should be easy to cover by a stand alone daemon.
> >
> > Toke, I saw that in the library discussion there was no mention of 
> > a daemon, what makes a daemon solution unsuitable?  
> 
> Quoting from the last discussion[0]:
> 
> > - Introducing a new, separate code base that we'll have to write, support
> >   and manage updates to.
> >
> > - Add a new dependency to using XDP (now you not only need the kernel
> >   and libraries, you'll also need the daemon).
> >
> > - Have to duplicate or wrap functionality currently found in the kernel;
> >   at least:
> >   
> >     - Keeping track of which XDP programs are loaded and attached to
> >       each interface (as well as the "new state" of their attachment
> >       order).
> >
> >     - Some kind of interface with the verifier; if an app does
> >       xdpd_rpc_load(prog), how is the verifier result going to get back
> >       to the caller?
> >
> > - Have to deal with state synchronisation issues (how does xdpd handle
> >   kernel state changing from underneath it?).
> > 
> > While these are issues that are (probably) all solvable, I think the
> > cost of solving them is far higher than putting the support into the
> > kernel. Which is why I think kernel support is the best solution :)  
> 
> The context was slightly different, since this was before we had
> freplace support in the kernel. But apart from the point about the
> verifier, I think the arguments still stand. In fact, now that we have
> that, we don't even need userspace linking, so basically a daemon's only
> task would be to arbitrate access to the XDP hook? In my book,
> arbitrating access to resources is what the kernel is all about...

You said that like the library doesn't arbitrate access and manage
resources.. It does exactly the same work the daemon would do.

Your prog chaining in the kernel proposal, now that would be a kernel
mechanism, but that's not what we're discussing here.

Daemon just trades off the complexity of making calls for the complexity
of the system and serializing/de-serializing the state.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 0/3] Introduce pinnable bpf_link kernel abstraction
  2020-03-05 16:34                               ` Alexei Starovoitov
@ 2020-03-05 22:34                                 ` Daniel Borkmann
  2020-03-05 22:50                                   ` Alexei Starovoitov
  0 siblings, 1 reply; 50+ messages in thread
From: Daniel Borkmann @ 2020-03-05 22:34 UTC (permalink / raw)
  To: Alexei Starovoitov, Toke Høiland-Jørgensen
  Cc: Alexei Starovoitov, Andrii Nakryiko, Andrii Nakryiko, bpf,
	Networking, Kernel Team

On 3/5/20 5:34 PM, Alexei Starovoitov wrote:
> On Thu, Mar 05, 2020 at 11:37:11AM +0100, Toke Høiland-Jørgensen wrote:
>> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
>>> On Wed, Mar 04, 2020 at 08:47:44AM +0100, Toke Høiland-Jørgensen wrote:
[...]
>> Anyway, what I was trying to express:
>>
>>> Still that doesn't mean that pinned link is 'immutable'.
>>
>> I don't mean 'immutable' in the sense that it cannot be removed ever.
>> Just that we may end up in a situation where an application can see a
>> netdev with an XDP program attached, has the right privileges to modify
>> it, but can't because it can't find the pinned bpf_link. Right? Or am I
>> misunderstanding your proposal?
>>
>> Amending my example from before, this could happen by:
>>
>> 1. Someone attaches a program to eth0, and pins the bpf_link to
>>     /sys/fs/bpf/myprog
>>
>> 2. eth0 is moved to a different namespace which mounts a new sysfs at
>>     /sys
>>
>> 3. Inside that namespace, /sys/fs/bpf/myprog is no longer accessible, so
>>     xdp-loader can't get access to the original bpf_link; but the XDP
>>     program is still attached to eth0.
> 
> The key to decide is whether moving netdev across netns should be allowed
> when xdp attached. I think it should be denied. Even when legacy xdp
> program is attached, since it will confuse user space managing part.

There are perfectly valid use cases where this is done already today (minus
bpf_link), for example, consider an orchestrator that is setting up the BPF
program on the device, moving to the newly created application pod during
the CNI call in k8s, such that the new pod does not have the /sys/fs/bpf/
mount instance and if unprivileged cannot remove the BPF prog from the dev
either. We do something like this in case of ipvlan, meaning, we attach a
rootlet prog that calls into single slot of a tail call map, move it to the
application pod, and only out of Cilium's own pod and it's pod-local bpf fs
instance we manage the pinned tail call map to update the main programs in
that single slot w/o having to switch any netns later on.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 0/3] Introduce pinnable bpf_link kernel abstraction
  2020-03-05 22:34                                 ` Daniel Borkmann
@ 2020-03-05 22:50                                   ` Alexei Starovoitov
  2020-03-05 23:42                                     ` Daniel Borkmann
  0 siblings, 1 reply; 50+ messages in thread
From: Alexei Starovoitov @ 2020-03-05 22:50 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Toke Høiland-Jørgensen, Alexei Starovoitov,
	Andrii Nakryiko, Andrii Nakryiko, bpf, Networking, Kernel Team

On Thu, Mar 05, 2020 at 11:34:18PM +0100, Daniel Borkmann wrote:
> On 3/5/20 5:34 PM, Alexei Starovoitov wrote:
> > On Thu, Mar 05, 2020 at 11:37:11AM +0100, Toke Høiland-Jørgensen wrote:
> > > Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
> > > > On Wed, Mar 04, 2020 at 08:47:44AM +0100, Toke Høiland-Jørgensen wrote:
> [...]
> > > Anyway, what I was trying to express:
> > > 
> > > > Still that doesn't mean that pinned link is 'immutable'.
> > > 
> > > I don't mean 'immutable' in the sense that it cannot be removed ever.
> > > Just that we may end up in a situation where an application can see a
> > > netdev with an XDP program attached, has the right privileges to modify
> > > it, but can't because it can't find the pinned bpf_link. Right? Or am I
> > > misunderstanding your proposal?
> > > 
> > > Amending my example from before, this could happen by:
> > > 
> > > 1. Someone attaches a program to eth0, and pins the bpf_link to
> > >     /sys/fs/bpf/myprog
> > > 
> > > 2. eth0 is moved to a different namespace which mounts a new sysfs at
> > >     /sys
> > > 
> > > 3. Inside that namespace, /sys/fs/bpf/myprog is no longer accessible, so
> > >     xdp-loader can't get access to the original bpf_link; but the XDP
> > >     program is still attached to eth0.
> > 
> > The key to decide is whether moving netdev across netns should be allowed
> > when xdp attached. I think it should be denied. Even when legacy xdp
> > program is attached, since it will confuse user space managing part.
> 
> There are perfectly valid use cases where this is done already today (minus
> bpf_link), for example, consider an orchestrator that is setting up the BPF
> program on the device, moving to the newly created application pod during
> the CNI call in k8s, such that the new pod does not have the /sys/fs/bpf/
> mount instance and if unprivileged cannot remove the BPF prog from the dev
> either. We do something like this in case of ipvlan, meaning, we attach a
> rootlet prog that calls into single slot of a tail call map, move it to the
> application pod, and only out of Cilium's own pod and it's pod-local bpf fs
> instance we manage the pinned tail call map to update the main programs in
> that single slot w/o having to switch any netns later on.

Right. You mentioned this use case before, but I managed to forget about it.
Totally makes sense for prog to stay attached to netdev when it's moved.
I think pod manager would also prefer that pod is not able to replace
xdp prog from inside the container. It sounds to me that steps 1,2,3 above
is exactly the desired behavior. Otherwise what stops some application
that started in a pod to override it?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 0/3] Introduce pinnable bpf_link kernel abstraction
  2020-03-05 22:50                                   ` Alexei Starovoitov
@ 2020-03-05 23:42                                     ` Daniel Borkmann
  2020-03-06  8:31                                       ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 50+ messages in thread
From: Daniel Borkmann @ 2020-03-05 23:42 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Toke Høiland-Jørgensen, Alexei Starovoitov,
	Andrii Nakryiko, Andrii Nakryiko, bpf, Networking, Kernel Team

On 3/5/20 11:50 PM, Alexei Starovoitov wrote:
> On Thu, Mar 05, 2020 at 11:34:18PM +0100, Daniel Borkmann wrote:
>> On 3/5/20 5:34 PM, Alexei Starovoitov wrote:
>>> On Thu, Mar 05, 2020 at 11:37:11AM +0100, Toke Høiland-Jørgensen wrote:
>>>> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
>>>>> On Wed, Mar 04, 2020 at 08:47:44AM +0100, Toke Høiland-Jørgensen wrote:
>> [...]
>>>> Anyway, what I was trying to express:
>>>>
>>>>> Still that doesn't mean that pinned link is 'immutable'.
>>>>
>>>> I don't mean 'immutable' in the sense that it cannot be removed ever.
>>>> Just that we may end up in a situation where an application can see a
>>>> netdev with an XDP program attached, has the right privileges to modify
>>>> it, but can't because it can't find the pinned bpf_link. Right? Or am I
>>>> misunderstanding your proposal?
>>>>
>>>> Amending my example from before, this could happen by:
>>>>
>>>> 1. Someone attaches a program to eth0, and pins the bpf_link to
>>>>      /sys/fs/bpf/myprog
>>>>
>>>> 2. eth0 is moved to a different namespace which mounts a new sysfs at
>>>>      /sys
>>>>
>>>> 3. Inside that namespace, /sys/fs/bpf/myprog is no longer accessible, so
>>>>      xdp-loader can't get access to the original bpf_link; but the XDP
>>>>      program is still attached to eth0.
>>>
>>> The key to decide is whether moving netdev across netns should be allowed
>>> when xdp attached. I think it should be denied. Even when legacy xdp
>>> program is attached, since it will confuse user space managing part.
>>
>> There are perfectly valid use cases where this is done already today (minus
>> bpf_link), for example, consider an orchestrator that is setting up the BPF
>> program on the device, moving to the newly created application pod during
>> the CNI call in k8s, such that the new pod does not have the /sys/fs/bpf/
>> mount instance and if unprivileged cannot remove the BPF prog from the dev
>> either. We do something like this in case of ipvlan, meaning, we attach a
>> rootlet prog that calls into single slot of a tail call map, move it to the
>> application pod, and only out of Cilium's own pod and it's pod-local bpf fs
>> instance we manage the pinned tail call map to update the main programs in
>> that single slot w/o having to switch any netns later on.
> 
> Right. You mentioned this use case before, but I managed to forget about it.
> Totally makes sense for prog to stay attached to netdev when it's moved.
> I think pod manager would also prefer that pod is not able to replace
> xdp prog from inside the container. It sounds to me that steps 1,2,3 above
> is exactly the desired behavior. Otherwise what stops some application
> that started in a pod to override it?

Generally, yes, and it shouldn't need to care nor see what is happening in
/sys/fs/bpf/ from the orchestrator at least (or could potentially have its
own private mount under /sys/fs/bpf/ or elsewhere). Ideally, the behavior
should be that orchestrator does all the setup out of its own namespace,
then moves everything over to the newly created target namespace and e.g.
only if the pod has the capable(cap_sys_admin) permissions, it could mess
around with anything attached there, or via similar model as done in [0]
when there is a master device. Last time I looked, there is a down/up cycle
on the dev upon netns migration and it flushes e.g. attached qdiscs afaik, so
there are limitations that still need to be addressed. Not sure atm if same
is happening to XDP right now. In this regards veth devices are a bit nicer
to work with since everything can be attached on hostns ingress w/o needing
to worry on the peer dev in the pod's netns.

   [0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7cc9f7003a969d359f608ebb701d42cafe75b84a

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 0/3] Introduce pinnable bpf_link kernel abstraction
  2020-03-05 23:42                                     ` Daniel Borkmann
@ 2020-03-06  8:31                                       ` Toke Høiland-Jørgensen
  2020-03-06 10:25                                         ` Daniel Borkmann
  0 siblings, 1 reply; 50+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-03-06  8:31 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov
  Cc: Alexei Starovoitov, Andrii Nakryiko, Andrii Nakryiko, bpf,
	Networking, Kernel Team

Daniel Borkmann <daniel@iogearbox.net> writes:

> On 3/5/20 11:50 PM, Alexei Starovoitov wrote:
>> On Thu, Mar 05, 2020 at 11:34:18PM +0100, Daniel Borkmann wrote:
>>> On 3/5/20 5:34 PM, Alexei Starovoitov wrote:
>>>> On Thu, Mar 05, 2020 at 11:37:11AM +0100, Toke Høiland-Jørgensen wrote:
>>>>> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
>>>>>> On Wed, Mar 04, 2020 at 08:47:44AM +0100, Toke Høiland-Jørgensen wrote:
>>> [...]
>>>>> Anyway, what I was trying to express:
>>>>>
>>>>>> Still that doesn't mean that pinned link is 'immutable'.
>>>>>
>>>>> I don't mean 'immutable' in the sense that it cannot be removed ever.
>>>>> Just that we may end up in a situation where an application can see a
>>>>> netdev with an XDP program attached, has the right privileges to modify
>>>>> it, but can't because it can't find the pinned bpf_link. Right? Or am I
>>>>> misunderstanding your proposal?
>>>>>
>>>>> Amending my example from before, this could happen by:
>>>>>
>>>>> 1. Someone attaches a program to eth0, and pins the bpf_link to
>>>>>      /sys/fs/bpf/myprog
>>>>>
>>>>> 2. eth0 is moved to a different namespace which mounts a new sysfs at
>>>>>      /sys
>>>>>
>>>>> 3. Inside that namespace, /sys/fs/bpf/myprog is no longer accessible, so
>>>>>      xdp-loader can't get access to the original bpf_link; but the XDP
>>>>>      program is still attached to eth0.
>>>>
>>>> The key to decide is whether moving netdev across netns should be allowed
>>>> when xdp attached. I think it should be denied. Even when legacy xdp
>>>> program is attached, since it will confuse user space managing part.
>>>
>>> There are perfectly valid use cases where this is done already today (minus
>>> bpf_link), for example, consider an orchestrator that is setting up the BPF
>>> program on the device, moving to the newly created application pod during
>>> the CNI call in k8s, such that the new pod does not have the /sys/fs/bpf/
>>> mount instance and if unprivileged cannot remove the BPF prog from the dev
>>> either. We do something like this in case of ipvlan, meaning, we attach a
>>> rootlet prog that calls into single slot of a tail call map, move it to the
>>> application pod, and only out of Cilium's own pod and it's pod-local bpf fs
>>> instance we manage the pinned tail call map to update the main programs in
>>> that single slot w/o having to switch any netns later on.
>> 
>> Right. You mentioned this use case before, but I managed to forget about it.
>> Totally makes sense for prog to stay attached to netdev when it's moved.
>> I think pod manager would also prefer that pod is not able to replace
>> xdp prog from inside the container. It sounds to me that steps 1,2,3 above
>> is exactly the desired behavior. Otherwise what stops some application
>> that started in a pod to override it?
>
> Generally, yes, and it shouldn't need to care nor see what is happening in
> /sys/fs/bpf/ from the orchestrator at least (or could potentially have its
> own private mount under /sys/fs/bpf/ or elsewhere). Ideally, the behavior
> should be that orchestrator does all the setup out of its own namespace,
> then moves everything over to the newly created target namespace and e.g.
> only if the pod has the capable(cap_sys_admin) permissions, it could mess
> around with anything attached there, or via similar model as done in [0]
> when there is a master device.

Yup, I can see how this can be a reasonable use case where you *would*
want the locking. However, my concern is that there should be a way for
an admin to recover from this (say, if it happens by mistake, or a
misbehaving application). Otherwise, I fear we'll end up with support
cases where the only answer is "try rebooting", which is obviously not
ideal.

> Last time I looked, there is a down/up cycle on the dev upon netns
> migration and it flushes e.g. attached qdiscs afaik, so there are
> limitations that still need to be addressed. Not sure atm if same is
> happening to XDP right now.

XDP programs will stay attached. devmaps (for redirect) have a notifier
that will remove devices when they move out of a namespace. Not sure if
there are any other issues with netns moves somewhere.

> In this regards veth devices are a bit nicer to work with since
> everything can be attached on hostns ingress w/o needing to worry on
> the peer dev in the pod's netns.

Presumably the XDP EGRESS hook that David Ahern is working on will make
this doable for XDP on veth as well?

-Toke


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 0/3] Introduce pinnable bpf_link kernel abstraction
  2020-03-06  8:31                                       ` Toke Høiland-Jørgensen
@ 2020-03-06 10:25                                         ` Daniel Borkmann
  2020-03-06 10:42                                           ` Toke Høiland-Jørgensen
  2020-03-06 18:09                                           ` David Ahern
  0 siblings, 2 replies; 50+ messages in thread
From: Daniel Borkmann @ 2020-03-06 10:25 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Alexei Starovoitov
  Cc: Alexei Starovoitov, Andrii Nakryiko, Andrii Nakryiko, bpf,
	Networking, Kernel Team

On 3/6/20 9:31 AM, Toke Høiland-Jørgensen wrote:
> Daniel Borkmann <daniel@iogearbox.net> writes:
>> On 3/5/20 11:50 PM, Alexei Starovoitov wrote:
>>> On Thu, Mar 05, 2020 at 11:34:18PM +0100, Daniel Borkmann wrote:
>>>> On 3/5/20 5:34 PM, Alexei Starovoitov wrote:
>>>>> On Thu, Mar 05, 2020 at 11:37:11AM +0100, Toke Høiland-Jørgensen wrote:
>>>>>> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
>>>>>>> On Wed, Mar 04, 2020 at 08:47:44AM +0100, Toke Høiland-Jørgensen wrote:
>>>> [...]
>>>>>> Anyway, what I was trying to express:
>>>>>>
>>>>>>> Still that doesn't mean that pinned link is 'immutable'.
>>>>>>
>>>>>> I don't mean 'immutable' in the sense that it cannot be removed ever.
>>>>>> Just that we may end up in a situation where an application can see a
>>>>>> netdev with an XDP program attached, has the right privileges to modify
>>>>>> it, but can't because it can't find the pinned bpf_link. Right? Or am I
>>>>>> misunderstanding your proposal?
>>>>>>
>>>>>> Amending my example from before, this could happen by:
>>>>>>
>>>>>> 1. Someone attaches a program to eth0, and pins the bpf_link to
>>>>>>       /sys/fs/bpf/myprog
>>>>>>
>>>>>> 2. eth0 is moved to a different namespace which mounts a new sysfs at
>>>>>>       /sys
>>>>>>
>>>>>> 3. Inside that namespace, /sys/fs/bpf/myprog is no longer accessible, so
>>>>>>       xdp-loader can't get access to the original bpf_link; but the XDP
>>>>>>       program is still attached to eth0.
>>>>>
>>>>> The key to decide is whether moving netdev across netns should be allowed
>>>>> when xdp attached. I think it should be denied. Even when legacy xdp
>>>>> program is attached, since it will confuse user space managing part.
>>>>
>>>> There are perfectly valid use cases where this is done already today (minus
>>>> bpf_link), for example, consider an orchestrator that is setting up the BPF
>>>> program on the device, moving to the newly created application pod during
>>>> the CNI call in k8s, such that the new pod does not have the /sys/fs/bpf/
>>>> mount instance and if unprivileged cannot remove the BPF prog from the dev
>>>> either. We do something like this in case of ipvlan, meaning, we attach a
>>>> rootlet prog that calls into single slot of a tail call map, move it to the
>>>> application pod, and only out of Cilium's own pod and it's pod-local bpf fs
>>>> instance we manage the pinned tail call map to update the main programs in
>>>> that single slot w/o having to switch any netns later on.
>>>
>>> Right. You mentioned this use case before, but I managed to forget about it.
>>> Totally makes sense for prog to stay attached to netdev when it's moved.
>>> I think pod manager would also prefer that pod is not able to replace
>>> xdp prog from inside the container. It sounds to me that steps 1,2,3 above
>>> is exactly the desired behavior. Otherwise what stops some application
>>> that started in a pod to override it?
>>
>> Generally, yes, and it shouldn't need to care nor see what is happening in
>> /sys/fs/bpf/ from the orchestrator at least (or could potentially have its
>> own private mount under /sys/fs/bpf/ or elsewhere). Ideally, the behavior
>> should be that orchestrator does all the setup out of its own namespace,
>> then moves everything over to the newly created target namespace and e.g.
>> only if the pod has the capable(cap_sys_admin) permissions, it could mess
>> around with anything attached there, or via similar model as done in [0]
>> when there is a master device.
> 
> Yup, I can see how this can be a reasonable use case where you *would*
> want the locking. However, my concern is that there should be a way for
> an admin to recover from this (say, if it happens by mistake, or a
> misbehaving application). Otherwise, I fear we'll end up with support
> cases where the only answer is "try rebooting", which is obviously not
> ideal.

I'm not quite sure I follow the concern, if you're an admin and have the right
permissions, then you should be able to introspect and change settings like with
anything else there is today.

>> Last time I looked, there is a down/up cycle on the dev upon netns
>> migration and it flushes e.g. attached qdiscs afaik, so there are
>> limitations that still need to be addressed. Not sure atm if same is
>> happening to XDP right now.
> 
> XDP programs will stay attached. devmaps (for redirect) have a notifier
> that will remove devices when they move out of a namespace. Not sure if
> there are any other issues with netns moves somewhere.
> 
>> In this regards veth devices are a bit nicer to work with since
>> everything can be attached on hostns ingress w/o needing to worry on
>> the peer dev in the pod's netns.
> 
> Presumably the XDP EGRESS hook that David Ahern is working on will make
> this doable for XDP on veth as well?

I'm not sure I see a use-case for XDP egress for Cilium yet, but maybe I'm still
lacking a clear picture on why one should use it. We currently use various
layers where we orchestrate our BPF programs from the agent. XDP/rx on the phys
nic on the one end, BPF sock progs attached to cgroups on the other end of the
spectrum. The processing in between on virtual devices is mainly tc/BPF-based
since everything is skb based anyway and more flexible also with interaction
with the rest of the stack. There is also not this pain of having to linearize
all the skbs, but at least there's a path to tackle that.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 0/3] Introduce pinnable bpf_link kernel abstraction
  2020-03-06 10:25                                         ` Daniel Borkmann
@ 2020-03-06 10:42                                           ` Toke Høiland-Jørgensen
  2020-03-06 18:09                                           ` David Ahern
  1 sibling, 0 replies; 50+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-03-06 10:42 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov
  Cc: Alexei Starovoitov, Andrii Nakryiko, Andrii Nakryiko, bpf,
	Networking, Kernel Team

Daniel Borkmann <daniel@iogearbox.net> writes:

> On 3/6/20 9:31 AM, Toke Høiland-Jørgensen wrote:
>> Daniel Borkmann <daniel@iogearbox.net> writes:
>>> On 3/5/20 11:50 PM, Alexei Starovoitov wrote:
>>>> On Thu, Mar 05, 2020 at 11:34:18PM +0100, Daniel Borkmann wrote:
>>>>> On 3/5/20 5:34 PM, Alexei Starovoitov wrote:
>>>>>> On Thu, Mar 05, 2020 at 11:37:11AM +0100, Toke Høiland-Jørgensen wrote:
>>>>>>> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
>>>>>>>> On Wed, Mar 04, 2020 at 08:47:44AM +0100, Toke Høiland-Jørgensen wrote:
>>>>> [...]
>>>>>>> Anyway, what I was trying to express:
>>>>>>>
>>>>>>>> Still that doesn't mean that pinned link is 'immutable'.
>>>>>>>
>>>>>>> I don't mean 'immutable' in the sense that it cannot be removed ever.
>>>>>>> Just that we may end up in a situation where an application can see a
>>>>>>> netdev with an XDP program attached, has the right privileges to modify
>>>>>>> it, but can't because it can't find the pinned bpf_link. Right? Or am I
>>>>>>> misunderstanding your proposal?
>>>>>>>
>>>>>>> Amending my example from before, this could happen by:
>>>>>>>
>>>>>>> 1. Someone attaches a program to eth0, and pins the bpf_link to
>>>>>>>       /sys/fs/bpf/myprog
>>>>>>>
>>>>>>> 2. eth0 is moved to a different namespace which mounts a new sysfs at
>>>>>>>       /sys
>>>>>>>
>>>>>>> 3. Inside that namespace, /sys/fs/bpf/myprog is no longer accessible, so
>>>>>>>       xdp-loader can't get access to the original bpf_link; but the XDP
>>>>>>>       program is still attached to eth0.
>>>>>>
>>>>>> The key to decide is whether moving netdev across netns should be allowed
>>>>>> when xdp attached. I think it should be denied. Even when legacy xdp
>>>>>> program is attached, since it will confuse user space managing part.
>>>>>
>>>>> There are perfectly valid use cases where this is done already today (minus
>>>>> bpf_link), for example, consider an orchestrator that is setting up the BPF
>>>>> program on the device, moving to the newly created application pod during
>>>>> the CNI call in k8s, such that the new pod does not have the /sys/fs/bpf/
>>>>> mount instance and if unprivileged cannot remove the BPF prog from the dev
>>>>> either. We do something like this in case of ipvlan, meaning, we attach a
>>>>> rootlet prog that calls into single slot of a tail call map, move it to the
>>>>> application pod, and only out of Cilium's own pod and it's pod-local bpf fs
>>>>> instance we manage the pinned tail call map to update the main programs in
>>>>> that single slot w/o having to switch any netns later on.
>>>>
>>>> Right. You mentioned this use case before, but I managed to forget about it.
>>>> Totally makes sense for prog to stay attached to netdev when it's moved.
>>>> I think pod manager would also prefer that pod is not able to replace
>>>> xdp prog from inside the container. It sounds to me that steps 1,2,3 above
>>>> is exactly the desired behavior. Otherwise what stops some application
>>>> that started in a pod to override it?
>>>
>>> Generally, yes, and it shouldn't need to care nor see what is happening in
>>> /sys/fs/bpf/ from the orchestrator at least (or could potentially have its
>>> own private mount under /sys/fs/bpf/ or elsewhere). Ideally, the behavior
>>> should be that orchestrator does all the setup out of its own namespace,
>>> then moves everything over to the newly created target namespace and e.g.
>>> only if the pod has the capable(cap_sys_admin) permissions, it could mess
>>> around with anything attached there, or via similar model as done in [0]
>>> when there is a master device.
>> 
>> Yup, I can see how this can be a reasonable use case where you *would*
>> want the locking. However, my concern is that there should be a way for
>> an admin to recover from this (say, if it happens by mistake, or a
>> misbehaving application). Otherwise, I fear we'll end up with support
>> cases where the only answer is "try rebooting", which is obviously not
>> ideal.
>
> I'm not quite sure I follow the concern, if you're an admin and have
> the right permissions, then you should be able to introspect and
> change settings like with anything else there is today.

Well, that's what I want to make sure of :)

However, I don't think such introspection is possible today? Or at least
there's no API exposed to do this, you'll have to go write drgn scripts
or something. But I expect an admin will want a command like 'xdp unload
eth0 --yes-i-really-really-mean-this', which would override any locking
done by bpf_link. So how to implement that? It's not enough to learn
'this bpf_link is pinned at links/id-123 on bpffs', you'll also need to
learn on *which* bpffs, and where to find that mountpoint. So how do you
do that?

Whereas an API that says 'return the bpf_link currently attached to
ifindex X' would sidestep this issue; but then, that is basically the
netlink API we have already, except it doesn't have the bpf_link
abstraction... So why do we need bpf_link?

>>> Last time I looked, there is a down/up cycle on the dev upon netns
>>> migration and it flushes e.g. attached qdiscs afaik, so there are
>>> limitations that still need to be addressed. Not sure atm if same is
>>> happening to XDP right now.
>> 
>> XDP programs will stay attached. devmaps (for redirect) have a notifier
>> that will remove devices when they move out of a namespace. Not sure if
>> there are any other issues with netns moves somewhere.
>> 
>>> In this regards veth devices are a bit nicer to work with since
>>> everything can be attached on hostns ingress w/o needing to worry on
>>> the peer dev in the pod's netns.
>> 
>> Presumably the XDP EGRESS hook that David Ahern is working on will make
>> this doable for XDP on veth as well?
>
> I'm not sure I see a use-case for XDP egress for Cilium yet, but maybe
> I'm still lacking a clear picture on why one should use it. We
> currently use various layers where we orchestrate our BPF programs
> from the agent. XDP/rx on the phys nic on the one end, BPF sock progs
> attached to cgroups on the other end of the spectrum. The processing
> in between on virtual devices is mainly tc/BPF-based since everything
> is skb based anyway and more flexible also with interaction with the
> rest of the stack. There is also not this pain of having to linearize
> all the skbs, but at least there's a path to tackle that.

I agree that there's not really any reason to use XDP on veth as long as
you'll end up with an skb eventually anyway. The only reason to do
something different I can think of is if you have an application inside
a container using AF_XDP, and you want to carry XDP frames all the way
through to that without ever building an skb. Not really sure this is a
good deployment model, but I kinda suspect that the NFV guys will want
to do this eventually...

-Toke


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 0/3] Introduce pinnable bpf_link kernel abstraction
  2020-03-06 10:25                                         ` Daniel Borkmann
  2020-03-06 10:42                                           ` Toke Høiland-Jørgensen
@ 2020-03-06 18:09                                           ` David Ahern
  1 sibling, 0 replies; 50+ messages in thread
From: David Ahern @ 2020-03-06 18:09 UTC (permalink / raw)
  To: Daniel Borkmann, Toke Høiland-Jørgensen, Alexei Starovoitov
  Cc: Alexei Starovoitov, Andrii Nakryiko, Andrii Nakryiko, bpf,
	Networking, Kernel Team

On 3/6/20 3:25 AM, Daniel Borkmann wrote:
>>
>> Presumably the XDP EGRESS hook that David Ahern is working on will make
>> this doable for XDP on veth as well?
> 
> I'm not sure I see a use-case for XDP egress for Cilium yet, but maybe
> I'm still
> lacking a clear picture on why one should use it. We currently use various
> layers where we orchestrate our BPF programs from the agent. XDP/rx on
> the phys
> nic on the one end, BPF sock progs attached to cgroups on the other end
> of the
> spectrum. The processing in between on virtual devices is mainly
> tc/BPF-based
> since everything is skb based anyway and more flexible also with
> interaction
> with the rest of the stack. There is also not this pain of having to
> linearize
> all the skbs, but at least there's a path to tackle that.
> 


{ veth-host } <----> { veth-container }

If you are currently putting an XDP program on veth-container, it is run
on the "Rx" of the packet from the host. That program is visible to the
container, but is managed by the host.

With XDP Tx you can put the same program on veth-host.

For containers, sure, maybe you don't care since you control all aspects
of the networking devices. For VMs, the host does not have access to or
control of the "other end" in the guest. Putting a program on the tap's
"Tx" side allows host managed, per-VM programs.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 0/3] Introduce pinnable bpf_link kernel abstraction
  2020-03-05 18:13                                     ` Jakub Kicinski
@ 2020-03-09 11:41                                       ` Toke Høiland-Jørgensen
  2020-03-09 18:50                                         ` Jakub Kicinski
  0 siblings, 1 reply; 50+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-03-09 11:41 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Alexei Starovoitov, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Andrii Nakryiko, bpf, Networking, Kernel Team

Jakub Kicinski <kuba@kernel.org> writes:

> On Thu, 05 Mar 2020 12:05:23 +0100 Toke Høiland-Jørgensen wrote:
>> Jakub Kicinski <kuba@kernel.org> writes:
>> > On Wed, 4 Mar 2020 17:07:08 -0800, Alexei Starovoitov wrote:  
>> >> > Maybe also the thief should not have CAP_ADMIN in the first place?
>> >> > And ask a daemon to perform its actions..    
>> >> 
>> >> a daemon idea keeps coming back in circles.
>> >> With FD-based kprobe/uprobe/tracepoint/fexit/fentry that problem is gone,
>> >> but xdp, tc, cgroup still don't have the owner concept.
>> >> Some people argued that these three need three separate daemons.
>> >> Especially since cgroups are mainly managed by systemd plus container
>> >> manager it's quite different from networking (xdp, tc) where something
>> >> like 'networkd' might makes sense.
>> >> But if you take this line of thought all the ways systemd should be that
>> >> single daemon to coordinate attaching to xdp, tc, cgroup because
>> >> in many cases cgroup and tc progs have to coordinate the work.  
>> >
>> > The feature creep could happen, but Toke's proposal has a fairly simple
>> > feature set, which should be easy to cover by a stand alone daemon.
>> >
>> > Toke, I saw that in the library discussion there was no mention of 
>> > a daemon, what makes a daemon solution unsuitable?  
>> 
>> Quoting from the last discussion[0]:
>> 
>> > - Introducing a new, separate code base that we'll have to write, support
>> >   and manage updates to.
>> >
>> > - Add a new dependency to using XDP (now you not only need the kernel
>> >   and libraries, you'll also need the daemon).
>> >
>> > - Have to duplicate or wrap functionality currently found in the kernel;
>> >   at least:
>> >   
>> >     - Keeping track of which XDP programs are loaded and attached to
>> >       each interface (as well as the "new state" of their attachment
>> >       order).
>> >
>> >     - Some kind of interface with the verifier; if an app does
>> >       xdpd_rpc_load(prog), how is the verifier result going to get back
>> >       to the caller?
>> >
>> > - Have to deal with state synchronisation issues (how does xdpd handle
>> >   kernel state changing from underneath it?).
>> > 
>> > While these are issues that are (probably) all solvable, I think the
>> > cost of solving them is far higher than putting the support into the
>> > kernel. Which is why I think kernel support is the best solution :)  
>> 
>> The context was slightly different, since this was before we had
>> freplace support in the kernel. But apart from the point about the
>> verifier, I think the arguments still stand. In fact, now that we have
>> that, we don't even need userspace linking, so basically a daemon's only
>> task would be to arbitrate access to the XDP hook? In my book,
>> arbitrating access to resources is what the kernel is all about...
>
> You said that like the library doesn't arbitrate access and manage
> resources.. It does exactly the same work the daemon would do.

Sure, the logic is in the library, but the state (which programs are
loaded) and synchronisation primitives (atomic replace of attached
program) are provided by the kernel. 

> Daemon just trades off the complexity of making calls for the
> complexity of the system and serializing/de-serializing the state.

What state are we serialising? I'm not sure I would consider just
pinning things in bpffs as "state serialisation"?

-Toke


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 0/3] Introduce pinnable bpf_link kernel abstraction
  2020-03-09 11:41                                       ` Toke Høiland-Jørgensen
@ 2020-03-09 18:50                                         ` Jakub Kicinski
  2020-03-10 12:22                                           ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 50+ messages in thread
From: Jakub Kicinski @ 2020-03-09 18:50 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Alexei Starovoitov, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Andrii Nakryiko, bpf, Networking, Kernel Team

On Mon, 09 Mar 2020 12:41:14 +0100 Toke Høiland-Jørgensen wrote:
> > You said that like the library doesn't arbitrate access and manage
> > resources.. It does exactly the same work the daemon would do.  
> 
> Sure, the logic is in the library, but the state (which programs are
> loaded) and synchronisation primitives (atomic replace of attached
> program) are provided by the kernel. 

I see your point of view. The state in the kernel which the library has
to read out every time is what I was thinking of as deserialization.

The library has to take some lock, and then read the state from the
kernel, and then construct its internal state based on that. I think
you have some cleverness there to stuff everything in BTF so far, but
I'd expect if the library grows that may become cumbersome and
wasteful (it's pinned memory after all). 

Parsing the packet once could be an example of something that could be
managed by the library to avoid wasted cycles. Then programs would have
to describe their requirements, and library may need to do rewrites of
the bytecode. 

I guess everything can be stuffed into BTF, but I'm not 100% sure
kernel is supposed to be a database either.

Note that the atomic replace may not sufficient for safe operation, 
as reading the state from the kernel is also not atomic.

> > Daemon just trades off the complexity of making calls for the
> > complexity of the system and serializing/de-serializing the state.  
> 
> What state are we serialising? I'm not sure I would consider just
> pinning things in bpffs as "state serialisation"?

At a quick glance at your code, e.g. what xdp_parse_run_order() does ;)

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 0/3] Introduce pinnable bpf_link kernel abstraction
  2020-03-09 18:50                                         ` Jakub Kicinski
@ 2020-03-10 12:22                                           ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 50+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-03-10 12:22 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Alexei Starovoitov, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Andrii Nakryiko, bpf, Networking, Kernel Team

Jakub Kicinski <kuba@kernel.org> writes:

> On Mon, 09 Mar 2020 12:41:14 +0100 Toke Høiland-Jørgensen wrote:
>> > You said that like the library doesn't arbitrate access and manage
>> > resources.. It does exactly the same work the daemon would do.  
>> 
>> Sure, the logic is in the library, but the state (which programs are
>> loaded) and synchronisation primitives (atomic replace of attached
>> program) are provided by the kernel. 
>
> I see your point of view. The state in the kernel which the library has
> to read out every time is what I was thinking of as deserialization.

Ohh, right. I consider the BTF-embedded data as 'configuration data'
which is different to 'state' in my mind. So hence my confusion about
what you were talking about re: state :)

> The library has to take some lock, and then read the state from the
> kernel, and then construct its internal state based on that. I think
> you have some cleverness there to stuff everything in BTF so far, but
> I'd expect if the library grows that may become cumbersome and
> wasteful (it's pinned memory after all).
>
> Parsing the packet once could be an example of something that could be
> managed by the library to avoid wasted cycles. Then programs would have
> to describe their requirements, and library may need to do rewrites of
> the bytecode.

Hmm, I've been trying to make libxdp fairly minimal in scope. It seems
like you are assuming that we'll end up with lots of additional
functionality? Do you have anything in particular in mind, or are you
talking in general terms here?

> I guess everything can be stuffed into BTF, but I'm not 100% sure
> kernel is supposed to be a database either.

I actually started out with the BTF approach because I wanted something
that could be part of the program bytecode (instead of, say, an external
config file that had to be carried along with the .o file). That it
survives a round-trip into the kernel turned out to be a nice bonus :)

I do agree with you in general terms, though: There's probably a limit
to how much stuff we can stick into this. The obvious better-suited
storage mechanism for more data is a BPF map, isn't it? I'm not sure
there's any point in moving to that before we have actual use cases for
richer state/metadata, though?

> Note that the atomic replace may not sufficient for safe operation, as
> reading the state from the kernel is also not atomic.

Yeah, there's a potential for read-update-write races. However, assuming
that the dispatcher program itself is not modified after initial setup
(i.e., we build a new one every time), I think this can be solved with a
"cmpxchg" operation where userspace includes the fd of the program it
thinks it is replacing, and the kernel refuses the operation if this
doesn't match. Do you disagree that this would be sufficient?

-Toke


^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2020-03-10 12:22 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-28 22:39 [PATCH bpf-next 0/3] Introduce pinnable bpf_link kernel abstraction Andrii Nakryiko
2020-02-28 22:39 ` [PATCH bpf-next 1/3] bpf: introduce pinnable bpf_link abstraction Andrii Nakryiko
2020-03-02 10:13   ` Toke Høiland-Jørgensen
2020-03-02 18:06     ` Andrii Nakryiko
2020-03-02 21:40       ` Toke Høiland-Jørgensen
2020-03-02 23:37         ` Andrii Nakryiko
2020-03-03  2:50   ` Alexei Starovoitov
2020-03-03  4:18     ` Andrii Nakryiko
2020-02-28 22:39 ` [PATCH bpf-next 2/3] libbpf: add bpf_link pinning/unpinning Andrii Nakryiko
2020-03-02 10:16   ` Toke Høiland-Jørgensen
2020-03-02 18:09     ` Andrii Nakryiko
2020-03-02 21:45       ` Toke Høiland-Jørgensen
2020-02-28 22:39 ` [PATCH bpf-next 3/3] selftests/bpf: add link pinning selftests Andrii Nakryiko
2020-03-02 10:11 ` [PATCH bpf-next 0/3] Introduce pinnable bpf_link kernel abstraction Toke Høiland-Jørgensen
2020-03-02 18:05   ` Andrii Nakryiko
2020-03-02 22:24     ` Toke Høiland-Jørgensen
2020-03-02 23:35       ` Andrii Nakryiko
2020-03-03  8:12         ` Toke Høiland-Jørgensen
2020-03-03  8:12       ` Daniel Borkmann
2020-03-03 15:46         ` Alexei Starovoitov
2020-03-03 19:23           ` Daniel Borkmann
2020-03-03 19:46             ` Andrii Nakryiko
2020-03-03 20:24               ` Toke Høiland-Jørgensen
2020-03-03 20:53                 ` Daniel Borkmann
2020-03-03 22:01                   ` Alexei Starovoitov
2020-03-03 22:27                     ` Toke Høiland-Jørgensen
2020-03-04  4:36                       ` Alexei Starovoitov
2020-03-04  7:47                         ` Toke Høiland-Jørgensen
2020-03-04 15:47                           ` Alexei Starovoitov
2020-03-05 10:37                             ` Toke Høiland-Jørgensen
2020-03-05 16:34                               ` Alexei Starovoitov
2020-03-05 22:34                                 ` Daniel Borkmann
2020-03-05 22:50                                   ` Alexei Starovoitov
2020-03-05 23:42                                     ` Daniel Borkmann
2020-03-06  8:31                                       ` Toke Høiland-Jørgensen
2020-03-06 10:25                                         ` Daniel Borkmann
2020-03-06 10:42                                           ` Toke Høiland-Jørgensen
2020-03-06 18:09                                           ` David Ahern
2020-03-04 19:41                         ` Jakub Kicinski
2020-03-04 20:45                           ` Alexei Starovoitov
2020-03-04 21:24                             ` Jakub Kicinski
2020-03-05  1:07                               ` Alexei Starovoitov
2020-03-05  8:16                                 ` Jakub Kicinski
2020-03-05 11:05                                   ` Toke Høiland-Jørgensen
2020-03-05 18:13                                     ` Jakub Kicinski
2020-03-09 11:41                                       ` Toke Høiland-Jørgensen
2020-03-09 18:50                                         ` Jakub Kicinski
2020-03-10 12:22                                           ` Toke Høiland-Jørgensen
2020-03-05 16:39                                   ` Alexei Starovoitov
2020-03-03 22:40                 ` Jakub Kicinski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).