bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 bpf-next 0/4] Add support for cgroup bpf_link
@ 2020-03-30  2:59 Andrii Nakryiko
  2020-03-30  2:59 ` [PATCH v3 bpf-next 1/4] bpf: implement bpf_link-based cgroup BPF program attachment Andrii Nakryiko
                   ` (4 more replies)
  0 siblings, 5 replies; 29+ messages in thread
From: Andrii Nakryiko @ 2020-03-30  2:59 UTC (permalink / raw)
  To: bpf, netdev, ast, daniel, rdna
  Cc: andrii.nakryiko, kernel-team, Andrii Nakryiko

bpf_link abstraction itself was formalized in [0] with justifications for why
its semantics is a good fit for attaching BPF programs of various types. This
patch set adds bpf_link-based BPF program attachment mechanism for cgroup BPF
programs.

Cgroup BPF link is semantically compatible with current BPF_F_ALLOW_MULTI
semantics of attaching cgroup BPF programs directly. Thus cgroup bpf_link can
co-exist with legacy BPF program multi-attachment.

bpf_link is destroyed and automatically detached when the last open FD holding
the reference to bpf_link is closed. This means that by default, when the
process that created bpf_link exits, attached BPF program will be
automatically detached due to bpf_link's clean up code. Cgroup bpf_link, like
any other bpf_link, can be pinned in BPF FS and by those means survive the
exit of process that created the link. This is useful in many scenarios to
provide long-living BPF program attachments. Pinning also means that there
could be many owners of bpf_link through independent FDs.

Additionally, auto-detachmet of cgroup bpf_link is implemented. When cgroup is
dying it will automatically detach all active bpf_links. This ensures that
cgroup clean up is not delayed due to active bpf_link even despite no chance
for any BPF program to be run for a given cgroup. In that sense it's similar
to existing behavior of dropping refcnt of attached bpf_prog. But in the case
of bpf_link, bpf_link is not destroyed and is still available to user as long
as at least one active FD is still open (or if it's pinned in BPF FS).

There are two main cgroup-specific differences between bpf_link-based and
direct bpf_prog-based attachment.

First, as opposed to direct bpf_prog attachment, cgroup itself doesn't "own"
bpf_link, which makes it possible to auto-clean up attached bpf_link when user
process abruptly exits without explicitly detaching BPF program. This makes
for a safe default behavior proven in BPF tracing program types. But bpf_link
doesn't bump cgroup->bpf.refcnt as well and because of that doesn't prevent
cgroup from cleaning up its BPF state.

Second, only owners of bpf_link (those who created bpf_link in the first place
or obtained a new FD by opening bpf_link from BPF FS) can detach and/or update
it. This makes sure that no other process can accidentally remove/replace BPF
program.

This patch set also implements LINK_UPDATE sub-command, which allows to
replace bpf_link's underlying bpf_prog, similarly to BPF_F_REPLACE flag
behavior for direct bpf_prog cgroup attachment. Similarly to LINK_CREATE, it
is supposed to be generic command for different types of bpf_links.

  [0] https://lore.kernel.org/bpf/20200228223948.360936-1-andriin@fb.com/

v2->v3:
  - revert back to just MULTI mode (Alexei);
  - fix tinyconfig compilation warning (kbuild test robot);

v1->v2:
  - implement exclusive and overridable exclusive modes (Andrey Ignatov);
  - fix build for !CONFIG_CGROUP_BPF build;
  - add more selftests for non-multi mode and inter-operability;


Andrii Nakryiko (4):
  bpf: implement bpf_link-based cgroup BPF program attachment
  bpf: implement bpf_prog replacement for an active bpf_cgroup_link
  libbpf: add support for bpf_link-based cgroup attachment
  selftests/bpf: test FD-based cgroup attachment

 include/linux/bpf-cgroup.h                    |  41 +-
 include/linux/bpf.h                           |  10 +-
 include/uapi/linux/bpf.h                      |  22 +-
 kernel/bpf/cgroup.c                           | 395 ++++++++++++++----
 kernel/bpf/syscall.c                          | 113 ++++-
 kernel/cgroup/cgroup.c                        |  41 +-
 tools/include/uapi/linux/bpf.h                |  22 +-
 tools/lib/bpf/bpf.c                           |  34 ++
 tools/lib/bpf/bpf.h                           |  19 +
 tools/lib/bpf/libbpf.c                        |  46 ++
 tools/lib/bpf/libbpf.h                        |   8 +-
 tools/lib/bpf/libbpf.map                      |   4 +
 .../selftests/bpf/prog_tests/cgroup_link.c    | 244 +++++++++++
 .../selftests/bpf/progs/test_cgroup_link.c    |  24 ++
 14 files changed, 924 insertions(+), 99 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/cgroup_link.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_cgroup_link.c

-- 
2.17.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v3 bpf-next 1/4] bpf: implement bpf_link-based cgroup BPF program attachment
  2020-03-30  2:59 [PATCH v3 bpf-next 0/4] Add support for cgroup bpf_link Andrii Nakryiko
@ 2020-03-30  2:59 ` Andrii Nakryiko
  2020-03-31  0:05   ` Andrey Ignatov
  2020-03-30  2:59 ` [PATCH v3 bpf-next 2/4] bpf: implement bpf_prog replacement for an active bpf_cgroup_link Andrii Nakryiko
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 29+ messages in thread
From: Andrii Nakryiko @ 2020-03-30  2:59 UTC (permalink / raw)
  To: bpf, netdev, ast, daniel, rdna
  Cc: andrii.nakryiko, kernel-team, Andrii Nakryiko

Implement new sub-command to attach cgroup BPF programs and return FD-based
bpf_link back on success. bpf_link, once attached to cgroup, cannot be
replaced, except by owner having its FD. Cgroup bpf_link supports only
BPF_F_ALLOW_MULTI semantics. Both link-based and prog-based BPF_F_ALLOW_MULTI
attachments can be freely intermixed.

To prevent bpf_cgroup_link from keeping cgroup alive past the point when no
BPF program can be executed, implement auto-detachment of link. When
cgroup_bpf_release() is called, all attached bpf_links are forced to release
cgroup refcounts, but they leave bpf_link otherwise active and allocated, as
well as still owning underlying bpf_prog. This is because user-space might
still have FDs open and active, so bpf_link as a user-referenced object can't
be freed yet. Once last active FD is closed, bpf_link will be freed and
underlying bpf_prog refcount will be dropped. But cgroup refcount won't be
touched, because cgroup is released already.

The inherent race between bpf_cgroup_link release (from closing last FD) and
cgroup_bpf_release() is resolved by both operations taking cgroup_mutex. So
the only additional check required is when bpf_cgroup_link attempts to detach
itself from cgroup. At that time we need to check whether there is still
cgroup associated with that link. And if not, exit with success, because
bpf_cgroup_link was already successfully detached.

Acked-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
---
 include/linux/bpf-cgroup.h     |  29 ++-
 include/linux/bpf.h            |  10 +-
 include/uapi/linux/bpf.h       |  10 +-
 kernel/bpf/cgroup.c            | 315 +++++++++++++++++++++++++--------
 kernel/bpf/syscall.c           |  61 ++++++-
 kernel/cgroup/cgroup.c         |  14 +-
 tools/include/uapi/linux/bpf.h |  10 +-
 7 files changed, 351 insertions(+), 98 deletions(-)

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index a7cd5c7a2509..d2d969669564 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -51,9 +51,18 @@ struct bpf_cgroup_storage {
 	struct rcu_head rcu;
 };
 
+struct bpf_cgroup_link {
+	struct bpf_link link;
+	struct cgroup *cgroup;
+	enum bpf_attach_type type;
+};
+
+extern const struct bpf_link_ops bpf_cgroup_link_lops;
+
 struct bpf_prog_list {
 	struct list_head node;
 	struct bpf_prog *prog;
+	struct bpf_cgroup_link *link;
 	struct bpf_cgroup_storage *storage[MAX_BPF_CGROUP_STORAGE_TYPE];
 };
 
@@ -84,20 +93,23 @@ struct cgroup_bpf {
 int cgroup_bpf_inherit(struct cgroup *cgrp);
 void cgroup_bpf_offline(struct cgroup *cgrp);
 
-int __cgroup_bpf_attach(struct cgroup *cgrp, struct bpf_prog *prog,
-			struct bpf_prog *replace_prog,
+int __cgroup_bpf_attach(struct cgroup *cgrp,
+			struct bpf_prog *prog, struct bpf_prog *replace_prog,
+			struct bpf_cgroup_link *link,
 			enum bpf_attach_type type, u32 flags);
 int __cgroup_bpf_detach(struct cgroup *cgrp, struct bpf_prog *prog,
+			struct bpf_cgroup_link *link,
 			enum bpf_attach_type type);
 int __cgroup_bpf_query(struct cgroup *cgrp, const union bpf_attr *attr,
 		       union bpf_attr __user *uattr);
 
 /* Wrapper for __cgroup_bpf_*() protected by cgroup_mutex */
-int cgroup_bpf_attach(struct cgroup *cgrp, struct bpf_prog *prog,
-		      struct bpf_prog *replace_prog, enum bpf_attach_type type,
+int cgroup_bpf_attach(struct cgroup *cgrp,
+		      struct bpf_prog *prog, struct bpf_prog *replace_prog,
+		      struct bpf_cgroup_link *link, enum bpf_attach_type type,
 		      u32 flags);
 int cgroup_bpf_detach(struct cgroup *cgrp, struct bpf_prog *prog,
-		      enum bpf_attach_type type, u32 flags);
+		      enum bpf_attach_type type);
 int cgroup_bpf_query(struct cgroup *cgrp, const union bpf_attr *attr,
 		     union bpf_attr __user *uattr);
 
@@ -332,6 +344,7 @@ int cgroup_bpf_prog_attach(const union bpf_attr *attr,
 			   enum bpf_prog_type ptype, struct bpf_prog *prog);
 int cgroup_bpf_prog_detach(const union bpf_attr *attr,
 			   enum bpf_prog_type ptype);
+int cgroup_bpf_link_attach(const union bpf_attr *attr, struct bpf_prog *prog);
 int cgroup_bpf_prog_query(const union bpf_attr *attr,
 			  union bpf_attr __user *uattr);
 #else
@@ -354,6 +367,12 @@ static inline int cgroup_bpf_prog_detach(const union bpf_attr *attr,
 	return -EINVAL;
 }
 
+static inline int cgroup_bpf_link_attach(const union bpf_attr *attr,
+					 struct bpf_prog *prog)
+{
+	return -EINVAL;
+}
+
 static inline int cgroup_bpf_prog_query(const union bpf_attr *attr,
 					union bpf_attr __user *uattr)
 {
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 3bde59a8453b..56254d880293 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1082,15 +1082,23 @@ extern int sysctl_unprivileged_bpf_disabled;
 int bpf_map_new_fd(struct bpf_map *map, int flags);
 int bpf_prog_new_fd(struct bpf_prog *prog);
 
-struct bpf_link;
+struct bpf_link {
+	atomic64_t refcnt;
+	const struct bpf_link_ops *ops;
+	struct bpf_prog *prog;
+	struct work_struct work;
+};
 
 struct bpf_link_ops {
 	void (*release)(struct bpf_link *link);
 	void (*dealloc)(struct bpf_link *link);
+
 };
 
 void bpf_link_init(struct bpf_link *link, const struct bpf_link_ops *ops,
 		   struct bpf_prog *prog);
+void bpf_link_cleanup(struct bpf_link *link, struct file *link_file,
+		      int link_fd);
 void bpf_link_inc(struct bpf_link *link);
 void bpf_link_put(struct bpf_link *link);
 int bpf_link_new_fd(struct bpf_link *link);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index f1fbc36f58d3..8b3f1c098ac0 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -111,6 +111,7 @@ enum bpf_cmd {
 	BPF_MAP_LOOKUP_AND_DELETE_BATCH,
 	BPF_MAP_UPDATE_BATCH,
 	BPF_MAP_DELETE_BATCH,
+	BPF_LINK_CREATE,
 };
 
 enum bpf_map_type {
@@ -541,7 +542,7 @@ union bpf_attr {
 		__u32		prog_cnt;
 	} query;
 
-	struct {
+	struct { /* anonymous struct used by BPF_RAW_TRACEPOINT_OPEN command */
 		__u64 name;
 		__u32 prog_fd;
 	} raw_tracepoint;
@@ -569,6 +570,13 @@ union bpf_attr {
 		__u64		probe_offset;	/* output: probe_offset */
 		__u64		probe_addr;	/* output: probe_addr */
 	} task_fd_query;
+
+	struct { /* struct used by BPF_LINK_CREATE command */
+		__u32		prog_fd;	/* eBPF program to attach */
+		__u32		target_fd;	/* object to attach to */
+		__u32		attach_type;	/* attach type */
+		__u32		flags;		/* extra flags */
+	} link_create;
 } __attribute__((aligned(8)));
 
 /* The description below is an attempt at providing documentation to eBPF
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index 9c8472823a7f..c24029937431 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -80,6 +80,17 @@ static void bpf_cgroup_storages_unlink(struct bpf_cgroup_storage *storages[])
 		bpf_cgroup_storage_unlink(storages[stype]);
 }
 
+/* Called when bpf_cgroup_link is auto-detached from dying cgroup.
+ * It drops cgroup and bpf_prog refcounts, and marks bpf_link as defunct. It
+ * doesn't free link memory, which will eventually be done by bpf_link's
+ * release() callback, when its last FD is closed.
+ */
+static void bpf_cgroup_link_auto_detach(struct bpf_cgroup_link *link)
+{
+	cgroup_put(link->cgroup);
+	link->cgroup = NULL;
+}
+
 /**
  * cgroup_bpf_release() - put references of all bpf programs and
  *                        release all cgroup bpf data
@@ -100,7 +111,10 @@ static void cgroup_bpf_release(struct work_struct *work)
 
 		list_for_each_entry_safe(pl, tmp, progs, node) {
 			list_del(&pl->node);
-			bpf_prog_put(pl->prog);
+			if (pl->prog)
+				bpf_prog_put(pl->prog);
+			if (pl->link)
+				bpf_cgroup_link_auto_detach(pl->link);
 			bpf_cgroup_storages_unlink(pl->storage);
 			bpf_cgroup_storages_free(pl->storage);
 			kfree(pl);
@@ -134,6 +148,18 @@ static void cgroup_bpf_release_fn(struct percpu_ref *ref)
 	queue_work(system_wq, &cgrp->bpf.release_work);
 }
 
+/* Get underlying bpf_prog of bpf_prog_list entry, regardless if it's through
+ * link or direct prog.
+ */
+static struct bpf_prog *prog_list_prog(struct bpf_prog_list *pl)
+{
+	if (pl->prog)
+		return pl->prog;
+	if (pl->link)
+		return pl->link->link.prog;
+	return NULL;
+}
+
 /* count number of elements in the list.
  * it's slow but the list cannot be long
  */
@@ -143,7 +169,7 @@ static u32 prog_list_length(struct list_head *head)
 	u32 cnt = 0;
 
 	list_for_each_entry(pl, head, node) {
-		if (!pl->prog)
+		if (!prog_list_prog(pl))
 			continue;
 		cnt++;
 	}
@@ -212,11 +238,11 @@ static int compute_effective_progs(struct cgroup *cgrp,
 			continue;
 
 		list_for_each_entry(pl, &p->bpf.progs[type], node) {
-			if (!pl->prog)
+			if (!prog_list_prog(pl))
 				continue;
 
 			item = &progs->items[cnt];
-			item->prog = pl->prog;
+			item->prog = prog_list_prog(pl);
 			bpf_cgroup_storages_assign(item->cgroup_storage,
 						   pl->storage);
 			cnt++;
@@ -333,19 +359,60 @@ static int update_effective_progs(struct cgroup *cgrp,
 
 #define BPF_CGROUP_MAX_PROGS 64
 
+static struct bpf_prog_list *find_attach_entry(struct list_head *progs,
+					       struct bpf_prog *prog,
+					       struct bpf_cgroup_link *link,
+					       struct bpf_prog *replace_prog,
+					       bool allow_multi)
+{
+	struct bpf_prog_list *pl;
+
+	/* single-attach case */
+	if (!allow_multi) {
+		if (list_empty(progs))
+			return NULL;
+		return list_first_entry(progs, typeof(*pl), node);
+	}
+
+	list_for_each_entry(pl, progs, node) {
+		if (prog && pl->prog == prog)
+			/* disallow attaching the same prog twice */
+			return ERR_PTR(-EINVAL);
+		if (link && pl->link == link)
+			/* disallow attaching the same link twice */
+			return ERR_PTR(-EINVAL);
+	}
+
+	/* direct prog multi-attach w/ replacement case */
+	if (replace_prog) {
+		list_for_each_entry(pl, progs, node) {
+			if (pl->prog == replace_prog)
+				/* a match found */
+				return pl;
+		}
+		/* prog to replace not found for cgroup */
+		return ERR_PTR(-ENOENT);
+	}
+
+	return NULL;
+}
+
 /**
- * __cgroup_bpf_attach() - Attach the program to a cgroup, and
+ * __cgroup_bpf_attach() - Attach the program or the link to a cgroup, and
  *                         propagate the change to descendants
  * @cgrp: The cgroup which descendants to traverse
  * @prog: A program to attach
+ * @link: A link to attach
  * @replace_prog: Previously attached program to replace if BPF_F_REPLACE is set
  * @type: Type of attach operation
  * @flags: Option flags
  *
+ * Exactly one of @prog or @link can be non-null.
  * Must be called with cgroup_mutex held.
  */
-int __cgroup_bpf_attach(struct cgroup *cgrp, struct bpf_prog *prog,
-			struct bpf_prog *replace_prog,
+int __cgroup_bpf_attach(struct cgroup *cgrp,
+			struct bpf_prog *prog, struct bpf_prog *replace_prog,
+			struct bpf_cgroup_link *link,
 			enum bpf_attach_type type, u32 flags)
 {
 	u32 saved_flags = (flags & (BPF_F_ALLOW_OVERRIDE | BPF_F_ALLOW_MULTI));
@@ -353,13 +420,19 @@ int __cgroup_bpf_attach(struct cgroup *cgrp, struct bpf_prog *prog,
 	struct bpf_prog *old_prog = NULL;
 	struct bpf_cgroup_storage *storage[MAX_BPF_CGROUP_STORAGE_TYPE],
 		*old_storage[MAX_BPF_CGROUP_STORAGE_TYPE] = {NULL};
-	struct bpf_prog_list *pl, *replace_pl = NULL;
+	struct bpf_prog_list *pl;
 	int err;
 
 	if (((flags & BPF_F_ALLOW_OVERRIDE) && (flags & BPF_F_ALLOW_MULTI)) ||
 	    ((flags & BPF_F_REPLACE) && !(flags & BPF_F_ALLOW_MULTI)))
 		/* invalid combination */
 		return -EINVAL;
+	if (link && (prog || replace_prog))
+		/* only either link or prog/replace_prog can be specified */
+		return -EINVAL;
+	if (!!replace_prog != !!(flags & BPF_F_REPLACE))
+		/* replace_prog implies BPF_F_REPLACE, and vice versa */
+		return -EINVAL;
 
 	if (!hierarchy_allows_attach(cgrp, type))
 		return -EPERM;
@@ -374,26 +447,15 @@ int __cgroup_bpf_attach(struct cgroup *cgrp, struct bpf_prog *prog,
 	if (prog_list_length(progs) >= BPF_CGROUP_MAX_PROGS)
 		return -E2BIG;
 
-	if (flags & BPF_F_ALLOW_MULTI) {
-		list_for_each_entry(pl, progs, node) {
-			if (pl->prog == prog)
-				/* disallow attaching the same prog twice */
-				return -EINVAL;
-			if (pl->prog == replace_prog)
-				replace_pl = pl;
-		}
-		if ((flags & BPF_F_REPLACE) && !replace_pl)
-			/* prog to replace not found for cgroup */
-			return -ENOENT;
-	} else if (!list_empty(progs)) {
-		replace_pl = list_first_entry(progs, typeof(*pl), node);
-	}
+	pl = find_attach_entry(progs, prog, link, replace_prog,
+			       flags & BPF_F_ALLOW_MULTI);
+	if (IS_ERR(pl))
+		return PTR_ERR(pl);
 
-	if (bpf_cgroup_storages_alloc(storage, prog))
+	if (bpf_cgroup_storages_alloc(storage, prog ? : link->link.prog))
 		return -ENOMEM;
 
-	if (replace_pl) {
-		pl = replace_pl;
+	if (pl) {
 		old_prog = pl->prog;
 		bpf_cgroup_storages_unlink(pl->storage);
 		bpf_cgroup_storages_assign(old_storage, pl->storage);
@@ -407,6 +469,7 @@ int __cgroup_bpf_attach(struct cgroup *cgrp, struct bpf_prog *prog,
 	}
 
 	pl->prog = prog;
+	pl->link = link;
 	bpf_cgroup_storages_assign(pl->storage, storage);
 	cgrp->bpf.flags[type] = saved_flags;
 
@@ -414,80 +477,93 @@ int __cgroup_bpf_attach(struct cgroup *cgrp, struct bpf_prog *prog,
 	if (err)
 		goto cleanup;
 
-	static_branch_inc(&cgroup_bpf_enabled_key);
 	bpf_cgroup_storages_free(old_storage);
-	if (old_prog) {
+	if (old_prog)
 		bpf_prog_put(old_prog);
-		static_branch_dec(&cgroup_bpf_enabled_key);
-	}
-	bpf_cgroup_storages_link(storage, cgrp, type);
+	else
+		static_branch_inc(&cgroup_bpf_enabled_key);
+	bpf_cgroup_storages_link(pl->storage, cgrp, type);
 	return 0;
 
 cleanup:
-	/* and cleanup the prog list */
-	pl->prog = old_prog;
+	if (old_prog) {
+		pl->prog = old_prog;
+		pl->link = NULL;
+	}
 	bpf_cgroup_storages_free(pl->storage);
 	bpf_cgroup_storages_assign(pl->storage, old_storage);
 	bpf_cgroup_storages_link(pl->storage, cgrp, type);
-	if (!replace_pl) {
+	if (!old_prog) {
 		list_del(&pl->node);
 		kfree(pl);
 	}
 	return err;
 }
 
+static struct bpf_prog_list *find_detach_entry(struct list_head *progs,
+					       struct bpf_prog *prog,
+					       struct bpf_cgroup_link *link,
+					       bool allow_multi)
+{
+	struct bpf_prog_list *pl;
+
+	if (!allow_multi) {
+		if (list_empty(progs))
+			/* report error when trying to detach and nothing is attached */
+			return ERR_PTR(-ENOENT);
+
+		/* to maintain backward compatibility NONE and OVERRIDE cgroups
+		 * allow detaching with invalid FD (prog==NULL) in legacy mode
+		 */
+		return list_first_entry(progs, typeof(*pl), node);
+	}
+
+	if (!prog && !link)
+		/* to detach MULTI prog the user has to specify valid FD
+		 * of the program or link to be detached
+		 */
+		return ERR_PTR(-EINVAL);
+
+	/* find the prog or link and detach it */
+	list_for_each_entry(pl, progs, node) {
+		if (pl->prog == prog && pl->link == link)
+			return pl;
+	}
+	return ERR_PTR(-ENOENT);
+}
+
 /**
- * __cgroup_bpf_detach() - Detach the program from a cgroup, and
+ * __cgroup_bpf_detach() - Detach the program or link from a cgroup, and
  *                         propagate the change to descendants
  * @cgrp: The cgroup which descendants to traverse
  * @prog: A program to detach or NULL
+ * @prog: A link to detach or NULL
  * @type: Type of detach operation
  *
+ * At most one of @prog or @link can be non-NULL.
  * Must be called with cgroup_mutex held.
  */
 int __cgroup_bpf_detach(struct cgroup *cgrp, struct bpf_prog *prog,
-			enum bpf_attach_type type)
+			struct bpf_cgroup_link *link, enum bpf_attach_type type)
 {
 	struct list_head *progs = &cgrp->bpf.progs[type];
 	u32 flags = cgrp->bpf.flags[type];
-	struct bpf_prog *old_prog = NULL;
 	struct bpf_prog_list *pl;
+	struct bpf_prog *old_prog;
 	int err;
 
-	if (flags & BPF_F_ALLOW_MULTI) {
-		if (!prog)
-			/* to detach MULTI prog the user has to specify valid FD
-			 * of the program to be detached
-			 */
-			return -EINVAL;
-	} else {
-		if (list_empty(progs))
-			/* report error when trying to detach and nothing is attached */
-			return -ENOENT;
-	}
+	if (prog && link)
+		/* only one of prog or link can be specified */
+		return -EINVAL;
 
-	if (flags & BPF_F_ALLOW_MULTI) {
-		/* find the prog and detach it */
-		list_for_each_entry(pl, progs, node) {
-			if (pl->prog != prog)
-				continue;
-			old_prog = prog;
-			/* mark it deleted, so it's ignored while
-			 * recomputing effective
-			 */
-			pl->prog = NULL;
-			break;
-		}
-		if (!old_prog)
-			return -ENOENT;
-	} else {
-		/* to maintain backward compatibility NONE and OVERRIDE cgroups
-		 * allow detaching with invalid FD (prog==NULL)
-		 */
-		pl = list_first_entry(progs, typeof(*pl), node);
-		old_prog = pl->prog;
-		pl->prog = NULL;
-	}
+	pl = find_detach_entry(progs, prog, link, flags & BPF_F_ALLOW_MULTI);
+	if (IS_ERR(pl))
+		return PTR_ERR(pl);
+
+	/* mark it deleted, so it's ignored while recomputing effective */
+	old_prog = pl->prog;
+	pl->prog = NULL;
+	pl->link = NULL;
 
 	err = update_effective_progs(cgrp, type);
 	if (err)
@@ -501,14 +577,15 @@ int __cgroup_bpf_detach(struct cgroup *cgrp, struct bpf_prog *prog,
 	if (list_empty(progs))
 		/* last program was detached, reset flags to zero */
 		cgrp->bpf.flags[type] = 0;
-
-	bpf_prog_put(old_prog);
+	if (old_prog)
+		bpf_prog_put(old_prog);
 	static_branch_dec(&cgroup_bpf_enabled_key);
 	return 0;
 
 cleanup:
-	/* and restore back old_prog */
+	/* restore back prog or link */
 	pl->prog = old_prog;
+	pl->link = link;
 	return err;
 }
 
@@ -521,6 +598,7 @@ int __cgroup_bpf_query(struct cgroup *cgrp, const union bpf_attr *attr,
 	struct list_head *progs = &cgrp->bpf.progs[type];
 	u32 flags = cgrp->bpf.flags[type];
 	struct bpf_prog_array *effective;
+	struct bpf_prog *prog;
 	int cnt, ret = 0, i;
 
 	effective = rcu_dereference_protected(cgrp->bpf.effective[type],
@@ -551,7 +629,8 @@ int __cgroup_bpf_query(struct cgroup *cgrp, const union bpf_attr *attr,
 
 		i = 0;
 		list_for_each_entry(pl, progs, node) {
-			id = pl->prog->aux->id;
+			prog = prog_list_prog(pl);
+			id = prog->aux->id;
 			if (copy_to_user(prog_ids + i, &id, sizeof(id)))
 				return -EFAULT;
 			if (++i == cnt)
@@ -581,8 +660,8 @@ int cgroup_bpf_prog_attach(const union bpf_attr *attr,
 		}
 	}
 
-	ret = cgroup_bpf_attach(cgrp, prog, replace_prog, attr->attach_type,
-				attr->attach_flags);
+	ret = cgroup_bpf_attach(cgrp, prog, replace_prog, NULL,
+				attr->attach_type, attr->attach_flags);
 
 	if (replace_prog)
 		bpf_prog_put(replace_prog);
@@ -604,7 +683,7 @@ int cgroup_bpf_prog_detach(const union bpf_attr *attr, enum bpf_prog_type ptype)
 	if (IS_ERR(prog))
 		prog = NULL;
 
-	ret = cgroup_bpf_detach(cgrp, prog, attr->attach_type, 0);
+	ret = cgroup_bpf_detach(cgrp, prog, attr->attach_type);
 	if (prog)
 		bpf_prog_put(prog);
 
@@ -612,6 +691,90 @@ int cgroup_bpf_prog_detach(const union bpf_attr *attr, enum bpf_prog_type ptype)
 	return ret;
 }
 
+static void bpf_cgroup_link_release(struct bpf_link *link)
+{
+	struct bpf_cgroup_link *cg_link =
+		container_of(link, struct bpf_cgroup_link, link);
+
+	/* link might have been auto-detached by dying cgroup already,
+	 * in that case our work is done here
+	 */
+	if (!cg_link->cgroup)
+		return;
+
+	mutex_lock(&cgroup_mutex);
+
+	/* re-check cgroup under lock again */
+	if (!cg_link->cgroup) {
+		mutex_unlock(&cgroup_mutex);
+		return;
+	}
+
+	WARN_ON(__cgroup_bpf_detach(cg_link->cgroup, NULL, cg_link,
+				    cg_link->type));
+
+	mutex_unlock(&cgroup_mutex);
+	cgroup_put(cg_link->cgroup);
+}
+
+static void bpf_cgroup_link_dealloc(struct bpf_link *link)
+{
+	struct bpf_cgroup_link *cg_link =
+		container_of(link, struct bpf_cgroup_link, link);
+
+	kfree(cg_link);
+}
+
+const struct bpf_link_ops bpf_cgroup_link_lops = {
+	.release = bpf_cgroup_link_release,
+	.dealloc = bpf_cgroup_link_dealloc,
+};
+
+int cgroup_bpf_link_attach(const union bpf_attr *attr, struct bpf_prog *prog)
+{
+	struct bpf_cgroup_link *link;
+	struct file *link_file;
+	struct cgroup *cgrp;
+	int err, link_fd;
+
+	if (attr->link_create.flags)
+		return -EINVAL;
+
+	cgrp = cgroup_get_from_fd(attr->link_create.target_fd);
+	if (IS_ERR(cgrp))
+		return PTR_ERR(cgrp);
+
+	link = kzalloc(sizeof(*link), GFP_USER);
+	if (!link) {
+		err = -ENOMEM;
+		goto out_put_cgroup;
+	}
+	bpf_link_init(&link->link, &bpf_cgroup_link_lops, prog);
+	link->cgroup = cgrp;
+	link->type = attr->link_create.attach_type;
+
+	link_file = bpf_link_new_file(&link->link, &link_fd);
+	if (IS_ERR(link_file)) {
+		kfree(link);
+		err = PTR_ERR(link_file);
+		goto out_put_cgroup;
+	}
+
+	err = cgroup_bpf_attach(cgrp, NULL, NULL, link, link->type,
+				BPF_F_ALLOW_MULTI);
+	if (err) {
+		bpf_link_cleanup(&link->link, link_file, link_fd);
+		goto out_put_cgroup;
+	}
+
+	fd_install(link_fd, link_file);
+	return link_fd;
+
+out_put_cgroup:
+	cgroup_put(cgrp);
+	return err;
+}
+
 int cgroup_bpf_prog_query(const union bpf_attr *attr,
 			  union bpf_attr __user *uattr)
 {
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index a616b63f23b4..05412b83ed6c 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2175,13 +2175,6 @@ static int bpf_obj_get(const union bpf_attr *attr)
 				attr->file_flags);
 }
 
-struct bpf_link {
-	atomic64_t refcnt;
-	const struct bpf_link_ops *ops;
-	struct bpf_prog *prog;
-	struct work_struct work;
-};
-
 void bpf_link_init(struct bpf_link *link, const struct bpf_link_ops *ops,
 		   struct bpf_prog *prog)
 {
@@ -2195,8 +2188,8 @@ void bpf_link_init(struct bpf_link *link, const struct bpf_link_ops *ops,
  * anon_inode's release() call. This helper manages marking bpf_link as
  * defunct, releases anon_inode file and puts reserved FD.
  */
-static void bpf_link_cleanup(struct bpf_link *link, struct file *link_file,
-			     int link_fd)
+void bpf_link_cleanup(struct bpf_link *link, struct file *link_file,
+		      int link_fd)
 {
 	link->prog = NULL;
 	fput(link_file);
@@ -2266,6 +2259,10 @@ static void bpf_link_show_fdinfo(struct seq_file *m, struct file *filp)
 		link_type = "raw_tracepoint";
 	else if (link->ops == &bpf_tracing_link_lops)
 		link_type = "tracing";
+#ifdef CONFIG_CGROUP_BPF
+	else if (link->ops == &bpf_cgroup_link_lops)
+		link_type = "cgroup";
+#endif
 	else
 		link_type = "unknown";
 
@@ -3553,6 +3550,49 @@ static int bpf_map_do_batch(const union bpf_attr *attr,
 	return err;
 }
 
+#define BPF_LINK_CREATE_LAST_FIELD link_create.flags
+static int link_create(union bpf_attr *attr)
+{
+	enum bpf_prog_type ptype;
+	struct bpf_prog *prog;
+	int ret;
+
+	if (CHECK_ATTR(BPF_LINK_CREATE))
+		return -EINVAL;
+
+	ptype = attach_type_to_prog_type(attr->link_create.attach_type);
+	if (ptype == BPF_PROG_TYPE_UNSPEC)
+		return -EINVAL;
+
+	prog = bpf_prog_get_type(attr->link_create.prog_fd, ptype);
+	if (IS_ERR(prog))
+		return PTR_ERR(prog);
+
+	ret = bpf_prog_attach_check_attach_type(prog,
+						attr->link_create.attach_type);
+	if (ret)
+		goto err_out;
+
+	switch (ptype) {
+	case BPF_PROG_TYPE_CGROUP_SKB:
+	case BPF_PROG_TYPE_CGROUP_SOCK:
+	case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
+	case BPF_PROG_TYPE_SOCK_OPS:
+	case BPF_PROG_TYPE_CGROUP_DEVICE:
+	case BPF_PROG_TYPE_CGROUP_SYSCTL:
+	case BPF_PROG_TYPE_CGROUP_SOCKOPT:
+		ret = cgroup_bpf_link_attach(attr, prog);
+		break;
+	default:
+		ret = -EINVAL;
+	}
+
+err_out:
+	if (ret < 0)
+		bpf_prog_put(prog);
+	return ret;
+}
+
 SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, size)
 {
 	union bpf_attr attr = {};
@@ -3663,6 +3703,9 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
 	case BPF_MAP_DELETE_BATCH:
 		err = bpf_map_do_batch(&attr, uattr, BPF_MAP_DELETE_BATCH);
 		break;
+	case BPF_LINK_CREATE:
+		err = link_create(&attr);
+		break;
 	default:
 		err = -EINVAL;
 		break;
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 3dead0416b91..219624fba9ba 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -6303,27 +6303,31 @@ void cgroup_sk_free(struct sock_cgroup_data *skcd)
 #endif	/* CONFIG_SOCK_CGROUP_DATA */
 
 #ifdef CONFIG_CGROUP_BPF
-int cgroup_bpf_attach(struct cgroup *cgrp, struct bpf_prog *prog,
-		      struct bpf_prog *replace_prog, enum bpf_attach_type type,
+int cgroup_bpf_attach(struct cgroup *cgrp,
+		      struct bpf_prog *prog, struct bpf_prog *replace_prog,
+		      struct bpf_cgroup_link *link,
+		      enum bpf_attach_type type,
 		      u32 flags)
 {
 	int ret;
 
 	mutex_lock(&cgroup_mutex);
-	ret = __cgroup_bpf_attach(cgrp, prog, replace_prog, type, flags);
+	ret = __cgroup_bpf_attach(cgrp, prog, replace_prog, link, type, flags);
 	mutex_unlock(&cgroup_mutex);
 	return ret;
 }
+
 int cgroup_bpf_detach(struct cgroup *cgrp, struct bpf_prog *prog,
-		      enum bpf_attach_type type, u32 flags)
+		      enum bpf_attach_type type)
 {
 	int ret;
 
 	mutex_lock(&cgroup_mutex);
-	ret = __cgroup_bpf_detach(cgrp, prog, type);
+	ret = __cgroup_bpf_detach(cgrp, prog, NULL, type);
 	mutex_unlock(&cgroup_mutex);
 	return ret;
 }
+
 int cgroup_bpf_query(struct cgroup *cgrp, const union bpf_attr *attr,
 		     union bpf_attr __user *uattr)
 {
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index f1fbc36f58d3..8b3f1c098ac0 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -111,6 +111,7 @@ enum bpf_cmd {
 	BPF_MAP_LOOKUP_AND_DELETE_BATCH,
 	BPF_MAP_UPDATE_BATCH,
 	BPF_MAP_DELETE_BATCH,
+	BPF_LINK_CREATE,
 };
 
 enum bpf_map_type {
@@ -541,7 +542,7 @@ union bpf_attr {
 		__u32		prog_cnt;
 	} query;
 
-	struct {
+	struct { /* anonymous struct used by BPF_RAW_TRACEPOINT_OPEN command */
 		__u64 name;
 		__u32 prog_fd;
 	} raw_tracepoint;
@@ -569,6 +570,13 @@ union bpf_attr {
 		__u64		probe_offset;	/* output: probe_offset */
 		__u64		probe_addr;	/* output: probe_addr */
 	} task_fd_query;
+
+	struct { /* struct used by BPF_LINK_CREATE command */
+		__u32		prog_fd;	/* eBPF program to attach */
+		__u32		target_fd;	/* object to attach to */
+		__u32		attach_type;	/* attach type */
+		__u32		flags;		/* extra flags */
+	} link_create;
 } __attribute__((aligned(8)));
 
 /* The description below is an attempt at providing documentation to eBPF
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v3 bpf-next 2/4] bpf: implement bpf_prog replacement for an active bpf_cgroup_link
  2020-03-30  2:59 [PATCH v3 bpf-next 0/4] Add support for cgroup bpf_link Andrii Nakryiko
  2020-03-30  2:59 ` [PATCH v3 bpf-next 1/4] bpf: implement bpf_link-based cgroup BPF program attachment Andrii Nakryiko
@ 2020-03-30  2:59 ` Andrii Nakryiko
  2020-03-30  3:00 ` [PATCH v3 bpf-next 3/4] libbpf: add support for bpf_link-based cgroup attachment Andrii Nakryiko
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 29+ messages in thread
From: Andrii Nakryiko @ 2020-03-30  2:59 UTC (permalink / raw)
  To: bpf, netdev, ast, daniel, rdna
  Cc: andrii.nakryiko, kernel-team, Andrii Nakryiko

Add new operation (LINK_UPDATE), which allows to replace active bpf_prog from
under given bpf_link. Currently this is only supported for bpf_cgroup_link,
but will be extended to other kinds of bpf_links in follow-up patches.

For bpf_cgroup_link, implemented functionality matches existing semantics for
direct bpf_prog attachment (including BPF_F_REPLACE flag). User can either
unconditionally set new bpf_prog regardless of which bpf_prog is currently
active under given bpf_link, or, optionally, can specify expected active
bpf_prog. If active bpf_prog doesn't match expected one, no changes are
performed, old bpf_link stays intact and attached, operation returns
a failure.

cgroup_bpf_replace() operation is resolving race between auto-detachment and
bpf_prog update in the same fashion as it's done for bpf_link detachment,
except in this case update has no way of succeeding because of target cgroup
marked as dying. So in this case error is returned.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
---
 include/linux/bpf-cgroup.h | 12 ++++++
 include/uapi/linux/bpf.h   | 12 ++++++
 kernel/bpf/cgroup.c        | 80 ++++++++++++++++++++++++++++++++++++++
 kernel/bpf/syscall.c       | 52 +++++++++++++++++++++++++
 kernel/cgroup/cgroup.c     | 27 +++++++++++++
 5 files changed, 183 insertions(+)

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index d2d969669564..c11b413d5b1a 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -100,6 +100,8 @@ int __cgroup_bpf_attach(struct cgroup *cgrp,
 int __cgroup_bpf_detach(struct cgroup *cgrp, struct bpf_prog *prog,
 			struct bpf_cgroup_link *link,
 			enum bpf_attach_type type);
+int __cgroup_bpf_replace(struct cgroup *cgrp, struct bpf_cgroup_link *link,
+			 struct bpf_prog *new_prog);
 int __cgroup_bpf_query(struct cgroup *cgrp, const union bpf_attr *attr,
 		       union bpf_attr __user *uattr);
 
@@ -110,6 +112,8 @@ int cgroup_bpf_attach(struct cgroup *cgrp,
 		      u32 flags);
 int cgroup_bpf_detach(struct cgroup *cgrp, struct bpf_prog *prog,
 		      enum bpf_attach_type type);
+int cgroup_bpf_replace(struct bpf_link *link, struct bpf_prog *old_prog,
+		       struct bpf_prog *new_prog);
 int cgroup_bpf_query(struct cgroup *cgrp, const union bpf_attr *attr,
 		     union bpf_attr __user *uattr);
 
@@ -350,6 +354,7 @@ int cgroup_bpf_prog_query(const union bpf_attr *attr,
 #else
 
 struct bpf_prog;
+struct bpf_link;
 struct cgroup_bpf {};
 static inline int cgroup_bpf_inherit(struct cgroup *cgrp) { return 0; }
 static inline void cgroup_bpf_offline(struct cgroup *cgrp) {}
@@ -373,6 +378,13 @@ static inline int cgroup_bpf_link_attach(const union bpf_attr *attr,
 	return -EINVAL;
 }
 
+static inline int cgroup_bpf_replace(struct bpf_link *link,
+				     struct bpf_prog *old_prog,
+				     struct bpf_prog *new_prog)
+{
+	return -EINVAL;
+}
+
 static inline int cgroup_bpf_prog_query(const union bpf_attr *attr,
 					union bpf_attr __user *uattr)
 {
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 8b3f1c098ac0..6241cbcd2a64 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -112,6 +112,7 @@ enum bpf_cmd {
 	BPF_MAP_UPDATE_BATCH,
 	BPF_MAP_DELETE_BATCH,
 	BPF_LINK_CREATE,
+	BPF_LINK_UPDATE,
 };
 
 enum bpf_map_type {
@@ -577,6 +578,17 @@ union bpf_attr {
 		__u32		attach_type;	/* attach type */
 		__u32		flags;		/* extra flags */
 	} link_create;
+
+	struct { /* struct used by BPF_LINK_UPDATE command */
+		__u32		link_fd;	/* link fd */
+		/* new program fd to update link with */
+		__u32		new_prog_fd;
+		__u32		flags;		/* extra flags */
+		/* expected link's program fd; is specified only if
+		 * BPF_F_REPLACE flag is set in flags */
+		__u32		old_prog_fd;
+	} link_update;
+
 } __attribute__((aligned(8)));
 
 /* The description below is an attempt at providing documentation to eBPF
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index c24029937431..80676fc00d81 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -500,6 +500,86 @@ int __cgroup_bpf_attach(struct cgroup *cgrp,
 	return err;
 }
 
+/* Swap updated BPF program for given link in effective program arrays across
+ * all descendant cgroups. This function is guaranteed to succeed.
+ */
+static void replace_effective_prog(struct cgroup *cgrp,
+				   enum bpf_attach_type type,
+				   struct bpf_cgroup_link *link)
+{
+	struct bpf_prog_array_item *item;
+	struct cgroup_subsys_state *css;
+	struct bpf_prog_array *progs;
+	struct bpf_prog_list *pl;
+	struct list_head *head;
+	struct cgroup *cg;
+	int pos;
+
+	css_for_each_descendant_pre(css, &cgrp->self) {
+		struct cgroup *desc = container_of(css, struct cgroup, self);
+
+		if (percpu_ref_is_zero(&desc->bpf.refcnt))
+			continue;
+
+		/* find position of link in effective progs array */
+		for (pos = 0, cg = desc; cg; cg = cgroup_parent(cg)) {
+			if (pos && !(cg->bpf.flags[type] & BPF_F_ALLOW_MULTI))
+				continue;
+
+			head = &cg->bpf.progs[type];
+			list_for_each_entry(pl, head, node) {
+				if (!prog_list_prog(pl))
+					continue;
+				if (pl->link == link)
+					goto found;
+				pos++;
+			}
+		}
+found:
+		BUG_ON(!cg);
+		progs = rcu_dereference_protected(
+				desc->bpf.effective[type],
+				lockdep_is_held(&cgroup_mutex));
+		item = &progs->items[pos];
+		WRITE_ONCE(item->prog, link->link.prog);
+	}
+}
+
+/**
+ * __cgroup_bpf_replace() - Replace link's program and propagate the change
+ *                          to descendants
+ * @cgrp: The cgroup which descendants to traverse
+ * @link: A link for which to replace BPF program
+ * @type: Type of attach operation
+ *
+ * Must be called with cgroup_mutex held.
+ */
+int __cgroup_bpf_replace(struct cgroup *cgrp, struct bpf_cgroup_link *link,
+			 struct bpf_prog *new_prog)
+{
+	struct list_head *progs = &cgrp->bpf.progs[link->type];
+	struct bpf_prog *old_prog;
+	struct bpf_prog_list *pl;
+	bool found = false;
+
+	if (link->link.prog->type != new_prog->type)
+		return -EINVAL;
+
+	list_for_each_entry(pl, progs, node) {
+		if (pl->link == link) {
+			found = true;
+			break;
+		}
+	}
+	if (!found)
+		return -ENOENT;
+
+	old_prog = xchg(&link->link.prog, new_prog);
+	replace_effective_prog(cgrp, link->type, link);
+	bpf_prog_put(old_prog);
+	return 0;
+}
+
 static struct bpf_prog_list *find_detach_entry(struct list_head *progs,
 					       struct bpf_prog *prog,
 					       struct bpf_cgroup_link *link,
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 05412b83ed6c..43f7201b68c2 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -3593,6 +3593,55 @@ static int link_create(union bpf_attr *attr)
 	return ret;
 }
 
+#define BPF_LINK_UPDATE_LAST_FIELD link_update.old_prog_fd
+
+static int link_update(union bpf_attr *attr)
+{
+	struct bpf_prog *old_prog = NULL, *new_prog;
+	struct bpf_link *link;
+	u32 flags;
+	int ret;
+
+	if (CHECK_ATTR(BPF_LINK_UPDATE))
+		return -EINVAL;
+
+	flags = attr->link_update.flags;
+	if (flags & ~BPF_F_REPLACE)
+		return -EINVAL;
+
+	link = bpf_link_get_from_fd(attr->link_update.link_fd);
+	if (IS_ERR(link))
+		return PTR_ERR(link);
+
+	new_prog = bpf_prog_get(attr->link_update.new_prog_fd);
+	if (IS_ERR(new_prog))
+		return PTR_ERR(new_prog);
+
+	if (flags & BPF_F_REPLACE) {
+		old_prog = bpf_prog_get(attr->link_update.old_prog_fd);
+		if (IS_ERR(old_prog)) {
+			ret = PTR_ERR(old_prog);
+			old_prog = NULL;
+			goto out_put_progs;
+		}
+	}
+
+#ifdef CONFIG_CGROUP_BPF
+	if (link->ops == &bpf_cgroup_link_lops) {
+		ret = cgroup_bpf_replace(link, old_prog, new_prog);
+		goto out_put_progs;
+	}
+#endif
+	ret = -EINVAL;
+
+out_put_progs:
+	if (old_prog)
+		bpf_prog_put(old_prog);
+	if (ret)
+		bpf_prog_put(new_prog);
+	return ret;
+}
+
 SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, size)
 {
 	union bpf_attr attr = {};
@@ -3706,6 +3755,9 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
 	case BPF_LINK_CREATE:
 		err = link_create(&attr);
 		break;
+	case BPF_LINK_UPDATE:
+		err = link_update(&attr);
+		break;
 	default:
 		err = -EINVAL;
 		break;
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 219624fba9ba..915dda3f7f19 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -6317,6 +6317,33 @@ int cgroup_bpf_attach(struct cgroup *cgrp,
 	return ret;
 }
 
+int cgroup_bpf_replace(struct bpf_link *link, struct bpf_prog *old_prog,
+		       struct bpf_prog *new_prog)
+{
+	struct bpf_cgroup_link *cg_link;
+	int ret;
+
+	if (link->ops != &bpf_cgroup_link_lops)
+		return -EINVAL;
+
+	cg_link = container_of(link, struct bpf_cgroup_link, link);
+
+	mutex_lock(&cgroup_mutex);
+	/* link might have been auto-released by dying cgroup, so fail */
+	if (!cg_link->cgroup) {
+		ret = -EINVAL;
+		goto out_unlock;
+	}
+	if (old_prog && link->prog != old_prog) {
+		ret = -EPERM;
+		goto out_unlock;
+	}
+	ret = __cgroup_bpf_replace(cg_link->cgroup, cg_link, new_prog);
+out_unlock:
+	mutex_unlock(&cgroup_mutex);
+	return ret;
+}
+
 int cgroup_bpf_detach(struct cgroup *cgrp, struct bpf_prog *prog,
 		      enum bpf_attach_type type)
 {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v3 bpf-next 3/4] libbpf: add support for bpf_link-based cgroup attachment
  2020-03-30  2:59 [PATCH v3 bpf-next 0/4] Add support for cgroup bpf_link Andrii Nakryiko
  2020-03-30  2:59 ` [PATCH v3 bpf-next 1/4] bpf: implement bpf_link-based cgroup BPF program attachment Andrii Nakryiko
  2020-03-30  2:59 ` [PATCH v3 bpf-next 2/4] bpf: implement bpf_prog replacement for an active bpf_cgroup_link Andrii Nakryiko
@ 2020-03-30  3:00 ` Andrii Nakryiko
  2020-03-30  3:00 ` [PATCH v3 bpf-next 4/4] selftests/bpf: test FD-based " Andrii Nakryiko
  2020-03-30 14:49 ` [PATCH v3 bpf-next 0/4] Add support for cgroup bpf_link David Ahern
  4 siblings, 0 replies; 29+ messages in thread
From: Andrii Nakryiko @ 2020-03-30  3:00 UTC (permalink / raw)
  To: bpf, netdev, ast, daniel, rdna
  Cc: andrii.nakryiko, kernel-team, Andrii Nakryiko

Add bpf_program__attach_cgroup(), which uses BPF_LINK_CREATE subcommand to
create an FD-based kernel bpf_link. Also add low-level bpf_link_create() API.

If expected_attach_type is not specified explicitly with
bpf_program__set_expected_attach_type(), libbpf will try to determine proper
attach type from BPF program's section definition.

Also add support for bpf_link's underlying BPF program replacement:
  - unconditional through high-level bpf_link__update_program() API;
  - cmpxchg-like with specifying expected current BPF program through
    low-level bpf_link_update() API.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
---
 tools/include/uapi/linux/bpf.h | 12 +++++++++
 tools/lib/bpf/bpf.c            | 34 +++++++++++++++++++++++++
 tools/lib/bpf/bpf.h            | 19 ++++++++++++++
 tools/lib/bpf/libbpf.c         | 46 ++++++++++++++++++++++++++++++++++
 tools/lib/bpf/libbpf.h         |  8 +++++-
 tools/lib/bpf/libbpf.map       |  4 +++
 6 files changed, 122 insertions(+), 1 deletion(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 8b3f1c098ac0..6241cbcd2a64 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -112,6 +112,7 @@ enum bpf_cmd {
 	BPF_MAP_UPDATE_BATCH,
 	BPF_MAP_DELETE_BATCH,
 	BPF_LINK_CREATE,
+	BPF_LINK_UPDATE,
 };
 
 enum bpf_map_type {
@@ -577,6 +578,17 @@ union bpf_attr {
 		__u32		attach_type;	/* attach type */
 		__u32		flags;		/* extra flags */
 	} link_create;
+
+	struct { /* struct used by BPF_LINK_UPDATE command */
+		__u32		link_fd;	/* link fd */
+		/* new program fd to update link with */
+		__u32		new_prog_fd;
+		__u32		flags;		/* extra flags */
+		/* expected link's program fd; is specified only if
+		 * BPF_F_REPLACE flag is set in flags */
+		__u32		old_prog_fd;
+	} link_update;
+
 } __attribute__((aligned(8)));
 
 /* The description below is an attempt at providing documentation to eBPF
diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 73220176728d..5cc1b0785d18 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -585,6 +585,40 @@ int bpf_prog_detach2(int prog_fd, int target_fd, enum bpf_attach_type type)
 	return sys_bpf(BPF_PROG_DETACH, &attr, sizeof(attr));
 }
 
+int bpf_link_create(int prog_fd, int target_fd,
+		    enum bpf_attach_type attach_type,
+		    const struct bpf_link_create_opts *opts)
+{
+	union bpf_attr attr;
+
+	if (!OPTS_VALID(opts, bpf_link_create_opts))
+		return -EINVAL;
+
+	memset(&attr, 0, sizeof(attr));
+	attr.link_create.prog_fd = prog_fd;
+	attr.link_create.target_fd = target_fd;
+	attr.link_create.attach_type = attach_type;
+
+	return sys_bpf(BPF_LINK_CREATE, &attr, sizeof(attr));
+}
+
+int bpf_link_update(int link_fd, int new_prog_fd,
+		    const struct bpf_link_update_opts *opts)
+{
+	union bpf_attr attr;
+
+	if (!OPTS_VALID(opts, bpf_link_update_opts))
+		return -EINVAL;
+
+	memset(&attr, 0, sizeof(attr));
+	attr.link_update.link_fd = link_fd;
+	attr.link_update.new_prog_fd = new_prog_fd;
+	attr.link_update.flags = OPTS_GET(opts, flags, 0);
+	attr.link_update.old_prog_fd = OPTS_GET(opts, old_prog_fd, 0);
+
+	return sys_bpf(BPF_LINK_UPDATE, &attr, sizeof(attr));
+}
+
 int bpf_prog_query(int target_fd, enum bpf_attach_type type, __u32 query_flags,
 		   __u32 *attach_flags, __u32 *prog_ids, __u32 *prog_cnt)
 {
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index b976e77316cc..46d47afdd887 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -168,6 +168,25 @@ LIBBPF_API int bpf_prog_detach(int attachable_fd, enum bpf_attach_type type);
 LIBBPF_API int bpf_prog_detach2(int prog_fd, int attachable_fd,
 				enum bpf_attach_type type);
 
+struct bpf_link_create_opts {
+	size_t sz; /* size of this struct for forward/backward compatibility */
+};
+#define bpf_link_create_opts__last_field sz
+
+LIBBPF_API int bpf_link_create(int prog_fd, int target_fd,
+			       enum bpf_attach_type attach_type,
+			       const struct bpf_link_create_opts *opts);
+
+struct bpf_link_update_opts {
+	size_t sz; /* size of this struct for forward/backward compatibility */
+	__u32 flags;	   /* extra flags */
+	__u32 old_prog_fd; /* expected old program FD */
+};
+#define bpf_link_update_opts__last_field old_prog_fd
+
+LIBBPF_API int bpf_link_update(int link_fd, int new_prog_fd,
+			       const struct bpf_link_update_opts *opts);
+
 struct bpf_prog_test_run_attr {
 	int prog_fd;
 	int repeat;
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 0638e717f502..ff9174282a8c 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -6978,6 +6978,12 @@ struct bpf_link {
 	bool disconnected;
 };
 
+/* Replace link's underlying BPF program with the new one */
+int bpf_link__update_program(struct bpf_link *link, struct bpf_program *prog)
+{
+	return bpf_link_update(bpf_link__fd(link), bpf_program__fd(prog), NULL);
+}
+
 /* Release "ownership" of underlying BPF resource (typically, BPF program
  * attached to some BPF hook, e.g., tracepoint, kprobe, etc). Disconnected
  * link, when destructed through bpf_link__destroy() call won't attempt to
@@ -7533,6 +7539,46 @@ static struct bpf_link *attach_lsm(const struct bpf_sec_def *sec,
 	return bpf_program__attach_lsm(prog);
 }
 
+struct bpf_link *
+bpf_program__attach_cgroup(struct bpf_program *prog, int cgroup_fd)
+{
+	const struct bpf_sec_def *sec_def;
+	enum bpf_attach_type attach_type;
+	char errmsg[STRERR_BUFSIZE];
+	struct bpf_link *link;
+	int prog_fd, link_fd;
+
+	prog_fd = bpf_program__fd(prog);
+	if (prog_fd < 0) {
+		pr_warn("program '%s': can't attach before loaded\n",
+			bpf_program__title(prog, false));
+		return ERR_PTR(-EINVAL);
+	}
+
+	link = calloc(1, sizeof(*link));
+	if (!link)
+		return ERR_PTR(-ENOMEM);
+	link->detach = &bpf_link__detach_fd;
+
+	attach_type = bpf_program__get_expected_attach_type(prog);
+	if (!attach_type) {
+		sec_def = find_sec_def(bpf_program__title(prog, false));
+		if (sec_def)
+			attach_type = sec_def->attach_type;
+	}
+	link_fd = bpf_link_create(prog_fd, cgroup_fd, attach_type, NULL);
+	if (link_fd < 0) {
+		link_fd = -errno;
+		free(link);
+		pr_warn("program '%s': failed to attach to cgroup: %s\n",
+			bpf_program__title(prog, false),
+			libbpf_strerror_r(link_fd, errmsg, sizeof(errmsg)));
+		return ERR_PTR(link_fd);
+	}
+	link->fd = link_fd;
+	return link;
+}
+
 struct bpf_link *bpf_program__attach(struct bpf_program *prog)
 {
 	const struct bpf_sec_def *sec_def;
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 55348724c355..44df1d3e7287 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -224,6 +224,8 @@ LIBBPF_API int bpf_link__fd(const struct bpf_link *link);
 LIBBPF_API const char *bpf_link__pin_path(const struct bpf_link *link);
 LIBBPF_API int bpf_link__pin(struct bpf_link *link, const char *path);
 LIBBPF_API int bpf_link__unpin(struct bpf_link *link);
+LIBBPF_API int bpf_link__update_program(struct bpf_link *link,
+					struct bpf_program *prog);
 LIBBPF_API void bpf_link__disconnect(struct bpf_link *link);
 LIBBPF_API int bpf_link__destroy(struct bpf_link *link);
 
@@ -245,13 +247,17 @@ bpf_program__attach_tracepoint(struct bpf_program *prog,
 LIBBPF_API struct bpf_link *
 bpf_program__attach_raw_tracepoint(struct bpf_program *prog,
 				   const char *tp_name);
-
 LIBBPF_API struct bpf_link *
 bpf_program__attach_trace(struct bpf_program *prog);
 LIBBPF_API struct bpf_link *
 bpf_program__attach_lsm(struct bpf_program *prog);
+LIBBPF_API struct bpf_link *
+bpf_program__attach_cgroup(struct bpf_program *prog, int cgroup_fd);
+
 struct bpf_map;
+
 LIBBPF_API struct bpf_link *bpf_map__attach_struct_ops(struct bpf_map *map);
+
 struct bpf_insn;
 
 /*
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index eabd3d3e689f..bb8831605b25 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -243,7 +243,11 @@ LIBBPF_0.0.8 {
 		bpf_link__pin;
 		bpf_link__pin_path;
 		bpf_link__unpin;
+		bpf_link__update_program;
+		bpf_link_create;
+		bpf_link_update;
 		bpf_map__set_initial_value;
+		bpf_program__attach_cgroup;
 		bpf_program__attach_lsm;
 		bpf_program__is_lsm;
 		bpf_program__set_attach_target;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v3 bpf-next 4/4] selftests/bpf: test FD-based cgroup attachment
  2020-03-30  2:59 [PATCH v3 bpf-next 0/4] Add support for cgroup bpf_link Andrii Nakryiko
                   ` (2 preceding siblings ...)
  2020-03-30  3:00 ` [PATCH v3 bpf-next 3/4] libbpf: add support for bpf_link-based cgroup attachment Andrii Nakryiko
@ 2020-03-30  3:00 ` Andrii Nakryiko
  2020-03-30 14:49 ` [PATCH v3 bpf-next 0/4] Add support for cgroup bpf_link David Ahern
  4 siblings, 0 replies; 29+ messages in thread
From: Andrii Nakryiko @ 2020-03-30  3:00 UTC (permalink / raw)
  To: bpf, netdev, ast, daniel, rdna
  Cc: andrii.nakryiko, kernel-team, Andrii Nakryiko

Add selftests to exercise FD-based cgroup BPF program attachments and their
intermixing with legacy cgroup BPF attachments. Auto-detachment and program
replacement (both unconditional and cmpxchng-like) are tested as well.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
---
 .../selftests/bpf/prog_tests/cgroup_link.c    | 244 ++++++++++++++++++
 .../selftests/bpf/progs/test_cgroup_link.c    |  24 ++
 2 files changed, 268 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/cgroup_link.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_cgroup_link.c

diff --git a/tools/testing/selftests/bpf/prog_tests/cgroup_link.c b/tools/testing/selftests/bpf/prog_tests/cgroup_link.c
new file mode 100644
index 000000000000..6e04f8d1d15b
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/cgroup_link.c
@@ -0,0 +1,244 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <test_progs.h>
+#include "cgroup_helpers.h"
+#include "test_cgroup_link.skel.h"
+
+static __u32 duration = 0;
+#define PING_CMD	"ping -q -c1 -w1 127.0.0.1 > /dev/null"
+
+static struct test_cgroup_link *skel = NULL;
+
+int ping_and_check(int exp_calls, int exp_alt_calls)
+{
+	skel->bss->calls = 0;
+	skel->bss->alt_calls = 0;
+	CHECK_FAIL(system(PING_CMD));
+	if (CHECK(skel->bss->calls != exp_calls, "call_cnt",
+		  "exp %d, got %d\n", exp_calls, skel->bss->calls))
+		return -EINVAL;
+	if (CHECK(skel->bss->alt_calls != exp_alt_calls, "alt_call_cnt",
+		  "exp %d, got %d\n", exp_alt_calls, skel->bss->alt_calls))
+		return -EINVAL;
+	return 0;
+}
+
+void test_cgroup_link(void)
+{
+	struct {
+		const char *path;
+		int fd;
+	} cgs[] = {
+		{ "/cg1" },
+		{ "/cg1/cg2" },
+		{ "/cg1/cg2/cg3" },
+		{ "/cg1/cg2/cg3/cg4" },
+	};
+	int last_cg = ARRAY_SIZE(cgs) - 1, cg_nr = ARRAY_SIZE(cgs);
+	DECLARE_LIBBPF_OPTS(bpf_link_update_opts, link_upd_opts);
+	struct bpf_link *links[ARRAY_SIZE(cgs)] = {}, *tmp_link;
+	__u32 prog_ids[ARRAY_SIZE(cgs)], prog_cnt = 0, attach_flags;
+	int i = 0, err, prog_fd;
+	bool detach_legacy = false;
+
+	skel = test_cgroup_link__open_and_load();
+	if (CHECK(!skel, "skel_open_load", "failed to open/load skeleton\n"))
+		return;
+	prog_fd = bpf_program__fd(skel->progs.egress);
+
+	err = setup_cgroup_environment();
+	if (CHECK(err, "cg_init", "failed: %d\n", err))
+		goto cleanup;
+
+	for (i = 0; i < cg_nr; i++) {
+		cgs[i].fd = create_and_get_cgroup(cgs[i].path);
+		if (CHECK(cgs[i].fd < 0, "cg_create", "fail: %d\n", cgs[i].fd))
+			goto cleanup;
+	}
+
+	err = join_cgroup(cgs[last_cg].path);
+	if (CHECK(err, "cg_join", "fail: %d\n", err))
+		goto cleanup;
+
+	for (i = 0; i < cg_nr; i++) {
+		links[i] = bpf_program__attach_cgroup(skel->progs.egress,
+						      cgs[i].fd);
+		if (CHECK(IS_ERR(links[i]), "cg_attach", "i: %d, err: %ld\n",
+				 i, PTR_ERR(links[i])))
+			goto cleanup;
+	}
+
+	ping_and_check(cg_nr, 0);
+
+	/* query the number of effective progs and attach flags in root cg */
+	err = bpf_prog_query(cgs[0].fd, BPF_CGROUP_INET_EGRESS,
+			     BPF_F_QUERY_EFFECTIVE, &attach_flags, NULL,
+			     &prog_cnt);
+	CHECK_FAIL(err);
+	CHECK_FAIL(attach_flags != BPF_F_ALLOW_MULTI);
+	if (CHECK(prog_cnt != 1, "effect_cnt", "exp %d, got %d\n", 1, prog_cnt))
+		goto cleanup;
+
+	/* query the number of effective progs in last cg */
+	err = bpf_prog_query(cgs[last_cg].fd, BPF_CGROUP_INET_EGRESS,
+			     BPF_F_QUERY_EFFECTIVE, NULL, NULL,
+			     &prog_cnt);
+	CHECK_FAIL(err);
+	CHECK_FAIL(attach_flags != BPF_F_ALLOW_MULTI);
+	if (CHECK(prog_cnt != cg_nr, "effect_cnt", "exp %d, got %d\n",
+		  cg_nr, prog_cnt))
+		goto cleanup;
+
+	/* query the effective prog IDs in last cg */
+	err = bpf_prog_query(cgs[last_cg].fd, BPF_CGROUP_INET_EGRESS,
+			     BPF_F_QUERY_EFFECTIVE, &attach_flags,
+			     prog_ids, &prog_cnt);
+	CHECK_FAIL(err);
+	CHECK_FAIL(attach_flags != BPF_F_ALLOW_MULTI);
+	if (CHECK(prog_cnt != cg_nr, "effect_cnt", "exp %d, got %d\n",
+		  cg_nr, prog_cnt))
+		goto cleanup;
+	for (i = 1; i < prog_cnt; i++) {
+		CHECK(prog_ids[i - 1] != prog_ids[i], "prog_id_check",
+		      "idx %d, prev id %d, cur id %d\n",
+		      i, prog_ids[i - 1], prog_ids[i]);
+	}
+
+	/* detach bottom program and ping again */
+	bpf_link__destroy(links[last_cg]);
+	links[last_cg] = NULL;
+
+	ping_and_check(cg_nr - 1, 0);
+
+	/* mix in with non link-based multi-attachments */
+	err = bpf_prog_attach(prog_fd, cgs[last_cg].fd,
+			      BPF_CGROUP_INET_EGRESS, BPF_F_ALLOW_MULTI);
+	if (CHECK(err, "cg_attach_legacy", "errno=%d\n", errno))
+		goto cleanup;
+	detach_legacy = true;
+
+	links[last_cg] = bpf_program__attach_cgroup(skel->progs.egress,
+						    cgs[last_cg].fd);
+	if (CHECK(IS_ERR(links[last_cg]), "cg_attach", "err: %ld\n",
+		  PTR_ERR(links[last_cg])))
+		goto cleanup;
+
+	ping_and_check(cg_nr + 1, 0);
+
+	/* detach link */
+	bpf_link__destroy(links[last_cg]);
+	links[last_cg] = NULL;
+
+	/* detach legacy */
+	err = bpf_prog_detach2(prog_fd, cgs[last_cg].fd, BPF_CGROUP_INET_EGRESS);
+	if (CHECK(err, "cg_detach_legacy", "errno=%d\n", errno))
+		goto cleanup;
+	detach_legacy = false;
+
+	/* attach legacy exclusive prog attachment */
+	err = bpf_prog_attach(prog_fd, cgs[last_cg].fd,
+			      BPF_CGROUP_INET_EGRESS, 0);
+	if (CHECK(err, "cg_attach_exclusive", "errno=%d\n", errno))
+		goto cleanup;
+	detach_legacy = true;
+
+	/* attempt to mix in with multi-attach bpf_link */
+	tmp_link = bpf_program__attach_cgroup(skel->progs.egress,
+					      cgs[last_cg].fd);
+	if (CHECK(!IS_ERR(tmp_link), "cg_attach_fail", "unexpected success!\n")) {
+		bpf_link__destroy(tmp_link);
+		goto cleanup;
+	}
+
+	ping_and_check(cg_nr, 0);
+
+	/* detach */
+	err = bpf_prog_detach2(prog_fd, cgs[last_cg].fd, BPF_CGROUP_INET_EGRESS);
+	if (CHECK(err, "cg_detach_legacy", "errno=%d\n", errno))
+		goto cleanup;
+	detach_legacy = false;
+
+	ping_and_check(cg_nr - 1, 0);
+
+	/* attach back link-based one */
+	links[last_cg] = bpf_program__attach_cgroup(skel->progs.egress,
+						    cgs[last_cg].fd);
+	if (CHECK(IS_ERR(links[last_cg]), "cg_attach", "err: %ld\n",
+		  PTR_ERR(links[last_cg])))
+		goto cleanup;
+
+	ping_and_check(cg_nr, 0);
+
+	/* check legacy exclusive prog can't be attached */
+	err = bpf_prog_attach(prog_fd, cgs[last_cg].fd,
+			      BPF_CGROUP_INET_EGRESS, 0);
+	if (CHECK(!err, "cg_attach_exclusive", "unexpected success")) {
+		bpf_prog_detach2(prog_fd, cgs[last_cg].fd, BPF_CGROUP_INET_EGRESS);
+		goto cleanup;
+	}
+
+	/* replace BPF programs inside their links for all but first link */
+	for (i = 1; i < cg_nr; i++) {
+		err = bpf_link__update_program(links[i], skel->progs.egress_alt);
+		if (CHECK(err, "prog_upd", "link #%d\n", i))
+			goto cleanup;
+	}
+
+	ping_and_check(1, cg_nr - 1);
+
+	/* Attempt program update with wrong expected BPF program */
+	link_upd_opts.old_prog_fd = bpf_program__fd(skel->progs.egress_alt);
+	link_upd_opts.flags = BPF_F_REPLACE;
+	err = bpf_link_update(bpf_link__fd(links[0]),
+			      bpf_program__fd(skel->progs.egress_alt),
+			      &link_upd_opts);
+	if (CHECK(err == 0 || errno != EPERM, "prog_cmpxchg1",
+		  "unexpectedly succeeded, err %d, errno %d\n", err, -errno))
+		goto cleanup;
+
+	/* Compare-exchange single link program from egress to egress_alt */
+	link_upd_opts.old_prog_fd = bpf_program__fd(skel->progs.egress);
+	link_upd_opts.flags = BPF_F_REPLACE;
+	err = bpf_link_update(bpf_link__fd(links[0]),
+			      bpf_program__fd(skel->progs.egress_alt),
+			      &link_upd_opts);
+	if (CHECK(err, "prog_cmpxchg2", "errno %d\n", -errno))
+		goto cleanup;
+
+	/* ping */
+	ping_and_check(0, cg_nr);
+
+	/* close cgroup FDs before detaching links */
+	for (i = 0; i < cg_nr; i++) {
+		if (cgs[i].fd > 0) {
+			close(cgs[i].fd);
+			cgs[i].fd = -1;
+		}
+	}
+
+	/* BPF programs should still get called */
+	ping_and_check(0, cg_nr);
+
+	/* leave cgroup and remove them, don't detach programs */
+	cleanup_cgroup_environment();
+
+	/* BPF programs should have been auto-detached */
+	ping_and_check(0, 0);
+
+cleanup:
+	if (detach_legacy)
+		bpf_prog_detach2(prog_fd, cgs[last_cg].fd,
+				 BPF_CGROUP_INET_EGRESS);
+
+	for (i = 0; i < cg_nr; i++) {
+		if (!IS_ERR(links[i]))
+			bpf_link__destroy(links[i]);
+	}
+	test_cgroup_link__destroy(skel);
+
+	for (i = 0; i < cg_nr; i++) {
+		if (cgs[i].fd > 0)
+			close(cgs[i].fd);
+	}
+	cleanup_cgroup_environment();
+}
diff --git a/tools/testing/selftests/bpf/progs/test_cgroup_link.c b/tools/testing/selftests/bpf/progs/test_cgroup_link.c
new file mode 100644
index 000000000000..77e47b9e4446
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_cgroup_link.c
@@ -0,0 +1,24 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2020 Facebook
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+
+int calls = 0;
+int alt_calls = 0;
+
+SEC("cgroup_skb/egress1")
+int egress(struct __sk_buff *skb)
+{
+	__sync_fetch_and_add(&calls, 1);
+	return 1;
+}
+
+SEC("cgroup_skb/egress2")
+int egress_alt(struct __sk_buff *skb)
+{
+	__sync_fetch_and_add(&alt_calls, 1);
+	return 1;
+}
+
+char _license[] SEC("license") = "GPL";
+
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 bpf-next 0/4] Add support for cgroup bpf_link
  2020-03-30  2:59 [PATCH v3 bpf-next 0/4] Add support for cgroup bpf_link Andrii Nakryiko
                   ` (3 preceding siblings ...)
  2020-03-30  3:00 ` [PATCH v3 bpf-next 4/4] selftests/bpf: test FD-based " Andrii Nakryiko
@ 2020-03-30 14:49 ` David Ahern
  2020-03-30 20:20   ` Andrii Nakryiko
  4 siblings, 1 reply; 29+ messages in thread
From: David Ahern @ 2020-03-30 14:49 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, netdev, ast, daniel, rdna
  Cc: andrii.nakryiko, kernel-team

On 3/29/20 8:59 PM, Andrii Nakryiko wrote:
> bpf_link abstraction itself was formalized in [0] with justifications for why
> its semantics is a good fit for attaching BPF programs of various types. This
> patch set adds bpf_link-based BPF program attachment mechanism for cgroup BPF
> programs.
> 
> Cgroup BPF link is semantically compatible with current BPF_F_ALLOW_MULTI
> semantics of attaching cgroup BPF programs directly. Thus cgroup bpf_link can
> co-exist with legacy BPF program multi-attachment.
> 
> bpf_link is destroyed and automatically detached when the last open FD holding
> the reference to bpf_link is closed. This means that by default, when the
> process that created bpf_link exits, attached BPF program will be
> automatically detached due to bpf_link's clean up code. Cgroup bpf_link, like
> any other bpf_link, can be pinned in BPF FS and by those means survive the
> exit of process that created the link. This is useful in many scenarios to
> provide long-living BPF program attachments. Pinning also means that there
> could be many owners of bpf_link through independent FDs.
> 
> Additionally, auto-detachmet of cgroup bpf_link is implemented. When cgroup is
> dying it will automatically detach all active bpf_links. This ensures that
> cgroup clean up is not delayed due to active bpf_link even despite no chance
> for any BPF program to be run for a given cgroup. In that sense it's similar
> to existing behavior of dropping refcnt of attached bpf_prog. But in the case
> of bpf_link, bpf_link is not destroyed and is still available to user as long
> as at least one active FD is still open (or if it's pinned in BPF FS).
> 
> There are two main cgroup-specific differences between bpf_link-based and
> direct bpf_prog-based attachment.
> 
> First, as opposed to direct bpf_prog attachment, cgroup itself doesn't "own"
> bpf_link, which makes it possible to auto-clean up attached bpf_link when user
> process abruptly exits without explicitly detaching BPF program. This makes
> for a safe default behavior proven in BPF tracing program types. But bpf_link
> doesn't bump cgroup->bpf.refcnt as well and because of that doesn't prevent
> cgroup from cleaning up its BPF state.
> 
> Second, only owners of bpf_link (those who created bpf_link in the first place
> or obtained a new FD by opening bpf_link from BPF FS) can detach and/or update
> it. This makes sure that no other process can accidentally remove/replace BPF
> program.
> 
> This patch set also implements LINK_UPDATE sub-command, which allows to
> replace bpf_link's underlying bpf_prog, similarly to BPF_F_REPLACE flag
> behavior for direct bpf_prog cgroup attachment. Similarly to LINK_CREATE, it
> is supposed to be generic command for different types of bpf_links.
> 

The observability piece should go in the same release as the feature.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 bpf-next 0/4] Add support for cgroup bpf_link
  2020-03-30 14:49 ` [PATCH v3 bpf-next 0/4] Add support for cgroup bpf_link David Ahern
@ 2020-03-30 20:20   ` Andrii Nakryiko
  2020-03-30 20:45     ` David Ahern
  0 siblings, 1 reply; 29+ messages in thread
From: Andrii Nakryiko @ 2020-03-30 20:20 UTC (permalink / raw)
  To: David Ahern
  Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
	Daniel Borkmann, Andrey Ignatov, Kernel Team

On Mon, Mar 30, 2020 at 7:49 AM David Ahern <dsahern@gmail.com> wrote:
>
> On 3/29/20 8:59 PM, Andrii Nakryiko wrote:
> > bpf_link abstraction itself was formalized in [0] with justifications for why
> > its semantics is a good fit for attaching BPF programs of various types. This
> > patch set adds bpf_link-based BPF program attachment mechanism for cgroup BPF
> > programs.
> >
> > Cgroup BPF link is semantically compatible with current BPF_F_ALLOW_MULTI
> > semantics of attaching cgroup BPF programs directly. Thus cgroup bpf_link can
> > co-exist with legacy BPF program multi-attachment.
> >
> > bpf_link is destroyed and automatically detached when the last open FD holding
> > the reference to bpf_link is closed. This means that by default, when the
> > process that created bpf_link exits, attached BPF program will be
> > automatically detached due to bpf_link's clean up code. Cgroup bpf_link, like
> > any other bpf_link, can be pinned in BPF FS and by those means survive the
> > exit of process that created the link. This is useful in many scenarios to
> > provide long-living BPF program attachments. Pinning also means that there
> > could be many owners of bpf_link through independent FDs.
> >
> > Additionally, auto-detachmet of cgroup bpf_link is implemented. When cgroup is
> > dying it will automatically detach all active bpf_links. This ensures that
> > cgroup clean up is not delayed due to active bpf_link even despite no chance
> > for any BPF program to be run for a given cgroup. In that sense it's similar
> > to existing behavior of dropping refcnt of attached bpf_prog. But in the case
> > of bpf_link, bpf_link is not destroyed and is still available to user as long
> > as at least one active FD is still open (or if it's pinned in BPF FS).
> >
> > There are two main cgroup-specific differences between bpf_link-based and
> > direct bpf_prog-based attachment.
> >
> > First, as opposed to direct bpf_prog attachment, cgroup itself doesn't "own"
> > bpf_link, which makes it possible to auto-clean up attached bpf_link when user
> > process abruptly exits without explicitly detaching BPF program. This makes
> > for a safe default behavior proven in BPF tracing program types. But bpf_link
> > doesn't bump cgroup->bpf.refcnt as well and because of that doesn't prevent
> > cgroup from cleaning up its BPF state.
> >
> > Second, only owners of bpf_link (those who created bpf_link in the first place
> > or obtained a new FD by opening bpf_link from BPF FS) can detach and/or update
> > it. This makes sure that no other process can accidentally remove/replace BPF
> > program.
> >
> > This patch set also implements LINK_UPDATE sub-command, which allows to
> > replace bpf_link's underlying bpf_prog, similarly to BPF_F_REPLACE flag
> > behavior for direct bpf_prog cgroup attachment. Similarly to LINK_CREATE, it
> > is supposed to be generic command for different types of bpf_links.
> >
>
> The observability piece should go in the same release as the feature.

You mean LINK_QUERY command I mentioned before? Yes, I'm working on
adding it next, regardless if this patch set goes in right now or
later.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 bpf-next 0/4] Add support for cgroup bpf_link
  2020-03-30 20:20   ` Andrii Nakryiko
@ 2020-03-30 20:45     ` David Ahern
  2020-03-30 20:50       ` Alexei Starovoitov
  2020-03-30 22:50       ` Alexei Starovoitov
  0 siblings, 2 replies; 29+ messages in thread
From: David Ahern @ 2020-03-30 20:45 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
	Daniel Borkmann, Andrey Ignatov, Kernel Team

On 3/30/20 2:20 PM, Andrii Nakryiko wrote:
>> The observability piece should go in the same release as the feature.
> You mean LINK_QUERY command I mentioned before? Yes, I'm working on
> adding it next, regardless if this patch set goes in right now or
> later.

There was also mention of a "human override" details of which have not
been discussed. Since the query is not ready either, then the
create/update should wait so that all of it can be in the same kernel
release. As it stands it is a half-baked feature.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 bpf-next 0/4] Add support for cgroup bpf_link
  2020-03-30 20:45     ` David Ahern
@ 2020-03-30 20:50       ` Alexei Starovoitov
  2020-03-30 22:50       ` Alexei Starovoitov
  1 sibling, 0 replies; 29+ messages in thread
From: Alexei Starovoitov @ 2020-03-30 20:50 UTC (permalink / raw)
  To: David Ahern
  Cc: Andrii Nakryiko, Andrii Nakryiko, bpf, Networking,
	Alexei Starovoitov, Daniel Borkmann, Andrey Ignatov, Kernel Team

On Mon, Mar 30, 2020 at 1:46 PM David Ahern <dsahern@gmail.com> wrote:
>
> On 3/30/20 2:20 PM, Andrii Nakryiko wrote:
> >> The observability piece should go in the same release as the feature.
> > You mean LINK_QUERY command I mentioned before? Yes, I'm working on
> > adding it next, regardless if this patch set goes in right now or
> > later.
>
> There was also mention of a "human override" details of which have not
> been discussed. Since the query is not ready either, then the
> create/update should wait so that all of it can be in the same kernel
> release. As it stands it is a half-baked feature.

"human override" was discussed only in the context of xdp.
There won't be override for cgroup-bpf.
Just like one doesn't exist today and there is no need for it.
There will be no override for clsbpf either.
None of the multi prog hooks need it.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 bpf-next 0/4] Add support for cgroup bpf_link
  2020-03-30 20:45     ` David Ahern
  2020-03-30 20:50       ` Alexei Starovoitov
@ 2020-03-30 22:50       ` Alexei Starovoitov
  2020-03-30 23:43         ` David Ahern
  1 sibling, 1 reply; 29+ messages in thread
From: Alexei Starovoitov @ 2020-03-30 22:50 UTC (permalink / raw)
  To: David Ahern
  Cc: Andrii Nakryiko, Andrii Nakryiko, bpf, Networking,
	Alexei Starovoitov, Daniel Borkmann, Andrey Ignatov, Kernel Team

On Mon, Mar 30, 2020 at 1:46 PM David Ahern <dsahern@gmail.com> wrote:
> release. As it stands it is a half-baked feature.

speaking of half-baked.
I think as it stands (even without link_query) it's already extremely
useful addition and doesn't take anything away from existing cgroup-bpf
and doesn't hinder observability. 'bpftool cgroup' works just fine.
So I've applied the set.

Even if it was half-baked it would still be applie-able.
Many features are developed over the course of multiple
kernel releases. Example: your nexthops, mptcp, bpf-lsm.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 bpf-next 0/4] Add support for cgroup bpf_link
  2020-03-30 22:50       ` Alexei Starovoitov
@ 2020-03-30 23:43         ` David Ahern
  2020-03-31  0:32           ` Alexei Starovoitov
  0 siblings, 1 reply; 29+ messages in thread
From: David Ahern @ 2020-03-30 23:43 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrii Nakryiko, Andrii Nakryiko, bpf, Networking,
	Alexei Starovoitov, Daniel Borkmann, Andrey Ignatov, Kernel Team

On 3/30/20 4:50 PM, Alexei Starovoitov wrote:
> On Mon, Mar 30, 2020 at 1:46 PM David Ahern <dsahern@gmail.com> wrote:
>> release. As it stands it is a half-baked feature.
> 
> speaking of half-baked.
> I think as it stands (even without link_query) it's already extremely
> useful addition and doesn't take anything away from existing cgroup-bpf
> and doesn't hinder observability. 'bpftool cgroup' works just fine.
> So I've applied the set.
> 
> Even if it was half-baked it would still be applie-able.
> Many features are developed over the course of multiple
> kernel releases. Example: your nexthops, mptcp, bpf-lsm.
> 

nexthops were not - refactoring in 1 release and the entire feature went
in to 5.4. Large features / patch sets often must be spread across
kernel versions because it is not humanly possible to send and review
the patches.

This is not a large feature, and there is no reason for CREATE/UPDATE -
a mere 4 patch set - to go in without something as essential as the
QUERY for observability.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 bpf-next 1/4] bpf: implement bpf_link-based cgroup BPF program attachment
  2020-03-30  2:59 ` [PATCH v3 bpf-next 1/4] bpf: implement bpf_link-based cgroup BPF program attachment Andrii Nakryiko
@ 2020-03-31  0:05   ` Andrey Ignatov
  2020-03-31  0:38     ` Alexei Starovoitov
  0 siblings, 1 reply; 29+ messages in thread
From: Andrey Ignatov @ 2020-03-31  0:05 UTC (permalink / raw)
  To: Andrii Nakryiko; +Cc: bpf, netdev, ast, daniel, andrii.nakryiko, kernel-team

Andrii Nakryiko <andriin@fb.com> [Sun, 2020-03-29 20:00 -0700]:
> Implement new sub-command to attach cgroup BPF programs and return FD-based
> bpf_link back on success. bpf_link, once attached to cgroup, cannot be
> replaced, except by owner having its FD. Cgroup bpf_link supports only
> BPF_F_ALLOW_MULTI semantics. Both link-based and prog-based BPF_F_ALLOW_MULTI
> attachments can be freely intermixed.
> 
> To prevent bpf_cgroup_link from keeping cgroup alive past the point when no
> BPF program can be executed, implement auto-detachment of link. When
> cgroup_bpf_release() is called, all attached bpf_links are forced to release
> cgroup refcounts, but they leave bpf_link otherwise active and allocated, as
> well as still owning underlying bpf_prog. This is because user-space might
> still have FDs open and active, so bpf_link as a user-referenced object can't
> be freed yet. Once last active FD is closed, bpf_link will be freed and
> underlying bpf_prog refcount will be dropped. But cgroup refcount won't be
> touched, because cgroup is released already.
> 
> The inherent race between bpf_cgroup_link release (from closing last FD) and
> cgroup_bpf_release() is resolved by both operations taking cgroup_mutex. So
> the only additional check required is when bpf_cgroup_link attempts to detach
> itself from cgroup. At that time we need to check whether there is still
> cgroup associated with that link. And if not, exit with success, because
> bpf_cgroup_link was already successfully detached.
> 
> Acked-by: Roman Gushchin <guro@fb.com>
> Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> ---
>  include/linux/bpf-cgroup.h     |  29 ++-
>  include/linux/bpf.h            |  10 +-
>  include/uapi/linux/bpf.h       |  10 +-
>  kernel/bpf/cgroup.c            | 315 +++++++++++++++++++++++++--------
>  kernel/bpf/syscall.c           |  61 ++++++-
>  kernel/cgroup/cgroup.c         |  14 +-
>  tools/include/uapi/linux/bpf.h |  10 +-
>  7 files changed, 351 insertions(+), 98 deletions(-)
> 
> diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
> index a7cd5c7a2509..d2d969669564 100644
> --- a/include/linux/bpf-cgroup.h
> +++ b/include/linux/bpf-cgroup.h
> @@ -51,9 +51,18 @@ struct bpf_cgroup_storage {
>  	struct rcu_head rcu;
>  };
>  
> +struct bpf_cgroup_link {
> +	struct bpf_link link;
> +	struct cgroup *cgroup;
> +	enum bpf_attach_type type;
> +};
> +
> +extern const struct bpf_link_ops bpf_cgroup_link_lops;
> +
>  struct bpf_prog_list {
>  	struct list_head node;
>  	struct bpf_prog *prog;
> +	struct bpf_cgroup_link *link;
>  	struct bpf_cgroup_storage *storage[MAX_BPF_CGROUP_STORAGE_TYPE];
>  };
>  
> @@ -84,20 +93,23 @@ struct cgroup_bpf {
>  int cgroup_bpf_inherit(struct cgroup *cgrp);
>  void cgroup_bpf_offline(struct cgroup *cgrp);
>  
> -int __cgroup_bpf_attach(struct cgroup *cgrp, struct bpf_prog *prog,
> -			struct bpf_prog *replace_prog,
> +int __cgroup_bpf_attach(struct cgroup *cgrp,
> +			struct bpf_prog *prog, struct bpf_prog *replace_prog,
> +			struct bpf_cgroup_link *link,
>  			enum bpf_attach_type type, u32 flags);
>  int __cgroup_bpf_detach(struct cgroup *cgrp, struct bpf_prog *prog,
> +			struct bpf_cgroup_link *link,
>  			enum bpf_attach_type type);
>  int __cgroup_bpf_query(struct cgroup *cgrp, const union bpf_attr *attr,
>  		       union bpf_attr __user *uattr);
>  
>  /* Wrapper for __cgroup_bpf_*() protected by cgroup_mutex */
> -int cgroup_bpf_attach(struct cgroup *cgrp, struct bpf_prog *prog,
> -		      struct bpf_prog *replace_prog, enum bpf_attach_type type,
> +int cgroup_bpf_attach(struct cgroup *cgrp,
> +		      struct bpf_prog *prog, struct bpf_prog *replace_prog,
> +		      struct bpf_cgroup_link *link, enum bpf_attach_type type,
>  		      u32 flags);
>  int cgroup_bpf_detach(struct cgroup *cgrp, struct bpf_prog *prog,
> -		      enum bpf_attach_type type, u32 flags);
> +		      enum bpf_attach_type type);
>  int cgroup_bpf_query(struct cgroup *cgrp, const union bpf_attr *attr,
>  		     union bpf_attr __user *uattr);
>  
> @@ -332,6 +344,7 @@ int cgroup_bpf_prog_attach(const union bpf_attr *attr,
>  			   enum bpf_prog_type ptype, struct bpf_prog *prog);
>  int cgroup_bpf_prog_detach(const union bpf_attr *attr,
>  			   enum bpf_prog_type ptype);
> +int cgroup_bpf_link_attach(const union bpf_attr *attr, struct bpf_prog *prog);
>  int cgroup_bpf_prog_query(const union bpf_attr *attr,
>  			  union bpf_attr __user *uattr);
>  #else
> @@ -354,6 +367,12 @@ static inline int cgroup_bpf_prog_detach(const union bpf_attr *attr,
>  	return -EINVAL;
>  }
>  
> +static inline int cgroup_bpf_link_attach(const union bpf_attr *attr,
> +					 struct bpf_prog *prog)
> +{
> +	return -EINVAL;
> +}
> +
>  static inline int cgroup_bpf_prog_query(const union bpf_attr *attr,
>  					union bpf_attr __user *uattr)
>  {
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 3bde59a8453b..56254d880293 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1082,15 +1082,23 @@ extern int sysctl_unprivileged_bpf_disabled;
>  int bpf_map_new_fd(struct bpf_map *map, int flags);
>  int bpf_prog_new_fd(struct bpf_prog *prog);
>  
> -struct bpf_link;
> +struct bpf_link {
> +	atomic64_t refcnt;
> +	const struct bpf_link_ops *ops;
> +	struct bpf_prog *prog;
> +	struct work_struct work;
> +};
>  
>  struct bpf_link_ops {
>  	void (*release)(struct bpf_link *link);
>  	void (*dealloc)(struct bpf_link *link);
> +
>  };
>  
>  void bpf_link_init(struct bpf_link *link, const struct bpf_link_ops *ops,
>  		   struct bpf_prog *prog);
> +void bpf_link_cleanup(struct bpf_link *link, struct file *link_file,
> +		      int link_fd);
>  void bpf_link_inc(struct bpf_link *link);
>  void bpf_link_put(struct bpf_link *link);
>  int bpf_link_new_fd(struct bpf_link *link);
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index f1fbc36f58d3..8b3f1c098ac0 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -111,6 +111,7 @@ enum bpf_cmd {
>  	BPF_MAP_LOOKUP_AND_DELETE_BATCH,
>  	BPF_MAP_UPDATE_BATCH,
>  	BPF_MAP_DELETE_BATCH,
> +	BPF_LINK_CREATE,
>  };
>  
>  enum bpf_map_type {
> @@ -541,7 +542,7 @@ union bpf_attr {
>  		__u32		prog_cnt;
>  	} query;
>  
> -	struct {
> +	struct { /* anonymous struct used by BPF_RAW_TRACEPOINT_OPEN command */
>  		__u64 name;
>  		__u32 prog_fd;
>  	} raw_tracepoint;
> @@ -569,6 +570,13 @@ union bpf_attr {
>  		__u64		probe_offset;	/* output: probe_offset */
>  		__u64		probe_addr;	/* output: probe_addr */
>  	} task_fd_query;
> +
> +	struct { /* struct used by BPF_LINK_CREATE command */
> +		__u32		prog_fd;	/* eBPF program to attach */
> +		__u32		target_fd;	/* object to attach to */
> +		__u32		attach_type;	/* attach type */
> +		__u32		flags;		/* extra flags */
> +	} link_create;
>  } __attribute__((aligned(8)));
>  
>  /* The description below is an attempt at providing documentation to eBPF
> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> index 9c8472823a7f..c24029937431 100644
> --- a/kernel/bpf/cgroup.c
> +++ b/kernel/bpf/cgroup.c
> @@ -80,6 +80,17 @@ static void bpf_cgroup_storages_unlink(struct bpf_cgroup_storage *storages[])
>  		bpf_cgroup_storage_unlink(storages[stype]);
>  }
>  
> +/* Called when bpf_cgroup_link is auto-detached from dying cgroup.
> + * It drops cgroup and bpf_prog refcounts, and marks bpf_link as defunct. It
> + * doesn't free link memory, which will eventually be done by bpf_link's
> + * release() callback, when its last FD is closed.
> + */
> +static void bpf_cgroup_link_auto_detach(struct bpf_cgroup_link *link)
> +{
> +	cgroup_put(link->cgroup);
> +	link->cgroup = NULL;
> +}
> +
>  /**
>   * cgroup_bpf_release() - put references of all bpf programs and
>   *                        release all cgroup bpf data
> @@ -100,7 +111,10 @@ static void cgroup_bpf_release(struct work_struct *work)
>  
>  		list_for_each_entry_safe(pl, tmp, progs, node) {
>  			list_del(&pl->node);
> -			bpf_prog_put(pl->prog);
> +			if (pl->prog)
> +				bpf_prog_put(pl->prog);
> +			if (pl->link)
> +				bpf_cgroup_link_auto_detach(pl->link);
>  			bpf_cgroup_storages_unlink(pl->storage);
>  			bpf_cgroup_storages_free(pl->storage);
>  			kfree(pl);
> @@ -134,6 +148,18 @@ static void cgroup_bpf_release_fn(struct percpu_ref *ref)
>  	queue_work(system_wq, &cgrp->bpf.release_work);
>  }
>  
> +/* Get underlying bpf_prog of bpf_prog_list entry, regardless if it's through
> + * link or direct prog.
> + */
> +static struct bpf_prog *prog_list_prog(struct bpf_prog_list *pl)
> +{
> +	if (pl->prog)
> +		return pl->prog;
> +	if (pl->link)
> +		return pl->link->link.prog;
> +	return NULL;
> +}
> +
>  /* count number of elements in the list.
>   * it's slow but the list cannot be long
>   */
> @@ -143,7 +169,7 @@ static u32 prog_list_length(struct list_head *head)
>  	u32 cnt = 0;
>  
>  	list_for_each_entry(pl, head, node) {
> -		if (!pl->prog)
> +		if (!prog_list_prog(pl))
>  			continue;
>  		cnt++;
>  	}
> @@ -212,11 +238,11 @@ static int compute_effective_progs(struct cgroup *cgrp,
>  			continue;
>  
>  		list_for_each_entry(pl, &p->bpf.progs[type], node) {
> -			if (!pl->prog)
> +			if (!prog_list_prog(pl))
>  				continue;
>  
>  			item = &progs->items[cnt];
> -			item->prog = pl->prog;
> +			item->prog = prog_list_prog(pl);
>  			bpf_cgroup_storages_assign(item->cgroup_storage,
>  						   pl->storage);
>  			cnt++;
> @@ -333,19 +359,60 @@ static int update_effective_progs(struct cgroup *cgrp,
>  
>  #define BPF_CGROUP_MAX_PROGS 64
>  
> +static struct bpf_prog_list *find_attach_entry(struct list_head *progs,
> +					       struct bpf_prog *prog,
> +					       struct bpf_cgroup_link *link,
> +					       struct bpf_prog *replace_prog,
> +					       bool allow_multi)
> +{
> +	struct bpf_prog_list *pl;
> +
> +	/* single-attach case */
> +	if (!allow_multi) {
> +		if (list_empty(progs))
> +			return NULL;
> +		return list_first_entry(progs, typeof(*pl), node);
> +	}
> +
> +	list_for_each_entry(pl, progs, node) {
> +		if (prog && pl->prog == prog)
> +			/* disallow attaching the same prog twice */
> +			return ERR_PTR(-EINVAL);
> +		if (link && pl->link == link)
> +			/* disallow attaching the same link twice */
> +			return ERR_PTR(-EINVAL);
> +	}
> +
> +	/* direct prog multi-attach w/ replacement case */
> +	if (replace_prog) {
> +		list_for_each_entry(pl, progs, node) {
> +			if (pl->prog == replace_prog)
> +				/* a match found */
> +				return pl;
> +		}
> +		/* prog to replace not found for cgroup */
> +		return ERR_PTR(-ENOENT);
> +	}
> +
> +	return NULL;
> +}
> +
>  /**
> - * __cgroup_bpf_attach() - Attach the program to a cgroup, and
> + * __cgroup_bpf_attach() - Attach the program or the link to a cgroup, and
>   *                         propagate the change to descendants
>   * @cgrp: The cgroup which descendants to traverse
>   * @prog: A program to attach
> + * @link: A link to attach
>   * @replace_prog: Previously attached program to replace if BPF_F_REPLACE is set
>   * @type: Type of attach operation
>   * @flags: Option flags
>   *
> + * Exactly one of @prog or @link can be non-null.
>   * Must be called with cgroup_mutex held.
>   */
> -int __cgroup_bpf_attach(struct cgroup *cgrp, struct bpf_prog *prog,
> -			struct bpf_prog *replace_prog,
> +int __cgroup_bpf_attach(struct cgroup *cgrp,
> +			struct bpf_prog *prog, struct bpf_prog *replace_prog,
> +			struct bpf_cgroup_link *link,
>  			enum bpf_attach_type type, u32 flags)
>  {
>  	u32 saved_flags = (flags & (BPF_F_ALLOW_OVERRIDE | BPF_F_ALLOW_MULTI));
> @@ -353,13 +420,19 @@ int __cgroup_bpf_attach(struct cgroup *cgrp, struct bpf_prog *prog,
>  	struct bpf_prog *old_prog = NULL;
>  	struct bpf_cgroup_storage *storage[MAX_BPF_CGROUP_STORAGE_TYPE],
>  		*old_storage[MAX_BPF_CGROUP_STORAGE_TYPE] = {NULL};
> -	struct bpf_prog_list *pl, *replace_pl = NULL;
> +	struct bpf_prog_list *pl;
>  	int err;
>  
>  	if (((flags & BPF_F_ALLOW_OVERRIDE) && (flags & BPF_F_ALLOW_MULTI)) ||
>  	    ((flags & BPF_F_REPLACE) && !(flags & BPF_F_ALLOW_MULTI)))
>  		/* invalid combination */
>  		return -EINVAL;
> +	if (link && (prog || replace_prog))
> +		/* only either link or prog/replace_prog can be specified */
> +		return -EINVAL;
> +	if (!!replace_prog != !!(flags & BPF_F_REPLACE))
> +		/* replace_prog implies BPF_F_REPLACE, and vice versa */
> +		return -EINVAL;
>  
>  	if (!hierarchy_allows_attach(cgrp, type))
>  		return -EPERM;
> @@ -374,26 +447,15 @@ int __cgroup_bpf_attach(struct cgroup *cgrp, struct bpf_prog *prog,
>  	if (prog_list_length(progs) >= BPF_CGROUP_MAX_PROGS)
>  		return -E2BIG;
>  
> -	if (flags & BPF_F_ALLOW_MULTI) {
> -		list_for_each_entry(pl, progs, node) {
> -			if (pl->prog == prog)
> -				/* disallow attaching the same prog twice */
> -				return -EINVAL;
> -			if (pl->prog == replace_prog)
> -				replace_pl = pl;
> -		}
> -		if ((flags & BPF_F_REPLACE) && !replace_pl)
> -			/* prog to replace not found for cgroup */
> -			return -ENOENT;
> -	} else if (!list_empty(progs)) {
> -		replace_pl = list_first_entry(progs, typeof(*pl), node);
> -	}
> +	pl = find_attach_entry(progs, prog, link, replace_prog,
> +			       flags & BPF_F_ALLOW_MULTI);
> +	if (IS_ERR(pl))
> +		return PTR_ERR(pl);
>  
> -	if (bpf_cgroup_storages_alloc(storage, prog))
> +	if (bpf_cgroup_storages_alloc(storage, prog ? : link->link.prog))
>  		return -ENOMEM;
>  
> -	if (replace_pl) {
> -		pl = replace_pl;
> +	if (pl) {
>  		old_prog = pl->prog;
>  		bpf_cgroup_storages_unlink(pl->storage);
>  		bpf_cgroup_storages_assign(old_storage, pl->storage);
> @@ -407,6 +469,7 @@ int __cgroup_bpf_attach(struct cgroup *cgrp, struct bpf_prog *prog,
>  	}
>  
>  	pl->prog = prog;
> +	pl->link = link;
>  	bpf_cgroup_storages_assign(pl->storage, storage);
>  	cgrp->bpf.flags[type] = saved_flags;
>  
> @@ -414,80 +477,93 @@ int __cgroup_bpf_attach(struct cgroup *cgrp, struct bpf_prog *prog,
>  	if (err)
>  		goto cleanup;
>  
> -	static_branch_inc(&cgroup_bpf_enabled_key);
>  	bpf_cgroup_storages_free(old_storage);
> -	if (old_prog) {
> +	if (old_prog)
>  		bpf_prog_put(old_prog);
> -		static_branch_dec(&cgroup_bpf_enabled_key);
> -	}
> -	bpf_cgroup_storages_link(storage, cgrp, type);
> +	else
> +		static_branch_inc(&cgroup_bpf_enabled_key);
> +	bpf_cgroup_storages_link(pl->storage, cgrp, type);
>  	return 0;
>  
>  cleanup:
> -	/* and cleanup the prog list */
> -	pl->prog = old_prog;
> +	if (old_prog) {
> +		pl->prog = old_prog;
> +		pl->link = NULL;
> +	}
>  	bpf_cgroup_storages_free(pl->storage);
>  	bpf_cgroup_storages_assign(pl->storage, old_storage);
>  	bpf_cgroup_storages_link(pl->storage, cgrp, type);
> -	if (!replace_pl) {
> +	if (!old_prog) {
>  		list_del(&pl->node);
>  		kfree(pl);
>  	}
>  	return err;
>  }
>  
> +static struct bpf_prog_list *find_detach_entry(struct list_head *progs,
> +					       struct bpf_prog *prog,
> +					       struct bpf_cgroup_link *link,
> +					       bool allow_multi)
> +{
> +	struct bpf_prog_list *pl;
> +
> +	if (!allow_multi) {
> +		if (list_empty(progs))
> +			/* report error when trying to detach and nothing is attached */
> +			return ERR_PTR(-ENOENT);
> +
> +		/* to maintain backward compatibility NONE and OVERRIDE cgroups
> +		 * allow detaching with invalid FD (prog==NULL) in legacy mode
> +		 */
> +		return list_first_entry(progs, typeof(*pl), node);
> +	}
> +
> +	if (!prog && !link)
> +		/* to detach MULTI prog the user has to specify valid FD
> +		 * of the program or link to be detached
> +		 */
> +		return ERR_PTR(-EINVAL);
> +
> +	/* find the prog or link and detach it */
> +	list_for_each_entry(pl, progs, node) {
> +		if (pl->prog == prog && pl->link == link)
> +			return pl;
> +	}
> +	return ERR_PTR(-ENOENT);
> +}
> +
>  /**
> - * __cgroup_bpf_detach() - Detach the program from a cgroup, and
> + * __cgroup_bpf_detach() - Detach the program or link from a cgroup, and
>   *                         propagate the change to descendants
>   * @cgrp: The cgroup which descendants to traverse
>   * @prog: A program to detach or NULL
> + * @prog: A link to detach or NULL
>   * @type: Type of detach operation
>   *
> + * At most one of @prog or @link can be non-NULL.
>   * Must be called with cgroup_mutex held.
>   */
>  int __cgroup_bpf_detach(struct cgroup *cgrp, struct bpf_prog *prog,
> -			enum bpf_attach_type type)
> +			struct bpf_cgroup_link *link, enum bpf_attach_type type)
>  {
>  	struct list_head *progs = &cgrp->bpf.progs[type];
>  	u32 flags = cgrp->bpf.flags[type];
> -	struct bpf_prog *old_prog = NULL;
>  	struct bpf_prog_list *pl;
> +	struct bpf_prog *old_prog;
>  	int err;
>  
> -	if (flags & BPF_F_ALLOW_MULTI) {
> -		if (!prog)
> -			/* to detach MULTI prog the user has to specify valid FD
> -			 * of the program to be detached
> -			 */
> -			return -EINVAL;
> -	} else {
> -		if (list_empty(progs))
> -			/* report error when trying to detach and nothing is attached */
> -			return -ENOENT;
> -	}
> +	if (prog && link)
> +		/* only one of prog or link can be specified */
> +		return -EINVAL;
>  
> -	if (flags & BPF_F_ALLOW_MULTI) {
> -		/* find the prog and detach it */
> -		list_for_each_entry(pl, progs, node) {
> -			if (pl->prog != prog)
> -				continue;
> -			old_prog = prog;
> -			/* mark it deleted, so it's ignored while
> -			 * recomputing effective
> -			 */
> -			pl->prog = NULL;
> -			break;
> -		}
> -		if (!old_prog)
> -			return -ENOENT;
> -	} else {
> -		/* to maintain backward compatibility NONE and OVERRIDE cgroups
> -		 * allow detaching with invalid FD (prog==NULL)
> -		 */
> -		pl = list_first_entry(progs, typeof(*pl), node);
> -		old_prog = pl->prog;
> -		pl->prog = NULL;
> -	}
> +	pl = find_detach_entry(progs, prog, link, flags & BPF_F_ALLOW_MULTI);
> +	if (IS_ERR(pl))
> +		return PTR_ERR(pl);
> +
> +	/* mark it deleted, so it's ignored while recomputing effective */
> +	old_prog = pl->prog;
> +	pl->prog = NULL;
> +	pl->link = NULL;
>  
>  	err = update_effective_progs(cgrp, type);
>  	if (err)
> @@ -501,14 +577,15 @@ int __cgroup_bpf_detach(struct cgroup *cgrp, struct bpf_prog *prog,
>  	if (list_empty(progs))
>  		/* last program was detached, reset flags to zero */
>  		cgrp->bpf.flags[type] = 0;
> -
> -	bpf_prog_put(old_prog);
> +	if (old_prog)
> +		bpf_prog_put(old_prog);
>  	static_branch_dec(&cgroup_bpf_enabled_key);
>  	return 0;
>  
>  cleanup:
> -	/* and restore back old_prog */
> +	/* restore back prog or link */
>  	pl->prog = old_prog;
> +	pl->link = link;
>  	return err;
>  }
>  
> @@ -521,6 +598,7 @@ int __cgroup_bpf_query(struct cgroup *cgrp, const union bpf_attr *attr,
>  	struct list_head *progs = &cgrp->bpf.progs[type];
>  	u32 flags = cgrp->bpf.flags[type];
>  	struct bpf_prog_array *effective;
> +	struct bpf_prog *prog;
>  	int cnt, ret = 0, i;
>  
>  	effective = rcu_dereference_protected(cgrp->bpf.effective[type],
> @@ -551,7 +629,8 @@ int __cgroup_bpf_query(struct cgroup *cgrp, const union bpf_attr *attr,
>  
>  		i = 0;
>  		list_for_each_entry(pl, progs, node) {
> -			id = pl->prog->aux->id;
> +			prog = prog_list_prog(pl);
> +			id = prog->aux->id;
>  			if (copy_to_user(prog_ids + i, &id, sizeof(id)))
>  				return -EFAULT;
>  			if (++i == cnt)
> @@ -581,8 +660,8 @@ int cgroup_bpf_prog_attach(const union bpf_attr *attr,
>  		}
>  	}
>  
> -	ret = cgroup_bpf_attach(cgrp, prog, replace_prog, attr->attach_type,
> -				attr->attach_flags);
> +	ret = cgroup_bpf_attach(cgrp, prog, replace_prog, NULL,
> +				attr->attach_type, attr->attach_flags);
>  
>  	if (replace_prog)
>  		bpf_prog_put(replace_prog);
> @@ -604,7 +683,7 @@ int cgroup_bpf_prog_detach(const union bpf_attr *attr, enum bpf_prog_type ptype)
>  	if (IS_ERR(prog))
>  		prog = NULL;
>  
> -	ret = cgroup_bpf_detach(cgrp, prog, attr->attach_type, 0);
> +	ret = cgroup_bpf_detach(cgrp, prog, attr->attach_type);
>  	if (prog)
>  		bpf_prog_put(prog);
>  
> @@ -612,6 +691,90 @@ int cgroup_bpf_prog_detach(const union bpf_attr *attr, enum bpf_prog_type ptype)
>  	return ret;
>  }
>  
> +static void bpf_cgroup_link_release(struct bpf_link *link)
> +{
> +	struct bpf_cgroup_link *cg_link =
> +		container_of(link, struct bpf_cgroup_link, link);
> +
> +	/* link might have been auto-detached by dying cgroup already,
> +	 * in that case our work is done here
> +	 */
> +	if (!cg_link->cgroup)
> +		return;
> +
> +	mutex_lock(&cgroup_mutex);
> +
> +	/* re-check cgroup under lock again */
> +	if (!cg_link->cgroup) {
> +		mutex_unlock(&cgroup_mutex);
> +		return;
> +	}
> +
> +	WARN_ON(__cgroup_bpf_detach(cg_link->cgroup, NULL, cg_link,
> +				    cg_link->type));
> +
> +	mutex_unlock(&cgroup_mutex);
> +	cgroup_put(cg_link->cgroup);
> +}
> +
> +static void bpf_cgroup_link_dealloc(struct bpf_link *link)
> +{
> +	struct bpf_cgroup_link *cg_link =
> +		container_of(link, struct bpf_cgroup_link, link);
> +
> +	kfree(cg_link);
> +}
> +
> +const struct bpf_link_ops bpf_cgroup_link_lops = {
> +	.release = bpf_cgroup_link_release,
> +	.dealloc = bpf_cgroup_link_dealloc,
> +};
> +
> +int cgroup_bpf_link_attach(const union bpf_attr *attr, struct bpf_prog *prog)
> +{
> +	struct bpf_cgroup_link *link;
> +	struct file *link_file;
> +	struct cgroup *cgrp;
> +	int err, link_fd;
> +
> +	if (attr->link_create.flags)
> +		return -EINVAL;
> +
> +	cgrp = cgroup_get_from_fd(attr->link_create.target_fd);
> +	if (IS_ERR(cgrp))
> +		return PTR_ERR(cgrp);
> +
> +	link = kzalloc(sizeof(*link), GFP_USER);
> +	if (!link) {
> +		err = -ENOMEM;
> +		goto out_put_cgroup;
> +	}
> +	bpf_link_init(&link->link, &bpf_cgroup_link_lops, prog);
> +	link->cgroup = cgrp;
> +	link->type = attr->link_create.attach_type;
> +
> +	link_file = bpf_link_new_file(&link->link, &link_fd);
> +	if (IS_ERR(link_file)) {
> +		kfree(link);
> +		err = PTR_ERR(link_file);
> +		goto out_put_cgroup;
> +	}
> +
> +	err = cgroup_bpf_attach(cgrp, NULL, NULL, link, link->type,
> +				BPF_F_ALLOW_MULTI);
> +	if (err) {
> +		bpf_link_cleanup(&link->link, link_file, link_fd);
> +		goto out_put_cgroup;
> +	}
> +
> +	fd_install(link_fd, link_file);
> +	return link_fd;
> +
> +out_put_cgroup:
> +	cgroup_put(cgrp);
> +	return err;
> +}
> +
>  int cgroup_bpf_prog_query(const union bpf_attr *attr,
>  			  union bpf_attr __user *uattr)
>  {
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index a616b63f23b4..05412b83ed6c 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -2175,13 +2175,6 @@ static int bpf_obj_get(const union bpf_attr *attr)
>  				attr->file_flags);
>  }
>  
> -struct bpf_link {
> -	atomic64_t refcnt;
> -	const struct bpf_link_ops *ops;
> -	struct bpf_prog *prog;
> -	struct work_struct work;
> -};
> -
>  void bpf_link_init(struct bpf_link *link, const struct bpf_link_ops *ops,
>  		   struct bpf_prog *prog)
>  {
> @@ -2195,8 +2188,8 @@ void bpf_link_init(struct bpf_link *link, const struct bpf_link_ops *ops,
>   * anon_inode's release() call. This helper manages marking bpf_link as
>   * defunct, releases anon_inode file and puts reserved FD.
>   */
> -static void bpf_link_cleanup(struct bpf_link *link, struct file *link_file,
> -			     int link_fd)
> +void bpf_link_cleanup(struct bpf_link *link, struct file *link_file,
> +		      int link_fd)
>  {
>  	link->prog = NULL;
>  	fput(link_file);
> @@ -2266,6 +2259,10 @@ static void bpf_link_show_fdinfo(struct seq_file *m, struct file *filp)
>  		link_type = "raw_tracepoint";
>  	else if (link->ops == &bpf_tracing_link_lops)
>  		link_type = "tracing";
> +#ifdef CONFIG_CGROUP_BPF
> +	else if (link->ops == &bpf_cgroup_link_lops)
> +		link_type = "cgroup";
> +#endif
>  	else
>  		link_type = "unknown";
>  
> @@ -3553,6 +3550,49 @@ static int bpf_map_do_batch(const union bpf_attr *attr,
>  	return err;
>  }
>  
> +#define BPF_LINK_CREATE_LAST_FIELD link_create.flags
> +static int link_create(union bpf_attr *attr)
> +{

From what I see this function does not check any capability whether the
existing bpf_prog_attach() checks for CAP_NET_ADMIN.

This is pretty importnant difference but I don't see it clarified in the
commit message or discussed (or I missed it?).

Having a way to attach cgroup bpf prog by non-priv users is actually
helpful in some use-cases, e.g. systemd required patching in the past to
make it work with user (non-priv) sessions, see [0].

But in other cases it's also useful to limit the ability to attach
programs to a cgroup while using bpf_link so that only the thing that
controls cgroup setup can attach but not any non-priv process running in
that cgroup. How is this use-case covered in BPF_LINK_CREATE?


[0] https://github.com/systemd/systemd/pull/12745

> +	enum bpf_prog_type ptype;
> +	struct bpf_prog *prog;
> +	int ret;
> +
> +	if (CHECK_ATTR(BPF_LINK_CREATE))
> +		return -EINVAL;
> +
> +	ptype = attach_type_to_prog_type(attr->link_create.attach_type);
> +	if (ptype == BPF_PROG_TYPE_UNSPEC)
> +		return -EINVAL;
> +
> +	prog = bpf_prog_get_type(attr->link_create.prog_fd, ptype);
> +	if (IS_ERR(prog))
> +		return PTR_ERR(prog);
> +
> +	ret = bpf_prog_attach_check_attach_type(prog,
> +						attr->link_create.attach_type);
> +	if (ret)
> +		goto err_out;
> +
> +	switch (ptype) {
> +	case BPF_PROG_TYPE_CGROUP_SKB:
> +	case BPF_PROG_TYPE_CGROUP_SOCK:
> +	case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
> +	case BPF_PROG_TYPE_SOCK_OPS:
> +	case BPF_PROG_TYPE_CGROUP_DEVICE:
> +	case BPF_PROG_TYPE_CGROUP_SYSCTL:
> +	case BPF_PROG_TYPE_CGROUP_SOCKOPT:
> +		ret = cgroup_bpf_link_attach(attr, prog);
> +		break;
> +	default:
> +		ret = -EINVAL;
> +	}
> +
> +err_out:
> +	if (ret < 0)
> +		bpf_prog_put(prog);
> +	return ret;
> +}
> +
>  SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, size)
>  {
>  	union bpf_attr attr = {};
> @@ -3663,6 +3703,9 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
>  	case BPF_MAP_DELETE_BATCH:
>  		err = bpf_map_do_batch(&attr, uattr, BPF_MAP_DELETE_BATCH);
>  		break;
> +	case BPF_LINK_CREATE:
> +		err = link_create(&attr);
> +		break;
>  	default:
>  		err = -EINVAL;
>  		break;
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 3dead0416b91..219624fba9ba 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -6303,27 +6303,31 @@ void cgroup_sk_free(struct sock_cgroup_data *skcd)
>  #endif	/* CONFIG_SOCK_CGROUP_DATA */
>  
>  #ifdef CONFIG_CGROUP_BPF
> -int cgroup_bpf_attach(struct cgroup *cgrp, struct bpf_prog *prog,
> -		      struct bpf_prog *replace_prog, enum bpf_attach_type type,
> +int cgroup_bpf_attach(struct cgroup *cgrp,
> +		      struct bpf_prog *prog, struct bpf_prog *replace_prog,
> +		      struct bpf_cgroup_link *link,
> +		      enum bpf_attach_type type,
>  		      u32 flags)
>  {
>  	int ret;
>  
>  	mutex_lock(&cgroup_mutex);
> -	ret = __cgroup_bpf_attach(cgrp, prog, replace_prog, type, flags);
> +	ret = __cgroup_bpf_attach(cgrp, prog, replace_prog, link, type, flags);
>  	mutex_unlock(&cgroup_mutex);
>  	return ret;
>  }
> +
>  int cgroup_bpf_detach(struct cgroup *cgrp, struct bpf_prog *prog,
> -		      enum bpf_attach_type type, u32 flags)
> +		      enum bpf_attach_type type)
>  {
>  	int ret;
>  
>  	mutex_lock(&cgroup_mutex);
> -	ret = __cgroup_bpf_detach(cgrp, prog, type);
> +	ret = __cgroup_bpf_detach(cgrp, prog, NULL, type);
>  	mutex_unlock(&cgroup_mutex);
>  	return ret;
>  }
> +
>  int cgroup_bpf_query(struct cgroup *cgrp, const union bpf_attr *attr,
>  		     union bpf_attr __user *uattr)
>  {
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index f1fbc36f58d3..8b3f1c098ac0 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -111,6 +111,7 @@ enum bpf_cmd {
>  	BPF_MAP_LOOKUP_AND_DELETE_BATCH,
>  	BPF_MAP_UPDATE_BATCH,
>  	BPF_MAP_DELETE_BATCH,
> +	BPF_LINK_CREATE,
>  };
>  
>  enum bpf_map_type {
> @@ -541,7 +542,7 @@ union bpf_attr {
>  		__u32		prog_cnt;
>  	} query;
>  
> -	struct {
> +	struct { /* anonymous struct used by BPF_RAW_TRACEPOINT_OPEN command */
>  		__u64 name;
>  		__u32 prog_fd;
>  	} raw_tracepoint;
> @@ -569,6 +570,13 @@ union bpf_attr {
>  		__u64		probe_offset;	/* output: probe_offset */
>  		__u64		probe_addr;	/* output: probe_addr */
>  	} task_fd_query;
> +
> +	struct { /* struct used by BPF_LINK_CREATE command */
> +		__u32		prog_fd;	/* eBPF program to attach */
> +		__u32		target_fd;	/* object to attach to */
> +		__u32		attach_type;	/* attach type */
> +		__u32		flags;		/* extra flags */
> +	} link_create;
>  } __attribute__((aligned(8)));
>  
>  /* The description below is an attempt at providing documentation to eBPF
> -- 
> 2.17.1
> 

-- 
Andrey Ignatov

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 bpf-next 0/4] Add support for cgroup bpf_link
  2020-03-30 23:43         ` David Ahern
@ 2020-03-31  0:32           ` Alexei Starovoitov
  2020-03-31  0:57             ` David Ahern
  0 siblings, 1 reply; 29+ messages in thread
From: Alexei Starovoitov @ 2020-03-31  0:32 UTC (permalink / raw)
  To: David Ahern
  Cc: Andrii Nakryiko, Andrii Nakryiko, bpf, Networking,
	Alexei Starovoitov, Daniel Borkmann, Andrey Ignatov, Kernel Team

On Mon, Mar 30, 2020 at 05:43:52PM -0600, David Ahern wrote:
> On 3/30/20 4:50 PM, Alexei Starovoitov wrote:
> > On Mon, Mar 30, 2020 at 1:46 PM David Ahern <dsahern@gmail.com> wrote:
> >> release. As it stands it is a half-baked feature.
> > 
> > speaking of half-baked.
> > I think as it stands (even without link_query) it's already extremely
> > useful addition and doesn't take anything away from existing cgroup-bpf
> > and doesn't hinder observability. 'bpftool cgroup' works just fine.
> > So I've applied the set.
> > 
> > Even if it was half-baked it would still be applie-able.
> > Many features are developed over the course of multiple
> > kernel releases. Example: your nexthops, mptcp, bpf-lsm.
> > 
> 
> nexthops were not - refactoring in 1 release and the entire feature went
> in to 5.4. Large features / patch sets often must be spread across
> kernel versions because it is not humanly possible to send and review
> the patches.
> 
> This is not a large feature, and there is no reason for CREATE/UPDATE -
> a mere 4 patch set - to go in without something as essential as the
> QUERY for observability.

As I said 'bpftool cgroup' covers it. Observability is not reduced in any way.
Also there is Documentation/bpf/drgn.rst
Kernel is an open book. Just learn this simple tool and everything will
be observable. Not only bpf objects, but the rest of the kernel too.
imo the use case for LINK_QUERY makes sense for xdp only.
There is one program there and it's a dispatcher program that libdispatcher
will be iteratively updating via freplace. For cgroup bpf_links I cannot
think of a reason why apps would need to query it.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 bpf-next 1/4] bpf: implement bpf_link-based cgroup BPF program attachment
  2020-03-31  0:05   ` Andrey Ignatov
@ 2020-03-31  0:38     ` Alexei Starovoitov
  2020-03-31  1:06       ` Andrii Nakryiko
  0 siblings, 1 reply; 29+ messages in thread
From: Alexei Starovoitov @ 2020-03-31  0:38 UTC (permalink / raw)
  To: Andrey Ignatov
  Cc: Andrii Nakryiko, bpf, netdev, ast, daniel, andrii.nakryiko, kernel-team

On Mon, Mar 30, 2020 at 05:05:13PM -0700, Andrey Ignatov wrote:
> >  
> > +#define BPF_LINK_CREATE_LAST_FIELD link_create.flags
> > +static int link_create(union bpf_attr *attr)
> > +{
> 
> From what I see this function does not check any capability whether the
> existing bpf_prog_attach() checks for CAP_NET_ADMIN.

Great catch! It's a bug.
I fixed it up.

> This is pretty importnant difference but I don't see it clarified in the
> commit message or discussed (or I missed it?).
> 
> Having a way to attach cgroup bpf prog by non-priv users is actually
> helpful in some use-cases, e.g. systemd required patching in the past to
> make it work with user (non-priv) sessions, see [0].
> 
> But in other cases it's also useful to limit the ability to attach
> programs to a cgroup while using bpf_link so that only the thing that
> controls cgroup setup can attach but not any non-priv process running in
> that cgroup. How is this use-case covered in BPF_LINK_CREATE?
> 
> 
> [0] https://github.com/systemd/systemd/pull/12745

yeah. we need to resurrect the discussion around CAP_BPF.

PS
pls trim your replies.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 bpf-next 0/4] Add support for cgroup bpf_link
  2020-03-31  0:32           ` Alexei Starovoitov
@ 2020-03-31  0:57             ` David Ahern
  2020-03-31  1:17               ` Alexei Starovoitov
  2020-03-31  3:54               ` Andrii Nakryiko
  0 siblings, 2 replies; 29+ messages in thread
From: David Ahern @ 2020-03-31  0:57 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrii Nakryiko, Andrii Nakryiko, bpf, Networking,
	Alexei Starovoitov, Daniel Borkmann, Andrey Ignatov, Kernel Team

On 3/30/20 6:32 PM, Alexei Starovoitov wrote:
>>
>> This is not a large feature, and there is no reason for CREATE/UPDATE -
>> a mere 4 patch set - to go in without something as essential as the
>> QUERY for observability.
> 
> As I said 'bpftool cgroup' covers it. Observability is not reduced in any way.

You want a feature where a process can prevent another from installing a
program on a cgroup. How do I learn which process is holding the
bpf_link reference and preventing me from installing a program? Unless I
have missed some recent change that is not currently covered by bpftool
cgroup, and there is no way reading kernel code will tell me.

###
To quote Lorenz from an earlier response:

"However, this behaviour concerns me. It's like Windows not
letting you delete a file while an application has it opened, which just
leads to randomly killing programs until you find the right one. It's
frustrating and counter productive.

You're taking power away from the operator. In your deployment scenario
this might make sense, but I think it's a really bad model in general.
If I am privileged I need to be able to exercise that privilege."
###

That is my point. You are restricting what root can do and people will
not want to resort to killing random processes trying to find the one
holding a reference. This is an essential missing piece and should go in
at the same time as this set.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 bpf-next 1/4] bpf: implement bpf_link-based cgroup BPF program attachment
  2020-03-31  0:38     ` Alexei Starovoitov
@ 2020-03-31  1:06       ` Andrii Nakryiko
  0 siblings, 0 replies; 29+ messages in thread
From: Andrii Nakryiko @ 2020-03-31  1:06 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrey Ignatov, Andrii Nakryiko, bpf, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team

On Mon, Mar 30, 2020 at 5:38 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Mar 30, 2020 at 05:05:13PM -0700, Andrey Ignatov wrote:
> > >
> > > +#define BPF_LINK_CREATE_LAST_FIELD link_create.flags
> > > +static int link_create(union bpf_attr *attr)
> > > +{
> >
> > From what I see this function does not check any capability whether the
> > existing bpf_prog_attach() checks for CAP_NET_ADMIN.
>
> Great catch! It's a bug.
> I fixed it up.

Thanks!

>
> > This is pretty importnant difference but I don't see it clarified in the
> > commit message or discussed (or I missed it?).

Yeah, not intentional, thanks for catching!

> >
> > Having a way to attach cgroup bpf prog by non-priv users is actually
> > helpful in some use-cases, e.g. systemd required patching in the past to
> > make it work with user (non-priv) sessions, see [0].
> >
> > But in other cases it's also useful to limit the ability to attach
> > programs to a cgroup while using bpf_link so that only the thing that
> > controls cgroup setup can attach but not any non-priv process running in
> > that cgroup. How is this use-case covered in BPF_LINK_CREATE?
> >
> >
> > [0] https://github.com/systemd/systemd/pull/12745
>
> yeah. we need to resurrect the discussion around CAP_BPF.
>
> PS
> pls trim your replies.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 bpf-next 0/4] Add support for cgroup bpf_link
  2020-03-31  0:57             ` David Ahern
@ 2020-03-31  1:17               ` Alexei Starovoitov
  2020-04-01  1:42                 ` David Ahern
  2020-03-31  3:54               ` Andrii Nakryiko
  1 sibling, 1 reply; 29+ messages in thread
From: Alexei Starovoitov @ 2020-03-31  1:17 UTC (permalink / raw)
  To: David Ahern
  Cc: Andrii Nakryiko, Andrii Nakryiko, bpf, Networking,
	Alexei Starovoitov, Daniel Borkmann, Andrey Ignatov, Kernel Team

On Mon, Mar 30, 2020 at 06:57:44PM -0600, David Ahern wrote:
> On 3/30/20 6:32 PM, Alexei Starovoitov wrote:
> >>
> >> This is not a large feature, and there is no reason for CREATE/UPDATE -
> >> a mere 4 patch set - to go in without something as essential as the
> >> QUERY for observability.
> > 
> > As I said 'bpftool cgroup' covers it. Observability is not reduced in any way.
> 
> You want a feature where a process can prevent another from installing a
> program on a cgroup. How do I learn which process is holding the
> bpf_link reference and preventing me from installing a program? Unless I
> have missed some recent change that is not currently covered by bpftool
> cgroup, and there is no way reading kernel code will tell me.

No. That's not the case at all. You misunderstood the concept.

> That is my point. You are restricting what root can do and people will
> not want to resort to killing random processes trying to find the one
> holding a reference. 

Not true either.
bpf_link = old attach with allow_multi (but with extra safety for owner)
The only thing bpf_link protects is the owner of the link from other
processes of nuking that link.
It does _not_ prevent other processes attaching their own cgroup-bpf progs
either via old interface or via bpf_link.

It will be different for xdp where only one prog is allowed per xdp hook.
There it will prevent other xdp progs. And there link_queury and
"human override" will be necessary.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 bpf-next 0/4] Add support for cgroup bpf_link
  2020-03-31  0:57             ` David Ahern
  2020-03-31  1:17               ` Alexei Starovoitov
@ 2020-03-31  3:54               ` Andrii Nakryiko
  2020-03-31 16:54                 ` David Ahern
  2020-03-31 21:51                 ` Edward Cree
  1 sibling, 2 replies; 29+ messages in thread
From: Andrii Nakryiko @ 2020-03-31  3:54 UTC (permalink / raw)
  To: David Ahern
  Cc: Alexei Starovoitov, Andrii Nakryiko, bpf, Networking,
	Alexei Starovoitov, Daniel Borkmann, Andrey Ignatov, Kernel Team

On Mon, Mar 30, 2020 at 5:57 PM David Ahern <dsahern@gmail.com> wrote:
>
> On 3/30/20 6:32 PM, Alexei Starovoitov wrote:
> >>
> >> This is not a large feature, and there is no reason for CREATE/UPDATE -
> >> a mere 4 patch set - to go in without something as essential as the
> >> QUERY for observability.
> >
> > As I said 'bpftool cgroup' covers it. Observability is not reduced in any way.
>
> You want a feature where a process can prevent another from installing a
> program on a cgroup. How do I learn which process is holding the
> bpf_link reference and preventing me from installing a program? Unless I
> have missed some recent change that is not currently covered by bpftool
> cgroup, and there is no way reading kernel code will tell me.
>
> ###
> To quote Lorenz from an earlier response:
>
> "However, this behaviour concerns me. It's like Windows not
> letting you delete a file while an application has it opened, which just
> leads to randomly killing programs until you find the right one. It's
> frustrating and counter productive.
>
> You're taking power away from the operator. In your deployment scenario
> this might make sense, but I think it's a really bad model in general.
> If I am privileged I need to be able to exercise that privilege."
> ###
>
> That is my point. You are restricting what root can do and people will
> not want to resort to killing random processes trying to find the one
> holding a reference. This is an essential missing piece and should go in
> at the same time as this set.

No need to kill random processes, you can kill only those that hold
bpf_link FD. You can find them using drgn tool with script like [0].
It will give you quite a lot of information already, but it should
also find pinned bpf_links, I haven't added it yet.

Found total 11 bpf_links.
-------------------------------------------------
type: tracing
prog: 'test1' id:223 type:BPF_PROG_TYPE_TRACING
pids: 449027
-------------------------------------------------
type: tracing
prog: 'test2' id:224 type:BPF_PROG_TYPE_TRACING
pids: 449027
-------------------------------------------------
type: tracing
prog: 'test3' id:225 type:BPF_PROG_TYPE_TRACING
pids: 449027
-------------------------------------------------
type: tracing
prog: 'test4' id:226 type:BPF_PROG_TYPE_TRACING
pids: 449027
-------------------------------------------------
type: tracing
prog: 'test5' id:227 type:BPF_PROG_TYPE_TRACING
pids: 449027
-------------------------------------------------
type: tracing
prog: 'test6' id:228 type:BPF_PROG_TYPE_TRACING
pids: 449027
-------------------------------------------------
type: raw_tp
prog: '' id:237 type:BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE
tp: bpf_test_finish
pids: 449462
-------------------------------------------------
type: cgroup
prog: 'egress' id:242 type:BPF_PROG_TYPE_CGROUP_SKB
attach: BPF_CGROUP_INET_EGRESS
cgroup: /cgroup-test-work-dir/cg1
pids: 449881
-------------------------------------------------
type: cgroup
prog: 'egress' id:242 type:BPF_PROG_TYPE_CGROUP_SKB
attach: BPF_CGROUP_INET_EGRESS
cgroup: /cgroup-test-work-dir/cg1/cg2
pids: 449881
-------------------------------------------------
type: cgroup
prog: 'egress' id:242 type:BPF_PROG_TYPE_CGROUP_SKB
attach: BPF_CGROUP_INET_EGRESS
cgroup: /cgroup-test-work-dir/cg1/cg2/cg3
pids: 449881
-------------------------------------------------
type: cgroup
prog: 'egress' id:242 type:BPF_PROG_TYPE_CGROUP_SKB
attach: BPF_CGROUP_INET_EGRESS
cgroup: /cgroup-test-work-dir/cg1/cg2/cg3/cg4
pids: 449881
-------------------------------------------------


   [0] https://gist.github.com/anakryiko/562dff8e39c619a5ee247bb55aa057c7

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 bpf-next 0/4] Add support for cgroup bpf_link
  2020-03-31  3:54               ` Andrii Nakryiko
@ 2020-03-31 16:54                 ` David Ahern
  2020-03-31 17:03                   ` Andrii Nakryiko
  2020-03-31 17:11                   ` Alexei Starovoitov
  2020-03-31 21:51                 ` Edward Cree
  1 sibling, 2 replies; 29+ messages in thread
From: David Ahern @ 2020-03-31 16:54 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Alexei Starovoitov, Andrii Nakryiko, bpf, Networking,
	Alexei Starovoitov, Daniel Borkmann, Andrey Ignatov, Kernel Team

On 3/30/20 9:54 PM, Andrii Nakryiko wrote:
> 
>    [0] https://gist.github.com/anakryiko/562dff8e39c619a5ee247bb55aa057c7

#!/usr/bin/env drgn

where do I find drgn? Google search is not turning up anything relevant.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 bpf-next 0/4] Add support for cgroup bpf_link
  2020-03-31 16:54                 ` David Ahern
@ 2020-03-31 17:03                   ` Andrii Nakryiko
  2020-03-31 17:11                   ` Alexei Starovoitov
  1 sibling, 0 replies; 29+ messages in thread
From: Andrii Nakryiko @ 2020-03-31 17:03 UTC (permalink / raw)
  To: David Ahern
  Cc: Alexei Starovoitov, Andrii Nakryiko, bpf, Networking,
	Alexei Starovoitov, Daniel Borkmann, Andrey Ignatov, Kernel Team

On Tue, Mar 31, 2020 at 9:54 AM David Ahern <dsahern@gmail.com> wrote:
>
> On 3/30/20 9:54 PM, Andrii Nakryiko wrote:
> >
> >    [0] https://gist.github.com/anakryiko/562dff8e39c619a5ee247bb55aa057c7
>
> #!/usr/bin/env drgn
>
> where do I find drgn? Google search is not turning up anything relevant.

Right, sorry, there were announcements and discussions about drgn on
mailing list before, so I wrongly assumed people are aware. It's here:

https://github.com/osandov/drgn

There is another BPF-related tool, that shows freplace'ed BPF
programs, attached to original program (which I stole some parts of
for my script, thanks Andrey Ignatov!):

https://github.com/osandov/drgn/blob/master/tools/bpf_inspect.py

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 bpf-next 0/4] Add support for cgroup bpf_link
  2020-03-31 16:54                 ` David Ahern
  2020-03-31 17:03                   ` Andrii Nakryiko
@ 2020-03-31 17:11                   ` Alexei Starovoitov
  1 sibling, 0 replies; 29+ messages in thread
From: Alexei Starovoitov @ 2020-03-31 17:11 UTC (permalink / raw)
  To: David Ahern
  Cc: Andrii Nakryiko, Andrii Nakryiko, bpf, Networking,
	Alexei Starovoitov, Daniel Borkmann, Andrey Ignatov, Kernel Team

On Tue, Mar 31, 2020 at 9:54 AM David Ahern <dsahern@gmail.com> wrote:
>
> On 3/30/20 9:54 PM, Andrii Nakryiko wrote:
> >
> >    [0] https://gist.github.com/anakryiko/562dff8e39c619a5ee247bb55aa057c7
>
> #!/usr/bin/env drgn
>
> where do I find drgn? Google search is not turning up anything relevant.

It's the link I mentioned in this thread just few emails ago:
"Also there is Documentation/bpf/drgn.rst"

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 bpf-next 0/4] Add support for cgroup bpf_link
  2020-03-31  3:54               ` Andrii Nakryiko
  2020-03-31 16:54                 ` David Ahern
@ 2020-03-31 21:51                 ` Edward Cree
  2020-03-31 22:44                   ` David Ahern
  2020-04-01  0:22                   ` Andrii Nakryiko
  1 sibling, 2 replies; 29+ messages in thread
From: Edward Cree @ 2020-03-31 21:51 UTC (permalink / raw)
  To: Andrii Nakryiko, David Ahern
  Cc: Alexei Starovoitov, Andrii Nakryiko, bpf, Networking,
	Alexei Starovoitov, Daniel Borkmann, Andrey Ignatov, Kernel Team

On 31/03/2020 04:54, Andrii Nakryiko wrote:
> No need to kill random processes, you can kill only those that hold
> bpf_link FD. You can find them using drgn tool with script like [0].
For the record, I find the argument "we don't need a query feature,
 because you can just use a kernel debugger" *utterly* *horrifying*.
Now, it seems to be moot, because Alexei has given other, better
 reasons why query doesn't need to land yet; but can we please not
 ever treat debugging interfaces as a substitute for proper APIs?

</scream>
-ed

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 bpf-next 0/4] Add support for cgroup bpf_link
  2020-03-31 21:51                 ` Edward Cree
@ 2020-03-31 22:44                   ` David Ahern
  2020-04-01  0:45                     ` Andrii Nakryiko
  2020-04-01  0:22                   ` Andrii Nakryiko
  1 sibling, 1 reply; 29+ messages in thread
From: David Ahern @ 2020-03-31 22:44 UTC (permalink / raw)
  To: Edward Cree, Andrii Nakryiko
  Cc: Alexei Starovoitov, Andrii Nakryiko, bpf, Networking,
	Alexei Starovoitov, Daniel Borkmann, Andrey Ignatov, Kernel Team

On 3/31/20 3:51 PM, Edward Cree wrote:
> On 31/03/2020 04:54, Andrii Nakryiko wrote:
>> No need to kill random processes, you can kill only those that hold
>> bpf_link FD. You can find them using drgn tool with script like [0].
> For the record, I find the argument "we don't need a query feature,
>  because you can just use a kernel debugger" *utterly* *horrifying*.
> Now, it seems to be moot, because Alexei has given other, better
>  reasons why query doesn't need to land yet; but can we please not
>  ever treat debugging interfaces as a substitute for proper APIs?
> 
> </scream>
> -ed
> 

just about to send the same intent. Dev packages and processing
/proc/kcore is not a proper observability API for production systems.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 bpf-next 0/4] Add support for cgroup bpf_link
  2020-03-31 21:51                 ` Edward Cree
  2020-03-31 22:44                   ` David Ahern
@ 2020-04-01  0:22                   ` Andrii Nakryiko
  2020-04-01 14:26                     ` Edward Cree
  1 sibling, 1 reply; 29+ messages in thread
From: Andrii Nakryiko @ 2020-04-01  0:22 UTC (permalink / raw)
  To: Edward Cree
  Cc: David Ahern, Alexei Starovoitov, Andrii Nakryiko, bpf,
	Networking, Alexei Starovoitov, Daniel Borkmann, Andrey Ignatov,
	Kernel Team

On Tue, Mar 31, 2020 at 2:52 PM Edward Cree <ecree@solarflare.com> wrote:
>
> On 31/03/2020 04:54, Andrii Nakryiko wrote:
> > No need to kill random processes, you can kill only those that hold
> > bpf_link FD. You can find them using drgn tool with script like [0].
> For the record, I find the argument "we don't need a query feature,
>  because you can just use a kernel debugger" *utterly* *horrifying*.
> Now, it seems to be moot, because Alexei has given other, better
>  reasons why query doesn't need to land yet; but can we please not
>  ever treat debugging interfaces as a substitute for proper APIs?

Can you please point out where I was objecting to observability API
(which is LINK_QUERY thing we've discussed and I didn't oppose, and
I'm going to add next)?

What I'm doubtful of is this "human override" functionality. I think a
tool that shows who's using (processes and mounted files in BPF FS)
given bpf_link is way more useful, because it allows you to both
"unblock" BPF hook (by killing "bad" processes and removing mounted
bpf_link files) and know which processes (read applications) are
misbehaving.

I'll address drgn vs not concern in reply to David Ahern, who's also
*utterly horrified*, apparently, so I'll try to calm him as well. ;)

>
> </scream>
> -ed

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 bpf-next 0/4] Add support for cgroup bpf_link
  2020-03-31 22:44                   ` David Ahern
@ 2020-04-01  0:45                     ` Andrii Nakryiko
  0 siblings, 0 replies; 29+ messages in thread
From: Andrii Nakryiko @ 2020-04-01  0:45 UTC (permalink / raw)
  To: David Ahern
  Cc: Edward Cree, Alexei Starovoitov, Andrii Nakryiko, bpf,
	Networking, Alexei Starovoitov, Daniel Borkmann, Andrey Ignatov,
	Kernel Team

On Tue, Mar 31, 2020 at 3:44 PM David Ahern <dsahern@gmail.com> wrote:
>
> On 3/31/20 3:51 PM, Edward Cree wrote:
> > On 31/03/2020 04:54, Andrii Nakryiko wrote:
> >> No need to kill random processes, you can kill only those that hold
> >> bpf_link FD. You can find them using drgn tool with script like [0].
> > For the record, I find the argument "we don't need a query feature,
> >  because you can just use a kernel debugger" *utterly* *horrifying*.
> > Now, it seems to be moot, because Alexei has given other, better
> >  reasons why query doesn't need to land yet; but can we please not
> >  ever treat debugging interfaces as a substitute for proper APIs?
> >
> > </scream>
> > -ed
> >
>
> just about to send the same intent. Dev packages and processing
> /proc/kcore is not a proper observability API for production systems.

I'm not against observability. LINK_QUERY is going to be added. I'm
also looking into making bpf_link into "lookup-able by id" object,
similar to bpf_map and bpf_prog, which will allow to easily just say
"show me all the BPF attachments in the system", which is impossible
to do right now, btw.

As for the drgn and /proc/kcore. drgn is an awesome tool to do lots of
inner kernel API observability stuff, which is impractical to expose
through stable APIs. But you don't have to use it to get the same
effect. The problem that script is solving is to show all the
processes that have open FD to bpf_link files. This is the same
problem fuser command is solving for normal files, but solution is
similar. fuser seems to be doing it iterating over all processes and
its FDs in procfs. Not the most efficient way, but it works. Here's
what you can get for cgroup bpf_link file with my last patch set
already:

# cat /proc/1366584/fdinfo/14
pos:    0
flags:  02000000
mnt_id: 14
link_type:      cgroup
prog_tag:       9ad187367cf2b9e8
prog_id:        1649


We can extend that information further with relevant details. This is
a good and bigger discussion for LINK_QUERY API as well, given it and
fdinfo might be treated as two ways to get same information. This is
one reason I didn't do it for cgroup bpf_link, there are already
enough related discussions to keep us all involved for more than a
week now.

But it would be nice to start discussing and figuring out these
relevant details, instead of being horrified and terrified, and
spreading FUD. Or inventing ways to violate good properties of
bpf_link (e.g., by forceful nuking) due to theoretical worries about
the need to detach bpf_link without finding application or pinned file
that holds it. As Alexei mentioned, what's there already (raw_tp,
tracing, and now cgroup bpf_links) is no worse than what we had
before. By the time we get to XDP bpf_link, we'll have even more
observability capabilities.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 bpf-next 0/4] Add support for cgroup bpf_link
  2020-03-31  1:17               ` Alexei Starovoitov
@ 2020-04-01  1:42                 ` David Ahern
  2020-04-01  2:04                   ` Alexei Starovoitov
  0 siblings, 1 reply; 29+ messages in thread
From: David Ahern @ 2020-04-01  1:42 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrii Nakryiko, Andrii Nakryiko, bpf, Networking,
	Alexei Starovoitov, Daniel Borkmann, Andrey Ignatov, Kernel Team

On 3/30/20 7:17 PM, Alexei Starovoitov wrote:
> On Mon, Mar 30, 2020 at 06:57:44PM -0600, David Ahern wrote:
>> On 3/30/20 6:32 PM, Alexei Starovoitov wrote:
>>>>
>>>> This is not a large feature, and there is no reason for CREATE/UPDATE -
>>>> a mere 4 patch set - to go in without something as essential as the
>>>> QUERY for observability.
>>>
>>> As I said 'bpftool cgroup' covers it. Observability is not reduced in any way.
>>
>> You want a feature where a process can prevent another from installing a
>> program on a cgroup. How do I learn which process is holding the
>> bpf_link reference and preventing me from installing a program? Unless I
>> have missed some recent change that is not currently covered by bpftool
>> cgroup, and there is no way reading kernel code will tell me.
> 
> No. That's not the case at all. You misunderstood the concept.

I don't think so ...

> 
>> That is my point. You are restricting what root can do and people will
>> not want to resort to killing random processes trying to find the one
>> holding a reference. 
> 
> Not true either.
> bpf_link = old attach with allow_multi (but with extra safety for owner)

cgroup programs existed for roughly 1 year before BPF_F_ALLOW_MULTI.
That's a year for tools like 'ip vrf exec' to exist and be relied on.
'ip vrf exec' does not use MULTI.

I have not done a deep dive on systemd code, but on ubuntu 18.04 system:

$ sudo ~/bin/bpftool cgroup tree
CgroupPath
ID       AttachType      AttachFlags     Name
/sys/fs/cgroup/unified/system.slice/systemd-udevd.service
    5        ingress
    4        egress
/sys/fs/cgroup/unified/system.slice/systemd-journald.service
    3        ingress
    2        egress
/sys/fs/cgroup/unified/system.slice/systemd-logind.service
    7        ingress
    6        egress

suggests that multi is not common with systemd either at some point in
its path, so 'ip vrf exec' is not alone in not using the flag. There
most likely are many other tools.


> The only thing bpf_link protects is the owner of the link from other
> processes of nuking that link.
> It does _not_ prevent other processes attaching their own cgroup-bpf progs
> either via old interface or via bpf_link.
> 

It does when that older code does not use the MULTI flag. There is a
history that is going to create conflicts and being able to id which
program holds the bpf_link is essential.

And this is really just one use case. There are many other reasons for
wanting to know what process is holding a reference to something.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 bpf-next 0/4] Add support for cgroup bpf_link
  2020-04-01  1:42                 ` David Ahern
@ 2020-04-01  2:04                   ` Alexei Starovoitov
  0 siblings, 0 replies; 29+ messages in thread
From: Alexei Starovoitov @ 2020-04-01  2:04 UTC (permalink / raw)
  To: David Ahern
  Cc: Andrii Nakryiko, Andrii Nakryiko, bpf, Networking,
	Alexei Starovoitov, Daniel Borkmann, Andrey Ignatov, Kernel Team

On Tue, Mar 31, 2020 at 07:42:46PM -0600, David Ahern wrote:
> On 3/30/20 7:17 PM, Alexei Starovoitov wrote:
> > On Mon, Mar 30, 2020 at 06:57:44PM -0600, David Ahern wrote:
> >> On 3/30/20 6:32 PM, Alexei Starovoitov wrote:
> >>>>
> >>>> This is not a large feature, and there is no reason for CREATE/UPDATE -
> >>>> a mere 4 patch set - to go in without something as essential as the
> >>>> QUERY for observability.
> >>>
> >>> As I said 'bpftool cgroup' covers it. Observability is not reduced in any way.
> >>
> >> You want a feature where a process can prevent another from installing a
> >> program on a cgroup. How do I learn which process is holding the
> >> bpf_link reference and preventing me from installing a program? Unless I
> >> have missed some recent change that is not currently covered by bpftool
> >> cgroup, and there is no way reading kernel code will tell me.
> > 
> > No. That's not the case at all. You misunderstood the concept.
> 
> I don't think so ...
> 
> > 
> >> That is my point. You are restricting what root can do and people will
> >> not want to resort to killing random processes trying to find the one
> >> holding a reference. 
> > 
> > Not true either.
> > bpf_link = old attach with allow_multi (but with extra safety for owner)
> 
> cgroup programs existed for roughly 1 year before BPF_F_ALLOW_MULTI.
> That's a year for tools like 'ip vrf exec' to exist and be relied on.
> 'ip vrf exec' does not use MULTI.
> 
> I have not done a deep dive on systemd code, but on ubuntu 18.04 system:
> 
> $ sudo ~/bin/bpftool cgroup tree
> CgroupPath
> ID       AttachType      AttachFlags     Name
> /sys/fs/cgroup/unified/system.slice/systemd-udevd.service
>     5        ingress
>     4        egress
> /sys/fs/cgroup/unified/system.slice/systemd-journald.service
>     3        ingress
>     2        egress
> /sys/fs/cgroup/unified/system.slice/systemd-logind.service
>     7        ingress
>     6        egress
> 
> suggests that multi is not common with systemd either at some point in
> its path, so 'ip vrf exec' is not alone in not using the flag. There
> most likely are many other tools.

Please take a look at systemd source code:
src/core/bpf-devices.c
src/core/bpf-firewall.c
It prefers to use BPF_F_ALLOW_MULTI when possible.
Since it's the most sensible flag.
Since 'ip vrf exec' is not using allow_multi it's breaking several systemd features.
(regardless of what bpf_link can and cannot do)

> > The only thing bpf_link protects is the owner of the link from other
> > processes of nuking that link.
> > It does _not_ prevent other processes attaching their own cgroup-bpf progs
> > either via old interface or via bpf_link.
> > 
> 
> It does when that older code does not use the MULTI flag. There is a
> history that is going to create conflicts and being able to id which
> program holds the bpf_link is essential.
> 
> And this is really just one use case. There are many other reasons for
> wanting to know what process is holding a reference to something.

I'm not disagreeing that it's useful to query what is attached where. My point
once again that bpf_link for cgroup didn't change a single bit in this logic.
There are processes (like systemd) that are using allow_multi. When they switch
to use bpf_link few years from now nothing will change for all other processes
in the system. Only systemd will be assured that their bpf-device prog will not
be accidentally removed by 'ip vrf'. Currently nothing protects systemd's bpf
progs. Any cap_net_admin process can _accidentally_ nuke it. It's even more
weird that bpf-cgroup-device that systemd is using is under cap_net_admin.
There is nothing networking about it. But that's a separate discussion.

May be you should fix 'ip vrf' first before systemd folks start yelling and
then we can continue arguing about merits of observability?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 bpf-next 0/4] Add support for cgroup bpf_link
  2020-04-01  0:22                   ` Andrii Nakryiko
@ 2020-04-01 14:26                     ` Edward Cree
  2020-04-01 17:39                       ` Andrii Nakryiko
  0 siblings, 1 reply; 29+ messages in thread
From: Edward Cree @ 2020-04-01 14:26 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: David Ahern, Alexei Starovoitov, Andrii Nakryiko, bpf,
	Networking, Alexei Starovoitov, Daniel Borkmann, Andrey Ignatov,
	Kernel Team

On 01/04/2020 01:22, Andrii Nakryiko wrote:
> Can you please point out where I was objecting to observability API
> (which is LINK_QUERY thing we've discussed and I didn't oppose, and
> I'm going to add next)?
I didn't say you objected to it.
I just said that you argued that it was OK for it to not land in the
 same release as the rest of the API, because drgn could paper over
 that gap.  Which seems to me to signify a dangerous way of thinking,
 and I wanted to raise the alarm about that.
(If you _don't_ see what's wrong with that argument... well, that'd
 be even _more_ alarming.  Debuggers — and fuser, for that matter —
 are for when things go wrong _in ways the designers of the system
 failed to anticipate_.  They should not be part of a 'normal' work-
 flow for dealing with problems that we already _know_ are possible;
 it's kinda-sorta like how exceptions shouldn't be used for non-
 exceptional situations.)

-ed

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 bpf-next 0/4] Add support for cgroup bpf_link
  2020-04-01 14:26                     ` Edward Cree
@ 2020-04-01 17:39                       ` Andrii Nakryiko
  0 siblings, 0 replies; 29+ messages in thread
From: Andrii Nakryiko @ 2020-04-01 17:39 UTC (permalink / raw)
  To: Edward Cree
  Cc: David Ahern, Alexei Starovoitov, Andrii Nakryiko, bpf,
	Networking, Alexei Starovoitov, Daniel Borkmann, Andrey Ignatov,
	Kernel Team

On Wed, Apr 1, 2020 at 7:26 AM Edward Cree <ecree@solarflare.com> wrote:
>
> On 01/04/2020 01:22, Andrii Nakryiko wrote:
> > Can you please point out where I was objecting to observability API
> > (which is LINK_QUERY thing we've discussed and I didn't oppose, and
> > I'm going to add next)?
> I didn't say you objected to it.
> I just said that you argued that it was OK for it to not land in the
>  same release as the rest of the API, because drgn could paper over
>  that gap.  Which seems to me to signify a dangerous way of thinking,
>  and I wanted to raise the alarm about that.

Let's have a bit more context first. BTW, please don't trim chain of
replies (at least not so aggressively) next time.

>>> That is my point. You are restricting what root can do and people will
>>> not want to resort to killing random processes trying to find the one
>>> holding a reference. This is an essential missing piece and should go in
>>> at the same time as this set.
>> No need to kill random processes, you can kill only those that hold
>> bpf_link FD. You can find them using drgn tool with script like [0].
> For the record, I find the argument "we don't need a query feature,
> because you can just use a kernel debugger" *utterly* *horrifying*.

I was addressing the concern of having to randomly kill processes.
Which is a "human override" thing, not really an observability one.
And my response is exactly that it's better to be able to see all
owners of bpf_link, rather than have a "nuke" option. It so happens
that drgn is a very useful tool for this and I was able to prototype
such script quickly enough to share it with others. drgn is not the
only option, you can do that by iterating procfs and using fdinfo. It
can certainly be improved for bpf_link-specific case by providing more
information (like cgroup path, etc). But in general, it's a question
of "which processes use given file", which unfortunately, I don't
think Linux has a better observability APIs for. I'm not happy about
that, but it's not really bpf_link-specific issue either.

> (If you _don't_ see what's wrong with that argument... well, that'd
>  be even _more_ alarming.  Debuggers — and fuser, for that matter —
>  are for when things go wrong _in ways the designers of the system
>  failed to anticipate_.  They should not be part of a 'normal' work-
>  flow for dealing with problems that we already _know_ are possible;
>  it's kinda-sorta like how exceptions shouldn't be used for non-
>  exceptional situations.)

I don't think it's as quite black and white as you are stating. For
instance, I *think* lsof, netstat, bcc tools, etc disagree with you.
Also need to kill application because it's attached to XDP or cgroup
doesn't seem like a "normal" workflow either. But I really would
rather leave it at that, if you don't mind.

>
> -ed

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2020-04-01 17:40 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-30  2:59 [PATCH v3 bpf-next 0/4] Add support for cgroup bpf_link Andrii Nakryiko
2020-03-30  2:59 ` [PATCH v3 bpf-next 1/4] bpf: implement bpf_link-based cgroup BPF program attachment Andrii Nakryiko
2020-03-31  0:05   ` Andrey Ignatov
2020-03-31  0:38     ` Alexei Starovoitov
2020-03-31  1:06       ` Andrii Nakryiko
2020-03-30  2:59 ` [PATCH v3 bpf-next 2/4] bpf: implement bpf_prog replacement for an active bpf_cgroup_link Andrii Nakryiko
2020-03-30  3:00 ` [PATCH v3 bpf-next 3/4] libbpf: add support for bpf_link-based cgroup attachment Andrii Nakryiko
2020-03-30  3:00 ` [PATCH v3 bpf-next 4/4] selftests/bpf: test FD-based " Andrii Nakryiko
2020-03-30 14:49 ` [PATCH v3 bpf-next 0/4] Add support for cgroup bpf_link David Ahern
2020-03-30 20:20   ` Andrii Nakryiko
2020-03-30 20:45     ` David Ahern
2020-03-30 20:50       ` Alexei Starovoitov
2020-03-30 22:50       ` Alexei Starovoitov
2020-03-30 23:43         ` David Ahern
2020-03-31  0:32           ` Alexei Starovoitov
2020-03-31  0:57             ` David Ahern
2020-03-31  1:17               ` Alexei Starovoitov
2020-04-01  1:42                 ` David Ahern
2020-04-01  2:04                   ` Alexei Starovoitov
2020-03-31  3:54               ` Andrii Nakryiko
2020-03-31 16:54                 ` David Ahern
2020-03-31 17:03                   ` Andrii Nakryiko
2020-03-31 17:11                   ` Alexei Starovoitov
2020-03-31 21:51                 ` Edward Cree
2020-03-31 22:44                   ` David Ahern
2020-04-01  0:45                     ` Andrii Nakryiko
2020-04-01  0:22                   ` Andrii Nakryiko
2020-04-01 14:26                     ` Edward Cree
2020-04-01 17:39                       ` Andrii Nakryiko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).