All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RESEND v3 bpf-next 00/14] BPF token
@ 2023-06-29  5:18 Andrii Nakryiko
  2023-06-29  5:18 ` [PATCH RESEND v3 bpf-next 01/14] bpf: introduce BPF token object Andrii Nakryiko
                   ` (15 more replies)
  0 siblings, 16 replies; 48+ messages in thread
From: Andrii Nakryiko @ 2023-06-29  5:18 UTC (permalink / raw)
  To: bpf
  Cc: linux-security-module, keescook, brauner, lennart, cyphar, luto,
	kernel-team, sargun

This patch set introduces new BPF object, BPF token, which allows to delegate
a subset of BPF functionality from privileged system-wide daemon (e.g.,
systemd or any other container manager) to a *trusted* unprivileged
application. Trust is the key here. This functionality is not about allowing
unconditional unprivileged BPF usage. Establishing trust, though, is
completely up to the discretion of respective privileged application that
would create a BPF token, as different production setups can and do achieve it
through a combination of different means (signing, LSM, code reviews, etc),
and it's undesirable and infeasible for kernel to enforce any particular way
of validating trustworthiness of particular process.

The main motivation for BPF token is a desire to enable containerized
BPF applications to be used together with user namespaces. This is currently
impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced
or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF
helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read
arbitrary memory, and it's impossible to ensure that they only read memory of
processes belonging to any given namespace. This means that it's impossible to
have namespace-aware CAP_BPF capability, and as such another mechanism to
allow safe usage of BPF functionality is necessary. BPF token and delegation
of it to a trusted unprivileged applications is such mechanism. Kernel makes
no assumption about what "trusted" constitutes in any particular case, and
it's up to specific privileged applications and their surrounding
infrastructure to decide that. What kernel provides is a set of APIs to create
and tune BPF token, and pass it around to privileged BPF commands that are
creating new BPF objects like BPF programs, BPF maps, etc.

Previous attempt at addressing this very same problem ([0]) attempted to
utilize authoritative LSM approach, but was conclusively rejected by upstream
LSM maintainers. BPF token concept is not changing anything about LSM
approach, but can be combined with LSM hooks for very fine-grained security
policy. Some ideas about making BPF token more convenient to use with LSM (in
particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF
2023 presentation ([1]). E.g., an ability to specify user-provided data
(context), which in combination with BPF LSM would allow implementing a very
dynamic and fine-granular custom security policies on top of BPF token. In the
interest of minimizing API surface area discussions this is going to be
added in follow up patches, as it's not essential to the fundamental concept
of delegatable BPF token.

It should be noted that BPF token is conceptually quite similar to the idea of
/dev/bpf device file, proposed by Song a while ago ([2]). The biggest
difference is the idea of using virtual anon_inode file to hold BPF token and
allowing multiple independent instances of them, each with its own set of
restrictions. BPF pinning solves the problem of exposing such BPF token
through file system (BPF FS, in this case) for cases where transferring FDs
over Unix domain sockets is not convenient. And also, crucially, BPF token
approach is not using any special stateful task-scoped flags. Instead, bpf()
syscall accepts token_fd parameters explicitly for each relevant BPF command.
This addresses main concerns brought up during the /dev/bpf discussion, and
fits better with overall BPF subsystem design.

This patch set adds a basic minimum of functionality to make BPF token useful
and to discuss API and functionality. Currently only low-level libbpf APIs
support passing BPF token around, allowing to test kernel functionality, but
for the most part is not sufficient for real-world applications, which
typically use high-level libbpf APIs based on `struct bpf_object` type. This
was done with the intent to limit the size of patch set and concentrate on
mostly kernel-side changes. All the necessary plumbing for libbpf will be sent
as a separate follow up patch set kernel support makes it upstream.

Another part that should happen once kernel-side BPF token is established, is
a set of conventions between applications (e.g., systemd), tools (e.g.,
bpftool), and libraries (e.g., libbpf) about sharing BPF tokens through BPF FS
at well-defined locations to allow applications take advantage of this in
automatic fashion without explicit code changes on BPF application's side.
But I'd like to postpone this discussion to after BPF token concept lands.

Once important distinctions from v2 that should be noted is a chance in the
semantics of a newly added BPF_TOKEN_CREATE command. Previously,
BPF_TOKEN_CREATE would create BPF token kernel object and return its FD to
user-space, allowing to (optionally) pin it in BPF FS using BPF_OBJ_PIN
command. This v3 version changes this slightly: BPF_TOKEN_CREATE combines BPF
token object creation *and* pinning in BPF FS. Such change ensures that BPF
token is always associated with a specific instance of BPF FS and cannot
"escape" it by application re-pinning it somewhere else using another
BPF_OBJ_PIN call. Now, BPF token can only be pinned once during its creation,
better containing it inside intended container (under assumption BPF FS is set
up in such a way as to not be shared with other containers on the system).

  [0] https://lore.kernel.org/bpf/20230412043300.360803-1-andrii@kernel.org/
  [1] http://vger.kernel.org/bpfconf2023_material/Trusted_unprivileged_BPF_LSFMM2023.pdf
  [2] https://lore.kernel.org/bpf/20190627201923.2589391-2-songliubraving@fb.com/

v3->v3-resend:
  - I started integrating token_fd into bpf_object_open_opts and higher-level
    libbpf bpf_object APIs, but it started going a bit deeper into bpf_object
    implementation details and how libbpf performs feature detection and
    caching, so I decided to keep it separate from this patch set and not
    distract from the mostly kernel-side changes;
v2->v3:
  - make BPF_TOKEN_CREATE pin created BPF token in BPF FS, and disallow
    BPF_OBJ_PIN for BPF token;
v1->v2:
  - fix build failures on Kconfig with CONFIG_BPF_SYSCALL unset;
  - drop BPF_F_TOKEN_UNKNOWN_* flags and simplify UAPI (Stanislav).

Andrii Nakryiko (14):
  bpf: introduce BPF token object
  libbpf: add bpf_token_create() API
  selftests/bpf: add BPF_TOKEN_CREATE test
  bpf: add BPF token support to BPF_MAP_CREATE command
  libbpf: add BPF token support to bpf_map_create() API
  selftests/bpf: add BPF token-enabled test for BPF_MAP_CREATE command
  bpf: add BPF token support to BPF_BTF_LOAD command
  libbpf: add BPF token support to bpf_btf_load() API
  selftests/bpf: add BPF token-enabled BPF_BTF_LOAD selftest
  bpf: add BPF token support to BPF_PROG_LOAD command
  bpf: take into account BPF token when fetching helper protos
  bpf: consistenly use BPF token throughout BPF verifier logic
  libbpf: add BPF token support to bpf_prog_load() API
  selftests/bpf: add BPF token-enabled BPF_PROG_LOAD tests

 drivers/media/rc/bpf-lirc.c                   |   2 +-
 include/linux/bpf.h                           |  79 ++++-
 include/linux/filter.h                        |   2 +-
 include/uapi/linux/bpf.h                      |  53 ++++
 kernel/bpf/Makefile                           |   2 +-
 kernel/bpf/arraymap.c                         |   2 +-
 kernel/bpf/cgroup.c                           |   6 +-
 kernel/bpf/core.c                             |   3 +-
 kernel/bpf/helpers.c                          |   6 +-
 kernel/bpf/inode.c                            |  46 ++-
 kernel/bpf/syscall.c                          | 183 +++++++++---
 kernel/bpf/token.c                            | 201 +++++++++++++
 kernel/bpf/verifier.c                         |  13 +-
 kernel/trace/bpf_trace.c                      |   2 +-
 net/core/filter.c                             |  36 +--
 net/ipv4/bpf_tcp_ca.c                         |   2 +-
 net/netfilter/nf_bpf_link.c                   |   2 +-
 tools/include/uapi/linux/bpf.h                |  53 ++++
 tools/lib/bpf/bpf.c                           |  35 ++-
 tools/lib/bpf/bpf.h                           |  45 ++-
 tools/lib/bpf/libbpf.map                      |   1 +
 .../selftests/bpf/prog_tests/libbpf_probes.c  |   4 +
 .../selftests/bpf/prog_tests/libbpf_str.c     |   6 +
 .../testing/selftests/bpf/prog_tests/token.c  | 277 ++++++++++++++++++
 24 files changed, 957 insertions(+), 104 deletions(-)
 create mode 100644 kernel/bpf/token.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/token.c

-- 
2.34.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH RESEND v3 bpf-next 01/14] bpf: introduce BPF token object
  2023-06-29  5:18 [PATCH RESEND v3 bpf-next 00/14] BPF token Andrii Nakryiko
@ 2023-06-29  5:18 ` Andrii Nakryiko
  2023-07-04 12:43   ` Christian Brauner
  2023-06-29  5:18 ` [PATCH RESEND v3 bpf-next 02/14] libbpf: add bpf_token_create() API Andrii Nakryiko
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 48+ messages in thread
From: Andrii Nakryiko @ 2023-06-29  5:18 UTC (permalink / raw)
  To: bpf
  Cc: linux-security-module, keescook, brauner, lennart, cyphar, luto,
	kernel-team, sargun

Add new kind of BPF kernel object, BPF token. BPF token is meant to to
allow delegating privileged BPF functionality, like loading a BPF
program or creating a BPF map, from privileged process to a *trusted*
unprivileged process, all while have a good amount of control over which
privileged operations could be performed using provided BPF token.

This patch adds new BPF_TOKEN_CREATE command to bpf() syscall, which
allows to create a new BPF token object along with a set of allowed
commands that such BPF token allows to unprivileged applications.
Currently only BPF_TOKEN_CREATE command itself can be
delegated, but other patches gradually add ability to delegate
BPF_MAP_CREATE, BPF_BTF_LOAD, and BPF_PROG_LOAD commands.

The above means that new BPF tokens can be created using existing BPF
token, if original privileged creator allowed BPF_TOKEN_CREATE command.
New derived BPF token cannot be more powerful than the original BPF
token.

Importantly, BPF token is automatically pinned at the specified location
inside an instance of BPF FS and cannot be repinned using BPF_OBJ_PIN
command, unlike BPF prog/map/btf/link. This provides more control over
unintended sharing of BPF tokens through pinning it in another BPF FS
instances.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 include/linux/bpf.h            |  47 ++++++++++
 include/uapi/linux/bpf.h       |  38 ++++++++
 kernel/bpf/Makefile            |   2 +-
 kernel/bpf/inode.c             |  46 +++++++--
 kernel/bpf/syscall.c           |  17 ++++
 kernel/bpf/token.c             | 167 +++++++++++++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h |  38 ++++++++
 7 files changed, 344 insertions(+), 11 deletions(-)
 create mode 100644 kernel/bpf/token.c

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index f58895830ada..c4f1684aa138 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -51,6 +51,7 @@ struct module;
 struct bpf_func_state;
 struct ftrace_ops;
 struct cgroup;
+struct bpf_token;
 
 extern struct idr btf_idr;
 extern spinlock_t btf_idr_lock;
@@ -1533,6 +1534,12 @@ struct bpf_link_primer {
 	u32 id;
 };
 
+struct bpf_token {
+	struct work_struct work;
+	atomic64_t refcnt;
+	u64 allowed_cmds;
+};
+
 struct bpf_struct_ops_value;
 struct btf_member;
 
@@ -1916,6 +1923,11 @@ bpf_prog_run_array_sleepable(const struct bpf_prog_array __rcu *array_rcu,
 	return ret;
 }
 
+static inline bool bpf_token_capable(const struct bpf_token *token, int cap)
+{
+	return token || capable(cap) || (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN));
+}
+
 #ifdef CONFIG_BPF_SYSCALL
 DECLARE_PER_CPU(int, bpf_prog_active);
 extern struct mutex bpf_stats_enabled_mutex;
@@ -2077,8 +2089,25 @@ struct file *bpf_link_new_file(struct bpf_link *link, int *reserved_fd);
 struct bpf_link *bpf_link_get_from_fd(u32 ufd);
 struct bpf_link *bpf_link_get_curr_or_next(u32 *id);
 
+void bpf_token_inc(struct bpf_token *token);
+void bpf_token_put(struct bpf_token *token);
+int bpf_token_create(union bpf_attr *attr);
+int bpf_token_new_fd(struct bpf_token *token);
+struct bpf_token *bpf_token_get_from_fd(u32 ufd);
+
+bool bpf_token_allow_cmd(const struct bpf_token *token, enum bpf_cmd cmd);
+
+enum bpf_type {
+	BPF_TYPE_UNSPEC	= 0,
+	BPF_TYPE_PROG,
+	BPF_TYPE_MAP,
+	BPF_TYPE_LINK,
+	BPF_TYPE_TOKEN,
+};
+
 int bpf_obj_pin_user(u32 ufd, int path_fd, const char __user *pathname);
 int bpf_obj_get_user(int path_fd, const char __user *pathname, int flags);
+int bpf_obj_pin_any(int path_fd, const char __user *pathname, void *raw, enum bpf_type type);
 
 #define BPF_ITER_FUNC_PREFIX "bpf_iter_"
 #define DEFINE_BPF_ITER_FUNC(target, args...)			\
@@ -2436,6 +2465,24 @@ static inline int bpf_obj_get_user(const char __user *pathname, int flags)
 	return -EOPNOTSUPP;
 }
 
+static inline void bpf_token_inc(struct bpf_token *token)
+{
+}
+
+static inline void bpf_token_put(struct bpf_token *token)
+{
+}
+
+static inline int bpf_token_new_fd(struct bpf_token *token)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline struct bpf_token *bpf_token_get_from_fd(u32 ufd)
+{
+	return ERR_PTR(-EOPNOTSUPP);
+}
+
 static inline void __dev_flush(void)
 {
 }
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 60a9d59beeab..3ff91f52745d 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -846,6 +846,24 @@ union bpf_iter_link_info {
  *		Returns zero on success. On error, -1 is returned and *errno*
  *		is set appropriately.
  *
+ * BPF_TOKEN_CREATE
+ *	Description
+ *		Create BPF token with embedded information about what
+ *		BPF-related functionality it allows. This BPF token can be
+ *		passed as an extra parameter to various bpf() syscall commands
+ *		to grant BPF subsystem functionality to unprivileged processes.
+ *		BPF token is automatically pinned at specified location in BPF
+ *		FS. It can be retrieved (to get FD passed to bpf() syscall)
+ *		using BPF_OBJ_GET command. It's not allowed to re-pin BPF
+ *		token using BPF_OBJ_PIN command. Such restrictions ensure BPF
+ *		token stays associated with originally intended BPF FS
+ *		instance and cannot be intentionally or unintentionally pinned
+ *		somewhere else.
+ *
+ *	Return
+ *		Returns zero on success. On error, -1 is returned and *errno*
+ *		is set appropriately.
+ *
  * NOTES
  *	eBPF objects (maps and programs) can be shared between processes.
  *
@@ -900,6 +918,7 @@ enum bpf_cmd {
 	BPF_ITER_CREATE,
 	BPF_LINK_DETACH,
 	BPF_PROG_BIND_MAP,
+	BPF_TOKEN_CREATE,
 };
 
 enum bpf_map_type {
@@ -1622,6 +1641,25 @@ union bpf_attr {
 		__u32		flags;		/* extra flags */
 	} prog_bind_map;
 
+	struct { /* struct used by BPF_TOKEN_CREATE command */
+		/* optional, BPF token FD granting operation */
+		__u32		token_fd;
+		__u32		token_flags;
+		__u32		pin_flags;
+		/* pin_{path_fd,pathname} specify location in BPF FS instance
+		 * to pin BPF token at;
+		 * path_fd + pathname have the same semantics as openat() syscall
+		 */
+		__u32		pin_path_fd;
+		__u64		pin_pathname;
+		/* a bit set of allowed bpf() syscall commands,
+		 * e.g., (1ULL << BPF_TOKEN_CREATE) | (1ULL << BPF_PROG_LOAD)
+		 * will allow creating derived BPF tokens and loading new BPF
+		 * programs
+		 */
+		__u64		allowed_cmds;
+	} token_create;
+
 } __attribute__((aligned(8)));
 
 /* The description below is an attempt at providing documentation to eBPF
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 1d3892168d32..bbc17ea3878f 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -6,7 +6,7 @@ cflags-nogcse-$(CONFIG_X86)$(CONFIG_CC_IS_GCC) := -fno-gcse
 endif
 CFLAGS_core.o += $(call cc-disable-warning, override-init) $(cflags-nogcse-yy)
 
-obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o log.o
+obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o log.o token.o
 obj-$(CONFIG_BPF_SYSCALL) += bpf_iter.o map_iter.o task_iter.o prog_iter.o link_iter.o
 obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o
 obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
index 4174f76133df..b9b93b81af9a 100644
--- a/kernel/bpf/inode.c
+++ b/kernel/bpf/inode.c
@@ -22,13 +22,6 @@
 #include <linux/bpf_trace.h>
 #include "preload/bpf_preload.h"
 
-enum bpf_type {
-	BPF_TYPE_UNSPEC	= 0,
-	BPF_TYPE_PROG,
-	BPF_TYPE_MAP,
-	BPF_TYPE_LINK,
-};
-
 static void *bpf_any_get(void *raw, enum bpf_type type)
 {
 	switch (type) {
@@ -41,6 +34,9 @@ static void *bpf_any_get(void *raw, enum bpf_type type)
 	case BPF_TYPE_LINK:
 		bpf_link_inc(raw);
 		break;
+	case BPF_TYPE_TOKEN:
+		bpf_token_inc(raw);
+		break;
 	default:
 		WARN_ON_ONCE(1);
 		break;
@@ -61,6 +57,9 @@ static void bpf_any_put(void *raw, enum bpf_type type)
 	case BPF_TYPE_LINK:
 		bpf_link_put(raw);
 		break;
+	case BPF_TYPE_TOKEN:
+		bpf_token_put(raw);
+		break;
 	default:
 		WARN_ON_ONCE(1);
 		break;
@@ -89,6 +88,12 @@ static void *bpf_fd_probe_obj(u32 ufd, enum bpf_type *type)
 		return raw;
 	}
 
+	raw = bpf_token_get_from_fd(ufd);
+	if (!IS_ERR(raw)) {
+		*type = BPF_TYPE_TOKEN;
+		return raw;
+	}
+
 	return ERR_PTR(-EINVAL);
 }
 
@@ -97,6 +102,7 @@ static const struct inode_operations bpf_dir_iops;
 static const struct inode_operations bpf_prog_iops = { };
 static const struct inode_operations bpf_map_iops  = { };
 static const struct inode_operations bpf_link_iops  = { };
+static const struct inode_operations bpf_token_iops  = { };
 
 static struct inode *bpf_get_inode(struct super_block *sb,
 				   const struct inode *dir,
@@ -136,6 +142,8 @@ static int bpf_inode_type(const struct inode *inode, enum bpf_type *type)
 		*type = BPF_TYPE_MAP;
 	else if (inode->i_op == &bpf_link_iops)
 		*type = BPF_TYPE_LINK;
+	else if (inode->i_op == &bpf_token_iops)
+		*type = BPF_TYPE_TOKEN;
 	else
 		return -EACCES;
 
@@ -369,6 +377,11 @@ static int bpf_mklink(struct dentry *dentry, umode_t mode, void *arg)
 			     &bpf_iter_fops : &bpffs_obj_fops);
 }
 
+static int bpf_mktoken(struct dentry *dentry, umode_t mode, void *arg)
+{
+	return bpf_mkobj_ops(dentry, mode, arg, &bpf_token_iops, &bpffs_obj_fops);
+}
+
 static struct dentry *
 bpf_lookup(struct inode *dir, struct dentry *dentry, unsigned flags)
 {
@@ -435,8 +448,8 @@ static int bpf_iter_link_pin_kernel(struct dentry *parent,
 	return ret;
 }
 
-static int bpf_obj_do_pin(int path_fd, const char __user *pathname, void *raw,
-			  enum bpf_type type)
+int bpf_obj_pin_any(int path_fd, const char __user *pathname, void *raw,
+		    enum bpf_type type)
 {
 	struct dentry *dentry;
 	struct inode *dir;
@@ -469,6 +482,9 @@ static int bpf_obj_do_pin(int path_fd, const char __user *pathname, void *raw,
 	case BPF_TYPE_LINK:
 		ret = vfs_mkobj(dentry, mode, bpf_mklink, raw);
 		break;
+	case BPF_TYPE_TOKEN:
+		ret = vfs_mkobj(dentry, mode, bpf_mktoken, raw);
+		break;
 	default:
 		ret = -EPERM;
 	}
@@ -487,7 +503,15 @@ int bpf_obj_pin_user(u32 ufd, int path_fd, const char __user *pathname)
 	if (IS_ERR(raw))
 		return PTR_ERR(raw);
 
-	ret = bpf_obj_do_pin(path_fd, pathname, raw, type);
+	/* disallow BPF_OBJ_PIN command for BPF token; BPF token can only be
+	 * auto-pinned during creation with BPF_TOKEN_CREATE
+	 */
+	if (type == BPF_TYPE_TOKEN) {
+		bpf_any_put(raw, type);
+		return -EOPNOTSUPP;
+	}
+
+	ret = bpf_obj_pin_any(path_fd, pathname, raw, type);
 	if (ret != 0)
 		bpf_any_put(raw, type);
 
@@ -547,6 +571,8 @@ int bpf_obj_get_user(int path_fd, const char __user *pathname, int flags)
 		ret = bpf_map_new_fd(raw, f_flags);
 	else if (type == BPF_TYPE_LINK)
 		ret = (f_flags != O_RDWR) ? -EINVAL : bpf_link_new_fd(raw);
+	else if (type == BPF_TYPE_TOKEN)
+		ret = (f_flags != O_RDWR) ? -EINVAL : bpf_token_new_fd(raw);
 	else
 		return -ENOENT;
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index a2aef900519c..745b605fad8e 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -5095,6 +5095,20 @@ static int bpf_prog_bind_map(union bpf_attr *attr)
 	return ret;
 }
 
+#define BPF_TOKEN_CREATE_LAST_FIELD token_create.allowed_cmds
+
+static int token_create(union bpf_attr *attr)
+{
+	if (CHECK_ATTR(BPF_TOKEN_CREATE))
+		return -EINVAL;
+
+	/* no flags are supported yet */
+	if (attr->token_create.token_flags || attr->token_create.pin_flags)
+		return -EINVAL;
+
+	return bpf_token_create(attr);
+}
+
 static int __sys_bpf(int cmd, bpfptr_t uattr, unsigned int size)
 {
 	union bpf_attr attr;
@@ -5228,6 +5242,9 @@ static int __sys_bpf(int cmd, bpfptr_t uattr, unsigned int size)
 	case BPF_PROG_BIND_MAP:
 		err = bpf_prog_bind_map(&attr);
 		break;
+	case BPF_TOKEN_CREATE:
+		err = token_create(&attr);
+		break;
 	default:
 		err = -EINVAL;
 		break;
diff --git a/kernel/bpf/token.c b/kernel/bpf/token.c
new file mode 100644
index 000000000000..1ece52439701
--- /dev/null
+++ b/kernel/bpf/token.c
@@ -0,0 +1,167 @@
+#include <linux/bpf.h>
+#include <linux/vmalloc.h>
+#include <linux/anon_inodes.h>
+#include <linux/fdtable.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/kernel.h>
+#include <linux/idr.h>
+#include <linux/namei.h>
+
+DEFINE_IDR(token_idr);
+DEFINE_SPINLOCK(token_idr_lock);
+
+void bpf_token_inc(struct bpf_token *token)
+{
+	atomic64_inc(&token->refcnt);
+}
+
+static void bpf_token_put_deferred(struct work_struct *work)
+{
+	struct bpf_token *token = container_of(work, struct bpf_token, work);
+
+	kvfree(token);
+}
+
+void bpf_token_put(struct bpf_token *token)
+{
+	if (!token)
+		return;
+
+	if (!atomic64_dec_and_test(&token->refcnt))
+		return;
+
+	INIT_WORK(&token->work, bpf_token_put_deferred);
+	schedule_work(&token->work);
+}
+
+static int bpf_token_release(struct inode *inode, struct file *filp)
+{
+	struct bpf_token *token = filp->private_data;
+
+	bpf_token_put(token);
+	return 0;
+}
+
+static ssize_t bpf_dummy_read(struct file *filp, char __user *buf, size_t siz,
+			      loff_t *ppos)
+{
+	/* We need this handler such that alloc_file() enables
+	 * f_mode with FMODE_CAN_READ.
+	 */
+	return -EINVAL;
+}
+
+static ssize_t bpf_dummy_write(struct file *filp, const char __user *buf,
+			       size_t siz, loff_t *ppos)
+{
+	/* We need this handler such that alloc_file() enables
+	 * f_mode with FMODE_CAN_WRITE.
+	 */
+	return -EINVAL;
+}
+
+static const struct file_operations bpf_token_fops = {
+	.release	= bpf_token_release,
+	.read		= bpf_dummy_read,
+	.write		= bpf_dummy_write,
+};
+
+static struct bpf_token *bpf_token_alloc(void)
+{
+	struct bpf_token *token;
+
+	token = kvzalloc(sizeof(*token), GFP_USER);
+	if (!token)
+		return NULL;
+
+	atomic64_set(&token->refcnt, 1);
+
+	return token;
+}
+
+static bool is_bit_subset_of(u32 subset, u32 superset)
+{
+	return (superset & subset) == subset;
+}
+
+int bpf_token_create(union bpf_attr *attr)
+{
+	struct bpf_token *new_token, *token = NULL;
+	int ret;
+
+	if (attr->token_create.token_fd) {
+		token = bpf_token_get_from_fd(attr->token_create.token_fd);
+		if (IS_ERR(token))
+			return PTR_ERR(token);
+		/* if provided BPF token doesn't allow creating new tokens,
+		 * then use system-wide capability checks only
+		 */
+		if (!bpf_token_allow_cmd(token, BPF_TOKEN_CREATE)) {
+			bpf_token_put(token);
+			token = NULL;
+		}
+	}
+
+	ret = -EPERM;
+	if (!bpf_token_capable(token, CAP_SYS_ADMIN))
+		goto out;
+
+	/* requested cmds should be a subset of associated token's set */
+	if (token && !is_bit_subset_of(attr->token_create.allowed_cmds, token->allowed_cmds))
+		goto out;
+
+	new_token = bpf_token_alloc();
+	if (!new_token) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	new_token->allowed_cmds = attr->token_create.allowed_cmds;
+
+	ret = bpf_obj_pin_any(attr->token_create.pin_path_fd,
+			      u64_to_user_ptr(attr->token_create.pin_pathname),
+			      new_token, BPF_TYPE_TOKEN);
+	if (ret < 0)
+		bpf_token_put(new_token);
+out:
+	bpf_token_put(token);
+	return ret;
+}
+
+#define BPF_TOKEN_INODE_NAME "bpf-token"
+
+/* Alloc anon_inode and FD for prepared token.
+ * Returns fd >= 0 on success; negative error, otherwise.
+ */
+int bpf_token_new_fd(struct bpf_token *token)
+{
+	return anon_inode_getfd(BPF_TOKEN_INODE_NAME, &bpf_token_fops, token, O_CLOEXEC);
+}
+
+struct bpf_token *bpf_token_get_from_fd(u32 ufd)
+{
+	struct fd f = fdget(ufd);
+	struct bpf_token *token;
+
+	if (!f.file)
+		return ERR_PTR(-EBADF);
+	if (f.file->f_op != &bpf_token_fops) {
+		fdput(f);
+		return ERR_PTR(-EINVAL);
+	}
+
+	token = f.file->private_data;
+	bpf_token_inc(token);
+	fdput(f);
+
+	return token;
+}
+
+bool bpf_token_allow_cmd(const struct bpf_token *token, enum bpf_cmd cmd)
+{
+	if (!token)
+		return false;
+
+	return token->allowed_cmds & (1ULL << cmd);
+}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 60a9d59beeab..3ff91f52745d 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -846,6 +846,24 @@ union bpf_iter_link_info {
  *		Returns zero on success. On error, -1 is returned and *errno*
  *		is set appropriately.
  *
+ * BPF_TOKEN_CREATE
+ *	Description
+ *		Create BPF token with embedded information about what
+ *		BPF-related functionality it allows. This BPF token can be
+ *		passed as an extra parameter to various bpf() syscall commands
+ *		to grant BPF subsystem functionality to unprivileged processes.
+ *		BPF token is automatically pinned at specified location in BPF
+ *		FS. It can be retrieved (to get FD passed to bpf() syscall)
+ *		using BPF_OBJ_GET command. It's not allowed to re-pin BPF
+ *		token using BPF_OBJ_PIN command. Such restrictions ensure BPF
+ *		token stays associated with originally intended BPF FS
+ *		instance and cannot be intentionally or unintentionally pinned
+ *		somewhere else.
+ *
+ *	Return
+ *		Returns zero on success. On error, -1 is returned and *errno*
+ *		is set appropriately.
+ *
  * NOTES
  *	eBPF objects (maps and programs) can be shared between processes.
  *
@@ -900,6 +918,7 @@ enum bpf_cmd {
 	BPF_ITER_CREATE,
 	BPF_LINK_DETACH,
 	BPF_PROG_BIND_MAP,
+	BPF_TOKEN_CREATE,
 };
 
 enum bpf_map_type {
@@ -1622,6 +1641,25 @@ union bpf_attr {
 		__u32		flags;		/* extra flags */
 	} prog_bind_map;
 
+	struct { /* struct used by BPF_TOKEN_CREATE command */
+		/* optional, BPF token FD granting operation */
+		__u32		token_fd;
+		__u32		token_flags;
+		__u32		pin_flags;
+		/* pin_{path_fd,pathname} specify location in BPF FS instance
+		 * to pin BPF token at;
+		 * path_fd + pathname have the same semantics as openat() syscall
+		 */
+		__u32		pin_path_fd;
+		__u64		pin_pathname;
+		/* a bit set of allowed bpf() syscall commands,
+		 * e.g., (1ULL << BPF_TOKEN_CREATE) | (1ULL << BPF_PROG_LOAD)
+		 * will allow creating derived BPF tokens and loading new BPF
+		 * programs
+		 */
+		__u64		allowed_cmds;
+	} token_create;
+
 } __attribute__((aligned(8)));
 
 /* The description below is an attempt at providing documentation to eBPF
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH RESEND v3 bpf-next 02/14] libbpf: add bpf_token_create() API
  2023-06-29  5:18 [PATCH RESEND v3 bpf-next 00/14] BPF token Andrii Nakryiko
  2023-06-29  5:18 ` [PATCH RESEND v3 bpf-next 01/14] bpf: introduce BPF token object Andrii Nakryiko
@ 2023-06-29  5:18 ` Andrii Nakryiko
  2023-06-29  5:18 ` [PATCH RESEND v3 bpf-next 03/14] selftests/bpf: add BPF_TOKEN_CREATE test Andrii Nakryiko
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 48+ messages in thread
From: Andrii Nakryiko @ 2023-06-29  5:18 UTC (permalink / raw)
  To: bpf
  Cc: linux-security-module, keescook, brauner, lennart, cyphar, luto,
	kernel-team, sargun

Add low-level wrapper API for BPF_TOKEN_CREATE command in bpf() syscall.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 tools/lib/bpf/bpf.c      | 21 +++++++++++++++++++++
 tools/lib/bpf/bpf.h      | 32 ++++++++++++++++++++++++++++++++
 tools/lib/bpf/libbpf.map |  1 +
 3 files changed, 54 insertions(+)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index ed86b37d8024..a247a1612f29 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -1201,3 +1201,24 @@ int bpf_prog_bind_map(int prog_fd, int map_fd,
 	ret = sys_bpf(BPF_PROG_BIND_MAP, &attr, attr_sz);
 	return libbpf_err_errno(ret);
 }
+
+int bpf_token_create(int pin_path_fd, const char *pin_pathname, struct bpf_token_create_opts *opts)
+{
+	const size_t attr_sz = offsetofend(union bpf_attr, token_create);
+	union bpf_attr attr;
+	int ret;
+
+	if (!OPTS_VALID(opts, bpf_token_create_opts))
+		return libbpf_err(-EINVAL);
+
+	memset(&attr, 0, attr_sz);
+	attr.token_create.pin_path_fd = pin_path_fd;
+	attr.token_create.pin_pathname = ptr_to_u64(pin_pathname);
+	attr.token_create.token_fd = OPTS_GET(opts, token_fd, 0);
+	attr.token_create.token_flags = OPTS_GET(opts, token_flags, 0);
+	attr.token_create.pin_flags = OPTS_GET(opts, pin_flags, 0);
+	attr.token_create.allowed_cmds = OPTS_GET(opts, allowed_cmds, 0);
+
+	ret = sys_bpf(BPF_TOKEN_CREATE, &attr, attr_sz);
+	return libbpf_err_errno(ret);
+}
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index 9aa0ee473754..ab0355d90a2c 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -551,6 +551,38 @@ struct bpf_test_run_opts {
 LIBBPF_API int bpf_prog_test_run_opts(int prog_fd,
 				      struct bpf_test_run_opts *opts);
 
+struct bpf_token_create_opts {
+	size_t sz; /* size of this struct for forward/backward compatibility */
+	__u32 token_fd;
+	__u32 token_flags;
+	__u32 pin_flags;
+	__u64 allowed_cmds;
+	size_t :0;
+};
+#define bpf_token_create_opts__last_field allowed_cmds
+
+/**
+ * @brief **bpf_token_create()** creates a new instance of BPF token, pinning
+ * it at the specified location in BPF FS.
+ *
+ * BPF token created and pinned with this API can be subsequently opened using
+ * bpf_obj_get() API to obtain FD that can be passed to bpf() syscall for
+ * commands like BPF_PROG_LOAD, BPF_MAP_CREATE, etc.
+ *
+ * @param pin_path_fd O_PATH FD (see man 2 openat() for semantics) specifying,
+ * in combination with *pin_pathname*, target location in BPF FS at which to
+ * create and pin BPF token.
+ * @param pin_pathname absolute or relative path specifying, in combination
+ * with *pin_path_fd*, specifying in combination with *pin_path_fd*, target
+ * location in BPF FS at which to create and pin BPF token.
+ * @param opts optional BPF token creation options, can be NULL
+ *
+ * @return 0, on success; negative error code, otherwise (errno is also set to
+ * the error code)
+ */
+LIBBPF_API int bpf_token_create(int pin_path_fd, const char *pin_pathname,
+				struct bpf_token_create_opts *opts);
+
 #ifdef __cplusplus
 } /* extern "C" */
 #endif
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index 7521a2fb7626..62cbe4775081 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -395,4 +395,5 @@ LIBBPF_1.2.0 {
 LIBBPF_1.3.0 {
 	global:
 		bpf_obj_pin_opts;
+		bpf_token_create;
 } LIBBPF_1.2.0;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH RESEND v3 bpf-next 03/14] selftests/bpf: add BPF_TOKEN_CREATE test
  2023-06-29  5:18 [PATCH RESEND v3 bpf-next 00/14] BPF token Andrii Nakryiko
  2023-06-29  5:18 ` [PATCH RESEND v3 bpf-next 01/14] bpf: introduce BPF token object Andrii Nakryiko
  2023-06-29  5:18 ` [PATCH RESEND v3 bpf-next 02/14] libbpf: add bpf_token_create() API Andrii Nakryiko
@ 2023-06-29  5:18 ` Andrii Nakryiko
  2023-06-29  5:18 ` [PATCH RESEND v3 bpf-next 04/14] bpf: add BPF token support to BPF_MAP_CREATE command Andrii Nakryiko
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 48+ messages in thread
From: Andrii Nakryiko @ 2023-06-29  5:18 UTC (permalink / raw)
  To: bpf
  Cc: linux-security-module, keescook, brauner, lennart, cyphar, luto,
	kernel-team, sargun

Add a subtest validating BPF_TOKEN_CREATE command, pinning/getting BPF
token in/from BPF FS, and creating derived BPF tokens using token_fd
parameter.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 .../testing/selftests/bpf/prog_tests/token.c  | 96 +++++++++++++++++++
 1 file changed, 96 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/token.c

diff --git a/tools/testing/selftests/bpf/prog_tests/token.c b/tools/testing/selftests/bpf/prog_tests/token.c
new file mode 100644
index 000000000000..153c4e26ef6b
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/token.c
@@ -0,0 +1,96 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2023 Meta Platforms, Inc. and affiliates. */
+#include "linux/bpf.h"
+#include <test_progs.h>
+#include <bpf/btf.h>
+#include "cap_helpers.h"
+
+static int drop_priv_caps(__u64 *old_caps)
+{
+	return cap_disable_effective((1ULL << CAP_BPF) |
+				     (1ULL << CAP_PERFMON) |
+				     (1ULL << CAP_NET_ADMIN) |
+				     (1ULL << CAP_SYS_ADMIN), old_caps);
+}
+
+static int restore_priv_caps(__u64 old_caps)
+{
+	return cap_enable_effective(old_caps, NULL);
+}
+
+#define BPFFS_PATH "/sys/fs/bpf"
+#define TOKEN_PATH BPFFS_PATH "/test_token"
+
+static void subtest_token_create(void)
+{
+	LIBBPF_OPTS(bpf_token_create_opts, opts);
+	int token_fd = 0, limited_token_fd = 0, err;
+	__u64 old_caps = 0;
+
+	/* check that any current and future cmd can be specified */
+	opts.allowed_cmds = ~0ULL;
+	err = bpf_token_create(-EBADF, TOKEN_PATH, &opts);
+	if (!ASSERT_OK(err, "token_create_future_proof"))
+		return;
+	unlink(TOKEN_PATH);
+
+	/* create BPF token which allows creating derived BPF tokens */
+	opts.allowed_cmds = 1ULL << BPF_TOKEN_CREATE;
+	err = bpf_token_create(-EBADF, TOKEN_PATH, &opts);
+	if (!ASSERT_OK(err, "token_create"))
+		return;
+
+	token_fd = bpf_obj_get(TOKEN_PATH);
+	if (!ASSERT_GT(token_fd, 0, "token_get"))
+		goto cleanup;
+	unlink(TOKEN_PATH);
+
+	/* validate pinning and getting works as expected */
+	err = bpf_obj_pin(token_fd, TOKEN_PATH);
+	if (!ASSERT_ERR(err, "token_pin_unexpected_success"))
+		goto cleanup;
+
+
+	/* drop privileges to test token_fd passing */
+	if (!ASSERT_OK(drop_priv_caps(&old_caps), "drop_caps"))
+		goto cleanup;
+
+	/* unprivileged BPF_TOKEN_CREATE should fail */
+	err = bpf_token_create(-EBADF, TOKEN_PATH, NULL);
+	if (!ASSERT_ERR(err, "token_create_unpriv_fail"))
+		goto cleanup;
+
+	/* unprivileged BPF_TOKEN_CREATE using granted BPF token succeeds */
+	opts.allowed_cmds = 0; /* ask for BPF token which doesn't allow new tokens */
+	opts.token_fd = token_fd;
+	err = bpf_token_create(-EBADF, TOKEN_PATH, &opts);
+	if (!ASSERT_OK(limited_token_fd, "token_create_limited"))
+		goto cleanup;
+
+	limited_token_fd = bpf_obj_get(TOKEN_PATH);
+	if (!ASSERT_GT(limited_token_fd, 0, "token_get_limited"))
+		goto cleanup;
+	unlink(TOKEN_PATH);
+
+	/* creating yet another token using "limited" BPF token should fail */
+	opts.allowed_cmds = 0;
+	opts.token_fd = limited_token_fd;
+	err = bpf_token_create(-EBADF, TOKEN_PATH,  &opts);
+	if (!ASSERT_ERR(err, "token_create_from_lim_fail"))
+		goto cleanup;
+
+cleanup:
+	if (token_fd)
+		close(token_fd);
+	if (limited_token_fd)
+		close(limited_token_fd);
+	unlink(TOKEN_PATH);
+	if (old_caps)
+		ASSERT_OK(restore_priv_caps(old_caps), "restore_caps");
+}
+
+void test_token(void)
+{
+	if (test__start_subtest("token_create"))
+		subtest_token_create();
+}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH RESEND v3 bpf-next 04/14] bpf: add BPF token support to BPF_MAP_CREATE command
  2023-06-29  5:18 [PATCH RESEND v3 bpf-next 00/14] BPF token Andrii Nakryiko
                   ` (2 preceding siblings ...)
  2023-06-29  5:18 ` [PATCH RESEND v3 bpf-next 03/14] selftests/bpf: add BPF_TOKEN_CREATE test Andrii Nakryiko
@ 2023-06-29  5:18 ` Andrii Nakryiko
  2023-06-29  5:18 ` [PATCH RESEND v3 bpf-next 05/14] libbpf: add BPF token support to bpf_map_create() API Andrii Nakryiko
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 48+ messages in thread
From: Andrii Nakryiko @ 2023-06-29  5:18 UTC (permalink / raw)
  To: bpf
  Cc: linux-security-module, keescook, brauner, lennart, cyphar, luto,
	kernel-team, sargun

Allow providing token_fd for BPF_MAP_CREATE command to allow controlled
BPF map creation from unprivileged process through delegated BPF token.

Further, add a filter of allowed BPF map types to BPF token, specified
at BPF token creation time. This, in combination with allowed_cmds
allows to create a narrowly-focused BPF token (controlled by privileged
agent) with a restrictive set of BPF maps that application can attempt
to create.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 include/linux/bpf.h                           |  3 +
 include/uapi/linux/bpf.h                      |  6 ++
 kernel/bpf/syscall.c                          | 56 +++++++++++++++----
 kernel/bpf/token.c                            | 13 +++++
 tools/include/uapi/linux/bpf.h                |  6 ++
 .../selftests/bpf/prog_tests/libbpf_probes.c  |  2 +
 .../selftests/bpf/prog_tests/libbpf_str.c     |  3 +
 7 files changed, 77 insertions(+), 12 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index c4f1684aa138..856a147c8ce8 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -251,6 +251,7 @@ struct bpf_map {
 	u32 btf_value_type_id;
 	u32 btf_vmlinux_value_type_id;
 	struct btf *btf;
+	struct bpf_token *token;
 #ifdef CONFIG_MEMCG_KMEM
 	struct obj_cgroup *objcg;
 #endif
@@ -1538,6 +1539,7 @@ struct bpf_token {
 	struct work_struct work;
 	atomic64_t refcnt;
 	u64 allowed_cmds;
+	u64 allowed_map_types;
 };
 
 struct bpf_struct_ops_value;
@@ -2096,6 +2098,7 @@ int bpf_token_new_fd(struct bpf_token *token);
 struct bpf_token *bpf_token_get_from_fd(u32 ufd);
 
 bool bpf_token_allow_cmd(const struct bpf_token *token, enum bpf_cmd cmd);
+bool bpf_token_allow_map_type(const struct bpf_token *token, enum bpf_map_type type);
 
 enum bpf_type {
 	BPF_TYPE_UNSPEC	= 0,
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 3ff91f52745d..59764ba48ec9 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -962,6 +962,7 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_BLOOM_FILTER,
 	BPF_MAP_TYPE_USER_RINGBUF,
 	BPF_MAP_TYPE_CGRP_STORAGE,
+	__MAX_BPF_MAP_TYPE
 };
 
 /* Note that tracing related programs such as
@@ -1368,6 +1369,7 @@ union bpf_attr {
 		 * to using 5 hash functions).
 		 */
 		__u64	map_extra;
+		__u32	map_token_fd;
 	};
 
 	struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */
@@ -1658,6 +1660,10 @@ union bpf_attr {
 		 * programs
 		 */
 		__u64		allowed_cmds;
+		/* similarly to allowed_cmds, a bit set of BPF map types that
+		 * are allowed to be created by requested BPF token;
+		 */
+		__u64		allowed_map_types;
 	} token_create;
 
 } __attribute__((aligned(8)));
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 745b605fad8e..cc15b1d5dc26 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -691,6 +691,7 @@ static void bpf_map_free_deferred(struct work_struct *work)
 {
 	struct bpf_map *map = container_of(work, struct bpf_map, work);
 	struct btf_record *rec = map->record;
+	struct bpf_token *token = map->token;
 
 	security_bpf_map_free(map);
 	bpf_map_release_memcg(map);
@@ -706,6 +707,7 @@ static void bpf_map_free_deferred(struct work_struct *work)
 	 * template bpf_map struct used during verification.
 	 */
 	btf_record_free(rec);
+	bpf_token_put(token);
 }
 
 static void bpf_map_put_uref(struct bpf_map *map)
@@ -1010,7 +1012,7 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf,
 	if (!IS_ERR_OR_NULL(map->record)) {
 		int i;
 
-		if (!bpf_capable()) {
+		if (!bpf_token_capable(map->token, CAP_BPF)) {
 			ret = -EPERM;
 			goto free_map_tab;
 		}
@@ -1092,11 +1094,12 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf,
 	return ret;
 }
 
-#define BPF_MAP_CREATE_LAST_FIELD map_extra
+#define BPF_MAP_CREATE_LAST_FIELD map_token_fd
 /* called via syscall */
 static int map_create(union bpf_attr *attr)
 {
 	const struct bpf_map_ops *ops;
+	struct bpf_token *token = NULL;
 	int numa_node = bpf_map_attr_numa_node(attr);
 	u32 map_type = attr->map_type;
 	struct bpf_map *map;
@@ -1147,14 +1150,32 @@ static int map_create(union bpf_attr *attr)
 	if (!ops->map_mem_usage)
 		return -EINVAL;
 
+	if (attr->map_token_fd) {
+		token = bpf_token_get_from_fd(attr->map_token_fd);
+		if (IS_ERR(token))
+			return PTR_ERR(token);
+
+		/* if current token doesn't grant map creation permissions,
+		 * then we can't use this token, so ignore it and rely on
+		 * system-wide capabilities checks
+		 */
+		if (!bpf_token_allow_cmd(token, BPF_MAP_CREATE) ||
+		    !bpf_token_allow_map_type(token, attr->map_type)) {
+			bpf_token_put(token);
+			token = NULL;
+		}
+	}
+
+	err = -EPERM;
+
 	/* Intent here is for unprivileged_bpf_disabled to block BPF map
 	 * creation for unprivileged users; other actions depend
 	 * on fd availability and access to bpffs, so are dependent on
 	 * object creation success. Even with unprivileged BPF disabled,
 	 * capability checks are still carried out.
 	 */
-	if (sysctl_unprivileged_bpf_disabled && !bpf_capable())
-		return -EPERM;
+	if (sysctl_unprivileged_bpf_disabled && !bpf_token_capable(token, CAP_BPF))
+		goto put_token;
 
 	/* check privileged map type permissions */
 	switch (map_type) {
@@ -1187,28 +1208,36 @@ static int map_create(union bpf_attr *attr)
 	case BPF_MAP_TYPE_LRU_PERCPU_HASH:
 	case BPF_MAP_TYPE_STRUCT_OPS:
 	case BPF_MAP_TYPE_CPUMAP:
-		if (!bpf_capable())
-			return -EPERM;
+		if (!bpf_token_capable(token, CAP_BPF))
+			goto put_token;
 		break;
 	case BPF_MAP_TYPE_SOCKMAP:
 	case BPF_MAP_TYPE_SOCKHASH:
 	case BPF_MAP_TYPE_DEVMAP:
 	case BPF_MAP_TYPE_DEVMAP_HASH:
 	case BPF_MAP_TYPE_XSKMAP:
-		if (!capable(CAP_NET_ADMIN))
-			return -EPERM;
+		if (!bpf_token_capable(token, CAP_NET_ADMIN))
+			goto put_token;
 		break;
 	default:
 		WARN(1, "unsupported map type %d", map_type);
-		return -EPERM;
+		goto put_token;
 	}
 
 	map = ops->map_alloc(attr);
-	if (IS_ERR(map))
-		return PTR_ERR(map);
+	if (IS_ERR(map)) {
+		err = PTR_ERR(map);
+		goto put_token;
+	}
 	map->ops = ops;
 	map->map_type = map_type;
 
+	if (token) {
+		/* move token reference into map->token, reuse our refcnt */
+		map->token = token;
+		token = NULL;
+	}
+
 	err = bpf_obj_name_cpy(map->name, attr->map_name,
 			       sizeof(attr->map_name));
 	if (err < 0)
@@ -1281,8 +1310,11 @@ static int map_create(union bpf_attr *attr)
 free_map_sec:
 	security_bpf_map_free(map);
 free_map:
+	bpf_token_put(map->token);
 	btf_put(map->btf);
 	map->ops->map_free(map);
+put_token:
+	bpf_token_put(token);
 	return err;
 }
 
@@ -5095,7 +5127,7 @@ static int bpf_prog_bind_map(union bpf_attr *attr)
 	return ret;
 }
 
-#define BPF_TOKEN_CREATE_LAST_FIELD token_create.allowed_cmds
+#define BPF_TOKEN_CREATE_LAST_FIELD token_create.allowed_map_types
 
 static int token_create(union bpf_attr *attr)
 {
diff --git a/kernel/bpf/token.c b/kernel/bpf/token.c
index 1ece52439701..91d8d987faea 100644
--- a/kernel/bpf/token.c
+++ b/kernel/bpf/token.c
@@ -110,6 +110,10 @@ int bpf_token_create(union bpf_attr *attr)
 	/* requested cmds should be a subset of associated token's set */
 	if (token && !is_bit_subset_of(attr->token_create.allowed_cmds, token->allowed_cmds))
 		goto out;
+	/* requested map types should be a subset of associated token's set */
+	if (token && !is_bit_subset_of(attr->token_create.allowed_map_types,
+				       token->allowed_map_types))
+		goto out;
 
 	new_token = bpf_token_alloc();
 	if (!new_token) {
@@ -118,6 +122,7 @@ int bpf_token_create(union bpf_attr *attr)
 	}
 
 	new_token->allowed_cmds = attr->token_create.allowed_cmds;
+	new_token->allowed_map_types = attr->token_create.allowed_map_types;
 
 	ret = bpf_obj_pin_any(attr->token_create.pin_path_fd,
 			      u64_to_user_ptr(attr->token_create.pin_pathname),
@@ -165,3 +170,11 @@ bool bpf_token_allow_cmd(const struct bpf_token *token, enum bpf_cmd cmd)
 
 	return token->allowed_cmds & (1ULL << cmd);
 }
+
+bool bpf_token_allow_map_type(const struct bpf_token *token, enum bpf_map_type type)
+{
+	if (!token || type >= __MAX_BPF_MAP_TYPE)
+		return false;
+
+	return token->allowed_map_types & (1ULL << type);
+}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 3ff91f52745d..59764ba48ec9 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -962,6 +962,7 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_BLOOM_FILTER,
 	BPF_MAP_TYPE_USER_RINGBUF,
 	BPF_MAP_TYPE_CGRP_STORAGE,
+	__MAX_BPF_MAP_TYPE
 };
 
 /* Note that tracing related programs such as
@@ -1368,6 +1369,7 @@ union bpf_attr {
 		 * to using 5 hash functions).
 		 */
 		__u64	map_extra;
+		__u32	map_token_fd;
 	};
 
 	struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */
@@ -1658,6 +1660,10 @@ union bpf_attr {
 		 * programs
 		 */
 		__u64		allowed_cmds;
+		/* similarly to allowed_cmds, a bit set of BPF map types that
+		 * are allowed to be created by requested BPF token;
+		 */
+		__u64		allowed_map_types;
 	} token_create;
 
 } __attribute__((aligned(8)));
diff --git a/tools/testing/selftests/bpf/prog_tests/libbpf_probes.c b/tools/testing/selftests/bpf/prog_tests/libbpf_probes.c
index 9f766ddd946a..573249a2814d 100644
--- a/tools/testing/selftests/bpf/prog_tests/libbpf_probes.c
+++ b/tools/testing/selftests/bpf/prog_tests/libbpf_probes.c
@@ -68,6 +68,8 @@ void test_libbpf_probe_map_types(void)
 
 		if (map_type == BPF_MAP_TYPE_UNSPEC)
 			continue;
+		if (strcmp(map_type_name, "__MAX_BPF_MAP_TYPE") == 0)
+			continue;
 
 		if (!test__start_subtest(map_type_name))
 			continue;
diff --git a/tools/testing/selftests/bpf/prog_tests/libbpf_str.c b/tools/testing/selftests/bpf/prog_tests/libbpf_str.c
index efb8bd43653c..e677c0435cec 100644
--- a/tools/testing/selftests/bpf/prog_tests/libbpf_str.c
+++ b/tools/testing/selftests/bpf/prog_tests/libbpf_str.c
@@ -132,6 +132,9 @@ static void test_libbpf_bpf_map_type_str(void)
 		const char *map_type_str;
 		char buf[256];
 
+		if (map_type == __MAX_BPF_MAP_TYPE)
+			continue;
+
 		map_type_name = btf__str_by_offset(btf, e->name_off);
 		map_type_str = libbpf_bpf_map_type_str(map_type);
 		ASSERT_OK_PTR(map_type_str, map_type_name);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH RESEND v3 bpf-next 05/14] libbpf: add BPF token support to bpf_map_create() API
  2023-06-29  5:18 [PATCH RESEND v3 bpf-next 00/14] BPF token Andrii Nakryiko
                   ` (3 preceding siblings ...)
  2023-06-29  5:18 ` [PATCH RESEND v3 bpf-next 04/14] bpf: add BPF token support to BPF_MAP_CREATE command Andrii Nakryiko
@ 2023-06-29  5:18 ` Andrii Nakryiko
  2023-06-29  5:18 ` [PATCH RESEND v3 bpf-next 06/14] selftests/bpf: add BPF token-enabled test for BPF_MAP_CREATE command Andrii Nakryiko
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 48+ messages in thread
From: Andrii Nakryiko @ 2023-06-29  5:18 UTC (permalink / raw)
  To: bpf
  Cc: linux-security-module, keescook, brauner, lennart, cyphar, luto,
	kernel-team, sargun

Add ability to provide token_fd for BPF_MAP_CREATE command through
bpf_map_create() API.

Also wire through token_create.allowed_map_types param for
BPF_TOKEN_CREATE command.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 tools/lib/bpf/bpf.c | 5 ++++-
 tools/lib/bpf/bpf.h | 7 +++++--
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index a247a1612f29..882297b1e136 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -169,7 +169,7 @@ int bpf_map_create(enum bpf_map_type map_type,
 		   __u32 max_entries,
 		   const struct bpf_map_create_opts *opts)
 {
-	const size_t attr_sz = offsetofend(union bpf_attr, map_extra);
+	const size_t attr_sz = offsetofend(union bpf_attr, map_token_fd);
 	union bpf_attr attr;
 	int fd;
 
@@ -198,6 +198,8 @@ int bpf_map_create(enum bpf_map_type map_type,
 	attr.numa_node = OPTS_GET(opts, numa_node, 0);
 	attr.map_ifindex = OPTS_GET(opts, map_ifindex, 0);
 
+	attr.map_token_fd = OPTS_GET(opts, token_fd, 0);
+
 	fd = sys_bpf_fd(BPF_MAP_CREATE, &attr, attr_sz);
 	return libbpf_err_errno(fd);
 }
@@ -1218,6 +1220,7 @@ int bpf_token_create(int pin_path_fd, const char *pin_pathname, struct bpf_token
 	attr.token_create.token_flags = OPTS_GET(opts, token_flags, 0);
 	attr.token_create.pin_flags = OPTS_GET(opts, pin_flags, 0);
 	attr.token_create.allowed_cmds = OPTS_GET(opts, allowed_cmds, 0);
+	attr.token_create.allowed_map_types = OPTS_GET(opts, allowed_map_types, 0);
 
 	ret = sys_bpf(BPF_TOKEN_CREATE, &attr, attr_sz);
 	return libbpf_err_errno(ret);
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index ab0355d90a2c..cd3fb5ce6fe2 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -51,8 +51,10 @@ struct bpf_map_create_opts {
 
 	__u32 numa_node;
 	__u32 map_ifindex;
+
+	__u32 token_fd;
 };
-#define bpf_map_create_opts__last_field map_ifindex
+#define bpf_map_create_opts__last_field token_fd
 
 LIBBPF_API int bpf_map_create(enum bpf_map_type map_type,
 			      const char *map_name,
@@ -557,9 +559,10 @@ struct bpf_token_create_opts {
 	__u32 token_flags;
 	__u32 pin_flags;
 	__u64 allowed_cmds;
+	__u64 allowed_map_types;
 	size_t :0;
 };
-#define bpf_token_create_opts__last_field allowed_cmds
+#define bpf_token_create_opts__last_field allowed_map_types
 
 /**
  * @brief **bpf_token_create()** creates a new instance of BPF token, pinning
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH RESEND v3 bpf-next 06/14] selftests/bpf: add BPF token-enabled test for BPF_MAP_CREATE command
  2023-06-29  5:18 [PATCH RESEND v3 bpf-next 00/14] BPF token Andrii Nakryiko
                   ` (4 preceding siblings ...)
  2023-06-29  5:18 ` [PATCH RESEND v3 bpf-next 05/14] libbpf: add BPF token support to bpf_map_create() API Andrii Nakryiko
@ 2023-06-29  5:18 ` Andrii Nakryiko
  2023-06-29  5:18 ` [PATCH RESEND v3 bpf-next 07/14] bpf: add BPF token support to BPF_BTF_LOAD command Andrii Nakryiko
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 48+ messages in thread
From: Andrii Nakryiko @ 2023-06-29  5:18 UTC (permalink / raw)
  To: bpf
  Cc: linux-security-module, keescook, brauner, lennart, cyphar, luto,
	kernel-team, sargun

Add test for creating BPF token with support for BPF_MAP_CREATE
delegation. And validate that its allowed_map_types filter works as
expected and allows to create privileged BPF maps through delegated
token, as long as they are allowed by privileged creator of a token.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 .../testing/selftests/bpf/prog_tests/token.c  | 55 +++++++++++++++++++
 1 file changed, 55 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/token.c b/tools/testing/selftests/bpf/prog_tests/token.c
index 153c4e26ef6b..0f832f9178a2 100644
--- a/tools/testing/selftests/bpf/prog_tests/token.c
+++ b/tools/testing/selftests/bpf/prog_tests/token.c
@@ -89,8 +89,63 @@ static void subtest_token_create(void)
 		ASSERT_OK(restore_priv_caps(old_caps), "restore_caps");
 }
 
+static void subtest_map_token(void)
+{
+	LIBBPF_OPTS(bpf_token_create_opts, token_opts);
+	LIBBPF_OPTS(bpf_map_create_opts, map_opts);
+	int token_fd = 0, map_fd = 0, err;
+	__u64 old_caps = 0;
+
+	/* check that it's ok to allow any map type */
+	token_opts.allowed_map_types = ~0ULL; /* any current and future map types is allowed */
+	err = bpf_token_create(-EBADF, TOKEN_PATH, &token_opts);
+	if (!ASSERT_OK(err, "token_create_future_proof"))
+		return;
+	unlink(TOKEN_PATH);
+
+	/* create BPF token allowing STACK, but not QUEUE map */
+	token_opts.allowed_cmds = 1ULL << BPF_MAP_CREATE;
+	token_opts.allowed_map_types = 1ULL << BPF_MAP_TYPE_STACK; /* but not QUEUE */
+	err = bpf_token_create(-EBADF, TOKEN_PATH, &token_opts);
+	if (!ASSERT_OK(err, "token_create"))
+		return;
+
+	/* drop privileges to test token_fd passing */
+	if (!ASSERT_OK(drop_priv_caps(&old_caps), "drop_caps"))
+		goto cleanup;
+
+	token_fd = bpf_obj_get(TOKEN_PATH);
+	if (!ASSERT_GT(token_fd, 0, "token_get"))
+		goto cleanup;
+
+	/* BPF_MAP_TYPE_STACK is privileged, but with given token_fd should succeed */
+	map_opts.token_fd = token_fd;
+	map_fd = bpf_map_create(BPF_MAP_TYPE_STACK, "token_stack", 0, 8, 1, &map_opts);
+	if (!ASSERT_GT(map_fd, 0, "stack_map_fd"))
+		goto cleanup;
+	close(map_fd);
+	map_fd = 0;
+
+	/* BPF_MAP_TYPE_QUEUE is privileged, and token doesn't allow it, so should fail */
+	map_opts.token_fd = token_fd;
+	map_fd = bpf_map_create(BPF_MAP_TYPE_QUEUE, "token_queue", 0, 8, 1, &map_opts);
+	if (!ASSERT_EQ(map_fd, -EPERM, "queue_map_fd"))
+		goto cleanup;
+
+cleanup:
+	if (map_fd > 0)
+		close(map_fd);
+	if (token_fd)
+		close(token_fd);
+	unlink(TOKEN_PATH);
+	if (old_caps)
+		ASSERT_OK(restore_priv_caps(old_caps), "restore_caps");
+}
+
 void test_token(void)
 {
 	if (test__start_subtest("token_create"))
 		subtest_token_create();
+	if (test__start_subtest("map_token"))
+		subtest_map_token();
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH RESEND v3 bpf-next 07/14] bpf: add BPF token support to BPF_BTF_LOAD command
  2023-06-29  5:18 [PATCH RESEND v3 bpf-next 00/14] BPF token Andrii Nakryiko
                   ` (5 preceding siblings ...)
  2023-06-29  5:18 ` [PATCH RESEND v3 bpf-next 06/14] selftests/bpf: add BPF token-enabled test for BPF_MAP_CREATE command Andrii Nakryiko
@ 2023-06-29  5:18 ` Andrii Nakryiko
  2023-06-29  5:18 ` [PATCH RESEND v3 bpf-next 08/14] libbpf: add BPF token support to bpf_btf_load() API Andrii Nakryiko
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 48+ messages in thread
From: Andrii Nakryiko @ 2023-06-29  5:18 UTC (permalink / raw)
  To: bpf
  Cc: linux-security-module, keescook, brauner, lennart, cyphar, luto,
	kernel-team, sargun

Accept BPF token FD in BPF_BTF_LOAD command to allow BTF data loading
through delegated BPF token. BTF loading is a pretty straightforward
operation, so as long as BPF token is created with allow_cmds granting
BPF_BTF_LOAD command, kernel proceeds to parsing BTF data and creating
BTF object.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 include/uapi/linux/bpf.h       |  1 +
 kernel/bpf/syscall.c           | 20 ++++++++++++++++++--
 tools/include/uapi/linux/bpf.h |  1 +
 3 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 59764ba48ec9..fa6a9e2396e6 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1536,6 +1536,7 @@ union bpf_attr {
 		 * truncated), or smaller (if log buffer wasn't filled completely).
 		 */
 		__u32		btf_log_true_size;
+		__u32		btf_token_fd;
 	};
 
 	struct {
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index cc15b1d5dc26..f295458a35c0 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -4484,15 +4484,31 @@ static int bpf_obj_get_info_by_fd(const union bpf_attr *attr,
 	return err;
 }
 
-#define BPF_BTF_LOAD_LAST_FIELD btf_log_true_size
+#define BPF_BTF_LOAD_LAST_FIELD btf_token_fd
 
 static int bpf_btf_load(const union bpf_attr *attr, bpfptr_t uattr, __u32 uattr_size)
 {
+	struct bpf_token *token = NULL;
+
 	if (CHECK_ATTR(BPF_BTF_LOAD))
 		return -EINVAL;
 
-	if (!bpf_capable())
+	if (attr->btf_token_fd) {
+		token = bpf_token_get_from_fd(attr->btf_token_fd);
+		if (IS_ERR(token))
+			return PTR_ERR(token);
+		if (!bpf_token_allow_cmd(token, BPF_BTF_LOAD)) {
+			bpf_token_put(token);
+			token = NULL;
+		}
+	}
+
+	if (!bpf_token_capable(token, CAP_BPF)) {
+		bpf_token_put(token);
 		return -EPERM;
+	}
+
+	bpf_token_put(token);
 
 	return btf_new_fd(attr, uattr, uattr_size);
 }
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 59764ba48ec9..fa6a9e2396e6 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1536,6 +1536,7 @@ union bpf_attr {
 		 * truncated), or smaller (if log buffer wasn't filled completely).
 		 */
 		__u32		btf_log_true_size;
+		__u32		btf_token_fd;
 	};
 
 	struct {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH RESEND v3 bpf-next 08/14] libbpf: add BPF token support to bpf_btf_load() API
  2023-06-29  5:18 [PATCH RESEND v3 bpf-next 00/14] BPF token Andrii Nakryiko
                   ` (6 preceding siblings ...)
  2023-06-29  5:18 ` [PATCH RESEND v3 bpf-next 07/14] bpf: add BPF token support to BPF_BTF_LOAD command Andrii Nakryiko
@ 2023-06-29  5:18 ` Andrii Nakryiko
  2023-06-29  5:18 ` [PATCH RESEND v3 bpf-next 09/14] selftests/bpf: add BPF token-enabled BPF_BTF_LOAD selftest Andrii Nakryiko
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 48+ messages in thread
From: Andrii Nakryiko @ 2023-06-29  5:18 UTC (permalink / raw)
  To: bpf
  Cc: linux-security-module, keescook, brauner, lennart, cyphar, luto,
	kernel-team, sargun

Allow user to specify token_fd for bpf_btf_load() API that wraps
kernel's BPF_BTF_LOAD command. This allows loading BTF from unprivileged
process as long as it has BPF token allowing BPF_BTF_LOAD command, which
can be created and delegated by privileged process.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 tools/lib/bpf/bpf.c | 4 +++-
 tools/lib/bpf/bpf.h | 3 ++-
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 882297b1e136..6fb915069be7 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -1098,7 +1098,7 @@ int bpf_raw_tracepoint_open(const char *name, int prog_fd)
 
 int bpf_btf_load(const void *btf_data, size_t btf_size, struct bpf_btf_load_opts *opts)
 {
-	const size_t attr_sz = offsetofend(union bpf_attr, btf_log_true_size);
+	const size_t attr_sz = offsetofend(union bpf_attr, btf_token_fd);
 	union bpf_attr attr;
 	char *log_buf;
 	size_t log_size;
@@ -1123,6 +1123,8 @@ int bpf_btf_load(const void *btf_data, size_t btf_size, struct bpf_btf_load_opts
 
 	attr.btf = ptr_to_u64(btf_data);
 	attr.btf_size = btf_size;
+	attr.btf_token_fd = OPTS_GET(opts, token_fd, 0);
+
 	/* log_level == 0 and log_buf != NULL means "try loading without
 	 * log_buf, but retry with log_buf and log_level=1 on error", which is
 	 * consistent across low-level and high-level BTF and program loading
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index cd3fb5ce6fe2..dc7c4af21ad9 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -132,9 +132,10 @@ struct bpf_btf_load_opts {
 	 * If kernel doesn't support this feature, log_size is left unchanged.
 	 */
 	__u32 log_true_size;
+	__u32 token_fd;
 	size_t :0;
 };
-#define bpf_btf_load_opts__last_field log_true_size
+#define bpf_btf_load_opts__last_field token_fd
 
 LIBBPF_API int bpf_btf_load(const void *btf_data, size_t btf_size,
 			    struct bpf_btf_load_opts *opts);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH RESEND v3 bpf-next 09/14] selftests/bpf: add BPF token-enabled BPF_BTF_LOAD selftest
  2023-06-29  5:18 [PATCH RESEND v3 bpf-next 00/14] BPF token Andrii Nakryiko
                   ` (7 preceding siblings ...)
  2023-06-29  5:18 ` [PATCH RESEND v3 bpf-next 08/14] libbpf: add BPF token support to bpf_btf_load() API Andrii Nakryiko
@ 2023-06-29  5:18 ` Andrii Nakryiko
  2023-06-29  5:18 ` [PATCH RESEND v3 bpf-next 10/14] bpf: add BPF token support to BPF_PROG_LOAD command Andrii Nakryiko
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 48+ messages in thread
From: Andrii Nakryiko @ 2023-06-29  5:18 UTC (permalink / raw)
  To: bpf
  Cc: linux-security-module, keescook, brauner, lennart, cyphar, luto,
	kernel-team, sargun

Add a simple test validating that BTF loading can be done from
unprivileged process through delegated BPF token.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 .../testing/selftests/bpf/prog_tests/token.c  | 60 +++++++++++++++++++
 1 file changed, 60 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/token.c b/tools/testing/selftests/bpf/prog_tests/token.c
index 0f832f9178a2..113cd4786a70 100644
--- a/tools/testing/selftests/bpf/prog_tests/token.c
+++ b/tools/testing/selftests/bpf/prog_tests/token.c
@@ -142,10 +142,70 @@ static void subtest_map_token(void)
 		ASSERT_OK(restore_priv_caps(old_caps), "restore_caps");
 }
 
+static void subtest_btf_token(void)
+{
+	LIBBPF_OPTS(bpf_token_create_opts, token_opts);
+	LIBBPF_OPTS(bpf_btf_load_opts, btf_opts);
+	int token_fd = 0, btf_fd = 0, err;
+	const void *raw_btf_data;
+	struct btf *btf = NULL;
+	__u32 raw_btf_size;
+	__u64 old_caps = 0;
+
+	/* create BPF token allowing BPF_BTF_LOAD command */
+	token_opts.allowed_cmds = 1ULL << BPF_BTF_LOAD;
+	err = bpf_token_create(-EBADF, TOKEN_PATH, &token_opts);
+	if (!ASSERT_OK(err, "token_create"))
+		return;
+
+	/* drop privileges to test token_fd passing */
+	if (!ASSERT_OK(drop_priv_caps(&old_caps), "drop_caps"))
+		goto cleanup;
+
+	token_fd = bpf_obj_get(TOKEN_PATH);
+	if (!ASSERT_GT(token_fd, 0, "token_get"))
+		goto cleanup;
+
+	btf = btf__new_empty();
+	if (!ASSERT_OK_PTR(btf, "empty_btf"))
+		goto cleanup;
+
+	ASSERT_GT(btf__add_int(btf, "int", 4, 0), 0, "int_type");
+
+	raw_btf_data = btf__raw_data(btf, &raw_btf_size);
+	if (!ASSERT_OK_PTR(raw_btf_data, "raw_btf_data"))
+		goto cleanup;
+
+	/* validate we can successfully load new BTF with token */
+	btf_opts.token_fd = token_fd;
+	btf_fd = bpf_btf_load(raw_btf_data, raw_btf_size, &btf_opts);
+	if (!ASSERT_GT(btf_fd, 0, "btf_fd"))
+		goto cleanup;
+	close(btf_fd);
+
+	/* now validate that we *cannot* load BTF without token */
+	btf_opts.token_fd = 0;
+	btf_fd = bpf_btf_load(raw_btf_data, raw_btf_size, &btf_opts);
+	if (!ASSERT_EQ(btf_fd, -EPERM, "btf_fd_eperm"))
+		goto cleanup;
+
+cleanup:
+	btf__free(btf);
+	if (btf_fd > 0)
+		close(btf_fd);
+	if (token_fd)
+		close(token_fd);
+	unlink(TOKEN_PATH);
+	if (old_caps)
+		ASSERT_OK(restore_priv_caps(old_caps), "restore_caps");
+}
+
 void test_token(void)
 {
 	if (test__start_subtest("token_create"))
 		subtest_token_create();
 	if (test__start_subtest("map_token"))
 		subtest_map_token();
+	if (test__start_subtest("btf_token"))
+		subtest_btf_token();
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH RESEND v3 bpf-next 10/14] bpf: add BPF token support to BPF_PROG_LOAD command
  2023-06-29  5:18 [PATCH RESEND v3 bpf-next 00/14] BPF token Andrii Nakryiko
                   ` (8 preceding siblings ...)
  2023-06-29  5:18 ` [PATCH RESEND v3 bpf-next 09/14] selftests/bpf: add BPF token-enabled BPF_BTF_LOAD selftest Andrii Nakryiko
@ 2023-06-29  5:18 ` Andrii Nakryiko
  2023-06-29  5:18 ` [PATCH RESEND v3 bpf-next 11/14] bpf: take into account BPF token when fetching helper protos Andrii Nakryiko
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 48+ messages in thread
From: Andrii Nakryiko @ 2023-06-29  5:18 UTC (permalink / raw)
  To: bpf
  Cc: linux-security-module, keescook, brauner, lennart, cyphar, luto,
	kernel-team, sargun

Add basic support of BPF token to BPF_PROG_LOAD. Extend BPF token to
allow specifying BPF_PROG_LOAD as an allowed command, and also allow to
specify bit sets of program type and attach type combination that would
be allowed to be loaded by requested BPF token.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 include/linux/bpf.h                           |  6 ++
 include/uapi/linux/bpf.h                      |  8 ++
 kernel/bpf/core.c                             |  1 +
 kernel/bpf/syscall.c                          | 89 +++++++++++++------
 kernel/bpf/token.c                            | 21 +++++
 tools/include/uapi/linux/bpf.h                |  8 ++
 .../selftests/bpf/prog_tests/libbpf_probes.c  |  2 +
 .../selftests/bpf/prog_tests/libbpf_str.c     |  3 +
 8 files changed, 113 insertions(+), 25 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 856a147c8ce8..64dcdc18f09a 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1411,6 +1411,7 @@ struct bpf_prog_aux {
 #ifdef CONFIG_SECURITY
 	void *security;
 #endif
+	struct bpf_token *token;
 	struct bpf_prog_offload *offload;
 	struct btf *btf;
 	struct bpf_func_info *func_info;
@@ -1540,6 +1541,8 @@ struct bpf_token {
 	atomic64_t refcnt;
 	u64 allowed_cmds;
 	u64 allowed_map_types;
+	u64 allowed_prog_types;
+	u64 allowed_attach_types;
 };
 
 struct bpf_struct_ops_value;
@@ -2099,6 +2102,9 @@ struct bpf_token *bpf_token_get_from_fd(u32 ufd);
 
 bool bpf_token_allow_cmd(const struct bpf_token *token, enum bpf_cmd cmd);
 bool bpf_token_allow_map_type(const struct bpf_token *token, enum bpf_map_type type);
+bool bpf_token_allow_prog_type(const struct bpf_token *token,
+			       enum bpf_prog_type prog_type,
+			       enum bpf_attach_type attach_type);
 
 enum bpf_type {
 	BPF_TYPE_UNSPEC	= 0,
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index fa6a9e2396e6..6a37ba2f422d 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1007,6 +1007,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_SK_LOOKUP,
 	BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */
 	BPF_PROG_TYPE_NETFILTER,
+	__MAX_BPF_PROG_TYPE
 };
 
 enum bpf_attach_type {
@@ -1439,6 +1440,7 @@ union bpf_attr {
 		 * truncated), or smaller (if log buffer wasn't filled completely).
 		 */
 		__u32		log_true_size;
+		__u32		prog_token_fd;
 	};
 
 	struct { /* anonymous struct used by BPF_OBJ_* commands */
@@ -1665,6 +1667,12 @@ union bpf_attr {
 		 * are allowed to be created by requested BPF token;
 		 */
 		__u64		allowed_map_types;
+		/* similarly to allowed_map_types, bit sets of BPF program
+		 * types and BPF program attach types that are allowed to be
+		 * loaded by requested BPF token
+		 */
+		__u64		allowed_prog_types;
+		__u64		allowed_attach_types;
 	} token_create;
 
 } __attribute__((aligned(8)));
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index dc85240a0134..2ed54d1ed32a 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2599,6 +2599,7 @@ void bpf_prog_free(struct bpf_prog *fp)
 
 	if (aux->dst_prog)
 		bpf_prog_put(aux->dst_prog);
+	bpf_token_put(aux->token);
 	INIT_WORK(&aux->work, bpf_prog_free_deferred);
 	schedule_work(&aux->work);
 }
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index f295458a35c0..b9e7cc72429e 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2577,13 +2577,15 @@ static bool is_perfmon_prog_type(enum bpf_prog_type prog_type)
 }
 
 /* last field in 'union bpf_attr' used by this command */
-#define	BPF_PROG_LOAD_LAST_FIELD log_true_size
+#define BPF_PROG_LOAD_LAST_FIELD prog_token_fd
 
 static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr, u32 uattr_size)
 {
 	enum bpf_prog_type type = attr->prog_type;
 	struct bpf_prog *prog, *dst_prog = NULL;
 	struct btf *attach_btf = NULL;
+	struct bpf_token *token = NULL;
+	bool bpf_cap;
 	int err;
 	char license[128];
 
@@ -2599,10 +2601,31 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr, u32 uattr_size)
 				 BPF_F_XDP_DEV_BOUND_ONLY))
 		return -EINVAL;
 
+	bpf_prog_load_fixup_attach_type(attr);
+
+	if (attr->prog_token_fd) {
+		token = bpf_token_get_from_fd(attr->prog_token_fd);
+		if (IS_ERR(token))
+			return PTR_ERR(token);
+		/* if current token doesn't grant prog loading permissions,
+		 * then we can't use this token, so ignore it and rely on
+		 * system-wide capabilities checks
+		 */
+		if (!bpf_token_allow_cmd(token, BPF_PROG_LOAD) ||
+		    !bpf_token_allow_prog_type(token, attr->prog_type,
+					       attr->expected_attach_type)) {
+			bpf_token_put(token);
+			token = NULL;
+		}
+	}
+
+	bpf_cap = bpf_token_capable(token, CAP_BPF);
+	err = -EPERM;
+
 	if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) &&
 	    (attr->prog_flags & BPF_F_ANY_ALIGNMENT) &&
-	    !bpf_capable())
-		return -EPERM;
+	    !bpf_cap)
+		goto put_token;
 
 	/* Intent here is for unprivileged_bpf_disabled to block BPF program
 	 * creation for unprivileged users; other actions depend
@@ -2611,21 +2634,23 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr, u32 uattr_size)
 	 * capability checks are still carried out for these
 	 * and other operations.
 	 */
-	if (sysctl_unprivileged_bpf_disabled && !bpf_capable())
-		return -EPERM;
+	if (sysctl_unprivileged_bpf_disabled && !bpf_cap)
+		goto put_token;
 
 	if (attr->insn_cnt == 0 ||
-	    attr->insn_cnt > (bpf_capable() ? BPF_COMPLEXITY_LIMIT_INSNS : BPF_MAXINSNS))
-		return -E2BIG;
+	    attr->insn_cnt > (bpf_cap ? BPF_COMPLEXITY_LIMIT_INSNS : BPF_MAXINSNS)) {
+		err = -E2BIG;
+		goto put_token;
+	}
 	if (type != BPF_PROG_TYPE_SOCKET_FILTER &&
 	    type != BPF_PROG_TYPE_CGROUP_SKB &&
-	    !bpf_capable())
-		return -EPERM;
+	    !bpf_cap)
+		goto put_token;
 
-	if (is_net_admin_prog_type(type) && !capable(CAP_NET_ADMIN) && !capable(CAP_SYS_ADMIN))
-		return -EPERM;
-	if (is_perfmon_prog_type(type) && !perfmon_capable())
-		return -EPERM;
+	if (is_net_admin_prog_type(type) && !bpf_token_capable(token, CAP_NET_ADMIN))
+		goto put_token;
+	if (is_perfmon_prog_type(type) && !bpf_token_capable(token, CAP_PERFMON))
+		goto put_token;
 
 	/* attach_prog_fd/attach_btf_obj_fd can specify fd of either bpf_prog
 	 * or btf, we need to check which one it is
@@ -2635,27 +2660,33 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr, u32 uattr_size)
 		if (IS_ERR(dst_prog)) {
 			dst_prog = NULL;
 			attach_btf = btf_get_by_fd(attr->attach_btf_obj_fd);
-			if (IS_ERR(attach_btf))
-				return -EINVAL;
+			if (IS_ERR(attach_btf)) {
+				err = -EINVAL;
+				goto put_token;
+			}
 			if (!btf_is_kernel(attach_btf)) {
 				/* attaching through specifying bpf_prog's BTF
 				 * objects directly might be supported eventually
 				 */
 				btf_put(attach_btf);
-				return -ENOTSUPP;
+				err = -ENOTSUPP;
+				goto put_token;
 			}
 		}
 	} else if (attr->attach_btf_id) {
 		/* fall back to vmlinux BTF, if BTF type ID is specified */
 		attach_btf = bpf_get_btf_vmlinux();
-		if (IS_ERR(attach_btf))
-			return PTR_ERR(attach_btf);
-		if (!attach_btf)
-			return -EINVAL;
+		if (IS_ERR(attach_btf)) {
+			err = PTR_ERR(attach_btf);
+			goto put_token;
+		}
+		if (!attach_btf) {
+			err = -EINVAL;
+			goto put_token;
+		}
 		btf_get(attach_btf);
 	}
 
-	bpf_prog_load_fixup_attach_type(attr);
 	if (bpf_prog_load_check_attach(type, attr->expected_attach_type,
 				       attach_btf, attr->attach_btf_id,
 				       dst_prog)) {
@@ -2663,7 +2694,8 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr, u32 uattr_size)
 			bpf_prog_put(dst_prog);
 		if (attach_btf)
 			btf_put(attach_btf);
-		return -EINVAL;
+		err = -EINVAL;
+		goto put_token;
 	}
 
 	/* plain bpf_prog allocation */
@@ -2673,7 +2705,8 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr, u32 uattr_size)
 			bpf_prog_put(dst_prog);
 		if (attach_btf)
 			btf_put(attach_btf);
-		return -ENOMEM;
+		err = -EINVAL;
+		goto put_token;
 	}
 
 	prog->expected_attach_type = attr->expected_attach_type;
@@ -2684,6 +2717,10 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr, u32 uattr_size)
 	prog->aux->sleepable = attr->prog_flags & BPF_F_SLEEPABLE;
 	prog->aux->xdp_has_frags = attr->prog_flags & BPF_F_XDP_HAS_FRAGS;
 
+	/* move token into prog->aux, reuse taken refcnt */
+	prog->aux->token = token;
+	token = NULL;
+
 	err = security_bpf_prog_alloc(prog->aux);
 	if (err)
 		goto free_prog;
@@ -2785,6 +2822,8 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr, u32 uattr_size)
 	if (prog->aux->attach_btf)
 		btf_put(prog->aux->attach_btf);
 	bpf_prog_free(prog);
+put_token:
+	bpf_token_put(token);
 	return err;
 }
 
@@ -3544,7 +3583,7 @@ static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog,
 	case BPF_PROG_TYPE_SK_LOOKUP:
 		return attach_type == prog->expected_attach_type ? 0 : -EINVAL;
 	case BPF_PROG_TYPE_CGROUP_SKB:
-		if (!capable(CAP_NET_ADMIN))
+		if (!bpf_token_capable(prog->aux->token, CAP_NET_ADMIN))
 			/* cg-skb progs can be loaded by unpriv user.
 			 * check permissions at attach time.
 			 */
@@ -5143,7 +5182,7 @@ static int bpf_prog_bind_map(union bpf_attr *attr)
 	return ret;
 }
 
-#define BPF_TOKEN_CREATE_LAST_FIELD token_create.allowed_map_types
+#define BPF_TOKEN_CREATE_LAST_FIELD token_create.allowed_attach_types
 
 static int token_create(union bpf_attr *attr)
 {
diff --git a/kernel/bpf/token.c b/kernel/bpf/token.c
index 91d8d987faea..22449a509048 100644
--- a/kernel/bpf/token.c
+++ b/kernel/bpf/token.c
@@ -114,6 +114,14 @@ int bpf_token_create(union bpf_attr *attr)
 	if (token && !is_bit_subset_of(attr->token_create.allowed_map_types,
 				       token->allowed_map_types))
 		goto out;
+	/* requested prog types should be a subset of associated token's set */
+	if (token && !is_bit_subset_of(attr->token_create.allowed_prog_types,
+				       token->allowed_prog_types))
+		goto out;
+	/* requested attach types should be a subset of associated token's set */
+	if (token && !is_bit_subset_of(attr->token_create.allowed_attach_types,
+				       token->allowed_attach_types))
+		goto out;
 
 	new_token = bpf_token_alloc();
 	if (!new_token) {
@@ -123,6 +131,8 @@ int bpf_token_create(union bpf_attr *attr)
 
 	new_token->allowed_cmds = attr->token_create.allowed_cmds;
 	new_token->allowed_map_types = attr->token_create.allowed_map_types;
+	new_token->allowed_prog_types = attr->token_create.allowed_prog_types;
+	new_token->allowed_attach_types = attr->token_create.allowed_attach_types;
 
 	ret = bpf_obj_pin_any(attr->token_create.pin_path_fd,
 			      u64_to_user_ptr(attr->token_create.pin_pathname),
@@ -178,3 +188,14 @@ bool bpf_token_allow_map_type(const struct bpf_token *token, enum bpf_map_type t
 
 	return token->allowed_map_types & (1ULL << type);
 }
+
+bool bpf_token_allow_prog_type(const struct bpf_token *token,
+			       enum bpf_prog_type prog_type,
+			       enum bpf_attach_type attach_type)
+{
+	if (!token || prog_type >= __MAX_BPF_PROG_TYPE || attach_type >= __MAX_BPF_ATTACH_TYPE)
+		return false;
+
+	return (token->allowed_prog_types & (1ULL << prog_type)) &&
+	       (token->allowed_attach_types & (1ULL << attach_type));
+}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index fa6a9e2396e6..6a37ba2f422d 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1007,6 +1007,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_SK_LOOKUP,
 	BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */
 	BPF_PROG_TYPE_NETFILTER,
+	__MAX_BPF_PROG_TYPE
 };
 
 enum bpf_attach_type {
@@ -1439,6 +1440,7 @@ union bpf_attr {
 		 * truncated), or smaller (if log buffer wasn't filled completely).
 		 */
 		__u32		log_true_size;
+		__u32		prog_token_fd;
 	};
 
 	struct { /* anonymous struct used by BPF_OBJ_* commands */
@@ -1665,6 +1667,12 @@ union bpf_attr {
 		 * are allowed to be created by requested BPF token;
 		 */
 		__u64		allowed_map_types;
+		/* similarly to allowed_map_types, bit sets of BPF program
+		 * types and BPF program attach types that are allowed to be
+		 * loaded by requested BPF token
+		 */
+		__u64		allowed_prog_types;
+		__u64		allowed_attach_types;
 	} token_create;
 
 } __attribute__((aligned(8)));
diff --git a/tools/testing/selftests/bpf/prog_tests/libbpf_probes.c b/tools/testing/selftests/bpf/prog_tests/libbpf_probes.c
index 573249a2814d..4ed46ed58a7b 100644
--- a/tools/testing/selftests/bpf/prog_tests/libbpf_probes.c
+++ b/tools/testing/selftests/bpf/prog_tests/libbpf_probes.c
@@ -30,6 +30,8 @@ void test_libbpf_probe_prog_types(void)
 
 		if (prog_type == BPF_PROG_TYPE_UNSPEC)
 			continue;
+		if (strcmp(prog_type_name, "__MAX_BPF_PROG_TYPE") == 0)
+			continue;
 
 		if (!test__start_subtest(prog_type_name))
 			continue;
diff --git a/tools/testing/selftests/bpf/prog_tests/libbpf_str.c b/tools/testing/selftests/bpf/prog_tests/libbpf_str.c
index e677c0435cec..ea2a8c4063a8 100644
--- a/tools/testing/selftests/bpf/prog_tests/libbpf_str.c
+++ b/tools/testing/selftests/bpf/prog_tests/libbpf_str.c
@@ -185,6 +185,9 @@ static void test_libbpf_bpf_prog_type_str(void)
 		const char *prog_type_str;
 		char buf[256];
 
+		if (prog_type == __MAX_BPF_PROG_TYPE)
+			continue;
+
 		prog_type_name = btf__str_by_offset(btf, e->name_off);
 		prog_type_str = libbpf_bpf_prog_type_str(prog_type);
 		ASSERT_OK_PTR(prog_type_str, prog_type_name);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH RESEND v3 bpf-next 11/14] bpf: take into account BPF token when fetching helper protos
  2023-06-29  5:18 [PATCH RESEND v3 bpf-next 00/14] BPF token Andrii Nakryiko
                   ` (9 preceding siblings ...)
  2023-06-29  5:18 ` [PATCH RESEND v3 bpf-next 10/14] bpf: add BPF token support to BPF_PROG_LOAD command Andrii Nakryiko
@ 2023-06-29  5:18 ` Andrii Nakryiko
  2023-06-29  5:18 ` [PATCH RESEND v3 bpf-next 12/14] bpf: consistenly use BPF token throughout BPF verifier logic Andrii Nakryiko
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 48+ messages in thread
From: Andrii Nakryiko @ 2023-06-29  5:18 UTC (permalink / raw)
  To: bpf
  Cc: linux-security-module, keescook, brauner, lennart, cyphar, luto,
	kernel-team, sargun

Instead of performing unconditional system-wide bpf_capable() and
perfmon_capable() calls inside bpf_base_func_proto() function (and other
similar ones) to determine eligibility of a given BPF helper for a given
program, use previously recorded BPF token during BPF_PROG_LOAD command
handling to inform the decision.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 drivers/media/rc/bpf-lirc.c |  2 +-
 include/linux/bpf.h         |  5 +++--
 kernel/bpf/cgroup.c         |  6 +++---
 kernel/bpf/helpers.c        |  6 +++---
 kernel/bpf/syscall.c        |  5 +++--
 kernel/trace/bpf_trace.c    |  2 +-
 net/core/filter.c           | 32 ++++++++++++++++----------------
 net/ipv4/bpf_tcp_ca.c       |  2 +-
 net/netfilter/nf_bpf_link.c |  2 +-
 9 files changed, 32 insertions(+), 30 deletions(-)

diff --git a/drivers/media/rc/bpf-lirc.c b/drivers/media/rc/bpf-lirc.c
index fe17c7f98e81..6d07693c6b9f 100644
--- a/drivers/media/rc/bpf-lirc.c
+++ b/drivers/media/rc/bpf-lirc.c
@@ -110,7 +110,7 @@ lirc_mode2_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	case BPF_FUNC_get_prandom_u32:
 		return &bpf_get_prandom_u32_proto;
 	case BPF_FUNC_trace_printk:
-		if (perfmon_capable())
+		if (bpf_token_capable(prog->aux->token, CAP_PERFMON))
 			return bpf_get_trace_printk_proto();
 		fallthrough;
 	default:
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 64dcdc18f09a..0e8680e639cb 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -2358,7 +2358,8 @@ int btf_check_type_match(struct bpf_verifier_log *log, const struct bpf_prog *pr
 struct bpf_prog *bpf_prog_by_id(u32 id);
 struct bpf_link *bpf_link_by_id(u32 id);
 
-const struct bpf_func_proto *bpf_base_func_proto(enum bpf_func_id func_id);
+const struct bpf_func_proto *bpf_base_func_proto(enum bpf_func_id func_id,
+						 const struct bpf_prog *prog);
 void bpf_task_storage_free(struct task_struct *task);
 void bpf_cgrp_storage_free(struct cgroup *cgroup);
 bool bpf_prog_has_kfunc_call(const struct bpf_prog *prog);
@@ -2615,7 +2616,7 @@ static inline int btf_struct_access(struct bpf_verifier_log *log,
 }
 
 static inline const struct bpf_func_proto *
-bpf_base_func_proto(enum bpf_func_id func_id)
+bpf_base_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
 	return NULL;
 }
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index 5b2741aa0d9b..39d6cfb6f304 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -1615,7 +1615,7 @@ cgroup_dev_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	case BPF_FUNC_perf_event_output:
 		return &bpf_event_output_data_proto;
 	default:
-		return bpf_base_func_proto(func_id);
+		return bpf_base_func_proto(func_id, prog);
 	}
 }
 
@@ -2173,7 +2173,7 @@ sysctl_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	case BPF_FUNC_perf_event_output:
 		return &bpf_event_output_data_proto;
 	default:
-		return bpf_base_func_proto(func_id);
+		return bpf_base_func_proto(func_id, prog);
 	}
 }
 
@@ -2330,7 +2330,7 @@ cg_sockopt_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	case BPF_FUNC_perf_event_output:
 		return &bpf_event_output_data_proto;
 	default:
-		return bpf_base_func_proto(func_id);
+		return bpf_base_func_proto(func_id, prog);
 	}
 }
 
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 9e80efa59a5d..6a740af48908 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1663,7 +1663,7 @@ const struct bpf_func_proto bpf_probe_read_kernel_str_proto __weak;
 const struct bpf_func_proto bpf_task_pt_regs_proto __weak;
 
 const struct bpf_func_proto *
-bpf_base_func_proto(enum bpf_func_id func_id)
+bpf_base_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
 	switch (func_id) {
 	case BPF_FUNC_map_lookup_elem:
@@ -1714,7 +1714,7 @@ bpf_base_func_proto(enum bpf_func_id func_id)
 		break;
 	}
 
-	if (!bpf_capable())
+	if (!bpf_token_capable(prog->aux->token, CAP_BPF))
 		return NULL;
 
 	switch (func_id) {
@@ -1772,7 +1772,7 @@ bpf_base_func_proto(enum bpf_func_id func_id)
 		break;
 	}
 
-	if (!perfmon_capable())
+	if (!bpf_token_capable(prog->aux->token, CAP_PERFMON))
 		return NULL;
 
 	switch (func_id) {
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index b9e7cc72429e..ceb17f10efbe 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -5438,7 +5438,7 @@ static const struct bpf_func_proto bpf_sys_bpf_proto = {
 const struct bpf_func_proto * __weak
 tracing_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
-	return bpf_base_func_proto(func_id);
+	return bpf_base_func_proto(func_id, prog);
 }
 
 BPF_CALL_1(bpf_sys_close, u32, fd)
@@ -5488,7 +5488,8 @@ syscall_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
 	switch (func_id) {
 	case BPF_FUNC_sys_bpf:
-		return !perfmon_capable() ? NULL : &bpf_sys_bpf_proto;
+		return !bpf_token_capable(prog->aux->token, CAP_PERFMON)
+		       ? NULL : &bpf_sys_bpf_proto;
 	case BPF_FUNC_btf_find_by_name_kind:
 		return &bpf_btf_find_by_name_kind_proto;
 	case BPF_FUNC_sys_close:
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 03b7f6b8e4f0..877f2e01d212 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1521,7 +1521,7 @@ bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	case BPF_FUNC_trace_vprintk:
 		return bpf_get_trace_vprintk_proto();
 	default:
-		return bpf_base_func_proto(func_id);
+		return bpf_base_func_proto(func_id, prog);
 	}
 }
 
diff --git a/net/core/filter.c b/net/core/filter.c
index 06ba0e56e369..03c411dc1e80 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -83,7 +83,7 @@
 #include <net/netfilter/nf_conntrack_bpf.h>
 
 static const struct bpf_func_proto *
-bpf_sk_base_func_proto(enum bpf_func_id func_id);
+bpf_sk_base_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog);
 
 int copy_bpf_fprog_from_user(struct sock_fprog *dst, sockptr_t src, int len)
 {
@@ -7817,7 +7817,7 @@ sock_filter_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	case BPF_FUNC_ktime_get_coarse_ns:
 		return &bpf_ktime_get_coarse_ns_proto;
 	default:
-		return bpf_base_func_proto(func_id);
+		return bpf_base_func_proto(func_id, prog);
 	}
 }
 
@@ -7900,7 +7900,7 @@ sock_addr_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 			return NULL;
 		}
 	default:
-		return bpf_sk_base_func_proto(func_id);
+		return bpf_sk_base_func_proto(func_id, prog);
 	}
 }
 
@@ -7919,7 +7919,7 @@ sk_filter_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	case BPF_FUNC_perf_event_output:
 		return &bpf_skb_event_output_proto;
 	default:
-		return bpf_sk_base_func_proto(func_id);
+		return bpf_sk_base_func_proto(func_id, prog);
 	}
 }
 
@@ -8106,7 +8106,7 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 #endif
 #endif
 	default:
-		return bpf_sk_base_func_proto(func_id);
+		return bpf_sk_base_func_proto(func_id, prog);
 	}
 }
 
@@ -8165,7 +8165,7 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 #endif
 #endif
 	default:
-		return bpf_sk_base_func_proto(func_id);
+		return bpf_sk_base_func_proto(func_id, prog);
 	}
 
 #if IS_MODULE(CONFIG_NF_CONNTRACK) && IS_ENABLED(CONFIG_DEBUG_INFO_BTF_MODULES)
@@ -8226,7 +8226,7 @@ sock_ops_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_tcp_sock_proto;
 #endif /* CONFIG_INET */
 	default:
-		return bpf_sk_base_func_proto(func_id);
+		return bpf_sk_base_func_proto(func_id, prog);
 	}
 }
 
@@ -8268,7 +8268,7 @@ sk_msg_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_get_cgroup_classid_curr_proto;
 #endif
 	default:
-		return bpf_sk_base_func_proto(func_id);
+		return bpf_sk_base_func_proto(func_id, prog);
 	}
 }
 
@@ -8312,7 +8312,7 @@ sk_skb_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_skc_lookup_tcp_proto;
 #endif
 	default:
-		return bpf_sk_base_func_proto(func_id);
+		return bpf_sk_base_func_proto(func_id, prog);
 	}
 }
 
@@ -8323,7 +8323,7 @@ flow_dissector_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	case BPF_FUNC_skb_load_bytes:
 		return &bpf_flow_dissector_load_bytes_proto;
 	default:
-		return bpf_sk_base_func_proto(func_id);
+		return bpf_sk_base_func_proto(func_id, prog);
 	}
 }
 
@@ -8350,7 +8350,7 @@ lwt_out_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	case BPF_FUNC_skb_under_cgroup:
 		return &bpf_skb_under_cgroup_proto;
 	default:
-		return bpf_sk_base_func_proto(func_id);
+		return bpf_sk_base_func_proto(func_id, prog);
 	}
 }
 
@@ -11181,7 +11181,7 @@ sk_reuseport_func_proto(enum bpf_func_id func_id,
 	case BPF_FUNC_ktime_get_coarse_ns:
 		return &bpf_ktime_get_coarse_ns_proto;
 	default:
-		return bpf_base_func_proto(func_id);
+		return bpf_base_func_proto(func_id, prog);
 	}
 }
 
@@ -11363,7 +11363,7 @@ sk_lookup_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	case BPF_FUNC_sk_release:
 		return &bpf_sk_release_proto;
 	default:
-		return bpf_sk_base_func_proto(func_id);
+		return bpf_sk_base_func_proto(func_id, prog);
 	}
 }
 
@@ -11697,7 +11697,7 @@ const struct bpf_func_proto bpf_sock_from_file_proto = {
 };
 
 static const struct bpf_func_proto *
-bpf_sk_base_func_proto(enum bpf_func_id func_id)
+bpf_sk_base_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
 	const struct bpf_func_proto *func;
 
@@ -11726,10 +11726,10 @@ bpf_sk_base_func_proto(enum bpf_func_id func_id)
 	case BPF_FUNC_ktime_get_coarse_ns:
 		return &bpf_ktime_get_coarse_ns_proto;
 	default:
-		return bpf_base_func_proto(func_id);
+		return bpf_base_func_proto(func_id, prog);
 	}
 
-	if (!perfmon_capable())
+	if (!bpf_token_capable(prog->aux->token, CAP_PERFMON))
 		return NULL;
 
 	return func;
diff --git a/net/ipv4/bpf_tcp_ca.c b/net/ipv4/bpf_tcp_ca.c
index 4406d796cc2f..0a3a60e7c282 100644
--- a/net/ipv4/bpf_tcp_ca.c
+++ b/net/ipv4/bpf_tcp_ca.c
@@ -193,7 +193,7 @@ bpf_tcp_ca_get_func_proto(enum bpf_func_id func_id,
 	case BPF_FUNC_ktime_get_coarse_ns:
 		return &bpf_ktime_get_coarse_ns_proto;
 	default:
-		return bpf_base_func_proto(func_id);
+		return bpf_base_func_proto(func_id, prog);
 	}
 }
 
diff --git a/net/netfilter/nf_bpf_link.c b/net/netfilter/nf_bpf_link.c
index c36da56d756f..d7786ea9c01a 100644
--- a/net/netfilter/nf_bpf_link.c
+++ b/net/netfilter/nf_bpf_link.c
@@ -219,7 +219,7 @@ static bool nf_is_valid_access(int off, int size, enum bpf_access_type type,
 static const struct bpf_func_proto *
 bpf_nf_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
-	return bpf_base_func_proto(func_id);
+	return bpf_base_func_proto(func_id, prog);
 }
 
 const struct bpf_verifier_ops netfilter_verifier_ops = {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH RESEND v3 bpf-next 12/14] bpf: consistenly use BPF token throughout BPF verifier logic
  2023-06-29  5:18 [PATCH RESEND v3 bpf-next 00/14] BPF token Andrii Nakryiko
                   ` (10 preceding siblings ...)
  2023-06-29  5:18 ` [PATCH RESEND v3 bpf-next 11/14] bpf: take into account BPF token when fetching helper protos Andrii Nakryiko
@ 2023-06-29  5:18 ` Andrii Nakryiko
  2023-06-29  5:18 ` [PATCH RESEND v3 bpf-next 13/14] libbpf: add BPF token support to bpf_prog_load() API Andrii Nakryiko
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 48+ messages in thread
From: Andrii Nakryiko @ 2023-06-29  5:18 UTC (permalink / raw)
  To: bpf
  Cc: linux-security-module, keescook, brauner, lennart, cyphar, luto,
	kernel-team, sargun

Remove remaining direct queries to perfmon_capable() and bpf_capable()
in BPF verifier logic and instead use BPF token (if available) to make
decisions about privileges.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 include/linux/bpf.h    | 18 ++++++++++--------
 include/linux/filter.h |  2 +-
 kernel/bpf/arraymap.c  |  2 +-
 kernel/bpf/core.c      |  2 +-
 kernel/bpf/verifier.c  | 13 ++++++-------
 net/core/filter.c      |  4 ++--
 6 files changed, 21 insertions(+), 20 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 0e8680e639cb..af9f7dc60f21 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -2059,24 +2059,26 @@ bpf_map_alloc_percpu(const struct bpf_map *map, size_t size, size_t align,
 
 extern int sysctl_unprivileged_bpf_disabled;
 
-static inline bool bpf_allow_ptr_leaks(void)
+bool bpf_token_capable(const struct bpf_token *token, int cap);
+
+static inline bool bpf_allow_ptr_leaks(const struct bpf_token *token)
 {
-	return perfmon_capable();
+	return bpf_token_capable(token, CAP_PERFMON);
 }
 
-static inline bool bpf_allow_uninit_stack(void)
+static inline bool bpf_allow_uninit_stack(const struct bpf_token *token)
 {
-	return perfmon_capable();
+	return bpf_token_capable(token, CAP_PERFMON);
 }
 
-static inline bool bpf_bypass_spec_v1(void)
+static inline bool bpf_bypass_spec_v1(const struct bpf_token *token)
 {
-	return perfmon_capable();
+	return bpf_token_capable(token, CAP_PERFMON);
 }
 
-static inline bool bpf_bypass_spec_v4(void)
+static inline bool bpf_bypass_spec_v4(const struct bpf_token *token)
 {
-	return perfmon_capable();
+	return bpf_token_capable(token, CAP_PERFMON);
 }
 
 int bpf_map_new_fd(struct bpf_map *map, int flags);
diff --git a/include/linux/filter.h b/include/linux/filter.h
index f69114083ec7..2391a9025ffd 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1109,7 +1109,7 @@ static inline bool bpf_jit_blinding_enabled(struct bpf_prog *prog)
 		return false;
 	if (!bpf_jit_harden)
 		return false;
-	if (bpf_jit_harden == 1 && bpf_capable())
+	if (bpf_jit_harden == 1 && bpf_token_capable(prog->aux->token, CAP_BPF))
 		return false;
 
 	return true;
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 2058e89b5ddd..f0c64df6b6ff 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -82,7 +82,7 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr)
 	bool percpu = attr->map_type == BPF_MAP_TYPE_PERCPU_ARRAY;
 	int numa_node = bpf_map_attr_numa_node(attr);
 	u32 elem_size, index_mask, max_entries;
-	bool bypass_spec_v1 = bpf_bypass_spec_v1();
+	bool bypass_spec_v1 = bpf_bypass_spec_v1(NULL);
 	u64 array_size, mask64;
 	struct bpf_array *array;
 
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 2ed54d1ed32a..979c10b9399d 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -661,7 +661,7 @@ static bool bpf_prog_kallsyms_candidate(const struct bpf_prog *fp)
 void bpf_prog_kallsyms_add(struct bpf_prog *fp)
 {
 	if (!bpf_prog_kallsyms_candidate(fp) ||
-	    !bpf_capable())
+	    !bpf_token_capable(fp->aux->token, CAP_BPF))
 		return;
 
 	bpf_prog_ksym_set_addr(fp);
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 11e54dd8b6dd..9d89ba98f8d8 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -19403,7 +19403,12 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr, bpfptr_t uattr, __u3
 	env->prog = *prog;
 	env->ops = bpf_verifier_ops[env->prog->type];
 	env->fd_array = make_bpfptr(attr->fd_array, uattr.is_kernel);
-	is_priv = bpf_capable();
+
+	env->allow_ptr_leaks = bpf_allow_ptr_leaks(env->prog->aux->token);
+	env->allow_uninit_stack = bpf_allow_uninit_stack(env->prog->aux->token);
+	env->bypass_spec_v1 = bpf_bypass_spec_v1(env->prog->aux->token);
+	env->bypass_spec_v4 = bpf_bypass_spec_v4(env->prog->aux->token);
+	env->bpf_capable = is_priv = bpf_token_capable(env->prog->aux->token, CAP_BPF);
 
 	bpf_get_btf_vmlinux();
 
@@ -19435,12 +19440,6 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr, bpfptr_t uattr, __u3
 	if (attr->prog_flags & BPF_F_ANY_ALIGNMENT)
 		env->strict_alignment = false;
 
-	env->allow_ptr_leaks = bpf_allow_ptr_leaks();
-	env->allow_uninit_stack = bpf_allow_uninit_stack();
-	env->bypass_spec_v1 = bpf_bypass_spec_v1();
-	env->bypass_spec_v4 = bpf_bypass_spec_v4();
-	env->bpf_capable = bpf_capable();
-
 	if (is_priv)
 		env->test_state_freq = attr->prog_flags & BPF_F_TEST_STATE_FREQ;
 
diff --git a/net/core/filter.c b/net/core/filter.c
index 03c411dc1e80..a58e6d5608ba 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -8525,7 +8525,7 @@ static bool cg_skb_is_valid_access(int off, int size,
 		return false;
 	case bpf_ctx_range(struct __sk_buff, data):
 	case bpf_ctx_range(struct __sk_buff, data_end):
-		if (!bpf_capable())
+		if (!bpf_token_capable(prog->aux->token, CAP_BPF))
 			return false;
 		break;
 	}
@@ -8537,7 +8537,7 @@ static bool cg_skb_is_valid_access(int off, int size,
 		case bpf_ctx_range_till(struct __sk_buff, cb[0], cb[4]):
 			break;
 		case bpf_ctx_range(struct __sk_buff, tstamp):
-			if (!bpf_capable())
+			if (!bpf_token_capable(prog->aux->token, CAP_BPF))
 				return false;
 			break;
 		default:
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH RESEND v3 bpf-next 13/14] libbpf: add BPF token support to bpf_prog_load() API
  2023-06-29  5:18 [PATCH RESEND v3 bpf-next 00/14] BPF token Andrii Nakryiko
                   ` (11 preceding siblings ...)
  2023-06-29  5:18 ` [PATCH RESEND v3 bpf-next 12/14] bpf: consistenly use BPF token throughout BPF verifier logic Andrii Nakryiko
@ 2023-06-29  5:18 ` Andrii Nakryiko
  2023-06-29  5:18 ` [PATCH RESEND v3 bpf-next 14/14] selftests/bpf: add BPF token-enabled BPF_PROG_LOAD tests Andrii Nakryiko
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 48+ messages in thread
From: Andrii Nakryiko @ 2023-06-29  5:18 UTC (permalink / raw)
  To: bpf
  Cc: linux-security-module, keescook, brauner, lennart, cyphar, luto,
	kernel-team, sargun

Wire through token_fd into bpf_prog_load(). Also make sure to pass
allowed_{prog,attach}_types to kernel in bpf_token_create().

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 tools/lib/bpf/bpf.c | 5 ++++-
 tools/lib/bpf/bpf.h | 7 +++++--
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 6fb915069be7..5f331bbf1ad2 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -234,7 +234,7 @@ int bpf_prog_load(enum bpf_prog_type prog_type,
 		  const struct bpf_insn *insns, size_t insn_cnt,
 		  struct bpf_prog_load_opts *opts)
 {
-	const size_t attr_sz = offsetofend(union bpf_attr, log_true_size);
+	const size_t attr_sz = offsetofend(union bpf_attr, prog_token_fd);
 	void *finfo = NULL, *linfo = NULL;
 	const char *func_info, *line_info;
 	__u32 log_size, log_level, attach_prog_fd, attach_btf_obj_fd;
@@ -263,6 +263,7 @@ int bpf_prog_load(enum bpf_prog_type prog_type,
 	attr.prog_flags = OPTS_GET(opts, prog_flags, 0);
 	attr.prog_ifindex = OPTS_GET(opts, prog_ifindex, 0);
 	attr.kern_version = OPTS_GET(opts, kern_version, 0);
+	attr.prog_token_fd = OPTS_GET(opts, token_fd, 0);
 
 	if (prog_name && kernel_supports(NULL, FEAT_PROG_NAME))
 		libbpf_strlcpy(attr.prog_name, prog_name, sizeof(attr.prog_name));
@@ -1223,6 +1224,8 @@ int bpf_token_create(int pin_path_fd, const char *pin_pathname, struct bpf_token
 	attr.token_create.pin_flags = OPTS_GET(opts, pin_flags, 0);
 	attr.token_create.allowed_cmds = OPTS_GET(opts, allowed_cmds, 0);
 	attr.token_create.allowed_map_types = OPTS_GET(opts, allowed_map_types, 0);
+	attr.token_create.allowed_prog_types = OPTS_GET(opts, allowed_prog_types, 0);
+	attr.token_create.allowed_attach_types = OPTS_GET(opts, allowed_attach_types, 0);
 
 	ret = sys_bpf(BPF_TOKEN_CREATE, &attr, attr_sz);
 	return libbpf_err_errno(ret);
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index dc7c4af21ad9..2ac56fba6027 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -104,9 +104,10 @@ struct bpf_prog_load_opts {
 	 * If kernel doesn't support this feature, log_size is left unchanged.
 	 */
 	__u32 log_true_size;
+	__u32 token_fd;
 	size_t :0;
 };
-#define bpf_prog_load_opts__last_field log_true_size
+#define bpf_prog_load_opts__last_field token_fd
 
 LIBBPF_API int bpf_prog_load(enum bpf_prog_type prog_type,
 			     const char *prog_name, const char *license,
@@ -561,9 +562,11 @@ struct bpf_token_create_opts {
 	__u32 pin_flags;
 	__u64 allowed_cmds;
 	__u64 allowed_map_types;
+	__u64 allowed_prog_types;
+	__u64 allowed_attach_types;
 	size_t :0;
 };
-#define bpf_token_create_opts__last_field allowed_map_types
+#define bpf_token_create_opts__last_field allowed_attach_types
 
 /**
  * @brief **bpf_token_create()** creates a new instance of BPF token, pinning
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH RESEND v3 bpf-next 14/14] selftests/bpf: add BPF token-enabled BPF_PROG_LOAD tests
  2023-06-29  5:18 [PATCH RESEND v3 bpf-next 00/14] BPF token Andrii Nakryiko
                   ` (12 preceding siblings ...)
  2023-06-29  5:18 ` [PATCH RESEND v3 bpf-next 13/14] libbpf: add BPF token support to bpf_prog_load() API Andrii Nakryiko
@ 2023-06-29  5:18 ` Andrii Nakryiko
  2023-06-29 23:15 ` [PATCH RESEND v3 bpf-next 00/14] BPF token Toke Høiland-Jørgensen
  2023-07-01  2:05 ` Yafang Shao
  15 siblings, 0 replies; 48+ messages in thread
From: Andrii Nakryiko @ 2023-06-29  5:18 UTC (permalink / raw)
  To: bpf
  Cc: linux-security-module, keescook, brauner, lennart, cyphar, luto,
	kernel-team, sargun

Add a test validating that BPF token can be used to load privileged BPF
program using privileged BPF helpers through delegated BPF token created
by privileged process.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 .../testing/selftests/bpf/prog_tests/token.c  | 66 +++++++++++++++++++
 1 file changed, 66 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/token.c b/tools/testing/selftests/bpf/prog_tests/token.c
index 113cd4786a70..415d49eacd4f 100644
--- a/tools/testing/selftests/bpf/prog_tests/token.c
+++ b/tools/testing/selftests/bpf/prog_tests/token.c
@@ -4,6 +4,7 @@
 #include <test_progs.h>
 #include <bpf/btf.h>
 #include "cap_helpers.h"
+#include <linux/filter.h>
 
 static int drop_priv_caps(__u64 *old_caps)
 {
@@ -200,6 +201,69 @@ static void subtest_btf_token(void)
 		ASSERT_OK(restore_priv_caps(old_caps), "restore_caps");
 }
 
+static void subtest_prog_token(void)
+{
+	LIBBPF_OPTS(bpf_token_create_opts, token_opts);
+	LIBBPF_OPTS(bpf_prog_load_opts, prog_opts);
+	int token_fd = 0, prog_fd = 0, err;
+	__u64 old_caps = 0;
+	struct bpf_insn insns[] = {
+		/* bpf_jiffies64() requires CAP_BPF */
+		BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_jiffies64),
+		/* bpf_get_current_task() requires CAP_PERFMON */
+		BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_get_current_task),
+		/* r0 = 0; exit; */
+		BPF_MOV64_IMM(BPF_REG_0, 0),
+		BPF_EXIT_INSN(),
+	};
+	size_t insn_cnt = ARRAY_SIZE(insns);
+
+	/* create BPF token allowing BPF_PROG_LOAD command */
+	token_opts.allowed_cmds = 1ULL << BPF_PROG_LOAD;
+	token_opts.allowed_prog_types = 1ULL << BPF_PROG_TYPE_XDP;
+	token_opts.allowed_attach_types = 1ULL << BPF_XDP;
+	err = bpf_token_create(-EBADF, TOKEN_PATH, &token_opts);
+	if (!ASSERT_OK(err, "token_create"))
+		return;
+
+	/* drop privileges to test token_fd passing */
+	if (!ASSERT_OK(drop_priv_caps(&old_caps), "drop_caps"))
+		goto cleanup;
+
+	token_fd = bpf_obj_get(TOKEN_PATH);
+	if (!ASSERT_GT(token_fd, 0, "token_get"))
+		goto cleanup;
+
+	/* validate we can successfully load BPF program with token; this
+	 * being XDP program (CAP_NET_ADMIN) using bpf_jiffies64() (CAP_BPF)
+	 * and bpf_get_current_task() (CAP_PERFMON) helpers validates we have
+	 * BPF token wired properly in a bunch of places in the kernel
+	 */
+	prog_opts.token_fd = token_fd;
+	prog_opts.expected_attach_type = BPF_XDP;
+	prog_fd = bpf_prog_load(BPF_PROG_TYPE_XDP, "token_prog", "GPL",
+				insns, insn_cnt, &prog_opts);
+	if (!ASSERT_GT(prog_fd, 0, "prog_fd"))
+		goto cleanup;
+	close(prog_fd);
+
+	/* now validate that we *cannot* load BPF program without token */
+	prog_opts.token_fd = 0;
+	prog_fd = bpf_prog_load(BPF_PROG_TYPE_XDP, "token_prog", "GPL",
+				insns, insn_cnt, &prog_opts);
+	if (!ASSERT_EQ(prog_fd, -EPERM, "prog_fd_eperm"))
+		goto cleanup;
+
+cleanup:
+	if (prog_fd > 0)
+		close(prog_fd);
+	if (token_fd)
+		close(token_fd);
+	unlink(TOKEN_PATH);
+	if (old_caps)
+		ASSERT_OK(restore_priv_caps(old_caps), "restore_caps");
+}
+
 void test_token(void)
 {
 	if (test__start_subtest("token_create"))
@@ -208,4 +272,6 @@ void test_token(void)
 		subtest_map_token();
 	if (test__start_subtest("btf_token"))
 		subtest_btf_token();
+	if (test__start_subtest("prog_token"))
+		subtest_prog_token();
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH RESEND v3 bpf-next 00/14] BPF token
  2023-06-29  5:18 [PATCH RESEND v3 bpf-next 00/14] BPF token Andrii Nakryiko
                   ` (13 preceding siblings ...)
  2023-06-29  5:18 ` [PATCH RESEND v3 bpf-next 14/14] selftests/bpf: add BPF token-enabled BPF_PROG_LOAD tests Andrii Nakryiko
@ 2023-06-29 23:15 ` Toke Høiland-Jørgensen
  2023-06-30 18:25   ` Andrii Nakryiko
                     ` (2 more replies)
  2023-07-01  2:05 ` Yafang Shao
  15 siblings, 3 replies; 48+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-06-29 23:15 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf
  Cc: linux-security-module, keescook, brauner, lennart, cyphar, luto,
	kernel-team, sargun

Andrii Nakryiko <andrii@kernel.org> writes:

> This patch set introduces new BPF object, BPF token, which allows to delegate
> a subset of BPF functionality from privileged system-wide daemon (e.g.,
> systemd or any other container manager) to a *trusted* unprivileged
> application. Trust is the key here. This functionality is not about allowing
> unconditional unprivileged BPF usage. Establishing trust, though, is
> completely up to the discretion of respective privileged application that
> would create a BPF token, as different production setups can and do achieve it
> through a combination of different means (signing, LSM, code reviews, etc),
> and it's undesirable and infeasible for kernel to enforce any particular way
> of validating trustworthiness of particular process.
>
> The main motivation for BPF token is a desire to enable containerized
> BPF applications to be used together with user namespaces. This is currently
> impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced
> or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF
> helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read
> arbitrary memory, and it's impossible to ensure that they only read memory of
> processes belonging to any given namespace. This means that it's impossible to
> have namespace-aware CAP_BPF capability, and as such another mechanism to
> allow safe usage of BPF functionality is necessary. BPF token and delegation
> of it to a trusted unprivileged applications is such mechanism. Kernel makes
> no assumption about what "trusted" constitutes in any particular case, and
> it's up to specific privileged applications and their surrounding
> infrastructure to decide that. What kernel provides is a set of APIs to create
> and tune BPF token, and pass it around to privileged BPF commands that are
> creating new BPF objects like BPF programs, BPF maps, etc.

So a colleague pointed out today that the Seccomp Notify functionality
would be a way to achieve your stated goal of allowing unprivileged
containers to (selectively) perform bpf() syscall operations. Christian
Brauner has a pretty nice writeup of the functionality here:
https://people.kernel.org/brauner/the-seccomp-notifier-new-frontiers-in-unprivileged-container-development

In fact he even mentions allowing unprivileged access to bpf() as a
possible use case (in the second-to-last paragraph).

AFAICT this would enable your use case without adding any new kernel
functionality or changing the BPF-using applications, while allowing the
privileged userspace daemon to make case-by-case decisions on each
operation instead of granting blanket capabilities (which is my main
objection to the token proposal, as we discussed on the last iteration
of the series).

So I'm curious whether you considered this as an alternative to
BPF_TOKEN? And if so, what your reason was for rejecting it?

-Toke


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH RESEND v3 bpf-next 00/14] BPF token
  2023-06-29 23:15 ` [PATCH RESEND v3 bpf-next 00/14] BPF token Toke Høiland-Jørgensen
@ 2023-06-30 18:25   ` Andrii Nakryiko
  2023-07-04  9:38     ` Christian Brauner
  2023-07-04 23:20     ` Toke Høiland-Jørgensen
  2023-07-02  6:59   ` Djalal Harouni
  2023-07-04  9:51   ` Christian Brauner
  2 siblings, 2 replies; 48+ messages in thread
From: Andrii Nakryiko @ 2023-06-30 18:25 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Andrii Nakryiko, bpf, linux-security-module, keescook, brauner,
	lennart, cyphar, luto, kernel-team, sargun

On Thu, Jun 29, 2023 at 4:15 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Andrii Nakryiko <andrii@kernel.org> writes:
>
> > This patch set introduces new BPF object, BPF token, which allows to delegate
> > a subset of BPF functionality from privileged system-wide daemon (e.g.,
> > systemd or any other container manager) to a *trusted* unprivileged
> > application. Trust is the key here. This functionality is not about allowing
> > unconditional unprivileged BPF usage. Establishing trust, though, is
> > completely up to the discretion of respective privileged application that
> > would create a BPF token, as different production setups can and do achieve it
> > through a combination of different means (signing, LSM, code reviews, etc),
> > and it's undesirable and infeasible for kernel to enforce any particular way
> > of validating trustworthiness of particular process.
> >
> > The main motivation for BPF token is a desire to enable containerized
> > BPF applications to be used together with user namespaces. This is currently
> > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced
> > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF
> > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read
> > arbitrary memory, and it's impossible to ensure that they only read memory of
> > processes belonging to any given namespace. This means that it's impossible to
> > have namespace-aware CAP_BPF capability, and as such another mechanism to
> > allow safe usage of BPF functionality is necessary. BPF token and delegation
> > of it to a trusted unprivileged applications is such mechanism. Kernel makes
> > no assumption about what "trusted" constitutes in any particular case, and
> > it's up to specific privileged applications and their surrounding
> > infrastructure to decide that. What kernel provides is a set of APIs to create
> > and tune BPF token, and pass it around to privileged BPF commands that are
> > creating new BPF objects like BPF programs, BPF maps, etc.
>
> So a colleague pointed out today that the Seccomp Notify functionality
> would be a way to achieve your stated goal of allowing unprivileged
> containers to (selectively) perform bpf() syscall operations. Christian
> Brauner has a pretty nice writeup of the functionality here:
> https://people.kernel.org/brauner/the-seccomp-notifier-new-frontiers-in-unprivileged-container-development
>
> In fact he even mentions allowing unprivileged access to bpf() as a
> possible use case (in the second-to-last paragraph).
>
> AFAICT this would enable your use case without adding any new kernel
> functionality or changing the BPF-using applications, while allowing the
> privileged userspace daemon to make case-by-case decisions on each
> operation instead of granting blanket capabilities (which is my main
> objection to the token proposal, as we discussed on the last iteration
> of the series).

It's not "blanket" capabilities. You control types or maps and
programs that could be created. And again, CAP_SYS_ADMIN guarded.
Please, don't give CAP_SYS_ADMIN/root permissions to applications you
can't be sure won't do something stupid and blame kernel API for it.

After all, the root process can setuid() any file and make it run with
elevated permissions, right? Doesn't get more "blanket" than that.

>
> So I'm curious whether you considered this as an alternative to
> BPF_TOKEN? And if so, what your reason was for rejecting it?
>

Yes, I'm aware, Christian has a follow up short blog post specifically
for using this for proxying BPF from privileged process ([0]).

So, in short, I think it's not a good generic solution. It's very
fragile and high-maintenance. It's still proxying BPF UAPI (except
application does preserve illusion of using BPF syscall, yes, that
part is good) with all the implications: needing to replicate all of
UAPI (fetching all those FDs from another process, following all the
pointers from another process' memory, etc), and also writing back all
the correct things (into another process' memory): log content,
log_true_size (out param), any other output parameters. What do we do
when an application uses a newer version of bpf_attr that is supported
by proxy? And honestly, I'm like 99% sure there are lots of less
obvious issues one runs into when starting implementing something like
this.

This sounds like a hack and nightmare to implement and support.
Perhaps that indirectly is supported by the fact that even Christian
half-jokingly calls this a crazy approach. That code basically is
unchanged for the last three years, with only one fix from Christian
one year after initial introduction ([1]) to fix a quirky issue
related to the limitation of pidfd working only for thread group
leaders. It also still supports only BPF_PROG_TYPE_CGROUP_DEVICE
program loading, it doesn't support a bunch of newer BPF_PROG_LOAD
fields and functionality, etc, etc.

So as a technical curiosity it's pretty cool and perhaps is the right
tool for the job for very narrow specific use cases. But as a
realistic generic approach that could be used by industry at large for
safe BPF usage from namespaced containers -- not so much.


  [0] https://brauner.io/2020/08/07/seccomp-notify-intercepting-the-bpf-syscall.html
  [1] https://github.com/lxc/lxd/commit/566d0a3b3cbe288787886c2f3bf5b250ceb930b0


> -Toke
>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH RESEND v3 bpf-next 00/14] BPF token
  2023-06-29  5:18 [PATCH RESEND v3 bpf-next 00/14] BPF token Andrii Nakryiko
                   ` (14 preceding siblings ...)
  2023-06-29 23:15 ` [PATCH RESEND v3 bpf-next 00/14] BPF token Toke Høiland-Jørgensen
@ 2023-07-01  2:05 ` Yafang Shao
  2023-07-05 20:37   ` Andrii Nakryiko
  15 siblings, 1 reply; 48+ messages in thread
From: Yafang Shao @ 2023-07-01  2:05 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, linux-security-module, keescook, brauner, lennart, cyphar,
	luto, kernel-team, sargun

On Thu, Jun 29, 2023 at 1:18 PM Andrii Nakryiko <andrii@kernel.org> wrote:
>
> This patch set introduces new BPF object, BPF token, which allows to delegate
> a subset of BPF functionality from privileged system-wide daemon (e.g.,
> systemd or any other container manager) to a *trusted* unprivileged
> application. Trust is the key here. This functionality is not about allowing
> unconditional unprivileged BPF usage. Establishing trust, though, is
> completely up to the discretion of respective privileged application that
> would create a BPF token, as different production setups can and do achieve it
> through a combination of different means (signing, LSM, code reviews, etc),
> and it's undesirable and infeasible for kernel to enforce any particular way
> of validating trustworthiness of particular process.
>
> The main motivation for BPF token is a desire to enable containerized
> BPF applications to be used together with user namespaces. This is currently
> impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced
> or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF
> helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read
> arbitrary memory, and it's impossible to ensure that they only read memory of
> processes belonging to any given namespace. This means that it's impossible to
> have namespace-aware CAP_BPF capability, and as such another mechanism to
> allow safe usage of BPF functionality is necessary. BPF token and delegation
> of it to a trusted unprivileged applications is such mechanism. Kernel makes
> no assumption about what "trusted" constitutes in any particular case, and
> it's up to specific privileged applications and their surrounding
> infrastructure to decide that. What kernel provides is a set of APIs to create
> and tune BPF token, and pass it around to privileged BPF commands that are
> creating new BPF objects like BPF programs, BPF maps, etc.
>
> Previous attempt at addressing this very same problem ([0]) attempted to
> utilize authoritative LSM approach, but was conclusively rejected by upstream
> LSM maintainers. BPF token concept is not changing anything about LSM
> approach, but can be combined with LSM hooks for very fine-grained security
> policy. Some ideas about making BPF token more convenient to use with LSM (in
> particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF
> 2023 presentation ([1]). E.g., an ability to specify user-provided data
> (context), which in combination with BPF LSM would allow implementing a very
> dynamic and fine-granular custom security policies on top of BPF token. In the
> interest of minimizing API surface area discussions this is going to be
> added in follow up patches, as it's not essential to the fundamental concept
> of delegatable BPF token.
>
> It should be noted that BPF token is conceptually quite similar to the idea of
> /dev/bpf device file, proposed by Song a while ago ([2]). The biggest
> difference is the idea of using virtual anon_inode file to hold BPF token and
> allowing multiple independent instances of them, each with its own set of
> restrictions. BPF pinning solves the problem of exposing such BPF token
> through file system (BPF FS, in this case) for cases where transferring FDs
> over Unix domain sockets is not convenient. And also, crucially, BPF token
> approach is not using any special stateful task-scoped flags. Instead, bpf()
> syscall accepts token_fd parameters explicitly for each relevant BPF command.
> This addresses main concerns brought up during the /dev/bpf discussion, and
> fits better with overall BPF subsystem design.
>
> This patch set adds a basic minimum of functionality to make BPF token useful
> and to discuss API and functionality. Currently only low-level libbpf APIs
> support passing BPF token around, allowing to test kernel functionality, but
> for the most part is not sufficient for real-world applications, which
> typically use high-level libbpf APIs based on `struct bpf_object` type. This
> was done with the intent to limit the size of patch set and concentrate on
> mostly kernel-side changes. All the necessary plumbing for libbpf will be sent
> as a separate follow up patch set kernel support makes it upstream.
>
> Another part that should happen once kernel-side BPF token is established, is
> a set of conventions between applications (e.g., systemd), tools (e.g.,
> bpftool), and libraries (e.g., libbpf) about sharing BPF tokens through BPF FS
> at well-defined locations to allow applications take advantage of this in
> automatic fashion without explicit code changes on BPF application's side.
> But I'd like to postpone this discussion to after BPF token concept lands.
>
> Once important distinctions from v2 that should be noted is a chance in the
> semantics of a newly added BPF_TOKEN_CREATE command. Previously,
> BPF_TOKEN_CREATE would create BPF token kernel object and return its FD to
> user-space, allowing to (optionally) pin it in BPF FS using BPF_OBJ_PIN
> command. This v3 version changes this slightly: BPF_TOKEN_CREATE combines BPF
> token object creation *and* pinning in BPF FS. Such change ensures that BPF
> token is always associated with a specific instance of BPF FS and cannot
> "escape" it by application re-pinning it somewhere else using another
> BPF_OBJ_PIN call. Now, BPF token can only be pinned once during its creation,
> better containing it inside intended container (under assumption BPF FS is set
> up in such a way as to not be shared with other containers on the system).
>
>   [0] https://lore.kernel.org/bpf/20230412043300.360803-1-andrii@kernel.org/
>   [1] http://vger.kernel.org/bpfconf2023_material/Trusted_unprivileged_BPF_LSFMM2023.pdf
>   [2] https://lore.kernel.org/bpf/20190627201923.2589391-2-songliubraving@fb.com/
>
> v3->v3-resend:
>   - I started integrating token_fd into bpf_object_open_opts and higher-level
>     libbpf bpf_object APIs, but it started going a bit deeper into bpf_object
>     implementation details and how libbpf performs feature detection and
>     caching, so I decided to keep it separate from this patch set and not
>     distract from the mostly kernel-side changes;
> v2->v3:
>   - make BPF_TOKEN_CREATE pin created BPF token in BPF FS, and disallow
>     BPF_OBJ_PIN for BPF token;
> v1->v2:
>   - fix build failures on Kconfig with CONFIG_BPF_SYSCALL unset;
>   - drop BPF_F_TOKEN_UNKNOWN_* flags and simplify UAPI (Stanislav).
>
> Andrii Nakryiko (14):
>   bpf: introduce BPF token object
>   libbpf: add bpf_token_create() API
>   selftests/bpf: add BPF_TOKEN_CREATE test
>   bpf: add BPF token support to BPF_MAP_CREATE command
>   libbpf: add BPF token support to bpf_map_create() API
>   selftests/bpf: add BPF token-enabled test for BPF_MAP_CREATE command
>   bpf: add BPF token support to BPF_BTF_LOAD command
>   libbpf: add BPF token support to bpf_btf_load() API
>   selftests/bpf: add BPF token-enabled BPF_BTF_LOAD selftest
>   bpf: add BPF token support to BPF_PROG_LOAD command
>   bpf: take into account BPF token when fetching helper protos
>   bpf: consistenly use BPF token throughout BPF verifier logic
>   libbpf: add BPF token support to bpf_prog_load() API
>   selftests/bpf: add BPF token-enabled BPF_PROG_LOAD tests
>
>  drivers/media/rc/bpf-lirc.c                   |   2 +-
>  include/linux/bpf.h                           |  79 ++++-
>  include/linux/filter.h                        |   2 +-
>  include/uapi/linux/bpf.h                      |  53 ++++
>  kernel/bpf/Makefile                           |   2 +-
>  kernel/bpf/arraymap.c                         |   2 +-
>  kernel/bpf/cgroup.c                           |   6 +-
>  kernel/bpf/core.c                             |   3 +-
>  kernel/bpf/helpers.c                          |   6 +-
>  kernel/bpf/inode.c                            |  46 ++-
>  kernel/bpf/syscall.c                          | 183 +++++++++---
>  kernel/bpf/token.c                            | 201 +++++++++++++
>  kernel/bpf/verifier.c                         |  13 +-
>  kernel/trace/bpf_trace.c                      |   2 +-
>  net/core/filter.c                             |  36 +--
>  net/ipv4/bpf_tcp_ca.c                         |   2 +-
>  net/netfilter/nf_bpf_link.c                   |   2 +-
>  tools/include/uapi/linux/bpf.h                |  53 ++++
>  tools/lib/bpf/bpf.c                           |  35 ++-
>  tools/lib/bpf/bpf.h                           |  45 ++-
>  tools/lib/bpf/libbpf.map                      |   1 +
>  .../selftests/bpf/prog_tests/libbpf_probes.c  |   4 +
>  .../selftests/bpf/prog_tests/libbpf_str.c     |   6 +
>  .../testing/selftests/bpf/prog_tests/token.c  | 277 ++++++++++++++++++
>  24 files changed, 957 insertions(+), 104 deletions(-)
>  create mode 100644 kernel/bpf/token.c
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/token.c
>
> --
> 2.34.1
>
>


Hi Andrii,

Thanks for your proposal.
That seems to be a useful functionality, and yet I have some questions.

1. Why can't we add security_bpf_probe_read_{kernel,user}?
    If possible, we can use these LSM hooks to refuse the process to
read other tasks' information. E.g. if the other process is not within
the same cgroup or the same namespace, we just refuse the reading. I
think it is not hard to identify if the other process is within the
same cgroup or the same namespace.

2. Why can't we extend bpf_cookie?
   We're now using bpf_cookie to identify each user or each
application, and only the permitted cookies can create new probe
links.  However we find the bpf_cookie is only supported by tracing,
perf_event and kprobe_multi, so we're planning to extend it to other
possible link types, then we can use LSM hooks to control all bpf
links.  I think that the upstream kernel should also support
bpf_cookie for all bpf links. If possible, we will post it to the
upstream in the future.
   After I have read your BPF token proposal, I just have some other
ideas. Why can't we just extend bpf_cookie to all other BPF objects?
For example, all progs and maps should also have the bpf_cookie.


-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH RESEND v3 bpf-next 00/14] BPF token
  2023-06-29 23:15 ` [PATCH RESEND v3 bpf-next 00/14] BPF token Toke Høiland-Jørgensen
  2023-06-30 18:25   ` Andrii Nakryiko
@ 2023-07-02  6:59   ` Djalal Harouni
  2023-07-04  9:51   ` Christian Brauner
  2 siblings, 0 replies; 48+ messages in thread
From: Djalal Harouni @ 2023-07-02  6:59 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Andrii Nakryiko, bpf, linux-security-module, keescook, brauner,
	lennart, cyphar, luto, kernel-team, sargun

On Fri, Jun 30, 2023 at 1:16 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Andrii Nakryiko <andrii@kernel.org> writes:
>
> > This patch set introduces new BPF object, BPF token, which allows to delegate
> > a subset of BPF functionality from privileged system-wide daemon (e.g.,
> > systemd or any other container manager) to a *trusted* unprivileged
> > application. Trust is the key here. This functionality is not about allowing
> > unconditional unprivileged BPF usage. Establishing trust, though, is
> > completely up to the discretion of respective privileged application that
> > would create a BPF token, as different production setups can and do achieve it
> > through a combination of different means (signing, LSM, code reviews, etc),
> > and it's undesirable and infeasible for kernel to enforce any particular way
> > of validating trustworthiness of particular process.
> >
> > The main motivation for BPF token is a desire to enable containerized
> > BPF applications to be used together with user namespaces. This is currently
> > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced
> > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF
> > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read
> > arbitrary memory, and it's impossible to ensure that they only read memory of
> > processes belonging to any given namespace. This means that it's impossible to
> > have namespace-aware CAP_BPF capability, and as such another mechanism to
> > allow safe usage of BPF functionality is necessary. BPF token and delegation
> > of it to a trusted unprivileged applications is such mechanism. Kernel makes
> > no assumption about what "trusted" constitutes in any particular case, and
> > it's up to specific privileged applications and their surrounding
> > infrastructure to decide that. What kernel provides is a set of APIs to create
> > and tune BPF token, and pass it around to privileged BPF commands that are
> > creating new BPF objects like BPF programs, BPF maps, etc.
>
> So a colleague pointed out today that the Seccomp Notify functionality
> would be a way to achieve your stated goal of allowing unprivileged
> containers to (selectively) perform bpf() syscall operations. Christian
> Brauner has a pretty nice writeup of the functionality here:
> https://people.kernel.org/brauner/the-seccomp-notifier-new-frontiers-in-unprivileged-container-development
>
> In fact he even mentions allowing unprivileged access to bpf() as a
> possible use case (in the second-to-last paragraph).
>
> AFAICT this would enable your use case without adding any new kernel
> functionality or changing the BPF-using applications, while allowing the
> privileged userspace daemon to make case-by-case decisions on each
> operation instead of granting blanket capabilities (which is my main
> objection to the token proposal, as we discussed on the last iteration
> of the series).
>
> So I'm curious whether you considered this as an alternative to
> BPF_TOKEN? And if so, what your reason was for rejecting it?

The Seccomp notifier is an answer 1. to special device nodes (or
arguably to simple cases...) , 2. a quick solution without changing
infrastructure and how the kernel deals with device nodes (doesn't
solve the root problem where this BPF series at least tries...), 3.
relies on Seccomp and would inherit its same limitation.

It clashes with BPF! BPF is not mknod, and most of its use cases are
*transparent to the workload*, they can't use Seccomp and are not
interested in it... Fd delegation is good design and applies to *all*
BPF use cases, all tools can take advantage of it, it is not
restricted to a special tool or daemon X.

Going further, hiding behind Seccomp notifier and such prevents BPF
from solving current and future problems.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH RESEND v3 bpf-next 00/14] BPF token
  2023-06-30 18:25   ` Andrii Nakryiko
@ 2023-07-04  9:38     ` Christian Brauner
  2023-07-04 23:20     ` Toke Høiland-Jørgensen
  1 sibling, 0 replies; 48+ messages in thread
From: Christian Brauner @ 2023-07-04  9:38 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Toke Høiland-Jørgensen, Andrii Nakryiko, bpf,
	linux-security-module, keescook, lennart, cyphar, luto,
	kernel-team, sargun

On Fri, Jun 30, 2023 at 11:25:57AM -0700, Andrii Nakryiko wrote:
> On Thu, Jun 29, 2023 at 4:15 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >
> > Andrii Nakryiko <andrii@kernel.org> writes:
> >
> > > This patch set introduces new BPF object, BPF token, which allows to delegate
> > > a subset of BPF functionality from privileged system-wide daemon (e.g.,
> > > systemd or any other container manager) to a *trusted* unprivileged
> > > application. Trust is the key here. This functionality is not about allowing
> > > unconditional unprivileged BPF usage. Establishing trust, though, is
> > > completely up to the discretion of respective privileged application that
> > > would create a BPF token, as different production setups can and do achieve it
> > > through a combination of different means (signing, LSM, code reviews, etc),
> > > and it's undesirable and infeasible for kernel to enforce any particular way
> > > of validating trustworthiness of particular process.
> > >
> > > The main motivation for BPF token is a desire to enable containerized
> > > BPF applications to be used together with user namespaces. This is currently
> > > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced
> > > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF
> > > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read
> > > arbitrary memory, and it's impossible to ensure that they only read memory of
> > > processes belonging to any given namespace. This means that it's impossible to
> > > have namespace-aware CAP_BPF capability, and as such another mechanism to
> > > allow safe usage of BPF functionality is necessary. BPF token and delegation
> > > of it to a trusted unprivileged applications is such mechanism. Kernel makes
> > > no assumption about what "trusted" constitutes in any particular case, and
> > > it's up to specific privileged applications and their surrounding
> > > infrastructure to decide that. What kernel provides is a set of APIs to create
> > > and tune BPF token, and pass it around to privileged BPF commands that are
> > > creating new BPF objects like BPF programs, BPF maps, etc.
> >
> > So a colleague pointed out today that the Seccomp Notify functionality
> > would be a way to achieve your stated goal of allowing unprivileged
> > containers to (selectively) perform bpf() syscall operations. Christian
> > Brauner has a pretty nice writeup of the functionality here:
> > https://people.kernel.org/brauner/the-seccomp-notifier-new-frontiers-in-unprivileged-container-development
> >
> > In fact he even mentions allowing unprivileged access to bpf() as a
> > possible use case (in the second-to-last paragraph).
> >
> > AFAICT this would enable your use case without adding any new kernel
> > functionality or changing the BPF-using applications, while allowing the
> > privileged userspace daemon to make case-by-case decisions on each
> > operation instead of granting blanket capabilities (which is my main
> > objection to the token proposal, as we discussed on the last iteration
> > of the series).
> 
> It's not "blanket" capabilities. You control types or maps and
> programs that could be created. And again, CAP_SYS_ADMIN guarded.
> Please, don't give CAP_SYS_ADMIN/root permissions to applications you
> can't be sure won't do something stupid and blame kernel API for it.
> 
> After all, the root process can setuid() any file and make it run with
> elevated permissions, right? Doesn't get more "blanket" than that.
> 
> >
> > So I'm curious whether you considered this as an alternative to
> > BPF_TOKEN? And if so, what your reason was for rejecting it?
> >
> 
> Yes, I'm aware, Christian has a follow up short blog post specifically
> for using this for proxying BPF from privileged process ([0]).
> 
> So, in short, I think it's not a good generic solution. It's very
> fragile and high-maintenance. It's still proxying BPF UAPI (except
> application does preserve illusion of using BPF syscall, yes, that
> part is good) with all the implications: needing to replicate all of
> UAPI (fetching all those FDs from another process, following all the
> pointers from another process' memory, etc), and also writing back all
> the correct things (into another process' memory): log content,
> log_true_size (out param), any other output parameters. What do we do
> when an application uses a newer version of bpf_attr that is supported
> by proxy? And honestly, I'm like 99% sure there are lots of less
> obvious issues one runs into when starting implementing something like
> this.
> 
> This sounds like a hack and nightmare to implement and support.
> Perhaps that indirectly is supported by the fact that even Christian
> half-jokingly calls this a crazy approach. That code basically is
> unchanged for the last three years, with only one fix from Christian
> one year after initial introduction ([1]) to fix a quirky issue
> related to the limitation of pidfd working only for thread group
> leaders. It also still supports only BPF_PROG_TYPE_CGROUP_DEVICE
> program loading, it doesn't support a bunch of newer BPF_PROG_LOAD
> fields and functionality, etc, etc.
> 
> So as a technical curiosity it's pretty cool and perhaps is the right
> tool for the job for very narrow specific use cases. But as a
> realistic generic approach that could be used by industry at large for
> safe BPF usage from namespaced containers -- not so much.

Some background... When BPF & cgroup moved the devices cgroup from a
file-based cgroup controller into a BPF program it was technically an
immediate widespread regression.

The cgroup v1 controller was file based and supported seemlessly
switching between allow- and denylists. Whether that was ever sensible
is a separate question.

But what this meant was that any container runtime that used a simple
file-based mechanism now had to generate a BPF device program that
mirrored the cgroup v1 semantic such that the old syntax of the cgroup
v1 device controller would be correctly translated into a BPF devices
program.

In addition, this broke some nesting scenarios. So intercepting bpf()
via seccomp was specifically done to avoid devices cgroup regressions.
It was never meant to be a generic solution.

It also doesn't work for all cases as the seccomp notifier's supervision
mechanism isn't really a clean solution.

It's a pipe dream that you can transparently proxy system calls for
another process via seccomp for sufficiently complex system calls. We
did it for specific use-cases where we could sufficiently guarantee that
they could be safe. But to make this work it would involve way more
invasive changes:

* nesting/stacking of seccomp notifiers
* clean handling of pointer arguments in-kernel such that you can safely
  continue system calls being sure that they haven't been modified. This
  is currently only possible in scenarios where safety is guaranteed by
  the kernel refusing nonsensical or unsafe arguments
* correct privilege handling
  The seccomp notifier emulates system calls in userspace and thus has
  to mimick the privilege context of the task it is emulating the system
  call for in such a way that (i) it allows it to succeed by avoiding the
  privilege limitations of why the given system call was supposed to be
  proxied in the first place, (ii) it doesn't allow to circumvent other,
  generic restrictions that would otherwise cause the system call to
  fail. It's like saying e.g., "execute with most of the proxied task's
  creds but let it have a few more privileges". That's frail as Linux
  creds aren't really composable. That's why we have override_creds()
  not "add_creds()" and "subtract_creds()" which would probably be
  nicer.

Or it would have to be a generic first class kernel proxy which begs the
question why not change the subsystems itself to do this cleanly.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH RESEND v3 bpf-next 00/14] BPF token
  2023-06-29 23:15 ` [PATCH RESEND v3 bpf-next 00/14] BPF token Toke Høiland-Jørgensen
  2023-06-30 18:25   ` Andrii Nakryiko
  2023-07-02  6:59   ` Djalal Harouni
@ 2023-07-04  9:51   ` Christian Brauner
  2023-07-04 23:33     ` Toke Høiland-Jørgensen
  2023-07-05 20:39     ` Andrii Nakryiko
  2 siblings, 2 replies; 48+ messages in thread
From: Christian Brauner @ 2023-07-04  9:51 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Andrii Nakryiko, bpf, linux-security-module, keescook, lennart,
	cyphar, luto, kernel-team, sargun

On Fri, Jun 30, 2023 at 01:15:47AM +0200, Toke Høiland-Jørgensen wrote:
> Andrii Nakryiko <andrii@kernel.org> writes:
> 
> > This patch set introduces new BPF object, BPF token, which allows to delegate
> > a subset of BPF functionality from privileged system-wide daemon (e.g.,
> > systemd or any other container manager) to a *trusted* unprivileged
> > application. Trust is the key here. This functionality is not about allowing
> > unconditional unprivileged BPF usage. Establishing trust, though, is
> > completely up to the discretion of respective privileged application that
> > would create a BPF token, as different production setups can and do achieve it
> > through a combination of different means (signing, LSM, code reviews, etc),
> > and it's undesirable and infeasible for kernel to enforce any particular way
> > of validating trustworthiness of particular process.
> >
> > The main motivation for BPF token is a desire to enable containerized
> > BPF applications to be used together with user namespaces. This is currently
> > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced
> > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF
> > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read
> > arbitrary memory, and it's impossible to ensure that they only read memory of
> > processes belonging to any given namespace. This means that it's impossible to
> > have namespace-aware CAP_BPF capability, and as such another mechanism to
> > allow safe usage of BPF functionality is necessary. BPF token and delegation
> > of it to a trusted unprivileged applications is such mechanism. Kernel makes
> > no assumption about what "trusted" constitutes in any particular case, and
> > it's up to specific privileged applications and their surrounding
> > infrastructure to decide that. What kernel provides is a set of APIs to create
> > and tune BPF token, and pass it around to privileged BPF commands that are
> > creating new BPF objects like BPF programs, BPF maps, etc.
> 
> So a colleague pointed out today that the Seccomp Notify functionality
> would be a way to achieve your stated goal of allowing unprivileged
> containers to (selectively) perform bpf() syscall operations. Christian
> Brauner has a pretty nice writeup of the functionality here:
> https://people.kernel.org/brauner/the-seccomp-notifier-new-frontiers-in-unprivileged-container-development

I'm amazed you read this. :)
The seccomp notifier comes with a lot of caveats. I think it would be
impractical if not infeasible to handle bpf() delegation.

> 
> In fact he even mentions allowing unprivileged access to bpf() as a
> possible use case (in the second-to-last paragraph).

Yeah, I tried to work around a userspace regression with the
introduction of the cgroup v2 devices controller.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH RESEND v3 bpf-next 01/14] bpf: introduce BPF token object
  2023-06-29  5:18 ` [PATCH RESEND v3 bpf-next 01/14] bpf: introduce BPF token object Andrii Nakryiko
@ 2023-07-04 12:43   ` Christian Brauner
  2023-07-04 13:34     ` Christian Brauner
                       ` (2 more replies)
  0 siblings, 3 replies; 48+ messages in thread
From: Christian Brauner @ 2023-07-04 12:43 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, linux-security-module, keescook, lennart, cyphar, luto,
	kernel-team, sargun

On Wed, Jun 28, 2023 at 10:18:19PM -0700, Andrii Nakryiko wrote:
> Add new kind of BPF kernel object, BPF token. BPF token is meant to to
> allow delegating privileged BPF functionality, like loading a BPF
> program or creating a BPF map, from privileged process to a *trusted*
> unprivileged process, all while have a good amount of control over which
> privileged operations could be performed using provided BPF token.
> 
> This patch adds new BPF_TOKEN_CREATE command to bpf() syscall, which
> allows to create a new BPF token object along with a set of allowed
> commands that such BPF token allows to unprivileged applications.
> Currently only BPF_TOKEN_CREATE command itself can be
> delegated, but other patches gradually add ability to delegate
> BPF_MAP_CREATE, BPF_BTF_LOAD, and BPF_PROG_LOAD commands.
> 
> The above means that new BPF tokens can be created using existing BPF
> token, if original privileged creator allowed BPF_TOKEN_CREATE command.
> New derived BPF token cannot be more powerful than the original BPF
> token.
> 
> Importantly, BPF token is automatically pinned at the specified location
> inside an instance of BPF FS and cannot be repinned using BPF_OBJ_PIN
> command, unlike BPF prog/map/btf/link. This provides more control over
> unintended sharing of BPF tokens through pinning it in another BPF FS
> instances.
> 
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> ---

The main issue I have with the token approach is that it is a completely
separate delegation vector on top of user namespaces. We mentioned this
duringthe conf and this was brought up on the thread here again as well.
Imho, that's a problem both security-wise and complexity-wise.

It's not great if each subsystem gets its own custom delegation
mechanism. This imposes such a taxing complexity on both kernel- and
userspace that it will quickly become a huge liability. So I would
really strongly encourage you to explore another direction.

I do think the spirit of your proposal is workable and that it can
mostly be kept in tact.

As mentioned before, bpffs has all the means to be taught delegation:

        // In container's user namespace
        fd_fs = fsopen("bpffs");

        // Delegating task in host userns (systemd-bpfd whatever you want)
        ret = fsconfig(fd_fs, FSCONFIG_SET_FLAG, "delegate", ...);

        // In container's user namespace
        fd_mnt = fsmount(fd_fs, 0);

        ret = move_mount(fd_fs, "", -EBADF, "/my/fav/location", MOVE_MOUNT_F_EMPTY_PATH)

Roughly, this would mean:

(i) raise FS_USERNS_MOUNT on bpffs but guard it behind the "delegate"
    mount option. IOW, it's only possibly to mount bpffs as an
    unprivileged user if a delegating process like systemd-bpfd with
    system-level privileges has marked it as delegatable.
(ii) add fine-grained delegation options that you want this
     bpffs instance to allow via new mount options. Idk,

     // allow usage of foo
     fsconfig(fd_fs, FSCONFIG_SET_STRING, "abilities", "foo");

     // also allow usage of bar
     fsconfig(fd_fs, FSCONFIG_SET_STRING, "abilities", "bar");

     // reset allowed options
     fsconfig(fd_fs, FSCONFIG_SET_STRING, "");

     // allow usage of schmoo
     fsconfig(fd_fs, FSCONFIG_SET_STRING, "abilities", "schmoo");

This all seems more intuitive and integrates with user and mount
namespaces of the container. This can also work for restricting
non-userns bpf instances fwiw. You can also share instances via
bind-mount and so on. The userns of the bpffs instance can also be used
for permission checking provided a given functionality has been
delegated by e.g., systemd-bpfd or whatever.

So roughly - untested and unfinished:

diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
index b9b93b81af9a..c021b0a674bb 100644
--- a/kernel/bpf/inode.c
+++ b/kernel/bpf/inode.c
@@ -623,15 +623,24 @@ struct bpf_prog *bpf_prog_get_type_path(const char *name, enum bpf_prog_type typ
 }
 EXPORT_SYMBOL(bpf_prog_get_type_path);
 
+struct bpf_mount_opts {
+	umode_t mode;
+	bool delegate;
+	u64 abilities;
+};
+
 /*
  * Display the mount options in /proc/mounts.
  */
 static int bpf_show_options(struct seq_file *m, struct dentry *root)
 {
+	struct bpf_mount_opts *opts = root->d_sb->s_fs_info;
 	umode_t mode = d_inode(root)->i_mode & S_IALLUGO & ~S_ISVTX;
 
 	if (mode != S_IRWXUGO)
 		seq_printf(m, ",mode=%o", mode);
+	if (opts->delegate)
+		seq_printf(m, ",delegate");
 	return 0;
 }
 
@@ -655,17 +664,17 @@ static const struct super_operations bpf_super_ops = {
 
 enum {
 	OPT_MODE,
+	Opt_delegate,
+	Opt_abilities,
 };
 
 static const struct fs_parameter_spec bpf_fs_parameters[] = {
-	fsparam_u32oct	("mode",			OPT_MODE),
+	fsparam_u32oct	     ("mode",			OPT_MODE),
+	fsparam_flag_no	     ("delegate",		Opt_delegate),
+	fsparam_string       ("abilities",		Opt_abilities),
 	{}
 };
 
-struct bpf_mount_opts {
-	umode_t mode;
-};
-
 static int bpf_parse_param(struct fs_context *fc, struct fs_parameter *param)
 {
 	struct bpf_mount_opts *opts = fc->fs_private;
@@ -694,6 +703,16 @@ static int bpf_parse_param(struct fs_context *fc, struct fs_parameter *param)
 	case OPT_MODE:
 		opts->mode = result.uint_32 & S_IALLUGO;
 		break;
+	case Opt_delegate:
+		if (fc->user_ns != &init_user_ns && !capable(CAP_SYS_ADMIN))
+			return -EPERM;
+
+		if (!result.negated)
+			opts->delegate = true;
+		break;
+	case Opt_abilities:
+		// parse param->string to opts->abilities
+		break;
 	}
 
 	return 0;
@@ -768,10 +787,20 @@ static int populate_bpffs(struct dentry *parent)
 static int bpf_fill_super(struct super_block *sb, struct fs_context *fc)
 {
 	static const struct tree_descr bpf_rfiles[] = { { "" } };
-	struct bpf_mount_opts *opts = fc->fs_private;
+	struct bpf_mount_opts *opts = sb->s_fs_info;
 	struct inode *inode;
 	int ret;
 
+	if (fc->user_ns != &init_user_ns && !opts->delegate) {
+		errorfc(fc, "Can't mount bpffs without delegation permissions");
+		return -EPERM;
+	}
+
+	if (opts->abilities && !opts->delegate) {
+		errorfc(fc, "Specifying abilities without enabling delegation");
+		return -EINVAL;
+	}
+
 	ret = simple_fill_super(sb, BPF_FS_MAGIC, bpf_rfiles);
 	if (ret)
 		return ret;
@@ -793,7 +822,10 @@ static int bpf_get_tree(struct fs_context *fc)
 
 static void bpf_free_fc(struct fs_context *fc)
 {
-	kfree(fc->fs_private);
+	struct bpf_mount_opts *opts = fc->s_fs_info;
+
+	if (opts)
+		kfree(opts);
 }
 
 static const struct fs_context_operations bpf_context_ops = {
@@ -815,17 +847,30 @@ static int bpf_init_fs_context(struct fs_context *fc)
 
 	opts->mode = S_IRWXUGO;
 
-	fc->fs_private = opts;
+	/* If an instance is delegated it will start with no abilities. */
+	opts->delegate = false;
+	opts->abilities = 0;
+
+	fc->s_fs_info = opts;
 	fc->ops = &bpf_context_ops;
 	return 0;
 }
 
+static void bpf_kill_super(struct super_block *sb)
+{
+	struct bpf_mount_opts *opts = sb->s_fs_info;
+
+	kill_litter_super(sb);
+	kfree(opts);
+}
+
 static struct file_system_type bpf_fs_type = {
 	.owner		= THIS_MODULE,
 	.name		= "bpf",
 	.init_fs_context = bpf_init_fs_context,
 	.parameters	= bpf_fs_parameters,
-	.kill_sb	= kill_litter_super,
+	.kill_sb	= bpf_kill_super,
+	.fs_flags	= FS_USERNS_MOUNT,
 };
 
 static int __init bpf_init(void)

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH RESEND v3 bpf-next 01/14] bpf: introduce BPF token object
  2023-07-04 12:43   ` Christian Brauner
@ 2023-07-04 13:34     ` Christian Brauner
  2023-07-04 23:28     ` Toke Høiland-Jørgensen
  2023-07-05 14:16     ` Paul Moore
  2 siblings, 0 replies; 48+ messages in thread
From: Christian Brauner @ 2023-07-04 13:34 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, linux-security-module, keescook, lennart, cyphar, luto,
	kernel-team, sargun

On Tue, Jul 04, 2023 at 02:43:59PM +0200, Christian Brauner wrote:
> On Wed, Jun 28, 2023 at 10:18:19PM -0700, Andrii Nakryiko wrote:
> > Add new kind of BPF kernel object, BPF token. BPF token is meant to to
> > allow delegating privileged BPF functionality, like loading a BPF
> > program or creating a BPF map, from privileged process to a *trusted*
> > unprivileged process, all while have a good amount of control over which
> > privileged operations could be performed using provided BPF token.
> > 
> > This patch adds new BPF_TOKEN_CREATE command to bpf() syscall, which
> > allows to create a new BPF token object along with a set of allowed
> > commands that such BPF token allows to unprivileged applications.
> > Currently only BPF_TOKEN_CREATE command itself can be
> > delegated, but other patches gradually add ability to delegate
> > BPF_MAP_CREATE, BPF_BTF_LOAD, and BPF_PROG_LOAD commands.
> > 
> > The above means that new BPF tokens can be created using existing BPF
> > token, if original privileged creator allowed BPF_TOKEN_CREATE command.
> > New derived BPF token cannot be more powerful than the original BPF
> > token.
> > 
> > Importantly, BPF token is automatically pinned at the specified location
> > inside an instance of BPF FS and cannot be repinned using BPF_OBJ_PIN
> > command, unlike BPF prog/map/btf/link. This provides more control over
> > unintended sharing of BPF tokens through pinning it in another BPF FS
> > instances.
> > 
> > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > ---
> 
> The main issue I have with the token approach is that it is a completely
> separate delegation vector on top of user namespaces. We mentioned this
> duringthe conf and this was brought up on the thread here again as well.
> Imho, that's a problem both security-wise and complexity-wise.
> 
> It's not great if each subsystem gets its own custom delegation
> mechanism. This imposes such a taxing complexity on both kernel- and
> userspace that it will quickly become a huge liability. So I would
> really strongly encourage you to explore another direction.
> 
> I do think the spirit of your proposal is workable and that it can
> mostly be kept in tact.
> 
> As mentioned before, bpffs has all the means to be taught delegation:
> 
>         // In container's user namespace
>         fd_fs = fsopen("bpffs");
> 
>         // Delegating task in host userns (systemd-bpfd whatever you want)
>         ret = fsconfig(fd_fs, FSCONFIG_SET_FLAG, "delegate", ...);
> 
>         // In container's user namespace
>         fd_mnt = fsmount(fd_fs, 0);
> 
>         ret = move_mount(fd_fs, "", -EBADF, "/my/fav/location", MOVE_MOUNT_F_EMPTY_PATH)
> 
> Roughly, this would mean:
> 
> (i) raise FS_USERNS_MOUNT on bpffs but guard it behind the "delegate"
>     mount option. IOW, it's only possibly to mount bpffs as an
>     unprivileged user if a delegating process like systemd-bpfd with
>     system-level privileges has marked it as delegatable.
> (ii) add fine-grained delegation options that you want this
>      bpffs instance to allow via new mount options. Idk,
> 
>      // allow usage of foo
>      fsconfig(fd_fs, FSCONFIG_SET_STRING, "abilities", "foo");
> 
>      // also allow usage of bar
>      fsconfig(fd_fs, FSCONFIG_SET_STRING, "abilities", "bar");
> 
>      // reset allowed options
>      fsconfig(fd_fs, FSCONFIG_SET_STRING, "");
> 
>      // allow usage of schmoo
>      fsconfig(fd_fs, FSCONFIG_SET_STRING, "abilities", "schmoo");

This is really just one crummy way of doing this. It's ofc possible to
make this a binary struct if you wanted to; of any form:

struct bpf_delegation_opts {
	u64 a;
	u64 b;
	u64 c;
	u32 d;
	u32 e;
};

and then

struct bpf_delegation_opts opts = {
	.a = SOMETHING_SOMETHING,
	.d = SOMETHING_SOMETHING_ELSE,
};

fsconfig(fd_fs, FSCONFIG_SET_BINARY, "abilities", &opts, sizeof(opts));

you'll get:

param->size == sizeof(opts);
param->blob = memdup_user_nul();

and then you can version this by size like we do for extensible structs
and change whatever you'd like to change in the future.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH RESEND v3 bpf-next 00/14] BPF token
  2023-06-30 18:25   ` Andrii Nakryiko
  2023-07-04  9:38     ` Christian Brauner
@ 2023-07-04 23:20     ` Toke Høiland-Jørgensen
  2023-07-05 12:57       ` Stefano Brivio
  1 sibling, 1 reply; 48+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-07-04 23:20 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, linux-security-module, keescook, brauner,
	lennart, cyphar, luto, kernel-team, sargun

Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:

> On Thu, Jun 29, 2023 at 4:15 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Andrii Nakryiko <andrii@kernel.org> writes:
>>
>> > This patch set introduces new BPF object, BPF token, which allows to delegate
>> > a subset of BPF functionality from privileged system-wide daemon (e.g.,
>> > systemd or any other container manager) to a *trusted* unprivileged
>> > application. Trust is the key here. This functionality is not about allowing
>> > unconditional unprivileged BPF usage. Establishing trust, though, is
>> > completely up to the discretion of respective privileged application that
>> > would create a BPF token, as different production setups can and do achieve it
>> > through a combination of different means (signing, LSM, code reviews, etc),
>> > and it's undesirable and infeasible for kernel to enforce any particular way
>> > of validating trustworthiness of particular process.
>> >
>> > The main motivation for BPF token is a desire to enable containerized
>> > BPF applications to be used together with user namespaces. This is currently
>> > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced
>> > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF
>> > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read
>> > arbitrary memory, and it's impossible to ensure that they only read memory of
>> > processes belonging to any given namespace. This means that it's impossible to
>> > have namespace-aware CAP_BPF capability, and as such another mechanism to
>> > allow safe usage of BPF functionality is necessary. BPF token and delegation
>> > of it to a trusted unprivileged applications is such mechanism. Kernel makes
>> > no assumption about what "trusted" constitutes in any particular case, and
>> > it's up to specific privileged applications and their surrounding
>> > infrastructure to decide that. What kernel provides is a set of APIs to create
>> > and tune BPF token, and pass it around to privileged BPF commands that are
>> > creating new BPF objects like BPF programs, BPF maps, etc.
>>
>> So a colleague pointed out today that the Seccomp Notify functionality
>> would be a way to achieve your stated goal of allowing unprivileged
>> containers to (selectively) perform bpf() syscall operations. Christian
>> Brauner has a pretty nice writeup of the functionality here:
>> https://people.kernel.org/brauner/the-seccomp-notifier-new-frontiers-in-unprivileged-container-development
>>
>> In fact he even mentions allowing unprivileged access to bpf() as a
>> possible use case (in the second-to-last paragraph).
>>
>> AFAICT this would enable your use case without adding any new kernel
>> functionality or changing the BPF-using applications, while allowing the
>> privileged userspace daemon to make case-by-case decisions on each
>> operation instead of granting blanket capabilities (which is my main
>> objection to the token proposal, as we discussed on the last iteration
>> of the series).
>
> It's not "blanket" capabilities. You control types or maps and
> programs that could be created. And again, CAP_SYS_ADMIN guarded.
> Please, don't give CAP_SYS_ADMIN/root permissions to applications you
> can't be sure won't do something stupid and blame kernel API for it.

Right, I didn't mean "blanket" in the sense of "permission to do
anything on the system"; I do get that you can restrict which subset of
functionality you grant. However, *within* that subset, it's a blanket
permission grant. I.e., you can't issue a token that grants a *specific*
application permission to load a *specific* BPF program - you can only
grant a general "load any program" permission that can be used by anyone
who possesses the token.

I guess we could in principle extend the token mechanism to allow this,
but the kernel doesn't seem like the right place to implement such a
fine-grained policy engine...

> After all, the root process can setuid() any file and make it run with
> elevated permissions, right? Doesn't get more "blanket" than that.

Which is exactly why setuid binaries are not generally how we implement
security delegation these days. So I don't think designing a new
mechanism this way is a good idea.

>> So I'm curious whether you considered this as an alternative to
>> BPF_TOKEN? And if so, what your reason was for rejecting it?
>>
>
> Yes, I'm aware, Christian has a follow up short blog post specifically
> for using this for proxying BPF from privileged process ([0]).
>
> So, in short, I think it's not a good generic solution. It's very
> fragile and high-maintenance. It's still proxying BPF UAPI (except
> application does preserve illusion of using BPF syscall, yes, that
> part is good) with all the implications: needing to replicate all of
> UAPI (fetching all those FDs from another process, following all the
> pointers from another process' memory, etc), and also writing back all
> the correct things (into another process' memory): log content,
> log_true_size (out param), any other output parameters.

Right, OK, that bit does sound pretty tedious (although I'll note that
there are people who are trying to make all this generally more
palatable[0]).

However, all that tediousness could be avoided while still retaining the
model of blocking the syscall and asking a userspace policy daemon to
supply a verdict. This could even be done using the same token
mechanism: instead of attaching a permission to the token itself, just
make it an opaque identifier. Then, when a syscall is made that contains
the token, block it and send a notification to user space and use the
verdict that comes back in place of the token "value". The notification
could go through the same file descriptor (using read/write or an ioctl,
restricted to CAP_SYS_ADMIN), or it could be a separate one that is
returned alongside it on TOKEN_CREATE. The notification could include
all of the syscall args or a subset, depending on the command, but the
kernel can ensure there are no TOCTOU races, and no need for the policy
daemon to go poking into other another process' namespace.

Actually, using this model I don't think we would even strictly speaking
need the explicit token FD to be included by the calling application
inside the container at all? I.e., if the system policy daemon could
just instruct the kernel "please delegate all permission decisions for
this user namespace to me", it could - so to speak - issue tokens on
demand as each call is made, instead of ahead of time. Which would both
enable the policy daemon to make specific usage decisions, and wouldn't
require any change needed to the applications using BPF inside the
container (not even to include the BPF token FD).

-Toke


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH RESEND v3 bpf-next 01/14] bpf: introduce BPF token object
  2023-07-04 12:43   ` Christian Brauner
  2023-07-04 13:34     ` Christian Brauner
@ 2023-07-04 23:28     ` Toke Høiland-Jørgensen
  2023-07-05  7:20       ` Daniel Borkmann
  2023-07-05 14:16     ` Paul Moore
  2 siblings, 1 reply; 48+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-07-04 23:28 UTC (permalink / raw)
  To: Christian Brauner, Andrii Nakryiko
  Cc: bpf, linux-security-module, keescook, lennart, cyphar, luto,
	kernel-team, sargun

Christian Brauner <brauner@kernel.org> writes:

> On Wed, Jun 28, 2023 at 10:18:19PM -0700, Andrii Nakryiko wrote:
>> Add new kind of BPF kernel object, BPF token. BPF token is meant to to
>> allow delegating privileged BPF functionality, like loading a BPF
>> program or creating a BPF map, from privileged process to a *trusted*
>> unprivileged process, all while have a good amount of control over which
>> privileged operations could be performed using provided BPF token.
>> 
>> This patch adds new BPF_TOKEN_CREATE command to bpf() syscall, which
>> allows to create a new BPF token object along with a set of allowed
>> commands that such BPF token allows to unprivileged applications.
>> Currently only BPF_TOKEN_CREATE command itself can be
>> delegated, but other patches gradually add ability to delegate
>> BPF_MAP_CREATE, BPF_BTF_LOAD, and BPF_PROG_LOAD commands.
>> 
>> The above means that new BPF tokens can be created using existing BPF
>> token, if original privileged creator allowed BPF_TOKEN_CREATE command.
>> New derived BPF token cannot be more powerful than the original BPF
>> token.
>> 
>> Importantly, BPF token is automatically pinned at the specified location
>> inside an instance of BPF FS and cannot be repinned using BPF_OBJ_PIN
>> command, unlike BPF prog/map/btf/link. This provides more control over
>> unintended sharing of BPF tokens through pinning it in another BPF FS
>> instances.
>> 
>> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
>> ---
>
> The main issue I have with the token approach is that it is a completely
> separate delegation vector on top of user namespaces. We mentioned this
> duringthe conf and this was brought up on the thread here again as well.
> Imho, that's a problem both security-wise and complexity-wise.
>
> It's not great if each subsystem gets its own custom delegation
> mechanism. This imposes such a taxing complexity on both kernel- and
> userspace that it will quickly become a huge liability. So I would
> really strongly encourage you to explore another direction.

I share this concern as well, but I'm not quite sure I follow your
proposal here. IIUC, you're saying that instead of creating the token
using a BPF_TOKEN_CREATE command, the policy daemon should create a
bpffs instance and attach the token value directly to that, right? But
then what? Are you proposing that the calling process inside the
container open a filesystem reference (how? using fspick()?) and pass
that to the bpf syscall? Or is there some way to find the right
filesystem instance to extract this from at the time that the bpf()
syscall is issued inside the container?

-Toke


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH RESEND v3 bpf-next 00/14] BPF token
  2023-07-04  9:51   ` Christian Brauner
@ 2023-07-04 23:33     ` Toke Høiland-Jørgensen
  2023-07-05 20:39     ` Andrii Nakryiko
  1 sibling, 0 replies; 48+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-07-04 23:33 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Andrii Nakryiko, bpf, linux-security-module, keescook, lennart,
	cyphar, luto, kernel-team, sargun

Christian Brauner <brauner@kernel.org> writes:

> On Fri, Jun 30, 2023 at 01:15:47AM +0200, Toke Høiland-Jørgensen wrote:
>> Andrii Nakryiko <andrii@kernel.org> writes:
>> 
>> > This patch set introduces new BPF object, BPF token, which allows to delegate
>> > a subset of BPF functionality from privileged system-wide daemon (e.g.,
>> > systemd or any other container manager) to a *trusted* unprivileged
>> > application. Trust is the key here. This functionality is not about allowing
>> > unconditional unprivileged BPF usage. Establishing trust, though, is
>> > completely up to the discretion of respective privileged application that
>> > would create a BPF token, as different production setups can and do achieve it
>> > through a combination of different means (signing, LSM, code reviews, etc),
>> > and it's undesirable and infeasible for kernel to enforce any particular way
>> > of validating trustworthiness of particular process.
>> >
>> > The main motivation for BPF token is a desire to enable containerized
>> > BPF applications to be used together with user namespaces. This is currently
>> > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced
>> > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF
>> > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read
>> > arbitrary memory, and it's impossible to ensure that they only read memory of
>> > processes belonging to any given namespace. This means that it's impossible to
>> > have namespace-aware CAP_BPF capability, and as such another mechanism to
>> > allow safe usage of BPF functionality is necessary. BPF token and delegation
>> > of it to a trusted unprivileged applications is such mechanism. Kernel makes
>> > no assumption about what "trusted" constitutes in any particular case, and
>> > it's up to specific privileged applications and their surrounding
>> > infrastructure to decide that. What kernel provides is a set of APIs to create
>> > and tune BPF token, and pass it around to privileged BPF commands that are
>> > creating new BPF objects like BPF programs, BPF maps, etc.
>> 
>> So a colleague pointed out today that the Seccomp Notify functionality
>> would be a way to achieve your stated goal of allowing unprivileged
>> containers to (selectively) perform bpf() syscall operations. Christian
>> Brauner has a pretty nice writeup of the functionality here:
>> https://people.kernel.org/brauner/the-seccomp-notifier-new-frontiers-in-unprivileged-container-development
>
> I'm amazed you read this. :)

I found it quite an enjoyable read, actually :)

> The seccomp notifier comes with a lot of caveats. I think it would be
> impractical if not infeasible to handle bpf() delegation.

Right, thank you for chiming in and explaining the context. I replied
elsewhere in the thread on the content, so let's not fork the discussion
any more than we have to...

-Toke


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH RESEND v3 bpf-next 01/14] bpf: introduce BPF token object
  2023-07-04 23:28     ` Toke Høiland-Jørgensen
@ 2023-07-05  7:20       ` Daniel Borkmann
  2023-07-05  8:45         ` Christian Brauner
  0 siblings, 1 reply; 48+ messages in thread
From: Daniel Borkmann @ 2023-07-05  7:20 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Christian Brauner, Andrii Nakryiko
  Cc: bpf, linux-security-module, keescook, lennart, cyphar, luto,
	kernel-team, sargun

On 7/5/23 1:28 AM, Toke Høiland-Jørgensen wrote:
> Christian Brauner <brauner@kernel.org> writes:
>> On Wed, Jun 28, 2023 at 10:18:19PM -0700, Andrii Nakryiko wrote:
>>> Add new kind of BPF kernel object, BPF token. BPF token is meant to to
>>> allow delegating privileged BPF functionality, like loading a BPF
>>> program or creating a BPF map, from privileged process to a *trusted*
>>> unprivileged process, all while have a good amount of control over which
>>> privileged operations could be performed using provided BPF token.
>>>
>>> This patch adds new BPF_TOKEN_CREATE command to bpf() syscall, which
>>> allows to create a new BPF token object along with a set of allowed
>>> commands that such BPF token allows to unprivileged applications.
>>> Currently only BPF_TOKEN_CREATE command itself can be
>>> delegated, but other patches gradually add ability to delegate
>>> BPF_MAP_CREATE, BPF_BTF_LOAD, and BPF_PROG_LOAD commands.
>>>
>>> The above means that new BPF tokens can be created using existing BPF
>>> token, if original privileged creator allowed BPF_TOKEN_CREATE command.
>>> New derived BPF token cannot be more powerful than the original BPF
>>> token.
>>>
>>> Importantly, BPF token is automatically pinned at the specified location
>>> inside an instance of BPF FS and cannot be repinned using BPF_OBJ_PIN
>>> command, unlike BPF prog/map/btf/link. This provides more control over
>>> unintended sharing of BPF tokens through pinning it in another BPF FS
>>> instances.
>>>
>>> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
>>> ---
>>
>> The main issue I have with the token approach is that it is a completely
>> separate delegation vector on top of user namespaces. We mentioned this
>> duringthe conf and this was brought up on the thread here again as well.
>> Imho, that's a problem both security-wise and complexity-wise.
>>
>> It's not great if each subsystem gets its own custom delegation
>> mechanism. This imposes such a taxing complexity on both kernel- and
>> userspace that it will quickly become a huge liability. So I would
>> really strongly encourage you to explore another direction.
> 
> I share this concern as well, but I'm not quite sure I follow your
> proposal here. IIUC, you're saying that instead of creating the token
> using a BPF_TOKEN_CREATE command, the policy daemon should create a
> bpffs instance and attach the token value directly to that, right? But
> then what? Are you proposing that the calling process inside the
> container open a filesystem reference (how? using fspick()?) and pass
> that to the bpf syscall? Or is there some way to find the right
> filesystem instance to extract this from at the time that the bpf()
> syscall is issued inside the container?

Given there can be multiple bpffs instances, it would have to be similar
as to what Andrii did in that you need to pass the fd to the bpf(2) for
prog/map creation in order to retrieve the opts->abilities from the super
block.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH RESEND v3 bpf-next 01/14] bpf: introduce BPF token object
  2023-07-05  7:20       ` Daniel Borkmann
@ 2023-07-05  8:45         ` Christian Brauner
  2023-07-05 12:34           ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 48+ messages in thread
From: Christian Brauner @ 2023-07-05  8:45 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Toke Høiland-Jørgensen, Andrii Nakryiko, bpf,
	linux-security-module, keescook, lennart, cyphar, luto,
	kernel-team, sargun

On Wed, Jul 05, 2023 at 09:20:28AM +0200, Daniel Borkmann wrote:
> On 7/5/23 1:28 AM, Toke Høiland-Jørgensen wrote:
> > Christian Brauner <brauner@kernel.org> writes:
> > > On Wed, Jun 28, 2023 at 10:18:19PM -0700, Andrii Nakryiko wrote:
> > > > Add new kind of BPF kernel object, BPF token. BPF token is meant to to
> > > > allow delegating privileged BPF functionality, like loading a BPF
> > > > program or creating a BPF map, from privileged process to a *trusted*
> > > > unprivileged process, all while have a good amount of control over which
> > > > privileged operations could be performed using provided BPF token.
> > > > 
> > > > This patch adds new BPF_TOKEN_CREATE command to bpf() syscall, which
> > > > allows to create a new BPF token object along with a set of allowed
> > > > commands that such BPF token allows to unprivileged applications.
> > > > Currently only BPF_TOKEN_CREATE command itself can be
> > > > delegated, but other patches gradually add ability to delegate
> > > > BPF_MAP_CREATE, BPF_BTF_LOAD, and BPF_PROG_LOAD commands.
> > > > 
> > > > The above means that new BPF tokens can be created using existing BPF
> > > > token, if original privileged creator allowed BPF_TOKEN_CREATE command.
> > > > New derived BPF token cannot be more powerful than the original BPF
> > > > token.
> > > > 
> > > > Importantly, BPF token is automatically pinned at the specified location
> > > > inside an instance of BPF FS and cannot be repinned using BPF_OBJ_PIN
> > > > command, unlike BPF prog/map/btf/link. This provides more control over
> > > > unintended sharing of BPF tokens through pinning it in another BPF FS
> > > > instances.
> > > > 
> > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > > > ---
> > > 
> > > The main issue I have with the token approach is that it is a completely
> > > separate delegation vector on top of user namespaces. We mentioned this
> > > duringthe conf and this was brought up on the thread here again as well.
> > > Imho, that's a problem both security-wise and complexity-wise.
> > > 
> > > It's not great if each subsystem gets its own custom delegation
> > > mechanism. This imposes such a taxing complexity on both kernel- and
> > > userspace that it will quickly become a huge liability. So I would
> > > really strongly encourage you to explore another direction.
> > 
> > I share this concern as well, but I'm not quite sure I follow your
> > proposal here. IIUC, you're saying that instead of creating the token
> > using a BPF_TOKEN_CREATE command, the policy daemon should create a
> > bpffs instance and attach the token value directly to that, right? But
> > then what? Are you proposing that the calling process inside the
> > container open a filesystem reference (how? using fspick()?) and pass
> > that to the bpf syscall? Or is there some way to find the right
> > filesystem instance to extract this from at the time that the bpf()
> > syscall is issued inside the container?
> 
> Given there can be multiple bpffs instances, it would have to be similar
> as to what Andrii did in that you need to pass the fd to the bpf(2) for
> prog/map creation in order to retrieve the opts->abilities from the super
> block.

I think it's pretty flexible what one can do here. Off the top of my
head there could be a dedicated file like /sys/fs/bpf/delegate which
only exists if delegation has been enabled. Thought that might be just a
wasted inode. There could be a new ioctl() on bpffsd which has the same
effect.

Probably an ioctl() on the bpffs instance is easier to grok. You could
even take away rights granted by a bpffs instance from such an fd via
additional ioctl() on it.

For increased limitations, it's also possible to have an optional
write-time security check from within the bpf call itself, e.g.,

    sys_bpf(fd_delegate)
    {
                struct fd fd = fdget_raw(fd_delegate);

                /* That token is only valid within a single user namespace ... */
                if (fd.file->f_cred->user_ns != current_user_ns())
                        return -EINVAL;

                /* woah, no CAP_BPF? */
                if (!ns_capable(fd.file->cred->user_ns, CAP_BPF))
                        return -EPERM;

                /* now check abilities */

                return 0;
    }

I'm not claiming that this is the silver bullet but it fits within the
framework of this approach and explicitly ties it into bpffs right from
the get go since this is the delegation mechanism's core.

The systemd-bpfd approach that was once pushed could probably also work
and I'm not up to date on why this was rejected. The issue against
systemd is still open.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH RESEND v3 bpf-next 01/14] bpf: introduce BPF token object
  2023-07-05  8:45         ` Christian Brauner
@ 2023-07-05 12:34           ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 48+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-07-05 12:34 UTC (permalink / raw)
  To: Christian Brauner, Daniel Borkmann
  Cc: Andrii Nakryiko, bpf, linux-security-module, keescook, lennart,
	cyphar, luto, kernel-team, sargun

Christian Brauner <brauner@kernel.org> writes:

> On Wed, Jul 05, 2023 at 09:20:28AM +0200, Daniel Borkmann wrote:
>> On 7/5/23 1:28 AM, Toke Høiland-Jørgensen wrote:
>> > Christian Brauner <brauner@kernel.org> writes:
>> > > On Wed, Jun 28, 2023 at 10:18:19PM -0700, Andrii Nakryiko wrote:
>> > > > Add new kind of BPF kernel object, BPF token. BPF token is meant to to
>> > > > allow delegating privileged BPF functionality, like loading a BPF
>> > > > program or creating a BPF map, from privileged process to a *trusted*
>> > > > unprivileged process, all while have a good amount of control over which
>> > > > privileged operations could be performed using provided BPF token.
>> > > > 
>> > > > This patch adds new BPF_TOKEN_CREATE command to bpf() syscall, which
>> > > > allows to create a new BPF token object along with a set of allowed
>> > > > commands that such BPF token allows to unprivileged applications.
>> > > > Currently only BPF_TOKEN_CREATE command itself can be
>> > > > delegated, but other patches gradually add ability to delegate
>> > > > BPF_MAP_CREATE, BPF_BTF_LOAD, and BPF_PROG_LOAD commands.
>> > > > 
>> > > > The above means that new BPF tokens can be created using existing BPF
>> > > > token, if original privileged creator allowed BPF_TOKEN_CREATE command.
>> > > > New derived BPF token cannot be more powerful than the original BPF
>> > > > token.
>> > > > 
>> > > > Importantly, BPF token is automatically pinned at the specified location
>> > > > inside an instance of BPF FS and cannot be repinned using BPF_OBJ_PIN
>> > > > command, unlike BPF prog/map/btf/link. This provides more control over
>> > > > unintended sharing of BPF tokens through pinning it in another BPF FS
>> > > > instances.
>> > > > 
>> > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
>> > > > ---
>> > > 
>> > > The main issue I have with the token approach is that it is a completely
>> > > separate delegation vector on top of user namespaces. We mentioned this
>> > > duringthe conf and this was brought up on the thread here again as well.
>> > > Imho, that's a problem both security-wise and complexity-wise.
>> > > 
>> > > It's not great if each subsystem gets its own custom delegation
>> > > mechanism. This imposes such a taxing complexity on both kernel- and
>> > > userspace that it will quickly become a huge liability. So I would
>> > > really strongly encourage you to explore another direction.
>> > 
>> > I share this concern as well, but I'm not quite sure I follow your
>> > proposal here. IIUC, you're saying that instead of creating the token
>> > using a BPF_TOKEN_CREATE command, the policy daemon should create a
>> > bpffs instance and attach the token value directly to that, right? But
>> > then what? Are you proposing that the calling process inside the
>> > container open a filesystem reference (how? using fspick()?) and pass
>> > that to the bpf syscall? Or is there some way to find the right
>> > filesystem instance to extract this from at the time that the bpf()
>> > syscall is issued inside the container?
>> 
>> Given there can be multiple bpffs instances, it would have to be similar
>> as to what Andrii did in that you need to pass the fd to the bpf(2) for
>> prog/map creation in order to retrieve the opts->abilities from the super
>> block.
>
> I think it's pretty flexible what one can do here. Off the top of my
> head there could be a dedicated file like /sys/fs/bpf/delegate which
> only exists if delegation has been enabled. Thought that might be just a
> wasted inode. There could be a new ioctl() on bpffsd which has the same
> effect.
>
> Probably an ioctl() on the bpffs instance is easier to grok. You could
> even take away rights granted by a bpffs instance from such an fd via
> additional ioctl() on it.

Right, gotcha; I was missing whether there was an existing mechanism to
obtain this; an ioctl makes sense. I can see the utility in attaching
this to the file system instance instead of as a separate object that's
pinned (but see my post in the other subthread about using the "ask
userspace model instead").

-Toke


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH RESEND v3 bpf-next 00/14] BPF token
  2023-07-04 23:20     ` Toke Høiland-Jørgensen
@ 2023-07-05 12:57       ` Stefano Brivio
  0 siblings, 0 replies; 48+ messages in thread
From: Stefano Brivio @ 2023-07-05 12:57 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, brauner
  Cc: Andrii Nakryiko, Andrii Nakryiko, bpf, linux-security-module,
	keescook, lennart, cyphar, luto, kernel-team, sargun,
	Alice Frosi

On Wed, 05 Jul 2023 01:20:22 +0200
Toke Høiland-Jørgensen <toke@redhat.com> wrote:

> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
> 
> > On Thu, Jun 29, 2023 at 4:15 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:  
> >>
> >> Andrii Nakryiko <andrii@kernel.org> writes:
> >>  
> >> > This patch set introduces new BPF object, BPF token, which allows to delegate
> >> > a subset of BPF functionality from privileged system-wide daemon (e.g.,
> >> > systemd or any other container manager) to a *trusted* unprivileged
> >> > application. Trust is the key here. This functionality is not about allowing
> >> > unconditional unprivileged BPF usage. Establishing trust, though, is
> >> > completely up to the discretion of respective privileged application that
> >> > would create a BPF token, as different production setups can and do achieve it
> >> > through a combination of different means (signing, LSM, code reviews, etc),
> >> > and it's undesirable and infeasible for kernel to enforce any particular way
> >> > of validating trustworthiness of particular process.
> >> >
> >> > The main motivation for BPF token is a desire to enable containerized
> >> > BPF applications to be used together with user namespaces. This is currently
> >> > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced
> >> > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF
> >> > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read
> >> > arbitrary memory, and it's impossible to ensure that they only read memory of
> >> > processes belonging to any given namespace. This means that it's impossible to
> >> > have namespace-aware CAP_BPF capability, and as such another mechanism to
> >> > allow safe usage of BPF functionality is necessary. BPF token and delegation
> >> > of it to a trusted unprivileged applications is such mechanism. Kernel makes
> >> > no assumption about what "trusted" constitutes in any particular case, and
> >> > it's up to specific privileged applications and their surrounding
> >> > infrastructure to decide that. What kernel provides is a set of APIs to create
> >> > and tune BPF token, and pass it around to privileged BPF commands that are
> >> > creating new BPF objects like BPF programs, BPF maps, etc.  
> >>
> >> So a colleague pointed out today that the Seccomp Notify functionality
> >> would be a way to achieve your stated goal of allowing unprivileged
> >> containers to (selectively) perform bpf() syscall operations. Christian
> >> Brauner has a pretty nice writeup of the functionality here:
> >> https://people.kernel.org/brauner/the-seccomp-notifier-new-frontiers-in-unprivileged-container-development
> >>
> >> In fact he even mentions allowing unprivileged access to bpf() as a
> >> possible use case (in the second-to-last paragraph).
> >>
> >> AFAICT this would enable your use case without adding any new kernel
> >> functionality or changing the BPF-using applications, while allowing the
> >> privileged userspace daemon to make case-by-case decisions on each
> >> operation instead of granting blanket capabilities (which is my main
> >> objection to the token proposal, as we discussed on the last iteration
> >> of the series).  
> >
> > It's not "blanket" capabilities. You control types or maps and
> > programs that could be created. And again, CAP_SYS_ADMIN guarded.
> > Please, don't give CAP_SYS_ADMIN/root permissions to applications you
> > can't be sure won't do something stupid and blame kernel API for it.  
> 
> Right, I didn't mean "blanket" in the sense of "permission to do
> anything on the system"; I do get that you can restrict which subset of
> functionality you grant. However, *within* that subset, it's a blanket
> permission grant. I.e., you can't issue a token that grants a *specific*
> application permission to load a *specific* BPF program - you can only
> grant a general "load any program" permission that can be used by anyone
> who possesses the token.
> 
> I guess we could in principle extend the token mechanism to allow this,
> but the kernel doesn't seem like the right place to implement such a
> fine-grained policy engine...
> 
> > After all, the root process can setuid() any file and make it run with
> > elevated permissions, right? Doesn't get more "blanket" than that.  
> 
> Which is exactly why setuid binaries are not generally how we implement
> security delegation these days. So I don't think designing a new
> mechanism this way is a good idea.
> 
> >> So I'm curious whether you considered this as an alternative to
> >> BPF_TOKEN? And if so, what your reason was for rejecting it?
> >>  
> >
> > Yes, I'm aware, Christian has a follow up short blog post specifically
> > for using this for proxying BPF from privileged process ([0]).
> >
> > So, in short, I think it's not a good generic solution. It's very
> > fragile and high-maintenance. It's still proxying BPF UAPI (except
> > application does preserve illusion of using BPF syscall, yes, that
> > part is good) with all the implications: needing to replicate all of
> > UAPI (fetching all those FDs from another process, following all the
> > pointers from another process' memory, etc), and also writing back all
> > the correct things (into another process' memory): log content,
> > log_true_size (out param), any other output parameters.  
> 
> Right, OK, that bit does sound pretty tedious (although I'll note that
> there are people who are trying to make all this generally more
> palatable[0]).

[0] https://seitan.rocks/ :)

Some clickbaiting for Christian: the presentation we gave a couple of
weeks ago, also linked from the project website, actually credits you
(slide 29/30, of course).

The code is still very much draft quality (we mostly focused on
demos/feasibility so far, cleaning it up now), and we didn't prove (at
least not yet) that handling complicated stuff such as bpf(2) is
actually convenient, but that's at least in scope as a stretch goal.
I'm not claiming it's doable, but we'd give it a try.

What we have at the moment is a meagre set of eight syscall models,
some blatantly incomplete.

A couple of comments to specific points Christian mentioned:

On Tue, 4 Jul 2023 11:38:38 +0200
Christian Brauner <brauner@kernel.org> wrote:

> It's a pipe dream that you can transparently proxy system calls for
> another process via seccomp for sufficiently complex system calls. We
> did it for specific use-cases where we could sufficiently guarantee that
> they could be safe.

Right, so we're trying to pick it up from there. It's way too early to
claim success, but I thought it would make sense to chime in anyway.

> But to make this work it would involve way more invasive changes:
> 
> * nesting/stacking of seccomp notifiers

The need for stacked seccomp filters is obvious to me and that works more
or less naturally. But why would you actually need to stack, or especially
nest *notifiers* themselves?

> * clean handling of pointer arguments in-kernel such that you can safely
>   continue system calls being sure that they haven't been modified. This
>   is currently only possible in scenarios where safety is guaranteed by
>   the kernel refusing nonsensical or unsafe arguments

We're considering a couple of options. One is to never use
SECCOMP_USER_NOTIF_FLAG_CONTINUE for system calls accepting pointers, or
only allowing that as an explicit "unsafe" option. For a "safe"
implementation, the supervisor (seitan) would in any case replay the
system call, matching the context (namespaces, credentials) of the target
process.

If PID or TID (per se, not in terms of associated context/capabilities) of
the caller matter for a specific system call, though, we simply can't
support that. But that shouldn't actually be relevant for bpf(2).

Strictly speaking, I think it's actually possible to "fix" this in the
kernel by means of checking or copying memory that's addressable by a
thread, but that might prove too invasive or end up in insurmountable
layering violations. This mechanism would involve "control" paths
rather than data paths, though, so the performance impact is not really
worrying.

Another option, which we outlined at this very convenient link:
  https://github.com/alicefr/community/blob/seitan/design-proposals/seitan/security-aspects-seitan.md#if-i-use-the-json-model-as-a-security-filter-can-another-thread-in-the-same-process-context-write-to-the-memory-area-pointed-to-by-system-call-arguments-while-the-calling-thread-is-blocked-and-defy-the-purpose-of-the-filter

would be to make the supervisor perform a deep copy (system calls are
anyway modeled in the seitan-cooker component) and then use good old
ptrace(2) as needed.

> * correct privilege handling
>   The seccomp notifier emulates system calls in userspace and thus has
>   to mimick the privilege context of the task it is emulating the system
>   call for in such a way that (i) it allows it to succeed by avoiding the
>   privilege limitations of why the given system call was supposed to be
>   proxied in the first place, (ii) it doesn't allow to circumvent other,
>   generic restrictions that would otherwise cause the system call to
>   fail. It's like saying e.g., "execute with most of the proxied task's
>   creds but let it have a few more privileges". That's frail as Linux
>   creds aren't really composable. That's why we have override_creds()
>   not "add_creds()" and "subtract_creds()" which would probably be
>   nicer.

Right, at the moment we just run that as root, but we plan to take care
of (ii) (albeit not solving it entirely, I guess), by at least applying a
seccomp filter to the supervisor itself. As to the set of (composed?)
capabilities, we don't have an answer yet.

> Or it would have to be a generic first class kernel proxy which begs the
> question why not change the subsystems itself to do this cleanly.

Well, the fine-grained "policy" implementation we're trying to achieve
looks to me like something that's a bit too complicated for the kernel,
and really more appropriate for userspace.

-- 
Stefano


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH RESEND v3 bpf-next 01/14] bpf: introduce BPF token object
  2023-07-04 12:43   ` Christian Brauner
  2023-07-04 13:34     ` Christian Brauner
  2023-07-04 23:28     ` Toke Høiland-Jørgensen
@ 2023-07-05 14:16     ` Paul Moore
  2023-07-05 14:42       ` Christian Brauner
  2 siblings, 1 reply; 48+ messages in thread
From: Paul Moore @ 2023-07-05 14:16 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Andrii Nakryiko, bpf, linux-security-module, keescook, lennart,
	cyphar, luto, kernel-team, sargun

On Tue, Jul 4, 2023 at 8:44 AM Christian Brauner <brauner@kernel.org> wrote:
> On Wed, Jun 28, 2023 at 10:18:19PM -0700, Andrii Nakryiko wrote:
> > Add new kind of BPF kernel object, BPF token. BPF token is meant to to
> > allow delegating privileged BPF functionality, like loading a BPF
> > program or creating a BPF map, from privileged process to a *trusted*
> > unprivileged process, all while have a good amount of control over which
> > privileged operations could be performed using provided BPF token.
> >
> > This patch adds new BPF_TOKEN_CREATE command to bpf() syscall, which
> > allows to create a new BPF token object along with a set of allowed
> > commands that such BPF token allows to unprivileged applications.
> > Currently only BPF_TOKEN_CREATE command itself can be
> > delegated, but other patches gradually add ability to delegate
> > BPF_MAP_CREATE, BPF_BTF_LOAD, and BPF_PROG_LOAD commands.
> >
> > The above means that new BPF tokens can be created using existing BPF
> > token, if original privileged creator allowed BPF_TOKEN_CREATE command.
> > New derived BPF token cannot be more powerful than the original BPF
> > token.
> >
> > Importantly, BPF token is automatically pinned at the specified location
> > inside an instance of BPF FS and cannot be repinned using BPF_OBJ_PIN
> > command, unlike BPF prog/map/btf/link. This provides more control over
> > unintended sharing of BPF tokens through pinning it in another BPF FS
> > instances.
> >
> > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > ---
>
> The main issue I have with the token approach is that it is a completely
> separate delegation vector on top of user namespaces. We mentioned this
> duringthe conf and this was brought up on the thread here again as well.
> Imho, that's a problem both security-wise and complexity-wise.
>
> It's not great if each subsystem gets its own custom delegation
> mechanism. This imposes such a taxing complexity on both kernel- and
> userspace that it will quickly become a huge liability. So I would
> really strongly encourage you to explore another direction.
>
> I do think the spirit of your proposal is workable and that it can
> mostly be kept in tact.
>
> As mentioned before, bpffs has all the means to be taught delegation:
>
>         // In container's user namespace
>         fd_fs = fsopen("bpffs");
>
>         // Delegating task in host userns (systemd-bpfd whatever you want)
>         ret = fsconfig(fd_fs, FSCONFIG_SET_FLAG, "delegate", ...);
>
>         // In container's user namespace
>         fd_mnt = fsmount(fd_fs, 0);
>
>         ret = move_mount(fd_fs, "", -EBADF, "/my/fav/location", MOVE_MOUNT_F_EMPTY_PATH)
>
> Roughly, this would mean:
>
> (i) raise FS_USERNS_MOUNT on bpffs but guard it behind the "delegate"
>     mount option. IOW, it's only possibly to mount bpffs as an
>     unprivileged user if a delegating process like systemd-bpfd with
>     system-level privileges has marked it as delegatable.
> (ii) add fine-grained delegation options that you want this
>      bpffs instance to allow via new mount options. Idk,
>
>      // allow usage of foo
>      fsconfig(fd_fs, FSCONFIG_SET_STRING, "abilities", "foo");
>
>      // also allow usage of bar
>      fsconfig(fd_fs, FSCONFIG_SET_STRING, "abilities", "bar");
>
>      // reset allowed options
>      fsconfig(fd_fs, FSCONFIG_SET_STRING, "");
>
>      // allow usage of schmoo
>      fsconfig(fd_fs, FSCONFIG_SET_STRING, "abilities", "schmoo");
>
> This all seems more intuitive and integrates with user and mount
> namespaces of the container. This can also work for restricting
> non-userns bpf instances fwiw. You can also share instances via
> bind-mount and so on. The userns of the bpffs instance can also be used
> for permission checking provided a given functionality has been
> delegated by e.g., systemd-bpfd or whatever.

I have no arguments against any of the above, and would prefer to see
something like this over a token-based mechanism.  However we do want
to make sure we have the proper LSM control points for either approach
so that admins who rely on LSM-based security policies can manage
delegation via their policies.

Using the fsconfig() approach described by Christian above, I believe
we should have the necessary hooks already in
security_fs_context_parse_param() and security_sb_mnt_opts() but I'm
basing that on a quick look this morning, some additional checking
would need to be done.

-- 
paul-moore.com

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH RESEND v3 bpf-next 01/14] bpf: introduce BPF token object
  2023-07-05 14:16     ` Paul Moore
@ 2023-07-05 14:42       ` Christian Brauner
  2023-07-05 16:00         ` Paul Moore
  2023-07-05 21:38         ` Andrii Nakryiko
  0 siblings, 2 replies; 48+ messages in thread
From: Christian Brauner @ 2023-07-05 14:42 UTC (permalink / raw)
  To: Paul Moore
  Cc: Andrii Nakryiko, bpf, linux-security-module, keescook, lennart,
	cyphar, luto, kernel-team, sargun

On Wed, Jul 05, 2023 at 10:16:13AM -0400, Paul Moore wrote:
> On Tue, Jul 4, 2023 at 8:44 AM Christian Brauner <brauner@kernel.org> wrote:
> > On Wed, Jun 28, 2023 at 10:18:19PM -0700, Andrii Nakryiko wrote:
> > > Add new kind of BPF kernel object, BPF token. BPF token is meant to to
> > > allow delegating privileged BPF functionality, like loading a BPF
> > > program or creating a BPF map, from privileged process to a *trusted*
> > > unprivileged process, all while have a good amount of control over which
> > > privileged operations could be performed using provided BPF token.
> > >
> > > This patch adds new BPF_TOKEN_CREATE command to bpf() syscall, which
> > > allows to create a new BPF token object along with a set of allowed
> > > commands that such BPF token allows to unprivileged applications.
> > > Currently only BPF_TOKEN_CREATE command itself can be
> > > delegated, but other patches gradually add ability to delegate
> > > BPF_MAP_CREATE, BPF_BTF_LOAD, and BPF_PROG_LOAD commands.
> > >
> > > The above means that new BPF tokens can be created using existing BPF
> > > token, if original privileged creator allowed BPF_TOKEN_CREATE command.
> > > New derived BPF token cannot be more powerful than the original BPF
> > > token.
> > >
> > > Importantly, BPF token is automatically pinned at the specified location
> > > inside an instance of BPF FS and cannot be repinned using BPF_OBJ_PIN
> > > command, unlike BPF prog/map/btf/link. This provides more control over
> > > unintended sharing of BPF tokens through pinning it in another BPF FS
> > > instances.
> > >
> > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > > ---
> >
> > The main issue I have with the token approach is that it is a completely
> > separate delegation vector on top of user namespaces. We mentioned this
> > duringthe conf and this was brought up on the thread here again as well.
> > Imho, that's a problem both security-wise and complexity-wise.
> >
> > It's not great if each subsystem gets its own custom delegation
> > mechanism. This imposes such a taxing complexity on both kernel- and
> > userspace that it will quickly become a huge liability. So I would
> > really strongly encourage you to explore another direction.
> >
> > I do think the spirit of your proposal is workable and that it can
> > mostly be kept in tact.
> >
> > As mentioned before, bpffs has all the means to be taught delegation:
> >
> >         // In container's user namespace
> >         fd_fs = fsopen("bpffs");
> >
> >         // Delegating task in host userns (systemd-bpfd whatever you want)
> >         ret = fsconfig(fd_fs, FSCONFIG_SET_FLAG, "delegate", ...);
> >
> >         // In container's user namespace
> >         fd_mnt = fsmount(fd_fs, 0);
> >
> >         ret = move_mount(fd_fs, "", -EBADF, "/my/fav/location", MOVE_MOUNT_F_EMPTY_PATH)
> >
> > Roughly, this would mean:
> >
> > (i) raise FS_USERNS_MOUNT on bpffs but guard it behind the "delegate"
> >     mount option. IOW, it's only possibly to mount bpffs as an
> >     unprivileged user if a delegating process like systemd-bpfd with
> >     system-level privileges has marked it as delegatable.
> > (ii) add fine-grained delegation options that you want this
> >      bpffs instance to allow via new mount options. Idk,
> >
> >      // allow usage of foo
> >      fsconfig(fd_fs, FSCONFIG_SET_STRING, "abilities", "foo");
> >
> >      // also allow usage of bar
> >      fsconfig(fd_fs, FSCONFIG_SET_STRING, "abilities", "bar");
> >
> >      // reset allowed options
> >      fsconfig(fd_fs, FSCONFIG_SET_STRING, "");
> >
> >      // allow usage of schmoo
> >      fsconfig(fd_fs, FSCONFIG_SET_STRING, "abilities", "schmoo");
> >
> > This all seems more intuitive and integrates with user and mount
> > namespaces of the container. This can also work for restricting
> > non-userns bpf instances fwiw. You can also share instances via
> > bind-mount and so on. The userns of the bpffs instance can also be used
> > for permission checking provided a given functionality has been
> > delegated by e.g., systemd-bpfd or whatever.
> 
> I have no arguments against any of the above, and would prefer to see
> something like this over a token-based mechanism.  However we do want
> to make sure we have the proper LSM control points for either approach
> so that admins who rely on LSM-based security policies can manage
> delegation via their policies.
> 
> Using the fsconfig() approach described by Christian above, I believe
> we should have the necessary hooks already in
> security_fs_context_parse_param() and security_sb_mnt_opts() but I'm
> basing that on a quick look this morning, some additional checking
> would need to be done.

I think what I outlined is even unnecessarily complicated. You don't
need that pointless "delegate" mount option at all actually. Permission
to delegate shouldn't be checked when the mount option is set. The
permissions should be checked when the superblock is created. That's the
right point in time. So sm like:

diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
index 4174f76133df..a2eb382f5457 100644
--- a/kernel/bpf/inode.c
+++ b/kernel/bpf/inode.c
@@ -746,6 +746,13 @@ static int bpf_fill_super(struct super_block *sb, struct fs_context *fc)
        struct inode *inode;
        int ret;

+       /*
+        * If you want to delegate this instance then you need to be
+        * privileged and know what you're doing. This isn't trust.
+        */
+       if ((fc->user_ns != &init_user_ns) && !capable(CAP_SYS_ADMIN))
+               return -EPERM;
+
        ret = simple_fill_super(sb, BPF_FS_MAGIC, bpf_rfiles);
        if (ret)
                return ret;
@@ -800,6 +807,7 @@ static struct file_system_type bpf_fs_type = {
        .init_fs_context = bpf_init_fs_context,
        .parameters     = bpf_fs_parameters,
        .kill_sb        = kill_litter_super,
+       .fs_flags       = FS_USERNS_MOUNT,
 };

 static int __init bpf_init(void)

In fact this is conceptually generalizable but I'd need to think about
that.

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH RESEND v3 bpf-next 01/14] bpf: introduce BPF token object
  2023-07-05 14:42       ` Christian Brauner
@ 2023-07-05 16:00         ` Paul Moore
  2023-07-05 21:38         ` Andrii Nakryiko
  1 sibling, 0 replies; 48+ messages in thread
From: Paul Moore @ 2023-07-05 16:00 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Andrii Nakryiko, bpf, linux-security-module, keescook, lennart,
	cyphar, luto, kernel-team, sargun

On Wed, Jul 5, 2023 at 10:42 AM Christian Brauner <brauner@kernel.org> wrote:
> On Wed, Jul 05, 2023 at 10:16:13AM -0400, Paul Moore wrote:
> > On Tue, Jul 4, 2023 at 8:44 AM Christian Brauner <brauner@kernel.org> wrote:
> > > On Wed, Jun 28, 2023 at 10:18:19PM -0700, Andrii Nakryiko wrote:
> > > > Add new kind of BPF kernel object, BPF token. BPF token is meant to to
> > > > allow delegating privileged BPF functionality, like loading a BPF
> > > > program or creating a BPF map, from privileged process to a *trusted*
> > > > unprivileged process, all while have a good amount of control over which
> > > > privileged operations could be performed using provided BPF token.
> > > >
> > > > This patch adds new BPF_TOKEN_CREATE command to bpf() syscall, which
> > > > allows to create a new BPF token object along with a set of allowed
> > > > commands that such BPF token allows to unprivileged applications.
> > > > Currently only BPF_TOKEN_CREATE command itself can be
> > > > delegated, but other patches gradually add ability to delegate
> > > > BPF_MAP_CREATE, BPF_BTF_LOAD, and BPF_PROG_LOAD commands.
> > > >
> > > > The above means that new BPF tokens can be created using existing BPF
> > > > token, if original privileged creator allowed BPF_TOKEN_CREATE command.
> > > > New derived BPF token cannot be more powerful than the original BPF
> > > > token.
> > > >
> > > > Importantly, BPF token is automatically pinned at the specified location
> > > > inside an instance of BPF FS and cannot be repinned using BPF_OBJ_PIN
> > > > command, unlike BPF prog/map/btf/link. This provides more control over
> > > > unintended sharing of BPF tokens through pinning it in another BPF FS
> > > > instances.
> > > >
> > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > > > ---
> > >
> > > The main issue I have with the token approach is that it is a completely
> > > separate delegation vector on top of user namespaces. We mentioned this
> > > duringthe conf and this was brought up on the thread here again as well.
> > > Imho, that's a problem both security-wise and complexity-wise.
> > >
> > > It's not great if each subsystem gets its own custom delegation
> > > mechanism. This imposes such a taxing complexity on both kernel- and
> > > userspace that it will quickly become a huge liability. So I would
> > > really strongly encourage you to explore another direction.
> > >
> > > I do think the spirit of your proposal is workable and that it can
> > > mostly be kept in tact.
> > >
> > > As mentioned before, bpffs has all the means to be taught delegation:
> > >
> > >         // In container's user namespace
> > >         fd_fs = fsopen("bpffs");
> > >
> > >         // Delegating task in host userns (systemd-bpfd whatever you want)
> > >         ret = fsconfig(fd_fs, FSCONFIG_SET_FLAG, "delegate", ...);
> > >
> > >         // In container's user namespace
> > >         fd_mnt = fsmount(fd_fs, 0);
> > >
> > >         ret = move_mount(fd_fs, "", -EBADF, "/my/fav/location", MOVE_MOUNT_F_EMPTY_PATH)
> > >
> > > Roughly, this would mean:
> > >
> > > (i) raise FS_USERNS_MOUNT on bpffs but guard it behind the "delegate"
> > >     mount option. IOW, it's only possibly to mount bpffs as an
> > >     unprivileged user if a delegating process like systemd-bpfd with
> > >     system-level privileges has marked it as delegatable.
> > > (ii) add fine-grained delegation options that you want this
> > >      bpffs instance to allow via new mount options. Idk,
> > >
> > >      // allow usage of foo
> > >      fsconfig(fd_fs, FSCONFIG_SET_STRING, "abilities", "foo");
> > >
> > >      // also allow usage of bar
> > >      fsconfig(fd_fs, FSCONFIG_SET_STRING, "abilities", "bar");
> > >
> > >      // reset allowed options
> > >      fsconfig(fd_fs, FSCONFIG_SET_STRING, "");
> > >
> > >      // allow usage of schmoo
> > >      fsconfig(fd_fs, FSCONFIG_SET_STRING, "abilities", "schmoo");
> > >
> > > This all seems more intuitive and integrates with user and mount
> > > namespaces of the container. This can also work for restricting
> > > non-userns bpf instances fwiw. You can also share instances via
> > > bind-mount and so on. The userns of the bpffs instance can also be used
> > > for permission checking provided a given functionality has been
> > > delegated by e.g., systemd-bpfd or whatever.
> >
> > I have no arguments against any of the above, and would prefer to see
> > something like this over a token-based mechanism.  However we do want
> > to make sure we have the proper LSM control points for either approach
> > so that admins who rely on LSM-based security policies can manage
> > delegation via their policies.
> >
> > Using the fsconfig() approach described by Christian above, I believe
> > we should have the necessary hooks already in
> > security_fs_context_parse_param() and security_sb_mnt_opts() but I'm
> > basing that on a quick look this morning, some additional checking
> > would need to be done.
>
> I think what I outlined is even unnecessarily complicated. You don't
> need that pointless "delegate" mount option at all actually. Permission
> to delegate shouldn't be checked when the mount option is set. The
> permissions should be checked when the superblock is created.

From a LSM perspective I think we would want to have policy
enforcement points both when task A enables delegation and when task B
makes use of the delegation.  We would likely also want to be able to
add some additional delegation state to the superblock if delegation
was enabled in the first enforcement point.

I'm not too bothered by how that ends up looking from a userspace
perspective, but it seems like requiring an explicit "this fs can be
delegated" step would be a positive from a security perspective.  In
other words, just because a task *could* delegated a filesystem, may
not mean it *wants* to delegate a filesystem.

-- 
paul-moore.com

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH RESEND v3 bpf-next 00/14] BPF token
  2023-07-01  2:05 ` Yafang Shao
@ 2023-07-05 20:37   ` Andrii Nakryiko
  2023-07-06  1:26     ` Yafang Shao
  0 siblings, 1 reply; 48+ messages in thread
From: Andrii Nakryiko @ 2023-07-05 20:37 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Andrii Nakryiko, bpf, linux-security-module, keescook, brauner,
	lennart, cyphar, luto, kernel-team, sargun

On Fri, Jun 30, 2023 at 7:06 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Thu, Jun 29, 2023 at 1:18 PM Andrii Nakryiko <andrii@kernel.org> wrote:
> >
> > This patch set introduces new BPF object, BPF token, which allows to delegate
> > a subset of BPF functionality from privileged system-wide daemon (e.g.,
> > systemd or any other container manager) to a *trusted* unprivileged
> > application. Trust is the key here. This functionality is not about allowing
> > unconditional unprivileged BPF usage. Establishing trust, though, is
> > completely up to the discretion of respective privileged application that
> > would create a BPF token, as different production setups can and do achieve it
> > through a combination of different means (signing, LSM, code reviews, etc),
> > and it's undesirable and infeasible for kernel to enforce any particular way
> > of validating trustworthiness of particular process.
> >
> > The main motivation for BPF token is a desire to enable containerized
> > BPF applications to be used together with user namespaces. This is currently
> > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced
> > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF
> > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read
> > arbitrary memory, and it's impossible to ensure that they only read memory of
> > processes belonging to any given namespace. This means that it's impossible to
> > have namespace-aware CAP_BPF capability, and as such another mechanism to
> > allow safe usage of BPF functionality is necessary. BPF token and delegation
> > of it to a trusted unprivileged applications is such mechanism. Kernel makes
> > no assumption about what "trusted" constitutes in any particular case, and
> > it's up to specific privileged applications and their surrounding
> > infrastructure to decide that. What kernel provides is a set of APIs to create
> > and tune BPF token, and pass it around to privileged BPF commands that are
> > creating new BPF objects like BPF programs, BPF maps, etc.
> >
> > Previous attempt at addressing this very same problem ([0]) attempted to
> > utilize authoritative LSM approach, but was conclusively rejected by upstream
> > LSM maintainers. BPF token concept is not changing anything about LSM
> > approach, but can be combined with LSM hooks for very fine-grained security
> > policy. Some ideas about making BPF token more convenient to use with LSM (in
> > particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF
> > 2023 presentation ([1]). E.g., an ability to specify user-provided data
> > (context), which in combination with BPF LSM would allow implementing a very
> > dynamic and fine-granular custom security policies on top of BPF token. In the
> > interest of minimizing API surface area discussions this is going to be
> > added in follow up patches, as it's not essential to the fundamental concept
> > of delegatable BPF token.
> >
> > It should be noted that BPF token is conceptually quite similar to the idea of
> > /dev/bpf device file, proposed by Song a while ago ([2]). The biggest
> > difference is the idea of using virtual anon_inode file to hold BPF token and
> > allowing multiple independent instances of them, each with its own set of
> > restrictions. BPF pinning solves the problem of exposing such BPF token
> > through file system (BPF FS, in this case) for cases where transferring FDs
> > over Unix domain sockets is not convenient. And also, crucially, BPF token
> > approach is not using any special stateful task-scoped flags. Instead, bpf()
> > syscall accepts token_fd parameters explicitly for each relevant BPF command.
> > This addresses main concerns brought up during the /dev/bpf discussion, and
> > fits better with overall BPF subsystem design.
> >
> > This patch set adds a basic minimum of functionality to make BPF token useful
> > and to discuss API and functionality. Currently only low-level libbpf APIs
> > support passing BPF token around, allowing to test kernel functionality, but
> > for the most part is not sufficient for real-world applications, which
> > typically use high-level libbpf APIs based on `struct bpf_object` type. This
> > was done with the intent to limit the size of patch set and concentrate on
> > mostly kernel-side changes. All the necessary plumbing for libbpf will be sent
> > as a separate follow up patch set kernel support makes it upstream.
> >
> > Another part that should happen once kernel-side BPF token is established, is
> > a set of conventions between applications (e.g., systemd), tools (e.g.,
> > bpftool), and libraries (e.g., libbpf) about sharing BPF tokens through BPF FS
> > at well-defined locations to allow applications take advantage of this in
> > automatic fashion without explicit code changes on BPF application's side.
> > But I'd like to postpone this discussion to after BPF token concept lands.
> >
> > Once important distinctions from v2 that should be noted is a chance in the
> > semantics of a newly added BPF_TOKEN_CREATE command. Previously,
> > BPF_TOKEN_CREATE would create BPF token kernel object and return its FD to
> > user-space, allowing to (optionally) pin it in BPF FS using BPF_OBJ_PIN
> > command. This v3 version changes this slightly: BPF_TOKEN_CREATE combines BPF
> > token object creation *and* pinning in BPF FS. Such change ensures that BPF
> > token is always associated with a specific instance of BPF FS and cannot
> > "escape" it by application re-pinning it somewhere else using another
> > BPF_OBJ_PIN call. Now, BPF token can only be pinned once during its creation,
> > better containing it inside intended container (under assumption BPF FS is set
> > up in such a way as to not be shared with other containers on the system).
> >
> >   [0] https://lore.kernel.org/bpf/20230412043300.360803-1-andrii@kernel.org/
> >   [1] http://vger.kernel.org/bpfconf2023_material/Trusted_unprivileged_BPF_LSFMM2023.pdf
> >   [2] https://lore.kernel.org/bpf/20190627201923.2589391-2-songliubraving@fb.com/
> >
> > v3->v3-resend:
> >   - I started integrating token_fd into bpf_object_open_opts and higher-level
> >     libbpf bpf_object APIs, but it started going a bit deeper into bpf_object
> >     implementation details and how libbpf performs feature detection and
> >     caching, so I decided to keep it separate from this patch set and not
> >     distract from the mostly kernel-side changes;
> > v2->v3:
> >   - make BPF_TOKEN_CREATE pin created BPF token in BPF FS, and disallow
> >     BPF_OBJ_PIN for BPF token;
> > v1->v2:
> >   - fix build failures on Kconfig with CONFIG_BPF_SYSCALL unset;
> >   - drop BPF_F_TOKEN_UNKNOWN_* flags and simplify UAPI (Stanislav).
> >
> > Andrii Nakryiko (14):
> >   bpf: introduce BPF token object
> >   libbpf: add bpf_token_create() API
> >   selftests/bpf: add BPF_TOKEN_CREATE test
> >   bpf: add BPF token support to BPF_MAP_CREATE command
> >   libbpf: add BPF token support to bpf_map_create() API
> >   selftests/bpf: add BPF token-enabled test for BPF_MAP_CREATE command
> >   bpf: add BPF token support to BPF_BTF_LOAD command
> >   libbpf: add BPF token support to bpf_btf_load() API
> >   selftests/bpf: add BPF token-enabled BPF_BTF_LOAD selftest
> >   bpf: add BPF token support to BPF_PROG_LOAD command
> >   bpf: take into account BPF token when fetching helper protos
> >   bpf: consistenly use BPF token throughout BPF verifier logic
> >   libbpf: add BPF token support to bpf_prog_load() API
> >   selftests/bpf: add BPF token-enabled BPF_PROG_LOAD tests
> >
> >  drivers/media/rc/bpf-lirc.c                   |   2 +-
> >  include/linux/bpf.h                           |  79 ++++-
> >  include/linux/filter.h                        |   2 +-
> >  include/uapi/linux/bpf.h                      |  53 ++++
> >  kernel/bpf/Makefile                           |   2 +-
> >  kernel/bpf/arraymap.c                         |   2 +-
> >  kernel/bpf/cgroup.c                           |   6 +-
> >  kernel/bpf/core.c                             |   3 +-
> >  kernel/bpf/helpers.c                          |   6 +-
> >  kernel/bpf/inode.c                            |  46 ++-
> >  kernel/bpf/syscall.c                          | 183 +++++++++---
> >  kernel/bpf/token.c                            | 201 +++++++++++++
> >  kernel/bpf/verifier.c                         |  13 +-
> >  kernel/trace/bpf_trace.c                      |   2 +-
> >  net/core/filter.c                             |  36 +--
> >  net/ipv4/bpf_tcp_ca.c                         |   2 +-
> >  net/netfilter/nf_bpf_link.c                   |   2 +-
> >  tools/include/uapi/linux/bpf.h                |  53 ++++
> >  tools/lib/bpf/bpf.c                           |  35 ++-
> >  tools/lib/bpf/bpf.h                           |  45 ++-
> >  tools/lib/bpf/libbpf.map                      |   1 +
> >  .../selftests/bpf/prog_tests/libbpf_probes.c  |   4 +
> >  .../selftests/bpf/prog_tests/libbpf_str.c     |   6 +
> >  .../testing/selftests/bpf/prog_tests/token.c  | 277 ++++++++++++++++++
> >  24 files changed, 957 insertions(+), 104 deletions(-)
> >  create mode 100644 kernel/bpf/token.c
> >  create mode 100644 tools/testing/selftests/bpf/prog_tests/token.c
> >
> > --
> > 2.34.1
> >
> >
>
>
> Hi Andrii,
>
> Thanks for your proposal.
> That seems to be a useful functionality, and yet I have some questions.

I've answered them below. But I don't think either of them have any
relation to BPF token and the problem I'm trying to solve.

>
> 1. Why can't we add security_bpf_probe_read_{kernel,user}?
>     If possible, we can use these LSM hooks to refuse the process to
> read other tasks' information. E.g. if the other process is not within
> the same cgroup or the same namespace, we just refuse the reading. I
> think it is not hard to identify if the other process is within the
> same cgroup or the same namespace.

There are probably many reasons. First, performance-wide, LSM hook for
each bpf_probe_read_{kernel,user}() call will be prohibitive. And just
in general, one would need to be very careful with such LSM hooks,
because bpf_probe_read_{kernel,user}() often happens from NMI context,
and LSM policy would have to be written and validated very carefully
with NMI context in mind.

But, more conceptually, for probe_read you get a random address and
you know the process context you are running in (but you might be
actually running in softirq and NMI, and that process context is
irrelevant). How can you efficiently (or at all) tell if that random
address "belongs" to cgroup or namespace? Just at conceptual level?

>
> 2. Why can't we extend bpf_cookie?
>    We're now using bpf_cookie to identify each user or each
> application, and only the permitted cookies can create new probe
> links.  However we find the bpf_cookie is only supported by tracing,
> perf_event and kprobe_multi, so we're planning to extend it to other
> possible link types, then we can use LSM hooks to control all bpf
> links.  I think that the upstream kernel should also support
> bpf_cookie for all bpf links. If possible, we will post it to the
> upstream in the future.
>    After I have read your BPF token proposal, I just have some other
> ideas. Why can't we just extend bpf_cookie to all other BPF objects?
> For example, all progs and maps should also have the bpf_cookie.
>

I'm not exactly clear how you use BPF cookie, but it wasn't intended
to provide any sort of security or validation policy. It's purely a
user-provided u64 to help distinguish different attach points when the
same BPF program is attached in multiple places (e.g., kprobe tracing
many different kernel functions and needing to distinguish between
them at runtime).

I do agree BPF cookie is super useful and we should keep extending
other types of BPF programs with BPF cookie support, of course. It's
just completely orthogonal to BPF token discussion.


>
> --
> Regards
> Yafang

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH RESEND v3 bpf-next 00/14] BPF token
  2023-07-04  9:51   ` Christian Brauner
  2023-07-04 23:33     ` Toke Høiland-Jørgensen
@ 2023-07-05 20:39     ` Andrii Nakryiko
  1 sibling, 0 replies; 48+ messages in thread
From: Andrii Nakryiko @ 2023-07-05 20:39 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Toke Høiland-Jørgensen, Andrii Nakryiko, bpf,
	linux-security-module, keescook, lennart, cyphar, luto,
	kernel-team, sargun

On Tue, Jul 4, 2023 at 2:52 AM Christian Brauner <brauner@kernel.org> wrote:
>
> On Fri, Jun 30, 2023 at 01:15:47AM +0200, Toke Høiland-Jørgensen wrote:
> > Andrii Nakryiko <andrii@kernel.org> writes:
> >
> > > This patch set introduces new BPF object, BPF token, which allows to delegate
> > > a subset of BPF functionality from privileged system-wide daemon (e.g.,
> > > systemd or any other container manager) to a *trusted* unprivileged
> > > application. Trust is the key here. This functionality is not about allowing
> > > unconditional unprivileged BPF usage. Establishing trust, though, is
> > > completely up to the discretion of respective privileged application that
> > > would create a BPF token, as different production setups can and do achieve it
> > > through a combination of different means (signing, LSM, code reviews, etc),
> > > and it's undesirable and infeasible for kernel to enforce any particular way
> > > of validating trustworthiness of particular process.
> > >
> > > The main motivation for BPF token is a desire to enable containerized
> > > BPF applications to be used together with user namespaces. This is currently
> > > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced
> > > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF
> > > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read
> > > arbitrary memory, and it's impossible to ensure that they only read memory of
> > > processes belonging to any given namespace. This means that it's impossible to
> > > have namespace-aware CAP_BPF capability, and as such another mechanism to
> > > allow safe usage of BPF functionality is necessary. BPF token and delegation
> > > of it to a trusted unprivileged applications is such mechanism. Kernel makes
> > > no assumption about what "trusted" constitutes in any particular case, and
> > > it's up to specific privileged applications and their surrounding
> > > infrastructure to decide that. What kernel provides is a set of APIs to create
> > > and tune BPF token, and pass it around to privileged BPF commands that are
> > > creating new BPF objects like BPF programs, BPF maps, etc.
> >
> > So a colleague pointed out today that the Seccomp Notify functionality
> > would be a way to achieve your stated goal of allowing unprivileged
> > containers to (selectively) perform bpf() syscall operations. Christian
> > Brauner has a pretty nice writeup of the functionality here:
> > https://people.kernel.org/brauner/the-seccomp-notifier-new-frontiers-in-unprivileged-container-development
>
> I'm amazed you read this. :)
> The seccomp notifier comes with a lot of caveats. I think it would be
> impractical if not infeasible to handle bpf() delegation.

Thanks for confirming my hunch.

And yeah, I read a bunch of blog posts from your blog post. The one
about new mount APIs was especially useful given how little
documentation I could find on them otherwise :)

>
> >
> > In fact he even mentions allowing unprivileged access to bpf() as a
> > possible use case (in the second-to-last paragraph).
>
> Yeah, I tried to work around a userspace regression with the
> introduction of the cgroup v2 devices controller.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH RESEND v3 bpf-next 01/14] bpf: introduce BPF token object
  2023-07-05 14:42       ` Christian Brauner
  2023-07-05 16:00         ` Paul Moore
@ 2023-07-05 21:38         ` Andrii Nakryiko
  2023-07-06 11:32           ` Toke Høiland-Jørgensen
  2023-07-11 13:33           ` Christian Brauner
  1 sibling, 2 replies; 48+ messages in thread
From: Andrii Nakryiko @ 2023-07-05 21:38 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Paul Moore, Andrii Nakryiko, bpf, linux-security-module,
	keescook, lennart, cyphar, luto, kernel-team, sargun

On Wed, Jul 5, 2023 at 7:42 AM Christian Brauner <brauner@kernel.org> wrote:
>
> On Wed, Jul 05, 2023 at 10:16:13AM -0400, Paul Moore wrote:
> > On Tue, Jul 4, 2023 at 8:44 AM Christian Brauner <brauner@kernel.org> wrote:
> > > On Wed, Jun 28, 2023 at 10:18:19PM -0700, Andrii Nakryiko wrote:
> > > > Add new kind of BPF kernel object, BPF token. BPF token is meant to to
> > > > allow delegating privileged BPF functionality, like loading a BPF
> > > > program or creating a BPF map, from privileged process to a *trusted*
> > > > unprivileged process, all while have a good amount of control over which
> > > > privileged operations could be performed using provided BPF token.
> > > >
> > > > This patch adds new BPF_TOKEN_CREATE command to bpf() syscall, which
> > > > allows to create a new BPF token object along with a set of allowed
> > > > commands that such BPF token allows to unprivileged applications.
> > > > Currently only BPF_TOKEN_CREATE command itself can be
> > > > delegated, but other patches gradually add ability to delegate
> > > > BPF_MAP_CREATE, BPF_BTF_LOAD, and BPF_PROG_LOAD commands.
> > > >
> > > > The above means that new BPF tokens can be created using existing BPF
> > > > token, if original privileged creator allowed BPF_TOKEN_CREATE command.
> > > > New derived BPF token cannot be more powerful than the original BPF
> > > > token.
> > > >
> > > > Importantly, BPF token is automatically pinned at the specified location
> > > > inside an instance of BPF FS and cannot be repinned using BPF_OBJ_PIN
> > > > command, unlike BPF prog/map/btf/link. This provides more control over
> > > > unintended sharing of BPF tokens through pinning it in another BPF FS
> > > > instances.
> > > >
> > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > > > ---
> > >
> > > The main issue I have with the token approach is that it is a completely
> > > separate delegation vector on top of user namespaces. We mentioned this
> > > duringthe conf and this was brought up on the thread here again as well.
> > > Imho, that's a problem both security-wise and complexity-wise.
> > >
> > > It's not great if each subsystem gets its own custom delegation
> > > mechanism. This imposes such a taxing complexity on both kernel- and
> > > userspace that it will quickly become a huge liability. So I would
> > > really strongly encourage you to explore another direction.

Alright, thanks a lot for elaborating. I did want to keep everything
contained to bpf() for various reasons, but it seems like I won't be
able to get away with this. :)

> > >
> > > I do think the spirit of your proposal is workable and that it can
> > > mostly be kept in tact.

It's good to know that at least conceptually you support the idea of
BPF delegation. I have a few more specific questions below and I'd
appreciate your answers, as I have less familiarity with how exactly
container managers do stuff at container bootstrapping stage.

But first, let's try to get some tentative agreement on design before
I go and implement the BPF-token-as-FS idea. I have basically just two
gripes with exact details of what you are proposing, so let me explain
which and why, and see if we can find some common ground.

First, the idea of coupling and bundling this "delegation" option with
BPF FS doesn't feel right. BPF FS is just a container of BPF objects,
so adding to it a new property of allowing to use privileged BPF
functionality seems a bit off.

Why not just create a new separate FS, let's code-name it "BPF Token
FS" for now (naming suggestions are welcome). Such BPF Token FS would
be dedicated to specifying everything about what's allowable through
BPF, just like my BPF token implementation. It can then be
mounted/bind-mounted inside BPF FS (or really, anywhere, it's just a
FS, right?). User application would open it (I'm guessing with
open_tree(), right?) and pass it as token_fd to bpf() syscall.

Having it as a separate single-purpose FS seems cleaner, because we
have use cases where we'd have one BPF FS instance created for a
container by our container manager, and then exposing a few separate
tokens with different sets of allowed functionality. E.g., one for
main intended workload, another for some BPF-based observability
tools, maybe yet another for more heavy-weight tools like bpftrace for
extra debugging. In the debugging case our container infrastructure
will be "evacuating" any other workloads on the same host to avoid
unnecessary consequences. The point is to not disturb
workload-under-human-debugging as much as possible, so we'd like to
keep userns intact, which is why mounting extra (more permissive) BPF
token inside already running containers is an important consideration.

With such goals, it seems nicer to have a single BPF FS, and few BPF
token FSs mounted inside it. Yes, we could bundle token functionality
with BPF FS, but separating those two seems cleaner to me. WDYT?

Second, mount options usage. I'm hearing stories from our production
folks how some new mount options (on some other FS, not BPF FS) were
breaking tools unintentionally during kernel/tooling
upgrades/downgrades, so it makes me a bit hesitant to have these
complicated sets of mount options to specify parameters of
BPF-token-as-FS. I've been thinking a bit, and I'm starting to lean
towards the idea of allowing to set up (and modify as well) all these
allowed maps/progs/attach types through special auto-created files
within BPF token FS. Something like below:

# pwd
/sys/fs/bpf/workload-token
# ls
allowed_cmds allowed_map_types allowed_prog_types allowed_attach_types
# echo "BPF_PROG_LOAD" > allowed_cmds
# echo "BPF_PROG_TYPE_KPROBE" >> allowed_prog_types
...
# cat allowed_prog_types
BPF_PROG_TYPE_KPROBE,BPF_PROG_TYPE_TRACEPOINT


The above is fake (I haven't implemented anything yet), but hopefully
works as a demonstration. We'll also need to make sure that inside
non-init userns these files are read-only or allow to just further
restrict the subset of allowed functionality, never extend it.

Such an approach will actually make it simpler to test and experiment
with this delegation locally, will make it trivial to observe what's
allowed from simple shell scripts, etc, etc. With fsmount() and O_PATH
it will be possible to set everything up from privileged processes
before ever exposing a BPF Token FS instance through a file system, if
there are any concerns about racing with user space.

That's the high-level approach I'm thinking of right now. Would that
work? How critical is it to reuse BPF FS itself and how important to
you is to rely on mount options vs special files as described above?
Hopefully not critical, and I can start working on it, and we'll get
what you want with using FS as a vehicle for delegation, while
allowing some of the intended use cases that we have in mind in a bit
cleaner fashion?

> > >
> > > As mentioned before, bpffs has all the means to be taught delegation:
> > >
> > >         // In container's user namespace
> > >         fd_fs = fsopen("bpffs");
> > >
> > >         // Delegating task in host userns (systemd-bpfd whatever you want)
> > >         ret = fsconfig(fd_fs, FSCONFIG_SET_FLAG, "delegate", ...);
> > >
> > >         // In container's user namespace
> > >         fd_mnt = fsmount(fd_fs, 0);
> > >
> > >         ret = move_mount(fd_fs, "", -EBADF, "/my/fav/location", MOVE_MOUNT_F_EMPTY_PATH)
> > >
> > > Roughly, this would mean:
> > >
> > > (i) raise FS_USERNS_MOUNT on bpffs but guard it behind the "delegate"
> > >     mount option. IOW, it's only possibly to mount bpffs as an
> > >     unprivileged user if a delegating process like systemd-bpfd with
> > >     system-level privileges has marked it as delegatable.

Regarding the FS_USERNS_MOUNT flag and fsopen() happening from inside
the user namespace. Am I missing something subtle and important here,
why does it have to happen inside the container's user namespace?
Can't the container manager both fsopen() and fsconfig() everything in
host userns, and only then fsmount+move_mount inside the container's
userns? Just trying to understand if there is some important early
association of userns happening at early steps here?

Also, in your example above, move_mount() should take fd_mnt, not fd_fs, right?

> > > (ii) add fine-grained delegation options that you want this
> > >      bpffs instance to allow via new mount options. Idk,
> > >
> > >      // allow usage of foo
> > >      fsconfig(fd_fs, FSCONFIG_SET_STRING, "abilities", "foo");
> > >
> > >      // also allow usage of bar
> > >      fsconfig(fd_fs, FSCONFIG_SET_STRING, "abilities", "bar");
> > >
> > >      // reset allowed options
> > >      fsconfig(fd_fs, FSCONFIG_SET_STRING, "");
> > >
> > >      // allow usage of schmoo
> > >      fsconfig(fd_fs, FSCONFIG_SET_STRING, "abilities", "schmoo");
> > >
> > > This all seems more intuitive and integrates with user and mount
> > > namespaces of the container. This can also work for restricting
> > > non-userns bpf instances fwiw. You can also share instances via
> > > bind-mount and so on. The userns of the bpffs instance can also be used
> > > for permission checking provided a given functionality has been
> > > delegated by e.g., systemd-bpfd or whatever.
> >
> > I have no arguments against any of the above, and would prefer to see
> > something like this over a token-based mechanism.  However we do want
> > to make sure we have the proper LSM control points for either approach
> > so that admins who rely on LSM-based security policies can manage
> > delegation via their policies.
> >
> > Using the fsconfig() approach described by Christian above, I believe
> > we should have the necessary hooks already in
> > security_fs_context_parse_param() and security_sb_mnt_opts() but I'm
> > basing that on a quick look this morning, some additional checking
> > would need to be done.
>
> I think what I outlined is even unnecessarily complicated. You don't
> need that pointless "delegate" mount option at all actually. Permission
> to delegate shouldn't be checked when the mount option is set. The
> permissions should be checked when the superblock is created. That's the
> right point in time. So sm like:
>

I think this gets even more straightforward with BPF Token FS being a
separate one, right? Given BPF Token FS is all about delegation, it
has to be a privileged operation to even create it.

> diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
> index 4174f76133df..a2eb382f5457 100644
> --- a/kernel/bpf/inode.c
> +++ b/kernel/bpf/inode.c
> @@ -746,6 +746,13 @@ static int bpf_fill_super(struct super_block *sb, struct fs_context *fc)
>         struct inode *inode;
>         int ret;
>
> +       /*
> +        * If you want to delegate this instance then you need to be
> +        * privileged and know what you're doing. This isn't trust.
> +        */
> +       if ((fc->user_ns != &init_user_ns) && !capable(CAP_SYS_ADMIN))
> +               return -EPERM;
> +
>         ret = simple_fill_super(sb, BPF_FS_MAGIC, bpf_rfiles);
>         if (ret)
>                 return ret;
> @@ -800,6 +807,7 @@ static struct file_system_type bpf_fs_type = {
>         .init_fs_context = bpf_init_fs_context,
>         .parameters     = bpf_fs_parameters,
>         .kill_sb        = kill_litter_super,
> +       .fs_flags       = FS_USERNS_MOUNT,

Just an aside thought. It doesn't seem like there is any reason why
BPF FS right now is not created with FS_USERNS_MOUNT, so (separately
from all this discussion) I suspect we can just make it
FS_USERNS_MOUNT right now (unless we combine it with BPF-token-FS,
then yeah, we can't do that unconditionally anymore). Given BPF FS is
just a container of pinned BPF objects, just mounting BPF FS doesn't
seem to be dangerous in any way. But that's just an aside thought
here.

>  };
>
>  static int __init bpf_init(void)
>
> In fact this is conceptually generalizable but I'd need to think about
> that.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH RESEND v3 bpf-next 00/14] BPF token
  2023-07-05 20:37   ` Andrii Nakryiko
@ 2023-07-06  1:26     ` Yafang Shao
  2023-07-06 20:34       ` Andrii Nakryiko
  0 siblings, 1 reply; 48+ messages in thread
From: Yafang Shao @ 2023-07-06  1:26 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, linux-security-module, keescook, brauner,
	lennart, cyphar, luto, kernel-team, sargun

On Thu, Jul 6, 2023 at 4:37 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Fri, Jun 30, 2023 at 7:06 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Thu, Jun 29, 2023 at 1:18 PM Andrii Nakryiko <andrii@kernel.org> wrote:
> > >
> > > This patch set introduces new BPF object, BPF token, which allows to delegate
> > > a subset of BPF functionality from privileged system-wide daemon (e.g.,
> > > systemd or any other container manager) to a *trusted* unprivileged
> > > application. Trust is the key here. This functionality is not about allowing
> > > unconditional unprivileged BPF usage. Establishing trust, though, is
> > > completely up to the discretion of respective privileged application that
> > > would create a BPF token, as different production setups can and do achieve it
> > > through a combination of different means (signing, LSM, code reviews, etc),
> > > and it's undesirable and infeasible for kernel to enforce any particular way
> > > of validating trustworthiness of particular process.
> > >
> > > The main motivation for BPF token is a desire to enable containerized
> > > BPF applications to be used together with user namespaces. This is currently
> > > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced
> > > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF
> > > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read
> > > arbitrary memory, and it's impossible to ensure that they only read memory of
> > > processes belonging to any given namespace. This means that it's impossible to
> > > have namespace-aware CAP_BPF capability, and as such another mechanism to
> > > allow safe usage of BPF functionality is necessary. BPF token and delegation
> > > of it to a trusted unprivileged applications is such mechanism. Kernel makes
> > > no assumption about what "trusted" constitutes in any particular case, and
> > > it's up to specific privileged applications and their surrounding
> > > infrastructure to decide that. What kernel provides is a set of APIs to create
> > > and tune BPF token, and pass it around to privileged BPF commands that are
> > > creating new BPF objects like BPF programs, BPF maps, etc.
> > >
> > > Previous attempt at addressing this very same problem ([0]) attempted to
> > > utilize authoritative LSM approach, but was conclusively rejected by upstream
> > > LSM maintainers. BPF token concept is not changing anything about LSM
> > > approach, but can be combined with LSM hooks for very fine-grained security
> > > policy. Some ideas about making BPF token more convenient to use with LSM (in
> > > particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF
> > > 2023 presentation ([1]). E.g., an ability to specify user-provided data
> > > (context), which in combination with BPF LSM would allow implementing a very
> > > dynamic and fine-granular custom security policies on top of BPF token. In the
> > > interest of minimizing API surface area discussions this is going to be
> > > added in follow up patches, as it's not essential to the fundamental concept
> > > of delegatable BPF token.
> > >
> > > It should be noted that BPF token is conceptually quite similar to the idea of
> > > /dev/bpf device file, proposed by Song a while ago ([2]). The biggest
> > > difference is the idea of using virtual anon_inode file to hold BPF token and
> > > allowing multiple independent instances of them, each with its own set of
> > > restrictions. BPF pinning solves the problem of exposing such BPF token
> > > through file system (BPF FS, in this case) for cases where transferring FDs
> > > over Unix domain sockets is not convenient. And also, crucially, BPF token
> > > approach is not using any special stateful task-scoped flags. Instead, bpf()
> > > syscall accepts token_fd parameters explicitly for each relevant BPF command.
> > > This addresses main concerns brought up during the /dev/bpf discussion, and
> > > fits better with overall BPF subsystem design.
> > >
> > > This patch set adds a basic minimum of functionality to make BPF token useful
> > > and to discuss API and functionality. Currently only low-level libbpf APIs
> > > support passing BPF token around, allowing to test kernel functionality, but
> > > for the most part is not sufficient for real-world applications, which
> > > typically use high-level libbpf APIs based on `struct bpf_object` type. This
> > > was done with the intent to limit the size of patch set and concentrate on
> > > mostly kernel-side changes. All the necessary plumbing for libbpf will be sent
> > > as a separate follow up patch set kernel support makes it upstream.
> > >
> > > Another part that should happen once kernel-side BPF token is established, is
> > > a set of conventions between applications (e.g., systemd), tools (e.g.,
> > > bpftool), and libraries (e.g., libbpf) about sharing BPF tokens through BPF FS
> > > at well-defined locations to allow applications take advantage of this in
> > > automatic fashion without explicit code changes on BPF application's side.
> > > But I'd like to postpone this discussion to after BPF token concept lands.
> > >
> > > Once important distinctions from v2 that should be noted is a chance in the
> > > semantics of a newly added BPF_TOKEN_CREATE command. Previously,
> > > BPF_TOKEN_CREATE would create BPF token kernel object and return its FD to
> > > user-space, allowing to (optionally) pin it in BPF FS using BPF_OBJ_PIN
> > > command. This v3 version changes this slightly: BPF_TOKEN_CREATE combines BPF
> > > token object creation *and* pinning in BPF FS. Such change ensures that BPF
> > > token is always associated with a specific instance of BPF FS and cannot
> > > "escape" it by application re-pinning it somewhere else using another
> > > BPF_OBJ_PIN call. Now, BPF token can only be pinned once during its creation,
> > > better containing it inside intended container (under assumption BPF FS is set
> > > up in such a way as to not be shared with other containers on the system).
> > >
> > >   [0] https://lore.kernel.org/bpf/20230412043300.360803-1-andrii@kernel.org/
> > >   [1] http://vger.kernel.org/bpfconf2023_material/Trusted_unprivileged_BPF_LSFMM2023.pdf
> > >   [2] https://lore.kernel.org/bpf/20190627201923.2589391-2-songliubraving@fb.com/
> > >
> > > v3->v3-resend:
> > >   - I started integrating token_fd into bpf_object_open_opts and higher-level
> > >     libbpf bpf_object APIs, but it started going a bit deeper into bpf_object
> > >     implementation details and how libbpf performs feature detection and
> > >     caching, so I decided to keep it separate from this patch set and not
> > >     distract from the mostly kernel-side changes;
> > > v2->v3:
> > >   - make BPF_TOKEN_CREATE pin created BPF token in BPF FS, and disallow
> > >     BPF_OBJ_PIN for BPF token;
> > > v1->v2:
> > >   - fix build failures on Kconfig with CONFIG_BPF_SYSCALL unset;
> > >   - drop BPF_F_TOKEN_UNKNOWN_* flags and simplify UAPI (Stanislav).
> > >
> > > Andrii Nakryiko (14):
> > >   bpf: introduce BPF token object
> > >   libbpf: add bpf_token_create() API
> > >   selftests/bpf: add BPF_TOKEN_CREATE test
> > >   bpf: add BPF token support to BPF_MAP_CREATE command
> > >   libbpf: add BPF token support to bpf_map_create() API
> > >   selftests/bpf: add BPF token-enabled test for BPF_MAP_CREATE command
> > >   bpf: add BPF token support to BPF_BTF_LOAD command
> > >   libbpf: add BPF token support to bpf_btf_load() API
> > >   selftests/bpf: add BPF token-enabled BPF_BTF_LOAD selftest
> > >   bpf: add BPF token support to BPF_PROG_LOAD command
> > >   bpf: take into account BPF token when fetching helper protos
> > >   bpf: consistenly use BPF token throughout BPF verifier logic
> > >   libbpf: add BPF token support to bpf_prog_load() API
> > >   selftests/bpf: add BPF token-enabled BPF_PROG_LOAD tests
> > >
> > >  drivers/media/rc/bpf-lirc.c                   |   2 +-
> > >  include/linux/bpf.h                           |  79 ++++-
> > >  include/linux/filter.h                        |   2 +-
> > >  include/uapi/linux/bpf.h                      |  53 ++++
> > >  kernel/bpf/Makefile                           |   2 +-
> > >  kernel/bpf/arraymap.c                         |   2 +-
> > >  kernel/bpf/cgroup.c                           |   6 +-
> > >  kernel/bpf/core.c                             |   3 +-
> > >  kernel/bpf/helpers.c                          |   6 +-
> > >  kernel/bpf/inode.c                            |  46 ++-
> > >  kernel/bpf/syscall.c                          | 183 +++++++++---
> > >  kernel/bpf/token.c                            | 201 +++++++++++++
> > >  kernel/bpf/verifier.c                         |  13 +-
> > >  kernel/trace/bpf_trace.c                      |   2 +-
> > >  net/core/filter.c                             |  36 +--
> > >  net/ipv4/bpf_tcp_ca.c                         |   2 +-
> > >  net/netfilter/nf_bpf_link.c                   |   2 +-
> > >  tools/include/uapi/linux/bpf.h                |  53 ++++
> > >  tools/lib/bpf/bpf.c                           |  35 ++-
> > >  tools/lib/bpf/bpf.h                           |  45 ++-
> > >  tools/lib/bpf/libbpf.map                      |   1 +
> > >  .../selftests/bpf/prog_tests/libbpf_probes.c  |   4 +
> > >  .../selftests/bpf/prog_tests/libbpf_str.c     |   6 +
> > >  .../testing/selftests/bpf/prog_tests/token.c  | 277 ++++++++++++++++++
> > >  24 files changed, 957 insertions(+), 104 deletions(-)
> > >  create mode 100644 kernel/bpf/token.c
> > >  create mode 100644 tools/testing/selftests/bpf/prog_tests/token.c
> > >
> > > --
> > > 2.34.1
> > >
> > >
> >
> >
> > Hi Andrii,
> >
> > Thanks for your proposal.
> > That seems to be a useful functionality, and yet I have some questions.
>
> I've answered them below. But I don't think either of them have any
> relation to BPF token and the problem I'm trying to solve.
>
> >
> > 1. Why can't we add security_bpf_probe_read_{kernel,user}?
> >     If possible, we can use these LSM hooks to refuse the process to
> > read other tasks' information. E.g. if the other process is not within
> > the same cgroup or the same namespace, we just refuse the reading. I
> > think it is not hard to identify if the other process is within the
> > same cgroup or the same namespace.
>
> There are probably many reasons. First, performance-wide, LSM hook for
> each bpf_probe_read_{kernel,user}() call will be prohibitive. And just
> in general, one would need to be very careful with such LSM hooks,
> because bpf_probe_read_{kernel,user}() often happens from NMI context,
> and LSM policy would have to be written and validated very carefully
> with NMI context in mind.
>
> But, more conceptually, for probe_read you get a random address and
> you know the process context you are running in (but you might be
> actually running in softirq and NMI, and that process context is
> irrelevant). How can you efficiently (or at all) tell if that random
> address "belongs" to cgroup or namespace? Just at conceptual level?
>
> >
> > 2. Why can't we extend bpf_cookie?
> >    We're now using bpf_cookie to identify each user or each
> > application, and only the permitted cookies can create new probe
> > links.  However we find the bpf_cookie is only supported by tracing,
> > perf_event and kprobe_multi, so we're planning to extend it to other
> > possible link types, then we can use LSM hooks to control all bpf
> > links.  I think that the upstream kernel should also support
> > bpf_cookie for all bpf links. If possible, we will post it to the
> > upstream in the future.
> >    After I have read your BPF token proposal, I just have some other
> > ideas. Why can't we just extend bpf_cookie to all other BPF objects?
> > For example, all progs and maps should also have the bpf_cookie.
> >
>
> I'm not exactly clear how you use BPF cookie, but it wasn't intended
> to provide any sort of security or validation policy. It's purely a
> user-provided u64 to help distinguish different attach points when the
> same BPF program is attached in multiple places (e.g., kprobe tracing
> many different kernel functions and needing to distinguish between
> them at runtime).

In our container environment, we enable the CAP_BPF, CAP_PERMON and
CAP_NET_ADMIN for the containers which want to run BPF programs
inside. However we don't want them to run whatever BPF programs they
want. We only allow them to run the BPF programs we have permitted for
each of them.  So we are using LSM to audit the BPF behavior such as
prog load, map creation and link attach.  We define different BPF
policies for different containers. In order to identify different
containers efficiently, we assign different bpf_cookies for different
containers. bpf_cookie is a u64, that's enough for our use cases.
We didn't use cgroup id to identify different containers because
cgroup id is a local value in a server, while bpf_cookie is a global
value, that would be easy for deployment.
For your use cases, maybe we could enable CAP_BPF (+CAP_PERMON,
+CAP_NET_ADMIN) for all users, and then we assign different
bpf_cookies for different users, so we can use LSM to allow the user
who have the permitted cookies to run BPF program ?

>
> I do agree BPF cookie is super useful and we should keep extending
> other types of BPF programs with BPF cookie support, of course. It's
> just completely orthogonal to BPF token discussion.
>

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH RESEND v3 bpf-next 01/14] bpf: introduce BPF token object
  2023-07-05 21:38         ` Andrii Nakryiko
@ 2023-07-06 11:32           ` Toke Høiland-Jørgensen
  2023-07-06 20:37             ` Andrii Nakryiko
  2023-07-11 13:33           ` Christian Brauner
  1 sibling, 1 reply; 48+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-07-06 11:32 UTC (permalink / raw)
  To: Andrii Nakryiko, Christian Brauner
  Cc: Paul Moore, Andrii Nakryiko, bpf, linux-security-module,
	keescook, lennart, cyphar, luto, kernel-team, sargun

Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:

> Having it as a separate single-purpose FS seems cleaner, because we
> have use cases where we'd have one BPF FS instance created for a
> container by our container manager, and then exposing a few separate
> tokens with different sets of allowed functionality. E.g., one for
> main intended workload, another for some BPF-based observability
> tools, maybe yet another for more heavy-weight tools like bpftrace for
> extra debugging. In the debugging case our container infrastructure
> will be "evacuating" any other workloads on the same host to avoid
> unnecessary consequences. The point is to not disturb
> workload-under-human-debugging as much as possible, so we'd like to
> keep userns intact, which is why mounting extra (more permissive) BPF
> token inside already running containers is an important consideration.

This example (as well as Yafang's in the sibling subthread) makes it
even more apparent to me that it would be better with a model where the
userspace policy daemon can just make decisions on each call directly,
instead of mucking about with different tokens with different embedded
permissions. Why not go that route (see my other reply for details on
what I mean)?

-Toke


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH RESEND v3 bpf-next 00/14] BPF token
  2023-07-06  1:26     ` Yafang Shao
@ 2023-07-06 20:34       ` Andrii Nakryiko
  2023-07-07  1:42         ` Yafang Shao
  0 siblings, 1 reply; 48+ messages in thread
From: Andrii Nakryiko @ 2023-07-06 20:34 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Andrii Nakryiko, bpf, linux-security-module, keescook, brauner,
	lennart, cyphar, luto, kernel-team, sargun

On Wed, Jul 5, 2023 at 6:27 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Thu, Jul 6, 2023 at 4:37 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Fri, Jun 30, 2023 at 7:06 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > On Thu, Jun 29, 2023 at 1:18 PM Andrii Nakryiko <andrii@kernel.org> wrote:
> > > >
> > > > This patch set introduces new BPF object, BPF token, which allows to delegate
> > > > a subset of BPF functionality from privileged system-wide daemon (e.g.,
> > > > systemd or any other container manager) to a *trusted* unprivileged
> > > > application. Trust is the key here. This functionality is not about allowing
> > > > unconditional unprivileged BPF usage. Establishing trust, though, is
> > > > completely up to the discretion of respective privileged application that
> > > > would create a BPF token, as different production setups can and do achieve it
> > > > through a combination of different means (signing, LSM, code reviews, etc),
> > > > and it's undesirable and infeasible for kernel to enforce any particular way
> > > > of validating trustworthiness of particular process.
> > > >
> > > > The main motivation for BPF token is a desire to enable containerized
> > > > BPF applications to be used together with user namespaces. This is currently
> > > > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced
> > > > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF
> > > > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read
> > > > arbitrary memory, and it's impossible to ensure that they only read memory of
> > > > processes belonging to any given namespace. This means that it's impossible to
> > > > have namespace-aware CAP_BPF capability, and as such another mechanism to
> > > > allow safe usage of BPF functionality is necessary. BPF token and delegation
> > > > of it to a trusted unprivileged applications is such mechanism. Kernel makes
> > > > no assumption about what "trusted" constitutes in any particular case, and
> > > > it's up to specific privileged applications and their surrounding
> > > > infrastructure to decide that. What kernel provides is a set of APIs to create
> > > > and tune BPF token, and pass it around to privileged BPF commands that are
> > > > creating new BPF objects like BPF programs, BPF maps, etc.
> > > >
> > > > Previous attempt at addressing this very same problem ([0]) attempted to
> > > > utilize authoritative LSM approach, but was conclusively rejected by upstream
> > > > LSM maintainers. BPF token concept is not changing anything about LSM
> > > > approach, but can be combined with LSM hooks for very fine-grained security
> > > > policy. Some ideas about making BPF token more convenient to use with LSM (in
> > > > particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF
> > > > 2023 presentation ([1]). E.g., an ability to specify user-provided data
> > > > (context), which in combination with BPF LSM would allow implementing a very
> > > > dynamic and fine-granular custom security policies on top of BPF token. In the
> > > > interest of minimizing API surface area discussions this is going to be
> > > > added in follow up patches, as it's not essential to the fundamental concept
> > > > of delegatable BPF token.
> > > >
> > > > It should be noted that BPF token is conceptually quite similar to the idea of
> > > > /dev/bpf device file, proposed by Song a while ago ([2]). The biggest
> > > > difference is the idea of using virtual anon_inode file to hold BPF token and
> > > > allowing multiple independent instances of them, each with its own set of
> > > > restrictions. BPF pinning solves the problem of exposing such BPF token
> > > > through file system (BPF FS, in this case) for cases where transferring FDs
> > > > over Unix domain sockets is not convenient. And also, crucially, BPF token
> > > > approach is not using any special stateful task-scoped flags. Instead, bpf()
> > > > syscall accepts token_fd parameters explicitly for each relevant BPF command.
> > > > This addresses main concerns brought up during the /dev/bpf discussion, and
> > > > fits better with overall BPF subsystem design.
> > > >
> > > > This patch set adds a basic minimum of functionality to make BPF token useful
> > > > and to discuss API and functionality. Currently only low-level libbpf APIs
> > > > support passing BPF token around, allowing to test kernel functionality, but
> > > > for the most part is not sufficient for real-world applications, which
> > > > typically use high-level libbpf APIs based on `struct bpf_object` type. This
> > > > was done with the intent to limit the size of patch set and concentrate on
> > > > mostly kernel-side changes. All the necessary plumbing for libbpf will be sent
> > > > as a separate follow up patch set kernel support makes it upstream.
> > > >
> > > > Another part that should happen once kernel-side BPF token is established, is
> > > > a set of conventions between applications (e.g., systemd), tools (e.g.,
> > > > bpftool), and libraries (e.g., libbpf) about sharing BPF tokens through BPF FS
> > > > at well-defined locations to allow applications take advantage of this in
> > > > automatic fashion without explicit code changes on BPF application's side.
> > > > But I'd like to postpone this discussion to after BPF token concept lands.
> > > >
> > > > Once important distinctions from v2 that should be noted is a chance in the
> > > > semantics of a newly added BPF_TOKEN_CREATE command. Previously,
> > > > BPF_TOKEN_CREATE would create BPF token kernel object and return its FD to
> > > > user-space, allowing to (optionally) pin it in BPF FS using BPF_OBJ_PIN
> > > > command. This v3 version changes this slightly: BPF_TOKEN_CREATE combines BPF
> > > > token object creation *and* pinning in BPF FS. Such change ensures that BPF
> > > > token is always associated with a specific instance of BPF FS and cannot
> > > > "escape" it by application re-pinning it somewhere else using another
> > > > BPF_OBJ_PIN call. Now, BPF token can only be pinned once during its creation,
> > > > better containing it inside intended container (under assumption BPF FS is set
> > > > up in such a way as to not be shared with other containers on the system).
> > > >
> > > >   [0] https://lore.kernel.org/bpf/20230412043300.360803-1-andrii@kernel.org/
> > > >   [1] http://vger.kernel.org/bpfconf2023_material/Trusted_unprivileged_BPF_LSFMM2023.pdf
> > > >   [2] https://lore.kernel.org/bpf/20190627201923.2589391-2-songliubraving@fb.com/
> > > >
> > > > v3->v3-resend:
> > > >   - I started integrating token_fd into bpf_object_open_opts and higher-level
> > > >     libbpf bpf_object APIs, but it started going a bit deeper into bpf_object
> > > >     implementation details and how libbpf performs feature detection and
> > > >     caching, so I decided to keep it separate from this patch set and not
> > > >     distract from the mostly kernel-side changes;
> > > > v2->v3:
> > > >   - make BPF_TOKEN_CREATE pin created BPF token in BPF FS, and disallow
> > > >     BPF_OBJ_PIN for BPF token;
> > > > v1->v2:
> > > >   - fix build failures on Kconfig with CONFIG_BPF_SYSCALL unset;
> > > >   - drop BPF_F_TOKEN_UNKNOWN_* flags and simplify UAPI (Stanislav).
> > > >
> > > > Andrii Nakryiko (14):
> > > >   bpf: introduce BPF token object
> > > >   libbpf: add bpf_token_create() API
> > > >   selftests/bpf: add BPF_TOKEN_CREATE test
> > > >   bpf: add BPF token support to BPF_MAP_CREATE command
> > > >   libbpf: add BPF token support to bpf_map_create() API
> > > >   selftests/bpf: add BPF token-enabled test for BPF_MAP_CREATE command
> > > >   bpf: add BPF token support to BPF_BTF_LOAD command
> > > >   libbpf: add BPF token support to bpf_btf_load() API
> > > >   selftests/bpf: add BPF token-enabled BPF_BTF_LOAD selftest
> > > >   bpf: add BPF token support to BPF_PROG_LOAD command
> > > >   bpf: take into account BPF token when fetching helper protos
> > > >   bpf: consistenly use BPF token throughout BPF verifier logic
> > > >   libbpf: add BPF token support to bpf_prog_load() API
> > > >   selftests/bpf: add BPF token-enabled BPF_PROG_LOAD tests
> > > >
> > > >  drivers/media/rc/bpf-lirc.c                   |   2 +-
> > > >  include/linux/bpf.h                           |  79 ++++-
> > > >  include/linux/filter.h                        |   2 +-
> > > >  include/uapi/linux/bpf.h                      |  53 ++++
> > > >  kernel/bpf/Makefile                           |   2 +-
> > > >  kernel/bpf/arraymap.c                         |   2 +-
> > > >  kernel/bpf/cgroup.c                           |   6 +-
> > > >  kernel/bpf/core.c                             |   3 +-
> > > >  kernel/bpf/helpers.c                          |   6 +-
> > > >  kernel/bpf/inode.c                            |  46 ++-
> > > >  kernel/bpf/syscall.c                          | 183 +++++++++---
> > > >  kernel/bpf/token.c                            | 201 +++++++++++++
> > > >  kernel/bpf/verifier.c                         |  13 +-
> > > >  kernel/trace/bpf_trace.c                      |   2 +-
> > > >  net/core/filter.c                             |  36 +--
> > > >  net/ipv4/bpf_tcp_ca.c                         |   2 +-
> > > >  net/netfilter/nf_bpf_link.c                   |   2 +-
> > > >  tools/include/uapi/linux/bpf.h                |  53 ++++
> > > >  tools/lib/bpf/bpf.c                           |  35 ++-
> > > >  tools/lib/bpf/bpf.h                           |  45 ++-
> > > >  tools/lib/bpf/libbpf.map                      |   1 +
> > > >  .../selftests/bpf/prog_tests/libbpf_probes.c  |   4 +
> > > >  .../selftests/bpf/prog_tests/libbpf_str.c     |   6 +
> > > >  .../testing/selftests/bpf/prog_tests/token.c  | 277 ++++++++++++++++++
> > > >  24 files changed, 957 insertions(+), 104 deletions(-)
> > > >  create mode 100644 kernel/bpf/token.c
> > > >  create mode 100644 tools/testing/selftests/bpf/prog_tests/token.c
> > > >
> > > > --
> > > > 2.34.1
> > > >
> > > >
> > >
> > >
> > > Hi Andrii,
> > >
> > > Thanks for your proposal.
> > > That seems to be a useful functionality, and yet I have some questions.
> >
> > I've answered them below. But I don't think either of them have any
> > relation to BPF token and the problem I'm trying to solve.
> >
> > >
> > > 1. Why can't we add security_bpf_probe_read_{kernel,user}?
> > >     If possible, we can use these LSM hooks to refuse the process to
> > > read other tasks' information. E.g. if the other process is not within
> > > the same cgroup or the same namespace, we just refuse the reading. I
> > > think it is not hard to identify if the other process is within the
> > > same cgroup or the same namespace.
> >
> > There are probably many reasons. First, performance-wide, LSM hook for
> > each bpf_probe_read_{kernel,user}() call will be prohibitive. And just
> > in general, one would need to be very careful with such LSM hooks,
> > because bpf_probe_read_{kernel,user}() often happens from NMI context,
> > and LSM policy would have to be written and validated very carefully
> > with NMI context in mind.
> >
> > But, more conceptually, for probe_read you get a random address and
> > you know the process context you are running in (but you might be
> > actually running in softirq and NMI, and that process context is
> > irrelevant). How can you efficiently (or at all) tell if that random
> > address "belongs" to cgroup or namespace? Just at conceptual level?
> >
> > >
> > > 2. Why can't we extend bpf_cookie?
> > >    We're now using bpf_cookie to identify each user or each
> > > application, and only the permitted cookies can create new probe
> > > links.  However we find the bpf_cookie is only supported by tracing,
> > > perf_event and kprobe_multi, so we're planning to extend it to other
> > > possible link types, then we can use LSM hooks to control all bpf
> > > links.  I think that the upstream kernel should also support
> > > bpf_cookie for all bpf links. If possible, we will post it to the
> > > upstream in the future.
> > >    After I have read your BPF token proposal, I just have some other
> > > ideas. Why can't we just extend bpf_cookie to all other BPF objects?
> > > For example, all progs and maps should also have the bpf_cookie.
> > >
> >
> > I'm not exactly clear how you use BPF cookie, but it wasn't intended
> > to provide any sort of security or validation policy. It's purely a
> > user-provided u64 to help distinguish different attach points when the
> > same BPF program is attached in multiple places (e.g., kprobe tracing
> > many different kernel functions and needing to distinguish between
> > them at runtime).
>
> In our container environment, we enable the CAP_BPF, CAP_PERMON and
> CAP_NET_ADMIN for the containers which want to run BPF programs
> inside. However we don't want them to run whatever BPF programs they
> want. We only allow them to run the BPF programs we have permitted for
> each of them.  So we are using LSM to audit the BPF behavior such as
> prog load, map creation and link attach.  We define different BPF
> policies for different containers. In order to identify different
> containers efficiently, we assign different bpf_cookies for different
> containers. bpf_cookie is a u64, that's enough for our use cases.

I can see how you can use BPF cookies for this, but it's certainly not
an intended use case :) BPF cookie is most useful on BPF side of
things.

But what you are describing is meant to be doable with BPF token. It's
not in first patch set, but I intended to allow user to specify an
extra "user context" blog of bytes which would be stored with BPF
token. And this data should be accessible from BPF LSM programs to
make extra custom policy decisions. But we need to agree on initial
BPF token stuff first, and then build out all the rest.

> We didn't use cgroup id to identify different containers because
> cgroup id is a local value in a server, while bpf_cookie is a global
> value, that would be easy for deployment.
> For your use cases, maybe we could enable CAP_BPF (+CAP_PERMON,
> +CAP_NET_ADMIN) for all users, and then we assign different
> bpf_cookies for different users, so we can use LSM to allow the user
> who have the permitted cookies to run BPF program ?
>
> >
> > I do agree BPF cookie is super useful and we should keep extending
> > other types of BPF programs with BPF cookie support, of course. It's
> > just completely orthogonal to BPF token discussion.
> >
>
> --
> Regards
> Yafang

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH RESEND v3 bpf-next 01/14] bpf: introduce BPF token object
  2023-07-06 11:32           ` Toke Høiland-Jørgensen
@ 2023-07-06 20:37             ` Andrii Nakryiko
  2023-07-07 13:04               ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 48+ messages in thread
From: Andrii Nakryiko @ 2023-07-06 20:37 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Christian Brauner, Paul Moore, Andrii Nakryiko, bpf,
	linux-security-module, keescook, lennart, cyphar, luto,
	kernel-team, sargun

On Thu, Jul 6, 2023 at 4:32 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>
> > Having it as a separate single-purpose FS seems cleaner, because we
> > have use cases where we'd have one BPF FS instance created for a
> > container by our container manager, and then exposing a few separate
> > tokens with different sets of allowed functionality. E.g., one for
> > main intended workload, another for some BPF-based observability
> > tools, maybe yet another for more heavy-weight tools like bpftrace for
> > extra debugging. In the debugging case our container infrastructure
> > will be "evacuating" any other workloads on the same host to avoid
> > unnecessary consequences. The point is to not disturb
> > workload-under-human-debugging as much as possible, so we'd like to
> > keep userns intact, which is why mounting extra (more permissive) BPF
> > token inside already running containers is an important consideration.
>
> This example (as well as Yafang's in the sibling subthread) makes it
> even more apparent to me that it would be better with a model where the
> userspace policy daemon can just make decisions on each call directly,
> instead of mucking about with different tokens with different embedded
> permissions. Why not go that route (see my other reply for details on
> what I mean)?

I don't know how you arrived at this conclusion, but we've debated BPF
proxying and separate service at length, there is no point in going on
another round here. Per-call decisions can be achieved nicely by
employing BPF LSM in a restrictive manner on top of BPF token (or no
token, if you are ok without user namespaces).

>
> -Toke
>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH RESEND v3 bpf-next 00/14] BPF token
  2023-07-06 20:34       ` Andrii Nakryiko
@ 2023-07-07  1:42         ` Yafang Shao
  0 siblings, 0 replies; 48+ messages in thread
From: Yafang Shao @ 2023-07-07  1:42 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, linux-security-module, keescook, brauner,
	lennart, cyphar, luto, kernel-team, sargun

On Fri, Jul 7, 2023 at 4:34 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Wed, Jul 5, 2023 at 6:27 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Thu, Jul 6, 2023 at 4:37 AM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > >
> > > On Fri, Jun 30, 2023 at 7:06 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >
> > > > On Thu, Jun 29, 2023 at 1:18 PM Andrii Nakryiko <andrii@kernel.org> wrote:
> > > > >
> > > > > This patch set introduces new BPF object, BPF token, which allows to delegate
> > > > > a subset of BPF functionality from privileged system-wide daemon (e.g.,
> > > > > systemd or any other container manager) to a *trusted* unprivileged
> > > > > application. Trust is the key here. This functionality is not about allowing
> > > > > unconditional unprivileged BPF usage. Establishing trust, though, is
> > > > > completely up to the discretion of respective privileged application that
> > > > > would create a BPF token, as different production setups can and do achieve it
> > > > > through a combination of different means (signing, LSM, code reviews, etc),
> > > > > and it's undesirable and infeasible for kernel to enforce any particular way
> > > > > of validating trustworthiness of particular process.
> > > > >
> > > > > The main motivation for BPF token is a desire to enable containerized
> > > > > BPF applications to be used together with user namespaces. This is currently
> > > > > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced
> > > > > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF
> > > > > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read
> > > > > arbitrary memory, and it's impossible to ensure that they only read memory of
> > > > > processes belonging to any given namespace. This means that it's impossible to
> > > > > have namespace-aware CAP_BPF capability, and as such another mechanism to
> > > > > allow safe usage of BPF functionality is necessary. BPF token and delegation
> > > > > of it to a trusted unprivileged applications is such mechanism. Kernel makes
> > > > > no assumption about what "trusted" constitutes in any particular case, and
> > > > > it's up to specific privileged applications and their surrounding
> > > > > infrastructure to decide that. What kernel provides is a set of APIs to create
> > > > > and tune BPF token, and pass it around to privileged BPF commands that are
> > > > > creating new BPF objects like BPF programs, BPF maps, etc.
> > > > >
> > > > > Previous attempt at addressing this very same problem ([0]) attempted to
> > > > > utilize authoritative LSM approach, but was conclusively rejected by upstream
> > > > > LSM maintainers. BPF token concept is not changing anything about LSM
> > > > > approach, but can be combined with LSM hooks for very fine-grained security
> > > > > policy. Some ideas about making BPF token more convenient to use with LSM (in
> > > > > particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF
> > > > > 2023 presentation ([1]). E.g., an ability to specify user-provided data
> > > > > (context), which in combination with BPF LSM would allow implementing a very
> > > > > dynamic and fine-granular custom security policies on top of BPF token. In the
> > > > > interest of minimizing API surface area discussions this is going to be
> > > > > added in follow up patches, as it's not essential to the fundamental concept
> > > > > of delegatable BPF token.
> > > > >
> > > > > It should be noted that BPF token is conceptually quite similar to the idea of
> > > > > /dev/bpf device file, proposed by Song a while ago ([2]). The biggest
> > > > > difference is the idea of using virtual anon_inode file to hold BPF token and
> > > > > allowing multiple independent instances of them, each with its own set of
> > > > > restrictions. BPF pinning solves the problem of exposing such BPF token
> > > > > through file system (BPF FS, in this case) for cases where transferring FDs
> > > > > over Unix domain sockets is not convenient. And also, crucially, BPF token
> > > > > approach is not using any special stateful task-scoped flags. Instead, bpf()
> > > > > syscall accepts token_fd parameters explicitly for each relevant BPF command.
> > > > > This addresses main concerns brought up during the /dev/bpf discussion, and
> > > > > fits better with overall BPF subsystem design.
> > > > >
> > > > > This patch set adds a basic minimum of functionality to make BPF token useful
> > > > > and to discuss API and functionality. Currently only low-level libbpf APIs
> > > > > support passing BPF token around, allowing to test kernel functionality, but
> > > > > for the most part is not sufficient for real-world applications, which
> > > > > typically use high-level libbpf APIs based on `struct bpf_object` type. This
> > > > > was done with the intent to limit the size of patch set and concentrate on
> > > > > mostly kernel-side changes. All the necessary plumbing for libbpf will be sent
> > > > > as a separate follow up patch set kernel support makes it upstream.
> > > > >
> > > > > Another part that should happen once kernel-side BPF token is established, is
> > > > > a set of conventions between applications (e.g., systemd), tools (e.g.,
> > > > > bpftool), and libraries (e.g., libbpf) about sharing BPF tokens through BPF FS
> > > > > at well-defined locations to allow applications take advantage of this in
> > > > > automatic fashion without explicit code changes on BPF application's side.
> > > > > But I'd like to postpone this discussion to after BPF token concept lands.
> > > > >
> > > > > Once important distinctions from v2 that should be noted is a chance in the
> > > > > semantics of a newly added BPF_TOKEN_CREATE command. Previously,
> > > > > BPF_TOKEN_CREATE would create BPF token kernel object and return its FD to
> > > > > user-space, allowing to (optionally) pin it in BPF FS using BPF_OBJ_PIN
> > > > > command. This v3 version changes this slightly: BPF_TOKEN_CREATE combines BPF
> > > > > token object creation *and* pinning in BPF FS. Such change ensures that BPF
> > > > > token is always associated with a specific instance of BPF FS and cannot
> > > > > "escape" it by application re-pinning it somewhere else using another
> > > > > BPF_OBJ_PIN call. Now, BPF token can only be pinned once during its creation,
> > > > > better containing it inside intended container (under assumption BPF FS is set
> > > > > up in such a way as to not be shared with other containers on the system).
> > > > >
> > > > >   [0] https://lore.kernel.org/bpf/20230412043300.360803-1-andrii@kernel.org/
> > > > >   [1] http://vger.kernel.org/bpfconf2023_material/Trusted_unprivileged_BPF_LSFMM2023.pdf
> > > > >   [2] https://lore.kernel.org/bpf/20190627201923.2589391-2-songliubraving@fb.com/
> > > > >
> > > > > v3->v3-resend:
> > > > >   - I started integrating token_fd into bpf_object_open_opts and higher-level
> > > > >     libbpf bpf_object APIs, but it started going a bit deeper into bpf_object
> > > > >     implementation details and how libbpf performs feature detection and
> > > > >     caching, so I decided to keep it separate from this patch set and not
> > > > >     distract from the mostly kernel-side changes;
> > > > > v2->v3:
> > > > >   - make BPF_TOKEN_CREATE pin created BPF token in BPF FS, and disallow
> > > > >     BPF_OBJ_PIN for BPF token;
> > > > > v1->v2:
> > > > >   - fix build failures on Kconfig with CONFIG_BPF_SYSCALL unset;
> > > > >   - drop BPF_F_TOKEN_UNKNOWN_* flags and simplify UAPI (Stanislav).
> > > > >
> > > > > Andrii Nakryiko (14):
> > > > >   bpf: introduce BPF token object
> > > > >   libbpf: add bpf_token_create() API
> > > > >   selftests/bpf: add BPF_TOKEN_CREATE test
> > > > >   bpf: add BPF token support to BPF_MAP_CREATE command
> > > > >   libbpf: add BPF token support to bpf_map_create() API
> > > > >   selftests/bpf: add BPF token-enabled test for BPF_MAP_CREATE command
> > > > >   bpf: add BPF token support to BPF_BTF_LOAD command
> > > > >   libbpf: add BPF token support to bpf_btf_load() API
> > > > >   selftests/bpf: add BPF token-enabled BPF_BTF_LOAD selftest
> > > > >   bpf: add BPF token support to BPF_PROG_LOAD command
> > > > >   bpf: take into account BPF token when fetching helper protos
> > > > >   bpf: consistenly use BPF token throughout BPF verifier logic
> > > > >   libbpf: add BPF token support to bpf_prog_load() API
> > > > >   selftests/bpf: add BPF token-enabled BPF_PROG_LOAD tests
> > > > >
> > > > >  drivers/media/rc/bpf-lirc.c                   |   2 +-
> > > > >  include/linux/bpf.h                           |  79 ++++-
> > > > >  include/linux/filter.h                        |   2 +-
> > > > >  include/uapi/linux/bpf.h                      |  53 ++++
> > > > >  kernel/bpf/Makefile                           |   2 +-
> > > > >  kernel/bpf/arraymap.c                         |   2 +-
> > > > >  kernel/bpf/cgroup.c                           |   6 +-
> > > > >  kernel/bpf/core.c                             |   3 +-
> > > > >  kernel/bpf/helpers.c                          |   6 +-
> > > > >  kernel/bpf/inode.c                            |  46 ++-
> > > > >  kernel/bpf/syscall.c                          | 183 +++++++++---
> > > > >  kernel/bpf/token.c                            | 201 +++++++++++++
> > > > >  kernel/bpf/verifier.c                         |  13 +-
> > > > >  kernel/trace/bpf_trace.c                      |   2 +-
> > > > >  net/core/filter.c                             |  36 +--
> > > > >  net/ipv4/bpf_tcp_ca.c                         |   2 +-
> > > > >  net/netfilter/nf_bpf_link.c                   |   2 +-
> > > > >  tools/include/uapi/linux/bpf.h                |  53 ++++
> > > > >  tools/lib/bpf/bpf.c                           |  35 ++-
> > > > >  tools/lib/bpf/bpf.h                           |  45 ++-
> > > > >  tools/lib/bpf/libbpf.map                      |   1 +
> > > > >  .../selftests/bpf/prog_tests/libbpf_probes.c  |   4 +
> > > > >  .../selftests/bpf/prog_tests/libbpf_str.c     |   6 +
> > > > >  .../testing/selftests/bpf/prog_tests/token.c  | 277 ++++++++++++++++++
> > > > >  24 files changed, 957 insertions(+), 104 deletions(-)
> > > > >  create mode 100644 kernel/bpf/token.c
> > > > >  create mode 100644 tools/testing/selftests/bpf/prog_tests/token.c
> > > > >
> > > > > --
> > > > > 2.34.1
> > > > >
> > > > >
> > > >
> > > >
> > > > Hi Andrii,
> > > >
> > > > Thanks for your proposal.
> > > > That seems to be a useful functionality, and yet I have some questions.
> > >
> > > I've answered them below. But I don't think either of them have any
> > > relation to BPF token and the problem I'm trying to solve.
> > >
> > > >
> > > > 1. Why can't we add security_bpf_probe_read_{kernel,user}?
> > > >     If possible, we can use these LSM hooks to refuse the process to
> > > > read other tasks' information. E.g. if the other process is not within
> > > > the same cgroup or the same namespace, we just refuse the reading. I
> > > > think it is not hard to identify if the other process is within the
> > > > same cgroup or the same namespace.
> > >
> > > There are probably many reasons. First, performance-wide, LSM hook for
> > > each bpf_probe_read_{kernel,user}() call will be prohibitive. And just
> > > in general, one would need to be very careful with such LSM hooks,
> > > because bpf_probe_read_{kernel,user}() often happens from NMI context,
> > > and LSM policy would have to be written and validated very carefully
> > > with NMI context in mind.
> > >
> > > But, more conceptually, for probe_read you get a random address and
> > > you know the process context you are running in (but you might be
> > > actually running in softirq and NMI, and that process context is
> > > irrelevant). How can you efficiently (or at all) tell if that random
> > > address "belongs" to cgroup or namespace? Just at conceptual level?
> > >
> > > >
> > > > 2. Why can't we extend bpf_cookie?
> > > >    We're now using bpf_cookie to identify each user or each
> > > > application, and only the permitted cookies can create new probe
> > > > links.  However we find the bpf_cookie is only supported by tracing,
> > > > perf_event and kprobe_multi, so we're planning to extend it to other
> > > > possible link types, then we can use LSM hooks to control all bpf
> > > > links.  I think that the upstream kernel should also support
> > > > bpf_cookie for all bpf links. If possible, we will post it to the
> > > > upstream in the future.
> > > >    After I have read your BPF token proposal, I just have some other
> > > > ideas. Why can't we just extend bpf_cookie to all other BPF objects?
> > > > For example, all progs and maps should also have the bpf_cookie.
> > > >
> > >
> > > I'm not exactly clear how you use BPF cookie, but it wasn't intended
> > > to provide any sort of security or validation policy. It's purely a
> > > user-provided u64 to help distinguish different attach points when the
> > > same BPF program is attached in multiple places (e.g., kprobe tracing
> > > many different kernel functions and needing to distinguish between
> > > them at runtime).
> >
> > In our container environment, we enable the CAP_BPF, CAP_PERMON and
> > CAP_NET_ADMIN for the containers which want to run BPF programs
> > inside. However we don't want them to run whatever BPF programs they
> > want. We only allow them to run the BPF programs we have permitted for
> > each of them.  So we are using LSM to audit the BPF behavior such as
> > prog load, map creation and link attach.  We define different BPF
> > policies for different containers. In order to identify different
> > containers efficiently, we assign different bpf_cookies for different
> > containers. bpf_cookie is a u64, that's enough for our use cases.
>
> I can see how you can use BPF cookies for this, but it's certainly not
> an intended use case :) BPF cookie is most useful on BPF side of
> things.

The utilization of the bpf_cookie appid in our use case has proven to
be valuable, thus we continue to rely on its functionality :)

>
> But what you are describing is meant to be doable with BPF token. It's
> not in first patch set, but I intended to allow user to specify an
> extra "user context" blog of bytes which would be stored with BPF
> token. And this data should be accessible from BPF LSM programs to
> make extra custom policy decisions. But we need to agree on initial
> BPF token stuff first, and then build out all the rest.

Sounds good. Introducing support for user context within the BPF token
would enhance its utility and provide even more valuable
functionality.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH RESEND v3 bpf-next 01/14] bpf: introduce BPF token object
  2023-07-06 20:37             ` Andrii Nakryiko
@ 2023-07-07 13:04               ` Toke Høiland-Jørgensen
  2023-07-07 17:58                 ` Andrii Nakryiko
  0 siblings, 1 reply; 48+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-07-07 13:04 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Christian Brauner, Paul Moore, Andrii Nakryiko, bpf,
	linux-security-module, keescook, lennart, cyphar, luto,
	kernel-team, sargun

Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:

> On Thu, Jul 6, 2023 at 4:32 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>>
>> > Having it as a separate single-purpose FS seems cleaner, because we
>> > have use cases where we'd have one BPF FS instance created for a
>> > container by our container manager, and then exposing a few separate
>> > tokens with different sets of allowed functionality. E.g., one for
>> > main intended workload, another for some BPF-based observability
>> > tools, maybe yet another for more heavy-weight tools like bpftrace for
>> > extra debugging. In the debugging case our container infrastructure
>> > will be "evacuating" any other workloads on the same host to avoid
>> > unnecessary consequences. The point is to not disturb
>> > workload-under-human-debugging as much as possible, so we'd like to
>> > keep userns intact, which is why mounting extra (more permissive) BPF
>> > token inside already running containers is an important consideration.
>>
>> This example (as well as Yafang's in the sibling subthread) makes it
>> even more apparent to me that it would be better with a model where the
>> userspace policy daemon can just make decisions on each call directly,
>> instead of mucking about with different tokens with different embedded
>> permissions. Why not go that route (see my other reply for details on
>> what I mean)?
>
> I don't know how you arrived at this conclusion,

Because it makes it apparent that you're basically building a policy
engine in the kernel with this...

> but we've debated BPF proxying and separate service at length, there
> is no point in going on another round here.

You had some objections to explicit proxying via RPC calls; I suggested
a way of avoiding that by keeping the kernel in the loop, which you have
not responded to. If you're just going to go ahead with your solution
over any objections you could just have stated so from the beginning and
saved us all a lot of time :/

Can we at least put this thing behind a kconfig option, so we can turn
it off in distro kernels?

> Per-call decisions can be achieved nicely by employing BPF LSM in a
> restrictive manner on top of BPF token (or no token, if you are ok
> without user namespaces).

Building a deficient security delegation mechanism and saying "you can
patch things up using an LSM" is a terrible design, though. Also, this
still means you have to implement all the policy checks in the kernel
(just in BPF) which is awkward at best.

-Toke


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH RESEND v3 bpf-next 01/14] bpf: introduce BPF token object
  2023-07-07 13:04               ` Toke Høiland-Jørgensen
@ 2023-07-07 17:58                 ` Andrii Nakryiko
  2023-07-07 22:00                   ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 48+ messages in thread
From: Andrii Nakryiko @ 2023-07-07 17:58 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Christian Brauner, Paul Moore, Andrii Nakryiko, bpf,
	linux-security-module, keescook, lennart, cyphar, luto,
	kernel-team, sargun

On Fri, Jul 7, 2023 at 6:04 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>
> > On Thu, Jul 6, 2023 at 4:32 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >>
> >> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
> >>
> >> > Having it as a separate single-purpose FS seems cleaner, because we
> >> > have use cases where we'd have one BPF FS instance created for a
> >> > container by our container manager, and then exposing a few separate
> >> > tokens with different sets of allowed functionality. E.g., one for
> >> > main intended workload, another for some BPF-based observability
> >> > tools, maybe yet another for more heavy-weight tools like bpftrace for
> >> > extra debugging. In the debugging case our container infrastructure
> >> > will be "evacuating" any other workloads on the same host to avoid
> >> > unnecessary consequences. The point is to not disturb
> >> > workload-under-human-debugging as much as possible, so we'd like to
> >> > keep userns intact, which is why mounting extra (more permissive) BPF
> >> > token inside already running containers is an important consideration.
> >>
> >> This example (as well as Yafang's in the sibling subthread) makes it
> >> even more apparent to me that it would be better with a model where the
> >> userspace policy daemon can just make decisions on each call directly,
> >> instead of mucking about with different tokens with different embedded
> >> permissions. Why not go that route (see my other reply for details on
> >> what I mean)?
> >
> > I don't know how you arrived at this conclusion,
>
> Because it makes it apparent that you're basically building a policy
> engine in the kernel with this...

I disagree that this is a policy engine in the kernel. It's a building
block for delegation and enforcement. The policy itself is implemented
in user-space by a privileged process that decides when to issue BPF
tokens and of which configuration. And, optionally and if necessary,
further restricting using BPF LSM in a more fine-grained and dynamic
way.

>
> > but we've debated BPF proxying and separate service at length, there
> > is no point in going on another round here.
>
> You had some objections to explicit proxying via RPC calls; I suggested
> a way of avoiding that by keeping the kernel in the loop, which you have

I thought we settled the seccomp notify proposal?

> not responded to. If you're just going to go ahead with your solution
> over any objections you could just have stated so from the beginning and
> saved us all a lot of time :/

It would also be good to understand that yours is but one of the
opinions. If you read the thread carefully you'll see that other
people have differing opinions. And yours doesn't necessarily have to
be the deciding one.

I appreciate the feedback, but I don't appreciate the expectation that
your feedback is binding in any way.

>
> Can we at least put this thing behind a kconfig option, so we can turn
> it off in distro kernels?

Why can't distro disable this in some more dynamic way, though? With
existing LSM mechanism, sysctl, whatever? I think it would be useful
to let users have control over this and decide for themselves without
having to rebuild a custom kernel.

>
> > Per-call decisions can be achieved nicely by employing BPF LSM in a
> > restrictive manner on top of BPF token (or no token, if you are ok
> > without user namespaces).
>
> Building a deficient security delegation mechanism and saying "you can
> patch things up using an LSM" is a terrible design, though. Also, this

A bunch of people disagree with you.

> still means you have to implement all the policy checks in the kernel
> (just in BPF) which is awkward at best.

"Patch things up using an LSM", if necessary, in a restrictive manner
is what LSM folks prefer. You are also assuming that it's always
necessary, and I'm saying that in lots of practical contexts LSM won't
be even necessary.

>
> -Toke
>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH RESEND v3 bpf-next 01/14] bpf: introduce BPF token object
  2023-07-07 17:58                 ` Andrii Nakryiko
@ 2023-07-07 22:00                   ` Toke Høiland-Jørgensen
  2023-07-07 23:58                     ` Andrii Nakryiko
  0 siblings, 1 reply; 48+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-07-07 22:00 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Christian Brauner, Paul Moore, Andrii Nakryiko, bpf,
	linux-security-module, keescook, lennart, cyphar, luto,
	kernel-team, sargun

Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:

> On Fri, Jul 7, 2023 at 6:04 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>>
>> > On Thu, Jul 6, 2023 at 4:32 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> >>
>> >> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>> >>
>> >> > Having it as a separate single-purpose FS seems cleaner, because we
>> >> > have use cases where we'd have one BPF FS instance created for a
>> >> > container by our container manager, and then exposing a few separate
>> >> > tokens with different sets of allowed functionality. E.g., one for
>> >> > main intended workload, another for some BPF-based observability
>> >> > tools, maybe yet another for more heavy-weight tools like bpftrace for
>> >> > extra debugging. In the debugging case our container infrastructure
>> >> > will be "evacuating" any other workloads on the same host to avoid
>> >> > unnecessary consequences. The point is to not disturb
>> >> > workload-under-human-debugging as much as possible, so we'd like to
>> >> > keep userns intact, which is why mounting extra (more permissive) BPF
>> >> > token inside already running containers is an important consideration.
>> >>
>> >> This example (as well as Yafang's in the sibling subthread) makes it
>> >> even more apparent to me that it would be better with a model where the
>> >> userspace policy daemon can just make decisions on each call directly,
>> >> instead of mucking about with different tokens with different embedded
>> >> permissions. Why not go that route (see my other reply for details on
>> >> what I mean)?
>> >
>> > I don't know how you arrived at this conclusion,
>>
>> Because it makes it apparent that you're basically building a policy
>> engine in the kernel with this...
>
> I disagree that this is a policy engine in the kernel. It's a building
> block for delegation and enforcement. The policy itself is implemented
> in user-space by a privileged process that decides when to issue BPF
> tokens and of which configuration. And, optionally and if necessary,
> further restricting using BPF LSM in a more fine-grained and dynamic
> way.

Right, and I'm saying that it's too coarse-grained to be a proper
building block in its own right. As evidenced by the need for adding an
LSM on top to do anything fine-grained; a task which is decidedly
non-trivial to get right, BTW. Which means that the path of least
resistance is going to be to just grant a token and not bother with the
LSM, thus ending up with this being a giant foot gun from a security
PoV.

>> > but we've debated BPF proxying and separate service at length, there
>> > is no point in going on another round here.
>>
>> You had some objections to explicit proxying via RPC calls; I suggested
>> a way of avoiding that by keeping the kernel in the loop, which you have
>
> I thought we settled the seccomp notify proposal?

Your objection to that was that it was too much of a hack to read all
the target process memory (etc) from the policy daemon, which I
acknowledged and suggested a way of keeping the kernel in the loop so it
can take responsibility for the gnarly bits while still allowing
userspace to actually make the decision:

https://lore.kernel.org/r/87v8ezb6x5.fsf@toke.dk

(Last two paragraphs). Maybe that message just got lost somewhere on its
way to your inbox?

>> not responded to. If you're just going to go ahead with your solution
>> over any objections you could just have stated so from the beginning and
>> saved us all a lot of time :/
>
> It would also be good to understand that yours is but one of the
> opinions. If you read the thread carefully you'll see that other
> people have differing opinions. And yours doesn't necessarily have to
> be the deciding one.
>
> I appreciate the feedback, but I don't appreciate the expectation that
> your feedback is binding in any way.

I'm not expecting veto rights, I'm objecting to being ignored. The way
this development process is *supposed* to work (as far as I'm concerned)
is that someone proposes a patch series, the community provides
feedback, and discussion proceeds until there's at least rough consensus
that the solution we've arrived at is the right way forward.

If you're going to cut that process short and just pick and choose which
comments are worth addressing and which are not, I can't stop you,
obviously; but at least do me the favour of being up front about it so I
can stop wasting my time trying to be constructive.

Anyhow, I guess this point is moot for this discussion since I'm about
to leave for vacation for four weeks and won't be able to follow up on
this. Apologies for the bad timing :/ I'll ping some RH folks and try to
get them to keep an eye on this while I'm away...

>> Can we at least put this thing behind a kconfig option, so we can turn
>> it off in distro kernels?
>
> Why can't distro disable this in some more dynamic way, though? With
> existing LSM mechanism, sysctl, whatever? I think it would be useful
> to let users have control over this and decide for themselves without
> having to rebuild a custom kernel.

A sysctl similar to the existing one for unprivileged BPF would be fine
as well. If an LSM ends up being the only way to control it, though,
that will carry so much operational overhead for us to get to a working
state that it'll most likely be simpler to just patch it out of the
kernel.

-Toke


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH RESEND v3 bpf-next 01/14] bpf: introduce BPF token object
  2023-07-07 22:00                   ` Toke Høiland-Jørgensen
@ 2023-07-07 23:58                     ` Andrii Nakryiko
  2023-07-10 23:42                       ` Djalal Harouni
  0 siblings, 1 reply; 48+ messages in thread
From: Andrii Nakryiko @ 2023-07-07 23:58 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Christian Brauner, Paul Moore, Andrii Nakryiko, bpf,
	linux-security-module, keescook, lennart, cyphar, luto,
	kernel-team, sargun

On Fri, Jul 7, 2023 at 3:00 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>
> > On Fri, Jul 7, 2023 at 6:04 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >>
> >> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
> >>
> >> > On Thu, Jul 6, 2023 at 4:32 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >> >>
> >> >> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
> >> >>
> >> >> > Having it as a separate single-purpose FS seems cleaner, because we
> >> >> > have use cases where we'd have one BPF FS instance created for a
> >> >> > container by our container manager, and then exposing a few separate
> >> >> > tokens with different sets of allowed functionality. E.g., one for
> >> >> > main intended workload, another for some BPF-based observability
> >> >> > tools, maybe yet another for more heavy-weight tools like bpftrace for
> >> >> > extra debugging. In the debugging case our container infrastructure
> >> >> > will be "evacuating" any other workloads on the same host to avoid
> >> >> > unnecessary consequences. The point is to not disturb
> >> >> > workload-under-human-debugging as much as possible, so we'd like to
> >> >> > keep userns intact, which is why mounting extra (more permissive) BPF
> >> >> > token inside already running containers is an important consideration.
> >> >>
> >> >> This example (as well as Yafang's in the sibling subthread) makes it
> >> >> even more apparent to me that it would be better with a model where the
> >> >> userspace policy daemon can just make decisions on each call directly,
> >> >> instead of mucking about with different tokens with different embedded
> >> >> permissions. Why not go that route (see my other reply for details on
> >> >> what I mean)?
> >> >
> >> > I don't know how you arrived at this conclusion,
> >>
> >> Because it makes it apparent that you're basically building a policy
> >> engine in the kernel with this...
> >
> > I disagree that this is a policy engine in the kernel. It's a building
> > block for delegation and enforcement. The policy itself is implemented
> > in user-space by a privileged process that decides when to issue BPF
> > tokens and of which configuration. And, optionally and if necessary,
> > further restricting using BPF LSM in a more fine-grained and dynamic
> > way.
>
> Right, and I'm saying that it's too coarse-grained to be a proper

CAP_BPF, CAP_PERFMON, CAP_SYS_ADMIN, CAP_NET_ADMIN are also very
coarse grained. And somehow we get by and make do with them outside of
the user namespace use case.

> building block in its own right. As evidenced by the need for adding an
> LSM on top to do anything fine-grained; a task which is decidedly

There is no *need* to add LSM. For tons of practical use cases you
won't need it. Yes, people will make a decision whether they even have
to bother with more fine grained controls. And if yes, LSM is there to
provide it.

> non-trivial to get right, BTW. Which means that the path of least
> resistance is going to be to just grant a token and not bother with the
> LSM, thus ending up with this being a giant foot gun from a security
> PoV.

If there is no need for LSM, yes, and I think it's totally acceptable.
It will be up to users to decide.

>
> >> > but we've debated BPF proxying and separate service at length, there
> >> > is no point in going on another round here.
> >>
> >> You had some objections to explicit proxying via RPC calls; I suggested
> >> a way of avoiding that by keeping the kernel in the loop, which you have
> >
> > I thought we settled the seccomp notify proposal?
>
> Your objection to that was that it was too much of a hack to read all
> the target process memory (etc) from the policy daemon, which I
> acknowledged and suggested a way of keeping the kernel in the loop so it
> can take responsibility for the gnarly bits while still allowing
> userspace to actually make the decision:
>

Your proposal for some new mechanism for blocking bpf() syscall to let
another user space process make decision and somehow provide all the
necessary data to make this decision without that process needing to
read original process' memory (so presumably kernel will make a copy
of BPF program instructions, BTF contents, all the strings, etc, etc?)
sounded more like a joke and just a contrarian way to provide *any*
alternative, just to disagree with the much simpler and more
straightforward proposal.

I encourage you to spend some time prototyping this new mechanism,
sending RFC and gathering community feedback before using this
handwavy idea as an excuse to block BPF token-like mechanism. I'll be
curious to read the discussion on how it's different from
authoritative LSM, seccomp notify, etc, etc.

> https://lore.kernel.org/r/87v8ezb6x5.fsf@toke.dk
>
> (Last two paragraphs). Maybe that message just got lost somewhere on its
> way to your inbox?
>
> >> not responded to. If you're just going to go ahead with your solution
> >> over any objections you could just have stated so from the beginning and
> >> saved us all a lot of time :/
> >
> > It would also be good to understand that yours is but one of the
> > opinions. If you read the thread carefully you'll see that other
> > people have differing opinions. And yours doesn't necessarily have to
> > be the deciding one.
> >
> > I appreciate the feedback, but I don't appreciate the expectation that
> > your feedback is binding in any way.
>
> I'm not expecting veto rights, I'm objecting to being ignored. The way

You are not being ignored. We are just disagreeing. There is a
difference. BPF proxying was discussed at length and people who manage
large sets of BPF applications voiced their concerns. Security
concerns you have for BPF token are just as applicable to CAP_BPF and
other caps. BPF token actually allows to drop those very
coarse-grained capabilities in a bunch of circumstances and overall
improve the security. Also note, there were security folks in the
discussion which seem to be fine with the BPF token approach, overall.

You don't like my (and others') answers. That's fine, but please don't
pretend like you are being ignored.

> this development process is *supposed* to work (as far as I'm concerned)
> is that someone proposes a patch series, the community provides
> feedback, and discussion proceeds until there's at least rough consensus
> that the solution we've arrived at is the right way forward.

Rough consensus, not 100% consensus, though?.. There will always be
someone who disagrees.

>
> If you're going to cut that process short and just pick and choose which

Yep, clearly, going into the 3rd month of discussions (starting from
LSF/MM, and I don't even include the authoritative LSM discussions
before that) is cutting this process very short, of course.

> comments are worth addressing and which are not, I can't stop you,
> obviously; but at least do me the favour of being up front about it so I
> can stop wasting my time trying to be constructive.

I wouldn't say that a proposal like "some seccomp-notify-like
mechanism to let another process decide if bpf() syscall should
proceed" with not much effort put into thinking about how it should be
done specifically and whether it's actually a better approach was very
constructive. And it felt self-evident that it's not a good way,
especially after Christian himself said that the seccomp-based
approach is also not a good generic solution. Your proposal was just a
weird bpf()-specific (and not very well specified) twist on the
seccomp notify idea. But as I said above, give it a try, perhaps I'm
mistaken and the BPF community would love the idea and implementation.

>
> Anyhow, I guess this point is moot for this discussion since I'm about
> to leave for vacation for four weeks and won't be able to follow up on
> this. Apologies for the bad timing :/ I'll ping some RH folks and try to
> get them to keep an eye on this while I'm away...

Enjoy your vacation!

>
> >> Can we at least put this thing behind a kconfig option, so we can turn
> >> it off in distro kernels?
> >
> > Why can't distro disable this in some more dynamic way, though? With
> > existing LSM mechanism, sysctl, whatever? I think it would be useful
> > to let users have control over this and decide for themselves without
> > having to rebuild a custom kernel.
>
> A sysctl similar to the existing one for unprivileged BPF would be fine
> as well. If an LSM ends up being the only way to control it, though,
> that will carry so much operational overhead for us to get to a working
> state that it'll most likely be simpler to just patch it out of the
> kernel.

Sounds good, I will add sysctl for the next version.

>
> -Toke
>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH RESEND v3 bpf-next 01/14] bpf: introduce BPF token object
  2023-07-07 23:58                     ` Andrii Nakryiko
@ 2023-07-10 23:42                       ` Djalal Harouni
  0 siblings, 0 replies; 48+ messages in thread
From: Djalal Harouni @ 2023-07-10 23:42 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Toke Høiland-Jørgensen, Christian Brauner, Paul Moore,
	Andrii Nakryiko, bpf, linux-security-module, keescook, lennart,
	cyphar, luto, kernel-team, sargun

On Sat, Jul 8, 2023 at 1:59 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
...
> > >
> > > Why can't distro disable this in some more dynamic way, though? With
> > > existing LSM mechanism, sysctl, whatever? I think it would be useful
> > > to let users have control over this and decide for themselves without
> > > having to rebuild a custom kernel.
> >
> > A sysctl similar to the existing one for unprivileged BPF would be fine
> > as well. If an LSM ends up being the only way to control it, though,
> > that will carry so much operational overhead for us to get to a working
> > state that it'll most likely be simpler to just patch it out of the
> > kernel.
>
> Sounds good, I will add sysctl for the next version.

What would be the purpose of the sysctl? or a kconfig? AFAICT the
operation is still privileged, and it's an opt-in? anyway...

It is obvious that this should be part of the BPF core... The other
user space proxy solution tries to solve another use case competing
with LSMs. It won't be able to handle the full context (or today's
nested workload) at bpf() call time... There are obvious reasons why
LSMs do exist...

Thanks for agreeing that it should be attached to the user namespace
at creation time as it is crucial to get it right... and Christian
(thanks BTW ;-) ) maybe we make it walk user ns list up to parent and
allow the token if it's coming from a parent namespace that is part of
the same hierarchy, then theoretically the parent ns is more
privileged...  will check again and reply to the corresponding email.

Thanks!

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH RESEND v3 bpf-next 01/14] bpf: introduce BPF token object
  2023-07-05 21:38         ` Andrii Nakryiko
  2023-07-06 11:32           ` Toke Høiland-Jørgensen
@ 2023-07-11 13:33           ` Christian Brauner
  2023-07-11 22:06             ` Andrii Nakryiko
  1 sibling, 1 reply; 48+ messages in thread
From: Christian Brauner @ 2023-07-11 13:33 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Paul Moore, Andrii Nakryiko, bpf, linux-security-module,
	keescook, lennart, cyphar, luto, kernel-team, sargun

On Wed, Jul 05, 2023 at 02:38:43PM -0700, Andrii Nakryiko wrote:
> On Wed, Jul 5, 2023 at 7:42 AM Christian Brauner <brauner@kernel.org> wrote:
> >
> > On Wed, Jul 05, 2023 at 10:16:13AM -0400, Paul Moore wrote:
> > > On Tue, Jul 4, 2023 at 8:44 AM Christian Brauner <brauner@kernel.org> wrote:
> > > > On Wed, Jun 28, 2023 at 10:18:19PM -0700, Andrii Nakryiko wrote:
> > > > > Add new kind of BPF kernel object, BPF token. BPF token is meant to to
> > > > > allow delegating privileged BPF functionality, like loading a BPF
> > > > > program or creating a BPF map, from privileged process to a *trusted*
> > > > > unprivileged process, all while have a good amount of control over which
> > > > > privileged operations could be performed using provided BPF token.
> > > > >
> > > > > This patch adds new BPF_TOKEN_CREATE command to bpf() syscall, which
> > > > > allows to create a new BPF token object along with a set of allowed
> > > > > commands that such BPF token allows to unprivileged applications.
> > > > > Currently only BPF_TOKEN_CREATE command itself can be
> > > > > delegated, but other patches gradually add ability to delegate
> > > > > BPF_MAP_CREATE, BPF_BTF_LOAD, and BPF_PROG_LOAD commands.
> > > > >
> > > > > The above means that new BPF tokens can be created using existing BPF
> > > > > token, if original privileged creator allowed BPF_TOKEN_CREATE command.
> > > > > New derived BPF token cannot be more powerful than the original BPF
> > > > > token.
> > > > >
> > > > > Importantly, BPF token is automatically pinned at the specified location
> > > > > inside an instance of BPF FS and cannot be repinned using BPF_OBJ_PIN
> > > > > command, unlike BPF prog/map/btf/link. This provides more control over
> > > > > unintended sharing of BPF tokens through pinning it in another BPF FS
> > > > > instances.
> > > > >
> > > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > > > > ---
> > > >
> > > > The main issue I have with the token approach is that it is a completely
> > > > separate delegation vector on top of user namespaces. We mentioned this
> > > > duringthe conf and this was brought up on the thread here again as well.
> > > > Imho, that's a problem both security-wise and complexity-wise.
> > > >
> > > > It's not great if each subsystem gets its own custom delegation
> > > > mechanism. This imposes such a taxing complexity on both kernel- and
> > > > userspace that it will quickly become a huge liability. So I would
> > > > really strongly encourage you to explore another direction.
> 
> Alright, thanks a lot for elaborating. I did want to keep everything
> contained to bpf() for various reasons, but it seems like I won't be
> able to get away with this. :)
> 
> > > >
> > > > I do think the spirit of your proposal is workable and that it can
> > > > mostly be kept in tact.
> 
> It's good to know that at least conceptually you support the idea of
> BPF delegation. I have a few more specific questions below and I'd
> appreciate your answers, as I have less familiarity with how exactly
> container managers do stuff at container bootstrapping stage.
> 
> But first, let's try to get some tentative agreement on design before
> I go and implement the BPF-token-as-FS idea. I have basically just two
> gripes with exact details of what you are proposing, so let me explain
> which and why, and see if we can find some common ground.

Just fyi, there'll likely be some delays in my replies bc first I need
to think about it and second floods of mails. I'll be on vacation for
starting end of this week.

> 
> First, the idea of coupling and bundling this "delegation" option with
> BPF FS doesn't feel right. BPF FS is just a container of BPF objects,
> so adding to it a new property of allowing to use privileged BPF
> functionality seems a bit off.

Fwiw, I have a series that makes it possible to delegate a superblock of
a filesystem to a user namespace using the new mount api introducing a
vfs generic "delegate" mount option. So this won't be a special bpf
thing. This is generally useful.

> 
> Why not just create a new separate FS, let's code-name it "BPF Token
> FS" for now (naming suggestions are welcome). Such BPF Token FS would
> be dedicated to specifying everything about what's allowable through
> BPF, just like my BPF token implementation. It can then be
> mounted/bind-mounted inside BPF FS (or really, anywhere, it's just a
> FS, right?). User application would open it (I'm guessing with
> open_tree(), right?) and pass it as token_fd to bpf() syscall.
> 
> Having it as a separate single-purpose FS seems cleaner, because we
> have use cases where we'd have one BPF FS instance created for a
> container by our container manager, and then exposing a few separate
> tokens with different sets of allowed functionality. E.g., one for
> main intended workload, another for some BPF-based observability
> tools, maybe yet another for more heavy-weight tools like bpftrace for
> extra debugging. In the debugging case our container infrastructure
> will be "evacuating" any other workloads on the same host to avoid
> unnecessary consequences. The point is to not disturb
> workload-under-human-debugging as much as possible, so we'd like to
> keep userns intact, which is why mounting extra (more permissive) BPF
> token inside already running containers is an important consideration.
> 
> With such goals, it seems nicer to have a single BPF FS, and few BPF
> token FSs mounted inside it. Yes, we could bundle token functionality
> with BPF FS, but separating those two seems cleaner to me. WDYT?

It seems that writing a pseudo filesystem for the kernel is some right
of passage that every kernel developer wants to go through for some
reason. It's not mandatory though, it's actually discouraged.

Joking aside.
I think the danger lies in adding more and more moving parts and
fragmenting this into so many moving pieces that it's hard to see the
bigger picture and have a clear sense of the API.

> 
> Second, mount options usage. I'm hearing stories from our production
> folks how some new mount options (on some other FS, not BPF FS) were
> breaking tools unintentionally during kernel/tooling
> upgrades/downgrades, so it makes me a bit hesitant to have these
> complicated sets of mount options to specify parameters of
> BPF-token-as-FS. I've been thinking a bit, and I'm starting to lean

I don't see this as a good argument for a new pseudo filesystem. It
implies that any new filesystem would end up with the same problem. The
answer here would be to report and fix such bugs.

> towards the idea of allowing to set up (and modify as well) all these
> allowed maps/progs/attach types through special auto-created files
> within BPF token FS. Something like below:
> 
> # pwd
> /sys/fs/bpf/workload-token
> # ls
> allowed_cmds allowed_map_types allowed_prog_types allowed_attach_types
> # echo "BPF_PROG_LOAD" > allowed_cmds
> # echo "BPF_PROG_TYPE_KPROBE" >> allowed_prog_types
> ...
> # cat allowed_prog_types
> BPF_PROG_TYPE_KPROBE,BPF_PROG_TYPE_TRACEPOINT
> 
> 
> The above is fake (I haven't implemented anything yet), but hopefully
> works as a demonstration. We'll also need to make sure that inside
> non-init userns these files are read-only or allow to just further
> restrict the subset of allowed functionality, never extend it.

This implementation would get you into the business of write-time
permission checks. And this almost always means you should use an
ioctl(), not a write() operation on these files.

> 
> Such an approach will actually make it simpler to test and experiment
> with this delegation locally, will make it trivial to observe what's
> allowed from simple shell scripts, etc, etc. With fsmount() and O_PATH
> it will be possible to set everything up from privileged processes
> before ever exposing a BPF Token FS instance through a file system, if
> there are any concerns about racing with user space.
> 
> That's the high-level approach I'm thinking of right now. Would that
> work? How critical is it to reuse BPF FS itself and how important to
> you is to rely on mount options vs special files as described above?

In the end, it's your api and you need to live with it and support it.
What is important is that we don't end up with security issues. The
special files thing will work but be aware that write-time permission
checking is nasty:
* https://git.zx2c4.com/CVE-2012-0056/about/ (Thanks to Aleksa for the link.)
* commit e57457641613 ("cgroup: Use open-time cgroup namespace for process migration perm checks")
There's a lot more. It can be done but it needs stringent permission
checking and an ioctl() is probably the way to go in this case.

Another thing, if you split configuration over multiple files you can
end up introducing race windows. This is a common complaint with cgroups
and sysfs whenever configuration of something is split over multiple
files. It gets especially hairy if the options interact with each other
somehow.

> Hopefully not critical, and I can start working on it, and we'll get
> what you want with using FS as a vehicle for delegation, while
> allowing some of the intended use cases that we have in mind in a bit
> cleaner fashion?
> 
> > > >
> > > > As mentioned before, bpffs has all the means to be taught delegation:
> > > >
> > > >         // In container's user namespace
> > > >         fd_fs = fsopen("bpffs");
> > > >
> > > >         // Delegating task in host userns (systemd-bpfd whatever you want)
> > > >         ret = fsconfig(fd_fs, FSCONFIG_SET_FLAG, "delegate", ...);
> > > >
> > > >         // In container's user namespace
> > > >         fd_mnt = fsmount(fd_fs, 0);
> > > >
> > > >         ret = move_mount(fd_fs, "", -EBADF, "/my/fav/location", MOVE_MOUNT_F_EMPTY_PATH)
> > > >
> > > > Roughly, this would mean:
> > > >
> > > > (i) raise FS_USERNS_MOUNT on bpffs but guard it behind the "delegate"
> > > >     mount option. IOW, it's only possibly to mount bpffs as an
> > > >     unprivileged user if a delegating process like systemd-bpfd with
> > > >     system-level privileges has marked it as delegatable.
> 
> Regarding the FS_USERNS_MOUNT flag and fsopen() happening from inside
> the user namespace. Am I missing something subtle and important here,
> why does it have to happen inside the container's user namespace?
> Can't the container manager both fsopen() and fsconfig() everything in
> host userns, and only then fsmount+move_mount inside the container's
> userns? Just trying to understand if there is some important early
> association of userns happening at early steps here?

The mount api _currently_ works very roughly like this: if a filesytem
is FS_USERNS_MOUNT enabled fsopen() records the user namespace of the
caller. The recorded userns will later become the owning userns of the
filesystem's superblock (Without going into detail: owning userns of a
superblock != owning userns of a mount. move_mount() on a detached mount
is about the latter.).

I have a patchset that adds a generic "delegate" mount option which will
allow a sufficiently privileged process to do the following:

        fd_fs = fsopen("ext4");
        
        /*
	 * Set owning namespace of the filesystem's superblock.
         * Caller must be privileged over @fd_userns.
         *
	 * Note, must be first mount option to ensure that possible
	 * follow-up ermission checks for other mount options are done
	 * on the final owning namespace.
         */
        fsconfig(fd_fs, FSCONFIG_SET_FD, "delegate", NULL, fd_userns);
        
        /*
         * * If fs is FS_USERNS_MOUNT then permission is checked in @fd_userns.
         * * If fs is not FS_USERNS_MOUNT then permission is check in @init_user_ns.
         *   (Privilege in @init_user_ns implies privilege over @fd_userns.)
         */
        fsconfig(fd_fs, FSCONFIG_CMD_CREATE, NULL, 0);

After this, the sb is owned by @fd_userns. Currently my draft restricts
this to such filesystems that raise FS_ALLOW_IDMAP because they almost
can support delegation and don't need to be checked for any potential
issues. But bpffs could easily support this (without caring about
FS_ALLOW_IDMAP).

> 
> Also, in your example above, move_mount() should take fd_mnt, not fd_fs, right?
> 
> > > > (ii) add fine-grained delegation options that you want this
> > > >      bpffs instance to allow via new mount options. Idk,
> > > >
> > > >      // allow usage of foo
> > > >      fsconfig(fd_fs, FSCONFIG_SET_STRING, "abilities", "foo");
> > > >
> > > >      // also allow usage of bar
> > > >      fsconfig(fd_fs, FSCONFIG_SET_STRING, "abilities", "bar");
> > > >
> > > >      // reset allowed options
> > > >      fsconfig(fd_fs, FSCONFIG_SET_STRING, "");
> > > >
> > > >      // allow usage of schmoo
> > > >      fsconfig(fd_fs, FSCONFIG_SET_STRING, "abilities", "schmoo");
> > > >
> > > > This all seems more intuitive and integrates with user and mount
> > > > namespaces of the container. This can also work for restricting
> > > > non-userns bpf instances fwiw. You can also share instances via
> > > > bind-mount and so on. The userns of the bpffs instance can also be used
> > > > for permission checking provided a given functionality has been
> > > > delegated by e.g., systemd-bpfd or whatever.
> > >
> > > I have no arguments against any of the above, and would prefer to see
> > > something like this over a token-based mechanism.  However we do want
> > > to make sure we have the proper LSM control points for either approach
> > > so that admins who rely on LSM-based security policies can manage
> > > delegation via their policies.
> > >
> > > Using the fsconfig() approach described by Christian above, I believe
> > > we should have the necessary hooks already in
> > > security_fs_context_parse_param() and security_sb_mnt_opts() but I'm
> > > basing that on a quick look this morning, some additional checking
> > > would need to be done.
> >
> > I think what I outlined is even unnecessarily complicated. You don't
> > need that pointless "delegate" mount option at all actually. Permission
> > to delegate shouldn't be checked when the mount option is set. The
> > permissions should be checked when the superblock is created. That's the
> > right point in time. So sm like:
> >
> 
> I think this gets even more straightforward with BPF Token FS being a
> separate one, right? Given BPF Token FS is all about delegation, it
> has to be a privileged operation to even create it.
> 
> > diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
> > index 4174f76133df..a2eb382f5457 100644
> > --- a/kernel/bpf/inode.c
> > +++ b/kernel/bpf/inode.c
> > @@ -746,6 +746,13 @@ static int bpf_fill_super(struct super_block *sb, struct fs_context *fc)
> >         struct inode *inode;
> >         int ret;
> >
> > +       /*
> > +        * If you want to delegate this instance then you need to be
> > +        * privileged and know what you're doing. This isn't trust.
> > +        */
> > +       if ((fc->user_ns != &init_user_ns) && !capable(CAP_SYS_ADMIN))
> > +               return -EPERM;
> > +
> >         ret = simple_fill_super(sb, BPF_FS_MAGIC, bpf_rfiles);
> >         if (ret)
> >                 return ret;
> > @@ -800,6 +807,7 @@ static struct file_system_type bpf_fs_type = {
> >         .init_fs_context = bpf_init_fs_context,
> >         .parameters     = bpf_fs_parameters,
> >         .kill_sb        = kill_litter_super,
> > +       .fs_flags       = FS_USERNS_MOUNT,
> 
> Just an aside thought. It doesn't seem like there is any reason why
> BPF FS right now is not created with FS_USERNS_MOUNT, so (separately
> from all this discussion) I suspect we can just make it
> FS_USERNS_MOUNT right now (unless we combine it with BPF-token-FS,
> then yeah, we can't do that unconditionally anymore). Given BPF FS is
> just a container of pinned BPF objects, just mounting BPF FS doesn't
> seem to be dangerous in any way. But that's just an aside thought
> here.

My two cents: Don't ever expose anything under user namespaces unless it
is guaranteed to be safe and has actual non-cosmetical use-cases.

The eagerness with which features pop up in user namespaces is probably
bankrolling half the infosec community.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH RESEND v3 bpf-next 01/14] bpf: introduce BPF token object
  2023-07-11 13:33           ` Christian Brauner
@ 2023-07-11 22:06             ` Andrii Nakryiko
  0 siblings, 0 replies; 48+ messages in thread
From: Andrii Nakryiko @ 2023-07-11 22:06 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Paul Moore, Andrii Nakryiko, bpf, linux-security-module,
	keescook, lennart, cyphar, luto, kernel-team, sargun

On Tue, Jul 11, 2023 at 6:33 AM Christian Brauner <brauner@kernel.org> wrote:
>
> On Wed, Jul 05, 2023 at 02:38:43PM -0700, Andrii Nakryiko wrote:
> > On Wed, Jul 5, 2023 at 7:42 AM Christian Brauner <brauner@kernel.org> wrote:
> > >
> > > On Wed, Jul 05, 2023 at 10:16:13AM -0400, Paul Moore wrote:
> > > > On Tue, Jul 4, 2023 at 8:44 AM Christian Brauner <brauner@kernel.org> wrote:
> > > > > On Wed, Jun 28, 2023 at 10:18:19PM -0700, Andrii Nakryiko wrote:
> > > > > > Add new kind of BPF kernel object, BPF token. BPF token is meant to to
> > > > > > allow delegating privileged BPF functionality, like loading a BPF
> > > > > > program or creating a BPF map, from privileged process to a *trusted*
> > > > > > unprivileged process, all while have a good amount of control over which
> > > > > > privileged operations could be performed using provided BPF token.
> > > > > >
> > > > > > This patch adds new BPF_TOKEN_CREATE command to bpf() syscall, which
> > > > > > allows to create a new BPF token object along with a set of allowed
> > > > > > commands that such BPF token allows to unprivileged applications.
> > > > > > Currently only BPF_TOKEN_CREATE command itself can be
> > > > > > delegated, but other patches gradually add ability to delegate
> > > > > > BPF_MAP_CREATE, BPF_BTF_LOAD, and BPF_PROG_LOAD commands.
> > > > > >
> > > > > > The above means that new BPF tokens can be created using existing BPF
> > > > > > token, if original privileged creator allowed BPF_TOKEN_CREATE command.
> > > > > > New derived BPF token cannot be more powerful than the original BPF
> > > > > > token.
> > > > > >
> > > > > > Importantly, BPF token is automatically pinned at the specified location
> > > > > > inside an instance of BPF FS and cannot be repinned using BPF_OBJ_PIN
> > > > > > command, unlike BPF prog/map/btf/link. This provides more control over
> > > > > > unintended sharing of BPF tokens through pinning it in another BPF FS
> > > > > > instances.
> > > > > >
> > > > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > > > > > ---
> > > > >
> > > > > The main issue I have with the token approach is that it is a completely
> > > > > separate delegation vector on top of user namespaces. We mentioned this
> > > > > duringthe conf and this was brought up on the thread here again as well.
> > > > > Imho, that's a problem both security-wise and complexity-wise.
> > > > >
> > > > > It's not great if each subsystem gets its own custom delegation
> > > > > mechanism. This imposes such a taxing complexity on both kernel- and
> > > > > userspace that it will quickly become a huge liability. So I would
> > > > > really strongly encourage you to explore another direction.
> >
> > Alright, thanks a lot for elaborating. I did want to keep everything
> > contained to bpf() for various reasons, but it seems like I won't be
> > able to get away with this. :)
> >
> > > > >
> > > > > I do think the spirit of your proposal is workable and that it can
> > > > > mostly be kept in tact.
> >
> > It's good to know that at least conceptually you support the idea of
> > BPF delegation. I have a few more specific questions below and I'd
> > appreciate your answers, as I have less familiarity with how exactly
> > container managers do stuff at container bootstrapping stage.
> >
> > But first, let's try to get some tentative agreement on design before
> > I go and implement the BPF-token-as-FS idea. I have basically just two
> > gripes with exact details of what you are proposing, so let me explain
> > which and why, and see if we can find some common ground.
>
> Just fyi, there'll likely be some delays in my replies bc first I need
> to think about it and second floods of mails. I'll be on vacation for
> starting end of this week.

I'll be on vacation for the next month or so starting from tomorrow,
so that's no problem :)

>
> >
> > First, the idea of coupling and bundling this "delegation" option with
> > BPF FS doesn't feel right. BPF FS is just a container of BPF objects,
> > so adding to it a new property of allowing to use privileged BPF
> > functionality seems a bit off.
>
> Fwiw, I have a series that makes it possible to delegate a superblock of
> a filesystem to a user namespace using the new mount api introducing a
> vfs generic "delegate" mount option. So this won't be a special bpf
> thing. This is generally useful.
>
> >
> > Why not just create a new separate FS, let's code-name it "BPF Token
> > FS" for now (naming suggestions are welcome). Such BPF Token FS would
> > be dedicated to specifying everything about what's allowable through
> > BPF, just like my BPF token implementation. It can then be
> > mounted/bind-mounted inside BPF FS (or really, anywhere, it's just a
> > FS, right?). User application would open it (I'm guessing with
> > open_tree(), right?) and pass it as token_fd to bpf() syscall.
> >
> > Having it as a separate single-purpose FS seems cleaner, because we
> > have use cases where we'd have one BPF FS instance created for a
> > container by our container manager, and then exposing a few separate
> > tokens with different sets of allowed functionality. E.g., one for
> > main intended workload, another for some BPF-based observability
> > tools, maybe yet another for more heavy-weight tools like bpftrace for
> > extra debugging. In the debugging case our container infrastructure
> > will be "evacuating" any other workloads on the same host to avoid
> > unnecessary consequences. The point is to not disturb
> > workload-under-human-debugging as much as possible, so we'd like to
> > keep userns intact, which is why mounting extra (more permissive) BPF
> > token inside already running containers is an important consideration.
> >
> > With such goals, it seems nicer to have a single BPF FS, and few BPF
> > token FSs mounted inside it. Yes, we could bundle token functionality
> > with BPF FS, but separating those two seems cleaner to me. WDYT?
>
> It seems that writing a pseudo filesystem for the kernel is some right
> of passage that every kernel developer wants to go through for some
> reason. It's not mandatory though, it's actually discouraged.

Believe me, I tried to avoid this as much as possible.

>
> Joking aside.
> I think the danger lies in adding more and more moving parts and
> fragmenting this into so many moving pieces that it's hard to see the
> bigger picture and have a clear sense of the API.

It's probably a difference of perspective as a BPF developer and user.
To me bundling this delegate option onto BPF FS is completely
counter-intuitive. BPF FS has (in my mind) nothing to do with how I
can use the BPF subsystem. So BPF token as a separate object/FS is way
more natural.

Having said that, I can bundle this new functionality onto BPF FS if
you insist, just to make some progress here and move to solving
further problems with BPF usage within userns. If someone else who
prefers separate FS for BPF token (and I know there are at least few
people who think it's cleaner that way as well) would like to voice
their opinion in support, please do so.

>
> >
> > Second, mount options usage. I'm hearing stories from our production
> > folks how some new mount options (on some other FS, not BPF FS) were
> > breaking tools unintentionally during kernel/tooling
> > upgrades/downgrades, so it makes me a bit hesitant to have these
> > complicated sets of mount options to specify parameters of
> > BPF-token-as-FS. I've been thinking a bit, and I'm starting to lean
>
> I don't see this as a good argument for a new pseudo filesystem. It
> implies that any new filesystem would end up with the same problem. The
> answer here would be to report and fix such bugs.

Sure, this wasn't the reason for separate BPF token FS, of course.

>
> > towards the idea of allowing to set up (and modify as well) all these
> > allowed maps/progs/attach types through special auto-created files
> > within BPF token FS. Something like below:
> >
> > # pwd
> > /sys/fs/bpf/workload-token
> > # ls
> > allowed_cmds allowed_map_types allowed_prog_types allowed_attach_types
> > # echo "BPF_PROG_LOAD" > allowed_cmds
> > # echo "BPF_PROG_TYPE_KPROBE" >> allowed_prog_types
> > ...
> > # cat allowed_prog_types
> > BPF_PROG_TYPE_KPROBE,BPF_PROG_TYPE_TRACEPOINT
> >
> >
> > The above is fake (I haven't implemented anything yet), but hopefully
> > works as a demonstration. We'll also need to make sure that inside
> > non-init userns these files are read-only or allow to just further
> > restrict the subset of allowed functionality, never extend it.
>
> This implementation would get you into the business of write-time
> permission checks. And this almost always means you should use an
> ioctl(), not a write() operation on these files.
>

Ok. I think ioctl() kind of kills all the benefits, so there is little point.

> >
> > Such an approach will actually make it simpler to test and experiment
> > with this delegation locally, will make it trivial to observe what's
> > allowed from simple shell scripts, etc, etc. With fsmount() and O_PATH
> > it will be possible to set everything up from privileged processes
> > before ever exposing a BPF Token FS instance through a file system, if
> > there are any concerns about racing with user space.
> >
> > That's the high-level approach I'm thinking of right now. Would that
> > work? How critical is it to reuse BPF FS itself and how important to
> > you is to rely on mount options vs special files as described above?
>
> In the end, it's your api and you need to live with it and support it.
> What is important is that we don't end up with security issues. The
> special files thing will work but be aware that write-time permission
> checking is nasty:
> * https://git.zx2c4.com/CVE-2012-0056/about/ (Thanks to Aleksa for the link.)

entertaining read :)

> * commit e57457641613 ("cgroup: Use open-time cgroup namespace for process migration perm checks")
> There's a lot more. It can be done but it needs stringent permission
> checking and an ioctl() is probably the way to go in this case.
>
> Another thing, if you split configuration over multiple files you can
> end up introducing race windows. This is a common complaint with cgroups
> and sysfs whenever configuration of something is split over multiple
> files. It gets especially hairy if the options interact with each other
> somehow.

I'm not too worried about races, but all the above makes sense. My
original approach with bpf() syscall creating BPF token object went
for immutable BPF token construction for the very same reasons of
simplicity. Alright, this is all fair enough, I'll give mount options
a try and see how it all works out.

>
> > Hopefully not critical, and I can start working on it, and we'll get
> > what you want with using FS as a vehicle for delegation, while
> > allowing some of the intended use cases that we have in mind in a bit
> > cleaner fashion?
> >
> > > > >
> > > > > As mentioned before, bpffs has all the means to be taught delegation:
> > > > >
> > > > >         // In container's user namespace
> > > > >         fd_fs = fsopen("bpffs");
> > > > >
> > > > >         // Delegating task in host userns (systemd-bpfd whatever you want)
> > > > >         ret = fsconfig(fd_fs, FSCONFIG_SET_FLAG, "delegate", ...);
> > > > >
> > > > >         // In container's user namespace
> > > > >         fd_mnt = fsmount(fd_fs, 0);
> > > > >
> > > > >         ret = move_mount(fd_fs, "", -EBADF, "/my/fav/location", MOVE_MOUNT_F_EMPTY_PATH)
> > > > >
> > > > > Roughly, this would mean:
> > > > >
> > > > > (i) raise FS_USERNS_MOUNT on bpffs but guard it behind the "delegate"
> > > > >     mount option. IOW, it's only possibly to mount bpffs as an
> > > > >     unprivileged user if a delegating process like systemd-bpfd with
> > > > >     system-level privileges has marked it as delegatable.
> >
> > Regarding the FS_USERNS_MOUNT flag and fsopen() happening from inside
> > the user namespace. Am I missing something subtle and important here,
> > why does it have to happen inside the container's user namespace?
> > Can't the container manager both fsopen() and fsconfig() everything in
> > host userns, and only then fsmount+move_mount inside the container's
> > userns? Just trying to understand if there is some important early
> > association of userns happening at early steps here?
>
> The mount api _currently_ works very roughly like this: if a filesytem
> is FS_USERNS_MOUNT enabled fsopen() records the user namespace of the
> caller. The recorded userns will later become the owning userns of the
> filesystem's superblock (Without going into detail: owning userns of a
> superblock != owning userns of a mount. move_mount() on a detached mount
> is about the latter.).
>
> I have a patchset that adds a generic "delegate" mount option which will
> allow a sufficiently privileged process to do the following:
>
>         fd_fs = fsopen("ext4");
>
>         /*
>          * Set owning namespace of the filesystem's superblock.
>          * Caller must be privileged over @fd_userns.
>          *
>          * Note, must be first mount option to ensure that possible
>          * follow-up ermission checks for other mount options are done
>          * on the final owning namespace.
>          */
>         fsconfig(fd_fs, FSCONFIG_SET_FD, "delegate", NULL, fd_userns);
>
>         /*
>          * * If fs is FS_USERNS_MOUNT then permission is checked in @fd_userns.
>          * * If fs is not FS_USERNS_MOUNT then permission is check in @init_user_ns.
>          *   (Privilege in @init_user_ns implies privilege over @fd_userns.)
>          */
>         fsconfig(fd_fs, FSCONFIG_CMD_CREATE, NULL, 0);
>
> After this, the sb is owned by @fd_userns. Currently my draft restricts
> this to such filesystems that raise FS_ALLOW_IDMAP because they almost
> can support delegation and don't need to be checked for any potential
> issues. But bpffs could easily support this (without caring about
> FS_ALLOW_IDMAP).

I see. Well, I should definitely not use "delegate" option name then
for anythying ;)

>
> >
> > Also, in your example above, move_mount() should take fd_mnt, not fd_fs, right?
> >
> > > > > (ii) add fine-grained delegation options that you want this
> > > > >      bpffs instance to allow via new mount options. Idk,
> > > > >
> > > > >      // allow usage of foo
> > > > >      fsconfig(fd_fs, FSCONFIG_SET_STRING, "abilities", "foo");
> > > > >
> > > > >      // also allow usage of bar
> > > > >      fsconfig(fd_fs, FSCONFIG_SET_STRING, "abilities", "bar");
> > > > >
> > > > >      // reset allowed options
> > > > >      fsconfig(fd_fs, FSCONFIG_SET_STRING, "");
> > > > >
> > > > >      // allow usage of schmoo
> > > > >      fsconfig(fd_fs, FSCONFIG_SET_STRING, "abilities", "schmoo");
> > > > >
> > > > > This all seems more intuitive and integrates with user and mount
> > > > > namespaces of the container. This can also work for restricting
> > > > > non-userns bpf instances fwiw. You can also share instances via
> > > > > bind-mount and so on. The userns of the bpffs instance can also be used
> > > > > for permission checking provided a given functionality has been
> > > > > delegated by e.g., systemd-bpfd or whatever.
> > > >
> > > > I have no arguments against any of the above, and would prefer to see
> > > > something like this over a token-based mechanism.  However we do want
> > > > to make sure we have the proper LSM control points for either approach
> > > > so that admins who rely on LSM-based security policies can manage
> > > > delegation via their policies.
> > > >
> > > > Using the fsconfig() approach described by Christian above, I believe
> > > > we should have the necessary hooks already in
> > > > security_fs_context_parse_param() and security_sb_mnt_opts() but I'm
> > > > basing that on a quick look this morning, some additional checking
> > > > would need to be done.
> > >
> > > I think what I outlined is even unnecessarily complicated. You don't
> > > need that pointless "delegate" mount option at all actually. Permission
> > > to delegate shouldn't be checked when the mount option is set. The
> > > permissions should be checked when the superblock is created. That's the
> > > right point in time. So sm like:
> > >
> >
> > I think this gets even more straightforward with BPF Token FS being a
> > separate one, right? Given BPF Token FS is all about delegation, it
> > has to be a privileged operation to even create it.
> >
> > > diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
> > > index 4174f76133df..a2eb382f5457 100644
> > > --- a/kernel/bpf/inode.c
> > > +++ b/kernel/bpf/inode.c
> > > @@ -746,6 +746,13 @@ static int bpf_fill_super(struct super_block *sb, struct fs_context *fc)
> > >         struct inode *inode;
> > >         int ret;
> > >
> > > +       /*
> > > +        * If you want to delegate this instance then you need to be
> > > +        * privileged and know what you're doing. This isn't trust.
> > > +        */
> > > +       if ((fc->user_ns != &init_user_ns) && !capable(CAP_SYS_ADMIN))
> > > +               return -EPERM;
> > > +
> > >         ret = simple_fill_super(sb, BPF_FS_MAGIC, bpf_rfiles);
> > >         if (ret)
> > >                 return ret;
> > > @@ -800,6 +807,7 @@ static struct file_system_type bpf_fs_type = {
> > >         .init_fs_context = bpf_init_fs_context,
> > >         .parameters     = bpf_fs_parameters,
> > >         .kill_sb        = kill_litter_super,
> > > +       .fs_flags       = FS_USERNS_MOUNT,
> >
> > Just an aside thought. It doesn't seem like there is any reason why
> > BPF FS right now is not created with FS_USERNS_MOUNT, so (separately
> > from all this discussion) I suspect we can just make it
> > FS_USERNS_MOUNT right now (unless we combine it with BPF-token-FS,
> > then yeah, we can't do that unconditionally anymore). Given BPF FS is
> > just a container of pinned BPF objects, just mounting BPF FS doesn't
> > seem to be dangerous in any way. But that's just an aside thought
> > here.
>
> My two cents: Don't ever expose anything under user namespaces unless it
> is guaranteed to be safe and has actual non-cosmetical use-cases.

Doesn't seem cosmetic to be able to have my own private BPF FS
instance created by an application inside the container to persist
and/or share BPF prog/maps between parts of the application. But I'm
not going to do this either, it was just a realization that we seem to
be unnecessarily restrictive with BPF FS (at least until it becomes
also a BPF token itself).

>
> The eagerness with which features pop up in user namespaces is probably
> bankrolling half the infosec community.

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2023-07-11 22:06 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-29  5:18 [PATCH RESEND v3 bpf-next 00/14] BPF token Andrii Nakryiko
2023-06-29  5:18 ` [PATCH RESEND v3 bpf-next 01/14] bpf: introduce BPF token object Andrii Nakryiko
2023-07-04 12:43   ` Christian Brauner
2023-07-04 13:34     ` Christian Brauner
2023-07-04 23:28     ` Toke Høiland-Jørgensen
2023-07-05  7:20       ` Daniel Borkmann
2023-07-05  8:45         ` Christian Brauner
2023-07-05 12:34           ` Toke Høiland-Jørgensen
2023-07-05 14:16     ` Paul Moore
2023-07-05 14:42       ` Christian Brauner
2023-07-05 16:00         ` Paul Moore
2023-07-05 21:38         ` Andrii Nakryiko
2023-07-06 11:32           ` Toke Høiland-Jørgensen
2023-07-06 20:37             ` Andrii Nakryiko
2023-07-07 13:04               ` Toke Høiland-Jørgensen
2023-07-07 17:58                 ` Andrii Nakryiko
2023-07-07 22:00                   ` Toke Høiland-Jørgensen
2023-07-07 23:58                     ` Andrii Nakryiko
2023-07-10 23:42                       ` Djalal Harouni
2023-07-11 13:33           ` Christian Brauner
2023-07-11 22:06             ` Andrii Nakryiko
2023-06-29  5:18 ` [PATCH RESEND v3 bpf-next 02/14] libbpf: add bpf_token_create() API Andrii Nakryiko
2023-06-29  5:18 ` [PATCH RESEND v3 bpf-next 03/14] selftests/bpf: add BPF_TOKEN_CREATE test Andrii Nakryiko
2023-06-29  5:18 ` [PATCH RESEND v3 bpf-next 04/14] bpf: add BPF token support to BPF_MAP_CREATE command Andrii Nakryiko
2023-06-29  5:18 ` [PATCH RESEND v3 bpf-next 05/14] libbpf: add BPF token support to bpf_map_create() API Andrii Nakryiko
2023-06-29  5:18 ` [PATCH RESEND v3 bpf-next 06/14] selftests/bpf: add BPF token-enabled test for BPF_MAP_CREATE command Andrii Nakryiko
2023-06-29  5:18 ` [PATCH RESEND v3 bpf-next 07/14] bpf: add BPF token support to BPF_BTF_LOAD command Andrii Nakryiko
2023-06-29  5:18 ` [PATCH RESEND v3 bpf-next 08/14] libbpf: add BPF token support to bpf_btf_load() API Andrii Nakryiko
2023-06-29  5:18 ` [PATCH RESEND v3 bpf-next 09/14] selftests/bpf: add BPF token-enabled BPF_BTF_LOAD selftest Andrii Nakryiko
2023-06-29  5:18 ` [PATCH RESEND v3 bpf-next 10/14] bpf: add BPF token support to BPF_PROG_LOAD command Andrii Nakryiko
2023-06-29  5:18 ` [PATCH RESEND v3 bpf-next 11/14] bpf: take into account BPF token when fetching helper protos Andrii Nakryiko
2023-06-29  5:18 ` [PATCH RESEND v3 bpf-next 12/14] bpf: consistenly use BPF token throughout BPF verifier logic Andrii Nakryiko
2023-06-29  5:18 ` [PATCH RESEND v3 bpf-next 13/14] libbpf: add BPF token support to bpf_prog_load() API Andrii Nakryiko
2023-06-29  5:18 ` [PATCH RESEND v3 bpf-next 14/14] selftests/bpf: add BPF token-enabled BPF_PROG_LOAD tests Andrii Nakryiko
2023-06-29 23:15 ` [PATCH RESEND v3 bpf-next 00/14] BPF token Toke Høiland-Jørgensen
2023-06-30 18:25   ` Andrii Nakryiko
2023-07-04  9:38     ` Christian Brauner
2023-07-04 23:20     ` Toke Høiland-Jørgensen
2023-07-05 12:57       ` Stefano Brivio
2023-07-02  6:59   ` Djalal Harouni
2023-07-04  9:51   ` Christian Brauner
2023-07-04 23:33     ` Toke Høiland-Jørgensen
2023-07-05 20:39     ` Andrii Nakryiko
2023-07-01  2:05 ` Yafang Shao
2023-07-05 20:37   ` Andrii Nakryiko
2023-07-06  1:26     ` Yafang Shao
2023-07-06 20:34       ` Andrii Nakryiko
2023-07-07  1:42         ` Yafang Shao

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.