All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 bpf-next 00/18] BPF token
@ 2023-06-07 23:53 Andrii Nakryiko
  2023-06-07 23:53 ` [PATCH v2 bpf-next 01/18] bpf: introduce BPF token object Andrii Nakryiko
                   ` (22 more replies)
  0 siblings, 23 replies; 72+ messages in thread
From: Andrii Nakryiko @ 2023-06-07 23:53 UTC (permalink / raw)
  To: bpf
  Cc: linux-security-module, keescook, brauner, lennart, cyphar, luto,
	kernel-team

This patch set introduces new BPF object, BPF token, which allows to delegate
a subset of BPF functionality from privileged system-wide daemon (e.g.,
systemd or any other container manager) to a *trusted* unprivileged
application. Trust is the key here. This functionality is not about allowing
unconditional unprivileged BPF usage. Establishing trust, though, is
completely up to the discretion of respective privileged application that
would create a BPF token.

The main motivation for BPF token is a desire to enable containerized
BPF applications to be used together with user namespaces. This is currently
impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced
or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF
helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read
arbitrary memory, and it's impossible to ensure that they only read memory of
processes belonging to any given namespace. This means that it's impossible to
have namespace-aware CAP_BPF capability, and as such another mechanism to
allow safe usage of BPF functionality is necessary. BPF token and delegation
of it to a trusted unprivileged applications is such mechanism. Kernel makes
no assumption about what "trusted" constitutes in any particular case, and
it's up to specific privileged applications and their surrounding
infrastructure to decide that. What kernel provides is a set of APIs to create
and tune BPF token, and pass it around to privileged BPF commands that are
creating new BPF objects like BPF programs, BPF maps, etc.

Previous attempt at addressing this very same problem ([0]) attempted to
utilize authoritative LSM approach, but was conclusively rejected by upstream
LSM maintainers. BPF token concept is not changing anything about LSM
approach, but can be combined with LSM hooks for very fine-grained security
policy. Some ideas about making BPF token more convenient to use with LSM (in
particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF
2023 presentation ([1]). E.g., an ability to specify user-provided data
(context), which in combination with BPF LSM would allow implementing a very
dynamic and fine-granular custom security policies on top of BPF token. In the
interest of minimizing API surface area discussions this is going to be
added in follow up patches, as it's not essential to the fundamental concept
of delegatable BPF token.

It should be noted that BPF token is conceptually quite similar to the idea of
/dev/bpf device file, proposed by Song a while ago ([2]). The biggest
difference is the idea of using virtual anon_inode file to hold BPF token and
allowing multiple independent instances of them, each with its own set of
restrictions. BPF pinning solves the problem of exposing such BPF token
through file system (BPF FS, in this case) for cases where transferring FDs
over Unix domain sockets is not convenient. And also, crucially, BPF token
approach is not using any special stateful task-scoped flags. Instead, bpf()
syscall accepts token_fd parameters explicitly for each relevant BPF command.
This addresses main concerns brought up during the /dev/bpf discussion, and
fits better with overall BPF subsystem design.

This patch set adds a basic minimum of functionality to make BPF token useful
and to discuss API and functionality. Currently only low-level libbpf APIs
support passing BPF token around, allowing to test kernel functionality, but
for the most part is not sufficient for real-world applications, which
typically use high-level libbpf APIs based on `struct bpf_object` type. This
was done with the intent to limit the size of patch set and concentrate on
mostly kernel-side changes. All the necessary plumbing for libbpf will be sent
as a separate follow up patch set kernel support makes it upstream.

Another part that should happen once kernel-side BPF token is established, is
a set of conventions between applications (e.g., systemd), tools (e.g.,
bpftool), and libraries (e.g., libbpf) about sharing BPF tokens through BPF FS
at well-defined locations to allow applications take advantage of this in
automatic fashion without explicit code changes on BPF application's side.
But I'd like to postpone this discussion to after BPF token concept lands.

  [0] https://lore.kernel.org/bpf/20230412043300.360803-1-andrii@kernel.org/
  [1] http://vger.kernel.org/bpfconf2023_material/Trusted_unprivileged_BPF_LSFMM2023.pdf
  [2] https://lore.kernel.org/bpf/20190627201923.2589391-2-songliubraving@fb.com/

v1->v2:
  - fix build failures on Kconfig with CONFIG_BPF_SYSCALL unset;
  - drop BPF_F_TOKEN_UNKNOWN_* flags and simplify UAPI (Stanislav).

Andrii Nakryiko (18):
  bpf: introduce BPF token object
  libbpf: add bpf_token_create() API
  selftests/bpf: add BPF_TOKEN_CREATE test
  bpf: move unprivileged checks into map_create() and bpf_prog_load()
  bpf: inline map creation logic in map_create() function
  bpf: centralize permissions checks for all BPF map types
  bpf: add BPF token support to BPF_MAP_CREATE command
  libbpf: add BPF token support to bpf_map_create() API
  selftests/bpf: add BPF token-enabled test for BPF_MAP_CREATE command
  bpf: add BPF token support to BPF_BTF_LOAD command
  libbpf: add BPF token support to bpf_btf_load() API
  selftests/bpf: add BPF token-enabled BPF_BTF_LOAD selftest
  bpf: keep BPF_PROG_LOAD permission checks clear of validations
  bpf: add BPF token support to BPF_PROG_LOAD command
  bpf: take into account BPF token when fetching helper protos
  bpf: consistenly use BPF token throughout BPF verifier logic
  libbpf: add BPF token support to bpf_prog_load() API
  selftests/bpf: add BPF token-enabled BPF_PROG_LOAD tests

 drivers/media/rc/bpf-lirc.c                   |   2 +-
 include/linux/bpf.h                           |  70 ++-
 include/linux/filter.h                        |   2 +-
 include/uapi/linux/bpf.h                      |  37 ++
 kernel/bpf/Makefile                           |   2 +-
 kernel/bpf/arraymap.c                         |   2 +-
 kernel/bpf/bloom_filter.c                     |   3 -
 kernel/bpf/bpf_local_storage.c                |   3 -
 kernel/bpf/bpf_struct_ops.c                   |   3 -
 kernel/bpf/cgroup.c                           |   6 +-
 kernel/bpf/core.c                             |   3 +-
 kernel/bpf/cpumap.c                           |   4 -
 kernel/bpf/devmap.c                           |   3 -
 kernel/bpf/hashtab.c                          |   6 -
 kernel/bpf/helpers.c                          |   6 +-
 kernel/bpf/inode.c                            |  26 ++
 kernel/bpf/lpm_trie.c                         |   3 -
 kernel/bpf/queue_stack_maps.c                 |   4 -
 kernel/bpf/reuseport_array.c                  |   3 -
 kernel/bpf/stackmap.c                         |   3 -
 kernel/bpf/syscall.c                          | 401 ++++++++++++++----
 kernel/bpf/token.c                            | 136 ++++++
 kernel/bpf/verifier.c                         |  13 +-
 kernel/trace/bpf_trace.c                      |   2 +-
 net/core/filter.c                             |  36 +-
 net/core/sock_map.c                           |   4 -
 net/ipv4/bpf_tcp_ca.c                         |   2 +-
 net/netfilter/nf_bpf_link.c                   |   2 +-
 net/xdp/xskmap.c                              |   4 -
 tools/include/uapi/linux/bpf.h                |  39 ++
 tools/lib/bpf/bpf.c                           |  32 +-
 tools/lib/bpf/bpf.h                           |  24 +-
 tools/lib/bpf/libbpf.map                      |   1 +
 .../selftests/bpf/prog_tests/libbpf_probes.c  |   4 +
 .../selftests/bpf/prog_tests/libbpf_str.c     |   6 +
 .../testing/selftests/bpf/prog_tests/token.c  | 260 ++++++++++++
 .../bpf/prog_tests/unpriv_bpf_disabled.c      |   6 +-
 37 files changed, 975 insertions(+), 188 deletions(-)
 create mode 100644 kernel/bpf/token.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/token.c

-- 
2.34.1


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v2 bpf-next 01/18] bpf: introduce BPF token object
  2023-06-07 23:53 [PATCH v2 bpf-next 00/18] BPF token Andrii Nakryiko
@ 2023-06-07 23:53 ` Andrii Nakryiko
  2023-06-07 23:53 ` [PATCH v2 bpf-next 02/18] libbpf: add bpf_token_create() API Andrii Nakryiko
                   ` (21 subsequent siblings)
  22 siblings, 0 replies; 72+ messages in thread
From: Andrii Nakryiko @ 2023-06-07 23:53 UTC (permalink / raw)
  To: bpf
  Cc: linux-security-module, keescook, brauner, lennart, cyphar, luto,
	kernel-team

Add new kind of BPF kernel object, BPF token. BPF token is meant to to
allow delegating privileged BPF functionality, like loading a BPF
program or creating a BPF map, from privileged process to a *trusted*
unprivileged process, all while have a good amount of control over which
privileged operation could be done using provided BPF token.

This patch adds new BPF_TOKEN_CREATE command to bpf() syscall, which
allows to create a new BPF token object along with a set of allowed
commands. Currently only BPF_TOKEN_CREATE command itself can be
delegated, but other patches gradually add ability to delegate
BPF_MAP_CREATE, BPF_BTF_LOAD, and BPF_PROG_LOAD commands.

The above means that BPF token creation can be allowed by another
existing BPF token, if original privileged creator allowed that. New
derived BPF token cannot be more powerful than the original BPF token.

Lastly, BPF token can be pinned in and retrieved from BPF FS, just like
progs, maps, BTFs, and links. This allows applications (like container
managers) to share BPF token with other applications through file system
just like any other BPF object, and further control access to it using
file system permissions, if desired.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 include/linux/bpf.h            |  38 +++++++++++
 include/uapi/linux/bpf.h       |  22 +++++++
 kernel/bpf/Makefile            |   2 +-
 kernel/bpf/inode.c             |  26 ++++++++
 kernel/bpf/syscall.c           |  70 ++++++++++++++++++++
 kernel/bpf/token.c             | 117 +++++++++++++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h |  22 +++++++
 7 files changed, 296 insertions(+), 1 deletion(-)
 create mode 100644 kernel/bpf/token.c

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index f58895830ada..5f3944352c26 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -51,6 +51,7 @@ struct module;
 struct bpf_func_state;
 struct ftrace_ops;
 struct cgroup;
+struct bpf_token;
 
 extern struct idr btf_idr;
 extern spinlock_t btf_idr_lock;
@@ -1533,6 +1534,12 @@ struct bpf_link_primer {
 	u32 id;
 };
 
+struct bpf_token {
+	struct work_struct work;
+	atomic64_t refcnt;
+	u64 allowed_cmds;
+};
+
 struct bpf_struct_ops_value;
 struct btf_member;
 
@@ -1916,6 +1923,11 @@ bpf_prog_run_array_sleepable(const struct bpf_prog_array __rcu *array_rcu,
 	return ret;
 }
 
+static inline bool bpf_token_capable(const struct bpf_token *token, int cap)
+{
+	return token || capable(cap) || (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN));
+}
+
 #ifdef CONFIG_BPF_SYSCALL
 DECLARE_PER_CPU(int, bpf_prog_active);
 extern struct mutex bpf_stats_enabled_mutex;
@@ -2077,6 +2089,14 @@ struct file *bpf_link_new_file(struct bpf_link *link, int *reserved_fd);
 struct bpf_link *bpf_link_get_from_fd(u32 ufd);
 struct bpf_link *bpf_link_get_curr_or_next(u32 *id);
 
+void bpf_token_inc(struct bpf_token *token);
+void bpf_token_put(struct bpf_token *token);
+struct bpf_token *bpf_token_alloc(void);
+int bpf_token_new_fd(struct bpf_token *token);
+struct bpf_token *bpf_token_get_from_fd(u32 ufd);
+
+bool bpf_token_allow_cmd(const struct bpf_token *token, enum bpf_cmd cmd);
+
 int bpf_obj_pin_user(u32 ufd, int path_fd, const char __user *pathname);
 int bpf_obj_get_user(int path_fd, const char __user *pathname, int flags);
 
@@ -2436,6 +2456,24 @@ static inline int bpf_obj_get_user(const char __user *pathname, int flags)
 	return -EOPNOTSUPP;
 }
 
+static inline void bpf_token_inc(struct bpf_token *token)
+{
+}
+
+static inline void bpf_token_put(struct bpf_token *token)
+{
+}
+
+static inline int bpf_token_new_fd(struct bpf_token *token)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline struct bpf_token *bpf_token_get_from_fd(u32 ufd)
+{
+	return ERR_PTR(-EOPNOTSUPP);
+}
+
 static inline void __dev_flush(void)
 {
 }
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index a7b5e91dd768..3e7e8d8cbe90 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -846,6 +846,16 @@ union bpf_iter_link_info {
  *		Returns zero on success. On error, -1 is returned and *errno*
  *		is set appropriately.
  *
+ * BPF_TOKEN_CREATE
+ *	Description
+ *		Create BPF token with embedded information about what
+ *		BPF-related functionality is allowed. This BPF token can be
+ *		passed as an extra parameter to various bpf() syscall command.
+ *
+ *	Return
+ *		A new file descriptor (a nonnegative integer), or -1 if an
+ *		error occurred (in which case, *errno* is set appropriately).
+ *
  * NOTES
  *	eBPF objects (maps and programs) can be shared between processes.
  *
@@ -900,6 +910,7 @@ enum bpf_cmd {
 	BPF_ITER_CREATE,
 	BPF_LINK_DETACH,
 	BPF_PROG_BIND_MAP,
+	BPF_TOKEN_CREATE,
 };
 
 enum bpf_map_type {
@@ -1621,6 +1632,17 @@ union bpf_attr {
 		__u32		flags;		/* extra flags */
 	} prog_bind_map;
 
+	struct { /* struct used by BPF_TOKEN_CREATE command */
+		__u32		flags;
+		__u32		token_fd;
+		/* a bit set of allowed bpf() syscall commands,
+		 * e.g., (1ULL << BPF_TOKEN_CREATE) | (1ULL << BPF_PROG_LOAD)
+		 * will allow creating derived BPF tokens and loading new BPF
+		 * programs
+		 */
+		__u64		allowed_cmds;
+	} token_create;
+
 } __attribute__((aligned(8)));
 
 /* The description below is an attempt at providing documentation to eBPF
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 1d3892168d32..bbc17ea3878f 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -6,7 +6,7 @@ cflags-nogcse-$(CONFIG_X86)$(CONFIG_CC_IS_GCC) := -fno-gcse
 endif
 CFLAGS_core.o += $(call cc-disable-warning, override-init) $(cflags-nogcse-yy)
 
-obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o log.o
+obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o log.o token.o
 obj-$(CONFIG_BPF_SYSCALL) += bpf_iter.o map_iter.o task_iter.o prog_iter.o link_iter.o
 obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o
 obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
index 4174f76133df..55d9a945ad18 100644
--- a/kernel/bpf/inode.c
+++ b/kernel/bpf/inode.c
@@ -27,6 +27,7 @@ enum bpf_type {
 	BPF_TYPE_PROG,
 	BPF_TYPE_MAP,
 	BPF_TYPE_LINK,
+	BPF_TYPE_TOKEN,
 };
 
 static void *bpf_any_get(void *raw, enum bpf_type type)
@@ -41,6 +42,9 @@ static void *bpf_any_get(void *raw, enum bpf_type type)
 	case BPF_TYPE_LINK:
 		bpf_link_inc(raw);
 		break;
+	case BPF_TYPE_TOKEN:
+		bpf_token_inc(raw);
+		break;
 	default:
 		WARN_ON_ONCE(1);
 		break;
@@ -61,6 +65,9 @@ static void bpf_any_put(void *raw, enum bpf_type type)
 	case BPF_TYPE_LINK:
 		bpf_link_put(raw);
 		break;
+	case BPF_TYPE_TOKEN:
+		bpf_token_put(raw);
+		break;
 	default:
 		WARN_ON_ONCE(1);
 		break;
@@ -89,6 +96,12 @@ static void *bpf_fd_probe_obj(u32 ufd, enum bpf_type *type)
 		return raw;
 	}
 
+	raw = bpf_token_get_from_fd(ufd);
+	if (!IS_ERR(raw)) {
+		*type = BPF_TYPE_TOKEN;
+		return raw;
+	}
+
 	return ERR_PTR(-EINVAL);
 }
 
@@ -97,6 +110,7 @@ static const struct inode_operations bpf_dir_iops;
 static const struct inode_operations bpf_prog_iops = { };
 static const struct inode_operations bpf_map_iops  = { };
 static const struct inode_operations bpf_link_iops  = { };
+static const struct inode_operations bpf_token_iops  = { };
 
 static struct inode *bpf_get_inode(struct super_block *sb,
 				   const struct inode *dir,
@@ -136,6 +150,8 @@ static int bpf_inode_type(const struct inode *inode, enum bpf_type *type)
 		*type = BPF_TYPE_MAP;
 	else if (inode->i_op == &bpf_link_iops)
 		*type = BPF_TYPE_LINK;
+	else if (inode->i_op == &bpf_token_iops)
+		*type = BPF_TYPE_TOKEN;
 	else
 		return -EACCES;
 
@@ -369,6 +385,11 @@ static int bpf_mklink(struct dentry *dentry, umode_t mode, void *arg)
 			     &bpf_iter_fops : &bpffs_obj_fops);
 }
 
+static int bpf_mktoken(struct dentry *dentry, umode_t mode, void *arg)
+{
+	return bpf_mkobj_ops(dentry, mode, arg, &bpf_token_iops, &bpffs_obj_fops);
+}
+
 static struct dentry *
 bpf_lookup(struct inode *dir, struct dentry *dentry, unsigned flags)
 {
@@ -469,6 +490,9 @@ static int bpf_obj_do_pin(int path_fd, const char __user *pathname, void *raw,
 	case BPF_TYPE_LINK:
 		ret = vfs_mkobj(dentry, mode, bpf_mklink, raw);
 		break;
+	case BPF_TYPE_TOKEN:
+		ret = vfs_mkobj(dentry, mode, bpf_mktoken, raw);
+		break;
 	default:
 		ret = -EPERM;
 	}
@@ -547,6 +571,8 @@ int bpf_obj_get_user(int path_fd, const char __user *pathname, int flags)
 		ret = bpf_map_new_fd(raw, f_flags);
 	else if (type == BPF_TYPE_LINK)
 		ret = (f_flags != O_RDWR) ? -EINVAL : bpf_link_new_fd(raw);
+	else if (type == BPF_TYPE_TOKEN)
+		ret = (f_flags != O_RDWR) ? -EINVAL : bpf_token_new_fd(raw);
 	else
 		return -ENOENT;
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 92a57efc77de..1d8b513ce318 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -5024,6 +5024,73 @@ static int bpf_prog_bind_map(union bpf_attr *attr)
 	return ret;
 }
 
+static bool is_bit_subset_of(u32 subset, u32 superset)
+{
+	return (superset & subset) == subset;
+}
+
+#define BPF_TOKEN_CMDS_MASK ((1ULL << BPF_TOKEN_CREATE))
+
+#define BPF_TOKEN_CREATE_LAST_FIELD token_create.allowed_cmds
+
+static int token_create(union bpf_attr *attr)
+{
+	struct bpf_token *new_token, *token = NULL;
+	int fd, err;
+
+	if (CHECK_ATTR(BPF_TOKEN_CREATE))
+		return -EINVAL;
+
+	if (attr->token_create.flags)
+		return -EINVAL;
+
+	if (attr->token_create.token_fd) {
+		token = bpf_token_get_from_fd(attr->token_create.token_fd);
+		if (IS_ERR(token))
+			return PTR_ERR(token);
+		/* if provided BPF token doesn't allow creating new tokens,
+		 * then use system-wide capability checks only
+		 */
+		if (!bpf_token_allow_cmd(token, BPF_TOKEN_CREATE)) {
+			bpf_token_put(token);
+			token = NULL;
+		}
+	}
+
+	if (!bpf_token_capable(token, CAP_SYS_ADMIN)) {
+		err = -EPERM;
+		goto err_out;
+	}
+
+	/* requested cmds should be a subset of associated token's set */
+	if (token && !is_bit_subset_of(attr->token_create.allowed_cmds, token->allowed_cmds)) {
+		err = -EPERM;
+		goto err_out;
+	}
+
+	new_token = bpf_token_alloc();
+	if (!new_token) {
+		err = -ENOMEM;
+		goto err_out;
+	}
+
+	new_token->allowed_cmds = attr->token_create.allowed_cmds;
+
+	fd = bpf_token_new_fd(new_token);
+	if (fd < 0) {
+		bpf_token_put(new_token);
+		err = fd;
+		goto err_out;
+	}
+
+	bpf_token_put(token);
+	return fd;
+
+err_out:
+	bpf_token_put(token);
+	return err;
+}
+
 static int __sys_bpf(int cmd, bpfptr_t uattr, unsigned int size)
 {
 	union bpf_attr attr;
@@ -5172,6 +5239,9 @@ static int __sys_bpf(int cmd, bpfptr_t uattr, unsigned int size)
 	case BPF_PROG_BIND_MAP:
 		err = bpf_prog_bind_map(&attr);
 		break;
+	case BPF_TOKEN_CREATE:
+		err = token_create(&attr);
+		break;
 	default:
 		err = -EINVAL;
 		break;
diff --git a/kernel/bpf/token.c b/kernel/bpf/token.c
new file mode 100644
index 000000000000..4257281ca1ec
--- /dev/null
+++ b/kernel/bpf/token.c
@@ -0,0 +1,117 @@
+#include <linux/bpf.h>
+#include <linux/vmalloc.h>
+#include <linux/anon_inodes.h>
+#include <linux/fdtable.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/kernel.h>
+#include <linux/idr.h>
+
+DEFINE_IDR(token_idr);
+DEFINE_SPINLOCK(token_idr_lock);
+
+void bpf_token_inc(struct bpf_token *token)
+{
+	atomic64_inc(&token->refcnt);
+}
+
+static void bpf_token_put_deferred(struct work_struct *work)
+{
+	struct bpf_token *token = container_of(work, struct bpf_token, work);
+
+	kvfree(token);
+}
+
+void bpf_token_put(struct bpf_token *token)
+{
+	if (!token)
+		return;
+
+	if (!atomic64_dec_and_test(&token->refcnt))
+		return;
+
+	INIT_WORK(&token->work, bpf_token_put_deferred);
+	schedule_work(&token->work);
+}
+
+static int bpf_token_release(struct inode *inode, struct file *filp)
+{
+	struct bpf_token *token = filp->private_data;
+
+	bpf_token_put(token);
+	return 0;
+}
+
+static ssize_t bpf_dummy_read(struct file *filp, char __user *buf, size_t siz,
+			      loff_t *ppos)
+{
+	/* We need this handler such that alloc_file() enables
+	 * f_mode with FMODE_CAN_READ.
+	 */
+	return -EINVAL;
+}
+
+static ssize_t bpf_dummy_write(struct file *filp, const char __user *buf,
+			       size_t siz, loff_t *ppos)
+{
+	/* We need this handler such that alloc_file() enables
+	 * f_mode with FMODE_CAN_WRITE.
+	 */
+	return -EINVAL;
+}
+
+static const struct file_operations bpf_token_fops = {
+	.release	= bpf_token_release,
+	.read		= bpf_dummy_read,
+	.write		= bpf_dummy_write,
+};
+
+struct bpf_token *bpf_token_alloc(void)
+{
+	struct bpf_token *token;
+
+	token = kvzalloc(sizeof(*token), GFP_USER);
+	if (token == NULL)
+		return NULL;
+
+	atomic64_set(&token->refcnt, 1);
+
+	return token;
+}
+
+#define BPF_TOKEN_INODE_NAME "bpf-token"
+
+/* Alloc anon_inode and FD for prepared token.
+ * Returns fd >= 0 on success; negative error, otherwise.
+ */
+int bpf_token_new_fd(struct bpf_token *token)
+{
+	return anon_inode_getfd(BPF_TOKEN_INODE_NAME, &bpf_token_fops, token, O_CLOEXEC);
+}
+
+struct bpf_token *bpf_token_get_from_fd(u32 ufd)
+{
+	struct fd f = fdget(ufd);
+	struct bpf_token *token;
+
+	if (!f.file)
+		return ERR_PTR(-EBADF);
+	if (f.file->f_op != &bpf_token_fops) {
+		fdput(f);
+		return ERR_PTR(-EINVAL);
+	}
+
+	token = f.file->private_data;
+	bpf_token_inc(token);
+	fdput(f);
+
+	return token;
+}
+
+bool bpf_token_allow_cmd(const struct bpf_token *token, enum bpf_cmd cmd)
+{
+	if (!token)
+		return false;
+
+	return token->allowed_cmds & (1ULL << cmd);
+}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index a7b5e91dd768..3e7e8d8cbe90 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -846,6 +846,16 @@ union bpf_iter_link_info {
  *		Returns zero on success. On error, -1 is returned and *errno*
  *		is set appropriately.
  *
+ * BPF_TOKEN_CREATE
+ *	Description
+ *		Create BPF token with embedded information about what
+ *		BPF-related functionality is allowed. This BPF token can be
+ *		passed as an extra parameter to various bpf() syscall command.
+ *
+ *	Return
+ *		A new file descriptor (a nonnegative integer), or -1 if an
+ *		error occurred (in which case, *errno* is set appropriately).
+ *
  * NOTES
  *	eBPF objects (maps and programs) can be shared between processes.
  *
@@ -900,6 +910,7 @@ enum bpf_cmd {
 	BPF_ITER_CREATE,
 	BPF_LINK_DETACH,
 	BPF_PROG_BIND_MAP,
+	BPF_TOKEN_CREATE,
 };
 
 enum bpf_map_type {
@@ -1621,6 +1632,17 @@ union bpf_attr {
 		__u32		flags;		/* extra flags */
 	} prog_bind_map;
 
+	struct { /* struct used by BPF_TOKEN_CREATE command */
+		__u32		flags;
+		__u32		token_fd;
+		/* a bit set of allowed bpf() syscall commands,
+		 * e.g., (1ULL << BPF_TOKEN_CREATE) | (1ULL << BPF_PROG_LOAD)
+		 * will allow creating derived BPF tokens and loading new BPF
+		 * programs
+		 */
+		__u64		allowed_cmds;
+	} token_create;
+
 } __attribute__((aligned(8)));
 
 /* The description below is an attempt at providing documentation to eBPF
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v2 bpf-next 02/18] libbpf: add bpf_token_create() API
  2023-06-07 23:53 [PATCH v2 bpf-next 00/18] BPF token Andrii Nakryiko
  2023-06-07 23:53 ` [PATCH v2 bpf-next 01/18] bpf: introduce BPF token object Andrii Nakryiko
@ 2023-06-07 23:53 ` Andrii Nakryiko
  2023-06-07 23:53 ` [PATCH v2 bpf-next 03/18] selftests/bpf: add BPF_TOKEN_CREATE test Andrii Nakryiko
                   ` (20 subsequent siblings)
  22 siblings, 0 replies; 72+ messages in thread
From: Andrii Nakryiko @ 2023-06-07 23:53 UTC (permalink / raw)
  To: bpf
  Cc: linux-security-module, keescook, brauner, lennart, cyphar, luto,
	kernel-team

Add low-level wrapper API for BPF_TOKEN_CREATE command in bpf() syscall.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 tools/lib/bpf/bpf.c      | 18 ++++++++++++++++++
 tools/lib/bpf/bpf.h      | 11 +++++++++++
 tools/lib/bpf/libbpf.map |  1 +
 3 files changed, 30 insertions(+)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index ed86b37d8024..38be66719485 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -1201,3 +1201,21 @@ int bpf_prog_bind_map(int prog_fd, int map_fd,
 	ret = sys_bpf(BPF_PROG_BIND_MAP, &attr, attr_sz);
 	return libbpf_err_errno(ret);
 }
+
+int bpf_token_create(struct bpf_token_create_opts *opts)
+{
+	const size_t attr_sz = offsetofend(union bpf_attr, token_create);
+	union bpf_attr attr;
+	int ret;
+
+	if (!OPTS_VALID(opts, bpf_token_create_opts))
+		return libbpf_err(-EINVAL);
+
+	memset(&attr, 0, attr_sz);
+	attr.token_create.flags = OPTS_GET(opts, flags, 0);
+	attr.token_create.token_fd = OPTS_GET(opts, token_fd, 0);
+	attr.token_create.allowed_cmds = OPTS_GET(opts, allowed_cmds, 0);
+
+	ret = sys_bpf_fd(BPF_TOKEN_CREATE, &attr, attr_sz);
+	return libbpf_err_errno(ret);
+}
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index 9aa0ee473754..f2b8041ca27a 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -551,6 +551,17 @@ struct bpf_test_run_opts {
 LIBBPF_API int bpf_prog_test_run_opts(int prog_fd,
 				      struct bpf_test_run_opts *opts);
 
+struct bpf_token_create_opts {
+	size_t sz; /* size of this struct for forward/backward compatibility */
+	__u32 flags;
+	__u32 token_fd;
+	__u64 allowed_cmds;
+	size_t :0;
+};
+#define bpf_token_create_opts__last_field allowed_cmds
+
+LIBBPF_API int bpf_token_create(struct bpf_token_create_opts *opts);
+
 #ifdef __cplusplus
 } /* extern "C" */
 #endif
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index 7521a2fb7626..62cbe4775081 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -395,4 +395,5 @@ LIBBPF_1.2.0 {
 LIBBPF_1.3.0 {
 	global:
 		bpf_obj_pin_opts;
+		bpf_token_create;
 } LIBBPF_1.2.0;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v2 bpf-next 03/18] selftests/bpf: add BPF_TOKEN_CREATE test
  2023-06-07 23:53 [PATCH v2 bpf-next 00/18] BPF token Andrii Nakryiko
  2023-06-07 23:53 ` [PATCH v2 bpf-next 01/18] bpf: introduce BPF token object Andrii Nakryiko
  2023-06-07 23:53 ` [PATCH v2 bpf-next 02/18] libbpf: add bpf_token_create() API Andrii Nakryiko
@ 2023-06-07 23:53 ` Andrii Nakryiko
  2023-06-07 23:53 ` [PATCH v2 bpf-next 04/18] bpf: move unprivileged checks into map_create() and bpf_prog_load() Andrii Nakryiko
                   ` (19 subsequent siblings)
  22 siblings, 0 replies; 72+ messages in thread
From: Andrii Nakryiko @ 2023-06-07 23:53 UTC (permalink / raw)
  To: bpf
  Cc: linux-security-module, keescook, brauner, lennart, cyphar, luto,
	kernel-team

Add a subtest validating BPF_TOKEN_CREATE command, pinning/getting BPF
token in/from BPF FS, and creating derived BPF tokens using token_fd
parameter.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 .../testing/selftests/bpf/prog_tests/token.c  | 93 +++++++++++++++++++
 1 file changed, 93 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/token.c

diff --git a/tools/testing/selftests/bpf/prog_tests/token.c b/tools/testing/selftests/bpf/prog_tests/token.c
new file mode 100644
index 000000000000..cba84c480ac5
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/token.c
@@ -0,0 +1,93 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2023 Meta Platforms, Inc. and affiliates. */
+#include "linux/bpf.h"
+#include <test_progs.h>
+#include <bpf/btf.h>
+#include "cap_helpers.h"
+
+static int drop_priv_caps(__u64 *old_caps)
+{
+	return cap_disable_effective((1ULL << CAP_BPF) |
+				     (1ULL << CAP_PERFMON) |
+				     (1ULL << CAP_NET_ADMIN) |
+				     (1ULL << CAP_SYS_ADMIN), old_caps);
+}
+
+static int restore_priv_caps(__u64 old_caps)
+{
+	return cap_enable_effective(old_caps, NULL);
+}
+
+#define TOKEN_PATH "/sys/fs/bpf/test_token"
+
+static void subtest_token_create(void)
+{
+	LIBBPF_OPTS(bpf_token_create_opts, opts);
+	int token_fd = 0, limited_token_fd = 0, tmp_fd = 0, err;
+	__u64 old_caps = 0;
+
+	/* check that any current and future cmd can be specified */
+	opts.allowed_cmds = ~0ULL;
+	token_fd = bpf_token_create(&opts);
+	if (!ASSERT_GT(token_fd, 0, "token_create_future_proof"))
+		return;
+	close(token_fd);
+
+	/* create BPF token which allows creating derived BPF tokens */
+	opts.allowed_cmds = 1ULL << BPF_TOKEN_CREATE;
+	token_fd = bpf_token_create(&opts);
+	if (!ASSERT_GT(token_fd, 0, "token_create"))
+		return;
+
+	/* validate pinning and getting works as expected */
+	err = bpf_obj_pin(token_fd, TOKEN_PATH);
+	if (!ASSERT_OK(err, "token_pin"))
+		goto cleanup;
+
+	tmp_fd = bpf_obj_get(TOKEN_PATH);
+	ASSERT_GT(tmp_fd, 0, "token_get");
+	close(tmp_fd);
+	tmp_fd = 0;
+	unlink(TOKEN_PATH);
+
+	/* drop privileges to test token_fd passing */
+	if (!ASSERT_OK(drop_priv_caps(&old_caps), "drop_caps"))
+		goto cleanup;
+
+	/* unprivileged BPF_TOKEN_CREATE should fail */
+	tmp_fd = bpf_token_create(NULL);
+	if (!ASSERT_LT(tmp_fd, 0, "token_create_unpriv_fail"))
+		goto cleanup;
+
+	/* unprivileged BPF_TOKEN_CREATE with associated BPF token succeeds */
+	opts.flags = 0;
+	opts.allowed_cmds = 0; /* ask for BPF token which doesn't allow new tokens */
+	opts.token_fd = token_fd;
+	limited_token_fd = bpf_token_create(&opts);
+	if (!ASSERT_GT(limited_token_fd, 0, "token_create_limited"))
+		goto cleanup;
+
+	/* creating yet another token using "limited" BPF token should fail */
+	opts.flags = 0;
+	opts.allowed_cmds = 0;
+	opts.token_fd = limited_token_fd;
+	tmp_fd = bpf_token_create(&opts);
+	if (!ASSERT_LT(tmp_fd, 0, "token_create_from_lim_fail"))
+		goto cleanup;
+
+cleanup:
+	if (tmp_fd)
+		close(tmp_fd);
+	if (token_fd)
+		close(token_fd);
+	if (limited_token_fd)
+		close(limited_token_fd);
+	if (old_caps)
+		ASSERT_OK(restore_priv_caps(old_caps), "restore_caps");
+}
+
+void test_token(void)
+{
+	if (test__start_subtest("token_create"))
+		subtest_token_create();
+}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v2 bpf-next 04/18] bpf: move unprivileged checks into map_create() and bpf_prog_load()
  2023-06-07 23:53 [PATCH v2 bpf-next 00/18] BPF token Andrii Nakryiko
                   ` (2 preceding siblings ...)
  2023-06-07 23:53 ` [PATCH v2 bpf-next 03/18] selftests/bpf: add BPF_TOKEN_CREATE test Andrii Nakryiko
@ 2023-06-07 23:53 ` Andrii Nakryiko
  2023-06-07 23:53 ` [PATCH v2 bpf-next 05/18] bpf: inline map creation logic in map_create() function Andrii Nakryiko
                   ` (18 subsequent siblings)
  22 siblings, 0 replies; 72+ messages in thread
From: Andrii Nakryiko @ 2023-06-07 23:53 UTC (permalink / raw)
  To: bpf
  Cc: linux-security-module, keescook, brauner, lennart, cyphar, luto,
	kernel-team

Make each bpf() syscall command a bit more self-contained, making it
easier to further enhance it. We move sysctl_unprivileged_bpf_disabled
handling down to map_create() and bpf_prog_load(), two special commands
in this regard.

Also swap the order of checks, calling bpf_capable() only if
sysctl_unprivileged_bpf_disabled is true, avoiding unnecessary audit
messages.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/bpf/syscall.c | 34 +++++++++++++++++++---------------
 1 file changed, 19 insertions(+), 15 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 1d8b513ce318..b7737405e1dd 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1157,6 +1157,15 @@ static int map_create(union bpf_attr *attr)
 	     !node_online(numa_node)))
 		return -EINVAL;
 
+	/* Intent here is for unprivileged_bpf_disabled to block BPF map
+	 * creation for unprivileged users; other actions depend
+	 * on fd availability and access to bpffs, so are dependent on
+	 * object creation success. Even with unprivileged BPF disabled,
+	 * capability checks are still carried out.
+	 */
+	if (sysctl_unprivileged_bpf_disabled && !bpf_capable())
+		return -EPERM;
+
 	/* find map type and init map: hashtable vs rbtree vs bloom vs ... */
 	map = find_and_alloc_map(attr);
 	if (IS_ERR(map))
@@ -2532,6 +2541,16 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr, u32 uattr_size)
 	/* eBPF programs must be GPL compatible to use GPL-ed functions */
 	is_gpl = license_is_gpl_compatible(license);
 
+	/* Intent here is for unprivileged_bpf_disabled to block BPF program
+	 * creation for unprivileged users; other actions depend
+	 * on fd availability and access to bpffs, so are dependent on
+	 * object creation success. Even with unprivileged BPF disabled,
+	 * capability checks are still carried out for these
+	 * and other operations.
+	 */
+	if (sysctl_unprivileged_bpf_disabled && !bpf_capable())
+		return -EPERM;
+
 	if (attr->insn_cnt == 0 ||
 	    attr->insn_cnt > (bpf_capable() ? BPF_COMPLEXITY_LIMIT_INSNS : BPF_MAXINSNS))
 		return -E2BIG;
@@ -5094,23 +5113,8 @@ static int token_create(union bpf_attr *attr)
 static int __sys_bpf(int cmd, bpfptr_t uattr, unsigned int size)
 {
 	union bpf_attr attr;
-	bool capable;
 	int err;
 
-	capable = bpf_capable() || !sysctl_unprivileged_bpf_disabled;
-
-	/* Intent here is for unprivileged_bpf_disabled to block key object
-	 * creation commands for unprivileged users; other actions depend
-	 * of fd availability and access to bpffs, so are dependent on
-	 * object creation success.  Capabilities are later verified for
-	 * operations such as load and map create, so even with unprivileged
-	 * BPF disabled, capability checks are still carried out for these
-	 * and other operations.
-	 */
-	if (!capable &&
-	    (cmd == BPF_MAP_CREATE || cmd == BPF_PROG_LOAD))
-		return -EPERM;
-
 	err = bpf_check_uarg_tail_zero(uattr, sizeof(attr), size);
 	if (err)
 		return err;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v2 bpf-next 05/18] bpf: inline map creation logic in map_create() function
  2023-06-07 23:53 [PATCH v2 bpf-next 00/18] BPF token Andrii Nakryiko
                   ` (3 preceding siblings ...)
  2023-06-07 23:53 ` [PATCH v2 bpf-next 04/18] bpf: move unprivileged checks into map_create() and bpf_prog_load() Andrii Nakryiko
@ 2023-06-07 23:53 ` Andrii Nakryiko
  2023-06-07 23:53 ` [PATCH v2 bpf-next 06/18] bpf: centralize permissions checks for all BPF map types Andrii Nakryiko
                   ` (17 subsequent siblings)
  22 siblings, 0 replies; 72+ messages in thread
From: Andrii Nakryiko @ 2023-06-07 23:53 UTC (permalink / raw)
  To: bpf
  Cc: linux-security-module, keescook, brauner, lennart, cyphar, luto,
	kernel-team

Currently find_and_alloc_map() performs two separate functions: some
argument sanity checking and partial map creation workflow hanling.
Neither of those functions are self-sufficient and are augmented by
further checks and initialization logic in the caller (map_create()
function). So unify all the sanity checks, permission checks, and
creation and initialization logic in one linear piece of code in
map_create() instead. This also make it easier to further enhance
permission checks and keep them located in one place.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/bpf/syscall.c | 57 +++++++++++++++++++-------------------------
 1 file changed, 24 insertions(+), 33 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index b7737405e1dd..20b373dce669 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -109,37 +109,6 @@ const struct bpf_map_ops bpf_map_offload_ops = {
 	.map_mem_usage = bpf_map_offload_map_mem_usage,
 };
 
-static struct bpf_map *find_and_alloc_map(union bpf_attr *attr)
-{
-	const struct bpf_map_ops *ops;
-	u32 type = attr->map_type;
-	struct bpf_map *map;
-	int err;
-
-	if (type >= ARRAY_SIZE(bpf_map_types))
-		return ERR_PTR(-EINVAL);
-	type = array_index_nospec(type, ARRAY_SIZE(bpf_map_types));
-	ops = bpf_map_types[type];
-	if (!ops)
-		return ERR_PTR(-EINVAL);
-
-	if (ops->map_alloc_check) {
-		err = ops->map_alloc_check(attr);
-		if (err)
-			return ERR_PTR(err);
-	}
-	if (attr->map_ifindex)
-		ops = &bpf_map_offload_ops;
-	if (!ops->map_mem_usage)
-		return ERR_PTR(-EINVAL);
-	map = ops->map_alloc(attr);
-	if (IS_ERR(map))
-		return map;
-	map->ops = ops;
-	map->map_type = type;
-	return map;
-}
-
 static void bpf_map_write_active_inc(struct bpf_map *map)
 {
 	atomic64_inc(&map->writecnt);
@@ -1127,7 +1096,9 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf,
 /* called via syscall */
 static int map_create(union bpf_attr *attr)
 {
+	const struct bpf_map_ops *ops;
 	int numa_node = bpf_map_attr_numa_node(attr);
+	u32 map_type = attr->map_type;
 	struct bpf_map *map;
 	int f_flags;
 	int err;
@@ -1157,6 +1128,25 @@ static int map_create(union bpf_attr *attr)
 	     !node_online(numa_node)))
 		return -EINVAL;
 
+	/* find map type and init map: hashtable vs rbtree vs bloom vs ... */
+	map_type = attr->map_type;
+	if (map_type >= ARRAY_SIZE(bpf_map_types))
+		return -EINVAL;
+	map_type = array_index_nospec(map_type, ARRAY_SIZE(bpf_map_types));
+	ops = bpf_map_types[map_type];
+	if (!ops)
+		return -EINVAL;
+
+	if (ops->map_alloc_check) {
+		err = ops->map_alloc_check(attr);
+		if (err)
+			return err;
+	}
+	if (attr->map_ifindex)
+		ops = &bpf_map_offload_ops;
+	if (!ops->map_mem_usage)
+		return -EINVAL;
+
 	/* Intent here is for unprivileged_bpf_disabled to block BPF map
 	 * creation for unprivileged users; other actions depend
 	 * on fd availability and access to bpffs, so are dependent on
@@ -1166,10 +1156,11 @@ static int map_create(union bpf_attr *attr)
 	if (sysctl_unprivileged_bpf_disabled && !bpf_capable())
 		return -EPERM;
 
-	/* find map type and init map: hashtable vs rbtree vs bloom vs ... */
-	map = find_and_alloc_map(attr);
+	map = ops->map_alloc(attr);
 	if (IS_ERR(map))
 		return PTR_ERR(map);
+	map->ops = ops;
+	map->map_type = map_type;
 
 	err = bpf_obj_name_cpy(map->name, attr->map_name,
 			       sizeof(attr->map_name));
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v2 bpf-next 06/18] bpf: centralize permissions checks for all BPF map types
  2023-06-07 23:53 [PATCH v2 bpf-next 00/18] BPF token Andrii Nakryiko
                   ` (4 preceding siblings ...)
  2023-06-07 23:53 ` [PATCH v2 bpf-next 05/18] bpf: inline map creation logic in map_create() function Andrii Nakryiko
@ 2023-06-07 23:53 ` Andrii Nakryiko
  2023-06-07 23:53 ` [PATCH v2 bpf-next 07/18] bpf: add BPF token support to BPF_MAP_CREATE command Andrii Nakryiko
                   ` (16 subsequent siblings)
  22 siblings, 0 replies; 72+ messages in thread
From: Andrii Nakryiko @ 2023-06-07 23:53 UTC (permalink / raw)
  To: bpf
  Cc: linux-security-module, keescook, brauner, lennart, cyphar, luto,
	kernel-team

This allows to do more centralized decisions later on, and generally
makes it very explicit which maps are privileged and which are not
(e.g., LRU_HASH and LRU_PERCPU_HASH, which are privileged HASH variants,
as opposed to unprivileged HASH and HASH_PERCPU; now this is explicit
and easy to verify).

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/bpf/bloom_filter.c                     |  3 --
 kernel/bpf/bpf_local_storage.c                |  3 --
 kernel/bpf/bpf_struct_ops.c                   |  3 --
 kernel/bpf/cpumap.c                           |  4 --
 kernel/bpf/devmap.c                           |  3 --
 kernel/bpf/hashtab.c                          |  6 ---
 kernel/bpf/lpm_trie.c                         |  3 --
 kernel/bpf/queue_stack_maps.c                 |  4 --
 kernel/bpf/reuseport_array.c                  |  3 --
 kernel/bpf/stackmap.c                         |  3 --
 kernel/bpf/syscall.c                          | 47 +++++++++++++++++++
 net/core/sock_map.c                           |  4 --
 net/xdp/xskmap.c                              |  4 --
 .../bpf/prog_tests/unpriv_bpf_disabled.c      |  6 ++-
 14 files changed, 52 insertions(+), 44 deletions(-)

diff --git a/kernel/bpf/bloom_filter.c b/kernel/bpf/bloom_filter.c
index 540331b610a9..addf3dd57b59 100644
--- a/kernel/bpf/bloom_filter.c
+++ b/kernel/bpf/bloom_filter.c
@@ -86,9 +86,6 @@ static struct bpf_map *bloom_map_alloc(union bpf_attr *attr)
 	int numa_node = bpf_map_attr_numa_node(attr);
 	struct bpf_bloom_filter *bloom;
 
-	if (!bpf_capable())
-		return ERR_PTR(-EPERM);
-
 	if (attr->key_size != 0 || attr->value_size == 0 ||
 	    attr->max_entries == 0 ||
 	    attr->map_flags & ~BLOOM_CREATE_FLAG_MASK ||
diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
index 47d9948d768f..b5149cfce7d4 100644
--- a/kernel/bpf/bpf_local_storage.c
+++ b/kernel/bpf/bpf_local_storage.c
@@ -723,9 +723,6 @@ int bpf_local_storage_map_alloc_check(union bpf_attr *attr)
 	    !attr->btf_key_type_id || !attr->btf_value_type_id)
 		return -EINVAL;
 
-	if (!bpf_capable())
-		return -EPERM;
-
 	if (attr->value_size > BPF_LOCAL_STORAGE_MAX_VALUE_SIZE)
 		return -E2BIG;
 
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index d3f0a4825fa6..116a0ce378ec 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -655,9 +655,6 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr)
 	const struct btf_type *t, *vt;
 	struct bpf_map *map;
 
-	if (!bpf_capable())
-		return ERR_PTR(-EPERM);
-
 	st_ops = bpf_struct_ops_find_value(attr->btf_vmlinux_value_type_id);
 	if (!st_ops)
 		return ERR_PTR(-ENOTSUPP);
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index 8ec18faa74ac..8a33e8747a0e 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -28,7 +28,6 @@
 #include <linux/sched.h>
 #include <linux/workqueue.h>
 #include <linux/kthread.h>
-#include <linux/capability.h>
 #include <trace/events/xdp.h>
 #include <linux/btf_ids.h>
 
@@ -89,9 +88,6 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
 	u32 value_size = attr->value_size;
 	struct bpf_cpu_map *cmap;
 
-	if (!bpf_capable())
-		return ERR_PTR(-EPERM);
-
 	/* check sanity of attributes */
 	if (attr->max_entries == 0 || attr->key_size != 4 ||
 	    (value_size != offsetofend(struct bpf_cpumap_val, qsize) &&
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index 802692fa3905..49cc0b5671c6 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -160,9 +160,6 @@ static struct bpf_map *dev_map_alloc(union bpf_attr *attr)
 	struct bpf_dtab *dtab;
 	int err;
 
-	if (!capable(CAP_NET_ADMIN))
-		return ERR_PTR(-EPERM);
-
 	dtab = bpf_map_area_alloc(sizeof(*dtab), NUMA_NO_NODE);
 	if (!dtab)
 		return ERR_PTR(-ENOMEM);
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 9901efee4339..56d3da7d0bc6 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -422,12 +422,6 @@ static int htab_map_alloc_check(union bpf_attr *attr)
 	BUILD_BUG_ON(offsetof(struct htab_elem, fnode.next) !=
 		     offsetof(struct htab_elem, hash_node.pprev));
 
-	if (lru && !bpf_capable())
-		/* LRU implementation is much complicated than other
-		 * maps.  Hence, limit to CAP_BPF.
-		 */
-		return -EPERM;
-
 	if (zero_seed && !capable(CAP_SYS_ADMIN))
 		/* Guard against local DoS, and discourage production use. */
 		return -EPERM;
diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
index e0d3ddf2037a..17c7e7782a1f 100644
--- a/kernel/bpf/lpm_trie.c
+++ b/kernel/bpf/lpm_trie.c
@@ -544,9 +544,6 @@ static struct bpf_map *trie_alloc(union bpf_attr *attr)
 {
 	struct lpm_trie *trie;
 
-	if (!bpf_capable())
-		return ERR_PTR(-EPERM);
-
 	/* check sanity of attributes */
 	if (attr->max_entries == 0 ||
 	    !(attr->map_flags & BPF_F_NO_PREALLOC) ||
diff --git a/kernel/bpf/queue_stack_maps.c b/kernel/bpf/queue_stack_maps.c
index 601609164ef3..8d2ddcb7566b 100644
--- a/kernel/bpf/queue_stack_maps.c
+++ b/kernel/bpf/queue_stack_maps.c
@@ -7,7 +7,6 @@
 #include <linux/bpf.h>
 #include <linux/list.h>
 #include <linux/slab.h>
-#include <linux/capability.h>
 #include <linux/btf_ids.h>
 #include "percpu_freelist.h"
 
@@ -46,9 +45,6 @@ static bool queue_stack_map_is_full(struct bpf_queue_stack *qs)
 /* Called from syscall */
 static int queue_stack_map_alloc_check(union bpf_attr *attr)
 {
-	if (!bpf_capable())
-		return -EPERM;
-
 	/* check sanity of attributes */
 	if (attr->max_entries == 0 || attr->key_size != 0 ||
 	    attr->value_size == 0 ||
diff --git a/kernel/bpf/reuseport_array.c b/kernel/bpf/reuseport_array.c
index cbf2d8d784b8..4b4f9670f1a9 100644
--- a/kernel/bpf/reuseport_array.c
+++ b/kernel/bpf/reuseport_array.c
@@ -151,9 +151,6 @@ static struct bpf_map *reuseport_array_alloc(union bpf_attr *attr)
 	int numa_node = bpf_map_attr_numa_node(attr);
 	struct reuseport_array *array;
 
-	if (!bpf_capable())
-		return ERR_PTR(-EPERM);
-
 	/* allocate all map elements and zero-initialize them */
 	array = bpf_map_area_alloc(struct_size(array, ptrs, attr->max_entries), numa_node);
 	if (!array)
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index b25fce425b2c..458bb80b14d5 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -74,9 +74,6 @@ static struct bpf_map *stack_map_alloc(union bpf_attr *attr)
 	u64 cost, n_buckets;
 	int err;
 
-	if (!bpf_capable())
-		return ERR_PTR(-EPERM);
-
 	if (attr->map_flags & ~STACK_CREATE_FLAG_MASK)
 		return ERR_PTR(-EINVAL);
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 20b373dce669..093472ac40f7 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1156,6 +1156,53 @@ static int map_create(union bpf_attr *attr)
 	if (sysctl_unprivileged_bpf_disabled && !bpf_capable())
 		return -EPERM;
 
+	/* check privileged map type permissions */
+	switch (map_type) {
+	case BPF_MAP_TYPE_ARRAY:
+	case BPF_MAP_TYPE_PERCPU_ARRAY:
+	case BPF_MAP_TYPE_PROG_ARRAY:
+	case BPF_MAP_TYPE_PERF_EVENT_ARRAY:
+	case BPF_MAP_TYPE_CGROUP_ARRAY:
+	case BPF_MAP_TYPE_ARRAY_OF_MAPS:
+	case BPF_MAP_TYPE_HASH:
+	case BPF_MAP_TYPE_PERCPU_HASH:
+	case BPF_MAP_TYPE_HASH_OF_MAPS:
+	case BPF_MAP_TYPE_RINGBUF:
+	case BPF_MAP_TYPE_USER_RINGBUF:
+	case BPF_MAP_TYPE_CGROUP_STORAGE:
+	case BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE:
+		/* unprivileged */
+		break;
+	case BPF_MAP_TYPE_SK_STORAGE:
+	case BPF_MAP_TYPE_INODE_STORAGE:
+	case BPF_MAP_TYPE_TASK_STORAGE:
+	case BPF_MAP_TYPE_CGRP_STORAGE:
+	case BPF_MAP_TYPE_BLOOM_FILTER:
+	case BPF_MAP_TYPE_LPM_TRIE:
+	case BPF_MAP_TYPE_REUSEPORT_SOCKARRAY:
+	case BPF_MAP_TYPE_STACK_TRACE:
+	case BPF_MAP_TYPE_QUEUE:
+	case BPF_MAP_TYPE_STACK:
+	case BPF_MAP_TYPE_LRU_HASH:
+	case BPF_MAP_TYPE_LRU_PERCPU_HASH:
+	case BPF_MAP_TYPE_STRUCT_OPS:
+	case BPF_MAP_TYPE_CPUMAP:
+		if (!bpf_capable())
+			return -EPERM;
+		break;
+	case BPF_MAP_TYPE_SOCKMAP:
+	case BPF_MAP_TYPE_SOCKHASH:
+	case BPF_MAP_TYPE_DEVMAP:
+	case BPF_MAP_TYPE_DEVMAP_HASH:
+	case BPF_MAP_TYPE_XSKMAP:
+		if (!capable(CAP_NET_ADMIN))
+			return -EPERM;
+		break;
+	default:
+		WARN(1, "unsupported map type %d", map_type);
+		return -EPERM;
+	}
+
 	map = ops->map_alloc(attr);
 	if (IS_ERR(map))
 		return PTR_ERR(map);
diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 00afb66cd095..19538d628714 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -32,8 +32,6 @@ static struct bpf_map *sock_map_alloc(union bpf_attr *attr)
 {
 	struct bpf_stab *stab;
 
-	if (!capable(CAP_NET_ADMIN))
-		return ERR_PTR(-EPERM);
 	if (attr->max_entries == 0 ||
 	    attr->key_size    != 4 ||
 	    (attr->value_size != sizeof(u32) &&
@@ -1085,8 +1083,6 @@ static struct bpf_map *sock_hash_alloc(union bpf_attr *attr)
 	struct bpf_shtab *htab;
 	int i, err;
 
-	if (!capable(CAP_NET_ADMIN))
-		return ERR_PTR(-EPERM);
 	if (attr->max_entries == 0 ||
 	    attr->key_size    == 0 ||
 	    (attr->value_size != sizeof(u32) &&
diff --git a/net/xdp/xskmap.c b/net/xdp/xskmap.c
index 2c1427074a3b..e1c526f97ce3 100644
--- a/net/xdp/xskmap.c
+++ b/net/xdp/xskmap.c
@@ -5,7 +5,6 @@
 
 #include <linux/bpf.h>
 #include <linux/filter.h>
-#include <linux/capability.h>
 #include <net/xdp_sock.h>
 #include <linux/slab.h>
 #include <linux/sched.h>
@@ -68,9 +67,6 @@ static struct bpf_map *xsk_map_alloc(union bpf_attr *attr)
 	int numa_node;
 	u64 size;
 
-	if (!capable(CAP_NET_ADMIN))
-		return ERR_PTR(-EPERM);
-
 	if (attr->max_entries == 0 || attr->key_size != 4 ||
 	    attr->value_size != 4 ||
 	    attr->map_flags & ~(BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY))
diff --git a/tools/testing/selftests/bpf/prog_tests/unpriv_bpf_disabled.c b/tools/testing/selftests/bpf/prog_tests/unpriv_bpf_disabled.c
index 8383a99f610f..0adf8d9475cb 100644
--- a/tools/testing/selftests/bpf/prog_tests/unpriv_bpf_disabled.c
+++ b/tools/testing/selftests/bpf/prog_tests/unpriv_bpf_disabled.c
@@ -171,7 +171,11 @@ static void test_unpriv_bpf_disabled_negative(struct test_unpriv_bpf_disabled *s
 				prog_insns, prog_insn_cnt, &load_opts),
 		  -EPERM, "prog_load_fails");
 
-	for (i = BPF_MAP_TYPE_HASH; i <= BPF_MAP_TYPE_BLOOM_FILTER; i++)
+	/* some map types require particular correct parameters which could be
+	 * sanity-checked before enforcing -EPERM, so only validate that
+	 * the simple ARRAY and HASH maps are failing with -EPERM
+	 */
+	for (i = BPF_MAP_TYPE_HASH; i <= BPF_MAP_TYPE_ARRAY; i++)
 		ASSERT_EQ(bpf_map_create(i, NULL, sizeof(int), sizeof(int), 1, NULL),
 			  -EPERM, "map_create_fails");
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v2 bpf-next 07/18] bpf: add BPF token support to BPF_MAP_CREATE command
  2023-06-07 23:53 [PATCH v2 bpf-next 00/18] BPF token Andrii Nakryiko
                   ` (5 preceding siblings ...)
  2023-06-07 23:53 ` [PATCH v2 bpf-next 06/18] bpf: centralize permissions checks for all BPF map types Andrii Nakryiko
@ 2023-06-07 23:53 ` Andrii Nakryiko
  2023-06-07 23:53 ` [PATCH v2 bpf-next 08/18] libbpf: add BPF token support to bpf_map_create() API Andrii Nakryiko
                   ` (15 subsequent siblings)
  22 siblings, 0 replies; 72+ messages in thread
From: Andrii Nakryiko @ 2023-06-07 23:53 UTC (permalink / raw)
  To: bpf
  Cc: linux-security-module, keescook, brauner, lennart, cyphar, luto,
	kernel-team

Allow providing token_fd for BPF_MAP_CREATE command to allow controlled
BPF map creation from unprivileged process through delegated BPF token.

Further, add a filter of allowed BPF map types to BPF token, specified
at BPF token creation time. This, in combination with allowed_cmds
allows to create a narrowly-focused BPF token (controlled by privileged
agent) with a restrictive set of BPF maps that application can attempt
to create.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 include/linux/bpf.h                           |  3 +
 include/uapi/linux/bpf.h                      |  6 ++
 kernel/bpf/syscall.c                          | 69 +++++++++++++++----
 kernel/bpf/token.c                            |  8 +++
 tools/include/uapi/linux/bpf.h                | 10 ++-
 .../selftests/bpf/prog_tests/libbpf_str.c     |  3 +
 6 files changed, 84 insertions(+), 15 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 5f3944352c26..e0c7eb5b0bd7 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -251,6 +251,7 @@ struct bpf_map {
 	u32 btf_value_type_id;
 	u32 btf_vmlinux_value_type_id;
 	struct btf *btf;
+	struct bpf_token *token;
 #ifdef CONFIG_MEMCG_KMEM
 	struct obj_cgroup *objcg;
 #endif
@@ -1538,6 +1539,7 @@ struct bpf_token {
 	struct work_struct work;
 	atomic64_t refcnt;
 	u64 allowed_cmds;
+	u64 allowed_map_types;
 };
 
 struct bpf_struct_ops_value;
@@ -2096,6 +2098,7 @@ int bpf_token_new_fd(struct bpf_token *token);
 struct bpf_token *bpf_token_get_from_fd(u32 ufd);
 
 bool bpf_token_allow_cmd(const struct bpf_token *token, enum bpf_cmd cmd);
+bool bpf_token_allow_map_type(const struct bpf_token *token, enum bpf_map_type type);
 
 int bpf_obj_pin_user(u32 ufd, int path_fd, const char __user *pathname);
 int bpf_obj_get_user(int path_fd, const char __user *pathname, int flags);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 3e7e8d8cbe90..7ee499a440a3 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -954,6 +954,7 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_BLOOM_FILTER,
 	BPF_MAP_TYPE_USER_RINGBUF,
 	BPF_MAP_TYPE_CGRP_STORAGE,
+	__MAX_BPF_MAP_TYPE
 };
 
 /* Note that tracing related programs such as
@@ -1359,6 +1360,7 @@ union bpf_attr {
 		 * to using 5 hash functions).
 		 */
 		__u64	map_extra;
+		__u32	map_token_fd;
 	};
 
 	struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */
@@ -1641,6 +1643,10 @@ union bpf_attr {
 		 * programs
 		 */
 		__u64		allowed_cmds;
+		/* similarly to allowed_cmds, a bit set of BPF map types that
+		 * are allowed to be created by requested BPF token;
+		 */
+		__u64		allowed_map_types;
 	} token_create;
 
 } __attribute__((aligned(8)));
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 093472ac40f7..cba7235d48da 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -691,6 +691,7 @@ static void bpf_map_free_deferred(struct work_struct *work)
 {
 	struct bpf_map *map = container_of(work, struct bpf_map, work);
 	struct btf_record *rec = map->record;
+	struct bpf_token *token = map->token;
 
 	security_bpf_map_free(map);
 	bpf_map_release_memcg(map);
@@ -706,6 +707,7 @@ static void bpf_map_free_deferred(struct work_struct *work)
 	 * template bpf_map struct used during verification.
 	 */
 	btf_record_free(rec);
+	bpf_token_put(token);
 }
 
 static void bpf_map_put_uref(struct bpf_map *map)
@@ -1010,7 +1012,7 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf,
 	if (!IS_ERR_OR_NULL(map->record)) {
 		int i;
 
-		if (!bpf_capable()) {
+		if (!bpf_token_capable(map->token, CAP_BPF)) {
 			ret = -EPERM;
 			goto free_map_tab;
 		}
@@ -1092,11 +1094,12 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf,
 	return ret;
 }
 
-#define BPF_MAP_CREATE_LAST_FIELD map_extra
+#define BPF_MAP_CREATE_LAST_FIELD map_token_fd
 /* called via syscall */
 static int map_create(union bpf_attr *attr)
 {
 	const struct bpf_map_ops *ops;
+	struct bpf_token *token = NULL;
 	int numa_node = bpf_map_attr_numa_node(attr);
 	u32 map_type = attr->map_type;
 	struct bpf_map *map;
@@ -1147,14 +1150,32 @@ static int map_create(union bpf_attr *attr)
 	if (!ops->map_mem_usage)
 		return -EINVAL;
 
+	if (attr->map_token_fd) {
+		token = bpf_token_get_from_fd(attr->map_token_fd);
+		if (IS_ERR(token))
+			return PTR_ERR(token);
+
+		/* if current token doesn't grant map creation permissions,
+		 * then we can't use this token, so ignore it and rely on
+		 * system-wide capabilities checks
+		 */
+		if (!bpf_token_allow_cmd(token, BPF_MAP_CREATE) ||
+		    !bpf_token_allow_map_type(token, attr->map_type)) {
+			bpf_token_put(token);
+			token = NULL;
+		}
+	}
+
+	err = -EPERM;
+
 	/* Intent here is for unprivileged_bpf_disabled to block BPF map
 	 * creation for unprivileged users; other actions depend
 	 * on fd availability and access to bpffs, so are dependent on
 	 * object creation success. Even with unprivileged BPF disabled,
 	 * capability checks are still carried out.
 	 */
-	if (sysctl_unprivileged_bpf_disabled && !bpf_capable())
-		return -EPERM;
+	if (sysctl_unprivileged_bpf_disabled && !bpf_token_capable(token, CAP_BPF))
+		goto put_token;
 
 	/* check privileged map type permissions */
 	switch (map_type) {
@@ -1187,28 +1208,36 @@ static int map_create(union bpf_attr *attr)
 	case BPF_MAP_TYPE_LRU_PERCPU_HASH:
 	case BPF_MAP_TYPE_STRUCT_OPS:
 	case BPF_MAP_TYPE_CPUMAP:
-		if (!bpf_capable())
-			return -EPERM;
+		if (!bpf_token_capable(token, CAP_BPF))
+			goto put_token;
 		break;
 	case BPF_MAP_TYPE_SOCKMAP:
 	case BPF_MAP_TYPE_SOCKHASH:
 	case BPF_MAP_TYPE_DEVMAP:
 	case BPF_MAP_TYPE_DEVMAP_HASH:
 	case BPF_MAP_TYPE_XSKMAP:
-		if (!capable(CAP_NET_ADMIN))
-			return -EPERM;
+		if (!bpf_token_capable(token, CAP_NET_ADMIN))
+			goto put_token;
 		break;
 	default:
 		WARN(1, "unsupported map type %d", map_type);
-		return -EPERM;
+		goto put_token;
 	}
 
 	map = ops->map_alloc(attr);
-	if (IS_ERR(map))
-		return PTR_ERR(map);
+	if (IS_ERR(map)) {
+		err = PTR_ERR(map);
+		goto put_token;
+	}
 	map->ops = ops;
 	map->map_type = map_type;
 
+	if (token) {
+		/* move token reference into map->token, reuse our refcnt */
+		map->token = token;
+		token = NULL;
+	}
+
 	err = bpf_obj_name_cpy(map->name, attr->map_name,
 			       sizeof(attr->map_name));
 	if (err < 0)
@@ -1281,8 +1310,11 @@ static int map_create(union bpf_attr *attr)
 free_map_sec:
 	security_bpf_map_free(map);
 free_map:
+	bpf_token_put(map->token);
 	btf_put(map->btf);
 	map->ops->map_free(map);
+put_token:
+	bpf_token_put(token);
 	return err;
 }
 
@@ -5086,9 +5118,11 @@ static bool is_bit_subset_of(u32 subset, u32 superset)
 	return (superset & subset) == subset;
 }
 
-#define BPF_TOKEN_CMDS_MASK ((1ULL << BPF_TOKEN_CREATE))
-
-#define BPF_TOKEN_CREATE_LAST_FIELD token_create.allowed_cmds
+#define BPF_TOKEN_CMDS_MASK (			\
+	(1ULL << BPF_TOKEN_CREATE)		\
+	| (1ULL << BPF_MAP_CREATE)		\
+)
+#define BPF_TOKEN_CREATE_LAST_FIELD token_create.allowed_map_types
 
 static int token_create(union bpf_attr *attr)
 {
@@ -5124,6 +5158,12 @@ static int token_create(union bpf_attr *attr)
 		err = -EPERM;
 		goto err_out;
 	}
+	/* requested map types should be a subset of associated token's set */
+	if (token && !is_bit_subset_of(attr->token_create.allowed_map_types,
+				       token->allowed_map_types)) {
+		err = -EPERM;
+		goto err_out;
+	}
 
 	new_token = bpf_token_alloc();
 	if (!new_token) {
@@ -5132,6 +5172,7 @@ static int token_create(union bpf_attr *attr)
 	}
 
 	new_token->allowed_cmds = attr->token_create.allowed_cmds;
+	new_token->allowed_map_types = attr->token_create.allowed_map_types;
 
 	fd = bpf_token_new_fd(new_token);
 	if (fd < 0) {
diff --git a/kernel/bpf/token.c b/kernel/bpf/token.c
index 4257281ca1ec..0abb1fa4f181 100644
--- a/kernel/bpf/token.c
+++ b/kernel/bpf/token.c
@@ -115,3 +115,11 @@ bool bpf_token_allow_cmd(const struct bpf_token *token, enum bpf_cmd cmd)
 
 	return token->allowed_cmds & (1ULL << cmd);
 }
+
+bool bpf_token_allow_map_type(const struct bpf_token *token, enum bpf_map_type type)
+{
+	if (!token || type >= __MAX_BPF_MAP_TYPE)
+		return false;
+
+	return token->allowed_map_types & (1ULL << type);
+}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 3e7e8d8cbe90..0722d42b55ea 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -954,6 +954,7 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_BLOOM_FILTER,
 	BPF_MAP_TYPE_USER_RINGBUF,
 	BPF_MAP_TYPE_CGRP_STORAGE,
+	__MAX_BPF_MAP_TYPE
 };
 
 /* Note that tracing related programs such as
@@ -1359,6 +1360,7 @@ union bpf_attr {
 		 * to using 5 hash functions).
 		 */
 		__u64	map_extra;
+		__u32	map_token_fd;
 	};
 
 	struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */
@@ -1638,9 +1640,15 @@ union bpf_attr {
 		/* a bit set of allowed bpf() syscall commands,
 		 * e.g., (1ULL << BPF_TOKEN_CREATE) | (1ULL << BPF_PROG_LOAD)
 		 * will allow creating derived BPF tokens and loading new BPF
-		 * programs
+		 * programs;
+		 * see also BPF_F_TOKEN_IGNORE_UNKNOWN_CMDS for its effect on
+		 * validity checking of this set
 		 */
 		__u64		allowed_cmds;
+		/* similarly to allowed_cmds, a bit set of BPF map types that
+		 * are allowed to be created by requested BPF token;
+		 */
+		__u64		allowed_map_types;
 	} token_create;
 
 } __attribute__((aligned(8)));
diff --git a/tools/testing/selftests/bpf/prog_tests/libbpf_str.c b/tools/testing/selftests/bpf/prog_tests/libbpf_str.c
index efb8bd43653c..e677c0435cec 100644
--- a/tools/testing/selftests/bpf/prog_tests/libbpf_str.c
+++ b/tools/testing/selftests/bpf/prog_tests/libbpf_str.c
@@ -132,6 +132,9 @@ static void test_libbpf_bpf_map_type_str(void)
 		const char *map_type_str;
 		char buf[256];
 
+		if (map_type == __MAX_BPF_MAP_TYPE)
+			continue;
+
 		map_type_name = btf__str_by_offset(btf, e->name_off);
 		map_type_str = libbpf_bpf_map_type_str(map_type);
 		ASSERT_OK_PTR(map_type_str, map_type_name);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v2 bpf-next 08/18] libbpf: add BPF token support to bpf_map_create() API
  2023-06-07 23:53 [PATCH v2 bpf-next 00/18] BPF token Andrii Nakryiko
                   ` (6 preceding siblings ...)
  2023-06-07 23:53 ` [PATCH v2 bpf-next 07/18] bpf: add BPF token support to BPF_MAP_CREATE command Andrii Nakryiko
@ 2023-06-07 23:53 ` Andrii Nakryiko
  2023-06-07 23:53 ` [PATCH v2 bpf-next 09/18] selftests/bpf: add BPF token-enabled test for BPF_MAP_CREATE command Andrii Nakryiko
                   ` (14 subsequent siblings)
  22 siblings, 0 replies; 72+ messages in thread
From: Andrii Nakryiko @ 2023-06-07 23:53 UTC (permalink / raw)
  To: bpf
  Cc: linux-security-module, keescook, brauner, lennart, cyphar, luto,
	kernel-team

Add ability to provide token_fd for BPF_MAP_CREATE command through
bpf_map_create() API.

Also wire through token_create.allowed_map_types param for
BPF_TOKEN_CREATE command.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 tools/lib/bpf/bpf.c | 5 ++++-
 tools/lib/bpf/bpf.h | 7 +++++--
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 38be66719485..0318538d43eb 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -169,7 +169,7 @@ int bpf_map_create(enum bpf_map_type map_type,
 		   __u32 max_entries,
 		   const struct bpf_map_create_opts *opts)
 {
-	const size_t attr_sz = offsetofend(union bpf_attr, map_extra);
+	const size_t attr_sz = offsetofend(union bpf_attr, map_token_fd);
 	union bpf_attr attr;
 	int fd;
 
@@ -198,6 +198,8 @@ int bpf_map_create(enum bpf_map_type map_type,
 	attr.numa_node = OPTS_GET(opts, numa_node, 0);
 	attr.map_ifindex = OPTS_GET(opts, map_ifindex, 0);
 
+	attr.map_token_fd = OPTS_GET(opts, token_fd, 0);
+
 	fd = sys_bpf_fd(BPF_MAP_CREATE, &attr, attr_sz);
 	return libbpf_err_errno(fd);
 }
@@ -1215,6 +1217,7 @@ int bpf_token_create(struct bpf_token_create_opts *opts)
 	attr.token_create.flags = OPTS_GET(opts, flags, 0);
 	attr.token_create.token_fd = OPTS_GET(opts, token_fd, 0);
 	attr.token_create.allowed_cmds = OPTS_GET(opts, allowed_cmds, 0);
+	attr.token_create.allowed_map_types = OPTS_GET(opts, allowed_map_types, 0);
 
 	ret = sys_bpf_fd(BPF_TOKEN_CREATE, &attr, attr_sz);
 	return libbpf_err_errno(ret);
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index f2b8041ca27a..19a43201d1af 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -51,8 +51,10 @@ struct bpf_map_create_opts {
 
 	__u32 numa_node;
 	__u32 map_ifindex;
+
+	__u32 token_fd;
 };
-#define bpf_map_create_opts__last_field map_ifindex
+#define bpf_map_create_opts__last_field token_fd
 
 LIBBPF_API int bpf_map_create(enum bpf_map_type map_type,
 			      const char *map_name,
@@ -556,9 +558,10 @@ struct bpf_token_create_opts {
 	__u32 flags;
 	__u32 token_fd;
 	__u64 allowed_cmds;
+	__u64 allowed_map_types;
 	size_t :0;
 };
-#define bpf_token_create_opts__last_field allowed_cmds
+#define bpf_token_create_opts__last_field allowed_map_types
 
 LIBBPF_API int bpf_token_create(struct bpf_token_create_opts *opts);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v2 bpf-next 09/18] selftests/bpf: add BPF token-enabled test for BPF_MAP_CREATE command
  2023-06-07 23:53 [PATCH v2 bpf-next 00/18] BPF token Andrii Nakryiko
                   ` (7 preceding siblings ...)
  2023-06-07 23:53 ` [PATCH v2 bpf-next 08/18] libbpf: add BPF token support to bpf_map_create() API Andrii Nakryiko
@ 2023-06-07 23:53 ` Andrii Nakryiko
  2023-06-07 23:53 ` [PATCH v2 bpf-next 10/18] bpf: add BPF token support to BPF_BTF_LOAD command Andrii Nakryiko
                   ` (13 subsequent siblings)
  22 siblings, 0 replies; 72+ messages in thread
From: Andrii Nakryiko @ 2023-06-07 23:53 UTC (permalink / raw)
  To: bpf
  Cc: linux-security-module, keescook, brauner, lennart, cyphar, luto,
	kernel-team

Add test for creating BPF token with support for BPF_MAP_CREATE
delegation. And validate that its allowed_map_types filter works as
expected and allows to create privileged BPF maps through delegated
token, as long as they are allowed by privileged creator of a token.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 .../testing/selftests/bpf/prog_tests/token.c  | 50 +++++++++++++++++++
 1 file changed, 50 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/token.c b/tools/testing/selftests/bpf/prog_tests/token.c
index cba84c480ac5..61707b3e81a7 100644
--- a/tools/testing/selftests/bpf/prog_tests/token.c
+++ b/tools/testing/selftests/bpf/prog_tests/token.c
@@ -86,8 +86,58 @@ static void subtest_token_create(void)
 		ASSERT_OK(restore_priv_caps(old_caps), "restore_caps");
 }
 
+static void subtest_map_token(void)
+{
+	LIBBPF_OPTS(bpf_token_create_opts, token_opts);
+	LIBBPF_OPTS(bpf_map_create_opts, map_opts);
+	int token_fd = 0, map_fd = 0;
+	__u64 old_caps = 0;
+
+	/* check that it's ok to allow any map type */
+	token_opts.allowed_map_types = ~0ULL; /* any current and future map types is allowed */
+	token_fd = bpf_token_create(&token_opts);
+	if (!ASSERT_GT(token_fd, 0, "token_create_future_proof"))
+		return;
+	close(token_fd);
+
+	/* create BPF token allowing STACK, but not QUEUE map */
+	token_opts.allowed_cmds = 1ULL << BPF_MAP_CREATE;
+	token_opts.allowed_map_types = 1ULL << BPF_MAP_TYPE_STACK; /* but not QUEUE */
+	token_fd = bpf_token_create(&token_opts);
+	if (!ASSERT_GT(token_fd, 0, "token_create"))
+		return;
+
+	/* drop privileges to test token_fd passing */
+	if (!ASSERT_OK(drop_priv_caps(&old_caps), "drop_caps"))
+		goto cleanup;
+
+	/* BPF_MAP_TYPE_STACK is privileged, but with given token_fd should succeed */
+	map_opts.token_fd = token_fd;
+	map_fd = bpf_map_create(BPF_MAP_TYPE_STACK, "token_stack", 0, 8, 1, &map_opts);
+	if (!ASSERT_GT(map_fd, 0, "stack_map_fd"))
+		goto cleanup;
+	close(map_fd);
+	map_fd = 0;
+
+	/* BPF_MAP_TYPE_QUEUE is privileged, and token doesn't allow it, so should fail */
+	map_opts.token_fd = token_fd;
+	map_fd = bpf_map_create(BPF_MAP_TYPE_QUEUE, "token_queue", 0, 8, 1, &map_opts);
+	if (!ASSERT_EQ(map_fd, -EPERM, "queue_map_fd"))
+		goto cleanup;
+
+cleanup:
+	if (map_fd > 0)
+		close(map_fd);
+	if (token_fd)
+		close(token_fd);
+	if (old_caps)
+		ASSERT_OK(restore_priv_caps(old_caps), "restore_caps");
+}
+
 void test_token(void)
 {
 	if (test__start_subtest("token_create"))
 		subtest_token_create();
+	if (test__start_subtest("map_token"))
+		subtest_map_token();
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v2 bpf-next 10/18] bpf: add BPF token support to BPF_BTF_LOAD command
  2023-06-07 23:53 [PATCH v2 bpf-next 00/18] BPF token Andrii Nakryiko
                   ` (8 preceding siblings ...)
  2023-06-07 23:53 ` [PATCH v2 bpf-next 09/18] selftests/bpf: add BPF token-enabled test for BPF_MAP_CREATE command Andrii Nakryiko
@ 2023-06-07 23:53 ` Andrii Nakryiko
  2023-06-07 23:53 ` [PATCH v2 bpf-next 11/18] libbpf: add BPF token support to bpf_btf_load() API Andrii Nakryiko
                   ` (12 subsequent siblings)
  22 siblings, 0 replies; 72+ messages in thread
From: Andrii Nakryiko @ 2023-06-07 23:53 UTC (permalink / raw)
  To: bpf
  Cc: linux-security-module, keescook, brauner, lennart, cyphar, luto,
	kernel-team

Accept BPF token FD in BPF_BTF_LOAD command to allow BTF data loading
through delegated BPF token. BTF loading is a pretty straightforward
operation, so as long as BPF token is created with allow_cmds granting
BPF_BTF_LOAD command, kernel proceeds to parsing BTF data and creating
BTF object.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 include/uapi/linux/bpf.h                      |  1 +
 kernel/bpf/syscall.c                          | 21 +++++++++++++++++--
 tools/include/uapi/linux/bpf.h                |  1 +
 .../selftests/bpf/prog_tests/libbpf_probes.c  |  2 ++
 4 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 7ee499a440a3..9043a1f8c419 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1527,6 +1527,7 @@ union bpf_attr {
 		 * truncated), or smaller (if log buffer wasn't filled completely).
 		 */
 		__u32		btf_log_true_size;
+		__u32		btf_token_fd;
 	};
 
 	struct {
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index cba7235d48da..2d9f971ec227 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -4475,15 +4475,31 @@ static int bpf_obj_get_info_by_fd(const union bpf_attr *attr,
 	return err;
 }
 
-#define BPF_BTF_LOAD_LAST_FIELD btf_log_true_size
+#define BPF_BTF_LOAD_LAST_FIELD btf_token_fd
 
 static int bpf_btf_load(const union bpf_attr *attr, bpfptr_t uattr, __u32 uattr_size)
 {
+	struct bpf_token *token = NULL;
+
 	if (CHECK_ATTR(BPF_BTF_LOAD))
 		return -EINVAL;
 
-	if (!bpf_capable())
+	if (attr->btf_token_fd) {
+		token = bpf_token_get_from_fd(attr->btf_token_fd);
+		if (IS_ERR(token))
+			return PTR_ERR(token);
+		if (!bpf_token_allow_cmd(token, BPF_BTF_LOAD)) {
+			bpf_token_put(token);
+			token = NULL;
+		}
+	}
+
+	if (!bpf_token_capable(token, CAP_BPF)) {
+		bpf_token_put(token);
 		return -EPERM;
+	}
+
+	bpf_token_put(token);
 
 	return btf_new_fd(attr, uattr, uattr_size);
 }
@@ -5121,6 +5137,7 @@ static bool is_bit_subset_of(u32 subset, u32 superset)
 #define BPF_TOKEN_CMDS_MASK (			\
 	(1ULL << BPF_TOKEN_CREATE)		\
 	| (1ULL << BPF_MAP_CREATE)		\
+	| (1ULL << BPF_BTF_LOAD)		\
 )
 #define BPF_TOKEN_CREATE_LAST_FIELD token_create.allowed_map_types
 
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 0722d42b55ea..366abd8b55b6 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1527,6 +1527,7 @@ union bpf_attr {
 		 * truncated), or smaller (if log buffer wasn't filled completely).
 		 */
 		__u32		btf_log_true_size;
+		__u32		btf_token_fd;
 	};
 
 	struct {
diff --git a/tools/testing/selftests/bpf/prog_tests/libbpf_probes.c b/tools/testing/selftests/bpf/prog_tests/libbpf_probes.c
index 9f766ddd946a..573249a2814d 100644
--- a/tools/testing/selftests/bpf/prog_tests/libbpf_probes.c
+++ b/tools/testing/selftests/bpf/prog_tests/libbpf_probes.c
@@ -68,6 +68,8 @@ void test_libbpf_probe_map_types(void)
 
 		if (map_type == BPF_MAP_TYPE_UNSPEC)
 			continue;
+		if (strcmp(map_type_name, "__MAX_BPF_MAP_TYPE") == 0)
+			continue;
 
 		if (!test__start_subtest(map_type_name))
 			continue;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v2 bpf-next 11/18] libbpf: add BPF token support to bpf_btf_load() API
  2023-06-07 23:53 [PATCH v2 bpf-next 00/18] BPF token Andrii Nakryiko
                   ` (9 preceding siblings ...)
  2023-06-07 23:53 ` [PATCH v2 bpf-next 10/18] bpf: add BPF token support to BPF_BTF_LOAD command Andrii Nakryiko
@ 2023-06-07 23:53 ` Andrii Nakryiko
  2023-06-07 23:53 ` [PATCH v2 bpf-next 12/18] selftests/bpf: add BPF token-enabled BPF_BTF_LOAD selftest Andrii Nakryiko
                   ` (11 subsequent siblings)
  22 siblings, 0 replies; 72+ messages in thread
From: Andrii Nakryiko @ 2023-06-07 23:53 UTC (permalink / raw)
  To: bpf
  Cc: linux-security-module, keescook, brauner, lennart, cyphar, luto,
	kernel-team

Allow user to specify token_fd for bpf_btf_load() API that wraps
kernel's BPF_BTF_LOAD command. This allows loading BTF from unprivileged
process as long as it has BPF token allowing BPF_BTF_LOAD command, which
can be created and delegated by privileged process.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 tools/lib/bpf/bpf.c | 4 +++-
 tools/lib/bpf/bpf.h | 3 ++-
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 0318538d43eb..193993dbbdc4 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -1098,7 +1098,7 @@ int bpf_raw_tracepoint_open(const char *name, int prog_fd)
 
 int bpf_btf_load(const void *btf_data, size_t btf_size, struct bpf_btf_load_opts *opts)
 {
-	const size_t attr_sz = offsetofend(union bpf_attr, btf_log_true_size);
+	const size_t attr_sz = offsetofend(union bpf_attr, btf_token_fd);
 	union bpf_attr attr;
 	char *log_buf;
 	size_t log_size;
@@ -1123,6 +1123,8 @@ int bpf_btf_load(const void *btf_data, size_t btf_size, struct bpf_btf_load_opts
 
 	attr.btf = ptr_to_u64(btf_data);
 	attr.btf_size = btf_size;
+	attr.btf_token_fd = OPTS_GET(opts, token_fd, 0);
+
 	/* log_level == 0 and log_buf != NULL means "try loading without
 	 * log_buf, but retry with log_buf and log_level=1 on error", which is
 	 * consistent across low-level and high-level BTF and program loading
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index 19a43201d1af..3153a9e697e2 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -132,9 +132,10 @@ struct bpf_btf_load_opts {
 	 * If kernel doesn't support this feature, log_size is left unchanged.
 	 */
 	__u32 log_true_size;
+	__u32 token_fd;
 	size_t :0;
 };
-#define bpf_btf_load_opts__last_field log_true_size
+#define bpf_btf_load_opts__last_field token_fd
 
 LIBBPF_API int bpf_btf_load(const void *btf_data, size_t btf_size,
 			    struct bpf_btf_load_opts *opts);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v2 bpf-next 12/18] selftests/bpf: add BPF token-enabled BPF_BTF_LOAD selftest
  2023-06-07 23:53 [PATCH v2 bpf-next 00/18] BPF token Andrii Nakryiko
                   ` (10 preceding siblings ...)
  2023-06-07 23:53 ` [PATCH v2 bpf-next 11/18] libbpf: add BPF token support to bpf_btf_load() API Andrii Nakryiko
@ 2023-06-07 23:53 ` Andrii Nakryiko
  2023-06-07 23:53 ` [PATCH v2 bpf-next 13/18] bpf: keep BPF_PROG_LOAD permission checks clear of validations Andrii Nakryiko
                   ` (10 subsequent siblings)
  22 siblings, 0 replies; 72+ messages in thread
From: Andrii Nakryiko @ 2023-06-07 23:53 UTC (permalink / raw)
  To: bpf
  Cc: linux-security-module, keescook, brauner, lennart, cyphar, luto,
	kernel-team

Add a simple test validating that BTF loading can be done from
unprivileged process through delegated BPF token.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 .../testing/selftests/bpf/prog_tests/token.c  | 55 +++++++++++++++++++
 1 file changed, 55 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/token.c b/tools/testing/selftests/bpf/prog_tests/token.c
index 61707b3e81a7..ff8ada405576 100644
--- a/tools/testing/selftests/bpf/prog_tests/token.c
+++ b/tools/testing/selftests/bpf/prog_tests/token.c
@@ -134,10 +134,65 @@ static void subtest_map_token(void)
 		ASSERT_OK(restore_priv_caps(old_caps), "restore_caps");
 }
 
+static void subtest_btf_token(void)
+{
+	LIBBPF_OPTS(bpf_token_create_opts, token_opts);
+	LIBBPF_OPTS(bpf_btf_load_opts, btf_opts);
+	int token_fd = 0, btf_fd = 0;
+	const void *raw_btf_data;
+	struct btf *btf = NULL;
+	__u32 raw_btf_size;
+	__u64 old_caps = 0;
+
+	/* create BPF token allowing BPF_BTF_LOAD command */
+	token_opts.allowed_cmds = 1ULL << BPF_BTF_LOAD;
+	token_fd = bpf_token_create(&token_opts);
+	if (!ASSERT_GT(token_fd, 0, "token_create"))
+		return;
+
+	/* drop privileges to test token_fd passing */
+	if (!ASSERT_OK(drop_priv_caps(&old_caps), "drop_caps"))
+		goto cleanup;
+
+	btf = btf__new_empty();
+	if (!ASSERT_OK_PTR(btf, "empty_btf"))
+		goto cleanup;
+
+	ASSERT_GT(btf__add_int(btf, "int", 4, 0), 0, "int_type");
+
+	raw_btf_data = btf__raw_data(btf, &raw_btf_size);
+	if (!ASSERT_OK_PTR(raw_btf_data, "raw_btf_data"))
+		goto cleanup;
+
+	/* validate we can successfully load new BTF with token */
+	btf_opts.token_fd = token_fd;
+	btf_fd = bpf_btf_load(raw_btf_data, raw_btf_size, &btf_opts);
+	if (!ASSERT_GT(btf_fd, 0, "btf_fd"))
+		goto cleanup;
+	close(btf_fd);
+
+	/* now validate that we *cannot* load BTF without token */
+	btf_opts.token_fd = 0;
+	btf_fd = bpf_btf_load(raw_btf_data, raw_btf_size, &btf_opts);
+	if (!ASSERT_EQ(btf_fd, -EPERM, "btf_fd_eperm"))
+		goto cleanup;
+
+cleanup:
+	btf__free(btf);
+	if (btf_fd > 0)
+		close(btf_fd);
+	if (token_fd)
+		close(token_fd);
+	if (old_caps)
+		ASSERT_OK(restore_priv_caps(old_caps), "restore_caps");
+}
+
 void test_token(void)
 {
 	if (test__start_subtest("token_create"))
 		subtest_token_create();
 	if (test__start_subtest("map_token"))
 		subtest_map_token();
+	if (test__start_subtest("btf_token"))
+		subtest_btf_token();
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v2 bpf-next 13/18] bpf: keep BPF_PROG_LOAD permission checks clear of validations
  2023-06-07 23:53 [PATCH v2 bpf-next 00/18] BPF token Andrii Nakryiko
                   ` (11 preceding siblings ...)
  2023-06-07 23:53 ` [PATCH v2 bpf-next 12/18] selftests/bpf: add BPF token-enabled BPF_BTF_LOAD selftest Andrii Nakryiko
@ 2023-06-07 23:53 ` Andrii Nakryiko
  2023-06-07 23:53 ` [PATCH v2 bpf-next 14/18] bpf: add BPF token support to BPF_PROG_LOAD command Andrii Nakryiko
                   ` (9 subsequent siblings)
  22 siblings, 0 replies; 72+ messages in thread
From: Andrii Nakryiko @ 2023-06-07 23:53 UTC (permalink / raw)
  To: bpf
  Cc: linux-security-module, keescook, brauner, lennart, cyphar, luto,
	kernel-team

Move out flags validation and license checks out of the permission
checks. They were intermingled, which makes subsequent changes harder.
Clean this up: perform straightforward flag validation upfront, and
fetch and check license later, right where we use it. Also consolidate
capabilities check in one block, right after basic attribute sanity
checks.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/bpf/syscall.c | 21 +++++++++------------
 1 file changed, 9 insertions(+), 12 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 2d9f971ec227..8e5c42af978c 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2582,7 +2582,6 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr, u32 uattr_size)
 	struct btf *attach_btf = NULL;
 	int err;
 	char license[128];
-	bool is_gpl;
 
 	if (CHECK_ATTR(BPF_PROG_LOAD))
 		return -EINVAL;
@@ -2601,16 +2600,6 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr, u32 uattr_size)
 	    !bpf_capable())
 		return -EPERM;
 
-	/* copy eBPF program license from user space */
-	if (strncpy_from_bpfptr(license,
-				make_bpfptr(attr->license, uattr.is_kernel),
-				sizeof(license) - 1) < 0)
-		return -EFAULT;
-	license[sizeof(license) - 1] = 0;
-
-	/* eBPF programs must be GPL compatible to use GPL-ed functions */
-	is_gpl = license_is_gpl_compatible(license);
-
 	/* Intent here is for unprivileged_bpf_disabled to block BPF program
 	 * creation for unprivileged users; other actions depend
 	 * on fd availability and access to bpffs, so are dependent on
@@ -2703,12 +2692,20 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr, u32 uattr_size)
 			     make_bpfptr(attr->insns, uattr.is_kernel),
 			     bpf_prog_insn_size(prog)) != 0)
 		goto free_prog_sec;
+	/* copy eBPF program license from user space */
+	if (strncpy_from_bpfptr(license,
+				make_bpfptr(attr->license, uattr.is_kernel),
+				sizeof(license) - 1) < 0)
+		goto free_prog_sec;
+	license[sizeof(license) - 1] = 0;
+
+	/* eBPF programs must be GPL compatible to use GPL-ed functions */
+	prog->gpl_compatible = license_is_gpl_compatible(license) ? 1 : 0;
 
 	prog->orig_prog = NULL;
 	prog->jited = 0;
 
 	atomic64_set(&prog->aux->refcnt, 1);
-	prog->gpl_compatible = is_gpl ? 1 : 0;
 
 	if (bpf_prog_is_dev_bound(prog->aux)) {
 		err = bpf_prog_dev_bound_init(prog, attr);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v2 bpf-next 14/18] bpf: add BPF token support to BPF_PROG_LOAD command
  2023-06-07 23:53 [PATCH v2 bpf-next 00/18] BPF token Andrii Nakryiko
                   ` (12 preceding siblings ...)
  2023-06-07 23:53 ` [PATCH v2 bpf-next 13/18] bpf: keep BPF_PROG_LOAD permission checks clear of validations Andrii Nakryiko
@ 2023-06-07 23:53 ` Andrii Nakryiko
  2023-06-07 23:53 ` [PATCH v2 bpf-next 15/18] bpf: take into account BPF token when fetching helper protos Andrii Nakryiko
                   ` (8 subsequent siblings)
  22 siblings, 0 replies; 72+ messages in thread
From: Andrii Nakryiko @ 2023-06-07 23:53 UTC (permalink / raw)
  To: bpf
  Cc: linux-security-module, keescook, brauner, lennart, cyphar, luto,
	kernel-team

Add basic support of BPF token to BPF_PROG_LOAD. Extend BPF token to
allow specifying BPF_PROG_LOAD as an allowed command, and also allow to
specify bit sets of program type and attach type combination that would
be allowed to be loaded by requested BPF token.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 include/linux/bpf.h                           |   6 +
 include/uapi/linux/bpf.h                      |   8 ++
 kernel/bpf/core.c                             |   1 +
 kernel/bpf/syscall.c                          | 105 +++++++++++++-----
 kernel/bpf/token.c                            |  11 ++
 tools/include/uapi/linux/bpf.h                |   8 ++
 .../selftests/bpf/prog_tests/libbpf_probes.c  |   2 +
 .../selftests/bpf/prog_tests/libbpf_str.c     |   3 +
 8 files changed, 119 insertions(+), 25 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index e0c7eb5b0bd7..d6e0904c9198 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1411,6 +1411,7 @@ struct bpf_prog_aux {
 #ifdef CONFIG_SECURITY
 	void *security;
 #endif
+	struct bpf_token *token;
 	struct bpf_prog_offload *offload;
 	struct btf *btf;
 	struct bpf_func_info *func_info;
@@ -1540,6 +1541,8 @@ struct bpf_token {
 	atomic64_t refcnt;
 	u64 allowed_cmds;
 	u64 allowed_map_types;
+	u64 allowed_prog_types;
+	u64 allowed_attach_types;
 };
 
 struct bpf_struct_ops_value;
@@ -2099,6 +2102,9 @@ struct bpf_token *bpf_token_get_from_fd(u32 ufd);
 
 bool bpf_token_allow_cmd(const struct bpf_token *token, enum bpf_cmd cmd);
 bool bpf_token_allow_map_type(const struct bpf_token *token, enum bpf_map_type type);
+bool bpf_token_allow_prog_type(const struct bpf_token *token,
+			       enum bpf_prog_type prog_type,
+			       enum bpf_attach_type attach_type);
 
 int bpf_obj_pin_user(u32 ufd, int path_fd, const char __user *pathname);
 int bpf_obj_get_user(int path_fd, const char __user *pathname, int flags);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 9043a1f8c419..58edd5f106c7 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -999,6 +999,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_SK_LOOKUP,
 	BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */
 	BPF_PROG_TYPE_NETFILTER,
+	__MAX_BPF_PROG_TYPE
 };
 
 enum bpf_attach_type {
@@ -1430,6 +1431,7 @@ union bpf_attr {
 		 * truncated), or smaller (if log buffer wasn't filled completely).
 		 */
 		__u32		log_true_size;
+		__u32		prog_token_fd;
 	};
 
 	struct { /* anonymous struct used by BPF_OBJ_* commands */
@@ -1648,6 +1650,12 @@ union bpf_attr {
 		 * are allowed to be created by requested BPF token;
 		 */
 		__u64		allowed_map_types;
+		/* similarly to allowed_map_types, bit sets of BPF program
+		 * types and BPF program attach types that are allowed to be
+		 * loaded by requested BPF token
+		 */
+		__u64		allowed_prog_types;
+		__u64		allowed_attach_types;
 	} token_create;
 
 } __attribute__((aligned(8)));
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 7421487422d4..cd0a93968009 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2597,6 +2597,7 @@ void bpf_prog_free(struct bpf_prog *fp)
 
 	if (aux->dst_prog)
 		bpf_prog_put(aux->dst_prog);
+	bpf_token_put(aux->token);
 	INIT_WORK(&aux->work, bpf_prog_free_deferred);
 	schedule_work(&aux->work);
 }
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 8e5c42af978c..c6d2fdb1af2f 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2573,13 +2573,15 @@ static bool is_perfmon_prog_type(enum bpf_prog_type prog_type)
 }
 
 /* last field in 'union bpf_attr' used by this command */
-#define	BPF_PROG_LOAD_LAST_FIELD log_true_size
+#define BPF_PROG_LOAD_LAST_FIELD prog_token_fd
 
 static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr, u32 uattr_size)
 {
 	enum bpf_prog_type type = attr->prog_type;
 	struct bpf_prog *prog, *dst_prog = NULL;
 	struct btf *attach_btf = NULL;
+	struct bpf_token *token = NULL;
+	bool bpf_cap;
 	int err;
 	char license[128];
 
@@ -2595,10 +2597,31 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr, u32 uattr_size)
 				 BPF_F_XDP_DEV_BOUND_ONLY))
 		return -EINVAL;
 
+	bpf_prog_load_fixup_attach_type(attr);
+
+	if (attr->prog_token_fd) {
+		token = bpf_token_get_from_fd(attr->prog_token_fd);
+		if (IS_ERR(token))
+			return PTR_ERR(token);
+		/* if current token doesn't grant prog loading permissions,
+		 * then we can't use this token, so ignore it and rely on
+		 * system-wide capabilities checks
+		 */
+		if (!bpf_token_allow_cmd(token, BPF_PROG_LOAD) ||
+		    !bpf_token_allow_prog_type(token, attr->prog_type,
+					       attr->expected_attach_type)) {
+			bpf_token_put(token);
+			token = NULL;
+		}
+	}
+
+	bpf_cap = bpf_token_capable(token, CAP_BPF);
+	err = -EPERM;
+
 	if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) &&
 	    (attr->prog_flags & BPF_F_ANY_ALIGNMENT) &&
-	    !bpf_capable())
-		return -EPERM;
+	    !bpf_cap)
+		goto put_token;
 
 	/* Intent here is for unprivileged_bpf_disabled to block BPF program
 	 * creation for unprivileged users; other actions depend
@@ -2607,21 +2630,23 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr, u32 uattr_size)
 	 * capability checks are still carried out for these
 	 * and other operations.
 	 */
-	if (sysctl_unprivileged_bpf_disabled && !bpf_capable())
-		return -EPERM;
+	if (sysctl_unprivileged_bpf_disabled && !bpf_cap)
+		goto put_token;
 
 	if (attr->insn_cnt == 0 ||
-	    attr->insn_cnt > (bpf_capable() ? BPF_COMPLEXITY_LIMIT_INSNS : BPF_MAXINSNS))
-		return -E2BIG;
+	    attr->insn_cnt > (bpf_cap ? BPF_COMPLEXITY_LIMIT_INSNS : BPF_MAXINSNS)) {
+		err = -E2BIG;
+		goto put_token;
+	}
 	if (type != BPF_PROG_TYPE_SOCKET_FILTER &&
 	    type != BPF_PROG_TYPE_CGROUP_SKB &&
-	    !bpf_capable())
-		return -EPERM;
+	    !bpf_cap)
+		goto put_token;
 
-	if (is_net_admin_prog_type(type) && !capable(CAP_NET_ADMIN) && !capable(CAP_SYS_ADMIN))
-		return -EPERM;
-	if (is_perfmon_prog_type(type) && !perfmon_capable())
-		return -EPERM;
+	if (is_net_admin_prog_type(type) && !bpf_token_capable(token, CAP_NET_ADMIN))
+		goto put_token;
+	if (is_perfmon_prog_type(type) && !bpf_token_capable(token, CAP_PERFMON))
+		goto put_token;
 
 	/* attach_prog_fd/attach_btf_obj_fd can specify fd of either bpf_prog
 	 * or btf, we need to check which one it is
@@ -2631,27 +2656,33 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr, u32 uattr_size)
 		if (IS_ERR(dst_prog)) {
 			dst_prog = NULL;
 			attach_btf = btf_get_by_fd(attr->attach_btf_obj_fd);
-			if (IS_ERR(attach_btf))
-				return -EINVAL;
+			if (IS_ERR(attach_btf)) {
+				err = -EINVAL;
+				goto put_token;
+			}
 			if (!btf_is_kernel(attach_btf)) {
 				/* attaching through specifying bpf_prog's BTF
 				 * objects directly might be supported eventually
 				 */
 				btf_put(attach_btf);
-				return -ENOTSUPP;
+				err = -ENOTSUPP;
+				goto put_token;
 			}
 		}
 	} else if (attr->attach_btf_id) {
 		/* fall back to vmlinux BTF, if BTF type ID is specified */
 		attach_btf = bpf_get_btf_vmlinux();
-		if (IS_ERR(attach_btf))
-			return PTR_ERR(attach_btf);
-		if (!attach_btf)
-			return -EINVAL;
+		if (IS_ERR(attach_btf)) {
+			err = PTR_ERR(attach_btf);
+			goto put_token;
+		}
+		if (!attach_btf) {
+			err = -EINVAL;
+			goto put_token;
+		}
 		btf_get(attach_btf);
 	}
 
-	bpf_prog_load_fixup_attach_type(attr);
 	if (bpf_prog_load_check_attach(type, attr->expected_attach_type,
 				       attach_btf, attr->attach_btf_id,
 				       dst_prog)) {
@@ -2659,7 +2690,8 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr, u32 uattr_size)
 			bpf_prog_put(dst_prog);
 		if (attach_btf)
 			btf_put(attach_btf);
-		return -EINVAL;
+		err = -EINVAL;
+		goto put_token;
 	}
 
 	/* plain bpf_prog allocation */
@@ -2669,7 +2701,8 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr, u32 uattr_size)
 			bpf_prog_put(dst_prog);
 		if (attach_btf)
 			btf_put(attach_btf);
-		return -ENOMEM;
+		err = -EINVAL;
+		goto put_token;
 	}
 
 	prog->expected_attach_type = attr->expected_attach_type;
@@ -2680,6 +2713,10 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr, u32 uattr_size)
 	prog->aux->sleepable = attr->prog_flags & BPF_F_SLEEPABLE;
 	prog->aux->xdp_has_frags = attr->prog_flags & BPF_F_XDP_HAS_FRAGS;
 
+	/* move token into prog->aux, reuse taken refcnt */
+	prog->aux->token = token;
+	token = NULL;
+
 	err = security_bpf_prog_alloc(prog->aux);
 	if (err)
 		goto free_prog;
@@ -2781,6 +2818,8 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr, u32 uattr_size)
 	if (prog->aux->attach_btf)
 		btf_put(prog->aux->attach_btf);
 	bpf_prog_free(prog);
+put_token:
+	bpf_token_put(token);
 	return err;
 }
 
@@ -3537,7 +3576,7 @@ static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog,
 	case BPF_PROG_TYPE_SK_LOOKUP:
 		return attach_type == prog->expected_attach_type ? 0 : -EINVAL;
 	case BPF_PROG_TYPE_CGROUP_SKB:
-		if (!capable(CAP_NET_ADMIN))
+		if (!bpf_token_capable(prog->aux->token, CAP_NET_ADMIN))
 			/* cg-skb progs can be loaded by unpriv user.
 			 * check permissions at attach time.
 			 */
@@ -5135,8 +5174,10 @@ static bool is_bit_subset_of(u32 subset, u32 superset)
 	(1ULL << BPF_TOKEN_CREATE)		\
 	| (1ULL << BPF_MAP_CREATE)		\
 	| (1ULL << BPF_BTF_LOAD)		\
+	| (1ULL << BPF_PROG_LOAD)		\
 )
-#define BPF_TOKEN_CREATE_LAST_FIELD token_create.allowed_map_types
+
+#define BPF_TOKEN_CREATE_LAST_FIELD token_create.allowed_attach_types
 
 static int token_create(union bpf_attr *attr)
 {
@@ -5178,6 +5219,18 @@ static int token_create(union bpf_attr *attr)
 		err = -EPERM;
 		goto err_out;
 	}
+	/* requested prog types should be a subset of associated token's set */
+	if (token && !is_bit_subset_of(attr->token_create.allowed_prog_types,
+				       token->allowed_prog_types)) {
+		err = -EPERM;
+		goto err_out;
+	}
+	/* requested attach types should be a subset of associated token's set */
+	if (token && !is_bit_subset_of(attr->token_create.allowed_attach_types,
+				       token->allowed_attach_types)) {
+		err = -EPERM;
+		goto err_out;
+	}
 
 	new_token = bpf_token_alloc();
 	if (!new_token) {
@@ -5187,6 +5240,8 @@ static int token_create(union bpf_attr *attr)
 
 	new_token->allowed_cmds = attr->token_create.allowed_cmds;
 	new_token->allowed_map_types = attr->token_create.allowed_map_types;
+	new_token->allowed_prog_types = attr->token_create.allowed_prog_types;
+	new_token->allowed_attach_types = attr->token_create.allowed_attach_types;
 
 	fd = bpf_token_new_fd(new_token);
 	if (fd < 0) {
diff --git a/kernel/bpf/token.c b/kernel/bpf/token.c
index 0abb1fa4f181..ff75dec8baf5 100644
--- a/kernel/bpf/token.c
+++ b/kernel/bpf/token.c
@@ -123,3 +123,14 @@ bool bpf_token_allow_map_type(const struct bpf_token *token, enum bpf_map_type t
 
 	return token->allowed_map_types & (1ULL << type);
 }
+
+bool bpf_token_allow_prog_type(const struct bpf_token *token,
+			       enum bpf_prog_type prog_type,
+			       enum bpf_attach_type attach_type)
+{
+	if (!token || prog_type >= __MAX_BPF_PROG_TYPE || attach_type >= __MAX_BPF_ATTACH_TYPE)
+		return false;
+
+	return (token->allowed_prog_types & (1ULL << prog_type)) &&
+	       (token->allowed_attach_types & (1ULL << attach_type));
+}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 366abd8b55b6..f23d084b196f 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -999,6 +999,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_SK_LOOKUP,
 	BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */
 	BPF_PROG_TYPE_NETFILTER,
+	__MAX_BPF_PROG_TYPE
 };
 
 enum bpf_attach_type {
@@ -1430,6 +1431,7 @@ union bpf_attr {
 		 * truncated), or smaller (if log buffer wasn't filled completely).
 		 */
 		__u32		log_true_size;
+		__u32		prog_token_fd;
 	};
 
 	struct { /* anonymous struct used by BPF_OBJ_* commands */
@@ -1650,6 +1652,12 @@ union bpf_attr {
 		 * are allowed to be created by requested BPF token;
 		 */
 		__u64		allowed_map_types;
+		/* similarly to allowed_map_types, bit sets of BPF program
+		 * types and BPF program attach types that are allowed to be
+		 * loaded by requested BPF token
+		 */
+		__u64		allowed_prog_types;
+		__u64		allowed_attach_types;
 	} token_create;
 
 } __attribute__((aligned(8)));
diff --git a/tools/testing/selftests/bpf/prog_tests/libbpf_probes.c b/tools/testing/selftests/bpf/prog_tests/libbpf_probes.c
index 573249a2814d..4ed46ed58a7b 100644
--- a/tools/testing/selftests/bpf/prog_tests/libbpf_probes.c
+++ b/tools/testing/selftests/bpf/prog_tests/libbpf_probes.c
@@ -30,6 +30,8 @@ void test_libbpf_probe_prog_types(void)
 
 		if (prog_type == BPF_PROG_TYPE_UNSPEC)
 			continue;
+		if (strcmp(prog_type_name, "__MAX_BPF_PROG_TYPE") == 0)
+			continue;
 
 		if (!test__start_subtest(prog_type_name))
 			continue;
diff --git a/tools/testing/selftests/bpf/prog_tests/libbpf_str.c b/tools/testing/selftests/bpf/prog_tests/libbpf_str.c
index e677c0435cec..ea2a8c4063a8 100644
--- a/tools/testing/selftests/bpf/prog_tests/libbpf_str.c
+++ b/tools/testing/selftests/bpf/prog_tests/libbpf_str.c
@@ -185,6 +185,9 @@ static void test_libbpf_bpf_prog_type_str(void)
 		const char *prog_type_str;
 		char buf[256];
 
+		if (prog_type == __MAX_BPF_PROG_TYPE)
+			continue;
+
 		prog_type_name = btf__str_by_offset(btf, e->name_off);
 		prog_type_str = libbpf_bpf_prog_type_str(prog_type);
 		ASSERT_OK_PTR(prog_type_str, prog_type_name);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v2 bpf-next 15/18] bpf: take into account BPF token when fetching helper protos
  2023-06-07 23:53 [PATCH v2 bpf-next 00/18] BPF token Andrii Nakryiko
                   ` (13 preceding siblings ...)
  2023-06-07 23:53 ` [PATCH v2 bpf-next 14/18] bpf: add BPF token support to BPF_PROG_LOAD command Andrii Nakryiko
@ 2023-06-07 23:53 ` Andrii Nakryiko
  2023-06-07 23:53 ` [PATCH v2 bpf-next 16/18] bpf: consistenly use BPF token throughout BPF verifier logic Andrii Nakryiko
                   ` (7 subsequent siblings)
  22 siblings, 0 replies; 72+ messages in thread
From: Andrii Nakryiko @ 2023-06-07 23:53 UTC (permalink / raw)
  To: bpf
  Cc: linux-security-module, keescook, brauner, lennart, cyphar, luto,
	kernel-team

Instead of performing unconditional system-wide bpf_capable() and
perfmon_capable() calls inside bpf_base_func_proto() function (and other
similar ones) to determine eligibility of a given BPF helper for a given
program, use previously recorded BPF token during BPF_PROG_LOAD command
handling to inform the decision.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 drivers/media/rc/bpf-lirc.c |  2 +-
 include/linux/bpf.h         |  5 +++--
 kernel/bpf/cgroup.c         |  6 +++---
 kernel/bpf/helpers.c        |  6 +++---
 kernel/bpf/syscall.c        |  5 +++--
 kernel/trace/bpf_trace.c    |  2 +-
 net/core/filter.c           | 32 ++++++++++++++++----------------
 net/ipv4/bpf_tcp_ca.c       |  2 +-
 net/netfilter/nf_bpf_link.c |  2 +-
 9 files changed, 32 insertions(+), 30 deletions(-)

diff --git a/drivers/media/rc/bpf-lirc.c b/drivers/media/rc/bpf-lirc.c
index fe17c7f98e81..6d07693c6b9f 100644
--- a/drivers/media/rc/bpf-lirc.c
+++ b/drivers/media/rc/bpf-lirc.c
@@ -110,7 +110,7 @@ lirc_mode2_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	case BPF_FUNC_get_prandom_u32:
 		return &bpf_get_prandom_u32_proto;
 	case BPF_FUNC_trace_printk:
-		if (perfmon_capable())
+		if (bpf_token_capable(prog->aux->token, CAP_PERFMON))
 			return bpf_get_trace_printk_proto();
 		fallthrough;
 	default:
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index d6e0904c9198..9a9212c5e3ff 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -2349,7 +2349,8 @@ int btf_check_type_match(struct bpf_verifier_log *log, const struct bpf_prog *pr
 struct bpf_prog *bpf_prog_by_id(u32 id);
 struct bpf_link *bpf_link_by_id(u32 id);
 
-const struct bpf_func_proto *bpf_base_func_proto(enum bpf_func_id func_id);
+const struct bpf_func_proto *bpf_base_func_proto(enum bpf_func_id func_id,
+						 const struct bpf_prog *prog);
 void bpf_task_storage_free(struct task_struct *task);
 void bpf_cgrp_storage_free(struct cgroup *cgroup);
 bool bpf_prog_has_kfunc_call(const struct bpf_prog *prog);
@@ -2606,7 +2607,7 @@ static inline int btf_struct_access(struct bpf_verifier_log *log,
 }
 
 static inline const struct bpf_func_proto *
-bpf_base_func_proto(enum bpf_func_id func_id)
+bpf_base_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
 	return NULL;
 }
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index 5b2741aa0d9b..39d6cfb6f304 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -1615,7 +1615,7 @@ cgroup_dev_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	case BPF_FUNC_perf_event_output:
 		return &bpf_event_output_data_proto;
 	default:
-		return bpf_base_func_proto(func_id);
+		return bpf_base_func_proto(func_id, prog);
 	}
 }
 
@@ -2173,7 +2173,7 @@ sysctl_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	case BPF_FUNC_perf_event_output:
 		return &bpf_event_output_data_proto;
 	default:
-		return bpf_base_func_proto(func_id);
+		return bpf_base_func_proto(func_id, prog);
 	}
 }
 
@@ -2330,7 +2330,7 @@ cg_sockopt_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	case BPF_FUNC_perf_event_output:
 		return &bpf_event_output_data_proto;
 	default:
-		return bpf_base_func_proto(func_id);
+		return bpf_base_func_proto(func_id, prog);
 	}
 }
 
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 9e80efa59a5d..6a740af48908 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1663,7 +1663,7 @@ const struct bpf_func_proto bpf_probe_read_kernel_str_proto __weak;
 const struct bpf_func_proto bpf_task_pt_regs_proto __weak;
 
 const struct bpf_func_proto *
-bpf_base_func_proto(enum bpf_func_id func_id)
+bpf_base_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
 	switch (func_id) {
 	case BPF_FUNC_map_lookup_elem:
@@ -1714,7 +1714,7 @@ bpf_base_func_proto(enum bpf_func_id func_id)
 		break;
 	}
 
-	if (!bpf_capable())
+	if (!bpf_token_capable(prog->aux->token, CAP_BPF))
 		return NULL;
 
 	switch (func_id) {
@@ -1772,7 +1772,7 @@ bpf_base_func_proto(enum bpf_func_id func_id)
 		break;
 	}
 
-	if (!perfmon_capable())
+	if (!bpf_token_capable(prog->aux->token, CAP_PERFMON))
 		return NULL;
 
 	switch (func_id) {
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index c6d2fdb1af2f..889ca5d3afe7 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -5500,7 +5500,7 @@ static const struct bpf_func_proto bpf_sys_bpf_proto = {
 const struct bpf_func_proto * __weak
 tracing_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
-	return bpf_base_func_proto(func_id);
+	return bpf_base_func_proto(func_id, prog);
 }
 
 BPF_CALL_1(bpf_sys_close, u32, fd)
@@ -5550,7 +5550,8 @@ syscall_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
 	switch (func_id) {
 	case BPF_FUNC_sys_bpf:
-		return !perfmon_capable() ? NULL : &bpf_sys_bpf_proto;
+		return !bpf_token_capable(prog->aux->token, CAP_PERFMON)
+		       ? NULL : &bpf_sys_bpf_proto;
 	case BPF_FUNC_btf_find_by_name_kind:
 		return &bpf_btf_find_by_name_kind_proto;
 	case BPF_FUNC_sys_close:
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 2bc41e6ac9fe..f5382d8bb690 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1511,7 +1511,7 @@ bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	case BPF_FUNC_trace_vprintk:
 		return bpf_get_trace_vprintk_proto();
 	default:
-		return bpf_base_func_proto(func_id);
+		return bpf_base_func_proto(func_id, prog);
 	}
 }
 
diff --git a/net/core/filter.c b/net/core/filter.c
index 428df050d021..59e5f41f2d5b 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -83,7 +83,7 @@
 #include <net/netfilter/nf_conntrack_bpf.h>
 
 static const struct bpf_func_proto *
-bpf_sk_base_func_proto(enum bpf_func_id func_id);
+bpf_sk_base_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog);
 
 int copy_bpf_fprog_from_user(struct sock_fprog *dst, sockptr_t src, int len)
 {
@@ -7739,7 +7739,7 @@ sock_filter_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	case BPF_FUNC_ktime_get_coarse_ns:
 		return &bpf_ktime_get_coarse_ns_proto;
 	default:
-		return bpf_base_func_proto(func_id);
+		return bpf_base_func_proto(func_id, prog);
 	}
 }
 
@@ -7822,7 +7822,7 @@ sock_addr_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 			return NULL;
 		}
 	default:
-		return bpf_sk_base_func_proto(func_id);
+		return bpf_sk_base_func_proto(func_id, prog);
 	}
 }
 
@@ -7841,7 +7841,7 @@ sk_filter_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	case BPF_FUNC_perf_event_output:
 		return &bpf_skb_event_output_proto;
 	default:
-		return bpf_sk_base_func_proto(func_id);
+		return bpf_sk_base_func_proto(func_id, prog);
 	}
 }
 
@@ -8028,7 +8028,7 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 #endif
 #endif
 	default:
-		return bpf_sk_base_func_proto(func_id);
+		return bpf_sk_base_func_proto(func_id, prog);
 	}
 }
 
@@ -8087,7 +8087,7 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 #endif
 #endif
 	default:
-		return bpf_sk_base_func_proto(func_id);
+		return bpf_sk_base_func_proto(func_id, prog);
 	}
 
 #if IS_MODULE(CONFIG_NF_CONNTRACK) && IS_ENABLED(CONFIG_DEBUG_INFO_BTF_MODULES)
@@ -8148,7 +8148,7 @@ sock_ops_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_tcp_sock_proto;
 #endif /* CONFIG_INET */
 	default:
-		return bpf_sk_base_func_proto(func_id);
+		return bpf_sk_base_func_proto(func_id, prog);
 	}
 }
 
@@ -8190,7 +8190,7 @@ sk_msg_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_get_cgroup_classid_curr_proto;
 #endif
 	default:
-		return bpf_sk_base_func_proto(func_id);
+		return bpf_sk_base_func_proto(func_id, prog);
 	}
 }
 
@@ -8234,7 +8234,7 @@ sk_skb_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_skc_lookup_tcp_proto;
 #endif
 	default:
-		return bpf_sk_base_func_proto(func_id);
+		return bpf_sk_base_func_proto(func_id, prog);
 	}
 }
 
@@ -8245,7 +8245,7 @@ flow_dissector_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	case BPF_FUNC_skb_load_bytes:
 		return &bpf_flow_dissector_load_bytes_proto;
 	default:
-		return bpf_sk_base_func_proto(func_id);
+		return bpf_sk_base_func_proto(func_id, prog);
 	}
 }
 
@@ -8272,7 +8272,7 @@ lwt_out_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	case BPF_FUNC_skb_under_cgroup:
 		return &bpf_skb_under_cgroup_proto;
 	default:
-		return bpf_sk_base_func_proto(func_id);
+		return bpf_sk_base_func_proto(func_id, prog);
 	}
 }
 
@@ -11103,7 +11103,7 @@ sk_reuseport_func_proto(enum bpf_func_id func_id,
 	case BPF_FUNC_ktime_get_coarse_ns:
 		return &bpf_ktime_get_coarse_ns_proto;
 	default:
-		return bpf_base_func_proto(func_id);
+		return bpf_base_func_proto(func_id, prog);
 	}
 }
 
@@ -11285,7 +11285,7 @@ sk_lookup_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	case BPF_FUNC_sk_release:
 		return &bpf_sk_release_proto;
 	default:
-		return bpf_sk_base_func_proto(func_id);
+		return bpf_sk_base_func_proto(func_id, prog);
 	}
 }
 
@@ -11619,7 +11619,7 @@ const struct bpf_func_proto bpf_sock_from_file_proto = {
 };
 
 static const struct bpf_func_proto *
-bpf_sk_base_func_proto(enum bpf_func_id func_id)
+bpf_sk_base_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
 	const struct bpf_func_proto *func;
 
@@ -11648,10 +11648,10 @@ bpf_sk_base_func_proto(enum bpf_func_id func_id)
 	case BPF_FUNC_ktime_get_coarse_ns:
 		return &bpf_ktime_get_coarse_ns_proto;
 	default:
-		return bpf_base_func_proto(func_id);
+		return bpf_base_func_proto(func_id, prog);
 	}
 
-	if (!perfmon_capable())
+	if (!bpf_token_capable(prog->aux->token, CAP_PERFMON))
 		return NULL;
 
 	return func;
diff --git a/net/ipv4/bpf_tcp_ca.c b/net/ipv4/bpf_tcp_ca.c
index 4406d796cc2f..0a3a60e7c282 100644
--- a/net/ipv4/bpf_tcp_ca.c
+++ b/net/ipv4/bpf_tcp_ca.c
@@ -193,7 +193,7 @@ bpf_tcp_ca_get_func_proto(enum bpf_func_id func_id,
 	case BPF_FUNC_ktime_get_coarse_ns:
 		return &bpf_ktime_get_coarse_ns_proto;
 	default:
-		return bpf_base_func_proto(func_id);
+		return bpf_base_func_proto(func_id, prog);
 	}
 }
 
diff --git a/net/netfilter/nf_bpf_link.c b/net/netfilter/nf_bpf_link.c
index c36da56d756f..d7786ea9c01a 100644
--- a/net/netfilter/nf_bpf_link.c
+++ b/net/netfilter/nf_bpf_link.c
@@ -219,7 +219,7 @@ static bool nf_is_valid_access(int off, int size, enum bpf_access_type type,
 static const struct bpf_func_proto *
 bpf_nf_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
-	return bpf_base_func_proto(func_id);
+	return bpf_base_func_proto(func_id, prog);
 }
 
 const struct bpf_verifier_ops netfilter_verifier_ops = {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v2 bpf-next 16/18] bpf: consistenly use BPF token throughout BPF verifier logic
  2023-06-07 23:53 [PATCH v2 bpf-next 00/18] BPF token Andrii Nakryiko
                   ` (14 preceding siblings ...)
  2023-06-07 23:53 ` [PATCH v2 bpf-next 15/18] bpf: take into account BPF token when fetching helper protos Andrii Nakryiko
@ 2023-06-07 23:53 ` Andrii Nakryiko
  2023-06-07 23:53 ` [PATCH v2 bpf-next 17/18] libbpf: add BPF token support to bpf_prog_load() API Andrii Nakryiko
                   ` (6 subsequent siblings)
  22 siblings, 0 replies; 72+ messages in thread
From: Andrii Nakryiko @ 2023-06-07 23:53 UTC (permalink / raw)
  To: bpf
  Cc: linux-security-module, keescook, brauner, lennart, cyphar, luto,
	kernel-team

Remove remaining direct queries to perfmon_capable() and bpf_capable()
in BPF verifier logic and instead use BPF token (if available) to make
decisions about privileges.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 include/linux/bpf.h    | 18 ++++++++++--------
 include/linux/filter.h |  2 +-
 kernel/bpf/arraymap.c  |  2 +-
 kernel/bpf/core.c      |  2 +-
 kernel/bpf/verifier.c  | 13 ++++++-------
 net/core/filter.c      |  4 ++--
 6 files changed, 21 insertions(+), 20 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 9a9212c5e3ff..452b935d21ed 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -2059,24 +2059,26 @@ bpf_map_alloc_percpu(const struct bpf_map *map, size_t size, size_t align,
 
 extern int sysctl_unprivileged_bpf_disabled;
 
-static inline bool bpf_allow_ptr_leaks(void)
+bool bpf_token_capable(const struct bpf_token *token, int cap);
+
+static inline bool bpf_allow_ptr_leaks(const struct bpf_token *token)
 {
-	return perfmon_capable();
+	return bpf_token_capable(token, CAP_PERFMON);
 }
 
-static inline bool bpf_allow_uninit_stack(void)
+static inline bool bpf_allow_uninit_stack(const struct bpf_token *token)
 {
-	return perfmon_capable();
+	return bpf_token_capable(token, CAP_PERFMON);
 }
 
-static inline bool bpf_bypass_spec_v1(void)
+static inline bool bpf_bypass_spec_v1(const struct bpf_token *token)
 {
-	return perfmon_capable();
+	return bpf_token_capable(token, CAP_PERFMON);
 }
 
-static inline bool bpf_bypass_spec_v4(void)
+static inline bool bpf_bypass_spec_v4(const struct bpf_token *token)
 {
-	return perfmon_capable();
+	return bpf_token_capable(token, CAP_PERFMON);
 }
 
 int bpf_map_new_fd(struct bpf_map *map, int flags);
diff --git a/include/linux/filter.h b/include/linux/filter.h
index f69114083ec7..2391a9025ffd 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1109,7 +1109,7 @@ static inline bool bpf_jit_blinding_enabled(struct bpf_prog *prog)
 		return false;
 	if (!bpf_jit_harden)
 		return false;
-	if (bpf_jit_harden == 1 && bpf_capable())
+	if (bpf_jit_harden == 1 && bpf_token_capable(prog->aux->token, CAP_BPF))
 		return false;
 
 	return true;
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 2058e89b5ddd..f0c64df6b6ff 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -82,7 +82,7 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr)
 	bool percpu = attr->map_type == BPF_MAP_TYPE_PERCPU_ARRAY;
 	int numa_node = bpf_map_attr_numa_node(attr);
 	u32 elem_size, index_mask, max_entries;
-	bool bypass_spec_v1 = bpf_bypass_spec_v1();
+	bool bypass_spec_v1 = bpf_bypass_spec_v1(NULL);
 	u64 array_size, mask64;
 	struct bpf_array *array;
 
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index cd0a93968009..c48303e097ec 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -661,7 +661,7 @@ static bool bpf_prog_kallsyms_candidate(const struct bpf_prog *fp)
 void bpf_prog_kallsyms_add(struct bpf_prog *fp)
 {
 	if (!bpf_prog_kallsyms_candidate(fp) ||
-	    !bpf_capable())
+	    !bpf_token_capable(fp->aux->token, CAP_BPF))
 		return;
 
 	bpf_prog_ksym_set_addr(fp);
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 1e38584d497c..20fdc1255a0b 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -19237,7 +19237,12 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr, bpfptr_t uattr, __u3
 	env->prog = *prog;
 	env->ops = bpf_verifier_ops[env->prog->type];
 	env->fd_array = make_bpfptr(attr->fd_array, uattr.is_kernel);
-	is_priv = bpf_capable();
+
+	env->allow_ptr_leaks = bpf_allow_ptr_leaks(env->prog->aux->token);
+	env->allow_uninit_stack = bpf_allow_uninit_stack(env->prog->aux->token);
+	env->bypass_spec_v1 = bpf_bypass_spec_v1(env->prog->aux->token);
+	env->bypass_spec_v4 = bpf_bypass_spec_v4(env->prog->aux->token);
+	env->bpf_capable = is_priv = bpf_token_capable(env->prog->aux->token, CAP_BPF);
 
 	bpf_get_btf_vmlinux();
 
@@ -19269,12 +19274,6 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr, bpfptr_t uattr, __u3
 	if (attr->prog_flags & BPF_F_ANY_ALIGNMENT)
 		env->strict_alignment = false;
 
-	env->allow_ptr_leaks = bpf_allow_ptr_leaks();
-	env->allow_uninit_stack = bpf_allow_uninit_stack();
-	env->bypass_spec_v1 = bpf_bypass_spec_v1();
-	env->bypass_spec_v4 = bpf_bypass_spec_v4();
-	env->bpf_capable = bpf_capable();
-
 	if (is_priv)
 		env->test_state_freq = attr->prog_flags & BPF_F_TEST_STATE_FREQ;
 
diff --git a/net/core/filter.c b/net/core/filter.c
index 59e5f41f2d5b..0f2e5a15f1fd 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -8447,7 +8447,7 @@ static bool cg_skb_is_valid_access(int off, int size,
 		return false;
 	case bpf_ctx_range(struct __sk_buff, data):
 	case bpf_ctx_range(struct __sk_buff, data_end):
-		if (!bpf_capable())
+		if (!bpf_token_capable(prog->aux->token, CAP_BPF))
 			return false;
 		break;
 	}
@@ -8459,7 +8459,7 @@ static bool cg_skb_is_valid_access(int off, int size,
 		case bpf_ctx_range_till(struct __sk_buff, cb[0], cb[4]):
 			break;
 		case bpf_ctx_range(struct __sk_buff, tstamp):
-			if (!bpf_capable())
+			if (!bpf_token_capable(prog->aux->token, CAP_BPF))
 				return false;
 			break;
 		default:
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v2 bpf-next 17/18] libbpf: add BPF token support to bpf_prog_load() API
  2023-06-07 23:53 [PATCH v2 bpf-next 00/18] BPF token Andrii Nakryiko
                   ` (15 preceding siblings ...)
  2023-06-07 23:53 ` [PATCH v2 bpf-next 16/18] bpf: consistenly use BPF token throughout BPF verifier logic Andrii Nakryiko
@ 2023-06-07 23:53 ` Andrii Nakryiko
  2023-06-07 23:53 ` [PATCH v2 bpf-next 18/18] selftests/bpf: add BPF token-enabled BPF_PROG_LOAD tests Andrii Nakryiko
                   ` (5 subsequent siblings)
  22 siblings, 0 replies; 72+ messages in thread
From: Andrii Nakryiko @ 2023-06-07 23:53 UTC (permalink / raw)
  To: bpf
  Cc: linux-security-module, keescook, brauner, lennart, cyphar, luto,
	kernel-team

Wire through token_fd into bpf_prog_load(). Also make sure to pass
allowed_{prog,attach}_types to kernel in bpf_token_create().

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 tools/lib/bpf/bpf.c | 5 ++++-
 tools/lib/bpf/bpf.h | 7 +++++--
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 193993dbbdc4..cd8f0c525de6 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -234,7 +234,7 @@ int bpf_prog_load(enum bpf_prog_type prog_type,
 		  const struct bpf_insn *insns, size_t insn_cnt,
 		  struct bpf_prog_load_opts *opts)
 {
-	const size_t attr_sz = offsetofend(union bpf_attr, log_true_size);
+	const size_t attr_sz = offsetofend(union bpf_attr, prog_token_fd);
 	void *finfo = NULL, *linfo = NULL;
 	const char *func_info, *line_info;
 	__u32 log_size, log_level, attach_prog_fd, attach_btf_obj_fd;
@@ -263,6 +263,7 @@ int bpf_prog_load(enum bpf_prog_type prog_type,
 	attr.prog_flags = OPTS_GET(opts, prog_flags, 0);
 	attr.prog_ifindex = OPTS_GET(opts, prog_ifindex, 0);
 	attr.kern_version = OPTS_GET(opts, kern_version, 0);
+	attr.prog_token_fd = OPTS_GET(opts, token_fd, 0);
 
 	if (prog_name && kernel_supports(NULL, FEAT_PROG_NAME))
 		libbpf_strlcpy(attr.prog_name, prog_name, sizeof(attr.prog_name));
@@ -1220,6 +1221,8 @@ int bpf_token_create(struct bpf_token_create_opts *opts)
 	attr.token_create.token_fd = OPTS_GET(opts, token_fd, 0);
 	attr.token_create.allowed_cmds = OPTS_GET(opts, allowed_cmds, 0);
 	attr.token_create.allowed_map_types = OPTS_GET(opts, allowed_map_types, 0);
+	attr.token_create.allowed_prog_types = OPTS_GET(opts, allowed_prog_types, 0);
+	attr.token_create.allowed_attach_types = OPTS_GET(opts, allowed_attach_types, 0);
 
 	ret = sys_bpf_fd(BPF_TOKEN_CREATE, &attr, attr_sz);
 	return libbpf_err_errno(ret);
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index 3153a9e697e2..f9afc7846762 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -104,9 +104,10 @@ struct bpf_prog_load_opts {
 	 * If kernel doesn't support this feature, log_size is left unchanged.
 	 */
 	__u32 log_true_size;
+	__u32 token_fd;
 	size_t :0;
 };
-#define bpf_prog_load_opts__last_field log_true_size
+#define bpf_prog_load_opts__last_field token_fd
 
 LIBBPF_API int bpf_prog_load(enum bpf_prog_type prog_type,
 			     const char *prog_name, const char *license,
@@ -560,9 +561,11 @@ struct bpf_token_create_opts {
 	__u32 token_fd;
 	__u64 allowed_cmds;
 	__u64 allowed_map_types;
+	__u64 allowed_prog_types;
+	__u64 allowed_attach_types;
 	size_t :0;
 };
-#define bpf_token_create_opts__last_field allowed_map_types
+#define bpf_token_create_opts__last_field allowed_attach_types
 
 LIBBPF_API int bpf_token_create(struct bpf_token_create_opts *opts);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v2 bpf-next 18/18] selftests/bpf: add BPF token-enabled BPF_PROG_LOAD tests
  2023-06-07 23:53 [PATCH v2 bpf-next 00/18] BPF token Andrii Nakryiko
                   ` (16 preceding siblings ...)
  2023-06-07 23:53 ` [PATCH v2 bpf-next 17/18] libbpf: add BPF token support to bpf_prog_load() API Andrii Nakryiko
@ 2023-06-07 23:53 ` Andrii Nakryiko
  2023-06-08 18:49 ` [PATCH v2 bpf-next 00/18] BPF token Stanislav Fomichev
                   ` (4 subsequent siblings)
  22 siblings, 0 replies; 72+ messages in thread
From: Andrii Nakryiko @ 2023-06-07 23:53 UTC (permalink / raw)
  To: bpf
  Cc: linux-security-module, keescook, brauner, lennart, cyphar, luto,
	kernel-team

Add a test validating that BPF token can be used to load privileged BPF
program using privileged BPF helpers through delegated BPF token created
by privileged process.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 .../testing/selftests/bpf/prog_tests/token.c  | 62 +++++++++++++++++++
 1 file changed, 62 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/token.c b/tools/testing/selftests/bpf/prog_tests/token.c
index ff8ada405576..eea39a91bbaa 100644
--- a/tools/testing/selftests/bpf/prog_tests/token.c
+++ b/tools/testing/selftests/bpf/prog_tests/token.c
@@ -4,6 +4,7 @@
 #include <test_progs.h>
 #include <bpf/btf.h>
 #include "cap_helpers.h"
+#include <linux/filter.h>
 
 static int drop_priv_caps(__u64 *old_caps)
 {
@@ -187,6 +188,65 @@ static void subtest_btf_token(void)
 		ASSERT_OK(restore_priv_caps(old_caps), "restore_caps");
 }
 
+static void subtest_prog_token(void)
+{
+	LIBBPF_OPTS(bpf_token_create_opts, token_opts);
+	LIBBPF_OPTS(bpf_prog_load_opts, prog_opts);
+	int token_fd = 0, prog_fd = 0;
+	__u64 old_caps = 0;
+	struct bpf_insn insns[] = {
+		/* bpf_jiffies64() requires CAP_BPF */
+		BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_jiffies64),
+		/* bpf_get_current_task() requires CAP_PERFMON */
+		BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_get_current_task),
+		/* r0 = 0; exit; */
+		BPF_MOV64_IMM(BPF_REG_0, 0),
+		BPF_EXIT_INSN(),
+	};
+	size_t insn_cnt = ARRAY_SIZE(insns);
+
+	/* create BPF token allowing BPF_PROG_LOAD command */
+	token_opts.flags = 0;
+	token_opts.allowed_cmds = 1ULL << BPF_PROG_LOAD;
+	token_opts.allowed_prog_types = 1ULL << BPF_PROG_TYPE_XDP;
+	token_opts.allowed_attach_types = 1ULL << BPF_XDP;
+	token_fd = bpf_token_create(&token_opts);
+	if (!ASSERT_GT(token_fd, 0, "token_create"))
+		return;
+
+	/* drop privileges to test token_fd passing */
+	if (!ASSERT_OK(drop_priv_caps(&old_caps), "drop_caps"))
+		goto cleanup;
+
+	/* validate we can successfully load BPF program with token; this
+	 * being XDP program (CAP_NET_ADMIN) using bpf_jiffies64() (CAP_BPF)
+	 * and bpf_get_current_task() (CAP_PERFMON) helpers validates we have
+	 * BPF token wired properly in a bunch of places in the kernel
+	 */
+	prog_opts.token_fd = token_fd;
+	prog_opts.expected_attach_type = BPF_XDP;
+	prog_fd = bpf_prog_load(BPF_PROG_TYPE_XDP, "token_prog", "GPL",
+				insns, insn_cnt, &prog_opts);
+	if (!ASSERT_GT(prog_fd, 0, "prog_fd"))
+		goto cleanup;
+	close(prog_fd);
+
+	/* now validate that we *cannot* load BPF program without token */
+	prog_opts.token_fd = 0;
+	prog_fd = bpf_prog_load(BPF_PROG_TYPE_XDP, "token_prog", "GPL",
+				insns, insn_cnt, &prog_opts);
+	if (!ASSERT_EQ(prog_fd, -EPERM, "prog_fd_eperm"))
+		goto cleanup;
+
+cleanup:
+	if (prog_fd > 0)
+		close(prog_fd);
+	if (token_fd)
+		close(token_fd);
+	if (old_caps)
+		ASSERT_OK(restore_priv_caps(old_caps), "restore_caps");
+}
+
 void test_token(void)
 {
 	if (test__start_subtest("token_create"))
@@ -195,4 +255,6 @@ void test_token(void)
 		subtest_map_token();
 	if (test__start_subtest("btf_token"))
 		subtest_btf_token();
+	if (test__start_subtest("prog_token"))
+		subtest_prog_token();
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 bpf-next 00/18] BPF token
  2023-06-07 23:53 [PATCH v2 bpf-next 00/18] BPF token Andrii Nakryiko
                   ` (17 preceding siblings ...)
  2023-06-07 23:53 ` [PATCH v2 bpf-next 18/18] selftests/bpf: add BPF token-enabled BPF_PROG_LOAD tests Andrii Nakryiko
@ 2023-06-08 18:49 ` Stanislav Fomichev
  2023-06-08 22:17   ` Andrii Nakryiko
  2023-06-09 11:17 ` Toke Høiland-Jørgensen
                   ` (3 subsequent siblings)
  22 siblings, 1 reply; 72+ messages in thread
From: Stanislav Fomichev @ 2023-06-08 18:49 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, linux-security-module, keescook, brauner, lennart, cyphar,
	luto, kernel-team

On 06/07, Andrii Nakryiko wrote:
> This patch set introduces new BPF object, BPF token, which allows to delegate
> a subset of BPF functionality from privileged system-wide daemon (e.g.,
> systemd or any other container manager) to a *trusted* unprivileged
> application. Trust is the key here. This functionality is not about allowing
> unconditional unprivileged BPF usage. Establishing trust, though, is
> completely up to the discretion of respective privileged application that
> would create a BPF token.
> 
> The main motivation for BPF token is a desire to enable containerized
> BPF applications to be used together with user namespaces. This is currently
> impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced
> or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF
> helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read
> arbitrary memory, and it's impossible to ensure that they only read memory of
> processes belonging to any given namespace. This means that it's impossible to
> have namespace-aware CAP_BPF capability, and as such another mechanism to
> allow safe usage of BPF functionality is necessary. BPF token and delegation
> of it to a trusted unprivileged applications is such mechanism. Kernel makes
> no assumption about what "trusted" constitutes in any particular case, and
> it's up to specific privileged applications and their surrounding
> infrastructure to decide that. What kernel provides is a set of APIs to create
> and tune BPF token, and pass it around to privileged BPF commands that are
> creating new BPF objects like BPF programs, BPF maps, etc.
> 
> Previous attempt at addressing this very same problem ([0]) attempted to
> utilize authoritative LSM approach, but was conclusively rejected by upstream
> LSM maintainers. BPF token concept is not changing anything about LSM
> approach, but can be combined with LSM hooks for very fine-grained security
> policy. Some ideas about making BPF token more convenient to use with LSM (in
> particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF
> 2023 presentation ([1]). E.g., an ability to specify user-provided data
> (context), which in combination with BPF LSM would allow implementing a very
> dynamic and fine-granular custom security policies on top of BPF token. In the
> interest of minimizing API surface area discussions this is going to be
> added in follow up patches, as it's not essential to the fundamental concept
> of delegatable BPF token.
> 
> It should be noted that BPF token is conceptually quite similar to the idea of
> /dev/bpf device file, proposed by Song a while ago ([2]). The biggest
> difference is the idea of using virtual anon_inode file to hold BPF token and
> allowing multiple independent instances of them, each with its own set of
> restrictions. BPF pinning solves the problem of exposing such BPF token
> through file system (BPF FS, in this case) for cases where transferring FDs
> over Unix domain sockets is not convenient. And also, crucially, BPF token
> approach is not using any special stateful task-scoped flags. Instead, bpf()
> syscall accepts token_fd parameters explicitly for each relevant BPF command.
> This addresses main concerns brought up during the /dev/bpf discussion, and
> fits better with overall BPF subsystem design.
> 
> This patch set adds a basic minimum of functionality to make BPF token useful
> and to discuss API and functionality. Currently only low-level libbpf APIs
> support passing BPF token around, allowing to test kernel functionality, but
> for the most part is not sufficient for real-world applications, which
> typically use high-level libbpf APIs based on `struct bpf_object` type. This
> was done with the intent to limit the size of patch set and concentrate on
> mostly kernel-side changes. All the necessary plumbing for libbpf will be sent
> as a separate follow up patch set kernel support makes it upstream.
> 
> Another part that should happen once kernel-side BPF token is established, is
> a set of conventions between applications (e.g., systemd), tools (e.g.,
> bpftool), and libraries (e.g., libbpf) about sharing BPF tokens through BPF FS
> at well-defined locations to allow applications take advantage of this in
> automatic fashion without explicit code changes on BPF application's side.
> But I'd like to postpone this discussion to after BPF token concept lands.
> 
>   [0] https://lore.kernel.org/bpf/20230412043300.360803-1-andrii@kernel.org/
>   [1] http://vger.kernel.org/bpfconf2023_material/Trusted_unprivileged_BPF_LSFMM2023.pdf
>   [2] https://lore.kernel.org/bpf/20190627201923.2589391-2-songliubraving@fb.com/
> 
> v1->v2:
>   - fix build failures on Kconfig with CONFIG_BPF_SYSCALL unset;
>   - drop BPF_F_TOKEN_UNKNOWN_* flags and simplify UAPI (Stanislav).
 
I went through v2, everything makes sense, the only thing that is
slightly confusing to me is the bpf_token_capable() call.
The name somehow implies that the token is capable of something
where in reality the function does "return token || capable(x)".

IMO, it would be less confusing if we do something like the following,
explicitly, instead of calling a function:

if (token || {bpf_,perfmon_,}capable(x)) ...

(or rename to something like bpf_token_or_capable(x))

Up to you on whether to take any action on that. OTOH, once you
grasp what bpf_token_capable really does, it's not really a problem.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 bpf-next 00/18] BPF token
  2023-06-08 18:49 ` [PATCH v2 bpf-next 00/18] BPF token Stanislav Fomichev
@ 2023-06-08 22:17   ` Andrii Nakryiko
  0 siblings, 0 replies; 72+ messages in thread
From: Andrii Nakryiko @ 2023-06-08 22:17 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Andrii Nakryiko, bpf, linux-security-module, keescook, brauner,
	lennart, cyphar, luto, kernel-team

On Thu, Jun 8, 2023 at 11:49 AM Stanislav Fomichev <sdf@google.com> wrote:
>
> On 06/07, Andrii Nakryiko wrote:
> > This patch set introduces new BPF object, BPF token, which allows to delegate
> > a subset of BPF functionality from privileged system-wide daemon (e.g.,
> > systemd or any other container manager) to a *trusted* unprivileged
> > application. Trust is the key here. This functionality is not about allowing
> > unconditional unprivileged BPF usage. Establishing trust, though, is
> > completely up to the discretion of respective privileged application that
> > would create a BPF token.
> >
> > The main motivation for BPF token is a desire to enable containerized
> > BPF applications to be used together with user namespaces. This is currently
> > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced
> > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF
> > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read
> > arbitrary memory, and it's impossible to ensure that they only read memory of
> > processes belonging to any given namespace. This means that it's impossible to
> > have namespace-aware CAP_BPF capability, and as such another mechanism to
> > allow safe usage of BPF functionality is necessary. BPF token and delegation
> > of it to a trusted unprivileged applications is such mechanism. Kernel makes
> > no assumption about what "trusted" constitutes in any particular case, and
> > it's up to specific privileged applications and their surrounding
> > infrastructure to decide that. What kernel provides is a set of APIs to create
> > and tune BPF token, and pass it around to privileged BPF commands that are
> > creating new BPF objects like BPF programs, BPF maps, etc.
> >
> > Previous attempt at addressing this very same problem ([0]) attempted to
> > utilize authoritative LSM approach, but was conclusively rejected by upstream
> > LSM maintainers. BPF token concept is not changing anything about LSM
> > approach, but can be combined with LSM hooks for very fine-grained security
> > policy. Some ideas about making BPF token more convenient to use with LSM (in
> > particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF
> > 2023 presentation ([1]). E.g., an ability to specify user-provided data
> > (context), which in combination with BPF LSM would allow implementing a very
> > dynamic and fine-granular custom security policies on top of BPF token. In the
> > interest of minimizing API surface area discussions this is going to be
> > added in follow up patches, as it's not essential to the fundamental concept
> > of delegatable BPF token.
> >
> > It should be noted that BPF token is conceptually quite similar to the idea of
> > /dev/bpf device file, proposed by Song a while ago ([2]). The biggest
> > difference is the idea of using virtual anon_inode file to hold BPF token and
> > allowing multiple independent instances of them, each with its own set of
> > restrictions. BPF pinning solves the problem of exposing such BPF token
> > through file system (BPF FS, in this case) for cases where transferring FDs
> > over Unix domain sockets is not convenient. And also, crucially, BPF token
> > approach is not using any special stateful task-scoped flags. Instead, bpf()
> > syscall accepts token_fd parameters explicitly for each relevant BPF command.
> > This addresses main concerns brought up during the /dev/bpf discussion, and
> > fits better with overall BPF subsystem design.
> >
> > This patch set adds a basic minimum of functionality to make BPF token useful
> > and to discuss API and functionality. Currently only low-level libbpf APIs
> > support passing BPF token around, allowing to test kernel functionality, but
> > for the most part is not sufficient for real-world applications, which
> > typically use high-level libbpf APIs based on `struct bpf_object` type. This
> > was done with the intent to limit the size of patch set and concentrate on
> > mostly kernel-side changes. All the necessary plumbing for libbpf will be sent
> > as a separate follow up patch set kernel support makes it upstream.
> >
> > Another part that should happen once kernel-side BPF token is established, is
> > a set of conventions between applications (e.g., systemd), tools (e.g.,
> > bpftool), and libraries (e.g., libbpf) about sharing BPF tokens through BPF FS
> > at well-defined locations to allow applications take advantage of this in
> > automatic fashion without explicit code changes on BPF application's side.
> > But I'd like to postpone this discussion to after BPF token concept lands.
> >
> >   [0] https://lore.kernel.org/bpf/20230412043300.360803-1-andrii@kernel.org/
> >   [1] http://vger.kernel.org/bpfconf2023_material/Trusted_unprivileged_BPF_LSFMM2023.pdf
> >   [2] https://lore.kernel.org/bpf/20190627201923.2589391-2-songliubraving@fb.com/
> >
> > v1->v2:
> >   - fix build failures on Kconfig with CONFIG_BPF_SYSCALL unset;
> >   - drop BPF_F_TOKEN_UNKNOWN_* flags and simplify UAPI (Stanislav).
>
> I went through v2, everything makes sense, the only thing that is
> slightly confusing to me is the bpf_token_capable() call.
> The name somehow implies that the token is capable of something
> where in reality the function does "return token || capable(x)".

heh, "bpf_token_" part is sort of like namespace/object prefix. The
intent here was to have a token-aware capable check. And yes, if we
get a token during prog/map/etc construction, the assumption is that
it provides all relevant permissions.

>
> IMO, it would be less confusing if we do something like the following,
> explicitly, instead of calling a function:
>
> if (token || {bpf_,perfmon_,}capable(x)) ...
>
> (or rename to something like bpf_token_or_capable(x))

I'd rather not open-code `if (token || ...)` checks everywhere, but I
can rename to `bpf_token_or_capable()` if people prefer. I erred on
the side of succinctness, but if it's confusing, then best to rename?

>
> Up to you on whether to take any action on that. OTOH, once you
> grasp what bpf_token_capable really does, it's not really a problem.

Cool, thanks for taking a look!

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 bpf-next 00/18] BPF token
  2023-06-07 23:53 [PATCH v2 bpf-next 00/18] BPF token Andrii Nakryiko
                   ` (18 preceding siblings ...)
  2023-06-08 18:49 ` [PATCH v2 bpf-next 00/18] BPF token Stanislav Fomichev
@ 2023-06-09 11:17 ` Toke Høiland-Jørgensen
  2023-06-09 18:21   ` Andrii Nakryiko
  2023-06-09 18:32 ` Andy Lutomirski
                   ` (2 subsequent siblings)
  22 siblings, 1 reply; 72+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-06-09 11:17 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf
  Cc: linux-security-module, keescook, brauner, lennart, cyphar, luto,
	kernel-team

Andrii Nakryiko <andrii@kernel.org> writes:

> This patch set introduces new BPF object, BPF token, which allows to delegate
> a subset of BPF functionality from privileged system-wide daemon (e.g.,
> systemd or any other container manager) to a *trusted* unprivileged
> application. Trust is the key here. This functionality is not about allowing
> unconditional unprivileged BPF usage. Establishing trust, though, is
> completely up to the discretion of respective privileged application that
> would create a BPF token.

I am not convinced that this token-based approach is a good way to solve
this: having the delegation mechanism be one where you can basically
only grant a perpetual delegation with no way to retract it, no way to
check what exactly it's being used for, and that is transitive (can be
passed on to others with no restrictions) seems like a recipe for
disaster. I believe this was basically the point Casey was making as
well in response to v1.

If the goal is to enable a privileged application (such as a container
manager) to grant another unprivileged application the permission to
perform certain bpf() operations, why not just proxy the operations
themselves over some RPC mechanism? That way the granting application
can perform authentication checks on every operation and ensure its
origins are sound at the time it is being made. Instead of just writing
a blank check (in the form of a token) and hoping the receiver of it is
not compromised...

-Toke

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 bpf-next 00/18] BPF token
  2023-06-09 11:17 ` Toke Høiland-Jørgensen
@ 2023-06-09 18:21   ` Andrii Nakryiko
  2023-06-09 21:21     ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 72+ messages in thread
From: Andrii Nakryiko @ 2023-06-09 18:21 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Andrii Nakryiko, bpf, linux-security-module, keescook, brauner,
	lennart, cyphar, luto, kernel-team

On Fri, Jun 9, 2023 at 4:17 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>
> Andrii Nakryiko <andrii@kernel.org> writes:
>
> > This patch set introduces new BPF object, BPF token, which allows to delegate
> > a subset of BPF functionality from privileged system-wide daemon (e.g.,
> > systemd or any other container manager) to a *trusted* unprivileged
> > application. Trust is the key here. This functionality is not about allowing
> > unconditional unprivileged BPF usage. Establishing trust, though, is
> > completely up to the discretion of respective privileged application that
> > would create a BPF token.
>
> I am not convinced that this token-based approach is a good way to solve
> this: having the delegation mechanism be one where you can basically
> only grant a perpetual delegation with no way to retract it, no way to
> check what exactly it's being used for, and that is transitive (can be
> passed on to others with no restrictions) seems like a recipe for
> disaster. I believe this was basically the point Casey was making as
> well in response to v1.

Most of this can be added, if we really need to. Ability to revoke BPF
token is easy to implement (though of course it will apply only for
subsequent operations). We can allocate ID for BPF token just like we
do for BPF prog/map/link and let tools iterate and fetch information
about it. As for controlling who's passing what and where, I don't
think the situation is different for any other FD-based mechanism. You
might as well create a BPF map/prog/link, pass it through SCM_RIGHTS
or BPF FS, and that application can keep doing the same to other
processes.

Ultimately, currently we have root permissions for applications that
need BPF. That's already very dangerous. But just because something
might be misused or abused doesn't prevent us from making a good
practical use of it, right?

Also, there is LSM on top of all of this to override and control how
the BPF subsystem is used, regardless of BPF token. It can override
any of the privileges mechanism, capabilities, BPF token, whatnot.

>
> If the goal is to enable a privileged application (such as a container
> manager) to grant another unprivileged application the permission to
> perform certain bpf() operations, why not just proxy the operations
> themselves over some RPC mechanism? That way the granting application

It's explicitly what we *do not* want to do, as it is a major problem
and logistical complication. Every single application will have to be
rewritten to use such a special daemon/service and its API, which is
completely different from bpf() syscall API. It invalidates the use of
all the libbpf (and other bpf libraries') APIs, BPF skeleton is
incompatible with this. It's a nightmare. I've got feedback from
people in another company that do have BPF service with just a tiny
subset of BPF functionality delegated to such service, and it's a pain
and definitely not a preferred way to do things.

Just think about having to mirror a big chunk of bpf() syscall as an
RPC. So no, BPF proxy is definitely not a good solution.


> can perform authentication checks on every operation and ensure its
> origins are sound at the time it is being made. Instead of just writing
> a blank check (in the form of a token) and hoping the receiver of it is
> not compromised...

All this could and should be done through LSM in much more decoupled
and transparent (to application) way. BPF token doesn't prevent this.
It actually helps with this, because organizations can actually
dictate that operations that do not provide BPF token are
automatically rejected, and those that do provide BPF token can be
further checked and granted or rejected based on specific BPF token
instance.

>
> -Toke

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 bpf-next 00/18] BPF token
  2023-06-07 23:53 [PATCH v2 bpf-next 00/18] BPF token Andrii Nakryiko
                   ` (19 preceding siblings ...)
  2023-06-09 11:17 ` Toke Høiland-Jørgensen
@ 2023-06-09 18:32 ` Andy Lutomirski
  2023-06-09 19:08   ` Andrii Nakryiko
  2023-06-09 22:29 ` Djalal Harouni
  2023-06-12 12:44 ` Dave Tucker
  22 siblings, 1 reply; 72+ messages in thread
From: Andy Lutomirski @ 2023-06-09 18:32 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf
  Cc: linux-security-module, Kees Cook, Christian Brauner, lennart,
	cyphar, kernel-team

On Wed, Jun 7, 2023, at 4:53 PM, Andrii Nakryiko wrote:
> This patch set introduces new BPF object, BPF token, which allows to delegate
> a subset of BPF functionality from privileged system-wide daemon (e.g.,
> systemd or any other container manager) to a *trusted* unprivileged
> application. Trust is the key here. This functionality is not about allowing
> unconditional unprivileged BPF usage. Establishing trust, though, is
> completely up to the discretion of respective privileged application that
> would create a BPF token.
>

I skimmed the description and the LSFMM slides.

Years ago, I sent out a patch set to start down the path of making the bpf() API make sense when used in less-privileged contexts (regarding access control of BPF objects and such).  It went nowhere.

Where does BPF token fit in?  Does a kernel with these patches applied actually behave sensibly if you pass a BPF token into a container?  Giving a way to enable BPF in a container is only a small part of the overall task -- making BPF behave sensibly in that container seems like it should also be necessary.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 bpf-next 00/18] BPF token
  2023-06-09 18:32 ` Andy Lutomirski
@ 2023-06-09 19:08   ` Andrii Nakryiko
  2023-06-19 17:40     ` Andy Lutomirski
  0 siblings, 1 reply; 72+ messages in thread
From: Andrii Nakryiko @ 2023-06-09 19:08 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andrii Nakryiko, bpf, linux-security-module, Kees Cook,
	Christian Brauner, lennart, cyphar, kernel-team

On Fri, Jun 9, 2023 at 11:32 AM Andy Lutomirski <luto@kernel.org> wrote:
>
> On Wed, Jun 7, 2023, at 4:53 PM, Andrii Nakryiko wrote:
> > This patch set introduces new BPF object, BPF token, which allows to delegate
> > a subset of BPF functionality from privileged system-wide daemon (e.g.,
> > systemd or any other container manager) to a *trusted* unprivileged
> > application. Trust is the key here. This functionality is not about allowing
> > unconditional unprivileged BPF usage. Establishing trust, though, is
> > completely up to the discretion of respective privileged application that
> > would create a BPF token.
> >
>
> I skimmed the description and the LSFMM slides.
>
> Years ago, I sent out a patch set to start down the path of making the bpf() API make sense when used in less-privileged contexts (regarding access control of BPF objects and such).  It went nowhere.
>
> Where does BPF token fit in?  Does a kernel with these patches applied actually behave sensibly if you pass a BPF token into a container?

Yes?.. In the sense that it is possible to create BPF programs and BPF
maps from inside the container (with BPF token). Right now under user
namespace it's impossible no matter what you do.

> Giving a way to enable BPF in a container is only a small part of the overall task -- making BPF behave sensibly in that container seems like it should also be necessary.

BPF is still a privileged thing. You can't just say that any
unprivileged application should be able to use BPF. That's why BPF
token is about trusting unpriv application in a controlled environment
(production) to not do something crazy. It can be enforced further
through LSM usage, but in a lot of cases, when dealing with internal
production applications it's enough to have a proper application
design and rely on code review process to avoid any negative effects.

So privileged daemon (container manager) will be configured with the
knowledge of which services/containers are allowed to use BPF, and
will grant BPF token only to those that were explicitly allowlisted.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 bpf-next 00/18] BPF token
  2023-06-09 18:21   ` Andrii Nakryiko
@ 2023-06-09 21:21     ` Toke Høiland-Jørgensen
  2023-06-09 22:03       ` Andrii Nakryiko
  0 siblings, 1 reply; 72+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-06-09 21:21 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, linux-security-module, keescook, brauner,
	lennart, cyphar, luto, kernel-team

Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:

> On Fri, Jun 9, 2023 at 4:17 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>>
>> Andrii Nakryiko <andrii@kernel.org> writes:
>>
>> > This patch set introduces new BPF object, BPF token, which allows to delegate
>> > a subset of BPF functionality from privileged system-wide daemon (e.g.,
>> > systemd or any other container manager) to a *trusted* unprivileged
>> > application. Trust is the key here. This functionality is not about allowing
>> > unconditional unprivileged BPF usage. Establishing trust, though, is
>> > completely up to the discretion of respective privileged application that
>> > would create a BPF token.
>>
>> I am not convinced that this token-based approach is a good way to solve
>> this: having the delegation mechanism be one where you can basically
>> only grant a perpetual delegation with no way to retract it, no way to
>> check what exactly it's being used for, and that is transitive (can be
>> passed on to others with no restrictions) seems like a recipe for
>> disaster. I believe this was basically the point Casey was making as
>> well in response to v1.
>
> Most of this can be added, if we really need to. Ability to revoke BPF
> token is easy to implement (though of course it will apply only for
> subsequent operations). We can allocate ID for BPF token just like we
> do for BPF prog/map/link and let tools iterate and fetch information
> about it. As for controlling who's passing what and where, I don't
> think the situation is different for any other FD-based mechanism. You
> might as well create a BPF map/prog/link, pass it through SCM_RIGHTS
> or BPF FS, and that application can keep doing the same to other
> processes.

No, but every other fd-based mechanism is limited in scope. E.g., if you
pass a map fd that's one specific map that can be passed around, with a
token it's all operations (of a specific type) which is way broader.

> Ultimately, currently we have root permissions for applications that
> need BPF. That's already very dangerous. But just because something
> might be misused or abused doesn't prevent us from making a good
> practical use of it, right?

That's not a given. It's always a trade-off, and if the mechanism is
likely to open up the system to additional risk that's not a good
trade-off even if it helps in some case. I basically worry that this is
the case here.

> Also, there is LSM on top of all of this to override and control how
> the BPF subsystem is used, regardless of BPF token. It can override
> any of the privileges mechanism, capabilities, BPF token, whatnot.

If this mechanism needs an LSM to be used safely, that's not incredibly
confidence-inspiring. Security mechanisms should fail safe, which this
one does not.

I'm also worried that an LSM policy is the only way to disable the
ability to create a token; with this in the kernel, I suddenly have to
trust not only that all applications with BPF privileges will not load
malicious code, but also that they won't (accidentally or maliciously)
conveys extra privileges on someone else. Seems a bit broad to have this
ability (to issue tokens) available to everyone with access to the bpf()
syscall, when (IIUC) it's only a single daemon in the system that would
legitimately do this in the deployment you're envisioning.

>> If the goal is to enable a privileged application (such as a container
>> manager) to grant another unprivileged application the permission to
>> perform certain bpf() operations, why not just proxy the operations
>> themselves over some RPC mechanism? That way the granting application
>
> It's explicitly what we *do not* want to do, as it is a major problem
> and logistical complication. Every single application will have to be
> rewritten to use such a special daemon/service and its API, which is
> completely different from bpf() syscall API. It invalidates the use of
> all the libbpf (and other bpf libraries') APIs, BPF skeleton is
> incompatible with this. It's a nightmare. I've got feedback from
> people in another company that do have BPF service with just a tiny
> subset of BPF functionality delegated to such service, and it's a pain
> and definitely not a preferred way to do things.

But weren't you proposing that libbpf should be able to transparently
look for tokens and load them without any application changes? Why can't
libbpf be taught to use an RPC socket in a similar fashion? It basically
boils down to something like:

static inline int sys_bpf(enum bpf_cmd cmd, union bpf_attr *attr,
			  unsigned int size)
{
	if (!stat("/run/bpf.sock")) {
		sock = open_socket("/run/bpf.sock");
                write_to(sock, cmd, attr, size);
                return read_response(sock);
        } else {
		return syscall(__NR_bpf, cmd, attr, size);
        }
}

> Just think about having to mirror a big chunk of bpf() syscall as an
> RPC. So no, BPF proxy is definitely not a good solution.

The daemon at the other side of the socket in the example above doesn't
*have* to be taught all the semantics of the syscall, it can just look
at the command name and make a decision based on that and the identity
of the socket peer, then just pass the whole thing to the kernel if the
permission check passes.

>> can perform authentication checks on every operation and ensure its
>> origins are sound at the time it is being made. Instead of just writing
>> a blank check (in the form of a token) and hoping the receiver of it is
>> not compromised...
>
> All this could and should be done through LSM in much more decoupled
> and transparent (to application) way. BPF token doesn't prevent this.
> It actually helps with this, because organizations can actually
> dictate that operations that do not provide BPF token are
> automatically rejected, and those that do provide BPF token can be
> further checked and granted or rejected based on specific BPF token
> instance.

See above re: needing an LSM policy to make this safe...

-Toke

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 bpf-next 00/18] BPF token
  2023-06-09 21:21     ` Toke Høiland-Jørgensen
@ 2023-06-09 22:03       ` Andrii Nakryiko
  2023-06-12 10:49         ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 72+ messages in thread
From: Andrii Nakryiko @ 2023-06-09 22:03 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Andrii Nakryiko, bpf, linux-security-module, keescook, brauner,
	lennart, cyphar, luto, kernel-team

On Fri, Jun 9, 2023 at 2:21 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>
> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>
> > On Fri, Jun 9, 2023 at 4:17 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> >>
> >> Andrii Nakryiko <andrii@kernel.org> writes:
> >>
> >> > This patch set introduces new BPF object, BPF token, which allows to delegate
> >> > a subset of BPF functionality from privileged system-wide daemon (e.g.,
> >> > systemd or any other container manager) to a *trusted* unprivileged
> >> > application. Trust is the key here. This functionality is not about allowing
> >> > unconditional unprivileged BPF usage. Establishing trust, though, is
> >> > completely up to the discretion of respective privileged application that
> >> > would create a BPF token.
> >>
> >> I am not convinced that this token-based approach is a good way to solve
> >> this: having the delegation mechanism be one where you can basically
> >> only grant a perpetual delegation with no way to retract it, no way to
> >> check what exactly it's being used for, and that is transitive (can be
> >> passed on to others with no restrictions) seems like a recipe for
> >> disaster. I believe this was basically the point Casey was making as
> >> well in response to v1.
> >
> > Most of this can be added, if we really need to. Ability to revoke BPF
> > token is easy to implement (though of course it will apply only for
> > subsequent operations). We can allocate ID for BPF token just like we
> > do for BPF prog/map/link and let tools iterate and fetch information
> > about it. As for controlling who's passing what and where, I don't
> > think the situation is different for any other FD-based mechanism. You
> > might as well create a BPF map/prog/link, pass it through SCM_RIGHTS
> > or BPF FS, and that application can keep doing the same to other
> > processes.
>
> No, but every other fd-based mechanism is limited in scope. E.g., if you
> pass a map fd that's one specific map that can be passed around, with a
> token it's all operations (of a specific type) which is way broader.

It's not black and white. Once you have a BPF program FD, you can
attach it many times, for example, and cause regressions. Sure, here
we are talking about creating multiple BPF maps or loading multiple
BPF programs, so it's wider in scope, but still, it's not that
fundamentally different.

>
> > Ultimately, currently we have root permissions for applications that
> > need BPF. That's already very dangerous. But just because something
> > might be misused or abused doesn't prevent us from making a good
> > practical use of it, right?
>
> That's not a given. It's always a trade-off, and if the mechanism is
> likely to open up the system to additional risk that's not a good
> trade-off even if it helps in some case. I basically worry that this is
> the case here.
>
> > Also, there is LSM on top of all of this to override and control how
> > the BPF subsystem is used, regardless of BPF token. It can override
> > any of the privileges mechanism, capabilities, BPF token, whatnot.
>
> If this mechanism needs an LSM to be used safely, that's not incredibly
> confidence-inspiring. Security mechanisms should fail safe, which this
> one does not.

I proposed to add authoritative LSM hooks that would selectively allow
some of BPF operations on a case-by-case basis. This was rejected,
claiming that the best approach is to give process privilege to do
whatever it needs to do and then restrict it with LSM.

Ok, if not for user namespaces, that would mean giving application
CAP_BPF+CAP_PERFMON+CAP_NET_ADMIN+CAP_SYS_ADMIN, and then restrict it
with LSM. Except with user namespace that doesn't work. So that's
where BPF token comes in, but allows it to do it more safely by
allowing to coarsely tune what subset of BPF operations is granted.
And then LSM should be used to further restrict it.

>
> I'm also worried that an LSM policy is the only way to disable the
> ability to create a token; with this in the kernel, I suddenly have to
> trust not only that all applications with BPF privileges will not load
> malicious code, but also that they won't (accidentally or maliciously)
> conveys extra privileges on someone else. Seems a bit broad to have this
> ability (to issue tokens) available to everyone with access to the bpf()
> syscall, when (IIUC) it's only a single daemon in the system that would
> legitimately do this in the deployment you're envisioning.

Note, any process with real CAP_SYS_ADMIN. Let's not forget that.

But would you feel better if BPF_TOKEN_CREATE was guarded behind
sysctl or Kconfig?

Ultimately, worrying is fine, but there are real problems that need to
be solved. And not doing anything isn't a great option.

>
> >> If the goal is to enable a privileged application (such as a container
> >> manager) to grant another unprivileged application the permission to
> >> perform certain bpf() operations, why not just proxy the operations
> >> themselves over some RPC mechanism? That way the granting application
> >
> > It's explicitly what we *do not* want to do, as it is a major problem
> > and logistical complication. Every single application will have to be
> > rewritten to use such a special daemon/service and its API, which is
> > completely different from bpf() syscall API. It invalidates the use of
> > all the libbpf (and other bpf libraries') APIs, BPF skeleton is
> > incompatible with this. It's a nightmare. I've got feedback from
> > people in another company that do have BPF service with just a tiny
> > subset of BPF functionality delegated to such service, and it's a pain
> > and definitely not a preferred way to do things.
>
> But weren't you proposing that libbpf should be able to transparently
> look for tokens and load them without any application changes? Why can't
> libbpf be taught to use an RPC socket in a similar fashion? It basically
> boils down to something like:
>
> static inline int sys_bpf(enum bpf_cmd cmd, union bpf_attr *attr,
>                           unsigned int size)
> {
>         if (!stat("/run/bpf.sock")) {
>                 sock = open_socket("/run/bpf.sock");
>                 write_to(sock, cmd, attr, size);
>                 return read_response(sock);
>         } else {
>                 return syscall(__NR_bpf, cmd, attr, size);
>         }
> }
>

Well, for one, Meta we'll use its own Thrift-based RPC protocol.
Google might use something internal for them using GRPC, someone else
would want to utilize systemd, yet others will use yet another
implementation. RPC introduces more failure modes. While with syscall
we know that operation either succeeded or failed, with RPC we'll have
to deal with "maybe", if it was some communication error.

Let's not trivialize adding, using, and supporting the RPC version of
bpf() syscall.


> > Just think about having to mirror a big chunk of bpf() syscall as an
> > RPC. So no, BPF proxy is definitely not a good solution.
>
> The daemon at the other side of the socket in the example above doesn't
> *have* to be taught all the semantics of the syscall, it can just look
> at the command name and make a decision based on that and the identity
> of the socket peer, then just pass the whole thing to the kernel if the
> permission check passes.

Let's not trivialize the consequences of adding an RPC protocol to all
this, please. No matter in what form or shape.

>
> >> can perform authentication checks on every operation and ensure its
> >> origins are sound at the time it is being made. Instead of just writing
> >> a blank check (in the form of a token) and hoping the receiver of it is
> >> not compromised...
> >
> > All this could and should be done through LSM in much more decoupled
> > and transparent (to application) way. BPF token doesn't prevent this.
> > It actually helps with this, because organizations can actually
> > dictate that operations that do not provide BPF token are
> > automatically rejected, and those that do provide BPF token can be
> > further checked and granted or rejected based on specific BPF token
> > instance.
>
> See above re: needing an LSM policy to make this safe...

See above. We are talking about the CAP_SYS_ADMIN-enabled process.
It's not safe by definition already.

>
> -Toke

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 bpf-next 00/18] BPF token
  2023-06-07 23:53 [PATCH v2 bpf-next 00/18] BPF token Andrii Nakryiko
                   ` (20 preceding siblings ...)
  2023-06-09 18:32 ` Andy Lutomirski
@ 2023-06-09 22:29 ` Djalal Harouni
  2023-06-09 22:57   ` Andrii Nakryiko
  2023-06-12 12:44 ` Dave Tucker
  22 siblings, 1 reply; 72+ messages in thread
From: Djalal Harouni @ 2023-06-09 22:29 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, linux-security-module, keescook, brauner, lennart, cyphar,
	luto, kernel-team

Hi Andrii,

On Thu, Jun 8, 2023 at 1:54 AM Andrii Nakryiko <andrii@kernel.org> wrote:
>
> This patch set introduces new BPF object, BPF token, which allows to delegate
> a subset of BPF functionality from privileged system-wide daemon (e.g.,
> systemd or any other container manager) to a *trusted* unprivileged
> application. Trust is the key here. This functionality is not about allowing
> unconditional unprivileged BPF usage. Establishing trust, though, is
> completely up to the discretion of respective privileged application that
> would create a BPF token.
>
> The main motivation for BPF token is a desire to enable containerized
> BPF applications to be used together with user namespaces. This is currently
> impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced
> or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF
> helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read
> arbitrary memory, and it's impossible to ensure that they only read memory of
> processes belonging to any given namespace. This means that it's impossible to
> have namespace-aware CAP_BPF capability, and as such another mechanism to
> allow safe usage of BPF functionality is necessary. BPF token and delegation
> of it to a trusted unprivileged applications is such mechanism. Kernel makes
> no assumption about what "trusted" constitutes in any particular case, and
> it's up to specific privileged applications and their surrounding
> infrastructure to decide that. What kernel provides is a set of APIs to create
> and tune BPF token, and pass it around to privileged BPF commands that are
> creating new BPF objects like BPF programs, BPF maps, etc.

Is there a reason for coupling this only with the userns?
The "trusted unprivileged" assumed by systemd can be in init userns?


> Previous attempt at addressing this very same problem ([0]) attempted to
> utilize authoritative LSM approach, but was conclusively rejected by upstream
> LSM maintainers. BPF token concept is not changing anything about LSM
> approach, but can be combined with LSM hooks for very fine-grained security
> policy. Some ideas about making BPF token more convenient to use with LSM (in
> particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF
> 2023 presentation ([1]). E.g., an ability to specify user-provided data
> (context), which in combination with BPF LSM would allow implementing a very
> dynamic and fine-granular custom security policies on top of BPF token. In the
> interest of minimizing API surface area discussions this is going to be
> added in follow up patches, as it's not essential to the fundamental concept
> of delegatable BPF token.
>
> It should be noted that BPF token is conceptually quite similar to the idea of
> /dev/bpf device file, proposed by Song a while ago ([2]). The biggest
> difference is the idea of using virtual anon_inode file to hold BPF token and
> allowing multiple independent instances of them, each with its own set of
> restrictions. BPF pinning solves the problem of exposing such BPF token
> through file system (BPF FS, in this case) for cases where transferring FDs
> over Unix domain sockets is not convenient. And also, crucially, BPF token
> approach is not using any special stateful task-scoped flags. Instead, bpf()

What's the use case for transfering over unix domain sockets?

Will BPF token translation happen if you cross the different namespaces?

If the token is pinned into different bpffs, will the token share the
same context?

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 bpf-next 00/18] BPF token
  2023-06-09 22:29 ` Djalal Harouni
@ 2023-06-09 22:57   ` Andrii Nakryiko
  2023-06-12 12:02     ` Djalal Harouni
  0 siblings, 1 reply; 72+ messages in thread
From: Andrii Nakryiko @ 2023-06-09 22:57 UTC (permalink / raw)
  To: Djalal Harouni
  Cc: Andrii Nakryiko, bpf, linux-security-module, keescook, brauner,
	lennart, cyphar, luto, kernel-team

On Fri, Jun 9, 2023 at 3:30 PM Djalal Harouni <tixxdz@gmail.com> wrote:
>
> Hi Andrii,
>
> On Thu, Jun 8, 2023 at 1:54 AM Andrii Nakryiko <andrii@kernel.org> wrote:
> >
> > This patch set introduces new BPF object, BPF token, which allows to delegate
> > a subset of BPF functionality from privileged system-wide daemon (e.g.,
> > systemd or any other container manager) to a *trusted* unprivileged
> > application. Trust is the key here. This functionality is not about allowing
> > unconditional unprivileged BPF usage. Establishing trust, though, is
> > completely up to the discretion of respective privileged application that
> > would create a BPF token.
> >
> > The main motivation for BPF token is a desire to enable containerized
> > BPF applications to be used together with user namespaces. This is currently
> > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced
> > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF
> > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read
> > arbitrary memory, and it's impossible to ensure that they only read memory of
> > processes belonging to any given namespace. This means that it's impossible to
> > have namespace-aware CAP_BPF capability, and as such another mechanism to
> > allow safe usage of BPF functionality is necessary. BPF token and delegation
> > of it to a trusted unprivileged applications is such mechanism. Kernel makes
> > no assumption about what "trusted" constitutes in any particular case, and
> > it's up to specific privileged applications and their surrounding
> > infrastructure to decide that. What kernel provides is a set of APIs to create
> > and tune BPF token, and pass it around to privileged BPF commands that are
> > creating new BPF objects like BPF programs, BPF maps, etc.
>
> Is there a reason for coupling this only with the userns?

There is no coupling. Without userns it is at least possible to grant
CAP_BPF and other capabilities from init ns. With user namespace that
becomes impossible.

> The "trusted unprivileged" assumed by systemd can be in init userns?

It doesn't have to be systemd, but yes, BPF token can be created only
when you have CAP_SYS_ADMIN in init ns. It's in line with restrictions
on a bunch of other bpf() syscall commands (like GET_FD_BY_ID family
of commands).

>
>
> > Previous attempt at addressing this very same problem ([0]) attempted to
> > utilize authoritative LSM approach, but was conclusively rejected by upstream
> > LSM maintainers. BPF token concept is not changing anything about LSM
> > approach, but can be combined with LSM hooks for very fine-grained security
> > policy. Some ideas about making BPF token more convenient to use with LSM (in
> > particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF
> > 2023 presentation ([1]). E.g., an ability to specify user-provided data
> > (context), which in combination with BPF LSM would allow implementing a very
> > dynamic and fine-granular custom security policies on top of BPF token. In the
> > interest of minimizing API surface area discussions this is going to be
> > added in follow up patches, as it's not essential to the fundamental concept
> > of delegatable BPF token.
> >
> > It should be noted that BPF token is conceptually quite similar to the idea of
> > /dev/bpf device file, proposed by Song a while ago ([2]). The biggest
> > difference is the idea of using virtual anon_inode file to hold BPF token and
> > allowing multiple independent instances of them, each with its own set of
> > restrictions. BPF pinning solves the problem of exposing such BPF token
> > through file system (BPF FS, in this case) for cases where transferring FDs
> > over Unix domain sockets is not convenient. And also, crucially, BPF token
> > approach is not using any special stateful task-scoped flags. Instead, bpf()
>
> What's the use case for transfering over unix domain sockets?

I'm not sure I understand the question. Unix domain socket
(specifically its SCM_RIGHTS ancillary message) allows to transfer
files between processes, which is one way to pass BPF object (like
prog/map/link, and now token). BPF FS is the other one. In practice
it's usually BPF FS, but there is no presumption about how file
reference is transferred.

>
> Will BPF token translation happen if you cross the different namespaces?

What does BPF token translation mean specifically? Currently it's a
very simple kernel object with refcnt and a few flags, so there is
nothing to translate?

>
> If the token is pinned into different bpffs, will the token share the
> same context?

So I was planning to allow a user process creating a BPF token to
specify custom user-provided data (context). This is not in this patch
set, but is it what you are asking about?

Regardless, pinning BPF object in BPF FS is just basically bumping a
refcnt and exposes that object in a way that can be looked up through
file system path (using bpf() syscall's BPF_OBJ_GET command).
Underlying object isn't cloned or copied, it's exactly the same object
with the same shared internal state.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 bpf-next 00/18] BPF token
  2023-06-09 22:03       ` Andrii Nakryiko
@ 2023-06-12 10:49         ` Toke Høiland-Jørgensen
  2023-06-12 22:08           ` Andrii Nakryiko
  0 siblings, 1 reply; 72+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-06-12 10:49 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, linux-security-module, keescook, brauner,
	lennart, cyphar, luto, kernel-team

Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:

> On Fri, Jun 9, 2023 at 2:21 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>>
>> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>>
>> > On Fri, Jun 9, 2023 at 4:17 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>> >>
>> >> Andrii Nakryiko <andrii@kernel.org> writes:
>> >>
>> >> > This patch set introduces new BPF object, BPF token, which allows to delegate
>> >> > a subset of BPF functionality from privileged system-wide daemon (e.g.,
>> >> > systemd or any other container manager) to a *trusted* unprivileged
>> >> > application. Trust is the key here. This functionality is not about allowing
>> >> > unconditional unprivileged BPF usage. Establishing trust, though, is
>> >> > completely up to the discretion of respective privileged application that
>> >> > would create a BPF token.
>> >>
>> >> I am not convinced that this token-based approach is a good way to solve
>> >> this: having the delegation mechanism be one where you can basically
>> >> only grant a perpetual delegation with no way to retract it, no way to
>> >> check what exactly it's being used for, and that is transitive (can be
>> >> passed on to others with no restrictions) seems like a recipe for
>> >> disaster. I believe this was basically the point Casey was making as
>> >> well in response to v1.
>> >
>> > Most of this can be added, if we really need to. Ability to revoke BPF
>> > token is easy to implement (though of course it will apply only for
>> > subsequent operations). We can allocate ID for BPF token just like we
>> > do for BPF prog/map/link and let tools iterate and fetch information
>> > about it. As for controlling who's passing what and where, I don't
>> > think the situation is different for any other FD-based mechanism. You
>> > might as well create a BPF map/prog/link, pass it through SCM_RIGHTS
>> > or BPF FS, and that application can keep doing the same to other
>> > processes.
>>
>> No, but every other fd-based mechanism is limited in scope. E.g., if you
>> pass a map fd that's one specific map that can be passed around, with a
>> token it's all operations (of a specific type) which is way broader.
>
> It's not black and white. Once you have a BPF program FD, you can
> attach it many times, for example, and cause regressions. Sure, here
> we are talking about creating multiple BPF maps or loading multiple
> BPF programs, so it's wider in scope, but still, it's not that
> fundamentally different.

Right, but the difference is that a single BPF program is a known
entity, so even if the application you pass the fd to can attach it
multiple times, it can't make it do new things (e.g., bpf_probe_read()
stuff it is not supposed to). Whereas with bpf_token you have no such
guarantee.

>>
>> > Ultimately, currently we have root permissions for applications that
>> > need BPF. That's already very dangerous. But just because something
>> > might be misused or abused doesn't prevent us from making a good
>> > practical use of it, right?
>>
>> That's not a given. It's always a trade-off, and if the mechanism is
>> likely to open up the system to additional risk that's not a good
>> trade-off even if it helps in some case. I basically worry that this is
>> the case here.
>>
>> > Also, there is LSM on top of all of this to override and control how
>> > the BPF subsystem is used, regardless of BPF token. It can override
>> > any of the privileges mechanism, capabilities, BPF token, whatnot.
>>
>> If this mechanism needs an LSM to be used safely, that's not incredibly
>> confidence-inspiring. Security mechanisms should fail safe, which this
>> one does not.
>
> I proposed to add authoritative LSM hooks that would selectively allow
> some of BPF operations on a case-by-case basis. This was rejected,
> claiming that the best approach is to give process privilege to do
> whatever it needs to do and then restrict it with LSM.
>
> Ok, if not for user namespaces, that would mean giving application
> CAP_BPF+CAP_PERFMON+CAP_NET_ADMIN+CAP_SYS_ADMIN, and then restrict it
> with LSM. Except with user namespace that doesn't work. So that's
> where BPF token comes in, but allows it to do it more safely by
> allowing to coarsely tune what subset of BPF operations is granted.
> And then LSM should be used to further restrict it.

Right, I do understand the use case, my worry is that we're creating a
privilege escalation model that is really broad if it is *not* coupled
with an LSM to restrict it. Which will be the default outside of
controlled environments that really know what they are doing.

So I dunno, maybe some way to restrict the token so it only grants
privilege if there is *also* an explicit LSM verdict on it? I guess
that's still too close to an authoritative LSM hook that it'll pass? I
do think the "explicit grant" model of an authoritative LSM is a better
fit for this kind of thing...

>> I'm also worried that an LSM policy is the only way to disable the
>> ability to create a token; with this in the kernel, I suddenly have to
>> trust not only that all applications with BPF privileges will not load
>> malicious code, but also that they won't (accidentally or maliciously)
>> conveys extra privileges on someone else. Seems a bit broad to have this
>> ability (to issue tokens) available to everyone with access to the bpf()
>> syscall, when (IIUC) it's only a single daemon in the system that would
>> legitimately do this in the deployment you're envisioning.
>
> Note, any process with real CAP_SYS_ADMIN. Let's not forget that.
>
> But would you feel better if BPF_TOKEN_CREATE was guarded behind
> sysctl or Kconfig?

Hmm, yeah, some way to make sure it's off by default would be
preferable, IMO.

> Ultimately, worrying is fine, but there are real problems that need to
> be solved. And not doing anything isn't a great option.

Right, it would be good if some of the security folks could chime in
with their view of how this is best achieved without running into any of
the "bad ideas" they are opposed to.

>> >> If the goal is to enable a privileged application (such as a container
>> >> manager) to grant another unprivileged application the permission to
>> >> perform certain bpf() operations, why not just proxy the operations
>> >> themselves over some RPC mechanism? That way the granting application
>> >
>> > It's explicitly what we *do not* want to do, as it is a major problem
>> > and logistical complication. Every single application will have to be
>> > rewritten to use such a special daemon/service and its API, which is
>> > completely different from bpf() syscall API. It invalidates the use of
>> > all the libbpf (and other bpf libraries') APIs, BPF skeleton is
>> > incompatible with this. It's a nightmare. I've got feedback from
>> > people in another company that do have BPF service with just a tiny
>> > subset of BPF functionality delegated to such service, and it's a pain
>> > and definitely not a preferred way to do things.
>>
>> But weren't you proposing that libbpf should be able to transparently
>> look for tokens and load them without any application changes? Why can't
>> libbpf be taught to use an RPC socket in a similar fashion? It basically
>> boils down to something like:
>>
>> static inline int sys_bpf(enum bpf_cmd cmd, union bpf_attr *attr,
>>                           unsigned int size)
>> {
>>         if (!stat("/run/bpf.sock")) {
>>                 sock = open_socket("/run/bpf.sock");
>>                 write_to(sock, cmd, attr, size);
>>                 return read_response(sock);
>>         } else {
>>                 return syscall(__NR_bpf, cmd, attr, size);
>>         }
>> }
>>
>
> Well, for one, Meta we'll use its own Thrift-based RPC protocol.
> Google might use something internal for them using GRPC, someone else
> would want to utilize systemd, yet others will use yet another
> implementation. RPC introduces more failure modes. While with syscall
> we know that operation either succeeded or failed, with RPC we'll have
> to deal with "maybe", if it was some communication error.
>
> Let's not trivialize adding, using, and supporting the RPC version of
> bpf() syscall.

I am not trying to trivialise it, I am well aware that it is more
complicated in practice than just adding a wrapper like the above. I am
just arguing with your point that "all applications need to change, so
we can't do RPC". Any mechanism we add along there lines will require
application changes, including the BPF token. And if the way we're going
to avoid that is by baking the support into libbpf, then that can be
done regardless of the mechanism we choose.

Or to put it another way: as you say it may be more *complicated* to add
an RPC-based path to libbpf, but it's not fundamentally impossible, it's
just another technical problem to be solved. And if that added
complexity buys us better security properties, maybe that is a good
trade-off. At least we shouldn't dismiss it out of hand.

-Toke

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 bpf-next 00/18] BPF token
  2023-06-09 22:57   ` Andrii Nakryiko
@ 2023-06-12 12:02     ` Djalal Harouni
  2023-06-12 14:31       ` Djalal Harouni
  2023-06-12 22:27       ` Andrii Nakryiko
  0 siblings, 2 replies; 72+ messages in thread
From: Djalal Harouni @ 2023-06-12 12:02 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, linux-security-module, keescook, brauner,
	lennart, cyphar, luto, kernel-team

On Sat, Jun 10, 2023 at 12:57 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Fri, Jun 9, 2023 at 3:30 PM Djalal Harouni <tixxdz@gmail.com> wrote:
> >
> > Hi Andrii,
> >
> > On Thu, Jun 8, 2023 at 1:54 AM Andrii Nakryiko <andrii@kernel.org> wrote:
> > >
> > > ...
> > > creating new BPF objects like BPF programs, BPF maps, etc.
> >
> > Is there a reason for coupling this only with the userns?
>
> There is no coupling. Without userns it is at least possible to grant
> CAP_BPF and other capabilities from init ns. With user namespace that
> becomes impossible.

But these are not the same: delegate full cap vs delegate an fd mask?

One can argue unprivileged in init userns is the same privileged in
nested userns
Getting to delegate fd in init userns, then in nested ones seems logical...

> > The "trusted unprivileged" assumed by systemd can be in init userns?
>
> It doesn't have to be systemd, but yes, BPF token can be created only
> when you have CAP_SYS_ADMIN in init ns. It's in line with restrictions
> on a bunch of other bpf() syscall commands (like GET_FD_BY_ID family
> of commands).

I'm more into getting fd delegation work also in the first init userns...

I can't understand why it's not possible or doable?

> >
> >
> > > Previous attempt at addressing this very same problem ([0]) attempted to
> > > utilize authoritative LSM approach, but was conclusively rejected by upstream
> > > LSM maintainers. BPF token concept is not changing anything about LSM
> > > approach, but can be combined with LSM hooks for very fine-grained security
> > > policy. Some ideas about making BPF token more convenient to use with LSM (in
> > > particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF
> > > 2023 presentation ([1]). E.g., an ability to specify user-provided data
> > > (context), which in combination with BPF LSM would allow implementing a very
> > > dynamic and fine-granular custom security policies on top of BPF token. In the
> > > interest of minimizing API surface area discussions this is going to be
> > > added in follow up patches, as it's not essential to the fundamental concept
> > > of delegatable BPF token.
> > >
> > > It should be noted that BPF token is conceptually quite similar to the idea of
> > > /dev/bpf device file, proposed by Song a while ago ([2]). The biggest
> > > difference is the idea of using virtual anon_inode file to hold BPF token and
> > > allowing multiple independent instances of them, each with its own set of
> > > restrictions. BPF pinning solves the problem of exposing such BPF token
> > > through file system (BPF FS, in this case) for cases where transferring FDs
> > > over Unix domain sockets is not convenient. And also, crucially, BPF token
> > > approach is not using any special stateful task-scoped flags. Instead, bpf()
> >
> > What's the use case for transfering over unix domain sockets?
>
> I'm not sure I understand the question. Unix domain socket
> (specifically its SCM_RIGHTS ancillary message) allows to transfer
> files between processes, which is one way to pass BPF object (like
> prog/map/link, and now token). BPF FS is the other one. In practice
> it's usually BPF FS, but there is no presumption about how file
> reference is transferred.

Got it.

IIRC SCM_RIGHTS and SCM_CREDENTIALS are translated into the receiving
userns, no ?

I assume such which allows to set up things in a hierarchical way...

If I set up the environment to lock things down the line, I find it
strange if a received fd would allow me to do more things than what
was planned when I created the environment: namespaces, mounts, etc

I think you have to add the owning userns context to the fd or
"token", and on the receiving part if the current userns is the same
or a nested one of the current userns hierarchy then allow bpf
operation, otherwise fail with -EACCESS or something similar...


> >
> > Will BPF token translation happen if you cross the different namespaces?
>
> What does BPF token translation mean specifically? Currently it's a
> very simple kernel object with refcnt and a few flags, so there is
> nothing to translate?

Please see above comment about the owning userns context

> >
> > If the token is pinned into different bpffs, will the token share the
> > same context?
>
> So I was planning to allow a user process creating a BPF token to
> specify custom user-provided data (context). This is not in this patch
> set, but is it what you are asking about?

Exactly, define what you can access inside the container... this would
align with Andy's suggestion "making BPF behave sensibly in that
container seems like it should also be necessary." I do agree on this.

Again I think LSM and bpf+lsm should have the final word on this too...


> Regardless, pinning BPF object in BPF FS is just basically bumping a
> refcnt and exposes that object in a way that can be looked up through
> file system path (using bpf() syscall's BPF_OBJ_GET command).
> Underlying object isn't cloned or copied, it's exactly the same object
> with the same shared internal state.

This is the part I also find strange, I can understand pinning a bpf
program, map, etc, but an fd that gives some access rights should be
part of the filesystem from the start, I don't get the extra pinning.
Also it seems bpffs is per superblock mount so why not allow
privileged to mount bpffs with the corresponding information, then
privileged can open the fd, set it up and pass it down the line when
executing the main program?  or even allow unprivileged to open it on
bpffs with some restrictive conditions?

Then it would be the business of the privileged to bind mount bpffs in
some other places, share it, etc

Having the fd or "token" that gives access rights pinned in two
separate bpffs mounts seems too much, it crosses namespaces (mount,
userns etc), environments setup by privileged...

I would just make it per bpffs mount and that's it, nothing more. If a
program wants to bind mount it somewhere else then it's not a bpf
problem.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 bpf-next 00/18] BPF token
  2023-06-07 23:53 [PATCH v2 bpf-next 00/18] BPF token Andrii Nakryiko
                   ` (21 preceding siblings ...)
  2023-06-09 22:29 ` Djalal Harouni
@ 2023-06-12 12:44 ` Dave Tucker
  2023-06-12 15:52   ` Djalal Harouni
  2023-06-12 23:04   ` Andrii Nakryiko
  22 siblings, 2 replies; 72+ messages in thread
From: Dave Tucker @ 2023-06-12 12:44 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, linux-security-module, keescook, brauner, lennart, cyphar,
	luto, kernel-team



> On 8 Jun 2023, at 00:53, Andrii Nakryiko <andrii@kernel.org> wrote:
> 
> This patch set introduces new BPF object, BPF token, which allows to delegate
> a subset of BPF functionality from privileged system-wide daemon (e.g.,
> systemd or any other container manager) to a *trusted* unprivileged
> application. Trust is the key here. This functionality is not about allowing
> unconditional unprivileged BPF usage. Establishing trust, though, is
> completely up to the discretion of respective privileged application that
> would create a BPF token.


Hello! Author of a bpfd[1] here.

> The main motivation for BPF token is a desire to enable containerized
> BPF applications to be used together with user namespaces. This is currently
> impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced
> or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF
> helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read
> arbitrary memory, and it's impossible to ensure that they only read memory of
> processes belonging to any given namespace. This means that it's impossible to
> have namespace-aware CAP_BPF capability, and as such another mechanism to
> allow safe usage of BPF functionality is necessary. BPF token and delegation
> of it to a trusted unprivileged applications is such mechanism. Kernel makes
> no assumption about what "trusted" constitutes in any particular case, and
> it's up to specific privileged applications and their surrounding
> infrastructure to decide that. What kernel provides is a set of APIs to create
> and tune BPF token, and pass it around to privileged BPF commands that are
> creating new BPF objects like BPF programs, BPF maps, etc.

You could do that… but the problem is created due to the pattern of having a
single binary that is responsible for:

- Loading and attaching the BPF program in question
- Interacting with maps

Let’s set aside some of the other fun concerns of eBPF in containers:
 - Requiring mounting of vmlinux, bpffs, traces etc…
 - How fs permissions on host translate into permissions in containers

While your proposal lets you grant a subset of CAP_BPF to some other process,
which I imagine could also be done with SELinux, it doesn’t stop you from needing
other required permissions for attaching tracing programs in such an
environment. 

For example, say container A wants to attach a uprobe to a process in container B.
Container A needs to be able to nsenter into container B’s pidns in order for attachment
to succeed… but then what I can do with CAP_BPF is the least of my concerns since
I’d wager I’d need to mount `/proc` from the host in container A + have elevated privileges
much scarier than CAP_BPF in the first place.

If you move “Loading and attaching” away to somewhere else (i.e a daemon like bpfd)
then with recent kernels your container workload should be fine to run entirely unprivileged,
or worst case with only CAP_BPF since all you need to do is read/write maps.

Policy control - which process can request to load programs that monitor which other
processes - would happen within this system daemon and you wouldn’t need tokens.

Since it’s easy enough to do this in userspace, I’d be strongly against adding more
complexity into BPF to support this usecase.

> Previous attempt at addressing this very same problem ([0]) attempted to
> utilize authoritative LSM approach, but was conclusively rejected by upstream
> LSM maintainers. BPF token concept is not changing anything about LSM
> approach, but can be combined with LSM hooks for very fine-grained security
> policy. Some ideas about making BPF token more convenient to use with LSM (in
> particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF
> 2023 presentation ([1]). E.g., an ability to specify user-provided data
> (context), which in combination with BPF LSM would allow implementing a very
> dynamic and fine-granular custom security policies on top of BPF token. In the
> interest of minimizing API surface area discussions this is going to be
> added in follow up patches, as it's not essential to the fundamental concept
> of delegatable BPF token.
> 
> It should be noted that BPF token is conceptually quite similar to the idea of
> /dev/bpf device file, proposed by Song a while ago ([2]). The biggest
> difference is the idea of using virtual anon_inode file to hold BPF token and
> allowing multiple independent instances of them, each with its own set of
> restrictions. BPF pinning solves the problem of exposing such BPF token
> through file system (BPF FS, in this case) for cases where transferring FDs
> over Unix domain sockets is not convenient. And also, crucially, BPF token
> approach is not using any special stateful task-scoped flags. Instead, bpf()
> syscall accepts token_fd parameters explicitly for each relevant BPF command.
> This addresses main concerns brought up during the /dev/bpf discussion, and
> fits better with overall BPF subsystem design.
> 
> This patch set adds a basic minimum of functionality to make BPF token useful
> and to discuss API and functionality. Currently only low-level libbpf APIs
> support passing BPF token around, allowing to test kernel functionality, but
> for the most part is not sufficient for real-world applications, which
> typically use high-level libbpf APIs based on `struct bpf_object` type. This
> was done with the intent to limit the size of patch set and concentrate on
> mostly kernel-side changes. All the necessary plumbing for libbpf will be sent
> as a separate follow up patch set kernel support makes it upstream.
> 
> Another part that should happen once kernel-side BPF token is established, is
> a set of conventions between applications (e.g., systemd), tools (e.g.,
> bpftool), and libraries (e.g., libbpf) about sharing BPF tokens through BPF FS
> at well-defined locations to allow applications take advantage of this in
> automatic fashion without explicit code changes on BPF application's side.
> But I'd like to postpone this discussion to after BPF token concept lands.
> 
>  [0] https://lore.kernel.org/bpf/20230412043300.360803-1-andrii@kernel.org/
>  [1] http://vger.kernel.org/bpfconf2023_material/Trusted_unprivileged_BPF_LSFMM2023.pdf
>  [2] https://lore.kernel.org/bpf/20190627201923.2589391-2-songliubraving@fb.com/
> 

- Dave

[1]: https://github.com/bpfd-dev/bpfd

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 bpf-next 00/18] BPF token
  2023-06-12 12:02     ` Djalal Harouni
@ 2023-06-12 14:31       ` Djalal Harouni
  2023-06-12 22:27       ` Andrii Nakryiko
  1 sibling, 0 replies; 72+ messages in thread
From: Djalal Harouni @ 2023-06-12 14:31 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, linux-security-module, keescook, brauner,
	lennart, cyphar, luto, kernel-team

On Mon, Jun 12, 2023 at 2:02 PM Djalal Harouni <tixxdz@gmail.com> wrote:
>
> On Sat, Jun 10, 2023 at 12:57 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
...
> > I'm not sure I understand the question. Unix domain socket
> > (specifically its SCM_RIGHTS ancillary message) allows to transfer
> > files between processes, which is one way to pass BPF object (like
> > prog/map/link, and now token). BPF FS is the other one. In practice
> > it's usually BPF FS, but there is no presumption about how file
> > reference is transferred.
>
> Got it.
>
> IIRC SCM_RIGHTS and SCM_CREDENTIALS are translated into the receiving
> userns, no ?
>
> I assume such which allows to set up things in a hierarchical way...
>
> If I set up the environment to lock things down the line, I find it
> strange if a received fd would allow me to do more things than what
> was planned when I created the environment: namespaces, mounts, etc
>
> I think you have to add the owning userns context to the fd or
> "token", and on the receiving part if the current userns is the same
> or a nested one of the current userns hierarchy then allow bpf
> operation, otherwise fail with -EACCESS or something similar...

Andrii to make it clear: the owning userns that is owner/creator of
the bpffs mount (better this one since you prevent the inherit fd and
do bad things with it cases...) lets call it userns A,  and the
receiving process is in userns B, so when transfering the fd if userns
B == userns A or if A is an ancestor of B then allow to do things with
fd token, otherwise just deny it...

At least that's how I see things now, but maybe there are corner cases...

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 bpf-next 00/18] BPF token
  2023-06-12 12:44 ` Dave Tucker
@ 2023-06-12 15:52   ` Djalal Harouni
  2023-06-12 23:04   ` Andrii Nakryiko
  1 sibling, 0 replies; 72+ messages in thread
From: Djalal Harouni @ 2023-06-12 15:52 UTC (permalink / raw)
  To: Dave Tucker
  Cc: Andrii Nakryiko, bpf, linux-security-module, keescook, brauner,
	lennart, cyphar, luto, kernel-team

On Mon, Jun 12, 2023 at 2:45 PM Dave Tucker <datucker@redhat.com> wrote:
>
>
>
> > On 8 Jun 2023, at 00:53, Andrii Nakryiko <andrii@kernel.org> wrote:
> >
> > This patch set introduces new BPF object, BPF token, which allows to delegate
> > a subset of BPF functionality from privileged system-wide daemon (e.g.,
> > systemd or any other container manager) to a *trusted* unprivileged
> > application. Trust is the key here. This functionality is not about allowing
> > unconditional unprivileged BPF usage. Establishing trust, though, is
> > completely up to the discretion of respective privileged application that
> > would create a BPF token.
>
>
> Hello! Author of a bpfd[1] here.
>
> > The main motivation for BPF token is a desire to enable containerized
> > BPF applications to be used together with user namespaces. This is currently
> > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced
> > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF
> > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read
> > arbitrary memory, and it's impossible to ensure that they only read memory of
> > processes belonging to any given namespace. This means that it's impossible to
> > have namespace-aware CAP_BPF capability, and as such another mechanism to
> > allow safe usage of BPF functionality is necessary. BPF token and delegation
> > of it to a trusted unprivileged applications is such mechanism. Kernel makes
> > no assumption about what "trusted" constitutes in any particular case, and
> > it's up to specific privileged applications and their surrounding
> > infrastructure to decide that. What kernel provides is a set of APIs to create
> > and tune BPF token, and pass it around to privileged BPF commands that are
> > creating new BPF objects like BPF programs, BPF maps, etc.
>
> You could do that… but the problem is created due to the pattern of having a
> single binary that is responsible for:
>
> - Loading and attaching the BPF program in question
> - Interacting with maps
>
> Let’s set aside some of the other fun concerns of eBPF in containers:
>  - Requiring mounting of vmlinux, bpffs, traces etc…
>  - How fs permissions on host translate into permissions in containers
>
> While your proposal lets you grant a subset of CAP_BPF to some other process,
> which I imagine could also be done with SELinux, it doesn’t stop you from needing
>
> other required permissions for attaching tracing programs in such an
> environment.
>
> For example, say container A wants to attach a uprobe to a process in container B.
> Container A needs to be able to nsenter into container B’s pidns in order for attachment
> to succeed… but then what I can do with CAP_BPF is the least of my concerns since
> I’d wager I’d need to mount `/proc` from the host in container A + have elevated privileges
> much scarier than CAP_BPF in the first place.
>
> If you move “Loading and attaching” away to somewhere else (i.e a daemon like bpfd)
> then with recent kernels your container workload should be fine to run entirely unprivileged,
> or worst case with only CAP_BPF since all you need to do is read/write maps.
>
> Policy control - which process can request to load programs that monitor which other
> processes - would happen within this system daemon and you wouldn’t need tokens.
>
> Since it’s easy enough to do this in userspace, I’d be strongly against adding more
> complexity into BPF to support this usecase.

For some cases complexity could be the other way, bpf by design are
small programs that can be loaded/unloaded dynamically and work on
their own... easily adaptable to dynamic workload... not all bpf are
the same...

Stuffing *everything* together and performing round trips between main
container and container transfering, loading and attaching bpf
programs would question what's the advantage?

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 bpf-next 00/18] BPF token
  2023-06-12 10:49         ` Toke Høiland-Jørgensen
@ 2023-06-12 22:08           ` Andrii Nakryiko
  2023-06-13 21:48             ` Hao Luo
  2023-06-14 12:06             ` Toke Høiland-Jørgensen
  0 siblings, 2 replies; 72+ messages in thread
From: Andrii Nakryiko @ 2023-06-12 22:08 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Andrii Nakryiko, bpf, linux-security-module, keescook, brauner,
	lennart, cyphar, luto, kernel-team

On Mon, Jun 12, 2023 at 3:49 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>
> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>
> > On Fri, Jun 9, 2023 at 2:21 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> >>
> >> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
> >>
> >> > On Fri, Jun 9, 2023 at 4:17 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> >> >>
> >> >> Andrii Nakryiko <andrii@kernel.org> writes:
> >> >>
> >> >> > This patch set introduces new BPF object, BPF token, which allows to delegate
> >> >> > a subset of BPF functionality from privileged system-wide daemon (e.g.,
> >> >> > systemd or any other container manager) to a *trusted* unprivileged
> >> >> > application. Trust is the key here. This functionality is not about allowing
> >> >> > unconditional unprivileged BPF usage. Establishing trust, though, is
> >> >> > completely up to the discretion of respective privileged application that
> >> >> > would create a BPF token.
> >> >>
> >> >> I am not convinced that this token-based approach is a good way to solve
> >> >> this: having the delegation mechanism be one where you can basically
> >> >> only grant a perpetual delegation with no way to retract it, no way to
> >> >> check what exactly it's being used for, and that is transitive (can be
> >> >> passed on to others with no restrictions) seems like a recipe for
> >> >> disaster. I believe this was basically the point Casey was making as
> >> >> well in response to v1.
> >> >
> >> > Most of this can be added, if we really need to. Ability to revoke BPF
> >> > token is easy to implement (though of course it will apply only for
> >> > subsequent operations). We can allocate ID for BPF token just like we
> >> > do for BPF prog/map/link and let tools iterate and fetch information
> >> > about it. As for controlling who's passing what and where, I don't
> >> > think the situation is different for any other FD-based mechanism. You
> >> > might as well create a BPF map/prog/link, pass it through SCM_RIGHTS
> >> > or BPF FS, and that application can keep doing the same to other
> >> > processes.
> >>
> >> No, but every other fd-based mechanism is limited in scope. E.g., if you
> >> pass a map fd that's one specific map that can be passed around, with a
> >> token it's all operations (of a specific type) which is way broader.
> >
> > It's not black and white. Once you have a BPF program FD, you can
> > attach it many times, for example, and cause regressions. Sure, here
> > we are talking about creating multiple BPF maps or loading multiple
> > BPF programs, so it's wider in scope, but still, it's not that
> > fundamentally different.
>
> Right, but the difference is that a single BPF program is a known
> entity, so even if the application you pass the fd to can attach it
> multiple times, it can't make it do new things (e.g., bpf_probe_read()
> stuff it is not supposed to). Whereas with bpf_token you have no such
> guarantee.

Sure, I'm not claiming BPF token is just like passing BPF program FD
around. My point is that anything in the kernel that is representable
by FD can be passed around to an unintended process through
SCM_RIGHTS. And if you want to have tighter control over who's passing
what, you'd probably need LSM. But it's not a requirement.

With BPF token it is important to trust the application you are
passing BPF token to. This is not a mechanism to just freely pass
around the ability to do BPF. You do it only to applications you
control.

You can initiate BPF token from under CAP_SYS_ADMIN only. If you give
CAP_SYS_ADMIN to some application that might pass BPF token to some
random application, you should probably revisit the whole approach.
You can do a lot of harm with that CAP_SYS_ADMIN beyond the BPF
subsystem.

On the other hand, the more correct comparison would be whether to
give some unprivileged application a BPF token versus giving it
CAP_BPF+CAP_PERFMON+CAP_NET_ADMIN+CAP_SYSADMIN (or the necessary
subset of it). With BPF token you can narrow down to what exact types
of programs and maps it can use, if at all. BPF token applies to BPF
subsystem only. With caps, you are giving that application way more
power than you'd like, but that's ok in practice, because a) you need
that application to do something useful with BPF, so you take that
risk, and b) you normally would control that application, so you are
mitigating this risk even without any LSM or something like that on
top.

We do the latter all the time because we have to. BPF token gives us
more well-scoped alternatively.

With user namespaces, if we could grant CAP_BPF and co to use BPF,
we'd do that. But we can't. BPF token at least gives us this
opportunity.

So while I understand your concerns in principle, I think they are a
bit overblown in practice.

>
> >>
> >> > Ultimately, currently we have root permissions for applications that
> >> > need BPF. That's already very dangerous. But just because something
> >> > might be misused or abused doesn't prevent us from making a good
> >> > practical use of it, right?
> >>
> >> That's not a given. It's always a trade-off, and if the mechanism is
> >> likely to open up the system to additional risk that's not a good
> >> trade-off even if it helps in some case. I basically worry that this is
> >> the case here.
> >>
> >> > Also, there is LSM on top of all of this to override and control how
> >> > the BPF subsystem is used, regardless of BPF token. It can override
> >> > any of the privileges mechanism, capabilities, BPF token, whatnot.
> >>
> >> If this mechanism needs an LSM to be used safely, that's not incredibly
> >> confidence-inspiring. Security mechanisms should fail safe, which this
> >> one does not.
> >
> > I proposed to add authoritative LSM hooks that would selectively allow
> > some of BPF operations on a case-by-case basis. This was rejected,
> > claiming that the best approach is to give process privilege to do
> > whatever it needs to do and then restrict it with LSM.
> >
> > Ok, if not for user namespaces, that would mean giving application
> > CAP_BPF+CAP_PERFMON+CAP_NET_ADMIN+CAP_SYS_ADMIN, and then restrict it
> > with LSM. Except with user namespace that doesn't work. So that's
> > where BPF token comes in, but allows it to do it more safely by
> > allowing to coarsely tune what subset of BPF operations is granted.
> > And then LSM should be used to further restrict it.
>
> Right, I do understand the use case, my worry is that we're creating a
> privilege escalation model that is really broad if it is *not* coupled
> with an LSM to restrict it. Which will be the default outside of
> controlled environments that really know what they are doing.

Look, you are worried that you gave some process root permissions and
that process delegated a small portion of that (BPF token) to an
unprivileged process, which abuses it somehow. Beyond the question of
"why did you grant root permissions to something you can't trust to do
the right thing", isn't there a more dangerous stuff (I don't know,
setuid, chmod/chown, etc) that root process can perform to grant
unprivileged process unintended and uncontrolled privileges?

Why BPF token is the one singled out that would have to require
mandatory LSM to be installed?

>
> So I dunno, maybe some way to restrict the token so it only grants
> privilege if there is *also* an explicit LSM verdict on it? I guess
> that's still too close to an authoritative LSM hook that it'll pass? I
> do think the "explicit grant" model of an authoritative LSM is a better
> fit for this kind of thing...
>

I proposed an authoritative LSM, it was pretty plainly rejected and
the model of "grant a lot + restrict with LSM" was suggested.

> >> I'm also worried that an LSM policy is the only way to disable the
> >> ability to create a token; with this in the kernel, I suddenly have to
> >> trust not only that all applications with BPF privileges will not load
> >> malicious code, but also that they won't (accidentally or maliciously)
> >> conveys extra privileges on someone else. Seems a bit broad to have this
> >> ability (to issue tokens) available to everyone with access to the bpf()
> >> syscall, when (IIUC) it's only a single daemon in the system that would
> >> legitimately do this in the deployment you're envisioning.
> >
> > Note, any process with real CAP_SYS_ADMIN. Let's not forget that.
> >
> > But would you feel better if BPF_TOKEN_CREATE was guarded behind
> > sysctl or Kconfig?
>
> Hmm, yeah, some way to make sure it's off by default would be
> preferable, IMO.
>
> > Ultimately, worrying is fine, but there are real problems that need to
> > be solved. And not doing anything isn't a great option.
>
> Right, it would be good if some of the security folks could chime in
> with their view of how this is best achieved without running into any of
> the "bad ideas" they are opposed to.

agreed

>
> >> >> If the goal is to enable a privileged application (such as a container
> >> >> manager) to grant another unprivileged application the permission to
> >> >> perform certain bpf() operations, why not just proxy the operations
> >> >> themselves over some RPC mechanism? That way the granting application
> >> >
> >> > It's explicitly what we *do not* want to do, as it is a major problem
> >> > and logistical complication. Every single application will have to be
> >> > rewritten to use such a special daemon/service and its API, which is
> >> > completely different from bpf() syscall API. It invalidates the use of
> >> > all the libbpf (and other bpf libraries') APIs, BPF skeleton is
> >> > incompatible with this. It's a nightmare. I've got feedback from
> >> > people in another company that do have BPF service with just a tiny
> >> > subset of BPF functionality delegated to such service, and it's a pain
> >> > and definitely not a preferred way to do things.
> >>
> >> But weren't you proposing that libbpf should be able to transparently
> >> look for tokens and load them without any application changes? Why can't
> >> libbpf be taught to use an RPC socket in a similar fashion? It basically
> >> boils down to something like:
> >>
> >> static inline int sys_bpf(enum bpf_cmd cmd, union bpf_attr *attr,
> >>                           unsigned int size)
> >> {
> >>         if (!stat("/run/bpf.sock")) {
> >>                 sock = open_socket("/run/bpf.sock");
> >>                 write_to(sock, cmd, attr, size);
> >>                 return read_response(sock);
> >>         } else {
> >>                 return syscall(__NR_bpf, cmd, attr, size);
> >>         }
> >> }
> >>
> >
> > Well, for one, Meta we'll use its own Thrift-based RPC protocol.
> > Google might use something internal for them using GRPC, someone else
> > would want to utilize systemd, yet others will use yet another
> > implementation. RPC introduces more failure modes. While with syscall
> > we know that operation either succeeded or failed, with RPC we'll have
> > to deal with "maybe", if it was some communication error.
> >
> > Let's not trivialize adding, using, and supporting the RPC version of
> > bpf() syscall.
>
> I am not trying to trivialise it, I am well aware that it is more
> complicated in practice than just adding a wrapper like the above. I am
> just arguing with your point that "all applications need to change, so
> we can't do RPC". Any mechanism we add along there lines will require
> application changes, including the BPF token. And if the way we're going

Well, it depends on what kinds of changes we are talking about. E.g.,
in most explicit case, it would be something like:

int token_fd = bpf_token_get("/sys/fs/bpf/my_granted_token");
if (token_fd < 0)
   /* we can bail out or just assume no token */
LIBBPF_OPTS(bpf_object_open_opts, .token_fd = token_fd);

struct my_skel *skel = my_skel__open_opts(&opts);


That's literally it. And if we have some convention that libbpf will
try to open, say, /sys/fs/bpf/.token automatically, there will be zero
code changes. And I'm not simplifying this.


> to avoid that is by baking the support into libbpf, then that can be
> done regardless of the mechanism we choose.
>
> Or to put it another way: as you say it may be more *complicated* to add
> an RPC-based path to libbpf, but it's not fundamentally impossible, it's
> just another technical problem to be solved. And if that added
> complexity buys us better security properties, maybe that is a good
> trade-off. At least we shouldn't dismiss it out of hand.

You are oversimplifying this. There is a huge difference between
syscall and RPC and interfaces.

The former (syscall approach) will error out only on invalid inputs
(and highly improbable if kernel runs out of memory, which means your
app is dead anyways). You don't code against syscall interface with
expectation that it can fail at any point and you should be able to
recover it.

With RPC you have to bake in into your application that any RPC can
fail transiently, for many reasons. Service could be down, restarted,
slow, etc, etc. This changes *everything* in how you develop
application, how you write code, how you handle errors, how you
monitor stuff. Everything.

It's impossible to just swap out syscall with RPC transparently
without introducing horrible consequences. This is not some technical
difficulty, it's a fundamental impedance mismatch. One of the early
distributed systems mistakes was to pretend that remote procedure
calls could be reliable and assume errors are rare and could be
pretended to behave like syscalls or local in-process APIs. It has
been recognized many times over how bad such approaches were. It's
outside of the scope of this discussion to go into more details.
Suffice it to say that libbpf is not going to pretend that syscall and
some RPC are equivalent and can be interchangeable in a transparent
way.

And then, even if we were crazy enough to do the above, there is no
way everyone will settle on one single implementation and/or RPC
protocol and API such that libbpf could implement it in its upstream
version. Big companies most probably will go with their own internal
ones that would give them better integration with internal
infrastructure, better overvability, etc. And even in open-source
there probably won't be one single implementation everyone will be
happy with.

>
> -Toke

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 bpf-next 00/18] BPF token
  2023-06-12 12:02     ` Djalal Harouni
  2023-06-12 14:31       ` Djalal Harouni
@ 2023-06-12 22:27       ` Andrii Nakryiko
  2023-06-14  0:23         ` Djalal Harouni
  1 sibling, 1 reply; 72+ messages in thread
From: Andrii Nakryiko @ 2023-06-12 22:27 UTC (permalink / raw)
  To: Djalal Harouni
  Cc: Andrii Nakryiko, bpf, linux-security-module, keescook, brauner,
	lennart, cyphar, luto, kernel-team

On Mon, Jun 12, 2023 at 5:02 AM Djalal Harouni <tixxdz@gmail.com> wrote:
>
> On Sat, Jun 10, 2023 at 12:57 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Fri, Jun 9, 2023 at 3:30 PM Djalal Harouni <tixxdz@gmail.com> wrote:
> > >
> > > Hi Andrii,
> > >
> > > On Thu, Jun 8, 2023 at 1:54 AM Andrii Nakryiko <andrii@kernel.org> wrote:
> > > >
> > > > ...
> > > > creating new BPF objects like BPF programs, BPF maps, etc.
> > >
> > > Is there a reason for coupling this only with the userns?
> >
> > There is no coupling. Without userns it is at least possible to grant
> > CAP_BPF and other capabilities from init ns. With user namespace that
> > becomes impossible.
>
> But these are not the same: delegate full cap vs delegate an fd mask?

What FD mask are we talking about here? I don't recall us talking
about any FD masks, so this one is a bit confusing without more
context.

>
> One can argue unprivileged in init userns is the same privileged in
> nested userns
> Getting to delegate fd in init userns, then in nested ones seems logical...

Again, sorry, I'm not following. Can you please elaborate what you mean?

>
> > > The "trusted unprivileged" assumed by systemd can be in init userns?
> >
> > It doesn't have to be systemd, but yes, BPF token can be created only
> > when you have CAP_SYS_ADMIN in init ns. It's in line with restrictions
> > on a bunch of other bpf() syscall commands (like GET_FD_BY_ID family
> > of commands).
>
> I'm more into getting fd delegation work also in the first init userns...
>
> I can't understand why it's not possible or doable?
>

I don't know what you are proposing, as I mentioned above, so it's
hard to answer this question.

> > >
> > >
> > > > Previous attempt at addressing this very same problem ([0]) attempted to
> > > > utilize authoritative LSM approach, but was conclusively rejected by upstream
> > > > LSM maintainers. BPF token concept is not changing anything about LSM
> > > > approach, but can be combined with LSM hooks for very fine-grained security
> > > > policy. Some ideas about making BPF token more convenient to use with LSM (in
> > > > particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF
> > > > 2023 presentation ([1]). E.g., an ability to specify user-provided data
> > > > (context), which in combination with BPF LSM would allow implementing a very
> > > > dynamic and fine-granular custom security policies on top of BPF token. In the
> > > > interest of minimizing API surface area discussions this is going to be
> > > > added in follow up patches, as it's not essential to the fundamental concept
> > > > of delegatable BPF token.
> > > >
> > > > It should be noted that BPF token is conceptually quite similar to the idea of
> > > > /dev/bpf device file, proposed by Song a while ago ([2]). The biggest
> > > > difference is the idea of using virtual anon_inode file to hold BPF token and
> > > > allowing multiple independent instances of them, each with its own set of
> > > > restrictions. BPF pinning solves the problem of exposing such BPF token
> > > > through file system (BPF FS, in this case) for cases where transferring FDs
> > > > over Unix domain sockets is not convenient. And also, crucially, BPF token
> > > > approach is not using any special stateful task-scoped flags. Instead, bpf()
> > >
> > > What's the use case for transfering over unix domain sockets?
> >
> > I'm not sure I understand the question. Unix domain socket
> > (specifically its SCM_RIGHTS ancillary message) allows to transfer
> > files between processes, which is one way to pass BPF object (like
> > prog/map/link, and now token). BPF FS is the other one. In practice
> > it's usually BPF FS, but there is no presumption about how file
> > reference is transferred.
>
> Got it.
>
> IIRC SCM_RIGHTS and SCM_CREDENTIALS are translated into the receiving
> userns, no ?
>
> I assume such which allows to set up things in a hierarchical way...
>
> If I set up the environment to lock things down the line, I find it
> strange if a received fd would allow me to do more things than what
> was planned when I created the environment: namespaces, mounts, etc
>
> I think you have to add the owning userns context to the fd or
> "token", and on the receiving part if the current userns is the same
> or a nested one of the current userns hierarchy then allow bpf
> operation, otherwise fail with -EACCESS or something similar...
>

I think I mentioned problems with namespacing BPF itself. It's just
fundamentally impossible due to a system-wide nature of BPF. So we can
pretend to somehow attach/restrict BPF token to some namespace, but it
still allows BPF programs to peek at any kernel state or user-space
process.

So I'd rather us not pretend we can do something that we actually
cannot enforce.

>
> > >
> > > Will BPF token translation happen if you cross the different namespaces?
> >
> > What does BPF token translation mean specifically? Currently it's a
> > very simple kernel object with refcnt and a few flags, so there is
> > nothing to translate?
>
> Please see above comment about the owning userns context
>
> > >
> > > If the token is pinned into different bpffs, will the token share the
> > > same context?
> >
> > So I was planning to allow a user process creating a BPF token to
> > specify custom user-provided data (context). This is not in this patch
> > set, but is it what you are asking about?
>
> Exactly, define what you can access inside the container... this would
> align with Andy's suggestion "making BPF behave sensibly in that
> container seems like it should also be necessary." I do agree on this.
>

I don't know what Andy's suggestion actually is (as I honestly can't
make out what your proposal is, sorry; you guys are not making it easy
on me by being pretty vague and nonspecific). But see above about
pretending to contain BPF within a container. There is no such thing.
BPF is system-wide.

> Again I think LSM and bpf+lsm should have the final word on this too...
>

Yes, I also think that having LSM on top is beneficial. But not a
strict requirement and more or less orthogonal.

>
> > Regardless, pinning BPF object in BPF FS is just basically bumping a
> > refcnt and exposes that object in a way that can be looked up through
> > file system path (using bpf() syscall's BPF_OBJ_GET command).
> > Underlying object isn't cloned or copied, it's exactly the same object
> > with the same shared internal state.
>
> This is the part I also find strange, I can understand pinning a bpf
> program, map, etc, but an fd that gives some access rights should be
> part of the filesystem from the start, I don't get the extra pinning.

BPF pinning of BPF token is optional. Everything still works without
any BPF FS mount at all. It's an FD, BPF FS is just one of the means
to pass FD to another process. I actually don't see why coupling BPF
FS and BPF token is simpler.

Now, BPF token is a kernel object, with its own state. It has an FD
associated with it. It can be passed around and provided as an
argument to bpf() syscall. In that sense it's just like BPF
prog/map/link, just another BPF object.

> Also it seems bpffs is per superblock mount so why not allow
> privileged to mount bpffs with the corresponding information, then
> privileged can open the fd, set it up and pass it down the line when
> executing the main program?  or even allow unprivileged to open it on
> bpffs with some restrictive conditions?
>
> Then it would be the business of the privileged to bind mount bpffs in
> some other places, share it, etc

How is this fundamentally different from BPF token pinning by
*privileged* process? Except we are not conflating BPF FS as a way to
pin/get many different BPF objects with BPF token itself. In both
cases it's up to privileged process to set up sharing of BPF token
appropriately.

>
> Having the fd or "token" that gives access rights pinned in two
> separate bpffs mounts seems too much, it crosses namespaces (mount,
> userns etc), environments setup by privileged...

See above, there is nothing namespaceable about BPF itself, and BPF
token as well. If some production setup benefits from pinning one BPF
token in multiple places, I don't see the problem with that.

>
> I would just make it per bpffs mount and that's it, nothing more. If a
> program wants to bind mount it somewhere else then it's not a bpf
> problem.

And if some application wants to pin BPF token, why would that be BPF
subsystem's problem as well?

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 bpf-next 00/18] BPF token
  2023-06-12 12:44 ` Dave Tucker
  2023-06-12 15:52   ` Djalal Harouni
@ 2023-06-12 23:04   ` Andrii Nakryiko
  1 sibling, 0 replies; 72+ messages in thread
From: Andrii Nakryiko @ 2023-06-12 23:04 UTC (permalink / raw)
  To: Dave Tucker
  Cc: Andrii Nakryiko, bpf, linux-security-module, keescook, brauner,
	lennart, cyphar, luto, kernel-team

On Mon, Jun 12, 2023 at 5:45 AM Dave Tucker <datucker@redhat.com> wrote:
>
>
>
> > On 8 Jun 2023, at 00:53, Andrii Nakryiko <andrii@kernel.org> wrote:
> >
> > This patch set introduces new BPF object, BPF token, which allows to delegate
> > a subset of BPF functionality from privileged system-wide daemon (e.g.,
> > systemd or any other container manager) to a *trusted* unprivileged
> > application. Trust is the key here. This functionality is not about allowing
> > unconditional unprivileged BPF usage. Establishing trust, though, is
> > completely up to the discretion of respective privileged application that
> > would create a BPF token.
>
>
> Hello! Author of a bpfd[1] here.
>
> > The main motivation for BPF token is a desire to enable containerized
> > BPF applications to be used together with user namespaces. This is currently
> > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced
> > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF
> > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read
> > arbitrary memory, and it's impossible to ensure that they only read memory of
> > processes belonging to any given namespace. This means that it's impossible to
> > have namespace-aware CAP_BPF capability, and as such another mechanism to
> > allow safe usage of BPF functionality is necessary. BPF token and delegation
> > of it to a trusted unprivileged applications is such mechanism. Kernel makes
> > no assumption about what "trusted" constitutes in any particular case, and
> > it's up to specific privileged applications and their surrounding
> > infrastructure to decide that. What kernel provides is a set of APIs to create
> > and tune BPF token, and pass it around to privileged BPF commands that are
> > creating new BPF objects like BPF programs, BPF maps, etc.
>
> You could do that… but the problem is created due to the pattern of having a
> single binary that is responsible for:
>
> - Loading and attaching the BPF program in question
> - Interacting with maps

It is a very desirable property to couple and deploy user process and
its BPF programs/maps together and manage their lifecycle directly.
All of Meta's production applications are using this model. This
allows for a simple and reliable versioning story. This allows using
BPF skeleton and BPF global variables naturally. It makes it simple
and easy to develop, debug, version, deploy, monitor BPF applications.

It also couples BPF program attachment (link) with lifetime of the
user space process. So if it crashes or restarts without clean
detachment, we don't end up with orphaned BPF programs and maps. We've
had pretty bad issues due to such orphaned programs, and that's why
the whole BPF link concept was formalized.

So it's actually a desirable approach in a real-world production setup.

>
> Let’s set aside some of the other fun concerns of eBPF in containers:
>  - Requiring mounting of vmlinux, bpffs, traces etc…
>  - How fs permissions on host translate into permissions in containers
>
> While your proposal lets you grant a subset of CAP_BPF to some other process,
> which I imagine could also be done with SELinux, it doesn’t stop you from needing
> other required permissions for attaching tracing programs in such an
> environment.

In some cases yes, there are other parts of the kernel that would
require some more work to be able to be used. But a lot of things are
possible within bpf() syscall already, including tracing stuff.

>
> For example, say container A wants to attach a uprobe to a process in container B.
> Container A needs to be able to nsenter into container B’s pidns in order for attachment
> to succeed… but then what I can do with CAP_BPF is the least of my concerns since
> I’d wager I’d need to mount `/proc` from the host in container A + have elevated privileges
> much scarier than CAP_BPF in the first place.

You'd wager, or you know for sure? I haven't tried, so I won't make any claims.

I do know, though, that our systemd-wide profiling agent (not running
under user namespace), can attach to and profile namespaced
applications running inside containers without any nsenter.

But again, uprobe'ing some other container is just one of possible use
cases. Even if some scenarios would require more stuff beyond the BPF
token, it doesn't invalidate the need and usefulness of the BPF token.

>
> If you move “Loading and attaching” away to somewhere else (i.e a daemon like bpfd)
> then with recent kernels your container workload should be fine to run entirely unprivileged,
> or worst case with only CAP_BPF since all you need to do is read/write maps.

Except we explicitly want to avoid the need for some external entity
loading BPF programs on my behalf, like I explained in replies to
Toke.

>
> Policy control - which process can request to load programs that monitor which other
> processes - would happen within this system daemon and you wouldn’t need tokens.

And we can do the same through controlling which containers/services
are issued BPF tokens. And in addition to that could employ LSM for
more dynamic and fine-granular control.

Doing this through a centralized daemon is one way of doing this. But
it's not the universally better way to do this.

>
> Since it’s easy enough to do this in userspace, I’d be strongly against adding more
> complexity into BPF to support this usecase.

I appreciate you trying to get more customers for bpfd, there is
nothing wrong with that. But this approach has major (good and bad)
implications and is not the most appropriate solution in a lot of
cases and setups.

As for complexity. If you looked at the code, you saw that it's a
completely optional feature as far as BPF UAPI goes, so your customers
won't need to care about BPF token existence, if they are happy using
bpfd solution.

>
> > Previous attempt at addressing this very same problem ([0]) attempted to
> > utilize authoritative LSM approach, but was conclusively rejected by upstream
> > LSM maintainers. BPF token concept is not changing anything about LSM
> > approach, but can be combined with LSM hooks for very fine-grained security
> > policy. Some ideas about making BPF token more convenient to use with LSM (in
> > particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF
> > 2023 presentation ([1]). E.g., an ability to specify user-provided data
> > (context), which in combination with BPF LSM would allow implementing a very
> > dynamic and fine-granular custom security policies on top of BPF token. In the
> > interest of minimizing API surface area discussions this is going to be
> > added in follow up patches, as it's not essential to the fundamental concept
> > of delegatable BPF token.
> >
> > It should be noted that BPF token is conceptually quite similar to the idea of
> > /dev/bpf device file, proposed by Song a while ago ([2]). The biggest
> > difference is the idea of using virtual anon_inode file to hold BPF token and
> > allowing multiple independent instances of them, each with its own set of
> > restrictions. BPF pinning solves the problem of exposing such BPF token
> > through file system (BPF FS, in this case) for cases where transferring FDs
> > over Unix domain sockets is not convenient. And also, crucially, BPF token
> > approach is not using any special stateful task-scoped flags. Instead, bpf()
> > syscall accepts token_fd parameters explicitly for each relevant BPF command.
> > This addresses main concerns brought up during the /dev/bpf discussion, and
> > fits better with overall BPF subsystem design.
> >
> > This patch set adds a basic minimum of functionality to make BPF token useful
> > and to discuss API and functionality. Currently only low-level libbpf APIs
> > support passing BPF token around, allowing to test kernel functionality, but
> > for the most part is not sufficient for real-world applications, which
> > typically use high-level libbpf APIs based on `struct bpf_object` type. This
> > was done with the intent to limit the size of patch set and concentrate on
> > mostly kernel-side changes. All the necessary plumbing for libbpf will be sent
> > as a separate follow up patch set kernel support makes it upstream.
> >
> > Another part that should happen once kernel-side BPF token is established, is
> > a set of conventions between applications (e.g., systemd), tools (e.g.,
> > bpftool), and libraries (e.g., libbpf) about sharing BPF tokens through BPF FS
> > at well-defined locations to allow applications take advantage of this in
> > automatic fashion without explicit code changes on BPF application's side.
> > But I'd like to postpone this discussion to after BPF token concept lands.
> >
> >  [0] https://lore.kernel.org/bpf/20230412043300.360803-1-andrii@kernel.org/
> >  [1] http://vger.kernel.org/bpfconf2023_material/Trusted_unprivileged_BPF_LSFMM2023.pdf
> >  [2] https://lore.kernel.org/bpf/20190627201923.2589391-2-songliubraving@fb.com/
> >
>
> - Dave
>
> [1]: https://github.com/bpfd-dev/bpfd
>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 bpf-next 00/18] BPF token
  2023-06-12 22:08           ` Andrii Nakryiko
@ 2023-06-13 21:48             ` Hao Luo
  2023-06-14 12:06             ` Toke Høiland-Jørgensen
  1 sibling, 0 replies; 72+ messages in thread
From: Hao Luo @ 2023-06-13 21:48 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Toke Høiland-Jørgensen, Andrii Nakryiko, bpf,
	linux-security-module, keescook, brauner, lennart, cyphar, luto,
	kernel-team

On Mon, Jun 12, 2023 at 3:08 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Mon, Jun 12, 2023 at 3:49 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> >
<...>
> > to avoid that is by baking the support into libbpf, then that can be
> > done regardless of the mechanism we choose.
> >
> > Or to put it another way: as you say it may be more *complicated* to add
> > an RPC-based path to libbpf, but it's not fundamentally impossible, it's
> > just another technical problem to be solved. And if that added
> > complexity buys us better security properties, maybe that is a good
> > trade-off. At least we shouldn't dismiss it out of hand.
>
> You are oversimplifying this. There is a huge difference between
> syscall and RPC and interfaces.
>
> The former (syscall approach) will error out only on invalid inputs
> (and highly improbable if kernel runs out of memory, which means your
> app is dead anyways). You don't code against syscall interface with
> expectation that it can fail at any point and you should be able to
> recover it.
>
> With RPC you have to bake in into your application that any RPC can
> fail transiently, for many reasons. Service could be down, restarted,
> slow, etc, etc. This changes *everything* in how you develop
> application, how you write code, how you handle errors, how you
> monitor stuff. Everything.
>
> It's impossible to just swap out syscall with RPC transparently
> without introducing horrible consequences. This is not some technical
> difficulty, it's a fundamental impedance mismatch. One of the early
> distributed systems mistakes was to pretend that remote procedure
> calls could be reliable and assume errors are rare and could be
> pretended to behave like syscalls or local in-process APIs. It has
> been recognized many times over how bad such approaches were. It's
> outside of the scope of this discussion to go into more details.
> Suffice it to say that libbpf is not going to pretend that syscall and
> some RPC are equivalent and can be interchangeable in a transparent
> way.
>
> And then, even if we were crazy enough to do the above, there is no
> way everyone will settle on one single implementation and/or RPC
> protocol and API such that libbpf could implement it in its upstream
> version. Big companies most probably will go with their own internal
> ones that would give them better integration with internal
> infrastructure, better overvability, etc. And even in open-source
> there probably won't be one single implementation everyone will be
> happy with.
>

Hello Toke and Andrii,

I agree with Andrii here. In Google, we have several years of
experience building and using BPF RPC service. We delegate BPF
operations to this service. From our experience, the RPC approach is
quite limiting and becomes impractical for many BPF use cases.

For programs that do not require much user interaction, it works just
fine. It just loads and attaches the programs, that's all. The problem
is the programs that require much user interaction, for example, the
ones doing observability, which may often read maps or poll on bpf
ringbuf. Overhead and reliability of RPC is one concern. Another
problem is the BPF operations based on mmap, for example, directly
updating/reading BPF global variables as used in skeleton. We still
haven't figured out how to fully support bpf skeleton. We also haven't
figured out how to support BPF ringbuf using RPC. There are also
problems maintaining this service to catch up with some new features
in libbpf.

Anyway, I think the syscall interface has been heavily baked in libbpf
and bpf kernel interfaces today. There are many BPF use cases where
delegating all BPF operations to a service can't work well. IMHO, to
achieve a good balance between flexibility and security, some
abstraction that conveys controlled trust from priv to unpriv is
necessary. The idea of BPF token makes sense to me. With token, libbpf
interface requires only minimal change, unpriv user can call libbpf
and bpf syscall natively, wins on efficiency and less maintenance
burden for libbpf developers.

Thanks,
Hao

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 bpf-next 00/18] BPF token
  2023-06-12 22:27       ` Andrii Nakryiko
@ 2023-06-14  0:23         ` Djalal Harouni
  2023-06-14  9:39           ` Christian Brauner
  2023-06-15 22:47           ` Andrii Nakryiko
  0 siblings, 2 replies; 72+ messages in thread
From: Djalal Harouni @ 2023-06-14  0:23 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, linux-security-module, keescook, brauner,
	lennart, cyphar, luto, kernel-team

On Tue, Jun 13, 2023 at 12:27 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Mon, Jun 12, 2023 at 5:02 AM Djalal Harouni <tixxdz@gmail.com> wrote:
> >
> > On Sat, Jun 10, 2023 at 12:57 AM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > >
> > > On Fri, Jun 9, 2023 at 3:30 PM Djalal Harouni <tixxdz@gmail.com> wrote:
> > > >
> > > > Hi Andrii,
> > > >
> > > > On Thu, Jun 8, 2023 at 1:54 AM Andrii Nakryiko <andrii@kernel.org> wrote:
> > > > >
> > > > > ...
> > > > > creating new BPF objects like BPF programs, BPF maps, etc.
> > > >
> > > > Is there a reason for coupling this only with the userns?
> > >
> > > There is no coupling. Without userns it is at least possible to grant
> > > CAP_BPF and other capabilities from init ns. With user namespace that
> > > becomes impossible.
> >
> > But these are not the same: delegate full cap vs delegate an fd mask?
>
> What FD mask are we talking about here? I don't recall us talking
> about any FD masks, so this one is a bit confusing without more
> context.

Ah err, sorry yes referring to fd token (which I assumed is a mask of
allowed operations or something like that).

So I want the possibility to delegate the fd token in the init userns.

> >
> > One can argue unprivileged in init userns is the same privileged in
> > nested userns
> > Getting to delegate fd in init userns, then in nested ones seems logical...
>
> Again, sorry, I'm not following. Can you please elaborate what you mean?

I mean can we use the fd token in the init user namespace too? not
only in the nested user namespaces but in the first one? Sorry I
didn't check the code.


> >
> > > > The "trusted unprivileged" assumed by systemd can be in init userns?
> > >
> > > It doesn't have to be systemd, but yes, BPF token can be created only
> > > when you have CAP_SYS_ADMIN in init ns. It's in line with restrictions
> > > on a bunch of other bpf() syscall commands (like GET_FD_BY_ID family
> > > of commands).
> >
> > I'm more into getting fd delegation work also in the first init userns...
> >
> > I can't understand why it's not possible or doable?
> >
>
> I don't know what you are proposing, as I mentioned above, so it's
> hard to answer this question.
>


> > > >
> > > >
> > > > > Previous attempt at addressing this very same problem ([0]) attempted to
> > > > > utilize authoritative LSM approach, but was conclusively rejected by upstream
> > > > > LSM maintainers. BPF token concept is not changing anything about LSM
> > > > > approach, but can be combined with LSM hooks for very fine-grained security
> > > > > policy. Some ideas about making BPF token more convenient to use with LSM (in
> > > > > particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF
> > > > > 2023 presentation ([1]). E.g., an ability to specify user-provided data
> > > > > (context), which in combination with BPF LSM would allow implementing a very
> > > > > dynamic and fine-granular custom security policies on top of BPF token. In the
> > > > > interest of minimizing API surface area discussions this is going to be
> > > > > added in follow up patches, as it's not essential to the fundamental concept
> > > > > of delegatable BPF token.
> > > > >
> > > > > It should be noted that BPF token is conceptually quite similar to the idea of
> > > > > /dev/bpf device file, proposed by Song a while ago ([2]). The biggest
> > > > > difference is the idea of using virtual anon_inode file to hold BPF token and
> > > > > allowing multiple independent instances of them, each with its own set of
> > > > > restrictions. BPF pinning solves the problem of exposing such BPF token
> > > > > through file system (BPF FS, in this case) for cases where transferring FDs
> > > > > over Unix domain sockets is not convenient. And also, crucially, BPF token
> > > > > approach is not using any special stateful task-scoped flags. Instead, bpf()
> > > >
> > > > What's the use case for transfering over unix domain sockets?
> > >
> > > I'm not sure I understand the question. Unix domain socket
> > > (specifically its SCM_RIGHTS ancillary message) allows to transfer
> > > files between processes, which is one way to pass BPF object (like
> > > prog/map/link, and now token). BPF FS is the other one. In practice
> > > it's usually BPF FS, but there is no presumption about how file
> > > reference is transferred.
> >
> > Got it.
> >
> > IIRC SCM_RIGHTS and SCM_CREDENTIALS are translated into the receiving
> > userns, no ?
> >
> > I assume such which allows to set up things in a hierarchical way...
> >
> > If I set up the environment to lock things down the line, I find it
> > strange if a received fd would allow me to do more things than what
> > was planned when I created the environment: namespaces, mounts, etc
> >
> > I think you have to add the owning userns context to the fd or
> > "token", and on the receiving part if the current userns is the same
> > or a nested one of the current userns hierarchy then allow bpf
> > operation, otherwise fail with -EACCESS or something similar...
> >
>
> I think I mentioned problems with namespacing BPF itself. It's just
> fundamentally impossible due to a system-wide nature of BPF. So we can
> pretend to somehow attach/restrict BPF token to some namespace, but it
> still allows BPF programs to peek at any kernel state or user-space
> process.

I'm not referring to namespacing BPF, but about the same token that
can fly between containers...
More or less problems mentioned by Casey
https://lore.kernel.org/bpf/20230602150011.1657856-19-andrii@kernel.org/T/#m005dfd937e4fff7a8cc35036f0ce38281f01e823

I think that a token or the fd should be part of the bpffs and should
not be shared between containers or crosse namespaces by default
without control... hence the suggested protection:
https://lore.kernel.org/bpf/CAEf4BzazbMqAh_Nj_geKNLshxT+4NXOCd-LkZ+sRKsbZAJ1tUw@mail.gmail.com/T/#m217d041d9ef9e02b598d5f0e1ff61043aeae57fd


> So I'd rather us not pretend we can do something that we actually
> cannot enforce.

Actually it is to protect against accidental token sharing or abuse...
so completely different things.


> >
> > > >
> > > > Will BPF token translation happen if you cross the different namespaces?
> > >
> > > What does BPF token translation mean specifically? Currently it's a
> > > very simple kernel object with refcnt and a few flags, so there is
> > > nothing to translate?
> >
> > Please see above comment about the owning userns context
> >
> > > >
> > > > If the token is pinned into different bpffs, will the token share the
> > > > same context?
> > >
> > > So I was planning to allow a user process creating a BPF token to
> > > specify custom user-provided data (context). This is not in this patch
> > > set, but is it what you are asking about?
> >
> > Exactly, define what you can access inside the container... this would
> > align with Andy's suggestion "making BPF behave sensibly in that
> > container seems like it should also be necessary." I do agree on this.
> >
>
> I don't know what Andy's suggestion actually is (as I honestly can't
> make out what your proposal is, sorry; you guys are not making it easy
> on me by being pretty vague and nonspecific). But see above about
> pretending to contain BPF within a container. There is no such thing.
> BPF is system-wide.

Sorry about that, I can quickly put: you may restrict types of bpf
programs, you may disable or nop probes if they are running without a
process context, if the triggered probe is owned by root by specific
uid? if the process is under a specific cgroup hierarchy etc... Are
the above possible?


> > Again I think LSM and bpf+lsm should have the final word on this too...
> >
>
> Yes, I also think that having LSM on top is beneficial. But not a
> strict requirement and more or less orthogonal.

I do think there should be LSM hooks to tighten this, as LSMs have
more context outside of BPF...


> >
> > > Regardless, pinning BPF object in BPF FS is just basically bumping a
> > > refcnt and exposes that object in a way that can be looked up through
> > > file system path (using bpf() syscall's BPF_OBJ_GET command).
> > > Underlying object isn't cloned or copied, it's exactly the same object
> > > with the same shared internal state.
> >
> > This is the part I also find strange, I can understand pinning a bpf
> > program, map, etc, but an fd that gives some access rights should be
> > part of the filesystem from the start, I don't get the extra pinning.
>
> BPF pinning of BPF token is optional. Everything still works without
> any BPF FS mount at all. It's an FD, BPF FS is just one of the means
> to pass FD to another process. I actually don't see why coupling BPF
> FS and BPF token is simpler.

I think it's better the other way around since bpffs is per super
block and separate mount then it is already solved, you just get that
special fd from the fs and pass it...


> Now, BPF token is a kernel object, with its own state. It has an FD
> associated with it. It can be passed around and provided as an
> argument to bpf() syscall. In that sense it's just like BPF
> prog/map/link, just another BPF object.
>
> > Also it seems bpffs is per superblock mount so why not allow
> > privileged to mount bpffs with the corresponding information, then
> > privileged can open the fd, set it up and pass it down the line when
> > executing the main program?  or even allow unprivileged to open it on
> > bpffs with some restrictive conditions?
> >
> > Then it would be the business of the privileged to bind mount bpffs in
> > some other places, share it, etc
>
> How is this fundamentally different from BPF token pinning by
> *privileged* process? Except we are not conflating BPF FS as a way to
> pin/get many different BPF objects with BPF token itself. In both
> cases it's up to privileged process to set up sharing of BPF token
> appropriately.

I'm not convinced about the use case of sharing BPF tokens between
containers or services...

Every container or service has its own separate bpffs, what's the
point of pinning a shared token created by a different container
compared to mounting separate bpffs with an fd token prepared to be
used for that specific container?

Then the container/service can delegate it to child processes, etc...
but sharing between containers and crossing user namespaces, mount
namespaces of such containers where bpffs is already separate in that
context? I don't see the point, and it just opens the room to token
misuse...


> >
> > Having the fd or "token" that gives access rights pinned in two
> > separate bpffs mounts seems too much, it crosses namespaces (mount,
> > userns etc), environments setup by privileged...
>
> See above, there is nothing namespaceable about BPF itself, and BPF
> token as well. If some production setup benefits from pinning one BPF
> token in multiple places, I don't see the problem with that.
>
> >
> > I would just make it per bpffs mount and that's it, nothing more. If a
> > program wants to bind mount it somewhere else then it's not a bpf
> > problem.
>
> And if some application wants to pin BPF token, why would that be BPF
> subsystem's problem as well?

The credentials, capabilities, keyring, different namespaces, etc are
all attached to the owning user namespace, if the BPF subsystem goes
its own way and creates a token to split up CAP_BPF without following
that model, then it's definitely a BPF subsystem problem...  I don't
recommend that.

Feels it's going more of a system-wide approach opening BPF
functionality where ultimately it clashes with the argument: delegate
a subset of BPF functionality to a *trusted* unprivileged application.
My reading of delegation is within a container/service hierarchy
nothing more.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 bpf-next 00/18] BPF token
  2023-06-14  0:23         ` Djalal Harouni
@ 2023-06-14  9:39           ` Christian Brauner
  2023-06-15 22:48             ` Andrii Nakryiko
  2023-06-15 22:47           ` Andrii Nakryiko
  1 sibling, 1 reply; 72+ messages in thread
From: Christian Brauner @ 2023-06-14  9:39 UTC (permalink / raw)
  To: Djalal Harouni
  Cc: Andrii Nakryiko, Andrii Nakryiko, bpf, linux-security-module,
	keescook, lennart, cyphar, luto, kernel-team

On Wed, Jun 14, 2023 at 02:23:02AM +0200, Djalal Harouni wrote:
> On Tue, Jun 13, 2023 at 12:27 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Mon, Jun 12, 2023 at 5:02 AM Djalal Harouni <tixxdz@gmail.com> wrote:
> > >
> > > On Sat, Jun 10, 2023 at 12:57 AM Andrii Nakryiko
> > > <andrii.nakryiko@gmail.com> wrote:
> > > >
> > > > On Fri, Jun 9, 2023 at 3:30 PM Djalal Harouni <tixxdz@gmail.com> wrote:
> > > > >
> > > > > Hi Andrii,
> > > > >
> > > > > On Thu, Jun 8, 2023 at 1:54 AM Andrii Nakryiko <andrii@kernel.org> wrote:
> > > > > >
> > > > > > ...
> > > > > > creating new BPF objects like BPF programs, BPF maps, etc.
> > > > >
> > > > > Is there a reason for coupling this only with the userns?
> > > >
> > > > There is no coupling. Without userns it is at least possible to grant
> > > > CAP_BPF and other capabilities from init ns. With user namespace that
> > > > becomes impossible.
> > >
> > > But these are not the same: delegate full cap vs delegate an fd mask?
> >
> > What FD mask are we talking about here? I don't recall us talking
> > about any FD masks, so this one is a bit confusing without more
> > context.
> 
> Ah err, sorry yes referring to fd token (which I assumed is a mask of
> allowed operations or something like that).
> 
> So I want the possibility to delegate the fd token in the init userns.
> 
> > >
> > > One can argue unprivileged in init userns is the same privileged in
> > > nested userns
> > > Getting to delegate fd in init userns, then in nested ones seems logical...
> >
> > Again, sorry, I'm not following. Can you please elaborate what you mean?
> 
> I mean can we use the fd token in the init user namespace too? not
> only in the nested user namespaces but in the first one? Sorry I
> didn't check the code.
> 
> 
> > >
> > > > > The "trusted unprivileged" assumed by systemd can be in init userns?
> > > >
> > > > It doesn't have to be systemd, but yes, BPF token can be created only
> > > > when you have CAP_SYS_ADMIN in init ns. It's in line with restrictions
> > > > on a bunch of other bpf() syscall commands (like GET_FD_BY_ID family
> > > > of commands).
> > >
> > > I'm more into getting fd delegation work also in the first init userns...
> > >
> > > I can't understand why it's not possible or doable?
> > >
> >
> > I don't know what you are proposing, as I mentioned above, so it's
> > hard to answer this question.
> >
> 
> 
> > > > >
> > > > >
> > > > > > Previous attempt at addressing this very same problem ([0]) attempted to
> > > > > > utilize authoritative LSM approach, but was conclusively rejected by upstream
> > > > > > LSM maintainers. BPF token concept is not changing anything about LSM
> > > > > > approach, but can be combined with LSM hooks for very fine-grained security
> > > > > > policy. Some ideas about making BPF token more convenient to use with LSM (in
> > > > > > particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF
> > > > > > 2023 presentation ([1]). E.g., an ability to specify user-provided data
> > > > > > (context), which in combination with BPF LSM would allow implementing a very
> > > > > > dynamic and fine-granular custom security policies on top of BPF token. In the
> > > > > > interest of minimizing API surface area discussions this is going to be
> > > > > > added in follow up patches, as it's not essential to the fundamental concept
> > > > > > of delegatable BPF token.
> > > > > >
> > > > > > It should be noted that BPF token is conceptually quite similar to the idea of
> > > > > > /dev/bpf device file, proposed by Song a while ago ([2]). The biggest
> > > > > > difference is the idea of using virtual anon_inode file to hold BPF token and
> > > > > > allowing multiple independent instances of them, each with its own set of
> > > > > > restrictions. BPF pinning solves the problem of exposing such BPF token
> > > > > > through file system (BPF FS, in this case) for cases where transferring FDs
> > > > > > over Unix domain sockets is not convenient. And also, crucially, BPF token
> > > > > > approach is not using any special stateful task-scoped flags. Instead, bpf()
> > > > >
> > > > > What's the use case for transfering over unix domain sockets?
> > > >
> > > > I'm not sure I understand the question. Unix domain socket
> > > > (specifically its SCM_RIGHTS ancillary message) allows to transfer
> > > > files between processes, which is one way to pass BPF object (like
> > > > prog/map/link, and now token). BPF FS is the other one. In practice
> > > > it's usually BPF FS, but there is no presumption about how file
> > > > reference is transferred.
> > >
> > > Got it.
> > >
> > > IIRC SCM_RIGHTS and SCM_CREDENTIALS are translated into the receiving
> > > userns, no ?
> > >
> > > I assume such which allows to set up things in a hierarchical way...
> > >
> > > If I set up the environment to lock things down the line, I find it
> > > strange if a received fd would allow me to do more things than what
> > > was planned when I created the environment: namespaces, mounts, etc
> > >
> > > I think you have to add the owning userns context to the fd or
> > > "token", and on the receiving part if the current userns is the same
> > > or a nested one of the current userns hierarchy then allow bpf
> > > operation, otherwise fail with -EACCESS or something similar...
> > >
> >
> > I think I mentioned problems with namespacing BPF itself. It's just
> > fundamentally impossible due to a system-wide nature of BPF. So we can
> > pretend to somehow attach/restrict BPF token to some namespace, but it
> > still allows BPF programs to peek at any kernel state or user-space
> > process.
> 
> I'm not referring to namespacing BPF, but about the same token that
> can fly between containers...
> More or less problems mentioned by Casey
> https://lore.kernel.org/bpf/20230602150011.1657856-19-andrii@kernel.org/T/#m005dfd937e4fff7a8cc35036f0ce38281f01e823
> 
> I think that a token or the fd should be part of the bpffs and should
> not be shared between containers or crosse namespaces by default
> without control... hence the suggested protection:
> https://lore.kernel.org/bpf/CAEf4BzazbMqAh_Nj_geKNLshxT+4NXOCd-LkZ+sRKsbZAJ1tUw@mail.gmail.com/T/#m217d041d9ef9e02b598d5f0e1ff61043aeae57fd
> 
> 
> > So I'd rather us not pretend we can do something that we actually
> > cannot enforce.
> 
> Actually it is to protect against accidental token sharing or abuse...
> so completely different things.
> 
> 
> > >
> > > > >
> > > > > Will BPF token translation happen if you cross the different namespaces?
> > > >
> > > > What does BPF token translation mean specifically? Currently it's a
> > > > very simple kernel object with refcnt and a few flags, so there is
> > > > nothing to translate?
> > >
> > > Please see above comment about the owning userns context
> > >
> > > > >
> > > > > If the token is pinned into different bpffs, will the token share the
> > > > > same context?
> > > >
> > > > So I was planning to allow a user process creating a BPF token to
> > > > specify custom user-provided data (context). This is not in this patch
> > > > set, but is it what you are asking about?
> > >
> > > Exactly, define what you can access inside the container... this would
> > > align with Andy's suggestion "making BPF behave sensibly in that
> > > container seems like it should also be necessary." I do agree on this.
> > >
> >
> > I don't know what Andy's suggestion actually is (as I honestly can't
> > make out what your proposal is, sorry; you guys are not making it easy
> > on me by being pretty vague and nonspecific). But see above about
> > pretending to contain BPF within a container. There is no such thing.
> > BPF is system-wide.
> 
> Sorry about that, I can quickly put: you may restrict types of bpf
> programs, you may disable or nop probes if they are running without a
> process context, if the triggered probe is owned by root by specific
> uid? if the process is under a specific cgroup hierarchy etc... Are
> the above possible?
> 
> 
> > > Again I think LSM and bpf+lsm should have the final word on this too...
> > >
> >
> > Yes, I also think that having LSM on top is beneficial. But not a
> > strict requirement and more or less orthogonal.
> 
> I do think there should be LSM hooks to tighten this, as LSMs have
> more context outside of BPF...
> 
> 
> > >
> > > > Regardless, pinning BPF object in BPF FS is just basically bumping a
> > > > refcnt and exposes that object in a way that can be looked up through
> > > > file system path (using bpf() syscall's BPF_OBJ_GET command).
> > > > Underlying object isn't cloned or copied, it's exactly the same object
> > > > with the same shared internal state.
> > >
> > > This is the part I also find strange, I can understand pinning a bpf
> > > program, map, etc, but an fd that gives some access rights should be
> > > part of the filesystem from the start, I don't get the extra pinning.
> >
> > BPF pinning of BPF token is optional. Everything still works without
> > any BPF FS mount at all. It's an FD, BPF FS is just one of the means
> > to pass FD to another process. I actually don't see why coupling BPF
> > FS and BPF token is simpler.
> 
> I think it's better the other way around since bpffs is per super
> block and separate mount then it is already solved, you just get that
> special fd from the fs and pass it...
> 
> 
> > Now, BPF token is a kernel object, with its own state. It has an FD
> > associated with it. It can be passed around and provided as an
> > argument to bpf() syscall. In that sense it's just like BPF
> > prog/map/link, just another BPF object.
> >
> > > Also it seems bpffs is per superblock mount so why not allow
> > > privileged to mount bpffs with the corresponding information, then
> > > privileged can open the fd, set it up and pass it down the line when
> > > executing the main program?  or even allow unprivileged to open it on
> > > bpffs with some restrictive conditions?
> > >
> > > Then it would be the business of the privileged to bind mount bpffs in
> > > some other places, share it, etc
> >
> > How is this fundamentally different from BPF token pinning by
> > *privileged* process? Except we are not conflating BPF FS as a way to
> > pin/get many different BPF objects with BPF token itself. In both
> > cases it's up to privileged process to set up sharing of BPF token
> > appropriately.
> 
> I'm not convinced about the use case of sharing BPF tokens between
> containers or services...
> 
> Every container or service has its own separate bpffs, what's the
> point of pinning a shared token created by a different container
> compared to mounting separate bpffs with an fd token prepared to be
> used for that specific container?
> 
> Then the container/service can delegate it to child processes, etc...
> but sharing between containers and crossing user namespaces, mount
> namespaces of such containers where bpffs is already separate in that
> context? I don't see the point, and it just opens the room to token
> misuse...
> 
> 
> > >
> > > Having the fd or "token" that gives access rights pinned in two
> > > separate bpffs mounts seems too much, it crosses namespaces (mount,
> > > userns etc), environments setup by privileged...
> >
> > See above, there is nothing namespaceable about BPF itself, and BPF
> > token as well. If some production setup benefits from pinning one BPF
> > token in multiple places, I don't see the problem with that.
> >
> > >
> > > I would just make it per bpffs mount and that's it, nothing more. If a
> > > program wants to bind mount it somewhere else then it's not a bpf
> > > problem.
> >
> > And if some application wants to pin BPF token, why would that be BPF
> > subsystem's problem as well?
> 
> The credentials, capabilities, keyring, different namespaces, etc are
> all attached to the owning user namespace, if the BPF subsystem goes
> its own way and creates a token to split up CAP_BPF without following
> that model, then it's definitely a BPF subsystem problem...  I don't
> recommend that.
> 
> Feels it's going more of a system-wide approach opening BPF
> functionality where ultimately it clashes with the argument: delegate
> a subset of BPF functionality to a *trusted* unprivileged application.
> My reading of delegation is within a container/service hierarchy
> nothing more.

You're making the exact arguments that Lennart, Aleksa, and I have been
making in the LSFMM presentation about this topic. It's even recorded:

https://youtu.be/4CCRTWEZLpw?t=1546

So we fully agree with you here.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 bpf-next 00/18] BPF token
  2023-06-12 22:08           ` Andrii Nakryiko
  2023-06-13 21:48             ` Hao Luo
@ 2023-06-14 12:06             ` Toke Høiland-Jørgensen
  2023-06-15 22:55               ` Andrii Nakryiko
  1 sibling, 1 reply; 72+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-06-14 12:06 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, linux-security-module, keescook, brauner,
	lennart, cyphar, luto, kernel-team

Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:

> On Mon, Jun 12, 2023 at 3:49 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>>
>> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>>
>> > On Fri, Jun 9, 2023 at 2:21 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>> >>
>> >> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>> >>
>> >> > On Fri, Jun 9, 2023 at 4:17 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>> >> >>
>> >> >> Andrii Nakryiko <andrii@kernel.org> writes:
>> >> >>
>> >> >> > This patch set introduces new BPF object, BPF token, which allows to delegate
>> >> >> > a subset of BPF functionality from privileged system-wide daemon (e.g.,
>> >> >> > systemd or any other container manager) to a *trusted* unprivileged
>> >> >> > application. Trust is the key here. This functionality is not about allowing
>> >> >> > unconditional unprivileged BPF usage. Establishing trust, though, is
>> >> >> > completely up to the discretion of respective privileged application that
>> >> >> > would create a BPF token.
>> >> >>
>> >> >> I am not convinced that this token-based approach is a good way to solve
>> >> >> this: having the delegation mechanism be one where you can basically
>> >> >> only grant a perpetual delegation with no way to retract it, no way to
>> >> >> check what exactly it's being used for, and that is transitive (can be
>> >> >> passed on to others with no restrictions) seems like a recipe for
>> >> >> disaster. I believe this was basically the point Casey was making as
>> >> >> well in response to v1.
>> >> >
>> >> > Most of this can be added, if we really need to. Ability to revoke BPF
>> >> > token is easy to implement (though of course it will apply only for
>> >> > subsequent operations). We can allocate ID for BPF token just like we
>> >> > do for BPF prog/map/link and let tools iterate and fetch information
>> >> > about it. As for controlling who's passing what and where, I don't
>> >> > think the situation is different for any other FD-based mechanism. You
>> >> > might as well create a BPF map/prog/link, pass it through SCM_RIGHTS
>> >> > or BPF FS, and that application can keep doing the same to other
>> >> > processes.
>> >>
>> >> No, but every other fd-based mechanism is limited in scope. E.g., if you
>> >> pass a map fd that's one specific map that can be passed around, with a
>> >> token it's all operations (of a specific type) which is way broader.
>> >
>> > It's not black and white. Once you have a BPF program FD, you can
>> > attach it many times, for example, and cause regressions. Sure, here
>> > we are talking about creating multiple BPF maps or loading multiple
>> > BPF programs, so it's wider in scope, but still, it's not that
>> > fundamentally different.
>>
>> Right, but the difference is that a single BPF program is a known
>> entity, so even if the application you pass the fd to can attach it
>> multiple times, it can't make it do new things (e.g., bpf_probe_read()
>> stuff it is not supposed to). Whereas with bpf_token you have no such
>> guarantee.
>
> Sure, I'm not claiming BPF token is just like passing BPF program FD
> around. My point is that anything in the kernel that is representable
> by FD can be passed around to an unintended process through
> SCM_RIGHTS. And if you want to have tighter control over who's passing
> what, you'd probably need LSM. But it's not a requirement.
>
> With BPF token it is important to trust the application you are
> passing BPF token to. This is not a mechanism to just freely pass
> around the ability to do BPF. You do it only to applications you
> control.

Trust is not binary, though. "Do I trust this application to perform
this specific action" is different from "do I trust this application to
perform any action in the future". A security mechanism should grant the
minimum required privileges required to perform the operation; this
token thing encourages (defaults to) broader grants, which is worrysome.

> With user namespaces, if we could grant CAP_BPF and co to use BPF,
> we'd do that. But we can't. BPF token at least gives us this
> opportunity.

If the use case is to punch holes in the user namespace isolation I feel
like that is better solved at the user namespace level than the BPF
subsystem level...

-Toke


(Ran out of time and I'm about to leave for PTO, so dropping the RPC
discussion for now)


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 bpf-next 00/18] BPF token
  2023-06-14  0:23         ` Djalal Harouni
  2023-06-14  9:39           ` Christian Brauner
@ 2023-06-15 22:47           ` Andrii Nakryiko
  1 sibling, 0 replies; 72+ messages in thread
From: Andrii Nakryiko @ 2023-06-15 22:47 UTC (permalink / raw)
  To: Djalal Harouni
  Cc: Andrii Nakryiko, bpf, linux-security-module, keescook, brauner,
	lennart, cyphar, luto, kernel-team

On Tue, Jun 13, 2023 at 5:23 PM Djalal Harouni <tixxdz@gmail.com> wrote:
>
> On Tue, Jun 13, 2023 at 12:27 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Mon, Jun 12, 2023 at 5:02 AM Djalal Harouni <tixxdz@gmail.com> wrote:
> > >
> > > On Sat, Jun 10, 2023 at 12:57 AM Andrii Nakryiko
> > > <andrii.nakryiko@gmail.com> wrote:
> > > >
> > > > On Fri, Jun 9, 2023 at 3:30 PM Djalal Harouni <tixxdz@gmail.com> wrote:
> > > > >
> > > > > Hi Andrii,
> > > > >
> > > > > On Thu, Jun 8, 2023 at 1:54 AM Andrii Nakryiko <andrii@kernel.org> wrote:
> > > > > >
> > > > > > ...
> > > > > > creating new BPF objects like BPF programs, BPF maps, etc.
> > > > >
> > > > > Is there a reason for coupling this only with the userns?
> > > >
> > > > There is no coupling. Without userns it is at least possible to grant
> > > > CAP_BPF and other capabilities from init ns. With user namespace that
> > > > becomes impossible.
> > >
> > > But these are not the same: delegate full cap vs delegate an fd mask?
> >
> > What FD mask are we talking about here? I don't recall us talking
> > about any FD masks, so this one is a bit confusing without more
> > context.
>
> Ah err, sorry yes referring to fd token (which I assumed is a mask of
> allowed operations or something like that).

Ok, so your "FD masks" aka "fd token" is actually a BPF token as
referenced to in this patch set, right? Thanks for clarifying!

>
> So I want the possibility to delegate the fd token in the init userns.
>

So as it is right now, BPF token has no association with userns, so
yes, you can delegate it in init userns. It's just a kernel object
with its own FD, which you pass to bpf() syscall operations.

> > >
> > > One can argue unprivileged in init userns is the same privileged in
> > > nested userns
> > > Getting to delegate fd in init userns, then in nested ones seems logical...
> >
> > Again, sorry, I'm not following. Can you please elaborate what you mean?
>
> I mean can we use the fd token in the init user namespace too? not
> only in the nested user namespaces but in the first one? Sorry I
> didn't check the code.

Yes, absolutely.

>
>
> > >
> > > > > The "trusted unprivileged" assumed by systemd can be in init userns?
> > > >
> > > > It doesn't have to be systemd, but yes, BPF token can be created only
> > > > when you have CAP_SYS_ADMIN in init ns. It's in line with restrictions
> > > > on a bunch of other bpf() syscall commands (like GET_FD_BY_ID family
> > > > of commands).
> > >
> > > I'm more into getting fd delegation work also in the first init userns...
> > >
> > > I can't understand why it's not possible or doable?
> > >
> >
> > I don't know what you are proposing, as I mentioned above, so it's
> > hard to answer this question.
> >
>
>
> > > > >
> > > > >
> > > > > > Previous attempt at addressing this very same problem ([0]) attempted to
> > > > > > utilize authoritative LSM approach, but was conclusively rejected by upstream
> > > > > > LSM maintainers. BPF token concept is not changing anything about LSM
> > > > > > approach, but can be combined with LSM hooks for very fine-grained security
> > > > > > policy. Some ideas about making BPF token more convenient to use with LSM (in
> > > > > > particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF
> > > > > > 2023 presentation ([1]). E.g., an ability to specify user-provided data
> > > > > > (context), which in combination with BPF LSM would allow implementing a very
> > > > > > dynamic and fine-granular custom security policies on top of BPF token. In the
> > > > > > interest of minimizing API surface area discussions this is going to be
> > > > > > added in follow up patches, as it's not essential to the fundamental concept
> > > > > > of delegatable BPF token.
> > > > > >
> > > > > > It should be noted that BPF token is conceptually quite similar to the idea of
> > > > > > /dev/bpf device file, proposed by Song a while ago ([2]). The biggest
> > > > > > difference is the idea of using virtual anon_inode file to hold BPF token and
> > > > > > allowing multiple independent instances of them, each with its own set of
> > > > > > restrictions. BPF pinning solves the problem of exposing such BPF token
> > > > > > through file system (BPF FS, in this case) for cases where transferring FDs
> > > > > > over Unix domain sockets is not convenient. And also, crucially, BPF token
> > > > > > approach is not using any special stateful task-scoped flags. Instead, bpf()
> > > > >
> > > > > What's the use case for transfering over unix domain sockets?
> > > >
> > > > I'm not sure I understand the question. Unix domain socket
> > > > (specifically its SCM_RIGHTS ancillary message) allows to transfer
> > > > files between processes, which is one way to pass BPF object (like
> > > > prog/map/link, and now token). BPF FS is the other one. In practice
> > > > it's usually BPF FS, but there is no presumption about how file
> > > > reference is transferred.
> > >
> > > Got it.
> > >
> > > IIRC SCM_RIGHTS and SCM_CREDENTIALS are translated into the receiving
> > > userns, no ?
> > >
> > > I assume such which allows to set up things in a hierarchical way...
> > >
> > > If I set up the environment to lock things down the line, I find it
> > > strange if a received fd would allow me to do more things than what
> > > was planned when I created the environment: namespaces, mounts, etc
> > >
> > > I think you have to add the owning userns context to the fd or
> > > "token", and on the receiving part if the current userns is the same
> > > or a nested one of the current userns hierarchy then allow bpf
> > > operation, otherwise fail with -EACCESS or something similar...
> > >
> >
> > I think I mentioned problems with namespacing BPF itself. It's just
> > fundamentally impossible due to a system-wide nature of BPF. So we can
> > pretend to somehow attach/restrict BPF token to some namespace, but it
> > still allows BPF programs to peek at any kernel state or user-space
> > process.
>
> I'm not referring to namespacing BPF, but about the same token that
> can fly between containers...
> More or less problems mentioned by Casey
> https://lore.kernel.org/bpf/20230602150011.1657856-19-andrii@kernel.org/T/#m005dfd937e4fff7a8cc35036f0ce38281f01e823
>
> I think that a token or the fd should be part of the bpffs and should
> not be shared between containers or crosse namespaces by default
> without control... hence the suggested protection:
> https://lore.kernel.org/bpf/CAEf4BzazbMqAh_Nj_geKNLshxT+4NXOCd-LkZ+sRKsbZAJ1tUw@mail.gmail.com/T/#m217d041d9ef9e02b598d5f0e1ff61043aeae57fd
>

Ok, cool, thanks for clarifying! I think we are getting somewhere in
this discussion. It seems like you are not worried about the BPF token
concept per se, rather that it's not bound to namespace and thus can
be "leaked" outside of the intended container. Got it. This makes it
more concrete to talk about, but I'll reply in the email to Christian,
to keep my reply in one place.

>
> > So I'd rather us not pretend we can do something that we actually
> > cannot enforce.
>
> Actually it is to protect against accidental token sharing or abuse...
> so completely different things.
>

Ok, got it. I was worried that there is a perception that BPF token
allows to sandbox BPF application somehow (which is not the case), so
wanted to make sure we are not conflating things. With your latest
reply it's clear that the problem that most of the discussion is
revolving around is containing BPF token *sharing* within the
container.


>
> > >
> > > > >
> > > > > Will BPF token translation happen if you cross the different namespaces?
> > > >
> > > > What does BPF token translation mean specifically? Currently it's a
> > > > very simple kernel object with refcnt and a few flags, so there is
> > > > nothing to translate?
> > >
> > > Please see above comment about the owning userns context
> > >
> > > > >
> > > > > If the token is pinned into different bpffs, will the token share the
> > > > > same context?
> > > >
> > > > So I was planning to allow a user process creating a BPF token to
> > > > specify custom user-provided data (context). This is not in this patch
> > > > set, but is it what you are asking about?
> > >
> > > Exactly, define what you can access inside the container... this would
> > > align with Andy's suggestion "making BPF behave sensibly in that
> > > container seems like it should also be necessary." I do agree on this.
> > >
> >
> > I don't know what Andy's suggestion actually is (as I honestly can't
> > make out what your proposal is, sorry; you guys are not making it easy
> > on me by being pretty vague and nonspecific). But see above about
> > pretending to contain BPF within a container. There is no such thing.
> > BPF is system-wide.
>
> Sorry about that, I can quickly put: you may restrict types of bpf
> programs, you may disable or nop probes if they are running without a
> process context, if the triggered probe is owned by root by specific
> uid? if the process is under a specific cgroup hierarchy etc... Are
> the above possible?

Yes, about restricting BPF program types. Definitely "No" for "probes
if they are running without a process context, if the triggered probe
is owned by root by specific uid". "Maybe" for "under a specific
cgroup hierarchy", which we could add in some form, but we can only
control where BPF program is attached. Nothing will still prevent BPF
program from reading random kernel memory. But at least such BPF
programs won't be able to control, say, network traffic of unintended
cgroups. But the last part is not implemented in this patch set and
should be discussed separately.

>
>
> > > Again I think LSM and bpf+lsm should have the final word on this too...
> > >
> >
> > Yes, I also think that having LSM on top is beneficial. But not a
> > strict requirement and more or less orthogonal.
>
> I do think there should be LSM hooks to tighten this, as LSMs have
> more context outside of BPF...

Agreed, but it should be added on top as a separate follow up patch set.

>
>
> > >
> > > > Regardless, pinning BPF object in BPF FS is just basically bumping a
> > > > refcnt and exposes that object in a way that can be looked up through
> > > > file system path (using bpf() syscall's BPF_OBJ_GET command).
> > > > Underlying object isn't cloned or copied, it's exactly the same object
> > > > with the same shared internal state.
> > >
> > > This is the part I also find strange, I can understand pinning a bpf
> > > program, map, etc, but an fd that gives some access rights should be
> > > part of the filesystem from the start, I don't get the extra pinning.
> >
> > BPF pinning of BPF token is optional. Everything still works without
> > any BPF FS mount at all. It's an FD, BPF FS is just one of the means
> > to pass FD to another process. I actually don't see why coupling BPF
> > FS and BPF token is simpler.
>
> I think it's better the other way around since bpffs is per super
> block and separate mount then it is already solved, you just get that
> special fd from the fs and pass it...
>

Ok, I see your point, I have a slightly alternative proposal for some
parts of it, but I'll explain in reply to Christian.

>
> > Now, BPF token is a kernel object, with its own state. It has an FD
> > associated with it. It can be passed around and provided as an
> > argument to bpf() syscall. In that sense it's just like BPF
> > prog/map/link, just another BPF object.
> >
> > > Also it seems bpffs is per superblock mount so why not allow
> > > privileged to mount bpffs with the corresponding information, then
> > > privileged can open the fd, set it up and pass it down the line when
> > > executing the main program?  or even allow unprivileged to open it on
> > > bpffs with some restrictive conditions?
> > >
> > > Then it would be the business of the privileged to bind mount bpffs in
> > > some other places, share it, etc
> >
> > How is this fundamentally different from BPF token pinning by
> > *privileged* process? Except we are not conflating BPF FS as a way to
> > pin/get many different BPF objects with BPF token itself. In both
> > cases it's up to privileged process to set up sharing of BPF token
> > appropriately.
>
> I'm not convinced about the use case of sharing BPF tokens between
> containers or services...
>
> Every container or service has its own separate bpffs, what's the
> point of pinning a shared token created by a different container
> compared to mounting separate bpffs with an fd token prepared to be
> used for that specific container?
>
> Then the container/service can delegate it to child processes, etc...
> but sharing between containers and crossing user namespaces, mount
> namespaces of such containers where bpffs is already separate in that
> context? I don't see the point, and it just opens the room to token
> misuse...
>

I don't have a specific use case or need for this. It's more of a
principle that API should not be assuming or dictating how exactly
user-space is going to use it, so I'd say we should prevent whatever
crazy scenario that doesn't violate common sense.

But I get that lots of people are concerned about BPF token leaking
into unintended neighboring containers, so maybe we should bake in a
mechanism to make this impossible. Again, let's talk in the next
reply.

>
> > >
> > > Having the fd or "token" that gives access rights pinned in two
> > > separate bpffs mounts seems too much, it crosses namespaces (mount,
> > > userns etc), environments setup by privileged...
> >
> > See above, there is nothing namespaceable about BPF itself, and BPF
> > token as well. If some production setup benefits from pinning one BPF
> > token in multiple places, I don't see the problem with that.
> >
> > >
> > > I would just make it per bpffs mount and that's it, nothing more. If a
> > > program wants to bind mount it somewhere else then it's not a bpf
> > > problem.
> >
> > And if some application wants to pin BPF token, why would that be BPF
> > subsystem's problem as well?
>
> The credentials, capabilities, keyring, different namespaces, etc are
> all attached to the owning user namespace, if the BPF subsystem goes
> its own way and creates a token to split up CAP_BPF without following
> that model, then it's definitely a BPF subsystem problem...  I don't
> recommend that.
>
> Feels it's going more of a system-wide approach opening BPF
> functionality where ultimately it clashes with the argument: delegate
> a subset of BPF functionality to a *trusted* unprivileged application.
> My reading of delegation is within a container/service hierarchy
> nothing more.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 bpf-next 00/18] BPF token
  2023-06-14  9:39           ` Christian Brauner
@ 2023-06-15 22:48             ` Andrii Nakryiko
  2023-06-23 22:18               ` Daniel Borkmann
  0 siblings, 1 reply; 72+ messages in thread
From: Andrii Nakryiko @ 2023-06-15 22:48 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Djalal Harouni, Andrii Nakryiko, bpf, linux-security-module,
	keescook, lennart, cyphar, luto, kernel-team, Sargun Dhillon

On Wed, Jun 14, 2023 at 2:39 AM Christian Brauner <brauner@kernel.org> wrote:
>
> On Wed, Jun 14, 2023 at 02:23:02AM +0200, Djalal Harouni wrote:
> > On Tue, Jun 13, 2023 at 12:27 AM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > >
> > > On Mon, Jun 12, 2023 at 5:02 AM Djalal Harouni <tixxdz@gmail.com> wrote:
> > > >
> > > > On Sat, Jun 10, 2023 at 12:57 AM Andrii Nakryiko
> > > > <andrii.nakryiko@gmail.com> wrote:
> > > > >
> > > > > On Fri, Jun 9, 2023 at 3:30 PM Djalal Harouni <tixxdz@gmail.com> wrote:
> > > > > >
> > > > > > Hi Andrii,
> > > > > >
> > > > > > On Thu, Jun 8, 2023 at 1:54 AM Andrii Nakryiko <andrii@kernel.org> wrote:
> > > > > > >
> > > > > > > ...
> > > > > > > creating new BPF objects like BPF programs, BPF maps, etc.
> > > > > >
> > > > > > Is there a reason for coupling this only with the userns?
> > > > >
> > > > > There is no coupling. Without userns it is at least possible to grant
> > > > > CAP_BPF and other capabilities from init ns. With user namespace that
> > > > > becomes impossible.
> > > >
> > > > But these are not the same: delegate full cap vs delegate an fd mask?
> > >
> > > What FD mask are we talking about here? I don't recall us talking
> > > about any FD masks, so this one is a bit confusing without more
> > > context.
> >
> > Ah err, sorry yes referring to fd token (which I assumed is a mask of
> > allowed operations or something like that).
> >
> > So I want the possibility to delegate the fd token in the init userns.
> >
> > > >
> > > > One can argue unprivileged in init userns is the same privileged in
> > > > nested userns
> > > > Getting to delegate fd in init userns, then in nested ones seems logical...
> > >
> > > Again, sorry, I'm not following. Can you please elaborate what you mean?
> >
> > I mean can we use the fd token in the init user namespace too? not
> > only in the nested user namespaces but in the first one? Sorry I
> > didn't check the code.
> >

[...]

> >
> > > >
> > > > Having the fd or "token" that gives access rights pinned in two
> > > > separate bpffs mounts seems too much, it crosses namespaces (mount,
> > > > userns etc), environments setup by privileged...
> > >
> > > See above, there is nothing namespaceable about BPF itself, and BPF
> > > token as well. If some production setup benefits from pinning one BPF
> > > token in multiple places, I don't see the problem with that.
> > >
> > > >
> > > > I would just make it per bpffs mount and that's it, nothing more. If a
> > > > program wants to bind mount it somewhere else then it's not a bpf
> > > > problem.
> > >
> > > And if some application wants to pin BPF token, why would that be BPF
> > > subsystem's problem as well?
> >
> > The credentials, capabilities, keyring, different namespaces, etc are
> > all attached to the owning user namespace, if the BPF subsystem goes
> > its own way and creates a token to split up CAP_BPF without following
> > that model, then it's definitely a BPF subsystem problem...  I don't
> > recommend that.
> >
> > Feels it's going more of a system-wide approach opening BPF
> > functionality where ultimately it clashes with the argument: delegate
> > a subset of BPF functionality to a *trusted* unprivileged application.
> > My reading of delegation is within a container/service hierarchy
> > nothing more.
>
> You're making the exact arguments that Lennart, Aleksa, and I have been
> making in the LSFMM presentation about this topic. It's even recorded:

Alright, so (I think) I get a pretty good feel now for what the main
concerns are, and why people are trying to push this to be an FS. And
it's not so much that BPF token grants bpf() syscall usage to unpriv
(but trusted) workloads or that BPF itself is not namespaceable. The
main worry is that BPF token, once issues, could be
illegally/uncontrollably passed outside of container, intentionally or
not. And by having this association with mount namespace (through BPF
FS) we automatically limit the sharing to only contain that has access
to that BPF FS.

So I agree that it makes sense to have this mount namespace
association, but I also would like to keep BPF token to be a separate
entity from BPF FS itself, and have the ability to have multiple
different BPF tokens exposed in a single BPF FS instance. I think the
latter is important.

So how about this slight modification: when a BPF token is created
using BPF_TOKEN_CREATE command, the user has to provide an FD for
"associated" BPF FS instance (superblock). What that does is allows
BPF token to be created with BPF FS and/or mount namespace association
set in stone. After that BPF token can only be pinned in that BPF FS
instance and cannot leave the boundaries of that mount namespace
(specific details to be worked out, this is new area for me, so I'm
sorry if I'm missing nuances).

What this slight tweak gives us is that we can still have multiple BPF
token instances within a single BPF FS. It is still pinnable/gettable
through common bpf() syscall's BPF_OBJ_PIN/BPF_OBJ_GET commands. You
still can have more nuances file permission and getting BPF token can
be controlled further through LSM. Also we still get to use an
extensible and familiar (to BPF users) bpf_attr binary approach.
Basically, it is very much native to BPF subsystem, but it is mount
namespace-bound like was requested by proponents of merging BPF token
and BPF FS together.

I assume that this BPF FS fd can be fetched using fsopen() or fspick()
syscalls, is that right?

WDYT? Does that sound like it would address all the above concerns?
Please point to any important details I might be missing (as I
mentioned, very unfamiliar territory).

>
> https://youtu.be/4CCRTWEZLpw?t=1546
>
> So we fully agree with you here.

I actually just rewatched that entire discussion. :) And after talking
about BPF token at length in the halls of the conference and email
discussions on this patch set, it was very useful to relisten (again)
all the finer points that were made back then. Thanks for the
remainder and the link.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 bpf-next 00/18] BPF token
  2023-06-14 12:06             ` Toke Høiland-Jørgensen
@ 2023-06-15 22:55               ` Andrii Nakryiko
  0 siblings, 0 replies; 72+ messages in thread
From: Andrii Nakryiko @ 2023-06-15 22:55 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Andrii Nakryiko, bpf, linux-security-module, keescook, brauner,
	lennart, cyphar, luto, kernel-team

On Wed, Jun 14, 2023 at 5:12 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>
> > On Mon, Jun 12, 2023 at 3:49 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> >>
> >> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
> >>
> >> > On Fri, Jun 9, 2023 at 2:21 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> >> >>
> >> >> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
> >> >>
> >> >> > On Fri, Jun 9, 2023 at 4:17 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> >> >> >>
> >> >> >> Andrii Nakryiko <andrii@kernel.org> writes:
> >> >> >>
> >> >> >> > This patch set introduces new BPF object, BPF token, which allows to delegate
> >> >> >> > a subset of BPF functionality from privileged system-wide daemon (e.g.,
> >> >> >> > systemd or any other container manager) to a *trusted* unprivileged
> >> >> >> > application. Trust is the key here. This functionality is not about allowing
> >> >> >> > unconditional unprivileged BPF usage. Establishing trust, though, is
> >> >> >> > completely up to the discretion of respective privileged application that
> >> >> >> > would create a BPF token.
> >> >> >>
> >> >> >> I am not convinced that this token-based approach is a good way to solve
> >> >> >> this: having the delegation mechanism be one where you can basically
> >> >> >> only grant a perpetual delegation with no way to retract it, no way to
> >> >> >> check what exactly it's being used for, and that is transitive (can be
> >> >> >> passed on to others with no restrictions) seems like a recipe for
> >> >> >> disaster. I believe this was basically the point Casey was making as
> >> >> >> well in response to v1.
> >> >> >
> >> >> > Most of this can be added, if we really need to. Ability to revoke BPF
> >> >> > token is easy to implement (though of course it will apply only for
> >> >> > subsequent operations). We can allocate ID for BPF token just like we
> >> >> > do for BPF prog/map/link and let tools iterate and fetch information
> >> >> > about it. As for controlling who's passing what and where, I don't
> >> >> > think the situation is different for any other FD-based mechanism. You
> >> >> > might as well create a BPF map/prog/link, pass it through SCM_RIGHTS
> >> >> > or BPF FS, and that application can keep doing the same to other
> >> >> > processes.
> >> >>
> >> >> No, but every other fd-based mechanism is limited in scope. E.g., if you
> >> >> pass a map fd that's one specific map that can be passed around, with a
> >> >> token it's all operations (of a specific type) which is way broader.
> >> >
> >> > It's not black and white. Once you have a BPF program FD, you can
> >> > attach it many times, for example, and cause regressions. Sure, here
> >> > we are talking about creating multiple BPF maps or loading multiple
> >> > BPF programs, so it's wider in scope, but still, it's not that
> >> > fundamentally different.
> >>
> >> Right, but the difference is that a single BPF program is a known
> >> entity, so even if the application you pass the fd to can attach it
> >> multiple times, it can't make it do new things (e.g., bpf_probe_read()
> >> stuff it is not supposed to). Whereas with bpf_token you have no such
> >> guarantee.
> >
> > Sure, I'm not claiming BPF token is just like passing BPF program FD
> > around. My point is that anything in the kernel that is representable
> > by FD can be passed around to an unintended process through
> > SCM_RIGHTS. And if you want to have tighter control over who's passing
> > what, you'd probably need LSM. But it's not a requirement.
> >
> > With BPF token it is important to trust the application you are
> > passing BPF token to. This is not a mechanism to just freely pass
> > around the ability to do BPF. You do it only to applications you
> > control.
>
> Trust is not binary, though. "Do I trust this application to perform
> this specific action" is different from "do I trust this application to
> perform any action in the future". A security mechanism should grant the
> minimum required privileges required to perform the operation; this
> token thing encourages (defaults to) broader grants, which is worrysome.

BPF token defaults to not allowing anything, unless you explicitly
allow commands/progs/maps. If you don't set allow_cmds, you literally
get a useless BPF token that grants you nothing.

>
> > With user namespaces, if we could grant CAP_BPF and co to use BPF,
> > we'd do that. But we can't. BPF token at least gives us this
> > opportunity.
>
> If the use case is to punch holes in the user namespace isolation I feel
> like that is better solved at the user namespace level than the BPF
> subsystem level...
>
> -Toke
>
>
> (Ran out of time and I'm about to leave for PTO, so dropping the RPC
> discussion for now)
>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 bpf-next 00/18] BPF token
  2023-06-09 19:08   ` Andrii Nakryiko
@ 2023-06-19 17:40     ` Andy Lutomirski
  2023-06-21 23:48       ` Andrii Nakryiko
  0 siblings, 1 reply; 72+ messages in thread
From: Andy Lutomirski @ 2023-06-19 17:40 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, linux-security-module, Kees Cook,
	Christian Brauner, lennart, cyphar, kernel-team



On Fri, Jun 9, 2023, at 12:08 PM, Andrii Nakryiko wrote:
> On Fri, Jun 9, 2023 at 11:32 AM Andy Lutomirski <luto@kernel.org> wrote:
>>
>> On Wed, Jun 7, 2023, at 4:53 PM, Andrii Nakryiko wrote:
>> > This patch set introduces new BPF object, BPF token, which allows to delegate
>> > a subset of BPF functionality from privileged system-wide daemon (e.g.,
>> > systemd or any other container manager) to a *trusted* unprivileged
>> > application. Trust is the key here. This functionality is not about allowing
>> > unconditional unprivileged BPF usage. Establishing trust, though, is
>> > completely up to the discretion of respective privileged application that
>> > would create a BPF token.
>> >
>>
>> I skimmed the description and the LSFMM slides.
>>
>> Years ago, I sent out a patch set to start down the path of making the bpf() API make sense when used in less-privileged contexts (regarding access control of BPF objects and such).  It went nowhere.
>>
>> Where does BPF token fit in?  Does a kernel with these patches applied actually behave sensibly if you pass a BPF token into a container?
>
> Yes?.. In the sense that it is possible to create BPF programs and BPF
> maps from inside the container (with BPF token). Right now under user
> namespace it's impossible no matter what you do.

I have no problem with creating BPF maps inside a container, but I think the maps should *be in the container*.

My series wasn’t about unprivileged BPF per se.  It was about updating the existing BPF permission model so that it made sense in a context in which it had multiple users that didn’t trust each other.

>
>> Giving a way to enable BPF in a container is only a small part of the overall task -- making BPF behave sensibly in that container seems like it should also be necessary.
>
> BPF is still a privileged thing. You can't just say that any
> unprivileged application should be able to use BPF. That's why BPF
> token is about trusting unpriv application in a controlled environment
> (production) to not do something crazy. It can be enforced further
> through LSM usage, but in a lot of cases, when dealing with internal
> production applications it's enough to have a proper application
> design and rely on code review process to avoid any negative effects.

We really shouldn’t be creating new kinds of privileged containers that do uncontained things.

If you actually want to go this route, I think you would do much better to introduce a way for a container manager to usefully proxy BPF on behalf of the container.

>
> So privileged daemon (container manager) will be configured with the
> knowledge of which services/containers are allowed to use BPF, and
> will grant BPF token only to those that were explicitly allowlisted.


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 bpf-next 00/18] BPF token
  2023-06-19 17:40     ` Andy Lutomirski
@ 2023-06-21 23:48       ` Andrii Nakryiko
  2023-06-22  8:22         ` Maryam Tahhan
  0 siblings, 1 reply; 72+ messages in thread
From: Andrii Nakryiko @ 2023-06-21 23:48 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andrii Nakryiko, bpf, linux-security-module, Kees Cook,
	Christian Brauner, lennart, cyphar, kernel-team

On Mon, Jun 19, 2023 at 10:40 AM Andy Lutomirski <luto@kernel.org> wrote:
>
>
>
> On Fri, Jun 9, 2023, at 12:08 PM, Andrii Nakryiko wrote:
> > On Fri, Jun 9, 2023 at 11:32 AM Andy Lutomirski <luto@kernel.org> wrote:
> >>
> >> On Wed, Jun 7, 2023, at 4:53 PM, Andrii Nakryiko wrote:
> >> > This patch set introduces new BPF object, BPF token, which allows to delegate
> >> > a subset of BPF functionality from privileged system-wide daemon (e.g.,
> >> > systemd or any other container manager) to a *trusted* unprivileged
> >> > application. Trust is the key here. This functionality is not about allowing
> >> > unconditional unprivileged BPF usage. Establishing trust, though, is
> >> > completely up to the discretion of respective privileged application that
> >> > would create a BPF token.
> >> >
> >>
> >> I skimmed the description and the LSFMM slides.
> >>
> >> Years ago, I sent out a patch set to start down the path of making the bpf() API make sense when used in less-privileged contexts (regarding access control of BPF objects and such).  It went nowhere.
> >>
> >> Where does BPF token fit in?  Does a kernel with these patches applied actually behave sensibly if you pass a BPF token into a container?
> >
> > Yes?.. In the sense that it is possible to create BPF programs and BPF
> > maps from inside the container (with BPF token). Right now under user
> > namespace it's impossible no matter what you do.
>
> I have no problem with creating BPF maps inside a container, but I think the maps should *be in the container*.
>
> My series wasn’t about unprivileged BPF per se.  It was about updating the existing BPF permission model so that it made sense in a context in which it had multiple users that didn’t trust each other.

I don't think it's possible with BPF, in principle, as I mentioned in
the cover letter. Even if some particular types of programs could be
"contained" in some sense, in general BPF is too global by its nature
(it observes everything in kernel memory, it can influence system-wide
behaviors, etc).

>
> >
> >> Giving a way to enable BPF in a container is only a small part of the overall task -- making BPF behave sensibly in that container seems like it should also be necessary.
> >
> > BPF is still a privileged thing. You can't just say that any
> > unprivileged application should be able to use BPF. That's why BPF
> > token is about trusting unpriv application in a controlled environment
> > (production) to not do something crazy. It can be enforced further
> > through LSM usage, but in a lot of cases, when dealing with internal
> > production applications it's enough to have a proper application
> > design and rely on code review process to avoid any negative effects.
>
> We really shouldn’t be creating new kinds of privileged containers that do uncontained things.
>
> If you actually want to go this route, I think you would do much better to introduce a way for a container manager to usefully proxy BPF on behalf of the container.

Please see Hao's reply ([0]) about his and Google's (not so rosy)
experiences with building and using such BPF proxy. We (Meta)
internally didn't go this route at all and strongly prefer not to.
There are lots of downsides and complications to having a BPF proxy.
In the end, this is just shuffling around where the decision about
trusting a given application with BPF access is being made. BPF proxy
adds lots of unnecessary logistical, operational, and development
complexity, but doesn't magically make anything safer.

  [0] https://lore.kernel.org/bpf/CA+khW7h95RpurRL8qmKdSJQEXNYuqSWnP16o-uRZ9G0KqCfM4Q@mail.gmail.com/

>
> >
> > So privileged daemon (container manager) will be configured with the
> > knowledge of which services/containers are allowed to use BPF, and
> > will grant BPF token only to those that were explicitly allowlisted.
>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 bpf-next 00/18] BPF token
  2023-06-21 23:48       ` Andrii Nakryiko
@ 2023-06-22  8:22         ` Maryam Tahhan
  2023-06-22 16:49           ` Andy Lutomirski
  2023-06-22 18:20           ` Andrii Nakryiko
  0 siblings, 2 replies; 72+ messages in thread
From: Maryam Tahhan @ 2023-06-22  8:22 UTC (permalink / raw)
  To: Andrii Nakryiko, Andy Lutomirski
  Cc: Andrii Nakryiko, bpf, linux-security-module, Kees Cook,
	Christian Brauner, lennart, cyphar, kernel-team

On 22/06/2023 00:48, Andrii Nakryiko wrote:
>
>>>> Giving a way to enable BPF in a container is only a small part of the overall task -- making BPF behave sensibly in that container seems like it should also be necessary.
>>> BPF is still a privileged thing. You can't just say that any
>>> unprivileged application should be able to use BPF. That's why BPF
>>> token is about trusting unpriv application in a controlled environment
>>> (production) to not do something crazy. It can be enforced further
>>> through LSM usage, but in a lot of cases, when dealing with internal
>>> production applications it's enough to have a proper application
>>> design and rely on code review process to avoid any negative effects.
>> We really shouldn’t be creating new kinds of privileged containers that do uncontained things.
>>
>> If you actually want to go this route, I think you would do much better to introduce a way for a container manager to usefully proxy BPF on behalf of the container.
> Please see Hao's reply ([0]) about his and Google's (not so rosy)
> experiences with building and using such BPF proxy. We (Meta)
> internally didn't go this route at all and strongly prefer not to.
> There are lots of downsides and complications to having a BPF proxy.
> In the end, this is just shuffling around where the decision about
> trusting a given application with BPF access is being made. BPF proxy
> adds lots of unnecessary logistical, operational, and development
> complexity, but doesn't magically make anything safer.
>
>    [0] https://lore.kernel.org/bpf/CA+khW7h95RpurRL8qmKdSJQEXNYuqSWnP16o-uRZ9G0KqCfM4Q@mail.gmail.com/
>
Apologies for being blunt, but  the token approach to me seems to be a 
work around providing the right level/classification for a pod/container 
in order to say you support unprivileged containers using eBPF. I think 
if your container needs to do privileged things it should have and be 
classified with the right permissions (privileges) to do what it needs 
to do.

The  proxy BPF on behalf of the container approach works for containers 
that don't need to do privileged BPF operations.

I have to say that  the `proxy BPF on behalf of the container` meets the 
needs of unprivileged pods and at the same time giving CAP_BPF to the 
applications meets the needs of these PODs that need to do 
privileged/bpf things without any tokens. Ultimately you are trusting 
these apps in the same way as if you were granting a token.


>>> So privileged daemon (container manager) will be configured with the
>>> knowledge of which services/containers are allowed to use BPF, and
>>> will grant BPF token only to those that were explicitly allowlisted.



^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 bpf-next 00/18] BPF token
  2023-06-22  8:22         ` Maryam Tahhan
@ 2023-06-22 16:49           ` Andy Lutomirski
       [not found]             ` <5a75d1f0-4ed9-399c-4851-2df0755de9b5@redhat.com>
  2023-06-22 19:05             ` Andrii Nakryiko
  2023-06-22 18:20           ` Andrii Nakryiko
  1 sibling, 2 replies; 72+ messages in thread
From: Andy Lutomirski @ 2023-06-22 16:49 UTC (permalink / raw)
  To: Maryam Tahhan, Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, linux-security-module, Kees Cook,
	Christian Brauner, lennart, cyphar, kernel-team



On Thu, Jun 22, 2023, at 1:22 AM, Maryam Tahhan wrote:
> On 22/06/2023 00:48, Andrii Nakryiko wrote:
>>
>>>>> Giving a way to enable BPF in a container is only a small part of the overall task -- making BPF behave sensibly in that container seems like it should also be necessary.
>>>> BPF is still a privileged thing. You can't just say that any
>>>> unprivileged application should be able to use BPF. That's why BPF
>>>> token is about trusting unpriv application in a controlled environment
>>>> (production) to not do something crazy. It can be enforced further
>>>> through LSM usage, but in a lot of cases, when dealing with internal
>>>> production applications it's enough to have a proper application
>>>> design and rely on code review process to avoid any negative effects.
>>> We really shouldn’t be creating new kinds of privileged containers that do uncontained things.
>>>
>>> If you actually want to go this route, I think you would do much better to introduce a way for a container manager to usefully proxy BPF on behalf of the container.
>> Please see Hao's reply ([0]) about his and Google's (not so rosy)
>> experiences with building and using such BPF proxy. We (Meta)
>> internally didn't go this route at all and strongly prefer not to.
>> There are lots of downsides and complications to having a BPF proxy.
>> In the end, this is just shuffling around where the decision about
>> trusting a given application with BPF access is being made. BPF proxy
>> adds lots of unnecessary logistical, operational, and development
>> complexity, but doesn't magically make anything safer.
>>
>>    [0] https://lore.kernel.org/bpf/CA+khW7h95RpurRL8qmKdSJQEXNYuqSWnP16o-uRZ9G0KqCfM4Q@mail.gmail.com/
>>
> Apologies for being blunt, but  the token approach to me seems to be a 
> work around providing the right level/classification for a pod/container 
> in order to say you support unprivileged containers using eBPF. I think 
> if your container needs to do privileged things it should have and be 
> classified with the right permissions (privileges) to do what it needs 
> to do.

Bluntness is great.

I think that this whole level/classification thing is utterly wrong.  Replace "BPF" with basically anything else, and you'll see how absurd it is.

"the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using files on disk"

That's very 1990's.  Maybe 1980's.  Of *course* giving access to a filesystem has some inherent security exposure.  So we can give containers access to *different* filesystems.  Or we can use ACLs.  Or MAC policy.  Or whatever.  We have many solutions, none of which are perfect, and we're doing okay.

"the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using the network"

The network is a big deal.  For some reason, it's cool these days to treat TCP as highly privileged.  You can get secrets from your favorite (or least favorite) cloud provider with unauthenticated HTTP to a magic IP and port.  You can bypass a whole lot of authenticating/authorizing proxies with unauthenticated HTTP (no TLS!) if you're on the right network.

This is IMO obnoxious, but we deal with it by having network namespaces and firewalls and rather outdated port <= 1024 rules.

"the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using BPF"

My response is: what's wrong with BPF?  BPF has maps and programs and such, and we could easily apply 1990's style ownership and DAC rules to them.  I even *wrote the code*.  But for some reason, the BPF community wants to bury its head in the sand, pretend it's 1980, declare that BPF is too privileged to have access control, and instead just have a complicated switch to turn it on and off in different contexts.

Please try harder.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 bpf-next 00/18] BPF token
  2023-06-22  8:22         ` Maryam Tahhan
  2023-06-22 16:49           ` Andy Lutomirski
@ 2023-06-22 18:20           ` Andrii Nakryiko
  2023-06-23 23:07             ` Toke Høiland-Jørgensen
  1 sibling, 1 reply; 72+ messages in thread
From: Andrii Nakryiko @ 2023-06-22 18:20 UTC (permalink / raw)
  To: Maryam Tahhan
  Cc: Andy Lutomirski, Andrii Nakryiko, bpf, linux-security-module,
	Kees Cook, Christian Brauner, lennart, cyphar, kernel-team

On Thu, Jun 22, 2023 at 1:23 AM Maryam Tahhan <mtahhan@redhat.com> wrote:
>
> On 22/06/2023 00:48, Andrii Nakryiko wrote:
> >
> >>>> Giving a way to enable BPF in a container is only a small part of the overall task -- making BPF behave sensibly in that container seems like it should also be necessary.
> >>> BPF is still a privileged thing. You can't just say that any
> >>> unprivileged application should be able to use BPF. That's why BPF
> >>> token is about trusting unpriv application in a controlled environment
> >>> (production) to not do something crazy. It can be enforced further
> >>> through LSM usage, but in a lot of cases, when dealing with internal
> >>> production applications it's enough to have a proper application
> >>> design and rely on code review process to avoid any negative effects.
> >> We really shouldn’t be creating new kinds of privileged containers that do uncontained things.
> >>
> >> If you actually want to go this route, I think you would do much better to introduce a way for a container manager to usefully proxy BPF on behalf of the container.
> > Please see Hao's reply ([0]) about his and Google's (not so rosy)
> > experiences with building and using such BPF proxy. We (Meta)
> > internally didn't go this route at all and strongly prefer not to.
> > There are lots of downsides and complications to having a BPF proxy.
> > In the end, this is just shuffling around where the decision about
> > trusting a given application with BPF access is being made. BPF proxy
> > adds lots of unnecessary logistical, operational, and development
> > complexity, but doesn't magically make anything safer.
> >
> >    [0] https://lore.kernel.org/bpf/CA+khW7h95RpurRL8qmKdSJQEXNYuqSWnP16o-uRZ9G0KqCfM4Q@mail.gmail.com/
> >
> Apologies for being blunt, but  the token approach to me seems to be a
> work around providing the right level/classification for a pod/container
> in order to say you support unprivileged containers using eBPF. I think
> if your container needs to do privileged things it should have and be
> classified with the right permissions (privileges) to do what it needs
> to do.

For one, when user namespaces are involved, there is no BPF use at
all, no matter how privileged you want to mark the container. I
mentioned this in the cover letter. Now, the claim is that user
namespaces are indeed useful and necessary, and yet we also want such
user-namespaced applications to be able to use BPF.

Currently there is no solution to that. And external BPF service is
not a great one, see [0], for real world users' feedback.

  [0] https://lore.kernel.org/bpf/CA+khW7h95RpurRL8qmKdSJQEXNYuqSWnP16o-uRZ9G0KqCfM4Q@mail.gmail.com/


>
> The  proxy BPF on behalf of the container approach works for containers
> that don't need to do privileged BPF operations.

BPF usage *is privileged* in all but some tiny use cases that are ok
with heavily limited unprivileged BPF functionality (and even then
recommendation is to disable unprivileged BPF altogether). Whether you
proxy such privileged BPF usage through an external application or you
are granting BPF token to such application is in the same category:
someone has to decide to trust the application to perform privileged
BPF operations.

And the only debatable thing here is whether the application itself
should do bpf() syscalls directly and be able to use the entire BPF
ecosystem of libraries, tools, techniques, and approaches. Or we go
and rewrite the world to use some RPC-based proxy to bpf() syscall?

And to put it bluntly, the latter is not a realistic (or even good) option.

>
> I have to say that  the `proxy BPF on behalf of the container` meets the
> needs of unprivileged pods and at the same time giving CAP_BPF to the

I tried to make it very clear in the cover letter, but granting
CAP_BPF under user namespace means precisely nothing. CAP_BPF is only
useful in the init namespace.

> applications meets the needs of these PODs that need to do
> privileged/bpf things without any tokens. Ultimately you are trusting
> these apps in the same way as if you were granting a token.

Yes, absolutely. As I mentioned very explicitly, it's the question of
trusting application. Service vs token is implementation details, but
the one that has huge implications in how applications are built,
tested, versioned, deployed, etc.

>
>
> >>> So privileged daemon (container manager) will be configured with the
> >>> knowledge of which services/containers are allowed to use BPF, and
> >>> will grant BPF token only to those that were explicitly allowlisted.
>
>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 bpf-next 00/18] BPF token
       [not found]             ` <5a75d1f0-4ed9-399c-4851-2df0755de9b5@redhat.com>
@ 2023-06-22 18:40               ` Andrii Nakryiko
  2023-06-22 21:04                 ` Maryam Tahhan
  2023-06-23  1:02                 ` Andy Lutomirski
  0 siblings, 2 replies; 72+ messages in thread
From: Andrii Nakryiko @ 2023-06-22 18:40 UTC (permalink / raw)
  To: Maryam Tahhan
  Cc: Andy Lutomirski, Andrii Nakryiko, bpf, linux-security-module,
	Kees Cook, Christian Brauner, lennart, cyphar, kernel-team

On Thu, Jun 22, 2023 at 10:38 AM Maryam Tahhan <mtahhan@redhat.com> wrote:
>

Please avoid replying in HTML.

> On 22/06/2023 17:49, Andy Lutomirski wrote:
>
> Apologies for being blunt, but  the token approach to me seems to be a
> work around providing the right level/classification for a pod/container
> in order to say you support unprivileged containers using eBPF. I think
> if your container needs to do privileged things it should have and be
> classified with the right permissions (privileges) to do what it needs
> to do.
>
> Bluntness is great.
>
> I think that this whole level/classification thing is utterly wrong.  Replace "BPF" with basically anything else, and you'll see how absurd it is.
>
> "the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using files on disk"
>
> That's very 1990's.  Maybe 1980's.  Of *course* giving access to a filesystem has some inherent security exposure.  So we can give containers access to *different* filesystems.  Or we can use ACLs.  Or MAC policy.  Or whatever.  We have many solutions, none of which are perfect, and we're doing okay.
>
> "the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using the network"
>
> The network is a big deal.  For some reason, it's cool these days to treat TCP as highly privileged.  You can get secrets from your favorite (or least favorite) cloud provider with unauthenticated HTTP to a magic IP and port.  You can bypass a whole lot of authenticating/authorizing proxies with unauthenticated HTTP (no TLS!) if you're on the right network.
>
> This is IMO obnoxious, but we deal with it by having network namespaces and firewalls and rather outdated port <= 1024 rules.
>
> "the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using BPF"
>
> My response is: what's wrong with BPF?  BPF has maps and programs and such, and we could easily apply 1990's style ownership and DAC rules to them.  I even *wrote the code*.  But for some reason, the BPF community wants to bury its head in the sand, pretend it's 1980, declare that BPF is too privileged to have access control, and instead just have a complicated switch to turn it on and off in different contexts.
>
> Please try harder.
>
> I'm going to be honest, I can't tell if we are in agreement or not :). I'm also going to use pod and container interchangeably throughout my response (bear with me)
>
>
> So just to clarify a few things on my end.  When I said "level/classification" I meant privileges --> A container should have the right level of privileges assigned to it for what it's trying to do in the K8s scenario through it's pod spec. To me it seems like BPF token is a way to work around the permissions assigned to a container in K8s for example: with bpf_token I'm marking a pod as unprivileged but then under the hood, through another service I'm giving it a token to do more than what it was specified in it's pod spec. Yeah I have a separate service controlling the tokens but something about it just seems not right (to me). If CAP_BPF is too broad, can we break it down further into something more granular. Something that can be assigned to the container through the pod spec rather than a separate service that seems to be doing things under the hood? This doesn't even start to
solve the problem I know...

Disclaimer: I don't know anything about Kubernetes, so don't expect me
reply with correct terminology or detailed understanding of
configuration of containers.

But on a more generic and conceptual level, it seems like you are
making some implementation assumptions and arguing based on that.

Like, why container spec cannot have native support for "granted BPF
functionality"? Why would BPF token have to be granted through some
separate service and not integrated into whatever Kubernetes'
"container manager" functionality and just be a natural extension of
the spec?

For CAP_BPF too broad. It is broad, yes. If you have good ideas how to
break it down some more -- please propose. But this is all orthogonal,
because the blocking problem is fundamental incompatibility of user
namespaces (and their implied isolation and sandboxing of workloads)
and BPF functionality, which is global by its very nature. The latter
is unavoidable in principle.

No matter how much you break down CAP_BPF, you can't enforce that BPF
program won't interfere with applications in other containers. Or that
it won't "spy" on them. It's just not what BPF can enforce in
principle.

So that comes back down to a question of trust and then controlled
delegation of BPF functionality. You trust workload with BPF usage
because you reviewed the BPF code, workload, testing, etc? Grant BPF
token and let that container use limited subset of BPF. Employ BPF LSM
to further restrict it beyond what BPF token can control.

You cannot trust an application to not do something harmful? You
shouldn't grant it either CAP_BPF in init namespace, nor BPF token in
user namespace. That's it. Pick your poison.

But all this cannot be mechanically decided or enforced. There has to
be some humans involved in making these decisions. Kernel's job is to
provide building blocks to grant and control BPF functionality to the
extent that it is technically possible.


>
> I understand the difficulties with trying to deploy BPF in K8s and the concerns around privilege escalation for the containers. I understand not all use cases are created equally but I think this falls into at least 2 categories:
>
> - Pods/Containers that need to do privileged BPF ops but not under a CAP_BPF umbrella --> sure we need something for this.
> - Pods/Containers that don't need to do any privileged BPF ops but still use BPF --> these are happy with a proxy service loading/unloading the bpf progs, creating maps and pinning them... But even in this scenario we need something to isolate the pinned maps/progs by different apps (why not DAC rules?), even better if the maps are in the container...

The above doesn't make much sense to me, sorry. If the application is
ok using unprivileged BPF, there is no problem there. They can today
already and there is no BPF proxy or BPF token involved.

As for "something to isolate the pinned maps/progs by different apps
(why not DAC rules?)", there is no such thing, as I've explained
already.

I can install sched_switch raw_tracepoint BPF program (if I'm allowed
to), and that program has system-wide observability. It cannot be
bound to an application. You can't just say "trigger this sched_switch
program only for scheduler decisions within my container". When you
actually start thinking about just that one example, even assuming we
add some per-container filter in the kernel to not trigger your
program, then what do we do when we switch from process A in container
X to process B in container Y? Is that event belonging to container X?
Or container Y? How can you prevent a program from reading a task's
data that doesn't belong to your container, when both are inputs to
this single tracepoint event?

Hopefully you can see where I'm going with this. And this is just one
random tiny example. We can think up tons of other cases to prove BPF
is not isolatable to any sort of "container".

>
> Anyway - I hope this clarifies my original intent - which is proxy at least starts to solve one part of the puzzle. Whatever approach(es) we take to solve the rest of these problems the more we can stick to tried and trusted mechanisms the better.

I disagree. BPF proxy complicates logistics, operations, and developer
experience, without resolving the issue of determining trust and the
need to delegate or proxy BPF functionality.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 bpf-next 00/18] BPF token
  2023-06-22 16:49           ` Andy Lutomirski
       [not found]             ` <5a75d1f0-4ed9-399c-4851-2df0755de9b5@redhat.com>
@ 2023-06-22 19:05             ` Andrii Nakryiko
  2023-06-23  3:28               ` Andy Lutomirski
  1 sibling, 1 reply; 72+ messages in thread
From: Andrii Nakryiko @ 2023-06-22 19:05 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Maryam Tahhan, Andrii Nakryiko, bpf, linux-security-module,
	Kees Cook, Christian Brauner, lennart, cyphar, kernel-team

On Thu, Jun 22, 2023 at 9:50 AM Andy Lutomirski <luto@kernel.org> wrote:
>
>
>
> On Thu, Jun 22, 2023, at 1:22 AM, Maryam Tahhan wrote:
> > On 22/06/2023 00:48, Andrii Nakryiko wrote:
> >>
> >>>>> Giving a way to enable BPF in a container is only a small part of the overall task -- making BPF behave sensibly in that container seems like it should also be necessary.
> >>>> BPF is still a privileged thing. You can't just say that any
> >>>> unprivileged application should be able to use BPF. That's why BPF
> >>>> token is about trusting unpriv application in a controlled environment
> >>>> (production) to not do something crazy. It can be enforced further
> >>>> through LSM usage, but in a lot of cases, when dealing with internal
> >>>> production applications it's enough to have a proper application
> >>>> design and rely on code review process to avoid any negative effects.
> >>> We really shouldn’t be creating new kinds of privileged containers that do uncontained things.
> >>>
> >>> If you actually want to go this route, I think you would do much better to introduce a way for a container manager to usefully proxy BPF on behalf of the container.
> >> Please see Hao's reply ([0]) about his and Google's (not so rosy)
> >> experiences with building and using such BPF proxy. We (Meta)
> >> internally didn't go this route at all and strongly prefer not to.
> >> There are lots of downsides and complications to having a BPF proxy.
> >> In the end, this is just shuffling around where the decision about
> >> trusting a given application with BPF access is being made. BPF proxy
> >> adds lots of unnecessary logistical, operational, and development
> >> complexity, but doesn't magically make anything safer.
> >>
> >>    [0] https://lore.kernel.org/bpf/CA+khW7h95RpurRL8qmKdSJQEXNYuqSWnP16o-uRZ9G0KqCfM4Q@mail.gmail.com/
> >>
> > Apologies for being blunt, but  the token approach to me seems to be a
> > work around providing the right level/classification for a pod/container
> > in order to say you support unprivileged containers using eBPF. I think
> > if your container needs to do privileged things it should have and be
> > classified with the right permissions (privileges) to do what it needs
> > to do.
>
> Bluntness is great.
>
> I think that this whole level/classification thing is utterly wrong.  Replace "BPF" with basically anything else, and you'll see how absurd it is.

BPF is not "anything else", it's important to understand that BPF is
inherently not compratmentalizable. And it's vast and generic in its
capabilities. This changes everything. So your analogies are
misleading.

>
> "the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using files on disk"
>
> That's very 1990's.  Maybe 1980's.  Of *course* giving access to a filesystem has some inherent security exposure.  So we can give containers access to *different* filesystems.  Or we can use ACLs.  Or MAC policy.  Or whatever.  We have many solutions, none of which are perfect, and we're doing okay.
>
> "the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using the network"
>
> The network is a big deal.  For some reason, it's cool these days to treat TCP as highly privileged.  You can get secrets from your favorite (or least favorite) cloud provider with unauthenticated HTTP to a magic IP and port.  You can bypass a whole lot of authenticating/authorizing proxies with unauthenticated HTTP (no TLS!) if you're on the right network.
>
> This is IMO obnoxious, but we deal with it by having network namespaces and firewalls and rather outdated port <= 1024 rules.
>
> "the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using BPF"
>
> My response is: what's wrong with BPF?  BPF has maps and programs and such, and we could easily apply 1990's style ownership and DAC rules to them.

Can you apply DAC rules to which kernel events BPF program can be run
on? Can you apply DAC rules to which in-kernel data structures a BPF
program can look at and make sure that it doesn't access a
task/socket/etc that "belongs" to some other container/user/etc?

Can we limit XDP or AF_XDP BPF programs from seeing and controlling
network traffic that will be eventually routed to a container that XDP
program "should not" have access to? Without making everything so slow
that it's useless?

> I even *wrote the code*.

Did you submit it upstream for review and wide discussion? Did you
test it and integrate it with production workloads to prove that your
solution is actually a viable real-world solution and not a toy?
Writing the code doesn't mean solving the problem.

> But for some reason, the BPF community wants to bury its head in the sand, pretend it's 1980, declare that BPF is too privileged to have access control, and instead just have a complicated switch to turn it on and off in different contexts.

I won't speak on behalf of the entire BPF community, but I'm trying to
explain that BPF cannot be reasonably sandboxed and has to be
privileged due to its global nature. And I haven't yet seen any
realistic counter-proposal to change that. And it's not about
ownership of the BPF map or BPF program, it's way beyond that..

>
> Please try harder.

Well, maybe there is something in that "some reason" you mentioned
above that you so quickly dismissed?

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 bpf-next 00/18] BPF token
  2023-06-22 18:40               ` Andrii Nakryiko
@ 2023-06-22 21:04                 ` Maryam Tahhan
  2023-06-22 23:35                   ` Andrii Nakryiko
  2023-06-23  1:02                 ` Andy Lutomirski
  1 sibling, 1 reply; 72+ messages in thread
From: Maryam Tahhan @ 2023-06-22 21:04 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andy Lutomirski, Andrii Nakryiko, bpf, linux-security-module,
	Kees Cook, Christian Brauner, lennart, cyphar, kernel-team

On Thu, Jun 22, 2023 at 7:40 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Thu, Jun 22, 2023 at 10:38 AM Maryam Tahhan <mtahhan@redhat.com> wrote:
> >
>
> Please avoid replying in HTML.
>

Sorry.

[...]

>
> Disclaimer: I don't know anything about Kubernetes, so don't expect me
> reply with correct terminology or detailed understanding of
> configuration of containers.
>
> But on a more generic and conceptual level, it seems like you are
> making some implementation assumptions and arguing based on that.
>

Firstly, thank you for taking the time to respond and explain. I can see
where you are coming from.

Yeah, admittedly I did make a few assumptions. I was thrown by the reference
to `unprivileged` processes in the cover letter. It seems like this is a way to
grant namespaced BPF permissions to a process (my gross
oversimplification - sorry).
Looking back throughout your responses there's nothing unprivileged here.

[...]


> Hopefully you can see where I'm going with this. And this is just one
> random tiny example. We can think up tons of other cases to prove BPF
> is not isolatable to any sort of "container".
>
> >
> > Anyway - I hope this clarifies my original intent - which is proxy at least starts to solve one part of the puzzle. Whatever approach(es) we take to solve the rest of these problems the more we can stick to tried and trusted mechanisms the better.
>
> I disagree. BPF proxy complicates logistics, operations, and developer
> experience, without resolving the issue of determining trust and the
> need to delegate or proxy BPF functionality.

I appreciate your viewpoint. I just don't think that this is a one
solution fits every
scenario situation. For example in the case of AF_XDP, I'd like to be
able to run
my containers without any additional privileges. I've been working on a device
plugin for Kubernetes whose job is to provision netdevs with an XDP redirect
program (then later there's a CNI that moves the netdev into the pod network
namespace).  Originally I was using bpf locally in the device plugin
(to load the
bpf program and get the XSK map fd) and SCM rights to pass the XSK_MAP over
UDS but honestly it was relatively cumbersome from an app development POV, very
easy to get wrong, and trying to keep up with the latest bpf api
changes started to
become an issue. If I wanted to add more interesting bpf programs I
had to do a full
recompile...

I've now moved to using bpfd, for the loading and unloading of the bpf
program on my behalf,
it also comes with a bunch of other advantages including being able to
update my trusted bpf
program transparently to both the device plugin my application (I
don't have to respin this either
when I write/want to add a new bpf prog), but mainly I have a trusted
proxy managing bpffs, bpf progs and maps for me. There's still more
work to do here...

I understand this is a much simplified scenario. and I'm sure I can
think of several more where
proxy is useful. All I'm trying to say is, I'm not sure there's just a
one size fits all soln for these issues.

Thanks
Maryam


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 bpf-next 00/18] BPF token
  2023-06-22 21:04                 ` Maryam Tahhan
@ 2023-06-22 23:35                   ` Andrii Nakryiko
  0 siblings, 0 replies; 72+ messages in thread
From: Andrii Nakryiko @ 2023-06-22 23:35 UTC (permalink / raw)
  To: Maryam Tahhan
  Cc: Andy Lutomirski, Andrii Nakryiko, bpf, linux-security-module,
	Kees Cook, Christian Brauner, lennart, cyphar, kernel-team

On Thu, Jun 22, 2023 at 2:04 PM Maryam Tahhan <mtahhan@redhat.com> wrote:
>
> On Thu, Jun 22, 2023 at 7:40 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Thu, Jun 22, 2023 at 10:38 AM Maryam Tahhan <mtahhan@redhat.com> wrote:
> > >
> >
> > Please avoid replying in HTML.
> >
>
> Sorry.

No worries, the problem is that the mailing list filters out such
messages. So if you go to [0] and scroll to the bottom of the page,
you'll see that your email is not in the lore archive. People not
CC'ed directly will only see what you wrote through my reply quoting
your email.

  [0] https://lore.kernel.org/bpf/CAFdtZitYhOK4TzAJVbFPMfup_homxSSu3Q8zjJCCiHCf22eJvQ@mail.gmail.com/#t

>
> [...]
>
> >
> > Disclaimer: I don't know anything about Kubernetes, so don't expect me
> > reply with correct terminology or detailed understanding of
> > configuration of containers.
> >
> > But on a more generic and conceptual level, it seems like you are
> > making some implementation assumptions and arguing based on that.
> >
>
> Firstly, thank you for taking the time to respond and explain. I can see
> where you are coming from.
>
> Yeah, admittedly I did make a few assumptions. I was thrown by the reference
> to `unprivileged` processes in the cover letter. It seems like this is a way to
> grant namespaced BPF permissions to a process (my gross
> oversimplification - sorry).

Yep, with the caveat that BPF functionality itself cannot be
namespaced (i.e., contained within the container), so this has to be
granted by a fully privileged process/proxy based on trusting the
workload to not do anything harmful.


> Looking back throughout your responses there's nothing unprivileged here.
>
> [...]
>
>
> > Hopefully you can see where I'm going with this. And this is just one
> > random tiny example. We can think up tons of other cases to prove BPF
> > is not isolatable to any sort of "container".
> >
> > >
> > > Anyway - I hope this clarifies my original intent - which is proxy at least starts to solve one part of the puzzle. Whatever approach(es) we take to solve the rest of these problems the more we can stick to tried and trusted mechanisms the better.
> >
> > I disagree. BPF proxy complicates logistics, operations, and developer
> > experience, without resolving the issue of determining trust and the
> > need to delegate or proxy BPF functionality.
>
> I appreciate your viewpoint. I just don't think that this is a one
> solution fits every
> scenario situation.

Absolutely. It's also not my intent or goal to kill any sort of BPF
proxy. What I'm trying to convey is that the BPF proxy approach has
severe downsides, depending on application, deployment practices, etc,
etc. It's not always a (good) answer. So I just want to avoid having
the dichotomy of "BPF token or BPF proxy, there could be only one".

> For example in the case of AF_XDP, I'd like to be
> able to run
> my containers without any additional privileges. I've been working on a device
> plugin for Kubernetes whose job is to provision netdevs with an XDP redirect
> program (then later there's a CNI that moves the netdev into the pod network
> namespace).  Originally I was using bpf locally in the device plugin
> (to load the
> bpf program and get the XSK map fd) and SCM rights to pass the XSK_MAP over
> UDS but honestly it was relatively cumbersome from an app development POV, very
> easy to get wrong, and trying to keep up with the latest bpf api
> changes started to
> become an issue. If I wanted to add more interesting bpf programs I
> had to do a full
> recompile...
>
> I've now moved to using bpfd, for the loading and unloading of the bpf
> program on my behalf,
> it also comes with a bunch of other advantages including being able to
> update my trusted bpf
> program transparently to both the device plugin my application (I
> don't have to respin this either
> when I write/want to add a new bpf prog), but mainly I have a trusted
> proxy managing bpffs, bpf progs and maps for me. There's still more
> work to do here...
>

It's a spectrum, and from my observations networking BPF programs lend
themselves more easily to this model of BPF proxy (at least until they
become complicated ensembles of networking and tracing BPF programs).
Very often networking applications can indeed load BPF program
completely independently from user-space parts, keep them "persisted"
in kernel, occasionally control them through pinned BPF maps, etc.

But the further you go towards tracing applications where BPF parts
are integral part of overall user-space application, this model
doesn't work very well. It's much simple to have BPF parts embedded,
loaded, versioned, initialized and interacted with from inside the
same process. And we have lots of such applications. BPF proxy
approach is a massive complication for such use cases with a bunch of
downsides.

> I understand this is a much simplified scenario. and I'm sure I can
> think of several more where
> proxy is useful. All I'm trying to say is, I'm not sure there's just a
> one size fits all soln for these issues.

100% agree. BPF token won't fit all use cases. And BPF proxy won't fit
all use cases either. Both approaches can and should coexist.

>
> Thanks
> Maryam
>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 bpf-next 00/18] BPF token
  2023-06-22 18:40               ` Andrii Nakryiko
  2023-06-22 21:04                 ` Maryam Tahhan
@ 2023-06-23  1:02                 ` Andy Lutomirski
  2023-06-23 15:10                   ` Andy Lutomirski
  2023-06-26 22:08                   ` Andrii Nakryiko
  1 sibling, 2 replies; 72+ messages in thread
From: Andy Lutomirski @ 2023-06-23  1:02 UTC (permalink / raw)
  To: Andrii Nakryiko, Maryam Tahhan
  Cc: Andrii Nakryiko, bpf, linux-security-module, Kees Cook,
	Christian Brauner, lennart, cyphar, kernel-team



On Thu, Jun 22, 2023, at 11:40 AM, Andrii Nakryiko wrote:
> On Thu, Jun 22, 2023 at 10:38 AM Maryam Tahhan <mtahhan@redhat.com> wrote:
>
> For CAP_BPF too broad. It is broad, yes. If you have good ideas how to
> break it down some more -- please propose. But this is all orthogonal,
> because the blocking problem is fundamental incompatibility of user
> namespaces (and their implied isolation and sandboxing of workloads)
> and BPF functionality, which is global by its very nature. The latter
> is unavoidable in principle.

How, exactly, is BPF global by its very nature?

The *implementation* has some issues with globalness.  Much of it should be fixable.

>
> No matter how much you break down CAP_BPF, you can't enforce that BPF
> program won't interfere with applications in other containers. Or that
> it won't "spy" on them. It's just not what BPF can enforce in
> principle.

The WHOLE POINT of the verifier is to attempt to constrain what BPF programs can and can't do.  There are bugs -- I get that.  There are helper functions that are fundamentally global.  But, in the absence of verifier bugs, BPF has actual boundaries to its functionality.

>
> So that comes back down to a question of trust and then controlled
> delegation of BPF functionality. You trust workload with BPF usage
> because you reviewed the BPF code, workload, testing, etc? Grant BPF
> token and let that container use limited subset of BPF. Employ BPF LSM
> to further restrict it beyond what BPF token can control.
>
> You cannot trust an application to not do something harmful? You
> shouldn't grant it either CAP_BPF in init namespace, nor BPF token in
> user namespace. That's it. Pick your poison.

I think what's lost here is hardening vs restricting intended functionality.

We have access control to restrict intended functionality.  We have other (and generally fairly ad-hoc and awkward) ways to flip off functionality because we want to reduce exposure to any bugs in it.

BPF needs hardening -- this is well established.  Right now, this is accomplished by restricting it to global root (effectively).  It should have access controls, too, but it doesn't.

>
> But all this cannot be mechanically decided or enforced. There has to
> be some humans involved in making these decisions. Kernel's job is to
> provide building blocks to grant and control BPF functionality to the
> extent that it is technically possible.
>

Exactly.  And it DOES NOT.  bpf maps, etc do not have sensible access controls.  Things that should not be global are global.  I'm saying the kernel should fix THAT.  Once it's in a state that it's at least credible to allow BPF in a user namespace, than come up with a way to allow it.

> As for "something to isolate the pinned maps/progs by different apps
> (why not DAC rules?)", there is no such thing, as I've explained
> already.
>
> I can install sched_switch raw_tracepoint BPF program (if I'm allowed
> to), and that program has system-wide observability. It cannot be
> bound to an application.

Great, a real example!

Either:

(a) don't run this in a container.  Have a service for the container to request the help of this program.

(b) have a way to have root approve a particular program and expose *that* program to the container, and let the program have its own access controls internally (e.g. only output info that belongs to that container).

> then what do we do when we switch from process A in container
> X to process B in container Y? Is that event belonging to container X?
> Or container Y?

I don't know, but you had better answer this question before you run this thing in a container, not just for security but for basic functionality.  If you haven't defined what your program is even supposed to do in a container, don't run it there.


> Hopefully you can see where I'm going with this. And this is just one
> random tiny example. We can think up tons of other cases to prove BPF
> is not isolatable to any sort of "container".

No.  You have not come up with an example of why BPF is not isolatable to a container.  You have come up with an example of why binding to a sched_switch raw tracepoint does not make sense in a container without additional mechanisms to give it well defined functionality and appropriate security.

Please stop conflating BPF (programs, maps, etc) with *attachments* of BPF programs to systemwide things.  They're both under the BPF umbrella.  They're not the same thing.

Passing a token into a container that allow that container to do things like loading its own programs *and attaching them to raw tracepoints* is IMO a complete nonstarter.  It makes no sense.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 bpf-next 00/18] BPF token
  2023-06-22 19:05             ` Andrii Nakryiko
@ 2023-06-23  3:28               ` Andy Lutomirski
  2023-06-23 16:13                 ` Casey Schaufler
  2023-06-26 22:08                 ` Andrii Nakryiko
  0 siblings, 2 replies; 72+ messages in thread
From: Andy Lutomirski @ 2023-06-23  3:28 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Maryam Tahhan, Andrii Nakryiko, bpf, linux-security-module,
	Kees Cook, Christian Brauner, lennart, cyphar, kernel-team

On Thu, Jun 22, 2023, at 12:05 PM, Andrii Nakryiko wrote:
> On Thu, Jun 22, 2023 at 9:50 AM Andy Lutomirski <luto@kernel.org> wrote:
>>
>>
>>
>> On Thu, Jun 22, 2023, at 1:22 AM, Maryam Tahhan wrote:
>> > On 22/06/2023 00:48, Andrii Nakryiko wrote:
>> >>
>> >>>>> Giving a way to enable BPF in a container is only a small part of the overall task -- making BPF behave sensibly in that container seems like it should also be necessary.
>> >>>> BPF is still a privileged thing. You can't just say that any
>> >>>> unprivileged application should be able to use BPF. That's why BPF
>> >>>> token is about trusting unpriv application in a controlled environment
>> >>>> (production) to not do something crazy. It can be enforced further
>> >>>> through LSM usage, but in a lot of cases, when dealing with internal
>> >>>> production applications it's enough to have a proper application
>> >>>> design and rely on code review process to avoid any negative effects.
>> >>> We really shouldn’t be creating new kinds of privileged containers that do uncontained things.
>> >>>
>> >>> If you actually want to go this route, I think you would do much better to introduce a way for a container manager to usefully proxy BPF on behalf of the container.
>> >> Please see Hao's reply ([0]) about his and Google's (not so rosy)
>> >> experiences with building and using such BPF proxy. We (Meta)
>> >> internally didn't go this route at all and strongly prefer not to.
>> >> There are lots of downsides and complications to having a BPF proxy.
>> >> In the end, this is just shuffling around where the decision about
>> >> trusting a given application with BPF access is being made. BPF proxy
>> >> adds lots of unnecessary logistical, operational, and development
>> >> complexity, but doesn't magically make anything safer.
>> >>
>> >>    [0] https://lore.kernel.org/bpf/CA+khW7h95RpurRL8qmKdSJQEXNYuqSWnP16o-uRZ9G0KqCfM4Q@mail.gmail.com/
>> >>
>> > Apologies for being blunt, but  the token approach to me seems to be a
>> > work around providing the right level/classification for a pod/container
>> > in order to say you support unprivileged containers using eBPF. I think
>> > if your container needs to do privileged things it should have and be
>> > classified with the right permissions (privileges) to do what it needs
>> > to do.
>>
>> Bluntness is great.
>>
>> I think that this whole level/classification thing is utterly wrong.  Replace "BPF" with basically anything else, and you'll see how absurd it is.
>
> BPF is not "anything else", it's important to understand that BPF is
> inherently not compratmentalizable. And it's vast and generic in its
> capabilities. This changes everything. So your analogies are
> misleading.
>

file descriptors are "vast and generic" -- you can open sockets, files, things in /proc, things in /sys, device nodes, etc.  They are infinitely extensible.  They work in containers.

What is so special about BPF?

>>
>> "the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using files on disk"
>>
>> That's very 1990's.  Maybe 1980's.  Of *course* giving access to a filesystem has some inherent security exposure.  So we can give containers access to *different* filesystems.  Or we can use ACLs.  Or MAC policy.  Or whatever.  We have many solutions, none of which are perfect, and we're doing okay.
>>
>> "the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using the network"
>>
>> The network is a big deal.  For some reason, it's cool these days to treat TCP as highly privileged.  You can get secrets from your favorite (or least favorite) cloud provider with unauthenticated HTTP to a magic IP and port.  You can bypass a whole lot of authenticating/authorizing proxies with unauthenticated HTTP (no TLS!) if you're on the right network.
>>
>> This is IMO obnoxious, but we deal with it by having network namespaces and firewalls and rather outdated port <= 1024 rules.
>>
>> "the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using BPF"
>>
>> My response is: what's wrong with BPF?  BPF has maps and programs and such, and we could easily apply 1990's style ownership and DAC rules to them.
>
> Can you apply DAC rules to which kernel events BPF program can be run
> on? Can you apply DAC rules to which in-kernel data structures a BPF
> program can look at and make sure that it doesn't access a
> task/socket/etc that "belongs" to some other container/user/etc?

No, of course.

If you have a BPF program that is granted the ability to read kernel data structures or to run in response to global events like this, it's basically a kernel module.  It may be subject to a verifier that imposes much stronger type safety than a kernel module is subject to, but it's still effectively a kernel module.

We don't give containers special tokens that let them load arbitrary modules.  We should not give them special tokens that let them do things with BPF that are functionally equivalent to loading arbitrary kernel modules.

But we do have ways that kernel modules (which are "vast and generic", too) can expose their functionality safely to containers.  BPF can learn to do this.

>
> Can we limit XDP or AF_XDP BPF programs from seeing and controlling
> network traffic that will be eventually routed to a container that XDP
> program "should not" have access to? Without making everything so slow
> that it's useless?

Of course you can -- assign an entire NIC or virtual function to a container, and let the XDP program handle that.  Or a vlan or a macvlan or whatever.  (I'm assuming XDP can be scoped like this.  I'm not that familiar with the details.)

>
>> I even *wrote the code*.
>
> Did you submit it upstream for review and wide discussion?

Yes.

> Did you
> test it and integrate it with production workloads to prove that your
> solution is actually a viable real-world solution and not a toy?

I did test it.  I did not integrate it with production workloads.

> Writing the code doesn't mean solving the problem.

Of course not.  My code was a little step in the right direction.  The BPF community was apparently not interested in it. 

>
>> But for some reason, the BPF community wants to bury its head in the sand, pretend it's 1980, declare that BPF is too privileged to have access control, and instead just have a complicated switch to turn it on and off in different contexts.
>
> I won't speak on behalf of the entire BPF community, but I'm trying to
> explain that BPF cannot be reasonably sandboxed and has to be
> privileged due to its global nature. And I haven't yet seen any
> realistic counter-proposal to change that. And it's not about
> ownership of the BPF map or BPF program, it's way beyond that..
>

It's really really hard to have a useful discussion about a security model when have, as what appears to be an axiom, that a security model can't be created.

If you actually feel this way, then I think you should not be advocating for allowing unprivileged containers to do the things that you think can't have a security model.

I'm saying that I think there *can* be a security model.  But until the maintainers start to believe that, there won't be one.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 bpf-next 00/18] BPF token
  2023-06-23  1:02                 ` Andy Lutomirski
@ 2023-06-23 15:10                   ` Andy Lutomirski
  2023-06-23 23:23                     ` Daniel Borkmann
  2023-06-26 22:08                   ` Andrii Nakryiko
  1 sibling, 1 reply; 72+ messages in thread
From: Andy Lutomirski @ 2023-06-23 15:10 UTC (permalink / raw)
  To: Andrii Nakryiko, Maryam Tahhan
  Cc: Andrii Nakryiko, bpf, linux-security-module, Kees Cook,
	Christian Brauner, lennart, cyphar, kernel-team



On Thu, Jun 22, 2023, at 6:02 PM, Andy Lutomirski wrote:
> On Thu, Jun 22, 2023, at 11:40 AM, Andrii Nakryiko wrote:
>
>> Hopefully you can see where I'm going with this. And this is just one
>> random tiny example. We can think up tons of other cases to prove BPF
>> is not isolatable to any sort of "container".
>
> No.  You have not come up with an example of why BPF is not isolatable 
> to a container.  You have come up with an example of why binding to a 
> sched_switch raw tracepoint does not make sense in a container without 
> additional mechanisms to give it well defined functionality and 
> appropriate security.

Thinking about this some more:

Suppose the goal is to allow a workload in a container to monitor itself by attaching to a tracepoint (something in the scheduler, for example).  The workload is in the container.  The tracepoint is global.  Kernel memory is global unless something that is trusted and understands the containers is doing the reading.  And proxying BPF is a mess.

So here are a couple of possible solutions:

(a) Improve BPF maps a bit so that BPF maps work well in containers.  It should be possible to create a map and share it (the file descriptor!) between the outside and the container without running into various snags.  (IIRC my patch series was a decent step in this direction,)  Now load the BPF program and attach it to the tracepoint outside the container but have it write its gathered data to the map that's in the container.  So you end up with a daemon outside the container that gets a request like "help me monitor such-and-such by running BPF program such-and-such (where the BPF program code presumably comes from a library outside the container", and the daemon arranges for the requesting container to have access to the map it needs to get the data.

(b) Make a way to pass a pre-approved program into a container.  So a daemon outside loads the program and does some new magic to say "make an fd that can be used to attach this particular program to this particular tracepoint" and pass that into the container.

I think (a) is better.  In particular, if you have a workload with many containers, and they all want to monitor the same tracepoint as it relates to their container, you will get much better performance if a single BPF program does the monitoring and sends the data out to each container as needed instead of having one copy of the program per container.

For what it's worth, BPF tokens seem like they'll have the same performance problem -- without coordination, you can end up with N containers generating N hooks all targeting the same global resource, resulting in overhead that scales linearly with the number of containers.

And, again, I'm not an XDP expert, but if you have one NIC, and you attach N XDP programs to it, and each one is inspecting packets and sending some to one particular container's AF_XDP socket, you are not going to get good performance.  You want *one* XDP program fanning the packets out to the relevant containers.

If this is hard right now, perhaps you could add new kernel mechanisms as needed to improve the situation.

--Andy

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 bpf-next 00/18] BPF token
  2023-06-23  3:28               ` Andy Lutomirski
@ 2023-06-23 16:13                 ` Casey Schaufler
  2023-06-26 22:08                 ` Andrii Nakryiko
  1 sibling, 0 replies; 72+ messages in thread
From: Casey Schaufler @ 2023-06-23 16:13 UTC (permalink / raw)
  To: Andy Lutomirski, Andrii Nakryiko
  Cc: Maryam Tahhan, Andrii Nakryiko, bpf, linux-security-module,
	Kees Cook, Christian Brauner, lennart, cyphar, kernel-team,
	Casey Schaufler

On 6/22/2023 8:28 PM, Andy Lutomirski wrote:
> On Thu, Jun 22, 2023, at 12:05 PM, Andrii Nakryiko wrote:
>> On Thu, Jun 22, 2023 at 9:50 AM Andy Lutomirski <luto@kernel.org> wrote:
>>>
>>>
>>> On Thu, Jun 22, 2023, at 1:22 AM, Maryam Tahhan wrote:
>>>> On 22/06/2023 00:48, Andrii Nakryiko wrote:
>>>>>>>> Giving a way to enable BPF in a container is only a small part of the overall task -- making BPF behave sensibly in that container seems like it should also be necessary.
>>>>>>> BPF is still a privileged thing. You can't just say that any
>>>>>>> unprivileged application should be able to use BPF. That's why BPF
>>>>>>> token is about trusting unpriv application in a controlled environment
>>>>>>> (production) to not do something crazy. It can be enforced further
>>>>>>> through LSM usage, but in a lot of cases, when dealing with internal
>>>>>>> production applications it's enough to have a proper application
>>>>>>> design and rely on code review process to avoid any negative effects.
>>>>>> We really shouldn’t be creating new kinds of privileged containers that do uncontained things.
>>>>>>
>>>>>> If you actually want to go this route, I think you would do much better to introduce a way for a container manager to usefully proxy BPF on behalf of the container.
>>>>> Please see Hao's reply ([0]) about his and Google's (not so rosy)
>>>>> experiences with building and using such BPF proxy. We (Meta)
>>>>> internally didn't go this route at all and strongly prefer not to.
>>>>> There are lots of downsides and complications to having a BPF proxy.
>>>>> In the end, this is just shuffling around where the decision about
>>>>> trusting a given application with BPF access is being made. BPF proxy
>>>>> adds lots of unnecessary logistical, operational, and development
>>>>> complexity, but doesn't magically make anything safer.
>>>>>
>>>>>    [0] https://lore.kernel.org/bpf/CA+khW7h95RpurRL8qmKdSJQEXNYuqSWnP16o-uRZ9G0KqCfM4Q@mail.gmail.com/
>>>>>
>>>> Apologies for being blunt, but  the token approach to me seems to be a
>>>> work around providing the right level/classification for a pod/container
>>>> in order to say you support unprivileged containers using eBPF. I think
>>>> if your container needs to do privileged things it should have and be
>>>> classified with the right permissions (privileges) to do what it needs
>>>> to do.
>>> Bluntness is great.
>>>
>>> I think that this whole level/classification thing is utterly wrong.  Replace "BPF" with basically anything else, and you'll see how absurd it is.
>> BPF is not "anything else", it's important to understand that BPF is
>> inherently not compratmentalizable. And it's vast and generic in its
>> capabilities. This changes everything. So your analogies are
>> misleading.
>>
> file descriptors are "vast and generic" -- you can open sockets, files, things in /proc, things in /sys, device nodes, etc.  They are infinitely extensible.  They work in containers.
>
> What is so special about BPF?
>
>>> "the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using files on disk"
>>>
>>> That's very 1990's.  Maybe 1980's.  Of *course* giving access to a filesystem has some inherent security exposure.  So we can give containers access to *different* filesystems.  Or we can use ACLs.  Or MAC policy.  Or whatever.  We have many solutions, none of which are perfect, and we're doing okay.
>>>
>>> "the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using the network"
>>>
>>> The network is a big deal.  For some reason, it's cool these days to treat TCP as highly privileged.  You can get secrets from your favorite (or least favorite) cloud provider with unauthenticated HTTP to a magic IP and port.  You can bypass a whole lot of authenticating/authorizing proxies with unauthenticated HTTP (no TLS!) if you're on the right network.
>>>
>>> This is IMO obnoxious, but we deal with it by having network namespaces and firewalls and rather outdated port <= 1024 rules.
>>>
>>> "the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using BPF"
>>>
>>> My response is: what's wrong with BPF?  BPF has maps and programs and such, and we could easily apply 1990's style ownership and DAC rules to them.
>> Can you apply DAC rules to which kernel events BPF program can be run
>> on? Can you apply DAC rules to which in-kernel data structures a BPF
>> program can look at and make sure that it doesn't access a
>> task/socket/etc that "belongs" to some other container/user/etc?
> No, of course.
>
> If you have a BPF program that is granted the ability to read kernel data structures or to run in response to global events like this, it's basically a kernel module.  It may be subject to a verifier that imposes much stronger type safety than a kernel module is subject to, but it's still effectively a kernel module.
>
> We don't give containers special tokens that let them load arbitrary modules.  We should not give them special tokens that let them do things with BPF that are functionally equivalent to loading arbitrary kernel modules.
>
> But we do have ways that kernel modules (which are "vast and generic", too) can expose their functionality safely to containers.  BPF can learn to do this.
>
>> Can we limit XDP or AF_XDP BPF programs from seeing and controlling
>> network traffic that will be eventually routed to a container that XDP
>> program "should not" have access to? Without making everything so slow
>> that it's useless?
> Of course you can -- assign an entire NIC or virtual function to a container, and let the XDP program handle that.  Or a vlan or a macvlan or whatever.  (I'm assuming XDP can be scoped like this.  I'm not that familiar with the details.)
>
>>> I even *wrote the code*.
>> Did you submit it upstream for review and wide discussion?
> Yes.
>
>> Did you
>> test it and integrate it with production workloads to prove that your
>> solution is actually a viable real-world solution and not a toy?
> I did test it.  I did not integrate it with production workloads.
>
>> Writing the code doesn't mean solving the problem.
> Of course not.  My code was a little step in the right direction.  The BPF community was apparently not interested in it. 
>
>>> But for some reason, the BPF community wants to bury its head in the sand, pretend it's 1980, declare that BPF is too privileged to have access control, and instead just have a complicated switch to turn it on and off in different contexts.
>> I won't speak on behalf of the entire BPF community, but I'm trying to
>> explain that BPF cannot be reasonably sandboxed and has to be
>> privileged due to its global nature. And I haven't yet seen any
>> realistic counter-proposal to change that. And it's not about
>> ownership of the BPF map or BPF program, it's way beyond that..
>>
> It's really really hard to have a useful discussion about a security model when have, as what appears to be an axiom, that a security model can't be created.

Agreed. Complete security denial makes development so much easier.
In the 1980's we were told that there was no way UNIX could ever be
made secure, especially because of IP networking and window systems.
It wasn't easy, what with everybody screaming (often literally) about
the performance impact and code complexity of every single change, no
matter how small.

I'm *not* advocating adopting it, but you could look at the Zephyr
security model as a worked example of a system similar to BPF that
does have a security model. I understand that there are many ways to
argue that this won't work for BPF, or that the model has issues of
its own, but have a look.

https://docs.zephyrproject.org/latest/security/security-overview.html

>
> If you actually feel this way, then I think you should not be advocating for allowing unprivileged containers to do the things that you think can't have a security model.
>
> I'm saying that I think there *can* be a security model.  But until the maintainers start to believe that, there won't be one.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 bpf-next 00/18] BPF token
  2023-06-15 22:48             ` Andrii Nakryiko
@ 2023-06-23 22:18               ` Daniel Borkmann
  2023-06-26 22:08                 ` Andrii Nakryiko
  0 siblings, 1 reply; 72+ messages in thread
From: Daniel Borkmann @ 2023-06-23 22:18 UTC (permalink / raw)
  To: Andrii Nakryiko, Christian Brauner
  Cc: Djalal Harouni, Andrii Nakryiko, bpf, linux-security-module,
	keescook, lennart, cyphar, luto, kernel-team, Sargun Dhillon

On 6/16/23 12:48 AM, Andrii Nakryiko wrote:
> On Wed, Jun 14, 2023 at 2:39 AM Christian Brauner <brauner@kernel.org> wrote:
>> On Wed, Jun 14, 2023 at 02:23:02AM +0200, Djalal Harouni wrote:
>>> On Tue, Jun 13, 2023 at 12:27 AM Andrii Nakryiko
>>> <andrii.nakryiko@gmail.com> wrote:
>>>> On Mon, Jun 12, 2023 at 5:02 AM Djalal Harouni <tixxdz@gmail.com> wrote:
>>>>> On Sat, Jun 10, 2023 at 12:57 AM Andrii Nakryiko
>>>>> <andrii.nakryiko@gmail.com> wrote:
>>>>>> On Fri, Jun 9, 2023 at 3:30 PM Djalal Harouni <tixxdz@gmail.com> wrote:
>>>>>>>
>>>>>>> Hi Andrii,
>>>>>>>
>>>>>>> On Thu, Jun 8, 2023 at 1:54 AM Andrii Nakryiko <andrii@kernel.org> wrote:
>>>>>>>>
>>>>>>>> ...
>>>>>>>> creating new BPF objects like BPF programs, BPF maps, etc.
>>>>>>>
>>>>>>> Is there a reason for coupling this only with the userns?
>>>>>>
>>>>>> There is no coupling. Without userns it is at least possible to grant
>>>>>> CAP_BPF and other capabilities from init ns. With user namespace that
>>>>>> becomes impossible.
>>>>>
>>>>> But these are not the same: delegate full cap vs delegate an fd mask?
>>>>
>>>> What FD mask are we talking about here? I don't recall us talking
>>>> about any FD masks, so this one is a bit confusing without more
>>>> context.
>>>
>>> Ah err, sorry yes referring to fd token (which I assumed is a mask of
>>> allowed operations or something like that).
>>>
>>> So I want the possibility to delegate the fd token in the init userns.
>>>
>>>>>
>>>>> One can argue unprivileged in init userns is the same privileged in
>>>>> nested userns
>>>>> Getting to delegate fd in init userns, then in nested ones seems logical...
>>>>
>>>> Again, sorry, I'm not following. Can you please elaborate what you mean?
>>>
>>> I mean can we use the fd token in the init user namespace too? not
>>> only in the nested user namespaces but in the first one? Sorry I
>>> didn't check the code.
>>>
> 
> [...]
> 
>>>
>>>>> Having the fd or "token" that gives access rights pinned in two
>>>>> separate bpffs mounts seems too much, it crosses namespaces (mount,
>>>>> userns etc), environments setup by privileged...
>>>>
>>>> See above, there is nothing namespaceable about BPF itself, and BPF
>>>> token as well. If some production setup benefits from pinning one BPF
>>>> token in multiple places, I don't see the problem with that.
>>>>
>>>>>
>>>>> I would just make it per bpffs mount and that's it, nothing more. If a
>>>>> program wants to bind mount it somewhere else then it's not a bpf
>>>>> problem.
>>>>
>>>> And if some application wants to pin BPF token, why would that be BPF
>>>> subsystem's problem as well?
>>>
>>> The credentials, capabilities, keyring, different namespaces, etc are
>>> all attached to the owning user namespace, if the BPF subsystem goes
>>> its own way and creates a token to split up CAP_BPF without following
>>> that model, then it's definitely a BPF subsystem problem...  I don't
>>> recommend that.
>>>
>>> Feels it's going more of a system-wide approach opening BPF
>>> functionality where ultimately it clashes with the argument: delegate
>>> a subset of BPF functionality to a *trusted* unprivileged application.
>>> My reading of delegation is within a container/service hierarchy
>>> nothing more.
>>
>> You're making the exact arguments that Lennart, Aleksa, and I have been
>> making in the LSFMM presentation about this topic. It's even recorded:
> 
> Alright, so (I think) I get a pretty good feel now for what the main
> concerns are, and why people are trying to push this to be an FS. And
> it's not so much that BPF token grants bpf() syscall usage to unpriv
> (but trusted) workloads or that BPF itself is not namespaceable. The
> main worry is that BPF token, once issues, could be
> illegally/uncontrollably passed outside of container, intentionally or
> not. And by having this association with mount namespace (through BPF
> FS) we automatically limit the sharing to only contain that has access
> to that BPF FS.

+1

> So I agree that it makes sense to have this mount namespace
> association, but I also would like to keep BPF token to be a separate
> entity from BPF FS itself, and have the ability to have multiple
> different BPF tokens exposed in a single BPF FS instance. I think the
> latter is important.
> 
> So how about this slight modification: when a BPF token is created
> using BPF_TOKEN_CREATE command, the user has to provide an FD for
> "associated" BPF FS instance (superblock). What that does is allows
> BPF token to be created with BPF FS and/or mount namespace association
> set in stone. After that BPF token can only be pinned in that BPF FS
> instance and cannot leave the boundaries of that mount namespace
> (specific details to be worked out, this is new area for me, so I'm
> sorry if I'm missing nuances).

Given bpffs is not a singleton and there can be multiple bpffs instances
in a container, couldn't we make the token a special bpffs mount/mode?
Something like single .token file in that mount (for example) which can
be opened and the fd then passed along for prog/map creation? And given
the multiple mounts, this also allows potentially for multiple tokens?
In other words, this is already set up by the container manager when it
sets up mounts rather than later, and the regular bpffs instance is sth
separate from all that. Meaning, in your container you get the usual
bpffs instance and then one or more special bpffs instances as tokens
at different paths (and in future they could unlock different subset of
bpf functionality for example).

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 bpf-next 00/18] BPF token
  2023-06-22 18:20           ` Andrii Nakryiko
@ 2023-06-23 23:07             ` Toke Høiland-Jørgensen
  2023-06-26 22:08               ` Andrii Nakryiko
  0 siblings, 1 reply; 72+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-06-23 23:07 UTC (permalink / raw)
  To: Andrii Nakryiko, Maryam Tahhan
  Cc: Andy Lutomirski, Andrii Nakryiko, bpf, linux-security-module,
	Kees Cook, Christian Brauner, lennart, cyphar, kernel-team

Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:

>> applications meets the needs of these PODs that need to do
>> privileged/bpf things without any tokens. Ultimately you are trusting
>> these apps in the same way as if you were granting a token.
>
> Yes, absolutely. As I mentioned very explicitly, it's the question of
> trusting application. Service vs token is implementation details, but
> the one that has huge implications in how applications are built,
> tested, versioned, deployed, etc.

So one thing that I don't really get is why such a "trusted application"
needs to be run in a user namespace in the first place? If it's trusted,
why not simply run it as a privileged container (without the user
namespace) and grant it the right system-level capabilities, instead of
going to all this trouble just to punch a hole in the user namespace
isolation?

-Toke


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 bpf-next 00/18] BPF token
  2023-06-23 15:10                   ` Andy Lutomirski
@ 2023-06-23 23:23                     ` Daniel Borkmann
  2023-06-24 13:59                       ` Andy Lutomirski
  0 siblings, 1 reply; 72+ messages in thread
From: Daniel Borkmann @ 2023-06-23 23:23 UTC (permalink / raw)
  To: Andy Lutomirski, Andrii Nakryiko, Maryam Tahhan
  Cc: Andrii Nakryiko, bpf, linux-security-module, Kees Cook,
	Christian Brauner, lennart, cyphar, kernel-team

On 6/23/23 5:10 PM, Andy Lutomirski wrote:
> On Thu, Jun 22, 2023, at 6:02 PM, Andy Lutomirski wrote:
>> On Thu, Jun 22, 2023, at 11:40 AM, Andrii Nakryiko wrote:
>>
>>> Hopefully you can see where I'm going with this. And this is just one
>>> random tiny example. We can think up tons of other cases to prove BPF
>>> is not isolatable to any sort of "container".
>>
>> No.  You have not come up with an example of why BPF is not isolatable
>> to a container.  You have come up with an example of why binding to a
>> sched_switch raw tracepoint does not make sense in a container without
>> additional mechanisms to give it well defined functionality and
>> appropriate security.

One big blocker for the case of BPF is not isolatable to a container are
CPU hardware bugs. There has been plenty of mitigation effort so that the
flexibility cannot be abused as a tool e.g. discussed in [0], but ultimately
it's a cat and mouse game and vendors are also not really transparent. So
actual reasonable discussion can be resumed once CPU vendors gets their
stuff fixed.

   [0] https://popl22.sigplan.org/details/prisc-2022-papers/11/BPF-and-Spectre-Mitigating-transient-execution-attacks

> Thinking about this some more:
> 
> Suppose the goal is to allow a workload in a container to monitor itself by attaching to a tracepoint (something in the scheduler, for example).  The workload is in the container.  The tracepoint is global.  Kernel memory is global unless something that is trusted and understands the containers is doing the reading.  And proxying BPF is a mess.

Agree that proxy is a mess for various reasons stated earlier.

> So here are a couple of possible solutions:
> 
> (a) Improve BPF maps a bit so that BPF maps work well in containers.  It should be possible to create a map and share it (the file descriptor!) between the outside and the container without running into various snags.  (IIRC my patch series was a decent step in this direction,)  Now load the BPF program and attach it to the tracepoint outside the container but have it write its gathered data to the map that's in the container.  So you end up with a daemon outside the container that gets a request like "help me monitor such-and-such by running BPF program such-and-such (where the BPF program code presumably comes from a library outside the container", and the daemon arranges for the requesting container to have access to the map it needs to get the data.

I don't think it's very practical, meaning the vast majority of applications
out there today are tightly coupled BPF code + user space application, and in
a lot of cases programs are dynamically created. This would require somehow
splitting up parts of your application to run outside the container in hostns
and other parts inside the container.. for the sake of the mentioned example
it's something fairly static, but real-world applications look different and
are much more complex.

> (b) Make a way to pass a pre-approved program into a container.  So a daemon outside loads the program and does some new magic to say "make an fd that can be used to attach this particular program to this particular tracepoint" and pass that into the container.

Same as above. Programs are in most cases very tightly coupled to the application
itself. I'm not sure if the ask is to redesign/implement all the existing user
space infra.

> I think (a) is better.  In particular, if you have a workload with many containers, and they all want to monitor the same tracepoint as it relates to their container, you will get much better performance if a single BPF program does the monitoring and sends the data out to each container as needed instead of having one copy of the program per container.
> 
> For what it's worth, BPF tokens seem like they'll have the same performance problem -- without coordination, you can end up with N containers generating N hooks all targeting the same global resource, resulting in overhead that scales linearly with the number of containers.

Worst case, sure, but it's not the point. These containers which would receive
the tokens are part of your trusted compute base.. so its up to the specific
applications and their surrounding infrastructure with regards to what problem
they solve where and approved by operators/platform engs to deploy in your cluster.
I don't particularly see that there's a performance problem. Andrii specifically
mentioned /trusted unprivileged applications/.

> And, again, I'm not an XDP expert, but if you have one NIC, and you attach N XDP programs to it, and each one is inspecting packets and sending some to one particular container's AF_XDP socket, you are not going to get good performance.  You want *one* XDP program fanning the packets out to the relevant containers.
> 
> If this is hard right now, perhaps you could add new kernel mechanisms as needed to improve the situation.
> 
> --Andy
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 bpf-next 00/18] BPF token
  2023-06-23 23:23                     ` Daniel Borkmann
@ 2023-06-24 13:59                       ` Andy Lutomirski
  2023-06-24 15:28                         ` Andy Lutomirski
  2023-06-26 22:31                         ` Andrii Nakryiko
  0 siblings, 2 replies; 72+ messages in thread
From: Andy Lutomirski @ 2023-06-24 13:59 UTC (permalink / raw)
  To: Daniel Borkmann, Andrii Nakryiko, Maryam Tahhan
  Cc: Andrii Nakryiko, bpf, linux-security-module, Kees Cook,
	Christian Brauner, lennart, cyphar, kernel-team



On Fri, Jun 23, 2023, at 4:23 PM, Daniel Borkmann wrote:
> On 6/23/23 5:10 PM, Andy Lutomirski wrote:
>> On Thu, Jun 22, 2023, at 6:02 PM, Andy Lutomirski wrote:
>>> On Thu, Jun 22, 2023, at 11:40 AM, Andrii Nakryiko wrote:
>>>
>>>> Hopefully you can see where I'm going with this. And this is just one
>>>> random tiny example. We can think up tons of other cases to prove BPF
>>>> is not isolatable to any sort of "container".
>>>
>>> No.  You have not come up with an example of why BPF is not isolatable
>>> to a container.  You have come up with an example of why binding to a
>>> sched_switch raw tracepoint does not make sense in a container without
>>> additional mechanisms to give it well defined functionality and
>>> appropriate security.
>
> One big blocker for the case of BPF is not isolatable to a container are
> CPU hardware bugs. There has been plenty of mitigation effort so that the
> flexibility cannot be abused as a tool e.g. discussed in [0], but ultimately
> it's a cat and mouse game and vendors are also not really transparent. So
> actual reasonable discussion can be resumed once CPU vendors gets their
> stuff fixed.
>
>    [0] 
> https://popl22.sigplan.org/details/prisc-2022-papers/11/BPF-and-Spectre-Mitigating-transient-execution-attacks
>

By this standard, shouldn’t we just give up?  Let everyone map /dev/mem readonly and stop pretending we can implement any form of access control.

Of course, we don’t do this. We try pretty hard to squash bugs and keep programs from doing an end run around OS security.

>> Thinking about this some more:
>> 
>> Suppose the goal is to allow a workload in a container to monitor itself by attaching to a tracepoint (something in the scheduler, for example).  The workload is in the container.  The tracepoint is global.  Kernel memory is global unless something that is trusted and understands the containers is doing the reading.  And proxying BPF is a mess.
>
> Agree that proxy is a mess for various reasons stated earlier.
>
>> So here are a couple of possible solutions:
>> 
>> (a) Improve BPF maps a bit so that BPF maps work well in containers.  It should be possible to create a map and share it (the file descriptor!) between the outside and the container without running into various snags.  (IIRC my patch series was a decent step in this direction,)  Now load the BPF program and attach it to the tracepoint outside the container but have it write its gathered data to the map that's in the container.  So you end up with a daemon outside the container that gets a request like "help me monitor such-and-such by running BPF program such-and-such (where the BPF program code presumably comes from a library outside the container", and the daemon arranges for the requesting container to have access to the map it needs to get the data.
>
> I don't think it's very practical, meaning the vast majority of applications
> out there today are tightly coupled BPF code + user space application, and in
> a lot of cases programs are dynamically created. This would require somehow
> splitting up parts of your application to run outside the container in hostns
> and other parts inside the container.. for the sake of the mentioned example
> it's something fairly static, but real-world applications look different and
> are much more complex.
>

It sounds like you are describing a situation where there is a workload in a container, where the *entire container* is part of the TCB, but the part of the workload that has the explicit right to read all of kernel memory (e.g. bpf_probe_read_kernel) is so tightly coupled to the container that no one outside the container wants to audit it.

And yet someone still wants to run it in a userns.
 
This is IMO a rather bizarre situation.

If I were operating a large fleet, and I had teams developing software to run in a container, I would not want to grant those containers this right without strict controls, and I don’t mean on/off controls. I would want strict auditing of *what exact BPF code* (including source) was run, and why, and who wrote it, and what the intended results are, and what limits access to the results, etc.  After all, we’re talking about the right, BY DESIGN, to access PII, payment card information, medical information, information protected by any jurisdiction’s data control rights, etc. Literally everything.  This ability, as described, isn’t “the right to use BPF.”  It is the right to *read all secrets*, intentionally.  (And modify them, with bpf_probe_write_user, possibly subject to some constraints.)


If this series was about passing a “may load kernel modules” token around, I think it would get an extremely chilly reception, even though we have module signatures.  I don’t see anything about BPF that makes BPF tokens more reasonable unless a real security model is developed first.

>> (b) Make a way to pass a pre-approved program into a container.  So a daemon outside loads the program and does some new magic to say "make an fd that can beused to attach this particular program to this particular tracepoint" and pass that into the container.
>
> Same as above. Programs are in most cases very tightly coupled to the 
> application
> itself. I'm not sure if the ask is to redesign/implement all the 
> existing user
> space infra.
>
>> I think (a) is better.  In particular, if you have a workload with many containers, and they all want to monitor the same tracepoint as it relates to their container, you will get much better performance if a single BPF program does the monitoring and sends the data out to each container as needed instead of having one copy of the program per container.
>> 
>> For what it's worth, BPF tokens seem like they'll have the same performance problem -- without coordination, you can end up with N containers generating N hooks all targeting the same global resource, resulting in overhead that scales linearly with the number of containers.
>
> Worst case, sure, but it's not the point. These containers which would 
> receive
> the tokens are part of your trusted compute base.. so its up to the 
> specific
> applications and their surrounding infrastructure with regards to what 
> problem
> they solve where and approved by operators/platform engs to deploy in 
> your cluster.
> I don't particularly see that there's a performance problem. Andrii 
> specifically
> mentioned /trusted unprivileged applications/.
>
>> And, again, I'm not an XDP expert, but if you have one NIC, and you attach N XDP programs to it, and each one is inspecting packets and sending some to one particular container's AF_XDP socket, you are not going to get good performance.  You want *one* XDP program fanning the packets out to the relevant containers.
>> 
>> If this is hard right now, perhaps you could add new kernel mechanisms as needed to improve the situation.
>> 
>> --Andy
>>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 bpf-next 00/18] BPF token
  2023-06-24 13:59                       ` Andy Lutomirski
@ 2023-06-24 15:28                         ` Andy Lutomirski
  2023-06-26 15:23                           ` Daniel Borkmann
  2023-06-27 10:22                           ` Djalal Harouni
  2023-06-26 22:31                         ` Andrii Nakryiko
  1 sibling, 2 replies; 72+ messages in thread
From: Andy Lutomirski @ 2023-06-24 15:28 UTC (permalink / raw)
  To: Daniel Borkmann, Andrii Nakryiko, Maryam Tahhan
  Cc: Andrii Nakryiko, bpf, linux-security-module, Kees Cook,
	Christian Brauner, lennart, cyphar, kernel-team



On Sat, Jun 24, 2023, at 6:59 AM, Andy Lutomirski wrote:
> On Fri, Jun 23, 2023, at 4:23 PM, Daniel Borkmann wrote:

>
> If this series was about passing a “may load kernel modules” token 
> around, I think it would get an extremely chilly reception, even though 
> we have module signatures.  I don’t see anything about BPF that makes 
> BPF tokens more reasonable unless a real security model is developed 
> first.
>

To be clear, I'm not saying that there should not be a mechanism to use BPF from a user namespace.  I'm saying the mechanism should have explicit access control.  It wouldn't need to solve all problems right away, but it should allow incrementally more features to be enabled as the access control solution gets more powerful over time.

BPF, unlike kernel modules, has a verifier.  While it would be a departure from current practice, permission to use BPF could come with an explicit list of allowed functions and allowed hooks.

(The hooks wouldn't just be a list, presumably -- premission to install an XDP program would be scoped to networks over which one has CAP_NET_ADMIN, presumably.  Other hooks would have their own scoping.  Attaching to a cgroup should (and maybe already does?) require some kind of permission on the cgroup.  Etc.)

If new, more restrictive functions are needed, they could be added.


Alternatively, people could try a limited form of BPF proxying.  It wouldn't need to be a full proxy -- an outside daemon really could approve the attachment of a BPF program, and it could parse the program, examine the list of function it uses and what the proposed attachment is to, and make an educated decision.  This would need some API changes (maybe), but it seems eminently doable.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 bpf-next 00/18] BPF token
  2023-06-24 15:28                         ` Andy Lutomirski
@ 2023-06-26 15:23                           ` Daniel Borkmann
  2023-07-04 20:48                             ` Andy Lutomirski
  2023-06-27 10:22                           ` Djalal Harouni
  1 sibling, 1 reply; 72+ messages in thread
From: Daniel Borkmann @ 2023-06-26 15:23 UTC (permalink / raw)
  To: Andy Lutomirski, Andrii Nakryiko, Maryam Tahhan
  Cc: Andrii Nakryiko, bpf, linux-security-module, Kees Cook,
	Christian Brauner, lennart, cyphar, kernel-team

On 6/24/23 5:28 PM, Andy Lutomirski wrote:
> On Sat, Jun 24, 2023, at 6:59 AM, Andy Lutomirski wrote:
>> On Fri, Jun 23, 2023, at 4:23 PM, Daniel Borkmann wrote:
>>
>> If this series was about passing a “may load kernel modules” token
>> around, I think it would get an extremely chilly reception, even though
>> we have module signatures.  I don’t see anything about BPF that makes
>> BPF tokens more reasonable unless a real security model is developed
>> first.
> 
> To be clear, I'm not saying that there should not be a mechanism to use BPF from a user namespace.  I'm saying the mechanism should have explicit access control.  It wouldn't need to solve all problems right away, but it should allow incrementally more features to be enabled as the access control solution gets more powerful over time.
> 
> BPF, unlike kernel modules, has a verifier.  While it would be a departure from current practice, permission to use BPF could come with an explicit list of allowed functions and allowed hooks.
> 
> (The hooks wouldn't just be a list, presumably -- premission to install an XDP program would be scoped to networks over which one has CAP_NET_ADMIN, presumably.  Other hooks would have their own scoping.  Attaching to a cgroup should (and maybe already does?) require some kind of permission on the cgroup.  Etc.)
> 
> If new, more restrictive functions are needed, they could be added.

Wasn't this the idea of the BPF tokens proposal, meaning you could create them with
restricted access as you mentioned - allowing an explicit subset of program types to
be loaded, subset of helpers/kfuncs, map types, etc.. Given you pass in this token
context upon program load-time (resp. map creation), the verifier is then extended
for restricted access. For example, see the bpf_token_allow_{cmd,map_type,prog_type}()
in this series. The user namespace relation was part of the use cases, but not strictly
part of the mechanism itself in this series.

With regards to the scoping, are you saying that the current design with the bitmasks
in the token create uapi is not flexible enough? If yes, what concrete alternative do
you propose?

> Alternatively, people could try a limited form of BPF proxying.  It wouldn't need to be a full proxy -- an outside daemon really could approve the attachment of a BPF program, and it could parse the program, examine the list of function it uses and what the proposed attachment is to, and make an educated decision.  This would need some API changes (maybe), but it seems eminently doable.

Thinking about this from an k8s environment angle, I think this wouldn't really be
practical for various reasons.. you now need to maintain two implementations for your
container images which ships BPF one which loads programs as today, and another one
which talks to this proxy if available, then you also need to standardize and support
the various loader libraries for this, you need to deal with yet one more component
in your cluster which could fail (compared to talking to kernel directly), and being
dependent on new proxy functionality becomes similar as with waiting for new kernels
to hit mainstream, it could potentially take a very long time until production upgrades.
What is being proposed here in this regard is less complex given no extra proxy is
involved. I would certainly prefer a kernel-based solution.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 bpf-next 00/18] BPF token
  2023-06-23  1:02                 ` Andy Lutomirski
  2023-06-23 15:10                   ` Andy Lutomirski
@ 2023-06-26 22:08                   ` Andrii Nakryiko
  1 sibling, 0 replies; 72+ messages in thread
From: Andrii Nakryiko @ 2023-06-26 22:08 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Maryam Tahhan, Andrii Nakryiko, bpf, linux-security-module,
	Kees Cook, Christian Brauner, lennart, cyphar, kernel-team

On Thu, Jun 22, 2023 at 6:03 PM Andy Lutomirski <luto@kernel.org> wrote:
>
>
>
> On Thu, Jun 22, 2023, at 11:40 AM, Andrii Nakryiko wrote:
> > On Thu, Jun 22, 2023 at 10:38 AM Maryam Tahhan <mtahhan@redhat.com> wrote:
> >
> > For CAP_BPF too broad. It is broad, yes. If you have good ideas how to
> > break it down some more -- please propose. But this is all orthogonal,
> > because the blocking problem is fundamental incompatibility of user
> > namespaces (and their implied isolation and sandboxing of workloads)
> > and BPF functionality, which is global by its very nature. The latter
> > is unavoidable in principle.
>
> How, exactly, is BPF global by its very nature?
>
> The *implementation* has some issues with globalness.  Much of it should be fixable.
>

bpf_probe_read_kernel() is widely used and required for real-world
applications. It's global by its nature and in principle not
restrictable. We can say that we'll just disable applications that use
bpf_probe_read_kernel(), but the goal is to enable applications that
are *practically useful*, not just some restricted set of programs
that are provably contained.

> >
> > No matter how much you break down CAP_BPF, you can't enforce that BPF
> > program won't interfere with applications in other containers. Or that
> > it won't "spy" on them. It's just not what BPF can enforce in
> > principle.
>
> The WHOLE POINT of the verifier is to attempt to constrain what BPF programs can and can't do.  There are bugs -- I get that.  There are helper functions that are fundamentally global.  But, in the absence of verifier bugs, BPF has actual boundaries to its functionality.

looking at your other replies, I think you realized yourself that
there are valid use cases where it's impossible to statically validate
boundaries

>
> >
> > So that comes back down to a question of trust and then controlled
> > delegation of BPF functionality. You trust workload with BPF usage
> > because you reviewed the BPF code, workload, testing, etc? Grant BPF
> > token and let that container use limited subset of BPF. Employ BPF LSM
> > to further restrict it beyond what BPF token can control.
> >
> > You cannot trust an application to not do something harmful? You
> > shouldn't grant it either CAP_BPF in init namespace, nor BPF token in
> > user namespace. That's it. Pick your poison.
>
> I think what's lost here is hardening vs restricting intended functionality.
>
> We have access control to restrict intended functionality.  We have other (and generally fairly ad-hoc and awkward) ways to flip off functionality because we want to reduce exposure to any bugs in it.
>
> BPF needs hardening -- this is well established.  Right now, this is accomplished by restricting it to global root (effectively).  It should have access controls, too, but it doesn't.
>
> >
> > But all this cannot be mechanically decided or enforced. There has to
> > be some humans involved in making these decisions. Kernel's job is to
> > provide building blocks to grant and control BPF functionality to the
> > extent that it is technically possible.
> >
>
> Exactly.  And it DOES NOT.  bpf maps, etc do not have sensible access controls.  Things that should not be global are global.  I'm saying the kernel should fix THAT.  Once it's in a state that it's at least credible to allow BPF in a user namespace, than come up with a way to allow it.
>
> > As for "something to isolate the pinned maps/progs by different apps
> > (why not DAC rules?)", there is no such thing, as I've explained
> > already.
> >
> > I can install sched_switch raw_tracepoint BPF program (if I'm allowed
> > to), and that program has system-wide observability. It cannot be
> > bound to an application.
>
> Great, a real example!
>
> Either:
>
> (a) don't run this in a container.  Have a service for the container to request the help of this program.
>
> (b) have a way to have root approve a particular program and expose *that* program to the container, and let the program have its own access controls internally (e.g. only output info that belongs to that container).
>
> > then what do we do when we switch from process A in container
> > X to process B in container Y? Is that event belonging to container X?
> > Or container Y?
>
> I don't know, but you had better answer this question before you run this thing in a container, not just for security but for basic functionality.  If you haven't defined what your program is even supposed to do in a container, don't run it there.

I think you are missing the point I'm making. A specific BPF program
that will use sched_switch is doing correct and right thing (for
whatever that means in a specific case). We as humans designed,
implemented, validated, reviewed it and are confident enough (as much
as we can be with software) that it does the right thing. It doesn't
try to spy on things, doesn't try to disrupt things.

We know this as humans thanks to our internal development process.

But this is not *provable* in a mechanical sense such that the kernel
can validate and enforce this. And yet it's a practically useful
application which we'd like to be able to launch from inside the
container without rearchitecting and rewriting the entire world and
proxying everything through some external root service.

>
>
> > Hopefully you can see where I'm going with this. And this is just one
> > random tiny example. We can think up tons of other cases to prove BPF
> > is not isolatable to any sort of "container".
>
> No.  You have not come up with an example of why BPF is not isolatable to a container.  You have come up with an example of why binding to a sched_switch raw tracepoint does not make sense in a container without additional mechanisms to give it well defined functionality and appropriate security.
>
> Please stop conflating BPF (programs, maps, etc) with *attachments* of BPF programs to systemwide things.  They're both under the BPF umbrella.  They're not the same thing.

I'm not conflating things. Thinking about BPF maps and BPF programs in
isolation from them being attached somewhere in the kernel and doing
actual and useful work is not useful.

It's the end-to-end functionality including attaching and running BPF
programs is what matters.

Pedantically drawing the line at the BPF program load step and saying
"this is BPF and everything else is not BPF" isn't really helpful. No
one cares about just loading and validating BPF programs. Developers
care about attaching and running them, that's what it all is about.

>
> Passing a token into a container that allow that container to do things like loading its own programs *and attaching them to raw tracepoints* is IMO a complete nonstarter.  It makes no sense.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 bpf-next 00/18] BPF token
  2023-06-23  3:28               ` Andy Lutomirski
  2023-06-23 16:13                 ` Casey Schaufler
@ 2023-06-26 22:08                 ` Andrii Nakryiko
  1 sibling, 0 replies; 72+ messages in thread
From: Andrii Nakryiko @ 2023-06-26 22:08 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Maryam Tahhan, Andrii Nakryiko, bpf, linux-security-module,
	Kees Cook, Christian Brauner, lennart, cyphar, kernel-team

On Thu, Jun 22, 2023 at 8:29 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> On Thu, Jun 22, 2023, at 12:05 PM, Andrii Nakryiko wrote:
> > On Thu, Jun 22, 2023 at 9:50 AM Andy Lutomirski <luto@kernel.org> wrote:
> >>
> >>
> >>
> >> On Thu, Jun 22, 2023, at 1:22 AM, Maryam Tahhan wrote:
> >> > On 22/06/2023 00:48, Andrii Nakryiko wrote:
> >> >>
> >> >>>>> Giving a way to enable BPF in a container is only a small part of the overall task -- making BPF behave sensibly in that container seems like it should also be necessary.
> >> >>>> BPF is still a privileged thing. You can't just say that any
> >> >>>> unprivileged application should be able to use BPF. That's why BPF
> >> >>>> token is about trusting unpriv application in a controlled environment
> >> >>>> (production) to not do something crazy. It can be enforced further
> >> >>>> through LSM usage, but in a lot of cases, when dealing with internal
> >> >>>> production applications it's enough to have a proper application
> >> >>>> design and rely on code review process to avoid any negative effects.
> >> >>> We really shouldn’t be creating new kinds of privileged containers that do uncontained things.
> >> >>>
> >> >>> If you actually want to go this route, I think you would do much better to introduce a way for a container manager to usefully proxy BPF on behalf of the container.
> >> >> Please see Hao's reply ([0]) about his and Google's (not so rosy)
> >> >> experiences with building and using such BPF proxy. We (Meta)
> >> >> internally didn't go this route at all and strongly prefer not to.
> >> >> There are lots of downsides and complications to having a BPF proxy.
> >> >> In the end, this is just shuffling around where the decision about
> >> >> trusting a given application with BPF access is being made. BPF proxy
> >> >> adds lots of unnecessary logistical, operational, and development
> >> >> complexity, but doesn't magically make anything safer.
> >> >>
> >> >>    [0] https://lore.kernel.org/bpf/CA+khW7h95RpurRL8qmKdSJQEXNYuqSWnP16o-uRZ9G0KqCfM4Q@mail.gmail.com/
> >> >>
> >> > Apologies for being blunt, but  the token approach to me seems to be a
> >> > work around providing the right level/classification for a pod/container
> >> > in order to say you support unprivileged containers using eBPF. I think
> >> > if your container needs to do privileged things it should have and be
> >> > classified with the right permissions (privileges) to do what it needs
> >> > to do.
> >>
> >> Bluntness is great.
> >>
> >> I think that this whole level/classification thing is utterly wrong.  Replace "BPF" with basically anything else, and you'll see how absurd it is.
> >
> > BPF is not "anything else", it's important to understand that BPF is
> > inherently not compratmentalizable. And it's vast and generic in its
> > capabilities. This changes everything. So your analogies are
> > misleading.
> >
>
> file descriptors are "vast and generic" -- you can open sockets, files, things in /proc, things in /sys, device nodes, etc.  They are infinitely extensible.  They work in containers.
>
> What is so special about BPF?

Socket with a well-defined and constrained interface that defines what
you can do with it (send and receive bytes, in a controlled fashion),
and BPF programs that intentionally are allowed to have an almost
arbitrarily complex control flow *controlled by user*, and can combine
dozens if not hundreds of "building blocks" (BPF helpers, kfuncs,
various BPF maps, etc) and that could be activated at various points
deep in the kernel (and run that custom user-provided code in kernel
space). I'd say that yeah, BPF is on another level as far as
genericity goes, compared to other interfaces.

And that's BPF's goal and appeal, nothing wrong with it. But I do
think BPF and sockets, files, things in /proc, etc are pretty
different in terms of how they can be proved and enforced to be
sandboxed.

>
> >>
> >> "the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using files on disk"
> >>
> >> That's very 1990's.  Maybe 1980's.  Of *course* giving access to a filesystem has some inherent security exposure.  So we can give containers access to *different* filesystems.  Or we can use ACLs.  Or MAC policy.  Or whatever.  We have many solutions, none of which are perfect, and we're doing okay.
> >>
> >> "the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using the network"
> >>
> >> The network is a big deal.  For some reason, it's cool these days to treat TCP as highly privileged.  You can get secrets from your favorite (or least favorite) cloud provider with unauthenticated HTTP to a magic IP and port.  You can bypass a whole lot of authenticating/authorizing proxies with unauthenticated HTTP (no TLS!) if you're on the right network.
> >>
> >> This is IMO obnoxious, but we deal with it by having network namespaces and firewalls and rather outdated port <= 1024 rules.
> >>
> >> "the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using BPF"
> >>
> >> My response is: what's wrong with BPF?  BPF has maps and programs and such, and we could easily apply 1990's style ownership and DAC rules to them.
> >
> > Can you apply DAC rules to which kernel events BPF program can be run
> > on? Can you apply DAC rules to which in-kernel data structures a BPF
> > program can look at and make sure that it doesn't access a
> > task/socket/etc that "belongs" to some other container/user/etc?
>
> No, of course.
>
> If you have a BPF program that is granted the ability to read kernel data structures or to run in response to global events like this, it's basically a kernel module.  It may be subject to a verifier that imposes much stronger type safety than a kernel module is subject to, but it's still effectively a kernel module.
>
> We don't give containers special tokens that let them load arbitrary modules.  We should not give them special tokens that let them do things with BPF that are functionally equivalent to loading arbitrary kernel modules.
>
> But we do have ways that kernel modules (which are "vast and generic", too) can expose their functionality safely to containers.  BPF can learn to do this.
>
> >
> > Can we limit XDP or AF_XDP BPF programs from seeing and controlling
> > network traffic that will be eventually routed to a container that XDP
> > program "should not" have access to? Without making everything so slow
> > that it's useless?
>
> Of course you can -- assign an entire NIC or virtual function to a container, and let the XDP program handle that.  Or a vlan or a macvlan or whatever.  (I'm assuming XDP can be scoped like this.  I'm not that familiar with the details.)
>
> >
> >> I even *wrote the code*.
> >
> > Did you submit it upstream for review and wide discussion?
>
> Yes.
>
> > Did you
> > test it and integrate it with production workloads to prove that your
> > solution is actually a viable real-world solution and not a toy?
>
> I did test it.  I did not integrate it with production workloads.
>

Real-world use cases are the ultimate test of APIs and features. No
matter how brilliant and elegant the solution is, if it doesn't work
with real-world applications, it's pretty useless.

It's not that hard to allow only a very limited and very restrictive
subset of BPF to be allowed to be loaded and attached from containers
without privileged permissions. But the point is to find a solution
that works for complicated (and sometimes very messy) real
applications that were validated by humans (to the best of their
abilities), but can't be proven to be contained within some container.


> > Writing the code doesn't mean solving the problem.
>
> Of course not.  My code was a little step in the right direction.  The BPF community was apparently not interested in it.
>
> >
> >> But for some reason, the BPF community wants to bury its head in the sand, pretend it's 1980, declare that BPF is too privileged to have access control, and instead just have a complicated switch to turn it on and off in different contexts.
> >
> > I won't speak on behalf of the entire BPF community, but I'm trying to
> > explain that BPF cannot be reasonably sandboxed and has to be
> > privileged due to its global nature. And I haven't yet seen any
> > realistic counter-proposal to change that. And it's not about
> > ownership of the BPF map or BPF program, it's way beyond that..
> >
>
> It's really really hard to have a useful discussion about a security model when have, as what appears to be an axiom, that a security model can't be created.
>
> If you actually feel this way, then I think you should not be advocating for allowing unprivileged containers to do the things that you think can't have a security model.
>
> I'm saying that I think there *can* be a security model.  But until the maintainers start to believe that, there won't be one.

See above, whatever security model you have in mind, it should be
workable with real-world applications. Building some elegant system
that will work for just a (rather small) subset of use cases isn't
appealing.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 bpf-next 00/18] BPF token
  2023-06-23 22:18               ` Daniel Borkmann
@ 2023-06-26 22:08                 ` Andrii Nakryiko
  0 siblings, 0 replies; 72+ messages in thread
From: Andrii Nakryiko @ 2023-06-26 22:08 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Christian Brauner, Djalal Harouni, Andrii Nakryiko, bpf,
	linux-security-module, keescook, lennart, cyphar, luto,
	kernel-team, Sargun Dhillon

On Fri, Jun 23, 2023 at 3:18 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 6/16/23 12:48 AM, Andrii Nakryiko wrote:
> > On Wed, Jun 14, 2023 at 2:39 AM Christian Brauner <brauner@kernel.org> wrote:
> >> On Wed, Jun 14, 2023 at 02:23:02AM +0200, Djalal Harouni wrote:
> >>> On Tue, Jun 13, 2023 at 12:27 AM Andrii Nakryiko
> >>> <andrii.nakryiko@gmail.com> wrote:
> >>>> On Mon, Jun 12, 2023 at 5:02 AM Djalal Harouni <tixxdz@gmail.com> wrote:
> >>>>> On Sat, Jun 10, 2023 at 12:57 AM Andrii Nakryiko
> >>>>> <andrii.nakryiko@gmail.com> wrote:
> >>>>>> On Fri, Jun 9, 2023 at 3:30 PM Djalal Harouni <tixxdz@gmail.com> wrote:
> >>>>>>>
> >>>>>>> Hi Andrii,
> >>>>>>>
> >>>>>>> On Thu, Jun 8, 2023 at 1:54 AM Andrii Nakryiko <andrii@kernel.org> wrote:
> >>>>>>>>
> >>>>>>>> ...
> >>>>>>>> creating new BPF objects like BPF programs, BPF maps, etc.
> >>>>>>>
> >>>>>>> Is there a reason for coupling this only with the userns?
> >>>>>>
> >>>>>> There is no coupling. Without userns it is at least possible to grant
> >>>>>> CAP_BPF and other capabilities from init ns. With user namespace that
> >>>>>> becomes impossible.
> >>>>>
> >>>>> But these are not the same: delegate full cap vs delegate an fd mask?
> >>>>
> >>>> What FD mask are we talking about here? I don't recall us talking
> >>>> about any FD masks, so this one is a bit confusing without more
> >>>> context.
> >>>
> >>> Ah err, sorry yes referring to fd token (which I assumed is a mask of
> >>> allowed operations or something like that).
> >>>
> >>> So I want the possibility to delegate the fd token in the init userns.
> >>>
> >>>>>
> >>>>> One can argue unprivileged in init userns is the same privileged in
> >>>>> nested userns
> >>>>> Getting to delegate fd in init userns, then in nested ones seems logical...
> >>>>
> >>>> Again, sorry, I'm not following. Can you please elaborate what you mean?
> >>>
> >>> I mean can we use the fd token in the init user namespace too? not
> >>> only in the nested user namespaces but in the first one? Sorry I
> >>> didn't check the code.
> >>>
> >
> > [...]
> >
> >>>
> >>>>> Having the fd or "token" that gives access rights pinned in two
> >>>>> separate bpffs mounts seems too much, it crosses namespaces (mount,
> >>>>> userns etc), environments setup by privileged...
> >>>>
> >>>> See above, there is nothing namespaceable about BPF itself, and BPF
> >>>> token as well. If some production setup benefits from pinning one BPF
> >>>> token in multiple places, I don't see the problem with that.
> >>>>
> >>>>>
> >>>>> I would just make it per bpffs mount and that's it, nothing more. If a
> >>>>> program wants to bind mount it somewhere else then it's not a bpf
> >>>>> problem.
> >>>>
> >>>> And if some application wants to pin BPF token, why would that be BPF
> >>>> subsystem's problem as well?
> >>>
> >>> The credentials, capabilities, keyring, different namespaces, etc are
> >>> all attached to the owning user namespace, if the BPF subsystem goes
> >>> its own way and creates a token to split up CAP_BPF without following
> >>> that model, then it's definitely a BPF subsystem problem...  I don't
> >>> recommend that.
> >>>
> >>> Feels it's going more of a system-wide approach opening BPF
> >>> functionality where ultimately it clashes with the argument: delegate
> >>> a subset of BPF functionality to a *trusted* unprivileged application.
> >>> My reading of delegation is within a container/service hierarchy
> >>> nothing more.
> >>
> >> You're making the exact arguments that Lennart, Aleksa, and I have been
> >> making in the LSFMM presentation about this topic. It's even recorded:
> >
> > Alright, so (I think) I get a pretty good feel now for what the main
> > concerns are, and why people are trying to push this to be an FS. And
> > it's not so much that BPF token grants bpf() syscall usage to unpriv
> > (but trusted) workloads or that BPF itself is not namespaceable. The
> > main worry is that BPF token, once issues, could be
> > illegally/uncontrollably passed outside of container, intentionally or
> > not. And by having this association with mount namespace (through BPF
> > FS) we automatically limit the sharing to only contain that has access
> > to that BPF FS.
>
> +1
>
> > So I agree that it makes sense to have this mount namespace
> > association, but I also would like to keep BPF token to be a separate
> > entity from BPF FS itself, and have the ability to have multiple
> > different BPF tokens exposed in a single BPF FS instance. I think the
> > latter is important.
> >
> > So how about this slight modification: when a BPF token is created
> > using BPF_TOKEN_CREATE command, the user has to provide an FD for
> > "associated" BPF FS instance (superblock). What that does is allows
> > BPF token to be created with BPF FS and/or mount namespace association
> > set in stone. After that BPF token can only be pinned in that BPF FS
> > instance and cannot leave the boundaries of that mount namespace
> > (specific details to be worked out, this is new area for me, so I'm
> > sorry if I'm missing nuances).
>
> Given bpffs is not a singleton and there can be multiple bpffs instances
> in a container, couldn't we make the token a special bpffs mount/mode?
> Something like single .token file in that mount (for example) which can
> be opened and the fd then passed along for prog/map creation? And given
> the multiple mounts, this also allows potentially for multiple tokens?
> In other words, this is already set up by the container manager when it
> sets up mounts rather than later, and the regular bpffs instance is sth
> separate from all that. Meaning, in your container you get the usual
> bpffs instance and then one or more special bpffs instances as tokens
> at different paths (and in future they could unlock different subset of
> bpf functionality for example).

Just from a technical point of view we could do that. But I see a lot
of value in keeping BPF token creation as part of BPF syscall and its
API. And the main issue, I believe, was not allowing BPF token to
escape the intended container, which should be more than covered by
BPF_TOKEN_CREATE pinning a token into provided BPF FS instance and not
allowing it to be repinned after that.

>
> Thanks,
> Daniel

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 bpf-next 00/18] BPF token
  2023-06-23 23:07             ` Toke Høiland-Jørgensen
@ 2023-06-26 22:08               ` Andrii Nakryiko
  2023-07-04 21:05                 ` Andy Lutomirski
  0 siblings, 1 reply; 72+ messages in thread
From: Andrii Nakryiko @ 2023-06-26 22:08 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Maryam Tahhan, Andy Lutomirski, Andrii Nakryiko, bpf,
	linux-security-module, Kees Cook, Christian Brauner, lennart,
	cyphar, kernel-team

On Fri, Jun 23, 2023 at 4:07 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>
> >> applications meets the needs of these PODs that need to do
> >> privileged/bpf things without any tokens. Ultimately you are trusting
> >> these apps in the same way as if you were granting a token.
> >
> > Yes, absolutely. As I mentioned very explicitly, it's the question of
> > trusting application. Service vs token is implementation details, but
> > the one that has huge implications in how applications are built,
> > tested, versioned, deployed, etc.
>
> So one thing that I don't really get is why such a "trusted application"
> needs to be run in a user namespace in the first place? If it's trusted,
> why not simply run it as a privileged container (without the user
> namespace) and grant it the right system-level capabilities, instead of
> going to all this trouble just to punch a hole in the user namespace
> isolation?

Because it's still useful to provide isolation that user namespace
provides in all other aspects besides BPF usage.

The fact that it's a trusted application doesn't mean that bugs don't
happen, or that some action that was not intended might be attempted
(due to a bug, some deep unintended library "feature", or just because
someone didn't anticipate some interaction).

Trusted here means we believe our BPF usage is not going to spy on
sensitive data, or attempt to disrupt other workloads, because of
design and code reviews, and we intend to maintain that property. But
people are still involved, of course, and bugs do happen. We'd like to
get as much protection as possible, and that's what the user namespace
is offering.

For BPF-side of things, we have to trust the process because there is
no technical solution. Running outside the user namespace we also
don't have any guarantees about BPF. We just have even less protection
in all other aspects outside of BPF. We are trying to improve our
story with user namespace to mitigate what's mitigatable.


>
> -Toke
>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 bpf-next 00/18] BPF token
  2023-06-24 13:59                       ` Andy Lutomirski
  2023-06-24 15:28                         ` Andy Lutomirski
@ 2023-06-26 22:31                         ` Andrii Nakryiko
  1 sibling, 0 replies; 72+ messages in thread
From: Andrii Nakryiko @ 2023-06-26 22:31 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Daniel Borkmann, Maryam Tahhan, Andrii Nakryiko, bpf,
	linux-security-module, Kees Cook, Christian Brauner, lennart,
	cyphar, kernel-team

On Sat, Jun 24, 2023 at 7:00 AM Andy Lutomirski <luto@kernel.org> wrote:
>
>
>
> On Fri, Jun 23, 2023, at 4:23 PM, Daniel Borkmann wrote:
> > On 6/23/23 5:10 PM, Andy Lutomirski wrote:
> >> On Thu, Jun 22, 2023, at 6:02 PM, Andy Lutomirski wrote:
> >>> On Thu, Jun 22, 2023, at 11:40 AM, Andrii Nakryiko wrote:
> >>>
> >>>> Hopefully you can see where I'm going with this. And this is just one
> >>>> random tiny example. We can think up tons of other cases to prove BPF
> >>>> is not isolatable to any sort of "container".
> >>>
> >>> No.  You have not come up with an example of why BPF is not isolatable
> >>> to a container.  You have come up with an example of why binding to a
> >>> sched_switch raw tracepoint does not make sense in a container without
> >>> additional mechanisms to give it well defined functionality and
> >>> appropriate security.
> >
> > One big blocker for the case of BPF is not isolatable to a container are
> > CPU hardware bugs. There has been plenty of mitigation effort so that the
> > flexibility cannot be abused as a tool e.g. discussed in [0], but ultimately
> > it's a cat and mouse game and vendors are also not really transparent. So
> > actual reasonable discussion can be resumed once CPU vendors gets their
> > stuff fixed.
> >
> >    [0]
> > https://popl22.sigplan.org/details/prisc-2022-papers/11/BPF-and-Spectre-Mitigating-transient-execution-attacks
> >
>
> By this standard, shouldn’t we just give up?  Let everyone map /dev/mem readonly and stop pretending we can implement any form of access control.
>
> Of course, we don’t do this. We try pretty hard to squash bugs and keep programs from doing an end run around OS security.
>
> >> Thinking about this some more:
> >>
> >> Suppose the goal is to allow a workload in a container to monitor itself by attaching to a tracepoint (something in the scheduler, for example).  The workload is in the container.  The tracepoint is global.  Kernel memory is global unless something that is trusted and understands the containers is doing the reading.  And proxying BPF is a mess.
> >
> > Agree that proxy is a mess for various reasons stated earlier.
> >
> >> So here are a couple of possible solutions:
> >>
> >> (a) Improve BPF maps a bit so that BPF maps work well in containers.  It should be possible to create a map and share it (the file descriptor!) between the outside and the container without running into various snags.  (IIRC my patch series was a decent step in this direction,)  Now load the BPF program and attach it to the tracepoint outside the container but have it write its gathered data to the map that's in the container.  So you end up with a daemon outside the container that gets a request like "help me monitor such-and-such by running BPF program such-and-such (where the BPF program code presumably comes from a library outside the container", and the daemon arranges for the requesting container to have access to the map it needs to get the data.
> >
> > I don't think it's very practical, meaning the vast majority of applications
> > out there today are tightly coupled BPF code + user space application, and in
> > a lot of cases programs are dynamically created. This would require somehow
> > splitting up parts of your application to run outside the container in hostns
> > and other parts inside the container.. for the sake of the mentioned example
> > it's something fairly static, but real-world applications look different and
> > are much more complex.
> >
>
> It sounds like you are describing a situation where there is a workload in a container, where the *entire container* is part of the TCB, but the part of the workload that has the explicit right to read all of kernel memory (e.g. bpf_probe_read_kernel) is so tightly coupled to the container that no one outside the container wants to audit it.
>
> And yet someone still wants to run it in a userns.
>

Yes, to get all the other benefits of userns. Yes, BPF isolation
cannot be enforced and we rely on a human-driven process to decide
whether it's ok to run BPF inside each specific container. But why
can't we also get all the other benefits of userns outside of BPF
usage.

BPF parts are critical for such applications, but they also normally
have a huge user-space part, and use large common libraries, so there
is a lot of benefit to having as much userns-provided isolation as
possible.


> This is IMO a rather bizarre situation.
>
> If I were operating a large fleet, and I had teams developing software to run in a container, I would not want to grant those containers this right without strict controls, and I don’t mean on/off controls. I would want strict auditing of *what exact BPF code* (including source) was run, and why, and who wrote it, and what the intended results are, and what limits access to the results, etc.  After all, we’re talking about the right, BY DESIGN, to access PII, payment card information, medical information, information protected by any jurisdiction’s data control rights, etc. Literally everything.  This ability, as described, isn’t “the right to use BPF.”  It is the right to *read all secrets*, intentionally.  (And modify them, with bpf_probe_write_user, possibly subject to some constraints.)

What makes you think this is not how it's actually done in practice
already (except right now we don't have BPF token, so it's
all-or-nothin, userns or not, root or not, which is overall worse than
what we'll get with BPF token + userns)?

Audit, code review, proper development practices. Then discussions and
reviews between team running container manager and team with BPF-based
workload to make decisions whether it's safe to allow BPF access (and
to what degree) and how teams will maintain privacy and safety
obligations.


>
>
> If this series was about passing a “may load kernel modules” token around, I think it would get an extremely chilly reception, even though we have module signatures.  I don’t see anything about BPF that makes BPF tokens more reasonable unless a real security model is developed first.

If we had dozens of teams developing and loading/unloading their
custom kernel modules all the time, it might not have sounded so
ridiculous?

>
> >> (b) Make a way to pass a pre-approved program into a container.  So a daemon outside loads the program and does some new magic to say "make an fd that can beused to attach this particular program to this particular tracepoint" and pass that into the container.
> >
> > Same as above. Programs are in most cases very tightly coupled to the
> > application
> > itself. I'm not sure if the ask is to redesign/implement all the
> > existing user
> > space infra.
> >
> >> I think (a) is better.  In particular, if you have a workload with many containers, and they all want to monitor the same tracepoint as it relates to their container, you will get much better performance if a single BPF program does the monitoring and sends the data out to each container as needed instead of having one copy of the program per container.
> >>
> >> For what it's worth, BPF tokens seem like they'll have the same performance problem -- without coordination, you can end up with N containers generating N hooks all targeting the same global resource, resulting in overhead that scales linearly with the number of containers.
> >
> > Worst case, sure, but it's not the point. These containers which would
> > receive
> > the tokens are part of your trusted compute base.. so its up to the
> > specific
> > applications and their surrounding infrastructure with regards to what
> > problem
> > they solve where and approved by operators/platform engs to deploy in
> > your cluster.
> > I don't particularly see that there's a performance problem. Andrii
> > specifically
> > mentioned /trusted unprivileged applications/.

Yep, performance is not why this is being done.

> >
> >> And, again, I'm not an XDP expert, but if you have one NIC, and you attach N XDP programs to it, and each one is inspecting packets and sending some to one particular container's AF_XDP socket, you are not going to get good performance.  You want *one* XDP program fanning the packets out to the relevant containers.
> >>
> >> If this is hard right now, perhaps you could add new kernel mechanisms as needed to improve the situation.
> >>
> >> --Andy
> >>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 bpf-next 00/18] BPF token
  2023-06-24 15:28                         ` Andy Lutomirski
  2023-06-26 15:23                           ` Daniel Borkmann
@ 2023-06-27 10:22                           ` Djalal Harouni
  1 sibling, 0 replies; 72+ messages in thread
From: Djalal Harouni @ 2023-06-27 10:22 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Daniel Borkmann, Andrii Nakryiko, Maryam Tahhan, Andrii Nakryiko,
	bpf, linux-security-module, Kees Cook, Christian Brauner,
	lennart, cyphar, kernel-team

On Sat, Jun 24, 2023 at 5:28 PM Andy Lutomirski <luto@kernel.org> wrote:
>
>
>
> On Sat, Jun 24, 2023, at 6:59 AM, Andy Lutomirski wrote:
> > On Fri, Jun 23, 2023, at 4:23 PM, Daniel Borkmann wrote:
>
> >
> > If this series was about passing a “may load kernel modules” token
> > around, I think it would get an extremely chilly reception, even though
> > we have module signatures.  I don’t see anything about BPF that makes
> > BPF tokens more reasonable unless a real security model is developed
> > first.
> >
>
> To be clear, I'm not saying that there should not be a mechanism to use BPF from a user namespace.  I'm saying the mechanism should have explicit access control.  It wouldn't need to solve all problems right away, but it should allow incrementally more features to be enabled as the access control solution gets more powerful over time.
>
> BPF, unlike kernel modules, has a verifier.  While it would be a departure from current practice, permission to use BPF could come with an explicit list of allowed functions and allowed hooks.
>
> (The hooks wouldn't just be a list, presumably -- premission to install an XDP program would be scoped to networks over which one has CAP_NET_ADMIN, presumably.  Other hooks would have their own scoping.  Attaching to a cgroup should (and maybe already does?) require some kind of permission on the cgroup.  Etc.)
>
> If new, more restrictive functions are needed, they could be added.
>

This seems to align with BPF fd/token delegation. I asked in another
thread if more context/policies could be provided from user space when
configuring the fd and the answer: it can be on top as a follow up...

The user namespace is just one single use case of many, also confirmed
in this reply [0] . Getting it to work in init userns should be the
first logical step anyway, then once you have an fd you can delegate
it or pass it around to childs that create nested user namespaces, etc
as it is currently done within container managers when they setup the
environments including the uid mapping... and of course there should
be some sort of mechanism to ensure that the delegated fd comes say
from a parent user namespace before using it and deny any cross
namespaces usage...


> Alternatively, people could try a limited form of BPF proxying.  It wouldn't need to be a full proxy -- an outside daemon really could approve the attachment of a BPF program, and it could parse the program, examine the list of function it uses and what the proposed attachment is to, and make an educated decision.  This would need some API changes (maybe), but it seems eminently doable.
>

Even a *limited* BPF proxying seems more in the opposite direction of
what you are suggesting above?

If I have an fd or the bpffs mount with a token properly setup by the
manager I can directly use it inside my containers, load small bpf
programs without talking to another external API of another
container... I assume the manager passed me the rights or already
pre-approved the operation...

Of course there is also the case of approving the attachment of bpf
programs without passing an fd/token which I assume is your point or
in other words denying it which makes perfectly sense indeed, then
yes: an outside daemon could do this, systemd / container managers etc
with the help of LSMs could *deny* attachment of BPF programs without
any external API changes (they already support LSMs), IIRC there is
already a hook part of bpf() syscall to restrict some program types
maybe, so future cases of bpf token should add in kernel and LSMs +
bpf-lsm hooks, ensure they are properly called with the full context
and restrict further...

So for the "limited form of BPF proxying... to approve attachment..."
I think with fd delegation of bpffs mount (that requires privileges to
set it up) then an in kernel LSM hooks on top to tighten this up is
the way to go


[0] https://lore.kernel.org/bpf/CAEf4BzbjGBY2=XGmTBWX3Vrgkc7h0FRQMTbB-SeKEf28h6OhAQ@mail.gmail.com/

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 bpf-next 00/18] BPF token
  2023-06-26 15:23                           ` Daniel Borkmann
@ 2023-07-04 20:48                             ` Andy Lutomirski
  2023-07-04 21:06                               ` Andy Lutomirski
  0 siblings, 1 reply; 72+ messages in thread
From: Andy Lutomirski @ 2023-07-04 20:48 UTC (permalink / raw)
  To: Daniel Borkmann, Andrii Nakryiko, Maryam Tahhan
  Cc: Andrii Nakryiko, bpf, linux-security-module, Kees Cook,
	Christian Brauner, lennart, cyphar, kernel-team



On Mon, Jun 26, 2023, at 8:23 AM, Daniel Borkmann wrote:
> On 6/24/23 5:28 PM, Andy Lutomirski wrote:
>> On Sat, Jun 24, 2023, at 6:59 AM, Andy Lutomirski wrote:
>>> On Fri, Jun 23, 2023, at 4:23 PM, Daniel Borkmann wrote:
>>>
>>> If this series was about passing a “may load kernel modules” token
>>> around, I think it would get an extremely chilly reception, even though
>>> we have module signatures.  I don’t see anything about BPF that makes
>>> BPF tokens more reasonable unless a real security model is developed
>>> first.
>> 
>> To be clear, I'm not saying that there should not be a mechanism to use BPF from a user namespace.  I'm saying the mechanism should have explicit access control.  It wouldn't need to solve all problems right away, but it should allow incrementally more features to be enabled as the access control solution gets more powerful over time.
>> 
>> BPF, unlike kernel modules, has a verifier.  While it would be a departure from current practice, permission to use BPF could come with an explicit list of allowed functions and allowed hooks.
>> 
>> (The hooks wouldn't just be a list, presumably -- premission to install an XDP program would be scoped to networks over which one has CAP_NET_ADMIN, presumably.  Other hooks would have their own scoping.  Attaching to a cgroup should (and maybe already does?) require some kind of permission on the cgroup.  Etc.)
>> 
>> If new, more restrictive functions are needed, they could be added.
>
> Wasn't this the idea of the BPF tokens proposal, meaning you could 
> create them with
> restricted access as you mentioned - allowing an explicit subset of 
> program types to
> be loaded, subset of helpers/kfuncs, map types, etc.. Given you pass in 
> this token
> context upon program load-time (resp. map creation), the verifier is 
> then extended
> for restricted access. For example, see the 
> bpf_token_allow_{cmd,map_type,prog_type}()
> in this series. The user namespace relation was part of the use cases, 
> but not strictly
> part of the mechanism itself in this series.

Hmm. It's very coarse grained.

Also, the bpf() attach API seems to be largely (completely?) missing what I would expect to be basic access controls on the things being attached to.   For example, the whole cgroup_bpf_prog_attach() path seems to be entirely missing any checks as to whether its caller has any particular permission over the cgroup in question.  It doesn't even check whether the cgroup is being accessed from the current userns (i.e. whether the fd refers to a struct file with f_path.mnt belonging to the current userns).  So the API in this patchset has no way to restrict permission to attach to cgroups to only apply to cgroups belonging to the container.

>
> With regards to the scoping, are you saying that the current design 
> with the bitmasks
> in the token create uapi is not flexible enough? If yes, what concrete 
> alternative do
> you propose?
>
>> Alternatively, people could try a limited form of BPF proxying.  It wouldn't need to be a full proxy -- an outside daemon really could approve the attachment of a BPF program, and it could parse the program, examine the list of function it uses and what the proposed attachment is to, and make an educated decision.  This would need some API changes (maybe), but it seems eminently doable.
>
> Thinking about this from an k8s environment angle, I think this 
> wouldn't really be
> practical for various reasons.. you now need to maintain two 
> implementations for your
> container images which ships BPF one which loads programs as today, and 
> another one
> which talks to this proxy if available, 

This seems fairly trivially solvable. Agree on an API, say using UNIX sockets to /var/run/bpfd/whatever.socket.  (Or maybe /var/lib?  I’m not sure there’s universal agreement on where things like this to.) The exact same API works uncontained (bpfd running, probably socket-activated) from a binary in the system and as a bind-mount from outside.

I don’t know k8s well at all, but it looks like hostPath can do exactly this.  Off the top of my head, I don’t know whether systemd’s .socket can be configured the right way so the same configuration would work contained and uncontained.  One could certainly work around *that* by having two different paths tried in succession, but that seems a bit silly.

This actually seems easier than supplying bpf tokens to a container.

> then you also need to 
> standardize and support
> the various loader libraries for this, you need to deal with yet one 
> more component
> in your cluster which could fail (compared to talking to kernel 
> directly), and being
> dependent on new proxy functionality becomes similar as with waiting 
> for new kernels
> to hit mainstream, it could potentially take a very long time until 
> production upgrades.
> What is being proposed here in this regard is less complex given no 
> extra proxy is
> involved. I would certainly prefer a kernel-based solution.

A userspace solution makes it easy to apply some kind of flexible approval and audit policy to the BPF program. I can imagine all kinds of ways that a fleet operator might want to control what can run, and trying to stick it in the kernel seems rather complex and awkward to customize.

I suppose a bpf token could be set up to call out to its creator for permission to load a program, which would involve a different set of tradeoffs.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 bpf-next 00/18] BPF token
  2023-06-26 22:08               ` Andrii Nakryiko
@ 2023-07-04 21:05                 ` Andy Lutomirski
  0 siblings, 0 replies; 72+ messages in thread
From: Andy Lutomirski @ 2023-07-04 21:05 UTC (permalink / raw)
  To: Andrii Nakryiko, Toke Høiland-Jørgensen
  Cc: Maryam Tahhan, Andrii Nakryiko, bpf, linux-security-module,
	Kees Cook, Christian Brauner, lennart, cyphar, kernel-team

On Mon, Jun 26, 2023, at 3:08 PM, Andrii Nakryiko wrote:
> On Fri, Jun 23, 2023 at 4:07 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>>
>> >> applications meets the needs of these PODs that need to do
>> >> privileged/bpf things without any tokens. Ultimately you are trusting
>> >> these apps in the same way as if you were granting a token.
>> >
>> > Yes, absolutely. As I mentioned very explicitly, it's the question of
>> > trusting application. Service vs token is implementation details, but
>> > the one that has huge implications in how applications are built,
>> > tested, versioned, deployed, etc.
>>
>> So one thing that I don't really get is why such a "trusted application"
>> needs to be run in a user namespace in the first place? If it's trusted,
>> why not simply run it as a privileged container (without the user
>> namespace) and grant it the right system-level capabilities, instead of
>> going to all this trouble just to punch a hole in the user namespace
>> isolation?
>
> Because it's still useful to provide isolation that user namespace
> provides in all other aspects besides BPF usage.
>
> The fact that it's a trusted application doesn't mean that bugs don't
> happen, or that some action that was not intended might be attempted
> (due to a bug, some deep unintended library "feature", or just because
> someone didn't anticipate some interaction).
>
> Trusted here means we believe our BPF usage is not going to spy on
> sensitive data, or attempt to disrupt other workloads, because of
> design and code reviews, and we intend to maintain that property. But
> people are still involved, of course, and bugs do happen. We'd like to
> get as much protection as possible, and that's what the user namespace
> is offering.
>

I'm wondering if your approach makes sense for Meta but maybe not outside Meta.  I think Meta is a bit unusual in that it operates a huge fleet, but the developers of the software in that fleet are a fairly tight group.   (I'm speculating here.  I don't know much about what goes on inside Meta, obviously.)

Concretely, you say "we believe our BPF usage is not going to spy on sensitive data".  Who is this "we"?  The kernel developers?  The people developing the BPF programs?  The people setting policy for the fleet?  The people creating container images that want to use BPF and run within the fleet?  Are these all the same "we"?

For a company with actual outside tenants, or a company that needs to comply with various privacy rules for some, but not all, of its applications, there are a lot of "we"s involved.  Some group develops software (or this is outsourced -- the BPF maintainership is essentially within Meta, after all).  Some group administers the fleet.  Some group develops BPF programs (or downloads them from outside and hopefully vets them).  Some group builds container images that want to use those programs.  Some group deploys these images via kubernetes or whatever.  Some group prepares reports for that say that certain services offered comply with PCI or HIPAA or FedRAMP or GDPR or whatever.  They're not all the same people.

Obviously bugs exist and mistakes happen.  But, at the end of the day, someone is going to read a BPF program (or a kernel module, or whatever) and take some degree of responsibility for saying "I read this thing, and I approve its use in a certain context".  And then *that permission* should be granted.  With your patchset as it is, the permission granted is not "run this program I approved" but rather "read all kernel memory".  And I don't think that will fly with a lot of potential users.

> For BPF-side of things, we have to trust the process because there is
> no technical solution. Running outside the user namespace we also
> don't have any guarantees about BPF. We just have even less protection
> in all other aspects outside of BPF. We are trying to improve our
> story with user namespace to mitigate what's mitigatable.

But there *are* technical solutions.  At least two broad types, as I've been trying to say.

1. Stronger and more flexible controls as to which specific programs can be loaded and run.  The people doing the trusting may very well want to trust specific things (and audit which things they've trusted, etc.)

2. Stronger and more flexible controls as to what programs can do.  Right now, bpf() can attach to essentially any cgroup or tracepoint if it can attach to any at all.  Programs can acccess all kernel memory (because alternatives to bpf_probe_kernel_read() aren't really available, and there is no incentive right now to add them, because there isn't even a way AFAIK to turn off bpf_probe_kernel_read()).

Progress on either one of these could go a long way.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 bpf-next 00/18] BPF token
  2023-07-04 20:48                             ` Andy Lutomirski
@ 2023-07-04 21:06                               ` Andy Lutomirski
  0 siblings, 0 replies; 72+ messages in thread
From: Andy Lutomirski @ 2023-07-04 21:06 UTC (permalink / raw)
  To: Daniel Borkmann, Andrii Nakryiko, Maryam Tahhan
  Cc: Andrii Nakryiko, bpf, linux-security-module, Kees Cook,
	Christian Brauner, lennart, cyphar, kernel-team



On Tue, Jul 4, 2023, at 1:48 PM, Andy Lutomirski wrote:
> On Mon, Jun 26, 2023, at 8:23 AM, Daniel Borkmann wrote:
>> On 6/24/23 5:28 PM, Andy Lutomirski wrote:
>>
>> Wasn't this the idea of the BPF tokens proposal, meaning you could 
>> create them with
>> restricted access as you mentioned - allowing an explicit subset of 
>> program types to
>> be loaded, subset of helpers/kfuncs, map types, etc.. Given you pass in 
>> this token
>> context upon program load-time (resp. map creation), the verifier is 
>> then extended
>> for restricted access. For example, see the 
>> bpf_token_allow_{cmd,map_type,prog_type}()
>> in this series. The user namespace relation was part of the use cases, 
>> but not strictly
>> part of the mechanism itself in this series.
>
> Hmm. It's very coarse grained.
>
> Also, the bpf() attach API seems to be largely (completely?) missing 
> what I would expect to be basic access controls on the things being 
> attached to.   For example, the whole cgroup_bpf_prog_attach() path 
> seems to be entirely missing any checks as to whether its caller has 
> any particular permission over the cgroup in question.  It doesn't even 
> check whether the cgroup is being accessed from the current userns 
> (i.e. whether the fd refers to a struct file with f_path.mnt belonging 
> to the current userns).  So the API in this patchset has no way to 
> restrict permission to attach to cgroups to only apply to cgroups 
> belonging to the container.
>

Forgot to mention: there's also no way to limit the functions that can be called.  While it's currently a bit of a pipe dream to do much useful work without bpf_probe_kernel_read(), it's at least conceptually possible to accomplish quite a bit without it, but there's no way to make that be part of the policy.

^ permalink raw reply	[flat|nested] 72+ messages in thread

end of thread, other threads:[~2023-07-04 21:06 UTC | newest]

Thread overview: 72+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-07 23:53 [PATCH v2 bpf-next 00/18] BPF token Andrii Nakryiko
2023-06-07 23:53 ` [PATCH v2 bpf-next 01/18] bpf: introduce BPF token object Andrii Nakryiko
2023-06-07 23:53 ` [PATCH v2 bpf-next 02/18] libbpf: add bpf_token_create() API Andrii Nakryiko
2023-06-07 23:53 ` [PATCH v2 bpf-next 03/18] selftests/bpf: add BPF_TOKEN_CREATE test Andrii Nakryiko
2023-06-07 23:53 ` [PATCH v2 bpf-next 04/18] bpf: move unprivileged checks into map_create() and bpf_prog_load() Andrii Nakryiko
2023-06-07 23:53 ` [PATCH v2 bpf-next 05/18] bpf: inline map creation logic in map_create() function Andrii Nakryiko
2023-06-07 23:53 ` [PATCH v2 bpf-next 06/18] bpf: centralize permissions checks for all BPF map types Andrii Nakryiko
2023-06-07 23:53 ` [PATCH v2 bpf-next 07/18] bpf: add BPF token support to BPF_MAP_CREATE command Andrii Nakryiko
2023-06-07 23:53 ` [PATCH v2 bpf-next 08/18] libbpf: add BPF token support to bpf_map_create() API Andrii Nakryiko
2023-06-07 23:53 ` [PATCH v2 bpf-next 09/18] selftests/bpf: add BPF token-enabled test for BPF_MAP_CREATE command Andrii Nakryiko
2023-06-07 23:53 ` [PATCH v2 bpf-next 10/18] bpf: add BPF token support to BPF_BTF_LOAD command Andrii Nakryiko
2023-06-07 23:53 ` [PATCH v2 bpf-next 11/18] libbpf: add BPF token support to bpf_btf_load() API Andrii Nakryiko
2023-06-07 23:53 ` [PATCH v2 bpf-next 12/18] selftests/bpf: add BPF token-enabled BPF_BTF_LOAD selftest Andrii Nakryiko
2023-06-07 23:53 ` [PATCH v2 bpf-next 13/18] bpf: keep BPF_PROG_LOAD permission checks clear of validations Andrii Nakryiko
2023-06-07 23:53 ` [PATCH v2 bpf-next 14/18] bpf: add BPF token support to BPF_PROG_LOAD command Andrii Nakryiko
2023-06-07 23:53 ` [PATCH v2 bpf-next 15/18] bpf: take into account BPF token when fetching helper protos Andrii Nakryiko
2023-06-07 23:53 ` [PATCH v2 bpf-next 16/18] bpf: consistenly use BPF token throughout BPF verifier logic Andrii Nakryiko
2023-06-07 23:53 ` [PATCH v2 bpf-next 17/18] libbpf: add BPF token support to bpf_prog_load() API Andrii Nakryiko
2023-06-07 23:53 ` [PATCH v2 bpf-next 18/18] selftests/bpf: add BPF token-enabled BPF_PROG_LOAD tests Andrii Nakryiko
2023-06-08 18:49 ` [PATCH v2 bpf-next 00/18] BPF token Stanislav Fomichev
2023-06-08 22:17   ` Andrii Nakryiko
2023-06-09 11:17 ` Toke Høiland-Jørgensen
2023-06-09 18:21   ` Andrii Nakryiko
2023-06-09 21:21     ` Toke Høiland-Jørgensen
2023-06-09 22:03       ` Andrii Nakryiko
2023-06-12 10:49         ` Toke Høiland-Jørgensen
2023-06-12 22:08           ` Andrii Nakryiko
2023-06-13 21:48             ` Hao Luo
2023-06-14 12:06             ` Toke Høiland-Jørgensen
2023-06-15 22:55               ` Andrii Nakryiko
2023-06-09 18:32 ` Andy Lutomirski
2023-06-09 19:08   ` Andrii Nakryiko
2023-06-19 17:40     ` Andy Lutomirski
2023-06-21 23:48       ` Andrii Nakryiko
2023-06-22  8:22         ` Maryam Tahhan
2023-06-22 16:49           ` Andy Lutomirski
     [not found]             ` <5a75d1f0-4ed9-399c-4851-2df0755de9b5@redhat.com>
2023-06-22 18:40               ` Andrii Nakryiko
2023-06-22 21:04                 ` Maryam Tahhan
2023-06-22 23:35                   ` Andrii Nakryiko
2023-06-23  1:02                 ` Andy Lutomirski
2023-06-23 15:10                   ` Andy Lutomirski
2023-06-23 23:23                     ` Daniel Borkmann
2023-06-24 13:59                       ` Andy Lutomirski
2023-06-24 15:28                         ` Andy Lutomirski
2023-06-26 15:23                           ` Daniel Borkmann
2023-07-04 20:48                             ` Andy Lutomirski
2023-07-04 21:06                               ` Andy Lutomirski
2023-06-27 10:22                           ` Djalal Harouni
2023-06-26 22:31                         ` Andrii Nakryiko
2023-06-26 22:08                   ` Andrii Nakryiko
2023-06-22 19:05             ` Andrii Nakryiko
2023-06-23  3:28               ` Andy Lutomirski
2023-06-23 16:13                 ` Casey Schaufler
2023-06-26 22:08                 ` Andrii Nakryiko
2023-06-22 18:20           ` Andrii Nakryiko
2023-06-23 23:07             ` Toke Høiland-Jørgensen
2023-06-26 22:08               ` Andrii Nakryiko
2023-07-04 21:05                 ` Andy Lutomirski
2023-06-09 22:29 ` Djalal Harouni
2023-06-09 22:57   ` Andrii Nakryiko
2023-06-12 12:02     ` Djalal Harouni
2023-06-12 14:31       ` Djalal Harouni
2023-06-12 22:27       ` Andrii Nakryiko
2023-06-14  0:23         ` Djalal Harouni
2023-06-14  9:39           ` Christian Brauner
2023-06-15 22:48             ` Andrii Nakryiko
2023-06-23 22:18               ` Daniel Borkmann
2023-06-26 22:08                 ` Andrii Nakryiko
2023-06-15 22:47           ` Andrii Nakryiko
2023-06-12 12:44 ` Dave Tucker
2023-06-12 15:52   ` Djalal Harouni
2023-06-12 23:04   ` Andrii Nakryiko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.