All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH bpf-next seccomp 00/12] eBPF seccomp filters
@ 2021-05-10 17:22 YiFei Zhu
  2021-05-10 17:22 ` [RFC PATCH bpf-next seccomp 01/12] seccomp: Move no_new_privs check to after prepare_filter YiFei Zhu
                   ` (12 more replies)
  0 siblings, 13 replies; 48+ messages in thread
From: YiFei Zhu @ 2021-05-10 17:22 UTC (permalink / raw)
  To: containers, bpf
  Cc: YiFei Zhu, linux-security-module, Alexei Starovoitov,
	Andrea Arcangeli, Andy Lutomirski, Austin Kuo, Claudio Canella,
	Daniel Borkmann, Daniel Gruss, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jann Horn, Jinghao Jia,
	Josep Torrellas, Kees Cook, Sargun Dhillon, Tianyin Xu,
	Tobin Feldman-Fitzthum, Tom Hromatka, Will Drewry

From: YiFei Zhu <yifeifz2@illinois.edu>

Based on: https://lists.linux-foundation.org/pipermail/containers/2018-February/038571.html

This patchset enables seccomp filters to be written in eBPF.
Supporting eBPF filters has been proposed a few times in the past.
The main concerns were (1) use cases and (2) security. We have
identified many use cases that can benefit from advanced eBPF
filters, such as:

  * exec-only-once filter / apply filter after exec
  * syscall logging (eg. via maps)
  * expressiveness & better tooling (no need for DSLs like easyseccomp)
  * contained syscall fault injection
  * Temporal System Call Specialization [1] with restrictive
    initialization phases (serving phase syscalls are filtered)
  * possible future extensions such as syscall serialization and
    argument rewriting

These features can also be achieved by user notifier + ptrace but
unfortunately user notifier is a lot of context switches (see the
benchmark results below), and hence much less efficient than eBPF.

For security, for an unprivileged caller, our implementation is as
restrictive as user notifier + ptrace, in regards to capabilities.
eBPF helpers follow the privilege model of original eBPF helpers.

Advanced eBPF feature (maps & helpers) is restricted by a new LSM
hook seccomp_extended. If LSM permits these features, then all standard
bpf helpers are permitted, and tracing helpers are permitted too if the
loader is bpf_capable and perfmon_capable. Mutable privileges should
not be a concern because if seccomp-eBPF is used to implement a mutable
policy of privileges, such policy can be implemented using user
notifier anyhow (which does not require seccomp-eBPF).

Moreover, a mechanism for reading user memory is added. The same
prototypes of bpf_probe_read_user{,str} from tracing are used. However,
when the loader of bpf program does not have CAP_PTRACE, the helper
will return -EPERM if the task under seccomp filter is non-dumpable.
The reason for this is that if we perform reduction from seccomp-eBPF
to user notifier + ptrace, ptrace requires CAP_PTRACE to read from
a non-dumpable process. However, eBPF does not solve the TOCTOU problem
of user notifier, so users should not use this to enforce a policy
based on memory contents.

In addition, a mechanism for storing process states between filter runs
is added. This uses the BPF-LSM task storage. However, since
unprivileged bpf loaders do not have access to ptr to BTF ID for use as
the task parameter to the helpers, the workaround is to use NULL as the
parameter, and the helper will fallback to current's group leader. This
is insufficient, unfortunately, because of the BTF enforcement in
bpf_local_storage_map_alloc_check, and the fact that tasks without
bpf_capable cannot load map BTFs. (Can I ask why this is restricted
this way?)

Giuseppe Scrivano shows how to support eBPF filters in crun [2], based
on which we have tested a number of stateful filters.

Performance wise, Jinghao did a test of 1,000,000 getpid() calls on an
Intel i7-9700K, with stock Ubuntu config. The syscalls are half EPERM
and half passthrough to the getpid() syscall handler [3]. The tests
are done recording a median of 10:

                user notif      eBPF            ratio
QEMU            6808104 us      80508.5 us      84.6
Bare Metal      3403667.5 us    80316 us        42.4

[1] https://www.usenix.org/conference/usenixsecurity20/presentation/ghavamnia
[2] https://github.com/giuseppe/crun/commit/3906b4fbcb671f8f188deef08c94ceae86a80120
[3] https://github.com/xlab-uiuc/seccomp-ebpf-upstream/tree/perf-test

Patch 1 moves no_new_privs check in filter loading.
Patch 2 implements basic support for seccomp-eBPF in the kernel.
Patch 3 enables a ptracer to get a fd to the eBPF for CRIU.
Patch 4 enables libbpf to recognize the section "seccomp".
Patch 5 adds a sample program test_seccomp to samples/bpf.

Patch 6 adds an LSM hook seccomp_extended.
Patch 7 allows bpf verifier hooks to restrict direct map access.
Patch 8 implements restrictions for eBPF filters depending on LSM hooks.
Patch 9 lets Yama LSM restrict seccomp-ebpf based on ptrace_scope.

Patch 10 enables seccomp-ebpf to read user memory.
Patch 11 allows bpf helpers to have nullable ptr to BTF ID as argument.
Patch 12 implements process storage using BPF-LSM task storage.

Sargun Dhillon (3):
  bpf, seccomp: Add eBPF filter capabilities
  seccomp, ptrace: Add a mechanism to retrieve attached eBPF seccomp
    filters
  samples/bpf: Add eBPF seccomp sample programs

YiFei Zhu (9):
  seccomp: Move no_new_privs check to after prepare_filter
  libbpf: recognize section "seccomp"
  lsm: New hook seccomp_extended
  bpf/verifier: allow restricting direct map access
  seccomp-ebpf: restrict filter to almost cBPF if LSM request such
  yama: (concept) restrict seccomp-eBPF with ptrace_scope
  seccomp-ebpf: Add ability to read user memory
  bpf/verifier: support NULL-able ptr to BTF ID as helper argument
  seccomp-ebpf: support task storage from BPF-LSM, defaulting to group
    leader

 arch/Kconfig                    |   7 +
 include/linux/bpf.h             |   8 ++
 include/linux/bpf_types.h       |   4 +
 include/linux/lsm_hook_defs.h   |   4 +
 include/linux/seccomp.h         |  15 +-
 include/linux/security.h        |  13 ++
 include/uapi/linux/bpf.h        |   1 +
 include/uapi/linux/ptrace.h     |   2 +
 include/uapi/linux/seccomp.h    |   1 +
 kernel/bpf/bpf_task_storage.c   |  64 +++++++--
 kernel/bpf/syscall.c            |   1 +
 kernel/bpf/verifier.c           |  15 +-
 kernel/ptrace.c                 |   4 +
 kernel/seccomp.c                | 235 ++++++++++++++++++++++++++++----
 kernel/trace/bpf_trace.c        |  42 ++++++
 samples/bpf/Makefile            |   3 +
 samples/bpf/test_seccomp_kern.c |  41 ++++++
 samples/bpf/test_seccomp_user.c |  49 +++++++
 security/security.c             |   8 ++
 security/yama/yama_lsm.c        |  30 ++++
 tools/include/uapi/linux/bpf.h  |   1 +
 tools/lib/bpf/libbpf.c          |   1 +
 22 files changed, 511 insertions(+), 38 deletions(-)
 create mode 100644 samples/bpf/test_seccomp_kern.c
 create mode 100644 samples/bpf/test_seccomp_user.c

--
2.31.1

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC PATCH bpf-next seccomp 01/12] seccomp: Move no_new_privs check to after prepare_filter
  2021-05-10 17:22 [RFC PATCH bpf-next seccomp 00/12] eBPF seccomp filters YiFei Zhu
@ 2021-05-10 17:22 ` YiFei Zhu
  2021-05-10 17:22 ` [RFC PATCH bpf-next seccomp 02/12] bpf, seccomp: Add eBPF filter capabilities YiFei Zhu
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 48+ messages in thread
From: YiFei Zhu @ 2021-05-10 17:22 UTC (permalink / raw)
  To: containers, bpf
  Cc: YiFei Zhu, linux-security-module, Alexei Starovoitov,
	Andrea Arcangeli, Andy Lutomirski, Austin Kuo, Claudio Canella,
	Daniel Borkmann, Daniel Gruss, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jann Horn, Jinghao Jia,
	Josep Torrellas, Kees Cook, Sargun Dhillon, Tianyin Xu,
	Tobin Feldman-Fitzthum, Tom Hromatka, Will Drewry

From: YiFei Zhu <yifeifz2@illinois.edu>

This is to make way for eBPF, so that this part of the code can be
shared by both cBPF and eBPF code paths.

Doing the privilege check after prepare_filter means that any
filter issues the caller would get -EINVAL, even when it does not
set no_new_privs or CAP_SYS_ADMIN.

Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
---
 kernel/seccomp.c | 26 ++++++++++++++++----------
 1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 1e63db4dbd9a..6e5ac0d686a1 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -642,16 +642,6 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
 
 	BUG_ON(INT_MAX / fprog->len < sizeof(struct sock_filter));
 
-	/*
-	 * Installing a seccomp filter requires that the task has
-	 * CAP_SYS_ADMIN in its namespace or be running with no_new_privs.
-	 * This avoids scenarios where unprivileged tasks can affect the
-	 * behavior of privileged children.
-	 */
-	if (!task_no_new_privs(current) &&
-			!ns_capable_noaudit(current_user_ns(), CAP_SYS_ADMIN))
-		return ERR_PTR(-EACCES);
-
 	/* Allocate a new seccomp_filter */
 	sfilter = kzalloc(sizeof(*sfilter), GFP_KERNEL | __GFP_NOWARN);
 	if (!sfilter)
@@ -1805,6 +1795,22 @@ static long seccomp_set_mode_filter(unsigned int flags,
 	if (IS_ERR(prepared))
 		return PTR_ERR(prepared);
 
+	/*
+	 * Installing a seccomp filter requires that the task has
+	 * CAP_SYS_ADMIN in its namespace or be running with no_new_privs.
+	 * This avoids scenarios where unprivileged tasks can affect the
+	 * behavior of privileged children.
+	 *
+	 * This is checked after filter preparation because the user
+	 * will get an EINVAL if their filter is invalid prior to the
+	 * EACCES.
+	 */
+	if (!task_no_new_privs(current) &&
+	    !ns_capable_noaudit(current_user_ns(), CAP_SYS_ADMIN)) {
+		ret = -EACCES;
+		goto out_free;
+	}
+
 	if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) {
 		listener = get_unused_fd_flags(O_CLOEXEC);
 		if (listener < 0) {
-- 
2.31.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC PATCH bpf-next seccomp 02/12] bpf, seccomp: Add eBPF filter capabilities
  2021-05-10 17:22 [RFC PATCH bpf-next seccomp 00/12] eBPF seccomp filters YiFei Zhu
  2021-05-10 17:22 ` [RFC PATCH bpf-next seccomp 01/12] seccomp: Move no_new_privs check to after prepare_filter YiFei Zhu
@ 2021-05-10 17:22 ` YiFei Zhu
  2021-05-10 17:22 ` [RFC PATCH bpf-next seccomp 03/12] seccomp, ptrace: Add a mechanism to retrieve attached eBPF seccomp filters YiFei Zhu
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 48+ messages in thread
From: YiFei Zhu @ 2021-05-10 17:22 UTC (permalink / raw)
  To: containers, bpf
  Cc: YiFei Zhu, linux-security-module, Alexei Starovoitov,
	Andrea Arcangeli, Andy Lutomirski, Austin Kuo, Claudio Canella,
	Daniel Borkmann, Daniel Gruss, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jann Horn, Jinghao Jia,
	Josep Torrellas, Kees Cook, Sargun Dhillon, Tianyin Xu,
	Tobin Feldman-Fitzthum, Tom Hromatka, Will Drewry

From: Sargun Dhillon <sargun@sargun.me>

This introduces the BPF_PROG_TYPE_SECCOMP bpf program type. It is meant
to be used for seccomp filters as an alternative to cBPF filters. The
program type has relatively limited capabilities in terms of helpers,
but that can be extended later on.

The eBPF code loading is separated from attachment of the filter, so
a privileged user can load the filter, and pass it back to an
unprivileged user who can attach it and use it at a later time.

In order to attach the filter itself, you need to supply a flag to the
seccomp syscall indicating that a eBPF filter is being attached, as
opposed to a cBPF one. Verification occurs at program load time,
so the user should only receive errors related to attachment.

The behavior of eBPF filters with bitmap cache is that they will
totally negate the bitmap acceleration. Static analysis of eBPF to
create a bitmap could be a potential future work. [YiFei Zhu]

All standard BPF helper calls are supported. If the loader has
CAP_BPF and CAP_PERFMON, all tracing BPF helper calls are also
supported. The reason for this is that, this adds no new attack
vectors I (YiFei) can tell. Standard helpers are already accessible
via BPF_PROG_TYPE_SOCKET_FILTER from an unprivileged loader. Any
policies that can be set through seccomp-eBPF can already be set
via seccomp user notifier, if the user uses this feature to implement
a policy. If a user okay with such advanced features, they may
implement an LSM policy to restrict this. This LSM hook is added in
a later patch. [YiFei Zhu]

rcu_read_lock is also held because maps like BPF_MAP_TYPE_LRU_HASH
requires them. For simplicity, this also affects cBPF filters.
[YiFei Zhu]

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Link: https://lists.linux-foundation.org/pipermail/containers/2018-February/038572.html
Co-developed-by: Jinghao Jia <jinghao7@illinois.edu>
Signed-off-by: Jinghao Jia <jinghao7@illinois.edu>
Co-developed-by: YiFei Zhu <yifeifz2@illinois.edu>
Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
---
 arch/Kconfig                   |   7 ++
 include/linux/bpf_types.h      |   4 +
 include/linux/seccomp.h        |   3 +-
 include/uapi/linux/bpf.h       |   1 +
 include/uapi/linux/seccomp.h   |   1 +
 kernel/bpf/syscall.c           |   1 +
 kernel/seccomp.c               | 151 +++++++++++++++++++++++++++++----
 tools/include/uapi/linux/bpf.h |   1 +
 8 files changed, 153 insertions(+), 16 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 5bc98d28a6e0..d1180fedcfea 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -519,6 +519,13 @@ config SECCOMP_CACHE_DEBUG
 
 	  If unsure, say N.
 
+config SECCOMP_FILTER_EXTENDED
+	bool "Extended BPF seccomp filters"
+	depends on SECCOMP_FILTER && BPF_SYSCALL
+	help
+	  Enables seccomp filters to be written in eBPF, as opposed
+	  to just cBPF filters.
+
 config HAVE_ARCH_STACKLEAK
 	bool
 	help
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index f883f01a5061..92d2126c72a6 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -78,6 +78,10 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_LSM, lsm,
 #endif /* CONFIG_BPF_LSM */
 #endif
 
+#ifdef CONFIG_SECCOMP_FILTER_EXTENDED
+BPF_PROG_TYPE(BPF_PROG_TYPE_SECCOMP, seccomp, void *, void *)
+#endif
+
 BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_ARRAY, percpu_array_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_PROG_ARRAY, prog_array_map_ops)
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 0c564e5d40ff..c0750dc05de5 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -8,7 +8,8 @@
 					 SECCOMP_FILTER_FLAG_LOG | \
 					 SECCOMP_FILTER_FLAG_SPEC_ALLOW | \
 					 SECCOMP_FILTER_FLAG_NEW_LISTENER | \
-					 SECCOMP_FILTER_FLAG_TSYNC_ESRCH)
+					 SECCOMP_FILTER_FLAG_TSYNC_ESRCH | \
+					 SECCOMP_FILTER_FLAG_EXTENDED)
 
 /* sizeof() the first published struct seccomp_notif_addfd */
 #define SECCOMP_NOTIFY_ADDFD_SIZE_VER0 24
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index ec6d85a81744..b78d5c9fbb4b 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -937,6 +937,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_EXT,
 	BPF_PROG_TYPE_LSM,
 	BPF_PROG_TYPE_SK_LOOKUP,
+	BPF_PROG_TYPE_SECCOMP,
 };
 
 enum bpf_attach_type {
diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index 6ba18b82a02e..5f98cc44df56 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -23,6 +23,7 @@
 #define SECCOMP_FILTER_FLAG_SPEC_ALLOW		(1UL << 2)
 #define SECCOMP_FILTER_FLAG_NEW_LISTENER	(1UL << 3)
 #define SECCOMP_FILTER_FLAG_TSYNC_ESRCH		(1UL << 4)
+#define SECCOMP_FILTER_FLAG_EXTENDED	(1UL << 5)
 
 /*
  * All BPF programs must return a 32-bit value.
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 941ca06d9dfa..f3007e7329aa 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2113,6 +2113,7 @@ static int bpf_prog_load(union bpf_attr *attr, union bpf_attr __user *uattr)
 		return -E2BIG;
 	if (type != BPF_PROG_TYPE_SOCKET_FILTER &&
 	    type != BPF_PROG_TYPE_CGROUP_SKB &&
+	    type != BPF_PROG_TYPE_SECCOMP &&
 	    !bpf_capable())
 		return -EPERM;
 
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 6e5ac0d686a1..1ef26a5bf93f 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -43,6 +43,7 @@
 #include <linux/uaccess.h>
 #include <linux/anon_inodes.h>
 #include <linux/lockdep.h>
+#include <linux/bpf.h>
 
 /*
  * When SECCOMP_IOCTL_NOTIF_ID_VALID was first introduced, it had the
@@ -408,7 +409,11 @@ static u32 seccomp_run_filters(const struct seccomp_data *sd,
 	 * value always takes priority (ignoring the DATA).
 	 */
 	for (; f; f = f->prev) {
-		u32 cur_ret = bpf_prog_run_pin_on_cpu(f->prog, sd);
+		u32 cur_ret;
+
+		rcu_read_lock();
+		cur_ret = bpf_prog_run_pin_on_cpu(f->prog, sd);
+		rcu_read_unlock();
 
 		if (ACTION_ONLY(cur_ret) < ACTION_ONLY(ret)) {
 			ret = cur_ret;
@@ -508,7 +513,10 @@ static inline pid_t seccomp_can_sync_threads(void)
 static inline void seccomp_filter_free(struct seccomp_filter *filter)
 {
 	if (filter) {
-		bpf_prog_destroy(filter->prog);
+		if (bpf_prog_was_classic(filter->prog))
+			bpf_prog_destroy(filter->prog);
+		else
+			bpf_prog_put(filter->prog);
 		kfree(filter);
 	}
 }
@@ -690,6 +698,52 @@ seccomp_prepare_user_filter(const char __user *user_filter)
 	return filter;
 }
 
+#ifdef CONFIG_SECCOMP_FILTER_EXTENDED
+/**
+ * seccomp_prepare_extended_filter - prepares a user-supplied eBPF fd
+ * @user_filter: pointer to the user data containing an fd.
+ *
+ * Returns 0 on success and non-zero otherwise.
+ */
+static struct seccomp_filter *
+seccomp_prepare_extended_filter(const int __user *user_fd)
+{
+	struct seccomp_filter *sfilter;
+	struct bpf_prog *fp;
+	int fd;
+
+	/* Fetch the fd from userspace */
+	if (get_user(fd, user_fd))
+		return ERR_PTR(-EFAULT);
+
+	/* Allocate a new seccomp_filter */
+	sfilter = kzalloc(sizeof(*sfilter), GFP_KERNEL | __GFP_NOWARN);
+	if (!sfilter)
+		return ERR_PTR(-ENOMEM);
+
+	mutex_init(&sfilter->notify_lock);
+	fp = bpf_prog_get_type(fd, BPF_PROG_TYPE_SECCOMP);
+
+	if (IS_ERR(fp)) {
+		kfree(sfilter);
+		return ERR_CAST(fp);
+	}
+
+	sfilter->prog = fp;
+	refcount_set(&sfilter->refs, 1);
+	refcount_set(&sfilter->users, 1);
+	init_waitqueue_head(&sfilter->wqh);
+
+	return sfilter;
+}
+#else
+static struct seccomp_filter *
+seccomp_prepare_extended_filter(const int __user *filter_fd)
+{
+	return ERR_PTR(-EINVAL);
+}
+#endif
+
 #ifdef SECCOMP_ARCH_NATIVE
 /**
  * seccomp_is_const_allow - check if filter is constant allow with given data
@@ -778,7 +832,10 @@ static void seccomp_cache_prepare_bitmap(struct seccomp_filter *sfilter,
 	struct seccomp_data sd;
 	int nr;
 
-	if (bitmap_prev) {
+	if (!bpf_prog_was_classic(sfilter->prog)) {
+		/* eBPF program, no caching. */
+		bitmap_zero(bitmap, bitmap_size);
+	} else if (bitmap_prev) {
 		/* The new filter must be as restrictive as the last. */
 		bitmap_copy(bitmap, bitmap_prev, bitmap_size);
 	} else {
@@ -1766,9 +1823,10 @@ static bool has_duplicate_listener(struct seccomp_filter *new_child)
  * Returns 0 on success or -EINVAL on failure.
  */
 static long seccomp_set_mode_filter(unsigned int flags,
-				    const char __user *filter)
+				    const void __user *filter)
 {
-	const unsigned long seccomp_mode = SECCOMP_MODE_FILTER;
+	/* We use SECCOMP_MODE_FILTER for both eBPF and cBPF filters */
+	const unsigned long filter_mode = SECCOMP_MODE_FILTER;
 	struct seccomp_filter *prepared = NULL;
 	long ret = -EINVAL;
 	int listener = -1;
@@ -1791,7 +1849,11 @@ static long seccomp_set_mode_filter(unsigned int flags,
 		return -EINVAL;
 
 	/* Prepare the new filter before holding any locks. */
-	prepared = seccomp_prepare_user_filter(filter);
+	if (flags & SECCOMP_FILTER_FLAG_EXTENDED)
+		prepared = seccomp_prepare_extended_filter(filter);
+	else
+		prepared = seccomp_prepare_user_filter(filter);
+
 	if (IS_ERR(prepared))
 		return PTR_ERR(prepared);
 
@@ -1836,7 +1898,7 @@ static long seccomp_set_mode_filter(unsigned int flags,
 
 	spin_lock_irq(&current->sighand->siglock);
 
-	if (!seccomp_may_assign_mode(seccomp_mode))
+	if (!seccomp_may_assign_mode(filter_mode))
 		goto out;
 
 	if (has_duplicate_listener(prepared)) {
@@ -1850,7 +1912,7 @@ static long seccomp_set_mode_filter(unsigned int flags,
 	/* Do not free the successfully attached filter. */
 	prepared = NULL;
 
-	seccomp_assign_mode(current, seccomp_mode, flags);
+	seccomp_assign_mode(current, filter_mode, flags);
 out:
 	spin_unlock_irq(&current->sighand->siglock);
 	if (flags & SECCOMP_FILTER_FLAG_TSYNC)
@@ -2046,15 +2108,17 @@ long seccomp_get_filter(struct task_struct *task, unsigned long filter_off,
 	if (IS_ERR(filter))
 		return PTR_ERR(filter);
 
+	/* This must be a new non-cBPF filter, since we save
+	 * every cBPF filter's orig_prog above when
+	 * CONFIG_CHECKPOINT_RESTORE is enabled.
+	 */
+	ret = -EMEDIUMTYPE;
+
 	fprog = filter->prog->orig_prog;
-	if (!fprog) {
-		/* This must be a new non-cBPF filter, since we save
-		 * every cBPF filter's orig_prog above when
-		 * CONFIG_CHECKPOINT_RESTORE is enabled.
-		 */
-		ret = -EMEDIUMTYPE;
+	if (!fprog)
+		goto out;
+	if (!bpf_prog_was_classic(filter->prog))
 		goto out;
-	}
 
 	ret = fprog->len;
 	if (!data)
@@ -2307,6 +2371,63 @@ static int seccomp_actions_logged_handler(struct ctl_table *ro_table, int write,
 	return ret;
 }
 
+#ifdef CONFIG_SECCOMP_FILTER_EXTENDED
+static bool seccomp_is_valid_access(int off, int size,
+				    enum bpf_access_type type,
+					const struct bpf_prog *prog,
+				    struct bpf_insn_access_aux *info)
+{
+	if (type != BPF_READ)
+		return false;
+
+	if (off < 0 || off + size > sizeof(struct seccomp_data))
+		return false;
+
+	if (off % size != 0)
+		return false;
+
+	switch (off) {
+	case bpf_ctx_range_till(struct seccomp_data, args[0], args[5]):
+		return (size == sizeof(__u64));
+	case bpf_ctx_range(struct seccomp_data, nr):
+		return (size == sizeof_field(struct seccomp_data, nr));
+	case bpf_ctx_range(struct seccomp_data, arch):
+		return (size == sizeof_field(struct seccomp_data, arch));
+	case bpf_ctx_range(struct seccomp_data, instruction_pointer):
+		return (size == sizeof_field(struct seccomp_data,
+					     instruction_pointer));
+	default:
+		return false;
+	}
+}
+
+static const struct bpf_func_proto *
+seccomp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+	switch (func_id) {
+	case BPF_FUNC_get_current_uid_gid:
+		return &bpf_get_current_uid_gid_proto;
+	case BPF_FUNC_get_current_pid_tgid:
+		return &bpf_get_current_pid_tgid_proto;
+	default:
+		break;
+	}
+
+	if (bpf_capable() && perfmon_capable())
+		return bpf_tracing_func_proto(func_id, prog);
+	else
+		return bpf_base_func_proto(func_id);
+}
+
+const struct bpf_prog_ops seccomp_prog_ops = {
+};
+
+const struct bpf_verifier_ops seccomp_verifier_ops = {
+	.get_func_proto		= seccomp_func_proto,
+	.is_valid_access	= seccomp_is_valid_access,
+};
+#endif /* CONFIG_SECCOMP_FILTER_EXTENDED */
+
 static struct ctl_path seccomp_sysctl_path[] = {
 	{ .procname = "kernel", },
 	{ .procname = "seccomp", },
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index ec6d85a81744..b78d5c9fbb4b 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -937,6 +937,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_EXT,
 	BPF_PROG_TYPE_LSM,
 	BPF_PROG_TYPE_SK_LOOKUP,
+	BPF_PROG_TYPE_SECCOMP,
 };
 
 enum bpf_attach_type {
-- 
2.31.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC PATCH bpf-next seccomp 03/12] seccomp, ptrace: Add a mechanism to retrieve attached eBPF seccomp filters
  2021-05-10 17:22 [RFC PATCH bpf-next seccomp 00/12] eBPF seccomp filters YiFei Zhu
  2021-05-10 17:22 ` [RFC PATCH bpf-next seccomp 01/12] seccomp: Move no_new_privs check to after prepare_filter YiFei Zhu
  2021-05-10 17:22 ` [RFC PATCH bpf-next seccomp 02/12] bpf, seccomp: Add eBPF filter capabilities YiFei Zhu
@ 2021-05-10 17:22 ` YiFei Zhu
  2021-05-10 17:22 ` [RFC PATCH bpf-next seccomp 04/12] libbpf: recognize section "seccomp" YiFei Zhu
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 48+ messages in thread
From: YiFei Zhu @ 2021-05-10 17:22 UTC (permalink / raw)
  To: containers, bpf
  Cc: YiFei Zhu, linux-security-module, Alexei Starovoitov,
	Andrea Arcangeli, Andy Lutomirski, Austin Kuo, Claudio Canella,
	Daniel Borkmann, Daniel Gruss, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jann Horn, Jinghao Jia,
	Josep Torrellas, Kees Cook, Sargun Dhillon, Tianyin Xu,
	Tobin Feldman-Fitzthum, Tom Hromatka, Will Drewry

From: Sargun Dhillon <sargun@sargun.me>

This extends the ptrace API to allow fetching eBPF seccomp filters
attached to programs. This is to enable checkpoint / restore cases.
The user will have to use the traditional PTRACE_SECCOMP_GET_FILTER
API call, and if they get an invalid medium type error they can switch
over to the eBPF variant of the API -- PTRACE_SECCOMP_GET_FILTER_EXTENDED.

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Link: https://lists.linux-foundation.org/pipermail/containers/2018-February/038478.html
[YiFei: increase ptrace number to 0x4210]
Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
---
 include/linux/seccomp.h     | 12 ++++++++++++
 include/uapi/linux/ptrace.h |  2 ++
 kernel/ptrace.c             |  4 ++++
 kernel/seccomp.c            | 37 +++++++++++++++++++++++++++++++++++++
 4 files changed, 55 insertions(+)

diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index c0750dc05de5..7ce9e3b3fa80 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -122,6 +122,18 @@ static inline long seccomp_get_metadata(struct task_struct *task,
 	return -EINVAL;
 }
 #endif /* CONFIG_SECCOMP_FILTER && CONFIG_CHECKPOINT_RESTORE */
+#if defined(CONFIG_SECCOMP_FILTER_EXTENDED) && defined(CONFIG_CHECKPOINT_RESTORE)
+extern long seccomp_get_filter_extended(struct task_struct *task,
+					unsigned long n,
+					void __user *data);
+#else
+static inline long seccomp_get_filter_extended(struct task_struct *task,
+					       unsigned long n,
+					       void __user *data)
+{
+	return -EINVAL;
+}
+#endif /* CONFIG_SECCOMP_FILTER_EXTENDED && CONFIG_CHECKPOINT_RESTORE */
 
 #ifdef CONFIG_SECCOMP_CACHE_DEBUG
 struct seq_file;
diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h
index 3747bf816f9a..725a03614c28 100644
--- a/include/uapi/linux/ptrace.h
+++ b/include/uapi/linux/ptrace.h
@@ -112,6 +112,8 @@ struct ptrace_rseq_configuration {
 	__u32 pad;
 };
 
+#define PTRACE_SECCOMP_GET_FILTER_EXTENDED	0x4210
+
 /*
  * These values are stored in task->ptrace_message
  * by tracehook_report_syscall_* to describe the current syscall-stop.
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 76f09456ec4b..1e8d2155231f 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -1247,6 +1247,10 @@ int ptrace_request(struct task_struct *child, long request,
 		break;
 #endif
 
+	case PTRACE_SECCOMP_GET_FILTER_EXTENDED:
+		ret = seccomp_get_filter_extended(child, addr, datavp);
+		break;
+
 	default:
 		break;
 	}
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 1ef26a5bf93f..8550ae885245 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -2168,6 +2168,43 @@ long seccomp_get_metadata(struct task_struct *task,
 }
 #endif
 
+#if defined(CONFIG_SECCOMP_FILTER_EXTENDED) && defined(CONFIG_CHECKPOINT_RESTORE)
+long seccomp_get_filter_extended(struct task_struct *task,
+				 unsigned long filter_off,
+				 void __user *data)
+{
+	struct seccomp_filter *filter;
+	struct bpf_prog *prog;
+	long ret;
+
+	if (!capable(CAP_SYS_ADMIN) ||
+	    current->seccomp.mode != SECCOMP_MODE_DISABLED) {
+		return -EACCES;
+	}
+
+	filter = get_nth_filter(task, filter_off);
+	if (IS_ERR(filter))
+		return PTR_ERR(filter);
+
+	if (bpf_prog_was_classic(filter->prog)) {
+		ret = -EMEDIUMTYPE;
+		goto out;
+	}
+	prog = bpf_prog_inc_not_zero(filter->prog);
+	if (IS_ERR(prog)) {
+		ret = PTR_ERR(prog);
+		goto out;
+	}
+
+	ret = bpf_prog_new_fd(filter->prog);
+	if (ret < 0)
+		bpf_prog_put(prog);
+out:
+	__put_seccomp_filter(filter);
+	return ret;
+}
+#endif
+
 #ifdef CONFIG_SYSCTL
 
 /* Human readable action names for friendly sysctl interaction */
-- 
2.31.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC PATCH bpf-next seccomp 04/12] libbpf: recognize section "seccomp"
  2021-05-10 17:22 [RFC PATCH bpf-next seccomp 00/12] eBPF seccomp filters YiFei Zhu
                   ` (2 preceding siblings ...)
  2021-05-10 17:22 ` [RFC PATCH bpf-next seccomp 03/12] seccomp, ptrace: Add a mechanism to retrieve attached eBPF seccomp filters YiFei Zhu
@ 2021-05-10 17:22 ` YiFei Zhu
  2021-05-10 17:22 ` [RFC PATCH bpf-next seccomp 05/12] samples/bpf: Add eBPF seccomp sample programs YiFei Zhu
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 48+ messages in thread
From: YiFei Zhu @ 2021-05-10 17:22 UTC (permalink / raw)
  To: containers, bpf
  Cc: YiFei Zhu, linux-security-module, Alexei Starovoitov,
	Andrea Arcangeli, Andy Lutomirski, Austin Kuo, Claudio Canella,
	Daniel Borkmann, Daniel Gruss, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jann Horn, Jinghao Jia,
	Josep Torrellas, Kees Cook, Sargun Dhillon, Tianyin Xu,
	Tobin Feldman-Fitzthum, Tom Hromatka, Will Drewry

From: YiFei Zhu <yifeifz2@illinois.edu>

The section is set to program type BPF_PROG_TYPE_SECCOMP with no
attach type.

Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
---
 tools/lib/bpf/libbpf.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index e2a3cf437814..42ce79b47378 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -8958,6 +8958,7 @@ static const struct bpf_sec_def section_defs[] = {
 	BPF_PROG_SEC("struct_ops",		BPF_PROG_TYPE_STRUCT_OPS),
 	BPF_EAPROG_SEC("sk_lookup/",		BPF_PROG_TYPE_SK_LOOKUP,
 						BPF_SK_LOOKUP),
+	BPF_PROG_SEC("seccomp",			BPF_PROG_TYPE_SECCOMP),
 };
 
 #undef BPF_PROG_SEC_IMPL
-- 
2.31.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC PATCH bpf-next seccomp 05/12] samples/bpf: Add eBPF seccomp sample programs
  2021-05-10 17:22 [RFC PATCH bpf-next seccomp 00/12] eBPF seccomp filters YiFei Zhu
                   ` (3 preceding siblings ...)
  2021-05-10 17:22 ` [RFC PATCH bpf-next seccomp 04/12] libbpf: recognize section "seccomp" YiFei Zhu
@ 2021-05-10 17:22 ` YiFei Zhu
  2021-05-10 17:22 ` [RFC PATCH bpf-next seccomp 06/12] lsm: New hook seccomp_extended YiFei Zhu
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 48+ messages in thread
From: YiFei Zhu @ 2021-05-10 17:22 UTC (permalink / raw)
  To: containers, bpf
  Cc: YiFei Zhu, linux-security-module, Alexei Starovoitov,
	Andrea Arcangeli, Andy Lutomirski, Austin Kuo, Claudio Canella,
	Daniel Borkmann, Daniel Gruss, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jann Horn, Jinghao Jia,
	Josep Torrellas, Kees Cook, Sargun Dhillon, Tianyin Xu,
	Tobin Feldman-Fitzthum, Tom Hromatka, Will Drewry

From: Sargun Dhillon <sargun@sargun.me>

This adds a sample program that uses seccomp-eBPF, called
test_seccomp. It shows the simple ability to code seccomp filters
in C.

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Link: https://lists.linux-foundation.org/pipermail/containers/2018-February/038573.html
Co-developed-by: Jinghao Jia <jinghao7@illinois.edu>
Signed-off-by: Jinghao Jia <jinghao7@illinois.edu>
[YiFei: change from bpf_load to libbpf]
Co-developed-by: YiFei Zhu <yifeifz2@illinois.edu>
Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
---
 samples/bpf/Makefile            |  3 ++
 samples/bpf/test_seccomp_kern.c | 41 +++++++++++++++++++++++++++
 samples/bpf/test_seccomp_user.c | 49 +++++++++++++++++++++++++++++++++
 3 files changed, 93 insertions(+)
 create mode 100644 samples/bpf/test_seccomp_kern.c
 create mode 100644 samples/bpf/test_seccomp_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 45ceca4e2c70..d49e7f91eba6 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -55,6 +55,7 @@ tprogs-y += task_fd_query
 tprogs-y += xdp_sample_pkts
 tprogs-y += ibumad
 tprogs-y += hbm
+tprogs-y += test_seccomp
 
 # Libbpf dependencies
 LIBBPF = $(TOOLS_PATH)/lib/bpf/libbpf.a
@@ -113,6 +114,7 @@ task_fd_query-objs := task_fd_query_user.o $(TRACE_HELPERS)
 xdp_sample_pkts-objs := xdp_sample_pkts_user.o
 ibumad-objs := ibumad_user.o
 hbm-objs := hbm.o $(CGROUP_HELPERS)
+test_seccomp-objs := test_seccomp_user.o
 
 # Tell kbuild to always build the programs
 always-y := $(tprogs-y)
@@ -174,6 +176,7 @@ always-y += ibumad_kern.o
 always-y += hbm_out_kern.o
 always-y += hbm_edt_kern.o
 always-y += xdpsock_kern.o
+always-y += test_seccomp_kern.o
 
 ifeq ($(ARCH), arm)
 # Strip all except -D__LINUX_ARM_ARCH__ option needed to handle linux
diff --git a/samples/bpf/test_seccomp_kern.c b/samples/bpf/test_seccomp_kern.c
new file mode 100644
index 000000000000..efd42f47d9c4
--- /dev/null
+++ b/samples/bpf/test_seccomp_kern.c
@@ -0,0 +1,41 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <uapi/linux/seccomp.h>
+#include <uapi/linux/bpf.h>
+#include <uapi/linux/unistd.h>
+#include <uapi/linux/errno.h>
+#include <bpf/bpf_helpers.h>
+#include <uapi/linux/audit.h>
+
+#if defined(__x86_64__)
+#define ARCH	AUDIT_ARCH_X86_64
+#elif defined(__i386__)
+#define ARCH	AUDIT_ARCH_I386
+#else
+#endif
+
+#ifdef ARCH
+/* Returns EPERM when trying to close fd 999 */
+SEC("seccomp")
+int bpf_prog1(struct seccomp_data *ctx)
+{
+	/*
+	 * Make sure this BPF program is being run on the same architecture it
+	 * was compiled on.
+	 */
+	if (ctx->arch != ARCH)
+		return SECCOMP_RET_ERRNO | EPERM;
+	if (ctx->nr == __NR_close && ctx->args[0] == 999)
+		return SECCOMP_RET_ERRNO | EPERM;
+
+	return SECCOMP_RET_ALLOW;
+}
+#else
+#warning Architecture not supported -- Blocking all syscalls
+SEC("seccomp")
+int bpf_prog1(struct seccomp_data *ctx)
+{
+	return SECCOMP_RET_ERRNO | EPERM;
+}
+#endif
+
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/test_seccomp_user.c b/samples/bpf/test_seccomp_user.c
new file mode 100644
index 000000000000..ba17e18666b9
--- /dev/null
+++ b/samples/bpf/test_seccomp_user.c
@@ -0,0 +1,49 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <assert.h>
+#include <bpf/libbpf.h>
+#include <errno.h>
+#include <linux/bpf.h>
+#include <linux/seccomp.h>
+#include <linux/unistd.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <strings.h>
+#include <sys/prctl.h>
+#include <unistd.h>
+
+int main(int argc, char **argv)
+{
+	struct bpf_object *obj;
+	char filename[256];
+	int prog_fd;
+
+	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+	if (bpf_prog_load(filename, BPF_PROG_TYPE_SECCOMP, &obj, &prog_fd))
+		exit(EXIT_FAILURE);
+	if (prog_fd < 0) {
+		fprintf(stderr, "ERROR: no program found: %s\n",
+			strerror(prog_fd));
+		exit(EXIT_FAILURE);
+	}
+
+	/* set new_new_privs so non-privileged users can attach filters */
+	if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
+		perror("prctl(NO_NEW_PRIVS)");
+		exit(EXIT_FAILURE);
+	}
+
+	if (syscall(__NR_seccomp, SECCOMP_SET_MODE_FILTER,
+		    SECCOMP_FILTER_FLAG_EXTENDED, &prog_fd)) {
+		perror("seccomp");
+		exit(EXIT_FAILURE);
+	}
+
+	close(111);
+	assert(errno == EBADF);
+	close(999);
+	assert(errno == EPERM);
+
+	printf("close syscall successfully filtered\n");
+	return 0;
+}
-- 
2.31.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC PATCH bpf-next seccomp 06/12] lsm: New hook seccomp_extended
  2021-05-10 17:22 [RFC PATCH bpf-next seccomp 00/12] eBPF seccomp filters YiFei Zhu
                   ` (4 preceding siblings ...)
  2021-05-10 17:22 ` [RFC PATCH bpf-next seccomp 05/12] samples/bpf: Add eBPF seccomp sample programs YiFei Zhu
@ 2021-05-10 17:22 ` YiFei Zhu
  2021-05-10 17:22 ` [RFC PATCH bpf-next seccomp 07/12] bpf/verifier: allow restricting direct map access YiFei Zhu
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 48+ messages in thread
From: YiFei Zhu @ 2021-05-10 17:22 UTC (permalink / raw)
  To: containers, bpf
  Cc: YiFei Zhu, linux-security-module, Alexei Starovoitov,
	Andrea Arcangeli, Andy Lutomirski, Austin Kuo, Claudio Canella,
	Daniel Borkmann, Daniel Gruss, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jann Horn, Jinghao Jia,
	Josep Torrellas, Kees Cook, Sargun Dhillon, Tianyin Xu,
	Tobin Feldman-Fitzthum, Tom Hromatka, Will Drewry

From: YiFei Zhu <yifeifz2@illinois.edu>

This hooks takes no argument, and returns 0 if the current task is
permitted to use extended seccomp-eBPF features, or -errno if it is
not permitted.

Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
---
 include/linux/lsm_hook_defs.h |  4 ++++
 include/linux/security.h      | 13 +++++++++++++
 security/security.c           |  8 ++++++++
 3 files changed, 25 insertions(+)

diff --git a/include/linux/lsm_hook_defs.h b/include/linux/lsm_hook_defs.h
index 61f04f7dc1a4..94e18d95e1cc 100644
--- a/include/linux/lsm_hook_defs.h
+++ b/include/linux/lsm_hook_defs.h
@@ -391,6 +391,10 @@ LSM_HOOK(int, 0, bpf_map_alloc_security, struct bpf_map *map)
 LSM_HOOK(void, LSM_RET_VOID, bpf_map_free_security, struct bpf_map *map)
 LSM_HOOK(int, 0, bpf_prog_alloc_security, struct bpf_prog_aux *aux)
 LSM_HOOK(void, LSM_RET_VOID, bpf_prog_free_security, struct bpf_prog_aux *aux)
+
+#ifdef CONFIG_SECCOMP_FILTER_EXTENDED
+LSM_HOOK(int, 0, seccomp_extended, void)
+#endif /* CONFIG_SECCOMP_FILTER_EXTENDED */
 #endif /* CONFIG_BPF_SYSCALL */
 
 LSM_HOOK(int, 0, locked_down, enum lockdown_reason what)
diff --git a/include/linux/security.h b/include/linux/security.h
index 9aeda3f9e838..8e98dd98ac90 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -1960,6 +1960,11 @@ extern int security_bpf_map_alloc(struct bpf_map *map);
 extern void security_bpf_map_free(struct bpf_map *map);
 extern int security_bpf_prog_alloc(struct bpf_prog_aux *aux);
 extern void security_bpf_prog_free(struct bpf_prog_aux *aux);
+
+#ifdef CONFIG_SECCOMP_FILTER_EXTENDED
+extern int security_seccomp_extended(void);
+#endif /* CONFIG_SECCOMP_FILTER_EXTENDED */
+
 #else
 static inline int security_bpf(int cmd, union bpf_attr *attr,
 					     unsigned int size)
@@ -1992,6 +1997,14 @@ static inline int security_bpf_prog_alloc(struct bpf_prog_aux *aux)
 
 static inline void security_bpf_prog_free(struct bpf_prog_aux *aux)
 { }
+
+#ifdef CONFIG_SECCOMP_FILTER_EXTENDED
+static inline int security_seccomp_extended(void)
+{
+	return 0;
+}
+#endif /* CONFIG_SECCOMP_FILTER_EXTENDED */
+
 #endif /* CONFIG_SECURITY */
 #endif /* CONFIG_BPF_SYSCALL */
 
diff --git a/security/security.c b/security/security.c
index 94383f83ba42..301afe76ffb2 100644
--- a/security/security.c
+++ b/security/security.c
@@ -2553,6 +2553,14 @@ void security_bpf_prog_free(struct bpf_prog_aux *aux)
 {
 	call_void_hook(bpf_prog_free_security, aux);
 }
+
+#ifdef CONFIG_SECCOMP_FILTER_EXTENDED
+int security_seccomp_extended(void)
+{
+	return call_int_hook(seccomp_extended, 0);
+}
+#endif
+
 #endif /* CONFIG_BPF_SYSCALL */
 
 int security_locked_down(enum lockdown_reason what)
-- 
2.31.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC PATCH bpf-next seccomp 07/12] bpf/verifier: allow restricting direct map access
  2021-05-10 17:22 [RFC PATCH bpf-next seccomp 00/12] eBPF seccomp filters YiFei Zhu
                   ` (5 preceding siblings ...)
  2021-05-10 17:22 ` [RFC PATCH bpf-next seccomp 06/12] lsm: New hook seccomp_extended YiFei Zhu
@ 2021-05-10 17:22 ` YiFei Zhu
  2021-05-10 17:22 ` [RFC PATCH bpf-next seccomp 08/12] seccomp-ebpf: restrict filter to almost cBPF if LSM request such YiFei Zhu
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 48+ messages in thread
From: YiFei Zhu @ 2021-05-10 17:22 UTC (permalink / raw)
  To: containers, bpf
  Cc: YiFei Zhu, linux-security-module, Alexei Starovoitov,
	Andrea Arcangeli, Andy Lutomirski, Austin Kuo, Claudio Canella,
	Daniel Borkmann, Daniel Gruss, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jann Horn, Jinghao Jia,
	Josep Torrellas, Kees Cook, Sargun Dhillon, Tianyin Xu,
	Tobin Feldman-Fitzthum, Tom Hromatka, Will Drewry

From: YiFei Zhu <yifeifz2@illinois.edu>

Add a verifier hook that is able to reject direct map access that
does not make use of eBPF helpers. These accesses mostly correspond
to eBPF data section accesses. This allows a program type to disable
maps altogether by resturing direct map accesses and not whitelisting
helpers that perform map accesses.

Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
---
 include/linux/bpf.h   | 1 +
 kernel/bpf/verifier.c | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 02b02cb29ce2..86f3e8784e43 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -484,6 +484,7 @@ struct bpf_verifier_ops {
 				 enum bpf_access_type atype,
 				 u32 *next_btf_id);
 	bool (*check_kfunc_call)(u32 kfunc_btf_id);
+	bool (*map_access)(enum bpf_access_type type);
 };
 
 struct bpf_prog_offload_ops {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 8fd552c16763..8eec1796caaa 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3100,6 +3100,9 @@ static int check_map_access_type(struct bpf_verifier_env *env, u32 regno,
 	struct bpf_map *map = regs[regno].map_ptr;
 	u32 cap = bpf_map_flags_to_cap(map);
 
+	if (env->ops->map_access && !env->ops->map_access(type))
+		cap = 0;
+
 	if (type == BPF_WRITE && !(cap & BPF_MAP_CAN_WRITE)) {
 		verbose(env, "write into map forbidden, value_size=%d off=%d size=%d\n",
 			map->value_size, off, size);
-- 
2.31.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC PATCH bpf-next seccomp 08/12] seccomp-ebpf: restrict filter to almost cBPF if LSM request such
  2021-05-10 17:22 [RFC PATCH bpf-next seccomp 00/12] eBPF seccomp filters YiFei Zhu
                   ` (6 preceding siblings ...)
  2021-05-10 17:22 ` [RFC PATCH bpf-next seccomp 07/12] bpf/verifier: allow restricting direct map access YiFei Zhu
@ 2021-05-10 17:22 ` YiFei Zhu
  2021-05-10 17:22 ` [RFC PATCH bpf-next seccomp 09/12] yama: (concept) restrict seccomp-eBPF with ptrace_scope YiFei Zhu
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 48+ messages in thread
From: YiFei Zhu @ 2021-05-10 17:22 UTC (permalink / raw)
  To: containers, bpf
  Cc: YiFei Zhu, linux-security-module, Alexei Starovoitov,
	Andrea Arcangeli, Andy Lutomirski, Austin Kuo, Claudio Canella,
	Daniel Borkmann, Daniel Gruss, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jann Horn, Jinghao Jia,
	Josep Torrellas, Kees Cook, Sargun Dhillon, Tianyin Xu,
	Tobin Feldman-Fitzthum, Tom Hromatka, Will Drewry

From: YiFei Zhu <yifeifz2@illinois.edu>

If LSM hook security_seccomp_extended returns non-zero, seccomp-eBPF
filters are not permitted to use eBPF maps or helpers.

Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
---
 kernel/seccomp.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 8550ae885245..b9ed9951a05b 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -2441,6 +2441,9 @@ static bool seccomp_is_valid_access(int off, int size,
 static const struct bpf_func_proto *
 seccomp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
+	if (security_seccomp_extended())
+		return NULL;
+
 	switch (func_id) {
 	case BPF_FUNC_get_current_uid_gid:
 		return &bpf_get_current_uid_gid_proto;
@@ -2459,9 +2462,15 @@ seccomp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 const struct bpf_prog_ops seccomp_prog_ops = {
 };
 
+static bool seccomp_map_access(enum bpf_access_type type)
+{
+	return !security_seccomp_extended();
+}
+
 const struct bpf_verifier_ops seccomp_verifier_ops = {
 	.get_func_proto		= seccomp_func_proto,
 	.is_valid_access	= seccomp_is_valid_access,
+	.map_access		= seccomp_map_access,
 };
 #endif /* CONFIG_SECCOMP_FILTER_EXTENDED */
 
-- 
2.31.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC PATCH bpf-next seccomp 09/12] yama: (concept) restrict seccomp-eBPF with ptrace_scope
  2021-05-10 17:22 [RFC PATCH bpf-next seccomp 00/12] eBPF seccomp filters YiFei Zhu
                   ` (7 preceding siblings ...)
  2021-05-10 17:22 ` [RFC PATCH bpf-next seccomp 08/12] seccomp-ebpf: restrict filter to almost cBPF if LSM request such YiFei Zhu
@ 2021-05-10 17:22 ` YiFei Zhu
  2021-05-10 17:22 ` [RFC PATCH bpf-next seccomp 10/12] seccomp-ebpf: Add ability to read user memory YiFei Zhu
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 48+ messages in thread
From: YiFei Zhu @ 2021-05-10 17:22 UTC (permalink / raw)
  To: containers, bpf
  Cc: YiFei Zhu, linux-security-module, Alexei Starovoitov,
	Andrea Arcangeli, Andy Lutomirski, Austin Kuo, Claudio Canella,
	Daniel Borkmann, Daniel Gruss, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jann Horn, Jinghao Jia,
	Josep Torrellas, Kees Cook, Sargun Dhillon, Tianyin Xu,
	Tobin Feldman-Fitzthum, Tom Hromatka, Will Drewry

From: YiFei Zhu <yifeifz2@illinois.edu>

LSM hook seccomp_extended is made to return -EPERM if the current
process may not ptrace its children, depending on the value of
ptrace_scope and CAP_SYS_PTRACE capability.

I'm not sure if this is the right way to do it, since ptrace_scope
is about ptrace and not seccomp. Is there a better policy that would
make more sense?

Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
---
 security/yama/yama_lsm.c | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/security/yama/yama_lsm.c b/security/yama/yama_lsm.c
index 06e226166aab..3b7b408b47a3 100644
--- a/security/yama/yama_lsm.c
+++ b/security/yama/yama_lsm.c
@@ -421,9 +421,39 @@ static int yama_ptrace_traceme(struct task_struct *parent)
 	return rc;
 }
 
+#ifdef CONFIG_SECCOMP_FILTER_EXTENDED
+static int yama_seccomp_extended(void)
+{
+	int rc = 0;
+
+	/* seccomp filter attach can only affect itself and children */
+	switch (ptrace_scope) {
+	case YAMA_SCOPE_DISABLED:
+	case YAMA_SCOPE_RELATIONAL:
+		/* No additional restrictions. */
+		break;
+	case YAMA_SCOPE_CAPABILITY:
+		rcu_read_lock();
+		if (!ns_capable(current_user_ns(), CAP_SYS_PTRACE))
+			rc = -EPERM;
+		rcu_read_unlock();
+		break;
+	case YAMA_SCOPE_NO_ATTACH:
+	default:
+		rc = -EPERM;
+		break;
+	}
+
+	return rc;
+}
+#endif /* CONFIG_SECCOMP_FILTER_EXTENDED */
+
 static struct security_hook_list yama_hooks[] __lsm_ro_after_init = {
 	LSM_HOOK_INIT(ptrace_access_check, yama_ptrace_access_check),
 	LSM_HOOK_INIT(ptrace_traceme, yama_ptrace_traceme),
+#ifdef CONFIG_SECCOMP_FILTER_EXTENDED
+	LSM_HOOK_INIT(seccomp_extended, yama_seccomp_extended),
+#endif
 	LSM_HOOK_INIT(task_prctl, yama_task_prctl),
 	LSM_HOOK_INIT(task_free, yama_task_free),
 };
-- 
2.31.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC PATCH bpf-next seccomp 10/12] seccomp-ebpf: Add ability to read user memory
  2021-05-10 17:22 [RFC PATCH bpf-next seccomp 00/12] eBPF seccomp filters YiFei Zhu
                   ` (8 preceding siblings ...)
  2021-05-10 17:22 ` [RFC PATCH bpf-next seccomp 09/12] yama: (concept) restrict seccomp-eBPF with ptrace_scope YiFei Zhu
@ 2021-05-10 17:22 ` YiFei Zhu
  2021-05-11  2:04   ` Alexei Starovoitov
  2021-05-10 17:22 ` [RFC PATCH bpf-next seccomp 11/12] bpf/verifier: support NULL-able ptr to BTF ID as helper argument YiFei Zhu
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 48+ messages in thread
From: YiFei Zhu @ 2021-05-10 17:22 UTC (permalink / raw)
  To: containers, bpf
  Cc: YiFei Zhu, linux-security-module, Alexei Starovoitov,
	Andrea Arcangeli, Andy Lutomirski, Austin Kuo, Claudio Canella,
	Daniel Borkmann, Daniel Gruss, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jann Horn, Jinghao Jia,
	Josep Torrellas, Kees Cook, Sargun Dhillon, Tianyin Xu,
	Tobin Feldman-Fitzthum, Tom Hromatka, Will Drewry

From: YiFei Zhu <yifeifz2@illinois.edu>

This uses helpers bpf_probe_read_user{,str}. To repect unprivileged
users may also load filters, when the loader of the filter does not
have CAP_SYS_PTRACE, attempting to read user memory when current mm
is non-dumpable results in -EPERM.

Right now this is not sleepable, -EFAULT may happen for valid memory
addresses. Future work might be adding support to bpf_copy_from_user
via sleepable filters.

Use of memory data to implement policy is discouraged until there is
a solution for time-of-check to time-of-use.

Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
---
 include/linux/bpf.h      |  4 ++++
 kernel/seccomp.c         |  8 ++++++++
 kernel/trace/bpf_trace.c | 42 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 54 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 86f3e8784e43..2019c0893250 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1965,6 +1965,10 @@ extern const struct bpf_func_proto bpf_get_socket_ptr_cookie_proto;
 extern const struct bpf_func_proto bpf_task_storage_get_proto;
 extern const struct bpf_func_proto bpf_task_storage_delete_proto;
 extern const struct bpf_func_proto bpf_for_each_map_elem_proto;
+extern const struct bpf_func_proto bpf_probe_read_user_proto;
+extern const struct bpf_func_proto bpf_probe_read_user_dumpable_proto;
+extern const struct bpf_func_proto bpf_probe_read_user_str_proto;
+extern const struct bpf_func_proto bpf_probe_read_user_dumpable_str_proto;
 
 const struct bpf_func_proto *bpf_tracing_func_proto(
 	enum bpf_func_id func_id, const struct bpf_prog *prog);
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index b9ed9951a05b..330e9c365cdc 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -2449,6 +2449,14 @@ seccomp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_get_current_uid_gid_proto;
 	case BPF_FUNC_get_current_pid_tgid:
 		return &bpf_get_current_pid_tgid_proto;
+	case BPF_FUNC_probe_read_user:
+		return ns_capable(current_user_ns(), CAP_SYS_PTRACE) ?
+			&bpf_probe_read_user_proto :
+			&bpf_probe_read_user_dumpable_proto;
+	case BPF_FUNC_probe_read_user_str:
+		return ns_capable(current_user_ns(), CAP_SYS_PTRACE) ?
+			&bpf_probe_read_user_str_proto :
+			&bpf_probe_read_user_dumpable_str_proto;
 	default:
 		break;
 	}
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index d2d7cf6cfe83..a1d6d64bde08 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -175,6 +175,27 @@ const struct bpf_func_proto bpf_probe_read_user_proto = {
 	.arg3_type	= ARG_ANYTHING,
 };
 
+BPF_CALL_3(bpf_probe_read_user_dumpable, void *, dst, u32, size,
+	   const void __user *, unsafe_ptr)
+{
+	int ret = -EPERM;
+
+	if (get_dumpable(current->mm))
+		ret = copy_from_user_nofault(dst, unsafe_ptr, size);
+	if (unlikely(ret < 0))
+		memset(dst, 0, size);
+	return ret;
+}
+
+const struct bpf_func_proto bpf_probe_read_user_dumpable_proto = {
+	.func		= bpf_probe_read_user_dumpable,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_UNINIT_MEM,
+	.arg2_type	= ARG_CONST_SIZE_OR_ZERO,
+	.arg3_type	= ARG_ANYTHING,
+};
+
 static __always_inline int
 bpf_probe_read_user_str_common(void *dst, u32 size,
 			       const void __user *unsafe_ptr)
@@ -212,6 +233,27 @@ const struct bpf_func_proto bpf_probe_read_user_str_proto = {
 	.arg3_type	= ARG_ANYTHING,
 };
 
+BPF_CALL_3(bpf_probe_read_user_dumpable_str, void *, dst, u32, size,
+	   const void __user *, unsafe_ptr)
+{
+	int ret = -EPERM;
+
+	if (get_dumpable(current->mm))
+		ret = strncpy_from_user_nofault(dst, unsafe_ptr, size);
+	if (unlikely(ret < 0))
+		memset(dst, 0, size);
+	return ret;
+}
+
+const struct bpf_func_proto bpf_probe_read_user_dumpable_str_proto = {
+	.func		= bpf_probe_read_user_dumpable_str,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_UNINIT_MEM,
+	.arg2_type	= ARG_CONST_SIZE_OR_ZERO,
+	.arg3_type	= ARG_ANYTHING,
+};
+
 static __always_inline int
 bpf_probe_read_kernel_common(void *dst, u32 size, const void *unsafe_ptr)
 {
-- 
2.31.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC PATCH bpf-next seccomp 11/12] bpf/verifier: support NULL-able ptr to BTF ID as helper argument
  2021-05-10 17:22 [RFC PATCH bpf-next seccomp 00/12] eBPF seccomp filters YiFei Zhu
                   ` (9 preceding siblings ...)
  2021-05-10 17:22 ` [RFC PATCH bpf-next seccomp 10/12] seccomp-ebpf: Add ability to read user memory YiFei Zhu
@ 2021-05-10 17:22 ` YiFei Zhu
  2021-05-10 17:22 ` [RFC PATCH bpf-next seccomp 12/12] seccomp-ebpf: support task storage from BPF-LSM, defaulting to group leader YiFei Zhu
  2021-05-10 17:47   ` Andy Lutomirski
  12 siblings, 0 replies; 48+ messages in thread
From: YiFei Zhu @ 2021-05-10 17:22 UTC (permalink / raw)
  To: containers, bpf
  Cc: YiFei Zhu, linux-security-module, Alexei Starovoitov,
	Andrea Arcangeli, Andy Lutomirski, Austin Kuo, Claudio Canella,
	Daniel Borkmann, Daniel Gruss, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jann Horn, Jinghao Jia,
	Josep Torrellas, Kees Cook, Sargun Dhillon, Tianyin Xu,
	Tobin Feldman-Fitzthum, Tom Hromatka, Will Drewry

From: YiFei Zhu <yifeifz2@illinois.edu>

This is to allow progs with no access to ptr to BPF ID still be
able to call some helpers, with these arguments set as NULL, so
the helper implementation may set a fallback when NULL is passed in.

Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
---
 include/linux/bpf.h   |  1 +
 kernel/bpf/verifier.c | 12 +++++++++---
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 2019c0893250..efa6444b88d3 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -303,6 +303,7 @@ enum bpf_arg_type {
 	ARG_PTR_TO_SOCKET,	/* pointer to bpf_sock (fullsock) */
 	ARG_PTR_TO_SOCKET_OR_NULL,	/* pointer to bpf_sock (fullsock) or NULL */
 	ARG_PTR_TO_BTF_ID,	/* pointer to in-kernel struct */
+	ARG_PTR_TO_BTF_ID_OR_NULL,	/* pointer to in-kernel struct or NULL */
 	ARG_PTR_TO_ALLOC_MEM,	/* pointer to dynamically allocated memory */
 	ARG_PTR_TO_ALLOC_MEM_OR_NULL,	/* pointer to dynamically allocated memory or NULL */
 	ARG_CONST_ALLOC_SIZE_OR_ZERO,	/* number of allocated bytes requested */
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 8eec1796caaa..8a08a27e0abc 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -484,7 +484,8 @@ static bool arg_type_may_be_null(enum bpf_arg_type type)
 	       type == ARG_PTR_TO_CTX_OR_NULL ||
 	       type == ARG_PTR_TO_SOCKET_OR_NULL ||
 	       type == ARG_PTR_TO_ALLOC_MEM_OR_NULL ||
-	       type == ARG_PTR_TO_STACK_OR_NULL;
+	       type == ARG_PTR_TO_STACK_OR_NULL ||
+	       type == ARG_PTR_TO_BTF_ID_OR_NULL;
 }
 
 /* Determine whether the function releases some resources allocated by another
@@ -4808,6 +4809,7 @@ static const struct bpf_reg_types *compatible_reg_types[__BPF_ARG_TYPE_MAX] = {
 	[ARG_PTR_TO_SOCKET]		= &fullsock_types,
 	[ARG_PTR_TO_SOCKET_OR_NULL]	= &fullsock_types,
 	[ARG_PTR_TO_BTF_ID]		= &btf_ptr_types,
+	[ARG_PTR_TO_BTF_ID_OR_NULL]	= &btf_ptr_types,
 	[ARG_PTR_TO_SPIN_LOCK]		= &spin_lock_types,
 	[ARG_PTR_TO_MEM]		= &mem_types,
 	[ARG_PTR_TO_MEM_OR_NULL]	= &mem_types,
@@ -5436,10 +5438,14 @@ static bool check_btf_id_ok(const struct bpf_func_proto *fn)
 	int i;
 
 	for (i = 0; i < ARRAY_SIZE(fn->arg_type); i++) {
-		if (fn->arg_type[i] == ARG_PTR_TO_BTF_ID && !fn->arg_btf_id[i])
+		if ((fn->arg_type[i] == ARG_PTR_TO_BTF_ID ||
+		     fn->arg_type[i] == ARG_PTR_TO_BTF_ID_OR_NULL) &&
+		    !fn->arg_btf_id[i])
 			return false;
 
-		if (fn->arg_type[i] != ARG_PTR_TO_BTF_ID && fn->arg_btf_id[i])
+		if ((fn->arg_type[i] != ARG_PTR_TO_BTF_ID &&
+		     fn->arg_type[i] != ARG_PTR_TO_BTF_ID_OR_NULL) &&
+		    fn->arg_btf_id[i])
 			return false;
 	}
 
-- 
2.31.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC PATCH bpf-next seccomp 12/12] seccomp-ebpf: support task storage from BPF-LSM, defaulting to group leader
  2021-05-10 17:22 [RFC PATCH bpf-next seccomp 00/12] eBPF seccomp filters YiFei Zhu
                   ` (10 preceding siblings ...)
  2021-05-10 17:22 ` [RFC PATCH bpf-next seccomp 11/12] bpf/verifier: support NULL-able ptr to BTF ID as helper argument YiFei Zhu
@ 2021-05-10 17:22 ` YiFei Zhu
  2021-05-11  1:58   ` Alexei Starovoitov
  2021-05-10 17:47   ` Andy Lutomirski
  12 siblings, 1 reply; 48+ messages in thread
From: YiFei Zhu @ 2021-05-10 17:22 UTC (permalink / raw)
  To: containers, bpf
  Cc: YiFei Zhu, linux-security-module, Alexei Starovoitov,
	Andrea Arcangeli, Andy Lutomirski, Austin Kuo, Claudio Canella,
	Daniel Borkmann, Daniel Gruss, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jann Horn, Jinghao Jia,
	Josep Torrellas, Kees Cook, Sargun Dhillon, Tianyin Xu,
	Tobin Feldman-Fitzthum, Tom Hromatka, Will Drewry

From: YiFei Zhu <yifeifz2@illinois.edu>

This enables seccomp-eBPF filters to have per-process state even when
the filter is loaded by an unprivileged process. Without CAP_BPF &&
CAP_PERFMON no access to ptr to BTF ID is possible, so the only valid
task the verifier will accept is NULL, and the helper implementation
fallbacks to the group leader to have a per-process storage.

Filters loaded by privileged processes may still access the storage
of arbitrary tasks via a valid task_struct ptr to BTF ID.

Since task storage require rcu being locked. We lock and unlock
rcu before every seccomp-eBPF filter execution.

I'm not sure if this is the best way to do this. One, this introduces
a dependency on BPF-LSM. Two, per-thread storage is not accessible
to unprivileged filter loaders; it has to be per-process.

Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
---
 include/linux/bpf.h           |  2 ++
 kernel/bpf/bpf_task_storage.c | 64 ++++++++++++++++++++++++++++++-----
 kernel/seccomp.c              |  4 +++
 3 files changed, 61 insertions(+), 9 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index efa6444b88d3..7c9755802275 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1964,7 +1964,9 @@ extern const struct bpf_func_proto bpf_ktime_get_coarse_ns_proto;
 extern const struct bpf_func_proto bpf_sock_from_file_proto;
 extern const struct bpf_func_proto bpf_get_socket_ptr_cookie_proto;
 extern const struct bpf_func_proto bpf_task_storage_get_proto;
+extern const struct bpf_func_proto bpf_task_storage_get_default_leader_proto;
 extern const struct bpf_func_proto bpf_task_storage_delete_proto;
+extern const struct bpf_func_proto bpf_task_storage_delete_default_leader_proto;
 extern const struct bpf_func_proto bpf_for_each_map_elem_proto;
 extern const struct bpf_func_proto bpf_probe_read_user_proto;
 extern const struct bpf_func_proto bpf_probe_read_user_dumpable_proto;
diff --git a/kernel/bpf/bpf_task_storage.c b/kernel/bpf/bpf_task_storage.c
index 3ce75758d394..5ddf3a92d359 100644
--- a/kernel/bpf/bpf_task_storage.c
+++ b/kernel/bpf/bpf_task_storage.c
@@ -224,19 +224,19 @@ static int bpf_pid_task_storage_delete_elem(struct bpf_map *map, void *key)
 	return err;
 }
 
-BPF_CALL_4(bpf_task_storage_get, struct bpf_map *, map, struct task_struct *,
-	   task, void *, value, u64, flags)
+static void *_bpf_task_storage_get(struct bpf_map *map, struct task_struct *task,
+				   void *value, u64 flags)
 {
 	struct bpf_local_storage_data *sdata;
 
 	if (flags & ~(BPF_LOCAL_STORAGE_GET_F_CREATE))
-		return (unsigned long)NULL;
+		return NULL;
 
 	if (!task)
-		return (unsigned long)NULL;
+		return NULL;
 
 	if (!bpf_task_storage_trylock())
-		return (unsigned long)NULL;
+		return NULL;
 
 	sdata = task_storage_lookup(task, map, true);
 	if (sdata)
@@ -251,12 +251,24 @@ BPF_CALL_4(bpf_task_storage_get, struct bpf_map *, map, struct task_struct *,
 
 unlock:
 	bpf_task_storage_unlock();
-	return IS_ERR_OR_NULL(sdata) ? (unsigned long)NULL :
-		(unsigned long)sdata->data;
+	return IS_ERR_OR_NULL(sdata) ? NULL : sdata->data;
 }
 
-BPF_CALL_2(bpf_task_storage_delete, struct bpf_map *, map, struct task_struct *,
-	   task)
+BPF_CALL_4(bpf_task_storage_get, struct bpf_map *, map, struct task_struct *,
+	   task, void *, value, u64, flags)
+{
+	return (unsigned long)_bpf_task_storage_get(map, task, value, flags);
+}
+
+BPF_CALL_4(bpf_task_storage_get_default_leader, struct bpf_map *, map,
+	   struct task_struct *, task, void *, value, u64, flags)
+{
+	if (!task)
+		task = current->group_leader;
+	return (unsigned long)_bpf_task_storage_get(map, task, value, flags);
+}
+
+static int _bpf_task_storage_delete(struct bpf_map *map, struct task_struct *task)
 {
 	int ret;
 
@@ -275,6 +287,20 @@ BPF_CALL_2(bpf_task_storage_delete, struct bpf_map *, map, struct task_struct *,
 	return ret;
 }
 
+BPF_CALL_2(bpf_task_storage_delete, struct bpf_map *, map, struct task_struct *,
+	   task)
+{
+	return _bpf_task_storage_delete(map, task);
+}
+
+BPF_CALL_2(bpf_task_storage_delete_default_leader, struct bpf_map *, map,
+	   struct task_struct *, task)
+{
+	if (!task)
+		task = current->group_leader;
+	return _bpf_task_storage_delete(map, task);
+}
+
 static int notsupp_get_next_key(struct bpf_map *map, void *key, void *next_key)
 {
 	return -ENOTSUPP;
@@ -330,6 +356,17 @@ const struct bpf_func_proto bpf_task_storage_get_proto = {
 	.arg4_type = ARG_ANYTHING,
 };
 
+const struct bpf_func_proto bpf_task_storage_get_default_leader_proto = {
+	.func = bpf_task_storage_get_default_leader,
+	.gpl_only = false,
+	.ret_type = RET_PTR_TO_MAP_VALUE_OR_NULL,
+	.arg1_type = ARG_CONST_MAP_PTR,
+	.arg2_type = ARG_PTR_TO_BTF_ID_OR_NULL,
+	.arg2_btf_id = &bpf_task_storage_btf_ids[0],
+	.arg3_type = ARG_PTR_TO_MAP_VALUE_OR_NULL,
+	.arg4_type = ARG_ANYTHING,
+};
+
 const struct bpf_func_proto bpf_task_storage_delete_proto = {
 	.func = bpf_task_storage_delete,
 	.gpl_only = false,
@@ -338,3 +375,12 @@ const struct bpf_func_proto bpf_task_storage_delete_proto = {
 	.arg2_type = ARG_PTR_TO_BTF_ID,
 	.arg2_btf_id = &bpf_task_storage_btf_ids[0],
 };
+
+const struct bpf_func_proto bpf_task_storage_delete_default_leader_proto = {
+	.func = bpf_task_storage_delete_default_leader,
+	.gpl_only = false,
+	.ret_type = RET_INTEGER,
+	.arg1_type = ARG_CONST_MAP_PTR,
+	.arg2_type = ARG_PTR_TO_BTF_ID_OR_NULL,
+	.arg2_btf_id = &bpf_task_storage_btf_ids[0],
+};
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 330e9c365cdc..5b41b2aee39c 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -2457,6 +2457,10 @@ seccomp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return ns_capable(current_user_ns(), CAP_SYS_PTRACE) ?
 			&bpf_probe_read_user_str_proto :
 			&bpf_probe_read_user_dumpable_str_proto;
+	case BPF_FUNC_task_storage_get:
+		return &bpf_task_storage_get_default_leader_proto;
+	case BPF_FUNC_task_storage_delete:
+		return &bpf_task_storage_delete_default_leader_proto;
 	default:
 		break;
 	}
-- 
2.31.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH bpf-next seccomp 00/12] eBPF seccomp filters
  2021-05-10 17:22 [RFC PATCH bpf-next seccomp 00/12] eBPF seccomp filters YiFei Zhu
@ 2021-05-10 17:47   ` Andy Lutomirski
  2021-05-10 17:22 ` [RFC PATCH bpf-next seccomp 02/12] bpf, seccomp: Add eBPF filter capabilities YiFei Zhu
                     ` (11 subsequent siblings)
  12 siblings, 0 replies; 48+ messages in thread
From: Andy Lutomirski @ 2021-05-10 17:47 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: containers, bpf, YiFei Zhu, LSM List, Alexei Starovoitov,
	Andrea Arcangeli, Austin Kuo, Claudio Canella, Daniel Borkmann,
	Daniel Gruss, Dimitrios Skarlatos, Giuseppe Scrivano,
	Hubertus Franke, Jann Horn, Jinghao Jia, Josep Torrellas,
	Kees Cook, Sargun Dhillon, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tom Hromatka, Will Drewry

On Mon, May 10, 2021 at 10:22 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
>
> From: YiFei Zhu <yifeifz2@illinois.edu>
>
> Based on: https://lists.linux-foundation.org/pipermail/containers/2018-February/038571.html
>
> This patchset enables seccomp filters to be written in eBPF.
> Supporting eBPF filters has been proposed a few times in the past.
> The main concerns were (1) use cases and (2) security. We have
> identified many use cases that can benefit from advanced eBPF
> filters, such as:

I haven't reviewed this carefully, but I think we need to distinguish
a few things:

1. Using the eBPF *language*.

2. Allowing the use of stateful / non-pure eBPF features.

3. Allowing the eBPF programs to read the target process' memory.

I'm generally in favor of (1).  I'm not at all sure about (2), and I'm
even less convinced by (3).

>
>   * exec-only-once filter / apply filter after exec

This is (2).  I'm not sure it's a good idea.

>   * syscall logging (eg. via maps)

This is (2).  Probably useful, but doesn't obviously belong in
seccomp, or at least not as part of the same seccomp feature as
regular filtering.

>   * expressiveness & better tooling (no need for DSLs like easyseccomp)

(1).  Sounds good.

>   * contained syscall fault injection

(2)?  We can already do this with notifiers.

> For security, for an unprivileged caller, our implementation is as
> restrictive as user notifier + ptrace, in regards to capabilities.
> eBPF helpers follow the privilege model of original eBPF helpers.

eBPF doesn't really have a privilege model yet.  There was a long and
disappointing thread about this awhile back.

> Moreover, a mechanism for reading user memory is added. The same
> prototypes of bpf_probe_read_user{,str} from tracing are used. However,
> when the loader of bpf program does not have CAP_PTRACE, the helper
> will return -EPERM if the task under seccomp filter is non-dumpable.
> The reason for this is that if we perform reduction from seccomp-eBPF
> to user notifier + ptrace, ptrace requires CAP_PTRACE to read from
> a non-dumpable process. However, eBPF does not solve the TOCTOU problem
> of user notifier, so users should not use this to enforce a policy
> based on memory contents.

What is this for?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH bpf-next seccomp 00/12] eBPF seccomp filters
@ 2021-05-10 17:47   ` Andy Lutomirski
  0 siblings, 0 replies; 48+ messages in thread
From: Andy Lutomirski @ 2021-05-10 17:47 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: containers, bpf, YiFei Zhu, LSM List, Alexei Starovoitov,
	Andrea Arcangeli, Austin Kuo, Claudio Canella, Daniel Borkmann,
	Daniel Gruss, Dimitrios Skarlatos, Giuseppe Scrivano,
	Hubertus Franke, Jann Horn, Jinghao Jia, Josep Torrellas,
	Kees Cook, Sargun Dhillon, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tom Hromatka, Will Drewry

On Mon, May 10, 2021 at 10:22 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
>
> From: YiFei Zhu <yifeifz2@illinois.edu>
>
> Based on: https://lists.linux-foundation.org/pipermail/containers/2018-February/038571.html
>
> This patchset enables seccomp filters to be written in eBPF.
> Supporting eBPF filters has been proposed a few times in the past.
> The main concerns were (1) use cases and (2) security. We have
> identified many use cases that can benefit from advanced eBPF
> filters, such as:

I haven't reviewed this carefully, but I think we need to distinguish
a few things:

1. Using the eBPF *language*.

2. Allowing the use of stateful / non-pure eBPF features.

3. Allowing the eBPF programs to read the target process' memory.

I'm generally in favor of (1).  I'm not at all sure about (2), and I'm
even less convinced by (3).

>
>   * exec-only-once filter / apply filter after exec

This is (2).  I'm not sure it's a good idea.

>   * syscall logging (eg. via maps)

This is (2).  Probably useful, but doesn't obviously belong in
seccomp, or at least not as part of the same seccomp feature as
regular filtering.

>   * expressiveness & better tooling (no need for DSLs like easyseccomp)

(1).  Sounds good.

>   * contained syscall fault injection

(2)?  We can already do this with notifiers.

> For security, for an unprivileged caller, our implementation is as
> restrictive as user notifier + ptrace, in regards to capabilities.
> eBPF helpers follow the privilege model of original eBPF helpers.

eBPF doesn't really have a privilege model yet.  There was a long and
disappointing thread about this awhile back.

> Moreover, a mechanism for reading user memory is added. The same
> prototypes of bpf_probe_read_user{,str} from tracing are used. However,
> when the loader of bpf program does not have CAP_PTRACE, the helper
> will return -EPERM if the task under seccomp filter is non-dumpable.
> The reason for this is that if we perform reduction from seccomp-eBPF
> to user notifier + ptrace, ptrace requires CAP_PTRACE to read from
> a non-dumpable process. However, eBPF does not solve the TOCTOU problem
> of user notifier, so users should not use this to enforce a policy
> based on memory contents.

What is this for?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH bpf-next seccomp 12/12] seccomp-ebpf: support task storage from BPF-LSM, defaulting to group leader
  2021-05-10 17:22 ` [RFC PATCH bpf-next seccomp 12/12] seccomp-ebpf: support task storage from BPF-LSM, defaulting to group leader YiFei Zhu
@ 2021-05-11  1:58   ` Alexei Starovoitov
  2021-05-11  5:44       ` YiFei Zhu
  0 siblings, 1 reply; 48+ messages in thread
From: Alexei Starovoitov @ 2021-05-11  1:58 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: containers, bpf, YiFei Zhu, linux-security-module,
	Alexei Starovoitov, Andrea Arcangeli, Andy Lutomirski,
	Austin Kuo, Claudio Canella, Daniel Borkmann, Daniel Gruss,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jann Horn, Jinghao Jia, Josep Torrellas, Kees Cook,
	Sargun Dhillon, Tianyin Xu, Tobin Feldman-Fitzthum, Tom Hromatka,
	Will Drewry

On Mon, May 10, 2021 at 12:22:49PM -0500, YiFei Zhu wrote:
> +
> +BPF_CALL_4(bpf_task_storage_get_default_leader, struct bpf_map *, map,
> +	   struct task_struct *, task, void *, value, u64, flags)
> +{
> +	if (!task)
> +		task = current->group_leader;

Did you actually need it to be group_leader or current is enough?
If so loading BTF is not necessary.
You could have exposed it bpf_get_current_task_btf() and passed its
return value into bpf_task_storage_get.

On the other side loading BTF can be relaxed to unpriv,
but doing current->group_leader deref will make it priv only anyway.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH bpf-next seccomp 10/12] seccomp-ebpf: Add ability to read user memory
  2021-05-10 17:22 ` [RFC PATCH bpf-next seccomp 10/12] seccomp-ebpf: Add ability to read user memory YiFei Zhu
@ 2021-05-11  2:04   ` Alexei Starovoitov
  2021-05-11  7:14       ` YiFei Zhu
  0 siblings, 1 reply; 48+ messages in thread
From: Alexei Starovoitov @ 2021-05-11  2:04 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: containers, bpf, YiFei Zhu, linux-security-module,
	Alexei Starovoitov, Andrea Arcangeli, Andy Lutomirski,
	Austin Kuo, Claudio Canella, Daniel Borkmann, Daniel Gruss,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jann Horn, Jinghao Jia, Josep Torrellas, Kees Cook,
	Sargun Dhillon, Tianyin Xu, Tobin Feldman-Fitzthum, Tom Hromatka,
	Will Drewry

On Mon, May 10, 2021 at 12:22:47PM -0500, YiFei Zhu wrote:
>  
> +BPF_CALL_3(bpf_probe_read_user_dumpable, void *, dst, u32, size,
> +	   const void __user *, unsafe_ptr)
> +{
> +	int ret = -EPERM;
> +
> +	if (get_dumpable(current->mm))
> +		ret = copy_from_user_nofault(dst, unsafe_ptr, size);

Could you explain a bit more how dumpable flag makes it safe for unpriv?
The unpriv prog is attached to the children tasks only, right?
and dumpable gets cleared if euid changes?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH bpf-next seccomp 00/12] eBPF seccomp filters
  2021-05-10 17:47   ` Andy Lutomirski
@ 2021-05-11  5:21     ` YiFei Zhu
  -1 siblings, 0 replies; 48+ messages in thread
From: YiFei Zhu @ 2021-05-11  5:21 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: containers, bpf, YiFei Zhu, LSM List, Alexei Starovoitov,
	Andrea Arcangeli, Austin Kuo, Claudio Canella, Daniel Borkmann,
	Daniel Gruss, Dimitrios Skarlatos, Giuseppe Scrivano,
	Hubertus Franke, Jann Horn, Jinghao Jia, Josep Torrellas,
	Kees Cook, Sargun Dhillon, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tom Hromatka, Will Drewry

On Mon, May 10, 2021 at 12:47 PM Andy Lutomirski <luto@kernel.org> wrote:
> On Mon, May 10, 2021 at 10:22 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> >
> > From: YiFei Zhu <yifeifz2@illinois.edu>
> >
> > Based on: https://lists.linux-foundation.org/pipermail/containers/2018-February/038571.html
> >
> > This patchset enables seccomp filters to be written in eBPF.
> > Supporting eBPF filters has been proposed a few times in the past.
> > The main concerns were (1) use cases and (2) security. We have
> > identified many use cases that can benefit from advanced eBPF
> > filters, such as:
>
> I haven't reviewed this carefully, but I think we need to distinguish
> a few things:
>
> 1. Using the eBPF *language*.
>
> 2. Allowing the use of stateful / non-pure eBPF features.
>
> 3. Allowing the eBPF programs to read the target process' memory.
>
> I'm generally in favor of (1).  I'm not at all sure about (2), and I'm
> even less convinced by (3).
>
> >
> >   * exec-only-once filter / apply filter after exec
>
> This is (2).  I'm not sure it's a good idea.

The basic idea is that for a container runtime it may wait to execute
a program in a container without that program being able to execve
another program, stopping any attack that involves loading another
binary. The container runtime can block any syscall but execve in the
exec-ed process by using only cBPF.

The use case is suggested by Andrea Arcangeli and Giuseppe Scrivano.
@Andrea and @Giuseppe, could you clarify more in case I missed
something?

> >   * syscall logging (eg. via maps)
>
> This is (2).  Probably useful, but doesn't obviously belong in
> seccomp, or at least not as part of the same seccomp feature as
> regular filtering.
>
> >   * expressiveness & better tooling (no need for DSLs like easyseccomp)
>
> (1).  Sounds good.
>
> >   * contained syscall fault injection
>
> (2)?  We can already do this with notifiers.

To clarify, “we can already do with notifiers” isn’t the point here.
We can do almost everything if you have notifiers and ptrace, but it
may impose significant overhead (see the microbenchmark results).

The reason I’m saying the overhead is important is for the
reproduction / testing of certain race conditions. A syscall failing
quickly in a userspace application could, from a race condition, have
a completely different trace as a syscall failing after a few context
switches. eBPF makes quick fault injection possible.

> > For security, for an unprivileged caller, our implementation is as
> > restrictive as user notifier + ptrace, in regards to capabilities.
> > eBPF helpers follow the privilege model of original eBPF helpers.
>
> eBPF doesn't really have a privilege model yet.  There was a long and
> disappointing thread about this awhile back.

The idea is that “seccomp-eBPF does not make life easier for an
adversary”. Any attack an adversary could potentially utilize
seccomp-eBPF, they can do the same with other eBPF features, i.e. it
would be an issue with eBPF in general rather than specifically
seccomp’s use of eBPF.

Here it is referring to the helpers goes to the base
bpf_base_func_proto if the caller is unprivileged (!bpf_capable ||
!perfmon_capable). In this case, if the adversary would utilize eBPF
helpers to perform an attack, they could do it via another
unprivileged prog type.

That said, there are a few additional helpers this patchset is adding:
* get_current_uid_gid
* get_current_pid_tgid
  These two provide public information (are namespaces a concern?). I
have no idea what kind of exploit it could add unless the adversary
somehow side-channels the task_struct? But in that case, how is the
reading of task_struct different from how the rest of the kernel is
reading task_struct?
  Though, if knowing the global uid / pid is a concern then the eBPF
progs will need to keep track of namespaces, and that might not be
trivial.
* probe_read_user
* probe_read_user_str
  Reduction to ptrace. The privilege model of reading another
process’s data (via process_vm_readv or
ptrace(PTRACE_PEEK{TEXT,DATA})) is guarded by
PTRACE_MODE_ATTACH_REALCREDS. However, unprivileged seccomp is
safeguarded by no_new_privs, so, unless the caller have a non-uniform
resuid & fsuid, in which case it’s the caller’s failure to relinquish
privileges, ruid of the seccomp-eBPF executor (which is task whose
syscalls is being filtered) would be the save as the ruid of the
applier (the task that set the seccomp mode, at the time of setting
it).
  The main concern here is LSMs. LSMs can further restrict the scope
of ptrace hence I also allow LSMs to deny all “the use of stateful /
non-pure eBPF features”.
  As for side channels... the copy_from_user_nofault may allow an
adversary to observe what’s in resident memory and what’s swapped out,
but the adversary can already do this by observing the timing of
memory accesses. The non-nofault variant copy_from_user is used
everywhere in the kernel, so if an adversary were to side channel the
kernel by copy_from_user against an address, they can already do it by
using a syscall with a pointer that would be used by copy_from_user.
* task_storage_get
* task_storage_delete
  This is what I’m least sure about. The implementation of
task_storage is more complex than the other helpers, and also assumes
a privileged eBPF loader. It would slightly extend the attack surface.
If this is a big issue then eBPF can emulate such a map by using some
hashmap and having PID as key...

> > Moreover, a mechanism for reading user memory is added. The same
> > prototypes of bpf_probe_read_user{,str} from tracing are used. However,
> > when the loader of bpf program does not have CAP_PTRACE, the helper
> > will return -EPERM if the task under seccomp filter is non-dumpable.
> > The reason for this is that if we perform reduction from seccomp-eBPF
> > to user notifier + ptrace, ptrace requires CAP_PTRACE to read from
> > a non-dumpable process. However, eBPF does not solve the TOCTOU problem
> > of user notifier, so users should not use this to enforce a policy
> > based on memory contents.
>
> What is this for?

Memory reading opens up lots of use cases. For example, logging what
files are being opened without imposing too much performance penalty
from strace. Or as an accelerator for user notify emulation, where
syscalls can be rejected on a fast path if we know the memory contents
does not satisfy certain conditions that user notify will check.

YiFei Zhu

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH bpf-next seccomp 00/12] eBPF seccomp filters
@ 2021-05-11  5:21     ` YiFei Zhu
  0 siblings, 0 replies; 48+ messages in thread
From: YiFei Zhu @ 2021-05-11  5:21 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: containers, bpf, YiFei Zhu, LSM List, Alexei Starovoitov,
	Andrea Arcangeli, Austin Kuo, Claudio Canella, Daniel Borkmann,
	Daniel Gruss, Dimitrios Skarlatos, Giuseppe Scrivano,
	Hubertus Franke, Jann Horn, Jinghao Jia, Josep Torrellas,
	Kees Cook, Sargun Dhillon, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tom Hromatka, Will Drewry

On Mon, May 10, 2021 at 12:47 PM Andy Lutomirski <luto@kernel.org> wrote:
> On Mon, May 10, 2021 at 10:22 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> >
> > From: YiFei Zhu <yifeifz2@illinois.edu>
> >
> > Based on: https://lists.linux-foundation.org/pipermail/containers/2018-February/038571.html
> >
> > This patchset enables seccomp filters to be written in eBPF.
> > Supporting eBPF filters has been proposed a few times in the past.
> > The main concerns were (1) use cases and (2) security. We have
> > identified many use cases that can benefit from advanced eBPF
> > filters, such as:
>
> I haven't reviewed this carefully, but I think we need to distinguish
> a few things:
>
> 1. Using the eBPF *language*.
>
> 2. Allowing the use of stateful / non-pure eBPF features.
>
> 3. Allowing the eBPF programs to read the target process' memory.
>
> I'm generally in favor of (1).  I'm not at all sure about (2), and I'm
> even less convinced by (3).
>
> >
> >   * exec-only-once filter / apply filter after exec
>
> This is (2).  I'm not sure it's a good idea.

The basic idea is that for a container runtime it may wait to execute
a program in a container without that program being able to execve
another program, stopping any attack that involves loading another
binary. The container runtime can block any syscall but execve in the
exec-ed process by using only cBPF.

The use case is suggested by Andrea Arcangeli and Giuseppe Scrivano.
@Andrea and @Giuseppe, could you clarify more in case I missed
something?

> >   * syscall logging (eg. via maps)
>
> This is (2).  Probably useful, but doesn't obviously belong in
> seccomp, or at least not as part of the same seccomp feature as
> regular filtering.
>
> >   * expressiveness & better tooling (no need for DSLs like easyseccomp)
>
> (1).  Sounds good.
>
> >   * contained syscall fault injection
>
> (2)?  We can already do this with notifiers.

To clarify, “we can already do with notifiers” isn’t the point here.
We can do almost everything if you have notifiers and ptrace, but it
may impose significant overhead (see the microbenchmark results).

The reason I’m saying the overhead is important is for the
reproduction / testing of certain race conditions. A syscall failing
quickly in a userspace application could, from a race condition, have
a completely different trace as a syscall failing after a few context
switches. eBPF makes quick fault injection possible.

> > For security, for an unprivileged caller, our implementation is as
> > restrictive as user notifier + ptrace, in regards to capabilities.
> > eBPF helpers follow the privilege model of original eBPF helpers.
>
> eBPF doesn't really have a privilege model yet.  There was a long and
> disappointing thread about this awhile back.

The idea is that “seccomp-eBPF does not make life easier for an
adversary”. Any attack an adversary could potentially utilize
seccomp-eBPF, they can do the same with other eBPF features, i.e. it
would be an issue with eBPF in general rather than specifically
seccomp’s use of eBPF.

Here it is referring to the helpers goes to the base
bpf_base_func_proto if the caller is unprivileged (!bpf_capable ||
!perfmon_capable). In this case, if the adversary would utilize eBPF
helpers to perform an attack, they could do it via another
unprivileged prog type.

That said, there are a few additional helpers this patchset is adding:
* get_current_uid_gid
* get_current_pid_tgid
  These two provide public information (are namespaces a concern?). I
have no idea what kind of exploit it could add unless the adversary
somehow side-channels the task_struct? But in that case, how is the
reading of task_struct different from how the rest of the kernel is
reading task_struct?
  Though, if knowing the global uid / pid is a concern then the eBPF
progs will need to keep track of namespaces, and that might not be
trivial.
* probe_read_user
* probe_read_user_str
  Reduction to ptrace. The privilege model of reading another
process’s data (via process_vm_readv or
ptrace(PTRACE_PEEK{TEXT,DATA})) is guarded by
PTRACE_MODE_ATTACH_REALCREDS. However, unprivileged seccomp is
safeguarded by no_new_privs, so, unless the caller have a non-uniform
resuid & fsuid, in which case it’s the caller’s failure to relinquish
privileges, ruid of the seccomp-eBPF executor (which is task whose
syscalls is being filtered) would be the save as the ruid of the
applier (the task that set the seccomp mode, at the time of setting
it).
  The main concern here is LSMs. LSMs can further restrict the scope
of ptrace hence I also allow LSMs to deny all “the use of stateful /
non-pure eBPF features”.
  As for side channels... the copy_from_user_nofault may allow an
adversary to observe what’s in resident memory and what’s swapped out,
but the adversary can already do this by observing the timing of
memory accesses. The non-nofault variant copy_from_user is used
everywhere in the kernel, so if an adversary were to side channel the
kernel by copy_from_user against an address, they can already do it by
using a syscall with a pointer that would be used by copy_from_user.
* task_storage_get
* task_storage_delete
  This is what I’m least sure about. The implementation of
task_storage is more complex than the other helpers, and also assumes
a privileged eBPF loader. It would slightly extend the attack surface.
If this is a big issue then eBPF can emulate such a map by using some
hashmap and having PID as key...

> > Moreover, a mechanism for reading user memory is added. The same
> > prototypes of bpf_probe_read_user{,str} from tracing are used. However,
> > when the loader of bpf program does not have CAP_PTRACE, the helper
> > will return -EPERM if the task under seccomp filter is non-dumpable.
> > The reason for this is that if we perform reduction from seccomp-eBPF
> > to user notifier + ptrace, ptrace requires CAP_PTRACE to read from
> > a non-dumpable process. However, eBPF does not solve the TOCTOU problem
> > of user notifier, so users should not use this to enforce a policy
> > based on memory contents.
>
> What is this for?

Memory reading opens up lots of use cases. For example, logging what
files are being opened without imposing too much performance penalty
from strace. Or as an accelerator for user notify emulation, where
syscalls can be rejected on a fast path if we know the memory contents
does not satisfy certain conditions that user notify will check.

YiFei Zhu

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH bpf-next seccomp 12/12] seccomp-ebpf: support task storage from BPF-LSM, defaulting to group leader
  2021-05-11  1:58   ` Alexei Starovoitov
@ 2021-05-11  5:44       ` YiFei Zhu
  0 siblings, 0 replies; 48+ messages in thread
From: YiFei Zhu @ 2021-05-11  5:44 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: containers, bpf, YiFei Zhu, LSM List, Alexei Starovoitov,
	Andrea Arcangeli, Andy Lutomirski, Austin Kuo, Claudio Canella,
	Daniel Borkmann, Daniel Gruss, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jann Horn, Jinghao Jia,
	Josep Torrellas, Kees Cook, Sargun Dhillon, Tianyin Xu,
	Tobin Feldman-Fitzthum, Tom Hromatka, Will Drewry

On Mon, May 10, 2021 at 8:58 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, May 10, 2021 at 12:22:49PM -0500, YiFei Zhu wrote:
> > +
> > +BPF_CALL_4(bpf_task_storage_get_default_leader, struct bpf_map *, map,
> > +        struct task_struct *, task, void *, value, u64, flags)
> > +{
> > +     if (!task)
> > +             task = current->group_leader;
>
> Did you actually need it to be group_leader or current is enough?
> If so loading BTF is not necessary.

I think if task_storage were to be used to apply a policy onto a set
of tasks, there are probably more use cases to perform the state
transitions across an entire process than state transitions across a
single thread. Though, since seccomp only applies to the process tree
a lot of use cases of a per-process storage would be covered by a
global datasec too.

> You could have exposed it bpf_get_current_task_btf() and passed its
> return value into bpf_task_storage_get.
>
> On the other side loading BTF can be relaxed to unpriv,
> but doing current->group_leader deref will make it priv only anyway.

Yeah, that deref is what I was concerned about. It seems that if I
expose BTF structs to a prog type it gains the ability to deref it,
and I definitely don't want unpriv reading task_structs. Though yeah
we could potentially change the verifier to prohibit PTR_TO_BTF_ID
deref and any pointer arithmetic on it...

How about, we expose bpf_get_current_task_btf to unpriv, prohibit
unpriv deref and pointer arithmetic, and have NULL be
current->group_leader? This way, unpriv has access to both per-thread
and per-process.

YiFei Zhu

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH bpf-next seccomp 12/12] seccomp-ebpf: support task storage from BPF-LSM, defaulting to group leader
@ 2021-05-11  5:44       ` YiFei Zhu
  0 siblings, 0 replies; 48+ messages in thread
From: YiFei Zhu @ 2021-05-11  5:44 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: containers, bpf, YiFei Zhu, LSM List, Alexei Starovoitov,
	Andrea Arcangeli, Andy Lutomirski, Austin Kuo, Claudio Canella,
	Daniel Borkmann, Daniel Gruss, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jann Horn, Jinghao Jia,
	Josep Torrellas, Kees Cook, Sargun Dhillon, Tianyin Xu,
	Tobin Feldman-Fitzthum, Tom Hromatka, Will Drewry

On Mon, May 10, 2021 at 8:58 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, May 10, 2021 at 12:22:49PM -0500, YiFei Zhu wrote:
> > +
> > +BPF_CALL_4(bpf_task_storage_get_default_leader, struct bpf_map *, map,
> > +        struct task_struct *, task, void *, value, u64, flags)
> > +{
> > +     if (!task)
> > +             task = current->group_leader;
>
> Did you actually need it to be group_leader or current is enough?
> If so loading BTF is not necessary.

I think if task_storage were to be used to apply a policy onto a set
of tasks, there are probably more use cases to perform the state
transitions across an entire process than state transitions across a
single thread. Though, since seccomp only applies to the process tree
a lot of use cases of a per-process storage would be covered by a
global datasec too.

> You could have exposed it bpf_get_current_task_btf() and passed its
> return value into bpf_task_storage_get.
>
> On the other side loading BTF can be relaxed to unpriv,
> but doing current->group_leader deref will make it priv only anyway.

Yeah, that deref is what I was concerned about. It seems that if I
expose BTF structs to a prog type it gains the ability to deref it,
and I definitely don't want unpriv reading task_structs. Though yeah
we could potentially change the verifier to prohibit PTR_TO_BTF_ID
deref and any pointer arithmetic on it...

How about, we expose bpf_get_current_task_btf to unpriv, prohibit
unpriv deref and pointer arithmetic, and have NULL be
current->group_leader? This way, unpriv has access to both per-thread
and per-process.

YiFei Zhu

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH bpf-next seccomp 10/12] seccomp-ebpf: Add ability to read user memory
  2021-05-11  2:04   ` Alexei Starovoitov
@ 2021-05-11  7:14       ` YiFei Zhu
  0 siblings, 0 replies; 48+ messages in thread
From: YiFei Zhu @ 2021-05-11  7:14 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: containers, bpf, YiFei Zhu, LSM List, Alexei Starovoitov,
	Andrea Arcangeli, Andy Lutomirski, Austin Kuo, Claudio Canella,
	Daniel Borkmann, Daniel Gruss, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jann Horn, Jinghao Jia,
	Josep Torrellas, Kees Cook, Sargun Dhillon, Tianyin Xu,
	Tobin Feldman-Fitzthum, Tom Hromatka, Will Drewry

On Mon, May 10, 2021 at 9:04 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, May 10, 2021 at 12:22:47PM -0500, YiFei Zhu wrote:
> >
> > +BPF_CALL_3(bpf_probe_read_user_dumpable, void *, dst, u32, size,
> > +        const void __user *, unsafe_ptr)
> > +{
> > +     int ret = -EPERM;
> > +
> > +     if (get_dumpable(current->mm))
> > +             ret = copy_from_user_nofault(dst, unsafe_ptr, size);
>
> Could you explain a bit more how dumpable flag makes it safe for unpriv?
> The unpriv prog is attached to the children tasks only, right?
> and dumpable gets cleared if euid changes?

This is the "reduction to ptrace". The model here is that the eBPF
seccomp filter is doing the equivalent of ptracing the user process
using the privileges of the task at the time of loading the seccomp
filter.

ptrace access control is governed by ptrace.c:__ptrace_may_access. The
requirements are:
* always allow thread group introspection -- assume false so we are
more restrictive than ptrace.
* tracer has CAP_PTRACE in the target user namespace or tracer
r/fsu/gidid equal target resu/gid -- discuss below
* tracer has CAP_PTRACE in the target user namespace or target is
SUID_DUMP_USER (I realized I should probably change the condition to
== SUID_DUMP_USER).
* passes LSM checks (eg yama ptrace_scope) -- we expose a hook to LSM
but it's more of a "disable all advanced seccomp-eBPF features". How
would a better interface to LSM look like?

The dumpable check handles the "target is SUID_DUMP_USER" condition,
in the circumstance that the loader does not have CAP_PTRACE in its
namespace at the time of load. Why would this imply its CAP_PTRACE
capability in target namespace? This is based on my understanding on
how capabilities and user namespaces interact:
For the sake of simplicity, let's first assume that loader is the same
task as the task that attaches the filter (via prctl or seccomp
syscall).
* Case 1: target and loader are the same user namespace. Trivial case,
the two operations are the same.
* Case 2: target is loader's parent namespace. Can't happen under
assumption. Seccomp affects itself and children only, and it is only
possible to join a descendant user ns.
* Case 3: target is loader's descendant namespace. Loader would have
full CAP_PTRACE on target. We are more restrictive than ptrace.
* Case 4: target and loader are on unrelated namespace branches. Can't
happen under assumption. Same as case 2.

Let's break this assumption and see what happens if the loader and
attacher are in different contexts:
* Case 1: attacher is less capable (as a general term of "what it can
do") than loader then all of the above applies, since the model
concerns and checks the capabilities of the loader.
* Case 2: attacher is more capable than loader. The attacher would
need an fd to the prog to attach it:
  * subcase 1: attacher inherited the fd after an exec and became more
capable. uh... why is it trusting fds from a less capable context?
  * subcase 2: attacher has CAP_SYS_ADMIN and gets the fd via
BPF_PROG_GET_FD_BY_ID. uh... why is it trusting random fds and
attaching it?
  * subcase 3: attacher received the fd via a domain socket from a
process which may be in a different user namespace. On my first
thought, I thought, why is it trusting random fds from a less capable
context? Except I just thought of an adversary could:
    * Clone into new userns,
    * Load filter in child, which has CAP_PTRACE in new userns
    * Send filter to the parent which doesn't have CAP_PTRACE in its userns
    * It's broken :(
We'll think more about this case. One way is to check against init
namespace, which means unpriv container runtimes won't have the
non-dumpable override. Though, it shouldn't be affecting most of the
use cases. Alternatively we can store which userns it was loaded from
and reject attaching from a different userns.

Regarding u/gids, for an attacher to attach a seccomp filter, whether
cBPF or eBPF, if it doesn't have CAP_SYS_ADMIN in its current ns, it
will have to set no_new_privs on itself before it can attach. (Unlike
the previous discussion, this check is done at attach time rather than
load.) With no_new_privs the target's privs is a subset of the
attacher's, so the attacher should have a way to match the target's
resuid, so this condition is not a concern.

YiFei Zhu

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH bpf-next seccomp 10/12] seccomp-ebpf: Add ability to read user memory
@ 2021-05-11  7:14       ` YiFei Zhu
  0 siblings, 0 replies; 48+ messages in thread
From: YiFei Zhu @ 2021-05-11  7:14 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: containers, bpf, YiFei Zhu, LSM List, Alexei Starovoitov,
	Andrea Arcangeli, Andy Lutomirski, Austin Kuo, Claudio Canella,
	Daniel Borkmann, Daniel Gruss, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jann Horn, Jinghao Jia,
	Josep Torrellas, Kees Cook, Sargun Dhillon, Tianyin Xu,
	Tobin Feldman-Fitzthum, Tom Hromatka, Will Drewry

On Mon, May 10, 2021 at 9:04 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, May 10, 2021 at 12:22:47PM -0500, YiFei Zhu wrote:
> >
> > +BPF_CALL_3(bpf_probe_read_user_dumpable, void *, dst, u32, size,
> > +        const void __user *, unsafe_ptr)
> > +{
> > +     int ret = -EPERM;
> > +
> > +     if (get_dumpable(current->mm))
> > +             ret = copy_from_user_nofault(dst, unsafe_ptr, size);
>
> Could you explain a bit more how dumpable flag makes it safe for unpriv?
> The unpriv prog is attached to the children tasks only, right?
> and dumpable gets cleared if euid changes?

This is the "reduction to ptrace". The model here is that the eBPF
seccomp filter is doing the equivalent of ptracing the user process
using the privileges of the task at the time of loading the seccomp
filter.

ptrace access control is governed by ptrace.c:__ptrace_may_access. The
requirements are:
* always allow thread group introspection -- assume false so we are
more restrictive than ptrace.
* tracer has CAP_PTRACE in the target user namespace or tracer
r/fsu/gidid equal target resu/gid -- discuss below
* tracer has CAP_PTRACE in the target user namespace or target is
SUID_DUMP_USER (I realized I should probably change the condition to
== SUID_DUMP_USER).
* passes LSM checks (eg yama ptrace_scope) -- we expose a hook to LSM
but it's more of a "disable all advanced seccomp-eBPF features". How
would a better interface to LSM look like?

The dumpable check handles the "target is SUID_DUMP_USER" condition,
in the circumstance that the loader does not have CAP_PTRACE in its
namespace at the time of load. Why would this imply its CAP_PTRACE
capability in target namespace? This is based on my understanding on
how capabilities and user namespaces interact:
For the sake of simplicity, let's first assume that loader is the same
task as the task that attaches the filter (via prctl or seccomp
syscall).
* Case 1: target and loader are the same user namespace. Trivial case,
the two operations are the same.
* Case 2: target is loader's parent namespace. Can't happen under
assumption. Seccomp affects itself and children only, and it is only
possible to join a descendant user ns.
* Case 3: target is loader's descendant namespace. Loader would have
full CAP_PTRACE on target. We are more restrictive than ptrace.
* Case 4: target and loader are on unrelated namespace branches. Can't
happen under assumption. Same as case 2.

Let's break this assumption and see what happens if the loader and
attacher are in different contexts:
* Case 1: attacher is less capable (as a general term of "what it can
do") than loader then all of the above applies, since the model
concerns and checks the capabilities of the loader.
* Case 2: attacher is more capable than loader. The attacher would
need an fd to the prog to attach it:
  * subcase 1: attacher inherited the fd after an exec and became more
capable. uh... why is it trusting fds from a less capable context?
  * subcase 2: attacher has CAP_SYS_ADMIN and gets the fd via
BPF_PROG_GET_FD_BY_ID. uh... why is it trusting random fds and
attaching it?
  * subcase 3: attacher received the fd via a domain socket from a
process which may be in a different user namespace. On my first
thought, I thought, why is it trusting random fds from a less capable
context? Except I just thought of an adversary could:
    * Clone into new userns,
    * Load filter in child, which has CAP_PTRACE in new userns
    * Send filter to the parent which doesn't have CAP_PTRACE in its userns
    * It's broken :(
We'll think more about this case. One way is to check against init
namespace, which means unpriv container runtimes won't have the
non-dumpable override. Though, it shouldn't be affecting most of the
use cases. Alternatively we can store which userns it was loaded from
and reject attaching from a different userns.

Regarding u/gids, for an attacher to attach a seccomp filter, whether
cBPF or eBPF, if it doesn't have CAP_SYS_ADMIN in its current ns, it
will have to set no_new_privs on itself before it can attach. (Unlike
the previous discussion, this check is done at attach time rather than
load.) With no_new_privs the target's privs is a subset of the
attacher's, so the attacher should have a way to match the target's
resuid, so this condition is not a concern.

YiFei Zhu

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH bpf-next seccomp 12/12] seccomp-ebpf: support task storage from BPF-LSM, defaulting to group leader
  2021-05-11  5:44       ` YiFei Zhu
  (?)
@ 2021-05-12 21:56       ` Alexei Starovoitov
  -1 siblings, 0 replies; 48+ messages in thread
From: Alexei Starovoitov @ 2021-05-12 21:56 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: containers, bpf, YiFei Zhu, LSM List, Alexei Starovoitov,
	Andrea Arcangeli, Andy Lutomirski, Austin Kuo, Claudio Canella,
	Daniel Borkmann, Daniel Gruss, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jann Horn, Jinghao Jia,
	Josep Torrellas, Kees Cook, Sargun Dhillon, Tianyin Xu,
	Tobin Feldman-Fitzthum, Tom Hromatka, Will Drewry

On Tue, May 11, 2021 at 12:44:46AM -0500, YiFei Zhu wrote:
> On Mon, May 10, 2021 at 8:58 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Mon, May 10, 2021 at 12:22:49PM -0500, YiFei Zhu wrote:
> > > +
> > > +BPF_CALL_4(bpf_task_storage_get_default_leader, struct bpf_map *, map,
> > > +        struct task_struct *, task, void *, value, u64, flags)
> > > +{
> > > +     if (!task)
> > > +             task = current->group_leader;
> >
> > Did you actually need it to be group_leader or current is enough?
> > If so loading BTF is not necessary.
> 
> I think if task_storage were to be used to apply a policy onto a set
> of tasks, there are probably more use cases to perform the state
> transitions across an entire process than state transitions across a
> single thread. Though, since seccomp only applies to the process tree
> a lot of use cases of a per-process storage would be covered by a
> global datasec too.
> 
> > You could have exposed it bpf_get_current_task_btf() and passed its
> > return value into bpf_task_storage_get.
> >
> > On the other side loading BTF can be relaxed to unpriv,
> > but doing current->group_leader deref will make it priv only anyway.
> 
> Yeah, that deref is what I was concerned about. It seems that if I
> expose BTF structs to a prog type it gains the ability to deref it,
> and I definitely don't want unpriv reading task_structs. Though yeah
> we could potentially change the verifier to prohibit PTR_TO_BTF_ID
> deref and any pointer arithmetic on it...
> 
> How about, we expose bpf_get_current_task_btf to unpriv, prohibit
> unpriv deref and pointer arithmetic, and have NULL be
> current->group_leader? This way, unpriv has access to both per-thread
> and per-process.

NULL to mean current->group_leader is too specific and not extensible.
I'd rather add another bpf_get_current_task_btf-like helper that
returns parent or group_leader depending on flags.
For now I wouldn't even do that. As you said hashmap with key=pid will
work fine. task_local_storage is a performance optimization.
I'd focus on landing the key bits before optimizing.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH bpf-next seccomp 10/12] seccomp-ebpf: Add ability to read user memory
  2021-05-11  7:14       ` YiFei Zhu
  (?)
@ 2021-05-12 22:36       ` Alexei Starovoitov
  2021-05-13  5:26           ` YiFei Zhu
  -1 siblings, 1 reply; 48+ messages in thread
From: Alexei Starovoitov @ 2021-05-12 22:36 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: containers, bpf, YiFei Zhu, LSM List, Alexei Starovoitov,
	Andrea Arcangeli, Andy Lutomirski, Austin Kuo, Claudio Canella,
	Daniel Borkmann, Daniel Gruss, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jann Horn, Jinghao Jia,
	Josep Torrellas, Kees Cook, Sargun Dhillon, Tianyin Xu,
	Tobin Feldman-Fitzthum, Tom Hromatka, Will Drewry

On Tue, May 11, 2021 at 02:14:01AM -0500, YiFei Zhu wrote:
> On Mon, May 10, 2021 at 9:04 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Mon, May 10, 2021 at 12:22:47PM -0500, YiFei Zhu wrote:
> > >
> > > +BPF_CALL_3(bpf_probe_read_user_dumpable, void *, dst, u32, size,
> > > +        const void __user *, unsafe_ptr)
> > > +{
> > > +     int ret = -EPERM;
> > > +
> > > +     if (get_dumpable(current->mm))
> > > +             ret = copy_from_user_nofault(dst, unsafe_ptr, size);
> >
> > Could you explain a bit more how dumpable flag makes it safe for unpriv?
> > The unpriv prog is attached to the children tasks only, right?
> > and dumpable gets cleared if euid changes?
> 
> This is the "reduction to ptrace". The model here is that the eBPF
> seccomp filter is doing the equivalent of ptracing the user process
> using the privileges of the task at the time of loading the seccomp
> filter.
> 
> ptrace access control is governed by ptrace.c:__ptrace_may_access. The
> requirements are:
> * always allow thread group introspection -- assume false so we are
> more restrictive than ptrace.
> * tracer has CAP_PTRACE in the target user namespace or tracer
> r/fsu/gidid equal target resu/gid -- discuss below
> * tracer has CAP_PTRACE in the target user namespace or target is
> SUID_DUMP_USER (I realized I should probably change the condition to
> == SUID_DUMP_USER).
> * passes LSM checks (eg yama ptrace_scope) -- we expose a hook to LSM
> but it's more of a "disable all advanced seccomp-eBPF features". How
> would a better interface to LSM look like?
> 
> The dumpable check handles the "target is SUID_DUMP_USER" condition,
> in the circumstance that the loader does not have CAP_PTRACE in its
> namespace at the time of load. Why would this imply its CAP_PTRACE
> capability in target namespace? This is based on my understanding on
> how capabilities and user namespaces interact:
> For the sake of simplicity, let's first assume that loader is the same
> task as the task that attaches the filter (via prctl or seccomp
> syscall).
> * Case 1: target and loader are the same user namespace. Trivial case,
> the two operations are the same.
> * Case 2: target is loader's parent namespace. Can't happen under
> assumption. Seccomp affects itself and children only, and it is only
> possible to join a descendant user ns.
> * Case 3: target is loader's descendant namespace. Loader would have
> full CAP_PTRACE on target. We are more restrictive than ptrace.
> * Case 4: target and loader are on unrelated namespace branches. Can't
> happen under assumption. Same as case 2.
> 
> Let's break this assumption and see what happens if the loader and
> attacher are in different contexts:
> * Case 1: attacher is less capable (as a general term of "what it can
> do") than loader then all of the above applies, since the model
> concerns and checks the capabilities of the loader.
> * Case 2: attacher is more capable than loader. The attacher would
> need an fd to the prog to attach it:
>   * subcase 1: attacher inherited the fd after an exec and became more
> capable. uh... why is it trusting fds from a less capable context?
>   * subcase 2: attacher has CAP_SYS_ADMIN and gets the fd via
> BPF_PROG_GET_FD_BY_ID. uh... why is it trusting random fds and
> attaching it?
>   * subcase 3: attacher received the fd via a domain socket from a
> process which may be in a different user namespace. On my first
> thought, I thought, why is it trusting random fds from a less capable
> context? Except I just thought of an adversary could:
>     * Clone into new userns,
>     * Load filter in child, which has CAP_PTRACE in new userns
>     * Send filter to the parent which doesn't have CAP_PTRACE in its userns
>     * It's broken :(
> We'll think more about this case. One way is to check against init
> namespace, which means unpriv container runtimes won't have the
> non-dumpable override. Though, it shouldn't be affecting most of the
> use cases. Alternatively we can store which userns it was loaded from
> and reject attaching from a different userns.

Typically the verifier does all the checks at load time to avoid
run-time overhead during program execution. Then at attach time we
check that attach parameters provided at load time match exactly
to those at attach time. ifindex, attach_btf_id, etc fall into this category.
Doing something similar it should be possible to avoid
doing get_dumpable() at run-time.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH bpf-next seccomp 10/12] seccomp-ebpf: Add ability to read user memory
  2021-05-12 22:36       ` Alexei Starovoitov
@ 2021-05-13  5:26           ` YiFei Zhu
  0 siblings, 0 replies; 48+ messages in thread
From: YiFei Zhu @ 2021-05-13  5:26 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: containers, bpf, YiFei Zhu, LSM List, Alexei Starovoitov,
	Andrea Arcangeli, Andy Lutomirski, Austin Kuo, Claudio Canella,
	Daniel Borkmann, Daniel Gruss, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jann Horn, Jinghao Jia,
	Josep Torrellas, Kees Cook, Sargun Dhillon, Tianyin Xu,
	Tobin Feldman-Fitzthum, Tom Hromatka, Will Drewry

On Wed, May 12, 2021 at 5:36 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> Typically the verifier does all the checks at load time to avoid
> run-time overhead during program execution. Then at attach time we
> check that attach parameters provided at load time match exactly
> to those at attach time. ifindex, attach_btf_id, etc fall into this category.
> Doing something similar it should be possible to avoid
> doing get_dumpable() at run-time.

Do you mean to move the check of dumpable to load time instead of
runtime? I do not think that makes sense. A process may arbitrarily
set its dumpable attribute during execution via prctl. A process could
do set itself to non-dumpable, before interacting with sensitive
information that would better not be possible to be dumped (eg.
ssh-agent does this [1]). Therefore, being dumpable at one point in
time does not indicate anything about whether it stays dumpable at a
later point in time. Besides, seccomp filters are inherited across
clone and exec, attaching to many tasks with no option to detach. What
should the load-time check of task dump-ability be against? The
current task may only be the tip of an iceburg.

[1] https://github.com/openssh/openssh-portable/blob/2dc328023f60212cd29504fc05d849133ae47355/ssh-agent.c#L1398

YiFei Zhu

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH bpf-next seccomp 10/12] seccomp-ebpf: Add ability to read user memory
@ 2021-05-13  5:26           ` YiFei Zhu
  0 siblings, 0 replies; 48+ messages in thread
From: YiFei Zhu @ 2021-05-13  5:26 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: containers, bpf, YiFei Zhu, LSM List, Alexei Starovoitov,
	Andrea Arcangeli, Andy Lutomirski, Austin Kuo, Claudio Canella,
	Daniel Borkmann, Daniel Gruss, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jann Horn, Jinghao Jia,
	Josep Torrellas, Kees Cook, Sargun Dhillon, Tianyin Xu,
	Tobin Feldman-Fitzthum, Tom Hromatka, Will Drewry

On Wed, May 12, 2021 at 5:36 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> Typically the verifier does all the checks at load time to avoid
> run-time overhead during program execution. Then at attach time we
> check that attach parameters provided at load time match exactly
> to those at attach time. ifindex, attach_btf_id, etc fall into this category.
> Doing something similar it should be possible to avoid
> doing get_dumpable() at run-time.

Do you mean to move the check of dumpable to load time instead of
runtime? I do not think that makes sense. A process may arbitrarily
set its dumpable attribute during execution via prctl. A process could
do set itself to non-dumpable, before interacting with sensitive
information that would better not be possible to be dumped (eg.
ssh-agent does this [1]). Therefore, being dumpable at one point in
time does not indicate anything about whether it stays dumpable at a
later point in time. Besides, seccomp filters are inherited across
clone and exec, attaching to many tasks with no option to detach. What
should the load-time check of task dump-ability be against? The
current task may only be the tip of an iceburg.

[1] https://github.com/openssh/openssh-portable/blob/2dc328023f60212cd29504fc05d849133ae47355/ssh-agent.c#L1398

YiFei Zhu

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH bpf-next seccomp 10/12] seccomp-ebpf: Add ability to read user memory
  2021-05-13  5:26           ` YiFei Zhu
  (?)
@ 2021-05-13 14:53           ` Andy Lutomirski
  2021-05-13 17:12               ` YiFei Zhu
  -1 siblings, 1 reply; 48+ messages in thread
From: Andy Lutomirski @ 2021-05-13 14:53 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: Alexei Starovoitov, containers, bpf, YiFei Zhu, LSM List,
	Alexei Starovoitov, Andrea Arcangeli, Austin Kuo,
	Claudio Canella, Daniel Borkmann, Daniel Gruss,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jann Horn, Jinghao Jia, Josep Torrellas, Kees Cook,
	Sargun Dhillon, Tianyin Xu, Tobin Feldman-Fitzthum, Tom Hromatka,
	Will Drewry


> On May 12, 2021, at 10:26 PM, YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> 
> On Wed, May 12, 2021 at 5:36 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>> Typically the verifier does all the checks at load time to avoid
>> run-time overhead during program execution. Then at attach time we
>> check that attach parameters provided at load time match exactly
>> to those at attach time. ifindex, attach_btf_id, etc fall into this category.
>> Doing something similar it should be possible to avoid
>> doing get_dumpable() at run-time.
> 
> Do you mean to move the check of dumpable to load time instead of
> runtime? I do not think that makes sense. A process may arbitrarily
> set its dumpable attribute during execution via prctl. A process could
> do set itself to non-dumpable, before interacting with sensitive
> information that would better not be possible to be dumped (eg.
> ssh-agent does this [1]). Therefore, being dumpable at one point in
> time does not indicate anything about whether it stays dumpable at a
> later point in time. Besides, seccomp filters are inherited across
> clone and exec, attaching to many tasks with no option to detach. What
> should the load-time check of task dump-ability be against? The
> current task may only be the tip of an iceburg.
> 
> [1] https://github.com/openssh/openssh-portable/blob/2dc328023f60212cd29504fc05d849133ae47355/ssh-agent.c#L1398
> 
> 

First things first: why are you checking dumpable at all?  Once you figure out why and whether it’s needed, you may learn something about what task to check.

I don’t think checking dumpable makes any sense.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH bpf-next seccomp 10/12] seccomp-ebpf: Add ability to read user memory
  2021-05-13 14:53           ` Andy Lutomirski
@ 2021-05-13 17:12               ` YiFei Zhu
  0 siblings, 0 replies; 48+ messages in thread
From: YiFei Zhu @ 2021-05-13 17:12 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Alexei Starovoitov, containers, bpf, YiFei Zhu, LSM List,
	Alexei Starovoitov, Andrea Arcangeli, Austin Kuo,
	Claudio Canella, Daniel Borkmann, Daniel Gruss,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jann Horn, Jinghao Jia, Josep Torrellas, Kees Cook,
	Sargun Dhillon, Tianyin Xu, Tobin Feldman-Fitzthum, Tom Hromatka,
	Will Drewry

On Thu, May 13, 2021 at 9:53 AM Andy Lutomirski <luto@amacapital.net> wrote:
> > On May 12, 2021, at 10:26 PM, YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> >
> > On Wed, May 12, 2021 at 5:36 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> >> Typically the verifier does all the checks at load time to avoid
> >> run-time overhead during program execution. Then at attach time we
> >> check that attach parameters provided at load time match exactly
> >> to those at attach time. ifindex, attach_btf_id, etc fall into this category.
> >> Doing something similar it should be possible to avoid
> >> doing get_dumpable() at run-time.
> >
> > Do you mean to move the check of dumpable to load time instead of
> > runtime? I do not think that makes sense. A process may arbitrarily
> > set its dumpable attribute during execution via prctl. A process could
> > do set itself to non-dumpable, before interacting with sensitive
> > information that would better not be possible to be dumped (eg.
> > ssh-agent does this [1]). Therefore, being dumpable at one point in
> > time does not indicate anything about whether it stays dumpable at a
> > later point in time. Besides, seccomp filters are inherited across
> > clone and exec, attaching to many tasks with no option to detach. What
> > should the load-time check of task dump-ability be against? The
> > current task may only be the tip of an iceburg.
> >
> > [1] https://github.com/openssh/openssh-portable/blob/2dc328023f60212cd29504fc05d849133ae47355/ssh-agent.c#L1398
> >
> >
>
> First things first: why are you checking dumpable at all?  Once you figure out why and whether it’s needed, you may learn something about what task to check.
>
> I don’t think checking dumpable makes any sense.

ptrace. We don't want to extend one's ability to read another
process's memory if they could not read it via ptrace
(process_vm_readv or ptrace(PTRACE_PEEK{TEXT,DATA})). The constraints
for ptrace to access a target's memory I've written down earlier [1],
but tl;dr: to be at least as restrictive as ptrace, a tracer without
CAP_PTRACE cannot trace a non-dumpable process. What's the target
process (i.e. the process whose memory is being read) in the context
of a seccomp filter? The current task. Does that answer your
questions?

[1] https://lore.kernel.org/bpf/CABqSeAT8iz-VhWjWqABqGbF7ydkoT7LmzJ5Do8K1ANQvQK=FJQ@mail.gmail.com/

YiFei Zhu

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH bpf-next seccomp 10/12] seccomp-ebpf: Add ability to read user memory
@ 2021-05-13 17:12               ` YiFei Zhu
  0 siblings, 0 replies; 48+ messages in thread
From: YiFei Zhu @ 2021-05-13 17:12 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Alexei Starovoitov, containers, bpf, YiFei Zhu, LSM List,
	Alexei Starovoitov, Andrea Arcangeli, Austin Kuo,
	Claudio Canella, Daniel Borkmann, Daniel Gruss,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jann Horn, Jinghao Jia, Josep Torrellas, Kees Cook,
	Sargun Dhillon, Tianyin Xu, Tobin Feldman-Fitzthum, Tom Hromatka,
	Will Drewry

On Thu, May 13, 2021 at 9:53 AM Andy Lutomirski <luto@amacapital.net> wrote:
> > On May 12, 2021, at 10:26 PM, YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> >
> > On Wed, May 12, 2021 at 5:36 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> >> Typically the verifier does all the checks at load time to avoid
> >> run-time overhead during program execution. Then at attach time we
> >> check that attach parameters provided at load time match exactly
> >> to those at attach time. ifindex, attach_btf_id, etc fall into this category.
> >> Doing something similar it should be possible to avoid
> >> doing get_dumpable() at run-time.
> >
> > Do you mean to move the check of dumpable to load time instead of
> > runtime? I do not think that makes sense. A process may arbitrarily
> > set its dumpable attribute during execution via prctl. A process could
> > do set itself to non-dumpable, before interacting with sensitive
> > information that would better not be possible to be dumped (eg.
> > ssh-agent does this [1]). Therefore, being dumpable at one point in
> > time does not indicate anything about whether it stays dumpable at a
> > later point in time. Besides, seccomp filters are inherited across
> > clone and exec, attaching to many tasks with no option to detach. What
> > should the load-time check of task dump-ability be against? The
> > current task may only be the tip of an iceburg.
> >
> > [1] https://github.com/openssh/openssh-portable/blob/2dc328023f60212cd29504fc05d849133ae47355/ssh-agent.c#L1398
> >
> >
>
> First things first: why are you checking dumpable at all?  Once you figure out why and whether it’s needed, you may learn something about what task to check.
>
> I don’t think checking dumpable makes any sense.

ptrace. We don't want to extend one's ability to read another
process's memory if they could not read it via ptrace
(process_vm_readv or ptrace(PTRACE_PEEK{TEXT,DATA})). The constraints
for ptrace to access a target's memory I've written down earlier [1],
but tl;dr: to be at least as restrictive as ptrace, a tracer without
CAP_PTRACE cannot trace a non-dumpable process. What's the target
process (i.e. the process whose memory is being read) in the context
of a seccomp filter? The current task. Does that answer your
questions?

[1] https://lore.kernel.org/bpf/CABqSeAT8iz-VhWjWqABqGbF7ydkoT7LmzJ5Do8K1ANQvQK=FJQ@mail.gmail.com/

YiFei Zhu

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH bpf-next seccomp 10/12] seccomp-ebpf: Add ability to read user memory
  2021-05-13 17:12               ` YiFei Zhu
@ 2021-05-13 17:15                 ` Andy Lutomirski
  -1 siblings, 0 replies; 48+ messages in thread
From: Andy Lutomirski @ 2021-05-13 17:15 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: Alexei Starovoitov, containers, bpf, YiFei Zhu, LSM List,
	Alexei Starovoitov, Andrea Arcangeli, Austin Kuo,
	Claudio Canella, Daniel Borkmann, Daniel Gruss,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jann Horn, Jinghao Jia, Josep Torrellas, Kees Cook,
	Sargun Dhillon, Tianyin Xu, Tobin Feldman-Fitzthum, Tom Hromatka,
	Will Drewry

On Thu, May 13, 2021 at 10:13 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
>
> On Thu, May 13, 2021 at 9:53 AM Andy Lutomirski <luto@amacapital.net> wrote:
> > > On May 12, 2021, at 10:26 PM, YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> > >
> > > On Wed, May 12, 2021 at 5:36 PM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > >> Typically the verifier does all the checks at load time to avoid
> > >> run-time overhead during program execution. Then at attach time we
> > >> check that attach parameters provided at load time match exactly
> > >> to those at attach time. ifindex, attach_btf_id, etc fall into this category.
> > >> Doing something similar it should be possible to avoid
> > >> doing get_dumpable() at run-time.
> > >
> > > Do you mean to move the check of dumpable to load time instead of
> > > runtime? I do not think that makes sense. A process may arbitrarily
> > > set its dumpable attribute during execution via prctl. A process could
> > > do set itself to non-dumpable, before interacting with sensitive
> > > information that would better not be possible to be dumped (eg.
> > > ssh-agent does this [1]). Therefore, being dumpable at one point in
> > > time does not indicate anything about whether it stays dumpable at a
> > > later point in time. Besides, seccomp filters are inherited across
> > > clone and exec, attaching to many tasks with no option to detach. What
> > > should the load-time check of task dump-ability be against? The
> > > current task may only be the tip of an iceburg.
> > >
> > > [1] https://github.com/openssh/openssh-portable/blob/2dc328023f60212cd29504fc05d849133ae47355/ssh-agent.c#L1398
> > >
> > >
> >
> > First things first: why are you checking dumpable at all?  Once you figure out why and whether it’s needed, you may learn something about what task to check.
> >
> > I don’t think checking dumpable makes any sense.
>
> ptrace. We don't want to extend one's ability to read another
> process's memory if they could not read it via ptrace
> (process_vm_readv or ptrace(PTRACE_PEEK{TEXT,DATA})). The constraints
> for ptrace to access a target's memory I've written down earlier [1],
> but tl;dr: to be at least as restrictive as ptrace, a tracer without
> CAP_PTRACE cannot trace a non-dumpable process. What's the target
> process (i.e. the process whose memory is being read) in the context
> of a seccomp filter? The current task. Does that answer your
> questions?
>
> [1] https://lore.kernel.org/bpf/CABqSeAT8iz-VhWjWqABqGbF7ydkoT7LmzJ5Do8K1ANQvQK=FJQ@mail.gmail.com/

The whole seccomp model is based on the assumption that the filter
installer completely controls the filtered task.  Reading memory is
not qualitatively different.

To be clear, this is not to be interpreted as an ack to allowing
seccomp to read process memory.  I'm saying that, if seccomp gains the
ability to read process memory, I don't think a dumpable or ptrace
check is needed.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH bpf-next seccomp 10/12] seccomp-ebpf: Add ability to read user memory
@ 2021-05-13 17:15                 ` Andy Lutomirski
  0 siblings, 0 replies; 48+ messages in thread
From: Andy Lutomirski @ 2021-05-13 17:15 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: Alexei Starovoitov, containers, bpf, YiFei Zhu, LSM List,
	Alexei Starovoitov, Andrea Arcangeli, Austin Kuo,
	Claudio Canella, Daniel Borkmann, Daniel Gruss,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jann Horn, Jinghao Jia, Josep Torrellas, Kees Cook,
	Sargun Dhillon, Tianyin Xu, Tobin Feldman-Fitzthum, Tom Hromatka,
	Will Drewry

On Thu, May 13, 2021 at 10:13 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
>
> On Thu, May 13, 2021 at 9:53 AM Andy Lutomirski <luto@amacapital.net> wrote:
> > > On May 12, 2021, at 10:26 PM, YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> > >
> > > On Wed, May 12, 2021 at 5:36 PM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > >> Typically the verifier does all the checks at load time to avoid
> > >> run-time overhead during program execution. Then at attach time we
> > >> check that attach parameters provided at load time match exactly
> > >> to those at attach time. ifindex, attach_btf_id, etc fall into this category.
> > >> Doing something similar it should be possible to avoid
> > >> doing get_dumpable() at run-time.
> > >
> > > Do you mean to move the check of dumpable to load time instead of
> > > runtime? I do not think that makes sense. A process may arbitrarily
> > > set its dumpable attribute during execution via prctl. A process could
> > > do set itself to non-dumpable, before interacting with sensitive
> > > information that would better not be possible to be dumped (eg.
> > > ssh-agent does this [1]). Therefore, being dumpable at one point in
> > > time does not indicate anything about whether it stays dumpable at a
> > > later point in time. Besides, seccomp filters are inherited across
> > > clone and exec, attaching to many tasks with no option to detach. What
> > > should the load-time check of task dump-ability be against? The
> > > current task may only be the tip of an iceburg.
> > >
> > > [1] https://github.com/openssh/openssh-portable/blob/2dc328023f60212cd29504fc05d849133ae47355/ssh-agent.c#L1398
> > >
> > >
> >
> > First things first: why are you checking dumpable at all?  Once you figure out why and whether it’s needed, you may learn something about what task to check.
> >
> > I don’t think checking dumpable makes any sense.
>
> ptrace. We don't want to extend one's ability to read another
> process's memory if they could not read it via ptrace
> (process_vm_readv or ptrace(PTRACE_PEEK{TEXT,DATA})). The constraints
> for ptrace to access a target's memory I've written down earlier [1],
> but tl;dr: to be at least as restrictive as ptrace, a tracer without
> CAP_PTRACE cannot trace a non-dumpable process. What's the target
> process (i.e. the process whose memory is being read) in the context
> of a seccomp filter? The current task. Does that answer your
> questions?
>
> [1] https://lore.kernel.org/bpf/CABqSeAT8iz-VhWjWqABqGbF7ydkoT7LmzJ5Do8K1ANQvQK=FJQ@mail.gmail.com/

The whole seccomp model is based on the assumption that the filter
installer completely controls the filtered task.  Reading memory is
not qualitatively different.

To be clear, this is not to be interpreted as an ack to allowing
seccomp to read process memory.  I'm saying that, if seccomp gains the
ability to read process memory, I don't think a dumpable or ptrace
check is needed.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH bpf-next seccomp 00/12] eBPF seccomp filters
  2021-05-11  5:21     ` YiFei Zhu
  (?)
@ 2021-05-15 15:49     ` Andy Lutomirski
  2021-05-20  9:05       ` Christian Brauner
  -1 siblings, 1 reply; 48+ messages in thread
From: Andy Lutomirski @ 2021-05-15 15:49 UTC (permalink / raw)
  To: YiFei Zhu
  Cc: containers, bpf, YiFei Zhu, LSM List, Alexei Starovoitov,
	Andrea Arcangeli, Austin Kuo, Claudio Canella, Daniel Borkmann,
	Daniel Gruss, Dimitrios Skarlatos, Giuseppe Scrivano,
	Hubertus Franke, Jann Horn, Jinghao Jia, Josep Torrellas,
	Kees Cook, Sargun Dhillon, Tianyin Xu, Tobin Feldman-Fitzthum,
	Tom Hromatka, Will Drewry

On 5/10/21 10:21 PM, YiFei Zhu wrote:
> On Mon, May 10, 2021 at 12:47 PM Andy Lutomirski <luto@kernel.org> wrote:
>> On Mon, May 10, 2021 at 10:22 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
>>>
>>> From: YiFei Zhu <yifeifz2@illinois.edu>
>>>
>>> Based on: https://lists.linux-foundation.org/pipermail/containers/2018-February/038571.html
>>>
>>> This patchset enables seccomp filters to be written in eBPF.
>>> Supporting eBPF filters has been proposed a few times in the past.
>>> The main concerns were (1) use cases and (2) security. We have
>>> identified many use cases that can benefit from advanced eBPF
>>> filters, such as:
>>
>> I haven't reviewed this carefully, but I think we need to distinguish
>> a few things:
>>
>> 1. Using the eBPF *language*.
>>
>> 2. Allowing the use of stateful / non-pure eBPF features.
>>
>> 3. Allowing the eBPF programs to read the target process' memory.
>>
>> I'm generally in favor of (1).  I'm not at all sure about (2), and I'm
>> even less convinced by (3).
>>
>>>
>>>   * exec-only-once filter / apply filter after exec
>>
>> This is (2).  I'm not sure it's a good idea.
> 
> The basic idea is that for a container runtime it may wait to execute
> a program in a container without that program being able to execve
> another program, stopping any attack that involves loading another
> binary. The container runtime can block any syscall but execve in the
> exec-ed process by using only cBPF.
> 
> The use case is suggested by Andrea Arcangeli and Giuseppe Scrivano.
> @Andrea and @Giuseppe, could you clarify more in case I missed
> something?

We've discussed having a notifier-using filter be able to replace its
filter.  This would allow this and other use cases without any
additional eBPF or cBPF code.

>> eBPF doesn't really have a privilege model yet.  There was a long and
>> disappointing thread about this awhile back.
> 
> The idea is that “seccomp-eBPF does not make life easier for an
> adversary”. Any attack an adversary could potentially utilize
> seccomp-eBPF, they can do the same with other eBPF features, i.e. it
> would be an issue with eBPF in general rather than specifically
> seccomp’s use of eBPF.
> 
> Here it is referring to the helpers goes to the base
> bpf_base_func_proto if the caller is unprivileged (!bpf_capable ||
> !perfmon_capable). In this case, if the adversary would utilize eBPF
> helpers to perform an attack, they could do it via another
> unprivileged prog type.
> 
> That said, there are a few additional helpers this patchset is adding:
> * get_current_uid_gid
> * get_current_pid_tgid
>   These two provide public information (are namespaces a concern?). I
> have no idea what kind of exploit it could add unless the adversary
> somehow side-channels the task_struct? But in that case, how is the
> reading of task_struct different from how the rest of the kernel is
> reading task_struct?

Yes, namespaces are a concern.  This idea got mostly shot down for kdbus
(what ever happened to that?), and it likely has the same problems for
seccomp.

>>
>> What is this for?
> 
> Memory reading opens up lots of use cases. For example, logging what
> files are being opened without imposing too much performance penalty
> from strace. Or as an accelerator for user notify emulation, where
> syscalls can be rejected on a fast path if we know the memory contents
> does not satisfy certain conditions that user notify will check.
> 

This has all kinds of race conditions.


I hate to be a party pooper, but this patchset is going to very high bar
to acceptance.  Right now, seccomp has a couple of excellent properties:

First, while it has limited expressiveness, it is simple enough that the
implementation can be easily understood and the scope for
vulnerabilities that fall through the cracks of the seccomp sandbox
model is low.  Compare this to Windows' low-integrity/high-integrity
sandbox system: there is a never ending string of sandbox escapes due to
token misuse, unexpected things at various integrity levels, etc.
Seccomp doesn't have tokens or integrity levels, and these bugs don't
happen.

Second, seccomp works, almost unchanged, in a completely unprivileged
context.  The last time making eBPF work sensibly in a less- or
-unprivileged context, the maintainers mostly rejected the idea of
developing/debugging a permission model for maps, cleaning up the bpf
object id system, etc.  You are going to have a very hard time
convincing the seccomp maintainers to let any of these mechanism
interact with seccomp until the underlying permission model is in place.

--Andy

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH bpf-next seccomp 00/12] eBPF seccomp filters
       [not found]     ` <fffbea8189794a8da539f6082af3de8e@DM5PR11MB1692.namprd11.prod.outlook.com>
@ 2021-05-16  8:38       ` Tianyin Xu
  2021-05-17 15:40         ` Tycho Andersen
                           ` (3 more replies)
  0 siblings, 4 replies; 48+ messages in thread
From: Tianyin Xu @ 2021-05-16  8:38 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: YiFei Zhu, containers, bpf, Zhu, YiFei, LSM List,
	Alexei Starovoitov, Andrea Arcangeli, Kuo, Hsuan-Chi,
	Claudio Canella, Daniel Borkmann, Daniel Gruss,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jann Horn, Jia, Jinghao, Torrellas, Josep, Kees Cook,
	Sargun Dhillon, Tobin Feldman-Fitzthum, Tom Hromatka,
	Will Drewry

On Sat, May 15, 2021 at 10:49 AM Andy Lutomirski <luto@kernel.org> wrote:
>
> On 5/10/21 10:21 PM, YiFei Zhu wrote:
> > On Mon, May 10, 2021 at 12:47 PM Andy Lutomirski <luto@kernel.org> wrote:
> >> On Mon, May 10, 2021 at 10:22 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> >>>
> >>> From: YiFei Zhu <yifeifz2@illinois.edu>
> >>>
> >>> Based on: https://urldefense.com/v3/__https://lists.linux-foundation.org/pipermail/containers/2018-February/038571.html__;!!DZ3fjg!thbAoRgmCeWjlv0qPDndNZW1j6Y2Kl_huVyUffr4wVbISf-aUiULaWHwkKJrNJyo$
> >>>
> >>> This patchset enables seccomp filters to be written in eBPF.
> >>> Supporting eBPF filters has been proposed a few times in the past.
> >>> The main concerns were (1) use cases and (2) security. We have
> >>> identified many use cases that can benefit from advanced eBPF
> >>> filters, such as:
> >>
> >> I haven't reviewed this carefully, but I think we need to distinguish
> >> a few things:
> >>
> >> 1. Using the eBPF *language*.
> >>
> >> 2. Allowing the use of stateful / non-pure eBPF features.
> >>
> >> 3. Allowing the eBPF programs to read the target process' memory.
> >>
> >> I'm generally in favor of (1).  I'm not at all sure about (2), and I'm
> >> even less convinced by (3).
> >>
> >>>
> >>>   * exec-only-once filter / apply filter after exec
> >>
> >> This is (2).  I'm not sure it's a good idea.
> >
> > The basic idea is that for a container runtime it may wait to execute
> > a program in a container without that program being able to execve
> > another program, stopping any attack that involves loading another
> > binary. The container runtime can block any syscall but execve in the
> > exec-ed process by using only cBPF.
> >
> > The use case is suggested by Andrea Arcangeli and Giuseppe Scrivano.
> > @Andrea and @Giuseppe, could you clarify more in case I missed
> > something?
>
> We've discussed having a notifier-using filter be able to replace its
> filter.  This would allow this and other use cases without any
> additional eBPF or cBPF code.
>

A notifier is not always a solution (even ignoring its perf overhead).

One problem, pointed out by Andrea Arcangeli, is that notifiers need
userspace daemons. So, it can hardly be used by daemonless container
engines like Podman.

And, /* sorry for repeating.. */ the performance overhead of notifiers
is not close to ebpf, which prevents use cases that require native
performance.


> >> eBPF doesn't really have a privilege model yet.  There was a long and
> >> disappointing thread about this awhile back.
> >
> > The idea is that “seccomp-eBPF does not make life easier for an
> > adversary”. Any attack an adversary could potentially utilize
> > seccomp-eBPF, they can do the same with other eBPF features, i.e. it
> > would be an issue with eBPF in general rather than specifically
> > seccomp’s use of eBPF.
> >
> > Here it is referring to the helpers goes to the base
> > bpf_base_func_proto if the caller is unprivileged (!bpf_capable ||
> > !perfmon_capable). In this case, if the adversary would utilize eBPF
> > helpers to perform an attack, they could do it via another
> > unprivileged prog type.
> >
> > That said, there are a few additional helpers this patchset is adding:
> > * get_current_uid_gid
> > * get_current_pid_tgid
> >   These two provide public information (are namespaces a concern?). I
> > have no idea what kind of exploit it could add unless the adversary
> > somehow side-channels the task_struct? But in that case, how is the
> > reading of task_struct different from how the rest of the kernel is
> > reading task_struct?
>
> Yes, namespaces are a concern.  This idea got mostly shot down for kdbus
> (what ever happened to that?), and it likely has the same problems for
> seccomp.
>
> >>
> >> What is this for?
> >
> > Memory reading opens up lots of use cases. For example, logging what
> > files are being opened without imposing too much performance penalty
> > from strace. Or as an accelerator for user notify emulation, where
> > syscalls can be rejected on a fast path if we know the memory contents
> > does not satisfy certain conditions that user notify will check.
> >
>
> This has all kinds of race conditions.
>
>
> I hate to be a party pooper, but this patchset is going to very high bar
> to acceptance.  Right now, seccomp has a couple of excellent properties:
>
> First, while it has limited expressiveness, it is simple enough that the
> implementation can be easily understood and the scope for
> vulnerabilities that fall through the cracks of the seccomp sandbox
> model is low.  Compare this to Windows' low-integrity/high-integrity
> sandbox system: there is a never ending string of sandbox escapes due to
> token misuse, unexpected things at various integrity levels, etc.
> Seccomp doesn't have tokens or integrity levels, and these bugs don't
> happen.
>
> Second, seccomp works, almost unchanged, in a completely unprivileged
> context.  The last time making eBPF work sensibly in a less- or
> -unprivileged context, the maintainers mostly rejected the idea of
> developing/debugging a permission model for maps, cleaning up the bpf
> object id system, etc.  You are going to have a very hard time
> convincing the seccomp maintainers to let any of these mechanism
> interact with seccomp until the underlying permission model is in place.
>
> --Andy

Thanks for pointing out the tradeoff between expressiveness vs. simplicity.

Note that we are _not_ proposing to replace cbpf, but propose to also
support ebpf filters. There certainly are use cases where cbpf is
sufficient, but there are also important use cases ebpf could make
life much easier.

Most importantly, we strongly believe that ebpf filters can be
supported without reducing security.

No worries about “party pooping” and we appreciate the feedback. We’d
love to hear concerns and collect feedback so we can address them to
hit that very high bar.


~t

--
Tianyin Xu
University of Illinois at Urbana-Champaign
https://tianyin.github.io/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH bpf-next seccomp 00/12] eBPF seccomp filters
  2021-05-16  8:38       ` Tianyin Xu
@ 2021-05-17 15:40         ` Tycho Andersen
  2021-05-17 17:07           ` Sargun Dhillon
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 48+ messages in thread
From: Tycho Andersen @ 2021-05-17 15:40 UTC (permalink / raw)
  To: Tianyin Xu
  Cc: Andy Lutomirski, YiFei Zhu, containers, bpf, Zhu, YiFei,
	LSM List, Alexei Starovoitov, Andrea Arcangeli, Kuo, Hsuan-Chi,
	Claudio Canella, Daniel Borkmann, Daniel Gruss,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jann Horn, Jia, Jinghao, Torrellas, Josep, Kees Cook,
	Sargun Dhillon, Tobin Feldman-Fitzthum, Tom Hromatka,
	Will Drewry

On Sun, May 16, 2021 at 03:38:00AM -0500, Tianyin Xu wrote:
> On Sat, May 15, 2021 at 10:49 AM Andy Lutomirski <luto@kernel.org> wrote:
> >
> > On 5/10/21 10:21 PM, YiFei Zhu wrote:
> > > On Mon, May 10, 2021 at 12:47 PM Andy Lutomirski <luto@kernel.org> wrote:
> > >> On Mon, May 10, 2021 at 10:22 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> > >>>
> > >>> From: YiFei Zhu <yifeifz2@illinois.edu>
> > >>>
> > >>> Based on: https://urldefense.com/v3/__https://lists.linux-foundation.org/pipermail/containers/2018-February/038571.html__;!!DZ3fjg!thbAoRgmCeWjlv0qPDndNZW1j6Y2Kl_huVyUffr4wVbISf-aUiULaWHwkKJrNJyo$
> > >>>
> > >>> This patchset enables seccomp filters to be written in eBPF.
> > >>> Supporting eBPF filters has been proposed a few times in the past.
> > >>> The main concerns were (1) use cases and (2) security. We have
> > >>> identified many use cases that can benefit from advanced eBPF
> > >>> filters, such as:
> > >>
> > >> I haven't reviewed this carefully, but I think we need to distinguish
> > >> a few things:
> > >>
> > >> 1. Using the eBPF *language*.
> > >>
> > >> 2. Allowing the use of stateful / non-pure eBPF features.
> > >>
> > >> 3. Allowing the eBPF programs to read the target process' memory.
> > >>
> > >> I'm generally in favor of (1).  I'm not at all sure about (2), and I'm
> > >> even less convinced by (3).
> > >>
> > >>>
> > >>>   * exec-only-once filter / apply filter after exec
> > >>
> > >> This is (2).  I'm not sure it's a good idea.
> > >
> > > The basic idea is that for a container runtime it may wait to execute
> > > a program in a container without that program being able to execve
> > > another program, stopping any attack that involves loading another
> > > binary. The container runtime can block any syscall but execve in the
> > > exec-ed process by using only cBPF.
> > >
> > > The use case is suggested by Andrea Arcangeli and Giuseppe Scrivano.
> > > @Andrea and @Giuseppe, could you clarify more in case I missed
> > > something?
> >
> > We've discussed having a notifier-using filter be able to replace its
> > filter.  This would allow this and other use cases without any
> > additional eBPF or cBPF code.
> >
> 
> A notifier is not always a solution (even ignoring its perf overhead).
> 
> One problem, pointed out by Andrea Arcangeli, is that notifiers need
> userspace daemons. So, it can hardly be used by daemonless container
> engines like Podman.

I'm not sure I buy this argument. Podman already has a conmon instance
for each container, this could be a child of that conmon process, or
live inside conmon itself.

Tycho

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH bpf-next seccomp 00/12] eBPF seccomp filters
  2021-05-16  8:38       ` Tianyin Xu
@ 2021-05-17 17:07           ` Sargun Dhillon
  2021-05-17 17:07           ` Sargun Dhillon
                             ` (2 subsequent siblings)
  3 siblings, 0 replies; 48+ messages in thread
From: Sargun Dhillon @ 2021-05-17 17:07 UTC (permalink / raw)
  To: Tianyin Xu
  Cc: Andy Lutomirski, YiFei Zhu, containers, bpf, Zhu, YiFei,
	LSM List, Alexei Starovoitov, Andrea Arcangeli, Kuo, Hsuan-Chi,
	Claudio Canella, Daniel Borkmann, Daniel Gruss,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jann Horn, Jia, Jinghao, Torrellas, Josep, Kees Cook,
	Tobin Feldman-Fitzthum, Tom Hromatka, Will Drewry

On Sun, May 16, 2021 at 1:39 AM Tianyin Xu <tyxu@illinois.edu> wrote:
>
> On Sat, May 15, 2021 at 10:49 AM Andy Lutomirski <luto@kernel.org> wrote:
> >
> > On 5/10/21 10:21 PM, YiFei Zhu wrote:
> > > On Mon, May 10, 2021 at 12:47 PM Andy Lutomirski <luto@kernel.org> wrote:
> > >> On Mon, May 10, 2021 at 10:22 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> > >>>
> > >>> From: YiFei Zhu <yifeifz2@illinois.edu>
> > >>>
> > >>> Based on: https://urldefense.com/v3/__https://lists.linux-foundation.org/pipermail/containers/2018-February/038571.html__;!!DZ3fjg!thbAoRgmCeWjlv0qPDndNZW1j6Y2Kl_huVyUffr4wVbISf-aUiULaWHwkKJrNJyo$
> > >>>
> > >>> This patchset enables seccomp filters to be written in eBPF.
> > >>> Supporting eBPF filters has been proposed a few times in the past.
> > >>> The main concerns were (1) use cases and (2) security. We have
> > >>> identified many use cases that can benefit from advanced eBPF
> > >>> filters, such as:
> > >>
> > >> I haven't reviewed this carefully, but I think we need to distinguish
> > >> a few things:
> > >>
> > >> 1. Using the eBPF *language*.
> > >>
> > >> 2. Allowing the use of stateful / non-pure eBPF features.
> > >>
> > >> 3. Allowing the eBPF programs to read the target process' memory.
> > >>
> > >> I'm generally in favor of (1).  I'm not at all sure about (2), and I'm
> > >> even less convinced by (3).
> > >>
> > >>>
> > >>>   * exec-only-once filter / apply filter after exec
> > >>
> > >> This is (2).  I'm not sure it's a good idea.
> > >
> > > The basic idea is that for a container runtime it may wait to execute
> > > a program in a container without that program being able to execve
> > > another program, stopping any attack that involves loading another
> > > binary. The container runtime can block any syscall but execve in the
> > > exec-ed process by using only cBPF.
> > >
> > > The use case is suggested by Andrea Arcangeli and Giuseppe Scrivano.
> > > @Andrea and @Giuseppe, could you clarify more in case I missed
> > > something?
> >
> > We've discussed having a notifier-using filter be able to replace its
> > filter.  This would allow this and other use cases without any
> > additional eBPF or cBPF code.
> >
>
> A notifier is not always a solution (even ignoring its perf overhead).
>
> One problem, pointed out by Andrea Arcangeli, is that notifiers need
> userspace daemons. So, it can hardly be used by daemonless container
> engines like Podman.
>
> And, /* sorry for repeating.. */ the performance overhead of notifiers
> is not close to ebpf, which prevents use cases that require native
> performance.

While I agree with you that this is the case right now, there's no reason it
has to be the case. There's a variety of mechanisms that can be employed
to significantly speed up the performance of the notifier. For example, right
now the notifier is behind one large per-filter lock. That could be removed
allowing for better concurrency. There are a large number of mechanisms
that scale O(n) with the outstanding notifications -- again, something
that could be improved.

The other big improvement that could be made is being able to use something
like io_uring with the notifier interface, but it would require a
fairly significant
user API change -- and a move away from ioctl. I'm not sure if people are
excited about that idea at the moment.

>
>
> > >> eBPF doesn't really have a privilege model yet.  There was a long and
> > >> disappointing thread about this awhile back.
> > >
> > > The idea is that “seccomp-eBPF does not make life easier for an
> > > adversary”. Any attack an adversary could potentially utilize
> > > seccomp-eBPF, they can do the same with other eBPF features, i.e. it
> > > would be an issue with eBPF in general rather than specifically
> > > seccomp’s use of eBPF.
> > >
> > > Here it is referring to the helpers goes to the base
> > > bpf_base_func_proto if the caller is unprivileged (!bpf_capable ||
> > > !perfmon_capable). In this case, if the adversary would utilize eBPF
> > > helpers to perform an attack, they could do it via another
> > > unprivileged prog type.
> > >
> > > That said, there are a few additional helpers this patchset is adding:
> > > * get_current_uid_gid
> > > * get_current_pid_tgid
> > >   These two provide public information (are namespaces a concern?). I
> > > have no idea what kind of exploit it could add unless the adversary
> > > somehow side-channels the task_struct? But in that case, how is the
> > > reading of task_struct different from how the rest of the kernel is
> > > reading task_struct?
> >
> > Yes, namespaces are a concern.  This idea got mostly shot down for kdbus
> > (what ever happened to that?), and it likely has the same problems for
> > seccomp.
> >
> > >>
> > >> What is this for?
> > >
> > > Memory reading opens up lots of use cases. For example, logging what
> > > files are being opened without imposing too much performance penalty
> > > from strace. Or as an accelerator for user notify emulation, where
> > > syscalls can be rejected on a fast path if we know the memory contents
> > > does not satisfy certain conditions that user notify will check.
> > >
> >
> > This has all kinds of race conditions.
> >
> >
> > I hate to be a party pooper, but this patchset is going to very high bar
> > to acceptance.  Right now, seccomp has a couple of excellent properties:
> >
> > First, while it has limited expressiveness, it is simple enough that the
> > implementation can be easily understood and the scope for
> > vulnerabilities that fall through the cracks of the seccomp sandbox
> > model is low.  Compare this to Windows' low-integrity/high-integrity
> > sandbox system: there is a never ending string of sandbox escapes due to
> > token misuse, unexpected things at various integrity levels, etc.
> > Seccomp doesn't have tokens or integrity levels, and these bugs don't
> > happen.
> >
> > Second, seccomp works, almost unchanged, in a completely unprivileged
> > context.  The last time making eBPF work sensibly in a less- or
> > -unprivileged context, the maintainers mostly rejected the idea of
> > developing/debugging a permission model for maps, cleaning up the bpf
> > object id system, etc.  You are going to have a very hard time
> > convincing the seccomp maintainers to let any of these mechanism
> > interact with seccomp until the underlying permission model is in place.
> >
> > --Andy
>
> Thanks for pointing out the tradeoff between expressiveness vs. simplicity.
>
> Note that we are _not_ proposing to replace cbpf, but propose to also
> support ebpf filters. There certainly are use cases where cbpf is
> sufficient, but there are also important use cases ebpf could make
> life much easier.
>
> Most importantly, we strongly believe that ebpf filters can be
> supported without reducing security.
>
> No worries about “party pooping” and we appreciate the feedback. We’d
> love to hear concerns and collect feedback so we can address them to
> hit that very high bar.
>
>
> ~t
>
> --
> Tianyin Xu
> University of Illinois at Urbana-Champaign
> https://tianyin.github.io/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH bpf-next seccomp 00/12] eBPF seccomp filters
@ 2021-05-17 17:07           ` Sargun Dhillon
  0 siblings, 0 replies; 48+ messages in thread
From: Sargun Dhillon @ 2021-05-17 17:07 UTC (permalink / raw)
  To: Tianyin Xu
  Cc: Andy Lutomirski, YiFei Zhu, containers, bpf, Zhu, YiFei,
	LSM List, Alexei Starovoitov, Andrea Arcangeli, Kuo, Hsuan-Chi,
	Claudio Canella, Daniel Borkmann, Daniel Gruss,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jann Horn, Jia, Jinghao, Torrellas, Josep, Kees Cook,
	Tobin Feldman-Fitzthum, Tom Hromatka, Will Drewry

On Sun, May 16, 2021 at 1:39 AM Tianyin Xu <tyxu@illinois.edu> wrote:
>
> On Sat, May 15, 2021 at 10:49 AM Andy Lutomirski <luto@kernel.org> wrote:
> >
> > On 5/10/21 10:21 PM, YiFei Zhu wrote:
> > > On Mon, May 10, 2021 at 12:47 PM Andy Lutomirski <luto@kernel.org> wrote:
> > >> On Mon, May 10, 2021 at 10:22 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> > >>>
> > >>> From: YiFei Zhu <yifeifz2@illinois.edu>
> > >>>
> > >>> Based on: https://urldefense.com/v3/__https://lists.linux-foundation.org/pipermail/containers/2018-February/038571.html__;!!DZ3fjg!thbAoRgmCeWjlv0qPDndNZW1j6Y2Kl_huVyUffr4wVbISf-aUiULaWHwkKJrNJyo$
> > >>>
> > >>> This patchset enables seccomp filters to be written in eBPF.
> > >>> Supporting eBPF filters has been proposed a few times in the past.
> > >>> The main concerns were (1) use cases and (2) security. We have
> > >>> identified many use cases that can benefit from advanced eBPF
> > >>> filters, such as:
> > >>
> > >> I haven't reviewed this carefully, but I think we need to distinguish
> > >> a few things:
> > >>
> > >> 1. Using the eBPF *language*.
> > >>
> > >> 2. Allowing the use of stateful / non-pure eBPF features.
> > >>
> > >> 3. Allowing the eBPF programs to read the target process' memory.
> > >>
> > >> I'm generally in favor of (1).  I'm not at all sure about (2), and I'm
> > >> even less convinced by (3).
> > >>
> > >>>
> > >>>   * exec-only-once filter / apply filter after exec
> > >>
> > >> This is (2).  I'm not sure it's a good idea.
> > >
> > > The basic idea is that for a container runtime it may wait to execute
> > > a program in a container without that program being able to execve
> > > another program, stopping any attack that involves loading another
> > > binary. The container runtime can block any syscall but execve in the
> > > exec-ed process by using only cBPF.
> > >
> > > The use case is suggested by Andrea Arcangeli and Giuseppe Scrivano.
> > > @Andrea and @Giuseppe, could you clarify more in case I missed
> > > something?
> >
> > We've discussed having a notifier-using filter be able to replace its
> > filter.  This would allow this and other use cases without any
> > additional eBPF or cBPF code.
> >
>
> A notifier is not always a solution (even ignoring its perf overhead).
>
> One problem, pointed out by Andrea Arcangeli, is that notifiers need
> userspace daemons. So, it can hardly be used by daemonless container
> engines like Podman.
>
> And, /* sorry for repeating.. */ the performance overhead of notifiers
> is not close to ebpf, which prevents use cases that require native
> performance.

While I agree with you that this is the case right now, there's no reason it
has to be the case. There's a variety of mechanisms that can be employed
to significantly speed up the performance of the notifier. For example, right
now the notifier is behind one large per-filter lock. That could be removed
allowing for better concurrency. There are a large number of mechanisms
that scale O(n) with the outstanding notifications -- again, something
that could be improved.

The other big improvement that could be made is being able to use something
like io_uring with the notifier interface, but it would require a
fairly significant
user API change -- and a move away from ioctl. I'm not sure if people are
excited about that idea at the moment.

>
>
> > >> eBPF doesn't really have a privilege model yet.  There was a long and
> > >> disappointing thread about this awhile back.
> > >
> > > The idea is that “seccomp-eBPF does not make life easier for an
> > > adversary”. Any attack an adversary could potentially utilize
> > > seccomp-eBPF, they can do the same with other eBPF features, i.e. it
> > > would be an issue with eBPF in general rather than specifically
> > > seccomp’s use of eBPF.
> > >
> > > Here it is referring to the helpers goes to the base
> > > bpf_base_func_proto if the caller is unprivileged (!bpf_capable ||
> > > !perfmon_capable). In this case, if the adversary would utilize eBPF
> > > helpers to perform an attack, they could do it via another
> > > unprivileged prog type.
> > >
> > > That said, there are a few additional helpers this patchset is adding:
> > > * get_current_uid_gid
> > > * get_current_pid_tgid
> > >   These two provide public information (are namespaces a concern?). I
> > > have no idea what kind of exploit it could add unless the adversary
> > > somehow side-channels the task_struct? But in that case, how is the
> > > reading of task_struct different from how the rest of the kernel is
> > > reading task_struct?
> >
> > Yes, namespaces are a concern.  This idea got mostly shot down for kdbus
> > (what ever happened to that?), and it likely has the same problems for
> > seccomp.
> >
> > >>
> > >> What is this for?
> > >
> > > Memory reading opens up lots of use cases. For example, logging what
> > > files are being opened without imposing too much performance penalty
> > > from strace. Or as an accelerator for user notify emulation, where
> > > syscalls can be rejected on a fast path if we know the memory contents
> > > does not satisfy certain conditions that user notify will check.
> > >
> >
> > This has all kinds of race conditions.
> >
> >
> > I hate to be a party pooper, but this patchset is going to very high bar
> > to acceptance.  Right now, seccomp has a couple of excellent properties:
> >
> > First, while it has limited expressiveness, it is simple enough that the
> > implementation can be easily understood and the scope for
> > vulnerabilities that fall through the cracks of the seccomp sandbox
> > model is low.  Compare this to Windows' low-integrity/high-integrity
> > sandbox system: there is a never ending string of sandbox escapes due to
> > token misuse, unexpected things at various integrity levels, etc.
> > Seccomp doesn't have tokens or integrity levels, and these bugs don't
> > happen.
> >
> > Second, seccomp works, almost unchanged, in a completely unprivileged
> > context.  The last time making eBPF work sensibly in a less- or
> > -unprivileged context, the maintainers mostly rejected the idea of
> > developing/debugging a permission model for maps, cleaning up the bpf
> > object id system, etc.  You are going to have a very hard time
> > convincing the seccomp maintainers to let any of these mechanism
> > interact with seccomp until the underlying permission model is in place.
> >
> > --Andy
>
> Thanks for pointing out the tradeoff between expressiveness vs. simplicity.
>
> Note that we are _not_ proposing to replace cbpf, but propose to also
> support ebpf filters. There certainly are use cases where cbpf is
> sufficient, but there are also important use cases ebpf could make
> life much easier.
>
> Most importantly, we strongly believe that ebpf filters can be
> supported without reducing security.
>
> No worries about “party pooping” and we appreciate the feedback. We’d
> love to hear concerns and collect feedback so we can address them to
> hit that very high bar.
>
>
> ~t
>
> --
> Tianyin Xu
> University of Illinois at Urbana-Champaign
> https://tianyin.github.io/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH bpf-next seccomp 00/12] eBPF seccomp filters
       [not found]         ` <108b4b9c2daa4123805d2b92cf51374b@DM5PR11MB1692.namprd11.prod.outlook.com>
@ 2021-05-20  8:16           ` Tianyin Xu
  2021-05-20  8:56             ` Christian Brauner
       [not found]             ` <00fe481c572d486289bc88780f48e88f@DM5PR11MB1692.namprd11.prod.outlook.com>
  0 siblings, 2 replies; 48+ messages in thread
From: Tianyin Xu @ 2021-05-20  8:16 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Andy Lutomirski, YiFei Zhu, containers, bpf, Zhu, YiFei,
	LSM List, Alexei Starovoitov, Andrea Arcangeli, Kuo, Hsuan-Chi,
	Claudio Canella, Daniel Borkmann, Daniel Gruss,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jann Horn, Jia, Jinghao, Torrellas, Josep, Kees Cook,
	Sargun Dhillon, Tobin Feldman-Fitzthum, Tom Hromatka,
	Will Drewry

On Mon, May 17, 2021 at 10:40 AM Tycho Andersen <tycho@tycho.pizza> wrote:
>
> On Sun, May 16, 2021 at 03:38:00AM -0500, Tianyin Xu wrote:
> > On Sat, May 15, 2021 at 10:49 AM Andy Lutomirski <luto@kernel.org> wrote:
> > >
> > > On 5/10/21 10:21 PM, YiFei Zhu wrote:
> > > > On Mon, May 10, 2021 at 12:47 PM Andy Lutomirski <luto@kernel.org> wrote:
> > > >> On Mon, May 10, 2021 at 10:22 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> > > >>>
> > > >>> From: YiFei Zhu <yifeifz2@illinois.edu>
> > > >>>
> > > >>> Based on: https://urldefense.com/v3/__https://lists.linux-foundation.org/pipermail/containers/2018-February/038571.html__;!!DZ3fjg!thbAoRgmCeWjlv0qPDndNZW1j6Y2Kl_huVyUffr4wVbISf-aUiULaWHwkKJrNJyo$
> > > >>>
> > > >>> This patchset enables seccomp filters to be written in eBPF.
> > > >>> Supporting eBPF filters has been proposed a few times in the past.
> > > >>> The main concerns were (1) use cases and (2) security. We have
> > > >>> identified many use cases that can benefit from advanced eBPF
> > > >>> filters, such as:
> > > >>
> > > >> I haven't reviewed this carefully, but I think we need to distinguish
> > > >> a few things:
> > > >>
> > > >> 1. Using the eBPF *language*.
> > > >>
> > > >> 2. Allowing the use of stateful / non-pure eBPF features.
> > > >>
> > > >> 3. Allowing the eBPF programs to read the target process' memory.
> > > >>
> > > >> I'm generally in favor of (1).  I'm not at all sure about (2), and I'm
> > > >> even less convinced by (3).
> > > >>
> > > >>>
> > > >>>   * exec-only-once filter / apply filter after exec
> > > >>
> > > >> This is (2).  I'm not sure it's a good idea.
> > > >
> > > > The basic idea is that for a container runtime it may wait to execute
> > > > a program in a container without that program being able to execve
> > > > another program, stopping any attack that involves loading another
> > > > binary. The container runtime can block any syscall but execve in the
> > > > exec-ed process by using only cBPF.
> > > >
> > > > The use case is suggested by Andrea Arcangeli and Giuseppe Scrivano.
> > > > @Andrea and @Giuseppe, could you clarify more in case I missed
> > > > something?
> > >
> > > We've discussed having a notifier-using filter be able to replace its
> > > filter.  This would allow this and other use cases without any
> > > additional eBPF or cBPF code.
> > >
> >
> > A notifier is not always a solution (even ignoring its perf overhead).
> >
> > One problem, pointed out by Andrea Arcangeli, is that notifiers need
> > userspace daemons. So, it can hardly be used by daemonless container
> > engines like Podman.
>
> I'm not sure I buy this argument. Podman already has a conmon instance
> for each container, this could be a child of that conmon process, or
> live inside conmon itself.
>
> Tycho

I checked with Andrea Arcangeli and Giuseppe Scrivano who are working on Podman.

You are right that Podman is not completely daemonless. However, “the
fact it's no entirely daemonless doesn't imply it's a good idea to
make it worse and to add complexity to the background conmon daemon or
to add more daemons.”

TL;DR. User notifiers are surely more flexible, but are also more
expensive and complex to implement, compared with ebpf filters. /*
I’ll reply to Sargun’s performance argument in a separate email */

I'm sure you know Podman well, but let me still move some jade from
Andrea and Giuseppe (all credits on podmon/crun are theirs) to
elaborate the point, for folks cced on the list who are not very
familiar with Podman.

Basically, the current order goes as follows:

         podman -> conmon -> crun -> container_binary
                               \
                                - seccomp done at crun level, not conmon

At runtime, what's left is:

         conmon -> container_binary  /* podman disappears; crun disappears */

So, to go through and use seccomp notify to block `exec`, we can
either start the container_binary with a seccomp agent wrapper, or
bloat the common binary (as pointed out by Tycho).

If we go with the first approach, we will have:

         podman -> conmon -> crun -> seccomp_agent -> container_binary

So, at runtime we'd be left with one more daemon:

        conmon -> seccomp_agent -> container_binary

Apparently, nobody likes one more daemon. So, the proposal from
Giuseppe was/is to use user notifiers as plugins (.so) loaded by
conmon:
https://github.com/containers/conmon/pull/190
https://github.com/containers/crun/pull/438

Now, with the ebpf filter support, one can implement the same thing
using an embarrassingly simple ebpf filter and, thanks to Giuseppe,
this is well supported by crun.

-- 
Tianyin Xu
University of Illinois at Urbana-Champaign
https://tianyin.github.io/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH bpf-next seccomp 00/12] eBPF seccomp filters
       [not found]         ` <eae2a0e5038b41c4af87edcb3d4cdc13@DM5PR11MB1692.namprd11.prod.outlook.com>
@ 2021-05-20  8:22           ` Tianyin Xu
  2021-05-24 18:55               ` Sargun Dhillon
  0 siblings, 1 reply; 48+ messages in thread
From: Tianyin Xu @ 2021-05-20  8:22 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: Andy Lutomirski, YiFei Zhu, containers, bpf, Zhu, YiFei,
	LSM List, Alexei Starovoitov, Andrea Arcangeli, Kuo, Hsuan-Chi,
	Claudio Canella, Daniel Borkmann, Daniel Gruss,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jann Horn, Jia, Jinghao, Torrellas, Josep, Kees Cook,
	Tobin Feldman-Fitzthum, Tom Hromatka, Will Drewry

On Mon, May 17, 2021 at 12:08 PM Sargun Dhillon <sargun@sargun.me> wrote:
>
> On Sun, May 16, 2021 at 1:39 AM Tianyin Xu <tyxu@illinois.edu> wrote:
> >
> > On Sat, May 15, 2021 at 10:49 AM Andy Lutomirski <luto@kernel.org> wrote:
> > >
> > > On 5/10/21 10:21 PM, YiFei Zhu wrote:
> > > > On Mon, May 10, 2021 at 12:47 PM Andy Lutomirski <luto@kernel.org> wrote:
> > > >> On Mon, May 10, 2021 at 10:22 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> > > >>>
> > > >>> From: YiFei Zhu <yifeifz2@illinois.edu>
> > > >>>
> > > >>> Based on: https://urldefense.com/v3/__https://lists.linux-foundation.org/pipermail/containers/2018-February/038571.html__;!!DZ3fjg!thbAoRgmCeWjlv0qPDndNZW1j6Y2Kl_huVyUffr4wVbISf-aUiULaWHwkKJrNJyo$
> > > >>>
> > > >>> This patchset enables seccomp filters to be written in eBPF.
> > > >>> Supporting eBPF filters has been proposed a few times in the past.
> > > >>> The main concerns were (1) use cases and (2) security. We have
> > > >>> identified many use cases that can benefit from advanced eBPF
> > > >>> filters, such as:
> > > >>
> > > >> I haven't reviewed this carefully, but I think we need to distinguish
> > > >> a few things:
> > > >>
> > > >> 1. Using the eBPF *language*.
> > > >>
> > > >> 2. Allowing the use of stateful / non-pure eBPF features.
> > > >>
> > > >> 3. Allowing the eBPF programs to read the target process' memory.
> > > >>
> > > >> I'm generally in favor of (1).  I'm not at all sure about (2), and I'm
> > > >> even less convinced by (3).
> > > >>
> > > >>>
> > > >>>   * exec-only-once filter / apply filter after exec
> > > >>
> > > >> This is (2).  I'm not sure it's a good idea.
> > > >
> > > > The basic idea is that for a container runtime it may wait to execute
> > > > a program in a container without that program being able to execve
> > > > another program, stopping any attack that involves loading another
> > > > binary. The container runtime can block any syscall but execve in the
> > > > exec-ed process by using only cBPF.
> > > >
> > > > The use case is suggested by Andrea Arcangeli and Giuseppe Scrivano.
> > > > @Andrea and @Giuseppe, could you clarify more in case I missed
> > > > something?
> > >
> > > We've discussed having a notifier-using filter be able to replace its
> > > filter.  This would allow this and other use cases without any
> > > additional eBPF or cBPF code.
> > >
> >
> > A notifier is not always a solution (even ignoring its perf overhead).
> >
> > One problem, pointed out by Andrea Arcangeli, is that notifiers need
> > userspace daemons. So, it can hardly be used by daemonless container
> > engines like Podman.
> >
> > And, /* sorry for repeating.. */ the performance overhead of notifiers
> > is not close to ebpf, which prevents use cases that require native
> > performance.
>
> While I agree with you that this is the case right now, there's no reason it
> has to be the case. There's a variety of mechanisms that can be employed
> to significantly speed up the performance of the notifier. For example, right
> now the notifier is behind one large per-filter lock. That could be removed
> allowing for better concurrency. There are a large number of mechanisms
> that scale O(n) with the outstanding notifications -- again, something
> that could be improved.

Thanks for the pointer! But, I don’t think this can fundamentally
eliminate the performance gap between the notifiers and the ebpf
filters. IMHO, the additional context switches of user notifiers make
the difference.

>
> The other big improvement that could be made is being able to use something
> like io_uring with the notifier interface, but it would require a
> fairly significant
> user API change -- and a move away from ioctl. I'm not sure if people are
> excited about that idea at the moment.
>

Apologize that I don’t fully understand your proposal. My
understanding about io_uring is that it allows you to amortize the
cost of context switch but not eliminate it, unless you are willing to
dedicate a core for it. I still believe that, even with io_uring, user
notifiers are going to be much slower than eBPF filters.

Btw, our patches are based on your patch set (thank you!). Are you
using user notifiers (with your improved version?) these days? It will
be nice to hear your opinions on ebpf filters.

> >
> >
> > > >> eBPF doesn't really have a privilege model yet.  There was a long and
> > > >> disappointing thread about this awhile back.
> > > >
> > > > The idea is that “seccomp-eBPF does not make life easier for an
> > > > adversary”. Any attack an adversary could potentially utilize
> > > > seccomp-eBPF, they can do the same with other eBPF features, i.e. it
> > > > would be an issue with eBPF in general rather than specifically
> > > > seccomp’s use of eBPF.
> > > >
> > > > Here it is referring to the helpers goes to the base
> > > > bpf_base_func_proto if the caller is unprivileged (!bpf_capable ||
> > > > !perfmon_capable). In this case, if the adversary would utilize eBPF
> > > > helpers to perform an attack, they could do it via another
> > > > unprivileged prog type.
> > > >
> > > > That said, there are a few additional helpers this patchset is adding:
> > > > * get_current_uid_gid
> > > > * get_current_pid_tgid
> > > >   These two provide public information (are namespaces a concern?). I
> > > > have no idea what kind of exploit it could add unless the adversary
> > > > somehow side-channels the task_struct? But in that case, how is the
> > > > reading of task_struct different from how the rest of the kernel is
> > > > reading task_struct?
> > >
> > > Yes, namespaces are a concern.  This idea got mostly shot down for kdbus
> > > (what ever happened to that?), and it likely has the same problems for
> > > seccomp.
> > >
> > > >>
> > > >> What is this for?
> > > >
> > > > Memory reading opens up lots of use cases. For example, logging what
> > > > files are being opened without imposing too much performance penalty
> > > > from strace. Or as an accelerator for user notify emulation, where
> > > > syscalls can be rejected on a fast path if we know the memory contents
> > > > does not satisfy certain conditions that user notify will check.
> > > >
> > >
> > > This has all kinds of race conditions.
> > >
> > >
> > > I hate to be a party pooper, but this patchset is going to very high bar
> > > to acceptance.  Right now, seccomp has a couple of excellent properties:
> > >
> > > First, while it has limited expressiveness, it is simple enough that the
> > > implementation can be easily understood and the scope for
> > > vulnerabilities that fall through the cracks of the seccomp sandbox
> > > model is low.  Compare this to Windows' low-integrity/high-integrity
> > > sandbox system: there is a never ending string of sandbox escapes due to
> > > token misuse, unexpected things at various integrity levels, etc.
> > > Seccomp doesn't have tokens or integrity levels, and these bugs don't
> > > happen.
> > >
> > > Second, seccomp works, almost unchanged, in a completely unprivileged
> > > context.  The last time making eBPF work sensibly in a less- or
> > > -unprivileged context, the maintainers mostly rejected the idea of
> > > developing/debugging a permission model for maps, cleaning up the bpf
> > > object id system, etc.  You are going to have a very hard time
> > > convincing the seccomp maintainers to let any of these mechanism
> > > interact with seccomp until the underlying permission model is in place.
> > >
> > > --Andy
> >
> > Thanks for pointing out the tradeoff between expressiveness vs. simplicity.
> >
> > Note that we are _not_ proposing to replace cbpf, but propose to also
> > support ebpf filters. There certainly are use cases where cbpf is
> > sufficient, but there are also important use cases ebpf could make
> > life much easier.
> >
> > Most importantly, we strongly believe that ebpf filters can be
> > supported without reducing security.
> >
> > No worries about “party pooping” and we appreciate the feedback. We’d
> > love to hear concerns and collect feedback so we can address them to
> > hit that very high bar.
> >
> >
> > ~t
> >
> > --
> > Tianyin Xu
> > University of Illinois at Urbana-Champaign
> > https://urldefense.com/v3/__https://tianyin.github.io/__;!!DZ3fjg!o4__Ob32oapUDg9_f6hzksoFiX9517CJ5-w8qtG9i-WKFs_xWbGQfUHpLjHjCddw$

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH bpf-next seccomp 00/12] eBPF seccomp filters
  2021-05-20  8:16           ` Tianyin Xu
@ 2021-05-20  8:56             ` Christian Brauner
  2021-05-20  9:37               ` Christian Brauner
                                 ` (2 more replies)
       [not found]             ` <00fe481c572d486289bc88780f48e88f@DM5PR11MB1692.namprd11.prod.outlook.com>
  1 sibling, 3 replies; 48+ messages in thread
From: Christian Brauner @ 2021-05-20  8:56 UTC (permalink / raw)
  To: Tianyin Xu
  Cc: Tycho Andersen, Andy Lutomirski, YiFei Zhu, containers, bpf, Zhu,
	YiFei, LSM List, Alexei Starovoitov, Andrea Arcangeli, Kuo,
	Hsuan-Chi, Claudio Canella, Daniel Borkmann, Daniel Gruss,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jann Horn, Jia, Jinghao, Torrellas, Josep, Kees Cook,
	Sargun Dhillon, Tobin Feldman-Fitzthum, Tom Hromatka,
	Will Drewry

On Thu, May 20, 2021 at 03:16:10AM -0500, Tianyin Xu wrote:
> On Mon, May 17, 2021 at 10:40 AM Tycho Andersen <tycho@tycho.pizza> wrote:
> >
> > On Sun, May 16, 2021 at 03:38:00AM -0500, Tianyin Xu wrote:
> > > On Sat, May 15, 2021 at 10:49 AM Andy Lutomirski <luto@kernel.org> wrote:
> > > >
> > > > On 5/10/21 10:21 PM, YiFei Zhu wrote:
> > > > > On Mon, May 10, 2021 at 12:47 PM Andy Lutomirski <luto@kernel.org> wrote:
> > > > >> On Mon, May 10, 2021 at 10:22 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> > > > >>>
> > > > >>> From: YiFei Zhu <yifeifz2@illinois.edu>
> > > > >>>
> > > > >>> Based on: https://urldefense.com/v3/__https://lists.linux-foundation.org/pipermail/containers/2018-February/038571.html__;!!DZ3fjg!thbAoRgmCeWjlv0qPDndNZW1j6Y2Kl_huVyUffr4wVbISf-aUiULaWHwkKJrNJyo$
> > > > >>>
> > > > >>> This patchset enables seccomp filters to be written in eBPF.
> > > > >>> Supporting eBPF filters has been proposed a few times in the past.
> > > > >>> The main concerns were (1) use cases and (2) security. We have
> > > > >>> identified many use cases that can benefit from advanced eBPF
> > > > >>> filters, such as:
> > > > >>
> > > > >> I haven't reviewed this carefully, but I think we need to distinguish
> > > > >> a few things:
> > > > >>
> > > > >> 1. Using the eBPF *language*.
> > > > >>
> > > > >> 2. Allowing the use of stateful / non-pure eBPF features.
> > > > >>
> > > > >> 3. Allowing the eBPF programs to read the target process' memory.
> > > > >>
> > > > >> I'm generally in favor of (1).  I'm not at all sure about (2), and I'm
> > > > >> even less convinced by (3).
> > > > >>
> > > > >>>
> > > > >>>   * exec-only-once filter / apply filter after exec
> > > > >>
> > > > >> This is (2).  I'm not sure it's a good idea.
> > > > >
> > > > > The basic idea is that for a container runtime it may wait to execute
> > > > > a program in a container without that program being able to execve
> > > > > another program, stopping any attack that involves loading another
> > > > > binary. The container runtime can block any syscall but execve in the
> > > > > exec-ed process by using only cBPF.
> > > > >
> > > > > The use case is suggested by Andrea Arcangeli and Giuseppe Scrivano.
> > > > > @Andrea and @Giuseppe, could you clarify more in case I missed
> > > > > something?
> > > >
> > > > We've discussed having a notifier-using filter be able to replace its
> > > > filter.  This would allow this and other use cases without any
> > > > additional eBPF or cBPF code.
> > > >
> > >
> > > A notifier is not always a solution (even ignoring its perf overhead).
> > >
> > > One problem, pointed out by Andrea Arcangeli, is that notifiers need
> > > userspace daemons. So, it can hardly be used by daemonless container
> > > engines like Podman.
> >
> > I'm not sure I buy this argument. Podman already has a conmon instance
> > for each container, this could be a child of that conmon process, or
> > live inside conmon itself.
> >
> > Tycho
> 
> I checked with Andrea Arcangeli and Giuseppe Scrivano who are working on Podman.
> 
> You are right that Podman is not completely daemonless. However, “the
> fact it's no entirely daemonless doesn't imply it's a good idea to
> make it worse and to add complexity to the background conmon daemon or
> to add more daemons.”
> 
> TL;DR. User notifiers are surely more flexible, but are also more
> expensive and complex to implement, compared with ebpf filters. /*
> I’ll reply to Sargun’s performance argument in a separate email */
> 
> I'm sure you know Podman well, but let me still move some jade from
> Andrea and Giuseppe (all credits on podmon/crun are theirs) to
> elaborate the point, for folks cced on the list who are not very
> familiar with Podman.
> 
> Basically, the current order goes as follows:
> 
>          podman -> conmon -> crun -> container_binary
>                                \
>                                 - seccomp done at crun level, not conmon
> 
> At runtime, what's left is:
> 
>          conmon -> container_binary  /* podman disappears; crun disappears */
> 
> So, to go through and use seccomp notify to block `exec`, we can
> either start the container_binary with a seccomp agent wrapper, or
> bloat the common binary (as pointed out by Tycho).
> 
> If we go with the first approach, we will have:
> 
>          podman -> conmon -> crun -> seccomp_agent -> container_binary
> 
> So, at runtime we'd be left with one more daemon:
> 
>         conmon -> seccomp_agent -> container_binary

That seems like a strawman. I don't see why this has to be out of
process or a separate daemon. Conmon uses a regular event loop. Adding
support for processing notifier syscall notifications is
straightforward. Moving it to a plugin as you mentioned below is a
design decision not a necessity.

> 
> Apparently, nobody likes one more daemon. So, the proposal from

I'm not sure such a blanket statements about an indeterminate group of
people's alleged preferences constitutes a technical argument wny we
need ebpf in seccomp.

> Giuseppe was/is to use user notifiers as plugins (.so) loaded by
> conmon:
> https://github.com/containers/conmon/pull/190
> https://github.com/containers/crun/pull/438
> 
> Now, with the ebpf filter support, one can implement the same thing
> using an embarrassingly simple ebpf filter and, thanks to Giuseppe,
> this is well supported by crun.

So I think this is trying to jump the gun by saying "Look, the result
might be simpler.". That may even be the case - though I'm not yet
convinced - but Andy's point stands that this brings a slew of issues on
the table that need clear answers. Bringing stateful ebpf features into
seccomp is a pretty big step and especially around the
privilege/security model it looks pretty handwavy right now.

Christian

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH bpf-next seccomp 00/12] eBPF seccomp filters
  2021-05-15 15:49     ` Andy Lutomirski
@ 2021-05-20  9:05       ` Christian Brauner
  0 siblings, 0 replies; 48+ messages in thread
From: Christian Brauner @ 2021-05-20  9:05 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: YiFei Zhu, containers, bpf, YiFei Zhu, LSM List,
	Alexei Starovoitov, Andrea Arcangeli, Austin Kuo,
	Claudio Canella, Daniel Borkmann, Daniel Gruss,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jann Horn, Jinghao Jia, Josep Torrellas, Kees Cook,
	Sargun Dhillon, Tianyin Xu, Tobin Feldman-Fitzthum, Tom Hromatka,
	Will Drewry

On Sat, May 15, 2021 at 08:49:01AM -0700, Andy Lutomirski wrote:
> On 5/10/21 10:21 PM, YiFei Zhu wrote:
> > On Mon, May 10, 2021 at 12:47 PM Andy Lutomirski <luto@kernel.org> wrote:
> >> On Mon, May 10, 2021 at 10:22 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> >>>
> >>> From: YiFei Zhu <yifeifz2@illinois.edu>
> >>>
> >>> Based on: https://lists.linux-foundation.org/pipermail/containers/2018-February/038571.html
> >>>
> >>> This patchset enables seccomp filters to be written in eBPF.
> >>> Supporting eBPF filters has been proposed a few times in the past.
> >>> The main concerns were (1) use cases and (2) security. We have
> >>> identified many use cases that can benefit from advanced eBPF
> >>> filters, such as:
> >>
> >> I haven't reviewed this carefully, but I think we need to distinguish
> >> a few things:
> >>
> >> 1. Using the eBPF *language*.
> >>
> >> 2. Allowing the use of stateful / non-pure eBPF features.
> >>
> >> 3. Allowing the eBPF programs to read the target process' memory.
> >>
> >> I'm generally in favor of (1).  I'm not at all sure about (2), and I'm
> >> even less convinced by (3).
> >>
> >>>
> >>>   * exec-only-once filter / apply filter after exec
> >>
> >> This is (2).  I'm not sure it's a good idea.
> > 
> > The basic idea is that for a container runtime it may wait to execute
> > a program in a container without that program being able to execve
> > another program, stopping any attack that involves loading another
> > binary. The container runtime can block any syscall but execve in the
> > exec-ed process by using only cBPF.
> > 
> > The use case is suggested by Andrea Arcangeli and Giuseppe Scrivano.
> > @Andrea and @Giuseppe, could you clarify more in case I missed
> > something?
> 
> We've discussed having a notifier-using filter be able to replace its
> filter.  This would allow this and other use cases without any
> additional eBPF or cBPF code.

Are you referring to sm like I sketched in
https://lore.kernel.org/containers/20210301110907.2qoxmiy55gpkgwnq@wittgenstein/
?

> 
> >> eBPF doesn't really have a privilege model yet.  There was a long and
> >> disappointing thread about this awhile back.
> > 
> > The idea is that “seccomp-eBPF does not make life easier for an
> > adversary”. Any attack an adversary could potentially utilize
> > seccomp-eBPF, they can do the same with other eBPF features, i.e. it
> > would be an issue with eBPF in general rather than specifically
> > seccomp’s use of eBPF.
> > 
> > Here it is referring to the helpers goes to the base
> > bpf_base_func_proto if the caller is unprivileged (!bpf_capable ||
> > !perfmon_capable). In this case, if the adversary would utilize eBPF
> > helpers to perform an attack, they could do it via another
> > unprivileged prog type.
> > 
> > That said, there are a few additional helpers this patchset is adding:
> > * get_current_uid_gid
> > * get_current_pid_tgid
> >   These two provide public information (are namespaces a concern?). I

If they are seen from userspace in any way then these must be resolved
relative to the caller's userns or caller's pidns. So yes, namespaces
need to be taken into account.

> > have no idea what kind of exploit it could add unless the adversary
> > somehow side-channels the task_struct? But in that case, how is the
> > reading of task_struct different from how the rest of the kernel is
> > reading task_struct?
> 
> Yes, namespaces are a concern.  This idea got mostly shot down for kdbus
> (what ever happened to that?), and it likely has the same problems for
> seccomp.
> 
> >>
> >> What is this for?
> > 
> > Memory reading opens up lots of use cases. For example, logging what
> > files are being opened without imposing too much performance penalty
> > from strace. Or as an accelerator for user notify emulation, where
> > syscalls can be rejected on a fast path if we know the memory contents
> > does not satisfy certain conditions that user notify will check.
> > 
> 
> This has all kinds of race conditions.
> 
> 
> I hate to be a party pooper, but this patchset is going to very high bar
> to acceptance.  Right now, seccomp has a couple of excellent properties:
> 
> First, while it has limited expressiveness, it is simple enough that the
> implementation can be easily understood and the scope for
> vulnerabilities that fall through the cracks of the seccomp sandbox
> model is low.  Compare this to Windows' low-integrity/high-integrity
> sandbox system: there is a never ending string of sandbox escapes due to
> token misuse, unexpected things at various integrity levels, etc.
> Seccomp doesn't have tokens or integrity levels, and these bugs don't
> happen.
> 
> Second, seccomp works, almost unchanged, in a completely unprivileged
> context.  The last time making eBPF work sensibly in a less- or

Yeah, which is pretty important.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH bpf-next seccomp 00/12] eBPF seccomp filters
  2021-05-20  8:56             ` Christian Brauner
@ 2021-05-20  9:37               ` Christian Brauner
  2021-06-01 19:55               ` Kees Cook
  2021-06-09  6:27               ` Jinghao Jia
  2 siblings, 0 replies; 48+ messages in thread
From: Christian Brauner @ 2021-05-20  9:37 UTC (permalink / raw)
  To: Tianyin Xu
  Cc: Tycho Andersen, Andy Lutomirski, YiFei Zhu, containers, bpf, Zhu,
	YiFei, LSM List, Alexei Starovoitov, Andrea Arcangeli, Kuo,
	Hsuan-Chi, Claudio Canella, Daniel Borkmann, Daniel Gruss,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jann Horn, Jia, Jinghao, Torrellas, Josep, Kees Cook,
	Sargun Dhillon, Tobin Feldman-Fitzthum, Tom Hromatka,
	Will Drewry

On Thu, May 20, 2021 at 10:56:13AM +0200, Christian Brauner wrote:
> On Thu, May 20, 2021 at 03:16:10AM -0500, Tianyin Xu wrote:
> > On Mon, May 17, 2021 at 10:40 AM Tycho Andersen <tycho@tycho.pizza> wrote:
> > >
> > > On Sun, May 16, 2021 at 03:38:00AM -0500, Tianyin Xu wrote:
> > > > On Sat, May 15, 2021 at 10:49 AM Andy Lutomirski <luto@kernel.org> wrote:
> > > > >
> > > > > On 5/10/21 10:21 PM, YiFei Zhu wrote:
> > > > > > On Mon, May 10, 2021 at 12:47 PM Andy Lutomirski <luto@kernel.org> wrote:
> > > > > >> On Mon, May 10, 2021 at 10:22 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> > > > > >>>
> > > > > >>> From: YiFei Zhu <yifeifz2@illinois.edu>
> > > > > >>>
> > > > > >>> Based on: https://urldefense.com/v3/__https://lists.linux-foundation.org/pipermail/containers/2018-February/038571.html__;!!DZ3fjg!thbAoRgmCeWjlv0qPDndNZW1j6Y2Kl_huVyUffr4wVbISf-aUiULaWHwkKJrNJyo$
> > > > > >>>
> > > > > >>> This patchset enables seccomp filters to be written in eBPF.
> > > > > >>> Supporting eBPF filters has been proposed a few times in the past.
> > > > > >>> The main concerns were (1) use cases and (2) security. We have
> > > > > >>> identified many use cases that can benefit from advanced eBPF
> > > > > >>> filters, such as:
> > > > > >>
> > > > > >> I haven't reviewed this carefully, but I think we need to distinguish
> > > > > >> a few things:
> > > > > >>
> > > > > >> 1. Using the eBPF *language*.
> > > > > >>
> > > > > >> 2. Allowing the use of stateful / non-pure eBPF features.
> > > > > >>
> > > > > >> 3. Allowing the eBPF programs to read the target process' memory.
> > > > > >>
> > > > > >> I'm generally in favor of (1).  I'm not at all sure about (2), and I'm
> > > > > >> even less convinced by (3).
> > > > > >>
> > > > > >>>
> > > > > >>>   * exec-only-once filter / apply filter after exec
> > > > > >>
> > > > > >> This is (2).  I'm not sure it's a good idea.
> > > > > >
> > > > > > The basic idea is that for a container runtime it may wait to execute
> > > > > > a program in a container without that program being able to execve
> > > > > > another program, stopping any attack that involves loading another
> > > > > > binary. The container runtime can block any syscall but execve in the
> > > > > > exec-ed process by using only cBPF.
> > > > > >
> > > > > > The use case is suggested by Andrea Arcangeli and Giuseppe Scrivano.
> > > > > > @Andrea and @Giuseppe, could you clarify more in case I missed
> > > > > > something?
> > > > >
> > > > > We've discussed having a notifier-using filter be able to replace its
> > > > > filter.  This would allow this and other use cases without any
> > > > > additional eBPF or cBPF code.
> > > > >
> > > >
> > > > A notifier is not always a solution (even ignoring its perf overhead).
> > > >
> > > > One problem, pointed out by Andrea Arcangeli, is that notifiers need
> > > > userspace daemons. So, it can hardly be used by daemonless container
> > > > engines like Podman.
> > >
> > > I'm not sure I buy this argument. Podman already has a conmon instance
> > > for each container, this could be a child of that conmon process, or
> > > live inside conmon itself.
> > >
> > > Tycho
> > 
> > I checked with Andrea Arcangeli and Giuseppe Scrivano who are working on Podman.
> > 
> > You are right that Podman is not completely daemonless. However, “the
> > fact it's no entirely daemonless doesn't imply it's a good idea to
> > make it worse and to add complexity to the background conmon daemon or
> > to add more daemons.”
> > 
> > TL;DR. User notifiers are surely more flexible, but are also more
> > expensive and complex to implement, compared with ebpf filters. /*
> > I’ll reply to Sargun’s performance argument in a separate email */
> > 
> > I'm sure you know Podman well, but let me still move some jade from
> > Andrea and Giuseppe (all credits on podmon/crun are theirs) to
> > elaborate the point, for folks cced on the list who are not very
> > familiar with Podman.
> > 
> > Basically, the current order goes as follows:
> > 
> >          podman -> conmon -> crun -> container_binary
> >                                \
> >                                 - seccomp done at crun level, not conmon
> > 
> > At runtime, what's left is:
> > 
> >          conmon -> container_binary  /* podman disappears; crun disappears */
> > 
> > So, to go through and use seccomp notify to block `exec`, we can
> > either start the container_binary with a seccomp agent wrapper, or
> > bloat the common binary (as pointed out by Tycho).
> > 
> > If we go with the first approach, we will have:
> > 
> >          podman -> conmon -> crun -> seccomp_agent -> container_binary
> > 
> > So, at runtime we'd be left with one more daemon:
> > 
> >         conmon -> seccomp_agent -> container_binary
> 
> That seems like a strawman. I don't see why this has to be out of
> process or a separate daemon. Conmon uses a regular event loop. Adding
> support for processing notifier syscall notifications is
> straightforward. Moving it to a plugin as you mentioned below is a
> design decision not a necessity.
> 
> > 
> > Apparently, nobody likes one more daemon. So, the proposal from
> 
> I'm not sure such a blanket statements about an indeterminate group of
> people's alleged preferences constitutes a technical argument wny we
> need ebpf in seccomp.

I was missing a :) here.

> 
> > Giuseppe was/is to use user notifiers as plugins (.so) loaded by
> > conmon:
> > https://github.com/containers/conmon/pull/190
> > https://github.com/containers/crun/pull/438
> > 
> > Now, with the ebpf filter support, one can implement the same thing
> > using an embarrassingly simple ebpf filter and, thanks to Giuseppe,
> > this is well supported by crun.
> 
> So I think this is trying to jump the gun by saying "Look, the result
> might be simpler.". That may even be the case - though I'm not yet
> convinced - but Andy's point stands that this brings a slew of issues on
> the table that need clear answers. Bringing stateful ebpf features into
> seccomp is a pretty big step and especially around the
> privilege/security model it looks pretty handwavy right now.
> 
> Christian
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH bpf-next seccomp 00/12] eBPF seccomp filters
       [not found]             ` <00fe481c572d486289bc88780f48e88f@DM5PR11MB1692.namprd11.prod.outlook.com>
@ 2021-05-20 22:13               ` Tianyin Xu
  0 siblings, 0 replies; 48+ messages in thread
From: Tianyin Xu @ 2021-05-20 22:13 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Tycho Andersen, Andy Lutomirski, YiFei Zhu, containers, bpf, Zhu,
	YiFei, LSM List, Alexei Starovoitov, Andrea Arcangeli, Kuo,
	Hsuan-Chi, Claudio Canella, Daniel Borkmann, Daniel Gruss,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jann Horn, Jia, Jinghao, Torrellas, Josep, Kees Cook,
	Sargun Dhillon, Tobin Feldman-Fitzthum, Tom Hromatka,
	Will Drewry

On Thu, May 20, 2021 at 3:56 AM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
>
> On Thu, May 20, 2021 at 03:16:10AM -0500, Tianyin Xu wrote:
> > On Mon, May 17, 2021 at 10:40 AM Tycho Andersen <tycho@tycho.pizza> wrote:
> > >
> > > On Sun, May 16, 2021 at 03:38:00AM -0500, Tianyin Xu wrote:
> > > > On Sat, May 15, 2021 at 10:49 AM Andy Lutomirski <luto@kernel.org> wrote:
> > > > >
> > > > > On 5/10/21 10:21 PM, YiFei Zhu wrote:
> > > > > > On Mon, May 10, 2021 at 12:47 PM Andy Lutomirski <luto@kernel.org> wrote:
> > > > > >> On Mon, May 10, 2021 at 10:22 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> > > > > >>>
> > > > > >>> From: YiFei Zhu <yifeifz2@illinois.edu>
> > > > > >>>
> > > > > >>> Based on: https://urldefense.com/v3/__https://lists.linux-foundation.org/pipermail/containers/2018-February/038571.html__;!!DZ3fjg!thbAoRgmCeWjlv0qPDndNZW1j6Y2Kl_huVyUffr4wVbISf-aUiULaWHwkKJrNJyo$
> > > > > >>>
> > > > > >>> This patchset enables seccomp filters to be written in eBPF.
> > > > > >>> Supporting eBPF filters has been proposed a few times in the past.
> > > > > >>> The main concerns were (1) use cases and (2) security. We have
> > > > > >>> identified many use cases that can benefit from advanced eBPF
> > > > > >>> filters, such as:
> > > > > >>
> > > > > >> I haven't reviewed this carefully, but I think we need to distinguish
> > > > > >> a few things:
> > > > > >>
> > > > > >> 1. Using the eBPF *language*.
> > > > > >>
> > > > > >> 2. Allowing the use of stateful / non-pure eBPF features.
> > > > > >>
> > > > > >> 3. Allowing the eBPF programs to read the target process' memory.
> > > > > >>
> > > > > >> I'm generally in favor of (1).  I'm not at all sure about (2), and I'm
> > > > > >> even less convinced by (3).
> > > > > >>
> > > > > >>>
> > > > > >>>   * exec-only-once filter / apply filter after exec
> > > > > >>
> > > > > >> This is (2).  I'm not sure it's a good idea.
> > > > > >
> > > > > > The basic idea is that for a container runtime it may wait to execute
> > > > > > a program in a container without that program being able to execve
> > > > > > another program, stopping any attack that involves loading another
> > > > > > binary. The container runtime can block any syscall but execve in the
> > > > > > exec-ed process by using only cBPF.
> > > > > >
> > > > > > The use case is suggested by Andrea Arcangeli and Giuseppe Scrivano.
> > > > > > @Andrea and @Giuseppe, could you clarify more in case I missed
> > > > > > something?
> > > > >
> > > > > We've discussed having a notifier-using filter be able to replace its
> > > > > filter.  This would allow this and other use cases without any
> > > > > additional eBPF or cBPF code.
> > > > >
> > > >
> > > > A notifier is not always a solution (even ignoring its perf overhead).
> > > >
> > > > One problem, pointed out by Andrea Arcangeli, is that notifiers need
> > > > userspace daemons. So, it can hardly be used by daemonless container
> > > > engines like Podman.
> > >
> > > I'm not sure I buy this argument. Podman already has a conmon instance
> > > for each container, this could be a child of that conmon process, or
> > > live inside conmon itself.
> > >
> > > Tycho
> >
> > I checked with Andrea Arcangeli and Giuseppe Scrivano who are working on Podman.
> >
> > You are right that Podman is not completely daemonless. However, “the
> > fact it's no entirely daemonless doesn't imply it's a good idea to
> > make it worse and to add complexity to the background conmon daemon or
> > to add more daemons.”
> >
> > TL;DR. User notifiers are surely more flexible, but are also more
> > expensive and complex to implement, compared with ebpf filters. /*
> > I’ll reply to Sargun’s performance argument in a separate email */
> >
> > I'm sure you know Podman well, but let me still move some jade from
> > Andrea and Giuseppe (all credits on podmon/crun are theirs) to
> > elaborate the point, for folks cced on the list who are not very
> > familiar with Podman.
> >
> > Basically, the current order goes as follows:
> >
> >          podman -> conmon -> crun -> container_binary
> >                                \
> >                                 - seccomp done at crun level, not conmon
> >
> > At runtime, what's left is:
> >
> >          conmon -> container_binary  /* podman disappears; crun disappears */
> >
> > So, to go through and use seccomp notify to block `exec`, we can
> > either start the container_binary with a seccomp agent wrapper, or
> > bloat the common binary (as pointed out by Tycho).
> >
> > If we go with the first approach, we will have:
> >
> >          podman -> conmon -> crun -> seccomp_agent -> container_binary
> >
> > So, at runtime we'd be left with one more daemon:
> >
> >         conmon -> seccomp_agent -> container_binary
>
> That seems like a strawman. I don't see why this has to be out of
> process or a separate daemon. Conmon uses a regular event loop. Adding
> support for processing notifier syscall notifications is
> straightforward. Moving it to a plugin as you mentioned below is a
> design decision not a necessity.
>
> >
> > Apparently, nobody likes one more daemon. So, the proposal from
>
> I'm not sure such a blanket statements about an indeterminate group of
> people's alleged preferences constitutes a technical argument wny we
> need ebpf in seccomp.
>
> > Giuseppe was/is to use user notifiers as plugins (.so) loaded by
> > conmon:
> > https://urldefense.com/v3/__https://github.com/containers/conmon/pull/190__;!!DZ3fjg!qFZ7PXfFe7eI1Bye9J8zsGOxTQQlfL-pBh0D7Arn1YZKevtEpA9uxKqMTP9kA5RJ$
> > https://urldefense.com/v3/__https://github.com/containers/crun/pull/438__;!!DZ3fjg!qFZ7PXfFe7eI1Bye9J8zsGOxTQQlfL-pBh0D7Arn1YZKevtEpA9uxKqMTJrKzhUD$
> >
> > Now, with the ebpf filter support, one can implement the same thing
> > using an embarrassingly simple ebpf filter and, thanks to Giuseppe,
> > this is well supported by crun.
>
> So I think this is trying to jump the gun by saying "Look, the result
> might be simpler.". That may even be the case - though I'm not yet
> convinced - but Andy's point stands that this brings a slew of issues on
> the table that need clear answers. Bringing stateful ebpf features into
> seccomp is a pretty big step and especially around the
> privilege/security model it looks pretty handwavy right now.
>
> Christian

If an alleged gunshot was the impression I left, I apologize.
Seriously, I have great respect for user notifiers -- my intention was
never to disregard it, or to argue that ebpf filters are always
strictly better.

On the other hand, I do believe (and tried to show) ebpf filters have
their own technical advantages, and can be very useful and efficient
in many use cases. Let me know if you don’t buy this.

It’s kinda weird that we are arguing against ebpf filters with user
notifiers (it’s analogous to "we don’t need Seccomp coz we have
ptrace…")

More importantly, I do really want to provide clear answers to the
privilege/security model, but I don’t precisely know what exactly
you’re referring to as "privilege/security model". Are you referring
to the one-way transition model of Seccomp (which may no longer be
held in stateful filters), or something different? It will be great if
you can clarify so we can answer explicitly.

Thanks!

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH bpf-next seccomp 00/12] eBPF seccomp filters
  2021-05-20  8:22           ` Tianyin Xu
@ 2021-05-24 18:55               ` Sargun Dhillon
  0 siblings, 0 replies; 48+ messages in thread
From: Sargun Dhillon @ 2021-05-24 18:55 UTC (permalink / raw)
  To: Tianyin Xu
  Cc: Andy Lutomirski, YiFei Zhu, containers, bpf, Zhu, YiFei,
	LSM List, Alexei Starovoitov, Andrea Arcangeli, Kuo, Hsuan-Chi,
	Claudio Canella, Daniel Borkmann, Daniel Gruss,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jann Horn, Jia, Jinghao, Torrellas, Josep, Kees Cook,
	Tobin Feldman-Fitzthum, Tom Hromatka, Will Drewry

On Thu, May 20, 2021 at 1:22 AM Tianyin Xu <tyxu@illinois.edu> wrote:
>
> On Mon, May 17, 2021 at 12:08 PM Sargun Dhillon <sargun@sargun.me> wrote:
> >
> > While I agree with you that this is the case right now, there's no reason it
> > has to be the case. There's a variety of mechanisms that can be employed
> > to significantly speed up the performance of the notifier. For example, right
> > now the notifier is behind one large per-filter lock. That could be removed
> > allowing for better concurrency. There are a large number of mechanisms
> > that scale O(n) with the outstanding notifications -- again, something
> > that could be improved.
>
> Thanks for the pointer! But, I don’t think this can fundamentally
> eliminate the performance gap between the notifiers and the ebpf
> filters. IMHO, the additional context switches of user notifiers make
> the difference.
>
I mean, I still think it can be closed. Or at least get better. I've
thought about
working on performance improvements, but they're lower on the list
than functionality changes.

> >
> > The other big improvement that could be made is being able to use something
> > like io_uring with the notifier interface, but it would require a
> > fairly significant
> > user API change -- and a move away from ioctl. I'm not sure if people are
> > excited about that idea at the moment.
> >
>
> Apologize that I don’t fully understand your proposal. My
> understanding about io_uring is that it allows you to amortize the
> cost of context switch but not eliminate it, unless you are willing to
> dedicate a core for it. I still believe that, even with io_uring, user
> notifiers are going to be much slower than eBPF filters.
The notifier gets significantly slower as a function of the notifications. If
you have a large number of notifications in flight, or if you're trying to
concurrently handle a large number of notifications, it gets slower. This
is where something like io_uring is super useful in terms of reducing
wakeups.

Also, in the original futex2 patches, it had a mechanism to better handle
(scheduling) of notifier like cases[1]. If the seccomp notifier did a similar
thing, we could see better performance.

>
> Btw, our patches are based on your patch set (thank you!). Are you
> using user notifiers (with your improved version?) these days? It will
> be nice to hear your opinions on ebpf filters.
>
I'm so glad that someone is picking up the work on this.

> > >
> > >
> > > > >> eBPF doesn't really have a privilege model yet.  There was a long and
> > > > >> disappointing thread about this awhile back.
> > > > >
> > > > > The idea is that “seccomp-eBPF does not make life easier for an
> > > > > adversary”. Any attack an adversary could potentially utilize
> > > > > seccomp-eBPF, they can do the same with other eBPF features, i.e. it
> > > > > would be an issue with eBPF in general rather than specifically
> > > > > seccomp’s use of eBPF.
> > > > >
> > > > > Here it is referring to the helpers goes to the base
> > > > > bpf_base_func_proto if the caller is unprivileged (!bpf_capable ||
> > > > > !perfmon_capable). In this case, if the adversary would utilize eBPF
> > > > > helpers to perform an attack, they could do it via another
> > > > > unprivileged prog type.
> > > > >
> > > > > That said, there are a few additional helpers this patchset is adding:
> > > > > * get_current_uid_gid
> > > > > * get_current_pid_tgid
> > > > >   These two provide public information (are namespaces a concern?). I
> > > > > have no idea what kind of exploit it could add unless the adversary
> > > > > somehow side-channels the task_struct? But in that case, how is the
> > > > > reading of task_struct different from how the rest of the kernel is
> > > > > reading task_struct?
> > > >
> > > > Yes, namespaces are a concern.  This idea got mostly shot down for kdbus
> > > > (what ever happened to that?), and it likely has the same problems for
> > > > seccomp.
> > > >
So, we actually have a case where we want to inspect an argument --
We want to look at the FD number that's passed to the sendmsg syscall, and then
see if that's an AF_INET socket, and if it is, then pass back to
notifier, otherwise
allow it to continue through. This is an area where I can see eBPF being
very useful.

> > > > >>
> > > > >> What is this for?
> > > > >
> > > > > Memory reading opens up lots of use cases. For example, logging what
> > > > > files are being opened without imposing too much performance penalty
> > > > > from strace. Or as an accelerator for user notify emulation, where
> > > > > syscalls can be rejected on a fast path if we know the memory contents
> > > > > does not satisfy certain conditions that user notify will check.
> > > > >
> > > >
> > > > This has all kinds of race conditions.
> > > >
> > > >
> > > > I hate to be a party pooper, but this patchset is going to very high bar
> > > > to acceptance.  Right now, seccomp has a couple of excellent properties:
> > > >
> > > > First, while it has limited expressiveness, it is simple enough that the
> > > > implementation can be easily understood and the scope for
> > > > vulnerabilities that fall through the cracks of the seccomp sandbox
> > > > model is low.  Compare this to Windows' low-integrity/high-integrity
> > > > sandbox system: there is a never ending string of sandbox escapes due to
> > > > token misuse, unexpected things at various integrity levels, etc.
> > > > Seccomp doesn't have tokens or integrity levels, and these bugs don't
> > > > happen.
> > > >
> > > > Second, seccomp works, almost unchanged, in a completely unprivileged
> > > > context.  The last time making eBPF work sensibly in a less- or
> > > > -unprivileged context, the maintainers mostly rejected the idea of
> > > > developing/debugging a permission model for maps, cleaning up the bpf
> > > > object id system, etc.  You are going to have a very hard time
> > > > convincing the seccomp maintainers to let any of these mechanism
> > > > interact with seccomp until the underlying permission model is in place.
> > > >
> > > > --Andy
> > >
> > > Thanks for pointing out the tradeoff between expressiveness vs. simplicity.
> > >
> > > Note that we are _not_ proposing to replace cbpf, but propose to also
> > > support ebpf filters. There certainly are use cases where cbpf is
> > > sufficient, but there are also important use cases ebpf could make
> > > life much easier.
> > >
> > > Most importantly, we strongly believe that ebpf filters can be
> > > supported without reducing security.
> > >
> > > No worries about “party pooping” and we appreciate the feedback. We’d
> > > love to hear concerns and collect feedback so we can address them to
> > > hit that very high bar.
> > >
> > >
> > > ~t
> > >
> > > --
> > > Tianyin Xu
> > > University of Illinois at Urbana-Champaign
> > > https://urldefense.com/v3/__https://tianyin.github.io/__;!!DZ3fjg!o4__Ob32oapUDg9_f6hzksoFiX9517CJ5-w8qtG9i-WKFs_xWbGQfUHpLjHjCddw$
>

[1]: https://lore.kernel.org/lkml/20210215152404.250281-1-andrealmeid@collabora.com/T/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH bpf-next seccomp 00/12] eBPF seccomp filters
@ 2021-05-24 18:55               ` Sargun Dhillon
  0 siblings, 0 replies; 48+ messages in thread
From: Sargun Dhillon @ 2021-05-24 18:55 UTC (permalink / raw)
  To: Tianyin Xu
  Cc: Andy Lutomirski, YiFei Zhu, containers, bpf, Zhu, YiFei,
	LSM List, Alexei Starovoitov, Andrea Arcangeli, Kuo, Hsuan-Chi,
	Claudio Canella, Daniel Borkmann, Daniel Gruss,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jann Horn, Jia, Jinghao, Torrellas, Josep, Kees Cook,
	Tobin Feldman-Fitzthum, Tom Hromatka, Will Drewry

On Thu, May 20, 2021 at 1:22 AM Tianyin Xu <tyxu@illinois.edu> wrote:
>
> On Mon, May 17, 2021 at 12:08 PM Sargun Dhillon <sargun@sargun.me> wrote:
> >
> > While I agree with you that this is the case right now, there's no reason it
> > has to be the case. There's a variety of mechanisms that can be employed
> > to significantly speed up the performance of the notifier. For example, right
> > now the notifier is behind one large per-filter lock. That could be removed
> > allowing for better concurrency. There are a large number of mechanisms
> > that scale O(n) with the outstanding notifications -- again, something
> > that could be improved.
>
> Thanks for the pointer! But, I don’t think this can fundamentally
> eliminate the performance gap between the notifiers and the ebpf
> filters. IMHO, the additional context switches of user notifiers make
> the difference.
>
I mean, I still think it can be closed. Or at least get better. I've
thought about
working on performance improvements, but they're lower on the list
than functionality changes.

> >
> > The other big improvement that could be made is being able to use something
> > like io_uring with the notifier interface, but it would require a
> > fairly significant
> > user API change -- and a move away from ioctl. I'm not sure if people are
> > excited about that idea at the moment.
> >
>
> Apologize that I don’t fully understand your proposal. My
> understanding about io_uring is that it allows you to amortize the
> cost of context switch but not eliminate it, unless you are willing to
> dedicate a core for it. I still believe that, even with io_uring, user
> notifiers are going to be much slower than eBPF filters.
The notifier gets significantly slower as a function of the notifications. If
you have a large number of notifications in flight, or if you're trying to
concurrently handle a large number of notifications, it gets slower. This
is where something like io_uring is super useful in terms of reducing
wakeups.

Also, in the original futex2 patches, it had a mechanism to better handle
(scheduling) of notifier like cases[1]. If the seccomp notifier did a similar
thing, we could see better performance.

>
> Btw, our patches are based on your patch set (thank you!). Are you
> using user notifiers (with your improved version?) these days? It will
> be nice to hear your opinions on ebpf filters.
>
I'm so glad that someone is picking up the work on this.

> > >
> > >
> > > > >> eBPF doesn't really have a privilege model yet.  There was a long and
> > > > >> disappointing thread about this awhile back.
> > > > >
> > > > > The idea is that “seccomp-eBPF does not make life easier for an
> > > > > adversary”. Any attack an adversary could potentially utilize
> > > > > seccomp-eBPF, they can do the same with other eBPF features, i.e. it
> > > > > would be an issue with eBPF in general rather than specifically
> > > > > seccomp’s use of eBPF.
> > > > >
> > > > > Here it is referring to the helpers goes to the base
> > > > > bpf_base_func_proto if the caller is unprivileged (!bpf_capable ||
> > > > > !perfmon_capable). In this case, if the adversary would utilize eBPF
> > > > > helpers to perform an attack, they could do it via another
> > > > > unprivileged prog type.
> > > > >
> > > > > That said, there are a few additional helpers this patchset is adding:
> > > > > * get_current_uid_gid
> > > > > * get_current_pid_tgid
> > > > >   These two provide public information (are namespaces a concern?). I
> > > > > have no idea what kind of exploit it could add unless the adversary
> > > > > somehow side-channels the task_struct? But in that case, how is the
> > > > > reading of task_struct different from how the rest of the kernel is
> > > > > reading task_struct?
> > > >
> > > > Yes, namespaces are a concern.  This idea got mostly shot down for kdbus
> > > > (what ever happened to that?), and it likely has the same problems for
> > > > seccomp.
> > > >
So, we actually have a case where we want to inspect an argument --
We want to look at the FD number that's passed to the sendmsg syscall, and then
see if that's an AF_INET socket, and if it is, then pass back to
notifier, otherwise
allow it to continue through. This is an area where I can see eBPF being
very useful.

> > > > >>
> > > > >> What is this for?
> > > > >
> > > > > Memory reading opens up lots of use cases. For example, logging what
> > > > > files are being opened without imposing too much performance penalty
> > > > > from strace. Or as an accelerator for user notify emulation, where
> > > > > syscalls can be rejected on a fast path if we know the memory contents
> > > > > does not satisfy certain conditions that user notify will check.
> > > > >
> > > >
> > > > This has all kinds of race conditions.
> > > >
> > > >
> > > > I hate to be a party pooper, but this patchset is going to very high bar
> > > > to acceptance.  Right now, seccomp has a couple of excellent properties:
> > > >
> > > > First, while it has limited expressiveness, it is simple enough that the
> > > > implementation can be easily understood and the scope for
> > > > vulnerabilities that fall through the cracks of the seccomp sandbox
> > > > model is low.  Compare this to Windows' low-integrity/high-integrity
> > > > sandbox system: there is a never ending string of sandbox escapes due to
> > > > token misuse, unexpected things at various integrity levels, etc.
> > > > Seccomp doesn't have tokens or integrity levels, and these bugs don't
> > > > happen.
> > > >
> > > > Second, seccomp works, almost unchanged, in a completely unprivileged
> > > > context.  The last time making eBPF work sensibly in a less- or
> > > > -unprivileged context, the maintainers mostly rejected the idea of
> > > > developing/debugging a permission model for maps, cleaning up the bpf
> > > > object id system, etc.  You are going to have a very hard time
> > > > convincing the seccomp maintainers to let any of these mechanism
> > > > interact with seccomp until the underlying permission model is in place.
> > > >
> > > > --Andy
> > >
> > > Thanks for pointing out the tradeoff between expressiveness vs. simplicity.
> > >
> > > Note that we are _not_ proposing to replace cbpf, but propose to also
> > > support ebpf filters. There certainly are use cases where cbpf is
> > > sufficient, but there are also important use cases ebpf could make
> > > life much easier.
> > >
> > > Most importantly, we strongly believe that ebpf filters can be
> > > supported without reducing security.
> > >
> > > No worries about “party pooping” and we appreciate the feedback. We’d
> > > love to hear concerns and collect feedback so we can address them to
> > > hit that very high bar.
> > >
> > >
> > > ~t
> > >
> > > --
> > > Tianyin Xu
> > > University of Illinois at Urbana-Champaign
> > > https://urldefense.com/v3/__https://tianyin.github.io/__;!!DZ3fjg!o4__Ob32oapUDg9_f6hzksoFiX9517CJ5-w8qtG9i-WKFs_xWbGQfUHpLjHjCddw$
>

[1]: https://lore.kernel.org/lkml/20210215152404.250281-1-andrealmeid@collabora.com/T/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH bpf-next seccomp 00/12] eBPF seccomp filters
  2021-05-20  8:56             ` Christian Brauner
  2021-05-20  9:37               ` Christian Brauner
@ 2021-06-01 19:55               ` Kees Cook
  2021-06-09  6:32                 ` Jinghao Jia
  2021-06-09  6:27               ` Jinghao Jia
  2 siblings, 1 reply; 48+ messages in thread
From: Kees Cook @ 2021-06-01 19:55 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Tianyin Xu, Tycho Andersen, Andy Lutomirski, YiFei Zhu,
	containers, bpf, Zhu, YiFei, LSM List, Alexei Starovoitov,
	Andrea Arcangeli, Kuo, Hsuan-Chi, Claudio Canella,
	Daniel Borkmann, Daniel Gruss, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jann Horn, Jia, Jinghao,
	Torrellas, Josep, Sargun Dhillon, Tobin Feldman-Fitzthum,
	Tom Hromatka, Will Drewry

On Thu, May 20, 2021 at 10:56:13AM +0200, Christian Brauner wrote:
> On Thu, May 20, 2021 at 03:16:10AM -0500, Tianyin Xu wrote:
> > On Mon, May 17, 2021 at 10:40 AM Tycho Andersen <tycho@tycho.pizza> wrote:
> > >
> > > On Sun, May 16, 2021 at 03:38:00AM -0500, Tianyin Xu wrote:
> > > > On Sat, May 15, 2021 at 10:49 AM Andy Lutomirski <luto@kernel.org> wrote:
> > > > >
> > > > > On 5/10/21 10:21 PM, YiFei Zhu wrote:
> > > > > > On Mon, May 10, 2021 at 12:47 PM Andy Lutomirski <luto@kernel.org> wrote:
> > > > > >> On Mon, May 10, 2021 at 10:22 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> > > > > >>>
> > > > > >>> From: YiFei Zhu <yifeifz2@illinois.edu>
> > > > > >>>
> > > > > >>> Based on: https://urldefense.com/v3/__https://lists.linux-foundation.org/pipermail/containers/2018-February/038571.html__;!!DZ3fjg!thbAoRgmCeWjlv0qPDndNZW1j6Y2Kl_huVyUffr4wVbISf-aUiULaWHwkKJrNJyo$
> > > > > >>>
> > > > > >>> This patchset enables seccomp filters to be written in eBPF.

Before I dive in, I do want to say that this is very interesting work.
Thanks for working on it, even if we're all so grumpy about accepting
it. :)

> > > > > >>> Supporting eBPF filters has been proposed a few times in the past.
> > > > > >>> The main concerns were (1) use cases and (2) security. We have
> > > > > >>> identified many use cases that can benefit from advanced eBPF
> > > > > >>> filters, such as:
> > > > > >>
> > > > > >> I haven't reviewed this carefully, but I think we need to distinguish
> > > > > >> a few things:
> > > > > >>
> > > > > >> 1. Using the eBPF *language*.

Likely everyone is aware, but I'll point out for anyone new reading this
thread: seccomp uses eBPF under the hood: all the cBPF is transformed to
eBPF at filter attach time. But yes, I get the point: using the _entire_
eBPF language. Though I'd remind folks that seccomp doesn't even use
the entire cBPF language...

> [...] but Andy's point stands that this brings a slew of issues on
> the table that need clear answers. Bringing stateful ebpf features into
> seccomp is a pretty big step and especially around the
> privilege/security model it looks pretty handwavy right now.

This is the blocker as far as I'm concerned: there is no story for
unprivileged eBPF. And even IF there was a story there, I find the rate
of security-related flaws in eBPF to be way too high for a sandboxing
primitive to depend on. There have been around a dozen a year for the
last 4 years:

$ git log --oneline --no-merges --pretty=format:'%as %h %s' \
   -i -E \ --all-match --grep '^Fixes:' --grep \
   '(over|under)flow|\bleak|escalat|expos(e[ds]?|ure)\b|use[- ]?after[- ]?free' \
   -- kernel/bpf/ | cut -d- -f1 | sort | uniq -c
      4 2015
      4 2016
     13 2017
     16 2018
     18 2019
     12 2020
      6 2021

I just can't bring myself to accept that level of risk for seccomp. (And
yes, this might be mitigated by blocking the bpf() syscall within a
filter, but then eBPF seccomp would become kind of useless inside a
container launcher, etc etc.)

-Kees

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH bpf-next seccomp 00/12] eBPF seccomp filters
  2021-05-20  8:56             ` Christian Brauner
  2021-05-20  9:37               ` Christian Brauner
  2021-06-01 19:55               ` Kees Cook
@ 2021-06-09  6:27               ` Jinghao Jia
  2 siblings, 0 replies; 48+ messages in thread
From: Jinghao Jia @ 2021-06-09  6:27 UTC (permalink / raw)
  To: Christian Brauner, Tianyin Xu
  Cc: Tycho Andersen, Andy Lutomirski, YiFei Zhu, containers, bpf, Zhu,
	YiFei, LSM List, Alexei Starovoitov, Andrea Arcangeli, Kuo,
	Hsuan-Chi, Claudio Canella, Daniel Borkmann, Daniel Gruss,
	Dimitrios Skarlatos, Giuseppe Scrivano, Hubertus Franke,
	Jann Horn, Torrellas, Josep, Kees Cook, Sargun Dhillon,
	Tobin Feldman-Fitzthum, Tom Hromatka, Will Drewry


On 5/20/21 3:56 AM, Christian Brauner wrote:
> On Thu, May 20, 2021 at 03:16:10AM -0500, Tianyin Xu wrote:
>> On Mon, May 17, 2021 at 10:40 AM Tycho Andersen <tycho@tycho.pizza> wrote:
>>> On Sun, May 16, 2021 at 03:38:00AM -0500, Tianyin Xu wrote:
>>>> On Sat, May 15, 2021 at 10:49 AM Andy Lutomirski <luto@kernel.org> wrote:
>>>>> On 5/10/21 10:21 PM, YiFei Zhu wrote:
>>>>>> On Mon, May 10, 2021 at 12:47 PM Andy Lutomirski <luto@kernel.org> wrote:
>>>>>>> On Mon, May 10, 2021 at 10:22 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
>>>>>>>> From: YiFei Zhu <yifeifz2@illinois.edu>
>>>>>>>>
>>>>>>>> Based on: https://urldefense.com/v3/__https://lists.linux-foundation.org/pipermail/containers/2018-February/038571.html__;!!DZ3fjg!thbAoRgmCeWjlv0qPDndNZW1j6Y2Kl_huVyUffr4wVbISf-aUiULaWHwkKJrNJyo$
>>>>>>>>
>>>>>>>> This patchset enables seccomp filters to be written in eBPF.
>>>>>>>> Supporting eBPF filters has been proposed a few times in the past.
>>>>>>>> The main concerns were (1) use cases and (2) security. We have
>>>>>>>> identified many use cases that can benefit from advanced eBPF
>>>>>>>> filters, such as:
>>>>>>> I haven't reviewed this carefully, but I think we need to distinguish
>>>>>>> a few things:
>>>>>>>
>>>>>>> 1. Using the eBPF *language*.
>>>>>>>
>>>>>>> 2. Allowing the use of stateful / non-pure eBPF features.
>>>>>>>
>>>>>>> 3. Allowing the eBPF programs to read the target process' memory.
>>>>>>>
>>>>>>> I'm generally in favor of (1).  I'm not at all sure about (2), and I'm
>>>>>>> even less convinced by (3).
>>>>>>>
>>>>>>>>    * exec-only-once filter / apply filter after exec
>>>>>>> This is (2).  I'm not sure it's a good idea.
>>>>>> The basic idea is that for a container runtime it may wait to execute
>>>>>> a program in a container without that program being able to execve
>>>>>> another program, stopping any attack that involves loading another
>>>>>> binary. The container runtime can block any syscall but execve in the
>>>>>> exec-ed process by using only cBPF.
>>>>>>
>>>>>> The use case is suggested by Andrea Arcangeli and Giuseppe Scrivano.
>>>>>> @Andrea and @Giuseppe, could you clarify more in case I missed
>>>>>> something?
>>>>> We've discussed having a notifier-using filter be able to replace its
>>>>> filter.  This would allow this and other use cases without any
>>>>> additional eBPF or cBPF code.
>>>>>
>>>> A notifier is not always a solution (even ignoring its perf overhead).
>>>>
>>>> One problem, pointed out by Andrea Arcangeli, is that notifiers need
>>>> userspace daemons. So, it can hardly be used by daemonless container
>>>> engines like Podman.
>>> I'm not sure I buy this argument. Podman already has a conmon instance
>>> for each container, this could be a child of that conmon process, or
>>> live inside conmon itself.
>>>
>>> Tycho
>> I checked with Andrea Arcangeli and Giuseppe Scrivano who are working on Podman.
>>
>> You are right that Podman is not completely daemonless. However, “the
>> fact it's no entirely daemonless doesn't imply it's a good idea to
>> make it worse and to add complexity to the background conmon daemon or
>> to add more daemons.”
>>
>> TL;DR. User notifiers are surely more flexible, but are also more
>> expensive and complex to implement, compared with ebpf filters. /*
>> I’ll reply to Sargun’s performance argument in a separate email */
>>
>> I'm sure you know Podman well, but let me still move some jade from
>> Andrea and Giuseppe (all credits on podmon/crun are theirs) to
>> elaborate the point, for folks cced on the list who are not very
>> familiar with Podman.
>>
>> Basically, the current order goes as follows:
>>
>>           podman -> conmon -> crun -> container_binary
>>                                 \
>>                                  - seccomp done at crun level, not conmon
>>
>> At runtime, what's left is:
>>
>>           conmon -> container_binary  /* podman disappears; crun disappears */
>>
>> So, to go through and use seccomp notify to block `exec`, we can
>> either start the container_binary with a seccomp agent wrapper, or
>> bloat the common binary (as pointed out by Tycho).
>>
>> If we go with the first approach, we will have:
>>
>>           podman -> conmon -> crun -> seccomp_agent -> container_binary
>>
>> So, at runtime we'd be left with one more daemon:
>>
>>          conmon -> seccomp_agent -> container_binary
> That seems like a strawman. I don't see why this has to be out of
> process or a separate daemon. Conmon uses a regular event loop. Adding
> support for processing notifier syscall notifications is
> straightforward. Moving it to a plugin as you mentioned below is a
> design decision not a necessity.
>
>> Apparently, nobody likes one more daemon. So, the proposal from
> I'm not sure such a blanket statements about an indeterminate group of
> people's alleged preferences constitutes a technical argument wny we
> need ebpf in seccomp.
>
>> Giuseppe was/is to use user notifiers as plugins (.so) loaded by
>> conmon:
>> https://urldefense.com/v3/__https://github.com/containers/conmon/pull/190__;!!DZ3fjg!qjoih4kOsHD09Yg41YKmYQrW_YhB3AzV0sgWZsRK621KIf7eTKiMMhAiew-ySWA_vbUt$
>> https://urldefense.com/v3/__https://github.com/containers/crun/pull/438__;!!DZ3fjg!qjoih4kOsHD09Yg41YKmYQrW_YhB3AzV0sgWZsRK621KIf7eTKiMMhAiew-ySfWBbnxD$
>>
>> Now, with the ebpf filter support, one can implement the same thing
>> using an embarrassingly simple ebpf filter and, thanks to Giuseppe,
>> this is well supported by crun.
> So I think this is trying to jump the gun by saying "Look, the result
> might be simpler.". That may even be the case - though I'm not yet
> convinced - but Andy's point stands that this brings a slew of issues on
> the table that need clear answers. Bringing stateful ebpf features into
> seccomp is a pretty big step and especially around the
> privilege/security model it looks pretty handwavy right now.
For the privilege/security model, I assume that you are referring to a 
way to safely do unprivileged ebpf and to make sure the ebpf features 
can be used by seccomp securely.

In fact, the privilege model is carefully implemented in the patch set . 
As mentioned in the cover letter, we followed the security model of user 
notifier and ptrace in a way that our implementation is as restrictive 
as them. Let me elaborate:

1. We require no less privilege than Seccomp or eBPF individually, (e.g. 
filter loading and uses of BPF helpers)

2. The new seccomp_extended LSM hook restricts the use of advanced bpf 
features (maps and helpers). Only when the hook permits the access can 
filters use standard helpers. The LSM hook is implemented in Yama and 
uses ptrace_scope to determine whether to allow access. This is based on 
the idea of reduction to ptrace, as the eBPF filters can instrument the 
process similar to ptrace.

3. The tracing helpers require additional capabilities (CAP_BPF and 
CAP_PERFMON).

4. For user-memory reading, we require CAP_PTRACE to read memory of 
non-dumpable processes. If the capability is not fulfilled, the 
bpf_user_probe{,str} helper would return -EPERM. This is, again, 
reduction to ptrace.

We acknowledge the concerns about user namespace pointed out by Alexei 
Starovoitov. We are more than happy to roll out the solution in the V2 
patch.

Best,
Jinghao

> Christian

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH bpf-next seccomp 00/12] eBPF seccomp filters
  2021-06-01 19:55               ` Kees Cook
@ 2021-06-09  6:32                 ` Jinghao Jia
  0 siblings, 0 replies; 48+ messages in thread
From: Jinghao Jia @ 2021-06-09  6:32 UTC (permalink / raw)
  To: Kees Cook, Christian Brauner
  Cc: Tianyin Xu, Tycho Andersen, Andy Lutomirski, YiFei Zhu,
	containers, bpf, Zhu, YiFei, LSM List, Alexei Starovoitov,
	Andrea Arcangeli, Kuo, Hsuan-Chi, Claudio Canella,
	Daniel Borkmann, Daniel Gruss, Dimitrios Skarlatos,
	Giuseppe Scrivano, Hubertus Franke, Jann Horn, Torrellas, Josep,
	Sargun Dhillon, Tobin Feldman-Fitzthum, Tom Hromatka,
	Will Drewry


On 6/1/21 2:55 PM, Kees Cook wrote:
> On Thu, May 20, 2021 at 10:56:13AM +0200, Christian Brauner wrote:
>> On Thu, May 20, 2021 at 03:16:10AM -0500, Tianyin Xu wrote:
>>> On Mon, May 17, 2021 at 10:40 AM Tycho Andersen <tycho@tycho.pizza> wrote:
>>>> On Sun, May 16, 2021 at 03:38:00AM -0500, Tianyin Xu wrote:
>>>>> On Sat, May 15, 2021 at 10:49 AM Andy Lutomirski <luto@kernel.org> wrote:
>>>>>> On 5/10/21 10:21 PM, YiFei Zhu wrote:
>>>>>>> On Mon, May 10, 2021 at 12:47 PM Andy Lutomirski <luto@kernel.org> wrote:
>>>>>>>> On Mon, May 10, 2021 at 10:22 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
>>>>>>>>> From: YiFei Zhu <yifeifz2@illinois.edu>
>>>>>>>>>
>>>>>>>>> Based on: https://urldefense.com/v3/__https://lists.linux-foundation.org/pipermail/containers/2018-February/038571.html__;!!DZ3fjg!thbAoRgmCeWjlv0qPDndNZW1j6Y2Kl_huVyUffr4wVbISf-aUiULaWHwkKJrNJyo$
>>>>>>>>>
>>>>>>>>> This patchset enables seccomp filters to be written in eBPF.
> Before I dive in, I do want to say that this is very interesting work.
> Thanks for working on it, even if we're all so grumpy about accepting
> it. :)
>
>>>>>>>>> Supporting eBPF filters has been proposed a few times in the past.
>>>>>>>>> The main concerns were (1) use cases and (2) security. We have
>>>>>>>>> identified many use cases that can benefit from advanced eBPF
>>>>>>>>> filters, such as:
>>>>>>>> I haven't reviewed this carefully, but I think we need to distinguish
>>>>>>>> a few things:
>>>>>>>>
>>>>>>>> 1. Using the eBPF *language*.
> Likely everyone is aware, but I'll point out for anyone new reading this
> thread: seccomp uses eBPF under the hood: all the cBPF is transformed to
> eBPF at filter attach time. But yes, I get the point: using the _entire_
> eBPF language. Though I'd remind folks that seccomp doesn't even use
> the entire cBPF language...
>
>> [...] but Andy's point stands that this brings a slew of issues on
>> the table that need clear answers. Bringing stateful ebpf features into
>> seccomp is a pretty big step and especially around the
>> privilege/security model it looks pretty handwavy right now.
> This is the blocker as far as I'm concerned: there is no story for
> unprivileged eBPF. And even IF there was a story there, I find the rate
> of security-related flaws in eBPF to be way too high for a sandboxing
> primitive to depend on. There have been around a dozen a year for the
> last 4 years:
>
> $ git log --oneline --no-merges --pretty=format:'%as %h %s' \
>     -i -E \ --all-match --grep '^Fixes:' --grep \
>     '(over|under)flow|\bleak|escalat|expos(e[ds]?|ure)\b|use[- ]?after[- ]?free' \
>     -- kernel/bpf/ | cut -d- -f1 | sort | uniq -c
>        4 2015
>        4 2016
>       13 2017
>       16 2018
>       18 2019
>       12 2020
>        6 2021
>
> I just can't bring myself to accept that level of risk for seccomp.

I just want to clarify that the patch is not supposed to add more risks 
to seccomp.


The vulnerabilities of ebpf are inherently there as long as ebpf is 
supported, no matter whether Seccomp supports ebpf filters or not. If 
ebpf is of concern, one can turn off ebpf completely and Seccomp ebpf 
won’t be available. Otherwise, the vulnerabilities are in your socket 
filters anyway.

> (And
> yes, this might be mitigated by blocking the bpf() syscall within a
> filter, but then eBPF seccomp would become kind of useless inside a
> container launcher, etc etc.)

This is an interesting point. I think the main concern is still about 
the additional risks (which I responded above).


I responded to Christian Brauner earlier about the security model. 
Basically, the implementation is as restrictive as user notifier and 
ptrace. For example, if CAP_BPF is not there, the container won’t be 
able to load ebpf filters with tracing helpers.


In fact, one can load the ebpf filter first and then block the bpf 
syscall (if it makes things more secure).


Best,

Jinghao

>
> -Kees
>

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2021-06-09  7:04 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-10 17:22 [RFC PATCH bpf-next seccomp 00/12] eBPF seccomp filters YiFei Zhu
2021-05-10 17:22 ` [RFC PATCH bpf-next seccomp 01/12] seccomp: Move no_new_privs check to after prepare_filter YiFei Zhu
2021-05-10 17:22 ` [RFC PATCH bpf-next seccomp 02/12] bpf, seccomp: Add eBPF filter capabilities YiFei Zhu
2021-05-10 17:22 ` [RFC PATCH bpf-next seccomp 03/12] seccomp, ptrace: Add a mechanism to retrieve attached eBPF seccomp filters YiFei Zhu
2021-05-10 17:22 ` [RFC PATCH bpf-next seccomp 04/12] libbpf: recognize section "seccomp" YiFei Zhu
2021-05-10 17:22 ` [RFC PATCH bpf-next seccomp 05/12] samples/bpf: Add eBPF seccomp sample programs YiFei Zhu
2021-05-10 17:22 ` [RFC PATCH bpf-next seccomp 06/12] lsm: New hook seccomp_extended YiFei Zhu
2021-05-10 17:22 ` [RFC PATCH bpf-next seccomp 07/12] bpf/verifier: allow restricting direct map access YiFei Zhu
2021-05-10 17:22 ` [RFC PATCH bpf-next seccomp 08/12] seccomp-ebpf: restrict filter to almost cBPF if LSM request such YiFei Zhu
2021-05-10 17:22 ` [RFC PATCH bpf-next seccomp 09/12] yama: (concept) restrict seccomp-eBPF with ptrace_scope YiFei Zhu
2021-05-10 17:22 ` [RFC PATCH bpf-next seccomp 10/12] seccomp-ebpf: Add ability to read user memory YiFei Zhu
2021-05-11  2:04   ` Alexei Starovoitov
2021-05-11  7:14     ` YiFei Zhu
2021-05-11  7:14       ` YiFei Zhu
2021-05-12 22:36       ` Alexei Starovoitov
2021-05-13  5:26         ` YiFei Zhu
2021-05-13  5:26           ` YiFei Zhu
2021-05-13 14:53           ` Andy Lutomirski
2021-05-13 17:12             ` YiFei Zhu
2021-05-13 17:12               ` YiFei Zhu
2021-05-13 17:15               ` Andy Lutomirski
2021-05-13 17:15                 ` Andy Lutomirski
2021-05-10 17:22 ` [RFC PATCH bpf-next seccomp 11/12] bpf/verifier: support NULL-able ptr to BTF ID as helper argument YiFei Zhu
2021-05-10 17:22 ` [RFC PATCH bpf-next seccomp 12/12] seccomp-ebpf: support task storage from BPF-LSM, defaulting to group leader YiFei Zhu
2021-05-11  1:58   ` Alexei Starovoitov
2021-05-11  5:44     ` YiFei Zhu
2021-05-11  5:44       ` YiFei Zhu
2021-05-12 21:56       ` Alexei Starovoitov
2021-05-10 17:47 ` [RFC PATCH bpf-next seccomp 00/12] eBPF seccomp filters Andy Lutomirski
2021-05-10 17:47   ` Andy Lutomirski
2021-05-11  5:21   ` YiFei Zhu
2021-05-11  5:21     ` YiFei Zhu
2021-05-15 15:49     ` Andy Lutomirski
2021-05-20  9:05       ` Christian Brauner
     [not found]     ` <fffbea8189794a8da539f6082af3de8e@DM5PR11MB1692.namprd11.prod.outlook.com>
2021-05-16  8:38       ` Tianyin Xu
2021-05-17 15:40         ` Tycho Andersen
2021-05-17 17:07         ` Sargun Dhillon
2021-05-17 17:07           ` Sargun Dhillon
     [not found]         ` <108b4b9c2daa4123805d2b92cf51374b@DM5PR11MB1692.namprd11.prod.outlook.com>
2021-05-20  8:16           ` Tianyin Xu
2021-05-20  8:56             ` Christian Brauner
2021-05-20  9:37               ` Christian Brauner
2021-06-01 19:55               ` Kees Cook
2021-06-09  6:32                 ` Jinghao Jia
2021-06-09  6:27               ` Jinghao Jia
     [not found]             ` <00fe481c572d486289bc88780f48e88f@DM5PR11MB1692.namprd11.prod.outlook.com>
2021-05-20 22:13               ` Tianyin Xu
     [not found]         ` <eae2a0e5038b41c4af87edcb3d4cdc13@DM5PR11MB1692.namprd11.prod.outlook.com>
2021-05-20  8:22           ` Tianyin Xu
2021-05-24 18:55             ` Sargun Dhillon
2021-05-24 18:55               ` Sargun Dhillon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.