BPF Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v1 0/3] seccomp: continue syscall from notifier
@ 2019-09-19  9:59 Christian Brauner
  2019-09-19  9:59 ` [PATCH v1 1/3] seccomp: add SECCOMP_USER_NOTIF_FLAG_CONTINUE Christian Brauner
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Christian Brauner @ 2019-09-19  9:59 UTC (permalink / raw)
  To: keescook, luto
  Cc: jannh, wad, shuah, ast, daniel, kafai, songliubraving, yhs,
	linux-kernel, linux-kselftest, netdev, bpf, Christian Brauner

Hey everyone,

This is the patchset coming out of the KSummit session Kees and I gave
in Lisbon last week (cf. [3] which also contains slides with more
details on related things such as deep argument inspection).
The simple idea is to extend the seccomp notifier to allow for the
continuation of a syscall. The rationale for this can be found in the
commit message to [1]. For the curious there is more detail in [2].
This patchset would unblock supervising an extended set of syscalls such
as mount() where a privileged process is supervising the syscalls of a
lesser privileged process and emulates the syscall for the latter in
userspace.
For more comments on security see [1].

Kees, if you prefer a pr the series can be pulled from:
git@gitolite.kernel.org:pub/scm/linux/kernel/git/brauner/linux tags/seccomp-notify-syscall-continue-v5.5

For anyone who wants to play with this it's sitting in:
https://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git/log/?h=seccomp_syscall_continue

/* v1 */
- Kees Cook <keescook@chromium.org>:
  - dropped patch because it is already present in linux-next
    [PATCH 2/4] seccomp: add two missing ptrace ifdefines
    Link: https://lore.kernel.org/r/20190918084833.9369-3-christian.brauner@ubuntu.com

/* v0 */
Link: https://lore.kernel.org/r/20190918084833.9369-1-christian.brauner@ubuntu.com

Thanks!
Christian

/* References */
[1]: [PATCH 1/3] seccomp: add SECCOMP_USER_NOTIF_FLAG_CONTINUE
[2]: https://lore.kernel.org/r/20190719093538.dhyopljyr5ns33qx@brauner.io
[3]: https://linuxplumbersconf.org/event/4/contributions/560

Christian Brauner (3):
  seccomp: add SECCOMP_USER_NOTIF_FLAG_CONTINUE
  seccomp: avoid overflow in implicit constant conversion
  seccomp: test SECCOMP_USER_NOTIF_FLAG_CONTINUE

 include/uapi/linux/seccomp.h                  |  20 ++++
 kernel/seccomp.c                              |  28 ++++-
 tools/testing/selftests/seccomp/seccomp_bpf.c | 105 +++++++++++++++++-
 3 files changed, 146 insertions(+), 7 deletions(-)

-- 
2.23.0


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v1 1/3] seccomp: add SECCOMP_USER_NOTIF_FLAG_CONTINUE
  2019-09-19  9:59 [PATCH v1 0/3] seccomp: continue syscall from notifier Christian Brauner
@ 2019-09-19  9:59 ` Christian Brauner
  2019-09-19 19:37   ` Jann Horn
  2019-09-19  9:59 ` [PATCH v1 2/3] seccomp: avoid overflow in implicit constant conversion Christian Brauner
  2019-09-19  9:59 ` [PATCH v1 3/3] seccomp: test SECCOMP_USER_NOTIF_FLAG_CONTINUE Christian Brauner
  2 siblings, 1 reply; 8+ messages in thread
From: Christian Brauner @ 2019-09-19  9:59 UTC (permalink / raw)
  To: keescook, luto
  Cc: jannh, wad, shuah, ast, daniel, kafai, songliubraving, yhs,
	linux-kernel, linux-kselftest, netdev, bpf, Christian Brauner,
	Tycho Andersen, Tyler Hicks

This allows the seccomp notifier to continue a syscall. A positive
discussion about this feature was triggered by a post to the
ksummit-discuss mailing list (cf. [3]) and took place during KSummit
(cf. [1]) and again at the containers/checkpoint-restore
micro-conference at Linux Plumbers.

Recently we landed seccomp support for SECCOMP_RET_USER_NOTIF (cf. [4])
which enables a process (watchee) to retrieve an fd for its seccomp
filter. This fd can then be handed to another (usually more privileged)
process (watcher). The watcher will then be able to receive seccomp
messages about the syscalls having been performed by the watchee.

This feature is heavily used in some userspace workloads. For example,
it is currently used to intercept mknod() syscalls in user namespaces
aka in containers.
The mknod() syscall can be easily filtered based on dev_t. This allows
us to only intercept a very specific subset of mknod() syscalls.
Furthermore, mknod() is not possible in user namespaces toto coelo and
so intercepting and denying syscalls that are not in the whitelist on
accident is not a big deal. The watchee won't notice a difference.

In contrast to mknod(), a lot of other syscall we intercept (e.g.
setxattr()) cannot be easily filtered like mknod() because they have
pointer arguments. Additionally, some of them might actually succeed in
user namespaces (e.g. setxattr() for all "user.*" xattrs). Since we
currently cannot tell seccomp to continue from a user notifier we are
stuck with performing all of the syscalls in lieu of the container. This
is a huge security liability since it is extremely difficult to
correctly assume all of the necessary privileges of the calling task
such that the syscall can be successfully emulated without escaping
other additional security restrictions (think missing CAP_MKNOD for
mknod(), or MS_NODEV on a filesystem etc.). This can be solved by
telling seccomp to resume the syscall.

One thing that came up in the discussion was the problem that another
thread could change the memory after userspace has decided to let the
syscall continue which is a well known TOCTOU with seccomp which is
present in other ways already.
The discussion showed that this feature is already very useful for any
syscall without pointer arguments. For any accidentally intercepted
non-pointer syscall it is safe to continue.
For syscalls with pointer arguments there is a race but for any cautious
userspace and the main usec cases the race doesn't matter. The notifier
is intended to be used in a scenario where a more privileged watcher
supervises the syscalls of lesser privileged watchee to allow it to get
around kernel-enforced limitations by performing the syscall for it
whenever deemed save by the watcher. Hence, if a user tricks the watcher
into allowing a syscall they will either get a deny based on
kernel-enforced restrictions later or they will have changed the
arguments in such a way that they manage to perform a syscall with
arguments that they would've been allowed to do anyway.
In general, it is good to point out again, that the notifier fd was not
intended to allow userspace to implement a security policy but rather to
work around kernel security mechanisms in cases where the watcher knows
that a given action is safe to perform.

/* References */
[1]: https://linuxplumbersconf.org/event/4/contributions/560
[2]: https://linuxplumbersconf.org/event/4/contributions/477
[3]: https://lore.kernel.org/r/20190719093538.dhyopljyr5ns33qx@brauner.io
[4]: commit 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace")

Co-developed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Tycho Andersen <tycho@tycho.ws>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Drewry <wad@chromium.org>
CC: Tyler Hicks <tyhicks@canonical.com>
---
/* v1 */
- Kees Cook <keescook@chromium.org>, Tycho Andersen <tycho@tycho.ws>:
  - s/SECCOMP_RET_USER_NOTIF_ALLOW/SECCOMP_USER_NOTIF_FLAG_CONTINUE/g
- Kees Cook <keescook@chromium.org>:
  - put giant warning about the dangers, and correct usage of the
    SECCOMP_USER_NOTIF_FLAG_CONTINUE flag
- Kees Cook <keescook@chromium.org>:
  - change return type for seccomp_do_user_notification() to int to align with
    similar functions

/* v0 */
Link: https://lore.kernel.org/r/20190918084833.9369-2-christian.brauner@ubuntu.com
---
 include/uapi/linux/seccomp.h | 20 ++++++++++++++++++++
 kernel/seccomp.c             | 28 ++++++++++++++++++++++------
 2 files changed, 42 insertions(+), 6 deletions(-)

diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index 90734aa5aa36..8a5dafed8a64 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -76,6 +76,26 @@ struct seccomp_notif {
 	struct seccomp_data data;
 };
 
+/*
+ * Valid flags for struct seccomp_notif_resp
+ *
+ * Note, the SECCOMP_USER_NOTIF_FLAG_CONTINUE flag must be used with caution!
+ * If set by the process supervising the syscalls of another process the
+ * syscall will continue. This is problematic because of an inherent TOCTOU.
+ * An attacker can exploit the time while the supervised process is waiting on
+ * a response from the supervising process to rewrite syscall arguments which
+ * are passed as pointers of the intercepted syscall.
+ * It should be absolutely clear that this means that the seccomp notifier
+ * _cannot_ be used to implement a security policy! It should only ever be used
+ * in scenarios where a more privileged process supervises the syscalls of a
+ * lesser privileged process to get around kernel-enforced security
+ * restrictions when the privileged process deems this safe. In other words,
+ * in order to continue a syscall the supervising process should be sure that
+ * another security mechanism or the kernel itself will sufficiently block
+ * syscalls if arguments are rewritten to something unsafe.
+ */
+#define SECCOMP_USER_NOTIF_FLAG_CONTINUE BIT(0)
+
 struct seccomp_notif_resp {
 	__u64 id;
 	__s64 val;
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index dba52a7db5e8..12d2227e5786 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -75,6 +75,7 @@ struct seccomp_knotif {
 	/* The return values, only valid when in SECCOMP_NOTIFY_REPLIED */
 	int error;
 	long val;
+	u32 flags;
 
 	/* Signals when this has entered SECCOMP_NOTIFY_REPLIED */
 	struct completion ready;
@@ -732,11 +733,12 @@ static u64 seccomp_next_notify_id(struct seccomp_filter *filter)
 	return filter->notif->next_id++;
 }
 
-static void seccomp_do_user_notification(int this_syscall,
-					 struct seccomp_filter *match,
-					 const struct seccomp_data *sd)
+static int seccomp_do_user_notification(int this_syscall,
+					struct seccomp_filter *match,
+					const struct seccomp_data *sd)
 {
 	int err;
+	u32 flags = 0;
 	long ret = 0;
 	struct seccomp_knotif n = {};
 
@@ -764,6 +766,7 @@ static void seccomp_do_user_notification(int this_syscall,
 	if (err == 0) {
 		ret = n.val;
 		err = n.error;
+		flags = n.flags;
 	}
 
 	/*
@@ -780,8 +783,14 @@ static void seccomp_do_user_notification(int this_syscall,
 		list_del(&n.list);
 out:
 	mutex_unlock(&match->notify_lock);
+
+	/* Userspace requests to continue the syscall. */
+	if (flags & SECCOMP_USER_NOTIF_FLAG_CONTINUE)
+		return 0;
+
 	syscall_set_return_value(current, task_pt_regs(current),
 				 err, ret);
+	return -1;
 }
 
 static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
@@ -867,8 +876,10 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
 		return 0;
 
 	case SECCOMP_RET_USER_NOTIF:
-		seccomp_do_user_notification(this_syscall, match, sd);
-		goto skip;
+		if (seccomp_do_user_notification(this_syscall, match, sd))
+			goto skip;
+
+		return 0;
 
 	case SECCOMP_RET_LOG:
 		seccomp_log(this_syscall, 0, action, true);
@@ -1087,7 +1098,11 @@ static long seccomp_notify_send(struct seccomp_filter *filter,
 	if (copy_from_user(&resp, buf, sizeof(resp)))
 		return -EFAULT;
 
-	if (resp.flags)
+	if (resp.flags & ~SECCOMP_USER_NOTIF_FLAG_CONTINUE)
+		return -EINVAL;
+
+	if ((resp.flags & SECCOMP_USER_NOTIF_FLAG_CONTINUE) &&
+	    (resp.error || resp.val))
 		return -EINVAL;
 
 	ret = mutex_lock_interruptible(&filter->notify_lock);
@@ -1116,6 +1131,7 @@ static long seccomp_notify_send(struct seccomp_filter *filter,
 	knotif->state = SECCOMP_NOTIFY_REPLIED;
 	knotif->error = resp.error;
 	knotif->val = resp.val;
+	knotif->flags = resp.flags;
 	complete(&knotif->ready);
 out:
 	mutex_unlock(&filter->notify_lock);
-- 
2.23.0


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v1 2/3] seccomp: avoid overflow in implicit constant conversion
  2019-09-19  9:59 [PATCH v1 0/3] seccomp: continue syscall from notifier Christian Brauner
  2019-09-19  9:59 ` [PATCH v1 1/3] seccomp: add SECCOMP_USER_NOTIF_FLAG_CONTINUE Christian Brauner
@ 2019-09-19  9:59 ` Christian Brauner
  2019-09-19  9:59 ` [PATCH v1 3/3] seccomp: test SECCOMP_USER_NOTIF_FLAG_CONTINUE Christian Brauner
  2 siblings, 0 replies; 8+ messages in thread
From: Christian Brauner @ 2019-09-19  9:59 UTC (permalink / raw)
  To: keescook, luto
  Cc: jannh, wad, shuah, ast, daniel, kafai, songliubraving, yhs,
	linux-kernel, linux-kselftest, netdev, bpf, Christian Brauner,
	Tyler Hicks, Tycho Andersen, stable

USER_NOTIF_MAGIC is assigned to int variables in this test so set it to INT_MAX
to avoid warnings:

seccomp_bpf.c: In function ‘user_notification_continue’:
seccomp_bpf.c:3088:26: warning: overflow in implicit constant conversion [-Woverflow]
 #define USER_NOTIF_MAGIC 116983961184613L
                          ^
seccomp_bpf.c:3572:15: note: in expansion of macro ‘USER_NOTIF_MAGIC’
  resp.error = USER_NOTIF_MAGIC;
               ^~~~~~~~~~~~~~~~

Fixes: 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace")
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Tyler Hicks <tyhicks@canonical.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Drewry <wad@chromium.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Yonghong Song <yhs@fb.com>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: stable@vger.kernel.org
Cc: linux-kselftest@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: bpf@vger.kernel.org
---
/* v1 */
unchanged

/* v0 */
Link: https://lore.kernel.org/r/20190918084833.9369-4-christian.brauner@ubuntu.com
---
 tools/testing/selftests/seccomp/seccomp_bpf.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
index 6ef7f16c4cf5..e996d7b7fd6e 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -35,6 +35,7 @@
 #include <stdbool.h>
 #include <string.h>
 #include <time.h>
+#include <limits.h>
 #include <linux/elf.h>
 #include <sys/uio.h>
 #include <sys/utsname.h>
@@ -3072,7 +3073,7 @@ static int user_trap_syscall(int nr, unsigned int flags)
 	return seccomp(SECCOMP_SET_MODE_FILTER, flags, &prog);
 }
 
-#define USER_NOTIF_MAGIC 116983961184613L
+#define USER_NOTIF_MAGIC INT_MAX
 TEST(user_notification_basic)
 {
 	pid_t pid;
-- 
2.23.0


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v1 3/3] seccomp: test SECCOMP_USER_NOTIF_FLAG_CONTINUE
  2019-09-19  9:59 [PATCH v1 0/3] seccomp: continue syscall from notifier Christian Brauner
  2019-09-19  9:59 ` [PATCH v1 1/3] seccomp: add SECCOMP_USER_NOTIF_FLAG_CONTINUE Christian Brauner
  2019-09-19  9:59 ` [PATCH v1 2/3] seccomp: avoid overflow in implicit constant conversion Christian Brauner
@ 2019-09-19  9:59 ` Christian Brauner
  2019-09-19 17:13   ` shuah
  2 siblings, 1 reply; 8+ messages in thread
From: Christian Brauner @ 2019-09-19  9:59 UTC (permalink / raw)
  To: keescook, luto
  Cc: jannh, wad, shuah, ast, daniel, kafai, songliubraving, yhs,
	linux-kernel, linux-kselftest, netdev, bpf, Christian Brauner,
	Tycho Andersen, Tyler Hicks, stable

Test whether a syscall can be performed after having been intercepted by
the seccomp notifier. The test uses dup() and kcmp() since it allows us to
nicely test whether the dup() syscall actually succeeded by comparing whether
the fds refer to the same underlying struct file.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Drewry <wad@chromium.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Yonghong Song <yhs@fb.com>
Cc: Tycho Andersen <tycho@tycho.ws>
CC: Tyler Hicks <tyhicks@canonical.com>
Cc: stable@vger.kernel.org
Cc: linux-kselftest@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: bpf@vger.kernel.org
---
/* v1 */
- Christian Brauner <christian.brauner@ubuntu.com>:
  - adapt to new flag name SECCOMP_USER_NOTIF_FLAG_CONTINUE

/* v0 */
Link: https://lore.kernel.org/r/20190918084833.9369-5-christian.brauner@ubuntu.com
---
 tools/testing/selftests/seccomp/seccomp_bpf.c | 102 ++++++++++++++++++
 1 file changed, 102 insertions(+)

diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
index e996d7b7fd6e..b0966599acb5 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -44,6 +44,7 @@
 #include <sys/times.h>
 #include <sys/socket.h>
 #include <sys/ioctl.h>
+#include <linux/kcmp.h>
 
 #include <unistd.h>
 #include <sys/syscall.h>
@@ -167,6 +168,10 @@ struct seccomp_metadata {
 
 #define SECCOMP_RET_USER_NOTIF 0x7fc00000U
 
+#ifndef SECCOMP_USER_NOTIF_FLAG_CONTINUE
+#define SECCOMP_USER_NOTIF_FLAG_CONTINUE 0x00000001
+#endif
+
 #define SECCOMP_IOC_MAGIC		'!'
 #define SECCOMP_IO(nr)			_IO(SECCOMP_IOC_MAGIC, nr)
 #define SECCOMP_IOR(nr, type)		_IOR(SECCOMP_IOC_MAGIC, nr, type)
@@ -3481,6 +3486,103 @@ TEST(seccomp_get_notif_sizes)
 	EXPECT_EQ(sizes.seccomp_notif_resp, sizeof(struct seccomp_notif_resp));
 }
 
+static int filecmp(pid_t pid1, pid_t pid2, int fd1, int fd2)
+{
+#ifdef __NR_kcmp
+	return syscall(__NR_kcmp, pid1, pid2, KCMP_FILE, fd1, fd2);
+#else
+	errno = ENOSYS;
+	return -1;
+#endif
+}
+
+TEST(user_notification_continue)
+{
+	pid_t pid;
+	long ret;
+	int status, listener;
+	struct seccomp_notif req = {};
+	struct seccomp_notif_resp resp = {};
+	struct pollfd pollfd;
+
+	ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
+	ASSERT_EQ(0, ret) {
+		TH_LOG("Kernel does not support PR_SET_NO_NEW_PRIVS!");
+	}
+
+	listener = user_trap_syscall(__NR_dup, SECCOMP_FILTER_FLAG_NEW_LISTENER);
+	ASSERT_GE(listener, 0);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		int dup_fd, pipe_fds[2];
+		pid_t self;
+
+		ret = pipe(pipe_fds);
+		if (ret < 0)
+			exit(EXIT_FAILURE);
+
+		dup_fd = dup(pipe_fds[0]);
+		if (dup_fd < 0)
+			exit(EXIT_FAILURE);
+
+		self = getpid();
+
+		ret = filecmp(self, self, pipe_fds[0], dup_fd);
+		if (ret)
+			exit(EXIT_FAILURE);
+
+		exit(EXIT_SUCCESS);
+	}
+
+	pollfd.fd = listener;
+	pollfd.events = POLLIN | POLLOUT;
+
+	EXPECT_GT(poll(&pollfd, 1, -1), 0);
+	EXPECT_EQ(pollfd.revents, POLLIN);
+
+	EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0);
+
+	pollfd.fd = listener;
+	pollfd.events = POLLIN | POLLOUT;
+
+	EXPECT_GT(poll(&pollfd, 1, -1), 0);
+	EXPECT_EQ(pollfd.revents, POLLOUT);
+
+	EXPECT_EQ(req.data.nr, __NR_dup);
+
+	resp.id = req.id;
+	resp.flags = SECCOMP_USER_NOTIF_FLAG_CONTINUE;
+
+	/*
+	 * Verify that setting SECCOMP_USER_NOTIF_FLAG_CONTINUE enforces other
+	 * args be set to 0.
+	 */
+	resp.error = 0;
+	resp.val = USER_NOTIF_MAGIC;
+	EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), -1);
+	EXPECT_EQ(errno, EINVAL);
+
+	resp.error = USER_NOTIF_MAGIC;
+	resp.val = 0;
+	EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), -1);
+	EXPECT_EQ(errno, EINVAL);
+
+	resp.error = 0;
+	resp.val = 0;
+	EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), 0) {
+		if (errno == EINVAL)
+			XFAIL(goto skip, "Kernel does not support SECCOMP_USER_NOTIF_FLAG_CONTINUE");
+	}
+
+skip:
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+}
+
 /*
  * TODO:
  * - add microbenchmarks
-- 
2.23.0


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v1 3/3] seccomp: test SECCOMP_USER_NOTIF_FLAG_CONTINUE
  2019-09-19  9:59 ` [PATCH v1 3/3] seccomp: test SECCOMP_USER_NOTIF_FLAG_CONTINUE Christian Brauner
@ 2019-09-19 17:13   ` shuah
  2019-09-19 21:50     ` Christian Brauner
  0 siblings, 1 reply; 8+ messages in thread
From: shuah @ 2019-09-19 17:13 UTC (permalink / raw)
  To: Christian Brauner, keescook, luto
  Cc: jannh, wad, ast, daniel, kafai, songliubraving, yhs,
	linux-kernel, linux-kselftest, netdev, bpf, Tycho Andersen,
	Tyler Hicks, stable, shuah

On 9/19/19 3:59 AM, Christian Brauner wrote:
> Test whether a syscall can be performed after having been intercepted by
> the seccomp notifier. The test uses dup() and kcmp() since it allows us to
> nicely test whether the dup() syscall actually succeeded by comparing whether
> the fds refer to the same underlying struct file.
> 
> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Andy Lutomirski <luto@amacapital.net>
> Cc: Will Drewry <wad@chromium.org>
> Cc: Shuah Khan <shuah@kernel.org>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Cc: Martin KaFai Lau <kafai@fb.com>
> Cc: Song Liu <songliubraving@fb.com>
> Cc: Yonghong Song <yhs@fb.com>
> Cc: Tycho Andersen <tycho@tycho.ws>
> CC: Tyler Hicks <tyhicks@canonical.com>
> Cc: stable@vger.kernel.org
> Cc: linux-kselftest@vger.kernel.org
> Cc: netdev@vger.kernel.org
> Cc: bpf@vger.kernel.org
> ---
> /* v1 */
> - Christian Brauner <christian.brauner@ubuntu.com>:
>    - adapt to new flag name SECCOMP_USER_NOTIF_FLAG_CONTINUE
> 
> /* v0 */
> Link: https://lore.kernel.org/r/20190918084833.9369-5-christian.brauner@ubuntu.com
> ---
>   tools/testing/selftests/seccomp/seccomp_bpf.c | 102 ++++++++++++++++++
>   1 file changed, 102 insertions(+)
> 
> diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
> index e996d7b7fd6e..b0966599acb5 100644
> --- a/tools/testing/selftests/seccomp/seccomp_bpf.c
> +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
> @@ -44,6 +44,7 @@
>   #include <sys/times.h>
>   #include <sys/socket.h>
>   #include <sys/ioctl.h>
> +#include <linux/kcmp.h>
>   
>   #include <unistd.h>
>   #include <sys/syscall.h>
> @@ -167,6 +168,10 @@ struct seccomp_metadata {
>   
>   #define SECCOMP_RET_USER_NOTIF 0x7fc00000U
>   
> +#ifndef SECCOMP_USER_NOTIF_FLAG_CONTINUE
> +#define SECCOMP_USER_NOTIF_FLAG_CONTINUE 0x00000001
> +#endif
> +
>   #define SECCOMP_IOC_MAGIC		'!'
>   #define SECCOMP_IO(nr)			_IO(SECCOMP_IOC_MAGIC, nr)
>   #define SECCOMP_IOR(nr, type)		_IOR(SECCOMP_IOC_MAGIC, nr, type)
> @@ -3481,6 +3486,103 @@ TEST(seccomp_get_notif_sizes)
>   	EXPECT_EQ(sizes.seccomp_notif_resp, sizeof(struct seccomp_notif_resp));
>   }
>   
> +static int filecmp(pid_t pid1, pid_t pid2, int fd1, int fd2)
> +{
> +#ifdef __NR_kcmp
> +	return syscall(__NR_kcmp, pid1, pid2, KCMP_FILE, fd1, fd2);
> +#else
> +	errno = ENOSYS;
> +	return -1;

This should be SKIP for kselftest so this isn't counted a failure.
In this case test can't be run because of a missing dependency.

> +#endif
> +}
> +
> +TEST(user_notification_continue)
> +{
> +	pid_t pid;
> +	long ret;
> +	int status, listener;
> +	struct seccomp_notif req = {};
> +	struct seccomp_notif_resp resp = {};
> +	struct pollfd pollfd;
> +
> +	ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
> +	ASSERT_EQ(0, ret) {
> +		TH_LOG("Kernel does not support PR_SET_NO_NEW_PRIVS!");
> +	}
> +
> +	listener = user_trap_syscall(__NR_dup, SECCOMP_FILTER_FLAG_NEW_LISTENER);
> +	ASSERT_GE(listener, 0);
> +
> +	pid = fork();
> +	ASSERT_GE(pid, 0);
> +
> +	if (pid == 0) {
> +		int dup_fd, pipe_fds[2];
> +		pid_t self;
> +
> +		ret = pipe(pipe_fds);
> +		if (ret < 0)
> +			exit(EXIT_FAILURE);
> +
> +		dup_fd = dup(pipe_fds[0]);
> +		if (dup_fd < 0)
> +			exit(EXIT_FAILURE);
> +
> +		self = getpid();
> +
> +		ret = filecmp(self, self, pipe_fds[0], dup_fd);
> +		if (ret)
> +			exit(EXIT_FAILURE);
> +
> +		exit(EXIT_SUCCESS);
> +	}
> +
> +	pollfd.fd = listener;
> +	pollfd.events = POLLIN | POLLOUT;
> +
> +	EXPECT_GT(poll(&pollfd, 1, -1), 0);
> +	EXPECT_EQ(pollfd.revents, POLLIN);
> +
> +	EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0);
> +
> +	pollfd.fd = listener;
> +	pollfd.events = POLLIN | POLLOUT;
> +
> +	EXPECT_GT(poll(&pollfd, 1, -1), 0);
> +	EXPECT_EQ(pollfd.revents, POLLOUT);
> +
> +	EXPECT_EQ(req.data.nr, __NR_dup);
> +
> +	resp.id = req.id;
> +	resp.flags = SECCOMP_USER_NOTIF_FLAG_CONTINUE;
> +
> +	/*
> +	 * Verify that setting SECCOMP_USER_NOTIF_FLAG_CONTINUE enforces other
> +	 * args be set to 0.
> +	 */
> +	resp.error = 0;
> +	resp.val = USER_NOTIF_MAGIC;
> +	EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), -1);
> +	EXPECT_EQ(errno, EINVAL);
> +
> +	resp.error = USER_NOTIF_MAGIC;
> +	resp.val = 0;
> +	EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), -1);
> +	EXPECT_EQ(errno, EINVAL);
> +
> +	resp.error = 0;
> +	resp.val = 0;
> +	EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), 0) {
> +		if (errno == EINVAL)
> +			XFAIL(goto skip, "Kernel does not support SECCOMP_USER_NOTIF_FLAG_CONTINUE");
> +	}
> +
> +skip:
> +	EXPECT_EQ(waitpid(pid, &status, 0), pid);
> +	EXPECT_EQ(true, WIFEXITED(status));
> +	EXPECT_EQ(0, WEXITSTATUS(status));
> +}
> +
>   /*
>    * TODO:
>    * - add microbenchmarks
> 

thanks,
-- Shuah

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v1 1/3] seccomp: add SECCOMP_USER_NOTIF_FLAG_CONTINUE
  2019-09-19  9:59 ` [PATCH v1 1/3] seccomp: add SECCOMP_USER_NOTIF_FLAG_CONTINUE Christian Brauner
@ 2019-09-19 19:37   ` Jann Horn
  2019-09-19 22:22     ` Christian Brauner
  0 siblings, 1 reply; 8+ messages in thread
From: Jann Horn @ 2019-09-19 19:37 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Kees Cook, Andy Lutomirski, Will Drewry, Shuah Khan,
	Alexei Starovoitov, Daniel Borkmann, kafai, Song Liu, yhs,
	kernel list, open list:KERNEL SELFTEST FRAMEWORK,
	Network Development, bpf, Tycho Andersen, Tyler Hicks

On Thu, Sep 19, 2019 at 11:59 AM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
> This allows the seccomp notifier to continue a syscall.
[...]
> Recently we landed seccomp support for SECCOMP_RET_USER_NOTIF (cf. [4])
> which enables a process (watchee) to retrieve an fd for its seccomp
> filter. This fd can then be handed to another (usually more privileged)
> process (watcher). The watcher will then be able to receive seccomp
> messages about the syscalls having been performed by the watchee.
[...]
> This can be solved by
> telling seccomp to resume the syscall.
[...]
> @@ -780,8 +783,14 @@ static void seccomp_do_user_notification(int this_syscall,
>                 list_del(&n.list);
>  out:
>         mutex_unlock(&match->notify_lock);
> +
> +       /* Userspace requests to continue the syscall. */
> +       if (flags & SECCOMP_USER_NOTIF_FLAG_CONTINUE)
> +               return 0;
> +
>         syscall_set_return_value(current, task_pt_regs(current),
>                                  err, ret);
> +       return -1;
>  }

Seccomp currently expects the various seccomp return values to be
fully ordered based on how much action the kernel should take against
the requested syscall. Currently, the range of return values is
basically divided into three regions: "block syscall in some way"
(from SECCOMP_RET_KILL_PROCESS to SECCOMP_RET_USER_NOTIF), "let ptrace
decide" (SECCOMP_RET_TRACE) and "allow" (SECCOMP_RET_LOG and
SECCOMP_RET_ALLOW). If SECCOMP_RET_USER_NOTIF becomes able to allow
syscalls, it will be able to override a negative decision from
SECCOMP_RET_TRACE.

In practice, that's probably not a big deal, since I'm not aware of
anyone actually using SECCOMP_RET_TRACE for security purposes, and on
top of that, you'd have to allow ioctl(..., SECCOMP_IOCTL_NOTIF_SEND,
...) and seccomp() with SECCOMP_FILTER_FLAG_NEW_LISTENER in your
seccomp policy for this to work.

More interestingly, what about the case where two
SECCOMP_RET_USER_NOTIF filters are installed? The most recently
installed filter takes precedence if the return values's action parts
are the same (and this is also documented in the manpage); so if a
container engine installs a filter that always intercepts sys_foobar()
(and never uses SECCOMP_USER_NOTIF_FLAG_CONTINUE), and then something
inside the container also installs a filter that always intercepts
sys_foobar() (and always uses SECCOMP_USER_NOTIF_FLAG_CONTINUE), the
container engine's filter will become ineffective.

With my tendency to overcomplicate things, I'm thinking that maybe it
might be a good idea to:
 - collect a list of all filters that returned SECCOMP_RET_USER_NOTIF,
as well as the highest-precedence return value that was less strict
than SECCOMP_RET_USER_NOTIF
 - sequentially send notifications to all of the
SECCOMP_RET_USER_NOTIF filters until one doesn't return
SECCOMP_USER_NOTIF_FLAG_CONTINUE
 - if all returned SECCOMP_USER_NOTIF_FLAG_CONTINUE, go with the
highest-precedence return value that was less strict than
SECCOMP_RET_USER_NOTIF, or allow if no such return value was
encountered

But perhaps, for now, it would also be enough to just expand the big
fat warning note and tell people that if they allow the use of
SECCOMP_IOCTL_NOTIF_SEND and SECCOMP_FILTER_FLAG_NEW_LISTENER in their
filter, SECCOMP_RET_USER_NOTIF is bypassable. And if someone actually
has a usecase where SECCOMP_RET_USER_NOTIF should be secure and nested
SECCOMP_RET_USER_NOTIF support is needed, that more complicated logic
could be added later?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v1 3/3] seccomp: test SECCOMP_USER_NOTIF_FLAG_CONTINUE
  2019-09-19 17:13   ` shuah
@ 2019-09-19 21:50     ` Christian Brauner
  0 siblings, 0 replies; 8+ messages in thread
From: Christian Brauner @ 2019-09-19 21:50 UTC (permalink / raw)
  To: shuah
  Cc: keescook, luto, jannh, wad, ast, daniel, kafai, songliubraving,
	yhs, linux-kernel, linux-kselftest, netdev, bpf, Tycho Andersen,
	Tyler Hicks, stable

On Thu, Sep 19, 2019 at 11:13:46AM -0600, shuah wrote:
> On 9/19/19 3:59 AM, Christian Brauner wrote:
> > Test whether a syscall can be performed after having been intercepted by
> > the seccomp notifier. The test uses dup() and kcmp() since it allows us to
> > nicely test whether the dup() syscall actually succeeded by comparing whether
> > the fds refer to the same underlying struct file.
> > 
> > Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> > Cc: Kees Cook <keescook@chromium.org>
> > Cc: Andy Lutomirski <luto@amacapital.net>
> > Cc: Will Drewry <wad@chromium.org>
> > Cc: Shuah Khan <shuah@kernel.org>
> > Cc: Alexei Starovoitov <ast@kernel.org>
> > Cc: Daniel Borkmann <daniel@iogearbox.net>
> > Cc: Martin KaFai Lau <kafai@fb.com>
> > Cc: Song Liu <songliubraving@fb.com>
> > Cc: Yonghong Song <yhs@fb.com>
> > Cc: Tycho Andersen <tycho@tycho.ws>
> > CC: Tyler Hicks <tyhicks@canonical.com>
> > Cc: stable@vger.kernel.org
> > Cc: linux-kselftest@vger.kernel.org
> > Cc: netdev@vger.kernel.org
> > Cc: bpf@vger.kernel.org
> > ---
> > /* v1 */
> > - Christian Brauner <christian.brauner@ubuntu.com>:
> >    - adapt to new flag name SECCOMP_USER_NOTIF_FLAG_CONTINUE
> > 
> > /* v0 */
> > Link: https://lore.kernel.org/r/20190918084833.9369-5-christian.brauner@ubuntu.com
> > ---
> >   tools/testing/selftests/seccomp/seccomp_bpf.c | 102 ++++++++++++++++++
> >   1 file changed, 102 insertions(+)
> > 
> > diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
> > index e996d7b7fd6e..b0966599acb5 100644
> > --- a/tools/testing/selftests/seccomp/seccomp_bpf.c
> > +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
> > @@ -44,6 +44,7 @@
> >   #include <sys/times.h>
> >   #include <sys/socket.h>
> >   #include <sys/ioctl.h>
> > +#include <linux/kcmp.h>
> >   #include <unistd.h>
> >   #include <sys/syscall.h>
> > @@ -167,6 +168,10 @@ struct seccomp_metadata {
> >   #define SECCOMP_RET_USER_NOTIF 0x7fc00000U
> > +#ifndef SECCOMP_USER_NOTIF_FLAG_CONTINUE
> > +#define SECCOMP_USER_NOTIF_FLAG_CONTINUE 0x00000001
> > +#endif
> > +
> >   #define SECCOMP_IOC_MAGIC		'!'
> >   #define SECCOMP_IO(nr)			_IO(SECCOMP_IOC_MAGIC, nr)
> >   #define SECCOMP_IOR(nr, type)		_IOR(SECCOMP_IOC_MAGIC, nr, type)
> > @@ -3481,6 +3486,103 @@ TEST(seccomp_get_notif_sizes)
> >   	EXPECT_EQ(sizes.seccomp_notif_resp, sizeof(struct seccomp_notif_resp));
> >   }
> > +static int filecmp(pid_t pid1, pid_t pid2, int fd1, int fd2)
> > +{
> > +#ifdef __NR_kcmp
> > +	return syscall(__NR_kcmp, pid1, pid2, KCMP_FILE, fd1, fd2);
> > +#else
> > +	errno = ENOSYS;
> > +	return -1;
> 
> This should be SKIP for kselftest so this isn't counted a failure.
> In this case test can't be run because of a missing dependency.

Right, I can just ifdef the whole test and report a skip.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v1 1/3] seccomp: add SECCOMP_USER_NOTIF_FLAG_CONTINUE
  2019-09-19 19:37   ` Jann Horn
@ 2019-09-19 22:22     ` Christian Brauner
  0 siblings, 0 replies; 8+ messages in thread
From: Christian Brauner @ 2019-09-19 22:22 UTC (permalink / raw)
  To: Jann Horn
  Cc: Kees Cook, Andy Lutomirski, Will Drewry, Shuah Khan,
	Alexei Starovoitov, Daniel Borkmann, kafai, Song Liu, yhs,
	kernel list, open list:KERNEL SELFTEST FRAMEWORK,
	Network Development, bpf, Tycho Andersen, Tyler Hicks

On Thu, Sep 19, 2019 at 09:37:06PM +0200, Jann Horn wrote:
> On Thu, Sep 19, 2019 at 11:59 AM Christian Brauner
> <christian.brauner@ubuntu.com> wrote:
> > This allows the seccomp notifier to continue a syscall.
> [...]
> > Recently we landed seccomp support for SECCOMP_RET_USER_NOTIF (cf. [4])
> > which enables a process (watchee) to retrieve an fd for its seccomp
> > filter. This fd can then be handed to another (usually more privileged)
> > process (watcher). The watcher will then be able to receive seccomp
> > messages about the syscalls having been performed by the watchee.
> [...]
> > This can be solved by
> > telling seccomp to resume the syscall.
> [...]
> > @@ -780,8 +783,14 @@ static void seccomp_do_user_notification(int this_syscall,
> >                 list_del(&n.list);
> >  out:
> >         mutex_unlock(&match->notify_lock);
> > +
> > +       /* Userspace requests to continue the syscall. */
> > +       if (flags & SECCOMP_USER_NOTIF_FLAG_CONTINUE)
> > +               return 0;
> > +
> >         syscall_set_return_value(current, task_pt_regs(current),
> >                                  err, ret);
> > +       return -1;
> >  }
> 
> Seccomp currently expects the various seccomp return values to be
> fully ordered based on how much action the kernel should take against
> the requested syscall. Currently, the range of return values is
> basically divided into three regions: "block syscall in some way"
> (from SECCOMP_RET_KILL_PROCESS to SECCOMP_RET_USER_NOTIF), "let ptrace
> decide" (SECCOMP_RET_TRACE) and "allow" (SECCOMP_RET_LOG and
> SECCOMP_RET_ALLOW). If SECCOMP_RET_USER_NOTIF becomes able to allow
> syscalls, it will be able to override a negative decision from
> SECCOMP_RET_TRACE.
> 
> In practice, that's probably not a big deal, since I'm not aware of
> anyone actually using SECCOMP_RET_TRACE for security purposes, and on
> top of that, you'd have to allow ioctl(..., SECCOMP_IOCTL_NOTIF_SEND,
> ...) and seccomp() with SECCOMP_FILTER_FLAG_NEW_LISTENER in your
> seccomp policy for this to work.
> 
> More interestingly, what about the case where two
> SECCOMP_RET_USER_NOTIF filters are installed? The most recently
> installed filter takes precedence if the return values's action parts
> are the same (and this is also documented in the manpage); so if a
> container engine installs a filter that always intercepts sys_foobar()
> (and never uses SECCOMP_USER_NOTIF_FLAG_CONTINUE), and then something
> inside the container also installs a filter that always intercepts
> sys_foobar() (and always uses SECCOMP_USER_NOTIF_FLAG_CONTINUE), the
> container engine's filter will become ineffective.

Excellent point. We discussed the nested container case today.

> 
> With my tendency to overcomplicate things, I'm thinking that maybe it
> might be a good idea to:
>  - collect a list of all filters that returned SECCOMP_RET_USER_NOTIF,
> as well as the highest-precedence return value that was less strict
> than SECCOMP_RET_USER_NOTIF
>  - sequentially send notifications to all of the
> SECCOMP_RET_USER_NOTIF filters until one doesn't return
> SECCOMP_USER_NOTIF_FLAG_CONTINUE
>  - if all returned SECCOMP_USER_NOTIF_FLAG_CONTINUE, go with the
> highest-precedence return value that was less strict than
> SECCOMP_RET_USER_NOTIF, or allow if no such return value was
> encountered
> 
> But perhaps, for now, it would also be enough to just expand the big
> fat warning note and tell people that if they allow the use of
> SECCOMP_IOCTL_NOTIF_SEND and SECCOMP_FILTER_FLAG_NEW_LISTENER in their
> filter, SECCOMP_RET_USER_NOTIF is bypassable. And if someone actually
> has a usecase where SECCOMP_RET_USER_NOTIF should be secure and nested
> SECCOMP_RET_USER_NOTIF support is needed, that more complicated logic
> could be added later?

Yes, I think that is the correct approach for now.
Realistically, the most useful scenario is a host-privileged supervisor
process and a user-namespaced supervised process (or to use a concrete
example, a host-privileged container manager and an unprivileged
container). Having a user-namespaced supervisor process supervising
another nested user-namespaced process is for the most part useless
because the supervisor can't do any of the interesting syscalls (e.g.
mounting block devices that are deemed safe, faking mknod() etc.). So I
expect seccomp with USER_NOTIF to be blocked just for good measure. 
Also - maybe I'm wrong - the warning we added points out that this is
only safe if the supervised process can already rely on kernel (or
other) restrictions, i.e. even if an attacker overwrites pointer syscall
arguments with harmful ones the supervisor must be sure that they are
already blocked anyway. Which can be generalized to: if an unwanted
syscall goes through in _some_ way then the supervisor must be sure that
it is blocked.
Iiuc, for your specific attack all the nested attacker can do is to
never actually get the (outer) supervisor to fake the syscall for it.
A more interesting case might be where the host-privileged supervising
process wants to deny a syscall that would otherwise succeed. But if
that's the case then the outer supervisor is trying to implement a
security policy. But we explicitly point out that this is not possible
with the notifier in general.
But honestly, that is very advanced and it seems unlikely that someone
would want this. So I'd say let's just point this out.

Christian

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, back to index

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-19  9:59 [PATCH v1 0/3] seccomp: continue syscall from notifier Christian Brauner
2019-09-19  9:59 ` [PATCH v1 1/3] seccomp: add SECCOMP_USER_NOTIF_FLAG_CONTINUE Christian Brauner
2019-09-19 19:37   ` Jann Horn
2019-09-19 22:22     ` Christian Brauner
2019-09-19  9:59 ` [PATCH v1 2/3] seccomp: avoid overflow in implicit constant conversion Christian Brauner
2019-09-19  9:59 ` [PATCH v1 3/3] seccomp: test SECCOMP_USER_NOTIF_FLAG_CONTINUE Christian Brauner
2019-09-19 17:13   ` shuah
2019-09-19 21:50     ` Christian Brauner

BPF Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/bpf/0 bpf/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 bpf bpf/ https://lore.kernel.org/bpf \
		bpf@vger.kernel.org bpf@archiver.kernel.org
	public-inbox-index bpf

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.bpf


AGPL code for this site: git clone https://public-inbox.org/ public-inbox