archive mirror
 help / color / mirror / Atom feed
From: Christian Brauner <>
	Christian Brauner <>,
	Tycho Andersen <>,
	Tyler Hicks <>
Subject: [PATCH 1/4] seccomp: add SECCOMP_RET_USER_NOTIF_ALLOW
Date: Wed, 18 Sep 2019 10:48:30 +0200	[thread overview]
Message-ID: <> (raw)
In-Reply-To: <>

This allows the seccomp notifier to continue a syscall. A positive
discussion about this feature was triggered by a post to the
ksummit-discuss mailing list (cf. [3]) and took place during KSummit
(cf. [1]) and again at the containers/checkpoint-restore
micro-conference at Linux Plumbers.

Recently we landed seccomp support for SECCOMP_RET_USER_NOTIF (cf. [4])
which enables a process (watchee) to retrieve an fd for its seccomp
filter. This fd can then be handed to another (usually more privileged)
process (watcher). The watcher will then be able to receive seccomp
messages about the syscalls having been performed by the watchee.

This feature is heavily used in some userspace workloads. For example,
it is currently used to intercept mknod() syscalls in user namespaces
aka in containers.
The mknod() syscall can be easily filtered based on dev_t. This allows
us to only intercept a very specific subset of mknod() syscalls.
Furthermore, mknod() is not possible in user namespaces toto coelo and
so intercepting and denying syscalls that are not in the whitelist on
accident is not a big deal. The watchee won't notice a difference.

In contrast to mknod(), a lot of other syscall we intercept (e.g.
setxattr()) cannot be easily filtered like mknod() because they have
pointer arguments. Additionally, some of them might actually succeed in
user namespaces (e.g. setxattr() for all "user.*" xattrs). Since we
currently cannot tell seccomp to continue from a user notifier we are
stuck with performing all of the syscalls in lieu of the container. This
is a huge security liability since it is extremely difficult to
correctly assume all of the necessary privileges of the calling task
such that the syscall can be successfully emulated without escaping
other additional security restrictions (think missing CAP_MKNOD for
mknod(), or MS_NODEV on a filesystem etc.). This can be solved by
telling seccomp to resume the syscall.

One thing that came up in the discussion was the problem that another
thread could change the memory after userspace has decided to let the
syscall continue which is a well known TOCTOU with seccomp which is
present in other ways already.
The discussion showed that this feature is already very useful for any
syscall without pointer arguments. For any accidentally intercepted
non-pointer syscall it is safe to continue.
For syscalls with pointer arguments there is a race but for any cautious
userspace and the main usec cases the race doesn't matter. The notifier
is intended to be used in a scenario where a more privileged watcher
supervises the syscalls of lesser privileged watchee to allow it to get
around kernel-enforced limitations by performing the syscall for it
whenever deemed save by the watcher. Hence, if a user tricks the watcher
into allowing a syscall they will either get a deny based on
kernel-enforced restrictions later or they will have changed the
arguments in such a way that they manage to perform a syscall with
arguments that they would've been allowed to do anyway.
In general, it is good to point out again, that the notifier fd was not
intended to allow userspace to implement a security policy but rather to
work around kernel security mechanisms in cases where the watcher knows
that a given action is safe to perform.

/* References */
[4]: commit 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace")

Signed-off-by: Christian Brauner <>
Cc: Kees Cook <>
Cc: Andy Lutomirski <>
Cc: Will Drewry <>
Cc: Tycho Andersen <>
CC: Tyler Hicks <>
Cc: Jann Horn <>
 include/uapi/linux/seccomp.h |  2 ++
 kernel/seccomp.c             | 24 ++++++++++++++++++++----
 2 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index 90734aa5aa36..2c23b9aa6383 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -76,6 +76,8 @@ struct seccomp_notif {
 	struct seccomp_data data;
+#define SECCOMP_RET_USER_NOTIF_ALLOW 0x00000001
 struct seccomp_notif_resp {
 	__u64 id;
 	__s64 val;
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index dba52a7db5e8..cdb90184d6d7 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -75,6 +75,7 @@ struct seccomp_knotif {
 	/* The return values, only valid when in SECCOMP_NOTIFY_REPLIED */
 	int error;
 	long val;
+	u32 flags;
 	/* Signals when this has entered SECCOMP_NOTIFY_REPLIED */
 	struct completion ready;
@@ -732,11 +733,12 @@ static u64 seccomp_next_notify_id(struct seccomp_filter *filter)
 	return filter->notif->next_id++;
-static void seccomp_do_user_notification(int this_syscall,
+static bool seccomp_do_user_notification(int this_syscall,
 					 struct seccomp_filter *match,
 					 const struct seccomp_data *sd)
 	int err;
+	u32 flags = 0;
 	long ret = 0;
 	struct seccomp_knotif n = {};
@@ -764,6 +766,7 @@ static void seccomp_do_user_notification(int this_syscall,
 	if (err == 0) {
 		ret = n.val;
 		err = n.error;
+		flags = n.flags;
@@ -780,8 +783,14 @@ static void seccomp_do_user_notification(int this_syscall,
+	/* perform syscall */
+		return false;
 	syscall_set_return_value(current, task_pt_regs(current),
 				 err, ret);
+	return true;
 static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
@@ -867,8 +876,10 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
 		return 0;
-		seccomp_do_user_notification(this_syscall, match, sd);
-		goto skip;
+		if (seccomp_do_user_notification(this_syscall, match, sd))
+			goto skip;
+		return 0;
 		seccomp_log(this_syscall, 0, action, true);
@@ -1087,7 +1098,11 @@ static long seccomp_notify_send(struct seccomp_filter *filter,
 	if (copy_from_user(&resp, buf, sizeof(resp)))
 		return -EFAULT;
-	if (resp.flags)
+	if (resp.flags & ~SECCOMP_RET_USER_NOTIF_ALLOW)
+		return -EINVAL;
+	if ((resp.flags & SECCOMP_RET_USER_NOTIF_ALLOW) &&
+	    (resp.error || resp.val))
 		return -EINVAL;
 	ret = mutex_lock_interruptible(&filter->notify_lock);
@@ -1116,6 +1131,7 @@ static long seccomp_notify_send(struct seccomp_filter *filter,
 	knotif->state = SECCOMP_NOTIFY_REPLIED;
 	knotif->error = resp.error;
 	knotif->val = resp.val;
+	knotif->flags = resp.flags;

  reply	other threads:[~2019-09-18  8:49 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-18  8:48 [PATCH 0/4] seccomp: continue syscall from notifier Christian Brauner
2019-09-18  8:48 ` Christian Brauner [this message]
2019-09-18 17:30   ` [PATCH 1/4] seccomp: add SECCOMP_RET_USER_NOTIF_ALLOW Kees Cook
2019-09-18 18:07     ` Tycho Andersen
2019-09-19  6:53       ` Christian Brauner
2019-09-19  6:53     ` Christian Brauner
2019-09-18  8:48 ` [PATCH 2/4] seccomp: add two missing ptrace ifdefines Christian Brauner
2019-09-18  9:15   ` Tyler Hicks
2019-09-18 17:33     ` Kees Cook
2019-09-19 10:42       ` Dmitry V. Levin
2019-09-19 16:55         ` Kees Cook
2019-09-19 17:04           ` shuah
2019-09-19 18:30           ` Dmitry V. Levin
2019-09-18  8:48 ` [PATCH 3/4] seccomp: avoid overflow in implicit constant conversion Christian Brauner
2019-09-18 10:01   ` Tyler Hicks
2019-09-18  8:48 ` [PATCH 4/4] seccomp: test SECCOMP_RET_USER_NOTIF_ALLOW Christian Brauner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).