linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/4] seccomp trap to userspace
@ 2018-05-17 15:12 Tycho Andersen
  2018-05-17 15:12 ` [PATCH v2 1/4] seccomp: add a return code to " Tycho Andersen
                   ` (4 more replies)
  0 siblings, 5 replies; 20+ messages in thread
From: Tycho Andersen @ 2018-05-17 15:12 UTC (permalink / raw)
  To: linux-kernel, containers
  Cc: Kees Cook, Andy Lutomirski, Oleg Nesterov, Eric W . Biederman,
	Serge E . Hallyn, Christian Brauner, Tyler Hicks, Akihiro Suda,
	Tobin C . Harding, Tycho Andersen

Hi,

After a while focusing on other things, I finally managed ot get a v2 of
this series prepared. I believe I've addressed all the feedback from v1,
except for one major point: switching the communication protocol over
the fd to nlattr. I looked into doing this, but the kernel stuff for
dealing with nlattr seems to require an skb (via nlmsg_{new,put} and
netlink_unicast), which means we need to deal with the netlink sequence
numbers, portids, and create a socket protocol. I can do this if we
still think nlattr is necessary, but based on looking at it, it seems
like a lot of extra code for no real benefit.

I've also added support for passing fds. The code itself is simple, but
the API could/should probably be different, see patch 4 for discussion.

Tycho

Tycho Andersen (4):
  seccomp: add a return code to trap to userspace
  seccomp: make get_nth_filter available outside of CHECKPOINT_RESTORE
  seccomp: add a way to get a listener fd from ptrace
  seccomp: add support for passing fds via USER_NOTIF

 arch/Kconfig                                  |   7 +
 include/linux/seccomp.h                       |  14 +-
 include/uapi/linux/ptrace.h                   |   2 +
 include/uapi/linux/seccomp.h                  |  20 +-
 kernel/ptrace.c                               |   4 +
 kernel/seccomp.c                              | 480 +++++++++++++++++-
 tools/testing/selftests/seccomp/seccomp_bpf.c | 359 ++++++++++++-
 7 files changed, 878 insertions(+), 8 deletions(-)

-- 
2.17.0

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v2 1/4] seccomp: add a return code to trap to userspace
  2018-05-17 15:12 [PATCH v2 0/4] seccomp trap to userspace Tycho Andersen
@ 2018-05-17 15:12 ` Tycho Andersen
  2018-05-17 15:33   ` Oleg Nesterov
                     ` (3 more replies)
  2018-05-17 15:12 ` [PATCH v2 2/4] seccomp: make get_nth_filter available outside of CHECKPOINT_RESTORE Tycho Andersen
                   ` (3 subsequent siblings)
  4 siblings, 4 replies; 20+ messages in thread
From: Tycho Andersen @ 2018-05-17 15:12 UTC (permalink / raw)
  To: linux-kernel, containers
  Cc: Kees Cook, Andy Lutomirski, Oleg Nesterov, Eric W . Biederman,
	Serge E . Hallyn, Christian Brauner, Tyler Hicks, Akihiro Suda,
	Tobin C . Harding, Tycho Andersen

This patch introduces a means for syscalls matched in seccomp to notify
some other task that a particular filter has been triggered.

The motivation for this is primarily for use with containers. For example,
if a container does an init_module(), we obviously don't want to load this
untrusted code, which may be compiled for the wrong version of the kernel
anyway. Instead, we could parse the module image, figure out which module
the container is trying to load and load it on the host.

As another example, containers cannot mknod(), since this checks
capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
/dev/zero should be ok for containers to mknod, but we'd like to avoid hard
coding some whitelist in the kernel. Another example is mount(), which has
many security restrictions for good reason, but configuration or runtime
knowledge could potentially be used to relax these restrictions.

This patch adds functionality that is already possible via at least two
other means that I know about, both of which involve ptrace(): first, one
could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL.
Unfortunately this is slow, so a faster version would be to install a
filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP.
Since ptrace allows only one tracer, if the container runtime is that
tracer, users inside the container (or outside) trying to debug it will not
be able to use ptrace, which is annoying. It also means that older
distributions based on Upstart cannot boot inside containers using ptrace,
since upstart itself uses ptrace to start services.

The actual implementation of this is fairly small, although getting the
synchronization right was/is slightly complex.

Finally, it's worth noting that the classic seccomp TOCTOU of reading
memory data from the task still applies here, but can be avoided with
careful design of the userspace handler: if the userspace handler reads all
of the task memory that is necessary before applying its security policy,
the tracee's subsequent memory edits will not be read by the tracer.

v2: * make id a u64; the idea here being that it will never overflow,
      because 64 is huge (one syscall every nanosecond => wrap every 584
      years)
    * prevent nesting of user notifications: if someone is already attached
      the tree in one place, nobody else can attach to the tree
    * notify the listener of signals the tracee receives as well
    * implement poll

Signed-off-by: Tycho Andersen <tycho@tycho.ws>
CC: Kees Cook <keescook@chromium.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Oleg Nesterov <oleg@redhat.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
CC: "Serge E. Hallyn" <serge@hallyn.com>
CC: Christian Brauner <christian.brauner@ubuntu.com>
CC: Tyler Hicks <tyhicks@canonical.com>
CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
---
 arch/Kconfig                                  |   7 +
 include/linux/seccomp.h                       |   3 +-
 include/uapi/linux/seccomp.h                  |  18 +-
 kernel/seccomp.c                              | 402 +++++++++++++++++-
 tools/testing/selftests/seccomp/seccomp_bpf.c | 181 +++++++-
 5 files changed, 605 insertions(+), 6 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 8e0d665c8d53..dd99eef3e049 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -401,6 +401,13 @@ config SECCOMP_FILTER
 
 	  See Documentation/prctl/seccomp_filter.txt for details.
 
+config SECCOMP_USER_NOTIFICATION
+	bool "Enable the SECCOMP_RET_USER_NOTIF seccomp action"
+	depends on SECCOMP_FILTER
+	help
+	  Enable SECCOMP_RET_USER_NOTIF, a return code which can be used by seccomp
+	  programs to notify a userspace listener that a particular event happened.
+
 config HAVE_GCC_PLUGINS
 	bool
 	help
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index c723a5c4e3ff..0fd3e0676a1c 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -5,7 +5,8 @@
 #include <uapi/linux/seccomp.h>
 
 #define SECCOMP_FILTER_FLAG_MASK	(SECCOMP_FILTER_FLAG_TSYNC | \
-					 SECCOMP_FILTER_FLAG_LOG)
+					 SECCOMP_FILTER_FLAG_LOG | \
+					 SECCOMP_FILTER_FLAG_GET_LISTENER)
 
 #ifdef CONFIG_SECCOMP
 
diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index 2a0bd9dd104d..8160e6cad528 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -17,8 +17,9 @@
 #define SECCOMP_GET_ACTION_AVAIL	2
 
 /* Valid flags for SECCOMP_SET_MODE_FILTER */
-#define SECCOMP_FILTER_FLAG_TSYNC	1
-#define SECCOMP_FILTER_FLAG_LOG		2
+#define SECCOMP_FILTER_FLAG_TSYNC		1
+#define SECCOMP_FILTER_FLAG_LOG			2
+#define SECCOMP_FILTER_FLAG_GET_LISTENER	4
 
 /*
  * All BPF programs must return a 32-bit value.
@@ -34,6 +35,7 @@
 #define SECCOMP_RET_KILL	 SECCOMP_RET_KILL_THREAD
 #define SECCOMP_RET_TRAP	 0x00030000U /* disallow and force a SIGSYS */
 #define SECCOMP_RET_ERRNO	 0x00050000U /* returns an errno */
+#define SECCOMP_RET_USER_NOTIF   0x7fc00000U /* notifies userspace */
 #define SECCOMP_RET_TRACE	 0x7ff00000U /* pass to a tracer or disallow */
 #define SECCOMP_RET_LOG		 0x7ffc0000U /* allow after logging */
 #define SECCOMP_RET_ALLOW	 0x7fff0000U /* allow */
@@ -59,4 +61,16 @@ struct seccomp_data {
 	__u64 args[6];
 };
 
+struct seccomp_notif {
+	__u64 id;
+	pid_t pid;
+	struct seccomp_data data;
+};
+
+struct seccomp_notif_resp {
+	__u64 id;
+	__s32 error;
+	__s64 val;
+};
+
 #endif /* _UAPI_LINUX_SECCOMP_H */
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index dc77548167ef..a169a62cb78a 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -38,6 +38,53 @@
 #include <linux/tracehook.h>
 #include <linux/uaccess.h>
 
+#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
+#include <linux/file.h>
+#include <linux/anon_inodes.h>
+
+enum notify_state {
+	SECCOMP_NOTIFY_INIT,
+	SECCOMP_NOTIFY_SENT,
+	SECCOMP_NOTIFY_REPLIED,
+};
+
+struct seccomp_knotif {
+	/* The pid whose filter triggered the notification */
+	pid_t pid;
+
+	/*
+	 * The "cookie" for this request; this is unique for this filter.
+	 */
+	u32 id;
+
+	/*
+	 * The seccomp data. This pointer is valid the entire time this
+	 * notification is active, since it comes from __seccomp_filter which
+	 * eclipses the entire lifecycle here.
+	 */
+	const struct seccomp_data *data;
+
+	/*
+	 * Notification states. When SECCOMP_RET_USER_NOTIF is returned, a
+	 * struct seccomp_knotif is created and starts out in INIT. Once the
+	 * handler reads the notification off of an FD, it transitions to READ.
+	 * If a signal is received the state transitions back to INIT and
+	 * another message is sent. When the userspace handler replies, state
+	 * transitions to REPLIED.
+	 */
+	enum notify_state state;
+
+	/* The return values, only valid when in SECCOMP_NOTIFY_REPLIED */
+	int error;
+	long val;
+
+	/* Signals when this has entered SECCOMP_NOTIFY_REPLIED */
+	struct completion ready;
+
+	struct list_head list;
+};
+#endif
+
 /**
  * struct seccomp_filter - container for seccomp BPF programs
  *
@@ -64,6 +111,27 @@ struct seccomp_filter {
 	bool log;
 	struct seccomp_filter *prev;
 	struct bpf_prog *prog;
+
+#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
+	/*
+	 * A semaphore that users of this notification can wait on for
+	 * changes. Actual reads and writes are still controlled with
+	 * filter->notify_lock.
+	 */
+	struct semaphore request;
+
+	/* A lock for all notification-related accesses. */
+	struct mutex notify_lock;
+
+	/* Is there currently an attached listener? */
+	bool has_listener;
+
+	/* The id of the next request. */
+	u64 next_id;
+
+	/* A list of struct seccomp_knotif elements. */
+	struct list_head notifications;
+#endif
 };
 
 /* Limit any path through the tree to 256KB worth of instructions. */
@@ -383,6 +451,13 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
 	if (!sfilter)
 		return ERR_PTR(-ENOMEM);
 
+#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
+	mutex_init(&sfilter->notify_lock);
+	sema_init(&sfilter->request, 0);
+	INIT_LIST_HEAD(&sfilter->notifications);
+	sfilter->next_id = get_random_u64();
+#endif
+
 	ret = bpf_prog_create_from_user(&sfilter->prog, fprog,
 					seccomp_check_filter, save_orig);
 	if (ret < 0) {
@@ -547,13 +622,15 @@ static void seccomp_send_sigsys(int syscall, int reason)
 #define SECCOMP_LOG_TRACE		(1 << 4)
 #define SECCOMP_LOG_LOG			(1 << 5)
 #define SECCOMP_LOG_ALLOW		(1 << 6)
+#define SECCOMP_LOG_USER_NOTIF		(1 << 7)
 
 static u32 seccomp_actions_logged = SECCOMP_LOG_KILL_PROCESS |
 				    SECCOMP_LOG_KILL_THREAD  |
 				    SECCOMP_LOG_TRAP  |
 				    SECCOMP_LOG_ERRNO |
 				    SECCOMP_LOG_TRACE |
-				    SECCOMP_LOG_LOG;
+				    SECCOMP_LOG_LOG |
+				    SECCOMP_LOG_USER_NOTIF;
 
 static inline void seccomp_log(unsigned long syscall, long signr, u32 action,
 			       bool requested)
@@ -572,6 +649,9 @@ static inline void seccomp_log(unsigned long syscall, long signr, u32 action,
 	case SECCOMP_RET_TRACE:
 		log = requested && seccomp_actions_logged & SECCOMP_LOG_TRACE;
 		break;
+	case SECCOMP_RET_USER_NOTIF:
+		log = requested && seccomp_actions_logged & SECCOMP_LOG_USER_NOTIF;
+		break;
 	case SECCOMP_RET_LOG:
 		log = seccomp_actions_logged & SECCOMP_LOG_LOG;
 		break;
@@ -645,6 +725,91 @@ void secure_computing_strict(int this_syscall)
 }
 #else
 
+#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
+static u64 seccomp_next_notify_id(struct seccomp_filter *filter)
+{
+	u64 ret = filter->next_id;
+
+	/* Note: overflow is ok here, the id just needs to be unique */
+	filter->next_id++;
+
+	return ret;
+}
+
+static void seccomp_do_user_notification(int this_syscall,
+					 struct seccomp_filter *match,
+					 const struct seccomp_data *sd)
+{
+	int err;
+	long ret = 0;
+	struct seccomp_knotif n = {};
+
+	mutex_lock(&match->notify_lock);
+	if (!match->has_listener) {
+		err = -ENOSYS;
+		goto out;
+	}
+
+	n.pid = current->pid;
+	n.state = SECCOMP_NOTIFY_INIT;
+	n.data = sd;
+	n.id = seccomp_next_notify_id(match);
+	init_completion(&n.ready);
+
+	list_add(&n.list, &match->notifications);
+
+	mutex_unlock(&match->notify_lock);
+	up(&match->request);
+
+	err = wait_for_completion_interruptible(&n.ready);
+	mutex_lock(&match->notify_lock);
+
+	/*
+	 * Here it's possible we got a signal and then had to wait on the mutex
+	 * while the reply was sent, so let's be sure there wasn't a response
+	 * in the meantime.
+	 */
+	if (err < 0 && n.state != SECCOMP_NOTIFY_REPLIED) {
+		/*
+		 * We got a signal. Let's tell userspace about it (potentially
+		 * again, if we had already notified them about the first one).
+		 */
+		if (n.state == SECCOMP_NOTIFY_SENT) {
+			n.state = SECCOMP_NOTIFY_INIT;
+			up(&match->request);
+		}
+		mutex_unlock(&match->notify_lock);
+		err = wait_for_completion_killable(&n.ready);
+		mutex_lock(&match->notify_lock);
+		if (err < 0)
+			goto remove_list;
+	}
+
+	ret = n.val;
+	err = n.error;
+
+	WARN(n.state != SECCOMP_NOTIFY_REPLIED,
+	     "notified about write complete when state is not write");
+
+remove_list:
+	list_del(&n.list);
+out:
+	mutex_unlock(&match->notify_lock);
+	syscall_set_return_value(current, task_pt_regs(current),
+				 err, ret);
+}
+#else
+static void seccomp_do_user_notification(int this_syscall,
+					 u32 action,
+					 struct seccomp_filter *match,
+					 const struct seccomp_data *sd)
+{
+	WARN(1, "user notification received, but disabled");
+	seccomp_log(this_syscall, SIGSYS, action, true);
+	do_exit(SIGSYS);
+}
+#endif
+
 #ifdef CONFIG_SECCOMP_FILTER
 static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
 			    const bool recheck_after_trace)
@@ -722,6 +887,9 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
 
 		return 0;
 
+	case SECCOMP_RET_USER_NOTIF:
+		seccomp_do_user_notification(this_syscall, match, sd);
+		goto skip;
 	case SECCOMP_RET_LOG:
 		seccomp_log(this_syscall, 0, action, true);
 		return 0;
@@ -828,6 +996,11 @@ static long seccomp_set_mode_strict(void)
 }
 
 #ifdef CONFIG_SECCOMP_FILTER
+#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
+static struct file *init_listener(struct task_struct *,
+				  struct seccomp_filter *);
+#endif
+
 /**
  * seccomp_set_mode_filter: internal function for setting seccomp filter
  * @flags:  flags to change filter behavior
@@ -847,6 +1020,8 @@ static long seccomp_set_mode_filter(unsigned int flags,
 	const unsigned long seccomp_mode = SECCOMP_MODE_FILTER;
 	struct seccomp_filter *prepared = NULL;
 	long ret = -EINVAL;
+	int listener = 0;
+	struct file *listener_f = NULL;
 
 	/* Validate flags. */
 	if (flags & ~SECCOMP_FILTER_FLAG_MASK)
@@ -857,13 +1032,28 @@ static long seccomp_set_mode_filter(unsigned int flags,
 	if (IS_ERR(prepared))
 		return PTR_ERR(prepared);
 
+	if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) {
+		listener = get_unused_fd_flags(O_RDWR);
+		if (listener < 0) {
+			ret = listener;
+			goto out_free;
+		}
+
+		listener_f = init_listener(current, prepared);
+		if (IS_ERR(listener_f)) {
+			put_unused_fd(listener);
+			ret = PTR_ERR(listener_f);
+			goto out_free;
+		}
+	}
+
 	/*
 	 * Make sure we cannot change seccomp or nnp state via TSYNC
 	 * while another thread is in the middle of calling exec.
 	 */
 	if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
 	    mutex_lock_killable(&current->signal->cred_guard_mutex))
-		goto out_free;
+		goto out_put_fd;
 
 	spin_lock_irq(&current->sighand->siglock);
 
@@ -881,6 +1071,16 @@ static long seccomp_set_mode_filter(unsigned int flags,
 	spin_unlock_irq(&current->sighand->siglock);
 	if (flags & SECCOMP_FILTER_FLAG_TSYNC)
 		mutex_unlock(&current->signal->cred_guard_mutex);
+out_put_fd:
+	if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) {
+		if (ret < 0) {
+			fput(listener_f);
+			put_unused_fd(listener);
+		} else {
+			fd_install(listener, listener_f);
+			ret = listener;
+		}
+	}
 out_free:
 	seccomp_filter_free(prepared);
 	return ret;
@@ -909,6 +1109,9 @@ static long seccomp_get_action_avail(const char __user *uaction)
 	case SECCOMP_RET_LOG:
 	case SECCOMP_RET_ALLOW:
 		break;
+	case SECCOMP_RET_USER_NOTIF:
+		if (IS_ENABLED(CONFIG_SECCOMP_USER_NOTIFICATION))
+			break;
 	default:
 		return -EOPNOTSUPP;
 	}
@@ -1105,6 +1308,7 @@ long seccomp_get_metadata(struct task_struct *task,
 #define SECCOMP_RET_KILL_THREAD_NAME	"kill_thread"
 #define SECCOMP_RET_TRAP_NAME		"trap"
 #define SECCOMP_RET_ERRNO_NAME		"errno"
+#define SECCOMP_RET_USER_NOTIF_NAME	"user_notif"
 #define SECCOMP_RET_TRACE_NAME		"trace"
 #define SECCOMP_RET_LOG_NAME		"log"
 #define SECCOMP_RET_ALLOW_NAME		"allow"
@@ -1114,6 +1318,7 @@ static const char seccomp_actions_avail[] =
 				SECCOMP_RET_KILL_THREAD_NAME	" "
 				SECCOMP_RET_TRAP_NAME		" "
 				SECCOMP_RET_ERRNO_NAME		" "
+				SECCOMP_RET_USER_NOTIF_NAME     " "
 				SECCOMP_RET_TRACE_NAME		" "
 				SECCOMP_RET_LOG_NAME		" "
 				SECCOMP_RET_ALLOW_NAME;
@@ -1131,6 +1336,7 @@ static const struct seccomp_log_name seccomp_log_names[] = {
 	{ SECCOMP_LOG_TRACE, SECCOMP_RET_TRACE_NAME },
 	{ SECCOMP_LOG_LOG, SECCOMP_RET_LOG_NAME },
 	{ SECCOMP_LOG_ALLOW, SECCOMP_RET_ALLOW_NAME },
+	{ SECCOMP_LOG_USER_NOTIF, SECCOMP_RET_USER_NOTIF_NAME },
 	{ }
 };
 
@@ -1279,3 +1485,195 @@ static int __init seccomp_sysctl_init(void)
 device_initcall(seccomp_sysctl_init)
 
 #endif /* CONFIG_SYSCTL */
+
+#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
+static int seccomp_notify_release(struct inode *inode, struct file *file)
+{
+	struct seccomp_filter *filter = file->private_data;
+	struct seccomp_knotif *knotif;
+
+	mutex_lock(&filter->notify_lock);
+
+	/*
+	 * If this file is being closed because e.g. the task who owned it
+	 * died, let's wake everyone up who was waiting on us.
+	 */
+	list_for_each_entry(knotif, &filter->notifications, list) {
+		if (knotif->state == SECCOMP_NOTIFY_REPLIED)
+			continue;
+
+		knotif->state = SECCOMP_NOTIFY_REPLIED;
+		knotif->error = -ENOSYS;
+		knotif->val = 0;
+
+		complete(&knotif->ready);
+	}
+
+	filter->has_listener = false;
+	mutex_unlock(&filter->notify_lock);
+	__put_seccomp_filter(filter);
+	return 0;
+}
+
+static ssize_t seccomp_notify_read(struct file *f, char __user *buf,
+				   size_t size, loff_t *ppos)
+{
+	struct seccomp_filter *filter = f->private_data;
+	struct seccomp_knotif *knotif = NULL, *cur;
+	struct seccomp_notif unotif;
+	ssize_t ret;
+
+	/* No offset reads. */
+	if (*ppos != 0)
+		return -EINVAL;
+
+	ret = down_interruptible(&filter->request);
+	if (ret < 0)
+		return ret;
+
+	mutex_lock(&filter->notify_lock);
+	list_for_each_entry(cur, &filter->notifications, list) {
+		if (cur->state == SECCOMP_NOTIFY_INIT) {
+			knotif = cur;
+			break;
+		}
+	}
+
+	/*
+	 * If we didn't find a notification, it could be that the task was
+	 * interrupted between the time we were woken and when we were able to
+	 * acquire the rw lock. Should we retry here or just -ENOENT? -ENOENT
+	 * for now.
+	 */
+	if (!knotif) {
+		ret = -ENOENT;
+		goto out;
+	}
+
+	unotif.id = knotif->id;
+	unotif.pid = knotif->pid;
+	unotif.data = *(knotif->data);
+
+	size = min_t(size_t, size, sizeof(struct seccomp_notif));
+	if (copy_to_user(buf, &unotif, size)) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	ret = sizeof(unotif);
+	knotif->state = SECCOMP_NOTIFY_SENT;
+
+out:
+	mutex_unlock(&filter->notify_lock);
+	return ret;
+}
+
+static ssize_t seccomp_notify_write(struct file *file, const char __user *buf,
+				    size_t size, loff_t *ppos)
+{
+	struct seccomp_filter *filter = file->private_data;
+	struct seccomp_notif_resp resp = {};
+	struct seccomp_knotif *knotif = NULL;
+	ssize_t ret = -EINVAL;
+
+	/* No partial writes. */
+	if (*ppos != 0)
+		return -EINVAL;
+
+	size = min_t(size_t, size, sizeof(resp));
+	if (copy_from_user(&resp, buf, size))
+		return -EFAULT;
+
+	ret = mutex_lock_interruptible(&filter->notify_lock);
+	if (ret < 0)
+		return ret;
+
+	list_for_each_entry(knotif, &filter->notifications, list) {
+		if (knotif->id == resp.id)
+			break;
+	}
+
+	if (!knotif || knotif->id != resp.id) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/* Allow exactly one reply. */
+	if (knotif->state != SECCOMP_NOTIFY_SENT) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	ret = size;
+	knotif->state = SECCOMP_NOTIFY_REPLIED;
+	knotif->error = resp.error;
+	knotif->val = resp.val;
+	complete(&knotif->ready);
+out:
+	mutex_unlock(&filter->notify_lock);
+	return ret;
+}
+
+static __poll_t seccomp_notify_poll(struct file *file,
+				    struct poll_table_struct *poll_tab)
+{
+	struct seccomp_filter *filter = file->private_data;
+	__poll_t ret = 0;
+	struct seccomp_knotif *cur;
+
+	ret = mutex_lock_interruptible(&filter->notify_lock);
+	if (ret < 0)
+		return ret;
+
+	list_for_each_entry(cur, &filter->notifications, list) {
+		if (cur->state == SECCOMP_NOTIFY_INIT)
+			ret |= EPOLLIN | EPOLLRDNORM;
+		if (cur->state == SECCOMP_NOTIFY_SENT)
+			ret |= EPOLLOUT | EPOLLWRNORM;
+	}
+
+	mutex_unlock(&filter->notify_lock);
+
+	return ret;
+}
+
+static const struct file_operations seccomp_notify_ops = {
+	.read = seccomp_notify_read,
+	.write = seccomp_notify_write,
+	.poll = seccomp_notify_poll,
+	.release = seccomp_notify_release,
+};
+
+static struct file *init_listener(struct task_struct *task,
+				  struct seccomp_filter *filter)
+{
+	struct file *ret = ERR_PTR(-EBUSY);
+	struct seccomp_filter *cur;
+	bool have_listener = false;
+
+	for (cur = task->seccomp.filter; cur; cur = cur->prev) {
+		mutex_lock(&cur->notify_lock);
+		if (cur->has_listener)
+			have_listener = true;
+	}
+
+	if (have_listener)
+		goto out;
+
+	ret = anon_inode_getfile("seccomp notify", &seccomp_notify_ops,
+				 filter, O_RDWR);
+	if (IS_ERR(ret))
+		goto out;
+
+
+	/* The file has a reference to it now */
+	__get_seccomp_filter(filter);
+	filter->has_listener = true;
+
+out:
+	for (cur = task->seccomp.filter; cur; cur = cur->prev)
+		mutex_unlock(&cur->notify_lock);
+
+	return ret;
+}
+#endif
diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
index 168c66d74fc5..bb96df66222f 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -40,10 +40,12 @@
 #include <sys/fcntl.h>
 #include <sys/mman.h>
 #include <sys/times.h>
+#include <sys/socket.h>
 
 #define _GNU_SOURCE
 #include <unistd.h>
 #include <sys/syscall.h>
+#include <poll.h>
 
 #include "../kselftest_harness.h"
 
@@ -150,6 +152,24 @@ struct seccomp_metadata {
 };
 #endif
 
+#ifndef SECCOMP_FILTER_FLAG_GET_LISTENER
+#define SECCOMP_FILTER_FLAG_GET_LISTENER 4
+
+#define SECCOMP_RET_USER_NOTIF 0x7fc00000U
+
+struct seccomp_notif {
+	__u64 id;
+	pid_t pid;
+	struct seccomp_data data;
+};
+
+struct seccomp_notif_resp {
+	__u64 id;
+	__s32 error;
+	__s64 val;
+};
+#endif
+
 #ifndef seccomp
 int seccomp(unsigned int op, unsigned int flags, void *args)
 {
@@ -2072,7 +2092,8 @@ TEST(seccomp_syscall_mode_lock)
 TEST(detect_seccomp_filter_flags)
 {
 	unsigned int flags[] = { SECCOMP_FILTER_FLAG_TSYNC,
-				 SECCOMP_FILTER_FLAG_LOG };
+				 SECCOMP_FILTER_FLAG_LOG,
+				 SECCOMP_FILTER_FLAG_GET_LISTENER };
 	unsigned int flag, all_flags;
 	int i;
 	long ret;
@@ -2917,6 +2938,164 @@ TEST(get_metadata)
 	ASSERT_EQ(0, kill(pid, SIGKILL));
 }
 
+static int user_trap_syscall(int nr, unsigned int flags)
+{
+	struct sock_filter filter[] = {
+		BPF_STMT(BPF_LD+BPF_W+BPF_ABS,
+			offsetof(struct seccomp_data, nr)),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, nr, 0, 1),
+		BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_USER_NOTIF),
+		BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
+	};
+
+	struct sock_fprog prog = {
+		.len = (unsigned short)ARRAY_SIZE(filter),
+		.filter = filter,
+	};
+
+	return seccomp(SECCOMP_SET_MODE_FILTER, flags, &prog);
+}
+
+static int read_notif(int listener, struct seccomp_notif *req)
+{
+	int ret;
+
+	do {
+		errno = 0;
+		ret = read(listener, req, sizeof(*req));
+	} while (ret == -1 && errno == ENOENT);
+	return ret;
+}
+
+static void signal_handler(int signal)
+{
+}
+
+#define USER_NOTIF_MAGIC 116983961184613L
+TEST(get_user_notification_syscall)
+{
+	pid_t pid;
+	long ret;
+	int status, listener;
+	struct seccomp_notif req = {};
+	struct seccomp_notif_resp resp = {};
+	struct pollfd pollfd;
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	/* Check that we get -ENOSYS with no listener attached */
+	if (pid == 0) {
+		if (user_trap_syscall(__NR_getpid, 0) < 0)
+			exit(1);
+		ret = syscall(__NR_getpid);
+		exit(ret >= 0 || errno != ENOSYS);
+	}
+
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+
+	/* Check that the basic notification machinery works */
+	listener = user_trap_syscall(__NR_getpid,
+				     SECCOMP_FILTER_FLAG_GET_LISTENER);
+	EXPECT_GE(listener, 0);
+
+	/* Installing a second listener in the chain should EBUSY */
+	EXPECT_EQ(user_trap_syscall(__NR_getpid,
+				    SECCOMP_FILTER_FLAG_GET_LISTENER),
+		  -1);
+	EXPECT_EQ(errno, EBUSY);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		ret = syscall(__NR_getpid);
+		exit(ret != USER_NOTIF_MAGIC);
+	}
+
+	EXPECT_EQ(read(listener, &req, sizeof(req)), sizeof(req));
+
+	pollfd.fd = listener;
+	pollfd.events = POLLIN | POLLOUT;
+
+	EXPECT_GT(poll(&pollfd, 1, -1), 0);
+	EXPECT_EQ(pollfd.revents, POLLOUT);
+
+	EXPECT_EQ(req.data.nr,  __NR_getpid);
+
+	resp.id = req.id;
+	resp.error = 0;
+	resp.val = USER_NOTIF_MAGIC;
+
+	EXPECT_EQ(write(listener, &resp, sizeof(resp)), sizeof(resp));
+
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+
+	/*
+	 * Check that nothing bad happens when we kill the task in the middle
+	 * of a syscall.
+	 */
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		ret = syscall(__NR_getpid);
+		exit(ret != USER_NOTIF_MAGIC);
+	}
+
+	ret = read(listener, &req, sizeof(req));
+	EXPECT_EQ(ret, sizeof(req));
+
+	EXPECT_EQ(kill(pid, SIGKILL), 0);
+	EXPECT_EQ(waitpid(pid, NULL, 0), pid);
+
+	resp.id = req.id;
+	ret = write(listener, &resp, sizeof(resp));
+	EXPECT_EQ(ret, -1);
+	EXPECT_EQ(errno, EINVAL);
+
+	/*
+	 * Check that we get another notification about a signal in the middle
+	 * of a syscall.
+	 */
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		if (signal(SIGUSR1, signal_handler) == SIG_ERR) {
+			perror("signal");
+			exit(1);
+		}
+		ret = syscall(__NR_getpid);
+		exit(ret != USER_NOTIF_MAGIC);
+	}
+
+	ret = read_notif(listener, &req);
+	EXPECT_EQ(ret, sizeof(req));
+	EXPECT_EQ(errno, 0);
+
+	EXPECT_EQ(kill(pid, SIGUSR1), 0);
+
+	ret = read_notif(listener, &req);
+	EXPECT_EQ(ret, sizeof(req));
+	EXPECT_EQ(errno, 0);
+
+	resp.id = req.id;
+	ret = write(listener, &resp, sizeof(resp));
+	EXPECT_EQ(ret, sizeof(resp));
+	EXPECT_EQ(errno, 0);
+
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+
+	close(listener);
+}
+
 /*
  * TODO:
  * - add microbenchmarks
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 2/4] seccomp: make get_nth_filter available outside of CHECKPOINT_RESTORE
  2018-05-17 15:12 [PATCH v2 0/4] seccomp trap to userspace Tycho Andersen
  2018-05-17 15:12 ` [PATCH v2 1/4] seccomp: add a return code to " Tycho Andersen
@ 2018-05-17 15:12 ` Tycho Andersen
  2018-05-17 15:12 ` [PATCH v2 3/4] seccomp: add a way to get a listener fd from ptrace Tycho Andersen
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 20+ messages in thread
From: Tycho Andersen @ 2018-05-17 15:12 UTC (permalink / raw)
  To: linux-kernel, containers
  Cc: Kees Cook, Andy Lutomirski, Oleg Nesterov, Eric W . Biederman,
	Serge E . Hallyn, Christian Brauner, Tyler Hicks, Akihiro Suda,
	Tobin C . Harding, Tycho Andersen

In the next commit we'll use this same mnemonic to get a listener for the
nth filter, so we need it available outside of CHECKPOINT_RESTORE. This is
slightly looser than necessary, because it really could be
CHECKPOINT_RESTORE || USER_NOTIFICATION, but it's declared static and this
complicates the code less, so hopefully it's ok.

Signed-off-by: Tycho Andersen <tycho@tycho.ws>
CC: Kees Cook <keescook@chromium.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Oleg Nesterov <oleg@redhat.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
CC: "Serge E. Hallyn" <serge@hallyn.com>
CC: Christian Brauner <christian.brauner@ubuntu.com>
CC: Tyler Hicks <tyhicks@canonical.com>
CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>

v2: new in v2
---
 kernel/seccomp.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index a169a62cb78a..f136eca93f2f 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -1180,7 +1180,7 @@ long prctl_set_seccomp(unsigned long seccomp_mode, char __user *filter)
 	return do_seccomp(op, 0, uargs);
 }
 
-#if defined(CONFIG_SECCOMP_FILTER) && defined(CONFIG_CHECKPOINT_RESTORE)
+#if defined(CONFIG_SECCOMP_FILTER)
 static struct seccomp_filter *get_nth_filter(struct task_struct *task,
 					     unsigned long filter_off)
 {
@@ -1227,6 +1227,7 @@ static struct seccomp_filter *get_nth_filter(struct task_struct *task,
 	return filter;
 }
 
+#if defined(CONFIG_CHECKPOINT_RESTORE)
 long seccomp_get_filter(struct task_struct *task, unsigned long filter_off,
 			void __user *data)
 {
@@ -1299,7 +1300,8 @@ long seccomp_get_metadata(struct task_struct *task,
 	__put_seccomp_filter(filter);
 	return ret;
 }
-#endif
+#endif /* CONFIG_CHECKPOINT_RESTORE */
+#endif /* CONFIG_SECCOMP_FILTER */
 
 #ifdef CONFIG_SYSCTL
 
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 3/4] seccomp: add a way to get a listener fd from ptrace
  2018-05-17 15:12 [PATCH v2 0/4] seccomp trap to userspace Tycho Andersen
  2018-05-17 15:12 ` [PATCH v2 1/4] seccomp: add a return code to " Tycho Andersen
  2018-05-17 15:12 ` [PATCH v2 2/4] seccomp: make get_nth_filter available outside of CHECKPOINT_RESTORE Tycho Andersen
@ 2018-05-17 15:12 ` Tycho Andersen
  2018-05-17 15:41   ` Oleg Nesterov
  2018-05-18 14:05   ` Christian Brauner
  2018-05-17 15:12 ` [PATCH v2 4/4] seccomp: add support for passing fds via USER_NOTIF Tycho Andersen
  2018-05-18 14:03 ` [PATCH v2 0/4] seccomp trap to userspace Christian Brauner
  4 siblings, 2 replies; 20+ messages in thread
From: Tycho Andersen @ 2018-05-17 15:12 UTC (permalink / raw)
  To: linux-kernel, containers
  Cc: Kees Cook, Andy Lutomirski, Oleg Nesterov, Eric W . Biederman,
	Serge E . Hallyn, Christian Brauner, Tyler Hicks, Akihiro Suda,
	Tobin C . Harding, Tycho Andersen

As an alternative to SECCOMP_FILTER_FLAG_GET_LISTENER, perhaps a ptrace()
version which can acquire filters is useful. There are at least two reasons
this is preferable, even though it uses ptrace:

1. You can control tasks that aren't cooperating with you
2. You can control tasks whose filters block sendmsg() and socket(); if the
   task installs a filter which blocks these calls, there's no way with
   SECCOMP_FILTER_FLAG_GET_LISTENER to get the fd out to the privileged task.

v2: fix a bug where listener mode was not unset when an unused fd was not
    available

Signed-off-by: Tycho Andersen <tycho@tycho.ws>
CC: Kees Cook <keescook@chromium.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Oleg Nesterov <oleg@redhat.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
CC: "Serge E. Hallyn" <serge@hallyn.com>
CC: Christian Brauner <christian.brauner@ubuntu.com>
CC: Tyler Hicks <tyhicks@canonical.com>
CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
---
 include/linux/seccomp.h                       | 11 ++++
 include/uapi/linux/ptrace.h                   |  2 +
 kernel/ptrace.c                               |  4 ++
 kernel/seccomp.c                              | 27 ++++++++
 tools/testing/selftests/seccomp/seccomp_bpf.c | 66 +++++++++++++++++++
 5 files changed, 110 insertions(+)

diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 0fd3e0676a1c..10e684899b7b 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -111,4 +111,15 @@ static inline long seccomp_get_metadata(struct task_struct *task,
 	return -EINVAL;
 }
 #endif /* CONFIG_SECCOMP_FILTER && CONFIG_CHECKPOINT_RESTORE */
+
+#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
+extern long seccomp_get_listener(struct task_struct *task,
+				 unsigned long filter_off);
+#else
+static inline long seccomp_get_listener(struct task_struct *task,
+					unsigned long filter_off)
+{
+	return -EINVAL;
+}
+#endif/* CONFIG_SECCOMP_USER_NOTIFICATION */
 #endif /* _LINUX_SECCOMP_H */
diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h
index d5a1b8a492b9..dc0abf81de3b 100644
--- a/include/uapi/linux/ptrace.h
+++ b/include/uapi/linux/ptrace.h
@@ -73,6 +73,8 @@ struct seccomp_metadata {
 	__u64 flags;		/* Output: filter's flags */
 };
 
+#define PTRACE_SECCOMP_GET_LISTENER	0x420e
+
 /* Read signals from a shared (process wide) queue */
 #define PTRACE_PEEKSIGINFO_SHARED	(1 << 0)
 
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 21fec73d45d4..fcbdb6f4dc07 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -1096,6 +1096,10 @@ int ptrace_request(struct task_struct *child, long request,
 		ret = seccomp_get_metadata(child, addr, datavp);
 		break;
 
+	case PTRACE_SECCOMP_GET_LISTENER:
+		ret = seccomp_get_listener(child, addr);
+		break;
+
 	default:
 		break;
 	}
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index f136eca93f2f..7c23aee76bb4 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -1678,4 +1678,31 @@ static struct file *init_listener(struct task_struct *task,
 
 	return ret;
 }
+
+long seccomp_get_listener(struct task_struct *task,
+			  unsigned long filter_off)
+{
+	struct seccomp_filter *filter;
+	struct file *listener;
+	int fd;
+
+	filter = get_nth_filter(task, filter_off);
+	if (IS_ERR(filter))
+		return PTR_ERR(filter);
+
+	fd = get_unused_fd_flags(O_RDWR);
+	if (fd < 0) {
+		__put_seccomp_filter(filter);
+		return fd;
+	}
+
+	listener = init_listener(task, task->seccomp.filter);
+	if (IS_ERR(listener)) {
+		put_unused_fd(fd);
+		return PTR_ERR(listener);
+	}
+
+	fd_install(fd, listener);
+	return fd;
+}
 #endif
diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
index bb96df66222f..473905f33e0b 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -178,6 +178,10 @@ int seccomp(unsigned int op, unsigned int flags, void *args)
 }
 #endif
 
+#ifndef PTRACE_SECCOMP_GET_LISTENER
+#define PTRACE_SECCOMP_GET_LISTENER 0x420e
+#endif
+
 #if __BYTE_ORDER == __LITTLE_ENDIAN
 #define syscall_arg(_n) (offsetof(struct seccomp_data, args[_n]))
 #elif __BYTE_ORDER == __BIG_ENDIAN
@@ -3096,6 +3100,68 @@ TEST(get_user_notification_syscall)
 	close(listener);
 }
 
+TEST(get_user_notification_ptrace)
+{
+	pid_t pid;
+	int status, listener;
+	int sk_pair[2];
+	char c;
+	struct seccomp_notif req = {};
+	struct seccomp_notif_resp resp = {};
+
+	ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		EXPECT_EQ(user_trap_syscall(__NR_getpid, 0), 0);
+
+		/* Test that we get ENOSYS while not attached */
+		EXPECT_EQ(syscall(__NR_getpid), -1);
+		EXPECT_EQ(errno, ENOSYS);
+
+		/* Signal we're ready and have installed the filter. */
+		EXPECT_EQ(write(sk_pair[1], "J", 1), 1);
+
+		EXPECT_EQ(read(sk_pair[1], &c, 1), 1);
+		EXPECT_EQ(c, 'H');
+
+		exit(syscall(__NR_getpid) != USER_NOTIF_MAGIC);
+	}
+
+	EXPECT_EQ(read(sk_pair[0], &c, 1), 1);
+	EXPECT_EQ(c, 'J');
+
+	EXPECT_EQ(ptrace(PTRACE_ATTACH, pid), 0);
+	EXPECT_EQ(waitpid(pid, NULL, 0), pid);
+	listener = ptrace(PTRACE_SECCOMP_GET_LISTENER, pid, 0);
+	EXPECT_GE(listener, 0);
+
+	/* EBUSY for second listener */
+	EXPECT_EQ(ptrace(PTRACE_SECCOMP_GET_LISTENER, pid, 0), -1);
+	EXPECT_EQ(errno, EBUSY);
+
+	EXPECT_EQ(ptrace(PTRACE_DETACH, pid, NULL, 0), 0);
+
+	/* Now signal we are done and respond with magic */
+	EXPECT_EQ(write(sk_pair[0], "H", 1), 1);
+
+	EXPECT_EQ(read(listener, &req, sizeof(req)), sizeof(req));
+
+	resp.id = req.id;
+	resp.error = 0;
+	resp.val = USER_NOTIF_MAGIC;
+
+	EXPECT_EQ(write(listener, &resp, sizeof(resp)), sizeof(resp));
+
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+
+	close(listener);
+}
+
 /*
  * TODO:
  * - add microbenchmarks
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 4/4] seccomp: add support for passing fds via USER_NOTIF
  2018-05-17 15:12 [PATCH v2 0/4] seccomp trap to userspace Tycho Andersen
                   ` (2 preceding siblings ...)
  2018-05-17 15:12 ` [PATCH v2 3/4] seccomp: add a way to get a listener fd from ptrace Tycho Andersen
@ 2018-05-17 15:12 ` Tycho Andersen
  2018-05-18 14:03 ` [PATCH v2 0/4] seccomp trap to userspace Christian Brauner
  4 siblings, 0 replies; 20+ messages in thread
From: Tycho Andersen @ 2018-05-17 15:12 UTC (permalink / raw)
  To: linux-kernel, containers
  Cc: Kees Cook, Andy Lutomirski, Oleg Nesterov, Eric W . Biederman,
	Serge E . Hallyn, Christian Brauner, Tyler Hicks, Akihiro Suda,
	Tobin C . Harding, Tycho Andersen

The idea here is that the userspace handler should be able to pass an fd
back to the trapped task, for example so it can be returned from socket().

I've proposed one API here, but I'm open to other options. In particular,
this only lets you return an fd from a syscall, which may not be enough in
all cases. For example, if an fd is written to an output parameter instead
of returned, the current API can't handle this. Another case is that
netlink takes as input fds sometimes (IFLA_NET_NS_FD, e.g.). If netlink
ever decides to install an fd and output it, we wouldn't be able to handle
this either.

Still, the vast majority of interesting cases are covered by this API, so
perhaps it is Enough.

I've left it as a separate commit for two reasons:
  * It illustrates the way in which we would grow struct seccomp_notif and
    struct seccomp_notif_resp without using netlink
  * It shows just how little code is needed to accomplish this :)

v2: new in v2

Signed-off-by: Tycho Andersen <tycho@tycho.ws>
CC: Kees Cook <keescook@chromium.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Oleg Nesterov <oleg@redhat.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
CC: "Serge E. Hallyn" <serge@hallyn.com>
CC: Christian Brauner <christian.brauner@ubuntu.com>
CC: Tyler Hicks <tyhicks@canonical.com>
CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
---
 include/uapi/linux/seccomp.h                  |   2 +
 kernel/seccomp.c                              |  49 +++++++-
 tools/testing/selftests/seccomp/seccomp_bpf.c | 112 ++++++++++++++++++
 3 files changed, 161 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index 8160e6cad528..3124427219cb 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -71,6 +71,8 @@ struct seccomp_notif_resp {
 	__u64 id;
 	__s32 error;
 	__s64 val;
+	__u8 return_fd;
+	__u32 fd;
 };
 
 #endif /* _UAPI_LINUX_SECCOMP_H */
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 7c23aee76bb4..c783b8dcd001 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -77,6 +77,8 @@ struct seccomp_knotif {
 	/* The return values, only valid when in SECCOMP_NOTIFY_REPLIED */
 	int error;
 	long val;
+	struct file *file;
+	unsigned int flags;
 
 	/* Signals when this has entered SECCOMP_NOTIFY_REPLIED */
 	struct completion ready;
@@ -785,13 +787,35 @@ static void seccomp_do_user_notification(int this_syscall,
 			goto remove_list;
 	}
 
-	ret = n.val;
-	err = n.error;
+	if (n.file) {
+		int fd;
+
+		fd = get_unused_fd_flags(n.flags);
+		if (fd < 0) {
+			err = fd;
+			ret = -1;
+			goto remove_list;
+		}
+
+		ret = fd;
+		err = 0;
+
+		fd_install(fd, n.file);
+		/* Don't fput, since fd has a reference now */
+		n.file = NULL;
+	} else {
+		ret = n.val;
+		err = n.error;
+	}
+
 
 	WARN(n.state != SECCOMP_NOTIFY_REPLIED,
 	     "notified about write complete when state is not write");
 
 remove_list:
+	if (n.file)
+		fput(n.file);
+
 	list_del(&n.list);
 out:
 	mutex_unlock(&match->notify_lock);
@@ -1610,6 +1634,27 @@ static ssize_t seccomp_notify_write(struct file *file, const char __user *buf,
 	knotif->state = SECCOMP_NOTIFY_REPLIED;
 	knotif->error = resp.error;
 	knotif->val = resp.val;
+
+	if (resp.return_fd) {
+		struct fd fd;
+
+		/*
+		 * This is a little hokey: we need a real fget() (i.e. not
+		 * __fget_light(), which is what fdget does), but we also need
+		 * the flags from strcut fd. So, we get it, put it, and get it
+		 * again for real.
+		 */
+		fd = fdget(resp.fd);
+		knotif->flags = fd.flags;
+		fdput(fd);
+
+		knotif->file = fget(resp.fd);
+		if (!knotif->file) {
+			ret = -EBADF;
+			goto out;
+		}
+	}
+
 	complete(&knotif->ready);
 out:
 	mutex_unlock(&filter->notify_lock);
diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
index 473905f33e0b..b04d3ecc61f4 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -167,6 +167,8 @@ struct seccomp_notif_resp {
 	__u64 id;
 	__s32 error;
 	__s64 val;
+	__u8 return_fd;
+	__u32 fd;
 };
 #endif
 
@@ -3162,6 +3164,116 @@ TEST(get_user_notification_ptrace)
 	close(listener);
 }
 
+TEST(user_notification_pass_fd)
+{
+	pid_t pid;
+	int status, listener;
+	int sk_pair[2];
+	char c;
+	struct seccomp_notif req = {};
+	struct seccomp_notif_resp resp = {};
+	long ret;
+
+	ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		int fd;
+		char buf[16];
+
+		EXPECT_EQ(user_trap_syscall(__NR_getpid, 0), 0);
+
+		/* Signal we're ready and have installed the filter. */
+		EXPECT_EQ(write(sk_pair[1], "J", 1), 1);
+
+		EXPECT_EQ(read(sk_pair[1], &c, 1), 1);
+		EXPECT_EQ(c, 'H');
+		close(sk_pair[1]);
+
+		/* An fd from getpid(). Let the games begin. */
+		fd = syscall(__NR_getpid);
+		EXPECT_GT(fd, 0);
+		EXPECT_EQ(read(fd, buf, sizeof(buf)), 12);
+		close(fd);
+
+		exit(strcmp("hello world", buf));
+	}
+
+	EXPECT_EQ(read(sk_pair[0], &c, 1), 1);
+	EXPECT_EQ(c, 'J');
+
+	EXPECT_EQ(ptrace(PTRACE_ATTACH, pid), 0);
+	EXPECT_EQ(waitpid(pid, NULL, 0), pid);
+	listener = ptrace(PTRACE_SECCOMP_GET_LISTENER, pid, 0);
+	EXPECT_GE(listener, 0);
+	EXPECT_EQ(ptrace(PTRACE_DETACH, pid, NULL, 0), 0);
+
+	/* Now signal we are done installing so it can do a getpid */
+	EXPECT_EQ(write(sk_pair[0], "H", 1), 1);
+	close(sk_pair[0]);
+
+	/* Make a new socket pair so we can send half across */
+	EXPECT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0);
+
+	ret = read_notif(listener, &req);
+	EXPECT_EQ(ret, sizeof(req));
+	EXPECT_EQ(errno, 0);
+
+	resp.id = req.id;
+	resp.return_fd = 1;
+	resp.fd = sk_pair[1];
+	EXPECT_EQ(write(listener, &resp, sizeof(resp)), sizeof(resp));
+	close(sk_pair[1]);
+
+	EXPECT_EQ(write(sk_pair[0], "hello world\0", 12), 12);
+	close(sk_pair[0]);
+
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+	close(listener);
+}
+
+TEST(user_notification_struct_size_mismatch)
+{
+	pid_t pid;
+	long ret;
+	int status, listener, len;
+	struct seccomp_notif req;
+	struct seccomp_notif_resp resp;
+
+	listener = user_trap_syscall(__NR_getpid,
+				     SECCOMP_FILTER_FLAG_GET_LISTENER);
+	EXPECT_GE(listener, 0);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		ret = syscall(__NR_getpid);
+		exit(ret != USER_NOTIF_MAGIC);
+	}
+
+	EXPECT_EQ(read(listener, &req, sizeof(req)), sizeof(req));
+
+	resp.id = req.id;
+	resp.error = 0;
+	resp.val = USER_NOTIF_MAGIC;
+
+	/*
+	 * Only write a partial structure: this is what was available before we
+	 * had fd support.
+	 */
+	len = offsetof(struct seccomp_notif_resp, val) + sizeof(resp.val);
+	EXPECT_EQ(write(listener, &resp, len), len);
+
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+}
+
 /*
  * TODO:
  * - add microbenchmarks
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/4] seccomp: add a return code to trap to userspace
  2018-05-17 15:12 ` [PATCH v2 1/4] seccomp: add a return code to " Tycho Andersen
@ 2018-05-17 15:33   ` Oleg Nesterov
  2018-05-17 15:39     ` Tycho Andersen
  2018-05-18 14:04   ` Christian Brauner
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 20+ messages in thread
From: Oleg Nesterov @ 2018-05-17 15:33 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: linux-kernel, containers, Kees Cook, Andy Lutomirski,
	Eric W . Biederman, Serge E . Hallyn, Christian Brauner,
	Tyler Hicks, Akihiro Suda, Tobin C . Harding

I didn't read this series yet, and I don't even understand what are you
trying to do, just one question...

On 05/17, Tycho Andersen wrote:
>
> +static struct file *init_listener(struct task_struct *task,
> +				  struct seccomp_filter *filter)
> +{
> +	struct file *ret = ERR_PTR(-EBUSY);
> +	struct seccomp_filter *cur;
> +	bool have_listener = false;
> +
> +	for (cur = task->seccomp.filter; cur; cur = cur->prev) {
> +		mutex_lock(&cur->notify_lock);

Did you test this patch with CONFIG_LOCKDEP ?

>From lockdep pov this loop tries to take the same lock twice or more, it shoul
complain.

Oleg.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/4] seccomp: add a return code to trap to userspace
  2018-05-17 15:33   ` Oleg Nesterov
@ 2018-05-17 15:39     ` Tycho Andersen
  2018-05-17 15:46       ` Oleg Nesterov
  0 siblings, 1 reply; 20+ messages in thread
From: Tycho Andersen @ 2018-05-17 15:39 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, containers, Kees Cook, Andy Lutomirski,
	Eric W . Biederman, Serge E . Hallyn, Christian Brauner,
	Tyler Hicks, Akihiro Suda, Tobin C . Harding

Hi Oleg,

Thanks for taking a look!

On Thu, May 17, 2018 at 05:33:24PM +0200, Oleg Nesterov wrote:
> I didn't read this series yet, and I don't even understand what are you
> trying to do, just one question...
> 
> On 05/17, Tycho Andersen wrote:
> >
> > +static struct file *init_listener(struct task_struct *task,
> > +				  struct seccomp_filter *filter)
> > +{
> > +	struct file *ret = ERR_PTR(-EBUSY);
> > +	struct seccomp_filter *cur;
> > +	bool have_listener = false;
> > +
> > +	for (cur = task->seccomp.filter; cur; cur = cur->prev) {
> > +		mutex_lock(&cur->notify_lock);
> 
> Did you test this patch with CONFIG_LOCKDEP ?

Yes, with,

CONFIG_LOCKDEP=y
CONFIG_DEBUG_LOCKDEP=y
CONFIG_DEBUG_ATOMIC_SLEEP=y

> From lockdep pov this loop tries to take the same lock twice or more, it shoul
> complain.

I didn't, but I guess that's because it's not trying to take the same lock
twice -- the pointer cur is changing in the loop. Unless I'm misunderstanding
what you're saying.

The idea behind this code is to lock the entire chain of filters up to the
parent so that we can ensure none of them have a listener installed. This is
based on a suggestion from Andy last review cycle to not allow two listeners,
since nesting would be confusing.

Tycho

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 3/4] seccomp: add a way to get a listener fd from ptrace
  2018-05-17 15:12 ` [PATCH v2 3/4] seccomp: add a way to get a listener fd from ptrace Tycho Andersen
@ 2018-05-17 15:41   ` Oleg Nesterov
  2018-05-17 15:57     ` Tycho Andersen
  2018-05-18 14:05   ` Christian Brauner
  1 sibling, 1 reply; 20+ messages in thread
From: Oleg Nesterov @ 2018-05-17 15:41 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: linux-kernel, containers, Kees Cook, Andy Lutomirski,
	Eric W . Biederman, Serge E . Hallyn, Christian Brauner,
	Tyler Hicks, Akihiro Suda, Tobin C . Harding

again, I don't understand this code yet, but

On 05/17, Tycho Andersen wrote:
>
> +long seccomp_get_listener(struct task_struct *task,
> +			  unsigned long filter_off)
> +{
> +	struct seccomp_filter *filter;
> +	struct file *listener;
> +	int fd;
> +
> +	filter = get_nth_filter(task, filter_off);
> +	if (IS_ERR(filter))
> +		return PTR_ERR(filter);
> +
> +	fd = get_unused_fd_flags(O_RDWR);
> +	if (fd < 0) {
> +		__put_seccomp_filter(filter);
> +		return fd;
> +	}
> +
> +	listener = init_listener(task, task->seccomp.filter);
> +	if (IS_ERR(listener)) {
> +		put_unused_fd(fd);
> +		return PTR_ERR(listener);

__put_seccomp_filter() ?

and since init_listener() does __get_seccomp_filter() on sucess, it is needed
uncondtitionally?

Oleg.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/4] seccomp: add a return code to trap to userspace
  2018-05-17 15:39     ` Tycho Andersen
@ 2018-05-17 15:46       ` Oleg Nesterov
  2018-05-24 15:28         ` Tycho Andersen
  0 siblings, 1 reply; 20+ messages in thread
From: Oleg Nesterov @ 2018-05-17 15:46 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: linux-kernel, containers, Kees Cook, Andy Lutomirski,
	Eric W . Biederman, Serge E . Hallyn, Christian Brauner,
	Tyler Hicks, Akihiro Suda, Tobin C . Harding

On 05/17, Tycho Andersen wrote:
>
> > From lockdep pov this loop tries to take the same lock twice or more, it shoul
> > complain.
>
> I didn't, but I guess that's because it's not trying to take the same lock
> twice -- the pointer cur is changing in the loop.

Yes, I see. But this is the same lock for lockdep, it has the same class.

Oleg.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 3/4] seccomp: add a way to get a listener fd from ptrace
  2018-05-17 15:41   ` Oleg Nesterov
@ 2018-05-17 15:57     ` Tycho Andersen
  2018-05-17 15:59       ` Tycho Andersen
  0 siblings, 1 reply; 20+ messages in thread
From: Tycho Andersen @ 2018-05-17 15:57 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, containers, Kees Cook, Andy Lutomirski,
	Eric W . Biederman, Serge E . Hallyn, Christian Brauner,
	Tyler Hicks, Akihiro Suda, Tobin C . Harding

On Thu, May 17, 2018 at 05:41:39PM +0200, Oleg Nesterov wrote:
> again, I don't understand this code yet, but
> 
> On 05/17, Tycho Andersen wrote:
> >
> > +long seccomp_get_listener(struct task_struct *task,
> > +			  unsigned long filter_off)
> > +{
> > +	struct seccomp_filter *filter;
> > +	struct file *listener;
> > +	int fd;
> > +
> > +	filter = get_nth_filter(task, filter_off);
> > +	if (IS_ERR(filter))
> > +		return PTR_ERR(filter);
> > +
> > +	fd = get_unused_fd_flags(O_RDWR);
> > +	if (fd < 0) {
> > +		__put_seccomp_filter(filter);
> > +		return fd;
> > +	}
> > +
> > +	listener = init_listener(task, task->seccomp.filter);
> > +	if (IS_ERR(listener)) {
> > +		put_unused_fd(fd);
> > +		return PTR_ERR(listener);
> 
> __put_seccomp_filter() ?

Yes, I think you're right here.

> and since init_listener() does __get_seccomp_filter() on sucess, it is needed
> uncondtitionally?

I think there does need to be a __get_seccomp_filter() on success in
init_listener(), because it's paired with the __put_seccomp_filter in
seccomp_notify_release. The listener fd has a reference to the filter,
and that shouldn't go away until after the fd is freed.

Tycho

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 3/4] seccomp: add a way to get a listener fd from ptrace
  2018-05-17 15:57     ` Tycho Andersen
@ 2018-05-17 15:59       ` Tycho Andersen
  0 siblings, 0 replies; 20+ messages in thread
From: Tycho Andersen @ 2018-05-17 15:59 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, containers, Kees Cook, Andy Lutomirski,
	Eric W . Biederman, Serge E . Hallyn, Christian Brauner,
	Tyler Hicks, Akihiro Suda, Tobin C . Harding

On Thu, May 17, 2018 at 09:57:33AM -0600, Tycho Andersen wrote:
> On Thu, May 17, 2018 at 05:41:39PM +0200, Oleg Nesterov wrote:
> > and since init_listener() does __get_seccomp_filter() on sucess, it is needed
> > uncondtitionally?
> 
> I think there does need to be a __get_seccomp_filter() on success in
> init_listener(), because it's paired with the __put_seccomp_filter in
> seccomp_notify_release. The listener fd has a reference to the filter,
> and that shouldn't go away until after the fd is freed.

Oh, sorry. I see what you meant here. Yes, it should be unconditional.

Tycho

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 0/4] seccomp trap to userspace
  2018-05-17 15:12 [PATCH v2 0/4] seccomp trap to userspace Tycho Andersen
                   ` (3 preceding siblings ...)
  2018-05-17 15:12 ` [PATCH v2 4/4] seccomp: add support for passing fds via USER_NOTIF Tycho Andersen
@ 2018-05-18 14:03 ` Christian Brauner
  4 siblings, 0 replies; 20+ messages in thread
From: Christian Brauner @ 2018-05-18 14:03 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: linux-kernel, containers, Tobin C . Harding, Kees Cook,
	Akihiro Suda, Oleg Nesterov, Andy Lutomirski, Eric W . Biederman,
	Christian Brauner, Tyler Hicks

On Thu, May 17, 2018 at 09:12:14AM -0600, Tycho Andersen wrote:
> Hi,
> 
> After a while focusing on other things, I finally managed ot get a v2 of
> this series prepared. I believe I've addressed all the feedback from v1,
> except for one major point: switching the communication protocol over
> the fd to nlattr. I looked into doing this, but the kernel stuff for
> dealing with nlattr seems to require an skb (via nlmsg_{new,put} and
> netlink_unicast), which means we need to deal with the netlink sequence
> numbers, portids, and create a socket protocol. I can do this if we
> still think nlattr is necessary, but based on looking at it, it seems
> like a lot of extra code for no real benefit.

Yes, we've had that discussion before and I agree. I fail to see the
benefit here too.

Christian

> 
> I've also added support for passing fds. The code itself is simple, but
> the API could/should probably be different, see patch 4 for discussion.
> 
> Tycho
> 
> Tycho Andersen (4):
>   seccomp: add a return code to trap to userspace
>   seccomp: make get_nth_filter available outside of CHECKPOINT_RESTORE
>   seccomp: add a way to get a listener fd from ptrace
>   seccomp: add support for passing fds via USER_NOTIF
> 
>  arch/Kconfig                                  |   7 +
>  include/linux/seccomp.h                       |  14 +-
>  include/uapi/linux/ptrace.h                   |   2 +
>  include/uapi/linux/seccomp.h                  |  20 +-
>  kernel/ptrace.c                               |   4 +
>  kernel/seccomp.c                              | 480 +++++++++++++++++-
>  tools/testing/selftests/seccomp/seccomp_bpf.c | 359 ++++++++++++-
>  7 files changed, 878 insertions(+), 8 deletions(-)
> 
> -- 
> 2.17.0
> 
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/4] seccomp: add a return code to trap to userspace
  2018-05-17 15:12 ` [PATCH v2 1/4] seccomp: add a return code to " Tycho Andersen
  2018-05-17 15:33   ` Oleg Nesterov
@ 2018-05-18 14:04   ` Christian Brauner
  2018-05-18 15:21     ` Tycho Andersen
  2018-05-19  0:14   ` kbuild test robot
  2018-05-19  5:01   ` kbuild test robot
  3 siblings, 1 reply; 20+ messages in thread
From: Christian Brauner @ 2018-05-18 14:04 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: linux-kernel, containers, Tobin C . Harding, Kees Cook,
	Akihiro Suda, Oleg Nesterov, Andy Lutomirski, Eric W . Biederman,
	Christian Brauner, Tyler Hicks

On Thu, May 17, 2018 at 09:12:15AM -0600, Tycho Andersen wrote:
> This patch introduces a means for syscalls matched in seccomp to notify
> some other task that a particular filter has been triggered.
> 
> The motivation for this is primarily for use with containers. For example,
> if a container does an init_module(), we obviously don't want to load this
> untrusted code, which may be compiled for the wrong version of the kernel
> anyway. Instead, we could parse the module image, figure out which module
> the container is trying to load and load it on the host.
> 
> As another example, containers cannot mknod(), since this checks
> capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
> /dev/zero should be ok for containers to mknod, but we'd like to avoid hard
> coding some whitelist in the kernel. Another example is mount(), which has
> many security restrictions for good reason, but configuration or runtime
> knowledge could potentially be used to relax these restrictions.
> 
> This patch adds functionality that is already possible via at least two
> other means that I know about, both of which involve ptrace(): first, one
> could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL.
> Unfortunately this is slow, so a faster version would be to install a
> filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP.
> Since ptrace allows only one tracer, if the container runtime is that
> tracer, users inside the container (or outside) trying to debug it will not
> be able to use ptrace, which is annoying. It also means that older
> distributions based on Upstart cannot boot inside containers using ptrace,
> since upstart itself uses ptrace to start services.
> 
> The actual implementation of this is fairly small, although getting the
> synchronization right was/is slightly complex.
> 
> Finally, it's worth noting that the classic seccomp TOCTOU of reading
> memory data from the task still applies here, but can be avoided with
> careful design of the userspace handler: if the userspace handler reads all
> of the task memory that is necessary before applying its security policy,
> the tracee's subsequent memory edits will not be read by the tracer.
> 
> v2: * make id a u64; the idea here being that it will never overflow,
>       because 64 is huge (one syscall every nanosecond => wrap every 584
>       years)
>     * prevent nesting of user notifications: if someone is already attached
>       the tree in one place, nobody else can attach to the tree
>     * notify the listener of signals the tracee receives as well
>     * implement poll
> 
> Signed-off-by: Tycho Andersen <tycho@tycho.ws>
> CC: Kees Cook <keescook@chromium.org>
> CC: Andy Lutomirski <luto@amacapital.net>
> CC: Oleg Nesterov <oleg@redhat.com>
> CC: Eric W. Biederman <ebiederm@xmission.com>
> CC: "Serge E. Hallyn" <serge@hallyn.com>
> CC: Christian Brauner <christian.brauner@ubuntu.com>
> CC: Tyler Hicks <tyhicks@canonical.com>
> CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
> ---
>  arch/Kconfig                                  |   7 +
>  include/linux/seccomp.h                       |   3 +-
>  include/uapi/linux/seccomp.h                  |  18 +-
>  kernel/seccomp.c                              | 402 +++++++++++++++++-
>  tools/testing/selftests/seccomp/seccomp_bpf.c | 181 +++++++-
>  5 files changed, 605 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 8e0d665c8d53..dd99eef3e049 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -401,6 +401,13 @@ config SECCOMP_FILTER
>  
>  	  See Documentation/prctl/seccomp_filter.txt for details.
>  
> +config SECCOMP_USER_NOTIFICATION
> +	bool "Enable the SECCOMP_RET_USER_NOTIF seccomp action"
> +	depends on SECCOMP_FILTER
> +	help
> +	  Enable SECCOMP_RET_USER_NOTIF, a return code which can be used by seccomp
> +	  programs to notify a userspace listener that a particular event happened.
> +
>  config HAVE_GCC_PLUGINS
>  	bool
>  	help
> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> index c723a5c4e3ff..0fd3e0676a1c 100644
> --- a/include/linux/seccomp.h
> +++ b/include/linux/seccomp.h
> @@ -5,7 +5,8 @@
>  #include <uapi/linux/seccomp.h>
>  
>  #define SECCOMP_FILTER_FLAG_MASK	(SECCOMP_FILTER_FLAG_TSYNC | \
> -					 SECCOMP_FILTER_FLAG_LOG)
> +					 SECCOMP_FILTER_FLAG_LOG | \
> +					 SECCOMP_FILTER_FLAG_GET_LISTENER)
>  
>  #ifdef CONFIG_SECCOMP
>  
> diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
> index 2a0bd9dd104d..8160e6cad528 100644
> --- a/include/uapi/linux/seccomp.h
> +++ b/include/uapi/linux/seccomp.h
> @@ -17,8 +17,9 @@
>  #define SECCOMP_GET_ACTION_AVAIL	2
>  
>  /* Valid flags for SECCOMP_SET_MODE_FILTER */
> -#define SECCOMP_FILTER_FLAG_TSYNC	1
> -#define SECCOMP_FILTER_FLAG_LOG		2
> +#define SECCOMP_FILTER_FLAG_TSYNC		1
> +#define SECCOMP_FILTER_FLAG_LOG			2
> +#define SECCOMP_FILTER_FLAG_GET_LISTENER	4
>  
>  /*
>   * All BPF programs must return a 32-bit value.
> @@ -34,6 +35,7 @@
>  #define SECCOMP_RET_KILL	 SECCOMP_RET_KILL_THREAD
>  #define SECCOMP_RET_TRAP	 0x00030000U /* disallow and force a SIGSYS */
>  #define SECCOMP_RET_ERRNO	 0x00050000U /* returns an errno */
> +#define SECCOMP_RET_USER_NOTIF   0x7fc00000U /* notifies userspace */
>  #define SECCOMP_RET_TRACE	 0x7ff00000U /* pass to a tracer or disallow */
>  #define SECCOMP_RET_LOG		 0x7ffc0000U /* allow after logging */
>  #define SECCOMP_RET_ALLOW	 0x7fff0000U /* allow */
> @@ -59,4 +61,16 @@ struct seccomp_data {
>  	__u64 args[6];
>  };
>  
> +struct seccomp_notif {
> +	__u64 id;
> +	pid_t pid;
> +	struct seccomp_data data;
> +};
> +
> +struct seccomp_notif_resp {
> +	__u64 id;
> +	__s32 error;
> +	__s64 val;
> +};
> +
>  #endif /* _UAPI_LINUX_SECCOMP_H */
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index dc77548167ef..a169a62cb78a 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -38,6 +38,53 @@
>  #include <linux/tracehook.h>
>  #include <linux/uaccess.h>
>  
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> +#include <linux/file.h>
> +#include <linux/anon_inodes.h>
> +
> +enum notify_state {
> +	SECCOMP_NOTIFY_INIT,
> +	SECCOMP_NOTIFY_SENT,
> +	SECCOMP_NOTIFY_REPLIED,
> +};
> +
> +struct seccomp_knotif {
> +	/* The pid whose filter triggered the notification */
> +	pid_t pid;
> +
> +	/*
> +	 * The "cookie" for this request; this is unique for this filter.
> +	 */
> +	u32 id;
> +
> +	/*
> +	 * The seccomp data. This pointer is valid the entire time this
> +	 * notification is active, since it comes from __seccomp_filter which
> +	 * eclipses the entire lifecycle here.
> +	 */
> +	const struct seccomp_data *data;
> +
> +	/*
> +	 * Notification states. When SECCOMP_RET_USER_NOTIF is returned, a
> +	 * struct seccomp_knotif is created and starts out in INIT. Once the
> +	 * handler reads the notification off of an FD, it transitions to READ.
> +	 * If a signal is received the state transitions back to INIT and
> +	 * another message is sent. When the userspace handler replies, state
> +	 * transitions to REPLIED.
> +	 */
> +	enum notify_state state;
> +
> +	/* The return values, only valid when in SECCOMP_NOTIFY_REPLIED */
> +	int error;
> +	long val;
> +
> +	/* Signals when this has entered SECCOMP_NOTIFY_REPLIED */
> +	struct completion ready;
> +
> +	struct list_head list;
> +};
> +#endif
> +
>  /**
>   * struct seccomp_filter - container for seccomp BPF programs
>   *
> @@ -64,6 +111,27 @@ struct seccomp_filter {
>  	bool log;
>  	struct seccomp_filter *prev;
>  	struct bpf_prog *prog;
> +
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> +	/*
> +	 * A semaphore that users of this notification can wait on for
> +	 * changes. Actual reads and writes are still controlled with
> +	 * filter->notify_lock.
> +	 */
> +	struct semaphore request;
> +
> +	/* A lock for all notification-related accesses. */
> +	struct mutex notify_lock;
> +
> +	/* Is there currently an attached listener? */
> +	bool has_listener;
> +
> +	/* The id of the next request. */
> +	u64 next_id;
> +
> +	/* A list of struct seccomp_knotif elements. */
> +	struct list_head notifications;
> +#endif
>  };
>  
>  /* Limit any path through the tree to 256KB worth of instructions. */
> @@ -383,6 +451,13 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
>  	if (!sfilter)
>  		return ERR_PTR(-ENOMEM);
>  
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> +	mutex_init(&sfilter->notify_lock);
> +	sema_init(&sfilter->request, 0);
> +	INIT_LIST_HEAD(&sfilter->notifications);
> +	sfilter->next_id = get_random_u64();
> +#endif
> +
>  	ret = bpf_prog_create_from_user(&sfilter->prog, fprog,
>  					seccomp_check_filter, save_orig);
>  	if (ret < 0) {
> @@ -547,13 +622,15 @@ static void seccomp_send_sigsys(int syscall, int reason)
>  #define SECCOMP_LOG_TRACE		(1 << 4)
>  #define SECCOMP_LOG_LOG			(1 << 5)
>  #define SECCOMP_LOG_ALLOW		(1 << 6)
> +#define SECCOMP_LOG_USER_NOTIF		(1 << 7)
>  
>  static u32 seccomp_actions_logged = SECCOMP_LOG_KILL_PROCESS |
>  				    SECCOMP_LOG_KILL_THREAD  |
>  				    SECCOMP_LOG_TRAP  |
>  				    SECCOMP_LOG_ERRNO |
>  				    SECCOMP_LOG_TRACE |
> -				    SECCOMP_LOG_LOG;
> +				    SECCOMP_LOG_LOG |
> +				    SECCOMP_LOG_USER_NOTIF;
>  
>  static inline void seccomp_log(unsigned long syscall, long signr, u32 action,
>  			       bool requested)
> @@ -572,6 +649,9 @@ static inline void seccomp_log(unsigned long syscall, long signr, u32 action,
>  	case SECCOMP_RET_TRACE:
>  		log = requested && seccomp_actions_logged & SECCOMP_LOG_TRACE;
>  		break;
> +	case SECCOMP_RET_USER_NOTIF:
> +		log = requested && seccomp_actions_logged & SECCOMP_LOG_USER_NOTIF;
> +		break;
>  	case SECCOMP_RET_LOG:
>  		log = seccomp_actions_logged & SECCOMP_LOG_LOG;
>  		break;
> @@ -645,6 +725,91 @@ void secure_computing_strict(int this_syscall)
>  }
>  #else
>  
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> +static u64 seccomp_next_notify_id(struct seccomp_filter *filter)
> +{
> +	u64 ret = filter->next_id;
> +
> +	/* Note: overflow is ok here, the id just needs to be unique */
> +	filter->next_id++;
> +
> +	return ret;
> +}

Nit: Depending on how averse people are to relying on side-effects this
could be simplified to:

static inline u64 seccomp_next_notify_id(struct seccomp_filter *filter)
{
        /* Note: Overflow is ok. The id just needs to be unique. */
        return filter->next_id++;
}

> +
> +static void seccomp_do_user_notification(int this_syscall,
> +					 struct seccomp_filter *match,
> +					 const struct seccomp_data *sd)
> +{
> +	int err;
> +	long ret = 0;
> +	struct seccomp_knotif n = {};
> +
> +	mutex_lock(&match->notify_lock);
> +	if (!match->has_listener) {
> +		err = -ENOSYS;
> +		goto out;
> +	}

Nit:

err = -ENOSYS;
mutex_lock(&match->notify_lock);
if (!match->has_listener)
        goto out;

looks cleaner to me or you do the err initalization at the top of the
function. :)

> +
> +	n.pid = current->pid;
> +	n.state = SECCOMP_NOTIFY_INIT;
> +	n.data = sd;
> +	n.id = seccomp_next_notify_id(match);
> +	init_completion(&n.ready);
> +
> +	list_add(&n.list, &match->notifications);
> +
> +	mutex_unlock(&match->notify_lock);
> +	up(&match->request);
> +
> +	err = wait_for_completion_interruptible(&n.ready);
> +	mutex_lock(&match->notify_lock);
> +
> +	/*
> +	 * Here it's possible we got a signal and then had to wait on the mutex
> +	 * while the reply was sent, so let's be sure there wasn't a response
> +	 * in the meantime.
> +	 */
> +	if (err < 0 && n.state != SECCOMP_NOTIFY_REPLIED) {
> +		/*
> +		 * We got a signal. Let's tell userspace about it (potentially
> +		 * again, if we had already notified them about the first one).
> +		 */
> +		if (n.state == SECCOMP_NOTIFY_SENT) {
> +			n.state = SECCOMP_NOTIFY_INIT;
> +			up(&match->request);
> +		}
> +		mutex_unlock(&match->notify_lock);
> +		err = wait_for_completion_killable(&n.ready);
> +		mutex_lock(&match->notify_lock);
> +		if (err < 0)
> +			goto remove_list;
> +	}
> +
> +	ret = n.val;
> +	err = n.error;
> +
> +	WARN(n.state != SECCOMP_NOTIFY_REPLIED,
> +	     "notified about write complete when state is not write");

Nit: That message seems a little cryptic.

> +
> +remove_list:
> +	list_del(&n.list);
> +out:
> +	mutex_unlock(&match->notify_lock);
> +	syscall_set_return_value(current, task_pt_regs(current),
> +				 err, ret);
> +}
> +#else
> +static void seccomp_do_user_notification(int this_syscall,
> +					 u32 action,
> +					 struct seccomp_filter *match,
> +					 const struct seccomp_data *sd)
> +{
> +	WARN(1, "user notification received, but disabled");

Nit: "received unexpected user notification" might be clearer

> +	seccomp_log(this_syscall, SIGSYS, action, true);
> +	do_exit(SIGSYS);
> +}
> +#endif
> +
>  #ifdef CONFIG_SECCOMP_FILTER
>  static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
>  			    const bool recheck_after_trace)
> @@ -722,6 +887,9 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
>  
>  		return 0;
>  
> +	case SECCOMP_RET_USER_NOTIF:
> +		seccomp_do_user_notification(this_syscall, match, sd);
> +		goto skip;
>  	case SECCOMP_RET_LOG:
>  		seccomp_log(this_syscall, 0, action, true);
>  		return 0;
> @@ -828,6 +996,11 @@ static long seccomp_set_mode_strict(void)
>  }
>  
>  #ifdef CONFIG_SECCOMP_FILTER
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> +static struct file *init_listener(struct task_struct *,
> +				  struct seccomp_filter *);
> +#endif
> +
>  /**
>   * seccomp_set_mode_filter: internal function for setting seccomp filter
>   * @flags:  flags to change filter behavior
> @@ -847,6 +1020,8 @@ static long seccomp_set_mode_filter(unsigned int flags,
>  	const unsigned long seccomp_mode = SECCOMP_MODE_FILTER;
>  	struct seccomp_filter *prepared = NULL;
>  	long ret = -EINVAL;
> +	int listener = 0;
> +	struct file *listener_f = NULL;
>  
>  	/* Validate flags. */
>  	if (flags & ~SECCOMP_FILTER_FLAG_MASK)
> @@ -857,13 +1032,28 @@ static long seccomp_set_mode_filter(unsigned int flags,
>  	if (IS_ERR(prepared))
>  		return PTR_ERR(prepared);
>  
> +	if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) {
> +		listener = get_unused_fd_flags(O_RDWR);
> +		if (listener < 0) {
> +			ret = listener;
> +			goto out_free;
> +		}
> +
> +		listener_f = init_listener(current, prepared);
> +		if (IS_ERR(listener_f)) {
> +			put_unused_fd(listener);
> +			ret = PTR_ERR(listener_f);
> +			goto out_free;
> +		}
> +	}
> +
>  	/*
>  	 * Make sure we cannot change seccomp or nnp state via TSYNC
>  	 * while another thread is in the middle of calling exec.
>  	 */
>  	if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
>  	    mutex_lock_killable(&current->signal->cred_guard_mutex))
> -		goto out_free;
> +		goto out_put_fd;
>  
>  	spin_lock_irq(&current->sighand->siglock);
>  
> @@ -881,6 +1071,16 @@ static long seccomp_set_mode_filter(unsigned int flags,
>  	spin_unlock_irq(&current->sighand->siglock);
>  	if (flags & SECCOMP_FILTER_FLAG_TSYNC)
>  		mutex_unlock(&current->signal->cred_guard_mutex);
> +out_put_fd:
> +	if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) {
> +		if (ret < 0) {
> +			fput(listener_f);
> +			put_unused_fd(listener);
> +		} else {
> +			fd_install(listener, listener_f);
> +			ret = listener;
> +		}
> +	}
>  out_free:
>  	seccomp_filter_free(prepared);
>  	return ret;
> @@ -909,6 +1109,9 @@ static long seccomp_get_action_avail(const char __user *uaction)
>  	case SECCOMP_RET_LOG:
>  	case SECCOMP_RET_ALLOW:
>  		break;
> +	case SECCOMP_RET_USER_NOTIF:
> +		if (IS_ENABLED(CONFIG_SECCOMP_USER_NOTIFICATION))
> +			break;
>  	default:
>  		return -EOPNOTSUPP;
>  	}
> @@ -1105,6 +1308,7 @@ long seccomp_get_metadata(struct task_struct *task,
>  #define SECCOMP_RET_KILL_THREAD_NAME	"kill_thread"
>  #define SECCOMP_RET_TRAP_NAME		"trap"
>  #define SECCOMP_RET_ERRNO_NAME		"errno"
> +#define SECCOMP_RET_USER_NOTIF_NAME	"user_notif"
>  #define SECCOMP_RET_TRACE_NAME		"trace"
>  #define SECCOMP_RET_LOG_NAME		"log"
>  #define SECCOMP_RET_ALLOW_NAME		"allow"
> @@ -1114,6 +1318,7 @@ static const char seccomp_actions_avail[] =
>  				SECCOMP_RET_KILL_THREAD_NAME	" "
>  				SECCOMP_RET_TRAP_NAME		" "
>  				SECCOMP_RET_ERRNO_NAME		" "
> +				SECCOMP_RET_USER_NOTIF_NAME     " "
>  				SECCOMP_RET_TRACE_NAME		" "
>  				SECCOMP_RET_LOG_NAME		" "
>  				SECCOMP_RET_ALLOW_NAME;
> @@ -1131,6 +1336,7 @@ static const struct seccomp_log_name seccomp_log_names[] = {
>  	{ SECCOMP_LOG_TRACE, SECCOMP_RET_TRACE_NAME },
>  	{ SECCOMP_LOG_LOG, SECCOMP_RET_LOG_NAME },
>  	{ SECCOMP_LOG_ALLOW, SECCOMP_RET_ALLOW_NAME },
> +	{ SECCOMP_LOG_USER_NOTIF, SECCOMP_RET_USER_NOTIF_NAME },
>  	{ }
>  };
>  
> @@ -1279,3 +1485,195 @@ static int __init seccomp_sysctl_init(void)
>  device_initcall(seccomp_sysctl_init)
>  
>  #endif /* CONFIG_SYSCTL */
> +
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> +static int seccomp_notify_release(struct inode *inode, struct file *file)
> +{
> +	struct seccomp_filter *filter = file->private_data;
> +	struct seccomp_knotif *knotif;
> +
> +	mutex_lock(&filter->notify_lock);
> +
> +	/*
> +	 * If this file is being closed because e.g. the task who owned it
> +	 * died, let's wake everyone up who was waiting on us.
> +	 */
> +	list_for_each_entry(knotif, &filter->notifications, list) {
> +		if (knotif->state == SECCOMP_NOTIFY_REPLIED)
> +			continue;
> +
> +		knotif->state = SECCOMP_NOTIFY_REPLIED;
> +		knotif->error = -ENOSYS;
> +		knotif->val = 0;
> +
> +		complete(&knotif->ready);
> +	}
> +
> +	filter->has_listener = false;
> +	mutex_unlock(&filter->notify_lock);
> +	__put_seccomp_filter(filter);
> +	return 0;
> +}
> +
> +static ssize_t seccomp_notify_read(struct file *f, char __user *buf,
> +				   size_t size, loff_t *ppos)
> +{
> +	struct seccomp_filter *filter = f->private_data;
> +	struct seccomp_knotif *knotif = NULL, *cur;
> +	struct seccomp_notif unotif;
> +	ssize_t ret;
> +
> +	/* No offset reads. */
> +	if (*ppos != 0)
> +		return -EINVAL;
> +
> +	ret = down_interruptible(&filter->request);
> +	if (ret < 0)
> +		return ret;
> +
> +	mutex_lock(&filter->notify_lock);
> +	list_for_each_entry(cur, &filter->notifications, list) {
> +		if (cur->state == SECCOMP_NOTIFY_INIT) {
> +			knotif = cur;
> +			break;
> +		}
> +	}
> +
> +	/*
> +	 * If we didn't find a notification, it could be that the task was
> +	 * interrupted between the time we were woken and when we were able to
> +	 * acquire the rw lock. Should we retry here or just -ENOENT? -ENOENT
> +	 * for now.
> +	 */
> +	if (!knotif) {
> +		ret = -ENOENT;
> +		goto out;
> +	}
> +
> +	unotif.id = knotif->id;
> +	unotif.pid = knotif->pid;
> +	unotif.data = *(knotif->data);
> +
> +	size = min_t(size_t, size, sizeof(struct seccomp_notif));
> +	if (copy_to_user(buf, &unotif, size)) {
> +		ret = -EFAULT;
> +		goto out;
> +	}
> +
> +	ret = sizeof(unotif);
> +	knotif->state = SECCOMP_NOTIFY_SENT;
> +
> +out:
> +	mutex_unlock(&filter->notify_lock);
> +	return ret;
> +}
> +
> +static ssize_t seccomp_notify_write(struct file *file, const char __user *buf,
> +				    size_t size, loff_t *ppos)
> +{
> +	struct seccomp_filter *filter = file->private_data;
> +	struct seccomp_notif_resp resp = {};
> +	struct seccomp_knotif *knotif = NULL;
> +	ssize_t ret = -EINVAL;
> +
> +	/* No partial writes. */
> +	if (*ppos != 0)
> +		return -EINVAL;
> +
> +	size = min_t(size_t, size, sizeof(resp));
> +	if (copy_from_user(&resp, buf, size))
> +		return -EFAULT;
> +
> +	ret = mutex_lock_interruptible(&filter->notify_lock);
> +	if (ret < 0)
> +		return ret;
> +
> +	list_for_each_entry(knotif, &filter->notifications, list) {
> +		if (knotif->id == resp.id)
> +			break;
> +	}
> +
> +	if (!knotif || knotif->id != resp.id) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	/* Allow exactly one reply. */
> +	if (knotif->state != SECCOMP_NOTIFY_SENT) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	ret = size;
> +	knotif->state = SECCOMP_NOTIFY_REPLIED;
> +	knotif->error = resp.error;
> +	knotif->val = resp.val;
> +	complete(&knotif->ready);
> +out:
> +	mutex_unlock(&filter->notify_lock);
> +	return ret;
> +}
> +
> +static __poll_t seccomp_notify_poll(struct file *file,
> +				    struct poll_table_struct *poll_tab)
> +{
> +	struct seccomp_filter *filter = file->private_data;
> +	__poll_t ret = 0;
> +	struct seccomp_knotif *cur;
> +
> +	ret = mutex_lock_interruptible(&filter->notify_lock);
> +	if (ret < 0)
> +		return ret;
> +
> +	list_for_each_entry(cur, &filter->notifications, list) {
> +		if (cur->state == SECCOMP_NOTIFY_INIT)
> +			ret |= EPOLLIN | EPOLLRDNORM;
> +		if (cur->state == SECCOMP_NOTIFY_SENT)
> +			ret |= EPOLLOUT | EPOLLWRNORM;
> +	}
> +
> +	mutex_unlock(&filter->notify_lock);
> +
> +	return ret;
> +}
> +
> +static const struct file_operations seccomp_notify_ops = {
> +	.read = seccomp_notify_read,
> +	.write = seccomp_notify_write,
> +	.poll = seccomp_notify_poll,
> +	.release = seccomp_notify_release,
> +};
> +
> +static struct file *init_listener(struct task_struct *task,
> +				  struct seccomp_filter *filter)
> +{
> +	struct file *ret = ERR_PTR(-EBUSY);
> +	struct seccomp_filter *cur;
> +	bool have_listener = false;
> +
> +	for (cur = task->seccomp.filter; cur; cur = cur->prev) {
> +		mutex_lock(&cur->notify_lock);
> +		if (cur->has_listener)
> +			have_listener = true;
> +	}
> +
> +	if (have_listener)
> +		goto out;
> +
> +	ret = anon_inode_getfile("seccomp notify", &seccomp_notify_ops,
> +				 filter, O_RDWR);
> +	if (IS_ERR(ret))
> +		goto out;
> +
> +
> +	/* The file has a reference to it now */
> +	__get_seccomp_filter(filter);
> +	filter->has_listener = true;
> +
> +out:
> +	for (cur = task->seccomp.filter; cur; cur = cur->prev)
> +		mutex_unlock(&cur->notify_lock);
> +
> +	return ret;
> +}
> +#endif
> diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
> index 168c66d74fc5..bb96df66222f 100644
> --- a/tools/testing/selftests/seccomp/seccomp_bpf.c
> +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
> @@ -40,10 +40,12 @@
>  #include <sys/fcntl.h>
>  #include <sys/mman.h>
>  #include <sys/times.h>
> +#include <sys/socket.h>
>  
>  #define _GNU_SOURCE
>  #include <unistd.h>
>  #include <sys/syscall.h>
> +#include <poll.h>
>  
>  #include "../kselftest_harness.h"
>  
> @@ -150,6 +152,24 @@ struct seccomp_metadata {
>  };
>  #endif
>  
> +#ifndef SECCOMP_FILTER_FLAG_GET_LISTENER
> +#define SECCOMP_FILTER_FLAG_GET_LISTENER 4
> +
> +#define SECCOMP_RET_USER_NOTIF 0x7fc00000U
> +
> +struct seccomp_notif {
> +	__u64 id;
> +	pid_t pid;
> +	struct seccomp_data data;
> +};
> +
> +struct seccomp_notif_resp {
> +	__u64 id;
> +	__s32 error;
> +	__s64 val;
> +};
> +#endif
> +
>  #ifndef seccomp
>  int seccomp(unsigned int op, unsigned int flags, void *args)
>  {
> @@ -2072,7 +2092,8 @@ TEST(seccomp_syscall_mode_lock)
>  TEST(detect_seccomp_filter_flags)
>  {
>  	unsigned int flags[] = { SECCOMP_FILTER_FLAG_TSYNC,
> -				 SECCOMP_FILTER_FLAG_LOG };
> +				 SECCOMP_FILTER_FLAG_LOG,
> +				 SECCOMP_FILTER_FLAG_GET_LISTENER };
>  	unsigned int flag, all_flags;
>  	int i;
>  	long ret;
> @@ -2917,6 +2938,164 @@ TEST(get_metadata)
>  	ASSERT_EQ(0, kill(pid, SIGKILL));
>  }
>  
> +static int user_trap_syscall(int nr, unsigned int flags)
> +{
> +	struct sock_filter filter[] = {
> +		BPF_STMT(BPF_LD+BPF_W+BPF_ABS,
> +			offsetof(struct seccomp_data, nr)),
> +		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, nr, 0, 1),
> +		BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_USER_NOTIF),
> +		BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
> +	};
> +
> +	struct sock_fprog prog = {
> +		.len = (unsigned short)ARRAY_SIZE(filter),
> +		.filter = filter,
> +	};
> +
> +	return seccomp(SECCOMP_SET_MODE_FILTER, flags, &prog);
> +}
> +
> +static int read_notif(int listener, struct seccomp_notif *req)
> +{
> +	int ret;
> +
> +	do {
> +		errno = 0;
> +		ret = read(listener, req, sizeof(*req));
> +	} while (ret == -1 && errno == ENOENT);
> +	return ret;
> +}
> +
> +static void signal_handler(int signal)
> +{
> +}
> +
> +#define USER_NOTIF_MAGIC 116983961184613L
> +TEST(get_user_notification_syscall)
> +{
> +	pid_t pid;
> +	long ret;
> +	int status, listener;
> +	struct seccomp_notif req = {};
> +	struct seccomp_notif_resp resp = {};
> +	struct pollfd pollfd;
> +
> +	pid = fork();
> +	ASSERT_GE(pid, 0);
> +
> +	/* Check that we get -ENOSYS with no listener attached */
> +	if (pid == 0) {
> +		if (user_trap_syscall(__NR_getpid, 0) < 0)
> +			exit(1);
> +		ret = syscall(__NR_getpid);
> +		exit(ret >= 0 || errno != ENOSYS);
> +	}
> +
> +	EXPECT_EQ(waitpid(pid, &status, 0), pid);
> +	EXPECT_EQ(true, WIFEXITED(status));
> +	EXPECT_EQ(0, WEXITSTATUS(status));
> +
> +	/* Check that the basic notification machinery works */
> +	listener = user_trap_syscall(__NR_getpid,
> +				     SECCOMP_FILTER_FLAG_GET_LISTENER);
> +	EXPECT_GE(listener, 0);
> +
> +	/* Installing a second listener in the chain should EBUSY */
> +	EXPECT_EQ(user_trap_syscall(__NR_getpid,
> +				    SECCOMP_FILTER_FLAG_GET_LISTENER),
> +		  -1);
> +	EXPECT_EQ(errno, EBUSY);
> +
> +	pid = fork();
> +	ASSERT_GE(pid, 0);
> +
> +	if (pid == 0) {
> +		ret = syscall(__NR_getpid);
> +		exit(ret != USER_NOTIF_MAGIC);
> +	}
> +
> +	EXPECT_EQ(read(listener, &req, sizeof(req)), sizeof(req));
> +
> +	pollfd.fd = listener;
> +	pollfd.events = POLLIN | POLLOUT;
> +
> +	EXPECT_GT(poll(&pollfd, 1, -1), 0);
> +	EXPECT_EQ(pollfd.revents, POLLOUT);
> +
> +	EXPECT_EQ(req.data.nr,  __NR_getpid);
> +
> +	resp.id = req.id;
> +	resp.error = 0;
> +	resp.val = USER_NOTIF_MAGIC;
> +
> +	EXPECT_EQ(write(listener, &resp, sizeof(resp)), sizeof(resp));
> +
> +	EXPECT_EQ(waitpid(pid, &status, 0), pid);
> +	EXPECT_EQ(true, WIFEXITED(status));
> +	EXPECT_EQ(0, WEXITSTATUS(status));
> +
> +	/*
> +	 * Check that nothing bad happens when we kill the task in the middle
> +	 * of a syscall.
> +	 */
> +	pid = fork();
> +	ASSERT_GE(pid, 0);
> +
> +	if (pid == 0) {
> +		ret = syscall(__NR_getpid);
> +		exit(ret != USER_NOTIF_MAGIC);
> +	}
> +
> +	ret = read(listener, &req, sizeof(req));
> +	EXPECT_EQ(ret, sizeof(req));
> +
> +	EXPECT_EQ(kill(pid, SIGKILL), 0);
> +	EXPECT_EQ(waitpid(pid, NULL, 0), pid);
> +
> +	resp.id = req.id;
> +	ret = write(listener, &resp, sizeof(resp));
> +	EXPECT_EQ(ret, -1);
> +	EXPECT_EQ(errno, EINVAL);
> +
> +	/*
> +	 * Check that we get another notification about a signal in the middle
> +	 * of a syscall.
> +	 */
> +	pid = fork();
> +	ASSERT_GE(pid, 0);
> +
> +	if (pid == 0) {
> +		if (signal(SIGUSR1, signal_handler) == SIG_ERR) {
> +			perror("signal");
> +			exit(1);
> +		}
> +		ret = syscall(__NR_getpid);
> +		exit(ret != USER_NOTIF_MAGIC);
> +	}
> +
> +	ret = read_notif(listener, &req);
> +	EXPECT_EQ(ret, sizeof(req));
> +	EXPECT_EQ(errno, 0);
> +
> +	EXPECT_EQ(kill(pid, SIGUSR1), 0);
> +
> +	ret = read_notif(listener, &req);
> +	EXPECT_EQ(ret, sizeof(req));
> +	EXPECT_EQ(errno, 0);
> +
> +	resp.id = req.id;
> +	ret = write(listener, &resp, sizeof(resp));
> +	EXPECT_EQ(ret, sizeof(resp));
> +	EXPECT_EQ(errno, 0);
> +
> +	EXPECT_EQ(waitpid(pid, &status, 0), pid);
> +	EXPECT_EQ(true, WIFEXITED(status));
> +	EXPECT_EQ(0, WEXITSTATUS(status));
> +
> +	close(listener);
> +}
> +
>  /*
>   * TODO:
>   * - add microbenchmarks
> -- 
> 2.17.0
> 
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 3/4] seccomp: add a way to get a listener fd from ptrace
  2018-05-17 15:12 ` [PATCH v2 3/4] seccomp: add a way to get a listener fd from ptrace Tycho Andersen
  2018-05-17 15:41   ` Oleg Nesterov
@ 2018-05-18 14:05   ` Christian Brauner
  2018-05-18 15:10     ` Tycho Andersen
  1 sibling, 1 reply; 20+ messages in thread
From: Christian Brauner @ 2018-05-18 14:05 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: linux-kernel, containers, Tobin C . Harding, Kees Cook,
	Akihiro Suda, Oleg Nesterov, Andy Lutomirski, Eric W . Biederman,
	Christian Brauner, Tyler Hicks

On Thu, May 17, 2018 at 09:12:17AM -0600, Tycho Andersen wrote:
> As an alternative to SECCOMP_FILTER_FLAG_GET_LISTENER, perhaps a ptrace()
> version which can acquire filters is useful. There are at least two reasons
> this is preferable, even though it uses ptrace:
> 
> 1. You can control tasks that aren't cooperating with you
> 2. You can control tasks whose filters block sendmsg() and socket(); if the
>    task installs a filter which blocks these calls, there's no way with
>    SECCOMP_FILTER_FLAG_GET_LISTENER to get the fd out to the privileged task.

I get the problem I guess the question we need to answer is do we care
enought to bring ptrace into this? Not really objecting, just asking. :)
If blocking sendmsg() or socket() becomes an issue because people like
to shoot themselves in the foot we can surely add this option later.

Christian

> 
> v2: fix a bug where listener mode was not unset when an unused fd was not
>     available
> 
> Signed-off-by: Tycho Andersen <tycho@tycho.ws>
> CC: Kees Cook <keescook@chromium.org>
> CC: Andy Lutomirski <luto@amacapital.net>
> CC: Oleg Nesterov <oleg@redhat.com>
> CC: Eric W. Biederman <ebiederm@xmission.com>
> CC: "Serge E. Hallyn" <serge@hallyn.com>
> CC: Christian Brauner <christian.brauner@ubuntu.com>
> CC: Tyler Hicks <tyhicks@canonical.com>
> CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
> ---
>  include/linux/seccomp.h                       | 11 ++++
>  include/uapi/linux/ptrace.h                   |  2 +
>  kernel/ptrace.c                               |  4 ++
>  kernel/seccomp.c                              | 27 ++++++++
>  tools/testing/selftests/seccomp/seccomp_bpf.c | 66 +++++++++++++++++++
>  5 files changed, 110 insertions(+)
> 
> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> index 0fd3e0676a1c..10e684899b7b 100644
> --- a/include/linux/seccomp.h
> +++ b/include/linux/seccomp.h
> @@ -111,4 +111,15 @@ static inline long seccomp_get_metadata(struct task_struct *task,
>  	return -EINVAL;
>  }
>  #endif /* CONFIG_SECCOMP_FILTER && CONFIG_CHECKPOINT_RESTORE */
> +
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> +extern long seccomp_get_listener(struct task_struct *task,
> +				 unsigned long filter_off);
> +#else
> +static inline long seccomp_get_listener(struct task_struct *task,
> +					unsigned long filter_off)
> +{
> +	return -EINVAL;
> +}
> +#endif/* CONFIG_SECCOMP_USER_NOTIFICATION */
>  #endif /* _LINUX_SECCOMP_H */
> diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h
> index d5a1b8a492b9..dc0abf81de3b 100644
> --- a/include/uapi/linux/ptrace.h
> +++ b/include/uapi/linux/ptrace.h
> @@ -73,6 +73,8 @@ struct seccomp_metadata {
>  	__u64 flags;		/* Output: filter's flags */
>  };
>  
> +#define PTRACE_SECCOMP_GET_LISTENER	0x420e
> +
>  /* Read signals from a shared (process wide) queue */
>  #define PTRACE_PEEKSIGINFO_SHARED	(1 << 0)
>  
> diff --git a/kernel/ptrace.c b/kernel/ptrace.c
> index 21fec73d45d4..fcbdb6f4dc07 100644
> --- a/kernel/ptrace.c
> +++ b/kernel/ptrace.c
> @@ -1096,6 +1096,10 @@ int ptrace_request(struct task_struct *child, long request,
>  		ret = seccomp_get_metadata(child, addr, datavp);
>  		break;
>  
> +	case PTRACE_SECCOMP_GET_LISTENER:
> +		ret = seccomp_get_listener(child, addr);
> +		break;
> +
>  	default:
>  		break;
>  	}
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index f136eca93f2f..7c23aee76bb4 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -1678,4 +1678,31 @@ static struct file *init_listener(struct task_struct *task,
>  
>  	return ret;
>  }
> +
> +long seccomp_get_listener(struct task_struct *task,
> +			  unsigned long filter_off)
> +{
> +	struct seccomp_filter *filter;
> +	struct file *listener;
> +	int fd;
> +
> +	filter = get_nth_filter(task, filter_off);
> +	if (IS_ERR(filter))
> +		return PTR_ERR(filter);
> +
> +	fd = get_unused_fd_flags(O_RDWR);
> +	if (fd < 0) {
> +		__put_seccomp_filter(filter);
> +		return fd;
> +	}
> +
> +	listener = init_listener(task, task->seccomp.filter);
> +	if (IS_ERR(listener)) {
> +		put_unused_fd(fd);
> +		return PTR_ERR(listener);
> +	}
> +
> +	fd_install(fd, listener);
> +	return fd;
> +}
>  #endif
> diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
> index bb96df66222f..473905f33e0b 100644
> --- a/tools/testing/selftests/seccomp/seccomp_bpf.c
> +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
> @@ -178,6 +178,10 @@ int seccomp(unsigned int op, unsigned int flags, void *args)
>  }
>  #endif
>  
> +#ifndef PTRACE_SECCOMP_GET_LISTENER
> +#define PTRACE_SECCOMP_GET_LISTENER 0x420e
> +#endif
> +
>  #if __BYTE_ORDER == __LITTLE_ENDIAN
>  #define syscall_arg(_n) (offsetof(struct seccomp_data, args[_n]))
>  #elif __BYTE_ORDER == __BIG_ENDIAN
> @@ -3096,6 +3100,68 @@ TEST(get_user_notification_syscall)
>  	close(listener);
>  }
>  
> +TEST(get_user_notification_ptrace)
> +{
> +	pid_t pid;
> +	int status, listener;
> +	int sk_pair[2];
> +	char c;
> +	struct seccomp_notif req = {};
> +	struct seccomp_notif_resp resp = {};
> +
> +	ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0);
> +
> +	pid = fork();
> +	ASSERT_GE(pid, 0);
> +
> +	if (pid == 0) {
> +		EXPECT_EQ(user_trap_syscall(__NR_getpid, 0), 0);
> +
> +		/* Test that we get ENOSYS while not attached */
> +		EXPECT_EQ(syscall(__NR_getpid), -1);
> +		EXPECT_EQ(errno, ENOSYS);
> +
> +		/* Signal we're ready and have installed the filter. */
> +		EXPECT_EQ(write(sk_pair[1], "J", 1), 1);
> +
> +		EXPECT_EQ(read(sk_pair[1], &c, 1), 1);
> +		EXPECT_EQ(c, 'H');
> +
> +		exit(syscall(__NR_getpid) != USER_NOTIF_MAGIC);
> +	}
> +
> +	EXPECT_EQ(read(sk_pair[0], &c, 1), 1);
> +	EXPECT_EQ(c, 'J');
> +
> +	EXPECT_EQ(ptrace(PTRACE_ATTACH, pid), 0);
> +	EXPECT_EQ(waitpid(pid, NULL, 0), pid);
> +	listener = ptrace(PTRACE_SECCOMP_GET_LISTENER, pid, 0);
> +	EXPECT_GE(listener, 0);
> +
> +	/* EBUSY for second listener */
> +	EXPECT_EQ(ptrace(PTRACE_SECCOMP_GET_LISTENER, pid, 0), -1);
> +	EXPECT_EQ(errno, EBUSY);
> +
> +	EXPECT_EQ(ptrace(PTRACE_DETACH, pid, NULL, 0), 0);
> +
> +	/* Now signal we are done and respond with magic */
> +	EXPECT_EQ(write(sk_pair[0], "H", 1), 1);
> +
> +	EXPECT_EQ(read(listener, &req, sizeof(req)), sizeof(req));
> +
> +	resp.id = req.id;
> +	resp.error = 0;
> +	resp.val = USER_NOTIF_MAGIC;
> +
> +	EXPECT_EQ(write(listener, &resp, sizeof(resp)), sizeof(resp));
> +
> +	EXPECT_EQ(waitpid(pid, &status, 0), pid);
> +	EXPECT_EQ(true, WIFEXITED(status));
> +	EXPECT_EQ(0, WEXITSTATUS(status));
> +
> +	close(listener);
> +}
> +
>  /*
>   * TODO:
>   * - add microbenchmarks
> -- 
> 2.17.0
> 
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 3/4] seccomp: add a way to get a listener fd from ptrace
  2018-05-18 14:05   ` Christian Brauner
@ 2018-05-18 15:10     ` Tycho Andersen
  0 siblings, 0 replies; 20+ messages in thread
From: Tycho Andersen @ 2018-05-18 15:10 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-kernel, containers, Tobin C . Harding, Kees Cook,
	Akihiro Suda, Oleg Nesterov, Andy Lutomirski, Eric W . Biederman,
	Christian Brauner, Tyler Hicks

On Fri, May 18, 2018 at 04:05:56PM +0200, Christian Brauner wrote:
> On Thu, May 17, 2018 at 09:12:17AM -0600, Tycho Andersen wrote:
> > As an alternative to SECCOMP_FILTER_FLAG_GET_LISTENER, perhaps a ptrace()
> > version which can acquire filters is useful. There are at least two reasons
> > this is preferable, even though it uses ptrace:
> > 
> > 1. You can control tasks that aren't cooperating with you
> > 2. You can control tasks whose filters block sendmsg() and socket(); if the
> >    task installs a filter which blocks these calls, there's no way with
> >    SECCOMP_FILTER_FLAG_GET_LISTENER to get the fd out to the privileged task.
> 
> I get the problem I guess the question we need to answer is do we care
> enought to bring ptrace into this? Not really objecting, just asking. :)
> If blocking sendmsg() or socket() becomes an issue because people like
> to shoot themselves in the foot we can surely add this option later.

It doesn't seem that unreasonable to me to want to filter socket() or
sendmsg() though, so designing an API that doesn't support that from
the get-go seems like a bad idea. But that's why there are two
alternatives, so we can argue about it :)

Tycho

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/4] seccomp: add a return code to trap to userspace
  2018-05-18 14:04   ` Christian Brauner
@ 2018-05-18 15:21     ` Tycho Andersen
  0 siblings, 0 replies; 20+ messages in thread
From: Tycho Andersen @ 2018-05-18 15:21 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-kernel, containers, Tobin C . Harding, Kees Cook,
	Akihiro Suda, Oleg Nesterov, Andy Lutomirski, Eric W . Biederman,
	Christian Brauner, Tyler Hicks

On Fri, May 18, 2018 at 04:04:16PM +0200, Christian Brauner wrote:
> On Thu, May 17, 2018 at 09:12:15AM -0600, Tycho Andersen wrote:
> > +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> > +static u64 seccomp_next_notify_id(struct seccomp_filter *filter)
> > +{
> > +	u64 ret = filter->next_id;
> > +
> > +	/* Note: overflow is ok here, the id just needs to be unique */
> > +	filter->next_id++;
> > +
> > +	return ret;
> > +}
> 
> Nit: Depending on how averse people are to relying on side-effects this
> could be simplified to:
> 
> static inline u64 seccomp_next_notify_id(struct seccomp_filter *filter)
> {
>         /* Note: Overflow is ok. The id just needs to be unique. */
>         return filter->next_id++;
> }

Oh, yes, definitely. I think this is leftover from when this function
worked a different way.

> > +
> > +static void seccomp_do_user_notification(int this_syscall,
> > +					 struct seccomp_filter *match,
> > +					 const struct seccomp_data *sd)
> > +{
> > +	int err;
> > +	long ret = 0;
> > +	struct seccomp_knotif n = {};
> > +
> > +	mutex_lock(&match->notify_lock);
> > +	if (!match->has_listener) {
> > +		err = -ENOSYS;
> > +		goto out;
> > +	}
> 
> Nit:
> 
> err = -ENOSYS;
> mutex_lock(&match->notify_lock);
> if (!match->has_listener)
>         goto out;
> 
> looks cleaner to me or you do the err initalization at the top of the
> function. :)

Ok :)

> > +
> > +	n.pid = current->pid;
> > +	n.state = SECCOMP_NOTIFY_INIT;
> > +	n.data = sd;
> > +	n.id = seccomp_next_notify_id(match);
> > +	init_completion(&n.ready);
> > +
> > +	list_add(&n.list, &match->notifications);
> > +
> > +	mutex_unlock(&match->notify_lock);
> > +	up(&match->request);
> > +
> > +	err = wait_for_completion_interruptible(&n.ready);
> > +	mutex_lock(&match->notify_lock);
> > +
> > +	/*
> > +	 * Here it's possible we got a signal and then had to wait on the mutex
> > +	 * while the reply was sent, so let's be sure there wasn't a response
> > +	 * in the meantime.
> > +	 */
> > +	if (err < 0 && n.state != SECCOMP_NOTIFY_REPLIED) {
> > +		/*
> > +		 * We got a signal. Let's tell userspace about it (potentially
> > +		 * again, if we had already notified them about the first one).
> > +		 */
> > +		if (n.state == SECCOMP_NOTIFY_SENT) {
> > +			n.state = SECCOMP_NOTIFY_INIT;
> > +			up(&match->request);
> > +		}
> > +		mutex_unlock(&match->notify_lock);
> > +		err = wait_for_completion_killable(&n.ready);
> > +		mutex_lock(&match->notify_lock);
> > +		if (err < 0)
> > +			goto remove_list;
> > +	}
> > +
> > +	ret = n.val;
> > +	err = n.error;
> > +
> > +	WARN(n.state != SECCOMP_NOTIFY_REPLIED,
> > +	     "notified about write complete when state is not write");
> 
> Nit: That message seems a little cryptic.

Perhaps we can just drop it. It's just a sanity check, but given the
tests above, it doesn't seem likely.

> > +
> > +remove_list:
> > +	list_del(&n.list);
> > +out:
> > +	mutex_unlock(&match->notify_lock);
> > +	syscall_set_return_value(current, task_pt_regs(current),
> > +				 err, ret);
> > +}
> > +#else
> > +static void seccomp_do_user_notification(int this_syscall,
> > +					 u32 action,
> > +					 struct seccomp_filter *match,
> > +					 const struct seccomp_data *sd)
> > +{
> > +	WARN(1, "user notification received, but disabled");
> 
> Nit: "received unexpected user notification" might be clearer

Yes, I wonder if we shouldn't just drop this too -- it's not a kernel
bug, but a userspace bug that they're using features that aren't
enabled.

We could enhance the verifier with a static check for
BPF_RET | BPF_K == SECCOMPO_RET_USER_NOTIF and reject such programs if
user notification isn't enabled. Of course, it wouldn't handle the
dynamic case, but it might be useful.

Tycho

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/4] seccomp: add a return code to trap to userspace
  2018-05-17 15:12 ` [PATCH v2 1/4] seccomp: add a return code to " Tycho Andersen
  2018-05-17 15:33   ` Oleg Nesterov
  2018-05-18 14:04   ` Christian Brauner
@ 2018-05-19  0:14   ` kbuild test robot
  2018-05-19  5:01   ` kbuild test robot
  3 siblings, 0 replies; 20+ messages in thread
From: kbuild test robot @ 2018-05-19  0:14 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: kbuild-all, linux-kernel, containers, Kees Cook, Andy Lutomirski,
	Oleg Nesterov, Eric W . Biederman, Serge E . Hallyn,
	Christian Brauner, Tyler Hicks, Akihiro Suda, Tobin C . Harding,
	Tycho Andersen

[-- Attachment #1: Type: text/plain, Size: 15430 bytes --]

Hi Tycho,

I love your patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.17-rc5]
[cannot apply to next-20180517]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Tycho-Andersen/seccomp-trap-to-userspace/20180519-071527
config: x86_64-randconfig-x010-201819 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All error/warnings (new ones prefixed by >>):

   kernel/seccomp.c: In function '__seccomp_filter':
>> kernel/seccomp.c:891:46: warning: passing argument 2 of 'seccomp_do_user_notification' makes integer from pointer without a cast [-Wint-conversion]
      seccomp_do_user_notification(this_syscall, match, sd);
                                                 ^~~~~
   kernel/seccomp.c:802:13: note: expected 'u32 {aka unsigned int}' but argument is of type 'struct seccomp_filter *'
    static void seccomp_do_user_notification(int this_syscall,
                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> kernel/seccomp.c:891:53: error: passing argument 3 of 'seccomp_do_user_notification' from incompatible pointer type [-Werror=incompatible-pointer-types]
      seccomp_do_user_notification(this_syscall, match, sd);
                                                        ^~
   kernel/seccomp.c:802:13: note: expected 'struct seccomp_filter *' but argument is of type 'const struct seccomp_data *'
    static void seccomp_do_user_notification(int this_syscall,
                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> kernel/seccomp.c:891:3: error: too few arguments to function 'seccomp_do_user_notification'
      seccomp_do_user_notification(this_syscall, match, sd);
      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
   kernel/seccomp.c:802:13: note: declared here
    static void seccomp_do_user_notification(int this_syscall,
                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
   kernel/seccomp.c: In function 'seccomp_set_mode_filter':
>> kernel/seccomp.c:1036:14: error: implicit declaration of function 'get_unused_fd_flags'; did you mean 'getname_flags'? [-Werror=implicit-function-declaration]
      listener = get_unused_fd_flags(O_RDWR);
                 ^~~~~~~~~~~~~~~~~~~
                 getname_flags
>> kernel/seccomp.c:1042:16: error: implicit declaration of function 'init_listener'; did you mean 'init_llist_head'? [-Werror=implicit-function-declaration]
      listener_f = init_listener(current, prepared);
                   ^~~~~~~~~~~~~
                   init_llist_head
>> kernel/seccomp.c:1042:14: warning: assignment makes pointer from integer without a cast [-Wint-conversion]
      listener_f = init_listener(current, prepared);
                 ^
>> kernel/seccomp.c:1044:4: error: implicit declaration of function 'put_unused_fd'; did you mean 'put_user_ns'? [-Werror=implicit-function-declaration]
       put_unused_fd(listener);
       ^~~~~~~~~~~~~
       put_user_ns
>> kernel/seccomp.c:1077:4: error: implicit declaration of function 'fput'; did you mean 'iput'? [-Werror=implicit-function-declaration]
       fput(listener_f);
       ^~~~
       iput
>> kernel/seccomp.c:1080:4: error: implicit declaration of function 'fd_install'; did you mean 'fs_initcall'? [-Werror=implicit-function-declaration]
       fd_install(listener, listener_f);
       ^~~~~~~~~~
       fs_initcall
   cc1: some warnings being treated as errors

vim +/seccomp_do_user_notification +891 kernel/seccomp.c

   738	
   739	static void seccomp_do_user_notification(int this_syscall,
   740						 struct seccomp_filter *match,
   741						 const struct seccomp_data *sd)
   742	{
   743		int err;
   744		long ret = 0;
   745		struct seccomp_knotif n = {};
   746	
   747		mutex_lock(&match->notify_lock);
   748		if (!match->has_listener) {
   749			err = -ENOSYS;
   750			goto out;
   751		}
   752	
   753		n.pid = current->pid;
   754		n.state = SECCOMP_NOTIFY_INIT;
   755		n.data = sd;
   756		n.id = seccomp_next_notify_id(match);
   757		init_completion(&n.ready);
   758	
   759		list_add(&n.list, &match->notifications);
   760	
   761		mutex_unlock(&match->notify_lock);
   762		up(&match->request);
   763	
   764		err = wait_for_completion_interruptible(&n.ready);
   765		mutex_lock(&match->notify_lock);
   766	
   767		/*
   768		 * Here it's possible we got a signal and then had to wait on the mutex
   769		 * while the reply was sent, so let's be sure there wasn't a response
   770		 * in the meantime.
   771		 */
   772		if (err < 0 && n.state != SECCOMP_NOTIFY_REPLIED) {
   773			/*
   774			 * We got a signal. Let's tell userspace about it (potentially
   775			 * again, if we had already notified them about the first one).
   776			 */
   777			if (n.state == SECCOMP_NOTIFY_SENT) {
   778				n.state = SECCOMP_NOTIFY_INIT;
   779				up(&match->request);
   780			}
   781			mutex_unlock(&match->notify_lock);
   782			err = wait_for_completion_killable(&n.ready);
   783			mutex_lock(&match->notify_lock);
   784			if (err < 0)
   785				goto remove_list;
   786		}
   787	
   788		ret = n.val;
   789		err = n.error;
   790	
   791		WARN(n.state != SECCOMP_NOTIFY_REPLIED,
   792		     "notified about write complete when state is not write");
   793	
   794	remove_list:
   795		list_del(&n.list);
   796	out:
   797		mutex_unlock(&match->notify_lock);
   798		syscall_set_return_value(current, task_pt_regs(current),
   799					 err, ret);
   800	}
   801	#else
 > 802	static void seccomp_do_user_notification(int this_syscall,
   803						 u32 action,
   804						 struct seccomp_filter *match,
   805						 const struct seccomp_data *sd)
   806	{
   807		WARN(1, "user notification received, but disabled");
   808		seccomp_log(this_syscall, SIGSYS, action, true);
   809		do_exit(SIGSYS);
   810	}
   811	#endif
   812	
   813	#ifdef CONFIG_SECCOMP_FILTER
   814	static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
   815				    const bool recheck_after_trace)
   816	{
   817		u32 filter_ret, action;
   818		struct seccomp_filter *match = NULL;
   819		int data;
   820	
   821		/*
   822		 * Make sure that any changes to mode from another thread have
   823		 * been seen after TIF_SECCOMP was seen.
   824		 */
   825		rmb();
   826	
   827		filter_ret = seccomp_run_filters(sd, &match);
   828		data = filter_ret & SECCOMP_RET_DATA;
   829		action = filter_ret & SECCOMP_RET_ACTION_FULL;
   830	
   831		switch (action) {
   832		case SECCOMP_RET_ERRNO:
   833			/* Set low-order bits as an errno, capped at MAX_ERRNO. */
   834			if (data > MAX_ERRNO)
   835				data = MAX_ERRNO;
   836			syscall_set_return_value(current, task_pt_regs(current),
   837						 -data, 0);
   838			goto skip;
   839	
   840		case SECCOMP_RET_TRAP:
   841			/* Show the handler the original registers. */
   842			syscall_rollback(current, task_pt_regs(current));
   843			/* Let the filter pass back 16 bits of data. */
   844			seccomp_send_sigsys(this_syscall, data);
   845			goto skip;
   846	
   847		case SECCOMP_RET_TRACE:
   848			/* We've been put in this state by the ptracer already. */
   849			if (recheck_after_trace)
   850				return 0;
   851	
   852			/* ENOSYS these calls if there is no tracer attached. */
   853			if (!ptrace_event_enabled(current, PTRACE_EVENT_SECCOMP)) {
   854				syscall_set_return_value(current,
   855							 task_pt_regs(current),
   856							 -ENOSYS, 0);
   857				goto skip;
   858			}
   859	
   860			/* Allow the BPF to provide the event message */
   861			ptrace_event(PTRACE_EVENT_SECCOMP, data);
   862			/*
   863			 * The delivery of a fatal signal during event
   864			 * notification may silently skip tracer notification,
   865			 * which could leave us with a potentially unmodified
   866			 * syscall that the tracer would have liked to have
   867			 * changed. Since the process is about to die, we just
   868			 * force the syscall to be skipped and let the signal
   869			 * kill the process and correctly handle any tracer exit
   870			 * notifications.
   871			 */
   872			if (fatal_signal_pending(current))
   873				goto skip;
   874			/* Check if the tracer forced the syscall to be skipped. */
   875			this_syscall = syscall_get_nr(current, task_pt_regs(current));
   876			if (this_syscall < 0)
   877				goto skip;
   878	
   879			/*
   880			 * Recheck the syscall, since it may have changed. This
   881			 * intentionally uses a NULL struct seccomp_data to force
   882			 * a reload of all registers. This does not goto skip since
   883			 * a skip would have already been reported.
   884			 */
   885			if (__seccomp_filter(this_syscall, NULL, true))
   886				return -1;
   887	
   888			return 0;
   889	
   890		case SECCOMP_RET_USER_NOTIF:
 > 891			seccomp_do_user_notification(this_syscall, match, sd);
   892			goto skip;
   893		case SECCOMP_RET_LOG:
   894			seccomp_log(this_syscall, 0, action, true);
   895			return 0;
   896	
   897		case SECCOMP_RET_ALLOW:
   898			/*
   899			 * Note that the "match" filter will always be NULL for
   900			 * this action since SECCOMP_RET_ALLOW is the starting
   901			 * state in seccomp_run_filters().
   902			 */
   903			return 0;
   904	
   905		case SECCOMP_RET_KILL_THREAD:
   906		case SECCOMP_RET_KILL_PROCESS:
   907		default:
   908			seccomp_log(this_syscall, SIGSYS, action, true);
   909			/* Dump core only if this is the last remaining thread. */
   910			if (action == SECCOMP_RET_KILL_PROCESS ||
   911			    get_nr_threads(current) == 1) {
   912				siginfo_t info;
   913	
   914				/* Show the original registers in the dump. */
   915				syscall_rollback(current, task_pt_regs(current));
   916				/* Trigger a manual coredump since do_exit skips it. */
   917				seccomp_init_siginfo(&info, this_syscall, data);
   918				do_coredump(&info);
   919			}
   920			if (action == SECCOMP_RET_KILL_PROCESS)
   921				do_group_exit(SIGSYS);
   922			else
   923				do_exit(SIGSYS);
   924		}
   925	
   926		unreachable();
   927	
   928	skip:
   929		seccomp_log(this_syscall, 0, action, match ? match->log : false);
   930		return -1;
   931	}
   932	#else
   933	static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
   934				    const bool recheck_after_trace)
   935	{
   936		BUG();
   937	}
   938	#endif
   939	
   940	int __secure_computing(const struct seccomp_data *sd)
   941	{
   942		int mode = current->seccomp.mode;
   943		int this_syscall;
   944	
   945		if (IS_ENABLED(CONFIG_CHECKPOINT_RESTORE) &&
   946		    unlikely(current->ptrace & PT_SUSPEND_SECCOMP))
   947			return 0;
   948	
   949		this_syscall = sd ? sd->nr :
   950			syscall_get_nr(current, task_pt_regs(current));
   951	
   952		switch (mode) {
   953		case SECCOMP_MODE_STRICT:
   954			__secure_computing_strict(this_syscall);  /* may call do_exit */
   955			return 0;
   956		case SECCOMP_MODE_FILTER:
   957			return __seccomp_filter(this_syscall, sd, false);
   958		default:
   959			BUG();
   960		}
   961	}
   962	#endif /* CONFIG_HAVE_ARCH_SECCOMP_FILTER */
   963	
   964	long prctl_get_seccomp(void)
   965	{
   966		return current->seccomp.mode;
   967	}
   968	
   969	/**
   970	 * seccomp_set_mode_strict: internal function for setting strict seccomp
   971	 *
   972	 * Once current->seccomp.mode is non-zero, it may not be changed.
   973	 *
   974	 * Returns 0 on success or -EINVAL on failure.
   975	 */
   976	static long seccomp_set_mode_strict(void)
   977	{
   978		const unsigned long seccomp_mode = SECCOMP_MODE_STRICT;
   979		long ret = -EINVAL;
   980	
   981		spin_lock_irq(&current->sighand->siglock);
   982	
   983		if (!seccomp_may_assign_mode(seccomp_mode))
   984			goto out;
   985	
   986	#ifdef TIF_NOTSC
   987		disable_TSC();
   988	#endif
   989		seccomp_assign_mode(current, seccomp_mode);
   990		ret = 0;
   991	
   992	out:
   993		spin_unlock_irq(&current->sighand->siglock);
   994	
   995		return ret;
   996	}
   997	
   998	#ifdef CONFIG_SECCOMP_FILTER
   999	#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
  1000	static struct file *init_listener(struct task_struct *,
  1001					  struct seccomp_filter *);
  1002	#endif
  1003	
  1004	/**
  1005	 * seccomp_set_mode_filter: internal function for setting seccomp filter
  1006	 * @flags:  flags to change filter behavior
  1007	 * @filter: struct sock_fprog containing filter
  1008	 *
  1009	 * This function may be called repeatedly to install additional filters.
  1010	 * Every filter successfully installed will be evaluated (in reverse order)
  1011	 * for each system call the task makes.
  1012	 *
  1013	 * Once current->seccomp.mode is non-zero, it may not be changed.
  1014	 *
  1015	 * Returns 0 on success or -EINVAL on failure.
  1016	 */
  1017	static long seccomp_set_mode_filter(unsigned int flags,
  1018					    const char __user *filter)
  1019	{
  1020		const unsigned long seccomp_mode = SECCOMP_MODE_FILTER;
  1021		struct seccomp_filter *prepared = NULL;
  1022		long ret = -EINVAL;
  1023		int listener = 0;
  1024		struct file *listener_f = NULL;
  1025	
  1026		/* Validate flags. */
  1027		if (flags & ~SECCOMP_FILTER_FLAG_MASK)
  1028			return -EINVAL;
  1029	
  1030		/* Prepare the new filter before holding any locks. */
  1031		prepared = seccomp_prepare_user_filter(filter);
  1032		if (IS_ERR(prepared))
  1033			return PTR_ERR(prepared);
  1034	
  1035		if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) {
> 1036			listener = get_unused_fd_flags(O_RDWR);
  1037			if (listener < 0) {
  1038				ret = listener;
  1039				goto out_free;
  1040			}
  1041	
> 1042			listener_f = init_listener(current, prepared);
  1043			if (IS_ERR(listener_f)) {
> 1044				put_unused_fd(listener);
  1045				ret = PTR_ERR(listener_f);
  1046				goto out_free;
  1047			}
  1048		}
  1049	
  1050		/*
  1051		 * Make sure we cannot change seccomp or nnp state via TSYNC
  1052		 * while another thread is in the middle of calling exec.
  1053		 */
  1054		if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
  1055		    mutex_lock_killable(&current->signal->cred_guard_mutex))
  1056			goto out_put_fd;
  1057	
  1058		spin_lock_irq(&current->sighand->siglock);
  1059	
  1060		if (!seccomp_may_assign_mode(seccomp_mode))
  1061			goto out;
  1062	
  1063		ret = seccomp_attach_filter(flags, prepared);
  1064		if (ret)
  1065			goto out;
  1066		/* Do not free the successfully attached filter. */
  1067		prepared = NULL;
  1068	
  1069		seccomp_assign_mode(current, seccomp_mode);
  1070	out:
  1071		spin_unlock_irq(&current->sighand->siglock);
  1072		if (flags & SECCOMP_FILTER_FLAG_TSYNC)
  1073			mutex_unlock(&current->signal->cred_guard_mutex);
  1074	out_put_fd:
  1075		if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) {
  1076			if (ret < 0) {
> 1077				fput(listener_f);
  1078				put_unused_fd(listener);
  1079			} else {
> 1080				fd_install(listener, listener_f);
  1081				ret = listener;
  1082			}
  1083		}
  1084	out_free:
  1085		seccomp_filter_free(prepared);
  1086		return ret;
  1087	}
  1088	#else
  1089	static inline long seccomp_set_mode_filter(unsigned int flags,
  1090						   const char __user *filter)
  1091	{
  1092		return -EINVAL;
  1093	}
  1094	#endif
  1095	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 26139 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/4] seccomp: add a return code to trap to userspace
  2018-05-17 15:12 ` [PATCH v2 1/4] seccomp: add a return code to " Tycho Andersen
                     ` (2 preceding siblings ...)
  2018-05-19  0:14   ` kbuild test robot
@ 2018-05-19  5:01   ` kbuild test robot
  2018-05-21 22:55     ` Tycho Andersen
  3 siblings, 1 reply; 20+ messages in thread
From: kbuild test robot @ 2018-05-19  5:01 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: kbuild-all, linux-kernel, containers, Kees Cook, Andy Lutomirski,
	Oleg Nesterov, Eric W . Biederman, Serge E . Hallyn,
	Christian Brauner, Tyler Hicks, Akihiro Suda, Tobin C . Harding,
	Tycho Andersen

[-- Attachment #1: Type: text/plain, Size: 12534 bytes --]

Hi Tycho,

I love your patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.17-rc5]
[cannot apply to next-20180517]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Tycho-Andersen/seccomp-trap-to-userspace/20180519-071527
config: i386-randconfig-a1-05181545 (attached as .config)
compiler: gcc-4.9 (Debian 4.9.4-2) 4.9.4
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All error/warnings (new ones prefixed by >>):

   kernel/seccomp.c: In function '__seccomp_filter':
   kernel/seccomp.c:891:46: warning: passing argument 2 of 'seccomp_do_user_notification' makes integer from pointer without a cast
      seccomp_do_user_notification(this_syscall, match, sd);
                                                 ^
   kernel/seccomp.c:802:13: note: expected 'u32' but argument is of type 'struct seccomp_filter *'
    static void seccomp_do_user_notification(int this_syscall,
                ^
>> kernel/seccomp.c:891:53: warning: passing argument 3 of 'seccomp_do_user_notification' from incompatible pointer type
      seccomp_do_user_notification(this_syscall, match, sd);
                                                        ^
   kernel/seccomp.c:802:13: note: expected 'struct seccomp_filter *' but argument is of type 'const struct seccomp_data *'
    static void seccomp_do_user_notification(int this_syscall,
                ^
   kernel/seccomp.c:891:3: error: too few arguments to function 'seccomp_do_user_notification'
      seccomp_do_user_notification(this_syscall, match, sd);
      ^
   kernel/seccomp.c:802:13: note: declared here
    static void seccomp_do_user_notification(int this_syscall,
                ^
   kernel/seccomp.c: In function 'seccomp_set_mode_filter':
>> kernel/seccomp.c:1036:3: error: implicit declaration of function 'get_unused_fd_flags' [-Werror=implicit-function-declaration]
      listener = get_unused_fd_flags(O_RDWR);
      ^
>> kernel/seccomp.c:1042:3: error: implicit declaration of function 'init_listener' [-Werror=implicit-function-declaration]
      listener_f = init_listener(current, prepared);
      ^
   kernel/seccomp.c:1042:14: warning: assignment makes pointer from integer without a cast
      listener_f = init_listener(current, prepared);
                 ^
>> kernel/seccomp.c:1044:4: error: implicit declaration of function 'put_unused_fd' [-Werror=implicit-function-declaration]
       put_unused_fd(listener);
       ^
>> kernel/seccomp.c:1077:4: error: implicit declaration of function 'fput' [-Werror=implicit-function-declaration]
       fput(listener_f);
       ^
>> kernel/seccomp.c:1080:4: error: implicit declaration of function 'fd_install' [-Werror=implicit-function-declaration]
       fd_install(listener, listener_f);
       ^
   cc1: some warnings being treated as errors

vim +/get_unused_fd_flags +1036 kernel/seccomp.c

   812	
   813	#ifdef CONFIG_SECCOMP_FILTER
   814	static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
   815				    const bool recheck_after_trace)
   816	{
   817		u32 filter_ret, action;
   818		struct seccomp_filter *match = NULL;
   819		int data;
   820	
   821		/*
   822		 * Make sure that any changes to mode from another thread have
   823		 * been seen after TIF_SECCOMP was seen.
   824		 */
   825		rmb();
   826	
   827		filter_ret = seccomp_run_filters(sd, &match);
   828		data = filter_ret & SECCOMP_RET_DATA;
   829		action = filter_ret & SECCOMP_RET_ACTION_FULL;
   830	
   831		switch (action) {
   832		case SECCOMP_RET_ERRNO:
   833			/* Set low-order bits as an errno, capped at MAX_ERRNO. */
   834			if (data > MAX_ERRNO)
   835				data = MAX_ERRNO;
   836			syscall_set_return_value(current, task_pt_regs(current),
   837						 -data, 0);
   838			goto skip;
   839	
   840		case SECCOMP_RET_TRAP:
   841			/* Show the handler the original registers. */
   842			syscall_rollback(current, task_pt_regs(current));
   843			/* Let the filter pass back 16 bits of data. */
   844			seccomp_send_sigsys(this_syscall, data);
   845			goto skip;
   846	
   847		case SECCOMP_RET_TRACE:
   848			/* We've been put in this state by the ptracer already. */
   849			if (recheck_after_trace)
   850				return 0;
   851	
   852			/* ENOSYS these calls if there is no tracer attached. */
   853			if (!ptrace_event_enabled(current, PTRACE_EVENT_SECCOMP)) {
   854				syscall_set_return_value(current,
   855							 task_pt_regs(current),
   856							 -ENOSYS, 0);
   857				goto skip;
   858			}
   859	
   860			/* Allow the BPF to provide the event message */
   861			ptrace_event(PTRACE_EVENT_SECCOMP, data);
   862			/*
   863			 * The delivery of a fatal signal during event
   864			 * notification may silently skip tracer notification,
   865			 * which could leave us with a potentially unmodified
   866			 * syscall that the tracer would have liked to have
   867			 * changed. Since the process is about to die, we just
   868			 * force the syscall to be skipped and let the signal
   869			 * kill the process and correctly handle any tracer exit
   870			 * notifications.
   871			 */
   872			if (fatal_signal_pending(current))
   873				goto skip;
   874			/* Check if the tracer forced the syscall to be skipped. */
   875			this_syscall = syscall_get_nr(current, task_pt_regs(current));
   876			if (this_syscall < 0)
   877				goto skip;
   878	
   879			/*
   880			 * Recheck the syscall, since it may have changed. This
   881			 * intentionally uses a NULL struct seccomp_data to force
   882			 * a reload of all registers. This does not goto skip since
   883			 * a skip would have already been reported.
   884			 */
   885			if (__seccomp_filter(this_syscall, NULL, true))
   886				return -1;
   887	
   888			return 0;
   889	
   890		case SECCOMP_RET_USER_NOTIF:
 > 891			seccomp_do_user_notification(this_syscall, match, sd);
   892			goto skip;
   893		case SECCOMP_RET_LOG:
   894			seccomp_log(this_syscall, 0, action, true);
   895			return 0;
   896	
   897		case SECCOMP_RET_ALLOW:
   898			/*
   899			 * Note that the "match" filter will always be NULL for
   900			 * this action since SECCOMP_RET_ALLOW is the starting
   901			 * state in seccomp_run_filters().
   902			 */
   903			return 0;
   904	
   905		case SECCOMP_RET_KILL_THREAD:
   906		case SECCOMP_RET_KILL_PROCESS:
   907		default:
   908			seccomp_log(this_syscall, SIGSYS, action, true);
   909			/* Dump core only if this is the last remaining thread. */
   910			if (action == SECCOMP_RET_KILL_PROCESS ||
   911			    get_nr_threads(current) == 1) {
   912				siginfo_t info;
   913	
   914				/* Show the original registers in the dump. */
   915				syscall_rollback(current, task_pt_regs(current));
   916				/* Trigger a manual coredump since do_exit skips it. */
   917				seccomp_init_siginfo(&info, this_syscall, data);
   918				do_coredump(&info);
   919			}
   920			if (action == SECCOMP_RET_KILL_PROCESS)
   921				do_group_exit(SIGSYS);
   922			else
   923				do_exit(SIGSYS);
   924		}
   925	
   926		unreachable();
   927	
   928	skip:
   929		seccomp_log(this_syscall, 0, action, match ? match->log : false);
   930		return -1;
   931	}
   932	#else
   933	static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
   934				    const bool recheck_after_trace)
   935	{
   936		BUG();
   937	}
   938	#endif
   939	
   940	int __secure_computing(const struct seccomp_data *sd)
   941	{
   942		int mode = current->seccomp.mode;
   943		int this_syscall;
   944	
   945		if (IS_ENABLED(CONFIG_CHECKPOINT_RESTORE) &&
   946		    unlikely(current->ptrace & PT_SUSPEND_SECCOMP))
   947			return 0;
   948	
   949		this_syscall = sd ? sd->nr :
   950			syscall_get_nr(current, task_pt_regs(current));
   951	
   952		switch (mode) {
   953		case SECCOMP_MODE_STRICT:
   954			__secure_computing_strict(this_syscall);  /* may call do_exit */
   955			return 0;
   956		case SECCOMP_MODE_FILTER:
   957			return __seccomp_filter(this_syscall, sd, false);
   958		default:
   959			BUG();
   960		}
   961	}
   962	#endif /* CONFIG_HAVE_ARCH_SECCOMP_FILTER */
   963	
   964	long prctl_get_seccomp(void)
   965	{
   966		return current->seccomp.mode;
   967	}
   968	
   969	/**
   970	 * seccomp_set_mode_strict: internal function for setting strict seccomp
   971	 *
   972	 * Once current->seccomp.mode is non-zero, it may not be changed.
   973	 *
   974	 * Returns 0 on success or -EINVAL on failure.
   975	 */
   976	static long seccomp_set_mode_strict(void)
   977	{
   978		const unsigned long seccomp_mode = SECCOMP_MODE_STRICT;
   979		long ret = -EINVAL;
   980	
   981		spin_lock_irq(&current->sighand->siglock);
   982	
   983		if (!seccomp_may_assign_mode(seccomp_mode))
   984			goto out;
   985	
   986	#ifdef TIF_NOTSC
   987		disable_TSC();
   988	#endif
   989		seccomp_assign_mode(current, seccomp_mode);
   990		ret = 0;
   991	
   992	out:
   993		spin_unlock_irq(&current->sighand->siglock);
   994	
   995		return ret;
   996	}
   997	
   998	#ifdef CONFIG_SECCOMP_FILTER
   999	#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
  1000	static struct file *init_listener(struct task_struct *,
  1001					  struct seccomp_filter *);
  1002	#endif
  1003	
  1004	/**
  1005	 * seccomp_set_mode_filter: internal function for setting seccomp filter
  1006	 * @flags:  flags to change filter behavior
  1007	 * @filter: struct sock_fprog containing filter
  1008	 *
  1009	 * This function may be called repeatedly to install additional filters.
  1010	 * Every filter successfully installed will be evaluated (in reverse order)
  1011	 * for each system call the task makes.
  1012	 *
  1013	 * Once current->seccomp.mode is non-zero, it may not be changed.
  1014	 *
  1015	 * Returns 0 on success or -EINVAL on failure.
  1016	 */
  1017	static long seccomp_set_mode_filter(unsigned int flags,
  1018					    const char __user *filter)
  1019	{
  1020		const unsigned long seccomp_mode = SECCOMP_MODE_FILTER;
  1021		struct seccomp_filter *prepared = NULL;
  1022		long ret = -EINVAL;
  1023		int listener = 0;
  1024		struct file *listener_f = NULL;
  1025	
  1026		/* Validate flags. */
  1027		if (flags & ~SECCOMP_FILTER_FLAG_MASK)
  1028			return -EINVAL;
  1029	
  1030		/* Prepare the new filter before holding any locks. */
  1031		prepared = seccomp_prepare_user_filter(filter);
  1032		if (IS_ERR(prepared))
  1033			return PTR_ERR(prepared);
  1034	
  1035		if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) {
> 1036			listener = get_unused_fd_flags(O_RDWR);
  1037			if (listener < 0) {
  1038				ret = listener;
  1039				goto out_free;
  1040			}
  1041	
> 1042			listener_f = init_listener(current, prepared);
  1043			if (IS_ERR(listener_f)) {
> 1044				put_unused_fd(listener);
  1045				ret = PTR_ERR(listener_f);
  1046				goto out_free;
  1047			}
  1048		}
  1049	
  1050		/*
  1051		 * Make sure we cannot change seccomp or nnp state via TSYNC
  1052		 * while another thread is in the middle of calling exec.
  1053		 */
  1054		if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
  1055		    mutex_lock_killable(&current->signal->cred_guard_mutex))
  1056			goto out_put_fd;
  1057	
  1058		spin_lock_irq(&current->sighand->siglock);
  1059	
  1060		if (!seccomp_may_assign_mode(seccomp_mode))
  1061			goto out;
  1062	
  1063		ret = seccomp_attach_filter(flags, prepared);
  1064		if (ret)
  1065			goto out;
  1066		/* Do not free the successfully attached filter. */
  1067		prepared = NULL;
  1068	
  1069		seccomp_assign_mode(current, seccomp_mode);
  1070	out:
  1071		spin_unlock_irq(&current->sighand->siglock);
  1072		if (flags & SECCOMP_FILTER_FLAG_TSYNC)
  1073			mutex_unlock(&current->signal->cred_guard_mutex);
  1074	out_put_fd:
  1075		if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) {
  1076			if (ret < 0) {
> 1077				fput(listener_f);
  1078				put_unused_fd(listener);
  1079			} else {
> 1080				fd_install(listener, listener_f);
  1081				ret = listener;
  1082			}
  1083		}
  1084	out_free:
  1085		seccomp_filter_free(prepared);
  1086		return ret;
  1087	}
  1088	#else
  1089	static inline long seccomp_set_mode_filter(unsigned int flags,
  1090						   const char __user *filter)
  1091	{
  1092		return -EINVAL;
  1093	}
  1094	#endif
  1095	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 28767 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/4] seccomp: add a return code to trap to userspace
  2018-05-19  5:01   ` kbuild test robot
@ 2018-05-21 22:55     ` Tycho Andersen
  0 siblings, 0 replies; 20+ messages in thread
From: Tycho Andersen @ 2018-05-21 22:55 UTC (permalink / raw)
  To: kbuild test robot
  Cc: kbuild-all, linux-kernel, containers, Kees Cook, Andy Lutomirski,
	Oleg Nesterov, Eric W . Biederman, Serge E . Hallyn,
	Christian Brauner, Tyler Hicks, Akihiro Suda, Tobin C . Harding

On Sat, May 19, 2018 at 01:01:15PM +0800, kbuild test robot wrote:
> Hi Tycho,
> 
> I love your patch! Yet something to improve:

Whoops, seems I forgot to compile the
!CONFIG_SECCOMP_USER_NOTIFICATION case. Anyways, I've fixed this for
v3.

Tycho

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/4] seccomp: add a return code to trap to userspace
  2018-05-17 15:46       ` Oleg Nesterov
@ 2018-05-24 15:28         ` Tycho Andersen
  0 siblings, 0 replies; 20+ messages in thread
From: Tycho Andersen @ 2018-05-24 15:28 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, containers, Kees Cook, Andy Lutomirski,
	Eric W . Biederman, Serge E . Hallyn, Christian Brauner,
	Tyler Hicks, Akihiro Suda, Tobin C . Harding

Hi Oleg,

On Thu, May 17, 2018 at 05:46:37PM +0200, Oleg Nesterov wrote:
> On 05/17, Tycho Andersen wrote:
> >
> > > From lockdep pov this loop tries to take the same lock twice or more, it shoul
> > > complain.
> >
> > I didn't, but I guess that's because it's not trying to take the same lock
> > twice -- the pointer cur is changing in the loop.
> 
> Yes, I see. But this is the same lock for lockdep, it has the same class.

I finally figured this out, I needed CONFIG_PROVE_LOCKING=y too,
anyway, I've added the nesting annotations for v3. Thanks!

Tycho

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2018-05-24 15:28 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-17 15:12 [PATCH v2 0/4] seccomp trap to userspace Tycho Andersen
2018-05-17 15:12 ` [PATCH v2 1/4] seccomp: add a return code to " Tycho Andersen
2018-05-17 15:33   ` Oleg Nesterov
2018-05-17 15:39     ` Tycho Andersen
2018-05-17 15:46       ` Oleg Nesterov
2018-05-24 15:28         ` Tycho Andersen
2018-05-18 14:04   ` Christian Brauner
2018-05-18 15:21     ` Tycho Andersen
2018-05-19  0:14   ` kbuild test robot
2018-05-19  5:01   ` kbuild test robot
2018-05-21 22:55     ` Tycho Andersen
2018-05-17 15:12 ` [PATCH v2 2/4] seccomp: make get_nth_filter available outside of CHECKPOINT_RESTORE Tycho Andersen
2018-05-17 15:12 ` [PATCH v2 3/4] seccomp: add a way to get a listener fd from ptrace Tycho Andersen
2018-05-17 15:41   ` Oleg Nesterov
2018-05-17 15:57     ` Tycho Andersen
2018-05-17 15:59       ` Tycho Andersen
2018-05-18 14:05   ` Christian Brauner
2018-05-18 15:10     ` Tycho Andersen
2018-05-17 15:12 ` [PATCH v2 4/4] seccomp: add support for passing fds via USER_NOTIF Tycho Andersen
2018-05-18 14:03 ` [PATCH v2 0/4] seccomp trap to userspace Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).