All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/3] seccomp trap to userspace
@ 2018-02-04 10:49 Tycho Andersen
  2018-02-04 10:49 ` [RFC 1/3] seccomp: add a return code to " Tycho Andersen
                   ` (4 more replies)
  0 siblings, 5 replies; 59+ messages in thread
From: Tycho Andersen @ 2018-02-04 10:49 UTC (permalink / raw)
  To: linux-kernel, containers
  Cc: Kees Cook, Andy Lutomirski, Oleg Nesterov, Eric W . Biederman,
	Serge E . Hallyn, Christian Brauner, Tyler Hicks, Akihiro Suda,
	Tycho Andersen

Several months ago at Linux Plumber's, we had a discussion about adding a
feature to seccomp which would allow seccomp to trigger a notification for some
other process. Here's a draft of that feature.

Patch 1 contains the bulk of it, patches 2 & 3 offer an alternative way to
acquire the fd that receives notifications via ptrace (the method in patch 1
poses some problems). Other suggestions for how to acquire one of these fds
would be welcome.

Take a close look at the synchronization. I think I've got it right, but I
probably don't :)

Thanks!

Tycho Andersen (3):
  seccomp: add a return code to trap to userspace
  seccomp: hoist out filter resolving logic
  seccomp: add a way to get a listener fd from ptrace

 arch/Kconfig                                  |   7 +
 include/linux/seccomp.h                       |  14 +-
 include/uapi/linux/ptrace.h                   |   1 +
 include/uapi/linux/seccomp.h                  |  18 +-
 kernel/ptrace.c                               |   4 +
 kernel/seccomp.c                              | 467 ++++++++++++++++++++++++--
 tools/testing/selftests/seccomp/seccomp_bpf.c | 180 +++++++++-
 7 files changed, 653 insertions(+), 38 deletions(-)

-- 
2.14.1

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [RFC 1/3] seccomp: add a return code to trap to userspace
       [not found] ` <20180204104946.25559-1-tycho-E0fblnxP3wo@public.gmane.org>
@ 2018-02-04 10:49   ` Tycho Andersen
  2018-02-04 10:49   ` [RFC 2/3] seccomp: hoist out filter resolving logic Tycho Andersen
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 59+ messages in thread
From: Tycho Andersen @ 2018-02-04 10:49 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Kees Cook, Akihiro Suda, Oleg Nesterov, Andy Lutomirski,
	Eric W . Biederman, Christian Brauner, Tyler Hicks

This patch introduces a means for syscalls matched in seccomp to notify
some other task that a particular filter has been triggered.

The motivation for this is primarily for use with containers. For example,
if a container does an init_module(), we obviously don't want to load this
untrusted code, which may be compiled for the wrong version of the kernel
anyway. Instead, we could parse the module image, figure out which module
the container is trying to load and load it on the host.

As another example, containers cannot mknod(), since this checks
capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
/dev/zero should be ok for containers to mknod, but we'd like to avoid hard
coding some whitelist in the kernel. Another example is mount(), which has
many security restrictions for good reason, but configuration or runtime
knowledge could potentially be used to relax these restrictions.

This patch adds functionality that is already possible via at least two
other means that I know about, both of which involve ptrace(): first, one
could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL.
Unfortunately this is slow, so a faster version would be to install a
filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP.
Since ptrace allows only one tracer, if the container runtime is that
tracer, users inside the container (or outside) trying to debug it will not
be able to use ptrace, which is annoying. It also means that older
distributions based on Upstart cannot boot inside containers using ptrace,
since upstart itself uses ptrace to start services.

The actual implementation of this is fairly small, although getting the
synchronization right was/is slightly complex. Also worth noting that there
is one race still present:

  1. a task does a SECCOMP_RET_USER_NOTIF
  2. the userspace handler reads this notification
  3. the task dies
  4. a new task with the same pid starts
  5. this new task does a SECCOMP_RET_USER_NOTIF, gets the same cookie id
     that the previous one did
  6. the userspace handler writes a response

There's no way to distinguish this case right now. Maybe we care, maybe we
don't, but it's worth noting.

Right now the interface is a simple structure copy across a file
descriptor. We could potentially invent something fancier.

Finally, it's worth noting that the classic seccomp TOCTOU of reading
memory data from the task still applies here, but can be avoided with
careful design of the userspace handler: if the userspace handler reads all
of the task memory that is necessary before applying its security policy,
the tracee's subsequent memory edits will not be read by the tracer.

Signed-off-by: Tycho Andersen <tycho-E0fblnxP3wo@public.gmane.org>
CC: Kees Cook <keescook-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>
CC: Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
CC: Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
CC: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
CC: "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
CC: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
CC: Tyler Hicks <tyhicks-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
CC: Akihiro Suda <suda.akihiro-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
---
 arch/Kconfig                                  |   7 +
 include/linux/seccomp.h                       |   3 +-
 include/uapi/linux/seccomp.h                  |  18 +-
 kernel/seccomp.c                              | 366 +++++++++++++++++++++++++-
 tools/testing/selftests/seccomp/seccomp_bpf.c | 114 +++++++-
 5 files changed, 502 insertions(+), 6 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 400b9e1b2f27..2946cb6fd704 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -387,6 +387,13 @@ config SECCOMP_FILTER
 
 	  See Documentation/prctl/seccomp_filter.txt for details.
 
+config SECCOMP_USER_NOTIFICATION
+	bool "Enable the SECCOMP_RET_USER_NOTIF seccomp action"
+	depends on SECCOMP_FILTER
+	help
+	  Enable SECCOMP_RET_USER_NOTIF, a return code which can be used by seccomp
+	  programs to notify a userspace listener that a particular event happened.
+
 config HAVE_GCC_PLUGINS
 	bool
 	help
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 10f25f7e4304..ce07da2ffd53 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -5,7 +5,8 @@
 #include <uapi/linux/seccomp.h>
 
 #define SECCOMP_FILTER_FLAG_MASK	(SECCOMP_FILTER_FLAG_TSYNC | \
-					 SECCOMP_FILTER_FLAG_LOG)
+					 SECCOMP_FILTER_FLAG_LOG | \
+					 SECCOMP_FILTER_FLAG_GET_LISTENER)
 
 #ifdef CONFIG_SECCOMP
 
diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index 2a0bd9dd104d..4a342aa2e524 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -17,8 +17,9 @@
 #define SECCOMP_GET_ACTION_AVAIL	2
 
 /* Valid flags for SECCOMP_SET_MODE_FILTER */
-#define SECCOMP_FILTER_FLAG_TSYNC	1
-#define SECCOMP_FILTER_FLAG_LOG		2
+#define SECCOMP_FILTER_FLAG_TSYNC		1
+#define SECCOMP_FILTER_FLAG_LOG			2
+#define SECCOMP_FILTER_FLAG_GET_LISTENER	4
 
 /*
  * All BPF programs must return a 32-bit value.
@@ -34,6 +35,7 @@
 #define SECCOMP_RET_KILL	 SECCOMP_RET_KILL_THREAD
 #define SECCOMP_RET_TRAP	 0x00030000U /* disallow and force a SIGSYS */
 #define SECCOMP_RET_ERRNO	 0x00050000U /* returns an errno */
+#define SECCOMP_RET_USER_NOTIF   0x7fc00000U /* notifies userspace */
 #define SECCOMP_RET_TRACE	 0x7ff00000U /* pass to a tracer or disallow */
 #define SECCOMP_RET_LOG		 0x7ffc0000U /* allow after logging */
 #define SECCOMP_RET_ALLOW	 0x7fff0000U /* allow */
@@ -59,4 +61,16 @@ struct seccomp_data {
 	__u64 args[6];
 };
 
+struct seccomp_notif {
+	__u32 id;
+	pid_t pid;
+	struct seccomp_data data;
+};
+
+struct seccomp_notif_resp {
+	__u32 id;
+	int error;
+	long val;
+};
+
 #endif /* _UAPI_LINUX_SECCOMP_H */
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 5f0dfb2abb8d..9541eb379e74 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -38,6 +38,52 @@
 #include <linux/tracehook.h>
 #include <linux/uaccess.h>
 
+#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
+#include <linux/file.h>
+#include <linux/anon_inodes.h>
+
+enum notify_state {
+	SECCOMP_NOTIFY_INIT,
+	SECCOMP_NOTIFY_READ,
+	SECCOMP_NOTIFY_WRITE,
+};
+
+struct seccomp_knotif {
+	/* The pid whose filter triggered the notification */
+	pid_t pid;
+
+	/*
+	 * The "cookie" for this request; this is unique for this filter.
+	 */
+	u32 id;
+
+	/*
+	 * The seccomp data. This pointer is valid the entire time this
+	 * notification is active, since it comes from __seccomp_filter which
+	 * eclipses the entire lifecycle here.
+	 */
+	const struct seccomp_data *data;
+
+	/*
+	 * SECCOMP_NOTIFY_INIT: someone has made this request, but it has not
+	 * 	yet been sent to userspace
+	 * SECCOMP_NOTIFY_READ: sent to userspace but no response yet
+	 * SECCOMP_NOTIFY_WRITE: we have a response from userspace, but it has
+	 * 	not yet been written back to the application
+	 */
+	enum notify_state state;
+
+	/* The return values, only valid when in SECCOMP_NOTIFY_WRITE */
+	int error;
+	long val;
+
+	/* Signals when this has entered SECCOMP_NOTIFY_WRITE */
+	struct completion ready;
+
+	struct list_head list;
+};
+#endif
+
 /**
  * struct seccomp_filter - container for seccomp BPF programs
  *
@@ -64,6 +110,30 @@ struct seccomp_filter {
 	bool log;
 	struct seccomp_filter *prev;
 	struct bpf_prog *prog;
+
+#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
+	/*
+	 * A semaphore that users of this notification can wait on for
+	 * changes. Actual reads and writes are still controlled with
+	 * filter->notify_lock.
+	 */
+	struct semaphore request;
+
+	/*
+	 * A lock for all notification-related accesses.
+	 */
+	struct mutex notify_lock;
+
+	/*
+	 * Is there currently an attached listener?
+	 */
+	bool has_listener;
+
+	/*
+	 * A list of struct seccomp_knotif elements.
+	 */
+	struct list_head notifications;
+#endif
 };
 
 /* Limit any path through the tree to 256KB worth of instructions. */
@@ -383,6 +453,12 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
 	if (!sfilter)
 		return ERR_PTR(-ENOMEM);
 
+#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
+	mutex_init(&sfilter->notify_lock);
+	sema_init(&sfilter->request, 0);
+	INIT_LIST_HEAD(&sfilter->notifications);
+#endif
+
 	ret = bpf_prog_create_from_user(&sfilter->prog, fprog,
 					seccomp_check_filter, save_orig);
 	if (ret < 0) {
@@ -547,13 +623,15 @@ static void seccomp_send_sigsys(int syscall, int reason)
 #define SECCOMP_LOG_TRACE		(1 << 4)
 #define SECCOMP_LOG_LOG			(1 << 5)
 #define SECCOMP_LOG_ALLOW		(1 << 6)
+#define SECCOMP_LOG_USER_NOTIF		(1 << 7)
 
 static u32 seccomp_actions_logged = SECCOMP_LOG_KILL_PROCESS |
 				    SECCOMP_LOG_KILL_THREAD  |
 				    SECCOMP_LOG_TRAP  |
 				    SECCOMP_LOG_ERRNO |
 				    SECCOMP_LOG_TRACE |
-				    SECCOMP_LOG_LOG;
+				    SECCOMP_LOG_LOG |
+				    SECCOMP_LOG_USER_NOTIF;
 
 static inline void seccomp_log(unsigned long syscall, long signr, u32 action,
 			       bool requested)
@@ -572,6 +650,9 @@ static inline void seccomp_log(unsigned long syscall, long signr, u32 action,
 	case SECCOMP_RET_TRACE:
 		log = requested && seccomp_actions_logged & SECCOMP_LOG_TRACE;
 		break;
+	case SECCOMP_RET_USER_NOTIF:
+		log = requested && seccomp_actions_logged & SECCOMP_LOG_USER_NOTIF;
+		break;
 	case SECCOMP_RET_LOG:
 		log = seccomp_actions_logged & SECCOMP_LOG_LOG;
 		break;
@@ -645,6 +726,89 @@ void secure_computing_strict(int this_syscall)
 }
 #else
 
+#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
+/*
+ * Finds the next unique notification id.
+ */
+static u32 seccomp_next_notify_id(struct list_head *list)
+{
+	struct seccomp_knotif *knotif = NULL;
+	struct list_head *cur;
+	u32 id = get_random_u32();
+
+again:
+	list_for_each(cur, list) {
+		knotif = list_entry(cur, struct seccomp_knotif, list);
+
+		if (knotif->id == id) {
+			id = get_random_u32();
+			goto again;
+		}
+	}
+
+	return id;
+}
+
+static void seccomp_do_user_notification(int this_syscall,
+					 struct seccomp_filter *match,
+					 const struct seccomp_data *sd)
+{
+	int err;
+	long ret = 0;
+	struct seccomp_knotif n = {};
+
+	mutex_lock(&match->notify_lock);
+	if (!match->has_listener) {
+		err = -ENOSYS;
+		goto out;
+	}
+
+	n.pid = current->pid;
+	n.state = SECCOMP_NOTIFY_INIT;
+	n.data = sd;
+	n.id = seccomp_next_notify_id(&match->notifications);
+	init_completion(&n.ready);
+
+	list_add(&n.list, &match->notifications);
+
+	mutex_unlock(&match->notify_lock);
+	up(&match->request);
+
+	err = wait_for_completion_interruptible(&n.ready);
+	/*
+	 * This syscall is getting interrupted. We no longer need to
+	 * tell userspace about it, and any userspace responses should
+	 * be ignored.
+	 */
+	mutex_lock(&match->notify_lock);
+	if (err < 0)
+		goto remove_list;
+
+	ret = n.val;
+	err = n.error;
+
+	WARN(n.state != SECCOMP_NOTIFY_WRITE,
+	     "notified about write complete when state is not write");
+
+remove_list:
+	list_del(&n.list);
+out:
+	mutex_unlock(&match->notify_lock);
+	syscall_set_return_value(current, task_pt_regs(current),
+				 err, ret);
+}
+#else
+static void seccomp_do_user_notification(int this_syscall,
+					 u32 action,
+					 struct seccomp_filter *match,
+					 const struct seccomp_data *sd)
+{
+	WARN(1, "user notification received, but disabled");
+	seccomp_log(this_syscall, SIGSYS, action, true);
+	do_exit(SIGSYS);
+}
+#endif
+
 #ifdef CONFIG_SECCOMP_FILTER
 static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
 			    const bool recheck_after_trace)
@@ -722,6 +886,9 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
 
 		return 0;
 
+	case SECCOMP_RET_USER_NOTIF:
+		seccomp_do_user_notification(this_syscall, match, sd);
+		goto skip;
 	case SECCOMP_RET_LOG:
 		seccomp_log(this_syscall, 0, action, true);
 		return 0;
@@ -828,6 +995,10 @@ static long seccomp_set_mode_strict(void)
 }
 
 #ifdef CONFIG_SECCOMP_FILTER
+#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
+static struct file *init_listener(struct seccomp_filter *filter);
+#endif
+
 /**
  * seccomp_set_mode_filter: internal function for setting seccomp filter
  * @flags:  flags to change filter behavior
@@ -847,6 +1018,8 @@ static long seccomp_set_mode_filter(unsigned int flags,
 	const unsigned long seccomp_mode = SECCOMP_MODE_FILTER;
 	struct seccomp_filter *prepared = NULL;
 	long ret = -EINVAL;
+	int listener = 0;
+	struct file *listener_f = NULL;
 
 	/* Validate flags. */
 	if (flags & ~SECCOMP_FILTER_FLAG_MASK)
@@ -857,13 +1030,28 @@ static long seccomp_set_mode_filter(unsigned int flags,
 	if (IS_ERR(prepared))
 		return PTR_ERR(prepared);
 
+	if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) {
+		listener = get_unused_fd_flags(O_RDWR);
+		if (listener < 0) {
+			ret = listener;
+			goto out_free;
+		}
+
+		listener_f = init_listener(prepared);
+		if (IS_ERR(listener_f)) {
+			put_unused_fd(listener);
+			ret = PTR_ERR(listener_f);
+			goto out_free;
+		}
+	}
+
 	/*
 	 * Make sure we cannot change seccomp or nnp state via TSYNC
 	 * while another thread is in the middle of calling exec.
 	 */
 	if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
 	    mutex_lock_killable(&current->signal->cred_guard_mutex))
-		goto out_free;
+		goto out_put_fd;
 
 	spin_lock_irq(&current->sighand->siglock);
 
@@ -881,6 +1069,16 @@ static long seccomp_set_mode_filter(unsigned int flags,
 	spin_unlock_irq(&current->sighand->siglock);
 	if (flags & SECCOMP_FILTER_FLAG_TSYNC)
 		mutex_unlock(&current->signal->cred_guard_mutex);
+out_put_fd:
+	if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) {
+		if (ret < 0) {
+			fput(listener_f);
+			put_unused_fd(listener);
+		} else {
+			fd_install(listener, listener_f);
+			ret = listener;
+		}
+	}
 out_free:
 	seccomp_filter_free(prepared);
 	return ret;
@@ -909,6 +1107,9 @@ static long seccomp_get_action_avail(const char __user *uaction)
 	case SECCOMP_RET_LOG:
 	case SECCOMP_RET_ALLOW:
 		break;
+	case SECCOMP_RET_USER_NOTIF:
+		if (IS_ENABLED(CONFIG_SECCOMP_USER_NOTIFICATION))
+			break;
 	default:
 		return -EOPNOTSUPP;
 	}
@@ -1057,6 +1258,7 @@ long seccomp_get_filter(struct task_struct *task, unsigned long filter_off,
 #define SECCOMP_RET_KILL_THREAD_NAME	"kill_thread"
 #define SECCOMP_RET_TRAP_NAME		"trap"
 #define SECCOMP_RET_ERRNO_NAME		"errno"
+#define SECCOMP_RET_USER_NOTIF_NAME	"user_notif"
 #define SECCOMP_RET_TRACE_NAME		"trace"
 #define SECCOMP_RET_LOG_NAME		"log"
 #define SECCOMP_RET_ALLOW_NAME		"allow"
@@ -1066,6 +1268,7 @@ static const char seccomp_actions_avail[] =
 				SECCOMP_RET_KILL_THREAD_NAME	" "
 				SECCOMP_RET_TRAP_NAME		" "
 				SECCOMP_RET_ERRNO_NAME		" "
+				SECCOMP_RET_USER_NOTIF_NAME     " "
 				SECCOMP_RET_TRACE_NAME		" "
 				SECCOMP_RET_LOG_NAME		" "
 				SECCOMP_RET_ALLOW_NAME;
@@ -1083,6 +1286,7 @@ static const struct seccomp_log_name seccomp_log_names[] = {
 	{ SECCOMP_LOG_TRACE, SECCOMP_RET_TRACE_NAME },
 	{ SECCOMP_LOG_LOG, SECCOMP_RET_LOG_NAME },
 	{ SECCOMP_LOG_ALLOW, SECCOMP_RET_ALLOW_NAME },
+	{ SECCOMP_LOG_USER_NOTIF, SECCOMP_RET_USER_NOTIF_NAME },
 	{ }
 };
 
@@ -1231,3 +1435,161 @@ static int __init seccomp_sysctl_init(void)
 device_initcall(seccomp_sysctl_init)
 
 #endif /* CONFIG_SYSCTL */
+
+#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
+static int seccomp_notify_release(struct inode *inode, struct file *file)
+{
+	struct seccomp_filter *filter = file->private_data;
+	struct list_head *cur;
+
+	mutex_lock(&filter->notify_lock);
+
+	/*
+	 * If this file is being closed because e.g. the task who owned it
+	 * died, let's wake everyone up who was waiting on us.
+	 */
+	list_for_each(cur, &filter->notifications) {
+		struct seccomp_knotif *knotif;
+
+		knotif = list_entry(cur, struct seccomp_knotif, list);
+
+		knotif->state = SECCOMP_NOTIFY_WRITE;
+		knotif->error = -ENOSYS;
+		knotif->val = 0;
+		complete(&knotif->ready);
+	}
+
+	filter->has_listener = false;
+	mutex_unlock(&filter->notify_lock);
+	__put_seccomp_filter(filter);
+	return 0;
+}
+
+static ssize_t seccomp_notify_read(struct file *f, char __user *buf,
+				   size_t size, loff_t *ppos)
+{
+	struct seccomp_filter *filter = f->private_data;
+	struct seccomp_knotif *knotif = NULL;
+	struct seccomp_notif unotif;
+	struct list_head *cur;
+	ssize_t ret;
+
+	/* No offset reads. */
+	if (*ppos != 0)
+		return -EINVAL;
+
+	ret = down_interruptible(&filter->request);
+	if (ret < 0)
+		return ret;
+
+	mutex_lock(&filter->notify_lock);
+	list_for_each(cur, &filter->notifications) {
+		knotif = list_entry(cur, struct seccomp_knotif, list);
+		if (knotif->state == SECCOMP_NOTIFY_INIT)
+			break;
+	}
+
+	/*
+	 * We didn't find anything which is odd, because at least one
+	 * thing should have been queued.
+	 */
+	if (knotif->state != SECCOMP_NOTIFY_INIT) {
+		ret = -ENOENT;
+		WARN(1, "no seccomp notification found");
+		goto out;
+	}
+
+	unotif.id = knotif->id;
+	unotif.pid = knotif->pid;
+	unotif.data = *(knotif->data);
+
+	size = min_t(size_t, size, sizeof(struct seccomp_notif));
+	if (copy_to_user(buf, &unotif, size)) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	ret = sizeof(unotif);
+	knotif->state = SECCOMP_NOTIFY_READ;
+
+out:
+	mutex_unlock(&filter->notify_lock);
+	return ret;
+}
+
+static ssize_t seccomp_notify_write(struct file *file, const char __user *buf,
+				    size_t size, loff_t *ppos)
+{
+	struct seccomp_filter *filter = file->private_data;
+	struct seccomp_notif_resp resp = {};
+	struct seccomp_knotif *knotif = NULL;
+	struct list_head *cur;
+	ssize_t ret = -EINVAL;
+
+	/* No partial writes. */
+	if (*ppos != 0)
+		return -EINVAL;
+
+	size = min_t(size_t, size, sizeof(resp));
+	if (copy_from_user(&resp, buf, size))
+		return -EFAULT;
+
+	ret = mutex_lock_interruptible(&filter->notify_lock);
+	if (ret < 0)
+		return ret;
+
+	list_for_each(cur, &filter->notifications) {
+		knotif = list_entry(cur, struct seccomp_knotif, list);
+
+		if (knotif->id == resp.id)
+			break;
+	}
+
+	if (!knotif || knotif->id != resp.id) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	ret = size;
+	knotif->state = SECCOMP_NOTIFY_WRITE;
+	knotif->error = resp.error;
+	knotif->val = resp.val;
+	complete(&knotif->ready);
+out:
+	mutex_unlock(&filter->notify_lock);
+	return ret;
+}
+
+static const struct file_operations seccomp_notify_ops = {
+	.read = seccomp_notify_read,
+	.write = seccomp_notify_write,
+	/* TODO: poll */
+	.release = seccomp_notify_release,
+};
+
+static struct file *init_listener(struct seccomp_filter *filter)
+{
+	struct file *ret;
+
+	mutex_lock(&filter->notify_lock);
+	if (filter->has_listener) {
+		mutex_unlock(&filter->notify_lock);
+		return ERR_PTR(-EBUSY);
+	}
+
+	ret = anon_inode_getfile("seccomp notify", &seccomp_notify_ops,
+				 filter, O_RDWR);
+	if (IS_ERR(ret)) {
+		__put_seccomp_filter(filter);
+	} else {
+		/*
+		 * Intentionally don't put_seccomp_filter(). The file
+		 * has a reference to it now.
+		 */
+		filter->has_listener = true;
+	}
+
+	mutex_unlock(&filter->notify_lock);
+	return ret;
+}
+#endif
diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
index 24dbf634e2dd..b43e2a70b08c 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -40,6 +40,7 @@
 #include <sys/fcntl.h>
 #include <sys/mman.h>
 #include <sys/times.h>
+#include <sys/socket.h>
 
 #define _GNU_SOURCE
 #include <unistd.h>
@@ -141,6 +142,24 @@ struct seccomp_data {
 #define SECCOMP_FILTER_FLAG_LOG 2
 #endif
 
+#ifndef SECCOMP_FILTER_FLAG_GET_LISTENER
+#define SECCOMP_FILTER_FLAG_GET_LISTENER 4
+
+#define SECCOMP_RET_USER_NOTIF 0x7fc00000U
+
+struct seccomp_notif {
+	__u32 id;
+	pid_t pid;
+	struct seccomp_data data;
+};
+
+struct seccomp_notif_resp {
+	__u32 id;
+	int error;
+	long val;
+};
+#endif
+
 #ifndef seccomp
 int seccomp(unsigned int op, unsigned int flags, void *args)
 {
@@ -2063,7 +2082,8 @@ TEST(seccomp_syscall_mode_lock)
 TEST(detect_seccomp_filter_flags)
 {
 	unsigned int flags[] = { SECCOMP_FILTER_FLAG_TSYNC,
-				 SECCOMP_FILTER_FLAG_LOG };
+				 SECCOMP_FILTER_FLAG_LOG,
+				 SECCOMP_FILTER_FLAG_GET_LISTENER };
 	unsigned int flag, all_flags;
 	int i;
 	long ret;
@@ -2845,6 +2865,98 @@ TEST(get_action_avail)
 	EXPECT_EQ(errno, EOPNOTSUPP);
 }
 
+static int user_trap_syscall(int nr, unsigned int flags)
+{
+	struct sock_filter filter[] = {
+		BPF_STMT(BPF_LD+BPF_W+BPF_ABS,
+			offsetof(struct seccomp_data, nr)),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, nr, 0, 1),
+		BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_USER_NOTIF),
+		BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
+	};
+
+	struct sock_fprog prog = {
+		.len = (unsigned short)ARRAY_SIZE(filter),
+		.filter = filter,
+	};
+
+	return seccomp(SECCOMP_SET_MODE_FILTER, flags, &prog);
+}
+
+#define USER_NOTIF_MAGIC 116983961184613L
+TEST(get_user_notification_syscall)
+{
+	pid_t pid;
+	long ret;
+	int status, listener;
+	struct seccomp_notif req;
+	struct seccomp_notif_resp resp;
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	/* Check that we get -ENOSYS with no listener attached */
+	if (pid == 0) {
+		ASSERT_EQ(user_trap_syscall(__NR_getpid, 0), 0);
+		ret = syscall(__NR_getpid);
+		exit(ret >= 0 || errno != ENOSYS);
+	}
+
+	ASSERT_EQ(waitpid(pid, &status, 0), pid);
+	ASSERT_EQ(true, WIFEXITED(status));
+	ASSERT_EQ(0, WEXITSTATUS(status));
+
+	/* Check that the basic notification machinery works */
+	listener = user_trap_syscall(__NR_getpid,
+				     SECCOMP_FILTER_FLAG_GET_LISTENER);
+	ASSERT_GE(listener, 0);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		ret = syscall(__NR_getpid);
+		exit(ret != USER_NOTIF_MAGIC);
+	}
+
+	ASSERT_EQ(read(listener, &req, sizeof(req)), sizeof(req));
+
+	resp.id = req.id;
+	resp.error = 0;
+	resp.val = USER_NOTIF_MAGIC;
+
+	ASSERT_EQ(write(listener, &resp, sizeof(resp)), sizeof(resp));
+
+	ASSERT_EQ(waitpid(pid, &status, 0), pid);
+	ASSERT_EQ(true, WIFEXITED(status));
+	ASSERT_EQ(0, WEXITSTATUS(status));
+
+	/*
+	 * Check that nothing bad happens when we kill the task in the middle
+	 * of a syscall.
+	 */
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		ret = syscall(__NR_getpid);
+		exit(ret != USER_NOTIF_MAGIC);
+	}
+
+	ret = read(listener, &req, sizeof(req));
+	ASSERT_EQ(ret, sizeof(req));
+
+	ASSERT_EQ(kill(pid, SIGKILL), 0);
+	ASSERT_EQ(waitpid(pid, NULL, 0), pid);
+
+	resp.id = req.id;
+	ret = write(listener, &resp, sizeof(resp));
+	EXPECT_EQ(ret, -1);
+	EXPECT_EQ(errno, EINVAL);
+
+	close(listener);
+}
+
 /*
  * TODO:
  * - add microbenchmarks
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [RFC 1/3] seccomp: add a return code to trap to userspace
  2018-02-04 10:49 [RFC 0/3] seccomp trap to userspace Tycho Andersen
@ 2018-02-04 10:49 ` Tycho Andersen
  2018-02-13 21:09   ` Kees Cook
       [not found]   ` <20180204104946.25559-2-tycho-E0fblnxP3wo@public.gmane.org>
  2018-02-04 10:49 ` [RFC 2/3] seccomp: hoist out filter resolving logic Tycho Andersen
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 59+ messages in thread
From: Tycho Andersen @ 2018-02-04 10:49 UTC (permalink / raw)
  To: linux-kernel, containers
  Cc: Kees Cook, Andy Lutomirski, Oleg Nesterov, Eric W . Biederman,
	Serge E . Hallyn, Christian Brauner, Tyler Hicks, Akihiro Suda,
	Tycho Andersen

This patch introduces a means for syscalls matched in seccomp to notify
some other task that a particular filter has been triggered.

The motivation for this is primarily for use with containers. For example,
if a container does an init_module(), we obviously don't want to load this
untrusted code, which may be compiled for the wrong version of the kernel
anyway. Instead, we could parse the module image, figure out which module
the container is trying to load and load it on the host.

As another example, containers cannot mknod(), since this checks
capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
/dev/zero should be ok for containers to mknod, but we'd like to avoid hard
coding some whitelist in the kernel. Another example is mount(), which has
many security restrictions for good reason, but configuration or runtime
knowledge could potentially be used to relax these restrictions.

This patch adds functionality that is already possible via at least two
other means that I know about, both of which involve ptrace(): first, one
could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL.
Unfortunately this is slow, so a faster version would be to install a
filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP.
Since ptrace allows only one tracer, if the container runtime is that
tracer, users inside the container (or outside) trying to debug it will not
be able to use ptrace, which is annoying. It also means that older
distributions based on Upstart cannot boot inside containers using ptrace,
since upstart itself uses ptrace to start services.

The actual implementation of this is fairly small, although getting the
synchronization right was/is slightly complex. Also worth noting that there
is one race still present:

  1. a task does a SECCOMP_RET_USER_NOTIF
  2. the userspace handler reads this notification
  3. the task dies
  4. a new task with the same pid starts
  5. this new task does a SECCOMP_RET_USER_NOTIF, gets the same cookie id
     that the previous one did
  6. the userspace handler writes a response

There's no way to distinguish this case right now. Maybe we care, maybe we
don't, but it's worth noting.

Right now the interface is a simple structure copy across a file
descriptor. We could potentially invent something fancier.

Finally, it's worth noting that the classic seccomp TOCTOU of reading
memory data from the task still applies here, but can be avoided with
careful design of the userspace handler: if the userspace handler reads all
of the task memory that is necessary before applying its security policy,
the tracee's subsequent memory edits will not be read by the tracer.

Signed-off-by: Tycho Andersen <tycho@tycho.ws>
CC: Kees Cook <keescook@chromium.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Oleg Nesterov <oleg@redhat.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
CC: "Serge E. Hallyn" <serge@hallyn.com>
CC: Christian Brauner <christian.brauner@ubuntu.com>
CC: Tyler Hicks <tyhicks@canonical.com>
CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
---
 arch/Kconfig                                  |   7 +
 include/linux/seccomp.h                       |   3 +-
 include/uapi/linux/seccomp.h                  |  18 +-
 kernel/seccomp.c                              | 366 +++++++++++++++++++++++++-
 tools/testing/selftests/seccomp/seccomp_bpf.c | 114 +++++++-
 5 files changed, 502 insertions(+), 6 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 400b9e1b2f27..2946cb6fd704 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -387,6 +387,13 @@ config SECCOMP_FILTER
 
 	  See Documentation/prctl/seccomp_filter.txt for details.
 
+config SECCOMP_USER_NOTIFICATION
+	bool "Enable the SECCOMP_RET_USER_NOTIF seccomp action"
+	depends on SECCOMP_FILTER
+	help
+	  Enable SECCOMP_RET_USER_NOTIF, a return code which can be used by seccomp
+	  programs to notify a userspace listener that a particular event happened.
+
 config HAVE_GCC_PLUGINS
 	bool
 	help
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 10f25f7e4304..ce07da2ffd53 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -5,7 +5,8 @@
 #include <uapi/linux/seccomp.h>
 
 #define SECCOMP_FILTER_FLAG_MASK	(SECCOMP_FILTER_FLAG_TSYNC | \
-					 SECCOMP_FILTER_FLAG_LOG)
+					 SECCOMP_FILTER_FLAG_LOG | \
+					 SECCOMP_FILTER_FLAG_GET_LISTENER)
 
 #ifdef CONFIG_SECCOMP
 
diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index 2a0bd9dd104d..4a342aa2e524 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -17,8 +17,9 @@
 #define SECCOMP_GET_ACTION_AVAIL	2
 
 /* Valid flags for SECCOMP_SET_MODE_FILTER */
-#define SECCOMP_FILTER_FLAG_TSYNC	1
-#define SECCOMP_FILTER_FLAG_LOG		2
+#define SECCOMP_FILTER_FLAG_TSYNC		1
+#define SECCOMP_FILTER_FLAG_LOG			2
+#define SECCOMP_FILTER_FLAG_GET_LISTENER	4
 
 /*
  * All BPF programs must return a 32-bit value.
@@ -34,6 +35,7 @@
 #define SECCOMP_RET_KILL	 SECCOMP_RET_KILL_THREAD
 #define SECCOMP_RET_TRAP	 0x00030000U /* disallow and force a SIGSYS */
 #define SECCOMP_RET_ERRNO	 0x00050000U /* returns an errno */
+#define SECCOMP_RET_USER_NOTIF   0x7fc00000U /* notifies userspace */
 #define SECCOMP_RET_TRACE	 0x7ff00000U /* pass to a tracer or disallow */
 #define SECCOMP_RET_LOG		 0x7ffc0000U /* allow after logging */
 #define SECCOMP_RET_ALLOW	 0x7fff0000U /* allow */
@@ -59,4 +61,16 @@ struct seccomp_data {
 	__u64 args[6];
 };
 
+struct seccomp_notif {
+	__u32 id;
+	pid_t pid;
+	struct seccomp_data data;
+};
+
+struct seccomp_notif_resp {
+	__u32 id;
+	int error;
+	long val;
+};
+
 #endif /* _UAPI_LINUX_SECCOMP_H */
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 5f0dfb2abb8d..9541eb379e74 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -38,6 +38,52 @@
 #include <linux/tracehook.h>
 #include <linux/uaccess.h>
 
+#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
+#include <linux/file.h>
+#include <linux/anon_inodes.h>
+
+enum notify_state {
+	SECCOMP_NOTIFY_INIT,
+	SECCOMP_NOTIFY_READ,
+	SECCOMP_NOTIFY_WRITE,
+};
+
+struct seccomp_knotif {
+	/* The pid whose filter triggered the notification */
+	pid_t pid;
+
+	/*
+	 * The "cookie" for this request; this is unique for this filter.
+	 */
+	u32 id;
+
+	/*
+	 * The seccomp data. This pointer is valid the entire time this
+	 * notification is active, since it comes from __seccomp_filter which
+	 * eclipses the entire lifecycle here.
+	 */
+	const struct seccomp_data *data;
+
+	/*
+	 * SECCOMP_NOTIFY_INIT: someone has made this request, but it has not
+	 * 	yet been sent to userspace
+	 * SECCOMP_NOTIFY_READ: sent to userspace but no response yet
+	 * SECCOMP_NOTIFY_WRITE: we have a response from userspace, but it has
+	 * 	not yet been written back to the application
+	 */
+	enum notify_state state;
+
+	/* The return values, only valid when in SECCOMP_NOTIFY_WRITE */
+	int error;
+	long val;
+
+	/* Signals when this has entered SECCOMP_NOTIFY_WRITE */
+	struct completion ready;
+
+	struct list_head list;
+};
+#endif
+
 /**
  * struct seccomp_filter - container for seccomp BPF programs
  *
@@ -64,6 +110,30 @@ struct seccomp_filter {
 	bool log;
 	struct seccomp_filter *prev;
 	struct bpf_prog *prog;
+
+#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
+	/*
+	 * A semaphore that users of this notification can wait on for
+	 * changes. Actual reads and writes are still controlled with
+	 * filter->notify_lock.
+	 */
+	struct semaphore request;
+
+	/*
+	 * A lock for all notification-related accesses.
+	 */
+	struct mutex notify_lock;
+
+	/*
+	 * Is there currently an attached listener?
+	 */
+	bool has_listener;
+
+	/*
+	 * A list of struct seccomp_knotif elements.
+	 */
+	struct list_head notifications;
+#endif
 };
 
 /* Limit any path through the tree to 256KB worth of instructions. */
@@ -383,6 +453,12 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
 	if (!sfilter)
 		return ERR_PTR(-ENOMEM);
 
+#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
+	mutex_init(&sfilter->notify_lock);
+	sema_init(&sfilter->request, 0);
+	INIT_LIST_HEAD(&sfilter->notifications);
+#endif
+
 	ret = bpf_prog_create_from_user(&sfilter->prog, fprog,
 					seccomp_check_filter, save_orig);
 	if (ret < 0) {
@@ -547,13 +623,15 @@ static void seccomp_send_sigsys(int syscall, int reason)
 #define SECCOMP_LOG_TRACE		(1 << 4)
 #define SECCOMP_LOG_LOG			(1 << 5)
 #define SECCOMP_LOG_ALLOW		(1 << 6)
+#define SECCOMP_LOG_USER_NOTIF		(1 << 7)
 
 static u32 seccomp_actions_logged = SECCOMP_LOG_KILL_PROCESS |
 				    SECCOMP_LOG_KILL_THREAD  |
 				    SECCOMP_LOG_TRAP  |
 				    SECCOMP_LOG_ERRNO |
 				    SECCOMP_LOG_TRACE |
-				    SECCOMP_LOG_LOG;
+				    SECCOMP_LOG_LOG |
+				    SECCOMP_LOG_USER_NOTIF;
 
 static inline void seccomp_log(unsigned long syscall, long signr, u32 action,
 			       bool requested)
@@ -572,6 +650,9 @@ static inline void seccomp_log(unsigned long syscall, long signr, u32 action,
 	case SECCOMP_RET_TRACE:
 		log = requested && seccomp_actions_logged & SECCOMP_LOG_TRACE;
 		break;
+	case SECCOMP_RET_USER_NOTIF:
+		log = requested && seccomp_actions_logged & SECCOMP_LOG_USER_NOTIF;
+		break;
 	case SECCOMP_RET_LOG:
 		log = seccomp_actions_logged & SECCOMP_LOG_LOG;
 		break;
@@ -645,6 +726,89 @@ void secure_computing_strict(int this_syscall)
 }
 #else
 
+#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
+/*
+ * Finds the next unique notification id.
+ */
+static u32 seccomp_next_notify_id(struct list_head *list)
+{
+	struct seccomp_knotif *knotif = NULL;
+	struct list_head *cur;
+	u32 id = get_random_u32();
+
+again:
+	list_for_each(cur, list) {
+		knotif = list_entry(cur, struct seccomp_knotif, list);
+
+		if (knotif->id == id) {
+			id = get_random_u32();
+			goto again;
+		}
+	}
+
+	return id;
+}
+
+static void seccomp_do_user_notification(int this_syscall,
+					 struct seccomp_filter *match,
+					 const struct seccomp_data *sd)
+{
+	int err;
+	long ret = 0;
+	struct seccomp_knotif n = {};
+
+	mutex_lock(&match->notify_lock);
+	if (!match->has_listener) {
+		err = -ENOSYS;
+		goto out;
+	}
+
+	n.pid = current->pid;
+	n.state = SECCOMP_NOTIFY_INIT;
+	n.data = sd;
+	n.id = seccomp_next_notify_id(&match->notifications);
+	init_completion(&n.ready);
+
+	list_add(&n.list, &match->notifications);
+
+	mutex_unlock(&match->notify_lock);
+	up(&match->request);
+
+	err = wait_for_completion_interruptible(&n.ready);
+	/*
+	 * This syscall is getting interrupted. We no longer need to
+	 * tell userspace about it, and any userspace responses should
+	 * be ignored.
+	 */
+	mutex_lock(&match->notify_lock);
+	if (err < 0)
+		goto remove_list;
+
+	ret = n.val;
+	err = n.error;
+
+	WARN(n.state != SECCOMP_NOTIFY_WRITE,
+	     "notified about write complete when state is not write");
+
+remove_list:
+	list_del(&n.list);
+out:
+	mutex_unlock(&match->notify_lock);
+	syscall_set_return_value(current, task_pt_regs(current),
+				 err, ret);
+}
+#else
+static void seccomp_do_user_notification(int this_syscall,
+					 u32 action,
+					 struct seccomp_filter *match,
+					 const struct seccomp_data *sd)
+{
+	WARN(1, "user notification received, but disabled");
+	seccomp_log(this_syscall, SIGSYS, action, true);
+	do_exit(SIGSYS);
+}
+#endif
+
 #ifdef CONFIG_SECCOMP_FILTER
 static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
 			    const bool recheck_after_trace)
@@ -722,6 +886,9 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
 
 		return 0;
 
+	case SECCOMP_RET_USER_NOTIF:
+		seccomp_do_user_notification(this_syscall, match, sd);
+		goto skip;
 	case SECCOMP_RET_LOG:
 		seccomp_log(this_syscall, 0, action, true);
 		return 0;
@@ -828,6 +995,10 @@ static long seccomp_set_mode_strict(void)
 }
 
 #ifdef CONFIG_SECCOMP_FILTER
+#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
+static struct file *init_listener(struct seccomp_filter *filter);
+#endif
+
 /**
  * seccomp_set_mode_filter: internal function for setting seccomp filter
  * @flags:  flags to change filter behavior
@@ -847,6 +1018,8 @@ static long seccomp_set_mode_filter(unsigned int flags,
 	const unsigned long seccomp_mode = SECCOMP_MODE_FILTER;
 	struct seccomp_filter *prepared = NULL;
 	long ret = -EINVAL;
+	int listener = 0;
+	struct file *listener_f = NULL;
 
 	/* Validate flags. */
 	if (flags & ~SECCOMP_FILTER_FLAG_MASK)
@@ -857,13 +1030,28 @@ static long seccomp_set_mode_filter(unsigned int flags,
 	if (IS_ERR(prepared))
 		return PTR_ERR(prepared);
 
+	if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) {
+		listener = get_unused_fd_flags(O_RDWR);
+		if (listener < 0) {
+			ret = listener;
+			goto out_free;
+		}
+
+		listener_f = init_listener(prepared);
+		if (IS_ERR(listener_f)) {
+			put_unused_fd(listener);
+			ret = PTR_ERR(listener_f);
+			goto out_free;
+		}
+	}
+
 	/*
 	 * Make sure we cannot change seccomp or nnp state via TSYNC
 	 * while another thread is in the middle of calling exec.
 	 */
 	if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
 	    mutex_lock_killable(&current->signal->cred_guard_mutex))
-		goto out_free;
+		goto out_put_fd;
 
 	spin_lock_irq(&current->sighand->siglock);
 
@@ -881,6 +1069,16 @@ static long seccomp_set_mode_filter(unsigned int flags,
 	spin_unlock_irq(&current->sighand->siglock);
 	if (flags & SECCOMP_FILTER_FLAG_TSYNC)
 		mutex_unlock(&current->signal->cred_guard_mutex);
+out_put_fd:
+	if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) {
+		if (ret < 0) {
+			fput(listener_f);
+			put_unused_fd(listener);
+		} else {
+			fd_install(listener, listener_f);
+			ret = listener;
+		}
+	}
 out_free:
 	seccomp_filter_free(prepared);
 	return ret;
@@ -909,6 +1107,9 @@ static long seccomp_get_action_avail(const char __user *uaction)
 	case SECCOMP_RET_LOG:
 	case SECCOMP_RET_ALLOW:
 		break;
+	case SECCOMP_RET_USER_NOTIF:
+		if (IS_ENABLED(CONFIG_SECCOMP_USER_NOTIFICATION))
+			break;
 	default:
 		return -EOPNOTSUPP;
 	}
@@ -1057,6 +1258,7 @@ long seccomp_get_filter(struct task_struct *task, unsigned long filter_off,
 #define SECCOMP_RET_KILL_THREAD_NAME	"kill_thread"
 #define SECCOMP_RET_TRAP_NAME		"trap"
 #define SECCOMP_RET_ERRNO_NAME		"errno"
+#define SECCOMP_RET_USER_NOTIF_NAME	"user_notif"
 #define SECCOMP_RET_TRACE_NAME		"trace"
 #define SECCOMP_RET_LOG_NAME		"log"
 #define SECCOMP_RET_ALLOW_NAME		"allow"
@@ -1066,6 +1268,7 @@ static const char seccomp_actions_avail[] =
 				SECCOMP_RET_KILL_THREAD_NAME	" "
 				SECCOMP_RET_TRAP_NAME		" "
 				SECCOMP_RET_ERRNO_NAME		" "
+				SECCOMP_RET_USER_NOTIF_NAME     " "
 				SECCOMP_RET_TRACE_NAME		" "
 				SECCOMP_RET_LOG_NAME		" "
 				SECCOMP_RET_ALLOW_NAME;
@@ -1083,6 +1286,7 @@ static const struct seccomp_log_name seccomp_log_names[] = {
 	{ SECCOMP_LOG_TRACE, SECCOMP_RET_TRACE_NAME },
 	{ SECCOMP_LOG_LOG, SECCOMP_RET_LOG_NAME },
 	{ SECCOMP_LOG_ALLOW, SECCOMP_RET_ALLOW_NAME },
+	{ SECCOMP_LOG_USER_NOTIF, SECCOMP_RET_USER_NOTIF_NAME },
 	{ }
 };
 
@@ -1231,3 +1435,161 @@ static int __init seccomp_sysctl_init(void)
 device_initcall(seccomp_sysctl_init)
 
 #endif /* CONFIG_SYSCTL */
+
+#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
+static int seccomp_notify_release(struct inode *inode, struct file *file)
+{
+	struct seccomp_filter *filter = file->private_data;
+	struct list_head *cur;
+
+	mutex_lock(&filter->notify_lock);
+
+	/*
+	 * If this file is being closed because e.g. the task who owned it
+	 * died, let's wake everyone up who was waiting on us.
+	 */
+	list_for_each(cur, &filter->notifications) {
+		struct seccomp_knotif *knotif;
+
+		knotif = list_entry(cur, struct seccomp_knotif, list);
+
+		knotif->state = SECCOMP_NOTIFY_WRITE;
+		knotif->error = -ENOSYS;
+		knotif->val = 0;
+		complete(&knotif->ready);
+	}
+
+	filter->has_listener = false;
+	mutex_unlock(&filter->notify_lock);
+	__put_seccomp_filter(filter);
+	return 0;
+}
+
+static ssize_t seccomp_notify_read(struct file *f, char __user *buf,
+				   size_t size, loff_t *ppos)
+{
+	struct seccomp_filter *filter = f->private_data;
+	struct seccomp_knotif *knotif = NULL;
+	struct seccomp_notif unotif;
+	struct list_head *cur;
+	ssize_t ret;
+
+	/* No offset reads. */
+	if (*ppos != 0)
+		return -EINVAL;
+
+	ret = down_interruptible(&filter->request);
+	if (ret < 0)
+		return ret;
+
+	mutex_lock(&filter->notify_lock);
+	list_for_each(cur, &filter->notifications) {
+		knotif = list_entry(cur, struct seccomp_knotif, list);
+		if (knotif->state == SECCOMP_NOTIFY_INIT)
+			break;
+	}
+
+	/*
+	 * We didn't find anything which is odd, because at least one
+	 * thing should have been queued.
+	 */
+	if (knotif->state != SECCOMP_NOTIFY_INIT) {
+		ret = -ENOENT;
+		WARN(1, "no seccomp notification found");
+		goto out;
+	}
+
+	unotif.id = knotif->id;
+	unotif.pid = knotif->pid;
+	unotif.data = *(knotif->data);
+
+	size = min_t(size_t, size, sizeof(struct seccomp_notif));
+	if (copy_to_user(buf, &unotif, size)) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	ret = sizeof(unotif);
+	knotif->state = SECCOMP_NOTIFY_READ;
+
+out:
+	mutex_unlock(&filter->notify_lock);
+	return ret;
+}
+
+static ssize_t seccomp_notify_write(struct file *file, const char __user *buf,
+				    size_t size, loff_t *ppos)
+{
+	struct seccomp_filter *filter = file->private_data;
+	struct seccomp_notif_resp resp = {};
+	struct seccomp_knotif *knotif = NULL;
+	struct list_head *cur;
+	ssize_t ret = -EINVAL;
+
+	/* No partial writes. */
+	if (*ppos != 0)
+		return -EINVAL;
+
+	size = min_t(size_t, size, sizeof(resp));
+	if (copy_from_user(&resp, buf, size))
+		return -EFAULT;
+
+	ret = mutex_lock_interruptible(&filter->notify_lock);
+	if (ret < 0)
+		return ret;
+
+	list_for_each(cur, &filter->notifications) {
+		knotif = list_entry(cur, struct seccomp_knotif, list);
+
+		if (knotif->id == resp.id)
+			break;
+	}
+
+	if (!knotif || knotif->id != resp.id) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	ret = size;
+	knotif->state = SECCOMP_NOTIFY_WRITE;
+	knotif->error = resp.error;
+	knotif->val = resp.val;
+	complete(&knotif->ready);
+out:
+	mutex_unlock(&filter->notify_lock);
+	return ret;
+}
+
+static const struct file_operations seccomp_notify_ops = {
+	.read = seccomp_notify_read,
+	.write = seccomp_notify_write,
+	/* TODO: poll */
+	.release = seccomp_notify_release,
+};
+
+static struct file *init_listener(struct seccomp_filter *filter)
+{
+	struct file *ret;
+
+	mutex_lock(&filter->notify_lock);
+	if (filter->has_listener) {
+		mutex_unlock(&filter->notify_lock);
+		return ERR_PTR(-EBUSY);
+	}
+
+	ret = anon_inode_getfile("seccomp notify", &seccomp_notify_ops,
+				 filter, O_RDWR);
+	if (IS_ERR(ret)) {
+		__put_seccomp_filter(filter);
+	} else {
+		/*
+		 * Intentionally don't put_seccomp_filter(). The file
+		 * has a reference to it now.
+		 */
+		filter->has_listener = true;
+	}
+
+	mutex_unlock(&filter->notify_lock);
+	return ret;
+}
+#endif
diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
index 24dbf634e2dd..b43e2a70b08c 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -40,6 +40,7 @@
 #include <sys/fcntl.h>
 #include <sys/mman.h>
 #include <sys/times.h>
+#include <sys/socket.h>
 
 #define _GNU_SOURCE
 #include <unistd.h>
@@ -141,6 +142,24 @@ struct seccomp_data {
 #define SECCOMP_FILTER_FLAG_LOG 2
 #endif
 
+#ifndef SECCOMP_FILTER_FLAG_GET_LISTENER
+#define SECCOMP_FILTER_FLAG_GET_LISTENER 4
+
+#define SECCOMP_RET_USER_NOTIF 0x7fc00000U
+
+struct seccomp_notif {
+	__u32 id;
+	pid_t pid;
+	struct seccomp_data data;
+};
+
+struct seccomp_notif_resp {
+	__u32 id;
+	int error;
+	long val;
+};
+#endif
+
 #ifndef seccomp
 int seccomp(unsigned int op, unsigned int flags, void *args)
 {
@@ -2063,7 +2082,8 @@ TEST(seccomp_syscall_mode_lock)
 TEST(detect_seccomp_filter_flags)
 {
 	unsigned int flags[] = { SECCOMP_FILTER_FLAG_TSYNC,
-				 SECCOMP_FILTER_FLAG_LOG };
+				 SECCOMP_FILTER_FLAG_LOG,
+				 SECCOMP_FILTER_FLAG_GET_LISTENER };
 	unsigned int flag, all_flags;
 	int i;
 	long ret;
@@ -2845,6 +2865,98 @@ TEST(get_action_avail)
 	EXPECT_EQ(errno, EOPNOTSUPP);
 }
 
+static int user_trap_syscall(int nr, unsigned int flags)
+{
+	struct sock_filter filter[] = {
+		BPF_STMT(BPF_LD+BPF_W+BPF_ABS,
+			offsetof(struct seccomp_data, nr)),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, nr, 0, 1),
+		BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_USER_NOTIF),
+		BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
+	};
+
+	struct sock_fprog prog = {
+		.len = (unsigned short)ARRAY_SIZE(filter),
+		.filter = filter,
+	};
+
+	return seccomp(SECCOMP_SET_MODE_FILTER, flags, &prog);
+}
+
+#define USER_NOTIF_MAGIC 116983961184613L
+TEST(get_user_notification_syscall)
+{
+	pid_t pid;
+	long ret;
+	int status, listener;
+	struct seccomp_notif req;
+	struct seccomp_notif_resp resp;
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	/* Check that we get -ENOSYS with no listener attached */
+	if (pid == 0) {
+		ASSERT_EQ(user_trap_syscall(__NR_getpid, 0), 0);
+		ret = syscall(__NR_getpid);
+		exit(ret >= 0 || errno != ENOSYS);
+	}
+
+	ASSERT_EQ(waitpid(pid, &status, 0), pid);
+	ASSERT_EQ(true, WIFEXITED(status));
+	ASSERT_EQ(0, WEXITSTATUS(status));
+
+	/* Check that the basic notification machinery works */
+	listener = user_trap_syscall(__NR_getpid,
+				     SECCOMP_FILTER_FLAG_GET_LISTENER);
+	ASSERT_GE(listener, 0);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		ret = syscall(__NR_getpid);
+		exit(ret != USER_NOTIF_MAGIC);
+	}
+
+	ASSERT_EQ(read(listener, &req, sizeof(req)), sizeof(req));
+
+	resp.id = req.id;
+	resp.error = 0;
+	resp.val = USER_NOTIF_MAGIC;
+
+	ASSERT_EQ(write(listener, &resp, sizeof(resp)), sizeof(resp));
+
+	ASSERT_EQ(waitpid(pid, &status, 0), pid);
+	ASSERT_EQ(true, WIFEXITED(status));
+	ASSERT_EQ(0, WEXITSTATUS(status));
+
+	/*
+	 * Check that nothing bad happens when we kill the task in the middle
+	 * of a syscall.
+	 */
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		ret = syscall(__NR_getpid);
+		exit(ret != USER_NOTIF_MAGIC);
+	}
+
+	ret = read(listener, &req, sizeof(req));
+	ASSERT_EQ(ret, sizeof(req));
+
+	ASSERT_EQ(kill(pid, SIGKILL), 0);
+	ASSERT_EQ(waitpid(pid, NULL, 0), pid);
+
+	resp.id = req.id;
+	ret = write(listener, &resp, sizeof(resp));
+	EXPECT_EQ(ret, -1);
+	EXPECT_EQ(errno, EINVAL);
+
+	close(listener);
+}
+
 /*
  * TODO:
  * - add microbenchmarks
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [RFC 2/3] seccomp: hoist out filter resolving logic
       [not found] ` <20180204104946.25559-1-tycho-E0fblnxP3wo@public.gmane.org>
  2018-02-04 10:49   ` [RFC 1/3] seccomp: add a return code to trap to userspace Tycho Andersen
@ 2018-02-04 10:49   ` Tycho Andersen
  2018-02-04 10:49   ` [RFC 3/3] seccomp: add a way to get a listener fd from ptrace Tycho Andersen
  2018-03-15 16:09   ` [RFC 0/3] seccomp trap to userspace Christian Brauner
  3 siblings, 0 replies; 59+ messages in thread
From: Tycho Andersen @ 2018-02-04 10:49 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Kees Cook, Akihiro Suda, Oleg Nesterov, Andy Lutomirski,
	Eric W . Biederman, Christian Brauner, Tyler Hicks

Hoist out the nth filter resolving logic that ptrace uses into a new
function. We'll use this in the next patch to implement the new
PTRACE_SECCOMP_GET_FILTER_FLAGS command. This is based on an older patch
that I had sent a while ago; it significantly revamps the get_nth_filter
logic based on previous suggestions from Oleg.

Signed-off-by: Tycho Andersen <tycho-E0fblnxP3wo@public.gmane.org>
CC: Kees Cook <keescook-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>
CC: Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
CC: Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
CC: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
CC: "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
CC: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
CC: Tyler Hicks <tyhicks-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
CC: Akihiro Suda <suda.akihiro-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
---
 kernel/seccomp.c | 77 +++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 45 insertions(+), 32 deletions(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 9541eb379e74..800db3f2866f 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -1179,49 +1179,68 @@ long prctl_set_seccomp(unsigned long seccomp_mode, char __user *filter)
 }
 
 #if defined(CONFIG_SECCOMP_FILTER) && defined(CONFIG_CHECKPOINT_RESTORE)
-long seccomp_get_filter(struct task_struct *task, unsigned long filter_off,
-			void __user *data)
+static struct seccomp_filter *get_nth_filter(struct task_struct *task,
+					     unsigned long filter_off)
 {
-	struct seccomp_filter *filter;
-	struct sock_fprog_kern *fprog;
-	long ret;
-	unsigned long count = 0;
-
-	if (!capable(CAP_SYS_ADMIN) ||
-	    current->seccomp.mode != SECCOMP_MODE_DISABLED) {
-		return -EACCES;
-	}
+	struct seccomp_filter *orig, *filter;
+	unsigned long count;
 
+	/*
+	 * Note: this is only correct because the caller should be the (ptrace)
+	 * tracer of the task, otherwise lock_task_sighand is needed.
+	 */
 	spin_lock_irq(&task->sighand->siglock);
+
 	if (task->seccomp.mode != SECCOMP_MODE_FILTER) {
-		ret = -EINVAL;
-		goto out;
+		spin_unlock_irq(&task->sighand->siglock);
+		return ERR_PTR(-EINVAL);
 	}
 
-	filter = task->seccomp.filter;
-	while (filter) {
-		filter = filter->prev;
+	orig = task->seccomp.filter;
+	__get_seccomp_filter(orig);
+	spin_unlock_irq(&task->sighand->siglock);
+
+	count = 0;
+	for (filter = orig; filter; filter = filter->prev)
 		count++;
-	}
 
 	if (filter_off >= count) {
-		ret = -ENOENT;
+		filter = ERR_PTR(-ENOENT);
 		goto out;
 	}
-	count -= filter_off;
 
-	filter = task->seccomp.filter;
-	while (filter && count > 1) {
-		filter = filter->prev;
+	count -= filter_off;
+	for (filter = orig; filter && count > 1; filter = filter->prev)
 		count--;
-	}
 
 	if (WARN_ON(count != 1 || !filter)) {
-		/* The filter tree shouldn't shrink while we're using it. */
-		ret = -ENOENT;
+		filter = ERR_PTR(-ENOENT);
 		goto out;
 	}
 
+	__get_seccomp_filter(filter);
+
+out:
+	__put_seccomp_filter(orig);
+	return filter;
+}
+
+long seccomp_get_filter(struct task_struct *task, unsigned long filter_off,
+			void __user *data)
+{
+	struct seccomp_filter *filter;
+	struct sock_fprog_kern *fprog;
+	long ret;
+
+	if (!capable(CAP_SYS_ADMIN) ||
+	    current->seccomp.mode != SECCOMP_MODE_DISABLED) {
+		return -EACCES;
+	}
+
+	filter = get_nth_filter(task, filter_off);
+	if (IS_ERR(filter))
+		return PTR_ERR(filter);
+
 	fprog = filter->prog->orig_prog;
 	if (!fprog) {
 		/* This must be a new non-cBPF filter, since we save
@@ -1236,17 +1255,11 @@ long seccomp_get_filter(struct task_struct *task, unsigned long filter_off,
 	if (!data)
 		goto out;
 
-	__get_seccomp_filter(filter);
-	spin_unlock_irq(&task->sighand->siglock);
-
 	if (copy_to_user(data, fprog->filter, bpf_classic_proglen(fprog)))
 		ret = -EFAULT;
 
-	__put_seccomp_filter(filter);
-	return ret;
-
 out:
-	spin_unlock_irq(&task->sighand->siglock);
+	__put_seccomp_filter(filter);
 	return ret;
 }
 #endif
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [RFC 2/3] seccomp: hoist out filter resolving logic
  2018-02-04 10:49 [RFC 0/3] seccomp trap to userspace Tycho Andersen
  2018-02-04 10:49 ` [RFC 1/3] seccomp: add a return code to " Tycho Andersen
@ 2018-02-04 10:49 ` Tycho Andersen
       [not found]   ` <20180204104946.25559-3-tycho-E0fblnxP3wo@public.gmane.org>
  2018-02-13 21:29   ` Kees Cook
       [not found] ` <20180204104946.25559-1-tycho-E0fblnxP3wo@public.gmane.org>
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 59+ messages in thread
From: Tycho Andersen @ 2018-02-04 10:49 UTC (permalink / raw)
  To: linux-kernel, containers
  Cc: Kees Cook, Andy Lutomirski, Oleg Nesterov, Eric W . Biederman,
	Serge E . Hallyn, Christian Brauner, Tyler Hicks, Akihiro Suda,
	Tycho Andersen

Hoist out the nth filter resolving logic that ptrace uses into a new
function. We'll use this in the next patch to implement the new
PTRACE_SECCOMP_GET_FILTER_FLAGS command. This is based on an older patch
that I had sent a while ago; it significantly revamps the get_nth_filter
logic based on previous suggestions from Oleg.

Signed-off-by: Tycho Andersen <tycho@tycho.ws>
CC: Kees Cook <keescook@chromium.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Oleg Nesterov <oleg@redhat.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
CC: "Serge E. Hallyn" <serge@hallyn.com>
CC: Christian Brauner <christian.brauner@ubuntu.com>
CC: Tyler Hicks <tyhicks@canonical.com>
CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
---
 kernel/seccomp.c | 77 +++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 45 insertions(+), 32 deletions(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 9541eb379e74..800db3f2866f 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -1179,49 +1179,68 @@ long prctl_set_seccomp(unsigned long seccomp_mode, char __user *filter)
 }
 
 #if defined(CONFIG_SECCOMP_FILTER) && defined(CONFIG_CHECKPOINT_RESTORE)
-long seccomp_get_filter(struct task_struct *task, unsigned long filter_off,
-			void __user *data)
+static struct seccomp_filter *get_nth_filter(struct task_struct *task,
+					     unsigned long filter_off)
 {
-	struct seccomp_filter *filter;
-	struct sock_fprog_kern *fprog;
-	long ret;
-	unsigned long count = 0;
-
-	if (!capable(CAP_SYS_ADMIN) ||
-	    current->seccomp.mode != SECCOMP_MODE_DISABLED) {
-		return -EACCES;
-	}
+	struct seccomp_filter *orig, *filter;
+	unsigned long count;
 
+	/*
+	 * Note: this is only correct because the caller should be the (ptrace)
+	 * tracer of the task, otherwise lock_task_sighand is needed.
+	 */
 	spin_lock_irq(&task->sighand->siglock);
+
 	if (task->seccomp.mode != SECCOMP_MODE_FILTER) {
-		ret = -EINVAL;
-		goto out;
+		spin_unlock_irq(&task->sighand->siglock);
+		return ERR_PTR(-EINVAL);
 	}
 
-	filter = task->seccomp.filter;
-	while (filter) {
-		filter = filter->prev;
+	orig = task->seccomp.filter;
+	__get_seccomp_filter(orig);
+	spin_unlock_irq(&task->sighand->siglock);
+
+	count = 0;
+	for (filter = orig; filter; filter = filter->prev)
 		count++;
-	}
 
 	if (filter_off >= count) {
-		ret = -ENOENT;
+		filter = ERR_PTR(-ENOENT);
 		goto out;
 	}
-	count -= filter_off;
 
-	filter = task->seccomp.filter;
-	while (filter && count > 1) {
-		filter = filter->prev;
+	count -= filter_off;
+	for (filter = orig; filter && count > 1; filter = filter->prev)
 		count--;
-	}
 
 	if (WARN_ON(count != 1 || !filter)) {
-		/* The filter tree shouldn't shrink while we're using it. */
-		ret = -ENOENT;
+		filter = ERR_PTR(-ENOENT);
 		goto out;
 	}
 
+	__get_seccomp_filter(filter);
+
+out:
+	__put_seccomp_filter(orig);
+	return filter;
+}
+
+long seccomp_get_filter(struct task_struct *task, unsigned long filter_off,
+			void __user *data)
+{
+	struct seccomp_filter *filter;
+	struct sock_fprog_kern *fprog;
+	long ret;
+
+	if (!capable(CAP_SYS_ADMIN) ||
+	    current->seccomp.mode != SECCOMP_MODE_DISABLED) {
+		return -EACCES;
+	}
+
+	filter = get_nth_filter(task, filter_off);
+	if (IS_ERR(filter))
+		return PTR_ERR(filter);
+
 	fprog = filter->prog->orig_prog;
 	if (!fprog) {
 		/* This must be a new non-cBPF filter, since we save
@@ -1236,17 +1255,11 @@ long seccomp_get_filter(struct task_struct *task, unsigned long filter_off,
 	if (!data)
 		goto out;
 
-	__get_seccomp_filter(filter);
-	spin_unlock_irq(&task->sighand->siglock);
-
 	if (copy_to_user(data, fprog->filter, bpf_classic_proglen(fprog)))
 		ret = -EFAULT;
 
-	__put_seccomp_filter(filter);
-	return ret;
-
 out:
-	spin_unlock_irq(&task->sighand->siglock);
+	__put_seccomp_filter(filter);
 	return ret;
 }
 #endif
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [RFC 3/3] seccomp: add a way to get a listener fd from ptrace
       [not found] ` <20180204104946.25559-1-tycho-E0fblnxP3wo@public.gmane.org>
  2018-02-04 10:49   ` [RFC 1/3] seccomp: add a return code to trap to userspace Tycho Andersen
  2018-02-04 10:49   ` [RFC 2/3] seccomp: hoist out filter resolving logic Tycho Andersen
@ 2018-02-04 10:49   ` Tycho Andersen
  2018-03-15 16:09   ` [RFC 0/3] seccomp trap to userspace Christian Brauner
  3 siblings, 0 replies; 59+ messages in thread
From: Tycho Andersen @ 2018-02-04 10:49 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Kees Cook, Akihiro Suda, Oleg Nesterov, Andy Lutomirski,
	Eric W . Biederman, Christian Brauner, Tyler Hicks

As an alternative to SECCOMP_FILTER_FLAG_GET_LISTENER, perhaps a ptrace()
version which can acquire filters is useful. There are at least two reasons
this is preferable, even though it uses ptrace:

1. You can control tasks that aren't cooperating with you
2. You can control tasks whose filters block sendmsg() and socket(); if the
   task installs a filter which blocks these calls, there's no way with
   SECCOMP_FILTER_FLAG_GET_LISTENER to get the fd out to the privileged task.

Signed-off-by: Tycho Andersen <tycho-E0fblnxP3wo@public.gmane.org>
CC: Kees Cook <keescook-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>
CC: Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
CC: Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
CC: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
CC: "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
CC: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
CC: Tyler Hicks <tyhicks-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
CC: Akihiro Suda <suda.akihiro-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
---
 include/linux/seccomp.h                       | 11 +++++
 include/uapi/linux/ptrace.h                   |  1 +
 kernel/ptrace.c                               |  4 ++
 kernel/seccomp.c                              | 24 ++++++++++
 tools/testing/selftests/seccomp/seccomp_bpf.c | 66 +++++++++++++++++++++++++++
 5 files changed, 106 insertions(+)

diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index ce07da2ffd53..0d4750e04bb1 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -103,4 +103,15 @@ static inline long seccomp_get_filter(struct task_struct *task,
 	return -EINVAL;
 }
 #endif /* CONFIG_SECCOMP_FILTER && CONFIG_CHECKPOINT_RESTORE */
+
+#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
+extern long seccomp_get_listener(struct task_struct *task,
+				 unsigned long filter_off);
+#else
+static inline long seccomp_get_listener(struct task_struct *task,
+					unsigned long filter_off)
+{
+	return -EINVAL;
+}
+#endif/* CONFIG_SECCOMP_USER_NOTIFICATION */
 #endif /* _LINUX_SECCOMP_H */
diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h
index e3939e00980b..60113de59b04 100644
--- a/include/uapi/linux/ptrace.h
+++ b/include/uapi/linux/ptrace.h
@@ -66,6 +66,7 @@ struct ptrace_peeksiginfo_args {
 #define PTRACE_SETSIGMASK	0x420b
 
 #define PTRACE_SECCOMP_GET_FILTER	0x420c
+#define PTRACE_SECCOMP_GET_LISTENER	0x420d
 
 /* Read signals from a shared (process wide) queue */
 #define PTRACE_PEEKSIGINFO_SHARED	(1 << 0)
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 84b1367935e4..50d8cc8be054 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -1092,6 +1092,10 @@ int ptrace_request(struct task_struct *child, long request,
 		ret = seccomp_get_filter(child, addr, datavp);
 		break;
 
+	case PTRACE_SECCOMP_GET_LISTENER:
+		ret = seccomp_get_listener(child, addr);
+		break;
+
 	default:
 		break;
 	}
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 800db3f2866f..0b1f65273d2a 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -1605,4 +1605,28 @@ static struct file *init_listener(struct seccomp_filter *filter)
 	mutex_unlock(&filter->notify_lock);
 	return ret;
 }
+
+long seccomp_get_listener(struct task_struct *task,
+			  unsigned long filter_off)
+{
+	struct seccomp_filter *filter;
+	struct file *listener;
+	int fd;
+
+	filter = get_nth_filter(task, filter_off);
+	if (IS_ERR(filter))
+		return PTR_ERR(filter);
+
+	listener = init_listener(filter);
+	if (IS_ERR(listener))
+		return PTR_ERR(listener);
+
+	fd = get_unused_fd_flags(O_RDWR);
+	if (fd < 0)
+		put_filp(listener);
+	else
+		fd_install(fd, listener);
+
+	return fd;
+}
 #endif
diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
index b43e2a70b08c..80f89a766895 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -168,6 +168,10 @@ int seccomp(unsigned int op, unsigned int flags, void *args)
 }
 #endif
 
+#ifndef PTRACE_SECCOMP_GET_LISTENER
+#define PTRACE_SECCOMP_GET_LISTENER 0x420d
+#endif
+
 #if __BYTE_ORDER == __LITTLE_ENDIAN
 #define syscall_arg(_n) (offsetof(struct seccomp_data, args[_n]))
 #elif __BYTE_ORDER == __BIG_ENDIAN
@@ -2957,6 +2961,68 @@ TEST(get_user_notification_syscall)
 	close(listener);
 }
 
+TEST(get_user_notification_ptrace)
+{
+	pid_t pid;
+	int status, listener;
+	int sk_pair[2];
+	char c;
+	struct seccomp_notif req;
+	struct seccomp_notif_resp resp;
+
+	ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		ASSERT_EQ(user_trap_syscall(__NR_getpid, 0), 0);
+
+		/* Test that we get ENOSYS while not attached */
+		ASSERT_EQ(syscall(__NR_getpid), -1);
+		ASSERT_EQ(errno, ENOSYS);
+
+		/* Signal we're ready and have installed the filter. */
+		ASSERT_EQ(write(sk_pair[1], "J", 1), 1);
+
+		ASSERT_EQ(read(sk_pair[1], &c, 1), 1);
+		ASSERT_EQ(c, 'H');
+
+		exit(syscall(__NR_getpid) != USER_NOTIF_MAGIC);
+	}
+
+	ASSERT_EQ(read(sk_pair[0], &c, 1), 1);
+	ASSERT_EQ(c, 'J');
+
+	ASSERT_EQ(ptrace(PTRACE_ATTACH, pid), 0);
+	ASSERT_EQ(waitpid(pid, NULL, 0), pid);
+	listener = ptrace(PTRACE_SECCOMP_GET_LISTENER, pid, 0);
+	ASSERT_GE(listener, 0);
+
+	/* EBUSY for second listener */
+	ASSERT_EQ(ptrace(PTRACE_SECCOMP_GET_LISTENER, pid, 0), -1);
+	ASSERT_EQ(errno, EBUSY);
+
+	ASSERT_EQ(ptrace(PTRACE_DETACH, pid, NULL, 0), 0);
+
+	/* Now signal we are done and respond with magic */
+	ASSERT_EQ(write(sk_pair[0], "H", 1), 1);
+
+	ASSERT_EQ(read(listener, &req, sizeof(req)), sizeof(req));
+
+	resp.id = req.id;
+	resp.error = 0;
+	resp.val = USER_NOTIF_MAGIC;
+
+	ASSERT_EQ(write(listener, &resp, sizeof(resp)), sizeof(resp));
+
+	ASSERT_EQ(waitpid(pid, &status, 0), pid);
+	ASSERT_EQ(true, WIFEXITED(status));
+	ASSERT_EQ(0, WEXITSTATUS(status));
+
+	close(listener);
+}
+
 /*
  * TODO:
  * - add microbenchmarks
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [RFC 3/3] seccomp: add a way to get a listener fd from ptrace
  2018-02-04 10:49 [RFC 0/3] seccomp trap to userspace Tycho Andersen
                   ` (2 preceding siblings ...)
       [not found] ` <20180204104946.25559-1-tycho-E0fblnxP3wo@public.gmane.org>
@ 2018-02-04 10:49 ` Tycho Andersen
       [not found]   ` <20180204104946.25559-4-tycho-E0fblnxP3wo@public.gmane.org>
  2018-03-15 16:09 ` [RFC 0/3] seccomp trap to userspace Christian Brauner
  4 siblings, 1 reply; 59+ messages in thread
From: Tycho Andersen @ 2018-02-04 10:49 UTC (permalink / raw)
  To: linux-kernel, containers
  Cc: Kees Cook, Andy Lutomirski, Oleg Nesterov, Eric W . Biederman,
	Serge E . Hallyn, Christian Brauner, Tyler Hicks, Akihiro Suda,
	Tycho Andersen

As an alternative to SECCOMP_FILTER_FLAG_GET_LISTENER, perhaps a ptrace()
version which can acquire filters is useful. There are at least two reasons
this is preferable, even though it uses ptrace:

1. You can control tasks that aren't cooperating with you
2. You can control tasks whose filters block sendmsg() and socket(); if the
   task installs a filter which blocks these calls, there's no way with
   SECCOMP_FILTER_FLAG_GET_LISTENER to get the fd out to the privileged task.

Signed-off-by: Tycho Andersen <tycho@tycho.ws>
CC: Kees Cook <keescook@chromium.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Oleg Nesterov <oleg@redhat.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
CC: "Serge E. Hallyn" <serge@hallyn.com>
CC: Christian Brauner <christian.brauner@ubuntu.com>
CC: Tyler Hicks <tyhicks@canonical.com>
CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
---
 include/linux/seccomp.h                       | 11 +++++
 include/uapi/linux/ptrace.h                   |  1 +
 kernel/ptrace.c                               |  4 ++
 kernel/seccomp.c                              | 24 ++++++++++
 tools/testing/selftests/seccomp/seccomp_bpf.c | 66 +++++++++++++++++++++++++++
 5 files changed, 106 insertions(+)

diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index ce07da2ffd53..0d4750e04bb1 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -103,4 +103,15 @@ static inline long seccomp_get_filter(struct task_struct *task,
 	return -EINVAL;
 }
 #endif /* CONFIG_SECCOMP_FILTER && CONFIG_CHECKPOINT_RESTORE */
+
+#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
+extern long seccomp_get_listener(struct task_struct *task,
+				 unsigned long filter_off);
+#else
+static inline long seccomp_get_listener(struct task_struct *task,
+					unsigned long filter_off)
+{
+	return -EINVAL;
+}
+#endif/* CONFIG_SECCOMP_USER_NOTIFICATION */
 #endif /* _LINUX_SECCOMP_H */
diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h
index e3939e00980b..60113de59b04 100644
--- a/include/uapi/linux/ptrace.h
+++ b/include/uapi/linux/ptrace.h
@@ -66,6 +66,7 @@ struct ptrace_peeksiginfo_args {
 #define PTRACE_SETSIGMASK	0x420b
 
 #define PTRACE_SECCOMP_GET_FILTER	0x420c
+#define PTRACE_SECCOMP_GET_LISTENER	0x420d
 
 /* Read signals from a shared (process wide) queue */
 #define PTRACE_PEEKSIGINFO_SHARED	(1 << 0)
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 84b1367935e4..50d8cc8be054 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -1092,6 +1092,10 @@ int ptrace_request(struct task_struct *child, long request,
 		ret = seccomp_get_filter(child, addr, datavp);
 		break;
 
+	case PTRACE_SECCOMP_GET_LISTENER:
+		ret = seccomp_get_listener(child, addr);
+		break;
+
 	default:
 		break;
 	}
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 800db3f2866f..0b1f65273d2a 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -1605,4 +1605,28 @@ static struct file *init_listener(struct seccomp_filter *filter)
 	mutex_unlock(&filter->notify_lock);
 	return ret;
 }
+
+long seccomp_get_listener(struct task_struct *task,
+			  unsigned long filter_off)
+{
+	struct seccomp_filter *filter;
+	struct file *listener;
+	int fd;
+
+	filter = get_nth_filter(task, filter_off);
+	if (IS_ERR(filter))
+		return PTR_ERR(filter);
+
+	listener = init_listener(filter);
+	if (IS_ERR(listener))
+		return PTR_ERR(listener);
+
+	fd = get_unused_fd_flags(O_RDWR);
+	if (fd < 0)
+		put_filp(listener);
+	else
+		fd_install(fd, listener);
+
+	return fd;
+}
 #endif
diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
index b43e2a70b08c..80f89a766895 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -168,6 +168,10 @@ int seccomp(unsigned int op, unsigned int flags, void *args)
 }
 #endif
 
+#ifndef PTRACE_SECCOMP_GET_LISTENER
+#define PTRACE_SECCOMP_GET_LISTENER 0x420d
+#endif
+
 #if __BYTE_ORDER == __LITTLE_ENDIAN
 #define syscall_arg(_n) (offsetof(struct seccomp_data, args[_n]))
 #elif __BYTE_ORDER == __BIG_ENDIAN
@@ -2957,6 +2961,68 @@ TEST(get_user_notification_syscall)
 	close(listener);
 }
 
+TEST(get_user_notification_ptrace)
+{
+	pid_t pid;
+	int status, listener;
+	int sk_pair[2];
+	char c;
+	struct seccomp_notif req;
+	struct seccomp_notif_resp resp;
+
+	ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		ASSERT_EQ(user_trap_syscall(__NR_getpid, 0), 0);
+
+		/* Test that we get ENOSYS while not attached */
+		ASSERT_EQ(syscall(__NR_getpid), -1);
+		ASSERT_EQ(errno, ENOSYS);
+
+		/* Signal we're ready and have installed the filter. */
+		ASSERT_EQ(write(sk_pair[1], "J", 1), 1);
+
+		ASSERT_EQ(read(sk_pair[1], &c, 1), 1);
+		ASSERT_EQ(c, 'H');
+
+		exit(syscall(__NR_getpid) != USER_NOTIF_MAGIC);
+	}
+
+	ASSERT_EQ(read(sk_pair[0], &c, 1), 1);
+	ASSERT_EQ(c, 'J');
+
+	ASSERT_EQ(ptrace(PTRACE_ATTACH, pid), 0);
+	ASSERT_EQ(waitpid(pid, NULL, 0), pid);
+	listener = ptrace(PTRACE_SECCOMP_GET_LISTENER, pid, 0);
+	ASSERT_GE(listener, 0);
+
+	/* EBUSY for second listener */
+	ASSERT_EQ(ptrace(PTRACE_SECCOMP_GET_LISTENER, pid, 0), -1);
+	ASSERT_EQ(errno, EBUSY);
+
+	ASSERT_EQ(ptrace(PTRACE_DETACH, pid, NULL, 0), 0);
+
+	/* Now signal we are done and respond with magic */
+	ASSERT_EQ(write(sk_pair[0], "H", 1), 1);
+
+	ASSERT_EQ(read(listener, &req, sizeof(req)), sizeof(req));
+
+	resp.id = req.id;
+	resp.error = 0;
+	resp.val = USER_NOTIF_MAGIC;
+
+	ASSERT_EQ(write(listener, &resp, sizeof(resp)), sizeof(resp));
+
+	ASSERT_EQ(waitpid(pid, &status, 0), pid);
+	ASSERT_EQ(true, WIFEXITED(status));
+	ASSERT_EQ(0, WEXITSTATUS(status));
+
+	close(listener);
+}
+
 /*
  * TODO:
  * - add microbenchmarks
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [RFC 1/3] seccomp: add a return code to trap to userspace
  2018-02-04 10:49 ` [RFC 1/3] seccomp: add a return code to " Tycho Andersen
@ 2018-02-04 17:36       ` Andy Lutomirski
       [not found]   ` <20180204104946.25559-2-tycho-E0fblnxP3wo@public.gmane.org>
  1 sibling, 0 replies; 59+ messages in thread
From: Andy Lutomirski @ 2018-02-04 17:36 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Kees Cook, Linux Containers, Akihiro Suda, Oleg Nesterov, LKML,
	Eric W . Biederman, Christian Brauner, Tyler Hicks

On Sun, Feb 4, 2018 at 10:49 AM, Tycho Andersen <tycho-E0fblnxP3wo@public.gmane.org> wrote:
> This patch introduces a means for syscalls matched in seccomp to notify
> some other task that a particular filter has been triggered.

Neat!

>
> The motivation for this is primarily for use with containers. For example,
> if a container does an init_module(), we obviously don't want to load this
> untrusted code, which may be compiled for the wrong version of the kernel
> anyway. Instead, we could parse the module image, figure out which module
> the container is trying to load and load it on the host.
>
> As another example, containers cannot mknod(), since this checks
> capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
> /dev/zero should be ok for containers to mknod, but we'd like to avoid hard
> coding some whitelist in the kernel. Another example is mount(), which has
> many security restrictions for good reason, but configuration or runtime
> knowledge could potentially be used to relax these restrictions.
>
> This patch adds functionality that is already possible via at least two
> other means that I know about, both of which involve ptrace(): first, one
> could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL.
> Unfortunately this is slow, so a faster version would be to install a
> filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP.
> Since ptrace allows only one tracer, if the container runtime is that
> tracer, users inside the container (or outside) trying to debug it will not
> be able to use ptrace, which is annoying. It also means that older
> distributions based on Upstart cannot boot inside containers using ptrace,
> since upstart itself uses ptrace to start services.
>
> The actual implementation of this is fairly small, although getting the
> synchronization right was/is slightly complex. Also worth noting that there
> is one race still present:
>
>   1. a task does a SECCOMP_RET_USER_NOTIF
>   2. the userspace handler reads this notification
>   3. the task dies
>   4. a new task with the same pid starts
>   5. this new task does a SECCOMP_RET_USER_NOTIF, gets the same cookie id
>      that the previous one did
>   6. the userspace handler writes a response

I'm slightly confused.  I thought the id was never reused for a given
struct seccomp_filter.  (Also, shouldn't the id be u64, not u32?)

On very quick reading, I have a question.  What happens if a process
has two seccomp_filters attached, one of them returns
SECCOMP_RET_USER_NOTIF, and the *other* one has a listener?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC 1/3] seccomp: add a return code to trap to userspace
@ 2018-02-04 17:36       ` Andy Lutomirski
  0 siblings, 0 replies; 59+ messages in thread
From: Andy Lutomirski @ 2018-02-04 17:36 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: LKML, Linux Containers, Kees Cook, Oleg Nesterov,
	Eric W . Biederman, Serge E . Hallyn, Christian Brauner,
	Tyler Hicks, Akihiro Suda

On Sun, Feb 4, 2018 at 10:49 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> This patch introduces a means for syscalls matched in seccomp to notify
> some other task that a particular filter has been triggered.

Neat!

>
> The motivation for this is primarily for use with containers. For example,
> if a container does an init_module(), we obviously don't want to load this
> untrusted code, which may be compiled for the wrong version of the kernel
> anyway. Instead, we could parse the module image, figure out which module
> the container is trying to load and load it on the host.
>
> As another example, containers cannot mknod(), since this checks
> capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
> /dev/zero should be ok for containers to mknod, but we'd like to avoid hard
> coding some whitelist in the kernel. Another example is mount(), which has
> many security restrictions for good reason, but configuration or runtime
> knowledge could potentially be used to relax these restrictions.
>
> This patch adds functionality that is already possible via at least two
> other means that I know about, both of which involve ptrace(): first, one
> could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL.
> Unfortunately this is slow, so a faster version would be to install a
> filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP.
> Since ptrace allows only one tracer, if the container runtime is that
> tracer, users inside the container (or outside) trying to debug it will not
> be able to use ptrace, which is annoying. It also means that older
> distributions based on Upstart cannot boot inside containers using ptrace,
> since upstart itself uses ptrace to start services.
>
> The actual implementation of this is fairly small, although getting the
> synchronization right was/is slightly complex. Also worth noting that there
> is one race still present:
>
>   1. a task does a SECCOMP_RET_USER_NOTIF
>   2. the userspace handler reads this notification
>   3. the task dies
>   4. a new task with the same pid starts
>   5. this new task does a SECCOMP_RET_USER_NOTIF, gets the same cookie id
>      that the previous one did
>   6. the userspace handler writes a response

I'm slightly confused.  I thought the id was never reused for a given
struct seccomp_filter.  (Also, shouldn't the id be u64, not u32?)

On very quick reading, I have a question.  What happens if a process
has two seccomp_filters attached, one of them returns
SECCOMP_RET_USER_NOTIF, and the *other* one has a listener?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC 1/3] seccomp: add a return code to trap to userspace
  2018-02-04 17:36       ` Andy Lutomirski
@ 2018-02-04 20:01           ` Tycho Andersen
  -1 siblings, 0 replies; 59+ messages in thread
From: Tycho Andersen @ 2018-02-04 20:01 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kees Cook, Linux Containers, Akihiro Suda, Oleg Nesterov, LKML,
	Eric W . Biederman, Christian Brauner, Tyler Hicks

Hi Andy,

On Sun, Feb 04, 2018 at 05:36:33PM +0000, Andy Lutomirski wrote:
> > The actual implementation of this is fairly small, although getting the
> > synchronization right was/is slightly complex. Also worth noting that there
> > is one race still present:
> >
> >   1. a task does a SECCOMP_RET_USER_NOTIF
> >   2. the userspace handler reads this notification
> >   3. the task dies
> >   4. a new task with the same pid starts
> >   5. this new task does a SECCOMP_RET_USER_NOTIF, gets the same cookie id
> >      that the previous one did
> >   6. the userspace handler writes a response
> 
> I'm slightly confused.  I thought the id was never reused for a given
> struct seccomp_filter.  (Also, shouldn't the id be u64, not u32?)

Well, what happens when u32/64 overflows? Eventually it will wrap.

> On very quick reading, I have a question.  What happens if a process
> has two seccomp_filters attached, one of them returns
> SECCOMP_RET_USER_NOTIF, and the *other* one has a listener?

Good question, in seccomp_run_filters(), the first (lowest, last
applied) filter who returns SECCOMP_RET_USER_NOTIF is the one that
gets the notification and the other receives nothing.

I don't really have any reason to prefer this behavior, it's just what
happened without much thought.

Cheers,

Tycho

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC 1/3] seccomp: add a return code to trap to userspace
@ 2018-02-04 20:01           ` Tycho Andersen
  0 siblings, 0 replies; 59+ messages in thread
From: Tycho Andersen @ 2018-02-04 20:01 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LKML, Linux Containers, Kees Cook, Oleg Nesterov,
	Eric W . Biederman, Serge E . Hallyn, Christian Brauner,
	Tyler Hicks, Akihiro Suda

Hi Andy,

On Sun, Feb 04, 2018 at 05:36:33PM +0000, Andy Lutomirski wrote:
> > The actual implementation of this is fairly small, although getting the
> > synchronization right was/is slightly complex. Also worth noting that there
> > is one race still present:
> >
> >   1. a task does a SECCOMP_RET_USER_NOTIF
> >   2. the userspace handler reads this notification
> >   3. the task dies
> >   4. a new task with the same pid starts
> >   5. this new task does a SECCOMP_RET_USER_NOTIF, gets the same cookie id
> >      that the previous one did
> >   6. the userspace handler writes a response
> 
> I'm slightly confused.  I thought the id was never reused for a given
> struct seccomp_filter.  (Also, shouldn't the id be u64, not u32?)

Well, what happens when u32/64 overflows? Eventually it will wrap.

> On very quick reading, I have a question.  What happens if a process
> has two seccomp_filters attached, one of them returns
> SECCOMP_RET_USER_NOTIF, and the *other* one has a listener?

Good question, in seccomp_run_filters(), the first (lowest, last
applied) filter who returns SECCOMP_RET_USER_NOTIF is the one that
gets the notification and the other receives nothing.

I don't really have any reason to prefer this behavior, it's just what
happened without much thought.

Cheers,

Tycho

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC 1/3] seccomp: add a return code to trap to userspace
  2018-02-04 20:01           ` Tycho Andersen
  (?)
  (?)
@ 2018-02-04 20:33           ` Andy Lutomirski
  -1 siblings, 0 replies; 59+ messages in thread
From: Andy Lutomirski @ 2018-02-04 20:33 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Kees Cook, Linux Containers, Akihiro Suda, Oleg Nesterov, LKML,
	Eric W . Biederman, Christian Brauner, Tyler Hicks

On Sun, Feb 4, 2018 at 8:01 PM, Tycho Andersen <tycho-E0fblnxP3wo@public.gmane.org> wrote:
> Hi Andy,
>
> On Sun, Feb 04, 2018 at 05:36:33PM +0000, Andy Lutomirski wrote:
>> > The actual implementation of this is fairly small, although getting the
>> > synchronization right was/is slightly complex. Also worth noting that there
>> > is one race still present:
>> >
>> >   1. a task does a SECCOMP_RET_USER_NOTIF
>> >   2. the userspace handler reads this notification
>> >   3. the task dies
>> >   4. a new task with the same pid starts
>> >   5. this new task does a SECCOMP_RET_USER_NOTIF, gets the same cookie id
>> >      that the previous one did
>> >   6. the userspace handler writes a response
>>
>> I'm slightly confused.  I thought the id was never reused for a given
>> struct seccomp_filter.  (Also, shouldn't the id be u64, not u32?)
>
> Well, what happens when u32/64 overflows? Eventually it will wrap.

I think we can safely assume that u64 won't overflow.  Even if we
processed one user return notification on a given seccomp_filter every
nanosecond (which would be insanely fast), that's 584 years.

>
>> On very quick reading, I have a question.  What happens if a process
>> has two seccomp_filters attached, one of them returns
>> SECCOMP_RET_USER_NOTIF, and the *other* one has a listener?
>
> Good question, in seccomp_run_filters(), the first (lowest, last
> applied) filter who returns SECCOMP_RET_USER_NOTIF is the one that
> gets the notification and the other receives nothing.
>
> I don't really have any reason to prefer this behavior, it's just what
> happened without much thought.

Hmm.  This won't nest right.  Maybe we should just disallow a
user-notification-using filter from being applied if there is already
one in the stack.  Then, if anyone cares about making these things
nest right, they can fix it.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC 1/3] seccomp: add a return code to trap to userspace
  2018-02-04 20:01           ` Tycho Andersen
  (?)
@ 2018-02-04 20:33           ` Andy Lutomirski
       [not found]             ` <CALCETrV81yr_zhuBbCTE8NgYx42oq=qvP=nLMsST0iS2wtOZng-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2018-02-05  8:47             ` Tycho Andersen
  -1 siblings, 2 replies; 59+ messages in thread
From: Andy Lutomirski @ 2018-02-04 20:33 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: LKML, Linux Containers, Kees Cook, Oleg Nesterov,
	Eric W . Biederman, Serge E . Hallyn, Christian Brauner,
	Tyler Hicks, Akihiro Suda

On Sun, Feb 4, 2018 at 8:01 PM, Tycho Andersen <tycho@tycho.ws> wrote:
> Hi Andy,
>
> On Sun, Feb 04, 2018 at 05:36:33PM +0000, Andy Lutomirski wrote:
>> > The actual implementation of this is fairly small, although getting the
>> > synchronization right was/is slightly complex. Also worth noting that there
>> > is one race still present:
>> >
>> >   1. a task does a SECCOMP_RET_USER_NOTIF
>> >   2. the userspace handler reads this notification
>> >   3. the task dies
>> >   4. a new task with the same pid starts
>> >   5. this new task does a SECCOMP_RET_USER_NOTIF, gets the same cookie id
>> >      that the previous one did
>> >   6. the userspace handler writes a response
>>
>> I'm slightly confused.  I thought the id was never reused for a given
>> struct seccomp_filter.  (Also, shouldn't the id be u64, not u32?)
>
> Well, what happens when u32/64 overflows? Eventually it will wrap.

I think we can safely assume that u64 won't overflow.  Even if we
processed one user return notification on a given seccomp_filter every
nanosecond (which would be insanely fast), that's 584 years.

>
>> On very quick reading, I have a question.  What happens if a process
>> has two seccomp_filters attached, one of them returns
>> SECCOMP_RET_USER_NOTIF, and the *other* one has a listener?
>
> Good question, in seccomp_run_filters(), the first (lowest, last
> applied) filter who returns SECCOMP_RET_USER_NOTIF is the one that
> gets the notification and the other receives nothing.
>
> I don't really have any reason to prefer this behavior, it's just what
> happened without much thought.

Hmm.  This won't nest right.  Maybe we should just disallow a
user-notification-using filter from being applied if there is already
one in the stack.  Then, if anyone cares about making these things
nest right, they can fix it.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC 1/3] seccomp: add a return code to trap to userspace
       [not found]             ` <CALCETrV81yr_zhuBbCTE8NgYx42oq=qvP=nLMsST0iS2wtOZng-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-02-05  8:47               ` Tycho Andersen
  0 siblings, 0 replies; 59+ messages in thread
From: Tycho Andersen @ 2018-02-05  8:47 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kees Cook, Linux Containers, Akihiro Suda, Oleg Nesterov, LKML,
	Eric W . Biederman, Christian Brauner, Tyler Hicks

On Sun, Feb 04, 2018 at 08:33:25PM +0000, Andy Lutomirski wrote:
> On Sun, Feb 4, 2018 at 8:01 PM, Tycho Andersen <tycho-E0fblnxP3wo@public.gmane.org> wrote:
> > Hi Andy,
> >
> > On Sun, Feb 04, 2018 at 05:36:33PM +0000, Andy Lutomirski wrote:
> >> > The actual implementation of this is fairly small, although getting the
> >> > synchronization right was/is slightly complex. Also worth noting that there
> >> > is one race still present:
> >> >
> >> >   1. a task does a SECCOMP_RET_USER_NOTIF
> >> >   2. the userspace handler reads this notification
> >> >   3. the task dies
> >> >   4. a new task with the same pid starts
> >> >   5. this new task does a SECCOMP_RET_USER_NOTIF, gets the same cookie id
> >> >      that the previous one did
> >> >   6. the userspace handler writes a response
> >>
> >> I'm slightly confused.  I thought the id was never reused for a given
> >> struct seccomp_filter.  (Also, shouldn't the id be u64, not u32?)
> >
> > Well, what happens when u32/64 overflows? Eventually it will wrap.
> 
> I think we can safely assume that u64 won't overflow.  Even if we
> processed one user return notification on a given seccomp_filter every
> nanosecond (which would be insanely fast), that's 584 years.

Yes, fair point r.e. u64. I'll make the change.

> >
> >> On very quick reading, I have a question.  What happens if a process
> >> has two seccomp_filters attached, one of them returns
> >> SECCOMP_RET_USER_NOTIF, and the *other* one has a listener?
> >
> > Good question, in seccomp_run_filters(), the first (lowest, last
> > applied) filter who returns SECCOMP_RET_USER_NOTIF is the one that
> > gets the notification and the other receives nothing.
> >
> > I don't really have any reason to prefer this behavior, it's just what
> > happened without much thought.
> 
> Hmm.  This won't nest right.  Maybe we should just disallow a
> user-notification-using filter from being applied if there is already
> one in the stack.  Then, if anyone cares about making these things
> nest right, they can fix it.

Sounds fine to me, I'll add a check.

Cheers,

Tycho

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC 1/3] seccomp: add a return code to trap to userspace
  2018-02-04 20:33           ` Andy Lutomirski
       [not found]             ` <CALCETrV81yr_zhuBbCTE8NgYx42oq=qvP=nLMsST0iS2wtOZng-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-02-05  8:47             ` Tycho Andersen
  1 sibling, 0 replies; 59+ messages in thread
From: Tycho Andersen @ 2018-02-05  8:47 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LKML, Linux Containers, Kees Cook, Oleg Nesterov,
	Eric W . Biederman, Serge E . Hallyn, Christian Brauner,
	Tyler Hicks, Akihiro Suda

On Sun, Feb 04, 2018 at 08:33:25PM +0000, Andy Lutomirski wrote:
> On Sun, Feb 4, 2018 at 8:01 PM, Tycho Andersen <tycho@tycho.ws> wrote:
> > Hi Andy,
> >
> > On Sun, Feb 04, 2018 at 05:36:33PM +0000, Andy Lutomirski wrote:
> >> > The actual implementation of this is fairly small, although getting the
> >> > synchronization right was/is slightly complex. Also worth noting that there
> >> > is one race still present:
> >> >
> >> >   1. a task does a SECCOMP_RET_USER_NOTIF
> >> >   2. the userspace handler reads this notification
> >> >   3. the task dies
> >> >   4. a new task with the same pid starts
> >> >   5. this new task does a SECCOMP_RET_USER_NOTIF, gets the same cookie id
> >> >      that the previous one did
> >> >   6. the userspace handler writes a response
> >>
> >> I'm slightly confused.  I thought the id was never reused for a given
> >> struct seccomp_filter.  (Also, shouldn't the id be u64, not u32?)
> >
> > Well, what happens when u32/64 overflows? Eventually it will wrap.
> 
> I think we can safely assume that u64 won't overflow.  Even if we
> processed one user return notification on a given seccomp_filter every
> nanosecond (which would be insanely fast), that's 584 years.

Yes, fair point r.e. u64. I'll make the change.

> >
> >> On very quick reading, I have a question.  What happens if a process
> >> has two seccomp_filters attached, one of them returns
> >> SECCOMP_RET_USER_NOTIF, and the *other* one has a listener?
> >
> > Good question, in seccomp_run_filters(), the first (lowest, last
> > applied) filter who returns SECCOMP_RET_USER_NOTIF is the one that
> > gets the notification and the other receives nothing.
> >
> > I don't really have any reason to prefer this behavior, it's just what
> > happened without much thought.
> 
> Hmm.  This won't nest right.  Maybe we should just disallow a
> user-notification-using filter from being applied if there is already
> one in the stack.  Then, if anyone cares about making these things
> nest right, they can fix it.

Sounds fine to me, I'll add a check.

Cheers,

Tycho

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC 1/3] seccomp: add a return code to trap to userspace
       [not found]   ` <20180204104946.25559-2-tycho-E0fblnxP3wo@public.gmane.org>
  2018-02-04 17:36       ` Andy Lutomirski
@ 2018-02-13 21:09     ` Kees Cook
  1 sibling, 0 replies; 59+ messages in thread
From: Kees Cook @ 2018-02-13 21:09 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Linux Containers, Akihiro Suda, Oleg Nesterov, LKML, Paul Moore,
	Eric W . Biederman, Tyler Hicks, Sargun Dhillon,
	Christian Brauner, Andy Lutomirski

On Sun, Feb 4, 2018 at 2:49 AM, Tycho Andersen <tycho-E0fblnxP3wo@public.gmane.org> wrote:
> This patch introduces a means for syscalls matched in seccomp to notify
> some other task that a particular filter has been triggered.
>
> The motivation for this is primarily for use with containers. For example,
> if a container does an init_module(), we obviously don't want to load this
> untrusted code, which may be compiled for the wrong version of the kernel
> anyway. Instead, we could parse the module image, figure out which module
> the container is trying to load and load it on the host.
>
> As another example, containers cannot mknod(), since this checks
> capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
> /dev/zero should be ok for containers to mknod, but we'd like to avoid hard
> coding some whitelist in the kernel. Another example is mount(), which has
> many security restrictions for good reason, but configuration or runtime
> knowledge could potentially be used to relax these restrictions.

Related to the eBPF seccomp thread, can the logic for these things be
handled entirely by eBPF? My assumption is that you still need to stop
the process to do something (i.e. do a mknod, or a mount) before
letting it continue. Is there some "wait for notification" system in
eBPF?

> This patch adds functionality that is already possible via at least two
> other means that I know about, both of which involve ptrace(): first, one
> could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL.
> Unfortunately this is slow, so a faster version would be to install a
> filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP.
> Since ptrace allows only one tracer, if the container runtime is that
> tracer, users inside the container (or outside) trying to debug it will not
> be able to use ptrace, which is annoying. It also means that older
> distributions based on Upstart cannot boot inside containers using ptrace,
> since upstart itself uses ptrace to start services.

Agreed: notification is extremely painful right now. The container
case is compelling, since it will always want a way to trick out these
kinds of filesystem calls.

> The actual implementation of this is fairly small, although getting the
> synchronization right was/is slightly complex. Also worth noting that there
> is one race still present:
>
>   1. a task does a SECCOMP_RET_USER_NOTIF
>   2. the userspace handler reads this notification
>   3. the task dies
>   4. a new task with the same pid starts
>   5. this new task does a SECCOMP_RET_USER_NOTIF, gets the same cookie id
>      that the previous one did
>   6. the userspace handler writes a response
>
> There's no way to distinguish this case right now. Maybe we care, maybe we
> don't, but it's worth noting.

So, I'd like to avoid the cookie if possible (surprise). Why isn't it
possible to close the kernel-side of the fd to indicate that it lost
the pid it was attached to? Is this just that the reader has no idea
who is sending messages? So the risk is a fork/die loop within the
same process tree (i.e. attached to the same filter)? Hrmpf. I can't
think of a better way to handle the
one(fd)-to-many(task-with-that-filter-attached) situation...

> Right now the interface is a simple structure copy across a file
> descriptor. We could potentially invent something fancier.

I wonder if this communication should be netlink, which gives a more
well-structured way to describe what's on the wire? The reason I ask
is because if we ever change the seccomp_data structure, we'll now
have two places where we need to deal with it (the first being within
the BPF itself). My initial idea was to prefix the communication with
a size field, then send the structure, and then I had nightmares, and
realized this was basically netlink reinvented.

> Finally, it's worth noting that the classic seccomp TOCTOU of reading
> memory data from the task still applies here, but can be avoided with
> careful design of the userspace handler: if the userspace handler reads all
> of the task memory that is necessary before applying its security policy,
> the tracee's subsequent memory edits will not be read by the tracer.

Is this really true? Couldn't a multi-threaded process muck with
memory out from under both the manager and the stopped process?

> Signed-off-by: Tycho Andersen <tycho-E0fblnxP3wo@public.gmane.org>
> CC: Kees Cook <keescook-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>
> CC: Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
> CC: Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> CC: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
> CC: "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
> CC: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
> CC: Tyler Hicks <tyhicks-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
> CC: Akihiro Suda <suda.akihiro-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
> ---
>  arch/Kconfig                                  |   7 +
>  include/linux/seccomp.h                       |   3 +-
>  include/uapi/linux/seccomp.h                  |  18 +-
>  kernel/seccomp.c                              | 366 +++++++++++++++++++++++++-
>  tools/testing/selftests/seccomp/seccomp_bpf.c | 114 +++++++-
>  5 files changed, 502 insertions(+), 6 deletions(-)
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 400b9e1b2f27..2946cb6fd704 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -387,6 +387,13 @@ config SECCOMP_FILTER
>
>           See Documentation/prctl/seccomp_filter.txt for details.
>
> +config SECCOMP_USER_NOTIFICATION
> +       bool "Enable the SECCOMP_RET_USER_NOTIF seccomp action"
> +       depends on SECCOMP_FILTER
> +       help
> +         Enable SECCOMP_RET_USER_NOTIF, a return code which can be used by seccomp
> +         programs to notify a userspace listener that a particular event happened.
> +
>  config HAVE_GCC_PLUGINS
>         bool
>         help
> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> index 10f25f7e4304..ce07da2ffd53 100644
> --- a/include/linux/seccomp.h
> +++ b/include/linux/seccomp.h
> @@ -5,7 +5,8 @@
>  #include <uapi/linux/seccomp.h>
>
>  #define SECCOMP_FILTER_FLAG_MASK       (SECCOMP_FILTER_FLAG_TSYNC | \
> -                                        SECCOMP_FILTER_FLAG_LOG)
> +                                        SECCOMP_FILTER_FLAG_LOG | \
> +                                        SECCOMP_FILTER_FLAG_GET_LISTENER)
>
>  #ifdef CONFIG_SECCOMP
>
> diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
> index 2a0bd9dd104d..4a342aa2e524 100644
> --- a/include/uapi/linux/seccomp.h
> +++ b/include/uapi/linux/seccomp.h
> @@ -17,8 +17,9 @@
>  #define SECCOMP_GET_ACTION_AVAIL       2
>
>  /* Valid flags for SECCOMP_SET_MODE_FILTER */
> -#define SECCOMP_FILTER_FLAG_TSYNC      1
> -#define SECCOMP_FILTER_FLAG_LOG                2
> +#define SECCOMP_FILTER_FLAG_TSYNC              1
> +#define SECCOMP_FILTER_FLAG_LOG                        2
> +#define SECCOMP_FILTER_FLAG_GET_LISTENER       4
>
>  /*
>   * All BPF programs must return a 32-bit value.
> @@ -34,6 +35,7 @@
>  #define SECCOMP_RET_KILL        SECCOMP_RET_KILL_THREAD
>  #define SECCOMP_RET_TRAP        0x00030000U /* disallow and force a SIGSYS */
>  #define SECCOMP_RET_ERRNO       0x00050000U /* returns an errno */
> +#define SECCOMP_RET_USER_NOTIF   0x7fc00000U /* notifies userspace */
>  #define SECCOMP_RET_TRACE       0x7ff00000U /* pass to a tracer or disallow */

/me tries to come up with an ordering rationale here and fails.

An ERRNO filter would block a USER_NOTIF because it's unconditional.
TRACE could be either, USER_NOTIF could be either.

This means TRACE rules would be bumped by a USER_NOTIF... hmm.

>  #define SECCOMP_RET_LOG                 0x7ffc0000U /* allow after logging */
>  #define SECCOMP_RET_ALLOW       0x7fff0000U /* allow */
> @@ -59,4 +61,16 @@ struct seccomp_data {
>         __u64 args[6];
>  };
>
> +struct seccomp_notif {
> +       __u32 id;
> +       pid_t pid;
> +       struct seccomp_data data;
> +};
> +
> +struct seccomp_notif_resp {
> +       __u32 id;
> +       int error;
> +       long val;
> +};
> +
>  #endif /* _UAPI_LINUX_SECCOMP_H */
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 5f0dfb2abb8d..9541eb379e74 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -38,6 +38,52 @@
>  #include <linux/tracehook.h>
>  #include <linux/uaccess.h>
>
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION

I wonder if it's time to split up seccomp.c ... probably not, but I've
always been unhappy with the #ifdefs even for just regular _FILTER. ;)

> +#include <linux/file.h>
> +#include <linux/anon_inodes.h>
> +
> +enum notify_state {
> +       SECCOMP_NOTIFY_INIT,
> +       SECCOMP_NOTIFY_READ,
> +       SECCOMP_NOTIFY_WRITE,
> +};
> +
> +struct seccomp_knotif {
> +       /* The pid whose filter triggered the notification */
> +       pid_t pid;
> +
> +       /*
> +        * The "cookie" for this request; this is unique for this filter.
> +        */
> +       u32 id;
> +
> +       /*
> +        * The seccomp data. This pointer is valid the entire time this
> +        * notification is active, since it comes from __seccomp_filter which
> +        * eclipses the entire lifecycle here.
> +        */
> +       const struct seccomp_data *data;
> +
> +       /*
> +        * SECCOMP_NOTIFY_INIT: someone has made this request, but it has not
> +        *      yet been sent to userspace
> +        * SECCOMP_NOTIFY_READ: sent to userspace but no response yet
> +        * SECCOMP_NOTIFY_WRITE: we have a response from userspace, but it has
> +        *      not yet been written back to the application
> +        */
> +       enum notify_state state;
> +
> +       /* The return values, only valid when in SECCOMP_NOTIFY_WRITE */
> +       int error;
> +       long val;
> +
> +       /* Signals when this has entered SECCOMP_NOTIFY_WRITE */
> +       struct completion ready;
> +
> +       struct list_head list;
> +};
> +#endif
> +
>  /**
>   * struct seccomp_filter - container for seccomp BPF programs
>   *
> @@ -64,6 +110,30 @@ struct seccomp_filter {
>         bool log;
>         struct seccomp_filter *prev;
>         struct bpf_prog *prog;
> +
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> +       /*
> +        * A semaphore that users of this notification can wait on for
> +        * changes. Actual reads and writes are still controlled with
> +        * filter->notify_lock.
> +        */
> +       struct semaphore request;
> +
> +       /*
> +        * A lock for all notification-related accesses.
> +        */
> +       struct mutex notify_lock;
> +
> +       /*
> +        * Is there currently an attached listener?
> +        */
> +       bool has_listener;
> +
> +       /*
> +        * A list of struct seccomp_knotif elements.
> +        */

Nit: these 3 above can be one-line comments.

> +       struct list_head notifications;
> +#endif
>  };
>
>  /* Limit any path through the tree to 256KB worth of instructions. */
> @@ -383,6 +453,12 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
>         if (!sfilter)
>                 return ERR_PTR(-ENOMEM);
>
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> +       mutex_init(&sfilter->notify_lock);
> +       sema_init(&sfilter->request, 0);
> +       INIT_LIST_HEAD(&sfilter->notifications);
> +#endif
> +
>         ret = bpf_prog_create_from_user(&sfilter->prog, fprog,
>                                         seccomp_check_filter, save_orig);
>         if (ret < 0) {
> @@ -547,13 +623,15 @@ static void seccomp_send_sigsys(int syscall, int reason)
>  #define SECCOMP_LOG_TRACE              (1 << 4)
>  #define SECCOMP_LOG_LOG                        (1 << 5)
>  #define SECCOMP_LOG_ALLOW              (1 << 6)
> +#define SECCOMP_LOG_USER_NOTIF         (1 << 7)
>
>  static u32 seccomp_actions_logged = SECCOMP_LOG_KILL_PROCESS |
>                                     SECCOMP_LOG_KILL_THREAD  |
>                                     SECCOMP_LOG_TRAP  |
>                                     SECCOMP_LOG_ERRNO |
>                                     SECCOMP_LOG_TRACE |
> -                                   SECCOMP_LOG_LOG;
> +                                   SECCOMP_LOG_LOG |
> +                                   SECCOMP_LOG_USER_NOTIF;
>
>  static inline void seccomp_log(unsigned long syscall, long signr, u32 action,
>                                bool requested)
> @@ -572,6 +650,9 @@ static inline void seccomp_log(unsigned long syscall, long signr, u32 action,
>         case SECCOMP_RET_TRACE:
>                 log = requested && seccomp_actions_logged & SECCOMP_LOG_TRACE;
>                 break;
> +       case SECCOMP_RET_USER_NOTIF:
> +               log = requested && seccomp_actions_logged & SECCOMP_LOG_USER_NOTIF;
> +               break;
>         case SECCOMP_RET_LOG:
>                 log = seccomp_actions_logged & SECCOMP_LOG_LOG;
>                 break;
> @@ -645,6 +726,89 @@ void secure_computing_strict(int this_syscall)
>  }
>  #else
>
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> +/*
> + * Finds the next unique notification id.
> + */
> +static u32 seccomp_next_notify_id(struct list_head *list)
> +{
> +       struct seccomp_knotif *knotif = NULL;
> +       struct list_head *cur;
> +       u32 id = get_random_u32();
> +
> +again:
> +       list_for_each(cur, list) {
> +               knotif = list_entry(cur, struct seccomp_knotif, list);
> +
> +               if (knotif->id == id) {
> +                       id = get_random_u32();
> +                       goto again;
> +               }
> +       }
> +
> +       return id;
> +}
> +
> +static void seccomp_do_user_notification(int this_syscall,
> +                                        struct seccomp_filter *match,
> +                                        const struct seccomp_data *sd)
> +{
> +       int err;
> +       long ret = 0;
> +       struct seccomp_knotif n = {};
> +
> +       mutex_lock(&match->notify_lock);
> +       if (!match->has_listener) {
> +               err = -ENOSYS;
> +               goto out;
> +       }
> +
> +       n.pid = current->pid;
> +       n.state = SECCOMP_NOTIFY_INIT;
> +       n.data = sd;
> +       n.id = seccomp_next_notify_id(&match->notifications);
> +       init_completion(&n.ready);
> +
> +       list_add(&n.list, &match->notifications);
> +
> +       mutex_unlock(&match->notify_lock);
> +       up(&match->request);
> +
> +       err = wait_for_completion_interruptible(&n.ready);
> +       /*
> +        * This syscall is getting interrupted. We no longer need to
> +        * tell userspace about it, and any userspace responses should
> +        * be ignored.
> +        */
> +       mutex_lock(&match->notify_lock);
> +       if (err < 0)
> +               goto remove_list;
> +
> +       ret = n.val;
> +       err = n.error;
> +
> +       WARN(n.state != SECCOMP_NOTIFY_WRITE,
> +            "notified about write complete when state is not write");
> +
> +remove_list:
> +       list_del(&n.list);
> +out:
> +       mutex_unlock(&match->notify_lock);
> +       syscall_set_return_value(current, task_pt_regs(current),
> +                                err, ret);
> +}
> +#else
> +static void seccomp_do_user_notification(int this_syscall,
> +                                        u32 action,
> +                                        struct seccomp_filter *match,
> +                                        const struct seccomp_data *sd)
> +{
> +       WARN(1, "user notification received, but disabled");
> +       seccomp_log(this_syscall, SIGSYS, action, true);
> +       do_exit(SIGSYS);
> +}
> +#endif
> +
>  #ifdef CONFIG_SECCOMP_FILTER
>  static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
>                             const bool recheck_after_trace)
> @@ -722,6 +886,9 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
>
>                 return 0;
>
> +       case SECCOMP_RET_USER_NOTIF:
> +               seccomp_do_user_notification(this_syscall, match, sd);
> +               goto skip;
>         case SECCOMP_RET_LOG:
>                 seccomp_log(this_syscall, 0, action, true);
>                 return 0;
> @@ -828,6 +995,10 @@ static long seccomp_set_mode_strict(void)
>  }
>
>  #ifdef CONFIG_SECCOMP_FILTER
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> +static struct file *init_listener(struct seccomp_filter *filter);
> +#endif
> +
>  /**
>   * seccomp_set_mode_filter: internal function for setting seccomp filter
>   * @flags:  flags to change filter behavior
> @@ -847,6 +1018,8 @@ static long seccomp_set_mode_filter(unsigned int flags,
>         const unsigned long seccomp_mode = SECCOMP_MODE_FILTER;
>         struct seccomp_filter *prepared = NULL;
>         long ret = -EINVAL;
> +       int listener = 0;
> +       struct file *listener_f = NULL;
>
>         /* Validate flags. */
>         if (flags & ~SECCOMP_FILTER_FLAG_MASK)
> @@ -857,13 +1030,28 @@ static long seccomp_set_mode_filter(unsigned int flags,
>         if (IS_ERR(prepared))
>                 return PTR_ERR(prepared);
>
> +       if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) {
> +               listener = get_unused_fd_flags(O_RDWR);
> +               if (listener < 0) {
> +                       ret = listener;
> +                       goto out_free;
> +               }
> +
> +               listener_f = init_listener(prepared);
> +               if (IS_ERR(listener_f)) {
> +                       put_unused_fd(listener);
> +                       ret = PTR_ERR(listener_f);
> +                       goto out_free;
> +               }
> +       }
> +
>         /*
>          * Make sure we cannot change seccomp or nnp state via TSYNC
>          * while another thread is in the middle of calling exec.
>          */
>         if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
>             mutex_lock_killable(&current->signal->cred_guard_mutex))
> -               goto out_free;
> +               goto out_put_fd;
>
>         spin_lock_irq(&current->sighand->siglock);
>
> @@ -881,6 +1069,16 @@ static long seccomp_set_mode_filter(unsigned int flags,
>         spin_unlock_irq(&current->sighand->siglock);
>         if (flags & SECCOMP_FILTER_FLAG_TSYNC)
>                 mutex_unlock(&current->signal->cred_guard_mutex);
> +out_put_fd:
> +       if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) {
> +               if (ret < 0) {
> +                       fput(listener_f);
> +                       put_unused_fd(listener);
> +               } else {
> +                       fd_install(listener, listener_f);
> +                       ret = listener;
> +               }
> +       }
>  out_free:
>         seccomp_filter_free(prepared);
>         return ret;
> @@ -909,6 +1107,9 @@ static long seccomp_get_action_avail(const char __user *uaction)
>         case SECCOMP_RET_LOG:
>         case SECCOMP_RET_ALLOW:
>                 break;
> +       case SECCOMP_RET_USER_NOTIF:
> +               if (IS_ENABLED(CONFIG_SECCOMP_USER_NOTIFICATION))
> +                       break;
>         default:
>                 return -EOPNOTSUPP;
>         }
> @@ -1057,6 +1258,7 @@ long seccomp_get_filter(struct task_struct *task, unsigned long filter_off,
>  #define SECCOMP_RET_KILL_THREAD_NAME   "kill_thread"
>  #define SECCOMP_RET_TRAP_NAME          "trap"
>  #define SECCOMP_RET_ERRNO_NAME         "errno"
> +#define SECCOMP_RET_USER_NOTIF_NAME    "user_notif"
>  #define SECCOMP_RET_TRACE_NAME         "trace"
>  #define SECCOMP_RET_LOG_NAME           "log"
>  #define SECCOMP_RET_ALLOW_NAME         "allow"
> @@ -1066,6 +1268,7 @@ static const char seccomp_actions_avail[] =
>                                 SECCOMP_RET_KILL_THREAD_NAME    " "
>                                 SECCOMP_RET_TRAP_NAME           " "
>                                 SECCOMP_RET_ERRNO_NAME          " "
> +                               SECCOMP_RET_USER_NOTIF_NAME     " "
>                                 SECCOMP_RET_TRACE_NAME          " "
>                                 SECCOMP_RET_LOG_NAME            " "
>                                 SECCOMP_RET_ALLOW_NAME;
> @@ -1083,6 +1286,7 @@ static const struct seccomp_log_name seccomp_log_names[] = {
>         { SECCOMP_LOG_TRACE, SECCOMP_RET_TRACE_NAME },
>         { SECCOMP_LOG_LOG, SECCOMP_RET_LOG_NAME },
>         { SECCOMP_LOG_ALLOW, SECCOMP_RET_ALLOW_NAME },
> +       { SECCOMP_LOG_USER_NOTIF, SECCOMP_RET_USER_NOTIF_NAME },
>         { }
>  };
>
> @@ -1231,3 +1435,161 @@ static int __init seccomp_sysctl_init(void)
>  device_initcall(seccomp_sysctl_init)
>
>  #endif /* CONFIG_SYSCTL */
> +
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> +static int seccomp_notify_release(struct inode *inode, struct file *file)
> +{
> +       struct seccomp_filter *filter = file->private_data;
> +       struct list_head *cur;
> +
> +       mutex_lock(&filter->notify_lock);
> +
> +       /*
> +        * If this file is being closed because e.g. the task who owned it
> +        * died, let's wake everyone up who was waiting on us.
> +        */
> +       list_for_each(cur, &filter->notifications) {
> +               struct seccomp_knotif *knotif;
> +
> +               knotif = list_entry(cur, struct seccomp_knotif, list);
> +
> +               knotif->state = SECCOMP_NOTIFY_WRITE;
> +               knotif->error = -ENOSYS;
> +               knotif->val = 0;
> +               complete(&knotif->ready);
> +       }
> +
> +       filter->has_listener = false;
> +       mutex_unlock(&filter->notify_lock);
> +       __put_seccomp_filter(filter);
> +       return 0;
> +}
> +
> +static ssize_t seccomp_notify_read(struct file *f, char __user *buf,
> +                                  size_t size, loff_t *ppos)
> +{
> +       struct seccomp_filter *filter = f->private_data;
> +       struct seccomp_knotif *knotif = NULL;
> +       struct seccomp_notif unotif;
> +       struct list_head *cur;
> +       ssize_t ret;
> +
> +       /* No offset reads. */
> +       if (*ppos != 0)
> +               return -EINVAL;
> +
> +       ret = down_interruptible(&filter->request);
> +       if (ret < 0)
> +               return ret;
> +
> +       mutex_lock(&filter->notify_lock);
> +       list_for_each(cur, &filter->notifications) {
> +               knotif = list_entry(cur, struct seccomp_knotif, list);
> +               if (knotif->state == SECCOMP_NOTIFY_INIT)
> +                       break;
> +       }
> +
> +       /*
> +        * We didn't find anything which is odd, because at least one
> +        * thing should have been queued.
> +        */
> +       if (knotif->state != SECCOMP_NOTIFY_INIT) {
> +               ret = -ENOENT;
> +               WARN(1, "no seccomp notification found");

I tend to prefer WARN_ONCE, just in case this ever finds itself
exposed to being triggered trivially from userspace.

> +               goto out;
> +       }
> +
> +       unotif.id = knotif->id;
> +       unotif.pid = knotif->pid;
> +       unotif.data = *(knotif->data);
> +
> +       size = min_t(size_t, size, sizeof(struct seccomp_notif));
> +       if (copy_to_user(buf, &unotif, size)) {
> +               ret = -EFAULT;
> +               goto out;
> +       }
> +
> +       ret = sizeof(unotif);
> +       knotif->state = SECCOMP_NOTIFY_READ;
> +
> +out:
> +       mutex_unlock(&filter->notify_lock);
> +       return ret;
> +}
> +
> +static ssize_t seccomp_notify_write(struct file *file, const char __user *buf,
> +                                   size_t size, loff_t *ppos)
> +{
> +       struct seccomp_filter *filter = file->private_data;
> +       struct seccomp_notif_resp resp = {};
> +       struct seccomp_knotif *knotif = NULL;
> +       struct list_head *cur;
> +       ssize_t ret = -EINVAL;
> +
> +       /* No partial writes. */
> +       if (*ppos != 0)
> +               return -EINVAL;
> +
> +       size = min_t(size_t, size, sizeof(resp));

In this case, we can't use min_t, size _must_ be == sizeof(resp),
otherwise we're operating on what's in the stack (which is zeroed, but
still).

> +       if (copy_from_user(&resp, buf, size))
> +               return -EFAULT;
> +
> +       ret = mutex_lock_interruptible(&filter->notify_lock);
> +       if (ret < 0)
> +               return ret;
> +
> +       list_for_each(cur, &filter->notifications) {
> +               knotif = list_entry(cur, struct seccomp_knotif, list);
> +
> +               if (knotif->id == resp.id)
> +                       break;

So we're finding the matching id here. Now, I'm trying to think about
how this will look in real-world use: the pid will be _blocked_ while
this happening. And all the other pids that trip this filter will
_also_ be blocked, since they're all waiting for the reader to read
and respond. The risk is pid death while waiting, and having another
appear with the same pid, trigger the same filter, get blocked, and
then the reader replies for the old pid, and the new pid gets the
results?

Since this notification queue is already linear, can't we use ordering
to enforce this? i.e. only the pid at the head of the filter
notification queue is going to have anything happening to it. Or is
the idea to have multiple readers/writers of the fd?

> +       }
> +
> +       if (!knotif || knotif->id != resp.id) {
> +               ret = -EINVAL;
> +               goto out;
> +       }
> +
> +       ret = size;
> +       knotif->state = SECCOMP_NOTIFY_WRITE;
> +       knotif->error = resp.error;
> +       knotif->val = resp.val;
> +       complete(&knotif->ready);
> +out:
> +       mutex_unlock(&filter->notify_lock);
> +       return ret;
> +}
> +
> +static const struct file_operations seccomp_notify_ops = {
> +       .read = seccomp_notify_read,
> +       .write = seccomp_notify_write,
> +       /* TODO: poll */

What's needed for poll? I think you've got all the pieces you need
already, i.e. wait queue, notifications, etc.

> +       .release = seccomp_notify_release,
> +};
> +
> +static struct file *init_listener(struct seccomp_filter *filter)
> +{
> +       struct file *ret;
> +
> +       mutex_lock(&filter->notify_lock);
> +       if (filter->has_listener) {
> +               mutex_unlock(&filter->notify_lock);
> +               return ERR_PTR(-EBUSY);
> +       }
> +
> +       ret = anon_inode_getfile("seccomp notify", &seccomp_notify_ops,
> +                                filter, O_RDWR);
> +       if (IS_ERR(ret)) {
> +               __put_seccomp_filter(filter);
> +       } else {
> +               /*
> +                * Intentionally don't put_seccomp_filter(). The file
> +                * has a reference to it now.
> +                */
> +               filter->has_listener = true;
> +       }

I spent some time staring at this, and I don't see it: where is the
get_() for this? The caller of init_listener() already does a put() on
the failure path. It seems like there is a get() missing near the
start of init_listener(), or I've entirely missed something.
(Regardless, I think the usage counting need a comment somewhere,
maybe near the top of seccomp.c with the field?)

> +
> +       mutex_unlock(&filter->notify_lock);
> +       return ret;
> +}
> +#endif
> diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
> index 24dbf634e2dd..b43e2a70b08c 100644
> --- a/tools/testing/selftests/seccomp/seccomp_bpf.c
> +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
> @@ -40,6 +40,7 @@
>  #include <sys/fcntl.h>
>  #include <sys/mman.h>
>  #include <sys/times.h>
> +#include <sys/socket.h>
>
>  #define _GNU_SOURCE
>  #include <unistd.h>
> @@ -141,6 +142,24 @@ struct seccomp_data {
>  #define SECCOMP_FILTER_FLAG_LOG 2
>  #endif
>
> +#ifndef SECCOMP_FILTER_FLAG_GET_LISTENER
> +#define SECCOMP_FILTER_FLAG_GET_LISTENER 4
> +
> +#define SECCOMP_RET_USER_NOTIF 0x7fc00000U
> +
> +struct seccomp_notif {
> +       __u32 id;
> +       pid_t pid;
> +       struct seccomp_data data;
> +};
> +
> +struct seccomp_notif_resp {
> +       __u32 id;
> +       int error;
> +       long val;
> +};
> +#endif
> +
>  #ifndef seccomp
>  int seccomp(unsigned int op, unsigned int flags, void *args)
>  {
> @@ -2063,7 +2082,8 @@ TEST(seccomp_syscall_mode_lock)
>  TEST(detect_seccomp_filter_flags)
>  {
>         unsigned int flags[] = { SECCOMP_FILTER_FLAG_TSYNC,
> -                                SECCOMP_FILTER_FLAG_LOG };
> +                                SECCOMP_FILTER_FLAG_LOG,
> +                                SECCOMP_FILTER_FLAG_GET_LISTENER };
>         unsigned int flag, all_flags;
>         int i;
>         long ret;
> @@ -2845,6 +2865,98 @@ TEST(get_action_avail)
>         EXPECT_EQ(errno, EOPNOTSUPP);
>  }
>
> +static int user_trap_syscall(int nr, unsigned int flags)
> +{
> +       struct sock_filter filter[] = {
> +               BPF_STMT(BPF_LD+BPF_W+BPF_ABS,
> +                       offsetof(struct seccomp_data, nr)),
> +               BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, nr, 0, 1),
> +               BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_USER_NOTIF),
> +               BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
> +       };
> +
> +       struct sock_fprog prog = {
> +               .len = (unsigned short)ARRAY_SIZE(filter),
> +               .filter = filter,
> +       };
> +
> +       return seccomp(SECCOMP_SET_MODE_FILTER, flags, &prog);
> +}
> +
> +#define USER_NOTIF_MAGIC 116983961184613L

Is this just you mashing the numpad? :)

> +TEST(get_user_notification_syscall)
> +{
> +       pid_t pid;
> +       long ret;
> +       int status, listener;
> +       struct seccomp_notif req;
> +       struct seccomp_notif_resp resp;
> +
> +       pid = fork();
> +       ASSERT_GE(pid, 0);
> +
> +       /* Check that we get -ENOSYS with no listener attached */
> +       if (pid == 0) {
> +               ASSERT_EQ(user_trap_syscall(__NR_getpid, 0), 0);
> +               ret = syscall(__NR_getpid);
> +               exit(ret >= 0 || errno != ENOSYS);
> +       }
> +
> +       ASSERT_EQ(waitpid(pid, &status, 0), pid);
> +       ASSERT_EQ(true, WIFEXITED(status));
> +       ASSERT_EQ(0, WEXITSTATUS(status));
> +
> +       /* Check that the basic notification machinery works */
> +       listener = user_trap_syscall(__NR_getpid,
> +                                    SECCOMP_FILTER_FLAG_GET_LISTENER);
> +       ASSERT_GE(listener, 0);
> +
> +       pid = fork();
> +       ASSERT_GE(pid, 0);
> +
> +       if (pid == 0) {
> +               ret = syscall(__NR_getpid);
> +               exit(ret != USER_NOTIF_MAGIC);
> +       }
> +
> +       ASSERT_EQ(read(listener, &req, sizeof(req)), sizeof(req));
> +
> +       resp.id = req.id;
> +       resp.error = 0;
> +       resp.val = USER_NOTIF_MAGIC;
> +
> +       ASSERT_EQ(write(listener, &resp, sizeof(resp)), sizeof(resp));
> +
> +       ASSERT_EQ(waitpid(pid, &status, 0), pid);
> +       ASSERT_EQ(true, WIFEXITED(status));
> +       ASSERT_EQ(0, WEXITSTATUS(status));
> +
> +       /*
> +        * Check that nothing bad happens when we kill the task in the middle
> +        * of a syscall.
> +        */
> +       pid = fork();
> +       ASSERT_GE(pid, 0);
> +
> +       if (pid == 0) {
> +               ret = syscall(__NR_getpid);
> +               exit(ret != USER_NOTIF_MAGIC);
> +       }
> +
> +       ret = read(listener, &req, sizeof(req));
> +       ASSERT_EQ(ret, sizeof(req));
> +
> +       ASSERT_EQ(kill(pid, SIGKILL), 0);
> +       ASSERT_EQ(waitpid(pid, NULL, 0), pid);
> +
> +       resp.id = req.id;
> +       ret = write(listener, &resp, sizeof(resp));
> +       EXPECT_EQ(ret, -1);
> +       EXPECT_EQ(errno, EINVAL);
> +
> +       close(listener);
> +}

Yay selftests! :)

-Kees

> +
>  /*
>   * TODO:
>   * - add microbenchmarks
> --
> 2.14.1
>



-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC 1/3] seccomp: add a return code to trap to userspace
  2018-02-04 10:49 ` [RFC 1/3] seccomp: add a return code to " Tycho Andersen
@ 2018-02-13 21:09   ` Kees Cook
       [not found]     ` <CAGXu5jLAAKY19a9iC1PmXRyuwdn1Zxr2Cb318zdzkqgYt8vtdg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
       [not found]   ` <20180204104946.25559-2-tycho-E0fblnxP3wo@public.gmane.org>
  1 sibling, 1 reply; 59+ messages in thread
From: Kees Cook @ 2018-02-13 21:09 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: LKML, Linux Containers, Andy Lutomirski, Oleg Nesterov,
	Eric W . Biederman, Serge E . Hallyn, Christian Brauner,
	Tyler Hicks, Akihiro Suda, Tom Hromatka, Sargun Dhillon,
	Paul Moore

On Sun, Feb 4, 2018 at 2:49 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> This patch introduces a means for syscalls matched in seccomp to notify
> some other task that a particular filter has been triggered.
>
> The motivation for this is primarily for use with containers. For example,
> if a container does an init_module(), we obviously don't want to load this
> untrusted code, which may be compiled for the wrong version of the kernel
> anyway. Instead, we could parse the module image, figure out which module
> the container is trying to load and load it on the host.
>
> As another example, containers cannot mknod(), since this checks
> capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
> /dev/zero should be ok for containers to mknod, but we'd like to avoid hard
> coding some whitelist in the kernel. Another example is mount(), which has
> many security restrictions for good reason, but configuration or runtime
> knowledge could potentially be used to relax these restrictions.

Related to the eBPF seccomp thread, can the logic for these things be
handled entirely by eBPF? My assumption is that you still need to stop
the process to do something (i.e. do a mknod, or a mount) before
letting it continue. Is there some "wait for notification" system in
eBPF?

> This patch adds functionality that is already possible via at least two
> other means that I know about, both of which involve ptrace(): first, one
> could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL.
> Unfortunately this is slow, so a faster version would be to install a
> filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP.
> Since ptrace allows only one tracer, if the container runtime is that
> tracer, users inside the container (or outside) trying to debug it will not
> be able to use ptrace, which is annoying. It also means that older
> distributions based on Upstart cannot boot inside containers using ptrace,
> since upstart itself uses ptrace to start services.

Agreed: notification is extremely painful right now. The container
case is compelling, since it will always want a way to trick out these
kinds of filesystem calls.

> The actual implementation of this is fairly small, although getting the
> synchronization right was/is slightly complex. Also worth noting that there
> is one race still present:
>
>   1. a task does a SECCOMP_RET_USER_NOTIF
>   2. the userspace handler reads this notification
>   3. the task dies
>   4. a new task with the same pid starts
>   5. this new task does a SECCOMP_RET_USER_NOTIF, gets the same cookie id
>      that the previous one did
>   6. the userspace handler writes a response
>
> There's no way to distinguish this case right now. Maybe we care, maybe we
> don't, but it's worth noting.

So, I'd like to avoid the cookie if possible (surprise). Why isn't it
possible to close the kernel-side of the fd to indicate that it lost
the pid it was attached to? Is this just that the reader has no idea
who is sending messages? So the risk is a fork/die loop within the
same process tree (i.e. attached to the same filter)? Hrmpf. I can't
think of a better way to handle the
one(fd)-to-many(task-with-that-filter-attached) situation...

> Right now the interface is a simple structure copy across a file
> descriptor. We could potentially invent something fancier.

I wonder if this communication should be netlink, which gives a more
well-structured way to describe what's on the wire? The reason I ask
is because if we ever change the seccomp_data structure, we'll now
have two places where we need to deal with it (the first being within
the BPF itself). My initial idea was to prefix the communication with
a size field, then send the structure, and then I had nightmares, and
realized this was basically netlink reinvented.

> Finally, it's worth noting that the classic seccomp TOCTOU of reading
> memory data from the task still applies here, but can be avoided with
> careful design of the userspace handler: if the userspace handler reads all
> of the task memory that is necessary before applying its security policy,
> the tracee's subsequent memory edits will not be read by the tracer.

Is this really true? Couldn't a multi-threaded process muck with
memory out from under both the manager and the stopped process?

> Signed-off-by: Tycho Andersen <tycho@tycho.ws>
> CC: Kees Cook <keescook@chromium.org>
> CC: Andy Lutomirski <luto@amacapital.net>
> CC: Oleg Nesterov <oleg@redhat.com>
> CC: Eric W. Biederman <ebiederm@xmission.com>
> CC: "Serge E. Hallyn" <serge@hallyn.com>
> CC: Christian Brauner <christian.brauner@ubuntu.com>
> CC: Tyler Hicks <tyhicks@canonical.com>
> CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
> ---
>  arch/Kconfig                                  |   7 +
>  include/linux/seccomp.h                       |   3 +-
>  include/uapi/linux/seccomp.h                  |  18 +-
>  kernel/seccomp.c                              | 366 +++++++++++++++++++++++++-
>  tools/testing/selftests/seccomp/seccomp_bpf.c | 114 +++++++-
>  5 files changed, 502 insertions(+), 6 deletions(-)
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 400b9e1b2f27..2946cb6fd704 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -387,6 +387,13 @@ config SECCOMP_FILTER
>
>           See Documentation/prctl/seccomp_filter.txt for details.
>
> +config SECCOMP_USER_NOTIFICATION
> +       bool "Enable the SECCOMP_RET_USER_NOTIF seccomp action"
> +       depends on SECCOMP_FILTER
> +       help
> +         Enable SECCOMP_RET_USER_NOTIF, a return code which can be used by seccomp
> +         programs to notify a userspace listener that a particular event happened.
> +
>  config HAVE_GCC_PLUGINS
>         bool
>         help
> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> index 10f25f7e4304..ce07da2ffd53 100644
> --- a/include/linux/seccomp.h
> +++ b/include/linux/seccomp.h
> @@ -5,7 +5,8 @@
>  #include <uapi/linux/seccomp.h>
>
>  #define SECCOMP_FILTER_FLAG_MASK       (SECCOMP_FILTER_FLAG_TSYNC | \
> -                                        SECCOMP_FILTER_FLAG_LOG)
> +                                        SECCOMP_FILTER_FLAG_LOG | \
> +                                        SECCOMP_FILTER_FLAG_GET_LISTENER)
>
>  #ifdef CONFIG_SECCOMP
>
> diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
> index 2a0bd9dd104d..4a342aa2e524 100644
> --- a/include/uapi/linux/seccomp.h
> +++ b/include/uapi/linux/seccomp.h
> @@ -17,8 +17,9 @@
>  #define SECCOMP_GET_ACTION_AVAIL       2
>
>  /* Valid flags for SECCOMP_SET_MODE_FILTER */
> -#define SECCOMP_FILTER_FLAG_TSYNC      1
> -#define SECCOMP_FILTER_FLAG_LOG                2
> +#define SECCOMP_FILTER_FLAG_TSYNC              1
> +#define SECCOMP_FILTER_FLAG_LOG                        2
> +#define SECCOMP_FILTER_FLAG_GET_LISTENER       4
>
>  /*
>   * All BPF programs must return a 32-bit value.
> @@ -34,6 +35,7 @@
>  #define SECCOMP_RET_KILL        SECCOMP_RET_KILL_THREAD
>  #define SECCOMP_RET_TRAP        0x00030000U /* disallow and force a SIGSYS */
>  #define SECCOMP_RET_ERRNO       0x00050000U /* returns an errno */
> +#define SECCOMP_RET_USER_NOTIF   0x7fc00000U /* notifies userspace */
>  #define SECCOMP_RET_TRACE       0x7ff00000U /* pass to a tracer or disallow */

/me tries to come up with an ordering rationale here and fails.

An ERRNO filter would block a USER_NOTIF because it's unconditional.
TRACE could be either, USER_NOTIF could be either.

This means TRACE rules would be bumped by a USER_NOTIF... hmm.

>  #define SECCOMP_RET_LOG                 0x7ffc0000U /* allow after logging */
>  #define SECCOMP_RET_ALLOW       0x7fff0000U /* allow */
> @@ -59,4 +61,16 @@ struct seccomp_data {
>         __u64 args[6];
>  };
>
> +struct seccomp_notif {
> +       __u32 id;
> +       pid_t pid;
> +       struct seccomp_data data;
> +};
> +
> +struct seccomp_notif_resp {
> +       __u32 id;
> +       int error;
> +       long val;
> +};
> +
>  #endif /* _UAPI_LINUX_SECCOMP_H */
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 5f0dfb2abb8d..9541eb379e74 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -38,6 +38,52 @@
>  #include <linux/tracehook.h>
>  #include <linux/uaccess.h>
>
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION

I wonder if it's time to split up seccomp.c ... probably not, but I've
always been unhappy with the #ifdefs even for just regular _FILTER. ;)

> +#include <linux/file.h>
> +#include <linux/anon_inodes.h>
> +
> +enum notify_state {
> +       SECCOMP_NOTIFY_INIT,
> +       SECCOMP_NOTIFY_READ,
> +       SECCOMP_NOTIFY_WRITE,
> +};
> +
> +struct seccomp_knotif {
> +       /* The pid whose filter triggered the notification */
> +       pid_t pid;
> +
> +       /*
> +        * The "cookie" for this request; this is unique for this filter.
> +        */
> +       u32 id;
> +
> +       /*
> +        * The seccomp data. This pointer is valid the entire time this
> +        * notification is active, since it comes from __seccomp_filter which
> +        * eclipses the entire lifecycle here.
> +        */
> +       const struct seccomp_data *data;
> +
> +       /*
> +        * SECCOMP_NOTIFY_INIT: someone has made this request, but it has not
> +        *      yet been sent to userspace
> +        * SECCOMP_NOTIFY_READ: sent to userspace but no response yet
> +        * SECCOMP_NOTIFY_WRITE: we have a response from userspace, but it has
> +        *      not yet been written back to the application
> +        */
> +       enum notify_state state;
> +
> +       /* The return values, only valid when in SECCOMP_NOTIFY_WRITE */
> +       int error;
> +       long val;
> +
> +       /* Signals when this has entered SECCOMP_NOTIFY_WRITE */
> +       struct completion ready;
> +
> +       struct list_head list;
> +};
> +#endif
> +
>  /**
>   * struct seccomp_filter - container for seccomp BPF programs
>   *
> @@ -64,6 +110,30 @@ struct seccomp_filter {
>         bool log;
>         struct seccomp_filter *prev;
>         struct bpf_prog *prog;
> +
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> +       /*
> +        * A semaphore that users of this notification can wait on for
> +        * changes. Actual reads and writes are still controlled with
> +        * filter->notify_lock.
> +        */
> +       struct semaphore request;
> +
> +       /*
> +        * A lock for all notification-related accesses.
> +        */
> +       struct mutex notify_lock;
> +
> +       /*
> +        * Is there currently an attached listener?
> +        */
> +       bool has_listener;
> +
> +       /*
> +        * A list of struct seccomp_knotif elements.
> +        */

Nit: these 3 above can be one-line comments.

> +       struct list_head notifications;
> +#endif
>  };
>
>  /* Limit any path through the tree to 256KB worth of instructions. */
> @@ -383,6 +453,12 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
>         if (!sfilter)
>                 return ERR_PTR(-ENOMEM);
>
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> +       mutex_init(&sfilter->notify_lock);
> +       sema_init(&sfilter->request, 0);
> +       INIT_LIST_HEAD(&sfilter->notifications);
> +#endif
> +
>         ret = bpf_prog_create_from_user(&sfilter->prog, fprog,
>                                         seccomp_check_filter, save_orig);
>         if (ret < 0) {
> @@ -547,13 +623,15 @@ static void seccomp_send_sigsys(int syscall, int reason)
>  #define SECCOMP_LOG_TRACE              (1 << 4)
>  #define SECCOMP_LOG_LOG                        (1 << 5)
>  #define SECCOMP_LOG_ALLOW              (1 << 6)
> +#define SECCOMP_LOG_USER_NOTIF         (1 << 7)
>
>  static u32 seccomp_actions_logged = SECCOMP_LOG_KILL_PROCESS |
>                                     SECCOMP_LOG_KILL_THREAD  |
>                                     SECCOMP_LOG_TRAP  |
>                                     SECCOMP_LOG_ERRNO |
>                                     SECCOMP_LOG_TRACE |
> -                                   SECCOMP_LOG_LOG;
> +                                   SECCOMP_LOG_LOG |
> +                                   SECCOMP_LOG_USER_NOTIF;
>
>  static inline void seccomp_log(unsigned long syscall, long signr, u32 action,
>                                bool requested)
> @@ -572,6 +650,9 @@ static inline void seccomp_log(unsigned long syscall, long signr, u32 action,
>         case SECCOMP_RET_TRACE:
>                 log = requested && seccomp_actions_logged & SECCOMP_LOG_TRACE;
>                 break;
> +       case SECCOMP_RET_USER_NOTIF:
> +               log = requested && seccomp_actions_logged & SECCOMP_LOG_USER_NOTIF;
> +               break;
>         case SECCOMP_RET_LOG:
>                 log = seccomp_actions_logged & SECCOMP_LOG_LOG;
>                 break;
> @@ -645,6 +726,89 @@ void secure_computing_strict(int this_syscall)
>  }
>  #else
>
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> +/*
> + * Finds the next unique notification id.
> + */
> +static u32 seccomp_next_notify_id(struct list_head *list)
> +{
> +       struct seccomp_knotif *knotif = NULL;
> +       struct list_head *cur;
> +       u32 id = get_random_u32();
> +
> +again:
> +       list_for_each(cur, list) {
> +               knotif = list_entry(cur, struct seccomp_knotif, list);
> +
> +               if (knotif->id == id) {
> +                       id = get_random_u32();
> +                       goto again;
> +               }
> +       }
> +
> +       return id;
> +}
> +
> +static void seccomp_do_user_notification(int this_syscall,
> +                                        struct seccomp_filter *match,
> +                                        const struct seccomp_data *sd)
> +{
> +       int err;
> +       long ret = 0;
> +       struct seccomp_knotif n = {};
> +
> +       mutex_lock(&match->notify_lock);
> +       if (!match->has_listener) {
> +               err = -ENOSYS;
> +               goto out;
> +       }
> +
> +       n.pid = current->pid;
> +       n.state = SECCOMP_NOTIFY_INIT;
> +       n.data = sd;
> +       n.id = seccomp_next_notify_id(&match->notifications);
> +       init_completion(&n.ready);
> +
> +       list_add(&n.list, &match->notifications);
> +
> +       mutex_unlock(&match->notify_lock);
> +       up(&match->request);
> +
> +       err = wait_for_completion_interruptible(&n.ready);
> +       /*
> +        * This syscall is getting interrupted. We no longer need to
> +        * tell userspace about it, and any userspace responses should
> +        * be ignored.
> +        */
> +       mutex_lock(&match->notify_lock);
> +       if (err < 0)
> +               goto remove_list;
> +
> +       ret = n.val;
> +       err = n.error;
> +
> +       WARN(n.state != SECCOMP_NOTIFY_WRITE,
> +            "notified about write complete when state is not write");
> +
> +remove_list:
> +       list_del(&n.list);
> +out:
> +       mutex_unlock(&match->notify_lock);
> +       syscall_set_return_value(current, task_pt_regs(current),
> +                                err, ret);
> +}
> +#else
> +static void seccomp_do_user_notification(int this_syscall,
> +                                        u32 action,
> +                                        struct seccomp_filter *match,
> +                                        const struct seccomp_data *sd)
> +{
> +       WARN(1, "user notification received, but disabled");
> +       seccomp_log(this_syscall, SIGSYS, action, true);
> +       do_exit(SIGSYS);
> +}
> +#endif
> +
>  #ifdef CONFIG_SECCOMP_FILTER
>  static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
>                             const bool recheck_after_trace)
> @@ -722,6 +886,9 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
>
>                 return 0;
>
> +       case SECCOMP_RET_USER_NOTIF:
> +               seccomp_do_user_notification(this_syscall, match, sd);
> +               goto skip;
>         case SECCOMP_RET_LOG:
>                 seccomp_log(this_syscall, 0, action, true);
>                 return 0;
> @@ -828,6 +995,10 @@ static long seccomp_set_mode_strict(void)
>  }
>
>  #ifdef CONFIG_SECCOMP_FILTER
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> +static struct file *init_listener(struct seccomp_filter *filter);
> +#endif
> +
>  /**
>   * seccomp_set_mode_filter: internal function for setting seccomp filter
>   * @flags:  flags to change filter behavior
> @@ -847,6 +1018,8 @@ static long seccomp_set_mode_filter(unsigned int flags,
>         const unsigned long seccomp_mode = SECCOMP_MODE_FILTER;
>         struct seccomp_filter *prepared = NULL;
>         long ret = -EINVAL;
> +       int listener = 0;
> +       struct file *listener_f = NULL;
>
>         /* Validate flags. */
>         if (flags & ~SECCOMP_FILTER_FLAG_MASK)
> @@ -857,13 +1030,28 @@ static long seccomp_set_mode_filter(unsigned int flags,
>         if (IS_ERR(prepared))
>                 return PTR_ERR(prepared);
>
> +       if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) {
> +               listener = get_unused_fd_flags(O_RDWR);
> +               if (listener < 0) {
> +                       ret = listener;
> +                       goto out_free;
> +               }
> +
> +               listener_f = init_listener(prepared);
> +               if (IS_ERR(listener_f)) {
> +                       put_unused_fd(listener);
> +                       ret = PTR_ERR(listener_f);
> +                       goto out_free;
> +               }
> +       }
> +
>         /*
>          * Make sure we cannot change seccomp or nnp state via TSYNC
>          * while another thread is in the middle of calling exec.
>          */
>         if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
>             mutex_lock_killable(&current->signal->cred_guard_mutex))
> -               goto out_free;
> +               goto out_put_fd;
>
>         spin_lock_irq(&current->sighand->siglock);
>
> @@ -881,6 +1069,16 @@ static long seccomp_set_mode_filter(unsigned int flags,
>         spin_unlock_irq(&current->sighand->siglock);
>         if (flags & SECCOMP_FILTER_FLAG_TSYNC)
>                 mutex_unlock(&current->signal->cred_guard_mutex);
> +out_put_fd:
> +       if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) {
> +               if (ret < 0) {
> +                       fput(listener_f);
> +                       put_unused_fd(listener);
> +               } else {
> +                       fd_install(listener, listener_f);
> +                       ret = listener;
> +               }
> +       }
>  out_free:
>         seccomp_filter_free(prepared);
>         return ret;
> @@ -909,6 +1107,9 @@ static long seccomp_get_action_avail(const char __user *uaction)
>         case SECCOMP_RET_LOG:
>         case SECCOMP_RET_ALLOW:
>                 break;
> +       case SECCOMP_RET_USER_NOTIF:
> +               if (IS_ENABLED(CONFIG_SECCOMP_USER_NOTIFICATION))
> +                       break;
>         default:
>                 return -EOPNOTSUPP;
>         }
> @@ -1057,6 +1258,7 @@ long seccomp_get_filter(struct task_struct *task, unsigned long filter_off,
>  #define SECCOMP_RET_KILL_THREAD_NAME   "kill_thread"
>  #define SECCOMP_RET_TRAP_NAME          "trap"
>  #define SECCOMP_RET_ERRNO_NAME         "errno"
> +#define SECCOMP_RET_USER_NOTIF_NAME    "user_notif"
>  #define SECCOMP_RET_TRACE_NAME         "trace"
>  #define SECCOMP_RET_LOG_NAME           "log"
>  #define SECCOMP_RET_ALLOW_NAME         "allow"
> @@ -1066,6 +1268,7 @@ static const char seccomp_actions_avail[] =
>                                 SECCOMP_RET_KILL_THREAD_NAME    " "
>                                 SECCOMP_RET_TRAP_NAME           " "
>                                 SECCOMP_RET_ERRNO_NAME          " "
> +                               SECCOMP_RET_USER_NOTIF_NAME     " "
>                                 SECCOMP_RET_TRACE_NAME          " "
>                                 SECCOMP_RET_LOG_NAME            " "
>                                 SECCOMP_RET_ALLOW_NAME;
> @@ -1083,6 +1286,7 @@ static const struct seccomp_log_name seccomp_log_names[] = {
>         { SECCOMP_LOG_TRACE, SECCOMP_RET_TRACE_NAME },
>         { SECCOMP_LOG_LOG, SECCOMP_RET_LOG_NAME },
>         { SECCOMP_LOG_ALLOW, SECCOMP_RET_ALLOW_NAME },
> +       { SECCOMP_LOG_USER_NOTIF, SECCOMP_RET_USER_NOTIF_NAME },
>         { }
>  };
>
> @@ -1231,3 +1435,161 @@ static int __init seccomp_sysctl_init(void)
>  device_initcall(seccomp_sysctl_init)
>
>  #endif /* CONFIG_SYSCTL */
> +
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> +static int seccomp_notify_release(struct inode *inode, struct file *file)
> +{
> +       struct seccomp_filter *filter = file->private_data;
> +       struct list_head *cur;
> +
> +       mutex_lock(&filter->notify_lock);
> +
> +       /*
> +        * If this file is being closed because e.g. the task who owned it
> +        * died, let's wake everyone up who was waiting on us.
> +        */
> +       list_for_each(cur, &filter->notifications) {
> +               struct seccomp_knotif *knotif;
> +
> +               knotif = list_entry(cur, struct seccomp_knotif, list);
> +
> +               knotif->state = SECCOMP_NOTIFY_WRITE;
> +               knotif->error = -ENOSYS;
> +               knotif->val = 0;
> +               complete(&knotif->ready);
> +       }
> +
> +       filter->has_listener = false;
> +       mutex_unlock(&filter->notify_lock);
> +       __put_seccomp_filter(filter);
> +       return 0;
> +}
> +
> +static ssize_t seccomp_notify_read(struct file *f, char __user *buf,
> +                                  size_t size, loff_t *ppos)
> +{
> +       struct seccomp_filter *filter = f->private_data;
> +       struct seccomp_knotif *knotif = NULL;
> +       struct seccomp_notif unotif;
> +       struct list_head *cur;
> +       ssize_t ret;
> +
> +       /* No offset reads. */
> +       if (*ppos != 0)
> +               return -EINVAL;
> +
> +       ret = down_interruptible(&filter->request);
> +       if (ret < 0)
> +               return ret;
> +
> +       mutex_lock(&filter->notify_lock);
> +       list_for_each(cur, &filter->notifications) {
> +               knotif = list_entry(cur, struct seccomp_knotif, list);
> +               if (knotif->state == SECCOMP_NOTIFY_INIT)
> +                       break;
> +       }
> +
> +       /*
> +        * We didn't find anything which is odd, because at least one
> +        * thing should have been queued.
> +        */
> +       if (knotif->state != SECCOMP_NOTIFY_INIT) {
> +               ret = -ENOENT;
> +               WARN(1, "no seccomp notification found");

I tend to prefer WARN_ONCE, just in case this ever finds itself
exposed to being triggered trivially from userspace.

> +               goto out;
> +       }
> +
> +       unotif.id = knotif->id;
> +       unotif.pid = knotif->pid;
> +       unotif.data = *(knotif->data);
> +
> +       size = min_t(size_t, size, sizeof(struct seccomp_notif));
> +       if (copy_to_user(buf, &unotif, size)) {
> +               ret = -EFAULT;
> +               goto out;
> +       }
> +
> +       ret = sizeof(unotif);
> +       knotif->state = SECCOMP_NOTIFY_READ;
> +
> +out:
> +       mutex_unlock(&filter->notify_lock);
> +       return ret;
> +}
> +
> +static ssize_t seccomp_notify_write(struct file *file, const char __user *buf,
> +                                   size_t size, loff_t *ppos)
> +{
> +       struct seccomp_filter *filter = file->private_data;
> +       struct seccomp_notif_resp resp = {};
> +       struct seccomp_knotif *knotif = NULL;
> +       struct list_head *cur;
> +       ssize_t ret = -EINVAL;
> +
> +       /* No partial writes. */
> +       if (*ppos != 0)
> +               return -EINVAL;
> +
> +       size = min_t(size_t, size, sizeof(resp));

In this case, we can't use min_t, size _must_ be == sizeof(resp),
otherwise we're operating on what's in the stack (which is zeroed, but
still).

> +       if (copy_from_user(&resp, buf, size))
> +               return -EFAULT;
> +
> +       ret = mutex_lock_interruptible(&filter->notify_lock);
> +       if (ret < 0)
> +               return ret;
> +
> +       list_for_each(cur, &filter->notifications) {
> +               knotif = list_entry(cur, struct seccomp_knotif, list);
> +
> +               if (knotif->id == resp.id)
> +                       break;

So we're finding the matching id here. Now, I'm trying to think about
how this will look in real-world use: the pid will be _blocked_ while
this happening. And all the other pids that trip this filter will
_also_ be blocked, since they're all waiting for the reader to read
and respond. The risk is pid death while waiting, and having another
appear with the same pid, trigger the same filter, get blocked, and
then the reader replies for the old pid, and the new pid gets the
results?

Since this notification queue is already linear, can't we use ordering
to enforce this? i.e. only the pid at the head of the filter
notification queue is going to have anything happening to it. Or is
the idea to have multiple readers/writers of the fd?

> +       }
> +
> +       if (!knotif || knotif->id != resp.id) {
> +               ret = -EINVAL;
> +               goto out;
> +       }
> +
> +       ret = size;
> +       knotif->state = SECCOMP_NOTIFY_WRITE;
> +       knotif->error = resp.error;
> +       knotif->val = resp.val;
> +       complete(&knotif->ready);
> +out:
> +       mutex_unlock(&filter->notify_lock);
> +       return ret;
> +}
> +
> +static const struct file_operations seccomp_notify_ops = {
> +       .read = seccomp_notify_read,
> +       .write = seccomp_notify_write,
> +       /* TODO: poll */

What's needed for poll? I think you've got all the pieces you need
already, i.e. wait queue, notifications, etc.

> +       .release = seccomp_notify_release,
> +};
> +
> +static struct file *init_listener(struct seccomp_filter *filter)
> +{
> +       struct file *ret;
> +
> +       mutex_lock(&filter->notify_lock);
> +       if (filter->has_listener) {
> +               mutex_unlock(&filter->notify_lock);
> +               return ERR_PTR(-EBUSY);
> +       }
> +
> +       ret = anon_inode_getfile("seccomp notify", &seccomp_notify_ops,
> +                                filter, O_RDWR);
> +       if (IS_ERR(ret)) {
> +               __put_seccomp_filter(filter);
> +       } else {
> +               /*
> +                * Intentionally don't put_seccomp_filter(). The file
> +                * has a reference to it now.
> +                */
> +               filter->has_listener = true;
> +       }

I spent some time staring at this, and I don't see it: where is the
get_() for this? The caller of init_listener() already does a put() on
the failure path. It seems like there is a get() missing near the
start of init_listener(), or I've entirely missed something.
(Regardless, I think the usage counting need a comment somewhere,
maybe near the top of seccomp.c with the field?)

> +
> +       mutex_unlock(&filter->notify_lock);
> +       return ret;
> +}
> +#endif
> diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
> index 24dbf634e2dd..b43e2a70b08c 100644
> --- a/tools/testing/selftests/seccomp/seccomp_bpf.c
> +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
> @@ -40,6 +40,7 @@
>  #include <sys/fcntl.h>
>  #include <sys/mman.h>
>  #include <sys/times.h>
> +#include <sys/socket.h>
>
>  #define _GNU_SOURCE
>  #include <unistd.h>
> @@ -141,6 +142,24 @@ struct seccomp_data {
>  #define SECCOMP_FILTER_FLAG_LOG 2
>  #endif
>
> +#ifndef SECCOMP_FILTER_FLAG_GET_LISTENER
> +#define SECCOMP_FILTER_FLAG_GET_LISTENER 4
> +
> +#define SECCOMP_RET_USER_NOTIF 0x7fc00000U
> +
> +struct seccomp_notif {
> +       __u32 id;
> +       pid_t pid;
> +       struct seccomp_data data;
> +};
> +
> +struct seccomp_notif_resp {
> +       __u32 id;
> +       int error;
> +       long val;
> +};
> +#endif
> +
>  #ifndef seccomp
>  int seccomp(unsigned int op, unsigned int flags, void *args)
>  {
> @@ -2063,7 +2082,8 @@ TEST(seccomp_syscall_mode_lock)
>  TEST(detect_seccomp_filter_flags)
>  {
>         unsigned int flags[] = { SECCOMP_FILTER_FLAG_TSYNC,
> -                                SECCOMP_FILTER_FLAG_LOG };
> +                                SECCOMP_FILTER_FLAG_LOG,
> +                                SECCOMP_FILTER_FLAG_GET_LISTENER };
>         unsigned int flag, all_flags;
>         int i;
>         long ret;
> @@ -2845,6 +2865,98 @@ TEST(get_action_avail)
>         EXPECT_EQ(errno, EOPNOTSUPP);
>  }
>
> +static int user_trap_syscall(int nr, unsigned int flags)
> +{
> +       struct sock_filter filter[] = {
> +               BPF_STMT(BPF_LD+BPF_W+BPF_ABS,
> +                       offsetof(struct seccomp_data, nr)),
> +               BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, nr, 0, 1),
> +               BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_USER_NOTIF),
> +               BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
> +       };
> +
> +       struct sock_fprog prog = {
> +               .len = (unsigned short)ARRAY_SIZE(filter),
> +               .filter = filter,
> +       };
> +
> +       return seccomp(SECCOMP_SET_MODE_FILTER, flags, &prog);
> +}
> +
> +#define USER_NOTIF_MAGIC 116983961184613L

Is this just you mashing the numpad? :)

> +TEST(get_user_notification_syscall)
> +{
> +       pid_t pid;
> +       long ret;
> +       int status, listener;
> +       struct seccomp_notif req;
> +       struct seccomp_notif_resp resp;
> +
> +       pid = fork();
> +       ASSERT_GE(pid, 0);
> +
> +       /* Check that we get -ENOSYS with no listener attached */
> +       if (pid == 0) {
> +               ASSERT_EQ(user_trap_syscall(__NR_getpid, 0), 0);
> +               ret = syscall(__NR_getpid);
> +               exit(ret >= 0 || errno != ENOSYS);
> +       }
> +
> +       ASSERT_EQ(waitpid(pid, &status, 0), pid);
> +       ASSERT_EQ(true, WIFEXITED(status));
> +       ASSERT_EQ(0, WEXITSTATUS(status));
> +
> +       /* Check that the basic notification machinery works */
> +       listener = user_trap_syscall(__NR_getpid,
> +                                    SECCOMP_FILTER_FLAG_GET_LISTENER);
> +       ASSERT_GE(listener, 0);
> +
> +       pid = fork();
> +       ASSERT_GE(pid, 0);
> +
> +       if (pid == 0) {
> +               ret = syscall(__NR_getpid);
> +               exit(ret != USER_NOTIF_MAGIC);
> +       }
> +
> +       ASSERT_EQ(read(listener, &req, sizeof(req)), sizeof(req));
> +
> +       resp.id = req.id;
> +       resp.error = 0;
> +       resp.val = USER_NOTIF_MAGIC;
> +
> +       ASSERT_EQ(write(listener, &resp, sizeof(resp)), sizeof(resp));
> +
> +       ASSERT_EQ(waitpid(pid, &status, 0), pid);
> +       ASSERT_EQ(true, WIFEXITED(status));
> +       ASSERT_EQ(0, WEXITSTATUS(status));
> +
> +       /*
> +        * Check that nothing bad happens when we kill the task in the middle
> +        * of a syscall.
> +        */
> +       pid = fork();
> +       ASSERT_GE(pid, 0);
> +
> +       if (pid == 0) {
> +               ret = syscall(__NR_getpid);
> +               exit(ret != USER_NOTIF_MAGIC);
> +       }
> +
> +       ret = read(listener, &req, sizeof(req));
> +       ASSERT_EQ(ret, sizeof(req));
> +
> +       ASSERT_EQ(kill(pid, SIGKILL), 0);
> +       ASSERT_EQ(waitpid(pid, NULL, 0), pid);
> +
> +       resp.id = req.id;
> +       ret = write(listener, &resp, sizeof(resp));
> +       EXPECT_EQ(ret, -1);
> +       EXPECT_EQ(errno, EINVAL);
> +
> +       close(listener);
> +}

Yay selftests! :)

-Kees

> +
>  /*
>   * TODO:
>   * - add microbenchmarks
> --
> 2.14.1
>



-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC 2/3] seccomp: hoist out filter resolving logic
       [not found]   ` <20180204104946.25559-3-tycho-E0fblnxP3wo@public.gmane.org>
@ 2018-02-13 21:29     ` Kees Cook
  0 siblings, 0 replies; 59+ messages in thread
From: Kees Cook @ 2018-02-13 21:29 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Linux Containers, Akihiro Suda, Oleg Nesterov, LKML,
	Eric W . Biederman, Tyler Hicks, Christian Brauner,
	Andy Lutomirski

On Sun, Feb 4, 2018 at 2:49 AM, Tycho Andersen <tycho-E0fblnxP3wo@public.gmane.org> wrote:
> Hoist out the nth filter resolving logic that ptrace uses into a new
> function. We'll use this in the next patch to implement the new
> PTRACE_SECCOMP_GET_FILTER_FLAGS command. This is based on an older patch
> that I had sent a while ago; it significantly revamps the get_nth_filter
> logic based on previous suggestions from Oleg.

Is this the same as f06eae831f0c1fc5b982ea200daf552810e1dd55 ? Quick
compare says yes? Either way, please rebase to v4.16-rc1 (or -rc2 in
the future). :)

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC 2/3] seccomp: hoist out filter resolving logic
  2018-02-04 10:49 ` [RFC 2/3] seccomp: hoist out filter resolving logic Tycho Andersen
       [not found]   ` <20180204104946.25559-3-tycho-E0fblnxP3wo@public.gmane.org>
@ 2018-02-13 21:29   ` Kees Cook
  2018-02-14 15:33     ` Tycho Andersen
  2018-02-14 15:33     ` Tycho Andersen
  1 sibling, 2 replies; 59+ messages in thread
From: Kees Cook @ 2018-02-13 21:29 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: LKML, Linux Containers, Andy Lutomirski, Oleg Nesterov,
	Eric W . Biederman, Serge E . Hallyn, Christian Brauner,
	Tyler Hicks, Akihiro Suda

On Sun, Feb 4, 2018 at 2:49 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> Hoist out the nth filter resolving logic that ptrace uses into a new
> function. We'll use this in the next patch to implement the new
> PTRACE_SECCOMP_GET_FILTER_FLAGS command. This is based on an older patch
> that I had sent a while ago; it significantly revamps the get_nth_filter
> logic based on previous suggestions from Oleg.

Is this the same as f06eae831f0c1fc5b982ea200daf552810e1dd55 ? Quick
compare says yes? Either way, please rebase to v4.16-rc1 (or -rc2 in
the future). :)

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC 3/3] seccomp: add a way to get a listener fd from ptrace
  2018-02-04 10:49 ` [RFC 3/3] seccomp: add a way to get a listener fd from ptrace Tycho Andersen
@ 2018-02-13 21:32       ` Kees Cook
  0 siblings, 0 replies; 59+ messages in thread
From: Kees Cook @ 2018-02-13 21:32 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Linux Containers, Akihiro Suda, Oleg Nesterov, LKML,
	Eric W . Biederman, Tyler Hicks, Christian Brauner,
	Andy Lutomirski

On Sun, Feb 4, 2018 at 2:49 AM, Tycho Andersen <tycho-E0fblnxP3wo@public.gmane.org> wrote:
> As an alternative to SECCOMP_FILTER_FLAG_GET_LISTENER, perhaps a ptrace()
> version which can acquire filters is useful. There are at least two reasons
> this is preferable, even though it uses ptrace:
>
> 1. You can control tasks that aren't cooperating with you
> 2. You can control tasks whose filters block sendmsg() and socket(); if the
>    task installs a filter which blocks these calls, there's no way with
>    SECCOMP_FILTER_FLAG_GET_LISTENER to get the fd out to the privileged task.

I got worried for a second that this would get us into a many-to-many
state, but I see init_listener enforces a single listener per filter.
Whew. Seems legit. :)

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC 3/3] seccomp: add a way to get a listener fd from ptrace
@ 2018-02-13 21:32       ` Kees Cook
  0 siblings, 0 replies; 59+ messages in thread
From: Kees Cook @ 2018-02-13 21:32 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: LKML, Linux Containers, Andy Lutomirski, Oleg Nesterov,
	Eric W . Biederman, Serge E . Hallyn, Christian Brauner,
	Tyler Hicks, Akihiro Suda

On Sun, Feb 4, 2018 at 2:49 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> As an alternative to SECCOMP_FILTER_FLAG_GET_LISTENER, perhaps a ptrace()
> version which can acquire filters is useful. There are at least two reasons
> this is preferable, even though it uses ptrace:
>
> 1. You can control tasks that aren't cooperating with you
> 2. You can control tasks whose filters block sendmsg() and socket(); if the
>    task installs a filter which blocks these calls, there's no way with
>    SECCOMP_FILTER_FLAG_GET_LISTENER to get the fd out to the privileged task.

I got worried for a second that this would get us into a many-to-many
state, but I see init_listener enforces a single listener per filter.
Whew. Seems legit. :)

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC 1/3] seccomp: add a return code to trap to userspace
  2018-02-13 21:09   ` Kees Cook
@ 2018-02-14 15:29         ` Tycho Andersen
  0 siblings, 0 replies; 59+ messages in thread
From: Tycho Andersen @ 2018-02-14 15:29 UTC (permalink / raw)
  To: Kees Cook
  Cc: Linux Containers, Akihiro Suda, Oleg Nesterov, LKML, Paul Moore,
	Eric W . Biederman, Tyler Hicks, Sargun Dhillon,
	Christian Brauner, Andy Lutomirski

Hey Kees,

Thanks for taking a look!

On Tue, Feb 13, 2018 at 01:09:20PM -0800, Kees Cook wrote:
> On Sun, Feb 4, 2018 at 2:49 AM, Tycho Andersen <tycho-E0fblnxP3wo@public.gmane.org> wrote:
> > This patch introduces a means for syscalls matched in seccomp to notify
> > some other task that a particular filter has been triggered.
> >
> > The motivation for this is primarily for use with containers. For example,
> > if a container does an init_module(), we obviously don't want to load this
> > untrusted code, which may be compiled for the wrong version of the kernel
> > anyway. Instead, we could parse the module image, figure out which module
> > the container is trying to load and load it on the host.
> >
> > As another example, containers cannot mknod(), since this checks
> > capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
> > /dev/zero should be ok for containers to mknod, but we'd like to avoid hard
> > coding some whitelist in the kernel. Another example is mount(), which has
> > many security restrictions for good reason, but configuration or runtime
> > knowledge could potentially be used to relax these restrictions.
> 
> Related to the eBPF seccomp thread, can the logic for these things be
> handled entirely by eBPF? My assumption is that you still need to stop
> the process to do something (i.e. do a mknod, or a mount) before
> letting it continue. Is there some "wait for notification" system in
> eBPF?

I replied in the other thread
(https://patchwork.ozlabs.org/cover/872938/#1856642 for those
following along at home), but no, at least not that I know of.

> > The actual implementation of this is fairly small, although getting the
> > synchronization right was/is slightly complex. Also worth noting that there
> > is one race still present:
> >
> >   1. a task does a SECCOMP_RET_USER_NOTIF
> >   2. the userspace handler reads this notification
> >   3. the task dies
> >   4. a new task with the same pid starts
> >   5. this new task does a SECCOMP_RET_USER_NOTIF, gets the same cookie id
> >      that the previous one did
> >   6. the userspace handler writes a response
> >
> > There's no way to distinguish this case right now. Maybe we care, maybe we
> > don't, but it's worth noting.
> 
> So, I'd like to avoid the cookie if possible (surprise). Why isn't it
> possible to close the kernel-side of the fd to indicate that it lost
> the pid it was attached to?

Because the fd is for a filter, not a task.

> Is this just that the reader has no idea
> who is sending messages? So the risk is a fork/die loop within the
> same process tree (i.e. attached to the same filter)? Hrmpf. I can't
> think of a better way to handle the
> one(fd)-to-many(task-with-that-filter-attached) situation...

Yes, exactly. The cookie just adds uniqueness, and as Andy pointed out
if we switch to u64, the race above basically ("u64 should be enough
for anybody") goes away.

> > Right now the interface is a simple structure copy across a file
> > descriptor. We could potentially invent something fancier.
> 
> I wonder if this communication should be netlink, which gives a more
> well-structured way to describe what's on the wire? The reason I ask
> is because if we ever change the seccomp_data structure, we'll now
> have two places where we need to deal with it (the first being within
> the BPF itself). My initial idea was to prefix the communication with
> a size field, then send the structure, and then I had nightmares, and
> realized this was basically netlink reinvented.

I suggested netlink in LA, and everyone (especially Andy) groaned very
loudly :). I'm happy to switch it to netlink if you like, although i
think memcpy() of structs should be safe here, since the return value
from read or write can indicate the size of things.

> > Finally, it's worth noting that the classic seccomp TOCTOU of reading
> > memory data from the task still applies here, but can be avoided with
> > careful design of the userspace handler: if the userspace handler reads all
> > of the task memory that is necessary before applying its security policy,
> > the tracee's subsequent memory edits will not be read by the tracer.
> 
> Is this really true? Couldn't a multi-threaded process muck with
> memory out from under both the manager and the stopped process?

Sure, but as long as the manager copies the relevant arguments out of
the tracee's memory *before* evaluating whether it's safe to do the
thing the tracee wants to do, it's ok. The assumption here is that the
tracee can't corrupt the manager's memory (because if it could, lots
of other things would already be broken).

> >  /*
> >   * All BPF programs must return a 32-bit value.
> > @@ -34,6 +35,7 @@
> >  #define SECCOMP_RET_KILL        SECCOMP_RET_KILL_THREAD
> >  #define SECCOMP_RET_TRAP        0x00030000U /* disallow and force a SIGSYS */
> >  #define SECCOMP_RET_ERRNO       0x00050000U /* returns an errno */
> > +#define SECCOMP_RET_USER_NOTIF   0x7fc00000U /* notifies userspace */
> >  #define SECCOMP_RET_TRACE       0x7ff00000U /* pass to a tracer or disallow */
> 
> /me tries to come up with an ordering rationale here and fails.
> 
> An ERRNO filter would block a USER_NOTIF because it's unconditional.
> TRACE could be either, USER_NOTIF could be either.
> 
> This means TRACE rules would be bumped by a USER_NOTIF... hmm.

Yes, I didn't exactly know what to do here. ERRNO, TRAP, and KILL all
seemed more important than USER_NOTIF, but TRACE didn't. I don't have
a strong opinion about what to do here, because users can adjust their
filters accordingly. Let me know what you prefer.

> > +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> 
> I wonder if it's time to split up seccomp.c ... probably not, but I've
> always been unhappy with the #ifdefs even for just regular _FILTER. ;)

A reasonable question. I'm happy to do that as a separate series
before this one goes in if you want.

> > +static ssize_t seccomp_notify_write(struct file *file, const char __user *buf,
> > +                                   size_t size, loff_t *ppos)
> > +{
> > +       struct seccomp_filter *filter = file->private_data;
> > +       struct seccomp_notif_resp resp = {};
> > +       struct seccomp_knotif *knotif = NULL;
> > +       struct list_head *cur;
> > +       ssize_t ret = -EINVAL;
> > +
> > +       /* No partial writes. */
> > +       if (*ppos != 0)
> > +               return -EINVAL;
> > +
> > +       size = min_t(size_t, size, sizeof(resp));
> 
> In this case, we can't use min_t, size _must_ be == sizeof(resp),
> otherwise we're operating on what's in the stack (which is zeroed, but
> still).

I'm not sure I follow. If the user passes in an old (smaller) struct
seccomp_notif_resp, we don't want to copy more than they specified. If
they pass in a bigger one, this will be sizeof(resp).

> > +       if (copy_from_user(&resp, buf, size))
> > +               return -EFAULT;
> > +
> > +       ret = mutex_lock_interruptible(&filter->notify_lock);
> > +       if (ret < 0)
> > +               return ret;
> > +
> > +       list_for_each(cur, &filter->notifications) {
> > +               knotif = list_entry(cur, struct seccomp_knotif, list);
> > +
> > +               if (knotif->id == resp.id)
> > +                       break;
> 
> So we're finding the matching id here. Now, I'm trying to think about
> how this will look in real-world use: the pid will be _blocked_ while
> this happening. And all the other pids that trip this filter will
> _also_ be blocked, since they're all waiting for the reader to read
> and respond. The risk is pid death while waiting, and having another
> appear with the same pid, trigger the same filter, get blocked, and
> then the reader replies for the old pid, and the new pid gets the
> results?

Yep, exactly.

> Since this notification queue is already linear, can't we use ordering
> to enforce this? i.e. only the pid at the head of the filter
> notification queue is going to have anything happening to it. Or is
> the idea to have multiple readers/writers of the fd?

I'm not really sure how we prevent multiple readers/writers of the fd.
But even with a single writer, the case you described "could" happen
(although again, with u64 cookies it shouldn't be a problem).

I'm not sure how ordering helps us though; the problem is really that
one entry for a pid was deleted, and a whole new one was created. So
ordering will look ok, but the response will go to the wrong pid.

> > +       }
> > +
> > +       if (!knotif || knotif->id != resp.id) {
> > +               ret = -EINVAL;
> > +               goto out;
> > +       }
> > +
> > +       ret = size;
> > +       knotif->state = SECCOMP_NOTIFY_WRITE;
> > +       knotif->error = resp.error;
> > +       knotif->val = resp.val;
> > +       complete(&knotif->ready);
> > +out:
> > +       mutex_unlock(&filter->notify_lock);
> > +       return ret;
> > +}
> > +
> > +static const struct file_operations seccomp_notify_ops = {
> > +       .read = seccomp_notify_read,
> > +       .write = seccomp_notify_write,
> > +       /* TODO: poll */
> 
> What's needed for poll? I think you've got all the pieces you need
> already, i.e. wait queue, notifications, etc.

Nothing, I just didn't implement it. I will do so for v2.

> > +       .release = seccomp_notify_release,
> > +};
> > +
> > +static struct file *init_listener(struct seccomp_filter *filter)
> > +{
> > +       struct file *ret;
> > +
> > +       mutex_lock(&filter->notify_lock);
> > +       if (filter->has_listener) {
> > +               mutex_unlock(&filter->notify_lock);
> > +               return ERR_PTR(-EBUSY);
> > +       }
> > +
> > +       ret = anon_inode_getfile("seccomp notify", &seccomp_notify_ops,
> > +                                filter, O_RDWR);
> > +       if (IS_ERR(ret)) {
> > +               __put_seccomp_filter(filter);
> > +       } else {
> > +               /*
> > +                * Intentionally don't put_seccomp_filter(). The file
> > +                * has a reference to it now.
> > +                */
> > +               filter->has_listener = true;
> > +       }
> 
> I spent some time staring at this, and I don't see it: where is the
> get_() for this? The caller of init_listener() already does a put() on
> the failure path. It seems like there is a get() missing near the
> start of init_listener(), or I've entirely missed something.

Ugh, yes. For the SECCOMP_FILTER_FLAG_GET_LISTENER case, you're right.
Originally I only had the ptrace-based one, and that has a get() in
get_nth_filter(), so the comment makes sense in that case.

I'll straighten this out for v2 and

> (Regardless, I think the usage counting need a comment somewhere,
> maybe near the top of seccomp.c with the field?)

...add a comment.

> > +#define USER_NOTIF_MAGIC 116983961184613L
> 
> Is this just you mashing the numpad? :)

Is there some better way to generate magic numbers? :)

Cheers,

Tycho

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC 1/3] seccomp: add a return code to trap to userspace
@ 2018-02-14 15:29         ` Tycho Andersen
  0 siblings, 0 replies; 59+ messages in thread
From: Tycho Andersen @ 2018-02-14 15:29 UTC (permalink / raw)
  To: Kees Cook
  Cc: LKML, Linux Containers, Andy Lutomirski, Oleg Nesterov,
	Eric W . Biederman, Serge E . Hallyn, Christian Brauner,
	Tyler Hicks, Akihiro Suda, Tom Hromatka, Sargun Dhillon,
	Paul Moore

Hey Kees,

Thanks for taking a look!

On Tue, Feb 13, 2018 at 01:09:20PM -0800, Kees Cook wrote:
> On Sun, Feb 4, 2018 at 2:49 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> > This patch introduces a means for syscalls matched in seccomp to notify
> > some other task that a particular filter has been triggered.
> >
> > The motivation for this is primarily for use with containers. For example,
> > if a container does an init_module(), we obviously don't want to load this
> > untrusted code, which may be compiled for the wrong version of the kernel
> > anyway. Instead, we could parse the module image, figure out which module
> > the container is trying to load and load it on the host.
> >
> > As another example, containers cannot mknod(), since this checks
> > capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
> > /dev/zero should be ok for containers to mknod, but we'd like to avoid hard
> > coding some whitelist in the kernel. Another example is mount(), which has
> > many security restrictions for good reason, but configuration or runtime
> > knowledge could potentially be used to relax these restrictions.
> 
> Related to the eBPF seccomp thread, can the logic for these things be
> handled entirely by eBPF? My assumption is that you still need to stop
> the process to do something (i.e. do a mknod, or a mount) before
> letting it continue. Is there some "wait for notification" system in
> eBPF?

I replied in the other thread
(https://patchwork.ozlabs.org/cover/872938/#1856642 for those
following along at home), but no, at least not that I know of.

> > The actual implementation of this is fairly small, although getting the
> > synchronization right was/is slightly complex. Also worth noting that there
> > is one race still present:
> >
> >   1. a task does a SECCOMP_RET_USER_NOTIF
> >   2. the userspace handler reads this notification
> >   3. the task dies
> >   4. a new task with the same pid starts
> >   5. this new task does a SECCOMP_RET_USER_NOTIF, gets the same cookie id
> >      that the previous one did
> >   6. the userspace handler writes a response
> >
> > There's no way to distinguish this case right now. Maybe we care, maybe we
> > don't, but it's worth noting.
> 
> So, I'd like to avoid the cookie if possible (surprise). Why isn't it
> possible to close the kernel-side of the fd to indicate that it lost
> the pid it was attached to?

Because the fd is for a filter, not a task.

> Is this just that the reader has no idea
> who is sending messages? So the risk is a fork/die loop within the
> same process tree (i.e. attached to the same filter)? Hrmpf. I can't
> think of a better way to handle the
> one(fd)-to-many(task-with-that-filter-attached) situation...

Yes, exactly. The cookie just adds uniqueness, and as Andy pointed out
if we switch to u64, the race above basically ("u64 should be enough
for anybody") goes away.

> > Right now the interface is a simple structure copy across a file
> > descriptor. We could potentially invent something fancier.
> 
> I wonder if this communication should be netlink, which gives a more
> well-structured way to describe what's on the wire? The reason I ask
> is because if we ever change the seccomp_data structure, we'll now
> have two places where we need to deal with it (the first being within
> the BPF itself). My initial idea was to prefix the communication with
> a size field, then send the structure, and then I had nightmares, and
> realized this was basically netlink reinvented.

I suggested netlink in LA, and everyone (especially Andy) groaned very
loudly :). I'm happy to switch it to netlink if you like, although i
think memcpy() of structs should be safe here, since the return value
from read or write can indicate the size of things.

> > Finally, it's worth noting that the classic seccomp TOCTOU of reading
> > memory data from the task still applies here, but can be avoided with
> > careful design of the userspace handler: if the userspace handler reads all
> > of the task memory that is necessary before applying its security policy,
> > the tracee's subsequent memory edits will not be read by the tracer.
> 
> Is this really true? Couldn't a multi-threaded process muck with
> memory out from under both the manager and the stopped process?

Sure, but as long as the manager copies the relevant arguments out of
the tracee's memory *before* evaluating whether it's safe to do the
thing the tracee wants to do, it's ok. The assumption here is that the
tracee can't corrupt the manager's memory (because if it could, lots
of other things would already be broken).

> >  /*
> >   * All BPF programs must return a 32-bit value.
> > @@ -34,6 +35,7 @@
> >  #define SECCOMP_RET_KILL        SECCOMP_RET_KILL_THREAD
> >  #define SECCOMP_RET_TRAP        0x00030000U /* disallow and force a SIGSYS */
> >  #define SECCOMP_RET_ERRNO       0x00050000U /* returns an errno */
> > +#define SECCOMP_RET_USER_NOTIF   0x7fc00000U /* notifies userspace */
> >  #define SECCOMP_RET_TRACE       0x7ff00000U /* pass to a tracer or disallow */
> 
> /me tries to come up with an ordering rationale here and fails.
> 
> An ERRNO filter would block a USER_NOTIF because it's unconditional.
> TRACE could be either, USER_NOTIF could be either.
> 
> This means TRACE rules would be bumped by a USER_NOTIF... hmm.

Yes, I didn't exactly know what to do here. ERRNO, TRAP, and KILL all
seemed more important than USER_NOTIF, but TRACE didn't. I don't have
a strong opinion about what to do here, because users can adjust their
filters accordingly. Let me know what you prefer.

> > +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> 
> I wonder if it's time to split up seccomp.c ... probably not, but I've
> always been unhappy with the #ifdefs even for just regular _FILTER. ;)

A reasonable question. I'm happy to do that as a separate series
before this one goes in if you want.

> > +static ssize_t seccomp_notify_write(struct file *file, const char __user *buf,
> > +                                   size_t size, loff_t *ppos)
> > +{
> > +       struct seccomp_filter *filter = file->private_data;
> > +       struct seccomp_notif_resp resp = {};
> > +       struct seccomp_knotif *knotif = NULL;
> > +       struct list_head *cur;
> > +       ssize_t ret = -EINVAL;
> > +
> > +       /* No partial writes. */
> > +       if (*ppos != 0)
> > +               return -EINVAL;
> > +
> > +       size = min_t(size_t, size, sizeof(resp));
> 
> In this case, we can't use min_t, size _must_ be == sizeof(resp),
> otherwise we're operating on what's in the stack (which is zeroed, but
> still).

I'm not sure I follow. If the user passes in an old (smaller) struct
seccomp_notif_resp, we don't want to copy more than they specified. If
they pass in a bigger one, this will be sizeof(resp).

> > +       if (copy_from_user(&resp, buf, size))
> > +               return -EFAULT;
> > +
> > +       ret = mutex_lock_interruptible(&filter->notify_lock);
> > +       if (ret < 0)
> > +               return ret;
> > +
> > +       list_for_each(cur, &filter->notifications) {
> > +               knotif = list_entry(cur, struct seccomp_knotif, list);
> > +
> > +               if (knotif->id == resp.id)
> > +                       break;
> 
> So we're finding the matching id here. Now, I'm trying to think about
> how this will look in real-world use: the pid will be _blocked_ while
> this happening. And all the other pids that trip this filter will
> _also_ be blocked, since they're all waiting for the reader to read
> and respond. The risk is pid death while waiting, and having another
> appear with the same pid, trigger the same filter, get blocked, and
> then the reader replies for the old pid, and the new pid gets the
> results?

Yep, exactly.

> Since this notification queue is already linear, can't we use ordering
> to enforce this? i.e. only the pid at the head of the filter
> notification queue is going to have anything happening to it. Or is
> the idea to have multiple readers/writers of the fd?

I'm not really sure how we prevent multiple readers/writers of the fd.
But even with a single writer, the case you described "could" happen
(although again, with u64 cookies it shouldn't be a problem).

I'm not sure how ordering helps us though; the problem is really that
one entry for a pid was deleted, and a whole new one was created. So
ordering will look ok, but the response will go to the wrong pid.

> > +       }
> > +
> > +       if (!knotif || knotif->id != resp.id) {
> > +               ret = -EINVAL;
> > +               goto out;
> > +       }
> > +
> > +       ret = size;
> > +       knotif->state = SECCOMP_NOTIFY_WRITE;
> > +       knotif->error = resp.error;
> > +       knotif->val = resp.val;
> > +       complete(&knotif->ready);
> > +out:
> > +       mutex_unlock(&filter->notify_lock);
> > +       return ret;
> > +}
> > +
> > +static const struct file_operations seccomp_notify_ops = {
> > +       .read = seccomp_notify_read,
> > +       .write = seccomp_notify_write,
> > +       /* TODO: poll */
> 
> What's needed for poll? I think you've got all the pieces you need
> already, i.e. wait queue, notifications, etc.

Nothing, I just didn't implement it. I will do so for v2.

> > +       .release = seccomp_notify_release,
> > +};
> > +
> > +static struct file *init_listener(struct seccomp_filter *filter)
> > +{
> > +       struct file *ret;
> > +
> > +       mutex_lock(&filter->notify_lock);
> > +       if (filter->has_listener) {
> > +               mutex_unlock(&filter->notify_lock);
> > +               return ERR_PTR(-EBUSY);
> > +       }
> > +
> > +       ret = anon_inode_getfile("seccomp notify", &seccomp_notify_ops,
> > +                                filter, O_RDWR);
> > +       if (IS_ERR(ret)) {
> > +               __put_seccomp_filter(filter);
> > +       } else {
> > +               /*
> > +                * Intentionally don't put_seccomp_filter(). The file
> > +                * has a reference to it now.
> > +                */
> > +               filter->has_listener = true;
> > +       }
> 
> I spent some time staring at this, and I don't see it: where is the
> get_() for this? The caller of init_listener() already does a put() on
> the failure path. It seems like there is a get() missing near the
> start of init_listener(), or I've entirely missed something.

Ugh, yes. For the SECCOMP_FILTER_FLAG_GET_LISTENER case, you're right.
Originally I only had the ptrace-based one, and that has a get() in
get_nth_filter(), so the comment makes sense in that case.

I'll straighten this out for v2 and

> (Regardless, I think the usage counting need a comment somewhere,
> maybe near the top of seccomp.c with the field?)

...add a comment.

> > +#define USER_NOTIF_MAGIC 116983961184613L
> 
> Is this just you mashing the numpad? :)

Is there some better way to generate magic numbers? :)

Cheers,

Tycho

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC 2/3] seccomp: hoist out filter resolving logic
  2018-02-13 21:29   ` Kees Cook
  2018-02-14 15:33     ` Tycho Andersen
@ 2018-02-14 15:33     ` Tycho Andersen
  1 sibling, 0 replies; 59+ messages in thread
From: Tycho Andersen @ 2018-02-14 15:33 UTC (permalink / raw)
  To: Kees Cook
  Cc: Linux Containers, Akihiro Suda, Oleg Nesterov, LKML,
	Eric W . Biederman, Tyler Hicks, Christian Brauner,
	Andy Lutomirski

On Tue, Feb 13, 2018 at 01:29:23PM -0800, Kees Cook wrote:
> On Sun, Feb 4, 2018 at 2:49 AM, Tycho Andersen <tycho-E0fblnxP3wo@public.gmane.org> wrote:
> > Hoist out the nth filter resolving logic that ptrace uses into a new
> > function. We'll use this in the next patch to implement the new
> > PTRACE_SECCOMP_GET_FILTER_FLAGS command. This is based on an older patch
> > that I had sent a while ago; it significantly revamps the get_nth_filter
> > logic based on previous suggestions from Oleg.
> 
> Is this the same as f06eae831f0c1fc5b982ea200daf552810e1dd55 ? Quick
> compare says yes? Either way, please rebase to v4.16-rc1 (or -rc2 in
> the future). :)

Yep, there was no tagged tree with that when I did these; I'll do that
for the next version.

Cheers,

Tycho

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC 2/3] seccomp: hoist out filter resolving logic
  2018-02-13 21:29   ` Kees Cook
@ 2018-02-14 15:33     ` Tycho Andersen
  2018-02-14 15:33     ` Tycho Andersen
  1 sibling, 0 replies; 59+ messages in thread
From: Tycho Andersen @ 2018-02-14 15:33 UTC (permalink / raw)
  To: Kees Cook
  Cc: LKML, Linux Containers, Andy Lutomirski, Oleg Nesterov,
	Eric W . Biederman, Serge E . Hallyn, Christian Brauner,
	Tyler Hicks, Akihiro Suda

On Tue, Feb 13, 2018 at 01:29:23PM -0800, Kees Cook wrote:
> On Sun, Feb 4, 2018 at 2:49 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> > Hoist out the nth filter resolving logic that ptrace uses into a new
> > function. We'll use this in the next patch to implement the new
> > PTRACE_SECCOMP_GET_FILTER_FLAGS command. This is based on an older patch
> > that I had sent a while ago; it significantly revamps the get_nth_filter
> > logic based on previous suggestions from Oleg.
> 
> Is this the same as f06eae831f0c1fc5b982ea200daf552810e1dd55 ? Quick
> compare says yes? Either way, please rebase to v4.16-rc1 (or -rc2 in
> the future). :)

Yep, there was no tagged tree with that when I did these; I'll do that
for the next version.

Cheers,

Tycho

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC 3/3] seccomp: add a way to get a listener fd from ptrace
       [not found]       ` <CAGXu5jLS2dzCjZOKa-W4kUdOPoJkRAq5Rsw1t5jX99v34yaoQw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-02-14 15:33         ` Tycho Andersen
  0 siblings, 0 replies; 59+ messages in thread
From: Tycho Andersen @ 2018-02-14 15:33 UTC (permalink / raw)
  To: Kees Cook
  Cc: Linux Containers, Akihiro Suda, Oleg Nesterov, LKML,
	Eric W . Biederman, Tyler Hicks, Christian Brauner,
	Andy Lutomirski

On Tue, Feb 13, 2018 at 01:32:26PM -0800, Kees Cook wrote:
> On Sun, Feb 4, 2018 at 2:49 AM, Tycho Andersen <tycho-E0fblnxP3wo@public.gmane.org> wrote:
> > As an alternative to SECCOMP_FILTER_FLAG_GET_LISTENER, perhaps a ptrace()
> > version which can acquire filters is useful. There are at least two reasons
> > this is preferable, even though it uses ptrace:
> >
> > 1. You can control tasks that aren't cooperating with you
> > 2. You can control tasks whose filters block sendmsg() and socket(); if the
> >    task installs a filter which blocks these calls, there's no way with
> >    SECCOMP_FILTER_FLAG_GET_LISTENER to get the fd out to the privileged task.
> 
> I got worried for a second that this would get us into a many-to-many
> state, but I see init_listener enforces a single listener per filter.
> Whew. Seems legit. :)

Yes, although if you sendmsg() the listener fd, you could still get
into that state, so it's still maybe a concern?

Tycho

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC 3/3] seccomp: add a way to get a listener fd from ptrace
  2018-02-13 21:32       ` Kees Cook
  (?)
  (?)
@ 2018-02-14 15:33       ` Tycho Andersen
  -1 siblings, 0 replies; 59+ messages in thread
From: Tycho Andersen @ 2018-02-14 15:33 UTC (permalink / raw)
  To: Kees Cook
  Cc: LKML, Linux Containers, Andy Lutomirski, Oleg Nesterov,
	Eric W . Biederman, Serge E . Hallyn, Christian Brauner,
	Tyler Hicks, Akihiro Suda

On Tue, Feb 13, 2018 at 01:32:26PM -0800, Kees Cook wrote:
> On Sun, Feb 4, 2018 at 2:49 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> > As an alternative to SECCOMP_FILTER_FLAG_GET_LISTENER, perhaps a ptrace()
> > version which can acquire filters is useful. There are at least two reasons
> > this is preferable, even though it uses ptrace:
> >
> > 1. You can control tasks that aren't cooperating with you
> > 2. You can control tasks whose filters block sendmsg() and socket(); if the
> >    task installs a filter which blocks these calls, there's no way with
> >    SECCOMP_FILTER_FLAG_GET_LISTENER to get the fd out to the privileged task.
> 
> I got worried for a second that this would get us into a many-to-many
> state, but I see init_listener enforces a single listener per filter.
> Whew. Seems legit. :)

Yes, although if you sendmsg() the listener fd, you could still get
into that state, so it's still maybe a concern?

Tycho

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC 1/3] seccomp: add a return code to trap to userspace
  2018-02-14 15:29         ` Tycho Andersen
  (?)
  (?)
@ 2018-02-14 17:19         ` Andy Lutomirski
  -1 siblings, 0 replies; 59+ messages in thread
From: Andy Lutomirski @ 2018-02-14 17:19 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Kees Cook, Linux Containers, Akihiro Suda, Oleg Nesterov, LKML,
	Paul Moore, Sargun Dhillon, Eric W . Biederman,
	Christian Brauner, Tyler Hicks

On Wed, Feb 14, 2018 at 3:29 PM, Tycho Andersen <tycho-E0fblnxP3wo@public.gmane.org> wrote:
> Hey Kees,
>
> Thanks for taking a look!
>
> On Tue, Feb 13, 2018 at 01:09:20PM -0800, Kees Cook wrote:
>> On Sun, Feb 4, 2018 at 2:49 AM, Tycho Andersen <tycho-E0fblnxP3wo@public.gmane.org> wrote:
>> > This patch introduces a means for syscalls matched in seccomp to notify
>> > some other task that a particular filter has been triggered.
>> >
>> > The motivation for this is primarily for use with containers. For example,
>> > if a container does an init_module(), we obviously don't want to load this
>> > untrusted code, which may be compiled for the wrong version of the kernel
>> > anyway. Instead, we could parse the module image, figure out which module
>> > the container is trying to load and load it on the host.
>> >
>> > As another example, containers cannot mknod(), since this checks
>> > capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
>> > /dev/zero should be ok for containers to mknod, but we'd like to avoid hard
>> > coding some whitelist in the kernel. Another example is mount(), which has
>> > many security restrictions for good reason, but configuration or runtime
>> > knowledge could potentially be used to relax these restrictions.
>>
>> Related to the eBPF seccomp thread, can the logic for these things be
>> handled entirely by eBPF? My assumption is that you still need to stop
>> the process to do something (i.e. do a mknod, or a mount) before
>> letting it continue. Is there some "wait for notification" system in
>> eBPF?
>
> I replied in the other thread
> (https://patchwork.ozlabs.org/cover/872938/#1856642 for those
> following along at home), but no, at least not that I know of.

eBPF can call functions.  One of those functions could put the caller
to sleep.  In fact, I think I once proposed doing this for the seccomp
logging action as well.

>> I wonder if this communication should be netlink, which gives a more
>> well-structured way to describe what's on the wire? The reason I ask
>> is because if we ever change the seccomp_data structure, we'll now
>> have two places where we need to deal with it (the first being within
>> the BPF itself). My initial idea was to prefix the communication with
>> a size field, then send the structure, and then I had nightmares, and
>> realized this was basically netlink reinvented.
>
> I suggested netlink in LA, and everyone (especially Andy) groaned very
> loudly :). I'm happy to switch it to netlink if you like, although i
> think memcpy() of structs should be safe here, since the return value
> from read or write can indicate the size of things.

I could easily get on board with "netlink" (i.e. NLA) messages sent
over an fd.  I will object strongly to the use of netlink *sockets*.

>
>> An ERRNO filter would block a USER_NOTIF because it's unconditional.
>> TRACE could be either, USER_NOTIF could be either.
>>
>> This means TRACE rules would be bumped by a USER_NOTIF... hmm.
>
> Yes, I didn't exactly know what to do here. ERRNO, TRAP, and KILL all
> seemed more important than USER_NOTIF, but TRACE didn't. I don't have
> a strong opinion about what to do here, because users can adjust their
> filters accordingly. Let me know what you prefer.

If we switched to eBPF functions, this whole issue goes away.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC 1/3] seccomp: add a return code to trap to userspace
  2018-02-14 15:29         ` Tycho Andersen
  (?)
@ 2018-02-14 17:19         ` Andy Lutomirski
  2018-02-14 17:23           ` Tycho Andersen
                             ` (3 more replies)
  -1 siblings, 4 replies; 59+ messages in thread
From: Andy Lutomirski @ 2018-02-14 17:19 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Kees Cook, LKML, Linux Containers, Oleg Nesterov,
	Eric W . Biederman, Serge E . Hallyn, Christian Brauner,
	Tyler Hicks, Akihiro Suda, Tom Hromatka, Sargun Dhillon,
	Paul Moore

On Wed, Feb 14, 2018 at 3:29 PM, Tycho Andersen <tycho@tycho.ws> wrote:
> Hey Kees,
>
> Thanks for taking a look!
>
> On Tue, Feb 13, 2018 at 01:09:20PM -0800, Kees Cook wrote:
>> On Sun, Feb 4, 2018 at 2:49 AM, Tycho Andersen <tycho@tycho.ws> wrote:
>> > This patch introduces a means for syscalls matched in seccomp to notify
>> > some other task that a particular filter has been triggered.
>> >
>> > The motivation for this is primarily for use with containers. For example,
>> > if a container does an init_module(), we obviously don't want to load this
>> > untrusted code, which may be compiled for the wrong version of the kernel
>> > anyway. Instead, we could parse the module image, figure out which module
>> > the container is trying to load and load it on the host.
>> >
>> > As another example, containers cannot mknod(), since this checks
>> > capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
>> > /dev/zero should be ok for containers to mknod, but we'd like to avoid hard
>> > coding some whitelist in the kernel. Another example is mount(), which has
>> > many security restrictions for good reason, but configuration or runtime
>> > knowledge could potentially be used to relax these restrictions.
>>
>> Related to the eBPF seccomp thread, can the logic for these things be
>> handled entirely by eBPF? My assumption is that you still need to stop
>> the process to do something (i.e. do a mknod, or a mount) before
>> letting it continue. Is there some "wait for notification" system in
>> eBPF?
>
> I replied in the other thread
> (https://patchwork.ozlabs.org/cover/872938/#1856642 for those
> following along at home), but no, at least not that I know of.

eBPF can call functions.  One of those functions could put the caller
to sleep.  In fact, I think I once proposed doing this for the seccomp
logging action as well.

>> I wonder if this communication should be netlink, which gives a more
>> well-structured way to describe what's on the wire? The reason I ask
>> is because if we ever change the seccomp_data structure, we'll now
>> have two places where we need to deal with it (the first being within
>> the BPF itself). My initial idea was to prefix the communication with
>> a size field, then send the structure, and then I had nightmares, and
>> realized this was basically netlink reinvented.
>
> I suggested netlink in LA, and everyone (especially Andy) groaned very
> loudly :). I'm happy to switch it to netlink if you like, although i
> think memcpy() of structs should be safe here, since the return value
> from read or write can indicate the size of things.

I could easily get on board with "netlink" (i.e. NLA) messages sent
over an fd.  I will object strongly to the use of netlink *sockets*.

>
>> An ERRNO filter would block a USER_NOTIF because it's unconditional.
>> TRACE could be either, USER_NOTIF could be either.
>>
>> This means TRACE rules would be bumped by a USER_NOTIF... hmm.
>
> Yes, I didn't exactly know what to do here. ERRNO, TRAP, and KILL all
> seemed more important than USER_NOTIF, but TRACE didn't. I don't have
> a strong opinion about what to do here, because users can adjust their
> filters accordingly. Let me know what you prefer.

If we switched to eBPF functions, this whole issue goes away.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC 1/3] seccomp: add a return code to trap to userspace
       [not found]           ` <CALCETrXeZZfVzXh7SwKhyB=+ySDk5fhrrdrXrcABsQ=JpQT7Tg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-02-14 17:23             ` Tycho Andersen
  2018-02-15 14:48             ` Christian Brauner
  2018-02-27  0:49             ` Kees Cook
  2 siblings, 0 replies; 59+ messages in thread
From: Tycho Andersen @ 2018-02-14 17:23 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kees Cook, Linux Containers, Akihiro Suda, Oleg Nesterov, LKML,
	Paul Moore, Sargun Dhillon, Eric W . Biederman,
	Christian Brauner, Tyler Hicks

On Wed, Feb 14, 2018 at 05:19:52PM +0000, Andy Lutomirski wrote:
> On Wed, Feb 14, 2018 at 3:29 PM, Tycho Andersen <tycho-E0fblnxP3wo@public.gmane.org> wrote:
> > Hey Kees,
> >
> > Thanks for taking a look!
> >
> > On Tue, Feb 13, 2018 at 01:09:20PM -0800, Kees Cook wrote:
> >> On Sun, Feb 4, 2018 at 2:49 AM, Tycho Andersen <tycho-E0fblnxP3wo@public.gmane.org> wrote:
> >> > This patch introduces a means for syscalls matched in seccomp to notify
> >> > some other task that a particular filter has been triggered.
> >> >
> >> > The motivation for this is primarily for use with containers. For example,
> >> > if a container does an init_module(), we obviously don't want to load this
> >> > untrusted code, which may be compiled for the wrong version of the kernel
> >> > anyway. Instead, we could parse the module image, figure out which module
> >> > the container is trying to load and load it on the host.
> >> >
> >> > As another example, containers cannot mknod(), since this checks
> >> > capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
> >> > /dev/zero should be ok for containers to mknod, but we'd like to avoid hard
> >> > coding some whitelist in the kernel. Another example is mount(), which has
> >> > many security restrictions for good reason, but configuration or runtime
> >> > knowledge could potentially be used to relax these restrictions.
> >>
> >> Related to the eBPF seccomp thread, can the logic for these things be
> >> handled entirely by eBPF? My assumption is that you still need to stop
> >> the process to do something (i.e. do a mknod, or a mount) before
> >> letting it continue. Is there some "wait for notification" system in
> >> eBPF?
> >
> > I replied in the other thread
> > (https://patchwork.ozlabs.org/cover/872938/#1856642 for those
> > following along at home), but no, at least not that I know of.
> 
> eBPF can call functions.  One of those functions could put the caller
> to sleep.  In fact, I think I once proposed doing this for the seccomp
> logging action as well.

Yes, true. We could always add a bpf_func_map_lookup_wait or
something. I can look into that if it's preferable.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC 1/3] seccomp: add a return code to trap to userspace
  2018-02-14 17:19         ` Andy Lutomirski
@ 2018-02-14 17:23           ` Tycho Andersen
  2018-02-15 14:48           ` Christian Brauner
                             ` (2 subsequent siblings)
  3 siblings, 0 replies; 59+ messages in thread
From: Tycho Andersen @ 2018-02-14 17:23 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kees Cook, LKML, Linux Containers, Oleg Nesterov,
	Eric W . Biederman, Serge E . Hallyn, Christian Brauner,
	Tyler Hicks, Akihiro Suda, Tom Hromatka, Sargun Dhillon,
	Paul Moore

On Wed, Feb 14, 2018 at 05:19:52PM +0000, Andy Lutomirski wrote:
> On Wed, Feb 14, 2018 at 3:29 PM, Tycho Andersen <tycho@tycho.ws> wrote:
> > Hey Kees,
> >
> > Thanks for taking a look!
> >
> > On Tue, Feb 13, 2018 at 01:09:20PM -0800, Kees Cook wrote:
> >> On Sun, Feb 4, 2018 at 2:49 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> >> > This patch introduces a means for syscalls matched in seccomp to notify
> >> > some other task that a particular filter has been triggered.
> >> >
> >> > The motivation for this is primarily for use with containers. For example,
> >> > if a container does an init_module(), we obviously don't want to load this
> >> > untrusted code, which may be compiled for the wrong version of the kernel
> >> > anyway. Instead, we could parse the module image, figure out which module
> >> > the container is trying to load and load it on the host.
> >> >
> >> > As another example, containers cannot mknod(), since this checks
> >> > capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
> >> > /dev/zero should be ok for containers to mknod, but we'd like to avoid hard
> >> > coding some whitelist in the kernel. Another example is mount(), which has
> >> > many security restrictions for good reason, but configuration or runtime
> >> > knowledge could potentially be used to relax these restrictions.
> >>
> >> Related to the eBPF seccomp thread, can the logic for these things be
> >> handled entirely by eBPF? My assumption is that you still need to stop
> >> the process to do something (i.e. do a mknod, or a mount) before
> >> letting it continue. Is there some "wait for notification" system in
> >> eBPF?
> >
> > I replied in the other thread
> > (https://patchwork.ozlabs.org/cover/872938/#1856642 for those
> > following along at home), but no, at least not that I know of.
> 
> eBPF can call functions.  One of those functions could put the caller
> to sleep.  In fact, I think I once proposed doing this for the seccomp
> logging action as well.

Yes, true. We could always add a bpf_func_map_lookup_wait or
something. I can look into that if it's preferable.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC 1/3] seccomp: add a return code to trap to userspace
       [not found]           ` <CALCETrXeZZfVzXh7SwKhyB=+ySDk5fhrrdrXrcABsQ=JpQT7Tg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2018-02-14 17:23             ` Tycho Andersen
@ 2018-02-15 14:48             ` Christian Brauner
  2018-02-27  0:49             ` Kees Cook
  2 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2018-02-15 14:48 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kees Cook, Linux Containers, Akihiro Suda, Oleg Nesterov, LKML,
	Paul Moore, Sargun Dhillon, Eric W . Biederman,
	Christian Brauner, Tyler Hicks

On Wed, Feb 14, 2018 at 05:19:52PM +0000, Andy Lutomirski wrote:
> On Wed, Feb 14, 2018 at 3:29 PM, Tycho Andersen <tycho-E0fblnxP3wo@public.gmane.org> wrote:
> > Hey Kees,
> >
> > Thanks for taking a look!
> >
> > On Tue, Feb 13, 2018 at 01:09:20PM -0800, Kees Cook wrote:
> >> On Sun, Feb 4, 2018 at 2:49 AM, Tycho Andersen <tycho-E0fblnxP3wo@public.gmane.org> wrote:
> >> > This patch introduces a means for syscalls matched in seccomp to notify
> >> > some other task that a particular filter has been triggered.
> >> >
> >> > The motivation for this is primarily for use with containers. For example,
> >> > if a container does an init_module(), we obviously don't want to load this
> >> > untrusted code, which may be compiled for the wrong version of the kernel
> >> > anyway. Instead, we could parse the module image, figure out which module
> >> > the container is trying to load and load it on the host.
> >> >
> >> > As another example, containers cannot mknod(), since this checks
> >> > capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
> >> > /dev/zero should be ok for containers to mknod, but we'd like to avoid hard
> >> > coding some whitelist in the kernel. Another example is mount(), which has
> >> > many security restrictions for good reason, but configuration or runtime
> >> > knowledge could potentially be used to relax these restrictions.
> >>
> >> Related to the eBPF seccomp thread, can the logic for these things be
> >> handled entirely by eBPF? My assumption is that you still need to stop
> >> the process to do something (i.e. do a mknod, or a mount) before
> >> letting it continue. Is there some "wait for notification" system in
> >> eBPF?
> >
> > I replied in the other thread
> > (https://patchwork.ozlabs.org/cover/872938/#1856642 for those
> > following along at home), but no, at least not that I know of.
> 
> eBPF can call functions.  One of those functions could put the caller
> to sleep.  In fact, I think I once proposed doing this for the seccomp
> logging action as well.
> 
> >> I wonder if this communication should be netlink, which gives a more
> >> well-structured way to describe what's on the wire? The reason I ask
> >> is because if we ever change the seccomp_data structure, we'll now
> >> have two places where we need to deal with it (the first being within
> >> the BPF itself). My initial idea was to prefix the communication with
> >> a size field, then send the structure, and then I had nightmares, and
> >> realized this was basically netlink reinvented.
> >
> > I suggested netlink in LA, and everyone (especially Andy) groaned very
> > loudly :). I'm happy to switch it to netlink if you like, although i
> > think memcpy() of structs should be safe here, since the return value
> > from read or write can indicate the size of things.
> 
> I could easily get on board with "netlink" (i.e. NLA) messages sent
> over an fd.  I will object strongly to the use of netlink *sockets*.

I think sending netlink messages makes perfect sense here although we
burden userspace with all those nice macros to parse these messages.
Are there already other cases where userspace gets netlink messages on
fds without having opened a netlink socket.

> 
> >
> >> An ERRNO filter would block a USER_NOTIF because it's unconditional.
> >> TRACE could be either, USER_NOTIF could be either.
> >>
> >> This means TRACE rules would be bumped by a USER_NOTIF... hmm.
> >
> > Yes, I didn't exactly know what to do here. ERRNO, TRAP, and KILL all
> > seemed more important than USER_NOTIF, but TRACE didn't. I don't have
> > a strong opinion about what to do here, because users can adjust their
> > filters accordingly. Let me know what you prefer.
> 
> If we switched to eBPF functions, this whole issue goes away.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC 1/3] seccomp: add a return code to trap to userspace
  2018-02-14 17:19         ` Andy Lutomirski
  2018-02-14 17:23           ` Tycho Andersen
@ 2018-02-15 14:48           ` Christian Brauner
       [not found]           ` <CALCETrXeZZfVzXh7SwKhyB=+ySDk5fhrrdrXrcABsQ=JpQT7Tg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2018-02-27  0:49           ` Kees Cook
  3 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2018-02-15 14:48 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Tycho Andersen, Kees Cook, LKML, Linux Containers, Oleg Nesterov,
	Eric W . Biederman, Serge E . Hallyn, Christian Brauner,
	Tyler Hicks, Akihiro Suda, Tom Hromatka, Sargun Dhillon,
	Paul Moore

On Wed, Feb 14, 2018 at 05:19:52PM +0000, Andy Lutomirski wrote:
> On Wed, Feb 14, 2018 at 3:29 PM, Tycho Andersen <tycho@tycho.ws> wrote:
> > Hey Kees,
> >
> > Thanks for taking a look!
> >
> > On Tue, Feb 13, 2018 at 01:09:20PM -0800, Kees Cook wrote:
> >> On Sun, Feb 4, 2018 at 2:49 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> >> > This patch introduces a means for syscalls matched in seccomp to notify
> >> > some other task that a particular filter has been triggered.
> >> >
> >> > The motivation for this is primarily for use with containers. For example,
> >> > if a container does an init_module(), we obviously don't want to load this
> >> > untrusted code, which may be compiled for the wrong version of the kernel
> >> > anyway. Instead, we could parse the module image, figure out which module
> >> > the container is trying to load and load it on the host.
> >> >
> >> > As another example, containers cannot mknod(), since this checks
> >> > capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
> >> > /dev/zero should be ok for containers to mknod, but we'd like to avoid hard
> >> > coding some whitelist in the kernel. Another example is mount(), which has
> >> > many security restrictions for good reason, but configuration or runtime
> >> > knowledge could potentially be used to relax these restrictions.
> >>
> >> Related to the eBPF seccomp thread, can the logic for these things be
> >> handled entirely by eBPF? My assumption is that you still need to stop
> >> the process to do something (i.e. do a mknod, or a mount) before
> >> letting it continue. Is there some "wait for notification" system in
> >> eBPF?
> >
> > I replied in the other thread
> > (https://patchwork.ozlabs.org/cover/872938/#1856642 for those
> > following along at home), but no, at least not that I know of.
> 
> eBPF can call functions.  One of those functions could put the caller
> to sleep.  In fact, I think I once proposed doing this for the seccomp
> logging action as well.
> 
> >> I wonder if this communication should be netlink, which gives a more
> >> well-structured way to describe what's on the wire? The reason I ask
> >> is because if we ever change the seccomp_data structure, we'll now
> >> have two places where we need to deal with it (the first being within
> >> the BPF itself). My initial idea was to prefix the communication with
> >> a size field, then send the structure, and then I had nightmares, and
> >> realized this was basically netlink reinvented.
> >
> > I suggested netlink in LA, and everyone (especially Andy) groaned very
> > loudly :). I'm happy to switch it to netlink if you like, although i
> > think memcpy() of structs should be safe here, since the return value
> > from read or write can indicate the size of things.
> 
> I could easily get on board with "netlink" (i.e. NLA) messages sent
> over an fd.  I will object strongly to the use of netlink *sockets*.

I think sending netlink messages makes perfect sense here although we
burden userspace with all those nice macros to parse these messages.
Are there already other cases where userspace gets netlink messages on
fds without having opened a netlink socket.

> 
> >
> >> An ERRNO filter would block a USER_NOTIF because it's unconditional.
> >> TRACE could be either, USER_NOTIF could be either.
> >>
> >> This means TRACE rules would be bumped by a USER_NOTIF... hmm.
> >
> > Yes, I didn't exactly know what to do here. ERRNO, TRAP, and KILL all
> > seemed more important than USER_NOTIF, but TRACE didn't. I don't have
> > a strong opinion about what to do here, because users can adjust their
> > filters accordingly. Let me know what you prefer.
> 
> If we switched to eBPF functions, this whole issue goes away.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC 1/3] seccomp: add a return code to trap to userspace
       [not found]           ` <CALCETrXeZZfVzXh7SwKhyB=+ySDk5fhrrdrXrcABsQ=JpQT7Tg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2018-02-14 17:23             ` Tycho Andersen
  2018-02-15 14:48             ` Christian Brauner
@ 2018-02-27  0:49             ` Kees Cook
  2 siblings, 0 replies; 59+ messages in thread
From: Kees Cook @ 2018-02-27  0:49 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linux Containers, Akihiro Suda, LKML, Oleg Nesterov, Paul Moore,
	Eric W . Biederman, Sargun Dhillon, Christian Brauner,
	Tyler Hicks

On Wed, Feb 14, 2018 at 9:19 AM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
> On Wed, Feb 14, 2018 at 3:29 PM, Tycho Andersen <tycho-E0fblnxP3wo@public.gmane.org> wrote:
>> On Tue, Feb 13, 2018 at 01:09:20PM -0800, Kees Cook wrote:
>>> On Sun, Feb 4, 2018 at 2:49 AM, Tycho Andersen <tycho-E0fblnxP3wo@public.gmane.org> wrote:
>>> I wonder if this communication should be netlink, which gives a more
>>> well-structured way to describe what's on the wire? The reason I ask
>>> is because if we ever change the seccomp_data structure, we'll now
>>> have two places where we need to deal with it (the first being within
>>> the BPF itself). My initial idea was to prefix the communication with
>>> a size field, then send the structure, and then I had nightmares, and
>>> realized this was basically netlink reinvented.
>>
>> I suggested netlink in LA, and everyone (especially Andy) groaned very
>> loudly :). I'm happy to switch it to netlink if you like, although i
>> think memcpy() of structs should be safe here, since the return value
>> from read or write can indicate the size of things.
>
> I could easily get on board with "netlink" (i.e. NLA) messages sent
> over an fd.  I will object strongly to the use of netlink *sockets*.

Yeah, I was thinking NLA over the fd; not a netlink socket.

>>> An ERRNO filter would block a USER_NOTIF because it's unconditional.
>>> TRACE could be either, USER_NOTIF could be either.
>>>
>>> This means TRACE rules would be bumped by a USER_NOTIF... hmm.
>>
>> Yes, I didn't exactly know what to do here. ERRNO, TRAP, and KILL all
>> seemed more important than USER_NOTIF, but TRACE didn't. I don't have
>> a strong opinion about what to do here, because users can adjust their
>> filters accordingly. Let me know what you prefer.
>
> If we switched to eBPF functions, this whole issue goes away.

Yeah, though we'd still need some kind of "wait for answer" eBPF
function. It feels wrong to re-use maps for that...

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC 1/3] seccomp: add a return code to trap to userspace
  2018-02-14 17:19         ` Andy Lutomirski
                             ` (2 preceding siblings ...)
       [not found]           ` <CALCETrXeZZfVzXh7SwKhyB=+ySDk5fhrrdrXrcABsQ=JpQT7Tg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-02-27  0:49           ` Kees Cook
       [not found]             ` <CAGXu5jKBmej+fXhEc+Jy7Guy+vXEZkHnc=4LNm1NNEsc1=DFVA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  3 siblings, 1 reply; 59+ messages in thread
From: Kees Cook @ 2018-02-27  0:49 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Tycho Andersen, LKML, Linux Containers, Oleg Nesterov,
	Eric W . Biederman, Serge E . Hallyn, Christian Brauner,
	Tyler Hicks, Akihiro Suda, Tom Hromatka, Sargun Dhillon,
	Paul Moore

On Wed, Feb 14, 2018 at 9:19 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Wed, Feb 14, 2018 at 3:29 PM, Tycho Andersen <tycho@tycho.ws> wrote:
>> On Tue, Feb 13, 2018 at 01:09:20PM -0800, Kees Cook wrote:
>>> On Sun, Feb 4, 2018 at 2:49 AM, Tycho Andersen <tycho@tycho.ws> wrote:
>>> I wonder if this communication should be netlink, which gives a more
>>> well-structured way to describe what's on the wire? The reason I ask
>>> is because if we ever change the seccomp_data structure, we'll now
>>> have two places where we need to deal with it (the first being within
>>> the BPF itself). My initial idea was to prefix the communication with
>>> a size field, then send the structure, and then I had nightmares, and
>>> realized this was basically netlink reinvented.
>>
>> I suggested netlink in LA, and everyone (especially Andy) groaned very
>> loudly :). I'm happy to switch it to netlink if you like, although i
>> think memcpy() of structs should be safe here, since the return value
>> from read or write can indicate the size of things.
>
> I could easily get on board with "netlink" (i.e. NLA) messages sent
> over an fd.  I will object strongly to the use of netlink *sockets*.

Yeah, I was thinking NLA over the fd; not a netlink socket.

>>> An ERRNO filter would block a USER_NOTIF because it's unconditional.
>>> TRACE could be either, USER_NOTIF could be either.
>>>
>>> This means TRACE rules would be bumped by a USER_NOTIF... hmm.
>>
>> Yes, I didn't exactly know what to do here. ERRNO, TRAP, and KILL all
>> seemed more important than USER_NOTIF, but TRACE didn't. I don't have
>> a strong opinion about what to do here, because users can adjust their
>> filters accordingly. Let me know what you prefer.
>
> If we switched to eBPF functions, this whole issue goes away.

Yeah, though we'd still need some kind of "wait for answer" eBPF
function. It feels wrong to re-use maps for that...

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC 1/3] seccomp: add a return code to trap to userspace
  2018-02-27  0:49           ` Kees Cook
@ 2018-02-27  3:27                 ` Andy Lutomirski
  0 siblings, 0 replies; 59+ messages in thread
From: Andy Lutomirski @ 2018-02-27  3:27 UTC (permalink / raw)
  To: Kees Cook
  Cc: ast-DgEjT+Ai2ygdnm+yROfE0A, Linux Containers, Akihiro Suda, LKML,
	Oleg Nesterov, Paul Moore, Eric W . Biederman, Sargun Dhillon,
	Christian Brauner, Tyler Hicks



> On Feb 26, 2018, at 4:49 PM, Kees Cook <keescook-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org> wrote:
> 
>> On Wed, Feb 14, 2018 at 9:19 AM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>> On Wed, Feb 14, 2018 at 3:29 PM, Tycho Andersen <tycho-E0fblnxP3wo@public.gmane.org> wrote:
>>>> On Tue, Feb 13, 2018 at 01:09:20PM -0800, Kees Cook wrote:
>>>> On Sun, Feb 4, 2018 at 2:49 AM, Tycho Andersen <tycho-E0fblnxP3wo@public.gmane.org> wrote:
>>>> I wonder if this communication should be netlink, which gives a more
>>>> well-structured way to describe what's on the wire? The reason I ask
>>>> is because if we ever change the seccomp_data structure, we'll now
>>>> have two places where we need to deal with it (the first being within
>>>> the BPF itself). My initial idea was to prefix the communication with
>>>> a size field, then send the structure, and then I had nightmares, and
>>>> realized this was basically netlink reinvented.
>>> 
>>> I suggested netlink in LA, and everyone (especially Andy) groaned very
>>> loudly :). I'm happy to switch it to netlink if you like, although i
>>> think memcpy() of structs should be safe here, since the return value
>>> from read or write can indicate the size of things.
>> 
>> I could easily get on board with "netlink" (i.e. NLA) messages sent
>> over an fd.  I will object strongly to the use of netlink *sockets*.
> 
> Yeah, I was thinking NLA over the fd; not a netlink socket.
> 
>>>> An ERRNO filter would block a USER_NOTIF because it's unconditional.
>>>> TRACE could be either, USER_NOTIF could be either.
>>>> 
>>>> This means TRACE rules would be bumped by a USER_NOTIF... hmm.
>>> 
>>> Yes, I didn't exactly know what to do here. ERRNO, TRAP, and KILL all
>>> seemed more important than USER_NOTIF, but TRACE didn't. I don't have
>>> a strong opinion about what to do here, because users can adjust their
>>> filters accordingly. Let me know what you prefer.
>> 
>> If we switched to eBPF functions, this whole issue goes away.
> 
> Yeah, though we'd still need some kind of "wait for answer" eBPF
> function. It feels wrong to re-use maps for that...
> 

BPF_CALL.

Alexei, can we make it so that each bpf program type can easily limit which BPF_CALL helpers can be use and allow bpf program types to add their own helpers?c

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC 1/3] seccomp: add a return code to trap to userspace
@ 2018-02-27  3:27                 ` Andy Lutomirski
  0 siblings, 0 replies; 59+ messages in thread
From: Andy Lutomirski @ 2018-02-27  3:27 UTC (permalink / raw)
  To: Kees Cook
  Cc: Tycho Andersen, LKML, Linux Containers, Oleg Nesterov,
	Eric W . Biederman, Serge E . Hallyn, Christian Brauner,
	Tyler Hicks, Akihiro Suda, Tom Hromatka, Sargun Dhillon,
	Paul Moore, ast



> On Feb 26, 2018, at 4:49 PM, Kees Cook <keescook@chromium.org> wrote:
> 
>> On Wed, Feb 14, 2018 at 9:19 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Wed, Feb 14, 2018 at 3:29 PM, Tycho Andersen <tycho@tycho.ws> wrote:
>>>> On Tue, Feb 13, 2018 at 01:09:20PM -0800, Kees Cook wrote:
>>>> On Sun, Feb 4, 2018 at 2:49 AM, Tycho Andersen <tycho@tycho.ws> wrote:
>>>> I wonder if this communication should be netlink, which gives a more
>>>> well-structured way to describe what's on the wire? The reason I ask
>>>> is because if we ever change the seccomp_data structure, we'll now
>>>> have two places where we need to deal with it (the first being within
>>>> the BPF itself). My initial idea was to prefix the communication with
>>>> a size field, then send the structure, and then I had nightmares, and
>>>> realized this was basically netlink reinvented.
>>> 
>>> I suggested netlink in LA, and everyone (especially Andy) groaned very
>>> loudly :). I'm happy to switch it to netlink if you like, although i
>>> think memcpy() of structs should be safe here, since the return value
>>> from read or write can indicate the size of things.
>> 
>> I could easily get on board with "netlink" (i.e. NLA) messages sent
>> over an fd.  I will object strongly to the use of netlink *sockets*.
> 
> Yeah, I was thinking NLA over the fd; not a netlink socket.
> 
>>>> An ERRNO filter would block a USER_NOTIF because it's unconditional.
>>>> TRACE could be either, USER_NOTIF could be either.
>>>> 
>>>> This means TRACE rules would be bumped by a USER_NOTIF... hmm.
>>> 
>>> Yes, I didn't exactly know what to do here. ERRNO, TRAP, and KILL all
>>> seemed more important than USER_NOTIF, but TRACE didn't. I don't have
>>> a strong opinion about what to do here, because users can adjust their
>>> filters accordingly. Let me know what you prefer.
>> 
>> If we switched to eBPF functions, this whole issue goes away.
> 
> Yeah, though we'd still need some kind of "wait for answer" eBPF
> function. It feels wrong to re-use maps for that...
> 

BPF_CALL.

Alexei, can we make it so that each bpf program type can easily limit which BPF_CALL helpers can be use and allow bpf program types to add their own helpers?c

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC 0/3] seccomp trap to userspace
       [not found] ` <20180204104946.25559-1-tycho-E0fblnxP3wo@public.gmane.org>
                     ` (2 preceding siblings ...)
  2018-02-04 10:49   ` [RFC 3/3] seccomp: add a way to get a listener fd from ptrace Tycho Andersen
@ 2018-03-15 16:09   ` Christian Brauner
  3 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2018-03-15 16:09 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Kees Cook, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Akihiro Suda, Oleg Nesterov, Andy Lutomirski, Eric W . Biederman,
	Christian Brauner, Tyler Hicks

On Sun, Feb 04, 2018 at 11:49:43AM +0100, Tycho Andersen wrote:
> Several months ago at Linux Plumber's, we had a discussion about adding a
> feature to seccomp which would allow seccomp to trigger a notification for some
> other process. Here's a draft of that feature.
> 
> Patch 1 contains the bulk of it, patches 2 & 3 offer an alternative way to
> acquire the fd that receives notifications via ptrace (the method in patch 1
> poses some problems). Other suggestions for how to acquire one of these fds
> would be welcome.
> 
> Take a close look at the synchronization. I think I've got it right, but I
> probably don't :)
> 
> Thanks!
> 
> Tycho Andersen (3):
>   seccomp: add a return code to trap to userspace
>   seccomp: hoist out filter resolving logic
>   seccomp: add a way to get a listener fd from ptrace
> 
>  arch/Kconfig                                  |   7 +
>  include/linux/seccomp.h                       |  14 +-
>  include/uapi/linux/ptrace.h                   |   1 +
>  include/uapi/linux/seccomp.h                  |  18 +-
>  kernel/ptrace.c                               |   4 +
>  kernel/seccomp.c                              | 467 ++++++++++++++++++++++++--
>  tools/testing/selftests/seccomp/seccomp_bpf.c | 180 +++++++++-
>  7 files changed, 653 insertions(+), 38 deletions(-)

Hey,

So, I've been following the discussion silently in the background and I
see that it got sidetracked into seccomp + ebpf. While I can see that
there is value in adding epbf support to seccomp I'd really like to see
this decoupled from this patchset. Afaict, this patchset would just work
fine without the ebpf portion (but I might be just have missed the
point). So if possible I would like to see a second version of this with
the comments accounted for and - if possible - have this up for merging
independent of the ebpf patchset that's floating around.

Christian

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC 0/3] seccomp trap to userspace
  2018-02-04 10:49 [RFC 0/3] seccomp trap to userspace Tycho Andersen
                   ` (3 preceding siblings ...)
  2018-02-04 10:49 ` [RFC 3/3] seccomp: add a way to get a listener fd from ptrace Tycho Andersen
@ 2018-03-15 16:09 ` Christian Brauner
       [not found]   ` <20180315160924.GA12744-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  4 siblings, 1 reply; 59+ messages in thread
From: Christian Brauner @ 2018-03-15 16:09 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: linux-kernel, containers, Kees Cook, Andy Lutomirski,
	Oleg Nesterov, Eric W . Biederman, Serge E . Hallyn,
	Christian Brauner, Tyler Hicks, Akihiro Suda

On Sun, Feb 04, 2018 at 11:49:43AM +0100, Tycho Andersen wrote:
> Several months ago at Linux Plumber's, we had a discussion about adding a
> feature to seccomp which would allow seccomp to trigger a notification for some
> other process. Here's a draft of that feature.
> 
> Patch 1 contains the bulk of it, patches 2 & 3 offer an alternative way to
> acquire the fd that receives notifications via ptrace (the method in patch 1
> poses some problems). Other suggestions for how to acquire one of these fds
> would be welcome.
> 
> Take a close look at the synchronization. I think I've got it right, but I
> probably don't :)
> 
> Thanks!
> 
> Tycho Andersen (3):
>   seccomp: add a return code to trap to userspace
>   seccomp: hoist out filter resolving logic
>   seccomp: add a way to get a listener fd from ptrace
> 
>  arch/Kconfig                                  |   7 +
>  include/linux/seccomp.h                       |  14 +-
>  include/uapi/linux/ptrace.h                   |   1 +
>  include/uapi/linux/seccomp.h                  |  18 +-
>  kernel/ptrace.c                               |   4 +
>  kernel/seccomp.c                              | 467 ++++++++++++++++++++++++--
>  tools/testing/selftests/seccomp/seccomp_bpf.c | 180 +++++++++-
>  7 files changed, 653 insertions(+), 38 deletions(-)

Hey,

So, I've been following the discussion silently in the background and I
see that it got sidetracked into seccomp + ebpf. While I can see that
there is value in adding epbf support to seccomp I'd really like to see
this decoupled from this patchset. Afaict, this patchset would just work
fine without the ebpf portion (but I might be just have missed the
point). So if possible I would like to see a second version of this with
the comments accounted for and - if possible - have this up for merging
independent of the ebpf patchset that's floating around.

Christian

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC 0/3] seccomp trap to userspace
  2018-03-15 16:09 ` [RFC 0/3] seccomp trap to userspace Christian Brauner
@ 2018-03-15 16:56       ` Andy Lutomirski
  0 siblings, 0 replies; 59+ messages in thread
From: Andy Lutomirski @ 2018-03-15 16:56 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Kees Cook, Linux Containers, Akihiro Suda, Oleg Nesterov, LKML,
	Eric W . Biederman, Christian Brauner, Tyler Hicks

On Thu, Mar 15, 2018 at 4:09 PM, Christian Brauner
<christian.brauner-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> wrote:
> On Sun, Feb 04, 2018 at 11:49:43AM +0100, Tycho Andersen wrote:
>> Several months ago at Linux Plumber's, we had a discussion about adding a
>> feature to seccomp which would allow seccomp to trigger a notification for some
>> other process. Here's a draft of that feature.
>>
>> Patch 1 contains the bulk of it, patches 2 & 3 offer an alternative way to
>> acquire the fd that receives notifications via ptrace (the method in patch 1
>> poses some problems). Other suggestions for how to acquire one of these fds
>> would be welcome.
>>
>> Take a close look at the synchronization. I think I've got it right, but I
>> probably don't :)
>>
>> Thanks!
>>
>> Tycho Andersen (3):
>>   seccomp: add a return code to trap to userspace
>>   seccomp: hoist out filter resolving logic
>>   seccomp: add a way to get a listener fd from ptrace
>>
>>  arch/Kconfig                                  |   7 +
>>  include/linux/seccomp.h                       |  14 +-
>>  include/uapi/linux/ptrace.h                   |   1 +
>>  include/uapi/linux/seccomp.h                  |  18 +-
>>  kernel/ptrace.c                               |   4 +
>>  kernel/seccomp.c                              | 467 ++++++++++++++++++++++++--
>>  tools/testing/selftests/seccomp/seccomp_bpf.c | 180 +++++++++-
>>  7 files changed, 653 insertions(+), 38 deletions(-)
>
> Hey,
>
> So, I've been following the discussion silently in the background and I
> see that it got sidetracked into seccomp + ebpf. While I can see that
> there is value in adding epbf support to seccomp I'd really like to see
> this decoupled from this patchset. Afaict, this patchset would just work
> fine without the ebpf portion (but I might be just have missed the
> point). So if possible I would like to see a second version of this with
> the comments accounted for and - if possible - have this up for merging
> independent of the ebpf patchset that's floating around.
>

The issue is that it might be (and, then again, might not be) nicer to
to *synchronously* call out to the monitor in the filter.  eBPF can do
that very cleanly, whereas classic BPF can't.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC 0/3] seccomp trap to userspace
@ 2018-03-15 16:56       ` Andy Lutomirski
  0 siblings, 0 replies; 59+ messages in thread
From: Andy Lutomirski @ 2018-03-15 16:56 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Tycho Andersen, LKML, Linux Containers, Kees Cook, Oleg Nesterov,
	Eric W . Biederman, Serge E . Hallyn, Christian Brauner,
	Tyler Hicks, Akihiro Suda

On Thu, Mar 15, 2018 at 4:09 PM, Christian Brauner
<christian.brauner@canonical.com> wrote:
> On Sun, Feb 04, 2018 at 11:49:43AM +0100, Tycho Andersen wrote:
>> Several months ago at Linux Plumber's, we had a discussion about adding a
>> feature to seccomp which would allow seccomp to trigger a notification for some
>> other process. Here's a draft of that feature.
>>
>> Patch 1 contains the bulk of it, patches 2 & 3 offer an alternative way to
>> acquire the fd that receives notifications via ptrace (the method in patch 1
>> poses some problems). Other suggestions for how to acquire one of these fds
>> would be welcome.
>>
>> Take a close look at the synchronization. I think I've got it right, but I
>> probably don't :)
>>
>> Thanks!
>>
>> Tycho Andersen (3):
>>   seccomp: add a return code to trap to userspace
>>   seccomp: hoist out filter resolving logic
>>   seccomp: add a way to get a listener fd from ptrace
>>
>>  arch/Kconfig                                  |   7 +
>>  include/linux/seccomp.h                       |  14 +-
>>  include/uapi/linux/ptrace.h                   |   1 +
>>  include/uapi/linux/seccomp.h                  |  18 +-
>>  kernel/ptrace.c                               |   4 +
>>  kernel/seccomp.c                              | 467 ++++++++++++++++++++++++--
>>  tools/testing/selftests/seccomp/seccomp_bpf.c | 180 +++++++++-
>>  7 files changed, 653 insertions(+), 38 deletions(-)
>
> Hey,
>
> So, I've been following the discussion silently in the background and I
> see that it got sidetracked into seccomp + ebpf. While I can see that
> there is value in adding epbf support to seccomp I'd really like to see
> this decoupled from this patchset. Afaict, this patchset would just work
> fine without the ebpf portion (but I might be just have missed the
> point). So if possible I would like to see a second version of this with
> the comments accounted for and - if possible - have this up for merging
> independent of the ebpf patchset that's floating around.
>

The issue is that it might be (and, then again, might not be) nicer to
to *synchronously* call out to the monitor in the filter.  eBPF can do
that very cleanly, whereas classic BPF can't.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC 0/3] seccomp trap to userspace
       [not found]       ` <CALCETrVnvbZLx5v=DMu2N1JtR+ys507X5CYBi-qQnus3VMQdwg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-03-15 17:05         ` Serge E. Hallyn
  0 siblings, 0 replies; 59+ messages in thread
From: Serge E. Hallyn @ 2018-03-15 17:05 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kees Cook, Linux Containers, Akihiro Suda, Oleg Nesterov, LKML,
	Christian Brauner, Eric W . Biederman, Christian Brauner,
	Tyler Hicks

Quoting Andy Lutomirski (luto-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org):
> On Thu, Mar 15, 2018 at 4:09 PM, Christian Brauner
> <christian.brauner-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> wrote:
> > On Sun, Feb 04, 2018 at 11:49:43AM +0100, Tycho Andersen wrote:
> >> Several months ago at Linux Plumber's, we had a discussion about adding a
> >> feature to seccomp which would allow seccomp to trigger a notification for some
> >> other process. Here's a draft of that feature.
> >>
> >> Patch 1 contains the bulk of it, patches 2 & 3 offer an alternative way to
> >> acquire the fd that receives notifications via ptrace (the method in patch 1
> >> poses some problems). Other suggestions for how to acquire one of these fds
> >> would be welcome.
> >>
> >> Take a close look at the synchronization. I think I've got it right, but I
> >> probably don't :)
> >>
> >> Thanks!
> >>
> >> Tycho Andersen (3):
> >>   seccomp: add a return code to trap to userspace
> >>   seccomp: hoist out filter resolving logic
> >>   seccomp: add a way to get a listener fd from ptrace
> >>
> >>  arch/Kconfig                                  |   7 +
> >>  include/linux/seccomp.h                       |  14 +-
> >>  include/uapi/linux/ptrace.h                   |   1 +
> >>  include/uapi/linux/seccomp.h                  |  18 +-
> >>  kernel/ptrace.c                               |   4 +
> >>  kernel/seccomp.c                              | 467 ++++++++++++++++++++++++--
> >>  tools/testing/selftests/seccomp/seccomp_bpf.c | 180 +++++++++-
> >>  7 files changed, 653 insertions(+), 38 deletions(-)
> >
> > Hey,
> >
> > So, I've been following the discussion silently in the background and I
> > see that it got sidetracked into seccomp + ebpf. While I can see that
> > there is value in adding epbf support to seccomp I'd really like to see
> > this decoupled from this patchset. Afaict, this patchset would just work
> > fine without the ebpf portion (but I might be just have missed the
> > point). So if possible I would like to see a second version of this with
> > the comments accounted for and - if possible - have this up for merging
> > independent of the ebpf patchset that's floating around.
> >
> 
> The issue is that it might be (and, then again, might not be) nicer to
> to *synchronously* call out to the monitor in the filter.  eBPF can do
> that very cleanly, whereas classic BPF can't.

Hm, synchronously - that brings to mind a thought...  I should re-look at
Tycho's patches first, but, if I'm in a container, start some syscall that
gets trapped to userspace, then I hit ctrl-c.  I'd like to be able to have
the handler be interrupted and have it return -EINTR.  Is that going to
be possible with the synchronous approach?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC 0/3] seccomp trap to userspace
  2018-03-15 16:56       ` Andy Lutomirski
  (?)
@ 2018-03-15 17:05       ` Serge E. Hallyn
  2018-03-15 17:11         ` Andy Lutomirski
       [not found]         ` <20180315170509.GA32766-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
  -1 siblings, 2 replies; 59+ messages in thread
From: Serge E. Hallyn @ 2018-03-15 17:05 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Christian Brauner, Tycho Andersen, LKML, Linux Containers,
	Kees Cook, Oleg Nesterov, Eric W . Biederman, Serge E . Hallyn,
	Christian Brauner, Tyler Hicks, Akihiro Suda

Quoting Andy Lutomirski (luto@kernel.org):
> On Thu, Mar 15, 2018 at 4:09 PM, Christian Brauner
> <christian.brauner@canonical.com> wrote:
> > On Sun, Feb 04, 2018 at 11:49:43AM +0100, Tycho Andersen wrote:
> >> Several months ago at Linux Plumber's, we had a discussion about adding a
> >> feature to seccomp which would allow seccomp to trigger a notification for some
> >> other process. Here's a draft of that feature.
> >>
> >> Patch 1 contains the bulk of it, patches 2 & 3 offer an alternative way to
> >> acquire the fd that receives notifications via ptrace (the method in patch 1
> >> poses some problems). Other suggestions for how to acquire one of these fds
> >> would be welcome.
> >>
> >> Take a close look at the synchronization. I think I've got it right, but I
> >> probably don't :)
> >>
> >> Thanks!
> >>
> >> Tycho Andersen (3):
> >>   seccomp: add a return code to trap to userspace
> >>   seccomp: hoist out filter resolving logic
> >>   seccomp: add a way to get a listener fd from ptrace
> >>
> >>  arch/Kconfig                                  |   7 +
> >>  include/linux/seccomp.h                       |  14 +-
> >>  include/uapi/linux/ptrace.h                   |   1 +
> >>  include/uapi/linux/seccomp.h                  |  18 +-
> >>  kernel/ptrace.c                               |   4 +
> >>  kernel/seccomp.c                              | 467 ++++++++++++++++++++++++--
> >>  tools/testing/selftests/seccomp/seccomp_bpf.c | 180 +++++++++-
> >>  7 files changed, 653 insertions(+), 38 deletions(-)
> >
> > Hey,
> >
> > So, I've been following the discussion silently in the background and I
> > see that it got sidetracked into seccomp + ebpf. While I can see that
> > there is value in adding epbf support to seccomp I'd really like to see
> > this decoupled from this patchset. Afaict, this patchset would just work
> > fine without the ebpf portion (but I might be just have missed the
> > point). So if possible I would like to see a second version of this with
> > the comments accounted for and - if possible - have this up for merging
> > independent of the ebpf patchset that's floating around.
> >
> 
> The issue is that it might be (and, then again, might not be) nicer to
> to *synchronously* call out to the monitor in the filter.  eBPF can do
> that very cleanly, whereas classic BPF can't.

Hm, synchronously - that brings to mind a thought...  I should re-look at
Tycho's patches first, but, if I'm in a container, start some syscall that
gets trapped to userspace, then I hit ctrl-c.  I'd like to be able to have
the handler be interrupted and have it return -EINTR.  Is that going to
be possible with the synchronous approach?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC 0/3] seccomp trap to userspace
       [not found]         ` <20180315170509.GA32766-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
@ 2018-03-15 17:11           ` Andy Lutomirski
  0 siblings, 0 replies; 59+ messages in thread
From: Andy Lutomirski @ 2018-03-15 17:11 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Kees Cook, Linux Containers, Akihiro Suda, LKML, Oleg Nesterov,
	Christian Brauner, Eric W . Biederman, Andy Lutomirski,
	Christian Brauner, Tyler Hicks

On Thu, Mar 15, 2018 at 5:05 PM, Serge E. Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> wrote:
> Quoting Andy Lutomirski (luto-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org):
>> On Thu, Mar 15, 2018 at 4:09 PM, Christian Brauner
>> <christian.brauner-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> wrote:
>> > On Sun, Feb 04, 2018 at 11:49:43AM +0100, Tycho Andersen wrote:
>> >> Several months ago at Linux Plumber's, we had a discussion about adding a
>> >> feature to seccomp which would allow seccomp to trigger a notification for some
>> >> other process. Here's a draft of that feature.
>> >>
>> >> Patch 1 contains the bulk of it, patches 2 & 3 offer an alternative way to
>> >> acquire the fd that receives notifications via ptrace (the method in patch 1
>> >> poses some problems). Other suggestions for how to acquire one of these fds
>> >> would be welcome.
>> >>
>> >> Take a close look at the synchronization. I think I've got it right, but I
>> >> probably don't :)
>> >>
>> >> Thanks!
>> >>
>> >> Tycho Andersen (3):
>> >>   seccomp: add a return code to trap to userspace
>> >>   seccomp: hoist out filter resolving logic
>> >>   seccomp: add a way to get a listener fd from ptrace
>> >>
>> >>  arch/Kconfig                                  |   7 +
>> >>  include/linux/seccomp.h                       |  14 +-
>> >>  include/uapi/linux/ptrace.h                   |   1 +
>> >>  include/uapi/linux/seccomp.h                  |  18 +-
>> >>  kernel/ptrace.c                               |   4 +
>> >>  kernel/seccomp.c                              | 467 ++++++++++++++++++++++++--
>> >>  tools/testing/selftests/seccomp/seccomp_bpf.c | 180 +++++++++-
>> >>  7 files changed, 653 insertions(+), 38 deletions(-)
>> >
>> > Hey,
>> >
>> > So, I've been following the discussion silently in the background and I
>> > see that it got sidetracked into seccomp + ebpf. While I can see that
>> > there is value in adding epbf support to seccomp I'd really like to see
>> > this decoupled from this patchset. Afaict, this patchset would just work
>> > fine without the ebpf portion (but I might be just have missed the
>> > point). So if possible I would like to see a second version of this with
>> > the comments accounted for and - if possible - have this up for merging
>> > independent of the ebpf patchset that's floating around.
>> >
>>
>> The issue is that it might be (and, then again, might not be) nicer to
>> to *synchronously* call out to the monitor in the filter.  eBPF can do
>> that very cleanly, whereas classic BPF can't.
>
> Hm, synchronously - that brings to mind a thought...  I should re-look at
> Tycho's patches first, but, if I'm in a container, start some syscall that
> gets trapped to userspace, then I hit ctrl-c.  I'd like to be able to have
> the handler be interrupted and have it return -EINTR.  Is that going to
> be possible with the synchronous approach?

I think so, but it should be possible with the classic async approach
too.  The main issue is the difference between a classic filter like
this (pseudocode):

if (nr == SYS_mount) return TRAP_TO_USERSPACE;

and the eBPF variant:

if (nr == SYS_mount) trap_to_userspace();

I admit that it's still not 100% clear to me that the latter is
genuinely more useful than the former.

The case where I think the synchronous function call is a huge win is this one:

if (nr  == SYS_mount) {
  log("Someone called mount with args %lx\n", ...);
  return RET_KILL;
}

The idea being that the log message wouldn't show up in the kernel log
-- it would get sent to the listener socket belonging to whoever
created the filter, and that process could then go and log it
properly.  This would work perfectly in containers and in totally
unprivileged applications like Chromium.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC 0/3] seccomp trap to userspace
  2018-03-15 17:05       ` Serge E. Hallyn
@ 2018-03-15 17:11         ` Andy Lutomirski
  2018-03-15 17:35           ` Tycho Andersen
       [not found]           ` <CALCETrXPcCNbpFJhXktkVS9gOPpmnU_bbY6Z8RrsBarq0dP4Lg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
       [not found]         ` <20180315170509.GA32766-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
  1 sibling, 2 replies; 59+ messages in thread
From: Andy Lutomirski @ 2018-03-15 17:11 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Andy Lutomirski, Christian Brauner, Tycho Andersen, LKML,
	Linux Containers, Kees Cook, Oleg Nesterov, Eric W . Biederman,
	Christian Brauner, Tyler Hicks, Akihiro Suda

On Thu, Mar 15, 2018 at 5:05 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
> Quoting Andy Lutomirski (luto@kernel.org):
>> On Thu, Mar 15, 2018 at 4:09 PM, Christian Brauner
>> <christian.brauner@canonical.com> wrote:
>> > On Sun, Feb 04, 2018 at 11:49:43AM +0100, Tycho Andersen wrote:
>> >> Several months ago at Linux Plumber's, we had a discussion about adding a
>> >> feature to seccomp which would allow seccomp to trigger a notification for some
>> >> other process. Here's a draft of that feature.
>> >>
>> >> Patch 1 contains the bulk of it, patches 2 & 3 offer an alternative way to
>> >> acquire the fd that receives notifications via ptrace (the method in patch 1
>> >> poses some problems). Other suggestions for how to acquire one of these fds
>> >> would be welcome.
>> >>
>> >> Take a close look at the synchronization. I think I've got it right, but I
>> >> probably don't :)
>> >>
>> >> Thanks!
>> >>
>> >> Tycho Andersen (3):
>> >>   seccomp: add a return code to trap to userspace
>> >>   seccomp: hoist out filter resolving logic
>> >>   seccomp: add a way to get a listener fd from ptrace
>> >>
>> >>  arch/Kconfig                                  |   7 +
>> >>  include/linux/seccomp.h                       |  14 +-
>> >>  include/uapi/linux/ptrace.h                   |   1 +
>> >>  include/uapi/linux/seccomp.h                  |  18 +-
>> >>  kernel/ptrace.c                               |   4 +
>> >>  kernel/seccomp.c                              | 467 ++++++++++++++++++++++++--
>> >>  tools/testing/selftests/seccomp/seccomp_bpf.c | 180 +++++++++-
>> >>  7 files changed, 653 insertions(+), 38 deletions(-)
>> >
>> > Hey,
>> >
>> > So, I've been following the discussion silently in the background and I
>> > see that it got sidetracked into seccomp + ebpf. While I can see that
>> > there is value in adding epbf support to seccomp I'd really like to see
>> > this decoupled from this patchset. Afaict, this patchset would just work
>> > fine without the ebpf portion (but I might be just have missed the
>> > point). So if possible I would like to see a second version of this with
>> > the comments accounted for and - if possible - have this up for merging
>> > independent of the ebpf patchset that's floating around.
>> >
>>
>> The issue is that it might be (and, then again, might not be) nicer to
>> to *synchronously* call out to the monitor in the filter.  eBPF can do
>> that very cleanly, whereas classic BPF can't.
>
> Hm, synchronously - that brings to mind a thought...  I should re-look at
> Tycho's patches first, but, if I'm in a container, start some syscall that
> gets trapped to userspace, then I hit ctrl-c.  I'd like to be able to have
> the handler be interrupted and have it return -EINTR.  Is that going to
> be possible with the synchronous approach?

I think so, but it should be possible with the classic async approach
too.  The main issue is the difference between a classic filter like
this (pseudocode):

if (nr == SYS_mount) return TRAP_TO_USERSPACE;

and the eBPF variant:

if (nr == SYS_mount) trap_to_userspace();

I admit that it's still not 100% clear to me that the latter is
genuinely more useful than the former.

The case where I think the synchronous function call is a huge win is this one:

if (nr  == SYS_mount) {
  log("Someone called mount with args %lx\n", ...);
  return RET_KILL;
}

The idea being that the log message wouldn't show up in the kernel log
-- it would get sent to the listener socket belonging to whoever
created the filter, and that process could then go and log it
properly.  This would work perfectly in containers and in totally
unprivileged applications like Chromium.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC 0/3] seccomp trap to userspace
  2018-03-15 17:11         ` Andy Lutomirski
@ 2018-03-15 17:25               ` Christian Brauner
       [not found]           ` <CALCETrXPcCNbpFJhXktkVS9gOPpmnU_bbY6Z8RrsBarq0dP4Lg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2018-03-15 17:25 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kees Cook, Linux Containers, Akihiro Suda, Oleg Nesterov, LKML,
	Eric W . Biederman, Christian Brauner, Tyler Hicks

On Thu, Mar 15, 2018 at 05:11:32PM +0000, Andy Lutomirski wrote:
> On Thu, Mar 15, 2018 at 5:05 PM, Serge E. Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> wrote:
> > Quoting Andy Lutomirski (luto-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org):
> >> On Thu, Mar 15, 2018 at 4:09 PM, Christian Brauner
> >> <christian.brauner-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> wrote:
> >> > On Sun, Feb 04, 2018 at 11:49:43AM +0100, Tycho Andersen wrote:
> >> >> Several months ago at Linux Plumber's, we had a discussion about adding a
> >> >> feature to seccomp which would allow seccomp to trigger a notification for some
> >> >> other process. Here's a draft of that feature.
> >> >>
> >> >> Patch 1 contains the bulk of it, patches 2 & 3 offer an alternative way to
> >> >> acquire the fd that receives notifications via ptrace (the method in patch 1
> >> >> poses some problems). Other suggestions for how to acquire one of these fds
> >> >> would be welcome.
> >> >>
> >> >> Take a close look at the synchronization. I think I've got it right, but I
> >> >> probably don't :)
> >> >>
> >> >> Thanks!
> >> >>
> >> >> Tycho Andersen (3):
> >> >>   seccomp: add a return code to trap to userspace
> >> >>   seccomp: hoist out filter resolving logic
> >> >>   seccomp: add a way to get a listener fd from ptrace
> >> >>
> >> >>  arch/Kconfig                                  |   7 +
> >> >>  include/linux/seccomp.h                       |  14 +-
> >> >>  include/uapi/linux/ptrace.h                   |   1 +
> >> >>  include/uapi/linux/seccomp.h                  |  18 +-
> >> >>  kernel/ptrace.c                               |   4 +
> >> >>  kernel/seccomp.c                              | 467 ++++++++++++++++++++++++--
> >> >>  tools/testing/selftests/seccomp/seccomp_bpf.c | 180 +++++++++-
> >> >>  7 files changed, 653 insertions(+), 38 deletions(-)
> >> >
> >> > Hey,
> >> >
> >> > So, I've been following the discussion silently in the background and I
> >> > see that it got sidetracked into seccomp + ebpf. While I can see that
> >> > there is value in adding epbf support to seccomp I'd really like to see
> >> > this decoupled from this patchset. Afaict, this patchset would just work
> >> > fine without the ebpf portion (but I might be just have missed the
> >> > point). So if possible I would like to see a second version of this with
> >> > the comments accounted for and - if possible - have this up for merging
> >> > independent of the ebpf patchset that's floating around.
> >> >
> >>
> >> The issue is that it might be (and, then again, might not be) nicer to
> >> to *synchronously* call out to the monitor in the filter.  eBPF can do
> >> that very cleanly, whereas classic BPF can't.
> >
> > Hm, synchronously - that brings to mind a thought...  I should re-look at
> > Tycho's patches first, but, if I'm in a container, start some syscall that
> > gets trapped to userspace, then I hit ctrl-c.  I'd like to be able to have
> > the handler be interrupted and have it return -EINTR.  Is that going to
> > be possible with the synchronous approach?
> 
> I think so, but it should be possible with the classic async approach
> too.  The main issue is the difference between a classic filter like
> this (pseudocode):
> 
> if (nr == SYS_mount) return TRAP_TO_USERSPACE;
> 
> and the eBPF variant:
> 
> if (nr == SYS_mount) trap_to_userspace();
> 
> I admit that it's still not 100% clear to me that the latter is
> genuinely more useful than the former.

We've just discussed this on irc and the fact that most problems can be
addressed by interfaces we already have makes it questionable what ebpf
brings to the game here. Especially since the discussion gave the
impression that if ebpf ever makes it to seccomp it will basically be
because it allows a nice implementation of the trap to userspace. If
it's even unclear whether it is really the better choice for this task
then we could consider to no try and make this patchset use it. (I
probably sound way more polemic than I intend to.)

> 
> The case where I think the synchronous function call is a huge win is this one:
> 
> if (nr  == SYS_mount) {
>   log("Someone called mount with args %lx\n", ...);
>   return RET_KILL;
> }
> 
> The idea being that the log message wouldn't show up in the kernel log
> -- it would get sent to the listener socket belonging to whoever
> created the filter, and that process could then go and log it
> properly.  This would work perfectly in containers and in totally
> unprivileged applications like Chromium.

Hm, that is a decent point but that's also a non-essential feature. I
also wonder if there's any reason to not simply extend it to use ebpf
later if seccomp every uses it?

Christian

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC 0/3] seccomp trap to userspace
@ 2018-03-15 17:25               ` Christian Brauner
  0 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2018-03-15 17:25 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Serge E. Hallyn, Tycho Andersen, LKML, Linux Containers,
	Kees Cook, Oleg Nesterov, Eric W . Biederman, Christian Brauner,
	Tyler Hicks, Akihiro Suda

On Thu, Mar 15, 2018 at 05:11:32PM +0000, Andy Lutomirski wrote:
> On Thu, Mar 15, 2018 at 5:05 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
> > Quoting Andy Lutomirski (luto@kernel.org):
> >> On Thu, Mar 15, 2018 at 4:09 PM, Christian Brauner
> >> <christian.brauner@canonical.com> wrote:
> >> > On Sun, Feb 04, 2018 at 11:49:43AM +0100, Tycho Andersen wrote:
> >> >> Several months ago at Linux Plumber's, we had a discussion about adding a
> >> >> feature to seccomp which would allow seccomp to trigger a notification for some
> >> >> other process. Here's a draft of that feature.
> >> >>
> >> >> Patch 1 contains the bulk of it, patches 2 & 3 offer an alternative way to
> >> >> acquire the fd that receives notifications via ptrace (the method in patch 1
> >> >> poses some problems). Other suggestions for how to acquire one of these fds
> >> >> would be welcome.
> >> >>
> >> >> Take a close look at the synchronization. I think I've got it right, but I
> >> >> probably don't :)
> >> >>
> >> >> Thanks!
> >> >>
> >> >> Tycho Andersen (3):
> >> >>   seccomp: add a return code to trap to userspace
> >> >>   seccomp: hoist out filter resolving logic
> >> >>   seccomp: add a way to get a listener fd from ptrace
> >> >>
> >> >>  arch/Kconfig                                  |   7 +
> >> >>  include/linux/seccomp.h                       |  14 +-
> >> >>  include/uapi/linux/ptrace.h                   |   1 +
> >> >>  include/uapi/linux/seccomp.h                  |  18 +-
> >> >>  kernel/ptrace.c                               |   4 +
> >> >>  kernel/seccomp.c                              | 467 ++++++++++++++++++++++++--
> >> >>  tools/testing/selftests/seccomp/seccomp_bpf.c | 180 +++++++++-
> >> >>  7 files changed, 653 insertions(+), 38 deletions(-)
> >> >
> >> > Hey,
> >> >
> >> > So, I've been following the discussion silently in the background and I
> >> > see that it got sidetracked into seccomp + ebpf. While I can see that
> >> > there is value in adding epbf support to seccomp I'd really like to see
> >> > this decoupled from this patchset. Afaict, this patchset would just work
> >> > fine without the ebpf portion (but I might be just have missed the
> >> > point). So if possible I would like to see a second version of this with
> >> > the comments accounted for and - if possible - have this up for merging
> >> > independent of the ebpf patchset that's floating around.
> >> >
> >>
> >> The issue is that it might be (and, then again, might not be) nicer to
> >> to *synchronously* call out to the monitor in the filter.  eBPF can do
> >> that very cleanly, whereas classic BPF can't.
> >
> > Hm, synchronously - that brings to mind a thought...  I should re-look at
> > Tycho's patches first, but, if I'm in a container, start some syscall that
> > gets trapped to userspace, then I hit ctrl-c.  I'd like to be able to have
> > the handler be interrupted and have it return -EINTR.  Is that going to
> > be possible with the synchronous approach?
> 
> I think so, but it should be possible with the classic async approach
> too.  The main issue is the difference between a classic filter like
> this (pseudocode):
> 
> if (nr == SYS_mount) return TRAP_TO_USERSPACE;
> 
> and the eBPF variant:
> 
> if (nr == SYS_mount) trap_to_userspace();
> 
> I admit that it's still not 100% clear to me that the latter is
> genuinely more useful than the former.

We've just discussed this on irc and the fact that most problems can be
addressed by interfaces we already have makes it questionable what ebpf
brings to the game here. Especially since the discussion gave the
impression that if ebpf ever makes it to seccomp it will basically be
because it allows a nice implementation of the trap to userspace. If
it's even unclear whether it is really the better choice for this task
then we could consider to no try and make this patchset use it. (I
probably sound way more polemic than I intend to.)

> 
> The case where I think the synchronous function call is a huge win is this one:
> 
> if (nr  == SYS_mount) {
>   log("Someone called mount with args %lx\n", ...);
>   return RET_KILL;
> }
> 
> The idea being that the log message wouldn't show up in the kernel log
> -- it would get sent to the listener socket belonging to whoever
> created the filter, and that process could then go and log it
> properly.  This would work perfectly in containers and in totally
> unprivileged applications like Chromium.

Hm, that is a decent point but that's also a non-essential feature. I
also wonder if there's any reason to not simply extend it to use ebpf
later if seccomp every uses it?

Christian

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC 0/3] seccomp trap to userspace
  2018-03-15 17:25               ` Christian Brauner
@ 2018-03-15 17:30                   ` Andy Lutomirski
  -1 siblings, 0 replies; 59+ messages in thread
From: Andy Lutomirski @ 2018-03-15 17:30 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Kees Cook, Linux Containers, Akihiro Suda, LKML, Oleg Nesterov,
	Eric W . Biederman, Andy Lutomirski, Christian Brauner,
	Tyler Hicks

On Thu, Mar 15, 2018 at 5:25 PM, Christian Brauner
<christian.brauner-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> wrote:
> On Thu, Mar 15, 2018 at 05:11:32PM +0000, Andy Lutomirski wrote:
>> On Thu, Mar 15, 2018 at 5:05 PM, Serge E. Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> wrote:
>> > Quoting Andy Lutomirski (luto-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org):
>> >> On Thu, Mar 15, 2018 at 4:09 PM, Christian Brauner
>> >> <christian.brauner-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> wrote:
>> >> > On Sun, Feb 04, 2018 at 11:49:43AM +0100, Tycho Andersen wrote:
>> >> >> Several months ago at Linux Plumber's, we had a discussion about adding a
>> >> >> feature to seccomp which would allow seccomp to trigger a notification for some
>> >> >> other process. Here's a draft of that feature.
>> >> >>
>> >> >> Patch 1 contains the bulk of it, patches 2 & 3 offer an alternative way to
>> >> >> acquire the fd that receives notifications via ptrace (the method in patch 1
>> >> >> poses some problems). Other suggestions for how to acquire one of these fds
>> >> >> would be welcome.
>> >> >>
>> >> >> Take a close look at the synchronization. I think I've got it right, but I
>> >> >> probably don't :)
>> >> >>
>> >> >> Thanks!
>> >> >>
>> >> >> Tycho Andersen (3):
>> >> >>   seccomp: add a return code to trap to userspace
>> >> >>   seccomp: hoist out filter resolving logic
>> >> >>   seccomp: add a way to get a listener fd from ptrace
>> >> >>
>> >> >>  arch/Kconfig                                  |   7 +
>> >> >>  include/linux/seccomp.h                       |  14 +-
>> >> >>  include/uapi/linux/ptrace.h                   |   1 +
>> >> >>  include/uapi/linux/seccomp.h                  |  18 +-
>> >> >>  kernel/ptrace.c                               |   4 +
>> >> >>  kernel/seccomp.c                              | 467 ++++++++++++++++++++++++--
>> >> >>  tools/testing/selftests/seccomp/seccomp_bpf.c | 180 +++++++++-
>> >> >>  7 files changed, 653 insertions(+), 38 deletions(-)
>> >> >
>> >> > Hey,
>> >> >
>> >> > So, I've been following the discussion silently in the background and I
>> >> > see that it got sidetracked into seccomp + ebpf. While I can see that
>> >> > there is value in adding epbf support to seccomp I'd really like to see
>> >> > this decoupled from this patchset. Afaict, this patchset would just work
>> >> > fine without the ebpf portion (but I might be just have missed the
>> >> > point). So if possible I would like to see a second version of this with
>> >> > the comments accounted for and - if possible - have this up for merging
>> >> > independent of the ebpf patchset that's floating around.
>> >> >
>> >>
>> >> The issue is that it might be (and, then again, might not be) nicer to
>> >> to *synchronously* call out to the monitor in the filter.  eBPF can do
>> >> that very cleanly, whereas classic BPF can't.
>> >
>> > Hm, synchronously - that brings to mind a thought...  I should re-look at
>> > Tycho's patches first, but, if I'm in a container, start some syscall that
>> > gets trapped to userspace, then I hit ctrl-c.  I'd like to be able to have
>> > the handler be interrupted and have it return -EINTR.  Is that going to
>> > be possible with the synchronous approach?
>>
>> I think so, but it should be possible with the classic async approach
>> too.  The main issue is the difference between a classic filter like
>> this (pseudocode):
>>
>> if (nr == SYS_mount) return TRAP_TO_USERSPACE;
>>
>> and the eBPF variant:
>>
>> if (nr == SYS_mount) trap_to_userspace();
>>
>> I admit that it's still not 100% clear to me that the latter is
>> genuinely more useful than the former.
>
> We've just discussed this on irc and the fact that most problems can be
> addressed by interfaces we already have makes it questionable what ebpf
> brings to the game here. Especially since the discussion gave the
> impression that if ebpf ever makes it to seccomp it will basically be
> because it allows a nice implementation of the trap to userspace. If
> it's even unclear whether it is really the better choice for this task
> then we could consider to no try and make this patchset use it. (I
> probably sound way more polemic than I intend to.)
>

No argument from me.

To be clear, I don't think that trap to userspace should block on eBPF
at all.  It was just a coincidence that both patches showed up around
the same time, and if they actually work well together, then it could
make sense to combine them.  But if trap to userspace ends up working
perfectly without eBPF, that's fine too.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC 0/3] seccomp trap to userspace
@ 2018-03-15 17:30                   ` Andy Lutomirski
  0 siblings, 0 replies; 59+ messages in thread
From: Andy Lutomirski @ 2018-03-15 17:30 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Andy Lutomirski, Serge E. Hallyn, Tycho Andersen, LKML,
	Linux Containers, Kees Cook, Oleg Nesterov, Eric W . Biederman,
	Christian Brauner, Tyler Hicks, Akihiro Suda

On Thu, Mar 15, 2018 at 5:25 PM, Christian Brauner
<christian.brauner@canonical.com> wrote:
> On Thu, Mar 15, 2018 at 05:11:32PM +0000, Andy Lutomirski wrote:
>> On Thu, Mar 15, 2018 at 5:05 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
>> > Quoting Andy Lutomirski (luto@kernel.org):
>> >> On Thu, Mar 15, 2018 at 4:09 PM, Christian Brauner
>> >> <christian.brauner@canonical.com> wrote:
>> >> > On Sun, Feb 04, 2018 at 11:49:43AM +0100, Tycho Andersen wrote:
>> >> >> Several months ago at Linux Plumber's, we had a discussion about adding a
>> >> >> feature to seccomp which would allow seccomp to trigger a notification for some
>> >> >> other process. Here's a draft of that feature.
>> >> >>
>> >> >> Patch 1 contains the bulk of it, patches 2 & 3 offer an alternative way to
>> >> >> acquire the fd that receives notifications via ptrace (the method in patch 1
>> >> >> poses some problems). Other suggestions for how to acquire one of these fds
>> >> >> would be welcome.
>> >> >>
>> >> >> Take a close look at the synchronization. I think I've got it right, but I
>> >> >> probably don't :)
>> >> >>
>> >> >> Thanks!
>> >> >>
>> >> >> Tycho Andersen (3):
>> >> >>   seccomp: add a return code to trap to userspace
>> >> >>   seccomp: hoist out filter resolving logic
>> >> >>   seccomp: add a way to get a listener fd from ptrace
>> >> >>
>> >> >>  arch/Kconfig                                  |   7 +
>> >> >>  include/linux/seccomp.h                       |  14 +-
>> >> >>  include/uapi/linux/ptrace.h                   |   1 +
>> >> >>  include/uapi/linux/seccomp.h                  |  18 +-
>> >> >>  kernel/ptrace.c                               |   4 +
>> >> >>  kernel/seccomp.c                              | 467 ++++++++++++++++++++++++--
>> >> >>  tools/testing/selftests/seccomp/seccomp_bpf.c | 180 +++++++++-
>> >> >>  7 files changed, 653 insertions(+), 38 deletions(-)
>> >> >
>> >> > Hey,
>> >> >
>> >> > So, I've been following the discussion silently in the background and I
>> >> > see that it got sidetracked into seccomp + ebpf. While I can see that
>> >> > there is value in adding epbf support to seccomp I'd really like to see
>> >> > this decoupled from this patchset. Afaict, this patchset would just work
>> >> > fine without the ebpf portion (but I might be just have missed the
>> >> > point). So if possible I would like to see a second version of this with
>> >> > the comments accounted for and - if possible - have this up for merging
>> >> > independent of the ebpf patchset that's floating around.
>> >> >
>> >>
>> >> The issue is that it might be (and, then again, might not be) nicer to
>> >> to *synchronously* call out to the monitor in the filter.  eBPF can do
>> >> that very cleanly, whereas classic BPF can't.
>> >
>> > Hm, synchronously - that brings to mind a thought...  I should re-look at
>> > Tycho's patches first, but, if I'm in a container, start some syscall that
>> > gets trapped to userspace, then I hit ctrl-c.  I'd like to be able to have
>> > the handler be interrupted and have it return -EINTR.  Is that going to
>> > be possible with the synchronous approach?
>>
>> I think so, but it should be possible with the classic async approach
>> too.  The main issue is the difference between a classic filter like
>> this (pseudocode):
>>
>> if (nr == SYS_mount) return TRAP_TO_USERSPACE;
>>
>> and the eBPF variant:
>>
>> if (nr == SYS_mount) trap_to_userspace();
>>
>> I admit that it's still not 100% clear to me that the latter is
>> genuinely more useful than the former.
>
> We've just discussed this on irc and the fact that most problems can be
> addressed by interfaces we already have makes it questionable what ebpf
> brings to the game here. Especially since the discussion gave the
> impression that if ebpf ever makes it to seccomp it will basically be
> because it allows a nice implementation of the trap to userspace. If
> it's even unclear whether it is really the better choice for this task
> then we could consider to no try and make this patchset use it. (I
> probably sound way more polemic than I intend to.)
>

No argument from me.

To be clear, I don't think that trap to userspace should block on eBPF
at all.  It was just a coincidence that both patches showed up around
the same time, and if they actually work well together, then it could
make sense to combine them.  But if trap to userspace ends up working
perfectly without eBPF, that's fine too.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC 0/3] seccomp trap to userspace
       [not found]           ` <CALCETrXPcCNbpFJhXktkVS9gOPpmnU_bbY6Z8RrsBarq0dP4Lg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2018-03-15 17:25               ` Christian Brauner
@ 2018-03-15 17:35             ` Tycho Andersen
  1 sibling, 0 replies; 59+ messages in thread
From: Tycho Andersen @ 2018-03-15 17:35 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kees Cook, Linux Containers, Akihiro Suda, Oleg Nesterov, LKML,
	Christian Brauner, Eric W . Biederman, Christian Brauner,
	Tyler Hicks, Alexei Starovoitov

Hi Andy,

On Thu, Mar 15, 2018 at 05:11:32PM +0000, Andy Lutomirski wrote:
> On Thu, Mar 15, 2018 at 5:05 PM, Serge E. Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> wrote:
> > Hm, synchronously - that brings to mind a thought...  I should re-look at
> > Tycho's patches first, but, if I'm in a container, start some syscall that
> > gets trapped to userspace, then I hit ctrl-c.  I'd like to be able to have
> > the handler be interrupted and have it return -EINTR.  Is that going to
> > be possible with the synchronous approach?
> 
> I think so, but it should be possible with the classic async approach
> too.  The main issue is the difference between a classic filter like
> this (pseudocode):
> 
> if (nr == SYS_mount) return TRAP_TO_USERSPACE;
> 
> and the eBPF variant:
> 
> if (nr == SYS_mount) trap_to_userspace();

Sargun started a private design discussion thread that I don't think
you were on, but Alexei said something to the effect of "eBPF programs
will never wait on userspace", so I'm not sure we can do something
like this in an eBPF program. I'm cc-ing him here again to confirm,
but I doubt things have changed.

> I admit that it's still not 100% clear to me that the latter is
> genuinely more useful than the former.
> 
> The case where I think the synchronous function call is a huge win is this one:
> 
> if (nr  == SYS_mount) {
>   log("Someone called mount with args %lx\n", ...);
>   return RET_KILL;
> }
> 
> The idea being that the log message wouldn't show up in the kernel log
> -- it would get sent to the listener socket belonging to whoever
> created the filter, and that process could then go and log it
> properly.  This would work perfectly in containers and in totally
> unprivileged applications like Chromium.

The current implementation can't do exactly this, but you could do:

if (nr == SYS_mount) {
    log(...);
    kill(pid, SIGKILL);
}

from the handler instead.

I guess Serge is asking a slightly different question: what if the
task gets e.g. SIGINT from the user doing a ^C or SIGALARM or
something, we should probably send the handler some sort of message or
interrupt to let it know that the syscall was cancelled. Right now the
current set doesn't behave that way, and the handler will just
continue on its merry way and get an EINVAL when it tries to respond
with the cancelled cookie.

Anyway, I think these last two points can be addressed with the
approach from this series. The notification to the handler about a
cancelled syscall might be slightly awkward, but I'll take a look.

Cheers,

Tycho

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC 0/3] seccomp trap to userspace
  2018-03-15 17:11         ` Andy Lutomirski
@ 2018-03-15 17:35           ` Tycho Andersen
  2018-03-16  0:46               ` Andy Lutomirski
       [not found]           ` <CALCETrXPcCNbpFJhXktkVS9gOPpmnU_bbY6Z8RrsBarq0dP4Lg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 59+ messages in thread
From: Tycho Andersen @ 2018-03-15 17:35 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Serge E. Hallyn, Christian Brauner, LKML, Linux Containers,
	Kees Cook, Oleg Nesterov, Eric W . Biederman, Christian Brauner,
	Tyler Hicks, Akihiro Suda, Alexei Starovoitov

Hi Andy,

On Thu, Mar 15, 2018 at 05:11:32PM +0000, Andy Lutomirski wrote:
> On Thu, Mar 15, 2018 at 5:05 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
> > Hm, synchronously - that brings to mind a thought...  I should re-look at
> > Tycho's patches first, but, if I'm in a container, start some syscall that
> > gets trapped to userspace, then I hit ctrl-c.  I'd like to be able to have
> > the handler be interrupted and have it return -EINTR.  Is that going to
> > be possible with the synchronous approach?
> 
> I think so, but it should be possible with the classic async approach
> too.  The main issue is the difference between a classic filter like
> this (pseudocode):
> 
> if (nr == SYS_mount) return TRAP_TO_USERSPACE;
> 
> and the eBPF variant:
> 
> if (nr == SYS_mount) trap_to_userspace();

Sargun started a private design discussion thread that I don't think
you were on, but Alexei said something to the effect of "eBPF programs
will never wait on userspace", so I'm not sure we can do something
like this in an eBPF program. I'm cc-ing him here again to confirm,
but I doubt things have changed.

> I admit that it's still not 100% clear to me that the latter is
> genuinely more useful than the former.
> 
> The case where I think the synchronous function call is a huge win is this one:
> 
> if (nr  == SYS_mount) {
>   log("Someone called mount with args %lx\n", ...);
>   return RET_KILL;
> }
> 
> The idea being that the log message wouldn't show up in the kernel log
> -- it would get sent to the listener socket belonging to whoever
> created the filter, and that process could then go and log it
> properly.  This would work perfectly in containers and in totally
> unprivileged applications like Chromium.

The current implementation can't do exactly this, but you could do:

if (nr == SYS_mount) {
    log(...);
    kill(pid, SIGKILL);
}

from the handler instead.

I guess Serge is asking a slightly different question: what if the
task gets e.g. SIGINT from the user doing a ^C or SIGALARM or
something, we should probably send the handler some sort of message or
interrupt to let it know that the syscall was cancelled. Right now the
current set doesn't behave that way, and the handler will just
continue on its merry way and get an EINVAL when it tries to respond
with the cancelled cookie.

Anyway, I think these last two points can be addressed with the
approach from this series. The notification to the handler about a
cancelled syscall might be slightly awkward, but I'll take a look.

Cheers,

Tycho

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC 0/3] seccomp trap to userspace
  2018-03-15 17:35           ` Tycho Andersen
@ 2018-03-16  0:46               ` Andy Lutomirski
  0 siblings, 0 replies; 59+ messages in thread
From: Andy Lutomirski @ 2018-03-16  0:46 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Kees Cook, Linux Containers, Akihiro Suda, LKML, Oleg Nesterov,
	Christian Brauner, Eric W . Biederman, Andy Lutomirski,
	Christian Brauner, Tyler Hicks, Alexei Starovoitov

On Thu, Mar 15, 2018 at 5:35 PM, Tycho Andersen <tycho-E0fblnxP3wo@public.gmane.org> wrote:
> Hi Andy,
>
> On Thu, Mar 15, 2018 at 05:11:32PM +0000, Andy Lutomirski wrote:
>> On Thu, Mar 15, 2018 at 5:05 PM, Serge E. Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> wrote:
>> > Hm, synchronously - that brings to mind a thought...  I should re-look at
>> > Tycho's patches first, but, if I'm in a container, start some syscall that
>> > gets trapped to userspace, then I hit ctrl-c.  I'd like to be able to have
>> > the handler be interrupted and have it return -EINTR.  Is that going to
>> > be possible with the synchronous approach?
>>
>> I think so, but it should be possible with the classic async approach
>> too.  The main issue is the difference between a classic filter like
>> this (pseudocode):
>>
>> if (nr == SYS_mount) return TRAP_TO_USERSPACE;
>>
>> and the eBPF variant:
>>
>> if (nr == SYS_mount) trap_to_userspace();
>
> Sargun started a private design discussion thread that I don't think
> you were on, but Alexei said something to the effect of "eBPF programs
> will never wait on userspace", so I'm not sure we can do something
> like this in an eBPF program. I'm cc-ing him here again to confirm,
> but I doubt things have changed.
>
>> I admit that it's still not 100% clear to me that the latter is
>> genuinely more useful than the former.
>>
>> The case where I think the synchronous function call is a huge win is this one:
>>
>> if (nr  == SYS_mount) {
>>   log("Someone called mount with args %lx\n", ...);
>>   return RET_KILL;
>> }
>>
>> The idea being that the log message wouldn't show up in the kernel log
>> -- it would get sent to the listener socket belonging to whoever
>> created the filter, and that process could then go and log it
>> properly.  This would work perfectly in containers and in totally
>> unprivileged applications like Chromium.
>
> The current implementation can't do exactly this, but you could do:
>
> if (nr == SYS_mount) {
>     log(...);
>     kill(pid, SIGKILL);
> }
>
> from the handler instead.
>
> I guess Serge is asking a slightly different question: what if the
> task gets e.g. SIGINT from the user doing a ^C or SIGALARM or
> something, we should probably send the handler some sort of message or
> interrupt to let it know that the syscall was cancelled. Right now the
> current set doesn't behave that way, and the handler will just
> continue on its merry way and get an EINVAL when it tries to respond
> with the cancelled cookie.

Hmm, I think we have to be very careful to avoid nasty races.  I think
the correct approach is to notice the signal and send a message to the
listener that a signal is pending but to take no additional action.
If the handler ends up completing the syscall with a successful
return, we don't want to replace it with -EINTR.  IOW the code looks
kind of like:

send_to_listener("hey I got a signal");
wait_ret = wait_interruptible for the listener to reply;
if (wait_ret == -EINTR) {
  send_to_listener("hey there's a signal");
  wait_ret = wait_killable for the listener to reply to the original request;
}

if (wait_ret == -EINTR) {
  /* hmm, this next line might not actually be necessary, but it's
harmless and possibly useful */
  send_to_listener("hey we're going away");
  /* and stop waiting */
}

... actually handle the result.

--Andy

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC 0/3] seccomp trap to userspace
@ 2018-03-16  0:46               ` Andy Lutomirski
  0 siblings, 0 replies; 59+ messages in thread
From: Andy Lutomirski @ 2018-03-16  0:46 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Andy Lutomirski, Serge E. Hallyn, Christian Brauner, LKML,
	Linux Containers, Kees Cook, Oleg Nesterov, Eric W . Biederman,
	Christian Brauner, Tyler Hicks, Akihiro Suda, Alexei Starovoitov

On Thu, Mar 15, 2018 at 5:35 PM, Tycho Andersen <tycho@tycho.ws> wrote:
> Hi Andy,
>
> On Thu, Mar 15, 2018 at 05:11:32PM +0000, Andy Lutomirski wrote:
>> On Thu, Mar 15, 2018 at 5:05 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
>> > Hm, synchronously - that brings to mind a thought...  I should re-look at
>> > Tycho's patches first, but, if I'm in a container, start some syscall that
>> > gets trapped to userspace, then I hit ctrl-c.  I'd like to be able to have
>> > the handler be interrupted and have it return -EINTR.  Is that going to
>> > be possible with the synchronous approach?
>>
>> I think so, but it should be possible with the classic async approach
>> too.  The main issue is the difference between a classic filter like
>> this (pseudocode):
>>
>> if (nr == SYS_mount) return TRAP_TO_USERSPACE;
>>
>> and the eBPF variant:
>>
>> if (nr == SYS_mount) trap_to_userspace();
>
> Sargun started a private design discussion thread that I don't think
> you were on, but Alexei said something to the effect of "eBPF programs
> will never wait on userspace", so I'm not sure we can do something
> like this in an eBPF program. I'm cc-ing him here again to confirm,
> but I doubt things have changed.
>
>> I admit that it's still not 100% clear to me that the latter is
>> genuinely more useful than the former.
>>
>> The case where I think the synchronous function call is a huge win is this one:
>>
>> if (nr  == SYS_mount) {
>>   log("Someone called mount with args %lx\n", ...);
>>   return RET_KILL;
>> }
>>
>> The idea being that the log message wouldn't show up in the kernel log
>> -- it would get sent to the listener socket belonging to whoever
>> created the filter, and that process could then go and log it
>> properly.  This would work perfectly in containers and in totally
>> unprivileged applications like Chromium.
>
> The current implementation can't do exactly this, but you could do:
>
> if (nr == SYS_mount) {
>     log(...);
>     kill(pid, SIGKILL);
> }
>
> from the handler instead.
>
> I guess Serge is asking a slightly different question: what if the
> task gets e.g. SIGINT from the user doing a ^C or SIGALARM or
> something, we should probably send the handler some sort of message or
> interrupt to let it know that the syscall was cancelled. Right now the
> current set doesn't behave that way, and the handler will just
> continue on its merry way and get an EINVAL when it tries to respond
> with the cancelled cookie.

Hmm, I think we have to be very careful to avoid nasty races.  I think
the correct approach is to notice the signal and send a message to the
listener that a signal is pending but to take no additional action.
If the handler ends up completing the syscall with a successful
return, we don't want to replace it with -EINTR.  IOW the code looks
kind of like:

send_to_listener("hey I got a signal");
wait_ret = wait_interruptible for the listener to reply;
if (wait_ret == -EINTR) {
  send_to_listener("hey there's a signal");
  wait_ret = wait_killable for the listener to reply to the original request;
}

if (wait_ret == -EINTR) {
  /* hmm, this next line might not actually be necessary, but it's
harmless and possibly useful */
  send_to_listener("hey we're going away");
  /* and stop waiting */
}

... actually handle the result.

--Andy

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC 0/3] seccomp trap to userspace
  2018-03-16  0:46               ` Andy Lutomirski
@ 2018-03-16 14:47                   ` Christian Brauner
  -1 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2018-03-16 14:47 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kees Cook, Linux Containers, Akihiro Suda, Oleg Nesterov, LKML,
	Christian Brauner, Eric W . Biederman, Christian Brauner,
	Tyler Hicks, Alexei Starovoitov

On Fri, Mar 16, 2018 at 12:46:55AM +0000, Andy Lutomirski wrote:
> On Thu, Mar 15, 2018 at 5:35 PM, Tycho Andersen <tycho-E0fblnxP3wo@public.gmane.org> wrote:
> > Hi Andy,
> >
> > On Thu, Mar 15, 2018 at 05:11:32PM +0000, Andy Lutomirski wrote:
> >> On Thu, Mar 15, 2018 at 5:05 PM, Serge E. Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> wrote:
> >> > Hm, synchronously - that brings to mind a thought...  I should re-look at
> >> > Tycho's patches first, but, if I'm in a container, start some syscall that
> >> > gets trapped to userspace, then I hit ctrl-c.  I'd like to be able to have
> >> > the handler be interrupted and have it return -EINTR.  Is that going to
> >> > be possible with the synchronous approach?
> >>
> >> I think so, but it should be possible with the classic async approach
> >> too.  The main issue is the difference between a classic filter like
> >> this (pseudocode):
> >>
> >> if (nr == SYS_mount) return TRAP_TO_USERSPACE;
> >>
> >> and the eBPF variant:
> >>
> >> if (nr == SYS_mount) trap_to_userspace();
> >
> > Sargun started a private design discussion thread that I don't think
> > you were on, but Alexei said something to the effect of "eBPF programs
> > will never wait on userspace", so I'm not sure we can do something
> > like this in an eBPF program. I'm cc-ing him here again to confirm,
> > but I doubt things have changed.
> >
> >> I admit that it's still not 100% clear to me that the latter is
> >> genuinely more useful than the former.
> >>
> >> The case where I think the synchronous function call is a huge win is this one:
> >>
> >> if (nr  == SYS_mount) {
> >>   log("Someone called mount with args %lx\n", ...);
> >>   return RET_KILL;
> >> }
> >>
> >> The idea being that the log message wouldn't show up in the kernel log
> >> -- it would get sent to the listener socket belonging to whoever
> >> created the filter, and that process could then go and log it
> >> properly.  This would work perfectly in containers and in totally
> >> unprivileged applications like Chromium.
> >
> > The current implementation can't do exactly this, but you could do:
> >
> > if (nr == SYS_mount) {
> >     log(...);
> >     kill(pid, SIGKILL);
> > }
> >
> > from the handler instead.
> >
> > I guess Serge is asking a slightly different question: what if the
> > task gets e.g. SIGINT from the user doing a ^C or SIGALARM or
> > something, we should probably send the handler some sort of message or
> > interrupt to let it know that the syscall was cancelled. Right now the
> > current set doesn't behave that way, and the handler will just
> > continue on its merry way and get an EINVAL when it tries to respond
> > with the cancelled cookie.
> 
> Hmm, I think we have to be very careful to avoid nasty races.  I think
> the correct approach is to notice the signal and send a message to the
> listener that a signal is pending but to take no additional action.
> If the handler ends up completing the syscall with a successful
> return, we don't want to replace it with -EINTR.  IOW the code looks
> kind of like:
> 
> send_to_listener("hey I got a signal");
> wait_ret = wait_interruptible for the listener to reply;
> if (wait_ret == -EINTR) {

Hm, so from the pseudo-code it looks like: The handler would inform the
listener that it received a signal (either from the syscall requester or
from somewhere else) and then wait for the listener to reply to that
message.  This would allow the listener to decide what action it wants
the handler to take based on the signal, i.e. either cancel the request
or retry?  The comment makes it sound like that the handler doesn't
really wait on the listener when it receives a signal it simply moves
on.
So no "taking no additional action" here means not have the handler
decide to abort but the listener?

Sorry if I'm being dense.

Christian

>   send_to_listener("hey there's a signal");
>   wait_ret = wait_killable for the listener to reply to the original request;
> }
> 
> if (wait_ret == -EINTR) {
>   /* hmm, this next line might not actually be necessary, but it's
> harmless and possibly useful */
>   send_to_listener("hey we're going away");
>   /* and stop waiting */
> }
> 
> ... actually handle the result.
> 
> --Andy
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC 0/3] seccomp trap to userspace
@ 2018-03-16 14:47                   ` Christian Brauner
  0 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2018-03-16 14:47 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Tycho Andersen, Kees Cook, Linux Containers, Akihiro Suda, LKML,
	Oleg Nesterov, Christian Brauner, Eric W . Biederman,
	Christian Brauner, Tyler Hicks, Alexei Starovoitov

On Fri, Mar 16, 2018 at 12:46:55AM +0000, Andy Lutomirski wrote:
> On Thu, Mar 15, 2018 at 5:35 PM, Tycho Andersen <tycho@tycho.ws> wrote:
> > Hi Andy,
> >
> > On Thu, Mar 15, 2018 at 05:11:32PM +0000, Andy Lutomirski wrote:
> >> On Thu, Mar 15, 2018 at 5:05 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
> >> > Hm, synchronously - that brings to mind a thought...  I should re-look at
> >> > Tycho's patches first, but, if I'm in a container, start some syscall that
> >> > gets trapped to userspace, then I hit ctrl-c.  I'd like to be able to have
> >> > the handler be interrupted and have it return -EINTR.  Is that going to
> >> > be possible with the synchronous approach?
> >>
> >> I think so, but it should be possible with the classic async approach
> >> too.  The main issue is the difference between a classic filter like
> >> this (pseudocode):
> >>
> >> if (nr == SYS_mount) return TRAP_TO_USERSPACE;
> >>
> >> and the eBPF variant:
> >>
> >> if (nr == SYS_mount) trap_to_userspace();
> >
> > Sargun started a private design discussion thread that I don't think
> > you were on, but Alexei said something to the effect of "eBPF programs
> > will never wait on userspace", so I'm not sure we can do something
> > like this in an eBPF program. I'm cc-ing him here again to confirm,
> > but I doubt things have changed.
> >
> >> I admit that it's still not 100% clear to me that the latter is
> >> genuinely more useful than the former.
> >>
> >> The case where I think the synchronous function call is a huge win is this one:
> >>
> >> if (nr  == SYS_mount) {
> >>   log("Someone called mount with args %lx\n", ...);
> >>   return RET_KILL;
> >> }
> >>
> >> The idea being that the log message wouldn't show up in the kernel log
> >> -- it would get sent to the listener socket belonging to whoever
> >> created the filter, and that process could then go and log it
> >> properly.  This would work perfectly in containers and in totally
> >> unprivileged applications like Chromium.
> >
> > The current implementation can't do exactly this, but you could do:
> >
> > if (nr == SYS_mount) {
> >     log(...);
> >     kill(pid, SIGKILL);
> > }
> >
> > from the handler instead.
> >
> > I guess Serge is asking a slightly different question: what if the
> > task gets e.g. SIGINT from the user doing a ^C or SIGALARM or
> > something, we should probably send the handler some sort of message or
> > interrupt to let it know that the syscall was cancelled. Right now the
> > current set doesn't behave that way, and the handler will just
> > continue on its merry way and get an EINVAL when it tries to respond
> > with the cancelled cookie.
> 
> Hmm, I think we have to be very careful to avoid nasty races.  I think
> the correct approach is to notice the signal and send a message to the
> listener that a signal is pending but to take no additional action.
> If the handler ends up completing the syscall with a successful
> return, we don't want to replace it with -EINTR.  IOW the code looks
> kind of like:
> 
> send_to_listener("hey I got a signal");
> wait_ret = wait_interruptible for the listener to reply;
> if (wait_ret == -EINTR) {

Hm, so from the pseudo-code it looks like: The handler would inform the
listener that it received a signal (either from the syscall requester or
from somewhere else) and then wait for the listener to reply to that
message.  This would allow the listener to decide what action it wants
the handler to take based on the signal, i.e. either cancel the request
or retry?  The comment makes it sound like that the handler doesn't
really wait on the listener when it receives a signal it simply moves
on.
So no "taking no additional action" here means not have the handler
decide to abort but the listener?

Sorry if I'm being dense.

Christian

>   send_to_listener("hey there's a signal");
>   wait_ret = wait_killable for the listener to reply to the original request;
> }
> 
> if (wait_ret == -EINTR) {
>   /* hmm, this next line might not actually be necessary, but it's
> harmless and possibly useful */
>   send_to_listener("hey we're going away");
>   /* and stop waiting */
> }
> 
> ... actually handle the result.
> 
> --Andy
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC 0/3] seccomp trap to userspace
       [not found]                   ` <20180316144751.GA3304-cl+VPiYnx/1AfugRpC6u6w@public.gmane.org>
@ 2018-03-16 16:01                     ` Andy Lutomirski
  0 siblings, 0 replies; 59+ messages in thread
From: Andy Lutomirski @ 2018-03-16 16:01 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Kees Cook, Linux Containers, LKML, Akihiro Suda,
	Christian Brauner, Eric W . Biederman, Andy Lutomirski,
	Oleg Nesterov, Christian Brauner, Tyler Hicks,
	Alexei Starovoitov



> On Mar 16, 2018, at 7:47 AM, Christian Brauner <christian.brauner@mailbox.org> wrote:
> 
>> On Fri, Mar 16, 2018 at 12:46:55AM +0000, Andy Lutomirski wrote:


I bet I confused everyone with a blatant typo:

>> 
>> Hmm, I think we have to be very careful to avoid nasty races.  I think
>> the correct approach is to notice the signal and send a message to the
>> listener that a signal is pending but to take no additional action.
>> If the handler ends up completing the syscall with a successful
>> return, we don't want to replace it with -EINTR.  IOW the code looks
>> kind of like:
>> 
>> send_to_listener("hey I got a signal");

That should be “hey I got a syscall”.   D’oh!

>> wait_ret = wait_interruptible for the listener to reply;
>> if (wait_ret == -EINTR) {
> 
> Hm, so from the pseudo-code it looks like: The handler would inform the
> listener that it received a signal (either from the syscall requester or
> from somewhere else) and then wait for the listener to reply to that
> message.  This would allow the listener to decide what action it wants
> the handler to take based on the signal, i.e. either cancel the request
> or retry?  The comment makes it sound like that the handler doesn't
> really wait on the listener when it receives a signal it simply moves
> on.

It keeps waiting killably but not interruptibly. 

> So no "taking no additional action" here means not have the handler
> decide to abort but the listener?

If by “handler” you mean kernel, then yes. 

There’s no userspace syscall handler involved. From the kernel’s perspective, a syscall is never still in progress when a signal handler is invoked — we only actually invoke syscall handlers in prepare_exit_to_usermode() or the non-x86 equivalent and the functions it calls. While a syscall is running, the kernel might notice that a signal is pending and do one of a few things:

1. Just keep going. Not all syscalls can be interrupted. 

2. Try to finish early. If a send() call has already sent some but not all data, it can stop waiting and return the number of bytes sent.

3. Abort with -EINTR.

4. Abort with -ERESTARTSYS or one of its relatives. These fiddle with user registers in a somewhat unpleasant way to pretend that the syscall never actually happened.  This works for syscalls that wait with an absolute timeout, for example. 

5. Set up restart_syscall() magic, rewrite regs so it looks like the user was about to call restart_syscall() when the signal happened, and abort. 

In all cases, the signal is dealt with afterwards. This could result in changing regs to call the handler or in simply returning. 

1-3 should work fully in seccomp. The only issue is that the kernel doesn’t know *which* to do, nor can the kernel force the listener to abort cleanly, so I think we have  no real choice but to let the listener decide. 

4 could be supported just like 1-3. 5 is awful, and I don’t think we should support it for user listeners. 
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC 0/3] seccomp trap to userspace
  2018-03-16 14:47                   ` Christian Brauner
  (?)
@ 2018-03-16 16:01                   ` Andy Lutomirski
       [not found]                     ` <D73E5C37-DC92-4D58-A163-0B20143AAEEB-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
  2018-03-16 16:40                     ` Christian Brauner
  -1 siblings, 2 replies; 59+ messages in thread
From: Andy Lutomirski @ 2018-03-16 16:01 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Andy Lutomirski, Tycho Andersen, Kees Cook, Linux Containers,
	Akihiro Suda, LKML, Oleg Nesterov, Christian Brauner,
	Eric W . Biederman, Christian Brauner, Tyler Hicks,
	Alexei Starovoitov



> On Mar 16, 2018, at 7:47 AM, Christian Brauner <christian.brauner@mailbox.org> wrote:
> 
>> On Fri, Mar 16, 2018 at 12:46:55AM +0000, Andy Lutomirski wrote:


I bet I confused everyone with a blatant typo:

>> 
>> Hmm, I think we have to be very careful to avoid nasty races.  I think
>> the correct approach is to notice the signal and send a message to the
>> listener that a signal is pending but to take no additional action.
>> If the handler ends up completing the syscall with a successful
>> return, we don't want to replace it with -EINTR.  IOW the code looks
>> kind of like:
>> 
>> send_to_listener("hey I got a signal");

That should be “hey I got a syscall”.   D’oh!

>> wait_ret = wait_interruptible for the listener to reply;
>> if (wait_ret == -EINTR) {
> 
> Hm, so from the pseudo-code it looks like: The handler would inform the
> listener that it received a signal (either from the syscall requester or
> from somewhere else) and then wait for the listener to reply to that
> message.  This would allow the listener to decide what action it wants
> the handler to take based on the signal, i.e. either cancel the request
> or retry?  The comment makes it sound like that the handler doesn't
> really wait on the listener when it receives a signal it simply moves
> on.

It keeps waiting killably but not interruptibly. 

> So no "taking no additional action" here means not have the handler
> decide to abort but the listener?

If by “handler” you mean kernel, then yes. 

There’s no userspace syscall handler involved. From the kernel’s perspective, a syscall is never still in progress when a signal handler is invoked — we only actually invoke syscall handlers in prepare_exit_to_usermode() or the non-x86 equivalent and the functions it calls. While a syscall is running, the kernel might notice that a signal is pending and do one of a few things:

1. Just keep going. Not all syscalls can be interrupted. 

2. Try to finish early. If a send() call has already sent some but not all data, it can stop waiting and return the number of bytes sent.

3. Abort with -EINTR.

4. Abort with -ERESTARTSYS or one of its relatives. These fiddle with user registers in a somewhat unpleasant way to pretend that the syscall never actually happened.  This works for syscalls that wait with an absolute timeout, for example. 

5. Set up restart_syscall() magic, rewrite regs so it looks like the user was about to call restart_syscall() when the signal happened, and abort. 

In all cases, the signal is dealt with afterwards. This could result in changing regs to call the handler or in simply returning. 

1-3 should work fully in seccomp. The only issue is that the kernel doesn’t know *which* to do, nor can the kernel force the listener to abort cleanly, so I think we have  no real choice but to let the listener decide. 

4 could be supported just like 1-3. 5 is awful, and I don’t think we should support it for user listeners. 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC 0/3] seccomp trap to userspace
       [not found]                     ` <D73E5C37-DC92-4D58-A163-0B20143AAEEB-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
@ 2018-03-16 16:40                       ` Christian Brauner
  0 siblings, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2018-03-16 16:40 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kees Cook, Linux Containers, LKML, Akihiro Suda,
	Christian Brauner, Eric W . Biederman, Andy Lutomirski,
	Oleg Nesterov, Christian Brauner, Tyler Hicks,
	Alexei Starovoitov

On Fri, Mar 16, 2018 at 09:01:47AM -0700, Andy Lutomirski wrote:
> 
> 
> > On Mar 16, 2018, at 7:47 AM, Christian Brauner <christian.brauner@mailbox.org> wrote:
> > 
> >> On Fri, Mar 16, 2018 at 12:46:55AM +0000, Andy Lutomirski wrote:
> 
> 
> I bet I confused everyone with a blatant typo:
> 
> >> 
> >> Hmm, I think we have to be very careful to avoid nasty races.  I think
> >> the correct approach is to notice the signal and send a message to the
> >> listener that a signal is pending but to take no additional action.
> >> If the handler ends up completing the syscall with a successful
> >> return, we don't want to replace it with -EINTR.  IOW the code looks
> >> kind of like:
> >> 
> >> send_to_listener("hey I got a signal");
> 
> That should be “hey I got a syscall”.   D’oh!

Ha ok, that's what led me to believe that listener != handler and I was
trying to make sense of thise. :)

Thanks!
Christian

> 
> >> wait_ret = wait_interruptible for the listener to reply;
> >> if (wait_ret == -EINTR) {
> > 
> > Hm, so from the pseudo-code it looks like: The handler would inform the
> > listener that it received a signal (either from the syscall requester or
> > from somewhere else) and then wait for the listener to reply to that
> > message.  This would allow the listener to decide what action it wants
> > the handler to take based on the signal, i.e. either cancel the request
> > or retry?  The comment makes it sound like that the handler doesn't
> > really wait on the listener when it receives a signal it simply moves
> > on.
> 
> It keeps waiting killably but not interruptibly. 
> 
> > So no "taking no additional action" here means not have the handler
> > decide to abort but the listener?
> 
> If by “handler” you mean kernel, then yes. 
> 
> There’s no userspace syscall handler involved. From the kernel’s perspective, a syscall is never still in progress when a signal handler is invoked — we only actually invoke syscall handlers in prepare_exit_to_usermode() or the non-x86 equivalent and the functions it calls. While a syscall is running, the kernel might notice that a signal is pending and do one of a few things:
> 
> 1. Just keep going. Not all syscalls can be interrupted. 
> 
> 2. Try to finish early. If a send() call has already sent some but not all data, it can stop waiting and return the number of bytes sent.
> 
> 3. Abort with -EINTR.
> 
> 4. Abort with -ERESTARTSYS or one of its relatives. These fiddle with user registers in a somewhat unpleasant way to pretend that the syscall never actually happened.  This works for syscalls that wait with an absolute timeout, for example. 
> 
> 5. Set up restart_syscall() magic, rewrite regs so it looks like the user was about to call restart_syscall() when the signal happened, and abort. 
> 
> In all cases, the signal is dealt with afterwards. This could result in changing regs to call the handler or in simply returning. 
> 
> 1-3 should work fully in seccomp. The only issue is that the kernel doesn’t know *which* to do, nor can the kernel force the listener to abort cleanly, so I think we have  no real choice but to let the listener decide. 
> 
> 4 could be supported just like 1-3. 5 is awful, and I don’t think we should support it for user listeners. 
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC 0/3] seccomp trap to userspace
  2018-03-16 16:01                   ` Andy Lutomirski
       [not found]                     ` <D73E5C37-DC92-4D58-A163-0B20143AAEEB-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
@ 2018-03-16 16:40                     ` Christian Brauner
  1 sibling, 0 replies; 59+ messages in thread
From: Christian Brauner @ 2018-03-16 16:40 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Tycho Andersen, Kees Cook, Linux Containers,
	Akihiro Suda, LKML, Oleg Nesterov, Christian Brauner,
	Eric W . Biederman, Christian Brauner, Tyler Hicks,
	Alexei Starovoitov

On Fri, Mar 16, 2018 at 09:01:47AM -0700, Andy Lutomirski wrote:
> 
> 
> > On Mar 16, 2018, at 7:47 AM, Christian Brauner <christian.brauner@mailbox.org> wrote:
> > 
> >> On Fri, Mar 16, 2018 at 12:46:55AM +0000, Andy Lutomirski wrote:
> 
> 
> I bet I confused everyone with a blatant typo:
> 
> >> 
> >> Hmm, I think we have to be very careful to avoid nasty races.  I think
> >> the correct approach is to notice the signal and send a message to the
> >> listener that a signal is pending but to take no additional action.
> >> If the handler ends up completing the syscall with a successful
> >> return, we don't want to replace it with -EINTR.  IOW the code looks
> >> kind of like:
> >> 
> >> send_to_listener("hey I got a signal");
> 
> That should be “hey I got a syscall”.   D’oh!

Ha ok, that's what led me to believe that listener != handler and I was
trying to make sense of thise. :)

Thanks!
Christian

> 
> >> wait_ret = wait_interruptible for the listener to reply;
> >> if (wait_ret == -EINTR) {
> > 
> > Hm, so from the pseudo-code it looks like: The handler would inform the
> > listener that it received a signal (either from the syscall requester or
> > from somewhere else) and then wait for the listener to reply to that
> > message.  This would allow the listener to decide what action it wants
> > the handler to take based on the signal, i.e. either cancel the request
> > or retry?  The comment makes it sound like that the handler doesn't
> > really wait on the listener when it receives a signal it simply moves
> > on.
> 
> It keeps waiting killably but not interruptibly. 
> 
> > So no "taking no additional action" here means not have the handler
> > decide to abort but the listener?
> 
> If by “handler” you mean kernel, then yes. 
> 
> There’s no userspace syscall handler involved. From the kernel’s perspective, a syscall is never still in progress when a signal handler is invoked — we only actually invoke syscall handlers in prepare_exit_to_usermode() or the non-x86 equivalent and the functions it calls. While a syscall is running, the kernel might notice that a signal is pending and do one of a few things:
> 
> 1. Just keep going. Not all syscalls can be interrupted. 
> 
> 2. Try to finish early. If a send() call has already sent some but not all data, it can stop waiting and return the number of bytes sent.
> 
> 3. Abort with -EINTR.
> 
> 4. Abort with -ERESTARTSYS or one of its relatives. These fiddle with user registers in a somewhat unpleasant way to pretend that the syscall never actually happened.  This works for syscalls that wait with an absolute timeout, for example. 
> 
> 5. Set up restart_syscall() magic, rewrite regs so it looks like the user was about to call restart_syscall() when the signal happened, and abort. 
> 
> In all cases, the signal is dealt with afterwards. This could result in changing regs to call the handler or in simply returning. 
> 
> 1-3 should work fully in seccomp. The only issue is that the kernel doesn’t know *which* to do, nor can the kernel force the listener to abort cleanly, so I think we have  no real choice but to let the listener decide. 
> 
> 4 could be supported just like 1-3. 5 is awful, and I don’t think we should support it for user listeners. 

^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2018-03-16 16:40 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-04 10:49 [RFC 0/3] seccomp trap to userspace Tycho Andersen
2018-02-04 10:49 ` [RFC 1/3] seccomp: add a return code to " Tycho Andersen
2018-02-13 21:09   ` Kees Cook
     [not found]     ` <CAGXu5jLAAKY19a9iC1PmXRyuwdn1Zxr2Cb318zdzkqgYt8vtdg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-02-14 15:29       ` Tycho Andersen
2018-02-14 15:29         ` Tycho Andersen
2018-02-14 17:19         ` Andy Lutomirski
2018-02-14 17:23           ` Tycho Andersen
2018-02-15 14:48           ` Christian Brauner
     [not found]           ` <CALCETrXeZZfVzXh7SwKhyB=+ySDk5fhrrdrXrcABsQ=JpQT7Tg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-02-14 17:23             ` Tycho Andersen
2018-02-15 14:48             ` Christian Brauner
2018-02-27  0:49             ` Kees Cook
2018-02-27  0:49           ` Kees Cook
     [not found]             ` <CAGXu5jKBmej+fXhEc+Jy7Guy+vXEZkHnc=4LNm1NNEsc1=DFVA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-02-27  3:27               ` Andy Lutomirski
2018-02-27  3:27                 ` Andy Lutomirski
2018-02-14 17:19         ` Andy Lutomirski
     [not found]   ` <20180204104946.25559-2-tycho-E0fblnxP3wo@public.gmane.org>
2018-02-04 17:36     ` Andy Lutomirski
2018-02-04 17:36       ` Andy Lutomirski
     [not found]       ` <CALCETrWgu5n+SMqrsZQ7MVYPtzs8otuc7hpA5uPH+JNtFrMBkQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-02-04 20:01         ` Tycho Andersen
2018-02-04 20:01           ` Tycho Andersen
2018-02-04 20:33           ` Andy Lutomirski
     [not found]             ` <CALCETrV81yr_zhuBbCTE8NgYx42oq=qvP=nLMsST0iS2wtOZng-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-02-05  8:47               ` Tycho Andersen
2018-02-05  8:47             ` Tycho Andersen
2018-02-04 20:33           ` Andy Lutomirski
2018-02-13 21:09     ` Kees Cook
2018-02-04 10:49 ` [RFC 2/3] seccomp: hoist out filter resolving logic Tycho Andersen
     [not found]   ` <20180204104946.25559-3-tycho-E0fblnxP3wo@public.gmane.org>
2018-02-13 21:29     ` Kees Cook
2018-02-13 21:29   ` Kees Cook
2018-02-14 15:33     ` Tycho Andersen
2018-02-14 15:33     ` Tycho Andersen
     [not found] ` <20180204104946.25559-1-tycho-E0fblnxP3wo@public.gmane.org>
2018-02-04 10:49   ` [RFC 1/3] seccomp: add a return code to trap to userspace Tycho Andersen
2018-02-04 10:49   ` [RFC 2/3] seccomp: hoist out filter resolving logic Tycho Andersen
2018-02-04 10:49   ` [RFC 3/3] seccomp: add a way to get a listener fd from ptrace Tycho Andersen
2018-03-15 16:09   ` [RFC 0/3] seccomp trap to userspace Christian Brauner
2018-02-04 10:49 ` [RFC 3/3] seccomp: add a way to get a listener fd from ptrace Tycho Andersen
     [not found]   ` <20180204104946.25559-4-tycho-E0fblnxP3wo@public.gmane.org>
2018-02-13 21:32     ` Kees Cook
2018-02-13 21:32       ` Kees Cook
     [not found]       ` <CAGXu5jLS2dzCjZOKa-W4kUdOPoJkRAq5Rsw1t5jX99v34yaoQw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-02-14 15:33         ` Tycho Andersen
2018-02-14 15:33       ` Tycho Andersen
2018-03-15 16:09 ` [RFC 0/3] seccomp trap to userspace Christian Brauner
     [not found]   ` <20180315160924.GA12744-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2018-03-15 16:56     ` Andy Lutomirski
2018-03-15 16:56       ` Andy Lutomirski
2018-03-15 17:05       ` Serge E. Hallyn
2018-03-15 17:11         ` Andy Lutomirski
2018-03-15 17:35           ` Tycho Andersen
2018-03-16  0:46             ` Andy Lutomirski
2018-03-16  0:46               ` Andy Lutomirski
     [not found]               ` <CALCETrWH7HbY2gS6O_cYKfp9QqqWBWVcHb++GaP3uUiSO9oo6g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-03-16 14:47                 ` Christian Brauner
2018-03-16 14:47                   ` Christian Brauner
2018-03-16 16:01                   ` Andy Lutomirski
     [not found]                     ` <D73E5C37-DC92-4D58-A163-0B20143AAEEB-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
2018-03-16 16:40                       ` Christian Brauner
2018-03-16 16:40                     ` Christian Brauner
     [not found]                   ` <20180316144751.GA3304-cl+VPiYnx/1AfugRpC6u6w@public.gmane.org>
2018-03-16 16:01                     ` Andy Lutomirski
     [not found]           ` <CALCETrXPcCNbpFJhXktkVS9gOPpmnU_bbY6Z8RrsBarq0dP4Lg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-03-15 17:25             ` Christian Brauner
2018-03-15 17:25               ` Christian Brauner
     [not found]               ` <20180315172558.GA28108-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2018-03-15 17:30                 ` Andy Lutomirski
2018-03-15 17:30                   ` Andy Lutomirski
2018-03-15 17:35             ` Tycho Andersen
     [not found]         ` <20180315170509.GA32766-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
2018-03-15 17:11           ` Andy Lutomirski
     [not found]       ` <CALCETrVnvbZLx5v=DMu2N1JtR+ys507X5CYBi-qQnus3VMQdwg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-03-15 17:05         ` Serge E. Hallyn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.