linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 0/5] seccomp trap to userspace
@ 2018-09-06 15:28 Tycho Andersen
  2018-09-06 15:28 ` [PATCH v6 1/5] seccomp: add a return code to " Tycho Andersen
                   ` (4 more replies)
  0 siblings, 5 replies; 38+ messages in thread
From: Tycho Andersen @ 2018-09-06 15:28 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel, containers, linux-api, Andy Lutomirski,
	Oleg Nesterov, Eric W . Biederman, Serge E . Hallyn,
	Christian Brauner, Tyler Hicks, Akihiro Suda, Jann Horn,
	Tycho Andersen

Hi all,

Here's a v6 of the seccomp trap to userspace series. v5 [1] was fairly
quiet, with Christian reminding me that I had forgotten to update the
docs for the ioctl change. Other than that, there are no changes.

[1]: https://lkml.org/lkml/2018/8/28/590

Thoughts welcome,

Tycho

Tycho Andersen (5):
  seccomp: add a return code to trap to userspace
  seccomp: make get_nth_filter available outside of CHECKPOINT_RESTORE
  seccomp: add a way to get a listener fd from ptrace
  seccomp: add support for passing fds via USER_NOTIF
  samples: add an example of seccomp user trap

 Documentation/ioctl/ioctl-number.txt          |   1 +
 .../userspace-api/seccomp_filter.rst          |  84 +++
 arch/Kconfig                                  |   9 +
 include/linux/seccomp.h                       |  18 +-
 include/uapi/linux/ptrace.h                   |   2 +
 include/uapi/linux/seccomp.h                  |  36 +-
 kernel/ptrace.c                               |   4 +
 kernel/seccomp.c                              | 538 +++++++++++++++-
 samples/seccomp/.gitignore                    |   1 +
 samples/seccomp/Makefile                      |   7 +-
 samples/seccomp/user-trap.c                   | 312 ++++++++++
 tools/testing/selftests/seccomp/seccomp_bpf.c | 587 +++++++++++++++++-
 12 files changed, 1586 insertions(+), 13 deletions(-)
 create mode 100644 samples/seccomp/user-trap.c

-- 
2.17.1


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v6 1/5] seccomp: add a return code to trap to userspace
  2018-09-06 15:28 [PATCH v6 0/5] seccomp trap to userspace Tycho Andersen
@ 2018-09-06 15:28 ` Tycho Andersen
  2018-09-06 22:15   ` Tyler Hicks
  2018-09-06 15:28 ` [PATCH v6 2/5] seccomp: make get_nth_filter available outside of CHECKPOINT_RESTORE Tycho Andersen
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 38+ messages in thread
From: Tycho Andersen @ 2018-09-06 15:28 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel, containers, linux-api, Andy Lutomirski,
	Oleg Nesterov, Eric W . Biederman, Serge E . Hallyn,
	Christian Brauner, Tyler Hicks, Akihiro Suda, Jann Horn,
	Tycho Andersen

This patch introduces a means for syscalls matched in seccomp to notify
some other task that a particular filter has been triggered.

The motivation for this is primarily for use with containers. For example,
if a container does an init_module(), we obviously don't want to load this
untrusted code, which may be compiled for the wrong version of the kernel
anyway. Instead, we could parse the module image, figure out which module
the container is trying to load and load it on the host.

As another example, containers cannot mknod(), since this checks
capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
/dev/zero should be ok for containers to mknod, but we'd like to avoid hard
coding some whitelist in the kernel. Another example is mount(), which has
many security restrictions for good reason, but configuration or runtime
knowledge could potentially be used to relax these restrictions.

This patch adds functionality that is already possible via at least two
other means that I know about, both of which involve ptrace(): first, one
could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL.
Unfortunately this is slow, so a faster version would be to install a
filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP.
Since ptrace allows only one tracer, if the container runtime is that
tracer, users inside the container (or outside) trying to debug it will not
be able to use ptrace, which is annoying. It also means that older
distributions based on Upstart cannot boot inside containers using ptrace,
since upstart itself uses ptrace to start services.

The actual implementation of this is fairly small, although getting the
synchronization right was/is slightly complex.

Finally, it's worth noting that the classic seccomp TOCTOU of reading
memory data from the task still applies here, but can be avoided with
careful design of the userspace handler: if the userspace handler reads all
of the task memory that is necessary before applying its security policy,
the tracee's subsequent memory edits will not be read by the tracer.

v2: * make id a u64; the idea here being that it will never overflow,
      because 64 is huge (one syscall every nanosecond => wrap every 584
      years) (Andy)
    * prevent nesting of user notifications: if someone is already attached
      the tree in one place, nobody else can attach to the tree (Andy)
    * notify the listener of signals the tracee receives as well (Andy)
    * implement poll
v3: * lockdep fix (Oleg)
    * drop unnecessary WARN()s (Christian)
    * rearrange error returns to be more rpetty (Christian)
    * fix build in !CONFIG_SECCOMP_USER_NOTIFICATION case
v4: * fix implementation of poll to use poll_wait() (Jann)
    * change listener's fd flags to be 0 (Jann)
    * hoist filter initialization out of ifdefs to its own function
      init_user_notification()
    * add some more testing around poll() and closing the listener while a
      syscall is in action
    * s/GET_LISTENER/NEW_LISTENER, since you can't _get_ a listener, but it
      creates a new one (Matthew)
    * correctly handle pid namespaces, add some testcases (Matthew)
    * use EINPROGRESS instead of EINVAL when a notification response is
      written twice (Matthew)
    * fix comment typo from older version (SEND vs READ) (Matthew)
    * whitespace and logic simplification (Tobin)
    * add some Documentation/ bits on userspace trapping
v5: * fix documentation typos (Jann)
    * add signalled field to struct seccomp_notif (Jann)
    * switch to using ioctls instead of read()/write() for struct passing
      (Jann)
    * add an ioctl to ensure an id is still valid
v6: * docs typo fixes, update docs for ioctl() change (Christian)

Signed-off-by: Tycho Andersen <tycho@tycho.ws>
CC: Kees Cook <keescook@chromium.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Oleg Nesterov <oleg@redhat.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
CC: "Serge E. Hallyn" <serge@hallyn.com>
CC: Christian Brauner <christian.brauner@ubuntu.com>
CC: Tyler Hicks <tyhicks@canonical.com>
CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
---
 Documentation/ioctl/ioctl-number.txt          |   1 +
 .../userspace-api/seccomp_filter.rst          |  73 +++
 arch/Kconfig                                  |   9 +
 include/linux/seccomp.h                       |   7 +-
 include/uapi/linux/seccomp.h                  |  33 +-
 kernel/seccomp.c                              | 453 +++++++++++++++++-
 tools/testing/selftests/seccomp/seccomp_bpf.c | 403 +++++++++++++++-
 7 files changed, 969 insertions(+), 10 deletions(-)

diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt
index 13a7c999c04a..31e9707f7e06 100644
--- a/Documentation/ioctl/ioctl-number.txt
+++ b/Documentation/ioctl/ioctl-number.txt
@@ -345,4 +345,5 @@ Code  Seq#(hex)	Include File		Comments
 					<mailto:raph@8d.com>
 0xF6	all	LTTng			Linux Trace Toolkit Next Generation
 					<mailto:mathieu.desnoyers@efficios.com>
+0xF7    00-1F   uapi/linux/seccomp.h
 0xFD	all	linux/dm-ioctl.h
diff --git a/Documentation/userspace-api/seccomp_filter.rst b/Documentation/userspace-api/seccomp_filter.rst
index 82a468bc7560..d1498885c1c7 100644
--- a/Documentation/userspace-api/seccomp_filter.rst
+++ b/Documentation/userspace-api/seccomp_filter.rst
@@ -122,6 +122,11 @@ In precedence order, they are:
 	Results in the lower 16-bits of the return value being passed
 	to userland as the errno without executing the system call.
 
+``SECCOMP_RET_USER_NOTIF``:
+    Results in a ``struct seccomp_notif`` message sent on the userspace
+    notification fd, if it is attached, or ``-ENOSYS`` if it is not. See below
+    on discussion of how to handle user notifications.
+
 ``SECCOMP_RET_TRACE``:
 	When returned, this value will cause the kernel to attempt to
 	notify a ``ptrace()``-based tracer prior to executing the system
@@ -183,6 +188,74 @@ The ``samples/seccomp/`` directory contains both an x86-specific example
 and a more generic example of a higher level macro interface for BPF
 program generation.
 
+Userspace Notification
+======================
+
+The ``SECCOMP_RET_USER_NOTIF`` return code lets seccomp filters pass a
+particular syscall to userspace to be handled. This may be useful for
+applications like container managers, which wish to intercept particular
+syscalls (``mount()``, ``finit_module()``, etc.) and change their behavior.
+
+There are currently two APIs to acquire a userspace notification fd for a
+particular filter. The first is when the filter is installed, the task
+installing the filter can ask the ``seccomp()`` syscall:
+
+.. code-block::
+
+    fd = seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
+
+which (on success) will return a listener fd for the filter, which can then be
+passed around via ``SCM_RIGHTS`` or similar. Alternatively, a filter fd can be
+acquired via:
+
+.. code-block::
+
+    fd = ptrace(PTRACE_SECCOMP_NEW_LISTENER, pid, 0);
+
+which grabs the 0th filter for some task which the tracer has privilege over.
+Note that filter fds correspond to a particular filter, and not a particular
+task. So if this task then forks, notifications from both tasks will appear on
+the same filter fd. Reads and writes to/from a filter fd are also synchronized,
+so a filter fd can safely have many readers.
+
+The interface for a seccomp notification fd consists of two structures:
+
+.. code-block::
+
+    struct seccomp_notif {
+        __u16 len;
+        __u64 id;
+        pid_t pid;
+        __u8 signalled;
+        struct seccomp_data data;
+    };
+
+    struct seccomp_notif_resp {
+        __u16 len;
+        __u64 id;
+        __s32 error;
+        __s64 val;
+    };
+
+Users can read via ``ioctl(SECCOMP_NOTIF_RECV)``  (or ``poll()``) on a seccomp
+notification fd to receive a ``struct seccomp_notif``, which contains five
+members: the input length of the structure, a globally unique ``id``, the
+``pid`` of the task which triggered this request (which may be 0 if the task is
+in a pid ns not visible from the listener's pid namespace), a flag representing
+whether or not the notification is a result of a non-fatal signal, and the
+``data`` passed to seccomp. Userspace can then make a decision based on this
+information about what to do, and ``ioctl(SECCOMP_NOTIF_SEND)`` a response,
+indicating what should be returned to userspace. The ``id`` member of ``struct
+seccomp_notif_resp`` should be the same ``id`` as in ``struct seccomp_notif``.
+
+It is worth noting that ``struct seccomp_data`` contains the values of register
+arguments to the syscall, but does not contain pointers to memory. The task's
+memory is accessible to suitably privileged traces via ``ptrace()`` or
+``/proc/pid/map_files/``. However, care should be taken to avoid the TOCTOU
+mentioned above in this document: all arguments being read from the tracee's
+memory should be read into the tracer's memory before any policy decisions are
+made. This allows for an atomic decision on syscall arguments.
+
 Sysctls
 =======
 
diff --git a/arch/Kconfig b/arch/Kconfig
index 6801123932a5..42f3585d925d 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -419,6 +419,15 @@ config SECCOMP_FILTER
 
 	  See Documentation/userspace-api/seccomp_filter.rst for details.
 
+config SECCOMP_USER_NOTIFICATION
+	bool "Enable the SECCOMP_RET_USER_NOTIF seccomp action"
+	depends on SECCOMP_FILTER
+	help
+	  Enable SECCOMP_RET_USER_NOTIF, a return code which can be used by seccomp
+	  programs to notify a userspace listener that a particular event happened.
+
+	  See Documentation/userspace-api/seccomp_filter.rst for details.
+
 config HAVE_STACKPROTECTOR
 	bool
 	help
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index e5320f6c8654..017444b5efed 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -4,9 +4,10 @@
 
 #include <uapi/linux/seccomp.h>
 
-#define SECCOMP_FILTER_FLAG_MASK	(SECCOMP_FILTER_FLAG_TSYNC	| \
-					 SECCOMP_FILTER_FLAG_LOG	| \
-					 SECCOMP_FILTER_FLAG_SPEC_ALLOW)
+#define SECCOMP_FILTER_FLAG_MASK	(SECCOMP_FILTER_FLAG_TSYNC | \
+					 SECCOMP_FILTER_FLAG_LOG | \
+					 SECCOMP_FILTER_FLAG_SPEC_ALLOW | \
+					 SECCOMP_FILTER_FLAG_NEW_LISTENER)
 
 #ifdef CONFIG_SECCOMP
 
diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index 9efc0e73d50b..aa5878972128 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -17,9 +17,10 @@
 #define SECCOMP_GET_ACTION_AVAIL	2
 
 /* Valid flags for SECCOMP_SET_MODE_FILTER */
-#define SECCOMP_FILTER_FLAG_TSYNC	(1UL << 0)
-#define SECCOMP_FILTER_FLAG_LOG		(1UL << 1)
-#define SECCOMP_FILTER_FLAG_SPEC_ALLOW	(1UL << 2)
+#define SECCOMP_FILTER_FLAG_TSYNC		(1UL << 0)
+#define SECCOMP_FILTER_FLAG_LOG			(1UL << 1)
+#define SECCOMP_FILTER_FLAG_SPEC_ALLOW		(1UL << 2)
+#define SECCOMP_FILTER_FLAG_NEW_LISTENER	(1UL << 3)
 
 /*
  * All BPF programs must return a 32-bit value.
@@ -35,6 +36,7 @@
 #define SECCOMP_RET_KILL	 SECCOMP_RET_KILL_THREAD
 #define SECCOMP_RET_TRAP	 0x00030000U /* disallow and force a SIGSYS */
 #define SECCOMP_RET_ERRNO	 0x00050000U /* returns an errno */
+#define SECCOMP_RET_USER_NOTIF   0x7fc00000U /* notifies userspace */
 #define SECCOMP_RET_TRACE	 0x7ff00000U /* pass to a tracer or disallow */
 #define SECCOMP_RET_LOG		 0x7ffc0000U /* allow after logging */
 #define SECCOMP_RET_ALLOW	 0x7fff0000U /* allow */
@@ -60,4 +62,29 @@ struct seccomp_data {
 	__u64 args[6];
 };
 
+struct seccomp_notif {
+	__u16 len;
+	__u64 id;
+	__u32 pid;
+	__u8 signalled;
+	struct seccomp_data data;
+};
+
+struct seccomp_notif_resp {
+	__u16 len;
+	__u64 id;
+	__s32 error;
+	__s64 val;
+};
+
+#define SECCOMP_IOC_MAGIC		0xF7
+
+/* Flags for seccomp notification fd ioctl. */
+#define SECCOMP_NOTIF_RECV		_IOWR(SECCOMP_IOC_MAGIC, 0,	\
+						struct seccomp_notif)
+#define SECCOMP_NOTIF_SEND		_IOWR(SECCOMP_IOC_MAGIC, 1,	\
+						struct seccomp_notif_resp)
+#define SECCOMP_NOTIF_IS_ID_VALID	_IOR(SECCOMP_IOC_MAGIC, 2,	\
+						__u64)
+
 #endif /* _UAPI_LINUX_SECCOMP_H */
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index fd023ac24e10..a09eb5c05f68 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -33,6 +33,7 @@
 #endif
 
 #ifdef CONFIG_SECCOMP_FILTER
+#include <linux/file.h>
 #include <linux/filter.h>
 #include <linux/pid.h>
 #include <linux/ptrace.h>
@@ -40,6 +41,53 @@
 #include <linux/tracehook.h>
 #include <linux/uaccess.h>
 
+#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
+#include <linux/anon_inodes.h>
+
+enum notify_state {
+	SECCOMP_NOTIFY_INIT,
+	SECCOMP_NOTIFY_SENT,
+	SECCOMP_NOTIFY_REPLIED,
+};
+
+struct seccomp_knotif {
+	/* The struct pid of the task whose filter triggered the notification */
+	struct pid *pid;
+
+	/* The "cookie" for this request; this is unique for this filter. */
+	u32 id;
+
+	/* Whether or not this task has been given an interruptible signal. */
+	bool signalled;
+
+	/*
+	 * The seccomp data. This pointer is valid the entire time this
+	 * notification is active, since it comes from __seccomp_filter which
+	 * eclipses the entire lifecycle here.
+	 */
+	const struct seccomp_data *data;
+
+	/*
+	 * Notification states. When SECCOMP_RET_USER_NOTIF is returned, a
+	 * struct seccomp_knotif is created and starts out in INIT. Once the
+	 * handler reads the notification off of an FD, it transitions to SENT.
+	 * If a signal is received the state transitions back to INIT and
+	 * another message is sent. When the userspace handler replies, state
+	 * transitions to REPLIED.
+	 */
+	enum notify_state state;
+
+	/* The return values, only valid when in SECCOMP_NOTIFY_REPLIED */
+	int error;
+	long val;
+
+	/* Signals when this has entered SECCOMP_NOTIFY_REPLIED */
+	struct completion ready;
+
+	struct list_head list;
+};
+#endif
+
 /**
  * struct seccomp_filter - container for seccomp BPF programs
  *
@@ -66,6 +114,30 @@ struct seccomp_filter {
 	bool log;
 	struct seccomp_filter *prev;
 	struct bpf_prog *prog;
+
+#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
+	/*
+	 * A semaphore that users of this notification can wait on for
+	 * changes. Actual reads and writes are still controlled with
+	 * filter->notify_lock.
+	 */
+	struct semaphore request;
+
+	/* A lock for all notification-related accesses. */
+	struct mutex notify_lock;
+
+	/* Is there currently an attached listener? */
+	bool has_listener;
+
+	/* The id of the next request. */
+	u64 next_id;
+
+	/* A list of struct seccomp_knotif elements. */
+	struct list_head notifications;
+
+	/* A wait queue for poll. */
+	wait_queue_head_t wqh;
+#endif
 };
 
 /* Limit any path through the tree to 256KB worth of instructions. */
@@ -359,6 +431,19 @@ static inline void seccomp_sync_threads(unsigned long flags)
 	}
 }
 
+#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
+static void init_user_notification(struct seccomp_filter *sfilter)
+{
+	mutex_init(&sfilter->notify_lock);
+	sema_init(&sfilter->request, 0);
+	INIT_LIST_HEAD(&sfilter->notifications);
+	sfilter->next_id = get_random_u64();
+	init_waitqueue_head(&sfilter->wqh);
+}
+#else
+static inline void init_user_notification(struct seccomp_filter *sfilter) { }
+#endif
+
 /**
  * seccomp_prepare_filter: Prepares a seccomp filter for use.
  * @fprog: BPF program to install
@@ -392,6 +477,8 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
 	if (!sfilter)
 		return ERR_PTR(-ENOMEM);
 
+	init_user_notification(sfilter);
+
 	ret = bpf_prog_create_from_user(&sfilter->prog, fprog,
 					seccomp_check_filter, save_orig);
 	if (ret < 0) {
@@ -556,13 +643,15 @@ static void seccomp_send_sigsys(int syscall, int reason)
 #define SECCOMP_LOG_TRACE		(1 << 4)
 #define SECCOMP_LOG_LOG			(1 << 5)
 #define SECCOMP_LOG_ALLOW		(1 << 6)
+#define SECCOMP_LOG_USER_NOTIF		(1 << 7)
 
 static u32 seccomp_actions_logged = SECCOMP_LOG_KILL_PROCESS |
 				    SECCOMP_LOG_KILL_THREAD  |
 				    SECCOMP_LOG_TRAP  |
 				    SECCOMP_LOG_ERRNO |
 				    SECCOMP_LOG_TRACE |
-				    SECCOMP_LOG_LOG;
+				    SECCOMP_LOG_LOG |
+				    SECCOMP_LOG_USER_NOTIF;
 
 static inline void seccomp_log(unsigned long syscall, long signr, u32 action,
 			       bool requested)
@@ -581,6 +670,9 @@ static inline void seccomp_log(unsigned long syscall, long signr, u32 action,
 	case SECCOMP_RET_TRACE:
 		log = requested && seccomp_actions_logged & SECCOMP_LOG_TRACE;
 		break;
+	case SECCOMP_RET_USER_NOTIF:
+		log = requested && seccomp_actions_logged & SECCOMP_LOG_USER_NOTIF;
+		break;
 	case SECCOMP_RET_LOG:
 		log = seccomp_actions_logged & SECCOMP_LOG_LOG;
 		break;
@@ -651,6 +743,83 @@ void secure_computing_strict(int this_syscall)
 }
 #else
 
+#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
+static u64 seccomp_next_notify_id(struct seccomp_filter *filter)
+{
+	/* Note: overflow is ok here, the id just needs to be unique */
+	return filter->next_id++;
+}
+
+static void seccomp_do_user_notification(int this_syscall,
+					 struct seccomp_filter *match,
+					 const struct seccomp_data *sd)
+{
+	int err;
+	long ret = 0;
+	struct seccomp_knotif n = {};
+
+	mutex_lock(&match->notify_lock);
+	err = -ENOSYS;
+	if (!match->has_listener)
+		goto out;
+
+	n.pid = task_pid(current);
+	n.state = SECCOMP_NOTIFY_INIT;
+	n.data = sd;
+	n.id = seccomp_next_notify_id(match);
+	init_completion(&n.ready);
+
+	list_add(&n.list, &match->notifications);
+	wake_up_poll(&match->wqh, EPOLLIN | EPOLLRDNORM);
+
+	mutex_unlock(&match->notify_lock);
+	up(&match->request);
+
+	err = wait_for_completion_interruptible(&n.ready);
+	mutex_lock(&match->notify_lock);
+
+	/*
+	 * Here it's possible we got a signal and then had to wait on the mutex
+	 * while the reply was sent, so let's be sure there wasn't a response
+	 * in the meantime.
+	 */
+	if (err < 0 && n.state != SECCOMP_NOTIFY_REPLIED) {
+		/*
+		 * We got a signal. Let's tell userspace about it (potentially
+		 * again, if we had already notified them about the first one).
+		 */
+		n.signalled = true;
+		if (n.state == SECCOMP_NOTIFY_SENT) {
+			n.state = SECCOMP_NOTIFY_INIT;
+			up(&match->request);
+		}
+		mutex_unlock(&match->notify_lock);
+		err = wait_for_completion_killable(&n.ready);
+		mutex_lock(&match->notify_lock);
+		if (err < 0)
+			goto remove_list;
+	}
+
+	ret = n.val;
+	err = n.error;
+
+remove_list:
+	list_del(&n.list);
+out:
+	mutex_unlock(&match->notify_lock);
+	syscall_set_return_value(current, task_pt_regs(current),
+				 err, ret);
+}
+#else
+static void seccomp_do_user_notification(int this_syscall,
+					 struct seccomp_filter *match,
+					 const struct seccomp_data *sd)
+{
+	seccomp_log(this_syscall, SIGSYS, SECCOMP_RET_USER_NOTIF, true);
+	do_exit(SIGSYS);
+}
+#endif
+
 #ifdef CONFIG_SECCOMP_FILTER
 static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
 			    const bool recheck_after_trace)
@@ -728,6 +897,9 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
 
 		return 0;
 
+	case SECCOMP_RET_USER_NOTIF:
+		seccomp_do_user_notification(this_syscall, match, sd);
+		goto skip;
 	case SECCOMP_RET_LOG:
 		seccomp_log(this_syscall, 0, action, true);
 		return 0;
@@ -834,6 +1006,9 @@ static long seccomp_set_mode_strict(void)
 }
 
 #ifdef CONFIG_SECCOMP_FILTER
+static struct file *init_listener(struct task_struct *,
+				  struct seccomp_filter *);
+
 /**
  * seccomp_set_mode_filter: internal function for setting seccomp filter
  * @flags:  flags to change filter behavior
@@ -853,6 +1028,8 @@ static long seccomp_set_mode_filter(unsigned int flags,
 	const unsigned long seccomp_mode = SECCOMP_MODE_FILTER;
 	struct seccomp_filter *prepared = NULL;
 	long ret = -EINVAL;
+	int listener = 0;
+	struct file *listener_f = NULL;
 
 	/* Validate flags. */
 	if (flags & ~SECCOMP_FILTER_FLAG_MASK)
@@ -863,13 +1040,28 @@ static long seccomp_set_mode_filter(unsigned int flags,
 	if (IS_ERR(prepared))
 		return PTR_ERR(prepared);
 
+	if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) {
+		listener = get_unused_fd_flags(0);
+		if (listener < 0) {
+			ret = listener;
+			goto out_free;
+		}
+
+		listener_f = init_listener(current, prepared);
+		if (IS_ERR(listener_f)) {
+			put_unused_fd(listener);
+			ret = PTR_ERR(listener_f);
+			goto out_free;
+		}
+	}
+
 	/*
 	 * Make sure we cannot change seccomp or nnp state via TSYNC
 	 * while another thread is in the middle of calling exec.
 	 */
 	if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
 	    mutex_lock_killable(&current->signal->cred_guard_mutex))
-		goto out_free;
+		goto out_put_fd;
 
 	spin_lock_irq(&current->sighand->siglock);
 
@@ -887,6 +1079,16 @@ static long seccomp_set_mode_filter(unsigned int flags,
 	spin_unlock_irq(&current->sighand->siglock);
 	if (flags & SECCOMP_FILTER_FLAG_TSYNC)
 		mutex_unlock(&current->signal->cred_guard_mutex);
+out_put_fd:
+	if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) {
+		if (ret < 0) {
+			fput(listener_f);
+			put_unused_fd(listener);
+		} else {
+			fd_install(listener, listener_f);
+			ret = listener;
+		}
+	}
 out_free:
 	seccomp_filter_free(prepared);
 	return ret;
@@ -915,6 +1117,9 @@ static long seccomp_get_action_avail(const char __user *uaction)
 	case SECCOMP_RET_LOG:
 	case SECCOMP_RET_ALLOW:
 		break;
+	case SECCOMP_RET_USER_NOTIF:
+		if (IS_ENABLED(CONFIG_SECCOMP_USER_NOTIFICATION))
+			break;
 	default:
 		return -EOPNOTSUPP;
 	}
@@ -1111,6 +1316,7 @@ long seccomp_get_metadata(struct task_struct *task,
 #define SECCOMP_RET_KILL_THREAD_NAME	"kill_thread"
 #define SECCOMP_RET_TRAP_NAME		"trap"
 #define SECCOMP_RET_ERRNO_NAME		"errno"
+#define SECCOMP_RET_USER_NOTIF_NAME	"user_notif"
 #define SECCOMP_RET_TRACE_NAME		"trace"
 #define SECCOMP_RET_LOG_NAME		"log"
 #define SECCOMP_RET_ALLOW_NAME		"allow"
@@ -1120,6 +1326,7 @@ static const char seccomp_actions_avail[] =
 				SECCOMP_RET_KILL_THREAD_NAME	" "
 				SECCOMP_RET_TRAP_NAME		" "
 				SECCOMP_RET_ERRNO_NAME		" "
+				SECCOMP_RET_USER_NOTIF_NAME     " "
 				SECCOMP_RET_TRACE_NAME		" "
 				SECCOMP_RET_LOG_NAME		" "
 				SECCOMP_RET_ALLOW_NAME;
@@ -1137,6 +1344,7 @@ static const struct seccomp_log_name seccomp_log_names[] = {
 	{ SECCOMP_LOG_TRACE, SECCOMP_RET_TRACE_NAME },
 	{ SECCOMP_LOG_LOG, SECCOMP_RET_LOG_NAME },
 	{ SECCOMP_LOG_ALLOW, SECCOMP_RET_ALLOW_NAME },
+	{ SECCOMP_LOG_USER_NOTIF, SECCOMP_RET_USER_NOTIF_NAME },
 	{ }
 };
 
@@ -1342,3 +1550,244 @@ static int __init seccomp_sysctl_init(void)
 device_initcall(seccomp_sysctl_init)
 
 #endif /* CONFIG_SYSCTL */
+
+#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
+static int seccomp_notify_release(struct inode *inode, struct file *file)
+{
+	struct seccomp_filter *filter = file->private_data;
+	struct seccomp_knotif *knotif;
+
+	mutex_lock(&filter->notify_lock);
+
+	/*
+	 * If this file is being closed because e.g. the task who owned it
+	 * died, let's wake everyone up who was waiting on us.
+	 */
+	list_for_each_entry(knotif, &filter->notifications, list) {
+		if (knotif->state == SECCOMP_NOTIFY_REPLIED)
+			continue;
+
+		knotif->state = SECCOMP_NOTIFY_REPLIED;
+		knotif->error = -ENOSYS;
+		knotif->val = 0;
+
+		complete(&knotif->ready);
+	}
+
+	wake_up_all(&filter->wqh);
+	filter->has_listener = false;
+	mutex_unlock(&filter->notify_lock);
+	__put_seccomp_filter(filter);
+	return 0;
+}
+
+static long seccomp_notify_recv(struct seccomp_filter *filter,
+				unsigned long arg)
+{
+	struct seccomp_knotif *knotif = NULL, *cur;
+	struct seccomp_notif unotif = {};
+	ssize_t ret;
+	u16 size;
+	void __user *buf = (void __user *)arg;
+
+	if (copy_from_user(&size, buf, sizeof(size)))
+		return -EFAULT;
+
+	ret = down_interruptible(&filter->request);
+	if (ret < 0)
+		return ret;
+
+	mutex_lock(&filter->notify_lock);
+	list_for_each_entry(cur, &filter->notifications, list) {
+		if (cur->state == SECCOMP_NOTIFY_INIT) {
+			knotif = cur;
+			break;
+		}
+	}
+
+	/*
+	 * If we didn't find a notification, it could be that the task was
+	 * interrupted between the time we were woken and when we were able to
+	 * acquire the rw lock.
+	 */
+	if (!knotif) {
+		ret = -ENOENT;
+		goto out;
+	}
+
+	size = min_t(size_t, size, sizeof(unotif));
+
+	unotif.len = size;
+	unotif.id = knotif->id;
+	unotif.pid = pid_vnr(knotif->pid);
+	unotif.signalled = knotif->signalled;
+	unotif.data = *(knotif->data);
+
+	if (copy_to_user(buf, &unotif, size)) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	ret = sizeof(unotif);
+	knotif->state = SECCOMP_NOTIFY_SENT;
+	wake_up_poll(&filter->wqh, EPOLLOUT | EPOLLWRNORM);
+
+out:
+	mutex_unlock(&filter->notify_lock);
+	return ret;
+}
+
+static long seccomp_notify_send(struct seccomp_filter *filter,
+				unsigned long arg)
+{
+	struct seccomp_notif_resp resp = {};
+	struct seccomp_knotif *knotif = NULL;
+	long ret;
+	u16 size;
+	void __user *buf = (void __user *)arg;
+
+	if (copy_from_user(&size, buf, sizeof(size)))
+		return -EFAULT;
+	size = min_t(size_t, size, sizeof(resp));
+	if (copy_from_user(&resp, buf, size))
+		return -EFAULT;
+
+	ret = mutex_lock_interruptible(&filter->notify_lock);
+	if (ret < 0)
+		return ret;
+
+	list_for_each_entry(knotif, &filter->notifications, list) {
+		if (knotif->id == resp.id)
+			break;
+	}
+
+	if (!knotif || knotif->id != resp.id) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/* Allow exactly one reply. */
+	if (knotif->state != SECCOMP_NOTIFY_SENT) {
+		ret = -EINPROGRESS;
+		goto out;
+	}
+
+	ret = size;
+	knotif->state = SECCOMP_NOTIFY_REPLIED;
+	knotif->error = resp.error;
+	knotif->val = resp.val;
+	complete(&knotif->ready);
+out:
+	mutex_unlock(&filter->notify_lock);
+	return ret;
+}
+
+static long seccomp_notify_is_id_valid(struct seccomp_filter *filter,
+				       unsigned long arg)
+{
+	struct seccomp_knotif *knotif = NULL;
+	void __user *buf = (void __user *)arg;
+	u64 id;
+
+	if (copy_from_user(&id, buf, sizeof(id)))
+		return -EFAULT;
+
+	list_for_each_entry(knotif, &filter->notifications, list) {
+		if (knotif->id == id)
+			return 1;
+	}
+
+	return 0;
+}
+
+static long seccomp_notify_ioctl(struct file *file, unsigned int cmd,
+				 unsigned long arg)
+{
+	struct seccomp_filter *filter = file->private_data;
+
+	switch (cmd) {
+	case SECCOMP_NOTIF_RECV:
+		return seccomp_notify_recv(filter, arg);
+	case SECCOMP_NOTIF_SEND:
+		return seccomp_notify_send(filter, arg);
+	case SECCOMP_NOTIF_IS_ID_VALID:
+		return seccomp_notify_is_id_valid(filter, arg);
+	default:
+		return -EINVAL;
+	}
+}
+
+static __poll_t seccomp_notify_poll(struct file *file,
+				    struct poll_table_struct *poll_tab)
+{
+	struct seccomp_filter *filter = file->private_data;
+	__poll_t ret = 0;
+	struct seccomp_knotif *cur;
+
+	poll_wait(file, &filter->wqh, poll_tab);
+
+	ret = mutex_lock_interruptible(&filter->notify_lock);
+	if (ret < 0)
+		return ret;
+
+	list_for_each_entry(cur, &filter->notifications, list) {
+		if (cur->state == SECCOMP_NOTIFY_INIT)
+			ret |= EPOLLIN | EPOLLRDNORM;
+		if (cur->state == SECCOMP_NOTIFY_SENT)
+			ret |= EPOLLOUT | EPOLLWRNORM;
+		if (ret & EPOLLIN && ret & EPOLLOUT)
+			break;
+	}
+
+	mutex_unlock(&filter->notify_lock);
+
+	return ret;
+}
+
+static const struct file_operations seccomp_notify_ops = {
+	.poll = seccomp_notify_poll,
+	.release = seccomp_notify_release,
+	.unlocked_ioctl = seccomp_notify_ioctl,
+};
+
+static struct file *init_listener(struct task_struct *task,
+				  struct seccomp_filter *filter)
+{
+	struct file *ret = ERR_PTR(-EBUSY);
+	struct seccomp_filter *cur, *last_locked = NULL;
+	int filter_nesting = 0;
+
+	for (cur = task->seccomp.filter; cur; cur = cur->prev) {
+		mutex_lock_nested(&cur->notify_lock, filter_nesting);
+		filter_nesting++;
+		last_locked = cur;
+		if (cur->has_listener)
+			goto out;
+	}
+
+	ret = anon_inode_getfile("seccomp notify", &seccomp_notify_ops,
+				 filter, O_RDWR);
+	if (IS_ERR(ret))
+		goto out;
+
+
+	/* The file has a reference to it now */
+	__get_seccomp_filter(filter);
+	filter->has_listener = true;
+
+out:
+	for (cur = task->seccomp.filter; cur; cur = cur->prev) {
+		mutex_unlock(&cur->notify_lock);
+		if (cur == last_locked)
+			break;
+	}
+
+	return ret;
+}
+#else
+static struct file *init_listener(struct task_struct *task,
+				  struct seccomp_filter *filter)
+{
+	return ERR_PTR(-EINVAL);
+}
+#endif
diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
index e1473234968d..89f2c788a06b 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -5,6 +5,7 @@
  * Test code for seccomp bpf.
  */
 
+#define _GNU_SOURCE
 #include <sys/types.h>
 
 /*
@@ -40,10 +41,12 @@
 #include <sys/fcntl.h>
 #include <sys/mman.h>
 #include <sys/times.h>
+#include <sys/socket.h>
+#include <sys/ioctl.h>
 
-#define _GNU_SOURCE
 #include <unistd.h>
 #include <sys/syscall.h>
+#include <poll.h>
 
 #include "../kselftest_harness.h"
 
@@ -154,6 +157,34 @@ struct seccomp_metadata {
 };
 #endif
 
+#ifndef SECCOMP_FILTER_FLAG_NEW_LISTENER
+#define SECCOMP_FILTER_FLAG_NEW_LISTENER (1UL << 3)
+
+#define SECCOMP_RET_USER_NOTIF 0x7fc00000U
+
+#define SECCOMP_IOC_MAGIC		0xF7
+#define SECCOMP_NOTIF_RECV		_IOWR(SECCOMP_IOC_MAGIC, 0,	\
+						struct seccomp_notif)
+#define SECCOMP_NOTIF_SEND		_IOWR(SECCOMP_IOC_MAGIC, 1,	\
+						struct seccomp_notif_resp)
+#define SECCOMP_NOTIF_IS_ID_VALID	_IOR(SECCOMP_IOC_MAGIC, 2,	\
+						__u64)
+struct seccomp_notif {
+	__u16 len;
+	__u64 id;
+	__u32 pid;
+	__u8 signalled;
+	struct seccomp_data data;
+};
+
+struct seccomp_notif_resp {
+	__u16 len;
+	__u64 id;
+	__s32 error;
+	__s64 val;
+};
+#endif
+
 #ifndef seccomp
 int seccomp(unsigned int op, unsigned int flags, void *args)
 {
@@ -2077,7 +2108,8 @@ TEST(detect_seccomp_filter_flags)
 {
 	unsigned int flags[] = { SECCOMP_FILTER_FLAG_TSYNC,
 				 SECCOMP_FILTER_FLAG_LOG,
-				 SECCOMP_FILTER_FLAG_SPEC_ALLOW };
+				 SECCOMP_FILTER_FLAG_SPEC_ALLOW,
+				 SECCOMP_FILTER_FLAG_NEW_LISTENER };
 	unsigned int flag, all_flags;
 	int i;
 	long ret;
@@ -2933,6 +2965,373 @@ TEST(get_metadata)
 	ASSERT_EQ(0, kill(pid, SIGKILL));
 }
 
+static int user_trap_syscall(int nr, unsigned int flags)
+{
+	struct sock_filter filter[] = {
+		BPF_STMT(BPF_LD+BPF_W+BPF_ABS,
+			offsetof(struct seccomp_data, nr)),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, nr, 0, 1),
+		BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_USER_NOTIF),
+		BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
+	};
+
+	struct sock_fprog prog = {
+		.len = (unsigned short)ARRAY_SIZE(filter),
+		.filter = filter,
+	};
+
+	return seccomp(SECCOMP_SET_MODE_FILTER, flags, &prog);
+}
+
+static int read_notif(int listener, struct seccomp_notif *req)
+{
+	int ret;
+
+	do {
+		errno = 0;
+		req->len = sizeof(*req);
+		ret = ioctl(listener, SECCOMP_NOTIF_RECV, req);
+	} while (ret == -1 && errno == ENOENT);
+	return ret;
+}
+
+static void signal_handler(int signal)
+{
+}
+
+#define USER_NOTIF_MAGIC 116983961184613L
+TEST(get_user_notification_syscall)
+{
+	pid_t pid;
+	long ret;
+	int status, listener;
+	struct seccomp_notif req = {};
+	struct seccomp_notif_resp resp = {};
+	struct pollfd pollfd;
+
+	struct sock_filter filter[] = {
+		BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
+	};
+	struct sock_fprog prog = {
+		.len = (unsigned short)ARRAY_SIZE(filter),
+		.filter = filter,
+	};
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	/* Check that we get -ENOSYS with no listener attached */
+	if (pid == 0) {
+		if (user_trap_syscall(__NR_getpid, 0) < 0)
+			exit(1);
+		ret = syscall(__NR_getpid);
+		exit(ret >= 0 || errno != ENOSYS);
+	}
+
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+
+	/* Add some no-op filters so that we (don't) trigger lockdep. */
+	EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
+	EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
+	EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
+	EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
+
+	/* Check that the basic notification machinery works */
+	listener = user_trap_syscall(__NR_getpid,
+				     SECCOMP_FILTER_FLAG_NEW_LISTENER);
+	EXPECT_GE(listener, 0);
+
+	/* Installing a second listener in the chain should EBUSY */
+	EXPECT_EQ(user_trap_syscall(__NR_getpid,
+				    SECCOMP_FILTER_FLAG_NEW_LISTENER),
+		  -1);
+	EXPECT_EQ(errno, EBUSY);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		ret = syscall(__NR_getpid);
+		exit(ret != USER_NOTIF_MAGIC);
+	}
+
+	pollfd.fd = listener;
+	pollfd.events = POLLIN | POLLOUT;
+
+	EXPECT_GT(poll(&pollfd, 1, -1), 0);
+	EXPECT_EQ(pollfd.revents, POLLIN);
+
+	req.len = sizeof(req);
+	EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_RECV, &req), sizeof(req));
+
+	pollfd.fd = listener;
+	pollfd.events = POLLIN | POLLOUT;
+
+	EXPECT_GT(poll(&pollfd, 1, -1), 0);
+	EXPECT_EQ(pollfd.revents, POLLOUT);
+
+	EXPECT_EQ(req.data.nr,  __NR_getpid);
+
+	resp.len = sizeof(resp);
+	resp.id = req.id;
+	resp.error = 0;
+	resp.val = USER_NOTIF_MAGIC;
+
+	EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_SEND, &resp), sizeof(resp));
+
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+
+	/*
+	 * Check that nothing bad happens when we kill the task in the middle
+	 * of a syscall.
+	 */
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		ret = syscall(__NR_getpid);
+		exit(ret != USER_NOTIF_MAGIC);
+	}
+
+	EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_RECV, &req), sizeof(req));
+	EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_IS_ID_VALID, &req.id), 1);
+
+	EXPECT_EQ(kill(pid, SIGKILL), 0);
+	EXPECT_EQ(waitpid(pid, NULL, 0), pid);
+
+	EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_IS_ID_VALID, &req.id), 0);
+
+	resp.id = req.id;
+	ret = ioctl(listener, SECCOMP_NOTIF_SEND, &resp);
+	EXPECT_EQ(ret, -1);
+	EXPECT_EQ(errno, EINVAL);
+
+	/*
+	 * Check that we get another notification about a signal in the middle
+	 * of a syscall.
+	 */
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		if (signal(SIGUSR1, signal_handler) == SIG_ERR) {
+			perror("signal");
+			exit(1);
+		}
+		ret = syscall(__NR_getpid);
+		exit(ret != USER_NOTIF_MAGIC);
+	}
+
+	ret = read_notif(listener, &req);
+	EXPECT_EQ(ret, sizeof(req));
+	EXPECT_EQ(errno, 0);
+
+	EXPECT_EQ(kill(pid, SIGUSR1), 0);
+
+	ret = read_notif(listener, &req);
+	EXPECT_EQ(req.signalled, 1);
+	EXPECT_EQ(ret, sizeof(req));
+	EXPECT_EQ(errno, 0);
+
+	resp.len = sizeof(resp);
+	resp.id = req.id;
+	ret = ioctl(listener, SECCOMP_NOTIF_SEND, &resp);
+	EXPECT_EQ(ret, sizeof(resp));
+	EXPECT_EQ(errno, 0);
+
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+
+	/*
+	 * Check that we get an ENOSYS when the listener is closed.
+	 */
+	pid = fork();
+	ASSERT_GE(pid, 0);
+	if (pid == 0) {
+		close(listener);
+		ret = syscall(__NR_getpid);
+		exit(ret != -1 && errno != ENOSYS);
+	}
+
+	close(listener);
+
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+}
+
+/*
+ * Check that a pid in a child namespace still shows up as valid in ours.
+ */
+TEST(user_notification_child_pid_ns)
+{
+	pid_t pid;
+	int status, listener;
+	int sk_pair[2];
+	char c;
+	struct seccomp_notif req = {};
+	struct seccomp_notif_resp resp = {};
+
+	ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0);
+	ASSERT_EQ(unshare(CLONE_NEWPID), 0);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		EXPECT_EQ(user_trap_syscall(__NR_getpid, 0), 0);
+
+		/* Signal we're ready and have installed the filter. */
+		EXPECT_EQ(write(sk_pair[1], "J", 1), 1);
+
+		EXPECT_EQ(read(sk_pair[1], &c, 1), 1);
+		EXPECT_EQ(c, 'H');
+
+		exit(syscall(__NR_getpid) != USER_NOTIF_MAGIC);
+	}
+
+	EXPECT_EQ(read(sk_pair[0], &c, 1), 1);
+	EXPECT_EQ(c, 'J');
+
+	EXPECT_EQ(ptrace(PTRACE_ATTACH, pid), 0);
+	EXPECT_EQ(waitpid(pid, NULL, 0), pid);
+	listener = ptrace(PTRACE_SECCOMP_NEW_LISTENER, pid, 0);
+	EXPECT_GE(listener, 0);
+	EXPECT_EQ(ptrace(PTRACE_DETACH, pid, NULL, 0), 0);
+
+	/* Now signal we are done and respond with magic */
+	EXPECT_EQ(write(sk_pair[0], "H", 1), 1);
+
+	req.len = sizeof(req);
+	EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_RECV, &req), sizeof(req));
+	EXPECT_EQ(req.pid, pid);
+
+	resp.len = sizeof(resp);
+	resp.id = req.id;
+	resp.error = 0;
+	resp.val = USER_NOTIF_MAGIC;
+
+	EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_SEND, &resp), sizeof(resp));
+
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+	close(listener);
+}
+
+/*
+ * Check that a pid in a sibling (i.e. unrelated) namespace shows up as 0, i.e.
+ * invalid.
+ */
+TEST(user_notification_sibling_pid_ns)
+{
+	pid_t pid, pid2;
+	int status, listener;
+	int sk_pair[2];
+	char c;
+	struct seccomp_notif req = {};
+	struct seccomp_notif_resp resp = {};
+
+	ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		int child_pair[2];
+
+		ASSERT_EQ(unshare(CLONE_NEWPID), 0);
+
+		ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, child_pair), 0);
+
+		pid2 = fork();
+		ASSERT_GE(pid2, 0);
+
+		if (pid2 == 0) {
+			close(child_pair[0]);
+			EXPECT_EQ(user_trap_syscall(__NR_getpid, 0), 0);
+
+			/* Signal we're ready and have installed the filter. */
+			EXPECT_EQ(write(child_pair[1], "J", 1), 1);
+
+			EXPECT_EQ(read(child_pair[1], &c, 1), 1);
+			EXPECT_EQ(c, 'H');
+
+			exit(syscall(__NR_getpid) != USER_NOTIF_MAGIC);
+		}
+
+		/* check that child has installed the filter */
+		EXPECT_EQ(read(child_pair[0], &c, 1), 1);
+		EXPECT_EQ(c, 'J');
+
+		/* tell parent who child is */
+		EXPECT_EQ(write(sk_pair[1], &pid2, sizeof(pid2)), sizeof(pid2));
+
+		/* parent has installed listener, tell child to call syscall */
+		EXPECT_EQ(read(sk_pair[1], &c, 1), 1);
+		EXPECT_EQ(c, 'H');
+		EXPECT_EQ(write(child_pair[0], "H", 1), 1);
+
+		EXPECT_EQ(waitpid(pid2, &status, 0), pid2);
+		EXPECT_EQ(true, WIFEXITED(status));
+		EXPECT_EQ(0, WEXITSTATUS(status));
+		exit(WEXITSTATUS(status));
+	}
+
+	EXPECT_EQ(read(sk_pair[0], &pid2, sizeof(pid2)), sizeof(pid2));
+
+	EXPECT_EQ(ptrace(PTRACE_ATTACH, pid2), 0);
+	EXPECT_EQ(waitpid(pid2, NULL, 0), pid2);
+	listener = ptrace(PTRACE_SECCOMP_NEW_LISTENER, pid2, 0);
+	EXPECT_GE(listener, 0);
+	EXPECT_EQ(errno, 0);
+	EXPECT_EQ(ptrace(PTRACE_DETACH, pid2, NULL, 0), 0);
+
+	/* Create the sibling ns, and sibling in it. */
+	EXPECT_EQ(unshare(CLONE_NEWPID), 0);
+	EXPECT_EQ(errno, 0);
+
+	pid2 = fork();
+	EXPECT_GE(pid2, 0);
+
+	if (pid2 == 0) {
+		req.len = sizeof(req);
+		ASSERT_EQ(ioctl(listener, SECCOMP_NOTIF_RECV, &req), sizeof(req));
+		/*
+		 * The pid should be 0, i.e. the task is in some namespace that
+		 * we can't "see".
+		 */
+		ASSERT_EQ(req.pid, 0);
+
+		resp.len = sizeof(resp);
+		resp.id = req.id;
+		resp.error = 0;
+		resp.val = USER_NOTIF_MAGIC;
+
+		ASSERT_EQ(ioctl(listener, SECCOMP_NOTIF_SEND, &resp), sizeof(resp));
+		exit(0);
+	}
+
+	close(listener);
+
+	/* Now signal we are done setting up sibling listener. */
+	EXPECT_EQ(write(sk_pair[0], "H", 1), 1);
+
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+
+	EXPECT_EQ(waitpid(pid2, &status, 0), pid2);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+}
+
+
 /*
  * TODO:
  * - add microbenchmarks
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v6 2/5] seccomp: make get_nth_filter available outside of CHECKPOINT_RESTORE
  2018-09-06 15:28 [PATCH v6 0/5] seccomp trap to userspace Tycho Andersen
  2018-09-06 15:28 ` [PATCH v6 1/5] seccomp: add a return code to " Tycho Andersen
@ 2018-09-06 15:28 ` Tycho Andersen
  2018-09-11 10:25   ` kbuild test robot
  2018-09-06 15:28 ` [PATCH v6 3/5] seccomp: add a way to get a listener fd from ptrace Tycho Andersen
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 38+ messages in thread
From: Tycho Andersen @ 2018-09-06 15:28 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel, containers, linux-api, Andy Lutomirski,
	Oleg Nesterov, Eric W . Biederman, Serge E . Hallyn,
	Christian Brauner, Tyler Hicks, Akihiro Suda, Jann Horn,
	Tycho Andersen

In the next commit we'll use this same mnemonic to get a listener for the
nth filter, so we need it available outside of CHECKPOINT_RESTORE in the
USER_NOTIFICATION case as well.

v2: new in v2
v3: no changes
v4: no changes
v5: switch to CHECKPOINT_RESTORE || USER_NOTIFICATION to avoid warning when
    only CONFIG_SECCOMP_FILTER is enabled.

Signed-off-by: Tycho Andersen <tycho@tycho.ws>
CC: Kees Cook <keescook@chromium.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Oleg Nesterov <oleg@redhat.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
CC: "Serge E. Hallyn" <serge@hallyn.com>
CC: Christian Brauner <christian.brauner@ubuntu.com>
CC: Tyler Hicks <tyhicks@canonical.com>
CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
---
 kernel/seccomp.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index a09eb5c05f68..ed786655186d 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -1188,7 +1188,8 @@ long prctl_set_seccomp(unsigned long seccomp_mode, char __user *filter)
 	return do_seccomp(op, 0, uargs);
 }
 
-#if defined(CONFIG_SECCOMP_FILTER) && defined(CONFIG_CHECKPOINT_RESTORE)
+#if defined(CONFIG_CHECKPOINT_RESTORE) || \
+	defined(CONFIG_SECCOMP_USER_NOTIFICATION)
 static struct seccomp_filter *get_nth_filter(struct task_struct *task,
 					     unsigned long filter_off)
 {
@@ -1235,6 +1236,7 @@ static struct seccomp_filter *get_nth_filter(struct task_struct *task,
 	return filter;
 }
 
+#if defined(CONFIG_CHECKPOINT_RESTORE)
 long seccomp_get_filter(struct task_struct *task, unsigned long filter_off,
 			void __user *data)
 {
@@ -1307,7 +1309,8 @@ long seccomp_get_metadata(struct task_struct *task,
 	__put_seccomp_filter(filter);
 	return ret;
 }
-#endif
+#endif /* CONFIG_CHECKPOINT_RESTORE */
+#endif /* CONFIG_SECCOMP_FILTER */
 
 #ifdef CONFIG_SYSCTL
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v6 3/5] seccomp: add a way to get a listener fd from ptrace
  2018-09-06 15:28 [PATCH v6 0/5] seccomp trap to userspace Tycho Andersen
  2018-09-06 15:28 ` [PATCH v6 1/5] seccomp: add a return code to " Tycho Andersen
  2018-09-06 15:28 ` [PATCH v6 2/5] seccomp: make get_nth_filter available outside of CHECKPOINT_RESTORE Tycho Andersen
@ 2018-09-06 15:28 ` Tycho Andersen
  2018-09-06 15:45   ` Jann Horn
  2018-09-13  0:00   ` Andy Lutomirski
  2018-09-06 15:28 ` [PATCH v6 4/5] seccomp: add support for passing fds via USER_NOTIF Tycho Andersen
  2018-09-06 15:28 ` [PATCH v6 5/5] samples: add an example of seccomp user trap Tycho Andersen
  4 siblings, 2 replies; 38+ messages in thread
From: Tycho Andersen @ 2018-09-06 15:28 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel, containers, linux-api, Andy Lutomirski,
	Oleg Nesterov, Eric W . Biederman, Serge E . Hallyn,
	Christian Brauner, Tyler Hicks, Akihiro Suda, Jann Horn,
	Tycho Andersen

As an alternative to SECCOMP_FILTER_FLAG_GET_LISTENER, perhaps a ptrace()
version which can acquire filters is useful. There are at least two reasons
this is preferable, even though it uses ptrace:

1. You can control tasks that aren't cooperating with you
2. You can control tasks whose filters block sendmsg() and socket(); if the
   task installs a filter which blocks these calls, there's no way with
   SECCOMP_FILTER_FLAG_GET_LISTENER to get the fd out to the privileged task.

v2: fix a bug where listener mode was not unset when an unused fd was not
    available
v3: fix refcounting bug (Oleg)
v4: * change the listener's fd flags to be 0
    * rename GET_LISTENER to NEW_LISTENER (Matthew)
v5: * add capable(CAP_SYS_ADMIN) requirement

Signed-off-by: Tycho Andersen <tycho@tycho.ws>
CC: Kees Cook <keescook@chromium.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Oleg Nesterov <oleg@redhat.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
CC: "Serge E. Hallyn" <serge@hallyn.com>
CC: Christian Brauner <christian.brauner@ubuntu.com>
CC: Tyler Hicks <tyhicks@canonical.com>
CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
---
 include/linux/seccomp.h                       | 11 +++
 include/uapi/linux/ptrace.h                   |  2 +
 kernel/ptrace.c                               |  4 ++
 kernel/seccomp.c                              | 31 +++++++++
 tools/testing/selftests/seccomp/seccomp_bpf.c | 68 +++++++++++++++++++
 5 files changed, 116 insertions(+)

diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 017444b5efed..c17c7d051af0 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -112,4 +112,15 @@ static inline long seccomp_get_metadata(struct task_struct *task,
 	return -EINVAL;
 }
 #endif /* CONFIG_SECCOMP_FILTER && CONFIG_CHECKPOINT_RESTORE */
+
+#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
+extern long seccomp_new_listener(struct task_struct *task,
+				 unsigned long filter_off);
+#else
+static inline long seccomp_new_listener(struct task_struct *task,
+					unsigned long filter_off)
+{
+	return -EINVAL;
+}
+#endif/* CONFIG_SECCOMP_USER_NOTIFICATION */
 #endif /* _LINUX_SECCOMP_H */
diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h
index d5a1b8a492b9..e80ecb1bd427 100644
--- a/include/uapi/linux/ptrace.h
+++ b/include/uapi/linux/ptrace.h
@@ -73,6 +73,8 @@ struct seccomp_metadata {
 	__u64 flags;		/* Output: filter's flags */
 };
 
+#define PTRACE_SECCOMP_NEW_LISTENER	0x420e
+
 /* Read signals from a shared (process wide) queue */
 #define PTRACE_PEEKSIGINFO_SHARED	(1 << 0)
 
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 21fec73d45d4..289960ac181b 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -1096,6 +1096,10 @@ int ptrace_request(struct task_struct *child, long request,
 		ret = seccomp_get_metadata(child, addr, datavp);
 		break;
 
+	case PTRACE_SECCOMP_NEW_LISTENER:
+		ret = seccomp_new_listener(child, addr);
+		break;
+
 	default:
 		break;
 	}
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index ed786655186d..580888785324 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -1787,6 +1787,37 @@ static struct file *init_listener(struct task_struct *task,
 
 	return ret;
 }
+
+long seccomp_new_listener(struct task_struct *task,
+			  unsigned long filter_off)
+{
+	struct seccomp_filter *filter;
+	struct file *listener;
+	int fd;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EACCES;
+
+	filter = get_nth_filter(task, filter_off);
+	if (IS_ERR(filter))
+		return PTR_ERR(filter);
+
+	fd = get_unused_fd_flags(0);
+	if (fd < 0) {
+		__put_seccomp_filter(filter);
+		return fd;
+	}
+
+	listener = init_listener(task, task->seccomp.filter);
+	__put_seccomp_filter(filter);
+	if (IS_ERR(listener)) {
+		put_unused_fd(fd);
+		return PTR_ERR(listener);
+	}
+
+	fd_install(fd, listener);
+	return fd;
+}
 #else
 static struct file *init_listener(struct task_struct *task,
 				  struct seccomp_filter *filter)
diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
index 89f2c788a06b..61b8e3c5c06b 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -193,6 +193,10 @@ int seccomp(unsigned int op, unsigned int flags, void *args)
 }
 #endif
 
+#ifndef PTRACE_SECCOMP_NEW_LISTENER
+#define PTRACE_SECCOMP_NEW_LISTENER 0x420e
+#endif
+
 #if __BYTE_ORDER == __LITTLE_ENDIAN
 #define syscall_arg(_n) (offsetof(struct seccomp_data, args[_n]))
 #elif __BYTE_ORDER == __BIG_ENDIAN
@@ -3165,6 +3169,70 @@ TEST(get_user_notification_syscall)
 	EXPECT_EQ(0, WEXITSTATUS(status));
 }
 
+TEST(get_user_notification_ptrace)
+{
+	pid_t pid;
+	int status, listener;
+	int sk_pair[2];
+	char c;
+	struct seccomp_notif req = {};
+	struct seccomp_notif_resp resp = {};
+
+	ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		EXPECT_EQ(user_trap_syscall(__NR_getpid, 0), 0);
+
+		/* Test that we get ENOSYS while not attached */
+		EXPECT_EQ(syscall(__NR_getpid), -1);
+		EXPECT_EQ(errno, ENOSYS);
+
+		/* Signal we're ready and have installed the filter. */
+		EXPECT_EQ(write(sk_pair[1], "J", 1), 1);
+
+		EXPECT_EQ(read(sk_pair[1], &c, 1), 1);
+		EXPECT_EQ(c, 'H');
+
+		exit(syscall(__NR_getpid) != USER_NOTIF_MAGIC);
+	}
+
+	EXPECT_EQ(read(sk_pair[0], &c, 1), 1);
+	EXPECT_EQ(c, 'J');
+
+	EXPECT_EQ(ptrace(PTRACE_ATTACH, pid), 0);
+	EXPECT_EQ(waitpid(pid, NULL, 0), pid);
+	listener = ptrace(PTRACE_SECCOMP_NEW_LISTENER, pid, 0);
+	EXPECT_GE(listener, 0);
+
+	/* EBUSY for second listener */
+	EXPECT_EQ(ptrace(PTRACE_SECCOMP_NEW_LISTENER, pid, 0), -1);
+	EXPECT_EQ(errno, EBUSY);
+
+	EXPECT_EQ(ptrace(PTRACE_DETACH, pid, NULL, 0), 0);
+
+	/* Now signal we are done and respond with magic */
+	EXPECT_EQ(write(sk_pair[0], "H", 1), 1);
+
+	req.len = sizeof(req);
+	EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_RECV, &req), sizeof(req));
+
+	resp.len = sizeof(resp);
+	resp.id = req.id;
+	resp.error = 0;
+	resp.val = USER_NOTIF_MAGIC;
+
+	EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_SEND, &resp), sizeof(resp));
+
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+
+	close(listener);
+}
+
 /*
  * Check that a pid in a child namespace still shows up as valid in ours.
  */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v6 4/5] seccomp: add support for passing fds via USER_NOTIF
  2018-09-06 15:28 [PATCH v6 0/5] seccomp trap to userspace Tycho Andersen
                   ` (2 preceding siblings ...)
  2018-09-06 15:28 ` [PATCH v6 3/5] seccomp: add a way to get a listener fd from ptrace Tycho Andersen
@ 2018-09-06 15:28 ` Tycho Andersen
  2018-09-06 16:15   ` Jann Horn
  2018-09-12 23:52   ` Andy Lutomirski
  2018-09-06 15:28 ` [PATCH v6 5/5] samples: add an example of seccomp user trap Tycho Andersen
  4 siblings, 2 replies; 38+ messages in thread
From: Tycho Andersen @ 2018-09-06 15:28 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel, containers, linux-api, Andy Lutomirski,
	Oleg Nesterov, Eric W . Biederman, Serge E . Hallyn,
	Christian Brauner, Tyler Hicks, Akihiro Suda, Jann Horn,
	Tycho Andersen

The idea here is that the userspace handler should be able to pass an fd
back to the trapped task, for example so it can be returned from socket().

I've proposed one API here, but I'm open to other options. In particular,
this only lets you return an fd from a syscall, which may not be enough in
all cases. For example, if an fd is written to an output parameter instead
of returned, the current API can't handle this. Another case is that
netlink takes as input fds sometimes (IFLA_NET_NS_FD, e.g.). If netlink
ever decides to install an fd and output it, we wouldn't be able to handle
this either.

Still, the vast majority of interesting cases are covered by this API, so
perhaps it is Enough.

I've left it as a separate commit for two reasons:
  * It illustrates the way in which we would grow struct seccomp_notif and
    struct seccomp_notif_resp without using netlink
  * It shows just how little code is needed to accomplish this :)

v2: new in v2
v3: no changes
v4: * pass fd flags back from userspace as well (Jann)
    * update same cgroup data on fd pass as SCM_RIGHTS (Alban)
    * only set the REPLIED state /after/ successful fdget (Alban)
    * reflect GET_LISTENER -> NEW_LISTENER changes
    * add to the new Documentation/ on user notifications about fd replies
v5: * fix documentation typo (O_EXCL -> O_CLOEXEC)

Signed-off-by: Tycho Andersen <tycho@tycho.ws>
CC: Kees Cook <keescook@chromium.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Oleg Nesterov <oleg@redhat.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
CC: "Serge E. Hallyn" <serge@hallyn.com>
CC: Christian Brauner <christian.brauner@ubuntu.com>
CC: Tyler Hicks <tyhicks@canonical.com>
CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
---
 .../userspace-api/seccomp_filter.rst          |  11 ++
 include/uapi/linux/seccomp.h                  |   3 +
 kernel/seccomp.c                              |  51 +++++++-
 tools/testing/selftests/seccomp/seccomp_bpf.c | 116 ++++++++++++++++++
 4 files changed, 179 insertions(+), 2 deletions(-)

diff --git a/Documentation/userspace-api/seccomp_filter.rst b/Documentation/userspace-api/seccomp_filter.rst
index d1498885c1c7..1c0aab306426 100644
--- a/Documentation/userspace-api/seccomp_filter.rst
+++ b/Documentation/userspace-api/seccomp_filter.rst
@@ -235,6 +235,9 @@ The interface for a seccomp notification fd consists of two structures:
         __u64 id;
         __s32 error;
         __s64 val;
+        __u8 return_fd;
+        __u32 fd;
+        __u32 fd_flags;
     };
 
 Users can read via ``ioctl(SECCOMP_NOTIF_RECV)``  (or ``poll()``) on a seccomp
@@ -256,6 +259,14 @@ mentioned above in this document: all arguments being read from the tracee's
 memory should be read into the tracer's memory before any policy decisions are
 made. This allows for an atomic decision on syscall arguments.
 
+Userspace can also return file descriptors. For example, one may decide to
+intercept ``socket()`` syscalls, and return some file descriptor from those
+based on some policy. To return a file descriptor, the ``return_fd`` member
+should be non-zero, the ``fd`` argument should be the fd in the listener's
+table to send to the tracee (similar to how ``SCM_RIGHTS`` works), and
+``fd_flags`` should be the flags that the fd in the tracee's table is opened
+with (e.g. ``O_CLOEXEC`` or similar).
+
 Sysctls
 =======
 
diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index aa5878972128..93f1bd5c7cf0 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -75,6 +75,9 @@ struct seccomp_notif_resp {
 	__u64 id;
 	__s32 error;
 	__s64 val;
+	__u8 return_fd;
+	__u32 fd;
+	__u32 fd_flags;
 };
 
 #define SECCOMP_IOC_MAGIC		0xF7
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 580888785324..4a6db4076ec5 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -43,6 +43,7 @@
 
 #ifdef CONFIG_SECCOMP_USER_NOTIFICATION
 #include <linux/anon_inodes.h>
+#include <net/cls_cgroup.h>
 
 enum notify_state {
 	SECCOMP_NOTIFY_INIT,
@@ -80,6 +81,8 @@ struct seccomp_knotif {
 	/* The return values, only valid when in SECCOMP_NOTIFY_REPLIED */
 	int error;
 	long val;
+	struct file *file;
+	unsigned int flags;
 
 	/* Signals when this has entered SECCOMP_NOTIFY_REPLIED */
 	struct completion ready;
@@ -800,10 +803,44 @@ static void seccomp_do_user_notification(int this_syscall,
 			goto remove_list;
 	}
 
-	ret = n.val;
-	err = n.error;
+	if (n.file) {
+		int fd;
+		struct socket *sock;
+
+		fd = get_unused_fd_flags(n.flags);
+		if (fd < 0) {
+			err = fd;
+			ret = -1;
+			goto remove_list;
+		}
+
+		/*
+		 * Similar to what SCM_RIGHTS does, let's re-set the cgroup
+		 * data to point ot the tracee's cgroups instead of the
+		 * listener's.
+		 */
+		sock = sock_from_file(n.file, &err);
+		if (sock) {
+			sock_update_netprioidx(&sock->sk->sk_cgrp_data);
+			sock_update_classid(&sock->sk->sk_cgrp_data);
+		}
+
+		ret = fd;
+		err = 0;
+
+		fd_install(fd, n.file);
+		/* Don't fput, since fd has a reference now */
+		n.file = NULL;
+	} else {
+		ret = n.val;
+		err = n.error;
+	}
+
 
 remove_list:
+	if (n.file)
+		fput(n.file);
+
 	list_del(&n.list);
 out:
 	mutex_unlock(&match->notify_lock);
@@ -1675,10 +1712,20 @@ static long seccomp_notify_send(struct seccomp_filter *filter,
 		goto out;
 	}
 
+	if (resp.return_fd) {
+		knotif->flags = resp.fd_flags;
+		knotif->file = fget(resp.fd);
+		if (!knotif->file) {
+			ret = -EBADF;
+			goto out;
+		}
+	}
+
 	ret = size;
 	knotif->state = SECCOMP_NOTIFY_REPLIED;
 	knotif->error = resp.error;
 	knotif->val = resp.val;
+
 	complete(&knotif->ready);
 out:
 	mutex_unlock(&filter->notify_lock);
diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
index 61b8e3c5c06b..c756722faa88 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -182,6 +182,9 @@ struct seccomp_notif_resp {
 	__u64 id;
 	__s32 error;
 	__s64 val;
+	__u8 return_fd;
+	__u32 fd;
+	__u32 fd_flags;
 };
 #endif
 
@@ -3233,6 +3236,119 @@ TEST(get_user_notification_ptrace)
 	close(listener);
 }
 
+TEST(user_notification_pass_fd)
+{
+	pid_t pid;
+	int status, listener;
+	int sk_pair[2];
+	char c;
+	struct seccomp_notif req = {};
+	struct seccomp_notif_resp resp = {};
+	long ret;
+
+	ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		int fd;
+		char buf[16];
+
+		EXPECT_EQ(user_trap_syscall(__NR_getpid, 0), 0);
+
+		/* Signal we're ready and have installed the filter. */
+		EXPECT_EQ(write(sk_pair[1], "J", 1), 1);
+
+		EXPECT_EQ(read(sk_pair[1], &c, 1), 1);
+		EXPECT_EQ(c, 'H');
+		close(sk_pair[1]);
+
+		/* An fd from getpid(). Let the games begin. */
+		fd = syscall(__NR_getpid);
+		EXPECT_GT(fd, 0);
+		EXPECT_EQ(read(fd, buf, sizeof(buf)), 12);
+		close(fd);
+
+		exit(strcmp("hello world", buf));
+	}
+
+	EXPECT_EQ(read(sk_pair[0], &c, 1), 1);
+	EXPECT_EQ(c, 'J');
+
+	EXPECT_EQ(ptrace(PTRACE_ATTACH, pid), 0);
+	EXPECT_EQ(waitpid(pid, NULL, 0), pid);
+	listener = ptrace(PTRACE_SECCOMP_NEW_LISTENER, pid, 0);
+	EXPECT_GE(listener, 0);
+	EXPECT_EQ(ptrace(PTRACE_DETACH, pid, NULL, 0), 0);
+
+	/* Now signal we are done installing so it can do a getpid */
+	EXPECT_EQ(write(sk_pair[0], "H", 1), 1);
+	close(sk_pair[0]);
+
+	/* Make a new socket pair so we can send half across */
+	EXPECT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0);
+
+	ret = read_notif(listener, &req);
+	EXPECT_EQ(ret, sizeof(req));
+	EXPECT_EQ(errno, 0);
+
+	resp.len = sizeof(resp);
+	resp.id = req.id;
+	resp.return_fd = 1;
+	resp.fd = sk_pair[1];
+	resp.fd_flags = 0;
+	EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_SEND, &resp), sizeof(resp));
+	close(sk_pair[1]);
+
+	EXPECT_EQ(write(sk_pair[0], "hello world\0", 12), 12);
+	close(sk_pair[0]);
+
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+	close(listener);
+}
+
+TEST(user_notification_struct_size_mismatch)
+{
+	pid_t pid;
+	long ret;
+	int status, listener, len;
+	struct seccomp_notif req;
+	struct seccomp_notif_resp resp;
+
+	listener = user_trap_syscall(__NR_getpid,
+				     SECCOMP_FILTER_FLAG_NEW_LISTENER);
+	EXPECT_GE(listener, 0);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		ret = syscall(__NR_getpid);
+		exit(ret != USER_NOTIF_MAGIC);
+	}
+
+	req.len = sizeof(req);
+	EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_RECV, &req), sizeof(req));
+
+	/*
+	 * Only write a partial structure: this is what was available before we
+	 * had fd support.
+	 */
+	len = offsetof(struct seccomp_notif_resp, val) + sizeof(resp.val);
+	resp.len = len;
+	resp.id = req.id;
+	resp.error = 0;
+	resp.val = USER_NOTIF_MAGIC;
+	EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_SEND, &resp), len);
+
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+}
+
 /*
  * Check that a pid in a child namespace still shows up as valid in ours.
  */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v6 5/5] samples: add an example of seccomp user trap
  2018-09-06 15:28 [PATCH v6 0/5] seccomp trap to userspace Tycho Andersen
                   ` (3 preceding siblings ...)
  2018-09-06 15:28 ` [PATCH v6 4/5] seccomp: add support for passing fds via USER_NOTIF Tycho Andersen
@ 2018-09-06 15:28 ` Tycho Andersen
  4 siblings, 0 replies; 38+ messages in thread
From: Tycho Andersen @ 2018-09-06 15:28 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel, containers, linux-api, Andy Lutomirski,
	Oleg Nesterov, Eric W . Biederman, Serge E . Hallyn,
	Christian Brauner, Tyler Hicks, Akihiro Suda, Jann Horn,
	Tycho Andersen

The idea here is just to give a demonstration of how one could safely use
the SECCOMP_RET_USER_NOTIF feature to do mount policies. This particular
policy is (as noted in the comment) not very interesting, but it serves to
illustrate how one might apply a policy dodging the various TOCTOU issues.

v5: new in v5

Signed-off-by: Tycho Andersen <tycho@tycho.ws>
CC: Kees Cook <keescook@chromium.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Oleg Nesterov <oleg@redhat.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
CC: "Serge E. Hallyn" <serge@hallyn.com>
CC: Christian Brauner <christian.brauner@ubuntu.com>
CC: Tyler Hicks <tyhicks@canonical.com>
CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
---
 samples/seccomp/.gitignore  |   1 +
 samples/seccomp/Makefile    |   7 +-
 samples/seccomp/user-trap.c | 312 ++++++++++++++++++++++++++++++++++++
 3 files changed, 319 insertions(+), 1 deletion(-)

diff --git a/samples/seccomp/.gitignore b/samples/seccomp/.gitignore
index 78fb78184291..d1e2e817d556 100644
--- a/samples/seccomp/.gitignore
+++ b/samples/seccomp/.gitignore
@@ -1,3 +1,4 @@
 bpf-direct
 bpf-fancy
 dropper
+user-trap
diff --git a/samples/seccomp/Makefile b/samples/seccomp/Makefile
index cf34ff6b4065..4920903c8009 100644
--- a/samples/seccomp/Makefile
+++ b/samples/seccomp/Makefile
@@ -1,6 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0
 ifndef CROSS_COMPILE
-hostprogs-$(CONFIG_SAMPLE_SECCOMP) := bpf-fancy dropper bpf-direct
+hostprogs-$(CONFIG_SAMPLE_SECCOMP) := bpf-fancy dropper bpf-direct user-trap
 
 HOSTCFLAGS_bpf-fancy.o += -I$(objtree)/usr/include
 HOSTCFLAGS_bpf-fancy.o += -idirafter $(objtree)/include
@@ -16,6 +16,10 @@ HOSTCFLAGS_bpf-direct.o += -I$(objtree)/usr/include
 HOSTCFLAGS_bpf-direct.o += -idirafter $(objtree)/include
 bpf-direct-objs := bpf-direct.o
 
+HOSTCFLAGS_user-trap.o += -I$(objtree)/usr/include
+HOSTCFLAGS_user-trap.o += -idirafter $(objtree)/include
+user-trap-objs := user-trap.o
+
 # Try to match the kernel target.
 ifndef CONFIG_64BIT
 
@@ -33,6 +37,7 @@ HOSTCFLAGS_bpf-fancy.o += $(MFLAG)
 HOSTLDLIBS_bpf-direct += $(MFLAG)
 HOSTLDLIBS_bpf-fancy += $(MFLAG)
 HOSTLDLIBS_dropper += $(MFLAG)
+HOSTLDLIBS_user-trap += $(MFLAG)
 endif
 always := $(hostprogs-m)
 endif
diff --git a/samples/seccomp/user-trap.c b/samples/seccomp/user-trap.c
new file mode 100644
index 000000000000..571eb32fd80b
--- /dev/null
+++ b/samples/seccomp/user-trap.c
@@ -0,0 +1,312 @@
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <string.h>
+#include <stddef.h>
+#include <sys/sysmacros.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <sys/socket.h>
+#include <sys/stat.h>
+#include <sys/mman.h>
+#include <sys/syscall.h>
+#include <sys/user.h>
+#include <sys/ioctl.h>
+#include <sys/ptrace.h>
+#include <sys/mount.h>
+#include <linux/limits.h>
+#include <linux/filter.h>
+#include <linux/seccomp.h>
+
+/*
+ * Because of some grossness, we can't include linux/ptrace.h here, so we
+ * re-define PTRACE_SECCOMP_NEW_LISTENER.
+ */
+#ifndef PTRACE_SECCOMP_NEW_LISTENER
+#define PTRACE_SECCOMP_NEW_LISTENER	0x420e
+#endif
+
+#define ARRAY_SIZE(x) (sizeof(x) / sizeof(*(x)))
+
+static int seccomp(unsigned int op, unsigned int flags, void *args)
+{
+	errno = 0;
+	return syscall(__NR_seccomp, op, flags, args);
+}
+
+static int user_trap_syscall(int nr, unsigned int flags)
+{
+	struct sock_filter filter[] = {
+		BPF_STMT(BPF_LD+BPF_W+BPF_ABS,
+			offsetof(struct seccomp_data, nr)),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, nr, 0, 1),
+		BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_USER_NOTIF),
+		BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
+	};
+
+	struct sock_fprog prog = {
+		.len = (unsigned short)ARRAY_SIZE(filter),
+		.filter = filter,
+	};
+
+	return seccomp(SECCOMP_SET_MODE_FILTER, flags, &prog);
+}
+
+static int handle_req(struct seccomp_notif *req,
+		      struct seccomp_notif_resp *resp, int listener)
+{
+	char path[PATH_MAX], source[PATH_MAX], target[PATH_MAX];
+	int ret = -1, mem;
+
+	resp->len = sizeof(*resp);
+	resp->id = req->id;
+	resp->error = -EPERM;
+	resp->val = 0;
+
+	if (req->data.nr != __NR_mount) {
+		fprintf(stderr, "huh? trapped something besides mknod? %d\n", req->data.nr);
+		return -1;
+	}
+
+	/* Only allow bind mounts. */
+	if (!(req->data.args[3] & MS_BIND))
+		return 0;
+
+	/*
+	 * Ok, let's read the task's memory to see where they wanted their
+	 * mount to go.
+	 */
+	snprintf(path, sizeof(path), "/proc/%d/mem", req->pid);
+	mem = open(path, O_RDONLY);
+	if (mem < 0) {
+		perror("open mem");
+		return -1;
+	}
+
+	/*
+	 * Now we avoid a TOCTOU: we referred to a pid by its pid, but since
+	 * the pid that made the syscall may have died, we need to confirm that
+	 * the pid is still valid after we open its /proc/pid/mem file. We can
+	 * ask the listener fd this as follows.
+	 *
+	 * Note that this check should occur *after* any task-specific
+	 * resources are opened, to make sure that the task has not died and
+	 * we're not wrongly reading someone else's state in order to make
+	 * decisions.
+	 */
+	if (ioctl(listener, SECCOMP_NOTIF_IS_ID_VALID, &req->id) != 1) {
+		fprintf(stderr, "task died before we could map its memory\n");
+		goto out;
+	}
+
+	/*
+	 * Phew, we've got the right /proc/pid/mem. Now we can read it. Note
+	 * that to avoid another TOCTOU, we should read all of the pointer args
+	 * before we decide to allow the syscall.
+	 */
+	if (lseek(mem, req->data.args[0], SEEK_SET) < 0) {
+		perror("seek");
+		goto out;
+	}
+
+	ret = read(mem, source, sizeof(source));
+	if (ret < 0) {
+		perror("read");
+		goto out;
+	}
+
+	if (lseek(mem, req->data.args[1], SEEK_SET) < 0) {
+		perror("seek");
+		goto out;
+	}
+
+	ret = read(mem, target, sizeof(target));
+	if (ret < 0) {
+		perror("read");
+		goto out;
+	}
+
+	/*
+	 * Our policy is to only allow bind mounts inside /tmp. This isn't very
+	 * interesting, because we could do unprivlieged bind mounts with user
+	 * namespaces already, but you get the idea.
+	 */
+	if (!strncmp(source, "/tmp", 4) && !strncmp(target, "/tmp", 4)) {
+		if (mount(source, target, NULL, req->data.args[3], NULL) < 0) {
+			ret = -1;
+			perror("actual mount");
+			goto out;
+		}
+		resp->error = 0;
+	}
+
+	/* Even if we didn't allow it because of policy, generating the
+	 * response was be a success, because we want to tell the worker EPERM.
+	 */
+	ret = 0;
+
+out:
+	close(mem);
+	return ret;
+}
+
+int main(void)
+{
+	int sk_pair[2], ret = 1, status, listener;
+	pid_t worker = 0 , tracer = 0;
+	char c;
+
+	if (socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair) < 0) {
+		perror("socketpair");
+		return 1;
+	}
+
+	worker = fork();
+	if (worker < 0) {
+		perror("fork");
+		goto close_pair;
+	}
+
+	if (worker == 0) {
+		if (user_trap_syscall(__NR_mount, 0) < 0) {
+			perror("seccomp");
+			exit(1);
+		}
+
+		if (setuid(1000) < 0) {
+			perror("setuid");
+			exit(1);
+		}
+
+		if (write(sk_pair[1], "a", 1) != 1) {
+			perror("write");
+			exit(1);
+		}
+
+		if (read(sk_pair[1], &c, 1) != 1) {
+			perror("write");
+			exit(1);
+		}
+
+		if (mkdir("/tmp/foo", 0755) < 0) {
+			perror("mkdir");
+			exit(1);
+		}
+
+		if (mount("/dev/sda", "/tmp/foo", NULL, 0, NULL) != -1) {
+			fprintf(stderr, "huh? mounted /dev/sda?\n");
+			exit(1);
+		}
+
+		if (errno != EPERM) {
+			perror("bad error from mount");
+			exit(1);
+		}
+
+		if (mount("/tmp/foo", "/tmp/foo", NULL, MS_BIND, NULL) < 0) {
+			perror("mount");
+			exit(1);
+		}
+
+		exit(0);
+	}
+
+	if (read(sk_pair[0], &c, 1) != 1) {
+		perror("read ready signal");
+		goto out_kill;
+	}
+
+	if (ptrace(PTRACE_ATTACH, worker) < 0) {
+		perror("ptrace");
+		goto out_kill;
+	}
+
+	if (waitpid(worker, NULL, 0) != worker) {
+		perror("waitpid");
+		goto out_kill;
+	}
+
+	listener = ptrace(PTRACE_SECCOMP_NEW_LISTENER, worker, 0);
+	if (listener < 0) {
+		perror("ptrace get listener");
+		goto out_kill;
+	}
+
+	if (ptrace(PTRACE_DETACH, worker, NULL, 0) < 0) {
+		perror("ptrace detach");
+		goto out_kill;
+	}
+
+	if (write(sk_pair[0], "a", 1) != 1) {
+		perror("write");
+		exit(1);
+	}
+
+	tracer = fork();
+	if (tracer < 0) {
+		perror("fork");
+		goto out_kill;
+	}
+
+	if (tracer == 0) {
+		while (1) {
+			struct seccomp_notif req = {};
+			struct seccomp_notif_resp resp = {};
+
+			req.len = sizeof(req);
+			if (ioctl(listener, SECCOMP_NOTIF_RECV, &req) != sizeof(req)) {
+				perror("ioctl recv");
+				goto out_close;
+			}
+
+			if (handle_req(&req, &resp, listener) < 0)
+				goto out_close;
+
+			if (ioctl(listener, SECCOMP_NOTIF_SEND, &resp) != sizeof(resp)) {
+				perror("ioctl send");
+				goto out_close;
+			}
+		}
+out_close:
+		close(listener);
+		exit(1);
+	}
+
+	close(listener);
+
+	if (waitpid(worker, &status, 0) != worker) {
+		perror("waitpid");
+		goto out_kill;
+	}
+
+	if (umount2("/tmp/foo", MNT_DETACH) < 0 && errno != EINVAL) {
+		perror("umount2");
+		goto out_kill;
+	}
+
+	if (remove("/tmp/foo") < 0 && errno != ENOENT) {
+		perror("remove");
+		exit(1);
+	}
+
+	if (!WIFEXITED(status) || WEXITSTATUS(status)) {
+		fprintf(stderr, "worker exited nonzero\n");
+		goto out_kill;
+	}
+
+	ret = 0;
+
+out_kill:
+	if (tracer > 0)
+		kill(tracer, SIGKILL);
+	if (worker > 0)
+		kill(worker, SIGKILL);
+
+close_pair:
+	close(sk_pair[0]);
+	close(sk_pair[1]);
+	return ret;
+}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 3/5] seccomp: add a way to get a listener fd from ptrace
  2018-09-06 15:28 ` [PATCH v6 3/5] seccomp: add a way to get a listener fd from ptrace Tycho Andersen
@ 2018-09-06 15:45   ` Jann Horn
  2018-09-06 15:50     ` Tycho Andersen
  2018-09-13  0:00   ` Andy Lutomirski
  1 sibling, 1 reply; 38+ messages in thread
From: Jann Horn @ 2018-09-06 15:45 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Kees Cook, kernel list, containers, Linux API, Andy Lutomirski,
	Oleg Nesterov, Eric W. Biederman, Serge E. Hallyn,
	Christian Brauner, Tyler Hicks, suda.akihiro

On Thu, Sep 6, 2018 at 5:29 PM Tycho Andersen <tycho@tycho.ws> wrote:
>
> As an alternative to SECCOMP_FILTER_FLAG_GET_LISTENER, perhaps a ptrace()
> version which can acquire filters is useful. There are at least two reasons
> this is preferable, even though it uses ptrace:
>
> 1. You can control tasks that aren't cooperating with you
> 2. You can control tasks whose filters block sendmsg() and socket(); if the
>    task installs a filter which blocks these calls, there's no way with
>    SECCOMP_FILTER_FLAG_GET_LISTENER to get the fd out to the privileged task.
[...]
> +long seccomp_new_listener(struct task_struct *task,
> +                         unsigned long filter_off)
> +{
> +       struct seccomp_filter *filter;
> +       struct file *listener;
> +       int fd;
> +
> +       if (!capable(CAP_SYS_ADMIN))
> +               return -EACCES;
> +
> +       filter = get_nth_filter(task, filter_off);
> +       if (IS_ERR(filter))
> +               return PTR_ERR(filter);
> +
> +       fd = get_unused_fd_flags(0);
> +       if (fd < 0) {
> +               __put_seccomp_filter(filter);
> +               return fd;
> +       }
> +
> +       listener = init_listener(task, task->seccomp.filter);

Did you mean to write something like `init_listener(task, filter)` here?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 3/5] seccomp: add a way to get a listener fd from ptrace
  2018-09-06 15:45   ` Jann Horn
@ 2018-09-06 15:50     ` Tycho Andersen
  0 siblings, 0 replies; 38+ messages in thread
From: Tycho Andersen @ 2018-09-06 15:50 UTC (permalink / raw)
  To: Jann Horn
  Cc: Kees Cook, kernel list, containers, Linux API, Andy Lutomirski,
	Oleg Nesterov, Eric W. Biederman, Serge E. Hallyn,
	Christian Brauner, Tyler Hicks, suda.akihiro

On Thu, Sep 06, 2018 at 05:45:25PM +0200, Jann Horn wrote:
> On Thu, Sep 6, 2018 at 5:29 PM Tycho Andersen <tycho@tycho.ws> wrote:
> >
> > As an alternative to SECCOMP_FILTER_FLAG_GET_LISTENER, perhaps a ptrace()
> > version which can acquire filters is useful. There are at least two reasons
> > this is preferable, even though it uses ptrace:
> >
> > 1. You can control tasks that aren't cooperating with you
> > 2. You can control tasks whose filters block sendmsg() and socket(); if the
> >    task installs a filter which blocks these calls, there's no way with
> >    SECCOMP_FILTER_FLAG_GET_LISTENER to get the fd out to the privileged task.
> [...]
> > +long seccomp_new_listener(struct task_struct *task,
> > +                         unsigned long filter_off)
> > +{
> > +       struct seccomp_filter *filter;
> > +       struct file *listener;
> > +       int fd;
> > +
> > +       if (!capable(CAP_SYS_ADMIN))
> > +               return -EACCES;
> > +
> > +       filter = get_nth_filter(task, filter_off);
> > +       if (IS_ERR(filter))
> > +               return PTR_ERR(filter);
> > +
> > +       fd = get_unused_fd_flags(0);
> > +       if (fd < 0) {
> > +               __put_seccomp_filter(filter);
> > +               return fd;
> > +       }
> > +
> > +       listener = init_listener(task, task->seccomp.filter);
> 
> Did you mean to write something like `init_listener(task, filter)` here?

Yes, yes I did. Thanks, Jann.

Tycho

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 4/5] seccomp: add support for passing fds via USER_NOTIF
  2018-09-06 15:28 ` [PATCH v6 4/5] seccomp: add support for passing fds via USER_NOTIF Tycho Andersen
@ 2018-09-06 16:15   ` Jann Horn
  2018-09-06 16:22     ` Tycho Andersen
  2018-09-12 23:52   ` Andy Lutomirski
  1 sibling, 1 reply; 38+ messages in thread
From: Jann Horn @ 2018-09-06 16:15 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Kees Cook, kernel list, containers, Linux API, Andy Lutomirski,
	Oleg Nesterov, Eric W. Biederman, Serge E. Hallyn,
	Christian Brauner, Tyler Hicks, suda.akihiro

On Thu, Sep 6, 2018 at 5:29 PM Tycho Andersen <tycho@tycho.ws> wrote:
> The idea here is that the userspace handler should be able to pass an fd
> back to the trapped task, for example so it can be returned from socket().
[...]
> diff --git a/Documentation/userspace-api/seccomp_filter.rst b/Documentation/userspace-api/seccomp_filter.rst
> index d1498885c1c7..1c0aab306426 100644
> --- a/Documentation/userspace-api/seccomp_filter.rst
> +++ b/Documentation/userspace-api/seccomp_filter.rst
> @@ -235,6 +235,9 @@ The interface for a seccomp notification fd consists of two structures:
>          __u64 id;
>          __s32 error;
>          __s64 val;
> +        __u8 return_fd;
> +        __u32 fd;
> +        __u32 fd_flags;

Normally,  syscalls that take an optional file descriptor accept a
signed 32-bit number, with -1 standing for "no file descriptor". Is
there a reason why this uses a separate variable to signal whether an
fd was provided?

Apart from that, this patch looks good to me.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 4/5] seccomp: add support for passing fds via USER_NOTIF
  2018-09-06 16:15   ` Jann Horn
@ 2018-09-06 16:22     ` Tycho Andersen
  2018-09-06 18:30       ` Tycho Andersen
  0 siblings, 1 reply; 38+ messages in thread
From: Tycho Andersen @ 2018-09-06 16:22 UTC (permalink / raw)
  To: Jann Horn
  Cc: Kees Cook, kernel list, containers, Linux API, Andy Lutomirski,
	Oleg Nesterov, Eric W. Biederman, Serge E. Hallyn,
	Christian Brauner, Tyler Hicks, suda.akihiro

On Thu, Sep 06, 2018 at 06:15:18PM +0200, Jann Horn wrote:
> On Thu, Sep 6, 2018 at 5:29 PM Tycho Andersen <tycho@tycho.ws> wrote:
> > The idea here is that the userspace handler should be able to pass an fd
> > back to the trapped task, for example so it can be returned from socket().
> [...]
> > diff --git a/Documentation/userspace-api/seccomp_filter.rst b/Documentation/userspace-api/seccomp_filter.rst
> > index d1498885c1c7..1c0aab306426 100644
> > --- a/Documentation/userspace-api/seccomp_filter.rst
> > +++ b/Documentation/userspace-api/seccomp_filter.rst
> > @@ -235,6 +235,9 @@ The interface for a seccomp notification fd consists of two structures:
> >          __u64 id;
> >          __s32 error;
> >          __s64 val;
> > +        __u8 return_fd;
> > +        __u32 fd;
> > +        __u32 fd_flags;
> 
> Normally,  syscalls that take an optional file descriptor accept a
> signed 32-bit number, with -1 standing for "no file descriptor". Is
> there a reason why this uses a separate variable to signal whether an
> fd was provided?

No real reason other than I looked at the bpf code and they were using
__u32 for bpf (but I think in their case the fd args are not
optional). I'll switch it to __s32/-1 for the next version.

> Apart from that, this patch looks good to me.

Thanks,

Tycho

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 4/5] seccomp: add support for passing fds via USER_NOTIF
  2018-09-06 16:22     ` Tycho Andersen
@ 2018-09-06 18:30       ` Tycho Andersen
  2018-09-10 17:00         ` Jann Horn
  0 siblings, 1 reply; 38+ messages in thread
From: Tycho Andersen @ 2018-09-06 18:30 UTC (permalink / raw)
  To: Jann Horn
  Cc: Kees Cook, kernel list, containers, Linux API, Andy Lutomirski,
	Oleg Nesterov, Eric W. Biederman, Serge E. Hallyn,
	Christian Brauner, Tyler Hicks, suda.akihiro

On Thu, Sep 06, 2018 at 10:22:46AM -0600, Tycho Andersen wrote:
> On Thu, Sep 06, 2018 at 06:15:18PM +0200, Jann Horn wrote:
> > On Thu, Sep 6, 2018 at 5:29 PM Tycho Andersen <tycho@tycho.ws> wrote:
> > > The idea here is that the userspace handler should be able to pass an fd
> > > back to the trapped task, for example so it can be returned from socket().
> > [...]
> > > diff --git a/Documentation/userspace-api/seccomp_filter.rst b/Documentation/userspace-api/seccomp_filter.rst
> > > index d1498885c1c7..1c0aab306426 100644
> > > --- a/Documentation/userspace-api/seccomp_filter.rst
> > > +++ b/Documentation/userspace-api/seccomp_filter.rst
> > > @@ -235,6 +235,9 @@ The interface for a seccomp notification fd consists of two structures:
> > >          __u64 id;
> > >          __s32 error;
> > >          __s64 val;
> > > +        __u8 return_fd;
> > > +        __u32 fd;
> > > +        __u32 fd_flags;
> > 
> > Normally,  syscalls that take an optional file descriptor accept a
> > signed 32-bit number, with -1 standing for "no file descriptor". Is
> > there a reason why this uses a separate variable to signal whether an
> > fd was provided?
> 
> No real reason other than I looked at the bpf code and they were using
> __u32 for bpf (but I think in their case the fd args are not
> optional). I'll switch it to __s32/-1 for the next version.

Oh, I think there is a reason actually: since this is an API addition,
the "0" value needs to be the previously default behavior if userspace
doesn't specify it. Since the previously default behavior was not to
return an fd, and we want to allow fd == 0, we need the extra flag to
make this work.

This is really only a problem because we're introducing this stuff in
a second patch (mostly to illustrate how extending the response
structure would work). I could fold this into the first patch if we
want, or we could keep the return_fd bits if the illustration is
useful.

Tycho

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 1/5] seccomp: add a return code to trap to userspace
  2018-09-06 15:28 ` [PATCH v6 1/5] seccomp: add a return code to " Tycho Andersen
@ 2018-09-06 22:15   ` Tyler Hicks
  2018-09-07 15:45     ` Tycho Andersen
  2018-09-08 20:35     ` Tycho Andersen
  0 siblings, 2 replies; 38+ messages in thread
From: Tyler Hicks @ 2018-09-06 22:15 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Kees Cook, linux-kernel, containers, linux-api, Andy Lutomirski,
	Oleg Nesterov, Eric W . Biederman, Serge E . Hallyn,
	Christian Brauner, Akihiro Suda, Jann Horn

[-- Attachment #1: Type: text/plain, Size: 34795 bytes --]

Hey Tycho - I'm finally getting around to reviewing this patch set. I
don't have access to previous review comments while I'm doing this
review so I hope I'm not revisiting too many previous discussions.

On 2018-09-06 09:28:55, Tycho Andersen wrote:
> This patch introduces a means for syscalls matched in seccomp to notify
> some other task that a particular filter has been triggered.
> 
> The motivation for this is primarily for use with containers. For example,
> if a container does an init_module(), we obviously don't want to load this
> untrusted code, which may be compiled for the wrong version of the kernel
> anyway. Instead, we could parse the module image, figure out which module
> the container is trying to load and load it on the host.
> 
> As another example, containers cannot mknod(), since this checks
> capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
> /dev/zero should be ok for containers to mknod, but we'd like to avoid hard
> coding some whitelist in the kernel. Another example is mount(), which has
> many security restrictions for good reason, but configuration or runtime
> knowledge could potentially be used to relax these restrictions.
> 
> This patch adds functionality that is already possible via at least two
> other means that I know about, both of which involve ptrace(): first, one
> could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL.
> Unfortunately this is slow, so a faster version would be to install a
> filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP.
> Since ptrace allows only one tracer, if the container runtime is that
> tracer, users inside the container (or outside) trying to debug it will not
> be able to use ptrace, which is annoying. It also means that older
> distributions based on Upstart cannot boot inside containers using ptrace,
> since upstart itself uses ptrace to start services.
> 
> The actual implementation of this is fairly small, although getting the
> synchronization right was/is slightly complex.
> 
> Finally, it's worth noting that the classic seccomp TOCTOU of reading
> memory data from the task still applies here, but can be avoided with
> careful design of the userspace handler: if the userspace handler reads all
> of the task memory that is necessary before applying its security policy,
> the tracee's subsequent memory edits will not be read by the tracer.
> 
> v2: * make id a u64; the idea here being that it will never overflow,
>       because 64 is huge (one syscall every nanosecond => wrap every 584
>       years) (Andy)
>     * prevent nesting of user notifications: if someone is already attached
>       the tree in one place, nobody else can attach to the tree (Andy)
>     * notify the listener of signals the tracee receives as well (Andy)
>     * implement poll
> v3: * lockdep fix (Oleg)
>     * drop unnecessary WARN()s (Christian)
>     * rearrange error returns to be more rpetty (Christian)
>     * fix build in !CONFIG_SECCOMP_USER_NOTIFICATION case
> v4: * fix implementation of poll to use poll_wait() (Jann)
>     * change listener's fd flags to be 0 (Jann)
>     * hoist filter initialization out of ifdefs to its own function
>       init_user_notification()
>     * add some more testing around poll() and closing the listener while a
>       syscall is in action
>     * s/GET_LISTENER/NEW_LISTENER, since you can't _get_ a listener, but it
>       creates a new one (Matthew)
>     * correctly handle pid namespaces, add some testcases (Matthew)
>     * use EINPROGRESS instead of EINVAL when a notification response is
>       written twice (Matthew)
>     * fix comment typo from older version (SEND vs READ) (Matthew)
>     * whitespace and logic simplification (Tobin)
>     * add some Documentation/ bits on userspace trapping
> v5: * fix documentation typos (Jann)
>     * add signalled field to struct seccomp_notif (Jann)
>     * switch to using ioctls instead of read()/write() for struct passing
>       (Jann)
>     * add an ioctl to ensure an id is still valid
> v6: * docs typo fixes, update docs for ioctl() change (Christian)
> 
> Signed-off-by: Tycho Andersen <tycho@tycho.ws>
> CC: Kees Cook <keescook@chromium.org>
> CC: Andy Lutomirski <luto@amacapital.net>
> CC: Oleg Nesterov <oleg@redhat.com>
> CC: Eric W. Biederman <ebiederm@xmission.com>
> CC: "Serge E. Hallyn" <serge@hallyn.com>
> CC: Christian Brauner <christian.brauner@ubuntu.com>
> CC: Tyler Hicks <tyhicks@canonical.com>
> CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
> ---
>  Documentation/ioctl/ioctl-number.txt          |   1 +
>  .../userspace-api/seccomp_filter.rst          |  73 +++
>  arch/Kconfig                                  |   9 +
>  include/linux/seccomp.h                       |   7 +-
>  include/uapi/linux/seccomp.h                  |  33 +-
>  kernel/seccomp.c                              | 453 +++++++++++++++++-
>  tools/testing/selftests/seccomp/seccomp_bpf.c | 403 +++++++++++++++-
>  7 files changed, 969 insertions(+), 10 deletions(-)
> 
> diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt
> index 13a7c999c04a..31e9707f7e06 100644
> --- a/Documentation/ioctl/ioctl-number.txt
> +++ b/Documentation/ioctl/ioctl-number.txt
> @@ -345,4 +345,5 @@ Code  Seq#(hex)	Include File		Comments
>  					<mailto:raph@8d.com>
>  0xF6	all	LTTng			Linux Trace Toolkit Next Generation
>  					<mailto:mathieu.desnoyers@efficios.com>
> +0xF7    00-1F   uapi/linux/seccomp.h
>  0xFD	all	linux/dm-ioctl.h
> diff --git a/Documentation/userspace-api/seccomp_filter.rst b/Documentation/userspace-api/seccomp_filter.rst
> index 82a468bc7560..d1498885c1c7 100644
> --- a/Documentation/userspace-api/seccomp_filter.rst
> +++ b/Documentation/userspace-api/seccomp_filter.rst
> @@ -122,6 +122,11 @@ In precedence order, they are:
>  	Results in the lower 16-bits of the return value being passed
>  	to userland as the errno without executing the system call.
>  
> +``SECCOMP_RET_USER_NOTIF``:
> +    Results in a ``struct seccomp_notif`` message sent on the userspace
> +    notification fd, if it is attached, or ``-ENOSYS`` if it is not. See below
> +    on discussion of how to handle user notifications.
> +
>  ``SECCOMP_RET_TRACE``:
>  	When returned, this value will cause the kernel to attempt to
>  	notify a ``ptrace()``-based tracer prior to executing the system
> @@ -183,6 +188,74 @@ The ``samples/seccomp/`` directory contains both an x86-specific example
>  and a more generic example of a higher level macro interface for BPF
>  program generation.
>  
> +Userspace Notification
> +======================
> +
> +The ``SECCOMP_RET_USER_NOTIF`` return code lets seccomp filters pass a
> +particular syscall to userspace to be handled. This may be useful for
> +applications like container managers, which wish to intercept particular
> +syscalls (``mount()``, ``finit_module()``, etc.) and change their behavior.
> +
> +There are currently two APIs to acquire a userspace notification fd for a
> +particular filter. The first is when the filter is installed, the task
> +installing the filter can ask the ``seccomp()`` syscall:
> +
> +.. code-block::
> +
> +    fd = seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
> +
> +which (on success) will return a listener fd for the filter, which can then be
> +passed around via ``SCM_RIGHTS`` or similar. Alternatively, a filter fd can be
> +acquired via:
> +
> +.. code-block::
> +
> +    fd = ptrace(PTRACE_SECCOMP_NEW_LISTENER, pid, 0);
> +
> +which grabs the 0th filter for some task which the tracer has privilege over.
> +Note that filter fds correspond to a particular filter, and not a particular
> +task. So if this task then forks, notifications from both tasks will appear on
> +the same filter fd. Reads and writes to/from a filter fd are also synchronized,
> +so a filter fd can safely have many readers.
> +
> +The interface for a seccomp notification fd consists of two structures:
> +
> +.. code-block::
> +
> +    struct seccomp_notif {
> +        __u16 len;
> +        __u64 id;
> +        pid_t pid;
> +        __u8 signalled;
> +        struct seccomp_data data;
> +    };
> +
> +    struct seccomp_notif_resp {
> +        __u16 len;
> +        __u64 id;
> +        __s32 error;
> +        __s64 val;
> +    };
> +
> +Users can read via ``ioctl(SECCOMP_NOTIF_RECV)``  (or ``poll()``) on a seccomp
> +notification fd to receive a ``struct seccomp_notif``, which contains five
> +members: the input length of the structure, a globally unique ``id``, the

This documentation says that id is "globally unique" but an in-code
comment below says "this is unique for this filter". IIUC, the id is
only guaranteed to be unique for the filter so this documentation should
be updated slightly to make it clear that the id is only global in those
terms.

> +``pid`` of the task which triggered this request (which may be 0 if the task is
> +in a pid ns not visible from the listener's pid namespace), a flag representing
> +whether or not the notification is a result of a non-fatal signal, and the
> +``data`` passed to seccomp. Userspace can then make a decision based on this
> +information about what to do, and ``ioctl(SECCOMP_NOTIF_SEND)`` a response,
> +indicating what should be returned to userspace. The ``id`` member of ``struct
> +seccomp_notif_resp`` should be the same ``id`` as in ``struct seccomp_notif``.
> +
> +It is worth noting that ``struct seccomp_data`` contains the values of register
> +arguments to the syscall, but does not contain pointers to memory. The task's
> +memory is accessible to suitably privileged traces via ``ptrace()`` or
> +``/proc/pid/map_files/``. However, care should be taken to avoid the TOCTOU
> +mentioned above in this document: all arguments being read from the tracee's
> +memory should be read into the tracer's memory before any policy decisions are
> +made. This allows for an atomic decision on syscall arguments.
> +
>  Sysctls
>  =======
>  
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 6801123932a5..42f3585d925d 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -419,6 +419,15 @@ config SECCOMP_FILTER
>  
>  	  See Documentation/userspace-api/seccomp_filter.rst for details.
>  
> +config SECCOMP_USER_NOTIFICATION

Did someone request a Kconfig option for this new feature? If not, I
think that nuking the Kconfig option would reduce the test matrix. No
other filter flags have their own build time option but maybe it makes
sense in this case if this filter flag exposes the kernel to significant
new attack surface since there's more to this than just a new filter
flag.

If someone has a requirement to disable this feature, maybe it'd be
better to leave the decision up to the distro *and* the admin via a
sysctl instead of taking the admin out of the decision with a build
time option.

> +	bool "Enable the SECCOMP_RET_USER_NOTIF seccomp action"
> +	depends on SECCOMP_FILTER
> +	help
> +	  Enable SECCOMP_RET_USER_NOTIF, a return code which can be used by seccomp
> +	  programs to notify a userspace listener that a particular event happened.
> +
> +	  See Documentation/userspace-api/seccomp_filter.rst for details.
> +
>  config HAVE_STACKPROTECTOR
>  	bool
>  	help
> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> index e5320f6c8654..017444b5efed 100644
> --- a/include/linux/seccomp.h
> +++ b/include/linux/seccomp.h
> @@ -4,9 +4,10 @@
>  
>  #include <uapi/linux/seccomp.h>
>  
> -#define SECCOMP_FILTER_FLAG_MASK	(SECCOMP_FILTER_FLAG_TSYNC	| \
> -					 SECCOMP_FILTER_FLAG_LOG	| \
> -					 SECCOMP_FILTER_FLAG_SPEC_ALLOW)
> +#define SECCOMP_FILTER_FLAG_MASK	(SECCOMP_FILTER_FLAG_TSYNC | \
> +					 SECCOMP_FILTER_FLAG_LOG | \
> +					 SECCOMP_FILTER_FLAG_SPEC_ALLOW | \
> +					 SECCOMP_FILTER_FLAG_NEW_LISTENER)
>  
>  #ifdef CONFIG_SECCOMP
>  
> diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
> index 9efc0e73d50b..aa5878972128 100644
> --- a/include/uapi/linux/seccomp.h
> +++ b/include/uapi/linux/seccomp.h
> @@ -17,9 +17,10 @@
>  #define SECCOMP_GET_ACTION_AVAIL	2
>  
>  /* Valid flags for SECCOMP_SET_MODE_FILTER */
> -#define SECCOMP_FILTER_FLAG_TSYNC	(1UL << 0)
> -#define SECCOMP_FILTER_FLAG_LOG		(1UL << 1)
> -#define SECCOMP_FILTER_FLAG_SPEC_ALLOW	(1UL << 2)
> +#define SECCOMP_FILTER_FLAG_TSYNC		(1UL << 0)
> +#define SECCOMP_FILTER_FLAG_LOG			(1UL << 1)
> +#define SECCOMP_FILTER_FLAG_SPEC_ALLOW		(1UL << 2)
> +#define SECCOMP_FILTER_FLAG_NEW_LISTENER	(1UL << 3)
>  
>  /*
>   * All BPF programs must return a 32-bit value.
> @@ -35,6 +36,7 @@
>  #define SECCOMP_RET_KILL	 SECCOMP_RET_KILL_THREAD
>  #define SECCOMP_RET_TRAP	 0x00030000U /* disallow and force a SIGSYS */
>  #define SECCOMP_RET_ERRNO	 0x00050000U /* returns an errno */
> +#define SECCOMP_RET_USER_NOTIF   0x7fc00000U /* notifies userspace */
>  #define SECCOMP_RET_TRACE	 0x7ff00000U /* pass to a tracer or disallow */
>  #define SECCOMP_RET_LOG		 0x7ffc0000U /* allow after logging */
>  #define SECCOMP_RET_ALLOW	 0x7fff0000U /* allow */
> @@ -60,4 +62,29 @@ struct seccomp_data {
>  	__u64 args[6];
>  };
>  
> +struct seccomp_notif {
> +	__u16 len;
> +	__u64 id;
> +	__u32 pid;
> +	__u8 signalled;

I think signaled is the best spelling to go with. There are a lot of
other instances of signalled in the kernel sources but, ultimately, it
makes sense to follow the lead of the WIFSIGNALED macro from wait(2).

> +	struct seccomp_data data;
> +};
> +
> +struct seccomp_notif_resp {
> +	__u16 len;
> +	__u64 id;
> +	__s32 error;
> +	__s64 val;
> +};
> +
> +#define SECCOMP_IOC_MAGIC		0xF7
> +
> +/* Flags for seccomp notification fd ioctl. */
> +#define SECCOMP_NOTIF_RECV		_IOWR(SECCOMP_IOC_MAGIC, 0,	\
> +						struct seccomp_notif)
> +#define SECCOMP_NOTIF_SEND		_IOWR(SECCOMP_IOC_MAGIC, 1,	\
> +						struct seccomp_notif_resp)

This is pedantic but it would make sense to me to have the ioctl names
match the struct names. That would leave us with:

#define SECCOMP_NOTIF			_IOWR(SECCOMP_IOC_MAGIC, 0,	\
					      struct seccomp_notif)
#define SECCOMP_NOTIF_RESP		_IOWR(SECCOMP_IOC_MAGIC, 1,	\
					      struct seccomp_notif_resp)

Change it if you agree. Ignore this comment if you don't.

> +#define SECCOMP_NOTIF_IS_ID_VALID	_IOR(SECCOMP_IOC_MAGIC, 2,	\
> +						__u64)
> +
>  #endif /* _UAPI_LINUX_SECCOMP_H */
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index fd023ac24e10..a09eb5c05f68 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -33,6 +33,7 @@
>  #endif
>  
>  #ifdef CONFIG_SECCOMP_FILTER
> +#include <linux/file.h>
>  #include <linux/filter.h>
>  #include <linux/pid.h>
>  #include <linux/ptrace.h>
> @@ -40,6 +41,53 @@
>  #include <linux/tracehook.h>
>  #include <linux/uaccess.h>
>  
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> +#include <linux/anon_inodes.h>
> +
> +enum notify_state {
> +	SECCOMP_NOTIFY_INIT,
> +	SECCOMP_NOTIFY_SENT,
> +	SECCOMP_NOTIFY_REPLIED,
> +};
> +
> +struct seccomp_knotif {
> +	/* The struct pid of the task whose filter triggered the notification */
> +	struct pid *pid;
> +
> +	/* The "cookie" for this request; this is unique for this filter. */
> +	u32 id;
> +
> +	/* Whether or not this task has been given an interruptible signal. */
> +	bool signalled;
> +
> +	/*
> +	 * The seccomp data. This pointer is valid the entire time this
> +	 * notification is active, since it comes from __seccomp_filter which
> +	 * eclipses the entire lifecycle here.
> +	 */
> +	const struct seccomp_data *data;
> +
> +	/*
> +	 * Notification states. When SECCOMP_RET_USER_NOTIF is returned, a
> +	 * struct seccomp_knotif is created and starts out in INIT. Once the
> +	 * handler reads the notification off of an FD, it transitions to SENT.
> +	 * If a signal is received the state transitions back to INIT and
> +	 * another message is sent. When the userspace handler replies, state
> +	 * transitions to REPLIED.
> +	 */
> +	enum notify_state state;
> +
> +	/* The return values, only valid when in SECCOMP_NOTIFY_REPLIED */
> +	int error;
> +	long val;
> +
> +	/* Signals when this has entered SECCOMP_NOTIFY_REPLIED */
> +	struct completion ready;
> +
> +	struct list_head list;
> +};
> +#endif
> +
>  /**
>   * struct seccomp_filter - container for seccomp BPF programs
>   *
> @@ -66,6 +114,30 @@ struct seccomp_filter {
>  	bool log;
>  	struct seccomp_filter *prev;
>  	struct bpf_prog *prog;
> +
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> +	/*
> +	 * A semaphore that users of this notification can wait on for
> +	 * changes. Actual reads and writes are still controlled with
> +	 * filter->notify_lock.
> +	 */
> +	struct semaphore request;
> +
> +	/* A lock for all notification-related accesses. */
> +	struct mutex notify_lock;
> +
> +	/* Is there currently an attached listener? */
> +	bool has_listener;
> +
> +	/* The id of the next request. */
> +	u64 next_id;
> +
> +	/* A list of struct seccomp_knotif elements. */
> +	struct list_head notifications;
> +
> +	/* A wait queue for poll. */
> +	wait_queue_head_t wqh;
> +#endif

I suspect that these additions would benefit from better struct packing
since there could be a lot of seccomp_filter structs floating around in
memory on a system with a large number of running containers or
otherwise sandboxed processes.

IIRC, there's a 3 byte hole following the log member that could be used
by has_listener, at least, and I'm not sure how the rest of the new
members affect things.

>  };
>  
>  /* Limit any path through the tree to 256KB worth of instructions. */
> @@ -359,6 +431,19 @@ static inline void seccomp_sync_threads(unsigned long flags)
>  	}
>  }
>  
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> +static void init_user_notification(struct seccomp_filter *sfilter)
> +{
> +	mutex_init(&sfilter->notify_lock);
> +	sema_init(&sfilter->request, 0);
> +	INIT_LIST_HEAD(&sfilter->notifications);
> +	sfilter->next_id = get_random_u64();
> +	init_waitqueue_head(&sfilter->wqh);
> +}
> +#else
> +static inline void init_user_notification(struct seccomp_filter *sfilter) { }
> +#endif
> +
>  /**
>   * seccomp_prepare_filter: Prepares a seccomp filter for use.
>   * @fprog: BPF program to install
> @@ -392,6 +477,8 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
>  	if (!sfilter)
>  		return ERR_PTR(-ENOMEM);
>  
> +	init_user_notification(sfilter);
> +
>  	ret = bpf_prog_create_from_user(&sfilter->prog, fprog,
>  					seccomp_check_filter, save_orig);
>  	if (ret < 0) {
> @@ -556,13 +643,15 @@ static void seccomp_send_sigsys(int syscall, int reason)
>  #define SECCOMP_LOG_TRACE		(1 << 4)
>  #define SECCOMP_LOG_LOG			(1 << 5)
>  #define SECCOMP_LOG_ALLOW		(1 << 6)
> +#define SECCOMP_LOG_USER_NOTIF		(1 << 7)
>  
>  static u32 seccomp_actions_logged = SECCOMP_LOG_KILL_PROCESS |
>  				    SECCOMP_LOG_KILL_THREAD  |
>  				    SECCOMP_LOG_TRAP  |
>  				    SECCOMP_LOG_ERRNO |
>  				    SECCOMP_LOG_TRACE |
> -				    SECCOMP_LOG_LOG;
> +				    SECCOMP_LOG_LOG |
> +				    SECCOMP_LOG_USER_NOTIF;
>  
>  static inline void seccomp_log(unsigned long syscall, long signr, u32 action,
>  			       bool requested)
> @@ -581,6 +670,9 @@ static inline void seccomp_log(unsigned long syscall, long signr, u32 action,
>  	case SECCOMP_RET_TRACE:
>  		log = requested && seccomp_actions_logged & SECCOMP_LOG_TRACE;
>  		break;
> +	case SECCOMP_RET_USER_NOTIF:
> +		log = requested && seccomp_actions_logged & SECCOMP_LOG_USER_NOTIF;
> +		break;
>  	case SECCOMP_RET_LOG:
>  		log = seccomp_actions_logged & SECCOMP_LOG_LOG;
>  		break;
> @@ -651,6 +743,83 @@ void secure_computing_strict(int this_syscall)
>  }
>  #else
>  
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> +static u64 seccomp_next_notify_id(struct seccomp_filter *filter)
> +{
> +	/* Note: overflow is ok here, the id just needs to be unique */
> +	return filter->next_id++;
> +}
> +
> +static void seccomp_do_user_notification(int this_syscall,
> +					 struct seccomp_filter *match,
> +					 const struct seccomp_data *sd)
> +{
> +	int err;
> +	long ret = 0;
> +	struct seccomp_knotif n = {};
> +
> +	mutex_lock(&match->notify_lock);
> +	err = -ENOSYS;
> +	if (!match->has_listener)
> +		goto out;
> +
> +	n.pid = task_pid(current);
> +	n.state = SECCOMP_NOTIFY_INIT;
> +	n.data = sd;
> +	n.id = seccomp_next_notify_id(match);
> +	init_completion(&n.ready);
> +
> +	list_add(&n.list, &match->notifications);
> +	wake_up_poll(&match->wqh, EPOLLIN | EPOLLRDNORM);
> +
> +	mutex_unlock(&match->notify_lock);
> +	up(&match->request);
> +
> +	err = wait_for_completion_interruptible(&n.ready);
> +	mutex_lock(&match->notify_lock);
> +
> +	/*
> +	 * Here it's possible we got a signal and then had to wait on the mutex
> +	 * while the reply was sent, so let's be sure there wasn't a response
> +	 * in the meantime.
> +	 */
> +	if (err < 0 && n.state != SECCOMP_NOTIFY_REPLIED) {
> +		/*
> +		 * We got a signal. Let's tell userspace about it (potentially
> +		 * again, if we had already notified them about the first one).
> +		 */
> +		n.signalled = true;
> +		if (n.state == SECCOMP_NOTIFY_SENT) {
> +			n.state = SECCOMP_NOTIFY_INIT;
> +			up(&match->request);
> +		}
> +		mutex_unlock(&match->notify_lock);

Is it intentional that you call mutex_unlocked() followed by up() when
initially waiting but then switch up the order before re-waiting here? I
don't yet fully understand the locking but this looked odd to me.

> +		err = wait_for_completion_killable(&n.ready);
> +		mutex_lock(&match->notify_lock);
> +		if (err < 0)
> +			goto remove_list;
> +	}
> +
> +	ret = n.val;
> +	err = n.error;
> +
> +remove_list:
> +	list_del(&n.list);
> +out:
> +	mutex_unlock(&match->notify_lock);
> +	syscall_set_return_value(current, task_pt_regs(current),
> +				 err, ret);
> +}
> +#else
> +static void seccomp_do_user_notification(int this_syscall,
> +					 struct seccomp_filter *match,
> +					 const struct seccomp_data *sd)
> +{
> +	seccomp_log(this_syscall, SIGSYS, SECCOMP_RET_USER_NOTIF, true);
> +	do_exit(SIGSYS);
> +}
> +#endif
> +
>  #ifdef CONFIG_SECCOMP_FILTER
>  static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
>  			    const bool recheck_after_trace)
> @@ -728,6 +897,9 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
>  
>  		return 0;
>  
> +	case SECCOMP_RET_USER_NOTIF:
> +		seccomp_do_user_notification(this_syscall, match, sd);
> +		goto skip;
>  	case SECCOMP_RET_LOG:
>  		seccomp_log(this_syscall, 0, action, true);
>  		return 0;
> @@ -834,6 +1006,9 @@ static long seccomp_set_mode_strict(void)
>  }
>  
>  #ifdef CONFIG_SECCOMP_FILTER
> +static struct file *init_listener(struct task_struct *,
> +				  struct seccomp_filter *);
> +
>  /**
>   * seccomp_set_mode_filter: internal function for setting seccomp filter
>   * @flags:  flags to change filter behavior
> @@ -853,6 +1028,8 @@ static long seccomp_set_mode_filter(unsigned int flags,
>  	const unsigned long seccomp_mode = SECCOMP_MODE_FILTER;
>  	struct seccomp_filter *prepared = NULL;
>  	long ret = -EINVAL;
> +	int listener = 0;
> +	struct file *listener_f = NULL;
>  
>  	/* Validate flags. */
>  	if (flags & ~SECCOMP_FILTER_FLAG_MASK)
> @@ -863,13 +1040,28 @@ static long seccomp_set_mode_filter(unsigned int flags,
>  	if (IS_ERR(prepared))
>  		return PTR_ERR(prepared);
>  
> +	if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) {
> +		listener = get_unused_fd_flags(0);
> +		if (listener < 0) {
> +			ret = listener;
> +			goto out_free;
> +		}
> +
> +		listener_f = init_listener(current, prepared);
> +		if (IS_ERR(listener_f)) {
> +			put_unused_fd(listener);
> +			ret = PTR_ERR(listener_f);
> +			goto out_free;
> +		}
> +	}
> +
>  	/*
>  	 * Make sure we cannot change seccomp or nnp state via TSYNC
>  	 * while another thread is in the middle of calling exec.
>  	 */
>  	if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
>  	    mutex_lock_killable(&current->signal->cred_guard_mutex))
> -		goto out_free;
> +		goto out_put_fd;
>  
>  	spin_lock_irq(&current->sighand->siglock);
>  
> @@ -887,6 +1079,16 @@ static long seccomp_set_mode_filter(unsigned int flags,
>  	spin_unlock_irq(&current->sighand->siglock);
>  	if (flags & SECCOMP_FILTER_FLAG_TSYNC)
>  		mutex_unlock(&current->signal->cred_guard_mutex);
> +out_put_fd:
> +	if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) {
> +		if (ret < 0) {
> +			fput(listener_f);
> +			put_unused_fd(listener);
> +		} else {
> +			fd_install(listener, listener_f);
> +			ret = listener;
> +		}
> +	}
>  out_free:
>  	seccomp_filter_free(prepared);
>  	return ret;
> @@ -915,6 +1117,9 @@ static long seccomp_get_action_avail(const char __user *uaction)
>  	case SECCOMP_RET_LOG:
>  	case SECCOMP_RET_ALLOW:
>  		break;
> +	case SECCOMP_RET_USER_NOTIF:
> +		if (IS_ENABLED(CONFIG_SECCOMP_USER_NOTIFICATION))
> +			break;

Lets add a "/* fall through */" comment here to support the ongoing
effort of marking these sorts of cases in prep for
-Wimplicit-fallthrough.

>  	default:
>  		return -EOPNOTSUPP;
>  	}
> @@ -1111,6 +1316,7 @@ long seccomp_get_metadata(struct task_struct *task,
>  #define SECCOMP_RET_KILL_THREAD_NAME	"kill_thread"
>  #define SECCOMP_RET_TRAP_NAME		"trap"
>  #define SECCOMP_RET_ERRNO_NAME		"errno"
> +#define SECCOMP_RET_USER_NOTIF_NAME	"user_notif"
>  #define SECCOMP_RET_TRACE_NAME		"trace"
>  #define SECCOMP_RET_LOG_NAME		"log"
>  #define SECCOMP_RET_ALLOW_NAME		"allow"
> @@ -1120,6 +1326,7 @@ static const char seccomp_actions_avail[] =
>  				SECCOMP_RET_KILL_THREAD_NAME	" "
>  				SECCOMP_RET_TRAP_NAME		" "
>  				SECCOMP_RET_ERRNO_NAME		" "
> +				SECCOMP_RET_USER_NOTIF_NAME     " "
>  				SECCOMP_RET_TRACE_NAME		" "
>  				SECCOMP_RET_LOG_NAME		" "
>  				SECCOMP_RET_ALLOW_NAME;
> @@ -1137,6 +1344,7 @@ static const struct seccomp_log_name seccomp_log_names[] = {
>  	{ SECCOMP_LOG_TRACE, SECCOMP_RET_TRACE_NAME },
>  	{ SECCOMP_LOG_LOG, SECCOMP_RET_LOG_NAME },
>  	{ SECCOMP_LOG_ALLOW, SECCOMP_RET_ALLOW_NAME },
> +	{ SECCOMP_LOG_USER_NOTIF, SECCOMP_RET_USER_NOTIF_NAME },

Probably best to keep this list in order. Can you stick it in front of
the element for TRACE?

>  	{ }
>  };
>  
> @@ -1342,3 +1550,244 @@ static int __init seccomp_sysctl_init(void)
>  device_initcall(seccomp_sysctl_init)
>  
>  #endif /* CONFIG_SYSCTL */
> +
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> +static int seccomp_notify_release(struct inode *inode, struct file *file)
> +{
> +	struct seccomp_filter *filter = file->private_data;
> +	struct seccomp_knotif *knotif;
> +
> +	mutex_lock(&filter->notify_lock);
> +
> +	/*
> +	 * If this file is being closed because e.g. the task who owned it
> +	 * died, let's wake everyone up who was waiting on us.
> +	 */
> +	list_for_each_entry(knotif, &filter->notifications, list) {
> +		if (knotif->state == SECCOMP_NOTIFY_REPLIED)
> +			continue;
> +
> +		knotif->state = SECCOMP_NOTIFY_REPLIED;
> +		knotif->error = -ENOSYS;

ENOSYS seems odd to me since the functionality is implemented. Is EIO
more appropriate? It feels like it better matches an EIO from read(2),
for example.

> +		knotif->val = 0;
> +
> +		complete(&knotif->ready);
> +	}
> +
> +	wake_up_all(&filter->wqh);
> +	filter->has_listener = false;
> +	mutex_unlock(&filter->notify_lock);
> +	__put_seccomp_filter(filter);
> +	return 0;
> +}
> +
> +static long seccomp_notify_recv(struct seccomp_filter *filter,
> +				unsigned long arg)
> +{
> +	struct seccomp_knotif *knotif = NULL, *cur;
> +	struct seccomp_notif unotif = {};
> +	ssize_t ret;
> +	u16 size;
> +	void __user *buf = (void __user *)arg;
> +
> +	if (copy_from_user(&size, buf, sizeof(size)))
> +		return -EFAULT;
> +
> +	ret = down_interruptible(&filter->request);
> +	if (ret < 0)
> +		return ret;
> +
> +	mutex_lock(&filter->notify_lock);
> +	list_for_each_entry(cur, &filter->notifications, list) {
> +		if (cur->state == SECCOMP_NOTIFY_INIT) {
> +			knotif = cur;
> +			break;
> +		}
> +	}
> +
> +	/*
> +	 * If we didn't find a notification, it could be that the task was
> +	 * interrupted between the time we were woken and when we were able to
> +	 * acquire the rw lock.
> +	 */
> +	if (!knotif) {
> +		ret = -ENOENT;
> +		goto out;
> +	}
> +
> +	size = min_t(size_t, size, sizeof(unotif));
> +
> +	unotif.len = size;
> +	unotif.id = knotif->id;
> +	unotif.pid = pid_vnr(knotif->pid);
> +	unotif.signalled = knotif->signalled;
> +	unotif.data = *(knotif->data);
> +
> +	if (copy_to_user(buf, &unotif, size)) {
> +		ret = -EFAULT;
> +		goto out;
> +	}
> +
> +	ret = sizeof(unotif);

I would have thought that ret = size since only size bytes are copied.

> +	knotif->state = SECCOMP_NOTIFY_SENT;
> +	wake_up_poll(&filter->wqh, EPOLLOUT | EPOLLWRNORM);
> +
> +out:
> +	mutex_unlock(&filter->notify_lock);
> +	return ret;
> +}
> +
> +static long seccomp_notify_send(struct seccomp_filter *filter,
> +				unsigned long arg)
> +{
> +	struct seccomp_notif_resp resp = {};
> +	struct seccomp_knotif *knotif = NULL;
> +	long ret;
> +	u16 size;
> +	void __user *buf = (void __user *)arg;
> +
> +	if (copy_from_user(&size, buf, sizeof(size)))
> +		return -EFAULT;
> +	size = min_t(size_t, size, sizeof(resp));
> +	if (copy_from_user(&resp, buf, size))
> +		return -EFAULT;
> +
> +	ret = mutex_lock_interruptible(&filter->notify_lock);
> +	if (ret < 0)
> +		return ret;
> +
> +	list_for_each_entry(knotif, &filter->notifications, list) {
> +		if (knotif->id == resp.id)
> +			break;
> +	}
> +
> +	if (!knotif || knotif->id != resp.id) {
> +		ret = -EINVAL;

ENOENT here instead? It clearly conveys that there is no notification
matching the requested ID. We'll probably have a more ambiguous error
path that we can use to abuse EINVAL. :)

> +		goto out;
> +	}
> +
> +	/* Allow exactly one reply. */
> +	if (knotif->state != SECCOMP_NOTIFY_SENT) {
> +		ret = -EINPROGRESS;
> +		goto out;
> +	}
> +
> +	ret = size;
> +	knotif->state = SECCOMP_NOTIFY_REPLIED;
> +	knotif->error = resp.error;
> +	knotif->val = resp.val;
> +	complete(&knotif->ready);
> +out:
> +	mutex_unlock(&filter->notify_lock);
> +	return ret;
> +}
> +
> +static long seccomp_notify_is_id_valid(struct seccomp_filter *filter,
> +				       unsigned long arg)
> +{
> +	struct seccomp_knotif *knotif = NULL;
> +	void __user *buf = (void __user *)arg;
> +	u64 id;
> +
> +	if (copy_from_user(&id, buf, sizeof(id)))
> +		return -EFAULT;
> +
> +	list_for_each_entry(knotif, &filter->notifications, list) {
> +		if (knotif->id == id)
> +			return 1;
> +	}
> +
> +	return 0;

I understand the desire to return 1 from
ioctl(fd, SECCOMP_NOTIF_IS_ID_VALID, id) when id is valid but it goes
against the common case where a syscall returns 0 on success. Also, the
ioctl_list(2) man page states:

 Decent ioctls return 0 on success and -1 on error, ...

The only suggestion that I have here is to change the ioctl name to
SECCOMP_NOTIF_VALID_ID (or similiar) and return 0 if the id is valid and
-EINVAL if the id is invalid. I don't feel strongly about it so take it
or leave it.

> +}
> +
> +static long seccomp_notify_ioctl(struct file *file, unsigned int cmd,
> +				 unsigned long arg)
> +{
> +	struct seccomp_filter *filter = file->private_data;
> +
> +	switch (cmd) {
> +	case SECCOMP_NOTIF_RECV:
> +		return seccomp_notify_recv(filter, arg);
> +	case SECCOMP_NOTIF_SEND:
> +		return seccomp_notify_send(filter, arg);
> +	case SECCOMP_NOTIF_IS_ID_VALID:
> +		return seccomp_notify_is_id_valid(filter, arg);
> +	default:
> +		return -EINVAL;
> +	}
> +}
> +
> +static __poll_t seccomp_notify_poll(struct file *file,
> +				    struct poll_table_struct *poll_tab)
> +{
> +	struct seccomp_filter *filter = file->private_data;
> +	__poll_t ret = 0;
> +	struct seccomp_knotif *cur;
> +
> +	poll_wait(file, &filter->wqh, poll_tab);
> +
> +	ret = mutex_lock_interruptible(&filter->notify_lock);
> +	if (ret < 0)
> +		return ret;
> +
> +	list_for_each_entry(cur, &filter->notifications, list) {
> +		if (cur->state == SECCOMP_NOTIFY_INIT)
> +			ret |= EPOLLIN | EPOLLRDNORM;
> +		if (cur->state == SECCOMP_NOTIFY_SENT)
> +			ret |= EPOLLOUT | EPOLLWRNORM;
> +		if (ret & EPOLLIN && ret & EPOLLOUT)
> +			break;
> +	}
> +
> +	mutex_unlock(&filter->notify_lock);
> +
> +	return ret;
> +}
> +
> +static const struct file_operations seccomp_notify_ops = {
> +	.poll = seccomp_notify_poll,
> +	.release = seccomp_notify_release,
> +	.unlocked_ioctl = seccomp_notify_ioctl,
> +};
> +
> +static struct file *init_listener(struct task_struct *task,
> +				  struct seccomp_filter *filter)
> +{
> +	struct file *ret = ERR_PTR(-EBUSY);
> +	struct seccomp_filter *cur, *last_locked = NULL;
> +	int filter_nesting = 0;
> +
> +	for (cur = task->seccomp.filter; cur; cur = cur->prev) {
> +		mutex_lock_nested(&cur->notify_lock, filter_nesting);
> +		filter_nesting++;
> +		last_locked = cur;
> +		if (cur->has_listener)
> +			goto out;
> +	}
> +
> +	ret = anon_inode_getfile("seccomp notify", &seccomp_notify_ops,
> +				 filter, O_RDWR);
> +	if (IS_ERR(ret))
> +		goto out;
> +
> +
> +	/* The file has a reference to it now */
> +	__get_seccomp_filter(filter);
> +	filter->has_listener = true;
> +
> +out:
> +	for (cur = task->seccomp.filter; cur; cur = cur->prev) {
> +		mutex_unlock(&cur->notify_lock);
> +		if (cur == last_locked)
> +			break;
> +	}
> +
> +	return ret;
> +}
> +#else
> +static struct file *init_listener(struct task_struct *task,
> +				  struct seccomp_filter *filter)
> +{
> +	return ERR_PTR(-EINVAL);
> +}
> +#endif
> diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
> index e1473234968d..89f2c788a06b 100644
> --- a/tools/testing/selftests/seccomp/seccomp_bpf.c
> +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
> @@ -5,6 +5,7 @@
>   * Test code for seccomp bpf.
>   */

[...]

I only gave the tests a quick review so far but nothing stood out.

I'm anxious to give this patch set some testing. I'll get to the other
patches soon.

Tyler

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 1/5] seccomp: add a return code to trap to userspace
  2018-09-06 22:15   ` Tyler Hicks
@ 2018-09-07 15:45     ` Tycho Andersen
  2018-09-08 20:35     ` Tycho Andersen
  1 sibling, 0 replies; 38+ messages in thread
From: Tycho Andersen @ 2018-09-07 15:45 UTC (permalink / raw)
  To: Tyler Hicks
  Cc: Kees Cook, linux-kernel, containers, linux-api, Andy Lutomirski,
	Oleg Nesterov, Eric W . Biederman, Serge E . Hallyn,
	Christian Brauner, Akihiro Suda, Jann Horn

Hey Tyler,

On Thu, Sep 06, 2018 at 10:15:12PM +0000, Tyler Hicks wrote:
> > +Users can read via ``ioctl(SECCOMP_NOTIF_RECV)``  (or ``poll()``) on a seccomp
> > +notification fd to receive a ``struct seccomp_notif``, which contains five
> > +members: the input length of the structure, a globally unique ``id``, the
> 
> This documentation says that id is "globally unique" but an in-code
> comment below says "this is unique for this filter". IIUC, the id is
> only guaranteed to be unique for the filter so this documentation should
> be updated slightly to make it clear that the id is only global in those
> terms.

Yup, thanks.

> > +``pid`` of the task which triggered this request (which may be 0 if the task is
> > +in a pid ns not visible from the listener's pid namespace), a flag representing
> > +whether or not the notification is a result of a non-fatal signal, and the
> > +``data`` passed to seccomp. Userspace can then make a decision based on this
> > +information about what to do, and ``ioctl(SECCOMP_NOTIF_SEND)`` a response,
> > +indicating what should be returned to userspace. The ``id`` member of ``struct
> > +seccomp_notif_resp`` should be the same ``id`` as in ``struct seccomp_notif``.
> > +
> > +It is worth noting that ``struct seccomp_data`` contains the values of register
> > +arguments to the syscall, but does not contain pointers to memory. The task's
> > +memory is accessible to suitably privileged traces via ``ptrace()`` or
> > +``/proc/pid/map_files/``. However, care should be taken to avoid the TOCTOU
> > +mentioned above in this document: all arguments being read from the tracee's
> > +memory should be read into the tracer's memory before any policy decisions are
> > +made. This allows for an atomic decision on syscall arguments.
> > +
> >  Sysctls
> >  =======
> >  
> > diff --git a/arch/Kconfig b/arch/Kconfig
> > index 6801123932a5..42f3585d925d 100644
> > --- a/arch/Kconfig
> > +++ b/arch/Kconfig
> > @@ -419,6 +419,15 @@ config SECCOMP_FILTER
> >  
> >  	  See Documentation/userspace-api/seccomp_filter.rst for details.
> >  
> > +config SECCOMP_USER_NOTIFICATION
> 
> Did someone request a Kconfig option for this new feature? If not, I
> think that nuking the Kconfig option would reduce the test matrix. No
> other filter flags have their own build time option but maybe it makes
> sense in this case if this filter flag exposes the kernel to significant
> new attack surface since there's more to this than just a new filter
> flag.
> 
> If someone has a requirement to disable this feature, maybe it'd be
> better to leave the decision up to the distro *and* the admin via a
> sysctl instead of taking the admin out of the decision with a build
> time option.

No, there was no explicit request by anyone, I just did it so I
wouldn't offend anyone with this code. I'll drop it for the next
version.

> >  /**
> >   * struct seccomp_filter - container for seccomp BPF programs
> >   *
> > @@ -66,6 +114,30 @@ struct seccomp_filter {
> >  	bool log;
> >  	struct seccomp_filter *prev;
> >  	struct bpf_prog *prog;
> > +
> > +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> > +	/*
> > +	 * A semaphore that users of this notification can wait on for
> > +	 * changes. Actual reads and writes are still controlled with
> > +	 * filter->notify_lock.
> > +	 */
> > +	struct semaphore request;
> > +
> > +	/* A lock for all notification-related accesses. */
> > +	struct mutex notify_lock;
> > +
> > +	/* Is there currently an attached listener? */
> > +	bool has_listener;
> > +
> > +	/* The id of the next request. */
> > +	u64 next_id;
> > +
> > +	/* A list of struct seccomp_knotif elements. */
> > +	struct list_head notifications;
> > +
> > +	/* A wait queue for poll. */
> > +	wait_queue_head_t wqh;
> > +#endif
> 
> I suspect that these additions would benefit from better struct packing
> since there could be a lot of seccomp_filter structs floating around in
> memory on a system with a large number of running containers or
> otherwise sandboxed processes.
> 
> IIRC, there's a 3 byte hole following the log member that could be used
> by has_listener, at least, and I'm not sure how the rest of the new
> members affect things.

Ok, I'll take a look.

> > +static void seccomp_do_user_notification(int this_syscall,
> > +					 struct seccomp_filter *match,
> > +					 const struct seccomp_data *sd)
> > +{
> > +	int err;
> > +	long ret = 0;
> > +	struct seccomp_knotif n = {};
> > +
> > +	mutex_lock(&match->notify_lock);
> > +	err = -ENOSYS;
> > +	if (!match->has_listener)
> > +		goto out;
> > +
> > +	n.pid = task_pid(current);
> > +	n.state = SECCOMP_NOTIFY_INIT;
> > +	n.data = sd;
> > +	n.id = seccomp_next_notify_id(match);
> > +	init_completion(&n.ready);
> > +
> > +	list_add(&n.list, &match->notifications);
> > +	wake_up_poll(&match->wqh, EPOLLIN | EPOLLRDNORM);
> > +
> > +	mutex_unlock(&match->notify_lock);
> > +	up(&match->request);
> > +
> > +	err = wait_for_completion_interruptible(&n.ready);
> > +	mutex_lock(&match->notify_lock);
> > +
> > +	/*
> > +	 * Here it's possible we got a signal and then had to wait on the mutex
> > +	 * while the reply was sent, so let's be sure there wasn't a response
> > +	 * in the meantime.
> > +	 */
> > +	if (err < 0 && n.state != SECCOMP_NOTIFY_REPLIED) {
> > +		/*
> > +		 * We got a signal. Let's tell userspace about it (potentially
> > +		 * again, if we had already notified them about the first one).
> > +		 */
> > +		n.signalled = true;
> > +		if (n.state == SECCOMP_NOTIFY_SENT) {
> > +			n.state = SECCOMP_NOTIFY_INIT;
> > +			up(&match->request);
> > +		}
> > +		mutex_unlock(&match->notify_lock);
> 
> Is it intentional that you call mutex_unlocked() followed by up() when
> initially waiting but then switch up the order before re-waiting here? I
> don't yet fully understand the locking but this looked odd to me.

No, not intentional. Assuming everything is correct, the order doesn't
matter here. It might be slightly better to do the up() after, since
then the woken task won't immediately sleep waiting on the mutex, but
who knows :)

> > +out_put_fd:
> > +	if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) {
> > +		if (ret < 0) {
> > +			fput(listener_f);
> > +			put_unused_fd(listener);
> > +		} else {
> > +			fd_install(listener, listener_f);
> > +			ret = listener;
> > +		}
> > +	}
> >  out_free:
> >  	seccomp_filter_free(prepared);
> >  	return ret;
> > @@ -915,6 +1117,9 @@ static long seccomp_get_action_avail(const char __user *uaction)
> >  	case SECCOMP_RET_LOG:
> >  	case SECCOMP_RET_ALLOW:
> >  		break;
> > +	case SECCOMP_RET_USER_NOTIF:
> > +		if (IS_ENABLED(CONFIG_SECCOMP_USER_NOTIFICATION))
> > +			break;
> 
> Lets add a "/* fall through */" comment here to support the ongoing
> effort of marking these sorts of cases in prep for
> -Wimplicit-fallthrough.

Will do.

> >  	default:
> >  		return -EOPNOTSUPP;
> >  	}
> > @@ -1111,6 +1316,7 @@ long seccomp_get_metadata(struct task_struct *task,
> >  #define SECCOMP_RET_KILL_THREAD_NAME	"kill_thread"
> >  #define SECCOMP_RET_TRAP_NAME		"trap"
> >  #define SECCOMP_RET_ERRNO_NAME		"errno"
> > +#define SECCOMP_RET_USER_NOTIF_NAME	"user_notif"
> >  #define SECCOMP_RET_TRACE_NAME		"trace"
> >  #define SECCOMP_RET_LOG_NAME		"log"
> >  #define SECCOMP_RET_ALLOW_NAME		"allow"
> > @@ -1120,6 +1326,7 @@ static const char seccomp_actions_avail[] =
> >  				SECCOMP_RET_KILL_THREAD_NAME	" "
> >  				SECCOMP_RET_TRAP_NAME		" "
> >  				SECCOMP_RET_ERRNO_NAME		" "
> > +				SECCOMP_RET_USER_NOTIF_NAME     " "
> >  				SECCOMP_RET_TRACE_NAME		" "
> >  				SECCOMP_RET_LOG_NAME		" "
> >  				SECCOMP_RET_ALLOW_NAME;
> > @@ -1137,6 +1344,7 @@ static const struct seccomp_log_name seccomp_log_names[] = {
> >  	{ SECCOMP_LOG_TRACE, SECCOMP_RET_TRACE_NAME },
> >  	{ SECCOMP_LOG_LOG, SECCOMP_RET_LOG_NAME },
> >  	{ SECCOMP_LOG_ALLOW, SECCOMP_RET_ALLOW_NAME },
> > +	{ SECCOMP_LOG_USER_NOTIF, SECCOMP_RET_USER_NOTIF_NAME },
> 
> Probably best to keep this list in order. Can you stick it in front of
> the element for TRACE?

Yep!

> > +	/*
> > +	 * If this file is being closed because e.g. the task who owned it
> > +	 * died, let's wake everyone up who was waiting on us.
> > +	 */
> > +	list_for_each_entry(knotif, &filter->notifications, list) {
> > +		if (knotif->state == SECCOMP_NOTIFY_REPLIED)
> > +			continue;
> > +
> > +		knotif->state = SECCOMP_NOTIFY_REPLIED;
> > +		knotif->error = -ENOSYS;
> 
> ENOSYS seems odd to me since the functionality is implemented. Is EIO
> more appropriate? It feels like it better matches an EIO from read(2),
> for example.

I copied the ENOSYS usage from SECCOMP_RET_TRACE: when there is no
tracer attached, it responds to any SECCOMP_RET_TRACE with ENOSYS.
Seems like keeping the same behavior here is useful.

> > +	unotif.len = size;
> > +	unotif.id = knotif->id;
> > +	unotif.pid = pid_vnr(knotif->pid);
> > +	unotif.signalled = knotif->signalled;
> > +	unotif.data = *(knotif->data);
> > +
> > +	if (copy_to_user(buf, &unotif, size)) {
> > +		ret = -EFAULT;
> > +		goto out;
> > +	}
> > +
> > +	ret = sizeof(unotif);
> 
> I would have thought that ret = size since only size bytes are copied.

Yes, right you are.

> > +	knotif->state = SECCOMP_NOTIFY_SENT;
> > +	wake_up_poll(&filter->wqh, EPOLLOUT | EPOLLWRNORM);
> > +
> > +out:
> > +	mutex_unlock(&filter->notify_lock);
> > +	return ret;
> > +}
> > +
> > +static long seccomp_notify_send(struct seccomp_filter *filter,
> > +				unsigned long arg)
> > +{
> > +	struct seccomp_notif_resp resp = {};
> > +	struct seccomp_knotif *knotif = NULL;
> > +	long ret;
> > +	u16 size;
> > +	void __user *buf = (void __user *)arg;
> > +
> > +	if (copy_from_user(&size, buf, sizeof(size)))
> > +		return -EFAULT;
> > +	size = min_t(size_t, size, sizeof(resp));
> > +	if (copy_from_user(&resp, buf, size))
> > +		return -EFAULT;
> > +
> > +	ret = mutex_lock_interruptible(&filter->notify_lock);
> > +	if (ret < 0)
> > +		return ret;
> > +
> > +	list_for_each_entry(knotif, &filter->notifications, list) {
> > +		if (knotif->id == resp.id)
> > +			break;
> > +	}
> > +
> > +	if (!knotif || knotif->id != resp.id) {
> > +		ret = -EINVAL;
> 
> ENOENT here instead? It clearly conveys that there is no notification
> matching the requested ID. We'll probably have a more ambiguous error
> path that we can use to abuse EINVAL. :)

Yes, will do :)

> > +		goto out;
> > +	}
> > +
> > +	/* Allow exactly one reply. */
> > +	if (knotif->state != SECCOMP_NOTIFY_SENT) {
> > +		ret = -EINPROGRESS;
> > +		goto out;
> > +	}
> > +
> > +	ret = size;
> > +	knotif->state = SECCOMP_NOTIFY_REPLIED;
> > +	knotif->error = resp.error;
> > +	knotif->val = resp.val;
> > +	complete(&knotif->ready);
> > +out:
> > +	mutex_unlock(&filter->notify_lock);
> > +	return ret;
> > +}
> > +
> > +static long seccomp_notify_is_id_valid(struct seccomp_filter *filter,
> > +				       unsigned long arg)
> > +{
> > +	struct seccomp_knotif *knotif = NULL;
> > +	void __user *buf = (void __user *)arg;
> > +	u64 id;
> > +
> > +	if (copy_from_user(&id, buf, sizeof(id)))
> > +		return -EFAULT;
> > +
> > +	list_for_each_entry(knotif, &filter->notifications, list) {
> > +		if (knotif->id == id)
> > +			return 1;
> > +	}
> > +
> > +	return 0;
> 
> I understand the desire to return 1 from
> ioctl(fd, SECCOMP_NOTIF_IS_ID_VALID, id) when id is valid but it goes
> against the common case where a syscall returns 0 on success. Also, the
> ioctl_list(2) man page states:
> 
>  Decent ioctls return 0 on success and -1 on error, ...
> 
> The only suggestion that I have here is to change the ioctl name to
> SECCOMP_NOTIF_VALID_ID (or similiar) and return 0 if the id is valid and
> -EINVAL if the id is invalid. I don't feel strongly about it so take it
> or leave it.

Sure, will do.

> > +}
> > +
> > +static long seccomp_notify_ioctl(struct file *file, unsigned int cmd,
> > +				 unsigned long arg)
> > +{
> > +	struct seccomp_filter *filter = file->private_data;
> > +
> > +	switch (cmd) {
> > +	case SECCOMP_NOTIF_RECV:
> > +		return seccomp_notify_recv(filter, arg);
> > +	case SECCOMP_NOTIF_SEND:
> > +		return seccomp_notify_send(filter, arg);
> > +	case SECCOMP_NOTIF_IS_ID_VALID:
> > +		return seccomp_notify_is_id_valid(filter, arg);
> > +	default:
> > +		return -EINVAL;
> > +	}
> > +}
> > +
> > +static __poll_t seccomp_notify_poll(struct file *file,
> > +				    struct poll_table_struct *poll_tab)
> > +{
> > +	struct seccomp_filter *filter = file->private_data;
> > +	__poll_t ret = 0;
> > +	struct seccomp_knotif *cur;
> > +
> > +	poll_wait(file, &filter->wqh, poll_tab);
> > +
> > +	ret = mutex_lock_interruptible(&filter->notify_lock);
> > +	if (ret < 0)
> > +		return ret;
> > +
> > +	list_for_each_entry(cur, &filter->notifications, list) {
> > +		if (cur->state == SECCOMP_NOTIFY_INIT)
> > +			ret |= EPOLLIN | EPOLLRDNORM;
> > +		if (cur->state == SECCOMP_NOTIFY_SENT)
> > +			ret |= EPOLLOUT | EPOLLWRNORM;
> > +		if (ret & EPOLLIN && ret & EPOLLOUT)
> > +			break;
> > +	}
> > +
> > +	mutex_unlock(&filter->notify_lock);
> > +
> > +	return ret;
> > +}
> > +
> > +static const struct file_operations seccomp_notify_ops = {
> > +	.poll = seccomp_notify_poll,
> > +	.release = seccomp_notify_release,
> > +	.unlocked_ioctl = seccomp_notify_ioctl,
> > +};
> > +
> > +static struct file *init_listener(struct task_struct *task,
> > +				  struct seccomp_filter *filter)
> > +{
> > +	struct file *ret = ERR_PTR(-EBUSY);
> > +	struct seccomp_filter *cur, *last_locked = NULL;
> > +	int filter_nesting = 0;
> > +
> > +	for (cur = task->seccomp.filter; cur; cur = cur->prev) {
> > +		mutex_lock_nested(&cur->notify_lock, filter_nesting);
> > +		filter_nesting++;
> > +		last_locked = cur;
> > +		if (cur->has_listener)
> > +			goto out;
> > +	}
> > +
> > +	ret = anon_inode_getfile("seccomp notify", &seccomp_notify_ops,
> > +				 filter, O_RDWR);
> > +	if (IS_ERR(ret))
> > +		goto out;
> > +
> > +
> > +	/* The file has a reference to it now */
> > +	__get_seccomp_filter(filter);
> > +	filter->has_listener = true;
> > +
> > +out:
> > +	for (cur = task->seccomp.filter; cur; cur = cur->prev) {
> > +		mutex_unlock(&cur->notify_lock);
> > +		if (cur == last_locked)
> > +			break;
> > +	}
> > +
> > +	return ret;
> > +}
> > +#else
> > +static struct file *init_listener(struct task_struct *task,
> > +				  struct seccomp_filter *filter)
> > +{
> > +	return ERR_PTR(-EINVAL);
> > +}
> > +#endif
> > diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
> > index e1473234968d..89f2c788a06b 100644
> > --- a/tools/testing/selftests/seccomp/seccomp_bpf.c
> > +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
> > @@ -5,6 +5,7 @@
> >   * Test code for seccomp bpf.
> >   */
> 
> [...]
> 
> I only gave the tests a quick review so far but nothing stood out.
> 
> I'm anxious to give this patch set some testing. I'll get to the other
> patches soon.

Thanks!

Tycho

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 1/5] seccomp: add a return code to trap to userspace
  2018-09-06 22:15   ` Tyler Hicks
  2018-09-07 15:45     ` Tycho Andersen
@ 2018-09-08 20:35     ` Tycho Andersen
  1 sibling, 0 replies; 38+ messages in thread
From: Tycho Andersen @ 2018-09-08 20:35 UTC (permalink / raw)
  To: Tyler Hicks
  Cc: Kees Cook, linux-kernel, containers, linux-api, Andy Lutomirski,
	Oleg Nesterov, Eric W . Biederman, Serge E . Hallyn,
	Christian Brauner, Akihiro Suda, Jann Horn

On Thu, Sep 06, 2018 at 10:15:12PM +0000, Tyler Hicks wrote:
> On 2018-09-06 09:28:55, Tycho Andersen wrote:
> >  /**
> >   * struct seccomp_filter - container for seccomp BPF programs
> >   *
> > @@ -66,6 +114,30 @@ struct seccomp_filter {
> >  	bool log;
> >  	struct seccomp_filter *prev;
> >  	struct bpf_prog *prog;
> > +
> > +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> > +	/*
> > +	 * A semaphore that users of this notification can wait on for
> > +	 * changes. Actual reads and writes are still controlled with
> > +	 * filter->notify_lock.
> > +	 */
> > +	struct semaphore request;
> > +
> > +	/* A lock for all notification-related accesses. */
> > +	struct mutex notify_lock;
> > +
> > +	/* Is there currently an attached listener? */
> > +	bool has_listener;
> > +
> > +	/* The id of the next request. */
> > +	u64 next_id;
> > +
> > +	/* A list of struct seccomp_knotif elements. */
> > +	struct list_head notifications;
> > +
> > +	/* A wait queue for poll. */
> > +	wait_queue_head_t wqh;
> > +#endif
> 
> I suspect that these additions would benefit from better struct packing
> since there could be a lot of seccomp_filter structs floating around in
> memory on a system with a large number of running containers or
> otherwise sandboxed processes.
> 
> IIRC, there's a 3 byte hole following the log member that could be used
> by has_listener, at least, and I'm not sure how the rest of the new
> members affect things.

So it turns out the additions are fairly major. The previous
sizeof(struct seccomp_filter) == 24 bytes on x86_64, with the three
byte hole you mentioned.

The new members alone actual sizes are:

sizeof(struct sempahore) request == 80
sizeof(struct mutex) notify_lock == 128
sizeof(struct list_head) notifications == 16
sizeof(struct wait_queue_head_t) wqh == 72

+ the base types of next_id, has_listener gives a grand total of 305
additional bytes, assuming it's packed perfectly. That seems like
quite a huge hit for everyone to endure, especially since it won't be
perfectly packed.

Instead, what if we add a struct notification, and a struct
notification* to struct seccomp_filter? Then we can drop the bool
has_listener because we can use a null test, and the 304 bytes are
only paid by people who actually use this feature (as well as the cost
of an additional indirection, but who cares, they're trapping to
userspace anyway). Unless I hear any objections, I'll do this for v7
:)

Tycho

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 4/5] seccomp: add support for passing fds via USER_NOTIF
  2018-09-06 18:30       ` Tycho Andersen
@ 2018-09-10 17:00         ` Jann Horn
  2018-09-11 20:29           ` Tycho Andersen
  0 siblings, 1 reply; 38+ messages in thread
From: Jann Horn @ 2018-09-10 17:00 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Kees Cook, kernel list, containers, Linux API, Andy Lutomirski,
	Oleg Nesterov, Eric W. Biederman, Serge E. Hallyn,
	Christian Brauner, Tyler Hicks, suda.akihiro

On Thu, Sep 6, 2018 at 8:30 PM Tycho Andersen <tycho@tycho.ws> wrote:
> On Thu, Sep 06, 2018 at 10:22:46AM -0600, Tycho Andersen wrote:
> > On Thu, Sep 06, 2018 at 06:15:18PM +0200, Jann Horn wrote:
> > > On Thu, Sep 6, 2018 at 5:29 PM Tycho Andersen <tycho@tycho.ws> wrote:
> > > > The idea here is that the userspace handler should be able to pass an fd
> > > > back to the trapped task, for example so it can be returned from socket().
> > > [...]
> > > > diff --git a/Documentation/userspace-api/seccomp_filter.rst b/Documentation/userspace-api/seccomp_filter.rst
> > > > index d1498885c1c7..1c0aab306426 100644
> > > > --- a/Documentation/userspace-api/seccomp_filter.rst
> > > > +++ b/Documentation/userspace-api/seccomp_filter.rst
> > > > @@ -235,6 +235,9 @@ The interface for a seccomp notification fd consists of two structures:
> > > >          __u64 id;
> > > >          __s32 error;
> > > >          __s64 val;
> > > > +        __u8 return_fd;
> > > > +        __u32 fd;
> > > > +        __u32 fd_flags;
> > >
> > > Normally,  syscalls that take an optional file descriptor accept a
> > > signed 32-bit number, with -1 standing for "no file descriptor". Is
> > > there a reason why this uses a separate variable to signal whether an
> > > fd was provided?
> >
> > No real reason other than I looked at the bpf code and they were using
> > __u32 for bpf (but I think in their case the fd args are not
> > optional). I'll switch it to __s32/-1 for the next version.
>
> Oh, I think there is a reason actually: since this is an API addition,
> the "0" value needs to be the previously default behavior if userspace
> doesn't specify it. Since the previously default behavior was not to
> return an fd, and we want to allow fd == 0, we need the extra flag to
> make this work.
>
> This is really only a problem because we're introducing this stuff in
> a second patch (mostly to illustrate how extending the response
> structure would work). I could fold this into the first patch if we
> want, or we could keep the return_fd bits if the illustration is
> useful.

I feel like adding extra struct fields just so that it is possible to
write programs against the intermediate new API between two kernel
commits is taking things a bit far.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 2/5] seccomp: make get_nth_filter available outside of CHECKPOINT_RESTORE
  2018-09-06 15:28 ` [PATCH v6 2/5] seccomp: make get_nth_filter available outside of CHECKPOINT_RESTORE Tycho Andersen
@ 2018-09-11 10:25   ` kbuild test robot
  0 siblings, 0 replies; 38+ messages in thread
From: kbuild test robot @ 2018-09-11 10:25 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: kbuild-all, Kees Cook, linux-kernel, containers, linux-api,
	Andy Lutomirski, Oleg Nesterov, Eric W . Biederman,
	Serge E . Hallyn, Christian Brauner, Tyler Hicks, Akihiro Suda,
	Jann Horn, Tycho Andersen

[-- Attachment #1: Type: text/plain, Size: 11798 bytes --]

Hi Tycho,

I love your patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.19-rc3]
[cannot apply to next-20180910]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Tycho-Andersen/seccomp-trap-to-userspace/20180907-130604
config: sparc64-allyesconfig (attached as .config)
compiler: sparc64-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=7.2.0 make.cross ARCH=sparc64 

All errors (new ones prefixed by >>):

   kernel/seccomp.c: In function 'get_nth_filter':
>> kernel/seccomp.c:1211:2: error: implicit declaration of function '__get_seccomp_filter'; did you mean 'get_seccomp_filter'? [-Werror=implicit-function-declaration]
     __get_seccomp_filter(orig);
     ^~~~~~~~~~~~~~~~~~~~
     get_seccomp_filter
>> kernel/seccomp.c:1215:45: error: dereferencing pointer to incomplete type 'struct seccomp_filter'
     for (filter = orig; filter; filter = filter->prev)
                                                ^~
>> kernel/seccomp.c:1235:2: error: implicit declaration of function '__put_seccomp_filter'; did you mean 'put_seccomp_filter'? [-Werror=implicit-function-declaration]
     __put_seccomp_filter(orig);
     ^~~~~~~~~~~~~~~~~~~~
     put_seccomp_filter
   kernel/seccomp.c: At top level:
>> kernel/seccomp.c:1240:6: error: redefinition of 'seccomp_get_filter'
    long seccomp_get_filter(struct task_struct *task, unsigned long filter_off,
         ^~~~~~~~~~~~~~~~~~
   In file included from include/linux/sched.h:21:0,
                    from include/linux/audit.h:26,
                    from kernel/seccomp.c:18:
   include/linux/seccomp.h:103:20: note: previous definition of 'seccomp_get_filter' was here
    static inline long seccomp_get_filter(struct task_struct *task,
                       ^~~~~~~~~~~~~~~~~~
   kernel/seccomp.c: In function 'seccomp_get_filter':
>> kernel/seccomp.c:1266:13: error: dereferencing pointer to incomplete type 'struct sock_fprog_kern'
     ret = fprog->len;
                ^~
>> kernel/seccomp.c:1270:40: error: implicit declaration of function 'bpf_classic_proglen' [-Werror=implicit-function-declaration]
     if (copy_to_user(data, fprog->filter, bpf_classic_proglen(fprog)))
                                           ^~~~~~~~~~~~~~~~~~~
   kernel/seccomp.c: At top level:
>> kernel/seccomp.c:1278:6: error: redefinition of 'seccomp_get_metadata'
    long seccomp_get_metadata(struct task_struct *task,
         ^~~~~~~~~~~~~~~~~~~~
   In file included from include/linux/sched.h:21:0,
                    from include/linux/audit.h:26,
                    from kernel/seccomp.c:18:
   include/linux/seccomp.h:108:20: note: previous definition of 'seccomp_get_metadata' was here
    static inline long seccomp_get_metadata(struct task_struct *task,
                       ^~~~~~~~~~~~~~~~~~~~
   kernel/seccomp.c:1791:21: warning: 'init_listener' defined but not used [-Wunused-function]
    static struct file *init_listener(struct task_struct *task,
                        ^~~~~~~~~~~~~
   cc1: some warnings being treated as errors

vim +1211 kernel/seccomp.c

f8e529ed Tycho Andersen 2015-10-27  1190  
4f037b54 Tycho Andersen 2018-09-06  1191  #if defined(CONFIG_CHECKPOINT_RESTORE) || \
4f037b54 Tycho Andersen 2018-09-06  1192  	defined(CONFIG_SECCOMP_USER_NOTIFICATION)
f06eae83 Tycho Andersen 2017-10-11  1193  static struct seccomp_filter *get_nth_filter(struct task_struct *task,
f06eae83 Tycho Andersen 2017-10-11  1194  					     unsigned long filter_off)
f8e529ed Tycho Andersen 2015-10-27  1195  {
f06eae83 Tycho Andersen 2017-10-11  1196  	struct seccomp_filter *orig, *filter;
f06eae83 Tycho Andersen 2017-10-11  1197  	unsigned long count;
f8e529ed Tycho Andersen 2015-10-27  1198  
f06eae83 Tycho Andersen 2017-10-11  1199  	/*
f06eae83 Tycho Andersen 2017-10-11  1200  	 * Note: this is only correct because the caller should be the (ptrace)
f06eae83 Tycho Andersen 2017-10-11  1201  	 * tracer of the task, otherwise lock_task_sighand is needed.
f06eae83 Tycho Andersen 2017-10-11  1202  	 */
f8e529ed Tycho Andersen 2015-10-27  1203  	spin_lock_irq(&task->sighand->siglock);
f06eae83 Tycho Andersen 2017-10-11  1204  
f8e529ed Tycho Andersen 2015-10-27  1205  	if (task->seccomp.mode != SECCOMP_MODE_FILTER) {
f06eae83 Tycho Andersen 2017-10-11  1206  		spin_unlock_irq(&task->sighand->siglock);
f06eae83 Tycho Andersen 2017-10-11  1207  		return ERR_PTR(-EINVAL);
f8e529ed Tycho Andersen 2015-10-27  1208  	}
f8e529ed Tycho Andersen 2015-10-27  1209  
f06eae83 Tycho Andersen 2017-10-11  1210  	orig = task->seccomp.filter;
f06eae83 Tycho Andersen 2017-10-11 @1211  	__get_seccomp_filter(orig);
f06eae83 Tycho Andersen 2017-10-11  1212  	spin_unlock_irq(&task->sighand->siglock);
f06eae83 Tycho Andersen 2017-10-11  1213  
f06eae83 Tycho Andersen 2017-10-11  1214  	count = 0;
f06eae83 Tycho Andersen 2017-10-11 @1215  	for (filter = orig; filter; filter = filter->prev)
f8e529ed Tycho Andersen 2015-10-27  1216  		count++;
f8e529ed Tycho Andersen 2015-10-27  1217  
f8e529ed Tycho Andersen 2015-10-27  1218  	if (filter_off >= count) {
f06eae83 Tycho Andersen 2017-10-11  1219  		filter = ERR_PTR(-ENOENT);
f8e529ed Tycho Andersen 2015-10-27  1220  		goto out;
f8e529ed Tycho Andersen 2015-10-27  1221  	}
f8e529ed Tycho Andersen 2015-10-27  1222  
f06eae83 Tycho Andersen 2017-10-11  1223  	count -= filter_off;
f06eae83 Tycho Andersen 2017-10-11  1224  	for (filter = orig; filter && count > 1; filter = filter->prev)
f8e529ed Tycho Andersen 2015-10-27  1225  		count--;
f8e529ed Tycho Andersen 2015-10-27  1226  
f8e529ed Tycho Andersen 2015-10-27  1227  	if (WARN_ON(count != 1 || !filter)) {
f06eae83 Tycho Andersen 2017-10-11  1228  		filter = ERR_PTR(-ENOENT);
f8e529ed Tycho Andersen 2015-10-27  1229  		goto out;
f8e529ed Tycho Andersen 2015-10-27  1230  	}
f8e529ed Tycho Andersen 2015-10-27  1231  
f06eae83 Tycho Andersen 2017-10-11  1232  	__get_seccomp_filter(filter);
f06eae83 Tycho Andersen 2017-10-11  1233  
f06eae83 Tycho Andersen 2017-10-11  1234  out:
f06eae83 Tycho Andersen 2017-10-11 @1235  	__put_seccomp_filter(orig);
f06eae83 Tycho Andersen 2017-10-11  1236  	return filter;
f06eae83 Tycho Andersen 2017-10-11  1237  }
f06eae83 Tycho Andersen 2017-10-11  1238  
4f037b54 Tycho Andersen 2018-09-06  1239  #if defined(CONFIG_CHECKPOINT_RESTORE)
f06eae83 Tycho Andersen 2017-10-11 @1240  long seccomp_get_filter(struct task_struct *task, unsigned long filter_off,
f06eae83 Tycho Andersen 2017-10-11  1241  			void __user *data)
f06eae83 Tycho Andersen 2017-10-11  1242  {
f06eae83 Tycho Andersen 2017-10-11  1243  	struct seccomp_filter *filter;
f06eae83 Tycho Andersen 2017-10-11  1244  	struct sock_fprog_kern *fprog;
f06eae83 Tycho Andersen 2017-10-11  1245  	long ret;
f06eae83 Tycho Andersen 2017-10-11  1246  
f06eae83 Tycho Andersen 2017-10-11  1247  	if (!capable(CAP_SYS_ADMIN) ||
f06eae83 Tycho Andersen 2017-10-11  1248  	    current->seccomp.mode != SECCOMP_MODE_DISABLED) {
f06eae83 Tycho Andersen 2017-10-11  1249  		return -EACCES;
f06eae83 Tycho Andersen 2017-10-11  1250  	}
f06eae83 Tycho Andersen 2017-10-11  1251  
f06eae83 Tycho Andersen 2017-10-11  1252  	filter = get_nth_filter(task, filter_off);
f06eae83 Tycho Andersen 2017-10-11  1253  	if (IS_ERR(filter))
f06eae83 Tycho Andersen 2017-10-11  1254  		return PTR_ERR(filter);
f06eae83 Tycho Andersen 2017-10-11  1255  
f8e529ed Tycho Andersen 2015-10-27  1256  	fprog = filter->prog->orig_prog;
f8e529ed Tycho Andersen 2015-10-27  1257  	if (!fprog) {
470bf1f2 Mickaël Salaün 2016-03-24  1258  		/* This must be a new non-cBPF filter, since we save
f8e529ed Tycho Andersen 2015-10-27  1259  		 * every cBPF filter's orig_prog above when
f8e529ed Tycho Andersen 2015-10-27  1260  		 * CONFIG_CHECKPOINT_RESTORE is enabled.
f8e529ed Tycho Andersen 2015-10-27  1261  		 */
f8e529ed Tycho Andersen 2015-10-27  1262  		ret = -EMEDIUMTYPE;
f8e529ed Tycho Andersen 2015-10-27  1263  		goto out;
f8e529ed Tycho Andersen 2015-10-27  1264  	}
f8e529ed Tycho Andersen 2015-10-27  1265  
f8e529ed Tycho Andersen 2015-10-27 @1266  	ret = fprog->len;
f8e529ed Tycho Andersen 2015-10-27  1267  	if (!data)
f8e529ed Tycho Andersen 2015-10-27  1268  		goto out;
f8e529ed Tycho Andersen 2015-10-27  1269  
f8e529ed Tycho Andersen 2015-10-27 @1270  	if (copy_to_user(data, fprog->filter, bpf_classic_proglen(fprog)))
f8e529ed Tycho Andersen 2015-10-27  1271  		ret = -EFAULT;
f8e529ed Tycho Andersen 2015-10-27  1272  
f8e529ed Tycho Andersen 2015-10-27  1273  out:
66a733ea Oleg Nesterov  2017-09-27  1274  	__put_seccomp_filter(filter);
f8e529ed Tycho Andersen 2015-10-27  1275  	return ret;
f8e529ed Tycho Andersen 2015-10-27  1276  }
f8e529ed Tycho Andersen 2015-10-27  1277  
26500475 Tycho Andersen 2017-10-11 @1278  long seccomp_get_metadata(struct task_struct *task,
26500475 Tycho Andersen 2017-10-11  1279  			  unsigned long size, void __user *data)
26500475 Tycho Andersen 2017-10-11  1280  {
26500475 Tycho Andersen 2017-10-11  1281  	long ret;
26500475 Tycho Andersen 2017-10-11  1282  	struct seccomp_filter *filter;
26500475 Tycho Andersen 2017-10-11  1283  	struct seccomp_metadata kmd = {};
26500475 Tycho Andersen 2017-10-11  1284  
26500475 Tycho Andersen 2017-10-11  1285  	if (!capable(CAP_SYS_ADMIN) ||
26500475 Tycho Andersen 2017-10-11  1286  	    current->seccomp.mode != SECCOMP_MODE_DISABLED) {
26500475 Tycho Andersen 2017-10-11  1287  		return -EACCES;
26500475 Tycho Andersen 2017-10-11  1288  	}
26500475 Tycho Andersen 2017-10-11  1289  
26500475 Tycho Andersen 2017-10-11  1290  	size = min_t(unsigned long, size, sizeof(kmd));
26500475 Tycho Andersen 2017-10-11  1291  
63bb0045 Tycho Andersen 2018-02-20  1292  	if (size < sizeof(kmd.filter_off))
63bb0045 Tycho Andersen 2018-02-20  1293  		return -EINVAL;
63bb0045 Tycho Andersen 2018-02-20  1294  
63bb0045 Tycho Andersen 2018-02-20  1295  	if (copy_from_user(&kmd.filter_off, data, sizeof(kmd.filter_off)))
26500475 Tycho Andersen 2017-10-11  1296  		return -EFAULT;
26500475 Tycho Andersen 2017-10-11  1297  
26500475 Tycho Andersen 2017-10-11  1298  	filter = get_nth_filter(task, kmd.filter_off);
26500475 Tycho Andersen 2017-10-11  1299  	if (IS_ERR(filter))
26500475 Tycho Andersen 2017-10-11  1300  		return PTR_ERR(filter);
26500475 Tycho Andersen 2017-10-11  1301  
26500475 Tycho Andersen 2017-10-11  1302  	if (filter->log)
26500475 Tycho Andersen 2017-10-11  1303  		kmd.flags |= SECCOMP_FILTER_FLAG_LOG;
26500475 Tycho Andersen 2017-10-11  1304  
26500475 Tycho Andersen 2017-10-11  1305  	ret = size;
26500475 Tycho Andersen 2017-10-11  1306  	if (copy_to_user(data, &kmd, size))
26500475 Tycho Andersen 2017-10-11  1307  		ret = -EFAULT;
26500475 Tycho Andersen 2017-10-11  1308  
26500475 Tycho Andersen 2017-10-11  1309  	__put_seccomp_filter(filter);
f8e529ed Tycho Andersen 2015-10-27  1310  	return ret;
f8e529ed Tycho Andersen 2015-10-27  1311  }
4f037b54 Tycho Andersen 2018-09-06  1312  #endif /* CONFIG_CHECKPOINT_RESTORE */
4f037b54 Tycho Andersen 2018-09-06  1313  #endif /* CONFIG_SECCOMP_FILTER */
8e5f1ad1 Tyler Hicks    2017-08-11  1314  

:::::: The code at line 1211 was first introduced by commit
:::::: f06eae831f0c1fc5b982ea200daf552810e1dd55 seccomp: hoist out filter resolving logic

:::::: TO: Tycho Andersen <tycho@docker.com>
:::::: CC: Kees Cook <keescook@chromium.org>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 55342 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 4/5] seccomp: add support for passing fds via USER_NOTIF
  2018-09-10 17:00         ` Jann Horn
@ 2018-09-11 20:29           ` Tycho Andersen
  0 siblings, 0 replies; 38+ messages in thread
From: Tycho Andersen @ 2018-09-11 20:29 UTC (permalink / raw)
  To: Jann Horn
  Cc: Kees Cook, kernel list, containers, Linux API, Andy Lutomirski,
	Oleg Nesterov, Eric W. Biederman, Serge E. Hallyn,
	Christian Brauner, Tyler Hicks, suda.akihiro

On Mon, Sep 10, 2018 at 07:00:43PM +0200, Jann Horn wrote:
> On Thu, Sep 6, 2018 at 8:30 PM Tycho Andersen <tycho@tycho.ws> wrote:
> > On Thu, Sep 06, 2018 at 10:22:46AM -0600, Tycho Andersen wrote:
> > > On Thu, Sep 06, 2018 at 06:15:18PM +0200, Jann Horn wrote:
> > > > On Thu, Sep 6, 2018 at 5:29 PM Tycho Andersen <tycho@tycho.ws> wrote:
> > > > > The idea here is that the userspace handler should be able to pass an fd
> > > > > back to the trapped task, for example so it can be returned from socket().
> > > > [...]
> > > > > diff --git a/Documentation/userspace-api/seccomp_filter.rst b/Documentation/userspace-api/seccomp_filter.rst
> > > > > index d1498885c1c7..1c0aab306426 100644
> > > > > --- a/Documentation/userspace-api/seccomp_filter.rst
> > > > > +++ b/Documentation/userspace-api/seccomp_filter.rst
> > > > > @@ -235,6 +235,9 @@ The interface for a seccomp notification fd consists of two structures:
> > > > >          __u64 id;
> > > > >          __s32 error;
> > > > >          __s64 val;
> > > > > +        __u8 return_fd;
> > > > > +        __u32 fd;
> > > > > +        __u32 fd_flags;
> > > >
> > > > Normally,  syscalls that take an optional file descriptor accept a
> > > > signed 32-bit number, with -1 standing for "no file descriptor". Is
> > > > there a reason why this uses a separate variable to signal whether an
> > > > fd was provided?
> > >
> > > No real reason other than I looked at the bpf code and they were using
> > > __u32 for bpf (but I think in their case the fd args are not
> > > optional). I'll switch it to __s32/-1 for the next version.
> >
> > Oh, I think there is a reason actually: since this is an API addition,
> > the "0" value needs to be the previously default behavior if userspace
> > doesn't specify it. Since the previously default behavior was not to
> > return an fd, and we want to allow fd == 0, we need the extra flag to
> > make this work.
> >
> > This is really only a problem because we're introducing this stuff in
> > a second patch (mostly to illustrate how extending the response
> > structure would work). I could fold this into the first patch if we
> > want, or we could keep the return_fd bits if the illustration is
> > useful.
> 
> I feel like adding extra struct fields just so that it is possible to
> write programs against the intermediate new API between two kernel
> commits is taking things a bit far.

Yep, I tend to agree with you. I'll fold the whole thing into the
first patch for the next version.

Tycho

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 4/5] seccomp: add support for passing fds via USER_NOTIF
  2018-09-06 15:28 ` [PATCH v6 4/5] seccomp: add support for passing fds via USER_NOTIF Tycho Andersen
  2018-09-06 16:15   ` Jann Horn
@ 2018-09-12 23:52   ` Andy Lutomirski
  2018-09-13  9:25     ` Tycho Andersen
                       ` (2 more replies)
  1 sibling, 3 replies; 38+ messages in thread
From: Andy Lutomirski @ 2018-09-12 23:52 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Kees Cook, LKML, Linux Containers, Linux API, Oleg Nesterov,
	Eric W . Biederman, Serge E . Hallyn, Christian Brauner,
	Tyler Hicks, Akihiro Suda, Jann Horn

On Thu, Sep 6, 2018 at 8:28 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> The idea here is that the userspace handler should be able to pass an fd
> back to the trapped task, for example so it can be returned from socket().
>
> I've proposed one API here, but I'm open to other options. In particular,
> this only lets you return an fd from a syscall, which may not be enough in
> all cases. For example, if an fd is written to an output parameter instead
> of returned, the current API can't handle this. Another case is that
> netlink takes as input fds sometimes (IFLA_NET_NS_FD, e.g.). If netlink
> ever decides to install an fd and output it, we wouldn't be able to handle
> this either.

An alternative could be to have an API (an ioctl on the listener,
perhaps) that just copies an fd into the tracee.  There would be the
obvious set of options: do we replace an existing fd or allocate a new
one, and is it CLOEXEC.  Then the tracer could add an fd and then
return it just like it's a regular number.

I feel like this would be more flexible and conceptually simpler, but
maybe a little slower for the common cases.  What do you think?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 3/5] seccomp: add a way to get a listener fd from ptrace
  2018-09-06 15:28 ` [PATCH v6 3/5] seccomp: add a way to get a listener fd from ptrace Tycho Andersen
  2018-09-06 15:45   ` Jann Horn
@ 2018-09-13  0:00   ` Andy Lutomirski
  2018-09-13  9:24     ` Tycho Andersen
  2018-10-17  7:25     ` Michael Tirado
  1 sibling, 2 replies; 38+ messages in thread
From: Andy Lutomirski @ 2018-09-13  0:00 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Kees Cook, LKML, Linux Containers, Linux API, Oleg Nesterov,
	Eric W . Biederman, Serge E . Hallyn, Christian Brauner,
	Tyler Hicks, Akihiro Suda, Jann Horn

On Thu, Sep 6, 2018 at 8:28 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> As an alternative to SECCOMP_FILTER_FLAG_GET_LISTENER, perhaps a ptrace()
> version which can acquire filters is useful. There are at least two reasons
> this is preferable, even though it uses ptrace:
>
> 1. You can control tasks that aren't cooperating with you
> 2. You can control tasks whose filters block sendmsg() and socket(); if the
>    task installs a filter which blocks these calls, there's no way with
>    SECCOMP_FILTER_FLAG_GET_LISTENER to get the fd out to the privileged task.

Hmm.  I contemplated this a bit and looked at your example a bit, and
I have a few thoughts:

 - What happens if you nest code like your sample?  That is, if you
are already in some container that is seccomped and there's a
listener, can you even run your sample?

 - Is there any association between the filter layer that uses the
USER_NOTIF return and the listener?  How would this API express such a
relationship?

I realize that my dream of how this should all work requires eBPF and
BPF_CALL, so it may not be viable right now, but I'd like a better
understanding of how this all fits together.

Also, I think that it's not strictly true that a filter that blocks
sendmsg() is problematic.  You could clone a thread, call seccomp() in
that thread, then get a listener, then execve().  Or we could have a
seccomp() mode that adds a filter but only kicks in after execve().
The latter could be generally useful.

--Andy

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 3/5] seccomp: add a way to get a listener fd from ptrace
  2018-09-13  0:00   ` Andy Lutomirski
@ 2018-09-13  9:24     ` Tycho Andersen
  2018-10-17  7:25     ` Michael Tirado
  1 sibling, 0 replies; 38+ messages in thread
From: Tycho Andersen @ 2018-09-13  9:24 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kees Cook, LKML, Linux Containers, Linux API, Oleg Nesterov,
	Eric W . Biederman, Serge E . Hallyn, Christian Brauner,
	Tyler Hicks, Akihiro Suda, Jann Horn

On Wed, Sep 12, 2018 at 05:00:54PM -0700, Andy Lutomirski wrote:
> On Thu, Sep 6, 2018 at 8:28 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> > As an alternative to SECCOMP_FILTER_FLAG_GET_LISTENER, perhaps a ptrace()
> > version which can acquire filters is useful. There are at least two reasons
> > this is preferable, even though it uses ptrace:
> >
> > 1. You can control tasks that aren't cooperating with you
> > 2. You can control tasks whose filters block sendmsg() and socket(); if the
> >    task installs a filter which blocks these calls, there's no way with
> >    SECCOMP_FILTER_FLAG_GET_LISTENER to get the fd out to the privileged task.
> 
> Hmm.  I contemplated this a bit and looked at your example a bit, and
> I have a few thoughts:
> 
>  - What happens if you nest code like your sample?  That is, if you
> are already in some container that is seccomped and there's a
> listener, can you even run your sample?

You can attach a filter with SECCOMP_RET_USER_NOTIF, but you can't
attach a listener to any such filter as long as there is another
listener somewhere in the chain (I disallowed this based on some
feedback you sent earlier, it's an artificial restriction).

>  - Is there any association between the filter layer that uses the
> USER_NOTIF return and the listener?  How would this API express such a
> relationship?

Not sure exactly what you're asking here. There is the struct file*,
but there could be many threads listening to it.

> I realize that my dream of how this should all work requires eBPF and
> BPF_CALL, so it may not be viable right now, but I'd like a better
> understanding of how this all fits together.
> 
> Also, I think that it's not strictly true that a filter that blocks
> sendmsg() is problematic.  You could clone a thread, call seccomp() in
> that thread, then get a listener, then execve().  Or we could have a
> seccomp() mode that adds a filter but only kicks in after execve().
> The latter could be generally useful.

Yes, in fact some of the test code works this way. However, the case I
was thinking of is the way a typical container manager works: it does
some initial setup, clone()s the init task, does some final setup,
load the seccomp profile, and exec() the container's init binary.
There's no way for this container to get its seccomp fd back out of
the container to the host if sendmsg() is blocked.

Tycho

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 4/5] seccomp: add support for passing fds via USER_NOTIF
  2018-09-12 23:52   ` Andy Lutomirski
@ 2018-09-13  9:25     ` Tycho Andersen
  2018-09-13  9:42     ` Aleksa Sarai
  2018-09-19  9:55     ` Tycho Andersen
  2 siblings, 0 replies; 38+ messages in thread
From: Tycho Andersen @ 2018-09-13  9:25 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kees Cook, LKML, Linux Containers, Linux API, Oleg Nesterov,
	Eric W . Biederman, Serge E . Hallyn, Christian Brauner,
	Tyler Hicks, Akihiro Suda, Jann Horn

On Wed, Sep 12, 2018 at 04:52:38PM -0700, Andy Lutomirski wrote:
> On Thu, Sep 6, 2018 at 8:28 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> > The idea here is that the userspace handler should be able to pass an fd
> > back to the trapped task, for example so it can be returned from socket().
> >
> > I've proposed one API here, but I'm open to other options. In particular,
> > this only lets you return an fd from a syscall, which may not be enough in
> > all cases. For example, if an fd is written to an output parameter instead
> > of returned, the current API can't handle this. Another case is that
> > netlink takes as input fds sometimes (IFLA_NET_NS_FD, e.g.). If netlink
> > ever decides to install an fd and output it, we wouldn't be able to handle
> > this either.
> 
> An alternative could be to have an API (an ioctl on the listener,
> perhaps) that just copies an fd into the tracee.  There would be the
> obvious set of options: do we replace an existing fd or allocate a new
> one, and is it CLOEXEC.  Then the tracer could add an fd and then
> return it just like it's a regular number.
> 
> I feel like this would be more flexible and conceptually simpler, but
> maybe a little slower for the common cases.  What do you think?

Yes, I like this. It also future (or current-) proofs the API against
instances where we return an FD in a structure and not via the return
code of the syscall.

Tycho

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 4/5] seccomp: add support for passing fds via USER_NOTIF
  2018-09-12 23:52   ` Andy Lutomirski
  2018-09-13  9:25     ` Tycho Andersen
@ 2018-09-13  9:42     ` Aleksa Sarai
  2018-09-19  9:55     ` Tycho Andersen
  2 siblings, 0 replies; 38+ messages in thread
From: Aleksa Sarai @ 2018-09-13  9:42 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Tycho Andersen, Kees Cook, Jann Horn, Linux API,
	Linux Containers, Akihiro Suda, Oleg Nesterov, LKML,
	Eric W . Biederman, Christian Brauner

[-- Attachment #1: Type: text/plain, Size: 1712 bytes --]

On 2018-09-12, Andy Lutomirski <luto@amacapital.net> wrote:
> > The idea here is that the userspace handler should be able to pass an fd
> > back to the trapped task, for example so it can be returned from socket().
> >
> > I've proposed one API here, but I'm open to other options. In particular,
> > this only lets you return an fd from a syscall, which may not be enough in
> > all cases. For example, if an fd is written to an output parameter instead
> > of returned, the current API can't handle this. Another case is that
> > netlink takes as input fds sometimes (IFLA_NET_NS_FD, e.g.). If netlink
> > ever decides to install an fd and output it, we wouldn't be able to handle
> > this either.
> 
> An alternative could be to have an API (an ioctl on the listener,
> perhaps) that just copies an fd into the tracee.  There would be the
> obvious set of options: do we replace an existing fd or allocate a new
> one, and is it CLOEXEC.  Then the tracer could add an fd and then
> return it just like it's a regular number.
> 
> I feel like this would be more flexible and conceptually simpler, but
> maybe a little slower for the common cases.  What do you think?

When we first discussed this I sent a (probably somewhat broken) patch
for "dup4" which would let you inject a file descriptor into a different
process -- I still think that having a method for injecting a file
descriptor without needing ptrace (and SCM_RIGHTS) shenanigans would be
generally useful. (With "dup4" you have a more obvious API for flags and
whether you allocate a new fd or use a specific one.)

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 4/5] seccomp: add support for passing fds via USER_NOTIF
  2018-09-12 23:52   ` Andy Lutomirski
  2018-09-13  9:25     ` Tycho Andersen
  2018-09-13  9:42     ` Aleksa Sarai
@ 2018-09-19  9:55     ` Tycho Andersen
  2018-09-19 14:19       ` Andy Lutomirski
  2 siblings, 1 reply; 38+ messages in thread
From: Tycho Andersen @ 2018-09-19  9:55 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kees Cook, LKML, Linux Containers, Linux API, Oleg Nesterov,
	Eric W . Biederman, Serge E . Hallyn, Christian Brauner,
	Tyler Hicks, Akihiro Suda, Jann Horn

On Wed, Sep 12, 2018 at 04:52:38PM -0700, Andy Lutomirski wrote:
> On Thu, Sep 6, 2018 at 8:28 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> > The idea here is that the userspace handler should be able to pass an fd
> > back to the trapped task, for example so it can be returned from socket().
> >
> > I've proposed one API here, but I'm open to other options. In particular,
> > this only lets you return an fd from a syscall, which may not be enough in
> > all cases. For example, if an fd is written to an output parameter instead
> > of returned, the current API can't handle this. Another case is that
> > netlink takes as input fds sometimes (IFLA_NET_NS_FD, e.g.). If netlink
> > ever decides to install an fd and output it, we wouldn't be able to handle
> > this either.
> 
> An alternative could be to have an API (an ioctl on the listener,
> perhaps) that just copies an fd into the tracee.  There would be the
> obvious set of options: do we replace an existing fd or allocate a new
> one, and is it CLOEXEC.  Then the tracer could add an fd and then
> return it just like it's a regular number.
> 
> I feel like this would be more flexible and conceptually simpler, but
> maybe a little slower for the common cases.  What do you think?

I'm just implementing this now, and there's one question: when do we
actually do the fd install? Should we do it when the user calls
SECCOMP_NOTIF_PUT_FD, or when the actual response is sent? It feels
like we should do it when the response is sent, instead of doing it
right when SECCOMP_NOTIF_PUT_FD is called, since if there's a
subsequent signal and the tracer decides to discard the response,
we'll have to implement some delete mechanism to delete the fd, but it
would have already been visible to the process, etc. So I'll go
forward with this unless there are strong objections, but I thought
I'd point it out just to avoid another round trip.

Tycho

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 4/5] seccomp: add support for passing fds via USER_NOTIF
  2018-09-19  9:55     ` Tycho Andersen
@ 2018-09-19 14:19       ` Andy Lutomirski
  2018-09-19 14:38         ` Tycho Andersen
  0 siblings, 1 reply; 38+ messages in thread
From: Andy Lutomirski @ 2018-09-19 14:19 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Kees Cook, LKML, Linux Containers, Linux API, Oleg Nesterov,
	Eric W . Biederman, Serge E . Hallyn, Christian Brauner,
	Tyler Hicks, Akihiro Suda, Jann Horn



> On Sep 19, 2018, at 2:55 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> 
>> On Wed, Sep 12, 2018 at 04:52:38PM -0700, Andy Lutomirski wrote:
>>> On Thu, Sep 6, 2018 at 8:28 AM, Tycho Andersen <tycho@tycho.ws> wrote:
>>> The idea here is that the userspace handler should be able to pass an fd
>>> back to the trapped task, for example so it can be returned from socket().
>>> 
>>> I've proposed one API here, but I'm open to other options. In particular,
>>> this only lets you return an fd from a syscall, which may not be enough in
>>> all cases. For example, if an fd is written to an output parameter instead
>>> of returned, the current API can't handle this. Another case is that
>>> netlink takes as input fds sometimes (IFLA_NET_NS_FD, e.g.). If netlink
>>> ever decides to install an fd and output it, we wouldn't be able to handle
>>> this either.
>> 
>> An alternative could be to have an API (an ioctl on the listener,
>> perhaps) that just copies an fd into the tracee.  There would be the
>> obvious set of options: do we replace an existing fd or allocate a new
>> one, and is it CLOEXEC.  Then the tracer could add an fd and then
>> return it just like it's a regular number.
>> 
>> I feel like this would be more flexible and conceptually simpler, but
>> maybe a little slower for the common cases.  What do you think?
> 
> I'm just implementing this now, and there's one question: when do we
> actually do the fd install? Should we do it when the user calls
> SECCOMP_NOTIF_PUT_FD, or when the actual response is sent? It feels
> like we should do it when the response is sent, instead of doing it
> right when SECCOMP_NOTIF_PUT_FD is called, since if there's a
> subsequent signal and the tracer decides to discard the response,
> we'll have to implement some delete mechanism to delete the fd, but it
> would have already been visible to the process, etc. So I'll go
> forward with this unless there are strong objections, but I thought
> I'd point it out just to avoid another round trip.
> 
> 

Can you do that non-racily?  That is, you need to commit to an fd *number* right away, but what if another thread uses the number before you actually install the fd?

Do we really allow non-“kill” signals to interrupt the whole process?  It might be the case that we don’t really need to clean up from signals if there’s a guarantee that the thread dies.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 4/5] seccomp: add support for passing fds via USER_NOTIF
  2018-09-19 14:19       ` Andy Lutomirski
@ 2018-09-19 14:38         ` Tycho Andersen
  2018-09-19 19:58           ` Andy Lutomirski
  0 siblings, 1 reply; 38+ messages in thread
From: Tycho Andersen @ 2018-09-19 14:38 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kees Cook, LKML, Linux Containers, Linux API, Oleg Nesterov,
	Eric W . Biederman, Serge E . Hallyn, Christian Brauner,
	Tyler Hicks, Akihiro Suda, Jann Horn

On Wed, Sep 19, 2018 at 07:19:56AM -0700, Andy Lutomirski wrote:
> 
> 
> > On Sep 19, 2018, at 2:55 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> > 
> >> On Wed, Sep 12, 2018 at 04:52:38PM -0700, Andy Lutomirski wrote:
> >>> On Thu, Sep 6, 2018 at 8:28 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> >>> The idea here is that the userspace handler should be able to pass an fd
> >>> back to the trapped task, for example so it can be returned from socket().
> >>> 
> >>> I've proposed one API here, but I'm open to other options. In particular,
> >>> this only lets you return an fd from a syscall, which may not be enough in
> >>> all cases. For example, if an fd is written to an output parameter instead
> >>> of returned, the current API can't handle this. Another case is that
> >>> netlink takes as input fds sometimes (IFLA_NET_NS_FD, e.g.). If netlink
> >>> ever decides to install an fd and output it, we wouldn't be able to handle
> >>> this either.
> >> 
> >> An alternative could be to have an API (an ioctl on the listener,
> >> perhaps) that just copies an fd into the tracee.  There would be the
> >> obvious set of options: do we replace an existing fd or allocate a new
> >> one, and is it CLOEXEC.  Then the tracer could add an fd and then
> >> return it just like it's a regular number.
> >> 
> >> I feel like this would be more flexible and conceptually simpler, but
> >> maybe a little slower for the common cases.  What do you think?
> > 
> > I'm just implementing this now, and there's one question: when do we
> > actually do the fd install? Should we do it when the user calls
> > SECCOMP_NOTIF_PUT_FD, or when the actual response is sent? It feels
> > like we should do it when the response is sent, instead of doing it
> > right when SECCOMP_NOTIF_PUT_FD is called, since if there's a
> > subsequent signal and the tracer decides to discard the response,
> > we'll have to implement some delete mechanism to delete the fd, but it
> > would have already been visible to the process, etc. So I'll go
> > forward with this unless there are strong objections, but I thought
> > I'd point it out just to avoid another round trip.
> > 
> > 
> 
> Can you do that non-racily?  That is, you need to commit to an fd *number* right away, but what if another thread uses the number before you actually install the fd?

I was thinking we could just do an __alloc_fd() and then do the
fd_install() when the response is sent or clean up the case that the
listener or task dies. I haven't actually tried to run the code yet,
so it's possible the locking won't work :)

> Do we really allow non-“kill” signals to interrupt the whole process?  It might be the case that we don’t really need to clean up from signals if there’s a guarantee that the thread dies.

Yes, we do, because of this: https://lkml.org/lkml/2018/3/15/1122

I could change that to just be a killable wait, though; I don't have
strong opinions about it and several people have commented that the
code is kind of weird.

Tycho

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 4/5] seccomp: add support for passing fds via USER_NOTIF
  2018-09-19 14:38         ` Tycho Andersen
@ 2018-09-19 19:58           ` Andy Lutomirski
  2018-09-20 23:42             ` Tycho Andersen
  0 siblings, 1 reply; 38+ messages in thread
From: Andy Lutomirski @ 2018-09-19 19:58 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Kees Cook, LKML, Linux Containers, Linux API, Oleg Nesterov,
	Eric W . Biederman, Serge E . Hallyn, Christian Brauner,
	Tyler Hicks, Akihiro Suda, Jann Horn

On Wed, Sep 19, 2018 at 7:38 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> On Wed, Sep 19, 2018 at 07:19:56AM -0700, Andy Lutomirski wrote:
>>
>>
>> > On Sep 19, 2018, at 2:55 AM, Tycho Andersen <tycho@tycho.ws> wrote:
>> >
>> >> On Wed, Sep 12, 2018 at 04:52:38PM -0700, Andy Lutomirski wrote:
>> >>> On Thu, Sep 6, 2018 at 8:28 AM, Tycho Andersen <tycho@tycho.ws> wrote:
>> >>> The idea here is that the userspace handler should be able to pass an fd
>> >>> back to the trapped task, for example so it can be returned from socket().
>> >>>
>> >>> I've proposed one API here, but I'm open to other options. In particular,
>> >>> this only lets you return an fd from a syscall, which may not be enough in
>> >>> all cases. For example, if an fd is written to an output parameter instead
>> >>> of returned, the current API can't handle this. Another case is that
>> >>> netlink takes as input fds sometimes (IFLA_NET_NS_FD, e.g.). If netlink
>> >>> ever decides to install an fd and output it, we wouldn't be able to handle
>> >>> this either.
>> >>
>> >> An alternative could be to have an API (an ioctl on the listener,
>> >> perhaps) that just copies an fd into the tracee.  There would be the
>> >> obvious set of options: do we replace an existing fd or allocate a new
>> >> one, and is it CLOEXEC.  Then the tracer could add an fd and then
>> >> return it just like it's a regular number.
>> >>
>> >> I feel like this would be more flexible and conceptually simpler, but
>> >> maybe a little slower for the common cases.  What do you think?
>> >
>> > I'm just implementing this now, and there's one question: when do we
>> > actually do the fd install? Should we do it when the user calls
>> > SECCOMP_NOTIF_PUT_FD, or when the actual response is sent? It feels
>> > like we should do it when the response is sent, instead of doing it
>> > right when SECCOMP_NOTIF_PUT_FD is called, since if there's a
>> > subsequent signal and the tracer decides to discard the response,
>> > we'll have to implement some delete mechanism to delete the fd, but it
>> > would have already been visible to the process, etc. So I'll go
>> > forward with this unless there are strong objections, but I thought
>> > I'd point it out just to avoid another round trip.
>> >
>> >
>>
>> Can you do that non-racily?  That is, you need to commit to an fd *number* right away, but what if another thread uses the number before you actually install the fd?
>
> I was thinking we could just do an __alloc_fd() and then do the
> fd_install() when the response is sent or clean up the case that the
> listener or task dies. I haven't actually tried to run the code yet,
> so it's possible the locking won't work :)

I would be very surprised if the locking works.  How can you run a
thread in a process when another thread has allocated but not
installed an fd and is blocked for an arbitrarily long time?

>
>> Do we really allow non-“kill” signals to interrupt the whole process?  It might be the case that we don’t really need to clean up from signals if there’s a guarantee that the thread dies.
>
> Yes, we do, because of this: https://lkml.org/lkml/2018/3/15/1122
>

I'm still not sure I see the problem.  Suppose I'm implementing a user
notifier for a nasty syscall like recvmsg().  If I'm the tracer, by
the time I decide to install an fd, I've committed to returning
something other than -EINTR, even if a non-fatal signal is sent before
I finish.  No rollback should be necessary.

In the (unlikely?) event that some tracer needs to be able to rollback
an fd installation to return -EINTR, a SECCOMP_NOTIF_CLOSE_FD
operation should be good enough, I think.  Or maybe PUT_FD can put -1
to delete an fd.

--Andy

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 4/5] seccomp: add support for passing fds via USER_NOTIF
  2018-09-19 19:58           ` Andy Lutomirski
@ 2018-09-20 23:42             ` Tycho Andersen
  2018-09-21  2:18               ` Andy Lutomirski
  0 siblings, 1 reply; 38+ messages in thread
From: Tycho Andersen @ 2018-09-20 23:42 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kees Cook, LKML, Linux Containers, Linux API, Oleg Nesterov,
	Eric W . Biederman, Serge E . Hallyn, Christian Brauner,
	Tyler Hicks, Akihiro Suda, Jann Horn

On Wed, Sep 19, 2018 at 12:58:20PM -0700, Andy Lutomirski wrote:
> On Wed, Sep 19, 2018 at 7:38 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> > On Wed, Sep 19, 2018 at 07:19:56AM -0700, Andy Lutomirski wrote:
> >>
> >>
> >> > On Sep 19, 2018, at 2:55 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> >> >
> >> >> On Wed, Sep 12, 2018 at 04:52:38PM -0700, Andy Lutomirski wrote:
> >> >>> On Thu, Sep 6, 2018 at 8:28 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> >> >>> The idea here is that the userspace handler should be able to pass an fd
> >> >>> back to the trapped task, for example so it can be returned from socket().
> >> >>>
> >> >>> I've proposed one API here, but I'm open to other options. In particular,
> >> >>> this only lets you return an fd from a syscall, which may not be enough in
> >> >>> all cases. For example, if an fd is written to an output parameter instead
> >> >>> of returned, the current API can't handle this. Another case is that
> >> >>> netlink takes as input fds sometimes (IFLA_NET_NS_FD, e.g.). If netlink
> >> >>> ever decides to install an fd and output it, we wouldn't be able to handle
> >> >>> this either.
> >> >>
> >> >> An alternative could be to have an API (an ioctl on the listener,
> >> >> perhaps) that just copies an fd into the tracee.  There would be the
> >> >> obvious set of options: do we replace an existing fd or allocate a new
> >> >> one, and is it CLOEXEC.  Then the tracer could add an fd and then
> >> >> return it just like it's a regular number.
> >> >>
> >> >> I feel like this would be more flexible and conceptually simpler, but
> >> >> maybe a little slower for the common cases.  What do you think?
> >> >
> >> > I'm just implementing this now, and there's one question: when do we
> >> > actually do the fd install? Should we do it when the user calls
> >> > SECCOMP_NOTIF_PUT_FD, or when the actual response is sent? It feels
> >> > like we should do it when the response is sent, instead of doing it
> >> > right when SECCOMP_NOTIF_PUT_FD is called, since if there's a
> >> > subsequent signal and the tracer decides to discard the response,
> >> > we'll have to implement some delete mechanism to delete the fd, but it
> >> > would have already been visible to the process, etc. So I'll go
> >> > forward with this unless there are strong objections, but I thought
> >> > I'd point it out just to avoid another round trip.
> >> >
> >> >
> >>
> >> Can you do that non-racily?  That is, you need to commit to an fd *number* right away, but what if another thread uses the number before you actually install the fd?
> >
> > I was thinking we could just do an __alloc_fd() and then do the
> > fd_install() when the response is sent or clean up the case that the
> > listener or task dies. I haven't actually tried to run the code yet,
> > so it's possible the locking won't work :)
> 
> I would be very surprised if the locking works.  How can you run a
> thread in a process when another thread has allocated but not
> installed an fd and is blocked for an arbitrarily long time?

I think the trick is that there's no actual locking required (except
for a brief locking of task->files). I've run the patch below and it
seems to work. But perhaps that's abusing __alloc_fd a little too
hard, I don't really know.

> >
> >> Do we really allow non-“kill” signals to interrupt the whole process?  It might be the case that we don’t really need to clean up from signals if there’s a guarantee that the thread dies.
> >
> > Yes, we do, because of this: https://lkml.org/lkml/2018/3/15/1122
> >
> 
> I'm still not sure I see the problem.  Suppose I'm implementing a user
> notifier for a nasty syscall like recvmsg().  If I'm the tracer, by
> the time I decide to install an fd, I've committed to returning
> something other than -EINTR, even if a non-fatal signal is sent before
> I finish.  No rollback should be necessary.

I don't understand why this is true. Surely you could stop a handler
on receipt of a new signal, and have it do something else entirely?

> In the (unlikely?) event that some tracer needs to be able to rollback
> an fd installation to return -EINTR, a SECCOMP_NOTIF_CLOSE_FD
> operation should be good enough, I think.  Or maybe PUT_FD can put -1
> to delete an fd.

Yes, I think even with something like what I did below we'd need some
sort of REMOVE_FD option, because otherwise there's no way to change
your mind and send -EINTR without the fd you just PUT_FD'd.

Tycho


From bfca7337cb53791aca74b595eb45e9afa3babac2 Mon Sep 17 00:00:00 2001
From: Tycho Andersen <tycho@tycho.ws>
Date: Thu, 20 Sep 2018 06:49:49 -0600
Subject: [PATCH] implement SECCOMP_NOTIF_PUT_FD ioctl

Signed-off-by: Tycho Andersen <tycho@tycho.ws>
---
 include/uapi/linux/seccomp.h                  |  8 ++
 kernel/seccomp.c                              | 74 ++++++++++++-------
 tools/testing/selftests/seccomp/seccomp_bpf.c | 24 +++++-
 3 files changed, 77 insertions(+), 29 deletions(-)

diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index 8fb2c024c0a1..62e474c372d4 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -80,6 +80,12 @@ struct seccomp_notif_resp {
 	__u32 fd_flags;
 };
 
+struct seccomp_notif_put_fd {
+	__u64 id;
+	__s32 fd;
+	__u32 fd_flags;
+};
+
 #define SECCOMP_IOC_MAGIC		0xF7
 
 /* Flags for seccomp notification fd ioctl. */
@@ -89,5 +95,7 @@ struct seccomp_notif_resp {
 					struct seccomp_notif_resp)
 #define SECCOMP_NOTIF_ID_VALID	_IOR(SECCOMP_IOC_MAGIC, 2,	\
 					__u64)
+#define SECCOMP_NOTIF_PUT_FD	_IOR(SECCOMP_IOC_MAGIC, 3,	\
+					struct seccomp_notif_put_fd)
 
 #endif /* _UAPI_LINUX_SECCOMP_H */
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 21b24cc07237..6bdf413863ca 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -41,6 +41,7 @@
 #include <linux/tracehook.h>
 #include <linux/uaccess.h>
 #include <linux/anon_inodes.h>
+#include <linux/fdtable.h>
 #include <net/cls_cgroup.h>
 
 enum notify_state {
@@ -51,6 +52,7 @@ enum notify_state {
 
 struct seccomp_knotif {
 	/* The struct pid of the task whose filter triggered the notification */
+	struct task_struct *task;
 	struct pid *pid;
 
 	/* The "cookie" for this request; this is unique for this filter. */
@@ -80,7 +82,7 @@ struct seccomp_knotif {
 	int error;
 	long val;
 	struct file *file;
-	unsigned int flags;
+	int fd;
 
 	/* Signals when this has entered SECCOMP_NOTIFY_REPLIED */
 	struct completion ready;
@@ -748,9 +750,11 @@ static void seccomp_do_user_notification(int this_syscall,
 	if (!match->notif)
 		goto out;
 
+	n.task = current;
 	n.pid = task_pid(current);
 	n.state = SECCOMP_NOTIFY_INIT;
 	n.data = sd;
+	n.fd = -1;
 	n.id = seccomp_next_notify_id(match);
 	init_completion(&n.ready);
 
@@ -786,16 +790,8 @@ static void seccomp_do_user_notification(int this_syscall,
 	}
 
 	if (n.file) {
-		int fd;
 		struct socket *sock;
 
-		fd = get_unused_fd_flags(n.flags);
-		if (fd < 0) {
-			err = fd;
-			ret = -1;
-			goto remove_list;
-		}
-
 		/*
 		 * Similar to what SCM_RIGHTS does, let's re-set the cgroup
 		 * data to point ot the tracee's cgroups instead of the
@@ -807,21 +803,20 @@ static void seccomp_do_user_notification(int this_syscall,
 			sock_update_classid(&sock->sk->sk_cgrp_data);
 		}
 
-		ret = fd;
-		err = 0;
-
-		fd_install(fd, n.file);
+		fd_install(n.fd, n.file);
 		/* Don't fput, since fd has a reference now */
 		n.file = NULL;
-	} else {
-		ret = n.val;
-		err = n.error;
+		n.fd = -1;
 	}
 
+	ret = n.val;
+	err = n.error;
 
 remove_list:
 	if (n.file)
 		fput(n.file);
+	if (n.fd >= 0)
+		put_unused_fd(n.fd);
 
 	list_del(&n.list);
 out:
@@ -1683,15 +1678,6 @@ static long seccomp_notify_send(struct seccomp_filter *filter,
 		goto out;
 	}
 
-	if (resp.return_fd) {
-		knotif->flags = resp.fd_flags;
-		knotif->file = fget(resp.fd);
-		if (!knotif->file) {
-			ret = -EBADF;
-			goto out;
-		}
-	}
-
 	ret = size;
 	knotif->state = SECCOMP_NOTIFY_REPLIED;
 	knotif->error = resp.error;
@@ -1731,6 +1717,42 @@ static long seccomp_notify_id_valid(struct seccomp_filter *filter,
 	return ret;
 }
 
+static long seccomp_notify_put_fd(struct seccomp_filter *filter,
+				  unsigned long arg)
+{
+	struct seccomp_notif_put_fd req;
+	void __user *buf = (void __user *)arg;
+	struct seccomp_knotif *knotif = NULL;
+	long ret;
+
+	if (copy_from_user(&req, buf, sizeof(req)))
+		return -EFAULT;
+
+	ret = mutex_lock_interruptible(&filter->notify_lock);
+	if (ret < 0)
+		return ret;
+
+	ret = -ENOENT;
+	list_for_each_entry(knotif, &filter->notif->notifications, list) {
+		unsigned long max_files;
+
+		if (knotif->id != req.id)
+			continue;
+
+		max_files = task_rlimit(knotif->task, RLIMIT_NOFILE);
+
+		knotif->file = fget(req.fd);
+		knotif->fd = __alloc_fd(knotif->task->files, 0, max_files,
+					req.fd_flags);
+
+		ret = knotif->fd;
+		break;
+	}
+
+	mutex_unlock(&filter->notify_lock);
+	return ret;
+}
+
 static long seccomp_notify_ioctl(struct file *file, unsigned int cmd,
 				 unsigned long arg)
 {
@@ -1743,6 +1765,8 @@ static long seccomp_notify_ioctl(struct file *file, unsigned int cmd,
 		return seccomp_notify_send(filter, arg);
 	case SECCOMP_NOTIF_ID_VALID:
 		return seccomp_notify_id_valid(filter, arg);
+	case SECCOMP_NOTIF_PUT_FD:
+		return seccomp_notify_put_fd(filter, arg);
 	default:
 		return -EINVAL;
 	}
diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
index 3dec856717a7..ae8daf992231 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -169,6 +169,9 @@ struct seccomp_metadata {
 					struct seccomp_notif_resp)
 #define SECCOMP_NOTIF_ID_VALID	_IOR(SECCOMP_IOC_MAGIC, 2,	\
 					__u64)
+#define SECCOMP_NOTIF_PUT_FD	_IOR(SECCOMP_IOC_MAGIC, 3,	\
+					struct seccomp_notif_put_fd)
+
 struct seccomp_notif {
 	__u16 len;
 	__u64 id;
@@ -186,6 +189,12 @@ struct seccomp_notif_resp {
 	__u32 fd;
 	__u32 fd_flags;
 };
+
+struct seccomp_notif_put_fd {
+	__u64 id;
+	__s32 fd;
+	__u32 fd_flags;
+};
 #endif
 
 #ifndef seccomp
@@ -3239,11 +3248,12 @@ TEST(get_user_notification_ptrace)
 TEST(user_notification_pass_fd)
 {
 	pid_t pid;
-	int status, listener;
+	int status, listener, fd;
 	int sk_pair[2];
 	char c;
 	struct seccomp_notif req = {};
 	struct seccomp_notif_resp resp = {};
+	struct seccomp_notif_put_fd putfd = {};
 	long ret;
 
 	ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0);
@@ -3295,9 +3305,15 @@ TEST(user_notification_pass_fd)
 
 	resp.len = sizeof(resp);
 	resp.id = req.id;
-	resp.return_fd = 1;
-	resp.fd = sk_pair[1];
-	resp.fd_flags = 0;
+
+	putfd.id = req.id;
+	putfd.fd = sk_pair[1];
+	putfd.fd_flags = 0;
+
+	fd = ioctl(listener, SECCOMP_NOTIF_PUT_FD, &putfd);
+	EXPECT_GE(fd, 0);
+	resp.val = fd;
+
 	EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_SEND, &resp), sizeof(resp));
 	close(sk_pair[1]);
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 4/5] seccomp: add support for passing fds via USER_NOTIF
  2018-09-20 23:42             ` Tycho Andersen
@ 2018-09-21  2:18               ` Andy Lutomirski
  2018-09-21 13:39                 ` Tycho Andersen
  2018-09-25 12:53                 ` Tycho Andersen
  0 siblings, 2 replies; 38+ messages in thread
From: Andy Lutomirski @ 2018-09-21  2:18 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Kees Cook, LKML, Linux Containers, Linux API, Oleg Nesterov,
	Eric W. Biederman, Serge E. Hallyn, Christian Brauner,
	Tyler Hicks, Akihiro Suda, Jann Horn

On Thu, Sep 20, 2018 at 4:42 PM Tycho Andersen <tycho@tycho.ws> wrote:
>
> On Wed, Sep 19, 2018 at 12:58:20PM -0700, Andy Lutomirski wrote:
> > On Wed, Sep 19, 2018 at 7:38 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> > > On Wed, Sep 19, 2018 at 07:19:56AM -0700, Andy Lutomirski wrote:
> > >>
> > >>
> > >> > On Sep 19, 2018, at 2:55 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> > >> >
> > >> >> On Wed, Sep 12, 2018 at 04:52:38PM -0700, Andy Lutomirski wrote:
> > >> >>> On Thu, Sep 6, 2018 at 8:28 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> > >> >>> The idea here is that the userspace handler should be able to pass an fd
> > >> >>> back to the trapped task, for example so it can be returned from socket().
> > >> >>>
> > >> >>> I've proposed one API here, but I'm open to other options. In particular,
> > >> >>> this only lets you return an fd from a syscall, which may not be enough in
> > >> >>> all cases. For example, if an fd is written to an output parameter instead
> > >> >>> of returned, the current API can't handle this. Another case is that
> > >> >>> netlink takes as input fds sometimes (IFLA_NET_NS_FD, e.g.). If netlink
> > >> >>> ever decides to install an fd and output it, we wouldn't be able to handle
> > >> >>> this either.
> > >> >>
> > >> >> An alternative could be to have an API (an ioctl on the listener,
> > >> >> perhaps) that just copies an fd into the tracee.  There would be the
> > >> >> obvious set of options: do we replace an existing fd or allocate a new
> > >> >> one, and is it CLOEXEC.  Then the tracer could add an fd and then
> > >> >> return it just like it's a regular number.
> > >> >>
> > >> >> I feel like this would be more flexible and conceptually simpler, but
> > >> >> maybe a little slower for the common cases.  What do you think?
> > >> >
> > >> > I'm just implementing this now, and there's one question: when do we
> > >> > actually do the fd install? Should we do it when the user calls
> > >> > SECCOMP_NOTIF_PUT_FD, or when the actual response is sent? It feels
> > >> > like we should do it when the response is sent, instead of doing it
> > >> > right when SECCOMP_NOTIF_PUT_FD is called, since if there's a
> > >> > subsequent signal and the tracer decides to discard the response,
> > >> > we'll have to implement some delete mechanism to delete the fd, but it
> > >> > would have already been visible to the process, etc. So I'll go
> > >> > forward with this unless there are strong objections, but I thought
> > >> > I'd point it out just to avoid another round trip.
> > >> >
> > >> >
> > >>
> > >> Can you do that non-racily?  That is, you need to commit to an fd *number* right away, but what if another thread uses the number before you actually install the fd?
> > >
> > > I was thinking we could just do an __alloc_fd() and then do the
> > > fd_install() when the response is sent or clean up the case that the
> > > listener or task dies. I haven't actually tried to run the code yet,
> > > so it's possible the locking won't work :)
> >
> > I would be very surprised if the locking works.  How can you run a
> > thread in a process when another thread has allocated but not
> > installed an fd and is blocked for an arbitrarily long time?
>
> I think the trick is that there's no actual locking required (except
> for a brief locking of task->files). I've run the patch below and it
> seems to work. But perhaps that's abusing __alloc_fd a little too
> hard, I don't really know.
>

Hmm.  This makes me highly nervous.  If nothing else, what releases
the busy-but-not-open fd if the whole process aborts?

> > >
> > >> Do we really allow non-“kill” signals to interrupt the whole process?  It might be the case that we don’t really need to clean up from signals if there’s a guarantee that the thread dies.
> > >
> > > Yes, we do, because of this: https://lkml.org/lkml/2018/3/15/1122
> > >
> >
> > I'm still not sure I see the problem.  Suppose I'm implementing a user
> > notifier for a nasty syscall like recvmsg().  If I'm the tracer, by
> > the time I decide to install an fd, I've committed to returning
> > something other than -EINTR, even if a non-fatal signal is sent before
> > I finish.  No rollback should be necessary.
>
> I don't understand why this is true. Surely you could stop a handler
> on receipt of a new signal, and have it do something else entirely?

I think you *could*, but I'm not sure why you would.  In general,
syscalls never execute signal handlers mid-syscall.  There is a very
small number of syscalls that use sys_restart_syscall(), but I don't
think any of them allocate fds, and I'm not sure we need or want to
support them with user notifiers.  The rest of the syscalls will, if
they're behaving correct, either do *something* (like reading some or
all of a buffer) and return success or they'll do nothing and return
-EINTR.  Or they return an -ERESTARTSYS variant.  And then, only
*after* the syscall logically returns (i.e. completely finishes
processing and puts its return code into the relevant register) will a
signal be delivered.  In other words, the case where something like
recv() gets interrupted but still returns a success code does not mean
that a signal handler was called and then recv() resumed.  It means
that recv() noticed the signal, stopped receiving, returned the number
of bytes read, and then allowed the signal to be delivered.

In the -ERESTARTSYS case, the syscall returns -ERESTARTSYS (or a
variant) and returns without doing anything.  But it returns in a
special case where, after the signal returns, the syscall will happen
again.

So, for user notifiers, I think that any sane handler that notices a
non-fatal signal will do one of these things:

 - Return -EINTR without changing any tracee state.

 - Return success, possibly without blocking as long as it would have
without the signal.

 - Return -ERESTARTSYS without changing any tracee state.

 - Kill the tracee.

None of these would involve backing out an fd that was already
installed.  I suppose another way of looking at this is that.

Although... now that I think about it, there are some special cases,
like socketpair().  Look for put_unused_fd().  So maybe we need to
expose get_unused_fd_flags() and put_unused_fd(), but I think that
these are exceptions and will be very uncommon in the context of
seccomp user notifiers.  (For example, socketpair() can be implemented
almost correctly without put_unused_fd().)

Hmm.  This does mean that we need a test case for a user notifier
returning -ERESTARTSYS.  It should Just Work (tm), but those are
famous last words.

-ERESTARTSYS_RESTARTBLOCK is the case that I don't think we need to worry about.

>
> > In the (unlikely?) event that some tracer needs to be able to rollback
> > an fd installation to return -EINTR, a SECCOMP_NOTIF_CLOSE_FD
> > operation should be good enough, I think.  Or maybe PUT_FD can put -1
> > to delete an fd.
>
> Yes, I think even with something like what I did below we'd need some
> sort of REMOVE_FD option, because otherwise there's no way to change
> your mind and send -EINTR without the fd you just PUT_FD'd.
>

I think we just want the operation to cover all the cases.  Let PUT_FD
take a source fd and a dest fd.  If the source fd is -1, the dest is
closed.  If the source is -1 and the dest is -1, return -EINVAL.  If
the dest is -1, allocate an fd.  If the dest is >= 0, work like
dup2().  (The latter could be necessary to emulate things like, say,
dup2 :))

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 4/5] seccomp: add support for passing fds via USER_NOTIF
  2018-09-21  2:18               ` Andy Lutomirski
@ 2018-09-21 13:39                 ` Tycho Andersen
  2018-09-21 18:27                   ` Andy Lutomirski
  2018-09-21 20:46                   ` Jann Horn
  2018-09-25 12:53                 ` Tycho Andersen
  1 sibling, 2 replies; 38+ messages in thread
From: Tycho Andersen @ 2018-09-21 13:39 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kees Cook, LKML, Linux Containers, Linux API, Oleg Nesterov,
	Eric W. Biederman, Serge E. Hallyn, Christian Brauner,
	Tyler Hicks, Akihiro Suda, Jann Horn

On Thu, Sep 20, 2018 at 07:18:45PM -0700, Andy Lutomirski wrote:
> On Thu, Sep 20, 2018 at 4:42 PM Tycho Andersen <tycho@tycho.ws> wrote:
> >
> > On Wed, Sep 19, 2018 at 12:58:20PM -0700, Andy Lutomirski wrote:
> > > On Wed, Sep 19, 2018 at 7:38 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> > > > On Wed, Sep 19, 2018 at 07:19:56AM -0700, Andy Lutomirski wrote:
> > > >>
> > > >>
> > > >> > On Sep 19, 2018, at 2:55 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> > > >> >
> > > >> >> On Wed, Sep 12, 2018 at 04:52:38PM -0700, Andy Lutomirski wrote:
> > > >> >>> On Thu, Sep 6, 2018 at 8:28 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> > > >> >>> The idea here is that the userspace handler should be able to pass an fd
> > > >> >>> back to the trapped task, for example so it can be returned from socket().
> > > >> >>>
> > > >> >>> I've proposed one API here, but I'm open to other options. In particular,
> > > >> >>> this only lets you return an fd from a syscall, which may not be enough in
> > > >> >>> all cases. For example, if an fd is written to an output parameter instead
> > > >> >>> of returned, the current API can't handle this. Another case is that
> > > >> >>> netlink takes as input fds sometimes (IFLA_NET_NS_FD, e.g.). If netlink
> > > >> >>> ever decides to install an fd and output it, we wouldn't be able to handle
> > > >> >>> this either.
> > > >> >>
> > > >> >> An alternative could be to have an API (an ioctl on the listener,
> > > >> >> perhaps) that just copies an fd into the tracee.  There would be the
> > > >> >> obvious set of options: do we replace an existing fd or allocate a new
> > > >> >> one, and is it CLOEXEC.  Then the tracer could add an fd and then
> > > >> >> return it just like it's a regular number.
> > > >> >>
> > > >> >> I feel like this would be more flexible and conceptually simpler, but
> > > >> >> maybe a little slower for the common cases.  What do you think?
> > > >> >
> > > >> > I'm just implementing this now, and there's one question: when do we
> > > >> > actually do the fd install? Should we do it when the user calls
> > > >> > SECCOMP_NOTIF_PUT_FD, or when the actual response is sent? It feels
> > > >> > like we should do it when the response is sent, instead of doing it
> > > >> > right when SECCOMP_NOTIF_PUT_FD is called, since if there's a
> > > >> > subsequent signal and the tracer decides to discard the response,
> > > >> > we'll have to implement some delete mechanism to delete the fd, but it
> > > >> > would have already been visible to the process, etc. So I'll go
> > > >> > forward with this unless there are strong objections, but I thought
> > > >> > I'd point it out just to avoid another round trip.
> > > >> >
> > > >> >
> > > >>
> > > >> Can you do that non-racily?  That is, you need to commit to an fd *number* right away, but what if another thread uses the number before you actually install the fd?
> > > >
> > > > I was thinking we could just do an __alloc_fd() and then do the
> > > > fd_install() when the response is sent or clean up the case that the
> > > > listener or task dies. I haven't actually tried to run the code yet,
> > > > so it's possible the locking won't work :)
> > >
> > > I would be very surprised if the locking works.  How can you run a
> > > thread in a process when another thread has allocated but not
> > > installed an fd and is blocked for an arbitrarily long time?
> >
> > I think the trick is that there's no actual locking required (except
> > for a brief locking of task->files). I've run the patch below and it
> > seems to work. But perhaps that's abusing __alloc_fd a little too
> > hard, I don't really know.
> >
> 
> Hmm.  This makes me highly nervous.  If nothing else, what releases
> the busy-but-not-open fd if the whole process aborts?

Nothing right now, it gets installed even though the syscall gets
-ENOSYS. So not ideal, but that's why I was thinking we needed some
form of delete support. But,

> > > >
> > > >> Do we really allow non-“kill” signals to interrupt the whole process?  It might be the case that we don’t really need to clean up from signals if there’s a guarantee that the thread dies.
> > > >
> > > > Yes, we do, because of this: https://lkml.org/lkml/2018/3/15/1122
> > > >
> > >
> > > I'm still not sure I see the problem.  Suppose I'm implementing a user
> > > notifier for a nasty syscall like recvmsg().  If I'm the tracer, by
> > > the time I decide to install an fd, I've committed to returning
> > > something other than -EINTR, even if a non-fatal signal is sent before
> > > I finish.  No rollback should be necessary.
> >
> > I don't understand why this is true. Surely you could stop a handler
> > on receipt of a new signal, and have it do something else entirely?
> 
> I think you *could*, but I'm not sure why you would.  In general,
> syscalls never execute signal handlers mid-syscall.  There is a very
> small number of syscalls that use sys_restart_syscall(), but I don't
> think any of them allocate fds, and I'm not sure we need or want to
> support them with user notifiers.  The rest of the syscalls will, if
> they're behaving correct, either do *something* (like reading some or
> all of a buffer) and return success or they'll do nothing and return
> -EINTR.  Or they return an -ERESTARTSYS variant.  And then, only
> *after* the syscall logically returns (i.e. completely finishes
> processing and puts its return code into the relevant register) will a
> signal be delivered.  In other words, the case where something like
> recv() gets interrupted but still returns a success code does not mean
> that a signal handler was called and then recv() resumed.  It means
> that recv() noticed the signal, stopped receiving, returned the number
> of bytes read, and then allowed the signal to be delivered.
> 
> In the -ERESTARTSYS case, the syscall returns -ERESTARTSYS (or a
> variant) and returns without doing anything.  But it returns in a
> special case where, after the signal returns, the syscall will happen
> again.
> 
> So, for user notifiers, I think that any sane handler that notices a
> non-fatal signal will do one of these things:
> 
>  - Return -EINTR without changing any tracee state.
> 
>  - Return success, possibly without blocking as long as it would have
> without the signal.
> 
>  - Return -ERESTARTSYS without changing any tracee state.
> 
>  - Kill the tracee.
> 
> None of these would involve backing out an fd that was already
> installed.  I suppose another way of looking at this is that.
> 
> Although... now that I think about it, there are some special cases,
> like socketpair().  Look for put_unused_fd().  So maybe we need to
> expose get_unused_fd_flags() and put_unused_fd(), but I think that
> these are exceptions and will be very uncommon in the context of
> seccomp user notifiers.  (For example, socketpair() can be implemented
> almost correctly without put_unused_fd().)

socketpair() is a good point. In particular, if we use this queuing
thing I've done above, then you can only ever send one fd, and you'll
need to send two here. So perhaps we really do need to do this as soon
as the tracer calls ioctl(), vs queuing and waiting.

> Hmm.  This does mean that we need a test case for a user notifier
> returning -ERESTARTSYS.  It should Just Work (tm), but those are
> famous last words.
> 
> -ERESTARTSYS_RESTARTBLOCK is the case that I don't think we need to worry about.
> 
> >
> > > In the (unlikely?) event that some tracer needs to be able to rollback
> > > an fd installation to return -EINTR, a SECCOMP_NOTIF_CLOSE_FD
> > > operation should be good enough, I think.  Or maybe PUT_FD can put -1
> > > to delete an fd.
> >
> > Yes, I think even with something like what I did below we'd need some
> > sort of REMOVE_FD option, because otherwise there's no way to change
> > your mind and send -EINTR without the fd you just PUT_FD'd.
> >
> 
> I think we just want the operation to cover all the cases.  Let PUT_FD
> take a source fd and a dest fd.  If the source fd is -1, the dest is
> closed.  If the source is -1 and the dest is -1, return -EINVAL.  If
> the dest is -1, allocate an fd.  If the dest is >= 0, work like
> dup2().  (The latter could be necessary to emulate things like, say,
> dup2 :))

...then if we're going to allow overwriting fds, we'd need to lift out
the logic from do_dup2 somewhere? Is this getting too complicated? :)

Tycho

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 4/5] seccomp: add support for passing fds via USER_NOTIF
  2018-09-21 13:39                 ` Tycho Andersen
@ 2018-09-21 18:27                   ` Andy Lutomirski
  2018-09-21 22:03                     ` Tycho Andersen
  2018-09-21 20:46                   ` Jann Horn
  1 sibling, 1 reply; 38+ messages in thread
From: Andy Lutomirski @ 2018-09-21 18:27 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Kees Cook, LKML, Linux Containers, Linux API, Oleg Nesterov,
	Eric W. Biederman, Serge E. Hallyn, Christian Brauner,
	Tyler Hicks, Akihiro Suda, Jann Horn

On Fri, Sep 21, 2018 at 6:39 AM Tycho Andersen <tycho@tycho.ws> wrote:
>
> On Thu, Sep 20, 2018 at 07:18:45PM -0700, Andy Lutomirski wrote:
> >
> > I think we just want the operation to cover all the cases.  Let PUT_FD
> > take a source fd and a dest fd.  If the source fd is -1, the dest is
> > closed.  If the source is -1 and the dest is -1, return -EINVAL.  If
> > the dest is -1, allocate an fd.  If the dest is >= 0, work like
> > dup2().  (The latter could be necessary to emulate things like, say,
> > dup2 :))
>
> ...then if we're going to allow overwriting fds, we'd need to lift out
> the logic from do_dup2 somewhere? Is this getting too complicated? :)
>

fds are complicated :-p

More seriously, though, I think it's okay if we don't support
everything out of the box.  getting the general semantics I suggested
is kind of nice because the resulting API is conceptually simple, even
if it encapsulates three cases.  But I'd be okay with only supporting
add-an-fd-at-an-unused-position and delete-an-fd out of the box --
more can be added if there's demand.

But I think that exposing an operation that allocates and reserves an
fd without putting anything in the slot is awkward, and it opens us up
to weird corner cases becoming visible that are currently there but
mostly hidden.  For example, what happens if someone overwrites a
reserved fd with dup2()?  (The answer is apparently -EBUSY -- see the
big comment in do_dup2() in fs/file.c.)  But there's a more
significant nastiness: what happens if someone abuses your new
mechanism to overwrite a reserved fd that belongs to a different
thread?  It looks like you'll hit the BUG_ON(fdt->fd[fd] != NULL); in
__fd_install().  So unless you actually track which unused fds you own
and enforce that the final installation installs in the right slot,
you have a problem.

BTW, socketpair() isn't the only thing that can add two fds.
recvmsg() can, too, as can pipe() and pipe2().  Some of the DRM ioctls
may as well for all I know.  But socketpair(), pipe(), and recvmsg()
can be credibly emulated by adding each fd in sequence and then
deleting them all of one fails.  Sure, this could race against dup2(),
but I'm not sure we care.

--Andy

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 4/5] seccomp: add support for passing fds via USER_NOTIF
  2018-09-21 13:39                 ` Tycho Andersen
  2018-09-21 18:27                   ` Andy Lutomirski
@ 2018-09-21 20:46                   ` Jann Horn
  1 sibling, 0 replies; 38+ messages in thread
From: Jann Horn @ 2018-09-21 20:46 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Andy Lutomirski, Kees Cook, kernel list, containers, Linux API,
	Oleg Nesterov, Eric W. Biederman, Serge E. Hallyn,
	Christian Brauner, Tyler Hicks, suda.akihiro

On Fri, Sep 21, 2018 at 3:39 PM Tycho Andersen <tycho@tycho.ws> wrote:
> On Thu, Sep 20, 2018 at 07:18:45PM -0700, Andy Lutomirski wrote:
> > On Thu, Sep 20, 2018 at 4:42 PM Tycho Andersen <tycho@tycho.ws> wrote:
> > > On Wed, Sep 19, 2018 at 12:58:20PM -0700, Andy Lutomirski wrote:
> > > > On Wed, Sep 19, 2018 at 7:38 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> > > > > On Wed, Sep 19, 2018 at 07:19:56AM -0700, Andy Lutomirski wrote:
> > > > >> Do we really allow non-“kill” signals to interrupt the whole process?  It might be the case that we don’t really need to clean up from signals if there’s a guarantee that the thread dies.
> > > > >
> > > > > Yes, we do, because of this: https://lkml.org/lkml/2018/3/15/1122
> > > > >
> > > >
> > > > I'm still not sure I see the problem.  Suppose I'm implementing a user
> > > > notifier for a nasty syscall like recvmsg().  If I'm the tracer, by
> > > > the time I decide to install an fd, I've committed to returning
> > > > something other than -EINTR, even if a non-fatal signal is sent before
> > > > I finish.  No rollback should be necessary.
> > >
> > > I don't understand why this is true. Surely you could stop a handler
> > > on receipt of a new signal, and have it do something else entirely?
> >
> > I think you *could*, but I'm not sure why you would.  In general,
> > syscalls never execute signal handlers mid-syscall.  There is a very
> > small number of syscalls that use sys_restart_syscall(), but I don't
> > think any of them allocate fds, and I'm not sure we need or want to
> > support them with user notifiers.  The rest of the syscalls will, if
> > they're behaving correct, either do *something* (like reading some or
> > all of a buffer) and return success or they'll do nothing and return
> > -EINTR.  Or they return an -ERESTARTSYS variant.  And then, only
> > *after* the syscall logically returns (i.e. completely finishes
> > processing and puts its return code into the relevant register) will a
> > signal be delivered.  In other words, the case where something like
> > recv() gets interrupted but still returns a success code does not mean
> > that a signal handler was called and then recv() resumed.  It means
> > that recv() noticed the signal, stopped receiving, returned the number
> > of bytes read, and then allowed the signal to be delivered.
> >
> > In the -ERESTARTSYS case, the syscall returns -ERESTARTSYS (or a
> > variant) and returns without doing anything.  But it returns in a
> > special case where, after the signal returns, the syscall will happen
> > again.
> >
> > So, for user notifiers, I think that any sane handler that notices a
> > non-fatal signal will do one of these things:
> >
> >  - Return -EINTR without changing any tracee state.
> >
> >  - Return success, possibly without blocking as long as it would have
> > without the signal.
> >
> >  - Return -ERESTARTSYS without changing any tracee state.
> >
> >  - Kill the tracee.
> >
> > None of these would involve backing out an fd that was already
> > installed.  I suppose another way of looking at this is that.
> >
> > Although... now that I think about it, there are some special cases,
> > like socketpair().  Look for put_unused_fd().  So maybe we need to
> > expose get_unused_fd_flags() and put_unused_fd(), but I think that
> > these are exceptions and will be very uncommon in the context of
> > seccomp user notifiers.  (For example, socketpair() can be implemented
> > almost correctly without put_unused_fd().)
>
> socketpair() is a good point. In particular, if we use this queuing
> thing I've done above, then you can only ever send one fd, and you'll
> need to send two here. So perhaps we really do need to do this as soon
> as the tracer calls ioctl(), vs queuing and waiting.
>
> > Hmm.  This does mean that we need a test case for a user notifier
> > returning -ERESTARTSYS.  It should Just Work (tm), but those are
> > famous last words.
> >
> > -ERESTARTSYS_RESTARTBLOCK is the case that I don't think we need to worry about.
> >
> > >
> > > > In the (unlikely?) event that some tracer needs to be able to rollback
> > > > an fd installation to return -EINTR, a SECCOMP_NOTIF_CLOSE_FD
> > > > operation should be good enough, I think.  Or maybe PUT_FD can put -1
> > > > to delete an fd.
> > >
> > > Yes, I think even with something like what I did below we'd need some
> > > sort of REMOVE_FD option, because otherwise there's no way to change
> > > your mind and send -EINTR without the fd you just PUT_FD'd.
> > >
> >
> > I think we just want the operation to cover all the cases.  Let PUT_FD
> > take a source fd and a dest fd.  If the source fd is -1, the dest is
> > closed.  If the source is -1 and the dest is -1, return -EINVAL.  If
> > the dest is -1, allocate an fd.  If the dest is >= 0, work like
> > dup2().  (The latter could be necessary to emulate things like, say,
> > dup2 :))
>
> ...then if we're going to allow overwriting fds, we'd need to lift out
> the logic from do_dup2 somewhere? Is this getting too complicated? :)

In particular if you end up allowing overwriting fds of a remote task,
please add a scary warning to the code that does that, informing the
reader that that's only safe because you know that the target task is
stopped outside syscall context, and that it would be a very bad idea
to just copypaste that code to somewhere else. If someone tried doing
that to a single-threaded task that's in the middle of a syscall, the
results would be interesting - and by "interesting", I mean
"use-after-free on a struct file".

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 4/5] seccomp: add support for passing fds via USER_NOTIF
  2018-09-21 18:27                   ` Andy Lutomirski
@ 2018-09-21 22:03                     ` Tycho Andersen
  0 siblings, 0 replies; 38+ messages in thread
From: Tycho Andersen @ 2018-09-21 22:03 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kees Cook, LKML, Linux Containers, Linux API, Oleg Nesterov,
	Eric W. Biederman, Serge E. Hallyn, Christian Brauner,
	Tyler Hicks, Akihiro Suda, Jann Horn

On Fri, Sep 21, 2018 at 11:27:59AM -0700, Andy Lutomirski wrote:
> On Fri, Sep 21, 2018 at 6:39 AM Tycho Andersen <tycho@tycho.ws> wrote:
> >
> > On Thu, Sep 20, 2018 at 07:18:45PM -0700, Andy Lutomirski wrote:
> > >
> > > I think we just want the operation to cover all the cases.  Let PUT_FD
> > > take a source fd and a dest fd.  If the source fd is -1, the dest is
> > > closed.  If the source is -1 and the dest is -1, return -EINVAL.  If
> > > the dest is -1, allocate an fd.  If the dest is >= 0, work like
> > > dup2().  (The latter could be necessary to emulate things like, say,
> > > dup2 :))
> >
> > ...then if we're going to allow overwriting fds, we'd need to lift out
> > the logic from do_dup2 somewhere? Is this getting too complicated? :)
> >
> 
> fds are complicated :-p

:D

> More seriously, though, I think it's okay if we don't support
> everything out of the box.  getting the general semantics I suggested
> is kind of nice because the resulting API is conceptually simple, even
> if it encapsulates three cases.  But I'd be okay with only supporting
> add-an-fd-at-an-unused-position and delete-an-fd out of the box --
> more can be added if there's demand.

It's the delete/replace-an-fd one that has me worried. Anyway, I'll
take a look and see what I can figure out.

> But I think that exposing an operation that allocates and reserves an
> fd without putting anything in the slot is awkward, and it opens us up
> to weird corner cases becoming visible that are currently there but
> mostly hidden.  For example, what happens if someone overwrites a
> reserved fd with dup2()?  (The answer is apparently -EBUSY -- see the
> big comment in do_dup2() in fs/file.c.)  But there's a more
> significant nastiness: what happens if someone abuses your new
> mechanism to overwrite a reserved fd that belongs to a different
> thread?  It looks like you'll hit the BUG_ON(fdt->fd[fd] != NULL); in
> __fd_install().  So unless you actually track which unused fds you own
> and enforce that the final installation installs in the right slot,
> you have a problem.
> 
> BTW, socketpair() isn't the only thing that can add two fds.
> recvmsg() can, too, as can pipe() and pipe2().  Some of the DRM ioctls
> may as well for all I know.  But socketpair(), pipe(), and recvmsg()
> can be credibly emulated by adding each fd in sequence and then
> deleting them all of one fails.  Sure, this could race against dup2(),
> but I'm not sure we care.

Yup agreed. We need to do the install when the ioctl() is called.

Tycho

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 4/5] seccomp: add support for passing fds via USER_NOTIF
  2018-09-21  2:18               ` Andy Lutomirski
  2018-09-21 13:39                 ` Tycho Andersen
@ 2018-09-25 12:53                 ` Tycho Andersen
  1 sibling, 0 replies; 38+ messages in thread
From: Tycho Andersen @ 2018-09-25 12:53 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kees Cook, LKML, Linux Containers, Linux API, Oleg Nesterov,
	Eric W. Biederman, Serge E. Hallyn, Christian Brauner,
	Tyler Hicks, Akihiro Suda, Jann Horn

On Thu, Sep 20, 2018 at 07:18:45PM -0700, Andy Lutomirski wrote:
> Hmm.  This does mean that we need a test case for a user notifier
> returning -ERESTARTSYS.  It should Just Work (tm), but those are
> famous last words.

Just to confirm, I've got a test case that works like this:

1. fork and install a SIGUSR1 handler
2. tracee does a syscall that gets trapped
3. send SIGUSR1
4. respond from the listener with -ERESTARTSYS
5. see another of the same syscall, even though the tracee still thinks
   its in the first one
6. respond with something reasonable, the tracee sees this response

I think that's the intended behavior. Note that when the listener
responds with -ERESTARTSYS and there is no signal pending, the task
just dies. That might be reasonable, I'm not sure.

Tycho

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 3/5] seccomp: add a way to get a listener fd from ptrace
  2018-09-13  0:00   ` Andy Lutomirski
  2018-09-13  9:24     ` Tycho Andersen
@ 2018-10-17  7:25     ` Michael Tirado
  2018-10-17 15:00       ` Tycho Andersen
  2018-10-17 18:31       ` Kees Cook
  1 sibling, 2 replies; 38+ messages in thread
From: Michael Tirado @ 2018-10-17  7:25 UTC (permalink / raw)
  To: Andy Lutomirski, Tycho Andersen; +Cc: LKML, Kees Cook

On Thu, Sep 13, 2018 at 12:02 AM Andy Lutomirski <luto@amacapital.net> wrote:
>
> Or we could have a
> seccomp() mode that adds a filter but only kicks in after execve().
>
> --Andy

Hey that's a pretty good idea, then we could block execve in a seccomp
launcher without post-exec cooperation, or that patch I wrote that used
an execve counter which probably should have been through prctl instead.

As for the rest of this long thread,
has anyone mentioned a specific use case that I missed? I didn't see code
patches sent to the linux-kernel mailing list, only this discussion thread
so I'm probably missing some important context.  Was it for loading modules
into kernel from a container?  Couldn't that be handled completely in user
space without using seccomp at all? Do we really want to turn seccomp into
a container IPC mechanism? It seems out of scope IMO, and especially
if it could be handled in user space already.

Why does it have to be a file descriptor, what would you be writing back to?
Could waitid be used somehow instead of ptrace to get notification
from a filter?
tldr, can someone kindly tell me how to find all the details surrounding these
patches so I can stop making really bad guesses?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 3/5] seccomp: add a way to get a listener fd from ptrace
  2018-10-17  7:25     ` Michael Tirado
@ 2018-10-17 15:00       ` Tycho Andersen
       [not found]         ` <CAMkWEXM1c7AGTH=tpgoHtPnFFY-V+05nGOU90Sa1E3EPY9OhKQ@mail.gmail.com>
  2018-10-17 18:31       ` Kees Cook
  1 sibling, 1 reply; 38+ messages in thread
From: Tycho Andersen @ 2018-10-17 15:00 UTC (permalink / raw)
  To: Michael Tirado; +Cc: Andy Lutomirski, LKML, Kees Cook

On Wed, Oct 17, 2018 at 07:25:00AM +0000, Michael Tirado wrote:
> On Thu, Sep 13, 2018 at 12:02 AM Andy Lutomirski <luto@amacapital.net> wrote:
> >
> > Or we could have a
> > seccomp() mode that adds a filter but only kicks in after execve().
> >
> > --Andy
> 
> Hey that's a pretty good idea, then we could block execve in a seccomp
> launcher without post-exec cooperation, or that patch I wrote that used
> an execve counter which probably should have been through prctl instead.
> 
> As for the rest of this long thread,
> has anyone mentioned a specific use case that I missed? I didn't see code
> patches sent to the linux-kernel mailing list, only this discussion thread
> so I'm probably missing some important context.  Was it for loading modules
> into kernel from a container?  Couldn't that be handled completely in user
> space without using seccomp at all? Do we really want to turn seccomp into
> a container IPC mechanism? It seems out of scope IMO, and especially
> if it could be handled in user space already.

That's one of the use cases, but there are a large number of others. I
discuss a few in patch 1:
https://www.spinics.net/lists/linux-containers/msg33956.html

> Why does it have to be a file descriptor, what would you be writing back to?
> Could waitid be used somehow instead of ptrace to get notification
> from a filter?

You can already do this with SECCOMP_RET_TRACE. Of course, that means
the task has to be traced, and avoiding that is the point of this
series.

> tldr, can someone kindly tell me how to find all the details surrounding these
> patches so I can stop making really bad guesses?

FWIW, I'm dropping the ptrace bits (and the fd passing bits) from the
next version, because they seem fairly controversial. So this will be
going away.

Tycho

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 3/5] seccomp: add a way to get a listener fd from ptrace
       [not found]         ` <CAMkWEXM1c7AGTH=tpgoHtPnFFY-V+05nGOU90Sa1E3EPY9OhKQ@mail.gmail.com>
@ 2018-10-17 18:15           ` Michael Tirado
  2018-10-21 16:00             ` Tycho Andersen
  0 siblings, 1 reply; 38+ messages in thread
From: Michael Tirado @ 2018-10-17 18:15 UTC (permalink / raw)
  To: Tycho Andersen; +Cc: LKML, Kees Cook, Andy Lutomirski

Tycho, Sorry for the duplicate, I forgot to CC the list :(

On Wed, Oct 17, 2018 at 3:00 PM Tycho Andersen <tycho@tycho.ws> wrote:
>
>
> That's one of the use cases, but there are a large number of others. I
> discuss a few in patch 1:
> https://www.spinics.net/lists/linux-containers/msg33956.html
>

Thanks this is making more sense to me now.

I haven't been keeping up with the list and just did a bunch
of reading. It seems that stackable LSM's are making some real
progress now, and I wonder if those patches are merged would
using a stacked security module approach be worth exploring if
it provides the same or greater flexibility, and assuming all
syscalls of interest can be hooked somehow?

>FWIW, I'm dropping the ptrace bits (and the fd passing bits)
>from the next version, because they seem fairly controversial.

Yeah ptrace can be difficult to work with, no doubt it is
controversial; <3 Yama. I've used the ptrace method for counting
syscall failures, and ignoring the non-trivial amount of time it
took me to learn the API then write working code, the performance
loss (in a syscall heavyweight program like web browser) is a
noticeable problem, outside of a debugging or analysis context.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 3/5] seccomp: add a way to get a listener fd from ptrace
  2018-10-17  7:25     ` Michael Tirado
  2018-10-17 15:00       ` Tycho Andersen
@ 2018-10-17 18:31       ` Kees Cook
  1 sibling, 0 replies; 38+ messages in thread
From: Kees Cook @ 2018-10-17 18:31 UTC (permalink / raw)
  To: Michael Tirado; +Cc: Andy Lutomirski, Tycho Andersen, LKML

On Wed, Oct 17, 2018 at 12:25 AM, Michael Tirado <mtirado418@gmail.com> wrote:
> On Thu, Sep 13, 2018 at 12:02 AM Andy Lutomirski <luto@amacapital.net> wrote:
>>
>> Or we could have a
>> seccomp() mode that adds a filter but only kicks in after execve().
>>
>> --Andy
>
> Hey that's a pretty good idea, then we could block execve in a seccomp
> launcher without post-exec cooperation, or that patch I wrote that used
> an execve counter which probably should have been through prctl instead.

This has come up a few times before, actually. I had a working
prototype, but it needed some more shaking-out. I do like the idea of
"activate later" filters, though I'd always felt like using execve as
the boundary was a bit limiting. I wonder if we could do some kind of
external trigger (i.e. the fd passed to the caller) for when to
activate... likely the synchronization is a horror show, though, so if
execve is "good enough", I'll probably be happy with that. :)

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 3/5] seccomp: add a way to get a listener fd from ptrace
  2018-10-17 18:15           ` Michael Tirado
@ 2018-10-21 16:00             ` Tycho Andersen
  0 siblings, 0 replies; 38+ messages in thread
From: Tycho Andersen @ 2018-10-21 16:00 UTC (permalink / raw)
  To: Michael Tirado; +Cc: LKML, Kees Cook, Andy Lutomirski

On Wed, Oct 17, 2018 at 06:15:47PM +0000, Michael Tirado wrote:
> Tycho, Sorry for the duplicate, I forgot to CC the list :(
> 
> On Wed, Oct 17, 2018 at 3:00 PM Tycho Andersen <tycho@tycho.ws> wrote:
> >
> >
> > That's one of the use cases, but there are a large number of others. I
> > discuss a few in patch 1:
> > https://www.spinics.net/lists/linux-containers/msg33956.html
> >
> 
> Thanks this is making more sense to me now.
> 
> I haven't been keeping up with the list and just did a bunch
> of reading. It seems that stackable LSM's are making some real
> progress now, and I wonder if those patches are merged would
> using a stacked security module approach be worth exploring if
> it provides the same or greater flexibility, and assuming all
> syscalls of interest can be hooked somehow?

Sorry, I somehow just noticed that this was a duplicate and the one I
replied to was the off-list one. Anyway, no, I don't think that'll
work. The LSM code right now can't do anything besides refuse an
access, and that's a very specific design constraint of it. In
particular, it can't mutate any task state or anything.

What we want in this series is basically the equivalent of
SECCOMP_RET_TRACE, without having to involve ptrace (for a variety of
reasons, mostly that applications want to use ptrace for their own
things). So seccomp seems like the most natural fit.

Tycho

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2018-10-21 16:00 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-06 15:28 [PATCH v6 0/5] seccomp trap to userspace Tycho Andersen
2018-09-06 15:28 ` [PATCH v6 1/5] seccomp: add a return code to " Tycho Andersen
2018-09-06 22:15   ` Tyler Hicks
2018-09-07 15:45     ` Tycho Andersen
2018-09-08 20:35     ` Tycho Andersen
2018-09-06 15:28 ` [PATCH v6 2/5] seccomp: make get_nth_filter available outside of CHECKPOINT_RESTORE Tycho Andersen
2018-09-11 10:25   ` kbuild test robot
2018-09-06 15:28 ` [PATCH v6 3/5] seccomp: add a way to get a listener fd from ptrace Tycho Andersen
2018-09-06 15:45   ` Jann Horn
2018-09-06 15:50     ` Tycho Andersen
2018-09-13  0:00   ` Andy Lutomirski
2018-09-13  9:24     ` Tycho Andersen
2018-10-17  7:25     ` Michael Tirado
2018-10-17 15:00       ` Tycho Andersen
     [not found]         ` <CAMkWEXM1c7AGTH=tpgoHtPnFFY-V+05nGOU90Sa1E3EPY9OhKQ@mail.gmail.com>
2018-10-17 18:15           ` Michael Tirado
2018-10-21 16:00             ` Tycho Andersen
2018-10-17 18:31       ` Kees Cook
2018-09-06 15:28 ` [PATCH v6 4/5] seccomp: add support for passing fds via USER_NOTIF Tycho Andersen
2018-09-06 16:15   ` Jann Horn
2018-09-06 16:22     ` Tycho Andersen
2018-09-06 18:30       ` Tycho Andersen
2018-09-10 17:00         ` Jann Horn
2018-09-11 20:29           ` Tycho Andersen
2018-09-12 23:52   ` Andy Lutomirski
2018-09-13  9:25     ` Tycho Andersen
2018-09-13  9:42     ` Aleksa Sarai
2018-09-19  9:55     ` Tycho Andersen
2018-09-19 14:19       ` Andy Lutomirski
2018-09-19 14:38         ` Tycho Andersen
2018-09-19 19:58           ` Andy Lutomirski
2018-09-20 23:42             ` Tycho Andersen
2018-09-21  2:18               ` Andy Lutomirski
2018-09-21 13:39                 ` Tycho Andersen
2018-09-21 18:27                   ` Andy Lutomirski
2018-09-21 22:03                     ` Tycho Andersen
2018-09-21 20:46                   ` Jann Horn
2018-09-25 12:53                 ` Tycho Andersen
2018-09-06 15:28 ` [PATCH v6 5/5] samples: add an example of seccomp user trap Tycho Andersen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).