linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 0/5] seccomp trap to userspace
@ 2018-08-28 14:35 Tycho Andersen
  2018-08-28 14:35 ` [PATCH v5 1/5] seccomp: add a return code to " Tycho Andersen
                   ` (4 more replies)
  0 siblings, 5 replies; 9+ messages in thread
From: Tycho Andersen @ 2018-08-28 14:35 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel, containers, linux-api, Andy Lutomirski,
	Oleg Nesterov, Eric W . Biederman, Serge E . Hallyn,
	Christian Brauner, Tyler Hicks, Akihiro Suda, Tycho Andersen

Hi all,

Here's v5 of the seccomp user set. Major changes from v4 include:

* switching to ioctl vs read/write
* adding a way to query whether a notification id is valid
* added a sample program that shows a complete usage of the API w/ notes
  about various TOCTOUs

as well as a bunch of smaller fixes. See individual patch notes for
details.

Thanks,

Tycho

Tycho Andersen (5):
  seccomp: add a return code to trap to userspace
  seccomp: make get_nth_filter available outside of CHECKPOINT_RESTORE
  seccomp: add a way to get a listener fd from ptrace
  seccomp: add support for passing fds via USER_NOTIF
  samples: add an example of seccomp user trap

 Documentation/ioctl/ioctl-number.txt          |   1 +
 .../userspace-api/seccomp_filter.rst          |  80 +++
 arch/Kconfig                                  |   9 +
 include/linux/seccomp.h                       |  18 +-
 include/uapi/linux/ptrace.h                   |   2 +
 include/uapi/linux/seccomp.h                  |  36 +-
 kernel/ptrace.c                               |   4 +
 kernel/seccomp.c                              | 538 +++++++++++++++-
 samples/seccomp/.gitignore                    |   1 +
 samples/seccomp/Makefile                      |   9 +-
 samples/seccomp/user-trap.c                   | 312 ++++++++++
 tools/testing/selftests/seccomp/seccomp_bpf.c | 587 +++++++++++++++++-
 12 files changed, 1584 insertions(+), 13 deletions(-)
 create mode 100644 samples/seccomp/user-trap.c

-- 
2.17.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v5 1/5] seccomp: add a return code to trap to userspace
  2018-08-28 14:35 [PATCH v5 0/5] seccomp trap to userspace Tycho Andersen
@ 2018-08-28 14:35 ` Tycho Andersen
  2018-08-29 18:59   ` Christian Brauner
  2018-08-28 14:36 ` [PATCH v5 2/5] seccomp: make get_nth_filter available outside of CHECKPOINT_RESTORE Tycho Andersen
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 9+ messages in thread
From: Tycho Andersen @ 2018-08-28 14:35 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel, containers, linux-api, Andy Lutomirski,
	Oleg Nesterov, Eric W . Biederman, Serge E . Hallyn,
	Christian Brauner, Tyler Hicks, Akihiro Suda, Tycho Andersen

This patch introduces a means for syscalls matched in seccomp to notify
some other task that a particular filter has been triggered.

The motivation for this is primarily for use with containers. For example,
if a container does an init_module(), we obviously don't want to load this
untrusted code, which may be compiled for the wrong version of the kernel
anyway. Instead, we could parse the module image, figure out which module
the container is trying to load and load it on the host.

As another example, containers cannot mknod(), since this checks
capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
/dev/zero should be ok for containers to mknod, but we'd like to avoid hard
coding some whitelist in the kernel. Another example is mount(), which has
many security restrictions for good reason, but configuration or runtime
knowledge could potentially be used to relax these restrictions.

This patch adds functionality that is already possible via at least two
other means that I know about, both of which involve ptrace(): first, one
could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL.
Unfortunately this is slow, so a faster version would be to install a
filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP.
Since ptrace allows only one tracer, if the container runtime is that
tracer, users inside the container (or outside) trying to debug it will not
be able to use ptrace, which is annoying. It also means that older
distributions based on Upstart cannot boot inside containers using ptrace,
since upstart itself uses ptrace to start services.

The actual implementation of this is fairly small, although getting the
synchronization right was/is slightly complex.

Finally, it's worth noting that the classic seccomp TOCTOU of reading
memory data from the task still applies here, but can be avoided with
careful design of the userspace handler: if the userspace handler reads all
of the task memory that is necessary before applying its security policy,
the tracee's subsequent memory edits will not be read by the tracer.

v2: * make id a u64; the idea here being that it will never overflow,
      because 64 is huge (one syscall every nanosecond => wrap every 584
      years) (Andy)
    * prevent nesting of user notifications: if someone is already attached
      the tree in one place, nobody else can attach to the tree (Andy)
    * notify the listener of signals the tracee receives as well (Andy)
    * implement poll
v3: * lockdep fix (Oleg)
    * drop unnecessary WARN()s (Christian)
    * rearrange error returns to be more rpetty (Christian)
    * fix build in !CONFIG_SECCOMP_USER_NOTIFICATION case
v4: * fix implementation of poll to use poll_wait() (Jann)
    * change listener's fd flags to be 0 (Jann)
    * hoist filter initialization out of ifdefs to its own function
      init_user_notification()
    * add some more testing around poll() and closing the listener while a
      syscall is in action
    * s/GET_LISTENER/NEW_LISTENER, since you can't _get_ a listener, but it
      creates a new one (Matthew)
    * correctly handle pid namespaces, add some testcases (Matthew)
    * use EINPROGRESS instead of EINVAL when a notification response is
      written twice (Matthew)
    * fix comment typo from older version (SEND vs READ) (Matthew)
    * whitespace and logic simplification (Tobin)
    * add some Documentation/ bits on userspace trapping
v5: * fix documentation typos (Jann)
    * add signalled field to struct seccomp_notif (Jann)
    * switch to using ioctls instead of read()/write() for struct passing
      (Jann)
    * add an ioctl to ensure an id is still valid

Signed-off-by: Tycho Andersen <tycho@tycho.ws>
CC: Kees Cook <keescook@chromium.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Oleg Nesterov <oleg@redhat.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
CC: "Serge E. Hallyn" <serge@hallyn.com>
CC: Christian Brauner <christian.brauner@ubuntu.com>
CC: Tyler Hicks <tyhicks@canonical.com>
CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
---
 Documentation/ioctl/ioctl-number.txt          |   1 +
 .../userspace-api/seccomp_filter.rst          |  69 +++
 arch/Kconfig                                  |   9 +
 include/linux/seccomp.h                       |   7 +-
 include/uapi/linux/seccomp.h                  |  33 +-
 kernel/seccomp.c                              | 453 +++++++++++++++++-
 tools/testing/selftests/seccomp/seccomp_bpf.c | 403 +++++++++++++++-
 7 files changed, 965 insertions(+), 10 deletions(-)

diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt
index 480c8609dc58..21fb661d3e0d 100644
--- a/Documentation/ioctl/ioctl-number.txt
+++ b/Documentation/ioctl/ioctl-number.txt
@@ -342,4 +342,5 @@ Code  Seq#(hex)	Include File		Comments
 					<mailto:raph@8d.com>
 0xF6	all	LTTng			Linux Trace Toolkit Next Generation
 					<mailto:mathieu.desnoyers@efficios.com>
+0xF7    00-1F   uapi/linux/seccomp.h
 0xFD	all	linux/dm-ioctl.h
diff --git a/Documentation/userspace-api/seccomp_filter.rst b/Documentation/userspace-api/seccomp_filter.rst
index 82a468bc7560..312472d8e9c5 100644
--- a/Documentation/userspace-api/seccomp_filter.rst
+++ b/Documentation/userspace-api/seccomp_filter.rst
@@ -122,6 +122,11 @@ In precedence order, they are:
 	Results in the lower 16-bits of the return value being passed
 	to userland as the errno without executing the system call.
 
+``SECCOMP_RET_USER_NOTIF``:
+    Results in a ``struct seccomp_notif`` message sent on the userspace
+    notification fd, if it is attached, or ``-ENOSYS`` if it is not. See below
+    on discussion of how to handle user notifications.
+
 ``SECCOMP_RET_TRACE``:
 	When returned, this value will cause the kernel to attempt to
 	notify a ``ptrace()``-based tracer prior to executing the system
@@ -183,6 +188,70 @@ The ``samples/seccomp/`` directory contains both an x86-specific example
 and a more generic example of a higher level macro interface for BPF
 program generation.
 
+Userspace Notification
+======================
+
+The ``SECCOMP_RET_USER_NOTIF`` return code lets seccomp filters pass a
+particular syscall to userspace to be handled. This may be useful for
+applications like container managers, which wish to intercept particular
+syscalls (``mount()``, ``finit_module()``, etc.) and change their behavior.
+
+There are currently two APIs to acquire a userspace notification fd for a
+particular filter. The first is when the filter is installed, the task
+installing the filter can ask the ``seccomp()`` syscall:
+
+.. code-block::
+
+    fd = seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
+
+which (on success) will return a listener fd for the filter, which can them be
+passed around via ``SCM_RIGHTS`` or similar. Alternatively, a filter fd can be
+acquired via:
+
+.. code-block::
+
+    fd = ptrace(PTRACE_SECCOMP_NEW_LISTENER, pid, 0);
+
+which grabs the 0th filter for some task which the tracer has privilege over.
+Note that filter fds correspond to a particular filter, and not a particular
+task. So if this task then forks, notifications from both tasks will appear on
+the same filter fd. Reads and writes to/from a filter fd are also synchronized,
+so a filter fd can safely have many readers.
+
+The interface for a seccomp notification fd consists of two structures:
+
+.. code-block::
+
+    struct seccomp_notif {
+        __u64 id;
+        pid_t pid;
+        __u8 signalled;
+        struct seccomp_data data;
+    };
+
+    struct seccomp_notif_resp {
+        __u64 id;
+        __s32 error;
+        __s64 val;
+    };
+
+Users can ``read()`` or ``poll()`` on a seccomp notification fd to receive a
+``struct seccomp_notif``, which contains three members: a globally unique
+``id``, the ``pid`` of the task which triggered this request (which may be 0 if
+the task is in a pid ns not visible from the listener's pid namespace), and the
+``data`` passed to seccomp. Userspace can then make a decision based on this
+information about what to do, and ``write()`` a response, indicating what
+should be returned to userspace. The ``id`` member of ``struct
+seccomp_notif_resp`` should be the same ``id`` as in ``struct seccomp_notif``.
+
+It is worth noting that ``struct seccomp_data`` contains the values of register
+arguments to the syscall, but does not contain pointers to memory. The task's
+memory is accessible to suitably privileged traces via via ``ptrace()`` or
+``/proc/pid/map_files/``. However, care should be taken to avoid the TOCTOU
+mentioned above in this document: all arguments being read from the tracee's
+memory should be read into the tracer's memory before any policy decisions are
+made. This allows for an atomic decision on syscall arguments.
+
 Sysctls
 =======
 
diff --git a/arch/Kconfig b/arch/Kconfig
index 1aa59063f1fd..6d9d4b7f7a40 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -405,6 +405,15 @@ config SECCOMP_FILTER
 
 	  See Documentation/userspace-api/seccomp_filter.rst for details.
 
+config SECCOMP_USER_NOTIFICATION
+	bool "Enable the SECCOMP_RET_USER_NOTIF seccomp action"
+	depends on SECCOMP_FILTER
+	help
+	  Enable SECCOMP_RET_USER_NOTIF, a return code which can be used by seccomp
+	  programs to notify a userspace listener that a particular event happened.
+
+	  See Documentation/userspace-api/seccomp_filter.rst for details.
+
 preferred-plugin-hostcc := $(if-success,[ $(gcc-version) -ge 40800 ],$(HOSTCXX),$(HOSTCC))
 
 config PLUGIN_HOSTCC
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index e5320f6c8654..017444b5efed 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -4,9 +4,10 @@
 
 #include <uapi/linux/seccomp.h>
 
-#define SECCOMP_FILTER_FLAG_MASK	(SECCOMP_FILTER_FLAG_TSYNC	| \
-					 SECCOMP_FILTER_FLAG_LOG	| \
-					 SECCOMP_FILTER_FLAG_SPEC_ALLOW)
+#define SECCOMP_FILTER_FLAG_MASK	(SECCOMP_FILTER_FLAG_TSYNC | \
+					 SECCOMP_FILTER_FLAG_LOG | \
+					 SECCOMP_FILTER_FLAG_SPEC_ALLOW | \
+					 SECCOMP_FILTER_FLAG_NEW_LISTENER)
 
 #ifdef CONFIG_SECCOMP
 
diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index 9efc0e73d50b..aa5878972128 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -17,9 +17,10 @@
 #define SECCOMP_GET_ACTION_AVAIL	2
 
 /* Valid flags for SECCOMP_SET_MODE_FILTER */
-#define SECCOMP_FILTER_FLAG_TSYNC	(1UL << 0)
-#define SECCOMP_FILTER_FLAG_LOG		(1UL << 1)
-#define SECCOMP_FILTER_FLAG_SPEC_ALLOW	(1UL << 2)
+#define SECCOMP_FILTER_FLAG_TSYNC		(1UL << 0)
+#define SECCOMP_FILTER_FLAG_LOG			(1UL << 1)
+#define SECCOMP_FILTER_FLAG_SPEC_ALLOW		(1UL << 2)
+#define SECCOMP_FILTER_FLAG_NEW_LISTENER	(1UL << 3)
 
 /*
  * All BPF programs must return a 32-bit value.
@@ -35,6 +36,7 @@
 #define SECCOMP_RET_KILL	 SECCOMP_RET_KILL_THREAD
 #define SECCOMP_RET_TRAP	 0x00030000U /* disallow and force a SIGSYS */
 #define SECCOMP_RET_ERRNO	 0x00050000U /* returns an errno */
+#define SECCOMP_RET_USER_NOTIF   0x7fc00000U /* notifies userspace */
 #define SECCOMP_RET_TRACE	 0x7ff00000U /* pass to a tracer or disallow */
 #define SECCOMP_RET_LOG		 0x7ffc0000U /* allow after logging */
 #define SECCOMP_RET_ALLOW	 0x7fff0000U /* allow */
@@ -60,4 +62,29 @@ struct seccomp_data {
 	__u64 args[6];
 };
 
+struct seccomp_notif {
+	__u16 len;
+	__u64 id;
+	__u32 pid;
+	__u8 signalled;
+	struct seccomp_data data;
+};
+
+struct seccomp_notif_resp {
+	__u16 len;
+	__u64 id;
+	__s32 error;
+	__s64 val;
+};
+
+#define SECCOMP_IOC_MAGIC		0xF7
+
+/* Flags for seccomp notification fd ioctl. */
+#define SECCOMP_NOTIF_RECV		_IOWR(SECCOMP_IOC_MAGIC, 0,	\
+						struct seccomp_notif)
+#define SECCOMP_NOTIF_SEND		_IOWR(SECCOMP_IOC_MAGIC, 1,	\
+						struct seccomp_notif_resp)
+#define SECCOMP_NOTIF_IS_ID_VALID	_IOR(SECCOMP_IOC_MAGIC, 2,	\
+						__u64)
+
 #endif /* _UAPI_LINUX_SECCOMP_H */
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index fd023ac24e10..a09eb5c05f68 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -33,6 +33,7 @@
 #endif
 
 #ifdef CONFIG_SECCOMP_FILTER
+#include <linux/file.h>
 #include <linux/filter.h>
 #include <linux/pid.h>
 #include <linux/ptrace.h>
@@ -40,6 +41,53 @@
 #include <linux/tracehook.h>
 #include <linux/uaccess.h>
 
+#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
+#include <linux/anon_inodes.h>
+
+enum notify_state {
+	SECCOMP_NOTIFY_INIT,
+	SECCOMP_NOTIFY_SENT,
+	SECCOMP_NOTIFY_REPLIED,
+};
+
+struct seccomp_knotif {
+	/* The struct pid of the task whose filter triggered the notification */
+	struct pid *pid;
+
+	/* The "cookie" for this request; this is unique for this filter. */
+	u32 id;
+
+	/* Whether or not this task has been given an interruptible signal. */
+	bool signalled;
+
+	/*
+	 * The seccomp data. This pointer is valid the entire time this
+	 * notification is active, since it comes from __seccomp_filter which
+	 * eclipses the entire lifecycle here.
+	 */
+	const struct seccomp_data *data;
+
+	/*
+	 * Notification states. When SECCOMP_RET_USER_NOTIF is returned, a
+	 * struct seccomp_knotif is created and starts out in INIT. Once the
+	 * handler reads the notification off of an FD, it transitions to SENT.
+	 * If a signal is received the state transitions back to INIT and
+	 * another message is sent. When the userspace handler replies, state
+	 * transitions to REPLIED.
+	 */
+	enum notify_state state;
+
+	/* The return values, only valid when in SECCOMP_NOTIFY_REPLIED */
+	int error;
+	long val;
+
+	/* Signals when this has entered SECCOMP_NOTIFY_REPLIED */
+	struct completion ready;
+
+	struct list_head list;
+};
+#endif
+
 /**
  * struct seccomp_filter - container for seccomp BPF programs
  *
@@ -66,6 +114,30 @@ struct seccomp_filter {
 	bool log;
 	struct seccomp_filter *prev;
 	struct bpf_prog *prog;
+
+#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
+	/*
+	 * A semaphore that users of this notification can wait on for
+	 * changes. Actual reads and writes are still controlled with
+	 * filter->notify_lock.
+	 */
+	struct semaphore request;
+
+	/* A lock for all notification-related accesses. */
+	struct mutex notify_lock;
+
+	/* Is there currently an attached listener? */
+	bool has_listener;
+
+	/* The id of the next request. */
+	u64 next_id;
+
+	/* A list of struct seccomp_knotif elements. */
+	struct list_head notifications;
+
+	/* A wait queue for poll. */
+	wait_queue_head_t wqh;
+#endif
 };
 
 /* Limit any path through the tree to 256KB worth of instructions. */
@@ -359,6 +431,19 @@ static inline void seccomp_sync_threads(unsigned long flags)
 	}
 }
 
+#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
+static void init_user_notification(struct seccomp_filter *sfilter)
+{
+	mutex_init(&sfilter->notify_lock);
+	sema_init(&sfilter->request, 0);
+	INIT_LIST_HEAD(&sfilter->notifications);
+	sfilter->next_id = get_random_u64();
+	init_waitqueue_head(&sfilter->wqh);
+}
+#else
+static inline void init_user_notification(struct seccomp_filter *sfilter) { }
+#endif
+
 /**
  * seccomp_prepare_filter: Prepares a seccomp filter for use.
  * @fprog: BPF program to install
@@ -392,6 +477,8 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
 	if (!sfilter)
 		return ERR_PTR(-ENOMEM);
 
+	init_user_notification(sfilter);
+
 	ret = bpf_prog_create_from_user(&sfilter->prog, fprog,
 					seccomp_check_filter, save_orig);
 	if (ret < 0) {
@@ -556,13 +643,15 @@ static void seccomp_send_sigsys(int syscall, int reason)
 #define SECCOMP_LOG_TRACE		(1 << 4)
 #define SECCOMP_LOG_LOG			(1 << 5)
 #define SECCOMP_LOG_ALLOW		(1 << 6)
+#define SECCOMP_LOG_USER_NOTIF		(1 << 7)
 
 static u32 seccomp_actions_logged = SECCOMP_LOG_KILL_PROCESS |
 				    SECCOMP_LOG_KILL_THREAD  |
 				    SECCOMP_LOG_TRAP  |
 				    SECCOMP_LOG_ERRNO |
 				    SECCOMP_LOG_TRACE |
-				    SECCOMP_LOG_LOG;
+				    SECCOMP_LOG_LOG |
+				    SECCOMP_LOG_USER_NOTIF;
 
 static inline void seccomp_log(unsigned long syscall, long signr, u32 action,
 			       bool requested)
@@ -581,6 +670,9 @@ static inline void seccomp_log(unsigned long syscall, long signr, u32 action,
 	case SECCOMP_RET_TRACE:
 		log = requested && seccomp_actions_logged & SECCOMP_LOG_TRACE;
 		break;
+	case SECCOMP_RET_USER_NOTIF:
+		log = requested && seccomp_actions_logged & SECCOMP_LOG_USER_NOTIF;
+		break;
 	case SECCOMP_RET_LOG:
 		log = seccomp_actions_logged & SECCOMP_LOG_LOG;
 		break;
@@ -651,6 +743,83 @@ void secure_computing_strict(int this_syscall)
 }
 #else
 
+#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
+static u64 seccomp_next_notify_id(struct seccomp_filter *filter)
+{
+	/* Note: overflow is ok here, the id just needs to be unique */
+	return filter->next_id++;
+}
+
+static void seccomp_do_user_notification(int this_syscall,
+					 struct seccomp_filter *match,
+					 const struct seccomp_data *sd)
+{
+	int err;
+	long ret = 0;
+	struct seccomp_knotif n = {};
+
+	mutex_lock(&match->notify_lock);
+	err = -ENOSYS;
+	if (!match->has_listener)
+		goto out;
+
+	n.pid = task_pid(current);
+	n.state = SECCOMP_NOTIFY_INIT;
+	n.data = sd;
+	n.id = seccomp_next_notify_id(match);
+	init_completion(&n.ready);
+
+	list_add(&n.list, &match->notifications);
+	wake_up_poll(&match->wqh, EPOLLIN | EPOLLRDNORM);
+
+	mutex_unlock(&match->notify_lock);
+	up(&match->request);
+
+	err = wait_for_completion_interruptible(&n.ready);
+	mutex_lock(&match->notify_lock);
+
+	/*
+	 * Here it's possible we got a signal and then had to wait on the mutex
+	 * while the reply was sent, so let's be sure there wasn't a response
+	 * in the meantime.
+	 */
+	if (err < 0 && n.state != SECCOMP_NOTIFY_REPLIED) {
+		/*
+		 * We got a signal. Let's tell userspace about it (potentially
+		 * again, if we had already notified them about the first one).
+		 */
+		n.signalled = true;
+		if (n.state == SECCOMP_NOTIFY_SENT) {
+			n.state = SECCOMP_NOTIFY_INIT;
+			up(&match->request);
+		}
+		mutex_unlock(&match->notify_lock);
+		err = wait_for_completion_killable(&n.ready);
+		mutex_lock(&match->notify_lock);
+		if (err < 0)
+			goto remove_list;
+	}
+
+	ret = n.val;
+	err = n.error;
+
+remove_list:
+	list_del(&n.list);
+out:
+	mutex_unlock(&match->notify_lock);
+	syscall_set_return_value(current, task_pt_regs(current),
+				 err, ret);
+}
+#else
+static void seccomp_do_user_notification(int this_syscall,
+					 struct seccomp_filter *match,
+					 const struct seccomp_data *sd)
+{
+	seccomp_log(this_syscall, SIGSYS, SECCOMP_RET_USER_NOTIF, true);
+	do_exit(SIGSYS);
+}
+#endif
+
 #ifdef CONFIG_SECCOMP_FILTER
 static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
 			    const bool recheck_after_trace)
@@ -728,6 +897,9 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
 
 		return 0;
 
+	case SECCOMP_RET_USER_NOTIF:
+		seccomp_do_user_notification(this_syscall, match, sd);
+		goto skip;
 	case SECCOMP_RET_LOG:
 		seccomp_log(this_syscall, 0, action, true);
 		return 0;
@@ -834,6 +1006,9 @@ static long seccomp_set_mode_strict(void)
 }
 
 #ifdef CONFIG_SECCOMP_FILTER
+static struct file *init_listener(struct task_struct *,
+				  struct seccomp_filter *);
+
 /**
  * seccomp_set_mode_filter: internal function for setting seccomp filter
  * @flags:  flags to change filter behavior
@@ -853,6 +1028,8 @@ static long seccomp_set_mode_filter(unsigned int flags,
 	const unsigned long seccomp_mode = SECCOMP_MODE_FILTER;
 	struct seccomp_filter *prepared = NULL;
 	long ret = -EINVAL;
+	int listener = 0;
+	struct file *listener_f = NULL;
 
 	/* Validate flags. */
 	if (flags & ~SECCOMP_FILTER_FLAG_MASK)
@@ -863,13 +1040,28 @@ static long seccomp_set_mode_filter(unsigned int flags,
 	if (IS_ERR(prepared))
 		return PTR_ERR(prepared);
 
+	if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) {
+		listener = get_unused_fd_flags(0);
+		if (listener < 0) {
+			ret = listener;
+			goto out_free;
+		}
+
+		listener_f = init_listener(current, prepared);
+		if (IS_ERR(listener_f)) {
+			put_unused_fd(listener);
+			ret = PTR_ERR(listener_f);
+			goto out_free;
+		}
+	}
+
 	/*
 	 * Make sure we cannot change seccomp or nnp state via TSYNC
 	 * while another thread is in the middle of calling exec.
 	 */
 	if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
 	    mutex_lock_killable(&current->signal->cred_guard_mutex))
-		goto out_free;
+		goto out_put_fd;
 
 	spin_lock_irq(&current->sighand->siglock);
 
@@ -887,6 +1079,16 @@ static long seccomp_set_mode_filter(unsigned int flags,
 	spin_unlock_irq(&current->sighand->siglock);
 	if (flags & SECCOMP_FILTER_FLAG_TSYNC)
 		mutex_unlock(&current->signal->cred_guard_mutex);
+out_put_fd:
+	if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) {
+		if (ret < 0) {
+			fput(listener_f);
+			put_unused_fd(listener);
+		} else {
+			fd_install(listener, listener_f);
+			ret = listener;
+		}
+	}
 out_free:
 	seccomp_filter_free(prepared);
 	return ret;
@@ -915,6 +1117,9 @@ static long seccomp_get_action_avail(const char __user *uaction)
 	case SECCOMP_RET_LOG:
 	case SECCOMP_RET_ALLOW:
 		break;
+	case SECCOMP_RET_USER_NOTIF:
+		if (IS_ENABLED(CONFIG_SECCOMP_USER_NOTIFICATION))
+			break;
 	default:
 		return -EOPNOTSUPP;
 	}
@@ -1111,6 +1316,7 @@ long seccomp_get_metadata(struct task_struct *task,
 #define SECCOMP_RET_KILL_THREAD_NAME	"kill_thread"
 #define SECCOMP_RET_TRAP_NAME		"trap"
 #define SECCOMP_RET_ERRNO_NAME		"errno"
+#define SECCOMP_RET_USER_NOTIF_NAME	"user_notif"
 #define SECCOMP_RET_TRACE_NAME		"trace"
 #define SECCOMP_RET_LOG_NAME		"log"
 #define SECCOMP_RET_ALLOW_NAME		"allow"
@@ -1120,6 +1326,7 @@ static const char seccomp_actions_avail[] =
 				SECCOMP_RET_KILL_THREAD_NAME	" "
 				SECCOMP_RET_TRAP_NAME		" "
 				SECCOMP_RET_ERRNO_NAME		" "
+				SECCOMP_RET_USER_NOTIF_NAME     " "
 				SECCOMP_RET_TRACE_NAME		" "
 				SECCOMP_RET_LOG_NAME		" "
 				SECCOMP_RET_ALLOW_NAME;
@@ -1137,6 +1344,7 @@ static const struct seccomp_log_name seccomp_log_names[] = {
 	{ SECCOMP_LOG_TRACE, SECCOMP_RET_TRACE_NAME },
 	{ SECCOMP_LOG_LOG, SECCOMP_RET_LOG_NAME },
 	{ SECCOMP_LOG_ALLOW, SECCOMP_RET_ALLOW_NAME },
+	{ SECCOMP_LOG_USER_NOTIF, SECCOMP_RET_USER_NOTIF_NAME },
 	{ }
 };
 
@@ -1342,3 +1550,244 @@ static int __init seccomp_sysctl_init(void)
 device_initcall(seccomp_sysctl_init)
 
 #endif /* CONFIG_SYSCTL */
+
+#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
+static int seccomp_notify_release(struct inode *inode, struct file *file)
+{
+	struct seccomp_filter *filter = file->private_data;
+	struct seccomp_knotif *knotif;
+
+	mutex_lock(&filter->notify_lock);
+
+	/*
+	 * If this file is being closed because e.g. the task who owned it
+	 * died, let's wake everyone up who was waiting on us.
+	 */
+	list_for_each_entry(knotif, &filter->notifications, list) {
+		if (knotif->state == SECCOMP_NOTIFY_REPLIED)
+			continue;
+
+		knotif->state = SECCOMP_NOTIFY_REPLIED;
+		knotif->error = -ENOSYS;
+		knotif->val = 0;
+
+		complete(&knotif->ready);
+	}
+
+	wake_up_all(&filter->wqh);
+	filter->has_listener = false;
+	mutex_unlock(&filter->notify_lock);
+	__put_seccomp_filter(filter);
+	return 0;
+}
+
+static long seccomp_notify_recv(struct seccomp_filter *filter,
+				unsigned long arg)
+{
+	struct seccomp_knotif *knotif = NULL, *cur;
+	struct seccomp_notif unotif = {};
+	ssize_t ret;
+	u16 size;
+	void __user *buf = (void __user *)arg;
+
+	if (copy_from_user(&size, buf, sizeof(size)))
+		return -EFAULT;
+
+	ret = down_interruptible(&filter->request);
+	if (ret < 0)
+		return ret;
+
+	mutex_lock(&filter->notify_lock);
+	list_for_each_entry(cur, &filter->notifications, list) {
+		if (cur->state == SECCOMP_NOTIFY_INIT) {
+			knotif = cur;
+			break;
+		}
+	}
+
+	/*
+	 * If we didn't find a notification, it could be that the task was
+	 * interrupted between the time we were woken and when we were able to
+	 * acquire the rw lock.
+	 */
+	if (!knotif) {
+		ret = -ENOENT;
+		goto out;
+	}
+
+	size = min_t(size_t, size, sizeof(unotif));
+
+	unotif.len = size;
+	unotif.id = knotif->id;
+	unotif.pid = pid_vnr(knotif->pid);
+	unotif.signalled = knotif->signalled;
+	unotif.data = *(knotif->data);
+
+	if (copy_to_user(buf, &unotif, size)) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	ret = sizeof(unotif);
+	knotif->state = SECCOMP_NOTIFY_SENT;
+	wake_up_poll(&filter->wqh, EPOLLOUT | EPOLLWRNORM);
+
+out:
+	mutex_unlock(&filter->notify_lock);
+	return ret;
+}
+
+static long seccomp_notify_send(struct seccomp_filter *filter,
+				unsigned long arg)
+{
+	struct seccomp_notif_resp resp = {};
+	struct seccomp_knotif *knotif = NULL;
+	long ret;
+	u16 size;
+	void __user *buf = (void __user *)arg;
+
+	if (copy_from_user(&size, buf, sizeof(size)))
+		return -EFAULT;
+	size = min_t(size_t, size, sizeof(resp));
+	if (copy_from_user(&resp, buf, size))
+		return -EFAULT;
+
+	ret = mutex_lock_interruptible(&filter->notify_lock);
+	if (ret < 0)
+		return ret;
+
+	list_for_each_entry(knotif, &filter->notifications, list) {
+		if (knotif->id == resp.id)
+			break;
+	}
+
+	if (!knotif || knotif->id != resp.id) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/* Allow exactly one reply. */
+	if (knotif->state != SECCOMP_NOTIFY_SENT) {
+		ret = -EINPROGRESS;
+		goto out;
+	}
+
+	ret = size;
+	knotif->state = SECCOMP_NOTIFY_REPLIED;
+	knotif->error = resp.error;
+	knotif->val = resp.val;
+	complete(&knotif->ready);
+out:
+	mutex_unlock(&filter->notify_lock);
+	return ret;
+}
+
+static long seccomp_notify_is_id_valid(struct seccomp_filter *filter,
+				       unsigned long arg)
+{
+	struct seccomp_knotif *knotif = NULL;
+	void __user *buf = (void __user *)arg;
+	u64 id;
+
+	if (copy_from_user(&id, buf, sizeof(id)))
+		return -EFAULT;
+
+	list_for_each_entry(knotif, &filter->notifications, list) {
+		if (knotif->id == id)
+			return 1;
+	}
+
+	return 0;
+}
+
+static long seccomp_notify_ioctl(struct file *file, unsigned int cmd,
+				 unsigned long arg)
+{
+	struct seccomp_filter *filter = file->private_data;
+
+	switch (cmd) {
+	case SECCOMP_NOTIF_RECV:
+		return seccomp_notify_recv(filter, arg);
+	case SECCOMP_NOTIF_SEND:
+		return seccomp_notify_send(filter, arg);
+	case SECCOMP_NOTIF_IS_ID_VALID:
+		return seccomp_notify_is_id_valid(filter, arg);
+	default:
+		return -EINVAL;
+	}
+}
+
+static __poll_t seccomp_notify_poll(struct file *file,
+				    struct poll_table_struct *poll_tab)
+{
+	struct seccomp_filter *filter = file->private_data;
+	__poll_t ret = 0;
+	struct seccomp_knotif *cur;
+
+	poll_wait(file, &filter->wqh, poll_tab);
+
+	ret = mutex_lock_interruptible(&filter->notify_lock);
+	if (ret < 0)
+		return ret;
+
+	list_for_each_entry(cur, &filter->notifications, list) {
+		if (cur->state == SECCOMP_NOTIFY_INIT)
+			ret |= EPOLLIN | EPOLLRDNORM;
+		if (cur->state == SECCOMP_NOTIFY_SENT)
+			ret |= EPOLLOUT | EPOLLWRNORM;
+		if (ret & EPOLLIN && ret & EPOLLOUT)
+			break;
+	}
+
+	mutex_unlock(&filter->notify_lock);
+
+	return ret;
+}
+
+static const struct file_operations seccomp_notify_ops = {
+	.poll = seccomp_notify_poll,
+	.release = seccomp_notify_release,
+	.unlocked_ioctl = seccomp_notify_ioctl,
+};
+
+static struct file *init_listener(struct task_struct *task,
+				  struct seccomp_filter *filter)
+{
+	struct file *ret = ERR_PTR(-EBUSY);
+	struct seccomp_filter *cur, *last_locked = NULL;
+	int filter_nesting = 0;
+
+	for (cur = task->seccomp.filter; cur; cur = cur->prev) {
+		mutex_lock_nested(&cur->notify_lock, filter_nesting);
+		filter_nesting++;
+		last_locked = cur;
+		if (cur->has_listener)
+			goto out;
+	}
+
+	ret = anon_inode_getfile("seccomp notify", &seccomp_notify_ops,
+				 filter, O_RDWR);
+	if (IS_ERR(ret))
+		goto out;
+
+
+	/* The file has a reference to it now */
+	__get_seccomp_filter(filter);
+	filter->has_listener = true;
+
+out:
+	for (cur = task->seccomp.filter; cur; cur = cur->prev) {
+		mutex_unlock(&cur->notify_lock);
+		if (cur == last_locked)
+			break;
+	}
+
+	return ret;
+}
+#else
+static struct file *init_listener(struct task_struct *task,
+				  struct seccomp_filter *filter)
+{
+	return ERR_PTR(-EINVAL);
+}
+#endif
diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
index e1473234968d..89f2c788a06b 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -5,6 +5,7 @@
  * Test code for seccomp bpf.
  */
 
+#define _GNU_SOURCE
 #include <sys/types.h>
 
 /*
@@ -40,10 +41,12 @@
 #include <sys/fcntl.h>
 #include <sys/mman.h>
 #include <sys/times.h>
+#include <sys/socket.h>
+#include <sys/ioctl.h>
 
-#define _GNU_SOURCE
 #include <unistd.h>
 #include <sys/syscall.h>
+#include <poll.h>
 
 #include "../kselftest_harness.h"
 
@@ -154,6 +157,34 @@ struct seccomp_metadata {
 };
 #endif
 
+#ifndef SECCOMP_FILTER_FLAG_NEW_LISTENER
+#define SECCOMP_FILTER_FLAG_NEW_LISTENER (1UL << 3)
+
+#define SECCOMP_RET_USER_NOTIF 0x7fc00000U
+
+#define SECCOMP_IOC_MAGIC		0xF7
+#define SECCOMP_NOTIF_RECV		_IOWR(SECCOMP_IOC_MAGIC, 0,	\
+						struct seccomp_notif)
+#define SECCOMP_NOTIF_SEND		_IOWR(SECCOMP_IOC_MAGIC, 1,	\
+						struct seccomp_notif_resp)
+#define SECCOMP_NOTIF_IS_ID_VALID	_IOR(SECCOMP_IOC_MAGIC, 2,	\
+						__u64)
+struct seccomp_notif {
+	__u16 len;
+	__u64 id;
+	__u32 pid;
+	__u8 signalled;
+	struct seccomp_data data;
+};
+
+struct seccomp_notif_resp {
+	__u16 len;
+	__u64 id;
+	__s32 error;
+	__s64 val;
+};
+#endif
+
 #ifndef seccomp
 int seccomp(unsigned int op, unsigned int flags, void *args)
 {
@@ -2077,7 +2108,8 @@ TEST(detect_seccomp_filter_flags)
 {
 	unsigned int flags[] = { SECCOMP_FILTER_FLAG_TSYNC,
 				 SECCOMP_FILTER_FLAG_LOG,
-				 SECCOMP_FILTER_FLAG_SPEC_ALLOW };
+				 SECCOMP_FILTER_FLAG_SPEC_ALLOW,
+				 SECCOMP_FILTER_FLAG_NEW_LISTENER };
 	unsigned int flag, all_flags;
 	int i;
 	long ret;
@@ -2933,6 +2965,373 @@ TEST(get_metadata)
 	ASSERT_EQ(0, kill(pid, SIGKILL));
 }
 
+static int user_trap_syscall(int nr, unsigned int flags)
+{
+	struct sock_filter filter[] = {
+		BPF_STMT(BPF_LD+BPF_W+BPF_ABS,
+			offsetof(struct seccomp_data, nr)),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, nr, 0, 1),
+		BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_USER_NOTIF),
+		BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
+	};
+
+	struct sock_fprog prog = {
+		.len = (unsigned short)ARRAY_SIZE(filter),
+		.filter = filter,
+	};
+
+	return seccomp(SECCOMP_SET_MODE_FILTER, flags, &prog);
+}
+
+static int read_notif(int listener, struct seccomp_notif *req)
+{
+	int ret;
+
+	do {
+		errno = 0;
+		req->len = sizeof(*req);
+		ret = ioctl(listener, SECCOMP_NOTIF_RECV, req);
+	} while (ret == -1 && errno == ENOENT);
+	return ret;
+}
+
+static void signal_handler(int signal)
+{
+}
+
+#define USER_NOTIF_MAGIC 116983961184613L
+TEST(get_user_notification_syscall)
+{
+	pid_t pid;
+	long ret;
+	int status, listener;
+	struct seccomp_notif req = {};
+	struct seccomp_notif_resp resp = {};
+	struct pollfd pollfd;
+
+	struct sock_filter filter[] = {
+		BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
+	};
+	struct sock_fprog prog = {
+		.len = (unsigned short)ARRAY_SIZE(filter),
+		.filter = filter,
+	};
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	/* Check that we get -ENOSYS with no listener attached */
+	if (pid == 0) {
+		if (user_trap_syscall(__NR_getpid, 0) < 0)
+			exit(1);
+		ret = syscall(__NR_getpid);
+		exit(ret >= 0 || errno != ENOSYS);
+	}
+
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+
+	/* Add some no-op filters so that we (don't) trigger lockdep. */
+	EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
+	EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
+	EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
+	EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
+
+	/* Check that the basic notification machinery works */
+	listener = user_trap_syscall(__NR_getpid,
+				     SECCOMP_FILTER_FLAG_NEW_LISTENER);
+	EXPECT_GE(listener, 0);
+
+	/* Installing a second listener in the chain should EBUSY */
+	EXPECT_EQ(user_trap_syscall(__NR_getpid,
+				    SECCOMP_FILTER_FLAG_NEW_LISTENER),
+		  -1);
+	EXPECT_EQ(errno, EBUSY);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		ret = syscall(__NR_getpid);
+		exit(ret != USER_NOTIF_MAGIC);
+	}
+
+	pollfd.fd = listener;
+	pollfd.events = POLLIN | POLLOUT;
+
+	EXPECT_GT(poll(&pollfd, 1, -1), 0);
+	EXPECT_EQ(pollfd.revents, POLLIN);
+
+	req.len = sizeof(req);
+	EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_RECV, &req), sizeof(req));
+
+	pollfd.fd = listener;
+	pollfd.events = POLLIN | POLLOUT;
+
+	EXPECT_GT(poll(&pollfd, 1, -1), 0);
+	EXPECT_EQ(pollfd.revents, POLLOUT);
+
+	EXPECT_EQ(req.data.nr,  __NR_getpid);
+
+	resp.len = sizeof(resp);
+	resp.id = req.id;
+	resp.error = 0;
+	resp.val = USER_NOTIF_MAGIC;
+
+	EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_SEND, &resp), sizeof(resp));
+
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+
+	/*
+	 * Check that nothing bad happens when we kill the task in the middle
+	 * of a syscall.
+	 */
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		ret = syscall(__NR_getpid);
+		exit(ret != USER_NOTIF_MAGIC);
+	}
+
+	EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_RECV, &req), sizeof(req));
+	EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_IS_ID_VALID, &req.id), 1);
+
+	EXPECT_EQ(kill(pid, SIGKILL), 0);
+	EXPECT_EQ(waitpid(pid, NULL, 0), pid);
+
+	EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_IS_ID_VALID, &req.id), 0);
+
+	resp.id = req.id;
+	ret = ioctl(listener, SECCOMP_NOTIF_SEND, &resp);
+	EXPECT_EQ(ret, -1);
+	EXPECT_EQ(errno, EINVAL);
+
+	/*
+	 * Check that we get another notification about a signal in the middle
+	 * of a syscall.
+	 */
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		if (signal(SIGUSR1, signal_handler) == SIG_ERR) {
+			perror("signal");
+			exit(1);
+		}
+		ret = syscall(__NR_getpid);
+		exit(ret != USER_NOTIF_MAGIC);
+	}
+
+	ret = read_notif(listener, &req);
+	EXPECT_EQ(ret, sizeof(req));
+	EXPECT_EQ(errno, 0);
+
+	EXPECT_EQ(kill(pid, SIGUSR1), 0);
+
+	ret = read_notif(listener, &req);
+	EXPECT_EQ(req.signalled, 1);
+	EXPECT_EQ(ret, sizeof(req));
+	EXPECT_EQ(errno, 0);
+
+	resp.len = sizeof(resp);
+	resp.id = req.id;
+	ret = ioctl(listener, SECCOMP_NOTIF_SEND, &resp);
+	EXPECT_EQ(ret, sizeof(resp));
+	EXPECT_EQ(errno, 0);
+
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+
+	/*
+	 * Check that we get an ENOSYS when the listener is closed.
+	 */
+	pid = fork();
+	ASSERT_GE(pid, 0);
+	if (pid == 0) {
+		close(listener);
+		ret = syscall(__NR_getpid);
+		exit(ret != -1 && errno != ENOSYS);
+	}
+
+	close(listener);
+
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+}
+
+/*
+ * Check that a pid in a child namespace still shows up as valid in ours.
+ */
+TEST(user_notification_child_pid_ns)
+{
+	pid_t pid;
+	int status, listener;
+	int sk_pair[2];
+	char c;
+	struct seccomp_notif req = {};
+	struct seccomp_notif_resp resp = {};
+
+	ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0);
+	ASSERT_EQ(unshare(CLONE_NEWPID), 0);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		EXPECT_EQ(user_trap_syscall(__NR_getpid, 0), 0);
+
+		/* Signal we're ready and have installed the filter. */
+		EXPECT_EQ(write(sk_pair[1], "J", 1), 1);
+
+		EXPECT_EQ(read(sk_pair[1], &c, 1), 1);
+		EXPECT_EQ(c, 'H');
+
+		exit(syscall(__NR_getpid) != USER_NOTIF_MAGIC);
+	}
+
+	EXPECT_EQ(read(sk_pair[0], &c, 1), 1);
+	EXPECT_EQ(c, 'J');
+
+	EXPECT_EQ(ptrace(PTRACE_ATTACH, pid), 0);
+	EXPECT_EQ(waitpid(pid, NULL, 0), pid);
+	listener = ptrace(PTRACE_SECCOMP_NEW_LISTENER, pid, 0);
+	EXPECT_GE(listener, 0);
+	EXPECT_EQ(ptrace(PTRACE_DETACH, pid, NULL, 0), 0);
+
+	/* Now signal we are done and respond with magic */
+	EXPECT_EQ(write(sk_pair[0], "H", 1), 1);
+
+	req.len = sizeof(req);
+	EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_RECV, &req), sizeof(req));
+	EXPECT_EQ(req.pid, pid);
+
+	resp.len = sizeof(resp);
+	resp.id = req.id;
+	resp.error = 0;
+	resp.val = USER_NOTIF_MAGIC;
+
+	EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_SEND, &resp), sizeof(resp));
+
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+	close(listener);
+}
+
+/*
+ * Check that a pid in a sibling (i.e. unrelated) namespace shows up as 0, i.e.
+ * invalid.
+ */
+TEST(user_notification_sibling_pid_ns)
+{
+	pid_t pid, pid2;
+	int status, listener;
+	int sk_pair[2];
+	char c;
+	struct seccomp_notif req = {};
+	struct seccomp_notif_resp resp = {};
+
+	ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		int child_pair[2];
+
+		ASSERT_EQ(unshare(CLONE_NEWPID), 0);
+
+		ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, child_pair), 0);
+
+		pid2 = fork();
+		ASSERT_GE(pid2, 0);
+
+		if (pid2 == 0) {
+			close(child_pair[0]);
+			EXPECT_EQ(user_trap_syscall(__NR_getpid, 0), 0);
+
+			/* Signal we're ready and have installed the filter. */
+			EXPECT_EQ(write(child_pair[1], "J", 1), 1);
+
+			EXPECT_EQ(read(child_pair[1], &c, 1), 1);
+			EXPECT_EQ(c, 'H');
+
+			exit(syscall(__NR_getpid) != USER_NOTIF_MAGIC);
+		}
+
+		/* check that child has installed the filter */
+		EXPECT_EQ(read(child_pair[0], &c, 1), 1);
+		EXPECT_EQ(c, 'J');
+
+		/* tell parent who child is */
+		EXPECT_EQ(write(sk_pair[1], &pid2, sizeof(pid2)), sizeof(pid2));
+
+		/* parent has installed listener, tell child to call syscall */
+		EXPECT_EQ(read(sk_pair[1], &c, 1), 1);
+		EXPECT_EQ(c, 'H');
+		EXPECT_EQ(write(child_pair[0], "H", 1), 1);
+
+		EXPECT_EQ(waitpid(pid2, &status, 0), pid2);
+		EXPECT_EQ(true, WIFEXITED(status));
+		EXPECT_EQ(0, WEXITSTATUS(status));
+		exit(WEXITSTATUS(status));
+	}
+
+	EXPECT_EQ(read(sk_pair[0], &pid2, sizeof(pid2)), sizeof(pid2));
+
+	EXPECT_EQ(ptrace(PTRACE_ATTACH, pid2), 0);
+	EXPECT_EQ(waitpid(pid2, NULL, 0), pid2);
+	listener = ptrace(PTRACE_SECCOMP_NEW_LISTENER, pid2, 0);
+	EXPECT_GE(listener, 0);
+	EXPECT_EQ(errno, 0);
+	EXPECT_EQ(ptrace(PTRACE_DETACH, pid2, NULL, 0), 0);
+
+	/* Create the sibling ns, and sibling in it. */
+	EXPECT_EQ(unshare(CLONE_NEWPID), 0);
+	EXPECT_EQ(errno, 0);
+
+	pid2 = fork();
+	EXPECT_GE(pid2, 0);
+
+	if (pid2 == 0) {
+		req.len = sizeof(req);
+		ASSERT_EQ(ioctl(listener, SECCOMP_NOTIF_RECV, &req), sizeof(req));
+		/*
+		 * The pid should be 0, i.e. the task is in some namespace that
+		 * we can't "see".
+		 */
+		ASSERT_EQ(req.pid, 0);
+
+		resp.len = sizeof(resp);
+		resp.id = req.id;
+		resp.error = 0;
+		resp.val = USER_NOTIF_MAGIC;
+
+		ASSERT_EQ(ioctl(listener, SECCOMP_NOTIF_SEND, &resp), sizeof(resp));
+		exit(0);
+	}
+
+	close(listener);
+
+	/* Now signal we are done setting up sibling listener. */
+	EXPECT_EQ(write(sk_pair[0], "H", 1), 1);
+
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+
+	EXPECT_EQ(waitpid(pid2, &status, 0), pid2);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+}
+
+
 /*
  * TODO:
  * - add microbenchmarks
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v5 2/5] seccomp: make get_nth_filter available outside of CHECKPOINT_RESTORE
  2018-08-28 14:35 [PATCH v5 0/5] seccomp trap to userspace Tycho Andersen
  2018-08-28 14:35 ` [PATCH v5 1/5] seccomp: add a return code to " Tycho Andersen
@ 2018-08-28 14:36 ` Tycho Andersen
  2018-08-29 19:07   ` Christian Brauner
  2018-08-28 14:36 ` [PATCH v5 3/5] seccomp: add a way to get a listener fd from ptrace Tycho Andersen
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 9+ messages in thread
From: Tycho Andersen @ 2018-08-28 14:36 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel, containers, linux-api, Andy Lutomirski,
	Oleg Nesterov, Eric W . Biederman, Serge E . Hallyn,
	Christian Brauner, Tyler Hicks, Akihiro Suda, Tycho Andersen

In the next commit we'll use this same mnemonic to get a listener for the
nth filter, so we need it available outside of CHECKPOINT_RESTORE in the
USER_NOTIFICATION case as well.

v2: new in v2
v3: no changes
v4: no changes
v5: switch to CHECKPOINT_RESTORE || USER_NOTIFICATION to avoid warning when
    only CONFIG_SECCOMP_FILTER is enabled.

Signed-off-by: Tycho Andersen <tycho@tycho.ws>
CC: Kees Cook <keescook@chromium.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Oleg Nesterov <oleg@redhat.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
CC: "Serge E. Hallyn" <serge@hallyn.com>
CC: Christian Brauner <christian.brauner@ubuntu.com>
CC: Tyler Hicks <tyhicks@canonical.com>
CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
---
 kernel/seccomp.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index a09eb5c05f68..ed786655186d 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -1188,7 +1188,8 @@ long prctl_set_seccomp(unsigned long seccomp_mode, char __user *filter)
 	return do_seccomp(op, 0, uargs);
 }
 
-#if defined(CONFIG_SECCOMP_FILTER) && defined(CONFIG_CHECKPOINT_RESTORE)
+#if defined(CONFIG_CHECKPOINT_RESTORE) || \
+	defined(CONFIG_SECCOMP_USER_NOTIFICATION)
 static struct seccomp_filter *get_nth_filter(struct task_struct *task,
 					     unsigned long filter_off)
 {
@@ -1235,6 +1236,7 @@ static struct seccomp_filter *get_nth_filter(struct task_struct *task,
 	return filter;
 }
 
+#if defined(CONFIG_CHECKPOINT_RESTORE)
 long seccomp_get_filter(struct task_struct *task, unsigned long filter_off,
 			void __user *data)
 {
@@ -1307,7 +1309,8 @@ long seccomp_get_metadata(struct task_struct *task,
 	__put_seccomp_filter(filter);
 	return ret;
 }
-#endif
+#endif /* CONFIG_CHECKPOINT_RESTORE */
+#endif /* CONFIG_SECCOMP_FILTER */
 
 #ifdef CONFIG_SYSCTL
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v5 3/5] seccomp: add a way to get a listener fd from ptrace
  2018-08-28 14:35 [PATCH v5 0/5] seccomp trap to userspace Tycho Andersen
  2018-08-28 14:35 ` [PATCH v5 1/5] seccomp: add a return code to " Tycho Andersen
  2018-08-28 14:36 ` [PATCH v5 2/5] seccomp: make get_nth_filter available outside of CHECKPOINT_RESTORE Tycho Andersen
@ 2018-08-28 14:36 ` Tycho Andersen
  2018-08-28 14:36 ` [PATCH v5 4/5] seccomp: add support for passing fds via USER_NOTIF Tycho Andersen
  2018-08-28 14:36 ` [PATCH v5 5/5] samples: add an example of seccomp user trap Tycho Andersen
  4 siblings, 0 replies; 9+ messages in thread
From: Tycho Andersen @ 2018-08-28 14:36 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel, containers, linux-api, Andy Lutomirski,
	Oleg Nesterov, Eric W . Biederman, Serge E . Hallyn,
	Christian Brauner, Tyler Hicks, Akihiro Suda, Tycho Andersen

As an alternative to SECCOMP_FILTER_FLAG_GET_LISTENER, perhaps a ptrace()
version which can acquire filters is useful. There are at least two reasons
this is preferable, even though it uses ptrace:

1. You can control tasks that aren't cooperating with you
2. You can control tasks whose filters block sendmsg() and socket(); if the
   task installs a filter which blocks these calls, there's no way with
   SECCOMP_FILTER_FLAG_GET_LISTENER to get the fd out to the privileged task.

v2: fix a bug where listener mode was not unset when an unused fd was not
    available
v3: fix refcounting bug (Oleg)
v4: * change the listener's fd flags to be 0
    * rename GET_LISTENER to NEW_LISTENER (Matthew)
v5: * add capable(CAP_SYS_ADMIN) requirement

Signed-off-by: Tycho Andersen <tycho@tycho.ws>
CC: Kees Cook <keescook@chromium.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Oleg Nesterov <oleg@redhat.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
CC: "Serge E. Hallyn" <serge@hallyn.com>
CC: Christian Brauner <christian.brauner@ubuntu.com>
CC: Tyler Hicks <tyhicks@canonical.com>
CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
---
 include/linux/seccomp.h                       | 11 +++
 include/uapi/linux/ptrace.h                   |  2 +
 kernel/ptrace.c                               |  4 ++
 kernel/seccomp.c                              | 31 +++++++++
 tools/testing/selftests/seccomp/seccomp_bpf.c | 68 +++++++++++++++++++
 5 files changed, 116 insertions(+)

diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 017444b5efed..c17c7d051af0 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -112,4 +112,15 @@ static inline long seccomp_get_metadata(struct task_struct *task,
 	return -EINVAL;
 }
 #endif /* CONFIG_SECCOMP_FILTER && CONFIG_CHECKPOINT_RESTORE */
+
+#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
+extern long seccomp_new_listener(struct task_struct *task,
+				 unsigned long filter_off);
+#else
+static inline long seccomp_new_listener(struct task_struct *task,
+					unsigned long filter_off)
+{
+	return -EINVAL;
+}
+#endif/* CONFIG_SECCOMP_USER_NOTIFICATION */
 #endif /* _LINUX_SECCOMP_H */
diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h
index d5a1b8a492b9..e80ecb1bd427 100644
--- a/include/uapi/linux/ptrace.h
+++ b/include/uapi/linux/ptrace.h
@@ -73,6 +73,8 @@ struct seccomp_metadata {
 	__u64 flags;		/* Output: filter's flags */
 };
 
+#define PTRACE_SECCOMP_NEW_LISTENER	0x420e
+
 /* Read signals from a shared (process wide) queue */
 #define PTRACE_PEEKSIGINFO_SHARED	(1 << 0)
 
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 21fec73d45d4..289960ac181b 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -1096,6 +1096,10 @@ int ptrace_request(struct task_struct *child, long request,
 		ret = seccomp_get_metadata(child, addr, datavp);
 		break;
 
+	case PTRACE_SECCOMP_NEW_LISTENER:
+		ret = seccomp_new_listener(child, addr);
+		break;
+
 	default:
 		break;
 	}
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index ed786655186d..580888785324 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -1787,6 +1787,37 @@ static struct file *init_listener(struct task_struct *task,
 
 	return ret;
 }
+
+long seccomp_new_listener(struct task_struct *task,
+			  unsigned long filter_off)
+{
+	struct seccomp_filter *filter;
+	struct file *listener;
+	int fd;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EACCES;
+
+	filter = get_nth_filter(task, filter_off);
+	if (IS_ERR(filter))
+		return PTR_ERR(filter);
+
+	fd = get_unused_fd_flags(0);
+	if (fd < 0) {
+		__put_seccomp_filter(filter);
+		return fd;
+	}
+
+	listener = init_listener(task, task->seccomp.filter);
+	__put_seccomp_filter(filter);
+	if (IS_ERR(listener)) {
+		put_unused_fd(fd);
+		return PTR_ERR(listener);
+	}
+
+	fd_install(fd, listener);
+	return fd;
+}
 #else
 static struct file *init_listener(struct task_struct *task,
 				  struct seccomp_filter *filter)
diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
index 89f2c788a06b..61b8e3c5c06b 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -193,6 +193,10 @@ int seccomp(unsigned int op, unsigned int flags, void *args)
 }
 #endif
 
+#ifndef PTRACE_SECCOMP_NEW_LISTENER
+#define PTRACE_SECCOMP_NEW_LISTENER 0x420e
+#endif
+
 #if __BYTE_ORDER == __LITTLE_ENDIAN
 #define syscall_arg(_n) (offsetof(struct seccomp_data, args[_n]))
 #elif __BYTE_ORDER == __BIG_ENDIAN
@@ -3165,6 +3169,70 @@ TEST(get_user_notification_syscall)
 	EXPECT_EQ(0, WEXITSTATUS(status));
 }
 
+TEST(get_user_notification_ptrace)
+{
+	pid_t pid;
+	int status, listener;
+	int sk_pair[2];
+	char c;
+	struct seccomp_notif req = {};
+	struct seccomp_notif_resp resp = {};
+
+	ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		EXPECT_EQ(user_trap_syscall(__NR_getpid, 0), 0);
+
+		/* Test that we get ENOSYS while not attached */
+		EXPECT_EQ(syscall(__NR_getpid), -1);
+		EXPECT_EQ(errno, ENOSYS);
+
+		/* Signal we're ready and have installed the filter. */
+		EXPECT_EQ(write(sk_pair[1], "J", 1), 1);
+
+		EXPECT_EQ(read(sk_pair[1], &c, 1), 1);
+		EXPECT_EQ(c, 'H');
+
+		exit(syscall(__NR_getpid) != USER_NOTIF_MAGIC);
+	}
+
+	EXPECT_EQ(read(sk_pair[0], &c, 1), 1);
+	EXPECT_EQ(c, 'J');
+
+	EXPECT_EQ(ptrace(PTRACE_ATTACH, pid), 0);
+	EXPECT_EQ(waitpid(pid, NULL, 0), pid);
+	listener = ptrace(PTRACE_SECCOMP_NEW_LISTENER, pid, 0);
+	EXPECT_GE(listener, 0);
+
+	/* EBUSY for second listener */
+	EXPECT_EQ(ptrace(PTRACE_SECCOMP_NEW_LISTENER, pid, 0), -1);
+	EXPECT_EQ(errno, EBUSY);
+
+	EXPECT_EQ(ptrace(PTRACE_DETACH, pid, NULL, 0), 0);
+
+	/* Now signal we are done and respond with magic */
+	EXPECT_EQ(write(sk_pair[0], "H", 1), 1);
+
+	req.len = sizeof(req);
+	EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_RECV, &req), sizeof(req));
+
+	resp.len = sizeof(resp);
+	resp.id = req.id;
+	resp.error = 0;
+	resp.val = USER_NOTIF_MAGIC;
+
+	EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_SEND, &resp), sizeof(resp));
+
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+
+	close(listener);
+}
+
 /*
  * Check that a pid in a child namespace still shows up as valid in ours.
  */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v5 4/5] seccomp: add support for passing fds via USER_NOTIF
  2018-08-28 14:35 [PATCH v5 0/5] seccomp trap to userspace Tycho Andersen
                   ` (2 preceding siblings ...)
  2018-08-28 14:36 ` [PATCH v5 3/5] seccomp: add a way to get a listener fd from ptrace Tycho Andersen
@ 2018-08-28 14:36 ` Tycho Andersen
  2018-08-28 14:36 ` [PATCH v5 5/5] samples: add an example of seccomp user trap Tycho Andersen
  4 siblings, 0 replies; 9+ messages in thread
From: Tycho Andersen @ 2018-08-28 14:36 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel, containers, linux-api, Andy Lutomirski,
	Oleg Nesterov, Eric W . Biederman, Serge E . Hallyn,
	Christian Brauner, Tyler Hicks, Akihiro Suda, Tycho Andersen

The idea here is that the userspace handler should be able to pass an fd
back to the trapped task, for example so it can be returned from socket().

I've proposed one API here, but I'm open to other options. In particular,
this only lets you return an fd from a syscall, which may not be enough in
all cases. For example, if an fd is written to an output parameter instead
of returned, the current API can't handle this. Another case is that
netlink takes as input fds sometimes (IFLA_NET_NS_FD, e.g.). If netlink
ever decides to install an fd and output it, we wouldn't be able to handle
this either.

Still, the vast majority of interesting cases are covered by this API, so
perhaps it is Enough.

I've left it as a separate commit for two reasons:
  * It illustrates the way in which we would grow struct seccomp_notif and
    struct seccomp_notif_resp without using netlink
  * It shows just how little code is needed to accomplish this :)

v2: new in v2
v3: no changes
v4: * pass fd flags back from userspace as well (Jann)
    * update same cgroup data on fd pass as SCM_RIGHTS (Alban)
    * only set the REPLIED state /after/ successful fdget (Alban)
    * reflect GET_LISTENER -> NEW_LISTENER changes
    * add to the new Documentation/ on user notifications about fd replies
v5: * fix documentation typo (O_EXCL -> O_CLOEXEC)

Signed-off-by: Tycho Andersen <tycho@tycho.ws>
CC: Kees Cook <keescook@chromium.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Oleg Nesterov <oleg@redhat.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
CC: "Serge E. Hallyn" <serge@hallyn.com>
CC: Christian Brauner <christian.brauner@ubuntu.com>
CC: Tyler Hicks <tyhicks@canonical.com>
CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
---
 .../userspace-api/seccomp_filter.rst          |  11 ++
 include/uapi/linux/seccomp.h                  |   3 +
 kernel/seccomp.c                              |  51 +++++++-
 tools/testing/selftests/seccomp/seccomp_bpf.c | 116 ++++++++++++++++++
 4 files changed, 179 insertions(+), 2 deletions(-)

diff --git a/Documentation/userspace-api/seccomp_filter.rst b/Documentation/userspace-api/seccomp_filter.rst
index 312472d8e9c5..668d93b425c1 100644
--- a/Documentation/userspace-api/seccomp_filter.rst
+++ b/Documentation/userspace-api/seccomp_filter.rst
@@ -233,6 +233,9 @@ The interface for a seccomp notification fd consists of two structures:
         __u64 id;
         __s32 error;
         __s64 val;
+        __u8 return_fd;
+        __u32 fd;
+        __u32 fd_flags;
     };
 
 Users can ``read()`` or ``poll()`` on a seccomp notification fd to receive a
@@ -252,6 +255,14 @@ mentioned above in this document: all arguments being read from the tracee's
 memory should be read into the tracer's memory before any policy decisions are
 made. This allows for an atomic decision on syscall arguments.
 
+Userspace can also return file descriptors. For example, one may decide to
+intercept ``socket()`` syscalls, and return some file descriptor from those
+based on some policy. To return a file descriptor, the ``return_fd`` member
+should be non-zero, the ``fd`` argument should be the fd in the listener's
+table to send to the tracee (similar to how ``SCM_RIGHTS`` works), and
+``fd_flags`` should be the flags that the fd in the tracee's table is opened
+with (e.g. ``O_CLOEXEC`` or similar).
+
 Sysctls
 =======
 
diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index aa5878972128..93f1bd5c7cf0 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -75,6 +75,9 @@ struct seccomp_notif_resp {
 	__u64 id;
 	__s32 error;
 	__s64 val;
+	__u8 return_fd;
+	__u32 fd;
+	__u32 fd_flags;
 };
 
 #define SECCOMP_IOC_MAGIC		0xF7
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 580888785324..4a6db4076ec5 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -43,6 +43,7 @@
 
 #ifdef CONFIG_SECCOMP_USER_NOTIFICATION
 #include <linux/anon_inodes.h>
+#include <net/cls_cgroup.h>
 
 enum notify_state {
 	SECCOMP_NOTIFY_INIT,
@@ -80,6 +81,8 @@ struct seccomp_knotif {
 	/* The return values, only valid when in SECCOMP_NOTIFY_REPLIED */
 	int error;
 	long val;
+	struct file *file;
+	unsigned int flags;
 
 	/* Signals when this has entered SECCOMP_NOTIFY_REPLIED */
 	struct completion ready;
@@ -800,10 +803,44 @@ static void seccomp_do_user_notification(int this_syscall,
 			goto remove_list;
 	}
 
-	ret = n.val;
-	err = n.error;
+	if (n.file) {
+		int fd;
+		struct socket *sock;
+
+		fd = get_unused_fd_flags(n.flags);
+		if (fd < 0) {
+			err = fd;
+			ret = -1;
+			goto remove_list;
+		}
+
+		/*
+		 * Similar to what SCM_RIGHTS does, let's re-set the cgroup
+		 * data to point ot the tracee's cgroups instead of the
+		 * listener's.
+		 */
+		sock = sock_from_file(n.file, &err);
+		if (sock) {
+			sock_update_netprioidx(&sock->sk->sk_cgrp_data);
+			sock_update_classid(&sock->sk->sk_cgrp_data);
+		}
+
+		ret = fd;
+		err = 0;
+
+		fd_install(fd, n.file);
+		/* Don't fput, since fd has a reference now */
+		n.file = NULL;
+	} else {
+		ret = n.val;
+		err = n.error;
+	}
+
 
 remove_list:
+	if (n.file)
+		fput(n.file);
+
 	list_del(&n.list);
 out:
 	mutex_unlock(&match->notify_lock);
@@ -1675,10 +1712,20 @@ static long seccomp_notify_send(struct seccomp_filter *filter,
 		goto out;
 	}
 
+	if (resp.return_fd) {
+		knotif->flags = resp.fd_flags;
+		knotif->file = fget(resp.fd);
+		if (!knotif->file) {
+			ret = -EBADF;
+			goto out;
+		}
+	}
+
 	ret = size;
 	knotif->state = SECCOMP_NOTIFY_REPLIED;
 	knotif->error = resp.error;
 	knotif->val = resp.val;
+
 	complete(&knotif->ready);
 out:
 	mutex_unlock(&filter->notify_lock);
diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
index 61b8e3c5c06b..c756722faa88 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -182,6 +182,9 @@ struct seccomp_notif_resp {
 	__u64 id;
 	__s32 error;
 	__s64 val;
+	__u8 return_fd;
+	__u32 fd;
+	__u32 fd_flags;
 };
 #endif
 
@@ -3233,6 +3236,119 @@ TEST(get_user_notification_ptrace)
 	close(listener);
 }
 
+TEST(user_notification_pass_fd)
+{
+	pid_t pid;
+	int status, listener;
+	int sk_pair[2];
+	char c;
+	struct seccomp_notif req = {};
+	struct seccomp_notif_resp resp = {};
+	long ret;
+
+	ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		int fd;
+		char buf[16];
+
+		EXPECT_EQ(user_trap_syscall(__NR_getpid, 0), 0);
+
+		/* Signal we're ready and have installed the filter. */
+		EXPECT_EQ(write(sk_pair[1], "J", 1), 1);
+
+		EXPECT_EQ(read(sk_pair[1], &c, 1), 1);
+		EXPECT_EQ(c, 'H');
+		close(sk_pair[1]);
+
+		/* An fd from getpid(). Let the games begin. */
+		fd = syscall(__NR_getpid);
+		EXPECT_GT(fd, 0);
+		EXPECT_EQ(read(fd, buf, sizeof(buf)), 12);
+		close(fd);
+
+		exit(strcmp("hello world", buf));
+	}
+
+	EXPECT_EQ(read(sk_pair[0], &c, 1), 1);
+	EXPECT_EQ(c, 'J');
+
+	EXPECT_EQ(ptrace(PTRACE_ATTACH, pid), 0);
+	EXPECT_EQ(waitpid(pid, NULL, 0), pid);
+	listener = ptrace(PTRACE_SECCOMP_NEW_LISTENER, pid, 0);
+	EXPECT_GE(listener, 0);
+	EXPECT_EQ(ptrace(PTRACE_DETACH, pid, NULL, 0), 0);
+
+	/* Now signal we are done installing so it can do a getpid */
+	EXPECT_EQ(write(sk_pair[0], "H", 1), 1);
+	close(sk_pair[0]);
+
+	/* Make a new socket pair so we can send half across */
+	EXPECT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0);
+
+	ret = read_notif(listener, &req);
+	EXPECT_EQ(ret, sizeof(req));
+	EXPECT_EQ(errno, 0);
+
+	resp.len = sizeof(resp);
+	resp.id = req.id;
+	resp.return_fd = 1;
+	resp.fd = sk_pair[1];
+	resp.fd_flags = 0;
+	EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_SEND, &resp), sizeof(resp));
+	close(sk_pair[1]);
+
+	EXPECT_EQ(write(sk_pair[0], "hello world\0", 12), 12);
+	close(sk_pair[0]);
+
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+	close(listener);
+}
+
+TEST(user_notification_struct_size_mismatch)
+{
+	pid_t pid;
+	long ret;
+	int status, listener, len;
+	struct seccomp_notif req;
+	struct seccomp_notif_resp resp;
+
+	listener = user_trap_syscall(__NR_getpid,
+				     SECCOMP_FILTER_FLAG_NEW_LISTENER);
+	EXPECT_GE(listener, 0);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		ret = syscall(__NR_getpid);
+		exit(ret != USER_NOTIF_MAGIC);
+	}
+
+	req.len = sizeof(req);
+	EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_RECV, &req), sizeof(req));
+
+	/*
+	 * Only write a partial structure: this is what was available before we
+	 * had fd support.
+	 */
+	len = offsetof(struct seccomp_notif_resp, val) + sizeof(resp.val);
+	resp.len = len;
+	resp.id = req.id;
+	resp.error = 0;
+	resp.val = USER_NOTIF_MAGIC;
+	EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_SEND, &resp), len);
+
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+}
+
 /*
  * Check that a pid in a child namespace still shows up as valid in ours.
  */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v5 5/5] samples: add an example of seccomp user trap
  2018-08-28 14:35 [PATCH v5 0/5] seccomp trap to userspace Tycho Andersen
                   ` (3 preceding siblings ...)
  2018-08-28 14:36 ` [PATCH v5 4/5] seccomp: add support for passing fds via USER_NOTIF Tycho Andersen
@ 2018-08-28 14:36 ` Tycho Andersen
  4 siblings, 0 replies; 9+ messages in thread
From: Tycho Andersen @ 2018-08-28 14:36 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel, containers, linux-api, Andy Lutomirski,
	Oleg Nesterov, Eric W . Biederman, Serge E . Hallyn,
	Christian Brauner, Tyler Hicks, Akihiro Suda, Tycho Andersen

The idea here is just to give a demonstration of how one could safely use
the SECCOMP_RET_USER_NOTIF feature to do mount policies. This particular
policy is (as noted in the comment) not very interesting, but it serves to
illustrate how one might apply a policy dodging the various TOCTOU issues.

v5: new in v5

Signed-off-by: Tycho Andersen <tycho@tycho.ws>
CC: Kees Cook <keescook@chromium.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Oleg Nesterov <oleg@redhat.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
CC: "Serge E. Hallyn" <serge@hallyn.com>
CC: Christian Brauner <christian.brauner@ubuntu.com>
CC: Tyler Hicks <tyhicks@canonical.com>
CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
---
 samples/seccomp/.gitignore  |   1 +
 samples/seccomp/Makefile    |   9 +-
 samples/seccomp/user-trap.c | 312 ++++++++++++++++++++++++++++++++++++
 3 files changed, 321 insertions(+), 1 deletion(-)

diff --git a/samples/seccomp/.gitignore b/samples/seccomp/.gitignore
index 78fb78184291..d1e2e817d556 100644
--- a/samples/seccomp/.gitignore
+++ b/samples/seccomp/.gitignore
@@ -1,3 +1,4 @@
 bpf-direct
 bpf-fancy
 dropper
+user-trap
diff --git a/samples/seccomp/Makefile b/samples/seccomp/Makefile
index ba942e3ead89..0ab120c95e38 100644
--- a/samples/seccomp/Makefile
+++ b/samples/seccomp/Makefile
@@ -1,6 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0
 ifndef CROSS_COMPILE
-hostprogs-$(CONFIG_SAMPLE_SECCOMP) := bpf-fancy dropper bpf-direct
+hostprogs-$(CONFIG_SAMPLE_SECCOMP) := bpf-fancy dropper bpf-direct user-trap
 
 HOSTCFLAGS_bpf-fancy.o += -I$(objtree)/usr/include
 HOSTCFLAGS_bpf-fancy.o += -idirafter $(objtree)/include
@@ -16,6 +16,10 @@ HOSTCFLAGS_bpf-direct.o += -I$(objtree)/usr/include
 HOSTCFLAGS_bpf-direct.o += -idirafter $(objtree)/include
 bpf-direct-objs := bpf-direct.o
 
+HOSTCFLAGS_user-trap.o += -I$(objtree)/usr/include
+HOSTCFLAGS_user-trap.o += -idirafter $(objtree)/include
+user-trap-objs := user-trap.o
+
 # Try to match the kernel target.
 ifndef CONFIG_64BIT
 
@@ -30,9 +34,12 @@ HOSTCFLAGS_bpf-direct.o += $(MFLAG)
 HOSTCFLAGS_dropper.o += $(MFLAG)
 HOSTCFLAGS_bpf-helper.o += $(MFLAG)
 HOSTCFLAGS_bpf-fancy.o += $(MFLAG)
+HOSTCFLAGS_user-trap.o += $(MFLAG)
 HOSTLOADLIBES_bpf-direct += $(MFLAG)
 HOSTLOADLIBES_bpf-fancy += $(MFLAG)
 HOSTLOADLIBES_dropper += $(MFLAG)
+HOSTLOADLIBES_user-trap += $(MFLAG)
+
 endif
 always := $(hostprogs-m)
 endif
diff --git a/samples/seccomp/user-trap.c b/samples/seccomp/user-trap.c
new file mode 100644
index 000000000000..571eb32fd80b
--- /dev/null
+++ b/samples/seccomp/user-trap.c
@@ -0,0 +1,312 @@
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <string.h>
+#include <stddef.h>
+#include <sys/sysmacros.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <sys/socket.h>
+#include <sys/stat.h>
+#include <sys/mman.h>
+#include <sys/syscall.h>
+#include <sys/user.h>
+#include <sys/ioctl.h>
+#include <sys/ptrace.h>
+#include <sys/mount.h>
+#include <linux/limits.h>
+#include <linux/filter.h>
+#include <linux/seccomp.h>
+
+/*
+ * Because of some grossness, we can't include linux/ptrace.h here, so we
+ * re-define PTRACE_SECCOMP_NEW_LISTENER.
+ */
+#ifndef PTRACE_SECCOMP_NEW_LISTENER
+#define PTRACE_SECCOMP_NEW_LISTENER	0x420e
+#endif
+
+#define ARRAY_SIZE(x) (sizeof(x) / sizeof(*(x)))
+
+static int seccomp(unsigned int op, unsigned int flags, void *args)
+{
+	errno = 0;
+	return syscall(__NR_seccomp, op, flags, args);
+}
+
+static int user_trap_syscall(int nr, unsigned int flags)
+{
+	struct sock_filter filter[] = {
+		BPF_STMT(BPF_LD+BPF_W+BPF_ABS,
+			offsetof(struct seccomp_data, nr)),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, nr, 0, 1),
+		BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_USER_NOTIF),
+		BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
+	};
+
+	struct sock_fprog prog = {
+		.len = (unsigned short)ARRAY_SIZE(filter),
+		.filter = filter,
+	};
+
+	return seccomp(SECCOMP_SET_MODE_FILTER, flags, &prog);
+}
+
+static int handle_req(struct seccomp_notif *req,
+		      struct seccomp_notif_resp *resp, int listener)
+{
+	char path[PATH_MAX], source[PATH_MAX], target[PATH_MAX];
+	int ret = -1, mem;
+
+	resp->len = sizeof(*resp);
+	resp->id = req->id;
+	resp->error = -EPERM;
+	resp->val = 0;
+
+	if (req->data.nr != __NR_mount) {
+		fprintf(stderr, "huh? trapped something besides mknod? %d\n", req->data.nr);
+		return -1;
+	}
+
+	/* Only allow bind mounts. */
+	if (!(req->data.args[3] & MS_BIND))
+		return 0;
+
+	/*
+	 * Ok, let's read the task's memory to see where they wanted their
+	 * mount to go.
+	 */
+	snprintf(path, sizeof(path), "/proc/%d/mem", req->pid);
+	mem = open(path, O_RDONLY);
+	if (mem < 0) {
+		perror("open mem");
+		return -1;
+	}
+
+	/*
+	 * Now we avoid a TOCTOU: we referred to a pid by its pid, but since
+	 * the pid that made the syscall may have died, we need to confirm that
+	 * the pid is still valid after we open its /proc/pid/mem file. We can
+	 * ask the listener fd this as follows.
+	 *
+	 * Note that this check should occur *after* any task-specific
+	 * resources are opened, to make sure that the task has not died and
+	 * we're not wrongly reading someone else's state in order to make
+	 * decisions.
+	 */
+	if (ioctl(listener, SECCOMP_NOTIF_IS_ID_VALID, &req->id) != 1) {
+		fprintf(stderr, "task died before we could map its memory\n");
+		goto out;
+	}
+
+	/*
+	 * Phew, we've got the right /proc/pid/mem. Now we can read it. Note
+	 * that to avoid another TOCTOU, we should read all of the pointer args
+	 * before we decide to allow the syscall.
+	 */
+	if (lseek(mem, req->data.args[0], SEEK_SET) < 0) {
+		perror("seek");
+		goto out;
+	}
+
+	ret = read(mem, source, sizeof(source));
+	if (ret < 0) {
+		perror("read");
+		goto out;
+	}
+
+	if (lseek(mem, req->data.args[1], SEEK_SET) < 0) {
+		perror("seek");
+		goto out;
+	}
+
+	ret = read(mem, target, sizeof(target));
+	if (ret < 0) {
+		perror("read");
+		goto out;
+	}
+
+	/*
+	 * Our policy is to only allow bind mounts inside /tmp. This isn't very
+	 * interesting, because we could do unprivlieged bind mounts with user
+	 * namespaces already, but you get the idea.
+	 */
+	if (!strncmp(source, "/tmp", 4) && !strncmp(target, "/tmp", 4)) {
+		if (mount(source, target, NULL, req->data.args[3], NULL) < 0) {
+			ret = -1;
+			perror("actual mount");
+			goto out;
+		}
+		resp->error = 0;
+	}
+
+	/* Even if we didn't allow it because of policy, generating the
+	 * response was be a success, because we want to tell the worker EPERM.
+	 */
+	ret = 0;
+
+out:
+	close(mem);
+	return ret;
+}
+
+int main(void)
+{
+	int sk_pair[2], ret = 1, status, listener;
+	pid_t worker = 0 , tracer = 0;
+	char c;
+
+	if (socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair) < 0) {
+		perror("socketpair");
+		return 1;
+	}
+
+	worker = fork();
+	if (worker < 0) {
+		perror("fork");
+		goto close_pair;
+	}
+
+	if (worker == 0) {
+		if (user_trap_syscall(__NR_mount, 0) < 0) {
+			perror("seccomp");
+			exit(1);
+		}
+
+		if (setuid(1000) < 0) {
+			perror("setuid");
+			exit(1);
+		}
+
+		if (write(sk_pair[1], "a", 1) != 1) {
+			perror("write");
+			exit(1);
+		}
+
+		if (read(sk_pair[1], &c, 1) != 1) {
+			perror("write");
+			exit(1);
+		}
+
+		if (mkdir("/tmp/foo", 0755) < 0) {
+			perror("mkdir");
+			exit(1);
+		}
+
+		if (mount("/dev/sda", "/tmp/foo", NULL, 0, NULL) != -1) {
+			fprintf(stderr, "huh? mounted /dev/sda?\n");
+			exit(1);
+		}
+
+		if (errno != EPERM) {
+			perror("bad error from mount");
+			exit(1);
+		}
+
+		if (mount("/tmp/foo", "/tmp/foo", NULL, MS_BIND, NULL) < 0) {
+			perror("mount");
+			exit(1);
+		}
+
+		exit(0);
+	}
+
+	if (read(sk_pair[0], &c, 1) != 1) {
+		perror("read ready signal");
+		goto out_kill;
+	}
+
+	if (ptrace(PTRACE_ATTACH, worker) < 0) {
+		perror("ptrace");
+		goto out_kill;
+	}
+
+	if (waitpid(worker, NULL, 0) != worker) {
+		perror("waitpid");
+		goto out_kill;
+	}
+
+	listener = ptrace(PTRACE_SECCOMP_NEW_LISTENER, worker, 0);
+	if (listener < 0) {
+		perror("ptrace get listener");
+		goto out_kill;
+	}
+
+	if (ptrace(PTRACE_DETACH, worker, NULL, 0) < 0) {
+		perror("ptrace detach");
+		goto out_kill;
+	}
+
+	if (write(sk_pair[0], "a", 1) != 1) {
+		perror("write");
+		exit(1);
+	}
+
+	tracer = fork();
+	if (tracer < 0) {
+		perror("fork");
+		goto out_kill;
+	}
+
+	if (tracer == 0) {
+		while (1) {
+			struct seccomp_notif req = {};
+			struct seccomp_notif_resp resp = {};
+
+			req.len = sizeof(req);
+			if (ioctl(listener, SECCOMP_NOTIF_RECV, &req) != sizeof(req)) {
+				perror("ioctl recv");
+				goto out_close;
+			}
+
+			if (handle_req(&req, &resp, listener) < 0)
+				goto out_close;
+
+			if (ioctl(listener, SECCOMP_NOTIF_SEND, &resp) != sizeof(resp)) {
+				perror("ioctl send");
+				goto out_close;
+			}
+		}
+out_close:
+		close(listener);
+		exit(1);
+	}
+
+	close(listener);
+
+	if (waitpid(worker, &status, 0) != worker) {
+		perror("waitpid");
+		goto out_kill;
+	}
+
+	if (umount2("/tmp/foo", MNT_DETACH) < 0 && errno != EINVAL) {
+		perror("umount2");
+		goto out_kill;
+	}
+
+	if (remove("/tmp/foo") < 0 && errno != ENOENT) {
+		perror("remove");
+		exit(1);
+	}
+
+	if (!WIFEXITED(status) || WEXITSTATUS(status)) {
+		fprintf(stderr, "worker exited nonzero\n");
+		goto out_kill;
+	}
+
+	ret = 0;
+
+out_kill:
+	if (tracer > 0)
+		kill(tracer, SIGKILL);
+	if (worker > 0)
+		kill(worker, SIGKILL);
+
+close_pair:
+	close(sk_pair[0]);
+	close(sk_pair[1]);
+	return ret;
+}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v5 1/5] seccomp: add a return code to trap to userspace
  2018-08-28 14:35 ` [PATCH v5 1/5] seccomp: add a return code to " Tycho Andersen
@ 2018-08-29 18:59   ` Christian Brauner
  2018-08-29 21:21     ` Tycho Andersen
  0 siblings, 1 reply; 9+ messages in thread
From: Christian Brauner @ 2018-08-29 18:59 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Kees Cook, linux-api, containers, Akihiro Suda, Oleg Nesterov,
	linux-kernel, Eric W . Biederman, Christian Brauner,
	Andy Lutomirski

On Tue, Aug 28, 2018 at 08:35:59AM -0600, Tycho Andersen wrote:
> This patch introduces a means for syscalls matched in seccomp to notify
> some other task that a particular filter has been triggered.
> 
> The motivation for this is primarily for use with containers. For example,
> if a container does an init_module(), we obviously don't want to load this
> untrusted code, which may be compiled for the wrong version of the kernel
> anyway. Instead, we could parse the module image, figure out which module
> the container is trying to load and load it on the host.
> 
> As another example, containers cannot mknod(), since this checks
> capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
> /dev/zero should be ok for containers to mknod, but we'd like to avoid hard
> coding some whitelist in the kernel. Another example is mount(), which has
> many security restrictions for good reason, but configuration or runtime
> knowledge could potentially be used to relax these restrictions.
> 
> This patch adds functionality that is already possible via at least two
> other means that I know about, both of which involve ptrace(): first, one
> could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL.
> Unfortunately this is slow, so a faster version would be to install a
> filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP.
> Since ptrace allows only one tracer, if the container runtime is that
> tracer, users inside the container (or outside) trying to debug it will not
> be able to use ptrace, which is annoying. It also means that older
> distributions based on Upstart cannot boot inside containers using ptrace,
> since upstart itself uses ptrace to start services.
> 
> The actual implementation of this is fairly small, although getting the
> synchronization right was/is slightly complex.
> 
> Finally, it's worth noting that the classic seccomp TOCTOU of reading
> memory data from the task still applies here, but can be avoided with
> careful design of the userspace handler: if the userspace handler reads all
> of the task memory that is necessary before applying its security policy,
> the tracee's subsequent memory edits will not be read by the tracer.
> 
> v2: * make id a u64; the idea here being that it will never overflow,
>       because 64 is huge (one syscall every nanosecond => wrap every 584
>       years) (Andy)
>     * prevent nesting of user notifications: if someone is already attached
>       the tree in one place, nobody else can attach to the tree (Andy)
>     * notify the listener of signals the tracee receives as well (Andy)
>     * implement poll
> v3: * lockdep fix (Oleg)
>     * drop unnecessary WARN()s (Christian)
>     * rearrange error returns to be more rpetty (Christian)
>     * fix build in !CONFIG_SECCOMP_USER_NOTIFICATION case
> v4: * fix implementation of poll to use poll_wait() (Jann)
>     * change listener's fd flags to be 0 (Jann)
>     * hoist filter initialization out of ifdefs to its own function
>       init_user_notification()
>     * add some more testing around poll() and closing the listener while a
>       syscall is in action
>     * s/GET_LISTENER/NEW_LISTENER, since you can't _get_ a listener, but it
>       creates a new one (Matthew)
>     * correctly handle pid namespaces, add some testcases (Matthew)
>     * use EINPROGRESS instead of EINVAL when a notification response is
>       written twice (Matthew)
>     * fix comment typo from older version (SEND vs READ) (Matthew)
>     * whitespace and logic simplification (Tobin)
>     * add some Documentation/ bits on userspace trapping
> v5: * fix documentation typos (Jann)
>     * add signalled field to struct seccomp_notif (Jann)
>     * switch to using ioctls instead of read()/write() for struct passing
>       (Jann)
>     * add an ioctl to ensure an id is still valid
> 
> Signed-off-by: Tycho Andersen <tycho@tycho.ws>
> CC: Kees Cook <keescook@chromium.org>
> CC: Andy Lutomirski <luto@amacapital.net>
> CC: Oleg Nesterov <oleg@redhat.com>
> CC: Eric W. Biederman <ebiederm@xmission.com>
> CC: "Serge E. Hallyn" <serge@hallyn.com>
> CC: Christian Brauner <christian.brauner@ubuntu.com>
> CC: Tyler Hicks <tyhicks@canonical.com>
> CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>

I know how much you love bikeshedding, Tycho. So let me start. :)

> ---
>  Documentation/ioctl/ioctl-number.txt          |   1 +
>  .../userspace-api/seccomp_filter.rst          |  69 +++
>  arch/Kconfig                                  |   9 +
>  include/linux/seccomp.h                       |   7 +-
>  include/uapi/linux/seccomp.h                  |  33 +-
>  kernel/seccomp.c                              | 453 +++++++++++++++++-
>  tools/testing/selftests/seccomp/seccomp_bpf.c | 403 +++++++++++++++-
>  7 files changed, 965 insertions(+), 10 deletions(-)
> 
> diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt
> index 480c8609dc58..21fb661d3e0d 100644
> --- a/Documentation/ioctl/ioctl-number.txt
> +++ b/Documentation/ioctl/ioctl-number.txt
> @@ -342,4 +342,5 @@ Code  Seq#(hex)	Include File		Comments
>  					<mailto:raph@8d.com>
>  0xF6	all	LTTng			Linux Trace Toolkit Next Generation
>  					<mailto:mathieu.desnoyers@efficios.com>
> +0xF7    00-1F   uapi/linux/seccomp.h
>  0xFD	all	linux/dm-ioctl.h
> diff --git a/Documentation/userspace-api/seccomp_filter.rst b/Documentation/userspace-api/seccomp_filter.rst
> index 82a468bc7560..312472d8e9c5 100644
> --- a/Documentation/userspace-api/seccomp_filter.rst
> +++ b/Documentation/userspace-api/seccomp_filter.rst
> @@ -122,6 +122,11 @@ In precedence order, they are:
>  	Results in the lower 16-bits of the return value being passed
>  	to userland as the errno without executing the system call.
>  
> +``SECCOMP_RET_USER_NOTIF``:
> +    Results in a ``struct seccomp_notif`` message sent on the userspace
> +    notification fd, if it is attached, or ``-ENOSYS`` if it is not. See below
> +    on discussion of how to handle user notifications.
> +
>  ``SECCOMP_RET_TRACE``:
>  	When returned, this value will cause the kernel to attempt to
>  	notify a ``ptrace()``-based tracer prior to executing the system
> @@ -183,6 +188,70 @@ The ``samples/seccomp/`` directory contains both an x86-specific example
>  and a more generic example of a higher level macro interface for BPF
>  program generation.
>  
> +Userspace Notification
> +======================
> +
> +The ``SECCOMP_RET_USER_NOTIF`` return code lets seccomp filters pass a
> +particular syscall to userspace to be handled. This may be useful for
> +applications like container managers, which wish to intercept particular
> +syscalls (``mount()``, ``finit_module()``, etc.) and change their behavior.
> +
> +There are currently two APIs to acquire a userspace notification fd for a
> +particular filter. The first is when the filter is installed, the task
> +installing the filter can ask the ``seccomp()`` syscall:
> +
> +.. code-block::
> +
> +    fd = seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
> +
> +which (on success) will return a listener fd for the filter, which can them be

s/them/then/

> +passed around via ``SCM_RIGHTS`` or similar. Alternatively, a filter fd can be
> +acquired via:
> +
> +.. code-block::
> +
> +    fd = ptrace(PTRACE_SECCOMP_NEW_LISTENER, pid, 0);
> +
> +which grabs the 0th filter for some task which the tracer has privilege over.
> +Note that filter fds correspond to a particular filter, and not a particular
> +task. So if this task then forks, notifications from both tasks will appear on
> +the same filter fd. Reads and writes to/from a filter fd are also synchronized,
> +so a filter fd can safely have many readers.
> +
> +The interface for a seccomp notification fd consists of two structures:
> +
> +.. code-block::
> +
> +    struct seccomp_notif {
> +        __u64 id;
> +        pid_t pid;
> +        __u8 signalled;
> +        struct seccomp_data data;
> +    };
> +
> +    struct seccomp_notif_resp {
> +        __u64 id;
> +        __s32 error;
> +        __s64 val;
> +    };
> +
> +Users can ``read()`` or ``poll()`` on a seccomp notification fd to receive a

You have changed this from read() to ioctl(), right?

> +``struct seccomp_notif``, which contains three members: a globally unique
> +``id``, the ``pid`` of the task which triggered this request (which may be 0 if
> +the task is in a pid ns not visible from the listener's pid namespace), and the
> +``data`` passed to seccomp. Userspace can then make a decision based on this
> +information about what to do, and ``write()`` a response, indicating what

Same question as above. :)

> +should be returned to userspace. The ``id`` member of ``struct
> +seccomp_notif_resp`` should be the same ``id`` as in ``struct seccomp_notif``.
> +
> +It is worth noting that ``struct seccomp_data`` contains the values of register
> +arguments to the syscall, but does not contain pointers to memory. The task's
> +memory is accessible to suitably privileged traces via via ``ptrace()`` or

s/via via/via/

> +``/proc/pid/map_files/``. However, care should be taken to avoid the TOCTOU
> +mentioned above in this document: all arguments being read from the tracee's
> +memory should be read into the tracer's memory before any policy decisions are
> +made. This allows for an atomic decision on syscall arguments.
> +
>  Sysctls
>  =======
>  
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 1aa59063f1fd..6d9d4b7f7a40 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -405,6 +405,15 @@ config SECCOMP_FILTER
>  
>  	  See Documentation/userspace-api/seccomp_filter.rst for details.
>  
> +config SECCOMP_USER_NOTIFICATION
> +	bool "Enable the SECCOMP_RET_USER_NOTIF seccomp action"
> +	depends on SECCOMP_FILTER
> +	help
> +	  Enable SECCOMP_RET_USER_NOTIF, a return code which can be used by seccomp
> +	  programs to notify a userspace listener that a particular event happened.
> +
> +	  See Documentation/userspace-api/seccomp_filter.rst for details.
> +
>  preferred-plugin-hostcc := $(if-success,[ $(gcc-version) -ge 40800 ],$(HOSTCXX),$(HOSTCC))
>  
>  config PLUGIN_HOSTCC
> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> index e5320f6c8654..017444b5efed 100644
> --- a/include/linux/seccomp.h
> +++ b/include/linux/seccomp.h
> @@ -4,9 +4,10 @@
>  
>  #include <uapi/linux/seccomp.h>
>  
> -#define SECCOMP_FILTER_FLAG_MASK	(SECCOMP_FILTER_FLAG_TSYNC	| \
> -					 SECCOMP_FILTER_FLAG_LOG	| \
> -					 SECCOMP_FILTER_FLAG_SPEC_ALLOW)
> +#define SECCOMP_FILTER_FLAG_MASK	(SECCOMP_FILTER_FLAG_TSYNC | \
> +					 SECCOMP_FILTER_FLAG_LOG | \
> +					 SECCOMP_FILTER_FLAG_SPEC_ALLOW | \
> +					 SECCOMP_FILTER_FLAG_NEW_LISTENER)
>  
>  #ifdef CONFIG_SECCOMP
>  
> diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
> index 9efc0e73d50b..aa5878972128 100644
> --- a/include/uapi/linux/seccomp.h
> +++ b/include/uapi/linux/seccomp.h
> @@ -17,9 +17,10 @@
>  #define SECCOMP_GET_ACTION_AVAIL	2
>  
>  /* Valid flags for SECCOMP_SET_MODE_FILTER */
> -#define SECCOMP_FILTER_FLAG_TSYNC	(1UL << 0)
> -#define SECCOMP_FILTER_FLAG_LOG		(1UL << 1)
> -#define SECCOMP_FILTER_FLAG_SPEC_ALLOW	(1UL << 2)
> +#define SECCOMP_FILTER_FLAG_TSYNC		(1UL << 0)
> +#define SECCOMP_FILTER_FLAG_LOG			(1UL << 1)
> +#define SECCOMP_FILTER_FLAG_SPEC_ALLOW		(1UL << 2)
> +#define SECCOMP_FILTER_FLAG_NEW_LISTENER	(1UL << 3)
>  
>  /*
>   * All BPF programs must return a 32-bit value.
> @@ -35,6 +36,7 @@
>  #define SECCOMP_RET_KILL	 SECCOMP_RET_KILL_THREAD
>  #define SECCOMP_RET_TRAP	 0x00030000U /* disallow and force a SIGSYS */
>  #define SECCOMP_RET_ERRNO	 0x00050000U /* returns an errno */
> +#define SECCOMP_RET_USER_NOTIF   0x7fc00000U /* notifies userspace */
>  #define SECCOMP_RET_TRACE	 0x7ff00000U /* pass to a tracer or disallow */
>  #define SECCOMP_RET_LOG		 0x7ffc0000U /* allow after logging */
>  #define SECCOMP_RET_ALLOW	 0x7fff0000U /* allow */
> @@ -60,4 +62,29 @@ struct seccomp_data {
>  	__u64 args[6];
>  };
>  
> +struct seccomp_notif {
> +	__u16 len;
> +	__u64 id;
> +	__u32 pid;
> +	__u8 signalled;
> +	struct seccomp_data data;
> +};
> +
> +struct seccomp_notif_resp {
> +	__u16 len;
> +	__u64 id;
> +	__s32 error;
> +	__s64 val;
> +};
> +
> +#define SECCOMP_IOC_MAGIC		0xF7
> +
> +/* Flags for seccomp notification fd ioctl. */
> +#define SECCOMP_NOTIF_RECV		_IOWR(SECCOMP_IOC_MAGIC, 0,	\
> +						struct seccomp_notif)
> +#define SECCOMP_NOTIF_SEND		_IOWR(SECCOMP_IOC_MAGIC, 1,	\
> +						struct seccomp_notif_resp)
> +#define SECCOMP_NOTIF_IS_ID_VALID	_IOR(SECCOMP_IOC_MAGIC, 2,	\
> +						__u64)
> +
>  #endif /* _UAPI_LINUX_SECCOMP_H */
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index fd023ac24e10..a09eb5c05f68 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -33,6 +33,7 @@
>  #endif
>  
>  #ifdef CONFIG_SECCOMP_FILTER
> +#include <linux/file.h>
>  #include <linux/filter.h>
>  #include <linux/pid.h>
>  #include <linux/ptrace.h>
> @@ -40,6 +41,53 @@
>  #include <linux/tracehook.h>
>  #include <linux/uaccess.h>
>  
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> +#include <linux/anon_inodes.h>
> +
> +enum notify_state {
> +	SECCOMP_NOTIFY_INIT,
> +	SECCOMP_NOTIFY_SENT,
> +	SECCOMP_NOTIFY_REPLIED,
> +};
> +
> +struct seccomp_knotif {
> +	/* The struct pid of the task whose filter triggered the notification */
> +	struct pid *pid;
> +
> +	/* The "cookie" for this request; this is unique for this filter. */
> +	u32 id;
> +
> +	/* Whether or not this task has been given an interruptible signal. */
> +	bool signalled;
> +
> +	/*
> +	 * The seccomp data. This pointer is valid the entire time this
> +	 * notification is active, since it comes from __seccomp_filter which
> +	 * eclipses the entire lifecycle here.
> +	 */
> +	const struct seccomp_data *data;
> +
> +	/*
> +	 * Notification states. When SECCOMP_RET_USER_NOTIF is returned, a
> +	 * struct seccomp_knotif is created and starts out in INIT. Once the
> +	 * handler reads the notification off of an FD, it transitions to SENT.
> +	 * If a signal is received the state transitions back to INIT and
> +	 * another message is sent. When the userspace handler replies, state
> +	 * transitions to REPLIED.
> +	 */
> +	enum notify_state state;
> +
> +	/* The return values, only valid when in SECCOMP_NOTIFY_REPLIED */
> +	int error;
> +	long val;
> +
> +	/* Signals when this has entered SECCOMP_NOTIFY_REPLIED */
> +	struct completion ready;
> +
> +	struct list_head list;
> +};
> +#endif
> +
>  /**
>   * struct seccomp_filter - container for seccomp BPF programs
>   *
> @@ -66,6 +114,30 @@ struct seccomp_filter {
>  	bool log;
>  	struct seccomp_filter *prev;
>  	struct bpf_prog *prog;
> +
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> +	/*
> +	 * A semaphore that users of this notification can wait on for
> +	 * changes. Actual reads and writes are still controlled with
> +	 * filter->notify_lock.
> +	 */
> +	struct semaphore request;
> +
> +	/* A lock for all notification-related accesses. */
> +	struct mutex notify_lock;
> +
> +	/* Is there currently an attached listener? */
> +	bool has_listener;
> +
> +	/* The id of the next request. */
> +	u64 next_id;
> +
> +	/* A list of struct seccomp_knotif elements. */
> +	struct list_head notifications;
> +
> +	/* A wait queue for poll. */
> +	wait_queue_head_t wqh;
> +#endif
>  };
>  
>  /* Limit any path through the tree to 256KB worth of instructions. */
> @@ -359,6 +431,19 @@ static inline void seccomp_sync_threads(unsigned long flags)
>  	}
>  }
>  
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> +static void init_user_notification(struct seccomp_filter *sfilter)
> +{
> +	mutex_init(&sfilter->notify_lock);
> +	sema_init(&sfilter->request, 0);
> +	INIT_LIST_HEAD(&sfilter->notifications);
> +	sfilter->next_id = get_random_u64();
> +	init_waitqueue_head(&sfilter->wqh);
> +}
> +#else
> +static inline void init_user_notification(struct seccomp_filter *sfilter) { }
> +#endif
> +
>  /**
>   * seccomp_prepare_filter: Prepares a seccomp filter for use.
>   * @fprog: BPF program to install
> @@ -392,6 +477,8 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
>  	if (!sfilter)
>  		return ERR_PTR(-ENOMEM);
>  
> +	init_user_notification(sfilter);
> +
>  	ret = bpf_prog_create_from_user(&sfilter->prog, fprog,
>  					seccomp_check_filter, save_orig);
>  	if (ret < 0) {
> @@ -556,13 +643,15 @@ static void seccomp_send_sigsys(int syscall, int reason)
>  #define SECCOMP_LOG_TRACE		(1 << 4)
>  #define SECCOMP_LOG_LOG			(1 << 5)
>  #define SECCOMP_LOG_ALLOW		(1 << 6)
> +#define SECCOMP_LOG_USER_NOTIF		(1 << 7)
>  
>  static u32 seccomp_actions_logged = SECCOMP_LOG_KILL_PROCESS |
>  				    SECCOMP_LOG_KILL_THREAD  |
>  				    SECCOMP_LOG_TRAP  |
>  				    SECCOMP_LOG_ERRNO |
>  				    SECCOMP_LOG_TRACE |
> -				    SECCOMP_LOG_LOG;
> +				    SECCOMP_LOG_LOG |
> +				    SECCOMP_LOG_USER_NOTIF;
>  
>  static inline void seccomp_log(unsigned long syscall, long signr, u32 action,
>  			       bool requested)
> @@ -581,6 +670,9 @@ static inline void seccomp_log(unsigned long syscall, long signr, u32 action,
>  	case SECCOMP_RET_TRACE:
>  		log = requested && seccomp_actions_logged & SECCOMP_LOG_TRACE;
>  		break;
> +	case SECCOMP_RET_USER_NOTIF:
> +		log = requested && seccomp_actions_logged & SECCOMP_LOG_USER_NOTIF;
> +		break;
>  	case SECCOMP_RET_LOG:
>  		log = seccomp_actions_logged & SECCOMP_LOG_LOG;
>  		break;
> @@ -651,6 +743,83 @@ void secure_computing_strict(int this_syscall)
>  }
>  #else
>  
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> +static u64 seccomp_next_notify_id(struct seccomp_filter *filter)
> +{
> +	/* Note: overflow is ok here, the id just needs to be unique */
> +	return filter->next_id++;
> +}
> +
> +static void seccomp_do_user_notification(int this_syscall,
> +					 struct seccomp_filter *match,
> +					 const struct seccomp_data *sd)
> +{
> +	int err;
> +	long ret = 0;
> +	struct seccomp_knotif n = {};
> +
> +	mutex_lock(&match->notify_lock);
> +	err = -ENOSYS;
> +	if (!match->has_listener)
> +		goto out;
> +
> +	n.pid = task_pid(current);
> +	n.state = SECCOMP_NOTIFY_INIT;
> +	n.data = sd;
> +	n.id = seccomp_next_notify_id(match);
> +	init_completion(&n.ready);
> +
> +	list_add(&n.list, &match->notifications);
> +	wake_up_poll(&match->wqh, EPOLLIN | EPOLLRDNORM);
> +
> +	mutex_unlock(&match->notify_lock);
> +	up(&match->request);
> +
> +	err = wait_for_completion_interruptible(&n.ready);
> +	mutex_lock(&match->notify_lock);
> +
> +	/*
> +	 * Here it's possible we got a signal and then had to wait on the mutex
> +	 * while the reply was sent, so let's be sure there wasn't a response
> +	 * in the meantime.
> +	 */
> +	if (err < 0 && n.state != SECCOMP_NOTIFY_REPLIED) {
> +		/*
> +		 * We got a signal. Let's tell userspace about it (potentially
> +		 * again, if we had already notified them about the first one).
> +		 */
> +		n.signalled = true;
> +		if (n.state == SECCOMP_NOTIFY_SENT) {
> +			n.state = SECCOMP_NOTIFY_INIT;
> +			up(&match->request);
> +		}
> +		mutex_unlock(&match->notify_lock);
> +		err = wait_for_completion_killable(&n.ready);
> +		mutex_lock(&match->notify_lock);
> +		if (err < 0)
> +			goto remove_list;
> +	}
> +
> +	ret = n.val;
> +	err = n.error;
> +
> +remove_list:
> +	list_del(&n.list);
> +out:
> +	mutex_unlock(&match->notify_lock);
> +	syscall_set_return_value(current, task_pt_regs(current),
> +				 err, ret);
> +}
> +#else
> +static void seccomp_do_user_notification(int this_syscall,
> +					 struct seccomp_filter *match,
> +					 const struct seccomp_data *sd)
> +{
> +	seccomp_log(this_syscall, SIGSYS, SECCOMP_RET_USER_NOTIF, true);
> +	do_exit(SIGSYS);
> +}
> +#endif
> +
>  #ifdef CONFIG_SECCOMP_FILTER
>  static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
>  			    const bool recheck_after_trace)
> @@ -728,6 +897,9 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
>  
>  		return 0;
>  
> +	case SECCOMP_RET_USER_NOTIF:
> +		seccomp_do_user_notification(this_syscall, match, sd);
> +		goto skip;
>  	case SECCOMP_RET_LOG:
>  		seccomp_log(this_syscall, 0, action, true);
>  		return 0;
> @@ -834,6 +1006,9 @@ static long seccomp_set_mode_strict(void)
>  }
>  
>  #ifdef CONFIG_SECCOMP_FILTER
> +static struct file *init_listener(struct task_struct *,
> +				  struct seccomp_filter *);
> +
>  /**
>   * seccomp_set_mode_filter: internal function for setting seccomp filter
>   * @flags:  flags to change filter behavior
> @@ -853,6 +1028,8 @@ static long seccomp_set_mode_filter(unsigned int flags,
>  	const unsigned long seccomp_mode = SECCOMP_MODE_FILTER;
>  	struct seccomp_filter *prepared = NULL;
>  	long ret = -EINVAL;
> +	int listener = 0;
> +	struct file *listener_f = NULL;
>  
>  	/* Validate flags. */
>  	if (flags & ~SECCOMP_FILTER_FLAG_MASK)
> @@ -863,13 +1040,28 @@ static long seccomp_set_mode_filter(unsigned int flags,
>  	if (IS_ERR(prepared))
>  		return PTR_ERR(prepared);
>  
> +	if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) {
> +		listener = get_unused_fd_flags(0);
> +		if (listener < 0) {
> +			ret = listener;
> +			goto out_free;
> +		}
> +
> +		listener_f = init_listener(current, prepared);
> +		if (IS_ERR(listener_f)) {
> +			put_unused_fd(listener);
> +			ret = PTR_ERR(listener_f);
> +			goto out_free;
> +		}
> +	}
> +
>  	/*
>  	 * Make sure we cannot change seccomp or nnp state via TSYNC
>  	 * while another thread is in the middle of calling exec.
>  	 */
>  	if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
>  	    mutex_lock_killable(&current->signal->cred_guard_mutex))
> -		goto out_free;
> +		goto out_put_fd;
>  
>  	spin_lock_irq(&current->sighand->siglock);
>  
> @@ -887,6 +1079,16 @@ static long seccomp_set_mode_filter(unsigned int flags,
>  	spin_unlock_irq(&current->sighand->siglock);
>  	if (flags & SECCOMP_FILTER_FLAG_TSYNC)
>  		mutex_unlock(&current->signal->cred_guard_mutex);
> +out_put_fd:
> +	if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) {
> +		if (ret < 0) {
> +			fput(listener_f);
> +			put_unused_fd(listener);
> +		} else {
> +			fd_install(listener, listener_f);
> +			ret = listener;
> +		}
> +	}
>  out_free:
>  	seccomp_filter_free(prepared);
>  	return ret;
> @@ -915,6 +1117,9 @@ static long seccomp_get_action_avail(const char __user *uaction)
>  	case SECCOMP_RET_LOG:
>  	case SECCOMP_RET_ALLOW:
>  		break;
> +	case SECCOMP_RET_USER_NOTIF:
> +		if (IS_ENABLED(CONFIG_SECCOMP_USER_NOTIFICATION))
> +			break;
>  	default:
>  		return -EOPNOTSUPP;
>  	}
> @@ -1111,6 +1316,7 @@ long seccomp_get_metadata(struct task_struct *task,
>  #define SECCOMP_RET_KILL_THREAD_NAME	"kill_thread"
>  #define SECCOMP_RET_TRAP_NAME		"trap"
>  #define SECCOMP_RET_ERRNO_NAME		"errno"
> +#define SECCOMP_RET_USER_NOTIF_NAME	"user_notif"
>  #define SECCOMP_RET_TRACE_NAME		"trace"
>  #define SECCOMP_RET_LOG_NAME		"log"
>  #define SECCOMP_RET_ALLOW_NAME		"allow"
> @@ -1120,6 +1326,7 @@ static const char seccomp_actions_avail[] =
>  				SECCOMP_RET_KILL_THREAD_NAME	" "
>  				SECCOMP_RET_TRAP_NAME		" "
>  				SECCOMP_RET_ERRNO_NAME		" "
> +				SECCOMP_RET_USER_NOTIF_NAME     " "
>  				SECCOMP_RET_TRACE_NAME		" "
>  				SECCOMP_RET_LOG_NAME		" "
>  				SECCOMP_RET_ALLOW_NAME;
> @@ -1137,6 +1344,7 @@ static const struct seccomp_log_name seccomp_log_names[] = {
>  	{ SECCOMP_LOG_TRACE, SECCOMP_RET_TRACE_NAME },
>  	{ SECCOMP_LOG_LOG, SECCOMP_RET_LOG_NAME },
>  	{ SECCOMP_LOG_ALLOW, SECCOMP_RET_ALLOW_NAME },
> +	{ SECCOMP_LOG_USER_NOTIF, SECCOMP_RET_USER_NOTIF_NAME },
>  	{ }
>  };
>  
> @@ -1342,3 +1550,244 @@ static int __init seccomp_sysctl_init(void)
>  device_initcall(seccomp_sysctl_init)
>  
>  #endif /* CONFIG_SYSCTL */
> +
> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION
> +static int seccomp_notify_release(struct inode *inode, struct file *file)
> +{
> +	struct seccomp_filter *filter = file->private_data;
> +	struct seccomp_knotif *knotif;
> +
> +	mutex_lock(&filter->notify_lock);
> +
> +	/*
> +	 * If this file is being closed because e.g. the task who owned it
> +	 * died, let's wake everyone up who was waiting on us.
> +	 */
> +	list_for_each_entry(knotif, &filter->notifications, list) {
> +		if (knotif->state == SECCOMP_NOTIFY_REPLIED)
> +			continue;
> +
> +		knotif->state = SECCOMP_NOTIFY_REPLIED;
> +		knotif->error = -ENOSYS;
> +		knotif->val = 0;
> +
> +		complete(&knotif->ready);
> +	}
> +
> +	wake_up_all(&filter->wqh);
> +	filter->has_listener = false;
> +	mutex_unlock(&filter->notify_lock);
> +	__put_seccomp_filter(filter);
> +	return 0;
> +}
> +
> +static long seccomp_notify_recv(struct seccomp_filter *filter,
> +				unsigned long arg)
> +{
> +	struct seccomp_knotif *knotif = NULL, *cur;
> +	struct seccomp_notif unotif = {};
> +	ssize_t ret;
> +	u16 size;
> +	void __user *buf = (void __user *)arg;
> +
> +	if (copy_from_user(&size, buf, sizeof(size)))
> +		return -EFAULT;
> +
> +	ret = down_interruptible(&filter->request);
> +	if (ret < 0)
> +		return ret;
> +
> +	mutex_lock(&filter->notify_lock);
> +	list_for_each_entry(cur, &filter->notifications, list) {
> +		if (cur->state == SECCOMP_NOTIFY_INIT) {
> +			knotif = cur;
> +			break;
> +		}
> +	}
> +
> +	/*
> +	 * If we didn't find a notification, it could be that the task was
> +	 * interrupted between the time we were woken and when we were able to
> +	 * acquire the rw lock.
> +	 */
> +	if (!knotif) {
> +		ret = -ENOENT;
> +		goto out;
> +	}
> +
> +	size = min_t(size_t, size, sizeof(unotif));
> +
> +	unotif.len = size;
> +	unotif.id = knotif->id;
> +	unotif.pid = pid_vnr(knotif->pid);
> +	unotif.signalled = knotif->signalled;
> +	unotif.data = *(knotif->data);
> +
> +	if (copy_to_user(buf, &unotif, size)) {
> +		ret = -EFAULT;
> +		goto out;
> +	}
> +
> +	ret = sizeof(unotif);
> +	knotif->state = SECCOMP_NOTIFY_SENT;
> +	wake_up_poll(&filter->wqh, EPOLLOUT | EPOLLWRNORM);
> +
> +out:
> +	mutex_unlock(&filter->notify_lock);
> +	return ret;
> +}
> +
> +static long seccomp_notify_send(struct seccomp_filter *filter,
> +				unsigned long arg)
> +{
> +	struct seccomp_notif_resp resp = {};
> +	struct seccomp_knotif *knotif = NULL;
> +	long ret;
> +	u16 size;
> +	void __user *buf = (void __user *)arg;
> +
> +	if (copy_from_user(&size, buf, sizeof(size)))
> +		return -EFAULT;
> +	size = min_t(size_t, size, sizeof(resp));
> +	if (copy_from_user(&resp, buf, size))
> +		return -EFAULT;
> +
> +	ret = mutex_lock_interruptible(&filter->notify_lock);
> +	if (ret < 0)
> +		return ret;
> +
> +	list_for_each_entry(knotif, &filter->notifications, list) {
> +		if (knotif->id == resp.id)
> +			break;
> +	}
> +
> +	if (!knotif || knotif->id != resp.id) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	/* Allow exactly one reply. */
> +	if (knotif->state != SECCOMP_NOTIFY_SENT) {
> +		ret = -EINPROGRESS;
> +		goto out;
> +	}
> +
> +	ret = size;
> +	knotif->state = SECCOMP_NOTIFY_REPLIED;
> +	knotif->error = resp.error;
> +	knotif->val = resp.val;
> +	complete(&knotif->ready);
> +out:
> +	mutex_unlock(&filter->notify_lock);
> +	return ret;
> +}
> +
> +static long seccomp_notify_is_id_valid(struct seccomp_filter *filter,
> +				       unsigned long arg)
> +{
> +	struct seccomp_knotif *knotif = NULL;
> +	void __user *buf = (void __user *)arg;
> +	u64 id;
> +
> +	if (copy_from_user(&id, buf, sizeof(id)))
> +		return -EFAULT;
> +
> +	list_for_each_entry(knotif, &filter->notifications, list) {
> +		if (knotif->id == id)
> +			return 1;
> +	}
> +
> +	return 0;
> +}
> +
> +static long seccomp_notify_ioctl(struct file *file, unsigned int cmd,
> +				 unsigned long arg)
> +{
> +	struct seccomp_filter *filter = file->private_data;
> +
> +	switch (cmd) {
> +	case SECCOMP_NOTIF_RECV:
> +		return seccomp_notify_recv(filter, arg);
> +	case SECCOMP_NOTIF_SEND:
> +		return seccomp_notify_send(filter, arg);
> +	case SECCOMP_NOTIF_IS_ID_VALID:
> +		return seccomp_notify_is_id_valid(filter, arg);
> +	default:
> +		return -EINVAL;
> +	}
> +}
> +
> +static __poll_t seccomp_notify_poll(struct file *file,
> +				    struct poll_table_struct *poll_tab)
> +{
> +	struct seccomp_filter *filter = file->private_data;
> +	__poll_t ret = 0;
> +	struct seccomp_knotif *cur;
> +
> +	poll_wait(file, &filter->wqh, poll_tab);
> +
> +	ret = mutex_lock_interruptible(&filter->notify_lock);
> +	if (ret < 0)
> +		return ret;
> +
> +	list_for_each_entry(cur, &filter->notifications, list) {
> +		if (cur->state == SECCOMP_NOTIFY_INIT)
> +			ret |= EPOLLIN | EPOLLRDNORM;
> +		if (cur->state == SECCOMP_NOTIFY_SENT)
> +			ret |= EPOLLOUT | EPOLLWRNORM;
> +		if (ret & EPOLLIN && ret & EPOLLOUT)
> +			break;
> +	}
> +
> +	mutex_unlock(&filter->notify_lock);
> +
> +	return ret;
> +}
> +
> +static const struct file_operations seccomp_notify_ops = {
> +	.poll = seccomp_notify_poll,
> +	.release = seccomp_notify_release,
> +	.unlocked_ioctl = seccomp_notify_ioctl,
> +};
> +
> +static struct file *init_listener(struct task_struct *task,
> +				  struct seccomp_filter *filter)
> +{
> +	struct file *ret = ERR_PTR(-EBUSY);
> +	struct seccomp_filter *cur, *last_locked = NULL;
> +	int filter_nesting = 0;
> +
> +	for (cur = task->seccomp.filter; cur; cur = cur->prev) {
> +		mutex_lock_nested(&cur->notify_lock, filter_nesting);
> +		filter_nesting++;
> +		last_locked = cur;
> +		if (cur->has_listener)
> +			goto out;
> +	}
> +
> +	ret = anon_inode_getfile("seccomp notify", &seccomp_notify_ops,
> +				 filter, O_RDWR);
> +	if (IS_ERR(ret))
> +		goto out;
> +
> +
> +	/* The file has a reference to it now */
> +	__get_seccomp_filter(filter);
> +	filter->has_listener = true;
> +
> +out:
> +	for (cur = task->seccomp.filter; cur; cur = cur->prev) {
> +		mutex_unlock(&cur->notify_lock);
> +		if (cur == last_locked)
> +			break;
> +	}
> +
> +	return ret;
> +}
> +#else
> +static struct file *init_listener(struct task_struct *task,
> +				  struct seccomp_filter *filter)
> +{
> +	return ERR_PTR(-EINVAL);
> +}
> +#endif
> diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
> index e1473234968d..89f2c788a06b 100644
> --- a/tools/testing/selftests/seccomp/seccomp_bpf.c
> +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
> @@ -5,6 +5,7 @@
>   * Test code for seccomp bpf.
>   */
>  
> +#define _GNU_SOURCE
>  #include <sys/types.h>
>  
>  /*
> @@ -40,10 +41,12 @@
>  #include <sys/fcntl.h>
>  #include <sys/mman.h>
>  #include <sys/times.h>
> +#include <sys/socket.h>
> +#include <sys/ioctl.h>
>  
> -#define _GNU_SOURCE
>  #include <unistd.h>
>  #include <sys/syscall.h>
> +#include <poll.h>
>  
>  #include "../kselftest_harness.h"
>  
> @@ -154,6 +157,34 @@ struct seccomp_metadata {
>  };
>  #endif
>  
> +#ifndef SECCOMP_FILTER_FLAG_NEW_LISTENER
> +#define SECCOMP_FILTER_FLAG_NEW_LISTENER (1UL << 3)
> +
> +#define SECCOMP_RET_USER_NOTIF 0x7fc00000U
> +
> +#define SECCOMP_IOC_MAGIC		0xF7
> +#define SECCOMP_NOTIF_RECV		_IOWR(SECCOMP_IOC_MAGIC, 0,	\
> +						struct seccomp_notif)
> +#define SECCOMP_NOTIF_SEND		_IOWR(SECCOMP_IOC_MAGIC, 1,	\
> +						struct seccomp_notif_resp)
> +#define SECCOMP_NOTIF_IS_ID_VALID	_IOR(SECCOMP_IOC_MAGIC, 2,	\
> +						__u64)
> +struct seccomp_notif {
> +	__u16 len;
> +	__u64 id;
> +	__u32 pid;
> +	__u8 signalled;
> +	struct seccomp_data data;
> +};
> +
> +struct seccomp_notif_resp {
> +	__u16 len;
> +	__u64 id;
> +	__s32 error;
> +	__s64 val;
> +};
> +#endif
> +
>  #ifndef seccomp
>  int seccomp(unsigned int op, unsigned int flags, void *args)
>  {
> @@ -2077,7 +2108,8 @@ TEST(detect_seccomp_filter_flags)
>  {
>  	unsigned int flags[] = { SECCOMP_FILTER_FLAG_TSYNC,
>  				 SECCOMP_FILTER_FLAG_LOG,
> -				 SECCOMP_FILTER_FLAG_SPEC_ALLOW };
> +				 SECCOMP_FILTER_FLAG_SPEC_ALLOW,
> +				 SECCOMP_FILTER_FLAG_NEW_LISTENER };
>  	unsigned int flag, all_flags;
>  	int i;
>  	long ret;
> @@ -2933,6 +2965,373 @@ TEST(get_metadata)
>  	ASSERT_EQ(0, kill(pid, SIGKILL));
>  }
>  
> +static int user_trap_syscall(int nr, unsigned int flags)
> +{
> +	struct sock_filter filter[] = {
> +		BPF_STMT(BPF_LD+BPF_W+BPF_ABS,
> +			offsetof(struct seccomp_data, nr)),
> +		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, nr, 0, 1),
> +		BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_USER_NOTIF),
> +		BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
> +	};
> +
> +	struct sock_fprog prog = {
> +		.len = (unsigned short)ARRAY_SIZE(filter),
> +		.filter = filter,
> +	};
> +
> +	return seccomp(SECCOMP_SET_MODE_FILTER, flags, &prog);
> +}
> +
> +static int read_notif(int listener, struct seccomp_notif *req)
> +{
> +	int ret;
> +
> +	do {
> +		errno = 0;
> +		req->len = sizeof(*req);
> +		ret = ioctl(listener, SECCOMP_NOTIF_RECV, req);
> +	} while (ret == -1 && errno == ENOENT);
> +	return ret;
> +}
> +
> +static void signal_handler(int signal)
> +{
> +}
> +
> +#define USER_NOTIF_MAGIC 116983961184613L
> +TEST(get_user_notification_syscall)
> +{
> +	pid_t pid;
> +	long ret;
> +	int status, listener;
> +	struct seccomp_notif req = {};
> +	struct seccomp_notif_resp resp = {};
> +	struct pollfd pollfd;
> +
> +	struct sock_filter filter[] = {
> +		BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
> +	};
> +	struct sock_fprog prog = {
> +		.len = (unsigned short)ARRAY_SIZE(filter),
> +		.filter = filter,
> +	};
> +
> +	pid = fork();
> +	ASSERT_GE(pid, 0);
> +
> +	/* Check that we get -ENOSYS with no listener attached */
> +	if (pid == 0) {
> +		if (user_trap_syscall(__NR_getpid, 0) < 0)
> +			exit(1);
> +		ret = syscall(__NR_getpid);
> +		exit(ret >= 0 || errno != ENOSYS);
> +	}
> +
> +	EXPECT_EQ(waitpid(pid, &status, 0), pid);
> +	EXPECT_EQ(true, WIFEXITED(status));
> +	EXPECT_EQ(0, WEXITSTATUS(status));
> +
> +	/* Add some no-op filters so that we (don't) trigger lockdep. */
> +	EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
> +	EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
> +	EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
> +	EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
> +
> +	/* Check that the basic notification machinery works */
> +	listener = user_trap_syscall(__NR_getpid,
> +				     SECCOMP_FILTER_FLAG_NEW_LISTENER);
> +	EXPECT_GE(listener, 0);
> +
> +	/* Installing a second listener in the chain should EBUSY */
> +	EXPECT_EQ(user_trap_syscall(__NR_getpid,
> +				    SECCOMP_FILTER_FLAG_NEW_LISTENER),
> +		  -1);
> +	EXPECT_EQ(errno, EBUSY);
> +
> +	pid = fork();
> +	ASSERT_GE(pid, 0);
> +
> +	if (pid == 0) {
> +		ret = syscall(__NR_getpid);
> +		exit(ret != USER_NOTIF_MAGIC);
> +	}
> +
> +	pollfd.fd = listener;
> +	pollfd.events = POLLIN | POLLOUT;
> +
> +	EXPECT_GT(poll(&pollfd, 1, -1), 0);
> +	EXPECT_EQ(pollfd.revents, POLLIN);
> +
> +	req.len = sizeof(req);
> +	EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_RECV, &req), sizeof(req));
> +
> +	pollfd.fd = listener;
> +	pollfd.events = POLLIN | POLLOUT;
> +
> +	EXPECT_GT(poll(&pollfd, 1, -1), 0);
> +	EXPECT_EQ(pollfd.revents, POLLOUT);
> +
> +	EXPECT_EQ(req.data.nr,  __NR_getpid);
> +
> +	resp.len = sizeof(resp);
> +	resp.id = req.id;
> +	resp.error = 0;
> +	resp.val = USER_NOTIF_MAGIC;
> +
> +	EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_SEND, &resp), sizeof(resp));
> +
> +	EXPECT_EQ(waitpid(pid, &status, 0), pid);
> +	EXPECT_EQ(true, WIFEXITED(status));
> +	EXPECT_EQ(0, WEXITSTATUS(status));
> +
> +	/*
> +	 * Check that nothing bad happens when we kill the task in the middle
> +	 * of a syscall.
> +	 */
> +	pid = fork();
> +	ASSERT_GE(pid, 0);
> +
> +	if (pid == 0) {
> +		ret = syscall(__NR_getpid);
> +		exit(ret != USER_NOTIF_MAGIC);
> +	}
> +
> +	EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_RECV, &req), sizeof(req));
> +	EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_IS_ID_VALID, &req.id), 1);
> +
> +	EXPECT_EQ(kill(pid, SIGKILL), 0);
> +	EXPECT_EQ(waitpid(pid, NULL, 0), pid);
> +
> +	EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_IS_ID_VALID, &req.id), 0);
> +
> +	resp.id = req.id;
> +	ret = ioctl(listener, SECCOMP_NOTIF_SEND, &resp);
> +	EXPECT_EQ(ret, -1);
> +	EXPECT_EQ(errno, EINVAL);
> +
> +	/*
> +	 * Check that we get another notification about a signal in the middle
> +	 * of a syscall.
> +	 */
> +	pid = fork();
> +	ASSERT_GE(pid, 0);
> +
> +	if (pid == 0) {
> +		if (signal(SIGUSR1, signal_handler) == SIG_ERR) {
> +			perror("signal");
> +			exit(1);
> +		}
> +		ret = syscall(__NR_getpid);
> +		exit(ret != USER_NOTIF_MAGIC);
> +	}
> +
> +	ret = read_notif(listener, &req);
> +	EXPECT_EQ(ret, sizeof(req));
> +	EXPECT_EQ(errno, 0);
> +
> +	EXPECT_EQ(kill(pid, SIGUSR1), 0);
> +
> +	ret = read_notif(listener, &req);
> +	EXPECT_EQ(req.signalled, 1);
> +	EXPECT_EQ(ret, sizeof(req));
> +	EXPECT_EQ(errno, 0);
> +
> +	resp.len = sizeof(resp);
> +	resp.id = req.id;
> +	ret = ioctl(listener, SECCOMP_NOTIF_SEND, &resp);
> +	EXPECT_EQ(ret, sizeof(resp));
> +	EXPECT_EQ(errno, 0);
> +
> +	EXPECT_EQ(waitpid(pid, &status, 0), pid);
> +	EXPECT_EQ(true, WIFEXITED(status));
> +	EXPECT_EQ(0, WEXITSTATUS(status));
> +
> +	/*
> +	 * Check that we get an ENOSYS when the listener is closed.
> +	 */
> +	pid = fork();
> +	ASSERT_GE(pid, 0);
> +	if (pid == 0) {
> +		close(listener);
> +		ret = syscall(__NR_getpid);
> +		exit(ret != -1 && errno != ENOSYS);
> +	}
> +
> +	close(listener);
> +
> +	EXPECT_EQ(waitpid(pid, &status, 0), pid);
> +	EXPECT_EQ(true, WIFEXITED(status));
> +	EXPECT_EQ(0, WEXITSTATUS(status));
> +}
> +
> +/*
> + * Check that a pid in a child namespace still shows up as valid in ours.
> + */
> +TEST(user_notification_child_pid_ns)
> +{
> +	pid_t pid;
> +	int status, listener;
> +	int sk_pair[2];
> +	char c;
> +	struct seccomp_notif req = {};
> +	struct seccomp_notif_resp resp = {};
> +
> +	ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0);
> +	ASSERT_EQ(unshare(CLONE_NEWPID), 0);
> +
> +	pid = fork();
> +	ASSERT_GE(pid, 0);
> +
> +	if (pid == 0) {
> +		EXPECT_EQ(user_trap_syscall(__NR_getpid, 0), 0);
> +
> +		/* Signal we're ready and have installed the filter. */
> +		EXPECT_EQ(write(sk_pair[1], "J", 1), 1);
> +
> +		EXPECT_EQ(read(sk_pair[1], &c, 1), 1);
> +		EXPECT_EQ(c, 'H');
> +
> +		exit(syscall(__NR_getpid) != USER_NOTIF_MAGIC);
> +	}
> +
> +	EXPECT_EQ(read(sk_pair[0], &c, 1), 1);
> +	EXPECT_EQ(c, 'J');
> +
> +	EXPECT_EQ(ptrace(PTRACE_ATTACH, pid), 0);
> +	EXPECT_EQ(waitpid(pid, NULL, 0), pid);
> +	listener = ptrace(PTRACE_SECCOMP_NEW_LISTENER, pid, 0);
> +	EXPECT_GE(listener, 0);
> +	EXPECT_EQ(ptrace(PTRACE_DETACH, pid, NULL, 0), 0);
> +
> +	/* Now signal we are done and respond with magic */
> +	EXPECT_EQ(write(sk_pair[0], "H", 1), 1);
> +
> +	req.len = sizeof(req);
> +	EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_RECV, &req), sizeof(req));
> +	EXPECT_EQ(req.pid, pid);
> +
> +	resp.len = sizeof(resp);
> +	resp.id = req.id;
> +	resp.error = 0;
> +	resp.val = USER_NOTIF_MAGIC;
> +
> +	EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_SEND, &resp), sizeof(resp));
> +
> +	EXPECT_EQ(waitpid(pid, &status, 0), pid);
> +	EXPECT_EQ(true, WIFEXITED(status));
> +	EXPECT_EQ(0, WEXITSTATUS(status));
> +	close(listener);
> +}
> +
> +/*
> + * Check that a pid in a sibling (i.e. unrelated) namespace shows up as 0, i.e.
> + * invalid.
> + */
> +TEST(user_notification_sibling_pid_ns)
> +{
> +	pid_t pid, pid2;
> +	int status, listener;
> +	int sk_pair[2];
> +	char c;
> +	struct seccomp_notif req = {};
> +	struct seccomp_notif_resp resp = {};
> +
> +	ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0);
> +
> +	pid = fork();
> +	ASSERT_GE(pid, 0);
> +
> +	if (pid == 0) {
> +		int child_pair[2];
> +
> +		ASSERT_EQ(unshare(CLONE_NEWPID), 0);
> +
> +		ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, child_pair), 0);
> +
> +		pid2 = fork();
> +		ASSERT_GE(pid2, 0);
> +
> +		if (pid2 == 0) {
> +			close(child_pair[0]);
> +			EXPECT_EQ(user_trap_syscall(__NR_getpid, 0), 0);
> +
> +			/* Signal we're ready and have installed the filter. */
> +			EXPECT_EQ(write(child_pair[1], "J", 1), 1);
> +
> +			EXPECT_EQ(read(child_pair[1], &c, 1), 1);
> +			EXPECT_EQ(c, 'H');
> +
> +			exit(syscall(__NR_getpid) != USER_NOTIF_MAGIC);
> +		}
> +
> +		/* check that child has installed the filter */
> +		EXPECT_EQ(read(child_pair[0], &c, 1), 1);
> +		EXPECT_EQ(c, 'J');
> +
> +		/* tell parent who child is */
> +		EXPECT_EQ(write(sk_pair[1], &pid2, sizeof(pid2)), sizeof(pid2));
> +
> +		/* parent has installed listener, tell child to call syscall */
> +		EXPECT_EQ(read(sk_pair[1], &c, 1), 1);
> +		EXPECT_EQ(c, 'H');
> +		EXPECT_EQ(write(child_pair[0], "H", 1), 1);
> +
> +		EXPECT_EQ(waitpid(pid2, &status, 0), pid2);
> +		EXPECT_EQ(true, WIFEXITED(status));
> +		EXPECT_EQ(0, WEXITSTATUS(status));
> +		exit(WEXITSTATUS(status));
> +	}
> +
> +	EXPECT_EQ(read(sk_pair[0], &pid2, sizeof(pid2)), sizeof(pid2));
> +
> +	EXPECT_EQ(ptrace(PTRACE_ATTACH, pid2), 0);
> +	EXPECT_EQ(waitpid(pid2, NULL, 0), pid2);
> +	listener = ptrace(PTRACE_SECCOMP_NEW_LISTENER, pid2, 0);
> +	EXPECT_GE(listener, 0);
> +	EXPECT_EQ(errno, 0);
> +	EXPECT_EQ(ptrace(PTRACE_DETACH, pid2, NULL, 0), 0);
> +
> +	/* Create the sibling ns, and sibling in it. */
> +	EXPECT_EQ(unshare(CLONE_NEWPID), 0);
> +	EXPECT_EQ(errno, 0);
> +
> +	pid2 = fork();
> +	EXPECT_GE(pid2, 0);
> +
> +	if (pid2 == 0) {
> +		req.len = sizeof(req);
> +		ASSERT_EQ(ioctl(listener, SECCOMP_NOTIF_RECV, &req), sizeof(req));
> +		/*
> +		 * The pid should be 0, i.e. the task is in some namespace that
> +		 * we can't "see".
> +		 */
> +		ASSERT_EQ(req.pid, 0);
> +
> +		resp.len = sizeof(resp);
> +		resp.id = req.id;
> +		resp.error = 0;
> +		resp.val = USER_NOTIF_MAGIC;
> +
> +		ASSERT_EQ(ioctl(listener, SECCOMP_NOTIF_SEND, &resp), sizeof(resp));
> +		exit(0);
> +	}
> +
> +	close(listener);
> +
> +	/* Now signal we are done setting up sibling listener. */
> +	EXPECT_EQ(write(sk_pair[0], "H", 1), 1);
> +
> +	EXPECT_EQ(waitpid(pid, &status, 0), pid);
> +	EXPECT_EQ(true, WIFEXITED(status));
> +	EXPECT_EQ(0, WEXITSTATUS(status));
> +
> +	EXPECT_EQ(waitpid(pid2, &status, 0), pid2);
> +	EXPECT_EQ(true, WIFEXITED(status));
> +	EXPECT_EQ(0, WEXITSTATUS(status));
> +}
> +
> +
>  /*
>   * TODO:
>   * - add microbenchmarks
> -- 
> 2.17.1
> 
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v5 2/5] seccomp: make get_nth_filter available outside of CHECKPOINT_RESTORE
  2018-08-28 14:36 ` [PATCH v5 2/5] seccomp: make get_nth_filter available outside of CHECKPOINT_RESTORE Tycho Andersen
@ 2018-08-29 19:07   ` Christian Brauner
  0 siblings, 0 replies; 9+ messages in thread
From: Christian Brauner @ 2018-08-29 19:07 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Kees Cook, linux-api, containers, Akihiro Suda, Oleg Nesterov,
	linux-kernel, Eric W . Biederman, Christian Brauner,
	Andy Lutomirski, Serge Hallyn, Jann Horn

On Tue, Aug 28, 2018 at 08:36:00AM -0600, Tycho Andersen wrote:
> In the next commit we'll use this same mnemonic to get a listener for the
> nth filter, so we need it available outside of CHECKPOINT_RESTORE in the
> USER_NOTIFICATION case as well.
> 
> v2: new in v2
> v3: no changes
> v4: no changes
> v5: switch to CHECKPOINT_RESTORE || USER_NOTIFICATION to avoid warning when
>     only CONFIG_SECCOMP_FILTER is enabled.
> 
> Signed-off-by: Tycho Andersen <tycho@tycho.ws>
> CC: Kees Cook <keescook@chromium.org>
> CC: Andy Lutomirski <luto@amacapital.net>
> CC: Oleg Nesterov <oleg@redhat.com>
> CC: Eric W. Biederman <ebiederm@xmission.com>
> CC: "Serge E. Hallyn" <serge@hallyn.com>
> CC: Christian Brauner <christian.brauner@ubuntu.com>
> CC: Tyler Hicks <tyhicks@canonical.com>
> CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
> ---
>  kernel/seccomp.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)

(Putting Serge and Jann in Cc. They seem to have been left out on
accident. :))

Acked-by: Christian Brauner <christian@brauner.io>

> 
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index a09eb5c05f68..ed786655186d 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -1188,7 +1188,8 @@ long prctl_set_seccomp(unsigned long seccomp_mode, char __user *filter)
>  	return do_seccomp(op, 0, uargs);
>  }
>  
> -#if defined(CONFIG_SECCOMP_FILTER) && defined(CONFIG_CHECKPOINT_RESTORE)
> +#if defined(CONFIG_CHECKPOINT_RESTORE) || \
> +	defined(CONFIG_SECCOMP_USER_NOTIFICATION)
>  static struct seccomp_filter *get_nth_filter(struct task_struct *task,
>  					     unsigned long filter_off)
>  {
> @@ -1235,6 +1236,7 @@ static struct seccomp_filter *get_nth_filter(struct task_struct *task,
>  	return filter;
>  }
>  
> +#if defined(CONFIG_CHECKPOINT_RESTORE)
>  long seccomp_get_filter(struct task_struct *task, unsigned long filter_off,
>  			void __user *data)
>  {
> @@ -1307,7 +1309,8 @@ long seccomp_get_metadata(struct task_struct *task,
>  	__put_seccomp_filter(filter);
>  	return ret;
>  }
> -#endif
> +#endif /* CONFIG_CHECKPOINT_RESTORE */
> +#endif /* CONFIG_SECCOMP_FILTER */
>  
>  #ifdef CONFIG_SYSCTL
>  
> -- 
> 2.17.1
> 
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v5 1/5] seccomp: add a return code to trap to userspace
  2018-08-29 18:59   ` Christian Brauner
@ 2018-08-29 21:21     ` Tycho Andersen
  0 siblings, 0 replies; 9+ messages in thread
From: Tycho Andersen @ 2018-08-29 21:21 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Kees Cook, linux-api, containers, Akihiro Suda, Oleg Nesterov,
	linux-kernel, Eric W . Biederman, Christian Brauner,
	Andy Lutomirski

On Wed, Aug 29, 2018 at 08:59:18PM +0200, Christian Brauner wrote:
> On Tue, Aug 28, 2018 at 08:35:59AM -0600, Tycho Andersen wrote:
> > +Users can ``read()`` or ``poll()`` on a seccomp notification fd to receive a
> 
> You have changed this from read() to ioctl(), right?

Derp, yes. I'll re-write this bit. Thanks for the others, I'll fix
them too. It's not bike shedding if it's incorrect :)

Thanks,

Tycho

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2018-08-29 21:21 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-28 14:35 [PATCH v5 0/5] seccomp trap to userspace Tycho Andersen
2018-08-28 14:35 ` [PATCH v5 1/5] seccomp: add a return code to " Tycho Andersen
2018-08-29 18:59   ` Christian Brauner
2018-08-29 21:21     ` Tycho Andersen
2018-08-28 14:36 ` [PATCH v5 2/5] seccomp: make get_nth_filter available outside of CHECKPOINT_RESTORE Tycho Andersen
2018-08-29 19:07   ` Christian Brauner
2018-08-28 14:36 ` [PATCH v5 3/5] seccomp: add a way to get a listener fd from ptrace Tycho Andersen
2018-08-28 14:36 ` [PATCH v5 4/5] seccomp: add support for passing fds via USER_NOTIF Tycho Andersen
2018-08-28 14:36 ` [PATCH v5 5/5] samples: add an example of seccomp user trap Tycho Andersen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).