linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v7 0/6] seccomp trap to userspace
@ 2018-09-27 15:11 Tycho Andersen
  2018-09-27 15:11 ` [PATCH v7 1/6] seccomp: add a return code to " Tycho Andersen
                   ` (6 more replies)
  0 siblings, 7 replies; 91+ messages in thread
From: Tycho Andersen @ 2018-09-27 15:11 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel, containers, linux-api, Andy Lutomirski,
	Oleg Nesterov, Eric W . Biederman, Serge E . Hallyn,
	Christian Brauner, Tyler Hicks, Akihiro Suda, Jann Horn,
	linux-fsdevel, Tycho Andersen

Hi all,

Here's v7 of the seccomp trap to userspace set. There are various minor
changes and bug fixes, but two major changes:

* We now pass fds to the tracee via an ioctl, and do it immediately when
  the ioctl is called. For this we needed some help from the vfs, so
  I've put the one patch in this series and cc'd fsdevel. This does have
  the advantage that the feature is now totally decoupled from the rest
  of the set, which is itself useful (thanks Andy!)

* Instead of putting all of the notification related stuff into the
  struct seccomp_filter, it now lives in its own struct notification,
  which is pointed to by struct seccomp_filter. This will save a lot of
  memory (thanks Tyler!)

v6 discussion: https://lkml.org/lkml/2018/9/6/769

Thoughts welcome,

Tycho

Tycho Andersen (6):
  seccomp: add a return code to trap to userspace
  seccomp: make get_nth_filter available outside of CHECKPOINT_RESTORE
  seccomp: add a way to get a listener fd from ptrace
  files: add a replace_fd_files() function
  seccomp: add a way to pass FDs via a notification fd
  samples: add an example of seccomp user trap

 Documentation/ioctl/ioctl-number.txt          |   1 +
 .../userspace-api/seccomp_filter.rst          |  89 +++
 fs/file.c                                     |  22 +-
 include/linux/file.h                          |   8 +
 include/linux/seccomp.h                       |  14 +-
 include/uapi/linux/ptrace.h                   |   2 +
 include/uapi/linux/seccomp.h                  |  42 +-
 kernel/ptrace.c                               |   4 +
 kernel/seccomp.c                              | 527 ++++++++++++++-
 samples/seccomp/.gitignore                    |   1 +
 samples/seccomp/Makefile                      |   7 +-
 samples/seccomp/user-trap.c                   | 312 +++++++++
 tools/testing/selftests/seccomp/seccomp_bpf.c | 607 +++++++++++++++++-
 13 files changed, 1617 insertions(+), 19 deletions(-)
 create mode 100644 samples/seccomp/user-trap.c

-- 
2.17.1

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH v7 1/6] seccomp: add a return code to trap to userspace
  2018-09-27 15:11 [PATCH v7 0/6] seccomp trap to userspace Tycho Andersen
@ 2018-09-27 15:11 ` Tycho Andersen
  2018-09-27 21:31   ` Kees Cook
                     ` (2 more replies)
  2018-09-27 15:11 ` [PATCH v7 2/6] seccomp: make get_nth_filter available outside of CHECKPOINT_RESTORE Tycho Andersen
                   ` (5 subsequent siblings)
  6 siblings, 3 replies; 91+ messages in thread
From: Tycho Andersen @ 2018-09-27 15:11 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel, containers, linux-api, Andy Lutomirski,
	Oleg Nesterov, Eric W . Biederman, Serge E . Hallyn,
	Christian Brauner, Tyler Hicks, Akihiro Suda, Jann Horn,
	linux-fsdevel, Tycho Andersen

This patch introduces a means for syscalls matched in seccomp to notify
some other task that a particular filter has been triggered.

The motivation for this is primarily for use with containers. For example,
if a container does an init_module(), we obviously don't want to load this
untrusted code, which may be compiled for the wrong version of the kernel
anyway. Instead, we could parse the module image, figure out which module
the container is trying to load and load it on the host.

As another example, containers cannot mknod(), since this checks
capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
/dev/zero should be ok for containers to mknod, but we'd like to avoid hard
coding some whitelist in the kernel. Another example is mount(), which has
many security restrictions for good reason, but configuration or runtime
knowledge could potentially be used to relax these restrictions.

This patch adds functionality that is already possible via at least two
other means that I know about, both of which involve ptrace(): first, one
could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL.
Unfortunately this is slow, so a faster version would be to install a
filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP.
Since ptrace allows only one tracer, if the container runtime is that
tracer, users inside the container (or outside) trying to debug it will not
be able to use ptrace, which is annoying. It also means that older
distributions based on Upstart cannot boot inside containers using ptrace,
since upstart itself uses ptrace to start services.

The actual implementation of this is fairly small, although getting the
synchronization right was/is slightly complex.

Finally, it's worth noting that the classic seccomp TOCTOU of reading
memory data from the task still applies here, but can be avoided with
careful design of the userspace handler: if the userspace handler reads all
of the task memory that is necessary before applying its security policy,
the tracee's subsequent memory edits will not be read by the tracer.

v2: * make id a u64; the idea here being that it will never overflow,
      because 64 is huge (one syscall every nanosecond => wrap every 584
      years) (Andy)
    * prevent nesting of user notifications: if someone is already attached
      the tree in one place, nobody else can attach to the tree (Andy)
    * notify the listener of signals the tracee receives as well (Andy)
    * implement poll
v3: * lockdep fix (Oleg)
    * drop unnecessary WARN()s (Christian)
    * rearrange error returns to be more rpetty (Christian)
    * fix build in !CONFIG_SECCOMP_USER_NOTIFICATION case
v4: * fix implementation of poll to use poll_wait() (Jann)
    * change listener's fd flags to be 0 (Jann)
    * hoist filter initialization out of ifdefs to its own function
      init_user_notification()
    * add some more testing around poll() and closing the listener while a
      syscall is in action
    * s/GET_LISTENER/NEW_LISTENER, since you can't _get_ a listener, but it
      creates a new one (Matthew)
    * correctly handle pid namespaces, add some testcases (Matthew)
    * use EINPROGRESS instead of EINVAL when a notification response is
      written twice (Matthew)
    * fix comment typo from older version (SEND vs READ) (Matthew)
    * whitespace and logic simplification (Tobin)
    * add some Documentation/ bits on userspace trapping
v5: * fix documentation typos (Jann)
    * add signalled field to struct seccomp_notif (Jann)
    * switch to using ioctls instead of read()/write() for struct passing
      (Jann)
    * add an ioctl to ensure an id is still valid
v6: * docs typo fixes, update docs for ioctl() change (Christian)
v7: * switch struct seccomp_knotif's id member to a u64 (derp :)
    * use notify_lock in IS_ID_VALID query to avoid racing
    * s/signalled/signaled (Tyler)
    * fix docs to reflect that ids are not globally unique (Tyler)
    * add a test to check -ERESTARTSYS behavior (Tyler)
    * drop CONFIG_SECCOMP_USER_NOTIFICATION (Tyler)
    * reorder USER_NOTIF in seccomp return codes list (Tyler)
    * return size instead of sizeof(struct user_notif) (Tyler)
    * ENOENT instead of EINVAL when invalid id is passed (Tyler)
    * drop CONFIG_SECCOMP_USER_NOTIFICATION guards (Tyler)
    * s/IS_ID_VALID/ID_VALID and switch ioctl to be "well behaved" (Tyler)
    * add a new struct notification to minimize the additions to
      struct seccomp_filter, also pack the necessary additions a bit more
      cleverly (Tyler)
    * switch to keeping track of the task itself instead of the pid (we'll
      use this for implementing PUT_FD)

Signed-off-by: Tycho Andersen <tycho@tycho.ws>
CC: Kees Cook <keescook@chromium.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Oleg Nesterov <oleg@redhat.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
CC: "Serge E. Hallyn" <serge@hallyn.com>
CC: Christian Brauner <christian.brauner@ubuntu.com>
CC: Tyler Hicks <tyhicks@canonical.com>
CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
---
 Documentation/ioctl/ioctl-number.txt          |   1 +
 .../userspace-api/seccomp_filter.rst          |  73 +++
 include/linux/seccomp.h                       |   7 +-
 include/uapi/linux/seccomp.h                  |  33 +-
 kernel/seccomp.c                              | 436 +++++++++++++++++-
 tools/testing/selftests/seccomp/seccomp_bpf.c | 413 ++++++++++++++++-
 6 files changed, 954 insertions(+), 9 deletions(-)

diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt
index 13a7c999c04a..31e9707f7e06 100644
--- a/Documentation/ioctl/ioctl-number.txt
+++ b/Documentation/ioctl/ioctl-number.txt
@@ -345,4 +345,5 @@ Code  Seq#(hex)	Include File		Comments
 					<mailto:raph@8d.com>
 0xF6	all	LTTng			Linux Trace Toolkit Next Generation
 					<mailto:mathieu.desnoyers@efficios.com>
+0xF7    00-1F   uapi/linux/seccomp.h
 0xFD	all	linux/dm-ioctl.h
diff --git a/Documentation/userspace-api/seccomp_filter.rst b/Documentation/userspace-api/seccomp_filter.rst
index 82a468bc7560..d2e61f1c0a0b 100644
--- a/Documentation/userspace-api/seccomp_filter.rst
+++ b/Documentation/userspace-api/seccomp_filter.rst
@@ -122,6 +122,11 @@ In precedence order, they are:
 	Results in the lower 16-bits of the return value being passed
 	to userland as the errno without executing the system call.
 
+``SECCOMP_RET_USER_NOTIF``:
+    Results in a ``struct seccomp_notif`` message sent on the userspace
+    notification fd, if it is attached, or ``-ENOSYS`` if it is not. See below
+    on discussion of how to handle user notifications.
+
 ``SECCOMP_RET_TRACE``:
 	When returned, this value will cause the kernel to attempt to
 	notify a ``ptrace()``-based tracer prior to executing the system
@@ -183,6 +188,74 @@ The ``samples/seccomp/`` directory contains both an x86-specific example
 and a more generic example of a higher level macro interface for BPF
 program generation.
 
+Userspace Notification
+======================
+
+The ``SECCOMP_RET_USER_NOTIF`` return code lets seccomp filters pass a
+particular syscall to userspace to be handled. This may be useful for
+applications like container managers, which wish to intercept particular
+syscalls (``mount()``, ``finit_module()``, etc.) and change their behavior.
+
+There are currently two APIs to acquire a userspace notification fd for a
+particular filter. The first is when the filter is installed, the task
+installing the filter can ask the ``seccomp()`` syscall:
+
+.. code-block::
+
+    fd = seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
+
+which (on success) will return a listener fd for the filter, which can then be
+passed around via ``SCM_RIGHTS`` or similar. Alternatively, a filter fd can be
+acquired via:
+
+.. code-block::
+
+    fd = ptrace(PTRACE_SECCOMP_NEW_LISTENER, pid, 0);
+
+which grabs the 0th filter for some task which the tracer has privilege over.
+Note that filter fds correspond to a particular filter, and not a particular
+task. So if this task then forks, notifications from both tasks will appear on
+the same filter fd. Reads and writes to/from a filter fd are also synchronized,
+so a filter fd can safely have many readers.
+
+The interface for a seccomp notification fd consists of two structures:
+
+.. code-block::
+
+    struct seccomp_notif {
+        __u16 len;
+        __u64 id;
+        pid_t pid;
+        __u8 signalled;
+        struct seccomp_data data;
+    };
+
+    struct seccomp_notif_resp {
+        __u16 len;
+        __u64 id;
+        __s32 error;
+        __s64 val;
+    };
+
+Users can read via ``ioctl(SECCOMP_NOTIF_RECV)``  (or ``poll()``) on a seccomp
+notification fd to receive a ``struct seccomp_notif``, which contains five
+members: the input length of the structure, a unique-per-filter ``id``, the
+``pid`` of the task which triggered this request (which may be 0 if the task is
+in a pid ns not visible from the listener's pid namespace), a flag representing
+whether or not the notification is a result of a non-fatal signal, and the
+``data`` passed to seccomp. Userspace can then make a decision based on this
+information about what to do, and ``ioctl(SECCOMP_NOTIF_SEND)`` a response,
+indicating what should be returned to userspace. The ``id`` member of ``struct
+seccomp_notif_resp`` should be the same ``id`` as in ``struct seccomp_notif``.
+
+It is worth noting that ``struct seccomp_data`` contains the values of register
+arguments to the syscall, but does not contain pointers to memory. The task's
+memory is accessible to suitably privileged traces via ``ptrace()`` or
+``/proc/pid/map_files/``. However, care should be taken to avoid the TOCTOU
+mentioned above in this document: all arguments being read from the tracee's
+memory should be read into the tracer's memory before any policy decisions are
+made. This allows for an atomic decision on syscall arguments.
+
 Sysctls
 =======
 
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index e5320f6c8654..017444b5efed 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -4,9 +4,10 @@
 
 #include <uapi/linux/seccomp.h>
 
-#define SECCOMP_FILTER_FLAG_MASK	(SECCOMP_FILTER_FLAG_TSYNC	| \
-					 SECCOMP_FILTER_FLAG_LOG	| \
-					 SECCOMP_FILTER_FLAG_SPEC_ALLOW)
+#define SECCOMP_FILTER_FLAG_MASK	(SECCOMP_FILTER_FLAG_TSYNC | \
+					 SECCOMP_FILTER_FLAG_LOG | \
+					 SECCOMP_FILTER_FLAG_SPEC_ALLOW | \
+					 SECCOMP_FILTER_FLAG_NEW_LISTENER)
 
 #ifdef CONFIG_SECCOMP
 
diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index 9efc0e73d50b..d4ccb32fe089 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -17,9 +17,10 @@
 #define SECCOMP_GET_ACTION_AVAIL	2
 
 /* Valid flags for SECCOMP_SET_MODE_FILTER */
-#define SECCOMP_FILTER_FLAG_TSYNC	(1UL << 0)
-#define SECCOMP_FILTER_FLAG_LOG		(1UL << 1)
-#define SECCOMP_FILTER_FLAG_SPEC_ALLOW	(1UL << 2)
+#define SECCOMP_FILTER_FLAG_TSYNC		(1UL << 0)
+#define SECCOMP_FILTER_FLAG_LOG			(1UL << 1)
+#define SECCOMP_FILTER_FLAG_SPEC_ALLOW		(1UL << 2)
+#define SECCOMP_FILTER_FLAG_NEW_LISTENER	(1UL << 3)
 
 /*
  * All BPF programs must return a 32-bit value.
@@ -35,6 +36,7 @@
 #define SECCOMP_RET_KILL	 SECCOMP_RET_KILL_THREAD
 #define SECCOMP_RET_TRAP	 0x00030000U /* disallow and force a SIGSYS */
 #define SECCOMP_RET_ERRNO	 0x00050000U /* returns an errno */
+#define SECCOMP_RET_USER_NOTIF   0x7fc00000U /* notifies userspace */
 #define SECCOMP_RET_TRACE	 0x7ff00000U /* pass to a tracer or disallow */
 #define SECCOMP_RET_LOG		 0x7ffc0000U /* allow after logging */
 #define SECCOMP_RET_ALLOW	 0x7fff0000U /* allow */
@@ -60,4 +62,29 @@ struct seccomp_data {
 	__u64 args[6];
 };
 
+struct seccomp_notif {
+	__u16 len;
+	__u64 id;
+	__u32 pid;
+	__u8 signaled;
+	struct seccomp_data data;
+};
+
+struct seccomp_notif_resp {
+	__u16 len;
+	__u64 id;
+	__s32 error;
+	__s64 val;
+};
+
+#define SECCOMP_IOC_MAGIC		0xF7
+
+/* Flags for seccomp notification fd ioctl. */
+#define SECCOMP_NOTIF_RECV	_IOWR(SECCOMP_IOC_MAGIC, 0,	\
+					struct seccomp_notif)
+#define SECCOMP_NOTIF_SEND	_IOWR(SECCOMP_IOC_MAGIC, 1,	\
+					struct seccomp_notif_resp)
+#define SECCOMP_NOTIF_ID_VALID	_IOR(SECCOMP_IOC_MAGIC, 2,	\
+					__u64)
+
 #endif /* _UAPI_LINUX_SECCOMP_H */
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index fd023ac24e10..fa6fe9756c80 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -33,12 +33,78 @@
 #endif
 
 #ifdef CONFIG_SECCOMP_FILTER
+#include <linux/file.h>
 #include <linux/filter.h>
 #include <linux/pid.h>
 #include <linux/ptrace.h>
 #include <linux/security.h>
 #include <linux/tracehook.h>
 #include <linux/uaccess.h>
+#include <linux/anon_inodes.h>
+
+enum notify_state {
+	SECCOMP_NOTIFY_INIT,
+	SECCOMP_NOTIFY_SENT,
+	SECCOMP_NOTIFY_REPLIED,
+};
+
+struct seccomp_knotif {
+	/* The struct pid of the task whose filter triggered the notification */
+	struct task_struct *task;
+
+	/* The "cookie" for this request; this is unique for this filter. */
+	u64 id;
+
+	/* Whether or not this task has been given an interruptible signal. */
+	bool signaled;
+
+	/*
+	 * The seccomp data. This pointer is valid the entire time this
+	 * notification is active, since it comes from __seccomp_filter which
+	 * eclipses the entire lifecycle here.
+	 */
+	const struct seccomp_data *data;
+
+	/*
+	 * Notification states. When SECCOMP_RET_USER_NOTIF is returned, a
+	 * struct seccomp_knotif is created and starts out in INIT. Once the
+	 * handler reads the notification off of an FD, it transitions to SENT.
+	 * If a signal is received the state transitions back to INIT and
+	 * another message is sent. When the userspace handler replies, state
+	 * transitions to REPLIED.
+	 */
+	enum notify_state state;
+
+	/* The return values, only valid when in SECCOMP_NOTIFY_REPLIED */
+	int error;
+	long val;
+
+	/* Signals when this has entered SECCOMP_NOTIFY_REPLIED */
+	struct completion ready;
+
+	struct list_head list;
+};
+
+/**
+ * struct notification - container for seccomp userspace notifications. Since
+ * most seccomp filters will not have notification listeners attached and this
+ * structure is fairly large, we store the notification-specific stuff in a
+ * separate structure.
+ *
+ * @request: A semaphore that users of this notification can wait on for
+ *           changes. Actual reads and writes are still controlled with
+ *           filter->notify_lock.
+ * @notify_lock: A lock for all notification-related accesses.
+ * @next_id: The id of the next request.
+ * @notifications: A list of struct seccomp_knotif elements.
+ * @wqh: A wait queue for poll.
+ */
+struct notification {
+	struct semaphore request;
+	u64 next_id;
+	struct list_head notifications;
+	wait_queue_head_t wqh;
+};
 
 /**
  * struct seccomp_filter - container for seccomp BPF programs
@@ -66,6 +132,8 @@ struct seccomp_filter {
 	bool log;
 	struct seccomp_filter *prev;
 	struct bpf_prog *prog;
+	struct notification *notif;
+	struct mutex notify_lock;
 };
 
 /* Limit any path through the tree to 256KB worth of instructions. */
@@ -392,6 +460,7 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
 	if (!sfilter)
 		return ERR_PTR(-ENOMEM);
 
+	mutex_init(&sfilter->notify_lock);
 	ret = bpf_prog_create_from_user(&sfilter->prog, fprog,
 					seccomp_check_filter, save_orig);
 	if (ret < 0) {
@@ -556,11 +625,13 @@ static void seccomp_send_sigsys(int syscall, int reason)
 #define SECCOMP_LOG_TRACE		(1 << 4)
 #define SECCOMP_LOG_LOG			(1 << 5)
 #define SECCOMP_LOG_ALLOW		(1 << 6)
+#define SECCOMP_LOG_USER_NOTIF		(1 << 7)
 
 static u32 seccomp_actions_logged = SECCOMP_LOG_KILL_PROCESS |
 				    SECCOMP_LOG_KILL_THREAD  |
 				    SECCOMP_LOG_TRAP  |
 				    SECCOMP_LOG_ERRNO |
+				    SECCOMP_LOG_USER_NOTIF |
 				    SECCOMP_LOG_TRACE |
 				    SECCOMP_LOG_LOG;
 
@@ -581,6 +652,9 @@ static inline void seccomp_log(unsigned long syscall, long signr, u32 action,
 	case SECCOMP_RET_TRACE:
 		log = requested && seccomp_actions_logged & SECCOMP_LOG_TRACE;
 		break;
+	case SECCOMP_RET_USER_NOTIF:
+		log = requested && seccomp_actions_logged & SECCOMP_LOG_USER_NOTIF;
+		break;
 	case SECCOMP_RET_LOG:
 		log = seccomp_actions_logged & SECCOMP_LOG_LOG;
 		break;
@@ -652,6 +726,73 @@ void secure_computing_strict(int this_syscall)
 #else
 
 #ifdef CONFIG_SECCOMP_FILTER
+static u64 seccomp_next_notify_id(struct seccomp_filter *filter)
+{
+	/* Note: overflow is ok here, the id just needs to be unique */
+	return filter->notif->next_id++;
+}
+
+static void seccomp_do_user_notification(int this_syscall,
+					 struct seccomp_filter *match,
+					 const struct seccomp_data *sd)
+{
+	int err;
+	long ret = 0;
+	struct seccomp_knotif n = {};
+
+	mutex_lock(&match->notify_lock);
+	err = -ENOSYS;
+	if (!match->notif)
+		goto out;
+
+	n.task = current;
+	n.state = SECCOMP_NOTIFY_INIT;
+	n.data = sd;
+	n.id = seccomp_next_notify_id(match);
+	init_completion(&n.ready);
+
+	list_add(&n.list, &match->notif->notifications);
+	wake_up_poll(&match->notif->wqh, EPOLLIN | EPOLLRDNORM);
+
+	mutex_unlock(&match->notify_lock);
+	up(&match->notif->request);
+
+	err = wait_for_completion_interruptible(&n.ready);
+	mutex_lock(&match->notify_lock);
+
+	/*
+	 * Here it's possible we got a signal and then had to wait on the mutex
+	 * while the reply was sent, so let's be sure there wasn't a response
+	 * in the meantime.
+	 */
+	if (err < 0 && n.state != SECCOMP_NOTIFY_REPLIED) {
+		/*
+		 * We got a signal. Let's tell userspace about it (potentially
+		 * again, if we had already notified them about the first one).
+		 */
+		n.signaled = true;
+		if (n.state == SECCOMP_NOTIFY_SENT) {
+			n.state = SECCOMP_NOTIFY_INIT;
+			up(&match->notif->request);
+		}
+		mutex_unlock(&match->notify_lock);
+		err = wait_for_completion_killable(&n.ready);
+		mutex_lock(&match->notify_lock);
+		if (err < 0)
+			goto remove_list;
+	}
+
+	ret = n.val;
+	err = n.error;
+
+remove_list:
+	list_del(&n.list);
+out:
+	mutex_unlock(&match->notify_lock);
+	syscall_set_return_value(current, task_pt_regs(current),
+				 err, ret);
+}
+
 static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
 			    const bool recheck_after_trace)
 {
@@ -728,6 +869,9 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
 
 		return 0;
 
+	case SECCOMP_RET_USER_NOTIF:
+		seccomp_do_user_notification(this_syscall, match, sd);
+		goto skip;
 	case SECCOMP_RET_LOG:
 		seccomp_log(this_syscall, 0, action, true);
 		return 0;
@@ -834,6 +978,9 @@ static long seccomp_set_mode_strict(void)
 }
 
 #ifdef CONFIG_SECCOMP_FILTER
+static struct file *init_listener(struct task_struct *,
+				  struct seccomp_filter *);
+
 /**
  * seccomp_set_mode_filter: internal function for setting seccomp filter
  * @flags:  flags to change filter behavior
@@ -853,6 +1000,8 @@ static long seccomp_set_mode_filter(unsigned int flags,
 	const unsigned long seccomp_mode = SECCOMP_MODE_FILTER;
 	struct seccomp_filter *prepared = NULL;
 	long ret = -EINVAL;
+	int listener = 0;
+	struct file *listener_f = NULL;
 
 	/* Validate flags. */
 	if (flags & ~SECCOMP_FILTER_FLAG_MASK)
@@ -863,13 +1012,28 @@ static long seccomp_set_mode_filter(unsigned int flags,
 	if (IS_ERR(prepared))
 		return PTR_ERR(prepared);
 
+	if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) {
+		listener = get_unused_fd_flags(0);
+		if (listener < 0) {
+			ret = listener;
+			goto out_free;
+		}
+
+		listener_f = init_listener(current, prepared);
+		if (IS_ERR(listener_f)) {
+			put_unused_fd(listener);
+			ret = PTR_ERR(listener_f);
+			goto out_free;
+		}
+	}
+
 	/*
 	 * Make sure we cannot change seccomp or nnp state via TSYNC
 	 * while another thread is in the middle of calling exec.
 	 */
 	if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
 	    mutex_lock_killable(&current->signal->cred_guard_mutex))
-		goto out_free;
+		goto out_put_fd;
 
 	spin_lock_irq(&current->sighand->siglock);
 
@@ -887,6 +1051,16 @@ static long seccomp_set_mode_filter(unsigned int flags,
 	spin_unlock_irq(&current->sighand->siglock);
 	if (flags & SECCOMP_FILTER_FLAG_TSYNC)
 		mutex_unlock(&current->signal->cred_guard_mutex);
+out_put_fd:
+	if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) {
+		if (ret < 0) {
+			fput(listener_f);
+			put_unused_fd(listener);
+		} else {
+			fd_install(listener, listener_f);
+			ret = listener;
+		}
+	}
 out_free:
 	seccomp_filter_free(prepared);
 	return ret;
@@ -911,6 +1085,7 @@ static long seccomp_get_action_avail(const char __user *uaction)
 	case SECCOMP_RET_KILL_THREAD:
 	case SECCOMP_RET_TRAP:
 	case SECCOMP_RET_ERRNO:
+	case SECCOMP_RET_USER_NOTIF:
 	case SECCOMP_RET_TRACE:
 	case SECCOMP_RET_LOG:
 	case SECCOMP_RET_ALLOW:
@@ -1111,6 +1286,7 @@ long seccomp_get_metadata(struct task_struct *task,
 #define SECCOMP_RET_KILL_THREAD_NAME	"kill_thread"
 #define SECCOMP_RET_TRAP_NAME		"trap"
 #define SECCOMP_RET_ERRNO_NAME		"errno"
+#define SECCOMP_RET_USER_NOTIF_NAME	"user_notif"
 #define SECCOMP_RET_TRACE_NAME		"trace"
 #define SECCOMP_RET_LOG_NAME		"log"
 #define SECCOMP_RET_ALLOW_NAME		"allow"
@@ -1120,6 +1296,7 @@ static const char seccomp_actions_avail[] =
 				SECCOMP_RET_KILL_THREAD_NAME	" "
 				SECCOMP_RET_TRAP_NAME		" "
 				SECCOMP_RET_ERRNO_NAME		" "
+				SECCOMP_RET_USER_NOTIF_NAME     " "
 				SECCOMP_RET_TRACE_NAME		" "
 				SECCOMP_RET_LOG_NAME		" "
 				SECCOMP_RET_ALLOW_NAME;
@@ -1134,6 +1311,7 @@ static const struct seccomp_log_name seccomp_log_names[] = {
 	{ SECCOMP_LOG_KILL_THREAD, SECCOMP_RET_KILL_THREAD_NAME },
 	{ SECCOMP_LOG_TRAP, SECCOMP_RET_TRAP_NAME },
 	{ SECCOMP_LOG_ERRNO, SECCOMP_RET_ERRNO_NAME },
+	{ SECCOMP_LOG_USER_NOTIF, SECCOMP_RET_USER_NOTIF_NAME },
 	{ SECCOMP_LOG_TRACE, SECCOMP_RET_TRACE_NAME },
 	{ SECCOMP_LOG_LOG, SECCOMP_RET_LOG_NAME },
 	{ SECCOMP_LOG_ALLOW, SECCOMP_RET_ALLOW_NAME },
@@ -1342,3 +1520,259 @@ static int __init seccomp_sysctl_init(void)
 device_initcall(seccomp_sysctl_init)
 
 #endif /* CONFIG_SYSCTL */
+
+#ifdef CONFIG_SECCOMP_FILTER
+static int seccomp_notify_release(struct inode *inode, struct file *file)
+{
+	struct seccomp_filter *filter = file->private_data;
+	struct seccomp_knotif *knotif;
+
+	mutex_lock(&filter->notify_lock);
+
+	/*
+	 * If this file is being closed because e.g. the task who owned it
+	 * died, let's wake everyone up who was waiting on us.
+	 */
+	list_for_each_entry(knotif, &filter->notif->notifications, list) {
+		if (knotif->state == SECCOMP_NOTIFY_REPLIED)
+			continue;
+
+		knotif->state = SECCOMP_NOTIFY_REPLIED;
+		knotif->error = -ENOSYS;
+		knotif->val = 0;
+
+		complete(&knotif->ready);
+	}
+
+	wake_up_all(&filter->notif->wqh);
+	kfree(filter->notif);
+	filter->notif = NULL;
+	mutex_unlock(&filter->notify_lock);
+	__put_seccomp_filter(filter);
+	return 0;
+}
+
+static long seccomp_notify_recv(struct seccomp_filter *filter,
+				unsigned long arg)
+{
+	struct seccomp_knotif *knotif = NULL, *cur;
+	struct seccomp_notif unotif = {};
+	ssize_t ret;
+	u16 size;
+	void __user *buf = (void __user *)arg;
+
+	if (copy_from_user(&size, buf, sizeof(size)))
+		return -EFAULT;
+
+	ret = down_interruptible(&filter->notif->request);
+	if (ret < 0)
+		return ret;
+
+	mutex_lock(&filter->notify_lock);
+	list_for_each_entry(cur, &filter->notif->notifications, list) {
+		if (cur->state == SECCOMP_NOTIFY_INIT) {
+			knotif = cur;
+			break;
+		}
+	}
+
+	/*
+	 * If we didn't find a notification, it could be that the task was
+	 * interrupted between the time we were woken and when we were able to
+	 * acquire the rw lock.
+	 */
+	if (!knotif) {
+		ret = -ENOENT;
+		goto out;
+	}
+
+	size = min_t(size_t, size, sizeof(unotif));
+
+	unotif.len = size;
+	unotif.id = knotif->id;
+	unotif.pid = task_pid_vnr(knotif->task);
+	unotif.signaled = knotif->signaled;
+	unotif.data = *(knotif->data);
+
+	if (copy_to_user(buf, &unotif, size)) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	ret = size;
+	knotif->state = SECCOMP_NOTIFY_SENT;
+	wake_up_poll(&filter->notif->wqh, EPOLLOUT | EPOLLWRNORM);
+
+
+out:
+	mutex_unlock(&filter->notify_lock);
+	return ret;
+}
+
+static long seccomp_notify_send(struct seccomp_filter *filter,
+				unsigned long arg)
+{
+	struct seccomp_notif_resp resp = {};
+	struct seccomp_knotif *knotif = NULL;
+	long ret;
+	u16 size;
+	void __user *buf = (void __user *)arg;
+
+	if (copy_from_user(&size, buf, sizeof(size)))
+		return -EFAULT;
+	size = min_t(size_t, size, sizeof(resp));
+	if (copy_from_user(&resp, buf, size))
+		return -EFAULT;
+
+	ret = mutex_lock_interruptible(&filter->notify_lock);
+	if (ret < 0)
+		return ret;
+
+	list_for_each_entry(knotif, &filter->notif->notifications, list) {
+		if (knotif->id == resp.id)
+			break;
+	}
+
+	if (!knotif || knotif->id != resp.id) {
+		ret = -ENOENT;
+		goto out;
+	}
+
+	/* Allow exactly one reply. */
+	if (knotif->state != SECCOMP_NOTIFY_SENT) {
+		ret = -EINPROGRESS;
+		goto out;
+	}
+
+	ret = size;
+	knotif->state = SECCOMP_NOTIFY_REPLIED;
+	knotif->error = resp.error;
+	knotif->val = resp.val;
+	complete(&knotif->ready);
+out:
+	mutex_unlock(&filter->notify_lock);
+	return ret;
+}
+
+static long seccomp_notify_id_valid(struct seccomp_filter *filter,
+				    unsigned long arg)
+{
+	struct seccomp_knotif *knotif = NULL;
+	void __user *buf = (void __user *)arg;
+	u64 id;
+	long ret;
+
+	if (copy_from_user(&id, buf, sizeof(id)))
+		return -EFAULT;
+
+	ret = mutex_lock_interruptible(&filter->notify_lock);
+	if (ret < 0)
+		return ret;
+
+	ret = -1;
+	list_for_each_entry(knotif, &filter->notif->notifications, list) {
+		if (knotif->id == id) {
+			ret = 0;
+			goto out;
+		}
+	}
+
+out:
+	mutex_unlock(&filter->notify_lock);
+	return ret;
+}
+
+static long seccomp_notify_ioctl(struct file *file, unsigned int cmd,
+				 unsigned long arg)
+{
+	struct seccomp_filter *filter = file->private_data;
+
+	switch (cmd) {
+	case SECCOMP_NOTIF_RECV:
+		return seccomp_notify_recv(filter, arg);
+	case SECCOMP_NOTIF_SEND:
+		return seccomp_notify_send(filter, arg);
+	case SECCOMP_NOTIF_ID_VALID:
+		return seccomp_notify_id_valid(filter, arg);
+	default:
+		return -EINVAL;
+	}
+}
+
+static __poll_t seccomp_notify_poll(struct file *file,
+				    struct poll_table_struct *poll_tab)
+{
+	struct seccomp_filter *filter = file->private_data;
+	__poll_t ret = 0;
+	struct seccomp_knotif *cur;
+
+	poll_wait(file, &filter->notif->wqh, poll_tab);
+
+	ret = mutex_lock_interruptible(&filter->notify_lock);
+	if (ret < 0)
+		return ret;
+
+	list_for_each_entry(cur, &filter->notif->notifications, list) {
+		if (cur->state == SECCOMP_NOTIFY_INIT)
+			ret |= EPOLLIN | EPOLLRDNORM;
+		if (cur->state == SECCOMP_NOTIFY_SENT)
+			ret |= EPOLLOUT | EPOLLWRNORM;
+		if (ret & EPOLLIN && ret & EPOLLOUT)
+			break;
+	}
+
+	mutex_unlock(&filter->notify_lock);
+
+	return ret;
+}
+
+static const struct file_operations seccomp_notify_ops = {
+	.poll = seccomp_notify_poll,
+	.release = seccomp_notify_release,
+	.unlocked_ioctl = seccomp_notify_ioctl,
+};
+
+static struct file *init_listener(struct task_struct *task,
+				  struct seccomp_filter *filter)
+{
+	struct file *ret = ERR_PTR(-EBUSY);
+	struct seccomp_filter *cur, *last_locked = NULL;
+	int filter_nesting = 0;
+
+	for (cur = task->seccomp.filter; cur; cur = cur->prev) {
+		mutex_lock_nested(&cur->notify_lock, filter_nesting);
+		filter_nesting++;
+		last_locked = cur;
+		if (cur->notif)
+			goto out;
+	}
+
+	ret = ERR_PTR(-ENOMEM);
+	filter->notif = kzalloc(sizeof(*(filter->notif)), GFP_KERNEL);
+	if (!filter->notif)
+		goto out;
+
+	sema_init(&filter->notif->request, 0);
+	INIT_LIST_HEAD(&filter->notif->notifications);
+	filter->notif->next_id = get_random_u64();
+	init_waitqueue_head(&filter->notif->wqh);
+
+	ret = anon_inode_getfile("seccomp notify", &seccomp_notify_ops,
+				 filter, O_RDWR);
+	if (IS_ERR(ret))
+		goto out;
+
+
+	/* The file has a reference to it now */
+	__get_seccomp_filter(filter);
+
+out:
+	for (cur = task->seccomp.filter; cur; cur = cur->prev) {
+		mutex_unlock(&cur->notify_lock);
+		if (cur == last_locked)
+			break;
+	}
+
+	return ret;
+}
+#endif
diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
index e1473234968d..5f4b836a6792 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -5,6 +5,7 @@
  * Test code for seccomp bpf.
  */
 
+#define _GNU_SOURCE
 #include <sys/types.h>
 
 /*
@@ -40,10 +41,12 @@
 #include <sys/fcntl.h>
 #include <sys/mman.h>
 #include <sys/times.h>
+#include <sys/socket.h>
+#include <sys/ioctl.h>
 
-#define _GNU_SOURCE
 #include <unistd.h>
 #include <sys/syscall.h>
+#include <poll.h>
 
 #include "../kselftest_harness.h"
 
@@ -154,6 +157,34 @@ struct seccomp_metadata {
 };
 #endif
 
+#ifndef SECCOMP_FILTER_FLAG_NEW_LISTENER
+#define SECCOMP_FILTER_FLAG_NEW_LISTENER (1UL << 3)
+
+#define SECCOMP_RET_USER_NOTIF 0x7fc00000U
+
+#define SECCOMP_IOC_MAGIC		0xF7
+#define SECCOMP_NOTIF_RECV	_IOWR(SECCOMP_IOC_MAGIC, 0,	\
+					struct seccomp_notif)
+#define SECCOMP_NOTIF_SEND	_IOWR(SECCOMP_IOC_MAGIC, 1,	\
+					struct seccomp_notif_resp)
+#define SECCOMP_NOTIF_ID_VALID	_IOR(SECCOMP_IOC_MAGIC, 2,	\
+					__u64)
+struct seccomp_notif {
+	__u16 len;
+	__u64 id;
+	__u32 pid;
+	__u8 signaled;
+	struct seccomp_data data;
+};
+
+struct seccomp_notif_resp {
+	__u16 len;
+	__u64 id;
+	__s32 error;
+	__s64 val;
+};
+#endif
+
 #ifndef seccomp
 int seccomp(unsigned int op, unsigned int flags, void *args)
 {
@@ -2077,7 +2108,8 @@ TEST(detect_seccomp_filter_flags)
 {
 	unsigned int flags[] = { SECCOMP_FILTER_FLAG_TSYNC,
 				 SECCOMP_FILTER_FLAG_LOG,
-				 SECCOMP_FILTER_FLAG_SPEC_ALLOW };
+				 SECCOMP_FILTER_FLAG_SPEC_ALLOW,
+				 SECCOMP_FILTER_FLAG_NEW_LISTENER };
 	unsigned int flag, all_flags;
 	int i;
 	long ret;
@@ -2933,6 +2965,383 @@ TEST(get_metadata)
 	ASSERT_EQ(0, kill(pid, SIGKILL));
 }
 
+static int user_trap_syscall(int nr, unsigned int flags)
+{
+	struct sock_filter filter[] = {
+		BPF_STMT(BPF_LD+BPF_W+BPF_ABS,
+			offsetof(struct seccomp_data, nr)),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, nr, 0, 1),
+		BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_USER_NOTIF),
+		BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
+	};
+
+	struct sock_fprog prog = {
+		.len = (unsigned short)ARRAY_SIZE(filter),
+		.filter = filter,
+	};
+
+	return seccomp(SECCOMP_SET_MODE_FILTER, flags, &prog);
+}
+
+static int read_notif(int listener, struct seccomp_notif *req)
+{
+	int ret;
+
+	do {
+		errno = 0;
+		req->len = sizeof(*req);
+		ret = ioctl(listener, SECCOMP_NOTIF_RECV, req);
+	} while (ret == -1 && errno == ENOENT);
+	return ret;
+}
+
+static void signal_handler(int signal)
+{
+}
+
+#define USER_NOTIF_MAGIC 116983961184613L
+TEST(get_user_notification_syscall)
+{
+	pid_t pid;
+	long ret;
+	int status, listener;
+	struct seccomp_notif req = {};
+	struct seccomp_notif_resp resp = {};
+	struct pollfd pollfd;
+
+	struct sock_filter filter[] = {
+		BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
+	};
+	struct sock_fprog prog = {
+		.len = (unsigned short)ARRAY_SIZE(filter),
+		.filter = filter,
+	};
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	/* Check that we get -ENOSYS with no listener attached */
+	if (pid == 0) {
+		if (user_trap_syscall(__NR_getpid, 0) < 0)
+			exit(1);
+		ret = syscall(__NR_getpid);
+		exit(ret >= 0 || errno != ENOSYS);
+	}
+
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+
+	/* Add some no-op filters so that we (don't) trigger lockdep. */
+	EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
+	EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
+	EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
+	EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
+
+	/* Check that the basic notification machinery works */
+	listener = user_trap_syscall(__NR_getpid,
+				     SECCOMP_FILTER_FLAG_NEW_LISTENER);
+	EXPECT_GE(listener, 0);
+
+	/* Installing a second listener in the chain should EBUSY */
+	EXPECT_EQ(user_trap_syscall(__NR_getpid,
+				    SECCOMP_FILTER_FLAG_NEW_LISTENER),
+		  -1);
+	EXPECT_EQ(errno, EBUSY);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		ret = syscall(__NR_getpid);
+		exit(ret != USER_NOTIF_MAGIC);
+	}
+
+	pollfd.fd = listener;
+	pollfd.events = POLLIN | POLLOUT;
+
+	EXPECT_GT(poll(&pollfd, 1, -1), 0);
+	EXPECT_EQ(pollfd.revents, POLLIN);
+
+	req.len = sizeof(req);
+	EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_RECV, &req), sizeof(req));
+
+	pollfd.fd = listener;
+	pollfd.events = POLLIN | POLLOUT;
+
+	EXPECT_GT(poll(&pollfd, 1, -1), 0);
+	EXPECT_EQ(pollfd.revents, POLLOUT);
+
+	EXPECT_EQ(req.data.nr,  __NR_getpid);
+
+	resp.len = sizeof(resp);
+	resp.id = req.id;
+	resp.error = 0;
+	resp.val = USER_NOTIF_MAGIC;
+
+	EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_SEND, &resp), sizeof(resp));
+
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+
+	/*
+	 * Check that nothing bad happens when we kill the task in the middle
+	 * of a syscall.
+	 */
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		ret = syscall(__NR_getpid);
+		exit(ret != USER_NOTIF_MAGIC);
+	}
+
+	EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_RECV, &req), sizeof(req));
+	EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_ID_VALID, &req.id), 0);
+
+	EXPECT_EQ(kill(pid, SIGKILL), 0);
+	EXPECT_EQ(waitpid(pid, NULL, 0), pid);
+
+	EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_ID_VALID, &req.id), -1);
+
+	resp.id = req.id;
+	ret = ioctl(listener, SECCOMP_NOTIF_SEND, &resp);
+	EXPECT_EQ(ret, -1);
+	EXPECT_EQ(errno, ENOENT);
+
+	/*
+	 * Check that we get another notification about a signal in the middle
+	 * of a syscall.
+	 */
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		if (signal(SIGUSR1, signal_handler) == SIG_ERR) {
+			perror("signal");
+			exit(1);
+		}
+		ret = syscall(__NR_getpid);
+		exit(ret != USER_NOTIF_MAGIC);
+	}
+
+	ret = read_notif(listener, &req);
+	EXPECT_EQ(ret, sizeof(req));
+	EXPECT_EQ(errno, 0);
+
+	EXPECT_EQ(kill(pid, SIGUSR1), 0);
+
+	ret = read_notif(listener, &req);
+	EXPECT_EQ(req.signaled, 1);
+	EXPECT_EQ(ret, sizeof(req));
+	EXPECT_EQ(errno, 0);
+
+	resp.len = sizeof(resp);
+	resp.id = req.id;
+	resp.error = -512; /* -ERESTARTSYS */
+	resp.val = 0;
+
+	EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_SEND, &resp), sizeof(resp));
+
+	ret = read_notif(listener, &req);
+	resp.len = sizeof(resp);
+	resp.id = req.id;
+	resp.error = 0;
+	resp.val = USER_NOTIF_MAGIC;
+	ret = ioctl(listener, SECCOMP_NOTIF_SEND, &resp);
+	EXPECT_EQ(ret, sizeof(resp));
+	EXPECT_EQ(errno, 0);
+
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+
+	/*
+	 * Check that we get an ENOSYS when the listener is closed.
+	 */
+	pid = fork();
+	ASSERT_GE(pid, 0);
+	if (pid == 0) {
+		close(listener);
+		ret = syscall(__NR_getpid);
+		exit(ret != -1 && errno != ENOSYS);
+	}
+
+	close(listener);
+
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+}
+
+/*
+ * Check that a pid in a child namespace still shows up as valid in ours.
+ */
+TEST(user_notification_child_pid_ns)
+{
+	pid_t pid;
+	int status, listener;
+	int sk_pair[2];
+	char c;
+	struct seccomp_notif req = {};
+	struct seccomp_notif_resp resp = {};
+
+	ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0);
+	ASSERT_EQ(unshare(CLONE_NEWPID), 0);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		EXPECT_EQ(user_trap_syscall(__NR_getpid, 0), 0);
+
+		/* Signal we're ready and have installed the filter. */
+		EXPECT_EQ(write(sk_pair[1], "J", 1), 1);
+
+		EXPECT_EQ(read(sk_pair[1], &c, 1), 1);
+		EXPECT_EQ(c, 'H');
+
+		exit(syscall(__NR_getpid) != USER_NOTIF_MAGIC);
+	}
+
+	EXPECT_EQ(read(sk_pair[0], &c, 1), 1);
+	EXPECT_EQ(c, 'J');
+
+	EXPECT_EQ(ptrace(PTRACE_ATTACH, pid), 0);
+	EXPECT_EQ(waitpid(pid, NULL, 0), pid);
+	listener = ptrace(PTRACE_SECCOMP_NEW_LISTENER, pid, 0);
+	EXPECT_GE(listener, 0);
+	EXPECT_EQ(ptrace(PTRACE_DETACH, pid, NULL, 0), 0);
+
+	/* Now signal we are done and respond with magic */
+	EXPECT_EQ(write(sk_pair[0], "H", 1), 1);
+
+	req.len = sizeof(req);
+	EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_RECV, &req), sizeof(req));
+	EXPECT_EQ(req.pid, pid);
+
+	resp.len = sizeof(resp);
+	resp.id = req.id;
+	resp.error = 0;
+	resp.val = USER_NOTIF_MAGIC;
+
+	EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_SEND, &resp), sizeof(resp));
+
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+	close(listener);
+}
+
+/*
+ * Check that a pid in a sibling (i.e. unrelated) namespace shows up as 0, i.e.
+ * invalid.
+ */
+TEST(user_notification_sibling_pid_ns)
+{
+	pid_t pid, pid2;
+	int status, listener;
+	int sk_pair[2];
+	char c;
+	struct seccomp_notif req = {};
+	struct seccomp_notif_resp resp = {};
+
+	ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		int child_pair[2];
+
+		ASSERT_EQ(unshare(CLONE_NEWPID), 0);
+
+		ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, child_pair), 0);
+
+		pid2 = fork();
+		ASSERT_GE(pid2, 0);
+
+		if (pid2 == 0) {
+			close(child_pair[0]);
+			EXPECT_EQ(user_trap_syscall(__NR_getpid, 0), 0);
+
+			/* Signal we're ready and have installed the filter. */
+			EXPECT_EQ(write(child_pair[1], "J", 1), 1);
+
+			EXPECT_EQ(read(child_pair[1], &c, 1), 1);
+			EXPECT_EQ(c, 'H');
+
+			exit(syscall(__NR_getpid) != USER_NOTIF_MAGIC);
+		}
+
+		/* check that child has installed the filter */
+		EXPECT_EQ(read(child_pair[0], &c, 1), 1);
+		EXPECT_EQ(c, 'J');
+
+		/* tell parent who child is */
+		EXPECT_EQ(write(sk_pair[1], &pid2, sizeof(pid2)), sizeof(pid2));
+
+		/* parent has installed listener, tell child to call syscall */
+		EXPECT_EQ(read(sk_pair[1], &c, 1), 1);
+		EXPECT_EQ(c, 'H');
+		EXPECT_EQ(write(child_pair[0], "H", 1), 1);
+
+		EXPECT_EQ(waitpid(pid2, &status, 0), pid2);
+		EXPECT_EQ(true, WIFEXITED(status));
+		EXPECT_EQ(0, WEXITSTATUS(status));
+		exit(WEXITSTATUS(status));
+	}
+
+	EXPECT_EQ(read(sk_pair[0], &pid2, sizeof(pid2)), sizeof(pid2));
+
+	EXPECT_EQ(ptrace(PTRACE_ATTACH, pid2), 0);
+	EXPECT_EQ(waitpid(pid2, NULL, 0), pid2);
+	listener = ptrace(PTRACE_SECCOMP_NEW_LISTENER, pid2, 0);
+	EXPECT_GE(listener, 0);
+	EXPECT_EQ(errno, 0);
+	EXPECT_EQ(ptrace(PTRACE_DETACH, pid2, NULL, 0), 0);
+
+	/* Create the sibling ns, and sibling in it. */
+	EXPECT_EQ(unshare(CLONE_NEWPID), 0);
+	EXPECT_EQ(errno, 0);
+
+	pid2 = fork();
+	EXPECT_GE(pid2, 0);
+
+	if (pid2 == 0) {
+		req.len = sizeof(req);
+		ASSERT_EQ(ioctl(listener, SECCOMP_NOTIF_RECV, &req), sizeof(req));
+		/*
+		 * The pid should be 0, i.e. the task is in some namespace that
+		 * we can't "see".
+		 */
+		ASSERT_EQ(req.pid, 0);
+
+		resp.len = sizeof(resp);
+		resp.id = req.id;
+		resp.error = 0;
+		resp.val = USER_NOTIF_MAGIC;
+
+		ASSERT_EQ(ioctl(listener, SECCOMP_NOTIF_SEND, &resp), sizeof(resp));
+		exit(0);
+	}
+
+	close(listener);
+
+	/* Now signal we are done setting up sibling listener. */
+	EXPECT_EQ(write(sk_pair[0], "H", 1), 1);
+
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+
+	EXPECT_EQ(waitpid(pid2, &status, 0), pid2);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+}
+
+
 /*
  * TODO:
  * - add microbenchmarks
-- 
2.17.1

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH v7 2/6] seccomp: make get_nth_filter available outside of CHECKPOINT_RESTORE
  2018-09-27 15:11 [PATCH v7 0/6] seccomp trap to userspace Tycho Andersen
  2018-09-27 15:11 ` [PATCH v7 1/6] seccomp: add a return code to " Tycho Andersen
@ 2018-09-27 15:11 ` Tycho Andersen
  2018-09-27 16:51   ` Jann Horn
                     ` (2 more replies)
  2018-09-27 15:11 ` [PATCH v7 3/6] seccomp: add a way to get a listener fd from ptrace Tycho Andersen
                   ` (4 subsequent siblings)
  6 siblings, 3 replies; 91+ messages in thread
From: Tycho Andersen @ 2018-09-27 15:11 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel, containers, linux-api, Andy Lutomirski,
	Oleg Nesterov, Eric W . Biederman, Serge E . Hallyn,
	Christian Brauner, Tyler Hicks, Akihiro Suda, Jann Horn,
	linux-fsdevel, Tycho Andersen

In the next commit we'll use this same mnemonic to get a listener for the
nth filter, so we need it available outside of CHECKPOINT_RESTORE in the
USER_NOTIFICATION case as well.

v2: new in v2
v3: no changes
v4: no changes
v5: switch to CHECKPOINT_RESTORE || USER_NOTIFICATION to avoid warning when
    only CONFIG_SECCOMP_FILTER is enabled.
v7: drop USER_NOTIFICATION bits

Signed-off-by: Tycho Andersen <tycho@tycho.ws>
CC: Kees Cook <keescook@chromium.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Oleg Nesterov <oleg@redhat.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
CC: "Serge E. Hallyn" <serge@hallyn.com>
CC: Christian Brauner <christian.brauner@ubuntu.com>
CC: Tyler Hicks <tyhicks@canonical.com>
CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
---
 kernel/seccomp.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index fa6fe9756c80..44a31ac8373a 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -1158,7 +1158,7 @@ long prctl_set_seccomp(unsigned long seccomp_mode, char __user *filter)
 	return do_seccomp(op, 0, uargs);
 }
 
-#if defined(CONFIG_SECCOMP_FILTER) && defined(CONFIG_CHECKPOINT_RESTORE)
+#if defined(CONFIG_SECCOMP_FILTER)
 static struct seccomp_filter *get_nth_filter(struct task_struct *task,
 					     unsigned long filter_off)
 {
@@ -1205,6 +1205,7 @@ static struct seccomp_filter *get_nth_filter(struct task_struct *task,
 	return filter;
 }
 
+#if defined(CONFIG_CHECKPOINT_RESTORE)
 long seccomp_get_filter(struct task_struct *task, unsigned long filter_off,
 			void __user *data)
 {
@@ -1277,7 +1278,8 @@ long seccomp_get_metadata(struct task_struct *task,
 	__put_seccomp_filter(filter);
 	return ret;
 }
-#endif
+#endif /* CONFIG_CHECKPOINT_RESTORE */
+#endif /* CONFIG_SECCOMP_FILTER */
 
 #ifdef CONFIG_SYSCTL
 
-- 
2.17.1

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH v7 3/6] seccomp: add a way to get a listener fd from ptrace
  2018-09-27 15:11 [PATCH v7 0/6] seccomp trap to userspace Tycho Andersen
  2018-09-27 15:11 ` [PATCH v7 1/6] seccomp: add a return code to " Tycho Andersen
  2018-09-27 15:11 ` [PATCH v7 2/6] seccomp: make get_nth_filter available outside of CHECKPOINT_RESTORE Tycho Andersen
@ 2018-09-27 15:11 ` Tycho Andersen
  2018-09-27 16:20   ` Jann Horn
                     ` (3 more replies)
  2018-09-27 15:11 ` [PATCH v7 4/6] files: add a replace_fd_files() function Tycho Andersen
                   ` (3 subsequent siblings)
  6 siblings, 4 replies; 91+ messages in thread
From: Tycho Andersen @ 2018-09-27 15:11 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel, containers, linux-api, Andy Lutomirski,
	Oleg Nesterov, Eric W . Biederman, Serge E . Hallyn,
	Christian Brauner, Tyler Hicks, Akihiro Suda, Jann Horn,
	linux-fsdevel, Tycho Andersen

As an alternative to SECCOMP_FILTER_FLAG_GET_LISTENER, perhaps a ptrace()
version which can acquire filters is useful. There are at least two reasons
this is preferable, even though it uses ptrace:

1. You can control tasks that aren't cooperating with you
2. You can control tasks whose filters block sendmsg() and socket(); if the
   task installs a filter which blocks these calls, there's no way with
   SECCOMP_FILTER_FLAG_GET_LISTENER to get the fd out to the privileged task.

v2: fix a bug where listener mode was not unset when an unused fd was not
    available
v3: fix refcounting bug (Oleg)
v4: * change the listener's fd flags to be 0
    * rename GET_LISTENER to NEW_LISTENER (Matthew)
v5: * add capable(CAP_SYS_ADMIN) requirement
v7: * point the new listener at the right filter (Jann)

Signed-off-by: Tycho Andersen <tycho@tycho.ws>
CC: Kees Cook <keescook@chromium.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Oleg Nesterov <oleg@redhat.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
CC: "Serge E. Hallyn" <serge@hallyn.com>
CC: Christian Brauner <christian.brauner@ubuntu.com>
CC: Tyler Hicks <tyhicks@canonical.com>
CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
---
 include/linux/seccomp.h                       |  7 ++
 include/uapi/linux/ptrace.h                   |  2 +
 kernel/ptrace.c                               |  4 ++
 kernel/seccomp.c                              | 31 +++++++++
 tools/testing/selftests/seccomp/seccomp_bpf.c | 68 +++++++++++++++++++
 5 files changed, 112 insertions(+)

diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 017444b5efed..234c61b37405 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -83,6 +83,8 @@ static inline int seccomp_mode(struct seccomp *s)
 #ifdef CONFIG_SECCOMP_FILTER
 extern void put_seccomp_filter(struct task_struct *tsk);
 extern void get_seccomp_filter(struct task_struct *tsk);
+extern long seccomp_new_listener(struct task_struct *task,
+				 unsigned long filter_off);
 #else  /* CONFIG_SECCOMP_FILTER */
 static inline void put_seccomp_filter(struct task_struct *tsk)
 {
@@ -92,6 +94,11 @@ static inline void get_seccomp_filter(struct task_struct *tsk)
 {
 	return;
 }
+static inline long seccomp_new_listener(struct task_struct *task,
+					unsigned long filter_off)
+{
+	return -EINVAL;
+}
 #endif /* CONFIG_SECCOMP_FILTER */
 
 #if defined(CONFIG_SECCOMP_FILTER) && defined(CONFIG_CHECKPOINT_RESTORE)
diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h
index d5a1b8a492b9..e80ecb1bd427 100644
--- a/include/uapi/linux/ptrace.h
+++ b/include/uapi/linux/ptrace.h
@@ -73,6 +73,8 @@ struct seccomp_metadata {
 	__u64 flags;		/* Output: filter's flags */
 };
 
+#define PTRACE_SECCOMP_NEW_LISTENER	0x420e
+
 /* Read signals from a shared (process wide) queue */
 #define PTRACE_PEEKSIGINFO_SHARED	(1 << 0)
 
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 21fec73d45d4..289960ac181b 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -1096,6 +1096,10 @@ int ptrace_request(struct task_struct *child, long request,
 		ret = seccomp_get_metadata(child, addr, datavp);
 		break;
 
+	case PTRACE_SECCOMP_NEW_LISTENER:
+		ret = seccomp_new_listener(child, addr);
+		break;
+
 	default:
 		break;
 	}
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 44a31ac8373a..17685803a2af 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -1777,4 +1777,35 @@ static struct file *init_listener(struct task_struct *task,
 
 	return ret;
 }
+
+long seccomp_new_listener(struct task_struct *task,
+			  unsigned long filter_off)
+{
+	struct seccomp_filter *filter;
+	struct file *listener;
+	int fd;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EACCES;
+
+	filter = get_nth_filter(task, filter_off);
+	if (IS_ERR(filter))
+		return PTR_ERR(filter);
+
+	fd = get_unused_fd_flags(0);
+	if (fd < 0) {
+		__put_seccomp_filter(filter);
+		return fd;
+	}
+
+	listener = init_listener(task, filter);
+	__put_seccomp_filter(filter);
+	if (IS_ERR(listener)) {
+		put_unused_fd(fd);
+		return PTR_ERR(listener);
+	}
+
+	fd_install(fd, listener);
+	return fd;
+}
 #endif
diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
index 5f4b836a6792..c6ba3ed5392e 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -193,6 +193,10 @@ int seccomp(unsigned int op, unsigned int flags, void *args)
 }
 #endif
 
+#ifndef PTRACE_SECCOMP_NEW_LISTENER
+#define PTRACE_SECCOMP_NEW_LISTENER 0x420e
+#endif
+
 #if __BYTE_ORDER == __LITTLE_ENDIAN
 #define syscall_arg(_n) (offsetof(struct seccomp_data, args[_n]))
 #elif __BYTE_ORDER == __BIG_ENDIAN
@@ -3175,6 +3179,70 @@ TEST(get_user_notification_syscall)
 	EXPECT_EQ(0, WEXITSTATUS(status));
 }
 
+TEST(get_user_notification_ptrace)
+{
+	pid_t pid;
+	int status, listener;
+	int sk_pair[2];
+	char c;
+	struct seccomp_notif req = {};
+	struct seccomp_notif_resp resp = {};
+
+	ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		EXPECT_EQ(user_trap_syscall(__NR_getpid, 0), 0);
+
+		/* Test that we get ENOSYS while not attached */
+		EXPECT_EQ(syscall(__NR_getpid), -1);
+		EXPECT_EQ(errno, ENOSYS);
+
+		/* Signal we're ready and have installed the filter. */
+		EXPECT_EQ(write(sk_pair[1], "J", 1), 1);
+
+		EXPECT_EQ(read(sk_pair[1], &c, 1), 1);
+		EXPECT_EQ(c, 'H');
+
+		exit(syscall(__NR_getpid) != USER_NOTIF_MAGIC);
+	}
+
+	EXPECT_EQ(read(sk_pair[0], &c, 1), 1);
+	EXPECT_EQ(c, 'J');
+
+	EXPECT_EQ(ptrace(PTRACE_ATTACH, pid), 0);
+	EXPECT_EQ(waitpid(pid, NULL, 0), pid);
+	listener = ptrace(PTRACE_SECCOMP_NEW_LISTENER, pid, 0);
+	EXPECT_GE(listener, 0);
+
+	/* EBUSY for second listener */
+	EXPECT_EQ(ptrace(PTRACE_SECCOMP_NEW_LISTENER, pid, 0), -1);
+	EXPECT_EQ(errno, EBUSY);
+
+	EXPECT_EQ(ptrace(PTRACE_DETACH, pid, NULL, 0), 0);
+
+	/* Now signal we are done and respond with magic */
+	EXPECT_EQ(write(sk_pair[0], "H", 1), 1);
+
+	req.len = sizeof(req);
+	EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_RECV, &req), sizeof(req));
+
+	resp.len = sizeof(resp);
+	resp.id = req.id;
+	resp.error = 0;
+	resp.val = USER_NOTIF_MAGIC;
+
+	EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_SEND, &resp), sizeof(resp));
+
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+
+	close(listener);
+}
+
 /*
  * Check that a pid in a child namespace still shows up as valid in ours.
  */
-- 
2.17.1

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH v7 4/6] files: add a replace_fd_files() function
  2018-09-27 15:11 [PATCH v7 0/6] seccomp trap to userspace Tycho Andersen
                   ` (2 preceding siblings ...)
  2018-09-27 15:11 ` [PATCH v7 3/6] seccomp: add a way to get a listener fd from ptrace Tycho Andersen
@ 2018-09-27 15:11 ` Tycho Andersen
  2018-09-27 16:49   ` Jann Horn
  2018-09-27 21:59   ` Kees Cook
  2018-09-27 15:11 ` [PATCH v7 5/6] seccomp: add a way to pass FDs via a notification fd Tycho Andersen
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 91+ messages in thread
From: Tycho Andersen @ 2018-09-27 15:11 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel, containers, linux-api, Andy Lutomirski,
	Oleg Nesterov, Eric W . Biederman, Serge E . Hallyn,
	Christian Brauner, Tyler Hicks, Akihiro Suda, Jann Horn,
	linux-fsdevel, Tycho Andersen, Alexander Viro

Similar to fd_install/__fd_install, we want to be able to replace an fd of
an arbitrary struct files_struct, not just current's. We'll use this in the
next patch to implement the seccomp ioctl that allows inserting fds into a
stopped process' context.

v7: new in v7

Signed-off-by: Tycho Andersen <tycho@tycho.ws>
CC: Alexander Viro <viro@zeniv.linux.org.uk>
CC: Kees Cook <keescook@chromium.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Oleg Nesterov <oleg@redhat.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
CC: "Serge E. Hallyn" <serge@hallyn.com>
CC: Christian Brauner <christian.brauner@ubuntu.com>
CC: Tyler Hicks <tyhicks@canonical.com>
CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
---
 fs/file.c            | 22 +++++++++++++++-------
 include/linux/file.h |  8 ++++++++
 2 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 7ffd6e9d103d..3b3c5aadaadb 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -850,24 +850,32 @@ __releases(&files->file_lock)
 }
 
 int replace_fd(unsigned fd, struct file *file, unsigned flags)
+{
+	return replace_fd_task(current, fd, file, flags);
+}
+
+/*
+ * Same warning as __alloc_fd()/__fd_install() here.
+ */
+int replace_fd_task(struct task_struct *task, unsigned fd,
+		    struct file *file, unsigned flags)
 {
 	int err;
-	struct files_struct *files = current->files;
 
 	if (!file)
-		return __close_fd(files, fd);
+		return __close_fd(task->files, fd);
 
-	if (fd >= rlimit(RLIMIT_NOFILE))
+	if (fd >= task_rlimit(task, RLIMIT_NOFILE))
 		return -EBADF;
 
-	spin_lock(&files->file_lock);
-	err = expand_files(files, fd);
+	spin_lock(&task->files->file_lock);
+	err = expand_files(task->files, fd);
 	if (unlikely(err < 0))
 		goto out_unlock;
-	return do_dup2(files, file, fd, flags);
+	return do_dup2(task->files, file, fd, flags);
 
 out_unlock:
-	spin_unlock(&files->file_lock);
+	spin_unlock(&task->files->file_lock);
 	return err;
 }
 
diff --git a/include/linux/file.h b/include/linux/file.h
index 6b2fb032416c..f94277fee038 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -11,6 +11,7 @@
 #include <linux/posix_types.h>
 
 struct file;
+struct task_struct;
 
 extern void fput(struct file *);
 
@@ -79,6 +80,13 @@ static inline void fdput_pos(struct fd f)
 
 extern int f_dupfd(unsigned int from, struct file *file, unsigned flags);
 extern int replace_fd(unsigned fd, struct file *file, unsigned flags);
+/*
+ * Warning! This is only safe if you know the owner of the files_struct is
+ * stopped outside syscall context. It's a very bad idea to use this unless you
+ * have similar guarantees in your code.
+ */
+extern int replace_fd_task(struct task_struct *task, unsigned fd,
+			   struct file *file, unsigned flags);
 extern void set_close_on_exec(unsigned int fd, int flag);
 extern bool get_close_on_exec(unsigned int fd);
 extern int get_unused_fd_flags(unsigned flags);
-- 
2.17.1

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH v7 5/6] seccomp: add a way to pass FDs via a notification fd
  2018-09-27 15:11 [PATCH v7 0/6] seccomp trap to userspace Tycho Andersen
                   ` (3 preceding siblings ...)
  2018-09-27 15:11 ` [PATCH v7 4/6] files: add a replace_fd_files() function Tycho Andersen
@ 2018-09-27 15:11 ` Tycho Andersen
  2018-09-27 16:39   ` Jann Horn
                     ` (2 more replies)
  2018-09-27 15:11 ` [PATCH v7 6/6] samples: add an example of seccomp user trap Tycho Andersen
  2018-09-28 21:57 ` [PATCH v7 0/6] seccomp trap to userspace Michael Kerrisk (man-opages)
  6 siblings, 3 replies; 91+ messages in thread
From: Tycho Andersen @ 2018-09-27 15:11 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel, containers, linux-api, Andy Lutomirski,
	Oleg Nesterov, Eric W . Biederman, Serge E . Hallyn,
	Christian Brauner, Tyler Hicks, Akihiro Suda, Jann Horn,
	linux-fsdevel, Tycho Andersen

This patch adds a way to insert FDs into the tracee's process (also
close/overwrite fds for the tracee). This functionality is necessary to
mock things like socketpair() or dup2() or similar, but since it depends on
external (vfs) patches, I've left it as a separate patch as before so the
core functionality can still be merged while we argue about this. Except
this time it doesn't add any ugliness to the API :)

v7: new in v7

Signed-off-by: Tycho Andersen <tycho@tycho.ws>
CC: Kees Cook <keescook@chromium.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Oleg Nesterov <oleg@redhat.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
CC: "Serge E. Hallyn" <serge@hallyn.com>
CC: Christian Brauner <christian.brauner@ubuntu.com>
CC: Tyler Hicks <tyhicks@canonical.com>
CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
---
 .../userspace-api/seccomp_filter.rst          |  16 +++
 include/uapi/linux/seccomp.h                  |   9 ++
 kernel/seccomp.c                              |  54 ++++++++
 tools/testing/selftests/seccomp/seccomp_bpf.c | 126 ++++++++++++++++++
 4 files changed, 205 insertions(+)

diff --git a/Documentation/userspace-api/seccomp_filter.rst b/Documentation/userspace-api/seccomp_filter.rst
index d2e61f1c0a0b..383a8dbae304 100644
--- a/Documentation/userspace-api/seccomp_filter.rst
+++ b/Documentation/userspace-api/seccomp_filter.rst
@@ -237,6 +237,13 @@ The interface for a seccomp notification fd consists of two structures:
         __s64 val;
     };
 
+    struct seccomp_notif_put_fd {
+        __u64 id;
+        __s32 fd;
+        __u32 fd_flags;
+        __s32 to_replace;
+    };
+
 Users can read via ``ioctl(SECCOMP_NOTIF_RECV)``  (or ``poll()``) on a seccomp
 notification fd to receive a ``struct seccomp_notif``, which contains five
 members: the input length of the structure, a unique-per-filter ``id``, the
@@ -256,6 +263,15 @@ mentioned above in this document: all arguments being read from the tracee's
 memory should be read into the tracer's memory before any policy decisions are
 made. This allows for an atomic decision on syscall arguments.
 
+Userspace can also insert (or overwrite) file descriptors of the tracee using
+``ioctl(SECCOMP_NOTIF_PUT_FD)``. The ``id`` member is the request/pid to insert
+the fd into. The ``fd`` is the fd in the listener's table to send or ``-1`` if
+an fd should be closed instead. The ``to_replace`` fd is the fd in the tracee's
+table that should be overwritten, or -1 if a new fd is installed. ``fd_flags``
+should be the flags that the fd in the tracee's table is opened with (e.g.
+``O_CLOEXEC`` or similar). The return value from this ioctl is the fd number
+that was installed.
+
 Sysctls
 =======
 
diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index d4ccb32fe089..91d77f041fbb 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -77,6 +77,13 @@ struct seccomp_notif_resp {
 	__s64 val;
 };
 
+struct seccomp_notif_put_fd {
+	__u64 id;
+	__s32 fd;
+	__u32 fd_flags;
+	__s32 to_replace;
+};
+
 #define SECCOMP_IOC_MAGIC		0xF7
 
 /* Flags for seccomp notification fd ioctl. */
@@ -86,5 +93,7 @@ struct seccomp_notif_resp {
 					struct seccomp_notif_resp)
 #define SECCOMP_NOTIF_ID_VALID	_IOR(SECCOMP_IOC_MAGIC, 2,	\
 					__u64)
+#define SECCOMP_NOTIF_PUT_FD	_IOR(SECCOMP_IOC_MAGIC, 3,	\
+					struct seccomp_notif_put_fd)
 
 #endif /* _UAPI_LINUX_SECCOMP_H */
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 17685803a2af..07a05ad59731 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -41,6 +41,8 @@
 #include <linux/tracehook.h>
 #include <linux/uaccess.h>
 #include <linux/anon_inodes.h>
+#include <linux/fdtable.h>
+#include <net/cls_cgroup.h>
 
 enum notify_state {
 	SECCOMP_NOTIFY_INIT,
@@ -1684,6 +1686,56 @@ static long seccomp_notify_id_valid(struct seccomp_filter *filter,
 	return ret;
 }
 
+static long seccomp_notify_put_fd(struct seccomp_filter *filter,
+				  unsigned long arg)
+{
+	struct seccomp_notif_put_fd req;
+	void __user *buf = (void __user *)arg;
+	struct seccomp_knotif *knotif = NULL;
+	long ret;
+
+	if (copy_from_user(&req, buf, sizeof(req)))
+		return -EFAULT;
+
+	if (req.fd < 0 && req.to_replace < 0)
+		return -EINVAL;
+
+	ret = mutex_lock_interruptible(&filter->notify_lock);
+	if (ret < 0)
+		return ret;
+
+	ret = -ENOENT;
+	list_for_each_entry(knotif, &filter->notif->notifications, list) {
+		struct file *file = NULL;
+
+		if (knotif->id != req.id)
+			continue;
+
+		if (req.fd >= 0)
+			file = fget(req.fd);
+
+		if (req.to_replace >= 0) {
+			ret = replace_fd_task(knotif->task, req.to_replace,
+					      file, req.fd_flags);
+		} else {
+			unsigned long max_files;
+
+			max_files = task_rlimit(knotif->task, RLIMIT_NOFILE);
+			ret = __alloc_fd(knotif->task->files, 0, max_files,
+					 req.fd_flags);
+			if (ret < 0)
+				break;
+
+			__fd_install(knotif->task->files, ret, file);
+		}
+
+		break;
+	}
+
+	mutex_unlock(&filter->notify_lock);
+	return ret;
+}
+
 static long seccomp_notify_ioctl(struct file *file, unsigned int cmd,
 				 unsigned long arg)
 {
@@ -1696,6 +1748,8 @@ static long seccomp_notify_ioctl(struct file *file, unsigned int cmd,
 		return seccomp_notify_send(filter, arg);
 	case SECCOMP_NOTIF_ID_VALID:
 		return seccomp_notify_id_valid(filter, arg);
+	case SECCOMP_NOTIF_PUT_FD:
+		return seccomp_notify_put_fd(filter, arg);
 	default:
 		return -EINVAL;
 	}
diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
index c6ba3ed5392e..cd1322c02b92 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -43,6 +43,7 @@
 #include <sys/times.h>
 #include <sys/socket.h>
 #include <sys/ioctl.h>
+#include <linux/kcmp.h>
 
 #include <unistd.h>
 #include <sys/syscall.h>
@@ -169,6 +170,9 @@ struct seccomp_metadata {
 					struct seccomp_notif_resp)
 #define SECCOMP_NOTIF_ID_VALID	_IOR(SECCOMP_IOC_MAGIC, 2,	\
 					__u64)
+#define SECCOMP_NOTIF_PUT_FD	_IOR(SECCOMP_IOC_MAGIC, 3,	\
+					struct seccomp_notif_put_fd)
+
 struct seccomp_notif {
 	__u16 len;
 	__u64 id;
@@ -183,6 +187,13 @@ struct seccomp_notif_resp {
 	__s32 error;
 	__s64 val;
 };
+
+struct seccomp_notif_put_fd {
+	__u64 id;
+	__s32 fd;
+	__u32 fd_flags;
+	__s32 to_replace;
+};
 #endif
 
 #ifndef seccomp
@@ -193,6 +204,14 @@ int seccomp(unsigned int op, unsigned int flags, void *args)
 }
 #endif
 
+#ifndef kcmp
+int kcmp(pid_t pid1, pid_t pid2, int type, unsigned long idx1,
+	 unsigned long idx2)
+{
+	return syscall(__NR_kcmp, pid1, pid2, type, idx1, idx2);
+}
+#endif
+
 #ifndef PTRACE_SECCOMP_NEW_LISTENER
 #define PTRACE_SECCOMP_NEW_LISTENER 0x420e
 #endif
@@ -3243,6 +3262,113 @@ TEST(get_user_notification_ptrace)
 	close(listener);
 }
 
+TEST(user_notification_pass_fd)
+{
+	pid_t pid;
+	int status, listener, fd;
+	int sk_pair[2];
+	char c;
+	struct seccomp_notif req = {};
+	struct seccomp_notif_resp resp = {};
+	struct seccomp_notif_put_fd putfd = {};
+	long ret;
+
+	ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		int fd;
+		char buf[16];
+
+		EXPECT_EQ(user_trap_syscall(__NR_getpid, 0), 0);
+
+		/* Signal we're ready and have installed the filter. */
+		EXPECT_EQ(write(sk_pair[1], "J", 1), 1);
+
+		EXPECT_EQ(read(sk_pair[1], &c, 1), 1);
+		EXPECT_EQ(c, 'H');
+		close(sk_pair[1]);
+
+		/* An fd from getpid(). Let the games begin. */
+		fd = syscall(__NR_getpid);
+		EXPECT_GT(fd, 0);
+		EXPECT_EQ(read(fd, buf, sizeof(buf)), 12);
+		close(fd);
+
+		exit(strcmp("hello world", buf));
+	}
+
+	EXPECT_EQ(read(sk_pair[0], &c, 1), 1);
+	EXPECT_EQ(c, 'J');
+
+	EXPECT_EQ(ptrace(PTRACE_ATTACH, pid), 0);
+	EXPECT_EQ(waitpid(pid, NULL, 0), pid);
+	listener = ptrace(PTRACE_SECCOMP_NEW_LISTENER, pid, 0);
+	EXPECT_GE(listener, 0);
+	EXPECT_EQ(ptrace(PTRACE_DETACH, pid, NULL, 0), 0);
+
+	/* Now signal we are done installing so it can do a getpid */
+	EXPECT_EQ(write(sk_pair[0], "H", 1), 1);
+	close(sk_pair[0]);
+
+	/* Make a new socket pair so we can send half across */
+	EXPECT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0);
+
+	ret = read_notif(listener, &req);
+	EXPECT_EQ(ret, sizeof(req));
+	EXPECT_EQ(errno, 0);
+
+	resp.len = sizeof(resp);
+	resp.id = req.id;
+
+	putfd.id = req.id;
+	putfd.fd_flags = 0;
+
+	/* First, let's just create a new fd with our stdout. */
+	putfd.fd = 0;
+	putfd.to_replace = -1;
+	fd = ioctl(listener, SECCOMP_NOTIF_PUT_FD, &putfd);
+	EXPECT_GE(fd, 0);
+	EXPECT_EQ(kcmp(req.pid, getpid(), KCMP_FILE, fd, 0), 0);
+
+	/* Dup something else over the top of it. */
+	putfd.fd = sk_pair[1];
+	putfd.to_replace = fd;
+	fd = ioctl(listener, SECCOMP_NOTIF_PUT_FD, &putfd);
+	EXPECT_GE(fd, 0);
+	EXPECT_EQ(kcmp(req.pid, getpid(), KCMP_FILE, fd, sk_pair[1]), 0);
+
+	/* Now, try to close it. */
+	putfd.fd = -1;
+	putfd.to_replace = fd;
+	fd = ioctl(listener, SECCOMP_NOTIF_PUT_FD, &putfd);
+	EXPECT_GE(fd, 0);
+	EXPECT_EQ(kcmp(req.pid, getpid(), KCMP_FILE, fd, sk_pair[1]), 1);
+
+	/* Ok, we tried the three cases, now let's do what we really want. */
+	putfd.fd = sk_pair[1];
+	putfd.to_replace = -1;
+	fd = ioctl(listener, SECCOMP_NOTIF_PUT_FD, &putfd);
+	EXPECT_GE(fd, 0);
+	EXPECT_EQ(kcmp(req.pid, getpid(), KCMP_FILE, fd, sk_pair[1]), 0);
+
+	resp.val = fd;
+	resp.error = 0;
+
+	EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_SEND, &resp), sizeof(resp));
+	close(sk_pair[1]);
+
+	EXPECT_EQ(write(sk_pair[0], "hello world\0", 12), 12);
+	close(sk_pair[0]);
+
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+	close(listener);
+}
+
 /*
  * Check that a pid in a child namespace still shows up as valid in ours.
  */
-- 
2.17.1

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH v7 6/6] samples: add an example of seccomp user trap
  2018-09-27 15:11 [PATCH v7 0/6] seccomp trap to userspace Tycho Andersen
                   ` (4 preceding siblings ...)
  2018-09-27 15:11 ` [PATCH v7 5/6] seccomp: add a way to pass FDs via a notification fd Tycho Andersen
@ 2018-09-27 15:11 ` Tycho Andersen
  2018-09-27 22:11   ` Kees Cook
  2018-09-28 21:57 ` [PATCH v7 0/6] seccomp trap to userspace Michael Kerrisk (man-opages)
  6 siblings, 1 reply; 91+ messages in thread
From: Tycho Andersen @ 2018-09-27 15:11 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel, containers, linux-api, Andy Lutomirski,
	Oleg Nesterov, Eric W . Biederman, Serge E . Hallyn,
	Christian Brauner, Tyler Hicks, Akihiro Suda, Jann Horn,
	linux-fsdevel, Tycho Andersen

The idea here is just to give a demonstration of how one could safely use
the SECCOMP_RET_USER_NOTIF feature to do mount policies. This particular
policy is (as noted in the comment) not very interesting, but it serves to
illustrate how one might apply a policy dodging the various TOCTOU issues.

v5: new in v5
v7: updates for v7 API changes

Signed-off-by: Tycho Andersen <tycho@tycho.ws>
CC: Kees Cook <keescook@chromium.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Oleg Nesterov <oleg@redhat.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
CC: "Serge E. Hallyn" <serge@hallyn.com>
CC: Christian Brauner <christian.brauner@ubuntu.com>
CC: Tyler Hicks <tyhicks@canonical.com>
CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
---
 samples/seccomp/.gitignore  |   1 +
 samples/seccomp/Makefile    |   7 +-
 samples/seccomp/user-trap.c | 312 ++++++++++++++++++++++++++++++++++++
 3 files changed, 319 insertions(+), 1 deletion(-)

diff --git a/samples/seccomp/.gitignore b/samples/seccomp/.gitignore
index 78fb78184291..d1e2e817d556 100644
--- a/samples/seccomp/.gitignore
+++ b/samples/seccomp/.gitignore
@@ -1,3 +1,4 @@
 bpf-direct
 bpf-fancy
 dropper
+user-trap
diff --git a/samples/seccomp/Makefile b/samples/seccomp/Makefile
index cf34ff6b4065..4920903c8009 100644
--- a/samples/seccomp/Makefile
+++ b/samples/seccomp/Makefile
@@ -1,6 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0
 ifndef CROSS_COMPILE
-hostprogs-$(CONFIG_SAMPLE_SECCOMP) := bpf-fancy dropper bpf-direct
+hostprogs-$(CONFIG_SAMPLE_SECCOMP) := bpf-fancy dropper bpf-direct user-trap
 
 HOSTCFLAGS_bpf-fancy.o += -I$(objtree)/usr/include
 HOSTCFLAGS_bpf-fancy.o += -idirafter $(objtree)/include
@@ -16,6 +16,10 @@ HOSTCFLAGS_bpf-direct.o += -I$(objtree)/usr/include
 HOSTCFLAGS_bpf-direct.o += -idirafter $(objtree)/include
 bpf-direct-objs := bpf-direct.o
 
+HOSTCFLAGS_user-trap.o += -I$(objtree)/usr/include
+HOSTCFLAGS_user-trap.o += -idirafter $(objtree)/include
+user-trap-objs := user-trap.o
+
 # Try to match the kernel target.
 ifndef CONFIG_64BIT
 
@@ -33,6 +37,7 @@ HOSTCFLAGS_bpf-fancy.o += $(MFLAG)
 HOSTLDLIBS_bpf-direct += $(MFLAG)
 HOSTLDLIBS_bpf-fancy += $(MFLAG)
 HOSTLDLIBS_dropper += $(MFLAG)
+HOSTLDLIBS_user-trap += $(MFLAG)
 endif
 always := $(hostprogs-m)
 endif
diff --git a/samples/seccomp/user-trap.c b/samples/seccomp/user-trap.c
new file mode 100644
index 000000000000..63c9a5994dc1
--- /dev/null
+++ b/samples/seccomp/user-trap.c
@@ -0,0 +1,312 @@
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <string.h>
+#include <stddef.h>
+#include <sys/sysmacros.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <sys/socket.h>
+#include <sys/stat.h>
+#include <sys/mman.h>
+#include <sys/syscall.h>
+#include <sys/user.h>
+#include <sys/ioctl.h>
+#include <sys/ptrace.h>
+#include <sys/mount.h>
+#include <linux/limits.h>
+#include <linux/filter.h>
+#include <linux/seccomp.h>
+
+/*
+ * Because of some grossness, we can't include linux/ptrace.h here, so we
+ * re-define PTRACE_SECCOMP_NEW_LISTENER.
+ */
+#ifndef PTRACE_SECCOMP_NEW_LISTENER
+#define PTRACE_SECCOMP_NEW_LISTENER	0x420e
+#endif
+
+#define ARRAY_SIZE(x) (sizeof(x) / sizeof(*(x)))
+
+static int seccomp(unsigned int op, unsigned int flags, void *args)
+{
+	errno = 0;
+	return syscall(__NR_seccomp, op, flags, args);
+}
+
+static int user_trap_syscall(int nr, unsigned int flags)
+{
+	struct sock_filter filter[] = {
+		BPF_STMT(BPF_LD+BPF_W+BPF_ABS,
+			offsetof(struct seccomp_data, nr)),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, nr, 0, 1),
+		BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_USER_NOTIF),
+		BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
+	};
+
+	struct sock_fprog prog = {
+		.len = (unsigned short)ARRAY_SIZE(filter),
+		.filter = filter,
+	};
+
+	return seccomp(SECCOMP_SET_MODE_FILTER, flags, &prog);
+}
+
+static int handle_req(struct seccomp_notif *req,
+		      struct seccomp_notif_resp *resp, int listener)
+{
+	char path[PATH_MAX], source[PATH_MAX], target[PATH_MAX];
+	int ret = -1, mem;
+
+	resp->len = sizeof(*resp);
+	resp->id = req->id;
+	resp->error = -EPERM;
+	resp->val = 0;
+
+	if (req->data.nr != __NR_mount) {
+		fprintf(stderr, "huh? trapped something besides mknod? %d\n", req->data.nr);
+		return -1;
+	}
+
+	/* Only allow bind mounts. */
+	if (!(req->data.args[3] & MS_BIND))
+		return 0;
+
+	/*
+	 * Ok, let's read the task's memory to see where they wanted their
+	 * mount to go.
+	 */
+	snprintf(path, sizeof(path), "/proc/%d/mem", req->pid);
+	mem = open(path, O_RDONLY);
+	if (mem < 0) {
+		perror("open mem");
+		return -1;
+	}
+
+	/*
+	 * Now we avoid a TOCTOU: we referred to a pid by its pid, but since
+	 * the pid that made the syscall may have died, we need to confirm that
+	 * the pid is still valid after we open its /proc/pid/mem file. We can
+	 * ask the listener fd this as follows.
+	 *
+	 * Note that this check should occur *after* any task-specific
+	 * resources are opened, to make sure that the task has not died and
+	 * we're not wrongly reading someone else's state in order to make
+	 * decisions.
+	 */
+	if (ioctl(listener, SECCOMP_NOTIF_ID_VALID, &req->id) < 0) {
+		fprintf(stderr, "task died before we could map its memory\n");
+		goto out;
+	}
+
+	/*
+	 * Phew, we've got the right /proc/pid/mem. Now we can read it. Note
+	 * that to avoid another TOCTOU, we should read all of the pointer args
+	 * before we decide to allow the syscall.
+	 */
+	if (lseek(mem, req->data.args[0], SEEK_SET) < 0) {
+		perror("seek");
+		goto out;
+	}
+
+	ret = read(mem, source, sizeof(source));
+	if (ret < 0) {
+		perror("read");
+		goto out;
+	}
+
+	if (lseek(mem, req->data.args[1], SEEK_SET) < 0) {
+		perror("seek");
+		goto out;
+	}
+
+	ret = read(mem, target, sizeof(target));
+	if (ret < 0) {
+		perror("read");
+		goto out;
+	}
+
+	/*
+	 * Our policy is to only allow bind mounts inside /tmp. This isn't very
+	 * interesting, because we could do unprivlieged bind mounts with user
+	 * namespaces already, but you get the idea.
+	 */
+	if (!strncmp(source, "/tmp", 4) && !strncmp(target, "/tmp", 4)) {
+		if (mount(source, target, NULL, req->data.args[3], NULL) < 0) {
+			ret = -1;
+			perror("actual mount");
+			goto out;
+		}
+		resp->error = 0;
+	}
+
+	/* Even if we didn't allow it because of policy, generating the
+	 * response was be a success, because we want to tell the worker EPERM.
+	 */
+	ret = 0;
+
+out:
+	close(mem);
+	return ret;
+}
+
+int main(void)
+{
+	int sk_pair[2], ret = 1, status, listener;
+	pid_t worker = 0 , tracer = 0;
+	char c;
+
+	if (socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair) < 0) {
+		perror("socketpair");
+		return 1;
+	}
+
+	worker = fork();
+	if (worker < 0) {
+		perror("fork");
+		goto close_pair;
+	}
+
+	if (worker == 0) {
+		if (user_trap_syscall(__NR_mount, 0) < 0) {
+			perror("seccomp");
+			exit(1);
+		}
+
+		if (setuid(1000) < 0) {
+			perror("setuid");
+			exit(1);
+		}
+
+		if (write(sk_pair[1], "a", 1) != 1) {
+			perror("write");
+			exit(1);
+		}
+
+		if (read(sk_pair[1], &c, 1) != 1) {
+			perror("write");
+			exit(1);
+		}
+
+		if (mkdir("/tmp/foo", 0755) < 0) {
+			perror("mkdir");
+			exit(1);
+		}
+
+		if (mount("/dev/sda", "/tmp/foo", NULL, 0, NULL) != -1) {
+			fprintf(stderr, "huh? mounted /dev/sda?\n");
+			exit(1);
+		}
+
+		if (errno != EPERM) {
+			perror("bad error from mount");
+			exit(1);
+		}
+
+		if (mount("/tmp/foo", "/tmp/foo", NULL, MS_BIND, NULL) < 0) {
+			perror("mount");
+			exit(1);
+		}
+
+		exit(0);
+	}
+
+	if (read(sk_pair[0], &c, 1) != 1) {
+		perror("read ready signal");
+		goto out_kill;
+	}
+
+	if (ptrace(PTRACE_ATTACH, worker) < 0) {
+		perror("ptrace");
+		goto out_kill;
+	}
+
+	if (waitpid(worker, NULL, 0) != worker) {
+		perror("waitpid");
+		goto out_kill;
+	}
+
+	listener = ptrace(PTRACE_SECCOMP_NEW_LISTENER, worker, 0);
+	if (listener < 0) {
+		perror("ptrace get listener");
+		goto out_kill;
+	}
+
+	if (ptrace(PTRACE_DETACH, worker, NULL, 0) < 0) {
+		perror("ptrace detach");
+		goto out_kill;
+	}
+
+	if (write(sk_pair[0], "a", 1) != 1) {
+		perror("write");
+		exit(1);
+	}
+
+	tracer = fork();
+	if (tracer < 0) {
+		perror("fork");
+		goto out_kill;
+	}
+
+	if (tracer == 0) {
+		while (1) {
+			struct seccomp_notif req = {};
+			struct seccomp_notif_resp resp = {};
+
+			req.len = sizeof(req);
+			if (ioctl(listener, SECCOMP_NOTIF_RECV, &req) != sizeof(req)) {
+				perror("ioctl recv");
+				goto out_close;
+			}
+
+			if (handle_req(&req, &resp, listener) < 0)
+				goto out_close;
+
+			if (ioctl(listener, SECCOMP_NOTIF_SEND, &resp) != sizeof(resp)) {
+				perror("ioctl send");
+				goto out_close;
+			}
+		}
+out_close:
+		close(listener);
+		exit(1);
+	}
+
+	close(listener);
+
+	if (waitpid(worker, &status, 0) != worker) {
+		perror("waitpid");
+		goto out_kill;
+	}
+
+	if (umount2("/tmp/foo", MNT_DETACH) < 0 && errno != EINVAL) {
+		perror("umount2");
+		goto out_kill;
+	}
+
+	if (remove("/tmp/foo") < 0 && errno != ENOENT) {
+		perror("remove");
+		exit(1);
+	}
+
+	if (!WIFEXITED(status) || WEXITSTATUS(status)) {
+		fprintf(stderr, "worker exited nonzero\n");
+		goto out_kill;
+	}
+
+	ret = 0;
+
+out_kill:
+	if (tracer > 0)
+		kill(tracer, SIGKILL);
+	if (worker > 0)
+		kill(worker, SIGKILL);
+
+close_pair:
+	close(sk_pair[0]);
+	close(sk_pair[1]);
+	return ret;
+}
-- 
2.17.1

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 3/6] seccomp: add a way to get a listener fd from ptrace
  2018-09-27 15:11 ` [PATCH v7 3/6] seccomp: add a way to get a listener fd from ptrace Tycho Andersen
@ 2018-09-27 16:20   ` Jann Horn
  2018-09-27 16:34     ` Tycho Andersen
  2018-09-27 17:35   ` Jann Horn
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 91+ messages in thread
From: Jann Horn @ 2018-09-27 16:20 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Kees Cook, kernel list, containers, Linux API, Andy Lutomirski,
	Oleg Nesterov, Eric W. Biederman, Serge E. Hallyn,
	Christian Brauner, Tyler Hicks, suda.akihiro, linux-fsdevel

On Thu, Sep 27, 2018 at 5:11 PM Tycho Andersen <tycho@tycho.ws> wrote:
> As an alternative to SECCOMP_FILTER_FLAG_GET_LISTENER, perhaps a ptrace()
> version which can acquire filters is useful. There are at least two reasons
> this is preferable, even though it uses ptrace:
>
> 1. You can control tasks that aren't cooperating with you
> 2. You can control tasks whose filters block sendmsg() and socket(); if the
>    task installs a filter which blocks these calls, there's no way with
>    SECCOMP_FILTER_FLAG_GET_LISTENER to get the fd out to the privileged task.
>
> v2: fix a bug where listener mode was not unset when an unused fd was not
>     available
> v3: fix refcounting bug (Oleg)
> v4: * change the listener's fd flags to be 0
>     * rename GET_LISTENER to NEW_LISTENER (Matthew)
> v5: * add capable(CAP_SYS_ADMIN) requirement
> v7: * point the new listener at the right filter (Jann)
>
> Signed-off-by: Tycho Andersen <tycho@tycho.ws>
> CC: Kees Cook <keescook@chromium.org>
> CC: Andy Lutomirski <luto@amacapital.net>
> CC: Oleg Nesterov <oleg@redhat.com>
> CC: Eric W. Biederman <ebiederm@xmission.com>
> CC: "Serge E. Hallyn" <serge@hallyn.com>
> CC: Christian Brauner <christian.brauner@ubuntu.com>
> CC: Tyler Hicks <tyhicks@canonical.com>
> CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>

If you address the two nits below, you can add:
Reviewed-by: Jann Horn <jannh@google.com>

>  include/linux/seccomp.h                       |  7 ++
>  include/uapi/linux/ptrace.h                   |  2 +
>  kernel/ptrace.c                               |  4 ++
>  kernel/seccomp.c                              | 31 +++++++++
>  tools/testing/selftests/seccomp/seccomp_bpf.c | 68 +++++++++++++++++++
>  5 files changed, 112 insertions(+)
>
> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> index 017444b5efed..234c61b37405 100644
> --- a/include/linux/seccomp.h
> +++ b/include/linux/seccomp.h
> @@ -83,6 +83,8 @@ static inline int seccomp_mode(struct seccomp *s)
>  #ifdef CONFIG_SECCOMP_FILTER
>  extern void put_seccomp_filter(struct task_struct *tsk);
>  extern void get_seccomp_filter(struct task_struct *tsk);
> +extern long seccomp_new_listener(struct task_struct *task,
> +                                unsigned long filter_off);

Nit: Sorry, I only noticed this just now, but this should have return
type int, not long. ptrace_request() returns an int, and an fd is also
normally represented as an int, not a long.

>  #else  /* CONFIG_SECCOMP_FILTER */
>  static inline void put_seccomp_filter(struct task_struct *tsk)
>  {
> @@ -92,6 +94,11 @@ static inline void get_seccomp_filter(struct task_struct *tsk)
>  {
>         return;
>  }
> +static inline long seccomp_new_listener(struct task_struct *task,
> +                                       unsigned long filter_off)
> +{
> +       return -EINVAL;
> +}
>  #endif /* CONFIG_SECCOMP_FILTER */
>
>  #if defined(CONFIG_SECCOMP_FILTER) && defined(CONFIG_CHECKPOINT_RESTORE)
> diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h
> index d5a1b8a492b9..e80ecb1bd427 100644
> --- a/include/uapi/linux/ptrace.h
> +++ b/include/uapi/linux/ptrace.h
> @@ -73,6 +73,8 @@ struct seccomp_metadata {
>         __u64 flags;            /* Output: filter's flags */
>  };
>
> +#define PTRACE_SECCOMP_NEW_LISTENER    0x420e
> +
>  /* Read signals from a shared (process wide) queue */
>  #define PTRACE_PEEKSIGINFO_SHARED      (1 << 0)
>
> diff --git a/kernel/ptrace.c b/kernel/ptrace.c
> index 21fec73d45d4..289960ac181b 100644
> --- a/kernel/ptrace.c
> +++ b/kernel/ptrace.c
> @@ -1096,6 +1096,10 @@ int ptrace_request(struct task_struct *child, long request,
>                 ret = seccomp_get_metadata(child, addr, datavp);
>                 break;
>
> +       case PTRACE_SECCOMP_NEW_LISTENER:
> +               ret = seccomp_new_listener(child, addr);
> +               break;
> +
>         default:
>                 break;
>         }
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 44a31ac8373a..17685803a2af 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -1777,4 +1777,35 @@ static struct file *init_listener(struct task_struct *task,
>
>         return ret;
>  }
> +
> +long seccomp_new_listener(struct task_struct *task,
> +                         unsigned long filter_off)
> +{
> +       struct seccomp_filter *filter;
> +       struct file *listener;
> +       int fd;
> +
> +       if (!capable(CAP_SYS_ADMIN))
> +               return -EACCES;
> +
> +       filter = get_nth_filter(task, filter_off);
> +       if (IS_ERR(filter))
> +               return PTR_ERR(filter);
> +
> +       fd = get_unused_fd_flags(0);

s/0/O_CLOEXEC/ ? If userspace needs a non-cloexec fd, userspace can
easily unset O_CLOEXEC; but the reverse isn't true, because it'd be
racy.

> +       if (fd < 0) {
> +               __put_seccomp_filter(filter);
> +               return fd;
> +       }
> +
> +       listener = init_listener(task, filter);
> +       __put_seccomp_filter(filter);
> +       if (IS_ERR(listener)) {
> +               put_unused_fd(fd);
> +               return PTR_ERR(listener);
> +       }
> +
> +       fd_install(fd, listener);
> +       return fd;
> +}
>  #endif
> diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
> index 5f4b836a6792..c6ba3ed5392e 100644
> --- a/tools/testing/selftests/seccomp/seccomp_bpf.c
> +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
> @@ -193,6 +193,10 @@ int seccomp(unsigned int op, unsigned int flags, void *args)
>  }
>  #endif
>
> +#ifndef PTRACE_SECCOMP_NEW_LISTENER
> +#define PTRACE_SECCOMP_NEW_LISTENER 0x420e
> +#endif
> +
>  #if __BYTE_ORDER == __LITTLE_ENDIAN
>  #define syscall_arg(_n) (offsetof(struct seccomp_data, args[_n]))
>  #elif __BYTE_ORDER == __BIG_ENDIAN
> @@ -3175,6 +3179,70 @@ TEST(get_user_notification_syscall)
>         EXPECT_EQ(0, WEXITSTATUS(status));
>  }
>
> +TEST(get_user_notification_ptrace)
> +{
> +       pid_t pid;
> +       int status, listener;
> +       int sk_pair[2];
> +       char c;
> +       struct seccomp_notif req = {};
> +       struct seccomp_notif_resp resp = {};
> +
> +       ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0);
> +
> +       pid = fork();
> +       ASSERT_GE(pid, 0);
> +
> +       if (pid == 0) {
> +               EXPECT_EQ(user_trap_syscall(__NR_getpid, 0), 0);
> +
> +               /* Test that we get ENOSYS while not attached */
> +               EXPECT_EQ(syscall(__NR_getpid), -1);
> +               EXPECT_EQ(errno, ENOSYS);
> +
> +               /* Signal we're ready and have installed the filter. */
> +               EXPECT_EQ(write(sk_pair[1], "J", 1), 1);
> +
> +               EXPECT_EQ(read(sk_pair[1], &c, 1), 1);
> +               EXPECT_EQ(c, 'H');
> +
> +               exit(syscall(__NR_getpid) != USER_NOTIF_MAGIC);
> +       }
> +
> +       EXPECT_EQ(read(sk_pair[0], &c, 1), 1);
> +       EXPECT_EQ(c, 'J');
> +
> +       EXPECT_EQ(ptrace(PTRACE_ATTACH, pid), 0);
> +       EXPECT_EQ(waitpid(pid, NULL, 0), pid);
> +       listener = ptrace(PTRACE_SECCOMP_NEW_LISTENER, pid, 0);
> +       EXPECT_GE(listener, 0);
> +
> +       /* EBUSY for second listener */
> +       EXPECT_EQ(ptrace(PTRACE_SECCOMP_NEW_LISTENER, pid, 0), -1);
> +       EXPECT_EQ(errno, EBUSY);
> +
> +       EXPECT_EQ(ptrace(PTRACE_DETACH, pid, NULL, 0), 0);
> +
> +       /* Now signal we are done and respond with magic */
> +       EXPECT_EQ(write(sk_pair[0], "H", 1), 1);
> +
> +       req.len = sizeof(req);
> +       EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_RECV, &req), sizeof(req));
> +
> +       resp.len = sizeof(resp);
> +       resp.id = req.id;
> +       resp.error = 0;
> +       resp.val = USER_NOTIF_MAGIC;
> +
> +       EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_SEND, &resp), sizeof(resp));
> +
> +       EXPECT_EQ(waitpid(pid, &status, 0), pid);
> +       EXPECT_EQ(true, WIFEXITED(status));
> +       EXPECT_EQ(0, WEXITSTATUS(status));
> +
> +       close(listener);
> +}
> +
>  /*
>   * Check that a pid in a child namespace still shows up as valid in ours.
>   */
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 3/6] seccomp: add a way to get a listener fd from ptrace
  2018-09-27 16:20   ` Jann Horn
@ 2018-09-27 16:34     ` Tycho Andersen
  0 siblings, 0 replies; 91+ messages in thread
From: Tycho Andersen @ 2018-09-27 16:34 UTC (permalink / raw)
  To: Jann Horn
  Cc: Kees Cook, kernel list, containers, Linux API, Andy Lutomirski,
	Oleg Nesterov, Eric W. Biederman, Serge E. Hallyn,
	Christian Brauner, Tyler Hicks, suda.akihiro, linux-fsdevel

On Thu, Sep 27, 2018 at 06:20:23PM +0200, Jann Horn wrote:
> On Thu, Sep 27, 2018 at 5:11 PM Tycho Andersen <tycho@tycho.ws> wrote:
> > As an alternative to SECCOMP_FILTER_FLAG_GET_LISTENER, perhaps a ptrace()
> > version which can acquire filters is useful. There are at least two reasons
> > this is preferable, even though it uses ptrace:
> >
> > 1. You can control tasks that aren't cooperating with you
> > 2. You can control tasks whose filters block sendmsg() and socket(); if the
> >    task installs a filter which blocks these calls, there's no way with
> >    SECCOMP_FILTER_FLAG_GET_LISTENER to get the fd out to the privileged task.
> >
> > v2: fix a bug where listener mode was not unset when an unused fd was not
> >     available
> > v3: fix refcounting bug (Oleg)
> > v4: * change the listener's fd flags to be 0
> >     * rename GET_LISTENER to NEW_LISTENER (Matthew)
> > v5: * add capable(CAP_SYS_ADMIN) requirement
> > v7: * point the new listener at the right filter (Jann)
> >
> > Signed-off-by: Tycho Andersen <tycho@tycho.ws>
> > CC: Kees Cook <keescook@chromium.org>
> > CC: Andy Lutomirski <luto@amacapital.net>
> > CC: Oleg Nesterov <oleg@redhat.com>
> > CC: Eric W. Biederman <ebiederm@xmission.com>
> > CC: "Serge E. Hallyn" <serge@hallyn.com>
> > CC: Christian Brauner <christian.brauner@ubuntu.com>
> > CC: Tyler Hicks <tyhicks@canonical.com>
> > CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
> 
> If you address the two nits below, you can add:
> Reviewed-by: Jann Horn <jannh@google.com>

Thanks!

> >  include/linux/seccomp.h                       |  7 ++
> >  include/uapi/linux/ptrace.h                   |  2 +
> >  kernel/ptrace.c                               |  4 ++
> >  kernel/seccomp.c                              | 31 +++++++++
> >  tools/testing/selftests/seccomp/seccomp_bpf.c | 68 +++++++++++++++++++
> >  5 files changed, 112 insertions(+)
> >
> > diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> > index 017444b5efed..234c61b37405 100644
> > --- a/include/linux/seccomp.h
> > +++ b/include/linux/seccomp.h
> > @@ -83,6 +83,8 @@ static inline int seccomp_mode(struct seccomp *s)
> >  #ifdef CONFIG_SECCOMP_FILTER
> >  extern void put_seccomp_filter(struct task_struct *tsk);
> >  extern void get_seccomp_filter(struct task_struct *tsk);
> > +extern long seccomp_new_listener(struct task_struct *task,
> > +                                unsigned long filter_off);
> 
> Nit: Sorry, I only noticed this just now, but this should have return
> type int, not long. ptrace_request() returns an int, and an fd is also
> normally represented as an int, not a long.

Ugh, I could have sworn I checked this. In particular because the
other seccomp code that's called from ptrace returns a long too :)

I'll fix that for the next version, and send a different patch for the
other two.

> >  #else  /* CONFIG_SECCOMP_FILTER */
> >  static inline void put_seccomp_filter(struct task_struct *tsk)
> >  {
> > @@ -92,6 +94,11 @@ static inline void get_seccomp_filter(struct task_struct *tsk)
> >  {
> >         return;
> >  }
> > +static inline long seccomp_new_listener(struct task_struct *task,
> > +                                       unsigned long filter_off)
> > +{
> > +       return -EINVAL;
> > +}
> >  #endif /* CONFIG_SECCOMP_FILTER */
> >
> >  #if defined(CONFIG_SECCOMP_FILTER) && defined(CONFIG_CHECKPOINT_RESTORE)
> > diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h
> > index d5a1b8a492b9..e80ecb1bd427 100644
> > --- a/include/uapi/linux/ptrace.h
> > +++ b/include/uapi/linux/ptrace.h
> > @@ -73,6 +73,8 @@ struct seccomp_metadata {
> >         __u64 flags;            /* Output: filter's flags */
> >  };
> >
> > +#define PTRACE_SECCOMP_NEW_LISTENER    0x420e
> > +
> >  /* Read signals from a shared (process wide) queue */
> >  #define PTRACE_PEEKSIGINFO_SHARED      (1 << 0)
> >
> > diff --git a/kernel/ptrace.c b/kernel/ptrace.c
> > index 21fec73d45d4..289960ac181b 100644
> > --- a/kernel/ptrace.c
> > +++ b/kernel/ptrace.c
> > @@ -1096,6 +1096,10 @@ int ptrace_request(struct task_struct *child, long request,
> >                 ret = seccomp_get_metadata(child, addr, datavp);
> >                 break;
> >
> > +       case PTRACE_SECCOMP_NEW_LISTENER:
> > +               ret = seccomp_new_listener(child, addr);
> > +               break;
> > +
> >         default:
> >                 break;
> >         }
> > diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> > index 44a31ac8373a..17685803a2af 100644
> > --- a/kernel/seccomp.c
> > +++ b/kernel/seccomp.c
> > @@ -1777,4 +1777,35 @@ static struct file *init_listener(struct task_struct *task,
> >
> >         return ret;
> >  }
> > +
> > +long seccomp_new_listener(struct task_struct *task,
> > +                         unsigned long filter_off)
> > +{
> > +       struct seccomp_filter *filter;
> > +       struct file *listener;
> > +       int fd;
> > +
> > +       if (!capable(CAP_SYS_ADMIN))
> > +               return -EACCES;
> > +
> > +       filter = get_nth_filter(task, filter_off);
> > +       if (IS_ERR(filter))
> > +               return PTR_ERR(filter);
> > +
> > +       fd = get_unused_fd_flags(0);
> 
> s/0/O_CLOEXEC/ ? If userspace needs a non-cloexec fd, userspace can
> easily unset O_CLOEXEC; but the reverse isn't true, because it'd be
> racy.

Sure, will do.

Tycho

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 5/6] seccomp: add a way to pass FDs via a notification fd
  2018-09-27 15:11 ` [PATCH v7 5/6] seccomp: add a way to pass FDs via a notification fd Tycho Andersen
@ 2018-09-27 16:39   ` Jann Horn
  2018-09-27 22:13     ` Tycho Andersen
  2018-09-27 19:28   ` Jann Horn
  2018-09-27 22:09   ` Kees Cook
  2 siblings, 1 reply; 91+ messages in thread
From: Jann Horn @ 2018-09-27 16:39 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Kees Cook, kernel list, containers, Linux API, Andy Lutomirski,
	Oleg Nesterov, Eric W. Biederman, Serge E. Hallyn,
	Christian Brauner, Tyler Hicks, suda.akihiro, linux-fsdevel

On Thu, Sep 27, 2018 at 5:11 PM Tycho Andersen <tycho@tycho.ws> wrote:
> This patch adds a way to insert FDs into the tracee's process (also
> close/overwrite fds for the tracee). This functionality is necessary to
> mock things like socketpair() or dup2() or similar, but since it depends on
> external (vfs) patches, I've left it as a separate patch as before so the
> core functionality can still be merged while we argue about this. Except
> this time it doesn't add any ugliness to the API :)
[...]
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 17685803a2af..07a05ad59731 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -41,6 +41,8 @@
>  #include <linux/tracehook.h>
>  #include <linux/uaccess.h>
>  #include <linux/anon_inodes.h>
> +#include <linux/fdtable.h>
> +#include <net/cls_cgroup.h>
>
>  enum notify_state {
>         SECCOMP_NOTIFY_INIT,
> @@ -1684,6 +1686,56 @@ static long seccomp_notify_id_valid(struct seccomp_filter *filter,
>         return ret;
>  }
>
> +static long seccomp_notify_put_fd(struct seccomp_filter *filter,
> +                                 unsigned long arg)
> +{
> +       struct seccomp_notif_put_fd req;
> +       void __user *buf = (void __user *)arg;
> +       struct seccomp_knotif *knotif = NULL;
> +       long ret;
> +
> +       if (copy_from_user(&req, buf, sizeof(req)))
> +               return -EFAULT;
> +
> +       if (req.fd < 0 && req.to_replace < 0)
> +               return -EINVAL;
> +
> +       ret = mutex_lock_interruptible(&filter->notify_lock);
> +       if (ret < 0)
> +               return ret;
> +
> +       ret = -ENOENT;
> +       list_for_each_entry(knotif, &filter->notif->notifications, list) {
> +               struct file *file = NULL;
> +
> +               if (knotif->id != req.id)
> +                       continue;
> +
> +               if (req.fd >= 0)
> +                       file = fget(req.fd);

So here we take a reference on `file`.

> +               if (req.to_replace >= 0) {
> +                       ret = replace_fd_task(knotif->task, req.to_replace,
> +                                             file, req.fd_flags);

Then here we try to place the file in knotif->task's file descriptor
table. This can either fail (e.g. due to exceeded rlimit), in which
case nothing happens, or it can do do_dup2(), which first takes an
extra reference to the file, then places it in the task's fd table.

Either way, afterwards, we still hold a reference to the file.

> +               } else {
> +                       unsigned long max_files;
> +
> +                       max_files = task_rlimit(knotif->task, RLIMIT_NOFILE);
> +                       ret = __alloc_fd(knotif->task->files, 0, max_files,
> +                                        req.fd_flags);
> +                       if (ret < 0)
> +                               break;

If we bail out here, we still hold a reference to `file`.

Suggestion: Change this to "if (ret >= 0) {" and make the following
code conditional instead of breaking.

> +                       __fd_install(knotif->task->files, ret, file);

But if we reach this point, __fd_install() consumes the file pointer,
so `file` is a dangling pointer now.

Suggestion: Add "break;" here.

> +               }

Suggestion: Add "if (file != NULL) fput(file);" here.

> +               break;
> +       }
> +
> +       mutex_unlock(&filter->notify_lock);
> +       return ret;
> +}

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 4/6] files: add a replace_fd_files() function
  2018-09-27 15:11 ` [PATCH v7 4/6] files: add a replace_fd_files() function Tycho Andersen
@ 2018-09-27 16:49   ` Jann Horn
  2018-09-27 18:04     ` Tycho Andersen
  2018-09-27 21:59   ` Kees Cook
  1 sibling, 1 reply; 91+ messages in thread
From: Jann Horn @ 2018-09-27 16:49 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Kees Cook, kernel list, containers, Linux API, Andy Lutomirski,
	Oleg Nesterov, Eric W. Biederman, Serge E. Hallyn,
	Christian Brauner, Tyler Hicks, suda.akihiro, linux-fsdevel,
	Al Viro

On Thu, Sep 27, 2018 at 5:11 PM Tycho Andersen <tycho@tycho.ws> wrote:
> Similar to fd_install/__fd_install, we want to be able to replace an fd of
> an arbitrary struct files_struct, not just current's. We'll use this in the
> next patch to implement the seccomp ioctl that allows inserting fds into a
> stopped process' context.
[...]
> diff --git a/fs/file.c b/fs/file.c
> index 7ffd6e9d103d..3b3c5aadaadb 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -850,24 +850,32 @@ __releases(&files->file_lock)
>  }
>
>  int replace_fd(unsigned fd, struct file *file, unsigned flags)
> +{
> +       return replace_fd_task(current, fd, file, flags);
> +}
> +
> +/*
> + * Same warning as __alloc_fd()/__fd_install() here.
> + */
> +int replace_fd_task(struct task_struct *task, unsigned fd,
> +                   struct file *file, unsigned flags)
>  {
>         int err;
> -       struct files_struct *files = current->files;

Why did you remove this? You could just do s/current/task/ instead, right?

>         if (!file)
> -               return __close_fd(files, fd);
> +               return __close_fd(task->files, fd);
>
> -       if (fd >= rlimit(RLIMIT_NOFILE))
> +       if (fd >= task_rlimit(task, RLIMIT_NOFILE))
>                 return -EBADF;
>
> -       spin_lock(&files->file_lock);
> -       err = expand_files(files, fd);
> +       spin_lock(&task->files->file_lock);
> +       err = expand_files(task->files, fd);
>         if (unlikely(err < 0))
>                 goto out_unlock;
> -       return do_dup2(files, file, fd, flags);
> +       return do_dup2(task->files, file, fd, flags);
>
>  out_unlock:
> -       spin_unlock(&files->file_lock);
> +       spin_unlock(&task->files->file_lock);
>         return err;
>  }
>
> diff --git a/include/linux/file.h b/include/linux/file.h
> index 6b2fb032416c..f94277fee038 100644
> --- a/include/linux/file.h
> +++ b/include/linux/file.h
> @@ -11,6 +11,7 @@
>  #include <linux/posix_types.h>
>
>  struct file;
> +struct task_struct;
>
>  extern void fput(struct file *);
>
> @@ -79,6 +80,13 @@ static inline void fdput_pos(struct fd f)
>
>  extern int f_dupfd(unsigned int from, struct file *file, unsigned flags);
>  extern int replace_fd(unsigned fd, struct file *file, unsigned flags);
> +/*
> + * Warning! This is only safe if you know the owner of the files_struct is
> + * stopped outside syscall context. It's a very bad idea to use this unless you
> + * have similar guarantees in your code.
> + */
> +extern int replace_fd_task(struct task_struct *task, unsigned fd,
> +                          struct file *file, unsigned flags);

I think Linux kernel coding style is normally to have comments on the
implementations of functions, not in the headers? Maybe replace the
warning above the implemenation of replace_fd_task() with this
comment.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 2/6] seccomp: make get_nth_filter available outside of CHECKPOINT_RESTORE
  2018-09-27 15:11 ` [PATCH v7 2/6] seccomp: make get_nth_filter available outside of CHECKPOINT_RESTORE Tycho Andersen
@ 2018-09-27 16:51   ` Jann Horn
  2018-09-27 21:42   ` Kees Cook
  2018-10-08 13:55   ` Christian Brauner
  2 siblings, 0 replies; 91+ messages in thread
From: Jann Horn @ 2018-09-27 16:51 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Kees Cook, kernel list, containers, Linux API, Andy Lutomirski,
	Oleg Nesterov, Eric W. Biederman, Serge E. Hallyn,
	Christian Brauner, Tyler Hicks, suda.akihiro, linux-fsdevel

On Thu, Sep 27, 2018 at 5:11 PM Tycho Andersen <tycho@tycho.ws> wrote:
> In the next commit we'll use this same mnemonic to get a listener for the
> nth filter, so we need it available outside of CHECKPOINT_RESTORE in the
> USER_NOTIFICATION case as well.
>
> v2: new in v2
> v3: no changes
> v4: no changes
> v5: switch to CHECKPOINT_RESTORE || USER_NOTIFICATION to avoid warning when
>     only CONFIG_SECCOMP_FILTER is enabled.
> v7: drop USER_NOTIFICATION bits
>
> Signed-off-by: Tycho Andersen <tycho@tycho.ws>
> CC: Kees Cook <keescook@chromium.org>
> CC: Andy Lutomirski <luto@amacapital.net>
> CC: Oleg Nesterov <oleg@redhat.com>
> CC: Eric W. Biederman <ebiederm@xmission.com>
> CC: "Serge E. Hallyn" <serge@hallyn.com>
> CC: Christian Brauner <christian.brauner@ubuntu.com>
> CC: Tyler Hicks <tyhicks@canonical.com>
> CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>

Reviewed-by: Jann Horn <jannh@google.com>

> ---
>  kernel/seccomp.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index fa6fe9756c80..44a31ac8373a 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -1158,7 +1158,7 @@ long prctl_set_seccomp(unsigned long seccomp_mode, char __user *filter)
>         return do_seccomp(op, 0, uargs);
>  }
>
> -#if defined(CONFIG_SECCOMP_FILTER) && defined(CONFIG_CHECKPOINT_RESTORE)
> +#if defined(CONFIG_SECCOMP_FILTER)
>  static struct seccomp_filter *get_nth_filter(struct task_struct *task,
>                                              unsigned long filter_off)
>  {
> @@ -1205,6 +1205,7 @@ static struct seccomp_filter *get_nth_filter(struct task_struct *task,
>         return filter;
>  }
>
> +#if defined(CONFIG_CHECKPOINT_RESTORE)
>  long seccomp_get_filter(struct task_struct *task, unsigned long filter_off,
>                         void __user *data)
>  {
> @@ -1277,7 +1278,8 @@ long seccomp_get_metadata(struct task_struct *task,
>         __put_seccomp_filter(filter);
>         return ret;
>  }
> -#endif
> +#endif /* CONFIG_CHECKPOINT_RESTORE */
> +#endif /* CONFIG_SECCOMP_FILTER */
>
>  #ifdef CONFIG_SYSCTL
>
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 3/6] seccomp: add a way to get a listener fd from ptrace
  2018-09-27 15:11 ` [PATCH v7 3/6] seccomp: add a way to get a listener fd from ptrace Tycho Andersen
  2018-09-27 16:20   ` Jann Horn
@ 2018-09-27 17:35   ` Jann Horn
  2018-09-27 18:09     ` Tycho Andersen
  2018-09-27 21:53   ` Kees Cook
  2018-10-08 15:16   ` Christian Brauner
  3 siblings, 1 reply; 91+ messages in thread
From: Jann Horn @ 2018-09-27 17:35 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Kees Cook, kernel list, containers, Linux API, Andy Lutomirski,
	Oleg Nesterov, Eric W. Biederman, Serge E. Hallyn,
	Christian Brauner, Tyler Hicks, suda.akihiro, linux-fsdevel

On Thu, Sep 27, 2018 at 5:11 PM Tycho Andersen <tycho@tycho.ws> wrote:
>
> As an alternative to SECCOMP_FILTER_FLAG_GET_LISTENER, perhaps a ptrace()
> version which can acquire filters is useful. There are at least two reasons
> this is preferable, even though it uses ptrace:
>
> 1. You can control tasks that aren't cooperating with you
> 2. You can control tasks whose filters block sendmsg() and socket(); if the
>    task installs a filter which blocks these calls, there's no way with
>    SECCOMP_FILTER_FLAG_GET_LISTENER to get the fd out to the privileged task.
[...]
> diff --git a/kernel/ptrace.c b/kernel/ptrace.c
> index 21fec73d45d4..289960ac181b 100644
> --- a/kernel/ptrace.c
> +++ b/kernel/ptrace.c
> @@ -1096,6 +1096,10 @@ int ptrace_request(struct task_struct *child, long request,
>                 ret = seccomp_get_metadata(child, addr, datavp);
>                 break;
>
> +       case PTRACE_SECCOMP_NEW_LISTENER:
> +               ret = seccomp_new_listener(child, addr);
> +               break;

Actually, could you amend this to also ensure that `data == 0` and
return -EINVAL otherwise? Then if we want to abuse `data` for passing
flags in the future, we don't have to worry about what happens if
someone passes in garbage as `data`.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 4/6] files: add a replace_fd_files() function
  2018-09-27 16:49   ` Jann Horn
@ 2018-09-27 18:04     ` Tycho Andersen
  0 siblings, 0 replies; 91+ messages in thread
From: Tycho Andersen @ 2018-09-27 18:04 UTC (permalink / raw)
  To: Jann Horn
  Cc: Kees Cook, kernel list, containers, Linux API, Andy Lutomirski,
	Oleg Nesterov, Eric W. Biederman, Serge E. Hallyn,
	Christian Brauner, Tyler Hicks, suda.akihiro, linux-fsdevel,
	Al Viro

On Thu, Sep 27, 2018 at 06:49:02PM +0200, Jann Horn wrote:
> On Thu, Sep 27, 2018 at 5:11 PM Tycho Andersen <tycho@tycho.ws> wrote:
> > Similar to fd_install/__fd_install, we want to be able to replace an fd of
> > an arbitrary struct files_struct, not just current's. We'll use this in the
> > next patch to implement the seccomp ioctl that allows inserting fds into a
> > stopped process' context.
> [...]
> > diff --git a/fs/file.c b/fs/file.c
> > index 7ffd6e9d103d..3b3c5aadaadb 100644
> > --- a/fs/file.c
> > +++ b/fs/file.c
> > @@ -850,24 +850,32 @@ __releases(&files->file_lock)
> >  }
> >
> >  int replace_fd(unsigned fd, struct file *file, unsigned flags)
> > +{
> > +       return replace_fd_task(current, fd, file, flags);
> > +}
> > +
> > +/*
> > + * Same warning as __alloc_fd()/__fd_install() here.
> > + */
> > +int replace_fd_task(struct task_struct *task, unsigned fd,
> > +                   struct file *file, unsigned flags)
> >  {
> >         int err;
> > -       struct files_struct *files = current->files;
> 
> Why did you remove this? You could just do s/current/task/ instead, right?

No reason, probably just flailing around trying to figure out what
exactly I wanted. I'll make the change, thanks.

> >         if (!file)
> > -               return __close_fd(files, fd);
> > +               return __close_fd(task->files, fd);
> >
> > -       if (fd >= rlimit(RLIMIT_NOFILE))
> > +       if (fd >= task_rlimit(task, RLIMIT_NOFILE))
> >                 return -EBADF;
> >
> > -       spin_lock(&files->file_lock);
> > -       err = expand_files(files, fd);
> > +       spin_lock(&task->files->file_lock);
> > +       err = expand_files(task->files, fd);
> >         if (unlikely(err < 0))
> >                 goto out_unlock;
> > -       return do_dup2(files, file, fd, flags);
> > +       return do_dup2(task->files, file, fd, flags);
> >
> >  out_unlock:
> > -       spin_unlock(&files->file_lock);
> > +       spin_unlock(&task->files->file_lock);
> >         return err;
> >  }
> >
> > diff --git a/include/linux/file.h b/include/linux/file.h
> > index 6b2fb032416c..f94277fee038 100644
> > --- a/include/linux/file.h
> > +++ b/include/linux/file.h
> > @@ -11,6 +11,7 @@
> >  #include <linux/posix_types.h>
> >
> >  struct file;
> > +struct task_struct;
> >
> >  extern void fput(struct file *);
> >
> > @@ -79,6 +80,13 @@ static inline void fdput_pos(struct fd f)
> >
> >  extern int f_dupfd(unsigned int from, struct file *file, unsigned flags);
> >  extern int replace_fd(unsigned fd, struct file *file, unsigned flags);
> > +/*
> > + * Warning! This is only safe if you know the owner of the files_struct is
> > + * stopped outside syscall context. It's a very bad idea to use this unless you
> > + * have similar guarantees in your code.
> > + */
> > +extern int replace_fd_task(struct task_struct *task, unsigned fd,
> > +                          struct file *file, unsigned flags);
> 
> I think Linux kernel coding style is normally to have comments on the
> implementations of functions, not in the headers? Maybe replace the
> warning above the implemenation of replace_fd_task() with this
> comment.

Will do.

Cheers,

Tycho

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 3/6] seccomp: add a way to get a listener fd from ptrace
  2018-09-27 17:35   ` Jann Horn
@ 2018-09-27 18:09     ` Tycho Andersen
  0 siblings, 0 replies; 91+ messages in thread
From: Tycho Andersen @ 2018-09-27 18:09 UTC (permalink / raw)
  To: Jann Horn
  Cc: Kees Cook, kernel list, containers, Linux API, Andy Lutomirski,
	Oleg Nesterov, Eric W. Biederman, Serge E. Hallyn,
	Christian Brauner, Tyler Hicks, suda.akihiro, linux-fsdevel

On Thu, Sep 27, 2018 at 07:35:06PM +0200, Jann Horn wrote:
> On Thu, Sep 27, 2018 at 5:11 PM Tycho Andersen <tycho@tycho.ws> wrote:
> >
> > As an alternative to SECCOMP_FILTER_FLAG_GET_LISTENER, perhaps a ptrace()
> > version which can acquire filters is useful. There are at least two reasons
> > this is preferable, even though it uses ptrace:
> >
> > 1. You can control tasks that aren't cooperating with you
> > 2. You can control tasks whose filters block sendmsg() and socket(); if the
> >    task installs a filter which blocks these calls, there's no way with
> >    SECCOMP_FILTER_FLAG_GET_LISTENER to get the fd out to the privileged task.
> [...]
> > diff --git a/kernel/ptrace.c b/kernel/ptrace.c
> > index 21fec73d45d4..289960ac181b 100644
> > --- a/kernel/ptrace.c
> > +++ b/kernel/ptrace.c
> > @@ -1096,6 +1096,10 @@ int ptrace_request(struct task_struct *child, long request,
> >                 ret = seccomp_get_metadata(child, addr, datavp);
> >                 break;
> >
> > +       case PTRACE_SECCOMP_NEW_LISTENER:
> > +               ret = seccomp_new_listener(child, addr);
> > +               break;
> 
> Actually, could you amend this to also ensure that `data == 0` and
> return -EINVAL otherwise? Then if we want to abuse `data` for passing
> flags in the future, we don't have to worry about what happens if
> someone passes in garbage as `data`.

Yes, good idea. Thanks!

Tycho

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 5/6] seccomp: add a way to pass FDs via a notification fd
  2018-09-27 15:11 ` [PATCH v7 5/6] seccomp: add a way to pass FDs via a notification fd Tycho Andersen
  2018-09-27 16:39   ` Jann Horn
@ 2018-09-27 19:28   ` Jann Horn
  2018-09-27 22:14     ` Tycho Andersen
  2018-09-27 22:09   ` Kees Cook
  2 siblings, 1 reply; 91+ messages in thread
From: Jann Horn @ 2018-09-27 19:28 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Kees Cook, kernel list, containers, Linux API, Andy Lutomirski,
	Oleg Nesterov, Eric W. Biederman, Serge E. Hallyn,
	Christian Brauner, Tyler Hicks, suda.akihiro, linux-fsdevel

On Thu, Sep 27, 2018 at 5:11 PM Tycho Andersen <tycho@tycho.ws> wrote:
> This patch adds a way to insert FDs into the tracee's process (also
> close/overwrite fds for the tracee). This functionality is necessary to
> mock things like socketpair() or dup2() or similar, but since it depends on
> external (vfs) patches, I've left it as a separate patch as before so the
> core functionality can still be merged while we argue about this. Except
> this time it doesn't add any ugliness to the API :)
[...]
> +static long seccomp_notify_put_fd(struct seccomp_filter *filter,
> +                                 unsigned long arg)
> +{
> +       struct seccomp_notif_put_fd req;
> +       void __user *buf = (void __user *)arg;
> +       struct seccomp_knotif *knotif = NULL;
> +       long ret;
> +
> +       if (copy_from_user(&req, buf, sizeof(req)))
> +               return -EFAULT;
> +
> +       if (req.fd < 0 && req.to_replace < 0)
> +               return -EINVAL;
> +
> +       ret = mutex_lock_interruptible(&filter->notify_lock);
> +       if (ret < 0)
> +               return ret;
> +
> +       ret = -ENOENT;
> +       list_for_each_entry(knotif, &filter->notif->notifications, list) {
> +               struct file *file = NULL;
> +
> +               if (knotif->id != req.id)
> +                       continue;

Are you intentionally permitting non-SENT states here? It shouldn't
make a big difference, but I think it'd be nice to at least block the
use of notifications in SECCOMP_NOTIFY_REPLIED state.

> +               if (req.fd >= 0)
> +                       file = fget(req.fd);

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 1/6] seccomp: add a return code to trap to userspace
  2018-09-27 15:11 ` [PATCH v7 1/6] seccomp: add a return code to " Tycho Andersen
@ 2018-09-27 21:31   ` Kees Cook
  2018-09-27 22:48     ` Tycho Andersen
  2018-10-17 20:29     ` Tycho Andersen
  2018-09-27 21:51   ` Jann Horn
  2018-09-29  0:28   ` Aleksa Sarai
  2 siblings, 2 replies; 91+ messages in thread
From: Kees Cook @ 2018-09-27 21:31 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: LKML, Linux Containers, Linux API, Andy Lutomirski,
	Oleg Nesterov, Eric W . Biederman, Serge E . Hallyn,
	Christian Brauner, Tyler Hicks, Akihiro Suda, Jann Horn,
	linux-fsdevel

On Thu, Sep 27, 2018 at 8:11 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> This patch introduces a means for syscalls matched in seccomp to notify
> some other task that a particular filter has been triggered.
>
> The motivation for this is primarily for use with containers. For example,
> if a container does an init_module(), we obviously don't want to load this
> untrusted code, which may be compiled for the wrong version of the kernel
> anyway. Instead, we could parse the module image, figure out which module
> the container is trying to load and load it on the host.
>
> As another example, containers cannot mknod(), since this checks
> capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
> /dev/zero should be ok for containers to mknod, but we'd like to avoid hard
> coding some whitelist in the kernel. Another example is mount(), which has
> many security restrictions for good reason, but configuration or runtime
> knowledge could potentially be used to relax these restrictions.
>
> This patch adds functionality that is already possible via at least two
> other means that I know about, both of which involve ptrace(): first, one
> could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL.
> Unfortunately this is slow, so a faster version would be to install a
> filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP.
> Since ptrace allows only one tracer, if the container runtime is that
> tracer, users inside the container (or outside) trying to debug it will not
> be able to use ptrace, which is annoying. It also means that older
> distributions based on Upstart cannot boot inside containers using ptrace,
> since upstart itself uses ptrace to start services.
>
> The actual implementation of this is fairly small, although getting the
> synchronization right was/is slightly complex.
>
> Finally, it's worth noting that the classic seccomp TOCTOU of reading
> memory data from the task still applies here, but can be avoided with
> careful design of the userspace handler: if the userspace handler reads all
> of the task memory that is necessary before applying its security policy,
> the tracee's subsequent memory edits will not be read by the tracer.
>
> v2: * make id a u64; the idea here being that it will never overflow,
>       because 64 is huge (one syscall every nanosecond => wrap every 584
>       years) (Andy)
>     * prevent nesting of user notifications: if someone is already attached
>       the tree in one place, nobody else can attach to the tree (Andy)
>     * notify the listener of signals the tracee receives as well (Andy)
>     * implement poll
> v3: * lockdep fix (Oleg)
>     * drop unnecessary WARN()s (Christian)
>     * rearrange error returns to be more rpetty (Christian)
>     * fix build in !CONFIG_SECCOMP_USER_NOTIFICATION case
> v4: * fix implementation of poll to use poll_wait() (Jann)
>     * change listener's fd flags to be 0 (Jann)
>     * hoist filter initialization out of ifdefs to its own function
>       init_user_notification()
>     * add some more testing around poll() and closing the listener while a
>       syscall is in action
>     * s/GET_LISTENER/NEW_LISTENER, since you can't _get_ a listener, but it
>       creates a new one (Matthew)
>     * correctly handle pid namespaces, add some testcases (Matthew)
>     * use EINPROGRESS instead of EINVAL when a notification response is
>       written twice (Matthew)
>     * fix comment typo from older version (SEND vs READ) (Matthew)
>     * whitespace and logic simplification (Tobin)
>     * add some Documentation/ bits on userspace trapping
> v5: * fix documentation typos (Jann)
>     * add signalled field to struct seccomp_notif (Jann)
>     * switch to using ioctls instead of read()/write() for struct passing
>       (Jann)
>     * add an ioctl to ensure an id is still valid
> v6: * docs typo fixes, update docs for ioctl() change (Christian)
> v7: * switch struct seccomp_knotif's id member to a u64 (derp :)
>     * use notify_lock in IS_ID_VALID query to avoid racing
>     * s/signalled/signaled (Tyler)
>     * fix docs to reflect that ids are not globally unique (Tyler)
>     * add a test to check -ERESTARTSYS behavior (Tyler)
>     * drop CONFIG_SECCOMP_USER_NOTIFICATION (Tyler)
>     * reorder USER_NOTIF in seccomp return codes list (Tyler)
>     * return size instead of sizeof(struct user_notif) (Tyler)
>     * ENOENT instead of EINVAL when invalid id is passed (Tyler)
>     * drop CONFIG_SECCOMP_USER_NOTIFICATION guards (Tyler)
>     * s/IS_ID_VALID/ID_VALID and switch ioctl to be "well behaved" (Tyler)
>     * add a new struct notification to minimize the additions to
>       struct seccomp_filter, also pack the necessary additions a bit more
>       cleverly (Tyler)
>     * switch to keeping track of the task itself instead of the pid (we'll
>       use this for implementing PUT_FD)

Patch-sending nit: can you put the versioning below the "---" line so
it isn't included in the final commit? (And I normally read these
backwards, so I'd expect v7 at the top, but that's not a big deal. I
mean... neither is the --- thing, but it makes "git am" easier for me
since I don't have to go edit the versioning out of the log.)

> Signed-off-by: Tycho Andersen <tycho@tycho.ws>
> CC: Kees Cook <keescook@chromium.org>
> CC: Andy Lutomirski <luto@amacapital.net>
> CC: Oleg Nesterov <oleg@redhat.com>
> CC: Eric W. Biederman <ebiederm@xmission.com>
> CC: "Serge E. Hallyn" <serge@hallyn.com>
> CC: Christian Brauner <christian.brauner@ubuntu.com>
> CC: Tyler Hicks <tyhicks@canonical.com>
> CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
> ---
>  Documentation/ioctl/ioctl-number.txt          |   1 +
>  .../userspace-api/seccomp_filter.rst          |  73 +++
>  include/linux/seccomp.h                       |   7 +-
>  include/uapi/linux/seccomp.h                  |  33 +-
>  kernel/seccomp.c                              | 436 +++++++++++++++++-
>  tools/testing/selftests/seccomp/seccomp_bpf.c | 413 ++++++++++++++++-
>  6 files changed, 954 insertions(+), 9 deletions(-)
>
> diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt
> index 13a7c999c04a..31e9707f7e06 100644
> --- a/Documentation/ioctl/ioctl-number.txt
> +++ b/Documentation/ioctl/ioctl-number.txt
> @@ -345,4 +345,5 @@ Code  Seq#(hex)     Include File            Comments
>                                         <mailto:raph@8d.com>
>  0xF6   all     LTTng                   Linux Trace Toolkit Next Generation
>                                         <mailto:mathieu.desnoyers@efficios.com>
> +0xF7    00-1F   uapi/linux/seccomp.h
>  0xFD   all     linux/dm-ioctl.h

I spent some time looking at this, and yes, it seems preferred to add
an entry here.

> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> index e5320f6c8654..017444b5efed 100644
> --- a/include/linux/seccomp.h
> +++ b/include/linux/seccomp.h
> @@ -4,9 +4,10 @@
>
>  #include <uapi/linux/seccomp.h>
>
> -#define SECCOMP_FILTER_FLAG_MASK       (SECCOMP_FILTER_FLAG_TSYNC      | \
> -                                        SECCOMP_FILTER_FLAG_LOG        | \
> -                                        SECCOMP_FILTER_FLAG_SPEC_ALLOW)
> +#define SECCOMP_FILTER_FLAG_MASK       (SECCOMP_FILTER_FLAG_TSYNC | \
> +                                        SECCOMP_FILTER_FLAG_LOG | \
> +                                        SECCOMP_FILTER_FLAG_SPEC_ALLOW | \
> +                                        SECCOMP_FILTER_FLAG_NEW_LISTENER)
>
>  #ifdef CONFIG_SECCOMP
>
> diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
> index 9efc0e73d50b..d4ccb32fe089 100644
> --- a/include/uapi/linux/seccomp.h
> +++ b/include/uapi/linux/seccomp.h
> @@ -17,9 +17,10 @@
>  #define SECCOMP_GET_ACTION_AVAIL       2
>
>  /* Valid flags for SECCOMP_SET_MODE_FILTER */
> -#define SECCOMP_FILTER_FLAG_TSYNC      (1UL << 0)
> -#define SECCOMP_FILTER_FLAG_LOG                (1UL << 1)
> -#define SECCOMP_FILTER_FLAG_SPEC_ALLOW (1UL << 2)
> +#define SECCOMP_FILTER_FLAG_TSYNC              (1UL << 0)
> +#define SECCOMP_FILTER_FLAG_LOG                        (1UL << 1)
> +#define SECCOMP_FILTER_FLAG_SPEC_ALLOW         (1UL << 2)
> +#define SECCOMP_FILTER_FLAG_NEW_LISTENER       (1UL << 3)

Since these are all getting indentation updates, can you switch them
to BIT(0), BIT(1), etc?

>  /*
>   * All BPF programs must return a 32-bit value.
> @@ -35,6 +36,7 @@
>  #define SECCOMP_RET_KILL        SECCOMP_RET_KILL_THREAD
>  #define SECCOMP_RET_TRAP        0x00030000U /* disallow and force a SIGSYS */
>  #define SECCOMP_RET_ERRNO       0x00050000U /* returns an errno */
> +#define SECCOMP_RET_USER_NOTIF   0x7fc00000U /* notifies userspace */
>  #define SECCOMP_RET_TRACE       0x7ff00000U /* pass to a tracer or disallow */
>  #define SECCOMP_RET_LOG                 0x7ffc0000U /* allow after logging */
>  #define SECCOMP_RET_ALLOW       0x7fff0000U /* allow */
> @@ -60,4 +62,29 @@ struct seccomp_data {
>         __u64 args[6];
>  };
>
> +struct seccomp_notif {
> +       __u16 len;
> +       __u64 id;
> +       __u32 pid;
> +       __u8 signaled;
> +       struct seccomp_data data;
> +};
> +
> +struct seccomp_notif_resp {
> +       __u16 len;
> +       __u64 id;
> +       __s32 error;
> +       __s64 val;
> +};

So, len has to come first, for versioning. However, since it's ahead
of a u64, this leaves a struct padding hole. pahole output:

struct seccomp_notif {
        __u16                      len;                  /*     0     2 */

        /* XXX 6 bytes hole, try to pack */

        __u64                      id;                   /*     8     8 */
        __u32                      pid;                  /*    16     4 */
        __u8                       signaled;             /*    20     1 */

        /* XXX 3 bytes hole, try to pack */

        struct seccomp_data        data;                 /*    24    64 */
        /* --- cacheline 1 boundary (64 bytes) was 24 bytes ago --- */

        /* size: 88, cachelines: 2, members: 5 */
        /* sum members: 79, holes: 2, sum holes: 9 */
        /* last cacheline: 24 bytes */
};
struct seccomp_notif_resp {
        __u16                      len;                  /*     0     2 */

        /* XXX 6 bytes hole, try to pack */

        __u64                      id;                   /*     8     8 */
        __s32                      error;                /*    16     4 */

        /* XXX 4 bytes hole, try to pack */

        __s64                      val;                  /*    24     8 */

        /* size: 32, cachelines: 1, members: 4 */
        /* sum members: 22, holes: 2, sum holes: 10 */
        /* last cacheline: 32 bytes */
};

How about making len u32, and moving pid and error above "id"? This
leaves a hole after signaled, so changing "len" won't be sufficient
for versioning here. Perhaps move it after data?

> +
> +#define SECCOMP_IOC_MAGIC              0xF7

Was there any specific reason for picking this value? There are lots
of fun ASCII code left like '!' or '*'. :)

> +
> +/* Flags for seccomp notification fd ioctl. */
> +#define SECCOMP_NOTIF_RECV     _IOWR(SECCOMP_IOC_MAGIC, 0,     \
> +                                       struct seccomp_notif)
> +#define SECCOMP_NOTIF_SEND     _IOWR(SECCOMP_IOC_MAGIC, 1,     \
> +                                       struct seccomp_notif_resp)
> +#define SECCOMP_NOTIF_ID_VALID _IOR(SECCOMP_IOC_MAGIC, 2,      \
> +                                       __u64)

To match other UAPI ioctl, can these have a prefix of "SECCOMP_IOCTOL_..."?

It may also be useful to match how other uapis do this, like for DRM:

#define DRM_IOCTL_BASE                  'd'
#define DRM_IO(nr)                      _IO(DRM_IOCTL_BASE,nr)
#define DRM_IOR(nr,type)                _IOR(DRM_IOCTL_BASE,nr,type)
#define DRM_IOW(nr,type)                _IOW(DRM_IOCTL_BASE,nr,type)
#define DRM_IOWR(nr,type)               _IOWR(DRM_IOCTL_BASE,nr,type)

#define DRM_IOCTL_VERSION               DRM_IOWR(0x00, struct drm_version)
#define DRM_IOCTL_GET_UNIQUE            DRM_IOWR(0x01, struct drm_unique)
#define DRM_IOCTL_GET_MAGIC             DRM_IOR( 0x02, struct drm_auth)
...


> +
>  #endif /* _UAPI_LINUX_SECCOMP_H */
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index fd023ac24e10..fa6fe9756c80 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -33,12 +33,78 @@
>  #endif
>
>  #ifdef CONFIG_SECCOMP_FILTER
> +#include <linux/file.h>
>  #include <linux/filter.h>
>  #include <linux/pid.h>
>  #include <linux/ptrace.h>
>  #include <linux/security.h>
>  #include <linux/tracehook.h>
>  #include <linux/uaccess.h>
> +#include <linux/anon_inodes.h>
> +
> +enum notify_state {
> +       SECCOMP_NOTIFY_INIT,
> +       SECCOMP_NOTIFY_SENT,
> +       SECCOMP_NOTIFY_REPLIED,
> +};
> +
> +struct seccomp_knotif {
> +       /* The struct pid of the task whose filter triggered the notification */
> +       struct task_struct *task;
> +
> +       /* The "cookie" for this request; this is unique for this filter. */
> +       u64 id;
> +
> +       /* Whether or not this task has been given an interruptible signal. */
> +       bool signaled;
> +
> +       /*
> +        * The seccomp data. This pointer is valid the entire time this
> +        * notification is active, since it comes from __seccomp_filter which
> +        * eclipses the entire lifecycle here.
> +        */
> +       const struct seccomp_data *data;
> +
> +       /*
> +        * Notification states. When SECCOMP_RET_USER_NOTIF is returned, a
> +        * struct seccomp_knotif is created and starts out in INIT. Once the
> +        * handler reads the notification off of an FD, it transitions to SENT.
> +        * If a signal is received the state transitions back to INIT and
> +        * another message is sent. When the userspace handler replies, state
> +        * transitions to REPLIED.
> +        */
> +       enum notify_state state;
> +
> +       /* The return values, only valid when in SECCOMP_NOTIFY_REPLIED */
> +       int error;
> +       long val;
> +
> +       /* Signals when this has entered SECCOMP_NOTIFY_REPLIED */
> +       struct completion ready;
> +
> +       struct list_head list;
> +};
> +
> +/**
> + * struct notification - container for seccomp userspace notifications. Since
> + * most seccomp filters will not have notification listeners attached and this
> + * structure is fairly large, we store the notification-specific stuff in a
> + * separate structure.
> + *
> + * @request: A semaphore that users of this notification can wait on for
> + *           changes. Actual reads and writes are still controlled with
> + *           filter->notify_lock.
> + * @notify_lock: A lock for all notification-related accesses.
> + * @next_id: The id of the next request.
> + * @notifications: A list of struct seccomp_knotif elements.
> + * @wqh: A wait queue for poll.
> + */
> +struct notification {
> +       struct semaphore request;
> +       u64 next_id;
> +       struct list_head notifications;
> +       wait_queue_head_t wqh;
> +};
>
>  /**
>   * struct seccomp_filter - container for seccomp BPF programs
> @@ -66,6 +132,8 @@ struct seccomp_filter {
>         bool log;
>         struct seccomp_filter *prev;
>         struct bpf_prog *prog;
> +       struct notification *notif;
> +       struct mutex notify_lock;
>  };
>
>  /* Limit any path through the tree to 256KB worth of instructions. */
> @@ -392,6 +460,7 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
>         if (!sfilter)
>                 return ERR_PTR(-ENOMEM);
>
> +       mutex_init(&sfilter->notify_lock);
>         ret = bpf_prog_create_from_user(&sfilter->prog, fprog,
>                                         seccomp_check_filter, save_orig);
>         if (ret < 0) {
> @@ -556,11 +625,13 @@ static void seccomp_send_sigsys(int syscall, int reason)
>  #define SECCOMP_LOG_TRACE              (1 << 4)
>  #define SECCOMP_LOG_LOG                        (1 << 5)
>  #define SECCOMP_LOG_ALLOW              (1 << 6)
> +#define SECCOMP_LOG_USER_NOTIF         (1 << 7)
>
>  static u32 seccomp_actions_logged = SECCOMP_LOG_KILL_PROCESS |
>                                     SECCOMP_LOG_KILL_THREAD  |
>                                     SECCOMP_LOG_TRAP  |
>                                     SECCOMP_LOG_ERRNO |
> +                                   SECCOMP_LOG_USER_NOTIF |
>                                     SECCOMP_LOG_TRACE |
>                                     SECCOMP_LOG_LOG;
>
> @@ -581,6 +652,9 @@ static inline void seccomp_log(unsigned long syscall, long signr, u32 action,
>         case SECCOMP_RET_TRACE:
>                 log = requested && seccomp_actions_logged & SECCOMP_LOG_TRACE;
>                 break;
> +       case SECCOMP_RET_USER_NOTIF:
> +               log = requested && seccomp_actions_logged & SECCOMP_LOG_USER_NOTIF;
> +               break;
>         case SECCOMP_RET_LOG:
>                 log = seccomp_actions_logged & SECCOMP_LOG_LOG;
>                 break;
> @@ -652,6 +726,73 @@ void secure_computing_strict(int this_syscall)
>  #else
>
>  #ifdef CONFIG_SECCOMP_FILTER
> +static u64 seccomp_next_notify_id(struct seccomp_filter *filter)
> +{
> +       /* Note: overflow is ok here, the id just needs to be unique */

Maybe just clarify in the comment: unique to the filter.

> +       return filter->notif->next_id++;

Also, it might be useful to add for both documentation and lockdep:

lockdep_assert_held(filter->notif->notify_lock);

into this function?


> +}
> +
> +static void seccomp_do_user_notification(int this_syscall,
> +                                        struct seccomp_filter *match,
> +                                        const struct seccomp_data *sd)
> +{
> +       int err;
> +       long ret = 0;
> +       struct seccomp_knotif n = {};
> +
> +       mutex_lock(&match->notify_lock);
> +       err = -ENOSYS;
> +       if (!match->notif)
> +               goto out;
> +
> +       n.task = current;
> +       n.state = SECCOMP_NOTIFY_INIT;
> +       n.data = sd;
> +       n.id = seccomp_next_notify_id(match);
> +       init_completion(&n.ready);
> +
> +       list_add(&n.list, &match->notif->notifications);
> +       wake_up_poll(&match->notif->wqh, EPOLLIN | EPOLLRDNORM);
> +
> +       mutex_unlock(&match->notify_lock);
> +       up(&match->notif->request);
> +

Maybe add a big comment here saying this is where we're waiting for a reply?

> +       err = wait_for_completion_interruptible(&n.ready);
> +       mutex_lock(&match->notify_lock);
> +
> +       /*
> +        * Here it's possible we got a signal and then had to wait on the mutex
> +        * while the reply was sent, so let's be sure there wasn't a response
> +        * in the meantime.
> +        */
> +       if (err < 0 && n.state != SECCOMP_NOTIFY_REPLIED) {
> +               /*
> +                * We got a signal. Let's tell userspace about it (potentially
> +                * again, if we had already notified them about the first one).
> +                */
> +               n.signaled = true;
> +               if (n.state == SECCOMP_NOTIFY_SENT) {
> +                       n.state = SECCOMP_NOTIFY_INIT;
> +                       up(&match->notif->request);
> +               }
> +               mutex_unlock(&match->notify_lock);
> +               err = wait_for_completion_killable(&n.ready);
> +               mutex_lock(&match->notify_lock);
> +               if (err < 0)
> +                       goto remove_list;
> +       }
> +
> +       ret = n.val;
> +       err = n.error;
> +
> +remove_list:
> +       list_del(&n.list);
> +out:
> +       mutex_unlock(&match->notify_lock);
> +       syscall_set_return_value(current, task_pt_regs(current),
> +                                err, ret);
> +}
> +
>  static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
>                             const bool recheck_after_trace)
>  {
> @@ -728,6 +869,9 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
>
>                 return 0;
>
> +       case SECCOMP_RET_USER_NOTIF:
> +               seccomp_do_user_notification(this_syscall, match, sd);
> +               goto skip;

Nit: please add a blank line here (to match the other cases).

>         case SECCOMP_RET_LOG:
>                 seccomp_log(this_syscall, 0, action, true);
>                 return 0;
> @@ -834,6 +978,9 @@ static long seccomp_set_mode_strict(void)
>  }
>
>  #ifdef CONFIG_SECCOMP_FILTER
> +static struct file *init_listener(struct task_struct *,
> +                                 struct seccomp_filter *);

Why is the forward declaration needed instead of just moving the
function here? I didn't see anything in it that looked like it
couldn't move.

> +
>  /**
>   * seccomp_set_mode_filter: internal function for setting seccomp filter
>   * @flags:  flags to change filter behavior
> @@ -853,6 +1000,8 @@ static long seccomp_set_mode_filter(unsigned int flags,
>         const unsigned long seccomp_mode = SECCOMP_MODE_FILTER;
>         struct seccomp_filter *prepared = NULL;
>         long ret = -EINVAL;
> +       int listener = 0;

Nit: "invalid fd" should be -1, not 0.

> +       struct file *listener_f = NULL;
>
>         /* Validate flags. */
>         if (flags & ~SECCOMP_FILTER_FLAG_MASK)
> @@ -863,13 +1012,28 @@ static long seccomp_set_mode_filter(unsigned int flags,
>         if (IS_ERR(prepared))
>                 return PTR_ERR(prepared);
>
> +       if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) {
> +               listener = get_unused_fd_flags(0);

As with the other place pointed out by Jann, this should maybe be O_CLOEXEC too?

> +               if (listener < 0) {
> +                       ret = listener;
> +                       goto out_free;
> +               }
> +
> +               listener_f = init_listener(current, prepared);
> +               if (IS_ERR(listener_f)) {
> +                       put_unused_fd(listener);
> +                       ret = PTR_ERR(listener_f);
> +                       goto out_free;
> +               }
> +       }
> +
>         /*
>          * Make sure we cannot change seccomp or nnp state via TSYNC
>          * while another thread is in the middle of calling exec.
>          */
>         if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
>             mutex_lock_killable(&current->signal->cred_guard_mutex))
> -               goto out_free;
> +               goto out_put_fd;
>
>         spin_lock_irq(&current->sighand->siglock);
>
> @@ -887,6 +1051,16 @@ static long seccomp_set_mode_filter(unsigned int flags,
>         spin_unlock_irq(&current->sighand->siglock);
>         if (flags & SECCOMP_FILTER_FLAG_TSYNC)
>                 mutex_unlock(&current->signal->cred_guard_mutex);
> +out_put_fd:
> +       if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) {
> +               if (ret < 0) {
> +                       fput(listener_f);
> +                       put_unused_fd(listener);
> +               } else {
> +                       fd_install(listener, listener_f);
> +                       ret = listener;
> +               }
> +       }

Can you update the kern-docs for seccomp_set_mode_filter(), since we
can now return positive values?

 * Returns 0 on success or -EINVAL on failure.

(this shoudln't say only -EINVAL, I realize too)

I have to say, I'm vaguely nervous about changing the semantics here
for passing back the fd as the return code from the seccomp() syscall.
Alternatives seem less appealing, though: changing the meaning of the
uargs parameter when SECCOMP_FILTER_FLAG_NEW_LISTENER is set, for
example. Hmm.

>  out_free:
>         seccomp_filter_free(prepared);
>         return ret;
> @@ -911,6 +1085,7 @@ static long seccomp_get_action_avail(const char __user *uaction)
>         case SECCOMP_RET_KILL_THREAD:
>         case SECCOMP_RET_TRAP:
>         case SECCOMP_RET_ERRNO:
> +       case SECCOMP_RET_USER_NOTIF:
>         case SECCOMP_RET_TRACE:
>         case SECCOMP_RET_LOG:
>         case SECCOMP_RET_ALLOW:
> @@ -1111,6 +1286,7 @@ long seccomp_get_metadata(struct task_struct *task,
>  #define SECCOMP_RET_KILL_THREAD_NAME   "kill_thread"
>  #define SECCOMP_RET_TRAP_NAME          "trap"
>  #define SECCOMP_RET_ERRNO_NAME         "errno"
> +#define SECCOMP_RET_USER_NOTIF_NAME    "user_notif"
>  #define SECCOMP_RET_TRACE_NAME         "trace"
>  #define SECCOMP_RET_LOG_NAME           "log"
>  #define SECCOMP_RET_ALLOW_NAME         "allow"
> @@ -1120,6 +1296,7 @@ static const char seccomp_actions_avail[] =
>                                 SECCOMP_RET_KILL_THREAD_NAME    " "
>                                 SECCOMP_RET_TRAP_NAME           " "
>                                 SECCOMP_RET_ERRNO_NAME          " "
> +                               SECCOMP_RET_USER_NOTIF_NAME     " "
>                                 SECCOMP_RET_TRACE_NAME          " "
>                                 SECCOMP_RET_LOG_NAME            " "
>                                 SECCOMP_RET_ALLOW_NAME;
> @@ -1134,6 +1311,7 @@ static const struct seccomp_log_name seccomp_log_names[] = {
>         { SECCOMP_LOG_KILL_THREAD, SECCOMP_RET_KILL_THREAD_NAME },
>         { SECCOMP_LOG_TRAP, SECCOMP_RET_TRAP_NAME },
>         { SECCOMP_LOG_ERRNO, SECCOMP_RET_ERRNO_NAME },
> +       { SECCOMP_LOG_USER_NOTIF, SECCOMP_RET_USER_NOTIF_NAME },
>         { SECCOMP_LOG_TRACE, SECCOMP_RET_TRACE_NAME },
>         { SECCOMP_LOG_LOG, SECCOMP_RET_LOG_NAME },
>         { SECCOMP_LOG_ALLOW, SECCOMP_RET_ALLOW_NAME },
> @@ -1342,3 +1520,259 @@ static int __init seccomp_sysctl_init(void)
>  device_initcall(seccomp_sysctl_init)
>
>  #endif /* CONFIG_SYSCTL */
> +
> +#ifdef CONFIG_SECCOMP_FILTER
> +static int seccomp_notify_release(struct inode *inode, struct file *file)
> +{
> +       struct seccomp_filter *filter = file->private_data;
> +       struct seccomp_knotif *knotif;
> +
> +       mutex_lock(&filter->notify_lock);
> +
> +       /*
> +        * If this file is being closed because e.g. the task who owned it
> +        * died, let's wake everyone up who was waiting on us.
> +        */
> +       list_for_each_entry(knotif, &filter->notif->notifications, list) {
> +               if (knotif->state == SECCOMP_NOTIFY_REPLIED)
> +                       continue;
> +
> +               knotif->state = SECCOMP_NOTIFY_REPLIED;
> +               knotif->error = -ENOSYS;
> +               knotif->val = 0;
> +
> +               complete(&knotif->ready);
> +       }
> +
> +       wake_up_all(&filter->notif->wqh);
> +       kfree(filter->notif);
> +       filter->notif = NULL;
> +       mutex_unlock(&filter->notify_lock);

It looks like that means nothing waiting on knotif->ready can access
filter->notif without rechecking it, yes?

e.g. in seccomp_do_user_notification() I see:

                        up(&match->notif->request);

I *think* this isn't reachable due to the test for n.state !=
SECCOMP_NOTIFY_REPLIED, though. Perhaps, just for sanity and because
it's not fast-path, we could add a WARN_ON() while checking for
unreplied signal death?

                n.signaled = true;
                if (n.state == SECCOMP_NOTIFY_SENT) {
                        n.state = SECCOMP_NOTIFY_INIT;
                        if (!WARN_ON(match->notif))
                            up(&match->notif->request);
                }
                mutex_unlock(&match->notify_lock);


> +       __put_seccomp_filter(filter);
> +       return 0;
> +}
> +
> +static long seccomp_notify_recv(struct seccomp_filter *filter,
> +                               unsigned long arg)
> +{
> +       struct seccomp_knotif *knotif = NULL, *cur;
> +       struct seccomp_notif unotif = {};
> +       ssize_t ret;
> +       u16 size;
> +       void __user *buf = (void __user *)arg;

I'd prefer this casting happen in seccomp_notify_ioctl(). This keeps
anything from accidentally using "arg" directly here.

> +
> +       if (copy_from_user(&size, buf, sizeof(size)))
> +               return -EFAULT;
> +
> +       ret = down_interruptible(&filter->notif->request);
> +       if (ret < 0)
> +               return ret;
> +
> +       mutex_lock(&filter->notify_lock);
> +       list_for_each_entry(cur, &filter->notif->notifications, list) {
> +               if (cur->state == SECCOMP_NOTIFY_INIT) {
> +                       knotif = cur;
> +                       break;
> +               }
> +       }
> +
> +       /*
> +        * If we didn't find a notification, it could be that the task was
> +        * interrupted between the time we were woken and when we were able to
> +        * acquire the rw lock.
> +        */
> +       if (!knotif) {
> +               ret = -ENOENT;
> +               goto out;
> +       }
> +
> +       size = min_t(size_t, size, sizeof(unotif));
> +

It is possible (though unlikely given the type widths involved here)
for unotif = {} to not initialize padding, so I would recommend an
explicit memset(&unotif, 0, sizeof(unotif)) here.

> +       unotif.len = size;
> +       unotif.id = knotif->id;
> +       unotif.pid = task_pid_vnr(knotif->task);
> +       unotif.signaled = knotif->signaled;
> +       unotif.data = *(knotif->data);
> +
> +       if (copy_to_user(buf, &unotif, size)) {
> +               ret = -EFAULT;
> +               goto out;
> +       }
> +
> +       ret = size;
> +       knotif->state = SECCOMP_NOTIFY_SENT;
> +       wake_up_poll(&filter->notif->wqh, EPOLLOUT | EPOLLWRNORM);
> +
> +
> +out:
> +       mutex_unlock(&filter->notify_lock);

Is there some way to rearrange the locking here to avoid holding the
mutex while doing copy_to_user() (which userspace could block with
userfaultfd, and then stall all the other notifications for this
filter)?

> +       return ret;
> +}
> +
> +static long seccomp_notify_send(struct seccomp_filter *filter,
> +                               unsigned long arg)
> +{
> +       struct seccomp_notif_resp resp = {};
> +       struct seccomp_knotif *knotif = NULL;
> +       long ret;
> +       u16 size;
> +       void __user *buf = (void __user *)arg;

Same cast note as above.

> +
> +       if (copy_from_user(&size, buf, sizeof(size)))
> +               return -EFAULT;
> +       size = min_t(size_t, size, sizeof(resp));
> +       if (copy_from_user(&resp, buf, size))
> +               return -EFAULT;

For sanity checking on a double-read from userspace, please add:

    if (resp.len != size)
        return -EINVAL;

> +
> +       ret = mutex_lock_interruptible(&filter->notify_lock);
> +       if (ret < 0)
> +               return ret;
> +
> +       list_for_each_entry(knotif, &filter->notif->notifications, list) {
> +               if (knotif->id == resp.id)
> +                       break;
> +       }
> +
> +       if (!knotif || knotif->id != resp.id) {
> +               ret = -ENOENT;
> +               goto out;
> +       }
> +
> +       /* Allow exactly one reply. */
> +       if (knotif->state != SECCOMP_NOTIFY_SENT) {
> +               ret = -EINPROGRESS;
> +               goto out;
> +       }
> +
> +       ret = size;
> +       knotif->state = SECCOMP_NOTIFY_REPLIED;
> +       knotif->error = resp.error;
> +       knotif->val = resp.val;
> +       complete(&knotif->ready);
> +out:
> +       mutex_unlock(&filter->notify_lock);
> +       return ret;
> +}
> +
> +static long seccomp_notify_id_valid(struct seccomp_filter *filter,
> +                                   unsigned long arg)
> +{
> +       struct seccomp_knotif *knotif = NULL;
> +       void __user *buf = (void __user *)arg;
> +       u64 id;
> +       long ret;
> +
> +       if (copy_from_user(&id, buf, sizeof(id)))
> +               return -EFAULT;
> +
> +       ret = mutex_lock_interruptible(&filter->notify_lock);
> +       if (ret < 0)
> +               return ret;
> +
> +       ret = -1;

Isn't this EPERM? Shouldn't it be -ENOENT?

> +       list_for_each_entry(knotif, &filter->notif->notifications, list) {
> +               if (knotif->id == id) {
> +                       ret = 0;
> +                       goto out;
> +               }
> +       }
> +
> +out:
> +       mutex_unlock(&filter->notify_lock);
> +       return ret;
> +}
> +
> +static long seccomp_notify_ioctl(struct file *file, unsigned int cmd,
> +                                unsigned long arg)
> +{
> +       struct seccomp_filter *filter = file->private_data;
> +
> +       switch (cmd) {
> +       case SECCOMP_NOTIF_RECV:
> +               return seccomp_notify_recv(filter, arg);
> +       case SECCOMP_NOTIF_SEND:
> +               return seccomp_notify_send(filter, arg);
> +       case SECCOMP_NOTIF_ID_VALID:
> +               return seccomp_notify_id_valid(filter, arg);
> +       default:
> +               return -EINVAL;
> +       }
> +}
> +
> +static __poll_t seccomp_notify_poll(struct file *file,
> +                                   struct poll_table_struct *poll_tab)
> +{
> +       struct seccomp_filter *filter = file->private_data;
> +       __poll_t ret = 0;
> +       struct seccomp_knotif *cur;
> +
> +       poll_wait(file, &filter->notif->wqh, poll_tab);
> +
> +       ret = mutex_lock_interruptible(&filter->notify_lock);
> +       if (ret < 0)
> +               return ret;
> +
> +       list_for_each_entry(cur, &filter->notif->notifications, list) {
> +               if (cur->state == SECCOMP_NOTIFY_INIT)
> +                       ret |= EPOLLIN | EPOLLRDNORM;
> +               if (cur->state == SECCOMP_NOTIFY_SENT)
> +                       ret |= EPOLLOUT | EPOLLWRNORM;
> +               if (ret & EPOLLIN && ret & EPOLLOUT)

My eyes! :) Can you wrap the bit operations in parens here?

> +                       break;
> +       }

Should POLLERR be handled here too? I don't quite see the conditions
that might be exposed? All the processes die for the filter, which
does what here?

> +
> +       mutex_unlock(&filter->notify_lock);
> +
> +       return ret;
> +}
> +
> +static const struct file_operations seccomp_notify_ops = {
> +       .poll = seccomp_notify_poll,
> +       .release = seccomp_notify_release,
> +       .unlocked_ioctl = seccomp_notify_ioctl,
> +};
> +
> +static struct file *init_listener(struct task_struct *task,
> +                                 struct seccomp_filter *filter)
> +{
> +       struct file *ret = ERR_PTR(-EBUSY);
> +       struct seccomp_filter *cur, *last_locked = NULL;
> +       int filter_nesting = 0;
> +
> +       for (cur = task->seccomp.filter; cur; cur = cur->prev) {
> +               mutex_lock_nested(&cur->notify_lock, filter_nesting);
> +               filter_nesting++;
> +               last_locked = cur;
> +               if (cur->notif)
> +                       goto out;
> +       }
> +
> +       ret = ERR_PTR(-ENOMEM);
> +       filter->notif = kzalloc(sizeof(*(filter->notif)), GFP_KERNEL);
> +       if (!filter->notif)
> +               goto out;
> +
> +       sema_init(&filter->notif->request, 0);
> +       INIT_LIST_HEAD(&filter->notif->notifications);
> +       filter->notif->next_id = get_random_u64();
> +       init_waitqueue_head(&filter->notif->wqh);
> +
> +       ret = anon_inode_getfile("seccomp notify", &seccomp_notify_ops,
> +                                filter, O_RDWR);
> +       if (IS_ERR(ret))
> +               goto out;
> +
> +
> +       /* The file has a reference to it now */
> +       __get_seccomp_filter(filter);
> +
> +out:
> +       for (cur = task->seccomp.filter; cur; cur = cur->prev) {
> +               mutex_unlock(&cur->notify_lock);
> +               if (cur == last_locked)
> +                       break;
> +       }
> +
> +       return ret;
> +}
> +#endif
> diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
> index e1473234968d..5f4b836a6792 100644
> --- a/tools/testing/selftests/seccomp/seccomp_bpf.c
> +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
> @@ -5,6 +5,7 @@
>   * Test code for seccomp bpf.
>   */
>
> +#define _GNU_SOURCE
>  #include <sys/types.h>
>
>  /*
> @@ -40,10 +41,12 @@
>  #include <sys/fcntl.h>
>  #include <sys/mman.h>
>  #include <sys/times.h>
> +#include <sys/socket.h>
> +#include <sys/ioctl.h>
>
> -#define _GNU_SOURCE
>  #include <unistd.h>
>  #include <sys/syscall.h>
> +#include <poll.h>
>
>  #include "../kselftest_harness.h"
>
> @@ -154,6 +157,34 @@ struct seccomp_metadata {
>  };
>  #endif
>
> +#ifndef SECCOMP_FILTER_FLAG_NEW_LISTENER
> +#define SECCOMP_FILTER_FLAG_NEW_LISTENER (1UL << 3)
> +
> +#define SECCOMP_RET_USER_NOTIF 0x7fc00000U
> +
> +#define SECCOMP_IOC_MAGIC              0xF7
> +#define SECCOMP_NOTIF_RECV     _IOWR(SECCOMP_IOC_MAGIC, 0,     \
> +                                       struct seccomp_notif)
> +#define SECCOMP_NOTIF_SEND     _IOWR(SECCOMP_IOC_MAGIC, 1,     \
> +                                       struct seccomp_notif_resp)
> +#define SECCOMP_NOTIF_ID_VALID _IOR(SECCOMP_IOC_MAGIC, 2,      \
> +                                       __u64)
> +struct seccomp_notif {
> +       __u16 len;
> +       __u64 id;
> +       __u32 pid;
> +       __u8 signaled;
> +       struct seccomp_data data;
> +};
> +
> +struct seccomp_notif_resp {
> +       __u16 len;
> +       __u64 id;
> +       __s32 error;
> +       __s64 val;
> +};
> +#endif
> +
>  #ifndef seccomp
>  int seccomp(unsigned int op, unsigned int flags, void *args)
>  {
> @@ -2077,7 +2108,8 @@ TEST(detect_seccomp_filter_flags)
>  {
>         unsigned int flags[] = { SECCOMP_FILTER_FLAG_TSYNC,
>                                  SECCOMP_FILTER_FLAG_LOG,
> -                                SECCOMP_FILTER_FLAG_SPEC_ALLOW };
> +                                SECCOMP_FILTER_FLAG_SPEC_ALLOW,
> +                                SECCOMP_FILTER_FLAG_NEW_LISTENER };
>         unsigned int flag, all_flags;
>         int i;
>         long ret;
> @@ -2933,6 +2965,383 @@ TEST(get_metadata)
>         ASSERT_EQ(0, kill(pid, SIGKILL));
>  }
>
> +static int user_trap_syscall(int nr, unsigned int flags)
> +{
> +       struct sock_filter filter[] = {
> +               BPF_STMT(BPF_LD+BPF_W+BPF_ABS,
> +                       offsetof(struct seccomp_data, nr)),
> +               BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, nr, 0, 1),
> +               BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_USER_NOTIF),
> +               BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
> +       };
> +
> +       struct sock_fprog prog = {
> +               .len = (unsigned short)ARRAY_SIZE(filter),
> +               .filter = filter,
> +       };
> +
> +       return seccomp(SECCOMP_SET_MODE_FILTER, flags, &prog);
> +}
> +
> +static int read_notif(int listener, struct seccomp_notif *req)
> +{
> +       int ret;
> +
> +       do {
> +               errno = 0;
> +               req->len = sizeof(*req);
> +               ret = ioctl(listener, SECCOMP_NOTIF_RECV, req);
> +       } while (ret == -1 && errno == ENOENT);
> +       return ret;
> +}
> +
> +static void signal_handler(int signal)
> +{
> +}
> +
> +#define USER_NOTIF_MAGIC 116983961184613L
> +TEST(get_user_notification_syscall)
> +{
> +       pid_t pid;
> +       long ret;
> +       int status, listener;
> +       struct seccomp_notif req = {};
> +       struct seccomp_notif_resp resp = {};
> +       struct pollfd pollfd;
> +
> +       struct sock_filter filter[] = {
> +               BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
> +       };
> +       struct sock_fprog prog = {
> +               .len = (unsigned short)ARRAY_SIZE(filter),
> +               .filter = filter,
> +       };
> +
> +       pid = fork();
> +       ASSERT_GE(pid, 0);
> +
> +       /* Check that we get -ENOSYS with no listener attached */
> +       if (pid == 0) {
> +               if (user_trap_syscall(__NR_getpid, 0) < 0)
> +                       exit(1);
> +               ret = syscall(__NR_getpid);
> +               exit(ret >= 0 || errno != ENOSYS);
> +       }
> +
> +       EXPECT_EQ(waitpid(pid, &status, 0), pid);
> +       EXPECT_EQ(true, WIFEXITED(status));
> +       EXPECT_EQ(0, WEXITSTATUS(status));
> +
> +       /* Add some no-op filters so that we (don't) trigger lockdep. */
> +       EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
> +       EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
> +       EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
> +       EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
> +
> +       /* Check that the basic notification machinery works */
> +       listener = user_trap_syscall(__NR_getpid,
> +                                    SECCOMP_FILTER_FLAG_NEW_LISTENER);
> +       EXPECT_GE(listener, 0);
> +
> +       /* Installing a second listener in the chain should EBUSY */
> +       EXPECT_EQ(user_trap_syscall(__NR_getpid,
> +                                   SECCOMP_FILTER_FLAG_NEW_LISTENER),
> +                 -1);
> +       EXPECT_EQ(errno, EBUSY);
> +
> +       pid = fork();
> +       ASSERT_GE(pid, 0);
> +
> +       if (pid == 0) {
> +               ret = syscall(__NR_getpid);
> +               exit(ret != USER_NOTIF_MAGIC);
> +       }
> +
> +       pollfd.fd = listener;
> +       pollfd.events = POLLIN | POLLOUT;
> +
> +       EXPECT_GT(poll(&pollfd, 1, -1), 0);
> +       EXPECT_EQ(pollfd.revents, POLLIN);
> +
> +       req.len = sizeof(req);
> +       EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_RECV, &req), sizeof(req));
> +
> +       pollfd.fd = listener;
> +       pollfd.events = POLLIN | POLLOUT;
> +
> +       EXPECT_GT(poll(&pollfd, 1, -1), 0);
> +       EXPECT_EQ(pollfd.revents, POLLOUT);
> +
> +       EXPECT_EQ(req.data.nr,  __NR_getpid);
> +
> +       resp.len = sizeof(resp);
> +       resp.id = req.id;
> +       resp.error = 0;
> +       resp.val = USER_NOTIF_MAGIC;
> +
> +       EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_SEND, &resp), sizeof(resp));
> +
> +       EXPECT_EQ(waitpid(pid, &status, 0), pid);
> +       EXPECT_EQ(true, WIFEXITED(status));
> +       EXPECT_EQ(0, WEXITSTATUS(status));
> +
> +       /*
> +        * Check that nothing bad happens when we kill the task in the middle
> +        * of a syscall.
> +        */
> +       pid = fork();
> +       ASSERT_GE(pid, 0);
> +
> +       if (pid == 0) {
> +               ret = syscall(__NR_getpid);
> +               exit(ret != USER_NOTIF_MAGIC);
> +       }
> +
> +       EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_RECV, &req), sizeof(req));
> +       EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_ID_VALID, &req.id), 0);
> +
> +       EXPECT_EQ(kill(pid, SIGKILL), 0);
> +       EXPECT_EQ(waitpid(pid, NULL, 0), pid);
> +
> +       EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_ID_VALID, &req.id), -1);

Please document SECCOMP_NOTIF_ID_VALID in seccomp_filter.rst. I had
been wondering what it's for, and now I see it's kind of an advisory
"is the other end still alive?" test.

> +
> +       resp.id = req.id;
> +       ret = ioctl(listener, SECCOMP_NOTIF_SEND, &resp);
> +       EXPECT_EQ(ret, -1);
> +       EXPECT_EQ(errno, ENOENT);
> +
> +       /*
> +        * Check that we get another notification about a signal in the middle
> +        * of a syscall.
> +        */
> +       pid = fork();
> +       ASSERT_GE(pid, 0);
> +
> +       if (pid == 0) {
> +               if (signal(SIGUSR1, signal_handler) == SIG_ERR) {
> +                       perror("signal");
> +                       exit(1);
> +               }
> +               ret = syscall(__NR_getpid);
> +               exit(ret != USER_NOTIF_MAGIC);
> +       }
> +
> +       ret = read_notif(listener, &req);
> +       EXPECT_EQ(ret, sizeof(req));
> +       EXPECT_EQ(errno, 0);
> +
> +       EXPECT_EQ(kill(pid, SIGUSR1), 0);
> +
> +       ret = read_notif(listener, &req);
> +       EXPECT_EQ(req.signaled, 1);
> +       EXPECT_EQ(ret, sizeof(req));
> +       EXPECT_EQ(errno, 0);
> +
> +       resp.len = sizeof(resp);
> +       resp.id = req.id;
> +       resp.error = -512; /* -ERESTARTSYS */
> +       resp.val = 0;
> +
> +       EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_SEND, &resp), sizeof(resp));
> +
> +       ret = read_notif(listener, &req);
> +       resp.len = sizeof(resp);
> +       resp.id = req.id;
> +       resp.error = 0;
> +       resp.val = USER_NOTIF_MAGIC;
> +       ret = ioctl(listener, SECCOMP_NOTIF_SEND, &resp);

I was slightly confused here: why have there been 3 reads? I was
expecting one notification for hitting getpid and one from catching a
signal. But in rereading, I see that NOTIF_RECV will return the most
recently unresponded notification, yes?

But... catching a signal replaces the existing seccomp_knotif? I
remain confused about how signal handling is meant to work here. What
happens if two signals get sent? It looks like you just block without
allowing more signals? (Thank you for writing the tests!)

(And can you document the expected behavior in the seccomp_filter.rst too?)

> +       EXPECT_EQ(ret, sizeof(resp));
> +       EXPECT_EQ(errno, 0);
> +
> +       EXPECT_EQ(waitpid(pid, &status, 0), pid);
> +       EXPECT_EQ(true, WIFEXITED(status));
> +       EXPECT_EQ(0, WEXITSTATUS(status));
> +
> +       /*
> +        * Check that we get an ENOSYS when the listener is closed.
> +        */
> +       pid = fork();
> +       ASSERT_GE(pid, 0);
> +       if (pid == 0) {
> +               close(listener);
> +               ret = syscall(__NR_getpid);
> +               exit(ret != -1 && errno != ENOSYS);
> +       }
> +
> +       close(listener);
> +
> +       EXPECT_EQ(waitpid(pid, &status, 0), pid);
> +       EXPECT_EQ(true, WIFEXITED(status));
> +       EXPECT_EQ(0, WEXITSTATUS(status));
> +}
> +
> +/*
> + * Check that a pid in a child namespace still shows up as valid in ours.
> + */
> +TEST(user_notification_child_pid_ns)
> +{
> +       pid_t pid;
> +       int status, listener;
> +       int sk_pair[2];
> +       char c;
> +       struct seccomp_notif req = {};
> +       struct seccomp_notif_resp resp = {};
> +
> +       ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0);
> +       ASSERT_EQ(unshare(CLONE_NEWPID), 0);
> +
> +       pid = fork();
> +       ASSERT_GE(pid, 0);
> +
> +       if (pid == 0) {
> +               EXPECT_EQ(user_trap_syscall(__NR_getpid, 0), 0);
> +
> +               /* Signal we're ready and have installed the filter. */
> +               EXPECT_EQ(write(sk_pair[1], "J", 1), 1);
> +
> +               EXPECT_EQ(read(sk_pair[1], &c, 1), 1);
> +               EXPECT_EQ(c, 'H');
> +
> +               exit(syscall(__NR_getpid) != USER_NOTIF_MAGIC);
> +       }
> +
> +       EXPECT_EQ(read(sk_pair[0], &c, 1), 1);
> +       EXPECT_EQ(c, 'J');
> +
> +       EXPECT_EQ(ptrace(PTRACE_ATTACH, pid), 0);
> +       EXPECT_EQ(waitpid(pid, NULL, 0), pid);
> +       listener = ptrace(PTRACE_SECCOMP_NEW_LISTENER, pid, 0);
> +       EXPECT_GE(listener, 0);
> +       EXPECT_EQ(ptrace(PTRACE_DETACH, pid, NULL, 0), 0);
> +
> +       /* Now signal we are done and respond with magic */
> +       EXPECT_EQ(write(sk_pair[0], "H", 1), 1);
> +
> +       req.len = sizeof(req);
> +       EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_RECV, &req), sizeof(req));
> +       EXPECT_EQ(req.pid, pid);
> +
> +       resp.len = sizeof(resp);
> +       resp.id = req.id;
> +       resp.error = 0;
> +       resp.val = USER_NOTIF_MAGIC;
> +
> +       EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_SEND, &resp), sizeof(resp));
> +
> +       EXPECT_EQ(waitpid(pid, &status, 0), pid);
> +       EXPECT_EQ(true, WIFEXITED(status));
> +       EXPECT_EQ(0, WEXITSTATUS(status));
> +       close(listener);
> +}
> +
> +/*
> + * Check that a pid in a sibling (i.e. unrelated) namespace shows up as 0, i.e.
> + * invalid.
> + */
> +TEST(user_notification_sibling_pid_ns)
> +{
> +       pid_t pid, pid2;
> +       int status, listener;
> +       int sk_pair[2];
> +       char c;
> +       struct seccomp_notif req = {};
> +       struct seccomp_notif_resp resp = {};
> +
> +       ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0);
> +
> +       pid = fork();
> +       ASSERT_GE(pid, 0);
> +
> +       if (pid == 0) {
> +               int child_pair[2];
> +
> +               ASSERT_EQ(unshare(CLONE_NEWPID), 0);
> +
> +               ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, child_pair), 0);
> +
> +               pid2 = fork();
> +               ASSERT_GE(pid2, 0);
> +
> +               if (pid2 == 0) {
> +                       close(child_pair[0]);
> +                       EXPECT_EQ(user_trap_syscall(__NR_getpid, 0), 0);
> +
> +                       /* Signal we're ready and have installed the filter. */
> +                       EXPECT_EQ(write(child_pair[1], "J", 1), 1);
> +
> +                       EXPECT_EQ(read(child_pair[1], &c, 1), 1);
> +                       EXPECT_EQ(c, 'H');
> +
> +                       exit(syscall(__NR_getpid) != USER_NOTIF_MAGIC);
> +               }
> +
> +               /* check that child has installed the filter */
> +               EXPECT_EQ(read(child_pair[0], &c, 1), 1);
> +               EXPECT_EQ(c, 'J');
> +
> +               /* tell parent who child is */
> +               EXPECT_EQ(write(sk_pair[1], &pid2, sizeof(pid2)), sizeof(pid2));
> +
> +               /* parent has installed listener, tell child to call syscall */
> +               EXPECT_EQ(read(sk_pair[1], &c, 1), 1);
> +               EXPECT_EQ(c, 'H');
> +               EXPECT_EQ(write(child_pair[0], "H", 1), 1);
> +
> +               EXPECT_EQ(waitpid(pid2, &status, 0), pid2);
> +               EXPECT_EQ(true, WIFEXITED(status));
> +               EXPECT_EQ(0, WEXITSTATUS(status));
> +               exit(WEXITSTATUS(status));
> +       }
> +
> +       EXPECT_EQ(read(sk_pair[0], &pid2, sizeof(pid2)), sizeof(pid2));
> +
> +       EXPECT_EQ(ptrace(PTRACE_ATTACH, pid2), 0);
> +       EXPECT_EQ(waitpid(pid2, NULL, 0), pid2);
> +       listener = ptrace(PTRACE_SECCOMP_NEW_LISTENER, pid2, 0);
> +       EXPECT_GE(listener, 0);
> +       EXPECT_EQ(errno, 0);
> +       EXPECT_EQ(ptrace(PTRACE_DETACH, pid2, NULL, 0), 0);
> +
> +       /* Create the sibling ns, and sibling in it. */
> +       EXPECT_EQ(unshare(CLONE_NEWPID), 0);
> +       EXPECT_EQ(errno, 0);
> +
> +       pid2 = fork();
> +       EXPECT_GE(pid2, 0);
> +
> +       if (pid2 == 0) {
> +               req.len = sizeof(req);
> +               ASSERT_EQ(ioctl(listener, SECCOMP_NOTIF_RECV, &req), sizeof(req));
> +               /*
> +                * The pid should be 0, i.e. the task is in some namespace that
> +                * we can't "see".
> +                */
> +               ASSERT_EQ(req.pid, 0);
> +
> +               resp.len = sizeof(resp);
> +               resp.id = req.id;
> +               resp.error = 0;
> +               resp.val = USER_NOTIF_MAGIC;
> +
> +               ASSERT_EQ(ioctl(listener, SECCOMP_NOTIF_SEND, &resp), sizeof(resp));
> +               exit(0);
> +       }
> +
> +       close(listener);
> +
> +       /* Now signal we are done setting up sibling listener. */
> +       EXPECT_EQ(write(sk_pair[0], "H", 1), 1);
> +
> +       EXPECT_EQ(waitpid(pid, &status, 0), pid);
> +       EXPECT_EQ(true, WIFEXITED(status));
> +       EXPECT_EQ(0, WEXITSTATUS(status));
> +
> +       EXPECT_EQ(waitpid(pid2, &status, 0), pid2);
> +       EXPECT_EQ(true, WIFEXITED(status));
> +       EXPECT_EQ(0, WEXITSTATUS(status));
> +}
> +
> +
>  /*
>   * TODO:
>   * - add microbenchmarks
> --
> 2.17.1
>

Looking good!

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 2/6] seccomp: make get_nth_filter available outside of CHECKPOINT_RESTORE
  2018-09-27 15:11 ` [PATCH v7 2/6] seccomp: make get_nth_filter available outside of CHECKPOINT_RESTORE Tycho Andersen
  2018-09-27 16:51   ` Jann Horn
@ 2018-09-27 21:42   ` Kees Cook
  2018-10-08 13:55   ` Christian Brauner
  2 siblings, 0 replies; 91+ messages in thread
From: Kees Cook @ 2018-09-27 21:42 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: LKML, Linux Containers, Linux API, Andy Lutomirski,
	Oleg Nesterov, Eric W . Biederman, Serge E . Hallyn,
	Christian Brauner, Tyler Hicks, Akihiro Suda, Jann Horn,
	linux-fsdevel

On Thu, Sep 27, 2018 at 8:11 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> In the next commit we'll use this same mnemonic to get a listener for the
> nth filter, so we need it available outside of CHECKPOINT_RESTORE in the
> USER_NOTIFICATION case as well.
>
> v2: new in v2
> v3: no changes
> v4: no changes
> v5: switch to CHECKPOINT_RESTORE || USER_NOTIFICATION to avoid warning when
>     only CONFIG_SECCOMP_FILTER is enabled.
> v7: drop USER_NOTIFICATION bits
>
> Signed-off-by: Tycho Andersen <tycho@tycho.ws>
> CC: Kees Cook <keescook@chromium.org>
> CC: Andy Lutomirski <luto@amacapital.net>
> CC: Oleg Nesterov <oleg@redhat.com>
> CC: Eric W. Biederman <ebiederm@xmission.com>
> CC: "Serge E. Hallyn" <serge@hallyn.com>
> CC: Christian Brauner <christian.brauner@ubuntu.com>
> CC: Tyler Hicks <tyhicks@canonical.com>
> CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
> ---
>  kernel/seccomp.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index fa6fe9756c80..44a31ac8373a 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -1158,7 +1158,7 @@ long prctl_set_seccomp(unsigned long seccomp_mode, char __user *filter)
>         return do_seccomp(op, 0, uargs);
>  }
>
> -#if defined(CONFIG_SECCOMP_FILTER) && defined(CONFIG_CHECKPOINT_RESTORE)
> +#if defined(CONFIG_SECCOMP_FILTER)
>  static struct seccomp_filter *get_nth_filter(struct task_struct *task,
>                                              unsigned long filter_off)
>  {
> @@ -1205,6 +1205,7 @@ static struct seccomp_filter *get_nth_filter(struct task_struct *task,
>         return filter;
>  }
>
> +#if defined(CONFIG_CHECKPOINT_RESTORE)
>  long seccomp_get_filter(struct task_struct *task, unsigned long filter_off,
>                         void __user *data)
>  {
> @@ -1277,7 +1278,8 @@ long seccomp_get_metadata(struct task_struct *task,
>         __put_seccomp_filter(filter);
>         return ret;
>  }
> -#endif
> +#endif /* CONFIG_CHECKPOINT_RESTORE */
> +#endif /* CONFIG_SECCOMP_FILTER */
>
>  #ifdef CONFIG_SYSCTL

Yup, looks fine.

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 1/6] seccomp: add a return code to trap to userspace
  2018-09-27 15:11 ` [PATCH v7 1/6] seccomp: add a return code to " Tycho Andersen
  2018-09-27 21:31   ` Kees Cook
@ 2018-09-27 21:51   ` Jann Horn
  2018-09-27 22:45     ` Kees Cook
  2018-09-27 23:04     ` Tycho Andersen
  2018-09-29  0:28   ` Aleksa Sarai
  2 siblings, 2 replies; 91+ messages in thread
From: Jann Horn @ 2018-09-27 21:51 UTC (permalink / raw)
  To: Tycho Andersen, hch, Al Viro, linux-fsdevel
  Cc: Kees Cook, kernel list, containers, Linux API, Andy Lutomirski,
	Oleg Nesterov, Eric W. Biederman, Serge E. Hallyn,
	Christian Brauner, Tyler Hicks, suda.akihiro

+Christoph Hellwig, Al Viro, fsdevel: For two questions about the poll
interface (search for "seccomp_notify_poll" and
"seccomp_notify_release" in the patch)

@Tycho: FYI, I've gone through all of v7 now, apart from the
test/sample code. So don't wait for more comments from me before
sending out v8.

On Thu, Sep 27, 2018 at 5:11 PM Tycho Andersen <tycho@tycho.ws> wrote:
> This patch introduces a means for syscalls matched in seccomp to notify
> some other task that a particular filter has been triggered.
>
> The motivation for this is primarily for use with containers. For example,
> if a container does an init_module(), we obviously don't want to load this
> untrusted code, which may be compiled for the wrong version of the kernel
> anyway. Instead, we could parse the module image, figure out which module
> the container is trying to load and load it on the host.
>
> As another example, containers cannot mknod(), since this checks
> capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
> /dev/zero should be ok for containers to mknod, but we'd like to avoid hard
> coding some whitelist in the kernel. Another example is mount(), which has
> many security restrictions for good reason, but configuration or runtime
> knowledge could potentially be used to relax these restrictions.

Note that in that case, the trusted runtime needs to be in the same
mount namespace as the container. mount() doesn't work on the mount
structure of a foreign mount namespace; check_mnt() specifically
checks for this case, and I think pretty much everything in
sys_mount() uses that check. So you'd have to join the container's
mount namespace before forwarding a mount syscall.

> This patch adds functionality that is already possible via at least two
> other means that I know about, both of which involve ptrace(): first, one
> could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL.
> Unfortunately this is slow, so a faster version would be to install a
> filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP.
> Since ptrace allows only one tracer, if the container runtime is that
> tracer, users inside the container (or outside) trying to debug it will not
> be able to use ptrace, which is annoying. It also means that older
> distributions based on Upstart cannot boot inside containers using ptrace,
> since upstart itself uses ptrace to start services.
>
> The actual implementation of this is fairly small, although getting the
> synchronization right was/is slightly complex.
>
> Finally, it's worth noting that the classic seccomp TOCTOU of reading
> memory data from the task still applies here,

Actually, it doesn't, right? It would apply if you told the kernel "go
ahead, that syscall is fine", but that's not how the API works - you
always intercept the syscall, copy argument data to a trusted tracer,
and then the tracer can make a replacement syscall. Sounds fine to me.

> but can be avoided with
> careful design of the userspace handler: if the userspace handler reads all
> of the task memory that is necessary before applying its security policy,
> the tracee's subsequent memory edits will not be read by the tracer.
[...]
> diff --git a/Documentation/userspace-api/seccomp_filter.rst b/Documentation/userspace-api/seccomp_filter.rst
[...]
> +which (on success) will return a listener fd for the filter, which can then be
> +passed around via ``SCM_RIGHTS`` or similar. Alternatively, a filter fd can be
> +acquired via:
> +
> +.. code-block::
> +
> +    fd = ptrace(PTRACE_SECCOMP_NEW_LISTENER, pid, 0);

The manpage documents ptrace() as taking four arguments, not three. I
know that the header defines it with varargs, but it would probably be
more useful to require passing in zero as the fourth argument so that
we have a place to stick flags if necessary in the future.

> +which grabs the 0th filter for some task which the tracer has privilege over.
> +Note that filter fds correspond to a particular filter, and not a particular
> +task. So if this task then forks, notifications from both tasks will appear on
> +the same filter fd. Reads and writes to/from a filter fd are also synchronized,
> +so a filter fd can safely have many readers.

Add a note about needing CAP_SYS_ADMIN here? Also, might be useful to
clarify in which direction "nth filter" counts.

> +The interface for a seccomp notification fd consists of two structures:
> +
> +.. code-block::
> +
> +    struct seccomp_notif {
> +        __u16 len;
> +        __u64 id;
> +        pid_t pid;
> +        __u8 signalled;
> +        struct seccomp_data data;
> +    };
> +
> +    struct seccomp_notif_resp {
> +        __u16 len;
> +        __u64 id;
> +        __s32 error;
> +        __s64 val;
> +    };
> +
> +Users can read via ``ioctl(SECCOMP_NOTIF_RECV)``  (or ``poll()``) on a seccomp
> +notification fd to receive a ``struct seccomp_notif``, which contains five
> +members: the input length of the structure, a unique-per-filter ``id``, the
> +``pid`` of the task which triggered this request (which may be 0 if the task is
> +in a pid ns not visible from the listener's pid namespace), a flag representing
> +whether or not the notification is a result of a non-fatal signal, and the
> +``data`` passed to seccomp. Userspace can then make a decision based on this
> +information about what to do, and ``ioctl(SECCOMP_NOTIF_SEND)`` a response,
> +indicating what should be returned to userspace. The ``id`` member of ``struct
> +seccomp_notif_resp`` should be the same ``id`` as in ``struct seccomp_notif``.
> +
> +It is worth noting that ``struct seccomp_data`` contains the values of register
> +arguments to the syscall, but does not contain pointers to memory. The task's
> +memory is accessible to suitably privileged traces via ``ptrace()`` or
> +``/proc/pid/map_files/``.

You probably don't actually want to use /proc/pid/map_files here; you
can't use that to access anonymous memory, and it needs CAP_SYS_ADMIN.
And while reading memory via ptrace() is possible, the interface is
really ugly (e.g. you can only read data in 4-byte chunks), and your
caveat about locking out other ptracers (or getting locked out by
them) applies. I'm not even sure if you could read memory via ptrace
while a process is stopped in the seccomp logic? PTRACE_PEEKDATA
requires the target to be in a __TASK_TRACED state.
The two interfaces you might want to use instead are /proc/$pid/mem
and process_vm_{readv,writev}, which allow you to do nice,
arbitrarily-sized, vectored IO on the memory of another process.

> However, care should be taken to avoid the TOCTOU
> +mentioned above in this document: all arguments being read from the tracee's
> +memory should be read into the tracer's memory before any policy decisions are
> +made. This allows for an atomic decision on syscall arguments.

Again, I don't really see how you could get this wrong.
[...]
> diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
[...]
>  #define SECCOMP_RET_KILL        SECCOMP_RET_KILL_THREAD
>  #define SECCOMP_RET_TRAP        0x00030000U /* disallow and force a SIGSYS */
>  #define SECCOMP_RET_ERRNO       0x00050000U /* returns an errno */
> +#define SECCOMP_RET_USER_NOTIF   0x7fc00000U /* notifies userspace */
>  #define SECCOMP_RET_TRACE       0x7ff00000U /* pass to a tracer or disallow */
>  #define SECCOMP_RET_LOG                 0x7ffc0000U /* allow after logging */
>  #define SECCOMP_RET_ALLOW       0x7fff0000U /* allow */
> @@ -60,4 +62,29 @@ struct seccomp_data {
>         __u64 args[6];
>  };
>
> +struct seccomp_notif {
> +       __u16 len;
> +       __u64 id;
> +       __u32 pid;
> +       __u8 signaled;
> +       struct seccomp_data data;
> +};
> +
> +struct seccomp_notif_resp {
> +       __u16 len;
> +       __u64 id;
> +       __s32 error;
> +       __s64 val;
> +};
> +
> +#define SECCOMP_IOC_MAGIC              0xF7
> +
> +/* Flags for seccomp notification fd ioctl. */
> +#define SECCOMP_NOTIF_RECV     _IOWR(SECCOMP_IOC_MAGIC, 0,     \
> +                                       struct seccomp_notif)
> +#define SECCOMP_NOTIF_SEND     _IOWR(SECCOMP_IOC_MAGIC, 1,     \
> +                                       struct seccomp_notif_resp)
> +#define SECCOMP_NOTIF_ID_VALID _IOR(SECCOMP_IOC_MAGIC, 2,      \
> +                                       __u64)
> +
>  #endif /* _UAPI_LINUX_SECCOMP_H */
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index fd023ac24e10..fa6fe9756c80 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -33,12 +33,78 @@
>  #endif
>
>  #ifdef CONFIG_SECCOMP_FILTER
> +#include <linux/file.h>
>  #include <linux/filter.h>
>  #include <linux/pid.h>
>  #include <linux/ptrace.h>
>  #include <linux/security.h>
>  #include <linux/tracehook.h>
>  #include <linux/uaccess.h>
> +#include <linux/anon_inodes.h>
> +
> +enum notify_state {
> +       SECCOMP_NOTIFY_INIT,
> +       SECCOMP_NOTIFY_SENT,
> +       SECCOMP_NOTIFY_REPLIED,
> +};
> +
> +struct seccomp_knotif {
> +       /* The struct pid of the task whose filter triggered the notification */
> +       struct task_struct *task;
> +
> +       /* The "cookie" for this request; this is unique for this filter. */
> +       u64 id;
> +
> +       /* Whether or not this task has been given an interruptible signal. */
> +       bool signaled;
> +
> +       /*
> +        * The seccomp data. This pointer is valid the entire time this
> +        * notification is active, since it comes from __seccomp_filter which
> +        * eclipses the entire lifecycle here.
> +        */
> +       const struct seccomp_data *data;
> +
> +       /*
> +        * Notification states. When SECCOMP_RET_USER_NOTIF is returned, a
> +        * struct seccomp_knotif is created and starts out in INIT. Once the
> +        * handler reads the notification off of an FD, it transitions to SENT.
> +        * If a signal is received the state transitions back to INIT and
> +        * another message is sent. When the userspace handler replies, state
> +        * transitions to REPLIED.
> +        */
> +       enum notify_state state;
> +
> +       /* The return values, only valid when in SECCOMP_NOTIFY_REPLIED */
> +       int error;
> +       long val;
> +
> +       /* Signals when this has entered SECCOMP_NOTIFY_REPLIED */
> +       struct completion ready;
> +
> +       struct list_head list;
> +};
> +
> +/**
> + * struct notification - container for seccomp userspace notifications. Since
> + * most seccomp filters will not have notification listeners attached and this
> + * structure is fairly large, we store the notification-specific stuff in a
> + * separate structure.
> + *
> + * @request: A semaphore that users of this notification can wait on for
> + *           changes. Actual reads and writes are still controlled with
> + *           filter->notify_lock.
> + * @notify_lock: A lock for all notification-related accesses.

notify_lock is documented here, but is a member of struct
seccomp_filter, not of struct notification.

> + * @next_id: The id of the next request.
> + * @notifications: A list of struct seccomp_knotif elements.
> + * @wqh: A wait queue for poll.
> + */
> +struct notification {
> +       struct semaphore request;
> +       u64 next_id;
> +       struct list_head notifications;
> +       wait_queue_head_t wqh;
> +};
>
>  /**
>   * struct seccomp_filter - container for seccomp BPF programs
> @@ -66,6 +132,8 @@ struct seccomp_filter {
>         bool log;
>         struct seccomp_filter *prev;
>         struct bpf_prog *prog;
> +       struct notification *notif;
> +       struct mutex notify_lock;
>  };
>
>  /* Limit any path through the tree to 256KB worth of instructions. */
> @@ -392,6 +460,7 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
>         if (!sfilter)
>                 return ERR_PTR(-ENOMEM);
>
> +       mutex_init(&sfilter->notify_lock);
>         ret = bpf_prog_create_from_user(&sfilter->prog, fprog,
>                                         seccomp_check_filter, save_orig);
>         if (ret < 0) {
[...]
> @@ -652,6 +726,73 @@ void secure_computing_strict(int this_syscall)
>  #else
>
>  #ifdef CONFIG_SECCOMP_FILTER
> +static u64 seccomp_next_notify_id(struct seccomp_filter *filter)
> +{
> +       /* Note: overflow is ok here, the id just needs to be unique */
> +       return filter->notif->next_id++;
> +}
> +
> +static void seccomp_do_user_notification(int this_syscall,
> +                                        struct seccomp_filter *match,
> +                                        const struct seccomp_data *sd)
> +{
> +       int err;
> +       long ret = 0;
> +       struct seccomp_knotif n = {};
> +
> +       mutex_lock(&match->notify_lock);
> +       err = -ENOSYS;
> +       if (!match->notif)
> +               goto out;
> +
> +       n.task = current;
> +       n.state = SECCOMP_NOTIFY_INIT;
> +       n.data = sd;
> +       n.id = seccomp_next_notify_id(match);
> +       init_completion(&n.ready);
> +
> +       list_add(&n.list, &match->notif->notifications);
> +       wake_up_poll(&match->notif->wqh, EPOLLIN | EPOLLRDNORM);
> +
> +       mutex_unlock(&match->notify_lock);
> +       up(&match->notif->request);
> +
> +       err = wait_for_completion_interruptible(&n.ready);
> +       mutex_lock(&match->notify_lock);
> +
> +       /*
> +        * Here it's possible we got a signal and then had to wait on the mutex
> +        * while the reply was sent, so let's be sure there wasn't a response
> +        * in the meantime.
> +        */
> +       if (err < 0 && n.state != SECCOMP_NOTIFY_REPLIED) {
> +               /*
> +                * We got a signal. Let's tell userspace about it (potentially
> +                * again, if we had already notified them about the first one).
> +                */
> +               n.signaled = true;
> +               if (n.state == SECCOMP_NOTIFY_SENT) {
> +                       n.state = SECCOMP_NOTIFY_INIT;
> +                       up(&match->notif->request);
> +               }

Do you need another wake_up_poll() here?

> +               mutex_unlock(&match->notify_lock);
> +               err = wait_for_completion_killable(&n.ready);
> +               mutex_lock(&match->notify_lock);
> +               if (err < 0)
> +                       goto remove_list;

Add a comment here explaining that we intentionally leave the
semaphore count too high (because otherwise we'd have to block), and
seccomp_notify_recv() compensates for that?

> +       }
> +
> +       ret = n.val;
> +       err = n.error;
> +
> +remove_list:
> +       list_del(&n.list);
> +out:
> +       mutex_unlock(&match->notify_lock);
> +       syscall_set_return_value(current, task_pt_regs(current),
> +                                err, ret);
> +}
> +
>  static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
>                             const bool recheck_after_trace)
>  {
[...]
>  #ifdef CONFIG_SECCOMP_FILTER
> +static struct file *init_listener(struct task_struct *,
> +                                 struct seccomp_filter *);
> +
>  /**
>   * seccomp_set_mode_filter: internal function for setting seccomp filter
>   * @flags:  flags to change filter behavior
> @@ -853,6 +1000,8 @@ static long seccomp_set_mode_filter(unsigned int flags,
>         const unsigned long seccomp_mode = SECCOMP_MODE_FILTER;
>         struct seccomp_filter *prepared = NULL;
>         long ret = -EINVAL;
> +       int listener = 0;
> +       struct file *listener_f = NULL;
>
>         /* Validate flags. */
>         if (flags & ~SECCOMP_FILTER_FLAG_MASK)
> @@ -863,13 +1012,28 @@ static long seccomp_set_mode_filter(unsigned int flags,
>         if (IS_ERR(prepared))
>                 return PTR_ERR(prepared);
>
> +       if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) {
> +               listener = get_unused_fd_flags(0);
> +               if (listener < 0) {
> +                       ret = listener;
> +                       goto out_free;
> +               }
> +
> +               listener_f = init_listener(current, prepared);
> +               if (IS_ERR(listener_f)) {
> +                       put_unused_fd(listener);
> +                       ret = PTR_ERR(listener_f);
> +                       goto out_free;
> +               }
> +       }
> +
>         /*
>          * Make sure we cannot change seccomp or nnp state via TSYNC
>          * while another thread is in the middle of calling exec.
>          */
>         if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
>             mutex_lock_killable(&current->signal->cred_guard_mutex))
> -               goto out_free;
> +               goto out_put_fd;
>
>         spin_lock_irq(&current->sighand->siglock);
>
> @@ -887,6 +1051,16 @@ static long seccomp_set_mode_filter(unsigned int flags,
>         spin_unlock_irq(&current->sighand->siglock);
>         if (flags & SECCOMP_FILTER_FLAG_TSYNC)
>                 mutex_unlock(&current->signal->cred_guard_mutex);
> +out_put_fd:
> +       if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) {
> +               if (ret < 0) {
> +                       fput(listener_f);
> +                       put_unused_fd(listener);
> +               } else {
> +                       fd_install(listener, listener_f);
> +                       ret = listener;
> +               }
> +       }
>  out_free:
>         seccomp_filter_free(prepared);
>         return ret;
[...]
> +
> +#ifdef CONFIG_SECCOMP_FILTER
> +static int seccomp_notify_release(struct inode *inode, struct file *file)
> +{
> +       struct seccomp_filter *filter = file->private_data;
> +       struct seccomp_knotif *knotif;
> +
> +       mutex_lock(&filter->notify_lock);
> +
> +       /*
> +        * If this file is being closed because e.g. the task who owned it
> +        * died, let's wake everyone up who was waiting on us.
> +        */
> +       list_for_each_entry(knotif, &filter->notif->notifications, list) {
> +               if (knotif->state == SECCOMP_NOTIFY_REPLIED)
> +                       continue;
> +
> +               knotif->state = SECCOMP_NOTIFY_REPLIED;
> +               knotif->error = -ENOSYS;
> +               knotif->val = 0;
> +
> +               complete(&knotif->ready);
> +       }
> +
> +       wake_up_all(&filter->notif->wqh);

If select() is polling us, a reference to the open file is being held,
and this can't be reached; and I think if epoll is polling us,
eventpoll_release() will remove itself from the wait queue, right? So
can this wake_up_all() actually ever notify anyone?

> +       kfree(filter->notif);
> +       filter->notif = NULL;
> +       mutex_unlock(&filter->notify_lock);
> +       __put_seccomp_filter(filter);
> +       return 0;
> +}
> +
> +static long seccomp_notify_recv(struct seccomp_filter *filter,
> +                               unsigned long arg)
> +{
> +       struct seccomp_knotif *knotif = NULL, *cur;
> +       struct seccomp_notif unotif = {};
> +       ssize_t ret;
> +       u16 size;
> +       void __user *buf = (void __user *)arg;
> +
> +       if (copy_from_user(&size, buf, sizeof(size)))
> +               return -EFAULT;
> +
> +       ret = down_interruptible(&filter->notif->request);
> +       if (ret < 0)
> +               return ret;
> +
> +       mutex_lock(&filter->notify_lock);
> +       list_for_each_entry(cur, &filter->notif->notifications, list) {
> +               if (cur->state == SECCOMP_NOTIFY_INIT) {
> +                       knotif = cur;
> +                       break;
> +               }
> +       }
> +
> +       /*
> +        * If we didn't find a notification, it could be that the task was
> +        * interrupted between the time we were woken and when we were able to

s/interrupted/interrupted by a fatal signal/ ?

> +        * acquire the rw lock.

State more explicitly here that we are compensating for an incorrectly
high semaphore count?

> +        */
> +       if (!knotif) {
> +               ret = -ENOENT;
> +               goto out;
> +       }
> +
> +       size = min_t(size_t, size, sizeof(unotif));
> +
> +       unotif.len = size;
> +       unotif.id = knotif->id;
> +       unotif.pid = task_pid_vnr(knotif->task);
> +       unotif.signaled = knotif->signaled;
> +       unotif.data = *(knotif->data);
> +
> +       if (copy_to_user(buf, &unotif, size)) {
> +               ret = -EFAULT;
> +               goto out;
> +       }
> +
> +       ret = size;
> +       knotif->state = SECCOMP_NOTIFY_SENT;
> +       wake_up_poll(&filter->notif->wqh, EPOLLOUT | EPOLLWRNORM);
> +
> +
> +out:
> +       mutex_unlock(&filter->notify_lock);
> +       return ret;
> +}
> +
> +static long seccomp_notify_send(struct seccomp_filter *filter,
> +                               unsigned long arg)
> +{
> +       struct seccomp_notif_resp resp = {};
> +       struct seccomp_knotif *knotif = NULL;
> +       long ret;
> +       u16 size;
> +       void __user *buf = (void __user *)arg;
> +
> +       if (copy_from_user(&size, buf, sizeof(size)))
> +               return -EFAULT;
> +       size = min_t(size_t, size, sizeof(resp));
> +       if (copy_from_user(&resp, buf, size))
> +               return -EFAULT;
> +
> +       ret = mutex_lock_interruptible(&filter->notify_lock);
> +       if (ret < 0)
> +               return ret;
> +
> +       list_for_each_entry(knotif, &filter->notif->notifications, list) {
> +               if (knotif->id == resp.id)
> +                       break;
> +       }
> +
> +       if (!knotif || knotif->id != resp.id) {

Uuuh, this looks unsafe and wrong. I don't think `knotif` can ever be
NULL here. If `filter->notif->notifications` is empty, I think
`knotif` will be `container_of(&filter->notif->notifications, struct
seccom_knotif, list)` - in other words, you'll have a type confusion,
and `knotif` probably points into some random memory in front of
`filter->notif`.

Am I missing something?

> +               ret = -ENOENT;
> +               goto out;
> +       }
> +
> +       /* Allow exactly one reply. */
> +       if (knotif->state != SECCOMP_NOTIFY_SENT) {
> +               ret = -EINPROGRESS;
> +               goto out;
> +       }

This means that if seccomp_do_user_notification() has in the meantime
received a signal and transitioned from SENT back to INIT, this will
fail, right? So we fail here, then we read the new notification, and
then we can retry SECCOMP_NOTIF_SEND? Is that intended?

> +       ret = size;
> +       knotif->state = SECCOMP_NOTIFY_REPLIED;
> +       knotif->error = resp.error;
> +       knotif->val = resp.val;
> +       complete(&knotif->ready);
> +out:
> +       mutex_unlock(&filter->notify_lock);
> +       return ret;
> +}
> +
> +static long seccomp_notify_id_valid(struct seccomp_filter *filter,
> +                                   unsigned long arg)
> +{
> +       struct seccomp_knotif *knotif = NULL;
> +       void __user *buf = (void __user *)arg;
> +       u64 id;
> +       long ret;
> +
> +       if (copy_from_user(&id, buf, sizeof(id)))
> +               return -EFAULT;
> +
> +       ret = mutex_lock_interruptible(&filter->notify_lock);
> +       if (ret < 0)
> +               return ret;
> +
> +       ret = -1;

In strace, this is going to show up as EPERM. Maybe use something like
-ENOENT instead? Or whatever you think resembles a fitting error
number.

> +       list_for_each_entry(knotif, &filter->notif->notifications, list) {
> +               if (knotif->id == id) {
> +                       ret = 0;

Would it make sense to treat notifications that have already been
replied to as invalid?

> +                       goto out;
> +               }
> +       }
> +
> +out:
> +       mutex_unlock(&filter->notify_lock);
> +       return ret;
> +}
> +
[...]
> +static __poll_t seccomp_notify_poll(struct file *file,
> +                                   struct poll_table_struct *poll_tab)
> +{
> +       struct seccomp_filter *filter = file->private_data;
> +       __poll_t ret = 0;
> +       struct seccomp_knotif *cur;
> +
> +       poll_wait(file, &filter->notif->wqh, poll_tab);
> +
> +       ret = mutex_lock_interruptible(&filter->notify_lock);
> +       if (ret < 0)
> +               return ret;

Looking at the callers of vfs_poll(), as far as I can tell, a poll
handler is not allowed to return error codes. Perhaps someone who
knows the poll interface better can weigh in here. I've CCed some
people who should hopefully know better how this stuff works.

> +       list_for_each_entry(cur, &filter->notif->notifications, list) {
> +               if (cur->state == SECCOMP_NOTIFY_INIT)
> +                       ret |= EPOLLIN | EPOLLRDNORM;
> +               if (cur->state == SECCOMP_NOTIFY_SENT)
> +                       ret |= EPOLLOUT | EPOLLWRNORM;
> +               if (ret & EPOLLIN && ret & EPOLLOUT)
> +                       break;
> +       }
> +
> +       mutex_unlock(&filter->notify_lock);
> +
> +       return ret;
> +}
> +
> +static const struct file_operations seccomp_notify_ops = {
> +       .poll = seccomp_notify_poll,
> +       .release = seccomp_notify_release,
> +       .unlocked_ioctl = seccomp_notify_ioctl,
> +};
> +
> +static struct file *init_listener(struct task_struct *task,
> +                                 struct seccomp_filter *filter)
> +{

Why does this function take a `task` pointer instead of always
accessing `current`? If `task` actually wasn't `current`, I would have
concurrency concerns. A comment in seccomp.h even explains:

 *          @filter must only be accessed from the context of current as there
 *          is no read locking.

Unless there's a good reason for it, I would prefer it if this
function didn't take a `task` pointer.

> +       struct file *ret = ERR_PTR(-EBUSY);
> +       struct seccomp_filter *cur, *last_locked = NULL;
> +       int filter_nesting = 0;
> +
> +       for (cur = task->seccomp.filter; cur; cur = cur->prev) {
> +               mutex_lock_nested(&cur->notify_lock, filter_nesting);
> +               filter_nesting++;
> +               last_locked = cur;
> +               if (cur->notif)
> +                       goto out;
> +       }
> +
> +       ret = ERR_PTR(-ENOMEM);
> +       filter->notif = kzalloc(sizeof(*(filter->notif)), GFP_KERNEL);

sizeof(struct notification) instead, to make the code clearer?

> +       if (!filter->notif)
> +               goto out;
> +
> +       sema_init(&filter->notif->request, 0);
> +       INIT_LIST_HEAD(&filter->notif->notifications);
> +       filter->notif->next_id = get_random_u64();
> +       init_waitqueue_head(&filter->notif->wqh);

Nit: next_id and notifications are declared in reverse order in the
struct. Could you flip them around here?

> +       ret = anon_inode_getfile("seccomp notify", &seccomp_notify_ops,
> +                                filter, O_RDWR);
> +       if (IS_ERR(ret))
> +               goto out;
> +
> +
> +       /* The file has a reference to it now */
> +       __get_seccomp_filter(filter);

__get_seccomp_filter() has a comment in it that claims "/* Reference
count is bounded by the number of total processes. */". I think this
change invalidates that comment. I think it should be fine to just
remove the comment.

> +out:
> +       for (cur = task->seccomp.filter; cur; cur = cur->prev) {

s/; cur;/; 1;/, or use a while loop instead? If the NULL check fires
here, something went very wrong.

> +               mutex_unlock(&cur->notify_lock);
> +               if (cur == last_locked)
> +                       break;
> +       }
> +
> +       return ret;
> +}
> +#endif

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 3/6] seccomp: add a way to get a listener fd from ptrace
  2018-09-27 15:11 ` [PATCH v7 3/6] seccomp: add a way to get a listener fd from ptrace Tycho Andersen
  2018-09-27 16:20   ` Jann Horn
  2018-09-27 17:35   ` Jann Horn
@ 2018-09-27 21:53   ` Kees Cook
  2018-10-08 15:16   ` Christian Brauner
  3 siblings, 0 replies; 91+ messages in thread
From: Kees Cook @ 2018-09-27 21:53 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: LKML, Linux Containers, Linux API, Andy Lutomirski,
	Oleg Nesterov, Eric W . Biederman, Serge E . Hallyn,
	Christian Brauner, Tyler Hicks, Akihiro Suda, Jann Horn,
	linux-fsdevel

On Thu, Sep 27, 2018 at 8:11 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> As an alternative to SECCOMP_FILTER_FLAG_GET_LISTENER, perhaps a ptrace()
> version which can acquire filters is useful. There are at least two reasons
> this is preferable, even though it uses ptrace:
>
> 1. You can control tasks that aren't cooperating with you
> 2. You can control tasks whose filters block sendmsg() and socket(); if the
>    task installs a filter which blocks these calls, there's no way with
>    SECCOMP_FILTER_FLAG_GET_LISTENER to get the fd out to the privileged task.
>
> v2: fix a bug where listener mode was not unset when an unused fd was not
>     available
> v3: fix refcounting bug (Oleg)
> v4: * change the listener's fd flags to be 0
>     * rename GET_LISTENER to NEW_LISTENER (Matthew)
> v5: * add capable(CAP_SYS_ADMIN) requirement
> v7: * point the new listener at the right filter (Jann)
>
> Signed-off-by: Tycho Andersen <tycho@tycho.ws>
> CC: Kees Cook <keescook@chromium.org>
> CC: Andy Lutomirski <luto@amacapital.net>
> CC: Oleg Nesterov <oleg@redhat.com>
> CC: Eric W. Biederman <ebiederm@xmission.com>
> CC: "Serge E. Hallyn" <serge@hallyn.com>
> CC: Christian Brauner <christian.brauner@ubuntu.com>
> CC: Tyler Hicks <tyhicks@canonical.com>
> CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
> ---
>  include/linux/seccomp.h                       |  7 ++
>  include/uapi/linux/ptrace.h                   |  2 +
>  kernel/ptrace.c                               |  4 ++
>  kernel/seccomp.c                              | 31 +++++++++
>  tools/testing/selftests/seccomp/seccomp_bpf.c | 68 +++++++++++++++++++
>  5 files changed, 112 insertions(+)
>
> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> index 017444b5efed..234c61b37405 100644
> --- a/include/linux/seccomp.h
> +++ b/include/linux/seccomp.h
> @@ -83,6 +83,8 @@ static inline int seccomp_mode(struct seccomp *s)
>  #ifdef CONFIG_SECCOMP_FILTER
>  extern void put_seccomp_filter(struct task_struct *tsk);
>  extern void get_seccomp_filter(struct task_struct *tsk);
> +extern long seccomp_new_listener(struct task_struct *task,
> +                                unsigned long filter_off);
>  #else  /* CONFIG_SECCOMP_FILTER */
>  static inline void put_seccomp_filter(struct task_struct *tsk)
>  {
> @@ -92,6 +94,11 @@ static inline void get_seccomp_filter(struct task_struct *tsk)
>  {
>         return;
>  }
> +static inline long seccomp_new_listener(struct task_struct *task,
> +                                       unsigned long filter_off)
> +{
> +       return -EINVAL;
> +}
>  #endif /* CONFIG_SECCOMP_FILTER */
>
>  #if defined(CONFIG_SECCOMP_FILTER) && defined(CONFIG_CHECKPOINT_RESTORE)
> diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h
> index d5a1b8a492b9..e80ecb1bd427 100644
> --- a/include/uapi/linux/ptrace.h
> +++ b/include/uapi/linux/ptrace.h
> @@ -73,6 +73,8 @@ struct seccomp_metadata {
>         __u64 flags;            /* Output: filter's flags */
>  };
>
> +#define PTRACE_SECCOMP_NEW_LISTENER    0x420e
> +
>  /* Read signals from a shared (process wide) queue */
>  #define PTRACE_PEEKSIGINFO_SHARED      (1 << 0)
>
> diff --git a/kernel/ptrace.c b/kernel/ptrace.c
> index 21fec73d45d4..289960ac181b 100644
> --- a/kernel/ptrace.c
> +++ b/kernel/ptrace.c
> @@ -1096,6 +1096,10 @@ int ptrace_request(struct task_struct *child, long request,
>                 ret = seccomp_get_metadata(child, addr, datavp);
>                 break;
>
> +       case PTRACE_SECCOMP_NEW_LISTENER:
> +               ret = seccomp_new_listener(child, addr);
> +               break;
> +
>         default:
>                 break;
>         }
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 44a31ac8373a..17685803a2af 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -1777,4 +1777,35 @@ static struct file *init_listener(struct task_struct *task,
>
>         return ret;
>  }
> +
> +long seccomp_new_listener(struct task_struct *task,
> +                         unsigned long filter_off)
> +{
> +       struct seccomp_filter *filter;
> +       struct file *listener;
> +       int fd;
> +
> +       if (!capable(CAP_SYS_ADMIN))
> +               return -EACCES;
> +
> +       filter = get_nth_filter(task, filter_off);
> +       if (IS_ERR(filter))
> +               return PTR_ERR(filter);
> +
> +       fd = get_unused_fd_flags(0);
> +       if (fd < 0) {
> +               __put_seccomp_filter(filter);
> +               return fd;
> +       }
> +
> +       listener = init_listener(task, filter);
> +       __put_seccomp_filter(filter);
> +       if (IS_ERR(listener)) {
> +               put_unused_fd(fd);
> +               return PTR_ERR(listener);
> +       }
> +
> +       fd_install(fd, listener);
> +       return fd;
> +}

Observation both here and with SECCOMP_FILTER_FLAG_NEW_LISTENER:
nothing actually checks that there is a RET_USER_NOTIF bpf rule in the
filter. *shrug* Not a problem, just a weird state.

>  #endif
> diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
> index 5f4b836a6792..c6ba3ed5392e 100644
> --- a/tools/testing/selftests/seccomp/seccomp_bpf.c
> +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
> @@ -193,6 +193,10 @@ int seccomp(unsigned int op, unsigned int flags, void *args)
>  }
>  #endif
>
> +#ifndef PTRACE_SECCOMP_NEW_LISTENER
> +#define PTRACE_SECCOMP_NEW_LISTENER 0x420e
> +#endif
> +
>  #if __BYTE_ORDER == __LITTLE_ENDIAN
>  #define syscall_arg(_n) (offsetof(struct seccomp_data, args[_n]))
>  #elif __BYTE_ORDER == __BIG_ENDIAN
> @@ -3175,6 +3179,70 @@ TEST(get_user_notification_syscall)
>         EXPECT_EQ(0, WEXITSTATUS(status));
>  }
>
> +TEST(get_user_notification_ptrace)
> +{
> +       pid_t pid;
> +       int status, listener;
> +       int sk_pair[2];
> +       char c;
> +       struct seccomp_notif req = {};
> +       struct seccomp_notif_resp resp = {};
> +
> +       ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0);
> +
> +       pid = fork();
> +       ASSERT_GE(pid, 0);
> +
> +       if (pid == 0) {
> +               EXPECT_EQ(user_trap_syscall(__NR_getpid, 0), 0);
> +
> +               /* Test that we get ENOSYS while not attached */
> +               EXPECT_EQ(syscall(__NR_getpid), -1);
> +               EXPECT_EQ(errno, ENOSYS);
> +
> +               /* Signal we're ready and have installed the filter. */
> +               EXPECT_EQ(write(sk_pair[1], "J", 1), 1);
> +
> +               EXPECT_EQ(read(sk_pair[1], &c, 1), 1);
> +               EXPECT_EQ(c, 'H');
> +
> +               exit(syscall(__NR_getpid) != USER_NOTIF_MAGIC);
> +       }
> +
> +       EXPECT_EQ(read(sk_pair[0], &c, 1), 1);
> +       EXPECT_EQ(c, 'J');
> +
> +       EXPECT_EQ(ptrace(PTRACE_ATTACH, pid), 0);
> +       EXPECT_EQ(waitpid(pid, NULL, 0), pid);
> +       listener = ptrace(PTRACE_SECCOMP_NEW_LISTENER, pid, 0);
> +       EXPECT_GE(listener, 0);
> +
> +       /* EBUSY for second listener */
> +       EXPECT_EQ(ptrace(PTRACE_SECCOMP_NEW_LISTENER, pid, 0), -1);
> +       EXPECT_EQ(errno, EBUSY);
> +
> +       EXPECT_EQ(ptrace(PTRACE_DETACH, pid, NULL, 0), 0);
> +
> +       /* Now signal we are done and respond with magic */
> +       EXPECT_EQ(write(sk_pair[0], "H", 1), 1);
> +
> +       req.len = sizeof(req);
> +       EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_RECV, &req), sizeof(req));
> +
> +       resp.len = sizeof(resp);
> +       resp.id = req.id;
> +       resp.error = 0;
> +       resp.val = USER_NOTIF_MAGIC;
> +
> +       EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_SEND, &resp), sizeof(resp));
> +
> +       EXPECT_EQ(waitpid(pid, &status, 0), pid);
> +       EXPECT_EQ(true, WIFEXITED(status));
> +       EXPECT_EQ(0, WEXITSTATUS(status));
> +
> +       close(listener);
> +}
> +
>  /*
>   * Check that a pid in a child namespace still shows up as valid in ours.
>   */
> --
> 2.17.1
>

And FWIW, I agree with Jann's review notes here too. :) Looks good!

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 4/6] files: add a replace_fd_files() function
  2018-09-27 15:11 ` [PATCH v7 4/6] files: add a replace_fd_files() function Tycho Andersen
  2018-09-27 16:49   ` Jann Horn
@ 2018-09-27 21:59   ` Kees Cook
  2018-09-28  2:20     ` Kees Cook
  1 sibling, 1 reply; 91+ messages in thread
From: Kees Cook @ 2018-09-27 21:59 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: LKML, Linux Containers, Linux API, Andy Lutomirski,
	Oleg Nesterov, Eric W . Biederman, Serge E . Hallyn,
	Christian Brauner, Tyler Hicks, Akihiro Suda, Jann Horn,
	linux-fsdevel, Alexander Viro

On Thu, Sep 27, 2018 at 8:11 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> Similar to fd_install/__fd_install, we want to be able to replace an fd of
> an arbitrary struct files_struct, not just current's. We'll use this in the
> next patch to implement the seccomp ioctl that allows inserting fds into a
> stopped process' context.
>
> v7: new in v7
>
> Signed-off-by: Tycho Andersen <tycho@tycho.ws>
> CC: Alexander Viro <viro@zeniv.linux.org.uk>
> CC: Kees Cook <keescook@chromium.org>
> CC: Andy Lutomirski <luto@amacapital.net>
> CC: Oleg Nesterov <oleg@redhat.com>
> CC: Eric W. Biederman <ebiederm@xmission.com>
> CC: "Serge E. Hallyn" <serge@hallyn.com>
> CC: Christian Brauner <christian.brauner@ubuntu.com>
> CC: Tyler Hicks <tyhicks@canonical.com>
> CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
> ---
>  fs/file.c            | 22 +++++++++++++++-------
>  include/linux/file.h |  8 ++++++++
>  2 files changed, 23 insertions(+), 7 deletions(-)
>
> diff --git a/fs/file.c b/fs/file.c
> index 7ffd6e9d103d..3b3c5aadaadb 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -850,24 +850,32 @@ __releases(&files->file_lock)
>  }
>
>  int replace_fd(unsigned fd, struct file *file, unsigned flags)
> +{
> +       return replace_fd_task(current, fd, file, flags);
> +}
> +
> +/*
> + * Same warning as __alloc_fd()/__fd_install() here.
> + */
> +int replace_fd_task(struct task_struct *task, unsigned fd,
> +                   struct file *file, unsigned flags)
>  {
>         int err;
> -       struct files_struct *files = current->files;

Same feedback as Jann: on a purely "smaller diff" note, this could
just be s/current/task/ here and all the other s/files/task->files/
would go away...

>
>         if (!file)
> -               return __close_fd(files, fd);
> +               return __close_fd(task->files, fd);
>
> -       if (fd >= rlimit(RLIMIT_NOFILE))
> +       if (fd >= task_rlimit(task, RLIMIT_NOFILE))
>                 return -EBADF;
>
> -       spin_lock(&files->file_lock);
> -       err = expand_files(files, fd);
> +       spin_lock(&task->files->file_lock);
> +       err = expand_files(task->files, fd);
>         if (unlikely(err < 0))
>                 goto out_unlock;
> -       return do_dup2(files, file, fd, flags);
> +       return do_dup2(task->files, file, fd, flags);
>
>  out_unlock:
> -       spin_unlock(&files->file_lock);
> +       spin_unlock(&task->files->file_lock);
>         return err;
>  }
>
> diff --git a/include/linux/file.h b/include/linux/file.h
> index 6b2fb032416c..f94277fee038 100644
> --- a/include/linux/file.h
> +++ b/include/linux/file.h
> @@ -11,6 +11,7 @@
>  #include <linux/posix_types.h>
>
>  struct file;
> +struct task_struct;
>
>  extern void fput(struct file *);
>
> @@ -79,6 +80,13 @@ static inline void fdput_pos(struct fd f)
>
>  extern int f_dupfd(unsigned int from, struct file *file, unsigned flags);
>  extern int replace_fd(unsigned fd, struct file *file, unsigned flags);
> +/*
> + * Warning! This is only safe if you know the owner of the files_struct is
> + * stopped outside syscall context. It's a very bad idea to use this unless you
> + * have similar guarantees in your code.
> + */
> +extern int replace_fd_task(struct task_struct *task, unsigned fd,
> +                          struct file *file, unsigned flags);

Perhaps call this __replace_fd() to indicate the "please don't use
this unless you're very sure"ness of it?

>  extern void set_close_on_exec(unsigned int fd, int flag);
>  extern bool get_close_on_exec(unsigned int fd);
>  extern int get_unused_fd_flags(unsigned flags);
> --
> 2.17.1
>

If I can get an Ack from Al, that would be very nice. :)

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 5/6] seccomp: add a way to pass FDs via a notification fd
  2018-09-27 15:11 ` [PATCH v7 5/6] seccomp: add a way to pass FDs via a notification fd Tycho Andersen
  2018-09-27 16:39   ` Jann Horn
  2018-09-27 19:28   ` Jann Horn
@ 2018-09-27 22:09   ` Kees Cook
  2018-09-27 22:15     ` Tycho Andersen
  2 siblings, 1 reply; 91+ messages in thread
From: Kees Cook @ 2018-09-27 22:09 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: LKML, Linux Containers, Linux API, Andy Lutomirski,
	Oleg Nesterov, Eric W . Biederman, Serge E . Hallyn,
	Christian Brauner, Tyler Hicks, Akihiro Suda, Jann Horn,
	linux-fsdevel

On Thu, Sep 27, 2018 at 8:11 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> This patch adds a way to insert FDs into the tracee's process (also
> close/overwrite fds for the tracee). This functionality is necessary to
> mock things like socketpair() or dup2() or similar, but since it depends on
> external (vfs) patches, I've left it as a separate patch as before so the
> core functionality can still be merged while we argue about this. Except
> this time it doesn't add any ugliness to the API :)
>
> v7: new in v7
>
> Signed-off-by: Tycho Andersen <tycho@tycho.ws>
> CC: Kees Cook <keescook@chromium.org>
> CC: Andy Lutomirski <luto@amacapital.net>
> CC: Oleg Nesterov <oleg@redhat.com>
> CC: Eric W. Biederman <ebiederm@xmission.com>
> CC: "Serge E. Hallyn" <serge@hallyn.com>
> CC: Christian Brauner <christian.brauner@ubuntu.com>
> CC: Tyler Hicks <tyhicks@canonical.com>
> CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
> ---
>  .../userspace-api/seccomp_filter.rst          |  16 +++
>  include/uapi/linux/seccomp.h                  |   9 ++
>  kernel/seccomp.c                              |  54 ++++++++
>  tools/testing/selftests/seccomp/seccomp_bpf.c | 126 ++++++++++++++++++
>  4 files changed, 205 insertions(+)
>
> diff --git a/Documentation/userspace-api/seccomp_filter.rst b/Documentation/userspace-api/seccomp_filter.rst
> index d2e61f1c0a0b..383a8dbae304 100644
> --- a/Documentation/userspace-api/seccomp_filter.rst
> +++ b/Documentation/userspace-api/seccomp_filter.rst
> @@ -237,6 +237,13 @@ The interface for a seccomp notification fd consists of two structures:
>          __s64 val;
>      };
>
> +    struct seccomp_notif_put_fd {
> +        __u64 id;
> +        __s32 fd;
> +        __u32 fd_flags;
> +        __s32 to_replace;
> +    };
> +
>  Users can read via ``ioctl(SECCOMP_NOTIF_RECV)``  (or ``poll()``) on a seccomp
>  notification fd to receive a ``struct seccomp_notif``, which contains five
>  members: the input length of the structure, a unique-per-filter ``id``, the
> @@ -256,6 +263,15 @@ mentioned above in this document: all arguments being read from the tracee's
>  memory should be read into the tracer's memory before any policy decisions are
>  made. This allows for an atomic decision on syscall arguments.
>
> +Userspace can also insert (or overwrite) file descriptors of the tracee using
> +``ioctl(SECCOMP_NOTIF_PUT_FD)``. The ``id`` member is the request/pid to insert
> +the fd into. The ``fd`` is the fd in the listener's table to send or ``-1`` if
> +an fd should be closed instead. The ``to_replace`` fd is the fd in the tracee's
> +table that should be overwritten, or -1 if a new fd is installed. ``fd_flags``
> +should be the flags that the fd in the tracee's table is opened with (e.g.
> +``O_CLOEXEC`` or similar). The return value from this ioctl is the fd number
> +that was installed.
> +
>  Sysctls
>  =======
>
> diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
> index d4ccb32fe089..91d77f041fbb 100644
> --- a/include/uapi/linux/seccomp.h
> +++ b/include/uapi/linux/seccomp.h
> @@ -77,6 +77,13 @@ struct seccomp_notif_resp {
>         __s64 val;
>  };
>
> +struct seccomp_notif_put_fd {
> +       __u64 id;
> +       __s32 fd;
> +       __u32 fd_flags;
> +       __s32 to_replace;
> +};
> +
>  #define SECCOMP_IOC_MAGIC              0xF7
>
>  /* Flags for seccomp notification fd ioctl. */
> @@ -86,5 +93,7 @@ struct seccomp_notif_resp {
>                                         struct seccomp_notif_resp)
>  #define SECCOMP_NOTIF_ID_VALID _IOR(SECCOMP_IOC_MAGIC, 2,      \
>                                         __u64)
> +#define SECCOMP_NOTIF_PUT_FD   _IOR(SECCOMP_IOC_MAGIC, 3,      \
> +                                       struct seccomp_notif_put_fd)
>
>  #endif /* _UAPI_LINUX_SECCOMP_H */
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 17685803a2af..07a05ad59731 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -41,6 +41,8 @@
>  #include <linux/tracehook.h>
>  #include <linux/uaccess.h>
>  #include <linux/anon_inodes.h>
> +#include <linux/fdtable.h>
> +#include <net/cls_cgroup.h>
>
>  enum notify_state {
>         SECCOMP_NOTIFY_INIT,
> @@ -1684,6 +1686,56 @@ static long seccomp_notify_id_valid(struct seccomp_filter *filter,
>         return ret;
>  }
>
> +static long seccomp_notify_put_fd(struct seccomp_filter *filter,
> +                                 unsigned long arg)
> +{
> +       struct seccomp_notif_put_fd req;
> +       void __user *buf = (void __user *)arg;
> +       struct seccomp_knotif *knotif = NULL;
> +       long ret;
> +
> +       if (copy_from_user(&req, buf, sizeof(req)))
> +               return -EFAULT;
> +
> +       if (req.fd < 0 && req.to_replace < 0)
> +               return -EINVAL;
> +
> +       ret = mutex_lock_interruptible(&filter->notify_lock);
> +       if (ret < 0)
> +               return ret;
> +
> +       ret = -ENOENT;
> +       list_for_each_entry(knotif, &filter->notif->notifications, list) {
> +               struct file *file = NULL;
> +
> +               if (knotif->id != req.id)
> +                       continue;
> +
> +               if (req.fd >= 0)
> +                       file = fget(req.fd);

Shouldn't we test for !file here?

> +
> +               if (req.to_replace >= 0) {
> +                       ret = replace_fd_task(knotif->task, req.to_replace,
> +                                             file, req.fd_flags);
> +               } else {
> +                       unsigned long max_files;
> +
> +                       max_files = task_rlimit(knotif->task, RLIMIT_NOFILE);
> +                       ret = __alloc_fd(knotif->task->files, 0, max_files,
> +                                        req.fd_flags);
> +                       if (ret < 0)
> +                               break;
> +
> +                       __fd_install(knotif->task->files, ret, file);
> +               }
> +
> +               break;
> +       }
> +
> +       mutex_unlock(&filter->notify_lock);
> +       return ret;
> +}
> +
>  static long seccomp_notify_ioctl(struct file *file, unsigned int cmd,
>                                  unsigned long arg)
>  {
> @@ -1696,6 +1748,8 @@ static long seccomp_notify_ioctl(struct file *file, unsigned int cmd,
>                 return seccomp_notify_send(filter, arg);
>         case SECCOMP_NOTIF_ID_VALID:
>                 return seccomp_notify_id_valid(filter, arg);
> +       case SECCOMP_NOTIF_PUT_FD:
> +               return seccomp_notify_put_fd(filter, arg);
>         default:
>                 return -EINVAL;
>         }
> diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
> index c6ba3ed5392e..cd1322c02b92 100644
> --- a/tools/testing/selftests/seccomp/seccomp_bpf.c
> +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
> @@ -43,6 +43,7 @@
>  #include <sys/times.h>
>  #include <sys/socket.h>
>  #include <sys/ioctl.h>
> +#include <linux/kcmp.h>
>
>  #include <unistd.h>
>  #include <sys/syscall.h>
> @@ -169,6 +170,9 @@ struct seccomp_metadata {
>                                         struct seccomp_notif_resp)
>  #define SECCOMP_NOTIF_ID_VALID _IOR(SECCOMP_IOC_MAGIC, 2,      \
>                                         __u64)
> +#define SECCOMP_NOTIF_PUT_FD   _IOR(SECCOMP_IOC_MAGIC, 3,      \
> +                                       struct seccomp_notif_put_fd)
> +
>  struct seccomp_notif {
>         __u16 len;
>         __u64 id;
> @@ -183,6 +187,13 @@ struct seccomp_notif_resp {
>         __s32 error;
>         __s64 val;
>  };
> +
> +struct seccomp_notif_put_fd {
> +       __u64 id;
> +       __s32 fd;
> +       __u32 fd_flags;
> +       __s32 to_replace;
> +};
>  #endif
>
>  #ifndef seccomp
> @@ -193,6 +204,14 @@ int seccomp(unsigned int op, unsigned int flags, void *args)
>  }
>  #endif
>
> +#ifndef kcmp
> +int kcmp(pid_t pid1, pid_t pid2, int type, unsigned long idx1,
> +        unsigned long idx2)
> +{
> +       return syscall(__NR_kcmp, pid1, pid2, type, idx1, idx2);
> +}
> +#endif
> +
>  #ifndef PTRACE_SECCOMP_NEW_LISTENER
>  #define PTRACE_SECCOMP_NEW_LISTENER 0x420e
>  #endif
> @@ -3243,6 +3262,113 @@ TEST(get_user_notification_ptrace)
>         close(listener);
>  }
>
> +TEST(user_notification_pass_fd)
> +{
> +       pid_t pid;
> +       int status, listener, fd;
> +       int sk_pair[2];
> +       char c;
> +       struct seccomp_notif req = {};
> +       struct seccomp_notif_resp resp = {};
> +       struct seccomp_notif_put_fd putfd = {};
> +       long ret;
> +
> +       ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0);
> +
> +       pid = fork();
> +       ASSERT_GE(pid, 0);
> +
> +       if (pid == 0) {
> +               int fd;
> +               char buf[16];
> +
> +               EXPECT_EQ(user_trap_syscall(__NR_getpid, 0), 0);
> +
> +               /* Signal we're ready and have installed the filter. */
> +               EXPECT_EQ(write(sk_pair[1], "J", 1), 1);
> +
> +               EXPECT_EQ(read(sk_pair[1], &c, 1), 1);
> +               EXPECT_EQ(c, 'H');
> +               close(sk_pair[1]);
> +
> +               /* An fd from getpid(). Let the games begin. */
> +               fd = syscall(__NR_getpid);
> +               EXPECT_GT(fd, 0);
> +               EXPECT_EQ(read(fd, buf, sizeof(buf)), 12);
> +               close(fd);
> +
> +               exit(strcmp("hello world", buf));
> +       }
> +
> +       EXPECT_EQ(read(sk_pair[0], &c, 1), 1);
> +       EXPECT_EQ(c, 'J');
> +
> +       EXPECT_EQ(ptrace(PTRACE_ATTACH, pid), 0);
> +       EXPECT_EQ(waitpid(pid, NULL, 0), pid);
> +       listener = ptrace(PTRACE_SECCOMP_NEW_LISTENER, pid, 0);
> +       EXPECT_GE(listener, 0);
> +       EXPECT_EQ(ptrace(PTRACE_DETACH, pid, NULL, 0), 0);
> +
> +       /* Now signal we are done installing so it can do a getpid */
> +       EXPECT_EQ(write(sk_pair[0], "H", 1), 1);
> +       close(sk_pair[0]);
> +
> +       /* Make a new socket pair so we can send half across */
> +       EXPECT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0);
> +
> +       ret = read_notif(listener, &req);
> +       EXPECT_EQ(ret, sizeof(req));
> +       EXPECT_EQ(errno, 0);
> +
> +       resp.len = sizeof(resp);
> +       resp.id = req.id;
> +
> +       putfd.id = req.id;
> +       putfd.fd_flags = 0;
> +
> +       /* First, let's just create a new fd with our stdout. */
> +       putfd.fd = 0;
> +       putfd.to_replace = -1;
> +       fd = ioctl(listener, SECCOMP_NOTIF_PUT_FD, &putfd);
> +       EXPECT_GE(fd, 0);
> +       EXPECT_EQ(kcmp(req.pid, getpid(), KCMP_FILE, fd, 0), 0);
> +
> +       /* Dup something else over the top of it. */
> +       putfd.fd = sk_pair[1];
> +       putfd.to_replace = fd;
> +       fd = ioctl(listener, SECCOMP_NOTIF_PUT_FD, &putfd);
> +       EXPECT_GE(fd, 0);
> +       EXPECT_EQ(kcmp(req.pid, getpid(), KCMP_FILE, fd, sk_pair[1]), 0);
> +
> +       /* Now, try to close it. */
> +       putfd.fd = -1;
> +       putfd.to_replace = fd;
> +       fd = ioctl(listener, SECCOMP_NOTIF_PUT_FD, &putfd);
> +       EXPECT_GE(fd, 0);
> +       EXPECT_EQ(kcmp(req.pid, getpid(), KCMP_FILE, fd, sk_pair[1]), 1);
> +
> +       /* Ok, we tried the three cases, now let's do what we really want. */
> +       putfd.fd = sk_pair[1];
> +       putfd.to_replace = -1;
> +       fd = ioctl(listener, SECCOMP_NOTIF_PUT_FD, &putfd);
> +       EXPECT_GE(fd, 0);
> +       EXPECT_EQ(kcmp(req.pid, getpid(), KCMP_FILE, fd, sk_pair[1]), 0);
> +
> +       resp.val = fd;
> +       resp.error = 0;
> +
> +       EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_SEND, &resp), sizeof(resp));
> +       close(sk_pair[1]);
> +
> +       EXPECT_EQ(write(sk_pair[0], "hello world\0", 12), 12);
> +       close(sk_pair[0]);
> +
> +       EXPECT_EQ(waitpid(pid, &status, 0), pid);
> +       EXPECT_EQ(true, WIFEXITED(status));
> +       EXPECT_EQ(0, WEXITSTATUS(status));
> +       close(listener);
> +}
> +
>  /*
>   * Check that a pid in a child namespace still shows up as valid in ours.
>   */
> --
> 2.17.1
>

In no surprise to anyone, I agree with Jann's feedback too.

And thank you again for the tests! :) It's really nice for seeing some
"live samples" of the intention of the API.

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 6/6] samples: add an example of seccomp user trap
  2018-09-27 15:11 ` [PATCH v7 6/6] samples: add an example of seccomp user trap Tycho Andersen
@ 2018-09-27 22:11   ` Kees Cook
  0 siblings, 0 replies; 91+ messages in thread
From: Kees Cook @ 2018-09-27 22:11 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: LKML, Linux Containers, Linux API, Andy Lutomirski,
	Oleg Nesterov, Eric W . Biederman, Serge E . Hallyn,
	Christian Brauner, Tyler Hicks, Akihiro Suda, Jann Horn,
	linux-fsdevel

On Thu, Sep 27, 2018 at 8:11 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> The idea here is just to give a demonstration of how one could safely use
> the SECCOMP_RET_USER_NOTIF feature to do mount policies. This particular
> policy is (as noted in the comment) not very interesting, but it serves to
> illustrate how one might apply a policy dodging the various TOCTOU issues.
>
> v5: new in v5
> v7: updates for v7 API changes
>
> Signed-off-by: Tycho Andersen <tycho@tycho.ws>
> CC: Kees Cook <keescook@chromium.org>
> CC: Andy Lutomirski <luto@amacapital.net>
> CC: Oleg Nesterov <oleg@redhat.com>
> CC: Eric W. Biederman <ebiederm@xmission.com>
> CC: "Serge E. Hallyn" <serge@hallyn.com>
> CC: Christian Brauner <christian.brauner@ubuntu.com>
> CC: Tyler Hicks <tyhicks@canonical.com>
> CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
> ---
>  samples/seccomp/.gitignore  |   1 +
>  samples/seccomp/Makefile    |   7 +-
>  samples/seccomp/user-trap.c | 312 ++++++++++++++++++++++++++++++++++++
>  3 files changed, 319 insertions(+), 1 deletion(-)
>
> diff --git a/samples/seccomp/.gitignore b/samples/seccomp/.gitignore
> index 78fb78184291..d1e2e817d556 100644
> --- a/samples/seccomp/.gitignore
> +++ b/samples/seccomp/.gitignore
> @@ -1,3 +1,4 @@
>  bpf-direct
>  bpf-fancy
>  dropper
> +user-trap
> diff --git a/samples/seccomp/Makefile b/samples/seccomp/Makefile
> index cf34ff6b4065..4920903c8009 100644
> --- a/samples/seccomp/Makefile
> +++ b/samples/seccomp/Makefile
> @@ -1,6 +1,6 @@
>  # SPDX-License-Identifier: GPL-2.0
>  ifndef CROSS_COMPILE
> -hostprogs-$(CONFIG_SAMPLE_SECCOMP) := bpf-fancy dropper bpf-direct
> +hostprogs-$(CONFIG_SAMPLE_SECCOMP) := bpf-fancy dropper bpf-direct user-trap
>
>  HOSTCFLAGS_bpf-fancy.o += -I$(objtree)/usr/include
>  HOSTCFLAGS_bpf-fancy.o += -idirafter $(objtree)/include
> @@ -16,6 +16,10 @@ HOSTCFLAGS_bpf-direct.o += -I$(objtree)/usr/include
>  HOSTCFLAGS_bpf-direct.o += -idirafter $(objtree)/include
>  bpf-direct-objs := bpf-direct.o
>
> +HOSTCFLAGS_user-trap.o += -I$(objtree)/usr/include
> +HOSTCFLAGS_user-trap.o += -idirafter $(objtree)/include
> +user-trap-objs := user-trap.o
> +
>  # Try to match the kernel target.
>  ifndef CONFIG_64BIT
>
> @@ -33,6 +37,7 @@ HOSTCFLAGS_bpf-fancy.o += $(MFLAG)
>  HOSTLDLIBS_bpf-direct += $(MFLAG)
>  HOSTLDLIBS_bpf-fancy += $(MFLAG)
>  HOSTLDLIBS_dropper += $(MFLAG)
> +HOSTLDLIBS_user-trap += $(MFLAG)
>  endif
>  always := $(hostprogs-m)
>  endif
> diff --git a/samples/seccomp/user-trap.c b/samples/seccomp/user-trap.c
> new file mode 100644
> index 000000000000..63c9a5994dc1
> --- /dev/null
> +++ b/samples/seccomp/user-trap.c
> @@ -0,0 +1,312 @@
> +#include <signal.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <unistd.h>
> +#include <errno.h>
> +#include <fcntl.h>
> +#include <string.h>
> +#include <stddef.h>
> +#include <sys/sysmacros.h>
> +#include <sys/types.h>
> +#include <sys/wait.h>
> +#include <sys/socket.h>
> +#include <sys/stat.h>
> +#include <sys/mman.h>
> +#include <sys/syscall.h>
> +#include <sys/user.h>
> +#include <sys/ioctl.h>
> +#include <sys/ptrace.h>
> +#include <sys/mount.h>
> +#include <linux/limits.h>
> +#include <linux/filter.h>
> +#include <linux/seccomp.h>
> +
> +/*
> + * Because of some grossness, we can't include linux/ptrace.h here, so we
> + * re-define PTRACE_SECCOMP_NEW_LISTENER.
> + */
> +#ifndef PTRACE_SECCOMP_NEW_LISTENER
> +#define PTRACE_SECCOMP_NEW_LISTENER    0x420e
> +#endif
> +
> +#define ARRAY_SIZE(x) (sizeof(x) / sizeof(*(x)))
> +
> +static int seccomp(unsigned int op, unsigned int flags, void *args)
> +{
> +       errno = 0;
> +       return syscall(__NR_seccomp, op, flags, args);
> +}
> +
> +static int user_trap_syscall(int nr, unsigned int flags)
> +{
> +       struct sock_filter filter[] = {
> +               BPF_STMT(BPF_LD+BPF_W+BPF_ABS,
> +                       offsetof(struct seccomp_data, nr)),
> +               BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, nr, 0, 1),
> +               BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_USER_NOTIF),
> +               BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
> +       };
> +
> +       struct sock_fprog prog = {
> +               .len = (unsigned short)ARRAY_SIZE(filter),
> +               .filter = filter,
> +       };
> +
> +       return seccomp(SECCOMP_SET_MODE_FILTER, flags, &prog);
> +}
> +
> +static int handle_req(struct seccomp_notif *req,
> +                     struct seccomp_notif_resp *resp, int listener)
> +{
> +       char path[PATH_MAX], source[PATH_MAX], target[PATH_MAX];
> +       int ret = -1, mem;
> +
> +       resp->len = sizeof(*resp);
> +       resp->id = req->id;
> +       resp->error = -EPERM;
> +       resp->val = 0;
> +
> +       if (req->data.nr != __NR_mount) {
> +               fprintf(stderr, "huh? trapped something besides mknod? %d\n", req->data.nr);
> +               return -1;
> +       }
> +
> +       /* Only allow bind mounts. */
> +       if (!(req->data.args[3] & MS_BIND))
> +               return 0;
> +
> +       /*
> +        * Ok, let's read the task's memory to see where they wanted their
> +        * mount to go.
> +        */
> +       snprintf(path, sizeof(path), "/proc/%d/mem", req->pid);
> +       mem = open(path, O_RDONLY);
> +       if (mem < 0) {
> +               perror("open mem");
> +               return -1;
> +       }
> +
> +       /*
> +        * Now we avoid a TOCTOU: we referred to a pid by its pid, but since
> +        * the pid that made the syscall may have died, we need to confirm that
> +        * the pid is still valid after we open its /proc/pid/mem file. We can
> +        * ask the listener fd this as follows.
> +        *
> +        * Note that this check should occur *after* any task-specific
> +        * resources are opened, to make sure that the task has not died and
> +        * we're not wrongly reading someone else's state in order to make
> +        * decisions.
> +        */
> +       if (ioctl(listener, SECCOMP_NOTIF_ID_VALID, &req->id) < 0) {
> +               fprintf(stderr, "task died before we could map its memory\n");
> +               goto out;
> +       }
> +
> +       /*
> +        * Phew, we've got the right /proc/pid/mem. Now we can read it. Note
> +        * that to avoid another TOCTOU, we should read all of the pointer args
> +        * before we decide to allow the syscall.
> +        */
> +       if (lseek(mem, req->data.args[0], SEEK_SET) < 0) {
> +               perror("seek");
> +               goto out;
> +       }
> +
> +       ret = read(mem, source, sizeof(source));
> +       if (ret < 0) {
> +               perror("read");
> +               goto out;
> +       }
> +
> +       if (lseek(mem, req->data.args[1], SEEK_SET) < 0) {
> +               perror("seek");
> +               goto out;
> +       }
> +
> +       ret = read(mem, target, sizeof(target));
> +       if (ret < 0) {
> +               perror("read");
> +               goto out;
> +       }
> +
> +       /*
> +        * Our policy is to only allow bind mounts inside /tmp. This isn't very
> +        * interesting, because we could do unprivlieged bind mounts with user
> +        * namespaces already, but you get the idea.
> +        */
> +       if (!strncmp(source, "/tmp", 4) && !strncmp(target, "/tmp", 4)) {
> +               if (mount(source, target, NULL, req->data.args[3], NULL) < 0) {
> +                       ret = -1;
> +                       perror("actual mount");
> +                       goto out;
> +               }
> +               resp->error = 0;
> +       }
> +
> +       /* Even if we didn't allow it because of policy, generating the
> +        * response was be a success, because we want to tell the worker EPERM.
> +        */
> +       ret = 0;
> +
> +out:
> +       close(mem);
> +       return ret;
> +}
> +
> +int main(void)
> +{
> +       int sk_pair[2], ret = 1, status, listener;
> +       pid_t worker = 0 , tracer = 0;
> +       char c;
> +
> +       if (socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair) < 0) {
> +               perror("socketpair");
> +               return 1;
> +       }
> +
> +       worker = fork();
> +       if (worker < 0) {
> +               perror("fork");
> +               goto close_pair;
> +       }
> +
> +       if (worker == 0) {
> +               if (user_trap_syscall(__NR_mount, 0) < 0) {
> +                       perror("seccomp");
> +                       exit(1);
> +               }
> +
> +               if (setuid(1000) < 0) {
> +                       perror("setuid");
> +                       exit(1);
> +               }
> +
> +               if (write(sk_pair[1], "a", 1) != 1) {
> +                       perror("write");
> +                       exit(1);
> +               }
> +
> +               if (read(sk_pair[1], &c, 1) != 1) {
> +                       perror("write");
> +                       exit(1);
> +               }
> +
> +               if (mkdir("/tmp/foo", 0755) < 0) {
> +                       perror("mkdir");
> +                       exit(1);
> +               }
> +
> +               if (mount("/dev/sda", "/tmp/foo", NULL, 0, NULL) != -1) {
> +                       fprintf(stderr, "huh? mounted /dev/sda?\n");
> +                       exit(1);
> +               }
> +
> +               if (errno != EPERM) {
> +                       perror("bad error from mount");
> +                       exit(1);
> +               }
> +
> +               if (mount("/tmp/foo", "/tmp/foo", NULL, MS_BIND, NULL) < 0) {
> +                       perror("mount");
> +                       exit(1);
> +               }
> +
> +               exit(0);
> +       }
> +
> +       if (read(sk_pair[0], &c, 1) != 1) {
> +               perror("read ready signal");
> +               goto out_kill;
> +       }
> +
> +       if (ptrace(PTRACE_ATTACH, worker) < 0) {
> +               perror("ptrace");
> +               goto out_kill;
> +       }
> +
> +       if (waitpid(worker, NULL, 0) != worker) {
> +               perror("waitpid");
> +               goto out_kill;
> +       }
> +
> +       listener = ptrace(PTRACE_SECCOMP_NEW_LISTENER, worker, 0);
> +       if (listener < 0) {
> +               perror("ptrace get listener");
> +               goto out_kill;
> +       }
> +
> +       if (ptrace(PTRACE_DETACH, worker, NULL, 0) < 0) {
> +               perror("ptrace detach");
> +               goto out_kill;
> +       }
> +
> +       if (write(sk_pair[0], "a", 1) != 1) {
> +               perror("write");
> +               exit(1);
> +       }
> +
> +       tracer = fork();
> +       if (tracer < 0) {
> +               perror("fork");
> +               goto out_kill;
> +       }
> +
> +       if (tracer == 0) {
> +               while (1) {
> +                       struct seccomp_notif req = {};
> +                       struct seccomp_notif_resp resp = {};
> +
> +                       req.len = sizeof(req);
> +                       if (ioctl(listener, SECCOMP_NOTIF_RECV, &req) != sizeof(req)) {
> +                               perror("ioctl recv");
> +                               goto out_close;
> +                       }
> +
> +                       if (handle_req(&req, &resp, listener) < 0)
> +                               goto out_close;
> +
> +                       if (ioctl(listener, SECCOMP_NOTIF_SEND, &resp) != sizeof(resp)) {
> +                               perror("ioctl send");
> +                               goto out_close;
> +                       }
> +               }
> +out_close:
> +               close(listener);
> +               exit(1);
> +       }
> +
> +       close(listener);
> +
> +       if (waitpid(worker, &status, 0) != worker) {
> +               perror("waitpid");
> +               goto out_kill;
> +       }
> +
> +       if (umount2("/tmp/foo", MNT_DETACH) < 0 && errno != EINVAL) {
> +               perror("umount2");
> +               goto out_kill;
> +       }
> +
> +       if (remove("/tmp/foo") < 0 && errno != ENOENT) {
> +               perror("remove");
> +               exit(1);
> +       }
> +
> +       if (!WIFEXITED(status) || WEXITSTATUS(status)) {
> +               fprintf(stderr, "worker exited nonzero\n");
> +               goto out_kill;
> +       }
> +
> +       ret = 0;
> +
> +out_kill:
> +       if (tracer > 0)
> +               kill(tracer, SIGKILL);
> +       if (worker > 0)
> +               kill(worker, SIGKILL);
> +
> +close_pair:
> +       close(sk_pair[0]);
> +       close(sk_pair[1]);
> +       return ret;
> +}
> --
> 2.17.1
>

handle_req() is well commented, but main() isn't. Since this is
explicitly a "sample", can you add operational comments to main() as
well? I think it might help people follow what is happening (and what
is expected) during main().

Beyond that, yay! Samples! :)

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 5/6] seccomp: add a way to pass FDs via a notification fd
  2018-09-27 16:39   ` Jann Horn
@ 2018-09-27 22:13     ` Tycho Andersen
  0 siblings, 0 replies; 91+ messages in thread
From: Tycho Andersen @ 2018-09-27 22:13 UTC (permalink / raw)
  To: Jann Horn
  Cc: Kees Cook, kernel list, containers, Linux API, Andy Lutomirski,
	Oleg Nesterov, Eric W. Biederman, Serge E. Hallyn,
	Christian Brauner, Tyler Hicks, suda.akihiro, linux-fsdevel

On Thu, Sep 27, 2018 at 06:39:02PM +0200, Jann Horn wrote:
> On Thu, Sep 27, 2018 at 5:11 PM Tycho Andersen <tycho@tycho.ws> wrote:
> > This patch adds a way to insert FDs into the tracee's process (also
> > close/overwrite fds for the tracee). This functionality is necessary to
> > mock things like socketpair() or dup2() or similar, but since it depends on
> > external (vfs) patches, I've left it as a separate patch as before so the
> > core functionality can still be merged while we argue about this. Except
> > this time it doesn't add any ugliness to the API :)
> [...]
> > diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> > index 17685803a2af..07a05ad59731 100644
> > --- a/kernel/seccomp.c
> > +++ b/kernel/seccomp.c
> > @@ -41,6 +41,8 @@
> >  #include <linux/tracehook.h>
> >  #include <linux/uaccess.h>
> >  #include <linux/anon_inodes.h>
> > +#include <linux/fdtable.h>
> > +#include <net/cls_cgroup.h>
> >
> >  enum notify_state {
> >         SECCOMP_NOTIFY_INIT,
> > @@ -1684,6 +1686,56 @@ static long seccomp_notify_id_valid(struct seccomp_filter *filter,
> >         return ret;
> >  }
> >
> > +static long seccomp_notify_put_fd(struct seccomp_filter *filter,
> > +                                 unsigned long arg)
> > +{
> > +       struct seccomp_notif_put_fd req;
> > +       void __user *buf = (void __user *)arg;
> > +       struct seccomp_knotif *knotif = NULL;
> > +       long ret;
> > +
> > +       if (copy_from_user(&req, buf, sizeof(req)))
> > +               return -EFAULT;
> > +
> > +       if (req.fd < 0 && req.to_replace < 0)
> > +               return -EINVAL;
> > +
> > +       ret = mutex_lock_interruptible(&filter->notify_lock);
> > +       if (ret < 0)
> > +               return ret;
> > +
> > +       ret = -ENOENT;
> > +       list_for_each_entry(knotif, &filter->notif->notifications, list) {
> > +               struct file *file = NULL;
> > +
> > +               if (knotif->id != req.id)
> > +                       continue;
> > +
> > +               if (req.fd >= 0)
> > +                       file = fget(req.fd);
> 
> So here we take a reference on `file`.
> 
> > +               if (req.to_replace >= 0) {
> > +                       ret = replace_fd_task(knotif->task, req.to_replace,
> > +                                             file, req.fd_flags);
> 
> Then here we try to place the file in knotif->task's file descriptor
> table. This can either fail (e.g. due to exceeded rlimit), in which
> case nothing happens, or it can do do_dup2(), which first takes an
> extra reference to the file, then places it in the task's fd table.
> 
> Either way, afterwards, we still hold a reference to the file.
> 
> > +               } else {
> > +                       unsigned long max_files;
> > +
> > +                       max_files = task_rlimit(knotif->task, RLIMIT_NOFILE);
> > +                       ret = __alloc_fd(knotif->task->files, 0, max_files,
> > +                                        req.fd_flags);
> > +                       if (ret < 0)
> > +                               break;
> 
> If we bail out here, we still hold a reference to `file`.
> 
> Suggestion: Change this to "if (ret >= 0) {" and make the following
> code conditional instead of breaking.
> 
> > +                       __fd_install(knotif->task->files, ret, file);
> 
> But if we reach this point, __fd_install() consumes the file pointer,
> so `file` is a dangling pointer now.
> 
> Suggestion: Add "break;" here.
> 
> > +               }
> 
> Suggestion: Add "if (file != NULL) fput(file);" here.

Ugh, yes, thanks.

Tycho

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 5/6] seccomp: add a way to pass FDs via a notification fd
  2018-09-27 19:28   ` Jann Horn
@ 2018-09-27 22:14     ` Tycho Andersen
  2018-09-27 22:17       ` Jann Horn
  0 siblings, 1 reply; 91+ messages in thread
From: Tycho Andersen @ 2018-09-27 22:14 UTC (permalink / raw)
  To: Jann Horn
  Cc: Kees Cook, kernel list, containers, Linux API, Andy Lutomirski,
	Oleg Nesterov, Eric W. Biederman, Serge E. Hallyn,
	Christian Brauner, Tyler Hicks, suda.akihiro, linux-fsdevel

On Thu, Sep 27, 2018 at 09:28:07PM +0200, Jann Horn wrote:
> On Thu, Sep 27, 2018 at 5:11 PM Tycho Andersen <tycho@tycho.ws> wrote:
> > This patch adds a way to insert FDs into the tracee's process (also
> > close/overwrite fds for the tracee). This functionality is necessary to
> > mock things like socketpair() or dup2() or similar, but since it depends on
> > external (vfs) patches, I've left it as a separate patch as before so the
> > core functionality can still be merged while we argue about this. Except
> > this time it doesn't add any ugliness to the API :)
> [...]
> > +static long seccomp_notify_put_fd(struct seccomp_filter *filter,
> > +                                 unsigned long arg)
> > +{
> > +       struct seccomp_notif_put_fd req;
> > +       void __user *buf = (void __user *)arg;
> > +       struct seccomp_knotif *knotif = NULL;
> > +       long ret;
> > +
> > +       if (copy_from_user(&req, buf, sizeof(req)))
> > +               return -EFAULT;
> > +
> > +       if (req.fd < 0 && req.to_replace < 0)
> > +               return -EINVAL;
> > +
> > +       ret = mutex_lock_interruptible(&filter->notify_lock);
> > +       if (ret < 0)
> > +               return ret;
> > +
> > +       ret = -ENOENT;
> > +       list_for_each_entry(knotif, &filter->notif->notifications, list) {
> > +               struct file *file = NULL;
> > +
> > +               if (knotif->id != req.id)
> > +                       continue;
> 
> Are you intentionally permitting non-SENT states here? It shouldn't
> make a big difference, but I think it'd be nice to at least block the
> use of notifications in SECCOMP_NOTIFY_REPLIED state.

Agreed, I'll block everything besides REPLIED.

Tycho

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 5/6] seccomp: add a way to pass FDs via a notification fd
  2018-09-27 22:09   ` Kees Cook
@ 2018-09-27 22:15     ` Tycho Andersen
  0 siblings, 0 replies; 91+ messages in thread
From: Tycho Andersen @ 2018-09-27 22:15 UTC (permalink / raw)
  To: Kees Cook
  Cc: LKML, Linux Containers, Linux API, Andy Lutomirski,
	Oleg Nesterov, Eric W . Biederman, Serge E . Hallyn,
	Christian Brauner, Tyler Hicks, Akihiro Suda, Jann Horn,
	linux-fsdevel

On Thu, Sep 27, 2018 at 03:09:06PM -0700, Kees Cook wrote:
> On Thu, Sep 27, 2018 at 8:11 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> > This patch adds a way to insert FDs into the tracee's process (also
> > close/overwrite fds for the tracee). This functionality is necessary to
> > mock things like socketpair() or dup2() or similar, but since it depends on
> > external (vfs) patches, I've left it as a separate patch as before so the
> > core functionality can still be merged while we argue about this. Except
> > this time it doesn't add any ugliness to the API :)
> >
> > v7: new in v7
> >
> > Signed-off-by: Tycho Andersen <tycho@tycho.ws>
> > CC: Kees Cook <keescook@chromium.org>
> > CC: Andy Lutomirski <luto@amacapital.net>
> > CC: Oleg Nesterov <oleg@redhat.com>
> > CC: Eric W. Biederman <ebiederm@xmission.com>
> > CC: "Serge E. Hallyn" <serge@hallyn.com>
> > CC: Christian Brauner <christian.brauner@ubuntu.com>
> > CC: Tyler Hicks <tyhicks@canonical.com>
> > CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
> > ---
> >  .../userspace-api/seccomp_filter.rst          |  16 +++
> >  include/uapi/linux/seccomp.h                  |   9 ++
> >  kernel/seccomp.c                              |  54 ++++++++
> >  tools/testing/selftests/seccomp/seccomp_bpf.c | 126 ++++++++++++++++++
> >  4 files changed, 205 insertions(+)
> >
> > diff --git a/Documentation/userspace-api/seccomp_filter.rst b/Documentation/userspace-api/seccomp_filter.rst
> > index d2e61f1c0a0b..383a8dbae304 100644
> > --- a/Documentation/userspace-api/seccomp_filter.rst
> > +++ b/Documentation/userspace-api/seccomp_filter.rst
> > @@ -237,6 +237,13 @@ The interface for a seccomp notification fd consists of two structures:
> >          __s64 val;
> >      };
> >
> > +    struct seccomp_notif_put_fd {
> > +        __u64 id;
> > +        __s32 fd;
> > +        __u32 fd_flags;
> > +        __s32 to_replace;
> > +    };
> > +
> >  Users can read via ``ioctl(SECCOMP_NOTIF_RECV)``  (or ``poll()``) on a seccomp
> >  notification fd to receive a ``struct seccomp_notif``, which contains five
> >  members: the input length of the structure, a unique-per-filter ``id``, the
> > @@ -256,6 +263,15 @@ mentioned above in this document: all arguments being read from the tracee's
> >  memory should be read into the tracer's memory before any policy decisions are
> >  made. This allows for an atomic decision on syscall arguments.
> >
> > +Userspace can also insert (or overwrite) file descriptors of the tracee using
> > +``ioctl(SECCOMP_NOTIF_PUT_FD)``. The ``id`` member is the request/pid to insert
> > +the fd into. The ``fd`` is the fd in the listener's table to send or ``-1`` if
> > +an fd should be closed instead. The ``to_replace`` fd is the fd in the tracee's
> > +table that should be overwritten, or -1 if a new fd is installed. ``fd_flags``
> > +should be the flags that the fd in the tracee's table is opened with (e.g.
> > +``O_CLOEXEC`` or similar). The return value from this ioctl is the fd number
> > +that was installed.
> > +
> >  Sysctls
> >  =======
> >
> > diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
> > index d4ccb32fe089..91d77f041fbb 100644
> > --- a/include/uapi/linux/seccomp.h
> > +++ b/include/uapi/linux/seccomp.h
> > @@ -77,6 +77,13 @@ struct seccomp_notif_resp {
> >         __s64 val;
> >  };
> >
> > +struct seccomp_notif_put_fd {
> > +       __u64 id;
> > +       __s32 fd;
> > +       __u32 fd_flags;
> > +       __s32 to_replace;
> > +};
> > +
> >  #define SECCOMP_IOC_MAGIC              0xF7
> >
> >  /* Flags for seccomp notification fd ioctl. */
> > @@ -86,5 +93,7 @@ struct seccomp_notif_resp {
> >                                         struct seccomp_notif_resp)
> >  #define SECCOMP_NOTIF_ID_VALID _IOR(SECCOMP_IOC_MAGIC, 2,      \
> >                                         __u64)
> > +#define SECCOMP_NOTIF_PUT_FD   _IOR(SECCOMP_IOC_MAGIC, 3,      \
> > +                                       struct seccomp_notif_put_fd)
> >
> >  #endif /* _UAPI_LINUX_SECCOMP_H */
> > diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> > index 17685803a2af..07a05ad59731 100644
> > --- a/kernel/seccomp.c
> > +++ b/kernel/seccomp.c
> > @@ -41,6 +41,8 @@
> >  #include <linux/tracehook.h>
> >  #include <linux/uaccess.h>
> >  #include <linux/anon_inodes.h>
> > +#include <linux/fdtable.h>
> > +#include <net/cls_cgroup.h>
> >
> >  enum notify_state {
> >         SECCOMP_NOTIFY_INIT,
> > @@ -1684,6 +1686,56 @@ static long seccomp_notify_id_valid(struct seccomp_filter *filter,
> >         return ret;
> >  }
> >
> > +static long seccomp_notify_put_fd(struct seccomp_filter *filter,
> > +                                 unsigned long arg)
> > +{
> > +       struct seccomp_notif_put_fd req;
> > +       void __user *buf = (void __user *)arg;
> > +       struct seccomp_knotif *knotif = NULL;
> > +       long ret;
> > +
> > +       if (copy_from_user(&req, buf, sizeof(req)))
> > +               return -EFAULT;
> > +
> > +       if (req.fd < 0 && req.to_replace < 0)
> > +               return -EINVAL;
> > +
> > +       ret = mutex_lock_interruptible(&filter->notify_lock);
> > +       if (ret < 0)
> > +               return ret;
> > +
> > +       ret = -ENOENT;
> > +       list_for_each_entry(knotif, &filter->notif->notifications, list) {
> > +               struct file *file = NULL;
> > +
> > +               if (knotif->id != req.id)
> > +                       continue;
> > +
> > +               if (req.fd >= 0)
> > +                       file = fget(req.fd);
> 
> Shouldn't we test for !file here?

Yes. Derp.

Tycho

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 5/6] seccomp: add a way to pass FDs via a notification fd
  2018-09-27 22:14     ` Tycho Andersen
@ 2018-09-27 22:17       ` Jann Horn
  2018-09-27 22:49         ` Tycho Andersen
  0 siblings, 1 reply; 91+ messages in thread
From: Jann Horn @ 2018-09-27 22:17 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Kees Cook, kernel list, containers, Linux API, Andy Lutomirski,
	Oleg Nesterov, Eric W. Biederman, Serge E. Hallyn,
	Christian Brauner, Tyler Hicks, suda.akihiro, linux-fsdevel

On Fri, Sep 28, 2018 at 12:14 AM Tycho Andersen <tycho@tycho.ws> wrote:
> On Thu, Sep 27, 2018 at 09:28:07PM +0200, Jann Horn wrote:
> > On Thu, Sep 27, 2018 at 5:11 PM Tycho Andersen <tycho@tycho.ws> wrote:
> > > This patch adds a way to insert FDs into the tracee's process (also
> > > close/overwrite fds for the tracee). This functionality is necessary to
> > > mock things like socketpair() or dup2() or similar, but since it depends on
> > > external (vfs) patches, I've left it as a separate patch as before so the
> > > core functionality can still be merged while we argue about this. Except
> > > this time it doesn't add any ugliness to the API :)
> > [...]
> > > +static long seccomp_notify_put_fd(struct seccomp_filter *filter,
> > > +                                 unsigned long arg)
> > > +{
> > > +       struct seccomp_notif_put_fd req;
> > > +       void __user *buf = (void __user *)arg;
> > > +       struct seccomp_knotif *knotif = NULL;
> > > +       long ret;
> > > +
> > > +       if (copy_from_user(&req, buf, sizeof(req)))
> > > +               return -EFAULT;
> > > +
> > > +       if (req.fd < 0 && req.to_replace < 0)
> > > +               return -EINVAL;
> > > +
> > > +       ret = mutex_lock_interruptible(&filter->notify_lock);
> > > +       if (ret < 0)
> > > +               return ret;
> > > +
> > > +       ret = -ENOENT;
> > > +       list_for_each_entry(knotif, &filter->notif->notifications, list) {
> > > +               struct file *file = NULL;
> > > +
> > > +               if (knotif->id != req.id)
> > > +                       continue;
> >
> > Are you intentionally permitting non-SENT states here? It shouldn't
> > make a big difference, but I think it'd be nice to at least block the
> > use of notifications in SECCOMP_NOTIFY_REPLIED state.
>
> Agreed, I'll block everything besides REPLIED.

Do you mean SENT? In REPLIED state, seccomp_notify_put_fd()
is racy because the target task is in the process of waking up, right?

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 1/6] seccomp: add a return code to trap to userspace
  2018-09-27 21:51   ` Jann Horn
@ 2018-09-27 22:45     ` Kees Cook
  2018-09-27 23:08       ` Tycho Andersen
  2018-09-27 23:04     ` Tycho Andersen
  1 sibling, 1 reply; 91+ messages in thread
From: Kees Cook @ 2018-09-27 22:45 UTC (permalink / raw)
  To: Jann Horn
  Cc: Tycho Andersen, Christoph Hellwig, Al Viro, linux-fsdevel,
	kernel list, Linux Containers, Linux API, Andy Lutomirski,
	Oleg Nesterov, Eric W. Biederman, Serge E. Hallyn,
	Christian Brauner, Tyler Hicks, Akihiro Suda

On Thu, Sep 27, 2018 at 2:51 PM, Jann Horn <jannh@google.com> wrote:
> On Thu, Sep 27, 2018 at 5:11 PM Tycho Andersen <tycho@tycho.ws> wrote:
>> However, care should be taken to avoid the TOCTOU
>> +mentioned above in this document: all arguments being read from the tracee's
>> +memory should be read into the tracer's memory before any policy decisions are
>> +made. This allows for an atomic decision on syscall arguments.
>
> Again, I don't really see how you could get this wrong.

Doesn't hurt to mention it, IMO.

>> +static long seccomp_notify_send(struct seccomp_filter *filter,
>> +                               unsigned long arg)
>> +{
>> +       struct seccomp_notif_resp resp = {};
>> +       struct seccomp_knotif *knotif = NULL;
>> +       long ret;
>> +       u16 size;
>> +       void __user *buf = (void __user *)arg;
>> +
>> +       if (copy_from_user(&size, buf, sizeof(size)))
>> +               return -EFAULT;
>> +       size = min_t(size_t, size, sizeof(resp));
>> +       if (copy_from_user(&resp, buf, size))
>> +               return -EFAULT;
>> +
>> +       ret = mutex_lock_interruptible(&filter->notify_lock);
>> +       if (ret < 0)
>> +               return ret;
>> +
>> +       list_for_each_entry(knotif, &filter->notif->notifications, list) {
>> +               if (knotif->id == resp.id)
>> +                       break;
>> +       }
>> +
>> +       if (!knotif || knotif->id != resp.id) {
>
> Uuuh, this looks unsafe and wrong. I don't think `knotif` can ever be
> NULL here. If `filter->notif->notifications` is empty, I think
> `knotif` will be `container_of(&filter->notif->notifications, struct
> seccom_knotif, list)` - in other words, you'll have a type confusion,
> and `knotif` probably points into some random memory in front of
> `filter->notif`.
>
> Am I missing something?

Oh, good catch. This just needs to be fixed like it's done in
seccomp_notif_recv (separate cur and knotif).

>> +static struct file *init_listener(struct task_struct *task,
>> +                                 struct seccomp_filter *filter)
>> +{
>
> Why does this function take a `task` pointer instead of always
> accessing `current`? If `task` actually wasn't `current`, I would have
> concurrency concerns. A comment in seccomp.h even explains:
>
>  *          @filter must only be accessed from the context of current as there
>  *          is no read locking.
>
> Unless there's a good reason for it, I would prefer it if this
> function didn't take a `task` pointer.

This is to support PTRACE_SECCOMP_NEW_LISTENER.

But you make an excellent point. Even TSYNC expects to operate only on
the current thread group. Hmm.

While the process is stopped by ptrace, we could, in theory, update
task->seccomp.filter via something like TSYNC.

So perhaps use:

mutex_lock_killable(&task->signal->cred_guard_mutex);

before walking the notify_locks?

>
>> +       struct file *ret = ERR_PTR(-EBUSY);
>> +       struct seccomp_filter *cur, *last_locked = NULL;
>> +       int filter_nesting = 0;
>> +
>> +       for (cur = task->seccomp.filter; cur; cur = cur->prev) {
>> +               mutex_lock_nested(&cur->notify_lock, filter_nesting);
>> +               filter_nesting++;
>> +               last_locked = cur;
>> +               if (cur->notif)
>> +                       goto out;
>> +       }
>> +
>> +       ret = ERR_PTR(-ENOMEM);
>> +       filter->notif = kzalloc(sizeof(*(filter->notif)), GFP_KERNEL);
>
> sizeof(struct notification) instead, to make the code clearer?

I prefer what Tycho has: I want to allocate an instances of whatever
filter->notif is.

Though, let's do the kzalloc outside of the locking, instead?

>> +       ret = anon_inode_getfile("seccomp notify", &seccomp_notify_ops,
>> +                                filter, O_RDWR);
>> +       if (IS_ERR(ret))
>> +               goto out;
>> +
>> +
>> +       /* The file has a reference to it now */
>> +       __get_seccomp_filter(filter);
>
> __get_seccomp_filter() has a comment in it that claims "/* Reference
> count is bounded by the number of total processes. */". I think this
> change invalidates that comment. I think it should be fine to just
> remove the comment.

Update it to "bounded by total processes and notification listeners"?

>> +out:
>> +       for (cur = task->seccomp.filter; cur; cur = cur->prev) {
>
> s/; cur;/; 1;/, or use a while loop instead? If the NULL check fires
> here, something went very wrong.

Hm? This is correct. This is how seccomp_run_filters() walks the list too:

        struct seccomp_filter *f =
                        READ_ONCE(current->seccomp.filter);
        ...
        for (; f; f = f->prev) {

Especially if we'll be holding the cred_guard_mutex.

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 1/6] seccomp: add a return code to trap to userspace
  2018-09-27 21:31   ` Kees Cook
@ 2018-09-27 22:48     ` Tycho Andersen
  2018-09-27 23:10       ` Kees Cook
  2018-10-08 14:58       ` Christian Brauner
  2018-10-17 20:29     ` Tycho Andersen
  1 sibling, 2 replies; 91+ messages in thread
From: Tycho Andersen @ 2018-09-27 22:48 UTC (permalink / raw)
  To: Kees Cook
  Cc: LKML, Linux Containers, Linux API, Andy Lutomirski,
	Oleg Nesterov, Eric W . Biederman, Serge E . Hallyn,
	Christian Brauner, Tyler Hicks, Akihiro Suda, Jann Horn,
	linux-fsdevel

On Thu, Sep 27, 2018 at 02:31:24PM -0700, Kees Cook wrote:
> On Thu, Sep 27, 2018 at 8:11 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> > This patch introduces a means for syscalls matched in seccomp to notify
> > some other task that a particular filter has been triggered.
> >
> > The motivation for this is primarily for use with containers. For example,
> > if a container does an init_module(), we obviously don't want to load this
> > untrusted code, which may be compiled for the wrong version of the kernel
> > anyway. Instead, we could parse the module image, figure out which module
> > the container is trying to load and load it on the host.
> >
> > As another example, containers cannot mknod(), since this checks
> > capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
> > /dev/zero should be ok for containers to mknod, but we'd like to avoid hard
> > coding some whitelist in the kernel. Another example is mount(), which has
> > many security restrictions for good reason, but configuration or runtime
> > knowledge could potentially be used to relax these restrictions.
> >
> > This patch adds functionality that is already possible via at least two
> > other means that I know about, both of which involve ptrace(): first, one
> > could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL.
> > Unfortunately this is slow, so a faster version would be to install a
> > filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP.
> > Since ptrace allows only one tracer, if the container runtime is that
> > tracer, users inside the container (or outside) trying to debug it will not
> > be able to use ptrace, which is annoying. It also means that older
> > distributions based on Upstart cannot boot inside containers using ptrace,
> > since upstart itself uses ptrace to start services.
> >
> > The actual implementation of this is fairly small, although getting the
> > synchronization right was/is slightly complex.
> >
> > Finally, it's worth noting that the classic seccomp TOCTOU of reading
> > memory data from the task still applies here, but can be avoided with
> > careful design of the userspace handler: if the userspace handler reads all
> > of the task memory that is necessary before applying its security policy,
> > the tracee's subsequent memory edits will not be read by the tracer.
> >
> > v2: * make id a u64; the idea here being that it will never overflow,
> >       because 64 is huge (one syscall every nanosecond => wrap every 584
> >       years) (Andy)
> >     * prevent nesting of user notifications: if someone is already attached
> >       the tree in one place, nobody else can attach to the tree (Andy)
> >     * notify the listener of signals the tracee receives as well (Andy)
> >     * implement poll
> > v3: * lockdep fix (Oleg)
> >     * drop unnecessary WARN()s (Christian)
> >     * rearrange error returns to be more rpetty (Christian)
> >     * fix build in !CONFIG_SECCOMP_USER_NOTIFICATION case
> > v4: * fix implementation of poll to use poll_wait() (Jann)
> >     * change listener's fd flags to be 0 (Jann)
> >     * hoist filter initialization out of ifdefs to its own function
> >       init_user_notification()
> >     * add some more testing around poll() and closing the listener while a
> >       syscall is in action
> >     * s/GET_LISTENER/NEW_LISTENER, since you can't _get_ a listener, but it
> >       creates a new one (Matthew)
> >     * correctly handle pid namespaces, add some testcases (Matthew)
> >     * use EINPROGRESS instead of EINVAL when a notification response is
> >       written twice (Matthew)
> >     * fix comment typo from older version (SEND vs READ) (Matthew)
> >     * whitespace and logic simplification (Tobin)
> >     * add some Documentation/ bits on userspace trapping
> > v5: * fix documentation typos (Jann)
> >     * add signalled field to struct seccomp_notif (Jann)
> >     * switch to using ioctls instead of read()/write() for struct passing
> >       (Jann)
> >     * add an ioctl to ensure an id is still valid
> > v6: * docs typo fixes, update docs for ioctl() change (Christian)
> > v7: * switch struct seccomp_knotif's id member to a u64 (derp :)
> >     * use notify_lock in IS_ID_VALID query to avoid racing
> >     * s/signalled/signaled (Tyler)
> >     * fix docs to reflect that ids are not globally unique (Tyler)
> >     * add a test to check -ERESTARTSYS behavior (Tyler)
> >     * drop CONFIG_SECCOMP_USER_NOTIFICATION (Tyler)
> >     * reorder USER_NOTIF in seccomp return codes list (Tyler)
> >     * return size instead of sizeof(struct user_notif) (Tyler)
> >     * ENOENT instead of EINVAL when invalid id is passed (Tyler)
> >     * drop CONFIG_SECCOMP_USER_NOTIFICATION guards (Tyler)
> >     * s/IS_ID_VALID/ID_VALID and switch ioctl to be "well behaved" (Tyler)
> >     * add a new struct notification to minimize the additions to
> >       struct seccomp_filter, also pack the necessary additions a bit more
> >       cleverly (Tyler)
> >     * switch to keeping track of the task itself instead of the pid (we'll
> >       use this for implementing PUT_FD)
> 
> Patch-sending nit: can you put the versioning below the "---" line so
> it isn't included in the final commit? (And I normally read these
> backwards, so I'd expect v7 at the top, but that's not a big deal. I
> mean... neither is the --- thing, but it makes "git am" easier for me
> since I don't have to go edit the versioning out of the log.)

Sure, will do.

> > diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
> > index 9efc0e73d50b..d4ccb32fe089 100644
> > --- a/include/uapi/linux/seccomp.h
> > +++ b/include/uapi/linux/seccomp.h
> > @@ -17,9 +17,10 @@
> >  #define SECCOMP_GET_ACTION_AVAIL       2
> >
> >  /* Valid flags for SECCOMP_SET_MODE_FILTER */
> > -#define SECCOMP_FILTER_FLAG_TSYNC      (1UL << 0)
> > -#define SECCOMP_FILTER_FLAG_LOG                (1UL << 1)
> > -#define SECCOMP_FILTER_FLAG_SPEC_ALLOW (1UL << 2)
> > +#define SECCOMP_FILTER_FLAG_TSYNC              (1UL << 0)
> > +#define SECCOMP_FILTER_FLAG_LOG                        (1UL << 1)
> > +#define SECCOMP_FILTER_FLAG_SPEC_ALLOW         (1UL << 2)
> > +#define SECCOMP_FILTER_FLAG_NEW_LISTENER       (1UL << 3)
> 
> Since these are all getting indentation updates, can you switch them
> to BIT(0), BIT(1), etc?

Will do.

> >  /*
> >   * All BPF programs must return a 32-bit value.
> > @@ -35,6 +36,7 @@
> >  #define SECCOMP_RET_KILL        SECCOMP_RET_KILL_THREAD
> >  #define SECCOMP_RET_TRAP        0x00030000U /* disallow and force a SIGSYS */
> >  #define SECCOMP_RET_ERRNO       0x00050000U /* returns an errno */
> > +#define SECCOMP_RET_USER_NOTIF   0x7fc00000U /* notifies userspace */
> >  #define SECCOMP_RET_TRACE       0x7ff00000U /* pass to a tracer or disallow */
> >  #define SECCOMP_RET_LOG                 0x7ffc0000U /* allow after logging */
> >  #define SECCOMP_RET_ALLOW       0x7fff0000U /* allow */
> > @@ -60,4 +62,29 @@ struct seccomp_data {
> >         __u64 args[6];
> >  };
> >
> > +struct seccomp_notif {
> > +       __u16 len;
> > +       __u64 id;
> > +       __u32 pid;
> > +       __u8 signaled;
> > +       struct seccomp_data data;
> > +};
> > +
> > +struct seccomp_notif_resp {
> > +       __u16 len;
> > +       __u64 id;
> > +       __s32 error;
> > +       __s64 val;
> > +};
> 
> So, len has to come first, for versioning. However, since it's ahead
> of a u64, this leaves a struct padding hole. pahole output:
> 
> struct seccomp_notif {
>         __u16                      len;                  /*     0     2 */
> 
>         /* XXX 6 bytes hole, try to pack */
> 
>         __u64                      id;                   /*     8     8 */
>         __u32                      pid;                  /*    16     4 */
>         __u8                       signaled;             /*    20     1 */
> 
>         /* XXX 3 bytes hole, try to pack */
> 
>         struct seccomp_data        data;                 /*    24    64 */
>         /* --- cacheline 1 boundary (64 bytes) was 24 bytes ago --- */
> 
>         /* size: 88, cachelines: 2, members: 5 */
>         /* sum members: 79, holes: 2, sum holes: 9 */
>         /* last cacheline: 24 bytes */
> };
> struct seccomp_notif_resp {
>         __u16                      len;                  /*     0     2 */
> 
>         /* XXX 6 bytes hole, try to pack */
> 
>         __u64                      id;                   /*     8     8 */
>         __s32                      error;                /*    16     4 */
> 
>         /* XXX 4 bytes hole, try to pack */
> 
>         __s64                      val;                  /*    24     8 */
> 
>         /* size: 32, cachelines: 1, members: 4 */
>         /* sum members: 22, holes: 2, sum holes: 10 */
>         /* last cacheline: 32 bytes */
> };
> 
> How about making len u32, and moving pid and error above "id"? This
> leaves a hole after signaled, so changing "len" won't be sufficient
> for versioning here. Perhaps move it after data?

I'm not sure what you mean by "len won't be sufficient for versioning
here"? Anyway, I can do some packing on these; I didn't bother before
since I figured it's a userspace interface, so saving a few bytes
isn't a huge deal.

> > +
> > +#define SECCOMP_IOC_MAGIC              0xF7
> 
> Was there any specific reason for picking this value? There are lots
> of fun ASCII code left like '!' or '*'. :)

No, ! it is :)

> > +
> > +/* Flags for seccomp notification fd ioctl. */
> > +#define SECCOMP_NOTIF_RECV     _IOWR(SECCOMP_IOC_MAGIC, 0,     \
> > +                                       struct seccomp_notif)
> > +#define SECCOMP_NOTIF_SEND     _IOWR(SECCOMP_IOC_MAGIC, 1,     \
> > +                                       struct seccomp_notif_resp)
> > +#define SECCOMP_NOTIF_ID_VALID _IOR(SECCOMP_IOC_MAGIC, 2,      \
> > +                                       __u64)
> 
> To match other UAPI ioctl, can these have a prefix of "SECCOMP_IOCTOL_..."?
> 
> It may also be useful to match how other uapis do this, like for DRM:
> 
> #define DRM_IOCTL_BASE                  'd'
> #define DRM_IO(nr)                      _IO(DRM_IOCTL_BASE,nr)
> #define DRM_IOR(nr,type)                _IOR(DRM_IOCTL_BASE,nr,type)
> #define DRM_IOW(nr,type)                _IOW(DRM_IOCTL_BASE,nr,type)
> #define DRM_IOWR(nr,type)               _IOWR(DRM_IOCTL_BASE,nr,type)
> 
> #define DRM_IOCTL_VERSION               DRM_IOWR(0x00, struct drm_version)
> #define DRM_IOCTL_GET_UNIQUE            DRM_IOWR(0x01, struct drm_unique)
> #define DRM_IOCTL_GET_MAGIC             DRM_IOR( 0x02, struct drm_auth)
> ...

Will do.

> 
> > +
> >  #endif /* _UAPI_LINUX_SECCOMP_H */
> > diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> > index fd023ac24e10..fa6fe9756c80 100644
> > --- a/kernel/seccomp.c
> > +++ b/kernel/seccomp.c
> > @@ -33,12 +33,78 @@
> >  #endif
> >
> >  #ifdef CONFIG_SECCOMP_FILTER
> > +#include <linux/file.h>
> >  #include <linux/filter.h>
> >  #include <linux/pid.h>
> >  #include <linux/ptrace.h>
> >  #include <linux/security.h>
> >  #include <linux/tracehook.h>
> >  #include <linux/uaccess.h>
> > +#include <linux/anon_inodes.h>
> > +
> > +enum notify_state {
> > +       SECCOMP_NOTIFY_INIT,
> > +       SECCOMP_NOTIFY_SENT,
> > +       SECCOMP_NOTIFY_REPLIED,
> > +};
> > +
> > +struct seccomp_knotif {
> > +       /* The struct pid of the task whose filter triggered the notification */
> > +       struct task_struct *task;
> > +
> > +       /* The "cookie" for this request; this is unique for this filter. */
> > +       u64 id;
> > +
> > +       /* Whether or not this task has been given an interruptible signal. */
> > +       bool signaled;
> > +
> > +       /*
> > +        * The seccomp data. This pointer is valid the entire time this
> > +        * notification is active, since it comes from __seccomp_filter which
> > +        * eclipses the entire lifecycle here.
> > +        */
> > +       const struct seccomp_data *data;
> > +
> > +       /*
> > +        * Notification states. When SECCOMP_RET_USER_NOTIF is returned, a
> > +        * struct seccomp_knotif is created and starts out in INIT. Once the
> > +        * handler reads the notification off of an FD, it transitions to SENT.
> > +        * If a signal is received the state transitions back to INIT and
> > +        * another message is sent. When the userspace handler replies, state
> > +        * transitions to REPLIED.
> > +        */
> > +       enum notify_state state;
> > +
> > +       /* The return values, only valid when in SECCOMP_NOTIFY_REPLIED */
> > +       int error;
> > +       long val;
> > +
> > +       /* Signals when this has entered SECCOMP_NOTIFY_REPLIED */
> > +       struct completion ready;
> > +
> > +       struct list_head list;
> > +};
> > +
> > +/**
> > + * struct notification - container for seccomp userspace notifications. Since
> > + * most seccomp filters will not have notification listeners attached and this
> > + * structure is fairly large, we store the notification-specific stuff in a
> > + * separate structure.
> > + *
> > + * @request: A semaphore that users of this notification can wait on for
> > + *           changes. Actual reads and writes are still controlled with
> > + *           filter->notify_lock.
> > + * @notify_lock: A lock for all notification-related accesses.
> > + * @next_id: The id of the next request.
> > + * @notifications: A list of struct seccomp_knotif elements.
> > + * @wqh: A wait queue for poll.
> > + */
> > +struct notification {
> > +       struct semaphore request;
> > +       u64 next_id;
> > +       struct list_head notifications;
> > +       wait_queue_head_t wqh;
> > +};
> >
> >  /**
> >   * struct seccomp_filter - container for seccomp BPF programs
> > @@ -66,6 +132,8 @@ struct seccomp_filter {
> >         bool log;
> >         struct seccomp_filter *prev;
> >         struct bpf_prog *prog;
> > +       struct notification *notif;
> > +       struct mutex notify_lock;
> >  };
> >
> >  /* Limit any path through the tree to 256KB worth of instructions. */
> > @@ -392,6 +460,7 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
> >         if (!sfilter)
> >                 return ERR_PTR(-ENOMEM);
> >
> > +       mutex_init(&sfilter->notify_lock);
> >         ret = bpf_prog_create_from_user(&sfilter->prog, fprog,
> >                                         seccomp_check_filter, save_orig);
> >         if (ret < 0) {
> > @@ -556,11 +625,13 @@ static void seccomp_send_sigsys(int syscall, int reason)
> >  #define SECCOMP_LOG_TRACE              (1 << 4)
> >  #define SECCOMP_LOG_LOG                        (1 << 5)
> >  #define SECCOMP_LOG_ALLOW              (1 << 6)
> > +#define SECCOMP_LOG_USER_NOTIF         (1 << 7)
> >
> >  static u32 seccomp_actions_logged = SECCOMP_LOG_KILL_PROCESS |
> >                                     SECCOMP_LOG_KILL_THREAD  |
> >                                     SECCOMP_LOG_TRAP  |
> >                                     SECCOMP_LOG_ERRNO |
> > +                                   SECCOMP_LOG_USER_NOTIF |
> >                                     SECCOMP_LOG_TRACE |
> >                                     SECCOMP_LOG_LOG;
> >
> > @@ -581,6 +652,9 @@ static inline void seccomp_log(unsigned long syscall, long signr, u32 action,
> >         case SECCOMP_RET_TRACE:
> >                 log = requested && seccomp_actions_logged & SECCOMP_LOG_TRACE;
> >                 break;
> > +       case SECCOMP_RET_USER_NOTIF:
> > +               log = requested && seccomp_actions_logged & SECCOMP_LOG_USER_NOTIF;
> > +               break;
> >         case SECCOMP_RET_LOG:
> >                 log = seccomp_actions_logged & SECCOMP_LOG_LOG;
> >                 break;
> > @@ -652,6 +726,73 @@ void secure_computing_strict(int this_syscall)
> >  #else
> >
> >  #ifdef CONFIG_SECCOMP_FILTER
> > +static u64 seccomp_next_notify_id(struct seccomp_filter *filter)
> > +{
> > +       /* Note: overflow is ok here, the id just needs to be unique */
> 
> Maybe just clarify in the comment: unique to the filter.
> 
> > +       return filter->notif->next_id++;
> 
> Also, it might be useful to add for both documentation and lockdep:
> 
> lockdep_assert_held(filter->notif->notify_lock);
> 
> into this function?

Will do.

> 
> > +}
> > +
> > +static void seccomp_do_user_notification(int this_syscall,
> > +                                        struct seccomp_filter *match,
> > +                                        const struct seccomp_data *sd)
> > +{
> > +       int err;
> > +       long ret = 0;
> > +       struct seccomp_knotif n = {};
> > +
> > +       mutex_lock(&match->notify_lock);
> > +       err = -ENOSYS;
> > +       if (!match->notif)
> > +               goto out;
> > +
> > +       n.task = current;
> > +       n.state = SECCOMP_NOTIFY_INIT;
> > +       n.data = sd;
> > +       n.id = seccomp_next_notify_id(match);
> > +       init_completion(&n.ready);
> > +
> > +       list_add(&n.list, &match->notif->notifications);
> > +       wake_up_poll(&match->notif->wqh, EPOLLIN | EPOLLRDNORM);
> > +
> > +       mutex_unlock(&match->notify_lock);
> > +       up(&match->notif->request);
> > +
> 
> Maybe add a big comment here saying this is where we're waiting for a reply?

Will do.

> > +       err = wait_for_completion_interruptible(&n.ready);
> > +       mutex_lock(&match->notify_lock);
> > +
> > +       /*
> > +        * Here it's possible we got a signal and then had to wait on the mutex
> > +        * while the reply was sent, so let's be sure there wasn't a response
> > +        * in the meantime.
> > +        */
> > +       if (err < 0 && n.state != SECCOMP_NOTIFY_REPLIED) {
> > +               /*
> > +                * We got a signal. Let's tell userspace about it (potentially
> > +                * again, if we had already notified them about the first one).
> > +                */
> > +               n.signaled = true;
> > +               if (n.state == SECCOMP_NOTIFY_SENT) {
> > +                       n.state = SECCOMP_NOTIFY_INIT;
> > +                       up(&match->notif->request);
> > +               }
> > +               mutex_unlock(&match->notify_lock);
> > +               err = wait_for_completion_killable(&n.ready);
> > +               mutex_lock(&match->notify_lock);
> > +               if (err < 0)
> > +                       goto remove_list;
> > +       }
> > +
> > +       ret = n.val;
> > +       err = n.error;
> > +
> > +remove_list:
> > +       list_del(&n.list);
> > +out:
> > +       mutex_unlock(&match->notify_lock);
> > +       syscall_set_return_value(current, task_pt_regs(current),
> > +                                err, ret);
> > +}
> > +
> >  static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
> >                             const bool recheck_after_trace)
> >  {
> > @@ -728,6 +869,9 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
> >
> >                 return 0;
> >
> > +       case SECCOMP_RET_USER_NOTIF:
> > +               seccomp_do_user_notification(this_syscall, match, sd);
> > +               goto skip;
> 
> Nit: please add a blank line here (to match the other cases).
> 
> >         case SECCOMP_RET_LOG:
> >                 seccomp_log(this_syscall, 0, action, true);
> >                 return 0;
> > @@ -834,6 +978,9 @@ static long seccomp_set_mode_strict(void)
> >  }
> >
> >  #ifdef CONFIG_SECCOMP_FILTER
> > +static struct file *init_listener(struct task_struct *,
> > +                                 struct seccomp_filter *);
> 
> Why is the forward declaration needed instead of just moving the
> function here? I didn't see anything in it that looked like it
> couldn't move.

I think there was a cycle in some earlier version, but I agree there
isn't now. I'll fix it.

> > +
> >  /**
> >   * seccomp_set_mode_filter: internal function for setting seccomp filter
> >   * @flags:  flags to change filter behavior
> > @@ -853,6 +1000,8 @@ static long seccomp_set_mode_filter(unsigned int flags,
> >         const unsigned long seccomp_mode = SECCOMP_MODE_FILTER;
> >         struct seccomp_filter *prepared = NULL;
> >         long ret = -EINVAL;
> > +       int listener = 0;
> 
> Nit: "invalid fd" should be -1, not 0.
> 
> > +       struct file *listener_f = NULL;
> >
> >         /* Validate flags. */
> >         if (flags & ~SECCOMP_FILTER_FLAG_MASK)
> > @@ -863,13 +1012,28 @@ static long seccomp_set_mode_filter(unsigned int flags,
> >         if (IS_ERR(prepared))
> >                 return PTR_ERR(prepared);
> >
> > +       if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) {
> > +               listener = get_unused_fd_flags(0);
> 
> As with the other place pointed out by Jann, this should maybe be O_CLOEXEC too?

Yep, will do.

> > +               if (listener < 0) {
> > +                       ret = listener;
> > +                       goto out_free;
> > +               }
> > +
> > +               listener_f = init_listener(current, prepared);
> > +               if (IS_ERR(listener_f)) {
> > +                       put_unused_fd(listener);
> > +                       ret = PTR_ERR(listener_f);
> > +                       goto out_free;
> > +               }
> > +       }
> > +
> >         /*
> >          * Make sure we cannot change seccomp or nnp state via TSYNC
> >          * while another thread is in the middle of calling exec.
> >          */
> >         if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
> >             mutex_lock_killable(&current->signal->cred_guard_mutex))
> > -               goto out_free;
> > +               goto out_put_fd;
> >
> >         spin_lock_irq(&current->sighand->siglock);
> >
> > @@ -887,6 +1051,16 @@ static long seccomp_set_mode_filter(unsigned int flags,
> >         spin_unlock_irq(&current->sighand->siglock);
> >         if (flags & SECCOMP_FILTER_FLAG_TSYNC)
> >                 mutex_unlock(&current->signal->cred_guard_mutex);
> > +out_put_fd:
> > +       if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) {
> > +               if (ret < 0) {
> > +                       fput(listener_f);
> > +                       put_unused_fd(listener);
> > +               } else {
> > +                       fd_install(listener, listener_f);
> > +                       ret = listener;
> > +               }
> > +       }
> 
> Can you update the kern-docs for seccomp_set_mode_filter(), since we
> can now return positive values?
> 
>  * Returns 0 on success or -EINVAL on failure.
> 
> (this shoudln't say only -EINVAL, I realize too)

Sure, I can fix both of these.

> I have to say, I'm vaguely nervous about changing the semantics here
> for passing back the fd as the return code from the seccomp() syscall.
> Alternatives seem less appealing, though: changing the meaning of the
> uargs parameter when SECCOMP_FILTER_FLAG_NEW_LISTENER is set, for
> example. Hmm.

>From my perspective we can drop this whole thing. The only thing I'll
ever use is the ptrace version. Someone at some point (I don't
remember who, maybe stgraber) suggested this version would be useful
as well.

Anyway, let me know if your nervousness outweighs this, I'm happy to
drop it.

> > @@ -1342,3 +1520,259 @@ static int __init seccomp_sysctl_init(void)
> >  device_initcall(seccomp_sysctl_init)
> >
> >  #endif /* CONFIG_SYSCTL */
> > +
> > +#ifdef CONFIG_SECCOMP_FILTER
> > +static int seccomp_notify_release(struct inode *inode, struct file *file)
> > +{
> > +       struct seccomp_filter *filter = file->private_data;
> > +       struct seccomp_knotif *knotif;
> > +
> > +       mutex_lock(&filter->notify_lock);
> > +
> > +       /*
> > +        * If this file is being closed because e.g. the task who owned it
> > +        * died, let's wake everyone up who was waiting on us.
> > +        */
> > +       list_for_each_entry(knotif, &filter->notif->notifications, list) {
> > +               if (knotif->state == SECCOMP_NOTIFY_REPLIED)
> > +                       continue;
> > +
> > +               knotif->state = SECCOMP_NOTIFY_REPLIED;
> > +               knotif->error = -ENOSYS;
> > +               knotif->val = 0;
> > +
> > +               complete(&knotif->ready);
> > +       }
> > +
> > +       wake_up_all(&filter->notif->wqh);
> > +       kfree(filter->notif);
> > +       filter->notif = NULL;
> > +       mutex_unlock(&filter->notify_lock);
> 
> It looks like that means nothing waiting on knotif->ready can access
> filter->notif without rechecking it, yes?
> 
> e.g. in seccomp_do_user_notification() I see:
> 
>                         up(&match->notif->request);
> 
> I *think* this isn't reachable due to the test for n.state !=
> SECCOMP_NOTIFY_REPLIED, though. Perhaps, just for sanity and because
> it's not fast-path, we could add a WARN_ON() while checking for
> unreplied signal death?
> 
>                 n.signaled = true;
>                 if (n.state == SECCOMP_NOTIFY_SENT) {
>                         n.state = SECCOMP_NOTIFY_INIT;
>                         if (!WARN_ON(match->notif))
>                             up(&match->notif->request);
>                 }
>                 mutex_unlock(&match->notify_lock);

So this code path should actually be safe, since notify_lock is held
throughout, as it is in the release handler. However, there is one just above
it that is not, because we do:

        mutex_unlock(&match->notify_lock);
        up(&match->notif->request);

When this was all a member of struct seccomp_filter the order didn't matter,
but now it very much does, and I think you're right that these statements need
to be reordered. There maybe others, I'll check everything else as well.

> 
> > +       __put_seccomp_filter(filter);
> > +       return 0;
> > +}
> > +
> > +static long seccomp_notify_recv(struct seccomp_filter *filter,
> > +                               unsigned long arg)
> > +{
> > +       struct seccomp_knotif *knotif = NULL, *cur;
> > +       struct seccomp_notif unotif = {};
> > +       ssize_t ret;
> > +       u16 size;
> > +       void __user *buf = (void __user *)arg;
> 
> I'd prefer this casting happen in seccomp_notify_ioctl(). This keeps
> anything from accidentally using "arg" directly here.

Will do.

> > +
> > +       if (copy_from_user(&size, buf, sizeof(size)))
> > +               return -EFAULT;
> > +
> > +       ret = down_interruptible(&filter->notif->request);
> > +       if (ret < 0)
> > +               return ret;
> > +
> > +       mutex_lock(&filter->notify_lock);
> > +       list_for_each_entry(cur, &filter->notif->notifications, list) {
> > +               if (cur->state == SECCOMP_NOTIFY_INIT) {
> > +                       knotif = cur;
> > +                       break;
> > +               }
> > +       }
> > +
> > +       /*
> > +        * If we didn't find a notification, it could be that the task was
> > +        * interrupted between the time we were woken and when we were able to
> > +        * acquire the rw lock.
> > +        */
> > +       if (!knotif) {
> > +               ret = -ENOENT;
> > +               goto out;
> > +       }
> > +
> > +       size = min_t(size_t, size, sizeof(unotif));
> > +
> 
> It is possible (though unlikely given the type widths involved here)
> for unotif = {} to not initialize padding, so I would recommend an
> explicit memset(&unotif, 0, sizeof(unotif)) here.

Orly? I didn't know that, thanks.

> > +       unotif.len = size;
> > +       unotif.id = knotif->id;
> > +       unotif.pid = task_pid_vnr(knotif->task);
> > +       unotif.signaled = knotif->signaled;
> > +       unotif.data = *(knotif->data);
> > +
> > +       if (copy_to_user(buf, &unotif, size)) {
> > +               ret = -EFAULT;
> > +               goto out;
> > +       }
> > +
> > +       ret = size;
> > +       knotif->state = SECCOMP_NOTIFY_SENT;
> > +       wake_up_poll(&filter->notif->wqh, EPOLLOUT | EPOLLWRNORM);
> > +
> > +
> > +out:
> > +       mutex_unlock(&filter->notify_lock);
> 
> Is there some way to rearrange the locking here to avoid holding the
> mutex while doing copy_to_user() (which userspace could block with
> userfaultfd, and then stall all the other notifications for this
> filter)?

Yes, I don't think it'll cause any problems to release the lock earlier.

> > +       return ret;
> > +}
> > +
> > +static long seccomp_notify_send(struct seccomp_filter *filter,
> > +                               unsigned long arg)
> > +{
> > +       struct seccomp_notif_resp resp = {};
> > +       struct seccomp_knotif *knotif = NULL;
> > +       long ret;
> > +       u16 size;
> > +       void __user *buf = (void __user *)arg;
> 
> Same cast note as above.
> 
> > +
> > +       if (copy_from_user(&size, buf, sizeof(size)))
> > +               return -EFAULT;
> > +       size = min_t(size_t, size, sizeof(resp));
> > +       if (copy_from_user(&resp, buf, size))
> > +               return -EFAULT;
> 
> For sanity checking on a double-read from userspace, please add:
> 
>     if (resp.len != size)
>         return -EINVAL;

Won't that fail if sizeof(resp) < resp.len, because of the min_t()?

> > +static long seccomp_notify_id_valid(struct seccomp_filter *filter,
> > +                                   unsigned long arg)
> > +{
> > +       struct seccomp_knotif *knotif = NULL;
> > +       void __user *buf = (void __user *)arg;
> > +       u64 id;
> > +       long ret;
> > +
> > +       if (copy_from_user(&id, buf, sizeof(id)))
> > +               return -EFAULT;
> > +
> > +       ret = mutex_lock_interruptible(&filter->notify_lock);
> > +       if (ret < 0)
> > +               return ret;
> > +
> > +       ret = -1;
> 
> Isn't this EPERM? Shouldn't it be -ENOENT?

Yes, I wasn't thinking of errno here, I'll switch it.

> > +       list_for_each_entry(knotif, &filter->notif->notifications, list) {
> > +               if (knotif->id == id) {
> > +                       ret = 0;
> > +                       goto out;
> > +               }
> > +       }
> > +
> > +out:
> > +       mutex_unlock(&filter->notify_lock);
> > +       return ret;
> > +}
> > +
> > +static long seccomp_notify_ioctl(struct file *file, unsigned int cmd,
> > +                                unsigned long arg)
> > +{
> > +       struct seccomp_filter *filter = file->private_data;
> > +
> > +       switch (cmd) {
> > +       case SECCOMP_NOTIF_RECV:
> > +               return seccomp_notify_recv(filter, arg);
> > +       case SECCOMP_NOTIF_SEND:
> > +               return seccomp_notify_send(filter, arg);
> > +       case SECCOMP_NOTIF_ID_VALID:
> > +               return seccomp_notify_id_valid(filter, arg);
> > +       default:
> > +               return -EINVAL;
> > +       }
> > +}
> > +
> > +static __poll_t seccomp_notify_poll(struct file *file,
> > +                                   struct poll_table_struct *poll_tab)
> > +{
> > +       struct seccomp_filter *filter = file->private_data;
> > +       __poll_t ret = 0;
> > +       struct seccomp_knotif *cur;
> > +
> > +       poll_wait(file, &filter->notif->wqh, poll_tab);
> > +
> > +       ret = mutex_lock_interruptible(&filter->notify_lock);
> > +       if (ret < 0)
> > +               return ret;
> > +
> > +       list_for_each_entry(cur, &filter->notif->notifications, list) {
> > +               if (cur->state == SECCOMP_NOTIFY_INIT)
> > +                       ret |= EPOLLIN | EPOLLRDNORM;
> > +               if (cur->state == SECCOMP_NOTIFY_SENT)
> > +                       ret |= EPOLLOUT | EPOLLWRNORM;
> > +               if (ret & EPOLLIN && ret & EPOLLOUT)
> 
> My eyes! :) Can you wrap the bit operations in parens here?
> 
> > +                       break;
> > +       }
> 
> Should POLLERR be handled here too? I don't quite see the conditions
> that might be exposed? All the processes die for the filter, which
> does what here?

I think it shouldn't do anything, because I was thinking of the semantics of
poll() as "when a tracee does a syscall that matches, fire". So a task could
start, never make a targeted syscall, and exit, and poll() shouldn't return a
value. Maybe it's useful to write that down somewhere, though.

> > +       EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_RECV, &req), sizeof(req));
> > +       EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_ID_VALID, &req.id), 0);
> > +
> > +       EXPECT_EQ(kill(pid, SIGKILL), 0);
> > +       EXPECT_EQ(waitpid(pid, NULL, 0), pid);
> > +
> > +       EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_ID_VALID, &req.id), -1);
> 
> Please document SECCOMP_NOTIF_ID_VALID in seccomp_filter.rst. I had
> been wondering what it's for, and now I see it's kind of an advisory
> "is the other end still alive?" test.

Yes, in fact it's necessary for avoiding races. There's some comments in the
sample code, but I'll update seccomp_filter.rst too.

> > +
> > +       resp.id = req.id;
> > +       ret = ioctl(listener, SECCOMP_NOTIF_SEND, &resp);
> > +       EXPECT_EQ(ret, -1);
> > +       EXPECT_EQ(errno, ENOENT);
> > +
> > +       /*
> > +        * Check that we get another notification about a signal in the middle
> > +        * of a syscall.
> > +        */
> > +       pid = fork();
> > +       ASSERT_GE(pid, 0);
> > +
> > +       if (pid == 0) {
> > +               if (signal(SIGUSR1, signal_handler) == SIG_ERR) {
> > +                       perror("signal");
> > +                       exit(1);
> > +               }
> > +               ret = syscall(__NR_getpid);
> > +               exit(ret != USER_NOTIF_MAGIC);
> > +       }
> > +
> > +       ret = read_notif(listener, &req);
> > +       EXPECT_EQ(ret, sizeof(req));
> > +       EXPECT_EQ(errno, 0);
> > +
> > +       EXPECT_EQ(kill(pid, SIGUSR1), 0);
> > +
> > +       ret = read_notif(listener, &req);
> > +       EXPECT_EQ(req.signaled, 1);
> > +       EXPECT_EQ(ret, sizeof(req));
> > +       EXPECT_EQ(errno, 0);
> > +
> > +       resp.len = sizeof(resp);
> > +       resp.id = req.id;
> > +       resp.error = -512; /* -ERESTARTSYS */
> > +       resp.val = 0;
> > +
> > +       EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_SEND, &resp), sizeof(resp));
> > +
> > +       ret = read_notif(listener, &req);
> > +       resp.len = sizeof(resp);
> > +       resp.id = req.id;
> > +       resp.error = 0;
> > +       resp.val = USER_NOTIF_MAGIC;
> > +       ret = ioctl(listener, SECCOMP_NOTIF_SEND, &resp);
> 
> I was slightly confused here: why have there been 3 reads? I was
> expecting one notification for hitting getpid and one from catching a
> signal. But in rereading, I see that NOTIF_RECV will return the most
> recently unresponded notification, yes?

The three reads are:

1. original syscall
# send SIGUSR1
2. another notif with signaled is set
# respond with -ERESTARTSYS to make sure that works
3. this is the result of -ERESTARTSYS

> But... catching a signal replaces the existing seccomp_knotif? I
> remain confused about how signal handling is meant to work here. What
> happens if two signals get sent? It looks like you just block without
> allowing more signals? (Thank you for writing the tests!)

Yes, that's the idea. This is an implementation of Andy's pseudocode:
https://lkml.org/lkml/2018/3/15/1122

> (And can you document the expected behavior in the seccomp_filter.rst too?)

Will do.

> 
> Looking good!

Thanks for your review!

Tycho

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 5/6] seccomp: add a way to pass FDs via a notification fd
  2018-09-27 22:17       ` Jann Horn
@ 2018-09-27 22:49         ` Tycho Andersen
  0 siblings, 0 replies; 91+ messages in thread
From: Tycho Andersen @ 2018-09-27 22:49 UTC (permalink / raw)
  To: Jann Horn
  Cc: Kees Cook, kernel list, containers, Linux API, Andy Lutomirski,
	Oleg Nesterov, Eric W. Biederman, Serge E. Hallyn,
	Christian Brauner, Tyler Hicks, suda.akihiro, linux-fsdevel

On Fri, Sep 28, 2018 at 12:17:07AM +0200, Jann Horn wrote:
> On Fri, Sep 28, 2018 at 12:14 AM Tycho Andersen <tycho@tycho.ws> wrote:
> > On Thu, Sep 27, 2018 at 09:28:07PM +0200, Jann Horn wrote:
> > > On Thu, Sep 27, 2018 at 5:11 PM Tycho Andersen <tycho@tycho.ws> wrote:
> > > > This patch adds a way to insert FDs into the tracee's process (also
> > > > close/overwrite fds for the tracee). This functionality is necessary to
> > > > mock things like socketpair() or dup2() or similar, but since it depends on
> > > > external (vfs) patches, I've left it as a separate patch as before so the
> > > > core functionality can still be merged while we argue about this. Except
> > > > this time it doesn't add any ugliness to the API :)
> > > [...]
> > > > +static long seccomp_notify_put_fd(struct seccomp_filter *filter,
> > > > +                                 unsigned long arg)
> > > > +{
> > > > +       struct seccomp_notif_put_fd req;
> > > > +       void __user *buf = (void __user *)arg;
> > > > +       struct seccomp_knotif *knotif = NULL;
> > > > +       long ret;
> > > > +
> > > > +       if (copy_from_user(&req, buf, sizeof(req)))
> > > > +               return -EFAULT;
> > > > +
> > > > +       if (req.fd < 0 && req.to_replace < 0)
> > > > +               return -EINVAL;
> > > > +
> > > > +       ret = mutex_lock_interruptible(&filter->notify_lock);
> > > > +       if (ret < 0)
> > > > +               return ret;
> > > > +
> > > > +       ret = -ENOENT;
> > > > +       list_for_each_entry(knotif, &filter->notif->notifications, list) {
> > > > +               struct file *file = NULL;
> > > > +
> > > > +               if (knotif->id != req.id)
> > > > +                       continue;
> > >
> > > Are you intentionally permitting non-SENT states here? It shouldn't
> > > make a big difference, but I think it'd be nice to at least block the
> > > use of notifications in SECCOMP_NOTIFY_REPLIED state.
> >
> > Agreed, I'll block everything besides REPLIED.
> 
> Do you mean SENT? In REPLIED state, seccomp_notify_put_fd()
> is racy because the target task is in the process of waking up, right?

Yes, sorry, I mean SENT.

Tycho

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 1/6] seccomp: add a return code to trap to userspace
  2018-09-27 21:51   ` Jann Horn
  2018-09-27 22:45     ` Kees Cook
@ 2018-09-27 23:04     ` Tycho Andersen
  2018-09-27 23:37       ` Jann Horn
  1 sibling, 1 reply; 91+ messages in thread
From: Tycho Andersen @ 2018-09-27 23:04 UTC (permalink / raw)
  To: Jann Horn
  Cc: hch, Al Viro, linux-fsdevel, Kees Cook, kernel list, containers,
	Linux API, Andy Lutomirski, Oleg Nesterov, Eric W. Biederman,
	Serge E. Hallyn, Christian Brauner, Tyler Hicks, suda.akihiro

On Thu, Sep 27, 2018 at 11:51:40PM +0200, Jann Horn wrote:
> +Christoph Hellwig, Al Viro, fsdevel: For two questions about the poll
> interface (search for "seccomp_notify_poll" and
> "seccomp_notify_release" in the patch)
> 
> @Tycho: FYI, I've gone through all of v7 now, apart from the
> test/sample code. So don't wait for more comments from me before
> sending out v8.

(assuming you meant v8 -> v9) yes thanks for your reviews! Much
appreciated.

> On Thu, Sep 27, 2018 at 5:11 PM Tycho Andersen <tycho@tycho.ws> wrote:
> > This patch introduces a means for syscalls matched in seccomp to notify
> > some other task that a particular filter has been triggered.
> >
> > The motivation for this is primarily for use with containers. For example,
> > if a container does an init_module(), we obviously don't want to load this
> > untrusted code, which may be compiled for the wrong version of the kernel
> > anyway. Instead, we could parse the module image, figure out which module
> > the container is trying to load and load it on the host.
> >
> > As another example, containers cannot mknod(), since this checks
> > capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
> > /dev/zero should be ok for containers to mknod, but we'd like to avoid hard
> > coding some whitelist in the kernel. Another example is mount(), which has
> > many security restrictions for good reason, but configuration or runtime
> > knowledge could potentially be used to relax these restrictions.
> 
> Note that in that case, the trusted runtime needs to be in the same
> mount namespace as the container. mount() doesn't work on the mount
> structure of a foreign mount namespace; check_mnt() specifically
> checks for this case, and I think pretty much everything in
> sys_mount() uses that check. So you'd have to join the container's
> mount namespace before forwarding a mount syscall.

Yep, Serge came up with a pretty neat trick that we used in LXD to
accomplish sending mounts to containers, but it requires some
coordination up front.

> > This patch adds functionality that is already possible via at least two
> > other means that I know about, both of which involve ptrace(): first, one
> > could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL.
> > Unfortunately this is slow, so a faster version would be to install a
> > filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP.
> > Since ptrace allows only one tracer, if the container runtime is that
> > tracer, users inside the container (or outside) trying to debug it will not
> > be able to use ptrace, which is annoying. It also means that older
> > distributions based on Upstart cannot boot inside containers using ptrace,
> > since upstart itself uses ptrace to start services.
> >
> > The actual implementation of this is fairly small, although getting the
> > synchronization right was/is slightly complex.
> >
> > Finally, it's worth noting that the classic seccomp TOCTOU of reading
> > memory data from the task still applies here,
> 
> Actually, it doesn't, right? It would apply if you told the kernel "go
> ahead, that syscall is fine", but that's not how the API works - you
> always intercept the syscall, copy argument data to a trusted tracer,
> and then the tracer can make a replacement syscall. Sounds fine to me.

Right, I guess the point here is just "you need to copy all the data
to the tracer *before* making a policy decision".

> > but can be avoided with
> > careful design of the userspace handler: if the userspace handler reads all
> > of the task memory that is necessary before applying its security policy,
> > the tracee's subsequent memory edits will not be read by the tracer.
> [...]
> > diff --git a/Documentation/userspace-api/seccomp_filter.rst b/Documentation/userspace-api/seccomp_filter.rst
> [...]
> > +which (on success) will return a listener fd for the filter, which can then be
> > +passed around via ``SCM_RIGHTS`` or similar. Alternatively, a filter fd can be
> > +acquired via:
> > +
> > +.. code-block::
> > +
> > +    fd = ptrace(PTRACE_SECCOMP_NEW_LISTENER, pid, 0);
> 
> The manpage documents ptrace() as taking four arguments, not three. I
> know that the header defines it with varargs, but it would probably be
> more useful to require passing in zero as the fourth argument so that
> we have a place to stick flags if necessary in the future.

Yep, I'll fix this, thanks. But also this documentation should really
live in the seccomp patch; some rebase got screwed up somewhere.

> > +which grabs the 0th filter for some task which the tracer has privilege over.
> > +Note that filter fds correspond to a particular filter, and not a particular
> > +task. So if this task then forks, notifications from both tasks will appear on
> > +the same filter fd. Reads and writes to/from a filter fd are also synchronized,
> > +so a filter fd can safely have many readers.
> 
> Add a note about needing CAP_SYS_ADMIN here? Also, might be useful to
> clarify in which direction "nth filter" counts.

Will do.

> > +The interface for a seccomp notification fd consists of two structures:
> > +
> > +.. code-block::
> > +
> > +    struct seccomp_notif {
> > +        __u16 len;
> > +        __u64 id;
> > +        pid_t pid;
> > +        __u8 signalled;
> > +        struct seccomp_data data;
> > +    };
> > +
> > +    struct seccomp_notif_resp {
> > +        __u16 len;
> > +        __u64 id;
> > +        __s32 error;
> > +        __s64 val;
> > +    };
> > +
> > +Users can read via ``ioctl(SECCOMP_NOTIF_RECV)``  (or ``poll()``) on a seccomp
> > +notification fd to receive a ``struct seccomp_notif``, which contains five
> > +members: the input length of the structure, a unique-per-filter ``id``, the
> > +``pid`` of the task which triggered this request (which may be 0 if the task is
> > +in a pid ns not visible from the listener's pid namespace), a flag representing
> > +whether or not the notification is a result of a non-fatal signal, and the
> > +``data`` passed to seccomp. Userspace can then make a decision based on this
> > +information about what to do, and ``ioctl(SECCOMP_NOTIF_SEND)`` a response,
> > +indicating what should be returned to userspace. The ``id`` member of ``struct
> > +seccomp_notif_resp`` should be the same ``id`` as in ``struct seccomp_notif``.
> > +
> > +It is worth noting that ``struct seccomp_data`` contains the values of register
> > +arguments to the syscall, but does not contain pointers to memory. The task's
> > +memory is accessible to suitably privileged traces via ``ptrace()`` or
> > +``/proc/pid/map_files/``.
> 
> You probably don't actually want to use /proc/pid/map_files here; you
> can't use that to access anonymous memory, and it needs CAP_SYS_ADMIN.
> And while reading memory via ptrace() is possible, the interface is
> really ugly (e.g. you can only read data in 4-byte chunks), and your
> caveat about locking out other ptracers (or getting locked out by
> them) applies. I'm not even sure if you could read memory via ptrace
> while a process is stopped in the seccomp logic? PTRACE_PEEKDATA
> requires the target to be in a __TASK_TRACED state.
> The two interfaces you might want to use instead are /proc/$pid/mem
> and process_vm_{readv,writev}, which allow you to do nice,
> arbitrarily-sized, vectored IO on the memory of another process.

Yes, in fact the sample code does use /proc/$pid/mem, but the docs
should be correct :)

> > +       /*
> > +        * Here it's possible we got a signal and then had to wait on the mutex
> > +        * while the reply was sent, so let's be sure there wasn't a response
> > +        * in the meantime.
> > +        */
> > +       if (err < 0 && n.state != SECCOMP_NOTIFY_REPLIED) {
> > +               /*
> > +                * We got a signal. Let's tell userspace about it (potentially
> > +                * again, if we had already notified them about the first one).
> > +                */
> > +               n.signaled = true;
> > +               if (n.state == SECCOMP_NOTIFY_SENT) {
> > +                       n.state = SECCOMP_NOTIFY_INIT;
> > +                       up(&match->notif->request);
> > +               }
> 
> Do you need another wake_up_poll() here?

Yes! Good point.

> > +               mutex_unlock(&match->notify_lock);
> > +               err = wait_for_completion_killable(&n.ready);
> > +               mutex_lock(&match->notify_lock);
> > +               if (err < 0)
> > +                       goto remove_list;
> 
> Add a comment here explaining that we intentionally leave the
> semaphore count too high (because otherwise we'd have to block), and
> seccomp_notify_recv() compensates for that?

Will do.

> > +       }
> > +
> > +       ret = n.val;
> > +       err = n.error;
> > +
> > +remove_list:
> > +       list_del(&n.list);
> > +out:
> > +       mutex_unlock(&match->notify_lock);
> > +       syscall_set_return_value(current, task_pt_regs(current),
> > +                                err, ret);
> > +}
> > +
> >  static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
> >                             const bool recheck_after_trace)
> >  {
> [...]
> >  #ifdef CONFIG_SECCOMP_FILTER
> > +static struct file *init_listener(struct task_struct *,
> > +                                 struct seccomp_filter *);
> > +
> >  /**
> >   * seccomp_set_mode_filter: internal function for setting seccomp filter
> >   * @flags:  flags to change filter behavior
> > @@ -853,6 +1000,8 @@ static long seccomp_set_mode_filter(unsigned int flags,
> >         const unsigned long seccomp_mode = SECCOMP_MODE_FILTER;
> >         struct seccomp_filter *prepared = NULL;
> >         long ret = -EINVAL;
> > +       int listener = 0;
> > +       struct file *listener_f = NULL;
> >
> >         /* Validate flags. */
> >         if (flags & ~SECCOMP_FILTER_FLAG_MASK)
> > @@ -863,13 +1012,28 @@ static long seccomp_set_mode_filter(unsigned int flags,
> >         if (IS_ERR(prepared))
> >                 return PTR_ERR(prepared);
> >
> > +       if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) {
> > +               listener = get_unused_fd_flags(0);
> > +               if (listener < 0) {
> > +                       ret = listener;
> > +                       goto out_free;
> > +               }
> > +
> > +               listener_f = init_listener(current, prepared);
> > +               if (IS_ERR(listener_f)) {
> > +                       put_unused_fd(listener);
> > +                       ret = PTR_ERR(listener_f);
> > +                       goto out_free;
> > +               }
> > +       }
> > +
> >         /*
> >          * Make sure we cannot change seccomp or nnp state via TSYNC
> >          * while another thread is in the middle of calling exec.
> >          */
> >         if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
> >             mutex_lock_killable(&current->signal->cred_guard_mutex))
> > -               goto out_free;
> > +               goto out_put_fd;
> >
> >         spin_lock_irq(&current->sighand->siglock);
> >
> > @@ -887,6 +1051,16 @@ static long seccomp_set_mode_filter(unsigned int flags,
> >         spin_unlock_irq(&current->sighand->siglock);
> >         if (flags & SECCOMP_FILTER_FLAG_TSYNC)
> >                 mutex_unlock(&current->signal->cred_guard_mutex);
> > +out_put_fd:
> > +       if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) {
> > +               if (ret < 0) {
> > +                       fput(listener_f);
> > +                       put_unused_fd(listener);
> > +               } else {
> > +                       fd_install(listener, listener_f);
> > +                       ret = listener;
> > +               }
> > +       }
> >  out_free:
> >         seccomp_filter_free(prepared);
> >         return ret;
> [...]
> > +
> > +#ifdef CONFIG_SECCOMP_FILTER
> > +static int seccomp_notify_release(struct inode *inode, struct file *file)
> > +{
> > +       struct seccomp_filter *filter = file->private_data;
> > +       struct seccomp_knotif *knotif;
> > +
> > +       mutex_lock(&filter->notify_lock);
> > +
> > +       /*
> > +        * If this file is being closed because e.g. the task who owned it
> > +        * died, let's wake everyone up who was waiting on us.
> > +        */
> > +       list_for_each_entry(knotif, &filter->notif->notifications, list) {
> > +               if (knotif->state == SECCOMP_NOTIFY_REPLIED)
> > +                       continue;
> > +
> > +               knotif->state = SECCOMP_NOTIFY_REPLIED;
> > +               knotif->error = -ENOSYS;
> > +               knotif->val = 0;
> > +
> > +               complete(&knotif->ready);
> > +       }
> > +
> > +       wake_up_all(&filter->notif->wqh);
> 
> If select() is polling us, a reference to the open file is being held,
> and this can't be reached; and I think if epoll is polling us,
> eventpoll_release() will remove itself from the wait queue, right? So
> can this wake_up_all() actually ever notify anyone?

I don't know actually, I just thought better safe than sorry. I can
drop it, though.

> > +       kfree(filter->notif);
> > +       filter->notif = NULL;
> > +       mutex_unlock(&filter->notify_lock);
> > +       __put_seccomp_filter(filter);
> > +       return 0;
> > +}
> > +
> > +static long seccomp_notify_recv(struct seccomp_filter *filter,
> > +                               unsigned long arg)
> > +{
> > +       struct seccomp_knotif *knotif = NULL, *cur;
> > +       struct seccomp_notif unotif = {};
> > +       ssize_t ret;
> > +       u16 size;
> > +       void __user *buf = (void __user *)arg;
> > +
> > +       if (copy_from_user(&size, buf, sizeof(size)))
> > +               return -EFAULT;
> > +
> > +       ret = down_interruptible(&filter->notif->request);
> > +       if (ret < 0)
> > +               return ret;
> > +
> > +       mutex_lock(&filter->notify_lock);
> > +       list_for_each_entry(cur, &filter->notif->notifications, list) {
> > +               if (cur->state == SECCOMP_NOTIFY_INIT) {
> > +                       knotif = cur;
> > +                       break;
> > +               }
> > +       }
> > +
> > +       /*
> > +        * If we didn't find a notification, it could be that the task was
> > +        * interrupted between the time we were woken and when we were able to
> 
> s/interrupted/interrupted by a fatal signal/ ?
> 
> > +        * acquire the rw lock.
> 
> State more explicitly here that we are compensating for an incorrectly
> high semaphore count?

Will do, thanks.

> > +        */
> > +       if (!knotif) {
> > +               ret = -ENOENT;
> > +               goto out;
> > +       }
> > +
> > +       size = min_t(size_t, size, sizeof(unotif));
> > +
> > +       unotif.len = size;
> > +       unotif.id = knotif->id;
> > +       unotif.pid = task_pid_vnr(knotif->task);
> > +       unotif.signaled = knotif->signaled;
> > +       unotif.data = *(knotif->data);
> > +
> > +       if (copy_to_user(buf, &unotif, size)) {
> > +               ret = -EFAULT;
> > +               goto out;
> > +       }
> > +
> > +       ret = size;
> > +       knotif->state = SECCOMP_NOTIFY_SENT;
> > +       wake_up_poll(&filter->notif->wqh, EPOLLOUT | EPOLLWRNORM);
> > +
> > +
> > +out:
> > +       mutex_unlock(&filter->notify_lock);
> > +       return ret;
> > +}
> > +
> > +static long seccomp_notify_send(struct seccomp_filter *filter,
> > +                               unsigned long arg)
> > +{
> > +       struct seccomp_notif_resp resp = {};
> > +       struct seccomp_knotif *knotif = NULL;
> > +       long ret;
> > +       u16 size;
> > +       void __user *buf = (void __user *)arg;
> > +
> > +       if (copy_from_user(&size, buf, sizeof(size)))
> > +               return -EFAULT;
> > +       size = min_t(size_t, size, sizeof(resp));
> > +       if (copy_from_user(&resp, buf, size))
> > +               return -EFAULT;
> > +
> > +       ret = mutex_lock_interruptible(&filter->notify_lock);
> > +       if (ret < 0)
> > +               return ret;
> > +
> > +       list_for_each_entry(knotif, &filter->notif->notifications, list) {
> > +               if (knotif->id == resp.id)
> > +                       break;
> > +       }
> > +
> > +       if (!knotif || knotif->id != resp.id) {
> 
> Uuuh, this looks unsafe and wrong. I don't think `knotif` can ever be
> NULL here. If `filter->notif->notifications` is empty, I think
> `knotif` will be `container_of(&filter->notif->notifications, struct
> seccom_knotif, list)` - in other words, you'll have a type confusion,
> and `knotif` probably points into some random memory in front of
> `filter->notif`.
> 
> Am I missing something?

No, I just flubbed the list API.

> > +               ret = -ENOENT;
> > +               goto out;
> > +       }
> > +
> > +       /* Allow exactly one reply. */
> > +       if (knotif->state != SECCOMP_NOTIFY_SENT) {
> > +               ret = -EINPROGRESS;
> > +               goto out;
> > +       }
> 
> This means that if seccomp_do_user_notification() has in the meantime
> received a signal and transitioned from SENT back to INIT, this will
> fail, right? So we fail here, then we read the new notification, and
> then we can retry SECCOMP_NOTIF_SEND? Is that intended?

I think so, the idea being that you might want to do something
different if a signal was sent. But Andy seemed to think that we might
not actually do anything different.

Either way, for the case you describe, EINPROGRESS is a little weird.
Perhaps it should be:

if (knotif->state == SECCOMP_NOTIFY_INIT) {
        ret = -EBUSY; /* or something? */
        goto out;
} else if (knotif->state == SECCOMP_NOTIFY_REPLIED) {
        ret = -EINPROGRESS;
        goto out;
}

?

> > +       ret = size;
> > +       knotif->state = SECCOMP_NOTIFY_REPLIED;
> > +       knotif->error = resp.error;
> > +       knotif->val = resp.val;
> > +       complete(&knotif->ready);
> > +out:
> > +       mutex_unlock(&filter->notify_lock);
> > +       return ret;
> > +}
> > +
> > +static long seccomp_notify_id_valid(struct seccomp_filter *filter,
> > +                                   unsigned long arg)
> > +{
> > +       struct seccomp_knotif *knotif = NULL;
> > +       void __user *buf = (void __user *)arg;
> > +       u64 id;
> > +       long ret;
> > +
> > +       if (copy_from_user(&id, buf, sizeof(id)))
> > +               return -EFAULT;
> > +
> > +       ret = mutex_lock_interruptible(&filter->notify_lock);
> > +       if (ret < 0)
> > +               return ret;
> > +
> > +       ret = -1;
> 
> In strace, this is going to show up as EPERM. Maybe use something like
> -ENOENT instead? Or whatever you think resembles a fitting error
> number.

Yep, will do.

> > +       list_for_each_entry(knotif, &filter->notif->notifications, list) {
> > +               if (knotif->id == id) {
> > +                       ret = 0;
> 
> Would it make sense to treat notifications that have already been
> replied to as invalid?

I suppose so, since we aren't going to let you reply to them anyway.

> > +                       goto out;
> > +               }
> > +       }
> > +
> > +out:
> > +       mutex_unlock(&filter->notify_lock);
> > +       return ret;
> > +}
> > +
> [...]
> > +static __poll_t seccomp_notify_poll(struct file *file,
> > +                                   struct poll_table_struct *poll_tab)
> > +{
> > +       struct seccomp_filter *filter = file->private_data;
> > +       __poll_t ret = 0;
> > +       struct seccomp_knotif *cur;
> > +
> > +       poll_wait(file, &filter->notif->wqh, poll_tab);
> > +
> > +       ret = mutex_lock_interruptible(&filter->notify_lock);
> > +       if (ret < 0)
> > +               return ret;
> 
> Looking at the callers of vfs_poll(), as far as I can tell, a poll
> handler is not allowed to return error codes. Perhaps someone who
> knows the poll interface better can weigh in here. I've CCed some
> people who should hopefully know better how this stuff works.

Thanks.

> > +       list_for_each_entry(cur, &filter->notif->notifications, list) {
> > +               if (cur->state == SECCOMP_NOTIFY_INIT)
> > +                       ret |= EPOLLIN | EPOLLRDNORM;
> > +               if (cur->state == SECCOMP_NOTIFY_SENT)
> > +                       ret |= EPOLLOUT | EPOLLWRNORM;
> > +               if (ret & EPOLLIN && ret & EPOLLOUT)
> > +                       break;
> > +       }
> > +
> > +       mutex_unlock(&filter->notify_lock);
> > +
> > +       return ret;
> > +}
> > +
> > +static const struct file_operations seccomp_notify_ops = {
> > +       .poll = seccomp_notify_poll,
> > +       .release = seccomp_notify_release,
> > +       .unlocked_ioctl = seccomp_notify_ioctl,
> > +};
> > +
> > +static struct file *init_listener(struct task_struct *task,
> > +                                 struct seccomp_filter *filter)
> > +{
> 
> Why does this function take a `task` pointer instead of always
> accessing `current`? If `task` actually wasn't `current`, I would have
> concurrency concerns. A comment in seccomp.h even explains:
> 
>  *          @filter must only be accessed from the context of current as there
>  *          is no read locking.
> 
> Unless there's a good reason for it, I would prefer it if this
> function didn't take a `task` pointer.

I think Kees replied already, but yes, this is a good point :(. We can
continue in his thread.

> > +       struct file *ret = ERR_PTR(-EBUSY);
> > +       struct seccomp_filter *cur, *last_locked = NULL;
> > +       int filter_nesting = 0;
> > +
> > +       for (cur = task->seccomp.filter; cur; cur = cur->prev) {
> > +               mutex_lock_nested(&cur->notify_lock, filter_nesting);
> > +               filter_nesting++;
> > +               last_locked = cur;
> > +               if (cur->notif)
> > +                       goto out;
> > +       }
> > +
> > +       ret = ERR_PTR(-ENOMEM);
> > +       filter->notif = kzalloc(sizeof(*(filter->notif)), GFP_KERNEL);
> 
> sizeof(struct notification) instead, to make the code clearer?
> 
> > +       if (!filter->notif)
> > +               goto out;
> > +
> > +       sema_init(&filter->notif->request, 0);
> > +       INIT_LIST_HEAD(&filter->notif->notifications);
> > +       filter->notif->next_id = get_random_u64();
> > +       init_waitqueue_head(&filter->notif->wqh);
> 
> Nit: next_id and notifications are declared in reverse order in the
> struct. Could you flip them around here?

Sure, will do.

> > +       ret = anon_inode_getfile("seccomp notify", &seccomp_notify_ops,
> > +                                filter, O_RDWR);
> > +       if (IS_ERR(ret))
> > +               goto out;
> > +
> > +
> > +       /* The file has a reference to it now */
> > +       __get_seccomp_filter(filter);
> 
> __get_seccomp_filter() has a comment in it that claims "/* Reference
> count is bounded by the number of total processes. */". I think this
> change invalidates that comment. I think it should be fine to just
> remove the comment.

Will do, thanks.

> > +out:
> > +       for (cur = task->seccomp.filter; cur; cur = cur->prev) {
> 
> s/; cur;/; 1;/, or use a while loop instead? If the NULL check fires
> here, something went very wrong.

I suppose so, given that we have last_locked.

Tycho

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 1/6] seccomp: add a return code to trap to userspace
  2018-09-27 22:45     ` Kees Cook
@ 2018-09-27 23:08       ` Tycho Andersen
  0 siblings, 0 replies; 91+ messages in thread
From: Tycho Andersen @ 2018-09-27 23:08 UTC (permalink / raw)
  To: Kees Cook
  Cc: Jann Horn, Christoph Hellwig, Al Viro, linux-fsdevel,
	kernel list, Linux Containers, Linux API, Andy Lutomirski,
	Oleg Nesterov, Eric W. Biederman, Serge E. Hallyn,
	Christian Brauner, Tyler Hicks, Akihiro Suda

On Thu, Sep 27, 2018 at 03:45:11PM -0700, Kees Cook wrote:
> On Thu, Sep 27, 2018 at 2:51 PM, Jann Horn <jannh@google.com> wrote:
> > On Thu, Sep 27, 2018 at 5:11 PM Tycho Andersen <tycho@tycho.ws> wrote:
> >> However, care should be taken to avoid the TOCTOU
> >> +mentioned above in this document: all arguments being read from the tracee's
> >> +memory should be read into the tracer's memory before any policy decisions are
> >> +made. This allows for an atomic decision on syscall arguments.
> >
> > Again, I don't really see how you could get this wrong.
> 
> Doesn't hurt to mention it, IMO.
> 
> >> +static long seccomp_notify_send(struct seccomp_filter *filter,
> >> +                               unsigned long arg)
> >> +{
> >> +       struct seccomp_notif_resp resp = {};
> >> +       struct seccomp_knotif *knotif = NULL;
> >> +       long ret;
> >> +       u16 size;
> >> +       void __user *buf = (void __user *)arg;
> >> +
> >> +       if (copy_from_user(&size, buf, sizeof(size)))
> >> +               return -EFAULT;
> >> +       size = min_t(size_t, size, sizeof(resp));
> >> +       if (copy_from_user(&resp, buf, size))
> >> +               return -EFAULT;
> >> +
> >> +       ret = mutex_lock_interruptible(&filter->notify_lock);
> >> +       if (ret < 0)
> >> +               return ret;
> >> +
> >> +       list_for_each_entry(knotif, &filter->notif->notifications, list) {
> >> +               if (knotif->id == resp.id)
> >> +                       break;
> >> +       }
> >> +
> >> +       if (!knotif || knotif->id != resp.id) {
> >
> > Uuuh, this looks unsafe and wrong. I don't think `knotif` can ever be
> > NULL here. If `filter->notif->notifications` is empty, I think
> > `knotif` will be `container_of(&filter->notif->notifications, struct
> > seccom_knotif, list)` - in other words, you'll have a type confusion,
> > and `knotif` probably points into some random memory in front of
> > `filter->notif`.
> >
> > Am I missing something?
> 
> Oh, good catch. This just needs to be fixed like it's done in
> seccomp_notif_recv (separate cur and knotif).
> 
> >> +static struct file *init_listener(struct task_struct *task,
> >> +                                 struct seccomp_filter *filter)
> >> +{
> >
> > Why does this function take a `task` pointer instead of always
> > accessing `current`? If `task` actually wasn't `current`, I would have
> > concurrency concerns. A comment in seccomp.h even explains:
> >
> >  *          @filter must only be accessed from the context of current as there
> >  *          is no read locking.
> >
> > Unless there's a good reason for it, I would prefer it if this
> > function didn't take a `task` pointer.
> 
> This is to support PTRACE_SECCOMP_NEW_LISTENER.
> 
> But you make an excellent point. Even TSYNC expects to operate only on
> the current thread group. Hmm.
> 
> While the process is stopped by ptrace, we could, in theory, update
> task->seccomp.filter via something like TSYNC.
> 
> So perhaps use:
> 
> mutex_lock_killable(&task->signal->cred_guard_mutex);
> 
> before walking the notify_locks?

This means that all the seccomp/ptrace code probably needs to be
updated for this? I'll try to send patches for this as well as the
return code thing Jann pointed out.

> >
> >> +       struct file *ret = ERR_PTR(-EBUSY);
> >> +       struct seccomp_filter *cur, *last_locked = NULL;
> >> +       int filter_nesting = 0;
> >> +
> >> +       for (cur = task->seccomp.filter; cur; cur = cur->prev) {
> >> +               mutex_lock_nested(&cur->notify_lock, filter_nesting);
> >> +               filter_nesting++;
> >> +               last_locked = cur;
> >> +               if (cur->notif)
> >> +                       goto out;
> >> +       }
> >> +
> >> +       ret = ERR_PTR(-ENOMEM);
> >> +       filter->notif = kzalloc(sizeof(*(filter->notif)), GFP_KERNEL);
> >
> > sizeof(struct notification) instead, to make the code clearer?
> 
> I prefer what Tycho has: I want to allocate an instances of whatever
> filter->notif is.
> 
> Though, let's do the kzalloc outside of the locking, instead?

Yep, sounds good.

> >> +       ret = anon_inode_getfile("seccomp notify", &seccomp_notify_ops,
> >> +                                filter, O_RDWR);
> >> +       if (IS_ERR(ret))
> >> +               goto out;
> >> +
> >> +
> >> +       /* The file has a reference to it now */
> >> +       __get_seccomp_filter(filter);
> >
> > __get_seccomp_filter() has a comment in it that claims "/* Reference
> > count is bounded by the number of total processes. */". I think this
> > change invalidates that comment. I think it should be fine to just
> > remove the comment.
> 
> Update it to "bounded by total processes and notification listeners"?

Will do.

> >> +out:
> >> +       for (cur = task->seccomp.filter; cur; cur = cur->prev) {
> >
> > s/; cur;/; 1;/, or use a while loop instead? If the NULL check fires
> > here, something went very wrong.
> 
> Hm? This is correct. This is how seccomp_run_filters() walks the list too:
> 
>         struct seccomp_filter *f =
>                         READ_ONCE(current->seccomp.filter);
>         ...
>         for (; f; f = f->prev) {
> 
> Especially if we'll be holding the cred_guard_mutex.

There is a last_locked local here though, I think that's what Jann is
pointing out.

Cheers,

Tycho

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 1/6] seccomp: add a return code to trap to userspace
  2018-09-27 22:48     ` Tycho Andersen
@ 2018-09-27 23:10       ` Kees Cook
  2018-09-28 14:39         ` Tycho Andersen
  2018-10-08 14:58       ` Christian Brauner
  1 sibling, 1 reply; 91+ messages in thread
From: Kees Cook @ 2018-09-27 23:10 UTC (permalink / raw)
  To: Tycho Andersen, Stephane Graber
  Cc: LKML, Linux Containers, Linux API, Andy Lutomirski,
	Oleg Nesterov, Eric W . Biederman, Serge E . Hallyn,
	Christian Brauner, Tyler Hicks, Akihiro Suda, Jann Horn,
	linux-fsdevel

On Thu, Sep 27, 2018 at 3:48 PM, Tycho Andersen <tycho@tycho.ws> wrote:
> On Thu, Sep 27, 2018 at 02:31:24PM -0700, Kees Cook wrote:
>> On Thu, Sep 27, 2018 at 8:11 AM, Tycho Andersen <tycho@tycho.ws> wrote:
>> struct seccomp_notif {
>>         __u16                      len;                  /*     0     2 */
>>
>>         /* XXX 6 bytes hole, try to pack */
>>
>>         __u64                      id;                   /*     8     8 */
>>         __u32                      pid;                  /*    16     4 */
>>         __u8                       signaled;             /*    20     1 */
>>
>>         /* XXX 3 bytes hole, try to pack */
>>
>>         struct seccomp_data        data;                 /*    24    64 */
>>         /* --- cacheline 1 boundary (64 bytes) was 24 bytes ago --- */
>>
>>         /* size: 88, cachelines: 2, members: 5 */
>>         /* sum members: 79, holes: 2, sum holes: 9 */
>>         /* last cacheline: 24 bytes */
>> };
>> struct seccomp_notif_resp {
>>         __u16                      len;                  /*     0     2 */
>>
>>         /* XXX 6 bytes hole, try to pack */
>>
>>         __u64                      id;                   /*     8     8 */
>>         __s32                      error;                /*    16     4 */
>>
>>         /* XXX 4 bytes hole, try to pack */
>>
>>         __s64                      val;                  /*    24     8 */
>>
>>         /* size: 32, cachelines: 1, members: 4 */
>>         /* sum members: 22, holes: 2, sum holes: 10 */
>>         /* last cacheline: 32 bytes */
>> };
>>
>> How about making len u32, and moving pid and error above "id"? This
>> leaves a hole after signaled, so changing "len" won't be sufficient
>> for versioning here. Perhaps move it after data?
>
> I'm not sure what you mean by "len won't be sufficient for versioning
> here"? Anyway, I can do some packing on these; I didn't bother before
> since I figured it's a userspace interface, so saving a few bytes
> isn't a huge deal.

I was thinking the "len" portion was for determining if the API ever
changes in the future. My point was that given the padding holes, e.g.
adding a u8 after signaled, "len" wouldn't change, so the kernel might
expect to starting reading something after signaled that it wasn't
checking before, but the len would be the same.

>> I have to say, I'm vaguely nervous about changing the semantics here
>> for passing back the fd as the return code from the seccomp() syscall.
>> Alternatives seem less appealing, though: changing the meaning of the
>> uargs parameter when SECCOMP_FILTER_FLAG_NEW_LISTENER is set, for
>> example. Hmm.
>
> From my perspective we can drop this whole thing. The only thing I'll
> ever use is the ptrace version. Someone at some point (I don't
> remember who, maybe stgraber) suggested this version would be useful
> as well.

Well that would certainly change the exposure of the interface pretty
drastically. :)

So, let's talk more about this, as it raises another thought I had
too: for the PTRACE interface to work, you have to know specifically
which filter you want to get notifications for. Won't that be slightly
tricky?

> Anyway, let me know if your nervousness outweighs this, I'm happy to
> drop it.

I'm not opposed to keeping it, but if you don't think anyone will use
it ... we should probably drop it just to avoid the complexity. It's a
cool API, though, so I'd like to hear from others first before you go
tearing it out. ;) (stgraber added to CC)

>> It is possible (though unlikely given the type widths involved here)
>> for unotif = {} to not initialize padding, so I would recommend an
>> explicit memset(&unotif, 0, sizeof(unotif)) here.
>
> Orly? I didn't know that, thanks.

Yeah, it's a pretty annoying C-ism. The spec says that struct
_members_ will get zero-initialized, but it doesn't say anything about
padding. >_< In most cases, the padding gets initialized too, just
because of bit widths being small enough that they're caught in the
member initialization that the compiler does. But for REALLY big
holes, they may get missed. In this case, while the padding is small,
it's directly exposed to userspace, so I want to make it robust.

>> > +       if (copy_from_user(&size, buf, sizeof(size)))
>> > +               return -EFAULT;
>> > +       size = min_t(size_t, size, sizeof(resp));
>> > +       if (copy_from_user(&resp, buf, size))
>> > +               return -EFAULT;
>>
>> For sanity checking on a double-read from userspace, please add:
>>
>>     if (resp.len != size)
>>         return -EINVAL;
>
> Won't that fail if sizeof(resp) < resp.len, because of the min_t()?

Ah, true. In that case, probably do resp.len = size to avoid any logic
failures due to the double-read? I just want to avoid any chance of
confusing the size and actually using it somewhere.

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 1/6] seccomp: add a return code to trap to userspace
  2018-09-27 23:04     ` Tycho Andersen
@ 2018-09-27 23:37       ` Jann Horn
  0 siblings, 0 replies; 91+ messages in thread
From: Jann Horn @ 2018-09-27 23:37 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: hch, Al Viro, linux-fsdevel, Kees Cook, kernel list, containers,
	Linux API, Andy Lutomirski, Oleg Nesterov, Eric W. Biederman,
	Serge E. Hallyn, Christian Brauner, Tyler Hicks, suda.akihiro

On Fri, Sep 28, 2018 at 1:04 AM Tycho Andersen <tycho@tycho.ws> wrote:
> On Thu, Sep 27, 2018 at 11:51:40PM +0200, Jann Horn wrote:
> > > +It is worth noting that ``struct seccomp_data`` contains the values of register
> > > +arguments to the syscall, but does not contain pointers to memory. The task's
> > > +memory is accessible to suitably privileged traces via ``ptrace()`` or
> > > +``/proc/pid/map_files/``.
> >
> > You probably don't actually want to use /proc/pid/map_files here; you
> > can't use that to access anonymous memory, and it needs CAP_SYS_ADMIN.
> > And while reading memory via ptrace() is possible, the interface is
> > really ugly (e.g. you can only read data in 4-byte chunks), and your
> > caveat about locking out other ptracers (or getting locked out by
> > them) applies. I'm not even sure if you could read memory via ptrace
> > while a process is stopped in the seccomp logic? PTRACE_PEEKDATA
> > requires the target to be in a __TASK_TRACED state.
> > The two interfaces you might want to use instead are /proc/$pid/mem
> > and process_vm_{readv,writev}, which allow you to do nice,
> > arbitrarily-sized, vectored IO on the memory of another process.
>
> Yes, in fact the sample code does use /proc/$pid/mem, but the docs
> should be correct :)

Please also mention the process_vm_readv/writev syscalls though, given
that fast access to remote processes is what they were made for.

> > > +#ifdef CONFIG_SECCOMP_FILTER
> > > +static int seccomp_notify_release(struct inode *inode, struct file *file)
[...]
> > > +       wake_up_all(&filter->notif->wqh);
> >
> > If select() is polling us, a reference to the open file is being held,
> > and this can't be reached; and I think if epoll is polling us,
> > eventpoll_release() will remove itself from the wait queue, right? So
> > can this wake_up_all() actually ever notify anyone?
>
> I don't know actually, I just thought better safe than sorry. I can
> drop it, though.

Let's see if any fs people have some insight...

> > > +               ret = -ENOENT;
> > > +               goto out;
> > > +       }
> > > +
> > > +       /* Allow exactly one reply. */
> > > +       if (knotif->state != SECCOMP_NOTIFY_SENT) {
> > > +               ret = -EINPROGRESS;
> > > +               goto out;
> > > +       }
> >
> > This means that if seccomp_do_user_notification() has in the meantime
> > received a signal and transitioned from SENT back to INIT, this will
> > fail, right? So we fail here, then we read the new notification, and
> > then we can retry SECCOMP_NOTIF_SEND? Is that intended?
>
> I think so, the idea being that you might want to do something
> different if a signal was sent. But Andy seemed to think that we might
> not actually do anything different.

If you already have the proper response ready, you'd probably want to
just go through with it, no? Otherwise you'll just end up re-emulating
the syscall afterwards for no good reason. If you noticed the
interruption in the middle of the emulated syscall, that'd be
different, but since this is the case where we're already done with
the emulation and getting ready to continue...

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 4/6] files: add a replace_fd_files() function
  2018-09-27 21:59   ` Kees Cook
@ 2018-09-28  2:20     ` Kees Cook
  2018-09-28  2:46       ` Jann Horn
  2018-09-28  5:23       ` Tycho Andersen
  0 siblings, 2 replies; 91+ messages in thread
From: Kees Cook @ 2018-09-28  2:20 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: LKML, Linux Containers, Linux API, Andy Lutomirski,
	Oleg Nesterov, Eric W . Biederman, Serge E . Hallyn,
	Christian Brauner, Tyler Hicks, Akihiro Suda, Jann Horn,
	linux-fsdevel, Alexander Viro

On Thu, Sep 27, 2018 at 2:59 PM, Kees Cook <keescook@chromium.org> wrote:
> On Thu, Sep 27, 2018 at 8:11 AM, Tycho Andersen <tycho@tycho.ws> wrote:
>> Similar to fd_install/__fd_install, we want to be able to replace an fd of
>> an arbitrary struct files_struct, not just current's. We'll use this in the
>> next patch to implement the seccomp ioctl that allows inserting fds into a
>> stopped process' context.
>>
>> v7: new in v7
>>
>> Signed-off-by: Tycho Andersen <tycho@tycho.ws>
>> CC: Alexander Viro <viro@zeniv.linux.org.uk>
>> CC: Kees Cook <keescook@chromium.org>
>> CC: Andy Lutomirski <luto@amacapital.net>
>> CC: Oleg Nesterov <oleg@redhat.com>
>> CC: Eric W. Biederman <ebiederm@xmission.com>
>> CC: "Serge E. Hallyn" <serge@hallyn.com>
>> CC: Christian Brauner <christian.brauner@ubuntu.com>
>> CC: Tyler Hicks <tyhicks@canonical.com>
>> CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
>> ---
>>  fs/file.c            | 22 +++++++++++++++-------
>>  include/linux/file.h |  8 ++++++++
>>  2 files changed, 23 insertions(+), 7 deletions(-)
>>
>> diff --git a/fs/file.c b/fs/file.c
>> index 7ffd6e9d103d..3b3c5aadaadb 100644
>> --- a/fs/file.c
>> +++ b/fs/file.c
>> @@ -850,24 +850,32 @@ __releases(&files->file_lock)
>>  }
>>
>>  int replace_fd(unsigned fd, struct file *file, unsigned flags)
>> +{
>> +       return replace_fd_task(current, fd, file, flags);
>> +}
>> +
>> +/*
>> + * Same warning as __alloc_fd()/__fd_install() here.
>> + */
>> +int replace_fd_task(struct task_struct *task, unsigned fd,
>> +                   struct file *file, unsigned flags)
>>  {
>>         int err;
>> -       struct files_struct *files = current->files;
>
> Same feedback as Jann: on a purely "smaller diff" note, this could
> just be s/current/task/ here and all the other s/files/task->files/
> would go away...
>
>>
>>         if (!file)
>> -               return __close_fd(files, fd);
>> +               return __close_fd(task->files, fd);
>>
>> -       if (fd >= rlimit(RLIMIT_NOFILE))
>> +       if (fd >= task_rlimit(task, RLIMIT_NOFILE))
>>                 return -EBADF;
>>
>> -       spin_lock(&files->file_lock);
>> -       err = expand_files(files, fd);
>> +       spin_lock(&task->files->file_lock);
>> +       err = expand_files(task->files, fd);
>>         if (unlikely(err < 0))
>>                 goto out_unlock;
>> -       return do_dup2(files, file, fd, flags);
>> +       return do_dup2(task->files, file, fd, flags);
>>
>>  out_unlock:
>> -       spin_unlock(&files->file_lock);
>> +       spin_unlock(&task->files->file_lock);
>>         return err;
>>  }
>>
>> diff --git a/include/linux/file.h b/include/linux/file.h
>> index 6b2fb032416c..f94277fee038 100644
>> --- a/include/linux/file.h
>> +++ b/include/linux/file.h
>> @@ -11,6 +11,7 @@
>>  #include <linux/posix_types.h>
>>
>>  struct file;
>> +struct task_struct;
>>
>>  extern void fput(struct file *);
>>
>> @@ -79,6 +80,13 @@ static inline void fdput_pos(struct fd f)
>>
>>  extern int f_dupfd(unsigned int from, struct file *file, unsigned flags);
>>  extern int replace_fd(unsigned fd, struct file *file, unsigned flags);
>> +/*
>> + * Warning! This is only safe if you know the owner of the files_struct is
>> + * stopped outside syscall context. It's a very bad idea to use this unless you
>> + * have similar guarantees in your code.
>> + */
>> +extern int replace_fd_task(struct task_struct *task, unsigned fd,
>> +                          struct file *file, unsigned flags);
>
> Perhaps call this __replace_fd() to indicate the "please don't use
> this unless you're very sure"ness of it?
>
>>  extern void set_close_on_exec(unsigned int fd, int flag);
>>  extern bool get_close_on_exec(unsigned int fd);
>>  extern int get_unused_fd_flags(unsigned flags);
>> --
>> 2.17.1
>>
>
> If I can get an Ack from Al, that would be very nice. :)

In out-of-band feedback from Al, he's pointed out a much cleaner
approach: do the work on the "current" side. i.e. current is stopped
in __seccomp_filter in the case SECCOMP_RET_USER_NOTIFY. Instead of
having the ioctl-handing process doing the work, have it done on the
other side. This may cause some additional complexity on the ioctl
return path, but it solves both this problem and the "ptrace attach"
issue: have the work delayed until "current" gets caught by seccomp.

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 4/6] files: add a replace_fd_files() function
  2018-09-28  2:20     ` Kees Cook
@ 2018-09-28  2:46       ` Jann Horn
  2018-09-28  5:23       ` Tycho Andersen
  1 sibling, 0 replies; 91+ messages in thread
From: Jann Horn @ 2018-09-28  2:46 UTC (permalink / raw)
  To: Kees Cook
  Cc: Tycho Andersen, kernel list, containers, Linux API,
	Andy Lutomirski, Oleg Nesterov, Eric W. Biederman,
	Serge E. Hallyn, Christian Brauner, Tyler Hicks, suda.akihiro,
	linux-fsdevel, Al Viro

On Fri, Sep 28, 2018 at 4:20 AM Kees Cook <keescook@chromium.org> wrote:
> On Thu, Sep 27, 2018 at 2:59 PM, Kees Cook <keescook@chromium.org> wrote:
> > On Thu, Sep 27, 2018 at 8:11 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> >> Similar to fd_install/__fd_install, we want to be able to replace an fd of
> >> an arbitrary struct files_struct, not just current's. We'll use this in the
> >> next patch to implement the seccomp ioctl that allows inserting fds into a
> >> stopped process' context.
> >>
> >> v7: new in v7
> >>
> >> Signed-off-by: Tycho Andersen <tycho@tycho.ws>
> >> CC: Alexander Viro <viro@zeniv.linux.org.uk>
> >> CC: Kees Cook <keescook@chromium.org>
> >> CC: Andy Lutomirski <luto@amacapital.net>
> >> CC: Oleg Nesterov <oleg@redhat.com>
> >> CC: Eric W. Biederman <ebiederm@xmission.com>
> >> CC: "Serge E. Hallyn" <serge@hallyn.com>
> >> CC: Christian Brauner <christian.brauner@ubuntu.com>
> >> CC: Tyler Hicks <tyhicks@canonical.com>
> >> CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
> >> ---
> >>  fs/file.c            | 22 +++++++++++++++-------
> >>  include/linux/file.h |  8 ++++++++
> >>  2 files changed, 23 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/fs/file.c b/fs/file.c
> >> index 7ffd6e9d103d..3b3c5aadaadb 100644
> >> --- a/fs/file.c
> >> +++ b/fs/file.c
> >> @@ -850,24 +850,32 @@ __releases(&files->file_lock)
> >>  }
> >>
> >>  int replace_fd(unsigned fd, struct file *file, unsigned flags)
> >> +{
> >> +       return replace_fd_task(current, fd, file, flags);
> >> +}
> >> +
> >> +/*
> >> + * Same warning as __alloc_fd()/__fd_install() here.
> >> + */
> >> +int replace_fd_task(struct task_struct *task, unsigned fd,
> >> +                   struct file *file, unsigned flags)
> >>  {
> >>         int err;
> >> -       struct files_struct *files = current->files;
> >
> > Same feedback as Jann: on a purely "smaller diff" note, this could
> > just be s/current/task/ here and all the other s/files/task->files/
> > would go away...
> >
> >>
> >>         if (!file)
> >> -               return __close_fd(files, fd);
> >> +               return __close_fd(task->files, fd);
> >>
> >> -       if (fd >= rlimit(RLIMIT_NOFILE))
> >> +       if (fd >= task_rlimit(task, RLIMIT_NOFILE))
> >>                 return -EBADF;
> >>
> >> -       spin_lock(&files->file_lock);
> >> -       err = expand_files(files, fd);
> >> +       spin_lock(&task->files->file_lock);
> >> +       err = expand_files(task->files, fd);
> >>         if (unlikely(err < 0))
> >>                 goto out_unlock;
> >> -       return do_dup2(files, file, fd, flags);
> >> +       return do_dup2(task->files, file, fd, flags);
> >>
> >>  out_unlock:
> >> -       spin_unlock(&files->file_lock);
> >> +       spin_unlock(&task->files->file_lock);
> >>         return err;
> >>  }
> >>
> >> diff --git a/include/linux/file.h b/include/linux/file.h
> >> index 6b2fb032416c..f94277fee038 100644
> >> --- a/include/linux/file.h
> >> +++ b/include/linux/file.h
> >> @@ -11,6 +11,7 @@
> >>  #include <linux/posix_types.h>
> >>
> >>  struct file;
> >> +struct task_struct;
> >>
> >>  extern void fput(struct file *);
> >>
> >> @@ -79,6 +80,13 @@ static inline void fdput_pos(struct fd f)
> >>
> >>  extern int f_dupfd(unsigned int from, struct file *file, unsigned flags);
> >>  extern int replace_fd(unsigned fd, struct file *file, unsigned flags);
> >> +/*
> >> + * Warning! This is only safe if you know the owner of the files_struct is
> >> + * stopped outside syscall context. It's a very bad idea to use this unless you
> >> + * have similar guarantees in your code.
> >> + */
> >> +extern int replace_fd_task(struct task_struct *task, unsigned fd,
> >> +                          struct file *file, unsigned flags);
> >
> > Perhaps call this __replace_fd() to indicate the "please don't use
> > this unless you're very sure"ness of it?
> >
> >>  extern void set_close_on_exec(unsigned int fd, int flag);
> >>  extern bool get_close_on_exec(unsigned int fd);
> >>  extern int get_unused_fd_flags(unsigned flags);
> >> --
> >> 2.17.1
> >>
> >
> > If I can get an Ack from Al, that would be very nice. :)
>
> In out-of-band feedback from Al, he's pointed out a much cleaner
> approach: do the work on the "current" side. i.e. current is stopped
> in __seccomp_filter in the case SECCOMP_RET_USER_NOTIFY. Instead of
> having the ioctl-handing process doing the work, have it done on the
> other side. This may cause some additional complexity on the ioctl
> return path, but it solves both this problem and the "ptrace attach"
> issue: have the work delayed until "current" gets caught by seccomp.

Can you elaborate on this? Are you saying you want to, for every file
descriptor that should be transferred, put a reference to the file
into the kernel's seccomp notification data structure, wake up the
task that's waiting for a reply, let the task install an fd, send back
a response on whether installing the FD worked, and then return that
response back to the container manager process? That sounds
like a pretty complicated dance that I'd prefer to avoid.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 4/6] files: add a replace_fd_files() function
  2018-09-28  2:20     ` Kees Cook
  2018-09-28  2:46       ` Jann Horn
@ 2018-09-28  5:23       ` Tycho Andersen
  1 sibling, 0 replies; 91+ messages in thread
From: Tycho Andersen @ 2018-09-28  5:23 UTC (permalink / raw)
  To: Kees Cook
  Cc: LKML, Linux Containers, Linux API, Andy Lutomirski,
	Oleg Nesterov, Eric W . Biederman, Serge E . Hallyn,
	Christian Brauner, Tyler Hicks, Akihiro Suda, Jann Horn,
	linux-fsdevel, Alexander Viro

On Thu, Sep 27, 2018 at 07:20:50PM -0700, Kees Cook wrote:
> On Thu, Sep 27, 2018 at 2:59 PM, Kees Cook <keescook@chromium.org> wrote:
> > On Thu, Sep 27, 2018 at 8:11 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> >> Similar to fd_install/__fd_install, we want to be able to replace an fd of
> >> an arbitrary struct files_struct, not just current's. We'll use this in the
> >> next patch to implement the seccomp ioctl that allows inserting fds into a
> >> stopped process' context.
> >>
> >> v7: new in v7
> >>
> >> Signed-off-by: Tycho Andersen <tycho@tycho.ws>
> >> CC: Alexander Viro <viro@zeniv.linux.org.uk>
> >> CC: Kees Cook <keescook@chromium.org>
> >> CC: Andy Lutomirski <luto@amacapital.net>
> >> CC: Oleg Nesterov <oleg@redhat.com>
> >> CC: Eric W. Biederman <ebiederm@xmission.com>
> >> CC: "Serge E. Hallyn" <serge@hallyn.com>
> >> CC: Christian Brauner <christian.brauner@ubuntu.com>
> >> CC: Tyler Hicks <tyhicks@canonical.com>
> >> CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
> >> ---
> >>  fs/file.c            | 22 +++++++++++++++-------
> >>  include/linux/file.h |  8 ++++++++
> >>  2 files changed, 23 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/fs/file.c b/fs/file.c
> >> index 7ffd6e9d103d..3b3c5aadaadb 100644
> >> --- a/fs/file.c
> >> +++ b/fs/file.c
> >> @@ -850,24 +850,32 @@ __releases(&files->file_lock)
> >>  }
> >>
> >>  int replace_fd(unsigned fd, struct file *file, unsigned flags)
> >> +{
> >> +       return replace_fd_task(current, fd, file, flags);
> >> +}
> >> +
> >> +/*
> >> + * Same warning as __alloc_fd()/__fd_install() here.
> >> + */
> >> +int replace_fd_task(struct task_struct *task, unsigned fd,
> >> +                   struct file *file, unsigned flags)
> >>  {
> >>         int err;
> >> -       struct files_struct *files = current->files;
> >
> > Same feedback as Jann: on a purely "smaller diff" note, this could
> > just be s/current/task/ here and all the other s/files/task->files/
> > would go away...
> >
> >>
> >>         if (!file)
> >> -               return __close_fd(files, fd);
> >> +               return __close_fd(task->files, fd);
> >>
> >> -       if (fd >= rlimit(RLIMIT_NOFILE))
> >> +       if (fd >= task_rlimit(task, RLIMIT_NOFILE))
> >>                 return -EBADF;
> >>
> >> -       spin_lock(&files->file_lock);
> >> -       err = expand_files(files, fd);
> >> +       spin_lock(&task->files->file_lock);
> >> +       err = expand_files(task->files, fd);
> >>         if (unlikely(err < 0))
> >>                 goto out_unlock;
> >> -       return do_dup2(files, file, fd, flags);
> >> +       return do_dup2(task->files, file, fd, flags);
> >>
> >>  out_unlock:
> >> -       spin_unlock(&files->file_lock);
> >> +       spin_unlock(&task->files->file_lock);
> >>         return err;
> >>  }
> >>
> >> diff --git a/include/linux/file.h b/include/linux/file.h
> >> index 6b2fb032416c..f94277fee038 100644
> >> --- a/include/linux/file.h
> >> +++ b/include/linux/file.h
> >> @@ -11,6 +11,7 @@
> >>  #include <linux/posix_types.h>
> >>
> >>  struct file;
> >> +struct task_struct;
> >>
> >>  extern void fput(struct file *);
> >>
> >> @@ -79,6 +80,13 @@ static inline void fdput_pos(struct fd f)
> >>
> >>  extern int f_dupfd(unsigned int from, struct file *file, unsigned flags);
> >>  extern int replace_fd(unsigned fd, struct file *file, unsigned flags);
> >> +/*
> >> + * Warning! This is only safe if you know the owner of the files_struct is
> >> + * stopped outside syscall context. It's a very bad idea to use this unless you
> >> + * have similar guarantees in your code.
> >> + */
> >> +extern int replace_fd_task(struct task_struct *task, unsigned fd,
> >> +                          struct file *file, unsigned flags);
> >
> > Perhaps call this __replace_fd() to indicate the "please don't use
> > this unless you're very sure"ness of it?
> >
> >>  extern void set_close_on_exec(unsigned int fd, int flag);
> >>  extern bool get_close_on_exec(unsigned int fd);
> >>  extern int get_unused_fd_flags(unsigned flags);
> >> --
> >> 2.17.1
> >>
> >
> > If I can get an Ack from Al, that would be very nice. :)
> 
> In out-of-band feedback from Al, he's pointed out a much cleaner
> approach: do the work on the "current" side. i.e. current is stopped
> in __seccomp_filter in the case SECCOMP_RET_USER_NOTIFY. Instead of
> having the ioctl-handing process doing the work, have it done on the
> other side. This may cause some additional complexity on the ioctl
> return path, but it solves both this problem and the "ptrace attach"
> issue: have the work delayed until "current" gets caught by seccomp.

So this is pretty much what we had in v6 (a one fd version, but the
idea is the same). The biggest issue is that in the case of e.g.
socketpair(), the fd values need to be written somewhere in the task's
memory, which means they need to be known before the response is sent.
If we have to wait until we're back in the task's context to install
them, we can't know the fd values.

V6 implementation: https://lkml.org/lkml/2018/9/6/773

Tycho

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 1/6] seccomp: add a return code to trap to userspace
  2018-09-27 23:10       ` Kees Cook
@ 2018-09-28 14:39         ` Tycho Andersen
  0 siblings, 0 replies; 91+ messages in thread
From: Tycho Andersen @ 2018-09-28 14:39 UTC (permalink / raw)
  To: Kees Cook
  Cc: Stephane Graber, LKML, Linux Containers, Linux API,
	Andy Lutomirski, Oleg Nesterov, Eric W . Biederman,
	Serge E . Hallyn, Christian Brauner, Tyler Hicks, Akihiro Suda,
	Jann Horn, linux-fsdevel

On Thu, Sep 27, 2018 at 04:10:29PM -0700, Kees Cook wrote:
> On Thu, Sep 27, 2018 at 3:48 PM, Tycho Andersen <tycho@tycho.ws> wrote:
> > On Thu, Sep 27, 2018 at 02:31:24PM -0700, Kees Cook wrote:
> >> On Thu, Sep 27, 2018 at 8:11 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> >> struct seccomp_notif {
> >>         __u16                      len;                  /*     0     2 */
> >>
> >>         /* XXX 6 bytes hole, try to pack */
> >>
> >>         __u64                      id;                   /*     8     8 */
> >>         __u32                      pid;                  /*    16     4 */
> >>         __u8                       signaled;             /*    20     1 */
> >>
> >>         /* XXX 3 bytes hole, try to pack */
> >>
> >>         struct seccomp_data        data;                 /*    24    64 */
> >>         /* --- cacheline 1 boundary (64 bytes) was 24 bytes ago --- */
> >>
> >>         /* size: 88, cachelines: 2, members: 5 */
> >>         /* sum members: 79, holes: 2, sum holes: 9 */
> >>         /* last cacheline: 24 bytes */
> >> };
> >> struct seccomp_notif_resp {
> >>         __u16                      len;                  /*     0     2 */
> >>
> >>         /* XXX 6 bytes hole, try to pack */
> >>
> >>         __u64                      id;                   /*     8     8 */
> >>         __s32                      error;                /*    16     4 */
> >>
> >>         /* XXX 4 bytes hole, try to pack */
> >>
> >>         __s64                      val;                  /*    24     8 */
> >>
> >>         /* size: 32, cachelines: 1, members: 4 */
> >>         /* sum members: 22, holes: 2, sum holes: 10 */
> >>         /* last cacheline: 32 bytes */
> >> };
> >>
> >> How about making len u32, and moving pid and error above "id"? This
> >> leaves a hole after signaled, so changing "len" won't be sufficient
> >> for versioning here. Perhaps move it after data?
> >
> > I'm not sure what you mean by "len won't be sufficient for versioning
> > here"? Anyway, I can do some packing on these; I didn't bother before
> > since I figured it's a userspace interface, so saving a few bytes
> > isn't a huge deal.
> 
> I was thinking the "len" portion was for determining if the API ever
> changes in the future. My point was that given the padding holes, e.g.
> adding a u8 after signaled, "len" wouldn't change, so the kernel might
> expect to starting reading something after signaled that it wasn't
> checking before, but the len would be the same.

Oh, yeah. That's ugly :(

> >> I have to say, I'm vaguely nervous about changing the semantics here
> >> for passing back the fd as the return code from the seccomp() syscall.
> >> Alternatives seem less appealing, though: changing the meaning of the
> >> uargs parameter when SECCOMP_FILTER_FLAG_NEW_LISTENER is set, for
> >> example. Hmm.
> >
> > From my perspective we can drop this whole thing. The only thing I'll
> > ever use is the ptrace version. Someone at some point (I don't
> > remember who, maybe stgraber) suggested this version would be useful
> > as well.
> 
> Well that would certainly change the exposure of the interface pretty
> drastically. :)
> 
> So, let's talk more about this, as it raises another thought I had
> too: for the PTRACE interface to work, you have to know specifically
> which filter you want to get notifications for. Won't that be slightly
> tricky?

Not necessarily. The way I imagine using it is:

1. container manager forks init task
2. init task does a bunch of setup stuff, then installs the filter
3. optionally install any user specified filter (or just merge the
   filter with step 2 instead of chaining them)
4. container manager grabs the listener fd from the container init via
   ptrace
5. init execs the user specified init

So the offset will always be known at least in my usecase. The
container manager doesn't want to install the filter on itself, so it
won't use NEW_LISTENER. Similarly, we don't want init to use
NEW_LISTENER, because if the user has decided to block sendmsg as part
of their policy, there's no way to get the fd out.

> > Anyway, let me know if your nervousness outweighs this, I'm happy to
> > drop it.
> 
> I'm not opposed to keeping it, but if you don't think anyone will use
> it ... we should probably drop it just to avoid the complexity. It's a
> cool API, though, so I'd like to hear from others first before you go
> tearing it out. ;) (stgraber added to CC)

It does seem useful for lighter weight cases than a container. The "I
want to run some random binary that I don't have the source for that
tries to make some privileged calls it doesn't really need" case. But
as a Container Guy I think I have in my contract somewhere that I have
to use containers :). But let's see what people think.

> >> > +       if (copy_from_user(&size, buf, sizeof(size)))
> >> > +               return -EFAULT;
> >> > +       size = min_t(size_t, size, sizeof(resp));
> >> > +       if (copy_from_user(&resp, buf, size))
> >> > +               return -EFAULT;
> >>
> >> For sanity checking on a double-read from userspace, please add:
> >>
> >>     if (resp.len != size)
> >>         return -EINVAL;
> >
> > Won't that fail if sizeof(resp) < resp.len, because of the min_t()?
> 
> Ah, true. In that case, probably do resp.len = size to avoid any logic
> failures due to the double-read? I just want to avoid any chance of
> confusing the size and actually using it somewhere.

Yep, sounds good.

Tycho

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 0/6] seccomp trap to userspace
  2018-09-27 15:11 [PATCH v7 0/6] seccomp trap to userspace Tycho Andersen
                   ` (5 preceding siblings ...)
  2018-09-27 15:11 ` [PATCH v7 6/6] samples: add an example of seccomp user trap Tycho Andersen
@ 2018-09-28 21:57 ` Michael Kerrisk (man-opages)
  2018-09-28 22:03   ` Tycho Andersen
  6 siblings, 1 reply; 91+ messages in thread
From: Michael Kerrisk (man-opages) @ 2018-09-28 21:57 UTC (permalink / raw)
  To: Tycho Andersen, Kees Cook
  Cc: mtk.manpages, linux-kernel, containers, linux-api,
	Andy Lutomirski, Oleg Nesterov, Eric W . Biederman,
	Serge E . Hallyn, Christian Brauner, Tyler Hicks, Akihiro Suda,
	Jann Horn, linux-fsdevel

Hi Tycho,

On 09/27/2018 05:11 PM, Tycho Andersen wrote:
> Hi all,
> 
> Here's v7 of the seccomp trap to userspace set. There are various minor
> changes and bug fixes, but two major changes:
> 
> * We now pass fds to the tracee via an ioctl, and do it immediately when
>    the ioctl is called. For this we needed some help from the vfs, so
>    I've put the one patch in this series and cc'd fsdevel. This does have
>    the advantage that the feature is now totally decoupled from the rest
>    of the set, which is itself useful (thanks Andy!)
> 
> * Instead of putting all of the notification related stuff into the
>    struct seccomp_filter, it now lives in its own struct notification,
>    which is pointed to by struct seccomp_filter. This will save a lot of
>    memory (thanks Tyler!)

Is there a documentation (man page) patch for this API change?

Thanks,

Michael

> v6 discussion: https://lkml.org/lkml/2018/9/6/769
> 
> Thoughts welcome,
> 
> Tycho
> 
> Tycho Andersen (6):
>    seccomp: add a return code to trap to userspace
>    seccomp: make get_nth_filter available outside of CHECKPOINT_RESTORE
>    seccomp: add a way to get a listener fd from ptrace
>    files: add a replace_fd_files() function
>    seccomp: add a way to pass FDs via a notification fd
>    samples: add an example of seccomp user trap
> 
>   Documentation/ioctl/ioctl-number.txt          |   1 +
>   .../userspace-api/seccomp_filter.rst          |  89 +++
>   fs/file.c                                     |  22 +-
>   include/linux/file.h                          |   8 +
>   include/linux/seccomp.h                       |  14 +-
>   include/uapi/linux/ptrace.h                   |   2 +
>   include/uapi/linux/seccomp.h                  |  42 +-
>   kernel/ptrace.c                               |   4 +
>   kernel/seccomp.c                              | 527 ++++++++++++++-
>   samples/seccomp/.gitignore                    |   1 +
>   samples/seccomp/Makefile                      |   7 +-
>   samples/seccomp/user-trap.c                   | 312 +++++++++
>   tools/testing/selftests/seccomp/seccomp_bpf.c | 607 +++++++++++++++++-
>   13 files changed, 1617 insertions(+), 19 deletions(-)
>   create mode 100644 samples/seccomp/user-trap.c
> 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 0/6] seccomp trap to userspace
  2018-09-28 21:57 ` [PATCH v7 0/6] seccomp trap to userspace Michael Kerrisk (man-opages)
@ 2018-09-28 22:03   ` Tycho Andersen
  2018-09-28 22:16     ` Michael Kerrisk (man-pages)
  0 siblings, 1 reply; 91+ messages in thread
From: Tycho Andersen @ 2018-09-28 22:03 UTC (permalink / raw)
  To: Michael Kerrisk (man-opages)
  Cc: Kees Cook, linux-kernel, containers, linux-api, Andy Lutomirski,
	Oleg Nesterov, Eric W . Biederman, Serge E . Hallyn,
	Christian Brauner, Tyler Hicks, Akihiro Suda, Jann Horn,
	linux-fsdevel

On Fri, Sep 28, 2018 at 11:57:40PM +0200, Michael Kerrisk (man-opages) wrote:
> Hi Tycho,
> 
> On 09/27/2018 05:11 PM, Tycho Andersen wrote:
> > Hi all,
> > 
> > Here's v7 of the seccomp trap to userspace set. There are various minor
> > changes and bug fixes, but two major changes:
> > 
> > * We now pass fds to the tracee via an ioctl, and do it immediately when
> >    the ioctl is called. For this we needed some help from the vfs, so
> >    I've put the one patch in this series and cc'd fsdevel. This does have
> >    the advantage that the feature is now totally decoupled from the rest
> >    of the set, which is itself useful (thanks Andy!)
> > 
> > * Instead of putting all of the notification related stuff into the
> >    struct seccomp_filter, it now lives in its own struct notification,
> >    which is pointed to by struct seccomp_filter. This will save a lot of
> >    memory (thanks Tyler!)
> 
> Is there a documentation (man page) patch for this API change?

Not yet, but once we decide on a final API I'll prepare one.

Cheers,

Tycho

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 0/6] seccomp trap to userspace
  2018-09-28 22:03   ` Tycho Andersen
@ 2018-09-28 22:16     ` Michael Kerrisk (man-pages)
  2018-09-28 22:34       ` Kees Cook
  0 siblings, 1 reply; 91+ messages in thread
From: Michael Kerrisk (man-pages) @ 2018-09-28 22:16 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Kees Cook, lkml, Containers, Linux API, Andy Lutomirski,
	Oleg Nesterov, Eric W. Biederman, Serge E. Hallyn,
	Christian Brauner, Tyler Hicks, Akihiro Suda, Jann Horn,
	linux-fsdevel

Hi Tycho,

On Sat, 29 Sep 2018 at 00:04, Tycho Andersen <tycho@tycho.ws> wrote:
>
> On Fri, Sep 28, 2018 at 11:57:40PM +0200, Michael Kerrisk (man-opages) wrote:
> > Hi Tycho,
> >
> > On 09/27/2018 05:11 PM, Tycho Andersen wrote:
> > > Hi all,
> > >
> > > Here's v7 of the seccomp trap to userspace set. There are various minor
> > > changes and bug fixes, but two major changes:
> > >
> > > * We now pass fds to the tracee via an ioctl, and do it immediately when
> > >    the ioctl is called. For this we needed some help from the vfs, so
> > >    I've put the one patch in this series and cc'd fsdevel. This does have
> > >    the advantage that the feature is now totally decoupled from the rest
> > >    of the set, which is itself useful (thanks Andy!)
> > >
> > > * Instead of putting all of the notification related stuff into the
> > >    struct seccomp_filter, it now lives in its own struct notification,
> > >    which is pointed to by struct seccomp_filter. This will save a lot of
> > >    memory (thanks Tyler!)
> >
> > Is there a documentation (man page) patch for this API change?
>
> Not yet, but once we decide on a final API I'll prepare one.

Honestly, the production of such documentation should be part of the
evolution towards the final API...

Documentation is not an afterthought. It's a tool for pushing you, the
developer (and others, your reviewers) to more deeply consider your
design.

Thanks,

Michael
-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 0/6] seccomp trap to userspace
  2018-09-28 22:16     ` Michael Kerrisk (man-pages)
@ 2018-09-28 22:34       ` Kees Cook
  2018-09-28 22:46         ` Michael Kerrisk (man-pages)
  0 siblings, 1 reply; 91+ messages in thread
From: Kees Cook @ 2018-09-28 22:34 UTC (permalink / raw)
  To: Michael Kerrisk
  Cc: Tycho Andersen, lkml, Containers, Linux API, Andy Lutomirski,
	Oleg Nesterov, Eric W. Biederman, Serge E. Hallyn,
	Christian Brauner, Tyler Hicks, Akihiro Suda, Jann Horn,
	linux-fsdevel

On Fri, Sep 28, 2018 at 3:16 PM, Michael Kerrisk (man-pages)
<mtk.manpages@gmail.com> wrote:
> Hi Tycho,
>
> On Sat, 29 Sep 2018 at 00:04, Tycho Andersen <tycho@tycho.ws> wrote:
>>
>> On Fri, Sep 28, 2018 at 11:57:40PM +0200, Michael Kerrisk (man-opages) wrote:
>> > Hi Tycho,
>> >
>> > On 09/27/2018 05:11 PM, Tycho Andersen wrote:
>> > > Hi all,
>> > >
>> > > Here's v7 of the seccomp trap to userspace set. There are various minor
>> > > changes and bug fixes, but two major changes:
>> > >
>> > > * We now pass fds to the tracee via an ioctl, and do it immediately when
>> > >    the ioctl is called. For this we needed some help from the vfs, so
>> > >    I've put the one patch in this series and cc'd fsdevel. This does have
>> > >    the advantage that the feature is now totally decoupled from the rest
>> > >    of the set, which is itself useful (thanks Andy!)
>> > >
>> > > * Instead of putting all of the notification related stuff into the
>> > >    struct seccomp_filter, it now lives in its own struct notification,
>> > >    which is pointed to by struct seccomp_filter. This will save a lot of
>> > >    memory (thanks Tyler!)
>> >
>> > Is there a documentation (man page) patch for this API change?
>>
>> Not yet, but once we decide on a final API I'll prepare one.
>
> Honestly, the production of such documentation should be part of the
> evolution towards the final API...
>
> Documentation is not an afterthought. It's a tool for pushing you, the
> developer (and others, your reviewers) to more deeply consider your
> design.

In Tycho's defense, he did write up documentation in Documentation/
for the feature, so it won't be an afterthought. :) But yes, there's
no manpage delta yet.

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 0/6] seccomp trap to userspace
  2018-09-28 22:34       ` Kees Cook
@ 2018-09-28 22:46         ` Michael Kerrisk (man-pages)
  2018-09-28 22:48           ` Jann Horn
  0 siblings, 1 reply; 91+ messages in thread
From: Michael Kerrisk (man-pages) @ 2018-09-28 22:46 UTC (permalink / raw)
  To: Kees Cook
  Cc: Tycho Andersen, lkml, Containers, Linux API, Andy Lutomirski,
	Oleg Nesterov, Eric W. Biederman, Serge E. Hallyn,
	Christian Brauner, Tyler Hicks, Akihiro Suda, Jann Horn,
	linux-fsdevel

Hi Kees,
On Sat, 29 Sep 2018 at 00:35, Kees Cook <keescook@chromium.org> wrote:
>
> On Fri, Sep 28, 2018 at 3:16 PM, Michael Kerrisk (man-pages)
> <mtk.manpages@gmail.com> wrote:
> > Hi Tycho,
> >
> > On Sat, 29 Sep 2018 at 00:04, Tycho Andersen <tycho@tycho.ws> wrote:
> >>
> >> On Fri, Sep 28, 2018 at 11:57:40PM +0200, Michael Kerrisk (man-opages) wrote:
> >> > Hi Tycho,
> >> >
> >> > On 09/27/2018 05:11 PM, Tycho Andersen wrote:
> >> > > Hi all,
> >> > >
> >> > > Here's v7 of the seccomp trap to userspace set. There are various minor
> >> > > changes and bug fixes, but two major changes:
> >> > >
> >> > > * We now pass fds to the tracee via an ioctl, and do it immediately when
> >> > >    the ioctl is called. For this we needed some help from the vfs, so
> >> > >    I've put the one patch in this series and cc'd fsdevel. This does have
> >> > >    the advantage that the feature is now totally decoupled from the rest
> >> > >    of the set, which is itself useful (thanks Andy!)
> >> > >
> >> > > * Instead of putting all of the notification related stuff into the
> >> > >    struct seccomp_filter, it now lives in its own struct notification,
> >> > >    which is pointed to by struct seccomp_filter. This will save a lot of
> >> > >    memory (thanks Tyler!)
> >> >
> >> > Is there a documentation (man page) patch for this API change?
> >>
> >> Not yet, but once we decide on a final API I'll prepare one.
> >
> > Honestly, the production of such documentation should be part of the
> > evolution towards the final API...
> >
> > Documentation is not an afterthought. It's a tool for pushing you, the
> > developer (and others, your reviewers) to more deeply consider your
> > design.
>
> In Tycho's defense, he did write up documentation in Documentation/
> for the feature, so it won't be an afterthought. :)

So, I missed that... How do I find this Documentation/ ?

> But yes, there's
> no manpage delta yet.

But, really, there should be, as part of the ongoing evolution of the patch...

(Apologies, Tycho. It may be that I came across a bit harshly.)

Thanks,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 0/6] seccomp trap to userspace
  2018-09-28 22:46         ` Michael Kerrisk (man-pages)
@ 2018-09-28 22:48           ` Jann Horn
  0 siblings, 0 replies; 91+ messages in thread
From: Jann Horn @ 2018-09-28 22:48 UTC (permalink / raw)
  To: Michael Kerrisk-manpages
  Cc: Kees Cook, Tycho Andersen, kernel list, containers, Linux API,
	Andy Lutomirski, Oleg Nesterov, Eric W. Biederman,
	Serge E. Hallyn, Christian Brauner, Tyler Hicks, suda.akihiro,
	linux-fsdevel

On Sat, Sep 29, 2018 at 12:47 AM Michael Kerrisk (man-pages)
<mtk.manpages@gmail.com> wrote:
> On Sat, 29 Sep 2018 at 00:35, Kees Cook <keescook@chromium.org> wrote:
> > On Fri, Sep 28, 2018 at 3:16 PM, Michael Kerrisk (man-pages)
> > <mtk.manpages@gmail.com> wrote:
> > > On Sat, 29 Sep 2018 at 00:04, Tycho Andersen <tycho@tycho.ws> wrote:
> > >> On Fri, Sep 28, 2018 at 11:57:40PM +0200, Michael Kerrisk (man-opages) wrote:
> > >> > On 09/27/2018 05:11 PM, Tycho Andersen wrote:
> > >> > > Here's v7 of the seccomp trap to userspace set. There are various minor
> > >> > > changes and bug fixes, but two major changes:
> > >> > >
> > >> > > * We now pass fds to the tracee via an ioctl, and do it immediately when
> > >> > >    the ioctl is called. For this we needed some help from the vfs, so
> > >> > >    I've put the one patch in this series and cc'd fsdevel. This does have
> > >> > >    the advantage that the feature is now totally decoupled from the rest
> > >> > >    of the set, which is itself useful (thanks Andy!)
> > >> > >
> > >> > > * Instead of putting all of the notification related stuff into the
> > >> > >    struct seccomp_filter, it now lives in its own struct notification,
> > >> > >    which is pointed to by struct seccomp_filter. This will save a lot of
> > >> > >    memory (thanks Tyler!)
> > >> >
> > >> > Is there a documentation (man page) patch for this API change?
> > >>
> > >> Not yet, but once we decide on a final API I'll prepare one.
> > >
> > > Honestly, the production of such documentation should be part of the
> > > evolution towards the final API...
> > >
> > > Documentation is not an afterthought. It's a tool for pushing you, the
> > > developer (and others, your reviewers) to more deeply consider your
> > > design.
> >
> > In Tycho's defense, he did write up documentation in Documentation/
> > for the feature, so it won't be an afterthought. :)
>
> So, I missed that... How do I find this Documentation/ ?

It's in patch 1:
https://lore.kernel.org/lkml/20180927151119.9989-2-tycho@tycho.ws/

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 1/6] seccomp: add a return code to trap to userspace
  2018-09-27 15:11 ` [PATCH v7 1/6] seccomp: add a return code to " Tycho Andersen
  2018-09-27 21:31   ` Kees Cook
  2018-09-27 21:51   ` Jann Horn
@ 2018-09-29  0:28   ` Aleksa Sarai
  2 siblings, 0 replies; 91+ messages in thread
From: Aleksa Sarai @ 2018-09-29  0:28 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Kees Cook, Jann Horn, linux-api, containers, Akihiro Suda,
	Oleg Nesterov, linux-kernel, Eric W . Biederman, linux-fsdevel,
	Christian Brauner, Andy Lutomirski

[-- Attachment #1: Type: text/plain, Size: 1898 bytes --]

On 2018-09-27, Tycho Andersen <tycho@tycho.ws> wrote:
> This patch introduces a means for syscalls matched in seccomp to notify
> some other task that a particular filter has been triggered.
> 
> The motivation for this is primarily for use with containers. For example,
> if a container does an init_module(), we obviously don't want to load this
> untrusted code, which may be compiled for the wrong version of the kernel
> anyway. Instead, we could parse the module image, figure out which module
> the container is trying to load and load it on the host.
> 
> As another example, containers cannot mknod(), since this checks
> capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
> /dev/zero should be ok for containers to mknod, but we'd like to avoid hard
> coding some whitelist in the kernel. Another example is mount(), which has
> many security restrictions for good reason, but configuration or runtime
> knowledge could potentially be used to relax these restrictions.

Minor thing, but this is no longer _entirely_ true (now it checks
ns_capable(sb->s_user_ns)). I think the kernel module auto-loading is a
much more interesting example, but since this is just a commit message
feel free to ignore my pedantry. :P

> Signed-off-by: Tycho Andersen <tycho@tycho.ws>
> CC: Kees Cook <keescook@chromium.org>
> CC: Andy Lutomirski <luto@amacapital.net>
> CC: Oleg Nesterov <oleg@redhat.com>
> CC: Eric W. Biederman <ebiederm@xmission.com>
> CC: "Serge E. Hallyn" <serge@hallyn.com>
> CC: Christian Brauner <christian.brauner@ubuntu.com>
> CC: Tyler Hicks <tyhicks@canonical.com>
> CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>

Would you mind adding me to the Cc: list for the next round of patches?
It's looking pretty neat!

Thanks!

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 2/6] seccomp: make get_nth_filter available outside of CHECKPOINT_RESTORE
  2018-09-27 15:11 ` [PATCH v7 2/6] seccomp: make get_nth_filter available outside of CHECKPOINT_RESTORE Tycho Andersen
  2018-09-27 16:51   ` Jann Horn
  2018-09-27 21:42   ` Kees Cook
@ 2018-10-08 13:55   ` Christian Brauner
  2 siblings, 0 replies; 91+ messages in thread
From: Christian Brauner @ 2018-10-08 13:55 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Kees Cook, Jann Horn, linux-api, containers, Akihiro Suda,
	Oleg Nesterov, linux-kernel, Eric W . Biederman, linux-fsdevel,
	Christian Brauner, Andy Lutomirski

On Thu, Sep 27, 2018 at 09:11:15AM -0600, Tycho Andersen wrote:
> In the next commit we'll use this same mnemonic to get a listener for the
> nth filter, so we need it available outside of CHECKPOINT_RESTORE in the
> USER_NOTIFICATION case as well.
> 
> v2: new in v2
> v3: no changes
> v4: no changes
> v5: switch to CHECKPOINT_RESTORE || USER_NOTIFICATION to avoid warning when
>     only CONFIG_SECCOMP_FILTER is enabled.
> v7: drop USER_NOTIFICATION bits
> 
> Signed-off-by: Tycho Andersen <tycho@tycho.ws>
> CC: Kees Cook <keescook@chromium.org>
> CC: Andy Lutomirski <luto@amacapital.net>
> CC: Oleg Nesterov <oleg@redhat.com>
> CC: Eric W. Biederman <ebiederm@xmission.com>
> CC: "Serge E. Hallyn" <serge@hallyn.com>
> CC: Christian Brauner <christian.brauner@ubuntu.com>
> CC: Tyler Hicks <tyhicks@canonical.com>
> CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>

Acked-by: Christian Brauner <christian@brauner.io>

> ---
>  kernel/seccomp.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index fa6fe9756c80..44a31ac8373a 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -1158,7 +1158,7 @@ long prctl_set_seccomp(unsigned long seccomp_mode, char __user *filter)
>  	return do_seccomp(op, 0, uargs);
>  }
>  
> -#if defined(CONFIG_SECCOMP_FILTER) && defined(CONFIG_CHECKPOINT_RESTORE)
> +#if defined(CONFIG_SECCOMP_FILTER)
>  static struct seccomp_filter *get_nth_filter(struct task_struct *task,
>  					     unsigned long filter_off)
>  {
> @@ -1205,6 +1205,7 @@ static struct seccomp_filter *get_nth_filter(struct task_struct *task,
>  	return filter;
>  }
>  
> +#if defined(CONFIG_CHECKPOINT_RESTORE)
>  long seccomp_get_filter(struct task_struct *task, unsigned long filter_off,
>  			void __user *data)
>  {
> @@ -1277,7 +1278,8 @@ long seccomp_get_metadata(struct task_struct *task,
>  	__put_seccomp_filter(filter);
>  	return ret;
>  }
> -#endif
> +#endif /* CONFIG_CHECKPOINT_RESTORE */
> +#endif /* CONFIG_SECCOMP_FILTER */
>  
>  #ifdef CONFIG_SYSCTL
>  
> -- 
> 2.17.1
> 
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 1/6] seccomp: add a return code to trap to userspace
  2018-09-27 22:48     ` Tycho Andersen
  2018-09-27 23:10       ` Kees Cook
@ 2018-10-08 14:58       ` Christian Brauner
  2018-10-09 14:28         ` Tycho Andersen
  1 sibling, 1 reply; 91+ messages in thread
From: Christian Brauner @ 2018-10-08 14:58 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Kees Cook, Jann Horn, Linux API, Linux Containers, Akihiro Suda,
	Oleg Nesterov, LKML, Eric W . Biederman, linux-fsdevel,
	Christian Brauner, Andy Lutomirski

On Thu, Sep 27, 2018 at 04:48:39PM -0600, Tycho Andersen wrote:
> On Thu, Sep 27, 2018 at 02:31:24PM -0700, Kees Cook wrote:
> > On Thu, Sep 27, 2018 at 8:11 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> > > This patch introduces a means for syscalls matched in seccomp to notify
> > > some other task that a particular filter has been triggered.
> > >
> > > The motivation for this is primarily for use with containers. For example,
> > > if a container does an init_module(), we obviously don't want to load this
> > > untrusted code, which may be compiled for the wrong version of the kernel
> > > anyway. Instead, we could parse the module image, figure out which module
> > > the container is trying to load and load it on the host.
> > >
> > > As another example, containers cannot mknod(), since this checks
> > > capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
> > > /dev/zero should be ok for containers to mknod, but we'd like to avoid hard
> > > coding some whitelist in the kernel. Another example is mount(), which has
> > > many security restrictions for good reason, but configuration or runtime
> > > knowledge could potentially be used to relax these restrictions.
> > >
> > > This patch adds functionality that is already possible via at least two
> > > other means that I know about, both of which involve ptrace(): first, one
> > > could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL.
> > > Unfortunately this is slow, so a faster version would be to install a
> > > filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP.
> > > Since ptrace allows only one tracer, if the container runtime is that
> > > tracer, users inside the container (or outside) trying to debug it will not
> > > be able to use ptrace, which is annoying. It also means that older
> > > distributions based on Upstart cannot boot inside containers using ptrace,
> > > since upstart itself uses ptrace to start services.
> > >
> > > The actual implementation of this is fairly small, although getting the
> > > synchronization right was/is slightly complex.
> > >
> > > Finally, it's worth noting that the classic seccomp TOCTOU of reading
> > > memory data from the task still applies here, but can be avoided with
> > > careful design of the userspace handler: if the userspace handler reads all
> > > of the task memory that is necessary before applying its security policy,
> > > the tracee's subsequent memory edits will not be read by the tracer.
> > >
> > > v2: * make id a u64; the idea here being that it will never overflow,
> > >       because 64 is huge (one syscall every nanosecond => wrap every 584
> > >       years) (Andy)
> > >     * prevent nesting of user notifications: if someone is already attached
> > >       the tree in one place, nobody else can attach to the tree (Andy)
> > >     * notify the listener of signals the tracee receives as well (Andy)
> > >     * implement poll
> > > v3: * lockdep fix (Oleg)
> > >     * drop unnecessary WARN()s (Christian)
> > >     * rearrange error returns to be more rpetty (Christian)
> > >     * fix build in !CONFIG_SECCOMP_USER_NOTIFICATION case
> > > v4: * fix implementation of poll to use poll_wait() (Jann)
> > >     * change listener's fd flags to be 0 (Jann)
> > >     * hoist filter initialization out of ifdefs to its own function
> > >       init_user_notification()
> > >     * add some more testing around poll() and closing the listener while a
> > >       syscall is in action
> > >     * s/GET_LISTENER/NEW_LISTENER, since you can't _get_ a listener, but it
> > >       creates a new one (Matthew)
> > >     * correctly handle pid namespaces, add some testcases (Matthew)
> > >     * use EINPROGRESS instead of EINVAL when a notification response is
> > >       written twice (Matthew)
> > >     * fix comment typo from older version (SEND vs READ) (Matthew)
> > >     * whitespace and logic simplification (Tobin)
> > >     * add some Documentation/ bits on userspace trapping
> > > v5: * fix documentation typos (Jann)
> > >     * add signalled field to struct seccomp_notif (Jann)
> > >     * switch to using ioctls instead of read()/write() for struct passing
> > >       (Jann)
> > >     * add an ioctl to ensure an id is still valid
> > > v6: * docs typo fixes, update docs for ioctl() change (Christian)
> > > v7: * switch struct seccomp_knotif's id member to a u64 (derp :)
> > >     * use notify_lock in IS_ID_VALID query to avoid racing
> > >     * s/signalled/signaled (Tyler)
> > >     * fix docs to reflect that ids are not globally unique (Tyler)
> > >     * add a test to check -ERESTARTSYS behavior (Tyler)
> > >     * drop CONFIG_SECCOMP_USER_NOTIFICATION (Tyler)
> > >     * reorder USER_NOTIF in seccomp return codes list (Tyler)
> > >     * return size instead of sizeof(struct user_notif) (Tyler)
> > >     * ENOENT instead of EINVAL when invalid id is passed (Tyler)
> > >     * drop CONFIG_SECCOMP_USER_NOTIFICATION guards (Tyler)
> > >     * s/IS_ID_VALID/ID_VALID and switch ioctl to be "well behaved" (Tyler)
> > >     * add a new struct notification to minimize the additions to
> > >       struct seccomp_filter, also pack the necessary additions a bit more
> > >       cleverly (Tyler)
> > >     * switch to keeping track of the task itself instead of the pid (we'll
> > >       use this for implementing PUT_FD)
> > 
> > Patch-sending nit: can you put the versioning below the "---" line so
> > it isn't included in the final commit? (And I normally read these
> > backwards, so I'd expect v7 at the top, but that's not a big deal. I
> > mean... neither is the --- thing, but it makes "git am" easier for me
> > since I don't have to go edit the versioning out of the log.)
> 
> Sure, will do.
> 
> > > diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
> > > index 9efc0e73d50b..d4ccb32fe089 100644
> > > --- a/include/uapi/linux/seccomp.h
> > > +++ b/include/uapi/linux/seccomp.h
> > > @@ -17,9 +17,10 @@
> > >  #define SECCOMP_GET_ACTION_AVAIL       2
> > >
> > >  /* Valid flags for SECCOMP_SET_MODE_FILTER */
> > > -#define SECCOMP_FILTER_FLAG_TSYNC      (1UL << 0)
> > > -#define SECCOMP_FILTER_FLAG_LOG                (1UL << 1)
> > > -#define SECCOMP_FILTER_FLAG_SPEC_ALLOW (1UL << 2)
> > > +#define SECCOMP_FILTER_FLAG_TSYNC              (1UL << 0)
> > > +#define SECCOMP_FILTER_FLAG_LOG                        (1UL << 1)
> > > +#define SECCOMP_FILTER_FLAG_SPEC_ALLOW         (1UL << 2)
> > > +#define SECCOMP_FILTER_FLAG_NEW_LISTENER       (1UL << 3)
> > 
> > Since these are all getting indentation updates, can you switch them
> > to BIT(0), BIT(1), etc?
> 
> Will do.
> 
> > >  /*
> > >   * All BPF programs must return a 32-bit value.
> > > @@ -35,6 +36,7 @@
> > >  #define SECCOMP_RET_KILL        SECCOMP_RET_KILL_THREAD
> > >  #define SECCOMP_RET_TRAP        0x00030000U /* disallow and force a SIGSYS */
> > >  #define SECCOMP_RET_ERRNO       0x00050000U /* returns an errno */
> > > +#define SECCOMP_RET_USER_NOTIF   0x7fc00000U /* notifies userspace */
> > >  #define SECCOMP_RET_TRACE       0x7ff00000U /* pass to a tracer or disallow */
> > >  #define SECCOMP_RET_LOG                 0x7ffc0000U /* allow after logging */
> > >  #define SECCOMP_RET_ALLOW       0x7fff0000U /* allow */
> > > @@ -60,4 +62,29 @@ struct seccomp_data {
> > >         __u64 args[6];
> > >  };
> > >
> > > +struct seccomp_notif {
> > > +       __u16 len;
> > > +       __u64 id;
> > > +       __u32 pid;
> > > +       __u8 signaled;
> > > +       struct seccomp_data data;
> > > +};
> > > +
> > > +struct seccomp_notif_resp {
> > > +       __u16 len;
> > > +       __u64 id;
> > > +       __s32 error;
> > > +       __s64 val;
> > > +};
> > 
> > So, len has to come first, for versioning. However, since it's ahead
> > of a u64, this leaves a struct padding hole. pahole output:
> > 
> > struct seccomp_notif {
> >         __u16                      len;                  /*     0     2 */
> > 
> >         /* XXX 6 bytes hole, try to pack */
> > 
> >         __u64                      id;                   /*     8     8 */
> >         __u32                      pid;                  /*    16     4 */
> >         __u8                       signaled;             /*    20     1 */
> > 
> >         /* XXX 3 bytes hole, try to pack */
> > 
> >         struct seccomp_data        data;                 /*    24    64 */
> >         /* --- cacheline 1 boundary (64 bytes) was 24 bytes ago --- */
> > 
> >         /* size: 88, cachelines: 2, members: 5 */
> >         /* sum members: 79, holes: 2, sum holes: 9 */
> >         /* last cacheline: 24 bytes */
> > };
> > struct seccomp_notif_resp {
> >         __u16                      len;                  /*     0     2 */
> > 
> >         /* XXX 6 bytes hole, try to pack */
> > 
> >         __u64                      id;                   /*     8     8 */
> >         __s32                      error;                /*    16     4 */
> > 
> >         /* XXX 4 bytes hole, try to pack */
> > 
> >         __s64                      val;                  /*    24     8 */
> > 
> >         /* size: 32, cachelines: 1, members: 4 */
> >         /* sum members: 22, holes: 2, sum holes: 10 */
> >         /* last cacheline: 32 bytes */
> > };
> > 
> > How about making len u32, and moving pid and error above "id"? This
> > leaves a hole after signaled, so changing "len" won't be sufficient
> > for versioning here. Perhaps move it after data?
> 
> I'm not sure what you mean by "len won't be sufficient for versioning
> here"? Anyway, I can do some packing on these; I didn't bother before
> since I figured it's a userspace interface, so saving a few bytes
> isn't a huge deal.
> 
> > > +
> > > +#define SECCOMP_IOC_MAGIC              0xF7
> > 
> > Was there any specific reason for picking this value? There are lots
> > of fun ASCII code left like '!' or '*'. :)
> 
> No, ! it is :)
> 
> > > +
> > > +/* Flags for seccomp notification fd ioctl. */
> > > +#define SECCOMP_NOTIF_RECV     _IOWR(SECCOMP_IOC_MAGIC, 0,     \
> > > +                                       struct seccomp_notif)
> > > +#define SECCOMP_NOTIF_SEND     _IOWR(SECCOMP_IOC_MAGIC, 1,     \
> > > +                                       struct seccomp_notif_resp)
> > > +#define SECCOMP_NOTIF_ID_VALID _IOR(SECCOMP_IOC_MAGIC, 2,      \
> > > +                                       __u64)
> > 
> > To match other UAPI ioctl, can these have a prefix of "SECCOMP_IOCTOL_..."?
> > 
> > It may also be useful to match how other uapis do this, like for DRM:
> > 
> > #define DRM_IOCTL_BASE                  'd'
> > #define DRM_IO(nr)                      _IO(DRM_IOCTL_BASE,nr)
> > #define DRM_IOR(nr,type)                _IOR(DRM_IOCTL_BASE,nr,type)
> > #define DRM_IOW(nr,type)                _IOW(DRM_IOCTL_BASE,nr,type)
> > #define DRM_IOWR(nr,type)               _IOWR(DRM_IOCTL_BASE,nr,type)
> > 
> > #define DRM_IOCTL_VERSION               DRM_IOWR(0x00, struct drm_version)
> > #define DRM_IOCTL_GET_UNIQUE            DRM_IOWR(0x01, struct drm_unique)
> > #define DRM_IOCTL_GET_MAGIC             DRM_IOR( 0x02, struct drm_auth)
> > ...
> 
> Will do.
> 
> > 
> > > +
> > >  #endif /* _UAPI_LINUX_SECCOMP_H */
> > > diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> > > index fd023ac24e10..fa6fe9756c80 100644
> > > --- a/kernel/seccomp.c
> > > +++ b/kernel/seccomp.c
> > > @@ -33,12 +33,78 @@
> > >  #endif
> > >
> > >  #ifdef CONFIG_SECCOMP_FILTER
> > > +#include <linux/file.h>
> > >  #include <linux/filter.h>
> > >  #include <linux/pid.h>
> > >  #include <linux/ptrace.h>
> > >  #include <linux/security.h>
> > >  #include <linux/tracehook.h>
> > >  #include <linux/uaccess.h>
> > > +#include <linux/anon_inodes.h>
> > > +
> > > +enum notify_state {
> > > +       SECCOMP_NOTIFY_INIT,
> > > +       SECCOMP_NOTIFY_SENT,
> > > +       SECCOMP_NOTIFY_REPLIED,
> > > +};
> > > +
> > > +struct seccomp_knotif {
> > > +       /* The struct pid of the task whose filter triggered the notification */
> > > +       struct task_struct *task;
> > > +
> > > +       /* The "cookie" for this request; this is unique for this filter. */
> > > +       u64 id;
> > > +
> > > +       /* Whether or not this task has been given an interruptible signal. */
> > > +       bool signaled;
> > > +
> > > +       /*
> > > +        * The seccomp data. This pointer is valid the entire time this
> > > +        * notification is active, since it comes from __seccomp_filter which
> > > +        * eclipses the entire lifecycle here.
> > > +        */
> > > +       const struct seccomp_data *data;
> > > +
> > > +       /*
> > > +        * Notification states. When SECCOMP_RET_USER_NOTIF is returned, a
> > > +        * struct seccomp_knotif is created and starts out in INIT. Once the
> > > +        * handler reads the notification off of an FD, it transitions to SENT.
> > > +        * If a signal is received the state transitions back to INIT and
> > > +        * another message is sent. When the userspace handler replies, state
> > > +        * transitions to REPLIED.
> > > +        */
> > > +       enum notify_state state;
> > > +
> > > +       /* The return values, only valid when in SECCOMP_NOTIFY_REPLIED */
> > > +       int error;
> > > +       long val;
> > > +
> > > +       /* Signals when this has entered SECCOMP_NOTIFY_REPLIED */
> > > +       struct completion ready;
> > > +
> > > +       struct list_head list;
> > > +};
> > > +
> > > +/**
> > > + * struct notification - container for seccomp userspace notifications. Since
> > > + * most seccomp filters will not have notification listeners attached and this
> > > + * structure is fairly large, we store the notification-specific stuff in a
> > > + * separate structure.
> > > + *
> > > + * @request: A semaphore that users of this notification can wait on for
> > > + *           changes. Actual reads and writes are still controlled with
> > > + *           filter->notify_lock.
> > > + * @notify_lock: A lock for all notification-related accesses.
> > > + * @next_id: The id of the next request.
> > > + * @notifications: A list of struct seccomp_knotif elements.
> > > + * @wqh: A wait queue for poll.
> > > + */
> > > +struct notification {
> > > +       struct semaphore request;
> > > +       u64 next_id;
> > > +       struct list_head notifications;
> > > +       wait_queue_head_t wqh;
> > > +};
> > >
> > >  /**
> > >   * struct seccomp_filter - container for seccomp BPF programs
> > > @@ -66,6 +132,8 @@ struct seccomp_filter {
> > >         bool log;
> > >         struct seccomp_filter *prev;
> > >         struct bpf_prog *prog;
> > > +       struct notification *notif;
> > > +       struct mutex notify_lock;
> > >  };
> > >
> > >  /* Limit any path through the tree to 256KB worth of instructions. */
> > > @@ -392,6 +460,7 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
> > >         if (!sfilter)
> > >                 return ERR_PTR(-ENOMEM);
> > >
> > > +       mutex_init(&sfilter->notify_lock);
> > >         ret = bpf_prog_create_from_user(&sfilter->prog, fprog,
> > >                                         seccomp_check_filter, save_orig);
> > >         if (ret < 0) {
> > > @@ -556,11 +625,13 @@ static void seccomp_send_sigsys(int syscall, int reason)
> > >  #define SECCOMP_LOG_TRACE              (1 << 4)
> > >  #define SECCOMP_LOG_LOG                        (1 << 5)
> > >  #define SECCOMP_LOG_ALLOW              (1 << 6)
> > > +#define SECCOMP_LOG_USER_NOTIF         (1 << 7)
> > >
> > >  static u32 seccomp_actions_logged = SECCOMP_LOG_KILL_PROCESS |
> > >                                     SECCOMP_LOG_KILL_THREAD  |
> > >                                     SECCOMP_LOG_TRAP  |
> > >                                     SECCOMP_LOG_ERRNO |
> > > +                                   SECCOMP_LOG_USER_NOTIF |
> > >                                     SECCOMP_LOG_TRACE |
> > >                                     SECCOMP_LOG_LOG;
> > >
> > > @@ -581,6 +652,9 @@ static inline void seccomp_log(unsigned long syscall, long signr, u32 action,
> > >         case SECCOMP_RET_TRACE:
> > >                 log = requested && seccomp_actions_logged & SECCOMP_LOG_TRACE;
> > >                 break;
> > > +       case SECCOMP_RET_USER_NOTIF:
> > > +               log = requested && seccomp_actions_logged & SECCOMP_LOG_USER_NOTIF;
> > > +               break;
> > >         case SECCOMP_RET_LOG:
> > >                 log = seccomp_actions_logged & SECCOMP_LOG_LOG;
> > >                 break;
> > > @@ -652,6 +726,73 @@ void secure_computing_strict(int this_syscall)
> > >  #else
> > >
> > >  #ifdef CONFIG_SECCOMP_FILTER
> > > +static u64 seccomp_next_notify_id(struct seccomp_filter *filter)
> > > +{
> > > +       /* Note: overflow is ok here, the id just needs to be unique */
> > 
> > Maybe just clarify in the comment: unique to the filter.
> > 
> > > +       return filter->notif->next_id++;
> > 
> > Also, it might be useful to add for both documentation and lockdep:
> > 
> > lockdep_assert_held(filter->notif->notify_lock);
> > 
> > into this function?
> 
> Will do.
> 
> > 
> > > +}
> > > +
> > > +static void seccomp_do_user_notification(int this_syscall,
> > > +                                        struct seccomp_filter *match,
> > > +                                        const struct seccomp_data *sd)
> > > +{
> > > +       int err;
> > > +       long ret = 0;
> > > +       struct seccomp_knotif n = {};
> > > +
> > > +       mutex_lock(&match->notify_lock);
> > > +       err = -ENOSYS;
> > > +       if (!match->notif)
> > > +               goto out;
> > > +
> > > +       n.task = current;
> > > +       n.state = SECCOMP_NOTIFY_INIT;
> > > +       n.data = sd;
> > > +       n.id = seccomp_next_notify_id(match);
> > > +       init_completion(&n.ready);
> > > +
> > > +       list_add(&n.list, &match->notif->notifications);
> > > +       wake_up_poll(&match->notif->wqh, EPOLLIN | EPOLLRDNORM);
> > > +
> > > +       mutex_unlock(&match->notify_lock);
> > > +       up(&match->notif->request);
> > > +
> > 
> > Maybe add a big comment here saying this is where we're waiting for a reply?
> 
> Will do.
> 
> > > +       err = wait_for_completion_interruptible(&n.ready);
> > > +       mutex_lock(&match->notify_lock);
> > > +
> > > +       /*
> > > +        * Here it's possible we got a signal and then had to wait on the mutex
> > > +        * while the reply was sent, so let's be sure there wasn't a response
> > > +        * in the meantime.
> > > +        */
> > > +       if (err < 0 && n.state != SECCOMP_NOTIFY_REPLIED) {
> > > +               /*
> > > +                * We got a signal. Let's tell userspace about it (potentially
> > > +                * again, if we had already notified them about the first one).
> > > +                */
> > > +               n.signaled = true;
> > > +               if (n.state == SECCOMP_NOTIFY_SENT) {
> > > +                       n.state = SECCOMP_NOTIFY_INIT;
> > > +                       up(&match->notif->request);
> > > +               }
> > > +               mutex_unlock(&match->notify_lock);
> > > +               err = wait_for_completion_killable(&n.ready);
> > > +               mutex_lock(&match->notify_lock);
> > > +               if (err < 0)
> > > +                       goto remove_list;
> > > +       }
> > > +
> > > +       ret = n.val;
> > > +       err = n.error;
> > > +
> > > +remove_list:
> > > +       list_del(&n.list);
> > > +out:
> > > +       mutex_unlock(&match->notify_lock);
> > > +       syscall_set_return_value(current, task_pt_regs(current),
> > > +                                err, ret);
> > > +}
> > > +
> > >  static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
> > >                             const bool recheck_after_trace)
> > >  {
> > > @@ -728,6 +869,9 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
> > >
> > >                 return 0;
> > >
> > > +       case SECCOMP_RET_USER_NOTIF:
> > > +               seccomp_do_user_notification(this_syscall, match, sd);
> > > +               goto skip;
> > 
> > Nit: please add a blank line here (to match the other cases).
> > 
> > >         case SECCOMP_RET_LOG:
> > >                 seccomp_log(this_syscall, 0, action, true);
> > >                 return 0;
> > > @@ -834,6 +978,9 @@ static long seccomp_set_mode_strict(void)
> > >  }
> > >
> > >  #ifdef CONFIG_SECCOMP_FILTER
> > > +static struct file *init_listener(struct task_struct *,
> > > +                                 struct seccomp_filter *);
> > 
> > Why is the forward declaration needed instead of just moving the
> > function here? I didn't see anything in it that looked like it
> > couldn't move.
> 
> I think there was a cycle in some earlier version, but I agree there
> isn't now. I'll fix it.
> 
> > > +
> > >  /**
> > >   * seccomp_set_mode_filter: internal function for setting seccomp filter
> > >   * @flags:  flags to change filter behavior
> > > @@ -853,6 +1000,8 @@ static long seccomp_set_mode_filter(unsigned int flags,
> > >         const unsigned long seccomp_mode = SECCOMP_MODE_FILTER;
> > >         struct seccomp_filter *prepared = NULL;
> > >         long ret = -EINVAL;
> > > +       int listener = 0;
> > 
> > Nit: "invalid fd" should be -1, not 0.
> > 
> > > +       struct file *listener_f = NULL;
> > >
> > >         /* Validate flags. */
> > >         if (flags & ~SECCOMP_FILTER_FLAG_MASK)
> > > @@ -863,13 +1012,28 @@ static long seccomp_set_mode_filter(unsigned int flags,
> > >         if (IS_ERR(prepared))
> > >                 return PTR_ERR(prepared);
> > >
> > > +       if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) {
> > > +               listener = get_unused_fd_flags(0);
> > 
> > As with the other place pointed out by Jann, this should maybe be O_CLOEXEC too?
> 
> Yep, will do.
> 
> > > +               if (listener < 0) {
> > > +                       ret = listener;
> > > +                       goto out_free;
> > > +               }
> > > +
> > > +               listener_f = init_listener(current, prepared);
> > > +               if (IS_ERR(listener_f)) {
> > > +                       put_unused_fd(listener);
> > > +                       ret = PTR_ERR(listener_f);
> > > +                       goto out_free;
> > > +               }
> > > +       }
> > > +
> > >         /*
> > >          * Make sure we cannot change seccomp or nnp state via TSYNC
> > >          * while another thread is in the middle of calling exec.
> > >          */
> > >         if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
> > >             mutex_lock_killable(&current->signal->cred_guard_mutex))
> > > -               goto out_free;
> > > +               goto out_put_fd;
> > >
> > >         spin_lock_irq(&current->sighand->siglock);
> > >
> > > @@ -887,6 +1051,16 @@ static long seccomp_set_mode_filter(unsigned int flags,
> > >         spin_unlock_irq(&current->sighand->siglock);
> > >         if (flags & SECCOMP_FILTER_FLAG_TSYNC)
> > >                 mutex_unlock(&current->signal->cred_guard_mutex);
> > > +out_put_fd:
> > > +       if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) {
> > > +               if (ret < 0) {
> > > +                       fput(listener_f);
> > > +                       put_unused_fd(listener);
> > > +               } else {
> > > +                       fd_install(listener, listener_f);
> > > +                       ret = listener;
> > > +               }
> > > +       }
> > 
> > Can you update the kern-docs for seccomp_set_mode_filter(), since we
> > can now return positive values?
> > 
> >  * Returns 0 on success or -EINVAL on failure.
> > 
> > (this shoudln't say only -EINVAL, I realize too)
> 
> Sure, I can fix both of these.
> 
> > I have to say, I'm vaguely nervous about changing the semantics here
> > for passing back the fd as the return code from the seccomp() syscall.
> > Alternatives seem less appealing, though: changing the meaning of the
> > uargs parameter when SECCOMP_FILTER_FLAG_NEW_LISTENER is set, for
> > example. Hmm.
> 
> From my perspective we can drop this whole thing. The only thing I'll
> ever use is the ptrace version. Someone at some point (I don't
> remember who, maybe stgraber) suggested this version would be useful
> as well.

So I think we want to have the ability to get an fd via seccomp().
Especially, if we all we worry about are weird semantics. When we
discussed this we knew the whole patchset was going to be weird. :)

This is a seccomp feature so seccomp should - if feasible - equip you
with everything to use it in a meaningful way without having to go
through a different kernel api. I know ptrace and seccomp are
already connected but I still find this cleaner. :)

Another thing is that the container itself might be traced for some
reason while you still might want to get an fd out.

Also, I wonder what happens if you want to filter the ptrace() syscall
itself? Then you'd deadlock?

Also, it seems that getting an fd via ptrace requires CAP_SYS_ADMIN in
the inital user namespace (which I just realized now) whereas getting
the fd via seccomp() doesn't seem to.

> 
> Anyway, let me know if your nervousness outweighs this, I'm happy to
> drop it.
> 
> > > @@ -1342,3 +1520,259 @@ static int __init seccomp_sysctl_init(void)
> > >  device_initcall(seccomp_sysctl_init)
> > >
> > >  #endif /* CONFIG_SYSCTL */
> > > +
> > > +#ifdef CONFIG_SECCOMP_FILTER
> > > +static int seccomp_notify_release(struct inode *inode, struct file *file)
> > > +{
> > > +       struct seccomp_filter *filter = file->private_data;
> > > +       struct seccomp_knotif *knotif;
> > > +
> > > +       mutex_lock(&filter->notify_lock);
> > > +
> > > +       /*
> > > +        * If this file is being closed because e.g. the task who owned it
> > > +        * died, let's wake everyone up who was waiting on us.
> > > +        */
> > > +       list_for_each_entry(knotif, &filter->notif->notifications, list) {
> > > +               if (knotif->state == SECCOMP_NOTIFY_REPLIED)
> > > +                       continue;
> > > +
> > > +               knotif->state = SECCOMP_NOTIFY_REPLIED;
> > > +               knotif->error = -ENOSYS;
> > > +               knotif->val = 0;
> > > +
> > > +               complete(&knotif->ready);
> > > +       }
> > > +
> > > +       wake_up_all(&filter->notif->wqh);
> > > +       kfree(filter->notif);
> > > +       filter->notif = NULL;
> > > +       mutex_unlock(&filter->notify_lock);
> > 
> > It looks like that means nothing waiting on knotif->ready can access
> > filter->notif without rechecking it, yes?
> > 
> > e.g. in seccomp_do_user_notification() I see:
> > 
> >                         up(&match->notif->request);
> > 
> > I *think* this isn't reachable due to the test for n.state !=
> > SECCOMP_NOTIFY_REPLIED, though. Perhaps, just for sanity and because
> > it's not fast-path, we could add a WARN_ON() while checking for
> > unreplied signal death?
> > 
> >                 n.signaled = true;
> >                 if (n.state == SECCOMP_NOTIFY_SENT) {
> >                         n.state = SECCOMP_NOTIFY_INIT;
> >                         if (!WARN_ON(match->notif))
> >                             up(&match->notif->request);
> >                 }
> >                 mutex_unlock(&match->notify_lock);
> 
> So this code path should actually be safe, since notify_lock is held
> throughout, as it is in the release handler. However, there is one just above
> it that is not, because we do:
> 
>         mutex_unlock(&match->notify_lock);
>         up(&match->notif->request);
> 
> When this was all a member of struct seccomp_filter the order didn't matter,
> but now it very much does, and I think you're right that these statements need
> to be reordered. There maybe others, I'll check everything else as well.
> 
> > 
> > > +       __put_seccomp_filter(filter);
> > > +       return 0;
> > > +}
> > > +
> > > +static long seccomp_notify_recv(struct seccomp_filter *filter,
> > > +                               unsigned long arg)
> > > +{
> > > +       struct seccomp_knotif *knotif = NULL, *cur;
> > > +       struct seccomp_notif unotif = {};
> > > +       ssize_t ret;
> > > +       u16 size;
> > > +       void __user *buf = (void __user *)arg;
> > 
> > I'd prefer this casting happen in seccomp_notify_ioctl(). This keeps
> > anything from accidentally using "arg" directly here.
> 
> Will do.
> 
> > > +
> > > +       if (copy_from_user(&size, buf, sizeof(size)))
> > > +               return -EFAULT;
> > > +
> > > +       ret = down_interruptible(&filter->notif->request);
> > > +       if (ret < 0)
> > > +               return ret;
> > > +
> > > +       mutex_lock(&filter->notify_lock);
> > > +       list_for_each_entry(cur, &filter->notif->notifications, list) {
> > > +               if (cur->state == SECCOMP_NOTIFY_INIT) {
> > > +                       knotif = cur;
> > > +                       break;
> > > +               }
> > > +       }
> > > +
> > > +       /*
> > > +        * If we didn't find a notification, it could be that the task was
> > > +        * interrupted between the time we were woken and when we were able to
> > > +        * acquire the rw lock.
> > > +        */
> > > +       if (!knotif) {
> > > +               ret = -ENOENT;
> > > +               goto out;
> > > +       }
> > > +
> > > +       size = min_t(size_t, size, sizeof(unotif));
> > > +
> > 
> > It is possible (though unlikely given the type widths involved here)
> > for unotif = {} to not initialize padding, so I would recommend an
> > explicit memset(&unotif, 0, sizeof(unotif)) here.
> 
> Orly? I didn't know that, thanks.
> 
> > > +       unotif.len = size;
> > > +       unotif.id = knotif->id;
> > > +       unotif.pid = task_pid_vnr(knotif->task);
> > > +       unotif.signaled = knotif->signaled;
> > > +       unotif.data = *(knotif->data);
> > > +
> > > +       if (copy_to_user(buf, &unotif, size)) {
> > > +               ret = -EFAULT;
> > > +               goto out;
> > > +       }
> > > +
> > > +       ret = size;
> > > +       knotif->state = SECCOMP_NOTIFY_SENT;
> > > +       wake_up_poll(&filter->notif->wqh, EPOLLOUT | EPOLLWRNORM);
> > > +
> > > +
> > > +out:
> > > +       mutex_unlock(&filter->notify_lock);
> > 
> > Is there some way to rearrange the locking here to avoid holding the
> > mutex while doing copy_to_user() (which userspace could block with
> > userfaultfd, and then stall all the other notifications for this
> > filter)?
> 
> Yes, I don't think it'll cause any problems to release the lock earlier.
> 
> > > +       return ret;
> > > +}
> > > +
> > > +static long seccomp_notify_send(struct seccomp_filter *filter,
> > > +                               unsigned long arg)
> > > +{
> > > +       struct seccomp_notif_resp resp = {};
> > > +       struct seccomp_knotif *knotif = NULL;
> > > +       long ret;
> > > +       u16 size;
> > > +       void __user *buf = (void __user *)arg;
> > 
> > Same cast note as above.
> > 
> > > +
> > > +       if (copy_from_user(&size, buf, sizeof(size)))
> > > +               return -EFAULT;
> > > +       size = min_t(size_t, size, sizeof(resp));
> > > +       if (copy_from_user(&resp, buf, size))
> > > +               return -EFAULT;
> > 
> > For sanity checking on a double-read from userspace, please add:
> > 
> >     if (resp.len != size)
> >         return -EINVAL;
> 
> Won't that fail if sizeof(resp) < resp.len, because of the min_t()?
> 
> > > +static long seccomp_notify_id_valid(struct seccomp_filter *filter,
> > > +                                   unsigned long arg)
> > > +{
> > > +       struct seccomp_knotif *knotif = NULL;
> > > +       void __user *buf = (void __user *)arg;
> > > +       u64 id;
> > > +       long ret;
> > > +
> > > +       if (copy_from_user(&id, buf, sizeof(id)))
> > > +               return -EFAULT;
> > > +
> > > +       ret = mutex_lock_interruptible(&filter->notify_lock);
> > > +       if (ret < 0)
> > > +               return ret;
> > > +
> > > +       ret = -1;
> > 
> > Isn't this EPERM? Shouldn't it be -ENOENT?
> 
> Yes, I wasn't thinking of errno here, I'll switch it.
> 
> > > +       list_for_each_entry(knotif, &filter->notif->notifications, list) {
> > > +               if (knotif->id == id) {
> > > +                       ret = 0;
> > > +                       goto out;
> > > +               }
> > > +       }
> > > +
> > > +out:
> > > +       mutex_unlock(&filter->notify_lock);
> > > +       return ret;
> > > +}
> > > +
> > > +static long seccomp_notify_ioctl(struct file *file, unsigned int cmd,
> > > +                                unsigned long arg)
> > > +{
> > > +       struct seccomp_filter *filter = file->private_data;
> > > +
> > > +       switch (cmd) {
> > > +       case SECCOMP_NOTIF_RECV:
> > > +               return seccomp_notify_recv(filter, arg);
> > > +       case SECCOMP_NOTIF_SEND:
> > > +               return seccomp_notify_send(filter, arg);
> > > +       case SECCOMP_NOTIF_ID_VALID:
> > > +               return seccomp_notify_id_valid(filter, arg);
> > > +       default:
> > > +               return -EINVAL;
> > > +       }
> > > +}
> > > +
> > > +static __poll_t seccomp_notify_poll(struct file *file,
> > > +                                   struct poll_table_struct *poll_tab)
> > > +{
> > > +       struct seccomp_filter *filter = file->private_data;
> > > +       __poll_t ret = 0;
> > > +       struct seccomp_knotif *cur;
> > > +
> > > +       poll_wait(file, &filter->notif->wqh, poll_tab);
> > > +
> > > +       ret = mutex_lock_interruptible(&filter->notify_lock);
> > > +       if (ret < 0)
> > > +               return ret;
> > > +
> > > +       list_for_each_entry(cur, &filter->notif->notifications, list) {
> > > +               if (cur->state == SECCOMP_NOTIFY_INIT)
> > > +                       ret |= EPOLLIN | EPOLLRDNORM;
> > > +               if (cur->state == SECCOMP_NOTIFY_SENT)
> > > +                       ret |= EPOLLOUT | EPOLLWRNORM;
> > > +               if (ret & EPOLLIN && ret & EPOLLOUT)
> > 
> > My eyes! :) Can you wrap the bit operations in parens here?
> > 
> > > +                       break;
> > > +       }
> > 
> > Should POLLERR be handled here too? I don't quite see the conditions
> > that might be exposed? All the processes die for the filter, which
> > does what here?
> 
> I think it shouldn't do anything, because I was thinking of the semantics of
> poll() as "when a tracee does a syscall that matches, fire". So a task could
> start, never make a targeted syscall, and exit, and poll() shouldn't return a
> value. Maybe it's useful to write that down somewhere, though.
> 
> > > +       EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_RECV, &req), sizeof(req));
> > > +       EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_ID_VALID, &req.id), 0);
> > > +
> > > +       EXPECT_EQ(kill(pid, SIGKILL), 0);
> > > +       EXPECT_EQ(waitpid(pid, NULL, 0), pid);
> > > +
> > > +       EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_ID_VALID, &req.id), -1);
> > 
> > Please document SECCOMP_NOTIF_ID_VALID in seccomp_filter.rst. I had
> > been wondering what it's for, and now I see it's kind of an advisory
> > "is the other end still alive?" test.
> 
> Yes, in fact it's necessary for avoiding races. There's some comments in the
> sample code, but I'll update seccomp_filter.rst too.
> 
> > > +
> > > +       resp.id = req.id;
> > > +       ret = ioctl(listener, SECCOMP_NOTIF_SEND, &resp);
> > > +       EXPECT_EQ(ret, -1);
> > > +       EXPECT_EQ(errno, ENOENT);
> > > +
> > > +       /*
> > > +        * Check that we get another notification about a signal in the middle
> > > +        * of a syscall.
> > > +        */
> > > +       pid = fork();
> > > +       ASSERT_GE(pid, 0);
> > > +
> > > +       if (pid == 0) {
> > > +               if (signal(SIGUSR1, signal_handler) == SIG_ERR) {
> > > +                       perror("signal");
> > > +                       exit(1);
> > > +               }
> > > +               ret = syscall(__NR_getpid);
> > > +               exit(ret != USER_NOTIF_MAGIC);
> > > +       }
> > > +
> > > +       ret = read_notif(listener, &req);
> > > +       EXPECT_EQ(ret, sizeof(req));
> > > +       EXPECT_EQ(errno, 0);
> > > +
> > > +       EXPECT_EQ(kill(pid, SIGUSR1), 0);
> > > +
> > > +       ret = read_notif(listener, &req);
> > > +       EXPECT_EQ(req.signaled, 1);
> > > +       EXPECT_EQ(ret, sizeof(req));
> > > +       EXPECT_EQ(errno, 0);
> > > +
> > > +       resp.len = sizeof(resp);
> > > +       resp.id = req.id;
> > > +       resp.error = -512; /* -ERESTARTSYS */
> > > +       resp.val = 0;
> > > +
> > > +       EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_SEND, &resp), sizeof(resp));
> > > +
> > > +       ret = read_notif(listener, &req);
> > > +       resp.len = sizeof(resp);
> > > +       resp.id = req.id;
> > > +       resp.error = 0;
> > > +       resp.val = USER_NOTIF_MAGIC;
> > > +       ret = ioctl(listener, SECCOMP_NOTIF_SEND, &resp);
> > 
> > I was slightly confused here: why have there been 3 reads? I was
> > expecting one notification for hitting getpid and one from catching a
> > signal. But in rereading, I see that NOTIF_RECV will return the most
> > recently unresponded notification, yes?
> 
> The three reads are:
> 
> 1. original syscall
> # send SIGUSR1
> 2. another notif with signaled is set
> # respond with -ERESTARTSYS to make sure that works
> 3. this is the result of -ERESTARTSYS
> 
> > But... catching a signal replaces the existing seccomp_knotif? I
> > remain confused about how signal handling is meant to work here. What
> > happens if two signals get sent? It looks like you just block without
> > allowing more signals? (Thank you for writing the tests!)
> 
> Yes, that's the idea. This is an implementation of Andy's pseudocode:
> https://lkml.org/lkml/2018/3/15/1122
> 
> > (And can you document the expected behavior in the seccomp_filter.rst too?)
> 
> Will do.
> 
> > 
> > Looking good!
> 
> Thanks for your review!
> 
> Tycho
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 3/6] seccomp: add a way to get a listener fd from ptrace
  2018-09-27 15:11 ` [PATCH v7 3/6] seccomp: add a way to get a listener fd from ptrace Tycho Andersen
                     ` (2 preceding siblings ...)
  2018-09-27 21:53   ` Kees Cook
@ 2018-10-08 15:16   ` Christian Brauner
  2018-10-08 15:33     ` Jann Horn
  2018-10-08 18:00     ` Tycho Andersen
  3 siblings, 2 replies; 91+ messages in thread
From: Christian Brauner @ 2018-10-08 15:16 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Kees Cook, Jann Horn, linux-api, containers, Akihiro Suda,
	Oleg Nesterov, linux-kernel, Eric W . Biederman, linux-fsdevel,
	Christian Brauner, Andy Lutomirski

On Thu, Sep 27, 2018 at 09:11:16AM -0600, Tycho Andersen wrote:
> As an alternative to SECCOMP_FILTER_FLAG_GET_LISTENER, perhaps a ptrace()
> version which can acquire filters is useful. There are at least two reasons
> this is preferable, even though it uses ptrace:
> 
> 1. You can control tasks that aren't cooperating with you
> 2. You can control tasks whose filters block sendmsg() and socket(); if the
>    task installs a filter which blocks these calls, there's no way with
>    SECCOMP_FILTER_FLAG_GET_LISTENER to get the fd out to the privileged task.

So for the slow of mind aka me:
I'm not sure I completely understand this problem. Can you outline how
sendmsg() and socket() are involved in this?

I'm also not sure that this holds (but I might misunderstand the
problem) afaict, you could do try to get the fd out via CLONE_FILES and
other means so something like: 

// let's pretend the libc wrapper for clone actually has sane semantics
pid = clone(CLONE_FILES);
if (pid == 0) {
        fd = seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);

        // Now this fd will be valid in both parent and child.
        // If you haven't blocked it you can inform the parent what
        // the fd number is via pipe2(). If you have blocked it you can
        // use dup2() and dup to a known fd number.
}

> 
> v2: fix a bug where listener mode was not unset when an unused fd was not
>     available
> v3: fix refcounting bug (Oleg)
> v4: * change the listener's fd flags to be 0
>     * rename GET_LISTENER to NEW_LISTENER (Matthew)
> v5: * add capable(CAP_SYS_ADMIN) requirement
> v7: * point the new listener at the right filter (Jann)
> 
> Signed-off-by: Tycho Andersen <tycho@tycho.ws>
> CC: Kees Cook <keescook@chromium.org>
> CC: Andy Lutomirski <luto@amacapital.net>
> CC: Oleg Nesterov <oleg@redhat.com>
> CC: Eric W. Biederman <ebiederm@xmission.com>
> CC: "Serge E. Hallyn" <serge@hallyn.com>
> CC: Christian Brauner <christian.brauner@ubuntu.com>
> CC: Tyler Hicks <tyhicks@canonical.com>
> CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
> ---
>  include/linux/seccomp.h                       |  7 ++
>  include/uapi/linux/ptrace.h                   |  2 +
>  kernel/ptrace.c                               |  4 ++
>  kernel/seccomp.c                              | 31 +++++++++
>  tools/testing/selftests/seccomp/seccomp_bpf.c | 68 +++++++++++++++++++
>  5 files changed, 112 insertions(+)
> 
> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> index 017444b5efed..234c61b37405 100644
> --- a/include/linux/seccomp.h
> +++ b/include/linux/seccomp.h
> @@ -83,6 +83,8 @@ static inline int seccomp_mode(struct seccomp *s)
>  #ifdef CONFIG_SECCOMP_FILTER
>  extern void put_seccomp_filter(struct task_struct *tsk);
>  extern void get_seccomp_filter(struct task_struct *tsk);
> +extern long seccomp_new_listener(struct task_struct *task,
> +				 unsigned long filter_off);
>  #else  /* CONFIG_SECCOMP_FILTER */
>  static inline void put_seccomp_filter(struct task_struct *tsk)
>  {
> @@ -92,6 +94,11 @@ static inline void get_seccomp_filter(struct task_struct *tsk)
>  {
>  	return;
>  }
> +static inline long seccomp_new_listener(struct task_struct *task,
> +					unsigned long filter_off)
> +{
> +	return -EINVAL;
> +}
>  #endif /* CONFIG_SECCOMP_FILTER */
>  
>  #if defined(CONFIG_SECCOMP_FILTER) && defined(CONFIG_CHECKPOINT_RESTORE)
> diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h
> index d5a1b8a492b9..e80ecb1bd427 100644
> --- a/include/uapi/linux/ptrace.h
> +++ b/include/uapi/linux/ptrace.h
> @@ -73,6 +73,8 @@ struct seccomp_metadata {
>  	__u64 flags;		/* Output: filter's flags */
>  };
>  
> +#define PTRACE_SECCOMP_NEW_LISTENER	0x420e
> +
>  /* Read signals from a shared (process wide) queue */
>  #define PTRACE_PEEKSIGINFO_SHARED	(1 << 0)
>  
> diff --git a/kernel/ptrace.c b/kernel/ptrace.c
> index 21fec73d45d4..289960ac181b 100644
> --- a/kernel/ptrace.c
> +++ b/kernel/ptrace.c
> @@ -1096,6 +1096,10 @@ int ptrace_request(struct task_struct *child, long request,
>  		ret = seccomp_get_metadata(child, addr, datavp);
>  		break;
>  
> +	case PTRACE_SECCOMP_NEW_LISTENER:
> +		ret = seccomp_new_listener(child, addr);
> +		break;
> +
>  	default:
>  		break;
>  	}
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 44a31ac8373a..17685803a2af 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -1777,4 +1777,35 @@ static struct file *init_listener(struct task_struct *task,
>  
>  	return ret;
>  }
> +
> +long seccomp_new_listener(struct task_struct *task,
> +			  unsigned long filter_off)
> +{
> +	struct seccomp_filter *filter;
> +	struct file *listener;
> +	int fd;
> +
> +	if (!capable(CAP_SYS_ADMIN))
> +		return -EACCES;

I know this might have been discussed a while back but why exactly do we
require CAP_SYS_ADMIN in init_userns and not in the target userns? What
if I want to do a setns()fd, CLONE_NEWUSER) to the target process and
use ptrace from in there?

> +
> +	filter = get_nth_filter(task, filter_off);
> +	if (IS_ERR(filter))
> +		return PTR_ERR(filter);
> +
> +	fd = get_unused_fd_flags(0);
> +	if (fd < 0) {
> +		__put_seccomp_filter(filter);
> +		return fd;
> +	}
> +
> +	listener = init_listener(task, filter);
> +	__put_seccomp_filter(filter);
> +	if (IS_ERR(listener)) {
> +		put_unused_fd(fd);
> +		return PTR_ERR(listener);
> +	}
> +
> +	fd_install(fd, listener);
> +	return fd;
> +}
>  #endif
> diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
> index 5f4b836a6792..c6ba3ed5392e 100644
> --- a/tools/testing/selftests/seccomp/seccomp_bpf.c
> +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
> @@ -193,6 +193,10 @@ int seccomp(unsigned int op, unsigned int flags, void *args)
>  }
>  #endif
>  
> +#ifndef PTRACE_SECCOMP_NEW_LISTENER
> +#define PTRACE_SECCOMP_NEW_LISTENER 0x420e
> +#endif
> +
>  #if __BYTE_ORDER == __LITTLE_ENDIAN
>  #define syscall_arg(_n) (offsetof(struct seccomp_data, args[_n]))
>  #elif __BYTE_ORDER == __BIG_ENDIAN
> @@ -3175,6 +3179,70 @@ TEST(get_user_notification_syscall)
>  	EXPECT_EQ(0, WEXITSTATUS(status));
>  }
>  
> +TEST(get_user_notification_ptrace)
> +{
> +	pid_t pid;
> +	int status, listener;
> +	int sk_pair[2];
> +	char c;
> +	struct seccomp_notif req = {};
> +	struct seccomp_notif_resp resp = {};
> +
> +	ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0);
> +
> +	pid = fork();
> +	ASSERT_GE(pid, 0);
> +
> +	if (pid == 0) {
> +		EXPECT_EQ(user_trap_syscall(__NR_getpid, 0), 0);
> +
> +		/* Test that we get ENOSYS while not attached */
> +		EXPECT_EQ(syscall(__NR_getpid), -1);
> +		EXPECT_EQ(errno, ENOSYS);
> +
> +		/* Signal we're ready and have installed the filter. */
> +		EXPECT_EQ(write(sk_pair[1], "J", 1), 1);
> +
> +		EXPECT_EQ(read(sk_pair[1], &c, 1), 1);
> +		EXPECT_EQ(c, 'H');
> +
> +		exit(syscall(__NR_getpid) != USER_NOTIF_MAGIC);
> +	}
> +
> +	EXPECT_EQ(read(sk_pair[0], &c, 1), 1);
> +	EXPECT_EQ(c, 'J');
> +
> +	EXPECT_EQ(ptrace(PTRACE_ATTACH, pid), 0);
> +	EXPECT_EQ(waitpid(pid, NULL, 0), pid);
> +	listener = ptrace(PTRACE_SECCOMP_NEW_LISTENER, pid, 0);
> +	EXPECT_GE(listener, 0);
> +
> +	/* EBUSY for second listener */
> +	EXPECT_EQ(ptrace(PTRACE_SECCOMP_NEW_LISTENER, pid, 0), -1);
> +	EXPECT_EQ(errno, EBUSY);
> +
> +	EXPECT_EQ(ptrace(PTRACE_DETACH, pid, NULL, 0), 0);
> +
> +	/* Now signal we are done and respond with magic */
> +	EXPECT_EQ(write(sk_pair[0], "H", 1), 1);
> +
> +	req.len = sizeof(req);
> +	EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_RECV, &req), sizeof(req));
> +
> +	resp.len = sizeof(resp);
> +	resp.id = req.id;
> +	resp.error = 0;
> +	resp.val = USER_NOTIF_MAGIC;
> +
> +	EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_SEND, &resp), sizeof(resp));
> +
> +	EXPECT_EQ(waitpid(pid, &status, 0), pid);
> +	EXPECT_EQ(true, WIFEXITED(status));
> +	EXPECT_EQ(0, WEXITSTATUS(status));
> +
> +	close(listener);
> +}
> +
>  /*
>   * Check that a pid in a child namespace still shows up as valid in ours.
>   */
> -- 
> 2.17.1
> 
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 3/6] seccomp: add a way to get a listener fd from ptrace
  2018-10-08 15:16   ` Christian Brauner
@ 2018-10-08 15:33     ` Jann Horn
  2018-10-08 16:21       ` Christian Brauner
  2018-10-08 18:00     ` Tycho Andersen
  1 sibling, 1 reply; 91+ messages in thread
From: Jann Horn @ 2018-10-08 15:33 UTC (permalink / raw)
  To: christian
  Cc: Tycho Andersen, Kees Cook, Linux API, containers, suda.akihiro,
	Oleg Nesterov, kernel list, Eric W. Biederman, linux-fsdevel,
	Christian Brauner, Andy Lutomirski

On Mon, Oct 8, 2018 at 5:16 PM Christian Brauner <christian@brauner.io> wrote:
>
> On Thu, Sep 27, 2018 at 09:11:16AM -0600, Tycho Andersen wrote:
> > As an alternative to SECCOMP_FILTER_FLAG_GET_LISTENER, perhaps a ptrace()
> > version which can acquire filters is useful. There are at least two reasons
> > this is preferable, even though it uses ptrace:
> >
> > 1. You can control tasks that aren't cooperating with you
> > 2. You can control tasks whose filters block sendmsg() and socket(); if the
> >    task installs a filter which blocks these calls, there's no way with
> >    SECCOMP_FILTER_FLAG_GET_LISTENER to get the fd out to the privileged task.
>
> So for the slow of mind aka me:
> I'm not sure I completely understand this problem. Can you outline how
> sendmsg() and socket() are involved in this?
>
> I'm also not sure that this holds (but I might misunderstand the
> problem) afaict, you could do try to get the fd out via CLONE_FILES and
> other means so something like:
>
> // let's pretend the libc wrapper for clone actually has sane semantics
> pid = clone(CLONE_FILES);
> if (pid == 0) {
>         fd = seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
>
>         // Now this fd will be valid in both parent and child.
>         // If you haven't blocked it you can inform the parent what
>         // the fd number is via pipe2(). If you have blocked it you can
>         // use dup2() and dup to a known fd number.
> }
>
> >
> > v2: fix a bug where listener mode was not unset when an unused fd was not
> >     available
> > v3: fix refcounting bug (Oleg)
> > v4: * change the listener's fd flags to be 0
> >     * rename GET_LISTENER to NEW_LISTENER (Matthew)
> > v5: * add capable(CAP_SYS_ADMIN) requirement
> > v7: * point the new listener at the right filter (Jann)
> >
> > Signed-off-by: Tycho Andersen <tycho@tycho.ws>
> > CC: Kees Cook <keescook@chromium.org>
> > CC: Andy Lutomirski <luto@amacapital.net>
> > CC: Oleg Nesterov <oleg@redhat.com>
> > CC: Eric W. Biederman <ebiederm@xmission.com>
> > CC: "Serge E. Hallyn" <serge@hallyn.com>
> > CC: Christian Brauner <christian.brauner@ubuntu.com>
> > CC: Tyler Hicks <tyhicks@canonical.com>
> > CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
> > ---
> >  include/linux/seccomp.h                       |  7 ++
> >  include/uapi/linux/ptrace.h                   |  2 +
> >  kernel/ptrace.c                               |  4 ++
> >  kernel/seccomp.c                              | 31 +++++++++
> >  tools/testing/selftests/seccomp/seccomp_bpf.c | 68 +++++++++++++++++++
> >  5 files changed, 112 insertions(+)
> >
> > diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> > index 017444b5efed..234c61b37405 100644
> > --- a/include/linux/seccomp.h
> > +++ b/include/linux/seccomp.h
> > @@ -83,6 +83,8 @@ static inline int seccomp_mode(struct seccomp *s)
> >  #ifdef CONFIG_SECCOMP_FILTER
> >  extern void put_seccomp_filter(struct task_struct *tsk);
> >  extern void get_seccomp_filter(struct task_struct *tsk);
> > +extern long seccomp_new_listener(struct task_struct *task,
> > +                              unsigned long filter_off);
> >  #else  /* CONFIG_SECCOMP_FILTER */
> >  static inline void put_seccomp_filter(struct task_struct *tsk)
> >  {
> > @@ -92,6 +94,11 @@ static inline void get_seccomp_filter(struct task_struct *tsk)
> >  {
> >       return;
> >  }
> > +static inline long seccomp_new_listener(struct task_struct *task,
> > +                                     unsigned long filter_off)
> > +{
> > +     return -EINVAL;
> > +}
> >  #endif /* CONFIG_SECCOMP_FILTER */
> >
> >  #if defined(CONFIG_SECCOMP_FILTER) && defined(CONFIG_CHECKPOINT_RESTORE)
> > diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h
> > index d5a1b8a492b9..e80ecb1bd427 100644
> > --- a/include/uapi/linux/ptrace.h
> > +++ b/include/uapi/linux/ptrace.h
> > @@ -73,6 +73,8 @@ struct seccomp_metadata {
> >       __u64 flags;            /* Output: filter's flags */
> >  };
> >
> > +#define PTRACE_SECCOMP_NEW_LISTENER  0x420e
> > +
> >  /* Read signals from a shared (process wide) queue */
> >  #define PTRACE_PEEKSIGINFO_SHARED    (1 << 0)
> >
> > diff --git a/kernel/ptrace.c b/kernel/ptrace.c
> > index 21fec73d45d4..289960ac181b 100644
> > --- a/kernel/ptrace.c
> > +++ b/kernel/ptrace.c
> > @@ -1096,6 +1096,10 @@ int ptrace_request(struct task_struct *child, long request,
> >               ret = seccomp_get_metadata(child, addr, datavp);
> >               break;
> >
> > +     case PTRACE_SECCOMP_NEW_LISTENER:
> > +             ret = seccomp_new_listener(child, addr);
> > +             break;
> > +
> >       default:
> >               break;
> >       }
> > diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> > index 44a31ac8373a..17685803a2af 100644
> > --- a/kernel/seccomp.c
> > +++ b/kernel/seccomp.c
> > @@ -1777,4 +1777,35 @@ static struct file *init_listener(struct task_struct *task,
> >
> >       return ret;
> >  }
> > +
> > +long seccomp_new_listener(struct task_struct *task,
> > +                       unsigned long filter_off)
> > +{
> > +     struct seccomp_filter *filter;
> > +     struct file *listener;
> > +     int fd;
> > +
> > +     if (!capable(CAP_SYS_ADMIN))
> > +             return -EACCES;
>
> I know this might have been discussed a while back but why exactly do we
> require CAP_SYS_ADMIN in init_userns and not in the target userns? What
> if I want to do a setns()fd, CLONE_NEWUSER) to the target process and
> use ptrace from in there?

See https://lore.kernel.org/lkml/CAG48ez3R+ZJ1vwGkDfGzKX2mz6f=jjJWsO5pCvnH68P+RKO8Ow@mail.gmail.com/
. Basically, the problem is that this doesn't just give you capability
over the target task, but also over every other task that has the same
filter installed; you need some sort of "is the caller capable over
the filter and anyone who uses it" check.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 3/6] seccomp: add a way to get a listener fd from ptrace
  2018-10-08 15:33     ` Jann Horn
@ 2018-10-08 16:21       ` Christian Brauner
  2018-10-08 16:42         ` Jann Horn
  0 siblings, 1 reply; 91+ messages in thread
From: Christian Brauner @ 2018-10-08 16:21 UTC (permalink / raw)
  To: Jann Horn
  Cc: Tycho Andersen, Kees Cook, Linux API, containers, suda.akihiro,
	Oleg Nesterov, kernel list, Eric W. Biederman, linux-fsdevel,
	Christian Brauner, Andy Lutomirski

On Mon, Oct 08, 2018 at 05:33:22PM +0200, Jann Horn wrote:
> On Mon, Oct 8, 2018 at 5:16 PM Christian Brauner <christian@brauner.io> wrote:
> >
> > On Thu, Sep 27, 2018 at 09:11:16AM -0600, Tycho Andersen wrote:
> > > As an alternative to SECCOMP_FILTER_FLAG_GET_LISTENER, perhaps a ptrace()
> > > version which can acquire filters is useful. There are at least two reasons
> > > this is preferable, even though it uses ptrace:
> > >
> > > 1. You can control tasks that aren't cooperating with you
> > > 2. You can control tasks whose filters block sendmsg() and socket(); if the
> > >    task installs a filter which blocks these calls, there's no way with
> > >    SECCOMP_FILTER_FLAG_GET_LISTENER to get the fd out to the privileged task.
> >
> > So for the slow of mind aka me:
> > I'm not sure I completely understand this problem. Can you outline how
> > sendmsg() and socket() are involved in this?
> >
> > I'm also not sure that this holds (but I might misunderstand the
> > problem) afaict, you could do try to get the fd out via CLONE_FILES and
> > other means so something like:
> >
> > // let's pretend the libc wrapper for clone actually has sane semantics
> > pid = clone(CLONE_FILES);
> > if (pid == 0) {
> >         fd = seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
> >
> >         // Now this fd will be valid in both parent and child.
> >         // If you haven't blocked it you can inform the parent what
> >         // the fd number is via pipe2(). If you have blocked it you can
> >         // use dup2() and dup to a known fd number.
> > }
> >
> > >
> > > v2: fix a bug where listener mode was not unset when an unused fd was not
> > >     available
> > > v3: fix refcounting bug (Oleg)
> > > v4: * change the listener's fd flags to be 0
> > >     * rename GET_LISTENER to NEW_LISTENER (Matthew)
> > > v5: * add capable(CAP_SYS_ADMIN) requirement
> > > v7: * point the new listener at the right filter (Jann)
> > >
> > > Signed-off-by: Tycho Andersen <tycho@tycho.ws>
> > > CC: Kees Cook <keescook@chromium.org>
> > > CC: Andy Lutomirski <luto@amacapital.net>
> > > CC: Oleg Nesterov <oleg@redhat.com>
> > > CC: Eric W. Biederman <ebiederm@xmission.com>
> > > CC: "Serge E. Hallyn" <serge@hallyn.com>
> > > CC: Christian Brauner <christian.brauner@ubuntu.com>
> > > CC: Tyler Hicks <tyhicks@canonical.com>
> > > CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
> > > ---
> > >  include/linux/seccomp.h                       |  7 ++
> > >  include/uapi/linux/ptrace.h                   |  2 +
> > >  kernel/ptrace.c                               |  4 ++
> > >  kernel/seccomp.c                              | 31 +++++++++
> > >  tools/testing/selftests/seccomp/seccomp_bpf.c | 68 +++++++++++++++++++
> > >  5 files changed, 112 insertions(+)
> > >
> > > diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> > > index 017444b5efed..234c61b37405 100644
> > > --- a/include/linux/seccomp.h
> > > +++ b/include/linux/seccomp.h
> > > @@ -83,6 +83,8 @@ static inline int seccomp_mode(struct seccomp *s)
> > >  #ifdef CONFIG_SECCOMP_FILTER
> > >  extern void put_seccomp_filter(struct task_struct *tsk);
> > >  extern void get_seccomp_filter(struct task_struct *tsk);
> > > +extern long seccomp_new_listener(struct task_struct *task,
> > > +                              unsigned long filter_off);
> > >  #else  /* CONFIG_SECCOMP_FILTER */
> > >  static inline void put_seccomp_filter(struct task_struct *tsk)
> > >  {
> > > @@ -92,6 +94,11 @@ static inline void get_seccomp_filter(struct task_struct *tsk)
> > >  {
> > >       return;
> > >  }
> > > +static inline long seccomp_new_listener(struct task_struct *task,
> > > +                                     unsigned long filter_off)
> > > +{
> > > +     return -EINVAL;
> > > +}
> > >  #endif /* CONFIG_SECCOMP_FILTER */
> > >
> > >  #if defined(CONFIG_SECCOMP_FILTER) && defined(CONFIG_CHECKPOINT_RESTORE)
> > > diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h
> > > index d5a1b8a492b9..e80ecb1bd427 100644
> > > --- a/include/uapi/linux/ptrace.h
> > > +++ b/include/uapi/linux/ptrace.h
> > > @@ -73,6 +73,8 @@ struct seccomp_metadata {
> > >       __u64 flags;            /* Output: filter's flags */
> > >  };
> > >
> > > +#define PTRACE_SECCOMP_NEW_LISTENER  0x420e
> > > +
> > >  /* Read signals from a shared (process wide) queue */
> > >  #define PTRACE_PEEKSIGINFO_SHARED    (1 << 0)
> > >
> > > diff --git a/kernel/ptrace.c b/kernel/ptrace.c
> > > index 21fec73d45d4..289960ac181b 100644
> > > --- a/kernel/ptrace.c
> > > +++ b/kernel/ptrace.c
> > > @@ -1096,6 +1096,10 @@ int ptrace_request(struct task_struct *child, long request,
> > >               ret = seccomp_get_metadata(child, addr, datavp);
> > >               break;
> > >
> > > +     case PTRACE_SECCOMP_NEW_LISTENER:
> > > +             ret = seccomp_new_listener(child, addr);
> > > +             break;
> > > +
> > >       default:
> > >               break;
> > >       }
> > > diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> > > index 44a31ac8373a..17685803a2af 100644
> > > --- a/kernel/seccomp.c
> > > +++ b/kernel/seccomp.c
> > > @@ -1777,4 +1777,35 @@ static struct file *init_listener(struct task_struct *task,
> > >
> > >       return ret;
> > >  }
> > > +
> > > +long seccomp_new_listener(struct task_struct *task,
> > > +                       unsigned long filter_off)
> > > +{
> > > +     struct seccomp_filter *filter;
> > > +     struct file *listener;
> > > +     int fd;
> > > +
> > > +     if (!capable(CAP_SYS_ADMIN))
> > > +             return -EACCES;
> >
> > I know this might have been discussed a while back but why exactly do we
> > require CAP_SYS_ADMIN in init_userns and not in the target userns? What
> > if I want to do a setns()fd, CLONE_NEWUSER) to the target process and
> > use ptrace from in there?
> 
> See https://lore.kernel.org/lkml/CAG48ez3R+ZJ1vwGkDfGzKX2mz6f=jjJWsO5pCvnH68P+RKO8Ow@mail.gmail.com/
> . Basically, the problem is that this doesn't just give you capability
> over the target task, but also over every other task that has the same
> filter installed; you need some sort of "is the caller capable over
> the filter and anyone who uses it" check.

Thanks.
But then this new ptrace feature as it stands is imho currently broken.
If you can install a seccomp filter with SECCOMP_RET_USER_NOTIF if you
are ns_cpabable(CAP_SYS_ADMIN) and also get an fd via seccomp() itself
if you are ns_cpabable(CAP_SYS_ADMIN) then either the new ptrace() api
extension should be fixed to allow for this too or the seccomp() way of
retrieving the pid - which I really think we want - needs to be fixed to
require capable(CAP_SYS_ADMIN) too.
The solution where both require ns_capable(CAP_SYS_ADMIN) is - imho -
the preferred way to solve this.
Everything else will just be confusing.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 3/6] seccomp: add a way to get a listener fd from ptrace
  2018-10-08 16:21       ` Christian Brauner
@ 2018-10-08 16:42         ` Jann Horn
  2018-10-08 18:18           ` Christian Brauner
  0 siblings, 1 reply; 91+ messages in thread
From: Jann Horn @ 2018-10-08 16:42 UTC (permalink / raw)
  To: christian
  Cc: Tycho Andersen, Kees Cook, Linux API, containers, suda.akihiro,
	Oleg Nesterov, kernel list, Eric W. Biederman, linux-fsdevel,
	Christian Brauner, Andy Lutomirski, linux-security-module

On Mon, Oct 8, 2018 at 6:21 PM Christian Brauner <christian@brauner.io> wrote:
> On Mon, Oct 08, 2018 at 05:33:22PM +0200, Jann Horn wrote:
> > On Mon, Oct 8, 2018 at 5:16 PM Christian Brauner <christian@brauner.io> wrote:
> > >
> > > On Thu, Sep 27, 2018 at 09:11:16AM -0600, Tycho Andersen wrote:
> > > > As an alternative to SECCOMP_FILTER_FLAG_GET_LISTENER, perhaps a ptrace()
> > > > version which can acquire filters is useful. There are at least two reasons
> > > > this is preferable, even though it uses ptrace:
> > > >
> > > > 1. You can control tasks that aren't cooperating with you
> > > > 2. You can control tasks whose filters block sendmsg() and socket(); if the
> > > >    task installs a filter which blocks these calls, there's no way with
> > > >    SECCOMP_FILTER_FLAG_GET_LISTENER to get the fd out to the privileged task.
> > >
> > > So for the slow of mind aka me:
> > > I'm not sure I completely understand this problem. Can you outline how
> > > sendmsg() and socket() are involved in this?
> > >
> > > I'm also not sure that this holds (but I might misunderstand the
> > > problem) afaict, you could do try to get the fd out via CLONE_FILES and
> > > other means so something like:
> > >
> > > // let's pretend the libc wrapper for clone actually has sane semantics
> > > pid = clone(CLONE_FILES);
> > > if (pid == 0) {
> > >         fd = seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
> > >
> > >         // Now this fd will be valid in both parent and child.
> > >         // If you haven't blocked it you can inform the parent what
> > >         // the fd number is via pipe2(). If you have blocked it you can
> > >         // use dup2() and dup to a known fd number.
> > > }
> > >
> > > >
> > > > v2: fix a bug where listener mode was not unset when an unused fd was not
> > > >     available
> > > > v3: fix refcounting bug (Oleg)
> > > > v4: * change the listener's fd flags to be 0
> > > >     * rename GET_LISTENER to NEW_LISTENER (Matthew)
> > > > v5: * add capable(CAP_SYS_ADMIN) requirement
> > > > v7: * point the new listener at the right filter (Jann)
> > > >
> > > > Signed-off-by: Tycho Andersen <tycho@tycho.ws>
> > > > CC: Kees Cook <keescook@chromium.org>
> > > > CC: Andy Lutomirski <luto@amacapital.net>
> > > > CC: Oleg Nesterov <oleg@redhat.com>
> > > > CC: Eric W. Biederman <ebiederm@xmission.com>
> > > > CC: "Serge E. Hallyn" <serge@hallyn.com>
> > > > CC: Christian Brauner <christian.brauner@ubuntu.com>
> > > > CC: Tyler Hicks <tyhicks@canonical.com>
> > > > CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
> > > > ---
> > > >  include/linux/seccomp.h                       |  7 ++
> > > >  include/uapi/linux/ptrace.h                   |  2 +
> > > >  kernel/ptrace.c                               |  4 ++
> > > >  kernel/seccomp.c                              | 31 +++++++++
> > > >  tools/testing/selftests/seccomp/seccomp_bpf.c | 68 +++++++++++++++++++
> > > >  5 files changed, 112 insertions(+)
> > > >
> > > > diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> > > > index 017444b5efed..234c61b37405 100644
> > > > --- a/include/linux/seccomp.h
> > > > +++ b/include/linux/seccomp.h
> > > > @@ -83,6 +83,8 @@ static inline int seccomp_mode(struct seccomp *s)
> > > >  #ifdef CONFIG_SECCOMP_FILTER
> > > >  extern void put_seccomp_filter(struct task_struct *tsk);
> > > >  extern void get_seccomp_filter(struct task_struct *tsk);
> > > > +extern long seccomp_new_listener(struct task_struct *task,
> > > > +                              unsigned long filter_off);
> > > >  #else  /* CONFIG_SECCOMP_FILTER */
> > > >  static inline void put_seccomp_filter(struct task_struct *tsk)
> > > >  {
> > > > @@ -92,6 +94,11 @@ static inline void get_seccomp_filter(struct task_struct *tsk)
> > > >  {
> > > >       return;
> > > >  }
> > > > +static inline long seccomp_new_listener(struct task_struct *task,
> > > > +                                     unsigned long filter_off)
> > > > +{
> > > > +     return -EINVAL;
> > > > +}
> > > >  #endif /* CONFIG_SECCOMP_FILTER */
> > > >
> > > >  #if defined(CONFIG_SECCOMP_FILTER) && defined(CONFIG_CHECKPOINT_RESTORE)
> > > > diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h
> > > > index d5a1b8a492b9..e80ecb1bd427 100644
> > > > --- a/include/uapi/linux/ptrace.h
> > > > +++ b/include/uapi/linux/ptrace.h
> > > > @@ -73,6 +73,8 @@ struct seccomp_metadata {
> > > >       __u64 flags;            /* Output: filter's flags */
> > > >  };
> > > >
> > > > +#define PTRACE_SECCOMP_NEW_LISTENER  0x420e
> > > > +
> > > >  /* Read signals from a shared (process wide) queue */
> > > >  #define PTRACE_PEEKSIGINFO_SHARED    (1 << 0)
> > > >
> > > > diff --git a/kernel/ptrace.c b/kernel/ptrace.c
> > > > index 21fec73d45d4..289960ac181b 100644
> > > > --- a/kernel/ptrace.c
> > > > +++ b/kernel/ptrace.c
> > > > @@ -1096,6 +1096,10 @@ int ptrace_request(struct task_struct *child, long request,
> > > >               ret = seccomp_get_metadata(child, addr, datavp);
> > > >               break;
> > > >
> > > > +     case PTRACE_SECCOMP_NEW_LISTENER:
> > > > +             ret = seccomp_new_listener(child, addr);
> > > > +             break;
> > > > +
> > > >       default:
> > > >               break;
> > > >       }
> > > > diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> > > > index 44a31ac8373a..17685803a2af 100644
> > > > --- a/kernel/seccomp.c
> > > > +++ b/kernel/seccomp.c
> > > > @@ -1777,4 +1777,35 @@ static struct file *init_listener(struct task_struct *task,
> > > >
> > > >       return ret;
> > > >  }
> > > > +
> > > > +long seccomp_new_listener(struct task_struct *task,
> > > > +                       unsigned long filter_off)
> > > > +{
> > > > +     struct seccomp_filter *filter;
> > > > +     struct file *listener;
> > > > +     int fd;
> > > > +
> > > > +     if (!capable(CAP_SYS_ADMIN))
> > > > +             return -EACCES;
> > >
> > > I know this might have been discussed a while back but why exactly do we
> > > require CAP_SYS_ADMIN in init_userns and not in the target userns? What
> > > if I want to do a setns()fd, CLONE_NEWUSER) to the target process and
> > > use ptrace from in there?
> >
> > See https://lore.kernel.org/lkml/CAG48ez3R+ZJ1vwGkDfGzKX2mz6f=jjJWsO5pCvnH68P+RKO8Ow@mail.gmail.com/
> > . Basically, the problem is that this doesn't just give you capability
> > over the target task, but also over every other task that has the same
> > filter installed; you need some sort of "is the caller capable over
> > the filter and anyone who uses it" check.
>
> Thanks.
> But then this new ptrace feature as it stands is imho currently broken.
> If you can install a seccomp filter with SECCOMP_RET_USER_NOTIF if you
> are ns_cpabable(CAP_SYS_ADMIN) and also get an fd via seccomp() itself
> if you are ns_cpabable(CAP_SYS_ADMIN) then either the new ptrace() api
> extension should be fixed to allow for this too or the seccomp() way of
> retrieving the pid - which I really think we want - needs to be fixed to
> require capable(CAP_SYS_ADMIN) too.
> The solution where both require ns_capable(CAP_SYS_ADMIN) is - imho -
> the preferred way to solve this.
> Everything else will just be confusing.

First you say "broken", then you say "confusing". Which one do you mean?

Regarding requiring ns_capable() for ptrace: That means that you'll
have to stash namespace information in the seccomp filter. You'd also
potentially be eliding the LSM check that would normally have to occur
between the tracer and the tracee; but I guess that's probably fine?
CAP_SYS_ADMIN in the init namespace already has some abilities that
LSMs can't observe; you could argue that CAP_SYS_ADMIN in another
namespace should have similar semantics, but I'm not sure whether that
matches what the LSM people want as semantics.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 3/6] seccomp: add a way to get a listener fd from ptrace
  2018-10-08 15:16   ` Christian Brauner
  2018-10-08 15:33     ` Jann Horn
@ 2018-10-08 18:00     ` Tycho Andersen
  2018-10-08 18:41       ` Christian Brauner
  2018-10-10 17:45       ` Andy Lutomirski
  1 sibling, 2 replies; 91+ messages in thread
From: Tycho Andersen @ 2018-10-08 18:00 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Kees Cook, Jann Horn, linux-api, containers, Akihiro Suda,
	Oleg Nesterov, linux-kernel, Eric W . Biederman, linux-fsdevel,
	Christian Brauner, Andy Lutomirski

On Mon, Oct 08, 2018 at 05:16:30PM +0200, Christian Brauner wrote:
> On Thu, Sep 27, 2018 at 09:11:16AM -0600, Tycho Andersen wrote:
> > As an alternative to SECCOMP_FILTER_FLAG_GET_LISTENER, perhaps a ptrace()
> > version which can acquire filters is useful. There are at least two reasons
> > this is preferable, even though it uses ptrace:
> > 
> > 1. You can control tasks that aren't cooperating with you
> > 2. You can control tasks whose filters block sendmsg() and socket(); if the
> >    task installs a filter which blocks these calls, there's no way with
> >    SECCOMP_FILTER_FLAG_GET_LISTENER to get the fd out to the privileged task.
> 
> So for the slow of mind aka me:
> I'm not sure I completely understand this problem. Can you outline how
> sendmsg() and socket() are involved in this?
> 
> I'm also not sure that this holds (but I might misunderstand the
> problem) afaict, you could do try to get the fd out via CLONE_FILES and
> other means so something like: 
> 
> // let's pretend the libc wrapper for clone actually has sane semantics
> pid = clone(CLONE_FILES);
> if (pid == 0) {
>         fd = seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
> 
>         // Now this fd will be valid in both parent and child.
>         // If you haven't blocked it you can inform the parent what
>         // the fd number is via pipe2(). If you have blocked it you can
>         // use dup2() and dup to a known fd number.
> }

But what if your seccomp filter wants to block both pipe2() and
dup2()? Whatever syscall you want to use to do this could be blocked
by some seccomp policy, which means you might not be able to use this
feature in some cases.

Perhaps it's unlikely, and we can just go forward knowing this. But it
seems like it is worth at least acknowledging that you can wedge
yourself into a corner.

Tycho

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 3/6] seccomp: add a way to get a listener fd from ptrace
  2018-10-08 16:42         ` Jann Horn
@ 2018-10-08 18:18           ` Christian Brauner
  2018-10-09 12:39             ` Jann Horn
  0 siblings, 1 reply; 91+ messages in thread
From: Christian Brauner @ 2018-10-08 18:18 UTC (permalink / raw)
  To: Jann Horn
  Cc: Tycho Andersen, Kees Cook, Linux API, containers, suda.akihiro,
	Oleg Nesterov, kernel list, Eric W. Biederman, linux-fsdevel,
	Christian Brauner, Andy Lutomirski, linux-security-module

On Mon, Oct 08, 2018 at 06:42:00PM +0200, Jann Horn wrote:
> On Mon, Oct 8, 2018 at 6:21 PM Christian Brauner <christian@brauner.io> wrote:
> > On Mon, Oct 08, 2018 at 05:33:22PM +0200, Jann Horn wrote:
> > > On Mon, Oct 8, 2018 at 5:16 PM Christian Brauner <christian@brauner.io> wrote:
> > > >
> > > > On Thu, Sep 27, 2018 at 09:11:16AM -0600, Tycho Andersen wrote:
> > > > > As an alternative to SECCOMP_FILTER_FLAG_GET_LISTENER, perhaps a ptrace()
> > > > > version which can acquire filters is useful. There are at least two reasons
> > > > > this is preferable, even though it uses ptrace:
> > > > >
> > > > > 1. You can control tasks that aren't cooperating with you
> > > > > 2. You can control tasks whose filters block sendmsg() and socket(); if the
> > > > >    task installs a filter which blocks these calls, there's no way with
> > > > >    SECCOMP_FILTER_FLAG_GET_LISTENER to get the fd out to the privileged task.
> > > >
> > > > So for the slow of mind aka me:
> > > > I'm not sure I completely understand this problem. Can you outline how
> > > > sendmsg() and socket() are involved in this?
> > > >
> > > > I'm also not sure that this holds (but I might misunderstand the
> > > > problem) afaict, you could do try to get the fd out via CLONE_FILES and
> > > > other means so something like:
> > > >
> > > > // let's pretend the libc wrapper for clone actually has sane semantics
> > > > pid = clone(CLONE_FILES);
> > > > if (pid == 0) {
> > > >         fd = seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
> > > >
> > > >         // Now this fd will be valid in both parent and child.
> > > >         // If you haven't blocked it you can inform the parent what
> > > >         // the fd number is via pipe2(). If you have blocked it you can
> > > >         // use dup2() and dup to a known fd number.
> > > > }
> > > >
> > > > >
> > > > > v2: fix a bug where listener mode was not unset when an unused fd was not
> > > > >     available
> > > > > v3: fix refcounting bug (Oleg)
> > > > > v4: * change the listener's fd flags to be 0
> > > > >     * rename GET_LISTENER to NEW_LISTENER (Matthew)
> > > > > v5: * add capable(CAP_SYS_ADMIN) requirement
> > > > > v7: * point the new listener at the right filter (Jann)
> > > > >
> > > > > Signed-off-by: Tycho Andersen <tycho@tycho.ws>
> > > > > CC: Kees Cook <keescook@chromium.org>
> > > > > CC: Andy Lutomirski <luto@amacapital.net>
> > > > > CC: Oleg Nesterov <oleg@redhat.com>
> > > > > CC: Eric W. Biederman <ebiederm@xmission.com>
> > > > > CC: "Serge E. Hallyn" <serge@hallyn.com>
> > > > > CC: Christian Brauner <christian.brauner@ubuntu.com>
> > > > > CC: Tyler Hicks <tyhicks@canonical.com>
> > > > > CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
> > > > > ---
> > > > >  include/linux/seccomp.h                       |  7 ++
> > > > >  include/uapi/linux/ptrace.h                   |  2 +
> > > > >  kernel/ptrace.c                               |  4 ++
> > > > >  kernel/seccomp.c                              | 31 +++++++++
> > > > >  tools/testing/selftests/seccomp/seccomp_bpf.c | 68 +++++++++++++++++++
> > > > >  5 files changed, 112 insertions(+)
> > > > >
> > > > > diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> > > > > index 017444b5efed..234c61b37405 100644
> > > > > --- a/include/linux/seccomp.h
> > > > > +++ b/include/linux/seccomp.h
> > > > > @@ -83,6 +83,8 @@ static inline int seccomp_mode(struct seccomp *s)
> > > > >  #ifdef CONFIG_SECCOMP_FILTER
> > > > >  extern void put_seccomp_filter(struct task_struct *tsk);
> > > > >  extern void get_seccomp_filter(struct task_struct *tsk);
> > > > > +extern long seccomp_new_listener(struct task_struct *task,
> > > > > +                              unsigned long filter_off);
> > > > >  #else  /* CONFIG_SECCOMP_FILTER */
> > > > >  static inline void put_seccomp_filter(struct task_struct *tsk)
> > > > >  {
> > > > > @@ -92,6 +94,11 @@ static inline void get_seccomp_filter(struct task_struct *tsk)
> > > > >  {
> > > > >       return;
> > > > >  }
> > > > > +static inline long seccomp_new_listener(struct task_struct *task,
> > > > > +                                     unsigned long filter_off)
> > > > > +{
> > > > > +     return -EINVAL;
> > > > > +}
> > > > >  #endif /* CONFIG_SECCOMP_FILTER */
> > > > >
> > > > >  #if defined(CONFIG_SECCOMP_FILTER) && defined(CONFIG_CHECKPOINT_RESTORE)
> > > > > diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h
> > > > > index d5a1b8a492b9..e80ecb1bd427 100644
> > > > > --- a/include/uapi/linux/ptrace.h
> > > > > +++ b/include/uapi/linux/ptrace.h
> > > > > @@ -73,6 +73,8 @@ struct seccomp_metadata {
> > > > >       __u64 flags;            /* Output: filter's flags */
> > > > >  };
> > > > >
> > > > > +#define PTRACE_SECCOMP_NEW_LISTENER  0x420e
> > > > > +
> > > > >  /* Read signals from a shared (process wide) queue */
> > > > >  #define PTRACE_PEEKSIGINFO_SHARED    (1 << 0)
> > > > >
> > > > > diff --git a/kernel/ptrace.c b/kernel/ptrace.c
> > > > > index 21fec73d45d4..289960ac181b 100644
> > > > > --- a/kernel/ptrace.c
> > > > > +++ b/kernel/ptrace.c
> > > > > @@ -1096,6 +1096,10 @@ int ptrace_request(struct task_struct *child, long request,
> > > > >               ret = seccomp_get_metadata(child, addr, datavp);
> > > > >               break;
> > > > >
> > > > > +     case PTRACE_SECCOMP_NEW_LISTENER:
> > > > > +             ret = seccomp_new_listener(child, addr);
> > > > > +             break;
> > > > > +
> > > > >       default:
> > > > >               break;
> > > > >       }
> > > > > diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> > > > > index 44a31ac8373a..17685803a2af 100644
> > > > > --- a/kernel/seccomp.c
> > > > > +++ b/kernel/seccomp.c
> > > > > @@ -1777,4 +1777,35 @@ static struct file *init_listener(struct task_struct *task,
> > > > >
> > > > >       return ret;
> > > > >  }
> > > > > +
> > > > > +long seccomp_new_listener(struct task_struct *task,
> > > > > +                       unsigned long filter_off)
> > > > > +{
> > > > > +     struct seccomp_filter *filter;
> > > > > +     struct file *listener;
> > > > > +     int fd;
> > > > > +
> > > > > +     if (!capable(CAP_SYS_ADMIN))
> > > > > +             return -EACCES;
> > > >
> > > > I know this might have been discussed a while back but why exactly do we
> > > > require CAP_SYS_ADMIN in init_userns and not in the target userns? What
> > > > if I want to do a setns()fd, CLONE_NEWUSER) to the target process and
> > > > use ptrace from in there?
> > >
> > > See https://lore.kernel.org/lkml/CAG48ez3R+ZJ1vwGkDfGzKX2mz6f=jjJWsO5pCvnH68P+RKO8Ow@mail.gmail.com/
> > > . Basically, the problem is that this doesn't just give you capability
> > > over the target task, but also over every other task that has the same
> > > filter installed; you need some sort of "is the caller capable over
> > > the filter and anyone who uses it" check.
> >
> > Thanks.
> > But then this new ptrace feature as it stands is imho currently broken.
> > If you can install a seccomp filter with SECCOMP_RET_USER_NOTIF if you
> > are ns_cpabable(CAP_SYS_ADMIN) and also get an fd via seccomp() itself
> > if you are ns_cpabable(CAP_SYS_ADMIN) then either the new ptrace() api
> > extension should be fixed to allow for this too or the seccomp() way of
> > retrieving the pid - which I really think we want - needs to be fixed to
> > require capable(CAP_SYS_ADMIN) too.
> > The solution where both require ns_capable(CAP_SYS_ADMIN) is - imho -
> > the preferred way to solve this.
> > Everything else will just be confusing.
> 
> First you say "broken", then you say "confusing". Which one do you mean?

Both. It's broken in so far as it places a seemingly unnecessary
restriction that could be fixed. You outlined one possible fix yourself
in the link you provided. And it's confusing in so far as there is a way
via seccomp() to get the fd without said requirement.

> 
> Regarding requiring ns_capable() for ptrace: That means that you'll
> have to stash namespace information in the seccomp filter. You'd also
> potentially be eliding the LSM check that would normally have to occur
> between the tracer and the tracee; but I guess that's probably fine?
> CAP_SYS_ADMIN in the init namespace already has some abilities that
> LSMs can't observe; you could argue that CAP_SYS_ADMIN in another
> namespace should have similar semantics, but I'm not sure whether that
> matches what the LSM people want as semantics.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 3/6] seccomp: add a way to get a listener fd from ptrace
  2018-10-08 18:00     ` Tycho Andersen
@ 2018-10-08 18:41       ` Christian Brauner
  2018-10-10 17:45       ` Andy Lutomirski
  1 sibling, 0 replies; 91+ messages in thread
From: Christian Brauner @ 2018-10-08 18:41 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Kees Cook, Jann Horn, linux-api, containers, Akihiro Suda,
	Oleg Nesterov, linux-kernel, Eric W . Biederman, linux-fsdevel,
	Christian Brauner, Andy Lutomirski

On Mon, Oct 08, 2018 at 12:00:43PM -0600, Tycho Andersen wrote:
> On Mon, Oct 08, 2018 at 05:16:30PM +0200, Christian Brauner wrote:
> > On Thu, Sep 27, 2018 at 09:11:16AM -0600, Tycho Andersen wrote:
> > > As an alternative to SECCOMP_FILTER_FLAG_GET_LISTENER, perhaps a ptrace()
> > > version which can acquire filters is useful. There are at least two reasons
> > > this is preferable, even though it uses ptrace:
> > > 
> > > 1. You can control tasks that aren't cooperating with you
> > > 2. You can control tasks whose filters block sendmsg() and socket(); if the
> > >    task installs a filter which blocks these calls, there's no way with
> > >    SECCOMP_FILTER_FLAG_GET_LISTENER to get the fd out to the privileged task.
> > 
> > So for the slow of mind aka me:
> > I'm not sure I completely understand this problem. Can you outline how
> > sendmsg() and socket() are involved in this?
> > 
> > I'm also not sure that this holds (but I might misunderstand the
> > problem) afaict, you could do try to get the fd out via CLONE_FILES and
> > other means so something like: 
> > 
> > // let's pretend the libc wrapper for clone actually has sane semantics
> > pid = clone(CLONE_FILES);
> > if (pid == 0) {
> >         fd = seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
> > 
> >         // Now this fd will be valid in both parent and child.
> >         // If you haven't blocked it you can inform the parent what
> >         // the fd number is via pipe2(). If you have blocked it you can
> >         // use dup2() and dup to a known fd number.
> > }
> 
> But what if your seccomp filter wants to block both pipe2() and
> dup2()? Whatever syscall you want to use to do this could be blocked

(Fwiw, setup shared memory before the clone(CLONE_FILES) and write the
fd in the shared memory. :))


> by some seccomp policy, which means you might not be able to use this
> feature in some cases.

> 
> Perhaps it's unlikely, and we can just go forward knowing this. But it
> seems like it is worth at least acknowledging that you can wedge
> yourself into a corner.

Sure, if you try really really hard to shoot yourself in the foot you'll
always be able to. From the top of my hat I'd say you can at least
probably filter the seccomp() syscall with the listener argument. Once
you've loaded the policy you're out of luck.
You also might be seccomp confided and not be able to use the ptrace()
syscall. AppArmor might prevent you from using ptrace()ing etc. pp.

So I think we really want both ways but the seccomp interface is way
cleaner. To get the fd from ptrace() you need three syscalls. With
seccomp() you need one.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 3/6] seccomp: add a way to get a listener fd from ptrace
  2018-10-08 18:18           ` Christian Brauner
@ 2018-10-09 12:39             ` Jann Horn
  2018-10-09 13:28               ` Christian Brauner
  0 siblings, 1 reply; 91+ messages in thread
From: Jann Horn @ 2018-10-09 12:39 UTC (permalink / raw)
  To: christian
  Cc: Tycho Andersen, Kees Cook, Linux API, containers, suda.akihiro,
	Oleg Nesterov, kernel list, Eric W. Biederman, linux-fsdevel,
	Christian Brauner, Andy Lutomirski, linux-security-module

On Mon, Oct 8, 2018 at 8:18 PM Christian Brauner <christian@brauner.io> wrote:
> On Mon, Oct 08, 2018 at 06:42:00PM +0200, Jann Horn wrote:
> > On Mon, Oct 8, 2018 at 6:21 PM Christian Brauner <christian@brauner.io> wrote:
> > > On Mon, Oct 08, 2018 at 05:33:22PM +0200, Jann Horn wrote:
> > > > On Mon, Oct 8, 2018 at 5:16 PM Christian Brauner <christian@brauner.io> wrote:
> > > > > On Thu, Sep 27, 2018 at 09:11:16AM -0600, Tycho Andersen wrote:
> > > > > > diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> > > > > > index 44a31ac8373a..17685803a2af 100644
> > > > > > --- a/kernel/seccomp.c
> > > > > > +++ b/kernel/seccomp.c
> > > > > > @@ -1777,4 +1777,35 @@ static struct file *init_listener(struct task_struct *task,
> > > > > >
> > > > > >       return ret;
> > > > > >  }
> > > > > > +
> > > > > > +long seccomp_new_listener(struct task_struct *task,
> > > > > > +                       unsigned long filter_off)
> > > > > > +{
> > > > > > +     struct seccomp_filter *filter;
> > > > > > +     struct file *listener;
> > > > > > +     int fd;
> > > > > > +
> > > > > > +     if (!capable(CAP_SYS_ADMIN))
> > > > > > +             return -EACCES;
> > > > >
> > > > > I know this might have been discussed a while back but why exactly do we
> > > > > require CAP_SYS_ADMIN in init_userns and not in the target userns? What
> > > > > if I want to do a setns()fd, CLONE_NEWUSER) to the target process and
> > > > > use ptrace from in there?
> > > >
> > > > See https://lore.kernel.org/lkml/CAG48ez3R+ZJ1vwGkDfGzKX2mz6f=jjJWsO5pCvnH68P+RKO8Ow@mail.gmail.com/
> > > > . Basically, the problem is that this doesn't just give you capability
> > > > over the target task, but also over every other task that has the same
> > > > filter installed; you need some sort of "is the caller capable over
> > > > the filter and anyone who uses it" check.
> > >
> > > Thanks.
> > > But then this new ptrace feature as it stands is imho currently broken.
> > > If you can install a seccomp filter with SECCOMP_RET_USER_NOTIF if you
> > > are ns_cpabable(CAP_SYS_ADMIN) and also get an fd via seccomp() itself
> > > if you are ns_cpabable(CAP_SYS_ADMIN)

Actually, you don't need CAP_SYS_ADMIN for seccomp() at all as long as
you enable the NNP flag, I think?

> > > then either the new ptrace() api
> > > extension should be fixed to allow for this too or the seccomp() way of
> > > retrieving the pid - which I really think we want - needs to be fixed to
> > > require capable(CAP_SYS_ADMIN) too.
> > > The solution where both require ns_capable(CAP_SYS_ADMIN) is - imho -
> > > the preferred way to solve this.
> > > Everything else will just be confusing.
> >
> > First you say "broken", then you say "confusing". Which one do you mean?
>
> Both. It's broken in so far as it places a seemingly unnecessary
> restriction that could be fixed. You outlined one possible fix yourself
> in the link you provided.

If by "possible fix" you mean "check whether the seccomp filter is
only attached to a single task": That wouldn't fundamentally change
the situation, it would only add an additional special case.

> And it's confusing in so far as there is a way
> via seccomp() to get the fd without said requirement.

I don't find it confusing at all. seccomp() and ptrace() are very
different situations: When you use seccomp(), infrastructure is
already in place for ensuring that your filter is only applied to
processes over which you are capable, and propagation is limited by
inheritance from your task down. When you use ptrace(), you need a
pretty different sort of access check that checks whether you're
privileged over ancestors, siblings and so on of the target task.

But thinking about it more, I think that CAP_SYS_ADMIN over the saved
current->mm->user_ns of the task that installed the filter (stored as
a "struct user_namespace *" in the filter) should be acceptable.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 3/6] seccomp: add a way to get a listener fd from ptrace
  2018-10-09 12:39             ` Jann Horn
@ 2018-10-09 13:28               ` Christian Brauner
  2018-10-09 13:36                 ` Jann Horn
  0 siblings, 1 reply; 91+ messages in thread
From: Christian Brauner @ 2018-10-09 13:28 UTC (permalink / raw)
  To: Jann Horn
  Cc: Tycho Andersen, Kees Cook, Linux API, containers, suda.akihiro,
	Oleg Nesterov, kernel list, Eric W. Biederman, linux-fsdevel,
	Christian Brauner, Andy Lutomirski, linux-security-module

On Tue, Oct 09, 2018 at 02:39:53PM +0200, Jann Horn wrote:
> On Mon, Oct 8, 2018 at 8:18 PM Christian Brauner <christian@brauner.io> wrote:
> > On Mon, Oct 08, 2018 at 06:42:00PM +0200, Jann Horn wrote:
> > > On Mon, Oct 8, 2018 at 6:21 PM Christian Brauner <christian@brauner.io> wrote:
> > > > On Mon, Oct 08, 2018 at 05:33:22PM +0200, Jann Horn wrote:
> > > > > On Mon, Oct 8, 2018 at 5:16 PM Christian Brauner <christian@brauner.io> wrote:
> > > > > > On Thu, Sep 27, 2018 at 09:11:16AM -0600, Tycho Andersen wrote:
> > > > > > > diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> > > > > > > index 44a31ac8373a..17685803a2af 100644
> > > > > > > --- a/kernel/seccomp.c
> > > > > > > +++ b/kernel/seccomp.c
> > > > > > > @@ -1777,4 +1777,35 @@ static struct file *init_listener(struct task_struct *task,
> > > > > > >
> > > > > > >       return ret;
> > > > > > >  }
> > > > > > > +
> > > > > > > +long seccomp_new_listener(struct task_struct *task,
> > > > > > > +                       unsigned long filter_off)
> > > > > > > +{
> > > > > > > +     struct seccomp_filter *filter;
> > > > > > > +     struct file *listener;
> > > > > > > +     int fd;
> > > > > > > +
> > > > > > > +     if (!capable(CAP_SYS_ADMIN))
> > > > > > > +             return -EACCES;
> > > > > >
> > > > > > I know this might have been discussed a while back but why exactly do we
> > > > > > require CAP_SYS_ADMIN in init_userns and not in the target userns? What
> > > > > > if I want to do a setns()fd, CLONE_NEWUSER) to the target process and
> > > > > > use ptrace from in there?
> > > > >
> > > > > See https://lore.kernel.org/lkml/CAG48ez3R+ZJ1vwGkDfGzKX2mz6f=jjJWsO5pCvnH68P+RKO8Ow@mail.gmail.com/
> > > > > . Basically, the problem is that this doesn't just give you capability
> > > > > over the target task, but also over every other task that has the same
> > > > > filter installed; you need some sort of "is the caller capable over
> > > > > the filter and anyone who uses it" check.
> > > >
> > > > Thanks.
> > > > But then this new ptrace feature as it stands is imho currently broken.
> > > > If you can install a seccomp filter with SECCOMP_RET_USER_NOTIF if you
> > > > are ns_cpabable(CAP_SYS_ADMIN) and also get an fd via seccomp() itself
> > > > if you are ns_cpabable(CAP_SYS_ADMIN)
> 
> Actually, you don't need CAP_SYS_ADMIN for seccomp() at all as long as
> you enable the NNP flag, I think?

Yes, if you turn on NNP you don't even need sys_admin.

> 
> > > > then either the new ptrace() api
> > > > extension should be fixed to allow for this too or the seccomp() way of
> > > > retrieving the pid - which I really think we want - needs to be fixed to
> > > > require capable(CAP_SYS_ADMIN) too.
> > > > The solution where both require ns_capable(CAP_SYS_ADMIN) is - imho -
> > > > the preferred way to solve this.
> > > > Everything else will just be confusing.
> > >
> > > First you say "broken", then you say "confusing". Which one do you mean?
> >
> > Both. It's broken in so far as it places a seemingly unnecessary
> > restriction that could be fixed. You outlined one possible fix yourself
> > in the link you provided.
> 
> If by "possible fix" you mean "check whether the seccomp filter is
> only attached to a single task": That wouldn't fundamentally change
> the situation, it would only add an additional special case.
> 
> > And it's confusing in so far as there is a way
> > via seccomp() to get the fd without said requirement.
> 
> I don't find it confusing at all. seccomp() and ptrace() are very

Fine, then that's a matter of opinion. I find it counterintuitive that
you can get an fd without privileges via one interface but not via
another.

> different situations: When you use seccomp(), infrastructure is

Sure. Note, that this is _one_ of the reasons why I want to make sure we
keep the native seccomp() only based way of getting an fd without
forcing userspace to switching to a differnet kernel api.

> already in place for ensuring that your filter is only applied to
> processes over which you are capable, and propagation is limited by
> inheritance from your task down. When you use ptrace(), you need a
> pretty different sort of access check that checks whether you're
> privileged over ancestors, siblings and so on of the target task.

So, don't get me wrong I'm not arguing against the ptrace() interface in
general. If this is something that people find useful, fine. But, I
would like to have a simple single-syscall pure-seccomp() based way of
getting an fd, i.e. what we have in patch 1 of this series.

> 
> But thinking about it more, I think that CAP_SYS_ADMIN over the saved
> current->mm->user_ns of the task that installed the filter (stored as
> a "struct user_namespace *" in the filter) should be acceptable.

Hm... Why not CAP_SYS_PTRACE?


One more thing. Citing from [1] 

> I think there's a security problem here. Imagine the following scenario:
> 
> 1. task A (uid==0) sets up a seccomp filter that uses SECCOMP_RET_USER_NOTIF
> 2. task A forks off a child B
> 3. task B uses setuid(1) to drop its privileges
> 4. task B becomes dumpable again, either via prctl(PR_SET_DUMPABLE, 1)
> or via execve()
> 5. task C (the attacker, uid==1) attaches to task B via ptrace
> 6. task C uses PTRACE_SECCOMP_NEW_LISTENER on task B

Sorry, to be late to the party but would this really pass
__ptrace_may_access() in ptrace_attach()? It doesn't seem obvious to me
that it would... Doesn't look like it would get past:

 	tcred = __task_cred(task);
	if (uid_eq(caller_uid, tcred->euid) &&
	    uid_eq(caller_uid, tcred->suid) &&
	    uid_eq(caller_uid, tcred->uid)  &&
	    gid_eq(caller_gid, tcred->egid) &&
	    gid_eq(caller_gid, tcred->sgid) &&
	    gid_eq(caller_gid, tcred->gid))
		goto ok;
	if (ptrace_has_cap(tcred->user_ns, mode))
		goto ok;
	rcu_read_unlock();
	return -EPERM;
ok:
	rcu_read_unlock();
	mm = task->mm;
	if (mm &&
	    ((get_dumpable(mm) != SUID_DUMP_USER) &&
	     !ptrace_has_cap(mm->user_ns, mode)))
	    return -EPERM;

> 7. because the seccomp filter is shared by task A and task B, task C
> is now able to influence syscall results for syscalls performed by
> task A

[1]: https://lore.kernel.org/lkml/CAG48ez3R+ZJ1vwGkDfGzKX2mz6f=jjJWsO5pCvnH68P+RKO8Ow@mail.gmail.com/

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 3/6] seccomp: add a way to get a listener fd from ptrace
  2018-10-09 13:28               ` Christian Brauner
@ 2018-10-09 13:36                 ` Jann Horn
  2018-10-09 13:49                   ` Christian Brauner
  2018-10-10 15:31                   ` Paul Moore
  0 siblings, 2 replies; 91+ messages in thread
From: Jann Horn @ 2018-10-09 13:36 UTC (permalink / raw)
  To: christian
  Cc: Tycho Andersen, Kees Cook, Linux API, containers, suda.akihiro,
	Oleg Nesterov, kernel list, Eric W. Biederman, linux-fsdevel,
	Christian Brauner, Andy Lutomirski, linux-security-module,
	selinux, Paul Moore, Stephen Smalley, Eric Paris

+cc selinux people explicitly, since they probably have opinions on this

On Tue, Oct 9, 2018 at 3:29 PM Christian Brauner <christian@brauner.io> wrote:
> On Tue, Oct 09, 2018 at 02:39:53PM +0200, Jann Horn wrote:
> > On Mon, Oct 8, 2018 at 8:18 PM Christian Brauner <christian@brauner.io> wrote:
> > > On Mon, Oct 08, 2018 at 06:42:00PM +0200, Jann Horn wrote:
> > > > On Mon, Oct 8, 2018 at 6:21 PM Christian Brauner <christian@brauner.io> wrote:
> > > > > On Mon, Oct 08, 2018 at 05:33:22PM +0200, Jann Horn wrote:
> > > > > > On Mon, Oct 8, 2018 at 5:16 PM Christian Brauner <christian@brauner.io> wrote:
> > > > > > > On Thu, Sep 27, 2018 at 09:11:16AM -0600, Tycho Andersen wrote:
> > > > > > > > diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> > > > > > > > index 44a31ac8373a..17685803a2af 100644
> > > > > > > > --- a/kernel/seccomp.c
> > > > > > > > +++ b/kernel/seccomp.c
> > > > > > > > @@ -1777,4 +1777,35 @@ static struct file *init_listener(struct task_struct *task,
> > > > > > > >
> > > > > > > >       return ret;
> > > > > > > >  }
> > > > > > > > +
> > > > > > > > +long seccomp_new_listener(struct task_struct *task,
> > > > > > > > +                       unsigned long filter_off)
> > > > > > > > +{
> > > > > > > > +     struct seccomp_filter *filter;
> > > > > > > > +     struct file *listener;
> > > > > > > > +     int fd;
> > > > > > > > +
> > > > > > > > +     if (!capable(CAP_SYS_ADMIN))
> > > > > > > > +             return -EACCES;
> > > > > > >
> > > > > > > I know this might have been discussed a while back but why exactly do we
> > > > > > > require CAP_SYS_ADMIN in init_userns and not in the target userns? What
> > > > > > > if I want to do a setns()fd, CLONE_NEWUSER) to the target process and
> > > > > > > use ptrace from in there?
> > > > > >
> > > > > > See https://lore.kernel.org/lkml/CAG48ez3R+ZJ1vwGkDfGzKX2mz6f=jjJWsO5pCvnH68P+RKO8Ow@mail.gmail.com/
> > > > > > . Basically, the problem is that this doesn't just give you capability
> > > > > > over the target task, but also over every other task that has the same
> > > > > > filter installed; you need some sort of "is the caller capable over
> > > > > > the filter and anyone who uses it" check.
> > > > >
> > > > > Thanks.
> > > > > But then this new ptrace feature as it stands is imho currently broken.
> > > > > If you can install a seccomp filter with SECCOMP_RET_USER_NOTIF if you
> > > > > are ns_cpabable(CAP_SYS_ADMIN) and also get an fd via seccomp() itself
> > > > > if you are ns_cpabable(CAP_SYS_ADMIN)
> >
> > Actually, you don't need CAP_SYS_ADMIN for seccomp() at all as long as
> > you enable the NNP flag, I think?
>
> Yes, if you turn on NNP you don't even need sys_admin.
>
> >
> > > > > then either the new ptrace() api
> > > > > extension should be fixed to allow for this too or the seccomp() way of
> > > > > retrieving the pid - which I really think we want - needs to be fixed to
> > > > > require capable(CAP_SYS_ADMIN) too.
> > > > > The solution where both require ns_capable(CAP_SYS_ADMIN) is - imho -
> > > > > the preferred way to solve this.
> > > > > Everything else will just be confusing.
> > > >
> > > > First you say "broken", then you say "confusing". Which one do you mean?
> > >
> > > Both. It's broken in so far as it places a seemingly unnecessary
> > > restriction that could be fixed. You outlined one possible fix yourself
> > > in the link you provided.
> >
> > If by "possible fix" you mean "check whether the seccomp filter is
> > only attached to a single task": That wouldn't fundamentally change
> > the situation, it would only add an additional special case.
> >
> > > And it's confusing in so far as there is a way
> > > via seccomp() to get the fd without said requirement.
> >
> > I don't find it confusing at all. seccomp() and ptrace() are very
>
> Fine, then that's a matter of opinion. I find it counterintuitive that
> you can get an fd without privileges via one interface but not via
> another.
>
> > different situations: When you use seccomp(), infrastructure is
>
> Sure. Note, that this is _one_ of the reasons why I want to make sure we
> keep the native seccomp() only based way of getting an fd without
> forcing userspace to switching to a differnet kernel api.
>
> > already in place for ensuring that your filter is only applied to
> > processes over which you are capable, and propagation is limited by
> > inheritance from your task down. When you use ptrace(), you need a
> > pretty different sort of access check that checks whether you're
> > privileged over ancestors, siblings and so on of the target task.
>
> So, don't get me wrong I'm not arguing against the ptrace() interface in
> general. If this is something that people find useful, fine. But, I
> would like to have a simple single-syscall pure-seccomp() based way of
> getting an fd, i.e. what we have in patch 1 of this series.

Yeah, I also prefer the seccomp() one.

> > But thinking about it more, I think that CAP_SYS_ADMIN over the saved
> > current->mm->user_ns of the task that installed the filter (stored as
> > a "struct user_namespace *" in the filter) should be acceptable.
>
> Hm... Why not CAP_SYS_PTRACE?

Because LSMs like SELinux add extra checks that apply even if you have
CAP_SYS_PTRACE, and this would subvert those. The only capability I
know of that lets you bypass LSM checks by design (if no LSM blocks
the capability itself) is CAP_SYS_ADMIN.

> One more thing. Citing from [1]
>
> > I think there's a security problem here. Imagine the following scenario:
> >
> > 1. task A (uid==0) sets up a seccomp filter that uses SECCOMP_RET_USER_NOTIF
> > 2. task A forks off a child B
> > 3. task B uses setuid(1) to drop its privileges
> > 4. task B becomes dumpable again, either via prctl(PR_SET_DUMPABLE, 1)
> > or via execve()
> > 5. task C (the attacker, uid==1) attaches to task B via ptrace
> > 6. task C uses PTRACE_SECCOMP_NEW_LISTENER on task B
>
> Sorry, to be late to the party but would this really pass
> __ptrace_may_access() in ptrace_attach()? It doesn't seem obvious to me
> that it would... Doesn't look like it would get past:
>
>         tcred = __task_cred(task);
>         if (uid_eq(caller_uid, tcred->euid) &&
>             uid_eq(caller_uid, tcred->suid) &&
>             uid_eq(caller_uid, tcred->uid)  &&
>             gid_eq(caller_gid, tcred->egid) &&
>             gid_eq(caller_gid, tcred->sgid) &&
>             gid_eq(caller_gid, tcred->gid))
>                 goto ok;
>         if (ptrace_has_cap(tcred->user_ns, mode))
>                 goto ok;
>         rcu_read_unlock();
>         return -EPERM;
> ok:
>         rcu_read_unlock();
>         mm = task->mm;
>         if (mm &&
>             ((get_dumpable(mm) != SUID_DUMP_USER) &&
>              !ptrace_has_cap(mm->user_ns, mode)))
>             return -EPERM;

Which specific check would prevent task C from attaching to task B? If
the UIDs match, the first "goto ok" executes; and you're dumpable, so
you don't trigger the second "return -EPERM".

> > 7. because the seccomp filter is shared by task A and task B, task C
> > is now able to influence syscall results for syscalls performed by
> > task A
>
> [1]: https://lore.kernel.org/lkml/CAG48ez3R+ZJ1vwGkDfGzKX2mz6f=jjJWsO5pCvnH68P+RKO8Ow@mail.gmail.com/

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 3/6] seccomp: add a way to get a listener fd from ptrace
  2018-10-09 13:36                 ` Jann Horn
@ 2018-10-09 13:49                   ` Christian Brauner
  2018-10-09 13:50                     ` Jann Horn
  2018-10-10 15:31                   ` Paul Moore
  1 sibling, 1 reply; 91+ messages in thread
From: Christian Brauner @ 2018-10-09 13:49 UTC (permalink / raw)
  To: Jann Horn
  Cc: Tycho Andersen, Kees Cook, Linux API, containers, suda.akihiro,
	Oleg Nesterov, kernel list, Eric W. Biederman, linux-fsdevel,
	Christian Brauner, Andy Lutomirski, linux-security-module,
	selinux, Paul Moore, Stephen Smalley, Eric Paris

On Tue, Oct 09, 2018 at 03:36:04PM +0200, Jann Horn wrote:
> +cc selinux people explicitly, since they probably have opinions on this
> 
> On Tue, Oct 9, 2018 at 3:29 PM Christian Brauner <christian@brauner.io> wrote:
> > On Tue, Oct 09, 2018 at 02:39:53PM +0200, Jann Horn wrote:
> > > On Mon, Oct 8, 2018 at 8:18 PM Christian Brauner <christian@brauner.io> wrote:
> > > > On Mon, Oct 08, 2018 at 06:42:00PM +0200, Jann Horn wrote:
> > > > > On Mon, Oct 8, 2018 at 6:21 PM Christian Brauner <christian@brauner.io> wrote:
> > > > > > On Mon, Oct 08, 2018 at 05:33:22PM +0200, Jann Horn wrote:
> > > > > > > On Mon, Oct 8, 2018 at 5:16 PM Christian Brauner <christian@brauner.io> wrote:
> > > > > > > > On Thu, Sep 27, 2018 at 09:11:16AM -0600, Tycho Andersen wrote:
> > > > > > > > > diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> > > > > > > > > index 44a31ac8373a..17685803a2af 100644
> > > > > > > > > --- a/kernel/seccomp.c
> > > > > > > > > +++ b/kernel/seccomp.c
> > > > > > > > > @@ -1777,4 +1777,35 @@ static struct file *init_listener(struct task_struct *task,
> > > > > > > > >
> > > > > > > > >       return ret;
> > > > > > > > >  }
> > > > > > > > > +
> > > > > > > > > +long seccomp_new_listener(struct task_struct *task,
> > > > > > > > > +                       unsigned long filter_off)
> > > > > > > > > +{
> > > > > > > > > +     struct seccomp_filter *filter;
> > > > > > > > > +     struct file *listener;
> > > > > > > > > +     int fd;
> > > > > > > > > +
> > > > > > > > > +     if (!capable(CAP_SYS_ADMIN))
> > > > > > > > > +             return -EACCES;
> > > > > > > >
> > > > > > > > I know this might have been discussed a while back but why exactly do we
> > > > > > > > require CAP_SYS_ADMIN in init_userns and not in the target userns? What
> > > > > > > > if I want to do a setns()fd, CLONE_NEWUSER) to the target process and
> > > > > > > > use ptrace from in there?
> > > > > > >
> > > > > > > See https://lore.kernel.org/lkml/CAG48ez3R+ZJ1vwGkDfGzKX2mz6f=jjJWsO5pCvnH68P+RKO8Ow@mail.gmail.com/
> > > > > > > . Basically, the problem is that this doesn't just give you capability
> > > > > > > over the target task, but also over every other task that has the same
> > > > > > > filter installed; you need some sort of "is the caller capable over
> > > > > > > the filter and anyone who uses it" check.
> > > > > >
> > > > > > Thanks.
> > > > > > But then this new ptrace feature as it stands is imho currently broken.
> > > > > > If you can install a seccomp filter with SECCOMP_RET_USER_NOTIF if you
> > > > > > are ns_cpabable(CAP_SYS_ADMIN) and also get an fd via seccomp() itself
> > > > > > if you are ns_cpabable(CAP_SYS_ADMIN)
> > >
> > > Actually, you don't need CAP_SYS_ADMIN for seccomp() at all as long as
> > > you enable the NNP flag, I think?
> >
> > Yes, if you turn on NNP you don't even need sys_admin.
> >
> > >
> > > > > > then either the new ptrace() api
> > > > > > extension should be fixed to allow for this too or the seccomp() way of
> > > > > > retrieving the pid - which I really think we want - needs to be fixed to
> > > > > > require capable(CAP_SYS_ADMIN) too.
> > > > > > The solution where both require ns_capable(CAP_SYS_ADMIN) is - imho -
> > > > > > the preferred way to solve this.
> > > > > > Everything else will just be confusing.
> > > > >
> > > > > First you say "broken", then you say "confusing". Which one do you mean?
> > > >
> > > > Both. It's broken in so far as it places a seemingly unnecessary
> > > > restriction that could be fixed. You outlined one possible fix yourself
> > > > in the link you provided.
> > >
> > > If by "possible fix" you mean "check whether the seccomp filter is
> > > only attached to a single task": That wouldn't fundamentally change
> > > the situation, it would only add an additional special case.
> > >
> > > > And it's confusing in so far as there is a way
> > > > via seccomp() to get the fd without said requirement.
> > >
> > > I don't find it confusing at all. seccomp() and ptrace() are very
> >
> > Fine, then that's a matter of opinion. I find it counterintuitive that
> > you can get an fd without privileges via one interface but not via
> > another.
> >
> > > different situations: When you use seccomp(), infrastructure is
> >
> > Sure. Note, that this is _one_ of the reasons why I want to make sure we
> > keep the native seccomp() only based way of getting an fd without
> > forcing userspace to switching to a differnet kernel api.
> >
> > > already in place for ensuring that your filter is only applied to
> > > processes over which you are capable, and propagation is limited by
> > > inheritance from your task down. When you use ptrace(), you need a
> > > pretty different sort of access check that checks whether you're
> > > privileged over ancestors, siblings and so on of the target task.
> >
> > So, don't get me wrong I'm not arguing against the ptrace() interface in
> > general. If this is something that people find useful, fine. But, I
> > would like to have a simple single-syscall pure-seccomp() based way of
> > getting an fd, i.e. what we have in patch 1 of this series.
> 
> Yeah, I also prefer the seccomp() one.
> 
> > > But thinking about it more, I think that CAP_SYS_ADMIN over the saved
> > > current->mm->user_ns of the task that installed the filter (stored as
> > > a "struct user_namespace *" in the filter) should be acceptable.
> >
> > Hm... Why not CAP_SYS_PTRACE?
> 
> Because LSMs like SELinux add extra checks that apply even if you have
> CAP_SYS_PTRACE, and this would subvert those. The only capability I
> know of that lets you bypass LSM checks by design (if no LSM blocks
> the capability itself) is CAP_SYS_ADMIN.
> 
> > One more thing. Citing from [1]
> >
> > > I think there's a security problem here. Imagine the following scenario:
> > >
> > > 1. task A (uid==0) sets up a seccomp filter that uses SECCOMP_RET_USER_NOTIF
> > > 2. task A forks off a child B
> > > 3. task B uses setuid(1) to drop its privileges
> > > 4. task B becomes dumpable again, either via prctl(PR_SET_DUMPABLE, 1)
> > > or via execve()
> > > 5. task C (the attacker, uid==1) attaches to task B via ptrace
> > > 6. task C uses PTRACE_SECCOMP_NEW_LISTENER on task B
> >
> > Sorry, to be late to the party but would this really pass
> > __ptrace_may_access() in ptrace_attach()? It doesn't seem obvious to me
> > that it would... Doesn't look like it would get past:
> >
> >         tcred = __task_cred(task);
> >         if (uid_eq(caller_uid, tcred->euid) &&
> >             uid_eq(caller_uid, tcred->suid) &&
> >             uid_eq(caller_uid, tcred->uid)  &&
> >             gid_eq(caller_gid, tcred->egid) &&
> >             gid_eq(caller_gid, tcred->sgid) &&
> >             gid_eq(caller_gid, tcred->gid))
> >                 goto ok;
> >         if (ptrace_has_cap(tcred->user_ns, mode))
> >                 goto ok;
> >         rcu_read_unlock();
> >         return -EPERM;
> > ok:
> >         rcu_read_unlock();
> >         mm = task->mm;
> >         if (mm &&
> >             ((get_dumpable(mm) != SUID_DUMP_USER) &&
> >              !ptrace_has_cap(mm->user_ns, mode)))
> >             return -EPERM;
> 
> Which specific check would prevent task C from attaching to task B? If
> the UIDs match, the first "goto ok" executes; and you're dumpable, so
> you don't trigger the second "return -EPERM".

You'd also need CAP_SYS_PTRACE in the mm->user_ns which you shouldn't
have if you did a setuid to an unpriv user. (But I always find that code
confusing.)

> 
> > > 7. because the seccomp filter is shared by task A and task B, task C
> > > is now able to influence syscall results for syscalls performed by
> > > task A
> >
> > [1]: https://lore.kernel.org/lkml/CAG48ez3R+ZJ1vwGkDfGzKX2mz6f=jjJWsO5pCvnH68P+RKO8Ow@mail.gmail.com/

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 3/6] seccomp: add a way to get a listener fd from ptrace
  2018-10-09 13:49                   ` Christian Brauner
@ 2018-10-09 13:50                     ` Jann Horn
  2018-10-09 14:09                       ` Christian Brauner
  0 siblings, 1 reply; 91+ messages in thread
From: Jann Horn @ 2018-10-09 13:50 UTC (permalink / raw)
  To: christian
  Cc: Tycho Andersen, Kees Cook, Linux API, containers, suda.akihiro,
	Oleg Nesterov, kernel list, Eric W. Biederman, linux-fsdevel,
	Christian Brauner, Andy Lutomirski, linux-security-module,
	selinux, Paul Moore, Stephen Smalley, Eric Paris

On Tue, Oct 9, 2018 at 3:49 PM Christian Brauner <christian@brauner.io> wrote:
>
> On Tue, Oct 09, 2018 at 03:36:04PM +0200, Jann Horn wrote:
> > +cc selinux people explicitly, since they probably have opinions on this
> >
> > On Tue, Oct 9, 2018 at 3:29 PM Christian Brauner <christian@brauner.io> wrote:
> > > On Tue, Oct 09, 2018 at 02:39:53PM +0200, Jann Horn wrote:
> > > > On Mon, Oct 8, 2018 at 8:18 PM Christian Brauner <christian@brauner.io> wrote:
> > > > > On Mon, Oct 08, 2018 at 06:42:00PM +0200, Jann Horn wrote:
> > > > > > On Mon, Oct 8, 2018 at 6:21 PM Christian Brauner <christian@brauner.io> wrote:
> > > > > > > On Mon, Oct 08, 2018 at 05:33:22PM +0200, Jann Horn wrote:
> > > > > > > > On Mon, Oct 8, 2018 at 5:16 PM Christian Brauner <christian@brauner.io> wrote:
> > > > > > > > > On Thu, Sep 27, 2018 at 09:11:16AM -0600, Tycho Andersen wrote:
> > > > > > > > > > diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> > > > > > > > > > index 44a31ac8373a..17685803a2af 100644
> > > > > > > > > > --- a/kernel/seccomp.c
> > > > > > > > > > +++ b/kernel/seccomp.c
> > > > > > > > > > @@ -1777,4 +1777,35 @@ static struct file *init_listener(struct task_struct *task,
> > > > > > > > > >
> > > > > > > > > >       return ret;
> > > > > > > > > >  }
> > > > > > > > > > +
> > > > > > > > > > +long seccomp_new_listener(struct task_struct *task,
> > > > > > > > > > +                       unsigned long filter_off)
> > > > > > > > > > +{
> > > > > > > > > > +     struct seccomp_filter *filter;
> > > > > > > > > > +     struct file *listener;
> > > > > > > > > > +     int fd;
> > > > > > > > > > +
> > > > > > > > > > +     if (!capable(CAP_SYS_ADMIN))
> > > > > > > > > > +             return -EACCES;
> > > > > > > > >
> > > > > > > > > I know this might have been discussed a while back but why exactly do we
> > > > > > > > > require CAP_SYS_ADMIN in init_userns and not in the target userns? What
> > > > > > > > > if I want to do a setns()fd, CLONE_NEWUSER) to the target process and
> > > > > > > > > use ptrace from in there?
> > > > > > > >
> > > > > > > > See https://lore.kernel.org/lkml/CAG48ez3R+ZJ1vwGkDfGzKX2mz6f=jjJWsO5pCvnH68P+RKO8Ow@mail.gmail.com/
> > > > > > > > . Basically, the problem is that this doesn't just give you capability
> > > > > > > > over the target task, but also over every other task that has the same
> > > > > > > > filter installed; you need some sort of "is the caller capable over
> > > > > > > > the filter and anyone who uses it" check.
> > > > > > >
> > > > > > > Thanks.
> > > > > > > But then this new ptrace feature as it stands is imho currently broken.
> > > > > > > If you can install a seccomp filter with SECCOMP_RET_USER_NOTIF if you
> > > > > > > are ns_cpabable(CAP_SYS_ADMIN) and also get an fd via seccomp() itself
> > > > > > > if you are ns_cpabable(CAP_SYS_ADMIN)
> > > >
> > > > Actually, you don't need CAP_SYS_ADMIN for seccomp() at all as long as
> > > > you enable the NNP flag, I think?
> > >
> > > Yes, if you turn on NNP you don't even need sys_admin.
> > >
> > > >
> > > > > > > then either the new ptrace() api
> > > > > > > extension should be fixed to allow for this too or the seccomp() way of
> > > > > > > retrieving the pid - which I really think we want - needs to be fixed to
> > > > > > > require capable(CAP_SYS_ADMIN) too.
> > > > > > > The solution where both require ns_capable(CAP_SYS_ADMIN) is - imho -
> > > > > > > the preferred way to solve this.
> > > > > > > Everything else will just be confusing.
> > > > > >
> > > > > > First you say "broken", then you say "confusing". Which one do you mean?
> > > > >
> > > > > Both. It's broken in so far as it places a seemingly unnecessary
> > > > > restriction that could be fixed. You outlined one possible fix yourself
> > > > > in the link you provided.
> > > >
> > > > If by "possible fix" you mean "check whether the seccomp filter is
> > > > only attached to a single task": That wouldn't fundamentally change
> > > > the situation, it would only add an additional special case.
> > > >
> > > > > And it's confusing in so far as there is a way
> > > > > via seccomp() to get the fd without said requirement.
> > > >
> > > > I don't find it confusing at all. seccomp() and ptrace() are very
> > >
> > > Fine, then that's a matter of opinion. I find it counterintuitive that
> > > you can get an fd without privileges via one interface but not via
> > > another.
> > >
> > > > different situations: When you use seccomp(), infrastructure is
> > >
> > > Sure. Note, that this is _one_ of the reasons why I want to make sure we
> > > keep the native seccomp() only based way of getting an fd without
> > > forcing userspace to switching to a differnet kernel api.
> > >
> > > > already in place for ensuring that your filter is only applied to
> > > > processes over which you are capable, and propagation is limited by
> > > > inheritance from your task down. When you use ptrace(), you need a
> > > > pretty different sort of access check that checks whether you're
> > > > privileged over ancestors, siblings and so on of the target task.
> > >
> > > So, don't get me wrong I'm not arguing against the ptrace() interface in
> > > general. If this is something that people find useful, fine. But, I
> > > would like to have a simple single-syscall pure-seccomp() based way of
> > > getting an fd, i.e. what we have in patch 1 of this series.
> >
> > Yeah, I also prefer the seccomp() one.
> >
> > > > But thinking about it more, I think that CAP_SYS_ADMIN over the saved
> > > > current->mm->user_ns of the task that installed the filter (stored as
> > > > a "struct user_namespace *" in the filter) should be acceptable.
> > >
> > > Hm... Why not CAP_SYS_PTRACE?
> >
> > Because LSMs like SELinux add extra checks that apply even if you have
> > CAP_SYS_PTRACE, and this would subvert those. The only capability I
> > know of that lets you bypass LSM checks by design (if no LSM blocks
> > the capability itself) is CAP_SYS_ADMIN.
> >
> > > One more thing. Citing from [1]
> > >
> > > > I think there's a security problem here. Imagine the following scenario:
> > > >
> > > > 1. task A (uid==0) sets up a seccomp filter that uses SECCOMP_RET_USER_NOTIF
> > > > 2. task A forks off a child B
> > > > 3. task B uses setuid(1) to drop its privileges
> > > > 4. task B becomes dumpable again, either via prctl(PR_SET_DUMPABLE, 1)
> > > > or via execve()
> > > > 5. task C (the attacker, uid==1) attaches to task B via ptrace
> > > > 6. task C uses PTRACE_SECCOMP_NEW_LISTENER on task B
> > >
> > > Sorry, to be late to the party but would this really pass
> > > __ptrace_may_access() in ptrace_attach()? It doesn't seem obvious to me
> > > that it would... Doesn't look like it would get past:
> > >
> > >         tcred = __task_cred(task);
> > >         if (uid_eq(caller_uid, tcred->euid) &&
> > >             uid_eq(caller_uid, tcred->suid) &&
> > >             uid_eq(caller_uid, tcred->uid)  &&
> > >             gid_eq(caller_gid, tcred->egid) &&
> > >             gid_eq(caller_gid, tcred->sgid) &&
> > >             gid_eq(caller_gid, tcred->gid))
> > >                 goto ok;
> > >         if (ptrace_has_cap(tcred->user_ns, mode))
> > >                 goto ok;
> > >         rcu_read_unlock();
> > >         return -EPERM;
> > > ok:
> > >         rcu_read_unlock();
> > >         mm = task->mm;
> > >         if (mm &&
> > >             ((get_dumpable(mm) != SUID_DUMP_USER) &&
> > >              !ptrace_has_cap(mm->user_ns, mode)))
> > >             return -EPERM;
> >
> > Which specific check would prevent task C from attaching to task B? If
> > the UIDs match, the first "goto ok" executes; and you're dumpable, so
> > you don't trigger the second "return -EPERM".
>
> You'd also need CAP_SYS_PTRACE in the mm->user_ns which you shouldn't
> have if you did a setuid to an unpriv user. (But I always find that code
> confusing.)

Only if the target hasn't gone through execve() since setuid().

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 3/6] seccomp: add a way to get a listener fd from ptrace
  2018-10-09 13:50                     ` Jann Horn
@ 2018-10-09 14:09                       ` Christian Brauner
  2018-10-09 15:26                         ` Jann Horn
  0 siblings, 1 reply; 91+ messages in thread
From: Christian Brauner @ 2018-10-09 14:09 UTC (permalink / raw)
  To: Jann Horn
  Cc: Tycho Andersen, Kees Cook, Linux API, containers, suda.akihiro,
	Oleg Nesterov, kernel list, Eric W. Biederman, linux-fsdevel,
	Christian Brauner, Andy Lutomirski, linux-security-module,
	selinux, Paul Moore, Stephen Smalley, Eric Paris

On Tue, Oct 09, 2018 at 03:50:53PM +0200, Jann Horn wrote:
> On Tue, Oct 9, 2018 at 3:49 PM Christian Brauner <christian@brauner.io> wrote:
> >
> > On Tue, Oct 09, 2018 at 03:36:04PM +0200, Jann Horn wrote:
> > > +cc selinux people explicitly, since they probably have opinions on this
> > >
> > > On Tue, Oct 9, 2018 at 3:29 PM Christian Brauner <christian@brauner.io> wrote:
> > > > On Tue, Oct 09, 2018 at 02:39:53PM +0200, Jann Horn wrote:
> > > > > On Mon, Oct 8, 2018 at 8:18 PM Christian Brauner <christian@brauner.io> wrote:
> > > > > > On Mon, Oct 08, 2018 at 06:42:00PM +0200, Jann Horn wrote:
> > > > > > > On Mon, Oct 8, 2018 at 6:21 PM Christian Brauner <christian@brauner.io> wrote:
> > > > > > > > On Mon, Oct 08, 2018 at 05:33:22PM +0200, Jann Horn wrote:
> > > > > > > > > On Mon, Oct 8, 2018 at 5:16 PM Christian Brauner <christian@brauner.io> wrote:
> > > > > > > > > > On Thu, Sep 27, 2018 at 09:11:16AM -0600, Tycho Andersen wrote:
> > > > > > > > > > > diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> > > > > > > > > > > index 44a31ac8373a..17685803a2af 100644
> > > > > > > > > > > --- a/kernel/seccomp.c
> > > > > > > > > > > +++ b/kernel/seccomp.c
> > > > > > > > > > > @@ -1777,4 +1777,35 @@ static struct file *init_listener(struct task_struct *task,
> > > > > > > > > > >
> > > > > > > > > > >       return ret;
> > > > > > > > > > >  }
> > > > > > > > > > > +
> > > > > > > > > > > +long seccomp_new_listener(struct task_struct *task,
> > > > > > > > > > > +                       unsigned long filter_off)
> > > > > > > > > > > +{
> > > > > > > > > > > +     struct seccomp_filter *filter;
> > > > > > > > > > > +     struct file *listener;
> > > > > > > > > > > +     int fd;
> > > > > > > > > > > +
> > > > > > > > > > > +     if (!capable(CAP_SYS_ADMIN))
> > > > > > > > > > > +             return -EACCES;
> > > > > > > > > >
> > > > > > > > > > I know this might have been discussed a while back but why exactly do we
> > > > > > > > > > require CAP_SYS_ADMIN in init_userns and not in the target userns? What
> > > > > > > > > > if I want to do a setns()fd, CLONE_NEWUSER) to the target process and
> > > > > > > > > > use ptrace from in there?
> > > > > > > > >
> > > > > > > > > See https://lore.kernel.org/lkml/CAG48ez3R+ZJ1vwGkDfGzKX2mz6f=jjJWsO5pCvnH68P+RKO8Ow@mail.gmail.com/
> > > > > > > > > . Basically, the problem is that this doesn't just give you capability
> > > > > > > > > over the target task, but also over every other task that has the same
> > > > > > > > > filter installed; you need some sort of "is the caller capable over
> > > > > > > > > the filter and anyone who uses it" check.
> > > > > > > >
> > > > > > > > Thanks.
> > > > > > > > But then this new ptrace feature as it stands is imho currently broken.
> > > > > > > > If you can install a seccomp filter with SECCOMP_RET_USER_NOTIF if you
> > > > > > > > are ns_cpabable(CAP_SYS_ADMIN) and also get an fd via seccomp() itself
> > > > > > > > if you are ns_cpabable(CAP_SYS_ADMIN)
> > > > >
> > > > > Actually, you don't need CAP_SYS_ADMIN for seccomp() at all as long as
> > > > > you enable the NNP flag, I think?
> > > >
> > > > Yes, if you turn on NNP you don't even need sys_admin.
> > > >
> > > > >
> > > > > > > > then either the new ptrace() api
> > > > > > > > extension should be fixed to allow for this too or the seccomp() way of
> > > > > > > > retrieving the pid - which I really think we want - needs to be fixed to
> > > > > > > > require capable(CAP_SYS_ADMIN) too.
> > > > > > > > The solution where both require ns_capable(CAP_SYS_ADMIN) is - imho -
> > > > > > > > the preferred way to solve this.
> > > > > > > > Everything else will just be confusing.
> > > > > > >
> > > > > > > First you say "broken", then you say "confusing". Which one do you mean?
> > > > > >
> > > > > > Both. It's broken in so far as it places a seemingly unnecessary
> > > > > > restriction that could be fixed. You outlined one possible fix yourself
> > > > > > in the link you provided.
> > > > >
> > > > > If by "possible fix" you mean "check whether the seccomp filter is
> > > > > only attached to a single task": That wouldn't fundamentally change
> > > > > the situation, it would only add an additional special case.
> > > > >
> > > > > > And it's confusing in so far as there is a way
> > > > > > via seccomp() to get the fd without said requirement.
> > > > >
> > > > > I don't find it confusing at all. seccomp() and ptrace() are very
> > > >
> > > > Fine, then that's a matter of opinion. I find it counterintuitive that
> > > > you can get an fd without privileges via one interface but not via
> > > > another.
> > > >
> > > > > different situations: When you use seccomp(), infrastructure is
> > > >
> > > > Sure. Note, that this is _one_ of the reasons why I want to make sure we
> > > > keep the native seccomp() only based way of getting an fd without
> > > > forcing userspace to switching to a differnet kernel api.
> > > >
> > > > > already in place for ensuring that your filter is only applied to
> > > > > processes over which you are capable, and propagation is limited by
> > > > > inheritance from your task down. When you use ptrace(), you need a
> > > > > pretty different sort of access check that checks whether you're
> > > > > privileged over ancestors, siblings and so on of the target task.
> > > >
> > > > So, don't get me wrong I'm not arguing against the ptrace() interface in
> > > > general. If this is something that people find useful, fine. But, I
> > > > would like to have a simple single-syscall pure-seccomp() based way of
> > > > getting an fd, i.e. what we have in patch 1 of this series.
> > >
> > > Yeah, I also prefer the seccomp() one.
> > >
> > > > > But thinking about it more, I think that CAP_SYS_ADMIN over the saved
> > > > > current->mm->user_ns of the task that installed the filter (stored as
> > > > > a "struct user_namespace *" in the filter) should be acceptable.
> > > >
> > > > Hm... Why not CAP_SYS_PTRACE?
> > >
> > > Because LSMs like SELinux add extra checks that apply even if you have
> > > CAP_SYS_PTRACE, and this would subvert those. The only capability I
> > > know of that lets you bypass LSM checks by design (if no LSM blocks
> > > the capability itself) is CAP_SYS_ADMIN.
> > >
> > > > One more thing. Citing from [1]
> > > >
> > > > > I think there's a security problem here. Imagine the following scenario:
> > > > >
> > > > > 1. task A (uid==0) sets up a seccomp filter that uses SECCOMP_RET_USER_NOTIF
> > > > > 2. task A forks off a child B
> > > > > 3. task B uses setuid(1) to drop its privileges
> > > > > 4. task B becomes dumpable again, either via prctl(PR_SET_DUMPABLE, 1)
> > > > > or via execve()
> > > > > 5. task C (the attacker, uid==1) attaches to task B via ptrace
> > > > > 6. task C uses PTRACE_SECCOMP_NEW_LISTENER on task B
> > > >
> > > > Sorry, to be late to the party but would this really pass
> > > > __ptrace_may_access() in ptrace_attach()? It doesn't seem obvious to me
> > > > that it would... Doesn't look like it would get past:
> > > >
> > > >         tcred = __task_cred(task);
> > > >         if (uid_eq(caller_uid, tcred->euid) &&
> > > >             uid_eq(caller_uid, tcred->suid) &&
> > > >             uid_eq(caller_uid, tcred->uid)  &&
> > > >             gid_eq(caller_gid, tcred->egid) &&
> > > >             gid_eq(caller_gid, tcred->sgid) &&
> > > >             gid_eq(caller_gid, tcred->gid))
> > > >                 goto ok;
> > > >         if (ptrace_has_cap(tcred->user_ns, mode))
> > > >                 goto ok;
> > > >         rcu_read_unlock();
> > > >         return -EPERM;
> > > > ok:
> > > >         rcu_read_unlock();
> > > >         mm = task->mm;
> > > >         if (mm &&
> > > >             ((get_dumpable(mm) != SUID_DUMP_USER) &&
> > > >              !ptrace_has_cap(mm->user_ns, mode)))
> > > >             return -EPERM;
> > >
> > > Which specific check would prevent task C from attaching to task B? If
> > > the UIDs match, the first "goto ok" executes; and you're dumpable, so
> > > you don't trigger the second "return -EPERM".
> >
> > You'd also need CAP_SYS_PTRACE in the mm->user_ns which you shouldn't
> > have if you did a setuid to an unpriv user. (But I always find that code
> > confusing.)
> 
> Only if the target hasn't gone through execve() since setuid().

Sorry if I want to know this in excessive detail but I'd like to
understand this properly so bear with me :)
- If task B has setuid()ed and prctl(PR_SET_DUMPABLE, 1)ed but not
  execve()ed then C won't pass ptrace_has_cap(mm->user_ns, mode).
- If task B has setuid()ed, exeved()ed it will get its dumpable flag set
  to /proc/sys/fs/suid_dumpable which by default is 0. So C won't pass
  (get_dumpable(mm) != SUID_DUMP_USER).
In both cases PTRACE_ATTACH shouldn't work. Now, if
/proc/sys/fs/suid_dumpable is 1 I'd find it acceptable for this to work.
This is an administrator choice.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 1/6] seccomp: add a return code to trap to userspace
  2018-10-08 14:58       ` Christian Brauner
@ 2018-10-09 14:28         ` Tycho Andersen
  2018-10-09 16:24           ` Christian Brauner
  0 siblings, 1 reply; 91+ messages in thread
From: Tycho Andersen @ 2018-10-09 14:28 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Kees Cook, Jann Horn, Linux API, Linux Containers, Akihiro Suda,
	Oleg Nesterov, LKML, Eric W . Biederman, linux-fsdevel,
	Christian Brauner, Andy Lutomirski

On Mon, Oct 08, 2018 at 04:58:05PM +0200, Christian Brauner wrote:
> On Thu, Sep 27, 2018 at 04:48:39PM -0600, Tycho Andersen wrote:
> > On Thu, Sep 27, 2018 at 02:31:24PM -0700, Kees Cook wrote:
> > > I have to say, I'm vaguely nervous about changing the semantics here
> > > for passing back the fd as the return code from the seccomp() syscall.
> > > Alternatives seem less appealing, though: changing the meaning of the
> > > uargs parameter when SECCOMP_FILTER_FLAG_NEW_LISTENER is set, for
> > > example. Hmm.
> > 
> > From my perspective we can drop this whole thing. The only thing I'll
> > ever use is the ptrace version. Someone at some point (I don't
> > remember who, maybe stgraber) suggested this version would be useful
> > as well.
> 
> So I think we want to have the ability to get an fd via seccomp().
> Especially, if we all we worry about are weird semantics. When we
> discussed this we knew the whole patchset was going to be weird. :)
> 
> This is a seccomp feature so seccomp should - if feasible - equip you
> with everything to use it in a meaningful way without having to go
> through a different kernel api. I know ptrace and seccomp are
> already connected but I still find this cleaner. :)
> 
> Another thing is that the container itself might be traced for some
> reason while you still might want to get an fd out.

Sure, I don't see the problem here.

> Also, I wonder what happens if you want to filter the ptrace() syscall
> itself? Then you'd deadlock?

No, are you confusing the tracee with the tracer here? Filtering
ptrace() will happen just like any other syscall... what would you
deadlock with?

> Also, it seems that getting an fd via ptrace requires CAP_SYS_ADMIN in
> the inital user namespace (which I just realized now) whereas getting
> the fd via seccomp() doesn't seem to.

Yep, I'll leave this discussion to the other thread.

Tycho

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 3/6] seccomp: add a way to get a listener fd from ptrace
  2018-10-09 14:09                       ` Christian Brauner
@ 2018-10-09 15:26                         ` Jann Horn
  2018-10-09 16:20                           ` Christian Brauner
  0 siblings, 1 reply; 91+ messages in thread
From: Jann Horn @ 2018-10-09 15:26 UTC (permalink / raw)
  To: christian
  Cc: Tycho Andersen, Kees Cook, Linux API, containers, suda.akihiro,
	Oleg Nesterov, kernel list, Eric W. Biederman, linux-fsdevel,
	Christian Brauner, Andy Lutomirski, linux-security-module,
	selinux, Paul Moore, Stephen Smalley, Eric Paris

On Tue, Oct 9, 2018 at 4:09 PM Christian Brauner <christian@brauner.io> wrote:
> On Tue, Oct 09, 2018 at 03:50:53PM +0200, Jann Horn wrote:
> > On Tue, Oct 9, 2018 at 3:49 PM Christian Brauner <christian@brauner.io> wrote:
> > > On Tue, Oct 09, 2018 at 03:36:04PM +0200, Jann Horn wrote:
> > > > On Tue, Oct 9, 2018 at 3:29 PM Christian Brauner <christian@brauner.io> wrote:
> > > > > One more thing. Citing from [1]
> > > > >
> > > > > > I think there's a security problem here. Imagine the following scenario:
> > > > > >
> > > > > > 1. task A (uid==0) sets up a seccomp filter that uses SECCOMP_RET_USER_NOTIF
> > > > > > 2. task A forks off a child B
> > > > > > 3. task B uses setuid(1) to drop its privileges
> > > > > > 4. task B becomes dumpable again, either via prctl(PR_SET_DUMPABLE, 1)
> > > > > > or via execve()
> > > > > > 5. task C (the attacker, uid==1) attaches to task B via ptrace
> > > > > > 6. task C uses PTRACE_SECCOMP_NEW_LISTENER on task B
> > > > >
> > > > > Sorry, to be late to the party but would this really pass
> > > > > __ptrace_may_access() in ptrace_attach()? It doesn't seem obvious to me
> > > > > that it would... Doesn't look like it would get past:
> > > > >
> > > > >         tcred = __task_cred(task);
> > > > >         if (uid_eq(caller_uid, tcred->euid) &&
> > > > >             uid_eq(caller_uid, tcred->suid) &&
> > > > >             uid_eq(caller_uid, tcred->uid)  &&
> > > > >             gid_eq(caller_gid, tcred->egid) &&
> > > > >             gid_eq(caller_gid, tcred->sgid) &&
> > > > >             gid_eq(caller_gid, tcred->gid))
> > > > >                 goto ok;
> > > > >         if (ptrace_has_cap(tcred->user_ns, mode))
> > > > >                 goto ok;
> > > > >         rcu_read_unlock();
> > > > >         return -EPERM;
> > > > > ok:
> > > > >         rcu_read_unlock();
> > > > >         mm = task->mm;
> > > > >         if (mm &&
> > > > >             ((get_dumpable(mm) != SUID_DUMP_USER) &&
> > > > >              !ptrace_has_cap(mm->user_ns, mode)))
> > > > >             return -EPERM;
> > > >
> > > > Which specific check would prevent task C from attaching to task B? If
> > > > the UIDs match, the first "goto ok" executes; and you're dumpable, so
> > > > you don't trigger the second "return -EPERM".
> > >
> > > You'd also need CAP_SYS_PTRACE in the mm->user_ns which you shouldn't
> > > have if you did a setuid to an unpriv user. (But I always find that code
> > > confusing.)
> >
> > Only if the target hasn't gone through execve() since setuid().
>
> Sorry if I want to know this in excessive detail but I'd like to
> understand this properly so bear with me :)
> - If task B has setuid()ed and prctl(PR_SET_DUMPABLE, 1)ed but not
>   execve()ed then C won't pass ptrace_has_cap(mm->user_ns, mode).

Yeah.

> - If task B has setuid()ed, exeved()ed it will get its dumpable flag set
>   to /proc/sys/fs/suid_dumpable

Not if you changed all UIDs (e.g. by calling setuid() as root). In
that case, setup_new_exec() calls "set_dumpable(current->mm,
SUID_DUMP_USER)".

> which by default is 0. So C won't pass
>   (get_dumpable(mm) != SUID_DUMP_USER).
> In both cases PTRACE_ATTACH shouldn't work. Now, if
> /proc/sys/fs/suid_dumpable is 1 I'd find it acceptable for this to work.
> This is an administrator choice.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 3/6] seccomp: add a way to get a listener fd from ptrace
  2018-10-09 15:26                         ` Jann Horn
@ 2018-10-09 16:20                           ` Christian Brauner
  2018-10-09 16:26                             ` Jann Horn
  0 siblings, 1 reply; 91+ messages in thread
From: Christian Brauner @ 2018-10-09 16:20 UTC (permalink / raw)
  To: Jann Horn
  Cc: Tycho Andersen, Kees Cook, Linux API, containers, suda.akihiro,
	Oleg Nesterov, kernel list, Eric W. Biederman, linux-fsdevel,
	Christian Brauner, Andy Lutomirski, linux-security-module,
	selinux, Paul Moore, Stephen Smalley, Eric Paris

On Tue, Oct 09, 2018 at 05:26:26PM +0200, Jann Horn wrote:
> On Tue, Oct 9, 2018 at 4:09 PM Christian Brauner <christian@brauner.io> wrote:
> > On Tue, Oct 09, 2018 at 03:50:53PM +0200, Jann Horn wrote:
> > > On Tue, Oct 9, 2018 at 3:49 PM Christian Brauner <christian@brauner.io> wrote:
> > > > On Tue, Oct 09, 2018 at 03:36:04PM +0200, Jann Horn wrote:
> > > > > On Tue, Oct 9, 2018 at 3:29 PM Christian Brauner <christian@brauner.io> wrote:
> > > > > > One more thing. Citing from [1]
> > > > > >
> > > > > > > I think there's a security problem here. Imagine the following scenario:
> > > > > > >
> > > > > > > 1. task A (uid==0) sets up a seccomp filter that uses SECCOMP_RET_USER_NOTIF
> > > > > > > 2. task A forks off a child B
> > > > > > > 3. task B uses setuid(1) to drop its privileges
> > > > > > > 4. task B becomes dumpable again, either via prctl(PR_SET_DUMPABLE, 1)
> > > > > > > or via execve()
> > > > > > > 5. task C (the attacker, uid==1) attaches to task B via ptrace
> > > > > > > 6. task C uses PTRACE_SECCOMP_NEW_LISTENER on task B
> > > > > >
> > > > > > Sorry, to be late to the party but would this really pass
> > > > > > __ptrace_may_access() in ptrace_attach()? It doesn't seem obvious to me
> > > > > > that it would... Doesn't look like it would get past:
> > > > > >
> > > > > >         tcred = __task_cred(task);
> > > > > >         if (uid_eq(caller_uid, tcred->euid) &&
> > > > > >             uid_eq(caller_uid, tcred->suid) &&
> > > > > >             uid_eq(caller_uid, tcred->uid)  &&
> > > > > >             gid_eq(caller_gid, tcred->egid) &&
> > > > > >             gid_eq(caller_gid, tcred->sgid) &&
> > > > > >             gid_eq(caller_gid, tcred->gid))
> > > > > >                 goto ok;
> > > > > >         if (ptrace_has_cap(tcred->user_ns, mode))
> > > > > >                 goto ok;
> > > > > >         rcu_read_unlock();
> > > > > >         return -EPERM;
> > > > > > ok:
> > > > > >         rcu_read_unlock();
> > > > > >         mm = task->mm;
> > > > > >         if (mm &&
> > > > > >             ((get_dumpable(mm) != SUID_DUMP_USER) &&
> > > > > >              !ptrace_has_cap(mm->user_ns, mode)))
> > > > > >             return -EPERM;
> > > > >
> > > > > Which specific check would prevent task C from attaching to task B? If
> > > > > the UIDs match, the first "goto ok" executes; and you're dumpable, so
> > > > > you don't trigger the second "return -EPERM".
> > > >
> > > > You'd also need CAP_SYS_PTRACE in the mm->user_ns which you shouldn't
> > > > have if you did a setuid to an unpriv user. (But I always find that code
> > > > confusing.)
> > >
> > > Only if the target hasn't gone through execve() since setuid().
> >
> > Sorry if I want to know this in excessive detail but I'd like to
> > understand this properly so bear with me :)
> > - If task B has setuid()ed and prctl(PR_SET_DUMPABLE, 1)ed but not
> >   execve()ed then C won't pass ptrace_has_cap(mm->user_ns, mode).
> 
> Yeah.
> 
> > - If task B has setuid()ed, exeved()ed it will get its dumpable flag set
> >   to /proc/sys/fs/suid_dumpable
> 
> Not if you changed all UIDs (e.g. by calling setuid() as root). In
> that case, setup_new_exec() calls "set_dumpable(current->mm,
> SUID_DUMP_USER)".

Actually, looking at this when C is trying to PTRACE_ATTACH to B as an
unprivileged user even if B execve()ed and it is dumpable C still
wouldn't have CAP_SYS_PTRACE in the mm->user_ns unless it already is
privileged over mm->user_ns which means it must be in an ancestor
user_ns.

> 
> > which by default is 0. So C won't pass
> >   (get_dumpable(mm) != SUID_DUMP_USER).
> > In both cases PTRACE_ATTACH shouldn't work. Now, if
> > /proc/sys/fs/suid_dumpable is 1 I'd find it acceptable for this to work.
> > This is an administrator choice.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 1/6] seccomp: add a return code to trap to userspace
  2018-10-09 14:28         ` Tycho Andersen
@ 2018-10-09 16:24           ` Christian Brauner
  2018-10-09 16:29             ` Tycho Andersen
  0 siblings, 1 reply; 91+ messages in thread
From: Christian Brauner @ 2018-10-09 16:24 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Kees Cook, Jann Horn, Linux API, Linux Containers, Akihiro Suda,
	Oleg Nesterov, LKML, Eric W . Biederman, linux-fsdevel,
	Christian Brauner, Andy Lutomirski

On Tue, Oct 09, 2018 at 07:28:33AM -0700, Tycho Andersen wrote:
> On Mon, Oct 08, 2018 at 04:58:05PM +0200, Christian Brauner wrote:
> > On Thu, Sep 27, 2018 at 04:48:39PM -0600, Tycho Andersen wrote:
> > > On Thu, Sep 27, 2018 at 02:31:24PM -0700, Kees Cook wrote:
> > > > I have to say, I'm vaguely nervous about changing the semantics here
> > > > for passing back the fd as the return code from the seccomp() syscall.
> > > > Alternatives seem less appealing, though: changing the meaning of the
> > > > uargs parameter when SECCOMP_FILTER_FLAG_NEW_LISTENER is set, for
> > > > example. Hmm.
> > > 
> > > From my perspective we can drop this whole thing. The only thing I'll
> > > ever use is the ptrace version. Someone at some point (I don't
> > > remember who, maybe stgraber) suggested this version would be useful
> > > as well.
> > 
> > So I think we want to have the ability to get an fd via seccomp().
> > Especially, if we all we worry about are weird semantics. When we
> > discussed this we knew the whole patchset was going to be weird. :)
> > 
> > This is a seccomp feature so seccomp should - if feasible - equip you
> > with everything to use it in a meaningful way without having to go
> > through a different kernel api. I know ptrace and seccomp are
> > already connected but I still find this cleaner. :)
> > 
> > Another thing is that the container itself might be traced for some
> > reason while you still might want to get an fd out.
> 
> Sure, I don't see the problem here.

How'd you to PTRACE_ATTACH in that case?

Anyway, the whole point is as we've discusses in the other thread we
really want a one-syscall-only, purely-seccomp() based way of getting
the fd. There are multiple options to get the fd even when you block
sendmsg()/socket() whatever and there's no good reason to only be able
to get the fd via a three-syscall-ptrace dance. :)

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 3/6] seccomp: add a way to get a listener fd from ptrace
  2018-10-09 16:20                           ` Christian Brauner
@ 2018-10-09 16:26                             ` Jann Horn
  2018-10-10 12:54                               ` Christian Brauner
  0 siblings, 1 reply; 91+ messages in thread
From: Jann Horn @ 2018-10-09 16:26 UTC (permalink / raw)
  To: christian
  Cc: Tycho Andersen, Kees Cook, Linux API, containers, suda.akihiro,
	Oleg Nesterov, kernel list, Eric W. Biederman, linux-fsdevel,
	Christian Brauner, Andy Lutomirski, linux-security-module,
	selinux, Paul Moore, Stephen Smalley, Eric Paris

On Tue, Oct 9, 2018 at 6:20 PM Christian Brauner <christian@brauner.io> wrote:
> On Tue, Oct 09, 2018 at 05:26:26PM +0200, Jann Horn wrote:
> > On Tue, Oct 9, 2018 at 4:09 PM Christian Brauner <christian@brauner.io> wrote:
> > > On Tue, Oct 09, 2018 at 03:50:53PM +0200, Jann Horn wrote:
> > > > On Tue, Oct 9, 2018 at 3:49 PM Christian Brauner <christian@brauner.io> wrote:
> > > > > On Tue, Oct 09, 2018 at 03:36:04PM +0200, Jann Horn wrote:
> > > > > > On Tue, Oct 9, 2018 at 3:29 PM Christian Brauner <christian@brauner.io> wrote:
> > > > > > > One more thing. Citing from [1]
> > > > > > >
> > > > > > > > I think there's a security problem here. Imagine the following scenario:
> > > > > > > >
> > > > > > > > 1. task A (uid==0) sets up a seccomp filter that uses SECCOMP_RET_USER_NOTIF
> > > > > > > > 2. task A forks off a child B
> > > > > > > > 3. task B uses setuid(1) to drop its privileges
> > > > > > > > 4. task B becomes dumpable again, either via prctl(PR_SET_DUMPABLE, 1)
> > > > > > > > or via execve()
> > > > > > > > 5. task C (the attacker, uid==1) attaches to task B via ptrace
> > > > > > > > 6. task C uses PTRACE_SECCOMP_NEW_LISTENER on task B
> > > > > > >
> > > > > > > Sorry, to be late to the party but would this really pass
> > > > > > > __ptrace_may_access() in ptrace_attach()? It doesn't seem obvious to me
> > > > > > > that it would... Doesn't look like it would get past:
> > > > > > >
> > > > > > >         tcred = __task_cred(task);
> > > > > > >         if (uid_eq(caller_uid, tcred->euid) &&
> > > > > > >             uid_eq(caller_uid, tcred->suid) &&
> > > > > > >             uid_eq(caller_uid, tcred->uid)  &&
> > > > > > >             gid_eq(caller_gid, tcred->egid) &&
> > > > > > >             gid_eq(caller_gid, tcred->sgid) &&
> > > > > > >             gid_eq(caller_gid, tcred->gid))
> > > > > > >                 goto ok;
> > > > > > >         if (ptrace_has_cap(tcred->user_ns, mode))
> > > > > > >                 goto ok;
> > > > > > >         rcu_read_unlock();
> > > > > > >         return -EPERM;
> > > > > > > ok:
> > > > > > >         rcu_read_unlock();
> > > > > > >         mm = task->mm;
> > > > > > >         if (mm &&
> > > > > > >             ((get_dumpable(mm) != SUID_DUMP_USER) &&
> > > > > > >              !ptrace_has_cap(mm->user_ns, mode)))
> > > > > > >             return -EPERM;
> > > > > >
> > > > > > Which specific check would prevent task C from attaching to task B? If
> > > > > > the UIDs match, the first "goto ok" executes; and you're dumpable, so
> > > > > > you don't trigger the second "return -EPERM".
> > > > >
> > > > > You'd also need CAP_SYS_PTRACE in the mm->user_ns which you shouldn't
> > > > > have if you did a setuid to an unpriv user. (But I always find that code
> > > > > confusing.)
> > > >
> > > > Only if the target hasn't gone through execve() since setuid().
> > >
> > > Sorry if I want to know this in excessive detail but I'd like to
> > > understand this properly so bear with me :)
> > > - If task B has setuid()ed and prctl(PR_SET_DUMPABLE, 1)ed but not
> > >   execve()ed then C won't pass ptrace_has_cap(mm->user_ns, mode).
> >
> > Yeah.
> >
> > > - If task B has setuid()ed, exeved()ed it will get its dumpable flag set
> > >   to /proc/sys/fs/suid_dumpable
> >
> > Not if you changed all UIDs (e.g. by calling setuid() as root). In
> > that case, setup_new_exec() calls "set_dumpable(current->mm,
> > SUID_DUMP_USER)".
>
> Actually, looking at this when C is trying to PTRACE_ATTACH to B as an
> unprivileged user even if B execve()ed and it is dumpable C still
> wouldn't have CAP_SYS_PTRACE in the mm->user_ns unless it already is
> privileged over mm->user_ns which means it must be in an ancestor
> user_ns.

Huh? Why would you need CAP_SYS_PTRACE for anything here? You can
ptrace another process running under your UID just fine, no matter
what the namespaces are. I'm not sure what you're saying.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 1/6] seccomp: add a return code to trap to userspace
  2018-10-09 16:24           ` Christian Brauner
@ 2018-10-09 16:29             ` Tycho Andersen
  0 siblings, 0 replies; 91+ messages in thread
From: Tycho Andersen @ 2018-10-09 16:29 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Kees Cook, Jann Horn, Linux API, Linux Containers, Akihiro Suda,
	Oleg Nesterov, LKML, Eric W . Biederman, linux-fsdevel,
	Christian Brauner, Andy Lutomirski

On Tue, Oct 09, 2018 at 06:24:14PM +0200, Christian Brauner wrote:
> On Tue, Oct 09, 2018 at 07:28:33AM -0700, Tycho Andersen wrote:
> > On Mon, Oct 08, 2018 at 04:58:05PM +0200, Christian Brauner wrote:
> > > On Thu, Sep 27, 2018 at 04:48:39PM -0600, Tycho Andersen wrote:
> > > > On Thu, Sep 27, 2018 at 02:31:24PM -0700, Kees Cook wrote:
> > > > > I have to say, I'm vaguely nervous about changing the semantics here
> > > > > for passing back the fd as the return code from the seccomp() syscall.
> > > > > Alternatives seem less appealing, though: changing the meaning of the
> > > > > uargs parameter when SECCOMP_FILTER_FLAG_NEW_LISTENER is set, for
> > > > > example. Hmm.
> > > > 
> > > > From my perspective we can drop this whole thing. The only thing I'll
> > > > ever use is the ptrace version. Someone at some point (I don't
> > > > remember who, maybe stgraber) suggested this version would be useful
> > > > as well.
> > > 
> > > So I think we want to have the ability to get an fd via seccomp().
> > > Especially, if we all we worry about are weird semantics. When we
> > > discussed this we knew the whole patchset was going to be weird. :)
> > > 
> > > This is a seccomp feature so seccomp should - if feasible - equip you
> > > with everything to use it in a meaningful way without having to go
> > > through a different kernel api. I know ptrace and seccomp are
> > > already connected but I still find this cleaner. :)
> > > 
> > > Another thing is that the container itself might be traced for some
> > > reason while you still might want to get an fd out.
> > 
> > Sure, I don't see the problem here.
> 
> How'd you to PTRACE_ATTACH in that case?

Oh, you mean if someone has *ptrace*'d the task, and a third party
wants to get a seccomp fd? I think "too bad" is the answer; I don't
really mind not supporting this case.

> Anyway, the whole point is as we've discusses in the other thread we
> really want a one-syscall-only, purely-seccomp() based way of getting
> the fd. There are multiple options to get the fd even when you block
> sendmsg()/socket() whatever and there's no good reason to only be able
> to get the fd via a three-syscall-ptrace dance. :)

Ok, I'll leave these bits in then for v8.

Tycho

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 3/6] seccomp: add a way to get a listener fd from ptrace
  2018-10-09 16:26                             ` Jann Horn
@ 2018-10-10 12:54                               ` Christian Brauner
  2018-10-10 13:09                                 ` Christian Brauner
  2018-10-10 13:10                                 ` Jann Horn
  0 siblings, 2 replies; 91+ messages in thread
From: Christian Brauner @ 2018-10-10 12:54 UTC (permalink / raw)
  To: Jann Horn
  Cc: Tycho Andersen, Kees Cook, Linux API, containers, suda.akihiro,
	Oleg Nesterov, kernel list, Eric W. Biederman, linux-fsdevel,
	Christian Brauner, Andy Lutomirski, linux-security-module,
	selinux, Paul Moore, Stephen Smalley, Eric Paris

On Tue, Oct 09, 2018 at 06:26:47PM +0200, Jann Horn wrote:
> On Tue, Oct 9, 2018 at 6:20 PM Christian Brauner <christian@brauner.io> wrote:
> > On Tue, Oct 09, 2018 at 05:26:26PM +0200, Jann Horn wrote:
> > > On Tue, Oct 9, 2018 at 4:09 PM Christian Brauner <christian@brauner.io> wrote:
> > > > On Tue, Oct 09, 2018 at 03:50:53PM +0200, Jann Horn wrote:
> > > > > On Tue, Oct 9, 2018 at 3:49 PM Christian Brauner <christian@brauner.io> wrote:
> > > > > > On Tue, Oct 09, 2018 at 03:36:04PM +0200, Jann Horn wrote:
> > > > > > > On Tue, Oct 9, 2018 at 3:29 PM Christian Brauner <christian@brauner.io> wrote:
> > > > > > > > One more thing. Citing from [1]
> > > > > > > >
> > > > > > > > > I think there's a security problem here. Imagine the following scenario:
> > > > > > > > >
> > > > > > > > > 1. task A (uid==0) sets up a seccomp filter that uses SECCOMP_RET_USER_NOTIF
> > > > > > > > > 2. task A forks off a child B
> > > > > > > > > 3. task B uses setuid(1) to drop its privileges
> > > > > > > > > 4. task B becomes dumpable again, either via prctl(PR_SET_DUMPABLE, 1)
> > > > > > > > > or via execve()
> > > > > > > > > 5. task C (the attacker, uid==1) attaches to task B via ptrace
> > > > > > > > > 6. task C uses PTRACE_SECCOMP_NEW_LISTENER on task B
> > > > > > > >
> > > > > > > > Sorry, to be late to the party but would this really pass
> > > > > > > > __ptrace_may_access() in ptrace_attach()? It doesn't seem obvious to me
> > > > > > > > that it would... Doesn't look like it would get past:
> > > > > > > >
> > > > > > > >         tcred = __task_cred(task);
> > > > > > > >         if (uid_eq(caller_uid, tcred->euid) &&
> > > > > > > >             uid_eq(caller_uid, tcred->suid) &&
> > > > > > > >             uid_eq(caller_uid, tcred->uid)  &&
> > > > > > > >             gid_eq(caller_gid, tcred->egid) &&
> > > > > > > >             gid_eq(caller_gid, tcred->sgid) &&
> > > > > > > >             gid_eq(caller_gid, tcred->gid))
> > > > > > > >                 goto ok;
> > > > > > > >         if (ptrace_has_cap(tcred->user_ns, mode))
> > > > > > > >                 goto ok;
> > > > > > > >         rcu_read_unlock();
> > > > > > > >         return -EPERM;
> > > > > > > > ok:
> > > > > > > >         rcu_read_unlock();
> > > > > > > >         mm = task->mm;
> > > > > > > >         if (mm &&
> > > > > > > >             ((get_dumpable(mm) != SUID_DUMP_USER) &&
> > > > > > > >              !ptrace_has_cap(mm->user_ns, mode)))
> > > > > > > >             return -EPERM;
> > > > > > >
> > > > > > > Which specific check would prevent task C from attaching to task B? If
> > > > > > > the UIDs match, the first "goto ok" executes; and you're dumpable, so
> > > > > > > you don't trigger the second "return -EPERM".
> > > > > >
> > > > > > You'd also need CAP_SYS_PTRACE in the mm->user_ns which you shouldn't
> > > > > > have if you did a setuid to an unpriv user. (But I always find that code
> > > > > > confusing.)
> > > > >
> > > > > Only if the target hasn't gone through execve() since setuid().
> > > >
> > > > Sorry if I want to know this in excessive detail but I'd like to
> > > > understand this properly so bear with me :)
> > > > - If task B has setuid()ed and prctl(PR_SET_DUMPABLE, 1)ed but not
> > > >   execve()ed then C won't pass ptrace_has_cap(mm->user_ns, mode).
> > >
> > > Yeah.
> > >
> > > > - If task B has setuid()ed, exeved()ed it will get its dumpable flag set
> > > >   to /proc/sys/fs/suid_dumpable
> > >
> > > Not if you changed all UIDs (e.g. by calling setuid() as root). In
> > > that case, setup_new_exec() calls "set_dumpable(current->mm,
> > > SUID_DUMP_USER)".
> >
> > Actually, looking at this when C is trying to PTRACE_ATTACH to B as an
> > unprivileged user even if B execve()ed and it is dumpable C still
> > wouldn't have CAP_SYS_PTRACE in the mm->user_ns unless it already is
> > privileged over mm->user_ns which means it must be in an ancestor
> > user_ns.
> 
> Huh? Why would you need CAP_SYS_PTRACE for anything here? You can
> ptrace another process running under your UID just fine, no matter
> what the namespaces are. I'm not sure what you're saying.

Sorry, I was out the door yesterday when answering this and was too
brief. I forgot to mention: /proc/sys/kernel/yama/ptrace_scope. It
should be enabled by default on nearly all distros and even if not -
which is an administrators choice - you can usually easily enable it via
sysctl.

1 ("restricted ptrace") [default value]
When  performing an operation that requires a PTRACE_MODE_ATTACH check,
the calling process must either have the CAP_SYS_PTRACE capability in
the user namespace of the target process or it must have a prede‐ fined
relationship with the target process.  By default, the predefined
relationship is that the target process must be a descendant of the
caller.

If you don't have it set you're already susceptible to all kinds of
other attacks and I'm still not convinced we need to bring out the big
capable(CAP_SYS_ADMIN) gun here.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 3/6] seccomp: add a way to get a listener fd from ptrace
  2018-10-10 12:54                               ` Christian Brauner
@ 2018-10-10 13:09                                 ` Christian Brauner
  2018-10-10 13:10                                 ` Jann Horn
  1 sibling, 0 replies; 91+ messages in thread
From: Christian Brauner @ 2018-10-10 13:09 UTC (permalink / raw)
  To: Jann Horn
  Cc: Tycho Andersen, Kees Cook, Linux API, containers, suda.akihiro,
	Oleg Nesterov, kernel list, Eric W. Biederman, linux-fsdevel,
	Christian Brauner, Andy Lutomirski, linux-security-module,
	selinux, Paul Moore, Stephen Smalley, Eric Paris

On Wed, Oct 10, 2018 at 02:54:22PM +0200, Christian Brauner wrote:
> On Tue, Oct 09, 2018 at 06:26:47PM +0200, Jann Horn wrote:
> > On Tue, Oct 9, 2018 at 6:20 PM Christian Brauner <christian@brauner.io> wrote:
> > > On Tue, Oct 09, 2018 at 05:26:26PM +0200, Jann Horn wrote:
> > > > On Tue, Oct 9, 2018 at 4:09 PM Christian Brauner <christian@brauner.io> wrote:
> > > > > On Tue, Oct 09, 2018 at 03:50:53PM +0200, Jann Horn wrote:
> > > > > > On Tue, Oct 9, 2018 at 3:49 PM Christian Brauner <christian@brauner.io> wrote:
> > > > > > > On Tue, Oct 09, 2018 at 03:36:04PM +0200, Jann Horn wrote:
> > > > > > > > On Tue, Oct 9, 2018 at 3:29 PM Christian Brauner <christian@brauner.io> wrote:
> > > > > > > > > One more thing. Citing from [1]
> > > > > > > > >
> > > > > > > > > > I think there's a security problem here. Imagine the following scenario:
> > > > > > > > > >
> > > > > > > > > > 1. task A (uid==0) sets up a seccomp filter that uses SECCOMP_RET_USER_NOTIF
> > > > > > > > > > 2. task A forks off a child B
> > > > > > > > > > 3. task B uses setuid(1) to drop its privileges
> > > > > > > > > > 4. task B becomes dumpable again, either via prctl(PR_SET_DUMPABLE, 1)
> > > > > > > > > > or via execve()
> > > > > > > > > > 5. task C (the attacker, uid==1) attaches to task B via ptrace
> > > > > > > > > > 6. task C uses PTRACE_SECCOMP_NEW_LISTENER on task B
> > > > > > > > >
> > > > > > > > > Sorry, to be late to the party but would this really pass
> > > > > > > > > __ptrace_may_access() in ptrace_attach()? It doesn't seem obvious to me
> > > > > > > > > that it would... Doesn't look like it would get past:
> > > > > > > > >
> > > > > > > > >         tcred = __task_cred(task);
> > > > > > > > >         if (uid_eq(caller_uid, tcred->euid) &&
> > > > > > > > >             uid_eq(caller_uid, tcred->suid) &&
> > > > > > > > >             uid_eq(caller_uid, tcred->uid)  &&
> > > > > > > > >             gid_eq(caller_gid, tcred->egid) &&
> > > > > > > > >             gid_eq(caller_gid, tcred->sgid) &&
> > > > > > > > >             gid_eq(caller_gid, tcred->gid))
> > > > > > > > >                 goto ok;
> > > > > > > > >         if (ptrace_has_cap(tcred->user_ns, mode))
> > > > > > > > >                 goto ok;
> > > > > > > > >         rcu_read_unlock();
> > > > > > > > >         return -EPERM;
> > > > > > > > > ok:
> > > > > > > > >         rcu_read_unlock();
> > > > > > > > >         mm = task->mm;
> > > > > > > > >         if (mm &&
> > > > > > > > >             ((get_dumpable(mm) != SUID_DUMP_USER) &&
> > > > > > > > >              !ptrace_has_cap(mm->user_ns, mode)))
> > > > > > > > >             return -EPERM;
> > > > > > > >
> > > > > > > > Which specific check would prevent task C from attaching to task B? If
> > > > > > > > the UIDs match, the first "goto ok" executes; and you're dumpable, so
> > > > > > > > you don't trigger the second "return -EPERM".
> > > > > > >
> > > > > > > You'd also need CAP_SYS_PTRACE in the mm->user_ns which you shouldn't
> > > > > > > have if you did a setuid to an unpriv user. (But I always find that code
> > > > > > > confusing.)
> > > > > >
> > > > > > Only if the target hasn't gone through execve() since setuid().
> > > > >
> > > > > Sorry if I want to know this in excessive detail but I'd like to
> > > > > understand this properly so bear with me :)
> > > > > - If task B has setuid()ed and prctl(PR_SET_DUMPABLE, 1)ed but not
> > > > >   execve()ed then C won't pass ptrace_has_cap(mm->user_ns, mode).
> > > >
> > > > Yeah.
> > > >
> > > > > - If task B has setuid()ed, exeved()ed it will get its dumpable flag set
> > > > >   to /proc/sys/fs/suid_dumpable
> > > >
> > > > Not if you changed all UIDs (e.g. by calling setuid() as root). In
> > > > that case, setup_new_exec() calls "set_dumpable(current->mm,
> > > > SUID_DUMP_USER)".
> > >
> > > Actually, looking at this when C is trying to PTRACE_ATTACH to B as an
> > > unprivileged user even if B execve()ed and it is dumpable C still
> > > wouldn't have CAP_SYS_PTRACE in the mm->user_ns unless it already is
> > > privileged over mm->user_ns which means it must be in an ancestor
> > > user_ns.
> > 
> > Huh? Why would you need CAP_SYS_PTRACE for anything here? You can
> > ptrace another process running under your UID just fine, no matter
> > what the namespaces are. I'm not sure what you're saying.
> 
> Sorry, I was out the door yesterday when answering this and was too
> brief. I forgot to mention: /proc/sys/kernel/yama/ptrace_scope. It
> should be enabled by default on nearly all distros and even if not -
> which is an administrators choice - you can usually easily enable it via
> sysctl.
> 
> 1 ("restricted ptrace") [default value]
> When  performing an operation that requires a PTRACE_MODE_ATTACH check,
> the calling process must either have the CAP_SYS_PTRACE capability in
> the user namespace of the target process or it must have a prede‐ fined
> relationship with the target process.  By default, the predefined
> relationship is that the target process must be a descendant of the
> caller.
> 
> If you don't have it set you're already susceptible to all kinds of
> other attacks and I'm still not convinced we need to bring out the big
> capable(CAP_SYS_ADMIN) gun here.

That being said, given that Tycho agreed to leave in the native
seccomp() way of retrieving an fd from the task without the sys_admin
restriction [1] which we prefer and if we merge it with aforementioned
feature I care way less about whether or not the restriction is upheld
for ptrace() or not.

[1]: https://lists.linuxfoundation.org/pipermail/containers/2018-October/039553.html

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 3/6] seccomp: add a way to get a listener fd from ptrace
  2018-10-10 12:54                               ` Christian Brauner
  2018-10-10 13:09                                 ` Christian Brauner
@ 2018-10-10 13:10                                 ` Jann Horn
  2018-10-10 13:18                                   ` Christian Brauner
  1 sibling, 1 reply; 91+ messages in thread
From: Jann Horn @ 2018-10-10 13:10 UTC (permalink / raw)
  To: christian
  Cc: Tycho Andersen, Kees Cook, Linux API, containers, suda.akihiro,
	Oleg Nesterov, kernel list, Eric W. Biederman, linux-fsdevel,
	Christian Brauner, Andy Lutomirski, linux-security-module,
	selinux, Paul Moore, Stephen Smalley, Eric Paris

On Wed, Oct 10, 2018 at 2:54 PM Christian Brauner <christian@brauner.io> wrote:
> On Tue, Oct 09, 2018 at 06:26:47PM +0200, Jann Horn wrote:
> > On Tue, Oct 9, 2018 at 6:20 PM Christian Brauner <christian@brauner.io> wrote:
> > > On Tue, Oct 09, 2018 at 05:26:26PM +0200, Jann Horn wrote:
> > > > On Tue, Oct 9, 2018 at 4:09 PM Christian Brauner <christian@brauner.io> wrote:
> > > > > On Tue, Oct 09, 2018 at 03:50:53PM +0200, Jann Horn wrote:
> > > > > > On Tue, Oct 9, 2018 at 3:49 PM Christian Brauner <christian@brauner.io> wrote:
> > > > > > > On Tue, Oct 09, 2018 at 03:36:04PM +0200, Jann Horn wrote:
> > > > > > > > On Tue, Oct 9, 2018 at 3:29 PM Christian Brauner <christian@brauner.io> wrote:
> > > > > > > > > One more thing. Citing from [1]
> > > > > > > > >
> > > > > > > > > > I think there's a security problem here. Imagine the following scenario:
> > > > > > > > > >
> > > > > > > > > > 1. task A (uid==0) sets up a seccomp filter that uses SECCOMP_RET_USER_NOTIF
> > > > > > > > > > 2. task A forks off a child B
> > > > > > > > > > 3. task B uses setuid(1) to drop its privileges
> > > > > > > > > > 4. task B becomes dumpable again, either via prctl(PR_SET_DUMPABLE, 1)
> > > > > > > > > > or via execve()
> > > > > > > > > > 5. task C (the attacker, uid==1) attaches to task B via ptrace
> > > > > > > > > > 6. task C uses PTRACE_SECCOMP_NEW_LISTENER on task B
> > > > > > > > >
> > > > > > > > > Sorry, to be late to the party but would this really pass
> > > > > > > > > __ptrace_may_access() in ptrace_attach()? It doesn't seem obvious to me
> > > > > > > > > that it would... Doesn't look like it would get past:
> > > > > > > > >
> > > > > > > > >         tcred = __task_cred(task);
> > > > > > > > >         if (uid_eq(caller_uid, tcred->euid) &&
> > > > > > > > >             uid_eq(caller_uid, tcred->suid) &&
> > > > > > > > >             uid_eq(caller_uid, tcred->uid)  &&
> > > > > > > > >             gid_eq(caller_gid, tcred->egid) &&
> > > > > > > > >             gid_eq(caller_gid, tcred->sgid) &&
> > > > > > > > >             gid_eq(caller_gid, tcred->gid))
> > > > > > > > >                 goto ok;
> > > > > > > > >         if (ptrace_has_cap(tcred->user_ns, mode))
> > > > > > > > >                 goto ok;
> > > > > > > > >         rcu_read_unlock();
> > > > > > > > >         return -EPERM;
> > > > > > > > > ok:
> > > > > > > > >         rcu_read_unlock();
> > > > > > > > >         mm = task->mm;
> > > > > > > > >         if (mm &&
> > > > > > > > >             ((get_dumpable(mm) != SUID_DUMP_USER) &&
> > > > > > > > >              !ptrace_has_cap(mm->user_ns, mode)))
> > > > > > > > >             return -EPERM;
> > > > > > > >
> > > > > > > > Which specific check would prevent task C from attaching to task B? If
> > > > > > > > the UIDs match, the first "goto ok" executes; and you're dumpable, so
> > > > > > > > you don't trigger the second "return -EPERM".
> > > > > > >
> > > > > > > You'd also need CAP_SYS_PTRACE in the mm->user_ns which you shouldn't
> > > > > > > have if you did a setuid to an unpriv user. (But I always find that code
> > > > > > > confusing.)
> > > > > >
> > > > > > Only if the target hasn't gone through execve() since setuid().
> > > > >
> > > > > Sorry if I want to know this in excessive detail but I'd like to
> > > > > understand this properly so bear with me :)
> > > > > - If task B has setuid()ed and prctl(PR_SET_DUMPABLE, 1)ed but not
> > > > >   execve()ed then C won't pass ptrace_has_cap(mm->user_ns, mode).
> > > >
> > > > Yeah.
> > > >
> > > > > - If task B has setuid()ed, exeved()ed it will get its dumpable flag set
> > > > >   to /proc/sys/fs/suid_dumpable
> > > >
> > > > Not if you changed all UIDs (e.g. by calling setuid() as root). In
> > > > that case, setup_new_exec() calls "set_dumpable(current->mm,
> > > > SUID_DUMP_USER)".
> > >
> > > Actually, looking at this when C is trying to PTRACE_ATTACH to B as an
> > > unprivileged user even if B execve()ed and it is dumpable C still
> > > wouldn't have CAP_SYS_PTRACE in the mm->user_ns unless it already is
> > > privileged over mm->user_ns which means it must be in an ancestor
> > > user_ns.
> >
> > Huh? Why would you need CAP_SYS_PTRACE for anything here? You can
> > ptrace another process running under your UID just fine, no matter
> > what the namespaces are. I'm not sure what you're saying.
>
> Sorry, I was out the door yesterday when answering this and was too
> brief. I forgot to mention: /proc/sys/kernel/yama/ptrace_scope. It
> should be enabled by default on nearly all distros

"nearly all distros"? AFAIK it's off on Debian, for starters. And Yama
still doesn't help you if one of the tasks enters a new user namespace
or whatever.

Yama is a little bit of extra, heuristic, **opt-in** hardening enabled
in some configurations. It is **not** a fundamental building block you
can rely on.

> and even if not -
> which is an administrators choice - you can usually easily enable it via
> sysctl.

Opt-in security isn't good enough. Kernel interfaces should still be
safe to use even on a system that has all the LSM stuff disabled in
the kernel config.

> 1 ("restricted ptrace") [default value]
> When  performing an operation that requires a PTRACE_MODE_ATTACH check,
> the calling process must either have the CAP_SYS_PTRACE capability in
> the user namespace of the target process or it must have a prede‐ fined
> relationship with the target process.  By default, the predefined
> relationship is that the target process must be a descendant of the
> caller.
>
> If you don't have it set you're already susceptible to all kinds of
> other attacks

Oh? Can you be more specific, please?

> and I'm still not convinced we need to bring out the big
> capable(CAP_SYS_ADMIN) gun here.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 3/6] seccomp: add a way to get a listener fd from ptrace
  2018-10-10 13:10                                 ` Jann Horn
@ 2018-10-10 13:18                                   ` Christian Brauner
  0 siblings, 0 replies; 91+ messages in thread
From: Christian Brauner @ 2018-10-10 13:18 UTC (permalink / raw)
  To: Jann Horn
  Cc: Tycho Andersen, Kees Cook, Linux API, containers, suda.akihiro,
	Oleg Nesterov, kernel list, Eric W. Biederman, linux-fsdevel,
	Christian Brauner, Andy Lutomirski, linux-security-module,
	selinux, Paul Moore, Stephen Smalley, Eric Paris

On Wed, Oct 10, 2018 at 03:10:11PM +0200, Jann Horn wrote:
> On Wed, Oct 10, 2018 at 2:54 PM Christian Brauner <christian@brauner.io> wrote:
> > On Tue, Oct 09, 2018 at 06:26:47PM +0200, Jann Horn wrote:
> > > On Tue, Oct 9, 2018 at 6:20 PM Christian Brauner <christian@brauner.io> wrote:
> > > > On Tue, Oct 09, 2018 at 05:26:26PM +0200, Jann Horn wrote:
> > > > > On Tue, Oct 9, 2018 at 4:09 PM Christian Brauner <christian@brauner.io> wrote:
> > > > > > On Tue, Oct 09, 2018 at 03:50:53PM +0200, Jann Horn wrote:
> > > > > > > On Tue, Oct 9, 2018 at 3:49 PM Christian Brauner <christian@brauner.io> wrote:
> > > > > > > > On Tue, Oct 09, 2018 at 03:36:04PM +0200, Jann Horn wrote:
> > > > > > > > > On Tue, Oct 9, 2018 at 3:29 PM Christian Brauner <christian@brauner.io> wrote:
> > > > > > > > > > One more thing. Citing from [1]
> > > > > > > > > >
> > > > > > > > > > > I think there's a security problem here. Imagine the following scenario:
> > > > > > > > > > >
> > > > > > > > > > > 1. task A (uid==0) sets up a seccomp filter that uses SECCOMP_RET_USER_NOTIF
> > > > > > > > > > > 2. task A forks off a child B
> > > > > > > > > > > 3. task B uses setuid(1) to drop its privileges
> > > > > > > > > > > 4. task B becomes dumpable again, either via prctl(PR_SET_DUMPABLE, 1)
> > > > > > > > > > > or via execve()
> > > > > > > > > > > 5. task C (the attacker, uid==1) attaches to task B via ptrace
> > > > > > > > > > > 6. task C uses PTRACE_SECCOMP_NEW_LISTENER on task B
> > > > > > > > > >
> > > > > > > > > > Sorry, to be late to the party but would this really pass
> > > > > > > > > > __ptrace_may_access() in ptrace_attach()? It doesn't seem obvious to me
> > > > > > > > > > that it would... Doesn't look like it would get past:
> > > > > > > > > >
> > > > > > > > > >         tcred = __task_cred(task);
> > > > > > > > > >         if (uid_eq(caller_uid, tcred->euid) &&
> > > > > > > > > >             uid_eq(caller_uid, tcred->suid) &&
> > > > > > > > > >             uid_eq(caller_uid, tcred->uid)  &&
> > > > > > > > > >             gid_eq(caller_gid, tcred->egid) &&
> > > > > > > > > >             gid_eq(caller_gid, tcred->sgid) &&
> > > > > > > > > >             gid_eq(caller_gid, tcred->gid))
> > > > > > > > > >                 goto ok;
> > > > > > > > > >         if (ptrace_has_cap(tcred->user_ns, mode))
> > > > > > > > > >                 goto ok;
> > > > > > > > > >         rcu_read_unlock();
> > > > > > > > > >         return -EPERM;
> > > > > > > > > > ok:
> > > > > > > > > >         rcu_read_unlock();
> > > > > > > > > >         mm = task->mm;
> > > > > > > > > >         if (mm &&
> > > > > > > > > >             ((get_dumpable(mm) != SUID_DUMP_USER) &&
> > > > > > > > > >              !ptrace_has_cap(mm->user_ns, mode)))
> > > > > > > > > >             return -EPERM;
> > > > > > > > >
> > > > > > > > > Which specific check would prevent task C from attaching to task B? If
> > > > > > > > > the UIDs match, the first "goto ok" executes; and you're dumpable, so
> > > > > > > > > you don't trigger the second "return -EPERM".
> > > > > > > >
> > > > > > > > You'd also need CAP_SYS_PTRACE in the mm->user_ns which you shouldn't
> > > > > > > > have if you did a setuid to an unpriv user. (But I always find that code
> > > > > > > > confusing.)
> > > > > > >
> > > > > > > Only if the target hasn't gone through execve() since setuid().
> > > > > >
> > > > > > Sorry if I want to know this in excessive detail but I'd like to
> > > > > > understand this properly so bear with me :)
> > > > > > - If task B has setuid()ed and prctl(PR_SET_DUMPABLE, 1)ed but not
> > > > > >   execve()ed then C won't pass ptrace_has_cap(mm->user_ns, mode).
> > > > >
> > > > > Yeah.
> > > > >
> > > > > > - If task B has setuid()ed, exeved()ed it will get its dumpable flag set
> > > > > >   to /proc/sys/fs/suid_dumpable
> > > > >
> > > > > Not if you changed all UIDs (e.g. by calling setuid() as root). In
> > > > > that case, setup_new_exec() calls "set_dumpable(current->mm,
> > > > > SUID_DUMP_USER)".
> > > >
> > > > Actually, looking at this when C is trying to PTRACE_ATTACH to B as an
> > > > unprivileged user even if B execve()ed and it is dumpable C still
> > > > wouldn't have CAP_SYS_PTRACE in the mm->user_ns unless it already is
> > > > privileged over mm->user_ns which means it must be in an ancestor
> > > > user_ns.
> > >
> > > Huh? Why would you need CAP_SYS_PTRACE for anything here? You can
> > > ptrace another process running under your UID just fine, no matter
> > > what the namespaces are. I'm not sure what you're saying.
> >
> > Sorry, I was out the door yesterday when answering this and was too
> > brief. I forgot to mention: /proc/sys/kernel/yama/ptrace_scope. It
> > should be enabled by default on nearly all distros
> 
> "nearly all distros"? AFAIK it's off on Debian, for starters. And Yama
> still doesn't help you if one of the tasks enters a new user namespace
> or whatever.
> 
> Yama is a little bit of extra, heuristic, **opt-in** hardening enabled
> in some configurations. It is **not** a fundamental building block you
> can rely on.
> 
> > and even if not -
> > which is an administrators choice - you can usually easily enable it via
> > sysctl.
> 
> Opt-in security isn't good enough. Kernel interfaces should still be
> safe to use even on a system that has all the LSM stuff disabled in
> the kernel config.

Then ptrace() isn't, I guess?

But see https://lists.linuxfoundation.org/pipermail/containers/2018-October/039567.html
I don't care as long as we have a way of getting the fd without the
CAP_SYS_ADMIN requirement throught seccomp().

> 
> > 1 ("restricted ptrace") [default value]
> > When  performing an operation that requires a PTRACE_MODE_ATTACH check,
> > the calling process must either have the CAP_SYS_PTRACE capability in
> > the user namespace of the target process or it must have a prede‐ fined
> > relationship with the target process.  By default, the predefined
> > relationship is that the target process must be a descendant of the
> > caller.
> >
> > If you don't have it set you're already susceptible to all kinds of
> > other attacks
> 
> Oh? Can you be more specific, please?

I was referring to attacks where you attach to processes that run as
your user but might expose in-memory credentials or other sensitive
information, (essentially what the manpage is outlining).

> 
> > and I'm still not convinced we need to bring out the big
> > capable(CAP_SYS_ADMIN) gun here.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 3/6] seccomp: add a way to get a listener fd from ptrace
  2018-10-09 13:36                 ` Jann Horn
  2018-10-09 13:49                   ` Christian Brauner
@ 2018-10-10 15:31                   ` Paul Moore
  2018-10-10 15:33                     ` Jann Horn
  1 sibling, 1 reply; 91+ messages in thread
From: Paul Moore @ 2018-10-10 15:31 UTC (permalink / raw)
  To: jannh
  Cc: christian, Tycho Andersen, keescook, linux-api, containers,
	suda.akihiro, oleg, linux-kernel, ebiederm, linux-fsdevel,
	christian.brauner, luto, linux-security-module, selinux,
	Stephen Smalley, Eric Paris

On Tue, Oct 9, 2018 at 9:36 AM Jann Horn <jannh@google.com> wrote:
> +cc selinux people explicitly, since they probably have opinions on this

I just spent about twenty minutes working my way through this thread,
and digging through the containers archive trying to get a good
understanding of what you guys are trying to do, and I'm not quite
sure I understand it all.  However, from what I have seen, this
approach looks very ptrace-y to me (I imagine to others as well based
on the comments) and because of this I think ensuring the usual ptrace
access controls are evaluated, including the ptrace LSM hooks, is the
right thing to do.

If I've missed something, or I'm thinking about this wrong, please
educate me; just a heads-up that I'm largely offline for most of this
week so responses on my end are going to be delayed much more than
usual.

> On Tue, Oct 9, 2018 at 3:29 PM Christian Brauner <christian@brauner.io> wrote:
> > On Tue, Oct 09, 2018 at 02:39:53PM +0200, Jann Horn wrote:
> > > On Mon, Oct 8, 2018 at 8:18 PM Christian Brauner <christian@brauner.io> wrote:
> > > > On Mon, Oct 08, 2018 at 06:42:00PM +0200, Jann Horn wrote:
> > > > > On Mon, Oct 8, 2018 at 6:21 PM Christian Brauner <christian@brauner.io> wrote:
> > > > > > On Mon, Oct 08, 2018 at 05:33:22PM +0200, Jann Horn wrote:
> > > > > > > On Mon, Oct 8, 2018 at 5:16 PM Christian Brauner <christian@brauner.io> wrote:
> > > > > > > > On Thu, Sep 27, 2018 at 09:11:16AM -0600, Tycho Andersen wrote:
> > > > > > > > > diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> > > > > > > > > index 44a31ac8373a..17685803a2af 100644
> > > > > > > > > --- a/kernel/seccomp.c
> > > > > > > > > +++ b/kernel/seccomp.c
> > > > > > > > > @@ -1777,4 +1777,35 @@ static struct file *init_listener(struct task_struct *task,
> > > > > > > > >
> > > > > > > > >       return ret;
> > > > > > > > >  }
> > > > > > > > > +
> > > > > > > > > +long seccomp_new_listener(struct task_struct *task,
> > > > > > > > > +                       unsigned long filter_off)
> > > > > > > > > +{
> > > > > > > > > +     struct seccomp_filter *filter;
> > > > > > > > > +     struct file *listener;
> > > > > > > > > +     int fd;
> > > > > > > > > +
> > > > > > > > > +     if (!capable(CAP_SYS_ADMIN))
> > > > > > > > > +             return -EACCES;
> > > > > > > >
> > > > > > > > I know this might have been discussed a while back but why exactly do we
> > > > > > > > require CAP_SYS_ADMIN in init_userns and not in the target userns? What
> > > > > > > > if I want to do a setns()fd, CLONE_NEWUSER) to the target process and
> > > > > > > > use ptrace from in there?
> > > > > > >
> > > > > > > See https://lore.kernel.org/lkml/CAG48ez3R+ZJ1vwGkDfGzKX2mz6f=jjJWsO5pCvnH68P+RKO8Ow@mail.gmail.com/
> > > > > > > . Basically, the problem is that this doesn't just give you capability
> > > > > > > over the target task, but also over every other task that has the same
> > > > > > > filter installed; you need some sort of "is the caller capable over
> > > > > > > the filter and anyone who uses it" check.
> > > > > >
> > > > > > Thanks.
> > > > > > But then this new ptrace feature as it stands is imho currently broken.
> > > > > > If you can install a seccomp filter with SECCOMP_RET_USER_NOTIF if you
> > > > > > are ns_cpabable(CAP_SYS_ADMIN) and also get an fd via seccomp() itself
> > > > > > if you are ns_cpabable(CAP_SYS_ADMIN)
> > >
> > > Actually, you don't need CAP_SYS_ADMIN for seccomp() at all as long as
> > > you enable the NNP flag, I think?
> >
> > Yes, if you turn on NNP you don't even need sys_admin.
> >
> > >
> > > > > > then either the new ptrace() api
> > > > > > extension should be fixed to allow for this too or the seccomp() way of
> > > > > > retrieving the pid - which I really think we want - needs to be fixed to
> > > > > > require capable(CAP_SYS_ADMIN) too.
> > > > > > The solution where both require ns_capable(CAP_SYS_ADMIN) is - imho -
> > > > > > the preferred way to solve this.
> > > > > > Everything else will just be confusing.
> > > > >
> > > > > First you say "broken", then you say "confusing". Which one do you mean?
> > > >
> > > > Both. It's broken in so far as it places a seemingly unnecessary
> > > > restriction that could be fixed. You outlined one possible fix yourself
> > > > in the link you provided.
> > >
> > > If by "possible fix" you mean "check whether the seccomp filter is
> > > only attached to a single task": That wouldn't fundamentally change
> > > the situation, it would only add an additional special case.
> > >
> > > > And it's confusing in so far as there is a way
> > > > via seccomp() to get the fd without said requirement.
> > >
> > > I don't find it confusing at all. seccomp() and ptrace() are very
> >
> > Fine, then that's a matter of opinion. I find it counterintuitive that
> > you can get an fd without privileges via one interface but not via
> > another.
> >
> > > different situations: When you use seccomp(), infrastructure is
> >
> > Sure. Note, that this is _one_ of the reasons why I want to make sure we
> > keep the native seccomp() only based way of getting an fd without
> > forcing userspace to switching to a differnet kernel api.
> >
> > > already in place for ensuring that your filter is only applied to
> > > processes over which you are capable, and propagation is limited by
> > > inheritance from your task down. When you use ptrace(), you need a
> > > pretty different sort of access check that checks whether you're
> > > privileged over ancestors, siblings and so on of the target task.
> >
> > So, don't get me wrong I'm not arguing against the ptrace() interface in
> > general. If this is something that people find useful, fine. But, I
> > would like to have a simple single-syscall pure-seccomp() based way of
> > getting an fd, i.e. what we have in patch 1 of this series.
>
> Yeah, I also prefer the seccomp() one.
>
> > > But thinking about it more, I think that CAP_SYS_ADMIN over the saved
> > > current->mm->user_ns of the task that installed the filter (stored as
> > > a "struct user_namespace *" in the filter) should be acceptable.
> >
> > Hm... Why not CAP_SYS_PTRACE?
>
> Because LSMs like SELinux add extra checks that apply even if you have
> CAP_SYS_PTRACE, and this would subvert those. The only capability I
> know of that lets you bypass LSM checks by design (if no LSM blocks
> the capability itself) is CAP_SYS_ADMIN.
>
> > One more thing. Citing from [1]
> >
> > > I think there's a security problem here. Imagine the following scenario:
> > >
> > > 1. task A (uid==0) sets up a seccomp filter that uses SECCOMP_RET_USER_NOTIF
> > > 2. task A forks off a child B
> > > 3. task B uses setuid(1) to drop its privileges
> > > 4. task B becomes dumpable again, either via prctl(PR_SET_DUMPABLE, 1)
> > > or via execve()
> > > 5. task C (the attacker, uid==1) attaches to task B via ptrace
> > > 6. task C uses PTRACE_SECCOMP_NEW_LISTENER on task B
> >
> > Sorry, to be late to the party but would this really pass
> > __ptrace_may_access() in ptrace_attach()? It doesn't seem obvious to me
> > that it would... Doesn't look like it would get past:
> >
> >         tcred = __task_cred(task);
> >         if (uid_eq(caller_uid, tcred->euid) &&
> >             uid_eq(caller_uid, tcred->suid) &&
> >             uid_eq(caller_uid, tcred->uid)  &&
> >             gid_eq(caller_gid, tcred->egid) &&
> >             gid_eq(caller_gid, tcred->sgid) &&
> >             gid_eq(caller_gid, tcred->gid))
> >                 goto ok;
> >         if (ptrace_has_cap(tcred->user_ns, mode))
> >                 goto ok;
> >         rcu_read_unlock();
> >         return -EPERM;
> > ok:
> >         rcu_read_unlock();
> >         mm = task->mm;
> >         if (mm &&
> >             ((get_dumpable(mm) != SUID_DUMP_USER) &&
> >              !ptrace_has_cap(mm->user_ns, mode)))
> >             return -EPERM;
>
> Which specific check would prevent task C from attaching to task B? If
> the UIDs match, the first "goto ok" executes; and you're dumpable, so
> you don't trigger the second "return -EPERM".
>
> > > 7. because the seccomp filter is shared by task A and task B, task C
> > > is now able to influence syscall results for syscalls performed by
> > > task A
> >
> > [1]: https://lore.kernel.org/lkml/CAG48ez3R+ZJ1vwGkDfGzKX2mz6f=jjJWsO5pCvnH68P+RKO8Ow@mail.gmail.com/



-- 
paul moore
www.paul-moore.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 3/6] seccomp: add a way to get a listener fd from ptrace
  2018-10-10 15:31                   ` Paul Moore
@ 2018-10-10 15:33                     ` Jann Horn
  2018-10-10 15:39                       ` Christian Brauner
  2018-10-11  7:24                       ` Paul Moore
  0 siblings, 2 replies; 91+ messages in thread
From: Jann Horn @ 2018-10-10 15:33 UTC (permalink / raw)
  To: Paul Moore
  Cc: christian, Tycho Andersen, Kees Cook, Linux API, containers,
	suda.akihiro, Oleg Nesterov, kernel list, Eric W. Biederman,
	linux-fsdevel, Christian Brauner, Andy Lutomirski,
	linux-security-module, selinux, Stephen Smalley, Eric Paris

On Wed, Oct 10, 2018 at 5:32 PM Paul Moore <paul@paul-moore.com> wrote:
> On Tue, Oct 9, 2018 at 9:36 AM Jann Horn <jannh@google.com> wrote:
> > +cc selinux people explicitly, since they probably have opinions on this
>
> I just spent about twenty minutes working my way through this thread,
> and digging through the containers archive trying to get a good
> understanding of what you guys are trying to do, and I'm not quite
> sure I understand it all.  However, from what I have seen, this
> approach looks very ptrace-y to me (I imagine to others as well based
> on the comments) and because of this I think ensuring the usual ptrace
> access controls are evaluated, including the ptrace LSM hooks, is the
> right thing to do.

Basically the problem is that this new ptrace() API does something
that doesn't just influence the target task, but also every other task
that has the same seccomp filter. So the classic ptrace check doesn't
work here.

> If I've missed something, or I'm thinking about this wrong, please
> educate me; just a heads-up that I'm largely offline for most of this
> week so responses on my end are going to be delayed much more than
> usual.
>
> > On Tue, Oct 9, 2018 at 3:29 PM Christian Brauner <christian@brauner.io> wrote:
> > > On Tue, Oct 09, 2018 at 02:39:53PM +0200, Jann Horn wrote:
> > > > On Mon, Oct 8, 2018 at 8:18 PM Christian Brauner <christian@brauner.io> wrote:
> > > > > On Mon, Oct 08, 2018 at 06:42:00PM +0200, Jann Horn wrote:
> > > > > > On Mon, Oct 8, 2018 at 6:21 PM Christian Brauner <christian@brauner.io> wrote:
> > > > > > > On Mon, Oct 08, 2018 at 05:33:22PM +0200, Jann Horn wrote:
> > > > > > > > On Mon, Oct 8, 2018 at 5:16 PM Christian Brauner <christian@brauner.io> wrote:
> > > > > > > > > On Thu, Sep 27, 2018 at 09:11:16AM -0600, Tycho Andersen wrote:
> > > > > > > > > > diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> > > > > > > > > > index 44a31ac8373a..17685803a2af 100644
> > > > > > > > > > --- a/kernel/seccomp.c
> > > > > > > > > > +++ b/kernel/seccomp.c
> > > > > > > > > > @@ -1777,4 +1777,35 @@ static struct file *init_listener(struct task_struct *task,
> > > > > > > > > >
> > > > > > > > > >       return ret;
> > > > > > > > > >  }
> > > > > > > > > > +
> > > > > > > > > > +long seccomp_new_listener(struct task_struct *task,
> > > > > > > > > > +                       unsigned long filter_off)
> > > > > > > > > > +{
> > > > > > > > > > +     struct seccomp_filter *filter;
> > > > > > > > > > +     struct file *listener;
> > > > > > > > > > +     int fd;
> > > > > > > > > > +
> > > > > > > > > > +     if (!capable(CAP_SYS_ADMIN))
> > > > > > > > > > +             return -EACCES;
> > > > > > > > >
> > > > > > > > > I know this might have been discussed a while back but why exactly do we
> > > > > > > > > require CAP_SYS_ADMIN in init_userns and not in the target userns? What
> > > > > > > > > if I want to do a setns()fd, CLONE_NEWUSER) to the target process and
> > > > > > > > > use ptrace from in there?
> > > > > > > >
> > > > > > > > See https://lore.kernel.org/lkml/CAG48ez3R+ZJ1vwGkDfGzKX2mz6f=jjJWsO5pCvnH68P+RKO8Ow@mail.gmail.com/
> > > > > > > > . Basically, the problem is that this doesn't just give you capability
> > > > > > > > over the target task, but also over every other task that has the same
> > > > > > > > filter installed; you need some sort of "is the caller capable over
> > > > > > > > the filter and anyone who uses it" check.
> > > > > > >
> > > > > > > Thanks.
> > > > > > > But then this new ptrace feature as it stands is imho currently broken.
> > > > > > > If you can install a seccomp filter with SECCOMP_RET_USER_NOTIF if you
> > > > > > > are ns_cpabable(CAP_SYS_ADMIN) and also get an fd via seccomp() itself
> > > > > > > if you are ns_cpabable(CAP_SYS_ADMIN)
> > > >
> > > > Actually, you don't need CAP_SYS_ADMIN for seccomp() at all as long as
> > > > you enable the NNP flag, I think?
> > >
> > > Yes, if you turn on NNP you don't even need sys_admin.
> > >
> > > >
> > > > > > > then either the new ptrace() api
> > > > > > > extension should be fixed to allow for this too or the seccomp() way of
> > > > > > > retrieving the pid - which I really think we want - needs to be fixed to
> > > > > > > require capable(CAP_SYS_ADMIN) too.
> > > > > > > The solution where both require ns_capable(CAP_SYS_ADMIN) is - imho -
> > > > > > > the preferred way to solve this.
> > > > > > > Everything else will just be confusing.
> > > > > >
> > > > > > First you say "broken", then you say "confusing". Which one do you mean?
> > > > >
> > > > > Both. It's broken in so far as it places a seemingly unnecessary
> > > > > restriction that could be fixed. You outlined one possible fix yourself
> > > > > in the link you provided.
> > > >
> > > > If by "possible fix" you mean "check whether the seccomp filter is
> > > > only attached to a single task": That wouldn't fundamentally change
> > > > the situation, it would only add an additional special case.
> > > >
> > > > > And it's confusing in so far as there is a way
> > > > > via seccomp() to get the fd without said requirement.
> > > >
> > > > I don't find it confusing at all. seccomp() and ptrace() are very
> > >
> > > Fine, then that's a matter of opinion. I find it counterintuitive that
> > > you can get an fd without privileges via one interface but not via
> > > another.
> > >
> > > > different situations: When you use seccomp(), infrastructure is
> > >
> > > Sure. Note, that this is _one_ of the reasons why I want to make sure we
> > > keep the native seccomp() only based way of getting an fd without
> > > forcing userspace to switching to a differnet kernel api.
> > >
> > > > already in place for ensuring that your filter is only applied to
> > > > processes over which you are capable, and propagation is limited by
> > > > inheritance from your task down. When you use ptrace(), you need a
> > > > pretty different sort of access check that checks whether you're
> > > > privileged over ancestors, siblings and so on of the target task.
> > >
> > > So, don't get me wrong I'm not arguing against the ptrace() interface in
> > > general. If this is something that people find useful, fine. But, I
> > > would like to have a simple single-syscall pure-seccomp() based way of
> > > getting an fd, i.e. what we have in patch 1 of this series.
> >
> > Yeah, I also prefer the seccomp() one.
> >
> > > > But thinking about it more, I think that CAP_SYS_ADMIN over the saved
> > > > current->mm->user_ns of the task that installed the filter (stored as
> > > > a "struct user_namespace *" in the filter) should be acceptable.
> > >
> > > Hm... Why not CAP_SYS_PTRACE?
> >
> > Because LSMs like SELinux add extra checks that apply even if you have
> > CAP_SYS_PTRACE, and this would subvert those. The only capability I
> > know of that lets you bypass LSM checks by design (if no LSM blocks
> > the capability itself) is CAP_SYS_ADMIN.
> >
> > > One more thing. Citing from [1]
> > >
> > > > I think there's a security problem here. Imagine the following scenario:
> > > >
> > > > 1. task A (uid==0) sets up a seccomp filter that uses SECCOMP_RET_USER_NOTIF
> > > > 2. task A forks off a child B
> > > > 3. task B uses setuid(1) to drop its privileges
> > > > 4. task B becomes dumpable again, either via prctl(PR_SET_DUMPABLE, 1)
> > > > or via execve()
> > > > 5. task C (the attacker, uid==1) attaches to task B via ptrace
> > > > 6. task C uses PTRACE_SECCOMP_NEW_LISTENER on task B
> > >
> > > Sorry, to be late to the party but would this really pass
> > > __ptrace_may_access() in ptrace_attach()? It doesn't seem obvious to me
> > > that it would... Doesn't look like it would get past:
> > >
> > >         tcred = __task_cred(task);
> > >         if (uid_eq(caller_uid, tcred->euid) &&
> > >             uid_eq(caller_uid, tcred->suid) &&
> > >             uid_eq(caller_uid, tcred->uid)  &&
> > >             gid_eq(caller_gid, tcred->egid) &&
> > >             gid_eq(caller_gid, tcred->sgid) &&
> > >             gid_eq(caller_gid, tcred->gid))
> > >                 goto ok;
> > >         if (ptrace_has_cap(tcred->user_ns, mode))
> > >                 goto ok;
> > >         rcu_read_unlock();
> > >         return -EPERM;
> > > ok:
> > >         rcu_read_unlock();
> > >         mm = task->mm;
> > >         if (mm &&
> > >             ((get_dumpable(mm) != SUID_DUMP_USER) &&
> > >              !ptrace_has_cap(mm->user_ns, mode)))
> > >             return -EPERM;
> >
> > Which specific check would prevent task C from attaching to task B? If
> > the UIDs match, the first "goto ok" executes; and you're dumpable, so
> > you don't trigger the second "return -EPERM".
> >
> > > > 7. because the seccomp filter is shared by task A and task B, task C
> > > > is now able to influence syscall results for syscalls performed by
> > > > task A
> > >
> > > [1]: https://lore.kernel.org/lkml/CAG48ez3R+ZJ1vwGkDfGzKX2mz6f=jjJWsO5pCvnH68P+RKO8Ow@mail.gmail.com/
>
>
>
> --
> paul moore
> www.paul-moore.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 3/6] seccomp: add a way to get a listener fd from ptrace
  2018-10-10 15:33                     ` Jann Horn
@ 2018-10-10 15:39                       ` Christian Brauner
  2018-10-10 16:54                         ` Tycho Andersen
  2018-10-11  7:24                       ` Paul Moore
  1 sibling, 1 reply; 91+ messages in thread
From: Christian Brauner @ 2018-10-10 15:39 UTC (permalink / raw)
  To: Jann Horn
  Cc: Paul Moore, Tycho Andersen, Kees Cook, Linux API, containers,
	suda.akihiro, Oleg Nesterov, kernel list, Eric W. Biederman,
	linux-fsdevel, Christian Brauner, Andy Lutomirski,
	linux-security-module, selinux, Stephen Smalley, Eric Paris

On Wed, Oct 10, 2018 at 05:33:43PM +0200, Jann Horn wrote:
> On Wed, Oct 10, 2018 at 5:32 PM Paul Moore <paul@paul-moore.com> wrote:
> > On Tue, Oct 9, 2018 at 9:36 AM Jann Horn <jannh@google.com> wrote:
> > > +cc selinux people explicitly, since they probably have opinions on this
> >
> > I just spent about twenty minutes working my way through this thread,
> > and digging through the containers archive trying to get a good
> > understanding of what you guys are trying to do, and I'm not quite
> > sure I understand it all.  However, from what I have seen, this
> > approach looks very ptrace-y to me (I imagine to others as well based
> > on the comments) and because of this I think ensuring the usual ptrace
> > access controls are evaluated, including the ptrace LSM hooks, is the
> > right thing to do.
> 
> Basically the problem is that this new ptrace() API does something
> that doesn't just influence the target task, but also every other task
> that has the same seccomp filter. So the classic ptrace check doesn't
> work here.

Just to throw this into the mix: then maybe ptrace() isn't the right
interface and we should just go with the native seccomp() approach for
now.

> 
> > If I've missed something, or I'm thinking about this wrong, please
> > educate me; just a heads-up that I'm largely offline for most of this
> > week so responses on my end are going to be delayed much more than
> > usual.
> >
> > > On Tue, Oct 9, 2018 at 3:29 PM Christian Brauner <christian@brauner.io> wrote:
> > > > On Tue, Oct 09, 2018 at 02:39:53PM +0200, Jann Horn wrote:
> > > > > On Mon, Oct 8, 2018 at 8:18 PM Christian Brauner <christian@brauner.io> wrote:
> > > > > > On Mon, Oct 08, 2018 at 06:42:00PM +0200, Jann Horn wrote:
> > > > > > > On Mon, Oct 8, 2018 at 6:21 PM Christian Brauner <christian@brauner.io> wrote:
> > > > > > > > On Mon, Oct 08, 2018 at 05:33:22PM +0200, Jann Horn wrote:
> > > > > > > > > On Mon, Oct 8, 2018 at 5:16 PM Christian Brauner <christian@brauner.io> wrote:
> > > > > > > > > > On Thu, Sep 27, 2018 at 09:11:16AM -0600, Tycho Andersen wrote:
> > > > > > > > > > > diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> > > > > > > > > > > index 44a31ac8373a..17685803a2af 100644
> > > > > > > > > > > --- a/kernel/seccomp.c
> > > > > > > > > > > +++ b/kernel/seccomp.c
> > > > > > > > > > > @@ -1777,4 +1777,35 @@ static struct file *init_listener(struct task_struct *task,
> > > > > > > > > > >
> > > > > > > > > > >       return ret;
> > > > > > > > > > >  }
> > > > > > > > > > > +
> > > > > > > > > > > +long seccomp_new_listener(struct task_struct *task,
> > > > > > > > > > > +                       unsigned long filter_off)
> > > > > > > > > > > +{
> > > > > > > > > > > +     struct seccomp_filter *filter;
> > > > > > > > > > > +     struct file *listener;
> > > > > > > > > > > +     int fd;
> > > > > > > > > > > +
> > > > > > > > > > > +     if (!capable(CAP_SYS_ADMIN))
> > > > > > > > > > > +             return -EACCES;
> > > > > > > > > >
> > > > > > > > > > I know this might have been discussed a while back but why exactly do we
> > > > > > > > > > require CAP_SYS_ADMIN in init_userns and not in the target userns? What
> > > > > > > > > > if I want to do a setns()fd, CLONE_NEWUSER) to the target process and
> > > > > > > > > > use ptrace from in there?
> > > > > > > > >
> > > > > > > > > See https://lore.kernel.org/lkml/CAG48ez3R+ZJ1vwGkDfGzKX2mz6f=jjJWsO5pCvnH68P+RKO8Ow@mail.gmail.com/
> > > > > > > > > . Basically, the problem is that this doesn't just give you capability
> > > > > > > > > over the target task, but also over every other task that has the same
> > > > > > > > > filter installed; you need some sort of "is the caller capable over
> > > > > > > > > the filter and anyone who uses it" check.
> > > > > > > >
> > > > > > > > Thanks.
> > > > > > > > But then this new ptrace feature as it stands is imho currently broken.
> > > > > > > > If you can install a seccomp filter with SECCOMP_RET_USER_NOTIF if you
> > > > > > > > are ns_cpabable(CAP_SYS_ADMIN) and also get an fd via seccomp() itself
> > > > > > > > if you are ns_cpabable(CAP_SYS_ADMIN)
> > > > >
> > > > > Actually, you don't need CAP_SYS_ADMIN for seccomp() at all as long as
> > > > > you enable the NNP flag, I think?
> > > >
> > > > Yes, if you turn on NNP you don't even need sys_admin.
> > > >
> > > > >
> > > > > > > > then either the new ptrace() api
> > > > > > > > extension should be fixed to allow for this too or the seccomp() way of
> > > > > > > > retrieving the pid - which I really think we want - needs to be fixed to
> > > > > > > > require capable(CAP_SYS_ADMIN) too.
> > > > > > > > The solution where both require ns_capable(CAP_SYS_ADMIN) is - imho -
> > > > > > > > the preferred way to solve this.
> > > > > > > > Everything else will just be confusing.
> > > > > > >
> > > > > > > First you say "broken", then you say "confusing". Which one do you mean?
> > > > > >
> > > > > > Both. It's broken in so far as it places a seemingly unnecessary
> > > > > > restriction that could be fixed. You outlined one possible fix yourself
> > > > > > in the link you provided.
> > > > >
> > > > > If by "possible fix" you mean "check whether the seccomp filter is
> > > > > only attached to a single task": That wouldn't fundamentally change
> > > > > the situation, it would only add an additional special case.
> > > > >
> > > > > > And it's confusing in so far as there is a way
> > > > > > via seccomp() to get the fd without said requirement.
> > > > >
> > > > > I don't find it confusing at all. seccomp() and ptrace() are very
> > > >
> > > > Fine, then that's a matter of opinion. I find it counterintuitive that
> > > > you can get an fd without privileges via one interface but not via
> > > > another.
> > > >
> > > > > different situations: When you use seccomp(), infrastructure is
> > > >
> > > > Sure. Note, that this is _one_ of the reasons why I want to make sure we
> > > > keep the native seccomp() only based way of getting an fd without
> > > > forcing userspace to switching to a differnet kernel api.
> > > >
> > > > > already in place for ensuring that your filter is only applied to
> > > > > processes over which you are capable, and propagation is limited by
> > > > > inheritance from your task down. When you use ptrace(), you need a
> > > > > pretty different sort of access check that checks whether you're
> > > > > privileged over ancestors, siblings and so on of the target task.
> > > >
> > > > So, don't get me wrong I'm not arguing against the ptrace() interface in
> > > > general. If this is something that people find useful, fine. But, I
> > > > would like to have a simple single-syscall pure-seccomp() based way of
> > > > getting an fd, i.e. what we have in patch 1 of this series.
> > >
> > > Yeah, I also prefer the seccomp() one.
> > >
> > > > > But thinking about it more, I think that CAP_SYS_ADMIN over the saved
> > > > > current->mm->user_ns of the task that installed the filter (stored as
> > > > > a "struct user_namespace *" in the filter) should be acceptable.
> > > >
> > > > Hm... Why not CAP_SYS_PTRACE?
> > >
> > > Because LSMs like SELinux add extra checks that apply even if you have
> > > CAP_SYS_PTRACE, and this would subvert those. The only capability I
> > > know of that lets you bypass LSM checks by design (if no LSM blocks
> > > the capability itself) is CAP_SYS_ADMIN.
> > >
> > > > One more thing. Citing from [1]
> > > >
> > > > > I think there's a security problem here. Imagine the following scenario:
> > > > >
> > > > > 1. task A (uid==0) sets up a seccomp filter that uses SECCOMP_RET_USER_NOTIF
> > > > > 2. task A forks off a child B
> > > > > 3. task B uses setuid(1) to drop its privileges
> > > > > 4. task B becomes dumpable again, either via prctl(PR_SET_DUMPABLE, 1)
> > > > > or via execve()
> > > > > 5. task C (the attacker, uid==1) attaches to task B via ptrace
> > > > > 6. task C uses PTRACE_SECCOMP_NEW_LISTENER on task B
> > > >
> > > > Sorry, to be late to the party but would this really pass
> > > > __ptrace_may_access() in ptrace_attach()? It doesn't seem obvious to me
> > > > that it would... Doesn't look like it would get past:
> > > >
> > > >         tcred = __task_cred(task);
> > > >         if (uid_eq(caller_uid, tcred->euid) &&
> > > >             uid_eq(caller_uid, tcred->suid) &&
> > > >             uid_eq(caller_uid, tcred->uid)  &&
> > > >             gid_eq(caller_gid, tcred->egid) &&
> > > >             gid_eq(caller_gid, tcred->sgid) &&
> > > >             gid_eq(caller_gid, tcred->gid))
> > > >                 goto ok;
> > > >         if (ptrace_has_cap(tcred->user_ns, mode))
> > > >                 goto ok;
> > > >         rcu_read_unlock();
> > > >         return -EPERM;
> > > > ok:
> > > >         rcu_read_unlock();
> > > >         mm = task->mm;
> > > >         if (mm &&
> > > >             ((get_dumpable(mm) != SUID_DUMP_USER) &&
> > > >              !ptrace_has_cap(mm->user_ns, mode)))
> > > >             return -EPERM;
> > >
> > > Which specific check would prevent task C from attaching to task B? If
> > > the UIDs match, the first "goto ok" executes; and you're dumpable, so
> > > you don't trigger the second "return -EPERM".
> > >
> > > > > 7. because the seccomp filter is shared by task A and task B, task C
> > > > > is now able to influence syscall results for syscalls performed by
> > > > > task A
> > > >
> > > > [1]: https://lore.kernel.org/lkml/CAG48ez3R+ZJ1vwGkDfGzKX2mz6f=jjJWsO5pCvnH68P+RKO8Ow@mail.gmail.com/
> >
> >
> >
> > --
> > paul moore
> > www.paul-moore.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 3/6] seccomp: add a way to get a listener fd from ptrace
  2018-10-10 15:39                       ` Christian Brauner
@ 2018-10-10 16:54                         ` Tycho Andersen
  2018-10-10 17:15                           ` Christian Brauner
  0 siblings, 1 reply; 91+ messages in thread
From: Tycho Andersen @ 2018-10-10 16:54 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jann Horn, Paul Moore, Kees Cook, Linux API, containers,
	suda.akihiro, Oleg Nesterov, kernel list, Eric W. Biederman,
	linux-fsdevel, Christian Brauner, Andy Lutomirski,
	linux-security-module, selinux, Stephen Smalley, Eric Paris

On Wed, Oct 10, 2018 at 05:39:57PM +0200, Christian Brauner wrote:
> On Wed, Oct 10, 2018 at 05:33:43PM +0200, Jann Horn wrote:
> > On Wed, Oct 10, 2018 at 5:32 PM Paul Moore <paul@paul-moore.com> wrote:
> > > On Tue, Oct 9, 2018 at 9:36 AM Jann Horn <jannh@google.com> wrote:
> > > > +cc selinux people explicitly, since they probably have opinions on this
> > >
> > > I just spent about twenty minutes working my way through this thread,
> > > and digging through the containers archive trying to get a good
> > > understanding of what you guys are trying to do, and I'm not quite
> > > sure I understand it all.  However, from what I have seen, this
> > > approach looks very ptrace-y to me (I imagine to others as well based
> > > on the comments) and because of this I think ensuring the usual ptrace
> > > access controls are evaluated, including the ptrace LSM hooks, is the
> > > right thing to do.
> > 
> > Basically the problem is that this new ptrace() API does something
> > that doesn't just influence the target task, but also every other task
> > that has the same seccomp filter. So the classic ptrace check doesn't
> > work here.
> 
> Just to throw this into the mix: then maybe ptrace() isn't the right
> interface and we should just go with the native seccomp() approach for
> now.

Please no :).

I don't buy your arguments that 3-syscalls vs. one is better. If I'm
doing this setup with a new container, I have to do
clone(CLONE_FILES), do this seccomp thing, so that my parent can pick
it up again, then do another clone without CLONE_FILES, because in the
general case I don't want to share my fd table with the container,
wait on the middle task for errors, etc. So we're still doing a bunch
of setup, and it feels more awkward than ptrace, with at least as many
syscalls, and it only works for your children.

I don't mind leaving capable(CAP_SYS_ADMIN) for the ptrace() part,
though. So if that's ok, then I think we can agree :)

Tycho

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 3/6] seccomp: add a way to get a listener fd from ptrace
  2018-10-10 16:54                         ` Tycho Andersen
@ 2018-10-10 17:15                           ` Christian Brauner
  2018-10-10 17:26                             ` Tycho Andersen
  0 siblings, 1 reply; 91+ messages in thread
From: Christian Brauner @ 2018-10-10 17:15 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Jann Horn, Paul Moore, Kees Cook, Linux API, containers,
	suda.akihiro, Oleg Nesterov, kernel list, Eric W. Biederman,
	linux-fsdevel, Christian Brauner, Andy Lutomirski,
	linux-security-module, selinux, Stephen Smalley, Eric Paris

On Wed, Oct 10, 2018 at 09:54:58AM -0700, Tycho Andersen wrote:
> On Wed, Oct 10, 2018 at 05:39:57PM +0200, Christian Brauner wrote:
> > On Wed, Oct 10, 2018 at 05:33:43PM +0200, Jann Horn wrote:
> > > On Wed, Oct 10, 2018 at 5:32 PM Paul Moore <paul@paul-moore.com> wrote:
> > > > On Tue, Oct 9, 2018 at 9:36 AM Jann Horn <jannh@google.com> wrote:
> > > > > +cc selinux people explicitly, since they probably have opinions on this
> > > >
> > > > I just spent about twenty minutes working my way through this thread,
> > > > and digging through the containers archive trying to get a good
> > > > understanding of what you guys are trying to do, and I'm not quite
> > > > sure I understand it all.  However, from what I have seen, this
> > > > approach looks very ptrace-y to me (I imagine to others as well based
> > > > on the comments) and because of this I think ensuring the usual ptrace
> > > > access controls are evaluated, including the ptrace LSM hooks, is the
> > > > right thing to do.
> > > 
> > > Basically the problem is that this new ptrace() API does something
> > > that doesn't just influence the target task, but also every other task
> > > that has the same seccomp filter. So the classic ptrace check doesn't
> > > work here.
> > 
> > Just to throw this into the mix: then maybe ptrace() isn't the right
> > interface and we should just go with the native seccomp() approach for
> > now.
> 
> Please no :).
> 
> I don't buy your arguments that 3-syscalls vs. one is better. If I'm
> doing this setup with a new container, I have to do
> clone(CLONE_FILES), do this seccomp thing, so that my parent can pick
> it up again, then do another clone without CLONE_FILES, because in the
> general case I don't want to share my fd table with the container,
> wait on the middle task for errors, etc. So we're still doing a bunch
> of setup, and it feels more awkward than ptrace, with at least as many
> syscalls, and it only works for your children.

You're talking about the case where you already have shot yourself in
the foot by blocking basically all other sensible ways of getting the fd
out.

Also, this was meant to show that parts of your initial justification
for implementing the ptrace() way of getting an fd doesn't really stand.
And it doesn't really. Even with ptrace() you can get into situations
where you're not able to get an fd. (see prior threads)

> 
> I don't mind leaving capable(CAP_SYS_ADMIN) for the ptrace() part,

Again, (prior threads) ptrace() or no ptrace() is something I do not
particularly care about as long as we have the
non-capable(CAP_SYS_ADMIN) seccomp() way of getting an fd out.

> though. So if that's ok, then I think we can agree :)
> 
> Tycho

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 3/6] seccomp: add a way to get a listener fd from ptrace
  2018-10-10 17:15                           ` Christian Brauner
@ 2018-10-10 17:26                             ` Tycho Andersen
  2018-10-10 18:28                               ` Christian Brauner
  0 siblings, 1 reply; 91+ messages in thread
From: Tycho Andersen @ 2018-10-10 17:26 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jann Horn, Paul Moore, Kees Cook, Linux API, containers,
	suda.akihiro, Oleg Nesterov, kernel list, Eric W. Biederman,
	linux-fsdevel, Christian Brauner, Andy Lutomirski,
	linux-security-module, selinux, Stephen Smalley, Eric Paris

On Wed, Oct 10, 2018 at 07:15:02PM +0200, Christian Brauner wrote:
> On Wed, Oct 10, 2018 at 09:54:58AM -0700, Tycho Andersen wrote:
> > On Wed, Oct 10, 2018 at 05:39:57PM +0200, Christian Brauner wrote:
> > > On Wed, Oct 10, 2018 at 05:33:43PM +0200, Jann Horn wrote:
> > > > On Wed, Oct 10, 2018 at 5:32 PM Paul Moore <paul@paul-moore.com> wrote:
> > > > > On Tue, Oct 9, 2018 at 9:36 AM Jann Horn <jannh@google.com> wrote:
> > > > > > +cc selinux people explicitly, since they probably have opinions on this
> > > > >
> > > > > I just spent about twenty minutes working my way through this thread,
> > > > > and digging through the containers archive trying to get a good
> > > > > understanding of what you guys are trying to do, and I'm not quite
> > > > > sure I understand it all.  However, from what I have seen, this
> > > > > approach looks very ptrace-y to me (I imagine to others as well based
> > > > > on the comments) and because of this I think ensuring the usual ptrace
> > > > > access controls are evaluated, including the ptrace LSM hooks, is the
> > > > > right thing to do.
> > > > 
> > > > Basically the problem is that this new ptrace() API does something
> > > > that doesn't just influence the target task, but also every other task
> > > > that has the same seccomp filter. So the classic ptrace check doesn't
> > > > work here.
> > > 
> > > Just to throw this into the mix: then maybe ptrace() isn't the right
> > > interface and we should just go with the native seccomp() approach for
> > > now.
> > 
> > Please no :).
> > 
> > I don't buy your arguments that 3-syscalls vs. one is better. If I'm
> > doing this setup with a new container, I have to do
> > clone(CLONE_FILES), do this seccomp thing, so that my parent can pick
> > it up again, then do another clone without CLONE_FILES, because in the
> > general case I don't want to share my fd table with the container,
> > wait on the middle task for errors, etc. So we're still doing a bunch
> > of setup, and it feels more awkward than ptrace, with at least as many
> > syscalls, and it only works for your children.
> 
> You're talking about the case where you already have shot yourself in
> the foot by blocking basically all other sensible ways of getting the fd
> out.

Ok, but these other ways involve syscalls too (sendmsg() or whatever).
And if you're going to allow arbitrary policy from your users, you
have to be maximally flexible.

> Also, this was meant to show that parts of your initial justification
> for implementing the ptrace() way of getting an fd doesn't really stand.
> And it doesn't really. Even with ptrace() you can get into situations
> where you're not able to get an fd. (see prior threads)

Of course. I guess my point was that we shouldn't design an API that's
impossible to use. I'll drop the notes about sendmsg() from the commit
message.

Tycho

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 3/6] seccomp: add a way to get a listener fd from ptrace
  2018-10-08 18:00     ` Tycho Andersen
  2018-10-08 18:41       ` Christian Brauner
@ 2018-10-10 17:45       ` Andy Lutomirski
  2018-10-10 18:26         ` Christian Brauner
  1 sibling, 1 reply; 91+ messages in thread
From: Andy Lutomirski @ 2018-10-10 17:45 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Christian Brauner, Kees Cook, Jann Horn, Linux API,
	Linux Containers, Akihiro Suda, Oleg Nesterov, LKML,
	Eric W. Biederman, Linux FS Devel, Christian Brauner

On Mon, Oct 8, 2018 at 11:00 AM Tycho Andersen <tycho@tycho.ws> wrote:
>
> On Mon, Oct 08, 2018 at 05:16:30PM +0200, Christian Brauner wrote:
> > On Thu, Sep 27, 2018 at 09:11:16AM -0600, Tycho Andersen wrote:
> > > As an alternative to SECCOMP_FILTER_FLAG_GET_LISTENER, perhaps a ptrace()
> > > version which can acquire filters is useful. There are at least two reasons
> > > this is preferable, even though it uses ptrace:
> > >
> > > 1. You can control tasks that aren't cooperating with you
> > > 2. You can control tasks whose filters block sendmsg() and socket(); if the
> > >    task installs a filter which blocks these calls, there's no way with
> > >    SECCOMP_FILTER_FLAG_GET_LISTENER to get the fd out to the privileged task.
> >
> > So for the slow of mind aka me:
> > I'm not sure I completely understand this problem. Can you outline how
> > sendmsg() and socket() are involved in this?
> >
> > I'm also not sure that this holds (but I might misunderstand the
> > problem) afaict, you could do try to get the fd out via CLONE_FILES and
> > other means so something like:
> >
> > // let's pretend the libc wrapper for clone actually has sane semantics
> > pid = clone(CLONE_FILES);
> > if (pid == 0) {
> >         fd = seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
> >
> >         // Now this fd will be valid in both parent and child.
> >         // If you haven't blocked it you can inform the parent what
> >         // the fd number is via pipe2(). If you have blocked it you can
> >         // use dup2() and dup to a known fd number.
> > }
>
> But what if your seccomp filter wants to block both pipe2() and
> dup2()? Whatever syscall you want to use to do this could be blocked
> by some seccomp policy, which means you might not be able to use this
> feature in some cases.

You don't need a syscall at all. You can use shared memory.

>
> Perhaps it's unlikely, and we can just go forward knowing this. But it
> seems like it is worth at least acknowledging that you can wedge
> yourself into a corner.
>

I think that what we *really* want is a way to create a seccomp fitter
and activate it later (on execve or via another call to seccomp(),
perhaps).  And we already sort of have that using ptrace() but a
better interface would be nice when a real use case gets figured out.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 3/6] seccomp: add a way to get a listener fd from ptrace
  2018-10-10 17:45       ` Andy Lutomirski
@ 2018-10-10 18:26         ` Christian Brauner
  0 siblings, 0 replies; 91+ messages in thread
From: Christian Brauner @ 2018-10-10 18:26 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Tycho Andersen, Kees Cook, Jann Horn, Linux API,
	Linux Containers, Akihiro Suda, Oleg Nesterov, LKML,
	Eric W. Biederman, Linux FS Devel, Christian Brauner

On Wed, Oct 10, 2018 at 10:45:29AM -0700, Andy Lutomirski wrote:
> On Mon, Oct 8, 2018 at 11:00 AM Tycho Andersen <tycho@tycho.ws> wrote:
> >
> > On Mon, Oct 08, 2018 at 05:16:30PM +0200, Christian Brauner wrote:
> > > On Thu, Sep 27, 2018 at 09:11:16AM -0600, Tycho Andersen wrote:
> > > > As an alternative to SECCOMP_FILTER_FLAG_GET_LISTENER, perhaps a ptrace()
> > > > version which can acquire filters is useful. There are at least two reasons
> > > > this is preferable, even though it uses ptrace:
> > > >
> > > > 1. You can control tasks that aren't cooperating with you
> > > > 2. You can control tasks whose filters block sendmsg() and socket(); if the
> > > >    task installs a filter which blocks these calls, there's no way with
> > > >    SECCOMP_FILTER_FLAG_GET_LISTENER to get the fd out to the privileged task.
> > >
> > > So for the slow of mind aka me:
> > > I'm not sure I completely understand this problem. Can you outline how
> > > sendmsg() and socket() are involved in this?
> > >
> > > I'm also not sure that this holds (but I might misunderstand the
> > > problem) afaict, you could do try to get the fd out via CLONE_FILES and
> > > other means so something like:
> > >
> > > // let's pretend the libc wrapper for clone actually has sane semantics
> > > pid = clone(CLONE_FILES);
> > > if (pid == 0) {
> > >         fd = seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
> > >
> > >         // Now this fd will be valid in both parent and child.
> > >         // If you haven't blocked it you can inform the parent what
> > >         // the fd number is via pipe2(). If you have blocked it you can
> > >         // use dup2() and dup to a known fd number.
> > > }
> >
> > But what if your seccomp filter wants to block both pipe2() and
> > dup2()? Whatever syscall you want to use to do this could be blocked
> > by some seccomp policy, which means you might not be able to use this
> > feature in some cases.
> 
> You don't need a syscall at all. You can use shared memory.

Yeah, I pointed that out too in the next mail. :)

> 
> >
> > Perhaps it's unlikely, and we can just go forward knowing this. But it
> > seems like it is worth at least acknowledging that you can wedge
> > yourself into a corner.
> >
> 
> I think that what we *really* want is a way to create a seccomp fitter

I thought about this exact thing when discussing my reservations about
ptrace() but I didn't want to defer this patchset any longer. But I
really like this idea of being able to get an fd *before* the filter is
loaded.

> and activate it later (on execve or via another call to seccomp(),
> perhaps).  And we already sort of have that using ptrace() but a
> better interface would be nice when a real use case gets figured out.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 3/6] seccomp: add a way to get a listener fd from ptrace
  2018-10-10 17:26                             ` Tycho Andersen
@ 2018-10-10 18:28                               ` Christian Brauner
  0 siblings, 0 replies; 91+ messages in thread
From: Christian Brauner @ 2018-10-10 18:28 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Jann Horn, Paul Moore, Kees Cook, Linux API, containers,
	suda.akihiro, Oleg Nesterov, kernel list, Eric W. Biederman,
	linux-fsdevel, Christian Brauner, Andy Lutomirski,
	linux-security-module, selinux, Stephen Smalley, Eric Paris

On Wed, Oct 10, 2018 at 10:26:22AM -0700, Tycho Andersen wrote:
> On Wed, Oct 10, 2018 at 07:15:02PM +0200, Christian Brauner wrote:
> > On Wed, Oct 10, 2018 at 09:54:58AM -0700, Tycho Andersen wrote:
> > > On Wed, Oct 10, 2018 at 05:39:57PM +0200, Christian Brauner wrote:
> > > > On Wed, Oct 10, 2018 at 05:33:43PM +0200, Jann Horn wrote:
> > > > > On Wed, Oct 10, 2018 at 5:32 PM Paul Moore <paul@paul-moore.com> wrote:
> > > > > > On Tue, Oct 9, 2018 at 9:36 AM Jann Horn <jannh@google.com> wrote:
> > > > > > > +cc selinux people explicitly, since they probably have opinions on this
> > > > > >
> > > > > > I just spent about twenty minutes working my way through this thread,
> > > > > > and digging through the containers archive trying to get a good
> > > > > > understanding of what you guys are trying to do, and I'm not quite
> > > > > > sure I understand it all.  However, from what I have seen, this
> > > > > > approach looks very ptrace-y to me (I imagine to others as well based
> > > > > > on the comments) and because of this I think ensuring the usual ptrace
> > > > > > access controls are evaluated, including the ptrace LSM hooks, is the
> > > > > > right thing to do.
> > > > > 
> > > > > Basically the problem is that this new ptrace() API does something
> > > > > that doesn't just influence the target task, but also every other task
> > > > > that has the same seccomp filter. So the classic ptrace check doesn't
> > > > > work here.
> > > > 
> > > > Just to throw this into the mix: then maybe ptrace() isn't the right
> > > > interface and we should just go with the native seccomp() approach for
> > > > now.
> > > 
> > > Please no :).
> > > 
> > > I don't buy your arguments that 3-syscalls vs. one is better. If I'm
> > > doing this setup with a new container, I have to do
> > > clone(CLONE_FILES), do this seccomp thing, so that my parent can pick
> > > it up again, then do another clone without CLONE_FILES, because in the
> > > general case I don't want to share my fd table with the container,
> > > wait on the middle task for errors, etc. So we're still doing a bunch
> > > of setup, and it feels more awkward than ptrace, with at least as many
> > > syscalls, and it only works for your children.
> > 
> > You're talking about the case where you already have shot yourself in
> > the foot by blocking basically all other sensible ways of getting the fd
> > out.
> 
> Ok, but these other ways involve syscalls too (sendmsg() or whatever).
> And if you're going to allow arbitrary policy from your users, you
> have to be maximally flexible.

So, I totally like the idea of being able to get an fd before the filter
is active. If this could be done in seccomp()-only it would be A+ (See
Andy's mail in the other thread.)
But I really don't want to keep you working on this forever. :)

> 
> > Also, this was meant to show that parts of your initial justification
> > for implementing the ptrace() way of getting an fd doesn't really stand.
> > And it doesn't really. Even with ptrace() you can get into situations
> > where you're not able to get an fd. (see prior threads)
> 
> Of course. I guess my point was that we shouldn't design an API that's
> impossible to use. I'll drop the notes about sendmsg() from the commit
> message.
> 
> Tycho

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 3/6] seccomp: add a way to get a listener fd from ptrace
  2018-10-10 15:33                     ` Jann Horn
  2018-10-10 15:39                       ` Christian Brauner
@ 2018-10-11  7:24                       ` Paul Moore
  2018-10-11 13:39                         ` Jann Horn
  1 sibling, 1 reply; 91+ messages in thread
From: Paul Moore @ 2018-10-11  7:24 UTC (permalink / raw)
  To: Jann Horn
  Cc: christian, Tycho Andersen, Kees Cook, Linux API, containers,
	suda.akihiro, Oleg Nesterov, kernel list, Eric W. Biederman,
	linux-fsdevel, Christian Brauner, Andy Lutomirski,
	linux-security-module, selinux, Stephen Smalley, Eric Paris

On October 10, 2018 11:34:11 AM Jann Horn <jannh@google.com> wrote:
> On Wed, Oct 10, 2018 at 5:32 PM Paul Moore <paul@paul-moore.com> wrote:
>> On Tue, Oct 9, 2018 at 9:36 AM Jann Horn <jannh@google.com> wrote:
>>> +cc selinux people explicitly, since they probably have opinions on this
>>
>> I just spent about twenty minutes working my way through this thread,
>> and digging through the containers archive trying to get a good
>> understanding of what you guys are trying to do, and I'm not quite
>> sure I understand it all.  However, from what I have seen, this
>> approach looks very ptrace-y to me (I imagine to others as well based
>> on the comments) and because of this I think ensuring the usual ptrace
>> access controls are evaluated, including the ptrace LSM hooks, is the
>> right thing to do.
>
> Basically the problem is that this new ptrace() API does something
> that doesn't just influence the target task, but also every other task
> that has the same seccomp filter. So the classic ptrace check doesn't
> work here.

Due to some rather unfortunate events today I'm suddenly without easy access to the kernel code, but would it be possible to run the LSM ptrace access control checks against all of the affected tasks?  If it is possible, how painful would it be?

>
>> If I've missed something, or I'm thinking about this wrong, please
>> educate me; just a heads-up that I'm largely offline for most of this
>> week so responses on my end are going to be delayed much more than
>> usual.
>>
>>> On Tue, Oct 9, 2018 at 3:29 PM Christian Brauner <christian@brauner.io> wrote:
>>>> On Tue, Oct 09, 2018 at 02:39:53PM +0200, Jann Horn wrote:
>>>>> On Mon, Oct 8, 2018 at 8:18 PM Christian Brauner <christian@brauner.io> wrote:
>>>>>> On Mon, Oct 08, 2018 at 06:42:00PM +0200, Jann Horn wrote:
>>>>>>> On Mon, Oct 8, 2018 at 6:21 PM Christian Brauner <christian@brauner.io> wrote:
>>>>>>> > On Mon, Oct 08, 2018 at 05:33:22PM +0200, Jann Horn wrote:
>>>>>>> > > On Mon, Oct 8, 2018 at 5:16 PM Christian Brauner <christian@brauner.io> wrote:
>>>>>>> > > > On Thu, Sep 27, 2018 at 09:11:16AM -0600, Tycho Andersen wrote:
>>>>>>> > > > > diff --git a/kernel/seccomp.c b/kernel/seccomp.c
>>>>>>> > > > > index 44a31ac8373a..17685803a2af 100644
>>>>>>> > > > > --- a/kernel/seccomp.c
>>>>>>> > > > > +++ b/kernel/seccomp.c
>>>>>>> > > > > @@ -1777,4 +1777,35 @@ static struct file *init_listener(struct task_struct *task,
>>>>>>> > > > >
>>>>>>> > > > >       return ret;
>>>>>>> > > > >  }
>>>>>>> > > > > +
>>>>>>> > > > > +long seccomp_new_listener(struct task_struct *task,
>>>>>>> > > > > +                       unsigned long filter_off)
>>>>>>> > > > > +{
>>>>>>> > > > > +     struct seccomp_filter *filter;
>>>>>>> > > > > +     struct file *listener;
>>>>>>> > > > > +     int fd;
>>>>>>> > > > > +
>>>>>>> > > > > +     if (!capable(CAP_SYS_ADMIN))
>>>>>>> > > > > +             return -EACCES;
>>>>>>> > > >
>>>>>>> > > > I know this might have been discussed a while back but why exactly do we
>>>>>>> > > > require CAP_SYS_ADMIN in init_userns and not in the target userns? What
>>>>>>> > > > if I want to do a setns()fd, CLONE_NEWUSER) to the target process and
>>>>>>> > > > use ptrace from in there?
>>>>>>> > >
>>>>>>> > > See https://lore.kernel.org/lkml/CAG48ez3R+ZJ1vwGkDfGzKX2mz6f=jjJWsO5pCvnH68P+RKO8Ow@mail.gmail.com/
>>>>>>> > > . Basically, the problem is that this doesn't just give you capability
>>>>>>> > > over the target task, but also over every other task that has the same
>>>>>>> > > filter installed; you need some sort of "is the caller capable over
>>>>>>> > > the filter and anyone who uses it" check.
>>>>>>> >
>>>>>>> > Thanks.
>>>>>>> > But then this new ptrace feature as it stands is imho currently broken.
>>>>>>> > If you can install a seccomp filter with SECCOMP_RET_USER_NOTIF if you
>>>>>>> > are ns_cpabable(CAP_SYS_ADMIN) and also get an fd via seccomp() itself
>>>>>>> > if you are ns_cpabable(CAP_SYS_ADMIN)
>>>>>
>>>>> Actually, you don't need CAP_SYS_ADMIN for seccomp() at all as long as
>>>>> you enable the NNP flag, I think?
>>>>
>>>> Yes, if you turn on NNP you don't even need sys_admin.
>>>>
>>>>>
>>>>>>> > then either the new ptrace() api
>>>>>>> > extension should be fixed to allow for this too or the seccomp() way of
>>>>>>> > retrieving the pid - which I really think we want - needs to be fixed to
>>>>>>> > require capable(CAP_SYS_ADMIN) too.
>>>>>>> > The solution where both require ns_capable(CAP_SYS_ADMIN) is - imho -
>>>>>>> > the preferred way to solve this.
>>>>>>> > Everything else will just be confusing.
>>>>>>>
>>>>>>> First you say "broken", then you say "confusing". Which one do you mean?
>>>>>>
>>>>>> Both. It's broken in so far as it places a seemingly unnecessary
>>>>>> restriction that could be fixed. You outlined one possible fix yourself
>>>>>> in the link you provided.
>>>>>
>>>>> If by "possible fix" you mean "check whether the seccomp filter is
>>>>> only attached to a single task": That wouldn't fundamentally change
>>>>> the situation, it would only add an additional special case.
>>>>>
>>>>>> And it's confusing in so far as there is a way
>>>>>> via seccomp() to get the fd without said requirement.
>>>>>
>>>>> I don't find it confusing at all. seccomp() and ptrace() are very
>>>>
>>>> Fine, then that's a matter of opinion. I find it counterintuitive that
>>>> you can get an fd without privileges via one interface but not via
>>>> another.
>>>>
>>>>> different situations: When you use seccomp(), infrastructure is
>>>>
>>>> Sure. Note, that this is _one_ of the reasons why I want to make sure we
>>>> keep the native seccomp() only based way of getting an fd without
>>>> forcing userspace to switching to a differnet kernel api.
>>>>
>>>>> already in place for ensuring that your filter is only applied to
>>>>> processes over which you are capable, and propagation is limited by
>>>>> inheritance from your task down. When you use ptrace(), you need a
>>>>> pretty different sort of access check that checks whether you're
>>>>> privileged over ancestors, siblings and so on of the target task.
>>>>
>>>> So, don't get me wrong I'm not arguing against the ptrace() interface in
>>>> general. If this is something that people find useful, fine. But, I
>>>> would like to have a simple single-syscall pure-seccomp() based way of
>>>> getting an fd, i.e. what we have in patch 1 of this series.
>>>
>>> Yeah, I also prefer the seccomp() one.
>>>
>>>>> But thinking about it more, I think that CAP_SYS_ADMIN over the saved
>>>>> current->mm->user_ns of the task that installed the filter (stored as
>>>>> a "struct user_namespace *" in the filter) should be acceptable.
>>>>
>>>> Hm... Why not CAP_SYS_PTRACE?
>>>
>>> Because LSMs like SELinux add extra checks that apply even if you have
>>> CAP_SYS_PTRACE, and this would subvert those. The only capability I
>>> know of that lets you bypass LSM checks by design (if no LSM blocks
>>> the capability itself) is CAP_SYS_ADMIN.
>>>
>>>> One more thing. Citing from [1]
>>>>
>>>>> I think there's a security problem here. Imagine the following scenario:
>>>>>
>>>>> 1. task A (uid==0) sets up a seccomp filter that uses SECCOMP_RET_USER_NOTIF
>>>>> 2. task A forks off a child B
>>>>> 3. task B uses setuid(1) to drop its privileges
>>>>> 4. task B becomes dumpable again, either via prctl(PR_SET_DUMPABLE, 1)
>>>>> or via execve()
>>>>> 5. task C (the attacker, uid==1) attaches to task B via ptrace
>>>>> 6. task C uses PTRACE_SECCOMP_NEW_LISTENER on task B
>>>>
>>>> Sorry, to be late to the party but would this really pass
>>>> __ptrace_may_access() in ptrace_attach()? It doesn't seem obvious to me
>>>> that it would... Doesn't look like it would get past:
>>>>
>>>>    tcred = __task_cred(task);
>>>>    if (uid_eq(caller_uid, tcred->euid) &&
>>>>        uid_eq(caller_uid, tcred->suid) &&
>>>>        uid_eq(caller_uid, tcred->uid)  &&
>>>>        gid_eq(caller_gid, tcred->egid) &&
>>>>        gid_eq(caller_gid, tcred->sgid) &&
>>>>        gid_eq(caller_gid, tcred->gid))
>>>>            goto ok;
>>>>    if (ptrace_has_cap(tcred->user_ns, mode))
>>>>            goto ok;
>>>>    rcu_read_unlock();
>>>>    return -EPERM;
>>>> ok:
>>>>    rcu_read_unlock();
>>>>    mm = task->mm;
>>>>    if (mm &&
>>>>        ((get_dumpable(mm) != SUID_DUMP_USER) &&
>>>>         !ptrace_has_cap(mm->user_ns, mode)))
>>>>        return -EPERM;
>>>
>>> Which specific check would prevent task C from attaching to task B? If
>>> the UIDs match, the first "goto ok" executes; and you're dumpable, so
>>> you don't trigger the second "return -EPERM".
>>>
>>>>> 7. because the seccomp filter is shared by task A and task B, task C
>>>>> is now able to influence syscall results for syscalls performed by
>>>>> task A
>>>>
>>>> [1]: https://lore.kernel.org/lkml/CAG48ez3R+ZJ1vwGkDfGzKX2mz6f=jjJWsO5pCvnH68P+RKO8Ow@mail.gmail.com/
>>
>>
>>
>> --
>> paul moore
>> www.paul-moore.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 3/6] seccomp: add a way to get a listener fd from ptrace
  2018-10-11  7:24                       ` Paul Moore
@ 2018-10-11 13:39                         ` Jann Horn
  2018-10-11 23:10                           ` Paul Moore
  0 siblings, 1 reply; 91+ messages in thread
From: Jann Horn @ 2018-10-11 13:39 UTC (permalink / raw)
  To: Paul Moore
  Cc: christian, Tycho Andersen, Kees Cook, Linux API, containers,
	suda.akihiro, Oleg Nesterov, kernel list, Eric W. Biederman,
	linux-fsdevel, Christian Brauner, Andy Lutomirski,
	linux-security-module, selinux, Stephen Smalley, Eric Paris

On Thu, Oct 11, 2018 at 9:24 AM Paul Moore <paul@paul-moore.com> wrote:
> On October 10, 2018 11:34:11 AM Jann Horn <jannh@google.com> wrote:
> > On Wed, Oct 10, 2018 at 5:32 PM Paul Moore <paul@paul-moore.com> wrote:
> >> On Tue, Oct 9, 2018 at 9:36 AM Jann Horn <jannh@google.com> wrote:
> >>> +cc selinux people explicitly, since they probably have opinions on this
> >>
> >> I just spent about twenty minutes working my way through this thread,
> >> and digging through the containers archive trying to get a good
> >> understanding of what you guys are trying to do, and I'm not quite
> >> sure I understand it all.  However, from what I have seen, this
> >> approach looks very ptrace-y to me (I imagine to others as well based
> >> on the comments) and because of this I think ensuring the usual ptrace
> >> access controls are evaluated, including the ptrace LSM hooks, is the
> >> right thing to do.
> >
> > Basically the problem is that this new ptrace() API does something
> > that doesn't just influence the target task, but also every other task
> > that has the same seccomp filter. So the classic ptrace check doesn't
> > work here.
>
> Due to some rather unfortunate events today I'm suddenly without easy access to the kernel code, but would it be possible to run the LSM ptrace access control checks against all of the affected tasks?  If it is possible, how painful would it be?

There are currently no backlinks from seccomp filters to the tasks
that use them; the only thing you have is a refcount. If the refcount
is 1, and the target task uses the filter directly (it is the last
installed one), you'd be able to infer that the ptrace target is the
only task with a reference to the filter, and you could just do the
direct check; but if the refcount is >1, you might end up having to
take some spinlock and then iterate over all tasks' filters with that
spinlock held, or something like that.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 3/6] seccomp: add a way to get a listener fd from ptrace
  2018-10-11 13:39                         ` Jann Horn
@ 2018-10-11 23:10                           ` Paul Moore
  2018-10-12  1:02                             ` Andy Lutomirski
  0 siblings, 1 reply; 91+ messages in thread
From: Paul Moore @ 2018-10-11 23:10 UTC (permalink / raw)
  To: Jann Horn
  Cc: christian, Tycho Andersen, Kees Cook, Linux API, containers,
	suda.akihiro, Oleg Nesterov, kernel list, Eric W. Biederman,
	linux-fsdevel, Christian Brauner, Andy Lutomirski,
	linux-security-module, selinux, Stephen Smalley, Eric Paris

On October 11, 2018 9:40:06 AM Jann Horn <jannh@google.com> wrote:
> On Thu, Oct 11, 2018 at 9:24 AM Paul Moore <paul@paul-moore.com> wrote:
>> On October 10, 2018 11:34:11 AM Jann Horn <jannh@google.com> wrote:
>>> On Wed, Oct 10, 2018 at 5:32 PM Paul Moore <paul@paul-moore.com> wrote:
>>>> On Tue, Oct 9, 2018 at 9:36 AM Jann Horn <jannh@google.com> wrote:
>>>>> +cc selinux people explicitly, since they probably have opinions on this
>>>>
>>>> I just spent about twenty minutes working my way through this thread,
>>>> and digging through the containers archive trying to get a good
>>>> understanding of what you guys are trying to do, and I'm not quite
>>>> sure I understand it all.  However, from what I have seen, this
>>>> approach looks very ptrace-y to me (I imagine to others as well based
>>>> on the comments) and because of this I think ensuring the usual ptrace
>>>> access controls are evaluated, including the ptrace LSM hooks, is the
>>>> right thing to do.
>>>
>>> Basically the problem is that this new ptrace() API does something
>>> that doesn't just influence the target task, but also every other task
>>> that has the same seccomp filter. So the classic ptrace check doesn't
>>> work here.
>>
>> Due to some rather unfortunate events today I'm suddenly without easy access to the kernel code, but would it be possible to run the LSM ptrace access control checks against all of the affected tasks?  If it is possible, how painful would it be?
>
> There are currently no backlinks from seccomp filters to the tasks
> that use them; the only thing you have is a refcount. If the refcount
> is 1, and the target task uses the filter directly (it is the last
> installed one), you'd be able to infer that the ptrace target is the
> only task with a reference to the filter, and you could just do the
> direct check; but if the refcount is >1, you might end up having to
> take some spinlock and then iterate over all tasks' filters with that
> spinlock held, or something like that.

That's what I was afraid of.

Unfortunately, I stand by my previous statements that we still probably want a LSM access check similar to what we currently do for ptrace.

--
paul moore
www.paul-moore.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 3/6] seccomp: add a way to get a listener fd from ptrace
  2018-10-11 23:10                           ` Paul Moore
@ 2018-10-12  1:02                             ` Andy Lutomirski
  2018-10-12 20:02                               ` Tycho Andersen
  0 siblings, 1 reply; 91+ messages in thread
From: Andy Lutomirski @ 2018-10-12  1:02 UTC (permalink / raw)
  To: Paul Moore
  Cc: Jann Horn, Christian Brauner, Tycho Andersen, Kees Cook,
	Linux API, Linux Containers, Akihiro Suda, Oleg Nesterov, LKML,
	Eric W. Biederman, Linux FS Devel, Christian Brauner, LSM List,
	SELinux-NSA, Stephen Smalley, Eric Paris

On Thu, Oct 11, 2018 at 4:10 PM Paul Moore <paul@paul-moore.com> wrote:
>
> On October 11, 2018 9:40:06 AM Jann Horn <jannh@google.com> wrote:
> > On Thu, Oct 11, 2018 at 9:24 AM Paul Moore <paul@paul-moore.com> wrote:
> >> On October 10, 2018 11:34:11 AM Jann Horn <jannh@google.com> wrote:
> >>> On Wed, Oct 10, 2018 at 5:32 PM Paul Moore <paul@paul-moore.com> wrote:
> >>>> On Tue, Oct 9, 2018 at 9:36 AM Jann Horn <jannh@google.com> wrote:
> >>>>> +cc selinux people explicitly, since they probably have opinions on this
> >>>>
> >>>> I just spent about twenty minutes working my way through this thread,
> >>>> and digging through the containers archive trying to get a good
> >>>> understanding of what you guys are trying to do, and I'm not quite
> >>>> sure I understand it all.  However, from what I have seen, this
> >>>> approach looks very ptrace-y to me (I imagine to others as well based
> >>>> on the comments) and because of this I think ensuring the usual ptrace
> >>>> access controls are evaluated, including the ptrace LSM hooks, is the
> >>>> right thing to do.
> >>>
> >>> Basically the problem is that this new ptrace() API does something
> >>> that doesn't just influence the target task, but also every other task
> >>> that has the same seccomp filter. So the classic ptrace check doesn't
> >>> work here.
> >>
> >> Due to some rather unfortunate events today I'm suddenly without easy access to the kernel code, but would it be possible to run the LSM ptrace access control checks against all of the affected tasks?  If it is possible, how painful would it be?
> >
> > There are currently no backlinks from seccomp filters to the tasks
> > that use them; the only thing you have is a refcount. If the refcount
> > is 1, and the target task uses the filter directly (it is the last
> > installed one), you'd be able to infer that the ptrace target is the
> > only task with a reference to the filter, and you could just do the
> > direct check; but if the refcount is >1, you might end up having to
> > take some spinlock and then iterate over all tasks' filters with that
> > spinlock held, or something like that.
>
> That's what I was afraid of.
>
> Unfortunately, I stand by my previous statements that we still probably want a LSM access check similar to what we currently do for ptrace.
>

I would argue that once "LSM" enters this conversation, it just means
we're on the wrong track.  Let's try to make this work without ptrace
if possible :)  The whole seccomp() mechanism is very carefully
designed so that it's perfectly safe to install seccomp filters
without involving LSM or even involving credentials at all.

--Andy

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 3/6] seccomp: add a way to get a listener fd from ptrace
  2018-10-12  1:02                             ` Andy Lutomirski
@ 2018-10-12 20:02                               ` Tycho Andersen
  2018-10-12 20:06                                 ` Jann Horn
  2018-10-12 20:11                                 ` Christian Brauner
  0 siblings, 2 replies; 91+ messages in thread
From: Tycho Andersen @ 2018-10-12 20:02 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Paul Moore, Jann Horn, Christian Brauner, Kees Cook, Linux API,
	Linux Containers, Akihiro Suda, Oleg Nesterov, LKML,
	Eric W. Biederman, Linux FS Devel, Christian Brauner, LSM List,
	SELinux-NSA, Stephen Smalley, Eric Paris

On Thu, Oct 11, 2018 at 06:02:06PM -0700, Andy Lutomirski wrote:
> On Thu, Oct 11, 2018 at 4:10 PM Paul Moore <paul@paul-moore.com> wrote:
> >
> > On October 11, 2018 9:40:06 AM Jann Horn <jannh@google.com> wrote:
> > > On Thu, Oct 11, 2018 at 9:24 AM Paul Moore <paul@paul-moore.com> wrote:
> > >> On October 10, 2018 11:34:11 AM Jann Horn <jannh@google.com> wrote:
> > >>> On Wed, Oct 10, 2018 at 5:32 PM Paul Moore <paul@paul-moore.com> wrote:
> > >>>> On Tue, Oct 9, 2018 at 9:36 AM Jann Horn <jannh@google.com> wrote:
> > >>>>> +cc selinux people explicitly, since they probably have opinions on this
> > >>>>
> > >>>> I just spent about twenty minutes working my way through this thread,
> > >>>> and digging through the containers archive trying to get a good
> > >>>> understanding of what you guys are trying to do, and I'm not quite
> > >>>> sure I understand it all.  However, from what I have seen, this
> > >>>> approach looks very ptrace-y to me (I imagine to others as well based
> > >>>> on the comments) and because of this I think ensuring the usual ptrace
> > >>>> access controls are evaluated, including the ptrace LSM hooks, is the
> > >>>> right thing to do.
> > >>>
> > >>> Basically the problem is that this new ptrace() API does something
> > >>> that doesn't just influence the target task, but also every other task
> > >>> that has the same seccomp filter. So the classic ptrace check doesn't
> > >>> work here.
> > >>
> > >> Due to some rather unfortunate events today I'm suddenly without easy access to the kernel code, but would it be possible to run the LSM ptrace access control checks against all of the affected tasks?  If it is possible, how painful would it be?
> > >
> > > There are currently no backlinks from seccomp filters to the tasks
> > > that use them; the only thing you have is a refcount. If the refcount
> > > is 1, and the target task uses the filter directly (it is the last
> > > installed one), you'd be able to infer that the ptrace target is the
> > > only task with a reference to the filter, and you could just do the
> > > direct check; but if the refcount is >1, you might end up having to
> > > take some spinlock and then iterate over all tasks' filters with that
> > > spinlock held, or something like that.
> >
> > That's what I was afraid of.
> >
> > Unfortunately, I stand by my previous statements that we still probably want a LSM access check similar to what we currently do for ptrace.
> >
> 
> I would argue that once "LSM" enters this conversation, it just means
> we're on the wrong track.  Let's try to make this work without ptrace
> if possible :)  The whole seccomp() mechanism is very carefully
> designed so that it's perfectly safe to install seccomp filters
> without involving LSM or even involving credentials at all.

In a last ditch effort to save the ptrace() interface: can we just
only allow it when refcount_read(filter->usage) == 1?

Tycho

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 3/6] seccomp: add a way to get a listener fd from ptrace
  2018-10-12 20:02                               ` Tycho Andersen
@ 2018-10-12 20:06                                 ` Jann Horn
  2018-10-12 20:11                                 ` Christian Brauner
  1 sibling, 0 replies; 91+ messages in thread
From: Jann Horn @ 2018-10-12 20:06 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Andy Lutomirski, Paul Moore, christian, Kees Cook, Linux API,
	containers, suda.akihiro, Oleg Nesterov, kernel list,
	Eric W. Biederman, linux-fsdevel, Christian Brauner,
	linux-security-module, selinux, Stephen Smalley, Eric Paris

On Fri, Oct 12, 2018 at 10:02 PM Tycho Andersen <tycho@tycho.ws> wrote:
> On Thu, Oct 11, 2018 at 06:02:06PM -0700, Andy Lutomirski wrote:
> > On Thu, Oct 11, 2018 at 4:10 PM Paul Moore <paul@paul-moore.com> wrote:
> > >
> > > On October 11, 2018 9:40:06 AM Jann Horn <jannh@google.com> wrote:
> > > > On Thu, Oct 11, 2018 at 9:24 AM Paul Moore <paul@paul-moore.com> wrote:
> > > >> On October 10, 2018 11:34:11 AM Jann Horn <jannh@google.com> wrote:
> > > >>> On Wed, Oct 10, 2018 at 5:32 PM Paul Moore <paul@paul-moore.com> wrote:
> > > >>>> On Tue, Oct 9, 2018 at 9:36 AM Jann Horn <jannh@google.com> wrote:
> > > >>>>> +cc selinux people explicitly, since they probably have opinions on this
> > > >>>>
> > > >>>> I just spent about twenty minutes working my way through this thread,
> > > >>>> and digging through the containers archive trying to get a good
> > > >>>> understanding of what you guys are trying to do, and I'm not quite
> > > >>>> sure I understand it all.  However, from what I have seen, this
> > > >>>> approach looks very ptrace-y to me (I imagine to others as well based
> > > >>>> on the comments) and because of this I think ensuring the usual ptrace
> > > >>>> access controls are evaluated, including the ptrace LSM hooks, is the
> > > >>>> right thing to do.
> > > >>>
> > > >>> Basically the problem is that this new ptrace() API does something
> > > >>> that doesn't just influence the target task, but also every other task
> > > >>> that has the same seccomp filter. So the classic ptrace check doesn't
> > > >>> work here.
> > > >>
> > > >> Due to some rather unfortunate events today I'm suddenly without easy access to the kernel code, but would it be possible to run the LSM ptrace access control checks against all of the affected tasks?  If it is possible, how painful would it be?
> > > >
> > > > There are currently no backlinks from seccomp filters to the tasks
> > > > that use them; the only thing you have is a refcount. If the refcount
> > > > is 1, and the target task uses the filter directly (it is the last
> > > > installed one), you'd be able to infer that the ptrace target is the
> > > > only task with a reference to the filter, and you could just do the
> > > > direct check; but if the refcount is >1, you might end up having to
> > > > take some spinlock and then iterate over all tasks' filters with that
> > > > spinlock held, or something like that.
> > >
> > > That's what I was afraid of.
> > >
> > > Unfortunately, I stand by my previous statements that we still probably want a LSM access check similar to what we currently do for ptrace.
> > >
> >
> > I would argue that once "LSM" enters this conversation, it just means
> > we're on the wrong track.  Let's try to make this work without ptrace
> > if possible :)  The whole seccomp() mechanism is very carefully
> > designed so that it's perfectly safe to install seccomp filters
> > without involving LSM or even involving credentials at all.
>
> In a last ditch effort to save the ptrace() interface: can we just
> only allow it when refcount_read(filter->usage) == 1?

>From a security perspective, I think that would be fine, assuming that
we know that the target task is stopped. (But note that if the target
process e.g. uses the filter on multiple threads, the refcount will be
higher.)

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 3/6] seccomp: add a way to get a listener fd from ptrace
  2018-10-12 20:02                               ` Tycho Andersen
  2018-10-12 20:06                                 ` Jann Horn
@ 2018-10-12 20:11                                 ` Christian Brauner
  1 sibling, 0 replies; 91+ messages in thread
From: Christian Brauner @ 2018-10-12 20:11 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Andy Lutomirski, Paul Moore, Jann Horn, Kees Cook, Linux API,
	Linux Containers, Akihiro Suda, Oleg Nesterov, LKML,
	Eric W. Biederman, Linux FS Devel, Christian Brauner, LSM List,
	SELinux-NSA, Stephen Smalley, Eric Paris

On Fri, Oct 12, 2018 at 01:02:20PM -0700, Tycho Andersen wrote:
> On Thu, Oct 11, 2018 at 06:02:06PM -0700, Andy Lutomirski wrote:
> > On Thu, Oct 11, 2018 at 4:10 PM Paul Moore <paul@paul-moore.com> wrote:
> > >
> > > On October 11, 2018 9:40:06 AM Jann Horn <jannh@google.com> wrote:
> > > > On Thu, Oct 11, 2018 at 9:24 AM Paul Moore <paul@paul-moore.com> wrote:
> > > >> On October 10, 2018 11:34:11 AM Jann Horn <jannh@google.com> wrote:
> > > >>> On Wed, Oct 10, 2018 at 5:32 PM Paul Moore <paul@paul-moore.com> wrote:
> > > >>>> On Tue, Oct 9, 2018 at 9:36 AM Jann Horn <jannh@google.com> wrote:
> > > >>>>> +cc selinux people explicitly, since they probably have opinions on this
> > > >>>>
> > > >>>> I just spent about twenty minutes working my way through this thread,
> > > >>>> and digging through the containers archive trying to get a good
> > > >>>> understanding of what you guys are trying to do, and I'm not quite
> > > >>>> sure I understand it all.  However, from what I have seen, this
> > > >>>> approach looks very ptrace-y to me (I imagine to others as well based
> > > >>>> on the comments) and because of this I think ensuring the usual ptrace
> > > >>>> access controls are evaluated, including the ptrace LSM hooks, is the
> > > >>>> right thing to do.
> > > >>>
> > > >>> Basically the problem is that this new ptrace() API does something
> > > >>> that doesn't just influence the target task, but also every other task
> > > >>> that has the same seccomp filter. So the classic ptrace check doesn't
> > > >>> work here.
> > > >>
> > > >> Due to some rather unfortunate events today I'm suddenly without easy access to the kernel code, but would it be possible to run the LSM ptrace access control checks against all of the affected tasks?  If it is possible, how painful would it be?
> > > >
> > > > There are currently no backlinks from seccomp filters to the tasks
> > > > that use them; the only thing you have is a refcount. If the refcount
> > > > is 1, and the target task uses the filter directly (it is the last
> > > > installed one), you'd be able to infer that the ptrace target is the
> > > > only task with a reference to the filter, and you could just do the
> > > > direct check; but if the refcount is >1, you might end up having to
> > > > take some spinlock and then iterate over all tasks' filters with that
> > > > spinlock held, or something like that.
> > >
> > > That's what I was afraid of.
> > >
> > > Unfortunately, I stand by my previous statements that we still probably want a LSM access check similar to what we currently do for ptrace.
> > >
> > 
> > I would argue that once "LSM" enters this conversation, it just means
> > we're on the wrong track.  Let's try to make this work without ptrace
> > if possible :)  The whole seccomp() mechanism is very carefully
> > designed so that it's perfectly safe to install seccomp filters
> > without involving LSM or even involving credentials at all.
> 
> In a last ditch effort to save the ptrace() interface: can we just
> only allow it when refcount_read(filter->usage) == 1?

I mean, the filter->usage == 1 means lets us get rid of
capable(CAP_SYS_ADMIN) making the ptrace() way of getting an fd useable
in nesting scenarios and from within user namespaces. That makes it a
whole lot more useful and aligns it with the seccomp() way of getting
the fd. So I wouldn't argue against it.
I guess it comes down to (for me) whether you consider this a necessary
part of this patchset aka meaning without it it wouldn't be useable.

Christian

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 1/6] seccomp: add a return code to trap to userspace
  2018-09-27 21:31   ` Kees Cook
  2018-09-27 22:48     ` Tycho Andersen
@ 2018-10-17 20:29     ` Tycho Andersen
  2018-10-17 22:21       ` Kees Cook
  1 sibling, 1 reply; 91+ messages in thread
From: Tycho Andersen @ 2018-10-17 20:29 UTC (permalink / raw)
  To: Kees Cook
  Cc: LKML, Linux Containers, Linux API, Andy Lutomirski,
	Oleg Nesterov, Eric W . Biederman, Serge E . Hallyn,
	Christian Brauner, Tyler Hicks, Akihiro Suda, Jann Horn,
	linux-fsdevel

On Thu, Sep 27, 2018 at 02:31:24PM -0700, Kees Cook wrote:
> On Thu, Sep 27, 2018 at 8:11 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> > @@ -60,4 +62,29 @@ struct seccomp_data {
> >         __u64 args[6];
> >  };
> >
> > +struct seccomp_notif {
> > +       __u16 len;
> > +       __u64 id;
> > +       __u32 pid;
> > +       __u8 signaled;
> > +       struct seccomp_data data;
> > +};
> > +
> > +struct seccomp_notif_resp {
> > +       __u16 len;
> > +       __u64 id;
> > +       __s32 error;
> > +       __s64 val;
> > +};
> 
> So, len has to come first, for versioning. However, since it's ahead
> of a u64, this leaves a struct padding hole. pahole output:
> 
> struct seccomp_notif {
>         __u16                      len;                  /*     0     2 */
> 
>         /* XXX 6 bytes hole, try to pack */
> 
>         __u64                      id;                   /*     8     8 */
>         __u32                      pid;                  /*    16     4 */
>         __u8                       signaled;             /*    20     1 */
> 
>         /* XXX 3 bytes hole, try to pack */
> 
>         struct seccomp_data        data;                 /*    24    64 */
>         /* --- cacheline 1 boundary (64 bytes) was 24 bytes ago --- */
> 
>         /* size: 88, cachelines: 2, members: 5 */
>         /* sum members: 79, holes: 2, sum holes: 9 */
>         /* last cacheline: 24 bytes */
> };
> struct seccomp_notif_resp {
>         __u16                      len;                  /*     0     2 */
> 
>         /* XXX 6 bytes hole, try to pack */
> 
>         __u64                      id;                   /*     8     8 */
>         __s32                      error;                /*    16     4 */
> 
>         /* XXX 4 bytes hole, try to pack */
> 
>         __s64                      val;                  /*    24     8 */
> 
>         /* size: 32, cachelines: 1, members: 4 */
>         /* sum members: 22, holes: 2, sum holes: 10 */
>         /* last cacheline: 32 bytes */
> };
> 
> How about making len u32, and moving pid and error above "id"? This
> leaves a hole after signaled, so changing "len" won't be sufficient
> for versioning here. Perhaps move it after data?

Just to confirm my understanding; I've got these as:

struct seccomp_notif {
	__u32                      len;                  /*     0     4 */
	__u32                      pid;                  /*     4     4 */
	__u64                      id;                   /*     8     8 */
	__u8                       signaled;             /*    16     1 */

	/* XXX 7 bytes hole, try to pack */

	struct seccomp_data        data;                 /*    24    64 */
	/* --- cacheline 1 boundary (64 bytes) was 24 bytes ago --- */

	/* size: 88, cachelines: 2, members: 5 */
	/* sum members: 81, holes: 1, sum holes: 7 */
	/* last cacheline: 24 bytes */
};
struct seccomp_notif_resp {
	__u32                      len;                  /*     0     4 */
	__s32                      error;                /*     4     4 */
	__u64                      id;                   /*     8     8 */
	__s64                      val;                  /*    16     8 */

	/* size: 24, cachelines: 1, members: 4 */
	/* last cacheline: 24 bytes */
};

in the next version. Since the structure has no padding at the end of
it, I think the Right Thing will happen. Note that this is slightly
different than what Kees suggested, if I add signaled after data, then
I end up with:

struct seccomp_notif {
	__u32                      len;                  /*     0     4 */
	__u32                      pid;                  /*     4     4 */
	__u64                      id;                   /*     8     8 */
	struct seccomp_data        data;                 /*    16    64 */
	/* --- cacheline 1 boundary (64 bytes) was 16 bytes ago --- */
	__u8                       signaled;             /*    80     1 */

	/* size: 88, cachelines: 2, members: 5 */
	/* padding: 7 */
	/* last cacheline: 24 bytes */
};

which I think will have the versioning problem if the next member
introduces is < 7 bytes.

Tycho

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 1/6] seccomp: add a return code to trap to userspace
  2018-10-17 20:29     ` Tycho Andersen
@ 2018-10-17 22:21       ` Kees Cook
  2018-10-17 22:33         ` Tycho Andersen
  2018-10-21 16:04         ` Tycho Andersen
  0 siblings, 2 replies; 91+ messages in thread
From: Kees Cook @ 2018-10-17 22:21 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: LKML, Linux Containers, Linux API, Andy Lutomirski,
	Oleg Nesterov, Eric W . Biederman, Serge E . Hallyn,
	Christian Brauner, Tyler Hicks, Akihiro Suda, Jann Horn,
	linux-fsdevel

On Wed, Oct 17, 2018 at 1:29 PM, Tycho Andersen <tycho@tycho.ws> wrote:
> On Thu, Sep 27, 2018 at 02:31:24PM -0700, Kees Cook wrote:
>> On Thu, Sep 27, 2018 at 8:11 AM, Tycho Andersen <tycho@tycho.ws> wrote:
>> > @@ -60,4 +62,29 @@ struct seccomp_data {
>> >         __u64 args[6];
>> >  };
>> >
>> > +struct seccomp_notif {
>> > +       __u16 len;
>> > +       __u64 id;
>> > +       __u32 pid;
>> > +       __u8 signaled;
>> > +       struct seccomp_data data;
>> > +};
>> > +
>> > +struct seccomp_notif_resp {
>> > +       __u16 len;
>> > +       __u64 id;
>> > +       __s32 error;
>> > +       __s64 val;
>> > +};
>>
>> So, len has to come first, for versioning. However, since it's ahead
>> of a u64, this leaves a struct padding hole. pahole output:
>>
>> struct seccomp_notif {
>>         __u16                      len;                  /*     0     2 */
>>
>>         /* XXX 6 bytes hole, try to pack */
>>
>>         __u64                      id;                   /*     8     8 */
>>         __u32                      pid;                  /*    16     4 */
>>         __u8                       signaled;             /*    20     1 */
>>
>>         /* XXX 3 bytes hole, try to pack */
>>
>>         struct seccomp_data        data;                 /*    24    64 */
>>         /* --- cacheline 1 boundary (64 bytes) was 24 bytes ago --- */
>>
>>         /* size: 88, cachelines: 2, members: 5 */
>>         /* sum members: 79, holes: 2, sum holes: 9 */
>>         /* last cacheline: 24 bytes */
>> };
>> struct seccomp_notif_resp {
>>         __u16                      len;                  /*     0     2 */
>>
>>         /* XXX 6 bytes hole, try to pack */
>>
>>         __u64                      id;                   /*     8     8 */
>>         __s32                      error;                /*    16     4 */
>>
>>         /* XXX 4 bytes hole, try to pack */
>>
>>         __s64                      val;                  /*    24     8 */
>>
>>         /* size: 32, cachelines: 1, members: 4 */
>>         /* sum members: 22, holes: 2, sum holes: 10 */
>>         /* last cacheline: 32 bytes */
>> };
>>
>> How about making len u32, and moving pid and error above "id"? This
>> leaves a hole after signaled, so changing "len" won't be sufficient
>> for versioning here. Perhaps move it after data?
>
> Just to confirm my understanding; I've got these as:
>
> struct seccomp_notif {
>         __u32                      len;                  /*     0     4 */
>         __u32                      pid;                  /*     4     4 */
>         __u64                      id;                   /*     8     8 */
>         __u8                       signaled;             /*    16     1 */
>
>         /* XXX 7 bytes hole, try to pack */
>
>         struct seccomp_data        data;                 /*    24    64 */
>         /* --- cacheline 1 boundary (64 bytes) was 24 bytes ago --- */
>
>         /* size: 88, cachelines: 2, members: 5 */
>         /* sum members: 81, holes: 1, sum holes: 7 */
>         /* last cacheline: 24 bytes */
> };
> struct seccomp_notif_resp {
>         __u32                      len;                  /*     0     4 */
>         __s32                      error;                /*     4     4 */
>         __u64                      id;                   /*     8     8 */
>         __s64                      val;                  /*    16     8 */
>
>         /* size: 24, cachelines: 1, members: 4 */
>         /* last cacheline: 24 bytes */
> };
>
> in the next version. Since the structure has no padding at the end of
> it, I think the Right Thing will happen. Note that this is slightly
> different than what Kees suggested, if I add signaled after data, then
> I end up with:
>
> struct seccomp_notif {
>         __u32                      len;                  /*     0     4 */
>         __u32                      pid;                  /*     4     4 */
>         __u64                      id;                   /*     8     8 */
>         struct seccomp_data        data;                 /*    16    64 */
>         /* --- cacheline 1 boundary (64 bytes) was 16 bytes ago --- */
>         __u8                       signaled;             /*    80     1 */
>
>         /* size: 88, cachelines: 2, members: 5 */
>         /* padding: 7 */
>         /* last cacheline: 24 bytes */
> };
>
> which I think will have the versioning problem if the next member
> introduces is < 7 bytes.

It'll be a problem in either place. What I was thinking was that
specific versioning is required instead of just length.

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 1/6] seccomp: add a return code to trap to userspace
  2018-10-17 22:21       ` Kees Cook
@ 2018-10-17 22:33         ` Tycho Andersen
  2018-10-21 16:04         ` Tycho Andersen
  1 sibling, 0 replies; 91+ messages in thread
From: Tycho Andersen @ 2018-10-17 22:33 UTC (permalink / raw)
  To: Kees Cook
  Cc: LKML, Linux Containers, Linux API, Andy Lutomirski,
	Oleg Nesterov, Eric W . Biederman, Serge E . Hallyn,
	Christian Brauner, Tyler Hicks, Akihiro Suda, Jann Horn,
	linux-fsdevel

On Wed, Oct 17, 2018 at 03:21:02PM -0700, Kees Cook wrote:
> On Wed, Oct 17, 2018 at 1:29 PM, Tycho Andersen <tycho@tycho.ws> wrote:
> > On Thu, Sep 27, 2018 at 02:31:24PM -0700, Kees Cook wrote:
> >> On Thu, Sep 27, 2018 at 8:11 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> >> > @@ -60,4 +62,29 @@ struct seccomp_data {
> >> >         __u64 args[6];
> >> >  };
> >> >
> >> > +struct seccomp_notif {
> >> > +       __u16 len;
> >> > +       __u64 id;
> >> > +       __u32 pid;
> >> > +       __u8 signaled;
> >> > +       struct seccomp_data data;
> >> > +};
> >> > +
> >> > +struct seccomp_notif_resp {
> >> > +       __u16 len;
> >> > +       __u64 id;
> >> > +       __s32 error;
> >> > +       __s64 val;
> >> > +};
> >>
> >> So, len has to come first, for versioning. However, since it's ahead
> >> of a u64, this leaves a struct padding hole. pahole output:
> >>
> >> struct seccomp_notif {
> >>         __u16                      len;                  /*     0     2 */
> >>
> >>         /* XXX 6 bytes hole, try to pack */
> >>
> >>         __u64                      id;                   /*     8     8 */
> >>         __u32                      pid;                  /*    16     4 */
> >>         __u8                       signaled;             /*    20     1 */
> >>
> >>         /* XXX 3 bytes hole, try to pack */
> >>
> >>         struct seccomp_data        data;                 /*    24    64 */
> >>         /* --- cacheline 1 boundary (64 bytes) was 24 bytes ago --- */
> >>
> >>         /* size: 88, cachelines: 2, members: 5 */
> >>         /* sum members: 79, holes: 2, sum holes: 9 */
> >>         /* last cacheline: 24 bytes */
> >> };
> >> struct seccomp_notif_resp {
> >>         __u16                      len;                  /*     0     2 */
> >>
> >>         /* XXX 6 bytes hole, try to pack */
> >>
> >>         __u64                      id;                   /*     8     8 */
> >>         __s32                      error;                /*    16     4 */
> >>
> >>         /* XXX 4 bytes hole, try to pack */
> >>
> >>         __s64                      val;                  /*    24     8 */
> >>
> >>         /* size: 32, cachelines: 1, members: 4 */
> >>         /* sum members: 22, holes: 2, sum holes: 10 */
> >>         /* last cacheline: 32 bytes */
> >> };
> >>
> >> How about making len u32, and moving pid and error above "id"? This
> >> leaves a hole after signaled, so changing "len" won't be sufficient
> >> for versioning here. Perhaps move it after data?
> >
> > Just to confirm my understanding; I've got these as:
> >
> > struct seccomp_notif {
> >         __u32                      len;                  /*     0     4 */
> >         __u32                      pid;                  /*     4     4 */
> >         __u64                      id;                   /*     8     8 */
> >         __u8                       signaled;             /*    16     1 */
> >
> >         /* XXX 7 bytes hole, try to pack */
> >
> >         struct seccomp_data        data;                 /*    24    64 */
> >         /* --- cacheline 1 boundary (64 bytes) was 24 bytes ago --- */
> >
> >         /* size: 88, cachelines: 2, members: 5 */
> >         /* sum members: 81, holes: 1, sum holes: 7 */
> >         /* last cacheline: 24 bytes */
> > };
> > struct seccomp_notif_resp {
> >         __u32                      len;                  /*     0     4 */
> >         __s32                      error;                /*     4     4 */
> >         __u64                      id;                   /*     8     8 */
> >         __s64                      val;                  /*    16     8 */
> >
> >         /* size: 24, cachelines: 1, members: 4 */
> >         /* last cacheline: 24 bytes */
> > };
> >
> > in the next version. Since the structure has no padding at the end of
> > it, I think the Right Thing will happen. Note that this is slightly
> > different than what Kees suggested, if I add signaled after data, then
> > I end up with:
> >
> > struct seccomp_notif {
> >         __u32                      len;                  /*     0     4 */
> >         __u32                      pid;                  /*     4     4 */
> >         __u64                      id;                   /*     8     8 */
> >         struct seccomp_data        data;                 /*    16    64 */
> >         /* --- cacheline 1 boundary (64 bytes) was 16 bytes ago --- */
> >         __u8                       signaled;             /*    80     1 */
> >
> >         /* size: 88, cachelines: 2, members: 5 */
> >         /* padding: 7 */
> >         /* last cacheline: 24 bytes */
> > };
> >
> > which I think will have the versioning problem if the next member
> > introduces is < 7 bytes.
> 
> It'll be a problem in either place. What I was thinking was that
> specific versioning is required instead of just length.

Oh, if we decide to use the padded space? Yes, that makes sense. Ok,
I'll switch it to a version.

Tycho

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 1/6] seccomp: add a return code to trap to userspace
  2018-10-17 22:21       ` Kees Cook
  2018-10-17 22:33         ` Tycho Andersen
@ 2018-10-21 16:04         ` Tycho Andersen
  2018-10-22  9:42           ` Christian Brauner
  1 sibling, 1 reply; 91+ messages in thread
From: Tycho Andersen @ 2018-10-21 16:04 UTC (permalink / raw)
  To: Kees Cook
  Cc: LKML, Linux Containers, Linux API, Andy Lutomirski,
	Oleg Nesterov, Eric W . Biederman, Serge E . Hallyn,
	Christian Brauner, Tyler Hicks, Akihiro Suda, Jann Horn,
	linux-fsdevel

On Wed, Oct 17, 2018 at 03:21:02PM -0700, Kees Cook wrote:
> On Wed, Oct 17, 2018 at 1:29 PM, Tycho Andersen <tycho@tycho.ws> wrote:
> > On Thu, Sep 27, 2018 at 02:31:24PM -0700, Kees Cook wrote:
> >> On Thu, Sep 27, 2018 at 8:11 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> >> > @@ -60,4 +62,29 @@ struct seccomp_data {
> >> >         __u64 args[6];
> >> >  };
> >> >
> >> > +struct seccomp_notif {
> >> > +       __u16 len;
> >> > +       __u64 id;
> >> > +       __u32 pid;
> >> > +       __u8 signaled;
> >> > +       struct seccomp_data data;
> >> > +};
> >> > +
> >> > +struct seccomp_notif_resp {
> >> > +       __u16 len;
> >> > +       __u64 id;
> >> > +       __s32 error;
> >> > +       __s64 val;
> >> > +};
> >>
> >> So, len has to come first, for versioning. However, since it's ahead
> >> of a u64, this leaves a struct padding hole. pahole output:
> >>
> >> struct seccomp_notif {
> >>         __u16                      len;                  /*     0     2 */
> >>
> >>         /* XXX 6 bytes hole, try to pack */
> >>
> >>         __u64                      id;                   /*     8     8 */
> >>         __u32                      pid;                  /*    16     4 */
> >>         __u8                       signaled;             /*    20     1 */
> >>
> >>         /* XXX 3 bytes hole, try to pack */
> >>
> >>         struct seccomp_data        data;                 /*    24    64 */
> >>         /* --- cacheline 1 boundary (64 bytes) was 24 bytes ago --- */
> >>
> >>         /* size: 88, cachelines: 2, members: 5 */
> >>         /* sum members: 79, holes: 2, sum holes: 9 */
> >>         /* last cacheline: 24 bytes */
> >> };
> >> struct seccomp_notif_resp {
> >>         __u16                      len;                  /*     0     2 */
> >>
> >>         /* XXX 6 bytes hole, try to pack */
> >>
> >>         __u64                      id;                   /*     8     8 */
> >>         __s32                      error;                /*    16     4 */
> >>
> >>         /* XXX 4 bytes hole, try to pack */
> >>
> >>         __s64                      val;                  /*    24     8 */
> >>
> >>         /* size: 32, cachelines: 1, members: 4 */
> >>         /* sum members: 22, holes: 2, sum holes: 10 */
> >>         /* last cacheline: 32 bytes */
> >> };
> >>
> >> How about making len u32, and moving pid and error above "id"? This
> >> leaves a hole after signaled, so changing "len" won't be sufficient
> >> for versioning here. Perhaps move it after data?
> >
> > Just to confirm my understanding; I've got these as:
> >
> > struct seccomp_notif {
> >         __u32                      len;                  /*     0     4 */
> >         __u32                      pid;                  /*     4     4 */
> >         __u64                      id;                   /*     8     8 */
> >         __u8                       signaled;             /*    16     1 */
> >
> >         /* XXX 7 bytes hole, try to pack */
> >
> >         struct seccomp_data        data;                 /*    24    64 */
> >         /* --- cacheline 1 boundary (64 bytes) was 24 bytes ago --- */
> >
> >         /* size: 88, cachelines: 2, members: 5 */
> >         /* sum members: 81, holes: 1, sum holes: 7 */
> >         /* last cacheline: 24 bytes */
> > };
> > struct seccomp_notif_resp {
> >         __u32                      len;                  /*     0     4 */
> >         __s32                      error;                /*     4     4 */
> >         __u64                      id;                   /*     8     8 */
> >         __s64                      val;                  /*    16     8 */
> >
> >         /* size: 24, cachelines: 1, members: 4 */
> >         /* last cacheline: 24 bytes */
> > };
> >
> > in the next version. Since the structure has no padding at the end of
> > it, I think the Right Thing will happen. Note that this is slightly
> > different than what Kees suggested, if I add signaled after data, then
> > I end up with:
> >
> > struct seccomp_notif {
> >         __u32                      len;                  /*     0     4 */
> >         __u32                      pid;                  /*     4     4 */
> >         __u64                      id;                   /*     8     8 */
> >         struct seccomp_data        data;                 /*    16    64 */
> >         /* --- cacheline 1 boundary (64 bytes) was 16 bytes ago --- */
> >         __u8                       signaled;             /*    80     1 */
> >
> >         /* size: 88, cachelines: 2, members: 5 */
> >         /* padding: 7 */
> >         /* last cacheline: 24 bytes */
> > };
> >
> > which I think will have the versioning problem if the next member
> > introduces is < 7 bytes.
> 
> It'll be a problem in either place. What I was thinking was that
> specific versioning is required instead of just length.

Euh, so I implemented this, and it sucks :). It's ugly, and generally
feels bad.

What if instead we just get rid of versioning all together, and
instead introduce a u32 flags? We could have one flag right now
(SECCOMP_NOTIF_FLAG_SIGNALED), and use introduce others as we add more
information to the response. Then we can add
SECCOMP_NOTIF_FLAG_EXTRA_FOO, and add another SECCOMP_IOCTL_GET_FOO to
grab the info?

FWIW, it's not really clear to me that we'll ever add anything to the
response since hopefully we'll land PUT_FD, so maybe this is all moot
anyway.

Tycho

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v7 1/6] seccomp: add a return code to trap to userspace
  2018-10-21 16:04         ` Tycho Andersen
@ 2018-10-22  9:42           ` Christian Brauner
  0 siblings, 0 replies; 91+ messages in thread
From: Christian Brauner @ 2018-10-22  9:42 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Kees Cook, Jann Horn, Linux API, Linux Containers, Akihiro Suda,
	Oleg Nesterov, LKML, Eric W . Biederman, linux-fsdevel,
	Christian Brauner, Andy Lutomirski

On Sun, Oct 21, 2018 at 05:04:37PM +0100, Tycho Andersen wrote:
> On Wed, Oct 17, 2018 at 03:21:02PM -0700, Kees Cook wrote:
> > On Wed, Oct 17, 2018 at 1:29 PM, Tycho Andersen <tycho@tycho.ws> wrote:
> > > On Thu, Sep 27, 2018 at 02:31:24PM -0700, Kees Cook wrote:
> > >> On Thu, Sep 27, 2018 at 8:11 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> > >> > @@ -60,4 +62,29 @@ struct seccomp_data {
> > >> >         __u64 args[6];
> > >> >  };
> > >> >
> > >> > +struct seccomp_notif {
> > >> > +       __u16 len;
> > >> > +       __u64 id;
> > >> > +       __u32 pid;
> > >> > +       __u8 signaled;
> > >> > +       struct seccomp_data data;
> > >> > +};
> > >> > +
> > >> > +struct seccomp_notif_resp {
> > >> > +       __u16 len;
> > >> > +       __u64 id;
> > >> > +       __s32 error;
> > >> > +       __s64 val;
> > >> > +};
> > >>
> > >> So, len has to come first, for versioning. However, since it's ahead
> > >> of a u64, this leaves a struct padding hole. pahole output:
> > >>
> > >> struct seccomp_notif {
> > >>         __u16                      len;                  /*     0     2 */
> > >>
> > >>         /* XXX 6 bytes hole, try to pack */
> > >>
> > >>         __u64                      id;                   /*     8     8 */
> > >>         __u32                      pid;                  /*    16     4 */
> > >>         __u8                       signaled;             /*    20     1 */
> > >>
> > >>         /* XXX 3 bytes hole, try to pack */
> > >>
> > >>         struct seccomp_data        data;                 /*    24    64 */
> > >>         /* --- cacheline 1 boundary (64 bytes) was 24 bytes ago --- */
> > >>
> > >>         /* size: 88, cachelines: 2, members: 5 */
> > >>         /* sum members: 79, holes: 2, sum holes: 9 */
> > >>         /* last cacheline: 24 bytes */
> > >> };
> > >> struct seccomp_notif_resp {
> > >>         __u16                      len;                  /*     0     2 */
> > >>
> > >>         /* XXX 6 bytes hole, try to pack */
> > >>
> > >>         __u64                      id;                   /*     8     8 */
> > >>         __s32                      error;                /*    16     4 */
> > >>
> > >>         /* XXX 4 bytes hole, try to pack */
> > >>
> > >>         __s64                      val;                  /*    24     8 */
> > >>
> > >>         /* size: 32, cachelines: 1, members: 4 */
> > >>         /* sum members: 22, holes: 2, sum holes: 10 */
> > >>         /* last cacheline: 32 bytes */
> > >> };
> > >>
> > >> How about making len u32, and moving pid and error above "id"? This
> > >> leaves a hole after signaled, so changing "len" won't be sufficient
> > >> for versioning here. Perhaps move it after data?
> > >
> > > Just to confirm my understanding; I've got these as:
> > >
> > > struct seccomp_notif {
> > >         __u32                      len;                  /*     0     4 */
> > >         __u32                      pid;                  /*     4     4 */
> > >         __u64                      id;                   /*     8     8 */
> > >         __u8                       signaled;             /*    16     1 */
> > >
> > >         /* XXX 7 bytes hole, try to pack */
> > >
> > >         struct seccomp_data        data;                 /*    24    64 */
> > >         /* --- cacheline 1 boundary (64 bytes) was 24 bytes ago --- */
> > >
> > >         /* size: 88, cachelines: 2, members: 5 */
> > >         /* sum members: 81, holes: 1, sum holes: 7 */
> > >         /* last cacheline: 24 bytes */
> > > };
> > > struct seccomp_notif_resp {
> > >         __u32                      len;                  /*     0     4 */
> > >         __s32                      error;                /*     4     4 */
> > >         __u64                      id;                   /*     8     8 */
> > >         __s64                      val;                  /*    16     8 */
> > >
> > >         /* size: 24, cachelines: 1, members: 4 */
> > >         /* last cacheline: 24 bytes */
> > > };
> > >
> > > in the next version. Since the structure has no padding at the end of
> > > it, I think the Right Thing will happen. Note that this is slightly
> > > different than what Kees suggested, if I add signaled after data, then
> > > I end up with:
> > >
> > > struct seccomp_notif {
> > >         __u32                      len;                  /*     0     4 */
> > >         __u32                      pid;                  /*     4     4 */
> > >         __u64                      id;                   /*     8     8 */
> > >         struct seccomp_data        data;                 /*    16    64 */
> > >         /* --- cacheline 1 boundary (64 bytes) was 16 bytes ago --- */
> > >         __u8                       signaled;             /*    80     1 */
> > >
> > >         /* size: 88, cachelines: 2, members: 5 */
> > >         /* padding: 7 */
> > >         /* last cacheline: 24 bytes */
> > > };
> > >
> > > which I think will have the versioning problem if the next member
> > > introduces is < 7 bytes.
> > 
> > It'll be a problem in either place. What I was thinking was that
> > specific versioning is required instead of just length.
> 
> Euh, so I implemented this, and it sucks :). It's ugly, and generally
> feels bad.
> 
> What if instead we just get rid of versioning all together, and
> instead introduce a u32 flags? We could have one flag right now
> (SECCOMP_NOTIF_FLAG_SIGNALED), and use introduce others as we add more
> information to the response. Then we can add
> SECCOMP_NOTIF_FLAG_EXTRA_FOO, and add another SECCOMP_IOCTL_GET_FOO to
> grab the info?
> 
> FWIW, it's not really clear to me that we'll ever add anything to the
> response since hopefully we'll land PUT_FD, so maybe this is all moot
> anyway.

I guess the only argument against a flag would be that you run out of
bits quickly if your interface grows (cf. mount, netlink etc.). But this
is likely not a concern here.
I actually think that the way vfs capabilities are done is pretty
nice. By accident or design they allow transparent translation between
old and new formats in-kernel. So would be cool if we can have the same
guarantee for this interface.

Christian

^ permalink raw reply	[flat|nested] 91+ messages in thread

end of thread, other threads:[~2018-10-22  9:42 UTC | newest]

Thread overview: 91+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-27 15:11 [PATCH v7 0/6] seccomp trap to userspace Tycho Andersen
2018-09-27 15:11 ` [PATCH v7 1/6] seccomp: add a return code to " Tycho Andersen
2018-09-27 21:31   ` Kees Cook
2018-09-27 22:48     ` Tycho Andersen
2018-09-27 23:10       ` Kees Cook
2018-09-28 14:39         ` Tycho Andersen
2018-10-08 14:58       ` Christian Brauner
2018-10-09 14:28         ` Tycho Andersen
2018-10-09 16:24           ` Christian Brauner
2018-10-09 16:29             ` Tycho Andersen
2018-10-17 20:29     ` Tycho Andersen
2018-10-17 22:21       ` Kees Cook
2018-10-17 22:33         ` Tycho Andersen
2018-10-21 16:04         ` Tycho Andersen
2018-10-22  9:42           ` Christian Brauner
2018-09-27 21:51   ` Jann Horn
2018-09-27 22:45     ` Kees Cook
2018-09-27 23:08       ` Tycho Andersen
2018-09-27 23:04     ` Tycho Andersen
2018-09-27 23:37       ` Jann Horn
2018-09-29  0:28   ` Aleksa Sarai
2018-09-27 15:11 ` [PATCH v7 2/6] seccomp: make get_nth_filter available outside of CHECKPOINT_RESTORE Tycho Andersen
2018-09-27 16:51   ` Jann Horn
2018-09-27 21:42   ` Kees Cook
2018-10-08 13:55   ` Christian Brauner
2018-09-27 15:11 ` [PATCH v7 3/6] seccomp: add a way to get a listener fd from ptrace Tycho Andersen
2018-09-27 16:20   ` Jann Horn
2018-09-27 16:34     ` Tycho Andersen
2018-09-27 17:35   ` Jann Horn
2018-09-27 18:09     ` Tycho Andersen
2018-09-27 21:53   ` Kees Cook
2018-10-08 15:16   ` Christian Brauner
2018-10-08 15:33     ` Jann Horn
2018-10-08 16:21       ` Christian Brauner
2018-10-08 16:42         ` Jann Horn
2018-10-08 18:18           ` Christian Brauner
2018-10-09 12:39             ` Jann Horn
2018-10-09 13:28               ` Christian Brauner
2018-10-09 13:36                 ` Jann Horn
2018-10-09 13:49                   ` Christian Brauner
2018-10-09 13:50                     ` Jann Horn
2018-10-09 14:09                       ` Christian Brauner
2018-10-09 15:26                         ` Jann Horn
2018-10-09 16:20                           ` Christian Brauner
2018-10-09 16:26                             ` Jann Horn
2018-10-10 12:54                               ` Christian Brauner
2018-10-10 13:09                                 ` Christian Brauner
2018-10-10 13:10                                 ` Jann Horn
2018-10-10 13:18                                   ` Christian Brauner
2018-10-10 15:31                   ` Paul Moore
2018-10-10 15:33                     ` Jann Horn
2018-10-10 15:39                       ` Christian Brauner
2018-10-10 16:54                         ` Tycho Andersen
2018-10-10 17:15                           ` Christian Brauner
2018-10-10 17:26                             ` Tycho Andersen
2018-10-10 18:28                               ` Christian Brauner
2018-10-11  7:24                       ` Paul Moore
2018-10-11 13:39                         ` Jann Horn
2018-10-11 23:10                           ` Paul Moore
2018-10-12  1:02                             ` Andy Lutomirski
2018-10-12 20:02                               ` Tycho Andersen
2018-10-12 20:06                                 ` Jann Horn
2018-10-12 20:11                                 ` Christian Brauner
2018-10-08 18:00     ` Tycho Andersen
2018-10-08 18:41       ` Christian Brauner
2018-10-10 17:45       ` Andy Lutomirski
2018-10-10 18:26         ` Christian Brauner
2018-09-27 15:11 ` [PATCH v7 4/6] files: add a replace_fd_files() function Tycho Andersen
2018-09-27 16:49   ` Jann Horn
2018-09-27 18:04     ` Tycho Andersen
2018-09-27 21:59   ` Kees Cook
2018-09-28  2:20     ` Kees Cook
2018-09-28  2:46       ` Jann Horn
2018-09-28  5:23       ` Tycho Andersen
2018-09-27 15:11 ` [PATCH v7 5/6] seccomp: add a way to pass FDs via a notification fd Tycho Andersen
2018-09-27 16:39   ` Jann Horn
2018-09-27 22:13     ` Tycho Andersen
2018-09-27 19:28   ` Jann Horn
2018-09-27 22:14     ` Tycho Andersen
2018-09-27 22:17       ` Jann Horn
2018-09-27 22:49         ` Tycho Andersen
2018-09-27 22:09   ` Kees Cook
2018-09-27 22:15     ` Tycho Andersen
2018-09-27 15:11 ` [PATCH v7 6/6] samples: add an example of seccomp user trap Tycho Andersen
2018-09-27 22:11   ` Kees Cook
2018-09-28 21:57 ` [PATCH v7 0/6] seccomp trap to userspace Michael Kerrisk (man-opages)
2018-09-28 22:03   ` Tycho Andersen
2018-09-28 22:16     ` Michael Kerrisk (man-pages)
2018-09-28 22:34       ` Kees Cook
2018-09-28 22:46         ` Michael Kerrisk (man-pages)
2018-09-28 22:48           ` Jann Horn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).